Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Machine learning of DNA shape and spatial geometry
(USC Thesis Other)
Machine learning of DNA shape and spatial geometry
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MachinelearningofDNAshapeandspatialgeometry
by
JinsenLi
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computational Biology and Bioinformatics)
December 2021
Copyright 2021 JinsenLi
IdedicatethisthesistotranscriptionfactorandDNAbinding,
fortheirneverendingmystery.
ii
Acknowledgements
The author acknowledges the financial support of Andrew Viterbi fellowship and USC research
enhancementfellowship.
IwouldliketothankmyadvisorRemoRohsforhiscontinuingsupportduringmyPhD.When
Ifaceobstaclesinresearch,healwaysgivemegoodadvisesonhowtosolvethem. Heofferedme
manyideasinthisthesis. Withoutthesupportofhim,thisthesiswillneverbeaccomplished.
I would like to thank my committee members Remo Rohs, Fengzhu Sun, Rosa Di Felice and
AiichiroNakano.
IwouldliketothankJiahuiLiuforherendlesssupports.
Iwouldliketothankmyparentsfortheirneverendingloveandsupports.
Iwouldliketothankallofmyfriends.
iii
TableofContents
Dedication ii
Acknowledgements iii
ListofTables vii
ListofFigures viii
Abstract x
Chapter1: Introduction 1
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 ThebasicsofDNA,TF,andTF-DNAbinding . . . . . . . . . . . . . . . . . . . . 2
1.3 Thebasicsofmachinelearninganddeeplearning . . . . . . . . . . . . . . . . . . 5
1.4 ExperimentalmethodsonstudyingTF-DNAbinding . . . . . . . . . . . . . . . . 7
1.5 ComputationalmethodsonstudyingTF-DNAbinding . . . . . . . . . . . . . . . 8
1.6 Overviewofthethesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter2: ExpandingtherepertoireofDNAshapefeaturesforgenome-scalestudiesof
transcriptionfactorbinding 11
2.1 ChapterBackground . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 MaterialsandMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Datasetselectionandpreprocessing . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 DNAshapepredictionandquerytablegeneration . . . . . . . . . . . . . . 15
2.2.3 IntroductionofadditionalDNAshapefeatures . . . . . . . . . . . . . . . 16
2.2.4 Implementingstatisticalregressionmodels . . . . . . . . . . . . . . . . . 17
2.2.5 Predictionaccuracyandperformanceassessment . . . . . . . . . . . . . . 19
2.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Performance comparisons of TF binding models derived using structural
datafromdifferentsources . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 ModelperformanceimprovesasthenumberofDNAshapefeaturesincreases 24
2.3.3 Performance comparison of shape-augmented models versus k-mer-based
models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.4 DNAshapeRbioconductorpackageforadditionalshapefeatures . . . . . . 30
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
iv
Chapter 3: Expanding the repertoire of DNA shape features to spatial geometric deep
learning 33
3.1 Chapterbackground . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 MaterialsandMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 TF-DNAbindingdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 High-throughputDNAshapeprediction . . . . . . . . . . . . . . . . . . . 36
3.2.3 B-DNA3Dstructureprediction . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.4 Datapreprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.5 Modelinputanddataaugmentation . . . . . . . . . . . . . . . . . . . . . 39
3.2.6 Trainingandtestsplit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.7 Spatialgeometrymodel . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.8 Graphconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.9 Basepairconvolution/embedding . . . . . . . . . . . . . . . . . . . . . . 41
3.2.10 Residualneuralnetwork(ResNet) . . . . . . . . . . . . . . . . . . . . . . 42
3.2.11 Learningschedule,hyper-parameter,andtransferlearning . . . . . . . . . 42
3.2.12 Sequencebaselinemodel . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.13 Partialmodelwithonlycertainchemicalgroups . . . . . . . . . . . . . . . 43
3.2.14 ArtificiallybentDNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Performancebeatsthesequencebaselinemodel . . . . . . . . . . . . . . . 44
3.3.2 PerformanceremainsthesameforXRCderivedDNArebuiltstructures . . 46
3.3.3 Backbone-onlymodelsrevealTF-family-specificshapeandbackbonepref-
erence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.4 SpatialgeometrymodelcanprocessdeformedDNAandfindspioneerfactors 48
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter4: Sideproject-AnalysisonDNAshapefeaturesoftheyeastoriginrecognition
complex 55
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 HighthroughputDNAshapefeaturesprediction . . . . . . . . . . . . . . 56
4.2.2 RebuildingDNAstructuresandaligning . . . . . . . . . . . . . . . . . . . 56
4.2.3 Alignedstructurepositioningandtheircurvatureanalysis . . . . . . . . . . 57
4.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter5: Conclusionsandfutureworks 61
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Futureworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 RebuildingDNAstructuresondemandonthefly . . . . . . . . . . . . . . 62
5.2.2 Optimizinginputwithfixedparametersforthespatialgeometrymodel . . 63
5.2.3 IncorporatingDNAmodificationsintothespatialgeometrymodel . . . . . 64
5.2.4 Optimizingthemodelarchitecturetobetranslationalandrotationalinvariant 64
v
References 66
Appendices 79
A SupplementarymaterialsandmethodsforChapter2 . . . . . . . . . . . . . . . . . 80
A.1 PreparationoftheUniversalProteinBindingMicroarray(uPBM)Dataset . 80
A.2 MonteCarlo(MC)Simulations . . . . . . . . . . . . . . . . . . . . . . . 80
A.3 MolecularDynamics(MD)Simulations . . . . . . . . . . . . . . . . . . . 81
A.4 FeatureEncodingandNormalization . . . . . . . . . . . . . . . . . . . . 81
A.5 MeanSquaredError(MSE)Calculation . . . . . . . . . . . . . . . . . . . 82
A.6 LimitationsofDynaSeq-derivedShapeFeatures . . . . . . . . . . . . . . . 82
B SupplementaryfiguresforChapter2 . . . . . . . . . . . . . . . . . . . . . . . . . 84
C SupplementarytablesforChapter2 . . . . . . . . . . . . . . . . . . . . . . . . . 95
D SupplementaryfiguresforChapter3 . . . . . . . . . . . . . . . . . . . . . . . . . 96
E SupplementarytablesforChapter3 . . . . . . . . . . . . . . . . . . . . . . . . . 104
vi
ListofTables
3.1 R
2
PerformancetableforSELEX-seqdatasets. . . . . . . . . . . . . . . . . . . . . 45
3.2 TFspreferringthebentbackbonemodel,groupedbyTFfamilies. . . . . . . . . . 52
C.1 Spearman’srankcorrelationcoefficientsbetweenfeaturesindifferentquerytables
andfeaturesderivedfrom590protein-boundXRCstructures. . . . . . . . . . . . 95
C.2 XRCdatasetcomprisedof590X-rayco-crystalstructures . . . . . . . . . . . . . 98
E.1 IndexofatomsofDNAstructurestobesortedas. . . . . . . . . . . . . . . . . . . 104
E.2 R
2
Performance table comparing full spatial geometry model and partial models
forSELEX-seqdatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
vii
ListofFigures
1.1 Chemicalstructuresfornucleobases . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 CartoonrepresentationofFISandDNAbindingstructure. . . . . . . . . . . . . . 4
1.3 Positionweightmatrices(PWM)sequencelogo . . . . . . . . . . . . . . . . . . . 9
2.1 Introductionofanexpandedrepertoireof13DNAshapefeatures. . . . . . . . . . 13
2.2 WorkflowtoencodeDNAsequenceandshape . . . . . . . . . . . . . . . . . . . 18
2.3 DirectcomparisonofR
2
valuesbetween1mer+4shapeversus1mermodels. . . . . 22
2.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 AdditionalShapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Summaryofweightedaverageperformances . . . . . . . . . . . . . . . . . . . . . 29
3.1 Workflowofpreprocessingsequencesandmodeltrainingprocedures. . . . . . . . 38
3.2 Diagramsofthespatialgeometrymodel. . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Diagramofthepartialspatialgeometrymodel. . . . . . . . . . . . . . . . . . . . 44
3.4 R
2
Performance table comparing our spatial geometry model to the baseline se-
quencemodelforHT-selexdatasets. . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 GeometrymodelcanMCorXRCderivedfeatures . . . . . . . . . . . . . . . . . 47
3.6 ∆MSEvaluebetweenthefullmodelandthebackbone-onlymodelforHT-SELEX
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 R
2
between the full model and the backbone-only model for ETS family from the
HT-SELEXdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 High-throughput prediction of different structural shape parameters of DNA be-
tweenTTTTTTTCCGCGmotifandthreeknownACSs . . . . . . . . . . . . . . . 59
4.2 Three different views of rebuilt DNA structures and their bend angles for core
motifsites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
viii
B.1 Performancecomparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.2 Performancecomparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
B.3 Summaryoftheweighted-averageperformancesfordifferentmodels. . . . . . . . 86
B.4 Meansquarederror(MSE)fordifferentTFdatasetsfordifferentmodels.. . . . . . 87
B.5 MSEdecreasesasN increasesin1mer+Nshapemodels. . . . . . . . . . . . . . . . 88
B.6 Meansquarederror(MSE)fordifferentTFdatasetsfordifferentmodels.. . . . . . 89
B.7 Principal component analysis (PCA) reveals different DNAbinding specificities
withinandbetweenTFfamilies(intra-vs. inter-family). . . . . . . . . . . . . . . 90
B.8 ComparisonbetweenfeaturequantityandR
2
performanceinthreedifferentmodel
settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
B.9 Principal component analysis (PCA) reveals different DNAbinding specificities
withinandbetweenTFfamilies(intra-vs. inter-family). . . . . . . . . . . . . . . 92
B.10 R
2
performancecomparing1mermodelsaugmentedbyDynaSeq
ave
(averagedver-
sion)shapefeaturesand1mer+shapemodels. . . . . . . . . . . . . . . . . . . . . 93
B.11 R
2
performance comparing between 1mer models augmented by shape features
derived from DynaSeq
ave
(averaged version) and DynaSeq
ens
(ensemble version)
and3mer-basedmodels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
D.1 DiagramofresidualblockfortheResNetarchitechture. . . . . . . . . . . . . . . 96
D.2 Diagramofthereal-timedataaugmentation. . . . . . . . . . . . . . . . . . . . . . 97
D.3 Workflowofthebaselinesequencemodel. . . . . . . . . . . . . . . . . . . . . . . 97
D.4 Learningcurve(R
2
)forP53fromSELEX-seqdataset. . . . . . . . . . . . . . . . 99
D.5 Learningcurve(MSE)forP53fromSELEX-seqdataset. . . . . . . . . . . . . . . 100
D.6 Spatial geometyr model compared to the baseline sequence CNN model with dif-
ferenthyper-parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
D.7 IllustrationofthemethodforartificallybendingDNA. . . . . . . . . . . . . . . . 102
D.8 Predicted binding affinity versus the experimental binding affinity on the test data
ofP53. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
ix
Abstract
Uncoveringthemechanismsthataffectthebindingspecificityoftranscriptionfactors(TFs)iscrit-
ical for understanding the principles of gene regulation. Although sequence-based models have
beenusedsuccessfullytopredictTFbindingspecificities,IfoundthatincludingDNAshapeinfor-
mationinthesemodelsimprovedtheiraccuracyandinterpretability. Previously,ourlabdeveloped
a method for modeling DNA binding specificities based on DNA shape features extracted from
Monte Carlo (MC) simulations. Prediction accuracies of these models, however, have not yet
been compared to accuracies of models incorporating DNA shape information extracted from X-
ray crystallography (XRC) data or Molecular Dynamics (MD) simulations. In this dissertation, I
integratedDNAshapeinformationextractedfromMCorMDsimulationsandXRCdataintopre-
dictivemodelsofTFbindingandcomparedtheirperformance. Modelsthatincorporatedstructural
informationconsistentlyshowedimprovedperformanceoversequence-basedmodelsregardlessof
datasource. Furthermore, IderivedandvalidatednineadditionalDNAshapefeaturesbeyondour
originalsetoffourfeatures. Theexpandedrepertoireof13distinctDNAshapefeatures,including
six intra-base pair and six inter-base pair parameters and minor groove width, is available in the
R/Bioconductor package DNAshapeR from our lab and enables a comprehensive structural de-
scriptionofthedoublehelixonagenome-widescale. Inaddition,toimprovetheperformanceand
interpretability even more, I propose a brand-new 3D geometrical model to replace methods that
aremajorlyrelyingonone-hotencodingofDNA,whichisbasedoncategoricalcharacters,andin
practice,requiresatonofhyperparametersearchtofit. Themodelisgreatlybenefitedbythenatu-
raldataaugmentationthusnotrequiredtotuneitshyperparameters. Toillustratetheeffectiveness
ofthismodel,IusedthesamebutclassicTF-DNAbindingspecificitypredictionasthebenchmark.
The result shows a superior performance and easy-learnability compared to sequence-based deep
x
learning model and provides a more natural display on the models interpretability. In addition, as
ashowcase,Ipresentthattheframeworkhasitsarchitecturalpotentialsindealingwithdeformed
ordynamicDNAscomparedtoanysequence-basedmodels.
xi
Chapter1
Introduction
This chapter talks about motivation, background and applications of my work, and reviews previ-
ousresearchinthefield.
1.1 Motivations
Ever since DNA (Deoxyribonucleic acid) was discovered, the functions of DNA have been a fan-
tastic topic to study. People firstly noted that DNA directly encoded RNA that can be used to
produce proteins. However, especially in eukaryotes genome, there is a tremendous amount of
DNA that doesnt directly encode any proteins. There are only 20000 proteins encoded in the
human genome, where they only consist of less than 2% of the whole genome, while 98% are
non-coding DNAs. Non-coding DNAs don’t code proteins but some of them determine where
transcriptionfactorsbind. Transcriptionfactors(TFs)areproteinsthatplayimportantrolesinreg-
ulatinggeneexpressionineukaryotes genome, servingasactivatorsorrepressors. ATFregulates
genesexpressionbybindingtoaspecificnon-codingDNAsequence. Asfarasweknow,thereare
about1600TFsinhumansgenome[1]. Therefore,itisextremelyimportanttostudyTFandDNA
bindingmechanisms. StudyingTFandDNAbindingisfundamentallyintriguinginunderstanding
how eukaryotesgenome works. In addition, some TF mutationscan causediseases. Forexample,
mutations of TBX15 gene, which encodes T-box transcription factor TBX15, can cause Cousin
Syndrome[2]. HeterozygousdeletionofZIC1andZIC4isrelatedtoDandy-Walkermalformation
1
[3]. Understandingwhatandhowthemutationcausingthediseaseisalsocrucialthatmayleadto
possibletherapy.
1.2 ThebasicsofDNA,TF,andTF-DNAbinding
DNAconsistsoftwopolynucleotidechainswhichformadoublehelixstructurebycoilingaround
eachother. Onenucleotideconsistsofonenucleobase,adeoxyribosesugarandaphosphategroup.
Thegeneticinformationisstoredinthefourbasicnucleobases. Theyareadenine(A),cytosine(C),
guanine (G) and thymine (T) [Figure 1.1]. A and G are purines, while C and T are pyrimidines.
Hydrogen bonds between A and T, or C and G are key forces to stabilize the double helix struc-
ture. Another force to stabilize the double helix is the base-stacking interactions among aromatic
nucleobases. Major groove and minor groove can be seen from the side of DNA repeatedly. Due
tothesetwostrongforcesandtheintrinsicchemicalstructures,DNAstructuresareveryconserved
compared to RNA and proteins. However, there are still variations in DNA structures. For exam-
ple,conformationsofDNAincluderight-handedA-DNAandB-DNAandleft-handedZ-DNA[4].
ChemicalmodificationscanalsohappenonDNAmolecules,suchasDNAmethylation. Cytosine
can be methylated to become 5-methylcytosine, which is prone to deamination that transforms
itself into thymine, thus a mutation. DNA methylation also has its biological function such as
X-inactivationofchromosomes[5]. BecauseDNAiscomposedofbasicandstandardnucleotides,
DNAstructuresaresequence-dependentwhennotboundtoothermolecules. However,molecules
are not static in solvent, a way to describe DNA structure is needed. Since the nature of DNA
structure is rigid and follows the double helix rule, one can define shape parameters about inter-
and intra-base pair spatial placements, as along the placement of the axes and backbone torsion
angle[6]. Inthisthesis,Iexpandedtheprevioushigh-throughputpredictionofDNAshapeparam-
eterstocover13DNAshapesincludingalltheinter-andintra-basepairfeatures. Byusingallthe
inter- and intra-base pair shape features, one can place base pairs onto a 3D space with standard
DNAstructureasreference.
2
Figure1.1: Chemicalstructuresfor4regularnucleobases.
Ingeneral,DNAstoresgeneticinformation. AverysmallamountofDNAinmosteukaryotes
isusedtotranscribeintomRNAthentranslateintoproteins,andothersasnon-codingDNAsserve
their genetic information differently. Rather than directly coding proteins, DNA can interact with
DNA-binding proteins, which is a process that regulates the activity of genome in different cell
linesanddifferentcellstages. AmongDNA-bindingproteins,onefundamentalfamilyofproteins
calledhistones arepresented in eukaryotic cell neclei, servingas spools forDNAtowrap around.
Other DNA-binding proteins that can adjust gene transcription, whether upwards to downwards,
can be called transcription factors (TFs). Their function relies on the binding to specific DNA
sequences.
To investigate the mechanisms of TF and DNA binding, it is important to understand the
physicsandchemistryofmolecularinteraction. Attractiveandrepulsiveforcesexistbetweennon-
bonded atoms when they are in a certain distance. Electrostatic interactions and Van der Waals
forces are the main definitors in non-bonded interactions. Due to atoms bonding, charges are un-
evenlydistributedonmoleculessurface,whichresultsinpositivelychargedandnegativelycharged
atoms. TheyareattractedbyeachothergivenCoulombpotential. VanderWaalsforcecoversthose
3
that are not categorized into electrostatic potential, including dispersion, repulsion and induction
andsoon[7]. SeeFigure1.2assampleforaprotein-DNAboundcomplex.
Figure 1.2: Cartoon representation of FIS and DNA binding structure. PDB ID: 5DS9 (Note that
FISbindinginducesbendingonDNA.)
Despite interactions of TF and DNA are for lowering the systematic energy, TF tends to read
DNAstructureintwoways,sequencereadoutandshapereadout[8]. Sequencereadoutmeansthe
TF directly contacts the exposed atoms of the nucleobases, while shape readout is a more struc-
tural readout that a TF prefers a specific conformation of DNA, such as a narrower minor groove.
Different TFs have different DNA readout mechanisms. Transcription factors are usually catego-
rized into different families by their intrinsic structures, gene functions and binding mechanisms,
suchasbasichelix-loop-helix(bHLH),basic-leucinezipper(bZIP),helix-turn-helix(HTH),home-
odomain proteins, zinc fingers and so on. bHLH consists of two α-helices connected by a loop
and favors to be functional as dimers. TFs studied in this thesis include MAX, C-Myc and more.
bZIP is another large family that forms dimer to bind DNA. The DNA binding region consists of
basicaminoacidssuchasarginineandlysine[9]. HTHbindsthemajorgrooveofDNAandincor-
poratestwoα helicesjoinedbyashortstrandofaminoacids. Proteinsfromhomeodomainfamily
4
are encoded by homeobox, a 180bp long DNA sequence and play important roles in anatomical
developmentinnumerouseukaryotes[10]. Zincfingersarecharacterizedbythezincionsinvolved
instabilizingthesmallprotein. Subclassesinzincfingershavedifferentstructures[11]. Cys2His2
(C2H2) zinc finger is one of the most well-studied subclass [12]. C2H2 is a sequence-specific
DNAbindingproteinthatreadsdirectlyintomajorgrooveofDNA[13].
Furthermore, TF may prefer a DNA motif in vitro and binds to it with high affinity, it may
notbindtoeverylocusofthissequenceintheeukaryotesgenomes,duetothecomplexityinvivo.
In nucleus of eukaryoktes living cells, DNAs are wrapper around histones, forming nucleosomes
[14] as the unit of chromatin [15]. This structure allows DNA to be stored in a super compressed
manner. LocaladjustmentstochromatincompactionandaccessibilityaremadetofreeDNAfrom
thiscondensedstructuretobeactivelytranscribed. Thenumberoffactorsinvolvedintheseadjust-
ments are intriguing but still not fully understood. Specific modifications can happen on histones
or DNAs, which are relevant to cell reprogramming, gene silencing and cancer development [16,
17]. These epigenetic modifications can impact TF-DNA binding in vivo or vice versa [18–24].
Chromatin remodeling events can be initiated by TFs called pioneer factors [25–27], which may
enablecellularreprogramming. Onadifferentscale,duetotheeffectsofDNAwrappingandcon-
densation, the 3D chromosome architecture (3D genome) is generally kept in a cell-type-specific
manner,whichalsocontributestothelocusselectionofinvivoTF-DNAbinding[28].
1.3 Thebasicsofmachinelearninganddeeplearning
Statisticalmachinelearninghasbecomeevenmorepopularincomputationalbiologysincenewer
high-throughput data is available. Machine learning methods can be categorized into supervised
and unsupervised learning. Supervised machine learning is statistical model that can be fitted by
a training dataset. The training data should contain the input and output for the model. These
statistical models generalize some aspects of features or knowledge from the training data. This
generalization,whenfittedproperly,canpredictoutputsfromnewdatawithgoodaccuracyorlow
5
variance. Traditional supervised machine learning methods include bayes inference [29], linear
regression, logistic regression [30], decision tree [31], neural network [32] and more. The output
from a machine learning model can be labeled for classification, one numerical value or multi-
ple/manynumericalvalues. Unsupervisedmachinelearningisstatisticalmodelthatprocessesdata
withoutoutputlabelsorvalues. Popularunsupervisedmachinelearningmethodsincludegaussian
mixturemodels,k-meanclusteringandmore[33].
One big challenge of traditional machine learning is feature engineering [34]. Data is not al-
ways linear and sometimes not numerical. The input data must be altered according to domain
knowledge to make it machine processable. A better engineered feature vector is usually benefi-
cial. In Chapter 2, the project I published in Nucleic Acids Research [35] involves clever feature
engineering that improves the generalization of a simple linear regression model. If feature engi-
neeringisnotoptionaloronewantstoimprovetheperformanceofthemodelevenmore,akernel
function to expand feature vectors or introducing higher-order features can be adopted. However,
they are generally time-consuming and may not always be the best choice. The higher-order fea-
turesmaynothavetheinterpretableability.
Deep learning [36] is a special case of machine learning that contains extensive layers of ar-
tificial neural networks. Note that in general, the number of parameters in deep learning network
are larger than the instances in training and testing datasets. Deep learning architectures almost
eliminate the requirement of feature engineering by translating the input into intermediate repre-
sentations. Deep learning is an area that has been rapidly evolving. Older deep learning architec-
tures have been replaced in a matter of months by newer, more complex models. Deep learning
is capable of processing 1D, 2D and 3D or more dimensions data as inputs. The most common
applicationsofdeeplearningareusedtoprocess2Dimageswithcolorsasfeaturechannels. Well-
engineered deep learning models not only can predict image labels [37, 38], recognize 3D shapes
[39–42],butalsoplaygames[43–46],andevenpredictproteinfolding[47–50]. Simplermachine
learning models using features built by domain knowledge cannot compete with deep learning on
6
performance. However, drawbacksfordeep learning modeldoexist. Deeplearninghasbeencrit-
icized as a black-box model since the interpretability of the model is very hard [51]. Features are
abstracted from layers to layers that transform linearity into non-linearity. Models have been pro-
posed to tackle this problem [52–54]. In Chapter 3, I will build a spatial geometry deep learning
modelfromscratchthatcantakeDNAstructuresasinput.
1.4 ExperimentalmethodsonstudyingTF-DNAbinding
No computational analyses and models are valid without experimental data available, especially
those generating high-throughput data. Experiments can be divided into two major categories, in
vitroandinvivo. InvitromethodsfocusonrevealingtheintrinsicTF-DNAbindingpreferenceina
freeenvironmentwithouttheconstraintsofhistones,celltypeandmore,inalargescaleandhigh-
throughputmanner,whileinvivomethodsfocusonbindinginarealisticenvironment. Toexamine
in vivo TF-DNA binding events, chromatin immunoprecipitation coupled with high-throughput
sequencing (ChIP-seq) is the main method [55]. Firstly, genomic regions are extracted through
immunoprecipitation. Then, with the help of next generation sequencing, a massive quantity of
sequences from the genomic regions are collected and verified. Modified ChIP-seq were also
invented to improve the identification of binding sites at single-nucleotide resolution [56, 57].
With the massive sequencing data, popular software such as MEME-ChIP [58] is used at last to
locatetheTFbindingregionsonreferencegenome. InvivoTF-DNAbindingisgreatlyaffectedby
thecelltype,cellularenvironmentandcellstageandrelieslessonitsintrinsicbindingpreference.
In this thesis, I will focus on high-throughput data acquired from in vitro experiments. There are
several experimental methods available to separate DNA sequences with higher binding affinity
from a pool of DNA sequences. Protein Binding Microarray (PBM) is one of the methods to
capture TF-DNA binding in a high-throughput fashion [59, 60], prior to the advancement of next
generationsequencing. Custom-designedmicrochipsattachedwithhighdensityofDNAsareused
to measure TF-DNA binding specificity. Another method, called systematic evolution of ligands
7
byexponentialenrichment(SELEX),hasbeendevelopingfordecadestoaccommodatethenewest
technology advancement in sequencing [61–63]. By repeatedly selecting DNAs with higher TF
bindingaffinityroundbyroundandthengoingthroughnext-generationsequencing(SELEX-seq),
thesemethodsusuallyprovideamassivedatasettoanalyze. Qualitycontrolandpost-hocanalysis
software like [64] is usually then used to filter the sequencing data and generate datasets that can
beusedforlateranalysis.
1.5 ComputationalmethodsonstudyingTF-DNAbinding
With experimental datasets available, methods and models have been published to identify the
binding preference of each TF. At first, models like the position weight matrices (PWM) were
introduced to annotate the motif sequence a TF prefers. PWM uses sequence logo to identify
binding preference at the binding site. Bigger base (A,C,G,T) logo means a more favorable base
pair at this position [Figure 1.3]. PWM is straightforward, but it lacks the ability to produce a
meaningful logo for TFs that are favorable at certain DNA shapes. Flanking regions outside the
core binding sites also impair binding affinities [65], but flanking regions are usually omitted by
PWM. To compensate the disadvantages of PWM, multiple methods have been proposed. Di-
nucleotides can be used to replace single nucleotides in the sequence logo [66]. However, di-
nucleotidesareashapeparameterfromanotherpointofviewbecausethebasepairstackingeffect
is di-nucleotide-dependent. In this thesis, I also analyzed this di-nucleotide model and tested the
ability of the di-nucleotides model in explaining all DNA shape features. If one wants to improve
the predictability of the model, using statistical machine learning to replace simple statistics such
as PWM might be a feasible way. Standard B-DNAs can be encoded as simple as 4 digits of 0 or
1,whereeachpositionrepresentsthebaseA,C,GorT.Usingsimplelinearregressionmodel,one
canachievegoodpredictabilityofTF-DNAbindingspecificity.
As mentioned above, DNA shape features are used to numericize DNA structures as values.
Ourlabproposedahigh-throughputpredictionmethodof4DNAshapefeatures,calledDNAShape
8
Figure1.3: PWMforMAXproteinDNAbindingmotifproducedbyJASPAR2020[67]
[68]. The four DNA shape features are minor groove width (MGW), roll (Roll), propeller twist
(ProT), helix twist (HelT). Roll and HelT describe two rotating angles between two immediately
closed base pairs. ProT describes one rotating angle within one base pair for the two different
bases. For a detailed visualization, see Figure 2.1 from Chapter 2. To utlize the high-throughput
prediction of DNA shape features, Rohs lab established a framework to improve the performance
of traditional DNA sequence-only machine learning model by incorporating DNA shape features
[69]. Other machine learning models including deep learning have been also applied to work on
TF-DNA binding data, either from in vivo or in vitro [35, 70–77], where the objectives are not
limitedtopredictTF-DNAbindingspecificity.
1.6 Overviewofthethesis
In this thesis, I will start from high-throughput DNA shape features prediction, comparing DNA
shape features acquired from different sources and then investigate a spatial geometry deep learn-
ing model for DNA structures as a natural expansion. Chapter 2 was published in Nucleic Acids
Research [35]. In Chapter 2, I will go through the original DNAShape method and expand the
method to include all inter- and intra-base pair shape features in addition to minor groove width.
I will then compare the additional shape features from MC simulations with MD simulations and
X-ray crystallography data. In the end, I will show the performance improvement after adding
standard deviation values for all predicted DNA shape features. Chapter 2 completely expanded
theoriginalDNAShapemethodandgaveathoroughcomparisonbetweendifferentdatasources.
9
InChapter3,IusedtheexpandedDNAShapetohigh-throughputgenerateDNAstructureswith
the help of 3DNA rebuild program. I construct a spatial geometry deep learning model that can
processstandardDNAstructuresasinput. Themodelonlyrequiresthecoordinatesofheavyatoms
in DNA and has data augmentation built-in to be translational and rotational invariant. The new
spatial geometry deep learning model is easier to train and has less parameters to tune to beat the
baseline sequence deep learning model. Intermediate layers in the spatial geometry model embed
theinputstructuralfeatures. ThemodelisalsoabletoprocesspartialinputssuchasDNAbasepair
coordinatesandDNAbackbonecoordinates. ProcessingDNAbackboneasinputprovidesinsights
on the importance of backbone in TF-DNA binding specificity for different TF families. The
modelisalsocapableofprocessingdeformedDNAasinput, sincethemodeltakesincoordinates
of heavy atoms, despite any confirmations of DNA can be. To provide a showcase for the model
as a potential application, I simulate two scenarios that the model may be taken to. The result
showsthatthemodelcandistinguishdifferentconformationsofthesamesequenceinput,whichis
a scenario that a sequence model finds a hard time at. The model also can potentially distinguish
pioneerfactorsfromthenon-pioneerfactors.
InChapter4,asasideprojectbutcontributedtoapaperpublishedinNatureCommunications
[78], I applied the high-throughput prediction of addtional DNA shape features and use them to
answer why a mutation in the yeast Origin Recognition Complex (ORC) dramatically changes its
DNAbindingpreference.
In the end, I summarize all chapters, illustrate the applications, and propose future research
plans.
10
Chapter2
ExpandingtherepertoireofDNAshapefeaturesfor
genome-scalestudiesoftranscriptionfactorbinding
withcollaborationofJaredSagendord,Tsu-PeiChiu,MarcoPasi,AlbertoPerezandRemoRohs
2.1 ChapterBackground
The binding of transcription factors (TFs) to DNA is a fundamental and crucial step for gene
regulation. However, many mechanisms involving TF binding are still unknown [79, 80]. Basic
questionsremainastohowaTFselectivelyrecognizesandbindstospecificDNAsequences. Ex-
perimentalprotocolshavebeenestablishedtomeasureTFDNAbindingspecificitiesquantitatively
andtounderstandtheunderlyingmechanisms. Forexample,protein-bindingmicroarrays(PBMs)
[59]andsystematicevolutionofligandsbyexponentialenrichmentcombinedwithmassivelypar-
allelsequencing(SELEX-seq)[81]orhigh-throughputSELEX(HT-SELEX)[82]arewidelyused
methodstoprobeinvitroTFDNAbindingquantitatively[83].
To analyze and interpret the large amounts of data that are obtained from high-throughput
(HT) binding assays, researchers have proposed various models for TFDNA binding specificities,
the most widely used of which is the position weight matrix (PWM) [84]. PWMs use position
frequency matrices, created by counting the probability that a nucleotide will occur at each in-
dividual position of the binding site, while ignoring dependencies between nucleotide positions
11
[84]. This approach has been successfully used to approximate TFDNA binding events and has
been expanded to include interdependencies between adjacent nucleotides in predicting TFDNA
binding [85–87]. Despite its success, researchers have pointed out limitations of the PWM model
[88],leadingtothedevelopmentofalternativeapproaches[89].
One alternative representation of interdependencies between nucleotide positions considers
thatTFsnotonlyrecognizeDNAsequencesbase-by-basethroughhydrogenbondsorotheramino
acidcontactsbutalsofavorcertainDNAconformations[83]. Manystudieshaveverifiedtherecog-
nition of three-dimensional DNA structure for diverse TF families [69, 90–102]. DNA conforma-
tioncanberepresentedbyDNAshapefeatures[68],whichassignnumericalvaluestotranslations
androtationsbetweenandwithinbasepairs(bp)[103]ormeasuregroovewidthbetweenopposite
phosphodiester backbones [8]. These parameters can be calculated by different software tools,
such as CURVES or 3DNA [6, 104], and incorporated into quantitative models [69]. Information
on DNA shape can be extracted from either experimental studies or molecular simulations [105].
For experimentally solved structures, X-ray crystallography (XRC) data provide much larger se-
quence coverage than nuclear magnetic resonance (NMR) spectroscopy data. Molecular simula-
tions, such as Monte Carlo (MC) or Molecular Dynamics (MD) simulations [106–108], likewise
provide a wide sequence coverage. However, a HT method is needed to obtain shape information
forlargenumbersofsequencesofarbitrarylengthorentiregenomes.
In previous work, our lab used a sliding-window approach together with a query table for all
unique pentamers generated from MC simulations to produce a predictive HT method of DNA
shape [68]. Using this method, anyone is able to predict four DNA shape features: minor groove
width (MGW), propeller twist (ProT), helix twist (HelT) and Roll. Experimental structures have
indicated that these four DNA shape features are among the most important structural character-
istics for evaluating proteinDNA readout modes [8, 87, 109–112]. However, given the intricacies
of DNA structure, it could be hypothesized that these four DNA shape features might not suffice
in describing the entire set of DNA recognition mechanisms. Therefore, I derived and validated
12
nine additional DNA shape parameters, expanding our available repertoire to a total set of 13 fea-
tures: sixinter-bpparameters,sixintra-bpparametersandMGW(Figure2.1). Ievaluatedwhether
including 13 shape features in our quantitative models would enhance the accuracy of TF binding
predictionsbeyondpriormodelsbasedonjustfourshapefeatures[69].
A
B
C
D
Molecular Dynamics (MD) simulation
Monte Carlo (MC) simulation
DNA shape features derived from
X-ray crystallography (XRC) Experiment
Computation
Minor groove width
(MGW)
Inter-base pair
Intra-base pair
Shift Rise
Tilt
Slide
Roll Helix Twist (HelT)
Shear Stagger
Buckle
Stretch
Propeller Twist (ProT) Opening
features
features
Figure 2.1: Introduction of an expanded repertoire of 13 DNA shape features. (A) We used
three methods, including MC and MD simulations and XRC, to derive DNA shape features. (B)
SchematicrepresentationofaDNAfragment(PDBID:1BNAtakenfromtheProteinDataBank)
with definition of MGW, inter-bp and intra-bp parameters. (C) Schematic representation of all
inter-bp DNA shape features used in this research. Each long brick represents a bp. Translations
androtationsofbpareshownintopandbottomrow,respectively. (D)Schematicrepresentationof
allintra-bpDNAshapefeaturesusedinthisresearch. Eachshortbrickrepresentsabase. Transla-
tionsandrotationsofbasesareshownintopandbottomrow,respectively.
To evaluate the contributions of these additional shape features, comparing model perfor-
mancesusingMC-derivedshapefeaturestomodelsbasedonfeaturesderivedfromothermethods
13
is needed. With the recent publication of 1-µs MD simulations of all unique tetramers [113], my
collaborator were able to extract the same DNA shape information from MD simulations and to
comparemodelsusingtheseMDdatatomodelsincludingequivalentdataextractedfromMCsim-
ulations. Considering that MC and MD simulations are both computational prediction methods,
I added an experimental reference and extracted DNA shape features from XRC data available in
theProteinDataBank(PDB)[114]. AlthoughtheNucleicAcidDatabase[115]containsthesame
structures for nucleic acids, I selected the reported PDB identifiers for consistency with previous
work [68]. To summarize, DNA shape features used in this study were derived from MC or MD
simulationsandXRCexperiments.
BecauseDNAshapefeatureswerederivedfromvarioussources,Ishowedthevalueofthisad-
ditionalinformationinstudyingseveralbiologicalquestions,whichareroutinelyaddressedbased
on DNA sequence methods [116]. Previously, people found that using shape information from
MC simulations improved prediction accuracy and reduced computational complexity compared
to k-mer models [69, 95, 117]. I sought to verify the robustness of models using additional shape
featuresandtocomparetheperformancesofmodelsbasedonMC-derivedshapefeaturestomod-
elsusingMDandXRC-derivedshapefeatures. Therefore,Ievaluatedandcomparedperformances
of regression models for TF binding specificity predictions incorporating 13 DNA shape feature
categoriesinanalogytopreviousmodelsusingjustfourshapecategories[69,92].
2.2 MaterialsandMethods
2.2.1 Datasetselectionandpreprocessing
I used three different datasets, which were designed for detecting TFDNA binding specificities
and were derived from different experimental platforms: genomic-context PBM (gcPBM) [87],
HT-SELEX [82] and SELEX-seq [81]. The gcPBM data used here contained data for the human
basic helix-loop-helix (bHLH) TF dimers Mad1/Max (Mad), Max/Max (Max) and c-Myc/Max
(Myc) (available in the Gene Expression Omnibus [GEO] under accession number GSE59845)
14
[69]. The SELEX-seq data contained 21 different datasets for Drosophila Hox TFs in complex
with their cofactor Extradenticle (Exd) (available under GEO accession number GSE65073) [81].
The HT-SELEX data included 240 datasets for 215 TFs from 27 protein families (available in the
EuropeanNucleotideArchive[ENA]understudyidentifierPRJEB14744)[95].
I preprocessed raw data from gcPBM and SELEX-seq experiments using methods from [69],
and HT-SELEX data using methods from [95], which essentially involved PWM-based sequence
alignment and trimming. After preprocessing, I collected more than 10 000 sequences per TF
for gcPBM data, and thousands of sequences per TF for SELEX-seq and HT-SELEX data. Each
sequence was assigned a relative binding affinity score, which was measured in the respective
experiment. Log-affinityscoreswereusedinthisstudy.
2.2.2 DNAshapepredictionandquerytablegeneration
I sought to use DNA shape information derived from three different data sources (MC and MD
simulationsandXRCexperiments)inregressionalgorithmsforpredictingTFDNAbindingspeci-
ficities. DifferentHTpredictionsofDNAshapefeatureswereobtainedbygeneratingquerytables
from different data sources. Our previous work used a pentamer query table generated from MC
data [68]; however, the available MD data did not cover all 512 unique pentamers. To ensure
complete sequence coverage and to enable comparisons between data sources, I generated three
tetramer query tables from (i) the 1-µs MD data [113], (ii) an XRC dataset used for validation
in our previous work [68] (Table S2) and (iii) MC data [68]. The new MC tetramer query table
wasgeneratedbasedonthesameMCsimulationdataasusedinourpreviouslygeneratedpentamer
querytable(seeAppendixSupplementaryMaterialsandMethodsAand[68]fordetailsontheMC
simulation protocol), by mining data for all tetramers instead of pentamers. For MD data, simu-
lations were available for 39 oligomers with sequences designed so that all 136 unique tetramers
werecoveredwithanaverageoccurrenceof3.9(seeAppendixSupplementaryMaterialsandMeth-
ods A and [113] for details on the MD simulation protocol). The XRC data provided an average
occurrence of each tetramer of 44.6, and the MC data provided an average coverage of 249.8 for
15
eachuniquetetramer. AlthoughthePDBcontainsadditionalDNAstructures,weusedtheoriginal
XRCdatasettobeconsistentwithouroriginalDNAshapemethod[68]. XRCdatawerefilteredfor
unusual deformations, and chemically modified structures were removed as previously described
[68]. ThedifferentcoverageoftetramersintheMD,XRCandMCdatasetsindicateswhyonlyour
MC-basedDNAshapemethodprovidedsufficientcoverageforall512uniquepentamers[68].
Eachquerytableprovided13DNAshapefeatures,includingsixinter-bporbp-stepparameters
(HelT, Rise, Roll, Shift, Slide and Tilt), six intra-bp or bp parameters (Buckle, Opening, ProT,
Shear,StaggerandStretch),andMGW.Iimplementedaslidingwindowalgorithm[68]thatusesa
slidingpentamerwindowtoacquirenumericalshapevaluesfromtheMC-derivedpentamerquery
table (available for download at http://rohslab.usc.edu/DNAshape+/) and to combine the
values into a feature vector. For example, considering a sequence of length n, when the sliding
window begins at position i, I use the query table and find the shape feature values for position
i+2(forbpparameters)andpositionsi+1andi+2(forbp-stepparameters)usingthefirstpentamer.
Then, we move the sliding window to position i+1 and use the next pentamer until we reach the
end of the query sequence. I determine bp-step parameters from overlapping pentamers using the
arithmeticaverageofvaluesderivedfromtwoadjacentpentamers. Thealgorithmworkssimilarly
on tetramer query tables based on a tetramer query window, with the exception that a tetramer is
assignedtwocentralbpparametersandonecentralbp-stepparameter.
2.2.3 IntroductionofadditionalDNAshapefeatures
IvalidatedthenineadditionalDNAshapefeaturesthatwereintroducedinthisstudybycomparing
theHTpredictionswithequivalentfeaturesderivedfromXRC,followingaprotocoldescribedfor
the DNAshape method [68]. I calculated Spearmans rank correlation coefficients for the compar-
ison with experimental data (Appendix Table C.1). Whereas the MC and MD simulations were
performedforunboundDNAfragments,theXRCdatasetincludedDNAconformationsfrompro-
teinDNA complexes due to the scarcity of experimental structures for free DNA molecules [105].
I removed structures with deformations due to crystal packing effects and other deformations, as
16
previously described [68]. I chose to use Spearmans rank correlation as a criterion because it
captures the pattern between minima and maxima in shape features and is less sensitive to fluctu-
ations in actual values of shape parameters, which can be affected by crystal packing artifacts or
protein-induceddeformations[68].
The general concept of DNA shape includes conformational flexibility. Certain DNA se-
quences are obviously more flexible than others, which can influence TF binding. To capture
this effect, I used the standard deviation (SD) of each DNA shape feature as an approximation of
DNA flexibility. Instead of calculating SD values along a simulation trajectory for the single oc-
currenceofapentamer,IpooledallofthepentamersofthesameidentitytogethertogenerateSD
values. This approach has the limitation that not all pentamers occurred with the same frequency
in our dataset. Nevertheless, the approach provided a single SD value for each shape feature in
a given sequence environment, independent of different occurrences of a pentamer or tetramer in
our dataset. For each bp-step parameter, one SD value was derived to prevent averaging between
twoSDvalues.
2.2.4 Implementingstatisticalregressionmodels
Feature vectors for a given sequence were generated in the following manner. Each type of nu-
cleotidewasassignedoneoffourbinaryvariablevectors(mononucleotideor1mermodel): Awas
encoded as 1 0 0 0, C as 0 1 0 0, G as 0 0 1 0, and T as 0 0 0 1 (Figure 2.2). Following the same
scheme, dinucleotides were encoded by one of 16 binary vectors (dinucleotide or 2mer model):
AAwasencodedas1followedby150valuesandsoon. Ingeneral,ak-mermodelrequiresO(4
k
)
binaryfeaturestoencodethesequence(seeAppendixSupplementaryMaterialsandMethodsA).
DNA shape features were predicted with our DNAshape method [68], normalized and con-
catenated with the encoded sequence feature into a feature vector (see Figure 2.2 and Appendix
Supplementary Materials and Methods A). For each TF dataset, which contains a list of aligned
sequences and their corresponding binding affinity scores, each sequence was used to create a
representative feature vector. Based on the resulting feature matrix and corresponding binding
17
Final feature vector
Encoding
Final feature matrix
1.0
0.5
0.3
TACTATT
TACTGTT
GGGTATA
TACTATT
DNAshapeR
Encode
sequence
Encode
shape
...
...
...
...
...
...
1-mer 2-mer 3-mer
MGW
Shear
Stretch
Stagger
Buckle
ProT
Opening
Shift
Slide
Rise
Tilt
Roll
HelT
L2 regularized MLR
Relative binding affinities
Figure 2.2: Work flow to encode DNA sequence and shape, which were used to train L2-
regularized MLR models that predicted TF binding specificities. DNA sequences were encoded
into sequence and shape features. Any combination of sequence and 13 shape features could be
chosen. DNA sequence and corresponding relative binding affinities were acquired from exper-
imental data. Encoded DNA sequences (final feature matrix) and corresponding binding affinity
scoreswereusedintraininganMLRmodel. ToselectL2-regularizationparameters,10-foldcross-
validationwasused. Alldatasetsweredividedinto10-foldtrainingandtestdatasets.
affinity scores, we applied a multiple linear regression (MLR) model with L2 regularization to
preventoverfitting[118]. MLRmodelswithL2regularizationhavethefollowinglossfunctionfor
minimizationsusingaclosed-formsolution:
L =
n
∑
i=1
(y
i
−x
⊺
i
ω)
2
+λ∥ω∥
2
2
(2.1)
where,y
i
representsthei-thobservedbindingaffinityscore,x
i
representsthei-thfeaturevector,ω
representsfeatureweightsandλ istheL2regularizationparameter. TominimizeL,aclosed-form
solutioncanbederivedas:
ˆ ω=(X
⊺
X+λI)
−1
X
⊺
y (2.2)
18
where,I isanidentitymatrixthathasthesizeofthenumberoffeaturesinthefeaturevector. The
L2 regularization parameterλ in loss functionL penalizes largeω values to prevent overfitting.
To apply this regression model, each dataset was separated into 10-fold training and testing data.
Models were trained on 90% of the data and tested on the remaining 10%. This procedure was
repeated ten times until each tenth of the data was assigned to the training data. To select the
best regularization parameter λ, another 10-fold cross validation was performed on each fold of
the training data. After optimizing λ, we applied the model to all training data and calculated
featureweights. Withtheregularizationparameterandweights,Iappliedthemodeltothetestdata
and retrieved predicted binding affinity scores. To improve robustness of the models, rather than
randomlydividingdatainto10-fold,Idivideddatainto10-foldwithsimilarbindingaffinityscore
distributionsbyselectingoneofthe10entriesintoafoldbasedonthesortedbindingaffinity-score
list. EachmodelwastrainedseparatelyforeachindividualTFdatasetbecausethederivedbinding
affinities (representing the response variable) could be incomparable for different experiments or
eachindividualTF.
2.2.5 Predictionaccuracyandperformanceassessment
Accuracies of the predicted binding affinity scores were determined by using the coefficient of
determination(R
2
):
R
2
=1−
∑
i
(y
i
− ˆ y
i
)
2
∑
i
(y
i
− ¯ y)
2
(2.3)
where,y
i
, ˆ y
i
and ¯ yrepresenttheobserved,predictedandaveragedobservedbindingaffinityscores,
respectively. When comparing performances between different sources of DNA shape data, I di-
rectly compared R
2
results to see if there was a significant difference in model performance. In
nearly every dataset, the sample size was much larger than the length of our feature vector; there-
fore, an adjusted R
2
was not introduced. R
2
is an indicator of performance for each TF dataset.
I calculated the weighted-average R
2
, which considers the number of available sequences to de-
riveanoverallperformanceforcertaindatasetsorTFfamilies. Iassumedthatdatasetswithgreater
19
numbersofsequencescouldbeusedtopredictbindingaffinityscoresmoreaccuratelythandatasets
with fewer sequences, assuming that the binding affinity scores were measured with similar sys-
tematicerrors.
In addition to R
2
, mean squared error (MSE) to achieve rigorous model assessment was used.
In regression models, MSE is computed based on the number of statistical degrees of freedom
(sample size minus the number of features) and, therefore, will penalize long feature vectors (see
Appendix Supplementary Materials and Methods A for equation A.2). For example, if model B
hasahigherR
2
butdoesnothavealowerMSEthanmodelA,thenonecannotclaimthatmodelB
performsbetterthanmodelA.
2.3 Results
2.3.1 PerformancecomparisonsofTFbindingmodelsderivedusingstructural
datafromdifferentsources
MGW, ProT, HelT and Roll were used in our previous studies [68, 69] and served as a reference
for comparison of the three new query tables, now tetramer-based and derived from different data
sources. R
2
values for predicted binding scores based on the three tetramer tables (derived from
MC,MDorXRCdata, denoted1mer+4shape
MC4
, 1mer+4shape
MD4
and1mer+4shape
XRC4
mod-
els,respectively)andtheMC-derivedpentamertable(1mer+4DNAshapemodel)comparedtothe
sequence-only 1mer model were plotted for each TF (Figure 2.3). R
2
values for the four models
thatincludedshapefeaturesweresignificantlygreaterthanvaluesforthe1mermodel. Inclusionof
shape information substantially improved R
2
values compared to the 1mer model for the gcPBM
data (bHLH family), SELEX-seq data (homeodomain family), and HT-SELEX data (multiple TF
families).
For each bp, I needed four binary variables to encode the 1mer sequence and four numerical
values to encode shape features. A typical TFDNA binding dataset contains DNA sequences of
12 bp in length. Thus, I performed MLR with 96 features for each dataset. To ensure satisfactory
20
prediction accuracy, large amounts of data were required. The three datasets I used all had more
than 1000 entries available for training; thus, additional filtering was not performed. In practice,
if a dataset had an R
2
value below 0, it was removed. Fortunately, the three datasets that we used
(gcPBM,SELEX-seqandHT-SELEX)allwereofhighquality.
Icomparedperformancesofmodelscombiningthe1mersequencewiththefourshapefeatures
fromdifferentdatasources(Figure2.4AandAppendixFigureB.1). SmalldifferencesinR
2
values
were observed between DNAshape (MC- and pentamer-based), MC, MD and XRC (all tetramer-
based). However,whenIremovedthe1mersequencefeatures,whicharecrucialforpredictiondue
tohydrogenbondsanddirectcontactsbetweenaminoacidsandthebases,largerdifferencesinthe
prediction accuracies of models from different sources started to emerge. An explicit comparison
of R
2
values (Appendix Figure B.2) indicated that XRC data provided lower-quality structural in-
formationthanthetwocomputationalapproaches(MCandMDsimulations). Thisfindingbecame
evenmoreapparentwhenIconsideredtheweightedaverageoverR
2
valuesforeachdatasetbased
on their sequence quantities. This approach allowed us to compare the prediction accuracy across
thethreedatasets(Figure2.4B)oracrossalldifferentTFfamilies(AppendixFigureB.3). Results
of plotting the MSE values of one model against those of another model (Appendix Figure B.4)
confirmedtheseconclusions.
Whereas shape features greatly benefitted models based on gcPBM datasets, even in the ab-
sence of sequence features, this was not the case for SELEX-seq and HT-SELEX datasets. For
these two datasets, using only shape features was not comparable to using only the 1mer se-
quencefeatures. However,the1mersequence+shapemodelscontinuouslyoutperformedthe1mer
sequence-onlymodelsforallthreedatasets,inagreementwithpreviousstudiesusingMCdata[69,
95].
21
0.0
0.5
1.0
0.0 0.5 1.0
1merR
2
1mer+4shape
MD4
R
2
0.0
0.5
1.0
0.0 0.5 1.0
1merR
2
1mer+4shape
MC4
R
2
0.0
0.5
1.0
0.0 0.5 1.0
1merR
2
1mer+4shape
XRC4
R
2
0.0
0.5
1.0
0.0 0.5 1.0
1merR
2
1mer+4DNAshapeR
2
Family
bHLH
bZIP
C2H2
CENPB
CP2
CUT
ETS
Forkhead
GATA
GCM
Homeo−
domain
HSF
IRF
MAD
MEIS
MYB
NFAT
NFI
Nuclear
receptor
PAX
POU
PROX
RRM
RUNX
SAND
TBX
TEA
Dataset
gcPBM
HT−SELEX
SELEX−seq
Figure 2.3: Direct comparison of R
2
values between 1mer+4shape versus 1mer models. As an
indicator of the accuracy of predicted TF binding specificity using the trained MLR model, R
2
was computed on the test dataset. Shape features were predicted based on tetramer query tables
derivedfrom(A)MCdata(1mer+4shape
MC4
model),(B)MDdata(1mer+4shape
MD4
model)and
(C) XRC data (1mer+4shape
XRC4
model) and (D) a pentamer query table derived from MC data
(1mer+4DNAshapemodel). 1merindicatesthatDNAsequenceswereencodedasmononucleotide
occurrences. 1mer+4shape indicates that DNA sequences were encoded as 1mer features aug-
mentedbyfourDNAshapefeatures(MGW,ProT,HelTandRoll).
22
Family
bHLH
bZIP
C2H2
CENPB
CP2
CUT
ETS
Forkhead
GATA
GCM
+RPHRí domain
HSF
IRF
MAD
MEIS
MYB
NFAT
NFI
Nuclear
receptor
PAX
POU
PROX
RRM
RUNX
SAND
TBX
TEA
Dataset
gcPBM
+7í6(/(; 6(/(;íVHT 0.0
0.5
1.0
0.0 0.5 1.0
1mer+4'1$VKDSH R
2
1mer+4Vhape
MD4
R
2
0.0
0.5
1.0
0.0 0.5 1.0
1mer+4'1$VKDSH R
2
1mer+4Vhape
XRC4
R
2
gcPBM +7í6(/(; 6(/(;íVHT PHU 4VhapH ;RC4
4Vh ap H MC4
4DNAV hapH 4VhapH MD4
PHU Vh ap H ; RC4
PHU V hapH MC4
PHU DNAVh apH PHU V hap H MD4
0
0.2
0.4
0.6
0.8
0
0.2
0.4
0.6
0
0.2
0.4
0.6
R
2
A B
Figure 2.4: (A) Direct comparison of R2 values between 1mer+4shape models. Points near the
diagonal suggest similar performance of compared models. (B) Summary of weighted-average
performance for different models. Shape features used here were MGW, ProT, HelT and Roll.
Error bars were calculated based on standard errors of the mean. Performances were divided into
threegroups based on differentexperimentalmethods (see AppendixFigure B.3 forgroups based
onTFfamilies). Identicalcolorwasusedformodelsusingthesamesourceofshapefeatures.
23
2.3.2 Model performance improves as the number of DNA shape features
increases
I explored whether the additional nine DNA shape features (Buckle, Opening, Rise, Shear, Shift,
Slide, Stagger, Stretch and Tilt) improved existing binding specificity models. These DNA struc-
tural features were previously defined [6, 119] but not available in HT predictions [68, 120].
After validating these features using Spearmans rank correlation coefficients (Appendix Table
C.1), I added each of the 13 features to models individually, using the sequence-only model as
a baseline. Figure 2.5A presents the performance differences (in∆R
2
) between the 1mer+1shape,
1mer+2shape and sequence-only 1mer models. By sorting the shape features based on their
average performance with four different data sources (MC, MD, XRC tetramer-based and MC
pentamer-based),Irevealedtheorderofimportanceforeachfeaturewhenaddedtoa1mermodel
(Figure 2.5A). Regardless of which feature I added, the performance of any 1mer+1shape model
was improved compared to the 1mer model. The next-most-informative feature would be the one
whose inclusion in the best 1mer+1shape model provided the greatest gain in performance over
the 1mer+1shape model. The resulting model after adding the second shape feature became the
1mer+2shapemodel(Figure2.5A).Althoughfeatureimportanceamongstructuralparametersvar-
ieddependingontheexperimentaldatasetthatwasused,theeffectofaddinganothershapefeature
intothesequence-onlymodelswasconsistentlybeneficial.
When I continued to add shape features to the best 1mer+2shape model, the average perfor-
mance increased further (shown versus number of shape features in the model in Figure 2.5B).
Here,IusedDNAshapeasourstandarddatasourcetochoosethenext-most-informativefeaturein
each round. Although model performances varied among different experimental datasets, adding
increasinglymoreshapefeaturestothe1mermodelledtoageneralupwardtrendinperformance.
TheMSEresults,representingmodelperformance(AppendixFigureB.5),supportedthisfinding.
OurresultsindicatethatmodelstrainedongcPBMdataoutperformedmodelstrainedonSELEX-
seq and HT-SELEX data (Figure 2.5B; Appendix Figures B.5 and B.6). This finding is not sur-
prising given the higher quality of gcPBM data due to inclusion of 15 bp flanking a binding site
24
5 and 3 of its core [87]. To demonstrate the importance of including information on longer bind-
ing sites or flanking regions, I tested our models on additional data, a widely used universal PBM
(uPBM) dataset, generated by the fifth dialogue for reverse engineering assessments and methods
(DREAM5) [116]. This dataset only contained information on 810 variable bp centered at the
coreofthebindingsite(seeAppendixSupplementaryMaterialsandMethodsA;AppendixFigure
B.6). Therefore, the data were of much lower quality than the SELEX-seq or HT-SELEX-derived
experimentaldata.
It was important to determine whether our inclusion of nine additional shape features in the
models directly influenced the ability to classify different TF families. I selected the top-affinity
bindingsitesinHT-SELEXdataamongmultipleTFfamiliesandencodedthemusingourconcate-
natedfeaturevector. Iappliedprincipalcomponentanalysis(PCA),ratherthananMLRmodel,on
these vectors to evaluate whether the additional shape features could help in separating different
TFfamilies. Theresultsrevealedthatintroducingnineadditionalshapefeatureswasbeneficialfor
classifyingTFfamilies(AppendixFigureB.7).
25
DNAshape MC4 MD4 XRC4
Buckle
Stretch
Tilt
ProT
Rise
Shift
Shear
HelT
Opening
Roll
Stagger
MGW
Slide
0%
5%
10%
0%
5%
10%
0%
5%
10%
0%
5%
10%
Relative R
2
increase
1mer+1shape
1mer+2shape
A B
gcPBM +7í6(/(; 6(/(;íVHT 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0.75
0.80
0.85
0.90
0.60
0.65
0.60
0.65
0.70
0.75
0.80
PHUNVKDSH R
2
'1$VKDSH MC4
MD4
;5& Figure 2.5: (A) Box and dot plots showing performance gain after adding one or two shape
features into the 1mer model. Red dots represent the performance of adding one shape fea-
ture (1mer+1shape model). Box plots illustrate the performance of adding two shape features
(1mer+2shapemodel), includingthefeatureindicatedonthex-axis. Blackdotsareoutliersofthe
box plots. Shape features were sorted based on their average performance of four data sources.
(B)Plotsgeneratedafterrepeatedlyaddingshapefeaturesintothepreviousmodel. Ineachround,
the most-informative feature was added based on performance of using the DNAshape query ta-
ble. Model performance was evaluated by using a weighted mean R2 among all datasets in each
experimental category. Error bars indicate maximum and minimum performance when N shape
features were added. DNAshape is pentamer-based; MD, MC and XRC data are tetramer-based
duetolimitationsinsequencecoveragefortheMDandXRCmethods.
26
2.3.3 Performance comparison of shape-augmented models versus k-mer-
basedmodels
Encoding a sequence of length N into k-mer features requires N 4k features. The number of
requiredfeatureswilldramaticallyincreaseaskincreases,especiallyifkis¿3. Icomparedperfor-
mances of our 1mer+13shape models with those of 2mer (dinucleotide) models. I also compared
performances of the 1mer+13shape models complemented by SDs for every shape feature with
performances of the 3mer (trinucleotide) models (Figure 2.6). Including SDs might be a way to
representDNAconformationalflexibility,whichseemstobeanimportantpropertyofDNAbind-
ingsites.
The number of features used in 1mer+13shape+ SD models (30 features per nucleotide posi-
tion) was still far below the number of features per nucleotide used in 3mer models (64 features
per nucleotide position) (Figure 2.6). Reduction of the computational cost compared to k-mer
sequence models is a major advantage of augmenting the 1mer model with shape features [69].
I analyzed whether, at a similar computational cost, the 1mer+Nshape models can outperform k-
mermodels. Ourstudyrevealedthatthe1mer+13shapemodel(17featurespernucleotideposition)
outperformed2merand3mermodelsforgcPBMdatasets(Figure2.6). Whenconsideringcompu-
tational cost (feature quantity), sequence+shape models performed consistently better (Appendix
FigureB.8). However,1mer+13shapemodelsdidnotoutperform3mermodelsfortheSELEX-seq
and HT-SELEX datasets, due to the lack of DNA shape information at the 5 and 3 terminals of
each sequence. The effect of adding information from a single 3mer at the end wasshownin [95]
to boost the performance of 1mer+shape models. Therefore, if I could find a feasible method to
predict shape features at the terminal ends of DNA sequences, the performance of 1mer+shape
modelswouldbehigher.
The performance also increased when I included SDs in our models, suggesting that SDs can
potentially be used to model DNA flexibility (Figure 2.6). However, I did not compute the SD
valuesfrom thetrajectoryofasimulation(see MethodsandMaterialssection). Therefore, further
improvements in deriving SDs or including DNA flexibility are needed. In general, our findings
27
indicatethatinformationcanbelearnedfromvaryingconformationalflexibilitiesofdifferentDNA
segments,andthatflexibilityisaveryimportantfeatureinproteinDNAbinding. ThePCAresults
furthersupportthisconclusion(AppendixFigureB.9).
28
Dataset
gcPBM
SELEX−seq
HT−SELEX
Seq Only Shape Only Seq+Shape
Performance Features
1mer
2mer
3mer
4shape
4shape+SD
13shape
13shape+SD
1mer+4shape
1mer+4shape+SD
1mer+13shape
1mer+13shape+SD
0.4
0.6
0.8
0
20
40
60
R
2
Number
Figure 2.6: Summary of weighted average performances (R
2
) between different models and the
featurenumbersusedinthemodel. SDwasthestandarddeviationofcorrespondingshapefeature
values. All shape features in this figure were generated from the MC query table. The four shape
features were MGW, ProT, HelT and Roll. The 13 shape features were MGW, six inter-bp and
sixintra-bpparameters. Addingmoreshapefeaturesincreasedthefeaturenumber(computational
cost). See Appendix Figure B.8 for a direct comparison between feature quantity and model per-
formance.
29
2.3.4 DNAshapeRbioconductorpackageforadditionalshapefeatures
TomaketheadditionalDNAshapeinformationbroadlyavailable,mycollaboratorandIextended
ourDNAshapeRpackage[120]builtinR/Bioconductor. Thesoftwarepackageisnownotonlyable
to predict the previously provided four shape features (MGW, Roll, ProT and HelT), but can also
generateDNAshapepredictionsfortheadditionalnineDNAshapefeaturesderivedfromanMC-
derived pentamer query table. To use the additional shape features, the user must input additional
parametersintotheRfunction. ThepackageanditsexpandedmanualareavailableatBioconductor
(http://www.bioconductor.org/packages/devel/bioc/html/DNAshapeR.html)andGitHub
(http://tsupeichiu.github.io/DNAshapeR/). Thesoftwarecanpredict,plot,andencodeval-
uesof13DNAshapefeaturesforanynumberofDNAsequencesorentiregenomes.
2.4 Discussion
In this study, my collaborators and I applied shape-augmented machine-learning models to pre-
dict TF binding specificities and compared the effects of DNA shape features from different data
sources (MC and MD simulations and XRC experimental data). I used three representative ex-
perimental datasets, acquired by gcPBM, SELEX-seq and HT-SELEX HT binding assays. The
datasets contained tens or hundreds of TFs and offered variable data qualities. All datasets had
comparablepredictionaccuracies,asindicatedbypreviousanalyses[69,81,95,118](4,13,19,45).
Our shape-augmented regression models outperformed models without shape information. Re-
gardless of the source, structural information improved the accuracy of predictions of TF binding
specificities. WhenIcombinedsequencewithfourshapefeatures,acquiredfromDNAshape(pen-
tamer),MC(tetramer),MD(tetramer)andXRC(tetramer),allprovidedsignificantimprovements
comparedto 1mer sequence-only models. Giventhe consistencyofour results, I conclude that, at
least in predicting TF binding specificities, adding DNA shape information from any of the MD,
MCorXRCdatasourceswillproduceimprovedquantitativemodels.
30
ItestedthepredictionperformancesofnineadditionalDNAinter-andintra-bpshapeparame-
ters. Theseshapefeatures,predictedbasedonMCsimulation,canbevalidatedonalargenumber
ofXRCstructures. AddingmoreDNAshapefeaturestothefeaturematrixproducedperformance
gains, although saturation was reached for a larger set of shape features. Performance gains were
also visible when more DNA shape features derived from MD simulations or XRC data were
added. Generally speaking, adding four shape features to the sequence-only models is the key
stepinimprovingmodelperformance. Ifdatasetshavealargenumberofsequences, addingmore
shape features to the model is always favorable; otherwise, one should opt to use as small of a
setofshapefeaturesaspossible. Ifcomputationalcostisaconsideration,sequence+shapemodels
are always preferable over k-mer models. Including additional DNA shape features might also
enhanceshape-augmentedthermodynamicmodelingapproaches[121]andmethodsforinvivoTF
bindingsiteprediction[90,94].
IalsoevaluatedthepredictionperformanceafteraddingSDvaluesforeachDNAshapefeature.
SDvalueswerecalculatedforeachpentamerandeachDNAshapecategorybasedonthemultiple
occurrencesofthatpentamerinourMC-deriveddataset. AlthoughourSDvalueswerenotderived
from a simulation trajectory for a single pentamer occurrence, our results showed that using SD
values, which can be considered as a surrogate of DNA flexibility, is a promising approach in
predictingTFDNAbinding. However,oneshouldexercisecautionwhenusingSDvaluesbecause
theyarecurrentlyimpossibletovalidatewithXRCdata. Furtherstudiesarerequiredtoidentifythe
roleofconformationalflexibilityinTFbindingspecificity. Aftersubmissionofthismanuscript,an
article was published that reported the derivation of additional features describing DNA structure
andflexibilityfromMDsimulationsusingadifferentprotocol[122]. Likelyduetothatprotocol’s
limitations(seeAppendixSupplementaryMethodsandMaterialsA),modelsusingthesealternate
MD-derived features [122] were outperformed by the models based on any of the three datasets
usedinthisstudy(MC,MDorXRCdata)(AppendixFiguresB.10andB.11).
A limitation of our expanded repertoire of DNA shape features is that I did not include phos-
phodiester backbone features, which are an important subset of DNA shape features. Backbone
31
features are difficult to validate through experimental structures because variations are not well
capturedatlimitedXRCresolutionorNMRaccuracy. Furthervalidationisrequiredtounlockthe
useofbackbonefeatures. AnotherlimitationisthatIcalculatedeachDNAshapefeaturebasedon
ensemble averages, which may not capture the actual distribution. For certain shape features (e.g.
backbone dihedrals and HelT), the distribution might be bimodal [123] and computing the mean
mightnotbethebestwaytoencodeDNAshapeinformation.
2.5 Conclusion
Our work demonstrated that DNA shape features are important in TF binding, regardless of the
source used to acquire the structural information (MC, MD or XRC data). To the best of our
knowledge, our study is the first to derive DNA structural features on a HT basis from multiple
different data sources and to test these features in statistical machine-learning applications. Even
when derived from different data sources, structural information consistently improved predic-
tion of TF binding specificity. Furthermore, I provided evidence that sequence+shape models,
especially models using the expanded repertoire of 13 DNA shape features, offer advantages over
k-mer models in terms of performance, computational cost and interpretability. For the benefit of
the community, the expanded repertoire of 13 DNA shape features has been made available for
useinourRpackage,DNAshapeR,athttp://bioconductor.org/packages/release/bioc/
html/DNAshapeR.html.
32
Chapter3
ExpandingtherepertoireofDNAshapefeaturestospatial
geometricdeeplearning
3.1 Chapterbackground
Transcription factors (TFs) regulate cellular activities through its binding to certain DNA se-
quences, usually be 6 to 12 base-pairs (bp). Many TFs read a specific DNA sequence through
base contacts, or base readout while some other TFs prefer certain DNA shapes, or shape read-
out [8]. Most TFs use both recognition pattern to read DNA information. Invented in the earlier
days, models as simple as motif or position-weighted matrices (PWM) were introduced to help
determine if a given DNA sequence is able to bind or not. Although these models are easier for
human to understand, which makes them still viable and popular, they do not work as precise as
newer machine learning models [35, 69, 70] which are able to quantitatively predict DNA bind-
ing specificities, primarily unlocked by high throughput deep sequencing technology. These high
throughput data can be generated by different experimental techniques. Systematic evolution of
ligands by exponential enrichment (SELEX) based methods usually provide the most coverage of
sequences when measuring binding affinity against certain TFs [63], in vitro. As for in vivo envi-
ronment,duetoobstaclessuchaschromatinaccessibilityandinternal3Dgenomestructureinthe
nucleus, binding can be very inconsistent on the same motif along the whole genome. ChIP-seq
33
[124]isthewidelyusedmethodtomeasuresuchbindingevents. Inthispaper,wefocusoninvitro
datasets,wherebindingissequencedependent.
Traditional machine learning methods combined with DNA shape features or k-mer based
DNA annotation have been proved to be useful in TF-DNA binding study [35],, as well as con-
volutional neural network (CNN) models [70], where CNN layers are treated like motif scanners
so it learns from data without alignment, and usually beat the performance of traditional models
inaglimpse. Deeplearning modelsuseDNAsequencesasinputwere alsousedtolearndifferent
objectives[74,76,77,125–130]. MostdeeplearningmethodsusedinTF-DNAbindingstudyrely
on one-hot encoding which encodes DNA sequences based on their bases by 0 or 1, while some
models incorporate additional information such as cell line information [131]. However, none
of those models take the molecules as input. Instead, they take the human-simplified sequence
format, which is built from 4 characters ACGT. Although the human-simplified sequence should
contain all the information to represent structures, structures can still be dynamic and deformed.
TFsmayprefercertainconformationsofthesameDNAsequence. DNAconformationcanbedra-
matically changed after one TF bound to DNA, which may cause the second TF binding or have
an effect in distance, resulting in allostery [81, 132–135]. No models that take DNA structure as
inputcurrently. Incomparison,althoughRNAsandproteinsarealsoabletoberepresentedasnu-
cleotidesoraminoacidssequences,manyworkshavebeendonetodirectlyincorporatestructures
into models [136–138], as well as smaller molecules [139]. The reason causing this difference is
probably the rigidity of DNA structures, which makes predicting DNA structures from sequence
much easier, so the mapping from sequence to structures has less drama. However, DNA can
still form A-DNA, B-DNA and Z-DNA form, be chemically modified, contain mismatches, and
haveuncommonbase-pairs[140–144]. Therefore,ageneraldeeplearningframeworkthatreceives
DNAstructuresasinputisneeded.
Thereweretwomajorapproachestorepresent3Dmolecularstructuresindeeplearningmodel.
The most intuitive approach is to represent structures as voxelized matrices, where the relative
geometrical relations among atoms retain despite any rotation and dislocation. Multiple research
34
hasproventhatitcanbeefficienttodosoonproteins[137,145,146](9-11)withatranslationally
invariant, 3D representation. However, this approach is computationally heavy if one wants to
generatethevoxelizedmatricesasaccurateaspossible.
Anotherapproachistouseagraphneuralnetwork(GNN),as3Dobjectcanberepresentedas
a collection of vortexes and edges. For small molecules, the vortexes are usually the atoms and
theedgesaretheconnectionsbetweenatoms. Physicalandchemicalfeaturesarethenassignedto
the vortexes and edges correspondingly. This representation usually omits the Euclidean distance
or coordinates therefore is naturally translationally invariant. For any given one structure, one
representation is generated and should include enough information for any later applications. By
using GNN, problems such as protein-protein interface prediction [147–149], antibody-antigen
binding interface prediction [150] and protein-ligand prediction [151] have been tackled in an
accuratefashion. Forlargermolecules,acompressionbasedonpriorknowledgeisusuallyusedto
collapseneighingfunctionalgroup,forexample,justusingvortexforoneaminoacidbutincluding
manycomputedlocalchemicalfeaturesandrelativespatialfeaturestodistinguishoneaminoacid
fromanother[147].
AlthoughweareabletodirectlyportanygeometricalproteinmodelsintoDNAssincetheyare
both molecules and their chemical composition is similar, it is still very important to consider the
differencebetweenproteinsandDNAstodiscoveranypotentialimprovements. Onekeydifference
fromproteinsisthatregularunmodifiedDNAsintheirB-DNAformhaveveryconservestructures
and components that only include 4 bases, sugar and phosphate backbones, where unmodified
regularproteinscontaindifferent20aminoacidsandcanvastlyvaryontheir3Dstructuresdueto
protein folding. In addition, for regular unmodified DNAs, the number of atoms appeared in each
normally paired base is exactly the same. This is a significant difference that can greatly reduce
thecomplexityofdesignofadeeplearningarchitectureforDNAs,whichleadstotheformationof
ourproposedmodelthatsuccessfullycombinessequentialsequenceinformationand3Dstructural
informationintoonesinglepiece. Asanexample,weusedTF-DNAbindingaffinityscoresasour
response variable to show the superiority of this model, despite this model can be easily ported to
35
replaceanymodelsthatwerepreviouslyusingone-hotencodingofDNAsequences. Inaddition,a
graph network model using only 3D coordinates and atom information for RNA structure scoring
wasrecentlypublishedandachievedstrikingperformance[138].
Inadditiontoexhibittheperformanceofthemodelcomparedtosequencedeeplearningmodel,
here,wealsopresentspecialscenariosthatasequencedeeplearningmodelisnotcapableof,orat
leastnotnaturalto,asshowcasesofthesuperiorityofourmodel. Onespecialscenarioistoprocess
only backbone structure, where performance of such input may indicate the overall importance
of backbone contributed in TF-DNA readout. Another special scenario is to process deformed
DNAs, or DNAs with the same sequence but in dynamics, where the shapes of DNA play the
major role. Research has been published specializing deformed DNA caused by protein binding
[133], nucleosomal DNA and TF binding [152, 153], DNA bendability measurement [154]. As
experimental designs and technology advance, newer data will become available that can already
use this future-proof model. In this article, we simulated special scenarios to prove the capability
of the spatial geometry model. The results provide insights into understanding TF-DNA binding
mechanisms.
3.2 MaterialsandMethods
3.2.1 TF-DNAbindingdata
SELEX-seqTF-DNArelativebindingaffinitydata,MAX,P53andMEF2Bisacquiredfrommulti-
pleliteratures[64,155]andtheyaregeneratedbySELEX-seqpackage[64]. 215TF-DNArelative
bindingaffinitydatafrom23TFfamiliesisacquiredfrom[95],wherethedatawentthroughqual-
itycontrol,filteringandaligningwiththecore-motifbasedonliterature[156].
3.2.2 High-throughputDNAshapeprediction
For any given DNA sequence, a full set of 12 DNA shapes including shift, slide, rise, tilt, roll,
helix twist, shear, stretch, stagger, buckle, propeller twist and opening [6], can be predicted by an
36
modifiedversionofDNAshapemethod[35]. The12DNAshapesdescribethespatialrelationship
between base pairs and within base pairs. For two sequential base pair and 3 axes (x, y and z),
each axis has a dislocation value and a rotation value, which counts to 6 shape features. The
same applies to two bases within a base pair, which counts to 6 shapes features. Together, there
are 12 shape features to be predicted. In this approach, a slightly modified query table is used,
which includes 5-mer with N at the terminus such as NNACG or ACGNN. The modified table
was directly computed from the original query table with simple average for those 5-mer with N
at the terminus. The purpose of this modification is to add the possibility to predict the flanking
2 bps in both ends so we can provide DNA shapes to the full length of sequences for rebuilding
purpose. Dependingonthequerytable,themethodcanpredictDNAshapefeaturesfromdifferent
sources such as Monte Carlo (MC) simulation and X-ray crystallography (XRC). This method is
alsoextremelyfastandcanprocesstensofthousandsofDNAsequencesinashortperiodoftime.
3.2.3 B-DNA3Dstructureprediction
AnystandardB-DNAstructurescanberebuiltbyX3DNA[104]subprogramcalledrebuild,which
takesinoneDNAsequencealongwithawholesetof12DNAshapeparametersasinputandPDB
fileasoutput. Theprogramusesthe12DNAshapeparameterstoplacebasepairsalongz-axison
a 3D space, and then link them together by adding connections. The generated PDB file has the
coordinates for all heavy atoms including C, N, O and P and their connection. According to PDB
file format, each atom has its own atom name, residue number and chain number to distinguish
itself from other atoms. These unique identifiers were used to index and sort the atom in later
data preprocessing. The connection information is also stored in the PDB files as atom linkage
information based on residue number. In order to enable the ability to perform graph network
computation, in later data preprocessing step, this information is also extracted. For a TF-DNA
binding affinity data that contains N entries of DNA sequences and their corresponding relative
bindingaffinity,NPDBfileswillbegeneratedandtemporarilystoredforlaterpreprocessing.
37
Figure 3.1: Workflow of preprocessing sequences and model training procedures. (a) Diagram
of SELEX-seq datasets. Each sequence in the dataset pairs with a relative binding affinity (noted
as Aff.). (b) Prepocess of DNA sequences in the datasets. 12 DNA shape features are predicted
for each DNA sequence. DNA structures are rebuilt with sequence and shape information, and
structures are represented as 3D coordinates of atoms and atoms adjacency (connection) matrix.
(c) Workflow of the whole process. Input data goes through prepocess, train/test split and then
feed into the training process. Trained model is then used to predict relative binding affinity for
testdata.
3.2.4 Datapreprocessing
For each TF-DNA binding data that has N entries of DNA and their corresponding N relative
binding affinity, N PDB files will be generated from the previous step. The data preprocessing is
to extract the needed information and merge those into a single H5 file to expedite the training
and test process. The needed information here in this work is the atom coordinates, which is
orderedbasedonitsresiduenumberandatomname,andthelinkageinformationforatoms. Fora
k-mersequence,41k atomswillbepresentedinthePDBfiles. Notethatitisnotalwaysnecessary
to extract all atoms from the PDB files. Only extracting coordinates of base pair atoms is also
possible. Extracted atom coordinates will be sorted based on predefined index (See Appendix
Table E.1). Therefore, if we extract n atoms from one PDB file, where n=41k if all heavy atoms
areextracted,thispreprocessingstepwillgenerateacoordinatesmatrixofdimension(n,3)andan
38
atom connection matrix of dimension (n,n) represented as a sparse matrix to save storage space.
Optionally, chemical information of atoms will also be extracted as an one-hot encoded matrix.
Since there are only 4 kinds of heavy atoms in DNA (C, N, O, P), this optional output will be
a matrix with dimension (n,4). The data is then stored in a H5 file with N entries. Each entry
in the H5 file will be indexed by its sequence, and contains atoms coordinates, optional atoms
id (chemical information), atoms connection matrix and the response variable (relative binding
affinity).
3.2.5 Modelinputanddataaugmentation
Unlike one-hot encoded sequence, where each sequence gets a one-to-one encode, coordinates-
based3Dgeometricalmodelsarenaturallyaugmented,whereeachstructurecanbetranslatedand
rotated randomly, while the model should always be fitted to predict them correctly. During the
trainingandevaluatingprocess, inputsare augmentedinreal-timeinthedeeplearninggraph. For
each input batch of size b, b translation matrices and rotation matrices are randomly generated
to augment the batch. See Appendix Figure D.2 for reference. Through this method, the model
will learn the true knowledge from the relative 3D coordinates because for each input, the model
willfacedifferentrepresentationsofthesameinputmultipletimeswhiletherelationshipsbetween
atoms remain. To expedite the speed and to reduce the needed disk space, this data augmenting
process was directly coded into the deep learning graph before the input. There is a parameter to
adjust how strong one would want to augment the input. In evaluating mode, the model will also
performthisdataaugmentingprocess. Consideringabatchsizeb,whichishowmanyentrieseach
iterationthemodelprocesses,intrainingmode,themodelwillinrealtimerandomlydislocateand
rotatebentriestothenetwork. Inevaluatingmode,toincreasetheaccuracyofprediction,acertain
numberofdislocationandrotationisrequired. Therefore,ifthenumberofdislocationandrotation
is n
r
, and the batch size is b, the model will be fed with b/n
r
entries of data so that the model can
generate the same batch size in evaluating mode, and the final loss and prediction is merged for
eachentryontheaveragebasebyn
r
.
39
3.2.6 Trainingandtestsplit
All the dataset was separated into training and testing set. MAX, MEF2B and P53 SELEX-seq
datasets were separated into training and testing by 90/10. 215 HT-SELEX TF-DNA binding
datasetswereseparatedintotrainingandtestingby80/20.
3.2.7 Spatialgeometrymodel
To be as simple as possible but still completely represent a 3D DNA structure, we used a regular
3D Cartesian coordinates system. For a DNA structure with n heavy atoms, the i-th atom on the
DNA structure is represented as a vector v
i
=(x
i
,y
i
,z
i
), where atoms are ordered based on their
chemicalgroupandpositionsinthelinearDNAstructure. Inaddition,hydrogensareomittedinthis
representation. In natural unmethylated DNAs, each base pair, whether its A-T or C-G, contains
l=41heavyatoms. Therefore,startingfromthefirstatomv
1
tothe41statomv
41
willrepresentthe
first base pair [Figure 3.2(a)]]. Inside the 41 atoms, their order is shown in Supplementary Table
S1. The representation wont store any chemical features such as the atom name. Due to a very
rigorous structure a DNA has, this information is redundant and may lead the model to learn only
sequence features. One can simply draw those atoms on a 3D space based on this representation
[Figure 3.2(b)]. Following that, an atom adjacency matrixA∈R
N×N
is added. In this model, we
only consider if two atoms are connected or not, therefore, A
ij
=0,1. To improve robustness,A
is normalized as
˜
A=D
−1
·(A+I), whereD=diag(d) for d
i
the degree of atom/node i. See
Figure3.2(d)forabriefoverviewofthemodelarchitechture.
3.2.8 Graphconvolution
To compute the output of graph convolution layer [Figure 3.2(c)], values from the neighboring
atoms are collected, weighted, summed and activated (ReLU) through the multiplication of adja-
cency matrixA. Therefore, after each layer of graph convolution, information will be gathered
40
Figure 3.2: Diagrams of the spatial geometry model. (a) Count of atoms per base pair. (b) DNA
represented as 3D coordinates and linkages. (c) Two major convoluational layers in the model,
graphconvolutionandbasepairconvolution. (d)Thenetworkgraphofthespatialgeometrymodel.
and computed into the centering atoms from the neighboring atoms. After n layers of graph con-
volution, one atom will contain features calculated from neighboring n-level atoms. This process
enablesustoperformconvolutionalongchemicalbonds,whichisthenaturalconnectionbetween
atoms in a molecule. Considering a layer input to beX and adjacency matrixA, layer weight as
W andlayerbiasasB,outputofthisgraphconvolutionallayeris
˜
X =X·A·W +B.
3.2.9 Basepairconvolution/embedding
The base pair convolution [Figure 3.2(c)], or base pair embedding layer, is a convolution layer
withl stridesandkernelsizeofl,wherel=41fortheregularfullmodel. Dependingontheatoms
per base pair if the model is not complete, l will be changed accordingly to compensate. This
layercollapsesallatomsinabasepairintoonesinglenodewhichensembletheinputofaone-hot
encoding DNA sequence model. Although the feature number of a one-hot encoding model is 4
if the sequence only contains ACGT, the output of this layer depends on the adjustable choice of
filtersize. Inourexample,64isthechosenfiltersize. Thislayercombinedwiththepreviousgraph
41
convolutionallayerstransformstheinput3DcoordinatesofDNAintoanembedfeaturematrixof
thestructure.
3.2.10 Residualneuralnetwork(ResNet)
Aresidualnetworkarchitecture[37]isusedafterbasepairconvolution/poolinglayer. Ourmodel-
specific ResNet is built upon residual blocks. One residual block contains 3 convolutional layers,
2batchnormalizationlayersand2activationfunctions. TheResNetusedinthismodelcontains4
residual blocks (ResBlock) with 3 max pooling layers in between. (See Appendix Figure D.1 for
details)Duetothedeepnatureofourapproach,usingResNetacceleratethespeedoftrainingand
reducingtheeffectofvanishinggradient.
3.2.11 Learningschedule,hyper-parameter,andtransferlearning
All models are trained by Adamax, a variation of Adam optimizer [157] and 0.01 learning rate.
There is no need to search for hyper-parameter for this model, although one can set dropout rate
and weight decay. In our case, dropout rate is 0 and weight decay is 0. One key challenge of this
model is to learn the structure subject to translation and rotation. The learnt model should have
the ability to process structures no matter where the camera is put. Therefore, transfer learning is
possible. For each k-mer data, a standard model trained is trained. Any new models with k-mer
inputwillinitiateitsparametersbyre-usetheparametersinthestandardmodel.
3.2.12 Sequencebaselinemodel
A sequence model is introduced to compare with the spatial geometry model as a baseline. The
sequencemodelfeaturesaninputofone-hotsequenceencodingandthesameoutputasthespatial
geometrymodel. Tohavethebestbaselineperformanceaspossible,thearchitectureofthismodel
involves multiple 1D convolutional layers and multiple dense layers. The number of parameters
in this model is even larger than the spatial geometry model. The model was trained on the same
42
data and by Adam or SGD optimizers depending on hyper-parameters. See Appendix Figure D.3
fordetail.
3.2.13 Partialmodelwithonlycertainchemicalgroups
Asthemodelisdefinedtotakeinputsofgeometricalcoordinatesinacertainorder,themodelcan
beeasilymodifiedtoacceptadifferentsizeofinput,suchascertainchemicalcomponents. Inthis
article, we also present a modified version of the model that utilize only base pairs or backbone
atomsasinput. Theonlymodificationisthebasepairconvolutiontobechangedtoaccommodate
different quantities of atoms per base pair. Instead of using strides and kernel size of 41, the
stridesandkernelsizeofbasepairconvolutionallayeraregoingtobechangedtofittheinput. For
example, there are 19 atoms per base pair if we only use base pair atoms and 22 atoms per base
pairifusingbackbone(sugar+phosphate)atoms,whileeveryotheraspectremainsthesame. See
Figure3.3forreference.
3.2.14 ArtificiallybentDNA
Artificially bent DNA was generated by performing a post-hoc editing of 3 DNA shape features
(roll, slide and helical twist) on the high-throughput predicted DNA shape profile, which is used
to rebuild the 3D structure of the DNA sequence. According Roll-and-Slide model [158], by
periodically adding values from -10 degrees to 10 degrees to roll, -1 or 2.5 to slide and -5 or 5
degrees to helical twist on every 5 bp, the DNA will bend similarly as a nucleosome-bound DNA
will. In addition, the artificially bent DNA can be pre-processed to contain only backbone atoms.
SeeFigureD.7forreferenece.
43
Figure3.3: Diagramofthepartialspatialgeometrymodel. AtomsfromDNAsugarandphosphate
backbonesaremarkedgreen. AtomsfromDNAbasepairsaremarkedyellow. Thepartialinputis
thensentintothesamearchitechturefortrainingandtesting.
3.3 Results
3.3.1 Performancebeatsthesequencebaselinemodel
The very first gold standard when comparing different models is of course the performance of
themagainstthesamedatasets. Here,weusedrelativeTF-DNAbindingaffinitydatasetsofMAX,
MEF2B, and P53, which is generated from SELEX-seq. These datasets are not aligned based on
their motifs to preserve the maximal information from the experiments. In addition, we included
published aligned HT-SELEX datasets including 215 TF-DNA binding data, with lower quantity
and quality. These HT-SELEX datasets are aligned based on their known motifs from JASPAR
database[67].
44
The standard sequence deep learning model easily suffers from performance loss due to im-
properselectionofhyper-parameters,underfittingoroverfitting,especiallyforthosedatasetswith
lower quality data (See Appendix Figure D.6 for reference) and needs thorough hyper-parameter
search to complete. This hyper-parameter search includes learning rate, optimizer, dropout rate,
weight decay, layer width (filter size), layer depth, training steps, batch size and more. Typically,
for each dataset and each search, one should randomly select a set of parameters and train the
modelonasubsetoftrainingdataandtestthemonanothersubsetoftrainingdata(validationdata)
tofindthebestcombinationtobeusedagainstthewholetrainingdata.
Here,wepresentourspatialgeometrymodelwithoutthenecessitytoperformhyper-parameter
searchbecauseofthenaturaldataaugmentationandthecapabilityoftransferlearning. Themodel
is able to converge in a short training cycles [Appendix Figure D.4 and D.5] and shows a better
performancethanthesequencebaselinemodelonbothunalignedSELEX-seqdatasetsandaligned
HT-SELEXdataset[Table3.1andFigure3.4].
Table3.1: R
2
PerformancetableforSELEX-seqdatasets.
1
TF BaselinesequenceCNNmodel spatialgeometrymodel
MAX 0.988 0.982
MEF2B 0.957 0.980
P53 0.848 0.972
While using a Cartesian coordinates system means the system is translation and rotation de-
pendent, it is not necessarily meant to be a disadvantage in our application. Because the input is
indexed, not similar to a random sorted point clouds, relationship between atoms can be well pre-
servedduringcomputation. However,randomtranslatingandrotatingtheinputbatchisnecessarily
needed to drive the model learning towards the real information about relative atom positioning.
Thisprocessisanaturaldataaugmentationandthemodelshouldbeimmunetoanytranslationand
rotation of the input. By introducing random translation and rotation into the model, the training
timetoconvergenceislongerbuttheperformanceoftestdataincreases.
1
Modelsaretrainedwiththesamesettings.
45
Figure 3.4: R
2
Performance table comparing our spatial geometry model to the baseline sequence
modelforHT-selexdatasets. Thebaselinesequencemodelwastrainedfor1000epochsusingSGD
optimizerwithlearningrateof0.001,anddropoutratewassetto0.25withnoweightdecay.
3.3.2 PerformanceremainsthesameforXRCderivedDNArebuiltstructures
The DNA rebuilding process involves with the prediction of DNA shape values for any given
sequences, wherevaluescanbeextractedfromeitherMC,MDorXRC[35]. Differentsourcesof
DNA shape may lead to different result. It is necessary to illustrate the performance of our model
by using different kind of DNA structures. Therefore, we demonstrate that performance remains
veryidenticalforXRC-derivedshapecomparedtoMC-derivedshape[Figure3.5].
3.3.3 Backbone-only models reveal TF-family-specific shape and backbone
preference
Unlike any sequence models that treat one single base pair as a one-hot encoded character, our
spatial geometry model can take in any number of atoms for each base pair. Considering this ad-
vantage, one can strip down the DNA molecules to the basic chemical groups, such as backbones
46
Figure 3.5: Geometry model can MC or XRC derived features (a) Diagram of using two differ-
ent query tables for rebuidling structures. (b) Performance comparison between MC-derived and
XRC-derived DNA structures for HT-SELEX dataset. (c) Performance comparison between MC-
derivedandXRC-derivedDNAstructuresfor3SELEX-seqdatasets.
orbasepairs,tostudytheirowncontributiontowardsTF-DNAbindingspecificity. Here,wereport
thatourmodelcanprocessbackboneasinput,asalongbasepair. Inourscenario,DNAstructures
are synthetically rebuilt based on DNA sequences and DNA shapes by firstly lying the base pairs
downin3DspacebasedonstandardB-DNAlinkage,nearlyallofinformationwillbepassedtothe
positionofbasepairs. WefirstlytestediftheaboveassumptionistruefortheSELEX-seqdatasets,
and it is not surprised that the base-pair-only model matches the performance of full model [Ap-
pendix Table E.2]. However, results become more interesting for the backbone-only model. The
position of the backbone is decided by the placement of DNA base pairs, which in some con-
nections will contain information about the exact DNA sequences inside of the backbone. We
assumed that if a TF favors shape and backbone recognition, backbone-only models are going to
beaseffective,andifaTFfavorssequencereadoutorusebackboneonlytostabilizebindingwith
no specificity required, backbone-only models are going to be not as effective but will retain a
certain level of performance since they are still generated by DNA sequences and DNA shapes.
47
If the above assumption is valid, backbone binding preference of TFs can be appraised from the
performance of backbone-only model among and within TF families. Different TF families have
different ways to recognize DNAs, depending on the DNA binding domain preferences. There-
fore, a family-specific shape and backbone preference can be analyzed [Figure 3.6]. This result
showsthatspecificitiescanbelearntonbackbone-onlywithonlyshapeinformation,onseveralTF
families. On those TF families prefer direct sequence readout, the model learns less information.
C2H2 zinc fingers are a TF family that tends to directly read sequence, with minimal backbone
contacts. ETS(E26transformation-specific)familytendstoreadDNAsequencedirectlywithlim-
ited backbone contact in addition. POU family uses helix-turn-helix binding structural motif to
bind DNA sequence, which is known for its direct sequence readout [159]. TBX (T-Box) domain
usually reads into major and minor grooves together with very few specific contacts to the base
pairs or sugars, while most of contacts happen on backbone phosphate groups [160, 161]. Basic
Helix-turn-helix(bHLH) binding domains rely on direct contacts to sequences and multiple DNA
backbone contacts to stabilize the binding, while the key determinants of binding specificity may
berelatedtobackbonephosphatebinding[162]. ThesedifferencesbetweenTFfamiliesalignwith
our findings [Figure 3.6]. Subclasses within TF families are also necessary to investigate. One
example is ETS family. Subclass I in ETS family [163] tends to use additional backbone contacts
than other subclasses. Datasets from subclass I in ETS TF families recapture higher amount of
information compared to other subclasses by using the backbone-only model [Figure 3.7]. This
backbone-onlymodelisprovidinginsightsintothepreferenceofTF-DNArecognition.
3.3.4 SpatialgeometrymodelcanprocessdeformedDNAandfindspioneer
factors
While tremendously conserved in 3D structure compared to proteins, DNA can be still flexible
on a global scale, especially in vivo, where molecules that can alter DNA structures exist. For
example,DNAsareusuallywrappedroundhistoneswhennotactive. Pioneerfactorsthatcanbind
nucleosomalDNAcanstillbindDNAs,althoughthoseTFsthattendtobindfreeDNAsmayshow
48
Figure 3.6: ∆MSE value between the full model and the backbone-only model for HT-SELEX
datasets. Themodenegativevaluesmeanalowerperformanceofbackbone-onlymodelcompared
tothefullmodel.
a different binding affinity. Considering an experiment that can high-throughput measure TF and
bent DNA binding affinity as while as TF and straight DNA but with the sequence DNA binding
affinity,sequencemodelsmayinstantlylosetheabilitytoshine,whilethespatialgeometrymodel
is able to distinguish the normal DNAs and the bent DNAs. Here, we simulate two scenarios that
the input DNA structures are bent and test if our model can still process it. The first one is to
pair the bent DNA with higher relative binding affinity compared the normal straight DNA with
the original values. If the model can distinguish bent DNA from normal straight DNA, the model
shouldperformwellinsuchascenario. Thesimulatedresultshowsthatthemodelcandistinguish
bent DNAs from straight DNAs and prefict their different binding affinity with high confidense
[AppendixFigureD.8].
The second scenario is to apply the backbone-only model on bent DNAs. Our assumption
here is that, if a TF prefers a bent DNA, the performance of this backbone-only model on bent
DNAwilloutperformthebackbone-onlymodelonstraightDNA.Inanotherword,thebentDNAs
49
Figure3.7: R
2
betweenthefullmodelandthebackbone-onlymodelforETSfamilyfromtheHT-
SELEXdataset. TFsfromSubclassIrecapturehigheramountofinformationfrombackbone-only
comparedtoothersubclasses.
contain more information needed for converging a better model. We chose only to use aligned
HT-SELEX data here because the bending simulation is to bend the DNA from the middle. The
bendingproceduresimulatesbendingasifthemiddleoftheDNAisthedyadofanucleosome. We
applied a Roll-and-Slide model that will set HelT features be a fixed value of 36 degrees at first
and alter Slide and HelT shape features every 5bp. Roll shape features will be altered as well in
a 5bp manner periodically. TFs that do not have pioneering function may still bind nucleosomal
DNA performing different functions [153]. We tested on all 215 TFs in the aligned HT-SELEX
dataset. Since the simulated bending process arbitrarily set HelT features to a fixed value, from
machine learning stand of points, the bent DNA should have less information. If the model is
50
learning sequence information from the structure, all bent datasets should perform worse than the
straightcounterpart. Strikingly, thereare manyTFdatasetswithhigherperformanceonthegroup
ofbentbackbone. Notethatifthebackbone-onlymodelwithstraightDNAsisalreadyonparwith
the full model, even the TF prefers bent DNAs, the performance is not going to improve dramati-
cally. Topreventartifacts,herewesetathresholdofMSEimprovementof0.0001,comparingthe
bentmodeltothestraightmodel. SomeTFsperformexceptionallybetter,suchasCEBP-B,which
is one of the centromere components, binding CENP-B box [164] . This protein binding induces
kinks in the DNA which is at nearly identical position as our bending simulation. ZBTB2B from
C2H2isahomologinhumantoGAGAfactorinDrosophila,whichisapioneerfactor[165]. SP3
from C2H2 family performs significantly better in bent version of DNA backbone. SP3 can inter-
actwithandrecruitmanyproteinstoremodelchromatinandregulategeneexpression[166]. ESR1
fromnuclearreceptorfamilycooperateswithFoxA1,awell-studiedpioneerfactor[25]thatbinds
toDNAhypomethylatedregionsofchromatinseeninbreastcancer,evenwithoutligandactivating
[167, 168]. TFE3 from homeodomain induces hypersensitivity in chromatin and mediates local
nucleosomepositioning[169],whichcanbeconsideredaspioneeringfunction. ESX1fromhome-
odomain family is a chromatin bound factor that was found highly active in mature sperm [170].
ETSfamilyhasbeenshowninvolvingchromatin-openingincancers[171]. Amongourhighlighted
TFs,SPDEF,ETV1,FLI1andETV4arefromETSfamily. ETV1isanoncogenereportedserving
asamasterregulatoringastrointestinalstromaltumour,melanomaandprostatecancer[172–174].
EWS-FLI1 fusion has been shown to induce chromatin remodeling in ewing sarcoma [175]. In
ETV4knock-outexperiment,lossofETV4affectschromatinaccessibilityslightly[176],suggest-
ing its potential but limited pioneering function in endometrial cancer cells. POU4F3 from POU
family also shows pioneer activity that drives mechanoreceptor differentiation [177]. ATF4 from
bZIP family may function as a pioneer factor to recruit histone acetyltransferase activity to a sub-
set of amino acid response target genes [178]. TBX2 from TBX family was reported to have a
cross-gyrebindingmode[153,179],formingdimer.
51
Table3.2: TFspreferringthebentbackbonemodel,groupedbyTFfamilies.
TFfamily TFpreferingbentDNAbackbonemodel
ETS SPDEF,ETV1,FLI1,ETV4
CENPB CENPB
C2H2 ZIC4,SP3,ZBTB7B,HINFP1
homeodomain Rhox11,ESX1,Gbx2
TBX TBX2
nuclearreceptor ESR1
bHLH TFE3
bZIP ATF4
POU POU4F3
In general, if a TF performs better in bent DNA backbone model, this TF may be able to bind
nucleosomalDNA,indicatingpotentialpioneeringfunction.
3.4 Discussion
In this work, we built a graph-based spatial geometry model to process heavy atoms coordinates
on regular B-DNA molecules as input. We collected data from high-throughput TF-DNA binding
studiessuchasSELEX-seqandHT-SELEX,asexamplestotestourmodel. SincerealworldDNA
structures are not always available from X-ray crystallography, in our test, we high-throughput
rebuilt DNA structures based on their sequences and the DNA shape predicted also based on se-
quences. GivenDNAstructuresaresequence-dependent,inourcase,itisdirectlyrelatedtoDNA
sequences. Therefore,thereisnoinformationgainfromDNAsequencestotherebuiltDNAstruc-
tures despite adding prior knowledge about DNA base pairs and backbone positioning. However,
themodelisstillabletobeattheperformanceofstandardsequencedeeplearningmodel,especially
for those datasets with lower quality that leads to lower generalization. In our opinion, sequence
deeplearningmodel,ifgoingthroughthoroughnetworkdesignsandhyper-parametersearch,may
beabletoachievethesameorbetterperformance. Inthatcase,ourmodelisstillgoingtobefaster
in training due to transfer learning capability with no hyper-parameter search needed. Our DNA
rebuildingpipelineusedinthispaperisverystraight-forward. ConsideringabetterDNAstructure
52
rebuilding method that relies on database analysis, like AlphaFold2[48] with nearly all PDB pro-
teinstructuresanalyzed,theperformanceofthismodelmayimproveevenmore. Theintermediate
basepairconvolutionallayerofthemodelgeneratesanembeddingofDNAstructuralfeatures,and
potentially have additional applications. One key advantage of this framework is that it directly
processes coordinates of heavy atoms on DNA. One can define how many atoms a base pair of
DNAshouldbe processed, suchas asubsetof atoms, forexampleonlybackbone atoms, toinves-
tigate the importance of certain chemical groups. The backbone-only input has no information of
thebasepairinthedoublehelix. TheonlyknowledgeitmaydirectlyhaveistheDNAshapes. Our
result shows the binding preference of backbones is on par with published literatures and can be
usedtoexplainbindingfavorabilityofDNAshapesandbackboneatoms.
Anotherkeyadvantageofthismodelis thatithasthecapabilitytodistinguishdifferentstages
in structures of the same DNA sequence, or deformed DNAs. Given the capability, the model is
able to learn through dynamics of DNA molecules or in our simulation, learn bending preference
ofthesameDNAsequence. OursimulationmadebentDNAshavehigherrelativebindingaffinity
compared to straight DNAs. Our model successfully learnt the property of different structure
conformations of the same sequences and can predict the binding affinity with high confidence.
Thisissomethingasequencemodelhastroubletodealwith,becausethebestthesequencemodel
can do is to extract additional features such as DNA bending angles as input to be as unique for
different entries. As research has been digging into deformed DNA and TF binding, such as a
recentpaperrevealingthefunctionsofDNAmismatchesinTF-DNArecognition[143],ourmodel
providesageneralframeworktoanalyzefuturehigh-throughputdatafordeformedDNAs.
Limitation of our method exists. At the current stage, the model is not able to process data
withdifferentlength, whichmeanstheinputofonemodelmustbethesamelengthforallentries.
In addition, since DNA methylation adds another heavy atom (Carbon) into the molecules, which
results in more atoms per base pair, this model is not yet designed to accomplish it. Further
improvement can be established to solve the above limitations. For example, instead of taking
53
into 41 atoms per base pair, the model can be modified to take into more atoms per base pair to
accomplishtheDNAmodifications.
3.5 Conclusion
We successfully designed a spatial geometry deep learning network for processing DNA dataset
andapplieditonTF-DNAbindingpredictionstudy. Thismodelworkstremendouslywellonpre-
dictingTF-DNAbindingspecificityanditmarginallybeatsthestate-of-artsequencedeeplearning
model. The model is self-regularized by data augmentation on the fly and can perform transfer
learningfromtrainedmodeltoaccommodatenewdataset. Themodelhasagreatpotentialtopro-
cessdeformedDNAstructuresasinputandcantakebarechemicalcomponentssuchasbackbone
as input, which, to the best of our knowledge, is brand-new. Our model is the first of its kind for
DNAonlyinput.
54
Chapter4
Sideproject-AnalysisonDNAshapefeaturesoftheyeast
originrecognitioncomplex
4.1 Background
In this Chapter, I will go through the work published in Nature Communications [78] with our
collaboratorsfromtheHongKongUniversityofScience&Technology. Inthiswork,ourcollabo-
ratorsengineeredtheOriginRecognitionComplex(ORC)fromyeastbyremovinga19-aminoacid
insertionhelixintheOrc4subunit. ORCisaconservedproteincomplexwith6subunits. However,
ORC prefers different DNA sequences between yeasts and humans. Humans, like all other meta-
zoans, use different replication origins during different developmental stages and different level
of chromatin opening [180–182]. The yeast, however, is a single-celled organism, initiates the
DNA replication from motifs, called autonomously replication sequences (ARSs). ORC consists
of 6 subunits from Orc1 to Orc6 [183]. With atomic-level structure for ORC bound to the ARS
motifavailable[184],interactionsbetweentheARSandORCcanbeviewedspecifically. Wehave
learned from the structure that binding between ORC and ACS happens mostly on the backbone
of DNA, and the insertion helix (IH) of Orc4 contacts the DNA major groove with base-specific
binding. ThisinteractionbetweenARSDNAandtheORCalsoinducesDNAbendingatthebind-
ing site, which is shared among other metazoans [185]. Our collaborators analyzed the protein
sequence alignment of Orc4 for many species and found out that the 19-amino-acids IH might
55
be the key difference between metazoan and yeast. This 19-amino acid removal from yeast ORC
transformed the binding behavior of the yeast ORC from base-specific binding into what the hu-
man ORC prefers. The altered ORC, called ORC-IH∆, prefers a brand-new DNA motif of T-rich,
fardifferentfromtheordinaryARSsequences.
4.2 Methods
4.2.1 HighthroughputDNAshapefeaturesprediction
DNAshapefeaturesarepredictedinahigh-throughputmanner,usingasliding-windowsalgorithm
toslidepentamersfromtheinputsequence[SeeSection2.2.3fromChapter2]. Valuesarequeried
from query tables for each pentamer. 5 shape features, Roll, ProT, HelT, MGW and electrostatic
potential (EP), are predicted for 4 DNA sequences. Three sequences are from regular ARSs and
one sequence is from the motif discovery of the mutant Orc2 binding peaks. EP is shape feature
that describes the electrostatic potential in the minor groove [186]. Roll and HelT are inter-base
pair features, while ProT is intra-base pair features. MGW and EP are features for minor groove
characteristicsandareassignedtobase-pairs,whileRollandHelTareassignedtointer-baseposi-
tions.
4.2.2 RebuildingDNAstructuresandaligning
With DNA shape features predicted, rebuild program from X3DNA [187] rebuild the DNA struc-
tures. Rebuilt DNA sequences have intrinsic shapes that may not be straight. Rebuilt DNA struc-
turesarethenalignedatthetheircoremotifs,pre-definedbymotifenrichmenttests. Alignmentof
DNAstructuresareperformedbyalignprogramofferedbyPyMOL[188]. Firstly,atomsconsisted
of the core motifs are selected, using select program offered by PyMOL, based on their residual
IDs. Then, the selected motif atoms from the 3 wildtype (WT) ARS are aligned with the selected
atomsfromthemutant.
56
4.2.3 Alignedstructurepositioningandtheircurvatureanalysis
Positioning the aligned structures are performed through orient program offered PyMOL. This
program calculates the principal components of the atoms with the XYZ axes. The highest and
second highest principal component are used to be the new X and Y axes. To generate different
postures of the aligned structures, rotate program from PyMOL is then used. Curvature analysis
is performed by Curves 5.3 program [6]. This program calculates a global curvature value of the
givenDNAstructure.
4.3 Results
Based on the de novo motif enrichment analysis from our collaborators, in addition to our soft-
ware, DNAshapeR that can high-throughput predict DNA shape features, I analyzed Roll, ProT,
HelT, MGW and electrostatic potential (EP) of DNA motifs from both the mutant and wildtypes.
Furthermore,IreconstructedtheDNAstructuresandalignthematthebindingsitetovisualizethe
structural differences. The result indicates a lower Roll values, lower ProT values, higher HelT
values,lowerMGWandETvaluesatthemutantsbindingsite[fig]. ThereconstructedDNAstruc-
tures,usingthemethodsinChapter3,verifythattheintrinsicDNAshapeofthemutantDNAmotif
hasmorecurvature[fig].
4.4 Discussions
TheresultsindicateanextensivenarrowerMGWatthemotifofmutant. RepetitiveA-Tbasepairs
greatly reduces the DNA minor groove width and in general, twist the DNA, and usually create
intrinsic DNA bending. Due to deletion of the 19-amino acids from Orc4, the mutant loses the
ability to bind to the original binding sites (ACSs), but prefers to bind to a DNA sequence with
thisdistinctDNAshapes. Genome-widestudyofhumanORCidentifiednearly100kbindingsites
inhumangenome,however,consensusmotifsofORCbindingcantbepredicted[189]. Similarly,
57
the mutant yeast ORC, which loses its ability to bind the ACSs, may share the same mechanisms
tothehumanORC.Ourcollaboratorsconductedfurtherresearchonopenchromatinandfoundout
that the mutant ORC prefers the regions that do not require nucleosome repositioning, while the
ORChastheabilitytoalterchromatinopening[190].
Therefore,theresultfromouranalysissuccessfullyreenforcedtheassumptionthatthemutated
ORC prefers DNA sequences with distinct intrinsic DNA shapes. This analysis also proves that
ourhigh-throughputpredictionofDNAshapefeaturesproposedinChapter2andChapter3isvery
useful.
58
Figure 4.1: High-throughput prediction of different structural shape parameters of DNA between
TTTTTTTCCGCG motif and three known ACSs from ARS607, ARS305, and ARS1. Shape fea-
tures include minor groove width (MGW), propeller twist (ProT), helix twist (HelT), electrostatic
potential (kT/e), and Roll. HelT and Roll describe rotation angles between adjacent base pairs.
ProTdescribesarotationanglebetweentwobasesthatformabasepair.
59
x
y
x
z
z
y
Core motif
bend angle (°)
12.5
4.0
1.6
7.0
Figure 4.2: Three different views of rebuilt DNA structures and their bend angles for core motif
sites.
Structures are aligned based on their core motif sites (10 bp in the center). The x- and y-axes
were generated based on the highest and second-highest principal components of all atoms in the
aligned structures. Bend angles are calculated using Curves 5.3 for their core motif sites. Bend
angles represent the angles between two vectors formed between two bp at each end of the core
motif (bp 1 and 2, and bp n−1 and n, respectively) of the core motif sites, where the vector is
defined between a reference point from one bp and the equivalent reference point on another bp.
Fordefinitionofnucleicacidreferencesystem,see[6].
60
Chapter5
Conclusionsandfutureworks
5.1 Conclusion
In this thesis, I started from the project expanding the capability of high-throughput prediction
of DNA shape features. The project successfully established the expanded version of DNAshape
and verified the credibility of the predicted additional DNA shape features. The machine learning
experimentsshowedthatbyincludingadditionalDNAshapefeatures,theperformanceofpredict-
ingTF-DNA binding specificities improves. In this project, I also compared DNAshape acquired
from different sources, such as MC or MD, and they showed a similar improvement. In addition,
I proved that adding standard deviations of the predicted shape features, which may be an indica-
tor of DNA flexibility, can improve the performance even further. This project was published in
NucleicAcidsResearch[35].
Withtheavailabilityofhigh-throughputpredictionofDNAshapefeatures,moreaccurateDNA
structurescanberebuiltinahigh-throughputfashion. Thisleadsmetoconceiveanideatobuilda
model that directly takes DNA structures as input. In Chapter 4, I discussed this project from the
methodstoresults. Iestablishedabrand-newdeeplearningmodelthatnotonlybeatsthesequence
baseline model by reducing the needs of hyper-parameters search, but also proves that additional
applicationsareachievableinthemodel,suchasanalyzingtheshape/backbonepreferencesofTF
families,tryingtodistinguishTFsthathavepotentialtobindnucleosomalDNAs.
61
Furthermore, as a side project collaborated with researchers from Hong Kong, I used the pre-
vious established high-throughput prediction of DNA shape features and the high-throughput re-
building of DNA structures pipeline to study the intrinsic shape difference between two kinds of
motifs. The result shows a great difference of intrinsic shapes between the mutant motif and the
wildtype motifs, which may explain the difference in binding site of yeast ORC from metazoan
ORC.ThisprojectwaspublishedNatureCommunications[78].
AsaconclusionofmyPhD,theseprojectsgreatlyexpandtheoriginalDNAshapemethod[68],
lead the path of further applications, and extend our understanding of TF-DNA binding mecha-
nisms. From the point of view in methodology, these projects expansively utilize a wide varieties
of machine learning, statistics, domain knowledge of DNA structures and TF-DNA binding and
buildabrand-newdeeplearningarchitecturedesignedtodealwithDNAstructures.
5.2 Futureworks
Scientific research is a never ending exploration to answer unsolved, intriging questions, and
methodologies can always be improved. Therfore, before I end my thesis, I would like to go
throughsomefutureworksthatIhavebeenworkingorpotentiallytobeworkedon. bae
5.2.1 RebuildingDNAstructuresondemandonthefly
Proposal Following Chapter 3, many potential directions of improvement can be investigated.
Although high-throughput predictions of DNA shape features and rebuilding DNA structures are
generally fast, intermediate results must be stored on disk and pre-processed for following analy-
ses. DNA shape predictions are based on window-sliding algorithm of DNA sequences and com-
pute values based on query tables. Information lose during in this process, where such lost can
be prevented if the model directly learns all information from raw simulation or crystal-structural
data. ArtificiallybendingDNAaccordingtoastaticmodelisquickbutcanuseimprovementsfrom
62
real world solved structures. A project can be conceived to tackle these problems. Given the suc-
cessofAlphaFoldandAlphaFold2[47,48],consideringthesimplicityofDNAstructures,amodel
much simpler is achievable. However, this proposed model is a generative model, not a discrimi-
nativemodel. Asuccessfulgenerativemodelisusuallyhardertobuildandtrain[191,192]. Inthis
section, I propose two directions to investigate on rebuilding DNA structures via deep learning.
The first direction is to rely on prior knowledge. Base pairs can be arbitrarily put on to 3D space
given DNA sequences. The model should then adjust coordinates of them, one can rebuild DNA
structuresonthefly. Dependingonthetrainingdata,thismodelshouldwhateverDNAstructures.
It can be regular B-DNA, A-DNA, Z-DNA, or deformed DNAs. DNA modifications should be
also available. This tool will be very helpful for biologists to investigate intrinsic DNA properties
and be a very good companion of the deep learning model proposed in Chapter 3. The second
direction is to go with auto generative model like generative adversarial nets (GAN) [193]. This
architecturefeaturesadiscriminativemodelandgenerativemodelonthesamegraph. Trainingon
thegenerativemodelhappenssimultaneouslywiththediscriminativemodellikeacompetition.
Preliminary results I implemented a small piece of model that takes input of k×4 matrix,
representingone-hotencodingofk-mersequence. Theoutputofthismodelis41k×3,representing
41k heavy atoms and coordinates in 3D space. The model is built upon transposed convolution
layers, sometimes called deconvolution layers. These layers amplify the input to be larger in size.
For experimental purpose only, I took 4096 rebuilt 5-mer DNA structures into training the model.
However, the model is not able to converge. Improvements must be made to incorporate prior
knowledges. Trainingsuchamodelwithbruteforcedoesn’tyieldsatisfactoryresults.
5.2.2 Optimizinginputwithfixedparametersforthespatialgeometrymodel
The goal for deep learning is to reduce the defined loss function by tuning the weight parameters
through gradient descent. Given a deep learning architecture, parameters and an input, output can
becomputed. However,theparameterscanbeturnedtofixedandnottrainable,withtheinputtobe
63
viewedasparameters. ThisapproachmayhelptolearnanoptimizedstructurethataTFfavorswith
agivensequence. However, sincethe inputof the spatialgeometry modelis only 3Dcoordinates,
there are no constraints to limit their geometrical positions. Artifacts will be produced without
refraining criteria. Here, I propose two methods to make optimizing input possible. The first
solutionistoaddanotherlossfunctionthatwillpenalizeirregularstructuresviaforcefields. Asa
priorknowledge,weknowthatatomsinbasepairs,duetoaromaticstructuresandhydrogenbonds,
are extremely hard to be lifted off plane, and chemical bonds should be kept within a reasonable
length and torsion angles. Structures will be penalized greatly if such an abnormal conformation
is evolved during optimizing. To penalize an abnormal conformation, force fields used in MD
simulation [194] can be adapted into a new loss function. The second method is to start with a
constrainted simplified model, such as coarse grain DNA model [195]. The input optimizer will
happenonthecoarsegrainmodelinstead. Thiswilleliminatemanypossibleartifacts.
5.2.3 IncorporatingDNAmodificationsintothespatialgeometrymodel
The spatial geometry model proposed in Chapter 3 is well defined for regular B-DNA. It can be
directlyportedonA-DNA,Z-DNAandDNAswithmismatches,sincetheydonotalterthenumber
of heavy atoms per bases. However, the most common 5-methylated cytosine adds one more
carbon into account. The proposal is to increase the number of atoms per bases. For instance,
includethecoordinatesoftheadditionalcarbon,andforthosebasepairswithnoadditionalatoms,
usethecoordinatesoftheconnectedatomasifthereisacarbonatomadded.
5.2.4 Optimizing the model architecture to be translational and rotational
invariant
Research has been conducted in machine learning field to deal with the Cartesian coordinates
system, such as point cloud. For example, architectures have been proposed to be rotation-and
translation-equivariant [196, 197]. The spatial geometry model used to scoring RNA predictions
64
[138]successfullyadaptedoneoftheabovearchitechutres. Thisarchitectureisworthexploringto
seeifitcansolvethetrainingbottleneckcausedbyextensiveautomaticdataaugmentations.
65
References
1. Babu, M. M., Luscombe, N. M., Aravind, L., Gerstein, M. & Teichmann, S. A. Structure
andevolutionoftranscriptionalregulatorynetworks.CurrentOpinioninStructuralBiology
14,283–291.doi:10.1016/j.sbi.2004.05.004 (2004).
2. Lausch, E., Hermanns, P., Farin, H. F., Alanay, Y., Unger, S., Nikkel, S., et al. TBX15
Mutations Cause Craniofacial Dysmorphism, Hypoplasia of Scapula and Pelvis, and Short
StatureinCousinSyndrome.TheAmericanJournalofHumanGenetics83,649–655.doi:10.
1016/j.ajhg.2008.10.011 (2008).
3. Grinberg, I., Northrup, H., Ardinger, H., Prasad, C., Dobyns, W. B. & Millen, K. J. Het-
erozygous deletion of the linked genes ZIC1 and ZIC4 is involved in Dandy-Walker mal-
formation.NatureGenetics36,1053–1055.doi:10.1038/ng1420(2004).
4. Ghosh,A.&Bansal,M.AglossaryofDNAstructuresfromAtoZ.ActaCrystallographica
Section D Biological Crystallography 59, 620–626. doi:10.1107/s0907444903003251
(2003).
5. Klose, R. J. & Bird, A. P. Genomic DNA methylation: the mark and its mediators. Trends
inBiochemicalSciences31,89–97.doi:10.1016/j.tibs.2005.12.008 (2006).
6. Lavery, R. & Sklenar, H. Defining the Structure of Irregular Nucleic Acids: Conventions
andPrinciples.JournalofBiomolecularStructureandDynamics6,655–667.doi:10.1080/
07391102.1989.10507728 (1989).
7. Stone,A.TheTheoryofIntermolecularForces(OxfordUniversityPress(OUP),2013).
8. Rohs, R., West, S. M., Sosinsky, A., Liu, P., Mann, R. S. & Honig, B. The role of DNA
shape in proteinDNA recognition. Nature 461, 1248–1253. doi:10.1038/nature08473
(2009).
9. Ellenberger, T. Getting a grip on DNA recognition: structures of the basic region leucine
zipper, and the basic region helix-loop-helix DNA-binding domains. Current Opinion in
StructuralBiology4,12–21.doi:https://doi.org/10.1016/S0959-440X(94)90054-X
(1994).
10. Brglin,T.R.&Affolter,M.Homeodomainproteins:anupdate.Chromosoma125,497–521.
doi:10.1007/s00412-015-0543-8 (2015).
11. Klug,A.Thediscoveryofzincfingersandtheirapplicationsingeneregulationandgenome
manipulation.Annualreviewofbiochemistry79,213–231(2010).
12. Iuchi, S. Three classes of C2H2 zinc finger proteins. Cellular and Molecular Life Sciences
CMLS58,625–635(2001).
13. Lam,K.N.,VanBakel,H.,Cote,A.G.,VanDerVen,A.&Hughes,T.R.Sequencespeci-
ficity is obtained from the majority of modular C2H2 zinc-finger arrays. Nucleic acids re-
search39,4680–4690(2011).
66
14. McGinty,R.K.&Tan,S.Nucleosomestructureandfunction.Chemicalreviews115,2255–
2273(2015).
15. Wolffe,A.Chromatin:structureandfunction(Academicpress,1998).
16. Cedar,H.&Bergman,Y.LinkingDNAmethylationandhistonemodification:patternsand
paradigms.NatureReviewsGenetics10,295–304(2009).
17. Brtov, E., Krej, J., Harniarov, A., Galiov, G. & Kozubek, S. Histone modifications and
nuclear architecture: a review. Journal of Histochemistry & Cytochemistry 56, 711–721
(2008).
18. Kilpinen, H., Waszak, S. M., Gschwind, A. R., Raghav, S. K., Witwicki, R. M., Orioli, A.,
et al. Coordinated effects of sequence variation on DNA binding, chromatin structure, and
transcription.Science342,744–747(2013).
19. Benveniste, D., Sonntag, H.-J., Sanguinetti, G. & Sproul, D. Transcription factor binding
predictshistonemodificationsinhumancelllines.ProceedingsoftheNationalAcademyof
Sciences111,13367–13372(2014).
20. DantasMachado,A.C.,Zhou,T.,Rao,S.,Goel,P.,Rastogi,C.,Lazarovici,A.,etal.Evolv-
ing insights on how cytosine methylation affects protein–DNA binding. Briefings in func-
tionalgenomics14,61–73(2015).
21. Hu,S.,Wan,J.,Su,Y.,Song,Q.,Zeng,Y.,Nguyen,H.N.,etal.DNAmethylationpresents
distinctbindingsitesforhumantranscriptionfactors.elife2,e00726(2013).
22. Liu, L., Jin, G. & Zhou, X. Modeling the relationship of epigenetic modifications to tran-
scriptionfactorbinding.Nucleicacidsresearch43,3873–3885(2015).
23. Rao,S.,Chiu,T.-P.,Kribelbauer,J.F.,Mann,R.S.,Bussemaker,H.J.&Rohs,R.System-
atic prediction of DNA shape changes due to CpG methylation explains epigenetic effects
onprotein–DNAbinding.Epigenetics&chromatin11,1–11(2018).
24. Xin, B. & Rohs, R. Relationship between histone modifications and transcription factor
bindingisproteinfamilyspecific.Genomeresearch28,321–333(2018).
25. Zaret,K.S.&Carroll,J.S.Pioneertranscriptionfactors:establishingcompetenceforgene
expression.Genes&development 25,2227–2241(2011).
26. Iwafuchi-Doi,M.&Zaret,K.S.Pioneertranscriptionfactorsincellreprogramming.Genes
&development 28,2679–2692(2014).
27. Zaret,K.S.&Mango,S.E.Pioneertranscriptionfactors,chromatindynamics,andcellfate
control.Currentopinioningenetics&development 37,76–81(2016).
28. Bonev, B. & Cavalli, G. Organization and function of the 3D genome. Nature Reviews
Genetics17,661–678(2016).
29. Morris,C.N.ParametricempiricalBayesinference:theoryandapplications.Journalofthe
AmericanstatisticalAssociation78,47–55(1983).
30. Dreiseitl, S. & Ohno-Machado, L. Logistic regression and artificial neural network clas-
sification models: a methodology review. Journal of biomedical informatics 35, 352–359
(2002).
31. Somvanshi, M., Chavan, P., Tambade, S. & Shinde, S. A review of machine learning tech-
niquesusingdecisiontreeandsupportvectormachinein2016internationalconferenceon
computingcommunicationcontrolandautomation(ICCUBEA)(2016),1–7.
32. Gevrey,M.,Dimopoulos,I.&Lek,S.Reviewandcomparisonofmethodstostudythecon-
tribution of variables in artificial neural network models. Ecological modelling 160, 249–
264(2003).
67
33. Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A. & Aljaaf, A. J. A systematic
review on supervised and unsupervised machine learning algorithms for data science. Su-
pervisedandunsupervisedlearningfordatascience,3–21(2020).
34. Turner,C. R.,Fuggetta, A., Lavazza,L. & Wolf,A. L. Aconceptual basis forfeature engi-
neering.JournalofSystemsandSoftware49,3–15(1999).
35. Li,J.,Sagendorf,J.M.,Chiu,T.-P.,Pasi,M.,Perez,A.&Rohs,R.Expandingtherepertoire
of DNA shape features for genome-scale studies of transcription factor binding. Nucleic
AcidsResearch45,12877–12887.doi:10.1093/nar/gkx1145 (2017).
36. LeCun,Y.,Bengio,Y.&Hinton,G.Deeplearning.nature521,436–444(2015).
37. He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition in
Proceedingsof the IEEEConferenceon ComputerVisionand PatternRecognition(CVPR)
(2016).
38. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet Classification with Deep Convolu-
tionalNeuralNetworksinAdvancesinNeuralInformationProcessingSystems(edsPereira,
F.,Burges,C.J.C.,Bottou,L.&Weinberger,K.Q.)25(CurranAssociates,Inc.,2012).
39. Maturana, D. & Scherer, S. VoxNet: A 3D Convolutional Neural Network for real-time ob-
jectrecognitionin2015IEEE/RSJInternationalConferenceonIntelligentRobotsandSys-
tems(IROS)(2015),922–928.doi:10.1109/IROS.2015.7353481.
40. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., et al. 3D ShapeNets: A deep
representation for volumetric shapes in 2015 IEEE Conference on Computer Vision and
PatternRecognition(CVPR)(IEEE,2015).doi:10.1109/cvpr.2015.7298801.
41. Qi, C. R., Su, H., Mo, K. & Guibas, L. J. PointNet: Deep Learning on Point Sets for 3D
Classification and Segmentation in Proceedings of the IEEE Conference on Computer Vi-
sionandPatternRecognition(CVPR)(2017).
42. Qi, C. R., Yi, L., Su, H. & Guibas, L. J. PointNet++: Deep Hierarchical Feature Learning
on Point Sets in a Metric Space in Proceedings of the 31st International Conference on
Neural Information Processing Systems (Curran Associates Inc., Long Beach, California,
USA,2017),5105–5114.
43. Mnih,V.,Kavukcuoglu,K.,Silver,D.,Graves,A.,Antonoglou,I.,Wierstra,D.,etal.Play-
ingAtariwithDeepReinforcementLearning.CoRRabs/1312.5602(2013).
44. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., et al. A gen-
eral reinforcement learning algorithm that masters chess, shogi, and Go through self-play.
Science362,1140–1144(2018).
45. Silver,D.,Schrittwieser,J.,Simonyan,K.,Antonoglou,I.,Huang,A.,Guez,A.,etal.Mas-
teringthegameofgowithouthumanknowledge.nature550,354–359(2017).
46. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., et al.
Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575,
350–354(2019).
47. Senior,A.W.,Evans,R.,Jumper,J.,Kirkpatrick,J.,Sifre,L.,Green,T.,etal.Improvedpro-
teinstructurepredictionusingpotentialsfromdeeplearning.Nature577,706–710(2020).
48. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., et al. Highly
accurateproteinstructurepredictionwithAlphaFold.Nature596,583–589(2021).
49. Xu, J. Distance-based protein folding powered by deep learning. Proceedings of the Na-
tionalAcademyofSciences116,16856–16865(2019).
68
50. Hiranuma, N., Park, H., Baek, M., Anishchenko, I., Dauparas, J. & Baker, D. Improved
protein structure refinement guided by deep learning based accuracy estimation. Nature
communications12,1–11(2021).
51. Rudin,C.Stopexplainingblackboxmachinelearningmodelsforhighstakesdecisionsand
useinterpretablemodelsinstead.NatureMachineIntelligence1,206–215(2019).
52. Ribeiro,M.T.,Singh,S.&Guestrin,C. ” Why should i trust you?” Explaining the predic-
tions of any classifier in Proceedings of the 22nd ACM SIGKDD international conference
onknowledgediscoveryanddatamining(2016),1135–1144.
53. Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions in Pro-
ceedings of the 31st international conference on neural information processing systems
(2017),4768–4777.
54. Shrikumar,A.,Greenside,P.&Kundaje,A.Learningimportantfeaturesthroughpropagat-
ing activation differencesin International Conference on Machine Learning(2017),3145–
3153.
55. Furey, T. S. ChIP–seq and beyond: new and improved methodologies to detect and charac-
terizeprotein–DNAinteractions.NatureReviewsGenetics13,840–852(2012).
56. Rhee,H.S.&Pugh,B.F.Comprehensivegenome-wideprotein-DNAinteractionsdetected
atsingle-nucleotideresolution.Cell147,1408–1419(2011).
57. He,Q.,Johnston,J.&Zeitlinger,J.ChIP-nexusenablesimproveddetectionofinvivotran-
scriptionfactorbindingfootprints.Naturebiotechnology33,395–401(2015).
58. Machanick, P. & Bailey, T. L. MEME-ChIP: motif analysis of large DNA datasets. Bioin-
formatics27,1696–1697(2011).
59. Berger, M. F., Philippakis, A. A., Qureshi, A. M., He, F. S., Estep, P. W. & Bulyk, M. L.
Compact, universal DNA microarrays to comprehensively determine transcription-factor
binding site specificities. Nature Biotechnology 24, 1429–1435. doi:10.1038/nbt1246
(2006).
60. Berger,M.F.&Bulyk,M.L.Universalprotein-bindingmicroarraysforthecomprehensive
characterizationoftheDNA-bindingspecificitiesoftranscriptionfactors.NatureProtocols
4,393–411.doi:10.1038/nprot.2008.195 (2009).
61. Tuerk, C. & Gold, L. Systematic evolution of ligands by exponential enrichment: RNA
ligandstobacteriophageT4DNApolymerase.science249,505–510(1990).
62. Ogawa,N.&Biggin,M.D.inGeneRegulatoryNetworks51–63(Springer,2012).
63. Riley, T. R., Slattery, M., Abe, N., Rastogi, C., Liu, D., Mann, R. S., et al. in Methods in
Molecular Biology 255–278 (Springer New York, 2014). doi:10.1007/978-1-4939-
1242-1_16.
64. Rastogi, C., Rube, H. T., Kribelbauer, J. F., Crocker, J., Loker, R. E., Martini, G. D., et al.
Accurate and sensitive quantification of protein-DNA binding affinity. Proceedings of the
NationalAcademyofSciences115,E3692–E3701(2018).
65. Inukai,S.,Kock,K.H.&Bulyk,M.L.Transcriptionfactor–DNAbinding:beyondbinding
sitemotifs.Currentopinioningenetics&development 43,110–119(2017).
66. Rube, H. T., Rastogi, C., Kribelbauer, J. F. & Bussemaker, H. J. A unified approach for
quantifyingandinterpretingDNAshapereadoutbytranscriptionfactors.MolecularSystems
Biology14.doi:10.15252/msb.20177902 (2018).
69
67. Fornes, O., Castro-Mondragon, J. A., Khan, A., Van der Lee, R., Zhang, X., Richmond,
P.A.,etal.JASPAR2020:updateoftheopen-accessdatabaseoftranscriptionfactorbinding
profiles.Nucleicacidsresearch48,D87–D92(2020).
68. Zhou, T., Yang, L., Lu, Y., Dror, I., Machado, A. C. D., Ghane, T., et al. DNAshape: a
method for the high-throughput prediction of DNA structural features on a genomic scale.
NucleicAcidsResearch41,W56–W62.doi:10.1093/nar/gkt437(2013).
69. Zhou,T.,Shen,N.,Yang,L.,Abe,N.,Horton,J.,Mann,R.S.,etal.Quantitativemodeling
of transcription factor binding specificities using DNA shape. Proceedings of the National
AcademyofSciences112,4654–4659.doi:10.1073/pnas.1422023112 (2015).
70. Alipanahi,B.,Delong,A.,Weirauch,M. T.& Frey,B.J.Predictingthesequencespecifici-
ties of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33, 831–
838.doi:10.1038/nbt.3300(2015).
71. Phuycharoen,M.,Zarrineh,P.,Bridoux,L.,Amin,S.,Losa,M.,Chen,K.,etal.Uncovering
tissue-specificbindingfeaturesfromdifferentialdeeplearning.NucleicAcidsResearch48,
e27–e27.doi:10.1093/nar/gkaa009(2020).
72. Fu,L.,Zhang,L.,Dollinger,E.,Peng,Q.,Nie,Q.&Xie,X.Predictingtranscriptionfactor
binding in single cells through deep learning. Science Advances 6. doi:10.1126/sciadv.
aba9031(2020).
73. Quan, L., Sun, X., Wu, J., Mei, J., Huang, L., He, R., et al. Learning useful representa-
tions of DNA sequences from ChIP-seq datasets for exploring transcription factor binding
specificities. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1–1.
doi:10.1109/tcbb.2020.3026787 (2020).
74. Wang,M.,Tai,C.,E,W.&Wei,L.DeFine:deepconvolutionalneuralnetworksaccurately
quantify intensities of transcription factor-DNA binding and facilitate evaluation of func-
tionalnon-codingvariants.Nucleicacidsresearch46,e69–e69(2018).
75. Park, S., Koh, Y., Jeon, H., Kim, H., Yeo, Y. & Kang, J. Enhancing the interpretability of
transcription factor binding site prediction using attention mechanism. Scientific Reports
10.doi:10.1038/s41598-020-70218-4 (2020).
76. Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network
for quantifying the function of DNA sequences. Nucleic Acids Research 44, e107–e107.
doi:10.1093/nar/gkw226(2016).
77. Zhou,J.&Troyanskaya,O.G.Predictingeffectsofnoncodingvariantswithdeeplearning-
basedsequencemodel.NatureMethods12,931–934.doi:10.1038/nmeth.3547 (2015).
78. Lee, C. S. K., Cheung, M. F., Li, J., Zhao, Y., Lam, W. H., Ho, V., et al. Humanizing the
yeastoriginrecognitioncomplex.NatureCommunications12.doi:10.1038/s41467-020-
20277-y(2021).
79. Shlyueva,D.,Stampfel,G.&Stark,A.Transcriptionalenhancers:frompropertiestogenome-
widepredictions.NatureReviewsGenetics15,272–286.doi:10.1038/nrg3682(2014).
80. Levo,M.&Segal,E.Inpursuitofdesignprinciplesofregulatorysequences.NatureReviews
Genetics15,453–468.doi:10.1038/nrg3684(2014).
81. Slattery, M., Riley, T., Liu, P., Abe, N., Gomez-Alcala, P., Dror, I., et al. Cofactor Binding
Evokes Latent Differences in DNA Binding Specificity between Hox Proteins. Cell 147,
1270–1282.doi:10.1016/j.cell.2011.10.053 (2011).
70
82. Jolma, A., Kivioja, T., Toivonen, J., Cheng, L., Wei, G., Enge, M., et al. Multiplexed mas-
sively parallel SELEX for characterization of human transcription factor binding specifici-
ties.GenomeResearch20,861–873.doi:10.1101/gr.100552.109 (2010).
83. Slattery, M., Zhou, T., Yang, L., Machado, A. C. D., Gordn, R. & Rohs, R. Absence of a
simplecode:howtranscriptionfactorsreadthegenome.TrendsinBiochemicalSciences39,
381–399.doi:10.1016/j.tibs.2014.07.002 (2014).
84. Stormo,G.D.DNAbindingsites:representationanddiscovery. Bioinformatics16,16–23.
doi:10.1093/bioinformatics/16.1.16 (2000).
85. Zhao, Y., Ruan, S., Pandey, M. & Stormo, G. D. Improved Models for Transcription Fac-
torBindingSiteIdentificationUsingNonindependentInteractions.Genetics191,781–790.
doi:10.1534/genetics.112.138685 (2012).
86. Sharon, E., Lubliner, S. & Segal, E. A Feature-Based Approach to Modeling ProteinDNA
Interactions. PLoS Computational Biology 4 (ed Stormo, G.) e1000154. doi:10.1371/
journal.pcbi.1000154 (2008).
87. Gordn,R.,Shen,N.,Dror,I.,Zhou,T.,Horton,J.,Rohs,R.,etal.GenomicRegionsFlank-
ingE-BoxBindingSitesInfluenceDNABindingSpecificityofbHLHTranscriptionFactors
through DNA Shape. Cell Reports 3, 1093–1104. doi:10.1016/j.celrep.2013.03.014
(2013).
88. Zabet, N. R. & Adryan, B. Estimating binding properties of transcription factors from
genome-wide binding profiles. Nucleic Acids Research 43, 84–94. doi:10.1093/nar/
gku1269(2014).
89. Siebert,M.&Sding,J.BayesianMarkovmodelsconsistentlyoutperformPWMsatpredict-
ing motifs in nucleotide sequences. Nucleic Acids Research44, 6055–6069. doi:10.1093/
nar/gkw521(2016).
90. Yang, J. & Ramsey, S. A. A DNA shape-based regulatory score improves position-weight
matrix-based recognition of transcription factor binding sites. Bioinformatics 31, 3445–
3450.doi:10.1093/bioinformatics/btv391 (2015).
91. Krietenstein, N., Wal, M., Watanabe, S., Park, B., Peterson, C. L., Pugh, B. F., et al. Ge-
nomicNucleosomeOrganizationReconstitutedwithPureProteins.Cell167,709–721.e12.
doi:10.1016/j.cell.2016.09.045 (2016).
92. Abe, N., Dror, I., Yang, L., Slattery, M., Zhou, T., Bussemaker, H. J., et al. Deconvolving
theRecognitionofDNAShapefromSequence.Cell161,307–318.doi:10.1016/j.cell.
2015.02.008(2015).
93. Villa, R., Schauer, T., Smialowski, P., Straub, T. & Becker, P. B. PionX sites mark the X
chromosome for dosage compensation. Nature537, 244–248. doi:10.1038/nature19338
(2016).
94. Mathelier, A., Xin, B., Chiu, T.-P., Yang, L., Rohs, R. & Wasserman, W. W. DNA Shape
Features Improve Transcription Factor Binding Site Predictions In Vivo. Cell Systems 3,
278–286.e4.doi:10.1016/j.cels.2016.07.001 (2016).
95. Yang,L.,Orenstein,Y.,Jolma,A.,Yin,Y.,Taipale,J.,Shamir,R.,etal.Transcriptionfactor
family-specific DNA shape readout revealed by quantitative specificity models. Molecular
SystemsBiology13,910.doi:10.15252/msb.20167238 (2017).
96. Schne,S.,Jurk,M.,Helabad,M.B.,Dror,I.,Lebars,I.,Kieffer,B., et al.Sequencesflank-
ing the core-binding site modulate glucocorticoid receptor structure and activity. Nature
Communications7.doi:10.1038/ncomms12621 (2016).
71
97. Shakked,Z.,Guzikevich-Guerstein,G.,Frolow,F.,Rabinovich,D.,Joachimiak,A.&Sigler,
P. B. Determinants of repressor/operator recognition from the structure of the trp operator
bindingsite.Nature368,469–473.doi:10.1038/368469a0(1994).
98. Luscombe, N. M., Austin, S. E., Berman, H. M. & Thornton, J. M. An overview of the
structuresofprotein-DNAcomplexes.GenomeBiology1,reviews001.1.doi:10.1186/gb-
2000-1-1-reviews001 (2000).
99. Garvie,C.W.&Wolberger,C.RecognitionofSpecificDNASequences.MolecularCell8,
937–946.doi:10.1016/s1097-2765(01)00392-6 (2001).
100. Locasale, J. W., Napoli, A. A., Chen, S., Berman, H. M. & Lawson, C. L. Signatures of
Protein-DNA Recognition in Free DNA Binding Sites. Journal of Molecular Biology 386,
1054–1065.doi:10.1016/j.jmb.2009.01.007 (2009).
101. Rohs, R., Jin, X., West, S. M., Joshi, R., Honig, B. & Mann, R. S. Origins of Specificity
in Protein-DNA Recognition. Annual Review of Biochemistry 79, 233–269. doi:10.1146/
annurev-biochem-060408-091030 (2010).
102. Machado,A.C.D.,Saleebyan,S.B.,Holmes,B.T.,Karelina,M.,Tam,J.,Kim,S.Y.,etal.
Proteopedia: 3D visualization and annotation of transcription factor-DNA readout modes.
Biochemistry and Molecular Biology Education 40, 400–401. doi:10.1002/bmb.20650
(2012).
103. Olson,W.K.,Bansal,M.,Burley,S.K.,Dickerson,R.E.,Gerstein,M.,Harvey,S.C.,etal.
A standard reference frame for the description of nucleic acid base-pair geometry. Journal
ofMolecularBiology313,229–237.doi:10.1006/jmbi.2001.4987 (2001).
104. Lu,X.-J.3DNA:asoftwarepackagefortheanalysis,rebuildingandvisualizationofthree-
dimensionalnucleicacidstructures.NucleicAcidsResearch31,5108–5121.doi:10.1093/
nar/gkg680(2003).
105. Rohs, R., West, S. M., Liu, P. & Honig, B. Nuance in the double-helix and its role in pro-
teinDNA recognition. Current Opinion in Structural Biology 19, 171–177. doi:10.1016/
j.sbi.2009.03.002(2009).
106. Dixit, S. B., Beveridge, D. L., Case, D. A., Cheatham, T. E., Giudice, E., Lankas, F.,
et al. Molecular Dynamics Simulations of the 136 Unique Tetranucleotide Sequences of
DNAOligonucleotides.II:SequenceContextEffectsontheDynamicalStructuresofthe10
UniqueDinucleotideSteps.BiophysicalJournal89,3721–3740.doi:10.1529/biophysj.
105.067397(2005).
107. Rohs,R.,Sklenar,H.&Shakked,Z.StructuralandEnergeticOriginsofSequence-Specific
DNA Bending: Monte Carlo Simulations of Papillomavirus E2-DNA Binding Sites. Struc-
ture13,1499–1509.doi:10.1016/j.str.2005.07.005 (2005).
108. Ivani,I.,Dans,P.D.,Noy,A.,Prez,A.,Faustino,I.,Hospital,A.,etal.Parmbsc1:arefined
force field for DNA simulations. Nature Methods 13, 55–58. doi:10.1038/nmeth.3658
(2015).
109. Chen, Y., Zhang, X., Machado, A. C. D., Ding, Y., Chen, Z., Qin, P. Z., et al. Structure
of p53 binding to the BAX response element reveals DNA unwinding and compression to
accommodate base-pair insertion. Nucleic Acids Research 41, 8368–8376. doi:10.1093/
nar/gkt584(2013).
110. Chang,Y.P.,Xu,M.,Machado, A. C.D., Yu,X. J.,Rohs,R.&Chen,X. S.Mechanismof
Origin DNA Recognition and Assembly of an Initiator-Helicase Complex by SV40 Large
72
Tumor Antigen. Cell Reports 3, 1117–1127. doi:10.1016/j.celrep.2013.03.002
(2013).
111. Machado, A. C. D., Zhou, T., Rao, S., Goel, P., Rastogi, C., Lazarovici, A., et al. Evolving
insightsonhowcytosinemethylationaffectsprotein-DNAbinding.BriefingsinFunctional
Genomics14,61–73.doi:10.1093/bfgp/elu040 (2014).
112. Li,J.,Machado,A.C.D.,Guo,M.,Sagendorf,J.M.,Zhou,Z.,Jiang,L.,etal.Structureof
theForkheadDomainofFOXA2BoundtoaCompleteDNAConsensusSite.Biochemistry
56,3745–3753.doi:10.1021/acs.biochem.7b00211 (2017).
113. Pasi, M., Maddocks, J. H., Beveridge, D., Bishop, T. C., Case, D. A., Cheatham, T., et al.
µABC: a systematic microsecond molecular dynamics study of tetranucleotide sequence
effects in B-DNA. Nucleic Acids Research 42, 12272–12283. doi:10.1093/nar/gku855
(2014).
114. Berman,H.M.TheProteinDataBank.NucleicAcidsResearch28,235–242.doi:10.1093/
nar/28.1.235(2000).
115. Berman, H., Olson, W., Beveridge, D., Westbrook, J., Gelbin, A., Demeny, T., et al. The
nucleicaciddatabase.Acomprehensiverelationaldatabaseofthree-dimensionalstructures
ofnucleicacids.BiophysicalJournal63,751–759.doi:10.1016/s0006-3495(92)81649-
1(1992).
116. Weirauch,M.T.,Cote,A.,Norel,R.,Annala,M.,Zhao,Y.,Riley,T.R.,etal.Evaluationof
methods for modeling transcription factor sequence specificity. Nature Biotechnology 31,
126–134.doi:10.1038/nbt.2486(2013).
117. Ma,W.,Yang,L.,Rohs,R.&Noble,W.S.DNAsequence+shapekernelenablesalignment-
free modeling of transcription factor binding. Bioinformatics 33 (ed Hancock, J.) 3003–
3010.doi:10.1093/bioinformatics/btx336 (2017).
118. Yang,L.,Zhou,T.,Dror,I.,Mathelier,A.,Wasserman,W.W.,Gordn,R.,etal.TFBSshape:
amotifdatabaseforDNAshapefeaturesoftranscriptionfactorbindingsites.NucleicAcids
Research42,D148–D155.doi:10.1093/nar/gkt1087 (2013).
119. Olson, W. K., Gorin, A. A., Lu, X.-J., Hock, L. M. & Zhurkin, V. B. DNA sequence-
dependent deformability deduced from protein-DNA crystal complexes. Proceedings of
the National Academy of Sciences 95, 11163–11168. doi:10.1073/pnas.95.19.11163
(1998).
120. Chiu,T.-P.,Comoglio,F.,Zhou,T.,Yang,L.,Paro,R.&Rohs,R.DNAshapeR:anR/Bioconductor
package for DNA shape prediction and feature encoding. Bioinformatics 32, 1211–1213.
doi:10.1093/bioinformatics/btv735 (2015).
121. Peng,P.-C.&Sinha,S.QuantitativemodelingofgeneexpressionusingDNAshapefeatures
ofbindingsites.NucleicAcidsResearch44,e120–e120.doi:10.1093/nar/gkw446(2016).
122. Andrabi,M.,Hutchins,A.P.,Miranda-Saavedra,D.,Kono,H.,Nussinov,R.,Mizuguchi,K.,
et al. Predicting conformational ensembles and genome-wide transcription factor binding
sites from DNA sequences. Scientific Reports 7. doi:10.1038/s41598-017-03199-6
(2017).
123. Dans, P. D., Prez, A., Faustino, I., Lavery, R. & Orozco, M. Exploring polymorphisms in
B-DNA helical conformations. Nucleic Acids Research 40, 10668–10678. doi:10.1093/
nar/gks884(2012).
124. Park, P. J. ChIP–seq: advantages and challenges of a maturing technology. Nature reviews
genetics10,669–680(2009).
73
125. Angermueller,C.,Lee,H.J.,Reik,W.&Stegle,O.DeepCpG:accuratepredictionofsingle-
cell DNA methylation states using deep learning. Genome Biology 18. doi:10.1186/
s13059-017-1189-z(2017).
126. Yang, B., Liu, F., Ren, C., Ouyang, Z., Xie, Z., Bo, X., et al. BiRen: predicting enhancers
withadeep-learning-basedmodelusingtheDNAsequencealone.Bioinformatics33,1930–
1936(2017).
127. Xu, H., Jia, P. & Zhao, Z. Deep4mC: systematic assessment and computational prediction
forDNAN4-methylcytosinesitesbydeeplearning.BriefingsinBioinformatics22,bbaa099
(2021).
128. Yu,H.&Dai,Z.SNNRice6mA:ADeepLearningMethodforPredictingDNAN6-Methyladenine
SitesinRiceGenome.FrontiersinGenetics10.doi:10.3389/fgene.2019.01071(2019).
129. Avsec,., Weilert, M., Shrikumar, A., Krueger, S., Alexandari, A., Dalal, K., et al. Base-
resolution models of transcription-factor binding reveal soft motif syntax. Nature Genetics
53,354–366.doi:10.1038/s41588-021-00782-6 (2021).
130. Hoffman, G. E., Bendl, J., Girdhar, K., Schadt, E. E. & Roussos, P. Functional interpreta-
tion of genetic variants using deep learning predicts impact on chromatin accessibility and
histonemodification.Nucleicacidsresearch47,10597–10611(2019).
131. Qin, Q. & Feng, J. Imputation for transcription factor binding predictions based on deep
learning. PLOS Computational Biology 13 (ed Ioshikhes, I.) e1005403. doi:10.1371/
journal.pcbi.1005403 (2017).
132. Osz,J.,Brlivet,Y.,Peluso-Iltis,C.,Cura,V.,Eiler,S.,Ruff,M., et al.Structuralbasisfora
molecular allosteric control mechanism of cofactor binding to nuclear receptors. Proceed-
ingsoftheNationalAcademyofSciences109,E588–E594(2012).
133. Kim,S.,Brostrmer,E.,Xing,D.,Jin,J.,Chong,S.,Ge,H.,etal.Probingallosterythrough
DNA.Science339,816–819(2013).
134. Drsata,T.,Zgarbov,M.,pakov,N.,Jureka,P.,Sponer,J.&Lankas,F.Mechanicalmodelof
DNAallostery.Thejournalofphysicalchemistryletters5,3831–3835(2014).
135. Xu, X., Ge, H., Gu, C., Gao, Y. Q., Wang, S. S., Thio, B. J. R., et al. Modeling spatial
correlationofDNAdeformation:DNAallosteryinproteinbinding.TheJournalofPhysical
ChemistryB117,13378–13387(2013).
136. Jimnez, J., Doerr, S., Martnez-Rosell, G., Rose, A. S. & Fabritiis, G. D. DeepSite: protein-
binding site predictor using 3D-convolutional neural networks. Bioinformatics 33 (ed Va-
lencia,A.)3036–3042.doi:10.1093/bioinformatics/btx350 (2017).
137. Wang,X.,Terashi,G.,Christoffer,C.W.,Zhu,M.&Kihara,D.Proteindockingmodeleval-
uation by 3D deep convolutional neural networks. Bioinformatics 36 (ed Ponty, Y.) 2113–
2118.doi:10.1093/bioinformatics/btz870 (2019).
138. Townshend, R. J. L., Eismann, S., Watkins, A. M., Rangan, R., Karelina, M., Das, R., et
al. Geometric deep learning of RNA structure. Science 373, 1047–1051. doi:10.1126/
science.abe5650(2021).
139. Stokes, J. M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N. M., et al. A
deeplearningapproachtoantibioticdiscovery.Cell180,688–702(2020).
140. Travers, A. A. DNA conformation and protein binding. Annual review of biochemistry 58,
427–452(1989).
141. Rich,A.&Zhang,S.Z-DNA:thelongroadtobiologicalfunction.NatureReviewsGenetics
4,566–572(2003).
74
142. Medvedeva, Y. A., Khamis, A. M., Kulakovskiy, I. V., Ba-Alawi, W., Bhuyan, M. S. I.,
Kawaji,H.,etal.Effectsofcytosinemethylationontranscriptionfactorbindingsites.BMC
genomics15,1–12(2014).
143. Afek,A.,Shi,H.,Rangadurai,A.,Sahay,H.,Senitzki,A.,Xhani,S.,etal.DNAmismatches
revealconformationalpenaltiesinprotein–DNArecognition.Nature587,291–296(2020).
144. Kitayner,M.,Rozenberg,H.,Rohs,R.,Suad,O.,Rabinovich,D.,Honig,B.,etal.Diversity
inDNArecognitionbyp53revealedbycrystalstructureswithHoogsteenbasepairs.Nature
structural&molecularbiology17,423–429(2010).
145. Townshend,R.,Bedi,R.,Suriana,P.&Dror,R.End-to-endlearningon3dproteinstructure
for interface prediction. Advances in Neural Information Processing Systems 32, 15642–
15651(2019).
146. Simonovsky,M.&Meyers,J.DeeplyTough:learningstructuralcomparisonofproteinbind-
ingsites.Journalofchemicalinformationandmodeling60,2356–2366(2020).
147. Fout, A. M. Protein interface prediction using graph convolutional networks PhD thesis
(ColoradoStateUniversity,2017).
148. Gainza,P.,Sverrisson,F.,Monti,F.,Rodola,E.,Boscaini,D.,Bronstein,M.,etal.Decipher-
inginteraction fingerprints from proteinmolecular surfacesusing geometricdeep learning.
NatureMethods17,184–192(2020).
149. Dai, B. & Bailey-Kellogg, C. Protein interaction interface region prediction by geometric
deeplearning.Bioinformatics(2021).
150. Pittala,S.&Bailey-Kellogg,C.Learningcontext-awarestructuralrepresentationstopredict
antigenandantibodybindinginterfaces.Bioinformatics36,3996–4003(2020).
151. Torng,W.&Altman,R.B.Graphconvolutionalneuralnetworksforpredictingdrug-target
interactions.Journalofchemicalinformationandmodeling59,4131–4149(2019).
152. Tan,C.&Takada,S.Nucleosomeallosteryinpioneertranscriptionfactorbinding.Proceed-
ingsoftheNationalAcademyofSciences117,20586–20596(2020).
153. Zhu,F.,Farnung,L.,Kaasinen,E.,Sahu,B.,Yin,Y.,Wei,B.,etal.Theinteractionlandscape
betweentranscriptionfactorsandthenucleosome.Nature562,76–81(2018).
154. Basu, A., Bobrovnikov, D. G., Qureshi, Z., Kayikcioglu, T., Ngo, T. T., Ranjan, A., et al.
MeasuringDNAmechanicsonthegenomescale.Nature589,462–467(2021).
155. Dantas Machado, A. C., Cooper, B. H., Lei, X., Di Felice, R., Chen, L. & Rohs, R. Land-
scapeofDNAbindingsignaturesofmyocyteenhancerfactor-2Brevealsauniqueinterplay
ofbaseandshapereadout.Nucleicacidsresearch48,8529–8544(2020).
156. Jolma,A.,Yan,J.,Whitington,T.,Toivonen,J.,Nitta,K.R.,Rastas,P.,etal.DNA-binding
specificitiesofhumantranscriptionfactors.Cell152,327–339(2013).
157. Kingma,D.P.&Ba,J.Adam:Amethodforstochasticoptimization.arXivpreprintarXiv:1412.6980
(2014).
158. Tolstorukov,M.Y.,Colasanti, A.V.,McCandlish,D.M.,Olson,W.K.&Zhurkin,V.B. A
novelroll-and-slidemechanismofDNAfoldinginchromatin:implicationsfornucleosome
positioning.Journalofmolecularbiology371,725–738(2007).
159. Brennan, R. G. & Matthews, B. W. The helix-turn-helix DNA binding motif. Journal of
BiologicalChemistry264,1903–1906(1989).
160. Liu, C. F., Brandt, G. S., Hoang, Q. Q., Naumova, N., Lazarevic, V., Hwang, E. S., et
al. Crystal structure of the DNA binding domain of the transcription factor T-bet suggests
75
simultaneous recognition of distant genome sites. Proceedings of the National Academy of
Sciences113,E6572–E6581(2016).
161. Coll, M., Seidman, J. G. & Mller, C. W. Structure of the DNA-bound T-box domain of
human TBX3, a transcription factor responsible for ulnar-mammary syndrome. Structure
10,343–356(2002).
162. De Masi, F., Grove, C. A., Vedenko, A., Alibes, A., Gisselbrecht, S. S., Serrano, L., et al.
Using a structural and logics systems approach to infer bHLH–DNA binding specificity
determinants.Nucleicacidsresearch39,4553–4563(2011).
163. Wei, G.-H., Badis, G., Berger, M. F., Kivioja, T., Palin, K., Enge, M., et al. Genome-wide
analysis of ETS-family DNA-binding in vitro and in vivo. The EMBO journal 29, 2147–
2160(2010).
164. Tanaka, Y., Nureki, O., Kurumizaka, H., Fukai, S., Kawaguchi, S., Ikuta, M., et al. Crystal
structure of the CENP-B protein–DNA complex: the DNA-binding domains of CENP-B
inducekinksintheCENP-BboxDNA.TheEMBOjournal20,6612–6618(2001).
165. Matharu, N. K., Hussain, T., Sankaranarayanan, R. & Mishra, R. K. Vertebrate homologue
ofDrosophilaGAGAfactor.Journalofmolecularbiology400,434–447(2010).
166. Li, L. & Davie, J. R. The role of Sp1 and Sp3 in normal and cancer cell biology. Annals of
Anatomy-AnatomischerAnzeiger 192,275–283(2010).
167. Caizzi, L., Ferrero, G., Cutrupi, S., Cordero, F., Ballar, C., Miano, V., et al. Genome-wide
activity of unliganded estrogen receptor-α in breast cancer cells. Proceedings of the Na-
tionalAcademyofSciences111,4892–4897(2014).
168. Srandour, A. A., Avner, S., Percevault, F., Demay, F., Bizot, M., Lucchetti-Miganeh, C., et
al.EpigeneticswitchinvolvedinactivationofpioneerfactorFOXA1-dependentenhancers.
Genomeresearch21,555–565(2011).
169. Ishii, H., Sen, R. & Pazin, M. J. Combinatorial control of DNase I-hypersensitive site for-
mation and erasure by immunoglobulin heavy chain enhancer-binding proteins. Journal of
BiologicalChemistry279,7331–7338(2004).
170. Johnson, G. D., Jodar, M., Pique-Regi, R. & Krawetz, S. A. Nuclease footprints in sperm
projectpastandfuturechromatinregulatoryevents.Scientificreports6,1–17(2016).
171. Yang, H., Schramek, D., Adam, R. C., Keyes, B. E., Wang, P., Zheng, D., et al. ETS fam-
ily transcriptional regulators drive chromatin dynamics and malignancy in squamous cell
carcinomas.Elife4,e10870(2015).
172. Baena,E.,Shao,Z.,Linn,D.E.,Glass,K.,Hamblen,M.J.,Fujiwara,Y.,etal.ETV1directs
androgen metabolism and confers aggressive prostate cancer in targeted mice and patients.
Genes&development 27,683–698(2013).
173. Chi, P., Chen, Y., Zhang, L., Guo, X., Wongvipat, J., Shamu, T., et al. ETV1 is a lineage
survival factor that cooperates with KIT in gastrointestinal stromal tumours. Nature 467,
849–853(2010).
174. Jan-Valbuena,J.,Widlund,H.R.,Perner,S.,Johnson,L.A.,Dibner,A.C.,Lin,W.M.,etal.
AnoncogenicroleforETV1inmelanoma.Cancerresearch70,2075–2084(2010).
175. Riggi,N.,Knoechel,B.,Gillespie,S.M.,Rheinbay,E.,Boulay,G.,Suv,M.L.,etal.EWS-
FLI1 utilizes divergent chromatin remodeling mechanisms to directly activate or repress
enhancerelementsinEwingsarcoma.Cancercell26,668–681(2014).
76
176. Rodriguez, A. C., Vahrenkamp, J. M., Berrett, K. C., Clark, K. A., Guillen, K. P., Scherer,
S. D., et al. ETV4 is necessary for estrogen signaling and growth in endometrial cancer
cells.Cancerresearch80,1234–1245(2020).
177. Haoze, V. Y., Tao, L., Llamas, J., Wang, X., Nguyen, J. D., Trecek, T., et al. POU4F3
pioneeractivityenablesATOH1todrivediversemechanoreceptordifferentiationthrougha
feed-forwardepigeneticmechanism.ProceedingsoftheNationalAcademyofSciences118
(2021).
178. Shan, J., Fu, L., Balasubramanian, M. N., Anthony, T. & Kilberg, M. S. ATF4-dependent
regulationoftheJMJD3geneduringaminoaciddeprivationcanberescuedinAtf4-deficient
cells by inhibition of deacetylation. Journal of Biological Chemistry 287, 36393–36403
(2012).
179. Makowski, M. M., Gaullier, G. & Luger, K. Picking a nucleosome lock: sequence-and
structure-specificrecognitionofthenucleosome.Journalofbiosciences45,1–7(2020).
180. Hansen, R. S., Thomas, S., Sandstrom, R., Canfield, T. K., Thurman, R. E., Weaver, M.,
etal.SequencingnewlyreplicatedDNArevealswidespreadplasticityinhumanreplication
timing.ProceedingsoftheNationalAcademyofSciences107,139–144(2010).
181. Prioleau,M.-N.&MacAlpine,D.M.DNAreplicationorigins—wheredowebegin?Genes
&development 30,1683–1697(2016).
182. MacAlpine,H.K.,Gordn,R.,Powell,S.K.,Hartemink,A.J.&MacAlpine,D.M.Drosophila
ORClocalizestoopenchromatinandmarkssitesofcohesincomplexloading. Genome re-
search20,201–211(2010).
183. Bell,S.P.&Stillman,B.ATP-dependentrecognitionofeukaryoticoriginsofDNAreplica-
tionbyamultiproteincomplex.Nature357,128–134(1992).
184. Li, N., Lam, W. H., Zhai, Y., Cheng, J., Cheng, E., Zhao, Y., et al. Structure of the origin
recognitioncomplexboundtoDNAreplicationorigin.Nature559,217–222(2018).
185. Bleichert, F., Leitner, A., Aebersold, R., Botchan, M. R. & Berger, J. M. Conformational
control and DNA-binding mechanism of the metazoan origin recognition complex. Pro-
ceedingsoftheNationalAcademyofSciences115,E5906–E5915(2018).
186. Chiu,T.-P.,Rao,S.,Mann,R.S.,Honig,B.&Rohs,R.Genome-widepredictionofminor-
groove electrostatic potential enables biophysical modeling of protein–DNA binding. Nu-
cleicacidsresearch45,12565–12576(2017).
187. Lu, X.-J. & Olson, W. K. 3DNA: a versatile, integrated software system for the analysis,
rebuilding and visualization of three-dimensional nucleic-acid structures. Nature protocols
3,1213–1227(2008).
188. DeLano,W.L.etal.Pymol:Anopen-sourcemoleculargraphicstool.CCP4Newsletteron
proteincrystallography40,82–92(2002).
189. Miotto, B., Ji, Z. & Struhl, K. Selectivity of ORC binding sites and the relation to replica-
tiontiming,fragilesites,anddeletionsincancers. Proceedings of the National Academy of
Sciences113,E4810–E4819(2016).
190. Belsky, J. A., MacAlpine, H. K., Lubelsky, Y., Hartemink, A. J. & MacAlpine, D. M.
Genome-wide chromatin footprinting reveals changes in replication origin architecture in-
ducedbypre-RCassembly.Genes&development 29,212–224(2015).
191. Serban, I. V., Lowe, R., Charlin, L. & Pineau, J. Generative deep neural networks for dia-
logue:Ashortreview.arXivpreprintarXiv:1611.06216 (2016).
77
192. Huszr,F.How(not)totrainyourgenerativemodel:Scheduledsampling,likelihood,adver-
sary?arXivpreprintarXiv:1511.05101(2015).
193. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al.
Generativeadversarialnets.Advancesinneuralinformationprocessingsystems27(2014).
194. Ivani,I.,Dans,P.D.,Noy,A.,Prez,A.,Faustino,I.,Hospital,A.,etal.Parmbsc1:arefined
forcefieldforDNAsimulations.Naturemethods13,55–58(2016).
195. Knotts IV, T. A., Rathore, N., Schwartz, D. C. & De Pablo, J. J. A coarse grain model for
DNA.TheJournalofchemicalphysics126,02B611(2007).
196. Thomas, N., Smidt, T., Kearnes, S., Yang, L., Li, L., Kohlhoff, K., et al. Tensor field
networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv
preprintarXiv:1802.08219(2018).
197. Fuchs,F.B.,Worrall,D.E.,Fischer,V.&Welling,M.Se(3)-transformers:3droto-translation
equivariantattentionnetworks.arXivpreprintarXiv:2006.10503(2020).
198. Sklenar, H., Wstner, D. & Rohs, R. Using internal and collective variables in Monte Carlo
simulations of nucleic acid structures: Chain breakage/closure algorithm and associated
Jacobians. Journal of Computational Chemistry 27, 309–315. doi:10.1002/jcc.20345
(2005).
199. Cornell, W. D., Cieplak, P., Bayly, C. I., Gould, I. R., Merz, K. M., Ferguson, D. M., et
al. A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and
Organic Molecules. Journal of the American Chemical Society 117, 5179–5197. doi:10.
1021/ja00124a002(1995).
200. Rohs, R., Etchebest, C. & Lavery, R. Unraveling Proteins: A Molecular Mechanics Study.
BiophysicalJournal76,2760–2768.doi:10.1016/s0006-3495(99)77429-1 (1999).
201. Lavery,R., Zakrzewska,K., Beveridge, D.,Bishop, T.C.,Case, D. A.,Cheatham, T., et al.
A systematic molecular dynamics study of nearest-neighbor effects on base pair and base
pair step conformations and fluctuations in B-DNA. Nucleic Acids Research 38, 299–313.
doi:10.1093/nar/gkp834(2009).
202. Berendsen, H. J. C., Grigera, J. R. & Straatsma, T. P. The missing term in effective pair
potentials.TheJournalofPhysicalChemistry91,6269–6271.doi:10.1021/j100308a038
(1987).
203. Dang, L. X. Mechanism and Thermodynamics of Ion Selectivity in Aqueous Solutions of
18-Crown-6Ether:AMolecularDynamicsStudy.JournaloftheAmericanChemicalSoci-
ety117,6954–6960.doi:10.1021/ja00131a018 (1995).
204. Lavery, R., Moakher, M., Maddocks, J. H., Petkeviciute, D. & Zakrzewska, K. Conforma-
tionalanalysisofnucleicacidsrevisited:Curves+.NucleicAcidsResearch37,5917–5929.
doi:10.1093/nar/gkp608(2009).
205. Fujii,S.,Kono,H.,Takenaka,S.,Go,N.&Sarai,A.Sequence-dependentDNAdeformabil-
ity studied using molecular dynamics simulations. Nucleic Acids Research35, 6063–6074.
doi:10.1093/nar/gkm627(2007).
206. Prez, A., Lankas, F., Luque, F. J. & Orozco, M. Towards a molecular dynamics consensus
view of B-DNA flexibility. Nucleic Acids Research 36, 2379–2394. doi:10.1093/nar/
gkn082(2008).
207. Prez, A., Marchn, I., Svozil, D., Sponer, J., Cheatham, T. E., Laughton, C. A., et al. Re-
finement of the AMBER Force Field for Nucleic Acids: Improving the Description ofα/γ
78
Conformers. Biophysical Journal 92, 3817–3829. doi:10.1529/biophysj.106.097782
(2007).
79
Appendices
A SupplementarymaterialsandmethodsforChapter2
A.1 PreparationoftheUniversalProteinBindingMicroarray(uPBM)Dataset
AnanalysisofuPBMdata[59,60]wascomparedwithresultsfromotherexperimentalbindingas-
says. TheuPBMdatasetusedinthisstudycontaineddatafor66mousetranscriptionfactors(TFs)
used in the DREAM5 challenge [116]. Briefly, uPBM data comprise DNA sequences designed to
ensurethatallpossible10-bpsequencesarerepresentedonthearray. Thepre-processingofuPBM
datasets from the DREAM5 challenge [116] was previously described [69]. These data contain
sequences of shorter length than sequences from gcPBM [87], SELEX-seq [92], or HT-SELEX
[95]datasets.
A.2 MonteCarlo(MC)Simulations
All-atom MC simulations of 2,121 DNA fragments of 1227 bp in length were analyzed to de-
rive DNA shape features. These previously reported MC simulations [68] were performed over 2
million MC cycles with standard B-DNA used as starting configuration [107]. MC sampling was
based on collective and internal variables combined with analytic chain closure using associated
Jacobians[198]. EnergycalculationsemployedtheAMBERforcefield[199]andimplicitsolvent
combined with explicit sodium counter ions [200]. Average structures were generated by using a
previously described simulation protocol [107] and analyzed with CURVES [6]. Resulting three-
dimensional MC predictions for 2,121 DNA sequences were mined in terms of the 512 unique
80
pentamers (data used in our DNAshape method [68] and labeled as DNAshape) or 136 unique
tetramers(resultingdatasetlabeledasMC4).
A.3 MolecularDynamics(MD)Simulations
MolecularDynamics(MD)SimulationsMDtrajectoriesoftheABCdatasetof39double-stranded
B-DNA oligomers [201] were analyzed to obtain information on the microsecond-scale average
tetranucleotide-dependent structure of B-DNA [113]. In particular, each oligomer in the ABC
datasetwas18-bplong,constructedbyrepeatingaparticular4-bpsequenceABCDthreeandahalf
times: 5-GC-CD-ABCD-ABCD-ABCD-GC-3. Sequences were designed such that the resulting
setof18-merscoveredeachofthe136distincttetra-nucleotidesequences(resultingdatasetlabeled
as MD4) with a minimum of three occurrences. One microsecond MD simulations on each of
the oligomers in the ABC dataset were performed in explicit solvent [202] with a physiological
concentration of K+ Cl- [203] by using the parmbsc0 modifications to the AMBER parm99 force
field. CURVES+ [204] was used to obtain the ensemble averages and standard deviations (SDs)
ofhelicalparametersfromthecentermostoccurrenceofeachtetranucleotide[113,201]. Thefirst
10% of the trajectories were excluded from the analysis. Further details on the MD simulation
protocolandanalysiswerepreviouslydescribed[113].
A.4 FeatureEncodingandNormalization
For a single sequence, we first encoded its sequence features as 1-mers, 2-mers, or 3-mers, using
one-hot encoding, and encoded shape features for that sequence. We used the DNAshape method
[68] within the DNAshapeR package [120] to generate these features. For a single shape feature
category,suchasMGW,wepredictedvaluesforagivensequenceasavectorof{MGW
3
,MGW
4
,...,MGW
n−2
}
(the first two positions are not predictable because the method is pentamer-based). To be better
compatible with regression models, predicted shape features were normalized by subtracting a
global minimum value acquired from the respective query table (see Materials and Methods) and
81
dividingbyanSDvalue. ThisSDvaluewascalculatedforeachshapefeatureandbasedonallval-
ues for this shape feature category from a given dataset. The SD value used in normalization was
computed separately for each individual dataset. In addition, if a position weight matrix (PWM)
wasknownforaparticularTFanditwaspalindromic,weaveragedeachfeaturevector,including
thesequencefeatureandshapefeaturescomplementedbythereversecomplement.
Once each feature was encoded, normalized, and complemented by its reverse palindrome (if
applicable),weconcatenatedallfeaturesinafinalfeaturevector:
{
⃗
S
1
,...
⃗
S
n
;MGW
3
,...,MGW
n−3
;HelT
2
,...HelT
n−1
;...
}
(A.1)
where
⃗
S
i
denotesthek-mersequencefeatureatthei-thposition,whichshouldhavealengthoffour
for 1mer, 16 for 2mer, and 64 for 3mer features. Other features, such as MGW and HelT, denote
shapefeatures;andndenotesthesequencelength.
A.5 MeanSquaredError(MSE)Calculation
MSEiscloselyrelatedtoR
2
andcanbecomputedasfollows:
MSE=
∑
i
(y
i
− ˆ y
i
)
2
n−m
(A.2)
where y
i
and ˆ y
i
representtheobservedandpredictedbindingaffinityscores,respectively; nrepre-
sents the sample size (e.g., total number of sequences); and m represents the length of the feature
vector,includingtheintercept. Therefore,when R
2
ishigher,MSEcanalsobehigherbecausethe
lengthofthefeaturevectoraffectsit.
A.6 LimitationsofDynaSeq-derivedShapeFeatures
While our work was in revision, a different group published alternative DNA shape features de-
rived only from MD simulations [122]. These DynaSeq shape features have several differences
82
comparedtotheMDfeaturesderivedfromtheABCdataset(seeMDSimulationssection). Firstly,
theDynaSeqshapefeatureswerederivedfromasequencedesigninwhicheachofthe136unique
tetramersoccurredonlyonceandwerealwaysflanked5and3byGCGCtetrads[205]. Thehydra-
tionpatternsofGC-richsequenceshavebeenshowntodifferfromthoseofothersequenceenviron-
ments; thus, the use of a single sequence environment might not fully sample the conformational
space of the central tetramer [206]. Secondly, DynaSeq features were based on MD simulations
that were an order of magnitude shorter than the ABC data used here. Thirdly, DynaSeq used the
Bsc0forcefield[207],despiteavailabilityoftheBsc1forcefieldthatcouldimprovetheconforma-
tional sampling of DNA and the accuracy of MD simulations [108]. Moreover, DynaSeq-derived
shapefeatureswerenotvalidatedthroughdirectcomparisonwithexperimentaldata. Instead,they
werevalidatedindirectlythroughtheabilitytoidentifyasequencewithinarandomsetofotherse-
quences based on structural information. However, this validation included many DNA structures
affectedbydrugbindingandevenbpmismatches[122].
83
B SupplementaryfiguresforChapter2
● ● ● ● 0.0
0.5
1.0
0.0 0.5 1.0
1mer+4DNAshape R
2
1mer+4shape
MC4
R
2
Family
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● bHLH
bZIP
C2H2
CENPB
CP2
CUT
ETS
Forkhead
GATA
GCM
Homeo−
domain
HSF
IRF
MAD
MEIS
MYB
NFAT
NFI
Nuclear
receptor
PAX
POU
PROX
RRM
RUNX
SAND
TBX
TEA
Dataset
● gcPBM
HT−SELEX
SELEX−seq
Figure B.1: Performance comparison using the four shape features MGW, ProT, HelT, and Roll
from MC-derived pentamers combined with 1mer sequence (1mer+4DNAshape model) and MC-
derived tetramer query tables combined with 1mer sequence (1mer+4shapeMC4 model) for mul-
tipledatasetsandTFfamilies.
84
● ● ● ● 0.0
0.5
1.0
0.0 0.5 1.0
4DNAshape R
2
4shape
MC4
R
2
● ● ● ● 0.0
0.5
1.0
0.0 0.5 1.0
4DNAshape R
2
4shape
XRC4
R
2
● ● ● ● 0.0
0.5
1.0
0.0 0.5 1.0
4shape
MC4
R
2
4shape
MD4
R
2
● ● ● ● 0.0
0.5
1.0
0.0 0.5 1.0
4DNAshape R
2
4shape
MD4
R
2
Family
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● bHLH
bZIP
C2H2
CENPB
CP2
CUT
ETS
Forkhead
GATA
GCM
Homeo−
domain
HSF
IRF
MAD
MEIS
MYB
NFAT
NFI
Nuclear
receptor
PAX
POU
PROX
RRM
RUNX
SAND
TBX
TEA
Dataset
● gcPBM
HT−SELEX
SELEX−seq
Figure B.2: Performance comparison between shape-only models (MC-derived tetramer-based
4shapeMC4, MC-derived pentamer-based 4shapeDNAshape, and MD-derived tetramer-based
4shapeMD4 models) for multiple datasets and TF families. We note that 1mer sequence features
werenotincludedinthesemodels.
85
NR PAX POU PROX RRM RUNX SAND TBX TEA
1mer
4shape
XRC4
4shape
MC4
4DNAshape
4shape
MD4
1mer+4shape
XRC4
1mer+4shape
MC4
1mer+4DNAshape
1mer+4shape
MD4
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
R
2
GCM Homeodomain HSF IRF MAD MEIS MYB NFAT NFI
1mer
4shape
XRC4
4shape
MC4
4DNAshape
4shape
MD4
1mer+4shape
XRC4
1mer+4shape
MC4
1mer+4DNAshape
1mer+4shape
MD4
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.8
R
2
bHLH bZIP C2H2 CENPB CP2 CUT ETS Forkhead GATA
1mer
4shape
XRC4
4shape
MC4
4DNAshape
4shape
MD4
1mer+4shape
XRC4
1mer+4shape
MC4
1mer+4DNAshape
1mer+4shape
MD4
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.0
0.2
0.4
0.6
R
2
Figure B.3: Summary of the weighted-average performances for different models. Shape features
usedherewereMGW,ProT,HelT,andRoll(representingthefourfeaturesderivedfromMC,MD,
orXRCdatabyminingthe136uniquetetramers,orthepentamer-basedDNAshape[68]method).
Error bars were calculated based on the standard error of the mean. Model performances were
dividedintogroupsbasedonTFfamiliesusedinthisstudy. Identicalcolorwasappliedformodels
usingthesamesourceofshapefeatures. AmonglabelsforTFfamilies,NRrepresentsthenuclear
receptorfamily.
86
Family
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
bHLH
bZIP
C2H2
CENPB
CP2
CUT
ETS
Forkhead
GATA
GCM
Homeo−
domain
HSF
IRF
MAD
MEIS
MYB
NFAT
NFI
Nuclear
receptor
PAX
POU
PROX
RRM
RUNX
SAND
TBX
TEA
Dataset
gcPBM
HT−SELEX
SELEX−seq
0
0.25
0.5
0.75
1
0 0.25 0.5 0.75 1
1mer MSE (× 10
−4
)
1mer+4DNAshape MSE (× 10
−4
)
0
0.25
0.5
0.75
1
0 0.25 0.5 0.75 1
1mer+4DNAshape MSE (× 10
−4
)
1mer+4shape
MD4
MSE (× 10
−4
)
Figure B.4: Mean squared error (MSE) for different TF datasets for different models. A few
datasetsmayhavelargerMSEvaluesthatareoutsidetheplot.
A)Performanceof1mermodelscomparedtomodelsaugmentedbyfourMCderivedandpentamer-
basedshapefeatures(1mer+4DNAshapemodels).
B) Performance of 1mer+4DNAshape models compared to MD-derived and tetramer-based
1mer+4shapeMD4models.
87
2.5e−06
5.0e−06
7.5e−06
0 1 2 3 4 5 6 7 8 9 10 11 12 13
1mer+Nshape
MSE
Dataset
gcPBM
HT−SELEX
SELEX−seq
Figure B.5: MSE decreases as N increases in 1mer+Nshape models. MC-derived and pentamer-
basedDNAshapefeatures[68]wereusedinthisanalysis. Eachdotrepresentsthemeanamongall
datasetsineachcategory,whereastheerrorbarrepresentsthestandarderrorofthedatasetsineach
experimentalcategory.
88
0.4
0.6
0.8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
1mer+Nshape
R
2
gcPBM
HT−SELEX
SELEX−seq
uPBM
Figure B.6: Plot equivalent to Figure 2.5(B), with the addition of the DREAM5 uPBM dataset
[116], generated after repeatedly adding shape features into the previous model. Shape features
used in this plot were only derived from MC data based on pentamers, denoted DNAshape [68].
In each round, the most-informative feature was added based on model performance by using
the DNAshape query table. Performance measures were calculated based on weighted mean R
2
among all datasets in each experimental category. Error bars indicate maximum and minimum
performanceswhenN shapefeatureswereadded.
89
● ● ● ● ●
●
●
●
●
●
●
● ●
●
●
●
● ● ● ● ●
●
●
● ●
●
● ●
●
● ● ● ● ●
●
●
●
● ●
●
● ●
● ● ● ● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ● ●
● ●
●
● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
● ●
●
● ●
●
● ●
● ● ●
●
● ●
●
● ● ● ● ●
●
●
●
●
●
● ● ●
●
●
● ●
●
● ●
●
●
● ● ●
● ● ● ● ● ● ● ● ●
●
●
● ●
●
●
● ●
●
●
● ●
●
● ● ●
● ● ●
●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
●
●
● ●
●
●
● ●
● ● ● ●
●
● ●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ● ● ●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
●
● ● ●
● ● ●
●
●
●
●
●
●
● ●
●
●
● ●
● ● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ●
● ●
● ● ●
●
● ●
●
●
● ● ● ●
●
●
● ● ● ●
●
●
● ●
● ●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ● ●
● ● ●
●
●
●
● ● ●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
●
●
●
● ●
● ● ●
●
●
● ●
●
●
●
●
●
●
●
● ● ●
● ●
●
●
●
● ● ●
● ●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
● ●
●
●
● ●
●
● ● ●
●
●
● ●
● ● ●
● ● ● ● ● ● ● ● ●
●
● ● ●
●
●
●
●
●
● ●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
● ● ●
●
●
●
●
●
●
● ● ● ●
● ● ●
● ● ●
●
● ● ● ● ● ● ● ●
●
●
● ● ●
●
●
● ● ●
●
● ● ● ●
●
● ●
●
● ●
● ● ● ●
●
● ● ● ● ● ● ● ●
● ● ● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
● ● ●
● ● ●
● ● ●
● ●
●
●
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
● ●
●
● ● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
●
● ●
●
●
● ●
●
● ●
●
● ●
●
● ●
● ●
●
● ●
●
●
●
● ●
● ●
●
● ●
● ●
●
● ●
● ●
● ●
●
● ●
● ●
●
●
● ●
● ●
● ● ● ●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
●
●
● ●
● ●
●
●
● ●
● ● ●
●
●
● ●
●
● ●
● ●
● ● ● ●
●
●
●
● ● ● ●
●
● ● ● ● ●
●
● ● ● ●
●
● ●
●
●
●
●
●
● ●
●
● ● ● ●
● ●
● ●
● ● ●
●
●
● ● ●
●
●
● ●
●
●
● ●
●
● ● ●
● ●
● ● ● ●
●
● ● ●
●
● ●
● ●
●
● ● ● ●
● ●
●
● ● ●
●
● ●
●
● ●
● ●
●
3.54
0
10
20
Inter−family Intra−family
Euclidean distance
1mer+13DNAshape model
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ● ●
● ●
●
●
●
●
●
● ● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ●
●
●
●
● ●
●
● ●
●
●
●
● ● ● ● ●
● ● ● ● ● ●
● ●
●
●
●
●
●
● ● ●
● ●
●
●
● ● ●
●
●
● ● ●
●
●
●
● ●
●
●
● ●
● ● ●
● ●
●
●
●
●
●
● ● ●
● ● ●
● ●
●
●
●
●
●
● ● ●
●
●
● ●
●
● ●
●
●
●
●
●
● ● ●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ● ●
●
● ●
●
●
● ●
●
● ●
●
●
●
●
● ● ●
●
● ●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
● ● ● ● ●
●
● ● ●
● ●
● ● ● ●
●
●
●
●
●
● ●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
● ● ●
● ●
●
● ●
●
●
● ● ●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
● ● ●
● ●
● ● ●
● ● ● ●
● ● ●
●
● ● ● ● ● ● ● ●
●
● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ● ●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
● ● ● ●
● ● ●
●
● ●
● ●
●
● ● ●
●
● ●
●
● ●
● ● ● ●
●
●
● ●
●
●
● ●
● ● ●
● ●
●
● ●
● ●
●
●
● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ●
●
●
●
● ● ● ●
●
●
● ●
●
● ●
●
● ●
● ● ● ● ●
● ●
●
● ●
● ●
● ● ● ●
●
●
●
● ● ●
●
●
●
● ● ● ●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
● ●
● ● ●
●
● ● ● ●
● ●
● ●
● ● ●
●
● ● ●
●
● ●
●
● ●
● ● ●
● ● ● ●
●
● ● ●
●
● ●
● ● ● ● ●
● ●
●
● ● ●
●
● ●
●
● ●
● ●
●
2.98
0
5
10
15
Inter−family Intra−family
Euclidean distance
1mer+4DNAshape model
A B
C D
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
● ● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ● ●
●
●
● ●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
í 0
í í 3& explained var 3& explained var ●
●
●
●
●
●
●
●
bHLH
bZIP
&+ Forkhead
Homeodomain
Nuclear receptor
POU
TBX
1mer+4DNAshape model
●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
●
● ● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ●
● ●
● ●
● ●
●
●
● ● ●
●
●
●
●
● ● ● ●
●
● ● ●
●
●
●
●
● ● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
●
●
●
● ● ●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
● ●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
● ●
● ● ●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
í í í 3& e [SODLQHG v ar 3& e [SODLQHG var PHU'1$VKDSHPRG HO FigureB.7: Principalcomponentanalysis(PCA)revealsdifferentDNAbindingspecificitieswithin
andbetweenTFfamilies(intra-vs. inter-family).
A) PCA results using 1mer+4shape features. Each dot represents a TF dataset. Dots of identical
colorbelongtothesameTFfamily. AnellipsewasdrawnforeachTFfamilyrepresentingthe0.68
probabilitycontouroftherespectivefittedtwo-variatenormaldistribution.
B)PCAresultsusing1mer+13shapefeatures(i.e.,nineshapefeaturesaddedwithrespectto(A)).
C)Boxplotsofinter-andintra-familyTFdistancesderivedfrom(A).
D)Boxplotsofinter-andintra-familyTFdistancesderivedfrom(B).
90
gcPBM SELEX−seq HT−SELEX
20 40 60
0.75
0.80
0.85
0.90
0.95
0.4
0.5
0.6
0.7
0.8
0.4
0.5
0.6
0.7
Feature quantity
R
2
Model
Sequence Only
Sequence+Shape
Shape Only
Figure B.8: Comparison between feature quantity and R
2
performance in three different
model settings. Points shown in Sequence+Shape group are 1mer+4shape, 1mer+4shape+SD,
1mer+13shape, and 1mer+13shape+SD models. SD represents the standard deviations of shape
features and, thus, conformational flexibility. Points shown in Shape Only group are 4shape,
4shape+SD,13shape,and13shape+SDmodels. PointsshowninSequenceOnlymodelsare1mer,
2mer,and3mermodels. ModelsintheSequence+Shapegroupperformedusuallybetterinachiev-
inghigherR
2
scores,particularlywhenusingfewerfeatures.
91
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ● ● ● ●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ● ● ● ● ●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
● ● ●
●
● ● ●
●
●
● ● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ●
●
● ●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
● ● ●
● ●
●
●
●
● ●
● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ● ● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ● ●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
● ●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
●
● ●
● ●
●
●
●
● ● ● ● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ● ●
● ●
● ●
●
●
●
●
●
● ●
● ●
●
●
● ●
●
●
●
●
● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
● ● ●
●
●
● ●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ● ●
●
●
●
●
●
●
●
●
● ● ●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
●
● ● ●
●
●
●
● ●
●
● ●
●
●
●
● ● ● ● ● ●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
● ● ●
●
●
●
●
● ● ●
●
●
● ●
●
●
●
● ● ● ● ●
●
●
● ● ●
●
●
● ● ●
●
● ● ●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
● ●
●
●
●
● ● ● ● ● ●
●
● ●
●
●
● ● ● ● ● ● ● ● ●
● ● ●
●
●
● ● ●
●
●
●
●
●
● ●
●
● ●
●
●
● ●
● ●
●
●
● ● ●
●
●
●
●
●
● ●
● ● ●
● ● ● ● ●
●
●
●
● ●
●
●
● ●
●
● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
● ● ● ● ● ● ●
● ● ●
●
● ● ● ● ● ● ● ●
●
●
● ● ●
●
●
● ● ●
●
● ● ● ●
●
● ●
●
● ●
●
●
● ●
●
●
●
● ● ●
●
● ●
● ● ● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
● ● ●
● ● ●
● ● ●
●
● ● ●
●
●
●
●
● ● ●
● ● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
● ● ● ● ● ●
●
● ● ●
●
● ●
●
● ●
● ●
●
● ●
● ●
● ●
● ●
●
● ●
●
● ●
●
● ●
● ●
● ●
●
● ●
●
● ●
● ●
●
● ●
● ●
●
● ●
● ●
● ●
●
● ●
● ●
●
●
● ●
● ●
● ● ● ● ● ●
●
●
● ●
●
●
●
● ●
●
● ●
● ●
●
● ●
● ● ●
●
● ●
●
● ●
● ●
● ● ● ●
● ●
●
● ● ●
●
●
●
● ● ● ●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
● ●
● ● ●
●
● ●
● ●
● ●
● ●
● ●
● ● ●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
● ●
●
●
● ● ●
● ●
● ●
● ● ● ● ● ● ●
●
● ●
●
● ●
●
● ● ●
●
● ●
●
● ●
● ●
●
4.38
0
10
20
30
40
Inter−family Intra−family
Euclidean distance
1mer+13DNAshape+SD model
● ●
● ●
● ●
● ●
●
● ●
●
●
●
● ●
● ● ●
●
●
● ●
●
●
● ●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
● ● ●
●
●
● ●
●
●
●
● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
● ● ●
●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
● ● ●
●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
● ●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ● ● ●
●
●
● ● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
● ●
●
●
●
● ●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ● ● ●
●
●
● ● ●
●
●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ● ●
●
●
● ● ●
●
●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ● ●
●
● ●
● ●
●
●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ● ●
●
●
● ● ●
●
●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
● ● ● ●
●
●
● ● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
3.37
0
20
40
60
Inter−family Intra−family
Euclidean distance
3mer model
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
● ● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ●
●●
● ●
● ●
●
●
● ● ●
●
●
●
●
● ● ● ●
●
● ● ●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●● ● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
í í í í 3& e [SODLQHG v ar 3&e [SODLQHG var PHU'1$VKDSH6' PRGHO ●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
● ● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
● ● ●
●
●
●
●
●
●● ●
●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●● ● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
● ●
● ● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
● ●
●
●
●
● ●
●
●
●● ●
●
●
●
●
● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
● ● ●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
−5
0
5
−6 −3 0 3 6
PC1 (2.12% explained var.)
PC2 (1.96% explained var.)
●
●
●
●
●
●
●
●
bHLH
bZIP
C2H2
Forkhead
Homeodomain
Nuclear receptor
POU
TBX
3mer model
A B
C D
FigureB.9: Principalcomponentanalysis(PCA)revealsdifferentDNAbindingspecificitieswithin
andbetweenTFfamilies(intra-vs. inter-family).
A) PCA results using 3mer features. Each dot represents a TF. Dots of identical color belong to
the same TF family. An ellipse was drawn for each TF family representing the 0.68 probability
contouroftherespectivefittedtwo-variatenormaldistribution.
B) PCA results using 1mer+13shape+SD features (30 features per nucleotide position) compared
to3merfeatures(64featurespernucleotideposition)usedin(A).
C)Boxplotsofinter-andintra-familyTFdistancesderivedfrom(A).
D)Boxplotsofinter-andintra-familyTFdistancesderivedfrom(B).
92
0.0
0.5
1.0
0.0 0.5 1.0
1mer + DynaSeq
ave
R
2
1mer + 13DNAshapeR
2
0.0
0.5
1.0
0.0 0.5 1.0
1mer + DynaSeq
ave
R
2
1mer + 13shape
MC4
R
2
0.0
0.5
1.0
0.0 0.5 1.0
1mer + DynaSeq
ave
R
2
1mer + 13shape
XRC4
R
2
0.0
0.5
1.0
0.0 0.5 1.0
1mer + DynaSeq
ave
R
2
1mer + 13shape
MD4
R
2
Family
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
bHLH
bZIP
C2H2
CENPB
CP2
CUT
ETS
Forkhead
GATA
GCM
Homeo−
domain
HSF
IRF
MAD
MEIS
MYB
NFAT
NFI
Nuclear
receptor
PAX
POU
PROX
RRM
RUNX
SAND
TBX
TEA
Dataset
gcPBM
HT−SELEX
SELEX−seq
Figure B.10: R
2
performance comparing 1mer models augmented by DynaSeq
ave
(averaged ver-
sion) shape features [122] and 1mer+shape models introduced in this work. DynaSeqave repre-
sents 13 shape features corresponding in number to our 13 shape features. All four models used
inthisresearch(1mer+13DNAshape,1mer+13shape
MC4
,1mer+13shape
MD4
,1mer+13shape
XRC4
)
achieved higher performance in predicting TF-DNA binding specificity, indicating the limitations
oftheDynaSeqapproach(seeAppendixA).
93
0.0
0.5
1.0
0.0 0.5 1.0
1mer + DynaSeq
ave
R
2
1mer + DynaSeq
ens
R
2
0.0
0.5
1.0
0.0 0.5 1.0
1mer + DynaSeq
ens
R
2
3merR
2
Family
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
bHLH
bZIP
C2H2
CENPB
CP2
CUT
ETS
Forkhead
GATA
GCM
Homeo−
domain
HSF
IRF
MAD
MEIS
MYB
NFAT
NFI
Nuclear
receptor
PAX
POU
PROX
RRM
RUNX
SAND
TBX
TEA
Dataset
gcPBM
HT−SELEX
SELEX−seq
Figure B.11: R
2
performance comparing between 1mer models augmented by shape features de-
rived from DynaSeq
ave
(averaged version) and DynaSeq
ens
(ensemble version) [122] and 3mer-
basedmodels. DynaSeq
ave
represents13shapefeaturescomparabletothe13shapefeaturesintro-
ducedinthiswork. DynaSeq
ens
represents5-binensemblesof13shapefeatures,whichsumtoato-
talof65shapefeatures(69when1merfeaturesareincludedin1mer+DynaSeq
ens
models). There-
fore, DynaSeq
ens
is superior to DynaSeq
ave
in this comparison. However, the 1mer+DynaSeq
ens
modelwith69featuresdoesnotoutperformcomparable3mermodelswith64features.
94
C SupplementarytablesforChapter2
TableC.1: Spearman’srankcorrelationcoefficientsbetweenfeaturesindifferentquerytablesand
features derived from 590 protein-bound XRC structures. Coefficients were calculated based on
a concatenated vector that was generated from all 590 DNA conformations (see Appendix Table
C.1forPDBIDs). ThisvalidationisequivalenttothevalidationusedforvalidatingtheDNAshape
method[68].
DNAshape(Pentamer) MC(Tetramer)
HelT 0.62 0.62
ProT 0.68 0.68
Opening 0.57 0.55
Stretch 0.55 0.60
Shift 0.64 0.64
Buckle 0.59 0.63
Rise 0.72 0.72
Stagger 0.69 0.69
Slide 0.62 0.62
MGW 0.70 0.64
Tilt 0.56 0.60
Roll 0.70 0.69
Shear 0.60 0.60
95
D SupplementaryfiguresforChapter3
FigureD.1: DiagramofresidualblockfortheResNetarchitechture. Inputgoesthroughtwo3×3
convolutional layers with batch normalization and ReLu activation. The + symbol means direct
additionoftwooutputsfromtwolayers,toserveasthefinaloutputofthisoneresidualbock.
96
FigureD.2: Diagramofthereal-timedataaugmentation. Randomtranlationaldistances,rotational
angles,axesamplifiersaresampledbyuniformdistributionandnormaldistribution. Inputbatches
aretranlated,rotatedandamplifiedinreal-timeinGPU.
Figure D.3: Workflow of the baseline sequence CNN model. Filter size set for the 3 CNN layers
are256,512and1024. Nopoolinglayerispresenttoconservemaximalinformation. Inregression
mode,thereare5dense(fullyconnected)layers,withunitsof512,256,128,64and32.
97
TableC.2: XRCdatasetcomprisedof590X-rayco-crystalstructuresusedincalculatingAppendix
TableC.1.
Dataset PDBIDs
XRC
bound
(Used for
generating
XRC
query
table)
1P3O, 1S97, 1C0W, 2H1O, 2ER8, 2C6Y, 1N48, 1R0A, 2NVX, 2ADY, 1EFA, 1AKH, 2R5Y,
2E2I, 1JJ6, 1SRS, 1RZR, 1GU5, 1QN4, 2IS4, 1NJX, 2J6U, 1IGN, 2GLI, 1OWG, 1IAW, 2IIE,
1OH6, 2OST, 1S32, 1ZX4, 2PI5, 1NWQ, 1PUE, 1KB2, 1H88, 2AU0, 1QLN, 1K78, 1MM8,
1P3M, 1NVP, 1OZJ, 1BP7, 1JKQ, 1DDN, 1MUH, 2AGQ, 2PPB, 1DSZ, 2H27, 2IEF, 1IC8,
1RH6, 2AOQ, 1S10, 1LBG, 1F4K, 1OH7, 2EUV, 1P34, 2D5V, 1XO0, 1R8D, 1LWT, 1T8I,
1CQT, 1D2I, 2O8E, 2DY4, 2D45, 1R4R, 1ID3, 2E2J, 1NG9, 1TW8, 2I9T, 4KTQ, 1LE8, 1P3B,
1QNC, 2HMI, 1GJI, 1ZME, 2C2R, 2ERG, 1CDW,1SKN, 2OWO,1HDD, 2C28, 1MUS, 2O93,
1TL8, 1KBU, 2I13, 2BNZ, 3PVI, 2JEJ, 1JKR, 1MA7, 1IG9, 2I9K, 1NJY, 1K79, 1KX5, 1AOI,
1P3G, 1LMB, 1EVW, 1IU3, 1F66, 2AS5, 1N6J, 1Z19, 1J1V, 2EWJ, 2HOT, 1MOW, 1C9B,
2AOR, 1Z63, 1FJL, 1TKD, 2ATL, 1PP8, 1HJC, 1S0N, 1OH8, 2P0J, 2A3V, 1TTU, 1HW2,
1AU7, 2BQ3, 1CZ0, 1HCR, 1L3U, 1R0N, 1Q9Y, 1G2F, 1G9Y, 2O5I, 2HVR, 2IRF, 2EUZ,
2KTQ, 1LLM, 1QN8, 2CAX, 1U8B, 1NKB, 2EVG, 1SKM, 2UVR, 2J6S, 1NZB, 1U3E, 1EJ9,
1DUX, 1N6Q, 2AQ4, 2HOS, 1T8E, 2GM4, 1RPE, 1A74, 1IPP, 1SAX, 2HOF, 1P7D, 2I3Q,
1DH3, 2UVV, 1APL, 1R0O, 1PVP, 1P3F, 1P3K, 1J59, 1W7A, 1EOO, 1MUR, 1P8K, 1FJX,
1CEZ, 1BC7, 2OG0, 2CGP, 1YTF, 1ODH, 1A0A, 1P4E, 1WBD, 1KSY, 1GA5, 4CRX, 1ZRE,
1KSX, 1SC7, 2JEF, 1ZLK, 2AJQ, 1DFM, 1HLZ, 1NK0, 2HDD, 2IIF, 2RVE, 1EQZ, 1BL0,
1ECR, 2EZV, 1K7A, 1PVI, 1PYI, 1U0D, 1LPQ, 1H0M, 1R7M, 2F8X, 2EVH, 2ERE, 1TF6,
1JJ4,1VRR,1YO5,1TK8,1CF7,2ASJ,1M5R,2ASD,2HHQ,1ZG1,1I3J,1Q0T,1P7H,1VKX,
1B72, 3CRX, 1IXY, 1QNE, 1HF0, 2OH2, 2BR0, 1XBR, 1NKC, 2CV5, 2OR1, 1SXQ, 1F0O,
1QN9, 1P3L, 1KB6, 2P6R, 2IVH, 1A6Y, 1FLO, 9ANT, 8MHT, 1TQE, 2A66, 1ZRF, 1W0U,
2BNW,1R4O,2F5P,1LLI,1NK5,2PI4,1VOL,2GIE,1JEY,1HWT,1LE9,2F8N,1PER,2OAA,
2NVZ, 1XNS, 1TK0, 2AYB, 1BPZ, 1NGM, 2EVI, 1W0T, 1M1A, 1NNE, 2FO1, 1CIT, 2HVS,
2H8R, 2BSQ, 1XHV, 2HOI, 1P47, 2I3P, 1UA1, 2NLL, 1H9T, 3MHT, 1H89, 2HHV, 1U8R,
2B9S, 1XHU, 2HR1, 1RUN, 1FOK, 1QSS, 1Z1G, 1K6O, 1Q3V, 1ZS4, 1G2D, 1D66, 2H1K,
1IO4, 1WBB, 2EVF, 1L5U, 1N5Y, 1RYS, 1H8A, 1QN6, 1FOS, 1ZRC, 1ZYQ, 2ETW, 1A36,
1YSA, 1REP, 1U35, 1HLV, 1F5T, 1T3N, 2GIG, 1BDT, 2C7A, 1NKP, 2R5Z, 1R9T, 1TGH,
2HAN, 1K61, 1ZNS, 1D5Y, 1FW6, 1L3L, 1RZ9, 1M5X, 1G3X, 2CRX, 1O3T, 2HT0, 2O61,
2ACJ, 1GLU, 1L3V, 1IJW, 1QP9, 1NLW, 1TX3, 2BQU, 1WB9, 1EGW, 1M19, 1G9Z, 1QNA,
1ZRD,1AWC,1Q3U,1GTW,1AM9,1JKO,2JEG,1NH2,2EVJ,1MNN,3CRO,1OUZ,1EYU,
2EUX, 1PVR, 2IS6, 1U78, 2EX5, 2BQR, 2HHS, 2IT0, 1LE5, 1A3Q, 1QN3, 1WTE, 1S9F,
1BY4, 1M6X, 1PUF, 1OH5, 1PVQ, 2UVU, 1D3U, 1E3M, 1R49, 1J5O, 2DTU, 1LQ1, 2F5N,
2A07, 1IHF, 2IMW, 1TRR, 2DPD, 1RM1, 2NTC, 1LAT, 2EUW, 4BDP, 1DU0, 1R71, 2P5O,
2GIH, 1OWR, 1T05, 1ZR4, 2Q2T, 1BF5, 1F2I, 1GU4, 1QN7, 1LRR, 1CGP, 1CRX, 2FIO,
1Q9X, 1YF3, 2NRA, 1HLO, 1LB2, 1NK8, 1MJQ, 1N3E, 1Z1B, 2C9L, 2FLD, 1GD2, 1M18,
3HDD, 1KC6, 1PP7, 2HZV, 2IVK, 1IG7, 1QNB, 1DRG, 1RIO, 1NJW, 1JKP, 1MNM, 1CYQ,
1GT0, 1JX4, 1A02, 2GEQ, 5CRX, 1TSR, 2Q2U, 2H7H, 1ZBB, 1MJ2, 2UVW, 1JJ8, 1MEY,
1MHD, 1JNM, 1KB4, 1CKT, 1U0C, 1N3F, 1B3T, 1RAM, 1YFH, 2ASL, 2C2E, 1N56, 1EWQ,
1LWS, 1ZR2, 2C2D, 1EA4, 1OCT, 2DRP, 1O3R, 1BDV, 1IMH, 1KX3, 1S0O, 1S0M, 1GXP,
1GDT,2ISZ,2NTZ,2J6T,2O6G,1P3P,1P3A,1O3Q,1OUQ,1TRO,1IF1,2BGW,2ODI,2JEI,
1PAR, 1RR8, 2HAP, 1TUP, 2IS2, 2AGO, 1B8I, 2C22, 2Q10, 1UBD, 1YFI, 2GII, 1PT3, 2NP2,
1RTD,2E2H,1QN5,1DC1,2ATA,1PZU,1NK9,1MJO,1P3I,1JFI,1T2K,1E3O,1T2T,1W36,
1R4I,2F5O,1JE8,1K4T,1YTB,2FJ7,2NZD,1JT0,1RYR,1KX4,2O5J,1O3S,1HBX,1ZLA,
2GIJ, 1T9J, 1A73, 1S9K, 1RUO, 1F44, 1QTM, 1YNW, 2HHT, 1MDM, 1HCQ, 2AYG, 2AC0,
1TC3, 1OWF, 1YRN, 1ZG5, 1PDN, 1H6F, 2AHI, 1MDY, 1Z9C, 1M0E, 1HJB, 2IS1, 1K82,
1SA3,6PAX,1YFJ,1QSY,2AGP,1JWL,1T9I,1AN4,3KTQ,1RRJ
98
Figure D.4: Learning curve (R
2
) for P53 from SELEX-seq dataset. This model is trained on P53
rebuiltonXRCDNAshapequerytable. Learningrateis0.01andoptimizerisAdamax.
99
FigureD.5: Learningcurve(loss)forP53fromSELEX-seqdataset. ThismodelistrainedonP53
rebuilt on XRC DNA shape query table. Loss is mean squared loss (MSE). Learning rate is 0.01
andoptimizerisAdamax.
100
FigureD.6: SpatialgeometyrmodelcomparedtothebaselinesequenceCNNmodelwithdifferent
hyper-parameters. AllthebaselinemodelsshownistrainedbySGDoptimizerwith0.001learning
rate. In the first row, the only variant is dropout rate, which is 0.5, 0.25 and 0. Weight decay is 0
and training epoch is 1000. In the second row, the first one has a dropout rate of 0.0 and weight
decayisalso0.0,butthetrainingepochissettobe100. Thesecondoneinthesecondrowhas500
trainingepoch,0.25dropoutrateand1×10
−8
weightdecay.
101
FigureD.7: Illustrationoftheprocessofourartificallybending. Basepairsinthemiddleistreated
likeitisdyadinnucleosomalDNA.BendinghappenedforbothsideaccordingtoaRoll-and-Slide
model[158].
102
Figure D.8: Predicted binding affinityversusthe experimentaltrue relativebinding affinityon the
testdataofP53. TheinputofthisdataisafusionbetweenstraightDNAsandbentDNAs. Relative
bindingaffinitiesforbentDNAsaresettobe20%higherthanthestraightcounterpart.
103
E SupplementarytablesforChapter3
TableE.1: IndexofatomsofDNAstructurestobesortedas.
AtomID A C G T Group
P Yes Yes Yes Yes
Backbone
OP1 Yes Yes Yes Yes
OP2 Yes Yes Yes Yes
O5’ Yes Yes Yes Yes
C5’ Yes Yes Yes Yes
C4’ Yes Yes Yes Yes
O4’ Yes Yes Yes Yes
C3’ Yes Yes Yes Yes
O3’ Yes Yes Yes Yes
C2’ Yes Yes Yes Yes
C1’ Yes Yes Yes Yes
C2 Yes Yes Yes Yes
BasePair
C4 Yes Yes Yes Yes
C5 Yes Yes Yes Yes
C6 Yes Yes Yes Yes
N1 Yes Yes Yes Yes
N3 Yes Yes Yes Yes
C8 Yes No Yes No
N7 Yes No Yes No
N9 Yes No Yes No
O2 No Yes No Yes
C7 No No No Yes
N2 No Yes Yes No
N4 No Yes No No
N6 Yes No No No
O4 No Yes No Yes
O6 No Yes Yes No
TotalAtoms 21 19 22 20 All
TableE.2: R
2
Performancetablecomparingfullspatialgeometrymodelandpartialmodels.
TF Fullmodel base-pair-onlymodel backbone-onlymodel
MAX 0.982 0.988 0.938
MEF2B 0.980 0.969 0.736
P53 0.972 0.982 0.902
104
Abstract (if available)
Abstract
Uncovering the mechanisms that affect the binding specificity of transcription factors (TFs) is critical for understanding the principles of gene regulation. Although sequence-based models have been used successfully to predict TF binding specificities, I found that including DNA shape information in these models improved their accuracy and interpretability. Previously, our lab developed a method for modeling DNA binding specificities based on DNA shape features extracted from Monte Carlo (MC) simulations. Prediction accuracies of these models, however, have not yet been compared to accuracies of models incorporating DNA shape information extracted from X-ray crystallography (XRC) data or Molecular Dynamics (MD) simulations. In this dissertation, I integrated DNA shape information extracted from MC or MD simulations and XRC data into predictive models of TF binding and compared their performance. Models that incorporated structural information consistently showed improved performance over sequence-based models regardless of data source. Furthermore, I derived and validated nine additional DNA shape features beyond our original set of four features. The expanded repertoire of 13 distinct DNA shape features, including six intra-base pair and six inter-base pair parameters and minor groove width, is available in the R/Bioconductor package DNAshapeR from our lab and enables a comprehensive structural description of the double helix on a genome-wide scale. In addition, to improve the performance and interpretability even more, I propose a brand-new 3D geometrical model to replace methods that are majorly relying on one-hot encoding of DNA, which is based on categorical characters, and in practice, requires a ton of hyperparameter search to fit. The model is greatly benefited by the natural data augmentation thus not required to tune its hyperparameters. To illustrate the effectiveness of this model, I used the same but classic TF-DNA binding specificity prediction as the benchmark. The result shows a superior performance and easy-learnability compared to sequence-based deep learning model and provides a more natural display on the models interpretability. In addition, as a show case, I present that the framework has its architectural potentials in dealing with deformed or dynamic DNAs compared to any sequence-based models.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Profiling transcription factor-DNA binding specificity
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Decoding protein-DNA binding determinants mediated through DNA shape readout
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Site-directed spin labeling studies of sequence-dependent DNA shape and protein-DNA recognition
PDF
Application of machine learning methods in genomic data analysis
PDF
Simulating the helicase motor of SV40 large tumor antigen
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Explorations in the use of quantum annealers for machine learning
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Anatomically based human hand modeling and simulation
PDF
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
PDF
Comparative analysis of DNA methylation in mammals
Asset Metadata
Creator
Li, Jinsen
(author)
Core Title
Machine learning of DNA shape and spatial geometry
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2021-12
Publication Date
11/02/2021
Defense Date
09/30/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D,binding,binding affinity,binding specificity,Computational Biology,deep learning,deformed DNA,DNA,DNA sequencing,DNA shape,DNA structure,machine learning,molecular dynamics,Monte Carlo simulation,OAI-PMH Harvest,pioneer factor,protein,protein binding microarray,SELEX,spatial geometry,structural biology,systematic evolution of ligands by exponential enrichment,transcription factor
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rohs, Remo (
committee chair
), Di Felice, Rosa (
committee member
), Nakano, Aiichiro (
committee member
), Sun, Fengzhu (
committee member
)
Creator Email
jasonljsxxx@gmail.com,jinsenli@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC16345211
Unique identifier
UC16345211
Legacy Identifier
etd-LiJinsen-10192
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Li, Jinsen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
3D
binding
binding affinity
binding specificity
deep learning
deformed DNA
DNA
DNA sequencing
DNA shape
DNA structure
machine learning
molecular dynamics
Monte Carlo simulation
pioneer factor
protein
protein binding microarray
SELEX
spatial geometry
structural biology
systematic evolution of ligands by exponential enrichment
transcription factor