Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Understanding protein–DNA recognition in the context of DNA methylation
(USC Thesis Other)
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Understanding Protein–DNA Recognition in the context of DNA methylation by Satyanarayan Rao Dissertation committee: Dr. Remo Rohs (Chair) Dr. Michael S Waterman Dr. Andrew D Smith Dr. Aiichiro Nakano Department of Biological Sciences Computational Biology & Bioinformatics University of Southern California A dissertation presented to the faculty of the USC Graduate School in partial fulfillment of the requirements for the degree of DoctorofPhilosophy USC December 2018 I would like to dedicate this dissertation to my loving mother and father who just believed in what I was doing. Since early primary education, I received constant support from my aunt and dear late uncle in terms of guidance and discipline which I still fathom! My sincere dedication goes to all my primary, secondary and high-school teachers who played pivotal role in what I am today. Hailing from computer science background, one of the major driving forces behind pursuing PhD was my eldest brother who suggested me that if I really want to contribute toward science/education I should do a PhD. Being a younger in the family, I received support from elders and my cousins which made me a better person, so my sincere thanks and dedication goes to them too! This page would look empty without having my beloved wife’s name, Puja, printed. It was her who supported me in my difficult times in later years of my PhD. I firmly believe that I am the luckiest person on the face of the earth because I have Puja in my arms! I cann’t forget to dedicate this to all my childhood, undergraduate, and graduate friends with whom I shared invaluable time! On the highs and lows of PhD career, one thing that kept smile on my face is Vidushak - an Improv theater group at USC, so I would like to dedicate this to all my friends from this troupe. iii Declaration I hereby declare that contents presented in this dissertation are original and have not been submitted to any other University. Work of others have been explicitly referred. Satyanarayan Rao December 2018 v Acknowledgements My sincere acknowledgement goes: To my supervisor, Dr. Remo Rohs, who supported me throughout my PhD career. Not only in the academic field, I have learned a lot about scientific communication and presentation under his guidance. I was extremely lucky to find him as my supervisor! I will miss the fun part of Rohs Lab (summer party and many more)! To my qualification exam and dissertation committee, composed of Prof. Michael S Waterman, Dr. Andrew D Smith, Dr. Myron F. Goodman, Dr. Chi H. Mak, Dr. Yan Liu, Dr. Aiichiro Nakano and Dr. Remo Rohs (Chair), for assessing my work and giving me useful feedback. I also want to thank them for their valuable time. To Dr. Harmen Bussemaker, with whom we collaborated onmethyl-DNAshape project. I received constant guidance and productive critical comments on my work. I also would like to acknowledge Bussemaker and Mann Lab members. I thank Judith (generated SELEX-seq data that I used) and Chaitanya for their valuable comments! To Dr. B. Jayaram and Dr. Avinash Mishra, who supported me at the time when I was starting to step into research. A year and half long research assistant work under their guidance generated enough interest in me to choose PhD study as the career option! To my dear friend Tsu-Pei, who is one of the best person you can have around. Apart from research where we have had countless discussions, he is filled with humor and humility. It took a while to understand his personality, but it was such an awesome experience after that. Here I also take the opportunity to thank all Rohs lab members and my cohorts; with them I had quality time at USC / Los Angeles. I want to thank them for making a smooth experience of transition to USA for my wife. I extend my acknowledgements to Luigi, who was of key help in technical support. I sincerely acknowledge the support I received from funding agencies and Viterbi fellowship! My vii sincere thank goes to developer communities of all open-source softwares, particularly, Python, R and Overleaf (adapted the latex template - by Krishna Kumar). viii Table of contents List of figures xi List of tables xiii List of symbols xv 1 Introduction 1 1.1 Deoxyribose Nucleic Acid (DNA) . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Physico-chemical signatures of DNA . . . . . . . . . . . . . . . . . 2 1.1.2 Structural features of DNA . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Protein–DNA interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Genomics perspective . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 Interaction perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.3 Filling the gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Thermodynamics of protein–DNA interactions . . . . . . . . . . . . . . . . 8 2 Estimation of methylated DNA shape features 13 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Table of contents 2.2.1 Sequence and structure datasets . . . . . . . . . . . . . . . . . . . . 15 2.2.2 All-atom MC simulations . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Building themethyl-DNAshape Pentamer Query Table . . . . . . . . . 17 2.2.4 methyl-DNAshape: method for high-throughput prediction of methy- lated DNA shape features . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Effect of CpG methylation on DNA shape features . . . . . . . . . . 18 2.3.2 Effect of CpG methylation on MGW of A-tracts . . . . . . . . . . . 20 3 Deoxyribonuclease I and CpG methylation 23 3.1 DNase I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 DNase I–DNA cocrystal structure . . . . . . . . . . . . . . . . . . . 24 3.1.2 Sequence-context model . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 CpG methylation and shape-to-affinity modeling . . . . . . . . . . . . . . . 26 4 Methylation sensitivity of Pbx-Hox complexes 29 4.1 Pbx-Hox system and CpG methylation . . . . . . . . . . . . . . . . . . . . . 29 5 Physico-chemical modeling of protein–DNA interactions 35 5.1 Physico-chemical encoding of DNA . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Modeling of protein–DNA interaction data . . . . . . . . . . . . . . . . . . . . 37 5.2.1 DeepRec framework . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.2 Framework components . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.3 Modeling and interpretation of Pbx-HoxA1–DNA binding . . . . . . 42 6 Discussion and future work 45 References 47 x Table of contents Appendix A Supplementary material for chapter 2 57 A.1 Structures from Protein Data Bank (PDB) . . . . . . . . . . . . . . . . . . . . 57 A.2 PDB IDs of methylated DNA structures . . . . . . . . . . . . . . . . . . . . . 57 A.3 Count statistics of transcription factor (TF) binding motifs containing CpG step(s) 58 A.4 Validation ofmethyl-DNAshape using experimentally determined structures . 58 A.5 Types and counts of sequences considered for Monte Carlo (MC) simulations 60 A.6 All-atom Monte Carlo simulations . . . . . . . . . . . . . . . . . . . . . . . 60 A.7 Sequence composition ofmPQT . . . . . . . . . . . . . . . . . . . . . . . . . 61 A.8 Pentamers used in scatter plot analysis . . . . . . . . . . . . . . . . . . . . . 63 A.9 Illustration of shape vector calculation . . . . . . . . . . . . . . . . . . . . . 63 Appendix B Supplementary material for chapter 3 & 4 65 B.1 DNase I cleavage data and statistical modeling . . . . . . . . . . . . . . . . . 65 B.1.1 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B.1.2 Statistical modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 B.2 Supplementary materials for Pbx-Hox data analyses and more . . . . . . . . 68 B.2.1 CpG context for unmethylated DNA . . . . . . . . . . . . . . . . . . 68 B.2.2 Methylation sensitivities of Pbx-HoxA1 and A5 . . . . . . . . . . . . 69 B.2.3 T-test for IUPAC-based shape analysis for Pbx-Hox data . . . . . . . 70 xi List of figures 1.1 DNA bases and base pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Functional groups on major and minor grooves . . . . . . . . . . . . . . . . 3 1.3 Illustration of Roll angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Illustration of Propeller twist . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Illustration of binding free energy . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 PDB statistics and MotifDb . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 methyl-DNAshape method . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Pentamer table comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Methylation effect on A-tracts . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 DNase I cleavage schematics . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 DNaseI DNA Cocrystal strucutre & contact map . . . . . . . . . . . . . . . . 25 3.3 Shape-to-affinity modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 Methylation sensitivity of Pbx-Hox complexes . . . . . . . . . . . . . . . . . . 31 5.1 Physico-chemical encoding of sequence . . . . . . . . . . . . . . . . . . . . . 37 5.2 Multi-module and multi-task deep learning framework . . . . . . . . . . . . 38 5.3 Understanding convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4 DeepRec – interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 xiii List of figures A.1 MGW profiles for selected DNA fragments or protein-DNA complexes . . . . 59 A.2 Shape vector calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 B.1 Methylation sensitivity Pbx-HoxA1 and A5 . . . . . . . . . . . . . . . . . . 69 B.2 Snapshot of the first page of Rao et al. [1] . . . . . . . . . . . . . . . . . . . . 71 xiv List of tables 3.1 Observed DNase I cleavage activities. . . . . . . . . . . . . . . . . . . . . . 26 A.1 Variables considered in MC simulations . . . . . . . . . . . . . . . . . . . . 60 A.2 Oligo counts used in MC simulations . . . . . . . . . . . . . . . . . . . . . . . 61 A.3 Count breakdown of unique pentamers seen inmPQT . . . . . . . . . . . . . 62 B.1 Phosphate cleavage count entry of AAApAAA in tier1 table . . . . . . . . . 66 xv List of symbols Greek Symbols G ◦ Standard free energy ∆G ◦ Standard free energy difference ∆∆G ◦ Relative standard free energy difference ∆∆∆G ◦ Change in relative standard free energy difference Acronyms / Abbreviations K a Binding affinity relK a Relative binding affinity pc-OHE physico-chemical one-hot encoding s-OHE sequence one-hot encoding xvii Chapter 1 Introduction Goal of this chapter is to introduce some basic concepts and briefly explain some terminologies that may help understand the later chapters. 1.1 Deoxyribose Nucleic Acid (DNA) Building blocks of a DNA are called the deoxyribonucleotides, and a deoxyribonucleotide has three components, namely pentose sugar, base and phosphate. The base and phosphate are attached to the pentose sugar. Four types of bases, Adenine(A), Cytosine(C), Guanine(G), and Thymine(T) [2] are usually found in DNA (see Figure 1.1). DNA usually exist in duplex helix form and the most common form is B-form – a right handed helix [3], but other forms, for example, A-form (PDB ID: 137D), have also been appreciated [4–6]. Human genome, for example, is a long DNA macromolecule with approximately 3 billion base pairs [7] encoding the information unique to the individual. 1 Introduction Figure 1.1 Building blocks of DNA macromolecule. Adenine(A) pairs with Thymine(T) (top panel), and Guanine(G) pairs with Cytosine(C). Base and sugar part is marked for G. Phosphate backbone is labeled for A and G. Same applies to other nucleotides. Dashed lines represent hydrogen bonds (distance in Å shown above the line) between bases. 1.1.1 Physico-chemical signatures of DNA The base part of each nucleotide have functional groups which are exposed either in the major or in the minor groove (see Figure 1.2). These functional groups play important roles in protein–DNA recognition and are of long time interest [8–10]. 2 1.1 Deoxyribose Nucleic Acid (DNA) Figure 1.2 Illustration of physico-chemical signatures on DNA. a Two views of same DNA highlighting functional groups exposed in the major (left) and minor groove (right). A 2D sheet- view of physico-chemical signatures follows the 3D DNA structure. 3DNA [11, 12] program (option: B-DNA (generic)) was used to built the DNA structure with sequence composition d(GACT) 3 [13]. Phostphate backbone and sugars are colored grey. Images were generated using pymol program. b Linear (1D) view of physico-chemical signatures of base pairs, for example, base pair (A–T) offers hydrogen-bond acceptor, donor, acceptor and methyl group (left-right) in the major groove, and hydrogen-bond acceptor, non-polar hydrogen and hydrogen- bond acceptor in the minor groove (see a). The pattern reverses for T–A base pair in the major groove, but remains same for the minor groove side (similar trend is for C–G bp). Representing signatures in this form can easily harness selected chemical modification of bases, e.g., methylation of cytosine base at C5 position (5mC) that only changes a single signature (gray to yellow). c A well known case of CpG methylation where letter representation would require additional letters ( one and two for hemi- and full-methylated CpG respectively) . 3 Introduction 1.1.2 Structural features of DNA Although the DNA model proposed by Watson and Crick [3] was a right-handed duplex helix (B-form DNA), deformation from its ideal form (A-form DNA, Z-form DNA) have been extensively observed and appreciated [4, 13–15]. These deviations are results of changes in helical parameters of base pair (bp; base-dependent) or higher order (more than a bp; sequence- dependent; see Figure 1.3), partially owing to avoid water (propeller twist; a natural phenomenon at bp level) and also stable stacking between base pairs [16]. One well-known example of higher order is A-tract (poly[A/T] – without TpA step) where bps are highly propeller twisted, so that adenine can make an additional hydrogen bond with the next thymine on the other strand [13, 15, 17–19] (see Figure 1.4). Illustration of roll angle and propeller twist Figure 1.3 a-c Illustration of impact of roll angle, a bp-step feature, on DNA conformation. Red and green colored cartoon represent deviation from reference structure (grey colored) by perturbing roll angle by 20 in negative and positive directions respectively (the grey colored bp is the reference). The minor and major groove view of both perturbations are shown in a and b respectively; c represents both perturbations combined. Negative roll bends DNA toward minor groove (red cartoon coming out of page in a,c), and positive roll bends away from minor groove (green cartoon going into the page of page in b,c). x3DNA software suite was used to perform these perturbations [20]. This video tutorial was of great help in performing these perturbations. The DNA used in this figure is a Dickerson Dodecamer (PDB ID: 1BNA) [ 21]. Pymol [22] and Inkscape was used to create the image. 4 1.1 Deoxyribose Nucleic Acid (DNA) Figure 1.4 a-c Illustration of impact of propeller twist, a bp feature, on DNA conformation. Red and green colored cartoon represent deviation from reference position (grey colored) by perturbing propeller twist by 30 degrees (usually this extreme is not observed, but used here for better visualization) in negative and positive directions respectively. a shows the minor groove view of both perturbations combined. b-c A-tract exhibiting highly negative propeller twist values to make additional hydrogen bond. O 4 of Thymine (T7) and NH 2 of Adenine (A17) makes additional hydrogen bond (c; major groove view is shown). Same structure and tools were used as used in Figure 1.3. It becomes natural that we should study these deformations for the better understanding of DNA structure and how they can influence protein–DNA interactions. However, determining the structure of macromolecules is not easy, particularly for DNA because of not finding good X-ray diffraction pattern [23, 24]. The PDB statistics report as of this date shows that the number of protein structures are larger than the number of DNA structures by two orders of magnitude. To fill the gap, significant efforts have been invested in computational modeling of chain of nucleic acids [25–30]. Macromolecules, for example, DNA, RNA and Proteins, usually function in their native conformations, meaning that their functional states have minimum total energy that represents all bonded and non-bonded interactions. With this criteria as an objective function, compu- tational methods such as molecular dynamics [31] and monte-carlo simulations [25, 28] are developed which attempt to approximate the best possible conformation of a given sequence of macromolecule. Despite the success of computation modeling approaches, they are limited by the scale at which biological processes occur. For example, computation modeling of the whole cell is yet a challenge. Moreover, the time complexities of these approaches are of exponential behavior and even at a significantly lower scale, for example, modeling of a 50 bp long DNA sequence may take more than a week. But, to a first approximation, the structural parameters of 5 Introduction DNA could be estimated by running simulations for population of DNA structures followed by mining of the simulation trajectories [32, 1] (see Chapter 2 for more details). 1.2 Proteins Proteins are the end product of the Central Dogma that explains the pathway from DNA to protein. DNA codes for RNA (Ribo Nucleic Acid) and RNA codes for protein, that actually perform the biological functions [33]. They are ubiquitous in any given cell. Subset of proteins which are found in nucleus have been given a special name, Transcription Factors (TFs), owing to their pivotal role in transcription processes to achieve precise regulation of gene expression. Other proteins also exist, for example, DNase I (will be discussed more in Chapter 3), that doesn’t regulate any gene but play an important role in cell apoptosis (death) by cleaving the accessible DNA. My work mainly focuses on proteins that are found in nucleus and interact with DNA. 1.3 Protein–DNA interactions 1.3.1 Genomics perspective Protein–DNA interactions are key to understand gene regulation. Regulation refers to the controlled expression of desired gene to get the right amount of protein needed for cell to develop/function. In general, nucleus has set of proteins referred to as General Transciption Factors (GTFs) which together with RNA polymerase form transcription initiation complex are capable of initiating transcription of any gene. However, the desired rate of transcription may not be achieved without TFs which further stabilize/disrupt the complex. A given protein qualifies as transcription factor if it possess one or more DNA Binding Domains (DBDs) that interact with small stretches of DNA sequences, generally known as binding motifs – and based on their function in the context of regulation they are referred to as promoter, enhancer or repressor elements [34]. Technological and research advancement in last two decades have offered plethora of in vivo experimental and computational methods [35–41] capable of profiling binding of a given 6 1.3 Protein–DNA interactions TF in cell type of interest in high-throughput manner. However, analyses mainly reveals high- affinity binding sites, and lacks profiling the complete range of binding, partially owing to the binding dynamics leading to low coverage, and experimental limitations (washing steps in ChiP-seq) [42, 43]. 1.3.2 Interaction perspective DNA overall is a negatively charged macromolecule due to phosphate backbone which carries highly negative charge (-1 unit) [44]. The bases, however, carries both positively and negatively charged atoms. These characteristics together with DNA structure are important in the context of protein–DNA interactions. Residues of a given protein may either exploit base or structure (shape) or both characteristics at the same time to achieve specificity. Specificity refers to the selected contact(s)/conformations that are essential for binding to occur. Usually, charged residues on protein take part in making specific contacts, for example, a well-know Arginie(protein)–Guanine(DNA) interaction in the major groove, where they form two hydrogen bonds. However, importance of hydrophobic or nonpolar residues found either on DNA or on protein can’t be ignored because of their contribution to binding by making van der Walls interactions [45–47]. The influence of hydrophobic residues on DNA to protein binding has been seen in both positive and negative directions [48–51, 47, 1]. For instance, Lac repressor prefers to have a methyl group in the major groove and form hydrophobic contact to this group [52, 53], but many bacterial restriction enzymes which cleaves foreign DNA are very sensitive to methylation, mainly because they defend their own (host) DNA by methylating their residues, for example, methylation at position 6 in adenine inhibits restriction endonuclease EcoRI to cleave the DNA, similarly methylation at position 5 in cytosine disrupts cleavage activity ofHpaII, many other examples are cited by McClelland and Nelson [54]. Owing to these different recognition modes, two keywords, base- and shape-readout, have been widely accepted in the community to make distinctions between both. Usually, base- readout accounts for interactions in the major groove side of DNA where functional groups are widely exposed on the DNA surface, and can easily be contacted by protein residues. The shape-readout accounts for local deformations in DNA structure that is utilized by protein to achieve specificity. Please see Figure 4.1a where base-readout is achieved through the helices of DBDs and shape readout is achieve through the N-terminal tails of proteins. 7 Introduction 1.3.3 Filling the gap High-throughput experimental methods generates vast amount of data, but lack profiling low binding affinity sites [ 43]. On other hand, determining absolute binding affinities of a pro- tein to different DNA sequences is a difficult pursuit [ 55, 56]. On a similar note, structural determination of complexes of protein bound to different DNA sequences using either X-ray crystallography or Nuclear Magnetic Resonance (NMR) methods is low throughput. To fill the gap, severalinvitro high-throughput methods have been developed, for example, System- atic Evolution of Ligands by EXponential Enrichment (SELEX) [57–59] and Protein Binding Microarray (PBM) [60, 61] which gives a better landscape of protein–DNA binding. In the interest of epigenetics, these methods, particularly SELEX-based methods have been extended (EpiSELEX-seq [47]; methyl-SELEX [51]) to learn the effect of chemical modifications on DNA to protein–DNA binding. Chapter 2 in this work is dedicated to quantify the impact of CpG methylation on structural deformations of DNA, and followed by its application (Chapter 3 and 4). To learn the binding mechanisms from above explainedinvitro data, one has to go back to the principles behind the binding which are laws of thermodynamics. The next section partially explains the thermodynamic insights in the context of proein–DNA interactions. 1.4 Thermodynamics of protein–DNA interactions Thermodynamics is the major driving force behind rules of protein–DNA binding [62, 56, 63]. Enrichment of a given probe (often termed as oligo or fragment) with a binding sequence a.k.a ligand is directly proportional to their ability to form thermodynamically stable complex often referred to as binding affinity ( K a : association constant ). Hence, understanding of the connection between thermodynamics and data is a must. My attempts here is to give a brief explanation of thermodynamics in the context of protein–DNA interaction that will be useful in later chapters. Please note that the following derivation is inspired from Foat et al. [56]. [64, 65] Considering a simplest system where we have two sequences, S ref and S mut (with one mutation apart fromS ref ; see Figure 1.5) known to bind protein with the highest and second highest binding affinity ( K a ). Reaction between protein and DNA can be presented as the following equation: 8 1.4 Thermodynamics of protein–DNA interactions S+P k on −− ⇀ ↽−− k of f PS ∀S∈{S ref , S mut } (1.1) K a (S)= 1 K d (S) = [PS] [S][P] = k on k of f =e − ∆G ◦ /RT (1.2) Ideally, it would be perfect to know the absoluteK a for each possible sequence, but in practice, it is really hard because of the limitation to know the fraction of unbound proteinP at equilibrium point [66, 55]. For this reason, we do computational modeling to make inferences to K a or relativeK a (relK a ). The most common way to omit [P] from the equation is do the ratio: K a (S mut ) K a (S ref ) = e − ∆G ◦ mut /RT e − ∆G ◦ ref /RT =e − ∆∆G ◦ /RT (1.3) ∆∆G ◦ =∆G ◦ mut − ∆G ◦ ref (1.4) relK a =e − ∆∆G ◦ /RT (1.5) ∆∆G ◦ /RT= − log(relK a ) (1.6) Without loss of generality, in later derivations, I am going to ignore the constant “RT” term. To the first approximation, it is assumed that base at each position in the binding site contribute independently to the binding free energy (denoted as change in standard free energy difference∆∆G ◦ ) [56, 67, 68]. ∆G ◦ (S)= L ∑ i=1 ∆G ◦ i (S i ) (1.7) ∆∆G ◦ =∆G ◦ (S ref )− ∆G ◦ (S mut ) (1.8) ForS ref andS mut : ∆∆G ◦ (S ref →S mut )= L ∑ i=1 ∆G ◦ (S i ref )− ∆G ◦ (S i mut ) =∆G ◦ 4 (T)− ∆G ◦ 4 (G) (1.9) 9 Introduction Referring to eq. (1.9), we can generalize that if a mutated sequence bears more than one mutation from the reference, the difference in binding free energy between them will be sum of differences at those mutated positions. ∆∆G ◦ = ∑ ∆∆G ◦ jb (1.10) where: ∆∆G ◦ jb ≡ Relative binding free energy due to mutation b at postion j in the reference sequence We define a new term: w jb =e ∆∆G ◦ jb (1.11) Using eq. (1.10) and (1.11) in eq. (1.3): K a (S mut ) K a (S ref ) =e ∑∆∆G ◦ jb = ∏ j e ∆∆G ◦ jb = ∏ j w jb K a (S mut )=K a (S ref ) ∏ j w jb (1.12) This concludes that a linear change in sequence (mutation) results in a linear change in binding free energy scale (eq. (1.9)), but a multiplicative change in binding affinity scale (eq. (1.12)). 10 1.4 Thermodynamics of protein–DNA interactions Figure 1.5 Protein–DNA interaction and binding free energy. Referring to eq. (1.1), cccupancy of probeS is defined as: N(S)= [PS] [S]+[P] = [P] [P]+K d (S) (1.13) For simplicity we assume that[P]≪ K d (S), because in a given reaction, protein may get consumed by highest affinity binding sites. N(S)≈ [P] K d (S) =[P]K a (S) (1.14) Suppose we have another sequence with one mutation apart fromS. Then, 11 Introduction N(S mut ) N(S ref ) = K a (S mut ) K a (S ref ) (1.15) N(S mut )=N(S ref ) ∏ w jb =[P]K a (S ref ) ∏ w jS j mut (1.16) This establishes a relationship between occupancies of the best possible sequence in ideal world (S ref ) and sequences that are away from itS mut . One may argue that since each position in the sequence contribute independently to∆∆G/RT (refer to eq. (1.10)), we may need only 3× L (count of all possible sequences which are one mutation away from the reference; L: sequence length) + 1 (reference sequence) to build a free energy matrix. But, few caveats exist in this approach which are, one, the experiment doesn’t ensure enrichment of all sequences which are one mutation away from the reference sequence, and two, observations for a given experiment can can’t be considered as ground truth – simply because of error associated with any experimental setup. Hence, a complete data modeling using DNA sequence or any function of sequence, for example DNA shape, seems to be a reasonable and robust approach to learn insights into protein–DNA interactions. 12 Chapter 2 Estimation of methylated DNA shape features This chapter aims to describe the method adopted for estimate shape feature values for methy- lated DNA. Followed by the methods part, I also show comparative analyses of features in unmethylatedvs. methylated form. 2.1 Motivation We recently studied how DNA shape contributes to protein–DNA recognition [15, 13, 32, 69]. However, we have not yet systematically quantified the effect of DNA methylation on protein binding [46]. Motivated by the widespread occurrence of CpG dinucleotides in TF binding motifs of different protein families [70–72], we aimed to study CpG methylation in the context of gene regulation (see Figure 2.1b). Understanding the protein–DNA readout of methylated cytosine requires structural insight derived from experimentally determined structures. Unfortunately, the current content of the Protein Data Bank (PDB) [73] includes only a few structures containing cytosine modifications (see Figure 2.1a). To close this knowledge gap, we utilized computational modeling of many DNA fragments to study the intrinsic effects induced by cytosine methylation, in a manner analogous to previous high-throughput studies of DNA shape of unmethylated genomic regions [32, 74, 75]. The resulting query tables can be utilized to analyze systematically the effect 13 Estimation of methylated DNA shape features of methylation on protein–DNA interactions, as we demonstrate for DNase I cleavage and Pbx-Hox binding data. Figure 2.1 Current statistics of available structures and abundance of CpG dinucleotides in TF binding sites. a Count statistics of protein–DNA complex and unbound DNA structures available in the PDB as of 31 May 2017. Counts of subsets of structures (right two bars) containing methylated DNA at CpG site(s) or in other sequence contexts were two orders of magnitude lower than the count of structures containing unmethylated DNA. Systematic profiling of the effect of methylation on three-dimensional DNA structure would require a substantially larger number of structures. Counts include structures solved by X-ray crystallography and NMR spectroscopy. b Abundance of CpG steps in TF binding motifs in HT-SELEX data for human TF datasets [70], derived using MotifDb [76]. CpG dinucleotides can be observed in binding sites irrespective of TF family. Five largest human TF families (based on number of binding sites containing at least one CpG step) are specified. Almost 90% of ETS family motifs contain CpG steps. Numbers on each bar represent counts of motifs containing CpG or no CpG steps This work was mainly motivated by an observation made by Lazarovici et al. [77] where they found that DNase I cleaved DNA with greater efficiency when immediate downstream CpG dinucleotide is methylated (see Chapter 3). At the same time, it aimed to offer a service to the scientific community where features for methylated DNA can also be profiled in high-throughput manner as it was being done for unmethylated DNA [32]. 14 2.2 Methodology 2.2 Methodology 2.2.1 Sequence and structure datasets A total of 3518 DNA fragments of lengths varying from 13 to 24 base pairs (bp) were considered in all-atom Monte Carlo (MC) simulations, based on a previously published protocol (see Appendix A for details) [78]. Before performing simulations, we added 5-methyl groups at CpG steps to the core sequence (central regions in sequences in Additional file 2: Table S1) of every DNA fragment [77]. Sequences of these fragments were designed to capture the complete pentamer space in terms of the sequence context. Each considered sequence was defined as having at least one CpG step. For better coverage of the sequence space, four different nucleotide combinations were used to flank each designed sequence. Canonical B-DNA structures for all DNA fragments were generated by the JUMNA program [79] and used as input for the all-atom MC simulations [78]. 2.2.2 All-atom MC simulations MC simulations (see Figure 2.2c) traverse the energy landscape by making random moves [26], thus combining effective sampling with fast equilibration [25]. For this study, MC sampling was expanded to include 5mC. Rotation of the 5-methyl group added one degree of freedom, whose rotation was implemented in a manner analogous to that of the thymine 5-methyl group. Partial charges for 5mC were taken from a database of AMBER force fields for naturally occurring modified nucleotides [ 77, 80]. For a given DNA structure, the MC simulation protocol included two million MC cycles, with each cycle attempting random variations of all degrees of freedom (see Appendix A.1). After completion of the MC simulations, trajectories were analyzed by using snapshots that were stored every 100 MC cycles. After we discarded the first half-million MC cycles as an equilibration period, we mined the remaining trajectories using CURVES analysis [27] (see Figure 2.2d, and Appendix A for detailed description of methodology). 15 Estimation of methylated DNA shape features Figure 2.2 Workflow for high-throughput methyl-DNAshape method. a Sequence pool. DNA fragments were considered for MC simulations to capture a sequence space that includes CpG methylation. Published sequences (left rectangular box) [81] and manually designed sequences (right rectangular box) included DNA fragments comprising a variable core (containing at least one methylated CpG step, called “mg” step) and flanks (4 bp in length). Right flanks were reverse complements of left flanks. For a given length of core sequence (5, 6, or 7 bp), all possible sequences (sequence_pool.xlsx ) were considered for MC simulations. b Seed structures. Canonical B-DNA structures were generated for all selected sequences. The 5- methyl groups (orange circles) were introduced at cytosine positions with letter “m” (on Watson and Crick strand). c All-atom MC trajectories. Simulations were performed on seed structures for 2 million MC cycles, with snapshots recorded every 100 cycles after equilibration. d Mining trajectories. Recorded snapshots were analyzed for DNA shape features (see Appendix A) associated with corresponding DNA sequences. e Pentamer Query Table (PQT). Pentamer sliding-window approach was applied to analyzed DNA fragments. Calculated DNA shape features (HelT, MGW, ProT, and Roll) were recorded at the center of each pentamer. Assigned value for a corresponding shape feature represents the average of all shape feature values in the sequence pool for a given pentamer in the PQT. f Front-end interface. Our easy-to-use methyl-DNAshape web server or DNAshapeR Bioconductor/R package can be used to profile shape features of any genomic region and DNA sequences of any length by using a pentamer sliding-window approach. Themethyl-DNAshape web server, available here, also outputs the effect of methylation on shape features in terms of∆shape (shown here for MGW) 16 2.2 Methodology 2.2.3 Building the methyl-DNAshape Pentamer Query Table Mining of the MC trajectories generates average structural features for a given sequence. We assigned minor groove width (MGW) values to nucleotides in a strand-independent manner [32]. We adopted a pentamer sliding-window approach to record DNA shape feature values from representative structures. For a given sequence of length N, the approach profiled the shape features of N− 4 pentamers due to end effects. For MGW and propeller twist (ProT), values were assigned to the central bp of the corresponding pentamer. For Roll and helix twist (HelT), two values were recorded for bp steps 2–3 and 3–4 of a pentamer, respectively. Shape feature values from multiple occurrences of a given pentamer in different DNA fragments were averaged and assigned as representative values for that pentamer (see Figure A.2). All possible pentamers were categorized in unmethylated and methylated groups. Unmethy- lated pentamers contained letters from the standard DNA alphabet, A, C, G, T. Methylated pentamers contained letters from the expanded alphabet, {A, C, G, T, m, g}. We assigned the letter “m” to 5mC and lowercase “g” to guanine base-paired with 5mC. We considered there to be no partial methylation; thus, for a DNA fragment of length N, methylation on the forward strand at indexi (5 ′ –3 ′ ) also indicates methylation at indexi + 1 (3 ′ –5 ′ ) on the reverse strand. The G base-paired to 5mC in a methylated 5mC/G bp cannot be treated in a similar fashion as G base-paired to unmethylated C. In addition, due to the requirement of DNA methylation at both Cs of a CpG step, each 5mC will be followed by a G base-paired to another 5mC on the opposite strand. Thus, “m” and “g” cannot be considered as independent letters. Introduction of the two letters “m/g” for a 5mC/G bp increased the number of possible unique pentamers, with 475 new pentamers being added to the 512 unique pentamers representing unmethylated DNA (Additional file 5: Table S3). Here, we discuss two specific examples. In the first example, NNmgN where N ∈ {A, C, G, T} has a single methylation mark at the underlined position 3. The second example is the complex case of gmgNm. To assign shape feature values, we have to consider that 5mC precedes “g” on its 5 ′ -flank and that “g” follows “m” on its 3 ′ -flank (Additional file 6: Fig. S2). We ran MC simulations with these combinations of methylated CpG steps to enrich pentamers of these types of compositions (see sequence_pool.xlsx for list of all sequences studied with MC simulations). 17 Estimation of methylated DNA shape features 2.2.4 methyl-DNAshape: method for high-throughput prediction of methy- lated DNA shape features Themethyl-DNAshape method derives DNA shape features of methylated DNA at nucleotide resolution, while considering the local sequence context. In a manner analogous to our DNAshape method for unmethylated DNA [32], we used a pentamer centered at position i to estimate DNA shape features at that position. We adopted the equivalent approach for DNA with methylated CpG dinucleotides, to capture the methylation properties of mammalian genomes. We derived themethyl-DNAshape Pentamer Query Table (mPQT), in analogy to the DNAshape Pentamer Query Table (PQT). DNA shape features at nucleotide position i were determined by querying themPQT based on a pentamer using two neighboring nucleotides in both flanks (P i = N i− 2 N i− 1 N i N i+1 N i+2 ). Ultimately,methyl-DNAshape calculates four feature vectors, one for each of the shape features HelT, MGW, ProT, and Roll (see Figure 2.2f). As in our previous work, we selected four DNA shape features that play important roles in protein–DNA recognition [32]. ProT is an intra-bp parameter that accounts for bp twisting along the base-pairing axis. Increased values of ProT lead to an opportunity to form an additional inter- bp hydrogen bond in the major groove [13]. Roll and HelT are bp step features that estimate de- formation at the dinucleotide level. The MGW feature plays a pivotal role in DNA shape readout [15]. A narrow minor groove enhances negative electrostatic potential and offers favorable inter- actions for positively charged amino acids [15]. Although the scarcity of experimentally solved structures with CpG methylation prohibited us from performing a validation such as is possible for unmethylated structures, we compared MGW predictions using methyl-DNAshape with X-ray cocrystal structures (see Figure A.1). Themethyl-DNAshape method is available as a web server at http://rohslab.usc.edu/methyl-DNAshape/ and as an extension to the R/Bioconductor package DNAshapeR [82] at http://bioconductor.org/packages/devel/bioc/html/DNAshapeR.html 2.3 Results 2.3.1 Effect of CpG methylation on DNA shape features To quantify the effects of cytosine methylation on DNA shape features, we compared values for all unique pentamers that contained a single CpG step, as derived from DNAshape [32] (designed for unmethylated DNA) and methyl-DNAshape (our high-throughput prediction 18 2.3 Results method designed for methylated DNA; see section 2.2). We considered four DNA shape features – HelT, MGW, ProT, and Roll – in this analysis. Roll and ProT exhibited strong methylation effects (50–100% of the range observed across all unmethylated-DNA sequences). At methylated CpG steps, Roll increased by an average of 6 ◦ (range 5.1 ◦ –7.2 ◦ ), representing a similar effect size as previously observed in molecular dynamics simulations [83]. In methylated C/G bp, ProT decreased by an average of 5 ◦ (range − 4.5 ◦ to− 6.0 ◦ ). By contrast, we observed relatively small effects for MGW and HelT (see Figure 2.3). An increase in Roll caused partial unstacking of the bp step, leading to widening of the minor groove. This conformational change might affect hydrogen bond formation in the major groove by exposing amino groups of guanine bases and oxygens of cytosine bases with different relative orientations. Presence of a methylated CpG step at position 1 or 3 (in the 5 ′ –3 ′ direction) in pentamers resulted in a lowering of HelT by approximately 2 ◦ (see Figure 2.3c). Only subtle changes in MGW were observed, except for some particular sequence contexts. 19 Estimation of methylated DNA shape features Figure 2.3 Effect size of CpG methylation on DNA shape features. Methylation-induced changes were analyzed for four shape features: a, e roll, b, f propeller twist (ProT), c, g helix twist (HelT), d, h minor groove width (MGW). For each shape feature, values for pentamers from the DNAshape query table for unmethylated DNA were plotted against values for corresponding pentamers from the methyl-DNAshape query table for methylated DNA. For simplicity, pentamers with one and only one CpG/mpg step (where “m” represents 5- methylcytosine and “g” represents G base-paired with “m” on the reverse-complement strand) were considered, for a total of 116 occurrences (Additional file 1). For bp step features Roll and HelT, values at bp steps 2–3 of each pentamer were used. For MGW and ProT, values at the central bp of each pentamer were used. CpG methylation increased Roll by an order of magnitude (light-orange dots). The opposite was observed when methylation occurred at the immediate next bp step (light-blue dots). Presence of a methyl group at the central bp, either on the forward (light-blue dots) or reverse (light-orange dots) strand caused a decrease in ProT 2.3.2 Effect of CpG methylation on MGW of A-tracts A-tracts, or poly[A/T] tracts, consist of a continuous run of at least three As or Ts without any TpA step. A-tracts, which play an important role in TF-DNA binding [17, 18], have a rigid conformation due to inter-bp hydrogen bonds in the major groove [19]. We analyzed the effect of methylation on the MGW of A-tracts flanked by CpG steps. As we derived the shape features from pentamers, we considered A-tracts of limited length of either three e.g., AAACG or four e.g., AAAAC (see Figure 2.4) nucleotides. For A-tracts that were 20 2.3 Results three bp in length, the subsequent CpG context extended into one nucleotide position flanking the pentamer because 5mC at the fifth position of a pentamer implicitly assumes a G/5mC bp at the following position. Box plot analysis revealed that the observed narrowing or widening of the minor groove upon CpG methylation depended on the sequence composition of As and Ts in the A-tract. For example, consecutive mutation from A to T in AAAAC led to a bell-shaped MGW profile, due to the introduction of a flexible TpA “hinge” step [ 84]. Maximal narrowing of the minor groove upon CpG methylation was observed for AATTC (see Figure 2.4). This result might be due to the fact that this particular A-tract had a narrow minor groove, an effect that was amplified through cytosine methylation in the adjacent CpG step. Effects of DNA methylation on MGW were larger and more variable for 4-bp than for 3-bp A-tracts. It was likely due to the more distinct minor groove narrowing of longer A-tracts and suggests that the methylation effect can be amplified depending on the A-tract features of the surrounding sequence. 21 Estimation of methylated DNA shape features Figure 2.4 Effect of CpG methylation on minor groove width (MGW) of adjacent A-tracts. a MGW values at the central nucleotide of 3-bp A-tracts, which are shown from AAACG to TTTCG with an exchange of one bp (A/T to T/A) from the 3 ′ end. Methylation did not decrease MGW at the central bp, except in the ATTCG sequence. Wilcoxon testP values were calculated for methylation narrowing the minor groove at the central nucleotide as the alternative hypothesis (*0.01 <P value≤ 0.05; **0.001 <P value≤ 0.01). Four A-tracts followed by a CpG step at the 3 ′ end include A-tracts preceded by a CpG step at the 5 ′ end because of symmetry in sequence and cytosine methylation. b MGW at the central nucleotide of 4-bp A-tracts follows a bell-shaped curve from AAAAC to TTTTC. One bp at a time was exchanged from A/T to T/A, starting at the 3 ′ end. Paired t-test P values were calculated for methylation narrowing the minor groove at the central bp as the alternative hypothesis. Two pentamers, AATTC and ATTTC, showed significant P values, meaning that methylation narrowed the minor groove. MC simulations were performed on longer DNA fragments containing hexamer sequences with a CpG/mpg bp step at position 5, and MGW values were measured at the central position 3 22 Chapter 3 Deoxyribonuclease I and CpG methylation The goal of this chapter is to further the understanding of Deoxyribonuclease I (DNase I) cleav- age bias in the context of CpG methylation [77, 46]. The attempt here to give an understanding of how CpG methylation-induced shape change results in enhanced cleavage rates of DNase I. At the same time it also serves as an validation/application of the high-throughput method methyl-DNAshape. 3.1 DNase I DNase I is an endonuclease that cleaves DNA phosphate backbone via hydrolysis reaction. It creates a nick on just one strand of duplex DNA. So, to completely covalently detach a double stranded piece from a long stretch of DNA, it would require four such nicks (two on forward and two on reverse complement strand; see Figure 3.1; thanks to Prof. Bussemaker’s Cold Spring Harbor Lecture). It has been widely used in genomics research for profiling accessible as well as bound regions by regulatory factors where it can’t cleave in the chromatin (a.k.a DNase I footprints) [85]. Ideally, an unbiased assessment of open chromatin regions using DNase I, would require it to cleave naked DNA in an unbiased fashion, meaning that probability of every cut event that is followed by DNase I binding to DNA should be equal. However, the sequence preference of 23 Deoxyribonuclease I and CpG methylation DNase I in its cleavage activities are of long time interest/debate [86, 87]. Recently, Lazarovici et al. [77] found that the hexamer context (three base pairs up- and downstream) explains almost all the variance in its cleavage activities. Moreover, they also found that its cleavage preference is influenced by the chemical modification (CpG methylation) on DNA. Looking at the protein–DNA contact map of existing DNase I–DNA cocrystal structure may guide us partially towards the reason behind its sequence preferences. Figure 3.1 Shcematics of DNaseI cleaving DNA. DNase I chew the circled phosphodiester bonds and releases the cleaved DNA. 3.1.1 DNase I–DNA cocrystal structure As explained in the introductory section 1.3, few protein residues e.g. Arginine and Histidines play pivotal roles DNA recognition. The protein–DNA complex structure allows us to look into these components to make structural inferences [1, 69, 88]. Lahm and Suck [89], who solved the cocrystal structure of DNase I (PDB ID: 2DNJ) bound to DNA sequence d(5 ′ - GCGATCGC-3 ′ ) 2 for the first time, found that the binding to DNA is achieved through the insertion of positively charged residues in the minor groove upstream of the cleavage site. In the interest of disentangling the sequence-derived structural dependency of DNase I, the same group again solved one more crystal structure of this system with a different DNA sequence d(5 ′ -GGCATGCC-3 ′ ) 2 that was not cleaved by DNase I [90]. By studying these two structures they concluded that DNase I bends DNA significantly on binding. Related to this observation, they also suggested that DNA sequence dependent structural features, for example, bendability [91], minor groove width [87], and other helical parameters are important determinants for biased cleavage activity [86] of DNase I. 24 3.1 DNase I Figure 3.2 Complex of DNase I endonuclease bound to DNA. a Cocrystal structure of the complex (PDB ID 2DNJ) illustrates that DNase I binds DNA through contacts to the DNA minor groove. A rotated view (bottom part) of the region contacted by DNase I shows that positively charged Arg9 and Arg41 residues recognize the negative electrostatic potential in the DNA minor groove, while Tyr76 stacks with a hydrophobic sugar moiety. Electrostatic potential was calculated for only the DNA molecule taken from the complex at physiologic ionic strength (0.145 M, based on a previously described protocol [15, 92]). Electrostatic potential is shown at the molecular surface of the DNA, using blue for +10 kT/e, red for− 10 kT/e and white for neutral potentials. b Protein–DNA contact map of 2DNJ. Similar information is shown in 2D representation. Adapted from [46] 3.1.2 Sequence-context model Although DNase I contact DNA through minor groove, making it not base- or sequence- specific [ 89], but it does interact with DNA in sequence-dependent manner by exploiting the structural variations. The contact surface in the minor groove expands to 6 bp [89, 90, 93]. This finding is further supported by the analyses recently done by Lazarovici et al. [77], where they found most of the variance in DNase I cleavage data lied withing the hexamer (three bp up- 25 Deoxyribonuclease I and CpG methylation and downstream of the phosphate cleavage site) context (see Figure 1A in [77]). A snapshot of hexamer table is shown below (see Appendix B.1). Table 3.1 Observed DNase I cleavage activities. Hexamer Observed cuts Genomic position Ratio Scaled ratio (SR) ACTpTAG 90,964 1,092,889 0.08323 1.00000 ACTpTGT 99,223 1,284,748 0.07723 0.92790 ACTpTGG 91,281 1,360,831 0.06708 0.80590 ACTpTAA 119,341 1,840,040 0.06486 0.77924 TCTpTAG 85,512 1,335,788 0.06402 0.76912 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CGGpTTT 10 201,805 0.00005 0.00060 CGCpGCG 3 81,371 0.00004 0.00044 GACpGCG 0 49,356 0.00000 0.00000 Adapted from [77]. 3.2 CpG methylation and shape-to-affinity modeling A closer look into DNase I cleavage data revealed that for a subset of hexamers with CpG dinucleotide step immediate downstream (C +1 G +2 ) to the cleavage site have high cleavage rates (by an order of magnitude) when methylated (see Figure 3.3b). Knowing that DNase I binds to DNA through minor groove, we reasoned that methylation-induced structural deformation may owe to the enhanced cleavage. To probe into the relationship between methylation induced shape feature change to cleavage bias, we designed a linear model that we call “Shape-to-affinity” model (see Figure 3.3a). The training process included only unmethylated data, with a motivation that the model should learn to predict∆∆G/RT purely from DNA shape features values of unmethylated se- quences [32]. Followed by that, we feed change in shape features values induced by methylation to the model to predict the methylation effect (see Figure 3.3c). By doing so, we learn the ability of our high-throughput method derived shape feature values to predict the functional effect of methylation on protein–DNA interaction. 26 3.2 CpG methylation and shape-to-affinity modeling Figure 3.3 Modeling of methylation-induced shifts in cleavage rates using methylation-induced shifts in shape feature profile. a Points on plot represent inferred binding free energy (∆∆G/RT) values of DNase I to unmethylated hexamers and corresponding methylated hexamers with absolute phosphate cleavage count≥ 25. Methylation-induced effects are shown for sequences with C +1 G +2 offset. Shift (downward) from diagonal indicates log-fold increase in cleavage activity of DNase I for methylated hexamers. Shape-to-affinity modeling and use of methyl- DNAshape features. b Shape-to-affinity model (L1- and L2-regularized linear regression model) built using unmethylated data. DNA shape features for unmethylated hexamers and their corresponding free energies (∆∆G/RT) were used as predictors and response variables, respectively. The model used the methylation effects on shape features (∆shape) calculated by methyl-DNAshape to predict∆∆∆G (methylation effects on free energy, indicated by∆∆∆ ˆ G). Linearity of the model allowed direct use of∆shape as input variable. Roll values are shown for illustration purposes. c Predictive powers of different shape-based models. Observed∆∆∆G/RT with median around− 2 is shown in gray colored box. Roll-based model accurately predicts the cleavage bias for C +1 G +2 offset. Adapted from [1]. 27 Deoxyribonuclease I and CpG methylation Shape-to-affinity modeling together with use of methyl-DNAshape reveals that ‘Roll’ angle to the greatest extent recapitulates the observed DNase I cleavage bias. Looking at the weight vector of the trained model, we learned the strong inverse relationship between roll angle at +1/+2 offset∆∆G/RT. − → W =[w bias , w 1 , ..., w 5 ] (3.1) =[3.75, 0.10,− 0.08,− 0.07,− 0.19, 0.03] (3.2) Roughly, a unit increase in the roll angle at offset +1/+2 would decrease the binding free energy by 0.19 unit, alternatively, enhance the relative binding affinity by e 0.19 = 1.21. CpG methylation indeed increase the roll angle and hence enhances binding affinity. We could relate this with the effect of increase in the roll angle (see Figure 2.3) on DNA structure, that is, bending the DNA towards major groove (see Figure 1.3) and this could make DNase I cleave the DNA easily because it prefers to cleave bent DNA more frequent [91]. 28 Chapter 4 Methylation sensitivity of Pbx-Hox complexes This chapter serves as another validation for our high throughput method,methyl-DNAshape. Here we take human Pbx-Hox heterodimer binding to DNA as a model system and show the potential of high-throughput method derived shape feature (MGW) to reveal methylation effects in binding. 4.1 Pbx-Hox system and CpG methylation SELEX-seq followed by DNA shape analyses of Exd (Extradenticle)-Hox (Homeobox) proteins binding sequences have revealed the shape-readout (minor-groove width (MGW) in this case) mechanism [94, 59]. Different MGW signatures were preferred by Hox paralogs (Labial (Lab) toabdominal-B (AbdB)) along the anterior-posterior axis [59]. Recent extension of SELEX-seq method to EpiSELEX-seq, by Kribelbauer et al. [47], produced high-throughput binding data for probing the effect of CpG methylation into protein-DNA binding. One of the model systems analyzed in this study is the Pbx (pre-B-cell leukemia Homoebox)-Hox, motivated by probing into the effect of methylation in vertebrates (Pbx is an ortholog of Exd found in mammals). Three paralogs of Hox proteins, namely, HoxA1, HoxA5, and HoxA9 were considered in the analyses [47]. 29 Methylation sensitivity of Pbx-Hox complexes Interestingly, for Pbx-HoxA1 and A5 destabilizing effects on binding was observed when offset 6/7 (NTGAYNNAYNNN; see Figure 4.1a) of 12 bp consensus binding site was methylated (NN (CG)→NN(mg)↓ binding; see section B.2.2; data for HoxA9 didn’t show enrichment of binding sequences with CpG at offset 6/7). With almost no contact in this region, we hypothesized that methylation may alter the shape of DNA in its neighboring region that are important for recognition. 30 4.1 Pbx-Hox system and CpG methylation Figure 4.1 CpG methylation induces a DNA shape change that explains its effect on Pbx-Hox binding. a Schematic representation of Pbx-Hox heterodimer bound to DNA (PDB ID 1PUF), and of the effect of CpG methylation on binding. Pbx (green) and Hox (blue) homeodomains bind up- and downstream of the central spacer region (indicated in red), respectively. CpG methylation at offsets 6/7 and 10/11 reduces binding, whereas methylation at offset 9/10 en- hances binding. Methyl group readout was previously identified as underlying mechanism for the latter offset [47]. b Scatter-plot representation of relative binding affinities of methylated versus unmethylated sequences for Pbx-HoxA1 complex. Sequences carrying a single methyla- tion event and their corresponding unmethylated part were considered. Green, magenta, and blue points correspond to methylation at offsets 6/7, 9/10, and 10/11, respectively. Sequences containing CpG dinucleotides at other offsets (relatively weakly affected by methylation) are colored gray. c Alternative representation of the data in b, showing the effect of methylation on binding free energy, denoted as∆∆∆G/RT. Positive (e.g., offsets 6/7 and 10/11) and negative (e.g., offset 9/10) shifts from the dashed line (indicating no methylation effect) reflect reduced and enhanced binding (on logarithmic scale) due to methylation. CpG dinucleotides at offsets 6/7 and 10/11 produce the same hexamer context for A 4 and A 8 (NNAYCG/NGAYCG) and, hence, were assigned a common color, dark-cyan. d Analysis of the methylation-induced change in MGW at positions A 4 and A8 within the Pbx-Hox binding site (NNGAYNNAYNNN), for the different hexameric/pentameric contexts that the Pbx-Hox heterodimer may encounter within its binding sequence. Coloring corresponds to that of labels and rectangular patches in c. Statistically significant widening of minor groove (first two boxes) plausibly explains the observed reduced binding due to methylation at CpG offsets 6/7 and 10/11. No significant change in MGW upon methylation was observed for offset 9/10. Adapted from [1] 31 Methylation sensitivity of Pbx-Hox complexes As previously reported [47], direct comparison of the relative binding affinities for unmethy- lated versus methylated sequences (see Figure 4.1b, c) shows that cytosine methylation can either have a stabilizing or destabilizing effect on Pbx-Hox binding, depending on the position of the CpG dinucleotide within the binding site. For example, methylation of a CpG dinucleotide at offset 6/7 (NTGAYCGAYNNN; C 6 G 7 ; green points/box in Figure 4.1b, c) and offset 10/11 (NTGAYNNAYCGN; C 10 G 11 ; blue points/box in Figure 4.1b, c) suppresses binding, whereas methylation at offset 9/10 (NTGAYNNACGNN; C 9 G 10 ; magenta points/box in Figure 4.1b, c) enhances binding by an order of magnitude. We previously proposed a plausible mechanism for the latter stabilizing effect, which we postulated to involve direct contacts to the methyl group in the major groove [47]. However, an explanation of the suppressed binding at the CpG offsets 6/7 and 10/11 was lacking (see Figure 4.1a). None of the two available cocrystal structures (PDB ID: 1B72–Pbx1-HoxB1 and 1PUF– Pbx1-HoxA9 complex bound to DNA) of Pbx-Hox complexes shows direct contacts between protein residues and DNA at offset 6/7, particularly in the major groove where methylation mark appears. However, the nucleotides at offset 6/7 form a spacer located between two AY dinucleotides (see Figure 4.1a), which were previously shown to exhibit strong shape preferences. Specifically, minor groove narrowing at AY positions adjacent to the central spacer was shown to be associated with enhanced binding when the nucleotide sequence was varied for unmethylated DNA [69, 59]. Therefore, we hypothesized that a methylation-induced change in DNA shape near the CpG dinucleotide could affect binding affinity. We used the pentamer-based shape tables that form the foundation of DNAshape [74] and methyl-DNAshape to investigate this effect systematically. A pentamer window centered at the A 8 position includes a CpG dinucleotide at offset 9/10 within its 5 bp (NNGAYNNACGNN). However, a CpG step at offsets 6/7 and 10/11 only includes one bp of the CpG dinucleotide (NNGAYCGAYNNN or NNGAYNNAYCGN) and indirectly constrains the nucleotide identity at a sixth position after the pentamer window. This distinction became important when we predicted MGW. In the case of the methylated-DNA table (mPQT), the presence of a (methylated) C at position 5 within the pentamer implies the presence of a G at the following position in the training set from which the pentamer tables were derived. This prediction is not the case for the unmethylated-DNA table (PQT). The pentamer tables do not capture a weak dependency of shape on the sixth position, which confounds our estimate of the methylation effect on shape. For this reason, we compiled an additional table consisting of unmethylated-DNA shape parameters for all hexamers ending 32 4.1 Pbx-Hox system and CpG methylation with CpG and heptamers with CpG flanks (see section B.2.1), which we used to estimate the effect of methylation on shape. Figure 4.1d shows that cytosine methylation in a sequence context consistent with the presence of a CpG step at offset 6/7 or 10/11 within the 12-bp Pbx-Hox binding site results in widening of the minor groove (see section B.2.3 for details on t-test performed). This observation, combined with the known inverse relationship between MGW and binding affinity for unmethylated DNA, provides a plausible explanation for the methylation-induced weakening of binding observed at these offsets (see Figure 4.1b). In contrast, no effect of methylation on MGW can be observed for the CpG offset 9/10, where direct contacts in the major groove already provided a mechanistic explanation [47]. 33 Chapter 5 Physico-chemical modeling of protein–DNA interactions My attempt in this chapter is to communicate the application of deep neural networks (DNNs) in the context of protein–DNA interactions. Although the project is in developing phase, we have results which re-iterate previous findings. Work shown in this chapter has been jointly thought and executed with my dear cohort Tsu-Pei Chiu. 5.1 Physico-chemical encoding of DNA Protein–DNA interaction data is often represented in the form of DNA sequence to binding strength of protein to that sequence (a relatively shorter fragment of the sequence). Representa- tion of DNA sequence for computational purposes often require One-Hot Encoding (s-OHE) of letters in the alphabet, e.g., for the typical DNA alphabet with {A, C, G, T}, we need 4-bit encoding. A simple illustration of encoding function for a given sequenceS of lengthL shown below: s-OHE(S)=[s-OHE(S 1 ), s-OHE(S 2 ), ..., s-OHE(S L )] T S i ≡ base at positioni inS 35 Physico-chemical modeling of protein–DNA interactions where: s-OHE(S i )=(I A , I C , I G , I T ) T ∀i∈{1, ..., L} ∑ k∈{A, C, G, T} I k =1 and I k = 1 k=S i 0 k̸=S i Our proposed physico-chemical based representation of DNA, however, adds another dimension that accounts for offsets of functional group signatures on the major/minor groove faces. For instance, with four different function group signatures on major groove face with four offsets, encoding of one bp would require 16 bits compared to four in case of letter encoding explained above. Similarly, 9 bits are needed for a bp in the minor groove face with three edge offsets and three distinct signatures. Encoding function followed by channel-based view of encoding for the major groove is illustrated below: pc-OHE major (S)=[pc-OHE major (S 1 ), pc-OHE major (S 2 ), ..., pc-OHE major (S L )] T where: pc-OHE major (S i )=[pc-OHE(Edge-Offset 1 ), ..., pc-OHE(Edge-Offset 4 )] T pc-OHE(Edge-Offset j )=(I A , I D , I M , I N ) T ∀j∈{1, 2, 3, 4} ∑ k∈{A, D, M, N} I k =1 and I k = 1 k=major(S ij ) 0 k̸=major(S ij ) 36 5.2 Modeling of protein–DNA interaction data Figure 5.1 Physico-chemical encoding for the major groove of sequence 5 ′ -GACT-3 ′ (sequence at the top of colored box). A linear encoding is reshaped in this fashion for interpretable and convenient computation purposes (described later in this chapter). In an analogy to image representation as RGB channels [95], physico-chemical signatures are presented as four different channels; nopolar (grey, Channel 1), methyl (yellow, Channel 2), H-bond donor (blue, Channel 3), and H-bond acceptor (red, Channel 4). Rows and columns present major groove offsets and bp positions respectively. 5.2 Modeling of protein–DNA interaction data The image like representation of DNA sequences (probes), and possibility of presence of a binding site in probes at any offset motivated us to use Convolutional Neural Network (CNN) because of its good fit to image-based problems and translational invariant property. CNN-based models have recently been an area of interest in genomics community too [96–98]. One of the caveats for deep learning-based models is that they are considered to be a “black box”, meaning that the interpretation of learned model is very difficult. Despite this challenge, some developments are being made to make it interpretable [99, 100]. Kelley et al. [96] used saturated insilico mutagenesis to interpret the model where change in predictions due to a single change 37 Physico-chemical modeling of protein–DNA interactions in input DNA sequence was regarded as its importance. Angermueller et al. [101, 102] gives a detailed review of usage of deep learning in computational biology. 5.2.1 DeepRec framework We have developed a multi-module deep learning framework where user can simply extend the framework by plugging in a user-defined module. Figure 5.2 DeepRec: a multi-module and multi-task deep learning framework. Raw het- erogeneous features ([X, Y, .]; multi-module) mapped to a response vector (O; multi-task) can be used as input to this framework. In addition, user-defined functions (f 1 (.)) can easily be applied to any given feature. The convolutional layer summarizes convolution, rectification, and max-pooling processes on the input. Again, the framework also allows to do cascaded convolutions by simply editing the framework configuration file. The joint layer is flattened rep- resentation of output of convolutional layer. Hidden layers are fully connected neural nets. The output layer that calculates the classification error/loss (see later sections) the back-propagates to update the parameters of the network 38 5.2 Modeling of protein–DNA interaction data 5.2.2 Framework components Unified data representation In big data regime, compilation of data in the format compatible to a program/package may not remain as trivial as it seems. In particular, with the flexible framework outlined above, addition of a new cartridge may produce one or more input files to incorporate during training. As a result, it would require many changes in configuration files. To overcome this issue, we have put significant effort towards making the package object-oriented in order to make it easy-to-use for end-users. Convolutional layer The concept of convolution has its mainstay in Digital Signal Processing (DSP). In layman terms, one can regard multiplication of two numbers as convolution (see Figure 5.3). In our framework the neural network is preceded by the convolution layer(s). For each convolution layer user defines number and size of kernels (like the edge detectors shown in Figure 5.3b). These kernels are randomly initiated and throughout the training process the network learns kernels which minimize the loss. 39 Physico-chemical modeling of protein–DNA interactions Figure 5.3 Convolution in layman terms (a) and its application (b) illustrating that a 3× 3 matrix with all elements being− 1 except the center (8) works as an edge detector. It is worth noting that convolution process is a linear transformation of the input. The non-linearity concept, that recently become very popular due to its outstanding capability in image classification/recognition, is achieved by adding a Rectified Linear Unit (ReLU) function to the output of convolution. It simply replaces negative values to zero. In our physico-chemical space, encoding of a sequence results in a three-dimensional matrix with the third dimension as channels. Convolutional kernels would be of same dimension in the channels but with essentially less in other two. Mathematically, convolution of input matrix with a given kernelk is: Conv(X) ijk = M ∑ m=1 N ∑ n=1 C ∑ c=1 k mnc X i+m,j+n,c (5.1) 40 5.2 Modeling of protein–DNA interaction data where: (i, j) : indices of output convolution matrix (M, N, C) :(rows, columns, and channels) Addition of ReLU function would lead to the following: Conv(X) ijk =ReLU M ∑ m=1 N ∑ n=1 C ∑ c=1 k mnc X i+m,j+n,c ! (5.2) where: ReLU(x)= x x≥ 0 0 x< 0 (5.3) Pooling layer The pooling layer introduces another nonlinearity by choosing themax() from a defined window. In addition it downsamples the data. Pooling with a window size (r× s)can be defined as: Pooling(X) ijk = maxX sub i,j,k (5.4) where: (i, j)= indices of the output matrix X sub ijk = submatrix of X :X [i, ..., i+r]× [j, ..., j+s] Joint layer Convolution followed by pooling results in reduced representation of each cartridge. At joint layer, we simply flattens each cartridge. It is important to note that till this layer cartridges don’t cross-talk yet. 41 Physico-chemical modeling of protein–DNA interactions Neural network layer A given neuron at this layer takes input from all the neurons in the joint layer. Steps from input to the joint layer can be regarded as feature extraction and neural network layers learn inter-dependencies between features. Output layer Output layer consist of a loss function that calculates the average error in predictions. The training process adopts error-backpropagation method to update weights of the network. Usually regularization is incorporated in the loss function control the overfitting. For instance, Mean Square Error (MSE) function is show below: L= 1 N ∑ i=1 N( ˆ y i − y i ) 2 (5.5) 5.2.3 Modeling and interpretation of Pbx-HoxA1–DNA binding One of the immediate advantage of physico-chemical based modeling is that it can incorporate both unmethylated and methylated data at the same time without expanding the alphabet (see Figure 1.2). We find the data produced by [ 47] a good fit where they showed that CpG methylation event at offset 9/10 in the consensus binding site (TGAYNNAYNNN) enhances binding (see Figure 4.1). To assess the functional importance of DeepRec, we used this data (part of it was kept for testing purpose) to train the network. Physico-chemical encoding of input sequences were performed as described above (see section 5.1). No scaling of response variable (relK a ) was performed as they were in [0, 1] range. For interpreting the trained model, we adopted a variant of saturatedinsilico mutagenesis. For instance, height of the third ‘A’ corresponding to ‘G’ in the top subplot (see Figure 5.4a) is average of differences of two pairs of predictions. The first pair, where we predict the binding affinities ( relK a ) with original and muted (signature for ‘A’ ([0, 0, 0, 1]) is mutated to [0, 0, 0, 0]) physico-chemical signature of the input sequence. Followed by this difference of− log of these two values is calculated. For the second pair, we follow the same process with one change 42 5.2 Modeling of protein–DNA interaction data Figure 5.4 Interpretation of trained deep learning model usinginsilico mutagenesis. a Predict- ing the importance of physico-chemical signatures at a given offset in the major groove. Y-axis presents the change in predicted relative binding free energies due to presence and absence of chemical signatures (see section 5.2.3 for details). b-d Contact map of selected physico- chemical signatures found important in (a). b The dinucleotide T7pG8 correspond to T2pG3 in panel a (sequence at the top), the interpretation reveals well-know Guanine-Arginine-Thymine triad [45]. c Similar hydrophobic contact was found for T10 (T5 in a). c the interpretation also confirms the finding by Kribelbauer et al. [47] that the methyl group at T14 (T9 in a) found to be important because methylated cytosines led to enhanced biding compared to unmethylated cytosines (see Figure 4.1b,c). Grids around nucleotides/residues represents electron densities, a way to assert the confidence of the location of atoms. light-cyan shaded region is not considered for interpretation because this dinucleotide step falls in the spacer region see Figure 4.1. The work is in progress in this direction. 43 Physico-chemical modeling of protein–DNA interactions (G 3 → A 3 ) in the input sequence, because adenine also offers an acceptor at ‘Edge offset 1’. Same process is adopted for all other letters. Along with revealing the importance of methyl group at T9 (see Figure 5.4a,d), our interpre- tation method also found interesting insights such as formation of Guanine-Arginine-Thymine triad (see Figure 5.4a,b) where arginine makes two H-bonds with guanine (height of As on both edge offset 1 & 2 are positive and dominant) and hydrophobic contacts with thymine’s methyl group. Although the interpretation shown here is for one input sequence similar patterns were seen for others. We are in process of doingposthoc analysis in order to give significant values for each letter height. 44 Chapter 6 Discussion and future work Many eukaryotic genomes including human undergo epigenetic modifications. CpG methy- lation is one such modification on DNA that is most prevalent [ 103]. CpG methylation a.k.a DNA methylation often linked with silencing of genes, but recent work have suggested other dimensions to it, for example Klf4 prefers to bind to methylated DNA in developing cells [81]. Moreover,invitro studies have shown evolving insights into its impact on protein–DNA bind- ing [47, 51]. We attempted to understand its impact both in base- and shape-readout contexts. Usingmethyl-DNAshape we were able to learn functional importance of CpG methylation in protein–DNA interactions on two model systems DNase I–DNA and Pbx-Hox. We were able to validate importance of difference DNA shape features in two systems, e.g., Roll angle in DNase–DNA and MGW in Pbx-Hox interactions. Recent work by Li et al. [104], who expanded the repertoire of DNAshape [32] by adding nine additional DNA shape features to DNAshape [82]. We aim to extendmethyl-DNAshape in the same direction and offer an interface where user can simply learn the impact of methy- lation on shape features. In addition to this, our plan is to incorporate methyl-DNAshape in TFBSshape [74], a web-based tool widely used for DNAshape analysis. In addition, we introduced a linear-regression based quantitative method, “Shape-to-affinity” model (see Figure 3.3), that can be applied to protein–DNA binding systems with a known background of shape-readout as we saw in the case of DNase I, or model can guide to probe into important DNA shape feature(s) for systems in which shape-readout is not well understood. Moreover, the IUPAC-based shape analyses (see Figure 4.1), a non-quantitative method, devel- 45 Discussion and future work oped in the process of analyzing Pbx-Hox data could serve as an important tool to learn the effect of methylation on shape-readout for a given system at first hand. We introduced a novel concept, the physico-chemical encoding of DNA sequences, that enables learning of protein–DNA interactions at a higher resolution. It has potential to better explain the base-readout recognition mode of protein–DNA binding. In addition, this encoding naturally incorporates both unmethylated and methylated data at the same time without the need of expanding the encoding space as in DNA alphabet space (see Chapter 1). 2D-sheet representation of DNA presented a good fit for us to harness concolutional-neural networks to model the data (see Chapter 5). We employed a variant of saturatedinsilico mutagenesis to interpret the model. Our next step is to get significance levels and introduce error bars on letters (see 5.4). 46 References [1] S. Rao, T. P. Chiu, J. F. Kribelbauer, R. S. Mann, H. J. Bussemaker, and R. Rohs. Systematic prediction of DNA shape changes due to CpG methylation explains epigenetic effects on protein-DNA binding. EpigeneticsChromatin, 11(1):6, Feb 2018. [PubMed Central:PMC5800008] [DOI:10.1186/s13072-018-0174-4] [PubMed:29409522]. [2] John Kuriyan et al. Themoleculesoflife: Physicalandchemicalprinciples. Garland Science, 2012. [3] J. D. Watson and F. H. Crick. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171(4356):737–738, Apr 1953. [PubMed:13054692]. [4] X. J. Lu, Z. Shakked, and W. K. Olson. A-form conformational motifs in ligand-bound DNA structures. J.Mol.Biol., 300(4):819–840, Jul 2000. [DOI:10.1006/jmbi.2000.3690] [PubMed:10891271]. [5] A. H. Wang, G. J. Quigley, F. J. Kolpak, J. L. Crawford, J. H. van Boom, G. van der Marel, and A. Rich. Molecular structure of a left-handed double helical DNA fragment at atomic resolution. Nature, 282(5740):680–686, Dec 1979. [PubMed:514347]. [6] H. Drew, T. Takano, S. Tanaka, K. Itakura, and R. E. Dickerson. High-salt d(CpGpCpG), a left-handed Z’ DNA double helix. Nature, 286(5773):567–573, Aug 1980. [PubMed:7402336]. [7] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, et al. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, Feb 2001. [DOI:10.1038/35057062] [PubMed:11237011]. [8] N. C. Seeman, J. M. Rosenberg, and A. Rich. Sequence-specific recognition of double helical nucleic acids by proteins. Proc. Natl. Acad. Sci. U.S.A., 73(3):804–808, Mar 1976. [PubMed Central:PMC336007] [PubMed:1062791]. [9] N. M. Luscombe, R. A. Laskowski, and J. M. Thornton. Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level. NucleicAcids Res., 29(13):2860–2874, Jul 2001. [PubMed Central:PMC55782] [PubMed:11433033]. 47 References [10] N. M. Luscombe and J. M. Thornton. Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity. J.Mol.Biol., 320(5):991–1009, Jul 2002. [PubMed:12126620]. [11] G. Zheng, X. J. Lu, and W. K. Olson. Web 3DNA–a web server for the analysis, reconstruction, and visualization of three-dimensional nucleic-acid structures. Nucleic AcidsRes., 37(Web Server issue):W240–246, Jul 2009. [PubMed Central:PMC2703980] [DOI:10.1093/nar/gkp358] [PubMed:19474339]. [12] X. J. Lu and W. K. Olson. 3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. NatProtoc, 3 (7):1213–1227, 2008. [PubMed Central:PMC3065354] [DOI:10.1038/nprot.2008.104] [PubMed:18600227]. [13] R. Rohs, X. Jin, S. M. West, R. Joshi, B. Honig, and R. S. Mann. Ori- gins of specificity in protein-DNA recognition. Annu. Rev. Biochem., 79:233– 269, 2010. [PubMed Central:PMC3285485] [DOI:10.1146/annurev-biochem-060408- 091030] [PubMed:20334529]. [14] K. E. Flick, M. S. Jurica, R. J. Monnat, and B. L. Stoddard. DNA binding and cleavage by the nuclear intron-encoded homing endonuclease I-PpoI. Nature, 394(6688):96–101, Jul 1998. [DOI:10.1038/27952] [PubMed:9665136]. [15] R. Rohs, S. M. West, A. Sosinsky, P. Liu, R. S. Mann, and B. Honig. The role of DNA shape in protein-DNA recognition. Nature, 461(7268):1248–1253, Oct 2009. [PubMed Central:PMC2793086] [DOI:10.1038/nature08473] [PubMed:19865164]. [16] Chris R Calladine and Horace Drew. Understanding DNA: the molecule and how it works. Academic press, 1997. [17] T. E. Haran and U. Mohanty. The unique structure of A-tracts and intrinsic DNA bending. Q.Rev.Biophys., 42(1):41–81, Feb 2009. [DOI:10.1017/S0033583509004752] [PubMed:19508739]. [18] H. S. Koo, H. M. Wu, and D. M. Crothers. DNA bending at adenine . thymine tracts. Nature, 320(6062):501–506, 1986. [DOI:10.1038/320501a0] [PubMed:3960133]. [19] H. C. Nelson, J. T. Finch, B. F. Luisi, and A. Klug. The structure of an oligo(dA).oligo(dT) tract and its biological implications. Nature, 330(6145):221–226, 1987. [DOI:10.1038/330221a0] [PubMed:3670410]. [20] A. V . Colasanti, X. J. Lu, and W. K. Olson. Analyzing and building nucleic acid structures with 3DNA. JVisExp, (74):e4401, Apr 2013. [PubMed Central:PMC3667640] [DOI:10.3791/4401] [PubMed:23644419]. [21] H. R. Drew, R. M. Wing, T. Takano, C. Broka, S. Tanaka, K. Itakura, and R. E. Dickerson. Structure of a B-DNA dodecamer: conformation and dynamics. Proc.Natl.Acad.Sci. U.S.A., 78(4):2179–2183, Apr 1981. [PubMed Central:PMC319307] [PubMed:6941276]. 48 References [22] Schrödinger, LLC. The PyMOL molecular graphics system, version 1.8. November 2015. [23] Y . Timsit and D. Moras. Crystallization of DNA. Meth.Enzymol., 211:409–429, 1992. [PubMed:1406318]. [24] G. Giannoni, F. J. Padden, and H. D. Keith. Crystallization of DNA from dilute solution. Proc.Natl.Acad.Sci.U.S.A., 62(3):964–971, Mar 1969. [PubMed Central:PMC223693] [PubMed:4895221]. [25] R. Rohs, H. Sklenar, and Z. Shakked. Structural and energetic origins of sequence- specific DNA bending: Monte Carlo simulations of papillomavirus E2-DNA bind- ing sites. Structure, 13(10):1499–1509, Oct 2005. [DOI:10.1016/j.str.2005.07.005] [PubMed:16216581]. [26] C. H. Mak. Loops mc: an all-atom monte carlo simulation program for rnas based on inverse kinematic loop closure. Mol Simul, 37, 2011. doi: 10.1080/08927022.2011. 565761. URL https://doi.org/10.1080/08927022.2011.565761. [27] R. Lavery and H. Sklenar. Defining the structure of irregular nucleic acids: conventions and principles. J. Biomol. Struct. Dyn., 6(4):655–667, Feb 1989. [DOI:10.1080/07391102.1989.10507728] [PubMed:2619933]. [28] H. Sklenar, D. Wustner, and R. Rohs. Using internal and collective variables in Monte Carlo simulations of nucleic acid structures: chain breakage/closure algorithm and asso- ciated Jacobians. JComputChem, 27(3):309–315, Feb 2006. [DOI:10.1002/jcc.20345] [PubMed:16355439]. [29] A. Perez, F. J. Luque, and M. Orozco. Dynamics of B-DNA on the microsecond time scale. J.Am.Chem.Soc., 129(47):14739–14745, Nov 2007. [DOI:10.1021/ja0753546] [PubMed:17985896]. [30] A. A. Travers. The structural basis of DNA flexibility. PhilosTransAMathPhysEngSci, 362(1820):1423–1438, Jul 2004. [DOI:10.1098/rsta.2004.1390] [PubMed:15306459]. [31] T. E. Cheatham and M. A. Young. Molecular dynamics simulation of nucleic acids: successes, limitations, and promise. Biopolymers, 56(4):232–256, 2000. [DOI:3.0.CO;2- H] [PubMed:11754338]. [32] T. Zhou, L. Yang, Y . Lu, I. Dror, A. C. Dantas Machado, T. Ghane, R. Di Felice, and R. Rohs. DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale. NucleicAcidsRes., 41(Web Server issue):56–62, Jul 2013. [PubMed Central:PMC3692085] [DOI:10.1093/nar/gkt437] [PubMed:23703209]. [33] F. Crick. Central dogma of molecular biology. Nature, 227(5258):561–563, Aug 1970. [PubMed:4913914]. 49 References [34] D. Shlyueva, G. Stampfel, and A. Stark. Transcriptional enhancers: from prop- erties to genome-wide predictions. Nat. Rev. Genet., 15(4):272–286, Apr 2014. [DOI:10.1038/nrg3682] [PubMed:24614317]. [35] Y . Zhang, T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein, C. Nus- baum, R. M. Myers, M. Brown, W. Li, and X. S. Liu. Model-based analysis of ChIP-Seq (MACS). GenomeBiol., 9(9):R137, 2008. [PubMed Central:PMC2592715] [DOI:10.1186/gb-2008-9-9-r137] [PubMed:18798982]. [36] J. Feng, T. Liu, B. Qin, Y . Zhang, and X. S. Liu. Identifying ChIP-seq enrichment using MACS. NatProtoc, 7(9):1728–1740, Sep 2012. [PubMed Central:PMC3868217] [DOI:10.1038/nprot.2012.101] [PubMed:22936215]. [37] J. Feng, T. Liu, and Y . Zhang. Using MACS to identify peaks from ChIP-Seq data. Curr ProtocBioinformatics, Chapter 2:Unit 2.14, Jun 2011. [PubMed Central:PMC3120977] [DOI:10.1002/0471250953.bi0214s34] [PubMed:21633945]. [38] J. Wang, J. Zhuang, S. Iyer, X. Y . Lin, M. C. Greven, B. H. Kim, J. Moore, B. G. Pierce, X. Dong, D. Virgil, E. Birney, J. H. Hung, and Z. Weng. Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium. Nucleic Acids Res., 41(Database issue):D171–176, Jan 2013. [PubMed Central:PMC3531197] [DOI:10.1093/nar/gks1221] [PubMed:23203885]. [39] J. Wang, A. Malecka, G. Tr?en, and J. Delabie. Comprehensive genome-wide tran- scription factor analysis reveals that a combination of high affinity and low affinity DNA binding is needed for human gene regulation. BMC Genomics, 16 Suppl 7: S12, 2015. [PubMed Central:PMC4474539] [DOI:10.1186/1471-2164-16-S7-S12] [PubMed:26099425]. [40] S. J. Clark, H. J. Lee, S. A. Smallwood, G. Kelsey, and W. Reik. Single-cell epigenomics: powerful new methods for understanding gene regulation and cell identity. GenomeBiol., 17:72, Apr 2016. [PubMed Central:PMC4834828] [DOI:10.1186/s13059-016-0944-x] [PubMed:27091476]. [41] A. Rotem, O. Ram, N. Shoresh, R. A. Sperling, A. Goren, D. A. Weitz, and B. E. Bernstein. Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state. Nat.Biotechnol., 33(11):1165–1172, Nov 2015. [PubMed Central:PMC4636926] [DOI:10.1038/nbt.3383] [PubMed:26458175]. [42] M. Nettling, H. Treutler, J. Cerquides, and I. Grosse. Detecting and correcting the binding-affinity bias in ChIP-seq data using inter-species information. BMCGenomics, 17:347, 05 2016. [PubMed Central:PMC4862171] [DOI:10.1186/s12864-016-2682-6] [PubMed:27165633]. [43] J. Crocker, N. Abe, L. Rinaldi, A. P. McGregor, N. Frankel, S. Wang, A. Alsawadi, P. Valenti, S. Plaza, F. Payre, R. S. Mann, and D. L. Stern. Low affinity binding site clus- ters confer hox specificity and regulatory robustness. Cell, 160(1-2):191–203, Jan 2015. [PubMed Central:PMC4449256] [DOI:10.1016/j.cell.2014.11.041] [PubMed:25557079]. 50 References [44] C. O. Pabo and R. T. Sauer. Protein-DNA recognition. Annu.Rev.Biochem., 53:293–321, 1984. [DOI:10.1146/annurev.bi.53.070184.001453] [PubMed:6236744]. [45] Y . Liu, X. Zhang, R. M. Blumenthal, and X. Cheng. A common mode of recogni- tion for methylated CpG. Trends Biochem. Sci., 38(4):177–183, Apr 2013. [PubMed Central:PMC3608759] [DOI:10.1016/j.tibs.2012.12.005] [PubMed:23352388]. [46] A. C. Dantas Machado, T. Zhou, S. Rao, P. Goel, C. Rastogi, A. Lazarovici, H. J. Busse- maker, and R. Rohs. Evolving insights on how cytosine methylation affects protein-DNA binding. BriefFunctGenomics, 14(1):61–73, Jan 2015. [PubMed Central:PMC4303714] [DOI:10.1093/bfgp/elu040] [PubMed:25319759]. [47] J. F. Kribelbauer, O. Laptenko, S. Chen, G. D. Martini, W. A. Freed-Pastor, C. Prives, R. S. Mann, and H. J. Bussemaker. Quantitative Analysis of the DNA Methylation Sensitivity of Transcription Factor Complexes. CellRep, 19(11):2383–2395, Jun 2017. [PubMed Central:PMC5533174] [DOI:10.1016/j.celrep.2017.05.069] [PubMed:28614722]. [48] R. Yuan and M. Meselson. A specific complex between a restriction endonuclease and its DNA substrate. Proc.Natl.Acad.Sci.U.S.A., 65(2):357–362, Feb 1970. [PubMed Central:PMC282910] [PubMed:4984237]. [49] R. J. Roberts. Restriction and modification enzymes and their recognition sequences. Gene, 4(3):183–194, Nov 1978. [PubMed:369952]. [50] H. O. Smith. Nucleotide sequence specificity of restriction endonucleases. Science, 205 (4405):455–462, Aug 1979. [PubMed:377492]. [51] Y . Yin, E. Morgunova, A. Jolma, E. Kaasinen, B. Sahu, S. Khund-Sayeed, P. K. Das, T. Kivioja, K. Dave, F. Zhong, K. R. Nitta, M. Taipale, A. Popov, P. A. Ginno, S. Domcke, J. Yan, D. Schubeler, C. Vinson, and J. Taipale. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science, 356(6337), 05 2017. [DOI:10.1126/science.aaj2239] [PubMed:28473536]. [52] E. F. Fisher and M. H. Caruthers. Studies on gene control regions XII. The functional significance of a lac operator constitutive mutation. NucleicAcidsRes., 7(2):401–416, Sep 1979. [PubMed Central:PMC328025] [PubMed:386283]. [53] A. Razin and A. D. Riggs. DNA methylation and gene function. Science, 210(4470): 604–610, Nov 1980. [PubMed:6254144]. [54] M. McClelland and M. Nelson. The effect of site-specific DNA methylation on restriction endonucleases and DNA modification methyltransferases–a review. Gene, 74(1):291– 304, Dec 1988. [PubMed:2854811]. [55] G. D. Stormo, Z. Zuo, and Y . K. Chang. Spec-seq: determining protein-DNA-binding specificity by sequencing. Brief Funct Genomics, 14(1):30–38, Jan 2015. [PubMed Central:PMC4366588] [DOI:10.1093/bfgp/elu043] [PubMed:25362070]. 51 References [56] B. C. Foat, A. V . Morozov, and H. J. Bussemaker. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics, 22(14):e141–149, Jul 2006. [DOI:10.1093/bioinformatics/btl223] [PubMed:16873464]. [57] C. Chai, Z. Xie, and E. Grotewold. SELEX (Systematic Evolution of Ligands by EXponential Enrichment), as a powerful tool for deciphering the protein-DNA interaction space. MethodsMol.Biol., 754:249–258, 2011. [DOI:10.1007/978-1-61779-154-3_14] [PubMed:21720957]. [58] A. R. Oliphant, C. J. Brandl, and K. Struhl. Defining the sequence specificity of DNA- binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein. Mol.Cell.Biol., 9(7):2944–2949, Jul 1989. [PubMed Central:PMC362762] [PubMed:2674675]. [59] M. Slattery, T. Riley, P. Liu, N. Abe, P. Gomez-Alcala, I. Dror, T. Zhou, R. Rohs, B. Honig, H. J. Bussemaker, and R. S. Mann. Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell, 147(6):1270–1282, Dec 2011. [PubMed Central:PMC3319069] [DOI:10.1016/j.cell.2011.10.053] [PubMed:22153072]. [60] M. L. Bulyk. Analysis of sequence specificities of DNA-binding proteins with pro- tein binding microarrays. Meth. Enzymol., 410:279–299, 2006. [PubMed Cen- tral:PMC2747587] [DOI:10.1016/S0076-6879(06)10013-0] [PubMed:16938556]. [61] M. L. Bulyk. Protein binding microarrays for the characterization of DNA-protein interactions. Adv. Biochem. Eng. Biotechnol., 104:65–85, 2007. [PubMed Cen- tral:PMC2727742] [PubMed:17290819]. [62] Garry D Stormo. IntroductiontoProtein-DNAInteractions: Structure,Thermodynamics, andBioinformatics. Cold Spring Harbor Laboratory Press, February 2013. [63] C. Rastogi, H. T. Rube, J. F. Kribelbauer, J. Crocker, R. E. Loker, G. D. Martini, O. Laptenko, W. A. Freed-Pastor, C. Prives, D. L. Stern, R. S. Mann, and H. J. Busse- maker. Accurate and sensitive quantification of protein-DNA binding affinity. Proc.Natl. Acad.Sci.U.S.A., 115(16):E3692–E3701, Apr 2018. [PubMed Central:PMC5910815] [DOI:10.1073/pnas.1714376115] [PubMed:29610332]. [64] T. P. Chiu, S. Rao, R. S. Mann, B. Honig, and R. Rohs. Genome-wide prediction of minor- groove electrostatic potential enables biophysical modeling of protein-DNA binding. NucleicAcidsRes., 45(21):12565–12576, Dec 2017. [PubMed Central:PMC5716191] [DOI:10.1093/nar/gkx915] [PubMed:29040720]. [65] A. Mishra, S. Rao, A. Mittal, and B. Jayaram. Capturing native/native like structures with a physico-chemical metric (pcSM) in protein folding. Biochim.Biophys.Acta, 1834 (8):1520–1531, Aug 2013. [DOI:10.1016/j.bbapap.2013.04.023] [PubMed:23665455]. [66] Z. Zuo, Y . Chang, and G. D. Stormo. A quantitative understanding of lac repressor’s binding specificity and flexibility. Quant Biol, 3(2):69–80, Jun 2015. [PubMed Cen- tral:PMC4704127] [DOI:10.1007/s40484-015-0044-z] [PubMed:26752632]. 52 References [67] P. V . Benos, M. L. Bulyk, and G. D. Stormo. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res., 30(20):4442–4451, Oct 2002. [PubMed Central:PMC137142] [PubMed:12384591]. [68] G. D. Stormo. Computer methods for analyzing sequence recognition of nucleic acids. Annu Rev Biophys Biophys Chem, 17:241–263, 1988. [DOI:10.1146/annurev.bb.17.060188.001325] [PubMed:3293587]. [69] N. Abe, I. Dror, L. Yang, M. Slattery, T. Zhou, H. J. Bussemaker, R. Rohs, and R. S. Mann. Deconvolving the recognition of DNA shape from sequence. Cell, 161(2): 307–318, Apr 2015. [PubMed Central:PMC4422406] [DOI:10.1016/j.cell.2015.02.008] [PubMed:25843630]. [70] A. Jolma, J. Yan, T. Whitington, J. Toivonen, K. R. Nitta, P. Rastas, E. Morgunova, M. Enge, M. Taipale, G. Wei, K. Palin, J. M. Vaquerizas, R. Vincentelli, N. M. Lus- combe, T. R. Hughes, P. Lemaire, E. Ukkonen, T. Kivioja, and J. Taipale. DNA- binding specificities of human transcription factors. Cell, 152(1-2):327–339, Jan 2013. [DOI:10.1016/j.cell.2012.12.009] [PubMed:23332764]. [71] A. T. Spivak and G. D. Stormo. ScerTF: a comprehensive database of benchmarked position weight matrices for Saccharomyces species. NucleicAcidsRes., 40(Database is- sue):D162–168, Jan 2012. [PubMed Central:PMC3245033] [DOI:10.1093/nar/gkr1180] [PubMed:22140105]. [72] A. Mathelier, O. Fornes, D. J. Arenillas, C. Y . Chen, G. Denay, J. Lee, W. Shi, C. Shyr, G. Tan, R. Worsley-Hunt, A. W. Zhang, F. Parcy, B. Lenhard, A. Sandelin, and W. W. Wasserman. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. NucleicAcidsRes., 44(D1):D110–115, Jan 2016. [PubMed Central:PMC4702842] [DOI:10.1093/nar/gkv1176] [PubMed:26531826]. [73] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The Protein Data Bank. Nucleic Acids Res., 28(1): 235–242, Jan 2000. [PubMed Central:PMC102472] [PubMed:10592235]. [74] L. Yang, T. Zhou, I. Dror, A. Mathelier, W. W. Wasserman, R. Gordan, and R. Rohs. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res., 42(Database issue):D148–155, Jan 2014. [PubMed Cen- tral:PMC3964943] [DOI:10.1093/nar/gkt1087] [PubMed:24214955]. [75] T. P. Chiu, L. Yang, T. Zhou, B. J. Main, S. C. Parker, S. V . Nuzhdin, T. D. Tullius, and R. Rohs. GBshape: a genome browser database for DNA shape annotations. Nucleic AcidsRes., 43(Database issue):D103–109, Jan 2015. [PubMed Central:PMC4384032] [DOI:10.1093/nar/gku977] [PubMed:25326329]. [76] Shannon p, & richards m. motifdb: An annotated collection of protein–dna binding sequence motifs. r package 2017; version 1.20.0. 53 References [77] A. Lazarovici, T. Zhou, A. Shafer, A. C. Dantas Machado, T. R. Riley, R. Sandstrom, P. J. Sabo, Y . Lu, R. Rohs, J. A. Stamatoyannopoulos, and H. J. Bussemaker. Prob- ing DNA shape and methylation state on a genomic scale with DNase I. Proc. Natl. Acad. Sci. U.S.A., 110(16):6376–6381, Apr 2013. [PubMed Central:PMC3631675] [DOI:10.1073/pnas.1216822110] [PubMed:23576721]. [78] X. Zhang, A. C. Dantas Machado, Y . Ding, Y . Chen, Y . Lu, Y . Duan, K. W. Tham, L. Chen, R. Rohs, and P. Z. Qin. Conformations of p53 response elements in solution deduced using site-directed spin labeling and Monte Carlo sampling. NucleicAcidsRes., 42(4): 2789–2797, Feb 2014. [PubMed Central:PMC3936745] [DOI:10.1093/nar/gkt1219] [PubMed:24293651]. [79] R. Lavery, K. Zakrzewska, and H. Sklenar. Jumna (junction minimisation of nucleic acids). ComputPhysCommun, 91, 1995. doi: 10.1016/0010-4655(95)00046-I. URL https://doi.org/10.1016/0010-4655(95)00046-I. [80] R. Aduri, B. T. Psciuk, P. Saro, H. Taniga, H. B. Schlegel, and J. SantaLucia. AM- BER Force Field Parameters for the Naturally Occurring Modified Nucleosides in RNA. JChemTheoryComput, 3(4):1464–1475, Jul 2007. [DOI:10.1021/ct600329w] [PubMed:26633217]. [81] S. Hu, J. Wan, Y . Su, Q. Song, Y . Zeng, H. N. Nguyen, J. Shin, E. Cox, H. S. Rho, C. Woodard, S. Xia, S. Liu, H. Lyu, G. L. Ming, H. Wade, H. Song, J. Qian, and H. Zhu. DNA methylation presents distinct binding sites for human transcription factors. Elife, 2:e00726, Sep 2013. [PubMed Central:PMC3762332] [DOI:10.7554/eLife.00726] [PubMed:24015356]. [82] T. P. Chiu, F. Comoglio, T. Zhou, L. Yang, R. Paro, and R. Rohs. DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding. Bioinformatics, 32(8):1211–1213, 04 2016. [PubMed Central:PMC4824130] [DOI:10.1093/bioinformatics/btv735] [PubMed:26668005]. [83] A. Perez, C. L. Castellazzi, F. Battistini, K. Collinet, O. Flores, O. Deniz, M. L. Ruiz, D. Torrents, R. Eritja, M. Soler-Lopez, and M. Orozco. Impact of methylation on the physical properties of DNA. Biophys. J., 102(9):2140–2148, May 2012. [PubMed Central:PMC3341543] [DOI:10.1016/j.bpj.2012.03.056] [PubMed:22824278]. [84] J. Hizver, H. Rozenberg, F. Frolow, D. Rabinovich, and Z. Shakked. DNA bend- ing by an adenine–thymine tract and its role in gene regulation. Proc. Natl. Acad. Sci. U.S.A., 98(15):8490–8495, Jul 2001. [PubMed Central:PMC37463] [DOI:10.1073/pnas.151247298] [PubMed:11438706]. [85] J. R. Hesselberth, X. Chen, Z. Zhang, P. J. Sabo, R. Sandstrom, A. P. Reynolds, R. E. Thurman, S. Neph, M. S. Kuehn, W. S. Noble, S. Fields, and J. A. Stamatoyannopou- los. Global mapping of protein-DNA interactions in vivo by digital genomic foot- printing. Nat. Methods, 6(4):283–289, Apr 2009. [PubMed Central:PMC2668528] [DOI:10.1038/nmeth.1313] [PubMed:19305407]. 54 References [86] G. P. Lomonossoff, P. J. Butler, and A. Klug. Sequence-dependent variation in the conformation of DNA. J.Mol.Biol., 149(4):745–760, Jul 1981. [PubMed:6273590]. [87] H. R. Drew and A. A. Travers. DNA structural variations in the E. coli tyrT promoter. Cell, 37(2):491–502, Jun 1984. [PubMed:6327070]. [88] A. Kundaje, W. Meuleman, J. Ernst, M. Bilenky, A. Yen, A. Heravi-Moussavi, P. Kher- adpour, Z. Zhang, J. Wang, M. J. Ziller, V . Amin, et al. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, Feb 2015. [PubMed Cen- tral:PMC4530010] [DOI:10.1038/nature14248] [PubMed:25693563]. [89] A. Lahm and D. Suck. DNase I-induced DNA conformation. 2 A structure of a DNase I-octamer complex. J.Mol.Biol., 222(3):645–667, Dec 1991. [PubMed:1748997]. [90] S. A. Weston, A. Lahm, and D. Suck. X-ray structure of the DNase I-d(GGTATACC)2 complex at 2.3 A resolution. J.Mol.Biol., 226(4):1237–1256, Aug 1992. [91] I. Brukner, R. Sanchez, D. Suck, and S. Pongor. Sequence-dependent bending propensity of DNA as revealed by DNase I: parameters for trinucleotides. EMBOJ., 14(8):1812– 1818, Apr 1995. [PubMed Central:PMC398274] [PubMed:7737131]. [92] Robert C. Harris, Travis Mackoy, Ana Carolina Dantas Machado, Darui Xu, Remo Rohs, and Marcia Oliveira Fenley. Chapter 3 opposites attract: Shape and electro- static complementarity in protein-dna complexes. 2:53–80, 2012. doi: 10.1039/ 9781849735056-00053. URL http://dx.doi.org/10.1039/9781849735056-00053. [93] D. Suck. DNA recognition by DNase I. J. Mol. Recognit., 7(2):65–70, Jun 1994. [DOI:10.1002/jmr.300070203] [PubMed:7826675]. [94] R. Joshi, J. M. Passner, R. Rohs, R. Jain, A. Sosinsky, M. A. Crickmore, V . Jacob, A. K. Aggarwal, B. Honig, and R. S. Mann. Functional specificity of a Hox protein mediated by the recognition of minor groove structure. Cell, 131(3):530–543, Nov 2007. [PubMed Central:PMC2709780] [DOI:10.1016/j.cell.2007.09.024] [PubMed:17981120]. [95] Thomas Porter and Tom Duff. Compositing digital images. SIGGRAPHComput.Graph., 18(3):253–259, January 1984. ISSN 0097-8930. doi: 10.1145/964965.808606. URL http://doi.acm.org/10.1145/964965.808606. [96] D. R. Kelley, J. Snoek, and J. L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res., 26(7): 990–999, 07 2016. [PubMed Central:PMC4937568] [DOI:10.1101/gr.200535.115] [PubMed:27197224]. [97] J. Zhou and O. G. Troyanskaya. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods, 12(10):931–934, Oct 2015. [PubMed Central:PMC4768299] [DOI:10.1038/nmeth.3547] [PubMed:26301843]. 55 References [98] D. D. Le, T. C. Shimko, A. K. Aditham, A. M. Keys, S. A. Longwell, Y . Oren- stein, and P. M. Fordyce. Comprehensive, high-resolution binding energy land- scapes reveal context dependencies of transcription factor binding. Proc. Natl. Acad. Sci. U.S.A., 115(16):E3702–E3711, Apr 2018. [PubMed Central:PMC5910820] [DOI:10.1073/pnas.1715888115] [PubMed:29588420]. [99] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. CoRR, abs/1704.02685, 2017. URL http: //arxiv.org/abs/1704.02685. [100] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional net- works: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013. URL http://arxiv.org/abs/1312.6034. [101] C. Angermueller, T. Parnamaa, L. Parts, and O. Stegle. Deep learning for computa- tional biology. Mol.Syst.Biol., 12(7):878, 07 2016. [PubMed Central:PMC4965871] [PubMed:27474269]. [102] C. Angermueller, H. J. Lee, W. Reik, and O. Stegle. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol., 18(1): 67, Apr 2017. [PubMed Central:PMC5387360] [DOI:10.1186/s13059-017-1189-z] [PubMed:28395661]. [103] Q. Song, B. Decato, E. E. Hong, M. Zhou, F. Fang, J. Qu, T. Garvin, M. Kessler, J. Zhou, and A. D. Smith. A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PLoSONE, 8(12):e81148, 2013. [PubMed Central:PMC3855694] [DOI:10.1371/journal.pone.0081148] [PubMed:24324667]. [104] J. Li, J. M. Sagendorf, T. P. Chiu, M. Pasi, A. Perez, and R. Rohs. Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding. NucleicAcidsRes., 45(22):12877–12887, Dec 2017. [PubMed Central:PMC5728407] [DOI:10.1093/nar/gkx1145] [PubMed:29165643]. [105] L. Yang, Y . Orenstein, A. Jolma, Y . Yin, J. Taipale, R. Shamir, and R. Rohs. Tran- scription factor family-specific DNA shape readout revealed by quantitative speci- ficity models. Mol. Syst. Biol., 13(2):910, 02 2017. [PubMed Central:PMC5327724] [PubMed:28167566]. [106] E. Wingender, X. Chen, R. Hehl, H. Karas, I. Liebich, V . Matys, T. Meinhardt, M. Pruss, I. Reuter, and F. Schacherer. TRANSFAC: an integrated system for gene expression regulation. NucleicAcidsRes., 28(1):316–319, Jan 2000. [PubMed Central:PMC102445] [PubMed:10592259]. [107] Wendy D Cornell, Piotr Cieplak, Christopher I Bayly, Ian R Gould, Kenneth M Merz, David M Ferguson, David C Spellmeyer, Thomas Fox, James W Caldwell, and Peter A Kollman. A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. JournaloftheAmericanChemicalSociety, 117(19):5179–5197, 1995. 56 Appendix A Supplementary material for chapter 2 A.1 Structures from Protein Data Bank (PDB) To generate count statistics, we used an advanced search interface to query the PDB [73] for occurrences of methylated cytosine. Fig. 1a presents the numbers of structures retrieved from the PDB on 31 May 2017 (as a snapshot in time). Counts are expected to evolve over time as new structures are added to the database. Numbers can be updated by running the Python script QueryPDBCounts.py (available here). A.2 PDB IDs of methylated DNA structures Our analysis revealed a very small subset of structures with methylated CpG dinucleotide step(s). PDB IDs of structures containing methylated CpG step(s) are the following: 1IG4, 1IH3, 1R3Z, 265D, 270D, 2KY8, 2MOE, 329D, 3C2I, 3VXX, 4C63, 4F6N, 4GJP, 4GLG, 4HP1, 4LG7, 4M9E, 4MKW, 4R2A, and 4LT5. 57 Supplementary material for chapter 2 A.3 Count statistics of transcription factor (TF) binding mo- tifs containing CpG step(s) Counts of binding motifs containing CpG step(s) for TF families were retrieved from MotifDb [76], a R package comprising TFBS databases, such as HT-SELEX sequences from Jolma et al. [70], the expanded HT-SELEX dataset published in [105], or JASPAR_CORE [72], TRANSFAC [106], and others. A.4 Validation ofmethyl-DNAshape using experimentally de- termined structures Additional filtering, such as by experimental procedure (i.e., X-ray structures) and methylation status (i.e., fully and not hemi-methylated CpG steps), reduced the 20 structures (see PDB IDs in section A.2) with methylated CpG step(s) to a total of only 10 structures. Such a very low count is obviously not sufficient for a comparative analysis or validation, especially given other influences such as crystal-packing artifacts. Nevertheless, we visualized the MGW profiles of some of these structures (see Figure A.1), and this limited comparison indicated agreement between the X-ray crystallography andmethyl-DNAshape results. The limited experimental data do not provide validation, but rather emphasize the need for a computational method to fill the current gap in structural information on methylated DNA. 58 A.4 Validation of methyl-DNAshape using experimentally determined structures Figure A.1 a-d MGWs for DNA sequences (x-axis labels) of four structures (X-ray based; PDB IDs - 4LG7, 3C2I, 4M9E, and 265D) were predicted withmethyl-DNAshape (blue points; this work) or calculated with CURVES (orange points; [27]). Underlined subsequence is expanded in every plot, presenting point-to-point correspondence betweenmethyl-DNAshape predictions and CURVES-derived values of MGW. Pearson correlation coefficients (PCCs) between methyl- DNAshape and X-ray-based values and their corresponding P-values are included in each panel. Panels are shown in the order of significance (P-value). 59 Supplementary material for chapter 2 A.5 Types and counts of sequences considered for Monte Carlo (MC) simulations An ensemble of sequences was selected to represent pentamers with methylated cytosine(s) in the CpG context. Additional file 7 (sequence_pool.xlsx) contains selected sequences/fragments that were considered for MC simulations. Table A.2 summarizes the counts for different types of fragments. A.6 All-atom Monte Carlo simulations MC simulations utilize a random sampling method to probe the search space while considering all of the atoms. This method treats most bond angles and all bond lengths as constants, which results in a substantial reduction in the number of degrees of freedom [28]. Variables considered in the MC simulations are summarized in Table A.1. MC simulations were performed by using an implicit solvent with sigmoidal distance-dependent dielectric function, explicit sodium counter ions, and associated Jacobians [25]. Table A.1 Variables considered in MC simulations Type of variable Count Description Collective 6 3 rigid-body rotations 3 rigid-body translations of nucleotides Internal 6 (7 for T or m) Glycosidic torsion angle Two endocyclic torsion Bond angle Sugar phase Amplitude (methyl group rotation for T or m) Total system energy was calculated by using the AMBER force field [ 107]. The force field for 5mC differed from that of the cytosine parameters due to the added methyl group. We used partial charges derived for 5mC from the database of AMBER force fields for naturally occurring modified nucleotides [80]. 60 A.7 Sequence composition of mPQT MC simulations started from a seed structure (in this case, a canonical B-DNA structure of a sequence generated with standard structural parameters using JUMNA [79]) as input and ran for 2 million MC cycles. Trajectory snapshots were stored every tenth cycle. The first half-million MC cycles were discarded as the equilibration period. Following equilibration, a total of 150,000 snapshots representing 1.5 million MC cycles were stored in the trajectory file of each individual simulation. The trajectory analysis program traversed through all of these snapshots and recorded the average shape parameter values derived by CURVES [27]. The program also generated an average MC structure as a representation of the sequence. Table A.2 Types of DNA fragments and their counts. Summary of types of sequences considered for all-atom Monte Carlo (MC) simulations. Most sequences were designed to cover different flanking sequences. Sequences between “-” symbols in column 1 are “core sequences”. Other sequences are regarded as flanks. “N” in designed sequences represents general DNA alphabet letters A, C, G, T. Methylated cytosine (“m”) and subsequent guanine (“g”) bases are underlined. Fragments Number of MC simulations Selected from Human HOXA9 binding sequences 3 [81] Human HOXA5 binding sequences 84 [81] CGNN-5mer-NNCG 1054 Designed CGNN-NNNNmg-NNCG 1298 Designed CGCG-NNmgNN-CGCG 253 Designed CGCG-NNNmgN-CGCG 256 Designed CGNN-mgNNNmg-NNCG 496 Designed CGNN-Poly[A/T] 4 mg-NNCG 74 Designed Total 3518 A.7 Sequence composition of mPQT Introduction of the letters “m” for 5-methylcytosine (5mC) and “g” for guanine base-paired to 5mC resulted in a total of 1,974 pentamers. As we considered DNA shape features of a pentamer on the forward strand, this experimental design also covered features of its reverse complement. Thus, the total number of entries in our methylated Pentamer Query Table (mPQT) was half of the total count (987). We refer to these 987 pentamers as the unique pentamers 61 Supplementary material for chapter 2 for the methyl-DNAshape method. Of these, 512 unique pentamers were comprised of the nucleotides A, C, G, and T. The remaining 475 unique pentamers contained at least one of the two newly introduced letters, “m” and “g”. The figure below gives detailed representations of pentamers found in themPQT. Table A.3 Count breakdown of unique pentamers seen in mPQT DNA alphabet ∑= {A, C, G, T} DNA alphabet with two new letters ∑= {A, C, G, T, m, g} m: 5mC; g: G of 5mcpG bp step Strand Orientation 5-mers Count 5-mers containing "mg" Count 5-mers begin with ’g’ or end with ’m’ Count Forward NNANN 256 mgNNN 64 NNNNm 256 NNCNN 256 NmgNN 64 mgNNm 16 mgmgN 4 NmgNm 16 mgNmg 4 NNmgm 16 mgmgm 1 *gNNNm 64 gmgNm 4 Reverse NNTNN 256 NNNmg 64 gNNNN 256 NNGNN 256 NNmgN 64 gNNmg 16 Nmgmg 4 gNmgN 16 gmgNN 16 gmgmg 1 gNmgm 4 Total 1024 268 682 Strand-specific total 512 134 341 Total count of pentamers used in the table = 512 + 134 + 341 = 987 *only counted in forward strand 62 A.8 Pentamers used in scatter plot analysis A.8 Pentamers used in scatter plot analysis To understand the influence of a single methylation event on DNA shape features, we considered pentamers with only a single CpG/mpg bp step (Fig. 3). With this constraint, a total of 116 (see below) pentamers were selected. #Pentamers of type 5 ′ − CGNNN− 3 ′ = 64 covers 5 ′ − NNNCG− 3 ′ (A.1) #Pentamers of type 5 ′ − NCGNN− 3 ′ = 64 covers 5 ′ − NNCGN− 3 ′ (A.2) Symmetry occurs in (A.1) for CGNCG, only counts 2 pentamers (CGACG and CGCCG) of this type and redundancy occurs for the count of pentamer CGCGN resulting a total count of 122 pentamers containing at least one CpG step: 64+ 64− 2 (symmetric)− 4 (redundant)= 122 (A.3) Count of pentamers containing exactly two CpG steps: #Pentamers of type 5 ′ − CGCGN− 3 ′ = 4 (A.4) #Pentamers of type 5 ′ − CGNCG− 3 ′ = 2 (A.5) Hence, total pentamers containing exact one CpG step = 122− 6= 116 A.9 Illustration of shape vector calculation We illustrated graphically how the bp step feature values of inter-bp shape features were assigned at each nucleotide position. The following figure shows the petamer table lookup process to evaluate feature vector for Roll and MGW. 63 Supplementary material for chapter 2 Figure A.2 Shape vector calculation. a Two Roll values, Roll1 and Roll2, were assigned to a given pentamer by using the query table for bp steps 2-3 and 3-4, respectively (illustrated at the top). The PQT lookup procedure is explained for calculation of the Roll feature vector for DNA sequence 5 ′ -TTTGACT-3 ′ as an example. Retrieval of the Roll feature vector for this sequence queries the lookup table three times for listed pentamers in the table. Because the third query pentamer, 5 ′ -TGACT-3 ′ , finds its reverse complement 5 ′ -AGTCA-3 ′ in the table, the search resulted in the reversal of Roll1 and Roll2 values of 5 ′ -AGTCA-3 ′ . The same process was adopted for the base-pair step feature HelT. b Illustration of MGW feature vector calculation. The process is simplified in this case because the search returns a single value at the central bp for a given pentamer. MGW values for two flanking nucleotides are undefined because values at these positions cannot be calculated as per definition of minor groove. <.>: average; σ: standard deviation. 64 Appendix B Supplementary material for chapter 3 & 4 B.1 DNase I cleavage data and statistical modeling B.1.1 Data preprocessing We used methylation status-dependent DNase I cleavage as model system to validate our high- throughput method methyl-DNAshape. DNase I is an endonuclease that cleaves the phosphodiester backbone of DNA [77, 46]. In a genomic context, DNase I can be used to profile the accessible regions of chromatin in a process called “DNase I footprinting”. DNase-seq is a sequencing-based method that utilizes DNase I cleavage to identify open regions of chromatin in a high-throughput manner. We used DNase-seq data generated from DNase I treatment in the IMR90 human cell line (GEO accession number: GSM723024). Data analysis revealed a sequence context-dependent bias of the DNase I cleavage activity. In particular, the presence of a methylated CpG step immediately downstream of the cleavage site resulted in a strong bias. We categorized each cleaved site as high or low methylation status, depending on the degree of methylation of CpG step(s) in the neighboring sequence. We used DNA methylation data generated by whole-genome shotgun bisulfite sequencing in the same cell line (GEO Accession ID: GSM432687-92) to determine high or low methylation status. Analysis of the co-crystal structure of DNase I with DNA (PDB ID: 2DNJ) [89] revealed that positively charged arginine residues formed contacts in the minor groove immediately upstream of the cleavage site. A larger fraction of variation in cut rates was explained by the sequence context 3-bp up- or downstream of the cleaved site, leading to a hexamer model of 65 Supplementary material for chapter 3 & 4 sequence- and methylation-status-dependent DNase I cleavage, as revealed in our previous study [77]. Results of the genome-wide analysis of phosphodiester cleavage events were recorded in tabular format. Based on the methylation level of the genomic region, five tables (tier 1 to tier 5, from lowest to highest level of methylation; available here) were generated. Each table contains 4,096 hexamers with multiple entries depending on the frequency of cleavage. For example, the first hexamer entry in the tier 1 table consists of three rows (see table). Information in the table can be summarized as follows: A total of 5664 phosphates of type AAApAAA in the genome that were cleaved once, and 7 that were cleaved twice. For example, the absolute phosphate cleavage count for AAApAAA equals 1× 5,664+ 2× 7= 5,678. Table B.1 Data preprocessing of DNase I cleavage data Hexamer Frequency Count AAApAAA 0 13037815 AAApAAA 1 5664 AAApAAA 2 7 Following Lazarovici et al. [77], we normalized the absolute phosphate cleavage counts by the total counts of a given hexamer in the genome (Additional file 10: Table S5, column 4). Normalized values were further divided by the maximum relative phosphate cleavage rate (maximum value from column 4) to keep all values in the range [0, 1] (resulting in normalized values in column 5). These Scaled Ratio (SR; Table S5, column 5) values refer to relative cut rates of the most frequently cleaved hexamer (ACTpTAG). Absence of a CpG step in ACTpTAG leads to an unbiased comparison of SR values of unmethylated and methylated hexamers containing CpG step(s). SR values were converted into relative binding free energy (∆∆G) values by scaling to the negative log. The following equation represents the conversion process: Relative Binding Free Energy(RBFE) hexamer ≡ ∆∆G/RT hexamer =− log(SR hexamer ) 66 B.1 DNase I cleavage data and statistical modeling B.1.2 Statistical modeling To understand DNase I cleavage bias from a DNA shape perspective, we adopted a statistical modeling method, L1- and L2-regularized multiple linear regression, to refine our previously published shape-to-affinity model [ 46]. To build the predictive model, we only used unmethy- lated hexamer data, namely DNA shape features as predictors and RBFE values as response variables. DNA shape features of unmethylated hexamers were predicted using DNAshape [32]. Predictions from DNAshape are unavailable in flanking regions (Additional file 6: Fig. S2). To assign values in these regions in an unbiased manner, we extended the sequence flanks by a general nucleotide “N” (with N∈ {A, C, G, T}) to create a pentamer window with the bp of interest at the center. For the leftmost or rightmost bp, we extended the window by two Ns. For the second bp from either the left or right flank, we extended the window by a single N. DNAshape values obtained for all possible permutations of pentamers formed by N (4 for single N, 16 for NN) were averaged to assign a single value at each position of the flanking regions. Considering the very low count of observed cut events (Additional file 9: Table S4, column 2) relative to the number of available genomic positions (column 3), we concluded that the DNase I cleavage activity followed a Poisson process. To avoid uncertainties in counting, we used the following criteria: σorS≤ 0.2× ObserveredCount hexamer (B.1) Whereœ is standard deviation, andS is sampling error. In a Poisson distribution, we can use standard deviation as an estimation of the sampling error. σ = p ObserveredCount hexamer (B.2) using eq. (B.1) in eq. (B.2), we get: ObserveredCount hexamer ≥ 25 (B.3) 67 Supplementary material for chapter 3 & 4 With the above considerations, we included the 3,037 hexamers with an absolute phosphate cleavage count≥ 25 in the training set for the model (see Data Preprocessing for details). Because the model is linear, we can infer changes inRBFE (∆∆∆G) by using these counts and the methylation-induced changes in shape features (∆shape). ∆∆ ˆ G methylated =W T · Shape methylated +b (B.4) ∆∆ ˆ G unmethylated =W T · Shape unmethylated +b (B.5) ∆∆∆ ˆ G=∆∆ ˆ G methylated − ∆∆ ˆ G unmethylated =W T · ∆shape (B.6) For modeling, we used the widely used tool glmnet with hybrid regularization (both L1- and L2-regularization by setting alpha = 0.5). Vignettes for glmnet are available here. B.2 Supplementary materials for Pbx-Hox data analyses and more B.2.1 CpG context for unmethylated DNA Both DNAshape [32] and methyl-DNAshape (this work) are pentamer sliding-window based DNA shape feature prediction methods. In addition to offering the shape feature prediction, methyl-DNAshape offers users the ability to predict methylation-induced shape changes (∆shape; Fig. 2). However, simply subtracting the DNAshape feature vector from themethyl-DNAshape feature vector may not result in the∆shape originating solely from DNA methylation in all cases. For example, in themPQT, the estimated MGW at the central ‘A’ for pentamer 5 ′ -TGATm-3 ′ is 5.23 Å, calculated by averaging the MGW values of all pentamers of this type in the methylated sequence pool. However, this pentamer always had a “g” (guanine following the methylated cytosine indicated by “m”) at the sixth position flanking the pentamer on the 3 ′ side, due to the assumption of “mpg” dinucleotide steps in case of methylated cytosines. The unmethylated counterpart of this pentamer, 5 ′ -TGATC-3 ′ , with estimated MGW value 4.77 Å, is averaged over any nucleotide (N∈ {A, C, G, T}) at the sixth position flanking the pentamer in the unmethylated sequence pool. Hence, ∆MGW (5.23 Å− 4.77 Å = 0.46 Å) is confounding methylation and sequence effects because of the identity of the nucleotide at the sixth position. 68 B.2 Supplementary materials for Pbx-Hox data analyses and more To address this technical subtlety, we compiled an additional table, called the CpG context table. We illustrated the use of this table to predict the∆MGW (Additional file 6: Fig. S2). Apart from existing MC simulation data used to build the PQT used in DNAshape, we ran additional MC simulations to enrich the count for such pentamers with the CG context in their flanks. With this new query table, we believe that we can look at the effect of methylation on shape feature values more closely (∆MGW DNAshape = 0.46vs.∆MGW CpG context table = 0.22 Å). B.2.2 Methylation sensitivities of Pbx-HoxA1 and A5 Methylation event in the spacer region had a common destabilizing effect for both Pbx-HoxA1 and A5 complexes binding to DNA. The figure below show the data for both systems. Figure B.1 Similar representation of relative binding affinities and change in relative binding free energies as shown in Figure 4.1. The boxes surrounded by dashed rectangle represent methylation effects (destabilizing∆∆∆G> 0.) at offset 6/7 in 12 bp binding site. 69 Supplementary material for chapter 3 & 4 CpG dinucleotide in the spacer region (offset 6/7) was not selected in enriched sequences for Pbx-HoxA9. $ grep -c "^......CG" \ GSE98652_Pbx_HOXA9_hum_filtered_rel_affinity_table_LibU_vs_LibM.txt 0 $ B.2.3 T-test for IUPAC-based shape analysis for Pbx-Hox data Two-tailed paired t-test statistics was used to infer the significance of ∆MGW for hexamers and pentamers of types NNAYCG or NGAYCG and NNACG or NGACG, respectively (Fig. 6d). The latter hexamers or pentamers where T replaces N at the initial position, representing the most preferred binding site (TGAYCG, count = 2; TGACG, count = 1), were not included in the plot because of too few possible instances to perform significance tests. Nevertheless, the ∆MGW for TGATCG is 0.22 Å, and for TGACCG it is 0.16 Å. 70 B.2 Supplementary materials for Pbx-Hox data analyses and more Figure B.2 71
Abstract (if available)
Abstract
Protein-DNA interaction is one of the important piece in understanding gene regulation puzzle. Since the discovery of DNA, many developments has led to an understanding that although four standard bases adenine(A), cytosine(C), guanine(G) and thymine(T) are building blocks of DNA, the chemical modification on some of these bases play important roles in different aspect of organism development/disease. These chemical modifications alter the strength of protein (Transcription Factor in general) binding to DNA either in positive or in negative direction which reflects transcription level of targeted genes. ❧ To this time, we know that proteins, in particular, Transcription Factors (TFs) achieve DNA binding specificity through two well-known DNA readout mechanisms, base and shape readout. In base readout, TF exploits the physico-chemical signatures of DNA bases and makes specific non-bonded, for instance, hydrogen-bond, whereas in shape readout, it utilizes biophysical properties, for instance, electrostatic potential (associated with minor-groove width), of DNA to strengthen the binding. At one end of spectrum where proteins, for example, restriction endonucleases binding to DNA is very specific (EcoRI cleaving GAATTC) but at the other, very non-specific (DNase I, and other transcription factors, Hox family). This dissertation work attempts to understand protein-DNA interactions from physico-chemical and shape perspective in the interest of chemical modification, particularly CpG methylation. ❧ Cytosine methylation, the most frequent modification, represents the addition of a methyl group at the major groove edge of the cytosine base. In mammalian genomes, cytosine methylation most frequently occurs at CpG dinucleotides and also on both strands of DNA. In addition to changing the chemical signature of C/G base pairs, cytosine methylation can affect DNA structure. These chemical and structural effects can contribute to base and shape readout either in positive (strengthen binding) or in negative direction. Dissertation work detailed here attempts to quantify the effect of CpG methylation on TF-DNA binding from DNA shape and physico-chemical perspectives.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
The kinetic study of engineered MBD domain interactions with methylated DNA: insight into binding of methylated DNA by MBD2b
PDF
Decoding protein-DNA binding determinants mediated through DNA shape readout
PDF
Machine learning of DNA shape and spatial geometry
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Profiling transcription factor-DNA binding specificity
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Comparative analysis of DNA methylation in mammals
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Identification and characterization of PR-Set7 and histone H4 lysine 20 methylation-associated proteins
PDF
Differential methylation analysis of colon tissues
PDF
Simulating the helicase motor of SV40 large tumor antigen
PDF
DNA methylation inhibitors and epigenetic regulation of microRNA expression
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
CpG poor promoter SULT1C2 regulated by DNA methylation and is induced by cigarette smoke condensate in lung cell lines
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Identification of DNA methylation markers in diffuse large B-cell lymphoma
Asset Metadata
Creator
Rao, Satyanarayan
(author)
Core Title
Understanding protein–DNA recognition in the context of DNA methylation
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
11/12/2018
Defense Date
06/20/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
CpG methylation,DNA methylation,DNA structure,DNase I cleavage sensitivity,epigenetics,human Hox protein binding specificity,methyl-DNAshape,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rohs, Remo (
committee chair
), Nakano, Aiichiro (
committee member
), Smith, Andrew D. (
committee member
), Waterman, Michael S. (
committee member
)
Creator Email
satyanar@usc.edu,satyanarayan.iiitm@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-102673
Unique identifier
UC11675523
Identifier
etd-RaoSatyana-6961.pdf (filename),usctheses-c89-102673 (legacy record id)
Legacy Identifier
etd-RaoSatyana-6961.pdf
Dmrecord
102673
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Rao, Satyanarayan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
CpG methylation
DNA methylation
DNA structure
DNase I cleavage sensitivity
epigenetics
human Hox protein binding specificity
methyl-DNAshape