Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Profiling transcription factor-DNA binding specificity
(USC Thesis Other)
Profiling transcription factor-DNA binding specificity
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PROFILING TRANSCRIPTION FACTOR-DNA BINDING SPECIFICITY by Lin Yang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computational Biology and Bioinformatics) August 2016 ã 2016 Lin Yang All Rights Reserved Abstract In the genomics field, representation of the DNA is simplified as a string of letters A, C, G and T. This sequence-based representation of DNA has greatly facilitated the ongoing endeavor that aims to decode the human genome. On the other hand, DNA is a molecule that has a three-dimensional structure. Depending on the specific application, the over-simplification of DNA as a plain sequence of letters may result in loss of information. To overcome this problem, structural parameters that describe the conformational characteristics of DNA (“DNA shape”) are considered, in addition to the primary sequence of DNA. This more enriched description of the DNA is intermediate between the over-simplification of the DNA using only a string of letters and the atomic models used in structural biology which are too complex. In the modeling of transcription factor (TF)-DNA binding specificity, simple sequence-based models, such as the position weight matrix models, can be augmented by introducing parameters that account for DNA shape. Moreover, mechanistic insights into how TFs achieve their DNA binding specificity can be derived from such shape-augmented models. In this thesis, I present a database for DNA shape features of TF binding sites. These features are shown to improve performance of machine learning models of TF- DNA binding specificity. Analyses of DNA shape features in nucleosome sequences and Drosophila Hox-TFBSs reveal the importance of DNA shape recognition in both sequence non-specific and specific DNA binding proteins. Modeling of the Hox-DNA binding specificity combined with feature selection techniques are proved to be able to reveal mechanisms in protein-DNA recognition. Finally, the method is generalized to study the role of DNA shape at base pair-resolution for a variety of TF families. Acknowledgements I would like to express my sincere thanks to my advisor Remo Rohs, for his continuous support and guidance as a mentor during the past few years. Your mentoring and influence on me are far beyond science. I am grateful that Professor Xianghong Zhou recognized my potential in doing scientific research during the CBB admission process. Thanks to Professor Frank Alber, Professor Liang Chen, Professor Xiaojiang Chen and Professor Fei Sha for being members of my Ph.D. guidance committee. And thanks to Professor Michael Waterman, Professor Yan Liu and Professor Frank Alber for being members of my dissertation committee. Thanks to my collaborators Professor Ron Shamir from the Tel Aviv University, Professor Richard Mann from the Columbia University, and Professor Raluca Gordân from the Duke University, for being extremely supportive to my research and career. Thanks to the present and past Rohs lab members Xiaofei Wang, Ana Carolina Dantas Machado, Tsu-Pei Chiu, Satyanarayan Rao, Beibei Xin, Jared Sagendorf, Jinsen Li, Richard Li, Dr. Rosa Di Felice, Dr. Iris Dror and Dr. Tianyin Zhou, for all the discussions and suggestions on my research and presentations. Special thanks go to Xiaofei Wang, who became my wife and without whom I would not have survived the enormous stress and anxiety during dissertation writing and preparation for the defense. Finally, I would like to thank the Viterbi Fellowship for funding my first two years at USC. And thanks to my parents and my Alma Mater, USTC, who trained me to keep wondering about the world and always long for the truth. ix Table of Contents LIST OF FIGURES ......................................................................................................... xiii LIST OF TABLES ............................................................................................................ xv Chapter 1 Introduction ........................................................................................................ 1 1.1 The DNA double helix .................................................................................................... 1 1.2 Decoding the human genome .......................................................................................... 4 1.3 Principles of TF-DNA binding specificity ...................................................................... 5 1.4 Experimental methods to study TF-DNA binding specificity ......................................... 7 1.4.1 Approaches of structural biology ....................................................................... 8 1.4.2 Approaches of genomics .................................................................................. 10 1.5 From experimental data to predictive models ............................................................... 12 1.5.1 Approaches of structural biology ..................................................................... 12 1.5.2 Approaches of genomics .................................................................................. 14 1.6 Overview of this thesis .................................................................................................. 17 Chapter 2 TFBSshape: a motif database for DNA shape features of transcription factor binding sites ...................................................................................................................... 19 2.1 Introduction ................................................................................................................... 19 2.2 Database ........................................................................................................................ 22 2.2.1 Database architecture and methodology .......................................................... 22 2.2.2 Interface with JASPAR .................................................................................... 23 2.2.3 Interface with UniPROBE ............................................................................... 26 2.2.4 User interface for analysis of DNA shape profile of one TF dataset ............... 28 2.2.5 User interface for comparison of DNA shape profiles of two TF datasets ...... 28 2.3 Biological applications .................................................................................................. 30 2.3.1 DNA shape preferences of human bHLH TFs ................................................. 30 x 2.3.2 DNA shape preferences of Hox TFs in mouse ................................................ 38 2.4 Conclusions ................................................................................................................... 41 Chapter 3 DNA shape readout in sequence specific and non-specific DNA binding proteins .............................................................................................................................. 43 3.1 DNA shape readout in nucleosomes ............................................................................. 43 3.1.1 Introduction ...................................................................................................... 43 3.1.2 Methods and materials ..................................................................................... 44 3.1.3 Results .............................................................................................................. 46 3.1.4 Conclusions and discussions ............................................................................ 50 3.2 DNA shape readout in Drosophila Hox TFs ................................................................. 51 3.2.1 Introduction ...................................................................................................... 51 3.2.2 Methods and materials ..................................................................................... 55 3.2.3 Results .............................................................................................................. 58 3.2.4 Discussion ........................................................................................................ 72 Chapter 4 Transcription factor family-specific DNA shape readout revealed by quantitative specificity models ......................................................................................... 79 4.1 Introduction ................................................................................................................... 79 4.2 Methods and materials ................................................................................................... 82 4.2.1 Re-sequencing DNA libraries generated from previous HT-SELEX experiments ............................................................................................................... 82 4.2.2 Deriving relative TF-binding affinity for DNA M-words from HT-SELEX reads .......................................................................................................................... 82 4.2.3 PCA and regression analysis ............................................................................ 83 4.2.4 Quality control of the data ............................................................................... 84 4.2.5 Generating DNA shape logos .......................................................................... 86 4.3 Results ........................................................................................................................... 86 4.3.1 PCA analysis reveals TF-family specific DNA binding specificities and heterogeneities within TF families ............................................................................ 86 xi 4.3.2 DNA shape features improve modeling of DNA binding specificities across different TF families ................................................................................................. 88 4.3.3 Analysis reveals for different TF families importance of DNA shape features in flanking regions .................................................................................................... 90 4.3.4 Feature selection provides insights into TF-DNA readout mechanisms ......... 92 4.4 Discussion ................................................................................................................... 104 Chapter 5 Concluding remarks ....................................................................................... 109 Bibliography ................................................................................................................... 113 xiii LIST OF FIGURES Figure 1.1 Schematic representation of nucleotide base pairs (bps) and the DNA double helix. 2 Figure 1.2 Schematic representation of the DNA structural parameters. 4 Figure 2.1 TFBSshape database flowchart. 23 Figure 2.2 Example TFBSshape analysis of DNA shape preferences for an Hnf4a TF dataset from UniPROBE. 26 Figure 2.3 Example TFBSshape comparison of DNA shape preferences of two TF datasets from UniPROBE for the homologous TFs Max from mouse and Cbf1 from yeast. 29 Figure 2.4 DNA shape preferences of human bHLH TFs. 32 Figure 2.5 DNA shape analysis of human bHLH TFBSs. 33 Figure 2.6 DNA shape analysis of human bHLH TFBSs. 34 Figure 2.7 DNA shape analysis of human bHLH TFBSs. 35 Figure 2.8 DNA shape analysis of human bHLH TFBSs. 36 Figure 2.9 DNA shape features distinguish TFBSs of anterior and posterior Hox proteins in mouse. 40 Figure 2.10 DNA shape analysis of mouse Hox TFBSs. 41 Figure 3.1 Agreement between DNA minor groove width predicted by DNAshape and hydroxyl (OH) cleavage intensity predicted by ORChID2. 47 Figure 3.2 Variation in minor groove width predicted by DNAshape (blue) and OH-cleavage intensity predicted by ORChID2 (green) on average. 48 Figure 3.3 Variation in Roll (blue), HelT (green) and ProT (red) on average in nucleosome sequences. 50 Figure 3.4 Anterior and Posterior Hox Proteins Select for Sequences with Distinct Minor Groove Shapes. 53 Figure 3.5 Amino acid sequencs of Scr variants. Numbering is relative to the first residue in the homeodomain. 59 Figure 3.6 Loss of MG Width Preferences in the Absence of MG-Recognizing Residues. 61 xiv Figure 3.7 Amino acid sequences (from the Exd interaction motif, YPWM, through the N-terminal arm of the homeodomain) of Antp variants. 62 Figure 3.8 Shape Readout Properties of Antp Variants with Scr-Specific Residues. 63 Figure 3.9 Comparison of Model Evaluaction Based on Support Vector Regression versus Multiple Linear Regression. 64 Figure 3.10 DNA Shape Features Improve Quantitative Predictions of DNA Binding Specificities of Exd-Hox Heterodimers. 65 Figure 3.11 DNA Shape Features Improve Quantitative Predictions of DNA Binding Specificities of Exd-Hox Heterodimers and Hox Monomers. 66 Figure 3.12 Models that Deconvolve DNA Sequence and Shape Further Demonstrate the Additional Information Contained by Shape-Based Models. 69 Figure 3.13 Models that Deconvolve DNA Sequence and Shape. 70 Figure 4.1 PCA analysis reveals different DNA binding specificities between TF families. 87 Figure 4.2 Performance comparisons between models. 89 Figure 4.3 Comparable performance of 1mer+2mer+3mer and 1mer+shape models for the gcPBM data. 91 Figure 4.4 Equivalent performance of 1mer+2mer+3mer and 3mer models for the HT-SELEX data. 92 Figure 4.5 Schematic representation of the feature selection process. 94 Figure 4.6 The importance of DNA shape features as a function of nucleotide positions revealed by feature selection with machine learning. 95 Figure 4.7 Positional DNA shape importance revealed by feature selection for TF families. 102 Figure 4.8 Structure view and DNA sequence and shape logos for the homeodomain TFs PITX2/PITX3 and GBX1. 104 Figure 5.1 Structure-based illustration of the complexity of in vivo TF-DNA binding specificity. 110 xv LIST OF TABLES Table 2.1 DNA shape analysis of human bHLH TFBSs. 37 1 Chapter 1 Introduction 1.1 The DNA double helix Deoxyribonucleic acid (DNA) is a polymer molecule built from nucleotide building blocks. The nucleotides that constitute DNA are each composed of a phosphate group, a five-carbon sugar, and a nitrogenous base. The base in general has four types, adenine (A), cytosine (C), guanine (G) and thymine (T). A and G are purines, whereas C and T are pyrimidines. In DNA molecules, an adenine can form two hydrogen bonds with a thymine and vice versa, and a cytosine can form three hydrogen bonds with a guanine and vice versa (Figure 1.1A). These are called AT/TA and CG/GC base pairs (bps). The consecutive stacking of bps in a rotating way forms a helical structure that resembles a right-handed twisted ladder, in which phosphodiester bonds between the 3’ and 5’ hydroxyl groups from adjacent nucleotides form the two rails and the bps form the steps (Figure 1.1B) (Watson and Crick 1953). The structure is thus called the DNA double helix. In this configuration, the imaginary central axis of the helix goes approximately through the center of each bp. DNA in this conformation is called the B-form DNA, or B- DNA, which is the most common functioning form in living cells (Drew, Wing et al. 1981). In this thesis, DNA refers to B-DNA unless otherwise specified. In the DNA double helix, hydrogen bonds within the AT/TA and CG/GC bps hold the two complementary DNA strands (the two rails) together, providing the structural basis for storing information (Figure 1.1). As a result, most organisms use DNA as the genetic material that carries the “instructions” for the development and functioning of the organisms. The representation of DNA is commonly simplified as a sequence of letters A, C, G and T, which is referred to as the DNA sequence. 2 Figure 1.1 Schematic representation of nucleotide base pairs (bps) and the DNA double helix. (A) Annotation of functional groups on the major and minor groove edges of different bps. Two hydrogen bonds (yellow dashed lines) form in a AT/TA bp and three hydrogen bonds (yellow dashed lines) in a GC/CG bp. Nonpolar hydrogens are not explicitly shown. (B) Stacking of the bps gives rise to the DNA double helix. Despite being an informative description of DNA, the primary DNA sequence that describes the order of the constituent nucleotides alone does not provide a complete picture of the DNA molecule. Both the global and local structural characteristics of DNA have been found to play important roles in the regulation of gene expression, although through different mechanisms. The global structure of the DNA refers to the three- dimensional (3D) organization of the genome in the cellular nucleus, which not only enables packing of the long DNA polymer molecule inside the nucleus, but also establishes the infrastructure that supports all the molecular machineries to “read” from the instruction molecule, i.e., the DNA (Rao, Huntley et al. 2014, Tjong, Li et al. 2016). DNA local structure, on the other hand, refers to the nuances in the geometry properties of the DNA bps and backbones and their dynamics (Rohs, West et al. 2009a). Due to the planar structure of DNA bases, they can be schematically represented with a thin brick- 3 like shape. A bp can also be further abstracted as a single “brick”. Based on this representation, two types of parameters are used to characterize the local conformation of DNA bps, i.e. intra-bp parameters and inter-bp parameters (Dickerson 1989). The intra- bp parameters describe the relative translational and rotational positioning of the two bases within the same bp. And the inter-bp parameters describe the relative translational and rotational positioning of two adjacent bps. The translation and rotation with respect to different directions and axes give rise to a variety of intra- and inter-bp parameters (Figure 1.2) (Dickerson 1989). Besides these structural parameters, the geometry constraints required for the formation of hydrogen bonds between bases within each bp lead to the asymmetry of the two grooves formed by the two phosphate-sugar backbones in the DNA double helix. As a result, the wider groove is called the major groove, and the narrower groove is called the minor groove (Figure 1.1B). Configurations of the intra- and inter-bp structural parameters result in variation in the width and depth of the major and minor grooves as well as local curvature and kinks in the DNA, which have been shown to play important roles in protein-DNA recognition (Rohs, Jin et al. 2010). DNA local structure, being the result of the interactions between nearby bases and bps, depends on the local DNA sequence context. On the other hand, although being stabilized by the bp stacking interactions and hydrogen bonding that “zips” the two strands, the DNA double helix is not a rigid object. Its structural flexibility and the resulting dynamics add another layer of complexity to the analysis of DNA structure (Olson, Gorin et al. 1998, Fujii, Kono et al. 2007, Rohs, West et al. 2009a). 4 Figure 1.2 Schematic representation of the DNA structural parameters that describe the translational and rotational relationships within each base pair (bp) and between adjacent bps. Figure adapted from (Ho and Carter 2011) 1.2 Decoding the human genome Since the release of the DNA sequence of the complete human genome by the international Human Genome Project (Lander, Linton et al. 2001), global efforts have continuously been spent on decoding the 23 chromosomes whose DNA is consisted of approximately 3,000,000,000 bps in total (Consortium, Birney et al. 2007, Consortium 2012). On one hand, our knowledge of the human genome, and the genome of other species in general, on molecular level has greatly expanded. We now know that there are ~20,000 protein coding genes (Consortium 2012). We have established databases that compiled the DNA sequences of different genes and the amino acid sequences of different proteins (Pruitt, Tatusova et al. 2012, UniProt 2015). Within genes, we have identified intron and exon regions as well as protein isoforms due to different combinations of exons being spliced together during transcription. These discoveries have been relatively straightforward provided with the rapid advancements in sequencing technologies enabling massive sequencing of DNA, RNA and proteins. On the other hand, the protein coding genes make up only ~3% of the human genome (Consortium 2012). Understanding the functions of the rest of the genome has been one of the biggest challenges faced by the science community. It has now been established that these non- 5 coding regions contain the information required for the regulation of gene expression (Consortium 2012). As an important transcription-regulation mechanism, a group of proteins called transcription factors (TFs) bind to specific elements embedded in the non- coding DNA and as a result promote or repress the transcription level of the targeted genes (Consortium 2012, Pennacchio, Bickmore et al. 2013). Regions containing such TF binding sites (TFBSs) have regulatory effects on gene transcription upon TF binding and are called enhancers (Pennacchio, Bickmore et al. 2013). Enhancers do not necessarily lie in proximity, in terms of the primary DNA sequence, to the targeted gene promoters, which are regions that RNA polymerases bind and initiate the transcription. Instead, they can be millions of bps away from the promoter and carry out regulatory functions by being near the promoter through space due to the 3D organization of the genome (Pennacchio, Bickmore et al. 2013). This greatly increases the complexity of gene regulation in the human genome. Understanding the general principles of how TFs select their DNA target sites, or TF-DNA binding specificity and establishing quantitative and predictive models of TF-DNA binding specificity are key steps towards the understanding of transcription regulation. Therefore, the main focus of this thesis is the analysis, specifically quantitative modeling, of TF-DNA binding specificity. Non-specific histone-DNA binding involved in nucleosome formation is also discussed. 1.3 Principles of TF-DNA binding specificity Currently, there is no general set of rules that can precisely predict the DNA binding specificity for any given TF. And it is not likely that such a simple “code” exists (Slattery, Zhou et al. 2014). However, technologies such as X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy have helped solve thousands of all- 6 atom structures of protein-DNA complex. Analyses of these structures have revealed some general principles of TF-DNA binding specificity. Specifically, two major binding modes have been identified as the main mechanisms for TFs to achieve specific DNA binding, i.e., base readout and shape readout (Rohs, Jin et al. 2010), which is also known as direct readout and indirect readout (Otwinowski, Schevitz et al. 1988). Base readout refers to the recognition of specific DNA bases by the amino acid side chains in a TF (Seeman, Rosenberg et al. 1976). In the major groove, each of the AT, TA, CG and GC bps has a unique pattern of functional groups on its edge, i.e. the arrangement of the hydrogen acceptors and donors and the hydrophobic methyl groups (Figure 1.1A). TFs, whose amino acid side chains can form direct hydrogen bonds, water-mediated hydrogen bonds and hydrophobic contacts with the functional groups in bps, require specific steric configuration of the functional groups in order for an ideal interface, resulting in DNA binding specificity (Seeman, Rosenberg et al. 1976). This mechanism involves the recognition of specific bases through physical TF-DNA interactions and is thus called the base readout (Rohs, Jin et al. 2010). As an example, the arginine residue can recognize a base G by forming bidentate hydrogen bonds with it (Seeman, Rosenberg et al. 1976). In contrast, DNA shape readout refers to the recognition of specific structural features in the DNA by TFs (Rohs, Jin et al. 2010). Depending on the sequence context, the local DNA structure may deviate from an ideal B-DNA, e.g., DNA bending, which in some cases can facilitate the complementarity between interfaces of the TF and the DNA when one docks to the other (Rohs, West et al. 2009a). In other cases, a TF may deform the DNA in order for a more ideal interface and thus the sequence-dependent flexibility or deformability of the DNA have effects on the eventual binding affinity (Olson, Gorin et 7 al. 1998). Another example of shape readout is the recognition of narrow DNA minor groove by positively charged amino acid side chains. It has been shown that A-tracts tend to have a narrower DNA minor groove than other sequences (Hizver, Rozenberg et al. 2001). And because the phosphate group in the DNA carries a negative charge, narrowing of the minor groove would result in enhanced negative electrostatic potential in the minor groove so that presence of a positively charged residue such as an arginine in this minor groove would strengthen the protein-DNA interaction (Rohs, West et al. 2009b). All these mechanisms in which the TF-DNA binding specificity is achieved through the indirect recognition of DNA structural features are collectively called DNA shape readout (Rohs, Jin et al. 2010). 1.4 Experimental methods to study TF-DNA binding specificity As the general principles of TF-DNA binding specificity are not sufficient for accurately predicting the DNA binding specificity of TFs, it is necessary to interrogate these problems case by case using experimental methods. The fields of structural biology and genomics have been tackling the problem of TF-DNA binding specificity from different methodological perspectives. In particular, structural biology has been approaching this problem by solving atomic structures of individual TF-DNA complexes. Whereas, the genomics field takes advantage of the rapid development in high- throughput DNA sequencing technologies and has developed assays that can systematically measure a TF’s binding preference to tens of thousands of different DNA sequences. 8 1.4.1 Approaches of structural biology Mechanistic understanding of molecular interactions has been driven mainly by the field of structural biology. Researchers in this field solve structures of molecules and molecule complexes in order to understand their functions. X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy are the two most commonly used methods for determining the atomic structures of molecules (Rhodes 2010, Shi 2014). X-ray crystallography X-ray crystallography is a technique that takes advantage of the X-ray diffraction properties of crystals to infer the periodic 3D arrangement of atoms in the crystal. The procedures in this technique include growing crystals of the target molecule or molecule complex, e.g., the TF-DNA complex of interest, collecting X-ray diffraction data of the crystal, inferring electron density map from the diffraction data, and fitting atoms into the density map to generate molecular models. A crystal is the result of arranging atoms in a periodic way that is highly ordered and structured. Therefore, an imaginary box, or unit cell, can be constructed such that it contains all the necessary atoms and their spatial arrangement in the cell, which, when repeated, forms the 3D array of unit cells that eventually give rise to the crystal. When an X-ray beam is cast upon a crystal, it diffracts the X-ray according to the Bragg’s law. As a result, the X-ray beams diffracted along different directions have different intensities and phases. The intensities are readily detectable through measuring the strength of the diffracted X-ray beams. However, obtaining the phases, which is called the phase problem, is not as trivial and various techniques, such as isomorphous replacement and molecular replacement, have been developed to tackle this problem. Details of these 9 methods are skipped here. Because the intensity and phase of each diffracted X-ray beam can be expressed as the result of a Fourier’s transformation of the electron density function that describes the distribution of electrons within the unit cell and the Fourier’s transformation is reversible, the electron density function can be reconstructed by doing a reverse Fourier’s transformation using the collected diffraction intensities and phases. Once we have the electron density, atoms of the molecules can be fitted into the density map and so structure models are generated. Since growing crystals of proteins often requires non-physiological conditions, the molecules in a crystal may be deformed from their functional forms in cells. Other factors such as crystal packing can also contribute to deformation of the molecules, which result in deformed molecular models. Nevertheless, X-ray crystallography has been so far the most successful method for solving atomic structures of bio-molecules. NMR spectroscopy NMR spectroscopy is another technique that has been used to solve the atomic structure of molecules. The procedures in this technique include preparing a solution sample of the target molecule, placing the sample in a strong magnetic field to collect nuclei resonance data, i.e., NMR spectrum, deriving distance restraints between nuclei from the spectrum, and generating structure models that satisfy the distance restraints. Atom nuclei that have more protons than neutrons, such as the hydrogen 1 H, have a quantum property named spin. The spins in a solution sample placed under an external magnetic field can be manipulated by electromagnetic radiation. As a result, the solution absorbs energy and then relaxes, generating electromagnetic radiation of particular radio frequencies, which are determined by the “host” nuclei, the chemical environment of the 10 spins, and the coupling between spins. Two types of spectroscopy experiments are combined to derive distance constraints between atoms in the molecule, i.e., correlation spectroscopy (COSY) and nuclear overhauser effect spectroscopy (NOESY). COSY reveals through-bond coupling, also called J-coupling, between neighbor nuclei that are connected by chemical bonds, whereas NOESY reveals through-space coupling between nuclei that are near each other in space, but not necessarily through chemical bonds. Stereochemically and energetically feasible molecular models are then generated to comply with the distance constraints derived from COSY and NOESY experiments. 1.4.2 Approaches of genomics The advantage of X-ray crystallography and NMR spectroscopy is that the solved atomic structure provides an extremely detailed picture of the TF-DNA. However, both techniques are low-throughput in that they can only study the binding mechanisms of one single TF-DNA complex at a time even with such a complex and laborious, and sometimes highly expensive experiment. Also, we may not always need the knowledge of all the atomic details in each TF-DNA complex for pragmatic purposes. The overall binding preference of a TF with respect to a large pool of variant DNA sequences in many cases are more interesting and useful. In this regard, assays have been developed to systematically test the DNA binding specificity of any given TF. The most commonly used high-throughput in vitro methods are the Universal Protein Binding Microarray (PBM) (Berger, Philippakis et al. 2006) and the Systematic Evolution of Ligands by Exponential Enrichment combined with massively parallel sequencing (SELEX-seq) (Slattery, Riley et al. 2011), which is also called high-throughput SELEX (HT-SELEX) (Zhao, Granas et al. 2009, Jolma, Kivioja et al. 2010). 11 Universal PBM In a universal PBM experiment, the TF of interest is cloned to contain an epitope tag, which serves as a “bait” for its later detection. The purified TF is then applied to a DNA microarray in which each spot contains a unique DNA sequence called the probe. As the TF binds freely to the DNA probes with association and dissociation rates determined by its binding affinity to the probes, its enrichment at different spots is determined by the underling DNA preferences. Next, fluorophore-conjugated antibody that is specific to the bait epitope tag is used to “prey” on the TF. Scanning for fluorescence of the prey antibody then generates the fluorescence intensity signal for different probes, which reflects the DNA binding specificity of the TF. The probes in a universal PBM are designed and generated from a de Bruijn sequence that cover all possible k-mers (DNA sequence of length k). In practice, k is often chosen to be 9 or 10. This design enables the coverage of all possible k-mers with a compact set of probes, e.g., all possible 10-mers can be covered by approximately 44,000 probes of 35 bps in length. SELEX-seq/HT-SELEX A SELEX-seq/HT-SELEX experiment consists of several rounds of DNA selection by the TF followed by polymerase chain reaction (PCR) amplification and DNA sequencing. Initially, a pool of DNA oligomers, each consisting of a random variable region flanked by constant primers and adapters required for PCR and Illumina sequencing, is prepared. The purified TF is then allowed to bind to the DNA. Next, the bound DNA sequences are isolated and amplified with PCR. A sample of these TF-selected DNA sequences gets sequenced, whereas the rest goes to the next round of TF selection and PCR amplification. This process is repeated for several rounds with the sequences selected from the previous round serve as the starting pool for the next round. The composition of 12 the TF-selected DNA sequences from each round can be recovered from the sequencing data. Since the evolution of the sequence composition in a SELEX-seq/HT-SELEX experiment is governed by the TF’s preference to the DNA, the relative binding affinity of the TF to tens of thousands of DNA oligomers can be simultaneously calculated based on the changes in the enrichment of these selected sequences across different rounds with a few mathematical assumptions (Levine and Nilsen-Hamilton 2007). 1.5 From experimental data to predictive models Despite the discrepancy in what each experimental method has to offer in the study of TF-DNA binding specificity, one of the shared goals is to build predictive models that are useful in applying our knowledge of TF-DNA binding to the understanding of biological processes. Again, in the pursuit of such models, the fields of structural biology and genomics have taken separate paths. Specifically, through molecular modeling, structural biology has developed the molecular mechanics (MM) and the hybrid quantum mechanics/molecular mechanics (QM/MM) approaches to simulate the behavior of molecules (Warshel and Levitt 1976). Whereas, the genomics field has relied more on probabilistic and statistical as well as machine learning approaches in the modeling of TF-DNA binding (Stormo and Zhao 2010). 1.5.1 Approaches of structural biology The MM approach, as revealed by its name, models molecular systems with classical mechanics. In the MM model, atoms are treated as point charges with an associated mass. Chemical bonds between atoms are simplified as spring-like interactions. A set of predefined parameters is used to describe how atoms interact with each other and these parameters are collectively called the force field. A force field 13 includes parameters that describe atom charges, the equilibrium bond length and torsion angle of each type of chemical bonds, the equilibrium angle between adjacent bonds, and parameters in calculating the electrostatic and van der Waals forces. The overall potential energy of a molecular system can be broken down into separate terms: 𝐸 = 𝐸 $%&' +𝐸 )&*+, +𝐸 '-. ,'/)+ +𝐸 ,+,01/%21)1-0 +𝐸 3)& ',/ 4))+2 , where 𝐸 $%&' represents the potential resulted from deviation of bond lengths from equilibrium lengths , 𝐸 )&*+, represents the potential resulted from deviation of bond angles from equilibrium angles, 𝐸 '-. ,'/)+ represents the potential resulted from dihedral torsions, 𝐸 ,+,01/%21)1-0 represents the potential resulted from electrostatic interactions generally computed based on Coulomb’s law, and 𝐸 3)& ',/ 4))+2 represents the potential resulted from van der Waals interactions generally computed using the Lennard-Jones potential. Within this framework, energy minimization techniques can be used to search for molecular conformations that have a low energy, as functional forms of molecules in biological systems tend to be in low energy states. One popular method is the Monte Carlo (MC) simulation of molecular systems (Rohs, Sklenar et al. 2005). In a MC simulation, random moves of atoms are performed which results in a new conformation. The potential energy of this new conformation is then calculated and based on how it compares to the energy of the previous conformation, the random changes made to the conformation are either accepted or aborted. This procedure is repeated over and over again until conformations with satisfactorily low energy have been found or the maximum number of allowed iterations has been reached. Alternatively, the dynamics of the molecular system can be simulated based on the Newton’s law of motion, F=ma. This 14 method is called the molecular dynamics (MD) simulation (Cheatham and Kollman 2000). In this case, forces acting on each atom are calculated based on the force field and the current conformation of the molecule system. The atoms are then allowed to move to a new coordinate, which is predicted based on the Newton’s law for a short time period, e.g., 1picosecond (ps). Repeating this procedure generates a trajectory of the molecule system in space and time, which reveals the dynamics of the system. While MM is purely based on classical mechanics, QM/MM, on the other hand, combines quantum mechanics and classical mechanics and treats part of the molecule system with quantum mechanics, and thus takes advantage of the accuracy of quantum treatment and efficiency in classical treatment. QM/MM can be used to simulate chemical reactions. 1.5.2 Approaches of genomics Just like X-ray crystallography and NMR spectroscopy are low-throughput techniques that can only study the structure of one TF-DNA complex at a time, the molecular simulation methods mentioned above are computationally too expensive for high-throughput analyses. For example, a 100 ns MD trajectory of a molecule system is consisted of 100,000 steps of 1 ps update of the 3D coordinates of all the atoms, the number of which generally ranges from tens of thousands to millions. As a result, an MD simulation typically takes from hours to days of computation time depending on the size of the molecular system and the computation power provided by the hardware. Moreover, force fields are prone to artifacts that will result in biologically irrelevant simulation trajectories. 15 With the rich TF-DNA binding data generated by high-throughput assays such as the universal PBM and SELEX-seq/HT-SELEX, an alternative approach is to use compact mathematical models to represent TF-DNA binding specificity (Stormo and Zhao 2010). These parameterized models can be tuned to fit the experimental data as a form of knowledge extraction. The result model thus summarizes the information in the data and provides a compact description of the TF-DNA binding specificity. The benefits of such models are twofold: 1) as we are essentially extracting knowledge from large- scale data, we can learn the mechanisms underlying the TF-DNA binding specificity by inspecting the tuned models; 2) the models have predictive power and can be deployed to make prediction for new DNA sequences that did not appear in the original measurement. Development of such models requires the knowledge from multiple disciplines including biophysics, biochemistry, statistics and computer sciences. With this approach, models are designed based on mathematical abstraction and assumptions of the TF-DNA interaction process. These parameterized mathematical models are then tuned to fit the experimental data with the aid of computer programming. In the field of genomics, the most widely used model of TF-DNA binding specificity is the position weight matrix (PWM) model and its variants (Stormo 2013). In this model, a parameter matrix of 4-by-n is used to represent the DNA binding preference of a TF, where n is chosen according to the width of the TFBS, i.e., the stretch of DNA bps occupied when bound by the TF. The rows of a PWM correspond to the four different types of nucleotides A, C, G and T. And the columns correspond to nucleotide positions within the TFBS. Each element in the matrix is a parameter that describes the preference for the corresponding nucleotide at the corresponding position. For any given 16 DNA sequence of length n, summation of the corresponding elements in the PWM generates a quantitative prediction of the binding preference by the TF. The prevalent usage of PWM models has demonstrated its power in capturing TF-DNA binding specificity. Another factor that has contributed to PWM’s popularity is that a PWM can be easily plotted as a sequence logo which provides an intuitive visual representation of the DNA binding specificity of a TF (Schneider and Stephens 1990). The major limitation of PWM models is that it makes the assumption that bps within the TFBS interact with the TF independently which in reality is not always true (Man and Stormo 2001, Bulyk, Johnson et al. 2002). For example, two adjacent bps may interact with the TF in a synergetic way that the presence of particular di-nucleotides can facilitate the local amino acid-nucleotide interactions, significantly lowering the overall energy of the TF-DNA complex. To overcome this limitation, PWM models have been extended to include parameters that account for nucleotide interdependency, e.g., parameterizing di- nucleotide and tri-nucleotide preferences, and so on (Zhao, Ruan et al. 2012). An alternative approach is to integrate knowledge from the structural biology field and describe the DNA no only as a simple sequence of A, C, G and T, but also use structural features of the double helix, or DNA shape (Zhou, Yang et al. 2013). This more enriched description of the DNA is intermediate between the over-simplification of the DNA using only a string of letters A, C, G and T, and the atomic models which are too complex. As a result, simple sequence-based models, such as the PWM models, of TF-DNA binding specificity can be augmented by introducing parameters that account for DNA shape without losing its compactness. Moreover, mechanistic insights into how the TF achieves its DNA binding specificity can be derived from such shape-augmented models. 17 1.6 Overview of this thesis This thesis presents a database for DNA shape features of TFBSs. These features are shown to improve performance of machine learning models of TF-DNA binding specificity. Analyses of DNA shape features in nucleosome sequences and Drosophila Hox-TFBSs reveal the importance of DNA shape readout in both sequence non-specific and specific DNA binding proteins. Modeling of the Hox-DNA binding specificity combined with feature selection techniques are proved to be able to reveal mechanisms in protein-DNA recognition. Finally, this method is generalized to study DNA shape readout at bp-resolution for a variety of TF families. 19 Chapter 2 TFBSshape: a motif database for DNA shape features of transcription factor binding sites Reproduced from Lin Yang, Tianyin Zhou, Iris Dror, Anthony Mathelier, Wyeth W. Wasserman, Raluca Gordân, and Remo Rohs: TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 42, D148-155 (2014) 2.1 Introduction The DNA binding specificities of TFs can be described as consensus sequences or position frequency matrices (PFMs) representing the probability of occurrence of each nucleotide at each position of a DNA binding site. These probability matrices are usually transformed into position weight matrices (PWMs) (Stormo 2000, Stormo 2013) and can be visualized as motif logos (Schneider and Stephens 1990). PWMs traditionally assume independence between individual nucleotide positions within the binding site. Recent approaches have expanded the basic concept of PWMs by adding dinucleotide parameters, based on observations that individual nucleotide positions within a motif are not independent from each other (Sharon, Lubliner et al. 2008, Zhao, Ruan et al. 2012, Weirauch, Cote et al. 2013). Interdependencies between nucleotide positions within a motif give rise to the three-dimensional structure of DNA and, thus, the shape of TFBSs. The important role of DNA shape as a determinant of protein-DNA binding specificity has been previously discussed (Rohs, West et al. 2009b, Rohs, Jin et al. 2010, Parker and Tullius 2011, Ostuni and Natoli 2013), and we have demonstrated mechanisms of DNA shape readout for numerous TFs (Joshi, Passner et al. 2007, Kitayner, Rozenberg et al. 2010, Slattery, Riley et al. 2011, Chen, Bates et al. 2012, Eldar, Rozenberg et al. 2013, Gordan, Shen et al. 2013, Dror, Zhou et al. 2014) and other DNA binding proteins 20 (Chang, Xu et al. 2013, Hancock, Ghane et al. 2013, Lazarovici, Zhou et al. 2013). The DNA structure implicitly contains all interdependencies between nucleotide positions of a TFBS and does not require explicit knowledge of individual interdependencies. Although DNA shape is a function of sequence, the sequence-structure relationship is highly complex and degenerate. DNA shape can explain why sequences that flank TFBSs contribute to binding specificity (Gordan, Shen et al. 2013). Spacers between binding sites of different DNA binding proteins (Kim, Brostromer et al. 2013) or half sites of multimeric TFs can play a similarly important role (Chen, Zhang et al. 2013, Watson, Kuchenbecker et al. 2013). Structural data have only been obtained for a small number of relatively short DNA sequences that have been studied experimentally or by computationally expensive molecular simulations. This limited availability has been a major bottleneck for using DNA shape information in genome analysis. To overcome this limitation, we recently developed a fast and efficient method for the high-throughput prediction of DNA shape, and validated the method with massive experimental and computational data (Zhou, Yang et al. 2013). Using this approach, in the present study, we described the development of our TFBSshape database, which provides DNA structural features for nucleotide sequences preferred by different TFs. We analyzed 739 datasets derived from open-access motif databases that describe the DNA biding specificities of TFs from 23 different species. We used the sequence information provided by JASPAR (Mathelier, Zhao et al. 2014) and UniPROBE (Robasky and Bulyk 2011) to calculate DNA shape features of TFBSs. These features include minor groove width (MGW), Roll, propeller twist (ProT) and helix twist 21 (HelT). Our TFBSshape database qualitatively illustrates the TFBS shape profiles in heat maps for TF core binding sites. Flanking sequences are included whenever such information is available. Download options provide quantitative data for further analysis. TFBSshape includes a tool to compare, both qualitatively and quantitatively, any two selected TFBS shape profiles. A user can also upload a sequence dataset and compare its DNA shape features with any chosen TFBS shape profile in the database. We applied the TFBSshape approach to different biological applications. We analyzed the differential DNA shape preferences of the human bHLH TFs Mad2 (‘Mad’), Max and c-Myc (‘Myc’) using genome-context protein binding microarray (gcPBM) data (Mordelet, Horton et al. 2013). To demonstrate the added value of describing TFBSs using structural features, we used L2-regularized multiple linear regression (MLR) to predict the DNA binding specificities of these bHLH factors based on nucleotide sequence alone compared to a model that combines DNA sequence and shape. We showed that shape-augmented MLR models improved the accuracy in DNA binding specificity predictions by >20%. We also described the DNA shape preferences of Hox proteins in mouse using DNA binding sequences derived from universal protein binding microarray (PBM) experiments (Berger, Badis et al. 2008). The results of this analysis showed that distinct DNA shape features of TFBSs for anterior versus posterior Hox TFs, as previously reported for Drosophila (Slattery, Riley et al. 2011), can be observed across species. 22 2.2 Database 2.2.1 Database architecture and methodology TFBSshape derives TFBS sequence information from the motif databases JASPAR (Mathelier, Zhao et al. 2014) and UniPROBE (Robasky and Bulyk 2011) and generates DNA shape data for TFBSs based on the high-throughput prediction of DNA structural features, including the parameters MGW, Roll, ProT and HelT (Zhou, Yang et al. 2013). The approach uses a sliding pentamer window and query tables of structural features derived from all-atom Monte Carlo simulations for all 512 unique pentamers. We previously validated this method using massive experimental data from X-ray crystallography, NMR spectroscopy and hydroxyl radical cleavage experiments, as well as statistical analysis and cross-validation (Zhou, Yang et al. 2013). The backend of TFBSshape consists of a MySQL database, PHP scripts hosted on an Apache server, and other scripts invoked by the PHP scripts to perform TFBS assembly and DNA shape prediction upon user request (Figure 2.1). The frontend of TFBSshape includes HTML web pages with components of CSS and JavaScripts that provide a user-friendly interface for retrieving data from the database. In addition, it provides an interface for comparison of two TFBS shape profiles from the database and an interface for generating DNA shape data for user-uploaded TFBS sequences, which can also be compared to a TFBS shape profile in the database. Among the TFBS data derived from JASPAR and UniPROBE, 371 TFs are from JASPAR (Portales-Casamar, Thongjuea et al. 2010), including 149 TFs from the latest JASPAR 2014 release (Mathelier, Zhao et al. 2014) and 368 TFs are from UniPROBE (Robasky and Bulyk 2011). TFs in JASPAR or UniPROBE without TFBS sequence information are not 23 included in TFBSshape. Due to their different storage formats, sequence data from JASPAR and UniPROBE need to undergo different pre-processing steps prior to TFBS assembly and DNA shape prediction. Figure 2.1 TFBSshape database flowchart. Input data are nucleotide sequences derived from the motif databases JASPAR and UniPROBE, which are stored and managed using a MySQL database (yellow). Following this pre- processing, TFBSs are assembled, DNA shape features are predicted “on the fly”, and an Apache server provides the user interface (blue). Figure adapted from (Yang, Zhou et al. 2014). 2.2.2 Interface with JASPAR JASPAR curates TFBS sequencing data derived from the literature in its sub- database JASPAR CORE (Portales-Casamar, Thongjuea et al. 2010). For TFs with available motif information, the TFBS sequences are provided in FASTA format, with the core binding site highlighted in upper-case letters and the flanking sequences in lower-case letters. Using this sequence data, TFBSshape derives DNA shape features for all nucleotide positions within the core binding site. This prediction is possible because TFBS sequences from JASPAR always contain the core binding sites, whereas their flanking sequences can be missing. Due to the methodology used to predict DNA shape features at the center of a sliding pentamer (Zhou, Yang et al. 2013), 2-bp flanks are needed to calculate the structural features for the entire core binding motif. 24 Because the flanking sequence information can be missing, TFBSshape first calculates the portion of TFBS sequences that contain ≥ 2-bp flanks on each side of the core binding site. If this portion is > 50% of all sequences available for a TF, sequences that do not contain 2-bp flanks are removed, and the TFBSs are assembled in the form of nnNNN…NNNnn using the remaining sequences. Here, the stretch of NNN…NNN represents the core binding site, and nn represents the 2-bp flanks required for predicting DNA shape features for the two NN positions at either end of the core binding site. If the portion of TFBSs with flanking sequence information is ≤ 50%, no sequences are removed, and TFBSs are assembled in the form of NNN…NNN. In this case, DNA structural features cannot be predicted for the first and last two NN positions of the core binding site. In either case, DNA shape features are predicted for the core binding sites and visualized as heat maps. A white space at the left and right margins of the heat maps indicates that structural features for the end positions of the core binding site are not available due to missing flanks. This situation is more common for TFBS data from the previous JASPAR release (Portales-Casamar, Thongjuea et al. 2010). It does not happen for data from the JASPAR 2014 update (Mathelier, Zhao et al. 2014). For each structural feature, a heat map for individual sequences and an average heat map are provided. Each row in the heat map for individual sequences represents the structural feature values of a corresponding sequence (Figure 2.2A). If the number of rows is ≤ 3000, these rows are clustered using the agglomerative hierarchical clustering algorithm, with distances defined as the Euclidean distance (ED) between the values of the individual rows. The average heat map provides average structural features at each nucleotide position of the 25 TFBS (Figure 2.2B). Although these heat maps can be seen as a qualitative analysis, links are provided for the user to download the actual DNA shape data for further quantitative analysis. The PWM generated based on the analyzed set of sequences is visualized in the TFBSshape database as a motif logo (Figure 2.2C). In this format, the numbering of the nucleotide positions corresponds to the numbering used in the structural feature heat maps. 26 Figure 2.2 Example TFBSshape analysis of DNA shape preferences for an Hnf4a TF dataset from UniPROBE. (A) Heat map showing predicted MGW profiles for individual sequences, clustered based on Euclidean distances of MGW profiles, and (B) average heat map for all sequences. The color code for both heat maps uses red for narrow MGW, blue for wide MGW, and white for intermediate values. (C) PWM calculated using all analyzed TFBS sequences, aligned with DNA shape heat maps as nucleotide sequence reference. Figure adapted from (Yang, Zhou et al. 2014). 2.2.3 Interface with UniPROBE UniPROBE hosts TF binding data generated from universal PBM experiments (Robasky and Bulyk 2011). In these experiments, each array is designed to contain 27 probes that cover all possible 10-mer variants (Berger, Philippakis et al. 2006). A TF of interest can access each probe to initiate a binding event, and the binding intensities for different probes are compared based on their fluorescence signal intensities. These data are used to derive PFMs that represent the in vitro DNA binding specificities of the TF. TFBSshape retrieves the probe set and PFM from UniPROBE and uses them as input data of the Find Individual Motif Occurrences (FIMO) algorithm (Grant, Bailey et al. 2011) from the MEME Suite of motif-based sequence analysis tools (Bailey, Boden et al. 2009). FIMO searches in the input sequences for occurrences of a motif specified by a PFM. Each motif occurrence found by FIMO is associated with a P-value, indicating the statistical significance level of the detected occurrence. TFBSshape determines motif occurrences with P ≤ 10 -3 as core binding sites. This significance level is an empirical threshold based on the fact that PFMs generated from such TFBSs are highly consistent with the original PFMs provided by UniPROBE (Robasky and Bulyk 2011). The assumption that TFBSs are enriched among proves with higher PBM signal intensities is generally true for TFBSs found in the above manner. A barcode visualizes the enrichment of the TFBSs among the ranked probes, with vertical bars representing probes with PBM signal intensities in descending order from left to right, and a white bar indicating no occurrence of a TFBS, a yellow bar indicating one TFBS and a brown bar indicating multiple TFBSs. TFBS-containing probes are subjected to the same TFBS assembly and other procedures as described for JASPAR. TFBSs derived from UniPROBE data usually have ≥ 2-bp flanks because most TFBSs are not located at the end of the probe. The TFBS sequence and shape data can be downloaded for further quantitative analysis. 28 2.2.4 User interface for analysis of DNA shape profile of one TF dataset TFBSshape provides tab pages that dynamically display specific content and form an interface for retrieving data. The ‘Selection’ tab enables the user to specify the search criteria for either JASPAR or UniPROBE data or to upload custom-aligned sequences. After the user initiates the selection, the ‘Refine’ tab displays a table listing all of the TFs that satisfy the search criteria or shows a form for submitting sequences. The user can select a TF from the list or upload custom DNA sequences. The ‘Results’ tab displays a table containing information on the analyzed dataset, with a download link for the DNA sequence and shape data. The ‘Results’ tab also displays structural feature heat maps for individual sequences (Figure 2.2A), average heat maps for each shape parameter (Figure 2.2B), and the motif logo representing the PWM calculated using TFBS sequence information (Figure 2.2C). The TFBSshape interface dynamically updates the content under the tab pages, allowing the user to maintain a temporary customization throughout an analysis session. 2.2.5 User interface for comparison of DNA shape profiles of two TF datasets TFBSshape provides an interface for comparing two TFBS shape profiles from the database, or for comparing an uploaded TFBS dataset with a user-chosen reference TF dataset from the database. Under the ‘Selection’ tab, the user can initiate the selection to compare two TFs or to upload custom-aligned sequences. The ‘Refine’ tab then displays a form that guides the user through the selection of the two desired TF datasets, derived from either JASPAR or UniPROBE, or a form that enables the user to upload the sequences and select the desired reference TF dataset from the database. In this form, the user needs to specify the alignment of the two TF motifs by setting the reference 29 positions for the compared datasets or the offset in nucleotide positions for the uploaded sequences. After this step, the user will find the comparison of DNA shape features under the ‘Results’ tab. The user can return to the ‘Refine’ tab to adjust the alignment. As an example for this functionality of TFBSshape we compared UniPROBE datasets for the bHLH TFs Max and Cbf1, which are from mouse and yeast, respectively. The TFBSs were aligned based on the PWMs for both TFs (Figure 2.3A). Using the requested sequence alignment, TFBSshape visualized quantitative comparisons of average heat maps for the DNA shape features MGW, Roll, ProT and HelT, and provided Pearson’s correlation coefficients (PCC) and EDs as quantitative measures for the comparison (Figure 2.3B). Figure 2.3 Example TFBSshape comparison of DNA shape preferences of two TF datasets from UniPROBE for the homologous TFs Max from mouse and Cbf1 from yeast. (A) PWMs calculated for Max and Cbf1 using all analyzed TFBS sequences, with nucleotide positions numbered according to the user-determined alignment. (B) Using the chosen alignment, average heat maps for the four DNA shape features MGW, Roll, ProT and HelT are shown for Max 30 (TF1) and Cbf1 (TF2). These shape profiles were quantitatively compared using PCC and ED. Figure adapted from (Yang, Zhou et al. 2014). 2.3 Biological applications 2.3.1 DNA shape preferences of human bHLH TFs Paralogous TFs often bind to TFBSs with very similar and often identical core binding motifs, although they bind to different target sites in the genome to execute their specific in vivo functions. We previously showed that the yeast bHLH factors Cbf1 and Tye7 select distinct DNA shape features that contribute to their DNA binding specificities to genomic target sites, despite their strong preference for the CACGTG E- box as a shared core binding motif (Gordan, Shen et al. 2013). TFBSshape can be used to compare structural features of TFBSs in analyzing DNA binding specificities among TFs of the same family. Therefore, in this report, we extended our study to human bHLH factors. We analyzed the TFBSs of Mad, Max and Myc derived from gcPBM experiments (Mordelet, Horton et al. 2013). Heat maps and box plots for MGW (Figure 2.4A, Figure 2.5A-C), Roll, ProT and HelT (Figure 2.6, Figure 2.7, Figure 2.8) clearly indicated the unique structural features of the E-box. In addition, a Kolmogorov–Smirnov (K-S) significance test revealed that Mad and Max exhibit much more similar DNA shape preferences compared to the more distinct DNA binding specificity of Myc (Figure 2.4B, Figure 2.5, Figure 2.6, Figure 2.7, Figure 2.8). Whereas the E-box as a core binding motif is shared between all three TFs, differential DNA binding specificities can be detected through motif-based analysis of DNA shape preferences. These differences can be due to variations in the flanking sequences, as shown for Cbf1 and Tye7, or nucleotide variants within the E-box (Table 2.1). To confirm the significance of the detected TFBS shape differences, we analyzed a 31 replicate experiment using Myc. The results indicated that the DNA shape features selected by the same TF in two independent gcPBM experiments were not distinct, according to K-S P-values (Figure 2.4B, Figure 2.5D, Table 2.1). 32 Figure 2.4 DNA shape preferences of human bHLH TFs. Heat maps illustrate MGW selections of (A) the Mad2-Max heterodimer (‘Mad’), the Max homodimer (‘Max’) and the c-Myc-Max heterodimer (‘Myc’). Sequence data were derived from gcPBM experiments (27) using 25% of the probes with highest signal intensities after removing probes with multiple TFBSs. (B) MGW preferences of the three TFs were compared, and nucleotide positions with significant MGW differences based on a K-S test were indicated for comparisons of Mad versus Max, Mad versus Myc and Max versus Myc (positions with different MGW distributions are shown in orange for P < 0.001 and yellow for P < 0.05; positions without significant differences are shown as green background). A replicate experiment for Myc verified that the gcPBM experiment and shape analysis did not detect any significant differences for Myc1 versus Myc2. The DNA shape features were symmetrized based on the palindromic E-box, which is located at the central positions –3 to +3 33 (frame). (C) L2-regularized MLR and 10-fold cross-validation were used to test the accuracy of binding specificity predictions, showing that shape-augmented models (purple) outperformed specificity models using nucleotide sequence alone (brown) for all three human bHLH TFs. Adding randomly shuffled DNA shape features did not lead to the observed improvement (magenta). Figure adapted from (Yang, Zhou et al. 2014). Figure 2.5 DNA shape analysis of human bHLH TFBSs. (A-C) MGW preferences of the three TFs were compared using box plots. Boxes represent the median (line inside the box), 1 st and 3 rd quartiles (edges of the box), and the whiskers define the furthest data points with 1.5 X inter-quartile range from the edges of the box. Asterisks indicate nucleotide positions where differences in MGW distributions selected by (A) Mad vs. Max, (B) Mad vs. Myc, and (C) Max vs. Myc were significant based on a K-S test (black asterisk, P < 0.05; red asterisk, P < 0.001). (D) As a negative control, two independent experiments for Myc (‘Myc1’ and ‘Myc2’) were also compared and DNA shape selections were found to be essentially identical. MGW features were symmetrized based on the palindromic E-box. Figure adapted from (Yang, Zhou et al. 2014). 34 Figure 2.6 DNA shape analysis of human bHLH TFBSs. Heat maps illustrate Roll selections of (A) Mad, (B) Max, and (C) Myc. (D-F) Roll preferences of the three TFs were compared using box plots. Boxes represent the median (line inside the box), 1st and 3rd quartiles (edges of the box), and the whiskers define the furthest data points within 1.5 × inter-quartile range from the edges of the box. Asterisks indicate nucleotide positions where differences in Roll distributions selected by (D) Mad vs. Max, (E) Mad vs. Myc, and (F) Max vs. Myc were significant based on a K-S test (black asterisk, P < 0.05; red asterisk, P < 0.001). Roll features were symmetrized based on the palindromic E-box. Figure adapted from (Yang, Zhou et al. 2014). 35 Figure 2.7 DNA shape analysis of human bHLH TFBSs. Heat maps illustrate ProT selections of (A) Mad, (B) Max, and (C) Myc. (D-F) ProT preferences of the three TFs were compared using box plots. Boxes represent the median (line inside the box), 1st and 3rd quartiles (edges of the box), and the whiskers define the furthest data points within 1.5 × inter-quartile range from the edges of the box. Asterisks indicate nucleotide positions where differences in ProT distributions selected by (D) Mad vs. Max, (E) Mad vs. Myc, and (F) Max vs. Myc were significant based on a K-S test (black asterisk, P < 0.05; red asterisk, P < 0.001). ProT features were symmetrized based on the palindromic E-box. Figure adapted from (Yang, Zhou et al. 2014). 36 Figure 2.8 DNA shape analysis of human bHLH TFBSs. Heat maps illustrate HelT selections of (A) Mad, (B) Max, and (C) Myc. (D-F) HelT preferences of the three TFs were compared using box plots. Boxes represent the median (line inside the box), 1st and 3rd quartiles (edges of the box), and the whiskers define the furthest data points within 1.5 × inter-quartile range from the edges of the box. Asterisks indicate nucleotide positions where differences in HelT distributions selected by (D) Mad vs. Max, (E) Mad vs. Myc, and (F) Max vs. Myc were significant based on a K-S test (black asterisk, P < 0.05; red asterisk, P < 0.001). HelT features were symmetrized based on the palindromic E-box. Figure adapted from (Yang, Zhou et al. 2014). 37 MGW(Figure 2.5) N. P. Mad vs. Max Mad vs. Myc Max vs. Myc Myc1 vs. Myc2 Roll (Figure 2.6) N. P. Mad vs. Max Mad vs. Myc Max vs. Myc Myc1 vs. Myc2 ProT (Figure 2.7) N. P. Mad vs. Max Mad vs. Myc Max vs. Myc Myc1 vs. Myc2 HelT (Figure 2.8) N. P. Mad vs. Max Mad vs. Myc Max vs. Myc Myc1 vs. Myc2 -16 8.73E -01 5.55E -01 6.50 E-01 1.00E +00 -17|-16 8.93E -01 4.24E -01 9.72E -01 9.84E -01 -16 2.67E -02 1.31E -02 9.48E -01 9.96E -01 -17|-16 4.17E -01 1.90E -01 9.99E -01 1.00E +00 -15 9.11E -01 4.65E -01 4.56 E-01 1.00E +00 -16|-15 4.86E -01 1.48E -01 9.56E -01 1.00E +00 -15 6.15E -03 9.38E -03 9.99E -01 1.00E +00 -16|-15 5.57E -01 7.57E -02 8.13E -01 1.00E +00 -14 8.02E -01 8.40E -01 9.80 E-01 1.00E +00 -15|-14 4.73E -01 2.09E -01 5.77E -01 1.00E +00 -14 1.62E -02 1.50E -03 9.14E -01 9.81E -01 -15|-14 1.78E -01 5.63E -01 8.77E -01 1.00E +00 -13 6.20E -01 8.47E -01 1.00E +00 1.00E +00 -14|-13 6.65E -01 2.58E -01 9.82E -01 1.00E +00 -13 1.70E -02 1.39E -02 8.77E -01 9.86E -01 -14|-13 7.98E -01 1.67E -01 5.90E -01 1.00E +00 -12 9.08E -01 4.19E -01 9.98E -01 1.00E +00 -13|-12 9.21E -01 4.53E -01 9.80E -01 9.99E -01 -12 6.49E -03 1.28E -03 1.00E +00 8.99E -01 -13|-12 1.65E -01 4.27E -01 9.73E -01 9.70E -01 -11 8.64E -01 2.08E -01 7.14E -01 1.00E +00 -12|-11 3.52E -01 6.56E -02 9.76E -01 1.00E +00 -11 8.50E -04 7.38E -05 9.85E -01 1.00E +00 -12|-11 6.79E -01 2.19E -01 9.94E -01 1.00E +00 -10 6.28E -01 4.74E -02 6.40E -01 1.00E +00 -11|-10 2.62E -01 4.11E -02 4.56E -01 1.00E +00 -10 4.50E -05 4.25E -05 9.75E -01 9.30E -01 -11|-10 4.92 E-02 1.20E -01 9.74E -01 9.89E -01 -9 8.75E -01 2.24E -01 8.93E -01 1.00E +00 -10|-9 2.39E -01 2.17E -02 8.32E -01 9.92E -01 -9 2.78E -03 1.69E -03 1.00E +00 1.00E +00 -10|-9 2.45E -01 8.25E -02 9.06E -01 1.00E +00 -8 3.07E -01 1.06E -01 9.11E -01 1.00E +00 -9|-8 1.35E -01 1.25E -02 8.62E -01 1.00E +00 -8 9.68E -05 8.65E -05 9.58E -01 1.00E +00 -9|-8 1.81E -01 1.79 E-02 9.63E -01 1.00E +00 -7 1.75E -01 8.43E -02 8.08E -01 1.00E +00 -8|-7 1.32E -01 6.14E -02 1.00E +00 1.00E +00 -7 5.66E -05 4.06E -05 9.63E -01 9.78E -01 -8|-7 3.89 E-02 1.49E -01 9.25E -01 9.65E -01 -6 7.12E -01 1.20E -04 1.13E -04 9.98E -01 -7|-6 5.04E -02 2.82E -05 1.31E -01 1.00E +00 -6 1.70E -04 1.08E -06 4.93E -01 9.84E -01 -7|-6 1.32E -01 1.76E -01 1.00E +00 1.00E +00 -5 9.62E -02 2.52E -03 3.19E -03 1.00E +00 -6|-5 1.14E -01 3.99E -05 1.05E -01 1.00E +00 -5 2.22E -02 2.04E -02 5.58E -01 1.00E +00 -6|-5 7.02E -01 1.95E -01 4.04E -01 1.00E +00 -4 3.26E -02 7.34E -06 1.26E -01 1.00E +00 -5|-4 6.24E -03 3.60E -02 1.74E -05 9.55E -01 -4 4.72E -02 0.00E +00 3.53E -14 9.98E -01 -5|-4 1.25 E-02 1.27E -01 6.73E -02 1.00E +00 -3 4.27E -01 1.38E -10 9.37E -11 1.00E +00 -4|-3 1.59E -02 6.28E -02 8.56E -02 1.00E +00 -3 1.39E -02 0.00E +00 0.00E +00 1.00E +00 -4|-3 1.60E -01 3.24 E-05 9.13 E-09 1.00E +00 -2 2.32E -02 8.46E -06 6.24E -08 9.99E -01 -3|-2 6.62E -03 0.00E +00 4.44E -16 9.15E -01 -2 8.79E -03 8.70E -11 8.28E -06 9.50E -01 -3|-2 3.79 E-03 9.82 E-12 1.26 E-06 7.63E -01 -1 5.98E -01 0.00E +00 0.00E +00 2.37E -01 -2|-1 7.69E -01 0.00E +00 0.00E +00 6.89E -01 -1 6.26E -01 0.00E +00 5.41E -14 2.59E -01 -2|-1 3.46E -01 2.28 E-08 3.53 E-10 8.72E -01 +1 5.98E -01 0.00E +00 0.00E +00 2.37E -01 -1|+1 5.57E -01 0.00E +00 9.55E -15 2.30E -01 +1 6.26E -01 0.00E +00 5.41E -14 2.59E -01 -1|+1 5.82E -01 0.00 E+00 2.91 E-13 2.91E -01 +2 2.32E -02 8.46E -06 6.24E -08 9.99E -01 +1|+2 7.69E -01 0.00E +00 0.00E +00 6.89E -01 +2 8.79E -03 8.70E -11 8.28E -06 9.50E -01 +1|+2 3.46E -01 2.28 E-08 3.53 E-10 8.72E -01 +3 4.27E -01 1.38E -10 9.37E -11 1.00E +00 +2|+3 6.62E -03 0.00E +00 4.44E -16 9.15E -01 +3 1.39E -02 0.00E +00 0.00E +00 1.00E +00 +2|+3 3.79 E-03 9.82 E-12 1.26 E-06 7.63E -01 +4 3.26E -02 7.34E -06 1.26E -01 1.00E +00 +3|+4 1.59E -02 6.28E -02 8.56E -02 1.00E +00 +4 4.72E -02 0.00E +00 3.53E -14 9.98E -01 +3|+4 1.60E -01 3.24 E-05 9.13 E-09 1.00E +00 +5 9.62E -02 2.52E -03 3.19E -03 1.00E +00 +4|+5 6.24E -03 3.60E -02 1.74E -05 9.55E -01 +5 2.22E -02 2.04E -02 5.58E -01 1.00E +00 +4|+5 1.25 E-02 1.27E -01 6.73E -02 1.00E +00 +6 7.12E -01 1.20E -04 1.13E -04 9.98E -01 +5|+6 1.14E -01 3.99E -05 1.05E -01 1.00E +00 +6 1.70E -04 1.08E -06 4.93E -01 9.84E -01 +5|+6 7.02E -01 1.95E -01 4.04E -01 1.00E +00 +7 1.75E -01 8.43E -02 8.08E -01 1.00E +00 +6|+7 5.04E -02 2.82E -05 1.31E -01 1.00E +00 +7 5.66E -05 4.06E -05 9.63E -01 9.78E -01 +6|+7 1.32E -01 1.76E -01 1.00E +00 1.00E +00 +8 3.07E -01 1.06E -01 9.11E -01 1.00E +00 +7|+8 1.32E -01 6.14E -02 1.00E +00 1.00E +00 +8 9.68E -05 8.65E -05 9.58E -01 1.00E +00 +7|+8 3.89 E-02 1.49E -01 9.25E -01 9.65E -01 +9 8.75E -01 2.24E -01 8.93E -01 1.00E +00 +8|+9 1.35E -01 1.25E -02 8.62E -01 1.00E +00 +9 2.78E -03 1.69E -03 1.00E +00 1.00E +00 +8|+9 1.81E -01 1.79 E-02 9.63E -01 1.00E +00 +10 6.28E -01 4.74E -02 6.40E -01 1.00E +00 +9|+10 2.39E -01 2.17E -02 8.32E -01 9.92E -01 +10 4.50E -05 4.25E -05 9.75E -01 9.30E -01 +9|+10 2.45E -01 8.25E -02 9.06E -01 1.00E +00 +11 8.64E -01 2.08E -01 7.14E -01 1.00E +00 +10|+11 2.62E -01 4.11E -02 4.56E -01 1.00E +00 +11 8.50E -04 7.38E -05 9.85E -01 1.00E +00 +10|+11 4.92 E-02 1.20E -01 9.74E -01 9.89E -01 +12 9.08E -01 4.19E -01 9.98E -01 1.00E +00 +11|+12 3.52E -01 6.56E -02 9.76E -01 1.00E +00 +12 6.49E -03 1.28E -03 1.00E +00 8.99E -01 +11|+12 6.79E -01 2.19E -01 9.94E -01 1.00E +00 +13 6.20E -01 8.47E -01 1.00E +00 1.00E +00 +12|+13 9.21E -01 4.53E -01 9.80E -01 9.99E -01 +13 1.70E -02 1.39E -02 8.77E -01 9.86E -01 +12|+13 1.65E -01 4.27E -01 9.73E -01 9.70E -01 +14 8.02E -01 8.40E -01 9.80E -01 1.00E +00 +13|+14 6.65E -01 2.58E -01 9.82E -01 1.00E +00 +14 1.62E -02 1.50E -03 9.14E -01 9.81E -01 +13|+14 7.98E -01 1.67E -01 5.90E -01 1.00E +00 +15 9.11E -01 4.65E -01 4.56E -01 1.00E +00 +14|+15 4.73E -01 2.09E -01 5.77E -01 1.00E +00 +15 6.15E -03 9.38E -03 9.99E -01 1.00E +00 +14|+15 1.78E -01 5.63E -01 8.77E -01 1.00E +00 +16 8.73E -01 5.55E -01 6.50E -01 1.00E +00 +15|+16 4.86E -01 1.48E -01 9.56E -01 1.00E +00 +16 2.67E -02 1.31E -02 9.48E -01 9.96E -01 +15|+16 5.57E -01 7.57E -02 8.13E -01 1.00E +00 +16|+17 8.93E -01 4.24E -01 9.72E -01 9.84E -01 +16|+17 4.17E -01 1.90E -01 9.99E -01 1.00E +00 Table 2.1 DNA shape analysis of human bHLH TFBSs. The table provides the Kolmogrorov-Smirnov (K-S) test p- values for the comparison of DNA shape selections of the Mad2-Max heterodimer (‘Mad’), Max homodimer (‘Max’), and c-Myc-Max heterodimer (‘Myc’). The central E-box location at nucleotide positions (N.P.) -3 to +3 is highlighted in gray. Statistically significant differences in the selection of DNA shape parameters MGW (Figure 2.5), Roll (Figure 2.6), ProT (Figure 2.7), and HelT (Figure 2.8) are highlighted in bold black for p < 0.05 (black asterisks in box plots) and in bold red for p < 0.001 (red asterisks in box plots). As a negative control, two independent experiments for Myc (‘Myc1’ and ‘Myc2’) were also compared and DNA shape selections were found to be essentially identical. 38 To further test whether the subtle differences in DNA shape features of the binding sites of the three paralogous bHLH TFs Mad, Max and Myc contribute to binding specificity beyond nucleotide sequence, we used L2-regularized MLR and 10-fold cross- validation to assess the prediction accuracy of models using sequence alone compared to a combination of sequence and shape parameters (MGW, Roll, ProT and HelT). We found that experimentally determined DNA binding specificities could be predicted with R 2 values between 0.65 and 0.70, whereas a shape-augmented model reached R 2 values between 0.80 and 0.88 (Figure 2.4C). Thus, by incorporating DNA shape features into binding specificity predictions, we achieved improvements of ∼ 26% for Mad, ∼ 26% for Max and ∼ 23% for Myc. These improvements indicate an important contribution of DNA shape features in protein–DNA recognition. In this model, DNA sequence and shape features were encoded using a strategy similar to our previous study of yeast bHLH TFs (Gordan, Shen et al. 2013) but here we also considered variations within the E-box core motif. Using randomly shuffled shape parameters did not lead to any improvement over the sequence-based model (Figure 2.4C). These results clearly demonstrate the added value of shape-augmented descriptions of TFBSs in the modelling of DNA binding specificities. 2.3.2 DNA shape preferences of Hox TFs in mouse We previously demonstrated that anterior and posterior Drosophila Hox proteins prefer distinct minor groove geometries (Joshi, Passner et al. 2007, Slattery, Riley et al. 2011), and recently analyzed DNA shape preferences of mouse homeodomain TFs (Dror, Zhou et al. 2014) derived from universal PBM experiments (Berger, Badis et al. 2008). Here, we show that the distinct DNA shape preferences of anterior and posterior Hox 39 proteins hold for mouse, based on comparisons of their MGW (Figure 2.9A), Roll, ProT and HelT profiles (Figure 2.10). Using EDs of MGW profiles, we generated a dendrogram revealing relationships between DNA binding specificities of mouse Hox proteins, and demonstrated clear distinctions in MGW preferences between anterior and posterior Hox TFs in mouse (Figure 2.9B), similar to the distinction previously observed for Drosophila (Slattery, Riley et al. 2011). Thus, the TFBSshape database can be utilized to study relationships in DNA binding specificities of closely related TFs within protein families. 40 Figure 2.9 DNA shape features distinguish TFBSs of anterior and posterior Hox proteins in mouse. (A) Heat map illustrating average MGW profiles of binding sites preferred by mouse Hox TFs determined by universal PBM (28). (B) A dendrogram based on EDs between average MGW profiles of preferred TFBSs demonstrates the different DNA shape preferences of anterior and posterior Hox TFs in mouse. Figure adapted from (Yang, Zhou et al. 2014). 41 Figure 2.10 DNA shape analysis of mouse Hox TFBSs. Heat maps illustrate (A) Roll, (B) ProT, and (C) HelT preferences of monomeric mouse Hox TFs determined by universal PBM experiments (26). Figure adapted from (Yang, Zhou et al. 2014). 2.4 Conclusions We have demonstrated that augmenting existing motif databases with DNA shape features provides new insights into the mechanisms used by TFs to achieve DNA binding specificity. Analyzing DNA shape preferences can help to differentiate between similar DNA binding specificities of paralogous TFs (Gordan, Shen et al. 2013). Such studies can be generalized to compare DNA binding specificities of homologous TFs from different species. Comparisons of structural features of TFBSs could potentially reveal evolutionary relationships between TFs based on the shape of their DNA binding sites 42 (Dror, Zhou et al. 2014). Integrating TFBSshape with the motif databases JASPAR (Mathelier, Zhao et al. 2014) and UniPROBE (Robasky and Bulyk 2011) makes DNA shape information readily available for known motifs. Whereas TFBSshape currently contains data for 23 species from the open-access motif databases JASPAR and UniPROBE, species-specific databases (Zhu, Christensen et al. 2011, de Boer and Hughes 2012) can easily be integrated to expand the repertoire of datasets for comparative analysis of TF binding specificities. The availability of DNA shape features for TFBSs suggests many further applications, such as shape-augmented genome annotations (Parker and Tullius 2011) and TFBS predictions using DNA structural features (Meysman, Thanh et al. 2011, Hooghe, Broos et al. 2012, Maienschein-Cline, Dinner et al. 2012). 43 Chapter 3 DNA shape readout in sequence specific and non- specific DNA binding proteins Sections 3.1.2 and 3.1.3 are reproduced from Tianyin Zhou, Lin Yang, Yan Lu, Iris Dror, Ana Carolina Dantas Machado, Tahereh Ghane, Rosa Di Felice, and Remo Rohs: DNAshape: a method for the high-throughput prediction of DNA structural features on a genome-wide scale. Nucleic Acids Res. 41, W56-62 (2013), and Tsu-Pei Chiu, Lin Yang, Tianyin Zhou, Bradley J. Main, Stephen C.J. Parker, Sergey V. Nuzhdin, Thomas D. Tullius, and Remo Rohs: GBshape: a genome browser database for DNA shape annotations. Nucleic Acids Res. 43, D103-109 (2015) Section 3.2 is reproduced from Namiko Abe, Iris Dror, Lin Yang, Matthew Slattery, Tianyin Zhou, Harmen J. Bussemaker, Remo Rohs, and Richard S. Mann: Deconvolving the recognition of DNA sequence from shape. Cell 161, 307-318 (2015) 3.1 DNA shape readout in nucleosomes 3.1.1 Introduction The nucleosome is a protein-DNA complex in which the DNA wraps around a histone core with approximately 147 bps of DNA and forms about 1.65 superhelical turns that are left-handed (Davey, Sargent et al. 2002). The histone core is consisted of eight histone proteins, i.e. two copies each of the histones H2A, H2B, H3, and H4. In eukaryotes, nucleosome formation packs the long DNA chain into chromatin, which can fit into the cellular nucleus. In addition to this architectural role, nucleosomes play fundamental roles in living cells by controlling the DNA accessibility. For example, it is proposed that cell lineages are determined by the repertoires of accessible regulatory regions that are reflected in the nucleosome landscape. This epigenomic landscape is further shaped by TF-DNA binding events which depend on the cell autonomous and environment-stimulated expression of TFs (Ostuni and Natoli 2013). Due to its functional roles in gene regulation, it is important to study how nucleosomes are distributed in the 44 genome, i.e., nucleosome positioning, and the mechanisms underlying nucleosome positioning preferences. It has been revealed that one determinant of nucleosome positioning is the intrinsic DNA preference, as in vitro construction of nucleosomes suggested that different DNA sequences have different energy barriers in forming nucleosomes (Lowary and Widom 1998). In vivo nucleosome sequences exhibit a 10-bp periodicity signal in the distribution of certain dinucleotides and A-tracts within nucleosomes (Segal, Fondufe- Mittendorf et al. 2006, Field, Kaplan et al. 2008, Rohs, West et al. 2009b). These sequence signals, however, lack a mechanistic explanation. As shown in the nucleosome structure, the DNA bends to wrap around the histone octamer, resulting in a 10-bp periodicity in the width of the DNA minor groove and the electrostatic potential in the groove (Davey, Sargent et al. 2002, Rohs, West et al. 2009b). Moreover, DNA A-tracts have been shown to favor a narrow DNA minor groove and a slight DNA bending. Taken together, this suggests a mechanism in which nucleosome formation is facilitated by the structural characteristics of the DNA sequence, or DNA shape readout. And signals such as the 10-bp periodicity in dinucleotide and A-tract occurrences are secondary effects of such structural preferences. This hypothesis is supported by analyses of DNA structural features in nucleosome sequences from different species. 3.1.2 Methods and materials Nucleosome sequences We obtained experimentally derived nucleosome sequence data for different species including Saccharomyces cerevisae, Caenorhabditis elegans, Drosophila melanogaster, and Homo sapien from published studies. In particular, we collected 23076 45 nucleosome sequences from Saccharomyces cerevisae (Field, Kaplan et al. 2008), 25654 nucleosome sequences from Drosophila melanogaster (Mavrich, Jiang et al. 2008). The modENCODE consortium generated extensive lists of nucleosome sequences for Caenorhabditis elegans, Drosophila melanogaster and human (Ho, Jung et al. 2014). From this source, we collected ~3.6 million Caenorhabditis elegans nucleosome sequences, ~3.8 million Drosophila melanogaster nucleosome sequences, and ~13.1 million human nucleosome sequences. Predict DNA structural features for nucleosome sequences DNA structural features of the nucleosome sequences can be predicted using two recently developed tools that are based on different methodologies, namely, the DNAshape (Zhou, Yang et al. 2013) approach and the ORChID2 (Bishop, Rohs et al. 2011) approach. DNAshape is based on all-atom Monte-Carlo simulations of short DNA fragments in the length of 12-27. These short DNA sequences cover all possible nucleotide pentamers and thus a structural profile that includes the minor groove width (MGW) and propeller twist (ProT) for the central base pair, as well as the roll (Roll) and helix twist (HelT) for the two central base-pair steps can be established for each pentamer and compiled into a library. In order to predict DNA structural features MGW, ProT, Roll, and HelT for a new sequence, a 5-bp wide sliding window is applied on the sequence and the structural features for the pentamer covered by the window can be looked up from the compiled library of pentamer structural profiles (Zhou, Yang et al. 2013). ORChID2 is based on experimental measures of the cleavage rate on the DNA backbones by the hydroxyl (OH) radical group. The OH-cleavage rate depends mostly on the solvent accessibility of the 5’, 5’’, and 4’ hydrogen atoms that lie on the outer edge of 46 the DNA minor groove. The solvent accessibility further depends on the width of the DNA minor groove. The wider the minor groove is, the more accessible the hydrogens are to the solvent, and thus the higher the OH-cleavage rate is. As a result, there is a direct correlation between the measured OH-cleavage rate on the DNA and the corresponding DNA minor groove width. One can perform hydroxyl radical cleavage experiments on different DNA sequences to infer the cleavage rate for all possible DNA pentamers. These metrics, which reflect the DNA minor groove width, can be similarly compiled into a query table and used to infer the minor groove width for new DNA sequences using the sliding window strategy aforementioned (Bishop, Rohs et al. 2011). 3.1.3 Results DNAshape and ORChID2 predictions of DNA minor groove width are highly correlated For the 23076 yeast and 25654 fly nucleosome sequences, we used DNAshape to predict the DNA minor groove width and compared these minor groove width profiles averaged on all nucleosome sequences to the published ORChID2 profiles of the same data (Figure 3.1). For the more extensive modENCODE data, we used DNAshape and ORChID2 to predict the DNA minor groove width and OH-cleavage intensity, respectively, and compared the two profiles averaged on all the nucleosome sequences (Figure 3.2). All these comparisons consistently revealed strong correlation between the DNAshape and ORChID2 predictions. This agreement between the two methods of completely different nature reinforces the reliability of the DNA minor groove width prediction. 47 Figure 3.1 Agreement between DNA minor groove width predicted by DNAshape and hydroxyl (OH) cleavage intensity predicted by ORChID2. The minor groove width is predicted for (A) 23076 yeast and (B) 25654 fly in vivo nucleosome sequences, and its average is compared with previously published OH-cleavage data derived from ORChID2 (Bishop, Rohs et al. 2011). The profiles derived from the two different methods are highly correlated, both revealing the 10-bp periodicity observed in dinucleotide (Field, Kaplan et al. 2008, Mavrich, Jiang et al. 2008). Numbering of the nucleotide position starts with 0 for the central base pair. Figure adapted from (Zhou, Yang et al. 2013). 48 Figure 3.2 Variation in minor groove width predicted by DNAshape (blue) and OH-cleavage intensity predicted by ORChID2 (green) on average derived from (A) Caenorhabditis elegan, (B) Drosophila melanogaster and (C) human nucleosome sequences. Numbering of the nucleotide position starts with -1 and 1 for the central two base pairs, respectively. Figure adapted from (Chiu, Yang et al. 2015). 10-bp periodicity of DNA structural features in nucleosome sequences Apart from the agreement between DNAshape and ORChID2 predictions, which in a sense serves as a cross-validation of the DNA minor groove width prediction, we observed clear 10-bp periodicity signals in the MGW property of nucleosome sequences from different organisms (Figure 3.1, Figure 3.2). The signals in the modENCODE data 49 are generally smoother and more apparent, possibility due to the much larger number of nucleosome sequences obtained from the high-throughput experiments (Ho, Jung et al. 2014). In addition to MGW, DNAshape is able to predict DNA structural parameters including Roll, HelT and ProT, which all displayed a 10-bp periodicity signal (Figure 3.3). Whereas the 10-bp periodicity was shared between human, fly and worm, details of the DNA shape profiles of nucleosome sequences varied across species due to the different nucleotide compositions of these genomes. Analysis of the other DNA shape features Roll, HelT and ProT further confirmed the shared 10-bp periodicity as well as distinctions in DNA shape between nucleosome sequences in these genomes (Figure 3.3). The maxima and minima of the MGW, Roll and ProT patterns overlapped, whereas the troughs in the HelT patterns matched the peaks in the other parameters, indicating a local helix unwinding at positions where a more positive Roll locally widens the minor groove. 50 Figure 3.3 Variation in Roll (blue), HelT (green) and ProT (red) on average in nucleosome sequences from (A) Caenorhabditis elegan, (B) Drosophila melanogaster and (C) human genomes. Numbering of the nucleotide position starts with -1 and 1 for the central two base pairs, respectively. Figure adapted from (Chiu, Yang et al. 2015). 3.1.4 Conclusions and discussions The periodicity observed in the structural properties of nucleosome sequences suggests a mechanism in which nucleosome formation has intrinsic DNA sequence preferences through DNA shape readout. The previously observed 10-bp periodicity in sequence features may be simply a secondary effect of the selected DNA structural 51 characteristics by nucleosomes. But the data do not rule out the possibility that the cause- and-effect relationship may be the other way around, as the DNA structure is sequence- dependent. In this alternative model, however, the mechanism for the selection of DNA sequence features is not obvious. It is also important to note that the 10-bp periodicity signals, although observable, are extremely weak, as indicated by the magnitude of variation in the structural profiles (Figure 3.1, Figure 3.2, Figure 3.3). This suggests that the intrinsic DNA sequence preference alone cannot completely explain in vivo nucleosomes positioning. Other mechanisms of in vivo nucleosome positioning, such as the “barrier” model, have been suggested. In the barrier model, nucleosomes form against a strongly positioned barrier nucleosome, e.g., the +1 nucleosome in promoters, resulting regularly positioned nucleosomes which decays when being further away from the barrier. The barrier nucleosomes are positioned by not only intrinsic sequence preferences, but also transcription related events such as transcription factor binding and RNA polymerase II occupancy (Mavrich, Ioshikhes et al. 2008, Zhang, Moqtaderi et al. 2009). 3.2 DNA shape readout in Drosophila Hox TFs 3.2.1 Introduction Precise control of gene expression relies on the ability of transcription factors to recognize specific DNA binding sites. Two distinct modes of protein-DNA recognition have been described: base readout, the formation of hydrogen bonds or hydrophobic contacts with functional groups of the DNA bases, primarily in the major groove (Seeman, Rosenberg et al. 1976), and shape readout, the recognition of the 3D structure of the DNA double helix (Rohs, West et al. 2009a). The importance of shape readout has 52 been inferred from crystal structures of protein-DNA complexes (Joshi, Passner et al. 2007, Meijsing, Pufall et al. 2009, Rohs, West et al. 2009b, Kitayner, Rozenberg et al. 2010) and from structural features of DNAs selected by DNA-binding proteins in high- throughput binding assays (Slattery, Riley et al. 2011, Gordan, Shen et al. 2013, Lazarovici, Zhou et al. 2013, Dror, Zhou et al. 2014, Yang, Zhou et al. 2014). However, as DNA shape is a function of the nucleotide sequence, it is difficult to tease apart whether a DNA binding protein favors a particular binding site because it recognizes its nucleotide sequence or, alternatively, structural features of the DNA molecule. Thus, whether DNA shape is a direct determinant of protein-DNA recognition remains an open question. In addition to being a potentially important mode of DNA recognition, if DNA binding proteins directly use shape readout then incorporating DNA structural information should significantly improve models for predicting DNA binding specificity, which remains challenging with existing methods (Weirauch, Cote et al. 2013, Slattery, Zhou et al. 2014). We previously described a role for DNA shape in the recognition of specific binding sites by the Hox family of transcription factors, which in vertebrates and Drosophila specify the unique characteristics of embryonic segments along the anterior- posterior axis (Joshi, Passner et al. 2007, Mann, Lelli et al. 2009, Slattery, Riley et al. 2011). Using in vitro selection combined with deep sequencing (SELEX-seq), which examines millions of sequences in an unbiased manner, we found that while Hox proteins bind highly similar sequences as monomers, heterodimerization with the cofactor Extradenticle (Exd) uncovers latent DNA binding specificities (Slattery, Riley et al. 2011). High-throughput DNA shape predictions (Zhou, Yang et al. 2013) for sequences 53 selected by each Exd-Hox complex (containing the motif NGAYNNAY) revealed that anterior and posterior Hox proteins prefer sequences with distinct minor groove (MG) topographies. Whereas all Exd-Hox complexes preferred sequences with a narrow MG near the AY of the Exd half-site (NGAY), only anterior Hox proteins (Lab, Pb, Dfd, and Scr) selected for sequences containing an additional minimum in MG width at the AY of the Hox half-site (NNAY) (Figure 3.4) (Slattery, Riley et al. 2011). However, this study, as well as analyses of other protein-DNA complexes (Gordan, Shen et al. 2013, Yang, Zhou et al. 2014), did not rule out the possibility that these shape preferences were merely a secondary consequence of base readout preferences. Figure 3.4 Anterior and Posterior Hox Proteins Select for Sequences with Distinct Minor Groove Shapes. Heat map of the average MGW at each position of 16-mers selected by each Exd-HoxWT heterodimer. Except for Pb and UbxIa*, the SELEX data from (Slattery, Riley et al. 2011) were re-analyzed using a common error cutoff of 20%. UbxIa* represents new SELEX data due to low counts in the previous dataset. In addition, because the previous dataset used a truncated form of Pb, we carried out a new SELEX experiment with full-length Pb. Dark green represents narrow minor grooves while white represents wider minor grooves. The numbers to the right of the heat map indicate the number of sequences analyzed for each complex. Black lines demarcate where Arg5 inserts into the minor groove (A 5 Y 6 ) and, for Scr, where Arg3 and His-12 insert into the minor groove (A 9 Y 10 ). Figure adapted from (Abe, Dror et al. 2015). 54 A key prediction of the shape-recognition model is that if the residues that recognize a distinct structural feature of the DNA, for example a local minimum in MG width, are mutated then the transcription factor should no longer prefer to bind DNA sequences containing that feature. Alternatively, if the structural feature were merely a byproduct of the DNA sequences selected by a base readout mechanism in the major groove, the binding sites preferred by the mutant factor would still contain that feature. Here, we tested this prediction using the anterior Hox protein Scr, which binds DNA with Exd to regulate Scr-specific target genes during Drosophila embryogenesis (Ryoo and Mann 1999). In a co-crystal structure of the Exd-Scr heterodimer bound to an Scr- specific target site, fkh250 (AGATTAAT), both shape readout and base readout mechanisms were evident (Joshi, Passner et al. 2007). In agreement with the SELEX-seq data (Figure 3.4), the fkh250 binding site contained two MG width minima, one recognized by Scr residues His-12 and Arg3, and the second recognized by Scr residue Arg5 (Joshi, Passner et al. 2007). As these residues did not form hydrogen bonds with bases, the implication is that they use shape readout, and not base readout, as their sole mode of DNA recognition. To test if Hox proteins directly use shape readout, we characterized the properties of mutant proteins that, based on the Exd-Scr co-crystal structures, are predicted to either lose or gain the ability to read specific MG topographies. When MG-inserting residues of Scr were mutated to alanines, thus impairing its ability to use shape readout, the mutant proteins no longer preferred sequences containing these MG width minima. Conversely, when MG recognizing residues from Scr were transferred to a Hox protein that normally does not select for this structural feature, the proteins selected binding sites with two MG 55 minima in vitro. Finally, we show that taking DNA shape features into consideration significantly improved the ability to predict Exd-Hox binding site specificities compared to models that only depend on DNA sequence. Together, these findings demonstrate that transcription factors directly use shape readout for protein-DNA recognition, and in silico prediction of DNA binding specificities will benefit by taking DNA structural features into consideration. 3.2.2 Methods and materials SELEX-Seq SELEX-Seq experiments were carried out for Exd-Hox mutant heterodimers defined in Figure 3.5 and Figure 3.7. Fifth order Markov models were constructed using Round 0 (R0) sequences to predict the number of 12-, 14-, and 16-mer sequences in each initial library as described (Slattery, Riley et al. 2011, Riley, Slattery et al. 2014). R3 data were used for all Hox variants in order to optimize counts and minimize sampling error. 12-, 14-, and 16-mer relative binding affinities were generated by taking the cubic root of the enrichment ratio (counts in R3 divided by expected counts as predicted using Markov model derived from R0 data). L2-Regularized Multiple Linear Regression Models for Predicting Binding Specificities Quantitatively To predict the relative binding affinity for each of the sequences bound by any of the Hox monomers, Exd-Hox heterodimers, and Exd-Hox mutant heterodimers, we trained L2-regularized multiple linear regression (MLR) models (Yang, Zhou et al. 2014). To measure the predictive power of the models, a 10-fold cross-validation was performed with an embedded 10-fold cross-validation on the training set to determine the optimal λ parameter. 56 We trained different categories of models that (i) encoded the nucleotide sequence of each of the bound sequences as binary features (sequence models), (ii) encoded different combinations of the DNA shape features MG width, ProT, Roll, and HelT (shape models), and (iii) combined nucleotide sequence and DNA shape features at the corresponding position (sequence+shape models). To measure the predictive power of each of the models, we calculated the coefficient of determination R 2 between the predicted and experimentally determined logarithm of relative binding affinities using 10- fold cross validation. We used all 14-mer sequences from R3 of the selection with a count of > 50, aligned based on the TGAYNNAY core motif for heterodimers, and the logarithm of the relative binding affinity as response variable. We used 14-mers in this analysis as a trade-off between sequence length and read coverage. Sequences with the core motif not located in the center resulted in missing flanks due to the alignment. We assigned features with a value of zero to these end positions. Sequences that did not contain the motif or contain more than one core motif were not included in the analysis. As a form of feature selection, we trained variants of these models where we added or removed features at specific positions and evaluated the performance of these models based on a ΔR 2 with respect to a reference model. Shape features at position i include MG width and ProT at position i, whereas the definition for Roll and HelT includes the base pair steps between nucleotides i-1 and i as well as i and i+1 (Zhou, Yang et al. 2013). For the analysis of monomer binding specificities, we aligned all single occurrences of TAAT motifs in 9-mer sequences. To evaluate the robustness of our results, we compared MLR-based models with models trained using support vector regression (ε-SVR) with a linear kernel (Gordan, 57 Shen et al. 2013, Zhou, Shen et al. 2015) by calculating the Pearson correlation between R 2 s derived from the two methods. We performed the ε-SVR analysis for sequence and sequence+shape models based on 10-fold cross validation for Exd-Hox WTs using 16- mer relative binding affinities as response variable and determined the hyper-parameters C and ε in a grid search using nested cross validation. Classification Models for Distinguishing ScrWT-like and AntpWT-like Binding Specificities To use sequence and shape features to classify Hox binding specificities, we aligned 14-mers selected by Exd-ScrWT or Exd-AntpWT according to the core motif TGAYNNAY. For the 14-mers with a single occurrence of the core motif, the top 50% with the highest relative affinities were selected from the datasets prepared as described above as the preferred binding sites for ScrWT, AntpWT, and the mutants. As a result, the ScrWT dataset comprised 7078 and the AntpWT dataset 4962 sequences. Among these sequences, we removed 2416 sequences that were shared between both datasets, resulting in 4662 ScrWT preferred sites (assigned the label +1) and 2546 AntpWT preferred sites (assigned the label –1). The models were evaluated based on this training data using L2-regularized MLR and 10-fold cross-validation, and area under the receiver- operating characteristic curve (AUC) was used as performance measure. The trained models were used to classify the top 50% aligned binding sites preferred by the mutants. The MLR prediction resulted in a continuous number, which was converted into a binary classification measure based on whether the response variable is >0 for ScrWT-like or <0 for AntpWT-like binding specificities. The Pearson correlation was calculated between the response variable (class labels) and the normalized MG width at each position to reveal which positions affect the classification 58 most (positions with strong correlation, either positive or negative). We normalized MG width by dividing the difference of MG width at position i and the MG width mean over all unique pentamers by the standard deviation in MG width over all 512 possible pentamers as derived from the DNAshape method (Zhou, Yang et al. 2013). All shape parameters were normalized by the same scheme for the aforementioned MLR analysis. 3.2.3 Results Mutants that Interfere with Scr’s Ability to Read MG Shape In an initial set of experiments to tease apart the contributions of shape readout from base readout, we mutated Scr residues His-12, Arg3, and Arg5, which, in a co- crystal structure, only use shape readout as their mode of recognition (Joshi, Passner et al. 2007). We generated a series of mutant proteins that change these residues to alanines and, consequently, impair Scr’s ability to recognize local MG topographies. We mutated either Arg3 alone (Scr Arg3A ), His-12 alone (Scr His-12A ), both His-12 and Arg3 (Scr His-12A, Arg3A ), or Arg5 alone (Scr Arg5A ) and tested the effect of these mutations in complex with Exd on Scr’s DNA binding site preferences using SELEX-seq (Figure 3.5). 59 Figure 3.5 Amino acid sequencs of Scr variants. Numbering is relative to the first residue in the homeodomain. Only sequences from the Exd-interaction motif YPWM through the homeodomain N-terminal arm are shown. The rest of the protein is wild-type in all cases. Red highlights mutated residues. Figure adapted from (Abe, Dror et al. 2015). To determine if His-12, Arg3, and Arg5 directly enable the selection of sequences with narrow MGs, we computed the average MG width profile for thousands of 16-mer sequences that were preferentially bound by each Scr variant in our SELEX-seq experiments. We employed DNAshape, a high-throughput method for the prediction of the structural features of DNA sequences based on the average conformations of pentamers derived from all-atom Monte Carlo simulations (Zhou, Yang et al. 2013). Sequences selected by Scr His-12A, Arg3A had an average MG width at A9 and Y10 that was significantly wider compared to those selected by ScrWT, without affecting the selection of the MG width minimum at A 5 Y 6 (Figure 3.6) (p < 2 × 10−16; Mann-Whitney U test). Sequences selected by the single mutant Scr Arg3A , but not Scr His-12A , had an intermediate width at A 9 Y 10 , suggesting that His-12 and Arg3 synergistically contribute to MG recognition at the Hox half-site (Figure 3.6). Conversely, compared to ScrWT, Scr Arg5A selected sequences with a wider MG specifically in the Exd half-site (A 5 Y 6 ), but these sequences retained the minimum at A 9 Y 10 (Figure 3.6). These results provide strong support for the idea that Arg5 directly selects sequences with a MG minimum at A 5 Y 6 , 60 while Arg3/His-12 directly select sequences with a MG width minimum at A 9 Y 10 . Selection of these MG width minima occurs independently, even though they are only separated by two base pairs. Despite its importance in selecting Scr-specific features of MG topography, Arg3 is present in many Hox homeodomains, including Antennapedia (Antp), that do not select a MG minimum at A 9 Y 10 (Figure 3.4) (Slattery, Riley et al. 2011). This observation prompted the question of why Arg3 in Antp and other posterior Hox proteins does not select for a narrow MG at this position. We speculated that the amino acids flanking Arg3 might play a role in binding site selection by correctly positioning this MG-inserting side chain. Indeed, although both Scr and Antp have Arg3 and Arg5, these residues are part of an N-terminal arm motif that differs between these two Hox proteins (R 3 Q 4 R 5 T 6 in Scr and R 3 G 4 R 5 Q 6 in Antp) (Figure 3.5). To test whether residues flanking these arginines play a role in Scr binding specificity we characterized an additional mutant, Scr His-12AGQ (Figure 3.5). In Scr His-12AGQ , His-12 is mutated to alanine and the fourth and sixth positions in the Scr homeodomain are changed to that of Antp (Gln 4 to Gly 4 and Thr 6 to Gln 6 ) to mimic Antp’s R 3 G 4 R 5 Q 6 motif. Strikingly, this mutant failed to select sequences with a minimum at the Hox half site (A 9 Y 10 ) (Figure 3.6). An additional mutant, ScrLinkGQ, that, in addition to having Antp’s R 3 G 4 R 5 Q 6 motif, has Antp’s linker (residues in between the YPWM Exd interaction motif and the homeodomain, Figure 3.5) in place of Scr’s linker, showed very similar behavior to Scr His-12AGQ (Figure 3.6). Together, these data suggest that additional residues within and adjacent to the N- terminal arm, which do not make direct contact with the DNA (minor or major groove), 61 play an important role in selecting Hox-specific MG topographies, likely by positioning the MG inserting side chains of Arg3, Arg5, and His-12. Figure 3.6 Loss of MG Width Preferences in the Absence of MG-Recognizing Residues. Heat map of the average MG width at each position of 16-mers selected by each Exd-Hox heterodimer. Dark green represents narrow MG regions whereas white represents wider MG regions. The number of sequences analyzed for each complex is shown on the right. Black lines demarcate where Arg5 inserts into the MG (A 5 Y 6 ) and, for ScrWT, where Arg3 and His-12 insert into the MG (A 9 Y 10 ). Figure adapted from (Abe, Dror et al. 2015). Mutants that Transfer Scr’s Ability to Read MG Shape to Antp The above experiments demonstrate that MG-inserting side chains in Scr are necessary for Scr’s ability to select sequences with local MG width minima. To test whether MG recognizing residues are sufficient to confer Scr’s binding preferences to a different Hox protein, we introduced these residues into Antp, which normally prefers sequences with wider MG regions at A 9 Y 10 (Figure 3.6). We created a series of Antp mutants that contained various combinations of Scr-specific amino acids in two regions, the linker and the N-terminal arm motif R 3 Q 4 R 5 T 6 (Figure 3.7). 62 Figure 3.7 Amino acid sequences (from the Exd interaction motif, YPWM, through the N-terminal arm of the homeodomain) of Antp variants. Green highlights residues specific to AntpWT, and red highlights residues specific to ScrWT. Non-highlighted residues are common between the two Hox proteins. Numbering is relative to the first residue of Scr’s homeodomain. The rest of the protein is wild-type in call cases. Figure adapted from (Abe, Dror et al. 2015). To determine if these Antp mutants share Scr’s MG shape preferences, we used DNAshape to predict the MG widths of 16-mers selected by these proteins. In general, the average MG width at A 9 Y 10 of the sequences selected by the Antp mutant series became narrower, toward that of Scr, upon the introduction of Scr-specific residues (Figure 3.8), where, with the exception of AntpHQ, each successive mutant selected sequences with a statistically significant narrowing of the average MG at these positions. These results suggest that Scr residues Gln4, Thr6, His-12, and linker all contribute to the recognition of DNA shape. Moreover, these residues are sufficient to confer the shape preferences of Scr when inserted into another Hox protein. 63 Figure 3.8 Shape Readout Properties of Antp Variants with Scr-Specific Residues. Heat map of the average MG width at each position of 16-mers selected by each Exd-Hox heterodimer. Dark green represents narrow MG regions whereas white represents wider MG regions. The number of sequences analyzed for each complex is shown on the right. Black lines demarcate where Arg5 inserts into the MG (A 5 Y 6 ) and, for ScrWT, where Arg3 and His-12 insert into the MG (A 9 Y 10 ). Figure adapted from (Abe, Dror et al. 2015). DNA Shape Features Improve Accuracy of Binding Specificity Predictions If shape readout is a direct and independent determinant of Hox-DNA binding specificity, we speculated that shape features of the target DNA could be used to improve quantitative predictions of relative binding affinities. To test this notion, we trained an L2-regularized multiple linear regression (MLR) model (Yang, Zhou et al. 2014) for each of the mutants and WT Hox proteins. We used 10-fold cross validation in order to train and determine the accuracy of a given model, quantified as the coefficient of determination R 2 . These MLR-derived R 2 s are robust as they are highly correlated with R 2 s derived using an alternative machine learning approach, support vector regression (ε- SVR) with a linear kernel (Figure 3.9; see Methods and materials) (Gordan, Shen et al. 2013, Zhou, Shen et al. 2015). 64 Figure 3.9 Comparison of Model Evaluaction Based on Support Vector Regression versus Multiple Linear Regression. The coefficients of determination R 2 derived from support vector regression (ε-SVR; y axis) and L2-regularized multiple linear regression (MLR; x axis) for sequence models (red) and sequence+shape models (blue) for 7 Exd-Hox WTs, together, result in a Pearson correlation close to 1. The heterodimer datasets for Lab and UbxIVa were too large for the ε-SVR analysis due to the required grid search to determine the C and ε hyper-parameters. The comparison of the two machine learning methods used 16-mer relative binding affinity data, as shown in Figure 3.6 and Figure 3.8. Figure adapted from (Abe, Dror et al. 2015). Using MLR, addition of MG width to a model based only on nucleotide sequence resulted in a modest improvement in R 2 of on average 12% (Figure 3.10A and Figure 3.11A). Like MG width, adding three other shape features one at a time, Roll, propeller twist (ProT), and helix twist (HelT), also led to a modest improvement in accuracy (Figure 3.11A). Inclusion of all four DNA shape features in combination further increased prediction accuracy (Figure 3.10A and Figure 3.11A). The improvement in binding affinity prediction accuracy, on average 26% when incorporating all four shape features, yielded the largest effect with high significance (p = 6 × 10 −5 ; Mann-Whitney U test). The addition of any combination of three shape features led to an intermediate increase in prediction accuracy, in some cases similar to that after addition of all four shape features (Figure 3.11A). These results suggest that all four features contribute to 65 Exd-Hox-DNA target selection in a non-additive manner, consistent with the interdependency of these features (Olson, Gorin et al. 1998). Thus, including DNA shape features in addition to MG width improves binding site predictions over models based only on nucleotide sequence. For comparison, we also assessed the benefit of adding shape features for the prediction of Hox monomer specificities. Interestingly, in this case the improvement in R 2 was, on average, only 6.4%, suggesting a larger role for DNA shape in conferring heterodimer specificity than monomer specificity (Figure 3.11B). Figure 3.10 DNA Shape Features Improve Quantitative Predictions of DNA Binding Specificities of Exd-Hox Heterodimers. (A) Scatter plot representing the coefficient of determination R 2 obtained using a sequence-only model (x axis) compared to a model using sequence and MG width (y axis). Each point represents a different Exd-Hox 66 heterodimer and is color-coded as indicated. (B) Scatter plot representing the coefficient of determination R 2 obtained using a sequence-only model (x axis) compared to a model using sequence and four DNA shape features (MG width, Roll, ProT and HelT) (y axis). (C) Box plots illustrating the contribution from DNA shape features to model accuracy when shape features were added to a sequence model at each position individually. The effect on the coefficient of determination ΔR 2 is shown for adding four shape features (MG width, Roll, ProT and HelT) position-by-position to the sequence model. The centerline of the box plots represents the median, the edge of the box the first and third quartile, and the whiskers indicate minimum/maximum values within 1.5 times the interquartile from the box. (D) Box plots illustrating the contribution from DNA shape features to model accuracy when sequence features were removed. The effect on the coefficient of determination ΔR 2 is shown for leaving out four shape features (MG width, Roll, ProT, and HelT) position-by-position from a shape-only model that does not contain any sequence information. The box plots are defined in (C). Figure adapted from (Abe, Dror et al. 2015). Figure 3.11 DNA Shape Features Improve Quantitative Predictions of DNA Binding Specificities of Exd-Hox Heterodimers and Hox Monomers. (A) Adding one of the four shape features (MGW, Roll, ProT and HelT) to a sequence model improves DNA binding specificity predictions of Exd-Hox heterodimers about equally well, measured based on the coefficient of determination R 2 , whereas addition of any three shape features simultaneously results in a compounded effect and addition of all four shape features at the same time results in the largest effect. P-values calculated based on a one-sided Mann-Whitney test for the sequence model versus shape-augmented models demonstrate the significance of the effect, with the sequence+shape model resulting in the most significant improvement. The centerline of the box plots represents the median, the edge of the box the 1st and 3rd quartile, and the whiskers indicate minimum/maximum values within 1.5 times the interquartile from the box. (B) Adding one of the four shape features (MGW, Roll, ProT and HelT) or all four shape features at once to a sequence model has only a modest effect on DNA binding specificity predictions of Hox monomers, measured based on the coefficient of 67 determination R 2 and one-sided Mann-Whitney test p-values. While this observation might in part be due to the stringent filtering of only TAAT motifs, it demonstrates the larger role of DNA shape on Exd-Hox heterodimer binding. The box plots are defined in (A). Figure adapted from (Abe, Dror et al. 2015). DNA Shape Contributes to Binding Specificities in a Position-Specific Manner Next, we hypothesized that if shape features contribute to an improvement in binding specificity prediction, then it might be possible to localize this effect within the binding site. We trained models using the sequence of the entire binding site augmented by all four shape features at individual positions one at a time, resulting in a set of models that tested the contribution of shape at each position of the binding site. We compared these models to a sequence-only model and calculated a ΔR 2 . This analysis highlighted the importance of DNA shape for predicting Exd-Hox binding specificities in the core, but not the flanks, of the binding site (Figure 3.10C). To analyze the role of DNA shape in a complementary manner, we trained shape- only models using the four shape features at all nucleotide positions, leaving out this information one position at a time, resulting in a set of models that assessed the relative importance of DNA shape at each position of the binding site. ΔR 2 s were calculated relative to a model that included the four shape features at all positions. In this analysis, prediction accuracy was expected to decrease most when shape features were removed from the model at positions that were important for shape readout. Interestingly, we detected the greatest effect at the A 9 position of the Hox half-site, followed by slightly weaker effects at the adjacent Y 10 position and the G 4 position of the Exd half-site (Figure 3.10D). Eliminating shape features from the remaining positions had a smaller impact on the ability to predict binding specificities. Within each SELEX-seq dataset, the sequences were most variable at the N 8 position, raising the possibility that the success of these models might be driven in large 68 part by this position. To test this idea and better assess the role of DNA shape throughout the binding site we trained additional models in which we removed sequence information at the N 8 position (“sequence-N 8 model”). Leaving out sequence information at the N 8 position did not significantly affect the accuracy of a sequence+shape model, suggesting that sequence information at N 8 is not essential for its performance (Figure 3.12A). When MG width information was added to the sequence-N 8 model, the ability to predict binding specificities was greatly enhanced compared to the same model without MG width information (Figure 3.12B and Figure 3.12C). These results argue that MG width information is more important than sequence at positions with a degenerate sequence signal, such as at N 8 , where direct readout is not playing a role. The removal of the confounding sequence information at this position uncovered MG width as an independent specificity determinant. When all four shape features were added to the sequence-N 8 model at single positions one at a time the contribution of DNA shape within the core motif was very apparent (Figure 3.13A), and significantly stronger than when the starting model included sequence information at the N 8 position (compare with Figure 3.11C). If instead of all four DNA shape parameters only MG width was added position by position to the sequence-N 8 model, the average improvement in R 2 , while smaller, was most apparent at or adjacent to Y 6 N 7 and A 9 Y 10 (Figure 3.13B). Thus, although DNA shape is generally important within the entire core of the binding site, the contribution of MG width is strongest at the two AY regions, precisely where local minima in MG width were observed in the Exd-Hox X-ray structures (Joshi, Passner et al. 2007) and SELEX-seq data (Figure 3.6 and Figure 3.8). 69 Taken together, quantitative predictions based on regression models indicated that shape features become important where sequence information is not well defined, more likely at positions that are not involved in base readout. In these cases, shape features contain more information than sequence alone, and removing the signal from sequence enables the quantitative modeling of the role of shape features on binding specificity. Figure 3.12 Models that Deconvolve DNA Sequence and Shape Further Demonstrate the Additional Information Contained by Shape-Based Models. (A) Scatter plot representing the coefficient of determination R 2 obtained using a sequence+shape model (x axis) compared to a sequence+shape model with sequence information removed at the N 8 position ([sequence–N 8 ]+shape model) (y axis). Removing sequence features at the N 8 position where sequence is most variable across the selected sequences has essentially no effect on model accuracy as all points lie on or close to the diagonal. Each point represents a different Hox variant and is color-coded as indicated. (B) Scatter plot representing the coefficient of determination R 2 obtained using a sequence–N 8 model (x axis) compared to the sequence–N 8 model with MG width (MGW) features added to all positions (y axis). All points above the diagonal line represent complexes in which the MGW-augmented model improves the prediction accuracy of the logarithm of relative binding affinities in comparison to a model that does not use this information. The color code for different heterodimers is equivalent to (A). Removing sequence features at the N 8 position deconvolves the contribution of MGW to model accuracy from 70 sequence. (C) Box plots illustrate that adding MGW to the sequence–N 8 model improves DNA binding specificity predictions of Exd-Hox heterodimers, measured based on the coefficient of determination R 2 . One-sided Mann-Whitney p-values demonstrate the significance of the effect. The centerline of the box plots represents the median, the edge of the box the 1st and 3rd quartile, and the whiskers indicate minimum/maximum values within 1.5 times the interquartile from the box. (D) Classification models based on individual shape features perform well in distinguishing AntpWT-like from ScrWT-like binding specificities, measured based on area under the-receiver-operating characteristic curve (AUC), with the combined shape model performing equally well as the sequence model. Figure adapted from (Abe, Dror et al. 2015). Figure 3.13 Models that Deconvolve DNA Sequence and Shape. (A) Removing sequence features at the N 8 position where sequence is least constrained across the selected sequences from the sequence+shape model further emphasizes the contribution of adding DNA shape to model accuracy. Whereas removing sequence information at this position has essentially no effect on model accuracy (Figure 3.12A), adding MG width to the sequence-N8 model has a large effect on prediction accuracy (Figure 3.12B). Based on this finding, the effect on the coefficient of determination ΔR 2 is shown in box plots for adding four shape features (MG width, Roll, ProT, and HelT) position-by-position to the sequence-N 8 model. The centerline of the box plots represents the median, the edge of the box the first and third quartile, and the whiskers indicate minimum/maximum values within 1.5 times the interquartile from the box. (B) Box plots illustrating the effect on the coefficient of determination ΔR 2 for adding MG width information position-by- position to the sequence-N 8 model emphasize the role of the AY and immediately adjacent positions. The box plots are defined in (A). (C) Pearson correlations (red) between MG width (MGW) and binding site labels (+1 for ScrWT-like versus –1 for AntpWT-like) track with the MGW pattern (blue) observed in the co-crystal structure (Joshi et al., 2007), emphasizing the important role of MGW in the core region of Exd-Hox binding site. (D) A sequence+shape classification model captures the gradual change of binding specificities introduced by mutations of the N-terminal 71 arm and linker sequences with some Exd-Hox mutant heterodimer specificities classified as Scr-like (red) and others as Antp-like (blue). Figure adapted from (Abe, Dror et al. 2015). DNA Shape Features Discriminate Anterior from Posterior Hox Binding Specificities To understand to what extent shape features can help distinguish Exd-ScrWT from Exd-AntpWT binding specificities, we assigned a value of +1 to the top 50% of sequences selected by Exd-ScrWT and –1 to the top 50% of sequences selected by Exd- AntpWT (see Methods and materials). We then used sequence- and shape-based models to evaluate the discriminative power of the selected features. Using L2-regularized MLR and 10-fold cross validation, we calculated the area under the receiver-operating characteristic curve (AUC) as a criterion for a model to discriminate ScrWT-like from AntpWT-like binding specificities. We found that MG width alone, without using sequence or additional shape features, discriminates between the binding specificities of both Exd-Hox complexes with high accuracy (Figure 3.12D). Thus, MG width does not merely refine binding specificity but is a powerful descriptor on its own, at least for discriminating between these two Exd-Hox complexes. Classification models using other shape parameters performed similarly well (Figure 3.12D), indicating that a classification between two states is less sensitive than quantitative prediction of binding strength using regression models. Further, these results suggest that the qualitative differences that are apparent in the MG width heat maps (Figure 3.6 and Figure 3.8) reflect a quantitative difference in anterior and posterior Hox specificities. Next, we asked which positions in the binding site had the highest impact on this classification. To answer this question, we calculated the Pearson correlation between the class labels +1 and –1 for Exd-ScrWT and Exd-AntpWT, respectively, and MG width at each position (see Methods and materials). Several positions showed strong, either 72 positive or negative, correlations that enabled the classification into ScrWT-like and AntpWT-like binding specificities (Figure 3.13C). Two regions showing a negative Pearson correlation aligned with the two MG width minima observed in the Exd-Scr co- crystal structure, and a region of positive Pearson correlation marked the region between these minima. This observation confirms that the core region is important for the differences in binding specificity between paralogous Hox factors. Interestingly, not only is the AY region of the Hox half-site important, but the shape of the entire core, presumably due to the influence of all core positions on the shape of this region. Finally, we used classification models to predict whether the DNA shape mutants defined in Figure 3.5 and Figure 3.7 tend to show ScrWT-like or AntpWT-like binding specificities. Here, a sequence was classified as ScrWT-like if the class label was predicted to be >0, and as AntpWT-like if the class label was predicted to be <0. This classification indicated a gradual change in the fraction of sequences selected by any of the mutants assigned as ScrWT- versus AntpWT-preferred sequences (Figure 3.13D). These data quantitatively confirm the qualitative observations shown above (Figure 3.6 and Figure 3.8) that MG width topography is an important binding specificity signal for Hox proteins. 3.2.4 Discussion Despite significant effort in the field, it is still not possible to accurately decipher the regulatory information that is encoded in the DNA sequences of eukaryotic genomes (Slattery, Zhou et al. 2014). In the work described here, we used a combination of experimental and computational approaches to show that intrinsic DNA structural characteristics—collectively referred to as DNA shape—are being directly read by DNA 73 binding proteins when they recognize their binding sites. Thus, analogous to mechanisms in which DNA base pairs are directly read by proteins via hydrogen bonds, the recognition of DNA shape independently contributes to both binding affinity and specificity. Using this information, we show that including DNA shape features significantly enhances the ability to predict DNA binding specificities and thus will greatly improve models for accurately predicting transcription factor binding in eukaryotic genomes. Separable Contributions of DNA Shape and Sequence to Protein-DNA Recognition Although several previous reports suggested the importance of DNA shape in protein-DNA recognition, all prior work was unable to definitively discriminate between the roles of DNA shape and sequence. Although DNA shape features, such as MG width, were previously found to contribute to binding specificity (Gordan, Shen et al. 2013, Lazarovici, Zhou et al. 2013, Dror, Zhou et al. 2014, Yang, Zhou et al. 2014), here the roles of DNA sequence and shape have been separated and analyzed in an unbiased manner. To achieve this, we mutated Scr amino acid side chains that do not make direct base contacts in the major groove, but instead either insert into the MG (His-12, Arg3, Arg5) or indirectly influence these interactions (Gln4, Thr6, linker). The combination of SELEX-seq with high-throughput DNA shape analysis allowed us to show the effect of these mutations on the selection of DNA binding sites with distinct shape characteristics. Further, not only were these amino acid side chains necessary for conferring the DNA binding preferences of these proteins, they were sufficient to confer this specificity, when grafted into a different Hox protein, Antp. These experiments effectively tease apart the contributions of shape readout from base readout. We speculate that the readout of DNA 74 shape may be a general mechanism that transcription factors use to recognize their binding sites. Moreover, for transcription factors that are members of large paralogous families, such as the Hox proteins, DNA shape may be essential for distinguishing between binding sites that are difficult to discriminate based on base readout alone. Statistical Machine Learning Reveals DNA Structure-Based Binding Specificity Signals To complement and extend the experimental studies, we used statistical machine learning, in this case multiple linear regression (MLR), to computationally analyze the contributions of DNA sequence and shape. Using this approach, we were able to (1) quantify the overall contribution of shape features to binding specificity and (2) compute the relative contributions of DNA shape and sequence at individual positions within the binding site. Extensive experimental work, involving structure determination and mutagenesis, represents the current standard approach for uncovering DNA readout mechanisms of transcription factors. The quantitative modeling introduced here suggests an alternate route for deriving such mechanistic information from high-throughput sequencing data. These methods will therefore likely be valuable when used to predict the DNA binding specificities of other transcription factors and when analyzing their interactions with genomes. To identify positions in the binding site where shape features contribute substantially to binding specificity, we used a form of feature selection in which we compared models with different feature sets by computing a ΔR 2 relative to a reference model. We found that the shape features in the core of the Exd-Hox heterodimer binding site were important for paralogous binding specificity. This observation is distinct from previous observations for another family of transcription factors, basic helix-loop-helix 75 (bHLH) factors, where shape features in regions flanking the core binding site play an important role in discriminating binding specificities of related family members in yeast (Gordan, Shen et al. 2013) and human (Yang, Zhou et al. 2014). Further, our feature selection approach indicates that shape features at the AY region of the Hox half-site were the most critical for determining binding specificity. This finding agrees with qualitative observations in a previous study (Slattery, Riley et al. 2011) and in this work (Figure 3.6 and Figure 3.8) that shape selections varied most substantially at this position for both wild-type and mutant Hox proteins. While this was previously a qualitative observation, the current study shows the effect quantitatively. The machine learning and feature selection methods reveal that this information will likely provide a powerful approach when analyzing data from high-throughput binding assays for other transcription factors. In particular, it is noteworthy that we were able to derive structural mechanisms used by Hox transcription factors based only on sequence data alone, without solving a 3D structure. Broader Implications for Recognition of Genomic Target Sites by Transcription Factors Based on our findings, we propose that as more high-throughput DNA binding data become available (Zhu, Christensen et al. 2011, Jolma, Yan et al. 2013, Hume, Barrera et al. 2015), DNA shape parameters should be taken into consideration when analyzing and subsequently scanning genomes for DNA binding site preferences. Further, although different families of transcription factors may use DNA shape in various ways, this information may be used to inform binding site prediction algorithms. As shown here quantitatively, Exd-Hox heterodimers use distinct structural features in the DNA, such as local regions of narrow MG, to achieve DNA binding specificity. Because MG width 76 minima are distinct structural motifs, we were able to separate their contributions to DNA recognition both biochemically, by mutating amino acids that recognize these motifs, and computationally, by training models that include or exclude specific subsets of DNA features. For other protein families, the contribution of DNA structure may not be as readily separable as it is for Exd-Hox binding. For example, although previous work demonstrated a role for DNA shape in conferring the binding specificity of bHLH proteins, this effect was mediated by sequences flanking the core binding site (E-box), where no known protein-DNA interactions (base or shape readout) occur (Gordan, Shen et al. 2013). In this case, the role of DNA shape may be biochemically inseparable from base readout because it is unlikely that a distinct structural motif is formed by the flanking sequences. Our results have implications for the design of binding site search and de-novo motif discovery methods, which currently most typically rely only on DNA base features (Weirauch, Cote et al. 2013). There are some examples where large sets of overlapping DNA structural features, which are highly interdependent from each other and inseparable from sequence, have been integrated in motif search algorithms (Meysman, Thanh et al. 2011, Hooghe, Broos et al. 2012, Maienschein-Cline, Dinner et al. 2012). The results described here, however, suggest that for some transcription factor families, distinct structural motifs, which can be defined independently from sequence, such as MG topography, can be directly integrated in genome analysis tools as quantifiable search parameters. The ability to independently define and quantify the role of distinct structural motifs will likely yield more powerful algorithms that may help identify low affinity, high specificity Hox binding sites that are unrecognizable with standard 77 approaches (Crocker, Abe et al. 2015). Further, machine learning approaches may also contribute to more accurate models of cooperative transcription factor binding, for example in the interferon-β enhanceosome (Panne, Maniatis et al. 2007), or in vivo, where DNA shape has been identified as a predictive feature for transcription factor binding (Barozzi, Simonatto et al. 2014). We further propose that the computational approaches described here will also be valuable for deconvolving and discovering the roles of DNA shape and sequence even for transcription factors such as the bHLH factors where DNA shape cannot be as readily separated biochemically from DNA sequence. The ability to quantitatively assess the distinct roles of DNA sequence and shape will therefore advance our ability to identify bona fide genomic binding sites and the ability to interpret eukaryotic genomes. 79 Chapter 4 Transcription factor family-specific DNA shape readout revealed by quantitative specificity models Reproduced from manuscript: Lin Yang, Yaron Orenstein, Arttu Jolma, Jussi Taipale, Ron Shamir, and Remo Rohs: DNA shape readout at base pair resolution reveals differences between transcription factor families. 4.1 Introduction Protein-DNA interactions play a central role in gene regulation. A group of proteins that recognize specific DNA sequences, known as transcription factors (TFs), bind to regulatory regions in the genome and consequently activate or repress transcription of target genes. Despite having an optimal DNA binding sequence, TFs can bind to various DNA sequences with different binding affinities, i.e. DNA binding specificities. In the last decade, technologies for measuring protein DNA-binding specificities have advanced tremendously (Slattery, Zhou et al. 2014). To date, platforms based on microarray technology, such as protein binding microarray (PBM) (Berger, Philippakis et al. 2006), and high-throughput sequencing technology, such as high- throughput SELEX (HT-SELEX) (Jolma, Kivioja et al. 2010) or SELEX-seq (Slattery, Riley et al. 2011), have enabled measurements of protein binding against thousands and even millions of different DNA sequences. The computational challenge is to develop accurate and quantitative models of protein-DNA binding specificities from enormous data and to further infer the binding mechanisms. Position weight matrix (PWM) or PWM-like models are widely used to represent the DNA binding preferences of proteins (Stormo 2000). In these models, a matrix is used to represent the TF binding site (TFBS), with each element representing the 80 contribution to the overall binding affinity from a nucleotide at the corresponding position. An inherent assumption of traditional PWM models is position independence, i.e. the contribution of different nucleotide positions within a TFBS to the overall TF- DNA binding affinity is additive. While being a good approximation, it has been demonstrated that this assumption does not hold for many proteins (Man and Stormo 2001, Bulyk, Johnson et al. 2002). Therefore, PWM models have been extended to include additional parameters, e.g. k-mer features, that account for the position interdependencies within TFBSs (Zhao, Ruan et al. 2012, Mathelier and Wasserman 2013, Mordelet, Horton et al. 2013, Weirauch, Cote et al. 2013, Riley, Lazarovici et al. 2015). Such interdependencies have result from TFs’ preferences for sequence-dependent DNA conformation or deformability, which we call DNA shape readout (Rohs, West et al. 2009b, Rohs, Jin et al. 2010). Based on this rationale, an alternative approach to augment traditional PWM models is through the inclusion of DNA structural parameters. Such approach was shown to improve the performance in modeling TF-DNA binding specificities to an extent that is comparable to using additional k-mer features, with a much smaller number of parameters (Zhou, Shen et al. 2015). Using this approach, we revealed in previous works the importance of DNA shape readout for members of the basic helix-loop-helix (bHLH) and homeodomain TF families (Dror, Zhou et al. 2014, Yang, Zhou et al. 2014, Zhou, Shen et al. 2015). We were also able, for Hox TFs, to narrow down regions in the TFBSs where DNA shape readout was used, demonstrating the power of such approach to reveal mechanistic insights into TF-DNA recognition (Abe, Dror et al. 2015). However, this was done for only two protein families due to the lack of large-scale high-quality TF-DNA 81 binding data. Luckily, with the recent abundance of high-throughput measurements of protein-DNA binding, it is now possible to dissect the role of DNA shape readout for many TF families. In this study, we augmented and used the most extensive mammalian TF-DNA binding affinity data available-to-date to learn DNA shape-based binding models. The data are derived from HT-SELEX experiments covering hundreds of human and mouse TFs (Jolma, Yan et al. 2013). To improve the statistical robustness of the analysis, we re- sequenced each experiment and increased the sequencing depth by ~10-fold on average. We implemented a pipeline to derive accurate TF-binding intensities for all possible DNA M-words (sequences of length M, where M in general varies from 9 to 24) from HT-SELEX reads. These M-word binding affinities supplement the previously published PWMs as a much richer description of the binding landscape of the tested proteins (Zhao, Granas et al. 2009, Jolma, Kivioja et al. 2010, Jolma, Yan et al. 2013). Using these preprocessed data, we next trained machine learning models of TF-DNA binding specificities. By comparing performance of models that use different features, we show that DNA shape readout is important for a variety of TF families, despite heterogeneity within the same TF family. Using feature selection, we pinpointed positions in the TFBSs where DNA shape readout is most likely to occur. We showed that the selected features concur with available structures solved by NMR spectroscopy. Considering the prevalence of DNA shape readout, we propose a new visualization of DNA shape preferences, which we call DNA shape logos. Overall, our results indicate the prevalence of DNA shape readout across TF families, and we argue for new methods that integrate 82 structure into quantitative modeling of data derived from protein-DNA binding experiments. 4.2 Methods and materials 4.2.1 Re-sequencing DNA libraries generated from previous HT-SELEX experiments In a previous study, HT-SELEX experiments were performed for an extensive collection of human and mouse TFs or their DNA binding domains (DBDs) belonging to different protein families to reveal their DNA binding specificity (Jolma, Yan et al. 2013). However, due to the relative low number of reads generated for these individual HT-SELEX experiments, quantitative TF-binding affinity of high confidence cannot be generated for a large enough sample of DNA sequences. Therefore, new sequencing data were generated by re-pooling the existing PCR-amplified SELEX-ligands into new Illumina sequencing libraries where the samples were multiplexed to lesser extent (~55X vs. ~800X) than in the previous study. The libraries were then sequenced using Illumina Hiseq2 platform as in the previous study (Jolma, Yan et al. 2013). The read count generated for each HT-SELEX experiment was thus increased on average to about 10 times of the previous data. The completed dataset includes 548 experiments covering 410 different TFs (disregarding mouse/human, full-length protein/DBD differences) from 40 protein families. 4.2.2 Deriving relative TF-binding affinity for DNA M-words from HT-SELEX reads For each TF, we selected a core-binding motif from the literature (mainly taken from (Weirauch and Hughes 2011)) that agreed with the motif published in (Jolma, Yan et al. 2013). We then calculated the binding score for each M-word that included the core 83 motif in the center (allowing for a few mismatches) and any possible flanking sequences 5’ and 3’ of the motif. We focused on non-cooperative DNA binding by the TFs. So, to avoid the possibility of cooperative TF-DNA binding in which multiple copies of the TF occupy different DNA binding sites on the same sequence, and to minimize noise caused by inaccurate alignment of M-words based on the core motif, we excluded HT-SELEX reads that contain multiple instances of the core motifs. A HT-SELEX experiment includes several rounds of binding site (BS) selection by the TF, where the specificity of the selected DNA sequences increases in each round. We therefore calculated the M- word score as the ratio of the frequency of the M-word in the i-th round over its estimated frequency in the initial round using 5th-order Markov model (Riley, Slattery et al. 2014). The final output of this process for each HT-SELEX experiment is the relative TF- binding affinity for different M-words. 4.2.3 PCA and regression analysis For each DNA sequence, 1-mer, 2-mer, and 3-mer features were encoded into feature vectors 𝝓 67,/ , 𝝓 87,/ , and 𝝓 97,/ , respectively, in a similar way to those used in (Zhou, Yang et al. 2013). First-order DNA shape features minor groove width (MGW), propeller twist (ProT), roll (Roll), and helix twist (HelT) were generated by our DNAshape prediction method (Zhou, Yang et al. 2013) and denoted as 𝝓 :;4 , 𝝓 </%= , 𝝓 >%++ , and 𝝓 ?,+= , respectively. These DNA shape features are of a different nature. Therefore, the following normalization was performed. 𝜙 - :;4 =(𝑀𝐺𝑊 - −𝑀𝐺𝑊 7-& ) 𝑀𝐺𝑊 2' , where MGWi is the predicted MGW, MGWmin is the minimum MGW over all possible pentamers, and MGWsd is the standard deviation of MGW in the data. Similarly, 84 𝜙 - </%= =(𝑃𝑟𝑜𝑇 - −𝑃𝑟𝑜𝑇 7-& ) 𝑃𝑟𝑜𝑇 2' , 𝜙 - >%++ =(𝑅𝑜𝑙𝑙 - −𝑅𝑜𝑙𝑙 7-& ) 𝑅𝑜𝑙𝑙 2' , 𝜙 - ?,+= =(𝐻𝑒𝑙𝑇 - −𝐻𝑒𝑙𝑇 7-& ) 𝐻𝑒𝑙𝑇 2' . Second-order DNA shape features are derived from 1st order DNA shape features and denoted as 𝝓 :;4 Q , 𝝓 </%= Q , 𝝓 >%++ Q , and 𝝓 ?,+= Q . These 2nd order shape features are the product of adjacent 1st order DNA shape features and normalized by the standard deviation. Since MGW and ProT are defined for each bp, Roll and HelT are defined for each bp-step, in the feature selection analysis, DNA shape features at nucleotide position i, or shape i , consist of 𝜙 - :;4 , 𝜙 - </%= , 𝜙 - >%++ , 𝜙 -R6 >%++ , 𝜙 - ?,+= , 𝜙 -R6 ?,+= , 𝜙 - :;4 Q , 𝜙 -R6 :;4 Q , 𝜙 - </%= Q , 𝜙 -R6 </%= Q , 𝜙 - >%++ Q , 𝜙 - ?,+= Q . If the core-motif sequence is palindromic, the last step in feature encoding is to symmetrize the feature vector by averaging it with the feature vector encoding the reverse complementary stand. After feature encoding, L2-regularized multiple linear regression (MLR) and 10- fold cross validation was performed for each dataset to gauge model performance, as previously described (Yang, Zhou et al. 2014, Abe, Dror et al. 2015, Zhou, Shen et al. 2015). L2-regularized MLR was chosen for its simplicity and interpretability. In PCA, the feature vector encoded for the sequence of highest DNA binding affinity of a TF was used to represent that TF. 4.2.4 Quality control of the data We performed two stages of quality control for these datasets. In the first stage, we used the following three criteria: 1. Only M-words with at least 9 counts were included in a dataset. In addition, the M-word with the highest binding affinity within a dataset 85 must have at least 100 counts. Otherwise, the dataset is discarded. These are to ensure statistical robustness. 2. If a dataset contains the relative binding affinity to less than 1000 different M-words for the TF, this dataset is discarded. This makes sure that we have enough data for training quantitative regression models. 3. In a TF dataset, the 90 th -percentile binding affinity must be at least 0.2 greater than the 10 th -percentile binding affinity. This ensures that DNA M-words of diverse TF-binding affinities are represented in each dataset. Next, we filtered out datasets by R 2 performance criteria. L2-regularized MLR was run on each of the remaining datasets using different combinations of features. Due to its linearity, it is expected for MLR models that model A would perform at least as well as model B, given that B uses a subset of features that A uses. We defined a dataset as invalid only when performance of model A is smaller than B by >3%, given that B uses a subset of features that A uses. This reduced the number of valid datasets to 533. Next, datasets for which even the best model has an R 2 <0.5 were excluded from our analyses, which resulted in 525 datasets that finally passed our quality control. These 525 datasets cover 197 human/mouse TFs that belong to 25 different TF families. For TFs covered by multiple datasets, only the dataset with the highest R 2 is included in the downstream analyses. Since the PCA analysis requires only one representative binding site sequence for each TF, we separately generated 12-word data using reads from the last round of the corresponding HT-SELEX experiment, as the last 86 round is expected to be the most specific. We used the top 12-word as the representative binding site for each TF. In doing so, as many as 294 TFs were covered in the PCA analysis. 4.2.5 Generating DNA shape logos We generated DNA shape logos with the seq2logo program using the PSSM-logo option (Thomsen and Nielsen 2012). The learned weights for the features in a shape model based only on first-order shape features were used to construct the position- specific scoring matrix (PSSM), which served as the input to the seq2logo program. 4.3 Results 4.3.1 PCA analysis reveals TF-family specific DNA binding specificities and heterogeneities within TF families We performed principal component analysis (PCA) to visualize TF-family specific DNA binding specificities. For each TF, we represented its DNA binding preference using the DNA M-word that has the highest binding affinity for this TF. We encoded this M-word into a numeric feature vector in two ways: i) a vector that includes only mononucleotide features, i.e. 1-mer features; ii) a vector that includes both 1-mer and DNA shape features. Figure 4.1 shows the first two principal components obtained using these feature vectors. We observed that different TF families tend to form distinct clusters in the PCA plots. To compare the clustering quality in the two plots, we obtained the 2D Euclidean distances between all pairs of TFs from Figure 4.1A and Figure 4.1B, respectively. These distances were classified into two groups: intra-family and inter-family, and visualized as boxplots in Figure 4.1C and Figure 4.1D, respectively. Although the inter-family 87 distances are generally larger than intra-family distances, using both 1-mer and DNA shape features led to a larger difference between the median of the inter-family group and the intra-family group, compared to using 1-mer features alone (Figure 4.1C, D). This is consistent with Figure 4.1A and B, which show that a larger fraction of the variance was explained by introducing DNA shape features. Figure 4.1 PCA analysis reveals different DNA binding specificities between TF families. (A) PCA using 1mer features. Each dot represents a TF. Dots of the same color belong to the same TF family. An ellipse was drawn for each TF family. The ellipse is a contour of a fitted two-variate normal distribution that encloses 0.68 (R package default) probability. (B) PCA using 1mer and DNA shape features, annotated in the same way as (A). (C) Boxplot of inter-family TF distances and intra-family TF distances derived from A). The difference between the medians of the inter-family and the intra-family is 2.02. (D) Boxplot of inter- family TF distances and intra-family TF distances derived from B). The difference between the medians of the inter-family and the intra-family is 3.68. 88 4.3.2 DNA shape features improve modeling of DNA binding specificities across different TF families We tested the importance of DNA shape recognition by each TF through quantitative modeling of TF-DNA binding specificities and comparison of model performance in terms of R 2 between predicted and experimental M-word scores. Similar to the methodology used in a previous study (Zhou, Shen et al. 2015), we built regression models that use only DNA mononucleotide features, i.e. 1mer models, and models that combine both DNA mononucleotide and shape features, i.e. 1mer+shape models. When a 1mer+shape model outperforms a 1mer model, it indicates that DNA shape readout might play a role in the TF binding. Based on analysis of 197 TFs from 25 different families, we found that 1mer+shape models generally outperform 1mer models (Figure 4.2A), indicating the prevalence of DNA shape readout across different TF families. It is also apparent that the importance of DNA shape recognition varies both between TF families and within TF families. For example, the model performance for homeodomain TFs is generally more substantially improved than for C2H2 TFs. Within the homeodomain family, there is a prominent variance among individual members. The homeodomain and bHLH TFs have been observed to be sensitive to DNA shape features (Slattery, Riley et al. 2011, Gordan, Shen et al. 2013, Yang, Zhou et al. 2014, Zhou, Shen et al. 2015). Here we confirmed this conclusion and extended this observation to TF families bZIP, CENPB, CP2, E2F, GATA, HOMEZ, IRF, MYB, nuclear receptor, PAX, POU, RFX, TEA, and TFAP, based on the observation that at least half of the members in the family, covered by our data, showed >10% performance improvement by adding DNA shape features to the model. Note that some families are under-represented in the data and have only one TF. 89 It is important to note that the homeodomain TFs in this study presumably bind DNA as monomers, whereas our previous studies demonstrated the importance of DNA shape for Exd-Hox heterodimers (Slattery, Riley et al. 2011). X-ray and NMR structures of homeodomain DBDs in complex with DNA repeatedly show that one N-terminal tail of the homeodomain DBD interacts with the DNA through minor groove and backbone contacts, which is a signature of DNA shape readout (Banerjee-Basu and Baxevanis 2001). Figure 4.2 Performance comparisons between (A) 1mer and 1mer+shape models, (B) 1mer+2mer+3mer and 1mer+shape models, (C) 1mer+2mer+3mer and 1mer+shape+3merE2 models, (D) 1mer+shape and 1mer+shape+3merE2 models. Each dot represents one dataset. Coordinates of the dot are determined by the performance, measured in R 2 , of the corresponding models indicated in parentheses. Shape and color of the dots indicate the TF family. The dashed line in (A) has a 1.1 slope marking 10% performance increase. The dashed lines in (C) has a 1.1 and a 0.9 slope. 90 4.3.3 Analysis reveals for different TF families importance of DNA shape features in flanking regions We previously observed that 1mer+2mer+3mer models usually outperform 1mer+shape models (Zhou, Shen et al. 2015). Here, we further explored this phenomenon and gained insights into possible explanations for this observation. As noted previously (Zhou, Shen et al. 2015), both 2mer and 3mer features are indirect representations of DNA shape characteristics. 2-mer features describe stacking interaction between adjacent base-pairs, and 3-mer features describe short structural elements such as A-tracts that tend to form narrow minor groove regions. Thus, it is not surprising that 1mer+2mer+3mer models can capture TF-DNA binding specificities with high accuracy. Using the high quality HT-SELEX data, we observed that, for the majority of TFs, 1mer+2mer+3mer models outperform 1mer+shape models (Figure 4.2B). The reason may be that since our prediction of local DNA shape features is based on a sliding window that covers 5 bps (Zhou, Yang et al. 2013), we are limited by this method in that we cannot predict shape features for the two extreme positions at both the 5'- and 3'-end of each DNA sequence. This could give an edge to 1mer+2mer+3mer models, as we can always encode 2-mer and 3-mer features for those positions, which in turn work as a proxy for DNA shape. To test this hypothesis, we further added 3-mer features from only the extreme positions, denoted as 3merE2 features, to the 1mer+shape model. The resulting model is thus called 1mer+shape+3merE2 model. Results show that adding 3merE2 features indeed further increased model performance to a level that is comparable to the 1mer+2mer+3mer model (Figure 4.2C). By the same argument, if longer flanking sequences were available for predicting shape features, 1mer+shape models would perform similarly to 1mer+2mer+3mer models without adding 3merE2 91 features. To verify this, we used an independent dataset generated by the gcPBM platform (Zhou, Shen et al. 2015). As mentioned above, this type of assay samples potential TFBSs within their genomic context, and thus, includes long flanking regions outside the core TFBS motif. As expected, 1mer+shape models performed comparably to 1mer+2mer+3mer models for this data without additional 3merE2 features (Figure 4.3). These results further imply that DNA shape features in flanking regions are determinants of TF-DNA binding specificities, which was previously known for bHLH TFs (Gordan, Shen et al. 2013, Yang, Zhou et al. 2014, Zhou, Shen et al. 2015). Here we showed for the first time that this phenomenon is of general nature, as adding 3merE2 features as proxy for missing DNA shape features consistently improved the model performance for various TF families (Figure 4.2D). Figure 4.3 Comparable performance of 1mer+2mer+3mer and 1mer+shape models for the gcPBM data. Beyond better interpretability of shape-augmented models, an important distinction between the models is the different number of features required to achieve 92 similar performance. The 1mer+shape model requires 12 features (including 2nd order DNA shape features) per position compared to 84 features used by the 1mer+2mer+3mer model per nucleotide position (Zhou, Shen et al. 2015). While we previously included lower 1-mers and 2-mers in our 1mer+2mer+3mer models for reasons of interpretability, we note here that the 3-mer features contain all the information of 1-mer and 2-mer. As a result, a 3mer model is actually equivalent to a 1mer+2mer+3mer model (Figure 4.4). This would, however, still leave the 3mer model with 64 required features per position. Figure 4.4 Equivalent performance of 1mer+2mer+3mer and 3mer models for the HT-SELEX data. 4.3.4 Feature selection provides insights into TF-DNA readout mechanisms Next, we performed feature selection to identify positions at which DNA shape features contribute to TF binding specificities. The method is similar to what we previously developed in the study of Hox proteins (Abe, Dror et al. 2015). In particular, for each TF, we first evaluated the R 2 performance of the baseline 1mer model, denoted as 𝑅 67,/ 8 . Next, we evaluated models that combine 1mer features with DNA shape 93 features at each nucleotide position i, denoted as 1mer+shapei models. We denoted the performance as 𝑅 67,/ R2. )S, T 8 . We then calculated the difference in model performance ∆𝑅 - 8 =𝑅 67,/ R2. )S, T 8 −𝑅 67,/ 8 for each nucleotide position i (Figure 4.5A). Thus, a positive ratio ∆𝑅 - 8 /𝑅 67,/ 8 represents the percentage of performance gain due to available DNA shape features at nucleotide position i. The ratio at position i compared to the other positions reflects the relative importance of DNA shape features at different nucleotide positions. We visualized the ratio ∆𝑅 - 8 /𝑅 67,/ 8 as a function of position i for each TF in the form of heat map (Figure 4.6A). To avoid interference from DNA sequence information, we devised another feature selection approach in which we removed DNA shape features position by position from a shape-only model (Figure 4.5B, Figure 4.6B). Such quantitative information on the position-dependent DNA shape importance in TF- DNA recognition at single bp-resolution provides a way to gain structural insights into protein-DNA recognition mechanisms from sequence data, which to date depended on the availability of experimentally solved structures. 94 Figure 4.5 Schematic representation of the feature selection process. (A) Add DNA shape features at individual positions one by one to a sequence-only model. (B) Remove DNA shape features from individual positions one by one from a shape-only model. Figure 4.6 shows the positional-dependent DNA shape importance for the homeodomain TFs that recognize a 5’-TAAT-3’ DNA motif. For the majority of these TFs, DNA shape is shown to be more important on the 3’ side of the core motif, indicated by the darkness of colors (Figure 4.6). Interestingly, homeodomain TFs that recognize a different motif, e.g., 5’-ATAAAA-3’ and 5’-TCGTAAA-3’, have a different positional DNA shape preference (Figure 4.7). Such positional preferences are also family-specific. For example, for bHLH and bZIP TFs, DNA shape in the two regions flanking the core motif is generally more important. The homeodomain, GATA and RFX TFs are generally sensitive to DNA shape on only one flank of the core motif (Figure 4.7). 95 Figure 4.6 The importance of DNA shape features as a function of nucleotide positions revealed by feature selection with machine learning. (A) Heat map based on adding DNA shape features to a sequence-only model. (B) Heat map based on removing DNA shape features from a shape-only model. The letter-case of the TF names indicates the species, with upper case being human TFs and lower case being mouse TFs. 96 A Average Average TCF3 TCF3 TCF4 TCF4 NEUROG2 NEUROG2 NHLH1 NHLH1 Srebf1 Srebf1 USF1 USF1 Bhlhb2 Bhlhb2 TFE3 TFE3 HEY1 HEY1 TFAP4 TFAP4 BHLHB2 BHLHB2 C A N N T G C A N N T G 0 10 20 30 40 % 0 1 2 3 4 5 % bHLH bHLH 97 B Average Average TEF TEF NFIL3 NFIL3 DBP DBP A T T A C G T A A T A T T A C G T A A T 0 2 4 6 8 10 % 0 1 2 3 4 5 % bZIP bZIP Average Average CEBPD CEBPD CEBPB CEBPB CEBPE CEBPE CEBPG CEBPG Cebpb Cebpb A T T G C G C A A T A T T G C G C A A T 0 5 10 15 20 25 30 % 0 1 2 3 4 5 % bZIP bZIP Average Average MAFK MAFK CREB3L1 CREB3L1 C A C G T G G C C A C G T G G C 0 4 8 12 16 20 % 0 5 10 15 20 25 30 % bZIP bZIP Average Average CREB3 CREB3 XBP1 XBP1 T G A C G T C A T G A C G T C A 0 2 4 6 8 10 % 0 1 2 3 4 5 % bZIP bZIP 98 C D Average Average GATA3 GATA3 GATA4 GATA4 GATA5 GATA5 W G A T A R W G A T A R 0 2 4 6 8 10 % 0 4 8 12 16 20 % GATA GATA Average Average E2F1 E2F1 E2F2 E2F2 T T T S G C G C S T T T S G C G C S 0 2 4 6 8 10 % 0 10 20 30 40 50 % E2F E2F 99 E Average Average Hoxd13 Hoxd13 HOXC11 HOXC11 HOXC12 HOXC12 HOXD11 HOXD11 CDX2 CDX2 HOXD12 HOXD12 HOXC10 HOXC10 Hoxa11 Hoxa11 HOXB13 HOXB13 CDX1 CDX1 HOXC13 HOXC13 A T A A A A A T A A A A 0 4 8 12 16 20 % 0 4 8 12 16 20 % Homeodomain Homeodomain Average Average HOXA13 HOXA13 HOXA10 HOXA10 Hoxc10 Hoxc10 T C G T A A A T C G T A A A 0 1 2 3 4 5 % 0 4 8 12 16 20 % Homeodomain Homeodomain 100 F G Average Average MYBL1 MYBL1 MYBL2 MYBL2 Y A A C K G Y A A C K G 0 2 4 6 8 10 % 0 4 8 12 16 20 % Myb Myb Average Average HNF4A HNF4A RXRA RXRA NR2F1 NR2F1 NR2F6 NR2F6 NR2C2 NR2C2 ESRRA ESRRA THRA THRA R G G T C A R G G T C A 0 10 20 30 40 50 % 0 10 20 30 40 50 % Nuclear receptor Nuclear receptor 101 H I Average Average PAX7 PAX7 PAX3 PAX3 T A A T T A T A A T T A 0 5 10 15 20 25 30 % 0 1 2 3 4 5 % PAX PAX Average Average PAX5 PAX5 PAX2 PAX2 T C A C G C W T S A T C A C G C W T S A 0 3 6 9 12 15 % 0 4 8 12 16 20 % PAX PAX Average Average POU2F2 POU2F2 POU3F1 POU3F1 POU5F1P1 POU5F1P1 A T G C A A T A T G C A A T 0 5 10 15 20 25 % 0 3 6 9 12 15 % POU POU 102 J K Average Average Rfx2 Rfx2 RFX2 RFX2 RFX5 RFX5 R G Y A A C R G Y A A C 0 5 10 15 20 25 30 % 0 10 20 30 40 50 % RFX RFX Average Average TFAP2A TFAP2A TFAP2C TFAP2C Tcfap2a Tcfap2a G C C N N N G G C G C C N N N G G C 0 5 10 15 20 25 % 0 1 2 3 4 5 % TFAP TFAP Figure 4.7 Positional DNA shape importance revealed by feature selection for TF families (A) bHLH; (B) bZIP; (C) E2F; (D) GATA; (E) Homeodomain; (F) Myb; (G) Nuclear receptor; (H) PAX; (I) POU; (J) RFX; (K) TFAP. 103 We illustrate the relevance of this approach based on structures of PITX2 (PDB ID 2LKX) and GBX1 (PDB ID 2ME6) downloaded from the Protein Data Bank (PDB), as the structures provide possible explanations for the entries PITX3 and GBX1 on the heat maps (Figure 4.8A and Figure 4.8B). (Since no PDB structure for PITX3 is available we used an NMR structure for PITX2, as it shares the same DNA binding domain as PITX3). In the heat maps, PITX3 has darker colors on the 3’ side of the 5’-TAAT-3’ motif, indicating a more important role of DNA shape at those darker positions. Interestingly in the PITX2 structure, the N-terminal tail of the protein interacts with the DNA in the minor groove of the 5'-TAAT-3' motif. The structure contains a narrowed DNA minor groove region on the 3’ side of a 5’-TAAT-3’ motif (Figure 4.8A). Taken together, this implies that the protein in this case might exploit the DNA structural characteristics at positions highlighted in the heat maps to achieve its binding specificity. We observed similar concurrence between heat map analysis and structural analysis for the TF GBX1 (Figure 4.8B). Moreover, the heat maps are consistent with our conclusion that DNA shape features in flanking regions are important for TF-DNA binding specificities (Figure 4.2C, D). To visualize the detailed DNA shape preferences of individual TFs, we propose a new visualization, DNA shape logo, in analogy to sequence logos shown for position weight matrices (PWMs). In this logo, we use letters H, M, P, and R to represent the DNA shape features HelT, MGW, ProT, and Roll, respectively. The height of the letters indicates the weight in the model for the corresponding DNA shape features at specific positions. As an example, we used the learned weights associated to DNA shape features in a shape model to generate the shape logos for PITX3 and GBX1 (Figure 4.8C and 104 Figure 4.8D). In both shape logos, a prominent letter M on the negative axis indicates a favorable narrow minor groove at position 6, which agrees well with the PDB structures showing a narrow DNA minor groove near that position. Again, since DNA shape information is missing for the two nucleotide positions at each end of the TFBS, no letters were shown at those positions in the shape logo. Figure 4.8 Structure view and DNA sequence and shape logos for the homeodomain TFs PITX2/PITX3 and GBX1. (A) NMR structure of PITX2 in complex with DNA (PDB id: 2LKX). (B) NMR structure of GBX1 in complex with DNA (PDB id: 2ME6). (C) DNA sequence and shape logos for PITX3. (D) DNA sequence and shape logos for GBX1. 4.4 Discussion Protein-DNA binding models have evolved tremendously in the last decade (Weirauch, Cote et al. 2013, Slattery, Zhou et al. 2014). In the past, binding models were based on a few high-affinity BSs. These enabled the identification and prediction of the most likely BSs in vivo, but missed many potential low-affinity sites (Stormo 2000, 105 Tanay 2006). Recently, it has been shown that weak and suboptimal TFBSs play an important role in transcriptional regulation (Crocker, Abe et al. 2015, Farley, Olson et al. 2015). These findings further emphasize the necessity of a quantitative understanding of TF-DNA binding specificities. Solving structures through X-ray crystallography and NMR spectroscopy provides us with detailed mechanisms of protein-DNA binding involving single DNA target sites and has greatly forwarded our perception of protein- DNA recognition. However, it is inherently difficult to bring these insights to a high- throughput level. Protein crystallization is still a time consuming process, and deriving distance constraints using NMR experiments is not trivial and costly. Electron microscopy is now gaining much attention due to new technology breakthroughs that allow it to solve structures at an atomic-level resolution (Hang, Wan et al. 2015). However, it is still working as a low-throughput platform. Alternatively, in the genomics field, sequencing- and microarray-based high-throughput methods have made it now possible to systematically study in vitro TF-DNA binding specificities by simultaneously measuring binding affinities to thousands and even millions of different DNA sequences. In vitro platforms such as HT-SELEX and PBM provide an effective solution to gain quantitative knowledge of TF-DNA binding (Berger, Philippakis et al. 2006, Zhao, Granas et al. 2009, Jolma, Kivioja et al. 2010), as all other confounding factors in vivo are not present. With the sequencing depth being further improved by 10-fold on average compared to the original data published in (Jolma, Yan et al. 2013), the HT-SELEX data generated in this study are so far the most extensive set of TF-DNA binding measurements for mammalian TFs. We constructed a pipeline that derives binding affinities for different DNA M-words from these HT-SELEX data, gaining a much more 106 detailed view of the binding energy landscape than simple PWM models. This enabled us to explore through machine learning methods how the mechanisms of DNA shape readout are employed by a variety of TF families. With feature selection techniques, we revealed positional DNA shape importance for TFs at bp-resolution. The results concur with available NMR structures. We would like to emphasize that this study provides a way to derive binding mechanisms from sequence data without relying on solving structures. We see several limitations in our study. First, while increasing the sequencing depth improved the statistical robustness of binding affinities derived for M-words, the sequencing depth can be further improved to an extent that is at least comparable to those used in (Slattery, Riley et al. 2011). But it should be noted here that this is not trivial considering the large number of TFs being studied. Second, we resolved some of the reported biases in HT-SELEX data by basing our analysis on known core motifs, allowing only one core motif within each oligo, removing PCR duplicates and normalization by the initial round. But still the data may suffer from other unknown artifacts. Third, although DNA shape logos provide an interpretable visualization of the breakdown of DNA shape readout into detailed preferences for the four structural parameters, i.e. MGW, HelT, ProT, and Roll, they may be inaccurate since the four DNA shape features are not independent from each other. The resulting shape logos thus provide a general guide in the DNA shape preferences and should be used with caution. Fourth, despite the fact that we found TF-DNA structures that supported the results, it is not at all conclusive, since there are no direct evidential structures yet in the PDB for the majority of the studied TFs. Nevertheless, we believe that with the methodology used in 107 this study and by improving the quality of TF-DNA binding measurements in the future, we will gain accurate mechanistic insights into TF-DNA binding. Last, although understanding in vitro protein-DNA binding mechanisms is a critical step towards understanding in vivo binding, we should keep in mind that the in vivo scenario consists of multiple layers of complexity, such as the 3D genome architecture (Rao, Huntley et al. 2015), DNA accessibility (Neph, Vierstra et al. 2012), nucleosome competition (Barozzi, Simonatto et al. 2014) and TF cooperativity (Slattery, Riley et al. 2011, Crocker, Abe et al. 2015). A full understanding of gene regulation will require the integration of knowledge obtained by different fields using various technologies. To conclude, in this work we explored, for the first time, the role of DNA shape readout systematically for a comprehensive variety of TF families using high-quality HT- SELEX data. It is also the first attempt to dissect the role of DNA shape at base-pair resolution for this large number of TFs. Through our study, we produced a valuable TF- DNA binding data resource for the community by increasing the sequencing depth of previous HT-SELEX experiments (Jolma, Yan et al. 2013) and developing tools for deriving TF-DNA binding affinities from the experimental data. 109 Chapter 5 Concluding remarks Studies of TF-DNA binding specificity compiled in this thesis are mainly based on protein-DNA binding data derived from in vitro platforms. These platforms create an isolate and controllable environment that is free from the myriads of confounding factors in vivo, i.e., in living cells, that interfere the TF-DNA binding events and their consequential functional effects. From these studies, we can learn the physical principles that underlie the DNA binding specificity of TFs and construct quantitative and predictive models of TF-DNA binding specificity. All these knowledge, however, ultimately needs to be applied to the understanding of in vivo TF-DNA binding specificity, due to its important roles in gene regulation. Experimental techniques such as chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-seq) have been developed to identify TF-DNA binding events that occur in cells (Park 2009). It has been reassuring to see similar DNA sites bound in vivo as those identified from in vitro experiments (Liu, Lee et al. 2006). This means TF-DNA binding specificity in vivo is also governed by principles that exist under in vitro conditions. However, this is far from being a full explanation of how TFs bind and not bind in the genome in vivo and how the binding events eventually lead to functional consequences such as the differentiation of cell types during the development of an organism. In controlled in vitro experiments, TFs are generally purified and applied to short fragments of naked DNA. In the in vivo situation, TF-DNA interactions occur in the cellular nucleus, which is filled with RNA, histones and all kinds of non-histone proteins. The carrier of genetic information, DNA, is consisted of long chains of bps packed with nucleosomes. In addition, the global 3D organization of the DNA inside the 110 nucleus has also been shown to play important roles in gene regulation. Based on in vitro data, one may infer millions of possible DNA binding sites in the human genome for a TF. However, the TF binds to only a small subset (thousands) of the putative binding sites in vivo. DNA accessibility controlled by the presence or absence of nucleosomes (Liu, Lee et al. 2006), specificity resulted from cooperative DNA binding of TFs (Panne 2008, Slattery, Riley et al. 2011), and epigenetic modifications such as DNA methylation (Maurano, Wang et al. 2015) partially explain this phenomenon (Figure 5.1). But still this question remains largely unanswered. Figure 5.1 Structure-based illustration of the complexity of in vivo TF-DNA binding specificity. (A) The basic helix- loop-helix (bHLH) Mad-Max heterodimer (PDB ID: 1nlw) binds to only a subset of putative DNA binding sites (blue). Some TFBSs are masked by the nucleosomes (PDB ID: 1kx5), whereas other accessible TFBSs are not selected by the TF. (B) Higher-order determinants of TF binding include cofactor cooperativity (e.g., Hox-Exd heterodimer; PDB ID 2r5z), multimeric binding (e.g., p53 tetramer; modeled based on PDB IDs 2ady and 1aie (Kitayner, Rozenberg et al. 2006)), enhanceosome cooperativity (e.g., interferon-b enhancersome; modeled based on PDB IDs 1t2k, 2pi0, 2o6g and 2o61 (Panne, Maniatis et al. 2007)), and chromatin accessibility due to nucleosome formation (PDB ID: 1kx5 (Davey, Sargent et al. 2002)). Figure adapted from (Slattery, Zhou et al. 2014). Unravelling the regulatory “grammar” of the genome is yet another challenge. This question involves the identification of regions, e.g., enhancers, that have regulatory effects on gene expression upon TF binding, and how the binding events in these regions lead to transcriptional regulation of the corresponding target genes. Studies have shown that enhancers in general contain clusters of binding sites for different TFs (Spitz and 111 Furlong 2012). It is the assembly of multiple TFs on the enhancer that leads to gene regulation. Within this framework, the enhanceosome model and billboard model have been proposed to explain the specific mechanisms involved (Arnosti and Kulkarni 2005). In the enhanceosome model, the enhancer activity depends on the cooperative assembly of TFs on the enhancer. The TFBSs in the enhancer are spaced and oriented in a specific way that accommodates the binding of a set of TFs. The resulted TF-DNA complex is called the enhanceosome. Changes in gene expression only happen when the enhanceosome is assembled. A prominent example of enhanceosome is the human interferon-b enhanceosome which assembles upon virus infection and consequentially activates the interferon-b gene (Figure 5.1B) (Thanos and Maniatis 1995). The billboard model, on the other hand, suggests a more flexible regulatory architecture. In this model, TFs do not function as a cooperative unit in an all-or-nothing manner. So TFBSs do not need to be placed in a specific configuration to enable cooperative TF binding. Instead, it is the combinatorial binding of TFs to the diverse binding sites available within the enhancer that dictates the enhancer activity, which depends on what TFs are present (Kulkarni and Arnosti 2003). Besides the enhanceosome and billboard extremes, mechanisms of enhancers may be more of an intermediate between the two models. The above discussion highlights the complexity of gene regulation which is only a sub-problem of how cells operate on the molecular level. I conjecture that understanding of this more general question will rely on structural understanding of the cells, which will continuously be pushed forward as both experimental technologies and computational power keep improving. A structural understanding is emphasized here because of its central importance to the functions of any object in general sense. To be able to perform a 112 particular function, an object must have the right structure. For example, for a cup to function as a cup, it must have a concave structure that enables it to hold water. For DNA to function as the genetic material, it is necessary to have the double helix structure that enables the encoding of information. For this reason, answers to questions that seem inapproachable now may be much more apparent once the structure is revealed. As an example, it was in history heavily debated whether protein or DNA is the genetic material. And following that, the controversy of how DNA replicates itself emerged. With the reveal of the DNA double helix structure, these questions were soon resolved. The same logic applies to the broader question that involves the functioning of living cells which must have the architecture to “hold” the instruction molecule, DNA, in a structured way that allows the physical form of information flow described in the central dogma of molecular biology. Answers to how genomes function in the cellular nucleus may readily reveal themselves once we obtain the structure of all components in the nucleus. During this pursuit of the “ultimate” answer, computational biology will continue to play a critical role as it has since the completion of the human genome project (Deonier, Tavaré et al. 2005). Large-scale data generated by modern molecular biology experiments provide the basis for quantitative modeling of biological processes using machine learning methods such as the modeling of TF-DNA binding specificity discussed in this thesis. As a result, molecular biology is transforming from a descriptive subject into a predictive one. This trend will continue and translate into more quantitative and precise medicine for human health. 113 Bibliography Abe, N., I. Dror, L. Yang, M. Slattery, T. Zhou, H. J. Bussemaker, R. Rohs and R. S. Mann (2015). "Deconvolving the recognition of DNA shape from sequence." Cell 161(2): 307-318. Arnosti, D. N. and M. M. Kulkarni (2005). "Transcriptional enhancers: Intelligent enhanceosomes or flexible billboards?" J Cell Biochem 94(5): 890-898. Bailey, T. L., M. Boden, F. A. Buske, M. Frith, C. E. Grant, L. Clementi, J. Y. Ren, W. W. Li and W. S. Noble (2009). "MEME SUITE: tools for motif discovery and searching." Nucleic Acids Research 37: W202-W208. Banerjee-Basu, S. and A. D. Baxevanis (2001). "Molecular evolution of the homeodomain family of transcription factors." Nucleic Acids Res 29(15): 3258-3269. Barozzi, I., M. Simonatto, S. Bonifacio, L. Yang, R. Rohs, S. Ghisletti and G. Natoli (2014). "Coregulation of Transcription Factor Binding and Nucleosome Occupancy through DNA Features of Mammalian Enhancers." Molecular Cell 54(5): 844-857. Berger, M. F., G. Badis, A. R. Gehrke, S. Talukder, A. A. Philippakis, L. Pena-Castillo, T. M. Alleyne, S. Mnaimneh, O. B. Botvinnik, E. T. Chan, F. Khalid, W. Zhang, D. Newburger, S. A. Jaeger, Q. D. Morris, M. L. Bulyk and T. R. Hughes (2008). "Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences." Cell 133(7): 1266-1276. Berger, M. F., A. A. Philippakis, A. M. Qureshi, F. X. S. He, P. W. Estep and M. L. Bulyk (2006). "Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities." Nature Biotechnology 24(11): 1429-1435. Bishop, E. P., R. Rohs, S. C. Parker, S. M. West, P. Liu, R. S. Mann, B. Honig and T. D. Tullius (2011). "A map of minor groove shape and electrostatic potential from hydroxyl radical cleavage patterns of DNA." ACS Chem Biol 6(12): 1314-1320. Bulyk, M. L., P. L. Johnson and G. M. Church (2002). "Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors." Nucleic Acids Res 30(5): 1255-1261. Chang, Y. P., M. Xu, A. C. Machado, X. J. Yu, R. Rohs and X. S. Chen (2013). "Mechanism of origin DNA recognition and assembly of an initiator-helicase complex by SV40 large tumor antigen." Cell Rep 3(4): 1117-1127. 114 Cheatham, T. E., 3rd and P. A. Kollman (2000). "Molecular dynamics simulation of nucleic acids." Annu Rev Phys Chem 51: 435-471. Chen, Y., D. L. Bates, R. Dey, P. H. Chen, A. C. Machado, I. A. Laird-Offringa, R. Rohs and L. Chen (2012). "DNA binding by GATA transcription factor suggests mechanisms of DNA looping and long-range gene regulation." Cell Rep 2(5): 1197-1206. Chen, Y., X. Zhang, A. C. Dantas Machado, Y. Ding, Z. Chen, P. Z. Qin, R. Rohs and L. Chen (2013). "Structure of p53 binding to the BAX response element reveals DNA unwinding and compression to accommodate base-pair insertion." Nucleic Acids Res 41(17): 8368-8376. Chiu, T. P., L. Yang, T. Zhou, B. J. Main, S. C. Parker, S. V. Nuzhdin, T. D. Tullius and R. Rohs (2015). "GBshape: a genome browser database for DNA shape annotations." Nucleic Acids Res 43(Database issue): D103-109. Consortium, E. P. (2012). "An integrated encyclopedia of DNA elements in the human genome." Nature 489(7414): 57-74. Consortium, E. P., E. Birney, J. A. Stamatoyannopoulos, A. Dutta, R. Guigo, T. R. Gingeras, E. H. Margulies, Z. Weng, M. Snyder, E. T. Dermitzakis, R. E. Thurman, M. S. Kuehn, C. M. Taylor, S. Neph, C. M. Koch, S. Asthana, A. Malhotra, I. Adzhubei, J. A. Greenbaum, R. M. Andrews, P. Flicek, P. J. Boyle, H. Cao, N. P. Carter, G. K. Clelland, S. Davis, N. Day, P. Dhami, S. C. Dillon, M. O. Dorschner, H. Fiegler, P. G. Giresi, J. Goldy, M. Hawrylycz, A. Haydock, R. Humbert, K. D. James, B. E. Johnson, E. M. Johnson, T. T. Frum, E. R. Rosenzweig, N. Karnani, K. Lee, G. C. Lefebvre, P. A. Navas, F. Neri, S. C. Parker, P. J. Sabo, R. Sandstrom, A. Shafer, D. Vetrie, M. Weaver, S. Wilcox, M. Yu, F. S. Collins, J. Dekker, J. D. Lieb, T. D. Tullius, G. E. Crawford, S. Sunyaev, W. S. Noble, I. Dunham, F. Denoeud, A. Reymond, P. Kapranov, J. Rozowsky, D. Zheng, R. Castelo, A. Frankish, J. Harrow, S. Ghosh, A. Sandelin, I. L. Hofacker, R. Baertsch, D. Keefe, S. Dike, J. Cheng, H. A. Hirsch, E. A. Sekinger, J. Lagarde, J. F. Abril, A. Shahab, C. Flamm, C. Fried, J. Hackermuller, J. Hertel, M. Lindemeyer, K. Missal, A. Tanzer, S. Washietl, J. Korbel, O. Emanuelsson, J. S. Pedersen, N. Holroyd, R. Taylor, D. Swarbreck, N. Matthews, M. C. Dickson, D. J. Thomas, M. T. Weirauch, J. Gilbert, J. Drenkow, I. Bell, X. Zhao, K. G. Srinivasan, W. K. Sung, H. S. Ooi, K. P. Chiu, S. Foissac, T. Alioto, M. Brent, L. Pachter, M. L. Tress, A. Valencia, S. W. Choo, C. Y. Choo, C. Ucla, C. Manzano, C. Wyss, E. Cheung, T. G. Clark, J. B. Brown, M. Ganesh, S. Patel, H. Tammana, J. Chrast, C. N. Henrichsen, C. Kai, J. Kawai, U. Nagalakshmi, J. Wu, Z. Lian, J. Lian, P. Newburger, X. Zhang, P. Bickel, J. S. Mattick, P. Carninci, Y. Hayashizaki, S. Weissman, T. Hubbard, R. M. Myers, J. Rogers, P. F. Stadler, T. M. Lowe, C. L. Wei, Y. Ruan, K. Struhl, M. Gerstein, S. E. Antonarakis, Y. Fu, E. D. Green, U. Karaoz, A. Siepel, J. Taylor, L. A. Liefer, K. A. Wetterstrand, P. J. Good, E. A. Feingold, M. S. Guyer, G. M. Cooper, G. Asimenos, C. N. Dewey, M. Hou, S. Nikolaev, J. I. Montoya-Burgos, A. Loytynoja, S. Whelan, F. Pardi, T. Massingham, H. Huang, N. R. Zhang, I. Holmes, J. C. Mullikin, A. Ureta-Vidal, B. Paten, M. 115 Seringhaus, D. Church, K. Rosenbloom, W. J. Kent, E. A. Stone, N. C. S. Program, C. Baylor College of Medicine Human Genome Sequencing, C. Washington University Genome Sequencing, I. Broad, I. Children's Hospital Oakland Research, S. Batzoglou, N. Goldman, R. C. Hardison, D. Haussler, W. Miller, A. Sidow, N. D. Trinklein, Z. D. Zhang, L. Barrera, R. Stuart, D. C. King, A. Ameur, S. Enroth, M. C. Bieda, J. Kim, A. A. Bhinge, N. Jiang, J. Liu, F. Yao, V. B. Vega, C. W. Lee, P. Ng, A. Shahab, A. Yang, Z. Moqtaderi, Z. Zhu, X. Xu, S. Squazzo, M. J. Oberley, D. Inman, M. A. Singer, T. A. Richmond, K. J. Munn, A. Rada-Iglesias, O. Wallerman, J. Komorowski, J. C. Fowler, P. Couttet, A. W. Bruce, O. M. Dovey, P. D. Ellis, C. F. Langford, D. A. Nix, G. Euskirchen, S. Hartman, A. E. Urban, P. Kraus, S. Van Calcar, N. Heintzman, T. H. Kim, K. Wang, C. Qu, G. Hon, R. Luna, C. K. Glass, M. G. Rosenfeld, S. F. Aldred, S. J. Cooper, A. Halees, J. M. Lin, H. P. Shulha, X. Zhang, M. Xu, J. N. Haidar, Y. Yu, Y. Ruan, V. R. Iyer, R. D. Green, C. Wadelius, P. J. Farnham, B. Ren, R. A. Harte, A. S. Hinrichs, H. Trumbower, H. Clawson, J. Hillman-Jackson, A. S. Zweig, K. Smith, A. Thakkapallayil, G. Barber, R. M. Kuhn, D. Karolchik, L. Armengol, C. P. Bird, P. I. de Bakker, A. D. Kern, N. Lopez-Bigas, J. D. Martin, B. E. Stranger, A. Woodroffe, E. Davydov, A. Dimas, E. Eyras, I. B. Hallgrimsdottir, J. Huppert, M. C. Zody, G. R. Abecasis, X. Estivill, G. G. Bouffard, X. Guan, N. F. Hansen, J. R. Idol, V. V. Maduro, B. Maskeri, J. C. McDowell, M. Park, P. J. Thomas, A. C. Young, R. W. Blakesley, D. M. Muzny, E. Sodergren, D. A. Wheeler, K. C. Worley, H. Jiang, G. M. Weinstock, R. A. Gibbs, T. Graves, R. Fulton, E. R. Mardis, R. K. Wilson, M. Clamp, J. Cuff, S. Gnerre, D. B. Jaffe, J. L. Chang, K. Lindblad-Toh, E. S. Lander, M. Koriabine, M. Nefedov, K. Osoegawa, Y. Yoshinaga, B. Zhu and P. J. de Jong (2007). "Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project." Nature 447(7146): 799-816. Crocker, J., N. Abe, L. Rinaldi, A. P. McGregor, N. Frankel, S. Wang, A. Alsawadi, P. Valenti, S. Plaza, F. Payre, R. S. Mann and D. L. Stern (2015). "Low affinity binding site clusters confer hox specificity and regulatory robustness." Cell 160(1-2): 191-203. Davey, C. A., D. F. Sargent, K. Luger, A. W. Maeder and T. J. Richmond (2002). "Solvent mediated interactions in the structure of the nucleosome core particle at 1.9 angstrom resolution." Journal of Molecular Biology 319(5): 1097-1113. de Boer, C. G. and T. R. Hughes (2012). "YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities." Nucleic Acids Research 40(D1): D169-D179. Deonier, R. C., S. Tavaré and M. Waterman (2005). Computational genome analysis: an introduction, Springer Science & Business Media. Dickerson, R. E. (1989). "Definitions and nomenclature of nucleic acid structure components." Nucleic Acids Res 17(5): 1797-1803. 116 Drew, H. R., R. M. Wing, T. Takano, C. Broka, S. Tanaka, K. Itakura and R. E. Dickerson (1981). "Structure of a B-DNA dodecamer: conformation and dynamics." Proc Natl Acad Sci U S A 78(4): 2179-2183. Dror, I., T. Zhou, Y. Mandel-Gutfreund and R. Rohs (2014). "Covariation between homeodomain transcription factors and the shape of their DNA binding sites." Nucleic Acids Res 42(1): 430-441. Eldar, A., H. Rozenberg, Y. Diskin-Posner, R. Rohs and Z. Shakked (2013). "Structural studies of p53 inactivation by DNA-contact mutations and its rescue by suppressor mutations via alternative protein-DNA interactions." Nucleic Acids Res 41(18): 8748- 8759. Farley, E. K., K. M. Olson, W. Zhang, A. J. Brandt, D. S. Rokhsar and M. S. Levine (2015). "Suboptimization of developmental enhancers." Science 350(6258): 325-328. Field, Y., N. Kaplan, Y. Fondufe-Mittendorf, I. K. Moore, E. Sharon, Y. Lubling, J. Widom and E. Segal (2008). "Distinct modes of regulation by chromatin encoded through nucleosome positioning signals." PLoS computational biology 4(11): e1000216. Fujii, S., H. Kono, S. Takenaka, N. Go and A. Sarai (2007). "Sequence-dependent DNA deformability studied using molecular dynamics simulations." Nucleic Acids Res 35(18): 6063-6074. Gordan, R., N. Shen, I. Dror, T. Zhou, J. Horton, R. Rohs and M. L. Bulyk (2013). "Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape." Cell Rep 3(4): 1093-1104. Grant, C. E., T. L. Bailey and W. S. Noble (2011). "FIMO: scanning for occurrences of a given motif." Bioinformatics 27(7): 1017-1018. Hancock, S. P., T. Ghane, D. Cascio, R. Rohs, R. Di Felice and R. C. Johnson (2013). "Control of DNA minor groove width and Fis protein binding by the purine 2-amino group." Nucleic Acids Res 41(13): 6750-6760. Hang, J., R. Wan, C. Yan and Y. Shi (2015). "Structural basis of pre-mRNA splicing." Science 349(6253): 1191-1198. Hizver, J., H. Rozenberg, F. Frolow, D. Rabinovich and Z. Shakked (2001). "DNA bending by an adenine--thymine tract and its role in gene regulation." Proc Natl Acad Sci U S A 98(15): 8490-8495. 117 Ho, J. W., Y. L. Jung, T. Liu, B. H. Alver, S. Lee, K. Ikegami, K. A. Sohn, A. Minoda, M. Y. Tolstorukov, A. Appert, S. C. Parker, T. Gu, A. Kundaje, N. C. Riddle, E. Bishop, T. A. Egelhofer, S. S. Hu, A. A. Alekseyenko, A. Rechtsteiner, D. Asker, J. A. Belsky, S. K. Bowman, Q. B. Chen, R. A. Chen, D. S. Day, Y. Dong, A. C. Dose, X. Duan, C. B. Epstein, S. Ercan, E. A. Feingold, F. Ferrari, J. M. Garrigues, N. Gehlenborg, P. J. Good, P. Haseley, D. He, M. Herrmann, M. M. Hoffman, T. E. Jeffers, P. V. Kharchenko, P. Kolasinska-Zwierz, C. V. Kotwaliwale, N. Kumar, S. A. Langley, E. N. Larschan, I. Latorre, M. W. Libbrecht, X. Lin, R. Park, M. J. Pazin, H. N. Pham, A. Plachetka, B. Qin, Y. B. Schwartz, N. Shoresh, P. Stempor, A. Vielle, C. Wang, C. M. Whittle, H. Xue, R. E. Kingston, J. H. Kim, B. E. Bernstein, A. F. Dernburg, V. Pirrotta, M. I. Kuroda, W. S. Noble, T. D. Tullius, M. Kellis, D. M. MacAlpine, S. Strome, S. C. Elgin, X. S. Liu, J. D. Lieb, J. Ahringer, G. H. Karpen and P. J. Park (2014). "Comparative analysis of metazoan chromatin organization." Nature 512(7515): 449-452. Ho, P. S. and M. Carter (2011). "DNA Structure: Alphabet Soup for the Cellular Soul, DNA Replication-Current Advances, Dr Herve Seligmann (Ed.)." InTech. Hooghe, B., S. Broos, F. van Roy and P. De Bleser (2012). "A flexible integrative approach based on random forest improves prediction of transcription factor binding sites." Nucleic Acids Res 40(14): e106. Hume, M. A., L. A. Barrera, S. S. Gisselbrecht and M. L. Bulyk (2015). "UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions." Nucleic Acids Res 43(Database issue): D117-122. Jolma, A., T. Kivioja, J. Toivonen, L. Cheng, G. Wei, M. Enge, M. Taipale, J. M. Vaquerizas, J. Yan, M. J. Sillanpaa, M. Bonke, K. Palin, S. Talukder, T. R. Hughes, N. M. Luscombe, E. Ukkonen and J. Taipale (2010). "Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities." Genome Res 20(6): 861-873. Jolma, A., J. Yan, T. Whitington, J. Toivonen, K. R. Nitta, P. Rastas, E. Morgunova, M. Enge, M. Taipale, G. Wei, K. Palin, J. M. Vaquerizas, R. Vincentelli, N. M. Luscombe, T. R. Hughes, P. Lemaire, E. Ukkonen, T. Kivioja and J. Taipale (2013). "DNA-binding specificities of human transcription factors." Cell 152(1-2): 327-339. Joshi, R., J. M. Passner, R. Rohs, R. Jain, A. Sosinsky, M. A. Crickmore, V. Jacob, A. K. Aggarwal, B. Honig and R. S. Mann (2007). "Functional specificity of a Hox protein mediated by the recognition of minor groove structure." Cell 131(3): 530-543. Kim, S., E. Brostromer, D. Xing, J. S. Jin, S. S. Chong, H. Ge, S. Y. Wang, C. Gu, L. J. Yang, Y. Q. Gao, X. D. Su, Y. J. Sun and X. S. Xie (2013). "Probing Allostery Through DNA." Science 339(6121): 816-819. 118 Kitayner, M., H. Rozenberg, N. Kessler, D. Rabinovich, L. Shaulov, T. E. Haran and Z. Shakked (2006). "Structural basis of DNA recognition by p53 tetramers." Mol Cell 22(6): 741-753. Kitayner, M., H. Rozenberg, R. Rohs, O. Suad, D. Rabinovich, B. Honig and Z. Shakked (2010). "Diversity in DNA recognition by p53 revealed by crystal structures with Hoogsteen base pairs." Nat Struct Mol Biol 17(4): 423-429. Kulkarni, M. M. and D. N. Arnosti (2003). "Information display by transcriptional enhancers." Development 130(26): 6569-6575. Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J. Howland, L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J. P. Mesirov, C. Miranda, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A. Sheridan, C. Sougnez, Y. Stange-Thomann, N. Stojanovic, A. Subramanian, D. Wyman, J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley, J. Burton, C. Clee, N. Carter, A. Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R. Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J. C. Mullikin, A. Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K. Wilson, L. W. Hillier, J. D. McPherson, M. A. Marra, E. R. Mardis, L. A. Fulton, A. T. Chinwalla, K. H. Pepin, W. R. Gish, S. L. Chissoe, M. C. Wendl, K. D. Delehaunty, T. L. Miner, A. Delehaunty, J. B. Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J. Minx, S. W. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, J. F. Cheng, A. Olsen, S. Lucas, C. Elkin, E. Uberbacher, M. Frazier, R. A. Gibbs, D. M. Muzny, S. E. Scherer, J. B. Bouck, E. J. Sodergren, K. C. Worley, C. M. Rives, J. H. Gorrell, M. L. Metzker, S. L. Naylor, R. S. Kucherlapati, D. L. Nelson, G. M. Weinstock, Y. Sakaki, A. Fujiyama, M. Hattori, T. Yada, A. Toyoda, T. Itoh, C. Kawagoe, H. Watanabe, Y. Totoki, T. Taylor, J. Weissenbach, R. Heilig, W. Saurin, F. Artiguenave, P. Brottier, T. Bruls, E. Pelletier, C. Robert, P. Wincker, D. R. Smith, L. Doucette-Stamm, M. Rubenfield, K. Weinstock, H. M. Lee, J. Dubois, A. Rosenthal, M. Platzer, G. Nyakatura, S. Taudien, A. Rump, H. Yang, J. Yu, J. Wang, G. Huang, J. Gu, L. Hood, L. Rowen, A. Madan, S. Qin, R. W. Davis, N. A. Federspiel, A. P. Abola, M. J. Proctor, R. M. Myers, J. Schmutz, M. Dickson, J. Grimwood, D. R. Cox, M. V. Olson, R. Kaul, C. Raymond, N. Shimizu, K. Kawasaki, S. Minoshima, G. A. Evans, M. Athanasiou, R. Schultz, B. A. Roe, F. Chen, H. Pan, J. Ramser, H. Lehrach, R. Reinhardt, W. R. McCombie, M. de la Bastide, N. Dedhia, H. Blocker, K. Hornischer, G. Nordsiek, R. Agarwala, L. Aravind, J. A. Bailey, A. Bateman, S. Batzoglou, E. Birney, P. Bork, D. G. Brown, C. B. Burge, L. Cerutti, H. C. Chen, D. Church, M. Clamp, R. R. Copley, T. Doerks, S. R. Eddy, E. E. Eichler, T. S. Furey, J. Galagan, J. G. Gilbert, C. Harmon, Y. Hayashizaki, D. Haussler, H. Hermjakob, K. Hokamp, W. Jang, L. S. Johnson, T. A. Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W. J. Kent, P. Kitts, E. V. Koonin, I. Korf, D. Kulp, D. Lancet, T. M. Lowe, A. McLysaght, T. Mikkelsen, J. V. Moran, N. Mulder, V. J. Pollara, C. P. Ponting, G. Schuler, J. Schultz, G. Slater, A. F. Smit, E. Stupka, J. Szustakowki, D. Thierry-Mieg, J. Thierry-Mieg, L. Wagner, J. Wallis, R. Wheeler, A. Williams, Y. I. Wolf, K. H. Wolfe, 119 S. P. Yang, R. F. Yeh, F. Collins, M. S. Guyer, J. Peterson, A. Felsenfeld, K. A. Wetterstrand, A. Patrinos, M. J. Morgan, P. de Jong, J. J. Catanese, K. Osoegawa, H. Shizuya, S. Choi, Y. J. Chen, J. Szustakowki and C. International Human Genome Sequencing (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921. Lazarovici, A., T. Zhou, A. Shafer, A. C. Dantas Machado, T. R. Riley, R. Sandstrom, P. J. Sabo, Y. Lu, R. Rohs, J. A. Stamatoyannopoulos and H. J. Bussemaker (2013). "Probing DNA shape and methylation state on a genomic scale with DNase I." Proc Natl Acad Sci U S A 110(16): 6376-6381. Levine, H. A. and M. Nilsen-Hamilton (2007). "A mathematical analysis of SELEX." Comput Biol Chem 31(1): 11-35. Liu, X., C. K. Lee, J. A. Granek, N. D. Clarke and J. D. Lieb (2006). "Whole-genome comparison of Leu3 binding in vitro and in vivo reveals the importance of nucleosome occupancy in target site selection." Genome Res 16(12): 1517-1528. Lowary, P. T. and J. Widom (1998). "New DNA sequence rules for high affinity binding to histone octamer and sequence-directed nucleosome positioning." J Mol Biol 276(1): 19-42. Maienschein-Cline, M., A. R. Dinner, W. S. Hlavacek and F. P. Mu (2012). "Improved predictions of transcription factor binding sites using physicochemical features of DNA." Nucleic Acids Research 40(22). Man, T. K. and G. D. Stormo (2001). "Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay." Nucleic Acids Res 29(12): 2471-2478. Mann, R. S., K. M. Lelli and R. Joshi (2009). "Hox specificity unique roles for cofactors and collaborators." Curr Top Dev Biol 88: 63-101. Mathelier, A. and W. W. Wasserman (2013). "The next generation of transcription factor binding site prediction." PLoS Comput Biol 9(9): e1003214. Mathelier, A., X. Zhao, A. W. Zhang, F. Parcy, R. Worsley-Hunt, D. J. Arenillas, S. Buchman, C. Y. Chen, A. Chou, H. Ienasescu, J. Lim, C. Shyr, G. Tan, M. Zhou, B. Lenhard, A. Sandelin and W. W. Wasserman (2014). "JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles." Nucleic Acids Res 42(Database issue): D142-147. 120 Maurano, M. T., H. Wang, S. John, A. Shafer, T. Canfield, K. Lee and J. A. Stamatoyannopoulos (2015). "Role of DNA Methylation in Modulating Transcription Factor Occupancy." Cell Rep 12(7): 1184-1195. Mavrich, T. N., I. P. Ioshikhes, B. J. Venters, C. Jiang, L. P. Tomsho, J. Qi, S. C. Schuster, I. Albert and B. F. Pugh (2008). "A barrier nucleosome model for statistical positioning of nucleosomes throughout the yeast genome." Genome Res 18(7): 1073- 1083. Mavrich, T. N., C. Jiang, I. P. Ioshikhes, X. Li, B. J. Venters, S. J. Zanton, L. P. Tomsho, J. Qi, R. L. Glaser, S. C. Schuster, D. S. Gilmour, I. Albert and B. F. Pugh (2008). "Nucleosome organization in the Drosophila genome." Nature 453(7193): 358-362. Meijsing, S. H., M. A. Pufall, A. Y. So, D. L. Bates, L. Chen and K. R. Yamamoto (2009). "DNA binding site sequence directs glucocorticoid receptor structure and activity." Science 324(5925): 407-410. Meysman, P., H. D. Thanh, K. Laukens, R. De Smet, Y. Wu, K. Marchal and K. Engelen (2011). "Use of structural DNA properties for the prediction of transcription-factor binding sites in Escherichia coli." Nucleic Acids Research 39(2). Mordelet, F., J. Horton, A. J. Hartemink, B. E. Engelhardt and R. Gordan (2013). "Stability selection for regression-based models of transcription factor-DNA binding specificity." Bioinformatics 29(13): i117-125. Mordelet, F., J. Horton, A. J. Hartemink, B. E. Engelhardt and R. Gordan (2013). "Stability selection for regression-based models of transcription factor-DNA binding specificity." Bioinformatics 29(13): 117-125. Neph, S., J. Vierstra, A. B. Stergachis, A. P. Reynolds, E. Haugen, B. Vernot, R. E. Thurman, S. John, R. Sandstrom, A. K. Johnson, M. T. Maurano, R. Humbert, E. Rynes, H. Wang, S. Vong, K. Lee, D. Bates, M. Diegel, V. Roach, D. Dunn, J. Neri, A. Schafer, R. S. Hansen, T. Kutyavin, E. Giste, M. Weaver, T. Canfield, P. Sabo, M. Zhang, G. Balasundaram, R. Byron, M. J. MacCoss, J. M. Akey, M. A. Bender, M. Groudine, R. Kaul and J. A. Stamatoyannopoulos (2012). "An expansive human regulatory lexicon encoded in transcription factor footprints." Nature 489(7414): 83-90. Olson, W. K., A. A. Gorin, X. J. Lu, L. M. Hock and V. B. Zhurkin (1998). "DNA sequence-dependent deformability deduced from protein-DNA crystal complexes." Proc Natl Acad Sci U S A 95(19): 11163-11168. Ostuni, R. and G. Natoli (2013). "Lineages, cell types and functional states: a genomic view." Curr Opin Cell Biol 25(6): 759-764. 121 Otwinowski, Z., R. W. Schevitz, R. G. Zhang, C. L. Lawson, A. Joachimiak, R. Q. Marmorstein, B. F. Luisi and P. B. Sigler (1988). "Crystal-Structure of Trp Repressor Operator Complex at Atomic Resolution." Nature 335(6188): 321-329. Panne, D. (2008). "The enhanceosome." Curr Opin Struct Biol 18(2): 236-242. Panne, D., T. Maniatis and S. C. Harrison (2007). "An atomic model of the interferon- beta enhanceosome." Cell 129(6): 1111-1123. Park, P. J. (2009). "ChIP-seq: advantages and challenges of a maturing technology." Nat Rev Genet 10(10): 669-680. Parker, S. C. and T. D. Tullius (2011). "DNA shape, genetic codes, and evolution." Curr Opin Struct Biol 21(3): 342-347. Pennacchio, L. A., W. Bickmore, A. Dean, M. A. Nobrega and G. Bejerano (2013). "Enhancers: five essential questions." Nat Rev Genet 14(4): 288-295. Portales-Casamar, E., S. Thongjuea, A. T. Kwon, D. Arenillas, X. B. Zhao, E. Valen, D. Yusuf, B. Lenhard, W. W. Wasserman and A. Sandelin (2010). "JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles." Nucleic Acids Research 38: D105-D110. Pruitt, K. D., T. Tatusova, G. R. Brown and D. R. Maglott (2012). "NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy." Nucleic Acids Res 40(Database issue): D130-135. Rao, S. S. P., M. H. Huntley, N. C. Durand, E. K. Stamenova, I. D. Bochkov, J. T. Robinson, A. L. Sanborn, I. Machol, A. D. Omer, E. S. Lander and E. L. Aiden (2014). "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping." Cell 159(7): 1665-1680. Rao, S. S. P., M. H. Huntley, N. C. Durand, E. K. Stamenova, I. D. Bochkov, J. T. Robinson, A. L. Sanborn, I. Machol, A. D. Omer, E. S. Lander and E. L. Aiden (2015). "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping (vol 159, pg 1665, 2014)." Cell 162(3): 687-688. Rhodes, G. (2010). Crystallography made crystal clear: a guide for users of macromolecular models, Academic press. Riley, T. R., A. Lazarovici, R. S. Mann and H. J. Bussemaker (2015). "Building accurate sequence-to-affinity models from high-throughput in vitro protein-DNA binding data using FeatureREDUCE." Elife 4: e06397. 122 Riley, T. R., M. Slattery, N. Abe, C. Rastogi, D. Liu, R. S. Mann and H. J. Bussemaker (2014). "SELEX-seq: a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes." Methods Mol Biol 1196: 255-278. Robasky, K. and M. L. Bulyk (2011). "UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions." Nucleic Acids Res 39(Database issue): D124-128. Rohs, R., X. Jin, S. M. West, R. Joshi, B. Honig and R. S. Mann (2010). "Origins of specificity in protein-DNA recognition." Annu Rev Biochem 79: 233-269. Rohs, R., H. Sklenar and Z. Shakked (2005). "Structural and energetic origins of sequence-specific DNA bending: Monte Carlo simulations of papillomavirus E2-DNA binding sites." Structure 13(10): 1499-1509. Rohs, R., S. M. West, P. Liu and B. Honig (2009a). "Nuance in the double-helix and its role in protein-DNA recognition." Current opinion in structural biology 19(2): 171-177. Rohs, R., S. M. West, A. Sosinsky, P. Liu, R. S. Mann and B. Honig (2009b). "The role of DNA shape in protein–DNA recognition." Nature 461(7268): 1248-1253. Ryoo, H. D. and R. S. Mann (1999). "The control of trunk Hox specificity and activity by Extradenticle." Genes Dev 13(13): 1704-1716. Schneider, T. D. and R. M. Stephens (1990). "Sequence logos: a new way to display consensus sequences." Nucleic Acids Res 18(20): 6097-6100. Seeman, N. C., J. M. Rosenberg and A. Rich (1976). "Sequence-specific recognition of double helical nucleic acids by proteins." Proceedings of the National Academy of Sciences 73(3): 804-808. Segal, E., Y. Fondufe-Mittendorf, L. Chen, A. Thastrom, Y. Field, I. K. Moore, J. P. Wang and J. Widom (2006). "A genomic code for nucleosome positioning." Nature 442(7104): 772-778. Sharon, E., S. Lubliner and E. Segal (2008). "A feature-based approach to modeling protein-DNA interactions." PLoS Comput Biol 4(8): e1000154. Shi, Y. G. (2014). "A Glimpse of Structural Biology through X-Ray Crystallography." Cell 159(5): 995-1014. 123 Slattery, M., T. Riley, P. Liu, N. Abe, P. Gomez-Alcala, I. Dror, T. Y. Zhou, R. Rohs, B. Honig, H. J. Bussemaker and R. S. Mann (2011). "Cofactor Binding Evokes Latent Differences in DNA Binding Specificity between Hox Proteins." Cell 147(6): 1270-1282. Slattery, M., T. Zhou, L. Yang, A. C. Dantas Machado, R. Gordan and R. Rohs (2014). "Absence of a simple code: how transcription factors read the genome." Trends Biochem Sci 39(9): 381-399. Spitz, F. and E. E. Furlong (2012). "Transcription factors: from enhancer binding to developmental control." Nat Rev Genet 13(9): 613-626. Stormo, G. D. (2000). "DNA binding sites: representation and discovery." Bioinformatics 16(1): 16-23. Stormo, G. D. (2013). "Modeling the specificity of protein-DNA interactions." Quantitative Biology 1(2): 115-130. Stormo, G. D. and Y. Zhao (2010). "Determining the specificity of protein-DNA interactions." Nat Rev Genet 11(11): 751-760. Tanay, A. (2006). "Extensive low-affinity transcriptional interactions in the yeast genome." Genome Res 16(8): 962-972. Thanos, D. and T. Maniatis (1995). "Virus induction of human IFN beta gene expression requires the assembly of an enhanceosome." Cell 83(7): 1091-1100. Thomsen, M. C. and M. Nielsen (2012). "Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion." Nucleic Acids Res 40(Web Server issue): W281-287. Tjong, H., W. Li, R. Kalhor, C. Dai, S. Hao, K. Gong, Y. Zhou, H. Li, X. J. Zhou, M. A. Le Gros, C. A. Larabell, L. Chen and F. Alber (2016). "Population-based 3D genome structure analysis reveals driving forces in spatial genome organization." Proc Natl Acad Sci U S A 113(12): E1663-1672. UniProt, C. (2015). "UniProt: a hub for protein information." Nucleic Acids Res 43(Database issue): D204-212. Warshel, A. and M. Levitt (1976). "Theoretical studies of enzymic reactions: dielectric, electrostatic and steric stabilization of the carbonium ion in the reaction of lysozyme." J Mol Biol 103(2): 227-249. 124 Watson, J. D. and F. H. Crick (1953). "Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid." Nature 171(4356): 737-738. Watson, L. C., K. M. Kuchenbecker, B. J. Schiller, J. D. Gross, M. A. Pufall and K. R. Yamamoto (2013). "The glucocorticoid receptor dimer interface allosterically transmits sequence-specific DNA signals." Nat Struct Mol Biol 20(7): 876-883. Weirauch, M. T., A. Cote, R. Norel, M. Annala, Y. Zhao, T. R. Riley, J. Saez-Rodriguez, T. Cokelaer, A. Vedenko, S. Talukder, D. Consortium, H. J. Bussemaker, Q. D. Morris, M. L. Bulyk, G. Stolovitzky and T. R. Hughes (2013). "Evaluation of methods for modeling transcription factor sequence specificity." Nat Biotechnol 31(2): 126-134. Weirauch, M. T. and T. Hughes (2011). A catalogue of eukaryotic transcription factor types, their evolutionary origin, and species distribution. A Handbook of Transcription Factors, Springer Netherlands: 25-73. Yang, L., T. Zhou, I. Dror, A. Mathelier, W. W. Wasserman, R. Gordan and R. Rohs (2014). "TFBSshape: a motif database for DNA shape features of transcription factor binding sites." Nucleic Acids Res 42(Database issue): D148-155. Zhang, Y., Z. Moqtaderi, B. P. Rattner, G. Euskirchen, M. Snyder, J. T. Kadonaga, X. S. Liu and K. Struhl (2009). "Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo." Nature Structural & Molecular Biology 16(8): 847-852. Zhao, Y., D. Granas and G. D. Stormo (2009). "Inferring binding energies from selected binding sites." PLoS Comput Biol 5(12): e1000590. Zhao, Y., S. Ruan, M. Pandey and G. D. Stormo (2012). "Improved models for transcription factor binding site identification using nonindependent interactions." Genetics 191(3): 781-790. Zhou, T., N. Shen, L. Yang, N. Abe, J. Horton, R. S. Mann, H. J. Bussemaker, R. Gordan and R. Rohs (2015). "Quantitative modeling of transcription factor binding specificities using DNA shape." Proceedings of the National Academy of Sciences of the United States of America 112(15): 4654-4659. Zhou, T., L. Yang, Y. Lu, I. Dror, A. C. Dantas Machado, T. Ghane, R. Di Felice and R. Rohs (2013). "DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale." Nucleic Acids Res 41(Web Server issue): W56- 62. 125 Zhu, L. H. J., R. G. Christensen, M. Kazemian, C. J. Hull, M. S. Enuameh, M. D. Basciotta, J. A. Brasefield, C. Zhu, Y. Asriyan, D. S. Lapointe, S. Sinha, S. A. Wolfe and M. H. Brodsky (2011). "FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system." Nucleic Acids Research 39: D111-D117.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Machine learning of DNA shape and spatial geometry
PDF
Decoding protein-DNA binding determinants mediated through DNA shape readout
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Forkhead transcription factors regulate replication origin firing through dimerization and cell cycle-dependent chromatin binding in S. cerevisiae
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
3D modeling of eukaryotic genomes
PDF
Simulating the helicase motor of SV40 large tumor antigen
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Site-directed spin labeling studies of sequence-dependent DNA shape and protein-DNA recognition
PDF
The relationship between DNA methylation and transcription factor binding in colon cancer cells
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Integrating high-throughput sequencing data to study gene regulation
PDF
Structural and biochemical studies of two DNA transaction enzymes
Asset Metadata
Creator
Yang, Lin
(author)
Core Title
Profiling transcription factor-DNA binding specificity
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
01/22/2017
Defense Date
06/06/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bioinformatics,Computational Biology,DNA,genomics,machine learning,Molecular Biology,OAI-PMH Harvest,protein,quantitative biology,structural biology,transcription factor
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rohs, Remo (
committee chair
), Alber, Frank (
committee member
), Liu, Yan (
committee member
), Waterman, Michael (
committee member
)
Creator Email
yang23@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-276533
Unique identifier
UC11280283
Identifier
etd-YangLin-4599.pdf (filename),usctheses-c40-276533 (legacy record id)
Legacy Identifier
etd-YangLin-4599.pdf
Dmrecord
276533
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Yang, Lin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bioinformatics
DNA
genomics
machine learning
protein
quantitative biology
structural biology
transcription factor