Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
(USC Thesis Other)
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DNA SHAPE AT TRANSCRIPTION FACTOR BINDING
SITES: FROM PURIFYING SELECTION TO A NEW
ALPHABET
by
Xiaofei Wang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOC TOR OF PHILOSOPHY
(Computational Biology and Bioinformatics)
August 2018
iii
© 2018
Xiaofei Wang
All Rights Reserved
v
Abstract
Noncoding DNA sequences, which play various roles in gene expression and
regulation, are under evolutionary pressure. Gene regulation requires specific protein–
DNA binding events, and previous studies showed that both DNA sequence and shape
readout are employed by transcription factors (TFs) to achieve DNA binding specificity.
Here, we established a link between disruptive local DNA shape changes and loss of
specific TF binding by investigating the shape-disrupting properties of single nucleotide
polymorphisms (SNPs) in human regulatory regions and described cases where disease-
associated SNPs may alter TF binding through DNA shape changes. This link led us to
hypothesize that local DNA shape within and around TF binding sites is under selection
pressure. Our results indicate that common SNPs in functional regions tend to maintain
DNA shape, whereas shape-disrupting SNPs are more likely to be eliminated through
purifying selection. These results show the importance of DNA shape in TF-DNA
recognition from the perspective of evolution. We next proposed a new DNA shape-
augmented alphabet that incorporates interdependency between adjacent nucleotides to
the classic PWM model. The TF family-specific shape alphabet learned through
simulated annealing algorithm significantly improved the performance of TF-DNA
binding prediction compared with the typical A, C, G, T-alphabet that represents the four
nucleobases – A, C, G, and T of DNA nucleotides.
vii
Acknowledgements
First and foremost, I would like to express my sincere thanks to my adviser Dr.
Remo Rohs, for his constant support and mentoring through the past five years. It would
not have been possible to write this thesis without his encouragement and help. He has
taught me not only how to be a good researcher but how to be a caring person.
I would also like to thank my qualifying exam and dissertation committee, Dr.
Frank Alber, Dr. Liang Chen, Dr. Fengzhu Sun, Dr. Paul D. Thomas and Dr. Michael S.
Waterman, for their valuable time reading my dissertation proposal and thesis and for
sharing their constructive comments and feedbacks with me.
Thank my dear colleagues in Rohs lab, Dr. Lin Yang, Dr. Beibei Xin, Jinsen Li,
Dr. Ana Carolina Dantas Machado, Dr. Tianyin Zhou, Richard Li, Dr. Tsu-Pei Chiu, Dr.
Satyanarayan Rao, Yingfei Wang, Brendon Cooper, Liana Engie, Jared Sagendorf and Dr.
Rosa Di Felice, for all the discussions and suggestions on my research and presentations.
Thank Xiaoshen, Junyin, Beibei, Jie, Chao, Han and many great friends I made
over the five years for their company, support and help. They have made my life as a
Ph.D. student a memorable experience.
Finally, I would like to thank my family. I thank my beloved husband, Lin Yang,
for helping me survive the enormous stress and anxiety during thesis writing and
preparation for the defense. Thank my daughter, Arabella, for coming to the world and
bringing us so much happiness. Thank my parents for supporting every decision I have
viii
made all these years and being there for me whenever I need. This thesis would not have
been written without them.
ix
Table of Contents
Abstract ............................................................................................................................... v
Acknowledgements ........................................................................................................... vii
Table of Contents ............................................................................................................... ix
LIST OF FIGURES ......................................................................................................... xiii
LIST OF TABLES ............................................................................................................ xv
Chapter 1 Introduction ........................................................................................................ 1
1.1 Molecular Evolution ............................................................................................ 1
1.1.1 Mutations ................................................................................................ 2
1.1.2 Single Nucleotide Polymorphism ........................................................... 2
1.1.3 Natural Selection .................................................................................... 3
1.2 The Structure of DNA ......................................................................................... 4
1.3 Specific TF-DNA Binding .................................................................................. 7
1.4 Modeling the TF-DNA Binding Specificity ........................................................ 8
1.4.1 Experimental approaches ........................................................................ 8
1.4.2 Computational approaches .................................................................... 10
1.5 Overview of This Thesis ................................................................................... 13
Chapter 2 Analysis of Genetic Variation Indicates DNA Shape Involvement in Purifying
Selection ............................................................................................................................ 15
2.1 Introduction ....................................................................................................... 15
2.2 Results ............................................................................................................... 17
2.2.1 Effect of SNPs on DNA Shape Varies with Allele Type and Local DNA
Context ........................................................................................................... 17
x
2.2.2 SNPs with Imbalanced DNA Accessibility Induced Larger DNA Shape
Changes Compared with SNPs Without Imbalance ...................................... 20
2.2.3 Disease-Associated SNPs Potentially Alter TF Binding Through Shape
Changes .......................................................................................................... 22
2.2.4 DNA Shape in Functional Regulatory Regions of the Drosophila
Genome Is More Conserved .......................................................................... 25
2.2.5 SNPs with Higher Minor Allele Frequency Tend to Relate to Smaller
DNA Shape Change ....................................................................................... 28
2.2.6 Large Data Set of SNPs in Putative CRMs Confirmed the Observations
for SNPs in Validated CRMs ......................................................................... 29
2.3 Materials and Methods ...................................................................................... 31
2.3.1 Single Nucleotide Polymorphism Data ................................................ 31
2.3.2 Definition of Functional and Nonfunctional Regions in Drosophila ... 32
2.3.3 Genome-Wide Prediction of DNA Shape ............................................. 35
2.3.4 Calculation of Euclidean Distance of Minor Groove Width between
Two Alleles .................................................................................................... 35
2.3.5 Statistical Analysis ................................................................................ 36
2.3.6 Shuffling of the Pentamer Query Table and Statistical Verification .... 37
2.4 Discussion ......................................................................................................... 37
Chapter 3 Beyond A, C, G, and T: learning a new alphabet ............................................. 41
3.1 Introduction ....................................................................................................... 41
3.2 Materials and Methods ...................................................................................... 43
3.2.1 Transcription Factor Binding Data ....................................................... 43
3.2.2 DNA Shape-Augmented Alphabet (Shape Alphabet) .......................... 46
3.2.3 Evaluating the Quality of the Shape Alphabet ..................................... 47
3.2.4 Finding the Optimal Shape Alphabet – Simulated Annealing Algorithm
....................................................................................................................... 49
3.2.5 PANTHER, Regroup of Transcription Factor Families ....................... 51
xi
3.3 Results ............................................................................................................... 52
3.3.1 DNA Shape Alphabet Better Discriminates Bound Sequences from
Unbound Sequences Than the Typical 4-letter A, C, G, T-Alphabet ............ 52
3.3.2 Finding the Universal Shape Alphabet That Can Be Utilized to Predict
TFBSs ............................................................................................................ 53
3.3.3 It Is More Feasible to Find Transcription Factor Family-Specific Shape
Alphabets ....................................................................................................... 55
3.3.4 Transcription Factor Family-Specific Shape Alphabets ....................... 58
3.3.5 Regrouping the TFs into Evolutionarily Related Families Improved the
Prediction Power of Family-Specific Shape Alphabet .................................. 60
3.3.6 Family-Specific Shape Alphabet Versus Dinucleotide Alphabet ......... 63
3.3.7 Trade-off between Model Complexity and Performance ..................... 65
3.4 Discussion ......................................................................................................... 67
Chapter 4 Concluding Remarks ........................................................................................ 71
Bibliography ..................................................................................................................... 75
xiii
LIST OF FIGURES
Figure 1.1 Schematic representation of a DNA fragment (PDB ID: 1BNA taken from the
Protein Data Bank) with definition of MGW and inter-bp and intra-bp parameters. ......... 5
Figure 1.2 Major groove and minor groove of DNA. ......................................................... 6
Figure 1.3 Visualization of PWM logos. .......................................................................... 11
Figure 2.1 Pipeline for evaluation of SNP effects on DNA shape. ................................... 17
Figure 2.2 Local effect of SNPs on MGW profiles. ......................................................... 19
Figure 2.3 Distribution of MGW changes for strongly imbalanced SNPs, weakly
imbalanced SNPs, and SNPs without imbalance in human. ............................................. 20
Figure 2.4 Boxplots for Mann-Whitney P-values in shuffling tests. ................................ 22
Figure 2.5 DNA shape variation caused by disease-associated SNPs. ............................. 24
Figure 2.6 Distributions of MGW changes for Drosophila SNPs in experimentally
validated CRMs at different locations and with different MAFs. .................................... 26
Figure 2.7 Distribution of ΔMGW values for SNPs in functional and nonfunctional
regions, obtained by using DNAshape-derived MGW in multi-allelic analysis. ............. 28
Figure 2.8 Distributions of MGW changes for Drosophila SNPs in putative CRMs at
different locations and with different MAFs. ................................................................... 30
Figure 3.1 DNA shape-augmented alphabet. .................................................................... 47
Figure 3.2 Flow chart of simulated annealing algorithm. ................................................. 51
Figure 3.3 Comparison of AUROC between A, C, G, T-alphabet and the learned shape
alphabet for gcPBM data. ................................................................................................. 53
Figure 3.4 Scatter plot for leave-one-out experiment on all 215 TFs. .............................. 55
Figure 3.5 Scatter plot for leave-one-family-out experiments. ......................................... 56
Figure 3.6 Scatter plot for intra-family leave-one-out experiments in 11 TF families. .... 58
Figure 3.7 Comparison of AUROCs between A, C, G, T-alphabet and the family-specific
shape alphabet for HT-SELEX data. ................................................................................ 60
Figure 3.8 Comparison of AUROCs between A, C, G, T-alphabet and the family-specific
shape alphabet for HT-SELEX data based on PANTHER regrouped TF families. ......... 61
xiv
Figure 3.9 Comparison of AUROCs between the original family-specific shape alphabets
and PANTHER family-specific shape alphabets for HT-SELEX data. ........................... 62
Figure 3.10 Comparison of AUROCs between the typical mononucleotide alphabet and
dinucleotide alphabet for HT-SELEX data. ...................................................................... 63
Figure 3.11 Comparison of AUROCs between dinucleotide alphabet and the family-
specific shape alphabet for HT-SELEX data. ................................................................... 65
Figure 3.12 Performance comparison between models of different complexity. ............. 66
xv
LIST OF TABLES
Table 2.1 TFs used to determine functional regions. ........................................................ 34
Table 3.1 Family membership of the TFs from HT-SELEX data in this study ................ 44
1
Chapter 1 Introduction
1.1 Molecular Evolution
A hundred and fifty-nine years ago, Charles Darwin published his book On the
Origin of Species (1859), demonstrating his theory of evolution by natural selection,
starting an era in which people got to know that all the living and extinct species on the
earth were interrelated in a great Tree of Life (Darwin 1859). In Darwin’s theory,
organisms compete with each other for food, for water and for all the resources that are
needed to survive and reproduce. And individuals that are better adapted to the
environment will have more offspring than those do not suit well – “the survival of the
fittest”. At the time, Darwin didn’t even know anything about genetics, while Gregor
Mendel was carrying out experiments on breeding pea plants that later led him to develop
the Mendel’s law of inheritance (Mendel 1865, Huxley 1942).
Only a hundred years later, the structure of DNA was discovered by James
Watson and Francis Crick in 1953 (Watson and Crick 1953). Since then scientists knew
that the hereditary factor Mendel proposed was DNA, through which living organisms
pass down their genetic information from generation to generation (except for some
viruses). They knew also natural selection actually acts on the level of DNA through
mutations (Dobzhansky and Dobzhansky 1937), which can be caused by random errors in
DNA replication or repair or by chemical or radiation damage.
2
1.1.1 Mutations
Mutations are permanent nucleotide changes in DNA sequences of genomes.
Typically, mutations occur randomly across the whole genome and accumulate slowly
from generation to generation. The effects of mutations on DNA sequences are different.
By the effect on DNA structure, mutations can be classified to: 1) insertions – add
one or more extra nucleotides into the DNA; 2) deletions – remove one or more
nucleotides from the DNA; 3) substitutions – exchange a single nucleotide for another;
and 4) large-scale chromosomal structure mutation, such as amplifications and
translocations.
By the effect on fitness, mutations can be classified to: 1) deleterious mutation –
mutation that decreases the fitness of the organism; 2) beneficial mutation – mutation that
increases the fitness of the organism; and 3) neutral mutation – mutation that has no
harmful or beneficial effect on the organism.
1.1.2 Single Nucleotide Polymorphism
If a mutation involves only change of a single nucleotide position, it is called
point mutation. Point mutations result in single nucleotide polymorphisms (SNPs) in a
population.
A SNP is a variation in a single nucleotide at a specific locus in a genome. Each
SNP have two or more possible nucleotide variations – alternate alleles. Some individual
in the population may have one allele at the locus, while some may have another allele at
the same locus in the genome. For example, at a specific locus in the human genome,
3
allele A may appear in most individuals, while allele G may occupy the position in a
minority of individuals.
The frequency of appearance of one allele in a particular population is called
allele frequency. The frequency of the second most frequent allele of a SNP is defined as
minor allele frequency (MAF). A large MAF indicates that both two most frequent alleles
of the SNP are common to observe in the population and are tolerated by the
circumstance.
1.1.3 Natural Selection
The key mechanism of evolution is natural selection. In a particular population,
individuals carrying different inherited genotype with different fitness compete with each
other. The genotype of the “winner” in the competition increases in frequency, while the
genotype frequency of the “loser” decreases.
There are two types of selection in the process – positive selection and negative
selection. Positive selection results in an increase in rare variants that improve optimal
fitness. Negative selection, also known as purifying selection, leads to the removal of
deleterious variants.
The purging of deleterious alleles in purifying selection can be achieved on the
population genetics level, with as little as a single nucleotide mutation being the unit of
selection. In such cases, individuals with the harmful single nucleotide mutation have
fewer offspring each generation, reducing the frequency of the mutation in the gene pool.
4
In other words, the minor allele frequency of a SNP locus under purifying selection will
be small.
1.2 The Structure of DNA
The carrier of genetic information – deoxyribonucleic acid (DNA) is a double-
stranded molecule composed of a sequence of nucleotides. Each nucleotide contains a
phosphate group, a 5-carbon sugar called deoxyribose, and one of four nitrogen-
containing nucleobases – adenine (A), cytosine (C), guanine (G), or thymine (T).
Repeated pattern of one phosphate group and one deoxyribose formed the backbone of a
single DNA strand, while two DNA strands are bound together by hydrogen bonds
formed by a complementary base pairing (either an adenine pairs with a thymine or a
guanine pairs with a cytosine), giving DNA a stable double helix structure (Watson and
Crick 1953).
The genetic information is stored through orderly combination of nucleotides
composed with different bases. Accordingly, the representation of DNA is simplified as a
sequence of letters A, C, G and T, which is referred to as the DNA sequence. Ribonucleic
acid (RNA) and proteins recognize the DNA sequence in the process of DNA replication,
protein synthesis and all the other biochemical processes happened inside a cell involving
DNA.
5
Figure 1.1 Schematic representation of a DNA fragment (PDB ID: 1BNA taken from the
Protein Data Bank) with definition of MGW and inter-bp and intra-bp parameters.
Figure adapted from (Li, Sagendorf et al. 2017).
Besides the information stored in the primary sequence of DNA, the global and
local structural characteristics of the molecule also express itself. The global structure of
the DNA refers to the three-dimensional (3D) organization of the genome in the cellular
nucleus, which not only enables packing of the long DNA polymer in a condensed
manner to fit into the rather tiny nucleus, but also established the infrastructure that
supports all the molecular machineries to “read” from the instruction molecule (Rao,
Huntley et al. 2014, Tjong, Li et al. 2016). DNA local structure, on the other hand, refers
to the nuances in the geometry properties of the DNA base pairs (bps) and backbones
6
(Rohs, West et al. 2009). Each bp can move and rotate slightly relative to the helical axis,
with respect to one another (intra-bp features) and with respect to the neighboring bps
(inter-bp features) (Figure 1.1). Besides the intra-bp and inter-bp features, the hydrogen
bonding between base pairs causes asymmetry of the two grooves in the backbones of the
DNA double helix. The wider groove is called the major groove, and the narrower groove
is called the minor groove (Figure 1.2). The bp movements affect the width of major and
minor grooves. All those features together play an important role in protein-DNA
recognition.
One thing to note is that the energetic cost of those slight bp movements depends
on the sequence. Different sequences have different degree of propensities to certain
structural deformations. Thus, the structural variation in the double helix is sequence-
dependent. On the other hand, different sequences could possibly give rise to a similar
structure.
Figure 1.2 Major groove and minor groove of DNA. The shape of the DNA minor groove
(dark grey, narrow and deep minor groove; white, wide and shallow minor groove) is
recognized by many TFs and used to achieve binding specificity. Figure credited to
Tianyin Zhou.
7
1.3 Specific TF-DNA Binding
Organisms synthesize proteins in cells according to the DNA sequence. The gene
expression process follows the central dogma of molecular biology. A fragment of DNA
or a gene is first copied into a single-stranded RNA in a process called transcription.
Then proteins are synthesized according to the RNA product, or the transcript, in a
process called translation. Multiple mechanisms ensure that organisms produce not only
the right proteins but also the right amount at the right time during the two processes.
These mechanisms are called the regulation of gene expression. Gene regulation enables
cells to differentiate. Gene regulation happens throughout different stages of gene
expression. One of the most utilized mechanisms is during transcription initiation,
through transcription factor-DNA binding.
Transcription factors (TFs) regulate the transcription of a gene by binding to
regulatory sequences located relatively close to the target gene. These regulatory
sequences contain specific sites that are recognized by the TFs called TF binding sites
(TFBSs). TF-DNA binding can trigger other biochemical events, which in turn help
activate or repress the transcription of the gene through different mechanisms. For
example, some bound TFs directly interact with the RNA polymerase to stabilize binding
of the polymerase and thus help promote the transcription in prokaryotes or the bound
TFs can help recruit the transcription machinery in eukaryotes. In eukaryotes, the bound
TFs can also recruit chromatin remodeling factors or other proteins that help transcription
initiation or elongation. On the other hand, some TFs bind to RNA polymerase binding
sites and exclude the polymerase so the transcription is repressed. In a systematic view,
there are various TFs working to regulate the transcription level of different genes. Each
8
gene may have one or multiple regulatory TFBSs that are located in its vicinity, which
are recognized by specific TFs resulting in activation or repression of transcription of that
gene. Therefore, understanding TF-DNA binding specificity is a key step in deciphering
gene regulation.
TFs adopted both base readout and shape readout to achieve their binding to
TFBSs. In base readout, TFs recognized the specific chemical signatures of different
DNA bases, often through hydrogen bonding or hydrophobic interactions. While in shape
readout, TFs read DNA through its local three-dimensional structure (DNA shape), such
as the minor groove width (MGW) of DNA (Figure 1.2). Usually both readout methods
are adopted when a TF binds. However, the proportion of their contribution to the overall
TF-DNA binding specificity varies among TF families.
1.4 Modeling the TF-DNA Binding Specificity
1.4.1 Experimental approaches
uPBM
Protein binding microarray (PBM) experiments could detect the binding
preference of a TF to a large pool of variant DNA sequences in a high-throughput manner.
In such experiments, sequences of interest are exposed to be bound by a specific purified
TF. Each spot in the microarray contains replicates of a unique sequence called a probe.
TFs are allowed to bind freely to the probes in the microarray. The binding affinity of the
TF to the probes can be detected by measuring the enrichment of the TF at different spots
through fluorescence scan.
9
The probes in a universal protein binding microarray (uPBM) (Berger, Philippakis
et al. 2006) are designed and generated from a de Bruijn sequence that cover all possible
k-mers (DNA sequence of length k). K is often set to be 9 or 10 in practice. This design
enables the coverage of all possible !-mers with a compact set of probes. For example,
all possible 10-mers can be covered by approximately 44,000 probes of 35 bps in length.
gcPBM
Genomic context protein binding microarray (gcPBM) experiments are similar to
uPBM, with the difference that it uses probes that are sequences sampled from real
genomic contexts (Mordelet, Horton et al. 2013). It first uses data from uPBM to locate
the putative binding sites in the genome. Then the probes are designed to be the 30-bp
genomic regions encompassing the putative binding sites, i.e. the putative binding site
and its 15bp flanks. The “bound” and “unbound” probes are determined by whether the
genomic region is bound or unbound in the ChIP-chip experiment.
HT-SELEX
A high-throughput systematic evolution of ligands by exponential enrichment
(HT-SELEX) experiment consists of several rounds of DNA selection by the TF followed
by polymerase chain reaction (PCR) amplification and DNA sequencing (Jolma, Yan et
al. 2013). First, a pool of DNA oligomers consisting of a random variable region flanked
by constant primers and adapters are prepared. Different from PBM experiments, after
the first round of TF binding, the bound DNA sequences are isolated and amplified with
PCR. Some of these selected sequences get sequenced, while the rest goes to the next
round of selection and PCR amplification. After repeating the above process for several
10
rounds, the relative binding affinity of the TF to the oligomers can be calculated based on
the changes in the enrichment of different oligomers measured from sequencing.
1.4.2 Computational approaches
Position weight matrix
Position weight matrix (PWM) is the most widely used model for TF-DNA
binding specificity (Stormo 2000, Stormo 2013). In this model, the TF binding preference
is represented by a 4×! parameter matrix, where ! is the width of the TFBS. The rows of
a PWM correspond to the four different types of nucleotides A, C, G and T. The columns
correspond to nucleotide positions of the TFBS. Each element in the matrix represents the
probability of a certain type of nucleotide appearing at a specific nucleotide position
within a TFBS. A PWM of a TF can be constructed with a set of aligned TFBS sequences
or through de novo motif discovery. Given a sequence of length !, the quantitative
binding preference by a TF can be calculated by the summation of the corresponding
elements in the PWM of the TF, called PWM score. PWMs can also be visualized as
motif logos (Schneider and Stephens 1990) (Figure 1.3).
11
Figure 1.3 Visualization of PWM logos. Figure credited to Lin Yang.
As an alternative to consensus sequences, PWM was introduced by Gary Stormo
and colleagues in 1982 and became an important part of many software tools for
computational motif discovery and binding site prediction. The MEME Suite is one of
those motif-based sequence analysis tools (Bailey, Johnson et al. 2015). The problem of
traditional PWM method is that it assumes independence between individual nucleotide
positions within the binding sites, which is not always true. The mechanisms in TF-DNA
interaction are highly complex and nucleotide interdependency especially between
adjacent nucleotides can arise from the synergetic effect of the presence of two specific
nucleotides on the binding affinity. Recently PWM has been extended to include
additional parameters, such as dinucleotide features, or even tri-nucleotide features to
account for the interdependencies between adjacent nucleotide positions to fulfill this
limitation.
12
Regression models
As mentioned above, TFs also adopt DNA shape readout to achieve binding
specificity. DNA shape features derived from simulated DNA structures implicitly
contain information of interdependencies between nucleotide positions within a TFBS.
Recently, machine learning methods that take into account both sequence features and
DNA shape features have been developed and are demonstrably more accurate in their
ability to predict TF-DNA interactions.
In regression models, such as the support vector regression (SVR) model (Zhou,
Shen et al. 2015) and the multi-linear regression (MLR) model (Yang, Orenstein et al.
2017), sequence features are presented with a series of binary numbers and shape features
are calculated with the DNAshape tool (Zhou, Yang et al. 2013). The models are trained
with a set of aligned TFBS sequences associated with affinity scores from the PBM or
SELEX experiments. The trained models can then be used to quantitatively predict the
binding affinity of the TF to a new sequence.
Regression models incorporating both sequence and shape features of DNA
significantly improve the prediction accuracy of TF binding preference over sequence-
only models. However, one limitation of these models is that the sequences used to
construct the regression models need be aligned first using PWM-based methods if they
were not designed to be aligned in the on-chip experiments. This makes it hard to conduct
de novo motif discovery with such models.
13
Alignment-free models
One type of alignment-free models are based on !-mer counts. Such models in a
way combine the extended PWM model and regression models. They use the appearance
of each !-mer as the features of the model and train regression models with those !-mer
features (Ma, Yang et al. 2017). In this way, the sequences used to construct models do
not necessarily need to be aligned. Such models can include shape features as well.
Another type of alignment-free models is based on deep learning (Alipanahi,
Delong et al. 2015). Deep learning has been studied actively in the machine learning field
and shown its superior prediction power in problems such as natural language processing,
image and text recognition, and many artificial intelligence fields. With more and more
high-throughput experiment data available, researchers have introduced deep learning
models into regulatory genomics and shown exciting performance improvements. When
training deep learning models to predict TF binding preference, the deep neural network
could learn to extract the “right” sequence and shape features automatically without much
preprocessing of the data. Such models are being used as a black box and extensive
studies are still needed to uncover the biological meanings inside these black box models.
1.5 Overview of This Thesis
In this thesis, we established a link between disruptive local DNA shape changes
and loss of specific TF binding by investigating the shape-disrupting properties of SNPs
in human regulatory regions. We also described cases where disease-associated SNPs
may alter TF binding through DNA shape changes. This link led us to hypothesize that
14
local DNA shape within and around TF binding sites is under selection pressure. Our
results indicate that common SNPs in functional regions tend to maintain DNA shape,
whereas shape-disrupting SNPs are more likely to be eliminated through purifying
selection. These results show the importance of DNA shape in TF-DNA recognition from
the perspective of evolution. We next proposed a new DNA shape-augmented alphabet
that incorporates interdependency between adjacent nucleotides to the classic PWM
model. The TF family-specific shape alphabet learned through simulated annealing
algorithm significantly improved the performance of TF-DNA binding prediction
compared with the typical A, C, G, T-alphabet that represents the four nucleobases – A,
C, G, and T of DNA nucleotides.
15
Chapter 2 Analysis of Genetic Variation Indicates DNA Shape
Involvement in Purifying Selection
Reproduced from Xiaofei Wang, Tianyin Zhou, Zeba Wunderilich, Matthew T. Maurano,
Angela H. DePace, Sergey V. Nuzhdin, and Remo Rohs: Analysis of genetic variation
indicates DNA shape involvement in purifying selection. Molecular Biology and
Evolution. msy099. doi: 10.1093/molbev/msy099 (2018)
2.1 Introduction
Biomedical disease research has focused primarily on identifying pathogenic
nonsynonymous variants in protein-coding regions of the genome because of the absence
of functional annotation of noncoding variants (Ward and Kellis 2012). However,
noncoding genetic variants can also be pathogenic, as shown by numerous genome-wide
association studies (Welter, MacArthur et al. 2014). Noncoding variants may affect
disease pathogenesis through various gene regulatory mechanisms, such as transcription
factor (TF) binding, RNA splicing, and mRNA degradation (Faustino and Cooper 2003,
Abelson, Kwan et al. 2005, Maurano, Humbert et al. 2012).
TFs regulate gene expression by binding to specific TF binding sites (TFBSs),
which are traditionally described by position weight matrices (PWMs) representing the
independent binding-affinity contributions of nucleotides at each position of the TFBS
(Stormo 2000, Stormo 2013). Nucleotide mutations within and near TFBSs, or in TFBS-
rich regions, can potentially disrupt TF binding and consequentially up- or down-regulate
gene expression (Mogno, Kwasnieski et al. 2013). It is, therefore, not surprising that
TFBS regions are under strong selection pressure (Andolfatto 2005). The three-
16
dimensional structure of DNA, or “DNA shape”, is an important determinant of specific
TF binding (Rohs, West et al. 2009, Rohs, Jin et al. 2010). TFs employ base readout
(direct contacts between amino acids and functional groups of the base pairs) and shape
readout (recognition of three-dimensional DNA structure) to achieve DNA binding
specificity (Slattery, Zhou et al. 2014). Adding DNA shape features to models of TF–
DNA binding increases the prediction accuracy of TF binding (Abe, Dror et al. 2015,
Zhou, Shen et al. 2015, Yang, Orenstein et al. 2017).
Here, by analyzing the connection between DNA shape and TF binding using
single nucleotide polymorphisms (SNPs) called from human DNase-seq data, we found
that SNP-induced allelic differences of TF binding can be partially explained by SNP-
induced local DNA shape changes. This observation prompted us to hypothesize that
DNA shape in TFBS regions is under purifying selection pressure. To test this hypothesis,
we analyzed SNPs derived from 216 natural strains of Drosophila melanogaster.
Comparing SNPs located in functional versus nonfunctional regions and SNPs with high
versus low minor allele frequency (MAF), we statistically showed that DNA shape is
under purifying selection.
17
Figure 2.1 Pipeline for evaluation of SNP effects on DNA shape. A) Human SNPs derived
from DNase-seq data were divided into three groups, (i) strongly imbalanced SNPs, (ii)
weakly imbalanced SNPs, and (iii) SNPs without imbalance, according to their ability to
vary chromatin accessibility. Drosophila SNPs, called from 216 natural strains of D.
melanogaster and located within blastoderm stage-active CRMs, were divided into two
groups, (i) SNPs in functional regions and (ii) SNPs in nonfunctional regions, based on
the criteria shown in the center. The same analysis was repeated for a larger number of
Drosophila SNPs in putative CRMs. B) Example calculation of DNA shape variation for
one SNP. One single-nucleotide variant at position 0 would result in DNA shape changes
at the five nucleotide positions centered around the variant. First, vectors of MGW for
each allele were predicted using DNAshape (Zhou, Yang et al. 2013). Euclidean
distances between MGWs of the two alleles were calculated as the DNA shape variation
of the SNP (see Materials and Methods).
2.2 Results
2.2.1 Effect of SNPs on DNA Shape Varies with Allele Type and Local DNA Context
To predict the effects of SNPs on local DNA shape, we used the DNAshape
method (Zhou, Yang et al. 2013) to analyze human and Drosophila datasets (Figure
18
2.1A). We used the predicted minor groove width (MGW) as predominant DNA shape
feature for evaluating the functional implications of DNA shape readout (Figure 2.1B)
(Parker, Hansen et al. 2009). Given that the predicted MGW at each nucleotide position
is a continuous value, we defined the DNA shape change for each SNP, in the form of
∆MGW, as the Euclidean distance between vectors of MGW profiles of the two alleles
(see Materials and Methods). We observed that the effect of SNPs on local DNA shape
patterns varied with allele type and local DNA context. At one extreme of the spectrum
were SNPs that had a weak effect on local shape patterns (Figure 2.2A and 2B). When a
TF recognizes the DNA shape pattern of its binding site, this recognition is likely to be
preserved for both the reference and alternative allele of those SNPs. At the other
extreme were SNPs that completely disrupted the local DNA shape (Figure 2.2C and 2D).
We suggest that loss of function is more likely to occur with those shape-disrupting SNPs.
Between these two extremes were SNPs that caused modest variations in local DNA
shape, which would likely still affect TF binding (Figure 2.2E and 2F). Subtle changes in
MGW lead to changes in electrostatic potential, which will affect DNA shape readout
(Chiu, Rao et al. 2017).
19
Figure 2.2 Local effect of SNPs on MGW profiles. Effects of SNPs on their surrounding
MGW patterns varied with allele type and local DNA context. MGW patterns of local
DNA region for two alternate alleles of SNPs were plotted in blue and red, respectively.
At one extreme of the spectrum were SNPs (A and B) that had very small effects on local
MGW. At the other extreme were SNPs (C and D) that completely disrupted the local
MGW geometry. Between these two extremes were SNPs that led to an intermediate
extent of variation in local DNA shape, while potentially still affecting TF binding (E and
F).
20
2.2.2 SNPs with Imbalanced DNA Accessibility Induced Larger DNA Shape
Changes Compared with SNPs Without Imbalance
To assess the effect of SNP-induced DNA shape changes on TF binding, we
analyzed the correlation between these shape changes and DNA accessibility changes in
an in vivo human dataset. DNA accessibility is a hallmark of specific TF binding
(Maurano, Haugen et al. 2015); if alleles differ in their degree of DNase I accessibility,
then we can infer that they will differ in their degree of TF binding. By analyzing DNase-
seq signals at heterozygous sites, Maurano et al. (2015) classified SNPs located in DNase
I hypersensitive sites into three groups: 1) strongly imbalanced SNPs (0.1% false
discovery rate [FDR] and >70% imbalance), 2) weakly imbalanced SNPs (5% FDR), and
3) SNPs without imbalance, as defined by (Maurano, Haugen et al. 2015).
Figure 2.3 Distribution of MGW changes for strongly imbalanced SNPs, weakly
imbalanced SNPs, and SNPs without imbalance in human. Distributions of ΔMGW
values for imbalanced SNPs (red and green plots) were shifted rightward compared to
SNPs without imbalance (blue plot). The more imbalanced the SNPs were, the larger the
ΔMGW or change in DNA shape was. Asterisks are color-coded to indicate the SNP
distributions being compared. Sample sizes for all groups are listed in the legend.
21
We calculated and plotted the distribution of ΔMGW values of SNPs for all three
groups (Figure 2.3). Although the ∆MGW distributions for all three groups were similar
in shape, the distributions for imbalanced SNPs were significantly shifted towards larger
∆MGW values (without imbalance vs. strongly imbalanced: Mann-Whitney P=1.44×10
-
12
, shuffling test P=0.023 using “bogus” MGW predictions; without imbalance vs.
weakly imbalanced: Mann-Whitney P=1.40×10
-3
, shuffling test P=0.201; strongly
imbalanced vs. weakly imbalanced: Mann-Whitney P=4.10×10
-8
, shuffling test P=0.011;
see Materials and Methods for shuffling test; Figure 2.4A). As the group of SNPs
became increasingly imbalanced, the distribution shifted towards larger ∆MGW values.
This observation revealed an association between imbalanced SNPs and increased
∆MGW, and indicated that drastic DNA shape changes could lead to loss of specific TF
binding.
22
Figure 2.4 Boxplots for Mann-Whitney P-values in shuffling tests. All of the Mann-
Whitney tests shown in Figure 2.3 and Figure 2.6 ßwere repeated by using 1,000
arbitrarily shuffled MGW predictions. Base-10 logarithms of the corresponding 1,000
Mann-Whitney P-values were plotted with box plots. Mann-Whitney P-values obtained by
using DNAshape-derived MGW patterns are marked with a red asterisk. Tests for A)
human and B) Drosophila data.
2.2.3 Disease-Associated SNPs Potentially Alter TF Binding Through Shape
Changes
We have illustrated that alternate alleles of SNPs could result in different DNA
shape patterns, thereby potentially influencing TF binding through effects on shape
readout and possibly leading to disease due to altered expression of the target genes
(Maurano, Humbert et al. 2012, Mogno, Kwasnieski et al. 2013). Thus, the pathogenesis
of some diseases may be attributed to the loss of preferred DNA shape at TFBSs.
However, loss of the preferred DNA shape at TFBSs is not the only possible pathogenic
mechanism for disease-associated SNPs. Therefore, we did not attempt to correlate
23
disease with DNA shape changes generally, but instead analyzed the effects on DNA
shape by SNPs that were shown in the literature to affect TF binding. Here we highlight a
few examples in which shape-disrupting SNPs could potentially cause disease.
For example, the T allele of SNP rs339331 increases the DNA binding affinity of
HOXB13, leading to overexpression of Regulatory Factor X 6 (RFX6), which promotes
prostate cancer cell growth and invasion (Huang, Whitington et al. 2014). Examining the
effect of the allele on local DNA shape (Figure 2.5A), we found that the risk allele T
induced a narrower minor groove in the core-binding site TTTTAT. The ∆MGW of
approximately 1.3 Å is at the right tail of the distribution in Figure 2.3, indicating a large
∆MGW for this SNP. This finding is consistent with our previous studies showing that
MGW plays a role in achieving homeodomain binding specificity (Slattery, Riley et al.
2011, Abe, Dror et al. 2015, Dror, Golan et al. 2015).
24
Figure 2.5 DNA shape variation caused by disease-associated SNPs. A) DNA shape
variation caused by SNP rs339331 in the HOXB13 binding site. HOXB13 prefers binding
to a narrower MGW induced by risk allele T. B) DNA shape variation caused by SNP
rs6893009 in the PU.1 binding site. The SNP caused large variance in MGW, which was
previously reported to be a predominant structural determinant of PU.1 binding. DNA
shape variation caused by C) SNP rs445 in the c-MYB binding site, D) a SNP in the
GATA3 binding site, E) SNP rs909116 in the ER-α binding site, and F) SNP rs6983267 in
the TCF7L2 binding site.
Another SNP, rs6893009, located in a PU.1 binding site, is a strong binding
quantitative trait locus as measured by ChIP-seq experiments (Tehranchi, Myrthil et al.
2016). These ChIP-seq measurements found that the SNP was in perfect linkage
disequilibrium with a Crohn’s disease-associated SNP, rs4958847 (Parkes, Barrett et al.
2007). Interestingly, we previously showed that DNA shape features help to guide
25
binding of PU.1 (Barozzi, Simonatto et al. 2014). Investigation of how rs6803009 affects
local DNA shape showed that the SNP caused a large change in MGW of approximately
1.3 Å (Figure 2.5B), consistent with our previous finding that MGW was a predominant
structural determinant of PU.1 binding.
Our analysis also revealed medium-to-strong MGW changes caused by SNPs
located in the c-MYB, GATA3, ER-α, and TCF7L2 binding sites (Figure 2.5C, D, E, and
F), all of which are disease-associated SNPs with evidence of disrupted TF binding (Jin,
Zhao et al. 2010, Miyoshi, Murase et al. 2010, Alipanahi, Delong et al. 2015, Mathelier,
Lefebvre et al. 2015).
2.2.4 DNA Shape in Functional Regulatory Regions of the Drosophila Genome Is
More Conserved
The above results suggested a possible link between DNA shape changes and loss
of TF–DNA binding, leading us to hypothesize that DNA shape in TFBS regions is under
purifying selection. To test this hypothesis, we analyzed SNP data derived from 216
natural strains of D. melanogaster to uncover the signal of DNA shape selection. We
used these data because of the large numbers of sequenced individuals, annotated cis-
regulatory modules (CRMs), and available PWMs for TFs in the D. melanogaster
genome.
First, we identified SNPs within experimentally verified CRMs in the Drosophila
genome that regulate gene expression during the blastoderm stage, as annotated in the
REDfly database (see Materials and Methods) (Gallo, Gerrard et al. 2011). We focused
on these CRMs because we were able to identify a high-confidence set of functioning
26
TFs at the blastoderm stage, whereas such a set of TFs was unavailable for other
developmental stages. CRMs are composed of TFBSs and intervening sequences. We
assumed that within CRMs, TFBSs and their flanking regions as well as nucleotide
positions with high conservation scores are functionally important. This assumption
enabled us to divide CRMs into two distinct DNA fragment sets, that is, functional and
nonfunctional regions (see Materials and Methods). We conducted comparative studies
on SNPs in those two regions.
Figure 2.6 Distributions of MGW changes for Drosophila SNPs in experimentally
validated CRMs at different locations and with different MAFs. A) Distribution of ΔMGW
values for SNPs in functional and nonfunctional regions (see Materials and Methods for
definition) using the DNAshape-derived MGW. Compared to the distribution for
functional regions (red plot), the distribution for nonfunctional regions (blue plot) was
27
significantly shifted rightward, indicating that SNPs induced greater changes in ΔMGW
in nonfunctional than in functional regions. B) Distribution of ΔMGW values for SNPs in
functional and nonfunctional regions, using one of the shuffled MGW predictions. Using
arbitrarily shuffled MGW, no signal of purifying selection emerged. C) Distribution of
ΔMGW values for SNPs with high and low MAF in functional regions. Distribution of
ΔMGW values for low MAF was significantly shifted towards the right. D) Distribution of
ΔMGW values for SNPs with high and low MAF in nonfunctional regions. Distributions
of these two groups exhibited no significant difference. Sample sizes for all groups are
listed in the legends.
Next, we plotted the distributions of ∆MGW values for SNPs in functional and
nonfunctional regions within Drosophila CRMs that are active in the blastoderm
developmental stage (Figure 2.6A). Compared with the ∆MGW distribution for
functional regions (red), the distribution for nonfunctional regions (blue) was
significantly shifted towards larger ∆MGW values (Mann-Whitney P=2.62×10
-4
;
shuffling test P=0.001; Figure 2.4B). When we compared the distributions of shuffled
∆MGW values using “bogus” MGW predictions, the shift was not significant (see
Materials and Methods; Figure 2.6B shows one shuffling). This result indicated that
SNPs in functional regions were less likely to induce drastic DNA shape changes,
implying that SNPs that greatly change MGW in these regions were removed by
purifying selection. Results of multi-allelic analysis confirmed this notion that local DNA
shape was more conserved in functional than in nonfunctional regions (see Materials
and Methods; Figure 2.7).
28
Figure 2.7 Distribution of ΔMGW values for SNPs in functional and nonfunctional
regions, obtained by using DNAshape-derived MGW in multi-allelic analysis. Compared
to the distribution for functional regions (red plot), the distribution for nonfunctional
regions (blue plot) was significantly shifted towards the right. This trend indicates that
the change in MGW values induced by SNPs in nonfunctional regions was greater than
the change induced by SNPs in functional regions. Sample sizes for all groups are listed
in the legend.
2.2.5 SNPs with Higher Minor Allele Frequency Tend to Relate to Smaller DNA
Shape Change
We further classified SNPs in functional regions within CRMs into high- and low-
frequency groups based on their MAFs. To create similarly sized groups, we used a
frequency cutoff of 0.04. Plotting the ∆MGW distributions for SNPs in these two groups,
we observed that SNPs with higher MAFs tended to have smaller variations of MGW
(Figure 2.6C; Mann-Whitney P=3.97×10
-2
; shuffling test P=0.045; Figure 2.4B).
Although this observation implies that SNPs that greatly change DNA shape are less
likely to have large MAFs, we caution that the biological effect is likely small for most
SNPs. To ensure that the observed negative correlation between MAF and ∆MGW values
29
was meaningful, we used SNPs in nonfunctional regions as a negative control group.
Using this negative control, we compared distributions of ∆MGW values for the high-
and low-frequency groups as we did for SNPs in functional regions. In this case, no
negative correlation between MAF and ∆MGW could be found (Figure 2.6D; Mann-
Whitney P=0.822; shuffling test P=0.819; Figure 2.4B). These results further support our
hypothesis that purifying selection acts on DNA shape near TFBSs.
2.2.6 Large Data Set of SNPs in Putative CRMs Confirmed the Observations for
SNPs in Validated CRMs
We repeated the same analysis on a larger dataset of SNPs in putative CRMs of
the Drosophila genome defined based on DNase I accessibility (see Materials and
Methods). In comparison to the ∆MGW distribution for functional regions, the
distribution for nonfunctional regions was significantly shifted towards larger ∆MGW
values (Mann-Whitney P=5.31×10
-29
; Figure 2.8A). The shift between distributions of
shuffled ∆MGW values that were calculated from “bogus” MGW predictions was not
significant (Figure 2.8B). This analysis also confirmed the negative correlation between
MAF and ∆MGW (Mann-Whitney P=6.27×10
-15
; Figure 2.8C). Using SNPs in
nonfunctional regions as negative control, no negative correlation between MAF and
∆MGW could be found (Figure 2.8D). Thus, this complementary analysis using a large
dataset of SNPs in putative CRMs reaffirmed our observations for SNPs in
experimentally validated CRMs.
30
Figure 2.8 Distributions of MGW changes for Drosophila SNPs in putative CRMs at
different locations and with different MAFs. A) Distribution of ΔMGW values for SNPs in
functional and nonfunctional regions (see Materials and Methods for definition) using
the DNAshape-derived MGW. Compared to the distribution for functional regions (red
plot), the distribution for nonfunctional regions (blue plot) was significantly shifted
rightward, indicating that SNPs induced greater changes in ΔMGW in nonfunctional
than in functional regions. B) Distribution of ΔMGW values for SNPs in functional and
nonfunctional regions, using one of the shuffled MGW predictions. Using arbitrarily
shuffled MGW, no signal of purifying selection emerged. C) Distribution of ΔMGW
values for SNPs with high and low MAF in functional regions. Distribution of ΔMGW
values for low MAF was significantly shifted towards the right. D) Distribution of ΔMGW
values for SNPs with high and low MAF in nonfunctional regions. Distributions of these
two groups exhibited no significant difference. Sample sizes for all groups are listed in
the legends.
31
2.3 Materials and Methods
2.3.1 Single Nucleotide Polymorphism Data
Human
A total of 362,284 common variants called from 493 high-resolution DNase-seq
profiles were considered in this study (Maurano, Haugen et al. 2015). Among these
variants, 64,597 SNPs were identified as allelicly imbalanced in TF occupancy and DNA
accessibility in vivo, based on the allelicly imbalanced read counts in the DNase-seq data.
Among these imbalanced SNPs, 55,141 were identified as weakly imbalanced (5% FDR),
and 9,456 were strongly imbalanced (0.1% FDR and >70% imbalance; Figure 2.1A).
Drosophila
Genomic data for 216 natural strains of D. melanogaster from two resequencing
projects (Mackay, Richards et al. 2012, Campo, Lehmann et al. 2013) were downloaded
from NCBI Sequence Read Archive (accession numbers PRJNA36679 and
PRJNA74721). For this vast amount of data, SNP calling as described in (Campo,
Lehmann et al. 2013) yielded 2,605,315 SNPs that occurred at least twice among the 216
individuals. The MAF of each SNP was calculated from these data. Here, we only
investigated SNPs located within blastoderm stage-active CRMs, representing regions in
which TFs bind during the blastoderm stage. We obtained the set of CRMs annotated as
“blastoderm embryo” in the REDfly database (Gallo, Gerrard et al. 2011). These CRMs
have been experimentally validated to regulate gene expression during Drosophila
embryonic development (Gallo, Gerrard et al. 2011). We filtered this list to eliminate
regions <100-bp long, as described in (Su, Teichmann et al. 2010), which resulted in 342
32
regions. By requiring an overlap with these CRMs, we narrowed the number of SNPs
down to 1,603 (Figure 2.1A).
Alternatively, with a less stringent criterion, we were able to identify a larger
number of putative CRMs based on DNase I accessible regions found during Drosophila
embryonic development (Thomas, Li et al. 2011). We selected regions that were
accessible during stage 5 and “developmentally dynamic”, that is, differentially
accessible at reported developmental time points (Thomas, Li et al. 2011). We further
filtered this list to eliminate regions <100 bp long (Su, Teichmann et al. 2010), resulting
in 4,699 regions. Requiring an overlap with these putative CRMs yielded 41,147 SNPs
(Figure 2.1A). Although this definition of putative CRMs likely included false positives,
it allowed testing of our hypothesis based on a large number of SNPs (Figure 2.8).
2.3.2 Definition of Functional and Nonfunctional Regions in Drosophila
Functional genomic regions are more likely than nonfunctional regions to be
under selection (Consortium 2012). PhastCons (Siepel, Bejerano et al. 2005) is a widely
used approach to identify evolutionarily conserved elements based on multiple sequence
alignment and a phylogenetic tree. This approach produces continuous values for
conservation scores for each nucleotide position of the genome. The higher the score is,
the more conserved the respective nucleotide position is.
We defined the functional and nonfunctional regions within CRMs based on the
following criteria (Figure 2.1A):
33
1. A nucleotide position with phastCons conservation score >0.1 was considered to
be in a functional region.
2. A nucleotide position that had conservation score ≤0.1 and was not located within
any of the identified TFBSs or their immediate 5-bp flanking regions were
considered to be in a nonfunctional region.
We excluded TFBSs with low conservation scores from nonfunctional regions to
rule out the possibility of underestimation of conservation levels calculated by the
sequence-based method. PhastCons conservation scores for alignments of genomes of 14
insects from D. melanogaster (dm3 assembly) at each nucleotide position were
downloaded from the UCSC Genome Browser (Tyner, Barber et al. 2017).
TFBSs in CRMs were located through motif scans. To search for motif matches,
we used PWMs of the 34 principal TFs (Supplementary Table 1) that are active during
the blastoderm stage of Drosophila development, as defined by FlyFactorSurvey (Zhu,
Christensen et al. 2011) and the PATSER program (Stormo, Schneider et al. 1982). We
used a GC content of 0.406, corresponding to the intergenic GC content of D.
melanogaster (Berman, Nibu et al. 2002), and a P-value cutoff of 0.001.
34
Table 2.1 TFs used to determine functional regions.
TF Symbol DNA binding domain Regulatory class
Bicoid bcd homeodomain A/P Maternal
Caudal cad homeodomain A/P Maternal
forkhead fkh forkhead domain A/P Zygo gap/term
giant gt b-zip domain A/P Zygo gap/term
huckebein hkb TFIIIA Zn finger A/P Zygo gap/term
hunchback hb TFIIIA Zn finger A/P Zygo gap/term
knirps kni receptor Zn finger A/P Zygo gap/term
knirps like knil receptor Zn finger A/P Zygo gap/term
Kruppel Kr TFIIIA Zn finger A/P Zygo gap/term
orthodenticle oc homeodomain A/P Zygo-gap/term
tailless tll receptor Zn finger A/P Zygo-gap/term
Dichaete D HMG/SOX class A/P Zygo-pair rule
even skipped eve homeodomain A/P Zygo-pair rule
ftz ftz homeodomain A/P Zygo-pair rule
hairy h bHLH A/P Zygo-pair rule
odd paired opa TFIIIA Zn finger A/P Zygo-pair rule
paired prd homeo & Prd domain A/P Zygo-pair rule
runt run runt domain A/P Zygo-pair rule
sloppy paired 1 slp1 forkhead domain A/P Zygo-pair rule
sloppy paired 2 slp2 forkhead domain A/P Zygo-pair rule
Stat92E Stat92E Stat domain A/P Zygo-pair rule
sis of bowl & odd sob Zn finger A/P Zygo-pair rule
odd-skipped odd Zn finger A/P Zygo-pair rule
bowl bowl Zn finger A/P Zygo-pair rule
Daughterless da bHLH D/V Maternal
dorsal dl NFkB/rel D/V Maternal
brinker brk novel D/V Zygo
Mad Mad SMAD-MH1 D/V Zygo
Medea Med SMAD-MH1 D/V Zygo
schnurri shn TFIIIA Zn finger D/V Zygo
snail sna TFIIIA Zn finger D/V Zygo
twist twi bHLH D/V Zygo
zerknult 1 zen 1 homeodomain D/V Zygo
zerknult 2 zen 2 homeodomain D/V Zygo
35
2.3.3 Genome-Wide Prediction of DNA Shape
The MGW values used in this study were predicted with DNAshape, a high-
throughput method for predicting DNA structural features (Zhou, Yang et al. 2013, Zhou,
Shen et al. 2015, Chiu, Comoglio et al. 2016). DNAshape uses a precomputed pentamer
query table that stores the DNA shape information for all 512 unique pentamers (Figure
2.1B). We focused in this study on MGW due to its well-established role in DNA shape
readout and because MGW, unless other dinucleotide-based features such as helix twist,
propeller twist, or roll, is defined over a region of several nucleotides.
2.3.4 Calculation of Euclidean Distance of Minor Groove Width between Two
Alleles
Pentamer-based MGW prediction with the DNAshape approach was shown to
agree well with experimental structural data (Zhou, Yang et al. 2013). Based on this
modeling assumption, one single-nucleotide variant would result in DNA shape changes
of the five consecutive nucleotide positions centered around it. Thus, prediction of MGW
at these five positions relies on the 9-mer sequence context (Figure 2.1B). For any SNP,
we denoted the 9-mer sequence context as !
!!
!
!!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
,
where !
!
is the
SNP locus. We defined the DNA shape for each allele ! of this SNP as !"#
!
, where
!"#
!
= !"#
!!
!
,!"#
!!
!
,!"#
!
!
,!"#
!
!
,!"#
!
!
(2.1)
We regarded the most frequently occurring allele as the reference allele and all
other alleles as alternative alleles.
36
To quantify the variation of local MGW between two alternate alleles of a SNP,
we calculated the Euclidean distance ∆MGW between the reference allele (ref) and
alternative allele (alt), as follows:
!!"#= (!"#
!
!"#
−!"#
!
!"#
)
!
!!!
!!!!
(2.2)
where ! indicates the relative position to the SNP (Figure 2.1B). Quantification of the
variation in local MGW can be expanded to multi-allelic SNPs by averaging ΔMGW
values between the reference allele and each alternative allele. For example, consider a
SNP having n>2 alleles (one reference allele and n–1 alternative alleles). For each
alternative allele alt
m
, a ΔMGW
m
value can be calculated using the above formula. The
MGW change of the multi-allelic SNP can be defined as follows:
!!"#=
1
!−1
!!"#
!
!!!
!!!
(2.3)
2.3.5Statistical Analysis
The Mann-Whitney U test, also called the Wilcoxon rank sum test, is a
nonparametric statistical test that is used to analyze differences between the medians of
two datasets. We applied the Mann-Whitney U test to evaluate the difference in ∆MGW
distributions between SNPs with and without imbalance, between SNPs in functional and
nonfunctional regions, and between SNPs with high and low MAFs. The P-value of each
test was calculated.
37
2.3.6 Shuffling of the Pentamer Query Table and Statistical Verification
The DNAshape method predicts MGW based on a pentamer query table that is
derived from data mining of all-atom Monte Carlo simulations (Zhou, Yang et al. 2013).
To rule out the possibility that the statistical significance of detected purifying selection
signals is an artifact of associating each pentamer with a floating number (i.e., MGW),
we generated 1,000 shuffled pentamer tables. For each shuffled table, we derived the
“bogus” ∆MGW values for each SNP (shuffled ∆MGWs) and computed a new P-value
for every statistical test that we conducted.
We calculated the shuffling test P-value as the ratio between the number of new
P-values that were lower than the originally reported P-value and the number of shuffles
(1,000). In other words, the shuffling test P-value was the probability that the distribution
difference can be observed by randomly associating a pentamer with a floating number.
Statistical verification using the shuffled pentamer tables indicated that the observed
purifying selection indeed acted on DNA shape rather than DNA sequence.
2.4 Discussion
Comparative studies have shown that at least 3–8% of all nucleotides in the
human genome are under purifying selection, and many of these nucleotides are located
in noncoding regions of the genome (Encode Project Consortium 2012). The
conservation level of a nucleotide position reflects its functional importance. However,
nucleotide sequence may not be the sole target of selective pressure (Parker, Hansen et al.
2009). In this study, our findings suggest that purifying selection also acts on DNA shape.
38
Due to the methodology currently available to probe structural features of the genome
(Zhou, Yang et al. 2013, Li, Sagendorf et al. 2017), only local DNA shape could be
analyzed in this study. We, therefore, did not address global genomic topological
information, such as data obtained from chromosome conformation capture-based
experiments (Dekker 2016, Dekker and Mirny 2016), which is also likely to be under
evolutionary selection (Kovina, Petrova et al. 2017).
In DNase-seq data, allelic imbalance of a SNP reflects a change in protein binding
at the locus. We showed that allelicly imbalanced SNPs in human DNase I hypersensitive
sites tended to induce slightly larger changes in DNA shape than SNPs without allelic
imbalance in chromatin accessibility in vivo. This shift in DNA shape change suggested a
role for DNA shape in determining TF binding specificity and sheds light on the
mechanism of DNA shape-targeted purifying selection. Although statistically significant,
the magnitude of the shift of shape variation was relatively small. One possible
explanation for this result is that, although many TFs employ DNA shape readout in
addition to sequence readout, not all TFs use shape readout to the same extent (Yang,
Orenstein et al. 2017). By conglomerating data from regions where many different TFs
bind, signals from TFs with different levels of dependence on DNA shape for binding
specificity are aggregated together, which likely dilutes the overall signal.
The effect of DNA shape on TF binding is also supported by the work of other
researchers. In a study of the effect of SNPs on TF binding, allele-specific binding events
strongly correlated with TFBS alterations (Shi, Fornes et al. 2016). Within each TF
binding motif, certain positions were more sensitive to allele-specific binding events. For
39
example, the binding of the CCAAT/enhancer-binding protein (CEBPB), which binds to
an 11-base pair (bp) DNA motif as a dimer, was particularly sensitive to position 6 at the
center of the motif (Shi, Fornes et al. 2016). Interestingly, in separate work that used
machine learning to study the position-dependent DNA shape importance in TF binding
(Yang, Orenstein et al. 2017), we observed that one side of the CEBPB half-site exhibited
stronger shape importance, coinciding with the central position identified by (Shi, Fornes
et al. 2016). Taken together, these findings suggest a mechanism by which DNA shape
affects TF binding as a prerequisite for natural selection. Alterations in TF binding due to
shape-disrupting SNPs could also cause disease, as illustrated above with examples of
disease-associated SNPs.
Furthermore, we found statistical evidence that purifying selection acts on DNA
shape features, based on SNP data derived from 216 natural strains of D. melanogaster.
In functional TF binding regions, DNA tended to maintain its shape. We hypothesized
that, if DNA shape is functional, then the more common a SNP is, the more likely it is
that the local DNA shape will be conserved, as shape-disrupting SNPs would have been
removed through purifying selection due to their deleteriousness. Our statistical analysis
confirmed this hypothesis. Specifically, we found that in functional TF binding regions,
SNPs with larger MAFs were more likely to result in smaller shape variations, whereas
such a correlation was not present in nonfunctional regions. Although the difference in
shape changes between high and low MAFs was small, it was statistically significant for
SNPs in experimentally validated CRMs, with the significance level being much higher
for a larger dataset of SNPs in putative CRMs.
40
In summary, we propose that selection acts on TF binding partially through
maintenance of DNA shape. This understanding adds a new perspective to the practical
study of genome evolution. The accuracy of regulatory SNP prediction has recently
increased dramatically due to more efficient machine-learning methods (Alipanahi,
Delong et al. 2015, Lee, Gorkin et al. 2015). As adding DNA shape information has been
demonstrated to improve the modeling of TF–DNA binding specificities (Zhou, Shen et
al. 2015, Mathelier, Xin et al. 2016, Yang, Orenstein et al. 2017), we envision that adding
information of DNA shape variation can further improve the prediction accuracy in the
identification of regulatory or pathogenic noncoding variants.
41
Chapter 3 Beyond A, C, G, and T: learning a new alphabet
3.1 Introduction
Before the role of DNA shape in TF-DNA recognition comes into people’s sight,
transcription factor binding sites (TFBSs) are traditionally described by position weight
matrices (PWMs) (Stormo 2000, Stormo 2013), which represent independent
contributions of nucleotides at each position of a TFBS to binding affinity and can be
visualized as sequence logos (Schneider and Stephens 1990). As an alternative to
consensus sequences, PWM was introduced by Dr. Gary Stormo and his colleagues in
1982 and became an important part of many software tools for motif discovery and
binding site prediction. The MEME Suite is one of these PWM-based sequence analysis
tools (Bailey, Johnson et al. 2015).
The problem of traditional PWM method is that it assumes independence between
individual nucleotide positions within the binding sites, which is not always the case. The
mechanisms in TF-DNA interaction are highly complex and nucleotide interdependency
between adjacent nucleotides can arise from the synergetic effect of the presence of two
specific nucleotides on the binding affinity. Recent work has demonstrated conclusively
that TFs also adopt DNA shape readout to achieve binding specificity. DNA shape
features that were derived from simulated DNA structures implicitly contain information
of interdependencies between nucleotide positions within a TFBS (Rohs, West et al.
2009).
42
The shape features at a given nucleotide can now be accurately predicted based
upon the nucleotide’s pentamer neighborhood in a high-throughput manner (Zhou, Yang
et al. 2013, Chiu, Comoglio et al. 2016). Accordingly, machine learning methods that
take into account the local DNA shape around a given nucleotide are demonstrably more
accurate in their ability to predict TF-DNA binding specificity (Zhou, Shen et al. 2015,
Ma, Yang et al. 2017, Yang, Orenstein et al. 2017). On the other hand, these models
typically require fitting many more parameters than a traditional, weight matrix-based
model. To control the number of parameters and also to allow direct re-use of existing
PWM-based methods, our goal in this work is to learn a shape-augmented DNA alphabet
(for simplicity, the rest of the thesis will denote it as “shape alphabet”) that considers the
interdependency among nucleotide positions. In particular, we searched for a mapping
from DNA pentamers to a small number of discrete classes, i.e. the shape alphabet,
optimizing for the ability of the shape alphabet-based position weight matrices (PWMs)
to discriminate between bound and unbound sites for a wide variety of TFs.
With a large collection of TF binding data that are derived from both genomic
context protein binding microarrays (gcPBMs) and high-throughput SELEX (HT-SELEX)
experiments, we developed a Simulated Annealing algorithm to search for the shape
alphabet with optimized quality in terms of its classification accuracy in distinguishing
bound and unbound sequences.
Our results showed that a universal alphabet that better discriminates between
bound and unbound DNA sequences than the A, C, G, T-alphabet for all TFs belonging
to different TF families may not exist, and it is more feasible to learn family-specific
43
shape alphabets for each TF family. With basic helix-loop-helix (bHLH) family TFs from
gcPBM data, the optimized shape alphabet improved the performance of TF-DNA
binding prediction by more than 10% on average compared with the traditional A, C, G,
T-alphabet. The average performance gain of the family-specific alphabets learned with
the HT-SELEX data was 7.6%. Although the dinucleotide-enhanced PWM model, i.e. a
PWM model that considers frequencies of dinucleotides instead of just single nucleotides,
achieved better prediction performance than the traditional PWM based on the A, C, G,
T-alphabet for some TF data sets in this study, when compared to the shape-augmented
alphabet, still shape-augmented alphabet showed better performance in TFBS prediction.
We also observed that the performance of shape alphabet-based PWM models increased
as the number of letters and the length of the !-mer increased.
3.2 Materials and Methods
3.2.1 Transcription Factor Binding Data
gcPBM Binding Data
The genomic context protein binding microarray (gcPBM) data were downloaded
from GEO accession number GSE59845 (Zhou, Shen et al. 2015). Each 36 bp sequence
is centered at the binding motif for TF-dimer Mad2−Max (Mad), Max−Max (Max), or c-
Myc−Max (Myc) that belong to basic helix-loop-helix (bHLH) TF family.
44
HT-SELEX Binding Data
The sequencing data of HT-SELEX experiments were downloaded from the
European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) under study identifier
PRJEB14744 (Yang, Orenstein et al. 2017). The data were produced by repooling of
existing PCR-amplified SELEX ligands from previously published HT-SELEX
experiment (Jolma, Yan et al. 2013) into new Illumina sequencing libraries, where
samples were multiplexed to a lesser extent than in the previous study. The complete data
includes 548 experiments covering 410 different TFs and 40 protein families, including
mouse/human full-length protein-DNA binding domain differences. The family
membership of the TFs can be found in Jolma et al. (2013) (Jolma, Yan et al. 2013).
Following the quality control criteria of Yang et al, the binding data of a total of 215 TFs
from 27 different families are left (Table 3.1) (Yang, Orenstein et al. 2017). All
sequences were aligned according to the corresponding core binding motifs derived from
Jolma et al, 2013 and Weirauch & Hughes, 2011 (Weirauch and Hughes 2011) and were
centered with the motifs.
Table 3.1 Family membership of the TFs from HT-SELEX data in this study
TF Family TF
bHLH Bhlhb2, MAX, TCF3, TCF4, TFAP4, TFE3, TFEB, USF1
bZIP ATF4, Cebpb, CEBPB, CEBPD, CEBPE, CEBPG, CREB3L1,
CREB3, Dbp, DBP, JDP2, Mafb, MAFK, NFIL3, XBP1
C2H2 BCL6B, Egr1, GLI2, Hic1, HINFP1, KLF16, SNAI2, SP1, SP3,
YY1, YY2, ZBTB49, ZBTB7A, ZBTB7B, Zfp740, ZIC1,
ZIC4, ZNF740, ZNF784
CENPB CENPB
CP2 GRHL1
CUT CUX1, ONECUT1, ONECUT3
45
ETS ELF1, Elf5, ELF5, ELK1, ELK3, ELK4, ERG, ETS1, ETV1,
ETV4, ETV5, ETV6, FEV, FLI1, SPDEF
Forkhead FOXB1, Foxc1, FOXC1, FOXC2, FOXD2, Foxg1, FOXI1,
FOXJ2, FOXJ3, FOXL1, FOXP3
GATA GATA3
GCM GCM1, GCM2
Homeodomain Alx1, ALX3, Alx4, ALX4, Arx, ARX, Barhl1, BARHL2,
BARX1, BSX, CART1, DLX1, Dlx1, Dlx2, DLX2, DLX3,
DLX4, DLX6, DMBX1, DPRX, EMX2, EN1, En2, ESX1,
EVX1, EVX2, GBX1, Gbx1, Gbx2, GBX2, GSC, GSX2,
HMX2, HOXA10, Hoxa2, HOXA2, HOXB13, HOXB2,
HOXB3, Hoxc10, HOXC10, HOXC11, HOXD12, Hoxd13,
Hoxd9, Irx3, ISX, LBX2, LHX2, Lhx4, LHX6, LHX9,
LMX1B, MEOX1, Meox2, MEOX2, MIXL1, MSX1, NKX2-3,
NKX3-1, Nkx3-1, NKX3-2, NKX6-1, NKX6-2, NOTO, OTX1,
OTX2, PDX1, PHOX2A, PHOX2B, PITX1, PITX3, PROP1,
PRRX1, Prrx2, PRRX2, RAXL1, Rhox11, RHOXF1, SHOX2,
Shox2, SHOX, UNCX, VAX1, VAX2, Vsx1, VSX1, VSX2
HSF HSF1
IRF IRF7
MAD SMAD3
MEIS MEIS1, Meis2, MEIS3, Pknox2
MYB MYBL1, MYBL2
NFAT NFATC1
NFI NFIA, NFIX
Nuclear receptor Ar, AR, ESR1, Esrra, HNF4A, NR2C2, Nr2e1, NR2F1,
NR2F6, NR3C1, NR3C2, RARA, Rara, Rarb, RARG, RORA,
RXRA, RXRG, THRA
PAX PAX2, PAX7
POU POU2F1, POU2F2, POU2F3, POU3F1, POU3F2, POU3F3,
POU3F4, POU4F3
PROX PROX1
RRM CPEB1
RUNX RUNX3
SAND GMEB2
TBX MGA, TBX15, TBX19, TBX21, TBX2
TEA TEAD3
46
3.2.2 DNA Shape-Augmented Alphabet (Shape Alphabet)
Currently, the most common approach to model TFBSs is to build a PWM from a
given a set of known binding sites. A straightforward way to incorporate DNA shape into
this approach is to switch from a 4-letter A, C, G, T-alphabet to an alphabet that
implicitly encodes shape features. It has been shown that the two flanking nucleotides on
both sides of a given nucleotide are sufficient to predict local shape features with high
accuracy, one could simply switch to a 1024-letter alphabet (each pentamer mapped to a
letter) and be guaranteed that this alphabet accurately captures local shape properties.
The problem with this simple approach is that the number of free parameters one
must fit for a PWM of length ! increases from 3×! to 1023×!. Accordingly, our goal
is to find a mapping of the 1024 pentamers into a small number ! of categories, such that
a PWM defined in this !-letter alphabet still accurately reflects local DNA shape
properties.
With the new ! -letter alphabet, one could translate the DNA sequences
represented by the A, C, G, T-alphabet to sequences represented by the new alphabet
through a 5-base pair (bp) sliding window. Then, we could build PWMs that are based on
such a shape alphabet, calculate PWM scores, predict binding sites and do motif
discovery as we do with A, C, G, and T (Figure 3.1).
47
Figure 3.1 DNA shape-augmented alphabet. (A) Cartoon of mapping from pentamers to
an !-letter shape alphabet (! = 8). There are totally 1024 pentamer sequences. Each of
the pentamers was mapped to a letter in the shape alphabet. (B) Translate DNA
sequences in A, C, G, T-alphabet to sequences in shape alphabet. With the learned shape
alphabet that could best discriminate bound sequences from unbound sequences, we
translated the DNA sequences in A, C, G, T-alphabet to sequences in the shape alphabet.
Traditionally, the A, C, G, T-alphabet was used to build PWMs, while our method used
the mapped shape alphabet (8 letters in total here).
3.2.3 Evaluating the Quality of the Shape Alphabet
Define a shape alphabet != !
!
, . . . ,!
!
. Each nucleotide pentamer ! mapped to a
shape letter ! via !
!
! =!. We measured the quality of !
!
with respect to a collection of
TF binding data sets != !
!
, . . . ,!
!
, where each data set ! was comprised of a collection
of !
!
of bound sites (positive data set) and a collection !
!
of unbound sites (negative data
48
set). The quality ! !,! of alphabet ! relative to TF binding data ! was the average
quality relative to each TF-specific data set within !:
! !,! =
1
!
!
!
!!!
!,!
!
(3.1)
This quality, in turn, was defined as the ability of a PWM defined in alphabet ! to
successfully separate, in a cross-validated test, bound from unbound sites. Say that the
sequences in data set !
!
can be separated into bound sites !
!
!
and unbound sites !
!
!
. We
randomly subdivide these sites into training sequences !
!
!
,!
!
!
and test sequences
!
!
!
,!
!
!
. From the training sequences we constructed an !×! PWM !
!
in alphabet !,
where ! is the width of the sequences in !
!
, such that the entry corresponding to
position ! and letter !
!
is
!
!
!,! = log
!∈ !
!
:!
!
!
!!!:!!!
=!
!
1
!
!∈ !
!
:!
!
!
!
!
!!:!
!
!!
=!
!
!
!
!
!!
(3.2)
The score of a given sequence ! was the sum of the corresponding entries in !
!
:
!
!
! = !
!
!,!
!
!
!!!:!!!
!
!!!
(3.3)
We then measured the ability of !
!
to score sequences in !
!
!
better than sequences
in !
!
!
. For this measurement, we used the area under the ROC curve (AUROC).
49
3.2.4 Finding the Optimal Shape Alphabet – Simulated Annealing Algorithm
For a fixed alphabet size, a candidate alphabet over pentamers can be represented
as a string of length 1024 drawn from a !-letter alphabet. Thus, for a maximum alphabet
size !, there are
!
!"#$
!
!!!
(3.4)
possible alphabets to consider. Thus, the simplest way to optimize over possible
alphabets would be via a heuristic method such as simulated annealing, a genetic
algorithm, or even random guessing.
The first algorithm to consider is simulated annealing.
Simulated annealing is a method for finding a good (not necessarily perfect)
solution to an optimization problem. By injecting just the right amount of randomness
into things early in the process, simulated annealing gains the strength that it avoids
getting caught at local maxima solutions that are better than any others nearby, but aren’t
the very best. Besides, simulated annealing is not that difficult to implement.
Suppose we want to learn a !-letter alphabet, the algorithm was implemented as
follows (Figure 3.2):
1. Prepare positive and negative datasets. Each TF dataset was further divided into
positive dataset (bound sequences) and negative dataset (unbound sequences)
according to the relative binding affinity of the sequences. Positive dataset
50
includes sequences with 50% highest relative binding affinity and negative dataset
includes sequences with 50% lowest relative binding affinity.
2. Prepare training, developing and testing datasets. For each TF, the sequences were
randomly divided into training sequences (90%) and testing sequences (10%).
3. Generate the initial mapping. Randomly map the 1024 pentamers to !-letter. Each
letter does not necessarily contain equal number of pentamers. Note that the !-
letter alphabet should still preserve the property of reverse complementation as in
A, C, G, T-alphabet. Set initial temperature of the simulated annealing algorithm
!
!"#$%&'(
= 1 and != 0.9.
4. Prepare shape alphabet sequences. Using the current mapping, represent the
sequences with shape alphabet.
5. Build PWMs using training data with formula (3.2) and calculate PWM score
using testing data with formula (3.3).
6. Calculate !"#$%
!"##$%&
for the current mapping.
7. Generate a random neighboring mapping. Randomly select a pentamer and map it
to any other letter. Map the reverse complementary pentamer of the selected one
to the corresponding letter at the same time.
8. Repeat 4-7.
9. Compare the !"#$%
!"##$%&
and !"#$%
!"#$%&'(
:
a. If !"#$%
!"##$%&
>!"#$%
!"#$%&'(
, accept the current mapping;
b. If !"#$%
!"##$%&
!"#$%&'(
, accept the current mapping with
acceptance probability !;
!= !
!"#$%
!"##$%&
!!"#$%
!"#$%&'(
!
!!"#$%&'
(3.5)
10. Update the temperature. !
!"##$%&
=!!
!"#$%&'(
.
Repeat 7-10, until converge or reach certain iterations.
51
Figure 3.2 Flow chart of simulated annealing algorithm. Simulated annealing was
applied to search the best mapping from pentamers to letters in shape alphabet. An
initiate mapping is assigned randomly with reverse complementation property. Then at
each iteration, a neighboring mapping is randomly picked and the AUROCs of current
and previous prediction is calculated as a comparison standard to decide if we accept the
new mapping or not.
3.2.5 PANTHER, Regroup of Transcription Factor Families
The PANTHER (Protein Analysis Through Evolutionary Relationships)
Classification System was designed to classify proteins (and their genes) in order to
facilitate high-throughput analysis (Thomas, Campbell et al. 2003, Mi, Muruganujan et al.
2013). Protein families in PANTHER were clustered by their evolutionary relationships
based on sequence alignments. We regrouped the transcription factor families in HT-
SELEX data using PANTHER classification system in searching of family-specific shape
52
alphabets. The 215 TFs in HT-SELEX data were grouped into 59 different PANTHER
protein families.
3.3 Results
3.3.1 DNA Shape Alphabet Better Discriminates Bound Sequences from Unbound
Sequences Than the Typical 4-letter A, C, G, T-Alphabet
We first adopted the simulated annealing algorithm on gcPBM data, which
includes sequence probes for 3 TF-dimers − Mad2-Max (Mad), Max-Max (Max), and c-
Myc-Max (Myc). All 3 TFs belong to the basic helix-loop-helix (bHLH) TF family and
share the same core motif – 5’-CACGTG-3’. Each sequence probe is composed of 15bp
flank sequence + core motif + 15bp flank sequence, with a total length of 36bp. Each
probe is associated with an intensity value, representing the binding affinity of it. The
positive data set (bound sequences) and negative data set (unbound sequences) were
divided based on the intensity value. We first set the number of letters in shape alphabet
to 8.
At each algorithm iteration, the average AUROC of the three TF data sets was
calculated as the !"#$%
!"##$%&
. After 60,000 iterations, the AUROC of Mad dataset
reached 0.975, the AUROC of Max dataset reached 0.973, and the AUROC of Myc
dataset reached 0.968 with the learned shape alphabet. While the typical A, C, G, T-
alphabet gave AUROCs of 0.869, 0.862, and 0.868, respectively. When using the learned
shape alphabet, the average AUROC was 12.2% higher than that using the typical A, C,
G, T-alphabet (Figure 3.3).
53
Figure 3.3 Comparison of AUROC between A, C, G, T-alphabet and the learned shape
alphabet for gcPBM data. When using the learned shape alphabet, the AUROC based on
a 10-fold cross validation is 12.2% higher than that using the typical A, C, G, T-alphabet.
This result supported our hypothesis that by introducing the shape alphabet, the
ability of discriminate bound sequence from unbound sequence could be improved.
3.3.2 Finding the Universal Shape Alphabet That Can Be Utilized to Predict TFBSs
The above experiment was performed on three TFs that belong to the same TF
family – bHLH. To adopt the shape alphabet to future TFBS prediction and de novo TF
motif discovery, it is ideal to have one universal shape alphabet that shows better
performance of TFBS prediction on all or most TFs than the typical A, C, G, T-alphabet.
Thus, the next question would be if we could find such a universal shape alphabet. If this
is the case, the universal shape alphabet could be used directly to do motif discovery or
motif scan.
0
0.2
0.4
0.6
0.8
1
Mad Max Myc
AUROC
A, C, G, T-Alphabet Shape Alphabet
54
To explore the answer to this question, we need a large data set that contains as
many TFs as possible. Here, we used HT-SELEX data published in Yang et al. (2017)
covering 215 TFs from 27 different families after quality filtering (Yang, Orenstein et al.
2017).
A universal shape alphabet should have the property that the learned alphabet
could also be used to predict the binding of a new TF that has not been seen in the
training data. So we used leave-one-out strategy to implement this experiment, that was,
we trained the shape alphabet based on 214 TFs and tested the prediction ability using the
learned shape alphabet on the TF that was not in the training data. In this way, we were
able to learn 215 different shape alphabets in total.
The algorithm was run for 60,000 iterations in order to converge. Among the 215
experiments, 88 of them showed higher test AUROCs with learned shape alphabet than
that with A, C, G, T-alphabet (Figure 3.4). In other words, over half of the learned
alphabets could not be adopted to predict the binding of a new TF. This result gave us the
hint that a universal shape alphabet might not exist.
55
Figure 3.4 Scatter plot for leave-one-out experiment on all 215 TFs. The scatter plot
shows the test shape-alphabet AUROCs for all 215 leave-one-out experiments compared
to the A, C, G, T-alphabet AUROCs. Each dot stands for one experiment (TF) and the
color and shape stand for the TF families the tested TF belongs to. Among the 215 dots,
88 of them showed higher AUROC with learned shape alphabet than that with sequence
alphabet.
3.3.3 It Is More Feasible to Find Transcription Factor Family-Specific Shape
Alphabets
As we could not find a universal shape alphabet, it is intuitively to think that
whether we could search for transcription factor family-specific shape alphabets. As we
mentioned before, the TF-DNA recognition mechanisms are quite different for different
TF families, some adopted more shape readout and some adopted more sequence readout.
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.4 0.5 0.6 0.7 0.8 0.9 1.0
Leave−one−out on all 215 TFs
AUROC (A,C,G,T−alphabet)
AUROC (shape alphabet)
●
●
●
●
●
bHLH
bZIP
C2H2
CENPB
CP2
CUT
ETS
Forkhead
GATA
GCM
Homeodomain
HSF
IRF
●
●
●
●
●
●
●
●
●
●
MAD
MEIS
MYB
NFAT
NFI
Nuclear receptor
PAX
POU
PROX
RRM
RUNX
SAND
TBX
TEA
56
To show that shape alphabets learned from other TF families were not contributed
much for TF binding prediction, I first conducted leave-one-family-out experiments. For
each TF family, shape alphabet that learned from all TFs that belong to other TF families
was used to test the prediction ability of binding. Clearly, in those experiments, shape
alphabet did not show a better prediction power (Figure 3.5). This result confirmed the
idea that it was more feasible to learn family-specific shape alphabets.
Figure 3.5 Scatter plot for leave-one-family-out experiments. The scatter plot shows the
test shape-alphabet AUROCs for the leave-one-family-out experiments compared to the A,
C, G, T-alphabet AUROCs. Each dot stands for one experiment (TF) and the color and
shape stand for the TF families the tested TF belongs to.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.4 0.5 0.6 0.7 0.8 0.9 1.0
Leave−one−family−out
AUROC (A,C,G,T−alphabet)
AUROC (shape alphabet)
●
●
●
●
●
bHLH
bZIP
C2H2
CENPB
CP2
CUT
ETS
Forkhead
GATA
GCM
Homeodomain
HSF
IRF
●
●
●
●
●
●
●
●
●
●
MAD
MEIS
MYB
NFAT
NFI
Nuclear receptor
PAX
POU
PROX
RRM
RUNX
SAND
TBX
TEA
57
I next did the intra-family leave-one-out experiments to test the suggestion that
TF family-specific shape alphabet is a way to go. To perform the intra-family leave-one-
out experiments, only TF families that contain 3 or more TF members were considered,
leaving 192 TFs in 11 TF families in the data set. For each family, one TF was left out
each time for test and the rest TFs in this family were used as training data.
The result showed that for certain TF families, such as Homeodomain, Forkhead,
bHLH, CUT, and ETS, the shape alphabet showed better prediction power than A, C, G,
T-alphabet (Figure 3.6). Those families also showed more performance gain when adding
shape features to regression models in predicting binding affinity in another work (Yang,
Orenstein et al. 2017).
TFs that belong to the same family are more likely to share the same pattern
regarding to TF-DNA recognition. The above results confirmed this suggestion. Thus, it
is more feasible to find shape alphabet for each individual TF family, especially for those
families that adopted more shape readout in TF-DNA recognition.
58
Figure 3.6 Scatter plot for intra-family leave-one-out experiments in 11 TF families. The
scatter plot shows the test shape-alphabet AUROCs for the intra-family leave-one-out
experiments compared to the A, C, G, T-alphabet AUROCs. Each dot stands for one
experiment and the color and shape stand for the TF families the tested TF belongs to.
For TF families such as Homeodomain, Forkhead, bHLH, CUT, and ETS, the learned
shape alphabets showed better prediction power than A, C, G, T-alphabet.
3.3.4 Transcription Factor Family-Specific Shape Alphabets
Since transcription factor family-specific shape alphabets were more feasible to
learn for future motif prediction, I then conducted simulated annealing algorithm on each
TF family in the HT-SELEX data as what I did for gcPBM data. Here I only consider TF
families that contained two or more TFs, leaving 203 TFs belonging to 15 different
families.
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.4 0.5 0.6 0.7 0.8 0.9 1.0
Intra−family leave−one−out
AUROC (A,C,G,T−alphabet)
AUROC (shape alphabet)
●
●
●
●
●
bHLH
bZIP
C2H2
CUT
ETS
Forkhead
Homeodomain
MEIS
Nuclear receptor
POU
TBX
59
The result showed that for 179 out of 203 TFs, shape alphabet better
discriminated the bound sequences from unbound sequences than the A, C, G, T-alphabet
did (Figure 3.7). The average AUROC for shape alphabet was 0.856, higher than that for
A, C, G, T-alphabet, which was 0.798. The average AUROC increase after adopted shape
alphabet for all 203 TFs was 7.58% compared to prediction with A, C, G, T-alphabet. For
some TF families such as bHLH, bZIP, CUT, ETS, Forkhead, GCM, MEIS, MYB, NFI,
PAX, and TBX, shape alphabet showed higher prediction power for all TFs belong to
them. This result was consistent with intra-family leave-one-out experiments as
illustrated in the previous section.
60
Figure 3.7 Comparison of AUROCs between A, C, G, T-alphabet and the family-specific
shape alphabet for HT-SELEX data. The scatter plot shows the AUROCs of the test data
with the learned family-specific shape alphabet. Each dot stands for one TF and the color
and shape stand for the TF families the TF belongs to.
3.3.5 Regrouping the TFs into Evolutionarily Related Families Improved the
Prediction Power of Family-Specific Shape Alphabet
A better classification of TF families could possibly help with learning family-
specific shape alphabets. I next used PANTHER Classification System to refine the
grouping of TF families. The PANTHER database regrouped the 215 TFs in HT-SELEX
data into 59 different families based on their evolutionary relationships. Taking the
previous Homeodomain family as an example, TFs in it was regrouped into 15
PAHTHER families.
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.4 0.5 0.6 0.7 0.8 0.9 1.0
Family−specific alphabet
AUROC (A,C,G,T−alphabet)
AUROC (shape alphabet)
●
●
●
●
●
bHLH
bZIP
C2H2
CUT
ETS
Forkhead
GCM
Homeodomain
MEIS
MYB
NFI
Nuclear receptor
PAX
POU
TBX
61
Similar intra-family leave-one-out experiments were conducted to those
Homeodomain TFs with their new families. Among the 81 TFs in this experiments, 55 of
them showed higher AUROC with shape alphabet learned with PANTHER families than
the previous Homeodomain family.
Figure 3.8 Comparison of AUROCs between A, C, G, T-alphabet and the family-specific
shape alphabet for HT-SELEX data based on PANTHER regrouped TF families. The
scatter plot shows the AUROCs of the test data with the learned family-specific shape
alphabet. Each dot stands for one TF and the color and shape stand for the TF families
the TF belongs to.
With the new classification of TF families, I redid the simulated annealing to
learn the family-specific shape alphabet based on those family classes. I only considered
families that contain two or more TFs in this experiment, leaving 187 TFs belonging to
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.4 0.5 0.6 0.7 0.8 0.9 1.0
PANTHER family−specific alphabet
AUROC (A,C,G,T−alphabet)
AUROC (shape alphabet)
●
●
●
●
●
bHLH
bZIP
C2H2
ETS
Forkhead
GCM
Homeodomain
MEIS
MYB
NFI
Nuclear receptor
PAX
POU
TBX
62
32 families. PANTHER family-specific shape alphabet showed higher prediction power
than A, C, G, T-alphabet in 181 out of the 187 TFs (Figure 3.8). The result indicated that
a refined TF family classification based on evolutionary relationship among TFs could
indeed improve the prediction power of family-specific shape alphabet.
I further compared the AUROCs between the PANTHER family-specific shape
alphabets and that of the original family classes. 177 out of the 187 TFs showed higher
AUROCs with PANTHER family-specific shape alphabets comparing with AUROCs
with the original family-specific shape alphabets (Figure 3.9).
Figure 3.9 Comparison of AUROCs between the original family-specific shape alphabets
and PANTHER family-specific shape alphabets for HT-SELEX data. Each dot stands for
one TF and the color and shape stand for the TF families the TF belongs to.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.4 0.5 0.6 0.7 0.8 0.9 1.0
PANTHER family vs. original family
AUROC (shape alphabet)
AUROC (shape alphabet, PANTHER)
●
●
●
●
●
bHLH
bZIP
C2H2
ETS
Forkhead
GCM
Homeodomain
MEIS
MYB
NFI
Nuclear receptor
PAX
POU
TBX
63
3.3.6 Family-Specific Shape Alphabet Versus Dinucleotide Alphabet
As traditional PWM models assume no interdependencies between adjacent
nucleotide when interacting with TFs, some researchers extended the model to include
parameters that account for nucleotide interdependency, such as dinucleotide preferences
to fulfill this limitation. Recent arguments have suggested that dinucleotide features could
almost perfectly parameterize the local DNA shape features.
Figure 3.10 Comparison of AUROCs between the typical mononucleotide alphabet and
dinucleotide alphabet for HT-SELEX data. Each dot stands for one TF and the color and
shape stand for the TF families the TF belongs to.
In order to compare the shape alphabet model to the dinucleotide model within
the same framework, I also built a straightforward mapping from the 16 dinucleotides to
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.4 0.5 0.6 0.7 0.8 0.9 1.0
A,C,G,T−alphabet vs.
Dinucleotide alphabet
AUROC (A,C,G,T−alphabet)
AUROC (dinucleotide alphabet)
●
●
●
●
●
bHLH
bZIP
C2H2
CUT
ETS
Forkhead
GCM
Homeodomain
MEIS
MYB
NFI
Nuclear receptor
PAX
POU
TBX
64
16 letters. Let’s call it dinucleotide alphabet. Analogous to PWMs, a dinucleotide
position weight matrix, gives the probability of observing each pair of nucleotides at each
adjacent positions in a binding site. Here I compared the prediction performance of the
typical A, C, G, T-alphabet (mononucleotide alphabet), dinucleotide alphabet, and the
learned family-specific shape alphabet. As shown in Figure 3.10, prediction with
dinucleotide alphabet was not significantly better than that with mononucleotide alphabet.
The average AUROC for mononucleotide alphabet was 0.798 and the average AUROC
for dinucleotide alphabet was 0.807. When comparing the prediction with family-specific
shape alphabet to that with dinucleotide alphabet, the family-specific shape alphabet still
performed better (Figure 3.11).
65
Figure 3.11 Comparison of AUROCs between dinucleotide alphabet and the family-
specific shape alphabet for HT-SELEX data. Each dot stands for one TF and the color
and shape stand for the TF families the TF belongs to.
Above results indicated that although the dinucleotide PWM model extended the
traditional mononucleotide PWM model by considering interdependency, the TF binding
prediction power was not comparable to the family-specific shape alphabet. Note that the
dinucleotide alphabet contains 16 letters whereas the shape alphabet here only has 8
letters.
3.3.7 Trade-off between Model Complexity and Performance
In previous experiments, I mapped 5-mers sequences (pentamers) to 8-letter shape
alphabets. The length of k-mer and the number of letters in new alphabet here should act
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.4 0.5 0.6 0.7 0.8 0.9 1.0
Shape alphabet vs.
Dinucleotide alphabet
AUROC (dinucleotide alphabet)
AUROC (shape alphabet)
●
●
●
●
●
bHLH
bZIP
C2H2
CUT
ETS
Forkhead
GCM
Homeodomain
MEIS
MYB
NFI
Nuclear receptor
PAX
POU
TBX
66
as hyper-parameters though. Here I tested how those 2 parameters influence the
performance of the models with gcPBM data.
Figure 3.12 Performance comparison between models of different complexity. The 3D
histogram shows the average AUROC of the 3 TFs in gcPBM data using shape alphabet
when number of letters in the alphabet n = 4, 8, 16, 32 and the length of k-mer k = 4, 5, 6,
7. Prediction power of shape alphabet increased as the number of letters and the length
of the k-mer increased.
As shown in Figure 3.12, the prediction power of shape alphabet increased as the
number of letters and the length of the k-mer went up, indicating that the more
complicated the model was, the better the performance would be. One could always build
a more complex model to achieve better performance while a simpler model could save
more computational time and resource. As the result shown here, even the simpler
models gave a rather good performance. Since we have the ability to predict the local
DNA shape features using pentamer sequences, choosing a k-mer length 5 could help
with further analysis of the shape alphabet we learned in this study.
k = 4
k = 5
k = 6
k = 7
0.8
0.85
0.9
0.95
1
n = 4
n = 8
n = 16
n = 32
0.948
0.959
0.963
0.964
0.96
0.972
0.976
0.977
0.972
0.981 0.986
0.986
0.988
0.988 0.992
0.993
AUROC
Number of letters in the alphabet
67
3.4 Discussion
PWM-based models are still widely used in representing DNA-binding
preferences of transcription factors. Those models assume additive contribution of
independent nucleotide positions at a TF motif to the overall binding affinity. Although
such models could be extended to include additional parameters, such as dinucleotide
features, tri-nucleotide features, and etc., to account for the interdependencies between
adjacent nucleotide positions, the number of parameter increases exponentially. In this
study we have presented a new dimension of PWM-based models by introducing a whole
new alphabet, which reduced the number of k-mer features to a desirable extent and at the
same time enhanced the prediction performance of PWM models in discriminating bound
sequences from unbound sequences.
The interdependencies between neighboring nucleotide positions reflect the DNA
shape preference in TF-DNA recognition. Transcription factors achieve DNA-binding
specificity through not only contacts with functional groups of bases (base readout) but
also readout of local structural features of the double helix (shape readout). A universal
shape alphabet that could discriminate bound and unbound sequences for all TFs
belonging to all TF families is ideal. In this case, the binding of a new TF that has never
been seen in the training data could still be modeled using this shape alphabet, and new
TF motifs could be found by performing de novo motif discovery with it. However, the
recognition mechanism can be quite different for TFs in different families. For example,
they adopted the mechanism of DNA shape readout to different extent (Yang, Orenstein
et al. 2017). Therefore, it is more feasible to search for TF family-specific DNA shape
68
alphabets. One similar situation is analogy-based protein structure prediction. It remains a
challenge to predict a structure that has never been seen before.
The embedding of DNA shape features within DNA sequences has been
discussed more and more recently. DNAshape, as a high-throughput sequence-dependent
local DNA structure predictor, has now been extended to be able to predict 13 structural
features. Studies have shown the advantage of adding DNAshape-predicted shape
features to machine-learning models when quantitatively predicting binding affinity.
Recent works suggested that DNA shape features could be mostly caught by introducing
just dinucleotide features. In my study I confirmed that the dinucleotide-based PWM,
outperformed the traditional mononucleotide PWM. However, the performance of shape-
augmented alphabet with fewer letters was still significantly better.
There always exists a trade-off between model complexity and interpretability in
modeling. Although a more complex model tends to offer higher performance, it requires
more training data and is more expensive in terms of computation resources. It also tends
to be less interpretable. Simpler models may sacrifice performance to some extend, but
they also tend to be less time consuming in terms of data collection and computation
workload and offer more interpretability. In our case, the prediction performance
increased as the number of letters and the length of k-mers went up. We chose to use
pentamers because it provided reasonably good performance and since the DNAshape
tool is at the same time pentamer based, it facilitates potential downstream shape analysis
of the learned shape alphabet.
69
Finally, many existing de novo motif discovery and motif scanning programs
were developed based on the PWM method and have been used widely in TF binding
studies. Our family-specific shape alphabet can be readily incorporated into these PWM-
based analysis tools, such as the MEME Suite, which has recently been re-engineered to
enable the usage of custom alphabets.
71
Chapter 4 Concluding Remarks
Only 2% of human genome consists of protein-coding sequences, leaving a vast
majority genomic regions waiting to be annotated (Lander, Linton et al. 2001, Venter,
Adams et al. 2001). In recent years, more and more attention is being paid to the non-
coding genomic regions with a growing catalogue of genomic elements (Consortium,
Birney et al. 2007, Consortium 2012). The pervasive existence of evolutionarily
conserved non-coding sequences revealed by comparative genomics approaches indicates
the functional importance of this portion of the genome (Miller, Makova et al. 2004).
Those conserved non-coding sequences have been shown mainly maintained through
purifying selection (Asthana, Noble et al. 2007, Casillas, Barbadilla et al. 2007, Ward and
Kellis 2012).
In this thesis, we verified our hypothesis that local DNA shape at transcription
factor binding sites is under purifying selection through analysis of SNPs located in these
regions. In a way, we expanded the definition of conservation of non-coding genomic
regions from primary sequence level to a 3-dimension structural level. At some loci, the
primary DNA sequence may be polymorphic, while the local DNA shape maintains. Thus,
when shape-readout-dominant TFs “read” the DNA, the structural features can still be
recognized, thus the transcription process will not be affected and the selection constraint
will not act on such loci. On the other hand, mutations that disrupt local DNA shape at
TFBSs and further impede the TF binding tend to be eliminated by purifying selection.
72
The conserved non-coding regions are highly associated with TFBSs and other
cis-regulatory elements. I believe that studying the conservation level of local DNA
shape, as with the conservation level of primary DNA sequence, could help identify the
regulatory modules in non-coding genomic regions.
Since the first publication of the crystal structure of protein-DNA complexes in
1980s, there are now more than 4000 entries in the Protein Data Bank (PDB) (Berman,
Westbrook et al. 2000). Structure biologists solve all-atom crystal structures using
technologies such as X-ray crystallography and nuclear magnetic resonance (NMR)
(Rhodes 2010, Shi 2014). These structures have helped to reveal many principles of TF-
DNA binding specificity. However, solving the molecular structure with any
experimental technology is no instant work.
In parallel to structural biology approaches, genomic approaches such as protein
binding microarray, HT-SELEX, and ChIP-seq have generated many TF binding data
both in vitro and in vivo. Analysis of such big data requires high-throughput
computational modeling. The past two decades witnessed the fast development of
modeling of TF-DNA binding specificity, from in vitro prediction to in vivo prediction,
from PWM to deep learning, from sequence-based models to those that involve local
DNA shape, DNA methylation (Rao, Chiu et al. 2018) and histone modification (Xin and
Rohs 2018) and even from classical machine learning to quantum annealing (Li, Di
Felice et al. 2018).
73
The new alphabet we proposed in this thesis only opened a door for adding new
features to the classical and efficient models. Other factors that have impact on TF-DNA
recognition could potentially be incorporated into the model in a similar way.
Regarding the whole computational biology field, as with the rapid development
of high-throughput experimental technologies bringing in the explosion of “big”
biological data, highly complex models could be built and tuned by such data to help us
gain a highly precise look into the functioning of living organisms on the earth.
75
Bibliography
Abe, N., I. Dror, L. Yang, M. Slattery, T. Zhou, H. J. Bussemaker, R. Rohs and R. S.
Mann (2015). "Deconvolving the recognition of DNA shape from sequence." Cell 161(2):
307-318.
Abelson, J. F., K. Y. Kwan, B. J. O'Roak, D. Y. Baek, A. A. Stillman, T. M. Morgan, C.
A. Mathews, D. L. Pauls, M. R. Rasin, M. Gunel, N. R. Davis, A. G. Ercan-Sencicek, D.
H. Guez, J. A. Spertus, J. F. Leckman, L. S. t. Dure, R. Kurlan, H. S. Singer, D. L.
Gilbert, A. Farhi, A. Louvi, R. P. Lifton, N. Sestan and M. W. State (2005). "Sequence
variants in SLITRK1 are associated with Tourette's syndrome." Science 310(5746): 317-
320.
Alipanahi, B., A. Delong, M. T. Weirauch and B. J. Frey (2015). "Predicting the
sequence specificities of DNA- and RNA-binding proteins by deep learning." Nat
Biotechnol 33(8): 831-838.
Andolfatto, P. (2005). "Adaptive evolution of non-coding DNA in Drosophila." Nature
437(7062): 1149-1152.
Asthana, S., W. S. Noble, G. Kryukov, C. E. Grant, S. Sunyaev and J. A.
Stamatoyannopoulos (2007). "Widely distributed noncoding purifying selection in the
human genome." Proc Natl Acad Sci U S A 104(30): 12410-12415.
Bailey, T. L., J. Johnson, C. E. Grant and W. S. Noble (2015). "The MEME Suite."
Nucleic Acids Res 43(W1): W39-49.
Barozzi, I., M. Simonatto, S. Bonifacio, L. Yang, R. Rohs, S. Ghisletti and G. Natoli
(2014). "Coregulation of transcription factor binding and nucleosome occupancy through
DNA features of mammalian enhancers." Mol Cell 54(5): 844-857.
76
Berger, M. F., A. A. Philippakis, A. M. Qureshi, F. S. He, P. W. Estep, 3rd and M. L.
Bulyk (2006). "Compact, universal DNA microarrays to comprehensively determine
transcription-factor binding site specificities." Nat Biotechnol 24(11): 1429-1435.
Berman, B. P., Y. Nibu, B. D. Pfeiffer, P. Tomancak, S. E. Celniker, M. Levine, G. M.
Rubin and M. B. Eisen (2002). "Exploiting transcription factor binding site clustering to
identify cis-regulatory modules involved in pattern formation in the Drosophila genome."
Proc Natl Acad Sci U S A 99(2): 757-762.
Berman, H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N.
Shindyalov and P. E. Bourne (2000). "The Protein Data Bank." Nucleic Acids Res 28(1):
235-242.
Campo, D., K. Lehmann, C. Fjeldsted, T. Souaiaia, J. Kao and S. V. Nuzhdin (2013).
"Whole-genome sequencing of two North American Drosophila melanogaster
populations reveals genetic differentiation and positive selection." Mol Ecol 22(20):
5084-5097.
Casillas, S., A. Barbadilla and C. M. Bergman (2007). "Purifying selection maintains
highly conserved noncoding sequences in Drosophila." Mol Biol Evol 24(10): 2222-2234.
Chiu, T. P., F. Comoglio, T. Zhou, L. Yang, R. Paro and R. Rohs (2016). "DNAshapeR:
an R/Bioconductor package for DNA shape prediction and feature encoding."
Bioinformatics 32(8): 1211-1213.
Chiu, T. P., S. Rao, R. S. Mann, B. Honig and R. Rohs (2017). "Genome-wide prediction
of minor-groove electrostatic potential enables biophysical modeling of protein-DNA
binding." Nucleic Acids Res.
Consortium, E. P. (2012). "An integrated encyclopedia of DNA elements in the human
genome." Nature 489(7414): 57-74.
Consortium, E. P., E. Birney, J. A. Stamatoyannopoulos, A. Dutta, R. Guigo, T. R.
Gingeras, E. H. Margulies, Z. Weng, M. Snyder, E. T. Dermitzakis, R. E. Thurman, M. S.
Kuehn, C. M. Taylor, S. Neph, C. M. Koch, S. Asthana, A. Malhotra, I. Adzhubei, J. A.
77
Greenbaum, R. M. Andrews, P. Flicek, P. J. Boyle, H. Cao, N. P. Carter, G. K. Clelland,
S. Davis, N. Day, P. Dhami, S. C. Dillon, M. O. Dorschner, H. Fiegler, P. G. Giresi, J.
Goldy, M. Hawrylycz, A. Haydock, R. Humbert, K. D. James, B. E. Johnson, E. M.
Johnson, T. T. Frum, E. R. Rosenzweig, N. Karnani, K. Lee, G. C. Lefebvre, P. A. Navas,
F. Neri, S. C. Parker, P. J. Sabo, R. Sandstrom, A. Shafer, D. Vetrie, M. Weaver, S.
Wilcox, M. Yu, F. S. Collins, J. Dekker, J. D. Lieb, T. D. Tullius, G. E. Crawford, S.
Sunyaev, W. S. Noble, I. Dunham, F. Denoeud, A. Reymond, P. Kapranov, J. Rozowsky,
D. Zheng, R. Castelo, A. Frankish, J. Harrow, S. Ghosh, A. Sandelin, I. L. Hofacker, R.
Baertsch, D. Keefe, S. Dike, J. Cheng, H. A. Hirsch, E. A. Sekinger, J. Lagarde, J. F.
Abril, A. Shahab, C. Flamm, C. Fried, J. Hackermuller, J. Hertel, M. Lindemeyer, K.
Missal, A. Tanzer, S. Washietl, J. Korbel, O. Emanuelsson, J. S. Pedersen, N. Holroyd, R.
Taylor, D. Swarbreck, N. Matthews, M. C. Dickson, D. J. Thomas, M. T. Weirauch, J.
Gilbert, J. Drenkow, I. Bell, X. Zhao, K. G. Srinivasan, W. K. Sung, H. S. Ooi, K. P.
Chiu, S. Foissac, T. Alioto, M. Brent, L. Pachter, M. L. Tress, A. Valencia, S. W. Choo,
C. Y. Choo, C. Ucla, C. Manzano, C. Wyss, E. Cheung, T. G. Clark, J. B. Brown, M.
Ganesh, S. Patel, H. Tammana, J. Chrast, C. N. Henrichsen, C. Kai, J. Kawai, U.
Nagalakshmi, J. Wu, Z. Lian, J. Lian, P. Newburger, X. Zhang, P. Bickel, J. S. Mattick, P.
Carninci, Y. Hayashizaki, S. Weissman, T. Hubbard, R. M. Myers, J. Rogers, P. F.
Stadler, T. M. Lowe, C. L. Wei, Y. Ruan, K. Struhl, M. Gerstein, S. E. Antonarakis, Y.
Fu, E. D. Green, U. Karaoz, A. Siepel, J. Taylor, L. A. Liefer, K. A. Wetterstrand, P. J.
Good, E. A. Feingold, M. S. Guyer, G. M. Cooper, G. Asimenos, C. N. Dewey, M. Hou,
S. Nikolaev, J. I. Montoya-Burgos, A. Loytynoja, S. Whelan, F. Pardi, T. Massingham, H.
Huang, N. R. Zhang, I. Holmes, J. C. Mullikin, A. Ureta-Vidal, B. Paten, M. Seringhaus,
D. Church, K. Rosenbloom, W. J. Kent, E. A. Stone, N. C. S. Program, C. Baylor College
of Medicine Human Genome Sequencing, C. Washington University Genome
Sequencing, I. Broad, I. Children's Hospital Oakland Research, S. Batzoglou, N.
Goldman, R. C. Hardison, D. Haussler, W. Miller, A. Sidow, N. D. Trinklein, Z. D.
Zhang, L. Barrera, R. Stuart, D. C. King, A. Ameur, S. Enroth, M. C. Bieda, J. Kim, A. A.
Bhinge, N. Jiang, J. Liu, F. Yao, V. B. Vega, C. W. Lee, P. Ng, A. Shahab, A. Yang, Z.
Moqtaderi, Z. Zhu, X. Xu, S. Squazzo, M. J. Oberley, D. Inman, M. A. Singer, T. A.
Richmond, K. J. Munn, A. Rada-Iglesias, O. Wallerman, J. Komorowski, J. C. Fowler, P.
78
Couttet, A. W. Bruce, O. M. Dovey, P. D. Ellis, C. F. Langford, D. A. Nix, G. Euskirchen,
S. Hartman, A. E. Urban, P. Kraus, S. Van Calcar, N. Heintzman, T. H. Kim, K. Wang, C.
Qu, G. Hon, R. Luna, C. K. Glass, M. G. Rosenfeld, S. F. Aldred, S. J. Cooper, A. Halees,
J. M. Lin, H. P. Shulha, X. Zhang, M. Xu, J. N. Haidar, Y. Yu, Y. Ruan, V. R. Iyer, R. D.
Green, C. Wadelius, P. J. Farnham, B. Ren, R. A. Harte, A. S. Hinrichs, H. Trumbower,
H. Clawson, J. Hillman-Jackson, A. S. Zweig, K. Smith, A. Thakkapallayil, G. Barber, R.
M. Kuhn, D. Karolchik, L. Armengol, C. P. Bird, P. I. de Bakker, A. D. Kern, N. Lopez-
Bigas, J. D. Martin, B. E. Stranger, A. Woodroffe, E. Davydov, A. Dimas, E. Eyras, I. B.
Hallgrimsdottir, J. Huppert, M. C. Zody, G. R. Abecasis, X. Estivill, G. G. Bouffard, X.
Guan, N. F. Hansen, J. R. Idol, V. V. Maduro, B. Maskeri, J. C. McDowell, M. Park, P. J.
Thomas, A. C. Young, R. W. Blakesley, D. M. Muzny, E. Sodergren, D. A. Wheeler, K.
C. Worley, H. Jiang, G. M. Weinstock, R. A. Gibbs, T. Graves, R. Fulton, E. R. Mardis,
R. K. Wilson, M. Clamp, J. Cuff, S. Gnerre, D. B. Jaffe, J. L. Chang, K. Lindblad-Toh, E.
S. Lander, M. Koriabine, M. Nefedov, K. Osoegawa, Y. Yoshinaga, B. Zhu and P. J. de
Jong (2007). "Identification and analysis of functional elements in 1% of the human
genome by the ENCODE pilot project." Nature 447(7146): 799-816.
Darwin, C. X. (1859). On the origin of species, 1859, Routledge.
Dekker, J. (2016). "Mapping the 3D genome: Aiming for consilience." Nat Rev Mol Cell
Biol 17(12): 741-742.
Dekker, J. and L. Mirny (2016). "The 3D Genome as Moderator of Chromosomal
Communication." Cell 164(6): 1110-1121.
Dobzhansky, T. and T. G. Dobzhansky (1937). Genetics and the Origin of Species,
Columbia university press.
Dror, I., T. Golan, C. Levy, R. Rohs and Y. Mandel-Gutfreund (2015). "A widespread
role of the motif environment in transcription factor binding across diverse protein
families." Genome Res 25(9): 1268-1280.
Faustino, N. A. and T. A. Cooper (2003). "Pre-mRNA splicing and human disease."
Genes Dev 17(4): 419-437.
79
Gallo, S. M., D. T. Gerrard, D. Miner, M. Simich, B. Des Soye, C. M. Bergman and M. S.
Halfon (2011). "REDfly v3.0: toward a comprehensive database of transcriptional
regulatory elements in Drosophila." Nucleic Acids Res 39(Database issue): D118-123.
Huang, Q., T. Whitington, P. Gao, J. F. Lindberg, Y. Yang, J. Sun, M. R. Vaisanen, R.
Szulkin, M. Annala, J. Yan, L. A. Egevad, K. Zhang, R. Lin, A. Jolma, M. Nykter, A.
Manninen, F. Wiklund, M. H. Vaarala, T. Visakorpi, J. Xu, J. Taipale and G. H. Wei
(2014). "A prostate cancer susceptibility allele at 6q22 increases RFX6 expression by
modulating HOXB13 chromatin binding." Nat Genet 46(2): 126-135.
Huxley, J. (1942). Evolution the modern synthesis, George Allen and Unwin.
Jin, S. H., H. W. Zhao, Y. Yi, Y. Nakata, A. Kalota and A. M. Gewirtz (2010). "c-Myb
binds MLL through menin in human leukemia cells and is an important driver of MLL-
associated leukemogenesis." Journal of Clinical Investigation 120(2): 593-606.
Jolma, A., J. Yan, T. Whitington, J. Toivonen, K. R. Nitta, P. Rastas, E. Morgunova, M.
Enge, M. Taipale, G. Wei, K. Palin, J. M. Vaquerizas, R. Vincentelli, N. M. Luscombe, T.
R. Hughes, P. Lemaire, E. Ukkonen, T. Kivioja and J. Taipale (2013). "DNA-binding
specificities of human transcription factors." Cell 152(1-2): 327-339.
Kovina, A. P., N. V. Petrova, E. S. Gushchanskaya, K. V. Dolgushin, E. S. Gerasimov, A.
A. Galitsyna, A. A. Penin, I. M. Flyamer, E. S. Ioudinkova, A. A. Gavrilov, Y. S.
Vassetzky, S. V. Ulianov, O. V. Iarovaia and S. V. Razin (2017). "Evolution of the
Genome 3D Organization: Comparison of Fused and Segregated Globin Gene Clusters."
Mol Biol Evol 34(6): 1492-1504.
Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon,
K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J.
Howland, L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J. P.
Mesirov, C. Miranda, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A.
Sheridan, C. Sougnez, Y. Stange-Thomann, N. Stojanovic, A. Subramanian, D. Wyman,
J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley, J. Burton, C. Clee, N. Carter, A.
Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R. Durbin, L. French, D.
80
Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A.
McMurray, L. Matthews, S. Mercer, S. Milne, J. C. Mullikin, A. Mungall, R. Plumb, M.
Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K. Wilson, L. W. Hillier, J. D.
McPherson, M. A. Marra, E. R. Mardis, L. A. Fulton, A. T. Chinwalla, K. H. Pepin, W. R.
Gish, S. L. Chissoe, M. C. Wendl, K. D. Delehaunty, T. L. Miner, A. Delehaunty, J. B.
Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J. Minx, S. W. Clifton, T. Hawkins,
E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, J. F. Cheng,
A. Olsen, S. Lucas, C. Elkin, E. Uberbacher, M. Frazier, R. A. Gibbs, D. M. Muzny, S. E.
Scherer, J. B. Bouck, E. J. Sodergren, K. C. Worley, C. M. Rives, J. H. Gorrell, M. L.
Metzker, S. L. Naylor, R. S. Kucherlapati, D. L. Nelson, G. M. Weinstock, Y. Sakaki, A.
Fujiyama, M. Hattori, T. Yada, A. Toyoda, T. Itoh, C. Kawagoe, H. Watanabe, Y. Totoki,
T. Taylor, J. Weissenbach, R. Heilig, W. Saurin, F. Artiguenave, P. Brottier, T. Bruls, E.
Pelletier, C. Robert, P. Wincker, D. R. Smith, L. Doucette-Stamm, M. Rubenfield, K.
Weinstock, H. M. Lee, J. Dubois, A. Rosenthal, M. Platzer, G. Nyakatura, S. Taudien, A.
Rump, H. Yang, J. Yu, J. Wang, G. Huang, J. Gu, L. Hood, L. Rowen, A. Madan, S. Qin,
R. W. Davis, N. A. Federspiel, A. P. Abola, M. J. Proctor, R. M. Myers, J. Schmutz, M.
Dickson, J. Grimwood, D. R. Cox, M. V. Olson, R. Kaul, C. Raymond, N. Shimizu, K.
Kawasaki, S. Minoshima, G. A. Evans, M. Athanasiou, R. Schultz, B. A. Roe, F. Chen, H.
Pan, J. Ramser, H. Lehrach, R. Reinhardt, W. R. McCombie, M. de la Bastide, N. Dedhia,
H. Blocker, K. Hornischer, G. Nordsiek, R. Agarwala, L. Aravind, J. A. Bailey, A.
Bateman, S. Batzoglou, E. Birney, P. Bork, D. G. Brown, C. B. Burge, L. Cerutti, H. C.
Chen, D. Church, M. Clamp, R. R. Copley, T. Doerks, S. R. Eddy, E. E. Eichler, T. S.
Furey, J. Galagan, J. G. Gilbert, C. Harmon, Y. Hayashizaki, D. Haussler, H. Hermjakob,
K. Hokamp, W. Jang, L. S. Johnson, T. A. Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W.
J. Kent, P. Kitts, E. V. Koonin, I. Korf, D. Kulp, D. Lancet, T. M. Lowe, A. McLysaght,
T. Mikkelsen, J. V. Moran, N. Mulder, V. J. Pollara, C. P. Ponting, G. Schuler, J. Schultz,
G. Slater, A. F. Smit, E. Stupka, J. Szustakowki, D. Thierry-Mieg, J. Thierry-Mieg, L.
Wagner, J. Wallis, R. Wheeler, A. Williams, Y. I. Wolf, K. H. Wolfe, S. P. Yang, R. F.
Yeh, F. Collins, M. S. Guyer, J. Peterson, A. Felsenfeld, K. A. Wetterstrand, A. Patrinos,
M. J. Morgan, P. de Jong, J. J. Catanese, K. Osoegawa, H. Shizuya, S. Choi, Y. J. Chen, J.
81
Szustakowki and C. International Human Genome Sequencing (2001). "Initial
sequencing and analysis of the human genome." Nature 409(6822): 860-921.
Lee, D., D. U. Gorkin, M. Baker, B. J. Strober, A. L. Asoni, A. S. McCallion and M. A.
Beer (2015). "A method to predict the impact of regulatory variants from DNA
sequence." Nat Genet 47(8): 955-961.
Li, J., J. M. Sagendorf, T. P. Chiu, M. Pasi, A. Perez and R. Rohs (2017). "Expanding the
repertoire of DNA shape features for genome-scale studies of transcription factor
binding." Nucleic Acids Res.
Li, J., J. M. Sagendorf, T. P. Chiu, M. Pasi, A. Perez and R. Rohs (2017). "Expanding the
repertoire of DNA shape features for genome-scale studies of transcription factor
binding." Nucleic Acids Res 45(22): 12877-12887.
Li, R. Y., R. Di Felice, R. Rohs and D. A. Lidar (2018). "Quantum annealing versus
classical machine learning applied to a simplified computational biology problem." npj
Quantum Inf 4.
Ma, W., L. Yang, R. Rohs and W. S. Noble (2017). "DNA sequence+shape kernel
enables alignment-free modeling of transcription factor binding." Bioinformatics 33(19):
3003-3010.
Mackay, T. F., S. Richards, E. A. Stone, A. Barbadilla, J. F. Ayroles, D. Zhu, S. Casillas,
Y. Han, M. M. Magwire, J. M. Cridland, M. F. Richardson, R. R. Anholt, M. Barron, C.
Bess, K. P. Blankenburg, M. A. Carbone, D. Castellano, L. Chaboub, L. Duncan, Z.
Harris, M. Javaid, J. C. Jayaseelan, S. N. Jhangiani, K. W. Jordan, F. Lara, F. Lawrence,
S. L. Lee, P. Librado, R. S. Linheiro, R. F. Lyman, A. J. Mackey, M. Munidasa, D. M.
Muzny, L. Nazareth, I. Newsham, L. Perales, L. L. Pu, C. Qu, M. Ramia, J. G. Reid, S. M.
Rollmann, J. Rozas, N. Saada, L. Turlapati, K. C. Worley, Y. Q. Wu, A. Yamamoto, Y.
Zhu, C. M. Bergman, K. R. Thornton, D. Mittelman and R. A. Gibbs (2012). "The
Drosophila melanogaster Genetic Reference Panel." Nature 482(7384): 173-178.
82
Mathelier, A., C. Lefebvre, A. W. Zhang, D. J. Arenillas, J. Ding, W. W. Wasserman and
S. P. Shah (2015). "Cis-regulatory somatic mutations and gene-expression alteration in
B-cell lymphomas." Genome Biol 16: 84.
Mathelier, A., B. Xin, T. P. Chiu, L. Yang, R. Rohs and W. W. Wasserman (2016).
"DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo."
Cell Syst 3(3): 278-286 e274.
Maurano, M. T., E. Haugen, R. Sandstrom, J. Vierstra, A. Shafer, R. Kaul and J. A.
Stamatoyannopoulos (2015). "Large-scale identification of sequence variants influencing
human transcription factor occupancy in vivo." Nat Genet 47(12): 1393-1401.
Maurano, M. T., R. Humbert, E. Rynes, R. E. Thurman, E. Haugen, H. Wang, A. P.
Reynolds, R. Sandstrom, H. Qu, J. Brody, A. Shafer, F. Neri, K. Lee, T. Kutyavin, S.
Stehling-Sun, A. K. Johnson, T. K. Canfield, E. Giste, M. Diegel, D. Bates, R. S. Hansen,
S. Neph, P. J. Sabo, S. Heimfeld, A. Raubitschek, S. Ziegler, C. Cotsapas, N.
Sotoodehnia, I. Glass, S. R. Sunyaev, R. Kaul and J. A. Stamatoyannopoulos (2012).
"Systematic localization of common disease-associated variation in regulatory DNA."
Science 337(6099): 1190-1195.
Mendel, G. (1865). "Experiments in plant hybridization (1865)." Verhandlungen des
naturforschenden Vereins Brünn.) Available online: www. mendelweb. org/Mendel. html
(accessed on 1 January 2013).
Mi, H., A. Muruganujan and P. D. Thomas (2013). "PANTHER in 2013: modeling the
evolution of gene function, and other gene attributes, in the context of phylogenetic
trees." Nucleic Acids Res 41(Database issue): D377-386.
Miller, W., K. D. Makova, A. Nekrutenko and R. C. Hardison (2004). "COMPARATIVE
GENOMICS." Annual Review of Genomics and Human Genetics 5(1): 15-56.
Miyoshi, Y., K. Murase, M. Saito, M. Imamura and K. Oh (2010). "Mechanisms of
estrogen receptor-alpha upregulation in breast cancers." Medical Molecular Morphology
43(4): 193-196.
83
Mogno, I., J. C. Kwasnieski and B. A. Cohen (2013). "Massively parallel synthetic
promoter assays reveal the in vivo effects of binding site variants." Genome Res 23(11):
1908-1915.
Mordelet, F., J. Horton, A. J. Hartemink, B. E. Engelhardt and R. Gordan (2013).
"Stability selection for regression-based models of transcription factor-DNA binding
specificity." Bioinformatics 29(13): i117-125.
Parker, S. C., L. Hansen, H. O. Abaan, T. D. Tullius and E. H. Margulies (2009). "Local
DNA topography correlates with functional noncoding regions of the human genome."
Science 324(5925): 389-392.
Parkes, M., J. C. Barrett, N. J. Prescott, M. Tremelling, C. A. Anderson, S. A. Fisher, R.
G. Roberts, E. R. Nimmo, F. R. Cummings, D. Soars, H. Drummond, C. W. Lees, S. A.
Khawaja, R. Bagnall, D. A. Burke, C. E. Todhunter, T. Ahmad, C. M. Onnie, W.
McArdle, D. Strachan, G. Bethel, C. Bryan, C. M. Lewis, P. Deloukas, A. Forbes, J.
Sanderson, D. P. Jewell, J. Satsangi, J. C. Mansfield, C. Wellcome Trust Case Control, L.
Cardon and C. G. Mathew (2007). "Sequence variants in the autophagy gene IRGM and
multiple other replicating loci contribute to Crohn's disease susceptibility." Nat Genet
39(7): 830-832.
Rao, S., T. P. Chiu, J. F. Kribelbauer, R. S. Mann, H. J. Bussemaker and R. Rohs (2018).
"Systematic prediction of DNA shape changes due to CpG methylation explains
epigenetic effects on protein-DNA binding." Epigenetics Chromatin 11(1): 6.
Rao, S. S., M. H. Huntley, N. C. Durand, E. K. Stamenova, I. D. Bochkov, J. T. Robinson,
A. L. Sanborn, I. Machol, A. D. Omer, E. S. Lander and E. L. Aiden (2014). "A 3D map
of the human genome at kilobase resolution reveals principles of chromatin looping."
Cell 159(7): 1665-1680.
Rhodes, G. (2010). Crystallography made crystal clear: a guide for users of
macromolecular models, Elsevier.
84
Rohs, R., X. Jin, S. M. West, R. Joshi, B. Honig and R. S. Mann (2010). "Origins of
specificity in protein-DNA recognition." Annu Rev Biochem 79: 233-269.
Rohs, R., S. M. West, P. Liu and B. Honig (2009). "Nuance in the double-helix and its
role in protein-DNA recognition." Curr Opin Struct Biol 19(2): 171-177.
Rohs, R., S. M. West, A. Sosinsky, P. Liu, R. S. Mann and B. Honig (2009). "The role of
DNA shape in protein-DNA recognition." Nature 461(7268): 1248-1253.
Schneider, T. D. and R. M. Stephens (1990). "Sequence logos: a new way to display
consensus sequences." Nucleic Acids Res 18(20): 6097-6100.
Shi, W., O. Fornes, A. Mathelier and W. W. Wasserman (2016). "Evaluating the impact
of single nucleotide variants on transcription factor binding." Nucleic Acids Res 44(21):
10106-10116.
Shi, Y. (2014). "A glimpse of structural biology through X-ray crystallography." Cell
159(5): 995-1014.
Siepel, A., G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou, K. Rosenbloom, H.
Clawson, J. Spieth, L. W. Hillier, S. Richards, G. M. Weinstock, R. K. Wilson, R. A.
Gibbs, W. J. Kent, W. Miller and D. Haussler (2005). "Evolutionarily conserved elements
in vertebrate, insect, worm, and yeast genomes." Genome Res 15(8): 1034-1050.
Slattery, M., T. Riley, P. Liu, N. Abe, P. Gomez-Alcala, I. Dror, T. Zhou, R. Rohs, B.
Honig, H. J. Bussemaker and R. S. Mann (2011). "Cofactor binding evokes latent
differences in DNA binding specificity between Hox proteins." Cell 147(6): 1270-1282.
Slattery, M., T. Zhou, L. Yang, A. C. Dantas Machado, R. Gordan and R. Rohs (2014).
"Absence of a simple code: how transcription factors read the genome." Trends Biochem
Sci 39(9): 381-399.
Stormo, G. D. (2000). "DNA binding sites: representation and discovery." Bioinformatics
16(1): 16-23.
85
Stormo, G. D. (2013). "Modeling the specificity of protein-DNA interactions." Quant
Biol 1(2): 115-130.
Stormo, G. D., T. D. Schneider, L. Gold and A. Ehrenfeucht (1982). "Use of the
'Perceptron' algorithm to distinguish translational initiation sites in E. coli." Nucleic
Acids Res 10(9): 2997-3011.
Su, J., S. A. Teichmann and T. A. Down (2010). "Assessing computational methods of
cis-regulatory module prediction." PLoS Comput Biol 6(12): e1001020.
Tehranchi, A. K., M. Myrthil, T. Martin, B. L. Hie, D. Golan and H. B. Fraser (2016).
"Pooled ChIP-Seq Links Variation in Transcription Factor Binding to Complex Disease
Risk." Cell 165(3): 730-741.
Thomas, P. D., M. J. Campbell, A. Kejariwal, H. Mi, B. Karlak, R. Daverman, K. Diemer,
A. Muruganujan and A. Narechania (2003). "PANTHER: a library of protein families and
subfamilies indexed by function." Genome Res 13(9): 2129-2141.
Thomas, S., X. Y. Li, P. J. Sabo, R. Sandstrom, R. E. Thurman, T. K. Canfield, E. Giste,
W. Fisher, A. Hammonds, S. E. Celniker, M. D. Biggin and J. A. Stamatoyannopoulos
(2011). "Dynamic reprogramming of chromatin accessibility during Drosophila embryo
development." Genome Biol 12(5): R43.
Tjong, H., W. Li, R. Kalhor, C. Dai, S. Hao, K. Gong, Y. Zhou, H. Li, X. J. Zhou, M. A.
Le Gros, C. A. Larabell, L. Chen and F. Alber (2016). "Population-based 3D genome
structure analysis reveals driving forces in spatial genome organization." Proc Natl Acad
Sci U S A 113(12): E1663-1672.
Tyner, C., G. P. Barber, J. Casper, H. Clawson, M. Diekhans, C. Eisenhart, C. M. Fischer,
D. Gibson, J. N. Gonzalez, L. Guruvadoo, M. Haeussler, S. Heitner, A. S. Hinrichs, D.
Karolchik, B. T. Lee, C. M. Lee, P. Nejad, B. J. Raney, K. R. Rosenbloom, M. L. Speir,
C. Villarreal, J. Vivian, A. S. Zweig, D. Haussler, R. M. Kuhn and W. J. Kent (2017).
"The UCSC Genome Browser database: 2017 update." Nucleic Acids Res 45(D1): D626-
D634.
86
Venter, J. C., M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O.
Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew,
D. H. Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski,
G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder, A.
G. Clark, J. Nadeau, V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M. Simon,
C. Slayman, M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L.
Florea, A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K.
Remington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Brandon, M. Cargill,
I. Chandramouliswaran, R. Charlab, K. Chaturvedi, Z. Deng, V. Di Francesco, P. Dunn,
K. Eilbeck, C. Evangelista, A. E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T.
J. Heiman, M. E. Higgins, R. R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li, Y.
Liang, X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M. Moore, A. K. Naik, V. A.
Narayan, B. Neelam, D. Nusskern, D. B. Rusch, S. Salzberg, W. Shao, B. Shue, J. Sun, Z.
Wang, A. Wang, X. Wang, J. Wang, M. Wei, R. Wides, C. Xiao, C. Yan, A. Yao, J. Ye,
M. Zhan, W. Zhang, H. Zhang, Q. Zhao, L. Zheng, F. Zhong, W. Zhong, S. Zhu, S. Zhao,
D. Gilbert, S. Baumhueter, G. Spier, C. Carter, A. Cravchik, T. Woodage, F. Ali, H. An,
A. Awe, D. Baldwin, H. Baden, M. Barnstead, I. Barrow, K. Beeson, D. Busam, A.
Carver, A. Center, M. L. Cheng, L. Curry, S. Danaher, L. Davenport, R. Desilets, S.
Dietz, K. Dodson, L. Doup, S. Ferriera, N. Garg, A. Gluecksmann, B. Hart, J. Haynes, C.
Haynes, C. Heiner, S. Hladun, D. Hostin, J. Houck, T. Howland, C. Ibegwam, J. Johnson,
F. Kalush, L. Kline, S. Koduru, A. Love, F. Mann, D. May, S. McCawley, T. McIntosh, I.
McMullen, M. Moy, L. Moy, B. Murphy, K. Nelson, C. Pfannkoch, E. Pratts, V. Puri, H.
Qureshi, M. Reardon, R. Rodriguez, Y. H. Rogers, D. Romblad, B. Ruhfel, R. Scott, C.
Sitter, M. Smallwood, E. Stewart, R. Strong, E. Suh, R. Thomas, N. N. Tint, S. Tse, C.
Vech, G. Wang, J. Wetter, S. Williams, M. Williams, S. Windsor, E. Winn-Deen, K.
Wolfe, J. Zaveri, K. Zaveri, J. F. Abril, R. Guigo, M. J. Campbell, K. V. Sjolander, B.
Karlak, A. Kejariwal, H. Mi, B. Lazareva, T. Hatton, A. Narechania, K. Diemer, A.
Muruganujan, N. Guo, S. Sato, V. Bafna, S. Istrail, R. Lippert, R. Schwartz, B. Walenz, S.
Yooseph, D. Allen, A. Basu, J. Baxendale, L. Blick, M. Caminha, J. Carnes-Stine, P.
Caulk, Y. H. Chiang, M. Coyne, C. Dahlke, A. Mays, M. Dombroski, M. Donnelly, D.
Ely, S. Esparham, C. Fosler, H. Gire, S. Glanowski, K. Glasser, A. Glodek, M. Gorokhov,
87
K. Graham, B. Gropman, M. Harris, J. Heil, S. Henderson, J. Hoover, D. Jennings, C.
Jordan, J. Jordan, J. Kasha, L. Kagan, C. Kraft, A. Levitsky, M. Lewis, X. Liu, J. Lopez,
D. Ma, W. Majoros, J. McDaniel, S. Murphy, M. Newman, T. Nguyen, N. Nguyen, M.
Nodell, S. Pan, J. Peck, M. Peterson, W. Rowe, R. Sanders, J. Scott, M. Simpson, T.
Smith, A. Sprague, T. Stockwell, R. Turner, E. Venter, M. Wang, M. Wen, D. Wu, M.
Wu, A. Xia, A. Zandieh and X. Zhu (2001). "The sequence of the human genome."
Science 291(5507): 1304-1351.
Ward, L. D. and M. Kellis (2012). "Evidence of abundant purifying selection in humans
for recently acquired regulatory functions." Science 337(6102): 1675-1678.
Ward, L. D. and M. Kellis (2012). "Interpreting noncoding genetic variation in complex
traits and human disease." Nat Biotechnol 30(11): 1095-1106.
Watson, J. D. and F. H. Crick (1953). "Molecular structure of nucleic acids; a structure
for deoxyribose nucleic acid." Nature 171(4356): 737-738.
Weirauch, M. T. and T. R. Hughes (2011). "A catalogue of eukaryotic transcription factor
types, their evolutionary origin, and species distribution." Subcell Biochem 52: 25-73.
Welter, D., J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P.
Flicek, T. Manolio, L. Hindorff and H. Parkinson (2014). "The NHGRI GWAS Catalog,
a curated resource of SNP-trait associations." Nucleic Acids Res 42(Database issue):
D1001-1006.
Xin, B. B. and R. Rohs (2018). "Relationship between histone modifications and
transcription factor binding is protein family specific." Genome Research 28(3): 321-333.
Yang, L., Y. Orenstein, A. Jolma, Y. Yin, J. Taipale, R. Shamir and R. Rohs (2017).
"Transcription factor family-specific DNA shape readout revealed by quantitative
specificity models." Mol Syst Biol 13(2): 910.
Zhou, T., N. Shen, L. Yang, N. Abe, J. Horton, R. S. Mann, H. J. Bussemaker, R. Gordan
and R. Rohs (2015). "Quantitative modeling of transcription factor binding specificities
using DNA shape." Proc Natl Acad Sci U S A 112(15): 4654-4659.
88
Zhou, T., L. Yang, Y. Lu, I. Dror, A. C. Dantas Machado, T. Ghane, R. Di Felice and R.
Rohs (2013). "DNAshape: a method for the high-throughput prediction of DNA
structural features on a genomic scale." Nucleic Acids Res 41(Web Server issue): W56-
62.
Zhu, L. J., R. G. Christensen, M. Kazemian, C. J. Hull, M. S. Enuameh, M. D. Basciotta,
J. A. Brasefield, C. Zhu, Y. Asriyan, D. S. Lapointe, S. Sinha, S. A. Wolfe and M. H.
Brodsky (2011). "FlyFactorSurvey: a database of Drosophila transcription factor binding
specificities determined using the bacterial one-hybrid system." Nucleic Acids Res
39(Database issue): D111-117.
Abstract (if available)
Abstract
Noncoding DNA sequences, which play various roles in gene expression and regulation, are under evolutionary pressure. Gene regulation requires specific protein–DNA binding events, and previous studies showed that both DNA sequence and shape readout are employed by transcription factors (TFs) to achieve DNA binding specificity. Here, we established a link between disruptive local DNA shape changes and loss of specific TF binding by investigating the shape-disrupting properties of single nucleotide polymorphisms (SNPs) in human regulatory regions and described cases where disease-associated SNPs may alter TF binding through DNA shape changes. This link led us to hypothesize that local DNA shape within and around TF binding sites is under selection pressure. Our results indicate that common SNPs in functional regions tend to maintain DNA shape, whereas shape-disrupting SNPs are more likely to be eliminated through purifying selection. These results show the importance of DNA shape in TF-DNA recognition from the perspective of evolution. We next proposed a new DNA shape-augmented alphabet that incorporates interdependency between adjacent nucleotides to the classic PWM model. The TF family-specific shape alphabet learned through simulated annealing algorithm significantly improved the performance of TF-DNA binding prediction compared with the typical A, C, G, T-alphabet that represents the four nucleobases—A, C, G, and T of DNA nucleotides.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Profiling transcription factor-DNA binding specificity
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Decoding protein-DNA binding determinants mediated through DNA shape readout
PDF
Machine learning of DNA shape and spatial geometry
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Forkhead transcription factors regulate replication origin firing through dimerization and cell cycle-dependent chromatin binding in S. cerevisiae
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
UVRAG protects cells from UV-induced DNA damage by regulating global genomic nucleotide excision repair pathway
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Detection, classification and functional annotation of mouse L1 retrotransposon promoters
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Site-directed spin labeling studies of sequence-dependent DNA shape and protein-DNA recognition
PDF
Predicting functional consequences of SNPs: insights from translation elongation, molecular phenotypes, and pathways
PDF
Application of machine learning methods in genomic data analysis
PDF
The role of Hic-5 in glucocorticoid receptor binding to chromatin
PDF
Comparative genomics of translational regulation
Asset Metadata
Creator
Wang, Xiaofei
(author)
Core Title
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
08/06/2020
Defense Date
06/21/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alphabet,DNA shape,OAI-PMH Harvest,purifying selection,SNP,TFBS
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rohs, Remo (
committee chair
), Chen, Liang (
committee member
), Thomas, Paul (
committee member
), Waterman, Michael (
committee member
)
Creator Email
fay.xiaofeiwang@gmail.com,xiaofei@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-60249
Unique identifier
UC11670522
Identifier
etd-WangXiaofe-6688.pdf (filename),usctheses-c89-60249 (legacy record id)
Legacy Identifier
etd-WangXiaofe-6688.pdf
Dmrecord
60249
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Xiaofei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
DNA shape
purifying selection
SNP
TFBS