Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improved methods for the quantification of transcription factor binding using SELEX-seq
(USC Thesis Other)
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Improved Methods for the Quantification of
Transcription Factor Binding Using SELEX-seq
by
Brendon Hunter Cooper
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
December 2022
Copyright 2022 Brendon Hunter Cooper
ii
Acknowledgements
Since joining USC, I have felt incredibly fortunate to be a part of Dr. Remo Rohs’s lab. Remo
has been instrumental in helping me grow as a researcher, mentor, and communicator.
Remo persistently encouraged all of us to be involved in decisions that determined the
fate of the program and gave us the freedom to organize and plan massive academic
retreats. Thank you, Remo, for allowing me to explore a wide variety of research
interests and for always promoting a healthy work-life balance. I am looking forward to
our continued relationship well beyond this step in my professional career.
I would also like to thank the long list of extraordinarily brilliant researchers and
friends who have helped make this an incredibly positive experience through their
support both academically and personally. This includes professors as USC, such as Dr.
Rosa Di Felice, Dr. Fengzhu Sun, Dr. Mark Chaisson, Dr. Peter Qin, Dr. Oscar Aparicio, and
Dr. Michael Waterman, as well as our external collaborators Dr. Nicolas Gompel, Dr.
Richard Mann, and Dr. Harmen Bussemaker. This also includes current and former lab
members Yingfei Wang, Dr. Ana Carolina Dantas Machado, Dr. Tsu-pei Chiu, Dr. Jared
Sagendorf, Dr. Jinsen Li, Dr. Satyanarayan Rao, Dr. Beibei Xin, Dr. Xiaofei Wang, Dr. Lin
Yang, Dr. Richard Li, George Wang, Yibei Jiang, Raktim Mitra, and Jesse Weller. Outside of
the lab, there are so many students and graduates I want to acknowledge, but in particular,
I would like to thank Katie Boeck, Yan Gan, Peter Luu, Dr. Tsung-Yu Lu, Keon Rabbani, Dr.
Valerie Thomas, Dr. Wen Shi, Dr. Rose Chiang, Dr. Joseph Larsen, Stefano Ceolin, Liucong
Ling, Aleique Allen, and Yue Li. All of you and many more have taught me so much over
iii
the years and have made my time at USC incredibly enjoyable and memorable. I am
excited to see where you all end up and am glad I can continue to count on you for advice.
Finally, I would like to thank my immensely supportive family and partner. I have
been incredibly lucky to have two parents, Jolie and Brent Cooper, and three siblings,
Ariana, Paul, and Cameron, who have always supported my dreams and provided a loving
and accepting environment for me to flourish. Thank you for lifting me up when I was
down and for pushing me when I needed it. My partner, Dr. Dustin Tsai, has been with me
since the beginning of my journey and has helped me with so many aspects of life as a
PhD student, from time management to editing, that I don’t know how I could have done
it all without him. It has been incredible traveling the world with you Dustin, and I feel so
lucky every day to have you in my life.
iv
Table of Contents
Acknowledgements ....................................................................................................................................... ii
List of Tables ................................................................................................................................................... vi
List of Figures ................................................................................................................................................ vii
Abstract ............................................................................................................................................................... x
Chapter 1: Introduction ............................................................................................................................... 1
Determinants of DNA binding site recognition .............................................................................. 1
High-throughput prediction of local DNA shape features ......................................................... 3
Methods to measure DNA binding preferences ............................................................................. 4
ChIP-seq .................................................................................................................................................... 4
Protein binding microarray ............................................................................................................... 5
HT-SELEX / SELEX-seq ....................................................................................................................... 6
Representations of DNA binding preferences ................................................................................ 8
Thesis Overview ......................................................................................................................................... 9
Chapter 2: Top-Down Crawl: A method for the ultra-rapid and motif-free alignment of
sequences with associated binding metrics ...................................................................................... 11
Introduction ............................................................................................................................................... 11
Implementation ........................................................................................................................................ 12
Results and comparison with alternate methods ....................................................................... 14
Application to in-vivo binding site identification and modulation ...................................... 17
Background of the spot
196
enhancer............................................................................................. 17
Fine tuning the activity of the spot
196
enhancer ...................................................................... 18
Chapter 3: Multi-step alignment of SELEX-seq data to quantify the DNA binding
specificity of all four S. cerevisiae forkhead box transcription factors .................................... 22
Introduction ............................................................................................................................................... 22
Material and methods ............................................................................................................................ 26
Protein expression and purification ............................................................................................ 26
Oligonucleotide synthesis ................................................................................................................ 27
SELEX-seq binding assay .................................................................................................................. 27
Competitive binding assay ............................................................................................................... 27
Multi-step alignment .......................................................................................................................... 29
Relative enrichment and free energy determination ............................................................ 32
Multiple linear regression ................................................................................................................ 36
Results and discussion ........................................................................................................................... 38
Comparing the reproducibility of PBM and SELEX-seq data .............................................. 38
Modified approach to determine enrichment .......................................................................... 41
v
Core-based alignment of full length reads ................................................................................. 42
Core binding sites exhibits interdependencies and shape preferences ......................... 43
Core binding sites of Fkh1 and Fkh2 exhibit differing selectivity .................................... 46
Flanking sequence contributions across differing cores ..................................................... 47
Interdependencies of flanking positions quantified using MLR ....................................... 49
Experimental validation of flanking contributions ................................................................ 52
Conclusions ................................................................................................................................................ 53
Data availability ........................................................................................................................................ 54
Chapter 4: Expanded analysis of simulated SELEX-seq data ...................................................... 55
Introduction ............................................................................................................................................... 55
Aligning to multiple binding sites ..................................................................................................... 56
Simulation of multiple binding sites ............................................................................................ 57
Simulation of a fixed adapter .......................................................................................................... 59
Simulation of R2 data ......................................................................................................................... 59
Simulation of overlapping binding sites ..................................................................................... 61
Enrichment of partial binding sites .............................................................................................. 61
Modeling full binding sites .............................................................................................................. 62
Chapter 5: Measuring the binding of CRISPR-Cas12a to flexible off-targets using
modified SELEX-seq .................................................................................................................................... 63
Introduction ............................................................................................................................................... 63
Experiment overview ............................................................................................................................. 65
Expected observations .......................................................................................................................... 66
Results and discussion ........................................................................................................................... 67
Chapter 6: Investigating MEF2B binding preferences using SELEX-seq and molecular
dynamics ......................................................................................................................................................... 72
Overview ..................................................................................................................................................... 72
SELEX-seq of MEF2B .......................................................................................................................... 72
Molecular dynamics simulations of MEF2B .............................................................................. 74
References ...................................................................................................................................................... 77
Appendices ..................................................................................................................................................... 85
Supplementary Tables ........................................................................................................................... 85
Supplementary Figures ......................................................................................................................... 92
vi
List of Tables
Table 2.1 Pseudocode of entire alignment process performed by TDC ................................ 15
Supplementary Table 2.1 Comparison of AR and GR PWMs generated
from various methods ................................................................................................................................ 85
Supplementary Table 2.2 Comparison of MEF2B and AbdA-Exd PWMs
generated from various methods .......................................................................................................... 86
Supplementary Table 2.3 Comparison of Dfd-Exd and Lab-Exd PWMs
generated from various methods .......................................................................................................... 86
Supplementary Table 2.4 Comparison of PbFl-Exd and Scr-Exd PWMs
generated from various methods .......................................................................................................... 87
Supplementary Table 2.5 Comparison of UbxIa-Exd and UbxIVa-Exd
PWMs generated from various methods. ........................................................................................... 87
Supplementary Table 2.6 Alignment agreement with TDC between
various alignment methods ..................................................................................................................... 88
Supplementary Table 2.7 MLR model performances for various alignment
methods ........................................................................................................................................................... 88
Supplementary Table 2.8 Wall-clock times for various alignment methods .................... 89
Supplementary Table 2.9 Peak memory usage for various alignment methods ............. 89
Supplementary Table 3.1 Oligo sequences used for SELEX-seq and competitive
binding experiments. .................................................................................................................................. 90
Supplementary Table 3.2 ΔΔG/RT measurements for our selected core
sequences averaged over every window and sorted by Fkh1 .................................................... 91
vii
List of Figures
Figure 1.1 Structure of DNA-binding protein E2 from bovine papillomavirus-1
bound to bent DNA, generated using DNAproDB [7] ...................................................................... 2
Figure 2.1 Overview, performance, and generated PWMs from TDC applied to
10 SELEX-seq datasets ............................................................................................................................... 13
Figure 2.2 Overview of validated Dll sites within spot
196
enhancer and proposed
variants ............................................................................................................................................................ 19
Figure 3.1 Binding logos derived from ChIP-seq, HT-SELEX, and PBM experiments ...... 23
Figure 3.2 Representative structures of FOX DBDs bound to DNA. ........................................ 25
Figure 3.3 Experiment and analysis framework overview ........................................................ 28
Figure 3.4 Relationship between the number of cores we included during
alignment versus the fraction of reads that align to a single core. ........................................... 31
Figure 3.5 Violin plot of ΔΔG/RT estimates from every window resulting from
two rounds of SELEX-seq for 48 selected core sequences relative to the most
enriched core ................................................................................................................................................. 33
Figure 3.6 Summary of flanking position contributions across shifts and
across differing cores ................................................................................................................................. 35
Figure 3.7 Comparison of PBM and SELEX-seq enrichments .................................................... 39
Figure 3.8 Comparison of the Z-score normalized enrichment of various
length k-mers ................................................................................................................................................. 40
Figure 3.9 PWMs generated from our selected 7-bp or 6-bp cores,
weighting each sequence by its relative enrichment. .................................................................... 43
viii
Figure 3.10 Comparison of the Fkh1 ΔΔG/RT measurements for different
mutations of the core DNA target sequence with respect to the reference
sequence GTAAACA ..................................................................................................................................... 44
Figure 3.11 Comparison of –ΔΔG/RT of 7-bp core sequences .................................................. 46
Figure 3.12 Maximum ΔΔG/RT averaged across all cores for each flanking position .... 48
Figure 3.13 Plot of the predicted electrostatic potential (EP) in the center of
the minor groove along the binding site ............................................................................................. 49
Figure 3.14 Competitive binding assay of Fkh1 ............................................................................. 52
Figure 4.1 Representation of how multiple binding sites contribute to
the binding observed after 2 rounds of selection for a SELEX-seq experiment .................. 60
Figure 5.1 Expected distribution of total mismatches over a 6-bp region,
given that one strand is completely randomized ............................................................................ 67
Figure 5.2 Bar chart showing the average equilibrium constant (K) for
sequences with a variable number of mismatches ......................................................................... 68
Figure 5.3 Bar chart showing the maximum K for sequences with 2
mismatches at variable positions relative to the PAM .................................................................. 70
Figure 5.4 Bar chart showing the maximum K for sequences with 3
mismatches at variable positions relative to the PAM .................................................................. 71
Figure 6.1 Backbone contacts between MEF2B and the minor groove of DNA ................. 74
Figure 6.2 Comparison of hydrogen bonding between K23 and R23 of
MEF2B with DNA.......................................................................................................................................... 75
Supplementary Figure 3.1 Distribution of the four bp at various positions
along the 16-mer in the initial library.................................................................................................. 92
ix
Supplementary Figure 3.2 ΔΔG/RT measurements for each aligned
core averaged over the 40 independent sets of aligned reads from the
Fkh1 SELEX-seq experiments ................................................................................................................. 92
Supplementary Figure 3.3 ΔΔG/RT measurements for each aligned
core averaged over the 40 independent sets of aligned reads from the
Fkh2 SELEX-seq experiments ................................................................................................................. 93
Supplementary Figure 3.4 ΔΔG/RT measurements for each aligned
core averaged over the 40 independent sets of aligned reads from the
Hcm1 SELEX-seq experiments ................................................................................................................ 93
Supplementary Figure 3.5 ΔΔG/RT measurements for each aligned
core averaged over the 40 independent sets of aligned reads from the
Fhl1 SELEX-seq experiments ................................................................................................................... 94
Supplementary Figure 5.1 Bar chart showing the maximum K for
sequences with 1 mismatch at a variable position relative to the PAM ................................. 94
Supplementary Figure 5.2 Bar chart showing the maximum K for
sequences with 4 mismatches at variable positions relative to the PAM .............................. 95
Supplementary Figure 5.3 Bar chart showing the maximum K for
sequences with 5 mismatches at variable positions relative to the PAM .............................. 95
x
Abstract
For life to exist, cells must be able to grow, divide, and respond to their environment. All
these processes require strict control over when genes are turned on, where they are
expressed, and how they can be silenced. This can be controlled through a variety of
means such as modification of transcriptional rate, translational rate, or degradation
post-translation. Here, we focus on the modification of transcriptional rate through
transcription factor binding. Transcription factors are DNA-binding proteins which bind
to specific targets in-vivo to upregulate or downregulate expression. Understanding what
factors play a role in target site recognition is an area of active research that is particularly
interesting for closely related transcription factors which exhibit similar binding
preferences.
Here, we investigate members of the forkhead box family of transcription factors,
which play a role in a number of processes from tumor suppression to cell growth and
differentiation. After generating SELEX-seq data for all four forkhead box homologs in S.
cerevisiae, we developed a multi-step pipeline for alignment of full-length SELEX-seq
reads which revealed important determinants of binding specificity outside the core of
the binding site. Part of the pipeline used for the rapid alignment of k-mer level data was
made publicly available through a webserver and was later applied to scanning and
modifying transcription factor binding sites in-vivo. Separately, we developed a modified
approach to SELEX-seq which revealed off-target binding of the CRISPR-Cas12a system
to mismatched off-targets. Finally, I describe an earlier work exploring the binding
preferences of the MEF2B transcription factor and the role of A-tract polarity.
1
Chapter 1: Introduction
Although DNA is the blueprint for life on Earth, it would be meaningless without the help
of DNA-binding proteins that can read, copy, and modify the underlying substrate. Many
of these proteins bind non-specifically to DNA, forming contacts with the backbone of the
DNA rather than specific bases. This is an important function for structural proteins such
as histones [1] or for DNA-bending proteins such as HMG1, which help reorganize DNA
in 3D space [2]. These proteins, and many more, function in a variety of contexts and need
to reliably form contacts with DNA exhibiting a great deal of sequence diversity.
On the other end of the spectrum, there are DNA-binding proteins that target
specific stretches of DNA based on the underlying sequence. Transcription factors (TFs)
are a major class of DNA-binding proteins which upregulate or downregulate
transcription of their target genes. This can be done through a variety of mechanisms,
from recruitment of histone modifiers or DNA methyltransferases, to direct interactions
with cofactors and RNA polymerase II. Being able to tune specific genes is an important
process for responding to environmental changes, but also a key requirement for cellular
differentiation and development. Furthermore, many genes involved in developmental
pathways, such as the Hox family, rely on TFs with remarkably subtle differences in
binding preferences [3]. Consequently, hundreds of mutations in TFs, binding sites, and
other regulatory proteins have been found to be implicated in human disease [4].
Determinants of DNA binding site recognition
DNA-binding proteins recognize their targets through a series of interactions classified
as either base or shape readout [5]. Base readout involves the recognition of unique
2
signatures, primarily exposed along the major groove of DNA. These signatures include
hydrogen bond donors and acceptors, as well as non-polar groups that can be recognized
by amino acids on the protein.
Although differing bases can be directly recognized by a DNA-binding proteins
using base readout, stretches of DNA bases can interact with each other to influence 3D
shape characteristics of the DNA. Additionally, conditions within the cell and other
binding proteins can induce changes to shape that would not be observed in unbound
DNA. This variability allows for a secondary mode of recognition referred to as shape
readout. Shape readout is broken into the recognition of global shape features, such as
DNA bend, which is measured over many base pairs, or local shape features, which are
measured between complementary or stacked bases, or over a small number of bases as
is the case for minor groove width.
In some cases, the bend angle of the DNA can facilitate contacts with the DNA-
binding protein. One such example is the E2 protein of bovine papillomavirus-1, in which
the target DNA appears to bend around the E2 dimer [6] (Figure 1.1). The bending allows
the two recognition helices to sit within the major grooves of the DNA to form base-
Figure 1.1 Structure of DNA-binding protein E2 from bovine papillomavirus-1 bound to bent DNA,
generated using DNAproDB [7]
3
specific contacts with the target. DNA bending can be either intrinsic, because of stacking
interactions over many base pairs, or induced, due to binding by a protein. The ease at
which DNA bends is described by a property known as DNA flexibility. Although many
factors influence flexibility, A-tracts longer than 4 bp are considered to be quit rigid,
whereas YpR steps, particularly TpA, allow for major deformations and kinking of the
DNA [8].
The local shape of DNA is described by features such as minor and major groove
width, propeller twist, helical twist, and roll. The minor groove width has been of
particular interest due to its role in modulating electrostatic potential. Narrowing of the
groove results in a phenomenon known as electrostatic focusing, which condenses field
lines to produce a more negative electrostatic potential [9]. The negative potential is then
able to interact with positively charged residues on a protein such as arginine and lysine,
which are long enough to reach into the narrow space of the minor groove. In fact, a
previous study by the Rohs lab found that 60% of residues falling within a narrow minor
groove are arginines, which reduces to 22% when looking at minor grooves wider than
5.0 Å [9].
High-throughput prediction of local DNA shape features
Although DNA shape features can be measured directly from a crystal structure,
generating these structures in the lab is incredibly time consuming, expensive, and low
throughput. Instead, structures can be generated by running all-atom Monte Carlo or
molecular dynamics (MD) simulations. Monte Carlo simulations work by randomly
sampling different conformations of the DNA with the goal of finding the most stable
conformation. Comparatively, MD simulations follow atoms of the structure over millions
4
of time steps given random initial velocities based on a given temperature and pressure.
Although this is expected the lead to a stable conformation based on the forces defined in
the simulation, the method may follow an indirect, and more computationally expensive,
path to get there.
In order to better understand the relationship between DNA sequence and shape,
researchers from the Rohs lab performed Monte Carlo simulations of 2121 highly
variable DNA fragments with a length of 12-27 bp [10]. For each simulation, several shape
features, including minor groove width, propeller twist, helical twist and roll, were
measured using Curves and averaged over 150,000 snapshots [11]. The researchers
found that shape features centered on a given position or a given step could be reliably
predicted using a 5-bp sliding window. Over the 2121 structures simulated, each
pentamer was represented an average of 44 times. The average shape features for every
unique pentamer were then saved as a lookup table for each shape which could be used
for the rapid prediction of shape features for any given sequence. The tool was initially
made available through a webserver, but was later provided as an R package through the
Bioconductor repository [12]. This package was later updated to provide predictions for
a total of 13 local shape features [13]. Being able to rapidly predict shape features enabled
researcher to further understand their impact on binding.
Methods to measure DNA binding preferences
ChIP-seq
DNA binding preferences for a given protein can be experimentally measured in a variety
of ways, both in-vivo and in-vitro. ChIP-seq is one such method which identifies binding
sites in-vivo by first inducing cross-links between DNA-binding proteins and the DNA
5
they are bound to [14]. Next, the DNA is randomly sheared by sonication and your protein
of interest, along with whatever DNA it is bound to, is pulled out of the solution using
antibodies specific to your protein. After this step, the bound DNA can be purified and
prepared for next generation sequencing. ChIP-seq reads can then be aligned to the
genome to indicate not only what your protein was bound to, but where. Unfortunately,
a major drawback of this method is a relatively low resolution, with fragments lengths
typically spanning around 300 bp. This makes identifying exactly where a given protein
is binding difficult and imprecise. A newer method, SLIM-ChIP [15], utilizes an MNase
digestion post-sonification to increase the resolution, but most available datasets were
created using the former technology. Perhaps more importantly, ChIP-seq experiments
exhibit a large amount of bias relative to in-vitro methods. This is in terms of sequence
composition, since certain sequences will be over or underrepresented in any given
genome, and in terms of accessibility [16]. Sequences that may otherwise be
preferentially bound by a given protein may be inaccessible at the timepoint the
experiment is performed, due to the positioning and composition of nucleosomes along
the DNA. Alternatively, unfavourable sequences falling within a constitutively accessible
region may be bound for more often than one might expect based on intrinsic DNA-
binding preferences alone.
Protein binding microarray
Measuring the DNA-binding preferences for a protein with an in-vitro experiment
provides a more controlled environment, free from many of the biases seen in ChIP-seq
data. To understand how cofactors, accessibility, and DNA modifications affect binding, it
is important to first understand native binding preferences. One of the first methods to
6
be popularized is the universal protein binding microarray (PBM) [17]. In this
experiment, each microarray includes 44,000 spots containing known DNA sequences
designed to cover every unique 8-mer at least 32 times. Next, a GST-tagged protein is
washed over the microarray, allowing it to bind to its preferred sites. A GST-targeting
fluorescent antibody is then washed over the microarray and an image is taken.
Fluorescent intensity can then be used as a measure of binding affinity, with brighter
fluorescence corresponding to stronger binding. As with ChIP-seq, a major challenge with
this approach is resolution. Each DNA sequence on the microarray is 60 bp long, so
identifying which binding site, or binding sites, are responsible for the observed
fluorescence can be challenging. Furthermore, PBM experiments are vulnerable to their
own biases based on where the binding site occurs along a particular 60-mer relative to
the surface of the slide, its orientation, and where the spot occurs on the microarray.
Fortunately, many of these biases can be corrected computationally, and are still much
smaller than those seen in ChIP-seq experiments. However, while PBM experiments
provide reproducible measurements for high-affinity binding sites, a huge portion of
moderate to low-affinity sites fall below the threshold of detection. Furthermore, biases
induced by the selection of probe sequences for the microarray can affect measurements
in a way that are not corrected computationally by the current framework. The effect of
this is explored further in Chapter 3.
HT-SELEX / SELEX-seq
As the price of next generation sequencing came down, an alternate approach, known as
high-throughput (HT)-SELEX [18], gained popularity. Unlike PBM, this method is able to
characterize binding preferences for hundreds of proteins in parallel and multiplexed
7
manner. In this experiment, each protein of interest is immobilized within a 96-well plate.
For each well, a DNA library containing a 14-40 bp randomized region is incubated with
the protein and free DNA is washed away. Bound fragments can then be purified and
amplified for an additional round of selection or barcoded and sent for sequencing. In this
experiment high affinity sequences are more likely to be enriched after selection
compared to the initial library or previous rounds. Unfortunately, due to the massive
scale of these experiments, a shallow depth of sequencing is typically utilized. In the
original experiment, each sample was covered by an average of 168,000 reads, but this
number varied drastically between samples. A later project resequenced these samples
to an average depth of around 1,656,000 reads per sample [19]. While this is a major
improvement, researchers found many samples to be of low quality due to biases in the
initial libraries, PCR bias, or sequencing bias. It is also not clear to what extent
immobilization of the protein affects its ability to bind DNA. In the original experiment,
410 TFs were included, however only 218 of these could be used in downstream analysis
following quality control. Although HT-SELEX has been instrumental in understanding
binding preferences due to its massive scale, performing the experiment is a complicated
and expensive process that requires specialized equipment not available to most labs.
An alternate approach, SELEX-seq, was developed around the same time as HT-
SELEX, and likewise takes advantage of low-cost next generation sequencing [20, 21].
While SELEX-seq is easy to run and only requires common laboratory equipment, it is
only designed to measure the DNA binding preferences of one protein at a time. In this
experiment, a randomized library is generated like the one described for HT-SELEX, and
then incubated with a protein of interest. Next, the mixture is run on a native
polyacrylamide gel to separate DNA fragments which have been bound and fragments
8
which are free. The bound fragments are cut from the gel, purified, amplified, and used
for an additional round of selection, or barcoded and sent for sequencing. Unlike in an
HT-SELEX experiment, the protein and DNA are floating freely in solution during
incubation. Although the sequencing depth is determined by the user, deep sequencing
produces enrichment measurements that are highly reproducible across a wide range of
affinities, allowing for the detection of sup-optimal binding sites and small-scale changes
between sites with similar affinities.
Representations of DNA binding preferences
Historically, DNA binding preferences of a given protein are represented using a position
weight matrix (PWM) [22]. To generate a traditional PWM, previously identified binding
sites are aligned and the frequency of each nucleotide is calculated for every position
along the alignment. For each position, the information content is calculated based on
some predetermined background distribution of the nucleotides, and a logo is generated
to show the importance of each base at each position. For PBM, HT-SELEX, and SELEX-
seq data, methods such as BEEML-PBM [23], SelexGLM [24], BEESEM [25], and NRLB [26]
have been developed to generate a PWM-based representation of binding, but in all these
cases, nucleotides are assumed to contribute independently to binding. Unfortunately,
this makes it impossible to detect how interdependencies between positions of the
binding site impact affinity. Consequently, this also neglects the impacts of shape readout
on DNA binding.
Using high-throughput predictions of DNA-shape, researchers have since been
able to add shape features to previously published PBM, HT-SELEX, and SELEX-seq
datasets and look for conserved shape preferences across a binding site [13]. Including
9
these features consistently resulted in an increase in predictive power, suggesting an
important role of local shape in determining binding site affinity for most TFs. In response,
a public database, called TFBSshape, was created to show DNA shape profiles, heat maps,
and box plots for binding sites derived from publicly available TF databases [27].
Although shape features improve our understanding of binding site recognition,
there may be additional influential interdependencies which are not captured by shape.
Instead of relying on a simplified model of binding based on a PWM+shape, it may be
better to utilize the entire set of measured k-mers enrichments provided by the
experiment. Using full-length k-mers does not require assumptions to be made about
interdependencies and have been successful in provide better predictions of in-vitro
binding [28].
Thesis Overview
In the following chapter, I introduce a method I developed called Top-Down Crawl (TDC),
which is soon to be published as an Application Note in Bioinformatics. The tool is
available as a publicly accessible webserver at topdowncrawl.usc.edu. The tool was
designed for the ultra-rapid alignment of k-mer level quantitative binding data from
experiments such as PBM and SELEX-seq. In comparing the tool with alternate
approaches, I show a high level of performance with very little computational cost. In an
additional subsection, I describe how we are using TDC to help identify and modulate
binding sites in-vivo using aligned k-mers from a SELEX-seq experiment.
In chapter 3, I discuss the use of Top-Down Crawl as part of a multi-step
framework used to align full-length SELEX-seq reads from experiments I performed on
all four forkhead box TFs found in S. cerevisiae. This work is currently in the revision
10
phase and is set to be published in Nucleic Acids Research. In this framework, I use TDC
to identify potential binding site cores to be used as a frame of reference for alignment of
full-length reads. This helps reveal the importance of flanking contributions to binding
site affinity and how DNA shape plays a role. In chapter 4, I discuss the simulation of
SELEX-seq seq data to probe ways in which the multi-step framework from the previous
chapter could be improved. In particular, I discuss methods for understanding binding
when multiple sites are present on a strand.
In chapter 5, I discuss a novel approach for applying SELEX-seq to gauge CRISPR-
Cas12a’s ability to bind flexible off-targets. The approach was designed in collaboration
with Aleique Allen from Dr. Peter Qin’s lab. We find that mismatched DNA, which is more
flexible and susceptible to kinking, is more likely to be bound by the Cas12a complex,
even when the target strand is not complementary to the guide RNA.
In the final chapter, I discussed an early work with Dr. Ana Carolina Dantas
Machado in which we used SELEX-seq to reveal the role of A-tract polarity in MEF2B
binding. This paper was published in Nucleic Acids Research, of which I was second
author. In addition to analysing preferential binding patterns across target sites, I used
all-atom molecular dynamics simulations to investigate how mutations on MEF2B affect
binding.
11
Chapter 2: Top-Down Crawl: A method for the ultra-rapid and
motif-free alignment of sequences with associated binding
metrics
Introduction
High-throughput in vitro binding methods, such as protein binding microarrays (PBM)
[17], SELEX-seq [20, 21], and SMiLE-seq [29], have given researchers the ability to
precisely quantify transcription factor (TF) binding in a controlled environment using
unbiased pools of DNA. For each of these methods, the enrichment of each individual
probe is not as informative as the enrichment of k-mers. While each full-length probe may
occur a few times within a sample, k-mers will occur far more frequently, providing highly
reproducible measures of binding affinity.
Since k-mer enrichment is inherently context-free, it is common to see a high level
of enrichment for k-mers that only covers a portion of the binding site. Determining which
part of the binding site a k-mer covers depends on alignment. Alignment allows
researchers to pinpoint TF–DNA interactions along the binding site and is a necessary
step in the application of conventional machine learning approaches such as multiple
linear regression (MLR). Previously described approaches, such as MEME [30], BEESEM
[25], and SelexGLM [24], are able to generate position weight matrices (PWMs) which can
subsequently be used to align k-mer level data, but they are not made for this purpose
and require a significant amount of time or memory to run. Furthermore, using a PWM to
summarize binding preferences for a TF is an unnecessary abstraction from the original
12
k-mer level data and results in the loss of information regarding interdependencies
between positions of the binding site. It is already known that DNA shape is dependent
on local interactions across several base pairs (bp), and plays a significant role in protein–
DNA binding for many TFs [31]. Here, we describe an original approach called Top-Down
Crawl (TDC), which can rapidly align large sets of k-mer level quantitative binding data
in a rank-dependent manner that does not depend on experiment-specific
parameterization. Top-Down Crawl can be run online at topdowncrawl.usc.edu or locally
as a python package available through pip: pypi.org/project/TopDownCrawl/.
Implementation
TDC was developed with one goal in mind: Usage of high-affinity sequences to describe
the binding of similar, but lower-affinity sequences. Then, use those sequences to align
other similar sequences. More specifically, the algorithm starts by assigning the k-mer
with the largest binding metric a shift of zero bp and is set as the first reference. All
unaligned k-mers that are one single bp mutation away from the reference are then added
to the alignment and assigned a shift equal to that of the reference (zero in this case). All
unaligned k-mers overlapping the reference by up to k–2 bp are then added to the
alignment with a shift equal to that of the reference plus or minus one or two depending
on if that sequence is overlapping on the 5’ or 3’ end of the reference (Figure 2.1A). For
example, the sequence AGTAAAC would overlap with the 5’ end of GTAAACA with a shift
of –1 bp. After this round, the reference sequence is marked as “complete” and the next
reference is determined as the most enriched k-mer amongst those which have been
aligned, excluding sequences already marked as complete. As before, the new reference
is used as a starting point for the addition of more k-mers to the alignment, so long as
13
Figure 2.1 Overview, performance, and generated PWMs from TDC applied to 10 SELEX-seq
datasets (A) Depiction of a single iteration of TDC, showing how the algorithm would align several
similar k-mers based on a reference. (B) Violin plots showing several method-dependent metrics
across 10 SELEX-seq datasets. For model performance, MLR models were trained using base
sequence, minor groove width, and electrostatic potential information along aligned 10-mers to
predict the log enrichment of 10-mers with a Z-score larger than 2. Models were trained using 5-
fold cross validation with elastic net regularization and the median performance across the test
tests is reported. Wall-clock time and peak memory usage requirements are inclusive of the k-mer
level enrichment calculation for TDC and MEME. This is the most time-consuming and memory-
intensive step for both methods. The alignment agreement indicates what fraction of sequences
were assigned to the same shift according to TDC and a given method. (C) TDC PWMs generated
using all 10-mers aligned with a shift of ±5, weighting each sequence by its relative enrichment.
14
they have not been added previously. This process is terminated when all sequences
added to the alignment have been marked as “complete.” Detailed pseudocode of this
process is provided in Table 2.1.
TDC is provided through a freely accessible webserver and requires a list of
sequences with their corresponding binding measurements as input. After processing the
upload, the user is given a summary of the alignment as well as a representative PWM,
weighting each sequence by its associated binding metric. This is most appropriate for
SELEX-seq data, for which the relative enrichment is expected to approximate the relative
binding affinity. The alignment itself is provided as a tab-delimited file where gaps are
represented by the “_” character. The output is stored for 48 hours and can be accessed
using a unique link generated for the submission. TDC can also be run locally as a python
package available through python’s package repository, pip.
Results and comparison with alternate methods
Given a PWM from a motif-generating method such as MEME, BEESEM, or SelexGLM, k-
mers can be assigned to their most likely “shift” relative to the reference, similar to TDC.
This is done by padding a given PWM with neutral positions on the 5’ and 3’ ends, then
sliding each k-mer along every window along the PWM to see which shift results in the
highest score. The alignment generated by this method can then be directly compared
with that provided by TDC. A generalizable workflow for PWM-based k-mer alignment
and evaluation is provided at github.com/bhcooper/TDC_evaluation.
MEME is a well-established method for the alignment of sequences, but it does not
take quantitative data as input. Therefore, every sequence is weighted equivalently in the
15
construction of the PWM. For this reason, we run MEME using only k-mers with a log
enrichment 2 standard deviations above the mean. The resulting PWM can then be used
TDC Algorithm
Input: Table containing standard DNA sequences (A,C,G,T) of equal length in column 1 and binding metrics in column 2
1: df ← dataframe containing binding metrics indexed by sequence
2:
3: // Average reverse complements or copy value from partner if absent from input
4: df_rc ← Copy df and reverse complement all sequences
5: df ← Append df_rc to df, group by index, and save mean for each index
6:
7: // Initialization
8: df ← Insert boolean column, isAligned, filled with False, to keep track of sequences which have already been
added to the alignment
9: df ← Insert boolean column, wasRef, filled with False, to keep track of sequences which have already been used as
a reference for adding other sequences to the alignment
10: df ← Insert integer column, shift, filled with N/A, to keep track of the shift assigned to each sequence
11: top ← Index of seqeunce with largest binding metric from df
12: k ← length(top)
13: df[top][shift] ← 0
14: df[top][isAligned] ← True
15: Delete reverseComplement(top) from df
16:
17: // Continue iterating until all aligned sequences have been used as a reference or until all seqeunces are aligned
18: while (isAligned == True and wasRef == False for any row in df) and (isAligned == False for any row in df)
19: unchecked ← Subset of df including rows where isAligned == True and wasRef == False
20: ref ← Index with largest binding metric from unchecked
21: refshift ← df[ref][shift]
22:
23: SNPs ← Indices within df that are 1 mismatch away from ref and where isAligned == False
24: df[SNPs][shift] ← refshift
25: df[SNPs][isAligned] ← True
26: Delete reverseComplement(SNPs) from df
27:
28: olap -1 ← Indices within df that overlap with the first k – 1 bases of ref and where isAligned == False
29: df[olap -1][shift] ← refshift – 1
30: df[olap -1][isAligned] ← True
31: Delete reverseComplement(olap -1) from df
32:
33: olap -2 ← Indices within df that overlap with the first k – 2 bases of ref and where isAligned == False
34: df[olap -2][shift] ← refshift – 2
35: df[olap -2][isAligned] ← True
36: Delete reverseComplement(olap -2) from df
37:
38: olap +1 ← Indices within df that overlap with the last k – 1 bases of ref and where isAligned == False
39: df[olap +1][shift] ← refshift + 1
40: df[olap +1][isAligned] ← True
41: Delete reverseComplement(olap +1) from df
42:
43: olap +2 ← Indices within df that overlap with the last k – 2 bases of ref and where isAligned == False
44: df[olap +2][shift] ← refshift + 2
45: df[olap +2][isAligned] ← True
46: Delete reverseComplement(olap +2) from df
47:
48: df[ref][wasRef] ← True
49:
50: df ← Subset of df where isAligned == True
51: df ← Pad indices of df with “-” based on shift
52: Output Save df as a table including the padded sequence, averaged binding metric, and shift
Table 2.1 Pseudocode of entire alignment process performed by TDC
16
for the alignment of all k-mers as described above. Alternatively, BEESEM was made
specifically for creating PWMs from SELEX-seq data, but is computationally limited to
producing motifs no longer than 10 bp and relies on subsampling for particularly large
datasets [25]. Although SelexGLM is able to generate much longer PWMs [24], the
currently available implementation has considerable memory requirements (Figure 2.1B)
and the output depends on the specification of several hyperparameters.
To compare TDC with alternate alignment methods, we include the analysis of 10
high-quality SELEX-seq datasets that have previously been published [24, 32, 33]. We use
a k-mer length of 10 which covers the known binding site for most of the TFs considered
and contains more information about suboptimal binding sites compared to longer k-
mers. Although the goal of TDC is alignment rather than PWM generation, we can
generate a logo from each alignment, weighting each sequence by its relative enrichment
(Figure 2.1C, Supplementary Tables 2.1-2.5). We found the resulting PWMs to be most
similar to those generated by BEESEM, with additional information outside the center of
the binding site, covering about 15 informative positions for GR and AR (Supplementary
Table 2.1).
For a more in-depth comparison, we determine what percent of significantly
enriched 10-mers are assigned to the same shift, using TDC as the reference. TDC showed
a high level of agreement with BEESEM, followed by MEME and SelexGLM (Figure 2.1B,
Supplementary Table 2.6). To evaluate the quality of each alignment, an MLR model was
trained to predict the log enrichment of aligned 10-mers which were significantly
enriched. Base pairs were one-hot encoded for each position, and the predicted minor
groove width and electrostatic potential were included to account for interdependencies
between positions [12]. For MLR to perform well, sequences need to be aligned such that
17
position specific permutations along the binding site predictably modulate binding
affinity. We found TDC to exhibit the best average performance, as measure by the
median R
2
using 5-fold cross validation with elastic net regularization (Figure 2.1B,
Supplementary Table 2.7). Comparing the wall-clock times, BEESEM was by far the
slowest, requiring many hours to complete, whereas TDC only takes seconds (Figure 2.1B,
Supplementary Table 2.8). SelexGLM was a bit faster than BEESEM, but required an
extremely large amount of memory (Figure 2.1B, Supplementary Table 2.9). The primary
reason these methods are more computationally demanding is because they work at the
level of the full-length read, rather than at the level of the k-mer. Since MEME was only
used to align k-mers with a log enrichment 2 standard deviations above the means, its
wall-clock time was highly dependent on the number of sequences passing this threshold.
While the quickest batch, including 653 sequences, was aligned in about 5 seconds, the
slowest batch, including just 4889 sequences took 18 minutes, demonstrating a 216-fold
increase in wall-clock time for a 7.5-fold increase in the number of sequences aligned.
Although we demonstrate TDC’s speed and performance using SELEX-seq data,
the alignment framework is highly flexible as it only depends on the rank of the sequences
provided. This allows for the alignment of binding data from a variety of experimental
approaches used today and those that are produced in the future.
Application to in-vivo binding site identification and modulation
Background of the spot
196
enhancer
Given a table of k-mers and their associated binding metrics from an experiment such as
SELEX-seq, it may seem trivial to use this directly as a lookup table for scanning a genome
for binding sites. However, there are several issues that complicate this type of analysis.
18
Primarily, chromatin and DNA modifications can dramatically alter the accessibility of
potential binding sites by a given transcription factor. Since the traditional SELEX-seq
experiment is only evaluated on unmodified and unwound DNA, the associated
measurements of affinity must only be compared across regions that are equally
accessible.
Transcriptional activators are transcription factors which upregulate the
transcription of promoters by binding to nearby target sites in regions called enhancers
[34]. Enhancers may contain multiple binding sites for different proteins which may
upregulate or, to a lesser extent, downregulate transcription [35]. Using relatively crude
PWM-based models of transcription factor specificity, researchers can identify and
abolish known binding sites, but fine-tuning these sites based on detailed SELEX-seq
enrichments has yet to be performed.
The spot
196
enhancer in D. melanogaster controls the regulation of the yellow gene,
which is responsible for the appearance of a dark spot the distal region of the developing
wing [36]. Within this enhancer, researchers previously identified and validated four
Distal-less (Dll) binding sites through in-vitro binding assays, indicated as Dll-a through
Dll-d [36] (Figure 2.2A). By cloning variations of the spot
196
enhancer upstream of a
DsRed reporter, researchers have been able to visualize how different mutations affect
the intensity and localization of yellow protein across the wing [37]. In an initial test of
this assay, researchers replaced 10-18 bp stretches of DNA with A-tracts, all along the
spot
196
enhancer. Four of these tests abolished the four Dll binding sites validated
previously, resulting in the elimination or reduction of the yellow reporter’s expression
[36, 37] (Figure 2.2A).
Fine tuning the activity of the spot
196
enhancer
19
Instead of completely abolishing the binding site, as was done in the previous
assay, we aimed to take one of these validated binding sites and either increase or
Figure 2.2 Overview of validated Dll sites within spot
196
enhancer and proposed variants (A)
Region of spot
196
enhancer covering the four Dll-sites previously validated in-vitro and in-vivo. The
Dll sites within the enhancer were each replaced with A -tracts and placed upstream a DsRed
reporter. Fluorescence of the reporter within the wings were imaged to indicate the activity of the
enhancer relative to the WT enhancer. Every mutant shows reduced fluorescence. (B) Relative
affinities according to a SELEX-seq experiment targeting Dll using the full list of 8-mers or a filtered
list of aligned 8-mers based on TDC. (C) 10 variants proposed to replace the WT Dll binding site at
Dll-c. All variants exhibit a single peak to replace the predominant peak from the WT sequence.
Variants cover a wide range of affinities based on the SELEX-seq experiment.
decrease its affinity based on measurements from a Dll SELEX-seq experiment. This
would serve as an in-vivo validation of the hypothesis that in-vitro binding affinity and
in-vivo transcriptional activity are correlated. First, we performed a naïve scan of the
spot
196
enhancer using the entire table of relative affinities for 8-mers from the SELEX-
seq experiment (Figure 2.2B). We used a k-mer length of 8 bp to ensure that we do not
have any missing values in the lookup table. This is also justified since the binding site for
Dll is expected to have a length of around 6 bp. In this first scan, three of the four known
binding sites, excluding Dll-b, appear as broad peaks spanning more than 8 bp.
20
Unfortunately, this makes is particularly difficult to identify the exact positions Dll is
binding, and to determine whether multiple binding sites exist close together.
The reason we see broad peaks with this type of scan is because we are using
unaligned k-mers. In a standard SELEX-seq experiment, any k-mer that overlaps a strong
binding site will be highly enriched, even if that k-mer is insufficient in inducing binding
on its own. As an example, SELEX-seq experiments of Fkh1 reveal strong binding to a 7-
bp sequence, “GTAAACA”. Comparatively, replacing the first bp with a “T” drops the
affinity nearly 10-fold. The shifted 7-mer, “TAAACAA”, will also be one of the most
enriched 7-mers in our dataset, with a relative enrichment equal to 70% of that of the
reference, because many of these sequences will be preceded by a “G”. Therefore, if you
are scanning along an enhancer with the sequence “TTAAACAA,” you will see a peak
centered on TAAACAN. Given that we know this sequence is preceded by a “T,” we already
know that this is a relatively weak binding site.
Instead of scanning with the entire table of k-mers, we can instead align the table
using TDC and scan using only a single shift which covers the entire binding site. In this
way, each binding site will be represented by one peak, and the exact positions of the
binding site will be in alignment relative to the peak. We once again scanned the spot
196
enhancer, using only 8-mers in alignment with the optimum Dll binding site, “TAATTA”
surrounded by 1 bp 5’ and 1 bp 3’ of the site (Figure 2.2B). With this framework, we were
able to identify all four previously validated Dll binding sites, including Dll-b, a low
affinity binding site which was previously shrouded by a noisy signal when using the
unaligned table. For all but the third Dll site, Dll-c, the exact binding position is indicated
by a single peak which pinpoints exactly where binding is expected to occur. Within Dll-
c, we actually found 3 potential binding sites, with one site exhibiting an exceptionally
21
low affinity relative to the other 2. This would be impossible to detect when scanning
with the unaligned table.
In collaboration with researchers from Dr. Nicolas Gompel’s lab at LMU Munich,
we designed 10 variants of Dll-c in the spot
196
enhancer of D. melanogaster. We chose this
target because the predominant binding site exhibits moderate affinity, which allows us
to evaluate the effect of increasing or decreasing Dll affinity at this site. To evaluate the
contribution of flanking positions to binding affinity, we selected the 10 variants from a
table of aligned 10-mers, which include the 6-bp core plus 2 positions 5’ and 3’ of the core.
We then replaced the WT site with these 10 variants and scanned the enhancer with the
complete table of 8-mers as was done previously (Figure 2.2C). This was to ensure none
of the variants unintentionally created a secondary binding site within this region. We
did not scan the enhancer using the table of 10-mers because there is more missing data
at this k-mer length, and we want to be sure even weak secondary sites are not created.
As before, the variants were cloned upstream from a DsRed reporter and
replicates were imaged and quantified by our collaborators in the Gompel lab. Although
imaging is not yet complete, early observations reveal upregulation and downregulation
of activity in concordance with our expectations based on the aligned SELEX-seq data.
22
Chapter 3: Multi-step alignment of SELEX-seq data to quantify
the DNA binding specificity of all four S. cerevisiae forkhead
box transcription factors
Introduction
The forkhead box (FOX) transcription factor (TF) family consists of over 43 homologs in
human characterized by a highly conserved winged-helix DNA binding domain (DBD).
Members within the family play crucial roles in regulating a variety of key processes from
proliferation and development, to tumor suppression and aging [38]. Because of this, it is
not surprising that members within the FOX family recognize distinct DNA binding sites
in vivo [39]. Despite this, previous attempts to characterize DNA binding preferences
across the FOX family using in vitro methods such as protein binding microarray (PBM)
or high-throughput (HT)-SELEX approaches have revealed little variability in the position
weight matrices (PWMs) between members (Figure 3.1). Although several in vivo
conditions can modulate the DNA binding specificity and activity of a TF, it is important
to understand inherent binding preferences and the limitations of previous attempts to
characterize them.
PBM and HT-SELEX binding assays have unveiled the motifs of hundreds of TFs.
Despite these important advances, these methods are often inadequate to capture precise
information about moderate- to low-affinity sequences. For PBM, also known as universal
PBM (uPBM), the fluorescence of low-affinity probes is often indistinguishable from the
background or not included in the array design, and HT-SELEX typically utilizes a
23
Figure 3.1 Binding logos derived from ChIP-seq, HT-SELEX, and PBM experiments (A) FOX
homologs found in human or (B) S. cerevisiae. Data was downloaded from JASPAR [40] and
UniPROBE [41] and then visualized using ggseqlogo.
dramatically shallower sequencing depth per TF compared to more targeted methods
such as SELEX-seq [31, 42]. However, low-affinity binding sites frequently make up
actively bound regions that modulate transcription of their target genes in vivo [43]. In
closely related homologs, such as those within the Hox family of TFs, differential binding
to suboptimal sites can help distinguish family members [44].
To explore this phenomenon in the context of the FOX family, we chose to further
investigate the binding of all four paralogs in yeast: Fkh1, Fkh2, Hcm1, and Fhl1. Based
on DNA binding profiles published in UniPROBE [41], calculated using BEEML-PBM [23],
24
Fkh1 and Fkh2 exhibit virtually indistinguishable binding preferences (Figure 3.1B).
While these TFs are capable of binding many of the same loci in vivo, hundreds of non-
shared sites have also been identified using ChIP-chip [45]. Unfortunately, ChIP-chip and
ChIP-seq data is notoriously noisy, with broad peaks that are often unable to pinpoint the
nucleotides bound by the TF.
To further explore the binding preferences of these proteins, we have performed
SELEX-seq with deep sequencing to collect over 10 million reads per sample after up to
two rounds of selection. With this framework, we were able to collect more information
for moderate- to low-affinity binding sites that would have been exponentially diluted
had we utilized many rounds of selection. Because of the extreme sequencing depth
utilized, enrichment measurements are remarkably consistent across independent
windows and between subsampled sets of reads.
One of the biggest challenges in analyzing SELEX-seq data is identifying the
location of the binding site or binding sites along the length of a given read. The original
SELEX-seq protocol provides a method to calculate the relative affinity between k-mers
[20], but the location of the binding site within these k-mers is unknown. This is especially
problematic since the sequence context of any given k-mer is lost during this process, so
interdependencies with positions outside of the k-mer would be lost. Because of this,
enriched k-mers are likely to include incomplete binding sites, or even multiple
overlapping binding sites. This makes it difficult to understand the biophysical processes
of binding site recognition without further data processing. Although several methods
have since been developed to generate position weight matrix (PWM) models of the
binding site, these models inherently assume that base pairs (bp) contribute
independently to binding [46]. When this assumption is violated, the alignment of false
25
binding sites can dilute the signal of nucleotide positions with a relatively small impact
on binding. Models including dinucleotide features have been able to capture some of
these interdependencies [47-49], but may still be inadequate in summarizing binding for
the vast number of moderate- to low-affinity sequences.
Figure 3.2 Representative structures of FOX DBDs bound to DNA. Winged regions are indicated
in blue, and the main recognition helix is indicated in red with the core of the DNA binding site
indicated in pink and flanking bp indicated in green. (A) Rattus norvegicus Foxd3 (PDB ID: 2HDC)
[50], (B) Human FOXA3 (PDB ID: 1VTN) [51], (C) Human FOXK2 (PDB ID: 2C6Y) [52].
With respect to the FOX family, nucleotide positions flanking the core of the
binding site may be particularly insightful in differentiating the binding of closely related
homologs [53]. The winged-helix domain of FOX proteins consists of a highly conserved
recognition helix that binds in the major groove, making several base-specific contacts to
its target (Figure 3.1, Figure 3.2). These positions contact a 6-7-bp region of the binding
site we refer to as the ‘core’, which exhibits high sequence specificity across FOX
homologs (Figure 3.1). For some co-crystal and NMR structures, these regions form
contacts with the minor groove regions flanking the core of the binding site [50-53]
(Figure 3.2).
To further explore these flanking positions, as well as interdependencies between
positions within the core, we developed a multi-step alignment framework that allows us
to align full-length reads and precisely detect the contribution of flanking positions to
binding specificity in a core-specific manner. For Fkh1 and Fkh2, this approach has
26
allowed us to expand the canonical 7-bp binding site to 13 bp by revealing the
contribution of six positions flanking the core of the binding site. Although previously
ignored, the cumulative impact of these positions can reduce the affinity of the best core
to be lower than the worst. Comparatively, Hcm1 and Fhl1 exhibited minimal dependence
on positions flanking the core.
Material and methods
Protein expression and purification
The DBD of Fkh1 and its surrounding residues (amino acids 243-484) were cloned into
the pET-28b(+) DNA vector such that a His-Tag is added to the N-terminal end of the
polypeptide (Millipore Sigma: 69865). The product was then transformed into Rosetta
2(DE3) Competent Cells (Millipore Sigma: 71405) and expression was induced overnight
at 16 °C using 0.2 mM IPTG. Cells were centrifuged at 4 °C and resuspended in MCAC-0
buffer (20 mM Tris-Cl pH 8.0, 0.5 M NaCl, 10% glycerol, 1 μg/mL pepstatin A, 50 μg/mL
TPCK, 1 mM benzamidine, 1 mM PMSF) on ice. The cells were then broken by sonication
and the lysate was clarified by centrifugation at 4 °C. The lysate was then incubated with
a Ni-NTA resin (Qiagen: 30210) equilibrated in MCAC-0. Next, the resin was sequentially
washed with MCAC buffer containing 10, 20, 30, 40 mM imidazole, followed by elution
with 250 mM imidazole. Salts were then removed by buffer exchange using Amicon Ultra
Centrifugal Filters (Millipore Sigma: UFC901024, UFC800324). The final protein products
were verified using SDS-PAGE. The same procedure was followed for the expression and
purification of the Fkh2 (amino acids 280-520), Hcm1 (amino acids 41-213), and Fhl1
(amino acids 401-638), using pET-28a(+) as the DNA vector (Millipore Sigma: 69864).
27
Oligonucleotide synthesis
The library and all other DNA oligonucleotides were synthesized by Integrated DNA
Technologies and purified by standard desalting (Supplementary Table 3.1).
SELEX-seq binding assay
The SELEX-seq procedure was carried out following the original protocol by Slattery et
al. [20, 21]. The library was designed with a 16-bp randomized region surrounded by
fixed adapters. Purification was performed using Qiagen’s MinElute PCR Purification Kit
(Cat: 28004). Binding reactions were performed in a binding buffer consisting of 10%
glycerol, 50 mM KCl, 20 mM Tris HCl (pH 7.9), 5 mM MgCl2, 0.1 mg/mL BSA, and 3 mM
DTT. We used a 6-FAM labeled library for tracking, and all gels were visualized on
Invitrogen’s iBright
TM
CL1000 Imaging System. Bound fragments were excised and
purified by phenol chloroform extraction followed by ethanol precipitation. We sent
samples for sequencing from round zero (R0), round one (R1), and round two (R2) of
selection.
Competitive binding assay
A competitive binding assay was performed to compare the binding affinity of Fkh1 to
two sequences containing binding sites with the same core, but differing flanks
(Supplementary Table 3.1). In this assay, a fixed amount of 6-FAM labeled probe is
incubated with a limited amount of binding protein such that non-specific binding is
prevented. An increasing amount of unlabeled competitor is included across several
samples until binding by the labeled probe is visibly reduced by at least 50%,
representing the point at which both probes are bound equally. In our experiment, the
28
sequence with optimal flanking sequence was labeled with a 5’ 6-FAM modification and
the sequence with the suboptimal flanking sequence was used as the competitor.
Figure 3.3 Experiment and analysis framework overview (A) A library containing a 16-bp fully
randomized region is incubated with the TF of interest and run on a non-denaturing
polyacrylamide gel. The bound DNA is then extracted, purified, and amplified. Aliquots are sent for
sequencing or used for additional rounds of selection. (B) Enriched 7-mers are identified and
aligned as described in Materials and Methods to identify core sequences to use for alignment of
the full 16-mers. Each 16-mer containing a hit to exactly one core sequence is assigned to a group
based on where the core sequence begins. (C) For each position outside of each core, the relative
enrichments between gapped 8-mers are used to calculate ΔΔG/RT. The averages across all
windows are plotted in a condensed view to allow for easier comparison across cores.
29
Multi-step alignment
Rather than using a PWM-based method for alignment, we aligned the 16-bp variable
region of full-length reads to a predetermined set of 6-bp cores for Fhl1, and 7-bp cores
for Fkh1, Fkh2 and Hcm1. The short k-mers allow for sequence diversity in the flanking
region while remaining long enough to accurately pinpoint true binding sites. This also
allows us to specifically focus on identifying the contributions of positions flanking the
core. The alignment is performed over a multi-step process summarized in Figure 3.3.
First, the relative enrichment of every k-mer was calculated as described in the
original SELEX-seq protocol [20, 21]. We then performed a preliminary alignment of the
longest set of k-mers which exhibited a high degree of coverage given a 100-count cutoff.
For every dataset we assessed, more than 95% of all unique 9-mers meet this threshold,
compared to an approximate 30% coverage of unique 10-mers. The preliminary
alignment is performed using a recently developed approach named Top-Down Crawl
(TDC). Starting from the most enriched sequence as the reference, sequences that are 1-
bp mutation away or overlapping by at least k−2 bp are added to an alignment
accordingly. The next iteration begins from the next most enriched sequence included in
the alignment and is repeated until no additional sequences can be aligned. This
framework was created specifically for the alignment of quantitative binding data, using
highly enriched sequences as a template to explain the binding of those that are less
enriched. To determine a set of candidate cores, this preliminary alignment is then
trimmed to the 6-7-bp region exhibiting the highest level of sequence conservation.
Unique sequences falling within this region are treated as candidate cores to be used in
the next step, which we refer to as reprioritization.
30
To avoid complications resulting from combinatorial effects between multiple
binding sites, we restrict our analysis to reads that only align to one core. This also helps
ensure that observed flanking preferences are truly acting to modulate the given core
rather than creating additional cores. However, this creates a trade-off between the
number of cores we choose to analyze and the number of reads we can align. Therefore,
rather than including the full list of candidate cores aligned to the most enriched k-mer,
we must prioritize a subset of these sequences. The most obvious way to do this is to
simply rank them by their enrichment; however, it is important to consider that the
observed enrichment of any given k-mer is a result of its activity when bound as a core in
addition to its activity when bound as an optimal flank to an adjacent core, since the
context is lost during the counting process. Since we aim to prioritize sequences based
solely on their contribution as a core, we developed an iterative approach to reprioritize
candidate cores as described below.
The highest priority core was assigned as the most enriched k-mer as calculated
previously. Next, all reads containing a match to that k-mer on the forward or reverse
strand were removed from the dataset and k-mer enrichments were recalculated for the
remaining reads. The next core prioritized was then the most enriched k-mer from this
reduced set of reads, such that the k-mer belongs to our list of candidate cores determined
previously. Any read from this reduced set containing a match to that k-mer is then
removed. This process was repeated until the enrichment of the most recently removed
k-mer is measured to be at least 95% of the average enrichment of the five previously
removed k-mers. This restricts our analysis to a reduced set of core sequences that
explain binding for a large number of full-length reads.
31
After this process, every read from the full dataset was realigned to the set of
prioritized sequences, discarding reads with multiple hits. For comparison, we separately
aligned the full dataset to the original list of candidate cores prioritized by raw
enrichment. Shown in Figure 3.4, the reprioritized list consistently allowed for the
alignment of more reads compared to the original list. To simplify later comparisons
between Fkh1, Fkh2, Hcm1, and Fhl1, we used the same set of core sequences for the
alignment of every dataset. To generate this list, we took the union of the core sequences
prioritized by each. The final list contained 49 sequences including 10 6-bp Fhl1-based
cores and 39 7-bp cores (Figure 3.5).
Figure 3.4 Relationship between the number of cores we included during alignment versus the
fraction of reads that align to a single core. The list of candidate core sequences is sorted by raw
enrichment or by iterative reprioritization as described in the manuscript. Using the top k-mers
from the reprioritized lists allow for more sequences to be included in downstream analysis.
32
Relative enrichment and free energy determination
Given a core sequence of length k, there will be 16–k+1 windows across the 16-bp
variable region where that k-mer can occur. Including occurrences on the reverse-
complement strand, this number doubles. Because our alignment framework only allows
for one core per read, each read can only be assigned to one window. Additionally, this
makes the enrichment of any given sequence within one window independent of its
enrichment in any other window (Figure 3.3B). To determine the enrichment of a core at
a given window for a given round, its proportion is divided by its proportion at that same
window in round zero. As described in the original SELEX-seq protocol [20, 21], the r
th
root of the relative enrichment, with r representing the number of rounds of selection
that have taken place, represents a close approximation of the true relative affinity [21].
The measurements can then be averaged over all windows to represent the best
estimates of the true relative affinities. The change in affinity can also be represented as
the ΔΔG/RT by taking the negative natural log of the relative affinity [54] (Figure 3.5).
This represents the difference in the free energy released by binding scaled by 1/RT. In
this case, a more positive value represents a less favorable binding interaction relative to
the most enriched core. Analysis of ΔΔG/RT rather than relative affinity also helps
accentuate changes between moderate- to low-affinity binding sites, and better reveals
the additive impact of multiple mutations.
To measure how flanking positions modulate binding site affinity, we then
calculated ΔΔG/RT at every variable nucleotide position outside of the core. Given a core
sequence of length k, each window will contain 16–k flanking positions across the 16-bp
variable region. These can occur 5’ and/or 3’ of the core sequence depending on the
window being analyzed. Assuming positions flanking the core contribute to binding
33
Figure 3.5 Violin plot of ΔΔG/RT estimates from every window resulting from two rounds of
SELEX-seq for 48 selected core sequences relative to the most enriched core (A) which is
GTAAACA for Fkh1, Fkh2, and Hcm1, and (B) GACGCA for Fhl1. Larger values indicate a greater
disruption to binding relative to the reference. Error bars show 95% confidence intervals.
independently of each other, their effects can be measured by looking at the relative
enrichment between gapped (k+1)-mers, including one flanking position and a fixed 6-
bp or 7-bp core. By looking at gapped (k+1)-mers, we can calculate the core-specific
effects of flanking positions up to at least nine bp away from the core. This core-specific
approach also allows us to avoid the dilution of flanking contributions if false binding
34
sites are included during the alignment process, and to determine to what extent these
contributions are dependent on the core.
Like the core, we measure flanking contributions in terms of ΔΔG/RT. For a given
window, a position frequency matrix (PFM) is generated by counting the occurrence of
each bp at every variable nucleotide position outside of each core. These are the gapped
(k+1)-mer counts mentioned previously. The enrichment of each bp is then determined
in a position specific manner by dividing the frequency of each bp by its frequency
observed in R0 at that same position (Supplementary Figure 3.). The enrichment of each
bp is divided by the mean enrichment at every position to get relative enrichments. As
described in the original SELEX-seq paper, the relative enrichment of samples collected
after R rounds of selection is only equivalent to the relative affinity if the R
th
root is taken.
However, this scaling factor may vary slightly depending on the efficiency of
separating bound and unbound fractions. Instead, we estimate the scaling factor my
dividing the mean log relative enrichment of cores from R2 by that from R1. This scaling
factor was equal to 1.89, 1.97, 1.76, and 1.85 for Fkh1, Fkh2, Hcm1, and Fhl1, respectively.
The scaled relative enrichments are then converted to ΔΔG/RT by applying the negative
natural log. Positive values indicate bp that are more disruptive to binding than average,
and negative values indicate bp which facilitate binding. We represent the values using a
heat map instead of a traditional motif logo in order to facilitate comparisons between
independent windows and cores (Figure 3.3C). This entire process is repeated for every
window along the 16-bp variable region for every core (Figure 3.6A). The ΔΔG/RT
measurements are then averaged across all windows (Figure 3.6B).
35
Figure 3.6 Summary of flanking position contributions across shifts and across differing cores (A)
Graphic representation of ΔΔG/RT for every possible bp at positions outside of the core binding
site, given a core sequence of GTAAACA. Rows represent the 40 independent sets of aligned 16-
mers, each with independent measurements. Larger values indicate greater destabilization of
binding relative to other possible bp at that position. Contributions are highly consistent across
samples. (B) ΔΔG/RT measurements for each aligned core averaged over the 40 independent sets
of aligned 16-mers. Rows are clustered with the UPGMA algorithm using Manhattan distance as
the metric. The measurements suggest that flanking contributions are largely independent of the
core.
36
Multiple linear regression
To evaluate the effect of interdependencies between nucleotides within the core binding
site, we trained a multiple linear regression (MLR) model using only mononucleotide (1-
mer) features to predict ΔΔG/RT values measured from our experiment [31]. If the
nucleotide positions truly contribute independently to binding, then the 1-mer model
should be able to accurately predict the measured ΔΔG/RT. Each bp of sequence was
encoded using a one-of-four encoding representing the four bases of DNA. With a
sequence of length L, the full encoding includes 4 x L features. Training was performed
using the LassoCV function from the sklearn python package with the maximum number
of iterations set to one million, and all other settings unchanged from their default
configuration [55]. The model performance was evaluated by 5-fold cross validation
scored using the coefficient of determination, R
2
. In a separate model, minor groove width
features predicted from the DNAshapeR package [12] were included.
One assumption in our calculation of flanking position contributions was that they
contribute independently to binding. To probe the validity of this assumption, we again
used MLR to model and predict ΔΔG/RT for longer k-mers that cover multiple flanking
positions. Since individual k-mer counts are larger for shorter sequences, their associated
ΔΔG/RT measurements are less noisy. However, longer k-mers can be more informative
regarding interdependencies between nucleotide positions. Using the set of core-aligned
reads, the ΔΔG/RT can be calculated for extended core sequences by including additional
positions flanking the core. We included at most 6 flanking positions covering 4 bp 5’ and
2 bp 3’ of the core. Instead of using a 100-count threshold for this length sequence, we
filtered out sequences that do not appear in at least two independent alignment windows,
37
allowing for the analysis or more sequences without compromising the correlation of
ΔΔG/RT measured between subsets of the data.
For each model, sequences were encoded to capture differing levels of information
depending on the question asked. For the simplest encoding, flanking positions were
encoded independently of each other and independently of the core using the one-of-four
encoding described previously. Since the core was restricted to a relatively small list of
predefined sequences, the core was encoded using a one-of-C encoding, with C
representing the number of cores included during the alignment. This allowed us to avoid
the assumption that core positions contribute independently to binding. With this simple
encoding, independent positions flanking the core were modeled as being independent
of the core, and independent of each other.
Alternatively, flanking positions can be encoded in a core-dependent manner by
matrix multiplication of the 4 x L flanking features by the one-of-C feature, resulting in a
4 x L x C set of features for each k-mer. This model is expected to capture similar
information as the previously mentioned framework for calculating flanking
contributions based on gapped (k+1)-mers. The model can also be extended to include
interdependencies between flanking positions by encoding all pairwise dinucleotides.
This is done by matrix multiplication of the 4 x L flanking features by itself to produce a
16 * L
2
set of features for each k-mer. Since this matrix is symmetric, only the upper
triangle was kept for modeling. As before, the performance is measured terms of the
average R
2
using 5-fold cross-validation.
38
Results and discussion
Comparing the reproducibility of PBM and SELEX-seq data
Across the FOX protein family, members typically bind to a strongly conserved 7-bp motif
surrounded by several flanking nucleotides, which appear to exhibit degenerate binding
specificities (Figure 3.1A). Whereas PBM experiments have been instrumental in
revealing the binding preferences for hundreds of TFs in a high-throughput manner [17,
56], the typical experiment is designed to examine binding sites up to 8-bp long. Although
the Seed-and-Wobble approach was proposed to extend the binding site beyond 8-bp
[17], motifs generated by the approach were found to greatly underperform relative to
BEEML-PBM, putting its broad applicability into question [57]. While fluorescent
intensity and binding strength are correlated in uPBM data, unrelated factors may also
affect fluorescence, such as the binding site location and orientation within the 60 bp
probes. In a typical PBM experiment, each 8-mer occurs in at least 32 positions on the
chip, so the effects from these secondary factors are thought to be minimal relative to the
effects of binding strength [17]. However, for moderate- to low-affinity sequences, these
effects can lead to noisy or irreproducible measurements across different microarray
designs (Figure 3.7A). This also complicates the detection of small-scale preferences that
may be found in the positions flanking the core.
Previously, Bulyk and co-workers derived binding profiles for 89 yeast TFs,
including Fkh1, Fkh2, and Fhl1 using PBM experiments [58]. For each TF, replicate
experiments were performed using two separate microarrays designed with
independent de Bruijn sequences. In comparing the Z-score normalized PBM
enrichments between the two microarray designs probed with Fkh1, we detected weak
correlations for the majority of 8-mers measured (Figure 3.7A). This was
39
Figure 3.7 Comparison of PBM and SELEX-seq enrichments (A) Comparison of Z-score normalized
enrichment scores for 8-mer sequences according to two independent PBM microarray designs
performed on Fkh1 and Fkh2. [41] (B) Comparison of Z-score normalized enrichment scores for
9-mer sequences derived from two separate SELEX-seq experiments with different library designs
performed on Fkh1. Fkh2 was only evaluated using the second library design.
especially apparent for 8-mers exhibiting low signal intensities. In comparing Fkh1 and
Fkh2 intensities measured using the same microarray design, a stronger correlation was
observed (Figure 3.7B). This suggests that any differences in binding between Fkh1 and
Fkh2 are more subtle than the variance that is observed across different microarray
designs. For this reason, a uPBM study of this design is not suitable for revealing the
subtle differences in binding preferences between the two homologs. An alternative
approach that we did not chose could be the genomic-context PBM (gcPBM) approach
also developed by Bulyk and co-workers [59]. Our SELEX-seq experiment addresses
many of the discussed concerns simply due to the extreme depth of sequencing utilized.
40
Figure 3.8 Comparison of the Z-score normalized enrichment of various length k-mers derived
from two equally sized sets of unique reads from Round 2 of the Fkh1 SELEX-seq experiment. The
coverage represents the proportion of unique k-mers for which we could calculate the enrichment.
We ultimately collected around 35 million reads per sample for Fkh1 and Fkh2, and about
10 million reads per sample for Hcm1 and Fhl1. For comparison with uPBM experiments,
we first calculated the context-free k-mer enrichment as described in previous work [20].
To explore distal positions flanking the core binding site, we used a library with a
16-bp variable region. With a k-mer length of 16, approximately 4.3 billion unique
sequence permutations are possible. With the size of a typical sequencing run, only a
small subset of these permutations can be captured and the count for each is often too
low to provide a meaningful measurement of binding affinity. By looking at the
enrichment of shorter k-mers within the randomized regions we can collapse the number
of possible permutations several fold and greatly increase the number of occurrences for
41
each. This creates a tradeoff between k-mer length, noise, and how many of the possible
k-mers can be analyzed given a 100-count threshold (Figure 3.8). Compared to previously
mentioned uPBM experiments, the Z-score normalized SELEX-seq enrichments exhibit
remarkably consistent measurements despite measuring k-mers of up to 13 bp in length.
Modified approach to determine enrichment
In previous work analyzing SELEX-seq data [21, 60] an enrichment table was determined
by counting every k-mer across a sliding window and dividing it by the expected count
according to a Markov Model generated from R0 k-mer counts. Although a sliding window
increases the number of observed counts for each k-mer, it comes at the cost of losing
positional information amongst the original reads. Additionally, we found that R0 biases
are dependent on the position, so representing the bias universally with a position-
agnostic Markov model may not be appropriate (Supplementary Figure 3.). Furthermore,
by counting k-mers using a sliding window, smaller subsequences are counted multiple
times, so k-mer counts cannot be considered independent of each other. For example, if
the most enriched core is GTAAACA, then we know that k-mers of the form TAAACAN will
also be highly enriched, even if those sequences are not intrinsically able to significantly
promote binding in any other context.
Problematically, any given k-mer may be enriched due to its activity as a strong
core binding site, optimal flanking sequence, or a combination of both. By limiting the
number of cores per sequence to one, as described in Materials and Methods, we can be
confident that the observed core is acting as the most likely binding site amongst each
read, and that flanking positions will be aligned accordingly. Additionally, if multiple
42
binding sites were to be permitted, then flanking positions may be biased to prefer the
creation of additional cores, rather than by modulating the affinity of the aligned core.
Core-based alignment of full length reads
From data shown in Figure 3.1, we know that the most conserved region of the binding
site, referred to as the ‘core,’ is 7-bp long for most FOX proteins. Considering a 7-bp core,
and a 16-bp randomized region, there are a total of 20 different positions in which the
core may reside, including ten on the forward strand and ten on the reverse strand
(Figure 3.3B). Assuming the data reflects sequence-specific binding, we expect every 16-
mer to have at least one predominant binding site. Since the core sequence is the most
influential region in determining binding, we sought to create a table of putative core
sequences that could be used to identify and align the binding sites for the largest number
of 16-mers. Details of how this table is generated are described in Materials and Methods.
We ultimately decided to use a list of 49 core sequences including ten 6-bp Fhl1
cores (Supplementary Table 3.2). For the R1 data, we were able to align 25.9%, 28.6%,
28.0%, and 13.2% of the reads bound by Fkh1, Fkh2, Hcm1, and Fhl1, respectively. For
R2, we aligned a more substantial 63.3%, 63.7%, 46.1%, and 19.9% of the reads bound
by Fkh1, Fkh2, Hcm1, and Fhl1, respectively. Unlike PWM-based methods this framework
does not assume any interdependencies within the core and reveals many high-affinity
sequences with surprisingly high dissimilarity from the most enriched core. Perhaps
most importantly, this framework aligns full-length reads which enables the analysis of
positions at least nine positions away from the core on either side. For comparison to
traditional methods, we generated a PWM weighting each core by its relative enrichment.
Using only our reduced set of 7-bp cores for Fkh1, Fhh2, and Hcm1, and the 6-bp cores
43
for Fhl1, the generated PWMs (Figure 3.9) are highly similar to the uPBM-derived motifs
published previously (Figure 3.1B) [58].
Figure 3.9 PWMs generated from our selected 7-bp or 6-bp cores, weighting each sequence by its
relative enrichment.
Core binding sites exhibits interdependencies and shape preferences
For every alignment window, ΔΔG/RT was calculated for each core sequence relative to
the most preferred core sequence, as described in Materials and Methods
(Supplementary Table 3.2). Since each window consists of an independent set of
sequences that were selected by the protein independently, they can be treated as
independent samples. For a core length of seven bp, each sequence can be measured
across 40 samples, including ten samples per strand per round. ΔΔG/RT measurements
for our set of cores exhibit little variability across independent samples resulting in
narrow confidence intervals for the averaged values (Figure 3.5). One core, GTATACA,
was excluded from analysis since we were concerned it may be filtered at a much higher
rate. Adding just one thymine to the first position 5’ of this core results in a palindromic
sequence containing two cores.
44
Figure 3.10 Comparison of the Fkh1 ΔΔG/RT measurements for different mutations of the core
DNA target sequence with respect to the reference sequence GTAAACA, exemplifying a non-
additive relationship. Out of this set of sequences, the three most enriched cores exhibit similar
minor groove width (MGW) and electrostatic potential (EP) profiles as predicted using
DNAshapeR (IUPAC: W = A/T; M = A/C).
For Fkh1, Fkh2, and Hcm1, the most enriched sequence was GTAAACA. It becomes
clear upon further inspection that nucleotide contributions in the core do not appear to
contribute independently. Using Fkh1 as an example, we considered the four core
sequences shown in Figure 3.10. If only the third position is mutated from an A to a C, we
see a ΔΔG/RT of 0.67. If we mutate the second position from a T to an A, we see a large
ΔΔG/RT of 1.90 If the effects of these mutations acted independently, then mutating both
positions would result in a ΔΔG/RT of 2.57. Instead, we observe a relatively modest effect
of 1.36, nearly half of its expected value, and even lower than the single mutation from T
to A. This phenomenon is repeated throughout the table of cores analyzed, exemplifying
45
complex interdependencies between positions within the core. This further suggests that
a PWM-based representation would not accurately describe binding preferences in this
system.
To explain these interdependencies, we examined DNA shape features for the four
core sequences discussed previously. The DNAshapeR package [12] predicts several DNA
shape features using a 5-bp sliding window with values based on previously run Monte
Carlo simulations of free DNA [10]. Although the package is able to predict 13 different
shape features, we are most interested in minor groove width and electrostatic potential
due to their biophysical origin arising from the identity of multiple bp beyond
dinucleotides and their potential to influence interactions with charged residues within
the winged regions of the DNA binding domain [61]. Although, the GAAAACA core is only
one mutation away from the reference core, it is disadvantaged in Fkh1 binding
compared to GACAACA, which contains an additional mutation at the third position.
Although this second mutation may disrupt some preferred bp-specific contacts, it
appears to increase the minor groove width and electrostatic potential of the DNA so that
it is more similar to the preferred reference (Figure 3.10). It is worth noting that this
secondary mutation disrupts what would be a 4-bp A-tract, a feature known to cause
intrinsic DNA bending [62].
To explore the importance of interdependencies in this region more rigorously,
we evaluated the use of a position specific affinity matrix (PSAM) for the prediction of
ΔΔG/RT for every 7-bp core containing more than 1 mutation from the reference, which
includes 31 enriched sequences. The PSAM is generated using the measured ΔΔG/RT for
every 7-bp core that is 1 mutation away from the reference. The PSAM predictions were
only weakly correlated (r
2
= 0.46) with the measured values, further emphasizing the
46
importance of including these interdependencies when prioritizing cores and identifying
binding sites.
Figure 3.11 Comparison of –ΔΔG/RT of 7-bp core sequences between (A) Fkh1 and Fkh2 or (B)
Hcm1 and Fkh2 color-coded by bp identity at each position.
Core binding sites of Fkh1 and Fkh2 exhibit differing selectivity
The ΔΔG/RT measurements from every window for each core are displayed in a violin
plot in Figure 3.5, and all averages are provided in Supplementary Table 3.2. To identify
any sequence-specific differences in DNA binding specificity between Fkh1 and Fkh2, we
plotted ΔΔG/RT of every 7-bp core and color-coded each point by the bp identity of the
core at each nucleotide position along the 7-mer (Figure 3.11A). Although only a few
positions exhibited variability in sequence, we noticed a sequence-dependent shift at the
second position of the core. At this position, Fkh2 exhibited a greater tolerance for
adenine at this position relative to Fkh1. Comparing Hcm1 with Fkh2, we find that Hcm1
is less tolerant of a thymine at position 6 of the core (Figure 3.11B). Across these three
homologs, the base-contacting residues, based on co-crystal structures for other FOX
proteins, appear to be highly conserved. This suggests that the observed differences in
preferences may be a result of a higher-order feature such as DNA shape.
47
Flanking sequence contributions across differing cores
To begin investigating the effects of flanking sequences on binding, we plotted ΔΔG/RT
for every possible bp outside of the core for every window. This type of analysis assumes
that edge positions contribute to binding independently of each other. For Fkh1, Fkh2,
and Hcm1, measurements were not only consistent across different windows (Figure
3.6A), but also highly consistent across different cores (Figure 3.11B, Supplementary
Figures 3.2-3.5). This suggests that our multi-step approach to identifying core sequences
and aligning reads is valid, since flanking preferences are also aligned. It is interesting
that this observation also holds true for differing length cores. The decision to include the
Fhl1-based cores in the alignment of all homologs was supported by previous
crystallographic evidence of FoxN3 that showed conserved amino acid contacts between
a 6-bp Fhl motif, GACGCA [63], and the standard 7-bp Fkh motif [64]. As described by the
authors, the three-dimensional DNA shape of the shorter core was altered in order to
align two ‘registration positions’ at the edges of each motif [64]. In our case, the 6-bp and
7-bp motifs exhibited highly similar flanking preferences both upstream and
downstream of the core, despite differing lengths, further supporting the discovery of
registration positions that align contacts flanking the core (Figure 3.6B).
We describe the sensitivity to flanking positions as the difference in ΔΔG/RT
between the most and least favored bp at each position, using the averages over all cores
(Figure 3.12). For Fkh1 and Fkh2, we found the largest flanking contributions at the four
positions 5’ of the core and two positions 3’ of the core. For Hcm1, only one position 3’ of
the core was found to have a similarly large impact on binding. It is interesting to see such
a stark difference in sensitivity to flanking positions even though all three homologs share
48
Figure 3.12 Maximum ΔΔG/RT averaged across all cores for each flanking position. The red box
highlights the positions that had the largest impact on Fkh1 binding.
similar preferences for the core. For Fhl1, flanking positions did not appear to contribute
significantly to binding. Although the 6-bp cores are far more enriched than those that
are 7-bp, the overall difference in ΔΔG/RT between the best and worst core is on a much
smaller scale than we see across Fkh1, Fkh2, and Hcm1 cores. This suggests that Fhl1
exhibits less specific binding to its preferred cores, which could impact alignment by
allowing “false” binding sites to be aligned.
Looking at the two nucleotide positions 3’ of the core, referred to as the +1 and +2
positions, we see a slightly diminished range in binding affinities by Fkh1 when the sixth
position of the core is a thymine, rather than a cytosine (Figure 3.6B). Likewise, we find
that the +1 position appears to have a smaller impact on the predicted electrostatic
potential at positions 5 and 6 of the core when there is a thymine at the sixth position,
using GTAAA(C/T)A as an example (Figure 3.13). This may explain why
49
Figure 3.13 Plot of the predicted electrostatic potential (EP) in the center of the minor groove
along the binding site given a (A) Cytosine or (B) Thymine at position six and a variable bp at the
first position 3’ of the core.
several core sequences containing a thymine at position 6 appear to be less sensitive to
variations at the +1 position. More broadly, this observed preference for a more negative
electrostatic potential in the 3’ flanking region is consistent with the hypothesis that
positively charged residues in the winged regions of Fkh proteins act to stabilize binding
to the DNA. Compared to Fkh1, Fkh2 exhibits a substantial reduction in specificity at the
+1 and +2 positions across all cores (Figure 3.12). This observation suggests reduced
sensitivity to electrostatic potential by Fkh2 in this region.
Interdependencies of flanking positions quantified using MLR
We decided to further investigate interdependencies using 13-mer windows covering all
7-bp cores and the six flanking positions that were most impactful to binding. We used
Fkh1 as an example since it exhibited the largest feature contributions of flanking
50
positions. Additionally, we only utilized the R2 data since a small set of 13-mers are
significantly enriched in the R1 data. Although we have already quantified the
contribution of flanking positions and the core in terms of ΔΔG/RT, these values are
assumed to be largely independent. If correct, then ΔΔG/RT of a 13-mer can be predicted
by the sum of the ΔΔG/RT values of each feature relative to a reference. Instead of using
our measured contributions, we can fit a model using MLR to predict an optimal set of
feature contributions.
An MLR model can also be expanded to include interdependencies by using an
alternate encoding of the input sequence. We can therefore measure the effect of
interdependencies by comparing the performance between a model that considers
interdependencies and one that does not. As described previously, measurements of 13-
mer enrichment are inherently noisy relative to shorter k-mers. If we split our dataset
into two equal parts of randomly sampled sequences, ensuring that duplicate reads are
in the same set, we can compare the –ΔΔG/RT measurements between the two to get a
sense of how much of the variance is due to noise. We find the measurements moderately
well correlated with a Pearson r
2
of 0.89. This puts an approximate upper bound on the
performance of the model since it is not designed to explain noise. In reality, we expect
the noise to be slightly reduced since we assessed our models using averages from the
full set of data.
For additional details regarding the encoding process, see Materials and Methods.
The most basic encoding assumes independence between flanking positions and
independence of flanking positions from the core but does not assume independence
between positions of the core. This simple model achieves an average cross-fold R
2
of 0.70.
To include interdependencies between flanking positions, we encode dinucleotide
51
features for every possible pair of positions outside the core. These features are believed
to capture stacking interactions between bp, as well as long-range interactions between
distant positions. Adding the dinucleotide encoding, we obtain an improved R
2
of 0.75 for
the trained model. This suggests that flanking positions can interact, but these
interactions only explain a small fraction of the overall variance.
Next, we examined core-specific contributions by encoding flanking positions
conditional on the core, removing the dinucleotide encoding for now. For this type of
encoding, the model achieves an R
2
of 0.83. This improvement in performance confirms
our previous observation that different cores slightly modulate flanking position
contributions. Combined with our previous finding, we find that contributions of the
flanking positions are dependent on the core as well as other flanking positions. With this
in mind, we concluded that the best model would consider both types of
interdependencies. Indeed, this model achieved the highest performance with an R
2
of
0.90, roughly matching the limit of the model approximated previously based on the
noisiness of the data.
If we assume the 5’ flanking positions contribute independently of the 3’ flanking
positions, we can repeat the above tests using less noisy 11-mer sequences including the
four 5’ flanking positions and the core. Between the same two equally sized halves
mentioned previously, we observe an r
2
of 0.98. The simplest model achieved an R
2
of
0.89. The model including dinucleotide interactions between flanking positions achieved
an R
2
of 0.91. The model including core-dependent flanks achieved an R
2
of 0.93. Lastly,
the model including dinucleotide interactions and core-dependent flanks achieved an R
2
of 0.96. As seen before, including dinucleotide interactions resulted in only modest
52
improvements to performance. Therefore, we chose to disregard these interactions to
facilitate analysis and report flanking contributions.
Experimental validation of flanking contributions
Based on the assumption of independence between positions, mutating the flanking
positions to their most unfavorable bps results in an average increase in ΔΔG/RT of 4.00
across all cores for Fkh1 and 3.24 for Fkh2. This corresponds to a roughly 55-fold and 26-
fold reduction in binding affinity, respectively. Alternatively, mutating the core to the
lowest affinity sequence included in the alignment, an Fhl1-based core, resulted in a
ΔΔG/RT of 3.68 for Fkh1 and 3.57 for Fkh2.
Figure 3.14 Competitive binding assay of Fkh1. The probe containing an optimum flank is FAM-
labeled. The competitor with a suboptimal flank is unlabeled. The molar ratio of competitor to
probe is shown. We see the intensity of the bound probe is noticeably reduced after a 32-fold
excess of competitor relative to the probe.
We experimentally validated the importance of the flanking positions using a
competitive binding assay to compare the binding of Fkh1 to the core, GTCAACA,
surrounded by either optimal or suboptimal flanking nucleotides. We find that the core
surrounded by the suboptimal flank is required in an excess of 32-fold to displace at least
half of the fragments containing the optimal flank (Figure 3.14). This aligns with our
53
expectations well and further emphasizes the importance of including flanking positions
in discriminating the binding affinity between identified binding sites.
The techniques utilized in this work provide a novel framework for the
identification and alignment of binding sites derived from SELEX-seq data. By utilizing a
targeted approach, focusing on the alignment of high-confidence cores, we were able to
reveal and quantify the specificity of Fkh1 and Fkh2 over an expanded binding site
spanning at least 13 bp.
Conclusions
With our multi-step alignment approach, we have been able to thoroughly explore how
flanking nucleotide positions contribute to binding site affinity in a way that previous
approaches cannot [reviewed in 3]. By focusing on the alignment of full-length reads, we
have revealed patterns of flanking nucleotide preferences that are highly consistent
across independent windows and across drastically different cores. Although the impact
of each nucleotide position may be small, their combined effect can greatly impair binding
to a putative DNA target. Additionally, these contributions are often lost in traditional
PWM-based analytical frameworks, for which the alignment of false binding sites can
dilute their effect. By using a restricted set of cores, we can pinpoint the effects of
mutating flanking positions without assuming independence between positions of the
core.
In this work, we explore the binding preferences of Fkh1 and Fkh2, and reveal
small-scale, but consistent differences that have not been characterized previously.
Including flanking contributions, we were able to expand the binding site to a 13-bp
window including four bp 5’ of the core and two bp 3’ of the core. Likewise, the framework
54
can be adapted to fully capture the impact of flanking positions for other TFs whose
binding has been measured using next generation sequencing. In this work, we selected
a list of candidate cores using two novel approaches, Top-Down Crawl, and iterative
reprioritization. The computational analysis framework is flexible and can be applied to
any list of cores desired by the researcher, even when those cores are not the same length.
Data availability
SELEX-seq data was collected as described in Materials and Methods and submitted to
the Gene Expression Omnibus (GEO) with accession number GSE178811. A small-scale
test run targeting Fkh1 was included as well as the large-scale runs discussed throughout
this work.
Workflow and scripts to perform the multi-step alignment approach and calculate
the resulting ΔΔG/RT values are provided at https://github.com/bhcooper/multi-step-
align. All steps of the approach can be performed by following the publicly available
workflow provided.
55
Chapter 4: Expanded analysis of simulated SELEX-seq data
Introduction
Although the previously described alignment and analysis pipeline was shown to work
well for my Fkh1, Fkh2, Hcm1, and Fhl1 SELEX-seq datasets, its development also
revealed several drawbacks of conventionally used methods. PBMs exhibit significant
noise for moderate to low affinity binding sites, and they are not able to analyze a large
variable region. SELEX-seq experiments can capture more reproducible predictions
across a wide range of affinities since they are able to expose the protein to a much larger
set of potential binding sites and utilize very deep sequencing for a robust signal. As
discussed previously, the randomized region of a SELEX-seq experiment is 16 bp long.
Even with the extreme depth provided by next generation sequencing, it is not capable of
capturing enough fragments to differentiate the affinity of the 4.3 billion unique
sequences possible. An enriched subset of all possible 16-mers can be analyzed by
performing several additional rounds of selection to restrict variability, but with each
additional round, the number of moderate to low affinity binding sites drops off
exponentially. Fortunately, most transcription factors primarily recognize a relatively
short region of 6-12 bp. Therefore, we can restrict our analysis to only consider the
enrichment of shorter k-mers and ignore positions which do not contribute to binding.
Additionally, when positions of the binding site contribute independently to binding, a
model can be developed using masked k-mers to infer the binding affinity of longer k-
mers without compromising signal quality from the experiment, as was shown previously.
56
Aligning to multiple binding sites
To properly characterize a binding site, it important to align sequences such that any
given nucleotide along the binding site is exposed to a consistent region along the
transcription factor’s binding surface. In my study of FOX proteins, I decided to align
every 16-mer to a predefined set of 7-mers determined through the sequential removal
of highly enriched sequences. Although positions outside of the 7 bp core were later
proven to contribute to binding, the 7 bp core contains the most influential positions in
identifying a true binding site, and flanking positions were assumed inconsequential to
this determination. For the Fkh proteins, this assumption was later validated by the
consistent contribution measurements for flanking positions across distinct cores. One
major restriction imposed by my alignment is that each 16-mer was only aligned if a
single core was present. This was done to avoid the identification of flanking sequences
that contribute to enrichment by creating additional binding sites, rather than by
contributing as flanking positions to the aligned binding site. This was also done to avoid
counting 16-mer multiple times across independent windows. However, this restriction
comes with several drawbacks that I wanted to address before applying this method to
other datasets.
Most importantly, this restriction does not allow for the alignment of palindromic
sequences, because these sequences inherently contain two binding sites. For Fkh1, Fkh2,
and Hcm1, this is not a problem since the binding site has an odd number of positions and
therefore cannot be perfectly palindromic. Nevertheless, this restriction will not filter all
cores equally, since some cores are close to being palindromic, or have subsequences that
significantly overlap other cores. In this case, a small number of flanking positions can
create a secondary binding site that would be filtered out by the alignment process. For
57
example, a secondary identical binding site is created on the opposite strand of core
sequence GTATACA by adding a T to the −1 position. In practice, the effect of this is only
apparent for a small number of our predefined Fkh cores, but in expanding the analysis
to other datasets, this cannot be ignored.
Based on the setup of the SELEX-seq experiment, we know that every DNA
fragment is bound by exactly one protein molecule. Therefore, when multiple potential
binding sites are present along a given fragment, we know only one of them is bound. The
probability of binding to a given site is directly proportional to the affinity of that site
relative to the affinity of every other site present on the fragment. Assuming that every
binding site contributes additively to the enrichment of a given fragment, the observed
count of a given 16-mer can be divided such that a given 7 bp core is only counted relative
to the amount it contributed to that observation. Since the 16-mer is divided based on
relative enrichment, and relative enrichment is determined by the divided counts, we can
iteratively alternate between these steps until convergence to find the true relative
affinities.
Simulation of multiple binding sites
To validate this hypothesis, I created a simple simulation to model the data. First, I
generate a set of N sequences made up of L binding sites grabbed randomly from a pool
of potential binding sites with predefined affinities. For each sequence, I calculate its
expected enrichment after one round of selection based on the predefined affinities of
each binding site. These represent the observations we get from a standard SELEX-seq
experiment. To determine the “unknown” relative affinities of each binding site, the
observed enrichment of each sequence is evenly divided and assigned to every binding
58
site it contains. This gives an initial estimate of the relative enrichments, which would
closely resemble that of conventional SELEX-seq analytical frameworks. However, these
values are markedly different from the true values that we predefined. We continue the
algorithm by redistributing the observed enrichment of each sequence proportional to
the relative enrichments of each binding site. This gives an updated set of predicted
relative enrichments that more closely resembles the set of predefined affinities. This
process is repeated several more times until convergence. The final enrichments for each
binding site were remarkably close to the predefined affinities, even when L was made to
be relatively large (100 binding sites).
To reflect my dataset, I was also curious how this algorithm would perform when
we only sought to determine the enrichment of a subset of potential binding sites (i.e.,
predefined cores). In this case, the observed enrichments were generated as before, but
were only distributed between a selected subset of potential binding sites. In this case,
binding sites outside of this subset were treated as if they did not contribute to binding.
After iterating until convergence, the final relative enrichments amongst the subset of
selected binding sites did not resemble their predefined relative affinities. Alternatively,
I can create an additional term to represent the average enrichment of any potential
“alternate” binding site outside of my selected subset. In this case, the observed
enrichments are divided between our subset of binding sites and any number of alternate
binding sites. In this case, the final relative enrichments amongst the subset once again
resembled the predefined affinities. This shows the importance of including a catchall
term for potentially unrecognized binding sites.
59
Simulation of a fixed adapter
To expand on this finding, I further evaluated the assumption that fixed adapter
sequences outside of the variable region can be ignored. To do this I added a fixed set of
randomly selected binding sites to the end of every sequence. I used the full sequences to
calculate the expected enrichment observations, but only used the “variable region” for
binding site enrichment calculations. After iteration and convergence, the relative
enrichments did not match the predefined relative affinities. The effect of this was
proportional to the affinities of the “adapter” binding sites. Therefore, although adapter
sequences can contribute to affinity estimations, ignoring them will not have a large effect
if they do not contain strong binding sites. The Fkh adapters we used were designed for
this purpose, but including the adapter sequences in this process can further improve
predictions moving forward. This also helps account for binding sites that “bleed” off the
edge of the variable region.
Simulation of R2 data
After one round of selection, any given sequence will be independently enriched by each
of the binding sites present. This means the enrichment is directly proportional to the
sum of binding site affinities, as shown below.
𝐸 𝑠𝑒 𝑞 𝑅 1
∝ ∑ 𝐾 𝑖 𝑖
This allows for the straightforward redistribution of counts proportional to their
relative affinities as described previously. However, after a second round of selection, the
enrichment of a given sequence is proportional to the square of this sum.
𝐸 𝑠𝑒 𝑞 𝑅 2
∝ ( ∑ 𝐾 𝑖 𝑖 )
2
60
If the measured enrichment of the 16-mer was not noisy, we could simply take the
square root of the enrichment and distribute the counts as before. However, as discussed
previously, the extreme amount of sequence variability results in an exceedingly small
count for any individual sequence. Therefore, I aimed to avoid using the square root of
the 16-mer and instead relied on information at the binding site level in determining how
counts should be distributed. In this way, I expanded the polynomial based on my current
prediction of binding site enrichment and normalized by the sum of all elements, as
represented in Figure 4.1. I then distributed the count of the 16-mer amongst the
normalized terms along the diagonal. The predicted affinity of a given binding site is then
determined by taking the square root of the enrichment for the binding site based on the
redistributed counts. I evaluated this using simulated R2 data and found the final
predicted binding site affinities to closely resemble their predefined values.
Figure 4.1 Representation of how multiple binding sites contribute to the binding observed after
2 rounds of selection for a SELEX-seq experiment. Given L binding sites along a sequence, the
enrichment of that sequence after two rounds of selection is proportional the sum of all pairwise
products.
In the original SELEX-seq paper, every 16-mer was assumed to have one binding
site responsible for the observed enrichment. As shown by the expanded polynomial, the
effects of this false assumption are exponentially amplified with each round of selection.
61
Simulation of overlapping binding sites
Although the simulation validates many aspects of the iterative redistribution approach,
it is highly simplified in that binding sites are completely independent points rather than
multi-base strings. I expanded the simulation by generating random 16 bp DNA
sequences and assigned a predefined affinity to every k-mer. For simplicity, I choose a k-
mer length of 5. I then calculated the expected R1 observation based on these relative
affinities, treating every 5 bp window as independent binding sites. After iterating until
convergence, I found the relative enrichments to match the predefined enrichments, even
for binding sites which overlap extensively.
Enrichment of partial binding sites
In our simulation, binding sites and their affinities are fully described by individual k-
mers of fixed length. When predicting these affinities, we can scan for the enrichment of
k-mers of the same length to fully explain binding. In a true SELEX-seq experiment, the
full length necessary to fully define binding sites may not be known. Furthermore, if the
binding sites are particularly long, the measured enrichments of k-mers of equal length
may be too noisy to provide meaningful predictions of relative affinity. Therefore, we
must depend on information from k-mers which are shorter than the full length of the
binding site.
To begin evaluating this framework, I chose a predefined affinity for several 4 bp
sequences to represent the “core” of the binding site. To every core, I added an additional
nucleotide 3’ of the 4-mer sequence and modulated the affinity by a fixed multiple
depending on which base was added. In this case, 3’ “flank” contributes independently of
the core, but the full binding site is nevertheless defined by 5 bp. I generated the expected
62
R1 observation based on a 5 bp sliding window for every sample, and then calculated the
enrichment of every 4-mer. After iteration and convergence, I found the relative
enrichment between the 4 bp “cores” did not match their predefined values. Since any
given 5-mer is made up of two 4-mers, the enrichment of any given 4-mer can represent
contributions from multiple distinct binding sites, and distinguishing their contributions
becomes impossible. However, it should be noted that the effect of this is minimal when
the “flank” contributions are small relative to the core, as is the case for our Fkh proteins.
Modeling full binding sites
Based on my previous analysis, I predict the binding site of Fkh1 and Fkh2 to be fully
defined by 13 bp, including a 7 bp core and 6 flanking positions. Since enrichment
information is highly noisy for 13-mers, we can instead use the MLR model defined
previously predict the relative affinity of 13 bp binding sites for redistribution of counts.
The MLR model could then be iteratively updated using the updated alignment until
convergence. Moving forward, I plan to pass on this knowledge to the next generation of
students from the Rohs lab for a future work.
63
Chapter 5: Measuring the binding of CRISPR-Cas12a to flexible
off-targets using modified SELEX-seq
Introduction
Since 2012, CRISPR-Cas systems have gained tremendous popularity for their potential
in facilitating gene editing. As opposed to alternate systems, such as zinc fingers and
TALENs that rely on protein contacts for target recognition, the CRISPR-Cas system is
guided by an RNA template. Not only does this greatly improve target recognition, but it
also makes the system far simpler and cheaper to customize due to the ease with which
polynucleotides can be synthesized. However, using CRISPR-Cas systems in a clinical
setting requires an in depth understanding of off-target activity to ensure a near-zero
chance of creating unintended deleterious mutations. Since their discovery, many distinct
CRISPR-Cas systems have been studied and modified in order to modulate their off-target
activity. Off-target effects can be measured with in-vivo experiments, such as GUIDE-seq
[65] and digenome seq [66], or with in-vitro experiments such as CIRCLE-seq [67], BLT
[68], or Spec-seq [69]. Generally speaking, CRISPR-Cas systems have been found to bind
to many off-targets in-vivo, but only cut a very small percentage of these compared to
sites that perfectly match the target [69-71]. Mechanistically, the Cas-specific PAM is
recognized first, followed by unwinding of the 8-12 PAM-proximal bases, or “seed” region,
to form a RNA-DNA duplex with the guide [72-75]. This initial step stabilizes the complex
but does not directly result in cutting [74]. Without stable pairing, the DNA duplex
eventually reforms to eject the Cas complex [72, 74]. This leads to the observation that
PAM-proximal mismatches with the guide are least likely to be tolerated [76-78].
64
Following PAM-proximal pairing, the DNA is continuously unwound to probe the
remaining PAM-distal positions, thus extending the RNA-DNA duplex though the end of
the guide [74, 79]. This induces a conformational change in the Cas protein, favouring the
creation of a double stranded break in the DNA [69, 80, 81].
Interestingly, off-target activity has been found to depend not only on the Cas
protein, but on the sequence of the guide RNA [76, 78]. Through experimentation, several
rules have been determined to facilitate the design of guides that are both specific and
efficient at cutting target sequences [82]. Biophysical mechanisms explain some of these
rules, but many are only weakly explained by our current understanding of the CRISPR-
Cas system. In the Cas12a system, shape-readout has been found to play a critical role in
recognition of the 5’-TTTN-3’ PAM [83]. In a crystal structure of the Cas12a system bound
to its target, the AT-rich PAM duplex was found to exhibit a narrow minor groove that
interacted with a positive lysine residue on the protein [83].
More recently, research by Wei Jiang of Peter Qin’s lab at USC revealed the
potential role of DNA flexibility in the recognition of off-targets by the Cas12a system [72].
The group designed a competition assay in which a small amount of radiolabelled probe
with perfect complementarity to the guide RNA is challenged with an excess
concentration of predefined competing duplexes. If a competitor binds to the Cas12a
complex, it will prevent cutting of the radiolabelled probe. Reaction products were
evaluated using denaturing PAGE. The main goal of this experiment was to gauge the
binding of off-targets containing “bubbly” or unpaired DNA. This was done by creating
mismatches between the two strands in the protospacer region directly adjacent to the
PAM. Interestingly, the researchers found unpaired DNA to bind to the Cas12a complex
even when complementarity to the guide was completely abolished. The researchers
65
evaluated this further by varying the “bubble” length immediately adjacent to the PAM.
This revealed a modest increase in binding with respect to bubble length up to a length
of four mismatched bases. In all these cases, the DNA was made to be non-complementary
with the guide RNA. Without the formation of a RNA-DNA duplex, the researchers
proposed that the bubbly DNA was able to be kinked in such a way that facilitates stable
binding to the Cas12a complex without the risk of ejection during DNA duplex
reformation. [72]
To explore this phenomenon at a wider scale, I have been working with Aleique
Allen from the Qin Lab to perform an analyse a modified SELEX-seq protocol in which
random mutations are introduced to 6 PAM-proximal positions on the non-target strand.
This alters the flexibility of the target strand without altering its sequence composition.
As before, the target strand was made to be non-complementary to the RNA guide, as a
perfect match would be cut at an extremely high rate regardless of its flexibility, thus
making it difficult to measure differences.
Experiment overview
The target strand was designed with an optimal PAM sequence followed by a fixed
protospacer, TCCTCA, that is not complementary to the Cas12a guide RNA. The non-
target strand was synthesized using a 6 bp fully randomized region immediately adjacent
the PAM. The two strands were duplexed using a standard annealing reaction, resulting
in a library of sequences with varying levels of flexibility. This library was incubated with
the Cas12a effector complex and run on a non-denaturing polyacrylamide gel. As in a
typical SELEX-seq experiment, the bound fraction is isolated, purified, and amplified then
sent for sequencing. Since the two strands of the library are non-identical, amplification
66
will result in 50% of the sample corresponding to the target strand and 50%
corresponding to the non-target strand. After sequencing, reads from the non-target
strands can be identified by their variations from the predefined target strand.
Unfortunately, this makes analysis of perfectly DNA impossible. However, based on the
previously described experiment, these sequences are the least flexible and least likely to
be bound by the Cas12a complex.
Expected observations
With a fully randomized library, the initial distribution in the number of mismatches can
be predicted using simple combinatorics. The probability of observing 𝑚 mismatches,
can be calculated as shown below.
𝑝 𝑚 = (
3
4
)
𝑚 (
1
4
)
6 − 𝑚 (
6
𝑚 )
Given a particular position, the probability of a mismatch is 3/4. With a 6 bp
variable region, there are 6 choose 𝑚 possible sequences with exactly 𝑚 mismatches. It
is immediately apparent that a small proportion of our initial reads will contain 2 or fewer
mismatches (Figure 5.1). If we were only interested in the total number of mismatches,
this experimental design overly samples sequences with a larger number of mismatches.
However, if we are interested in the specific sequence of the mismatches, this framework
provides a more uniform sampling of all possible sequences. This is useful in analyzing
distinct types of mismatches which may be more amenable to the formation of non-
Watson-Crick base pairs, resulting in reduced flexibility. Furthermore, it may be possible
that bases within a large bubble could pair with non-opposing bases on the opposite
67
strand, resulting in abnormal secondary structure. This could lead to kinks in the DNA
which may promote or inhibit interaction with the Cas12a effector complex.
Figure 5.1 Expected distribution of total mismatches over a 6-bp region, given that one strand is
completely randomized
Additionally, since mismatches may not be adjacent along the randomized region, or may
be distant from the PAM, the total number of mismatches may not be informative of
bubble length or flexibility near the PAM. The fully randomized region allows us to
differentiate many of these scenarios and identify any potential features that were not
considered previously.
Results and discussion
The bound and unbound DNA fragments were barcoded and sent for sequencing as well
as the initial library. By including the unbound fraction, we were able to simultaneously
measure enrichment in the bound fraction and depletion in the unbound fraction. This
helped enhance the signal of sequences which were preferentially bound by the Cas12a
complex. For each unique sequence, given index i, we calculate the equilibrium binding
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 1 2 3 4 5 6
proportion
# mismatches
Expected Mismatch Distribution
68
constant Ki as described for the Spec-seq protocol [84] and as shown below, with pi,
indicating the proportion of reads in the bound (B) or unbound (U) fractions.
𝐾 𝑖 =
𝑝 𝑖 𝐵 𝑝 𝑖 𝑈
Sequences that are more likely to be bound than unbound achieve a K > 1.
Additionally, we chose to filter out sequences which occurred fewer than twenty times in
total across the bound and unbound fractions because these observations are noisy and
fluctuate greatly with minor changes to either count.
In our first comparison, we look at the average K stratified by the total number of
mismatches relative to the target strand (Figure 5.2). Beyond three mismatches, we see
a strong positive correlation between the number of mismatches and the likelihood of
being bound. This agrees with our expectation that mismatched sequences will exhibit
increased flexibility.
Figure 5.2 Bar chart showing the average equilibrium constant (K) for sequences with a variable
number of mismatches
Next, we investigate whether the position of the mismatch impacts preferential
binding. As a control, this is compared across sequences with the same total number of
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6
average K
# mismatches
Preferential Binding of Mismatched Sequences
69
mismatches. Additionally, instead of looking at the average K for each group, we consider
the maximum. Specific types of mutations can drastically alter the observed equilibrium
constant, and we are most interested in the maximum impact by mutating given positions.
Here we represent the location of the mismatch relative to the PAM, with position 1 being
the first position adjacent to the PAM and position 6 being the 6
th
position away from the
PAM. Using a single mismatch, each group would be represented by 3 unique sequences.
At the sequencing depth utilized, this made comparisons between sequences with a single
mismatch relatively uninformative (Supplementary Figure 5.1). Instead, we show the
comparison between sequences with at least 2 mismatches, which are represented by at
least 9 unique sequences. Looking at sequences containing two mismatches, we find
those with the largest K are more likely to have mismatches at positions near the PAM,
particularly at positions 1 or 2 (Figure 5.3). However, the sequences containing
mismatches at positions 1 and 2 were comparatively unfavorable. I hypothesize that
gapped mismatches are able to destabilize the intervening base pairs to further increase
flexibility of the duplex. Indeed, we find many sequences where adjacent mismatches
exhibit a reduced binding by the Cas12a complex.
70
Figure 5.3 Bar chart showing the maximum K for sequences with 2 mismatches at variable
positions relative to the PAM
Looking at sequences with exactly 3 mismatches, location bias becomes much
more influential (Figure 5.4). The most favorable mispairings occur at positions 1, 3, and
5, resulting in a nearly 10-fold increase in binding relative to the weakest sequence with
3 mismatches. If mismatches are gapped by exactly 1 bp, the intervening bp is not
stabilized by stacking interactions while paired with its complement. Therefore,
mutations at positions 1, 3, and 5 may be able to easily destabilize pairing at 5 positions,
resulting in a substantial increase in flexibility near the PAM. Comparatively, mismatches
with a similar spacing that start farther from the PAM (e.g., 2,4,6) are not enriched,
reinforcing the importance of flexibility near the PAM.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1,5 1,3 1,6 2,5 3,5 1,4 2,4 4,6 2,6 1,2 3,4 5,6 2,3 3,6 4,5
maximum K
mismatch locations
Max K for Sequences with 2 Mismatches
71
Figure 5.4 Bar chart showing the maximum K for sequences with 3 mismatches at variable
positions relative to the PAM
Looking at sequences with 4 or 5 mismatches, we see similar trends, with
preferential binding when positions near the PAM are mutated and separated by small
gaps (Supplementary Figures 5.2-5.3). In a follow up study, it would be interesting to see
how cytosine modifications affect flexibility and off-target binding. Various cytosine
modification have been found to alter DNA flexibility, with 5-mC resulting in increased
flexibility [85]. This may suggest that off-targets in methylated regions could be more
likely to be bound than their unmethylated counterparts. Additionally, while flexibility
may facilitate the necessary kinking induced by the Cas12a complex, duplex stability may
also affect this process. A weak duplex is able to be melted more easily and this melting
can facilitate kinking [86]. In our case, we did not evaluate differing duplexes because we
use a constant sequence on the target strand, but a more typical SELEX-seq approach
could be used in a future study to measure binding against all possible duplexes.
0
1
2
3
4
5
6
7
8
1,3,5
1,2,4
1,2,3
1,2,5
1,3,4
1,3,6
1,2,6
3,5,6
2,3,5
2,5,6
2,4,5
1,4,5
1,4,6
2,3,6
2,3,4
1,5,6
4,5,6
3,4,5
2,4,6
3,4,6
maximum K
mismatch locations
Max K for Sequences with 3 Mismatches
72
Chapter 6: Investigating MEF2B binding preferences using
SELEX-seq and molecular dynamics
Overview
After becoming a member of the Rohs lab, I joined Carolina, now Dr. Machado, and fellow
collaborators in the study of MEF2B, a member of the myocyte enhancer factor-2 (MEF2)
family. With me as the second author, our work was published 2020 in Nucleic Acids
Research [32]. Members of the MEF2 family of transcription factors are responsible for a
range of biological processes from the development of various muscle types to the
formation of neurons in the brain [87-89]. MEF2B is particularly interesting due to its
role in lymph node differentiation with mutations leading to B-cell lymphoma [90-92].
MEF2 transcription factors bind to their targets as homo- or heterodimers, recognizing
the 5’-YTAW4TAR-3’ motif [93, 94]. Observations of MEF2 crystal structures bound to
DNA reveal several base-specific contacts to the peripheral nucleotides of the binding site,
but only non-specific minor groove contacts in the central W4 region of the motif [32, 95-
97]. Without base-specific contacts in this region, MEF2 transcription factors
nevertheless bind to specific targets in vivo. If shape recognition is utilized in this region,
as is suggested by the crystal structures, then it is important to consider
interdependencies between nucleotides in the prediction of optimal binding sites.
SELEX-seq of MEF2B
To investigate this further, Dr. Machado first performed SELEX-seq using wild type
MEF2B [32]. Potential binding sites were compared by analyzing the relative
73
enrichments of all 10-mers conforming to the YTAW4TAR motif. The most enriched
sequences were often found to contain an A-tract within the central W4 region. Since A-
tracts within this region create a non-palindromic binding site, this finding is particularly
fascinating, since MEF2B binds as a homodimer. Interestingly, binding sites containing A-
tracts were also found to exhibit asymmetric sensitivity to variations at the 5’ and 3’
peripheral positions of the binding site. In previous works, A-tracts have been shown to
exhibit altered shape features from their 5’ to 3’ ends, including a narrowing minor
groove width and increased bending [98-100]. Although A-tracts create non-palindromic
binding sites, the resulting A-tract polarity appears to have an overall positive affect on
binding. This also explains the asymmetric specificity observed at the two peripheral
ends outside of the A-tract [32].
SELEX-seq was also used to probe four MEF2B mutants that are known to be
associated with cancer [32]. This includes two N-terminal mutants, K4E and K5E, as well
as two recognition helix mutants, R15G and K23R. Although SELEX-seq cannot be used to
directly compare affinities between experiments, it is useful in revealing changes in
specificity. Although the central W4 region did not exhibit noticeably altered specificity
between MEF2B mutants, we did see changes when comparing the peripheral triplet
preferences. The K23R mutant was found to be far more tolerant of DNA variations in the
first position of the peripheral triplets relative to the wild-type protein. R15G also
exhibits this phenomenon to a lesser extent, but K4E and K5E triplet preferences do not
appear to differ significantly from the wild type. This was particularly surprising, since
previous studies suggest that K4E is one of the most potent mutations in causing non-
Hodgkin’s lymphoma [101], and K5E is a very similar mutation.
74
Molecular dynamics simulations of MEF2B
To further explore the biomechanical impacts of these mutations, I performed molecular
dynamics (MD) simulations of the four MEF2B variants bound to DNA, using a previously
solved X-ray crystal structure a starting point (PDB ID: 1TQE) [102]. The DNA contains
the binding site 5’-TTATAAATAG-3’. As seen in a representative snapshot from the
simulation, the N-terminal tail fits into the minor groove of the DNA and contains several
positive residues (Figure 6.1A).
Figure 6.1 Backbone contacts between MEF2B and the minor groove of DNA (A) Residues of the
N-terminal shown interacting with the minor groove of the DNA in a representative snapshot from
the MD simulation of wild type MEF2B (B) Average minor groove width along the binding site over
the entire MD simulation for wild type MEF2B and four mutants.
While R3 sits within the minor groove, K4 and K5 span the minor groove to make
contacts with the negatively charged backbone of the DNA. By replacing either of these
residues with a negatively charged glutamate, this interaction is no longer favorable.
Looking at the average minor groove width over the entire simulation, we see widening
in the K4E and K5E mutants compared to the WT (Figure 6.1B). Although we didn’t
observe N-terminal tail ejection during the course of the simulation, literature suggests
that widening of the minor groove will result in reduced negativity [103], which will in
75
Figure 6.2 Comparison of hydrogen bonding between K23 and R23 of MEF2B with DNA (A)
Interaction of K23 from WT MEF2B with the ApG step at the 3’ end of the binding site (B)
Interaction of R23 from the K23R mutant of MEF2B with a 5’ ApA step and (C) a 3’ ApG step and
an opposing base.
turn destabilize binding by R3 [9]. Furthermore, since these are shape-specific contacts
occurring in the minor groove, they are not expected to affect sequence-readout. This
explains why we did not see modulated sequence preferences based on the SELEX-seq
data, even though mutations likely reduced overall binding affinity.
In the major groove of the DNA, K23 forms three hydrogen bonds with the ApG
step of the 3’ triplet, TAG, including a bidentate bond with guanine (Figure 6.2A). At the
5’ end, K23 does not form any hydrogen bonds with the DNA. From our previous analysis
of SELEX-seq data, the 5’ triplet, TTA, is bound at a much lower rate than the optimal
sequence CTA. To investigate the importance of this interaction, we ran another
simulation after mutating the 5’ triplet to CTA, which is palindromic to the 3’ triplet. In
this simulation, the three hydrogen bonds are seen at both ApG steps on the 5’ and 3’ ends
of the binding site. This further exemplifies that these interactions are stable and
reproducible across simulations, and are important in determining sequence specificity.
Although lysine and arginine are both positive residues, their geometries are quite
distinct. Whereas lysine contains three hydrogen bond donors coming from one nitrogen
atom, arginine contains five, which are spread across three distinct nitrogen atoms. Based
76
on our simulation of the K23R mutant, this creates more opportunities to form hydrogen
bonds with the DNA, even when the sequence is suboptimal. Opposed to wild type MEF2B,
the K23R mutant forms hydrogen bonds with the 5’ triplet, TTA (Figure 6.2B), as well as
the 3’ triplet, TAG (Figure 6.2C). Although hydrogen bonding occurs at both ends of the
binding site, more bonds are observed at the 3’ end. This observation is consistent with
the SELEX-seq data for the K23R mutant, which shows TAG as the optimal triplet, with
increased tolerance for variations. The R15G mutant also exhibited reduced triplet
specificity in the SELEX-seq data, but R15 only contacts the highly flexible ends of the
DNA in our MD simulation, which are not representative of in vivo conditions. For this
reason, I chose not to speculate on its mechanism of DNA binding. In a future work, the
DNA could be extended to provide a more stable binding interface. Additionally, it would
be interesting to see if lysine to arginine mutations result in reduced specificity in other
systems as well.
77
References
1. Davey, C.A., et al., Solvent Mediated Interactions in the Structure of the Nucleosome
Core Particle at 1.9Å Resolution††We dedicate this paper to the memory of Max
Perutz who was particularly inspirational and supportive to T.J.R. in the early stages
of this study. Journal of Molecular Biology, 2002. 319(5): p. 1097-1113.
2. Paull, T.T., M.J. Haykinson, and R.C. Johnson, The nonspecific DNA-binding and-
bending proteins HMG1 and HMG2 promote the assembly of complex nucleoprotein
structures. Genes & development, 1993. 7(8): p. 1521-1534.
3. Slattery, M., et al., Absence of a simple code: how transcription factors read the
genome. Trends in biochemical sciences, 2014. 39(9): p. 381-399.
4. Lee, T.I. and R.A. Young, Transcriptional regulation and its misregulation in disease.
Cell, 2013. 152(6): p. 1237-51.
5. Rohs, R., et al., Origins of specificity in protein-DNA recognition. Annu Rev Biochem,
2010. 79: p. 233-69.
6. Hegde, R.S., et al., Crystal structure at 1.7 Å of the bovine papillomavirus-1 E2 DMA-
binding domain bound to its DNA target. Nature, 1992. 359(6395): p. 505-512.
7. Sagendorf, J.M., et al., DNAproDB: an expanded database and web-based tool for
structural analysis of DNA–protein complexes. Nucleic acids research, 2020. 48(D1):
p. D277-D287.
8. Nikolova, E.N., et al., Probing sequence-specific DNA flexibility in A-tracts and
pyrimidine-purine steps by nuclear magnetic resonance 13C relaxation and
molecular dynamics simulations. Biochemistry, 2012. 51(43): p. 8654-8664.
9. Rohs, R., et al., The role of DNA shape in protein–DNA recognition. Nature, 2009.
461(7268): p. 1248-1253.
10. Zhou, T., et al., DNAshape: a method for the high-throughput prediction of DNA
structural features on a genomic scale. Nucleic Acids Research, 2013. 41(W1): p.
W56-W62.
11. Lavery, R. and H. Sklenar, Defining the structure of irregular nucleic acids:
conventions and principles. Journal of Biomolecular Structure and Dynamics, 1989.
6(4): p. 655-667.
12. Chiu, T.-P., et al., DNAshapeR: an R/Bioconductor package for DNA shape prediction
and feature encoding. Bioinformatics, 2016. 32(8): p. 1211-1213.
78
13. Li, J., et al., Expanding the repertoire of DNA shape features for genome-scale studies
of transcription factor binding. Nucleic Acids Research, 2017. 45(22): p. 12877-
12887.
14. Mardis, E.R., ChIP-seq: welcome to the new frontier. Nature methods, 2007. 4(8): p.
613-614.
15. Gutin, J., et al., Fine-Resolution Mapping of TF Binding and Chromatin Interactions.
Cell Rep, 2018. 22(10): p. 2797-2807.
16. Park, D., et al., Widespread misinterpretable ChIP-seq bias in yeast. PloS one, 2013.
8(12): p. e83506.
17. Berger, M.F., et al., Compact, universal DNA microarrays to comprehensively
determine transcription-factor binding site specificities. Nature biotechnology,
2006. 24(11): p. 1429-1435.
18. Jolma, A., et al., Multiplexed massively parallel SELEX for characterization of human
transcription factor binding specificities. Genome research, 2010. 20(6): p. 861-
873.
19. Yang, L., et al., Transcription factor family-specific DNA shape readout revealed by
quantitative specificity models. Molecular Systems Biology, 2017. 13(2): p. 910.
20. Riley, T.R., et al., SELEX-seq: a method for characterizing the complete repertoire of
binding site preferences for transcription factor complexes, in Hox Genes. 2014,
Springer. p. 255-278.
21. Slattery, M., et al., Cofactor binding evokes latent differences in DNA binding
specificity between Hox proteins. Cell, 2011. 147(6): p. 1270-1282.
22. Stormo, G.D., et al., Use of the ‘Perceptron’algorithm to distinguish translational
initiation sites in E. coli. Nucleic acids research, 1982. 10(9): p. 2997-3011.
23. Zhao, Y. and G.D. Stormo, Quantitative analysis demonstrates most transcription
factors require only simple models of specificity. Nature biotechnology, 2011. 29(6):
p. 480-483.
24. Zhang, L., et al., SelexGLM differentiates androgen and glucocorticoid receptor DNA-
binding preference over an extended binding site. Genome research, 2018. 28(1): p.
111-121.
25. Ruan, S., S.J. Swamidass, and G.D. Stormo, BEESEM: estimation of binding energy
models using HT-SELEX data. Bioinformatics, 2017. 33(15): p. 2288-2295.
26. Rastogi, C., et al., Accurate and sensitive quantification of protein-DNA binding
affinity. Proceedings of the National Academy of Sciences, 2018. 115(16): p.
E3692-E3701.
79
27. Chiu, T.-P., et al., TFBSshape: an expanded motif database for DNA shape features of
transcription factor binding sites. Nucleic Acids Research, 2020. 48(D1): p. D246-
D255.
28. Weirauch, M.T., et al., Evaluation of methods for modeling transcription factor
sequence specificity. Nature biotechnology, 2013. 31(2): p. 126-134.
29. Isakova, A., et al., SMiLE-seq identifies binding motifs of single and dimeric
transcription factors. Nature methods, 2017. 14(3): p. 316-322.
30. Bailey, T.L. and C. Elkan, Fitting a mixture model by expectation maximization to
discover motifs in bipolymers. 1994.
31. Yang, L., et al., Transcription factor family‐specific DNA shape readout revealed by
quantitative specificity models. Molecular systems biology, 2017. 13(2): p. 910.
32. Dantas Machado, A.C., et al., Landscape of DNA binding signatures of myocyte
enhancer factor-2B reveals a unique interplay of base and shape readout. Nucleic
acids research, 2020. 48(15): p. 8529-8544.
33. Abe, N., et al., Deconvolving the recognition of DNA shape from sequence. Cell, 2015.
161(2): p. 307-18.
34. Dillon, N. and P. Sabbattini, Functional gene expression domains: defining the
functional unit of eukaryotic gene regulation. Bioessays, 2000. 22(7): p. 657-665.
35. Li, L. and Z. Wunderlich, An Enhancer's Length and Composition Are Shaped by Its
Regulatory Task. Front Genet, 2017. 8: p. 63.
36. Arnoult, L., et al., Emergence and diversification of fly pigmentation through
evolution of a gene regulatory module. Science, 2013. 339(6126): p. 1423-1426.
37. Le Poul, Y., et al., Regulatory encoding of quantitative variation in spatial activity of
a Drosophila enhancer. Science advances, 2020. 6(49): p. eabe2955.
38. Golson, M.L. and K.H. Kaestner, Fox transcription factors: from development to
disease. Development, 2016. 143(24): p. 4558-4570.
39. Lalmansingh, A.S., et al., Multiple modes of chromatin remodeling by Forkhead box
proteins. Biochimica et Biophysica Acta (BBA) - Gene Regulatory Mechanisms,
2012. 1819(7): p. 707-715.
40. Fornes, O., et al., JASPAR 2020: update of the open-access database of transcription
factor binding profiles. Nucleic Acids Research, 2019. 48(D1): p. D87-D92.
41. Hume, M.A., et al., UniPROBE, update 2015: new tools and content for the online
database of protein-binding microarray data on protein–DNA interactions. Nucleic
acids research, 2015. 43(D1): p. D117-D122.
80
42. Zhou, T., et al., Quantitative modeling of transcription factor binding specificities
using DNA shape. Proceedings of the National Academy of Sciences, 2015. 112(15):
p. 4654-4659.
43. Tanay, A., Extensive low-affinity transcriptional interactions in the yeast genome.
Genome research, 2006. 16(8): p. 962-972.
44. Crocker, J., et al., Low affinity binding site clusters confer hox specificity and
regulatory robustness. Cell, 2015. 160(1-2): p. 191-203.
45. Ostrow, A.Z., et al., Fkh1 and Fkh2 bind multiple chromosomal elements in the S.
cerevisiae genome with distinct specificities and cell cycle dynamics. PloS one, 2014.
9(2).
46. Stormo, G.D., et al., Use of the ‘Perceptron’ algorithm to distinguish translational
initiation sites in E. coli. Nucleic Acids Research, 1982. 10(9): p. 2997-3011.
47. Ruan, S. and G.D. Stormo, Comparison of discriminative motif optimization using
matrix and DNA shape-based models. BMC Bioinformatics, 2018. 19(1): p. 86.
48. Rube, H.T., et al., A unified approach for quantifying and interpreting DNA shape
readout by transcription factors. Molecular Systems Biology, 2018. 14(2): p. e7902.
49. Sharon, E., S. Lubliner, and E. Segal, A feature-based approach to modeling protein–
DNA interactions. PLoS computational biology, 2008. 4(8): p. e1000154.
50. Jin, C., et al., Dynamic DNA Contacts Observed in the NMR Structure of Winged Helix
Protein-DNA Complex. Journal of Molecular Biology, 1999. 289(4): p. 683-690.
51. Clark, K.L., et al., Co-crystal structure of the HNF-3/fork head DNA-recognition motif
resembles histone H5. Nature, 1993. 364(6436): p. 412-420.
52. Tsai, K.-L., et al., Crystal Structure of the Human FOXK1a-DNA Complex and Its
Implications on the Diverse Binding Specificity of Winged Helix/Forkhead Proteins*.
Journal of Biological Chemistry, 2006. 281(25): p. 17400-17409.
53. Li, J., et al., Structure of the Forkhead Domain of FOXA2 Bound to a Complete DNA
Consensus Site. Biochemistry, 2017. 56(29): p. 3745-3753.
54. Riley, T.R., et al., Building accurate sequence-to-affinity models from high-
throughput in vitro protein-DNA binding data using FeatureREDUCE. Elife, 2015. 4:
p. e06397.
55. Pedregosa, F., et al., Scikit-learn: Machine learning in Python. the Journal of
machine Learning research, 2011. 12: p. 2825-2830.
56. Badis, G., et al., Diversity and complexity in DNA recognition by transcription factors.
Science, 2009. 324(5935): p. 1720-1723.
81
57. Weirauch, M.T., et al., Evaluation of methods for modeling transcription factor
sequence specificity. Nature Biotechnology, 2013. 31(2): p. 126-134.
58. Zhu, C., et al., High-resolution DNA-binding specificity analysis of yeast transcription
factors. Genome research, 2009. 19(4): p. 556-566.
59. Gordân, R., et al., Genomic regions flanking E-box binding sites influence DNA
binding specificity of bHLH transcription factors through DNA shape. Cell reports,
2013. 3(4): p. 1093-1104.
60. Abe, N., et al., Deconvolving the recognition of DNA shape from sequence. Cell, 2015.
161(2): p. 307-318.
61. Azad, R.N., et al., Experimental maps of DNA structure at nucleotide resolution
distinguish intrinsic from protein-induced DNA deformations. Nucleic acids
research, 2018. 46(5): p. 2636-2647.
62. Haran, T.E. and U. Mohanty, The unique structure of A-tracts and intrinsic DNA
bending. Q Rev Biophys, 2009. 42(1): p. 41-81.
63. Badis, G., et al., A library of yeast transcription factor motifs reveals a widespread
function for Rsc3 in targeting nucleosome exclusion at promoters. Molecular cell,
2008. 32(6): p. 878-887.
64. Rogers, J.M., et al., Bispecific Forkhead Transcription Factor FoxN3 Recognizes Two
Distinct Motifs with Different DNA Shapes. Molecular Cell, 2019. 74(2): p. 245-
253.e6.
65. Tsai, S.Q., et al., GUIDE-seq enables genome-wide profiling of off-target cleavage by
CRISPR-Cas nucleases. Nature biotechnology, 2015. 33(2): p. 187-197.
66. Kim, D., et al., Digenome-seq: genome-wide profiling of CRISPR-Cas9 off-target
effects in human cells. Nature methods, 2015. 12(3): p. 237-243.
67. Tsai, S.Q., et al., CIRCLE-seq: a highly sensitive in vitro screen for genome-wide
CRISPR–Cas9 nuclease off-targets. Nature methods, 2017. 14(6): p. 607.
68. Huston, N.C., et al., Identification of guide-intrinsic determinants of Cas9 specificity.
The CRISPR journal, 2019. 2(3): p. 172-185.
69. Zhang, L., et al., Systematic in vitro profiling of off-target affinity, cleavage and
efficiency for CRISPR enzymes. Nucleic acids research, 2020. 48(9): p. 5037-5053.
70. Wu, X., et al., Genome-wide binding of the CRISPR endonuclease Cas9 in mammalian
cells. Nat Biotechnol, 2014. 32(7): p. 670-6.
71. O'Geen, H., et al., A genome-wide analysis of Cas9 binding specificity using ChIP-seq
and targeted sequence capture. Nucleic Acids Res, 2015. 43(6): p. 3389-404.
82
72. Jiang, F., et al., Structures of a CRISPR-Cas9 R-loop complex primed for DNA cleavage.
Science, 2016. 351(6275): p. 867-871.
73. Jinek, M., et al., A Programmable Dual-RNA–Guided DNA Endonuclease in Adaptive
Bacterial Immunity. Science, 2012. 337(6096): p. 816-821.
74. Sternberg, S.H., et al., DNA interrogation by the CRISPR RNA-guided endonuclease
Cas9. Nature, 2014. 507(7490): p. 62-67.
75. Swarts, D.C., J. van der Oost, and M. Jinek, Structural basis for guide RNA processing
and seed-dependent DNA targeting by CRISPR-Cas12a. Molecular cell, 2017. 66(2):
p. 221-233. e4.
76. Pattanayak, V., et al., High-throughput profiling of off-target DNA cleavage reveals
RNA-programmed Cas9 nuclease specificity. Nature Biotechnology, 2013. 31(9): p.
839-843.
77. Fu, B.X.H., et al., Distinct patterns of Cas9 mismatch tolerance in vitro and in vivo.
Nucleic Acids Research, 2016. 44(11): p. 5365-5377.
78. Hsu, P.D., et al., DNA targeting specificity of RNA-guided Cas9 nucleases. Nature
Biotechnology, 2013. 31(9): p. 827-832.
79. Stella, S., P. Alcón, and G. Montoya, Structure of the Cpf1 endonuclease R-loop
complex after target DNA cleavage. Nature, 2017. 546(7659): p. 559-563.
80. Stella, S., et al., Conformational Activation Promotes CRISPR-Cas12a Catalysis and
Resetting of the Endonuclease Activity. Cell, 2018. 175(7): p. 1856-1871.e21.
81. Yang, M., et al., The Conformational Dynamics of Cas9 Governing DNA Cleavage Are
Revealed by Single-Molecule FRET. Cell Reports, 2018. 22(2): p. 372-382.
82. Mohr, S.E., et al., CRISPR guide RNA design for research applications. The FEBS
Journal, 2016. 283(17): p. 3232-3238.
83. Yamano, T., et al., Crystal structure of Cpf1 in complex with guide RNA and target
DNA. Cell, 2016. 165(4): p. 949-962.
84. Stormo, G.D., Z. Zuo, and Y.K. Chang, Spec-seq: determining protein–DNA-binding
specificity by sequencing. Briefings in Functional Genomics, 2015. 14(1): p. 30-38.
85. Ngo, T.T.M., et al., Effects of cytosine modifications on DNA flexibility and
nucleosome mechanical stability. Nature Communications, 2016. 7(1): p. 10813.
86. Forties, R.A., R. Bundschuh, and M.G. Poirier, The flexibility of locally melted DNA.
Nucleic acids research, 2009. 37(14): p. 4580-4586.
87. Assali, A., A.J. Harrington, and C.W. Cowan, Emerging roles for MEF2 in brain
development and mental disorders. Current opinion in neurobiology, 2019. 59: p.
49-58.
83
88. Edmondson, D.G., et al., Mef2 gene expression marks the cardiac and skeletal muscle
lineages during mouse embryogenesis. Development, 1994. 120(5): p. 1251-1263.
89. Gossett, L.A., et al., A new myocyte-specific enhancer-binding factor that recognizes
a conserved element associated with multiple muscle-specific genes. Molecular and
cellular biology, 1989. 9(11): p. 5022-5033.
90. Krenács, D., et al., Pattern of MEF2B expression in lymphoid tissues and in malignant
lymphomas. Virchows Archiv, 2015. 467(3): p. 345-355.
91. Brescia, P., et al., MEF2B instructs germinal center development and acts as an
oncogene in B cell lymphomagenesis. Cancer cell, 2018. 34(3): p. 453-465. e9.
92. Morin, R.D., et al., Genetic landscapes of relapsed and refractory diffuse large B-cell
lymphomas. Clinical cancer research, 2016. 22(9): p. 2290-2300.
93. Andrés, V., M. Cervera, and V. Mahdavi, Determination of the consensus binding site
for MEF2 expressed in muscle and brain reveals tissue-specific sequence constraints.
Journal of Biological Chemistry, 1995. 270(40): p. 23246-23249.
94. Jolma, A., et al., DNA-binding specificities of human transcription factors. Cell, 2013.
152(1-2): p. 327-39.
95. Santelli, E. and T.J. Richmond, Crystal structure of MEF2A core bound to DNA at 1.5
Å resolution. Journal of molecular biology, 2000. 297(2): p. 437-449.
96. Wu, Y., et al., Structure of the MADS-box/MEF2 domain of MEF2A bound to DNA and
its implication for myocardin recruitment. Journal of molecular biology, 2010.
397(2): p. 520-533.
97. Han, A., et al., Sequence-specific recruitment of transcriptional co-repressor Cabin1
by myocyte enhancer factor-2. Nature, 2003. 422(6933): p. 730-734.
98. Koo, H.-S., H.-M. Wu, and D.M. Crothers, DNA bending at adenine· thymine tracts.
Nature, 1986. 320(6062): p. 501-506.
99. MacDonald, D., et al., Solution structure of an A-tract DNA bend1 1Edited by I. Tinoco.
Journal of Molecular Biology, 2001. 306(5): p. 1081-1098.
100. Merling, A., N. Sagaydakova, and T.E. Haran, A-Tract Polarity Dominate the
Curvature in Flanking Sequences. Biochemistry, 2003. 42(17): p. 4978-4984.
101. Ying, C.Y., et al., MEF2B mutations lead to deregulated expression of the oncogene
BCL6 in diffuse large B cell lymphoma. Nature immunology, 2013. 14(10): p. 1084-
1092.
102. Han, A., et al., Mechanism of recruitment of class II histone deacetylases by myocyte
enhancer factor-2. Journal of molecular biology, 2005. 345(1): p. 91-102.
84
103. Honig, B. and A. Nicholls, Classical electrostatics in biology and chemistry. Science,
1995. 268(5214): p. 1144-1149.
85
Appendices
Supplementary Tables
AR GR
TDC
BEESEM
SelexGLM
MEME
Supplementary Table 2.1 Comparison of AR and GR PWMs generated from various methods. The
TDC PWM is generated using all 10-mers aligned with a shift of ±5, weighting each sequence by its
relative enrichment. The units of the SelexGLM method are provided in terms of -ΔΔG/RT as
described in the original method. All others are shown in terms of bits.
86
MEF2B AbdA-Exd
TDC
BEESEM
SelexGLM
MEME
Supplementary Table 2.2 Comparison of MEF2B and AbdA-Exd PWMs generated from various
methods.
Dfd-Exd Lab-Exd
TDC
BEESEM
SelexGLM
MEME
Supplementary Table 2.3 Comparison of Dfd-Exd and Lab-Exd PWMs generated from various
methods.
87
PbFl-Exd Scr-Exd
TDC
BEESEM
SelexGLM
MEME
Supplementary Table 2.4 Comparison of PbFl-Exd and Scr-Exd PWMs generated from various
methods.
UbxIa-Exd UbxIVa-Exd
TDC
BEESEM
SelexGLM
MEME
Supplementary Table 2.5 Comparison of UbxIa-Exd and UbxIVa-Exd PWMs generated from
various methods.
88
BEESEM SelexGLM MEME
AR 63% 44% 43%
GR 68% 74% 56%
MEF2B 86% 85% 57%
AbdA-Exd 98% 97% 95%
Dfd-Exd 97% 89% 65%
Lab-Exd 96% 90% 51%
PbFl-Exd 85% 83% 72%
Scr-Exd 96% 92% 87%
UbxIa-Exd 99% 97% 93%
UbxIVa-Exd 100% 98% 99%
Supplementary Table 2.6 Alignment agreement with TDC between various alignment methods
TDC BEESEM SelexGLM MEME
AR 0.84 0.80 0.83 0.83
GR 0.76 0.75 0.74 0.78
MEF2B 0.56 0.60 0.63 0.40
AbdA-Exd 0.67 0.69 0.70 0.66
Dfd-Exd 0.65 0.64 0.59 0.64
Lab-Exd 0.69 0.66 0.59 0.40
PbFl-Exd 0.75 0.76 0.75 0.73
Scr-Exd 0.82 0.76 0.78 0.65
UbxIa-Exd 0.81 0.77 0.73 0.59
UbxIVa-Exd 0.85 0.85 0.84 0.85
Supplementary Table 2.7 MLR model performances for various alignment methods
89
Enrichment
TDC +
Enrichment
BEESEM SelexGLM
MEME +
Enrichment
AR 0h 3m 9s 0h 3m 41s 21h 17m 59s 1h 26m 0s 0h 11m 32s
GR 0h 3m 5s 0h 3m 40s 19h 6m 31s 0h 41m 6s 0h 18m 55s
MEF2B 0h 4m 1s 0h 4m 28s 10h 6m 30s 4h 46m 34s 0h 11m 56s
AbdA-Exd 0h 1m 47s 0h 1m 57s 4h 24m 37s 0h 27m 10s 0h 1m 53s
Dfd-Exd 0h 2m 2s 0h 2m 9s 3h 58m 42s 0h 21m 53s 0h 2m 7s
Lab-Exd 0h 2m 23s 0h 2m 37s 4h 28m 12s 0h 26m 6s 0h 2m 33s
PbFl-Exd 0h 1m 30s 0h 2m 17s 9h 8m 54s 1h 34m 51s 0h 19m 35s
Scr-Exd 0h 1m 46s 0h 2m 1s 4h 14m 29s 0h 36m 9s 0h 2m 0s
UbxIa-Exd 0h 2m 9s 0h 2m 27s 5h 18m 2s 0h 53m 4s 0h 3m 18s
UbxIVa-Exd 0h 1m 51s 0h 2m 4s 4h 31m 49s 0h 27m 14s 0h 2m 2s
Supplementary Table 2.8 Wall-clock times for various alignment methods
Enrichment
TDC +
Enrichment
BEESEM SelexGLM
MEME +
Enrichment
AR 19 GB 20 GB 63 GB 116 GB 20 GB
GR 19 GB 19 GB 55 GB 81 GB 19 GB
MEF2B 23 GB 23 GB 31 GB 228 GB 23 GB
AbdA-Exd 18 GB 18 GB 15 GB 42 GB 18 GB
Dfd-Exd 18 GB 18 GB 14 GB 35 GB 18 GB
Lab-Exd 18 GB 18 GB 16 GB 47 GB 18 GB
PbFl-Exd 12 GB 12 GB 28 GB 110 GB 12 GB
Scr-Exd 16 GB 16 GB 15 GB 52 GB 16 GB
UbxIa-Exd 17 GB 17 GB 18 GB 68 GB 17 GB
UbxIVa-Exd 18 GB 18 GB 16 GB 47 GB 18 GB
Supplementary Table 2.9 Peak memory usage for various alignment methods
90
Oligo Name Sequence
Library GAGTTCTACAGTCCGACGATCCAGNNNNNNNNNNNNNNNNTCCGTATCGCTCCTCCAATG
Positive control GAGTTCTACAGTCCGACGATCCAGAAAAGGTAAACAAGAATCCGTATCGCTCCTCCAATG
Negative control GAGTTCTACAGTCCGACGATCCAGAGAGTTAGCCGATGTTTCCGTATCGCTCCTCCAATG
Forward Primer GAGTTCTACAGTCCGACGATC
Reverse Primer CATTGGAGGAGCGATACG
5’ adapter AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA
3’ adapter – R0 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCACGTGATCATTGGAGGAGCGATAC
3’ adapter – R1 Fkh1 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCAACATCGCATTGGAGGAGCGATAC
3’ adapter – R1 Fkh2 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCAGCCTAACATTGGAGGAGCGATAC
3’ adapter – R2 Fkh1 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCACACTGTCATTGGAGGAGCGATAC
3’ adapter – R2 Fkh2 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCAATTGGCCATTGGAGGAGCGATAC
3’ adapter – R1 Hcm1 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCAATTGGCCATTGGAGGAGCGATAC
3’ adapter – R1 Fhl1 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCAACATCGCATTGGAGGAGCGATAC
3’ adapter – R2 Hcm1 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCAGATCTGCATTGGAGGAGCGATAC
3’ adapter – R2 Fhl1 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCAGCCTAACATTGGAGGAGCGATAC
Positive flank GCTCAATTGTCAACATATCGG
Negative flank GCTCGGCGGTCAACAGGTCGG
| | | | | | | | |
1 10 20 30 40 50 60 70 80
Supplementary Table 3.1 Oligo sequences used for SELEX-seq and competitive binding
experiments. The 16-bp variable region of the library is bolded. Positive and negative controls
were used to determine the appropriate protein:DNA ratio to minimize non-specific binding, as
described in Materials and Methods. Forward and reverse primers were used in the amplification
of the library. Illumina adapters and barcodes were added to the final library products using four
cycles of PCR with the 5’ and 3’ adapter oligos shown above. The flanking positions that were
modulated for the competitive binding experiment are underlined.
91
Core
Sequence
Fkh1
( ΔΔG/RT)
Fkh2
( ΔΔG/RT)
Hcm1
( ΔΔG/RT)
Fhl1
( ΔΔG/RT)
GTAAACA 0.000 0.000 0.000 1.209
ATAAACA 0.536 0.524 0.191 1.266
GTCAACA 0.673 0.557 0.284 1.167
GTAAATA 0.835 0.550 1.205 1.456
ATCAACA 1.011 0.949 0.090 1.205
GTACACA 1.325 1.399 0.644 0.861
GACAACA 1.363 0.911 0.709 1.164
CACAACA 1.481 0.797 0.458 0.842
CAAAACA 1.483 0.790 0.783 1.166
CTAAACA 1.546 1.467 1.218 1.218
ATAAATA 1.548 1.298 1.610 1.441
AACAACA 1.610 0.934 0.235 1.129
GTCAATA 1.631 1.162 1.420 1.320
ACAAACA 1.692 1.299 0.761 0.976
ATACACA 1.862 1.793 0.707 0.904
ATATACA 1.881 1.764 1.314 1.176
GAAAACA 1.899 1.237 1.062 1.327
AATAACA 1.901 0.928 0.773 1.261
CCAAACA 1.926 1.349 0.958 0.862
CATAACA 1.938 1.072 0.871 1.037
AAAAACA 2.019 1.378 0.902 1.348
ATCAATA 2.028 1.699 1.335 1.338
GCAAACA 2.040 1.468 1.173 0.789
TACAACA 2.221 1.630 1.472 1.074
AGCAACA 2.285 1.966 1.114 0.842
CAAAATA 2.325 1.580 2.068 1.328
GGCAACA 2.426 1.975 1.635 1.075
AATAATA 2.440 1.511 2.071 1.436
CAATACA 2.475 2.014 1.453 1.065
GACAATA 2.475 1.952 2.190 1.327
GATAACA 2.482 1.484 1.448 1.196
ACCAACA 2.493 1.949 1.572 1.094
CACAATA 2.523 1.863 1.766 0.960
AAATACA 2.534 2.118 1.367 1.325
AGAAACA 2.577 2.432 1.516 1.291
GACGCA- 2.599 3.164 2.174 0.000
CATAATA 2.628 1.632 1.973 1.210
GAATACA 2.739 2.284 1.515 1.235
GTCGCA- 2.914 3.317 2.581 0.294
CATCACA 2.966 2.365 1.563 0.910
CACGCA- 3.048 3.168 2.272 0.070
AACGCA- 3.083 3.140 2.377 0.209
GACGCT- 3.209 3.383 2.904 0.385
TACGCA- 3.224 3.218 2.566 0.386
CGCGCA- 3.229 3.520 2.640 0.271
GACGCG- 3.233 3.444 2.738 0.194
GGCGCA- 3.406 3.565 2.694 0.360
GACTCA- 3.676 3.536 2.978 0.239
Supplementary Table 3.2 ΔΔG/RT measurements for our selected core sequences averaged over
every window and sorted by Fkh1
92
Supplementary Figures
Supplementary Figure 3.1 Distribution of the four bp at various positions along the 16-mer in
the initial library. This exemplifies the need for a position-specific background model when
determining enrichment.
Supplementary Figure 3.2 ΔΔG/RT measurements for each aligned core averaged over the 40
independent sets of aligned reads from the Fkh1 SELEX-seq experiments. Rows are clustered with
the UPGMA algorithm using Manhattan distance as the metric.
93
Supplementary Figure 3.3 ΔΔG/RT measurements for each aligned core averaged over the 40
independent sets of aligned reads from the Fkh2 SELEX-seq experiments. Rows are clustered with
the UPGMA algorithm using Manhattan distance as the metric.
Supplementary Figure 3.4 ΔΔG/RT measurements for each aligned core averaged over the 40
independent sets of aligned reads from the Hcm1 SELEX-seq experiments. Rows are clustered with
the UPGMA algorithm using Manhattan distance as the metric.
94
Supplementary Figure 3.5 ΔΔG/RT measurements for each aligned core averaged over the 40
independent sets of aligned reads from the Fhl1 SELEX-seq experiments. Rows are clustered with
the UPGMA algorithm using Manhattan distance as the metric.
Supplementary Figure 5.1 Bar chart showing the maximum K for sequences with 1 mismatch at
a variable position relative to the PAM
0
0.2
0.4
0.6
0.8
1
1.2
1 4 5 6 2 3
maximum K
mismatch location
Max K for Sequences with 1 Mismatch
95
Supplementary Figure 5.2 Bar chart showing the maximum K for sequences with 4 mismatches
at variable positions relative to the PAM
Supplementary Figure 5.3 Bar chart showing the maximum K for sequences with 5 mismatches
at variable positions relative to the PAM
0
10
20
30
40
50
60
maximum K
mismatch locations
Max K for Sequences with 4 Mismatches
0
5
10
15
20
25
30
35
40
45
1,2,3,5,6 1,2,3,4,5 1,2,3,4,6 1,2,4,5,6 1,3,4,5,6 2,3,4,5,6
maximum K
mismatch locations
Max K for Sequences with 5 Mismatches
Abstract (if available)
Abstract
For life to exist, cells must be able to grow, divide, and respond to their environment. All these processes require strict control over when genes are turned on, where they are expressed, and how they can be silenced. This can be controlled through a variety of means such as modification of transcriptional rate, translational rate, or degradation post-translation. Here, we focus on the modification of transcriptional rate through transcription factor binding. Transcription factors are DNA-binding proteins which bind to specific targets in-vivo to upregulate or downregulate expression. Understanding what factors play a role in target site recognition is an area of active research that is particularly interesting for closely related transcription factors which exhibit similar binding preferences.
Here, we investigate members of the forkhead box family of transcription factors, which play a role in a number of processes from tumor suppression to cell growth and differentiation. After generating SELEX-seq data for all four forkhead box homologs in S. cerevisiae, we developed a multi-step pipeline for alignment of full-length SELEX-seq reads which revealed important determinants of binding specificity outside the core of the binding site. Part of the pipeline used for the rapid alignment of k-mer level data was made publicly available through a webserver and was later applied to scanning and modifying transcription factor binding sites in-vivo. Separately, we developed a modified approach to SELEX-seq which revealed off-target binding of the CRISPR-Cas12a system to mismatched off-targets. Finally, I describe an earlier work exploring the binding preferences of the MEF2B transcription factor and the role of A-tract polarity.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Machine learning of DNA shape and spatial geometry
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Profiling transcription factor-DNA binding specificity
PDF
Forkhead transcription factors regulate replication origin firing through dimerization and cell cycle-dependent chromatin binding in S. cerevisiae
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Decoding protein-DNA binding determinants mediated through DNA shape readout
PDF
Probabilistic modeling and data integration to examine RNA-protein interactions
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Malignant cell fraction prediction using deep learning: from point estimate to uncertainty quantification
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
The relationship between DNA methylation and transcription factor binding in colon cancer cells
PDF
Characterization of the ZFX family of transcription factors that bind downstream of the start site of CpG island promoters
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Comparative genomics of translational regulation
Asset Metadata
Creator
Cooper, Brendon Hunter
(author)
Core Title
Improved methods for the quantification of transcription factor binding using SELEX-seq
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2022-12
Publication Date
09/19/2022
Defense Date
07/21/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
CRISPR-Cas,DNA-binding proteins,forkhead box,MEF2B,OAI-PMH Harvest,SELEX-seq,sequence alignment,transcription factor binding
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rohs, Remo (
committee chair
), Chaisson, Mark (
committee member
), Di Felice, Rosa (
committee member
), Qin, Peter (
committee member
), Sun, Fengzhu (
committee member
)
Creator Email
bhcooper@ucdavis.edu,bhcooper@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112013936
Unique identifier
UC112013936
Legacy Identifier
etd-CooperBren-11230
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Cooper, Brendon Hunter
Type
texts
Source
20220921-usctheses-batch-983
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
CRISPR-Cas
DNA-binding proteins
forkhead box
MEF2B
SELEX-seq
sequence alignment
transcription factor binding