Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A diagrammatic analysis of the secondary structural ensemble of CNG trinucleotide repeat
(USC Thesis Other)
A diagrammatic analysis of the secondary structural ensemble of CNG trinucleotide repeat
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Copyright 2022 Ethan Nhat-Huy Phan
A Diagrammatic Analysis of the Secondary Structural Ensemble
of CNG Trinucleotide Repeat
by
Ethan Nhat-Huy Phan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirement for the Degree
DOCTOR OF PHILOSOPHY
(CHEMISTRY)
August 2022
ii
Table of Contents
List of Tables ................................................................................................................... v
List of Figures ................................................................................................................. vii
Abstract .......................................................................................................................... xii
Chapter 1: Introduction .................................................................................................... 1
1.1 Trinucleotide repeats expansion disorders as a motivating case .......................... 1
1.2 Basic notions about backbone entropy and free energy ....................................... 4
1.3 Introduction to Dual graphs ................................................................................... 6
Chapter 2: Topological Constraints and Their Conformational Entropic Penalties on
RNA Folds ..................................................................................................................... 12
2.1.1 Relationship Between Constraints and Backbone Conformational Entropy
........................................................................................................................... 12
2.1.2 Topological Representation of Secondary and Tertiary Structural
Constraints ......................................................................................................... 13
2.1.3 Factoring Diagrams into Approximately Independent Pieces .................... 17
2.1.4 Monte Carlo Simulation Studies ................................................................ 21
2.2 Results ................................................................................................................ 25
2.2.1 Hairpin Loops ............................................................................................ 25
2.2.2 Initiation of a Second Hairpin ..................................................................... 27
2.2.3 Two-way Junctions .................................................................................... 29
2.2.4 Three-way Junctions ................................................................................. 33
2.2.5 Initiation of a Third Hairpin ......................................................................... 36
iii
2.2.6 Four-way Junctions ................................................................................... 37
2.3 Discussion ........................................................................................................... 40
2.4 References .......................................................................................................... 45
Chapter 3: Quantifying Structural Diversity of CNG Trinucleotide Repeats Using
Diagrammatic Algorithms .............................................................................................. 48
3.1.1 Backbone Conformational Entropy and Secondary Structures .................. 49
3.1.2 Monte Carlo Simulations............................................................................ 53
3.1.3 Evaluating Conformational Ensembles of CNG Repeats ........................... 56
3.2.1 Loop Initiation Entropies Involving Hoogsteen Pairs .................................. 61
3.2.2 Quadruplexes ............................................................................................ 62
3.2.3. Pseudoknots ............................................................................................. 66
3.2.4 Ensembles of CNG Repeats ...................................................................... 67
3.3 Discussion ........................................................................................................... 76
Chapter 4: Diagrammatic Approaches to RNA Structures with Trinucleotide Repeats . 88
4.1 Materials and Methods ........................................................................................ 92
4.1.1 Graph Representations.............................................................................. 92
4.1.2 Specializing to (CNG) Repeat Sequences ................................................. 94
4.1.3 Graph Elements and Loop Entropy Contributions ..................................... 95
4.1.4 Stabilities of GC|CG Helix Doublets and G-Quadruplexes ........................ 98
4.1.5 Diagrammatic Renormalization .................................................................. 99
4.2 Results and Discussion ..................................................................................... 101
4.3 References ........................................................................................................ 113
iv
Chapter 5: Conclusion ................................................................................................. 116
References .............................................................................................................. 120
Bibliography ................................................................................................................ 121
Appendix A: Additional Conformational Cost Data ...................................................... 130
Appendix B: H-Type Pseudoknot Closure Maps and Costs ........................................ 134
v
List of Tables
Table 2.1 Table of free energy cost of forming a two-way junction in kcal/mol as a
function of the 5’ and 3’ junction lengths in nt, 𝑏 and 𝑐 , respectively. Error
estimates from the simulation are given in parentheses. .................................... 31
Table 2.2 Table of free energy costs of forming a three-way junction in kcal/mol as a
function of the 5’ and 3’ junction length in nucleotide (𝑎 and 𝑐 respectively) with
the centre junction length (𝑏 ) is kept at 1 nt as a parameter. ............................. 35
Table 2.3 Table of free energy cost of forming a four-way junction in kcal/mol as a
function of the 5’ and 3’ junction length in nucleotide (𝑎 and 𝑑 respectively) with
the middle junction lengths fixed (𝑏 = 𝑐 = 4 nt) .................................................. 39
Table 3.1 Initiation free energies for the a, b, and c loops inside a double-deck
quadruplex structure from MC simulations. The typical statistical error on each
value is approximately 0.05 kcal/mol. .............................................................. 65
Table 4.1 Contributions of loop entropies to the folding free energy at 310 K from the
data library in Refs. (14, 15) (𝑅𝑇 = 0.616 kcal/mol). Entropies of the loops in a
multibranch junction are in general correlated, but their sum scales with the total
junction lengths. Loop entropies of the junction internal to the branches are
uncorrelated with the loops on the other sides of the branches. Empirically,
higher multibranch structures cost more entropy. ............................................... 97
Table A.1 Table of free energy costs of forming three-way junction in kcal/mol for center
junction lengths (b) 0, 2, and 3. ........................................................................ 131
Table A.2 Table of free energy costs of forming three-way junction in kcal/mol for center
junction lengths (b) 4 through 6. ....................................................................... 132
vi
Table A.3 Table of free energy costs of forming three-way junction in kcal/mol for center
junction lengths (b) 7 and 8. ............................................................................. 133
Table B.1 Entropic cost for the stepwise formation of the a=b=c=1 pseudoknot. ....... 135
Table B.2 Entropic cost for the stepwise formation of the a=b=1, c=4 pseudoknot .... 137
Table B.3 Entropic cost for the stepwise formation of the a=4, b=c=1 pseudoknot .... 139
Table B.4 Entropic cost for the stepwise formation of the a=c=4, b=1 pseudoknot .... 141
vii
List of Figures
Figure 1.1 Examples of different structures of the 5’-NG(CNG)8CN-3’ repeat sequence
containing identical base pair content while being separate distinct folded shape
with different connectivity ..................................................................................... 2
Figure 1.2 Structures of the 5’-NG(CNG)8CN-3’ repeat sequence from Fig. 1.1 along
with their base-pair-only description given in brackets and their dual graph
representation ....................................................................................................... 7
Figure 2.1 Various secondary structures, the total enumeration of the constraints that
define them, and their conversion into a diagrammatic topological representation
followed by factorization ..................................................................................... 15
Figure 2.2 Sample conformations obtained from the same starting constraints (helix in
the middle of the strand) for a 34 nt polyU chain ................................................ 25
Figure 2.3 Free energy cost due to conformational entropy loss at 310K for loop
initiation in an unconstrained chain. .................................................................... 27
Figure 2.4 Free energy cost at 310K to initiate a second loop of length 𝑏 in a chain
already containing a loop. ................................................................................... 28
Figure 2.5 The free energy costs of forming a two-way junction with 5’ and 3’ junction
length 𝑏 and 𝑐 respectively given that a loop 𝑎 is already in place ..................... 30
Figure 2.6 The free energy cost of forming symmetric two-way junctions plotted for
chains with different sizes of the first loop, 𝑎 , and for different lengths of the stem
separating 𝑎 from the 2-way junction (𝑏 , 𝑐 ) ......................................................... 33
Figure 2.7 Reduced topological representation of the set of constraints defining a three-
way junction ........................................................................................................ 34
viii
Figure 2.8 The free energy costs of forming a three-way junction with 5’ and 3’ junction
length 𝑎 and 𝑐 respectively given that junction length 𝑏 is fixed at 1 nt; this
surface corresponds to the data given in table 2 above. .................................... 36
Figure 2.9 The free energy cost of initiating a third hairpin of length 𝑐 in the presence of
two existing loops (𝑎 and 𝑏 ) ............................................................................... 37
Figure 2.10 Reduced topological representation of the set constraints defining a four-
way junction ........................................................................................................ 38
Figure 2.11 Diagrammatic representation of the topology of a three-way junction and
how it can be altered by introduction of new tertiary interactions........................ 41
Figure 3.1 Examples of diagrams representing different secondary structural elements.
........................................................................................................................... 51
Figure 3.2 Factorization of the constraints in a three-way junction into approximately
independent contributions. The total entropy S is reduced to the sum of the
three closed diagrams on the right. .................................................................... 53
Figure 3.3 Geometric criteria used in defining (a) Watson-Crick base pairing geometry
(33), (b) G-G Hoogsteen geometry (31, 32), and (c) purine-pyrimidine Hoogsteen
geometry (31, 34). .............................................................................................. 55
Figure 3.4 (a-d) Examples of different structures of the 5’-NG(CNG)8CN-3’ repeat
sequence belonging to distinct topological classes. (e) Maximal hairpin structure
of 5’-NG(CNG)19CN-3’. ....................................................................................... 58
Figure 3.5 Comparison of loop initiation using different sets of base pairing criteria. In
comparison to the Watson-Crick initiation cost, the purine-pyrimidine Hoogsteen
initiation costs are effectively shifted up by a constant. ...................................... 62
ix
Figure 3.6 (a) A possible quadruplex structure relevant to CNG repeat sequences. The
structure’s entropy is determined by the three loops labeled a, b, and c. (b)
Diagrammatic representation of a quadruplex, showing its dependence on the
three loop lengths a, b, and c, as well as the number of layers . ...................... 64
Figure 3.7 (a) The conformational entropy of a pseudoknot is determined by the lengths
of the three loops labeled a, b, and c, as well as the duplex lengths a and b. In
the pseudoknot structures most relevant to CNG repeat sequences, a and b are
both 2 nt, the interhelix length b is 1 nt, and the loop lengths a and c are 1, 4, 7,
... ........................................................................................................................ 67
Figure 3.8 All graphs for (CNG)17 at total degree 8, their RAG-ID (9) and corresponding
ensemble-averaged cost, entropy, and the graph free energy. Δ𝐹 and Δ𝐺 are in
kcal/mol .............................................................................................................. 71
Figure 3.9 All graphs for (CNG)17 at total degree 12, their RAG-ID and corresponding
ensemble-averaged cost, entropy, and the graph free energy. Δ𝐹 and Δ𝐺 are in
kcal/mol .............................................................................................................. 72
Figure 3.10 All graphs for (CNG)17 at total degree 16, their RAG-ID and corresponding
ensemble-averaged cost, entropy, and the graph free energy. Δ𝐹 and Δ𝐺 are in
kcal/mol .............................................................................................................. 73
Figure 3.11 All non-pseudoknot graphs for (CNG)17 at total degree 20, their
corresponding ensemble-averaged cost, entropy, and the graph free energy. Δ𝐹
and Δ𝐺 are in kcal/mol. ....................................................................................... 74
x
Figure 3.12 The single non-pseudoknot graphs for (CNG)17 at total degree 28 and all
non-pseudoknot graphs of total degree 28, their corresponding ensemble-
averaged cost, entropy, and the graph free energy. Δ𝐹 and Δ𝐺 are in kcal/mol . 75
Figure 4.1 Examples of a 5’-NG(CNG)8CN-3’ repeat sequence in five different
conformations. (a) The maximal hairpin “necklace” structure. (b) and (c)
Structures with an asymmetric internal junction. (d) and (e) Structures with three-
way junctions ...................................................................................................... 88
Figure 4.2 Example showing factorization of the diagram on the left into the factors on
the right .............................................................................................................. 93
Figure 4.3 Dual graph representation of all structural elements included in this study:
helix, hairpin, 2-way junction, 3-way junction, loops in a quadruplex, the
quadruplex core, bridge, and unpaired ends ...................................................... 96
Figure 4.4 Dyson equation for the root function 𝑅 3 including hairpins, 2- and 3-way
junctions, as well as quadruplexes ................................................................... 100
Figure 4.5 Ensemble averages of the number of helices (solid line), bridges (dashed
lines), hairpin loops (open circles), 2-way junctions (dotted dashed lines), 3-way
junctions (open triangles) and quadruplexes (squares) computed from the
physically-relevant solution for a (CNG)60 repeat, as a function of quadruplex
stability (stable on the left, unstable on the right). ............................................. 103
Figure 4.6 Diagrams illustrating some of the structures observed in the results in Figs.
4.5 and 4.7........................................................................................................ 104
xi
Figure 4.7 Ensemble averages of features computed for a (CNG)60 repeat as a function
of extra stability added to each 2-way junction (favorable on the left, unfavorable
on the right). ..................................................................................................... 106
Figure 4.8 A “phase diagram” summarizing the results from Fig. 4.5 and 4.7 ............ 108
Figure 4.9 (a) Divergence of the partition function 𝑍𝜆 when 𝜆 approaches the singular
point 𝜆𝑐 . The scale on the top shows the average repeat lengths ⟨𝑛 ⟩ for each 𝜆 .
(b) Structural features as a fraction of the repeat length as a function of λ. The
scale on the top maps ⟨𝑛 ⟩ to λ .......................................................................... 109
Figure 4.10 Ensemble averages of features computed for a (CNG)60 repeat as a
function of extra stability added to each helix (favorable on the left, unfavorable
on the right). ..................................................................................................... 111
Figure A.1 The entropic cost plot for the formation of two-way junctions. Two different
set of constraints were used. The top surface in each view corresponds to the
constraints shown in the inset of Fig. 2.3. ......................................................... 130
Figure B.1 Map showing the thermodynamic pathways used in calculating the entropic
cost to form a pseudoknot structure with the junction length a=b=c=1. ............ 134
Figure B.2 Map showing the thermodynamic pathways used in calculating the entropic
cost to form a pseudoknot structure with the junction length a=b=1, c=4 ......... 136
Figure B.3 Map showing the thermodynamic pathways used in calculating the entropic
cost to form a pseudoknot structure with the junction length a=4, b=c=1 ......... 138
Figure B.4 Map showing the thermodynamic pathways used in calculating the entropic
cost to form a pseudoknot structure with the junction length a=c=4, b=1 ......... 140
xii
Abstract
Functional ribonucleic acid chains (RNA) can fold into many complex structures in
biological settings using a collection of secondary and tertiary structural motifs. These
structures are often necessary for the proper function of the RNA and a misfold can
produce undesired outcomes. Of particular interest to us are instances of gain-of-
function in RNA which have been implicated in the pathogenesis of trinucleotide repeat
expansion disorders where the expanded microsatellites are found in intronic and
untranslated regions. The expanded microsatellite yields an expanded RNA chain which
then gain unintended function due to newly accessible folded structures. Thus, an
understanding of how the final folded structure is determined by factors such as chain
length is important in understanding how expanded trinucleotide repeats give rise to
diseases. In this study, we focused on the influence that entropy—particularly of the
sugar-phosphate backbone—has in determining the final folded structure of RNA
chains.
We began in Chapter 2 by drawing from the field of topology and the existing body
of work on RNA graphs to design a diagrammatic scheme to represent the different
types of RNA structures and the constraints they impose upon the sugar-phosphate
backbone. Application of the diagrammatic scheme to folded RNA structures allows us
to enact a factorization of the folded structure, grouping the constraints into mutually
independent subsets and enables the total conformational entropy penalty of the fold to
be calculated as a sum of independent terms. We then simulated large ensembles of
single-stranded RNA sequences in solution using high throughput Monte Carlo
simulations to validate the underlying assumptions of our diagrammatic scheme,
xiii
examining the entropic costs for the initiation of two major secondary structural motifs:
hairpins and multiway junctions. Further simulations of higher complexity constraints
such as pseudoknots and quadruplexes yielded additional insight into the distinct
topological classes of secondary and tertiary structures, the interactions between
multiple constrains on RNA structures, and how some functional RNA sequences may
operate by transformation between different topological classes.
With the foundations laid for our diagrammatic factorization, we looked to apply our
methodology to (CNG)n trinucleotide repeats. In chapter 3 and 4 we focused our
attention on (CNG)n trinucleotide repeats (TNR) which are transcripts of unstable
microsatellites whose spontaneous expansions have been linked to genetic diseases.
These so-called trinucleotide repeat expansion disorders (TREDs) exhibit complex
mechanism of pathogenesis, some of which are attributed to a potential RNA transcript
gain-of-function. Thus, the structures to which these expanded transcripts have access
and the diversity of their conformational ensemble were investigated. In chapter 3, we
simulated and cataloged the secondary structure of NG-(CNG)16-CN and NG-(CNG)50-
CN oligomers and sorted them into sub-ensembles based on their defining
characteristics and quantified the structural diversity and thermodynamic stability for
these ensembles. Our findings showed that though it maximizes the number of base
pairing contacts, the generally assumed structure for these repeats—a series of
alternating short Watson-Crick helices and two-way junctions capped by a hairpin—may
not be the most thermodynamically favorable, and the structural ensembles are
characterized by largely open conformations. Furthermore, our data show that the
diversity of the ensembles has a non-negligible length-dependence, suggesting that
xiv
further, more generalized study is needed as TREDs are associated with expansions of
more than 60 to 100 repeat units.
To generalize the analysis, in chapter 4 we introduced another diagrammatic
method which can be used to analyze the structural diversity of an arbitrary (CNG)n
sequence. By representing the structural elements on the chain’s conformation by a set
of graphs and employing elementary diagrammatic methods often seen in physics, we
were able to formulate a renormalization procedure to re-sum these graphs and
produce a closed-form expression for the ensemble partition function of the chain. By
making a simple approximation for the renormalization, this theory can be applied to
extended (CNG)n sequences to comprehensively capture an arbitrarily large set of
conformations containing any number and combination of duplexes, hairpins, multiway
junctions, H-type pseudoknots, and quadruplexes. We then numerically solved the
analytical equations obtained from the renormalization theory to obtain equilibrium
estimates for secondary structural content for each chain to study the structural
ensembles of (CNG)n repeats with large n (n ~ 60). Our findings suggests that as with
the more restrictive analysis in chapter 3, the ensemble is surprisingly diverse.
Furthermore, it shows that the distribution is sensitive to the identity of the N nucleotide.
As the N nucleotide can participate in non-canonical pairs and determines whether the
(CNG)n sequence in question can sustain stable quadruplexes, the results show that
different choice of N produces biases on the stabilities of different motifs and affect the
secondary structures of the chain along with how they may undergo structural switches
when perturbed.
1
Chapter 1: Introduction
1.1 Trinucleotide repeats expansion disorders as a motivating case
The genome is an important component of a cell and its stability is taken for granted
in healthy cells. The genome is however not as resistant to change as its biological
function would indicate. Microsatellites, tracts of short repeating DNA motif, are known
to exhibit large length variability (1–3) in the genome. Of particular interest are
microsatellites composed of repeated trinucleotide motifs. Based on their locations
within the gene and the length to which they have expanded, these repeats are known
to cause a variety of neurological disorders referred to as trinucleotide repeat expansion
disorders, or TREDs. Some recognizable examples of TREDs include Huntington’s
disease and fragile X syndrome. When the expansions occur within exons, the
disorders are caused by the resulting defective protein. When these expansions occur
in introns, however, it is often unclear whether the RNA transcript of the repeats
themselves or another aberrant gene product is responsible for their cytotoxicity (4–7).
In the following study, we will delve into the structure of the RNA transcripts of CNG
repeats which are a subset of trinucleotide repeats responsible for many TREDs. While
expanded CNG repeats may lead to aberrant protein products with extended sequences
of glutamines, giving rise to so-called polyQ diseases, many of these trinucleotide
expansions are not found in coding regions. In these cases, cytotoxicity may be due to
the mRNA transcripts via a gain or loss of function. Examples of gain of function has
been demonstrated in myotonic dystrophy type 1 (DM1) (4), where expanded CTG
repeats in the 3’-untranslated regions of the dystrophia myotonica protein kinase gene
produces a RNA transcript which interacts with CUG-binding proteins (CUGBP1) and
2
muscleblind-like (MBNL1) proteins. These interactions alter protein levels in the cell,
which in turn affects their function as splicing regulators, leading to symptoms (8–10).
These examples demonstrate a need to understand the in vivo structure of the mRNA
transcripts of these repeats to elucidate how RNA with expanded CNG sequences may
behave in the cell to give rise to TRED symptoms.
To study the in vivo structures, we need to capture all the possible folds that these
mRNA transcripts can access. This effort requires that we address two major issues:
the repetitive nature of the transcripts and the length scale due to expansion. Since the
microsatellites that give rise to TREDs, and repeat expansion disorders in general, are
composed of many repeating units of a DNA motif, the resulting RNA transcripts also
contains many repeated motif units. This allows the RNA to form base pairs that are
non-stationary in sequence space. An example relevant to CNG repeats can found in
Figure 1.1.
Figure 1.1 Examples of different structures of the 5’-NG(CNG)8CN-3’ repeat sequence containing identical base
pair content while being separate distinct folded shape with different connectivity.
This means that knowing the base pairs in the fold alone is insufficient to pin down the
fold as the contacts can be moved along the sequence from one repeat unit to another
3
to yield different folds that are indistinguishable from a base-pairing perspective. This
together with the length scale issue—that is, longer chains can naturally access more
structures—requires us to take a different approach to studying these structures.
Rather than focusing on the base pairs, it is more useful for us to center our effort on
the unpaired regions of the folded structure.
Formation of base pairs is the result of introducing constraints onto the sugar-
phosphate backbone. While the same set of base pairing contacts can be made
between different subsets of the repeated units, each choice of subunit set will produce
a different set of unpaired regions, each with their own conformational entropy cost.
Thus, the collection of all unpaired regions within a folded structure along with the
conformational penalty inflicted on the chain to create them can serve to distinguish
between folded structures of RNA with repeating subunits much better than the set of
base pairs alone. This can then be combined with diagrammatic methods and graph
theory to allow for a more compact and computationally convenient representation of
the structure to be introduced for use in scaling our analysis to arbitrary chain lengths.
Our tasks at hand are then as follow. First, obtain a consistent set of free energy
values associated with the unpaired regions in a folded structure and how best to map
this and base pairing free energy onto a graphical representation for RNA structures.
Second, apply the free energy data and graph representation to the analysis of the
structural ensembles of CNG oligomers and the diversity of structures. Third, extend our
analysis onto longer chain lengths. The first and initial steps of the second task will be
the covered in chapter 2, the remaining bulk of the work needed to analyze oligomers
will be addressed in chapter 3, and the extension to longer chain lengths will be the
4
topic of chapter 4. With our task in hand and sight focused on the backbone entropy of
RNA structures, we now introduce some basic ideas regarding the conformational
entropy of the RNA backbone and the folding free energy as well as the graph
representation that will be used throughout this study.
1.2 Basic notions about backbone entropy and free energy
RNA sequences are predominantly found single-stranded in the cell, but they can
assemble into specific higher-order structures by utilizing secondary and tertiary
structural building blocks. The free energy change starting from an open unfolded chain
going to the final folded conformation, ∆𝐺 fold
, determines the stability of the fold. A
number of molecular factors control this folding free energy, including chain
conformational fluctuations, base stacking, base complementarity interactions, as well
as other solvent-induced forces such as counterion-mediated intrachain attractions (11–
16). For the fold to be thermodynamically stable, the overall ∆𝐺 fold
from these various
factors must add to produce a downhill driving force, i.e. a net negative ∆𝐺 fold
. Of all the
factors that make up ∆𝐺 fold
, there is only one term that is guaranteed to be positive, and
this is −𝑇 ∆𝑆 b
, where ∆𝑆 b
is the change in conformational entropy of the RNA backbone
upon folding.
Formation of secondary and tertiary contacts on the RNA sequence introduces
constraints into the conformation of the chains. The conformational contribution to the
free energy −𝑇 ∆𝑆 b
must therefore be uphill. On the secondary structural level, base
pairing requires two nucleobases from different positions on the RNA sequence to adopt
a specific relative geometry, while base stacking constrains two adjacent bases to a
5
different relative geometry putting one base on top of the other. On the tertiary level,
contacts such as kissing hairpins or loop-receptor type interactions place other kinds of
constraints on the conformation of the chain. A thermodynamic ensemble of free chains
has none of these constraints, and the variational statement of the second law of
thermodynamics states that the introduction of internal constraints into the ensemble
must raise the free energy, or at the minimum leaves it unchanged (17, 18). Therefore,
the conformational entropy of the RNA backbone is necessarily suppressed when
constraints are imposed. Another way to view this is to consider a chain that has been
compacted by internal constraints. Upon the removal of these constraints, it will unfurl if
no force other than chain conformational entropy is present. Therefore, folding must
suppress ∆𝑆 b
producing a thermodynamically uphill penalty against the folded
conformation.
The fact that the chain conformational entropy ∆𝑆 b
upon folding is always less than
zero has an important consequence. If we denote all terms in ∆𝐺 fold
due to factors other
than backbone entropy—base complementarity interactions, base stacking interactions,
and counterion-mediated electrostatic interactions—by ∆𝐺 ′, the thermodynamic
requirement that ∆𝐺 fold
= ∆𝐺 ′
− 𝑇 ∆𝑆 b
< 0 for a stably folded RNA demands that ∆𝐺 ′
must be more negative than 𝑇 ∆𝑆 b
. The magnitude of 𝑇 ∆𝑆 b
therefore places a rigorous
lower bound on the strengths of all the other thermodynamic forces that make the fold
overall stable. This idea will be revisited when ensembles of oligomers are analyzed in
Chapter 3 and 4.
6
1.3 Introduction to Dual graphs
Diagrammatic approaches for classifying RNA structures have been used widely
(19–30). Graphs provide an elegant method for categorizing the many diverse
conformational structures that can be adopted by RNA sequences and may be used to
facilitate recognition and analysis of common topological features in RNA structures that
are otherwise difficult to decipher from their 2- or 3-dimensional structures. Graphs also
provide an alternate space within which RNA secondary structures can be understood
(31, 32) and they are the basis of the algorithms (33, 34) behind some of the most
widely used RNA secondary structure prediction tools (35–37). Graphs also help
elucidate the rich connection between RNA structure and topology, enabling topological
interpretations to be used for annotating RNA structures (38–43).
The use of graphs in RNA studies began with loop diagrams or “line graphs” (44) in
which the entire sequence was mapped to a straight-line graph with a vertex for every
base in the sequence. Arcs were then drawn to connect vertices representing paired
nucleotides along the sequence length together. This was then followed by tree graphs
which represented junctions and loops in secondary structures as vertices, or points, of
a graph and helices as the edges connecting the vertices of the graph (29, 30). Though
useful in allowing graph theoretic results to be applied to analyzing RNA structure, loop
diagrams and tree graphs have their shortcomings. Loop diagrams allow for complete
enumeration of secondary structures, including pseudoknots, and the easy identification
of such knotting structures. In exchange, it gave up information on the overall shape of
the folded structures, which must be reconstructed from the base pairing information
given. Tree graphs, meanwhile, can only show structure that contains helices and loops
7
without pseudoknots. Dual graphs were later introduced by Schlick and coworkers (21,
45–47) which incorporated a combination of desirable traits from both loop diagrams
and tree graphs. In the dual graph representation, helices are represented by vertices of
the graph, while the unpaired segments are represented by the edges connecting the
vertices. This results in a graph that is not visually relatable to the 2D secondary
structure but allows for pseudoknot and structures such as quadruplex and triple helices
to be shown explicitly and allows the topology and overall shape of the secondary
structure to be easily discernable. Examples of some possible conformations of a short
(CNG) repeat with different secondary structures and their dual graphs are shown are
shown in Fig. 1.2.
Figure 1.2 Structures of the 5’-NG(CNG)8CN-3’ repeat sequence from Fig. 1.1 along with their base-pair-only
description given in brackets and their dual graph representation. By reducing the amount of information present,
the graph allows for clear depiction of the different underlying chain structures.
Fig.1.2(a) shows a series of base pair stacks separated by short unpaired regions
with a hairpin loop cap, yielding a ladder-like structure. Fig.1.2(b) shows a so-called “3-
way junction” in which two hairpin motifs are joined together to a 3
rd
base paired region.
This class of structure is easily identified in graphical representation due to its visual
resemblance to a 3-way intersection. In the dual graph representation, all the edges are
treated on even footing with only the number of unpaired nucleotides being denoted.
8
Vertices, as we will explore in later chapters, are classified by their base pairing content
as well as the number of edges that connect to them. By streamlining the information
present and separating the length and base pair content from the visual depiction, dual
graphs give a clear view of the underlying topology for each structure. Finally, the dual
graph representation is highly suggestive, highlighting potential “break points” in the
structure which will be useful for the development of our diagrammatic method. A more
in depth look at dual graphs, how constraints are grouped within them, and their
interpretation will be presented in Chapter 2 and 3.
9
1.4 Reference
1. Mirkin, S.M. 2004. Molecular Models for Repeat Expansions. Biochem. Mol. Biol. 24.
2. Khristich, A.N., and S.M. Mirkin. 2020. On the wrong DNA track: Molecular mechanisms of repeat-
mediated genome instability. J. Biol. Chem. 295:4134–4170.
3. Wells, R.D., R. Dere, M.L. Hebert, M. Napierala, and L.S. Son. 2005. Advances in mechanisms of
genetic instability related to hereditary neurological diseases. Nucleic Acids Res. 33:3785–3798.
4. Li, L.-B., and N.M. Bonini. 2010. Roles of trinucleotide-repeat RNA in neurological disease and
degeneration. Trends Neurosci. 33:292–298.
5. Mirkin, S.M. 2006. DNA structures, repeat expansions and human hereditary disorders. Curr Opin
Struct Biol. 16:351–8.
6. Mirkin, S.M. 2007. Expandable DNA repeats and human disease. Nature. 447:932–40.
7. Nelson, D.L., H.T. Orr, and S.T. Warren. 2013. The Unstable Repeats—Three Evolving Faces of
Neurological Disease. Neuron. 77:825–843.
8. Timchenko, L.T., N.A. Timchenko, C.T. Caskey, and R. Roberts. 1996. Novel Proteins with Binding
Specificity for DNA CTG Repeats And RNA Cug Repeats: Implications for Myotonic Dystrophy.
Hum. Mol. Genet. 5:115–121.
9. Miller, J.W., C.R. Urbinati, P. Teng-umnuay, M.G. Stenberg, B.J. Byrne, C.A. Thornton, and M.S.
Swanson. 2000. Recruitment of human muscleblind proteins to (CUG)n expansions associated with
myotonic dystrophy. EMBO J. 19:4439–4448.
10. Napierala, M., and W.J. Krzyzosiak. 1997. CUG Repeats Present in Myotonin Kinase RNA Form
Metastable “Slippery” Hairpins. J. Biol. Chem. 272:31079–31085.
11. Draper, D.E., D. Grilley, and A.M. Soto. 2005. Ions and RNA Folding. Annu. Rev. Biophys. Biomol.
Struct. 34:221–243.
12. Wong, G.C.L., and L. Pollack. 2010. Electrostatics of Strongly Charged Biological Polymers: Ion-
Mediated Interactions and Self-Organization in Nucleic Acids and Proteins. Annu. Rev. Phys. Chem.
61:171–189.
13. Chen, S.-J. 2008. RNA Folding: Conformational Statistics, Folding Kinetics, and Ion Electrostatics.
Annu. Rev. Biophys. 37:197–214.
14. Liu, L., and S.-J. Chen. 2010. Computing the conformational entropy for RNA folds. J. Chem. Phys.
132:235104.
15. Woodson, S.A. 2010. Compact intermediates in RNA folding. Annu. Rev. Biophys. 39:61–77.
16. Turner, D.H. 1996. Thermodynamics of base pairing. Curr. Opin. Struct. Biol. 6:299–304.
17. Chandler, D. 1987. Introduction to Modern Statistical Mechanics. 1st ed. New York, NY: Oxford
University Press.
18. Hill, T.L. 2013. Statistical mechanics: principles and selected applications. Courier Corporation.
10
19. Hin Hark Gan, Daniela Fera, Julie Zorn, Nahum Shiffeldrim, Michael Tang, Uri Laserson, Namhee
Kim, and Tamar Schlick. 1987. RAG: RNA-As-Graphs database—concepts, analysis, and features.
Nutr. Health. 5:1285–1291.
20. Gan, H.H., S. Pasquali, and T. Schlick. 2003. Exploring the repertoire of RNA secondary motifs
using graph theory; implications for RNA design. Nucleic Acids Res. 31:2926–2943.
21. Fera, D., N. Kim, N. Shiffeldrim, J. Zorn, U. Laserson, H.H. Gan, and T. Schlick. 2004. RAG: RNA-
As-Graphs web resource. BMC Bioinformatics. 5:88.
22. Gevertz, J., H.H. Gan, and T. Schlick. 2005. In vitro RNA random pools are not structurally diverse:
A computational analysis. RNA. 11:853–863.
23. Izzo, J.A., N. Kim, S. Elmetwaly, and T. Schlick. 2011. RAG: An update to the RNA-As-Graphs
resource. BMC Bioinformatics. 12:219.
24. Laing, C., and T. Schlick. 2011. Computational approaches to RNA structure prediction, analysis,
and design. Curr. Opin. Struct. Biol. 21:306–318.
25. Kim, N., C. Laing, S. Elmetwaly, S. Jung, J. Curuksu, and T. Schlick. 2014. Graph-based sampling
for approximating global helical topologies of RNA. Proc. Natl. Acad. Sci. USA. 111:4079–4084.
26. Jain, S., C.S. Bayrak, L. Petingi, and T. Schlick. 2018. Dual Graph Partitioning Highlights a Small
Group of Pseudoknot-Containing RNA Submotifs. Genes. 9:371.
27. Schlick, T. 2018. Adventures with RNA graphs. Methods. 143:16–33.
28. Jain, S., S. Saju, L. Petingi, and T. Schlick. 2019. An extended dual graph library and partitioning
algorithm applicable to pseudoknotted RNA structures. Methods. 162–163:74–84.
29. Shapiro, B.A., and K. Zhang. 1990. Comparing multiple RNA secondary structures using tree
comparisons. Bioinformatics. 6:309–318.
30. Le, S.-Y., R. Nussinov, and J.V. Maizel. 1989. Tree graphs of RNA secondary structures and their
comparisons. Comput. Biomed. Res. 22:461–473.
31. Waterman, M.S., and T.F. Smith. 1978. RNA secondary structure: a complete mathematical
analysis. Math. Biosci. 42:257–266.
32. Penner, R.C., and M.S. Waterman. 1993. Spaces of RNA Secondary Structures. Adv. Math.
101:31–49.
33. Waterman, M.S., and T.F. Smith. 1986. Rapid dynamic programming algorithms for RNA secondary
structure. Adv. Appl. Math. 7:455–464.
34. Rivas, E., and S.R. Eddy. 1999. A dynamic programming algorithm for RNA structure prediction
including pseudoknots11Edited by I. Tinoco. J. Mol. Biol. 285:2053–2068.
35. Zuker, M. 2003. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids
Res. 31:3406–3415.
36. Hofacker, I.L. 2003. Vienna RNA secondary structure server. Nucleic Acids Res. 31:3429–3431.
37. Lorenz, R., S.H. Bernhart, C. Höner zu Siederdissen, H. Tafer, C. Flamm, P.F. Stadler, and I.L.
Hofacker. 2011. ViennaRNA Package 2.0. Algorithms Mol. Biol. 6:26.
11
38. Orland, H., and A. Zee. 2002. RNA folding and large N matrix theory. Nucl. Phys. B. 620:456–476.
39. Rødland, E.A. 2006. Pseudoknots in RNA Secondary Structures: Representation, Enumeration, and
Prevalence. J. Comput. Biol. 13:1197–1213.
40. Bon, M., G. Vernizzi, H. Orland, and A. Zee. 2008. Topological Classification of RNA Structures. J.
Mol. Biol. 379:900–911.
41. Andersen, J.E., L.O. Chekhov, R.C. Penner, C.M. Reidys, and P. Sułkowski. 2013. Topological
recursion for chord diagrams, RNA complexes, and cells in moduli spaces. Nucl. Phys. B. 866:414–
443.
42. Vernizzi, G., and H. Orland. 2015. Random matrix theory and ribonucleic acid (RNA) folding. Oxf.
Handb. Random Matrix Theory.
43. Vernizzi, G., H. Orland, and A. Zee. 2016. Classification and predictions of RNA pseudoknots based
on topological invariants. Phys. Rev. E. 94:042410.
44. Schmitt, W.R., and M.S. Waterman. 1994. Linear trees and RNA secondary structure. Discrete
Appl. Math. 51:317–323.
45. Laing, C., and T. Schlick. 2011. Computational approaches to RNA structure prediction, analysis,
and design. Curr. Opin. Struct. Biol. 21:306–318.
46. Kim, N., C. Laing, S. Elmetwaly, S. Jung, J. Curuksu, and T. Schlick. 2014. Graph-based sampling
for approximating global helical topologies of RNA. Proc. Natl. Acad. Sci. 111:4079–4084.
47. Gan, H.H., D. Fera, J. Zorn, N. Shiffeldrim, M. Tang, U. Laserson, N. Kim, and T. Schlick. 1987.
RAG: RNA-As-Graphs database—concepts, analysis, and features. Nutr. Health. 5:1285–1291.
12
Chapter 2: Topological Constraints and Their Conformational
Entropic Penalties on RNA Folds
In this chapter, our goal is to develop the theoretical basis for calculating ∆𝑆 b
, the
change in backbone entropy from the unfolded free chain to the folded structure in
solution, as a function of the constraints on the RNA backbone imposed by known
secondary and tertiary structures. The first question is a technical one. Is there an
efficient computational methodology to accurately quantify backbone conformational
entropy? The second question is a conceptual one. How do we define these constraints,
and more importantly, how do we decide whether a set of constraints are independent
or correlated? We will address these two questions by formulating a topological view of
RNA folds.
2.1 Materials and Methods
2.1.1 Relationship Between Constraints and Backbone Conformational Entropy
Examples of the kind of constraints that defines the secondary and tertiary
structures of a RNA may be base pairs, stacked bases, or other tertiary interactions. We
denote each constraint symbolically by 𝑐 𝑗 and in a folded RNA there could be 𝑁 of
these. In a thermal ensemble of free RNA chains in solution, the entropy cost ∆𝑆 b
of
imposing these constraints {𝑐 1
, 𝑐 2
, 𝑐 3
, … 𝑐 𝑁 } on the chains can be calculated from the
probability of observing chains that meet these conditions (1, 2):
𝑃 (𝑐 1
, 𝑐 2
, 𝑐 3
, … 𝑐 𝑁 ) = 𝑒 ∆𝑆 b
/𝑅 , (2.1)
13
where 𝑅 is the gas constant. For even a short chain with any appreciable secondary or
tertiary structure, the number of base pairs, stacked bases and other tertiary contacts is
usually quite large. The joint probability of all these constraints occurring on the same
chain is consequently small, and ∆𝑆 b
is usually large and very negative. While Eq. 2.1 is
a possible way to compute ∆𝑆 b
, the number of chain conformations that must be
sampled is impractically and prohibitively large.
A reduction of the joint probability is possible if the constraints can be divided into
subsets which are independent of each other. If this is the case, Eq. 2.1 can be
simplified. For instance, if there are six constraints and they can be factored into three
independent subsets {𝑐 1
, 𝑐 2
}, {𝑐 3
} and {𝑐 4
, 𝑐 5
, 𝑐 6
}, then 𝑒 ∆𝑆 b
/𝑅 = 𝑃 (𝑐 1
, 𝑐 2
, 𝑐 3
, … 𝑐 6
) =
𝑃 (𝑐 1
, 𝑐 2
)𝑃 (𝑐 3
)𝑃 (𝑐 4
, 𝑐 5
, 𝑐 6
), and the entropy becomes a sum of three independent terms,
one for satisfying each of these three independent sets of constraints. If this is the case,
the entropy can be more easily evaluated because each of the joint probabilities that
must be computed requires many fewer conditions to be jointly satisfied. In the next
section, we will develop the topological representation of these constraints introduced in
Chapter 1 to help us better understand how to factor them into independent subsets.
2.1.2 Topological Representation of Secondary and Tertiary Structural
Constraints
In the dual graph representation introduced by Schlick and coworkers (3–6), helices
are represented by vertices of the graph, while the unpaired segments are represented
by the edges connecting the vertices. Figure 2.1 shows several examples of the
secondary structural motifs seen in many RNA folds and their corresponding graph
representation. Figure 2.1(a) depicts a three-way junction with two hairpins in the
14
interior of the sequence and a helix between the 5’ and 3’ terminal residues, with three
intervening single-stranded loop segments. In this case, the constraints associated with
the secondary structure are the base pairing and stacking forces that hold the helices
together. If these forces are removed, the chain will unfurl. The backbone
conformational entropy is the logarithm of the joint probability of observing all these
constraints being satisfied on one chain. In the middle row of Fig. 2.1(a), we group all
the constraints that come from the same stem into one set. There are three stems in
this structure and hence three subsets of constraints. The reason why we choose to
view each stem as one subset is because the multiple constraints in each set (i.e. base
pairs and base stacks) are clustered. Unless there are additional tertiary contacts
between these stems, they should be largely unaware of the existence of the
constraints in the other sets.
While the division of constraints into the three subsets depicted in the second row of
Fig. 2.1(a) seems reasonable, we have omitted the central fact that the three helices are
connected by single-stranded segments that make up the rest of the three-way junction.
The connectivity among the helices, while not explicitly given in our list of constraints, is
implicit due to the backbone continuity of the RNA. In the topological representation, the
segments labelled a through c in the second row of Fig. 2.1(a) remind us that these
strands, as well as those in the hairpins d and e, must be counted as implicit constraints
for this construct.
15
Figure 2.1 Various secondary structures, the total enumeration of the constraints that define them, and their
conversion into a diagrammatic topological representation followed by factorization. (a) A three-way junction is
defined by five single-stranded lengths and 3 helices. It is factored into 3 independent subsets which can be
treated separately. (b) A pseudoknot is defined by three single-stranded length and two helices. Due to backbone
connectivity, the diagram is not factorizable. (c) A triple helix is defined by two single-stranded loops and one triple
helix structure. The factorization suggests the two loops are approximately independently of each other. (d) A
quadruplex is defined by several loops threaded through the quadruplex core. The factorization shown here
suggests that the three loops, after topological reduction, should become approximately independent of each
other.
The third row of Fig. 2.1(a) shows our topological representation of all the
constraints inherent in the structure 2.1(a), including both explicit as well as implicit
ones. All the constraints due to a single stem (base pairing and base stacking forces)
are represented by one solid circle. Following standard terminology in topology, each
circle is a “vertex”. The loops labelled a through e are called “arcs”, or edges of the
graph, and they make manifest the implicit constraints coming from the backbone
connectedness. Notice that four arcs pass through every vertex. This corresponds to
16
the physical observation that each helix can have at most 2 strands coming from either
end of the helix. The half circle at the lower right actually constitutes two arcs, denoting
the 5’ and 3’ free termini of the chain. Free ends on the 5’ and 3’ termini of a chain do
not cost any entropy, hence ∆𝑆 b
for a structure with or without free ends would have
been the same. This topological reduction of the secondary structure in Fig. 2.1(a)
delineates the key constraints that define the fold as well as the relationships among
them. Notice that while all helices are represented by just dots, the intrinsic entropy of
each stem depends on the size of each helix measured in nucleotide (nt) units, which
must be specified for its entropy to be evaluated correctly.
Figure 2.1(b) shows a schematic structure of a pseudoknot, which helps illustrates
additional features of our topological representation. The second row of Fig. 2.1(b)
suggests that the constraints coming from each stem can be grouped together into one
subset. The three single-stranded regions internal to the pseudoknot are labelled a
through c. These segments constitute the implicit constraints originating from the
connectedness of the backbone. The third row of Fig. 2.1(b) shows the topological
representation of all these constraints in reduced form. The arcs labelled a, b and c
correspond to the loops depicted in the second row. As in the three-way junction, four
arcs go through every vertex. Though not explicitly shown in the topological
representation, the number of nucleotides between the entry point into the pseudoknot
and the exit point, labelled d in the second row of Fig. 2.1(b), needs to be specified for
the entropy to be evaluated properly. Again, the free 5’ and 3’ ends are indicated by
open arcs, but as described above they do not cost additional entropy.
17
Figure 2.1(c) shows a schematic drawing of a triple helix, and 2.1(d) shows a
quadruplex. The same topological reduction procedures described above lead to the
diagrams on the third row of Figs. 2.1(c) and (d). For the triple helix in Fig. 2.1(c), its
topological representation has only one vertex, but six arcs go through it. To
differentiate this from the vertices in Fig. 2.1(a) and (b), the vertex in Fig. 2.1(c) is
shown as a solid triangle. The two relevant loops are labelled a and b. Again, the size of
the triple helix in nt units must be specified for the entropy to be computed properly. The
quadruplex structure in the first row of Fig. 2.1(d) reduces to the diagram on the third
row. There are three loops labelled a, b, and c. This vertex, which has eight arcs going
through it, is shown as a solid square. The size of the quadruplex stack in nt units must
be specified for the total entropy to be calculated properly.
2.1.3 Factoring Diagrams into Approximately Independent Pieces
While the topological reductions introduced in the last section transform the
constraints that define the secondary and/or tertiary structure of a RNA fold into
diagrammatic elements, the fact that the vertices and arcs in the topological
representation remain connected suggests that they are still correlated with each other.
However, there exists an implicit assumption within the literature for RNA secondary
structure modeling that loops can be factored into independent components. Examples
of this assumption being used include the nearest-neighbor model of Turner and
Mathews (7–10), web servers that utilizes the nearest-neighbor model to calculate free
energy of RNA structures such as MFold (7, 11–14) or NUPack (15), and discrete chain
models in which loops are formed as part of a random walk (16, 17). In the following, we
18
develop a rigorous factorization scheme to divide each diagram into approximately
independent pieces in a way that is consistent with the existing literature.
A possible factorization scheme is illustrated on the last row of Fig. 2.1(a) for the
three-way junction. First, as discussed earlier, the free segments on the 5’ and 3’ ends
of the chain do not incur any entropic costs. In the factored diagram, the two open arcs
representing these two termini have been eliminated. Second, the loops labelled d and
e have been factored out from the composite arc a-b-c. This factorization scheme is
motivated by the fact that the hairpin loop on one end of each stem is largely isolated
from the loops on the opposite end of the stem, except in the case where they make
direct contact with each other, such as in a pseudoknot. Otherwise, loops on opposite
ends of a helix are largely agnostic of each other except for the fact that they are both
on the same stem, so factoring the loops on the opposite ends of a stem into
approximately independent parts seems to be justified, as long as there are no explicit
constraints between them. In this sense, every vertex “insulates” a pair of arcs on one
side of the vertex from another pair of arcs on the other side, facilitating this
factorization. We note that this postulated independence is not exact but only
approximate. The validity of this conjecture will be demonstrated by the simulation
studies presented below, and the data will show this postulated independence is quite
accurate.
While the factorization shown in the last row of Fig. 2.1(a) suggests the two hairpin
loops d and e are largely independent of the three loops a, b, and c forming the three-
way junction, the composite a-b-c loop cannot be factorized further. The reason is that
19
each vertex only insulates a pair of arcs from another pair, and the a-b-c loop must be
treated as interdependent.
Before going on to demonstrate how to factorize the other diagrams in Fig. 2.1, we
turn to the theory of topology to try to show why vertices with four arcs going through
them can be factorized, but those with only two cannot. For planar networks such as the
ones shown in the third row of Fig. 2.1, a basic definition in topology for Eulerian circuits
guarantees that the entire network of arcs connected only by even vertices (i.e. those
with an even number of arcs going through them) could be traversed by a continuous
closed path that traces over each arc once and only once. Conversely, if a close path
can traverse a network over each arc once and only once, the vertices must all be even
(18, 19). When expressed in the context of RNAs, this theorem simply expresses the
obvious fact that a RNA, having a continuous backbone, must be able to traverse all the
constraints on its folded structure; therefore, all vertices representing such constraints
are necessarily even. Furthermore, if we factor the diagram in the third row of Fig. 2.1(a)
into that on the last row, the requirement of backbone continuity remains intact because
every even vertex ensures that there is a closed path on both sides of the vertex after it
has been factored. Conversely, if we factor a diagram and find that one or more of the
elements in the resulting diagram can no longer be traversed by a closed path, then
chain connectivity has been violated and such factorization is illegitimate. Thus, the
fewest number of edges that must be connected to a vertex to ensure that each
subgraph maintains backbone continuity is two, and vertices with only two arcs cannot
be factored further as this is equivalent to splitting the helix along its length. By this, we
see that further factoring the a-b-c loop in the last row of Fig. 2.1(a) is impossible
20
because that would necessarily break one or more implicit constraints imposed by the
continuity requirement of the RNA backbone. With this, it is easy to see that any part of
a diagram that begins and ends on the same vertex can be factored out if and only if
there is a closed path that traverses all the arcs inside this part of the diagram once and
only once. This is commonly referred to in graph theory as a circuit decomposition.
Because of this, all self-contained peripheral loops, like those in the last row of
Fig. 2.1(a), are factorizable from the rest of the diagram. Therefore, to facilitate the
factorization of diagrams, it is convenient to introduce another topological feature called
an “articulation point”. An articulation point is any vertex which when removed separates
the diagram into two disjoint parts, each of which can be traversed by a closed path.
The three vertices in Fig. 2.1(a) all represent articulation points.
Now going to the example of the pseudoknot in Fig. 2.1(b), we can first remove the
two free ends producing the diagram in the last row of Fig. 2.1(b). Further factorization
of this diagram is impossible because the two vertices are now both odd (i.e. having an
odd number of arcs going through them). A theorem in topology states that for a
network that has exactly two odd vertices, it can be traversed by exactly one path that
begins on one of the vertices and ends on the other one. Further factorizing the diagram
would violate the continuity requirement of the chain because neither of the two vertices
are articulation points. Finally, for the triple helix in Fig. 2.1(c) and the quadruplex in
Fig. 2.1(d), factorization leads to the diagrams on the last row. The results of these
factorizations are analogous to the three-way junction in Fig. 2.1(a), producing multiple
disjoint closed loops. Though the diagrammatic factorization would suggest that triple
helices and quadruplexes have mostly independent loops, there is currently no data to
21
support the factorization for Fig. 2.1(c) or 2.1(d). Thus, the factorizations suggested for
Fig. 2.1(b), 2.1(c), and 2.1(d) are only conjectures. The work contained within this
chapter will focus on validating the factorization for multi-way junctions which all share
the same topology as Fig 2.1(a). This will provide theoretical support to the long-
standing assumption of factorizability for loops in secondary structure and serve as a
lead into our work in chapter 2 which addresses on the factorization of the more
complex structures.
It should be noted that this separation of constraints into independent subsets and
the subsequence factorization to be introduced is valid for the backbone conformational
term, Δ𝑆 𝑏 . There are terms in ∆𝐺 ′ (i.e. the “everything else” term), particularly the
electrostatics and the excluded volume interactions, that are not expected to factor due
to the long-range nature of these forces. However, the intrinsic factorizability of the
backbone conformational entropy term, Δ𝑆 𝑏 , is unaltered.
2.1.4 Monte Carlo Simulation Studies
The factorization schemes introduced above for dividing constraints inherent from
known secondary/tertiary structures of a RNA into approximately independent subsets
were tested against large-scale Monte Carlo simulations. We simulated large
ensembles of poly-U sequences with and without constraints to ascertain the
interdependences of different constraints corresponding to the ones that define hairpins
with various loop lengths, as well as two-way, three-way, and four-way junctions of
different sizes.
The Monte Carlo (MC) simulations were carried out using our in-house Nucleic MC
program based on a previously described computational method. The Nucleic MC
22
program enables high-throughput atomistic MC simulations to be carried out for RNAs
or DNAs by using a mixed numerical/analytical method to treat the sugar-phosphate
backbone. Given positions and orientations of the bases, Nucleic MC uses a chain
closure algorithm to sum over all possible backbone conformations arising from the
torsional degrees of freedom of the sugar-phosphate backbone for all nucleotide units
on the chain (20–24). In the process, the summation takes into account steric
interactions within all parts of the chain: between atoms in the sugar-phosphate
backbone, between all bases in the side chains, and between the backbone and
nucleobase side chain. Unlike molecular dynamics, Nucleic MC can sum over a
massive number of backbone conformations with numerical efficiencies orders of
magnitude faster, enabling a diverse ensemble of chain conformations to be generated
rapidly. To further cut down on CPU requirements, Nucleic MC also uses high-level
theoretical models (25–30) to represent the solvent’s and the counterions’ influences on
the nucleic acid implicitly. Using our in-house parallel computing resources, a thermal
ensemble consisting of several million uncorrelated chain conformations for RNA and
DNA sequences up to a hundred nucleotides could be simulated in several days. The
accuracy of Nucleic MC in terms of the chain structures that it produces has been fully
validated in several prior studies (20–22).
To focus our investigation exclusively on backbone entropic effects, we turned off all
base stacking and base complementarity interactions except those explicitly dictated by
the constraints during the simulations. The steric interactions, in keeping with our focus
on entropic effects, is represented by the Weeks-Chandler-Andersen (WCA) potential
(25). The WCA potential captures the repulsive branch of common two-body potentials
23
such as Lennard-Jones and reflects lack of stabilization associated with base pairing
and base stacking. Counterion-mediated forces are necessary to accurately mimic
physiological ionic conditions, and we calibrated these interactions in our simulations to
match the ambient ionic strength of an approximately 0.1 M NaCl solution (20, 26, 31).
Several series of simulations were carried out. These consisted of: (1) polyU chains
with no constraints, to assess the entropic costs of hairpin loop initiations, (2) polyU
chains with one internal constraint corresponding to a pre-formed hairpin loop in the
interior of the sequence, to assess the entropic costs of initiating a second base-pair
contact anywhere else along the chain, seeding the formation of either a two-way
junction or a second hairpin loop, (3) polyU chains with two internal constraints
corresponding to two pre-formed hairpin loops separated by a variable-length loop
between them, to assess the entropic costs of initiating different three-way junctions of
various sizes, and (4) polyU chains with three internal constraints corresponding to
three pre-formed hairpin loops separated by two fixed-length loops, to assess the
entropic costs of initiating a four-way junction. Entropic costs were evaluated by
conducting a counting experiment on all MC frames produced by Nucleic MC. The
number of times that a given pair of nucleotides—labeled as 𝑖 and 𝑗 —satisfies the base
pairing constraints (vide infra) is collected and normalized by the total number of MC
frames analyzed. This provides a probability of observing the nucleotides 𝑖 and 𝑗 in a
configuration that satisfies the base pairing constraint, 𝑃 (𝑖 , 𝑗 ), within the thermal
ensemble. The associated entropy cost is then calculated as
Δ𝐺 = −𝑘 𝐵 𝑇 ln [𝑃 (𝑖 , 𝑗 )] (2.2)
24
All entropic costs in this work were calculated at 310K. While these simulations were
designed to test the conjectures made above regarding the interdependencies of
various constraints, the full thermodynamic data set presented below will also enable
any researcher to easily calculate the backbone entropy costs of any known RNA fold.
Care should be taken when using or referencing the reported values as they pertain
only to the backbone entropy cost. Thus, the values should not be compared directly to
experimental entropy values which have contributions from all parts of the system—the
solvent for example. The reported entropy costs should ideally serve as a guide to
determine trends in dependence and extrapolation into larger loop sizes at which point
the backbone entropy tends to be the dominant contributor to the free energy.
Alternatively, these values can also serve as a validity check towards studies of
enthalpy as the sum of all non-entropy parts must offset, at a minimum, the backbone
entropy costs reported within this work. Figure 2.2 shows sample snapshots of a chain
conformations from the MC simulation used in calculating the cost of internal junction
formation.
25
Figure 2.2 Sample conformations obtained from the same starting constraints (helix in the middle of the strand) for
a 34 nt polyU chain. The newly formed base pair is highlighted in red. Conformations (a) and (b) show no newly
formed base pairs. Conformations (c) and (d) show newly formed base pair initiating loops in the head and tail
respectively. Conformations (e) and (f) show newly formed base pair creating internal junctions.
2.2 Results
2.2.1 Hairpin Loops
While U does not form canonical base pairs with itself, the entropic penalty
necessary to put the sugar-phosphate backbone into a conformation ready to facilitate
base pairing between them can be easily computed by counting the number of chain
conformations that meet the conditions shown in the inset of Fig. 2.3 over the entire
ensemble. This combination of Nb -Nb distance (9.0 ± 0.5 Å), virtual bond angles (125 ±
20°) and virtual torsion angle (0 ± 40°) between the two C1’-Nb glycosidic bonds of the
two bases to be paired selects out base configurations which are in position to form an
“ideal” complementary pair (http://ndbserver.rutgers.edu/) (32, 33). It should be noted
26
that the choice of accepted values for the four base pairing criteria can be tightened or
relaxed to match experimental geometries. As this determines the phase space volume
that is associated with the constraints, the calculated entropy cost to form a structure
will decrease as the range of accepted values for the criteria is increased and vice
versa, so the entropy will have a constant offset depending on how the constraints are
precisely defined. For an example, see Figure A.1 in Appendix A. Figure 2.3 shows the
free energy ∆𝐺 = −𝑇 ∆𝑆 b
at 𝑇 = 310K for the spontaneous initiation of a hairpin loop of
different lengths 𝑎 anywhere along the sequence of a (U)22 strand as the open circles.
The loop initiation free energy increases smoothly from about 4.7 kcal/mol for a 3-nt
hairpin loop to 6.6 kcal/mol for a 10-nt loop. The free energy for loop initiation starting at
specific locations on the sequence are also shown for several positions in Fig. 2.3, red
(toward the 5’ end) to violet (toward the 3’ end). Experimental values are included in
green. Loop initiation free energies seem to be slightly lower on the chain ends as they
are expected to have more freedom, but only by a very small amount. Interior loops
farther from the chain ends appear to be formed with roughly uniform probability along
the entire sequence. Both the magnitude and loop-length dependence of these data
compare well with the thermodynamic data reported by Turner and Mathew in green
(https://rna.urmc.rochester.edu/NNDB/index.html) (7–10, 34) based on RNA melting
experiments; most of the deviations are within 0.6kcal/mol (1kB at 310K). The observed
trends and deviations from experimental values collected in the work of Turner and
Mathew match those obtained by prior simulation studies (16, 17, 35). As we are only
investigating the parts of the free energy that comes from the backbone, the differences
are expected as result from the experimental values capturing contribution from other
27
terms besides the backbone. They are also consistent with previous MC data from our
group using slightly different backbone closure parameters (21, 22).
Figure 2.3 Free energy cost due to conformational entropy loss at 310K for loop initiation in an unconstrained
chain. The cost increases smoothly as a function of loop size (nt) with no significant position dependence along the
sequence other than at the chain’s ends where the cost decreases slightly. Experimental data for hairpin initiation
obtained from melting experiments and aggregated in the nearest-neighbor model’s database (21) have been
included for comparison purposes. Error bars have been included for all points in the average value series. (Inset)
the backbone geometric criteria used to define a base pair in the MC simulation. All parameters are chosen to put
the C1’-Nb glycosidic bonds in the correct geometry to form a Watson-Crick pair.
Once a loop has been initiated, the helix can propagate by stacking more paired
bases onto the first one. MC data show that the free energy cost due to backbone
conformational entropy required for propagating the stem is 5.22 ± 0.03 kcal/mol per
rung, in agreement with previous results (20). This value is independent of the length of
the existing helix.
2.2.2 Initiation of a Second Hairpin
The formation of a second hairpin on a RNA strand that already contains one
provides the first test for assessing whether the constraints associated with two side-by-
28
side hairpins are independent. Figure 2.4 shows initiation free energy for the second
hairpin as a function of its loop length. The open circles are loop initiation free energies
for the first hairpin taken from Fig. 2.3. The green markers are initiation free energies for
a second hairpin formed on the strand in which the first loop has a minimal stem length
of 1 and the spacer length 𝑐 is variable. The grayscale markers are initiation costs for a
fix length of 𝑐 and variable length in the stem of loop 𝑎 . Fig. 2.4 shows that, to within
statistical errors, the initiation of the second hairpin costs as much entropy as the first
one of the same loop length 𝑏 . This proves that the constraints associated with two
side-by-side hairpins are indeed largely independent.
Figure 2.4 Free energy cost at 310K to initiate a second loop of length 𝑏 in a chain already containing a loop. The
second, third, and fourth series show the cost of the second loop 𝑏 is independent of the spacer length 𝑐 between
it and the first loop 𝑎 , which has a minimal stem length of 1. The fifth, sixth, and seventh series show the cost of
loop 𝑏 is independent of the stem length of loop 𝑎 for a spacer length 𝑐 = 2 nt. Other data showing similar
independence for different spacer lengths 𝑐 as well as the stem length on loop 𝑎 are not presented. Note that error
bars were included even though some of them are too small to be observed.
29
2.2.3 Two-way Junctions
The free energies for forming two-way junctions are shown in Fig. 2.5. In the
topological representation of a two-way junction, depicted on the top left of Fig. 2.5,
there are three relevant loop lengths: a is the length of the hairpin loop, b is the length
of the junction on the 5’ side, and c is the other junction on the 3’ side. The dangling
free ends of the chain are omitted as usual because they do not cost free energy.
Figure 2.5 shows the additional free energy needed to initiate a two-way junction after
the hairpin loop a is in place, as a function of the two junction lengths 𝑏 and 𝑐 in nt.
Figure 2.5 illustrates that the free energy ∆𝐺 (𝑏 , 𝑐 ) is approximately the same when 𝑏
and 𝑐 are swapped, indicating that the initiation costs of a two-way junction is roughly
symmetric with respect to the 5’ and 3’ junction lengths. The numerical values for
∆𝐺 (𝑏 , 𝑐 ) are tabulated in Table 1, with error estimates given in parentheses. While a
precise comparison between the numerical values obtained from experiments versus
simulations is difficult due to the simulations only accounting for the backbone entropy,
the trend observed in our data are nevertheless similar to those from experiments used
in constructing the nearest neighbor model. The entropic cost in general increases as
the size of the loop (𝑏 + 𝑐 ) grows and exhibits asymptotic behavior for sufficiently large
loop size (8, 9, 35).
30
Figure 2.5 The free energy costs of forming a two-way junction with 5’ and 3’ junction length 𝑏 and 𝑐 respectively
given that a loop 𝑎 is already in place. (a) Top view. (b) Side view. In general, the free energy cost grows as the
junction size increases and is roughly symmetric when the 5’ and 3’ lengths are swapped.
31
Table 2.1 Table of free energy cost of forming a two-way junction in kcal/mol as a function of the 5’ and 3’ junction lengths in nt, 𝑏 and 𝑐 ,
respectively. Error estimates from the simulation are given in parentheses.
32
Figure 2.6 shows how the two-way junction free energy depends on the loop length
of the hairpin on the other side of the helix and the length of the stem itself. The
conjecture that motivates our topological reduction scheme argues that they should be
largely independent. Figure 2.6 plots the free energy of initiating a symmetric two-way
junction (i.e. 𝑏 = 𝑐 ) as a function of the junction size for a 4 nt hairpin loop with three
different stem lengths (1 nt, 4 nt, and 6 nt), as well as a 6 nt hairpin loop with a 1 nt
stem, and a 7 nt loop with a 1 nt stem. Clearly, the entropic costs for junction formation
are independent of the hairpin on the other side of the constraint as well as the helix
length. Note that the variation in costs for larger loop sizes is a natural result of the
counting experiment. A higher entropic cost corresponds to a smaller number of
recorded occurrences which is more heavily impacted by counting uncertainty. While
not shown explicitly here, results for all two-way junctions, symmetric or asymmetric,
demonstrate similar independence. Error bars are shown explicitly for a few data points
to illustrate the size of the typical uncertainties.
33
Figure 2.6 The free energy cost of forming symmetric two-way junctions plotted for chains with different sizes of
the first loop, 𝑎 , and for different lengths of the stem separating 𝑎 from the 2-way junction (𝑏 , 𝑐 ). Over the set of
three values used for 𝑎 , the free energy cost to close the junction are consistent with each other. This indicates
that the two-way junction is dependent on only the two junction lengths 𝑏 and 𝑐 , but not the loop on the opposite
side, 𝑎 . Over the three different stem lengths, the cost to close the symmetric junction shows no discernible
dependence on the length of the stem. Typical error bars for selected data points are included. The error for larger
loop sizes can be attributed to errors in the counting experiment. Dash line is a guide to the eye.
2.2.4 Three-way Junctions
Three-way junctions are characterized by three different junction lengths as shown
in Fig. 2.7. As in the case of two-way junctions, the free energy cost of initiating a three-
way junction is largely independent of the hairpins on the opposite side of all three
constraints. In Table 2.2, we tabulate the values of ∆𝐺 (𝑎 , 𝑏 , 𝑐 ), where 𝑎 is the length of
the 5’ junction, 𝑐 is the length of the 3’ junction, and 𝑏 is the length of the junction in the
middle; Figure 2.8 shows the corresponding free energy surface. Only one value for 𝑏 is
shown in Table 2.2; data tables for all other values of 𝑏 studied are included Appendix
A, Table A.1. Not surprisingly, closing a three-way junction costs more free energy than
two-way junctions, but this additional cost is only marginal. Comparison of our data
34
against experimental results shows some deviations; this is expected as the introduction
of larger loops and more branching helices yields larger contribution to the experimental
results from sources that are not included in our simulations such as sequence-
dependent stabilization and coaxial stacking of helices. In terms of comparing against
existing simulation results, we observed the same dependence on loop size and
number of branching helices as Aalberts & Nandagopal (35). As the loop size increases
the free energy cost increases. Additionally as the number of branching helices
increases, there is an overall destabilizing effect that increases the cost for all loop sizes
(35). This can be seen in the decreased range spanned by the entropy cost as we move
from two-way junction to three- and four-way junctions. The trends are also similar to
results obtained in other studies (16, 17), though our predicted entropic costs are
somewhat higher. This difference most likely originates from the way in which each
simulation handles the torsional motion of the backbone with the other studies using
highly discretized models—diamond lattice for Cao & Chen (16) and discrete states
configuration space for Zhang et al. (17).
Figure 2.7 Reduced topological representation of the set of constraints defining a three-way junction. For Table 2
and each sub-tables of Table A.1, A.2, and A.3 in the Supplemental Information, the value for b is fixed while a and
c change to give rise to the different sizes of three-way junctions.
35
Table 2.2 Table of free energy costs of forming a three-way junction in kcal/mol as a function of the 5’ and 3’ junction length in nucleotide (𝑎
and 𝑐 respectively) with the centre junction length (𝑏 ) is kept at 1 nt as a parameter. For 𝑏 = 0 and 𝑏 ≥ 2, see Table A.1, A.2, and A.3 in the
supplemental material. Error estimates from the simulation are given in parentheses. Entries which have “ inf” errors were too infrequently
observed during the simulation for errors to be accurately calculated.
36
Figure 2.8 The free energy costs of forming a three-way junction with 5’ and 3’ junction length 𝑎 and 𝑐 respectively
given that junction length 𝑏 is fixed at 1 nt; this surface corresponds to the data given in table 2 above. (a) Top
view. (b) Side view
2.2.5 Initiation of a Third Hairpin
Figure 2.9 shows the free energy for initiating a third hairpin c after two others (a
and b) have been formed, as a function of loop length 𝑐 in nt. The open circles are
initiation free energy for the first hairpin taken from Fig. 2.3. Red circles show hairpin
initiation on the 5’ side of loop a. Violet squares show hairpin initiation on the 3’ side of
loop b, and green diamonds show hairpin initiation on the strand between a and b.
Analogous to the results for the initiation of a second hairpin shown in Fig. 2.3, the third
hairpin is largely independent of the first two. The segment length between any two
hairpins in this set of data varies from 0 to 4 nt.
37
Figure 2.9 The free energy cost of initiating a third hairpin of length 𝑐 in the presence of two existing loops (𝑎 and
𝑏 ). When compared against the cost of initiating a hairpin loop on the free chain, the cost of the third loop is
comparable and shows no dependence on the location of the new loop relative to the existing loops. This suggests
that independence of hairpin loops can be extended to any number of loops within a chain. Note that error bars
were included for the average cost of the first hairpin like in Fig.2.3; some of them are not visible due to their size.
2.2.6 Four-way Junctions
Figure 2.10 shows the reduced topological representation of a four-way junction,
with the loop on the other side of every hairpin having been factored out. The free
energy of formation of a four-way junction is a function of the four junction lengths 𝑎 , 𝑏 ,
𝑐 and 𝑑 . Initiation free energies for an example of a four-way junction are tabulated in
Table 2.3, for one particular combination of junction lengths 𝑏 = 𝑐 = 4 nt. Data shown
are the additional free energy cost for the fourth constraint to be met after the first three
constraints are in place. To obtain this data set, an ensemble of 2 million MC simulated
conformations of (U)42 chains was used. Free energies in Table 2.3 show that closing a
four-way junction generally costs more entropy than a three-way junction (see
Table 2.2), which in turn costs more entropy than two-way junctions. Again, error
38
estimates are given in parentheses. The error bars are a little larger than for the two-
and three-way junctions because the probability of observing a four-way junction was
quite low. In Table 2.3, cells that are blank indicate combinations that failed to show up
in the 2-million-member MC simulated ensemble.
Figure 2.10 Reduced topological representation of the set constraints defining a four-way junction. For the
purposes of this study, the two of the lengths were constrained to be equal and fixed in value (𝑏 = 𝑐 = 4 nt) while
the other lengths (𝑎 and 𝑑 ) can vary.
39
Table 2.3 Table of free energy cost of forming a four-way junction in kcal/mol as a function of the 5’ and 3’ junction length in nucleotide (𝑎
and 𝑑 respectively) with the middle junction lengths fixed (𝑏 = 𝑐 = 4 nt). Error estimates from the simulation are given in parentheses. Blank
entries correspond to events that were not observed during the simulation despite the large size of the ensemble generated.
40
2.3 Discussion
The topological representation we have developed above has been used to aide in
the factorization of the joint constraints imposed by typical RNA secondary structure
motifs into approximately independent subsets. Here, we discuss the broader
application of this scheme.
First, using the topological reduction scheme and data presented above, calculating
the total free energy cost arising from backbone conformational constraints associated
with any structure is simple. Using the three-way junction from Fig. 2.1(a) as an
example, we will illustrate this procedure for junction lengths 𝑎 = 6, 𝑏 = 4, 𝑐 = 5, 𝑑 = 3, 𝑒
= 6 nt, with one of the two stems having 𝑓 base pair steps and the other having 𝑔 . From
Fig. 2.3, the free energy for seeding hairpin loops 𝑑 = 3 nt and 𝑒 = 6 nt are 4.8 and 6.6
kcal/mol, respectively. The cost for propagating a seeded hairpin is 5.2 kcal/mol/base-
pair-step, so the free energy associated with the two stems combined is
5.2 × (𝑓 + 𝑔 ) kcal/mol. From Table S1(d), the free energy for a 6-4-5 three-way junction
is 7.4 kcal/mol. The total is therefore 18.8 + 5.2(𝑓 + 𝑔 ) kcal/mol.
Topological reduction can also be used to analyze the interdependence of more
complex constraints coming from tertiary contacts. An example is shown in Fig. 2.11.
Many riboswitches, such as the guanine-responsive riboswitch from the xpt-pbuX
operon of B. subtilis (36) and TPP-specific riboswitch of Arabidopsis thaliana (37), make
use of a three-way junction architecture to form their aptamer domain. When the
aptamer binds its target ligand, additional constraints arising from the reconfiguration of
the binding pocket either destabilize existing tertiary interactions or stabilize addition
tertiary contacts, leading to a rearrangement of the folded structure causing an
41
upstream or downstream switching sequence to rehybridize and produces a global
shape transformation in the riboswitch RNA (38–40). Figure 2.11 shows how some of
these interactions renormalize the topology of a three-way junction.
Figure 2.11 Diagrammatic representation of the topology of a three-way junction and how it can be altered by
introduction of new tertiary interactions. (a) An unmodified three-way junction like the one shown in Fig. 2.1(a). (b)
Representation of kissing loops. The new constraint represented by the thick dash line in the top row of (b) results
in a change in connectivity that no longer allows the two loops 𝒃 and 𝒅 to be factored. (c) Representation of ligand-
mediated base-base contact in the three-way junction. The new constraint closes a portion of the three-way
junction into a loop, giving rise to a diagram that is factorizable into 4 independent subsets corresponding to two
hairpins, one two-way junction, and one three-way junction. (d) The kissing loop and ligand-mediated base-base
interaction are combined. The effect changes the connectivity to yield a factorizable diagram consisting of a two-
way junction and the structure previously seen in (b). (e) The kissing loop interaction is now combined with a triple
base interaction. This yields a new structure factorizable into a two-way junction and a new multiply-connected
loop structure.
The top row of Fig. 2.11(a) shows the same three-way junction architecture from
Fig. 2.1(a) without tertiary contacts. The second row in Fig. 2.11(a) shows its topological
representation and the third row shows the final factorized diagram from Fig. 2.1(a). As
described above, without tertiary contacts the two hairpin loops and the junctions are
largely independent, and from this we derive three disjoint sets of constraints. Now
consider the addition of a kissing-loop interaction, denoted in Fig. 2.11(b) by a thick
42
dashed line, between hairpins b and d. The topological representation of this structure
is shown in the second row of Fig. 2.11(b), where the constraint imposed by the kissing-
loop interaction is represented by a white circle. Due to this extra constraint, this
structure is no longer factorizable because it contains no articulation points. Therefore,
the kissing-loop interaction modifies the topological structure of the diagram
fundamentally. In the language of topology, this diagram now belongs to a different
“class” than the diagram in Fig. 2.11(a). This new nonfactorizable topological class is
shown on the bottom row of Fig. 2.11(b).
In Fig. 2.11(c), a different tertiary interaction is introduced into the three-way
junction. The dashed line in the top row of Fig. 2.11(c) denotes a new base-base
contact between two of the junctions mediated by a ligand upon binding. The topological
representation of this structure is shown in the second row of Fig. 2.11(c), and complete
factorization leads to the diagram on the bottom row of Fig. 2.11(c). In this case, the two
loops b and d corresponding to the hairpins remain factorizable, but the new interaction
between loops a and e renormalizes the diagram into a different topological class. The
final factorized representation, shown on the bottom row of Fig. 2.11(c), is topologically
equivalent to two hairpin loops, one two-way junction, and one three-way junction.
The structure in Fig. 2.11(d) combines a kissing-loop tertiary contact between b and
d with a base-base tertiary interaction between a and e. The final factorized diagram is
shown on the bottom row of Fig. 2.11(d), consisting of one two-way junction, plus three
multiply-connected loops, which happens to belong to the same topological class as the
structure in Fig. 2.11(b).
43
Finally, Fig. 2.11(e) introduces a new type of tertiary interaction. The thick three-way
dashed line in Fig. 2.11(e) denotes a triple base interaction, such as the one observed
in the crystallographic structure of the G-box riboswitch when a guanine is bound into
the aptamer domain. The ligand forms contacts simultaneously with three bases,
leading to a triplet interaction. Figure 2.11(e) considers the topological renormalization
that is produced by mixing a kissing-loop interaction between hairpins b and d with a
base-triple interaction among junctions a, c, and e. The final factorized diagram is
shown in the bottom row of Fig. 2.11(e). This diagram suggests that the structure in
Fig. 2.11(e) is topologically equivalent to one two-way junction, plus four mutually
connected loops. This results also explains how riboswitches based on a three-way
junction motif might utilize tertiary interactions coming from ligand binding to induce
loop-loop interactions in distal regions of its RNA sequence.
We conclude by mentioning one useful property of factorizable diagrams. After
complete factorization, each disjoint piece consists of a self-contained substructure that
traces out a close circuit beginning with an initial vertex and ending on the same vertex,
traversing every arc inside the substructure once and only once. For each of these
substructures, a basic theorem in topology states that the choice of the initial vertex is
arbitrary, and the choice of the first arc to follow to start the circuit is also arbitrary. This
means that when calculating the entropy of a substructure, the answer does not depend
on which constraint (i.e. vertex) to start and end with. On the other hand, for
substructures that do not begin and end on the same vertex, such as the one in
Fig. 2.1(b), they must have exactly two odd vertices. There is only one way to traverse
44
the entire path through such structures, which is by starting on one of the odd vertices
and ending on the other one.
These examples in this and the last sections show how our proposed topological
perspective of RNA structures could lead to new insights into the interplay among
multiple constraints inherent in the secondary and tertiary structures of folded RNAs. By
extending our study to more complex secondary structures such as those in Fig. 2.1(c)
and 2.1(d), we should be able to examine the validity of the factorization hypothesis on
more complex elements and evaluate their entropic costs from simulation. This can then
be used to study more complex tertiary folds by mapping the 3D structure to the
corresponding 2D graphs which we can separate into the independent subsets to
calculate their entropic costs. Work that expands our current analysis to more complex
constraints such as pseudoknots and quadruplexes will be presented in the next
chapter.
45
2.4 References
1. De Gennes, P.-G., and P.-G. Gennes. 1979. Scaling concepts in polymer physics. Cornell university
press.
2. Flory, P.J., and M. Volkenstein. 1969. Statistical mechanics of chain molecules. Wiley Online Library.
3. Laing, C., and T. Schlick. 2011. Computational approaches to RNA structure prediction, analysis, and
design. Curr. Opin. Struct. Biol. 21:306–318.
4. Kim, N., C. Laing, S. Elmetwaly, S. Jung, J. Curuksu, and T. Schlick. 2014. Graph-based sampling
for approximating global helical topologies of RNA. Proc. Natl. Acad. Sci. 111:4079–4084.
5. Gan, H.H., D. Fera, J. Zorn, N. Shiffeldrim, M. Tang, U. Laserson, N. Kim, and T. Schlick. 1987.
RAG: RNA-As-Graphs database—concepts, analysis, and features. Nutr. Health. 5:1285–1291.
6. Fera, D., N. Kim, N. Shiffeldrim, J. Zorn, U. Laserson, H.H. Gan, and T. Schlick. 2004. RAG: RNA-
As-Graphs web resource. BMC Bioinformatics. 5:88.
7. Mathews, D.H., J. Sabina, M. Zuker, and D.H. Turner. 1999. Expanded sequence dependence of
thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 288:911–
940.
8. Turner, D.H., and D.H. Mathews. 2009. NNDB: the nearest neighbor parameter database for
predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 38:D280–D282.
9. Mathews, D.H., and D.H. Turner. 2006. Prediction of RNA secondary structure by free energy
minimization. Curr. Opin. Struct. Biol. 16:270–278.
10. Diamond, J.M., D.H. Turner, and D.H. Mathews. 2001. Thermodynamics of three-way multibranch
loops in RNA. Biochemistry. 40:6971–6981.
11. Zuker, M. 1989. [20] Computer prediction of RNA structure. In: Methods in Enzymology. Academic
Press. pp. 262–288.
12. Zuker, M., D.H. Mathews, and D.H. Turner. 1999. Algorithms and Thermodynamics for RNA
Secondary Structure Prediction: A Practical Guide. In: Barciszewski J, BFC Clark, editors. RNA
Biochemistry and Biotechnology. Dordrecht, Netherlands: Springer Netherlands. pp. 11–43.
13. Zuker, M. 2003. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids
Res. 31:3406–3415.
14. Zuker, M. 1989. On finding all suboptimal foldings of an RNA molecule. Science. 244:48–52.
15. Zadeh, J.N., C.D. Steenberg, J.S. Bois, B.R. Wolfe, M.B. Pierce, A.R. Khan, R.M. Dirks, and N.A.
Pierce. 2011. NUPACK: Analysis and design of nucleic acid systems. J. Comput. Chem. 32:170–173.
16. Cao, S., and S.-J. Chen. 2005. Predicting RNA folding thermodynamics with a reduced chain
representation model. RNA. 11:1884–1897.
17. Zhang, J., M. Lin, R. Chen, W. Wang, and J. Liang. 2008. Discrete state model and accurate
estimation of loop entropy of RNA secondary structures. J. Chem. Phys. 128:03B624.
18. Arnold, B.H. 2011. Intuitive concepts in elementary topology. Courier Corporation.
46
19. Balakrishnan, R., and K. Ranganathan. 2012. A textbook of graph theory. Springer Science &
Business Media.
20. Mak, C.H. 2015. Atomistic Free Energy Model for Nucleic Acids: Simulations of Single-Stranded DNA
and the Entropy Landscape of RNA Stem–Loop Structures. J. Phys. Chem. B. 119:14840–14856.
21. Mak, C.H., L.L. Sani, and A.N. Villa. 2015. Residual Conformational Entropies on the Sugar–
Phosphate Backbone of Nucleic Acids: An Analysis of the Nucleosome Core DNA and the Ribosome.
J. Phys. Chem. B. 119:10434–10447.
22. Mak, C.H., T. Matossian, and W.-Y. Chung. 2014. Conformational entropy of the RNA phosphate
backbone and its contribution to the folding free energy. Biophys. J. 106:1497–1507.
23. Mak, C.H. 2008. RNA conformational sampling. I. Single‐nucleotide loop closure. J. Comput. Chem.
29:926–933.
24. Mak, C.H., W.-Y. Chung, and N.D. Markovskiy. 2011. RNA conformational sampling: II. Arbitrary
length multinucleotide loop closure. J. Chem. Theory Comput. 7:1198–1207.
25. Weeks, J.D., D. Chandler, and H.C. Andersen. 1971. Role of repulsive forces in determining the
equilibrium structure of simple liquids. J. Chem. Phys. 54:5237–5247.
26. Henke, P.S., and C.H. Mak. 2014. Free energy of RNA-counterion interactions in a tight-binding
model computed by a discrete space mapping. J. Chem. Phys. 141:08B612_1.
27. Mak, C.H., and P.S. Henke. 2012. Ions and RNAs: free energies of counterion-mediated RNA fold
stabilities. J. Chem. Theory Comput. 9:621–639.
28. Mak, C.H. 2016. Unraveling Base Stacking Driving Forces in DNA. J. Phys. Chem. B. 120:6010–20.
29. Rury, A.S., C. Ferry, J.R. Hunt, M. Lee, D. Mondal, S.M.O. O’Connell, E.N.H. Phan, Z. Peng, P.
Pokhilko, D. Sylvinson, Y. Zhou, and C.H. Mak. 2016. Solvent Thermodynamic Driving Force
Controls Stacking Interactions between Polyaromatics. J. Phys. Chem. C. 120:23858–23869.
30. Hummer, G., S. Garde, A.E. Garcia, A. Pohorille, and L.R. Pratt. 1996. An information theory model
of hydrophobic interactions. Proc. Natl. Acad. Sci. USA. 93:8951–8955.
31. Henke, P.S., and C.H. Mak. 2016. An implicit divalent counterion force field for RNA molecular
dynamics. J Chem Phys. 144.
32. Coimbatore Narayanan, B., J. Westbrook, S. Ghosh, A.I. Petrov, B. Sweeney, C.L. Zirbel, N.B.
Leontis, and H.M. Berman. 2013. The Nucleic Acid Database: new features and capabilities. Nucleic
Acids Res. 42:D114–D122.
33. Berman, H.M., W.K. Olson, D.L. Beveridge, J. Westbrook, A. Gelbin, T. Demeny, S.-H. Hsieh, A.R.
Srinivasan, and B. Schneider. 1992. The nucleic acid database. A comprehensive relational database
of three-dimensional structures of nucleic acids. Biophys. J. 63:751–759.
34. Serra, M.J., and D.H. Turner. 1995. [11] Predicting thermodynamic properties of RNA. Methods
Enzymol. 259:242–261.
35. Aalberts, D.P., and N. Nandagopal. 2010. A two-length-scale polymer theory for RNA loop free
energies and helix stacking. RNA. 16:1350–1355.
47
36. Batey, R.T., S.D. Gilbert, and R.K. Montange. 2004. Structure of a natural guanine-responsive
riboswitch complexed with the metabolite hypoxanthine. Nature. 432:411–415.
37. Thore, S., M. Leibundgut, and N. Ban. 2006. Structure of the eukaryotic thiamine pyrophosphate
riboswitch with its regulatory ligand. Science. 312:1208–1211.
38. Manzourolajdad, A., and J. Arnold. 2015. Secondary structural entropy in RNA switch (Riboswitch)
identification. BMC Bioinformatics. 16:133.
39. Roth, A., and R.R. Breaker. 2009. The structural and functional diversity of metabolite-binding
riboswitches. Annu. Rev. Biochem. 78:305–34.
40. Montange, R.K., and R.T. Batey. 2008. Riboswitches: emerging themes in RNA structure and
function. Annu. Rev. Biophys. 37:117–33.
48
Chapter 3: Quantifying Structural Diversity of CNG
Trinucleotide Repeats Using Diagrammatic Algorithms
The structures most often associated with the gain of function hypothesis for
expanded CNG sequences cited in the literature is a necklace-like structure composed
of a long stretch of successive two-way junctions interposed by shorts helixes and with
a hairpin stem-loop cap (1–6). Many of the studies conducted, however, are based on
CNG repeat oligomers, whereas the threshold of TRED disease onset is typically
associated with expansions of 60 to 100 units or longer. Additionally, the structures
resolved are limited to those which can be isolated and crystalized. As the length of the
CNG repeats grow, the diversity of accessible structures could grow rapidly as well, with
the necklace structure comprising only one possible subset of motifs out of many. This
leaves a gap in our understanding of the structure-function relationship that may be
responsible for pathogenesis in CNG-related TREDs. The structural diversity of CNG
chains on the order of lengths comparable to the critical expansion thresholds of TREDs
remain unclear. In this chapter, we present backbone entropic cost information for
constraints needed to form base triple, pseudoknots, and quadruplex which, together
with the previous data presented in the previous chapter, completes the list of
constraints we expect to encounter in secondary structural ensemble of CNG repeats.
The compiled database will be used to perform calculations aimed at addressing the
question of structural diversity as the chains approach the critical expansion threshold
using a diagrammatic algorithm.
49
3.1 Materials and Methods
3.1.1 Backbone Conformational Entropy and Secondary Structures
We begin with a quick recap of basic notions in RNA folding, constraints, and dual
graphs. An open RNA strand is characterized by an ensemble of many diverse
conformations and is in a high-entropy state. If this RNA sequence can fold and develop
secondary structure(s), the entropy of the chain will decrease as the chain folds
because the number of conformations that are consistent with the secondary structures
in the folded state is necessarily lower than the unfolded state. This decrease in
conformational entropy 𝑆 of a RNA can be viewed as a result of the constraint(s)
imposed by the secondary structural elements present in the fold on the possible
conformations of the chain. The entropy loss from the unfolded state to the folded state
is:
Δ𝑆 = 𝑆 (with constraints) − 𝑆 (no constraints) = −𝑘 𝐵 ∑ 𝑃 𝑐 ′
ln 𝑃 𝑐 ′
𝑐 − 𝑃 𝑐 ln 𝑃 𝑐
(3.1)
where 𝑐 is a chain conformation, 𝑃 𝑐 and 𝑃 𝑐 ′
are the normalized probabilities of that
conformation with and without the constraint(s) imposed by the secondary structure,
and 𝑘 𝐵 is Boltzmann’s constant. If an alternative fold with a different set of secondary
structural elements exists, it will in general have a different conformational entropy
because the constraints are different. Adding more constraints to the chain in the form
of more complex secondary structures will necessarily lead to a more negative Δ𝑆 , but
the constraints imposed by the different secondary structural elements in a fold are in
general not independent. That is, Δ𝑆 with constraints A + B is not necessarily equal to
Δ𝑆 with constraint A plus Δ𝑆 with constraint B.
50
In order to more easily characterize RNA secondary structures, Schlick et al. have
proposed a diagrammatic scheme (7–9). Figure 3.1 shows examples of some of the
diagrams representing different kinds of secondary structural elements. The bottom row
of Fig. 3.1(a) illustrates the diagram of a three-way junction. Each of the three helices is
represented by a black dot. Unpaired loops are represented by curved lines. The bottom
row of Fig. 3.1(b) shows the diagrammatic representation of a pseudoknot. The two
black dots now represent the two paired regions, whereas the unpaired loops are
represented by straight or curved lines. A triplex is illustrated in Fig. 3.1(c). A black
triangle is used in its diagrammatic representation to represent the three-base
interaction in this secondary structural element. Fig. 3.1(d) shows a quadruplex, and a
black square is used to represent the interactions of the four bases in this secondary
structure. In these diagrams, the points where two or more lines converge are called
“vertices” while the lines themselves are called “edges”.
For the remainder of this chapter, we will analyze the structural ensemble of CNG
repeats and lay the foundation for the theory that will be used in Chapter 4 to extend our
analysis to arbitrarily long CNG chains. This formalism requires additional terms which
we now introduce. The number of edges emanating from a vertex 𝑣 define the degree
𝑑 𝑣 of the vertex. A dot (representing a duplex) always has four edges and 𝑑 𝑣 = 4. A
triangle (representing a triplex) always has six edges with 𝑑 𝑣 = 6, and a square
(representing a quadruplex) always has eight edges and 𝑑 𝑣 = 8. Because all RNAs are
linear polymers, any graph representing a RNA fold is necessarily Eulerian (10),
meaning there is a way to trace through the entire diagram over all its edges only once.
This also implies that either the degree of every vertex will be even or only two vertices
51
will be odd while all other are even. In such Eulerian graphs, the number of edges 𝐸
including the two dangling ends on the 5’ and 3’ ends is
𝐸 = 1 +
1
2
𝐷
(3.2)
where 𝐷 = ∑𝑑 𝑣 𝑣 is the total degree over all vertices in the diagram.
Figure 3.1 Examples of diagrams representing different secondary structural elements. (a) A three-way junction
where each dot represents a helix and loops are represented by lines. (b) A pseudoknot where black dots
represent paired regions and unpaired loops are represented by lines. (c) A triplex structure represented
diagrammatically by a black triangle. (d) A quadruplex represented by a black square. Figure adapted from (11).
In addition to their utility in characterizing RNA secondary structures, the
diagrammatic representation proposed by Schick et al. is also useful for the calculation
of the conformational entropy of folded RNA structures. In the last chapter, we
described how the constraints imposed by the secondary structural elements of any
folded state can be broken into approximately independent sets using a factorization
52
strategy based on how the elements of the diagram are connected. An example of how
this factorization works is illustrated in Fig. 3.2 for a three-way junction. Additionally, we
introduced the idea of an “articulation point” within the structure. In an effort to formalize
our language in preparation for Chapter 4, we will forgo further usage of that term and
instead introduce “fragile vertices”. A fragile vertex is defined as any vertex that if
removed from the diagram disconnects it into two or more disjoint pieces. Fig. 3.2
shows that all three vertices in the diagram of a three-way junction are fragile.
Disconnecting the diagram at these fragile vertices generates the factorized diagrams
on the far right of Fig. 3.2. When a diagram is completely factorized, it breaks up into
irreducible pieces. We have proven that the conformational entropy of the fold is also
approximately separable when a diagram is reducible, and Δ𝑆 becomes the simple sum
of the entropies of all the irreducible pieces. For example, the total entropy Δ𝑆 of the
three-way junction in Fig. 3.2 can be reduced to the sum of the three closed diagrams
on the right. The big circle with arcs labeled a, b and c represents the unpaired loops in
the three-way junction. The two smaller circles labeled d and e represent the loops in
the hairpins. The edges corresponding to the 5’ and 3’ ends of the folded structure
contribute nothing to Δ𝑆 since they contain no additional constraints and can be omitted
from the completely factorized diagram shown on the right. The dots represent duplexes
of different helix length (𝛼 , 𝛽 , or 𝛾 ). Each fragile vertex contains additional enthalpic and
entropic free energy contributions depending on its size. The free energy of each fragile
vertex can, for example, be estimated using Turner’s nearest-neighbor model (12–14),
data from computer simulations, or other experimental data.
53
Figure 3.2 Factorization of the constraints in a three-way junction into approximately independent contributions.
The total entropy S is reduced to the sum of the three closed diagrams on the right.
3.1.2 Monte Carlo Simulations
Diagrammatic factorization provides a simple recipe for the calculation of the
conformational entropy of any RNA fold. To make use of it, a library of conformational
entropy data must be compiled for every irreducible element representing various types
of secondary structural elements (hairpin, junction, duplex, triplex, quadruplex, etc.) of
different sizes, as well as for other unfactorizable structures such as pseudoknots. This
library can be sourced from experimental data or from computer simulations. In the last
chapter, we have provided a complete and consistent set of Monte Carlo simulation
results for the entropy values of hairpins, two-, three- and four-way junctions. These
correspond to diagrams in which the vertices are all degree 4. While some of the same
data are available from melting experiments (12, 14, 15), not everything is. We have
relied on extensive computer simulations to compile an internally consistent data set.
This chapter further extends this data library, adding results for non-canonical base
pairs, quadruplexes, and pseudoknots. These new data also serve to demonstrate
factorizability questions in pseudoknots and quadruplexes, highlighting comparisons
and contrasts with what has already been proven for two-, three-, and four-way
junctions. This data library provides entropic contributions to the free energy associated
with the unpaired loops of each diagram. Free energies of formation of the paired
regions associated with the vertices in each diagram are independent of the edges in
54
these MC simulations, and they are added back in separately as needed. Vertex free
energies are not included in the reported data for hairpin initiation, pseudoknot
formation, and quadruplex formation obtained from direct simulation.
Monte Carlo (MC) simulations have been carried out using our in-house Nucleic MC
program for high-throughput conformational sampling of RNAs (16). Detailed
discussions of the mixed numerical/analytical treatment and closure algorithm used in
simulating the sugar-phosphate backbone(16–20) and accounting for steric interactions
(21, 22), solvent effect (22–24), and counterions’ influence (25, 26) have been
presented in previous publications. Using Nucleic MC, we generated thermal ensembles
consisting of several million uncorrelated conformations for chains with many different
secondary-structural constraints corresponding to several different classes of diagrams.
To evaluate the conformational entropy of quadruplexes and pseudoknots, poly-U
constructs of many different structures were simulated. For diagrams involving helices
with Watson-Crick (WC) base pairs, long hairpin structures in the protein databank were
melted to obtain the appropriate starting conformation. Using the same parameters for
defining base pairing events the previous chapter, we then identified and counted
spontaneous base pair formations during the simulation to measure the entropic cost of
initiating any new base pair constraint within the structure. Multiple base-pairing
constraints are associated with some of these structures. To evaluate the entropy of
these, we computed the entropy cost for forming the first constraint, and holding the first
constraint, we then computed the additional entropy cost for forming the second
constraint, etc. Since entropy is a state function, any thermodynamic pathway between
the initial (open) and final (folded) states will yield the correct Δ𝑆 . For example, the
55
starting structure of a pseudoknot was chosen to produce the proper length for the 𝛼
helix as shown in Figure 3.1, with the size of the seeded hairpin chosen to provide a
range of lengths in the final assembled pseudoknot structure.
For quadruplexes, parameters for identifying Hoogsteen base pair are needed. The
structures of pyrimidine-purine base pairs utilizing the purine’s Hoogsteen edge (PDBID
1GQU (27), 2QS6 (28), 1K2G (29), and 2H49 (30)) as well as quadruplexes with
different topologies (PDBID 1KF1 (31) and 143D (32)) were used to define the base
pairing criteria for identifying Hoogsteen pairs. These selection parameters for WC and
Hoogsteen pairs are summarized in Fig. 3.3.
Figure 3.3 Geometric criteria used in defining (a) Watson-Crick base pairing geometry (33), (b) G-G Hoogsteen
geometry (31, 32), and (c) purine-pyrimidine Hoogsteen geometry (31, 34).
56
3.1.3 Evaluating Conformational Ensembles of CNG Repeats
Using the factorization scheme described above and a library of the entropy values
of the irreducible elements, the entropy of the conformational ensemble of any RNA
sequence can be evaluated. We illustrate this using the CNG repeat sequence 5’-
NG(CNG)8CN-3’ as an example. Figures 3.4(a) through (d) show four possible
secondary structures of this sequence, and their corresponding diagrammatic
representations are shown next to each. If we consider only WC base pairs, the longest
uninterrupted canonically paired duplex length in CNG repeat sequences is only 2 base
pairs (bp). These are highlighted by the blue boxes in Fig. 3.4. In the corresponding
graphs, these are represented by blue dots. Since the structure in (a) has four 2-bp
duplexes whereas (b) only has three, the total vertex free energy of (a) should be lower
because base pairs are stabilizing. But on the other hand, (b) has fewer secondary
structural constraints than (a), and therefore (b) is expected to have a more favorable
conformational entropy. In an equilibrium ensemble of this sequence, we expect a
thermodynamic competition between maximizing the number of vertices versus
maximizing the diversity of the conformational ensemble. Furthermore, as our data
library disallows hairpins shorter than 3-nt, the structure in (a) is the only conformation
consistent with the diagram shown in (a). However, for structure (b), there are multiple
alternative structures consistent with the graph shown in (b). These alternative
structures can be obtained by permuting the junctions among the various positions
along the sequence. For example, permuting the two junctions in the asymmetric bulge
leads to a different structure without affecting its topology. Also, transposing a
subsegment within one junction with another junction produces a different structure
57
without altering the topology. For example, one can remove a single (CNG) unit from
the hairpin and transpose it into the first junction, making both 4-nt long, to derive a new
structure with a symmetric bulge instead of the asymmetric one in (b) without altering
the topology. The entropy associated with the configurational diversity of the topological
class represented by the graph in Fig. 3.4(b) is therefore higher than (a) and favors (b)
over (a). In addition to (a) and (b), there are many other structures for the same
sequence which belong to other topological classes. Fig. 3.4(c) and (d) show two
additional examples. The structure in (c) corresponds to a three-way junction, while the
structure in (d) consists of a pseudoknot plus a hairpin. Whereas (a) and (b) have
different number of 2-bp duplex units, structures (b), (c) and (d) all have three duplexes.
Because of this, structures (b), (c) and (d) have approximately the same vertex free
energy and their competition for relevance within the conformational ensemble of this
sequence is controlled by entropy alone. Our goal is to quantify the size and diversity of
these CNG repeat ensembles using the diagrammatic techniques described above.
Notice that each structure is characterized by a certain number of nucleotide (nt) units ℓ
distributed over the unpaired regions among the loops and junctions, which in the
graphs are associated with 𝐸 edges. We will see that the problem of calculating the
entropy is equivalent to finding all the possible ways of distributing the ℓ unpaired
nucleotides over the 𝐸 edges in the diagram.
58
Figure 3.4 (a-d) Examples of different structures of the 5’-NG(CNG)8CN-3’ repeat sequence belonging to distinct
topological classes. (e) Maximal hairpin structure of 5’-NG(CNG)19CN-3’.
To evaluate the volume of the conformational ensemble of a 5’-NG(CNG)MCN-3’
repeat sequence of a certain length 𝑀 , we divide the ensemble into subsets according
to the total degree of the graphs. Recall that the total degree 𝐷 of a graph is equal to the
sum of the degrees over all its vertices. Using this definition, the total degree of the
graph in Fig. 3.4(a) is 16 because each vertex corresponding to a duplex is degree 4,
since 4 edges emanate from it. On the other hand, the graphs in (b), (c) and (d) all have
total degree 𝐷 = 12, because of the three duplexes present in each of those structures.
Earlier, we have also mentioned that the total degree of a graph is related to the number
of edges 𝐸 in it by 𝐸 = 1 + 𝐷 /2. Because of this, we now recognize that even though
the graphs in Fig. 3.4(b), (c) and (d) all belong to distinct topological classes, they all
have the same number of edges because they have the same total degree. The
diagrams in Fig. 3.4(b), (c) and (d) all have 7 edges because they are all degree 12.
59
Furthermore, since the vertices are fixed-length duplexes, the total lengths of all the
edges for all diagrams of degree 𝐷 are also the same for sequences containing the
same number of (CNG) repeats 𝑀 . For example, the structures in Fig. 3.4(b), (c) and (d)
all have total edge lengths of ℓ = 16 nt.
In a previous paragraph, we described how the entropy of a certain class of
diagrams is derived from the permutation of the edges and the transposition of
subsegment lengths among the edges. Restating this more precisely in terms of
combinatorics, the structural diversity of a certain topological class is related to the
number of possible ways in which the total edge length in a structure consisting of ℓ nts
can be distributed among the 𝐸 edges in the diagram. Because of this, grouping
diagrams by total degree is advantageous compared to grouping them according to
topological class. Since diagrams of the same degree also have the same number of
edges, the combinatoric problem is identical for diagrams across the same degree,
regardless of which topological class they belong to. This allows us to recycle the
solution of the same combinatorics problem on diagrams of many different classes, as
long as their total degrees are the same. This also means that the intrinsic diversities of
the different subsets of the ensemble represented by different topological classes of
graphs belonging to the same total degree are identical. The only difference between
two topological classes belonging to the same total degree lies in the conformational
entropies of the irreducible elements, which are different for different types of secondary
structures. For example, the probability of observing an 8-nt hairpin loop in the
ensemble of all possible conformations is very different from that of observing an 8-nt
loop inside a three-way junction, even though they are both loops of the same length.
60
These different entropy values are supplied by the library we complied using the MC
simulations described above.
We summarize our solution to the combinatorics problem with a few useful
expressions here. Referring to Fig. 3.4, notice that each edge segment has a minimum
length of 1 nt and can vary only by multiples of 3 nt. Therefore, the length of the 𝑖 -th
edge in a diagram can be represented by 1 + 3𝑗 𝑖 , where 𝑗 is a non-negative integer, and
there are 𝐸 of these. The chain, which takes the form 5’-NG(CNG)MCN-3’, contains 𝑀 +
1 (CNG) repeats. We always assume both the 5' and 3' ends have a dangling N
nucleotide, so the total length of interest for the sequence is (3𝑀 + 4) nt. A degree-𝐷
diagram has 𝐷 nts in the duplexes because all base pairs come in stacks of 2, so the
total edge length is ℓ = (3𝑀 + 4) − 𝐷 . The number of edges is 𝐸 = 1 +
1
2
𝐷 . In terms of
this, ℓ = 3(𝑀 + 2 − 𝐸 ) + 𝐸 . Subtracting the minimum length of 1 nt for each of the 𝐸
edges, the number of transposable nts is 3(𝑀 + 2 − 𝐸 ), but they must occur in 3-nt
multiples. Therefore, the combinatorics problem is reduced to finding all sets of non-
negative integers {𝑗 1
, 𝑗 2
, ⋯ 𝑗 𝐸 } such that ∑ 𝑗 𝑖 =
𝐸 𝑖 =1
𝐽 ≡ (𝑀 + 2 − 𝐸 ) = (𝑀 + 1 −
1
2
𝐷 ), which
also implies that the maximum theoretical degree for a chain with 𝑀 +1 repeats is 2(𝑀 +
1). The process of dividing up the nucleotides into the edges of the graph is equivalent
to creating a string of 𝐸 non-negative integers with zeroes allowed such that they sum to
𝐽 . This is the problem of determining all weak compositions of 𝐽 in combinatorics; for a
graph which has 𝐸 edges, this is the enumeration of all weak 𝐸 -composition of 𝐽 (10,
35). For the work presented in this chapter, the enumeration is done by brute force with
the correct compositions being stored as a valid structure of the graph ensemble. The
collection of valid structures 𝛼 for each graph Ξ forms an ensemble with each structure
61
contributing a weight, 𝜔 Ξ
(𝛼 ) = exp(−
𝐹 (𝛼 )
𝑘𝑇
) , corresponding to its inherent free energy
cost 𝐹 (𝛼 ) to the partition function of the graph ensemble, 𝑍 (Ξ) = ∑ 𝜔 Ξ
(𝛼 ). For each
graph ensemble, we can then define the ensemble-averaged conformational cost,
Δ𝐹 (Ξ) = ∑
𝜔 𝛼 𝑍 (Ξ)
𝐹 (𝛼 ), and the entropy Δ𝑆 (Ξ) = −𝑘 𝐵 ∑
𝜔 𝛼 𝑍 (Ξ)
ln (
𝜔 𝛼 𝑍 (𝛯 )
). All graph ensemble
calculations were carried out for two different RNA strands, NG-(CNG)16-CN and NG-
(CNG)50-CN. For brevity, these will be referred to by their full length 𝑛 , which are 17 and
51 respectively, in the result section.
3.2 Results
3.2.1 Loop Initiation Entropies Involving Hoogsteen Pairs
Previously, we reported data for the entropies of initiating hairpin loops of different
sizes seeded for WC pairs. To initiate a 𝑛 -nt loop, the constraint associated with the
base pair suppresses the conformational diversity of the backbone, which suffers an
entropy penalty leading to a free energy cost Δ𝐺 (𝑛 ) which increases with the loop length
𝑛 . The free energy of formation of loops utilizing WC base pairing geometry are shown
in Fig. 3.5 as the black filled circles. RNA triplexes and quadruplexes, on the other
hand, must use noncanonical base pairing geometry on their Hoogsteen edges to
initiate loops. The geometric constraints on the backbone needed to facilitate a purine-
purine Hoogsteen pair (e.g. G:G) or a purine-pyrimidine Hoogsteen pair (e.g. A:C) are
shown in Fig. 3.3(b) and (c), respectively. These Hoogsteen-specific geometric
constraints produce higher free energy requirements for loop initiation compared to
loops formed via WC interactions. Fig. 3.5 shows MC results for initiation free energies
62
needed to form a loop using G:G Hoogsteen geometry (solid blue squares) and loops
formed via purine:pyrimidine (R:Y) Hoogsteen geometry (open squares). Both types of
Hoogsteen-pair geometry loops require higher free energy compared to WC-pair
initiated hairpins. Note that these values pertain only to the entropic cost placed on the
chain to close the loop and additional contribution from the base pairs themselves are
added separately as needed.
Figure 3.5 Comparison of loop initiation using different sets of base pairing criteria. In comparison to the Watson-
Crick initiation cost, the purine-pyrimidine Hoogsteen initiation costs are effectively shifted up by a constant. This is
consistent with the difference in the targeted inter-base distance and the almost identical range of bond and
torsion angles. The G-G Hoogsteen loop initiation, despite sharing the same inter-base distance, suffers a larger
cost at small loop lengths that is associated with the base positioning and the different inter-base angles.
3.2.2 Quadruplexes
Quadruplex structures on DNA have been observed on d(GGG-NNN)n repeats (36).
These G-quadruplexes typically consist of a triple-deck sandwich of four Gs on each
63
layer, interacting with each other via G:G Hoogsteen pairs. The d(NNN) sequences act
as linkers, connecting the vertices of the triple sandwich. Various linker topologies have
been identified. These are exemplified by the structures found in PDB IDs 1KF1 (31)
and 143D (32). In 1KF1, the linkers are threaded through the G-quadruplex structure
connecting the bottom corner of one edge of the triple sandwich with the top of an
adjacent edge. In 143D, the linkers are threaded by connecting either the bottom corner
of one edge with the bottom corner of an adjacent edge, or the top corner with the top
corner of an adjacent edge.
The type of quadruplexes that are most relevant to 5’-NG(CNG)MCN-3’ RNA repeats
are the double-deck sandwich structures illustrated by Fig. 3.6. Instead of three layers,
the quadruplex structure in Fig. 3.6 has only two layers. The linker topology shown in
Fig. 3.6 follows a bottom to top threading pattern, analogous to 1KF1. (CXG)n repeats
where X=G can potentially produce quadruplex structures of the type shown in Fig. 3.6,
with each linker being either 1-nt (-C-), 4-nt (-CGGC), 7-nt (-CGGCGGC-) in length, or
even longer.
64
Figure 3.6 (a) A possible quadruplex structure relevant to CNG repeat sequences. The structure’s entropy is
determined by the three loops labeled a, b, and c. (b) Diagrammatic representation of a quadruplex, showing its
dependence on the three loop lengths a, b, and c, as well as the number of layers .
There are three linker loops in a quadruplex structure. These are labeled a, b, and c
in Fig. 3.6 in the 5’ to 3’ direction. The entropic free energy costs for initiating the first
loop a, the second loop b and the third loop c to connect the G on the bottom of one
edge of the quadruplex to the G on the top of the next edge are tabulated in Table 3.1
for a double-deck quadruplex structure. In the MC simulations, loop a was initiated first.
After this loop was formed, the free energy of initiating loop b was computed by holding
the first two edges of the quadruplex fixed. After loop b was formed, the free energy of
initiating loop c was then computed by holding the first three edges of the quadruplex
fixed. The results in Table 3.1 suggest that as the linked loops get longer, the free
energy cost of forming the loop also increases. This trend is not dissimilar to that
observed in Fig. 3.5 for the hairpin initiation free energies. But as the quadruplex
structure was assembled, the loop free energies also become progressively higher from
a to b to c. This is presumably due to increased steric congestion in the core of the
65
quadruplex structure, making it more difficult for loop b to form compared to a, and in
turn more difficult for loop c to form compared to b. For linker loop c, the frequency of
observing its formation in the MC simulations were too rare to be able to determine their
free energies accurately for lengths > 5 or < 2 nt, and these have been left out of Table
3.1. In addition to the loops a, b and c, there are entropic penalties associated with
constraining the backbone to the four edges of the quadruplex. The total free energy
cost for this is also given in Table 3.1.
Loop
Length (nt)
Δ𝐺 loop a
(kcal/mol)
Δ𝐺 loop b
(kcal/mol)
Δ𝐺 loop c
(kcal/mol)
Other Δ𝐺
(kcal/mol)
1 4.2 5.3 -
all
edges
12.6
2 4.9 6.8 -
3 5.5 7.0 7.0
4 6.0 7.4 8.4
5 6.3 7.2 6.9
6 7.0 7.4 -
7 7.3 8.7 -
Table 3.1 Initiation free energies for the a, b, and c loops inside a double-deck quadruplex structure from MC
simulations. The typical statistical error on each value is approximately 0.05 kcal/mol.
Previously, we found that vertices in graphs such as the helix in a hairpin or a stem
in a 2-, 3-, or 4-way junction divide diagrams into approximately independent pieces.
For the quadruplex, this independence is not strictly obeyed, because as Table 3.1
shows, the initiation free energies of the a, b and c loops are asymmetric with respect to
exchange, and they are no longer independent of each other. To accommodate this, we
can modify the graph factorization scheme by simply redefining the entire quadruplex
structure together with its a, b, and c loops as one irreducible element, instead of
66
assuming the loops are separable. Since quadruplexes typically have very limited loop
lengths, this does not affect the validity or impact the utility of the graph factorization
scheme described above.
3.2.3. Pseudoknots
The conformational entropy of a pseudoknot is determined by the lengths of the
three loops a, b, and c, as well as the duplex lengths and . In the pseudoknot
structures most relevant to CNG repeat sequences, and are 2 bp, the interhelix
length b is 1 nt, and the loop lengths a and c are 1, 4, 7, .... An example of how such
pseudoknots fit into a (CGN) repeat sequence is shown in Fig. 3.4(d). Prior studies in
the literature by Cao and Chen suggest that the three loops of a pseudoknot can be
treated independently (37–39) when considering their entropies. Our simulation results
for the pseudoknot structures most relevant to CNG repeats do not corroborate this.
To calculate the conformational entropy costs for pseudoknots, we calculated the
cost for each step in a thermodynamic pathway that folds a free chain into the final
pseudoknot structure, passing through a hairpin structure along the way (an example of
such pathways for the a = b = c = 1 case can be found in Fig.B.1 in the Appendices).
Consequently, the formation of the pseudoknot’s three loops in our method is due to a
single base pairing event which turns an existing hairpin structure into a pseudoknot,
and this can happen on either the 5’ side or the 3’ side. Additionally, there are extra
entropy costs for extending the helices to reach their target lengths and . In Fig. 3.7,
we summarize the entropic cost and standard error of forming the entire pseudoknot for
the four smallest pseudoknot structures relevant to (CNG)n repeats. The cost is
calculated as the sum of costs to go from an open chain to the appropriate hairpin and
67
then from the hairpin to the final pseudoknot structure relative to the cost of two 2-bp
duplexes. Multiple pathways connecting the initial open chain and the final structure
were used and the averages are shown in Fig. 3.7. The map of the pathway used for
each of the structure can be found in the Appendix B as Fig. B.1-B.4, and the costs for
each step of the pathways can be found as Table B.1-B.4.
Figure 3.7 (a) The conformational entropy of a pseudoknot is determined by the lengths of the three loops labeled
a, b, and c, as well as the duplex lengths a and b. In the pseudoknot structures most relevant to CNG repeat
sequences, a and b are both 2 nt, the interhelix length b is 1 nt, and the loop lengths a and c are 1, 4, 7, ... (b)
Diagrammatic representation of a pseudoknot and calculated average cost for the four smallest pseudoknot
structures relevant to CNG repeats.
3.2.4 Ensembles of CNG Repeats
Given a 5’-NG(CNG)MCN-3’ repeat sequence, we first partitioned the conformational
space according to the total degree of the diagrams, and then by collecting all
accessible folded structures which share the same graph representation into a subset
ensemble. For each structure within a subset ensemble, we calculated its weight using
the entropic costs of all its irreducible elements as determined by the library derived
from our MC data. The ensemble’s partition function of the subset represented by a
graph Ξ was then used to calculate its sub-ensemble average conformational cost
Δ𝐹 (Ξ), its entropy Δ𝑆 (Ξ), and then the free energy, according to Δ𝐺 (Ξ) = Δ𝐹 (Ξ) −
𝑇 Δ𝑆 (Ξ) + Δ𝐺 0
. The entropy Δ𝑆 (Ξ) is a measure for the diversity—the number of folded
68
conformations that can be represented by the graph Ξ— of the subset. Δ𝐺 (Ξ)
determines the overall thermodynamic stability of this subset relative to other subsets
and the open chain. Δ𝐺 0
is the stabilization contributed by the duplexes in the structure.
To determine the contribution from each duplex, we used the experimental Δ𝐺 exp
data
reported by Sobzcak et al. for (CNG)20 oligomers in 100mM NaCl (40) for N = A, C, G
and U as the target Δ𝐺 (Ξ). The conformation that was reported for (CNG)20 has the
maximal hairpin structure shown in Fig. 3.4(e)—which is comprised of nine duplexes,
eight symmetric (1,1) internal junctions, one 4-nt hairpin, and two dangling ends. The
Δ𝐹 (Ξ) contributed by the loops of this structure was then calculated using our library of
entropic costs: 5.10 kcal/mol for the 4-nt hairpin, 5.98 kcal/mol for each of the (1,1)
internal junctions, and 0 for the dangling ends. As the maximal hairpin structure has no
permutable segments on the junctions, there is one only possible structure that matches
the graph in Fig. 3.4e. The stabilization contributed by each node is then calculated as
Δ𝐺 0
(duplex ) = (
1
9
) (Δ𝐺 (Ξ) − Δ𝐹 (Ξ)) = (
1
9
) (Δ𝐺 (Ξ) − (5.10 + 8 ∗ (5.98)) ) kcal/mol. This
calculation was carried out for each N = A, C, G and U using the experimental Δ𝐺 exp
data reported by Sobzcak et al. The smallest stabilization came from N = C with
Δ𝐺 0
(duplex ) = −6.17 kcal/mol, followed by U (Δ𝐺 0
(duplex ) = −6.39 kcal/mol), A
(Δ𝐺 0
(duplex ) = −6.57 kcal/mol), and G (Δ𝐺 0
(duplex ) = −6.62 kcal/mol). In the following
sections, we will use the Δ𝐺 0
(duplex ) value for N = C to report our numerical data
involving duplexes (black dots), as this represents a lower bound to stability for all the
structures. The other values for N = A, G or T can be obtained by applying the
appropriate offset to the values for each dot (duplex).
69
The graph ensemble calculations reported below have been carried out for a RNA
strand total length 𝑛 = 17 and 51. These were done by grouping graphs according to
their total degree 𝐷 , equal to the sum over the degrees of all vertices. Note that the
number of graphs at each total degree 𝐷 proliferates rapidly as 𝐷 increases. However,
the number of permissible graphs at each 𝐷 is also constrained by the length of the
RNA chain making it possible to exhaustively enumerate all graphs for chains that are
not too long. In particular, for the 𝑛 = 17 oligomer, we were able to enumerate all
graphs up to 𝐷 = 28, which is the highest-order permissible set. At 𝐷 = 28, there is only
one permissible graph, which is shown in Fig. 3.12(b), corresponding to the maximally
base-paired structure for 𝑛 = 17. All permissible graphs from 𝐷 = 8 to 28 are displayed
in Figs. 3.8 through 3.12 for the 𝑛 = 17 oligomer (the 𝐷 = 4 set is trivial and contains
only one graph, which is not displayed). While exhaustively enumerating all graphs in
the ensemble for 𝑛 = 17 is possible, this quickly becomes impossible for longer repeat
lengths. In the 𝑛 = 51 case, the significantly longer chain length prevents the same total
enumeration to be carried out, and only graphs up to total degree of 𝐷 = 16 are
presented in the data below.
Fig. 3.8 shows all graphs with total degree 𝐷 = 8 and the calculated values of
Δ𝐹 (Ξ) and Δ𝐺 (Ξ) in kcal/mol, and Δ𝑆 (Ξ)/𝑅 for each. We derived these diagrams from
the list of graphs enumerated by Schlick et al. (7, 9), after removing those containing
structures for which we have no corresponding data or those requiring non-secondary
structural motifs to form. Three motifs are present with total degree of 8: hairpins, two-
way junctions, and pseudoknots. The thermodynamic stabilities of the three subsets
reported in their Δ𝐺 values include the intrinsic stabilization provided by the free energy
70
in the duplexes, Δ𝐺 0
= 2 × (−6.17) kcal/mol. Similarly, data are shown in Fig. 3.9 for all
graphs with total degree 𝐷 = 12 and in Fig. 3.10 for total degree 𝐷 = 16. The
stabilization provided by the duplexes are Δ𝐺 0
= 3 × (−6.17) kcal/mol and Δ𝐺 0
=
4 × (−6.17) kcal/mol, respectively. Figs. 3.11 and 3.12(a) show the non-pseudoknot
graphs at 𝐷 = 20, 24 for 𝑛 = 17, and Fig. 3.12(b) shows the single permissible graph at
𝐷 = 28 which corresponds to a maximally base paired structure.
We also considered quadruplexes. The only graph that contains a single
quadruplex is shown in Fig.3.10. A single quadruplex has total degree 𝐷 = 16. To
establish the intrinsic free energy of the quadruplex core, we again rely on experimental
data from Sobczak et al. Their study showed two trinucleotide oligomers which can form
quadruplex in solution: (UGG)17 and (AGG)17, but (CNG) repeats cannot. To estimate
the effects of including quadruplexes in the (CNG)n repeat ensembles, we used the
experimental free energies of (UGG)17 and (AGG)17. The proposed structure for both
contains a single 2-layer quadruplex structure with a guanine tetrad in each layer similar
to what is shown in Fig 3.6. The three loop lengths in the structure proposed by
Sobczak et al. are all 1-nt in length. Like the maximal hairpin example above, we
calculate Δ𝐺 (Ξ) for a quadruplex structure in a similar way, with only one single
quadruplex node, which is represented in our graphs by a black square. Recall that
Δ𝐺 (Ξ) = Δ𝐹 (Ξ) − 𝑇 Δ𝑆 (Ξ) + Δ𝐺 0
(quad), and the loop contribution in the form of Δ𝐹 (Ξ) is
comprised of the cost for forming each of the three linker loops a, b, and c, and the cost
of aligning the guanine in each of the tetrad’s column. As only one quadruplex structure
is reported in the work of Sobczak et al—a quadruplex on the 5’ end and a long terminal
tail—Δ𝑆 (Ξ) is zero. The cost of loop a, loop b, and aligning all columns in the
71
quadruplex core are read directly from Table 3.1: 4.2 kcal/mol, 5.3 kcal/mol, and 12.6
kcal/mol, respectively. The contribution of loop c is approximated at 7.0 kcal/mol due to
lack of explicit simulated value at 1-nt. The free energy of a quadruplex core,
Δ𝐺 0
(quad), is then calculated as Δ𝐺 0
(quad) = Δ𝐺 (Ξ) − Δ𝐹 (Ξ) = Δ𝐺 (Ξ) −
(4.2 + 5.3 + 7.0 + 12.6) kcal/mol. Experimentally determined Δ𝐺 exp
= Δ𝐺 (Ξ) for (UGG)17
and (AGG)17 in 100mM NaCl at 260nm from Sobzcak et al. were used, yielding an
approximation for the quadruplex core of −32.97 kcal/mol from (AGG)17 and −33.14
kcal/mol from (UGG)17. The value −33 kcal/mol was used in calculating the free energy
of the single quadruplex-containing graph in Fig.3.10.
Figure 3.8 All graphs for (CNG)17 at total degree 8, their RAG-ID (9) and corresponding ensemble-averaged cost,
entropy, and the graph free energy. Δ𝐹 and Δ𝐺 are in kcal/mol. Δ𝑆 are reported in units of 𝑅 , the gas constant.
72
Figure 3.9 All graphs for (CNG)17 at total degree 12, their RAG-ID and corresponding ensemble-averaged cost,
entropy, and the graph free energy. Δ𝐹 and Δ𝐺 are in kcal/mol. Δ𝑆 are reported in units of 𝑅 , the gas constant.
73
Figure 3.10 All graphs for (CNG)17 at total degree 16, their RAG-ID and corresponding ensemble-averaged cost,
entropy, and the graph free energy. Δ𝐹 and Δ𝐺 are in kcal/mol. Δ𝑆 are reported in units of 𝑅 , the gas constant. The
list also includes a quadruplex structure, with an estimate for the intrinsic Δ𝐺 of the core, referenced against the
same standard state (four 2-bp duplexes) used for the rest of the structures in this figure.
74
Figure 3.11 All non-pseudoknot graphs for (CNG)17 at total degree 20, their corresponding ensemble-averaged
cost, entropy, and the graph free energy. Δ𝐹 and Δ𝐺 are in kcal/mol. Δ𝑆 are reported in units of 𝑅 , the gas
constant. The graphs have been sorted from most to least stable free energy rather than by RAG-ID
75
Figure 3.12 The single non-pseudoknot graphs for (CNG)17 at total degree 28 and all non-pseudoknot graphs of
total degree 28, their corresponding ensemble-averaged cost, entropy, and the graph free energy. Δ𝐹 and Δ𝐺 are
in kcal/mol. Δ𝑆 are reported in units of 𝑅 , the gas constant. The graphs have sorted from most to least stable free
energy rather than by RAG-ID. Note that due to the constraint of chain length, there are no bubble graphs with 6 or
7 nodes.
76
3.3 Discussion
Data in Figs. 3.8-3.12 reveal the basic characteristics of the structural ensembles
typical of CNG repeat sequences. While this direct enumeration approach is limited to
graphs of low total degrees and/or relatively short total chain lengths, the results
demonstrate central features that allow us to make projections about graphs of higher
degrees and longer total chain length. We begin with a discussion of the graph
ensembles for 5’-NG(CNG)16CN-3’ (𝑛 = 17).
At total degree 𝐷 = 8, Fig. 3.8 reveals that the graph with the lowest overall free
energy is (2,1). While this is just one graph, it is important to remember that a large
number of conformations are represented by it, where the segments in the graph can
have variable lengths but they are restricted in such a way that their sum must equal the
total length of the full RNA chain. The entropy Δ𝑆 /𝑅 listed under the graph is a measure
for the number of these conformations. The value Δ𝑆 /𝑅 = 5.87 for graph (2,1) when 𝑛 =
17 suggests that there are roughly 𝑒 5.87
~354 conformations represented. The value Δ𝐹
is the sub-ensemble free energy cost associated with suppressing the backbone
conformational degrees of freedom to force the chain to conform with the constraints
implied by the vertices in the graph. For (2,1) it is 10.65 kcal/mol when 𝑛 = 17. Notice
that for every conformation in this sub-ensemble the backbone conformational cost is
different, and the conformation tally of ~354 as well as the cost Δ𝐹 = 10.65 kcal/mol are
ensemble averaged properties. The overall free energy of graph (2,1) when 𝑛 = 17 is
Δ𝐹 − (𝑅𝑇 )(ΔS/R) + ΔG
0
= 10.65 − (0.616)(5.87) + Δ𝐺 0
= −5.30 kcal/mol, where Δ𝐺 0
is
the intrinsic free energy associated with two 2-bp duplexes, equal to −12.34 kcal/mol as
indicated in Fig. 3.8, and 𝑇 = 310 K.
77
The graph that has the next highest overall free energy in Fig. 3.8 is (2,2), with Δ𝐺 =
−4.07 kcal/mol. The entropy of graph (2,2) is similar to (2,1). This is because their
topologies are similar, except two segments in (2,2) are constrained into a 2-way
junction, whereas in (2,1) one of them is constrained inside a hairpin and the other is
free. Contrasting this to the graph (2,3) in Fig. 3.8, which contains a pseudoknot, Δ𝑆 /𝑅
is quite a bit lower for (2,3), even though (2,3) contains the same number of segments
as (2,2). Note that Eq.(3.1) dictates that the total number of edges 𝐸 and the total
degree 𝐷 are related by 𝐸 = 1 + 𝐷 /2, and all graphs in Fig. 3.8 necessarily have the
same number of edges. The reason why the structure containing the pseudoknot has a
significantly lower entropy compared to (2,1) and (2,2) is related to the dependence of
the cost function of a pseudoknot on the length of the unpaired loop regions that
comprise it. Long loop lengths are suppressed in a pseudoknot compared to hairpins or
2-way junctions, and this leads to a lower diversity in the sub-ensemble associated with
graph (2,3) compared to (2,1) or (2,2). This also results in a higher overall Δ𝐺 for graphs
containing pseudoknots.
While each graph represents a sub-ensemble of the conformations at a certain total
degree 𝐷 and its entropy value Δ𝑆 /𝑅 reflects the diversity of that subset, an additional
ensemble-level entropy is associated with the superset of graphs at each 𝐷 . This
ensemble-level entropy at a degree 𝐷 is given by Δ𝑆 𝐷 /𝑅 = − ∑𝑃 (Ξ)ln 𝑃 (Ξ)
Ξ
, where the
normalized probability 𝑃 (Ξ) of graph Ξ is given by 𝑃 (Ξ) = 𝑒 −Δ𝐺 (Ξ)/𝑅𝑇
/ ∑𝑒 −Δ𝐺 (Ξ)/𝑅𝑇
Ξ
. This
value can be interpreted as the measure of how all the conformation at the 𝐷 = 8 level
is distributed amongst the graphs in Fig. 3.8. For the graphs in Fig. 3.8 with 𝑛 = 17,
Δ𝑆 𝐷 /𝑅 comes out to be 0.37, suggesting that at the ensemble level, the information
78
content in the superset of 𝐷 = 8 graphs is roughly equivalent to just 𝑒 0.37
~1.4 graphs.
This suggests that the secondary structural content attributed to the graphs in Fig. 3.8
are not evenly distributed amongst the three graphs. As Δ𝐺 (Ξ) determines the weight of
each graph, most of the configurations are expected to be consistent with the non-
pseudoknot graphs. Finally, the overall free energy of the ensemble Δ𝐺 𝐷 can be
computed either from 𝑒 −Δ𝐺 𝐷 /𝑅𝑇
= 𝑍 = ∑𝑒 −Δ𝐺 (Ξ)/𝑅𝑇
Ξ
or Δ𝐺 𝐷 = ⟨Δ𝐺 ⟩ − 𝑇 Δ𝑆 𝐷 , where ⟨Δ𝐺 ⟩ =
𝑍 −1
∑Δ𝐺 (Ξ)𝑒 −Δ𝐺 (Ξ)/𝑅𝑇
Ξ
. Δ𝐺 𝐷 serves as a measure for thermodynamic stability of the
𝐷 = 8 sub-ensemble—that is, the collection of all 𝐷 = 8 structures—relative to the open
chain. For the 𝐷 = 8 sub-ensemble, Δ𝐺 𝐷 = −5.38 kcal/mol.
Moving to the 𝐷 = 12 graphs in Fig. 3.9, the diversity of the graphs expands. The
member of this superset with the lowest overall free energy Δ𝐺 is (3,1). The one with the
highest free energy is again the structure with a pseudoknot. 𝐷 = 12 is also the lowest
order at which a 3-way junction appears such as in (3,5). Similar to the 𝐷 = 8 superset
in Fig. 3.8, the member with the highest sub-ensemble diversity is (3,4), which has the
highest entropy Δ𝑆 /𝑅 = 7.75, corresponding to approximately 𝑒 7.75
~2300 distinct
conformations. The ensemble-level entropy for the set 𝐷 = 12 is 𝛥 𝑆 𝐷 /𝑅 = 0.75
corresponding to ~ 2.1 graphs. The overall free energy of the 𝐷 = 12 sub-ensemble is
Δ𝐺 𝐷 = −7.02 kcal/mol. The values show that there are more graphs that comprise the
𝐷 = 12 sub-ensemble and collectively they are more favorable than the 𝐷 = 8 sub-
ensemble.
Going to 𝐷 = 16 in Fig. 3.10, the diversity of the graphs expands even further. All
the graphs in Fig. 3.10 contain four 2-bp duplexes, except the one with a quadruplex.
The intrinsic free energy associated with four 2-bp duplexes is Δ𝐺 0
= −24.68 kcal/mol,
79
which is indicated in Fig. 3.10. On the other hand, the intrinsic free energy of a 2-layer
quadruplex core, according to the reasoning at the end of the last section, is estimated
to be Δ𝐺 0
~ − 32.97 kcal/mol. Fig. 3.10 shows that the Δ𝐺 value for the quadruplex graph
is −4.97 kcal/mol. Its entropy is low because due to the same reason as the
pseudoknots – the quadruplex structure is confined to short loop lengths and reduces
the conformational diversity of the subset. Notice that the quadruplex structure is only
possible when the sequence is (CGG)M and thus is not expected to contribute
significantly to the structural ensemble of (CNG) repeats.
Like the graphs in Figs. 3.8 and 3.9, the graph with the lowest overall free energy in
Fig. 3.10 is graph (4,1). We will refer to graphs having this topology as “bubble
diagrams”. For graph (4,1), the overall free energy is Δ𝐺 = −7.60 kcal/mol. Notice that
going from 𝐷 = 12 in Fig. 3.9 to 𝐷 = 16 in Fig. 3.10, the entropy of the bubble diagram
actually decreases from Δ𝑆 /𝑅 = 6.55 for graph (3,1) in Fig. 3.9 to 5.62 for graph (4,1) in
Fig. 3.10. This indicates that when going to higher total degree 𝐷 , the fixed length of the
full RNA chain becomes a factor limiting the number of combinations of segment
lengths that could fit into the total number of nucleotides on the sequence. However,
this effect is also dependent on other features of the graphs. For example, the type of
graphs with the highest entropy in both Fig. 3.9 and 3.10 are the “necklace diagrams”,
exemplified by structures (3,4) and (4,14). Going from (3,4) to (4,14), the entropy value
of the necklace diagram continues to increase from Δ𝑆 /𝑅 = 7.75 to 8.20, growing from
~2300 to ~3600 configurations. Furthermore, some of the other partial necklace
diagrams, such as (3,2) in Fig. 3.9 and (4,4) and (4,8), also show continued increase in
diversity going from lower to higher order total degree. In addition to these, the
80
additional diagrams associated with 3- or 4-way junctions also seem to expand in
diversity. The ensemble-level entropy for the set 𝐷 = 16 is Δ𝑆 𝐷 /𝑅 = 1.64 corresponding
to ~ 5.2 graphs and comparing this to Δ𝑆 𝐷 /𝑅 =0.67 for 𝐷 = 12 validates this
observation. The overall free energy of the 𝐷 = 16 sub-ensemble is Δ𝐺 𝐷 = −8.05
kcal/mol, indicating that the superset of 𝐷 = 16 diagrams continues to be more
thermodynamically favorable than the graphs of lower total degree.
Going beyond 𝐷 = 16, Fig.3.11 shows all non-pseudoknot graphs at 𝐷 = 20
containing up to 4-way junctions. Fig.3.12(a) shows all non-pseudoknot graphs at 𝐷 =
24. Fig. 3.12(b) shows the single non-pseudoknot graph at 𝐷 = 28. For 𝐷 > 16,
pseudoknot graphs have been omitted due to their falloff in thermodynamic stability and
the small sub-ensembles they represent. Our graphs also do not include multiway
junctions higher than 4 because our library does not contain simulation data for 5-way
or higher junctions. For 𝑛 = 17 chains, however, only one possible graph at 𝐷 = 28
would have a 5-way junction. For 𝑛 = 17 chains, there are no graphs beyond 𝐷 = 28.
In the 𝐷 = 20 graphs in Fig. 3.11, the bubble diagram continues to be the most
energetically favorable. However, the 𝐷 = 20 graphs collectively are no more favorable
than the graphs of lower degrees. Most of them have Δ𝐺 similar to graphs in Fig.3.10,
and some of them are less favorable. The 𝐷 = 20 sub-ensemble entropy is 𝑆 𝐷 /𝑅 =
2.66, corresponding to ~14.4 graphs, and the free energy is Δ𝐺 𝐷 = −8.10. All of the non-
pseudoknot graphs in Fig. 3.10 at 𝐷 = 16 have four duplexes (dots), whereas the
graphs in Fig. 3.11 at 𝐷 = 20 have five. Our results show that relative to the open chain,
5-duplex structures collectively are only marginally more favorable than 4-duplex
structures. This trend continues in Fig.3.12(a) for 𝐷 = 24 where all the graphs are now
81
less thermodynamically stable than their counterparts at lower degrees. This decrease
in thermodynamic stability is an intrinsic entropic effect produced by the total length of
the RNA chain which constrains the number of permutable loop segments. As more
nodes are added for a given total chain length, fewer free nucleotides are available to
be assigned to the to the unpaired loop regions. This leads to a decrease in the
structural diversity Δ𝑆 /𝑅 that offsets the entropic cost of constraining the backbone. This
can be seen in the Δ𝑆 /𝑅 values of the graphs going from 𝐷 = 16 to 𝐷 = 20 and beyond.
The 𝐷 = 24 sub-ensemble has an entropy Δ𝑆 𝐷 /𝑅 = 3.01, corresponding to ~20.3
graphs and free energy Δ𝐺 𝐷 = −6.84. These values indicate that despite more graphs
of total degree 𝐷 = 24 being present, the 6-node structure corresponding to them are
less favorable collectively than the 5-node and 4-node graphs. The chain length further
constrains the graph down to a single diagram in Fig.3.12(b) for 𝐷 = 28, which is the
maximum degree possible for 𝑛 = 17. This diagram has only one permutable segment
and hence a low entropy.
Data for the 𝑛 = 51 chain demonstrate the effects chain length exerts on the
characteristics of the graph sub-ensembles. Figs. 3.8-3.10 show that the average cost
Δ𝐹 of all graphs increases when the chain grows from 𝑛 = 17 to 𝑛 = 51. This is
expected as longer RNA chains should have access to secondary structures with longer
loop lengths. However, the changes in cost reflect a change in loop length of only 3 to 6
nucleotides, and the secondary structures of the graphs only grew by one or two
additional trinucleotide units as the total RNA chain length is tripled. This points to a
preferential placement of nucleotide units into the dangling ends or bridging junctions
82
which incur no cost of formation while contributing to the overall structural diversity of
the graph.
At the same time, the increase in RNA chain length has a large favorable effect on
the graphs’ entropies. Every graph exhibited an increase in Δ𝑆 /𝑅 when the chain length
grew from 𝑛 = 17 to 51. This is also expected as there are now more transposable
units, and the RNA chain has access to a larger number of conformations for each
graph. The changes in Δ𝐹 and Δ𝑆 combine to yield more favorable Δ𝐺 for all graphs as
the total RNA chain length grows. The longer total RNA chain length also increases the
maximal number of nodes that can be present in the graphs. Though we are not able to
investigate the graphs of total degree greater than 16 for the 𝑛 = 51 chain, the results
from the 𝑛 = 17 case suggests that at the Δ𝐺 of the graphs will peak at some total
degree near the maximum, which for 𝑛 = 51 is 𝐷 = 96 or 24 nodes.
Our data suggest that as a function of repeat length 𝑛 , there are two opposing
factors that control the thermodynamic stability of the graphs at different degrees 𝐷 .
First, longer repeat lengths permits a larger number of duplexes to be made, and the
maximum degree 𝐷 ‡
is proportional to 𝑛 . The duplexes contain extra thermodynamic
stability due to the base pairs and base stacks, and graphs at higher 𝐷 have more node
stabilization. But at the same time, more duplexes constrain the backbone
conformations, producing lower conformational entropy for the chains. This acts against
larger 𝐷 and destabilizes them. Finally, the repeat length 𝑛 also limits the number of
possible diagrams as 𝐷 increases toward the maximum 𝐷 ‡
. This further truncates the
size of the sub-ensembles approaching 𝐷 ‡
. This tradeoff between duplex stability and
chain conformational diversity results in an optimum in the sub-ensemble free energy
83
Δ𝐺 𝐷 at a certain 𝐷 . Stronger intrinsic duplex stability shifts this optimum position toward
𝐷 ‡
, whereas a weaker duplex stability shifts this optimum position toward lower 𝐷 .
Depending on the width of this distribution, the ensemble may be characterized by
graphs from many different degrees 𝐷 with open diagrams being the most common
dominant sub-ensemble.
These theoretical predictions are in variance from the prevailing view that the
dominant structure of CNG repeats is a maximal hairpin structure with 2-way junctions
plus a single hairpin. The results instead point to many potential structures of similar
prevalence with large contribution from open and bubble-diagram type structures.
Though crystallographic data point to the dominance of hairpin structures, it leaves the
question of how an ensemble of mostly open structures can be detected in solution.
Techniques such as small-angle X-ray scattering (SAXS) (41–44), UV melting (45) , and
Forster resonance energy transfer (FRET) (46) can all be used to probe the solution
structure of RNA. While the use of thermodynamic data of Sobzcak et al. does provide a
point of contact between the calculated free energies and experimental measurements,
the ensemble predicted by our results is diverse enough that a one-to-one
correspondence to experiments is unlikely. Instead, general features consistent with a
diverse ensemble of open and flexible structure should be looked for in the experimental
data to confirm our prediction. In particular, the predicted ensemble should have an
SAXS profile yielding Kratky plot consistent with flexible structures, UV melting data
consistent with a broad structural distribution, and FRET measurements that show
contacts between positions proximal to one another with stretches of no contacts in
between.
84
We now address possible limitations of our model in its current form and future steps
to improve on what we have presented here. As the focus of this study was to
understand the diversity of (CNG) repeats at a secondary structural level, long range
interactions were not included. These studies can be carried out using the same Nucleic
MC simulation framework to measure the free energy of these tertiary contacts, and
these are currently in progress. Additionally, the current model assumes WC-only base
pairing, and as such, ignores base pairs between the non-GC residues in the repeats.
However, the enumeration of diagrams involving non-WC pairs is possible, and a library
of free energies of these contacts can be constructed using the same MC simulation
framework to compute their free energies, and these are also currently underway. Also
enumerating diagrams with non-WC pairs introduce additional graph elements which
also makes the enumeration process more complex, but this limitation can be
addressed by the diagrammatic summation method proposed in the companion paper,
wherein we apply a partition function method to study the graph ensembles, removing
the need for explicit enumeration (47).
85
3.4 References
1. Kiliszek, A., R. Kierzek, W.J. Krzyzosiak, and W. Rypniewski. 2012. Crystallographic
characterization of CCG repeats. Nucleic Acids Res. 40:8155–8162.
2. Kiliszek, A., R. Kierzek, W.J. Krzyzosiak, and W. Rypniewski. 2011. Crystal structures of CGG RNA
repeats with implications for fragile X-associated tremor ataxia syndrome. Nucleic Acids Res.
39:7308–7315.
3. Mooers, B.H.M., J.S. Logue, and J.A. Berglund. 2005. The structural basis of myotonic dystrophy
from the crystal structure of CUG repeats. PNAS. 102:16626–16631.
4. Kumar, A., H. Park, P. Fang, R. Parkesh, M. Guo, K.W. Nettles, and M.D. Disney. 2011. Myotonic
Dystrophy Type 1 RNA Crystal Structures Reveal Heterogeneous 1 × 1 Nucleotide UU Internal
Loop Conformations. Biochemistry. 50:9928–9935.
5. Tamjar, J., E. Katorcha, A. Popov, and L. Malinina. 2012. Structural dynamics of double-helical
RNAs composed of CUG/CUG- and CUG/CGG-repeats. Journal of Biomolecular Structure and
Dynamics. 30:505–523.
6. Coonrod, L.A., J.R. Lohman, and J.A. Berglund. 2012. Utilizing the GAAA Tetraloop/Receptor To
Facilitate Crystal Packing and Determination of the Structure of a CUG RNA Helix. Biochemistry.
51:8330–8337.
7. Izzo, J.A., N. Kim, S. Elmetwaly, and T. Schlick. 2011. RAG: An update to the RNA-As-Graphs
resource. BMC Bioinformatics. 12:219.
8. Gan, H.H., S. Pasquali, and T. Schlick. 2003. Exploring the repertoire of RNA secondary motifs
using graph theory; implications for RNA design. Nucleic Acids Res. 31:2926–2943.
9. Fera, D., N. Kim, N. Shiffeldrim, J. Zorn, U. Laserson, H.H. Gan, and T. Schlick. 2004. RAG: RNA-
As-Graphs web resource. BMC Bioinf. 5:88.
10. Walker, R. 1992. Implementing discrete mathematics: combinatorics and graph theory with
Mathematica, Steven Skiena. Pp 334. 1990. ISBN 0-201-50943-1 (Addison-Wesley). The
Mathematical Gazette. 76:286–288.
11. Mak, C.H., and E.N.H. Phan. 2018. Topological Constraints and Their Conformational Entropic
Penalties on RNA Folds. Biophysical Journal. 114:2059–2071.
12. Mathews, D.H., J. Sabina, M. Zuker, and D.H. Turner. 1999. Expanded sequence dependence of
thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 288:911–
940.
13. Turner, D.H. 1996. Thermodynamics of base pairing. Curr. Opin. Struct. Biol. 6:299–304.
14. Turner, D.H., and D.H. Mathews. 2009. NNDB: the nearest neighbor parameter database for
predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 38:D280–D282.
15. Diamond, J.M., D.H. Turner, and D.H. Mathews. 2001. Thermodynamics of three-way multibranch
loops in RNA. Biochemistry. 40:6971–6981.
86
16. Mak, C.H. 2015. Atomistic Free Energy Model for Nucleic Acids: Simulations of Single-Stranded
DNA and the Entropy Landscape of RNA Stem–Loop Structures. J. Phys. Chem. B. 119:14840–
14856.
17. Mak, C.H. 2008. RNA conformational sampling. I. Single‐nucleotide loop closure. J. Comput. Chem.
29:926–933.
18. Mak, C.H., W.-Y. Chung, and N.D. Markovskiy. 2011. RNA conformational sampling: II. Arbitrary
length multinucleotide loop closure. J. Chem. Theory Comput. 7:1198–1207.
19. Mak, C.H., T. Matossian, and W.-Y. Chung. 2014. Conformational entropy of the RNA phosphate
backbone and its contribution to the folding free energy. Biophys. J. 106:1497–1507.
20. Mak, C.H., L.L. Sani, and A.N. Villa. 2015. Residual Conformational Entropies on the Sugar–
Phosphate Backbone of Nucleic Acids: An Analysis of the Nucleosome Core DNA and the
Ribosome. J. Phys. Chem. B. 119:10434–10447.
21. Weeks, J.D., D. Chandler, and H.C. Andersen. 1971. Role of repulsive forces in determining the
equilibrium structure of simple liquids. J. Chem. Phys. 54:5237–5247.
22. Mak, C.H. 2016. Unraveling Base Stacking Driving Forces in DNA. J. Phys. Chem. B. 120:6010–20.
23. Hummer, G., S. Garde, A.E. Garcia, A. Pohorille, and L.R. Pratt. 1996. An information theory model
of hydrophobic interactions. Proc. Natl. Acad. Sci. USA. 93:8951–8955.
24. Rury, A.S., C. Ferry, J.R. Hunt, M. Lee, D. Mondal, S.M.O. O’Connell, E.N.H. Phan, Z. Peng, P.
Pokhilko, D. Sylvinson, Y. Zhou, and C.H. Mak. 2016. Solvent Thermodynamic Driving Force
Controls Stacking Interactions between Polyaromatics. J. Phys. Chem. C. 120:23858–23869.
25. Henke, P.S., and C.H. Mak. 2014. Free energy of RNA-counterion interactions in a tight-binding
model computed by a discrete space mapping. J. Chem. Phys. 141:08B612_1.
26. Mak, C.H., and P.S. Henke. 2012. Ions and RNAs: free energies of counterion-mediated RNA fold
stabilities. J. Chem. Theory Comput. 9:621–639.
27. Abrescia, N.G.A., A. Thompson, T. Huynh-Dinh, and J.A. Subirana. 2002. Crystal structure of an
antiparallel DNA fragment with Hoogsteen base pairing. PNAS. 99:2806–2811.
28. Pous, J., L. Urpí, J.A. Subirana, C. Gouyette, J. Navaza, and J.L. Campos. 2008. Stabilization by
Extra-Helical Thymines of a DNA Duplex with Hoogsteen Base Pairs. J. Am. Chem. Soc. 130:6755–
6760.
29. Kitamura, A., Y. Muto, S. Watanabe, I. Kim, T. Ito, Y. Nishiya, K. Sakamoto, T. Ohtsuki, G. Kawai,
K. Watanabe, K. Hosono, H. Takaku, E. Katoh, T. Yamazaki, T. Inoue, and S. Yokoyama. 2002.
Solution structure of an RNA fragment with the P7/P9.0 region and the 3′-terminal guanosine of the
Tetrahymena group I intron. RNA. 8:440–451.
30. Shankar, N., S.D. Kennedy, G. Chen, T.R. Krugh, and D.H. Turner. 2006. The NMR Structure of an
Internal Loop from 23S Ribosomal RNA Differs from Its Structure in Crystals of 50S Ribosomal
Subunits,. Biochemistry. 45:11776–11789.
31. Parkinson, G.N., M.P.H. Lee, and S. Neidle. 2002. Crystal structure of parallel quadruplexes from
human telomeric DNA. Nature. 417:876–880.
87
32. Wang, Y., and D.J. Patel. 1993. Solution structure of the human telomeric repeat d[AG3(T2AG3)3]
G-tetraplex. Structure. 1:263–282.
33. Olson, W.K., M. Bansal, S.K. Burley, R.E. Dickerson, M. Gerstein, S.C. Harvey, U. Heinemann, X.J.
Lu, S. Neidle, Z. Shakked, H. Sklenar, M. Suzuki, C.S. Tung, E. Westhof, C. Wolberger, and H.M.
Berman. 2001. A standard reference frame for the description of nucleic acid base-pair geometry.
Journal of molecular biology. 313:229–237.
34. Vedula, L.S., J. Jiang, T. Zakharian, D.E. Cane, and D.W. Christianson. 2008. Structural and
mechanistic analysis of trichodiene synthase using site-directed mutagenesis: Probing the catalytic
function of tyrosine-295 and the asparagine-225/serine-229/glutamate-233–Mg2+B motif. Archives
of Biochemistry and Biophysics. 469:184–194.
35. Richmond, B., and A. Knopfmacher. 1995. Compositions with distinct parts. Aeq. Math. 49:86–97.
36. Gilbert, D.E., and J. Feigon. 1999. Multistranded DNA structures. Current Opinion in Structural
Biology. 9:305–314.
37. Cao, S., and S.-J. Chen. 2005. Predicting RNA folding thermodynamics with a reduced chain
representation model. RNA. 11:1884–1897.
38. Cao, S., and S.-J. Chen. 2006. Predicting RNA pseudoknot folding thermodynamics. Nucleic Acids
Res. 34:2634–2652.
39. Cao, S., and S.-J. Chen. 2009. Predicting structures and stabilities for H-type pseudoknots with
interhelix loops. RNA. 15:696–706.
40. Sobczak, K., G. Michlewski, M. de Mezer, E. Kierzek, J. Krol, M. Olejniczak, R. Kierzek, and W.J.
Krzyzosiak. 2010. Structural Diversity of Triplet Repeat RNAs. J. Biol. Chem. 285:12755–12764.
41. Chen, Y., and L. Pollack. 2016. SAXS studies of RNA: structures, dynamics, and interactions with
partners. WIREs RNA. 7:512–526.
42. Bernadó, P., E. Mylonas, M.V. Petoukhov, M. Blackledge, and D.I. Svergun. 2007. Structural
Characterization of Flexible Proteins Using Small-Angle X-ray Scattering. J. Am. Chem. Soc.
129:5656–5664.
43. Kikhney, A.G., and D.I. Svergun. 2015. A practical guide to small angle X-ray scattering (SAXS) of
flexible and intrinsically disordered proteins. FEBS Letters. 589:2570–2577.
44. Burke, J.E., and S.E. Butcher. 2012. Nucleic Acid Structure Characterization by Small Angle X-Ray
Scattering (SAXS). Current Protocols in Nucleic Acid Chemistry. 51:7.18.1-7.18.18.
45. Xia, T., D.H. Mathews, and D.H. Turner. 1999. 6.03 - Thermodynamics of RNA Secondary Structure
Formation. In: Barton SD, K Nakanishi, O Meth-Cohn, editors. Comprehensive Natural Products
Chemistry. Oxford: Pergamon. pp. 21–47.
46. Füchtbauer, A.F., M.S. Wranne, M. Bood, E. Weis, P. Pfeiffer, J.R. Nilsson, A. Dahlén, M. Grøtli,
and L.M. Wilhelmsson. 2019. Interbase FRET in RNA: from A to Z. Nucleic Acids Res. 47:9990–
9997.
47. Mak, C.H., and E.N.H. Phan. 2020. Diagrammatic Theory of RNA Structures and Ensembles with
Trinucleotide Repeats. bioRxiv. 2020.05.30.125641.
88
Chapter 4: Diagrammatic Approaches to RNA Structures with
Trinucleotide Repeats
In this chapter, we further develop our diagrammatic methods to compute the
conformational diversity of trinucleotide repeat RNA sequences. Examples of some
possible conformations of a short (CNG) repeat with different secondary structures are
shown in Fig. 4.1. Because of their repeat structures, at least one-third of the
nucleotides on (CNG)n sequences cannot form canonical base pairs upon folding.
Depending on the identity of the N nucleotide, they may also interact with themselves or
with the G or C nucleotides.
Figure 4.1 Examples of a 5’-NG(CNG)8CN-3’ repeat sequence in five different conformations. (a) The maximal
hairpin “necklace” structure. (b) and (c) Structures with an asymmetric internal junction. (d) and (e) Structures with
three-way junctions. The dual graph representation is shown next to each example, where each 2-bp duplex is
represented by a dot, hairpin loops by circles with one dot, 2-way junctions by circles with two dots, 3-way
junctions by circles with three dots and an arc represents the two unpaired ends. In the graphs, the number
adjacent to each edge indicates its length in nt. The base pair representation is shown below the dual graph of
each example.
89
Fig. 4.1(a) illustrates a maximally canonically paired “necklace” structure. To the
right of it is shown its dual graph representation. The length of each junction is specified
in number of nucleotides (nt). The base-pair representation of the structure is shown
below the dual graph. The base-pair or "matrix" representation explicitly enumerates the
sequence positions of the nucleotides bound by canonical interactions. Fig. 4.1(b) and
(c) show two other examples where one of the two-way junctions is asymmetric. These
two structures have one fewer helix and thus lower base pair and stacking stability than
(a). Their dual graph representations are shown to the right of (b) and (c), suggesting
that their loop structures are topologically distinct from (a). Different junction lengths
also cost different amount of conformational entropy for the sugar-phosphate backbone.
The loop entropies in the various secondary structures must be accounted for to
correctly determine their free energies. In general, (b) and (c) do not have the same
loop entropies even though they contain the same number of nucleotides inside their
loops (five 1-nt loops, one 4-nt loop and one 7-nt loop). This is because the 4x1 internal
loops in (c) adjacent to the hairpin may sterically interface with each other and with the
helices differently compared to the 1x1 internal loops in (b). Loop entropies are
therefore dependent on where and how they appear on the structure relative to each
other.
Fig. 4.1(d) and (e) show two examples with three-way junctions. In general, higher
multiway junctions cost more entropy because they represent a more stringent
conformational constraint for the sugar-phosphate backbone, and they also experience
more steric congestion for the helices around the junction. The dual graph
representation of each is shown to the right. Even though (d) and (e) are topologically
90
equivalent, they do not contain the same loop entropies because their loops are
arranged differently along the sequence. Notice that while (e) has identical junction
lengths to (b) and (c), the loop entropies of these three structures are also intrinsically
different.
Entropies of loops and junctions, or more precisely the loss in their conformational
entropies, arise from constraints coming from the base pairs. If 𝑐 denotes a chain
conformation and 𝑃 (𝑐 ) its probability, the total entropy content of this ensemble is given
by 𝑆 = −𝑘 𝐵 ∑𝑃 (𝑐 )ln 𝑃 (𝑐 )
𝑐 . Under these constraints, the new probability for each
conformation in the presence of these constraints 𝑃 ′
(𝑐 ) = 𝑃 (𝑐 |constaints) incurs a
penalty, and the loss of entropy is given by:
Δ𝑆 = 𝑆 (with constraints) − 𝑆 (no constraints) = −𝑘 𝐵 ∑ 𝑃 𝑐 ′
ln 𝑃 𝑐 ′
− 𝑃 𝑐 ln 𝑃 𝑐 𝑐 (4.1)
where the sum runs over all conformations. If one can determine how the constraints
imposed by the secondary and tertiary structures in the fold transforms 𝑃 (𝑐 ) → 𝑃 ′
(𝑐 ), Δ𝑆
can be determined.
In general, the constraints imposed by secondary/tertiary structures are correlated.
“Factorizability” describes how these constraints may break up into independent (or
approximately independent) subsets. For instance, if the fold introduces 4 constraints 𝐴 ,
𝐵 , 𝐶 and 𝐷 but the effects of 𝐴 and 𝐵 are separable from 𝐶 which is also separable from
𝐷 , then 𝑃 ′
(𝑐 ) = 𝑃 (𝑐 |𝐴 , 𝐵 , 𝐶 , 𝐷 ) = 𝑃 (𝑐 |𝐴 , 𝐵 ) ⋅ 𝑃 (𝑐 |𝐶 ) ⋅ 𝑃 (𝑐 |𝐷 ). Under this factorization, the
entropy change in Eq. (4.1) would simply be equal to
Δ𝑆 = Δ𝑆 (with constraints 𝐴 , 𝐵 ) + Δ𝑆 (with constraint 𝐶 ) + Δ𝑆 (with constraint 𝐷 ).
Different approximations have been used to account for loop entropies in RNA
folding predictions. These range from ignoring loop entropies all together (1–4), to
91
treating each loop in the secondary structure as independent and approximating its
value by additivity rules (5–7), to assigning experimentally-derived free energy to loops
of specific known sequences (8). The most sophisticated of these is NNDB (9), which
Mfold (10) is based on. NNDB employs thermodynamic data to assign approximate
functional forms to interpolate experimentally measured loop free energies of hairpins,
bulges, internal loops and multibranch junctions. In one form or another, an intrinsic
factorizability in the loop entropies is assumed by all these approaches. For example,
NNDB treats the loop entropies in multiway junctions higher than two approximated by a
sum in the form 𝑎 + 𝑏 × 𝑢 + 𝑐 × ℎ, where 𝑢 is the number of unpaired nucleotides, ℎ is
the number of branching helices, and the empirical constants 𝑎 , 𝑏 , 𝑐 are parameters that
were found by maximizing the accuracy of secondary structure prediction (11). For
many RNA folding problems, this assumption may be well justified because the
thermodynamic driving force for the secondary structure comes from the stability of the
pairing and stacking of bases in the helices. But for (CNG) trinucleotide repeat
sequences, this may not be the case since each helix is no more than a two-base-pair
stack of GC|CG, and they lack the more substantial stacking free energy that stabilizes
longer helices (12). Indeed, experimental measurements suggest that the helix free
energy estimated from Mfold greatly overestimates the stability of GC|CG stacks in
(CNG) repeats (13). Because of this, the role of the loop entropies, their factorizability,
and how they influence the conformational diversity of (CNG) repeats should be
examined.
Using a large body of empirical data derived from Monte Carlo (MC) conformational
sampling (14, 15), we have determined cases where constraints are approximately
92
independent and provided quantitative metrics for their factorizability in chapters 2 and
3. We now present how this approximate factorizabilities of the loop entropies can be
expressed diagrammatically for chains of arbitrary lengths, and apply this to study the
conformational diversity of (CNG) repeat sequences.
4.1 Materials and Methods
4.1.1 Graph Representations
Tinoco et al. (16) used an adjacency matrix representation to denote the canonically
bound base pairs in RNA secondary structures. This representation is given in Fig. 4.1
to the lower right of each structure. Waterman et al. (5, 6, 17) have described several
equivalent representations, such as chord diagrams and linear trees. Schlick et al. (18–
20) employed dual graphs to represent the same information, and examples of these
are shown in Fig. 4.1 to the upper right of each structure. Though topologically
equivalent, various representations emphasize different aspects of the folding free
energies. The matrix representation and the chord diagrams, for example, emphasize
the paired bases, whereas dual graphs highlight the unpaired segments on the loops
and junctions, as pointed out by Liu and Bundschuh (8).
Since the focus of this manuscript is on loops, we will rely on dual graphs. In the
previous chapters, we describe the approximate factorizabilities of certain secondary
structural features that were observed in the MC data in Refs. (14, 15). These
factorizabilities can be expressed using diagrams. For instance, the loop entropies of
the unpaired segments in any two-way junction are correlated, but they are largely
uncorrelated with the loops on the other sides of the helices exiting from the two-way
93
junction. Fig. 4.2 shows how this factorization works for the two structures in Fig. 4.1(b)
and (c). Each of the objects on the right side of Fig. 4.2 contain loop entropies that can
be retrieved from the data library in Refs. (14, 15). Similar factorizabilities exist for
higher multiway junctions, and their dual graph representations can also be used to
express this in the same way analogous to Fig. 4.2.
Figure 4.2 Example showing factorization of the diagram on the left into the factors on the right. The circle with
one dot represents a hairpin loop of size 𝑑 . Circles with two dots represent 2-way junctions. The two open line
segments represent open strands. The three filled dots represent 2-bp (4-nt) duplexes. The corresponding
expression for the composite probability is given in Eq. (4.2).
The composite probability of the graph on the left in Fig. 4.2 is given by:
𝑃 1
(𝑑 )𝑃 2
(𝑏 , 𝑓 )𝑃 2
(𝑐 , 𝑒 )𝑃 0
(𝑎 )𝑃 0
(𝑔 )[𝑃 •
(4)]
3
(4.2)
where 𝑃 1
(𝑥 ) is the probability associated with a hairpin loop (or a “1-way junction”) of
length 𝑥 , 𝑃 2
(𝑥 , 𝑦 ) is the probability of a 2-way junction with loop lengths 𝑥 and 𝑦 , 𝑃 0
(𝑥 ) =
1 is the probability associated with an open strand and 𝑃 •
is the probability of the duplex.
For the loops in hairpin and junctions, their probabilities are given by 𝑃 = 𝑒 Δ𝑆 /𝑘 𝐵 , where
Δ𝑆 is the conformational entropy of a loop relative to an open strand. 𝑃 •
(4) =
𝑒 Δ𝑆 •
/𝑘 𝐵 −Δ𝐻 •
/𝑘 𝐵 𝑇 , the probability of a 2-bp (4-nt) duplex, has both enthalpic and entropic
contributions in it, which involve stacking and base pairing interactions as well as the
94
loss of conformational freedom suffered by the backbone to stack. An example of all the
decomposable factors of a necklace diagram is given on the right side of Fig. 4.2.
4.1.2 Specializing to (CNG) Repeat Sequences
To specialize the formulation to apply to 5’-NG(CNG)8CN-3’ repeat sequences
specifically, we take into account their repeat structure. By “repeat structure”, we are
referring to the periodicity of the nucleotide sequence. In our calculations, we employ
constructs with the following architecture:
5’-(N-GC)-(N-GC)-(N-GC)-(N-GC)-(N-GC)-(N-GC)- … -(N-GC)-(N-GC)-N-3’
with 𝑛 repeating units of (NGC). Formally, this construct has 𝑙 = 3𝑛 + 1 nucleotides
instead of 3𝑛 . This is done to ensure that the 5’ and 3’ ends of the chain do not have to
be treated differently, but it does not materially alter the results or the formulation.
As described above, the periodicity of the sequence permits canonical base pairing
producing 2-bp duplexes only. Beyond that, the ability of the N nucleotides to form
noncanonical base pairs can favor different structures depending on whether N = A, C,
G or U. These noncanonical effects can be captured by assigning an extra bias to the 2-
way junctions of those sequences where noncanonical base pair or stacking can add
stability to the chain. Because of the repeat structure, unpaired segments on the
sequence are limited to lengths equal to 1, 4, 7, 11, … nt. To do this, every loop length
in the formulation is replaced by its length divided by 3. For example, the lengths
{𝑎 , 𝑏 , … 𝑔 } in Fig. 4.2 become {𝑎 ′
= 𝑎 \3, 𝑏 ′
= 𝑏 \3, … 𝑔 ′
= 𝑔 \3}, where \ denotes an
integer division without remainder. A loop with length 𝑎 ′
= 0 is 1-nt long. A loop with
𝑎 ′
= 1 is 4-nt long, etc. The only exception to this rule is a 2-bp (4-nt) duplex, which is
95
assigned a length of 2 repeat units instead of 1, and a quadruplex, which is assigned 4
repeat units.
Bundschuh et al. (8, 21) have applied a related diagrammatic method to various
trinucleotide repeats. They employed a diagrammatic recursion relation for the partition
function 𝑍 to study the crossover from asymptotic scaling behavior to finite-length
effects. They found that in the presence of multiloop junctions, the crossover to the
scaling regime is related to the chain’s ability to make branches. For (GCA)n chains,
their results show that the scaling regime is reached with just a handful of repeats,
whereas for (GCC)n sequences the crossover does not occur until the sequence is
hundreds of repeats long because of the extra pairing coming from the N = C
nucleotides in the junctions with the G residues adjacent to them. These studies
suggest that the interaction of the N nucleotide in (CNG) repeats may play a significant
role in determining their prevalent structures. In our work, we have employed a graph
renormalization scheme based on diagrammatic decomposition to study the
concentrations of different structural elements on the chain, whereas in the work of
Bundschuh et al. (8, 21) their graph recursion on 𝑍 was better suited to studying the
emergence of repeat-length-dependent asymptotic behaviors. But the two methods
share common diagrammatic features.
4.1.3 Graph Elements and Loop Entropy Contributions
The secondary structural elements considered in this study are shown in Fig. 4.3. A
dot represents a GC|CG helix. Its probability 𝑃 •
contains the pairing and stacking free
energy, as well as the backbone entropy of the doublet. Circles with one, two or three
holes represent the loops in a hairpin, a two-way junction, and a three-way junction,
96
respectively, and their probabilities 𝑃 1
, 𝑃 2
and 𝑃 3
contain the loop entropies. Hairpins
and two-way junctions have been found in experimental thermodynamic studies (13) to
be most relevant for (CNG) repeat sequences. In this study we also include three-way
junctions to assess their relevance. In addition to these, quadruplexes, represented by
the diagram with three loops emanating from a square core in Fig. 4.3, have also been
included because they have been observed in experimental studies of other
trinucleotide repeat sequences, noticeably (AGG) and (UCC) (13). The core of each
quadruplex contains a double-deck tetrad structure with eight G nucleotides bound with
Hoogsteen base pairs and is represented diagrammatically by a solid square. Its
probability 𝑃 𝑞 contains the pairing and stacking free energy as well as the backbone
entropy of the bases in the tetrad. Since only G can form tetrads, quadruplexes are
possible only on (CGG) repeat sequence. For multibranch structures, while we have
limited ourselves to 3-way junction in this chapter, 4-, 5- or any higher multiway
junctions may be added without complications, but the results will show that multiway
junctions are of less importance for (CNG) repeats. The 5’ or 3’ unpaired ends of the
chain, represented by the last diagram in Fig. 4.3, do not cost any extra entropy
compared to an open chain.
Figure 4.3 Dual graph representation of all structural elements included in this study: helix, hairpin, 2-way junction,
3-way junction, loops in a quadruplex, the quadruplex core, bridge, and unpaired ends.
The loop entropies contained in each graph element are supplied by the data library
in Refs. (14, 15). For example, the entropies of the two loops in a 2-way junction are
97
dependent but their total can be expressed as a function of the sum of their lengths. The
portions of the library relevant to (CNG) repeats are reproduced in Table 4.1 for total
loop length in units of the number of repeats 𝑛 . Loop entropy data for all relevant
elements in Fig. 4.3 are given in Table 4.1.
Loops free energy as a function of total loop length
(kcal/mol)
Feature (nickname) 𝑛 = 0 𝑛 = 1 𝑛 = 2 𝑛 = 3 𝑛 > 3
hairpin (1wj) ∞ 5.02 5.85 6.16 3.9 + 1.08 ln(3𝑛 + 1)
two-way junction (2wj) 5.97 6.53 6.79 6.88 4.4 + 1.08 ln(3𝑛 + 2)
three-way junction
(3wj)
7.12 7.33 7.46 7.53 4.9 + 1.08 ln(3𝑛 + 3)
quadruplex (quad) 15.5 17.6 19.0 19.9 ∞
Table 4.1 Contributions of loop entropies to the folding free energy at 310 K from the data library in Refs. (14, 15)
(𝑅𝑇 = 0.616 kcal/mol). Entropies of the loops in a multibranch junction are in general correlated, but their sum
scales with the total junction lengths. Loop entropies of the junction internal to the branches are uncorrelated with
the loops on the other sides of the branches. Empirically, higher multibranch structures cost more entropy.
The basic premise of the present work considers free energies of the loops to be a
fundamental determinant of RNA structures. This is somewhat different from the
traditional view, where base paired in helices, triplexes, quadruplexes or from tertiary
interactions are considered the drivers. Both these factors are of course present in any
RNA system, but in some problems paired structures are more important, whereas in
others loop entropies may outweigh pairs. For the type of problem studied in this paper,
where the ensemble may be dominated by open instead of strongly paired structures,
careful consideration must be given to the loop entropies. Our results will show that for
the (CNG) repeats, treating the loop entropies carefully is the key to understanding their
conformational ensembles.
98
4.1.4 Stabilities of GC|CG Helix Doublets and G-Quadruplexes
The core thermodynamic stabilities of pair structures, such as the helices and
quadruplexes in Fig. 4.3, are taken from experiments as described in the previous
chapter. For example, to determine the free energy contribution from each duplex, we
used the experimental Δ𝐺 exp
data reported by Sobczak et al. for (CNG)20 oligomers in
100mM NaCl (13) for N = A, C, G and U. The only conformation that was reported for
(CNG)20 has the maximal hairpin structure analogous to that shown in Fig. 4.1(a). Using
the loop entropy values from our library, and in conjunction with the experimentally
observed Δ𝐺 exp
for the maximal hairpin, we determined free energies of the helix cores
in each of the (CNG)20 repeats for N = A, C, G and U separately. The smallest came
from N = C with Δ𝐺 0
(duplex ) = −6.17 kcal/mol, followed by U (−6.39 kcal/mol), A (−6.57
kcal/mol), and G (−6.62 kcal/mol). In the results below, we will use the N = C
Δ𝐺 0
(duplex ) value as the reference, as this represents a lower bound to stability. The
other results for N = A, G or U were obtained by applying the appropriate offset to the
values for each duplex. For quadruplexes, experimental data from Sobczak et al.
suggest that (UGG)17 and (AGG)17 can form quadruplexes, but (CGG) repeats cannot.
To estimate the effects of including quadruplexes in the (CNG)n repeat ensembles, we
used the experimental free energies of (UGG)17 and (AGG)17 and determined the free
energy of a quadruplex core using the Δ𝐺 exp
for (UGG)17 and (AGG)17 in 100mM NaCl
(13) These yielded an approximation for the quadruplex core free energy ~ −20.4
kcal/mol from (AGG)17 and (UGG)17. In our calculations, we varied the quadruplex
99
stability from zero up to and beyond these values to examine how the potential
formation of quadruplexes might affect the structures of (CNG) repeats.
The values of the duplex free energies derived from the experimental data of
Sobczak et al. (13) using the method above are ~ 3 kcal/mol weaker per GC|CG helix
compared to the nearest-neighbor model of Turner et al. (9, 22). Using Mfold (10) to
calculate the free energy of a typical (CNG) repeat produces exclusively the maximal
hairpin structure analogous to Fig. 4.1(a) as the only significant conformation. But using
the helix free energies obtained according to the prescription in the last paragraph,
structural alternatives to the maximal hairpin becomes more competitive. In general,
non-maximally paired structures enjoy higher entropies because loop segments in
hairpins and junctions are less constrained compared to pair bases. In the results
below, we will see the tradeoff between higher entropy in the more open structures
versus the higher stability in the helices and quadruplexes in compact structures
produce a mixed diverse ensemble for most (CNG) repeat sequences, rather than
favoring a single dominant maximal hairpin structure.
4.1.5 Diagrammatic Renormalization
The graph approach described here share many features with those employed in
field theory and in liquids, where diagrammatic techniques have been used extensively
to manipulate graphs (23). Previous work have also applied diagrammatic techniques to
study RNAs (1, 4, 5, 7, 8, 21, 24, 25).
The canonical partition function of the ensemble 𝑍 (𝑛 ) as a function of the number of
(CNG) repeats 𝑛 is represented by diagrams. The generating function, 𝑍 (𝜆 ) =
∑ 𝑍 (𝑛 )exp (−𝜆𝑛 )
∞
𝑛 =0
, which is the grand canonical ensemble partition function allowing
100
variable repeat lengths, can then be expressed in terms of the generating functions of
the probabilities of the diagrammatic elements described above at 310K. Standard
renormalization allows the graphs to be re-summed, giving
𝑍 (𝜆 ) = 1/[1 − 𝑒 −𝜆 − 𝑅 (𝜆 )] (4.3)
where the root function 𝑅 is a sum over all irreducible diagrams. Recursion relations
similar those in Eq. 4.3 have previously been described in the context of RNA structural
studies (1, 2, 4–8, 21, 24, 25). Pillsbury et al. reported similar recursion relations for
RNA (2) as well as Reidys et al. (3), while the use of irreducible diagrams has been
introduced by Orland et al. (1, 2, 24, 26) for studying RNA structures. The root function
satisfies the Dyson equation (1, 2, 24, 26, 27), which is shown diagrammatically in Fig.
4.4. Including multibranch loops up to 3-way junctions, this self-consistent equation for
the root function 𝑅 3
(𝜆 ) is quadratic. Recursion relations for 𝑍 have also been used by
Liu and Bundschuh (8) to examine how the partition function scales with repeat lengths.
Figure 4.4 Dyson equation for the root function 𝑅 3
including hairpins, 2- and 3-way junctions, as well as
quadruplexes.
The inputs, 𝑃 •
(𝜆 ), 𝑃 1
(𝜆 ),𝑃 2
(𝜆 ), 𝑃 3
(𝜆 ) and 𝑃 𝑞 (𝜆 ) were obtained from the loop free
energies of duplexes, hairpins, 2- and 3-way junctions, as well as quadruplexes and the
duplex and quadruplex stabilities described in the last subsection. The functional
dependence of the loop free energies on the loop lengths were extended beyond the
finite-length data available from the simulations by using the same scaling relationships
101
that have been adopted by Turner, et al. in the nearest-neighbor model(9, 28, 29) which
was based on Stockmayer et al.(30), yielding the following expressions at 𝑇 = 310 K:
𝑃 •
(𝜆 ) = 𝑒 −(2𝜆 −
6.17
0.616
)
(4.4a)
𝑃 1
(𝜆 ) = 𝑒 −(𝜆 +
5.016
0.616
)
+ 𝑒 −(2𝜆 +
5.848
0.616
)
+ 𝑒 −(3𝜆 +
6.159
0.616
)
+ 𝑒 −(4𝜆 +
5.086
0.616
)
⋅ Φ (𝑒 −𝜆 , 1.75,
13
3
)
(4.4b)
𝑃 2
(𝜆 ) = 𝑄 2
(𝜆 ) − 𝑑 𝑄 2
(𝜆 )/𝑑𝜆
( 4.4c)
𝑄 2
(𝜆 ) ≡ 𝑒 −(
5.970
0.616
)
+ 𝑒 −(𝜆 +
6.528
0.616
)
+ 𝑒 −(2𝜆 +
6.797
0.616
)
+ 𝑒 −(3𝜆 +
6.880
0.616
)
+ 𝑒 −(4𝜆 +
5.587
0.616
)
⋅ Φ (𝑒 −𝜆 , 1.75,
14
3
)
(4.4d)
𝑃 3
(𝜆 ) =
1
2
[2𝑄 3
(𝜆 ) − 3
𝑑 𝑄 3
(𝜆 )
𝑑𝜆
+
𝑑 2
𝑄 3
(𝜆 )
𝑑 𝜆 2
]
(4.4e)
𝑄 3
(𝜆 ) ≡ 𝑒 −(
7.124
0.616
)
+ 𝑒 −(𝜆 +
7.327
0.616
)
+ 𝑒 −(2𝜆 +
7.458
0.616
)
+ 𝑒 −(3𝜆 +
7.524
0.616
)
+ 𝑒 −(4𝜆 +
6.087
0.616
)
⋅ 𝛷 (𝑒 −𝜆 , 1.75,
15
3
)
(4f)
𝑃 𝑞 (𝜆 ) ≡ 𝑒 −(4𝜆 +
20.4
0.616
)
[𝑒 −(
15.5
0.616
)
+ 3𝑒 −(𝜆 +
17.6
0.616
)
+ 3𝑒 −(2𝜆 +
19.0
0.616
)
+ 𝑒 −(3𝜆 +
19.9
0.616
)
]
(4g)
𝑃 𝑘 (𝜆 ) ≡ 𝑒 −(4𝜆 +
12.34
0.616
)
[𝑒 −(
13.2
0.616
)
+ 2𝑒 −(𝜆 +
14.0
0.616
)
+ 3𝑒 −(2𝜆 +
14.7
0.616
)
+ 4𝑒 −(3𝜆 +
15.0
0.616
)
]
(4h)
where Φ is the Lerch transcendent (31).
4.2 Results and Discussion
We have applied the calculations described in Methods and Materials to (CNG)
repeats, where N = A, C, G or U, to compute the ensemble average number of
102
secondary structure features associated with the conformations of the chains. The
Dyson equation in Fig. 4.4 is quadratic in 𝑅 3
and there are in general two roots. In all of
the cases studied, we found only one of them to yield physical results, while the other
root produced a negative value for the partition function 𝑍 . Results from the physically
relevant solution are shown in Fig. 4.5. Since (CAG), (CCG) and (CUG) repeat
sequence cannot physically produce quadruplexes but (CGG) repeats may, we have
plotted the results as a function of the stability of the quadruplex core 𝜇 𝑞 0
/𝑅𝑇 . While
(CGG) repeat sequences can potentially form quadruplexes, experimental evidence
shows little to no quadruplex structures on (CGG)17 or (CGG)20 sequences (13). On the
other hand, (AGG) repeats have been found to fold predominantly into quadruplex-rich
structures (13). We have employed experimental data for (AGG) repeats to establish an
upper limit for how stable a quadruplex could be if it was to exist in (CGG) repeats. This
upper limit is on the left side of the graphs in Fig. 4.5, and the quadruplex core stability
decreases (i.e. 𝜇 𝑞 0
/𝑅𝑇 becomes more positive) moving to the right. (CAG), (CCG) and
(CUG) repeats are therefore associated with the right side of Fig. 4.5. The expected
structural features of (CNG)60 chains are displayed as a function of 𝜇 𝑞 0
/𝑅𝑇 .
103
Figure 4.5 Ensemble averages of the number of helices (solid line), bridges (dashed lines), hairpin loops (open
circles), 2-way junctions (dotted dashed lines), 3-way junctions (open triangles) and quadruplexes (squares)
computed from the physically-relevant solution for a (CNG)60 repeat, as a function of quadruplex stability (stable on
the left, unstable on the right).
Before discussing the results, we point out that what have been calculated are
ensemble averages, and as such, they may contain contributions from a large number
of different structures. When considering the data, it is therefore important to not
associate the averages with a single conformation, keeping in mind that there may be
many structures within each ensemble. For example, while the maximal hairpin
structure depicted in Fig. 4.1(a) may be one of the prevalent structures in a (CNG)
repeat ensemble, it may be only one of many. In fact, the ensembles we have computed
are rather diverse, and the averages of all the structural features vary smoothly across
the entire parameter space studied.
Fig. 4.5 shows that the structural characteristics of (CNG)60 is strongly dependent on
the ability of the chain to make quadruplexes. When quadruplexes are unstable, the
structures on the right side of Fig. 4.5 correspond to an ensemble with largely open
104
chains with high concentrations of bridges and hairpin loops and some 2-way junctions,
but relatively few 3-way junctions and no quadruplexes. Interestingly, the number of
hairpin loops is almost identical to the number of bridges on the right side of Fig. 4.5.
This suggests that the structures in this ensemble are dominated by the “1+2” diagrams,
an example of which is illustrated in Fig. 4.6. Furthermore, a large number of bridges is
also indicative of largely open structures, but the number of helices observed here is
somewhat less than the maximum number that could be sustained on a (CNG)60 repeat
(theoretical maximum is 29). Instead of being driven by the favorable enthalpy of
formation of the helices, the formations in this ensemble seem to be dominated by loop
entropies.
Figure 4.6 Diagrams illustrating some of the structures observed in the results in Figs. 4.5 and 4.7.
105
Next, focusing on the left side of Fig. 4.5, we examine how the presence of
quadruplexes alters the structural characteristics of the ensemble. As the stability of the
quadruplex is increased (i.e. 𝜇 𝑞 0
/𝑅𝑇 going from right to left in Fig. 4.5), they begin to
displace the helices. This is revealed by a decrease in the concentration of helices and
a concomitant increase in the concentration of quadruplexes. The number of bridges on
the chain also increases, while the number of 2- and 3-way junctions decreases. These
changes occur because as the quadruplexes displace the helices, the chain must
dissolve other structures in order to give way to the quadruplexes, since quadruplexes
have a larger footprint on the sequence (one quadruplex takes up a minimum of four
CNG repeats, whereas a helix only takes up two). Dissolution of the other structures
creates more bridge segments. Based on these observations, we can conclude that the
most relevant graphs in the stable-quadruplex limit (left side) of Fig. 4.5 are the “lei”
diagrams in Fig. 4.5, where quadruplexes are distributed along a largely open chain.
Experimental evidence shows little to no quadruplex formation for short (CGG)
repeat sequences (13). Based on this and the results in Fig. 4.5, we can estimate that
the stability of a quadruplex on a (CGG) chain 𝜇 𝑞 0
/𝑅𝑇 must be at least ~ 6𝑅𝑇 lower than
on a (AGG) chain. We indicate this estimate in Fig. 4.5 by a vertical dotted line. This
suggests that a quadruplex in (CGG) repeats must be approximately > 3.7 kcal/mol less
stable than in (AGG) repeats.
Next, we examine the structures of (CNG) repeat in the absence of quadruplexes.
As we have seen already, even though (CGG) repeats can form quadruplexes,
quadruplexes in (CGG) repeat are expected to be ~ 3.7 kcal/mol less stable than those
in (AGG) repeats. The other (CNG) repeats, N = A, C, U, cannot physically form
106
quadruplexes. In Fig. 4.7, we show results for these (CNG) repeats after placing a large
unfavorable bias against quadruplex formation on the chains.
Figure 4.7 Ensemble averages of features computed for a (CNG)60 repeat as a function of extra stability added to
each 2-way junction (favorable on the left, unfavorable on the right).
In actual (CNG) repeat sequences the ability of the N nucleotides to form
noncanonical base pairs is expected to favor different structures depending on whether
N = A, C, G or U. We can capture these effects in our model by assigning an extra bias
to the 2-way junctions of those sequences where noncanonical base pair or stacking
can add stability to the chain. The bias is applied to every 2-way junction regardless of
size primarily to account for the propensity of stacking a N nucleotide against either of
the helices on the junction. To easily ascertain these effects, the results in Fig. 4.7 are
reported as a function of this bias 𝜇 2
/𝑅𝑇 , where 𝜇 2
is a chemical potential imposed on
each 2-way junction. Negative value adds a bonus, and positive value assesses a
penalty. Approximate values of the bias for N = G, A, U and C are indicated on the top
of Fig. 4.7
107
In the limit where 2-way junctions are very stable (left side of Fig. 4.7), the structures
are dominated by a large number of helices and 2-way junctions but very few hairpins or
bridges. This suggests that the ensemble is characterized by closed and compact
structures. These conformations correspond to the “necklace” diagrams in Fig. 4.6 that
we have discussed in Materials and Methods.
Turning to the right side of Fig. 4.7, in the limit of a large bias imposed against the
formation of 2-way junctions, the solutions correspond to the “bubble” diagrams in Fig.
4.6 and they are the hairpin-capped counterpart of the lei diagrams. They have almost
as many bridge segments as hairpins but the number of helices is far from the
theoretical maximum of 30. These chains are therefore largely open, and they are
dominated by the entropies of the loop segments. Results from Fig. 4.7 suggest that
noncanonical base pairs or favorable stacking of the N nucleotide within the junctions
can produce a significant effect on the conformations of (CNG) repeats. The values of
the bias 𝜇 2
/𝑅𝑇 used to generate the results in Fig. 4.7 spans a range of only ~ 3.1
kcal/mol, but within this very narrow range, the structures in these ensembles vary
drastically.
108
Figure 4.8 A “phase diagram” summarizing the results from Fig. 4.5 and 4.7. The horizontal axis indicates 2-way
junction stability, and the vertical axis quadruplex stability. Phases that have been identified by the calculations are
labeled. See Fig. 4.6 for their graphical representations. Phase boundaries are approximate. Star shows position
for which the scaling analysis in Fig. 4.9 was carried out.
Fig. 4.8 shows a “phase diagram” summarizing all the findings from above, where
variations in quadruplex stability from Fig. 4.5 are plotted along the vertical direction and
variations in two-way junction stability from Fig. 4.7 are plotted along the horizontal
direction. On this phase diagram, “(AGG)” and “(CGG)” indicate the approximate
quadruplex stabilities in (AGG) versus (CGG) chains. Approximate values of the stability
of 2-way junctions in (CNG) repeats for N = G, A, U and C are also indicated on the top
of Fig. 4.8. Non-quadruplex-forming (CNG)60 repeat sequences occupy the center of
this phase diagram, with most of their structures dominated by the 1+2 and bubble
diagrams illustrated in Fig. 4.6, which are semi-open structures. A minor fraction of the
ensemble is also made up of necklace structures, which are closed and compact. These
results point to the existence of many potential structures of similar prevalence with
contributions from both open and compact structures. Though crystallographic data of
(CNG) repeats suggest the dominance of hairpin structures (32, 33, 34), it leaves the
109
question of how an ensemble of diverse structures could be detected in solution.
Techniques such as small-angle X-ray scattering (SAXS) (35–38), UV melting (39), and
Forster resonance energy transfer (FRET) (40) can all be used to probe the solution
structure of RNA. While the use of thermodynamic data of Sobczak et al. (13) does
provide a point of contact between the calculated free energies and experimental
measurements, the ensemble predicted by our results is diverse enough that a one-to-
one correspondence to specific structure(s) revealed by experiments is unlikely. Also
important is that multiway junctions seem to be of low abundance because higher
branching costs more entropy according to data in Table 4.1, so while multibranch
structures higher than three-way can be included in the calculations, they are not likely
to alter the results significantly.
Figure 4.9 (a) Divergence of the partition function 𝑍 (𝜆 ) when 𝜆 approaches the singular point 𝜆 𝑐 . The scale on the
top shows the average repeat lengths ⟨𝑛 ⟩ for each 𝜆 . (b) Structural features as a fraction of the repeat length as a
function of λ. The scale on the top maps ⟨𝑛 ⟩ to λ. Short repeats and long repeats have very different structural
compositions, and the crossover appears to occur between 30 to 60 repeats.
The conformational ensembles are functions of the repeat length. This repeat length
dependence is illustrated in Fig. 4.9 for a point on the phase diagram marked by the star
110
in Fig. 4.8. Fig. 4.9(a) shows divergence of the partition function 𝑍 (𝜆 ) when 𝜆
approaches the singular point 𝜆 𝑐 .The slope is ~ −1, suggesting that it is a simple pole.
This result is expected because this problem is isomorphic to the enumeration all paths
from the 5’ to 3’ end of the chain on the space the folding problem is embedded, and
generating functions of paths all have the same dominant singularity, which is a simple
pole (41). The scale on the top of Fig. 4.9(a) shows the average repeat lengths ⟨𝑛 ⟩ for
each 𝜆 , and repeats lengths approximately > 60 appear to be in the scaling region. Fig.
4.9(b) shows how each of the features as a fraction of the repeat length varies as a
function of 𝜆 , again with the scale on the top mapping ⟨𝑛 ⟩ to 𝜆 . Short repeats and long
repeats have very different structural compositions, and the crossover appears to occur
between 30 to 60 repeats. Note that in the scaling limit, there are almost equal densities
of bridges, hairpins and two-way junctions on the chain, and the ensemble is dominated
by largely open structures.
111
Figure 4.10 Ensemble averages of features computed for a (CNG)60 repeat as a function of extra stability added to
each helix (favorable on the left, unfavorable on the right).
Finally, since there is a significant discrepancy between the stability of the GC|CG
duplexes predicted by NNDB compared to experimentally-derived results collected
specifically from (CNG) repeat sequences, we want to know to what extent the stability
of the duplexes may have on the computed results. Fig. 4.10 shows the structural
characteristics of (CNG)60 as a function of a bias placed on the helices, more stable to
the left, less stable to the right. Toward the right, as the helices become less stable,
they are displaced by quadruplexes, which are the only structures other than helices
that can cap the end of a branch. These map to the lei diagrams in Fig. 4.6. Toward the
left, as the helices become more stable, they seed an increasing number of two- and
three-way junctions in favor of hairpins. The resulting structures correspond to the
112
necklace and “three-way tree” structures in Fig. 6. Notice that −3𝑅𝑇 on the left edge of
Fig. 4.10 corresponds to only -1.8 kcal/mol of extra stability, and this small difference
produces a significant change in structural compositions. Therefore, a more accurate
experimental assessment on the thermodynamic stability of the GC|CG duplexes may
be important for understanding (CNG) repeat structures.
113
4.3 References
1. Orland, H., and A. Zee. 2002. RNA folding and large N matrix theory. Nuclear Physics B. 620:456–
476.
2. Pillsbury, M., H. Orland, and A. Zee. 2005. Steepest descent calculation of RNA pseudoknots. Phys.
Rev. E. 72:011911.
3. Reidys, C.M., F.W.D. Huang, J.E. Andersen, R.C. Penner, P.F. Stadler, and M.E. Nebel. 2011.
Topology and prediction of RNA pseudoknots. Bioinformatics. 27:1076–1085.
4. Andersen, J.E., L.O. Chekhov, R.C. Penner, C.M. Reidys, and P. Sułkowski. 2013. Topological
recursion for chord diagrams, RNA complexes, and cells in moduli spaces. Nuclear Physics B.
866:414–443.
5. Waterman, M.S., and T.F. Smith. 1978. RNA secondary structure: a complete mathematical
analysis. Mathematical Biosciences. 42:257–266.
6. Penner, R.C., and M.S. Waterman. 1993. Spaces of RNA Secondary Structures. Advances in
Mathematics. 101:31–49.
7. Waterman, M.S., and T.F. Smith. 1986. Rapid dynamic programming algorithms for RNA secondary
structure. Advances in Applied Mathematics. 7:455–464.
8. Liu, T., and R. Bundschuh. 2004. Analytical description of finite size effects for RNA secondary
structures. Phys. Rev. E. 69:061912.
9. Turner, D.H., and D.H. Mathews. 2010. NNDB: the nearest neighbor parameter database for
predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 38:D280–D282.
10. Zuker, M. 2003. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic
Acids Res. 31:3406–3415.
11. Mathews, D.H., and D.H. Turner. 2002. Experimentally Derived Nearest-Neighbor Parameters for
the Stability of RNA Three- and Four-Way Multibranch Loops. Biochemistry. 41:869–880.
12. Yakovchuk, P., E. Protozanova, and M.D. Frank-Kamenetskii. 2006. Base-stacking and base-pairing
contributions into thermal stability of the DNA double helix. Nucleic acids research. 34:564–574.
13. Sobczak, K., G. Michlewski, M. de Mezer, E. Kierzek, J. Krol, M. Olejniczak, R. Kierzek, and W.J.
Krzyzosiak. 2010. Structural Diversity of Triplet Repeat RNAs. J. Biol. Chem. 285:12755–12764.
14. Mak, C.H., and E.N.H. Phan. 2018. Topological Constraints and Their Conformational Entropic
Penalties on RNA Folds. Biophysical Journal. 114:2059–2071.
15. Phan, E.N.H., and C.H. Mak. 2020. Quantifying Structural Diversity of CNG Trinucleotide Repeats
Using Diagrammatic Algorithms. bioRxiv. 2020.05.30.124636.
114
16. Tinoco, I., O.C. Uhlenbeck, and M.D. Levine. 1971. Estimation of Secondary Structure in Ribonucleic
Acids. Nature. 230:362–367.
17. Schmitt, W.R., and M.S. Waterman. 1994. Linear trees and RNA secondary structure. Discrete Apl.
Math. 51:317–323.
18. Hin Hark Gan, Daniela Fera, Julie Zorn, Nahum Shiffeldrim, Michael Tang, Uri Laserson, Namhee
Kim, and Tamar Schlick. 1987. RAG: RNA-As-Graphs database—concepts, analysis, and features.
Nutr. Health. 5:1285–1291.
19. Izzo, J.A., N. Kim, S. Elmetwaly, and T. Schlick. 2011. RAG: An update to the RNA-As-Graphs
resource. BMC Bioinformatics. 12:219.
20. Schlick, T. 2018. Adventures with RNA graphs. Methods. 143:16–33.
21. Bundschuh, R. 2014. Unified approach to partition functions of RNA secondary structures. J. Math.
Biol. 69:1129–1150.
22. Turner, D.H. 1996. Thermodynamics of base pairing. Curr. Opin. Struct. Biol. 6:299–304.
23. Mattuck, R.D. 1992. A Guide to Feynman Diagrams in the Many-Body Problem: Second Edition.
2nd edition. New York, USA: Dover Publications.
24. Vernizzi, G., and H. Orland. 2015. Random matrix theory and ribonucleic acid (RNA) folding. The
Oxford Handbook of Random Matrix Theory.
25. Vernizzi, G., H. Orland, and A. Zee. 2016. Classification and predictions of RNA pseudoknots based
on topological invariants. Phys. Rev. E. 94:042410.
26. Bon, M., G. Vernizzi, H. Orland, and A. Zee. 2008. Topological Classification of RNA Structures.
Journal of Molecular Biology. 379:900–911.
27. Dyson, F.J. 1949. The S Matrix in Quantum Electrodynamics. Phys. Rev. 75:1736–1755.
28. Serra, M.J., and D.H. Turner. 1995. Predicting thermodynamic properties of RNA. Method
Enzymolgy. 259:242–61.
29. Lu, Z.J., D.H. Turner, and D.H. Mathews. 2006. A set of nearest neighbor parameters for predicting
the enthalpy change of RNA secondary structure formation. Nucleic Acids Res. 34:4912–24.
30. Jacobson, H., and W.H. Stockmayer. 1950. Intramolecular Reaction in Polycondensations. I. The
Theory of Linear Systems. J. Chem. Phys. 18:1600–1606.
31. Gradshteĭn, I.S., and D. Zwillinger. 2014. Table of integrals, series, and products. Eighth edition /.
San Diego, CA: Academic Press.
32. Mooers, B.H.M., J.S. Logue, and J.A. Berglund. 2005. The structural basis of myotonic dystrophy
from the crystal structure of CUG repeats. PNAS. 102:16626–16631.
115
33. Kiliszek, A., R. Kierzek, W.J. Krzyzosiak, and W. Rypniewski. 2012. Crystallographic characterization
of CCG repeats. Nucleic Acids Res. 40:8155–8162.
34. Kumar, A., H. Park, P. Fang, R. Parkesh, M. Guo, K.W. Nettles, and M.D. Disney. 2011. Myotonic
Dystrophy Type 1 RNA Crystal Structures Reveal Heterogeneous 1 × 1 Nucleotide UU Internal Loop
Conformations. Biochemistry. 50:9928–9935.
35. Chen, Y., and L. Pollack. 2016. SAXS studies of RNA: structures, dynamics, and interactions with
partners. WIREs RNA. 7:512–526.
36. Bernadó, P., E. Mylonas, M.V. Petoukhov, M. Blackledge, and D.I. Svergun. 2007. Structural
Characterization of Flexible Proteins Using Small-Angle X-ray Scattering. J. Am. Chem. Soc.
129:5656–5664.
37. Kikhney, A.G., and D.I. Svergun. 2015. A practical guide to small angle X-ray scattering (SAXS) of
flexible and intrinsically disordered proteins. FEBS Letters. 589:2570–2577.
38. Burke, J.E., and S.E. Butcher. 2012. Nucleic Acid Structure Characterization by Small Angle X-Ray
Scattering (SAXS). Current Protocols in Nucleic Acid Chemistry. 51:7.18.1-7.18.18.
39. Xia, T., D.H. Mathews, and D.H. Turner. 1999. 6.03 - Thermodynamics of RNA Secondary Structure
Formation. In: Barton SD, K Nakanishi, O Meth-Cohn, editors. Comprehensive Natural Products
Chemistry. Oxford: Pergamon. pp. 21–47.
40. Füchtbauer, A.F., M.S. Wranne, M. Bood, E. Weis, P. Pfeiffer, J.R. Nilsson, A. Dahlén, M. Grøtli, and
L.M. Wilhelmsson. 2019. Interbase FRET in RNA: from A to Z. Nucleic Acids Res. 47:9990–9997.
41. Flajolet, P., and R. Sedgewick. 2009. Analytic Combinatorics. Cambridge University Press.
116
Chapter 5: Conclusion
In this paper, we took a fresh look at the secondary structure of RNA and in
particular the structural ensemble of CNG trinucleotide repeats. A more backbone-
centric view of how to interpret the various types of secondary and tertiary structural
motifs encountered in typical RNA folds has been presented. Among the various terms
in the folding free energy, the free energy coming from entropy depression due to the
loss of backbone conformational freedom is the only term that is guaranteed to be
always uphill, and as such, it provides a rigorous lower bound on the magnitudes of all
the other free energy contributors that must act to stabilize the fold. To focus on the
unpaired regions of the structure, a more thorough understanding of the backbone
entropy and how to integrate it into structural analysis was needed. This we addressed
with a proposed diagrammatic scheme to quantify the entropic penalty imposed on the
sugar-phosphate backbone of a folded RNA coming from constraints imposed by the
secondary and/or tertiary contacts needed to stabilize the folded structures.
A simple diagrammatic device was designed to help factor the many secondary- and
tertiary-constraints typically seen in folded RNAs into approximately independent sets,
to allow the separation of the backbone entropy into additive parts. This new approach
generates an interesting and intuitive topological view of RNA structures. We further
show how topological reduction can be carried out for typical secondary and tertiary
structure motifs and comparing the results of the reduction against large-scale Monte
Carlo simulations of equilibrium ensembles of different RNA constructs in solution, we
demonstrate the accuracy and the usefulness of the topological perspective. Extensive
data sets and simple recipes are provided in the paper to enable any RNA scientist to
117
easily estimate the magnitude of backbone entropy depression due to the following
common RNA secondary motifs: hairpin loops, multiway junctions, pseudoknots, and
quadruplexes
We then utilized the established the data sets and recipes to study the structural
ensemble of CNG trinucleotide repeats. Our analysis of the graph ensembles of CNG
repeat chains at the oligomer scale lays the foundation for a theoretical model for
analyzing the structural diversity of trinucleotide repeat chains, as well as observations
that are germane to understanding RNA conformational ensemble. By using a graph
factorization method and our data library built from simulations we were able to group
accessible secondary structures together into subsets represented by graphs and have
calculated metrics for their thermodynamic stability Δ𝐺 (Ξ), as well as the structure
content Δ𝑆 (Ξ) of the graph sub-ensembles. The results show that most structures are
thermodynamically stable with a range of stability which would result in the prevalence
of certain structures over others. The addition of helices incurs a backbone entropy cost
that is offset by an increase in the number of structure accessible to the chain, and it is
the balance between these two factors that determines the thermodynamic favorability
of the structure. The results show that the extent to which the structural diversity of
different classes of diagrams can grow as the total base pairing content increases is
also dictated by the chain length and the stabilization provided by the helices present in
the structure. This manifested as a critical total degree 𝐷 at which some structures
begin to lose diversity and become less favorable while others continue to proliferate.
Altogether, the results show that the structural diversity and propensities for different
structural elements on CNG repeat chains are determined by an interplay between the
118
length of the full RNA chain, the stabilizing strength of the helices, and the complexity of
the graphs in the ensemble,
Finally, we have formulated a diagrammatic theory to study the conformational
ensembles of (CNG)n RNA sequences to understand the structures of long,
overexpanded CNG microsatellites implicated in TREDs. To understand the structures
of these (CNG) repeat sequences, we performed a series of calculations aimed at
characterizing their equilibrium ensembles. Instead of direct simulation, our calculations
were based on a diagrammatic representation of the partition function of the chain and
the factorization scheme that we had developed at the start of the study. Using
generating function mathematics, the factorization scheme, and diagrammatic re-
summation techniques, we were able to derive a closed-form expression for the partition
function in terms of a renormalized root function. Employing a simple approximation for
this root function, we derived analytical expressions for the partition function and its
corresponding thermodynamic observables. Including hairpins, 2- and 3-way junctions,
helices and quadruplexes in the root function, the partition function captures an infinite
set of conformations with any number and any combination of these structural elements.
Together with simulation data from our self-consistent library of entropic costs obtained
in the first part of the study for the various graph elements, as well as experimentally
derived free energies for the helices and quadruplexes, we solved the resulting
equations to arrive at numerical estimates for the ensemble expectation values of the
number of structural features on the chain, including bridges, hairpin loops, 1-, 2- and 3-
way junctions and quadruplexes. This enabled us to quantitatively characterize the
structural diversity of different (CNG)n ensembles.
119
While most studies in the field have implicitly assumed that the ensemble of a
(CNG)n sequence is dominated by a single structure having the maximal number of
paired bases forming duplexes interposed by 2-way junctions between them, the results
of this study suggest otherwise (1–5). Our study shows that the structural ensembles of
(CNG)n repeat sequence with n ~ 60 are surprisingly diverse. The equilibrium number of
duplexes, hairpins, junctions, bridges, and quadruplexes on these sequences indicate
that their secondary structure contents are far from the expected maximally paired
conformation. To the contrary, the ensemble is dominated by a mixture of open and
compact structures. We have mapped out the resulting structures as a function of the
ability of the N nucleotide (N = A, C, G or U) in (CNG) repeats to make noncanonical
pairs, as well as their ability to sustain stable quadruplexes. The “phase diagram” that
emerges shows a diversity of different structures across this parameter space,
demonstrating that ensembles of (CNG) repeat sequences can potentially contain many
alternate conformations. The results show how perturbations in the form of biases on
the stabilities of the various structural motifs - duplexes, junctions, hairpins, and
quadruplexes - could affect the secondary structures of the chains in either directions
and how these structures may switch when they are perturbed, e.g. when they interact
with or bind other molecules. This may in turn have implications on how these (CNG)n
sequences could acquire unintended functions in the cell, leading to their cytotoxicity.
120
References
1. Mirkin, S.M. 2006. DNA structures, repeat expansions and human hereditary disorders. Curr Opin
Struct Biol. 16:351–8.
2. Orr, H.T., and H.Y. Zoghbi. 2007. Trinucleotide Repeat Disorders. Annu. Rev. Neurosci. 30:575–621.
3. Sobczak, K., G. Michlewski, M. de Mezer, E. Kierzek, J. Krol, M. Olejniczak, R. Kierzek, and W.J.
Krzyzosiak. 2010. Structural Diversity of Triplet Repeat RNAs. J. Biol. Chem. 285:12755–12764.
4. Kiliszek, A., R. Kierzek, W.J. Krzyzosiak, and W. Rypniewski. 2012. Crystallographic characterization
of CCG repeats. Nucleic Acids Res. 40:8155–8162.
5. Kiliszek, A., R. Kierzek, W.J. Krzyzosiak, and W. Rypniewski. 2011. Crystal structures of CGG RNA
repeats with implications for fragile X-associated tremor ataxia syndrome. Nucleic Acids Res.
39:7308–7315.
121
Bibliography
Aalberts, D. P., and N. Nandagopal. 2010. A two-length-scale polymer theory for RNA loop
free energies and helix stacking. RNA 16:1350-1355.
Aalberts, D.P., and N.O. Hodas. 2005. Asymmetry in RNA pseudoknots: observation and
theory. Nucleic Acids Res. 33:2210–2214.
Abrescia, N.G.A., A. Thompson, T. Huynh-Dinh, and J.A. Subirana. 2002. Crystal structure
of an antiparallel DNA fragment with Hoogsteen base pairing. PNAS. 99:2806–2811.
Andersen, J.E., L.O. Chekhov, R.C. Penner, C.M. Reidys, and P. Sułkowski. 2013.
Topological recursion for chord diagrams, RNA complexes, and cells in moduli spaces.
Nucl. Phys. B. 866:414–443.
Arnold, B. H. 2011. Intuitive concepts in elementary topology. Dover Publications, Mineola,
N.Y.
Balakrishnan, R., and K. Ranganathan. 2012. A textbook of graph theory. Springer
Science & Business Media, New York, NY.
Batey, R. T., S. D. Gilbert, and R. K. Montange. 2004. Structure of a natural guanine-
responsive riboswitch complexed with the metabolite hypoxanthine. Nature 432:411-415.
Berman, H. M., W. K. Olson, D. L. Beveridge, J. Westbrook, A. Gelbin, T. Demeny, S.-H.
Hsieh, A. Srinivasan, and B. Schneider. 1992. The nucleic acid database. A
comprehensive relational database of three-dimensional structures of nucleic acids.
Biophys. J. 63:751-759.
Bernadó, P., E. Mylonas, M.V. Petoukhov, M. Blackledge, and D.I. Svergun. 2007.
Structural Characterization of Flexible Proteins Using Small-Angle X-ray Scattering. J. Am.
Chem. Soc. 129:5656–5664.
Bon, M., G. Vernizzi, H. Orland, and A. Zee. 2008. Topological Classification of RNA
Structures. J. Mol. Biol. 379:900–911.
Broda, M., E. Kierzek, Z. Gdaniec, T. Kulinski, and R. Kierzek. 2005. Thermodynamic
Stability of RNA Structures Formed by CNG Trinucleotide Repeats. Implication for
Prediction of RNA Structure. Biochemistry. 44:10873–10882.
Bundschuh, R. 2014. Unified approach to partition functions of RNA secondary structures.
J. Math. Biol. 69:1129–1150.
Burke, J.E., and S.E. Butcher. 2012. Nucleic Acid Structure Characterization by Small
Angle X-Ray Scattering (SAXS). Curr. Protoc. Nucleic Acid Chem. 51:7.18.1-7.18.18.
Cao, S., and S.-J. Chen. 2005. Predicting RNA folding thermodynamics with a reduced
chain representation model. RNA. 11:1884–1897.
Cao, S., and S.-J. Chen. 2006. Predicting RNA pseudoknot folding thermodynamics.
Nucleic Acids Res. 34:2634–2652.
122
Cao, S., and S.-J. Chen. 2009. Predicting structures and stabilities for H-type pseudoknots
with interhelix loops. RNA. 15:696–706.
Chandler, D. 1987. Introduction to Modern Statistical Mechanics. Oxford University Press,
New York, NY.
Chen, S.J. 2008. RNA folding: conformational statistics, folding kinetics, and ion
electrostatics. Annu Rev Biophys. 37:197–214.
Chen, Y., and L. Pollack. 2016. SAXS studies of RNA: structures, dynamics, and
interactions with partners. Wiley Interdiscip. Rev.: RNA. 7:512–526.
Coimbatore Narayanan, B., J. Westbrook, S. Ghosh, A. I. Petrov, B. Sweeney, C. L. Zirbel,
N. B. Leontis, and H. M. Berman. 2013. The Nucleic Acid Database: new features and
capabilities. Nucleic Acids Res. 42:D114-D122.
Coonrod, L.A., J.R. Lohman, and J.A. Berglund. 2012. Utilizing the GAAA
Tetraloop/Receptor To Facilitate Crystal Packing and Determination of the Structure of a
CUG RNA Helix. Biochemistry. 51:8330–8337.
De Gennes, P.-G. 1979. Scaling concepts in polymer physics. Cornell University Press,
Ithaca, NY.
Diamond, J.M., D.H. Turner, and D.H. Mathews. 2001. Thermodynamics of three-way
multibranch loops in RNA. Biochemistry. 40:6971–6981.
Ding, F., C. A. Lavender, K. M. Weeks, and N. V. Dokholyan. 2012. Three-dimensional
RNA structure refinement by hydroxyl radical probing. Nat. Methods 9:603.
Ding, F., S. Sharma, P. Chalasani, V. V. Demidov, N. E. Broude, and N. V. Dokholyan.
2008. Ab initio RNA folding by discrete molecular dynamics: From structure prediction to
folding mechanisms. RNA 14:1164-1173.
Draper, D. E., D. Grilley, and A. M. Soto. 2005. Ions and RNA Folding. Annu. Rev.
Biophys. Biomol. Struct. 34:221-243.
Fera, D., N. Kim, N. Shiffeldrim, J. Zorn, U. Laserson, H.H. Gan, and T. Schlick. 2004.
RAG: RNA-As-Graphs web resource. BMC Bioinf. 5:88.
Flajolet, P., and R. Sedgewick. 2009. Analytic Combinatorics. Cambridge University Press.
Flory, P., and M. Volkenstein. 1969. Statistical Mechanics of Chain Molecules. Interscience
Publishers, New York, NY.
Füchtbauer, A.F., M.S. Wranne, M. Bood, E. Weis, P. Pfeiffer, J.R. Nilsson, A. Dahlén, M.
Grøtli, and L.M. Wilhelmsson. 2019. Interbase FRET in RNA: from A to Z. Nucleic Acids
Res. 47:9990–9997.
Gan, H. H., D. Fera, J. Zorn, N. Shiffeldrim, M. Tang, U. Laserson, N. Kim, and T. Schlick.
1987. RAG: RNA-As-Graphs database—concepts, analysis, and features. Nutr. Health
5:1285-1291.
123
Gan, H.H., S. Pasquali, and T. Schlick. 2003. Exploring the repertoire of RNA secondary
motifs using graph theory; implications for RNA design. Nucleic Acids Res. 31:2926–2943.
Gevertz, J., H.H. Gan, and T. Schlick. 2005. In vitro RNA random pools are not structurally
diverse: A computational analysis. RNA. 11:853–863.
Gilbert, D.E., and J. Feigon. 1999. Multistranded DNA structures. Curr. Opin. Struct. Biol.
9:305–314.
Gradshteĭn, I.S., and D. Zwillinger. 2014. Table of integrals, series, and products. Eighth
edition. San Diego, CA: Academic Press.
Henke, P. S., and C. H. Mak. 2016. An implicit divalent counterion force field for RNA
molecular dynamics. J. Chem. Phys. 144.
Henke, P.S., and C.H. Mak. 2014. Free energy of RNA-counterion interactions in a tight-
binding model computed by a discrete space mapping. J. Chem. Phys. 141:08B612_1.
Hill, T. L. 2013. Statistical mechanics: principles and selected applications. McGraw-Hill,
New York, NY.
Hin Hark Gan, Daniela Fera, Julie Zorn, Nahum Shiffeldrim, Michael Tang, Uri Laserson,
Namhee Kim, and Tamar Schlick. 1987. RAG: RNA-As-Graphs database—concepts,
analysis, and features. Nutr. Health. 5:1285–1291.
Hofacker, I.L. 2003. Vienna RNA secondary structure server. Nucleic Acids Res. 31:3429–
3431.
Hummer, G., S. Garde, A.E. Garcia, A. Pohorille, and L.R. Pratt. 1996. An information
theory model of hydrophobic interactions. Proc. Natl. Acad. Sci. USA. 93:8951–8955.
Isambert, H., and E.D. Siggia. 2000. Modeling RNA folding paths with pseudoknots:
Application to hepatitis delta virus ribozyme. Proc. Natl. Acad. Sci. U.S.A. 97:6515–6520.
Izzo, J.A., N. Kim, S. Elmetwaly, and T. Schlick. 2011. RAG: An update to the RNA-As-
Graphs resource. BMC Bioinf. 12:219.
Jacobson, H., and W.H. Stockmayer. 1950. Intramolecular Reaction in Polycondensations.
I. The Theory of Linear Systems. J. Chem. Phys. 18:1600–1606.
Jain, S., C.S. Bayrak, L. Petingi, and T. Schlick. 2018. Dual Graph Partitioning Highlights a
Small Group of Pseudoknot-Containing RNA Submotifs. Genes. 9:371.
Jain, S., S. Saju, L. Petingi, and T. Schlick. 2019. An extended dual graph library and
partitioning algorithm applicable to pseudoknotted RNA structures. Methods. 162–163:74–
84.
Khristich, A.N., and S.M. Mirkin. 2020. On the wrong DNA track: Molecular mechanisms of
repeat-mediated genome instability. J. Biol. Chem. 295:4134–4170.
Kierzek, R., M.E. Burkard, and D.H. Turner. 1999. Thermodynamics of Single Mismatches
in RNA Duplexes. Biochemistry. 38:14214–14223.
124
Kikhney, A.G., and D.I. Svergun. 2015. A practical guide to small angle X-ray scattering
(SAXS) of flexible and intrinsically disordered proteins. FEBS Lett. 589:2570–2577.
Kiliszek, A., and W. Rypniewski. 2014. Structural studies of CNG repeats. Nucleic Acids
Res. 42:8189–8199.
Kiliszek, A., R. Kierzek, W.J. Krzyzosiak, and W. Rypniewski. 2011. Crystal structures of
CGG RNA repeats with implications for fragile X-associated tremor ataxia syndrome.
Nucleic Acids Res. 39:7308–7315.
Kiliszek, A., R. Kierzek, W.J. Krzyzosiak, and W. Rypniewski. 2012. Crystallographic
characterization of CCG repeats. Nucleic Acids Res. 40:8155–8162.
Kim, N., C. Laing, S. Elmetwaly, S. Jung, J. Curuksu, and T. Schlick. 2014. Graph-based
sampling for approximating global helical topologies of RNA. Proc. Natl. Acad. Sci. USA.
111:4079–4084.
Kim, N., C. Laing, S. Elmetwaly, S. Jung, J. Curuksu, and T. Schlick. 2014. Graph-based
sampling for approximating global helical topologies of RNA. Proc. Natl. Acad. Sci. USA.
111:4079–4084.
Kitamura, A., Y. Muto, S. Watanabe, I. Kim, T. Ito, Y. Nishiya, K. Sakamoto, T. Ohtsuki, G.
Kawai, K. Watanabe, K. Hosono, H. Takaku, E. Katoh, T. Yamazaki, T. Inoue, and S.
Yokoyama. 2002. Solution structure of an RNA fragment with the P7/P9.0 region and the
3′-terminal guanosine of the Tetrahymena group I intron. RNA. 8:440–451.
Kucharík, M., I.L. Hofacker, P.F. Stadler, and J. Qin. 2016. Pseudoknots in RNA folding
landscapes. Bioinformatics. 32:187–194.
Kumar, A., H. Park, P. Fang, R. Parkesh, M. Guo, K.W. Nettles, and M.D. Disney. 2011.
Myotonic Dystrophy Type 1 RNA Crystal Structures Reveal Heterogeneous 1 × 1
Nucleotide UU Internal Loop Conformations. Biochemistry. 50:9928–9935.
Laing, C., and T. Schlick. 2011. Computational approaches to RNA structure prediction,
analysis, and design. Curr. Opin. Struct. Biol. 21:306–318.
Le, S.-Y., R. Nussinov, and J.V. Maizel. 1989. Tree graphs of RNA secondary structures
and their comparisons. Comput. Biomed. Res. 22:461–473.
Li, L.-B., and N.M. Bonini. 2010. Roles of trinucleotide-repeat RNA in neurological disease
and degeneration. Trends in Neurosci. 33:292–298.
Liu, L., and S.-J. Chen. 2010. Computing the conformational entropy for RNA folds. J.
Chem. Phys. 132:235104.
Liu, T., and R. Bundschuh. 2004. Analytical description of finite size effects for RNA
secondary structures. Phys. Rev. E. 69:061912.
Lorenz, R., S.H. Bernhart, C. Höner zu Siederdissen, H. Tafer, C. Flamm, P.F. Stadler,
and I.L. Hofacker. 2011. ViennaRNA Package 2.0. Algorithms Mol. Biol. 6:26.
125
Lu, Z.J., D.H. Turner, and D.H. Mathews. 2006. A set of nearest neighbor parameters for
predicting the enthalpy change of RNA secondary structure formation. Nucleic Acids Res.
34:4912–24.
Mak, C.H. 2008. RNA conformational sampling. I. Single‐nucleotide loop closure. J.
Comput. Chem. 29:926–933.
Mak, C.H. 2015. Atomistic Free Energy Model for Nucleic Acids: Simulations of Single-
Stranded DNA and the Entropy Landscape of RNA Stem–Loop Structures. J. Phys. Chem.
B. 119:14840–14856.
Mak, C.H. 2016. Unraveling Base Stacking Driving Forces in DNA. J. Phys. Chem. B.
120:6010–20.
Mak, C.H., and E.N.H. Phan. 2018. Topological Constraints and Their Conformational
Entropic Penalties on RNA Folds. Biophys. J. 114:2059–2071.
Mak, C.H., and E.N.H. Phan. 2020. Diagrammatic Theory of RNA Structures and
Ensembles with Trinucleotide Repeats. bioRxiv. 2020.05.30.125641.
Mak, C.H., and P.S. Henke. 2012. Ions and RNAs: free energies of counterion-mediated
RNA fold stabilities. J. Chem. Theory Comput. 9:621–639.
Mak, C.H., L.L. Sani, and A.N. Villa. 2015. Residual Conformational Entropies on the
Sugar–Phosphate Backbone of Nucleic Acids: An Analysis of the Nucleosome Core DNA
and the Ribosome. J. Phys. Chem. B. 119:10434–10447.
Mak, C.H., T. Matossian, and W.-Y. Chung. 2014. Conformational entropy of the RNA
phosphate backbone and its contribution to the folding free energy. Biophys. J. 106:1497–
1507.
Mak, C.H., W.-Y. Chung, and N.D. Markovskiy. 2011. RNA conformational sampling: II.
Arbitrary length multinucleotide loop closure. J. Chem. Theory Comput. 7:1198–1207.
Manzourolajdad, A., and J. Arnold. 2015. Secondary structural entropy in RNA switch
(Riboswitch) identification. BMC Bioinf. 16:133.
Mathews, D. H., and D. H. Turner. 2006. Prediction of RNA secondary structure by free
energy minimization. Curr. Opin. Struct. Biol. 16:270-278.
Mathews, D.H., J. Sabina, M. Zuker, and D.H. Turner. 1999. Expanded sequence
dependence of thermodynamic parameters improves prediction of RNA secondary
structure. J. Mol. Biol. 288:911–940.
Matsugami, A., K. Ouhashi, T. Ikeda, T. Okuizumi, H. Sotoya, S. Uesugi, and M. Katahira.
2002. Unique quadruplex structures of d(GGA)4 (12-mer) and d(GGA)8 (24-mer)—Direct
evidence of the formation of non-canonical base pairs and structural comparison—.
Nucleic Acids Symp. Se.r (Oxf). 2:49–50.
126
Matsugami, A., T. Okuizumi, S. Uesugi, and M. Katahira. 2003. Intramolecular Higher
Order Packing of Parallel Quadruplexes Comprising a G:G:G:G Tetrad and a
G(:A):G(:A):G(:A):G Heptad of GGA Triplet Repeat DNA. J. Biol. Chem. 278:28147–28153.
Mattuck, R.D. 1992. A Guide to Feynman Diagrams in the Many-Body Problem: Second
Edition. 2nd edition. New York, USA: Dover Publications.
Miller, J.W., C.R. Urbinati, P. Teng-umnuay, M.G. Stenberg, B.J. Byrne, C.A. Thornton,
and M.S. Swanson. 2000. Recruitment of human muscleblind proteins to (CUG)n
expansions associated with myotonic dystrophy. EMBO J.. 19:4439–4448.
Mirkin, S.M. 2004. Molecular Models for Repeat Expansions. Chemtracts: Biochem. Mol.
Biol.. 24.
Mirkin, S.M. 2006. DNA structures, repeat expansions and human hereditary disorders.
Curr. Opin Struct. Biol. 16:351–8.
Mirkin, S.M. 2007. Expandable DNA repeats and human disease. Nature. 447:932–40.
Montange, R. K., and R. T. Batey. 2008. Riboswitches: emerging themes in RNA structure
and function. Annu. Rev. Biophys. 37:117-133.
Mooers, B.H.M., J.S. Logue, and J.A. Berglund. 2005. The structural basis of myotonic
dystrophy from the crystal structure of CUG repeats. PNAS. 102:16626–16631.
Napierala, M., and W.J. Krzyzosiak. 1997. CUG Repeats Present in Myotonin Kinase RNA
Form Metastable “Slippery” Hairpins. J. Biol. Chem. 272:31079–31085.
Nelson, D.L., H.T. Orr, and S.T. Warren. 2013. The Unstable Repeats—Three Evolving
Faces of Neurological Disease. Neuron. 77:825–843.
Neueder, A. 2019. RNA-Mediated Disease Mechanisms in Neurodegenerative Disorders.
J. Mol. Biol. 431:1780–1791.
Olson, W.K., M. Bansal, S.K. Burley, R.E. Dickerson, M. Gerstein, S.C. Harvey, U.
Heinemann, X.J. Lu, S. Neidle, Z. Shakked, H. Sklenar, M. Suzuki, C.S. Tung, E. Westhof,
C. Wolberger, and H.M. Berman. 2001. A standard reference frame for the description of
nucleic acid base-pair geometry. J. Mol. Biol. 313:229–237.
Orland, H., and A. Zee. 2002. RNA folding and large N matrix theory. Nucl. Phys. B.
620:456–476.
Orr, H.T., and H.Y. Zoghbi. 2007. Trinucleotide Repeat Disorders. Annu. Rev. Neurosci.
30:575–621.
Parkinson, G.N., M.P.H. Lee, and S. Neidle. 2002. Crystal structure of parallel
quadruplexes from human telomeric DNA. Nature. 417:876–880.
Penner, R.C., and M.S. Waterman. 1993. Spaces of RNA Secondary Structures. Adv.
Math. 101:31–49.
127
Phan, E.N.H., and C.H. Mak. 2020. Quantifying Structural Diversity of CNG Trinucleotide
Repeats Using Diagrammatic Algorithms. bioRxiv. 2020.05.30.124636.
Pous, J., L. Urpí, J.A. Subirana, C. Gouyette, J. Navaza, and J.L. Campos. 2008.
Stabilization by Extra-Helical Thymines of a DNA Duplex with Hoogsteen Base Pairs. J.
Am. Chem. Soc. 130:6755–6760.
Qawasmi, L., M. Braun, I. Guberman, E. Cohen, L. Naddaf, A. Mellul, O. Matilainen, N.
Roitenberg, D. Share, D. Stupp, H. Chahine, E. Cohen, S.M.D.A. Garcia, and Y. Tabach.
2019. Expanded CUG Repeats Trigger Disease Phenotype and Expression Changes
through the RNAi Machinery in C. elegans. J. Mol. Biol. 431:1711–1728.
Ranum, L.P., and T.A. Cooper. 2006. RNA-mediated neuromuscular disorders. Annu. Rev.
Neurosci. 29:259–77.
Richmond, B., and A. Knopfmacher. 1995. Compositions with distinct parts. Aeq. Math.
49:86–97.
Rivas, E., and S.R. Eddy. 1999. A dynamic programming algorithm for RNA structure
prediction including pseudoknots11Edited by I. Tinoco. J. Mol. Biol. 285:2053–2068.
Rødland, E.A. 2006. Pseudoknots in RNA Secondary Structures: Representation,
Enumeration, and Prevalence. Journal of Comput. Biol. 13:1197–1213.
Roth, A., and R. R. Breaker. 2009. The structural and functional diversity of metabolite-
binding riboswitches. Annu. Rev. Biochem. 78:305-334.
Rury, A.S., C. Ferry, J.R. Hunt, M. Lee, D. Mondal, S.M.O. O’Connell, E.N.H. Phan, Z.
Peng, P. Pokhilko, D. Sylvinson, Y. Zhou, and C.H. Mak. 2016. Solvent Thermodynamic
Driving Force Controls Stacking Interactions between Polyaromatics. J. Phys. Chem. C.
120:23858–23869.
Schlick, T. 2018. Adventures with RNA graphs. Methods. 143:16–33.
Schmitt, W. R., and M. S. Waterman. 1994. Linear trees and RNA secondary structure.
Discret. Appl. Math. 51:317-323.
Serra, M.J., and D.H. Turner. 1995. Predicting thermodynamic properties of RNA. Method
Enzymol. 259:242–61.
Shankar, N., S.D. Kennedy, G. Chen, T.R. Krugh, and D.H. Turner. 2006. The NMR
Structure of an Internal Loop from 23S Ribosomal RNA Differs from Its Structure in
Crystals of 50S Ribosomal Subunits,. Biochemistry. 45:11776–11789.
Shapiro, B.A., and K. Zhang. 1990. Comparing multiple RNA secondary structures using
tree comparisons. Bioinformatics. 6:309–318.
Sobczak, K., G. Michlewski, M. de Mezer, E. Kierzek, J. Krol, M. Olejniczak, R. Kierzek,
and W.J. Krzyzosiak. 2010. Structural Diversity of Triplet Repeat RNAs. J. Biol. Chem.
285:12755–12764.
128
Tamjar, J., E. Katorcha, A. Popov, and L. Malinina. 2012. Structural dynamics of double-
helical RNAs composed of CUG/CUG- and CUG/CGG-repeats. J. Biomol. Struct. Dyn.
30:505–523.
Thore, S., M. Leibundgut, and N. Ban. 2006. Structure of the eukaryotic thiamine
pyrophosphate riboswitch with its regulatory ligand. Science 312:1208-1211.
Timchenko, L.T., N.A. Timchenko, C.T. Caskey, and R. Roberts. 1996. Novel Proteins with
Binding Specificity for DNA CTG Repeats And RNA Cug Repeats: Implications for
Myotonic Dystrophy. Hum. Mol. Genet. 5:115–121.
Turner, D. H. 1996. Thermodynamics of base pairing. Curr. Opin. Struct. Biol. 6:299-304.
Turner, D. H., and D. H. Mathews. 2009. NNDB: the nearest neighbor parameter database
for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 38:D280-
D282.
Turner, D.H. 1996. Thermodynamics of base pairing. Curr. Opin. Struct. Biol. 6:299–304.
Turner, D.H., and D.H. Mathews. 2010. NNDB: the nearest neighbor parameter database
for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 38:D280–
D282.
Vedula, L.S., J. Jiang, T. Zakharian, D.E. Cane, and D.W. Christianson. 2008. Structural
and mechanistic analysis of trichodiene synthase using site-directed mutagenesis: Probing
the catalytic function of tyrosine-295 and the asparagine-225/serine-229/glutamate-233–
Mg2+B motif. Arch. of Biochem. Biophys. 469:184–194.
Vernizzi, G., and H. Orland. 2015. Random matrix theory and ribonucleic acid (RNA)
folding. The Oxford Handbook of Random Matrix Theory.
Vernizzi, G., H. Orland, and A. Zee. 2016. Classification and predictions of RNA
pseudoknots based on topological invariants. Phys. Rev. E. 94:042410.
Walker, R. 1992. Implementing discrete mathematics: combinatorics and graph theory with
Mathematica, Steven Skiena. Pp 334. 1990. ISBN 0-201-50943-1 (Addison-Wesley). The
Math. Gaz. 76:286–288.
Wang, Y., and D.J. Patel. 1993. Solution structure of the human telomeric repeat
d[AG3(T2AG3)3] G-tetraplex. Structure. 1:263–282.
Waterman, M.S., and T.F. Smith. 1978. RNA secondary structure: a complete
mathematical analysis. Math. Biosci. 42:257–266.
Waterman, M.S., and T.F. Smith. 1986. Rapid dynamic programming algorithms for RNA
secondary structure. Adv. Appl. Math. 7:455–464.
Weeks, J. D., D. Chandler, and H. C. Andersen. 1971. Role of repulsive forces in
determining the equilibrium structure of simple liquids. J. Chem. Phys. 54:5237-5247.
129
Weeks, J.D., D. Chandler, and H.C. Andersen. 1971. Role of repulsive forces in
determining the equilibrium structure of simple liquids. J. Chem. Phys. 54:5237–5247.
Wells, R.D., R. Dere, M.L. Hebert, M. Napierala, and L.S. Son. 2005. Advances in
mechanisms of genetic instability related to hereditary neurological diseases. Nucleic Acids
Res. 33:3785–3798.
Wong, G. C. L., and L. Pollack. 2010. Electrostatics of Strongly Charged Biological
Polymers: Ion-Mediated Interactions and Self-Organization in Nucleic Acids and Proteins.
Annu. Rev. Phys. Chem. 61:171-189.
Woodson, S. A. 2010. Compact intermediates in RNA folding. Annu. Rev. Biophys. 39:61-
77.
Xia, T., D.H. Mathews, and D.H. Turner. 1999. 6.03 - Thermodynamics of RNA Secondary
Structure Formation. In: Barton SD, K Nakanishi, O Meth-Cohn, editors. Comprehensive
Natural Products Chemistry. Oxford: Pergamon. pp. 21–47.
Zadeh, J. N., C. D. Steenberg, J. S. Bois, B. R. Wolfe, M. B. Pierce, A. R. Khan, R. M.
Dirks, and N. A. Pierce. 2011. NUPACK: Analysis and design of nucleic acid systems. J.
Comput. Chem. 32:170-173.
Zhang, J., M. Lin, R. Chen, W. Wang, and J. Liang. 2008. Discrete state model and
accurate estimation of loop entropy of RNA secondary structures. J. Chem. Phys.
128:03B624.
Zuker, M. 1989. [20] Computer prediction of RNA structure. In: Methods in Enzymol.
Academic Press. 262-288.
Zuker, M. 1989. On finding all suboptimal foldings of an RNA molecule. Science 244:48-
52.
Zuker, M. 2003. Mfold web server for nucleic acid folding and hybridization prediction.
Nucleic Acids Res. 31:3406-3415.
Zuker, M. 2003. Mfold web server for nucleic acid folding and hybridization prediction.
Nucleic Acids Res. 31:3406–3415.
Zuker, M., D. H. Mathews, and D. H. Turner. 1999. Algorithms and Thermodynamics for
RNA Secondary Structure Prediction: A Practical Guide. In: RNA Biochemistry and
Biotechnology. J. Barciszewski, and B. F. C. Clark, editors. Springer Netherlands,
Dordrecht. 11-43.
130
Appendix A: Additional Conformational Cost Data
Figure A.1 The entropic cost plot for the formation of two-way junctions. Two different set of constraints were
used. The top surface in each view corresponds to the constraints shown in the inset of Fig. 2.3. The lower surface
corresponds to allowing all possible values of the bond and torsion angles—that is, only the 𝑁 𝑏 − 𝑁 𝑏 distances
were used. The offset comes from the increase in conformational volume as the constraints are relaxed. The offset
for removing all bond and torsion angle constraints shown here is −2.2 ± 0.3 kcal/mol.
131
Center loop length b = 0 nt
5’ loop length (nt), a
3’ loop length (nt), c
0 1 2 3 4 5 6 7
0 6.74 (0.17) 6.82 (0.18) 6.95 (0.21) 7.49 (0.37) 7.38 (0.32) 6.95 (0.21) 7.13 (0.25) 6.86 (0.19)
1 6.95 (0.21) 7.01 (0.22) 6.74 (0.17) 7.07 (0.23) 7.13 (0.25) 7.29 (0.29) 7.13 (0.25) 7.49 (0.37)
2 6.70 (0.17) 7.07 (0.23) 7.13 (0.25) 7.49 (0.37) 8.06 (0.76) 7.13 (0.25) 8.48 (Inf) 7.13 (0.25)
3 7.07 (0.23) 7.13 (0.25) 7.81 (0.53) 7.63 (0.43) 7.13 (0.25) 7.07 (0.23) 6.86 (0.19) 7.13 (0.25)
4 6.90 (0.20) 7.01 (0.22) 6.86 (0.19) 7.01 (0.22) 7.49 (0.37) 7.49 (0.37) 7.63 (0.43) 7.20 (0.27)
5 7.13 (0.25) 7.01 (0.22) 7.63 (0.43) 7.38 (0.32) 7.38 (0.32) 7.20 (0.27) 7.49 (0.37) 7.29 (0.29)
6 7.20 (0.27) 6.95 (0.21) 7.63 (0.43) 7.20 (0.27) 7.20 (0.27) 7.07 (0.23) 6.95 (0.21) 7.13 (0.25)
7 7.07 (0.23) 6.90 (0.20) 7.07 (0.23) 6.90 (0.20) 7.38 (0.32) 7.07 (0.23) 7.38 (0.32) 7.49 (0.37)
Center loop length b = 2 nt
5’ loop length (nt), a
3’ loop length (nt), c
0 1 2 3 4 5 6 7
0 6.63 (0.12) 7.47 (0.27) 6.90 (0.16) 7.04 (0.18) 7.47 (0.27) 7.17 (0.20) 7.04 (0.18) 7.33 (0.23)
1 7.17 (0.20) 7.04 (0.18) 7.12 (0.19) 7.55 (0.29) 7.33 (0.23) 7.55 (0.29) 7.17 (0.20) 7.55 (0.29)
2 7.00 (0.17) 7.17 (0.20) 7.17 (0.20) 7.64 (0.32) 7.17 (0.20) 7.39 (0.25) 7.22 (0.21) 7.27 (0.22)
3 7.17 (0.20) 7.12 (0.19) 7.08 (0.18) 7.64 (0.32) 7.22 (0.21) 7.47 (0.27) 7.27 (0.22) 7.22 (0.21)
4 7.39 (0.25) 7.00 (0.17) 7.47 (0.27) 7.64 (0.32) 7.39 (0.25) 7.27 (0.22) 7.39 (0.25) 7.76 (0.37)
5 7.17 (0.20) 7.33 (0.23) 7.47 (0.27) 7.47 (0.27) 7.33 (0.23) 7.39 (0.25) 7.47 (0.27) 7.47 (0.27)
6 7.55 (0.29) 7.55 (0.29) 7.47 (0.27) 7.64 (0.32) 7.22 (0.21) 7.22 (0.21) 7.33 (0.23) 7.17 (0.20)
7 7.33 (0.23) 6.93 (0.16) 7.39 (0.25) 7.76 (0.37) 7.33 (0.23) 7.47 (0.27) 7.12 (0.19) 7.04 (0.18)
Center loop length b = 3 nt
5’ loop length (nt), a
3’ loop length (nt), c
0 1 2 3 4 5 6 7
0 7.09 (0.19) 7.04 (0.18) 7.04 (0.18) 7.36 (0.25) 7.72 (0.37) 7.43 (0.27) 7.13 (0.20) 7.00 (0.18)
1 7.29 (0.23) 7.24 (0.22) 7.18 (0.21) 7.51 (0.29) 7.72 (0.37) 7.51 (0.29) 7.51 (0.29) 7.09 (0.19)
2 7.09 (0.19) 7.36 (0.25) 7.61 (0.32) 7.51 (0.29) 7.43 (0.27) 7.18 (0.21) 7.51 (0.29) 7.04 (0.18)
3 7.43 (0.27) 7.61 (0.32) 7.86 (0.43) 7.36 (0.25) 7.51 (0.29) 7.61 (0.32) 7.72 (0.37) 7.43 (0.27)
4 7.43 (0.27) 7.51 (0.29) 7.61 (0.32) 7.29 (0.23) 7.43 (0.27) 8.04 (0.53) 8.04 (0.53) 7.51 (0.29)
5 8.04 (0.53) 7.13 (0.20) 7.61 (0.32) 7.36 (0.25) 8.04 (0.53) 7.86 (0.43) 7.43 (0.27) 7.51 (0.29)
6 7.09 (0.19) 7.51 (0.29) 7.36 (0.25) 7.13 (0.20) 7.61 (0.32) 7.61 (0.32) 7.43 (0.27) 7.24 (0.22)
7 7.61 (0.32) 7.13 (0.20) 7.09 (0.19) 7.24 (0.22) 7.13 (0.20) 7.43 (0.27) 7.04 (0.18) 7.13 (0.20)
Table A.1 Table of free energy costs of forming three-way junction in kcal/mol for center junction lengths (b) 0, 2, and 3.
132
Center loop length b = 4 nt
5’ loop length (nt), a
3’ loop length (nt), c
0 1 2 3 4 5 6 7
0 7.13 (0.17) 7.20 (0.18) 7.40 (0.22) 7.52 (0.25) 7.25 (0.19) 7.77 (0.32) 7.16 (0.18) 7.13 (0.17)
1 7.34 (0.21) 7.29 (0.20) 7.25 (0.19) 7.45 (0.23) 7.77 (0.32) 7.52 (0.25) 7.29 (0.20) 7.25 (0.19)
2 7.34 (0.21) 7.29 (0.20) 7.67 (0.29) 7.67 (0.29) 7.34 (0.21) 7.77 (0.32) 7.40 (0.22) 7.52 (0.25)
3 7.52 (0.25) 7.45 (0.23) 7.59 (0.27) 7.40 (0.22) 7.45 (0.23) 7.45 (0.23) 7.45 (0.23) 7.29 (0.20)
4 7.34 (0.21) 7.34 (0.21) 7.59 (0.27) 7.88 (0.37) 7.52 (0.25) 7.52 (0.25) 8.02 (0.43) 8.20 (0.53)
5 7.16 (0.18) 7.67 (0.29) 7.77 (0.32) 8.20 (0.53) 7.16 (0.18) 7.45 (0.23) 7.77 (0.32) 7.40 (0.22)
6 7.09 (0.17) 7.25 (0.19) 7.52 (0.25) 7.45 (0.23) 7.20 (0.18) 7.40 (0.22) 7.20 (0.18) 7.40 (0.22)
7 7.16 (0.18) 7.59 (0.27) 7.20 (0.18) 7.29 (0.20) 7.20 (0.18) 7.52 (0.25) 7.59 (0.27) 7.59 (0.27)
Center loop length b = 5 nt
5’ loop length (nt), a
3’ loop length (nt), c
0 1 2 3 4 5 6 7
0 7.03 (0.18) 7.39 (0.25) 7.46 (0.27) 7.39 (0.25) 7.64 (0.32) 7.88 (0.43) 7.21 (0.21) 7.46 (0.27)
1 7.46 (0.27) 7.21 (0.21) 7.54 (0.29) 7.26 (0.22) 7.54 (0.29) 7.88 (0.43) 7.54 (0.29) 7.46 (0.27)
2 7.64 (0.32) 7.88 (0.43) 7.75 (0.37) 7.64 (0.32) 7.39 (0.25) 7.39 (0.25) 7.32 (0.23) 8.06 (0.53)
3 7.54 (0.29) 7.88 (0.43) 7.46 (0.27) 7.88 (0.43) 7.54 (0.29) 7.39 (0.25) 7.88 (0.43) 7.64 (0.32)
4 7.46 (0.27) 7.64 (0.32) 7.54 (0.29) 8.31 (0.76) 7.26 (0.22) 7.75 (0.37) 7.54 (0.29) 7.32 (0.23)
5 7.64 (0.32) 7.64 (0.32) 7.64 (0.32) 7.88 (0.43) 7.54 (0.29) 7.88 (0.43) 7.88 (0.43) 7.21 (0.21)
6 7.75 (0.37) 7.54 (0.29) 7.54 (0.29) 7.75 (0.37) 7.26 (0.22) 7.26 (0.22) 7.64 (0.32) 7.46 (0.27)
7 7.16 (0.20) 7.32 (0.23) 7.32 (0.23) 7.26 (0.22) 7.11 (0.19) 8.31 (0.76) 7.46 (0.27) 7.39 (0.25)
Center loop length b = 6 nt
5’ loop length (nt), a
3’ loop length (nt), c
0 1 2 3 4 5 6 7
0 7.79 (0.37) 7.58 (0.29) 7.30 (0.22) 7.79 (0.37) 7.79 (0.37) 7.42 (0.25) 7.42 (0.25) 7.42 (0.25)
1 7.30 (0.22) 7.25 (0.21) 7.67 (0.32) 7.50 (0.27) 7.50 (0.27) 7.67 (0.32) 7.30 (0.22) 7.15 (0.19)
2 7.15 (0.19) 8.10 (0.53) 7.58 (0.29) 8.35 (0.76) 7.67 (0.32) 7.50 (0.27) 7.36 (0.23) 7.30 (0.22)
3 7.79 (0.37) 7.42 (0.25) 8.10 (0.53) 7.67 (0.32) 7.92 (0.43) 7.79 (0.37) 7.79 (0.37) 7.67 (0.32)
4 7.20 (0.20) 7.79 (0.37) 7.50 (0.27) 7.79 (0.37) 7.79 (0.37) 7.58 (0.29) 7.58 (0.29) 7.58 (0.29)
5 7.79 (0.37) 7.42 (0.25) 7.42 (0.25) 7.67 (0.32) 7.92 (0.43) 7.79 (0.37) 7.30 (0.22) 7.79 (0.37)
6 7.36 (0.23) 7.58 (0.29) 7.58 (0.29) 7.79 (0.37) 7.58 (0.29) 7.50 (0.27) 7.58 (0.29) 7.58 (0.29)
7 6.93 (0.16) 7.79 (0.37) 7.58 (0.29) 7.79 (0.37) 7.11 (0.18) 7.36 (0.23) 7.25 (0.21) 7.30 (0.22)
Table A.2 Table of free energy costs of forming three-way junction in kcal/mol for center junction lengths (b) 4 through 6.
133
Center loop length b = 7 nt
5’ loop length (nt), a
3’ loop length (nt), c
0 1 2 3 4 5 6 7
0 7.51 (0.29) 7.36 (0.25) 7.72 (0.37) 7.61 (0.32) 7.86 (0.43) 7.43 (0.27) 7.61 (0.32) 7.51 (0.29)
1 7.43 (0.27) 7.61 (0.32) 8.28 (0.76) 7.86 (0.43) 7.61 (0.32) 8.03 (0.53) 7.61 (0.32) 7.36 (0.25)
2 7.61 (0.32) 7.61 (0.32) 7.72 (0.37) 8.03 (0.53) 8.28 (0.76) 7.72 (0.37) 7.51 (0.29) 7.72 (0.37)
3 7.43 (0.27) 7.51 (0.29) 7.61 (0.32) 7.51 (0.29) 7.43 (0.27) 8.03 (0.53) 7.61 (0.32) 7.61 (0.32)
4 8.03 (0.53) 7.18 (0.21) 8.03 (0.53) 7.86 (0.43) 7.61 (0.32) 7.61 (0.32) 7.61 (0.32) 7.61 (0.32)
5 7.86 (0.43) 7.51 (0.29) 7.61 (0.32) 7.51 (0.29) 7.72 (0.37) 7.43 (0.27) 7.23 (0.22) 7.36 (0.25)
6 7.36 (0.25) 8.03 (0.53) 7.61 (0.32) 7.72 (0.37) 7.36 (0.25) 7.51 (0.29) 7.61 (0.32) 8.03 (0.53)
7 7.72 (0.37) 7.04 (0.18) 7.43 (0.27) 7.61 (0.32) 7.43 (0.27) 7.61 (0.32) 7.72 (0.37) 7.18 (0.21)
Center loop length b = 8 nt
5’ loop length (nt), a
3’ loop length (nt), c
0 1 2 3 4 5 6 7
0 7.92 (0.43) 8.10 (0.53) 7.42 (0.25) 7.92 (0.43) 7.78 (0.37) 7.57 (0.29) 7.49 (0.27) 7.78 (0.37)
1 7.57 (0.29) 7.42 (0.25) 7.49 (0.27) 7.92 (0.43) 7.49 (0.27) 7.67 (0.32) 7.35 (0.23) 7.57 (0.29)
2 7.67 (0.32) 7.92 (0.43) 7.92 (0.43) 7.78 (0.37) 7.92 (0.43) 7.49 (0.27) 7.67 (0.32) 7.19 (0.20)
3 7.57 (0.29) 8.10 (0.53) 7.67 (0.32) 8.34 (0.76) 7.57 (0.29) 7.42 (0.25) 7.67 (0.32) 7.29 (0.22)
4 7.49 (0.27) 7.57 (0.29) 7.78 (0.37) 7.35 (0.23) 7.67 (0.32) 8.77 (Inf) 7.78 (0.37) 7.42 (0.25)
5 7.49 (0.27) 7.49 (0.27) 7.57 (0.29) 7.78 (0.37) 7.67 (0.32) 8.34 (0.76) 7.67 (0.32) 7.92 (0.43)
6 7.35 (0.23) 7.78 (0.37) 7.49 (0.27) 7.67 (0.32) 7.78 (0.37) 7.57 (0.29) 7.42 (0.25) 7.78 (0.37)
7 7.42 (0.25) 7.42 (0.25) 7.35 (0.23) 7.03 (0.17) 7.24 (0.21) 7.67 (0.32) 7.67 (0.32) 7.67 (0.32)
Table A.3 Table of free energy costs of forming three-way junction in kcal/mol for center junction lengths (b) 7 and 8.
134
Appendix B: H-Type Pseudoknot Closure Maps and Costs
Figure B.1 Map showing the thermodynamic pathways used in calculating the entropic cost to form a pseudoknot
structure with the junction length a=b=c=1. The entropic cost associated with each step can be found in Table S1.
Steps leading to symmetric structures in which the choice of 5’ or 3’ for the base pairing event is indistinguishable
are marked with an asterisk and have entropic cost reported for both the 5’ and 3’ base pairing.
135
Start End Cost
Start End Cost
Start End Cost
0 1a 5.01
2a 3a 6.03*
3a 4 6.03*
0 1b 5.61
2a 3c 6.03*
3b 4 6.75*
1a 2b 5.82
2b 3b 6.03*
3c 4 6.03*
1a 2c 6.00
2b 3c 6.75*
3d 4 6.75*
1a 2d 5.57
2c 3a 6.75*
1a 2e (5') 5.11
2c 3d 6.03*
1a 2e (3') 5.21
2d 3a 7.46
1b 2a (5') 5.77
2d 3b 6.34
1b 2a (3') 5.82
2d 3c 7.46
1b 2b 5.26
2d 3d 6.38
1b 2c 5.22
2e 3b 6.75*
1b 3d 4.85
2e 3d 6.75*
Table B.1 Entropic cost for the stepwise formation of the a=b=c=1 pseudoknot. Cost of each step in the
thermodynamic pathways of Figure B.1. Numbers marked with an asterisk are obtained via closed path
calculations rather than directly obtained from simulations. All costs are reported in units of kcal/mol
136
Figure B.2 Map showing the thermodynamic pathways used in calculating the entropic cost to form a pseudoknot
structure with the junction length a=b=1, c=4. The entropic cost associated with each step can be found in Table
S2.
137
Start End Cost
Start End Cost
0 1a 5.02
2a 3b 6.03*
0 1b 5.62
2a 3c 6.03*
0 1c 5.85
2b 3a 6.03*
0 1d 6.20
2b 3c 6.75*
1a 2b 6.39
2c 3a -
1a 2d 5.51
2c 3b -
1a 2e 5.94
2d 3c 7.71
1b 2a 6.63
2d 3d 7.37
1b 2d 4.85
2e 3a 6.75*
1b 2f 6.18
2e 3d 6.75*
1c 2c 5.51
2f 3b 6.75*
1c 2e 5.15
2f 3d 6.03*
1c 2f 5.86
3a 4 6.75*
1d 2a 6.05*
3b 4 6.03*
1d 2b 5.21*
3c 4 6.03*
1d 2c 4.85
3d 4 6.75*
Table B.2 Entropic cost for the stepwise formation of the a=b=1, c=4 pseudoknot. Cost of each step in the
thermodynamic pathways of Figure B.2. Numbers marked with an asterisk are obtained via closed path
calculations rather than directly obtained from simulations. Steps for which no simulation data nor sufficient data for
closed-path calculations were available have been left blank. All costs are reported in units of kcal/mol.
138
Figure B.3 Map showing the thermodynamic pathways used in calculating the entropic cost to form a pseudoknot
structure with the junction length a=4, b=c=1. The entropic cost associated with each step can be found in Table
S3.
139
start end cost
start end cost
0 1a 5.02
2a 3a 6.75*
0 1b 5.62
2a 3b 6.75*
0 1c 5.85
2b 3a 6.03*
0 1d 6.20
2b 3d 6.75*
1a 2e 6.39
2c 3a -
1a 2a 5.98
2c 3b -
1a 2d 5.51
2d 3a 6.86
1b 2b 6.08
2d 3c 8.14
1b 2d 4.85
2e 3b 6.03*
1b 2f 6.51
2e 3c 6.03*
1c 2a 5.20
2f 3c 6.75*
1c 2b 5.90
2f 3d 6.03*
1c 2f 5.51
3a 4 6.75*
1d 2a -
3b 4 6.75*
1d 2b -
3c 4 6.03*
1d 2c -
3d 4 6.03*
Table B.3 Entropic cost for the stepwise formation of the a=4, b=c=1 pseudoknot. Cost of each step in the
thermodynamic pathways of Figure B.3. Numbers marked with an asterisk are obtained via closed path
calculations rather than directly obtained from simulations. Steps for which no simulation data nor sufficient data for
closed-path calculations were available have been left blank. All costs are reported in units of kcal/mol.
140
Figure B.4 Map showing the thermodynamic pathways used in calculating the entropic cost to form a pseudoknot
structure with the junction length a=c=4, b=1. The entropic cost associated with each step can be found in Table
S4. Steps leading to symmetric structures in which the choice of 5’ or 3’ for the base pairing event is
indistinguishable are marked with an asterisk and have entropic cost reported for both the 5’ and 3’ base pairing.
Note that structure 1b and 2c do not have many available values and were omitted from the pathway analysis.
Consequently, 2e is also not included due to being accessible only from 1b. These structures have been included
here for the sake of completion.
141
start end cost
start end cost
0 1a 5.85
2a 3b 6.75*
0 1b 6.20
2a 3c 6.75*
1a 2a (5') 5.97
2b 3b 6.03*
1a 2a (3') 5.90
2b 3d 6.75*
1a 2b 6.38
2c 3a -
1a 2c -
2c 3b -
1a 2d 6.26
2c 3c -
1b 2b -
2c 3d -
1b 2c -
2d 3a 6.75*
1b 2d -
2d 3c 6.03*
1b 2e -
3a 4 6.03*
3b 4 6.75*
3c 4 6.75*
3d 4 6.03*
Table B.4 Entropic cost for the stepwise formation of the a=c=4, b=1 pseudoknot. Cost of each step in the
thermodynamic pathways of Figure B.4. Numbers marked with an asterisk are obtained via closed path
calculations rather than directly obtained from simulations. Steps for which no simulation data nor sufficient data for
closed-path calculations were available have been left blank. Structure 2e has been fully omitted from the table as
no simulation data for its formation from a prior structure nor the subsequent steps leading out from it is available.
All costs are reported in units of kcal/mol.
Abstract (if available)
Abstract
Functional ribonucleic acid chains (RNA) can fold into many complex structures in biological settings using a collection of secondary and tertiary structural motifs. These structures are often necessary for the proper function of the RNA and a misfold can produce undesired outcomes. Of particular interest to us are instances of gain-of-function in RNA which have been implicated in the pathogenesis of trinucleotide repeat expansion disorders where the expanded microsatellites are found in intronic and untranslated regions. The expanded microsatellite yields an expanded RNA chain which then gain unintended function due to newly accessible folded structures. Thus, an understanding of how the final folded structure is determined by factors such as chain length is important in understanding how expanded trinucleotide repeats give rise to diseases. In this study, we focused on the influence that entropy—particularly of the sugar-phosphate backbone—has in determining the final folded structure of RNA chains.
We began in Chapter 2 by drawing from the field of topology and the existing body of work on RNA graphs to design a diagrammatic scheme to represent the different types of RNA structures and the constraints they impose upon the sugar-phosphate backbone. Application of the diagrammatic scheme to folded RNA structures allows us to enact a factorization of the folded structure, grouping the constraints into mutually independent subsets and enables the total conformational entropy penalty of the fold to be calculated as a sum of independent terms. We then simulated large ensembles of single-stranded RNA sequences in solution using high throughput Monte Carlo simulations to validate the underlying assumptions of our diagrammatic scheme, examining the entropic costs for the initiation of two major secondary structural motifs: hairpins and multiway junctions. Further simulations of higher complexity constraints such as pseudoknots and quadruplexes yielded additional insight into the distinct topological classes of secondary and tertiary structures, the interactions between multiple constrains on RNA structures, and how some functional RNA sequences may operate by transformation between different topological classes.
With the foundations laid for our diagrammatic factorization, we looked to apply our methodology to (CNG)n trinucleotide repeats. In chapter 3 and 4 we focused our attention on (CNG)n trinucleotide repeats (TNR) which are transcripts of unstable microsatellites whose spontaneous expansions have been linked to genetic diseases. These so-called trinucleotide repeat expansion disorders (TREDs) exhibit complex mechanism of pathogenesis, some of which are attributed to a potential RNA transcript gain-of-function. Thus, the structures to which these expanded transcripts have access and the diversity of their conformational ensemble were investigated. In chapter 3, we simulated and cataloged the secondary structure of NG-(CNG)16-CN and NG-(CNG)50-CN oligomers and sorted them into sub-ensembles based on their defining characteristics and quantified the structural diversity and thermodynamic stability for these ensembles. Our findings showed that though it maximizes the number of base pairing contacts, the generally assumed structure for these repeats—a series of alternating short Watson-Crick helices and two-way junctions capped by a hairpin—may not be the most thermodynamically favorable, and the structural ensembles are characterized by largely open conformations. Furthermore, our data show that the diversity of the ensembles has a non-negligible length-dependence, suggesting that further, more generalized study is needed as TREDs are associated with expansions of more than 60 to 100 repeat units.
To generalize the analysis, in chapter 4 we introduced another diagrammatic method which can be used to analyze the structural diversity of an arbitrary (CNG)n sequence. By representing the structural elements on the chain’s conformation by a set of graphs and employing elementary diagrammatic methods often seen in physics, we were able to formulate a renormalization procedure to re-sum these graphs and produce a closed-form expression for the ensemble partition function of the chain. By making a simple approximation for the renormalization, this theory can be applied to extended (CNG)n sequences to comprehensively capture an arbitrarily large set of conformations containing any number and combination of duplexes, hairpins, multiway junctions, H-type pseudoknots, and quadruplexes. We then numerically solved the analytical equations obtained from the renormalization theory to obtain equilibrium estimates for secondary structural content for each chain to study the structural ensembles of (CNG)n repeats with large n (n ~ 60). Our findings suggests that as with the more restrictive analysis in chapter 3, the ensemble is surprisingly diverse. Furthermore, it shows that the distribution is sensitive to the identity of the N nucleotide. As the N nucleotide can participate in non-canonical pairs and determines whether the (CNG)n sequence in question can sustain stable quadruplexes, the results show that different choice of N produces biases on the stabilities of different motifs and affect the secondary structures of the chain along with how they may undergo structural switches when perturbed.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Loss of WRN causes instability of expanded CTG trinucleotide repeats (TNRs)
PDF
The effects of divalent counterions on the formation and stabilization of RNA tertiary structure
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
Asset Metadata
Creator
Phan, Ethan Nhat-Huy
(author)
Core Title
A diagrammatic analysis of the secondary structural ensemble of CNG trinucleotide repeat
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Chemistry
Degree Conferral Date
2022-08
Publication Date
07/15/2022
Defense Date
06/21/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
conformational entropy,diagrams,graph,graph renormalization,graph theory,Monte Carlo simulations,nucleic acid,OAI-PMH Harvest,RNA,secondary structures,structural diversity,structure,thermodynamics,trinucleotide repeats
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mak, Chi (
committee chair
), Goodman, Myron (
committee member
), Prezhdo, Oleg (
committee member
)
Creator Email
ethanpha@usc.edu,huytruc1993@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111371448
Unique identifier
UC111371448
Legacy Identifier
etd-PhanEthanN-10837
Document Type
Thesis
Format
application/pdf (imt)
Rights
Phan, Ethan Nhat-Huy
Type
texts
Source
20220715-usctheses-batch-953
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
conformational entropy
graph
graph renormalization
graph theory
Monte Carlo simulations
nucleic acid
RNA
secondary structures
structural diversity
structure
thermodynamics
trinucleotide repeats