Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Analysis and algorithms for distinguishable RNA secondary structures
(USC Thesis Other)
Analysis and algorithms for distinguishable RNA secondary structures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ANALYSIS AND ALGORITHMS FOR DISTINGUISHABLE RNA SECONDARY STRUCTURES by Masaru Nakajima A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (PHYSICS) May 2023 Copyright 2023 Masaru Nakajima Dedication I dedicate this work to my wife, Astrid, who has supported me through the journey to complete this program. ii TableofContents Dedication ii List of Figures v Abstract viii Chapter 1: Introduction 1 1.1 Novel contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Overview and organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2: Background 5 2.1 Secondary structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Formalization for secondary structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Free energy models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Base-pair models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Loop models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Computational model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Minimum free energy problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Partition function problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 3: Counting Secondary Structures 17 3.1 Counting for linear sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Counting for circular sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Interacting sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.1 Generalized sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.2 Secondary structures for generalized sequences . . . . . . . . . . . . . . . . . . . . 22 3.3.3 Counting problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 4: Symmetry and Distinguishability 26 4.1 Unifying the types of sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Symmetry of generalized sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Indistinguishable structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Counting distinguishable structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 5: Characterization of Periodic Subsets 34 5.1 Defining periodic subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 Pair orbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3 Conditions for pair orbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 iii Chapter 6: Central Orbit Algorithm 41 6.1 Central orbit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.2 Circular sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.3 Multi sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 7: Reduced Structure Algorithm 46 7.1 Defining reduced structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.2 Counting reduced structures for circular sequences . . . . . . . . . . . . . . . . . . . . . . 49 7.3 Counting reduced structures for multi sequences . . . . . . . . . . . . . . . . . . . . . . . . 51 7.4 Reduced structure algorithm for distinguishable count . . . . . . . . . . . . . . . . . . . . . 53 Chapter 8: Applications of Developed Method 57 8.1 Unlabeled circular sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 8.2 Difference between the two algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 8.2.1 Symmetric count problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 8.2.2 Design problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 8.3 Pair count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.3.1 Pair count problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.3.2 Distinguishable pair count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.4 Distinguishable count with pseudoknots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.4.1 Distinguishable count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 9: Conclusion 69 Bibliography 71 Appendices 76 Chapter A: Proofs 77 A.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Chapter B: Counting structures for two interacting sequences 81 iv ListofFigures 2.1 Example structures. The sequence of lengthn is drawn in a circular fashion with 0 as the 0 th site. A solid red line represents a pair. (A) The structure without any pair is valid. (B) An example of a valid structure with pairs. (C) An example of an invalid structure that violates conditions L2. (D) An example of an invalid structure that violates condition L3. Note that two pairs inducing a pseudoknot result in crossing lines. . . . . . . . . . . . . . . 8 2.2 Types of loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Graphical representations of Equations (3.1) and (3.2). . . . . . . . . . . . . . . . . . . . . . 19 3.2 Graphical representations of generalized sequences. The sequences are directed such that 5’-3’ direction is clockwise. (A) A generalized sequence with 5 constituent sequences. (B) A generalized sequence with 4 identical sequences (symmetry 4). (C) A generalized sequence of symmetry 4 (sequences of same color are identical). (D) A generalized sequence of symmetry 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Example structures for a generalized sequence consisting of 5 constituent sequences (m = 5), drawn in a circular fashion. A solid red line represents a base pair. (A) An example of a valid structure. (B) An example of an invalid structure which violates condition L4 for nicksd 3 andd 4 . (C) An example of an invalid structure which violates condition L4 for nicksd 0 andd 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Graphical representations of Equations (3.4), (3.5), and (3.6). As forR e ij , we assume the third case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Rotation by3 bases of (A) linear sequence, (B) circular sequence, and (C) multi sequence. The each colored circle represents a particular base: red→A, blue→C, and green→U. . 29 4.2 Example orbits of different sizes for a generalized sequence ψ = (ϕ, ∅) (a circular sequence) of symmetry 4. (A) Orbit size of 4 and symmetry 1. (B) Orbit size of 2 and symmetry 2. (C) Orbit size of 1 and symmetry 4. . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1 Example structures with pair orbits for circular sequence of symmetry4. The pairs in each pair orbit is emphasized for each pair orbit. (A) Pair orbits in0-periodic structure each consists of one pair. (B) Pairs orbits in2-periodic structure each consists of two pairs. (C) Pair obits in1-periodic structure each consists of 4 pairs. . . . . . . . . . . . . . . . . . . . 36 v 5.2 Example structures for circular sequence of symmetry4. The pairs in internal pair orbits are drawn in red, and those in external pair orbits are drawn in blue. (A) For a2-periodic structure,r = 2n/4 =n/2. (B) For a1-periodic structure,r = n/4. The pairs in external orbits cross the dotted lines dividing the sequence according to the periodicity of the secondary structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Consequences of violating the conditions for reduced structures. (A) Violation of P3 among two internal orbits. (B) Violation of P3 between internal and external orbits. (C) Violation of P3 among two external orbits. (D) Violation of P4 among two external orbits. (E) Violation of P5 between an internal and external orbits. The corresponding secondary structures all violate condition L3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4 Consequences of violating conditions P6-P8. (A) Violation of P6. (B) Violation of P7. (C) Violation of P8. In all cases, the resulting structures violate condition L4. . . . . . . . . . . 40 6.1 Partitioning of periodic structures based on central orbit. We assume that the circular sequence has symmetry 4. (A) 2-periodic and (B) 1-periodic structures. In both cases, the central orbit is the external orbit(i,j) ex . The red and blue regions show the two types of substructures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7.1 Correspondence betweenb-periodic structures and reduced structures. The generalized sequences considered here all have symmetry 4. The internal orbits and internal reduced pairs are in red, while the external orbits and external reduced pairs are in blue. (A) A 2-periodic structure for a circular sequence and a corresponding reduced structure. (B) A 1-periodic structure and a circular sequence and corresponding reduced structure. (C) A 2-periodic structure for a multi sequence and a corresponding reduced structure. (D) A 1-periodic structure and a multi sequence and corresponding reduced structure. . . . . . . 48 7.2 Graphical representation of Equation (7.3), (7.4), and (7.5). . . . . . . . . . . . . . . . . . . 52 7.3 Graphical representation of Equation (7.7), (7.8), and (7.9). As forH e ij , the third case is assumed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 8.1 Process of solving Problem 2 using the central orbit algorithm. . . . . . . . . . . . . . . . . 60 8.2 Example of a structure containing pseudoknots. (A) Representation used by Dirks et al.[9]. (B) Representation used by Chitsatz et al.[5]. . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.3 Example structures for two interacting sequences. (A) Arc a ′ covers bond b while a does not. (B) Arcsa anda ′ are interacting since they both cover at least one same bond. Moreover, they subtend each other. (C) Arc a subtends arc a ′ , but not the other way around. (D) Arca ′ subtends arcsa anda ′ . (E) Arcsa anda ′ are interacting, but neither subtends the other, violating condition M4. . . . . . . . . . . . . . . . . . . . . . . . . . . 66 8.4 Cases for 1-periodic structures containing an arc{i,j} interacting with corresponding arc{i ′ ,j ′ } . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 8.5 Recursion relation forG ij . Herei ′ =i,j ′ =j,k ′ =k, andl ′ =l. . . . . . . . . . . . . . . 68 vi 8.6 1-periodic structures which do not contain an arc interacting with its corresponding arc. . 68 B.1 The quantities defined for computing the structures for two interacting sequences. The quantityC iji ′ j ′ is defined as the number of structures for two substrings ϕ [i..j− 1] and ϕ ′ [i ′ ..j ′ − 1]. The other quantities are defined analogously with additional conditions indicated by the lines color. Blue lines indicate arcs and red lines indicate interacting bond. The superscripts containinga means that eitheri orj ′ − 1 is part of an arc or interacting bond (4 cases). The superscripts containingb indicate that bothi andj ′ − 1 are part of an arc or interacting bond (4 cases). The quantityC b iji ′ j ′ is the summation of all 4 cases. The superscripts containinge have the same conditions as those containingb, but with additional condition that all arcs are either covered by the interacting bond containingi orj ′ − 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 B.2 The recursion relations forC iji ′ j ′,C a1 iji ′ j ′ ,C a2 iji ′ j ′ ,C a3 iji ′ j ′ ,C a4 iji ′ j ′ , andC b iji ′ j ′ . . . . . . . . . . 84 B.3 The recursion relations forC b1 iji ′ j ′ ,C b2 iji ′ j ′ ,C b3 iji ′ j ′ ,C b4 iji ′ j ′ , . . . . . . . . . . . . . . . . . . . . 85 vii Abstract RNA secondary structures are essential abstractions for understanding spacial folding behaviors of those macromolecules. Many algorithms to solve problems over secondary structures involve a common dy- namic programming setup to exploit the property that secondary structures can be decomposed into sub- structures. Dirks et al. (2007) noted that this setup cannot directly address an issue of distinguishability among secondary structures, which arises for classes of sequences that admit non-trivial symmetry – in- cluding circular sequences and interacting sequences. We examine the problem of counting distinguishable secondary structures. Drawing from elementary results in group theory, we identify useful subsets of sec- ondary structures. We then build on an algorithm due to Hofacker et al. (2012) for computing the sizes of these subsets of possible structures. The result is a cubic time algorithm to count the distinguishable struc- tures compatible with a given circular sequence. We also develop another algorithm for the same problem which has certain advantages for some related problems. This general approach may be employed to solve similar problems for different types of RNA sequences and with different constraints on structures. viii Chapter1 Introduction The RNA secondary structure model offers a framework in which efficient analysis and prediction of RNA conformations can be made [22]. A secondary structure is a specifically constrained set of paired posi- tions in a sequence. Algorithms to identify optimal structures for given RNA sequences,i.e. to “fold” RNA, focused on maximizing the number of paired bases [28] and finding minimum free energy secondary struc- tures [43, 53]. The landmark algorithm of McCaskill [25] efficiently computes the partition function for the set of secondary structures, and enabled a detailed characterization of equilibrium secondary structural features. The complexity of the folding and the partition function problems dramatically increases when including the structures with so-called pseudoknots [32, 1]. However, efficient algorithms exist for cases involving many forms of biologically relevant pseudoknots [32, 1, 10]. Most algorithmic work on secondary structures has focused on ordinary, linear RNA sequences, but certain variations have emerged as important. For example, circular RNAs are relevant in studies of certain viruses [12] and viroids [11]. Though circular RNA sequences are fundamentally different from ordinary RNA sequences, the usual algorithms can often be adapted by slight modifications to solve corresponding problems on circular sequences [52, 51]. Secondary structures for interacting RNA sequences have also been studied, driven in part by applications in biotechnology [15]. Algorithms to fold multiple interacting 1 RNA sequences have been developed [8, 2]. Dirks et al. [9] gave a cubic time algorithm to compute the partition function for a multiset of RNA sequences under restrictions that remain useful in practice. The core of many folding and partition function algorithms is dynamic programming that exploits properties of substructures. In their basic form, such algorithms closely resemble those for parsing context- free grammars [33] and may be viewed as variants thereof. In this work, we refer to this class of dynamic programming schemes as the standard dynamic programming schemes. For ordinary RNA sequences, the secondary structures accounted for by such algorithms all represent physically distinguishable conforma- tions. However, this need not hold for circular RNAs or multiple interacting RNAs. For circular RNAs, if the sequence consists of repeated substrings, different secondary structures may be indistinguishable [14]. Dirks et al. [9] also observed that secondary structures of interacting RNA sequences can be physically indistinguishable. The number of indistinguishable secondary structures are directly related to the sym- metry of these structures [9]. The presence of indistinguishable secondary structures and their relation to symmetry in the underly- ing sequence(s) pose two major issues if we are to apply the standard dynamic programming scheme. First, indistinguishable secondary structures lead to overcounting of some these secondary structures. Second, free energy of secondary structures needs to be corrected based on the symmetries of a structure [9]. In the context of the partition function, Dirks et al. [9] observed fortuitously that the two issues together can be handled with minor modifications to the standard dynamic programming schemes. In the context of optimization problems, including finding the maximum number of paired bases over any structure or finding a structure that minimizes free energy, this “overcounting” is not an issue. The apparent existence of different, yet indistinguishable, optima would not lead to reporting an incorrect optimum. However, the symmetry correction to free energy cannot be directly applied within the standard dynamic programming schemes. This is because, by their nature, dynamic programming algorithms operate on local substruc- tures, while symmetry is a global property. Only a partial solution to this issue has been found [14]. 2 A problem related to the partition function and folding problems which is affected by the issues raised above is that of counting possible secondary structures. Given an RNA sequence, the counting problem pertains to enumerating the secondary structures which can be adopted by the sequence. For ordinary RNA sequences, the problem can be solved efficiently by using a standard dynamic programming scheme. For some circular sequences and interacting sequences, however, some secondary structures are indis- tinguishable, which leads to overcounting. Currently, to our knowledge, there is no efficient algorithm for accounting for indistinguishable secondary structures. Solving this problem is the major focus of this work. 1.1 Novelcontributions To solve the problem of counting distinguishable structures, we drew from previous studies [9, 14] to mathematically formulate the symmetry of secondary structures. In particular, we focused on the obser- vation [9] that mutually indistinguishable secondary structures form an orbit under the action of a cyclic group. This implies that counting distinguishable structures means counting the orbits. We therefore used a well-known lemma from group theory on the number of orbits under the action of a group. Based on the lemma, we identified the subsets of secondary structures, called the periodic subsets, the sizes of which are needed to solve the distinguishable count problem. Some properties of periodic subsets have been identified by Hofacker et al. [14]. In this study, we make a complete characterization of the periodic subsets. This allowed us to extend the method by Hofacker et al. [14] to compute the sizes of the periodic subsets. The outcome of using this method is a cubic time algorithm, called the central orbit algorithm, for the problem of counting distinguishable structures. In addition to the above algorithm, we also developed another algorithm for the distinguishable count problem. This algorithm, called the reduced structure algorithm, takes advantage of the properties of periodic structures. The time complexity of the reduced structure algorithm is same as that of the central 3 orbit algorithm in the context of the distinguishable count problem. However, a feature unique to the reduced structure algorithm is the fact that the relevant quantities are computed and saved dynamically by a standard dynamic programming scheme. This allows online computation of relevant quantities without changing the overall time complexity, a property absent in the central orbit algorithm. The algorithms developed in this work can be applied to other related problems. We apply the cen- tral orbit algorithm to count unlabeled circular sequences, which is relevant in combinatoric analysis of secondary structures. The comparison between the two algorithms are shown in the context of related problems. We also solve the problem of counting distinguishable secondary structures containing a cer- tain pair. The analysis in this work can also be extended to include certain class of pseudoknots, which are often ignored for the sake of computational efficiency. 1.2 Overviewandorganization In Chapter 2 we define secondary structures for RNA sequences, and review the algorithms for some fundamental problems regarding RNA secondary structures. In Chapter 3 we introduce the problem of counting secondary structures for linear, circular and interacting sequences. In Chapter 4 we introduce the notion of symmetry in circular sequences as well as interacting sequences, and define the problem of counting distinguishable secondary structures. In Chapter 5 we characterize the properties of peri- odic subsets. In Chapter 6, we develop an efficient algorithm, called the central orbit algorithm, to solve the distinguishable count problem. In Chapter 7 we develop the reduced structure algorithm which also solves the distinguishable count problem efficiently. In Chapter 8 we apply the two algorithms to related problems, some of which illustrate the advantage of the reduced structure algorithm. 4 Chapter2 Background The usefulness of RNA secondary structure model comes from the fact that it encapsulates an essential part of the behavior of RNAs, i.e., the base pairing conformations. The model also allows efficient compu- tations of relevant quantities of RNAs. In this chapter, we define secondary structures of RNA sequences and introduce essential algorithms for solving secondary structure problems. In particular, we show that common to these algorithms is the dynamic programming scheme which takes advantage of the fact that secondary structures can be partitioned into smaller substructures. 2.1 Secondarystructures Ribonucleic acid (RNA) is a molecule consisting of ribonucleotides in a chain. Each ribonucleotide contains one of the four bases: adenine (A), cytosine (C), guanine (G), and uracil (U). RNA molecules have direc- tionality; the phosphate group attached to the 5’ hydroxyl group of one nucleotide can form a linkage with a 3’ hydroxyl of another nucleotide. The resulting chain of nucleotides thus has the 5’-3’ (or 3’-5’) direction. In characterizing an RNA, its primary structure defined as the string of letters in {A,C,G,U}. An RNA molecule bends around itself, and bases in close proximity form hydrogen bond if they are complementary; that is, if they form a Watson-Crick base pairA− U orC− G. 5 When analyzing the structures of RNA molecules, there are several classes of models. Quantum chem- ical models [36, 38] allow highest detail of the molecular behaviors. Computationally expensive, such models are limited to a few bases in proximity. Atomistic molecular dynamics models [6, 45] also provide realistic analysis of RNA molecules. The scope of such models is also limited as due to the intensive com- putational requirements [37]. Other coarse-grain RAN models [16, 7, 37] exist with varying scopes and accuracy. In the secondary structure model, unlike the above 3D models, RNA structure is simply a collection of base pairs. The secondary structure energy models are based on empirical studies [35, 47, 23, 24, 42, 41] which parametrize the contributions of various substructures to the total free energy. The secondary structure model has a computational advantage over the 3D models, while the detailed energy model allows accurate prediction of RNA structures. 2.2 Formalizationforsecondarystructures We letΣ RNA ={A,C,G,U} andΣ ∗ RNA be the set of all strings consisting of the symbols inΣ RNA . We define an RNA sequence to be a string in Σ ∗ RNA . We call such a string alinearsequence, or simply sequence. When considering a linear sequenceϕ ∈ Σ ∗ RNA , we will usen to denote its length. For convenience, we use zero-based indexing, so ϕ =ϕ 0 ϕ 1 ··· ϕ n− 1 , withϕ i ∈Σ RNA . We adopt the convention of indexingϕ in the order of the 5’ end to the 3’ end of the RNA molecule. We will refer to each position in an RNA sequence as a site, and at sitei the value ofϕ i is called a base. Given a natural numbern, a secondary structures is a set of unordered pairs{i,j}, withi,j∈[0,n− 1]. We remark that we use the terminology of “pairs” to indicate that for any{i,j}∈ s, we havei̸= j and 6 thus|{i,j}|=2. We keep this assumption throughout the dissertation. To work with physically relevant secondary structures, we define valid secondary structures as those which satisfy the following conditions: L1. For all{i,j} ins,i̸=j. L2. For all{i,j} and{k,l} ins, either{i,j}={k,l} or{i,j}∩{k,l}=∅. L3. For all distinct pairs{i,j} and{k,l} ins, ifi<k <j, theni<l <j. From here onward, we write structures to mean secondary structures. Condition L1 imposes the restriction that a base cannot pair with itself. In practical applications, the steric constraint such as|j− i| > 3 is imposed [24], but in this study we work with the above condition and note that the algorithm performance will not change. Condition L2 ensures that no site is involved in more than one pair. Condition L3 prohibits any two pairs{i,j} and{k,l} from having the relationshipi<k <j <l. If this relationship is satisfied, we say the secondary structure has a pseudoknot, and we consider such structures invalid. Graphical representations of some example structures are shown in Figure 2.1. We say that s is compatible with a length-n sequenceϕ if, and only if, s is valid and, for every{i,j} ∈ s, the basesϕ i andϕ j form a valid base pair, i.e., they conform to{A,U},{C,G}. For the sake of simplicity, we ignore the wobble pair. We therefore define the set B RNA ={{A,U},{C,G}}, and say thatϕ i andϕ j form a valid base pair if and only if{ϕ i ,ϕ j }∈B RNA . We denote byΩ ψ the set of all structures compatible withϕ . 2.3 Freeenergymodels One of the primary goals of the RNA secondary structure analysis is identifying the structure(s) that a given sequence is likely to adopt. To this end, we need a free energy model which measures the relative 7 (A) (B) (C) (D) Figure 2.1: Example structures. The sequence of lengthn is drawn in a circular fashion with 0 as the 0 th site. A solid red line represents a pair. (A) The structure without any pair is valid. (B) An example of a valid structure with pairs. (C) An example of an invalid structure that violates conditions L2. (D) An example of an invalid structure that violates condition L3. Note that two pairs inducing a pseudoknot result in crossing lines. preference by a given sequence over the possible structures. Among many energy models, we introduce two models which we call thebasepairmodel and theloopmodel. The first model is the simplest one while the second model is more realistic and therefore more practical. While the base pair model requires a simple algorithm compared to that required by the loop model, the essence of these algorithm is the same, i.e., to take advantage of the substructures. 2.3.1 Base-pairmodels In the base pair model the free energy of a secondary structure is characterized by a single constantE >0. For a sequenceϕ and a structures inΩ ϕ , the free energy ofs in the base pair model, denoted byF B (s), is defined as F B (s)=− E|s|. That is, the free energy in this model is linearly proportional to the number of pairs ins. In this model, the more pairs are in the structure, the lower the free energy, and therefore more likely that the sequence adopts the structure. 8 2.3.2 Loopmodels A more realistic than the base pair model the loop model. Empirical studies[34, 24] have been conducted to parametrize the free energies associated with secondary structures. This widely accepted model has been used to successfully predict minimum free energy structures for RNA strands [50]. In the loop model, every pair in a structures∈ Ω ϕ defines a unique subset of sites. To define loops, we introduce the below terminology: • Sitek is covered by pair{i,j} ifi<k≤ j. • Pair{k,l} is covered by pair{i,j} if bothk andl are covered by{i,j}. • Sitek is accessible from pair{i,j} if{i,j} coversk and no pair covered by{i,j} coversk. • Pair{k,l} is accessible from pair{i,j} if{i,j} covers{k,l} and no pair covered by{i,j} covers {k,l}. Given a pair{i,j} ins, the loop closed by{i,j}, denoted byL(i,j), consists of the unpaired and the sites in the pairs accessible from{i,j}. We define the size of a loop, denoted by n L , as the number of unpaired sites accessible in the loop, and we denote byα the number of accessible pairs. The free energy in the loop model, denoted byF L (s), is the sum of free energies for the loops; F L (s)= X {i,j}∈s F(L(i,j)) (2.1) where F(L(i,j)) is the free energy associated with the loop L(i,j). The loops can be categorized into three types: hairpin loop, interior loop, and multiloop. 9 Hairpin loop A loop L(i,j) is called a hairpin loop α = 0. Due to a steric constraint, the size of the loop must satisfyn L ≤ 3. The energy of a hairpin loop can be parameterized by the closing pair, i.e., F(L(i,j))=F HP (i,j). (2.2) Interior loop A loop L(i,j) is called an interior loop if α = 1. Let{k,l} be the pair accessible from {i,j}. The energy model for interior loop depends on the locations of the two pairs; F(L(i,j))=F IN (i,j,k,l). (2.3) Multiloop A loopL(i,j) is called a multiloop ifα ≤ 2. The energy of a multiloop depends linearly on α andn L , F(L(i,j))=F m +αF p +n L F u , (2.4) whereF m ,F p , andF u are positive constants. More accurate energy models also depend on the location of each pair accessible, but we omit such consideration for the sake of simplicity. An example of each loop type is shown in Figure 2.2. 2.4 Computationalmodel Before we introduce the algorithms for RNA secondary structures, we state the assumptions on the com- putational model. Some of the following algorithms, particular those for counting, involve integer multi- plications. Moreover, those integer increase exponentially [43] with the size of the sequences. Since the fastest current algorithm to multiplyc-bit integers takesΘ( clogc) time [13], in a random access machine model a multiplication would takeO(nlogn) for sequence of size n. In order to keep the performance analysis of the algorithms consistent with previous studies on RNA secondary structure algorithms, we 10 α = 0 α = 1 α = 2 k l i j Interior loop i j k l i j k l Bulge loop Stacked pairs i j Hairpin loop i j r h k l Multi-loop Figure 2.2: Types of loop assume a computational model which takes constant time for any type of multiplication. This detail re- garding the assumed model of computation also emerges in the partition function problem. Algorithms for computing partition functions for RNA structures [25, 9] have cubic-time performance, and this requires a floating point arithmetic, which can lead to arbitrary numerical error. 11 2.5 Minimumfreeenergyproblem With a energy model for secondary structures, we can ask which structure is most likely to be adopted by a given sequence. That is, the problem pertains to finding a minimum free energy (MFE) structure. In this section, we review the algorithms developed for the two energy models. Basepairmodel In the base pair model, the MFE problem amounts finding a structure arg min s∈Ω ϕ F B (s)=argmax s∈Ω ϕ |s|. There is an efficient algorithm [28] to solve this problem. Given a sequence ϕ of length n, let W ij be the maximum number of pairs a structure compatible with ϕ [i..j− 1] can have. This quantity can be expressed recursively, for0≤ i<j≤ n, W ij =max W i+1,j max i<k≤ j n W b ik +W kj o , (2.5) whereW b ij is the maximum pair for structures compatible withϕ [i..j− 1] containing the pair{i,j− 1}. In Equation (2.5), the first case considers the structures where site i is not paired, while the second case considers those in which site i is paired with another site k. The base cases are W ij = 0 for i = j. For 0≤ i<j≤ n, the quantityW b ij is 0 if{ϕ i ,ϕ j− 1 }̸∈B RNA and otherwise W b ij =1+W i+1,j− 1 . (2.6) The quantitiesW ij andW b ij for all0≤ i≤ j ≤ n can be computed inO(n 3 ) time by dynamic pro- gramming. The entryW 0n corresponds to the maximal possible number of pairs in a structure compatible 12 withϕ . A back tracking algorithm can be used to find a max pair structure without changing the overall time complexity. Loop model We now address the problem of finding a structure with minimum free energy (MFE) for a given sequence ϕ using the loop model. Efficient dynamic programming algorithms have been devel- oped [44, 53] for this problem. We first define V ij as MFE for the substringϕ [i..j− 1]. The MFE for the entire sequence is given byV 0n . For0≤ i<j≤ n, we have the following relation: V ij =min V i+1,j min i<k≤ j {V b ik +V kj } (2.7) where the quantity V b ij is the MFE for the substring ϕ [i..j − 1] with the condition that the structure contains the pair{i,j− 1}. The base cases are defined as V ij = 0 fori = j. Equation (2.7) has the same form as Equation 2.5, withmax operation replaced withmin operation. For0≤ i<j≤ n, the quantityV b ij is∞ if{ϕ i ,ϕ j− 1 }̸∈B RNA and otherwise V b ij =min F HP (i,j− 1), min i<k<l<j F IN (i,j− 1,k,l− 1)+V b kl , min i<k<j n F m +V m i+1,k +V m1 kj o (2.8) whereV m ij is the MFE for the substructures in the range[i,j) that are part of a multiloop and contain at least one pair, whileV m1 ij is the MFE for the same same range in a multiloop containing exactly one pair involving sitei. In the above equation, the three cases correspond to the loopL(i,j− 1) being a hairpin loop, an interior loop, and a multiloop, in order. 13 For0≤ i≤ j≤ n, the recursion relations forV m ij andV m1 ij are V m ij = min i≤ k<j V m1 kj +min{(k− i)F u ,V m ik } (2.9) and V m1 ij =min n F p +V b ij ,F u +V m1 i,j− 1 o (2.10) with base casesV m ij =∞ andV m1 ij =∞ fori=j ori+1=j. In Equation (2.9), we account for structures for the range[i,j) that have at least one pair. Letk∈(i,j) be the site (closer to the 5’ end) involved in the pair closest to the 3’ end. By definition, the quantity V m1 kj accounts for the range[k,j). As for the range [i,k), we may either have no pair, in which case the energy is given by (k− i)F u , or at least one pair, which is accounted for byV m ik . In Equation (2.10), sitei can either pair withj− 1, in which case the MFE is given byF p +V b ij , or with site in(i,j− 1), for which the MFE is given byF u +V m1 i,j− 1 (since sitej− 1 would not be paired). The iterations in Equation (2.8) suggests that it takes quadratic time to determine any entryV b ij . This implies that it takesO(n 4 ) time to computeV ij for all0≤ i < j ≤ n. We note that the time complexity of the algorithm can improved to cubic time by introduicng auxiliary quantities [21]. The entryV 0n is the MFE for the sequenceϕ , and a quadratic-time backtrack algorithm can be used to identify a structure with the minimum free energy. 2.6 Partitionfunctionproblem A major milestone in the analysis of RNA structure analysis beside the MFE problem is the computation of the partition function for the space of structures by McCaskill [25]. The partition function allows the 14 computation of base pairing probabilities. Given an RNA sequenceϕ of lengthn, the partition function is defined as Q(Ω ϕ )= X s∈Ω ϕ e − βF L (s) , (2.11) where β = 1/k B T , with Boltzmann constant k B and temperature T . The computation of the partition function can be done in a manner similar to the MFE computation. We define Q ij ,Q b ij ,Q m ij , andQ m1 ij as the partition functions for the substructures accounted for by V ij , V b ij , V m ij , and V m1 ij , respectively. The recursion relations are obtained by simply replacingmin with+, and+ with× . For0≤ i≤ j≤ n,Q ij =0 ifi=j and otherwise Q ij =Q i+1,j + X i<k<j Q b ik Q kj . (2.12) For0≤ i<j≤ n,Q b ij is 0 if{ϕ i ,ϕ j }̸∈B RNA and otherwise Q b ij =e − βF HP (i,j− 1) + X i<k<l<j e − βF IN (i,j− 1,k,l− 1) Q b kl (2.13) + X i<k<j e − βF m Q m i+1,k Q m1 kj . (2.14) For0≤ i≤ j≤ n,Q m ij =0 ifi=j ori+1=j, and otherwise Q m ij = X i<k<j Q m1 kj e − β (k− i)Fu +Q m ik . (2.15) For0≤ i<j≤ n,Q m1 ij =0 ifi+1=j and otherwise Q m1 ij =e − βF p Q b ij +e − βF u Q m1 i,j− 1 . (2.16) 15 The Equations (2.12) and (2.16) indicate a quartic time computation of these quantities for all0≤ i<j < n. With modifications, the above computation can be done in cubic time [20]. The entry Q 0n corresponds to the total partition function for the given sequenceϕ . 16 Chapter3 CountingSecondaryStructures In this chapter, we focus on the problem of counting secondary structures. The counting problem is related to the MFE and partition function problems in that it is solved by using similar dynamic programming algorithms. We review the counting algorithms for three different types of sequences: linear sequences, circular sequences, and interacting sequences. 3.1 Countingforlinearsequences We discuss the problem counting secondary structures for a given linear sequence in Σ ∗ RNA . Given a linear sequenceϕ of lengthn, we define the quantity C ij as the number of structures compatible with the substringϕ [i..j− 1]. The solution to the problem is|Ω ϕ |=C 0n . The recursion relation for0≤ i<j≤ n is given as C ij =C i+1,j + X i<k≤ j C b ik C kj , (3.1) whereC b ij is defined as the number of structures compatible with the substring ϕ [i,..j− 1] containing the pair{i,j− 1}. The base cases are given asC ij =1 fori=j. In Equation (3.1), the first term corresponds 17 Algorithm1 Counting structures compatible withϕ Require: Sequenceϕ of lengthn. Ensure: The size ofΩ ϕ . 1: C ij ← 0 andC b ij ← 0 for all0≤ i≤ j≤ n 2: C ii ← 1 for all0≤ i≤ n 3: forj← 0 tondo 4: fori← j− 1 down to0do 5: if{ϕ i ,ϕ j− 1 }∈B RNA then 6: C b ij ← C i+1,j− 1 7: C ij ← C i+1,j 8: fork← i+1 toj do 9: C ij ← C ij +C b ik C kj 10: return C 0n to the case where sitei is not paired. The second term corresponds to the structures containing the pair {i,k− 1} fori<k≤ j. For0≤ i<j≤ n, the quantityC b ij is0 if{ϕ i ,ϕ j− 1 }̸∈B RNA , and otherwise C b ij =C i+1,j− 1 . (3.2) That is, the number of structures containing the pair{i,j− 1}, assuming such pair is allowed, is that of substructures for the substringϕ [i+1..j− 2]. The quantitiesC ij andC b ij can be computed in cubic time via dynamic programming. The pseudocode for this method is illustrated in Algorithm 1. For visual un- derstanding, graphical representation of the recursion relationships is shown in Figure 3.1. Waterman [43] observed that the number of structures grows exponentially with the length of the sequence. This means that the elements ofC ij andC b ij grow exponentially as well. In order to keep the complexity analysis in- dependent of model of computation, we assume, unless stated otherwise, that addition and multiplication both take constant time and constant space. 18 Figure 3.1: Graphical representations of Equations (3.1) and (3.2). 3.2 Countingforcircularsequences A circular RNA is like a linear RNA molecule except the 5’ and 3’ end of the sequence are linked via phosphate backbone. Circular RNAs naturally occur in viroids [11, 39] and viruses [12, 40]. Circular RNAs also appear as intermediates in a sequencing process of RNA viruses [18]. Mathematically, a circular RNA sequencew is a cyclically ordered multiset of RNA bases, i.e.,Σ RNA . Cyclic ordering implies, for example, that circular sequences spelledACG, CGA, GAC are all identical. Therefore, we can define, without loss of generality, a circular sequence w as being a string inΣ ∗ RNA , just like a linear sequence, with the notion of the identity characterized above. Consider a circular RNA sequence w of length n. Regarding structures, the only difference between linear and circular RNA sequences is that, depending on the steric condition, the positions 0 and n− 1 may not pair. Since the steric condition in this document only requires that i ̸= j, the above condition reduces to condition L1. Therefore, givenn, the conditions for valid structures for circular RNA sequences are identical to those for linear RNA sequences, i.e., L1-L3. We note that this would not be the case if the steric condition is|j− i|>c forc>0. We denote byΩ w the set of structures satisfying conditions L1-L3 where{w i ,w j } ∈ B RNA for every pair{i,j} ∈ s, i.e., the set of structures compatible with w. Since 19 circular sequences satisfy identical conditions as linear sequences, we can use C ij and C b ij as defined in Equations (3.1) and (3.2) to count the structures compatible with circular sequence w, i.e.,|Ω w | = C 0n . For a linear sequenceϕ and a circular sequencew, we have|Ω ϕ |=|Ω w | ifϕ =w. 3.3 Interactingsequences Beside the studies on naturally occurring nucleic acids, there have been remarkable advancement in en- gineering and modifying nucleic acids to develop nanodevices for specific functionalities. For example, DNA has been modified with electronically functional materials to build nanoelectronic circuits [31, 30]. The hybridization property of DNA has also been used to develop dynamic nanodevices which demon- strate tweezer-like or scissor-like behaviors [48, 17]. The hybridization of RNA has been used to develop nanodevices which can regulate specific genes [4, 19]. Other RNA nanodevices have been engineered to perform catalytic reactions [46, 26, 27, 29]. Crucial in developing such nanodevices is the understanding of the interaction of multiple nucleic acids. Here, we review a useful secondary structure model for multiple RNA sequences. Multiple RNAs can interact through base pairs; a base of one sequence pairs with a base from another sequence. There are several models for interacting sequences [8, 2, 9, 5]. Among these, we adopt the model developed by Dirks et al. [9]. The reason for this choice is that this model allows clear illustration of the symmetry which exist in interacting sequences, which is the main focus of this dissertation. Also, this model is the foundation for a widely used secondary structure analysis and design tool [49]. 3.3.1 Generalizedsequence Considerm sequencesϕ (0) ,ϕ (1) ,...,ϕ (m− 1) of lengthsn 0 ,n 1 ,...,n m− 1 , and let n = P m− 1 c=0 n c . In the model developed by Dirks et al. [9], them sequences are concatenated and treated like a single sequence, but with additional information of the concatenation points. There are in the order ofm! ways to arrange 20 the order of concatenation, although some of the orders are equivalent (see [9]). The above authors showed that, by imposing certain conditions on the secondary structures (described in Section 3.3.2), every possible secondary structure has a “natural” order of the sequences. The problem of accounting for the secondary structures of interacting sequences can then be divided into subproblems, each of which corresponds to a particular order of concatenation. Here we assume an arbitrary order since the analysis does not depend on the order of concatenation. Let the sequences be concatenated as follows: ϕ =ϕ (0) ϕ (1) ...ϕ (m− 1) . The sequenceϕ is indexed in the range[0,n− 1]. To identify them concatenation points, we define a set D ={d 0 ,d 1 ,...,d m− 1 } where d h = h− 1 X c=0 n c . (3.3) The elements ofD are called nicks, and they are the 5’ ends of the constituent sequences. The sequence ϕ together with the setD representsm sequences in a particular order. We therefore define generalized sequenceψ as the pair ψ =(ϕ,D ). We say that the generalized sequenceψ has sizem and lengthn. This is a generalized concept in that if D ={0},ψ represents a linear sequenceϕ . A graphical representation of a generalized sequence is shown in Figure 3.2A. 21 (A) (B) (C) (D) Figure 3.2: Graphical representations of generalized sequences. The sequences are directed such that 5’- 3’ direction is clockwise. (A) A generalized sequence with 5 constituent sequences. (B) A generalized sequence with 4 identical sequences (symmetry 4). (C) A generalized sequence of symmetry 4 (sequences of same color are identical). (D) A generalized sequence of symmetry 3. 3.3.2 Secondarystructuresforgeneralizedsequences Similar to a structure on a single sequence, a secondary structure on a generalized sequence is a set of pairs. In general, the rules for a valid structure for a sequence extend to generalized sequences, but generalized sequences bring additional considerations. In particular, we are interested in structures that areconnected. Intuitively, this means all constituent sequences are connected by pairs. Formally, a structures isvalid for generalized sequence(ϕ,D ) ifs satisfies L1-L3 and additionally L4. For any two distinct nicks d < d ′ in D, there exists a pair{i,j} in s such that i ∈ [d,d ′ ) and j̸∈[d,d ′ ). The purpose of L4 is to ensure that secondary structures are connected. Example structures for a general- ized sequence consisting of five constituent sequences are shown in Figure 3.3. We note that condition L4 is equivalent to the following two conditions L4a. For all{i,j}∈s, at most one element of ¯ D is accessible from{i,j}. L4b. For alld∈ ¯ D, there exists{i,j}∈s, such that{i,j} coversd. Here, ¯ D =D\{0}, 22 (A) (B) (C) Figure 3.3: Example structures for a generalized sequence consisting of 5 constituent sequences (m = 5), drawn in a circular fashion. A solid red line represents a base pair. (A) An example of a valid structure. (B) An example of an invalid structure which violates condition L4 for nicksd 3 andd 4 . (C) An example of an invalid structure which violates condition L4 for nicksd 0 andd 2 . and coverage and accessibility are as defined for single-stranded RNA sequences. In this document, we will use condition L4 and the two conditions L4a and L4b interchangeably. A structure s is compatible with a generalized sequence ψ = (ϕ,D ) if and only if s is valid and, for every pair{i,j}∈ s,{ϕ i ,ϕ j }∈ B RNA holds. The first part, validity of s, is determined by relationships withins and by hows relates toD. The second part, validity of base pairs, is determined by how members ofs relate to the letters inϕ . We letΩ ψ denote the set of structures compatible withψ . 3.3.3 Countingproblem Counting problem for a generalized sequence can be solved in a similar way as for the partition function problem. LetΩ ψ denote the set of structure compatible withψ . In order to compute|Ω ψ |, we first define C ij as the number of structures compatible withψ [i,j− 1], where ψ [k,l]= (ϕ [k..l],∅) ifD =∅ (ϕ [k..l],(D∩[k,l])∪{k}) otherwise. is a sub-generalized-sequence ofψ . The recursion relation for this quantity is defined, for 0≤ i≤ j <n, as C ij =γ i+1 C i+1,j + X i<k≤ j γ k C b ik C kj . (3.4) 23 where γ i = 0 ifi∈D 1 otherwise. The first term of Equation (3.4) accounts for the cases where i is not paired, which is possible only ifi+1 is not a nick since otherwise the structure would be disconnected. The second term sum over all possible k such thati pairs withk− 1. The factorγ k ensures that the structure is connected. The quantity C b ij is defined as the number of structures compatible ψ [i,j − 1] containing the pair {i,j− 1}. For0≤ i<j≤ n,C b ij =0 if{ϕ i ,ϕ j− 1 }̸∈B RNA and otherwise C b ij =γ i+1 γ j− 1 C i+1,j− 1 +C e ij (3.5) The first case of Equation (3.5) counts the cases where the loop L(i,j− 1) is not an exterior loop. Such structure is possible if neitheri+1 norj− 1 is a nick, which is ensured by the factorγ i γ j− 1 . The second term accounts for the exterior loops. The quantity C e ij is defined as the number of secondary structures compatible with ψ [i,j− 1] con- taining the pair{i,j− 1}, where the loopL(i,j− 1) is an exterior loop. The recursion relation is defined as C e ij = 0 ifi+1∈D, j− 1∈D, andi+1<j, C i+1,j− 1 if eitheri+1∈D orj− 1∈D, P i<d<j d∈D C i+1,d C d,j− 1 otherwise. (3.6) The recursion relations for C ij , C b ij , and C e ij are shown in Figure 3.4. We show the algorithm for counting the structures for generalized sequence in Algorithm 2. 24 Figure 3.4: Graphical representations of Equations (3.4), (3.5), and (3.6). As for R e ij , we assume the third case. Algorithm2 Counting structures compatible withψ Require: Generalize sequenceψ ={ϕ,D } of lengthn. Ensure: The size ofΩ ψ . 1: C ij ← 0,C b ij ← 0,C e ij for all0≤ i≤ j≤ n 2: C ii ← 1 for all0≤ i≤ n 3: forj← 0 tondo 4: fori← j− 1 down to0do 5: if{ϕ i ,ϕ j− 1 }∈B RNA then 6: if i+1∈D,j− 1∈D, andi+1<j then 7: C e ij ← 0 8: elseif (i+1∈D andj− 1̸∈D) or (i+1̸∈D andj− 1∈D) then 9: C e ij ← C i+1,j− 1 10: else 11: ford inD do 12: C e ij ← C e ij +C i+1,d C d,j− 1 13: C b ij ← γ i+1 γ j− 1 C i+1,j− 1 +C e ij 14: C ij ← C i+1,j 15: fork← i+1 toj do 16: C ij ← C ij +γ k C b ik C kj 17: return C 0n 25 Chapter4 SymmetryandDistinguishability One of the properties of RNA sequences is that they have 5’-3’ asymmetry. This makes every structure for a sequence physically distinguishable from the rest of the structures. This property does not necessarily hold for circular sequences or generalized sequences. In either case, nontrivial symmetry in the sequences give rise to indistinguishable structures. It has been observed that the common algorithms introduced in the previous chapter cannot directly account for indistinguishable structures. In this chapter, we introduce the notion of symmetry sequences and show how it can affect the secondary structure analyses. In particular, we focus on the counting problem as it is directly affect by the issue of distinguishability. 4.1 Unifyingthetypesofsequence So far, we have separately addressed three types of sequences: linear sequences, circular sequences, and generalized. In fact, these types can all be addressed in the framework of generalized sequence. As men- tioned in Section 3.3.1, a linear sequence ϕ can be seen as a generalized sequence ψ = (ϕ,D ) where D ={0}. That is, the only nick is that between the first and last sites of ϕ . A circular sequence does not contain a nick, so it can be thought of as a generalized sequenceψ =(ϕ,D ) whereD =∅. The following table summarizes the unifying definition of generalized sequence ψ =(ϕ,D ) depending on the size ofD. 26 Sequence type D Linear |D|=1 Circular |D|=0 Multi |D|>1 To differentiate from linear and circular sequences, we say that ψ a multi sequence when|D| > 1. For simplicity, we use expressions such as linear sequence (|D| = 1), circular sequence (|D| = 0), and multi sequence (|D| > 1). In the following sections in this chapter, we adopt this unified definition of generalized sequence. We note that, even with the assumption that|D| ≥ 0 to allow linear and circular sequences, the conditions L1-L4 still apply. When|D|≤ 1 (linear and circular sequences), condition L4 is trivially satisfied. Therefore, the definition the set of secondary structures, Ω ψ , has no ambiguity. 4.2 Symmetryofgeneralizedsequence Dirks et al. [9] and Hofacker et al. [14] observed that circular and multi sequences can have nontrivial symmetry, a property absent in linear sequences. In this section, we formally define the symmetry of generalized sequences. For a generalized sequenceψ = (ϕ,D ) of lengthn =|ϕ |, we define the rotation ofψ byc (∈ [0,n)) bases, denoted byY c (ψ )=ψ ′ =(ϕ ′ ,D ′ ), such that|ϕ |=|ϕ ′ |,|D|=|D ′ |, ϕ (i+c)modn =ϕ ′ i , and D ′ ={(d+c)modn:d∈D}. 27 That is, rotationY c cyclically shifts the sequence and the nicks byc bases. For a sequenceϕ =ACUACUACUACU, consider three generalized sequencesψ l = (ϕ, {0}),ψ c = (ψ, ∅), andψ m = (ϕ, {0,6}), as shown in Fig- ure 4.1. They share the underlying sequenceϕ , but their nicks make them linear, circular, and multi. The rotation by3 bases, the mappingY 3 , of these generalized sequences results in new generalized sequences as follows. ψ ψ ′ (ACUACUACU,{0}) (ACUACUACU,{3}) (ACUACUACU,∅) (ACUACUACU,∅) (ACUACUACU,{0,6}) (ACUACUACU,{3,9}) These rotations are also shown graphically in Figure 4.1. Given a generalized sequenceψ of lengthn, we define its symmetry, denoted byp, as the number of integers c ∈ [0,n) such that Y c (ψ ) = ψ ; that is, both ϕ and D remain invariant. Using the language of group theory, given the action of cyclic group of ordern, denoted byZ/nZ, on the set of generalized sequences, symmetry ofψ is the size of the stabilizer subgroup ofψ . In the example above, the symmetries of ψ l , ψ c , and ψ m are, in order, 1, 4, and 2. We say that the symmetry p of ψ is nontrivial if p > 1, and trivial otherwise. In fact, the symmetry of linear sequence ψ = (ϕ, {0}) is always trivial since the only integerc∈[0,n), which leaves the nick{0} unchanged, isc=0. In contrast, circular and multi sequences can have nontrivial symmetry. For a circular sequenceψ =(ϕ, ∅), By definition, symmetry p of a generalized sequenceψ =(ϕ,D ) (the size of a stabilizer) is a divisor of n. We define the unitlength ofψ , denoted byt, ast=n/p. The unit length is the smallest positive integer such that ϕ i =ϕ (i+t)modn 28 (A) (B) (C) Figure 4.1: Rotation by3 bases of (A) linear sequence, (B) circular sequence, and (C) multi sequence. The each colored circle represents a particular base: red→A, blue→C, and green→U. for alli∈[0,n), and (d+t)modn∈D for all d ∈ D. As shown in the section section, the unit length has a special property for secondary structures forψ . 4.3 Indistinguishablestructures Dirks et al. [9] observed that, for a multi sequence with nontrivial symmetry, some structures are dis- tinct as sets of pairs yet represent indistinguishable conformations. Hofacker et al.[14] also observed the 29 same phenomenon for circular sequences. This means that, using the standard counting algorithms (Al- gorithms 1 and 2) leads to overcounting. In this section, we formally address the issue of indistinguishable structures and define a new problem, pertaining to counting the distinguishable structures. We start by defining, given a positive integer n, the rotation of an integeri∈[0,n) byc∈[0,n) as Y c (i)=(i+c)modn. (4.1) We also extend the definition of the rotation to pairs of integers: Y c ({i,j})={Y c (i),Y c (j)} (4.2) forc,i,j∈[0,n). For a generalized sequenceψ =(ϕ,D ) of lengthn and symmetryp, consider a structures∈Ω ψ . We define the rotation of a structure as Y c (s)={Y c ({i,j}),{i,j}∈s} (4.3) forc∈ [0,n). That is, rotation byc shifts every pair ins byc. We note that rotation is invertible, and the inverse of a rotation is given by Y − 1 c =Y (n− c)modn forc∈ [0,n). In general,Y c (s) does not necessarily belong toΩ ψ . The following proposition shows that the compatibility is conserved for certain value ofc. Proposition 1. For a generalized sequence ψ of length n and symmetry p, let the unit length t = n/p. Then,s∈Ω ψ impliess ′ =Y t (s)∈Ω ψ 30 Proof. We show thats ′ satisfies conditions L1-L3 and that {w i ,w j }∈B RNA for every pair{i,j} ins ′ . Condition L1 For any pair{i,j}∈ s and the corresponding pair{i ′ ,j ′ } = Y t ({i,j})∈ s ′ , i̸= j implies i ′ ̸=j ′ by the definition of rotation. Condition L2 Consider two pairs{i ′ ,j ′ } = Y c ({i,j}) and{k ′ ,l ′ } = Y c ({k,l}) in s ′ for some{i,j} and {k,l} ins. Since rotation is invertible,{i,j} = {k,l} implies{i ′ ,j ′ } = {k ′ ,l ′ }, and{i,j}∩{k,l} = ∅ implies{i ′ ,j ′ }∩{k ′ ,l ′ }=∅. Sinces satisfies L2, these two are the only possibilities. Condition L3 Consider pairs{i,j} and{k,l} ins, and leti<j,k <l, andi<k without loss of generality. Sinces satisfies L3, either i<j <k <l ori<k <l <j. Assumei<j <k <l, and rotationY t applied to{i,j} and{k,l} gives{i ′ ,j ′ } and{k ′ ,l ′ }, respectively. Possible orders for the rotated sites are the four circular rotations on: i ′ <j ′ <k ′ <l ′ . In all four possibilities,{i ′ ,j ′ } and{k ′ ,l ′ } satisfy L3. We reach a similar conclusion ifi<k <l <j. Condition L4 Sinces∈Ω ψ , for two nicksd a andd b inD, there is a pair{i,j} ins such thati∈[d a ,d b ) and j ̸∈ [d a ,d b ). For twod ′ a = Y r (d a ) andd ′ b = Y t (d b ) inD, assumed ′ a < d ′ b without loss of generality. The pair{i ′ ,j ′ }=Y t ({i,j}) satisfies either i ′ ∈[d ′ a ,d ′ b ) andj ′ ̸∈[d ′ a ,d ′ b ), ori ′ not∈[d ′ a ,d ′ b ) andj ′ ∈[d ′ a ,d ′ b ), satisfying condition L4 for the nicksd ′ a andd ′ b . Since this is the case for every pair of nicks, condition L4 is satisfied. Compatibility Since ϕ Yt(i) = ϕ i for all i ∈ [0,n),{ϕ i ,ϕ j } ∈ B RNA implies{w i ′,w j ′} ∈ B RNA , where{i ′ ,j ′ } = Y t ({i,j}), for all{i,j} ins. With rotation defined, we can now define the symmetry and distinguishability among structures. Con- sider a generalized sequenceψ of lengthn and symmetryp, and lett=n/p. Lemma 1 suggests that rota- tion byt maps a structures inΩ ψ to another structureY t (s) inΩ ψ . We can continue applying the rotation 31 to find Y ht (s) ∈ Ω ψ for h ∈ [0,p). In the language of group theory, the p rotations Y ht for h ∈ [0,p) constitute an action of the cyclic groupZ/pZ of orderp onΩ ψ . Here, we define Z/pZ={0,1,...,p− 1} where the composition rule is addition modulop. The orbit and the stabilizer ofs∈Ω ψ under this action are denoted byorb p (s) andstab p (s), respectively, and defined as orb p (s)={Y ht (s):h∈[0,p)}, stab p (s)={h:h∈[0,p) andY ht (s)=s}. The action ofZ/pZ partitionsΩ ψ into disjoint orbits, each of which is an equivalence class. We say that two structuress ands ′ areindistinguishable from each other if they belong to the same orbit anddistinguishable otherwise. We define the symmetry ofs as the size of the stabilizerstab p (s). The Orbit-Stabilizer theorem states |orb p (s)||stab p (s)|=p for everys∈ Ω ψ . Therefore, the symmetry of a secondary structure is a divisor ofp. Example orbits and corresponding symmetry for a circular sequence of symmetry 4 are shown in Figure 4.2. For a generalized sequence of lengthn and symmetryp, we define the set of distinguishable structures, denoted byΛ ψ , as the quotient set ofΩ ψ over the group action byZ/pZ, i.e., Λ ψ = Ω ψ Z/pZ ={orb p (s):s∈Ω ψ }. (4.4) That is, we treat each orbit as one distinguishable structure. 4.4 Countingdistinguishablestructures Our computational problem is the counting problem defined as follows: 32 Figure 4.2: Example orbits of different sizes for a generalized sequence ψ =(ϕ, ∅) (a circular sequence) of symmetry 4. (A) Orbit size of 4 and symmetry 1. (B) Orbit size of 2 and symmetry 2. (C) Orbit size of 1 and symmetry 4. Problem 1: Distinguishable Count Input: A generalized sequenceψ of lengthn Question: What is the size ofΛ ψ ? For a linear sequenceψ = (ϕ, {0}), since it symmetry isp = 1, and unit lengtht = n/1 = n, every orbit consists of one secondary structure. This means that every structure is distinguishable from each other. Therefore we have|Λ ψ |=|Ω ψ |, and the problem can solved by Algorithm 1. For circular and multi sequences, there has not been an efficient algorithm for Distinguishable Count problem. As pointed out by Dirks et al. [9], this is because the dynamic programming methods such as Algorithms 1 and 2 operates on substructures and cannot account for global parameter such as symmetry of a structure. Developing an efficient algorithm, as shown in the following chapters, for Distinguishable Count problem is one of the main contributions of this work. 33 Chapter5 CharacterizationofPeriodicSubsets In the previous chapter, we defined the problem of counting distinguishable structures. To develop efficient methods to solve this problem, we first use an elementary result of group theory to identify relevant subsets of structures, called the periodic subsets. We then show the properties of such subsets. 5.1 Definingperiodicsubsets Consider a generalized sequenceψ of lengthn and symmetryp, and lett = n/p. Counting distinguish- able structures amounts to counting the orbits under the action of cyclic groupZ/pZ. Burnside’s lemma provides the relationship between the number of orbits and particular subsets ofΩ ψ , i.e., |Λ ψ |= 1 p X h∈Z/pZ |Fix(h)| (5.1) where Fix(h)={s∈Ω ψ :Y ht (s)=s}, (5.2) is the subset ofΩ ψ fixed under the group action of h∈Z/pZ. We callFix(h) theh-periodic subset ofΩ ψ , and the structures inFix(h)h-periodic structures. We note thatFix(0)=Ω ψ since any structure inΩ ψ is fixed under the action of the identity element, and we know its size is given by C 0n . Forh∈Z/pZ\{0}, let 34 ⟨h⟩ denote the subgroup ofZ/pZ generated byx. For anyh∈Z/pZ\{0}, ifs∈Fix(h), thens∈Fix(h ′ ) for allh ′ ∈⟨h⟩. Since⟨h⟩=⟨gcd(h,p)⟩, we conclude that Fix(h)=Fix(gcd(h,p)), for every h ∈ Z/pZ\{0}. To compute|Λ ψ |, we therefore only need to compute the size of Fix(b) for every proper divisorsb ofp. 5.2 Pairorbits In this section, we identify some properties of the pairs in periodic structures. For a generalized sequence ψ of lengthn and symmetryp, consider ab-periodic structures∈Fix(b) for some proper divisorb ofp. By definition of Fix(b), it holds thatY r (s) = s, wherer = bn/p. This implies that, for every pair{i,j}∈ s, there is another pairY r ({i,j}) is also ins. This in turn implies thatY 2r ({i,j}) is ins and so on, before returning to the original pairY qr ({i,j})={i,j} whereq =p/b. This allows us to partitions into disjoint subsets, called pair orbits, ofq pairs. The pair orbit of a pair{i,j}, denoted by[{i,j}] q , is defined as [{i,j}] q ={Y hr ({i,j}):h∈[0,q))}, (5.3) where r = n/q. A pair orbit can be represented by any of its member; for a pair{i,j} in s, [{i,j}] q = [Y hr ({i,j})] q for anyh∈ [1,q]. Example pair orbits for periodic structures are shown in Figure 5.1. The following proposition shows that there are two types of pairs orbits. Proposition 2. Consider a generalized sequence ψ of length n and symmetry p. Suppose s ∈ Fix(b), whereb is a proper divisor ofp. Letq =p/b andr =n/q. Then, every pair orbit ins can be expressed as [{i,j}] q or[{i+r,j}] q where0≤ i<j <r. 35 (A) (B) (C) Figure 5.1: Example structures with pair orbits for circular sequence of symmetry4. The pairs in each pair orbit is emphasized for each pair orbit. (A) Pair orbits in0-periodic structure each consists of one pair. (B) Pairs orbits in 2-periodic structure each consists of two pairs. (C) Pair obits in 1-periodic structure each consists of 4 pairs. Proof. Consider a pair{i,j} ins, and leti ′ =imodr andj ′ =j modr. Since{ϕ i ,ϕ j }∈B RNA , it must be thatϕ i ̸= ϕ j . Furthermore, we haveϕ i = ϕ i ′ andϕ j = ϕ j ′. This impliesi ′ ̸= j ′ , and we can assume i ′ < j ′ without loss of generality. We consider two cases. In the first case, we have ⌊i/r⌋ = ⌊j/r⌋. Let k =q−⌊ i/r⌋ so thatY kr ({i,j})={i ′ ,j ′ }, and we can express[{i,j}] q =[{i ′ ,j ′ }] q . Now consider the second case where⌊i/r⌋ ̸= ⌊j/r⌋, and let k = q−⌊ j/r⌋ so that Y kr ({i,j}) = {i ′ +hr,j ′ } for someh∈ [0,q). We show that the only possible value ofh is1. By assumption,h̸= 0. If q ≤ 2, the only possibility is h = 1. For q > 2, to show that h = 1 by contradiction, assume h > 1. 36 (A) (B) Figure 5.2: Example structures for circular sequence of symmetry 4. The pairs in internal pair orbits are drawn in red, and those in external pair orbits are drawn in blue. (A) For a 2-periodic structure, r = 2n/4 = n/2. (B) For a 1-periodic structure, r = n/4. The pairs in external orbits cross the dotted lines dividing the sequence according to the periodicity of the secondary structure. Consider the pairY r ({i ′ +hr,j ′ })={Y r (i ′ +hr),j ′ +r}. Ifh 1) of length n and symmetry p, and for a proper divisorb ofp, letq = p/b andr = n/q. Then, a structures is inFix(b) if and only if it consists of internal and/or external pair orbits, orbit compatible withψ , and the pair orbits ins satisfy, in addition to P1-P5, the following conditions. P6. For any two distinct nicksd < d ′ inD∩[0,r), there exists a pair orbit(i,j) x ins such that|{i,j}∩ [d,d ′ )|=1. P7. If a nickd∈ D∩[0,r) is accessible from an external orbit(i,j) ex , then there is at least one external orbit(k,l) ex covered by(i,j) ex . P8. There is at least one external orbit ins. Conditions P1-P8 ensure that the periodic structures do not violate conditions L1-L4. The following table shows the correspondence between the conditions for periodic structures and general secondary structures. Periodic structures Secondary structures P1 L1 P2 L2 P3-P5 L3 P6-P8 L4 Figure 5.3 shows some consequences of violating conditions P3-P5. Conditions P6-P8 pertain only to multi sequences. Some consequences of violating these conditions are shown in Figure 5.4. 39 (D) (E) (B) (C) (A) Figure 5.3: Consequences of violating the conditions for reduced structures. (A) Violation of P3 among two internal orbits. (B) Violation of P3 between internal and external orbits. (C) Violation of P3 among two external orbits. (D) Violation of P4 among two external orbits. (E) Violation of P5 between an internal and external orbits. The corresponding secondary structures all violate condition L3. (A) (B) (C) Figure 5.4: Consequences of violating conditions P6-P8. (A) Violation of P6. (B) Violation of P7. (C) Violation of P8. In all cases, the resulting structures violate condition L4. 40 Chapter6 CentralOrbitAlgorithm In the previous chapter, we characterized the properties of the periodic subsets for circular and multi sequences (Theorems 3 and 4). In this section, we develop an efficient algorithm, called the central orbit algorithm. This algorithm draws from the insight provided by Hofacker et al.[14]. 6.1 Centralorbit In this section, we develop an algorithm for counting periodic structures by extending the observation made by a previous research [14]. We start with the following observations that follow from Theorems 3 and 4. Corollary5. For a generalized sequenceψ ={ϕ,D } with lengthn and symmetryp, consider a structure s∈Fix(b) for some proper divisorb ofp. Ifs has at least one external orbit, then there exists exactly one external orbit that is covered by the other external orbits ins. We call such an external orbit the central orbit. Proof. Assume thats has at least one external orbit. For any two distinct external orbits ins, condition P4 requires that one of the orbits covers the other. Furthermore, given three external orbitso 1 ,o 2 , ando 3 , if o 1 coverso 2 , ando 2 coverso 3 , theno 1 coverso 3 . It then follows that there exists one external orbit ins which is covered by the other external orbits ins. 41 (A) (B) Figure 6.1: Partitioning of periodic structures based on central orbit. We assume that the circular sequence has symmetry 4. (A) 2-periodic and (B) 1-periodic structures. In both cases, the central orbit is the external orbit(i,j) ex . The red and blue regions show the two types of substructures. Corollary6. For a generalized sequenceψ with lengthn and symmetryp, consider a structures∈Fix(b) for some proper divisorb ofp. Ifs contains at least one external orbit, and the central orbit is(i,j) ex where 0≤ i<j <r (r =nb/p), thens has two substructuress (i,j) ands [j,i+r] . Here, a substructure is a subset of pairs ins for which the involved sites are all in the indicated range. Proof. By the definition of the central orbit, we first note that there is no external orbit (k,l) ex such thati< k <l <j. This means that we only have internal orbits for the range(i,j), which in turn implies that any paired site in this range is paired with another site in the same range, thus indicating the substructures (i,j) . We also have the pair{j,i+r} belonging to the central orbit(i,j) ex . This pair defines the substructure s [j,i+r] . Corollary 6 allows us to partition a periodic structure into substructures based on the central orbit. Figure 6.1 shows this partitioning of periodic structures. 6.2 Circularsequences We now construct an algorithm for Problem 1 for circular sequences based on the findings in the previous section. Consider a circular sequence ψ = (ϕ, ∅) of length n and symmetry p, and a proper divisor b of p. For a b-periodic structure s ∈ Fix(b), Corollary 6 shows that the central orbit (i,j) ex , if it exists, implies the substructuress (i,j) ands [j,i+r] . We note that these two substructures combine to make a larger 42 substructures (i,i+r] . Since this substructure spans the range of lengthr, and sinceY r (s) = s, the entire structures is characterized bys (i,i+r] , that is, s=∪ h∈[0,q) Y hr s (i,i+r] . Furthermore, the number of structures inFix(b) containing the central orbit(i,j) ex is given by the product C i+1,j C b j,i+r+1 , corresponding to the substructures s (i,j) and s [j,i+r] , latter of which contains the pair {j,i+r}. We therefore arrive at the following result. Proposition 7. For a circular sequence ψ = (ϕ, ∅) of length n and symmetry p, consider a b-periodic subsetFix(b) for a proper divisorb ofp. Letr =nb/p. Then, the size of the subset is given by |Fix(b)|=C 0r + X 0≤ i<j≤ r C i+1,j C b j,i+r+1 . (6.1) Proof. The structures inFix(b) either have no external orbits or at least one external orbit. In the former case, we only have internal orbits. We note that, in the absence of external orbits, the conditions P1-P5 reduced to those equivalent to L1-L3. Since internal orbits(i,j) in are defined on the range 0≤ i<j <r, the number of structures in Fix(b) without external orbits is given by C 0r , which is the first term in Equation (6.1). If there is at least one external orbit, then Corollary 5 states that there is a unique central orbit. Let (i,j) ex be the central orbit. As implied by Corollary 6, the number of secondary structures with central orbit(i,j) ex isC i+1,j C b j,i+r+1 . The second term in Equation (6.1) sums this product for all possiblei and j. We note that, since central orbit, if it exists, is unique, the summation does not lead to any redundancy. 43 Algorithm3 Counting distinguishable structures compatible with a circular sequence (central orbit algo- rithm) Require: Circular sequenceψ =(ϕ, ∅) of lengthn and symmetryp. Ensure: The size ofΛ ψ . 1: ComputeC ij andC b ij forϕ (Algorithm 1). 2: F(h)← 0 for allh∈[0,p) 3: F(0)← C 0n 4: forh from1 top− 1do 5: b← gcd(h,p) 6: if b̸=hthen 7: F(h)← F(b) 8: else 9: r← nb/p 10: F(b)← C 0r 11: fori← 0tor do 12: forj← i+1tor do 13: F(b)← F(b)+C i+1,j C b j,i+r+1 14: Λ ← 0 15: forh from0 top− 1do 16: Λ ← Λ+ F(h) 17: return Λ /p Proposition 7 provides an important identity for computing the sizes of periodic subsets. We note that Equation (6.1) is a generalization of the identity established previously [14]. The pseudocode for this method, called the central orbit algorithm, is given in Algorithm 3. 6.3 Multisequence In this section, we build algorithms for counting distinguishable structure for multi sequences. It amounts to incorporating the additional conditions for the periodic structures mentioned in Theorem 4. Using the central orbit algorithm, a relationship analogous to Equation (6.1) in the context of general- ized sequence is |Fix(n)|= X 0≤ i<j≤ r γ i+1 γ j C i+1,j C b j,i+r+1 , (6.2) wherer =nb/p, andC ij andC b ij are as defined in Equations (3.4) and (3.5). We note that the first term in Equation (6.1) is absent in Equation (6.2). This is due to condition P8 which prohibits structures with only 44 Algorithm 4 Counting distinguishable structures compatible with a multisequence (central orbit algo- rithm) Require: Multi sequenceψ of lengthn and symmetryp. Ensure: The size ofΛ ψ . 1: ComputeC ij andC b ij forψ (Algorithm 2). 2: F(h)← 0 for allh∈[0,p) 3: F(0)← C 0n 4: forh from1 top− 1do 5: b← gcd(h,p) 6: if b̸=hthen 7: F(h)← F(b) 8: else 9: r← nb/p 10: fori← 0tor do 11: forj← i+1tor do 12: F(b)← F(b)+γ i+1 γ j C i+1,j C b j,i+r+1 13: Λ ← 0 14: forh from0 top− 1do 15: Λ ← Λ+ F(h) 16: return Λ /p internal orbits. The factorsγ i+1 γ j ensure that only connected structures are counted. The algorithm for counting distinguishable structures for generalized sequenceψ is shown in Algorithm 4. With the central orbit algorithms for both circular and multi sequences, we can state the following result. Theorem8. With a computation model in which addition and multiplication take constant time and space, Distinguishable Count problem can be solved by central orbit algorithm inO(n 3 ) time andO(n 2 ) space. Proof. We refer to Algorithms 3 and 4. A standard dynamic programming scheme to computeC ij andC b ij takesO(n 3 ) time andO(n 2 ) space. Note that the remaining steps require onlyO(pn 2 ) time. Sincep≤ n, the theorem holds. 45 Chapter7 ReducedStructureAlgorithm In Chapter 6, we introduced the central orbit algorithm which solve Problem 1 efficiently (Theorem 8). In this chapter, we introduce another algorithm, called the reduced structure algorithm, for the same problem. As shown below, this algorithm has the same time and space complexity as the central orbit algorithm. The purpose of introducing this algorithm is that it sheds light on some properties of periodic structures that allow us to treat them like another type of “secondary structure” (i.e., reduced structure). Besides allowing an analogy from secondary structures, the concept of reduced structure provides certain advantages. The comparison between the algorithms will be done in Section 8.2. 7.1 Definingreducedstructures Consider a generalized sequence of length n and symmetry p > 1, and a proper divisor b of p. Let q = p/b andr = n/q. Proposition 2 demonstrates that anyb-periodic structures consists of internal and/or external pair orbits; that is, for example, s=(i,j) in ∪(k,l) ex ∪··· , 46 where0≤ i < j < r,0≤ k < l < r, and so on. Rather than viewings as a union of pair orbits, we can also treats as a set of pair orbits. To this end, for each internal orbit (i,j) in =∪ h∈[0,q− 1] Y hr ({i,j}), we define internal reduced pair{i,j} in . Similarly, for each external orbit (k,l) ex =∪ h∈[0,q− 1] Y hr ({k+r,l}), we define external reduced pair{k,l} ex . The internal and external reduced pairs are simply pairs of sites in the range[0,r) with a binary label (internal or external). For ab-periodic structures∈ Fix(b), we can then define the reduced structure σ ={{i,j} in ,{k,l} ex ,···} . (7.1) The correspondence betweens andσ is depicted in Figure 7.1. The reduced pairs in the reduced structureσ involve sites in the range[0,r). We note thatr depends of the value of b (r = nb/p). Since s ∈ Fix(b) is orbit compatible with ψ , every reduced pair{i,j} x (x is eitherin orex) is such that{ϕ i ,ϕ j }∈ B RNA . Therefore, for a multisequence ofψ with lengthn and symmetry p, and for a proper divisor b, the b-periodic subset, Fix(b), corresponds to the set of reduced structures “compatible” withψ/q , whereq =p/b and ψ/q =(ϕ [0..r− 1],D∩[0,r)). We note thatr = n/q. The notation of division in the expressionψ/q is meant to indicate thatψ/q has length shorter than that ofψ by a factorq. Letψ/q =ψ ′ =(ϕ ′ ,D ′ ), and we formally define Γ ψ ′ as reduced 47 (A) (B) (C) (D) Figure 7.1: Correspondence between b-periodic structures and reduced structures. The generalized se- quences considered here all have symmetry 4. The internal orbits and internal reduced pairs are in red, while the external orbits and external reduced pairs are in blue. (A) A 2-periodic structure for a circular sequence and a corresponding reduced structure. (B) A 1-periodic structure and a circular sequence and corresponding reduced structure. (C) A 2-periodic structure for a multi sequence and a corresponding reduced structure. (D) A 1-periodic structure and a multi sequence and corresponding reduced structure. structures compatible withψ ′ as those consisting of reduced pairs{i,j} x (x is eitherin orex) satisfying {ϕ ′ i ,ϕ ′ j }∈B RNA as well as the following conditions. Below, the symbolsx andy are eitherin orex. R1. Reduced pairs{i,j} x satisfyi̸=j R2. For any two reduced pairs{i,j} x and{k,l} y , either{i,j} x ={k,l} y or{i,j}∩{k,l}=∅. R3. For any{i,j} x and{k,l} y , ifi<k <j, theni<l <j. R4. For two distinct external reduced pairs, one reduced pair covers the other. R5. No internal reduced pair covers an external reduced pair. R6. For any two distinct nicksd<d ′ inD∩[0,r), there exists a reduced pair{i,j} x such that|{i,j}∩ [d,d ′ )|=1. R7. If a nickd∈D∩[0,r) is accessible from an external reduced pair{i,j} ex , then there is at least one external reduced pair{k,l} ex covered by{i,j} ex . 48 R8. If|D|̸=0, there is at least one external reduced pair ins. Here, the definitions of covering and being accessible are analogous to those for pair orbits (Section 5.2). The correspondence between the b-periodic subset Fix(b) and the set Γ ψ ′ is one to one. The following observation formally states this. Observation9. Consider a generalized sequenceψ of lengthn and symmetryp (> 1). For every proper divisorb ofp, letq =p/b, andψ ′ =ψ/q . Then, |Fix(b)|=|Γ ψ ′|. (7.2) Therefore, counting periodic structures for a generalized sequence are equivalent to counting reduced structures for a shorter generalized sequence. 7.2 Countingreducedstructuresforcircularsequences Just as the conditions L1-L3 can be used to develop a dynamic programming algorithm to count the struc- tures (Algorithm 1), we aim to use the conditions R1-R5 (R6-R8 are irrelevant for circular sequences) to de- velop a dynamic programming algorithm to count reduced structures. Given a circular sequenceψ =(ϕ, ∅) of lengthn and symmetryp, and for a proper divisorb ofp, letq =p/b and define ψ ′ =ψ/q . The length of ψ ′ is r = n/q. We define R ij as the number of reduced structures compatible with ψ ′ [i,j− 1], for 0≤ i<j≤ r. We consider two types of reduced structures compatible withψ ′ [i,j− 1]: those containing only internal reduced pairs and those containing at least one external reduced pair. In the first case, we observe that, without external reduced pairs, the conditions R1-R5 become identical to conditionsL1-L3. Therefore, the number of reduced structures in the first case is C ij , as defined in Equation (3.1). Let H ij the 49 number of reduced structures in the second case, which contain at least one external reduced pair. Then we can write R ij =C ij +H ij (7.3) for0≤ i<j≤ r. The quantityR 0r corresponds to|Γ ψ ′|. We now consider the recursion relation for H ij . We consider two cases: one in which site i is not involved in any reduced pair and the other in which it is. The number of reduced structures in the first case isH i+1,j . In the second case, sitei is either part of an internal reduced pair{i,k− 1} in or an external pair{i,k− 1} ex for somek ∈ [i+2,j). With{i,k− 1} in , there must be at least one external reduced pair in the range [k,j) since internal reduced pair{i,k− 1} in cannot cover any external reduced pair (condition R5). Therefore, the number of reduced structures containing{i,k− 1} in isC b ik H kj , whereC b ik is as defined in Equation (3.2). If site i is part of an external reduced pair{i,k− 1} ex , then there must not be any external reduced pair in the range [k,j) (condition R4). Therefore, the number of reduced structures containing{i,k− 1} ex isH b ik C kj , whereH b ik is the number of reduced structures compatible withψ [i..k− 1] containing external reduced pair{i,k− 1} ex . We can therefore express H ij =H i+1,j + X i<k≤ j C b ik H kj +H b ik C kj (7.4) for0≤ i≤ j≤ r with base casesH ii =0 fori∈[0,r]. For0≤ i<j≤ r,H b ij =0 if{ϕ i ,ϕ j− 1 }̸∈B RNA and otherwise H b ij =C i+1,j− 1 +H i+1,j− 1 . (7.5) In Equation (7.5), the first term corresponds to the cases with only internal reduced pairs in the range [i + 1,j − 1) and the second term to those with at least one external reduced pair in this range. The algorithm for counting reduced structures is shown in Algorithm 5. The graphical representation of the 50 Algorithm5 Counting reduced structures for circular sequences Require: Circular sequenceψ =(ϕ, ∅) of lengthr. Ensure: The size ofΓ ψ 1: ObtainC ij andC b ij forϕ (Algorithm 1) 2: H ij ← 0,R ij ← 0 andR b ij ← 0 for all0≤ i≤ j≤ r 3: H ii ← 1 for alli∈[0,r] 4: forj← 0 tor do 5: fori← j− 1 down to0do 6: if{ϕ i ,ϕ j− 1 }∈B RNA then 7: H b ij ← C i+1,j− 1 +H i+1,j− 1 8: H ij ← H i+1,j 9: fork← i+1 toj do 10: H ij ← H ij +C b ik H kj +H b ik C kj 11: R ij ← C ij +H ij 12: return R 0r recursion relations is shown in Figure 7.2. Similar to Algorithm 1, the time complexity of Algorithm 5 is cubic. 7.3 Countingreducedstructuresformultisequences We can also develop an algorithm analogous to the reduced structure algorithm (Algorithm 5) for multi sequences. For a generalized sequenceψ of lengthn and symmetryp, and for a proper divisorb ofp, let q = p/b and r = nb/p. We count the reduced structures for ψ ′ = ψ/q which has length r. We define R ij in the same way as for circular sequences. The only difference in the multi sequence cases is that condition R8 requires that there be at least one external reduced pair. We there have the relationship R ij =H ij , (7.6) for0≤ i≤ i≤ r. 51 Figure 7.2: Graphical representation of Equation (7.3), (7.4), and (7.5). The recursion relation forH ij is expressed as H ij =γ i+1 H i+1,j + X i<k≤ j γ k C b ik H kj +H b ik C kj , (7.7) whereR b ij is the number of reduced structures compatible withψ [i,j− 1] containing the external reduced pair{i,j− 1} ex . The terms in Equation (7.7) are analogous to those in Equation (7.4), with the factorsγ i+1 andγ k ensuring condition R6 is satisfied. The recursion relation for H b ij can be expressed as H b ij =γ i+1 γ j− 1 (C i+1,j− 1 +H i+1,j− 1 )+H e ij , (7.8) 52 whereH e ij is the number of reduced structures counted byH b ij where a nick is accessible from{i,j− 1} ex . The recursion relation forH e ij is expressed as H e ij = 0 ifi+1∈D andj− 1∈D, H i+1,j− 1 if eitheri+1∈D orj− 1∈D, P i<d<j d∈D (C i+1,d H d,j− 1 +H i+1,d C d,j− 1 ) otherwise. (7.9) In Equation (7.9), the first condition is a prohibitive case violating condition R6 or R7. With the second condition, either the nicki+1 orj− 1 is the nick accessible from{i,j− 1} ex . We useH i+1,j− 1 to ensure that there is at least one external reduced pair in the range to satisfy condition R7. In the final case, we sum over every nickd nick in in the range(i,j), which can be accessible from{i,j− 1} ex . Withd accessible, condition R7 requires that there be at least one external reduced pair either in the range(i,d) or[d,j− 1)), which are accounted for the by two terms. The recursion relations amongH ij ,H b ij , andH e ij are shown in Figure 7.3. The quantitiesH ij , H b ij , andH e ij can be computed for all 0 ≤ i ≤ j ≤ r by dynamic programming. The pseudocode for the algorithm for counting reduced structures for a generalized sequence is shown in Algorithm 6. Similar to Algorithm 5, this algorithm is cubic time. 7.4 Reducedstructurealgorithmfordistinguishablecount Sections 7.2 and 7.3 showed the methods to count reduced structures for circular and multi sequences. Since the number of reduced structures are directly related to that of periodic structures (Observation 9), we can use Algorithms 5 and 6 to solve Problem 1. For a generalized sequenceψ = (ϕ,D ) of lengthn and nontrivial symmetryp, we must compute the sizes ofb-periodic subsetsFix(b) for all proper divisorb ofp (Equation (5.1)). Observation 9 implies that 53 Figure 7.3: Graphical representation of Equation (7.7), (7.8), and (7.9). As forH e ij , the third case is assumed. Algorithm6 Counting reduced structures for multi sequences Require: Multi sequenceψ ={ϕ,D } of lengthr. Ensure: The size ofΓ ψ . 1: ObtainC ij andC b ij forψ (Algorithm 2) 2: R ij ← 0,H ij ← 0,H b ij ← 0, andH e ij ← 0 for all0≤ i<j≤ r 3: H ii ← 1 for all0≤ i≤ r 4: forj← 0 tor do 5: fori← j− 2 down to0do 6: if{ϕ i ,ϕ j− 1 }∈B RNA then 7: if i+1∈D andj− 1∈D then 8: H e ij ← 0 9: elseif (i+1∈D andj− 1̸∈D) or (i+1̸∈D andj− 1∈D) then 10: H e ij ← H i+1,j− 1 11: else 12: ford inD do 13: H e ij ← H e ij +(C i+1,d H d,j− 1 +H i+1,d C d,j− 1 ) 14: H b ij ← γ i+1 γ j− 1 (C i+1,j− 1 +H i+1,j− 1 )+H e ij 15: C ij ← γ i+1 H i+1,j 16: fork← i+1 toj do 17: H ij ← H ij +γ k C b ik H kj +H b ik C kj 18: R ij =H ij 19: return R 0r 54 we must compute the sizes ofΓ ψ ′ whereψ ′ =ψ/q (q =p/b). Letb max the largest proper divisor ofp, and q min =p/b max . We note that, for any proper divisorb ofp, lettingq =p/b, the generalized sequenceψ/q is a sub-generalized-sequence ofψ/q min . This means the quantitiesR ij computed forψ/q min contain the sizes of reduced structure for all these sub-generalized-sequences. Specifically, letting r max = n/q min , imagine that we obtain the values ofR ij for all0≤ i≤ j≤ r max by Algorithm 5 or 6. Then, for a proper divisor b of p (letting q = p/b and r = n/q), the size Γ ψ/q , and therefore the size of Fix(b), is given by R 0r . That is, we only need to compute R ij for ψ/q min . The pseudocode for this algorithm is shown in Algorithm 7. The following theorem states the result of applying this algorithm. Theorem10. Withacomputationmodelinwhichadditionandmultiplicationtakeconstanttimeandspace, DistinguishableCountproblemcanbesolvedbyreducedstructurealgorithminO(n 3 )timeandO(n 2 )space. Proof. We refer to Algorithm 7. The only difference between this algorithm and the central orbit algo- rithm (Algorithm 3) is the way in which|Fix(b)| are computed. Since this task is done in cubic time by Algorithm 5, the theorem holds. 55 Algorithm7 Counting distinguishable structures (reduced structure algorithm) Require: Generalized sequenceψ =(ϕ,D ) of lengthn and symmetryp. Ensure: The size ofΛ ψ 1: b max ← maximum proper divisor ofp 2: q min ← p/b max 3: r max ← n/q min 4: if|D|≤ 1then 5: ObtainC ij forϕ (Algorithm 1) 6: ObtainR ij forϕ/q min (Algorithm 5) 7: else 8: ObtainC ij forψ (Algorithm 2) 9: ObtainR ij forϕ/q min (Algorithm 6) 10: F(h)← 0 for allh∈[0,p) 11: F(0)← C 0n 12: forh from1 top− 1do 13: b← gcd(h,p) 14: if b̸=hthen 15: F(h)← F(b) 16: else 17: r← nb/p 18: F(h)← R 0r 19: Λ ← 0 20: forh from0 top− 1do 21: Λ ← Λ+ F(h) 22: return Λ /p 56 Chapter8 ApplicationsofDevelopedMethod In Chapters 6 and 7 we developed two efficient algorithms for Problem 1. The theoretical work developed in this process can be applied to solve related problem. In this chapter, we show examples of ways in which these algorithms can be applied. We also show that, for some problems, the reduced structure algorithm is more efficient than the central orbit algorithm. 8.1 Unlabeledcircularsequences Analyzing unlabeled sequences is relevant in studying the combinatoric characteristics of structures. Wa- terman [43] derived a recursion relation for the number of structures compatible with an unlabeled se- quence of length n. This amounts to removing the requirement that every pair forms a valid base pair. LetC(n) be the number of structures for a single-stranded sequence of lengthn, that is, those satisfying conditions L1-L3. It can be recursively expressed as C(n)=C(n− 1)+ n− 2 X k=0 C(k)C(n− k− 2) (8.1) forn>0 with base caseC(0)=1. 57 We carry out the same analysis on distinguishable structures for unlabeled circular sequences. We define Λ( n,p), for a divisorp ofn, as the number of distinguishable structures for the unlabeled sequence of lengthn and symmetryp. Based on Equation (5.1), we have Λ( n,p)= 1 p X h∈(0,p) F(gcd(h,p)n/p)+C(n) ! , (8.2) where F(r), for r > 0, is the number of b-periodic structures for a circular sequence of length n and symmetryp such thatr =bn/p. From Equation (6.1), F(r)=C(r)+ r− 2 X k=0 (r− k− 1)C(k)C(r− k− 2). (8.3) We note that the case ofp=2 is special in that it allows for pair of type{i,i+r}, which cannot be a valid base pair (see [14] for details). In the above analysis, we omitted the contribution from this cases, though it can be handled separately. Dynamic programming can be used to computeF(k) fork∈[1,r] inO(r 2 ) time. Therefore,Λ( n,p) can be computed inO(n 2 ) time. 8.2 Differencebetweenthetwoalgorithms The two algorithms, the central orbit algorithm and the reduced structure algorithm, have the same time and space complexities for Problem 1. However, given a related but different problem, the time complexities of these algorithms are dramatically different. In this section, we introduce two such problems. 8.2.1 Symmetriccountproblem Given a sequence ϕ of length n, we can create a multi sequence ψ = (ϕϕ, {0,n}), by concatenating 2 copies ofϕ . The symmetry ofψ is2. Imagine that we wish to compute the size of1-periodic subsetFix(1), or the subset of structures with symmetry 2. It takes both the central orbit algorithm and the reduced 58 structure algorithm cubic time to compute|Fix(1)| (Algorithms 4 and 7). However, what if we wish to compute the size ofFix(1) subset forψ =(ϕ [i..j]ϕ [i..j],{0,n ij }), wheren ij =j− i+1 is the length ofϕ [i..j], for all0≤ i<j <n? Formally, we define the following problem. Problem 2: Symmetric Count Input: A sequenceϕ of lengthn Question: Compute the size of 1-periodic subset Fix(1) of Ω ψ , whereψ = (ϕ [i..j]ϕ [i..j],{0,n ij }) for all0≤ i≤ j <n. Using the central orbit algorithm, we know that it takes cubic time to computeFix(1) for a generalized sequenceψ for particulari andj. When we change the generalized sequenceψ by changingi and/orj, we must update the values ofC ij andC b ij which takes cubic time. Making this update for alli andj would then lead to total computational complexity ofΩ( n 5 ). The following proposition formally states this result. Proposition 11. With a computational model in which addition and multiplication take constant time and space, the central orbit algorithm requiresΩ( n 5 ) time to solve Problem 2. Proof. For0≤ i≤ j < n, letψ = (ϕ [i..j]ϕ [i..j],{0,n ij }). The graphical representation of this multi sequence is shown in Figure 8.1A. The length of ψ is 2n ij . Based on Equation (6.2), the central orbit algorithm requires the values ofC kl andC b kl for0≤ k≤ l≤ 2n ij . In dynamic programming, computing C kl for these values ofk andl is essentially filling out the upper triangle of a (n ij +1)× (n ij +1) matrix, as shown in Figure 8.1B. Now consider incrementing j to a new value j ′ = j +1 and computing the size of Fix(1) for ψ = (ψ [i..j ′ ]ϕ [i..j ′ ],{0,n ij ′}) (Figure 8.1C). The length of the new multi sequence is2n ij ′ = 2n ij +2. The matrix size is now (2n ij +3)× (2n ij +3). Two new rows and two new columns are inserted: one at rown ij +1 and another at row 2n ij +2 (row numbers are 0-based). As shown in Figure 8.1D, the new matrix entryC n ij +1,n ij +1 affects the entries C kl for0≤ k ≤ n ij +1 andn ij +1≤ 2n ij +2 due to the 59 (A) (B) (D) (C) Figure 8.1: Process of solving Problem 2 using the central orbit algorithm. recursion relation ofC ij . Updating these quantities takeΩ( n 3 ) time. Since we must do this computation Ω( n 2 ) times, the total time complexity of this algorithm would beΩ( n 5 ). We now solve Problem 2 using the reduced structure algorithm. The following proposition shows that, instead of quintic time, the reduced structure algorithm is cubic time. Proposition 12. With a computational model in which addition and multiplication take constant time and space, Problem 2 can be solved using the reduced structure algorithm inO(n 3 ) time. Proof. Given the input sequence ϕ of length n, define a generalized sequence ψ = (ϕ, {0}). We can use Algorithm 6 to compute R i,j+1 for all 0 ≤ i ≤ j ≤ n in cubic time. It suffices to notice that the valueR ij is precisely the number ofFix(1) for the generalized sequenceψ ′ = (ϕ [i..j]ϕ [i..j],{0,n ij }) (Equation (7.2)). We therefore solved Problem 2 inO(n 3 ) time. 8.2.2 Designproblem Another example where the two algorithms differ is when we look for a sequence which satisfies certain condition regarding the number of1-periodic structures. We define the following problem. 60 Problem 3: Symmetric Count Design Input: Positive integern Question: Find a sequence ϕ of length n such that|Fix(1)| is maximized for the generalized sequence ψ consisting of two copies ofϕ . We consider a naive approach of countingFix(1) for every possible generalized sequence consisting of to copies of n-long sequence. There are 4 n such generalized sequences. For each such generalized sequence, we must compute the size of Fix(1). Given a generalized sequence, computing Fix(1) using the central orbit algorithm takes cubic time. Therefore, computing for all possible generalized sequences would takeΩ(4 n n 3 ). We now consider solving Problem 3 using reduced structures. We consider building a rooted tree of depth n, where each node has four child nodes corresponding to the bases{A,C,G,U}. We label each node with the RNA sequence which is the concatenation of the characters from the root to that node. At depthk, the nodes are labeled withk-long sequences. The nodes at depthn correspond to the4 n sequences we need the compute Fix(1) for. Starting with the root, we design an algorithm which visits the nodes in a width-first manner. For each node u at depth k, we build and keep k +1× k +1 and H ij matrix for generalized sequence consisting of the sequence labeling the node. We do this by copying the matrix H ij for the parent nodev ofu, adding a row and a column, and updating the2k− 1 entries of the matrix. Using Algorithm 6, such update takesO(k 2 )=O(n 2 ) time. We note that the number of nodes in the tree isO(4 n ). Since the update time at each nodeO(n 2 ), the overall time complexity would beO(4 n n 2 ). We note that, if we were to use the central orbit algorithm to update the matrices, the update for each node would beO(n 3 ). 61 8.3 Paircount In this section, we discuss the problem of accounting for secondary structures containing a particular pairs. Often, we are interested in the probability of an RNA sequence forming a particular pair. An efficient algo- rithm has been developed to address this issue. We will extend the method in the cases where generalized sequences have nontrivial symmetries. 8.3.1 Paircountproblem Consider a circular sequenceψ = (ϕ, ∅) and the setΩ ψ of secondary structures compatible withψ . For a pair{i,j} with0≤ i<j≤ n, we define the subset Ω ψ (i,j)={s∈Ω ψ :{i,j}∈s}. (8.4) Computing the size ofΩ ψ (i,j) is a variant of the problem of finding the pair probability, for which a cubic time algorithm has been found [20]. To begin, we assume that the quantitiesC ij andC b ij as defined in Equations (3.1) and (3.2), respectively, have been computed for all0≤ i≤ j ≤ n. We then define ˆ C b ij to be the number of substructures for the range[0,i]∪[j,n) assuming that there is a pair{i,j− 1}. For0≤ i<j≤ n, ˆ C b ij =0 if{i,j− 1}̸∈B RNA and otherwise ˆ C b ij =C 0i C jn + X k∈[0,i) ˆ C b1 kj C k+1,i (8.5) where ˆ C b1 ij = X k∈[j,n) C jk ˆ C b i,k+1 . (8.6) In Equation (8.5), the first term corresponds to the structures with no pairs covering {i,j− 1}, while the second term accounts for the cases where{i,j − 1} is accessible from another pair, of which one site 62 is k ∈ [0,i). The quantities ˆ C b ij and ˆ C b1 ij can be computed in cubic time using dynamic programming algorithm. The pair count is given by Ω ψ (i,j)= ˆ C b i,j+1 C b i,j+1 for0≤ i<j≤ n. Therefore,Ω w (i,j) can be computed for all possible pairs in cubic time. 8.3.2 Distinguishablepaircount In this section we study the pair count problem in the context of distinguishable count problem. Given a circular sequenceψ = (ϕ, ∅) of lengthn and symmetryp and a structures∈ Ω ψ , we consider the orbit orb p (s) to be one distinguishable structure. If a pair{i,j} is ins, this implies that every pair in the pair orbit[{i,j}] p , where [{i,j}] p ={Y ht ({i,j}):h∈[0,p]} witht=n/p, is in at least one structure inorb p (s). Therefore, when counting distinguishable structures, it is more reasonable to consider the pair orbit instead of an individual pair. We therefore define the quantity Λ ψ (i,j)={orb(s):s∈Ω ψ ands∩[{i,j}] p ̸=∅}. (8.7) We can therefore define the following problem. Problem 4: Distinguishable pair count Input: A circular sequenceψ =(ϕ, ∅) of lengthn and a pair{i,j} with0≤ i<j≤ n Question: Compute|Λ ψ (i,j)| 63 To solve Problem 4, we define the complement set of Λ ψ (i,j), i.e., ¯Λ ψ (i,j)=Λ ψ \Λ ψ (i,j)={orb(s):s∈Ω ψ ands∩[{i,j}] p =∅}. (8.8) We note that it is possible to compute the size of ¯Λ ψ (i,j) simply by prohibiting the pairs in[{i,j}] p when computing|Λ ψ |. In other words, we impose additional conditions such thatC b kl =0 if{k,l− 1}∈[{i,j}] p . We can then use either Algorithm 3 or Algorithm 7, and the output is| ¯Λ ψ (i,j)|. Theorem13. Problem 4 can be solved inO(n 3 ) time andO(n 2 ) space. Proof. As shown above, Algorithm 3 can be used with additional conditions to solve Problem 4 in the specified time and space complexities. Theorem 13 implies that it takesO(n 5 ) time to compute|Λ ψ (i,j)| for all possible pairs{i,j}. 8.4 Distinguishablecountwithpseudoknots So far, we have dealt with structures without pseudoknots. There have been developments in extending the RNA secondary structure algorithms to structures with pseudoknots [1, 32, 3, 10, 5]. With the new algorithms for counting distinguishable structures, we now extend them to include a certain class of pseu- doknots. In particular, we focus on two interacting sequences. Consider a generalized sequence consisting of two sequences. An example of a secondary structure containing pseudoknot is shown in Figure 8.2A. As shown in the figure, some lines indicating the base pairs cross each other. To avoid crossing lines, we change the representation to that shown in Figure 8.2B. We do not allow all types of pseudoknot; rather, we focus on a class of pseudoknots studied by Chitsaz et al. [5]. The reason for dealing with this particular class of structures is that it contains naturally occurring structures and still allows efficient computation. Given two sequences ϕ and ϕ ′ of length n and n ′ , we 64 (A) (B) Figure 8.2: Example of a structure containing pseudoknots. (A) Representation used by Dirks et al.[9]. (B) Representation used by Chitsatz et al.[5]. index the sequences byi ∈ [0,n) andi ′ ∈ [0,n ′ ). We call pairs within a strand, i.e., of forms{i,j} and {i ′ ,j ′ }, arcs. The pairs between the two strands, of the form{i,i ′ }, are called bonds. Figure 8.2B shows an example structure with arcs (in red) and bonds (in blue). An arc{i,j} ({i ′ ,j ′ }) is said to cover a bond {k,k ′ } if i < k < j (i ′ < k ′ < j ′ ). Two arcs{i,j} and{i ′ ,j ′ } are said to be interacting if they both cover a same bond. For two interacting arcs{i,j} and{i ′ ,j ′ }, we say{i,j} subtends{i ′ ,i ′ } if every bond covered by{i ′ ,j ′ } is covered by{i,j}. We define the conditions for the structures, a set of bonds and arcs, as follows. M1. Conditions L1-L3 are satisfied among arcs. M2. Any two distinct bonds{i,i ′ } and{j,j ′ } satisfyi̸=j andi ′ ̸=j ′ , and ifi<j theni ′ >j ′ . M3. Any arc{i,j} ({i ′ ,j ′ }) and any bond{k,k ′ } satisfy{i,j}∩{k}=∅ ({i ′ ,j ′ }∩{k ′ }=∅). M4. For any interacting arcs{i,j} and{i ′ ,j ′ }, at least one subtends the other. M5. The structure contains at least one bond. Examples structures are shown in Figure 8.3. The computation of partition function over the structures specified above have been accomplished by Chitsaz et al. [5]. Here, we develop an analogous algorithm for the counting problem. Givenϕ of lengthn andϕ ′ of lengthn ′ , we define C iji ′ j ′ to be the number of the structures for the substringsϕ [i,j− 1] and ϕ ′ [i ′ ,j ′ − 1]. The desired quantity is then given byC 0n0n ′. In addition, we define the quantity C b iji ′ j ′ to be 65 (B) (C) (D) (E) (A) Figure 8.3: Example structures for two interacting sequences. (A) Arca ′ covers bondb whilea does not. (B) Arcs a and a ′ are interacting since they both cover at least one same bond. Moreover, they subtend each other. (C) Arca subtends arca ′ , but not the other way around. (D) Arca ′ subtends arcsa anda ′ . (E) Arcsa anda ′ are interacting, but neither subtends the other, violating condition M4. the number of structures counted byC iji ′ j ′ where sitesi andj ′ − 1 are part of either a bond or an arc that is interacting with another arc. Together with other mutually-recursive quantities, these quantities can be computed for all0≤ i≤ j≤ n and0≤ i ′ ≤ j ′ ≤ n ′ inO(n 6 ) time. The details of the computation are in Appendix B. The quantityC 0n0n ′ corresponds to the number of structures satisfying conditions M1-M5. 8.4.1 Distinguishablecount Ifϕ andϕ ′ are distinct, the structures counted byC 0n0n ′ are all distinguishable. If the two sequences are identical, however, the generalized sequenceψ consisting of two identical sequences has symmetry 2, and some structures become indistinguishable. We wish to compute the number of distinguishable structures. Since the symmetry of the generalized sequence is2, the Burnside’s Lemma implies that |Λ ψ |= |Fix(0)|+|Fix(1)| 2 . (8.9) We note that|Fix(0)|=C 0n0n , so it remains to compute|Fix(1)|. 66 + Figure 8.4: Cases for 1-periodic structures containing an arc {i,j} interacting with corresponding arc {i ′ ,j ′ } In order to compute|Fix(1)|, we identify some properties of the structures inFix(1). For a structure s∈Fix(1), an arc{i,j}∈s implies another arc{i ′ ,j ′ }∈s wherei=i ′ andj =j ′ . We can then think of two cases; (1) there is an arc{i,j}, and corresponding arc{i ′ ,j ′ }, which are interacting with each other, (2) there is no such pair of interacting arcs. In case (1), let{i,j} be the arc that is interacting with the corresponding arc{i ′ ,j ′ } such that any other arc{k,l} interacting with corresponding arc{k,l} is such thati<k <l <j. The number of such 1-periodic structures is given by, summing over all possiblei andj, X 0≤ i<j≤ n G i+1,j C 0,i,j ′ +1,n +C 0i C j ′ +1,n whereG ij is the number of1-periodic structures for two interacting sequencesϕ [i..j− 1] andϕ [i..j− 1] containing at least one bond. Graphical representation for this relationship is shown in Figure 8.4. The recursion relation forR ij is given by G ij = X i≤ k<l≤ j C b l,j,i ′ ,k ′ − 1 C k+1,l +G k+1,l (C l+1,j,i ′ ,k ′ +C l+1,j C i ′ k ′) (8.10) whereC b iji ′ l ′ is defined in Equation (B.6). The recursion relation for G ij is shown in Figure 8.5. 67 = + + Figure 8.5: Recursion relation forG ij . Herei ′ =i,j ′ =j,k ′ =k, andl ′ =l. Figure 8.6: 1-periodic structures which do not contain an arc interacting with its corresponding arc. In case (2), if there is no arc interacting with its corresponding arc, then the number of such structures can be obtained by summing overi andj involved in bond or interacting arc, X 0≤ i<j≤ n C b jn0i ′C i+2,j , as shown in Figure 8.6. Combining the two cases, we arrive at the following identity |Fix(1)|= X 0≤ i<j≤ n G i+1,j C 0,i,j ′ +1,n +C 0i C j ′ +1,n + X 0≤ i<j≤ n C b jn0i ′C i+2,j (8.11) Therefore, we can compute|Λ ψ | inO(n 6 ) time. 68 Chapter9 Conclusion In this dissertation, we worked on the problem of accounting for indistinguishable secondary structures which arise for circular and multi sequences. We did this by formally defining the notions of symmetry and distinguishable secondary structures for generalized sequences. We relied on “rotations” of secondary structures, which are essentially the action of the cyclic group on the set of secondary structures. Each orbit under this action is then treated as a single distinguishable secondary structure, and its symmetry is the size of the stabilizer subgroup corresponding to that orbit. This framework gives a convenient formulation for the problem of counting distinguishable structures for generalized sequences, a problem that, to our knowledge, has not yet seen an efficient algorithm. We addressed the problem of counting distinguishable structures by using elementary identities from group theory to identify certain subsets of structures, i.e., the periodic subsets, which are used to solve the problem. We characterized the periodic subsets by identifying the conditions that the structures in theses sub- sets satisfy. We then extended the algorithm of [14], applying it to these periodic subsets, allowing us to compute their sizes. This leads to a cubic-time algorithm (central orbit algorithm). We also developed another algorithm, the reduced structure algorithm, of equivalent time complexity to solve the same prob- lem. While the two algorithms have identical time complexities, we showed that the reduced structure 69 algorithm has dramatically better performance than the central orbit algorithm for the symmetric count problem. The established mathematical framework and algorithms are general, and they can be used to solve other related problems. We used them to conduct analysis on enumerating the distinguishable structures for unlabeled circular sequences, which is useful for combinatoric analysis for such quantity. We extended the pair count problem to define distinguishable pair count problem and provided an efficient algorithm to solve it. Furthermore, we provided an efficient algorithm to count distinguishable structures for two identical sequences, including a certain class of pseudoknots. We note that the issue, raised by [9] regarding the minimum free energy problem in the presence of entropic energy correction, still remains. The challenge stems from accounting for such a global property as symmetry while using a dynamic programming scheme, which operates on local problems. We believe that our work, following the previous works [9, 14], contributes to the advancement in addressing the issue regarding the symmetry in secondary structures. 70 Bibliography [1] Tatsuya Akutsu. “Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots”. In: Discrete Applied Mathematics 104.1-3 (2000), pp. 45–62. [2] Mirela Andronescu, Zhi Chuan Zhang, and Anne Condon. “Secondary structure prediction of interacting RNA molecules”. In: Journal of Molecular Biology 345.5 (2005), pp. 987–1001. [3] Eckart Bindewald, Kirill Afonin, Luc Jaeger, and Bruce A Shapiro. “Multistrand RNA secondary structure prediction and nanostructure design including pseudoknots”. In: ACS Nano 5.12 (2011), pp. 9542–9551. [4] James Chappell, Melissa K Takahashi, and Julius B Lucks. “Creating small transcription activating RNAs”. In: Nature chemical biology 11.3 (2015), pp. 214–220. [5] Hamidreza Chitsaz, Raheleh Salari, S Cenk Sahinalp, and Rolf Backofen. “A partition function algorithm for interacting nucleic acid strands”. In: Bioinformatics 25.12 (2009), pp. i365–i373. [6] Wendy D Cornell, Piotr Cieplak, Christopher I Bayly, Ian R Gould, Kenneth M Merz, David M Ferguson, David C Spellmeyer, Thomas Fox, James W Caldwell, and Peter A Kollman. “A second generation force field for the simulation of proteins, nucleic acids, and organic molecules”. In: Journal of the American Chemical Society 117.19 (1995), pp. 5179–5197. [7] Rhiju Das, John Karanicolas, and David Baker. “Atomic accuracy in predicting and designing noncanonical RNA structure”. In: Nature methods 7.4 (2010), pp. 291–294. [8] Roumen A Dimitrov and Michael Zuker. “Prediction of hybridization and melting for double-stranded nucleic acids”. In: Biophysical Journal 87.1 (2004), pp. 215–226. [9] Robert M Dirks, Justin S Bois, Joseph M Schaeffer, Erik Winfree, and Niles A Pierce. “Thermodynamic analysis of interacting nucleic acid strands”. In: SIAM Review 49.1 (2007), pp. 65–88. [10] Robert M Dirks and Niles A Pierce. “A partition function algorithm for nucleic acid secondary structure including pseudoknots”. In: Journal of Computational Chemistry 24.13 (2003), pp. 1664–1677. 71 [11] Ricardo Flores, Sonia Delgado, Marıa-Eugenia Gas, Alberto Carbonell, Diego Molina, Selma Gago, and Marcos De la Pena. “Viroids: the minimal non-coding RNAs with autonomous replication”. In: FEBS letters 567.1 (2004), pp. 42–48. [12] Severin O Gudima, Jinhong Chang, and John M Taylor. “Features affecting the ability of hepatitis delta virus RNAs to initiate RNA-directed RNA synthesis”. In: Journal of virology 78.11 (2004), pp. 5737–5744. [13] David Harvey and Joris Van Der Hoeven. “Integer multiplication in time O(n log n)”. In: Ann. Math. 193.2 (2021), pp. 563–617. [14] Ivo L Hofacker, Christian M Reidys, and Peter F Stadler. “Symmetric circular matchings and RNA folding”. In: Discrete Mathematics 312.1 (2012), pp. 100–112. [15] Farren J Isaacs, Daniel J Dwyer, and James J Collins. “RNA synthetic biology”. In: Nature Biotechnology 24.5 (2006), pp. 545–554. [16] Magdalena A Jonikas, Randall J Radmer, Alain Laederach, Rhiju Das, Samuel Pearlman, Daniel Herschlag, and Russ B Altman. “Coarse-grained modeling of large RNA molecules with knowledge-based potentials and structural filters”. In: Rna 15.2 (2009), pp. 189–199. [17] Natasa Jonoska and Nadriaan C Seeman. DNA Computing: 7th International Workshop on DNA-BasedComputers,DNA7,Tampa,FL,USA,June10-13,2001,RevisedPapers. Vol. 2340. Springer, 2003. [18] Xiang-duo Kong, Shi-zhen Zhu, Xiao-jun Gou, Xiao-ping Wang, Hong-ying Zhang, and Jin Zhang. “A circular RNA–DNA enzyme obtained by in vitro selection”. In: Biochemical and biophysical research communications 292.4 (2002), pp. 1111–1115. [19] Chang C Liu, Lei Qi, Julius B Lucks, Thomas H Segall-Shapiro, Denise Wang, Vivek K Mutalik, and Adam P Arkin. “An adaptor from translational to transcriptional control enables predictable assembly of complex regulation”. In: Nature methods 9.11 (2012), pp. 1088–1094. [20] Ronny Lorenz, Christoph Flamm, Ivo L Hofacker, and Peter F Stadler. “Efficient Computation of Base-pairing Probabilities in Multi-strand RNA Folding.” In: BIOINFORMATICS. 2020, pp. 23–31. [21] Rune B Lyngsø, Michael Zuker, and CN Pedersen. “Fast evaluation of internal loops in RNA secondary structure prediction.” In: Bioinformatics (Oxford, England) 15.6 (1999), pp. 440–445. [22] David H Mathews. “Revolutions in RNA secondary structure prediction”. In: Journal of molecular biology 359.3 (2006), pp. 526–532. [23] David H Mathews, Matthew D Disney, Jessica L Childs, Susan J Schroeder, Michael Zuker, and Douglas H Turner. “Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure”. In: Proceedings of the National Academy of Sciences 101.19 (2004), pp. 7287–7292. 72 [24] David H Mathews, Jeffrey Sabina, Michael Zuker, and Douglas H Turner. “Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure”. In: Journal of Molecular Biology 288.5 (1999), pp. 911–940. [25] John S McCaskill. “The equilibrium partition function and base pair binding probabilities for RNA secondary structure”. In: Biopolymers: Original Research on Biomolecules 29.6-7 (1990), pp. 1105–1119. [26] Alexander S Mironov, Ivan Gusarov, Ruslan Rafikov, Lubov Errais Lopez, Konstantin Shatalin, Rimma A Kreneva, Daniel A Perumov, and Evgeny Nudler. “Sensing small molecules by nascent RNA: a mechanism to control transcription in bacteria”. In: Cell 111.5 (2002), pp. 747–756. [27] Ali Nahvi, Narasimhan Sudarsan, Margaret S Ebert, Xiang Zou, Kenneth L Brown, and Ronald R Breaker. “Genetic control by a metabolite binding mRNA”. In: Chemistry & biology 9.9 (2002), pp. 1043–1049. [28] Ruth Nussinov and Ann B Jacobson. “Fast algorithm for predicting the secondary structure of single-stranded RNA”. In: Proceedings of the National Academy of Sciences 77.11 (1980), pp. 6309–6313. [29] Gerry A Prody, John T Bakos, Jamal M Buzayan, Irving R Schneider, and George Bruening. “Autolytic processing of dimeric plant virus satellite RNA”. In: Science 231.4745 (1986), pp. 1577–1580. [30] J Richter, M Mertig, W Pompe, and H Vinzelberg. “Low-temperature resistance of DNA-templated nanowires”. In: Applied Physics A 74.6 (2002), pp. 725–728. [31] Jan Richter, Ralf Seidel, Remo Kirsch, Michael Mertig, Wolfgang Pompe, Jens Plaschke, and Hans K Schackert. “Nanoscale palladium metallization of DNA”. In: Advanced Materials 12.7 (2000), pp. 507–510. [32] Elena Rivas and Sean R Eddy. “A dynamic programming algorithm for RNA structure prediction including pseudoknots”. In: Journal of molecular biology 285.5 (1999), pp. 2053–2068. [33] Itiroo Sakai. “Syntax in universal translation”. In: Proceedings of the International Conference on Machine Translation and Applied Language Analysis. 1961. [34] John SantaLucia. “A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics”. In: Proceedings of the National Academy of Sciences 95.4 (1998), pp. 1460–1465. [35] Martin J Serra and Douglas H Turner. “[11] Predicting thermodynamic properties of RNA”. In: Methods in enzymology. Vol. 259. Elsevier, 1995, pp. 242–261. [36] Jiří Šponer, Petr Jurecka, and Pavel Hobza. “Accurate interaction energies of hydrogen-bonded nucleic acid base pairs”. In: Journal of the American Chemical Society 126.32 (2004), pp. 10142–10151. 73 [37] Petr Šulc, Flavio Romano, Thomas E Ouldridge, Jonathan PK Doye, and Ard A Louis. “A nucleotide-level coarse-grained model of RNA”. In: The Journal of chemical physics 140.23 (2014), 06B614_1. [38] Daniel Svozil, Pavel Hobza, and Jiri Sponer. “Comparison of intrinsic stacking energies of ten unique dinucleotide steps in A-RNA and B-DNA duplexes. Can we determine correct order of stability by quantum-chemical calculations?” In: The Journal of Physical Chemistry B 114.2 (2010), pp. 1191–1203. [39] Martin Tabler and Mina Tsagris. “Viroids: petite RNA pathogens with distinguished talents”. In: Trends in plant science 9.7 (2004), pp. 339–348. [40] TS Wadkins and MD Been. “Ribozyme activity in the genomic and antigenomic RNA strands of hepatitis delta virus”. In: Cellular and Molecular Life Sciences CMLS 59.1 (2002), pp. 112–125. [41] Amy E Walter, Douglas H Turner, James Kim, Matthew H Lyttle, Peter Müller, David H Mathews, and Michael Zuker. “Coaxial stacking of helixes enhances binding of oligoribonucleotides and improves predictions of RNA folding.” In: Proceedings of the National Academy of Sciences 91.20 (1994), pp. 9218–9222. [42] Amy E Walter, Ming Wu, and Douglas H Turner. “The stability and structure of tandem GA mismatches in RNA depend on closing base pairs”. In: Biochemistry 33.37 (1994), pp. 11349–11354. [43] Michael S Waterman. “Secondary structure of single-stranded nucleic acids”. In: Adv. Math. Suppl. Studies 1 (1978), pp. 167–212. [44] Michael S Waterman and Temple F Smith. “RNA secondary structure: A complete mathematical analysis”. In: Mathematical Biosciences 42.3-4 (1978), pp. 257–266. [45] Laura Wesson and David Eisenberg. “Atomic solvation parameters applied to molecular dynamics of proteins in solution”. In: Protein Science 1.2 (1992), pp. 227–235. [46] Wade Winkler, Ali Nahvi, and Ronald R Breaker. “Thiamine derivatives bind messenger RNAs directly to regulate bacterial gene expression”. In: Nature 419.6910 (2002), pp. 952–956. [47] Tianbing Xia, John SantaLucia Jr, Mark E Burkard, Ryszard Kierzek, Susan J Schroeder, Xiaoqi Jiao, Christopher Cox, and Douglas H Turner. “Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson- Crick base pairs”. In: Biochemistry 37.42 (1998), pp. 14719–14735. [48] Bernard Yurke, Andrew J Turberfield, Allen P Mills, Friedrich C Simmel, and Jennifer L Neumann. “A DNA-fuelled molecular machine made of DNA”. In: Nature 406.6796 (2000), pp. 605–608. [49] Joseph N Zadeh, Conrad D Steenberg, Justin S Bois, Brian R Wolfe, Marshall B Pierce, Asif R Khan, Robert M Dirks, and Niles A Pierce. “NUPACK: analysis and design of nucleic acid systems”. In: Journal of Computational Chemistry 32.1 (2011), pp. 170–173. [50] Michael Zuker. “Mfold web server for nucleic acid folding and hybridization prediction”. In: Nucleic acids research 31.13 (2003), pp. 3406–3415. 74 [51] Michael Zuker. “On finding all suboptimal foldings of an RNA molecule”. In: Science 244.4900 (1989), pp. 48–52. [52] Michael Zuker and David Sankoff. “RNA secondary structures and their prediction”. In: Bulletin of Mathematical Biology 46.4 (1984), pp. 591–621. [53] Michael Zuker and Patrick Stiegler. “Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information”. In: Nucleic Acids Research 9.1 (1981), pp. 133–148. 75 Appendices 76 AppendixA Proofs A.1 ProofofTheorem3 Proof. Assume s ∈ Fix(b). Proposition 2 shows that s consists of internal and/or external orbits and is orbit compatible withψ . We show thats satisfies conditions P2-P5. Condition P1: This follows immediately from the definition of pair orbits. Condition P2: Assume there are two orbits(i,j) x and(k,l) y such that(i,j) x ̸=(k,l) y . If{i,j}∩{k,l}̸=∅, it would imply that there are two pairs, one from each orbit, which are distinct yet share a site, a violation of L2. Condition P3: If P3 is violated, thens contains pair orbits(i,j) in (or(i,j) ex ) and(k,l) in (or(k,l) ex ) such that i < k < j < l. We have four cases. If (i,j) in and (k,l) in are in s, then{i,j} and{k,l} are in s, and s violates L3. If (i,j) ex and (k,l) ex are in s, then pairs{i + r,j} and{k + r,l} are in s, with j < l < i+r < k+r, ands violates L3. If(i,j) ex and(k,l) in are ins, then pairs{i+r,j} and{k,l} are ins, withk < j < l < i+r, ands violates L3. We reach a similar contradiction for the case where (i,j) in and(k,l) ex are ins. 77 Condition P4: If P4 is violated, thens contains two external orbits(i,j) ex and(k,l) ex such thati<k and j < l. As we have already shown (P3), this implies i < j < k < l. Consider two pairs{i+r,j} and {k+r,l} ins. Sincej− i<r andl− k <r, we havej <l 1 as L4 is relevant only when there are at least two nicks. Ifb=p, then every pair orbit can be thought of as an internal orbit containing only one pair, and P6 is equivalent to L4. For a proper divisorb ofp, P8 implies there is an external orbit(i,j) ex ins. The pairs in this orbit ensures that L4 is satisfied between any two nicks d < d ′ in{Y hr (0) : h∈ [0,q)} whereq = p/b. Consider two nicks inD∩[0,r). By P6, there is at least one orbit (i,j) x such that|{i,j}∩[d,d ′ )| = 1. Ifx = in, then L4 is satisfied between any two nicks in {Y hr (d) : h ∈ [0,q)}∪{Y hr (d ′ ) : h ∈ [0,q)}. If there is no such internal orbit, then there is one external orbit (i,j) ex such that|{i,j}∩[d,d ′ )| = 1. This implies that eitheri<d≤ j <d ′ ord≤ i<d ′ ≤ j. The first case implies that L4 is satisfied between any two nicks in{Y hr (d):h∈[0,q)}∪{Y hr (d ′ ):h∈[0,q)} except betweenY hr (d) andY h ′ r (d) for anyh,h ′ ∈[0,q). However, sinced is accessible from(i,j) ex , P7 implies that there is another external orbit(k,j) ex covered by(i,j) ex . This implies that L4 is satisfied between Y hr (d) andY h ′ r (d) for anyh,h ′ ∈ [0,q). Therefore, L4 is satisfied for any two nicks in D. 80 AppendixB Countingstructuresfortwointeractingsequences Given ϕ of length n and ϕ ′ of length n ′ , we define C iji ′ j ′ to be the number of the structures for the substringsϕ [i,j− 1] andϕ ′ [i ′ ,j ′ − 1] satisfying conditions M1-M5. The graphical representations of the rest of the quantities are shown in Figure B.1. 81 The recursion relations among the quantities are given as follows. C iji ′ j ′ = X k∈[i,j] C ik C a1 kji ′ j ′ +C a3 kji ′ j ′ (B.1) C a1 iji ′ j ′ = X k ′ ∈(i ′ ,j] C k ′ j ′ C b1 ijk ′ j ′ +C b3 ijk ′ j ′ (B.2) C a2 iji ′ j ′ = X k∈[i,j] C ik C b1 kji ′ j ′ +C b2 kji ′ j ′ (B.3) C a3 iji ′ j ′ = X k ′ ∈(i ′ ,j] C k ′ j ′ C b2 ijk ′ j ′ +C b4 ijk ′ j ′ (B.4) C a4 iji ′ j ′ = X k∈[i,j] C ik C b3 kji ′ j ′ +C b4 kji ′ j ′ (B.5) C b iji ′ j ′ =C b1 iji ′ j ′ +C b2 iji ′ j ′ +C b3 iji ′ j ′ +C b4 iji ′ j ′ (B.6) C b1 iji ′ j ′ =g i,j ′ − 1 C i+1,j C i ′ ,j ′ − 1 +C i+1,j,i ′ ,j ′ − 1 (B.7) C b2 iji ′ j ′ =C e2 iji ′ j ′ + X k∈(i,j] j ′ ∈(i,j) C e2 ikk ′ j ′C b kji ′ k ′ (B.8) C b3 iji ′ j ′ =C e3 iji ′ j ′ + X k∈(i,j] j ′ ∈(i,j) C e3 ikk ′ j ′C b kji ′ k ′ (B.9) C b4 iji ′ j ′ =C e4 iji ′ j ′ + X k∈(i,j] j ′ ∈(i,j) C e4 ikk ′ j ′C b kji ′ k ′ (B.10) C e2 iji ′ j ′ = X k∈(i,j) g ik C a2 i+1,k,i ′ ,j ′C k+1,j (B.11) C e3 iji ′ j ′ = X k ′ ∈(i,j) g k ′ − 1,j ′ − 1 C a1 i,j,k ′ ,j ′ − 1 C i ′ ,k ′ − 1 (B.12) C e4 iji ′ j ′ = X k ′ ∈(i,j) g k ′ − 1,j ′ − 1 C a3 i,j,k ′ ,j ′ − 1 C i ′ ,k ′ − 1 + X k∈(i,j) g ik C a4 i+1,k,i ′ ,j ′C k+1,j − X k∈(i,j) k ′ ∈(i,j) g ik g k ′ − 1,j ′ − 1 C i+1,k,k ′ ,j ′ − 1 C k+1,j C i ′ ,k ′ − 1 . (B.13) 82 1 1 1 Figure B.1: The quantities defined for computing the structures for two interacting sequences. The quan- tityC iji ′ j ′ is defined as the number of structures for two substrings ϕ [i..j− 1] andϕ ′ [i ′ ..j ′ − 1]. The other quantities are defined analogously with additional conditions indicated by the lines color. Blue lines indicate arcs and red lines indicate interacting bond. The superscripts containinga means that eitheri or j ′ − 1 is part of an arc or interacting bond (4 cases). The superscripts containing b indicate that both i andj ′ − 1 are part of an arc or interacting bond (4 cases). The quantityC b iji ′ j ′ is the summation of all 4 cases. The superscripts containinge have the same conditions as those containingb, but with additional condition that all arcs are either covered by the interacting bond containingi orj ′ − 1. Here, g ij = 1 if{ϕ i ,ϕ j } ∈ B RNA and 0 otherwise. the quantitiesg i ′ j ′ andg ii ′ are analogously defined. The graphical representations of the above recursion relations are shown in Figures B.2 and B.3. The above quantities can be computed for all possible values ofi,j,i ′ , andj ′ inO((n+n ′ ) 6 ) time. 83 Figure B.2: The recursion relations forC iji ′ j ′,C a1 iji ′ j ′ ,C a2 iji ′ j ′ ,C a3 iji ′ j ′ ,C a4 iji ′ j ′ , andC b iji ′ j ′ . 84 1 1 1 1 1 1 1 1 1 Figure B.3: The recursion relations forC b1 iji ′ j ′ ,C b2 iji ′ j ′ ,C b3 iji ′ j ′ ,C b4 iji ′ j ′ , C e2 iji ′ j ′ ,C e3 iji ′ j ′ , andC e4 iji ′ j ′ . 85
Abstract (if available)
Abstract
RNA secondary structures are essential abstractions for understanding spatial folding behaviors of those macromolecules. Many algorithms to solve problems over secondary structures involve a common dynamic programming setup to exploit the property that secondary structures can be decomposed into substructures. Dirks et al. (2007) noted that this setup cannot directly address an issue of distinguishability among secondary structures, which arises for classes of sequences that admit non-trivial symmetry – including circular sequences and interacting sequences. We examine the problem of counting distinguishable secondary structures. Drawing from elementary results in group theory, we identify useful subsets of secondary structures. We then build on an algorithm due to Hofacker et al. (2012) for computing the sizes of these subsets of possible structures. The result is a cubic time algorithm to count the distinguishable structures compatible with a given circular sequence. We also develop another algorithm for the same problem which has certain advantages for some related problems. This general approach may be employed to solve similar problems for different types of RNA sequences and with different constraints on structures.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Designing data-effective machine learning pipeline in application to physics and material science
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Coulomb interactions and superconductivity in low dimensional materials
PDF
A diagrammatic analysis of the secondary structural ensemble of CNG trinucleotide repeat
PDF
Exploiting structure in the Boolean weighted constraint satisfaction problem: a constraint composite graph-based approach
PDF
Mathematical modeling in bacterial communication and optogenetic systems
PDF
Optical communications, optical ranging, and optical signal processing using tunable and reconfigurable technologies
PDF
Development of methods and novel crosslinkers for RNA structure and interaction studies in living cells
PDF
Probabilistic modeling and data integration to examine RNA-protein interactions
PDF
The effects of divalent counterions on the formation and stabilization of RNA tertiary structure
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Testing structure-induced RNA hydrophobicity
PDF
Multi-scale quantum dynamics and machine learning for next generation energy applications
PDF
Machine learning of DNA shape and spatial geometry
PDF
Signatures of topology in a quasicrystal: a case study of the non-interacting and superconducting Fibonacci chain
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Demonstration of error suppression and algorithmic quantum speedup on noisy-intermediate scale quantum computers
PDF
In-plane soil structure interaction excited by plane P/SV waves
PDF
Review of long noncoding RNAs and chromosome structure
Asset Metadata
Creator
Nakajima, Masaru
(author)
Core Title
Analysis and algorithms for distinguishable RNA secondary structures
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Physics
Degree Conferral Date
2023-05
Publication Date
02/22/2023
Defense Date
01/19/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
distinguishability,OAI-PMH Harvest,RNA,secondary structure,symmetry
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Haas, Stephan (
committee chair
), Di Felice, Rosa (
committee member
), Fraser, Scott (
committee member
), Nakano, Aiichiro (
committee member
), Smith, Andrew (
committee member
)
Creator Email
masarun@usc.edu,masarun23@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112755774
Unique identifier
UC112755774
Identifier
etd-NakajimaMa-11481.pdf (filename)
Legacy Identifier
etd-NakajimaMa-11481
Document Type
Dissertation
Format
theses (aat)
Rights
Nakajima, Masaru
Internet Media Type
application/pdf
Type
texts
Source
20230228-usctheses-batch-1008
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
distinguishability
RNA
secondary structure
symmetry