Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other) 

00001.tif

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Copy asset link
Request this asset
Transcript (if available)
Content INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type o f computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back o f the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Information Company 300 North Zeeb Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600 Com binatorial P roblem s from M apping and R eading DNA Sequences by Daniela R en ate Martin A Dissertation Presented to th e FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of th e Requirements for the Degree DOCTOR O F PHILOSOPHY (Applied Mathematics) May 1995 UMI N um ber: 9621668 UMI Microform 9621668 Copyright 1996, by UMI Company. A ll rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Z eeb Road Ann Arbor, M I 48103 Daniela Renate M artin Michael W aterm an Combinatorial Problem s from M apping and Reading DNA Sequences T his thesis considers combinatorial problems on restriction maps and DNA sequencing which can be represented involving alternating Eulerian paths in edge-bicolored graphs. We introduce order transformations, which transform an alternating Eulerian path into another one. The key is Pevzner’s theorem : Every two alternating Eulerian paths in an edge-bicolored graph can be transformed in to each other by a sequence of order transformations. W hen restriction sites are modeled by a random process, the number of solutions to the double digest problem (DDP) increases exponentially with the length of the DNA. Cassette transform ations define equivalence classes on th e set of solutions in the case of no coincident cuts. Pevzner completely characterized the solutions to the DD P in the case of no coincident cut sites by associating solutions with alternating Eulerian paths in an edge-bicolored graph. In P a rt I we extend cassettes and their transformations to the case al­ lowing coincident cut sites. Solutions are again characterized by associating them with alternating E ulerian cycles in a graph. Part II considers problem s related to th e question ab o u t the num ber of equivalence classes in th e case of no coincident cut sites. For example, we take a look a t possible g rap h topologies. Also, we consider the probability for getting o n e of the graphs in a very specific fashion (‘walking’). In P art I I I we consider th e problem of uniquely reconstructing a sequence from its 1-tuple content. W e first prove Poisson approxim ation theorems for the num ber of long repeats in a sequence, with letters chosen uniformly at random from a finite alphabet, according to a not necessarily uniform distribution. We will introduce the Ukkonen transform ation and call a se­ quence u n iq u ely /-recoverable if no such transform ations, except possibly trivial ones, are possible. W e get an approxim ation theorem for the proba­ bility of a random ly chosen sequence to b e /-recoverable from substrings of length I « lo g 1/p(m2/ 2 ), a n d we give a ra te of convergence. 2 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by t 'Danie 1 aa Re nate: t#_ Mart in...................................... under the direction of hex Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re­ quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies D a te 8,... 199 5 DISSERTATION COMMITTEE Chairperson Chairperson D edication I dedicate this dissertation to the most wonderful people in my life: my husband and my parents. A cknow ledgem ents I am deeply grateful for all the help and support from my advisor Mike W aterman. Also, I would like to thank Gesine Reinert for a great collaboration on P art III, Pavel Pevzner, who inspired most of this thesis, for very helpful comments, my committee members Richard Arratia, Simon Tavare and Doug Ierardi for reading a preliminary draft of this thesis, my fellow students Fengzhu Sun and Ethan Port for being friends, and Kengee Lewis for patient help with the typesetting. C ontents D ed ication ii A ckn ow led gem ents iii L ist O f F igures vi 1 Introd uction 1 1 Equivalence Classes for the Double D igest Problem w ith Coincident Cut Sites 13 2 In trod uction to Part I 14 3 P relim inaries 17 3.1 D efinitions................................................................................................ 17 3.2 A lternating Eulerian C y c le s .............................................................. 18 3.3 The Double Digest P r o b le m .......................................................................... 18 4 T he D ou b le D ig e st P rob lem A n d E quivalence C lasses O f S olu tion s w ith ou t C oincident C u t S ites 22 4.1 Restriction Maps and Border Block Graphs ............................................ 22 4.2 Cassette Transformations and Restriction M a p s ......................... 23 5 T he D ou b le D ig e st P rob lem A n d E quivalence C lasses O f S olu tion s w ith C oincident C ut S ites 29 5.1 The Extended Border Block G r a p h ............................................................. 29 5.2 The Extended Cassette Transformations of Restriction M a p s ...................32 II The Num ber of Equivalence Classes 6 Introd uction to Part II 38 39 7 W alking 42 7.1 Starting, Extending and Ending the G r a p h ............................................... 43 7.2 The ‘W alking-Probability’ .............................................................................. 49 7.2.1 The ‘W alking-Probability’ for an infinitely long DNA strand . 49 7.2.2 ‘W alking-Probability’ for a finitely long DNA strand .................. 54 8 G raph T op ologies 56 9 T h e E x p ected N u m b er o f B order B lock G raphs 60 9.1 Estim ating the Expected Number of Border Block G r a p h s ..................... 60 III Poisson Approxim ation for Long R epeats in a Ran­ dom Sequence w ith Application to Sequencing by Hy­ bridization 66 10 In trod uction to Part III 67 11 R eview o f a P oisson A p proxim ation T h eorem 69 12 A P oisson A p proxim ation for Long R ep ea ts 71 12.1 N o ta tio n ............................................................................................................... 71 12.2 Poisson A pproxim ation.................................................................................... 72 13 U nique R ecoverab ility 84 13.1 Notation and Ukkonen Transform ations...................................................... 84 13.2 Unique Recoverability .................................................................................... 86 13.3 C onclusion............................................................................................................ 90 v List O f Figures 2.1 Schematic of a Restriction M a p .................................................................... 16 3.1 Order Transformations: a) shows an order reflection and b) shows an order e x c h a n g e ................................................................................................. 19 3.2 M ultiple Solutions ........................................................................................... 21 4.1 Border Block Graph ........................................................................................ 24 4.2 Cassette E x c h a n g e ............................................................................................ 26 4.3 Cassette R e fle c tio n ............................................................................................ 27 5.1 Extended Border Block G r a p h ........................................................................ 31 5.2 Examples of Extended C assettes.................................................................... 34 5.3 E x ch an g es............................................................................................................ 35 5.4 R eflections............................................................................................................. 36 7.1 W a lk in g ................................................................................................................ 43 7.2 Walking for r o d d ............................................................................................... 44 7.3 Starting the graph ............................................................................................ 45 7.4 Extending the graph: (a) i odd and (b) i e v e n ......................................... 46 7.5 Ending the graph: (a) r odd and (b) r even .......................................... 48 8.1 ‘tree skeleton’: (a) shows a graph having a ‘tree skeleton’ and (b) shows an example without o n e ....................................................................... 57 12.1 Two cases for overlapping fragments; / — 1 = 4. In the top picture, a, = dj = dj+ 1 = cij+ 2 = aJ+3; hence e5 = 1. In the bottom picture, di — dj — Qj+ 2 find fli+i = flj+i = Oj+3) hence e3 = 2................................. 74 12.2 The “worst” case for overlapping fragments; / — 1 = 4 . We have d{ = dj = Uj+i = Uj+2 = ® i' = = ®j'+i = ®i'+ 2 ~ ®j'+3i hence eg — 1. 77 12.3 The type of overlap if only two segments may overlap; / — 1 = 4 . We have 3 equations of length 3 and 2 equations of length 2; hence e3 = 3, e2 — 2........................................................................................................ 79 vi C hapter 1 Introduction In the first part of this introduction, we will give an overview of the the biological framework relevant for this thesis. Then we will go into more details about the three separate parts of this work. The genome of all living organisms consists of DNA, a very long two-stranded chemical polymer. Each DNA strand is composed of four different nucleotides, that are held in sequence joined to a sugar-phosphate backbone to form a very long chain. These four nucleotides are symbolized by A, G, C and T, th at stand for the four bases adenine, guanine, cytosine and thymine. In its usual two-stranded form, the strands are joined to each other by hydrogen bonds between the nucleotides- A basepairs with T and G basepairs with C. Each DNA molecule contains many genes, which are its functional units. Each gene codes for the production of a protein, and all organisms are largely composed of proteins. Humans have approximately three billion nucleotide pairs in the genome that contains about 100,000 genes on 23 pairs of chromosomes, where each chromosome is a separate DNA molecule. One first step to understand the human genome would be to determine the order and spacings of all genes in the genome. This information would give us a gene map. Maps involve measuring the positions of easily observed landmarks. There are different kind of maps, depending on the measuring system and the landmarks. One kind of map, a so-called genetic linkage map, describes the arrangements of genes and DNA markers on the basis of patterns of their inheritance. Genes tend to be inherited together (that is linked) if they are close on such a map. The order of genes on a chromosome measured by a linkage map is the same as the order in a physical map, but there is no constant factor th at relates physical and genetic distance. 1 We will not consider genetic linkage maps in this thesis. Physical m aps specify distances between landmarks along a chromosome. Ideally, the distances axe mea­ sured in nucleotides. This is for example guaranteed the case when we sequence the DNA. In Part III of this work, we will consider m athem atical problems arising by using a special m ethod of sequencing, namely sequencing by hybridization. More details about this m ethod will also be given at the end of this introduction. Sequencing gives us the highest resolution map. But this is not always the most desirable since there is so much effort required to sequence the DNA, and we are interested in efficient ways of doing it. In addition a large map with much detailed information is hard to use. Assume that we have one map of the USA which contains all streets in the USA. We would have a difficult tim e using this map. For example, if we would look for a specific street somewhere in the USA, all the complete details of the streets, many with the same name, are counterproductive. So it is desirable to have also lower resolution maps. In general, the lower resolution maps help generate a higher resolution map, in our USA example, and make their usage easier. If one wants to make such a map, one approach probably would be to first map each state separately. And in each state, one would first locate all cities, and then, for each city, one would start m apping the streets. The same strategy is applied in mapping the genome. Probably the most used physical map is the so-called restriction map. The landmarks in such a physical map are the cleavage sites of so called restriction enzymes. These restriction enzymes cut the DNA always at specific patterns in the DNA. The maps can be calibrated in nucleotides by measuring the sizes of the DNA fragments produced when the DNA is cut with such a restriction enzyme. We will give much more details about restriction maps and the problems obtaining them later in this introduction. They will play a mayor part in this thesis. Having a map of the genome is basic for genetic engineering. In the last years, the ability to clone and analyze individual genes has rapidly increased. This helps to deduce the amino acid sequences of the involved proteins, which is the foundation of understanding genetic disorders, the immune system, endocrine abnormalities, coronary artery disease, infectious diseases and cancer as well as neurological and behavioral diseases such as Huntington’s disease and Alzheimer’s disease. The dream of course is, th at if we understand the diseases on that basic level, th at we can 2 ultim ately prevent and cure genetic diseases. Higher resolution maps of the hum an genome, with the ultim ate goal of having the complete sequence, would accelerate the progress in understanding genetic diseases and would help in developing new approaches in diagnosis (genetic screening), treatm ent and prevention. This thesis consists of three different parts, where Part I and II both concern re­ striction maps and Part III concerns DNA sequencing. We will now give background m aterial and summaries for Parts I-III. In 1970 Hamilton Smith isolated the first enzyme that cuts DNA at specific nucleotide sequences (sites). For example, the enzyme EcoRI cuts the DNA at the sequence GAATTC. These enzymes are known as restriction enzymes. The DNA restriction fragments th at arise from cutting the DNA with such an restriction enzyme give rise to a restriction map. Such a m ap shows the ordering of DNA restriction fragments along a stretch of DNA. For a review see Nathans and Smith. For an example of a restriction m ap see the following figure, where the vertical lines indicate the restriction sites. There are several experimental approaches for finding a restriction map. In P art I and II of this thesis we consider aspects of the problem of mapping the position of the cut sites of two restriction enzymes. One way to construct such a map is to measure the fragment lengths (not order) from digests using each of the two enzymes singly (single digest) and then from a digest where the two enzymes are used together (double digest). The problem of determining the positions of the cut sites from the fragment length data is known as the Double Digest Problem (DDP). The wide use of restriction maps motivates the variety of approaches of solving the DDP: set partition problem (Fitch et a l, 1983), interval graphs (W aterm an and Griggs, 1986), stochastic annealing (Goldstein and W aterm an, 1987), flows in net­ work (Pevzner, 1988 and 1990), Shostak’s separation theory (Allison and Yee,1988, Ho et al., 1990) and minimal cycle mean in a digraph (Pevzner, 1992). Goldstein and W aterman (1987) proved that the DDP is NP-hard. They also showed th at the number of solutions to fragment length data produced from random cut sites increases exponentially as the length of the DNA increases. Biologists are interested in finding the unique restriction map with site locations identical to those 3 in the DNA. This makes m ultiple solutions to the DDP more problematic than if the problem were th at of finding at least one solution to the DDP. Although the num ber of solutions grows very fast, observations revealed th at a large num ber of them are very ‘similar’ in the sense that they can be obtained from each other by simple transformations. Schmitt and W aterman (1991) introduced these transform ations along with an equivalence relation on restriction maps in the case of no coincident cut sites. An equivalence class consists of all solutions that can be obtained from each other by a sequence of cassette transformations. They raised the question to completely characterize the equivalence classes of solutions to the DDP. In Part I Chapter 3, we will first give some basic definitions about edge-colored graphs. Then we will give the definition of order transformations of alternating Eulerian paths. This are transformations that applied to an alternating Eulerian path gives us another one. These transformation were introduced by Pevzner (1994) and give rise to the following very im portant theorem due to Pevzner: Every two alternating Eulerian cycles in an edge-bicolored graph can be transformed into each other by a sequence of order transformations. This theorem is very basic for all three parts of this thesis, since all the problems we discuss can be formulated in terms of alternating Eulerian cycles in edge-bicolored graphs. After th at we will formally describe the DDP. In Part I Chapter 4, we will show how Pevzner solved the problem of completely characterizing the solutions to the DDP. In his paper, the maps do not contain any coincident cut sites. We will first show how he associated a restriction map with an alternating Eulerian path in an edge-bicolored graph. We will then show how the cassette transform ations from Schmitt and W aterm an (1991) correspond to the order transform ations defined by Pevzner (1994). To do this Pevzner generalized the definition by Schmitt and W aterman of cassettes slightly. He then showed that each equivalence class of solutions corresponds to one specific graph, called the border block graph. We will give the details, because they are relevant for our purposes later. There is a computer program called MAPSUN (Pevzner, 1993) which is included in the software package DNASUN (Mironov et al., 1993) th at uses these ideas and generates all equivalence classes of restriction maps to find the true restriction map. 4 Very often, restriction maps do have coincident cut sites. So it is interesting to have a characterization of the solutions to the DDP for coincident cut sites. This is the motivation behind Part I, especially behind Part I Chapter 5. We will first extend the notion of the border block graph to accommodate the coincident cut sites. We are doing this by merging the ‘old’ graphs Pevzner defined. But we must merge them in a way th at makes sense and takes care of the fact that we still want a relationship between some extended cassette transform ations and order transform ation. To do this we must introduce some new vertices to the graphs. This makes the construction more complicated. We then extend cassette transformations to the general case, allowing coincident cut sites. In the case we do not consider coincident cut sites, cassette transformations only happen in one coincident cut fragment. This covers the general situation only between two coincident cut sites. Now cassette transformations can happen between coincident cuts and even cross the border between two such coincident cuts. We then give a complete characterization of solutions to the general DDP. The proof is based on the fact th at we can still associate a restriction m ap with an alternating Eulerian cycle in an edge-bicolored graph. Using Pevzner’s theorem stated above we can show that an equivalence class will correspond to a specific graph, namely to an extended border block graph. In P art I, we are concerned with identifying and characterizing the equivalence classes of restriction maps in the case that we have coincident cuts. In Part II we study the number of equivalence classes in the original setup, where we are no longer considering the case th at we have coincident cut sites. For a person assembling a map, the following question is quite naturally: Assume that we have the two single digest fragment length data a = ||A || = {A, : 1 < i < n} from enzyme A and b = ||f?|| = {B t : 1 < i < m} from enzyme B as well as the double digest c=||AA.B|| = { C ,: l < * < / } . Note th at all these sets might be multisets. Fixing a, b, and c, we would like to know how many essentially different answers there are. T hat is, how many different equivalence classes are there? Or, going to the setup of Goldstein and W aterman, we think of a = ||A || and b = ||£?|| as randomly generated. The cut sites are created by cutting each nucleotide with probability Pa and p s respectively, until we reach the end of the DNA (no coincident cut sites up to this point). This gives us a single fixed double digest c = ||A A f?||. Question we would like to answer is: 5 • How many equivalence classes are there? • How does the number of equivalence classes grow as the length of the DNA grows? We do not solve this problem, but we study a related problem th a t might be helpful in solving and understanding the combinatorial structure of the equivalence classes. Another question, related to our original problem, is the following. Assume we have the two fixed single digests a and b. Let 7 ra and xa> be two perm utations of the elements of a and let Xb and X b< be two perm utations of the elements of b. From (xa, Xb) and (xa«,Xb<) we might get different double digests c and c'. We write a = {(AJ, /,) : 1 < i < n'} where /, indicates the frequency with which the fragment length A\ is listed in the set a. Note that /< = n. For example if a = {1,1,2,2,2,3} we would rewrite it as {(1,2), (2,3), (3,1)}. We do the same for the set b = {(#,', < 7,), 1 < i < m '}, where = m - Therefore the total number of different perm utations of a and b equals n! m! m i l / . ! rirJ is ,!' Some pairs (xa, Xb) will give the same double digests and some will not. Interesting questions include the following: • How many different double digests are there? • Given some fixed double digest, how many different border block graphs are there? To give some insight in the stochastics and combinatorics of the border block graphs, consider again the model used in Goldstein and W aterm an. We have two enzymes A and B th at cut the DNA with rate p \ and p b , respectively. We stop this process as soon as we have a coincident cut. Now we take a look at a very specific subset of the double digests, called the border block fragments, in the order they were generated along the DNA. The border block fragments (/l 5. . . , /r+i), th at then induce a border block graph. (The border block fragments give us the vertices of the graph.) 6 In Part II Chapter 7, we will show how to calculate the probability that we walk from the border block fragment li to the border block fragment I2 , then to the border block fragment fo, and so on until we reach a coincident cut after the last border block fragment /r+1. (See Figure 7.1.) All graphs that we get by walking in this m anner from border block fragment /j up to the border block fragment /r+i will be denoted by S (h , • • •, K+i)- In Chapter 7.1, we will give recursive formulas th at can be used to calculate P(C /(/i,... ,/r+i)), by separately considering the cases th at we start, extend or end the graph. In Chapter 7.2, we use these recursions to explicitly calculate P(£?(/i,. . . , In the first section of Chapter 7.2, we will calculate P ((? (/i,. . . , lr+i)) for the case that the DNA is infinitely long and th at we stop the cutting process as soon as we reach a coincident cut site. We will show th at P(C7(/i,. . . , fr+i)) only depends on the number of border block fragments and on the total length L = lj of the involved border block fragments (and of course on pa and p s). In the second section of Chapter 7.2, we will fix the length D of the involved DNA. In this case, we let the enzymes cut with the same cut rate (because the calculations are very complicated), but without coincident cut sites until the end of the DNA. The calculations to find the exact expression for P(<7(/i,. . . , b+i)) are too involved, and we therefore only give a lower bound. Analogous to the case where the DNA is infinitely long, we get th at P (t? (/j,. . . , /r+i)) depends only on the number of border block fragments, on the total length L = 5Ij=i h -> and here also on D, the length of the DNA (and of course on the rate pa = Pb )- In Part II Chapter 8 , we take a look at the possible tree topologies. To be more precise, we will give a lower bound R (r, k) on the number of border block graphs hav­ ing k vertices and r edges. We already know from Goldstein and W aterman (1987) th at there are many different solutions. The combinatorics here indicates that there are also many different border block graphs. For example, there is superexponential growth in the number of vertices. In Part II Chapter 9, we consider the expected number of border block graphs. We consider the situation where the enzymes A and B move down an infinitely long DNA strand until we finally have a coincident cut. How many different border block graphs do we expect (fixing some param eter)? We only give a relatively complicated and rough estim ate, and we not have any asymptotics. 7 Im portant problems in this area remain to be solved. We do not solve the original problem due to the complex relationship between the double digest fragments and the border block fragments. T hat is, we would like to answer questions like the following: Fixing some double digest, which subsets of fragments are potential border block fragments? How do we recognize these and what are the possible corresponding graphs? W hat is the expected ‘size’ of a border block graph? In Parts I and II, we were concerned with problems arising by making a restriction map. In Part III, we are interested in making a higher resolution map, like when we sequence. We are considering a very special m ethod of sequencing DNA, namely sequencing by hybridization. One of the prim ary goals of the Human Genome Project is to increase the rate of DNA sequencing and to reduce its costs. W hile gel based m ethods for determining the sequence of nucleotides (A, G, C, T) are being autom ated and improved, new approaches to DNA sequencing are being explored. Sequencing by hybridization (SBH) is a novel approach of determining DNA sequence th at was proposed by several groups around the same tim e [Drmanac and Crkvenjakov (1987); Bains and Smith (1988); Lysov et al. (1988); Southern (1988); Macevicz (1989)]. Sequencing by hybridization is based on the following setup. A short single­ stranded DNA of 8-25 letters is called a probe. The probe will bind or hybridize to a single-stranded target DNA if the substring complementary to the probe exists in the target. If the target is presented to all probes of length / (called /-tuples), then the /-tuple content of the target is known, and this data can be used to determine the sequence of the target. To accomplish the repeated probing of all 4( probes of length /, all the probes are attached to the surface of a substrate where each probe is at a known position. This is called a sequencing chip. Then the labeled target is presented to the sequencing chip, and hybridizations detected by an instrum ent sensitive to the label. The ex­ perim ental challenges to making this approach successful include fixing DNA to the substrate in a reliable manner, devising efficient detection systems for DNA-DNA hybridization (that is, for label detection), controlling the substantial differences be­ tween the binding energies of complementary DNAs of the same length but different base composition, and detecting perfectly complementary duplexes from those that 8 axe complementary except for one mismatched pair of bases. There have been rig­ orous efforts to overcome these challenges and significant progress has been made, although determ ination of longer sequences is not yet routine [Pevzner and Lip- shutz (1994)]. Fodor and colleagues have developed light-directed polymer synthesis [Fodor et al. (1991); Fodor et al. (1993); Pease et al. (1994)] and recently synthe­ sized a sequencing chip with all 48 8 -tuples. Certainly a sequencing chip with all 410 10-tuples is a near term possibility. Certainly the experimental aspects of sequencing by hybridization are of impor­ tance in developing the technology; in addition, the computational and m athem at­ ical sides of sequencing by hybridization are critical too. To see the basic problem, consider the ideal case. A sequence a = a ia 2...am is to be sequenced, and the d ata are all /-tuples present in the sequence. This set of occurring /-tuples (mul­ tiplicities allowed) is known as the spectrum of a, 5 (a). It is natural to pose the sequence recoverability problem as a traveling salesman or Hamilton path prob­ lem. The graph G h = (V h ,E h ) has vertex set Vh = 5 (a), and (u, v) € E h when u = u iu 2 ...u/, v = V\V2 ...v i € 5 (a) and U 2 U 3 ...U 1 = i>iv2 . . . u/_j. Of course this is known to be NP-complete. Pevzner (1989) ingeniously translated this prob­ lem in the following way: The vertices of a new graph are the (/ — l)-tuples from 5 (a ) and the edges correspond to 5(a). Define the graph G e = (Vg, E e) where Ve — { x ix 2 . . . x/_ 1 : x \x 2 ... xi- 1 = v\v 2 ... 1 or v 2 v2 ...v i and v iv 2 ... v; € 5(a)} is the vertex set and E e = {(xix2 . .. xj_i, x2x3 . .. x/) : xi x 2 ...x/ € 5(a)} is the edge set. The problem of determining the sequence a is translated into an Eulerian path problem, one for which there is an efficient solution. Clearly computational difficulties arise when the data have errors. In the ideal case, however, it is instructive to ask how big m ust / be to expect uniquely determine a random sequence. This problem was the subject of a recent paper by [Dyer et al. (1993)]. The relation between this part and [Dyer et al. (1993)] is described below. T he intuition for the form of the answer is easy to derive. For the sequencing by hybridization data to give a unique answer, there should not be any (I — l)-tuple repeats. The longest repeat in a random sequence is approximately / = logi « log 1 ( ) , where p = P ( two random letters match). See [Karlin et al. (1987)] or P [Arratia et al. (1990a)]. A simple calculation (ignoring technical details) gives about (™)pl expected repeats of length / and solving 1 = (™)pl gives the above value 9 of / = logi Making this intuition precise involves the m ethods of Poisson approximations recently described in [Waterman (1995)]. Now we give an overview of Part III. In Chapter 11, we will review the underlying theoretical result, a Poisson approximation theorem by [Arratia et al. (1990)] (The­ orem 7). Employing the Chen-Stein method, [Arratia et al. (1990)] also provide a rate of convergence in the total variation distance. (See also [Barbour et al. (1992)].) Chapter 12 contains the application of this theorem for repeats of long subse­ quences in a randomly chosen sequence a = aia 2 ...am. We will assume th at the “letters” ai,a 2, ...,a m are chosen independently from a finite alphabet, according to a not necessarily uniform distribution. As a long sequence may occur more than once in a word and these occurrences may overlap, the dependence structure is different from th at in sequence alignments. Still, it is a local dependence structure. Hence we can apply Theorem 7 from the previous chapter to obtain a Poisson approxima­ tion theorem for the number of long repeats, together with a rate of convergence (Theorem 8 ). In order to make the approximation nontrivial, we recover as length I for the repeated subsequence that I fa logi (^ -) (Corollary 1). If we only consider P non-overlapping repeats, the calculations simplify, and the lim iting behavior remains unchanged (Corollary 9). Chapter 13 gives the application of the above results to the problem of unique recoverability of a sequence. We will introduce the so-called Ukkonen transformations. They transform one string into another string with the same spectrum. We will call a sequence (uniquely) recoverable if neither transpo­ sitions nor rotations in the string are possible (except possibly trivial ones leaving the string fixed). This will be explained in more detail in this section. The most likely structure th at would potentially destroy unique recoverability is th at there are several pairs of disjoint repeating regions in the sequence - the subsequences between the repeats can then be interchanged without changing the spectrum. The probability for the latter event can be easily estim ated using the Poisson approxima­ tion result from the last section. The probabilities for other structures th at would destroy unique recoverability can be easily estim ated directly. Combining these es­ tim ates, we get an approximation theorem for the probability of randomly chosen sequence to be uniquely recoverable from substrings of length I fa l o g i ( ^ ) , and we P 1 give a rate of convergence (Theorem 10). 10 One of the benefits of our Poisson approximation is th at while the limit form is given by [Karlin et al. (1987)], the results below have explicit bounds on the approximation necessary for application to sequencing by hybridization in the fol­ lowing chapter. The two theorems for overlapping and non-overlapping repeats can be applied to estim ate statistical significance in the analysis of patterns in a single sequence. W hile the Chen-Stein m ethod has been carefully worked out for sequence m atching in [Arratia et al. (1990a)] and [Waterman (1995)], the correlation struc­ ture for overlapping repeats involves some details th at axe quite distinct and not obvious. The probability of unique recoverability can be used as a measure for the effectiveness of a SBH chip; [Pevzner and Lipshutz (1994)] call it the “resolving power” of a chip. The rate of convergence, together with the lim iting distribution, thus provides a means to estim ate this resolving power. To conclude this part we return to the paper of [Dyer et al. (1993)]. In that paper, for the case of uniformly distributed letters, a qualitative lim it theorem for the probability that a sequence can be uniquely recovered by subsequences of length I fa logi (^2~) is given. Their proof uses actually two approximations. Firstly, they show that the quantity in question is close to another quantity, which is then shown to converge in distribution to a Poisson variable. We believe th at our proof gives more insight into the structure of the problem, and it also fills some gaps in the proofs by [Dyer et a/.(1993)]. Furthermore, [Dyer et a/.(1993)] do not obtain a rate of convergence, which is desirable for practical purposes. Finally, it is well-known that the nucleotide distribution is not uniform. [Dyer et a/.( 1993)] claim that a general distribution on the alphabet would change the proofs and the results only slightly. The presentation below shows th at the arguments needed are more complex and th at the general results stated by [Dyer et a/.(1993)] are not valid (compare for instance their expression for the expectation with Lemma 18). We introduce a random variable W counting the number of length / — 1 repeats. There are two im portant cases: that where overlapping repeats are counted and that where repeats must be non-overlapping to be counted. In the case of overlapping repeats which is the relevant structure for SBH, the value of E (W ) cannot be explicitly given in a closed form (Lemma 18) for a general distribution, whereas for a uniform distribution it has a closed form (Lemma 19). For non-overlapping repeats, E (W ) has a closed form for a general distribution (Theorem 9). Note th at E(VF) in Theorem 9 almost 11 coincides with the corresponding expression suggested by [Dyer et al. (1994)] except th at occurrences of to — / + 1 are replaced by m — 21 + 3 in our formula. Their formula was proposed for the overlapping case. The most im portant difference between a uniform and a non-uniform distribution on the alphabet is that, in the uniform case, coincidence of letters a i,a j, a* occur independently, P(a< = aj|a; = a*) = P (a ; = aj). In the non-uniform case, this is no longer true, and the dependencies caused by overlapping sequences require careful analysis. 12 Part I Equivalence C lasses for th e D ouble D igest P roblem w ith C oincident C ut Sites 13 C hapter 2 Introduction to P art I The DNA restriction fragments that arise from cutting the DNA with a restriction enzyme give rise to a restriction map. Such a map shows the ordering of DNA restriction fragments along a stretch of DNA. For a review see Nathans and Smith. For an example how a restriction m ap might look like see Figure 2.1. There are several experimental approaches to finding a restriction map. In P art I and II of this thesis we consider the problem of mapping the position of the cut sites of two restriction enzymes. One way to construct such a map is to measure the fragment length (not order) from digests by each of the two enzymes singly (single digest) and then from a digest where the two enzymes are used together (double digest). The problem of determining the position of the cut sites from the fragment length data is known as the Double Digest Problem (DDP). Goldstein and W aterman (1987) proved th at the DDP is NP-hard. They also showed that the number of solutions to fragment length d ata produced from random cut sites increases exponentially as the length of the DNA increases. Biologists are interested in finding the unique restriction map with site locations identical to those in the DNA. This makes m ultiple solutions to the DDP more problematic th an if the problem were that of finding at least one solution to the DDP. Although the number of solutions grows very fast, observations revealed that a large num ber of them are very ‘similar’ in the sense th at they can be obtained from each other by simple transformations. Schmitt and W aterman (1991) introduced these transformations formally along with an equivalence relation on restriction maps in the case of no coincident cut sites. An equivalence class consists of all solutions th at can be obtained from each other 14 by a sequence of cassette transformations. They raised the question to completely characterize the solutions to the DDP. In P art I Chapter 3, we will first give some basic definitions about edge-colored graphs. We will mention Kotzig’s theorem, th at gives a characterization of graphs containing an alternating Eulerian path. Then we will give the definition of order transformations of alternating Eulerian paths. These are transform ations th at ap­ plied to an alternating Eulerian path gives us another alternating Eulerian path. These transform ation were introduced by Pevzner (1994) and give rise to the follow­ ing very im portant theorem by Pevzner: Every two alternating Eulerian cycles in an edge-bicolored graph can be transformed into each other by a sequence of order transformations. After th at we will formally describe the DDP. In P art I Chapter 4 we will show, how Pevzner solved the problem of completely characterizing the solutions to the DDP. Note th at the m aps in his paper do not contain any coincident cut sites. We will first show how he associated a restriction map with an alternating Eulerian path in an edge-bicolored graph. We will then show how the cassette transformation from Schm itt and W aterm an (1991) correspond to the order transform ations defined by Pevzner (1994). To do this Pevzner generalized the definition by Schmitt and W aterman of cassettes slightly. He then showed that each equivalence class of solutions corresponds to one specific graph, called the border block graph. We will give the details, because there are relevant for our purposes later. In Part I Chapter 5 we consider restriction maps with coincident cut sites. We will first extend the notion of the border block graph to accommodate the coincident cut sites. We are doing this by merging the ‘old’ graphs Pevzner defined. But to be able to merge them in a way that makes sense and takes care of the fact, th at we still want a relationship between some extended cassette transform ations and order transform ation, we had to introduce some new vertices to the graphs. This makes the construction more complicated. We then extend cassette transform ations to the general case, allowing coincident cut sites. In the case th a t we did not consider coincident cut sites, cassette transform ations only happened in one coincident cut fragment, th at is only between two coincident cut sites. Now cassette transform a­ tions can happen between coincident cut fragment and even over the border between 15 Figure 2.1: Schematic of a Restriction Map two such coincident cut fragments. We then give a complete characterization of solu­ tions to the general DDP. The proof is based on the fact that we can still associate a restriction map with an alternating Eulerian cycle in an edge-bicolored graph. Using Pevzner’s theorem that we stated above we can show th at an equivalence class will correspond to a specific graph, namely to an extended border block graph. 16 C hapter 3 Prelim inaries 3.1 D efinitions We consider an undirected graph G (V ,E ), where V denotes the vertices and E the edge set of the graph. An edge-coloring of a graph is an assignment of colors to its edges. Therefore G(V, E ) will be edge-colored in / colors, if there is a function / : E — ► {1,2,...,/}. In this paper a colored graph will be an edge-colored graph. A sequence of vertices F = x^x 2 . .. xm is called a path, if (x,, Xj+i) € E for 1 < i < m — l.A cycle is a path with xi = xm. A path is called alternating, if two consecutive edges are colored differently, and it is called Eulerian if it traverses every edge in E exactly once. Set dc(v) to be the number of c-colored edges of E incident to v. The degree d(v) of v satisfies d(v) = 5Zc=i de(v). A graph is called balanced if m axcdc(u) < d(v)/2. A multigraph can have both multiple edges between vertices and self loops. We have the following theorem characterizing graphs containing alternating Eulerian cycles. T h e o re m 1 (K o tzig ) Let G be a connected edge-colored graph with vertices of even degree. Then there exists an alternating Eulerian cycle in G if and only if G is balanced. Consider two bicolored graphs G = G(V, E ) and G' = (V ', E '), with say colors A and B . If V = V ', E = E 1 and if for each v, w £ V with (v , w) £ E the number of A- colored edges (v , w) in G equals the number of A-colored edges (v, w) in G' (therefore the same holds for the number of 5-colored edges), we call the two graphs G and G' identical as bicolored graphs. 17 3.2 A lternating Eulerian C ycles Let G = G(V, E) be an edge-bicolored graph. We will now introduce transformations which allow us to transform an alternating Eulerian path into another alternating Eulerian path. Let F = x i . . . x , ... Xj ... x m be an alternating path in G with x, = Xj. Set F\ = x j ... Xi, F2 = Xi... Xj and F3 = Xj ... xm. Denote by F J = XjXj-i ...Xi the reflection of the path Fj. Then the transform ation < f> : F = F 1F 2F3 — ► F* = F^F^F^ is called an order reflection, if 4>{F) = F* is still an alternating path. For an example see Figure 3.1(a). Let F = Xi... Xi... Xj... Xk ... xn ... xm be an alternating path in G with x, = xit and Xj = x„. Set F = F 1F 2F3F 4F 5 with F\ = xi...x ,, F 2 = x,...X j,F 3 = Xj... Xk, F 4 = Xk-..xn and F 5 = x „ ... xm. The transform ation given by < f > : F = F 1F 2F 3F4F 5 — » F* = F 1F4F 3F 2F 5 is called an order exchange, if < f> (F ) = F* is still an alternating path. For an example see Figure 3.1(b). The following theorem will be very useful later. T h e o re m 2 (P e v z n e r) Every two alternating Eulerian cycles in an edge-bicolored graph G can be transformed into each other by a sequence o f order transformations, that is by order exchanges and order reflections. A proof can be found in [Pevzner (1994)]. 3.3 T he D ouble D igest Problem We will consider the simplest version of the Double Digest Problem involving linear DNA, three digests and no measurement errors. We are interested in determining the positions of sites of two restriction enzymes, say A and B , along the DNA. The data consists of the measurements of the fragment lengths (not order) coming from a digest of the DNA by each of the two enzymes singly and from a digest where the two enzymes are used together. The problem of determining the position of the cut sites from the fragment lengths is called the Double Digest Problem (DDP). The formal version of the DDP is the following. Our data has the form: a = ||A || = {A i: 1 < t < n} from the first digest, b = ||f?|| = {Bi : 1 < i < m} from the second 18 a) b) F 1 F 2 F 3 F 4 F 5 F 1 F 4 F 3 F 2 F 5 Figure 3.1: O rder Transformations: a) shows an order reflection and b) shows an order exchange 19 digest and c = ||A AB\\ = ||C || = {C, : 1 < i < /} from the double digest. Note that, in general, a, b and c will be multisets. In an abuse of notation we often denote by A = {Ax,..., A„} and B = { B i , , B m}, the sets of fragment lengths coming from the single digests, and by C = {Ci,..., C/} the set of fragment lengths coming from the double digest. The DDP corresponds therefore to the determining of orderings of the sets A and B , which are consistent with C in the following sense. We call (er, p) a configuration where a is a perm utation of ( 1, 2 ,..., n) and p a perm utation of (1,2,..., m). By ordering A and B according to a and p we get a set of locations of cut sites S = {s : s = Hx<j<r Aff(j) or 5Zi<j<< ^ (j)> 0 < r < n ,0 < t < m}. Sites which are common to both enzymes are called coincident cut sites. Note that S is not considered as a m ultiset, although we might have coincident cut sites, since we record the location of a cut site only once. By labeling the elements of S such that S = {sj : 0 < j < /}, with s, < Sj for i < j. Note th at the double digest implied by the configuration (<r, fi) is C(cr,fi) = {Cj(o, fi) : sj — Sj_x, 1 < j < /}. A solution to the DDP is therefore a configuration (<r, fi) such th at C(a,fi) = C. A solution will be called a restriction map and will be denoted by (A, B ). For an example see Figure 3.2. Note that a solution is not necessarily unique. For a treatm ent of the combinatorics of multiple solutions of the DDP we refer to Schm itt and W aterman [Schmitt and W aterm an (1991)]. For an example where the solution is not unique see again Figure 3.2. We consider the bases of the DNA being m apped as an interval [1, N], with the fragments corresponding to intervals < j. We define an order by [i,j] < [&,/] if i < k. A = {Ax,..., An} is now just a set of non-empty intervals, called the blocks of A, where 1JA, = [1, A T ] and A * < Aj if i < j. 20 a = IIA II ={1,3,3.12} b = IIB II = {1,2,3,3,4,6) c = IIC II = {1,1,1,1,2,2,2,3,6} Figure 3.2: Multiple Solutions C hapter 4 T h e D ouble D igest Problem A nd E quivalence C lasses O f Solutions w ithout C oincident C ut Sites 4.1 R estriction M aps and Border B lock Graphs From now on we will assume that a restriction map (A, B ) is given. Let A = {i4j,..., An},B = { B i, ..., Bmj and A A B = C = { C \,..., C/}. In this section we will assume th at A and B have no coincident cut sites, that is I = n + m — 1. Pevzner introduced a graph (see [Pevzner (1994)]) called the border block graph, and then showed th at any restriction map corresponds to an alternating Eulerian path in the border block graph. We will review the main facts here again since the construction is relevant for the case when we allow coincident cut sites. A fork of a block Ai is the set of intervals F(A{) = {Cj : Cj C A J . A fork containing at least two blocks, that is |F (A )| > 1, is called a multifork. Note th a t every two forks F(Ai) and F(Bj) have at most one block in common. When |F (A )| > 1,C* G F(X) is called border block if either C* = mmcJ^F(X)Cj or C* = maxc}eF(X)Cj- Define F*(X) to be the set of border blocks of the multifork F(X) and let B be the set of all border blocks of the pair (A, B). Let M F = {F(A )} where F(X) is a multifork and VH = {|Cj| : Cj € B} be the set of border block lengths. Note that VH is not a m ultiset ( blocks C, and Cj correspond to the same element of VH if |C <| = |Cj|). We now construct a bicolored (with colors A and B) m ultigraph H = H(VH, EH) where E H = {(|C,-|, |C,-|) : {Cu Cj} = F*(X) for F(X) G MF}. Note th at two vertices |C, | and |Cj| might be joined by more than one edge and that 22 we might have loops. An edge will be A— colored if X = A*, for some i and it will be 5 — colored if X = Bj for some j. The bicolored graph H = 5 ((A , B)) is called the border block graph of the map (A, B). See Figure 4.1 for an example where the top picture shows how the graph is constructed from the map. The ‘edges’ below the m ap show which fragments are used in the construction of the border block graph. If we join two border block fragments that border a 5-fragm ent we denote that by a 5-edge, otherwise by an A-edge. The bottom picture shows the ‘collapsed’ border block graph, where that double digest fragments of equal length correspond to the same vertex and that a fragment of length i corresponds to vertex i. Note that all vertices of H are balanced, except for possibly |Ci| and |Cj|, since every multifork contains exactly two border blocks. If \C\ \ and |C/| are not balanced, then Ic^dCil) — d s(|C i|)| = 1 and Id^dC/l) — dB{\Ci\)\ = 1. We can therefore transform 5 into a balanced graph by adding one or two edges. Order the border blocks in B : C\ = C;, < C ,2 < • • • < Clt = C|, where k — 1 = M , the total number of multiforks. Lemma 1 The path F = |Ctl ||Ci21 • • • |C,t | is an alternating Eulerian path in H, the border block graph corresponding to (A, 5 ). Therefore every restriction map (A, 5 ) corresponds to an alternating Eulerian path in the corresponding border block graph. 4.2 C assette Transform ations and R estriction M aps In this section we define cassette transformations th at will allow us to define an equivalence relation on restriction maps. These transformations were first intro­ duced by Schmitt and W aterman in [Schmitt and W aterm an (1991)] and slightly generalized in [Pevzner (1994)]. Note that we still assume no coincident cuts. Fix i and j and let Ic = {Ck • Ci < Ck < Cj} be the set of intervals between C{ and Cj. The cassette defined by Ic is the pair (7 ^ ,/b ), where I a and I s are the sets of all blocks of A and 5 respectively, which contain a block of Ic- Define and m s to be the minimal elements of the leftmost blocks of I a and I b respectively. The left overlap is defined to be — m j. The right overlap is defined similarly, by substituting maximal for minimal and rightmost for leftmost. 23 H -1 — I H -------------------------1 -------------- 1 — I A 3 I ~l 3 I 2 I 8-----------1-3 I I ■ 2i !, 2, 4 , 2i 2 i 2t 4 , 3 , A B Figure 4.1: Border Block Graph 24 Let Sr denote the rightmost cut site of the cassette and let sr denote the next cut site to the left (that is it will be a cut site of the other enzyme). If m = S r — sr = Cj for some j G {1,...,/} , we say th a t the right overlap of size m corresponds to a double digest fragment, namely to Cj with \Cj\ = m . Similarly we can define when the left overlap corresponds to a double digest fragment. Note that the left and right overlaps of the cassette Ic correspond to double digest fragments if and only if the left overlap has size |C,_i| and the right overlap has size |Cj+i|. If two disjoint cassettes of the solution (A, B ) have identical left and right overlap and the overlaps correspond to double digest fragments, then they can be exchanged as in Figure 4.2 and we get a new solution (A B ' ) to DD P(a, b, c). If the left and right overlap of a cassette have the same size but different sign and the overlaps correspond to double digest fragments then the cassette can be reflected as in Figure 4.3 and we get a new solution (A ", B "). The seemingly unnatural restriction to cassettes whose overlaps correspond to double digest fragments arises from the definition of the border block graph. There the vertex i corresponds to a double digest fragment of length i. Therefore a cassette whose left overlap corresponds to a double digest fragment of length m and whose right overlap corresponds to a double digest fragment of length M , gives rise to a path in the border block graph from vertex m to vertex M . This would not be the case in general. This gives us the following lemma that states th at order exchanges and reflec­ tions in the border block graph correspond to cassette exchanges and reflections in restriction maps. L em m a 2 Let H be the border block graph o f (A, B ) and F be the alternating Eu­ lerian path in H corresponding to (A,B). (i) Let (A ',B ') be obtained from (A ,B ) by a cassette exchange (reflection) and F ' be the alternating Eulerian path corresponding to (A 1 , B'). Then there is an order exchange (reflection) taking F to F'. (ii) Let F ' be obtained from F by means o f an order exchange (reflection). Then there is a cassette exchange (reflection) taking (A , B ) to (A ',B '), where F ' corre­ sponds to (A B ' ) . More details can be found in [Pevzner (1994)]. 25 11 4 , 3 , 5 h 3 . 6 3 1 1 1 1 : 1, 3 , 2 ' 'S 1 1 * ] 3 1 7 I 1 I 1! 2 1 1 ! V i 2 i V S 4 i 2 12 h 2 4 1 3 1 r l ^ I I 1 M I I I __________________i I_______ 1 1 • 4 li , i; 3 6 i * i 2| 4 | 3 | 5 3 I I : * r ~ i 3 i l l 8 '5 1. 3 , 2 ! 1 1 I 7 I I I 1! 2 !| 2 : 1 1 4 i 2 12 h *i2 1 *i 1 4 1 3 1 1 1 1 1 1 |l l II 1 1 Figure 4.2: Cassette Exchange 26 Figure 4.3: Cassette Reflection 27 We can now introduce an equivalence relation on the set of all solutions to DDP(a, b, c). Let F(Ai) be a multifork with |F (A ,)| > 2. Define uncut fragments of F(Ai) as U(Ai) = {Bj : Bj C Ai and Bj € F(Ai) — F*(Ai)}, that is £/(A.) is the set of uncut fragments Bj C Ai which are not border blocks of Ai. We analogously define U(Bi). Every perm utation of uncut fragments determines a new solution to DDP(a, b ,c), leaving the border block graph unchanged. Note that we have to ex­ clude the borderblocks from the perm utation, since otherwise the border block graph might change. We let (A, B ) = (A', B') if and only if there is a sequence of cassette transfor­ m ations and uncut fragment perm utations transforming (A ,B ) into (A ',B '). This equivalence relation partitions the set of solutions into equivalence classes, each cor­ responding to a border block graph. This gives us now a complete characterization of equivalent restriction maps in the case of no coincident cut sites. T h eorem 3 Let H be the border block graph o f (A , B ) and let H ' be the borderblock graph o f (A', B'). Then (A , B) = (A', B') if and only if the graphs H and H ' are identical (as bicolored graphs). 28 C hapter 5 T he D ouble D igest P roblem A nd E quivalence C lasses O f Solutions w ith C oincident C ut Sites 5.1 T he E xtended B order B lock G raph We will now describe a graph that accommodates a map th at includes coincident cuts. The graph will be made up of the border block graphs described above, con­ structed for each component separately and put together in a way th at will allow us to find a correspondence between alternating Eulerian paths and the underlying map, giving us a complete characterization of the solutions to the DDP. We will again assume that a physical map (A, B) is given and that A = {Ai,..., An}, B = {J9i,. . . , Bm} and A A B = C = {Ci,..., C/}. If the m ap has c — 1 coincident cut sites, then we have c components. From now on we will allow coincident cut sites, th at is we are allowing more than one component. Let us assume that we have c components. Let L be the set of leftmost fragments in these components and R the set of rightmost fragments of the components. If the component just consists of one A-fragment and one 19-fragment (therefore one C-fragment), we include it in L. L and R can be naturally identified with a subset of B, the set of all border blocks. Let L a = {A, : A, € L} and R a = {A, : A, € R] be the set of all A-fragments that are the leftmost and rightmost fragment of a component. We analogously define L b and R b - We will now introduce the extended border block graph. Let V H ' = {\Ck\ : C* 6 B} U : Bk ( = L b U R b } U {0}. As before, V H ' is not a multiset ( note that |C,| and |C<|* are two different elements). 29 We let E H ' = {(|Cj|, |Cj|) : (Ci,Cj) = F"(X) for F(X) G MF} U |0 ,D : Bi G L B U R b } U { (\B i\\0 ) : 0 ; G Lb U 0 B} U {(|i4,|,0) : At G L a U R a } \J s= 1 {(0,0)}. Note th at E H 1 is a multiset and that the last term in the definition of EH' corresponds to c loops at 0, one for each component. The coloring is according to the following rules: 1. (|C;|, |Cj|) : (Ci,Cj) = F*(X) for F(X) G M F will be colored as before. 2. (|0 ,|, 1-0,1*) : 0 , G L b U R b will be 0 - colored. 3. (|0j|*,O) : Bj G L b U R b will be A-colored. 4. (|A,-|,0) : Ai G LA U R A will be /4-colored. 5. (0,0) will be 0-colored. H ' = H '(V H ', EH') = H((A, 0 )) will be called the extended border block graph. H ' is again a bicolored multigraph. For an example see Figure 5.1. The top picture shows the map and how the connections between the components in the graph are made. Each tim e the last fragment of a component is a 0-fragm ent of length j , we use the vertex j * to bridge between the border block fragment of length j and the vertex 0, which we can think as representing the overlap of size 0. The bottom picture shows the complete extended border block graph. Note th at H ' is a balanced graph and similar as before, a restriction map corresponds to an alternating Eulerian cycle. Lemma 3 Every restriction map (z4,0 ) corresponds to an alternating Eulerian cy­ cle in the corresponding extended border block graph. 30 Figure 5.1: Extended Border Block Graph 31 5.2 T he E xten d ed C assette Transform ations of R estriction M aps As in the case with no coincident cut sites, we introduce transform ations th at will transform a solution (A ,B ) into another solution (A ',B '). Again fix i and j and let Ic = {Ci, : Ci < Ck < Cj}. As before, we denote by I a and I s the sets of all blocks of A and B respectively, which contain a block of I c • Recall th at L and R (L , R C A U B) corresponded to the set of border blocks of the components. Let C = L U R. In an abuse of notation, we will at the same tim e use C for the subset of B corresponding to the set of border blocks of the component and for the elements in L U R which gave rise to them. To define the extended cassette, I'c = (I'A, I'B), we will to consider different cases: 1. Ci £ C and Cj C. Then the cassette (IA,I'g) is defined as before, th at is I'A = IA and I'B = lg . See Figure 5.2a for two examples. 2. Either Ci or Cj belong to C. Then Ic can define up to two cassettes. (a) l'c = = (I'A, I'g) where I'A and I'B are the sets of all blocks of A and B respectively, which contain a block of Ic- (b) • Consider the case Ci € C. Assume th at C i-\ € C. If the fragment Ci- 1 corresponds to a fragment A r for some r, then I A = I A U { A r } . If the fragment Ci-1 corresponds to a fragment B r for some r, then I'g = I g U {Br}. We let I ^ = {I'A , I'g). • Consider the case Cj € C. Assume th at Cj+i € C. If the fragment Cj+i corresponds to a fragment A r for some r, then I A = I A U ( A rj. If the fragment Cj+i corresponds to a fragment B r for some r, then I'g = I B U {Br}. We let I'c = (I'A , I'g). See Figure 5.2b for an example. The left picture corresponds to the choice of case 2 (a) and the right picture to the choice of case 2 (b). 3. Ci, Cj € C. Then Ic can define up to four cassettes. (a) rc = (IA,Ig ) where IA = IA and I'B = Ig. See Figure 5.2c(i) for an example. 32 (b) Assume that Ci- 1 € C. If the fragment C,_i corresponds to a fragment At for some r, then IA = IA U {Ar}. If the fragment C,_i corresponds to a fragment Br for some r, then I'B = /g U {f?r }• We let I'c = (IA, IB). See Figure 5.2c(ii) for an example. (c) Assume that cj+ i e c. If the fragment Cj+i corresponds to a fragment A r for some r, then I'A = 1a U {Ar }. If the fragment Cj+1 corresponds to a fragment B T for some r, then ffi = f s U {f?r }. We let = (I'A, I'B). See Figure 5.2c(iii) for an example. (d) Assume th at Ci-1 € C and Cj+i ^ If the fragment C,_i corresponds to a fragment A r for some r, then IA = I A U {Ar}. If the fragment Cj+i corresponds to a fragment A , for some s, then I'A = I'A U {A,}. If the fragment i corresponds to a fragment B r for some r, then I B = I B U {Br}. If the fragment Cj+1 corresponds to a fragment B , for some s, then I'B = I'B U {# ,}. We let = (IA, I'B). See Figure 5.2c(iv) for an example. Note that in case (3) there are fewer than four possible cassettes if at least one of Ci or Cj is the rightmost or leftmost fragment in the restriction map. The overlaps of a cassette are defined the same way as before. Note that the overlap might be zero if the cassette borders a coincident cut site. The criterion for exchanging or reflecting cassettes these (extended) cassettes are also unchanged, th at is we can exchange two disjoint cassettes if their right and left overlaps are the same and if the nonzero overlaps correspond to double digest fragments. We reflect a cassette if the left and right overlaps are of the same size but different sign and if the nonzero overlaps correspond to double digest fragments. Note th at an overlap of size zero cannot correspond to a double digest fragment, since each fragment has positive length. But the graph is constructed to deal exactly with th at case. For examples of exchanges of extended cassettes and reflections of extended cassettes see Figure 5.3 and Figure 5.4. The following lemma give us the relation between the cassette transformations and the order transform ations in the extended border block graph. Note that all the above definitions and the following lemma are consistent with the case that we have no coincident cut sites. 33 t , l[ 8 | 2 j 5 j 4 1 4 5 0 1 j « ! : 1 , ' " " T — '5 4 , T ' \ l | 2 , 'r — j 3 2H ' s ' j I ' : 1 3 ; 4 i p 1 1 2 3 2 | 2 1 ; 1 25 2 | 1 : 1 1 1 : 1 ! < i i 0 4 I 1 3 1 4 I 1 4 1 1 3 4 1 1 1 : 1 i i : ' 2 i 2 i i * " " l 3 1 ""25 2 • | 5 1 3 I t i 3 . 2 i 2 . 1 I I I ; 2 2 3 2 ! 2 1 2 | 2 | ! ! < 1 • 1 ' T J 2 I I i 1 j 1 1 1 1 (ii) . . . . . . 0 1 T 1 4 I , 1, 2 5 4 1 4 | 11!2 4 1 1 1 1 1 1 “ 3 j 2 - 3 -1 m 2 | 2 | , 3 1 2 , 3 'l 2 1 1 : 1 i !i2 ! 2 1 2 11 1 1 1 : 1 0 2 . 2 , ,1.25 2 . 2,1 1 1 I I ; 1 1 1 1 2 2 1 1 II ! 1 1 | 1 I I I (iv) ____ ■ 1 1 • 1 1 4 | ) i________________ 1 J| 2 1 4 , 1 4 1 i 1! 2 4 r 1 1 j I , 3 j 2 3 1 < 1 ' ' 5 T q 2 , , 2 1 3 " H 2 1 1 ; 1 1 2 j 2 2 1 5 1 1 5 1 ! 5 1 2 . 1 2 i 2 1 2 . in 1 2 , I I I 1 1 * 1 1 1 1 ■ 1 Figure 5.2: Examples of Extended Cassettes 34 I 1 2 | 3 | 1 r------------------- f----------------------------------;------------------------j 6 { 1 , | IS 6 j 1< 2 | 3 ! 1 1 1 s i"i 2 f----S 3 T s 1 1 s s 2 , i r V , , 2 , TV'S 1 0 1 ! "Vi 2 r— y 3 1 ! 1 i 1 1 1 11{- 2 i 1 1 I f I I i l l 2 1 x\ 2 11 1 1 1 1 *■ 2 11\ 2 1 1 i i 1| 1| li 2 I 1 I I I ; 1 j. I M i l s 1 1 ______S I_____ i l I j 1 ii 2i 3 1 2 4 11 1 i 2"; 2 f------ 4 — 1 2 T'T 1 i i | i | ii 1 | li 2 | 1 | 1 1 i 1 2 l f 1| M j 1 j H s 1 1 t t , 4 2 4 ! 1 j 2 1 3 h i 1 ; | ? ~ s i | S 1 ! 1 2 i 4 i 2 . 3 | 1 i I 1 ! 1 i i i 2 J 2 - 1 j I S 1 . 1 * 2 . I , 1 1 s 0 ^ J M i l } I I _ i 5_ _ .5 : 2 3 | 4 2 0 4 ; 2 4 j 4 j 2 3 1 | 3 V " J3 , 3 2 0 1 1 i r r ' j 2 S i 1 1 li T i 3 { 3 i'2"j | • 2 | 1 I 2 , 11 3 2 1 ] 1 l.lj 2 , 2 1 j 1 | | 1 | ij 2 | 1 | 3 S 2 | 1; 2 | M i l 0 1 1 1 4 0 1 1 I I 1 1 j 1 ___ J S_____ i 1 --------- f...... l! 2 i 2 r 3 ! 3 I 1 i 1 | 1 | 2! 2 5 1 J 1. 2 i 1 1 A 1 2 1 1 1 | i 1 1 1 j 1 1 0 Figure 5.3: Exchanges Figure 5.4: Reflections L e m m a 4 Let H ' be the extended border block graph of(A,B) and F be the alter­ nating Eulerian path in H ' corresponding to (A, £?). (i) Let (A',B') be obtained from (A , B ) by a cassette exchange (reflection) and F ' be the alternating Eulerian path corresponding to (A ',B '). Then there is an order exchange (reflection) taking F to F'. (ii) Let F ' be obtained from F by means o f an order exchange (reflection). Then there is cassette exchange (reflection) taking (A , B ) to {A', B '), where F 1 corresponds to (A ',£ ')- As before this allows us to define an equivalence relation on the set of solutions to the DDP, allowing coincident cut sites. We let (A, B ) = (A', £?') if and only if there is a sequence of (extended) cassette transform ations and uncut fragment perm utations transforming (A, B ) into (A', B '). This equivalence relation partitions the set of solutions into equivalence classes, each corresponding to an extended border block graph H'. Therefore we get the following theorem, which is an extension of Theorem 3. T h e o re m 4 Let H ' be the extended border block graph of (A, B ) and H " be the extended border block graph of(A',B'). Then (A, B ) = (A ',B ') if and only if the two graphs H ' and H" are identical (as bicolored graphs). 37 Part II T he N um ber o f Equivalence C lasses C hapter 6 Introduction to Part II In this part we take a look at the number of equivalence classes in the original setup, where we assume there are no coincident cut sites. For a person assembling a map, the following question is quite natural: Assume th at we have the two single digest fragment length data a = ||A|) = {A; : 1 < i < n} from enzyme A and b = ||i?|| = {5, : 1 < i < m} from enzyme B as well as the double digest c = ||A A f?|| = {C, : 1 < i < /}. Note that all these sets might be multisets. Fixing a, b, and c, we would like to know how many essentially different answers there are. T hat is, how many different equivalence classes are there. Or, going to the setup of Goldstein and W aterman, we think of a = ||A || and b = ||i?|| as randomly generated. The cut sites are created by cutting each nucleotide with probability pa and p s respectively, until we reach the end of the DNA (no coincident cut sites up to this point). This gives us a single fixed double digest c = ||A A B\\. Questions we would like to know the answer to is: • How many equivalence classes are there? • How does the number of equivalence classes grow as the length of the DNA grows? In Chapter 7 and 8 , we study a related problem that might be helpful in solving and understanding the original problems. Another question, related to our original problem, is the following: Assume we have the two fixed single digests a and b. Let 7 ra and 7 ra< be two perm utations of the elements of a and let 7 T b and 7 T b ' be two perm utations of b. From (7 ra, 7T b) and (7 ra« , 7 T b < ) we might get different double digests c and c'. We write a = {(Aj, /,) : 1 < 39 i < n '} where /, indicates the frequency w ith which the fragment length A\ is listed in the set a. Note th at /< = n. For example if a = {1,1,2,2,2,3} we would rewrite it as {(1,2), (2,3), (3,1)}. We do the same for the set b = {(£?,•, < 7 ,), 1 < i < m'}, where 5« = m ■ Therefore the total number of different perm utations of a and b equals n! m\ n r : , / . ! rr='i</.!' Some pairs (x#, 7 Tb) will give the same double digests and some will not. Interesting questions include the following: • How many different double digests are there? • Given some fixed double digest, how many different border block graphs are there? To give some insight in the stochastics and combinatorics of the border block graphs, consider the model used in Goldstein and W aterman. We have two enzymes A and B th at cut the DNA with rate pa and p s, respectively. We stop this process as soon as we have a coincident cut. Now we take a look at a very specific subset of the double digests, called the border block fragments, in the order they are generated along the DNA. Say we have the border block fragments ( l i , ..., /r+i), th at then induce a border block graph. (The border block fragments give us the vertices in the graph.) In P art II Chapter 7, we will show how to calculate the probability th at we walk from the border block fragment li to the border block fragment /2, then to the border block fragment /3, and so on until we reach a coincident cut after the last border block fragment /r+i. (See Figure 7.1.) All graphs th at we get by walking in this m anner from border block fragment l\ up to the border block fragment /r+i will be denoted by .. ,/r+i)- In Chapter 7.1, we will give recursive formulas on how to calculate P((?(/i,..., /r+i)), by considering the cases that we start, extend or end the graph separately. In Chapter 7.2, we use these recursions to really calculate P(£7(/i,..., lr+1)). In the first section of Chapter 7.2, we will calculate P(£7(/i,..., /r+i)) for the case that the DNA is infinitely long and that we stop the cutting process as soon as we reach a coincident cut site. We will show that P(<?(/i,..., lr+ 1)) only depends on the number 40 of border block fragments and on the total length L = U °f the involved border block fragments (and of course on pa and ps). In the second section of Chapter 7.2, we will fix the length D of the involved DNA. In this case, we let the enzymes cut with the same cut rate, because otherwise the calculations are very complicated, but without coincident cut sites until the end of the DNA. The calculations to find the exact expression for P(C/(/i,..., lr+i)) are too messy, and we will therefore only give a lower bound. Analogous to the case that the DNA is infinitely long, we get that P((?(/i,..., /r+i)) depends only on the number of border block fragments, on the total length L = 52i=i U, and here also on £), the length of the DNA (and of course on the rate pa = Pb )- In Part II Chapter 8 , we take a look at the possible tree topologies. To be more exact, we will give a lower bound R(r, k) on the number of border block graphs having k vertices and r edges. We already know from Goldstein and W aterm an (1987) that there are many different solutions. The combinatorics here indicates th a t there are also many different border block graphs. For example, there is superexponential growth in the number of vertices. In P art II Chapter 9, we consider the expected number of border block graphs, where the enzymes A and B run on an infinitely long DNA strand until we finally have a coincident cut. How many different border block graphs do we expect (fixing some param eter)? We only give a relatively complicated and rough estim ate, and do not have any asymptotics. 41 C hapter 7 W alking In this chapter we compute the probability that we get a border block graph as illustrated in Figure 7.1. We first have a border block fragment of length L, then an A-edge, then a border block fragment of length I2 , then a fi-edge and so on. The last border block fragment before the coincident cut has length lT+i. Thus we have r edges, and the last edge before the coincident cut is either an A-edge if r is odd or a 5-edge if r is even. (See again Figure 7.1.) The set of graphs we obtain from this walking-process will be denoted by G (l\ , ..., /r+i)- For convenience, we will use the notation G(l 1,..., /r+i) for a single graph as well as for the whole set of graphs we get by this walking process. Note th at the 1,’s are ordered in our setup, which they are not in the border block graph. Therefore a specific border block graph can come from a lot of different walks, because two border block fragment lengths |/,| and |/j|,i ^ j correspond to the same vertex in the graph if |/,| = |/j|. We, of course consider the discrete case, where the /*’s are integers. We now fix a tuple 1 = (/*,..., /r+j) of border block fragment lengths. Let A\ denote the set of allowable (ordered) cut sites corresponding to the tuple 1 . A\ is the set of possible cut sites Si of the two enzymes, alternating from enzyme A to enzyme B, th at are relevant for the restriction m ap and that will get us a graph G(l 1,. .., /r+1)- (See Figure 7.2.) Therefore, if we assume the DNA to be infinitely long, we can describe A\ in the following way: A\ — {($ 1,..., ST) : Si > I\ + h and Si > 5,_ 1 + ft+i for 2 < i < r}. 42 h h h Wi A B -J^T" .... _L_ r cxld •l I2 * 3 Wl Figure 7.1: Walking Let G(Si , . . . , Sr) denote the graph corresponding to these cut sites. Then we obtain P (G (/1, . . . , / r+1) ) = £ P(G(Su . . . , S r)). (7.1) (5j ,...f 5r)€«4j We will now show how to compute P ( G ( 5 j,. . . , Sr)) recursively. 7.1 Starting, E xten d in g and E nding th e Graph Let W a denote the waiting tim e corresponding to the enzyme A; Wa has a geometric distribution with success probability pa• Analogously Wb has a geometric distribu­ tion with success probability ps- Let G(S\ , . . . , St) denote the part of the graph up to the cut site 5,. We can think of building up the graph by wandering from one relevant cut site to the next. First we consider starting the graph. See Figure 7.3 for the setup to calculate the probability that we start the graph with an A-edge connecting a border block fragment of length /1 with a border block fragment of length li- We have two cases to consider: the case that Si > li + I2 and the case that Si = h + h- 43 Figure 7.2: Walking for r odd Case 1 (5i > h + h ) gives us P(G(Sl)) = P(WA = Si)P(WB = /i)P (5 -c u t at position £ - l2 )P(WB > h) = P /i(l - P /0 Si_1Pb(1 - P b )'1-1Pb(1 - P s ) '2 = pA( 1 - Pa )Si- V b (1 ~ PB),1+h~'. (7.2) Case 2 (Si = /i + l2) gives us P (G (S ,)) = P(W A = Sl)P(WB = /,)P (W B > /,) = PA( 1 - P^)'1+'2_1Pb ( 1 - Pb )< i+<2_1- (7.3) Now we consider to extend G (S i,. . . , S,_i) to G(S\,..., Si) for i > 2. (See Figure 7.4.) Here we consider four cases. If i is odd and i < r, we are in the situation of Figure 7.4a). We get for the situation S, > S,_i + /,+i P ( G ( S ! ,...,S ,) ) = P(G(SU ..., S,_i))P(W /4 = Si - (S.--, - li)\WA > k) •P(B-cut at position S, — li+i)P(W fl > /;+i) ____________ | : A S,-l2 _ j _______________; B u | Figure 7.3: Starting the graph = P(G(S 1, . . . , 5,_ 1))pa( 1 - P^)S|- s- , - 1Pb( 1 - P s ) ,,+1 ■ (7.4) For the case that 5, = 5,-_i + /;+i, we get similarly P (G (S l t .. •, Si)) = P (G{SU . .., 5,--i))p^( 1 - P ^)'1 + 1 - J(l - Pb)'1 + i • (7.5) If i is even and i < r, we are in the case of Figure 7.4b) and S, > 5,_i + /,+i. We obtain P (G(S 1 , . . . , S i)) = P ( G ( 5 ,,. . . , 5,_ 1))P(W b = 5,- - (5,_! - U)\WB > U) •P(A-cut at position 5, — /,-+i)P(Wyt > /,+i) = P (G (5 1, . . . , 5 i_i))pfl(l - p B)s*-5i- 1- 1p ^(l - p ^ ) /,+1- (7.6) In the case that 5, = + /,+i, we have P ( G ( S i,... ,5 ,)) = P ( G ( 5 i,. . . ,5,_i))pb(1 - P b ) ,‘+1_1(1 - Pa )1 * 1- (7.7) 45 (a) Si-i-li Si- (b) Si., Si-i-l; Si Si’li+l 'i+l Si’lj+1 Si ‘ i+l A B C A B A B C Figure 7.4: Extending the graph: (a) i odd and (b) i even 46 Now we want to consider the situation of ending the graph (with a coincident cut). In this situation we have to distinguish between having an even or an odd number of edges. If r is odd, we are in the situation of Figure 7.5a). Here we obtain for ST > S T- 1 + /r+1 : P(G (S„...,S,)) = P(G(S,,. . • , Sr_i))P(W.A = Sr - (Sr-! ~ lr)\WA > /,) •P(B-cut at position ST — /r+i)P(W B = /r+i) = P(G(5!,..., 5r_1))p^(l - Pa )S' - S'-'-1P2 b ( 1 - Ps),r+1-1- (7.8) In the case that ST = ST~i + /r+i, we have P(G (5i,.. •, Sr)) = P(G(SU . • •, 5r-i))p^(l - pAf " - lPB( 1 - pb)1 * 1- 1. (7.9) In the case that r is even, we get the following results (considering Figure 7.5b). If Sr > Sr- 1 + /r+i, then P (G (S „...,S r)) = P(G(Si, . . . , Sr-l))P(WB = S r - (Sr-1 ~ lr)\WB > lT) •P(A-cut at position ST — /r+i)P(VFA = lr+l) = P(G(SU...,S r -1))pB( l - p B ) Sr- Sr-i- y A( l-P A )1 ^ - 1. (7.10) For Sr = Sr- 1 + /r+i, we have P (G(SU . .. ,Sr)) = P(G (5i,..., Sr_ ,))M l - PBt+'~lpA( 1 - pa)1 * ' - 1. (7.11) 47 (a) Sr-l-lr Sr - Sr S r_1 T+l ‘ r+ 1 A B C (b) Sr-l Sr-l-lr Sr'lr+1 Sr l r+l A B C Figure 7.5: Ending the graph: (a) r odd and (b) r even 48 7.2 T he ‘W alking-P robability’ 7.2.1 The ‘W alking-Probability’ for an infinitely long D N A strand In this section, we will compute P ( G ( /i,. . . , /r+i)). Recall that P ( G ( / „ ...,/ r+1)) = E E ••• E p ( G ( s „ . . . , 5 r)). Si > ll+ h ^2>Si +/3 5 r> 5 r_ 1 + /r+l We will do the calculation by starting from the inside. As before, we have different cases to consider. Case 1 corresponds to an odd number of edges and Case 2 to an even num ber of edges. Case 1 (r odd): First we calculate the innermost sum IT = E s r>sr.!+ ir+i P ( G (S i,. . . , Sr))- Using equations (7.8) and (7.9), we have Ir = £ P (G (5 1, . . . , 5 r)) SV>.Sr_ l +/r+l = P (G(SU . . . , Sr-i))PA( 1 - Pa)‘^ - 1Pb( 1 - PB) ,r+1- 1 + P (G(S1, . . . , 5 r_1))p^(l - p a Y ^ - '- 'p U 1 - P s ),r+1_1 oo E ( 1 ~ P a ) 1 /= 5 r_i + /r+i +1 = P ( G ( S „ . . . , 5r_ i))(l - Pa) 1^ P b { 1 - P e ) ,r+1-1 A r, (7.12) where A r = - ^ — + pB. 1 ~ Pa The next step will be to calculate / r_j = H sr_,>sr_2+;r Ir• Noting that r — 1 is even and then using equations (7.6) and (7.7), we have I r - 1 = E I t S r - 1 > S r_ j + /r = P (G (S „...,S r_2))p!(l - p B)'"+^ - 2(l - p ^ ) ' ^ A r +P(G(51,...,5 r_2))p|(l -P B ) 1^ - Sr- 2~2PA{ 1 - PA),r+,r+1 49 •A r- £ (1 ~ P b ) 1 l~ S T— 2 + /r+ l = P(G(5i,..., 5r-2))pe(l — PB)/r+<r+1_1(l — Pa)1t+1'+1 Ar_i, (7.13) where ir~ '= (i^k<+ Ps) { t = T b + pa) - For 2 < * < r — 1, we let /j = Es.>s,_i+(,+i A+i- Then we have the following result: C la im 1 For 2 < i < r and r odd, we have /, = P(G(Si,..., 5,_1))pB(l - Pb)(E- ^ 'jM (1 - Pa 'J A,-, where <r-i+l)/2 , no _ \ (r *+1 )/2 A< = ~ l" Pb ) ^ g + p A y - — i f i is even, + Pb ) 1+(r 0/2 + pa) (r 0/2 if i is odd. Proof: Note th at we already know that the statem ent is true for r and r — 1. We will use these two results as starting values to calculate the others. We first consider the case that i is even, so i — 1 is odd. Then, we obtain for /*_i = using equations (7.4) and (7.5) b -i = £ a Si- 1 > $ _ 2 +/, = P ( G ( 5 i ,.. .,5 ,_ 2))pb(1 — P b ) ^ j=‘ /j)-1Pa(1 - P 4 ) (^ ,=‘ ,j)_1 A , + P ( G ( S „ . . . , S,_2))p !(1 - p b )(E % ,j)- 'p a ( 1 - P ,0 (E - ^ o o •A< £ ) ( i - P ^ y /=5|_2+/»+l = P ( G ( 5 i ,. . . , $ - 2))p b (1 - Pb )(E - - ‘ ,,)_1(1 - Pa )E % 'j 50 Comparing the exponents of pa / ( l — Pa ) + Pb and p s / ( l — Pb ) + Pa , we see that this corresponds to the result of the claim since i — 1 is odd. The case th at i is odd goes analogously using equations (7.6) and (7.7). □ From the claim, we obtain h = P(G(51))pi,(l — p b ) ^ ' ^ ! — Pa)E ^ ' jA 2, where / \ (r l)/2 / \ ( r - l ) / 2 a ’ = (t^ I +Ps) ( r = ^ +PJ) • Using this we obtain, using equations (7.2) and (7.3) E h S i> li+ h = pA( 1 - Pa)1 ^ ’*1 1,)~1Pb{1 - P b )(^ j=i ,j)-2 A 2 + p ,i(l - p^ ) ^ ' " 3 ^ ” 1^ 1 - P b )(^ ' =i l>)_2A 2 o o • E ( i - p * ) ' i='l+/2 + l = (1 - P / t ) ^ J=1 /j( l - Pb )(^ j=1 ,j)-2Pb A 2 ^ 1 + Pb j • This proves the following lemma. L e m m a 5 If r is odd, then we have P (G (/„...,/r+1)) = (1 - pA) ^ ]= 1 /jP b ( 1 ~ P b )(^ j=1 l,)~2 / n \ < r+1)/2 / r, \ (r_1)/2 f e + P ‘ ) (l^H ' Now we return to Case 2 (r even). The result in this case is calculated analogously as Case 1. So we only give the outline and the interm ediate results. W ith the same notation as before we have the following claim. C la im 2 For 2 < i < r and r even, we have li = P (G (SU . .., S ;- i) ) p * ( l - Pa ) ( E ^ + ' 'j)_1(1 - P b )E - ‘ I+ 1 ,j A „ 51 where A< = sl+(r-i)/2 (r-,+l)/2 if i is even if i is odd. Therefore in this case we obtain for / 2 the following: I2 = P(G(5x))pA(l -P 4 )(E^ ' jM (1 - Pb) ^ ‘3&2, where / „ \ (r - 2) /2 / „ \ i+(r- 2)/2 ( t^ H ■ Using this we can now calculate 5 . E / 2 = p \ { 1 - P a )(^ = 1 1 , ) ~ 2P b ( 1 - Pb)(^ = 1 ,j)_1 A 2 +P^(1 - Pa){^ ’=3I,)~2Pb( 1 “ Pb)(^ = 1 /j)_1A2 00 /= /l + / 2 + l This give us the following lemma. L e m m a 6 If r is even, then we have P ( G ( / „ ...,/ r+1)) = Pa ( 1 - Pyt)(^ =1 (,)_1Pb (1 - Pb )(^ j=1 ,>)_1 Thus far we have studied the case th at we started the graph with an A-edge. Of course there is the corresponding case that we start with an 5-edge. Let the set of graphs th at corresponds to a walking via ( L ,. . . , /r+i), but starting with a 5-edge, be denoted by G'(l\ , . . . , /r+i). By symmetry, we get P ( G '( /i,. . . , *r+i)) by interchanging Pa and ps in P (G (/i,. . . , L+i))- Therefore we have in this case the following lemma. 52 L e m m a 7 I f r is odd, we have P ( G '( / „ . . . , / r+i)) = p \( 1 - i,)-2(l - p b ) ^ j=1 / r, \ (’ - 1 )/2 / „ \ (r+1)/2 And, if r is even, then we have = pA( 1 - Pyi)(^ =I ,j)_1p b (1 - Pb)(^ j=1 ,,)-1 Note that P ( G ( /i,. . . , /r+i)) and P (C r'(/i,. . . , lr+i)) only depend on P a , P b , on the number of edges r, and on L = J lj, the total length of the border block fragments. Let /r+i) denote the set of graphs we get by walking via ( /j ,. . . , /r+i). Note G{hi • • • i L+i) is the union of . . , /r+j) and G'(li, . . . , L+i)- Therefore here it does not m atter whether we started with an A-edge or a 19-edge. Combining the lemmas above we therefore have the following result. T h e o re m 5 Let pa and ps denote the cut rates of the enzyme A and B respectively. Also let qA = 1 - p A,qB = l~PB,rA = pa /( 1 ~ P a ) + Pb andrB = P b /(1 ~ P b ) + Pa- Then we have for some fixed number r P ( S ( / i , . . . , / r+i)) ! t lpAPB{qAqB)L~l {y/rArB)T _ if r is even, (qAqB)L(y/rArBy ((PBlqB?\JrAlrB + IPA/qA)2\fr B /r A) if r is odd, where r+l L = £ / ; - j=i 53 7.2.2 ‘W alking-Probability’ for a finitely long D N A strand In this section, we will give a lower bound on P ( G ( /i,. . . , /r+1))- Because the calcu­ lations become very tedious, we assume that p = Pa = PB and q = 1 — p. We assume th at the DNA has finite length D. Recall that P(G(lU ...,l r+l)) E E £ £ P(G(Su . . . , S r)). D — (ir+l +/r)> ^r_2 >S’ r_3+/r — 1 ^ “^ r+1 >^r— 1 2+^r In the case th at p = Pa = Pb and q = 1 —p, the recursions for starting, extending and ending the graph simplify and are now independent from the case distinction even/odd we had before. Ending the graph becomes (Sr > 5r_i + /r+i ) : P ( G (S i,. . . , Sr)) = p3qD- s^ +,^ ~ 2 P(G(S1, . . .,S r-i)). Note that we used the fact th at Sr = D in our former recursions. Extending the graph becomes (S, > S{-\ + /,+i) : P ( G ( 5 „ ... ,5 ,)) = p 2 qs'-Si- '+li+'-l P(G(Su ... ,5,•_!)). And starting the graph becomes (Si > /i + /2) ' P (G(Si)) = p3qSl+,1+h. We again start the calculation with the innermost sum. £ P V - * - > +''+> - 2P (G (5x, . . . , 5 r_ !)) D — lT +1 >Sr — 1 >5r_2 +/r > E p2qS'- 1- Sr-i+lr- 1p3 qD- S ^ +t^ - 2P(G(Su ..., Sr- 2)) D— lr+i>Sr— i >5r_2+/p > p 5qD- s'-*+,'+'+,' - 3 P(G(Su ..., Sr_2)). (7.14) 54 Note that we used a very rough estim ate in the last inequality. For the next sum, we obtain, £ Py>-*-*+''+»+,'- 3P (G (s„ ..., sr_2)) S r-l-(/r+l+lr)>Sr_2>Sr-3+/r-l > p 7qD- Sr- 3 +lr+l+,r+lr- '- 4P(G(Su . . . , S r- 3)). (7.15) Inductively we obtain P (G(Si,...,Sr)) > p 2(r-3)+39D -S i+ £^H r-3)-2p(£(5i)) ^ Hence P (G (5 i,...,5 r)) > p * 9 D+l £ ! ,*-<r+1> . (7.1.7) Using this result and the fact that we can start by an yl-edge or an B-edge, we have the following claim. C la im 3 Let p = Pa = Pb denote the cut rates and q = 1 —p. For some fixed number r, we have P(G(51,...,5 r) ) > ( ^ ) 9°+L~ \ where r+l j=1 55 C hapter 8 G raph T opologies In this section, we will give a lower bound on the number of border block graphs having k vertices and r edges. Recall that, from the walking process described in Chapter 7, we obtain the border block graph by collapsing border block fragments of the same length to a single vertex. Let R(r, k) denote the number of different, balanced m ultigraphs (except possibly two vertices) having k vertices and r edges. In this section we will give a lower bound on R(r,k). To get a lower bound, we will only consider the graphs which have a ‘tree skeleton’; this means th at after collapsing m ultiple edges to single ones and eliminating all loops, we will have a tree. See Figure 8.1 for an example. Recall th at a tree having k vertices has k — 1 edges and that a tree with at least three vertices has at least two term inal vertices and therefore at least two terminal edges. Therefore, if we ignore any two term inal edges, we used k — 3 edges. We will distinguish two cases, namely the case that we have a few edges (r < 2k — 4) and the case th at we have plenty of edges (r > 2k — 4). If we have at least 2k—A edges we can construct from any tree an almost balanced graph (that is, except possibly two vertices) by doubling all the edges, besides two chosen term inal ones. We therefore used 2{k — 3 ) + 2 = 2 fc — 4 edges up to now. Hence, we still can distribute r — (2k — 4) edges on the graph in a way th at does not destroy the balance of the graph. We will do this by always distributing a pair of (same) edges (they might be a pair of loops) instead of single ones. So we will always keep the balance of the graph. The number of possible places to put the edges equals 2 k — 1 ; therefore k possible choices for loops (the vertices) and k — 1 possible choices for the ‘real’ edges. To calculate the number of ways to distribute the edges, 56 Figure 8.1: ‘tree skeleton’: (a) shows a graph having a ‘tree skeleton’ and (b) shows an example without one 57 we have to distinguish between an even number of edges and an odd number. If r is even, we have to distribute (r — 2k + 4)/2 pairs of edges. The num ber of ways to do this corresponds to selecting a m ultiset of size (r — 2k + 4)/2 of a set of size 2k — 1. Therefore the num ber of ways of placing these edges is 2 ifc — 1 + (r — 2Jb + 4)/2 - (r — 2fc + 4)/2 l \ = / (2 k + r ) / 2 \ ) \ ( r - 2 k + * ) / 2 ) ' If r is odd, we have to distribute (r — 2 A : + 4 — l)/2 pairs of edges (there is one edges left; it will go to one of the term inal edges). The number of ways to do this corresponds to selecting a m ultiset of size (r — 2k + 3)/2 of a set of size 2k— I, which can be done in the following number of ways: / 2k — 1 + (r — 2 J fc + 3)/2 — 1 \ / (2* + r — l)/2 \ \ (r — 2k + 3)/2 j ” \ (r-2fc + 3)/2 ) ' This gives us a lower bound for the number of ways to put the edges on the underlying tree. The number of different topologies, th at is the number of labeled trees on k vertices, equals kk~2. Therefore we have the following lemma. L em m a 8 For r > 2k — 4, we have the following lower bound. If r is even, then ,* _ 2 I (2 * + r ) / 2 R(r, k) > k I v . (8.1) \ (r - 2 k + 4)/2 ) If r is odd, we have ,* -2 1 (2 * + r - l ) / 2 f l ( r , i t ) > r - 2 v " . (8 .2 ) \ ( r - 2 k + 3)/2 J We know that the number of solutions tends to be large. From the above argu­ ments, it seems intuitive th at there are also a large number of graphs. For the case th at we have only few edges, we will use a very rough estim ate. Here we just consider the case th at the tree skeleton is a line. Because of the symmetry, there are k\/2 possible choices for such a (labeled) tree. This gives us the following (rough) lower bound. 58 L e m m a 9 In the case that k — I < r < 2k — 4 roe have R (r,k)> k\/2 . (8.3) 59 C hapter 9 T he E xp ected N um ber o f B order B lock Graphs It is natural in probability to compute expectations of random variables. Therefore we study the expected number of border block graphs. Our only approach to do this is via Chapter 7 and 8 , where we calculate the walking probability and bound the number of graph topologies. We observe th at a border block graph with vertex sum n has very small probability, when n is large. If we obtain a large lower bound to the expected value of border block graphs w ith vertex sum n, this would be very informative. However, even a small upper bound is not correspondingly informative, since the biologist is faced with proposing maps consistent with a fixed set of data. 9.1 E stim ating th e E xp ected N um ber o f B order B lock Graphs Note that a graph is specified by its vertices and edges. We will consider border block graphs with constant vertex sum. The graph has vertices vlf... ,vt and vj H ( - u* is constant. Let E (n) denote the expected number of border block graphs with total vertex sum n. E (n) = E ( # of border block graphs with total vertex sum n) ^ j / # of border block graphs with total vertex sum n, \ t= 1 y having k vertices J " ~ / # of border block graphs with total vertex sum n , \ = y , y , E I I (9.1) *= i r=k-1 V having k vertices and r edges / 60 Let qk(n) denote the number of partitions of the integer n into k distinct (non­ trivial) parts. For example, 92(b) = 2(1 + 4,2 + 3) and 92(6 ) = 2(1 + 5,2 + 4). Let pg(n,r,k) be a lower bound on P ( £ ( /i,..., /r+i)) such that we have k distinct s th at add up to n; that is, the border block graph has k distinct vertices that add up to n and the graph has r edges. Hence 9*(n) gives us the num ber of different vertex constellations, and R(r, k) gives us a lower bound on the number of different almost balanced graphs having k vertices and r edges; that is, qk(n)R(r, k ) gives us a lower bound on the number of different almost balanced graphs having r edges and k vertices that add up to n. Also pg(n, r, k) is a lower bound on getting such a border block graph. Hence from equation (9.1) we obtain. n 00 E (” ) ^ Y , H qk{n)R(r,k)ps (n,r,k) k = 1 r = k- 1 n 00 > Y ^ ( n ) Y R(r,k)pg(n,r,k). (9.2) k —4 r = k- 1 We now deal with the expression f?(r , k)pg(n,r, k). To get a lower bound on P(C?(/i,. . . , L+i)), we use th at P ( £ ( /i,. . . , L+i)) decreases (for fixed r) as L increases. Therefore to find a lower bound, we just have to find the maximal walking distance to get a border block graph with k vertices that add up to n having r edges. But if we have enough edges, that is r > 2k — 4, the maximal distance equals 2n — 3 + n(r — (2k — 4)), because we use 2k — 4 edges to connect the tree skeleton (with double edges except two term inal ones) by a walk of length < 2n — 3 using k different vertices. Then we still have to use r — 2 A : + 4 edges, connecting border block fragments of length < n. As before, let 9^ = 1 ~Pa,Qb = 1 — Pb, = pa/(1~Pa)+Pb and rg — pb/( 1 — Pb) + Pa- If r < 2 A : — 4, then we used k — 1 edges to connect the tree skeleton (a line) by a walk of length < n. Hence the total length is less than or equal n + n (r — (k — 1)) = n(r — k + 2). We have the following lemma. L e m m a 10 Then we have for r > 2k — 4, pg(n,r, k ) 61 > %PAPB(qAqB)2n+n{r 2k+4) 4(v 'r i r fi)f i f r is even, (qAqB)2n+n{r- 2k+ 4)- 3 (y/w i ) r (9.3) x ((PB/qB)2 ^ r A/rB + (pa/qA ) 2 \JrB/ ra) if r is odd, and, for k — 1 < r < 2fc — 4, pg(n,r,k) f 2PAPB{qAqB)n(r~k+i)~l {\/rArB)r ifr is even, (qAqB)n{T~k+2)(y/rArB)r (9.4) x ((PB/qB)2\JrA/rB + ( P /i/^ ) 2^ / ^ ) ifr is odd. We now want to compute ^ ( r > k)xT f°r k > 4. We first compute a lower bound for £ ^ 2f c -4 ^ ( r > k)xT- If r is even, by replacing r with 2r', we obtain ~ / (2k + r)/2 \ r ^ - 4 \ ( r - 2 k + 4)/2 J r even = £ ( * + r# W ' ,CT-2 \ r ' - k + 2 ) 1 oo = ^ r j j ! J ; < * + r ')<‘ + r ' - *) •••<’• ' - k + 3>* 1 oo --------- £ r '(r ' — 1) • • • (r' — 2 A : + 3 )(i2)r -2fc+2+fc-2 (2fc 2)! r,=2A _ / 1 \ 2*_1 - ( r h O where for the last equation we used the fact that / i s n+l oo ( — _ - ) = 1/n! ^ 2 s(s - 1) • • • (s - n + l)x s-n. x ' 3=n Analogously, we obtain for r odd, by rewriting r as 2 r' -f 1, ~ / ( 2 fc + r - l ) / 2 \ r r . ^ \ (r-2fc + 3)/2 ) X r odd 62 CO = £ k + r' r '= k —2 \ r ' - k + 2 2k— 1 X 2r'+l _2k-3 _ / _ i _ V 2 \ l - x 2J X The term s for k — 1 < r < 2k — 4 can be calculated by Y xr = (x * - 1 - X 2 k ~ 4 ) / ( 1 - x). r = k - l We proved the following lemma. L e m m a 11 For k > 4 we have: oo / 1 \ 2k-l £ rt(r,*)xr > * fc - 2 ( — - ) x2*-4, (9.5) r=2k-4 X J oo / 1 \ 2fc-l £ R(r,k)xr>k k - 2 { ^ — 2 J x 2* " 3 (9.6) r=2ifc-4 r odd and 2k-5 u / k - 1 _ „2fc-4\ £ f?(r,fc)xr > ' . (9.7) r=k— 1 ^ ( l X) Now we put the last two lemmas together to get a lower bound on Y R{r,k)pG(n,r,k). r= k -l We have OO Y R(r,k)p 0 (n,r,k) r=2Jfc-4 r even > 2 pAPB(qAqB)2n~4 (qAqB)n{~2k+4) 63 • X R ( r 1 k)((QAqB)n\/rArBy r = 2 k - 4 r even = 2pi 4pB (^9j3)2n_4fcf e " 2 (rJ 4rB)fc-2(l/(l - {qAqB)2nrArB)f k ~ 1 and O O X) R(r,k)pc(n,r,k) r = 2 k —4 r odd > (qAqB)2n~z{qAqB)n(~2k+4) ([pB/qB) 2 ^ r A/rB + (pA/qA) 2 \fni/rT ) o o • X ^(r^)((9A gfl)n \/rA rB )r r = 2 k —4 r od d = (qAqB)3n- 3 kk- 2 (rArB)k- 2 V ^ ( l/U - (<Mgfl)2nr „ r B ) ) 2* - 1 x ( {PBlqB)2\lrAlrB + {pAlqA)2 \jrBlr ^ j We also have 2fc-5 X R (r, k)PG(n,r,k) r = k - 1 I 1 - (qAqB)ny/r^TB ■ ( 2 PAPB{qAqB) _1 + {pB/qB)2 yJrA/rB + (P/i/9a)2\/^ b 7 ^ a) • So we finally obtain the following lower bound, combining the last three results. T h e o re m 6 For n > 4, we have E(n) > X qk{n )r{k), k=4 where, for k > 4, O O r(k) = X R(r,k)p 0 (n,r,k) r —k —l 64 . / ZpAPBjqAqB)~*____ \ \ + ((PB/qB)2 yJrA/r B + { p a ^ a Y ^ s / t a ) y/rATB{qAqB)n ) + ( 9 ^ f l ) n(2_fc)_1i f c ! / 2 ( ~ ((g»9B)n0 ^ ) 2 t~4 V 1 - (qAqB)ny/rArB • ( 2PAPB{qAqB)~l + (PB/qB) 2\JrA/rB + (PA/qA)2 y/rB/rA) • 65 Part III P oisson A pproxim ation for Long R ep eats in a R andom Sequence w ith A pplication to Sequencing by H ybridization 66 C hapter 10 Introduction to Part III In Part III, we are considering a very special m ethod of sequencing DNA, namely sequencing by hybridization. Sequencing by hybridization is based on the following setup. A short single­ stranded DNA of 8-25 letters is called a probe. The probe will bind or hybridize to a single-stranded target DNA if the substring complementary to the probe exists in the target. If the target is presented to all probes of length / (called /-tuples), then the /-tuple content of the target is known, and this data can be used to determine the sequence of the target. The basic problem (in the ideal case) is the following. A sequence a = a\a 2 ...am is to be sequenced, and the data are all /-tuples present in the sequence. This set of occurring /-tuples (multiplicities allowed) is known as the spectrum of a, 5(a). The problem of determining if the sequence can be uniquely reconstructed from the spectrum is called the recoverability problem. It is natural to pose the sequence recoverability problem as a traveling salesman or Hamilton path problem. The graph Gh = (V h,E h ) has vertex set Vh = 5 (a), and (u,v) € Eh when u = U1U 2 ... uj, v = v\v 2 5 (a) and u 2U 3 ...ui = viv 2 ... u/_ i. Of course this is known to be NP-complete. Pevzner (1989) ingeniously translated this problem in the following way: The vertices of a new graph are the (/ — l)-tuples from 5(a) and the edges correspond to 5(a). Define the graph Ge = (Ve,Ee) where Ve = {xix 2 . . . x/_i : X1X 2 ... x/_i = iqt> 2 • • • ^1-1 or ^2^2 • • • vi and V\V2 ... vi £ 5(a)} is the vertex set and Ee = {(X1X 2 ... x/_i, x 2x3... xj) : x\x2...xi € 5(a)} is the edge set. The problem of determining the sequence a is translated into an Eulerian path problem, one for which there is an efficient solution. 67 Now we give an overview of Part III. In Chapter 11, we will review the underlying theoretical result, a Poisson approximation theorem by [Arratia et al. (1990)] (The­ orem 7). Employing the Chen-Stein m ethod, [Arratia et al. (1990)] also provide a rate of convergence in the total variation distance. (See also [Barbour et al. (1992)].) Chapter 12 contains the application of this theorem for repeats of long subse­ quences in a randomly chosen sequence a = a1a2---am - We will assume that the “letters” a i ,a 2, ...,a m are chosen independently from a finite alphabet, according to a not necessarily uniform distribution. As a long sequence may occur more than once in a word and these occurrences may overlap, the dependence structure is different from th at in sequence alignments. Still, it is a local dependence structure. Hence we can apply Theorem 7 from the previous chapter to obtain a Poisson approxima­ tion theorem for the number of long repeats, together with a rate of convergence (Theorem 8 ). In order to make the approximation nontrivial, we recover as length I for the repeated subsequence that I « lo g i(^ -) (Corollary 1). If we only consider P non-overlapping repeats, the calculations simplify, and the limiting behavior remains unchanged (Corollary 9). Chapter 13 gives the application of the above results to the problem of unique recoverability of a sequence. We will introduce the so-called Ukkonen transformations. They transform one string into another string with the same spectrum. We will call a sequence uniquely recoverable if neither transposi­ tions nor rotations in the string are possible (except possibly trivial ones, leaving the string fixed). This will be explained in more detail in this section. The most likely structure th at would potentially destroy unique recoverability is th at there are several pairs of disjoint repeating regions in the sequence - the subsequences between the repeats can then be interchanged without changing the spectrum. The probability for the latter event can be easily estim ated using the Poisson approxima­ tion result from the last section. The probabilities for other structures th at would destroy unique recoverability can be easily estim ated directly. Combining these es­ tim ates, we get an approximation theorem for the probability of randomly chosen sequence to be unique recoverable from substrings of length I ~ l o g i ( ^ ) , and we P i give a rate of convergence (Theorem 10). 68 C hapter 11 R eview o f a P oisson A pproxim ation T heorem In this section we will review the Poisson approximation theorem which we will use later, following the setup and notation of [Arratia et al. (1990b)]. The theorem gives bounds in term s of the total variation distance between two distributions. See also [Barbour et al. (1992)] for more details and alternative approaches for Poisson approximation using the Chen-Stein method. Assume that (fl,«4, P ) is a probability space. Let C(Y) denote the law or distri­ bution of the random variable Y. If h is a real valued function defined on the support of Yo and Y\ , and if we let \\h\\ = 8UPfc|ft(fc)|, then we can define the total variation distance between Yo and as \\C(Y0) - C ( Y l )\\ = suP|W|=1|E(MK0)) - E ( f t( y 0))| (11-1) = 2sup|P(F0 e A ) - P ( Y 1 e A)\. (11.2) A€A Note th at if fi is Z + = { 1 ,2 ,3 ,...} , then ||£ (V b ,m ) — £(^i)ll 0 as m — ► oo implies that P (V o ,m = k) — * P ( y ] = k) as m — * oo. Consider a finite or countable index set I. For each a € I, let X a be a Bernoulli random variable with pa = P (A a = 1) > 0, W = '£/ X a and A = E(W ). 69 We assume A 6 (0,oo). Suppose that for each a G / we have chosen a set B a C I with a € B a. We think of B a as a neighborhood set of a , consisting of a set of indices /? such that X a and Xp are dependent. We define bi = Y> P°P0 ' (1L3) O r€ //36 B a 62 = 5Z IZ P°P where pap = E ( X aX p), (11.4) a £ / a # / 3 e B a and b 3 = ' £ E \ E ( X a - Pa\a(X,} : p t Ba))\. (11.5) a€l In some sense 6i measures the neighborhood size, b2 measures the expected num­ ber of neighbors of a given occurrence and b 3 measures the dependence between an event and the num ber of occurrences outside its neighborhood. Computing 6i and b 2 involves computing the first and second moments of W. In the case th at X a is inde­ pendent of {Xp : /? ^ B a }, the term 63 equals 0. Here we state the Poisson approxi­ mation theorem by R. A rratia, L. Goldstein and L. Gordon [Arratia et al. (1990b)]. T h eorem 7 (P oisson A p proxim ation) Let W = be the number of oc­ currences o f dependent events and let Z be a Poisson random variable with E (Z) = E (W ) = A . Then ||£(W 0 - £ (Z )|| < 2(bt + b 2 + b3) and |P (W = 0) - e " A | < (1 A A~1)(61 + b2 + 63). We will now set up the notation for the problem we described in the Introduction. 70 C hapter 12 A P oisson A pproxim ation for Long R ep eats 12.1 N otation Let E be a fixed alphabet with s letters and a = a ia 2 ■ ■ • am be a string of length m chosen from Em, and let I be an integer < m . We assume th at the letters are i.i.d. but we do not necessarily assume a uniform distribution on the alphabet. Denote by £a = P (a, = a ) for a € E. Let p = p2 = P (ak = am) = ]T P(a* = am = a ) = £ th at is p is the probability that two specific random letters in the word a are the same. We put PZ ~ P(^fc = = ®n) = ^ ! P ( ^ — am — Q > n — O) — ^ ~ or€E and in general pr = P (a, = a i+ 1 = • • • = a,+r_i) = ^ a r £2 The special case of having a uniform distribution on the alphabet gives us that £a = P(ai = a) = l / s for all a € S, and hence p = 1/s. From now on we will use p instead of p2. Let a[*,i + / — 2] denote the substring 0 * 0,+1 • • • a ,+ /_ 2 of length / — 1. For 1 < i < j < m — I -1- 2 let = 1 (a[*,i + / - 2 ] = a [ j,j + 1 - 2 } and (i = 1 or (a,_! ^ a ^ .i) ) ) , 71 th at is X (itj) is the indicator that at position i and j in the string a, the same substring of length at least / — 1 starts. The condition = 1 or (a,_i ^ declumps the different substrings. Then W' = £ *<«> l<«j<m -/+ 2 counts the number of exact repeats of substrings of length at least / — 1 in the sequence. Our goal is to prove that W is approximately Poisson, that is th a t the number of clumps almost has a Poisson distribution. Observe th at this problem is related to the problem of sequence matching, but in sequence matching two independent strings are compared, whereas here we look for repeats in the same sequence. This makes the dependence structure more complicated. Although the results from se­ quence matching cannot be applied here, we can make use of similar techniques as in [Waterman (1995)]. 12.2 P oisson A pproxim ation We will call the subword a[»,i + / — 2] an i-fragment or sometimes just a fragment. Let u = ( i,j ) ,i < j, and I = {(*, J )|1 < * < i < m — / + 2}. We define the neighborhood B u as the set of all v = (i ', j ') 6 I such th at at least one of the i- fragment or the / -fragment overlaps at least one of the /'-fragment or the /-fragm ent. For v = ( * ',/) € B'u we thus need that at least two of the four substring starting at positions i , i ', j and j ' overlap. There are 4 different ways to choose the overlapping indices, and for the string that overlaps we have at most I different starting positions. Hence we can estim ate the neighborhood size by |Z?U | < 4m/. Let 61,62 and 63 denote the expressions defined in the equations (11.3), (11.4) and (11.5) for the random variable W and the above defined neighborhood B u. Our aim is to prove that & ! + 62 + 63 is small, if / is suitably chosen. The following lemma, which is a relatively straight forward consequence of Holder’s inequality, will be a key to all further calculations. Recall th at p = p2. Then we have the following (see [Waterman (1995)]). 72 L em m a 12 For r > 3 we have pr = p r S ( T ) + r / 2 where £(r) G (0,1/2 — 1 / r]. Using Lemma 12 we have that p l / 0 - 1 ) _ p r5 (r)/(r-l)+ r/(2 (r-l)) _ p . p r « ( r ) /( r - l) + ( 2 -r ) /( 2 ( r - l» which we will use later. L em m a 13 1. I f the i-fragment and the j-fragm ent overlap, we have p u = P (X U = 1) < for Si G [0,1/2). 2. I f at least two o f the four involved fragments overlap, we have P (X U = l , X v = l ) < for S2 G [0,1/2). We will see th at = ~di € [0, — ^ and S2 = — i)+i € ^0, — ^ , where r 6 (r) 2 — r dr = ~ - 7 + 7 T 7 7T € ( 2 — r ' \ 2 (r — 1) ’ . r — 1 2 (r — 1) and 6 (r) is defined in equation ( 1 2.1). Proof: For the first part: We have P (X U = 1) < P(a[i,» + / - 2] = a [j,j + I - 2]). We now write the / — 1 equations, a,' = a_,, a , + i = f l j + n • • • 2 — Q j+ 1 -2 ( 12. 1) ( 12.2 ) (12.3) 73 j j Figure 12.1: Two cases for overlapping fragments; / — 1 = 4. In the top picture, d = a.j = a.j+i = aj+2 = dj+34 , hence e5 = 1. In the bottom picture, a, = aj = a,j+ 2 and = Uj+i — aj+3; hence e3 = 2 . as e2 equations of the form a* = a m, e3 equations of the form a* = am = a„, e4 equations of the form a* = am = an = ar, and, in general, e* equations of the form a,, = a ,2 = • • • = aik, where {i'i, i2, ... , 1*} is a subset of — 2 , j , ..., j + 1 — 2 ,}. Thus e\ = 1 if all involved letters are the same. We have e, > 0 as well as e2 + 2e3 + 3e4 + ••• + (&— 1 )et + ••■ + ( / — l)e/ = / — 1. (12.4) This is the case since we assume no redundancy, that is each a, appears in exactly one type of equation. See also Figure 12.1. Note that we assume that some of the fragments overlap. This can result in having the information “collapse” to information about only / different letters; hence we have as the maximal equation chain length. This gives us P ( * u = l) < P (a[i, i + I — 2 ] = &\j,j + / — 2]) 74 = P(ah = am)e 2 P (ak = am = an)ei P (ak = am = an = ar)et • • • P(a< = ai + 1 = • • • = aj)e‘ = pV Pe i (12.5) where the e, > 0 and satisfy equation (12.4). Using equation (12.2), we get p ( x . = i ) < r=3 r=3 = J J ^ 5 ( r ) / ( r - l ) + ( 2 - r ) / ( 2 ( r - l ) ) y r - 1> er r=2 r=3 = r i p (r~i)er r i( P dr)(r- i)% ( i 2 .6 ) r=2 r=3 where dT was defined in equation (12.3). We let i , = 3 ™ x { - i ) e [°. | ) - So we get from equation (12.6), P(XU = 1) < p6-l) jp p-«.(r-l)«r < p(l-^)(/-l) (12-7) r = 3 Note that 0 < Si < 1/2 since 0 < — dr < | for all r. Also, Holder’s inequality gives (r-l)/r 2 US Pr = E e - e . < ( e ( C ,)'/Ir" ,f . ) ■ 1 = f f c " /r. \or€S / Thus p ; ' 1 '-" <;> v : , and hence dr > dr+i- Therefore < 5 i = — di. This proves the first part of the lemma. 75 To prove the second statem ent, we now write similarly as above P ( X U = 1 ,X V = 1) < P (a [i,i + / - 2] = a [ j,j + I - 2],a[*',t' + / - 2] = a \ j',j' + / - 2]) < P (ak = a m ) ' 2 P (a f c = a m = a n )e3 P(a* = am - an = ar)u ■ ■ ■ ■ • ■ P ( ai = a,+1 = ■ ■ ■ = aj = ■•■ = av = ••• = a j< )e 2 (,_1)+ 1 = P? P? 1, (12-8) where e2 is the number of equations of the form a* = am, ek is the number of equations of the form a„ = a ,2 = • • • = a,t , and {?i, • • •, **} is a subset of + + 1 - 2 , + 1 - 2 , + 1 - 2 } . Thus e2(;_1)+i = 1 indicates the fact that all involved letters are the same. Again e, > 0 and since we assume no redundancy, 62 + 2e3 -f 3 e4 + • • • + (k — l) e * + • • • + 2(1 — 1 )c2(/— i)+ i = 2(1 — 1). (1 2 .9 ) Note th at we still assume that some of the fragments overlap (that is why we have e2(/_i)+i as the maximal equation chain length; see Figure 12.2). We can now rewrite equation (12.8) using equation (12.2) in the following way: p(x„=i,x„=i) < p - n n ’ r=3 r=3 2(/-l)+l 2(/-l)+l = n p ( r - " " n ( p * ) ( 1 2 . 1 0 ) r—2 r— 3 where dT as in equation (12.3). We let So we get from equation (12.10) 6 2 = max 3<r<2((— i ) + i ' 2 (/-l)+ l P ^ ^ l ^ ^ l ) ^ ? 2^ - 1) n p - « 2( r - l ) e r < p (1-52)2(/-1) ( 12 U ) r=3 76 Figure 12.2: The “worst” case for overlapping fragments; / — 1 = 4. We have a, = a,j = flj.fi = flj+2 = = = ®j'+i = ®j'+2 = ®i'+3) hence eg = 1 . Note th at 0 < £2 < 1/2 since 0 < — dr < | for all r. By the same argument as above, < ^ 2 = — ^ 2(l-l)+l, and 8 2 > & \. □ L e m m a 14 We have b\ < 2 + 4 m 2/2p2(1-#1h,_1) and 63 = 0, 8 \ € [0 ,1 / 2 ) as defined above. P ro o f: By construction of the neighborhood B u we have 63 = 0. Note that |/| < and |BU| < 4/m. Therefore we have * > i = H X) PuPv < |5 u|p(1_'il)('_1)^ F u u€/v€Bu «€/ < 4m/p*1_Sl^/-1* (|/|p ^ -1^ + m/p^1-5l^ ,_1^ < 2m 3/p<2- Sl)<'-1) + 4m 2/2p2< 1- 6l)(' - 1 ) For the second inequality, we split up the sum g/p„ into the part with no overlap between the i- and the j-fragm ent, and in the part with overlap. For the first part, 77 independence gives us that pu , and for the part with overlap we used Lemma 13 to bound pu. □ L e m m a 15 In the uniform case, that is p = l /s , we have Si = 0. P ro o f: We have for all 5 -(»—i) — p T — p rS(T)+T/ 2 = s ~ (rS(r )+T/ 2)' Therefore S(r) = (r — 2)/(2r), and hence r 6 (r) 2 — r r = ------- 1 + o7 7 7 = r — 1 2 (r — 1 ) which implies that S\ = 0 . □ So, to get a reasonable Poisson approximation theorem it remains to get a good estim ate for b2. Later we will approximate A = E( W ), the Poisson parameter. L e m m a 16 We get the following estimate o f b2 : b2 < 64m2/ V (1^ 2)('_1) + 2 m 3/p§<'-i>+3(J-D«(3) where S2 € [0, from Lemma 13 and 6(3) was defined in equation (12.1). P ro o f: Recall that *2 = E E E (X u X v )- u6 / u^uGBu We will distinguish between different potential overlap situations. Denote i ~ j if |i — j | < 21, that is the f-fragment and the j-fragm ent potentially overlap. Then we can write the above sum as follows: = E E e ( ^ ^ ) + E E e ( ^ ^ ) + E . E e ( ^ „ ) + y : i*i i'+j' + E E 1I * ~ or * + ~ •Vi *'+}’ 78 j' Figure 12.3: The type of overlap if only two segments may overlap; / — 1 = 4. We have 3 equations of length 3 and 2 equations of length 2; hence e3 = 3, e2 = 2. The first four term s can be easily estim ated by 1 6 m 2/2p2(1- 52M ,_1) considering the number of possible overlap situations and estimating E(A"U A„) using Lemma 13. As an example consider the third term. For i 'f j we have less than m 2 possible constellations and if € B u (at least two of the four involved fragments overlap) and i' ~ j ' we have less than 16/2 possibilities. So it remains to consider the term l[t ~ i \ j / j 1 or i ^ i',j ~ j /]E(A uX u). For v = (i ',j ') € B u we need overlap and since i j and i' / j ' and i ~ i ',j / j 1 or i i ',j ~ j', only two fragments can overlap. So the longest equation chains are of the form a* = a/ = a m. (See Figure 12.3.) Using th at e2 + 2 e3 = 2(1 — 1) and hence e3 < l — l and noting that 6(3) € [0,1/6), we get for such u and v, using equation ( 1 2.1), E(A:U X„) < p e 22 p e 33 = p^+(3«(3)+3/2)e3 = p 2(/-l)+(3S(3)-l/2)e3 < p 2((-l)+ (3 « (3 )-l/2 )(/-l) < p (/-l)3/2+3«(3)(/-l) 79 Since |/ | < m 2/ 2 and \BU\ < 4m/ we obtain b2 < 64m 2/ V (1"42)('_1) + 2 m 3/p i('- 1> +3(,- 1)* < 3> . □ L em m a 17 In the uniform case, that is p = 1/s , we have S2 = 0. Proof: In the proof for Si we showed th at dr = 0. □ For the expectation of W, which will be the param eter or mean of the Poisson distribution, it is tedious to calculate the exact value because of the strong depen­ dencies involved. However, we can easily give the following estimates. L em m a 18 - 21 + 3y , . „ (1 _ p) ^ n w ) £ - 21 + 3 y , _ „ + We will denote E(W) by A = A(m). Proof: We have E (W) = £ P(^« = l) l< t< i< m -/+ 2 j r p(x„ = i)+ E p(*« = i) l< » < j< m -f+ 2 l< « < j< m -/+ 2 j - i > l j — 1</— 1 < E / - ,) + E p o -* 1 '1 - 1 ' l< t'< j< m -/+ 2 l < i < j < m - l + 2 < (m r 2/ + 2jim ~ 2f- ± j ) / - i ) + (i2 .i2 ) Using the second equation we get the lower bound by ignoring the overlapping fragment terms (second term ), and by observing P (X U = 1) > p*,_1*(l — p). □ In the uniform case, the probabilities in the two sums are easy to calculate. L em m a 19 In the uniform case, we have E(W) = (m - / + l ) ^ ' - 1* + ^ ~ l 2 + - s-1 ). 80 P ro o f: The first term corresponds to the case i — 1; we have P (X (W) = 1) = P (a [l, / - 1) = a\j,j + 1-2}) = s-V -'K The case i / 1 gives us P (X (< t J - ) = 1) = P(a[i, i + / - 2] = a [j,j + 1-2] and a,_i ^ a,_ j) = s"('- Using the above results we get the Poisson approximation theorem in the general case. T h e o re m 8 Assume any distribution on the alphabet £ with |E| = s. Let X(ij) = 1 (a[i,i + / - 2] = a [j,j + I - 2]&(i = 1 or (a,_i ± a_,_i))) and W = E x«j)- l< i< j< m — l+2 Assume that Z is a Poisson random variable with E (Z ) = E(VU) = A as described in Lemma 18. Let 8 = S2 E [0,1/2) with S2 as above and let p = p2 = Then by Theorem 7 we get \\C(W)-C(Z)\\<2(b1 + b2), where 6, < 2m V 2- W - l> + 4m 2/ V ( 1-* » '-1) and 62 < 64m 2/2pat1- fM'-1) + 2m 3Ip ^ 1 - 1^ 1 - 1^ . From the above theorem we get the following corollary. Note that we write f(n) « g(n) if f(n)/g(n) — ► c as n — » 0 0 for some constant c > 0 . 81 C o ro lla ry 1 I f we let I = [log1/p(m 2/c)J for some constant c > 0, then the right hand side in Theorem 8 tends to zero as m tends to infinity, that is ||£(W ) - C(Z) || = O (m 2 S 1 log1/p(m )) + O (m 65(3) log1/p(m )) . So W converges, as m tends to infinity, to a Poisson random variable with mean (1 — p)c c A € 2p ’ 2p ' P ro o f: Using I = |j0 gi/p(m 2/ c)J> we see ~ m 3 log1/p(m )m 25 4 + m 2 (log1/p(m ) ) 2 m 4(1 6 ) « m 24-1 log1/p(m) + (m w _1 log1/p(m ) ) 2 . We claim th at 0 < 8 < 1/2 — 5(3) and hence th at the term s on the right hand side converge to 0 as m tends to infinity. Recall that Holder’s inequality gives us p «(r)+l/2 = p l / r < p l / ( r - l ) = p « (r - l)+ l/2 and therefore 8(r) > 5(r — 1), th at is the 6(r)'s form a decreasing sequence. Us­ ing additionally that pT < pr_i we get prS(T )+ T /2 < p(r- 1)* (> — 1)+0— U/2, This implies rS(r) + r/2 > (r — l)5 (r — 1) 4 - (r — l)/2 , and hence r S(r) + *(r ~ 1) + j; > ^(3) + I r — 1 2 (r — 1) — ' ' 2 ~ ' 2 Therefore ! + * = 1 + _ ^ <(r) + + > * (,) + ‘ . This gives us — dr < 1/2 — 5(3) and therefore 5 < 1/2 — 5(3). We now proved that bi converges to 0 as m tends to infinity. Analogously we get for the 62 term 62 < 64m2/2p2(1- 4» '- 1) + 2 m 3/p t('-1)+3('-1 )5(3) 82 « m 4® -2 (log1/p(m ) ) 2 + m _6® (3) logi/p(m), which tends to 0 as m tends to infinity since 46 — 2 < 0 and 6(3) > 0. To get the bounds on the param eter A, we use Lemma 18 and observe that m 2 pl — > c and m/p^1-® 1^ ~ m 2® 1-1 log1^p(m) — > 0 as m — > oo. Therefore £ ( ! - , ) ! < B ( I V ) < ^ . □ Note th at the dominating term in 1LW is of the order of m 2 p(l~l\ since the term m/p(1- ® d(,~1) is negligible. If we do not consider overlapping repeats, th at is, if we count only the number of repeats of non-overlapping segments, we can give the expectation of W explicitly. Also, the expression for b\ simplifies, now summing only over those indices i and j such that j — i > I. For b2, however, the problem of overlapping sequences remains; th at estim ate cannot easily be simplified. Thus we get the following corollary. T h e o re m 9 Assume any distribution on the alphabet E with |E| = s. Let Y(ij) = 1 (a[i, i + l - 2] = a\j,j + / - 2] and (i = 1 or (a<_i ^ aj_i)) and j > i + /) and » " = £ r (y ,. l<i<j<m— 1+2 Assume that Z is a Poisson random variable with E (Z ) = E (W ) = (m - 2 /+ 2)p('_1) + ^/ + 3^P(,_1)(1 - P ) C ( ! - P ) . P Let 6 and p be as in Theorem 8, and let I = Ll°gi/P(m2/ c)J f or some constant c > 0. Then as in Theorem 8 we get \\C(W) - C(Z) I I = O (m 2®-1 log1/p(m )) + O ( m '6® * 3* log1/p(m )) . 83 C hapter 13 U nique R ecoverability Now we will return to the problem of unique recoverability of a sequence, as described in the introduction. The outline of this section is as follows. Firstly, in Section 3.1, we define the spectrum of a sequence, Ukkonen’s transform ations and we explain the notion of Ukkonen recoverability. In Section 3.2, we largely follow the approach by [Dyer et al. (1994)] to get an estim ate for the probability th at a randomly chosen sequence can be reconstructed from it spectrum of substrings of length / — 1. As we do not require the uniform distribution on the alphabet, the dependencies are more complicated than in [Dyer et al. (1994)]. Thus the proof of Lemma 20 needs new arguments. Lemma 21 has a detailed proof based on the argument sketched in [Dyer et al. (1994)]. The final result, Theorem 10, gives the limiting behavior for the probability of unique recoverability. This is an extension of Theorem 1 in [Dyer et al. (1994)], which considers the uniform case. Furthermore, we give a rate of convergence in the total variation distance. There is no related result for this in [Dyer et al. (1994)]. 13.1 N otation and U kkonen Transform ations For each b € E 1 let 7V(b, a) denote the number of occurrences of b as a substring of a. Then we define Nt(a) = (Af(bi, a), N(bz, a),..., N (b gi, a)) as the spectrum of a where b i, b 2, ..., b 8 i is some enumeration of E(. We say th a t a is I-recoverable, if a' € E m and a ^ a ' implies iV/(a) / A^(a'), that is a can be (uniquely) reconstructed from its spectrum of substrings of length I. 84 Ukkonen [Ukkonen (1992)] introduced certain transform ations, which transform one word into another one, but do not change the spectrum . Let a be a string. These are the transformations: • Transpositions: 1. If a can be written as a = a i b i a 2b 2a 3 b i a 4 b 2 as for some (I — 1)- strings b i and b 2 and for some strings a i,... ,a 5, then the string a! = a i b i a 4 b 2a 3 b i a 2 b 2 a 5 , where a 2 and a* have changed positions is called a transposition of a. 2. If a can be written as a = a ib a 2 b a 3 b &4 where b is a (/ — l)-string and a i ,...,a 4 are some strings, then a' = a ib a 3 b a 2 b a 4 , where a 2 and a 3 have changed positions, is also called a transposition of a. • Rotations: If a can be w ritten as a = b i a i b 2 a 2 b i for some (/ — l)-strings b i and b 2 and some strings a i and a 2 , then the string a = b 2 a 2biaxb 2 is called a rotation of a. These transform ations always give us two words with the same spectrum . For the reverse Ukkonen posed the conjecture th at two words with the same spectrum can be transformed into each other by a finite sequence of transpositions and ro­ tations. ([Pevzner (1989)]) reduced the problem to an Eulerian path problem in edge-bicolored graphs. In 1992 Pevzner ([Pevzner (1994)]) showed th at the answer to the Ukkonen conjecture is positive, th at is two words with the same spectrum can always be transformed into each other by a sequence of transpositions and ro­ tations and vice versa. This gives us now a criterion to determ ine when a word is uniquely I-recoverable, namely if no nontrivial transformations are possible. We have to exclude the trivial transform ations, leaving the string fixed, to get the uniqueness. For example consider the definition of a transposition, that is we have a = a ib ia 2 b 2 a 3 b i a 4 b 2 a 5 . Assume a 2 = a4 , then we have the possibility of a transposition without changing the sequence a. We are interested in lim P (a is /-recoverable) m— M X ) x ' 85 and the rate of convergence. 13.2 U nique R ecoverability We now continue with some additional notation. Let a starts and ends with the same / — 1 letters, and ) a rotation would result in a word different from a J and _ f 36 € E' " 1 : N (b,a) > 3 and ( the strings between the occurrences are not identical ! 3(/ — l)-tuple which appears at least three times, and the strings between the occurrences are not identical Straightforward calculation gives us the following results. L e m m a 20 Let I = Lfo<h/p(m 2/c)J f or some constant c > 0. Then 1. P (£ ) < p*'"1* = o(l). 2 . P (G) = O ((logI/P(m ))2m 4*-3) + O (m 25- 1) = o(l). P ro o f: The proof is similar to the one given in [Dyer et al. (1994)] (Lemma 1) for the uniform case . Obviously P (£ ) = by independence (no overlap). Let Qijk = {»[*, i + I ~ 2] = a\j, j + I - 2] = a[fc, k + l — 2]}, for (i,j,k) 6 S = {(i,j,k) : 1 < i < j < k < m — I + 2}. We divide S into Si = {(i,j , k) 6 S : m ax{j — *, k — j} > I — 2}, th at is the set of indices such th at at least one fragment is independent of the others, and we put 5 2 = S \ S\. Because of this and the first part of Lemma 13 we get P ( Q n k ) < P('_1) p*1- ^ ' - 1* < p*2-*)*'-1) (13.1) . (13.2) 86 for (i,j, k) € S\. In the case that (i,j, k) € 5 2, the second part of Lemma 13 gives P (a [i,i + / - 2] = a \j,j + l - 2 } = a[/b, k + l - 2]) < Note that |5 i| < m3 — 0(m 3) and I < 3!m/2 = 0 ( m l2). So we get p ( 0 ) < E p ( a « * ) + E p (G « k ) («',j,fc)e5i («J,fc)€S 2 < m V2 - t)(,- 1 ) + = O((log1/p(m ))2m 4*"3) + 0 (m 2S_1) = o(l). □ Let J- be the event that there are two pairs of disjoint maximal common sub­ strings (markers), say m j and m 2, of length at least / — 1 which occur in a as . .. m 1 . .. m 2 . . . m j ... m 2 ___ By the above exposition of the Ukkonen transformations we know that we are not able to determine the positions of the two different substrings between mi and m 2. Note th at the maximality condition guarantees th at we have only nontrivial transpositions. Recall that Pevzner has shown th at a is uniquely /-recoverable if neither £,Q nor T occur, since £ would allow a nontrivial rotation and Q and T would allow nontrivial transpositions (see the above definitions). Assume we have a string of n pairs of parenthesis, that is a string which has n ‘(’s and n ‘)’s. A string of parenthesis is well form ed if, reading the string of parenthesis from left to right, we never encounter more ‘)’s than ‘(’s. The number of well formed string of n pairs of parenthesis is known as the Catalan Numbers C(n) and has the following form: C(n) = ( ^ f °r n-°' The first elements of the sequence are 1,1,2,5,14; see [Graham et al. (1987)]. Let £k denote the event that we have k pairs of disjoint markers (that is k pairs of disjoint repeating regions of length at least I — 1). We have the following lemma which is proved in [Dyer et al. (1994)]. For clarification and completeness, we will repeat the proof here. 87 L e m m a 21 P ( ? c\Sk) = (fc + 1)! P ro o f: Call the k pairs of marker m i,. . . , m*. We have (2/c)!/2f c different order­ ings of these markers. If we do not want T to occur we have k\C(k ) possibilities for this. To see this map T° onto the set W* of all well formed strings of k pairs of parenthesis. We have by definition |W*| = C(k). The mapping scans the string a from left to right. The first encounter of some m arker m, gives us ‘(’ and the second encounter gives us ‘) \ So each marker constellation gives us by construction a well formed string of parenthesis. We claim that each element in W* has exactly k\ preimages. To see this we note th at in a well formed string each parenthesis ‘( ’ has a unique corresponding parenthesis ') ’ (use induction on the number of pairs; remov­ ing an innermost pair of parenthesis). So we have k\ preimages, since by assigning the k markers to the k ‘(’s we always get distinct elements in T c. Hence / 2k \ 1 2* [ k ) * + 1(2*)! This gives us now all the pieces to calculate lim P (a is /-recoverable). m— ♦ oo T h e o re m 1 0 Letting I = U0 g i/P(m2/ c)J f or some constant c > 0 we get °° (2A)f c lim P ( a is Ukkonen I-recoverable) = e~x Y'' 77—-----— v ; *!(* + !)! where A € [^ 2^ , 2p] • ra^e ° f convergence in the total variation distance is O ° + o • Note th at the rate of convergence depends on the choice of c. The rate is of the same magnitude as the rate obtained in [Waterman (1995)] for sequence matching. 88 As the sequence matching and the problem of unique recoverability are related, this was to be expected. Proof: By Lemma 20, limm _ 00 P (£ c) = 1 and l i m m _ o o P(£/c) = 1- We have P (a is /-recoverable) = P (£ c fl Qc fl T ° ) = Y{FcC\Qc) - 'P { { F r \ g c) \{ £ cC\GcC\Fc)). Note that P ( ( ^ cn g c) \ (£ c n g c n r c)) = P(u> : w e ? c n g c,w<t£c) = P(w : w £ P u g , w € £ ) < P(Sb We also have p (t c n g c) By Theorem 8 , Y p { F c n g c n { w = k}) = Y P ( ^ c n £ k) k=0 k=0 o o c o c \k Y P { F c \ £ k ) p ( S k ) = Y 7 T T T u P « W = n ^ k=o k=o + i )- lim P (£ fc ) = lim P(W = k) = P (Z = k) where Z is a Poisson random variable with expectation A = limm^oo E(W), th at is P(Z = k) = 'A* it! ‘ Hence P (a is /-recoverable) — e A Y . (2A) £ S « (* + !)! 00 9^ PjPnfl-nn-Ejrnj o o nk P(Z = k) - E ^ T T Y m W = k ) - PiZ = h)l 89 + |P {{w = k} n gc) - P (W = *)|) + P(f) oo ok < 1 / 2 | | £ ( ^ ) - £ ( Z ) | | ^ - — — + e2P (0 ) + P (5 ) Jt=o \ K + i )- < (bi + b2)e2 + P(€) + eaP (0 ) < e2 (eSm’ / y V - W - 1) + 2m 3# 2- ™ - 1* + 2m 3/p3/ 2(/-i)+3(/-i)«(3)) + p ' _1 + e2 ( m V a- J)(/- 1) + G m iy t1- 6*1 - 1') . Inserting / = log1/(p (m 2/c) and observing that p,_1 is of smaller order than the other term s gives e2 ( 6 8m 212pW -W '-U + 2 m 3/p<2-*>('-1> + 2m 3/p3/ 2<'-1> +3<'-1W3> + m 3p(2- s)(,- 1)-+ 6 m /V < 1- 4H'-1)) + p' " 1 « e2 ^272(log1/p(m ))2m -2+4* ^ + 4log1/p(m )m ~1+w / \ 3/2+35(3) / \ / \ 2 - 6 +4 log1/p(m )m -6^ ^ J + ^ J m " 2 + m "1+2‘ ^ J +4(log1/,( m ))Im - ” « Q ,(1 “' j l°8i/p(m ) / c \ 3' W m 6«(3) I p I 4-4e2---------- ,64(3) This concludes the proof. □ 13.3 C onclusion One of the benefits of our Poisson approximation is th at while the limit form is given by [Karlin et al. (1987)], the results below have explicit bounds on the approxima­ tion necessary for application to sequencing by hybridization in the following chap­ ter. The two theorems for overlapping and non-overlapping repeats can be applied to estim ate statistical significance in the analysis of patterns in a single sequence. W hile the Chen-Stein method has been carefully worked out for sequence matching 90 in [Arratia et al. (1990a)] and [Waterman (1995)], the correlation structure for over­ lapping repeats involves some details th at are quite distinct and not obvious. The probability of unique recoverability can be used as a measure for the effectiveness of a SBH chip; [Pevzner and Lipshutz (1994)] call it the “resolving power” of a chip. The rate of convergence, together with the limiting distribution, thus provides a means to estim ate this resolving power. To conclude this part we return to the paper of [Dyer et al. (1993)]. In their paper, for the case of uniformly distributed letters, a qualitative limit theorem for the probability th at a sequence can be uniquely recovered by subsequences of length / sa logi (^~ ) is given. Their proof uses actually two approximations. Firstly, they show th at the quantity in question is close to another quantity, which is then shown to converge in distribution to a Poisson variable. We believe that our proof gives more insight into the structure of the problem, and it also fills some gaps in the proofs by [Dyer et a/.(1993)]. Furthermore, [Dyer et a/.(1993)] do not obtain a rate of convergence, which is desirable for practical purposes. Finally, it is well-known that the nucleotide distribution is not uniform. [Dyer et a/.(1993)] claim th at a general distribution on the alphabet would change the proofs and the results only slightly. The presentation below shows th at the arguments needed are more complex and that the general results stated by [Dyer et a/.(1993)] are not valid (compare for instance their expression for the expectation with Lemma 18). We introduce a random variable W counting the number of length / — 1 repeats. There are two im portant cases: that where overlapping repeats are counted and th at where repeats must be non-overlapping to be counted. In the case of overlapping repeats which is the relevant structure for SBH, the value of E(IF) cannot be explicitly given in a closed form (Lemma 18) for a general distribution, whereas for a uniform distribution it has a closed form (Lemma 19). For non-overlapping repeats, E(W ) has a closed form for a general distribution (Theorem 9). Note that E(W ) in Theorem 9 almost coincides with the corresponding expression suggested by [Dyer et al. (1994)] except that occurrences of m — / + 1 are replaced by m — 21+ 3 in our formula. Their formula was proposed for the overlapping case. The most im portant difference between a uniform and a non-uniform distribution on the alphabet is th at, in the uniform case, coincidence of letters a,, aj,ak occur independently, P (a, = aj|a< = a*) = P (a; = aj). 91 In the non-uniform case, this is no longer true, and the dependencies caused by overlapping sequences require careful analysis. Now we will discuss some possible extensions of this part. Since the results (Theorems 2 and 3) are valid in general for repeats in random sequences, they can be employed to estim ate statistical significance of repeats in sequence. See Chapter 12 in [Waterman (1995)] for a general treatm ent of patterns in a sequence, including a m ultivariate central limit theorem. It should be possible to extend the argum ent to handle fc-fold repeats although the combinatorics of the neighbor­ hood structure will be challenging. A natural extension of Theorems 2 and 3 is to study long repeats that m atch in at least fraction a of the positions. Of course Theorems 2 and 3 are for a = 1. To retain large deviations behavior, we need a > P(Tw o random letters are identical) = p. In addition, declumping becomes a challenging issue. In [Arratia et al. (1990a)] Poisson approximation for long fraction a matches between two random sequences is studied, and declumping is accomplished by application of the ballot theorem. The difficulty arises because for a < 1 it is possible th at there is a match of at least fraction a of length t or greater and none of exactly length t. Generalizing [Arratia et al. (1990a)] to overlapping repeats will require careful analysis. Finally we mention generalization of Theorems 2 and 3 to non-iid sequences, e.g. Markov, sequences. By far the most practical generalization of Section 3 on sequencing by hybridiza­ tion would be to find the probability of correctly determining the sequence in the presence of hybridization errors. Notice that our paper as Dyer et aPs paper is about unique sequence. This is because if the data are error free the correct sequence must be among all those consistent with the data. W hen there are hybridizations, there may be no sequence consistent with the data or if there are consistent sequences, they may not include the correct sequence. To our knowledge, a good stochastic model of SBH errors has not yet been given. After a good tractable model of errors is given, we might ask if / = logj/p(m) still gives the correct sequence. There are two types of hybridization errors: false positives (l„, = 1 when w 6 5 (a)). Per­ haps false positives might be modeled randomly and independently: P(l„, = 1 and w € 5 (a)) = p*. However the false negatives are not so easily treated. False nega­ tives arise from tuples in the neighborhood of w 6 5 (a), and the tuples might differ 92 from w by 1 or 2 mismatches. Indels axe unlikely to occur in this context. In addi­ tion because of the energy of hybridization, these mismatches tend to be at the end of the tuple. After the errors are modeled, the algorithm for inferring sequence from d ata with errors must be specified in order th at the probability of correct sequence determ ination can be estimated. We conclude then th at problems of substantial interest remain to be studied. 93 R eference List [Allison and Yee (1988)] L. Allison and C. N. Yee, Restriction site m apping is in separation theory, C ABIO S 4 (1988), 97-101. [Arratia et al. (1990a)] R. A rratia, L. Gordon and M.S. W aterm an, The Erdos- Renyi law in distribution, for coin tossing and sequence matching, Ann. Statist. 18 (1990), 539-570. [Arratia et al. (1990b)] R. A rratia, L. Goldstein and L. Gordon, Poisson Approxi­ mation and the Chen-Stein Method. Statistical Science 5 (1990), 403-434. [Bains and Smith (1988)] W. Bains and G.C. Smith, A novel m ethod for DNA se­ quence determination. J. Theor. Biol.,135, (1988), 303-307. [Barbour et al. (1992)] A.D. Barbour, L. Holst, S. Janson, Poisson approximation. Clarendon, Oxford 1992. [Drmanac and Crkvenjakov (1987)] R. Drmanac and R. Crkvenjakov, Yugoslav Patent Application 570. (1987). [Dyer et al. (1994)] M. Dyer, A. Frieze and S. Suen, The Probability of Unique Solutions of Sequencing by Hybridization. J. Comp. Biol.,1, (1994), 105-110. [Fitch et al. (1983)] W. M. Fitch, T. F. Smith and W. W. Ralph, Mapping the order of DNA restriction fragments, Gene 22 (1983), 19-29. [Fodor et al. (1991)] S.P.A. Fodor, J.L. Read, M.S. Pirrung, L. Stryer, A.T. Lu and D. Solas, Light-directed spatially addressable parallel chemical synthesis. Sci­ ence, 251, (1991) 767-773. [Fodor et al. (1993)] S.P.A. Fodor, R.P. Rava, X.C. Huang, A.C. Pease, C.P. Holmes and C.L. Adams, Multiplex biochemical assays with biological chips. Nature, 364, (1993) 555-556. [Goldstein and W aterman (1987)] L. Goldstein and M. S. W aterm an, Mapping DNA by stochastic relaxation, Adv. Appl. Math. 8 (1987), 197-207. [Graham et al. (1987)] R.L. Graham, D.E. K nuth and O. Patashnik, Concrete Mathematics. Addison-Wesley, 1988. 94 [Ho et al. (1990)] S. T. S. Ho, L. Allison and C. N. Yee, Restriction site mapping for three or more enzymes, CABIO S 6 (1990), 195-204. [Karlin et al. (1987)] S. Karlin and F. Ost, Counts on long aligned word matches along random letter sequences. Adv. Appl. Prob. 19 (1987), 293-351. [Kotzig (1968)] A. Kotzig, Moves without forbidden transitions in a graph, Matem- atiky casopis 18 (1968), 76-80. [Lysov et al. (1988)] Yu.P. Lysov, V.L. Florent’ev, A.A. Khorlin, Khrapko, V.V. Shik and A.D. Mirzabekov, DNA Sequencing by hybridization with oligonu­ cleotides. A novel method. Dokl. Acad. Sci USSR, 303, (1988) 1508-1511. [Macevicz (1989)] S.C. Macevicz, International Patent Application PS US89 04741 (1989). [Mironov et al.] A. A. Mironov, N. N. Alexandrov, N.Yu. Bogodarova, A. Grigorjev, V. F. Lunovskaya, P. A. Pevzner and M. E. Truchan, DNASUN: A Package of Com puter Programs for Biotechnology Laboratory. [Nathan and Smith (1975)] D. Nathans and H. O. Smith, Restriction ebdonuclease in the analysis and restructuring of DNA molecules, Ann. Rev. Biochem. 44 (1975), 273-293. [Pease et al. (1994)] A.C. Pease, D. Solas, E.J. Sullivan, M.T. Cronin, C.P. Holmes and S.P.A. Fodor, Oligonucleotide arrays for for rapid DNA sequence analysis. Proc. Natl. Acad, of Sci. USA, 91, (1994) 5022-5026. [Pevzner (1988)] P. A. Pevzner, Graphs of restrictions and DNA physical mapping, Biopolymers and Cell 5 (1988), 233-237 (in Russian). [Pevzner (1989)] P.A. Pevzner, 1-tuple DNA Sequencing: Com puter Analysis. Jour­ nal o f Biomolecular Structure and Dynamics 7 (1989), 63-73. [Pevzner (1990)] P. A. Pevzner, DNA physical mapping.In M.D. Frank-Kamenetzky (ed.), Computer analysis of genetic texts, ’ ’Nauka” Moscow, (1990), 154-188 (in Russian). [Pevzner (1992)] P. A. Pevzner, DNA physical mapping, flows in networks and min­ imum cycles means in graphs, DIM ACS 8 (1992), 99-112. [Pevzner (1994)] P.A. Pevzner, DNA physical mapping and alternating Eulerian cycles in colored graphs. Algorithmica 12 (1994) [Pevzner] P. A. Pevzner, MAPSUN: a DNA physical mapping computer algorithm (in preparation). 95 [Pevzner and Lipshutz (1994)] P.A. Pevzner and R.A. Lipshutz, Towards DNA Se­ quencing Chips. 19th Symposium on Mathematical Foundations o f Computer Science, Kosice, Slovakia, Lecture Notes in Computer Science, 841, (1994) 143- 158. [Schmitt and W aterman (1991)] W. Schmitt and M. S. W aterm an, M ultiple Solu­ tions of DNA Restriction Mapping Problems, Adv. Appl. Math. 12 (1991), 412-427. [Southern (1988)] E. Southern, United Kingdom Patent Application GB8810400. (1988). [Ukkonen (1992)] E. Ukkonen, Approximate string-m atching with q-grams and maximal matches. Theoretical Computer Science 92 (1992), 191-211. [Waterman (1995)] M. S. W aterman, Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman Hall 1995. [Waterman and Griggs (1986)] M. S. W aterman and J. R. Griggs, Interval graphs and maps of DNA, Bull. Math. Biol. 48 (1986), 189-195. 96 
Asset Metadata
Core Title 00001.tif 
Tag OAI-PMH Harvest 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-oUC11257345 
Unique identifier UC11257345 
Legacy Identifier 9621668 
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button