Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Algorithms for phylogenetic tree reconstruction based on genome rearrangements
(USC Thesis Other)
Algorithms for phylogenetic tree reconstruction based on genome rearrangements
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content
ALGORITHMS FOR PHYLOGENETIC TREE RECONSTRUCTION BASED ON GENOME REARRANGEMENTS by Guillaume Bourque A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATHEMATICS) August 2002 Copyright 2002 Guillaume Bourque R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. UMI Number: 3094307 UMI UMI Microform 3094307 Copyright 2003 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALIFORNIA 90007 This dissertation, w ritten by under the direction of ...... Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillm ent of re quirements for the degree of DOCTOR OF PH ILOSOPH Y Dean of Graduate Studies D a te ..... DISSERTATION COMMITTEE August 6 , 2002 Chairperson Sira cm " Z JCU/£kjN? R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. D edication to my fam ily and m y little girl R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A cknow ledgem ents This thesis will always stand for me as a memento of my wonderful experience at USC. There are many people I would like to thank for making this adventure such a success. First, David Sankoff, who hired me as an undergraduate student, initiated me to this inspiring field and subtly suggested, what I now know to be, the best possible path for me. I am also indebted to Mike W aterman for his warm welcoming and for his understanding of my recent peculiar situation. His advice was always most valuable. I will never forget how a single suggestion in my very first semester at USC helped me avoid a potential disaster. I am extremely grateful to my advisor, Pavel Pevzner, for stim ulating discussions and precious supervision. In many ways, I wish we could have had a lot more of those rollerblading sessions. I must also thank Bernard Moret and Glenn Tesler for providing me with their genome rearrangement programs without which none of this work would have been possible. Special thanks to Glenn for multiple discussions and comments th at significantly improved the way I work. I am obliged to Bill Murphy for providing human, cat, cow and mouse comparative iii R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. m aps prior to publication. Furthermore, I am also appreciative of the support provided to me by USC and by grants from the N atural Sciences and Engineering Research Council of Canada. My final words are for my wife, Caroline, her daughter, Charlotte, and our three months old little Ophelie. I would like to thank them for allowing me tim e to disappear in the mysterious world of m athem aticians and computer scientists but also for helping me create my own little world around them. Last but not least, I cannot neglect to thank my parents and my sister who helped me become the person I am today. Merci. G.B. Montreal, Quebec May 2002. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C ontents D ed ication ii A cknow ledgem ents iii L ist of Figures vii List of Tables xii A bstract x iii 1 Introduction 1 2 P h ylogen etic Tree R econ stru ction M eth od s 11 2.1 Distance-based m e th o d s .......................................................................... 12 2.2 Likelihood-based methods ............................................................ 13 2.3 Maximum parsimony m e th o d s ................................................................ 14 3 G enom es, R earrangem ents and P airw ise A lgorith m s 21 3.1 G enom es........................................................................................................ 21 3.2 R earrangem ents........................................................................................... 23 3.3 Pairwise Algorithms ................................................................................. 27 3.3.1 Unichromosomal g e n o m e s .......................................................... 28 3.3.2 Multichromosomal c a s e ................................................................. 32 4 M ultiple G enom e R earrangem ent P rob lem 33 4.1 Algorithm .................................................................................... 33 4.2 T e s ts .............................................................................................................. 39 4.2.1 Simulated d a t a .............................................................................. 39 4.2.2 Herpesvirus d a t a ........................................................................... 44 4.2.3 Human, fruit fly, and sea urchin mtDNA d a t a ..................... 47 4.2.4 Metazoan mtDNA d a ta ................................................................. 47 4.2.5 Campanulaceae cpDNA d a t a ....................................................... 50 v R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5 R econ stru ctin g A ncestral G ene O rders for M ultichrom osom al G enom es 52 5.1 A lg o rith m .................................... 52 ■ 5 .2 T e s ts ........................................................................ . 55 5.2.1 Simulated data ...................................... 55 5.2.2 Gene order of the human-cat-mouse common ancestor . . . 56 6 V ariants and A pp lication s 63 6.1 Towards the recovery of mammalian ancestral gene orders .... 63 6.1.1 Human-cat-mouse common ancestor r e v is ite d ...................... 64 6.1.2 Human-cat-cow common a n c e s to r............................................. 69 6.2 Reconstructing ancestral gene orders for a fixed phylogeny .... 71 7 M G R as a W eb Tool 77 7.1 I n p u t............................................................................................................... 78 7.2 O u tp u t.................................................................................................. 80 7.2.1 Newick F o r m a t .......................................................................... . 81 7.2.2 ASCII F o rm a t.......................................................................... 82 7.3 Campanulaceae cpDNA output .............................................. 83 8 Sum m ary 86 R eference L ist 90 vi R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. List o f Figures 1.1 Phylogenetic tree of a subset of 13 plants from the Campanulaceae family. Modern organisms correspond to leaves while internal nodes correspond to ancestral g en o m es............................................. 1.2 (a) Coding genes on the Gorilla’s mitochondrial DNA. (b) Coding genes on the Earthworm ’s m itochondrial DNA. The list of genes is the same but their order differs. For instance, the sequence “C0X1 C0X2 ATP8” is followed by “ATP6” in Gorilla but by “C0X3” in Earthworm. tRNA genes have been left out of this figure to simplify the example. GenBank accession numbers: NCJ301645 and NC.001673............................................................... .......................... 1.3 Example of a most parsimonious rearrangement scenario with 7 reversals between Gorilla’s mitochondrial DNA and Earthworm ’s mitochondrial DNA. The gene numbers correspond to genes in Fig ure 1.2 starting with “CQX1” in Gorilla as “1” and going clockwise until “ND2” is assign “13”. Of all the genes in the two mtDNAs, only “ND6” in Gorilla was on the opposite DNA strand which is why it is represented by “-10”............................................................... 1.4 Reversal distance, d(7 r, 7 ), versus actual number of reversals per formed to transform tt into 7 , where 7 is a genom e/perm utation th at evolved from the identity perm utation t t = 1, 2,_ _ , 100 by k random reversals. The simulations were repeated 10 times for every k. We compute the average difference between the reversal distance and the actual number of reversals performed k .............. 2.1 Computation of the breakpoint distance, d { ,p(7 r), and the rever sal distance, d(ir), between t t = 1 -4 -3 -5 2 6 and the identity perm utation. ........................ 3.1 Example of a directed chromosome. The two endpoints of the chromosome are identifiable. This implies that flipping the chro mosome would be noticeable. .................................................. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.2 (a) Original DNA segment where each integer corresponds to a gene, (b) Same twisted DNA segment. During replication, if the loop is copied in its current order, it constitutes a reversal and the segment is transformed into 12-5-4 -3 6. Note th at the sign or the strand of the genes is modified at the same tim e as the order. The loop can also be totally ignored or skipped during replication, when this happens, it constitutes a deletion and the segment is transformed into 1 2 6............................................................................. . 24 3.3 Breakpoint graph associated with the perm utation ir = 1 -4 -3 -5 2 6. Black edges are represented by regular lines while grey edges are represented by dashed lines. .................... 29 4.1 Comparison of M GR-M EDIAN and GRAPPA (3 genomes equidistant from the ancestor). The genomes Gh, G2, G3 are obtained by k reversals each from the ancestral identity perm utation 1 2 ... n (n = 30 and n — 100). The simulations were repeated 10 times for every ratio ^reversals/^m arkers — 3k/n. (a) and (b) The average difference between the number of reversals on the tree recovered by the algorithm and the number of reversals on the actual tree (equal to 3k). (c) and (d) The average reversal distance between the solution recovered and the actual ancestor.................................... 41 4.2 Comparison of M GR-M EDIAN and G RAPPA (3 genomes nonequidistant from the ancestor). The genomes Gh, Gh, G3 are obtained by k, k and 2 k reversals respectively each from the ancestral identity perm utation 12... n (n = 30 and n = 100). The simulations were repeated 10 times for every ratio ^reversals/^m arkers = Ak/n. (a) and (b) The average difference between the num ber of reversals on the tree recovered by the algorithm and the num ber of reversals on the actual tree (equal to 4k). (c) and (d) The average reversal distance between the solution recovered and the actual ancestor. . 42 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4.3 Comparison of M G R and G RAPPA (4 genomes). We start from an unrooted tree with 4 leaves and select one of the two internal nodes to be the identity perm utation 12... n (n = 30 and n = 100). We then perform k reversals on each branch of the tree to obtain the genomes G\ , Gy, G3 , G4 as the 4 leaves of the tree. The simulations were repeated 10 times for every ratio # reversals/^m arkers = 5k /n . (a) and (b) The average difference between the number of reversals on the tree recovered by the algorithm and the number of reversals on the actual tree (equal to 5k). (c) and (d) The average reversal distance between the best (i.e., closest) internal node in the solution recovered and the identity perm utation.......................... 43 4.4 Comparison of M G R and G RAPPA (to genomes each with 30 m ark ers). The genomes (?j, Gy, ■ ■ ■ , Gm correspond to a subset of leaves from a complete unrooted binary tree on which we have performed k reversals on each branch. The simulations were repeated 10 times for every m. (a) and (b) The average difference between the num ber of reversals on the tree recovered by the algorithm and the number of reversals on the actual tree when k = 2 and k — 3. . . . 45 4.5 Gene orders of Herpes simplex virus (HSV), Epstein-Barr virus (EBV) and Cytomegalovirus (CMV) taken from Hannenhalli et al., 1995 [HCKP95]. Also shown is the ancestral gene order (A) and the optimal evolutionary scenario recovered by M GR-M EDIAN. . 46 4.6 Human, sea urchin and fruit fly mitochondrial gene order taken from Sankoff et ah, 1996 [SSK96]. A is the ancestral gene order suggested by M GR-M EDIAN........................................................................... 47 4.7 Phylogeny of 11 m etazoan genomes reconstructed by M G R . The gene order data is taken from the MGA Source Guide which is compiled by Jeffrey L. Boore. The genomes come from 6 m ajor metazoan groupings: nematodes (NEM), annelids (ANN), mollusks (MOL), arthropods (ART), echinoderms (ECH), and chordates (CHO). Numbers on the edges show the number of reversals.......................... 49 4.8 Phylogeny of the Campanulaceae cpDNA dataset as reconstructed by M G R . Numbers on the edges show the num ber of reversals. . . . 51 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.1 Performance of M G R -M C (3 multicliromosomal genomes equidistant from the ancestor). The ancestral genomes are obtained from the identity perm utation 1 2 . . . n (n = 30 and n — 100) by insert ing b chromosomes breaks (b = 2 when n — 30 and 6 = 9 when n = 100). The genomes Gh, (?2, Gz are obtained by k rearrange ments each from the ancestral genomes. Each rearrangement is a reversal/ translocation with probability p and a fusion/fission with probability 1 —p. The simulations were repeated 10 times for every ratio r = 3k/n. We compute the average score difference which is the difference between the num ber of rearrangements on the tree recovered by the algorithm and the actual number of rearrange ments (equal to 3k). We also compute the average distance of solution between the solution recovered and the actual ancestor. . 57 5.2 Ancestral median for human, mouse, and cat genomes found by M G R-M C. We used the gene order of 114 markers spread over the chromosomes in all three species. The numbers above the chromo somes correspond to these 114 markers and the numbering is such that the hum an genome corresponds to the identity perm utation broken in 20 pieces. The names below the chromosomes correspond to the name of the markers. We attribute a color to each human chromosome. The color of any marker (in any genome) indicates on which hum an chromosome is the homolog of this marker. Each marker segm ent is traversed by a diagonal line. See Figure 5.3 for an explanation and a zoomed in version of chromosome X............... 61 5.3 Ancestral chromosome X for hum an, mouse, and cat genomes found by M G R -M C in Figure 5.2. Each m arker segm ent is tra versed by a diagonal line. These diagonal lines are such that the hum an chromosomes are traversed from top left to bottom right and are design to provide visual help to identify where rearrange ments occurred. For this chromosome, and this data set, the gene order of the ancestor coincides with the cat gene order and only differs by one segment consisting of genes 108 and 109 (break in the diagonal line) from the hum an gene order. The mouse X chro mosome is broken into 7 segments as compared to the ancestor (shown by 7 broken segments of the diagonal line)............................. 62 6.1 Hypothetical phylogenetic tree of m am m als.............................. 64 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 6.2 Ancestral gene order recovered by M G R -M C on the signed perm u tations obtained from running the cluster-algorithm at threshold 4 on the human-cat-mouse 470 markers data set. Each chromo some segment unit represents a cluster and is not drawned propor tionally to the number of markers it contains. Color coding and diagonal lines are used as in Figures 5.2 and 5.3. . . . . . . . . . . 73 6.3 Ancestral gene order recovered by M G R -M C on the signed permu tations obtained from running the cluster-algorithm at threshold 6 on the human-cat-mouse 470 markers data set. Each chromo some segment unit represents a cluster and is not drawned propor tionally to the number of markers it contains. Color coding and diagonal lines are used as in Figures 5.2 and 5.3................................ 74 6.4 Ancestral gene order recovered by M G R -M C on the signed perm uta tions obtained from running the cluster-algorithm at threshold 4 on the human-cat-cow 311 markers data set. Each chromosome seg ment unit represents a cluster and is not drawned proportionally to the number of markers it contains. Color coding and diagonal lines are used as in Figures 5.2 and 5.3.................................................. 75 6.5 Ancestral gene order recovered by M G R -M C on the signed perm uta tions obtained from running the cluster-algorithm at threshold 6 on the human-cat-cow 311 markers data set. Each chromosome seg ment unit represents a cluster and is not drawned proportionally to the number of markers it contains. Color coding and diagonal lines are used as in Figures 5.2 and 5.3.................................................. 76 7.1 Human genome as an example of the multichromosomal input for m at used by M G R . . ................................................................................. 79 7.2 M G R web tool sample input form ................................ 80 7.3 M G R simple sample output. Some of the instructions were removed from the web page to improve the readability of other elements. The tree is displayed with edges proportional to the number of rearrangem ents.................................... 84 7.4 M G R sample output using the Campanulaceae cpDNA data set. The tree is displayed with fixed size edges.............................................. 85 xi R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. List o f Tables 2.1 The growth of the number of unrooted tree is more than exponen tial with the number of genomes (leaves). .................... 3.1 Examples of possible chromosomal m utations. ............. ... 6.1 Results after applying the clustering-algorithm for various thresh olds and M G R -M C to the human-cat-mouse data set with 470 unsigned markers. ( l) /( 2) corresponds to the num ber/ percentage of m ark ers which are not dropped as singleton clusters. (3) is the number of clusters th at are not singleton clusters. “A” is the ancestral gene order recovered by M G R -M C ................................................................... 6.2 Results after applying clustering-algorithm for various thresholds and M G R -M C to the human-cat-cow data set with 311 unsigned markers. (1) /( 2) corresponds to the num ber/percentage of mark ers which are not dropped as singleton clusters. (3) is the number of clusters that are not singleton clusters. “A” is the ancestral gene order recovered................................................................................... R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A bstract Recent progress in genome-scale sequencing and comparative mapping raises new challenges in studies of genome rearrangements. Although the pairwise genome rearrangement problem is well-studied, algorithms for reconstructing the most parsimonious rearrangement scenarios for multiple species are in great demand. The previous approaches to the m ultiple genome rearrangement problem were largely based on the breakpoint distance rather than on a more biologically accu rate rearrangement (reversal) distance. Another shortcoming of the existing soft ware tools is their inability to analyze rearrangements (inversions, translocations, fusions, and fissions) of multichromosomal genomes. This thesis proposes a new multiple genome rearrangem ent algorithm that is based on the rearrangement (rather than breakpoint) distance and th at is applicable to both uni chromosomal and multichromosomal genomes. We further apply this algorithm for genome-scale phylogenetic tree reconstruction and deriv ing ancestral gene orders. In particular, our analysis suggests a new improved rearrangement scenario for a very difficult Campanulaceae cpDNA dataset and xiii R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. putative rearrangement scenarios for the human, cat, cow and mouse genomes. In this thesis, we also offer a new lead for the integration of comparative m ap ping data in genome rearrangement studies. Finally, we describe the web version of the algorithm, which has recently been made available for phylogenetic tree reconstruction and for the exploration of the rearrangement scenarios. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 1 Introduction Nothing in biology makes sense except in the light o f evolution. — Theodosius Dobzhansky The goal of phylogenetics is to identify genetic relationships between species. These relationships are often represented by trees, also known as phylogenies, where contemporary genomes correspond to leaves (see Figure 1.1 for an exam ple). The problems of understanding biodiversity, classifying current species and uncovering their ancestral history have been central to biology for over a century. The affinities of all the beings of the same class have sometimes been represented by a great tree ... As buds give rise by growth to fresh buds, and these if vigorous, branch out and overtop on all sides many a feebler branch, so by generation I believe it has been with the great Tree of Life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications. Charles Darwin, 1859 Phylogenetic tree reconstruction was initially based on the comparison of mor phological characters. More recently, the advent of DNA sequencing helped create a new tradition of tree reconstruction based on the analysis of individual genes 1 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (G raur and Li, 2000 [GL00]). One of the difficultie with these reconstructions is that, for the same set of species, studying different genes could suggest different phylogenies. An alternative approach, also making use of the incredible potential of the newly available molecular data, are genome rearrangement studies. These studies are based on genome-wide analysis of gene orders rather than of individ ual genes (Palm er and Herbon, 1988 [PH88], Palm er, 1992 [Pal92], Sankoff et al., 1992 [SLA+92], Olmstead et al., 1994 [OP94], Bafna and Pevzner, 1995 [BP95], Hannenhalli et al., 1995 [HCKP95], Blanchette et al., 1999 [BKS99], Cosner et al., 2000 [CJM+0Qa]). Some of these studies even described closely related genomes with very similar genes but very dissimilar gene order making a gene-based study difficult and a genome rearrangement study extremely valuable. « Cam panula I A denophora I Trachelium * -------------------Sym phyandra E W ahlenbergia M erciera j A syneum a '— Triodanus » ------------Legousia 2 Codonopsis » C yananthus __________________________________ I ------------Platycodon ----------------- Tobacco Figure 1.1: Phylogenetic tree of a subset of 13 plants from the Campanulaceae family. Modern organisms correspond to leaves while internal nodes correspond to ancestral genomes. 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. More than 60 years ago, Dobzhansky and Sturtevant, 1938 [DS38], pioneered the study of genome rearrangements with an analysis of inversions in Drosophila pseudoobscura (fruit fly). The interest in genome-wide analysis of gene order flourished in recent years mostly due to progress in large-scale sequencing and comparative mapping (O ’Brien et al., 1999 [0 +99], Murphy et al., 2000 [MSOOl], Lander et al., 2001 [L+01], Venter et al., 2001 [V+01]), which generated a flood of new data. In this thesis, we will consider two different types of genomes: unichromoso- mal and multichromosomal. These types of genomes can be divided even further, which will be done in Chapter 3, but until then, we will assume that we are deal ing with unichromosomal circular genomes. A unichromosomal circular genome consists of a single genomic segment wrapped around itself, see Figure 1.2 for an example. In th at example, the two mitochondrial DNA segments share the same set of genes (a common alphabet) but the order of the genes differs. In the context of genome rearrangements, genomes are typically viewed as signed permutations where each integer corresponds to a unique gene and the sign corresponds to its orientation (strand). For unichromosomal genomes, the most common rearrangements, affecting the order of genes, are inversions that are often referred to as reversals in bioinformatics. A reversal p(i,j), applied to a permutation it = ... 7 r8 _! m ... t t j 7Tj + 1 . . . 7 r n, reverses the segment 7r,-. . . t t j and produces the permutation 7r • p(i,j) = t t i . . . 7 p _ i — ttj — t t - ,- 1 . . . — 7r,- 7r7 + i . . . n n . 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. For instance, the effect of the reversal p(4 ,8) on the identity perm utation is the following: 1 2 3 4 5 6 7 8 9 10 I />(4,8) 4 1 2 3 -8 -7 -6 -5 -4 9 10 Given two perm utations t t and 7 , the reversal distance, d(ir, 7), is defined as the minimum number of reversals required to convert one perm utation into the other. Take the example of the two mitochondrial genomes with 13 genes shown in Figure 1.2. The corresponding perm utations are shown on the first and the last line of Figure 1.3. W hat is the minimum number of reversals required to convert one perm utation into the other? The problem is far from being trivial. In this particular case, the answer is 7 reversals and Figure 1.3 shows one such rearrangement scenario. The study of reversal distance was pioneered by David Sankoff (Sankoff, 1992 [San92], Kececioglu and Sankoff, 1994 [KS94]) and increas ingly efficient polynomial-time algorithms have been developed to compute the reversal distance (Hannenhalli and Pevzner, 1995 [HP95b], Berman and Hannen- halli, 1996 [BH96], Kaplan et ah, 1997 [KST97], Moret et al., 2000 [MWB+01], Bergeron, 2001 [BerOl]). Some of these results will be reviewed in Section 3.3. The problem of interest in this thesis is to find a phylogenetic tree describing the most “plausible” rearrangement scenario for m ultiple species (Hannenhalli et al., 1995 [HCKP95], Sankoff et al., 1996 [SSK96]). This is known as the Multiple 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Genome Rearrangement Problem. Formally, given a set of m signed perm utations (existing genomes) of order n, find a tree T with the m perm utations as leaf nodes and assign perm utations (ancestral genomes) to internal nodes such that D(T) is minimized, where D(T)= Y , 4 ' , 7) G,"/)£T is the sum of the reversal distances over all edges of the tree. The special case of three genomes (to = 3) is called the Median Problem (Figure 4.5) . Although the reversal distance for a pair of genomes can be computed in polynomial tim e (Hannenhalli and Pevzner, 1999 [HP99]), its use in studies of multiple genome rearrangements has been somewhat limited since it was not clear how to combine pairwise rearrangem ent scenarios into a multiple rearrange m ent scenario. In particular, Caprara, 1999 [Cap99] dem onstrated th at even the simplest version of the Multiple Genome Rearrangement Problem, the Median Problem, is NP-hard. As a result this line of research was later abandoned in favor of the breakpoint analysis approach and the current tools use the so-called breakpoint distance (W atterson et al., 1982 [WEHM82], Nadeau and Taylor, 1984 [NT84]) to derive the rearrangement scenarios. However, the breakpoint analysis has some limitations in the analysis of pairwise genome rearrangements (Pevzner, 2000 [PevOO]). One of the reasons why the breakpoint distance dominated the analysis of multiple rearrangements in the last few years is th at it was not clear how to compute the plausible reversal-based evolutionary scenarios. The work we 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. present uses the reversal distance for computing multiple rearrangement scenar ios and discusses some advantages of this approach over the breakpoint distance approach which we will review in Chapter 2. Our algorithm explores the specifics of the reversal distance and is based on the observation that the reversal distance is a good approximation of the true distance for many biologically relevant cases. Let 7 be a genome th at evolved from a genome t t by k reversals (i.e., the true distance between tt and 7 is k). We say that t t and 7 form a valid pair if d(?r, 7) = k' otherwise we say that d(ir, 7 ) underestimates the true distance (Wang and Warnow, 2001 [WW01]). Typically, two genomes form a valid pair if the number of rearrangements between them is relatively small. This happens to be frequently the case in genome rearrangement studies. For a genome with n = 100 markers, Figure 1.4 illustrates th at the reversal distance approximates the true distance very well as long the number of reversals remains below 0.4n. In many biologically relevant cases (e.g., rearrangements of the X chromosome in mammalian species) the number of rearrangement events is well below 0.4ra. As a result, the reversal distance often corresponds to valid (or “almost valid”) pairs of genomes. Therefore the genome- based evolutionary trees are often additive or “almost additive” (Buneman, 1971 [Bun71]). In such trees, the distances are expected to equal the sums of the branch lengths between species. This property allows one to design new genome rearrangement algorithms that explore the specifics of additive trees. 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Let t t and 7 be the leaves (existing genomes) in the evolutionary tree T and let t t — oy, er2, ... , oy_i, (T /t = 7 be a path between t t and 7 in T passing through the ancestral genomes er2, ... , <7k-i- Define J t (tt,7) = ^ 2 d(ai:cri+i) l-CiKk—l For a valid pair, ( I t ( t t , 7) = d(7 r ,7). We define the deficit of tt and 7 as def(7 r, 7) = dx(ir,7) — <f(7 r, 7). The deficit of the tree T is defined as def(T) = Y2 def(7 T , 7) where the sum is taken over all pairs of leaves. The closer the tree is to being additive, the smaller the deficit of the tree will be. Many genome-based trees are “almost additive” , for example the herpes virus tree from Hannenhalli et al., 1995 [HCKP95] has deficit 1, while the mtDNA tree from Sankoff et al., 1996 [SSK96] has deficit 3. Our algorithms are implicitly based on this obser vation and we dem onstrate below that they provide an accurate reconstruction of the ancestral genomes for trees with small deficit. They use pairwise genomic distance software as a subroutine (implemented by Glenn Tesler and available via the GRIMM web server1 on Pavel Pevzner!s laboratory web site). The multiple genome rearrangement software described here has also been made available on the same web server. 1http://www-cse.ucsd.edu/groups/bioinformatics/software.html 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. This thesis is organized as follows. First, in Chapter 2 we describe previous work on phylogenetic tree reconstruction methods. In Chapter 3, we review the notation and the terminology for genomes and rearrangements. We also summarize some of the results for computing pairwise distances. Next, in Chapter 4, we describe a new algorithm to solve the M ultiple Genome Rearrangement Problem for unichromosomal genomes. Then, in Chapter 5, we study the same problem but for multichromosomal genomes. In Chapter 6, we describe some variants of the algorithm in a few applications. Finally, in Chapter 7, we describe the web version of the program and show examples extracted from it. Parts of Chapters 1 and 2 and most of Chapters 4 and 5 are extracted from the paper by Bourque and Pevzner, 2002 [BP02]. 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) (b) COXl ND2 ND3 COXl CYTB ND1 ND6. ND1 ND2 ND6 ND5 CYTB ‘ COXl 'ATP6 ND4 ND3 COXl ND5 ND4 Figure 1.2: (a) Coding genes on the Gorilla’s mitochondrial DNA. (b) Coding genes on the Earthworm’s mitochondrial DNA. The list of genes is the same but their order differs. For instance, the sequence “COXl C0X2 ATP8” is followed by “ATP6” in Gorilla but by “C0X3” in Earthworm. tRNA genes have been left out of this figure to simplify the example. GenBank accession numbers: NC-001645 and NC-001673. Gorilla mtDNA 1 2 3 4 5 6 7 8 9 -10 11 12 13 p(6, 10) 1 2 3 4 5 6 7 8 9 -10 11 12 13 p{ 4,7) 1 2 3 4 5 10 -9 -8 -7 -6 11 12 13 p( 4 ,6) 1 2 3 9 -10 -5 -4 -8 -7 -6 11 12 13 K M 2) 1 2 3 5 10 -9 -4 -8 -7 -6 11 12 13 p(7,10) 1 2 3 5 10 -12 -11 6 7 8 4 9 13 p( 9,12) 1 2 3 5 10 -12 -8 -7 -6 11 4 9 13 p(6, 11) 1 2 3 5 10 -12 -8 -7 -9 -4 -11 6 13 Worm mtDNA 1 2 3 5 10 11 4 9 7 8 12 6 13 Figure 1.3: Example of a most parsimonious rearrangement scenario with 7 rever sals between Gorilla’s mitochondrial DNA and Earthworm ’s mitochondrial DNA. The gene numbers correspond to genes in Figure 1.2 starting with “COXl” in Gorilla as “1” and going clockwise until “N D2” is assign “13”. Of all the genes in the two mtDNAs, only “N D 6” in Gorilla was on the opposite DNA strand which is why it is represented by “-10”. 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5 -15 -2 0 - _ 2 5 _____________ i_____ . ___ - ___i_____________ i_____________ I_____________ i_____________ i_____________ > _____________ i_____________ i_____________ 0 10 20 30 40 50 60 70 80 90 100 actual number of reversals performed (k) Figure 1.4: Reversal distance, d(7 T , 7), versus actual number of reversals per formed to transform it into 7, where 7 is a genome/ perm utation th at evolved from the identity perm utation 7 r = 1 ,2,... , 100 by k random reversals. The sim ulations were repeated 10 times for every k. We compute the average difference between the reversal distance and the actual num ber of reversals performed k. 10 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 2 P h ylogen etic Tree R econstruction M ethods Phytogenies are typically represented by unrooted binary trees where the leaves of the trees correspond to contemporary genomes (species alive today) and inter nal nodes correspond to ancient genomes which are now extinct (see Figure 1.1). These trees are unrooted because identifying their root is generally a distinct problem. Often, a distant genome, also known as an outgroup, is used to root a tree by generating a long branch on which the root is then positioned. Phyloge netic tree reconstruction is difficult largely because the number of unrooted trees grows at a rate which is more than exponential with the number of leaves. More precisely, for m genomes (or leaves), m > 3, there are (2m - 5)!! = (2m - 5) * (2m - 7) * ... * 3 * 1 unrooted binary trees (see Table 2.1). Once the topology of a tree is determined, the actual labeling of the internal nodes can add even more complexity to the problem. 11 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Nb of genomes (m) Nb of unrooted tree 3 1 4 3 5 15 6 105 7 945 8 10395 9 1.35135 10 2027025 11 34459425 12 654729075 13 13749310575 14 316234143225 15 7905853580625 Table 2.1: The growth of the number of unrooted tree is more than exponential with the number of genomes (leaves). Most of the work on phylogenetic tree reconstruction was developed in the context of gene-based or sequence-based phylogenies. B ut, the same methods can be adapted to gene order data. We will review the three main classes of reconstruction methods: distance-based methods, likelihood-based methods and maximum parsimony methods. 2.1 D istance-based m ethods Methods of this type construct trees strictly based on all the pairwise distances between the leaves of the tree. Each of these methods starts by computing the pairwise distance m atrix between all the sequences or the genomes of interest. For 12 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. sequences, the pairwise score could, for instance, be extracted from an optimal alignment. For genomes, or more specifically for gene order data, the score could be the actual rearrangement distance or even the breakpoint distance. Distance- based methods differ in how they use the distance m atrix to reconstruct the trees. Currently, the most common family of distance-based methods is probably “neighbor-joining” which was first proposed by Saitou and Nei, 1987 [SN87]. Methods in this class are typically very efficient; in most cases phylogenies can be inferred in polynomial time. On the other hand, these methods, at least when applied to gene order data, tend to lack in accuracy versus maximum par simony methods (Wang et ah, 2002 [WJM+Q2]). One might argue that because distance-based methods do not label internal nodes when reconstructing a tree, the result can be unrealistic. This is not the case with likelihood-based or maxi mum parsimony methods which are described below. 2.2 L ikelihood-based m ethods If we make assumptions about the mechanisms of evolution (for sequences or gene order) and the rates at which these changes occur, we can seek the tree which is the most likely to have generated the data observed. Such methods are called maximum-likelihood methods but they tend to be extremely computationally intensive. The complexity is due in part to extrem ely vast space of potential internal nodes that m ust be explored. 13 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. We only know of one maximum-likelihood framework developed specifically for gene order data (Dicks, 2000 [DicOO]). The method was only applied to small instances of the problem and, unfortunately, it is not clear how it could be scaled to more complex problems. Another recent and promising development in phylogenetic tree inference involves constructing a Bayesian framework and using Markov chain Monte Carlo methods to sample trees and model param eter values (Rannala and Yang, 1996 [RY96], Mau et al., 1997 [MNL97], Larget and Simon, 1999 [LS99], Li et ah, 2000 [LPD00]). It will be interesting to see if these methods can be adapted to gene order data. 2.3 M axim um parsim ony m ethods Methods seeking the most parsimonious scenario attem pt to recover the tree, and its internal nodes, th at minimizes the number of events on its branches. It corresponds to the Steiner Tree Problem for various possible distances. In the examples th at we consider, the general assumption is that the rearrangements events are rare. When that is the case, the most parsimonious tree probably corresponds to the actual tree. The first m ethods of this type were developed for sequence data (Fitch, 1977 [Fit77]) but they were later rewritten for gene order data (Hannenhalli et al., 1995 [HCKP95], Sankoff et al., 1996 [SSK96]). 14 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The problem of interest in this thesis, the Multiple Genome Rearrangement Problem, defined in Chapter 1, is an example where we seek the most parsimo nious scenario with respect to the total number of rearrangem ents. Studies of the Multiple Genome Rearrangement Problem evolved from the special case of the Median Problem. T hat is, given the gene order of three unichromosomal genomes G i, G2 and G3, find the ancestral genome A which minimizes the total reversal distance d(A, G\) + d(A, G2) + d(A, G3). In 1997, Sankoff and Blanchette [SB97]) attem pted to solve the median problem by minimizing the breakpoint distance instead of the reversal distance. We now describe their work. A pair of elements in perm utations 7 r and 7 form a breakpoint if they are consecutive in one perm utation but non-consecutive in the other. More formally, given a signed perm utation, 7 r = ^ 1^ 2 . • • we extend it by adding ttq = 0 and 7rn + i = n + 1. Let i ~ j if j — i = 1. We call a pair of elements (7 r,-, 77+1), 0 < i < n, of 7 r an adjacen cy if 77 ~ 7 r« +i, and a breakpoint otherwise. Now, given n and the identity perm utation, /, the breakpoint distance, dbp(n) = df,p(iT, [), is the number of breakpoints in 7 r. For instance, if 7 r — 1 -4 -3 -5 2 6 then dbp(x) = 4. See Figure 2.1 for an example of how the breakpoint and the reversal distances are calculated. Of course, the definition of the breakpoint distance could easily be generalized for two perm utations which are not the identity but we will spare the reader at this point. We use the same convention th at d(ir) = d(ir, /) is the reversal distance between 7 r and the identity perm utation. 15 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 7 T = 1 -4 -3 -5 2 -6 7 T — 1 -4 -3 -5 2 -6 1 -4 -3 -2 5 -6 1 2 3 4 5 -6 1 2 3 4 5 6 dbp(ir) = 5 d{7 r) - 3 Figure 2.1: Computation of the breakpoint distance, and the reversal distance, d{ir), between 7 r = 1 -4 -3 -5 2 6 and the identity perm utation. Both the breakpoint and the reversal distance are measures of gene order similarity. Both distances can be computed in polynomial tim e but only the breakpoint distance is a direct calculation. A trivial lower bound for the reversal distance can be obtained from the breakpoint distance: (2.D For instance, in the example of Figure 2.1, we get th at d(w) = 3 > | = . The rationale behind the inequality is th at each reversal can resolve at most 2 breakpoints. Sankoff and Blanchette, 1997 [SB97] described a reduction of the median problem for the breakpoint distance to the Traveling Salesman Problem for which relatively efficient algorithms are available. Using this result, Blanchette et al., 1997 [BBS97] and Blanchette et ah, 1999 [BKS99] developed breakpoint anal ysis, a m ethod attem pting to reconstruct the most parsimonious scenario for to genomes under the breakpoint distance instead of the rearrangement distance. 16 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Their m ethod first looked for the optimal assignment of internal nodes for a given topology (this is also known as the small parsimony problem) by a reduction to successive median problems. The breakpoint analysis m ethod is then used to scan the space of all possible tree topologies in an attem pt to find the best tree (this is the large parsimony problem). One of the drawbacks of this approach is that the tree space quickly becomes unmanageable (see Table 2.1). Moreover, it appears difficult to adapt both the breakpoint distance and the breakpoint analysis to the more general case of multichromosomal genomes. Those genomes have “natural” breaks within them and it is not clear how this information could be included in the method. A final significant issue with breakpoint analysis is that the breakpoint dis tance, in contrast to the reversal distance, does not correspond to a minimum number of rearrangement events. As a result, the breakpoint median, recovered by breakpoint analysis, rarely corresponds to the ancestral median, the genome that minimizes the overall number of rearrangements in the evolutionary scenario. Our simulations demonstrate th at in many cases the ancestral median correctly reconstructs the ancestral genome. To illustrate this point, consider the following simple example. Suppose th at the genomes Gi, G^, and Gs, evolved from the ancestral genome A = 1 2 3 4 5 6 by one reversal each such that 17 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Gi = 1 2 -4 -3 5 6 G2 = 1 -4 -3 -2 5 6 G3 = 1 2 3 4 -5 6 Searching for the breakpoint median will produce 4 optim al solutions: A, but also Gi, G2, and G3. If the median is A, then we have 2 breakpoints on each edge of the tree for a total of 6. But if the median is any of the 3 genomes, we also get a total of 6 = 04-3+3 breakpoints. Hence, in this simple case, the breakpoint median fails to unambiguously identify the ancestor. Conversely, the only solution for the ancestral median is A since it is the only perm utation generating a tree with a total score of 3 reversals. Picking any of the 3 other genomes would give a total score of 4 reversals. This thesis studies the ancestral median problem since it appears to be more biologically accurate than the breakpoint median. Initial attem pts to recover the ancestral median were m ade by Hannenhalli et al., 1995 [HCKP95] and Sankoff et al., 1996 [SSK96] who came up with approaches th at may work well for 3 close genomes. However, it is not clear how to generalize their approaches for more than three genomes. In particular, Hannenhalli et al., 1995 [HCKP95] were successful in reconstructing a genome rearrangement scenario for 3 herpesviruses but failed to resolve a much more complicated data set of 13 Campanulaceae cpDNAs with 105 markers (the comparative maps were constructed by Mary Beth Cosner and colleagues in early 90s). The variety of rearrangements in these flowering plants 18 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. far exceeds that reported in any group of land plants thus making this data set a challenging problem for any genome rearrangement study. The first relatively large data set of rearranged genomes was studied by Blanchette et al., 1999 [BKS99] who used B PA nalysis, the original implemen tation of breakpoint analysis [BBS97], to analyze 11 metazoan mtDNA with 35 markers. As for the Campanulaceae problem, it remained unsolved for almost ten years until Cosner et al., 2000 [CJM+00b], [CJM+00a] improved on BPA nalysis and constructed a rearrangement scenario with 67 reversals. Recently, Moret et al., 2001 [MWB+01] developed the software GRAPPA th at further improved on BPA nalysis. Finally, in a recent breakthrough, Moret et al., 2001 [MWWW01] described a million-fold speedup over G RAPPA and re-evaluated the Campanu laceae rearrangement scenario. Their analysis returned 216 trees with reversal distance 67 as compared to only 4 such trees in the previous analysis. Our M G R algorithm described below improves on this recent result by generating a rear rangement scenario with only 65 reversals th at was overlooked by Moret et al., 2001 [MWWW01]. At the tim e we initiated this work, apart from the initial attem pts by Hannen halli et al., 1995 [HCKP95] and Sankoff et al., 1996 [SSK96], no other algorithm for computing any version of the M ultiple Genome Rearrangement Problem was available. Since then, two independent groups have published new results on the Median Problem. Caprara, 2001 [CapOl] and Siepel et al., 2001 [ACS01] 19 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. described two very different branch-and-bound algorithms for the special case of the Multiple Genome Rearrangement Problem with 3 genomes. Their work confirms the interest of the community in phylogenetic tree reconstruction based on actual rearrangements events. 20 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 3 G enom es, R earrangem ents and Pairw ise A lgorithm s In this chapter, we present the notation and terminology used in the context of genome rearrangem ents. In Section 3.1, we present the various types of genomes we might encounter. In Section 3.2, we present the different types of rearrange ments affecting the gene order of these genomes. Finally, in Section 3.3, we present an overview of some of the results for computing the rearrangement dis tance between 2 genomes th at we use in later chapters. 3.1 G enom es In the context of genome rearrangem ents, the focus is on the relative order of specific genes in different genomes. The goal is to extract phylogenetic relation ships from this information. For th at purpose, it is sufficient and common usage to view genes as integers since this representation reports directly on their rel ative order. The actual numbering of the genes is somewhat arbitrary but it is typically chosen such th at one of the genomes corresponds to the monotonically 21 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. increasing sequence of numbers 1 ... n where n corresponds to the total number of genes of interest in the genome. When the orientation or the strand of each gene is also known, we associate a ± to each number. This distinguishes between signed and unsigned genomes. In the remainder of this thesis, we will assume th at the genomes are signed unless specified otherwise. We make an additional distinction between genomes depending on whether they consist of a single chromosome or a collection of chromosomes. The first ones are unichromosomal while the others are multichromosomal. There are three types of chromosomes: circular, linear directed and linear undirected. A common example of a circular unichromosomal genome is a mitochondrion (see Figure 1.2 for two such examples). In contrast, linear chromosomes have two endpoints instead of being wrapped around themselves. We make a distinction between directed and undirected linear chromosomes depending on whether these end points are identifiable and have a fixed order. If the whole chromosome can be flipped without any noticeable change it is said to be undirected. On the other hand, if, for instance, a chromosome has a p arm (or short arm) and a q arm (or long arm) it is said to be directed since flipping it would imply a change. See Figure 3.1. In the remainder of this study, a multichromosomal genome II will be represented by a set of N linear undirected chromosomes {tt(1), ... , 7 r(lV)} representing the “broken” signed perm utation. We only consider genomes with linear undirected chromosomes at this point because those are the genomes for 22 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. which a rearrangement distance engine is available (Tesler, 2002 [Tes02a]). But the adaptation to linear directed chromosomes, for instance, would be relatively easy to do. p arm q arm ( x Figure 3.1: Example of a directed chromosome. The two endpoints of the chro mosome are identifiable. This implies that flipping the chromosome would be noticeable. These subtle differences behind the various kinds of genome are im portant because they are a factor when comparing genomes. Consider, for example, the following three genomes: Gi = 1 2 3, G 2 — 2 3 1 and Gz = -3 -2 -1. As cir cular chromosomes, all three representation are equivalent. As undirected linear chromosomes, only G\ and Gz correspond to the same representation. Finally, all three genomes would be different if they were directed linear chromosomes. 3.2 R earrangem ents During the course of evolution, a variety of events can modify the genomic con tent. These events are called mutations and they usually arise from DNA replica tion errors. These m utations are separated in two main categories: point m uta tions and chromosomal m utations. Point m utations occur when a single DNA base pair is modified. These m utations are further divided in three subcategories depending on their specific effect when translated: nonsense, missense or silent. 23 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Because point mutations generally only affect individual genes, they do not influ ence the specific order of the genes and they will be ignored in the remainder of this study. In contrast, chromosomal mutations directly modify the order of the genes. There are various types of chromosomal m utations with the most common being: reversals (or inversions), translocations, fusions, fissions, transpositions, inverted transpositions, insertions, deletions and duplications. See Figure 3.2 for an example of how reversals and deletions can occur during DNA replica tion. Only translocations, fusions and fissions are specific to multichromosomal genomes. The first six types of chromosomal m utations rearrange the genes but they do not modify the set of genes present in a given genome. Insertions, dele tions and duplications on the other hand, modify the gene content of a genome by adding, or removing, some genes or by generating m ultiple copies of the same gene. Examples of each possible chromosomal m utation are shown in Table 3.1. 1 2 3 4 5 6 (a) (b) Figure 3.2: (a) Original DNA segment where each integer corresponds to a gene, (b) Same twisted DNA segment. During replication, if the loop is copied in its current order, it constitutes a reversal and the segment is transformed into 12-5 -4 -3 6. Note that the sign or the strand of the genes is modified at the same time as the order. The loop can also be totally ignored or skipped during replication, when this happens, it constitutes a deletion and the segment is transformed into 1 2 6 . 24 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. M utation Type Before After Impact Reversal Translocation Fusion Fission Transposition Inv. Transposition Insertion Deletion Duplication 1 2 3 4 5 6 =4 1 2 -5 - 4 -3 6 gene order 1 2 3 4 5 6 1 2 5 6 3 4 gene order 1 2 3 4 5 6 = > • 1 2 3 4 5 6 gene order 1 2 3 4 5 6 => 1 2 3 4 gene order 5 6 1 |2 3j 4 5 j I 6 =* 1 4 5 2 3 6 1 [23]4 5 R 6 = f> 14 5 -3 -2 6 1234Q56 =» 12 3 4 7 5 6 1 | 3 4 5 6 => 1 3 4 5 6 1 2 [3105 6 =* 1 2 3 4 3’ 4’ 5 6 gene order gene order gene content gene content gene content Table 3.1: Examples of possible chromosomal mutations. In the current study, we will only consider four types of mutations: reversals, translocations, fusions and fissions. These m utations only affect the gene order and not the gene content. This selection of possible rearrangements is motivated in part by the fact that in most cases they are the most common rearrange ments (Pevzner, 2000 [PevOO]). It is also because the only efficient algorithms for computing pairwise rearrangement distances available are restricted to these rearrangements (for unichromosomal genomes: Hannenhalli and Pevzner, 1995 25 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [HP95b], Berman and Hannenhalli, 1996 [BH96], Kaplan et al., 1997 [KST97], Moret et al., 2000 [MWB+01] and Bergeron, 2001 [BerOl]; for multichromosomal genomes: Tesler, 2002 [Tes02b]). Including m utations affecting gene content con stitute an even bigger challenge that will be interesting to pursue. It will require including studies of multigene families such as in Sankoff, 1999 [San99]. The assumption that we just made about the possible rearrangements implies th at it is sufficient to compare the genomes of interest on a common set of n genes (i.e. a common alphabet of size n). We have seen, in Chapter 1, the effect of a reversal, p(i,j), applied to a perm utation (or chromosome) it = t t\ ... 7 r*_i 7 T ; . . . nj 7 T j+i ... 7 rn. It reverses the segment tt8 - ... 7 Xj and produces the perm utation 7 T\ . . . 7 T z ‘ _i TTj — Ttj^i . . . . . . 7 Tn . The other types of rearrangements are restricted to multichromosomal genomes and simultaneously affect two chromosomes. Assume we have two such chromosomes 7 r = 7ri7r2 ... and a = o\U2 . . . cr„(c r ) where n(7r) is the number of genes in the chromosome n. Let 1 < i < n(tv) and 1 < j < n(a). A transloca tion p(tt, a, i,j) exchanges genes between chromosomes 7 r and a and transforms them into chromosomes 7 T i . . . 7 Ti^fCTj . . . O ' and 0 \ . . . O j^ \7 ^ i . . . 26 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Fusions and fissions are special cases of translocations. A fusion concatenates the chromosomes t t and a. It corresponds to the translocation p(tt, a, n(%) + 1, 1), resulting in the new chromosome • • * ^n(7r)& 1 * • • A fission “breaks” a chromosome tt into two chromosomes. It corresponds to the translocation p(n, 0, i, 1) and so it affects two chromosomes but one of those is empty (0). It produces two new chromosomes 71*1... 7 T j_x and 7 Tj... . 3.3 Pairw ise A lgorithm s In this section, we present an overview of the algorithms for computing the distance and finding the shortest pairwise rearrangem ent scenario between two genomes. We will focus on the algorithms for computing the pairwise distance, developed by Bafna and Pevzner, 1996 [BP96] and Hannenhalli and Pevzner, 1999 [HP99], summarized by Pevzner, 2000 [PevOO] and improved by Tesler, 2002 [Tes02a], since they represent the basis of the m ethod presented in later chapters. We will first describe the unichromosomal case and then briefly review the multichromosomal case. 27 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.3.1 U nich rom osom al gen om es As mentioned before, for unichromosomal genomes, the only rearrangement events th at we consider are reversals. Finding the shortest reversal scenario transforming one genome into the other then corresponds to sorting a perm uta tion using reversals since we can label the genes such th at the other genome is the identity perm utation. Assume we have a perm utation 7 r th at we wish to sort. The first step is to convert t t , a signed perm utation, into t t ', an unsigned perm utation, by mim icking every directed element of i by two undirected elements i a and if, repre senting the head and the tail of i. If t t was a perm utation of size n, t t ' will be a perm utation of size 2n. In practice, we replace each positive element -H of the perm utation by 2i — 1 and 2i, and each negative element —i by 2i and 2i — 1. This transformation was first proposed by Bafna and Pevzner, 1996 [BP96]. The next step is to construct the breakpoint graph associated with t t ' . We extend the perm utation t t ' by adding t t ' 0 = 0 and T r ' 2 n + 1 — 2n + 1. The breakpoint graph of tt', G (tt), is an edge-colored graph with 2n + 2 vertices {tt'0, tt[,... , tt'2ti, T T 2n+l\ = {0,1,2,... ,2 n ,2 n + 1}. Add a black edge between vertices Tr'2 i and TT2 i + l for 0 < i < n. Add a g r e y edge between 2i and 2i -f 1 for 0 < i < n. Black edges correspond to the actual state of the perm utation while grey edges correspond to the sorted perm utation we seek. See Figure 3.3 for an example. 28 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Tt' 0 1 2 8 7 6 5 10 9 3 4 12 11 13 Tt 1 - 4 - 3 - 5 2 -6 Figure 3.3: Breakpoint graph associated with the perm utation 7 r = 1 -4 -3 -5 2 6. Black edges are represented by regular lines while grey edges are represented by dashed lines. Bafna and Pevzner, 1996 [BP96], and later Hannenhalli and Pevzner, 1999 [HP99], showed that the G(tt) contains all the necessary information for efficiently sorting the perm utation t t . The key idea is to look at the maximal cycle decom position of the breakpoint graph. Finding the maximal cycle decomposition of a graph in general can be a very difficult problem but, fortunately, because of the way the breakpoint graph was constructed for a signed perm utation, each vertex has degree two and so the problem is trivial in this case. Suppose c(7 r) is the maximum number of edge-disjoint alternating cycles. Alternating because in any cycle, of the breakpoint graph of a signed perm utation, each pair of consecutive edges can never be of the same color. Bafna and Pevzner proved the lower bound d ( ‘ 7r) > n + 1 — c ( t t ) , (3.1) 29 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. which proved to be tighter than (2 .1). For instance, looking once again at the example from Figure 2.1 which is the same as in Figure 3.3, we see th at n + 1 — c(tt) = 6 + 1 —4 = 3 which is a tighter lower bound than | from (2.1). In order to present the result by Hannhenhalli and Pevzner, 1995 [HP95b] for calculating the reversal distance of a signed perm utation, we need to define a few additional concepts on the breakpoint graph. A gray edge in G(ir) is said to be oriented if it spans an odd number of vertices (when the vertices of G(n) are arranged in the canonical order tT q, ... , ttr 2n+1). A cycle is said to be oriented if it contains at least one oriented gray edge. Cycles which are not oriented are said to be unoriented unless they are of size 2 which we will describe as trivial. The term oriented comes from the fact th at whenever we traverse an oriented cycle we will traverse at least one black edge from left to right and one black edge from right to left. For instance, in Figure 3.3, edge (4,5) is oriented and hence the cycle (4,5,10,11,13,12) is oriented. On the other hand, cycle (2,3,9,8) is unoriented since both grey edges (2,3) and (8,9) are unoriented. We now define the overlap graph of G, 0(G). For each grey edge in G we will associated a new vertex ve in 0(G ). Whenever two grey edges e and e! overlap or cross in the canonical representation of G, we will connect the corresponding vertices ve and ve> . Below, by component we will mean connected component in 0. A component will be oriented if it contains a vertex ve for which the corresponding grey edge e is oriented. As for cycles, a component which consists 30 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. of a single vertex (grey edge) will be said to be trivial. In Figure 3 .3 , there are only 2 trivial component and one larger oriented component since at least one of its grey edge is oriented. As we will see, the difficulty in sorting perm utations comes from unoriented components. Unoriented components can be classified into two categories: hurdles and protected nonhurdle. A pro tected nonhurdle is an unoriented component that separates other unoriented components in G when vertices in G are placed in canonical order. A hurdle is any unoriented component which is not a protected nonhurdle. A hurdle is a superhurdle if deleting it would transform a protected nonhurdle into a hurdle otherwise it is said to be a sim ple hurdle. Finally, t t is said to be a fo rtre ss if there exists an odd number of hurdles and all are superhurdles in 0 (G (7 r)) (Setubal and Meidanis, 1997 [SM97]). The result by Hannenhalli and Pevzner, 1995 [HP95b] is that: d(tv) = n + 1 - c(ir) + h (tt) + /(?r), (3.2) where h ( r c ) is the number of hurdles in - k and / ( tt) Is 1 if tt is a fortress and 0 otherwise. They also developed the machinery for actually recovering an optimal sequence of sorting reversals. 31 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3 .3 .2 M ultichrom osom al case Hannenhalli and Pevzner, 1995 [HP95a] also derived an equation similar to (3.2) for multichromosomal genomes when rearrangements are: reversals, transloca tions, fusions and fissions. We refer the reader to Pevzner, 2000 [PevQO] and Tesler, 2002 [Tes02a] for the details of the calculation but we will briefly review how the formula for the rearrangement distance is obtained. The main idea for computing the rearrangement distance between two mul tichromosomal genomes n and P is to concatenate their chromosomes into two perm utations 7 r and 7 . The idea behind these concatenated genomes is th at every rearrangement in a multichromosomal genome H can be mimicked by a reversal in a perm utation tt. In an o p tim a l concatenate, sorting t t with respect to 7 actually corresponds to sorting H with respect to P. Tesler, 2002 [Tes02a] even showed that when such an optimal concatenate does not exist, a n e a r-o p tim a l concatenate exists such th at sorting this concatenate mimics sorting the m ulti chromosomal genomes and uses a single extra reversal which corresponds to a reordering of the chromosomes. 32 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 4 M ultiple G enom e R earrangem ent Problem 4.1 A lgorithm We first explain the idea of our algorithm for the case of 3 unichromosomal genomes. For this type of genomes, the only rearrangem ent events th at we con sider are reversals. Our algorithm evaluates all possible reversals for each of the 3 genomes, identifying good reversals. Intuitively, a reversal is good if it brings a genome closer to the ancestral genome. Of course, the ancestral genome is unknown and therefore it is unclear how to find good reversals. However, we argue (and confirm by simulations) that the reversals which bring Gi closer to both (?2 and Gz are likely to be good reversals. If this is correct, then we do not need the ancestral genome to find good reversals. We then carry on good reversals in the genomes G\, G 2, and G 3 in the hope th at they will bring us closer to the ancestor, and iterate until the genomes G\, G 2 and G 3 are transformed into an identical genome (converge to the ancestor A). Ideally, at this point, we 33 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. have reached the most likely ancestral median. Of course, this approach works well only for “almost additive” trees with small deficit and we argue that it is the case for many biologically interesting samples. A good reversal p in genome Gi is a reversal th at reduces the reversal distance between G\ and G ? 2 and the reversal distance between G\ and G3. Define A (p) as the overall reduction in the reversal distances: A(p) = (d(Gx, G2) + d(Gi, G3)) — (d(Gi • p 1 G 2) + d{G\ ■ p, G3) ) . The reversal p is good if A (p) = 2. Good reversals in genomes G2 and G 3 are defined similarly. The idea of our algorithm is embarrassingly simple: look for good reversals in Gi, G2, and G 3 and perform them (if there are any) until each of the genomes is turned into the same ancestral genome. In many cases (in particular for additive trees) good reversals are sufficient to bring all three genomes to the ancestor. We call an instance of the median problem th at can be resolved using only good reversals a perfect triple. If we run out of good reversals before the three genomes converge to the ancestor, we relax our definition of good reversals. This approach leaves room for some flexibility. Often there is a variety of good reversals and it is not clear which one to choose. For example, the list of good reversals often contains non-overlapping reversals p(ii,ji) and p ( i2, J 2 ) with 34 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. i\ < j i < '< 2 < J2 and the order in which these reversals are performed is often irrelevant. Our objective is to choose good reversals in such a way that we do not run out of good reversals until all three genomes converge to the ancestor. One way to address this problem is to test all pairs/triples/... of reversals in order to avoid reversals that would cause us to run out of good reversals in a few steps. We also use a heuristic to choose the best reversal from the list of good reversals. The heuristic is based on an observation th at good reversals, if carried out in the correct order, should not affect most of the other good reversals th at are available. Hence, for each good reversal /> , we compute np, the number of good reversals that will be available if p is carried out. The heuristic picks the good reversal p with the maximal n p as the best reversal, the reversal to be carried out. We implemented these procedures in a program called MGR-M EDIAN and it turned out to work well in practice. In some cases, no good reversal will be available, i.e. A(p) < 2 for all reversals p in each of the three genomes. In those situations, the best reversal will be the result of a depth k search minimizing the total pairwise reversal distances. Suppose we have a sequence of k reversals pi, p i ... p k to be applied to Gi. Define A ( p i , p2, ■ ■. , pk) — d(Gi, G2) + d(Gi, G3) — (d(Gi ■ pi ■ pk, G 2) + d(Gi ■ p i • • • p k , Gs)). 35 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Let A = max A (pi,p2, ... , p k) Pli-A and p i,... , pk be a sequence of reversals achieving this maximum. A then cor responds to the maximal reduction in the reversal distance after the depth k search. The best reversal in Gi will be the first reversal of the sequence, i.e. p\ (the best reversals in (?2 and &3 are defined similarly). W hen no good reversal is available, the reversal th at will be carried out by MGR-M EDIAN will be the result of this search. The depth k search should be taken with caution when one of the genomes is already within distance less than k from the ancestor. In this case we consider k reversals pi,p 2, ■.. , pk, where the first x reversals are applied to G1, the next y reversals are applied to and the remaining k — x — y reversals are applied to G 3 , and maximize the function A(pi,p2j • • • ,Pk) = d(G\, G 2) + d(Gi, G 3) -{ - d(G 2 , G 3 ) d{G\ • pi ■ * • pXf G % • px+l " " ' Px+y) — d{G\ ■ P i ■ ■ • P x i G s ■ P x + y + 1 • • • P k ) — <f((?2 ‘ Px+1 ' ' ' Px+y> G 3 ■ Px^.y+1 • • • p k ) • Putting all these steps together we get the algorithm MGR-M EDIAN: 36 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. MGR-MEDIAN(6'i, G2, G3) 1. w hile Gi, G 2 and G3 are not identical 2. find all good reversals 3. w hile there are good reversals 4. find and carry out the best reversal 5. find all good reversals 6 . if G\, G 2 and G3 are not identical 7. find and carry out the best reversal from a depth k search 8. return Gi Now consider the case of m > 3 unichromosomal genomes G i, G2,... , Gm. We generalize the previous definition of good reversal in G ,- to be any reversal th at reduces the reversal distance from Gi to all other genomes. We define A (/?) once again as the reduction in the reversal distances: A(p) = J 2 i(G„ Gj) - £ d(G, • p, Gj) A good reversal p in genome Gi is now a reversal with A(p) = rn — 1. We iteratively carry on good reversals until any two of m genomes become identical. When we reach that point, we remove one of the two genomes and start the procedure again with m — 1 genomes. We keep removing genomes until we are back to solving the median problem. In many cases, especially when m is large, we will run out of good reversals before converging to the ancestral genome. In such cases, we have developed a heuristic to complete the recovery of the phylogeny. The heuristic relies on the MGR-MEDIAN program for 3 genomes described in the previous section. Starting 37 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. from the 3 closest genomes (in terms of the reversal distance), it iteratively adds one more genome to reconstruct the full phylogeny. Whenever possible, we choose the genome that is the “closest” to the partially reconstructed tree and such that it also forms a perfect triple with the two endpoints of one of the edges in the tree. We seek the closest genomes first because, as we will see in Section 4.2, the closer the genomes, the more accurate the ancestor produced by M GR-M EDIAN. Assume that genomes G\, G% , ... , Gi are already included in the tree T that corresponds to the partially reconstructed phylogeny. The problem of adding genome Gi+i to T corresponds to identifying the edge of T th at should be split by the edge leading to Gi+i . We call th at edge the split edge. The heuristic once again uses a simple greedy approach to find the split edge. For each edge (u,v) in T, compute M = M (u,v,G i+ 1) the median of u, v, and Gi+1- The cost of adding G;+1 to this edge, C(u,v), is the total reversal score of the median less the reversal distance between u and v (the score of the edge being removed). Formally, C (u7 v) = d(u, M) + d(v, M ) + d(Gi+1, M ) — d(u, v) The split edge on which to add Gi+1 is then the one with the smallest cost. Putting all these steps together, we get the algorithm M G R : The described algorithms rely on our ability to find good reversals. Instead of computing A (p) for all possible reversals while looking for good reversals, we have implemented a speedup making the algorithms more computationally 38 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. MGR(Gi, G2, • • • , Gm) 1. find all good reversals 2. w hile there are good reversals 3. find and carry out the best reversal 4. find all good reversals 5. find 3 closest genomes and build tree T using M GR-M EDIAN 6. w h ile there are genomes not in T 7. find Gi, the closest genome to T not in T 8. find the split edge on T using M GR-M EDIAN and add Gi 9. return T efficient. The speedup makes use of the concept of conserved adjacency. We call an ordered pair of markers, (x, y), a conserved adjacency if (x,y) or its inverse (—y, —x) is present in all genomes as consecutive elements. When looking for good reversals, we only consider reversals that do not break any conserved adjacency. The justification behind this shortcut comes from a result of Hannenhalli and Pevzner, 1996 [HP96] and a theorem recently proved by Glenn Tesler (Tesler, 2002 [Tes02a]). 4.2 Tests 4.2.1 Sim ulated d ata We compared our MGR algorithm to the two im plem entation of breakpoint anal ysis: BPA nalysis [BBS97] and GRAPPA [MWB+01j. Our initial tests showed that these two programs were producing nearly identical results, and so we decided only to include results from GRAPPA since it was a more efficient implementation. 39 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. W hen testing the algorithm, we are interested not only in the phylogeny th at we recover but also in the correct labeling of the internal (ancestral) nodes. We used the following simulated data for benchmarking. Starting from the identity perm utation A with n genes/markers, we performed k reversals to get genome Gh, k to get G2 and k to get G3. We used the resulting 3 perm utations as the input to M GR-M EDIAN and G RAPPA and checked whether they reconstructed the ancestral identity perm utation. Figures 4.1a,b show the difference between the total reversal distance D (T ) of the tree recovered by the algorithm and the actual number of reversals (equal to 3k). Figures 4.1c,d show the reversal distance between the ancestral perm utation recovered by the algorithm and the actual ancestor, the identity perm utation. The tests are conducted for various ratios r — ^reversals/^m arkers which corre sponds to the total number of reversals on the tree divided by the total number of markers in one genome. Both GRAPPA and M GR-M EDIAN produce very similar solutions for r < 0.20. But as the ratio r increases, G RAPPA starts making errors. In contrast, M GR-M EDIAN persists in finding correct solutions and in some cases find solutions that even have fewer reversals, than the actual ancestor. The issue here is that as the ratio r increases, the assumption that the ancestor corresponds to the most parsimonious scenario sometimes fails. In Figures 4.1c,d we see th at as the ratio r increases, both algorithms start having difficulty recovering the actual ancestor, with the 40 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. solution produced by G RAPPA further away on average than the ancestor produced by M GR-M EDIAN. (a) M edian Problem with 30 m arkers (b) M edian Problem with 100 m arkers © GRAPPA MGR • Actual o c 2 o 8 CD CO r e 0.4 0.6 / 0.2 ratio # reversals / # m arkers •© GRAPPA MGR • - Actual o c £ < s .o r o o c o 0 ) > 03 -2 0.8 0.2 ratio # reversals / # m arkers / (c) M edian Problem with 30 m arkers • e GRAPPA MGR .2 6 o - v > 5 os- 0.2 0.4 0.6 ratio # reversals / # m arkers 0.8 (d) M edian Problem with 100 m arkers -O GRAPPA MGR o 15 S 1 0 ; 0.8 m arkers 0.2 ratio # reversals t # m arkers 0.4 0.6 Figure 4.1: Comparison of M GR-M EDIAN and GRAPPA (3 genomes equidistant from the ancestor). The genomes Gi, G 2, (?3 are obtained by k reversals each from the ancestral identity perm utation 12... n (n = 30 and n = 100). The simulations were repeated 10 times for every ratio ^reversals/^m arkers = 3kjn. (a) and (b) The average difference between the number of reversals on the tree recovered by the algorithm and the number of reversals on the actual tree (equal to 3k). (c) and (d) The average reversal distance between the solution recovered and the actual ancestor. Figure 4.2 presents the results of similar experiments with nonequidistant genomes starting from the identity perm utation A and performing k, k and 2k random reversals to obtain Gh, G 2 and G3. Once again, G RAPPA starts failing to recover the optimal solution at r > 0.20, while M GR-M EDIAN keeps finding the true ancestor. 41 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) Assym etric M edian with 30 m arkers (b) Assym etric M edian with 100 m arkers - e GRAPPA MGR — Actual 0.5 g? -0 .5 -1.5 0.2 ratio # reversals / # m arkers 0.6 / 0.8 © GRAPPA MGR — Actual ..................7> / / / d / / / / .JO (c) Assym etric M edian with 30 m arkers 0.2 0.4 0.6 0.8 ratio # reversals / # m arkers (d) Assym etric M edian with 100 m arkers © GRAPPA MGR c o 1 } o o e g . 52 ® O ) <0 ? f lj 0® - 0.2 ratio # reversals I # m arkers 0.4 0.8 0.6 © GRAPPA 0.2 0 .4 0.6 ratio # reversals / # m arkers Figure 4.2: Comparison of M GR-M EDIAN and GRAPPA (3 genomes nonequidis- tant from the ancestor). The genomes Gx, G2, G 3 are obtained by k, k and 2 k reversals respectively each from the ancestral identity perm utation 12... n (n = 30 and n = 100). The simulations were repeated 10 times for every ratio # reversals/^m arkers — Ak/n. (a) and (b) The average difference between the number of reversals on the tree recovered by the algorithm and the number of reversals on the actual tree (equal to 4k). (c) and (d) The average reversal distance between the solution recovered and the actual ancestor. We tested the performance of M G R for four and more genomes using a similar setup. First, we considered a small tree with 4 genomes as leaves and 2 internal (ancestral) nodes. For simplicity, we picked one of the ancestral nodes to be the identity perm utation. We then randomly simulated k reversals on each branch of the tree. We used the resulting 4 leaves of the tree as the input for M G R and G RAPPA and calculated the difference between the total reversal distance of the tree recovered with the actual num ber of performed reversals equal to 5k (Figure 42 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4.3a,b). We also calculated how close the solutions recovered would get of the true ancestral perm utation (the identity perm utation). Since in each solution there are two internal nodes, we picked the one that is closer to the identity and recorded the reversal distance between it and the identity perm utation (Figure 4.3c,d). (a) 4 genomes with 30 markers (b) 4 genom es with 100 markers GRAPPA P 0 5 8 6 © ■ GRAPPA - x - MGR 0 ' MGR — Actual / £ ■ ■ • • Actual 0 0 - 0 0.5 1 ratio # reversals / # markers (c) 4 genomes with 30 markers 0.2 0.4 0.6 ratio # reversals / # markers (d) 4 genomes with 100 markers • e - GRAPPA MGR 2 4 05 o '•fi ^ G O 0.5 1 ratio # reversals / # markers • e GRAPPA MGR 2 3 9 V i ^ o ® - 0.4 0.6 0.8 0.2 ratio # reversals / # markers Figure 4.3: Comparison of M G R and GRAPPA (4 genomes). We start from an unrooted tree with 4 leaves and select one of the two internal nodes to be the identity perm utation 1 2 ... n (n = 30 and n = 100). We then perform k reversals on each branch of the tree to obtain the genomes Gf, (?2, Gs, G4 as the 4 leaves of the tree. The simulations were repeated 10 times for every ratio ^reversals/^m arkers = 5kjn. (a) and (b) The average difference between the number of reversals on the tree recovered by the algorithm and the number of reversals on the actual tree (equal to 5k). (c) and (d) The average reversal distance between the best (i.e., closest) internal node in the solution recovered and the identity perm utation. 43 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Finally, to see the effect of adding more genomes, we constructed larger com plete unrooted binary trees and simulated k random reversals on each branch. To obtain a sample input with m genomes that we would feed into M G R and GRAPPA, we simulated the smallest complete binary tree such th at the number of leaves was larger than m and randomly removed the extra leaves. The results in Figure 4.4 show the difference between the total reversal distance of the tree recovered and the total reversal distance of the simulated tree. Note th at it is difficult to use the ratio r — ^reversals/^m arkers here as it changes depending on the size of the tree. For example, when k = 1, if m is 4 then r = 5/30 ~ 0.167, but if m is 8 then r = 13/30 ~ 0.433 and if m is 16 then r — 27/30 = 0.9. Unfortunately, running G RAPPA on more than 10 genomes genomes turned out to be impossible on our workstations, as the tree space was too large. The only way to get around this problem would have been to suggest a tree topology to G RAPPA (which is exactly what we are trying to recover in the first place). However, even if we did suggest the actual tree topology to GRAPPA, we would still get an average score difference of 7.3 for n = 30 and of 19.1 for n = 100. 4 .2 .2 H erp esvirus d ata Hannenhalli et ah, 1995 [HCKP95] used herpesvirus gene orders as a test case for one of the first studies on the Multiple Genome Rearrangement Problem. They developed a rather elaborate m ethod to solve a relatively simple instance of the 44 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) 30 m arkers an d k=2 reversals p er branch (b) 30 m arkers and k=3 reversals p er branch GRAPPA MGR — Actuai 0 5 10 15 e GRAPPA MGR • • Actual a > c CD X 3 o o 0 5 10 15 nb of genomes (m ) n b of genomes (m ) Figure 4.4: Comparison of MGR and GRAPPA (m genomes each with 30 markers). The genomes G\ , G2 ,. • • , Gm correspond to a subset of leaves from a complete unrooted binary tree on which we have performed k reversals on each branch. The simulations were repeated 10 times for every to. (a) and (b) The average difference between the number of reversals on the tree recovered by the algorithm and the number of reversals on the actual tree when k — 2 and k — 3. median problem for Herpes simplex virus (HSV), Epstein-Barr virus (EBV) and Cytomegalovirus (CMV) (Figure 4.5a). As the authors themselves pointed out, the method used would not be applicable to more complex problems and new algorithms would be required. The optim al solutions recovered involved 7 reversals. The ratio # reversals/ ^m arkers in this example is: r = 7/25 = 0.28. Our simulations indicate th at MGR-M EDIAN typically reconstructs the correct scenario with such ratios while GRAPPA typically fails for r > 0.2. 45 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. HSV: 1 23 4 5 6 7 8 9 10 11 12 13 14 15 16-19-18-17 20 21 22 23 -25 -24 EBV: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 6 -2 0 -1 9 -1 8 -1 7 21 22 23 24 25 CM V: 1 2 3 4 5 6 7 8 9 10 11 -13 -12 -16 -15 -14 -25 -24 1718 19 20 21 22 23 @ A: 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 21 22 23 24 25 reversal (-19 -18 -17) reversal (-25 -24) reversal (-16 -15 -14) reversal (-13 -12) \ reversal (-23 -2 \ reversal (-25 CM V / \ reversal (-23 -22 -21 -2 0 -1 9 -18 -17) y / \ . reversal (-25 -24 17 18 19 20 21 22 23) E B V ) ( c M v ) Figure 4.5: Gene orders of Herpes simplex virus (HSV), Epstein-Barr virus (EBV) and Cytomegalovirus (CMV) taken from Hannenhalli et al., 1995 [HCKP95]. Also shown is the ancestral gene order (A) and the optimal evo lutionary scenario recovered by M GR-M EDIAN. We tested both M GR-M EDIAN and GRAPPA on these 3 herpesviruses to see whether they would recover the ancestral genome suggested by [HCKP95]. M GR-M EDIAN found this genome and reconstructed the rearrangement scenario with 7 reversals (Figure 4.5b) even though it did not correspond to a perfect triple. In contrast, G RAPPA returned a suboptim al solution with 8 reversals. Actu ally, the ancestor suggested by GRAPPA was the genome H S V itself, indicating the problem with the breakpoint distance described in Chapter 2. 46 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4 .2 .3 H um an, fruit fly, and sea urchin m tD N A d a ta Sankoff et al., 1996 [SSK96] analyzed human, sea urchin, and fruit fly mtDNA to derive the ancestral gene order. Using MGR-MEDIAN, we found the ancestral gene order A with a total reversal distance of 39 (Figure 4.6). The solution is different from the ones found in [SSK96] but the total reversal distance is the same. The ratio ^reversals/^m arkers for this data set is r = 39/33 ~ 1.18, an indication of a difficult problem. Running GRAPPA on these genomes, we obtained a solution that has a total reversal distance of 43. H u m a n : 26 13 17 12 -24 IS 18 32 -2 -1 6 -3 -3 3 4 -2 8 7 5 1 10 19 25 2 2 11 29 14 20 -21 -8 6 30 -2 3 9 27 31 S e a u rch in : 26 4 25 22 S 1 -2 8 19 11 29 20 -21 6 9 2 7 8 30 23 -2 4 16 14 -2 32 3 -31 15 -7 3 3 10 13 17 12 18 F ru it fly: -26 -31 -2 7 12 -2 4 15 18 32 -3 -3 3 4 13 5 7 1 10 19 2 25 16 29 8 -9 -20 -1 1 -2 2 3 0 -2 3 21 6 28 -1 7 -14 A : 26 13 17 12 -2 4 15 18 32 -2 8 7 -6 21 -2 0 -2 9 -11 -2 2 -2 5 -1 6 8 -3 -3 3 4 14 -2 -1 9 -1 0 -1 -5 30 -2 3 9 2 7 31 Figure 4.6: Human, sea urchin and fruit fly mitochondrial gene order taken from Sankoff et al., 1996 [SSK96]. A is the ancestral gene order suggested by M GR-M EDIAN. 4 .2 .4 M etazoan m tD N A d a ta Blanchette et ah, 1999 [BKS99] used B PA nalysis in the rearrangement study of 11 metazoan mtDNAs. The genomes come from 6 m ajor m etazoan groupings: nematodes (NEM), annelids (ANN), mollusks (MOL), arthropods (ART), echin- oderms (ECH), and chordates (CHO). They are: Ascaris suum and Onchocerca volvulus from the netm atodes, Lumbricus terrestris from the annelids, Albinaria coerulea, Cepaea nemoralis and Katharina tunicata from the mollusks, Drosophila yakuba and Artemia franciscana from the arthropods, Asterina pectinifera and 47 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Strongylocentrotus purpuratus from the echinoderms, and finally, Human from the chordates. They were originally selected in [BKS99] to provide the analysis with exemplar of the most diverse members of each group. The two “optim al” phylogenies recovered in [BKS99] had 199 breakpoints. We studied the same data set with M G R and GRAPPA and used the curated gene order data of the 11 genomes from the MG A Source Guide 1 compiled by Jeffrey L. Boore. After removing two genes th at were not shared by all mtDNAs we were left with a common set of 36 genes. M G R recovered a phylogeny with 150 reversals (Figure 4.7). The tree space for 11 genomes is very large and searching it exhaustively with GRAPPA is very tim e consuming. After 48 hours on a workstation, G RAPPA had recovered 3 “optim al” trees with 175 reversals and 200 breakpoints. Even suggesting the topology found by M G R to G RAPPA would only produce a fourth tree with 175 reversals. The tree recovered by MGR is closely related to one of the optimal trees in Blanchette et al., 1999 [BKS99]. The weak association of Katharina tunicata with the mollusks was already discussed in [BKS99]. Apart from this and from the weak grouping of the two arthropods, the induced phylogeny also agrees with the metazoan phylogeny proposed by Boore and Brown, 2000 [BB00] (the nemath- odes and the echinoderms were not discussed in this paper). We remark that Blanchette et al., 1999 [BKS99] obtained their tree in a semi-autom ated regime 1 http://w w w .jgi.doe.gov/programs/comparative/MGA_Source_.Guide.html 48 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 15 2 1 J1L 23 17 7 -Onchocerca volvulus (NEM) -15---- Ascaris suum (NEM) — 5-----Lumbricus terrestris (ANN) — ----- Cepaea nemoralis (MOL) — -----Albinaria coerulea (MOL) — -----Katharina tunicata (MOL) — -----Artemia franciscana (ART) — -----Drosophila yakuba (ART) — -----Strongylocentrotus pupuratus (ECH) — 1 ----- Asterina pectinifera (ECH) — Human (CHO) Figure 4.7: Phylogeny of 11 m etazoan genomes reconstructed by M G R . The gene order data is taken from the MGA Source Guide which is compiled by Jeffrey L. Boore. The genomes come from 6 m ajor metazoan groupings: nematodes (NEM), annelids (ANN), mollusks (MOL), arthropods (ART), echinoderms (ECH), and chordates (CHO). Numbers on the edges show the number of reversals. by making a selection between the potential phylogenies and disregarding the ones breaking any one of the 6 m etazoan groupings. Although such assumptions about the data were not included in M G R , it did not prevent it from the discovery of a very similar tree in a fully autom ated fashion (Figure 4.7). Rooted differ ently, we see that the nemathodes are a late-branching sister taxon of the annelids which is the same as in [BKS99]. The deuterostomes (chordate and echinoderm) association was successfully identified both in [BKS99] and in the tree from M G R 49 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. but not in any of the first 3 trees produced by GRAPPA (excluding the one we suggested as a constraint). 4 .2 .5 Campanulaceae cp D N A d ata We analyzed the Campanulaceae chloroplast dataset with 13 cpDNAs and 105 markers. It is one of the most challenging genome rearrangement data sets studied by Cosner et al, 2000 [CJM+00b], [CJM+00a] and Moret et al., 2001 [MWWW01]. The tree space for 13 genomes is too large and cannot be searched exhaus tively by G RAPPA. To analyze the tree space in this case [CJM+00b], [CJM+00a], [MWWW01] described various techniques to obtain constraint trees to suggest to the program. G RAPPA then searched the space of refinements of these con straint trees trying to minimize the total num ber of reversals. Moret et al., 2001 [MWWW01] recovered 216 trees with a total of 67 reversals. GRAPPA was not able to decide which of those trees corresponds to the most likely reconstruction of the rearrangement scenario. Running M G R on the same data set did not require the preprocessing of a constraint tree and recovered a tree with only 65 reversals, shown in Figure 4.8. The topology of the tree recovered actually corresponds to the topology of one of the trees recovered by GRAPPA, but the labeling of the internal nodes differs. Since our tree minimizes the num ber of reversals we argue that M G R provided a better reconstruction of the rearrangement scenario than GRAPPA. 50 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 0 0 0 - 2 - Campanula — — Adenophora - 2 - Trachelium — i- Symphyandra 9 — — Wahlenbergia --4- Merciera Asyneuma - 2 - Triodanus a Legousia — — Codonopsis C Cyananthus Platycodon — Tobacco Figure 4.8: Phylogeny of the Campanulaceae cpDNA dataset as reconstructed by M G R . Numbers on the edges show the number of reversals. 51 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 5 R econstru ctin g A ncestral G ene Orders for M ultichrom osom al G enom es 5.1 A lgorithm Consider three multichromosomal genomes Gi, G2 and G3. The median problem is to find the ancestral genome A which minimizes the total genomic distance d(A, Gi) + d(A, G2) + d(A, Gz). The genomic distance in this case is defined in terms of reversals, translocations, fusions, and fissions, the most common rearrangement events in multichromosomal genomes (Pevzner, 2000 [PevOO]). Given two genomes II and T, the genomic distance, d(U, T), is defined as the minimum number of reversals, translocations, fusions and fissions required to convert one genome into the other. The genomic distance was first studied by 52 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Hannenhalli and Pevzner, 1995 [HP95a] who developed a polynomial-time algo rithm to compute a rearrangement scenario between man and mouse involving 131 rearrangements. M G R -M C algorithm is a generalization of the M GR-M EDIAN algorithm for unichro- mosomal genomes. First, we evaluate all possible rearrangements (reversals, translocations, fusions, and fissions) for each of the 3 genomes, identifying good rearrangements. As in Section 4.1, a rearrangement is good if it brings a genome closer to the ancestral genome. We will argue once again that the rearrangements which bring G\ closer to both G2 and G3 are likely to be good. We iteratively carry on these good rearrangements until the genomes G i, (?2 and (?3 are trans formed into an identical genome hoping th at we have reached the most likely ancestral median. Since we are dealing with multi chromosomal genomes and with four different types of rearrangements, we need to be aware of an ambiguous situation th at can occur when solving the median problem. Consider the following simple example: Gi = 1 2 3 4 5 G2 = 12-5-4-3 G3 = 1 2 3 4 5 In this example, the parsimony principle does not allow one to unambiguously reconstruct the evolutionary scenario. If the ancestor coincides with G 1 then a reversal occurred on the way to G2 and a fission occurred on the way to G3. But we could also have a similar scenario starting from G % as the ancestor or 53 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. even starting from Gz if we assume that 2 fusions occurred. In this example, d(Gi, G 2) = d(Gi, Gz) = d((?2, Gz) = 1. We did not have this kind of ambiguity for unichromosomal genomes because it was impossible to find 3 genomes that would all be within 1 reversal of the each other. These ambiguities motivate a more careful selection within the good rearrangem ents. We use the observa tion th at in most genomes of interest (e.g. mammalian genomes) reversals and translocations are more common than fusions and fissions. W hen looking for the best rearrangement to be carried out within the good rearrangem ents, we always select reversals/ translocations before fusions/fissions. W ithin a given kind of rearrangement, we will use the same criteria we use previously and select the rearrangement p with the maximal n p, where n p is the number of good rear rangements that would be available if p was carried out. As in Section 4.1, we call an instance of the median problem that can be resolved using only good rearrangements a perfect triple. If we run out of good reversals before reaching a solution, the best rearrangement will be the result of a depth k search minimizing the total pairwise rearrangement distances. Putting all these steps together we get the M G R -M C algorithm for multichromosomal genomes: The algorithm M G R described in Section 4.1 can now be generalized to work with m multichromosomal genomes by looking for good rearrangem ents instead of good reversals and by using M G R-M C instead of M GR-M EDIAN. 54 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. MGR-MC (Gi, G '2; G 3) 1. w h i l e Gi, G2 and G3 are not identical 2 . find all good reversals and translocations 3. w hile there is a good reversal or translocation 4. find and carry on the best reversal or translocation 5. find all good reversals and translocations 6 . find all good fusions and fissions 7. if there is a good fusion or fission 8. find and carry on the best fusion or fission 9. else carry on the result of a depth k search 10. return G 1 5.2 Tests 5.2.1 Sim ulated d ata We used the following simulations to test the performance of M G R -M C . Starting from the identity perm utation A of size n, we first randomly selected b chro mosome breaks to simulate a multichromosomal ancestor. Next, to transform A into Gi (1 < i < 3), we performed k rearrangements where each rearrange ment was randomly assigned to be a reversal/ translocation with probability p and a fusion/fission with probability 1 — p. We used the resulting 3 genomes as the input for M G R -M C . We are interested in the difference between the score (total number of rearrangements) of the solution recovered by the algorithm and the actual score of the simulated tree (equal to 3k). We are also interested in the rearrangement distance between the ancestral genome recovered by the algorithm and the actual ancestor. The tests are conducted for various ratios 55 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. r = ^rearrangem ents/^m arkers which correspond to the total number of rear rangements on the tree divided by the total number of markers in one genome. Figure 5.1 illustrates that M G R -M C has no difficulty recovering ancestral genomes with a score which is at least as good as the one of the actual ances tor. Actually, in all the tests conducted, not once was the solution produced by M G R -M C worst than the true ancestor. The solutions produced for small ratios r — #rearrangem ents/#m arkers tend to be very close the actual ancestor. But, as the ratio r increases, we see the same effect as for unichromosomal genomes: M G R -M C starts finding solutions with genomic distance which is smaller than the true number of rearrangement events. As a result, the average distance between the solution recovered and the true ancestor increases. The comparison of Figures 5.1a,b and 5.1c,d illustrates that accuracy of reconstructions deteriorates with the increase in the rate of fusions/fissions. Actually, it looks like the combinatorial explosion associated with having four types of rearrangements for multichromo somal genomes tends to jeopardize the most parsimonious scenario assumption for even smaller values of r compared to the unichromosomal case. 5.2.2 G ene order o f th e h u m an -cat-m ou se com m on an cestor The modern comparative mapping studies generated a wealth of data on dif ferences in genomic organization for many m am m alian species. However, most 56 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) MGR-MC with 30 markers and p=0.9 (b) MGR-MC with 100 markers and p=0.9 average score difference / - x - average score difference • • ■ ■ actual ancestor — actual ancestor •x - average distance of solution / C O 5 ■ x • average distance of solution I 0.2 0.4 0.6 0.8 1 ratio # rearrangements / # markers (c) MGR-MC with 30 markers and p=0.7 - x - average score difference — actual ancestor ■ x average distance of solution 0.2 0.4 0.6 0.8 1 ratio # rearrangements / # markers - 1 0 1 0 0.2 0.4 0.6 0.8 ratio # rearrangements / # markers (d) MGR-MC with 100 markers and p=0.7 10r average score difference actual ancestor average distance of solution - 1 0 L 0 0.2 0.4 0.6 0.8 ratio # rearrangements / # markers Figure 5.1: Performance of M G R -M C (3 multichromosomal genomes equidistant from the ancestor). The ancestral genomes are obtained from the identity permu tation 12... n (n = 30 and n = 100) by inserting b chromosomes breaks (b = 2 when n — 30 and b = 9 when n = 100). The genomes G i, G2, G3 are obtained by k rearrangements each from the ancestral genomes. Each rearrangement is a reversal/ translocation with probability p and a fusion/fission with probability I —p. The simulations were repeated 10 times for every ratio r = 3k fn. We com pute the average score difference which is the difference between the number of rearrangements on the tree recovered by the algorithm and the actual number of rearrangements (equal to 3k). We also compute the average distance of solution between the solution recovered and the actual ancestor. existing comparative maps are pairwise maps representing genome organization of two species rather than multiple maps representing the genomic organization for more than two species. Since the num ber of established universal markers (O ’Brien et al., 1999 [0 +99]) th at work across many genomes is relatively small, 57 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. it is often not clear how to integrate pairwise comparative maps into multiple maps. The first sufficiently detailed triple comparative maps appeared recently as the results of rat (W atanabe et al, 1999 [WBM+99]) and cat (Murphy et al., 2000 [MSC+00]) comparative mapping projects. We collaborated with Bill Murphy at the National Cancer Institute to integrate the pairwise human-mouse, hum an-cat, and mouse-cat comparative maps into a triple human-mouse-cat map and we use this map for deriving the ancestral genome organization. The previous attem pts to derive rearrangement history of multichromosomal genomes concentrated on human and mouse genomes (Nadeau and Taylor, 1984 [NT84], Hannenhalli and Pevzner, 1995 [HP95a]). The cat data used in this paper comes from Murphy et al., 2000 [MSC+00] and consists of 193 markers shared by all three species. The number of markers is still too small to derive a detailed rearrangement scenario but it allows one to get some insights into a large-scale organization of the ancestor. Ultimately, this organization may be refined with M G R -M C as soon as more markers shared by all three species become available. Comparative maps usually correspond to unsigned perm utations, i.e., no information on the direction (signs) of the genes/markers is available. Since mammalian comparative maps contain m any singletons (Pevzner, 2000 [PevOO]) the existing algorithms for analyzing unsigned perm utations become too time- consuming in this case. As a result we have to assign an orientation to the 58 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. markers, since the current implementation of M G R -M C only supports signed permu- t at ions /genom es. Ultimately, this should not be a problem as more data becomes available. We used strips in unsigned perm utations (Hannenhalli and Pevzner, 1996 [HP96]) to infer the signed perm utations from the original unsigned permu tations. Using human genome as a reference, we first identified all the strips both in cat and in mouse genomes. We then assigned an orientation to the markers based on these strips. Any m arker for which we could not assign an orientation using this m ethod either in cat or in mouse genome was removed and we were left with a common set of 114 markers. This process obviously inserts a bias towards blocks of preserved markers, while removing information about more local disruptions, e.g., single marker reversals. At least partly because of this bias, in this common alphabet of size 114, the genomic distance between human and mouse genomes is only 33 rearrangem ents. This is much smaller than the famous estim ate of 178 ± 3 9 rearrangements by Nadeau and Taylor, 1984 [NT84]. We used the human, cat, and mouse gene orders as input for M G R -M C which recovered the most likely common ancestor after 12 rearrangements in human, 14 rearrangements in cat and 22 rearrangements in mouse (Figure 5.2). Out of the total of 48 rearrangem ents, there are only 3 fusions and 2 fissions. Notice that this corresponds to a ratio ^rearrangem ents/ ^m arkers = 48/114 0.42 and a proportion of reversals/ translocations over all rearrangements of p = 43/48 ~ 0.9. 59 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Comparing these ratios to the simulations in Figure 5.1b, we see that M G R -M C typically produces good results in th at range. The resulting ancestral gene order generated b y MGR-MC is shown in Figure 5.2 and 5.3. Although most of the elements of the ancestral organization in Fig ure 5.2 are consistent with the existing biological conjectures, the organization of ancestral chromosomes 4 and 17 is surprising and even counterintuitive. Accord ing to our scenario the chromosomes 4 and 17 in the ancestor were combined into the chromosome 5 in human and the chromosome A1 in cat. We do not argue that it is a correct reconstruction of the ancestral chromosomes 4 and 17 (more markers are needed to support this conjecture) but rem ark instead th at MGR-MC provided us with solid combinatorial reasons why such scenario makes sense. Such reasons are not straightforward and hard to explain without the multichromo- somal genome rearrangement software th at never was available to evolutionary biologists in the past. Therefore, the non-trivial combinatorial arguments used by M G R -M C in the construction of Figure 5.2 may escape the attention of biolo gists that studied this problem in the past. The detailed biological analysis of our human-mouse-cat ancestral reconstruction will be discussed in another paper (joint project with Bill M urphy). 6 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A N C E S T O R H U M A N C A T a t i e / M O U S E i 2 ' 1 ' 7 i ^ r A2 1 I 2 , f l i n g B M i i i i i A3 M M i f i t 3 W H , 5 « B . f l j As 4 m f i i t i f t t B2 MSMMMSM s | i B 3 « 6 i A f l H , M As , t i 7 ?»»40H 4 2 4 3 01 i t s t M e i , « l o S t f . • 1 5 1 7 I8 1 9 5 0 8 S * / / / £ _ j° t U M m i 1 1 E y E r A ? d i m m m m f k ( f i t i t S J . c l i p t t a " f i t 1 5 s p , « l . i t 2 8 5 5 2 6 13 C S i i s - f 1 3 , A? A ,* “ S S w 2 1 2 2 9 B 9| 6 3 & 1 - 6 6 - A 5 is i^ M m l u m E 1 : f s t i i 1 5 i p T i H 2 P J f , f l i t* i6 i K f i i t f > i 20 WBSM t t i * i i 17 i 3 | 21 H / i t f / t f i t s i t -/iff, mSmmi X f f i T i i , V T f f f A 7:"«• / » m » y a m f f f j f / f / f l f t t t s t j j Figure 5.2: Ancestral median for human, mouse, and cat genomes found by M G R -M C . We used the gene order of 114 markers spread over the chromosomes in all three species. The numbers above the chromosomes correspond to these 114 markers and the numbering is such that the human genome corresponds to the identity permutation broken in 20 pieces. The names below the chromosomes correspond to the name of the markers. We attribute a color to each human chromosome. The color of any marker (in any genome) indicates on which human chromosome is the homolog of this marker. Each marker segment is traversed by a diagonal line. See Figure 5.3 for an explanation and a zoomed in version of chromosome X. 61 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. HUMAN CAT 92 93 94 95 96 97 98 99 100 10! 102 i 03 104 105 106 107 108 109 9 f ’ & £ ^ 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 -109-108 MOUSE 92 93 -105-104-103-102-101 94 95 108 109 -107-106-100 -99 -98 96 97 & 92 93 94 95 % 97 9B 99 100 101 102 103 104 105 106 107 -109-108 ANCESTOR / i f $ / # / ^ / / / ■ £ # ' <? e 4? -i Figure 5 .3 : Ancestral chromosome X for human, mouse, and cat genomes found by MGR-MC in Figure 5 .2 . Each marker segm ent is traversed by a diagonal line. These diagonal lines are such that the hum an chromosomes are traversed from top left to bottom right and are design to provide visual help to identify where rearrangements occurred. For this chromosome, and this data set, the gene order of the ancestor coincides with the cat gene order and only differs by one segment consisting of genes 108 and 109 (break in the diagonal line) from the human gene order. The mouse X chromosome is broken into 7 segments as compared to the ancestor (shown by 7 broken segments of the diagonal line). 62 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 6 Variants and A pplications 6.1 Towards th e recovery o f m am m alian ances tral gene orders The original question that inspired much of the work presented in this thesis was the following: can we use gene order to reconstruct the phylogenetic tree of mammals which spans over 100- to 150-million years (O ’Brien et ah, 1999 [0 +99])? Can we also recover the internal nodes of the tree th at corresponds to ancestral species (see Figure 6.1)? M G R-M C, the algorithm described in Chapter 5, corresponds to a first step in attem pting to answer these questions. Now, as more and more data becomes available with comparative mapping and sequencing, we can start exploring the information provided by gene order. For instance, it becomes interesting to verify if our m ethod, based on rearrangem ents, manages to recover some of the ancestral chromosomes, or chromosomal associations, which have been suggested by other studies such as chromosome painting. 63 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Human Chimpanzee Gorilla Cat • • • Sheep Cow M ouse Rat Figure 6.1: Hypothetical phylogenetic tree of mammals. 6.1.1 H u m an -cat-m ou se com m on an cestor revisited In Section 5.2.2, we described the results obtained by using M G R -M C on a data set of 193 markers common to human, cat and mouse. The newest data set, also provided by Bill Murphy at the National Cancer Institute, now contains 470 markers common to the same species. This allows us to refine the rearrangement scenario th at we initially proposed in Bourque and Pevzner, 2002 [BP02]. The more markers are available for the gene order analysis, the more the data is representative and so the more reliable the ancestor recovered becomes. In order to analyze with M G R -M C this data set of 470 common markers, we need to infer, once again, the signed perm utations from the unsigned perm utations obtained from comparative mapping. For that purpose, we have refined the method described in Section 5.2.2. The simplest approach to infer sign from 64 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. comparative maps is to look for -strips common to all the genomes of interest. A strip is a sequence of unsigned markers aq a 2 ... aq- with k > 2. It is common to all genomes (or conserved) if either the sequence o q a q ... or the reverse sequence a-kCtk- 1 • • ■ a i can be found in all genomes. Maximizing the size of these conserved strips returns a unique decomposition of the genomes. Finally, we assign a “+ ” orientation in the first genome to all the strips and a “+ ” or a ” in the other genomes depending on whether the strip is found in the same or in the reverse order. Intertwined between these strips are singletons, unsigned markers which are not part of any conserved strip. It is impossible to assign an orientation to these markers with a high level of confidence and so they are not included in the signed perm utation. We will refer to this m ethod as the strip-algorithm. The strip-algorithm provides us with a signed perm utation which slightly underestimates the actual complexity of the unsigned perm utation since we remove singletons. The m ethod is also very sensitive to noise and to local rear rangements. Recently, we have been experimenting with a new technique devel oped in collaboration with Glenn Tesler th at is more flexible in what can consti tute a strip. Instead of strips, which are sequences of markers, we will consider clusters which are sets of “close” markers. The closeness criteria is parameterized 65 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. by a threshold that determines which markers will be clustered together. For each pair of unsigned markers (i, j) where 1 < i , j < n, we compute the score m s ( i j ) = l— l where al(i) is the indice of marker i in genome I. W hen s(i,j ) < i , for a given threshold t, an edge is drawn between markers i and j in the cluster graph, C, which has one vertex for each unsigned markers. Each component of C corre sponds to a cluster of genes for which we will assign a unique number between 1 and n'. These clusters will constitute the common alphabet on which we will compare the genomes by running an algorithm such as M G R-M C. The signed per m utation associated with each genome is constructed by computing the average position of the genes in each cluster. The orientation of each cluster is set to in the first genome and to “+ ” or to ” in all other genomes depending on what minimizes the unsigned distance between the cluster in the current genome and the one in the first genome. This technique does not allow us to determine the orientation of clusters with a single gene, and so, just as before, singletons will need to be dropped from the signed perm utation. We will refer to this method as the cluster-algorithm. The cluster-algorithm is a generalization of the strip-algorithm. For instance, with 3 genomes, running the clustering-algorithm with threshold 4 will return 66 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the same result as the strip-algorithm. Of course, for higher thresholds, the clustering-algorithm tends to eliminate a lot of the noise and even some local rearrangements initially implied by the data. Consequently, the rearrange m ent distances between the new signed perm utations will underestim ate the dis tances between the corresponding unsigned perm utations. At high threshold, the cluster-algorithm allows us to construct signed perm utations which focuses on the macro-rearrangement that affected the genomes. Running the cluster-algorithm for various thresholds on the newest human- cat-mouse data set with 470 unsigned markers produced the results that are shown in Table 6.1. threshold 4 5 6 8 12 35 (1) nb of markers used 276 345 379 409 424 443 (2) ~ % of markers used 59 73 81 87 90 94 (3) nb of clusters 112 114 106 94 85 74 (4) d(A,Human) 29 27 29 24 24 25 (5) d(A,Cat) 28 24 21 22 16 13 (6) d(A,Mouse) 66 80 72 64 65 52 (7) total distance 123 131 122 110 105 90 (8) deficit 99 103 94 91 77 64 Table 6.1: Results after applying the clustering-algorithm for various thresholds and M G R -M C to the human-cat-mouse data set with 470 unsigned markers. (l)/(2 ) corresponds to the num ber/ percentage of markers which are not dropped as sin gleton clusters. (3) is the num ber of clusters that are not singleton clusters. “A” is the ancestral gene order recovered by M G R-M C. Typically, as we increase the threshold, the number of markers we use (the number of markers which are not dropped as singletons) increases, the number of 67 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. clusters decreases and so does the overall distance. The initial idea behind this clustering algorithm was to relax the definition of strip in order to account for potential mapping errors. The fact that there are more markers and more clusters for threshold 5 than 4 is probably an indication th at the clustering technique works in avoiding some of the local difficulties. Threshold 6 also looks very good. Not all the results are shown in Table 6.1. For instance, running the cluster- algorithm for thresholds higher than 35 no longer changed the signed perm uta tions recovered. At that level, it could be interesting to see if sorting these large clusters can recover the actual macro-rearrangements th at affected the genomes. In Figure 6.2 and 6.3, we show the ancestral gene order recovered by MGR-MC on the signed perm utations obtained from running the cluster-algorithm at threshold 4 and 6. Threshold 4 is interesting because it corresponds to the strip-algorithm which was used to generate Figure 5.2 and so those two figures can be compared. We note that the ancestor in Figure 6.2 is more fragmented, it has more colored blocks, than the one in Figure 5.2. This is due to the fact that in this new data set, the mouse genome itself is much more fragmented: it has 36 colored blocks in Figure 5.2 versus 65 in Figure 6.2. These colored blocks correspond to syntenies with respect to human chromosomes. The newest data set, which is more complete, contains more of the subtleties of the mouse genome which is known to be extremely shuffled compared to the hum an or the cat genome. The ancestor suggested in Figure 6.2 contains a few surprising chromosomes such 68 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. as chromosome 1 and chromosomes 9, 12 and 15. Chromosome 1 contains 5 colored blocks from 3 different human chromosomes. The chromosomes 9, 12 and 1 -5 all combine markers from human chromosomes 1 and 10. These surprising associations might be an artifact of local errors or rearrangements but they also might be the result of the heavy influence of the very shuffled mouse genome. For threshold 6 in Figure 6.3, there are 69 colored blocks in mouse, even more than for threshold 4, but the ancestral gene order also seems much more consistent with the expected chromosomal organization of the mammalian ancestor. 6.1.2 H u m an -cat-cow com m on an cestor One of the difficulties, in the previous section, of trying to recover the ancestral gene order of human-cat-mouse is that, as we have seen in Figures 5.2 and 6.2, the mouse genome is extremely shuffled compared to the hum an and the cat genomes. Sankoff et ah, 1996 [SSK96] were the first to point th at a “triangulation effect in the median problem seems to pin down the solution” . Unfortunately, this nice “triangulation effect” which leads to unique solutions, or almost unique solutions, tends to fade out when one of three genomes is extremely distant from the other two. In contrast, the human-cat-cow data set is much more balanced with none of the three genomes coming out as drastically different from the other two. The data, provided once again by Bill Murphy, consists of 311 unsigned markers 69 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. positioned in all 3 genomes. A difference between this data set and the human- cat-mouse data set is that it has been curated of “obvious” local mapping error by a biologist. Results after running the cluster-algorithm for various thresholds are shown in Table 6.2. This table stops at threshold 20 since raising it more did not modify the resulting permutations. threshold 4 5 6 8 20 (1) nb of markers used 248 262 276 286 298 (2) ~ % of markers used 80 84 89 92 96 (3) nb of clusters 81 74 70 60 44 (4) d(A,Human) 14 14 16 13 8 (5) d(A,Cat) 22 25 21 20 10 (6) d(A,Cow) 34 29 27 20 16 (7) total distance 70 68 64 53 34 (8) deficit 59 58 55 45 29 Table 6.2: Results after applying clustering-algorithm for various thresholds and M G R -M C to the human-cat-cow data set with 311 unsigned markers. (l)/(2 ) corre sponds to the num ber/percentage of markers which are not dropped as singleton clusters. (3) is the number of clusters th at are not singleton clusters. “A” is the ancestral gene order recovered. Note that in this case, the percentage of used markers is much higher even for low thresholds. This is a consequence of the genomes being closer but also an indication th at the local mapping error correcting was effective. In Figure 6.4, we show the ancestral gene order associated with threshold 4, the threshold for which the reconstructed perm utations have the most clusters. Having more clusters means that the signed perm utation captures more of the complexity of the unsigned perm utations. All chromosomal associations of the ancestor in 6.4, 70 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. except for chromosome 1, are found either in human or in cat. We also show in Figure 6.5, the ancestral gene order associated with threshold 6. Note that for threshold 4, it was not possible to identify a cluster (actually, a strip) on chromosome 27 of the cow genome. This is not the case with threshold 6 even though it has fewer clusters. 6.2 R econstructing ancestral gene orders for a fixed phylogeny Even though the algorithms described in Chapter 4 and 5 were designed to recover phylogenetic trees with unknown topology (large maximum parsimony problem) they can be adapted to work on the same problems when the topology is known (small maximum parsimony problem). This is can be especially useful when the topology of a tree is suggested by other methods and we wish, for instance, to compare the score of th at tree to the score of the tree recovered by the solving the large maximum parsimony problem. Since we have a m ethod to recover the median of three genomes th at works well the closer these genomes are to each other, it suggests a trivial heuristic to solve the problem. Find the triplet (Gh, ( ? 2 , G 3 ) with the smallest total pairwise distance such that two of the genomes ( G j , G 2 ) are leaves of the same node in the given topology. Compute the median, M , replace G\ and G 2 by M and iterate. 71 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. We did not conduct extensive tests of this heuristic yet, but it is reasonable to assume that for well behave trees it probably performs satisfactorily. 72 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. ANCESTOR HUMAN CAT MOUSE Figure 6.2: Ancestral gene order recovered by MGR-MC on the signed perm utations obtained from running the cluster-algorithm at threshold 4 on the human-cat- mouse 470 markers data set. Each chromosome segment unit represents a cluster and is not drawned proportionally to the number of markers it contains. Color coding and diagonal lines are used as in Figures 5.2 and 5.3. 73 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. ANCESTOR HUMAN CAT MOUSE 9 10 14 15 16 17 - 18 1 9 i i * 20 2 1 i i m 17 18 19 20 2 1 H 22 g j El E2 E3 FI F2 x Y u 12 13 14 15 ■ M l Figure 6.3: Ancestral gene order recovered by M G R-M C on the signed perm utations obtained from running the cluster-algorithm at threshold 6 on the human-cat- mouse 470 markers data set. Each chromosome segment unit represents a cluster and is not drawned proportionally to the num ber of markers it contains. Color coding and diagonal lines are used as in Figures 5.2 and 5.3. 74 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. ANCESTOR 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2 1 22 23 H U M A N li— 1 CAT COW • > JO -12 -11 75 76 77 3 1 2 3 4 5 6 7 S 9 10 11 12 13 57 58 14 59 60 15 p i ii f e j 61 62 16 [ 63 64 17 n 1 9 73 74 2 0 75 76 7. 7S J2 E Z 3 75 x . 51 52 ii 54 55 » A! A2 A3 BI B 2 B3 B4 C l C2 D1 D2 D3 D4 El E2 E3 FI F2 X S -26 21 7 -17 16 IS 31 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 [ 3 1 -3 -56 H 15 -17 -16 1 9 6 9 - 6 6 -6 5 -6 8 -67 -' 20 * -25 24 21 p y s a a -62 W 22 lililM liill 23 24 25 26 28 29 3 1 30 -29 Figure 6 .4 : Ancestral gene order recovered by MGR-MC on the signed perm utations obtained from running the cluster-algorithm at threshold 4 on the human-cat-cow 311 markers data set. Each chromosome segment unit represents a cluster and is not drawned proportionally to the num ber of markers it contains. Color coding and diagonal lines are used as in Figures 5.2 and 5.3. 75 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. ANCESTOR HUMAN CAT COW 2 9 ■ Figure 6.5: Ancestral gene order recovered by M G R -M C on the signed perm utations obtained from running the cluster-algorithm at threshold 6 on the human-cat-cow 311 markers data set. Each chromosome segment unit represents a cluster and is not drawned proportionally to the num ber of markers it contains. Color coding and diagonal lines are used as in Figures 5.2 and 5.3. 76 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 7 M G R as a W eb Tool MGR implements an algorithm which, given a set of genomes, seeks a tree such th at the sum of the rearrangements is minimized over all the edges of the tree. It can be used for phylogeny inference and also for inference of ancestral gene orders. A web version has been made available in 2001 at h ttp ://w w w -cse.u csd .ed u /g ro u p s/b io in fo rm atics/M G R . A computing time limit is currently set on the web tool to prevent an overload of the server but more computationally intensive problems can be resolved by contacting directly the authors. The implementation w ritten in C makes extensive use of G R IM M , the pairwise distance engine and web server developed by Glenn Tesler [Tes02b]. The goal behind the web server was to make the algorithm easily accessible to the scientific community. For instance, biologists with gene order data are now able to visualize and experiment directly with the results from M G R . It is even possible to view actual rearrangement scenarios by clicking directly on the edges of the phylogenetic trees produced since each of these edges is linked to the web version of G R IM M . This can be of great help to users who wish to verify 77 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. or confirm the solution obtained. To our knowledge, the web server represents the first phylogeny reconstructing program, based on gene order, available in a “friendly” user interface (without having to compile code). 7.1 Input The algorithm and the web tool are versatile in the sense that they work for 3 different types of genomes: - Uni chromosomal circular - Unichromosomal linear (directed) - Multi chromosomal (or linear undirected) By directed we mean that we can distinguish between the start and the end of the chromosome. An example of the representation used for multichromosomal genomes can be found in Fig. 7.1. The representation of uni chromosomal genomes is similar but the name of the genome is followed by the gene order on a single line and the “$” at the end is optional. For unichromosomal genomes, the rearrangem ent events that we consider are reversals. For multichromosomal genomes, the rearrangement events that we consider are reversals, translocations, fissions, and fusions. M G R works on a list of at least 3 genomes. It is possible to input the genomes from 3 different ways: 78 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. >HUMAN 1 2 3 4 5 6 7 8 $ 9 10 11 12 13 14 15 16 17 18 19 20 $ 21 22 23 24 $ 25 26 27 28 29 30 31 34 35 $ 36 37 $ 38 39 40 41 42 43 $ 44 45 46 47 48 $ 49 50 51 52 53 54 55 57 58 59 60 61 62 63 69 70 $ 71 72 73 74 75 76 77 80 81 $ 82 83 $ 84 85 86 87 $ 88 $ 89 90 91 $ 92 93 94 95 96 97 98 110 111 112 113 114 Figure 7.1: Human genome as an example of the multichromosomal input format used by M G R . - Specifying the number of genomes and then filling the form one by one. - Copying and pasting a list of genomes from a file using the format as in Fig. 7.1. This is probably the most practical m ethod for users dealing with large instances of the problem. - Select one of the precomputed examples. The input form is shown in Fig. 7.2. A few options are currently imple mented. First, it is possible to request the pairwise distance m atrix instead of the phylogeny reconstruction, the latter being much more computationally inten sive. Next, it is also possible to specify whether the tree produced (see Section 79 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 7.2) will have fixed size edges or edges where the length is proportional to the number of rearrangements. In the latter case, it is also possible to specify the desired total width of the tree. The “unsigned” option, for genomes in which the orientation of the markers is still unknown, is not yet implemented for multiple genomes as the algorithm to obtain the pairwise distance is too computationally intensive. % genom es I ] j U pdate~| i P a ir w is e ggno.T.e foruij Genome 1: Genome 2: Genome 3: : ' T" ............................................................................................L _ Chromosomes: O circular O linear (oriented) ® multichromosomal or unoriented Signs: ®signed ©unsigned ( r u n | [u n d o j | c l e a r fo rm ) Or, j c h o o s e s a m p le d a t a ; £ | Multiple genome options Action: ® Distance matrix only O Phylogenetic tree (MGR) Tree size: ® Edge length proportional to distance Total width of tree [siT Q Not proportional Figure 7.2: M G R web tool sample input form. 7.2 O utput The report produced by the web server is separated in 3 sections. The first section contains the reconstructed phylogeny and its score. The phylogenies are produced by M G R in Newick and ASCII formats described below. The second section contains the pairwise distance m atrix of the leaf nodes (i.e. the genomes 80 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. used as input). Each entry in the m atrix is clickable and is linked to the G R IM M web server showing an optimal pairwise rearrangement scenario. This m atrix can provide useful information when studying a phylogeny. For instance, it can be use to determine whether the recovered tree is additive. Finally, the third section contains a list of all the gene order sequences of the genomes used as input and of the ancestors recovered. The ancestral gene orders being an intrinsic part of the results. In the report, the ancestral nodes are labelled by the letter “A” followed by a number (e.g. A4). Consider the following simple example with 4 unichromosomal linear genomes: > Genome 1 >Genome2 > GenomeS >Genome4 1 2 3 4 5 1 -2 3 4 5 1 2 3 -5 -4 1 -3 -5 -4 2 Feeding this example in the web server would produce the result shown in Fig. 7.3. 7.2.1 N ew ick Form at The Newick Format uses nested parenthesis for representing trees. It allows the labelling of the leaf and the internal nodes. In contrast to the Phylip Format, branch lengths, which correspond to the number of rearrangem ents, can also be incorporated, preceded by a colon. For instance, given the example above, M G R would produce the following Newick representation: (Genome4:2, Genome2:1, (Genome 1:0, GenomeS :1)A4:0)A5; 81 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. One of the purpose for including this representation in the report is that it can easily be transposed to other more elaborate phylogeny drawing programs. 7.2.2 A SC II Form at The report also includes an ASCII representation of the phylogeny. For instance, the same example would produce the following tree: — 2----------------- Genome4 1 Genome2 + -A 5 | +G enom el +-A 4 -|------- 1--------GenomeS The main purpose of this representation is to provide the user with direct visualization of the solution. The tree diagram is generated by a modified version of the RETREE program available in the PHYLIP package by Joe Felsenstein1. The number of rearrangements th at occurred on each edge is shown. When no number is shown on an edge it means th at no rearrangement occurred on th at edge. In the tree produced by the web tool, it is possible to view actual rearrangement scenarios since each edge is clickable and is linked to the G R IM M web server. It is also possible to click on any genome to view its specific gene order in last section of the report. 1 http://evolution.genetics.washington.edu/phylip.html 82 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 7.3 C am panulaceae cp D N A output The Campanulaceae cpDNA data set is one of the most challenging genome rearrangement problems studied yet. It was analyzed by Cosner et al, 2000 [CJM+00a] using GRAPPA. It consists of 13 cpDNAs with 105 signed markers. The report from the M G R web tool is shown in Fig. 7.4. The “fixed size edges” option was selected for this report. 83 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. MGR Report Tree recovered (Newick Standard) (Genome4:2,Genome2:1 ,(Genomei:0,Genome3:1 )A4:0)A5; Score: 4 rearrangements ASCII representation of the unrooted tree recovered + •-................... - - - 2 - G enom e4 j + - - - - - - - - - - - - - - - - - - - - x -----------------------------------G en om e2 + -A 5 • + - G e n o m e l + -A 4 + --------------------------------1 -----------------------------------G en om e3 Pairwise distance matrix of the input genomes (leaf nodes) r ................. [Genomei |Genome2i|Genome3[|Genome4j [G enom ei 1 o « 1 I I 1 I I 2 I |G enom e2 | 1 ll o I I 2 I | 3 j [GenomeS | 1 I I 2 I I 0 | 3 ! [Genome4 I 2 I I 3 1 1 3 || 0 i Back to top of report or enter new data or change options Gene order sequence of the genomes used as input and of the ancestors recovered Genomet 1 2345 Genome2 1 -2345 Genome3 12 3-5-4 Genome4 1 -3 -5 -4 2 A4 1 2345 A5 1 2345 Back to top of report or enter new data or change options GRIMM 1.04 by Glenn Tesler. University of California, San Diego. Copyright © 2001-2002, The University of California. MGR 1.0 by Guillaume Bourque. University of Southern California. Copyright © 2001, University of Southern California. Figure 7.3: M G R simple sample output. Some of the instructions were removed from the web page to improve the readability of other elements. The tree is displayed with edges proportional to the num ber of rearrangem ents. 84 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Tree recovered (Newick Standard) (Platycodon:11,(Codonopsis:5,Cyananthus:5)A22:3,(((Asyneuma:5,(Legousia:4,Triodanus:1) A19:2)A20:4,((Merciera:4,Wahlenbergia:2)A16:4,((Adenophora:3,Campanula:0)A14:1, (Symphyandra:1,Trachelium:0)A13:0)A15:0)A17:1)A18:7,Tobacco:2)A21:0)A23; S c o re : 65 rearrangements ASCII representation of the unrooted tree recovered - 1 1 - -A22 - P l a t y c o d o n - C o d o n o p s i s -A23 -A20 | + +-2-A19 -Alt ---- -A17 -A14 5---Cy anan thu s 5-------Asyneuma L-egcusia Triodanus 4--- Merciera 2 Wahlenbergia 3--- Adenophora - -..Campanula 1 Symphyandra Trachelium Tobacco 4. - - 1 -- Figure 7.4: M G R sample output using the Campanulaceae cpDNA data set. The tree is displayed with fixed size edges. 85 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 8 Sum m ary The first two chapters of this thesis were devoted to describing problems and methods associated with phylogenetic tree reconstruction based on gene order. In Chapter 3, we described the various types of gene order data that we consider and we also described some assumptions about how these gene orders evolve over tim e, more precisely, what kinds of rearrangements can occur. In Chapter 4, we first developed a method to find the most parsimonious rearrangement scenario for 3 unichromosomal genomes and then adapted the m ethod for more genomes, which corresponds to the Multiple Genome Rearrangement Problem. In Chapter 5, we adapted the m ethod presented in Chapter 4 to work with multichromosomal genomes and the rearrangements associated with these kinds of genomes. In Chapter 6, we presented an approach to integrate comparative mapping data into rearrangement studies. We applied this approach to two new data sets of common markers in human-cat-mouse and in human-cat-cow. In the same chapter, we also described the basis of a heuristic to recover the most parsimonious rearrangement scenario of a phylogeny with a fixed topology. Finally, in Chapter 7, we showed some examples extracted from the web version of the program. 86 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. As stated in the introduction, the goal of this thesis was to explore the specifics of additive trees. The focus was not so much on the speed of the implementation but more on the design of a versatile method making use of some simple prop erties found in most real life rearrangement studies. At the core of the Multiple Genome Rearrangement Problem is the most parsimonious assumption that the actual rearrangement scenario which generated the current genomes is the one with the fewest rearrangements. W hen genomes are too shuffled and trees are far from being additive, finding the scenario with the fewest rearrangements proba bly becomes less significant since the most parsimonious assumption itself loses some of its strength. The more markers become available, the less “shuffled” the genomes will appear since the number of rearrangements that actually occurred is unknown but fixed. Unfortunately, because the num ber of potential markers is limited by the length of the actual biological sequences, this does not guarantee that the most parsimonious assumption can be valid for all data sets. Once again, because the focus of this work was not on the speed of the imple mentation, many algorithm engineering techniques could now be used to speed up the program. One of the obvious bottlenecks of the algorithm is the recov ery of good rearrangements, the current m ethod m ust look at many potential rearrangements and compute various rearrangement distances to see if they have been modified. Recent work by Siepel, 2002 [Sie02] is extremely promising and suggests a modification th at might represent a significant speed improvement. 87 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. They developed a m ethod based on the cycle decomposition, presented briefly in C hapter 3, that finds all the sorting reversals of a unichromosomal genome. Such a m ethod could allow for a much faster identification of good rearrangements at least in the unichromosomal case. Another example of a potential speed improve ment would be the use of a lower bound when looking for the split edge that would allow a quicker pruning of potential medians. The iterative m ethod describe in this thesis is very flexible in how it selects rearrangem ents. It is easy to include, in the selection process, knowledge that surpasses the most parsimonious assumption. For instance, in Chapter 5, we included the extra assumption that fusions and fissions are less frequent than reversals and translocations. In the same way, it is conceivable that we could prioritize, for instance, short reversals over all other types of rearrangem ents. If it were possible to efficiently include a similar prioritization in the pairwise distance calculation, the iterative m ethod we presented would be easily adaptable to fully integrate the prioritization for multiple genomes. A few words now on recent work on the M ultiple Genome Rearrangement Problem by other groups. We are only aware of work on the special case of the Median Problem for unichromosomal genomes. After the two branch-and- bound algorithms by Caprara, 2001 [CapOl] and Siepel et al., 2001 [ACS01], a new branch-and-bound algorithm was described in work by Siepel as part of his thesis. This most recent algorithm follows a path very similar the one we 88 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. proposed in [BP02] since it first restricts itself to good reversals , which are stated differently, instead of blindly looking at all potential reversals as in [ACS01]. Then, if necessary, it relaxes this restriction just as we do in our algorithm. Unfortunately, we did not have tim e to do a full-scale comparison of all these methods for recovering the reversal median and it will be interesting to do so in the near future. It will also be interesting to see if these methods will be adaptable, like M G R , to more general instances of the problem. A final word on the work described in Chapter 6. In that chapter, we only showed partial results from using the clustering-algorithm instead of the strip-algorithm to obtain signed perm utations from comparative mapping data. Instead of looking at specific thresholds, another approach could be to look for a consensus ancestor: look for patterns (or chromosome associations) which are preserved for most thresholds. The clustering m ethod could also be used for a wide range of applications such as finding a rearrangem ent scenario on two levels by first identifying clusters which have been rearranged locally and then finding more global rearrangements which affected the clusters themselves. This could simply be accomplished by raising the threshold used in the clustering-algorithm and could be useful both for signed and unsigned data. Such a method would need to be assessed against other methods for estim ating and recovering conserved segments between genomes. 89 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. R eference List [ACSOlj [BBOO] [BBS97] [BerOl] [BH96] [BKS99] [BP9-5] B. M. E. Moret A. C. Siepel. Finding an optimal inversion median: Experim ental results. In Olivier Gascuel and Bernard M. E. M oret, editors, Algorithms in Bioinformatics, First International Work shop, W ABI 2001, volume 2149 of Lecture Notes in Computer Sci ence, pages 189-203, Aarhus, Denmark, 2001. Springer. J.L. Boore and W.M. Brown. Mitochondrial genomes of galathea- linum, helobdella, and platynereis: Sequence and gene arrangement comparisons show th at pogonophora is not a phylum and annelida and arthropoda are not sister taxa. Molecular Biology and Evolu tion, 7:87-106, 2000. M. Blanchette, G. Bourque, and D. Sankoff. Breakpoint phyloge- nies. In S. Miyano and T. Takagi, editors, Genome Informatics Workshop (G IW 1997), pages 25-34, Tokyo, Japan, 1997. Univ. Academy Press. A. Bergeron. A very elem entary presentation of the Hannenhalli- Pevzner theory. In Proceedings 12th Annual Symposium on Combi natorial Pattern Matching, volume 2089 of Lecture Notes in Com puter Science, pages 106-117, Jerusalem, Israel, 2001. P. Berman and S. Hannenh.al.li. Fast sorting by reversal. In Com binatorial Pattern Matching. 7th Annual Symposium, volume 1075 of Lecture Notes in Computer Science, pages 168-185, New York, 1996. Springer. M. Blanchette, T. Kunisawa, and D. Sankoff. Gene order break point evidence in animal mitochondrial phylogeny. Journal of Molecular Evolution, 49:193-203, 1999. V. Bafna and P.A. Pevzner. Sorting by reversals: genome rear rangements in plant organelles and evolutionary history of X chro mosome. Molecular Biology and Evolution, 12:239-246, 1995. 90 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [BP96] [BP02] [Bun71] [Cap99] [CapOl] [CJM+OOa] [CJM+OOb] [DicOO] [DS-38] [Fit 77] V. Bafna and P.A. Pevzner. Genome rearrangements and sorting by reversal. SIA M Journal on Computing, 25:272-289, 1996. G. Bourque and P.A. Pevzner. Genome-scale evolution: recon structing gene orders in the ancestral species. Genome Res., 12:26- 36, 2002. P. Buneman. The recovery of trees from measures of dissimilarity. In Mathematics in the Archaeological and Historical Sciences, pages 387-395. Edinburgh University Press, 1971. A. Caprara. Formulations and complexity of multiple sorting by reversals. In S. Istrail, P.A. Pevzner, and M.S. W aterman, editors, Proceedings of the Third Annual International Conference on Com putational Molecular Biology (RECOMB-99), pages 84-93, Lyon, France, April 1999. ACM Press. A. Caprara. On the practical solution of the reversal median prob lem. In Olivier Gascuel and Bernard M. E. M oret, editors, Algo rithms in Bioinformatics, First International Workshop, W ABI 2001, volume 2149 of Lecture Notes in Computer Science, pages 238-251, Aarhus, Denmark, 2001. Springer. M.E. Cosner, R.K. Jansen, B.M.E. Moret, L.A. Raubeson, L.S. Wang, T. War now, and S. W yman. A new fast heuristic for com puting the breakpoint phylogeny and experimental phylogenetic analyses of real and synthetic data. In Proceedings o f the 8 th Inter national Conference on Intelligent Systems for Molecular Biology (ISMB-2000), pages 104-115, San Diego, 2000. M.E. Cosner, R.K. Jansen, B.M.E. Moret, L.A. Raubeson, L.S. Wang, T. War now, and S. Wyman. An empirical comparison of phylogenetic methods on chloroplast gene order data in campan- ulaceae. In Comparitive Genomics (DCAF-2000), pages 99-121, Montreal, 2000. Kluwer Acad. Pubs. J. Dicks. Chromtree: Maximum likelihood estim ation of chromo somal phylogenies. In Comparitive Genomics (DCAF-2000), pages 333-342, Montreal, 2000. Kluwer Acad. Pubs. T. Dobzhansky and A. Sturtevant. Inversions in the chromosomes of Drosophila pseudoobscura. Genetics, 23:28-64, 1938. W.M. Fitch. On the problem of discovering the most parsimonious tree. Amer. Natur., 111:223-257, 1977. 91 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [GLOO] D. Graur and W.-H. Li. Fundamentals of Molecular Evolution. Sunderland, MA: Sinauer Associates, Inc., 2000. [HCKP95] [HP95a] [HP95b] [HP96] [HP99] [KS94] [KST97] [L+01] [LPD00] S. Hannenhalli, C. Chappey, E. Koonin, and P. Pevzner. Genome sequence comparison and scenarios for gene rearrangem ents: A test case. Genomics, 30:299-311, 1995. S. Hannenhalli and P. Pevzner. Transforming men into mice: poly nomial algorithm for genomic distance problem. In Proceedings of the 36th IEEE Symposium on Foundations of Computer Science, pages 581-592, 1995. S. Hannenhalli and P.A. Pevzner. Transforming cabbage into turnip (polynomial algorithm for sorting signed perm utations by rever sals). In Proceedings of the 27th Annual AC M -SIAM Symposium on the Theory of Computing, pages 178-189, 1995. S. Hannenhalli and P. Pevzner. To cut... or not to cut (applica tions of comparative physical maps in molecular evolution). In Proceedings o f the 7th AC M -SIAM Symposium on Discrete Algo rithms (SODA), pages 304-313, 1996. S. Hannenhalli and P.A. Pevzner. Transforming cabbage into turnip: polynomial algorithm for sorting signed perm utations by reversals. Journal o f ACM, 46:1-27, 1999. J. Kececioglu and D. Sankoff. Efficient bounds for oriented chro mosome inversion distance. In M. Crochemore and D. Gusfield, editors, Combinatorial Pattern Matching. 5th Annual Symposium, volume 807 of Lecture Notes in Computer Science, pages 307-325, New York, 1994. Springer. H. Kaplan, R. Shamir, and R.E. Tar j an. Faster and simpler algo rithm for sorting signed perm utations by reversals. In Proceedings of the 8 th Annual A C M -SIAM Symposium on Discrete Algorithms, pages 344-351, New York, 1997. ACM. E. Lander et al. Initial sequencing and analysis of the human genome. Nature, 409:860-921, 2001. S. Li, D. Pearl, and H. Doss. Phylogenetic tree construction using markov chain monte carlo. J. o f the American Statistical Associa tion, 95:493-508, 2000. 92 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [LS99] [MNL97] [MSC+OO] [MSOOl] [MWB+01] [MWWW01] [NT 84] [0+99] [OP94] [Pal92] [PevOO] B. Larget and D.L. Simon. Markov chain monte carlo algorithms for the bayesian analysis of phylogenetic trees. Molecular Biology and Evolution, 16(6):750-759, 1999. B. Man, M.A. Newton, and B. Larget. Bayesian phylogenetic infer ence via markov chain monte carlo methods. Molecular Biology and Evolution, 14:717-724, 1997. W .J. Murphy, S. Sun, Z.-Q. Chen, N. Yuhki, D. Hirschmann, M. Menotti-Raymond, and S.J. O ’Brien. A radiation hybrid map of the cat genome: Implications for comparative mapping. Genome Research, 10:691-702, 2000. W .J. Murphy, R. Stanyon, and S.J. O ’Brien. Evolution of m am malian genome organization inferred from comparative gene m ap ping. Genome Biol., 2:1-8, 2001. B.M.E. Moret, S. W yman, D.A. Bader, T. Warnow, and M. Yan. A new implementation and detailed study of breakpoint analysis. In Proceedings of the 6th Pacific Symposium on Biocomputing (PSB 2001), pages 583-594, 2001. B.M.E. Moret, L. Wang, T. Warnow, and S.K. Wyman. New approaches for reconstructing phylogenies from gene order data. In Proceedings o f the 9th International Conference on Intelligent Systems fo r Molecular Biology (ISMB-2001), pages 165-173, 2001. J.H. Nadeau and B.A. Taylor. Lengths of chromosomal segments conserved since divergence of m an and mouse. Proceedings of the National Academy of Sciences USA, 81:814-818, 1984. S.J. O ’Brien et al. The promise of comparative genomics in m am mals. Science, 286:458-481, 1999. R.G. Olmstead and J.D. Palmer. Chloroplast DNA systematics: a review of methods and data analysis. American Journal of Botany, 81:1205-1224, 1994. J.D. Palmer. Chloroplast and mitochondrial genome evolution in land plants. In R. Herrmann, editor, Cell Organelles, pages 99-133. Springer Verlag, 1992. P.A. Pevzner. Computational molecular biology: an algorithmic approach, chapter 10. The M IT Press, 2000. 93 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [PH88] [RY96] [San92] [San99] [SB97] [Sie02] [SLA+92] [SM97] [SN87] [SSK96] J.D. Palmer and L.A. Herbon. Plant mitochondrial DNA evolves rapidly in structure, but slowly in sequence. Journal of Molecular Evolution, 27:87-97, 1988. B. Rannala and Z. Yang. Probability distribution of molecular evolutionary trees: a new m ethod of phylogenetic inference. J. Molecular Evolution, 43:304-311, 1996. D. Sankoff. Edit distance for genome comparison based on non local operations. In A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, editors, Proceedings o f the 3rd Annual Symposium on Combinatorial Pattern Matching, pages 121-135, Tucson, AZ, 1992. Springer-Verlag, Berlin. D. Sankoff. Genome rearrangement with gene families. Bioinfor matics, 15(11):909-17, November 1999. D. Sankoff and M. Blanchette. The median problem for breakpoints in comparative genomics. In Computing and Combinatorics, Pro- ceeedings of COCOON ‘97, Lecture Notes in Computer Science, pages 251-263, New York, 1997. Springer Verlag. A. Siepel. An algorithm to find all sorting reversals. In G. Myers, S. Hannenhalli, S. Istrail, P.A. Pevzner, and M.S. W aterman, edi tors, Proceedings o f the Sixth Annual International Conference on Computational Molecular Biology (RECOMB-02), pages 281-290, Washington, D.C., April 2002. ACM Press. D. Sankoff, G. Leduc, N. Antoine, B. Paquin, B. Lang, and R. Ced- ergren. Gene order comparisons for phylogenetic inference: Evo lution of the mitochondrial genome. Proceedings of the National Academy o f Sciences USA, 89:6575-6579, 1992. J. Setubal and J. Meidanis. Introduction to Computational Molec ular Biology, chapter 7. PWS Publishing Company, 1997. N. Saitou and M. Nei. The neighbor-joining method: A new m ethod for reconstructing phylogenetic trees. Molecular Biology and Evo lution, 4:406-425, 1987. D. Sankoff, G. Sundaram, and J. Kececioglu. Steiner points in the space of genome rearrangem ents. International Journal on Foun dations o f Computer Science, 7:1-9, 1996. 94 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [Tes02a] [Tes02b] [V+01] [WBM+99] [WEHM82] [WJM+02] [WW01] G. Tesler. Efficient algorithms for multi chromosomal genome rear rangements. (subm itted), 2002. G. Tesler. GRIMM: genome rearrangements web server. Bioinfor matics, 18(3):492— 493, 2002. J.C. Venter et al. The sequence of the human genome. Science, 291:1304-1352, 2001. T.K. W atanabe, M.T Bihoreau, L.C. McCarthy, S.L. Kiguwa, H. Hishigaki, A. Tsuji, J. Browne J, Y. Yamasaki, A. Mizoguchi- M iyakita, K. Oga K, T. Ono, S. Okuno, N Kanemoto, E. Takahashi, K. Tom ita, H. Hayashi, M. Adachi, and C. Webber. A radiation hybrid m ap of the rat genome containing 5,255 markers. Nature Genetics, 22:27-36, 1999. G.A. W atterson, W .J. Ewens, T.E. Hall, and A. Morgan. The chro mosome inversion problem. Journal of Theoretical Biology, 99:1-7, 1982. L.S. Wang, R.K. Jansen, B.M.E. Moret, L. Raubeson, and T. Warnow. Fast phylogenetic methods for the analysis of genome rearrangement data: an empirical study. In Proceedings of the 7th Pacific Symposium on Biocomputing (PSB 2002), pages 524-535, Hawaii, 2002. L.-S. Wang and T. Warnow. Estim ating true evolutionary distances between genomes. In Proceedings 33rd Symposium on Theory of Computing (ST O C ’Ol), pages 637-646. ACM Press, 2001. 95 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Asset Metadata
Creator
Bourque, Guillaume (author)
Core Title
Algorithms for phylogenetic tree reconstruction based on genome rearrangements
Contributor
Digitized by ProQuest
(provenance)
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biology, genetics,Computer Science,mathematics,OAI-PMH Harvest
Language
English
Advisor
Waterman, Michael (
committee chair
), [illegible] (
committee member
), Pevzner, Pavel (
committee member
), Tavare, Simon (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-274680
Unique identifier
UC11339284
Identifier
3094307.pdf (filename),usctheses-c16-274680 (legacy record id)
Legacy Identifier
3094307.pdf
Dmrecord
274680
Document Type
Dissertation
Rights
Bourque, Guillaume
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
biology, genetics
Linked assets
University of Southern California Dissertations and Theses