Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Long term evolution of gene duplicates in arabidopsis polyploids
(USC Thesis Other)
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LONG TERM EVOLUTION OF GENE DU PLICATES IN ARABIDOPSIS POLYPLOIDS by Peter L Chang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) December 2012 Copyright 2012 Peter L Chang To my parents, Henry and Melissa, and brother, David ll Acknowledgements I want to give a heartfelt thank you to my advisor, Sergey Nuzhdin, who was kind enough to take me on as a student and guide me these past few years. For allowing me to explore a wide range of scientific topics and ideas. For traveling with me to various regions around the globe and introducing me to experts in the field. For countless hours of conversation that has provided lots of guidance, motivation, and intellectual stimulation. I will proudly carry the title of "Nuzhdin Alumni" wherever my future experiences take me. I would like to thank the members of my doctoral committee for their guidance, including USC Professors Fengzhu Sun, David Conti, Matthew Dean, and Andrew Smith. I thank other members of the USC Molecular and Com putational Biology department, including Professors Michael Wat erman, Ting Chen, Steven Finkel, and Xuelin Wu. To my fellow colleagues that I worked with everyday. Special thanks to Maren Friesen for joining me on numerous collaborations. Thank you for allowing me to take part in Medicago and Rhizobia and having that branch out into so many other wonderful collabora tions. I would like to thank Joseph Dunham and We ndy Vu of the Nuzhdin Lab for their wonderful camaraderie and frienship. I will miss our annual winter skiing and summer camping trips. To my graduate cohorts, Justin Dalton, Christopher Corzett, Sarmad Al-Bassam, Xiting Yan, Yang-Ho Chen, and John Nguyen for their academic Ill and motivational support. I have enj oyed working with all of you and look forward to wat ching as we become contributing members and leaders in our scientific fields. All of my work have been in collaboration with other scientists and researchers here at USC and around the world. I would like to thank Michelle Arbeitman and Kim- berly Hughes from Florida State University, Fabrizio Ghiselli and Liliana Milani from the University of Bologna in Italy, Magnus Nordborg and Glenda Willems from the Gregor Mendel Institute in Vienna, and Aaron Tarone from Texas A&M University. I would also like to acknowledge two friends of Kaimuki High School that I have known for almost 20 years. Fred Lee and Ronald Oyama have been very supportive of me and my fa mily in their roles as coaches, teachers, leaders, and dearest friends. Everyt hing import ant I really needed to know in high school was passed down to me by these two men. I often think about how the values they taught about working hard, consistency, and perserverence carries over into every corner of my everyday life. To Fred, who reminded me to "Mai huli kua a loa' a ka lei o ka lanakila" . To Ronald, who encouraged me to practice hard everyday when it didn't matter because it will prepare me for when it really does. For reminding me that winners are not defined by what they do, but rather how they work hard and prepare themselves to ultimately accomplish their goals. Two amazing leaders, without whom I would have never known the significance of hard work, or believed that I could become a scientist. Earning a doctorate degree requires an unbelieveable amount of mental energy and a life time of committment. I thank everyone for helping me along and the way and letting me borrow whatever strength I needed. I share this accomplishment with each and every one of you! IV Table of Contents Acknowledgements List of Tables List of Figures Preface Abstract Chapter 1: Introduction 1.1 Introduction to Polyploids ...... . 1.2 Immediate Effects of Polyploidization 1.3 Long-term Effects of Polyploidization . 1.4 Arabidopsis Polyploids . . . . . . . . . Chapter 2: Hom eolog-specific retention and use in Arabidopsis suecica 2.1 Abstract ........ . 2.2 Introduction ...... . 2.3 Materials and Methods . 2.4 Results . . . . . . . . . . 2.4.1 Comparison of hybridization between AS and F1AS 2.4.2 Homeolog-specific genomic reten tion in AS . . . .. 2.4.3 Use of AT and AA homeologs in the AS transcriptome . 2.4.4 Network analysis of homeolog-specific genes 2.4.5 Variation within AT and AA 2.5 Discusssion ..................... . 2.5.1 Evolved AS patterns ............ . 2.5.2 Resolving incompatibilities in allotetraploid networks . 2.6 Conclusions ........................... . Chapter 3: Population genomics of pooled Arabidopsis lyrata genome data show gene network evo lution patterns consistent with flies, worms and yeast 3.1 Abstract ........ . 3.2 Introduction ...... . 3.3 Materials and Methods . Ill Vlll IX Xll XIV 1 1 4 6 10 12 12 13 14 20 21 24 26 29 33 34 34 37 39 40 40 41 43 v 3.4 Results . 3.4.1 Test for selection between orthologs 3.4.2 Network evolution 3.5 Discussion 3.6 Conclusions Chapter 4: Sequencing of Arabidopsis halleri ssp gemmifera reveal regions of dupli cation undergoing positive selection 4.1 Introduction ...... . 4.2 Materials and Methods .... . 4.3 Results . . . . . . . . . . . . . . 4.3.1 Mapping of AH reads to the AL reference 4.3.2 Structural variation . . . . . . . . . . . . 4.3.3 Depth coverage of the gemmifera and halleri subspecies 4.3.4 Sequence analysis and varia tion . 4.4 Discussion . 4.5 Conclusions ............... . 45 45 46 48 51 52 52 54 57 57 59 61 66 67 72 Chapter 5: Genomic charac terization of three Arabi do psis kamchatica genomes 74 5.1 Introduction . . . . . . . 74 5.2 Materials and Methods . . . . . . . . . . . . 5 .3 Results . . . . . . . . . . . . . . . . . . . . . 5.3.1 Homeolog-specific genomic reten tion 5.3.2 Sequence evolution of homeologs .. 5.3.3 Analysis using other parental genomes of AH and AL 5.3.4 Analysis of chloroplast sequences ...... . 5.3.5 Simulation of Illumina data from AH and AL .. 5 .4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Characterization of Illumina sequences from AK 5.4.2 Patterns of homeologous gene loss and gain . 5.4.3 Network evolution of homeologous genomes . 5.4.4 Differential selection of homeologous genomes 5.5 Conclusions . . . . . . . . . . . Chapter 6: Summary and Conclusions Refer ences Appendix A 78 80 80 86 87 89 91 93 93 96 98 99 100 102 107 List of data sets analyzed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Appendix B Hom eolog-specific retention and use in Arabidopsis suecica 125 VI Appendix C Sequencing of Arabidopsis halleri ssp gemmif era reveal regions of duplication undergoing positive selection 132 Appendix D Genomic charac terization of three Arabidopsis kamchatica genomes . . . . . . . 138 Appendix E The ecological and genomic basis of salinity adaptation in Tunisian Medicago truncatula 145 Appendix F Genetic variation of transgenerational plasticity on seed transcriptome and off- spring early response to salinity 146 Appendix G Genomic basis of aging and life- history evolution in Drosophila melanogaster . 147 Appendix H Somatic sex -specific transcriptome differences in Drosophila revealed by whole transcriptome sequencing 148 Appendix I Sex-specific signaling in the blood brain barrier is required for male courtship in Drosophila . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Appendix J Genetic basis of long term courtship suppression of Drosophila males revealed by transcriptome sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Appendix K A versa tile method for cell-specific profiling of translated mRNAs in Drosophila 151 Appendix L A de novo transcriptome assembly of Lucilia sericata (Diptera: Calliphoridae ) with predicted alternative splices, single nucleotide polymorphisms and transcript expression estimates Appendix M 152 De novo assembly of the Manila clam Ruditapes philippinarum transcriptome provides new insights into expression bias, mitochondrial doubly uniparental inheritance and sex determination . . . . . . . . . . . . . . . . . . . . . . 153 Vll List of Tables 1.1 Summary of four hypotheses explaining high retention of gene duplicates . 8 2.1 2.2 2.3 2.4 3.1 3.2 4.1 4.2 4.3 4.4 5.1 Regions of putative altera tions in AS Homeolog-specific retention in AS . . Gene Ontology annotation for homeolog-biased genes in AS Co-biased pairs of AS homeologs in gene networks ..... Serpentine genes in AL with membrane and cytosol functions Ranks of Gene Ontology enrichment in AL serpentine genes Summary of Illumina data analysis in AH Structural variants in AH . . . . . . . . . Regions of increase depth coverage in AH gemmifera Number of genes with increase depth coverage in AH Homeolog-specific retention in AK . . . . A.1 List of data sets sequenced and analyzed . D.1 Analysis using other parental genomes of AH and AL 23 27 31 32 46 47 58 60 62 63 81 124 142 Vlll List of Figures 1.1 1.2 1.3 2.1 2.2 2.3 2.4 Two types of polyploids . . . . . . . . . . . . . . . . . . Homeologs are orthologs that are merged into a new cell Leaves and rosettes of Arabidopsis ..... . Distribution of probe intensities in AS (Chr4 ) Distribution of probe intensities in AS (Chr3 ) Probe intensities before and after normalization Histogram distribution of homeolog bias in AS 2.5 Distribution of clusters of biased homeolog transcripts in AS 2.6 Sequenced read alignments to AT and AA orthologs 2.7 Concordance of homeolog-specific expression in AS . 2.8 Fraction of AS gene pairs co-biased for different connectivity 2 3 10 22 22 25 26 28 29 30 32 2.9 Divergence of homeologs in AS compared to homeolog-specific expression 33 4.1 Depth ratio of genome-wide 10kb-regions in AH ........ . 4.2 Genome-wide analysis of AH gemmifera across 8 chromosomes 4.3 Genome-wide analysis of "8292" AH halleri population sample across 8 chromosomes ............ . 4.4 Nucleotide substitution rates in AH 61 64 65 66 5.1 Geographical distribution of AK, AL, and AH in the Northern Hemisphere 77 IX 5.2 Dot plots of gene depth in AK Japan ..... . 5.3 Distribution of fo ldchange of gene depth in AK 5.4 Analysis of homeologous sequence evolution in AK Japan 5.5 Maximum parsimony tree of Arabidopsis samples inferred from 3,766 poly- morphisms within chloroplast sequences . . . . . . . . . . . . . 5.6 ROC curves comparing True Positive and False Positive Rates B.1 Distribution of probe intensities in AS (Chrl 3MB ) B.2 Distribution of probe intensities in AS (Chrl 8MB ) B.3 Distribution of probe intensities in AS (Chrl 12MB ) B.4 Distribution of probe intensities in AS (Chr2 6MB) . B.5 Distribution of probe intensities in AS (Chr2 17MB) B.6 Distribution of probe intensities in AS (Chr3 20MB ) B.7 Distribution of probe intensities in AS (Chr4 7MB) . B.8 Distribution of probe intensities in AS (Chr4 16MB) B.9 Distribution of probe intensities in AS (Chr5 8MB ) . B.10 Histogram distribution of alpha statistic in AS and F1AS B.ll Distribution of alpha statistic in AS DNA . B.12 Distribution of alpha statistic in AS eDNA 82 83 88 90 92 125 126 126 127 127 128 128 129 129 130 131 131 C.1 Genome-wide analysis of "8293" AH halleri population sample across 8 chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 C.2 Genome-wide analysis of "8294" AH halleri population sample across 8 chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 C.3 Genome-wide analysis of "8295" AH halleri population sample across 8 chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 X C.4 Genome-wide analysis of "8296" AH halleri population sample across 8 chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 C.5 Genome-wide analysis of "8297" AH halleri population sample across 8 chromosomes ............. . D.1 Dot plots of gene depth in AK Alaska D .2 Dot plots of gene depth in AK British Columbia D.3 Analysis of homeologous sequence evolution in AK Alaska 137 138 139 140 D.4 Analysis of homeologous sequence evolution in AK British Columbia 141 D.5 Distribution of fo ldchange of gene depth in simulated data ... D.6 Analysis of homeologous sequence evolution in simulated data . 143 144 XI Preface The fo llowing dissertation consists of two main sections outlining research on two Ara bidopsis polyploid systems: suecica and kamchatica. The research performed here at USC has also resulted in several prepared publications as listed below: Chang PL, Dilkes BP, McMahon M, Comai L, Nuzhdin SV: Homoeolog-specific re tention and use in allotetraploid Arabidopsis suecica depends on parent of origin and network partners. Genome Biology 2010, 11 (12):R125. (Chapter 2) Chang PL, Leong W, Zhang P, Friesen ML: Population genomics of pooled Ara bidopsis lyrata genome data show gene network evolution patterns consistent with flies, worms and yeast. Submitted to Molecular Ecology. (Chapter 3) Chang PL, Steets JA, Wo lf DE, Takebayashi N, Nord borg M, Nuzhdin SV: Sequencing of Arabidopsis halleri ssp gemmifera reveal regions of duplication undergoing positive selection. In Prep. (C hapter 4) Chang PL, Steets JA, Wo lf DE, Takebayashi N, Nordborg M, Nuzhdin SV: Genomic characterization of three Arabidopsis kamchatka genomes. In Prep. (C hapter 5) Friesen ML, von We ttberg EJB, Badri M, Moriuchi KS, Barhoumi F, Cuellar-Ortiz S, Chang PL, Cordeiro MA, Vu WT, Arraouadi S, Djebali N, Zribi K, Badri Y, Porter SS, Aouani ME, Cook DR, Strauss SY, Nuzhdin SV: The ecological and genomic basis of salinity adaptation in Tunisian Medicago truncatula. Submitted to the Proceedings of the National Academy of Sciences of the United States of America. (see Abstract in Appendix E) Vu WT, Chang PL, Moriuchi KS, Friesen ML: Genetic variation of transgenerational plasticity on seed transcriptome and offspring early response to salinity. Sub mitted to Current Biology. (see Abstract in Appendix F) Remolina SC, Chang PL, Leips J, Nuzhdin SV, Hughes KA: Genomic basis of aging and life-history evolution in Drosophila melanogaster. Evolution 2012, 66(11):3390- 3403. (see Abstract in Appendix G) Chang PL, Dunham JP, Nuzhdin SV, Arbeitman MN: Somatic sex-specific transcrip tome differences in Drosophila revealed by whole transcriptome sequencing. BMC Genomics 2011, 12(1):364. (see Abstract in Appendix H) Xll Hoxha V, Lama C, Chang PL, Saurabh S, Olate N, Patel N, Dauwalder B: Sex-specific signaling in the blood brain barrier is required for male courtship in Drosophila. PLoS Genetics. In Press. (see Abstract in Appendix I) Winbush A, Reed D, Chang PL, Nuzhdin SV, Arbeitman MN: Genetic basis of long term courtship suppression of Drosophila males revealed by transcriptome sequencing. G3: Genes, Genomes, Genetics 2012, 2(11):1437-1445. (see Abstract in Appendix J) Thomas A, Lee P-J, Dalton JE, Nomie KJ, Stoica L, Costa-Mattioli M, Chang PL, Nuzhdin SV, Arbeitman MN, Dierick HA: A versatile method for cell-specific pro filing of translated mRNAs in Drosophila. PLoS ONE 2012, 7(7) :e40276. (see Abstract in K) Sze S-H, Dunham JP, Carey B, Chang PL, Li F, Edman RM, Fjeldsted C, Scott MJ, Nuzhdin SV, Tarone AM: A de novo transcriptome assembly of Lucilia sericata ( Diptera: Calliphoridae) with predicted alternative splices, single nucleotide polymorphisms and transcript expression estimates. Insect Molecular Biology 2012, 21(2):205-21. (see Abstract in Appendix L) Ghiselli F, Milani L, Chang PL, Hedgecock D, Davis JP, Nuzhdin SV, Passamonti M: De novo assembly of the Manila clam Ruditapes philippinarum transcrip tome provides new insights into expression bias, mitochondrial doubly uni parental inheritance and sex determination. Molecular Biology and Evolution 2012, 29(2):7 71-86. (see Abstract in Appendix M) Peter Chang Los Angeles, California, USA December 2012 Xlll Abstract Recent deve lopments in genom1cs are revolutionizing our v1ews of genome evolution, demonstrating that perhaps all higher organisms, including mammals, have undergone full or partial genome duplications. Polyploidization, a form of genome duplication, is the increase in genome size caused by the inheritance of additional sets of chromosomes. Over time, these genomes undergo diploidization, reducing polyploid genomes back towards a diploid state. In some ways, polyploidy may be the single most common mechanism of speciation in plan ts. Our work on Arabidopsis polyploids fo cused on two allotetraploids: Arabidopsis sue cica (AS ) and Ara bidopsis kamchatica (AK ). In AS, we observed that homeologs orig inated from parental Arabidopsis thaliana (AT ) were lost faster than those originated from Ara bidopsis arenosa (AA ). We also found that AT homeologs were more likely to be silenced in the leaf transcriptome and that this silencing was network-dependent. The networks of AS are evolving to be more AT -like or more AA-like, rather than mixed. In general, genes within an interspecies network are typically more co-adapted with each other than with genes from other homeologous networks. Here, we fo und that mixed networks were significantly underrepresented in AS. XIV In AK, we sequenced three accesswns collected from Japan, Alaska, and British Columbia and documented a pattern of consistent loss of homeologs from parental A ra bidopsis lyrata (AL), compared to Arabid opsis halleri (AH). We also examined the cor relation between network connectivity and gene retention and fo und that genes lost or further gained after polyploidization displayed a significant decrease in the number of network partners and/or expression correlation coefficient. Comparison of divergence be tween parental AL and AH homo logs showed that hom eo log loss in the polyploid is more common in less divergent genes, as well as in genes with high within-species variation in AL and AH lineages. In cases where genes were duplicated after polyploidization, they exhibited increased Ka/Ks ratios, suggesting that some duplicates were undergoing neofunct ionalization that introduced new funct ions. In the study of polyploidization and genome duplication, our results across six species of Arabidopsis illustrate the importance of understanding gene evolution in the context of network topology. In the AS polyploid system, AS networks are evo lving to be more AT -like or more AA-like, due to co-evo lution of genes within a network in AT and AA lineages leading up to hybridization. In the AK polyploid system, gene evolution occurs fa stest among genes constrained to nodes of lower centrality and connectivity. These are the genes that were undergoing positive selection for local adaptation in parental AL. These are the genes whose homeologs were lost or further gained after polyploidization in AK. Consistent with the Gene Balance Hypothesis, we find that "connected" genes are not usually lost or duplicated. In the rare cases where this does happen, genes with copy number fluctuations were fo und fa rther and more isolated from the netwo rk, minimizing their dosage effects and allow networks to more easily adapt to evolutionary changes. XV Chapter 1 Introduction 1.1 Introduction to Polyploids In evolut ionary biology, polyploidization is the increase in genome size caused by the inheritance of additional sets of chromosomes. Polyploidy is a comm onplace in the an cestry of flowering plants [104] and speciation via polyploidy likely served as one of the more predominant modes of sympatric speciation in plants [122] . In addition to con tributing to 15% of all flowering plant speciation events [177] and being detectable in the ancestry of all plant genomes sequenced to date, polyploidy has occurred numerous times throughout the domestication of the majority of cultivated crops [122, 177]. While previous studies suggested that polyploidy occurred sometime in the past of 57% to 70% of flowering plants [122], recent analysis based on whole genome sequences show traces of genome doubling early in their evolution [76, 77,124, 157, 158, 164], predating the radia tion of flower ing plants and thus suggesting that all flowering plants are paleopolyploids. Upon radiation, many lineages have undergone additional, independent and more recent duplications [164]. 1 Polyploidization can be classified into two general groups depending on the divergence of its parental species and subsequently the di- vergence of their chromosomal structure. AI- lopolyploidy is probably the most common spe- ciation mechanism in plants [122] and occurs when both parental organisms are from differ- ent species. Allopolyploidy most likely occurs Autopo lypl oi dy Allopolyploidy Figure 1.1: Autopolyploids inherit ge netic material from the same species. Allotetraploids inherit one set each from two different species. via hybridization to form diploid hybrid species, followed by eventual genome dou- bling [151]. This process combines divergent genomes with chromosome doubling to sta- bilize the polyploid by allowing every chromosome to have an identical homolog partner to pair with during meiosis. In contrast, autopolyploidy results from genome doubling within a single species or by crossing different plants or populations within a species, involving the production and merger of unreduced gametes from genetically and chromo- somally similar individuals. Polyploidy is more commonly fo und in plants for a variety of possible reasons. One predominant view is that polyploidy interferes with the sex determination pathway [114] , almost non-existent in plants as the majority contain both male and fe male reproductive organs. A second obstacle is the disruption of dosage compensation mechanisms that bal- ances the ratio of X to autosomal gene products [119] . Animals have evolved mechanisms in XX/XY systems and the doubling of chromosome introduces XXXX/XXXY /XXYY organisms that are not usually seen within these animal lineages. For the most part, plant species do not need to deal with dosage compensation. It has been recen tly noted, 2 Homologs Homologs Homologs are DNA copies of each other Orthologs are homologs in different species derived from a common ancestor Homeologs are orthologs that are merged into a new cell Figure 1.2: Same-colored chromosomes in a cell are homologs. Different colored chromosomes between two species are orthologs. When orthologs are merged into a single cell, they are homeologs. however, that polyploidy is fai rly common among dioecious plants with male and fe male reproductive organs in different organisms [119]. It is more likely that plants evolved mechanisms to cope with changes in ploidy because of the lability of their genomes. Compared to animals, plants have very dynamic and variable genomes, having higher frequencies of recombination and transposon activity resulting in extreme phenotypic plasticity [83] . Animals, on the other hand, have stable compartmentalized genomes that have been conserved over the course of its radiation [83] . This plasticity is extremely im- portant for polyploid species as they sometimes struggle to find appropriate mates. The ability of plant species to have a longer lifespan and to adopt different sexual lifestyles such as self fert ilization or asexual reproduction further improves its chances of success. 3 1.2 Immediate Effects of Polyploidization In many plant speCies, newly fo rmed polyploids are unstable and undergo immediate genomic rearrangements [107, 173]. In Brassica polyploids, rearrangements and fragment loss were observed within five generat ions in newly created allopolyploids [152] . Through this study, it was suggested that genomic rearrangements may be necessary to restore nuclear-cytoplasmic compatibility, as it was shown that rearrangements tended to occur in paternally contributed genomes [152] . More recently, studies have documented genomic changes in newly created wheat [46], potato [11], and Arabidopsis allopolyploids [39,81] . Using chromosome-specific tags for allotetraploid crosses of Arab idopsis thaliana and arenosa, Comai et al. observed that although chromosomes of different parental origins coalesced at early meiosis, chromosomal rearrangements rapidly enabled homeologous chromosomes to properly align [39]. Rearrangements in polyploids are usually accompanied by massive changes in trans posable elements, siRNA patterns and other similar epigenetic fact ors [34, 43, 59, 116, 121] . One possible explanation for such rapid change is that transposable elements that are silent in one parental line become active in allopolyploids [105-107, 173]. For example, it has been shown that maternally derived siRNAs are not sufficient to repress retrotrans posons in the paternal genome of Arabidopsis hybrids [79]. Such transposable elements may contribute to physical changes in the structural configura tion by fa cilitating move ment of genes and promoting unequal crossover [121]. In addition to epigenetic fa ctors, divergence of centromeres and homeologous regions can lead to segregation distortion, 4 nondisjunction, and nonreciprocal exchanges in hybrids due to similarity of structure between homeologous chromosomes [102, 121]. Changes in genomic structure inevitably lead to changes in gene expression and si lencing. This pattern has been seen in many allopolyploids, including Spartina [137], Gossypium [1, 174], and Arabidopsis [36, 79], in which methylation, histone, and hete rochromatin patterns have been altered [40, 144]. Because Arabidopsis polyploids are relatively easy to artificially manufact ure in the laboratory and has become a popular model, a lot of informa tion regarding gene silencing has come from these studies. Early AFLP-based approaches established that approximately 11% of alleles displayed patterns of expression in artificial arenosa x thaliana F1 allotetraploids that were non-additive when com pared to the parental diploid [90]. Allele-specific silencing appeared largely stochast ic, with the same allele silenced in some but active in other accessions, all of which had been derived from the same genotypical parents [90]. Silencing also varied be tween generat ions within accessions and appeared tissue-specific [171] . In a similar study, Wang et al. used spotted 70-mer oligonucleotide arrays to compare whole genome levels of expression between thaliana, arenosa, and F1 allotetraploids, showing that more than 15% of the transcriptome appeared divergent between the parental diploid species [170]. In F1, 5% of genes deviated in expression level from the additive mid-parent expecta tion, with the majority being repressed. Interestingly, 94% of these genes were more strongly expressed in the thaliana parent , with their levels of expression in F1 resembling arenosa [16, 35]. 5 1.3 Long- term Effects of Polyploidization Polyploidy would not serve a major role in speciation if all gene duplicates were quickly removed. Early classical work on duplicate gene evolution predicted that partially reces sive deleterious mutations would rise in frequency with little opposition from selection as long as one funct ional gene was present [122]. Once a deleterious mutation results in removal and lost of one duplicate, selection would act more strongly to preserve the func tion of the remaining gene [122] . Data from a variety of ancient polyploids, however, have shown that a much larger fraction of duplicated gene copies are retained than expected, ranging from 8% in yeast (100MY) to nearly 80% in maize (ll MY), Xenopus (30MY), and salmonids (25- 100MY) [121] . Such high levels of retention have led to alternative theories of how genes duplicates are maintained. Several hypotheses have been brought fo rth to account for the high retention of dupli cated genes. One such example is "neofunc tionalization," where genes acquire beneficial mutations that differentiate and preserve gene duplicates. This cannot explain the high levels of retention, however, since a deleterious mutation that eliminates func tion and re sults in inactivation is significantly more likely to occur. Walsh modeled this process and showed that indeed, inactivation was more likely [169]. Genes can also undergo immedi ate "specialization," where the expression of gene duplicates is shared between different tissues [9 6,97]. In a surprising discovery, Adams et al. fo und that alleles in polyploid cot ton hybridized from two different parental genomes differ in expression patterns among tissues examined for 11 of 18 genes considered [1]. In some cases, one parental gene was en tirely expressed in carpels while the homeologous gene from the other parent was 6 entirely expressed in petals and stamens. These results imply that duplicated genes can undergo immediate divergence in funct ion as a pleitropic effect of polyploidization and cast doubt over the assumption that selection is en tirely relaxed for newly duplicated genes. A third hypothesis for the retention of duplicated genes is "subfunc tionalization," where deleterious mutations accumulate in both gene duplicates, disrupting some but not all functions of the gene until both duplicates become essential for the functions of the original single-copy gene [50, 98]. This process ensures that both gene copies remain under selection and are retained, and is consistent with instances where duplicated genes differ in the timing of expression in different tissues [47,50,69]. In fact , Force et al. showed that total expression patterns of duplicated genes coincide with expression patterns of single-copy orthologs in related organisms [50]. Another popular explana tion, termed the "Gene Balance Hypothesis," suggests that gene duplicates are retained because the removal of certain genes results in haploinsuffi ciencies that lead to reduced fitness [20, 21]. Haploinsufficiency is a state in which a lower than normal amount of wild-type gene product confers a detectable phenotype [164, 166]. Because polyploidy does not change the gene balance, the ratio of gene products remains the same after duplication. However, when the duplicated gene copy of a member of a multi-subunit complex is removed while other members retain both gene copies, for ex ample, the gene balance is disturbed [19, 44, 168]. Thus, these "connected genes" should be difficult to remove from a duplicated genome one at a time, and impossible to remove in concert [52] . 7 Table 1.1: Summary of four hypotheses explaining high retention of gene duplicates N eofunctionalization Specialization Subfunctionalization Gene duplicates acquire mutations until it encounters beneficial mutation. Problem: Deleterious mutations are significantly more likely to occur. Expression of gene duplicates is shared between different tissues at different times. Genes with variable expression are more likely to be retained and specialized. Problem: Deleterious mutations are significantly more likely to occur. Deleterious mutations accumulate in both gene duplicates until both duplicates become essential for the functions of the original single-copy gene. Problem: Does not predict the relationship between gene function and retention. Gene Balance Hypotheses Gene duplicates are retained because the removal of certain genes results in haploinsufficiencies that lead to reduced fitness. Problem: Does not explain the selection mechanism for retained genes, just that they are retained longer for selection to act. Numerous studies have shown that certain biological processes are extremely sen- sitive to the quantitative expression of genes involved and that gene copy reten tion is strongly correlated with the "connectivity" of a gene. Ve itia showed experimental and simulation evidence that the activity of typical transcription fact or complexes relative to the concentration of any one of the subunits is sigmoidal [166, 167]. In human, Papp et al. showed that transcriptional regulators and proteins that are part of signal trans- duction pathways were significantly oversensitive to gene dose [123] . These investigators also noted that not all genes retained from the yeast tetraploidy event 100MYA have been equally retained as pairs and fo und that genes encoding ribosomal proteins were ove r-retained as pairs. In rice, genes encoding transcription fact ors were vastly over- retained at 50%, compared to the average genome reten tion of 16% [161] . A strong case for the Gene Balance Hypothesis was made in Arabidopsis, where numerous studies all 8 confirmed that genes encoding transcriptional regulators and protein kinases were sig nificantly over- retained [22, 101, 143, 160]. These results provide great theoretical and experimental support for this hypothesis, predicting that pairs will be preserved because particular multi-subunit machines or cascades cannot up end up in a haploinsufficient state. Fteeling and Thomas argued that prese rvation of gene balance is the best single explanation for the retention of gene pairs fo llowing large-scale genome duplications [52] . Despite the interest in genome duplication, most of the research in the field has been directed at the immediate genomic and phenotypic changes resulting from polyploidy. Numerous synthe tic polyploids have been constructed in controlled settings to identify fa ctors that affect the degree of rearrangements or changes in gene expression and silenc ing. Regardless of the reasons behind gene reten tion, it is clear that the fitness landscape faced by a polyploid is different than that of its parental diploid species, with selection acting to fine tune one or both gene copies for more specialized functions [121]. Evolu tionary theories have been brought forth to account for the high reten tion of duplicated genes, leading to accurate predictions as to the type of genes that are ove r-retained in yeast, rice, and Arabidopsis. The amount of literature on genome duplication is a testa ment to its importance in speciation and diversifica tion. Subsequent and slower evo lution of gene sequences and content, howe ver, has rarely been analyzed at base-level precision combined with whole genome coverage in natural polyploids. 9 1.4 Arabidopsis Polyploids Of all the plants that have been surveyed and studied in the ge- nomics era, Arabidopsis has served as one of the premier plant model systems. Its flagship species, Arabidopsis thaliana, is a popular model plant containing detailed descriptions of phe- notypic effects, network relationships and expression patterns for many genes and pathways. The fully sequenced genome of thaliana is currently annotated with 39 thousand genes Figure 1.3: Leaves and and appears to have undergone at least three round of poly- rosettes of Arabidopsis. ploidization [164]. It is an annual, weedy, and mostly autogamous species that is native to Europe and central Asia and now naturalized worldwide [5]. A rabidopsis thaliana is be- lieved to be the maternal parent of A rabidopsis suecica, an allopolyploid species for which a single origin has been established with mitochondrial, chloroplast, and nuclear DNA markers [75,115,135]. It is highly, although not completely, selfing [136] and found mainly in central Sweden and southern Finland [70]. A rabidopsis suecica (2n = 26) originated ca. 30,000 years ago from a cross between a largely homozygous ovule-parent thaliana (2n = 10) and a pollen-parent Arabidopsis arenosa (2n = 16) [75, 115, 135], presumably south of the ice cover and spread north when the ice retreated 10,000 years ago [75]. Un- like thaliana, arenosa is a self-incompatible member of the Arabidopsis genus, carrying the highest level of genetic diversity among its species group [85]. The estimated time of divergence between thaliana and arenas a is approximately 5MYA [84], though recent work using previously overlooked fossil evidence suggest a divergence time of 13MYA [17]. 10 A second polyploid within the fam ily is Arabidopsis kamchatica. While it is not as well known and studied as A rabidopsis suecica, it may in some ways serve as a better model system for polyploidy and hybrid speciation because of its natural variation and adaptation to different extreme en vironments. A rabidopsis kamchatica is an allopolyploid for which two subspecies are recognized based on morphology, life history, and habita ts. The first subspecies, A rabidopsis kamchatica ssp. kamchatica, is a perennial originally from Kamchatka, Russia but now reportedly fo und in East Asia and North America [147] . The second subspecies, A rabidopsis kamchatica ssp. kawasakiana, is an annual fo und in sandy open habitats along seashores or lakeshores in western Japan [147, 154] . Chloroplast and nuclear DNA markers reveal an allopolyploid origin of Ara bidopsis kamchatica (2n = 32) from Ara bidopsis lyrata (2n = 16) and Ara bidopsis halleri (2n = 16) [145] from multiple individuals of its diploid parents [147], two species that have been extensively studied. A rabidopsis lyrata is a perennial outcrossing species [37] that occurs under a variety of climatic and ecological conditions, but is most often cold-tolerant and grows in low competition habitats [38, 78]. It has been identified in North America and Europe, growing in patchy regions extending from Central Europe to Norway [78]. The sequencing of its genome is complete and available to the plant community. Arabid opsis halleri is also a perennial outcrossing species [111 ] and has served as one of the best model systems for heavy metal tolerance and accumulation in plant species [9, 53, 80, 86, 108, 109, 125- 128, 134, 165], having a preference for high altitudes and harsh soil conditions. While kamchatica is morphologically similar to lyrata, halleri is believed to be the maternal parent of kamchatica [147]. The estimated time of divergence between lyrata and halleri is approximately 2MYA [85]. 11 Chapter 2 Homeolog-specific retention and use m Arabidopsis suec1ca The following chapter was published as a Research Article in Genome Biology: Chang PL, Dilkes BP, McMahon M, Comai L, Nuzhdin SV: Homoeolog-specific re tention and use in allotetraploid Arabidopsis suecica depends on parent of origin and network partners. Genome Biology 2010, 11 (12):R125. 2.1 Abstract Allotetraploids carry pairs of diverged homeologs for most genes. With the genome dou bled in size, the number of putative interact ions is enormous. This poses challenges on how to coordinate the two disparate genomes, and creates opportunities by enhancing the phenotypic varia tion. New combinations of alleles co-adapt and respond to new en vironmental pressures. Three stages of the allopolyploidization process - parental species divergence, hybridization, and genome duplication - have been well analyzed. The last stage of evolutionary adj ustments remains mysterious. Homeolog-specific retention and use were analyzed m Ara bidopsis suecica (AS), a species derived from Ara bidopsis thaliana (AT) and Arabidopsis arenosa (AA) in a single 12 event 12,000 to 300,000 years ago. We used 405,466 diagnostic fe atures on tiling mi croarrays to recognize AT and AA contributions to the AS genome and transcript ome: 614 genes lacked AT contributions and 324 genes lacked AA contributions within AS. In leaf tissues, 3,458 genes preferent ially expressed AT homeologs while 4,150 fa vored AA homeologs. These patterns were validated with resequencing. Genes with preferential use of AA homeologs were enriched for expression funct ions, consistent with the dominance of AA transcription. Heterologous networks- mixed from AT and AA transcripts- were underrepresen ted. Thousands of deleted and silenced homeologs in the genome of AS were identified. Since heterologous networks may be compromised by interspecies incompatibilities, these networks evolve co-biases, expressing either only AT or only AA homeologs. This pro gressive change towards predominantly pure parental networks might contribute to phe notypic variability and plasticity, and enable the species to exploit a larger range of en vironments. 2.2 Introduction An allotetraploid is fo rmed from the merging of two different species, which may have diverged evolutionarily for millions of years. The resulting plant, if viable, might have a competitive edge, such as broader ecological tolerance compared to its parents [45,57, 120] . The evolutionary importance of polyploidy is reflected in its prevalence in flowering plants: ancient polyploidy is apparent in all plant genomes sequenced to date and is estimated 13 to have been involved in 15% of all plant speciation events [177]. Why are polyploids so evolutionarily, ecologically, and agriculturally successful? One can generate an artificial Fl allotetraploid (FlAS) in the lab by perf orming a cross between a tetraploid AT ovule-parent and a tetraploid AA pollen donor [40] . The resulting primary species hybrid contains two genomes from AT and two from AA. We can use this as an estimate of the genomic composition and homeolog-specific expression at the time of allopolyploid speciation [34, 59, 171]. Taking these patterns as reflective of the AS ancestral state, we observed how evolution has shaped the AS genome. As AT is a selfer and AA an outcrosser [85], AT- originated homeologs might have possessed more deleterious mutations due to Hill- Robertson interference [82]. Are AA-originated homeologs more commonly retained? AT and AA evolved orthologous networks in which genes were finely tuned to coordinate, separately within each species. Interference of AT and AA homeologs may cause mis-regulation within mixed AS networks. This is akin to Dobzhansky-Muller incompatibilities [41] . Do heterologous networks evolve to restore their original orthologous-like compositions? Here, we address these and other questions. 2.3 Materials and Methods Plant material, DNA and RNA preparations Affymetrix GeneChip Arabidopsis Tiling l. OR Arrays were hybridized with samples from four different sources. Genomic DNA was obtained from tetraploid AT accession Ler [42], tetraploid AA accession Care-l [40], allotetraploid AS accession Sue-1 [42], and an Fl allotetraploid produced by crossing the tetraploids AT and AA as maternal and 14 paternal paren ts, respectively [40] . eDNA was prepared from AS Sue-l leaf samples. All genomic DNA and eDNA samples were hybridized in three biological replicates using standard hybridization protocols. Affymetrix CEL files are available for download from the public repository ArrayExpress under the accessions E-MEXP- 2968 and E-MEXP- 2969. Ilium ina library construc tion was performed according to standard protocols. Raw Illumina sequences are available for download at the NCB! Short Read Archive under the study SRA 025958. Microarray preprocessing and normalization The Arabidopsis Tiling Array is composed of over 3.2 million probe pairs tiled throughout the complete AT genome. Probes are tiled at an average of 35 base pairs. To ensure that arrays within genotypes are comparable to each other, Robust Multiarray Analysis [73, 74] was implemented to perform background correction. Intensities for three biological replicates were summa rized using quan tile normalization [23]. In addition, intensities for the three biological replicates of AS and FlAS were summa rized altogether using quantile normalization. PM probes exhibited some misma tches for the AT genotype, as this array is based on a different reference; the arrays exhibited an additional lower hybridization intensity peak. PM probes from conserved exon regions were much more robust. Consistency and density plots can be fo und in the online version of the manuscript [32]. As expected from interspecific sequence divergence, the number of AA higher-intensity probes decreased, while the number of lower-intensity probes increased. Note, however, that "conservative feat ures" and "divergent fe atures" peak at similar intensities in both 15 species, making the analyses easier. Similar to AT, AA lower-intensity probes were over represented in non-coding regions. Identifying AS genomic regions with putative multi-gene alterations Probe intensities among three biological replicates in AS were averaged and paired with the corresponding average among the three F1AS replicates. For each gene, a paired Wilcoxon rank-sum test (FDR < 0.05) [18] of all probes was used to identify genes with different ial hybridization. The significance of individual genes might be misleading, but the pattern for multigene regions is robust. We scanned for windows in which at least 27 (90%) out of 30 genes exhibited unidirectional stronger or unidirectional weaker hybridization in AS in comparison with F1AS. We also required these differences to be significant at FDR < 0.05 for at least 9 (30%) genes. Overlapping windows were collapsed to identify the entirety of these regions. Multi-genotype normalization and identification of diagnostic features Our goal here is to select probe fe atures enabling the com pari son of AT and AA signal representation in AS DNA and RNA. To enable cross-comparison of DNA and RNA, the analyses have to be made gene-by-gene, with DNA and RNA hybridization signals normalized to the same level for each gene. First, probes represent ing conserved signatures between genotypes were identified and used to scale the entire gene. For every probe in a gene, its average intensity among replicates in AT was compared to the average intensity in AA. These ratios fo rmed a unimodal distribution and the peak of this distribution was used as the scaling fa ctor for 16 which to normalize between genotypes for that gene. Mathematically, for probe i in the gene, the average intensity among 3 biological replicates in both genotypes is defined as: 1 3 Ti = - " ti J 3 L_, J= l and (2.1) (2.2) where ai j and ti j represent the probe intensities of the jth replicate of the ith probe in AT and AA, respectively. Defining Xi as: The scaling fa ctor, Xmax is defined as: Xmax = argmaxf(x) X (2.3) (2.4) The value for Xmax was estimated using the mlv fun ction in R, which calculates the kernel density and searches for x that maximizes that estimated density funct ion. From hereon, we replace all ai j values with rescaled values represented by Xmax ai j · We disregarded genes whose f(x) fa iled the Shapiro-Wilks normality test. This normalization method is similar to one recently outlined by Robinson and Oshlack [133], where a scaling parameter is used to normalize between two samples. 17 Second, we identified single fe ature polymorphisms or DFs between AT and AA using a Welch t-test of l og2-transformed values, fo llowed by an FDR correction. These approaches enabled the analysis of homeolog-specific retention in 24,344 out of approximately 39,000 AT genes. Analysis of DFs in DNA samples from AS If an AS gene retained both parental homeologs, we should observe an equal mix of AT and AA signals. A linear model was used to determine whether AS has probe intensities within a gene contributed by i) both parents (mixed), ii) parental AT only (AT- like), or iii) parental AA only (AA-like) . For a gene with n DFs, the vector of intensities in AS, S = [51, 52, ... , 5n], may be contributed by corresponding AT- and AA-specific signals, such that S = a1A + P1T and the contribution of AA, a1, can be estimated using a sim pie linear regres sion. Specifically: (2.5) where i = 1, 2, ... , n, j = 1, 2, 3 for the three biological replicates, and Ai and Ti are the mean intensities in AA and AT, respectively. Ei j are error terms that are independent random variables from a Normal distribution with mean 0 and variance <T 2 The strength of our experimental design is in F1AS, in which a null model holds true for genomic DNA. For F1AS, this expectation is: (2.6) 18 To detect deviations from the null, we tested whether a1 is significantly different from "' 2 · Under the null hypothesis that a1 = a 2 , and assuming a+ (3 = 1: (2.7) fo llows an F distribution with 1 and 6n-1 degrees of free dom. This assumption of a+ (3 = 1 can be made since the contributions of AT and AA are weigh ted. The bias was labeled as AA-like if a1 > a 2 and as AT -like if a1 < "' 2 · To account for multiple testing issues arising from thousands of genes tested, Benjamini-Hochberg's FDR was employed to adj ust the significance level at 0.05. As with all linear regression models, we assume that the error terms follow a Normal distribution. We investigated this by applying a Shapiro-Wilks test on each gene to ensure that they were Normal. We removed over 7,000 genes that fa iled these tests. We fo und little discrepancy for the results of the analyses when a was defined as the AT contribution. We also determined significance by perf orming a permutation test for each gene and fo und no discrepancy with the F distribution shown above. Analysis of DFs in AS transcripts Since we are estimating the relative contribution of AA rather than the absolute, the expression level of every gene in the AS transcriptome was normalized to identical hy- bridization levels with its corresponding genomic DNA. This was done using probes rep- resenting conserved signatures, identified as previously described. We then analyzed the homeolog-specific expression with the same linear model approach as above, using DFs 19 identified between RNA and DNA, and a fo und in F1AS DNA as the null reference point. When these intensities of DFs are biased in one direction, we can determine homeolog specific expression. Furthermore, for each gene, a was estimated by regressing over all DFs in the set, minimizing spurious effects of individual probes. Forty-nine percent of genes were express ed. The homeolog-specific expression was assayed in 18,876 genes. Illumina data analysis Paired-end 72-base Illumina reads were aligned and mapped allowing up to 10 misma tches using BWA Ve rsion 0.5.7 [92] to 102 AA transcript sequences and their orthologous AT sequences . A pairwise global alignment identified SNPs and short insertion /deletion vari ants between orthologous AT and AA gene pairs. Reads that mapped to either of the two orthologs were scanned for these variants to ensure that they were clustered with the appropriate ortholog (Figure 2.6) . The number of reads mapped to each ortholog was normalized to FPK (fragments per kilobase of exon) to account for slightly variable se quence length between orthologs. This analysis and its results are summa rized in Figures 2.6 and 2.7 and the Appendix. 2.4 Results To address the long term evolution of gene duplicates in polyploids, we studied the Arabid opsis suecica system using a genome-wide Arabidopsis tiling array. We analyzed the genomes of four Arabidopsis species: thaliana (AT ), arenosa (AA ), suecica (AS ), and an F1 allotetraploid produced by crossing thaliana with arenosa (F1AS ) . We fo cused on 1,393,557 probes annotated in coding regions, each mapping to a single gene within 20 the genome using Bowtie [88]. Following microarray preprocessing and normalization, we devised a method to discriminate AT and AA-originated homeologs and to estimate their contribution to the pools of DNA and RNA present in AS. For every gene in AS, we determined whether both AT and AA homeologs are present in the genome and whether they are expressed evenly or in homeolog-specific fa shion. 2.4.1 Comparison of hybridization between AS and FlAS The Arabidopsis array feat ures 3.2 million 25-base-long probes tiled throughout the com plete genome every 35 bases. As these fe atures are homologous to the AT reference, they should, on average, exhibit a lower hybridization with AA DNA. Probe intensities con firm this expectation. Two typical examples are shown for chromosomes 3 and 4 (Figures 2.1 and 2.2; see Figures B.1 - B.9 in Appendix B for other examples). F1AS signals are a sharp intermediate between AT and AA. AS shows remarkable correspondence with F1AS, with the exception of several extended regions. We hypothesize that these regions correspond to historic losses of homeologous chromosomal regions in AS. We mapped fe atures onto the genes and compared intensities between AS and F1AS; 6,790 genes exhibited differential hybridization (Wilcoxon ranked sum test, FDR < 0.05). To identify large putative altera tions, we scanned for clusters containing at least 30 genes with a strong unidirectional bias (at least 27 with the same bias, significant for at least 9 genes). We identified 39 clusters, encompassing 1,643 genes (Table 2.1). Some clusters were due to differential abundance of transposable-element-like sequences . Chrl 13.66M, Chrl 14.00M, Chr3 12.44M, Chr3 13.36M, and Chr5 11.06M mainly consisted of copia like, gypsy- like, or CACTA-like retrotransposons. Other regions - for instance, on Chrl 21 0.0 0.5 Chromosome 4 1.0 1 15M-1 32M 48 Genes Ctromosome Pos (MB) 1.5 16 0M-1 78M 33 Genes 2.0 Figure 2.1: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue), AS (gold), and FlAS (brown) on Chromosome 4. Chromosome positions and gene annotations correspond to the AT genome. Gray boxes indicate clusters containing at least 30 genes with a strong unidirectional bias, where at least 27 genes have the same bias, and significant for at least 9 genes. A list of clusters can be found in Table 2.1. 22.0 22.5 Chromosome 3 23.0 22 98M- 2346M 198 Genes Ch'omosome Pus (MB) 23.5 24.0 Figure 2.2: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue) , AS (gold), and FlAS (brown) on Chromosome 3. 22 Table 2.1: Regions of putative alterations in AS. Chr Region Number Percent with Percent Number Higher of Genes differential TEs of probes hybridization hybridization in? AT1 0.29M - 0.39M 38 44.7 0 2,537 F1AS 0.82M - 0.91M 32 28.1 3.1 2,266 F1AS 3.16M- 3.29M 43 37.2 0 3,175 AS 8.40M - 8.49M 37 29.7 2.7 1,991 F1AS 13.66M - 13.86M 43 58.1 51.2 3,547 F1AS 14.00M - 14.39M 70 42.9 51.4 5,998 F1AS 29.97M - 30.07M 40 32.5 0 2,536 F1AS AT2 1.96M - 2.03M 34 32.4 8.8 1,377 AS 4.57M - 4.69M 30 30.0 36.7 2,302 F1AS 6.50M - 6.67M 43 27.9 16.3 3,214 AS 10.88M- 11.01M 38 26.3 0 3,182 AS 14.7 4M - 14.84M 37 27.0 0 2,440 F1AS 19.60M - 19.68M 36 38.9 0 2,065 F1AS AT3 0.30M - 0.36M 33 42.4 0 1,568 F1AS 5.58M - 5.68M 32 46.9 0 2,299 AS 7.30M - 7.38M 31 32.3 16.1 1,822 F1AS 12.44M - 12.61M 36 27.8 61.1 3,055 F1AS 13.36M - 13.50M 34 55.9 50.0 2,431 AS 14.55M - 14. 70M 39 38.5 33.3 2,904 AS 20.25M - 20.34M 31 32.3 3.2 2,165 F1AS 20.93M - 2l.OOM 30 30.0 0 1,881 F1AS 21.30M - 21.43M 44 34.1 2.3 3,227 F1AS 21.60M - 21.73M 45 44.4 0 3,217 F1AS 22.11M - 22.22M 37 29.7 0 2,520 F1AS 22.98M - 23.46M 198 79.8 2.0 12,3 09 F1AS AT4 1.13M - 1.33M 59 28.8 1.7 4,967 AS 1.60M - 1.78M 33 57.6 39.4 2,762 F1AS 7.59M - 7.68M 34 29.4 2.9 2,052 AS 7.67M - 7.82M 47 23.4 21.3 3,232 AS 16.89M- 16.96M 32 34.4 0 1,797 AS 17.86M- 17.95M 39 38.5 0 2,000 F1AS AT5 9.92M - lO.llM 44 43.2 22.7 4,269 AS 11.06M - 11.27M 42 45.2 59.5 2,948 F1AS 13. 76M - 13.89M 38 36.8 18.4 2,785 AS 18.49M - 18.61M 33 30.3 0 2,882 AS 20.53M - 20. 70M 34 29.4 2.9 2,621 AS 23.48M - 23.56M 33 30.3 0 1,991 F1AS 26.41M- 26.47M 34 29.4 0 1,453 F1AS ATM 0.02M - 0.24M 30 50.0 0 1,447 F1AS 0.29M, Chr3 0.30M, Chr3 5.58M, Chr3 21.60M, and Chr3 22.98M - appeared free from this problem. Interestingly, the region 1. 60M- 1.78M on chromosome 4 (Figure 2.1 ) is coincident with the heterochromatic knob known to be hypervariable in AT [24]. The 23 22. 98M-23.46M region of chromosome 3 (Figure 2.2) looked like an AT- homeolog deletion. These results show that tiling arrays can be a useful tool for detecting copy number variation [14] and large-scale alterations in the AS genome. 2.4.2 Homeolog-specific genomic retention in AS To analyze the homeolog-specific reten tion and expression of individual genes, we fo cused on 1,393,557 probes mapping to coding regions. Since AT and AA sequences differ at 1 out of 20 bases (hereafter termed "diagnost ic feat ures" (DFs)), some 25-base oligonu cleotides designed for AT are a perfect match for AA sequences. Whenever orthologous AA sequences mis-match to the AT chip, this hybridization is weaken ed. Separately for every gene, we identified a scaling fact or based on probes with similar signatures of hy bridization to normalize intensities between species. We then identified homeolog-specific DFs and only retained those ( 405,466) robust over replicates (Figure 2.3). We could only fo llow 24,344 genes as the fa stest-evolving genes have too many DFs for normalization. We tested for deviations from an equal represent ation of the two homeologs in the AS genome [36,90,170]. As a ref erence point, we used the F1AS DNA in which homeologs are present at equal doses. For each gene within the regions of putative alterat ions, we tested for changes in a between AS and F1AS, where a represents the relative contribution of AA DF hybridization strengths in a hybrid genome. The degree of departure between a in AS and F1AS was used to determine whether a hybridization signal in AS is con tributed by both paren ts, parental AT only, or parental AA only. There was an upward shift in a in AS compared to F1AS (one-sided paired t-test, p < 2e-17), suggesting a preferent ial retention of homeologs derived from the AA parent (Figure 2.4). Supporting 24 � D D "(f.j D c Q) (!) ....., E D D ""0 D Q) ""' -� - ro D - = E D ..__ D = 0 N = z I = ;;;;;;;;o Q) ..__ D 0... 10935800 10935900 10936000 10936100 10936200 10936300 D D * * * � D (!) "(f.j - c D Q) ....., D E D ""' ""0 = Q) D N ro D !!!!!! � D E N &;;;;;! - ..__ 0 z D 10935800 10935900 10936000 10936100 10936200 10936300 Chromosome 1 Pas Figure 2.3: Probe intensities before and after normalization. Probe intensities for every gene were normalized to identical levels in all arrays. A t-test between AT (red) and AA (blue) replicates identified diagnostic fe atures (shown with asterisks) that were used to identify homeolog specific hybridization. FlAS (brown) is shown as a null reference for which to compare AS (gold). this, more genes were called AA-like (614) than AT -like (324). This bias is significant , although moderate com pared to other studies [29, 33, 49, 93, 132, 159, 178]. This might reflect a limited power of microarrays. For instance, we analyzed 30 genes encoded by the mitochondria organelle known to be AT- derived . Only one plastid-encoded gene had enough DFs to be unambiguously classified, and was biased towards maternal AT, as expected. 25 > () c 0 0 "" 0 0 (') Q) 0 ::J 0 0" N � lL 0 0 0 -0.6 -0.4 -0.2 00 0.2 0.4 0.6 Change of alpha Figure 2.4: Histogram distribution of homeolog bias in AS. ,6.o: is shown for the genome of AS, using FlAS as a null reference. Distribution is nearly symmetrical and centered at 0.004. 2.4.3 Use of AT and AA homeologs in the AS transcriptome To identify homeologous transcripts in AS, we extracted RNA from leaf tissues and pro- cessed microarrays with the SNP- detection protocols similar to above. More than 49% of genes were called expressed, with 3,458 and 4,150 exhibiting AT -enriched and AA- enriched DFs, respectively. Overal l, we conclude that, over the 12,000 to 300,000 years, AS has accumulated more deletions of AT- originated homeologs and uses the remain- ing AT- originated homeologs somewhat less. Genes physically clustered together might 26 co-express and co-evolve in transcript levels, as observed in flies [110] and maize [141] . To test whether biases in homeolog-specific expression were concordant between nearby genes, we calculated running averages of 6a along chromosomes (Figure 2.5), and fo und regions with clusters of AT- enriched and AA-enriched transcription. Table 2.2: Homeolog-specific retention in AS. AT -like AA-like AS genome AS transcriptome 324 3,458 614 4,150 To validate the tiling array-based procedures above, we prepared Illumina libraries and performed RNA-sequencing of the AS transcriptome. The AA genome is not yet assembled, but we identified 52 AA genes from GenBank and acquired an additional 50 genes from the UC Genome Center. We identified the orthologous AT genes for these AA genes and mapped the Illumina reads to both homologs. Nine genes did not contain any reads that mapped to either homolog. For 14 genes, reads only mapped to either the AT or the AA homolog. For the remaining genes, reads were aligned to both homologs and clustered as either derived from AT or AA (Figure 2.6). We consider the number of uniquely mapped reads as a measure of homeolog-specific expression. A strong correlation in AA:A T expression ratio between tiling arrays and the RNA-seq (R 2 = 0.646, p < 5e- 07) proves that both approaches work. This concordance is nice (Figure 2.7) considering that RNA samples were extracted from independently grown plants, and that microarray estimates are frequently noisy. 27 Chromosome 1 10 15 20 25 30 Pos1t1on (MB) Chromosome 2 ; 0 � �:A .. - w·� •H . : ·� .a,., ... 6..Jio � ' 9 �L-,: __ �_· __ �-- --,_� != ! __ �� ___ -;�� T =-- -� :_ ·_� __ =:-,- ·- ·�:_ -__ � -- ,- -- -- -- -- ,- -- -- -- -- � 10 15 20 25 30 Pos1t1on (MB) Chromosome 3 : 0 9 � : O L '"' 0 --�o4 .. h "'17'\7- ,A � - :·;: •. a i ' �L . __,-·_- __ · __ ... _•_o __ v, ,_' _.,... __ Vf_ · __ ""�_ j -T* =-- ' i '--_ :-=' c__-- , : _� _c * __ .,. ___ :_ , = ---� I'---- ,-----..,--" 0 0 � rr jl : .. . ,.._ J 0 = · � vv � 0 ' 10 15 Pos1t1on (MB) Chromosome 4 * ... fl -... .... '[l� "�VN;-v � * * * * 10 . ... ... � .... � :-�u .. 15 Pos1t1on (MB) Chromosome 5 Pos1t1on (MB) 20 25 30 20 25 30 Figure 2.5: Distribution of clusters of biased homeolog transcripts. Lines above the center indicate clusters of AT-like genes, and those below indicate of AA-like genes. Asterisks depict significance using a genome-wide permutation test. Presence of another asterisk indicates a nearby region that is also clustered with AT- or AA-enriched transcription. 28 AT1 G65450.1 ... T .. T . G T Mapped to Aa Ortho log ..... G .... A .... T .. T . . . . . . G .... A .... T .. T . G T . G T . ........ T . . . . . . . G .. .. A .... T .. T . G . T .... T. .. ...... ...••••• ... . G ............... G .. .. A .... T .. T . G . T ....... C ......................... G ............... G .... A .... T .. T .. G .. T .. T ......... C ......................... G ..... .......... G .... A ... T. T.. . G ... T .. �CAGTTTT AACT GC T TACGC AAAGG C GAAAT GCAAGGC ATTGCTT GAAGAGCC GTT TGGGAGGATTGT G GAAAT AGTAGGTGA TGG GG CAAATAGGAT AAC GGAT GAGTAT GCGCGGTCT GCTATAGATTGGGGJ �C GGTTTT AACC GCA TACGC AAAGG A GAAAT GCAAGGC ATTGCT T GAAGAGCC GTT TGGGAGGATTGT AG A AAT GGTAGG AG AAGG GTCAAA G AGGAT AAC GGAT GAGT AT GCGCGGTCT GCTATAGATTGGGGJ . A ........ G .. T .. T .. .... A ...... T ... ..... A .. .. G .... A .. A ... T .. . G .. A .. ......... G .. ........ A ... G ..... C . A ........ G .. T .. T .. .... A ...... T .......... ..... ... .... T ... ..... A .. . G .... A .. A ... T .. . G .. A .. ............. ....... A G .. T .. T .. .... A ...... T . . . . . . . . T ... ..... A .. .. G .... A .. .. T ... G .. A .. . ...... A .......... . .. T .. T .. .... A ...... T ...... .. T ... ..... A .. .. G .G .. A.GA Mapped to At Ort holog Figure 2.6: Sequenced read alignments to AT and AA orthologs. Orthologous AT and AA sequences shown at center contain diagnostic SNPs in red and blue, respectively, that can be used to align and cluster Illumina reads. 2.4.4 Network analysis of homeolog -specific genes The summary of the Gene Ontology analysis of genes exhibiting homeolog-specific reten- tion and expression is shown in Table 2.3. The categories Cell Communication and Signal Tr ansduction were underrepresented, while DNA Repair and Response to DNA Damage Stimulus were overrepresent ed. AA-enriched transcripts were overrepresented in the Gene Expression category, including subprocesses involved in transcription, translation, RNA processing and gene silencing by miRNA. Lastly, we considered homeolog-specific expression in the context of AT transcriptional networks [99]. Of the 7,608 genes with homeolog-specific expression, connectedness esti- mates were available for 6,941 gene pairs. We tested whether bins of higher-connected gene pairs exhibited higher concordance of homeolog-specific expression (Figure 2.8). The fraction of concordant pairs was approximately 0.4 in low-connectedness bins, but increased to 0.8 for the high-connected gene pairs (R 2 = 0.47, p < 0.0001). We also investigated whether AT and AA-biased genes were less frequently mixed within these networks. We partitioned networks with homeolog-specific expressions of at least two 29 0 " Q) (/) <( z Ct: "' c E � � .2 0 � Ct: c 0 ·u; (/) � Q_ >< w "" N lO cj lO N cj 0.0 625 • • • 0.25 4 16 64 Expression Ratio for Affymetrix Tiling Array Figure 2. 7: Concordance between homeolog-specific expression estimated from AT tiling microarray (X-axis) and Illumina resequencing (Y-axis). R 2 � 0.646, p < 5e-07. genes as co-biased for AT (219), co-biased for AA (325), or with mixed biases (302) (Table 2.4). The latter "mixed" group was significantly underrepresented in comparison with random expectation ( x2,p < 6e-08), suggesting that homeologous networks evolve towards pure AT or AA profiles. 30 Table 2.3: Gene Ontology annotation for homeolog-biased genes in AS, overrepresented unless stated. AS genome AT-like AA-like AS transcriptome AT like AA-like Biological process Sulfur amino acid metab olic process Response to fungus Heat acclimation Aspartate family amino acid meta bolic process mRNA metabolic process Rib oflavin biosynthetic process Membrane lipid metabolic process Cellular sodium ion homeostasis Cellular calcium ion homeostasis Aspartate family amino acid meta bolic process Purine ribonucl eoside monophosphate metabolic process Cellular potassium ion homeostasis Protein amino acid glycosylation Defense response, underrepresented DNA repair Response to DNA damage stimulus RNA meta bolic process Cell communication, underrepresented Signal transduction, underrepresented Hormone transport Microtubule cytoskeleton organization Biological process One carbon meta bolic process Intr acellular protein transport Macromolecule localiz ation Microtubule-based movement Cytoskeleton-dep endent intracellular transport Protein complex assembly Cellular component organization Cytoskeleton organization and biogenesis Photorespiration Seryl-tRNA aminoacylation Aspartate family amino acid meta bolic process mRNA metabolic process Response to drug, underrepresented Drug transp ort, underrepresented Pyrimidine base metabolic process Phosphate transport lnflarmnatory response Oxidative phosphorylation ATP synthesis coupled electron transport Prograrmned cell death Cell development Glycerol metabolic process Alcohol metab olic process Hormone metab olic process Phagocytosis Endo cytosis Hormone cata bolic process Photomorp hogenesis tRNA meta bolic process, underrepresented Transcription Nuclear transport Regulation of cell cycle RNA polyadenylation p-value 0.00078 0.0054 0.0054 0.012 0.012 0.013 0.013 0.013 0.021 0.024 0035 0.036 0.021 0.029 0.024 0.024 0028 0.031 0.033 0.044 0.0441 p-value 6.1e 05 0.00012 0.00012 0.00045 0.00045 0.0030 0.0039 0.0039 0.0053 0.0069 0.0071 0.011 0.020 0.020 0.024 0.024 0.024 0.0013 0.0024 0.0028 0.0043 0.0058 0.0058 0.0058 0.0081 0.0081 0.012 0.014 0.017 0.023 0.031 0.034 0.034 31 0 ro c::i (f) c::i 0.2 • • • 0.4 0.6 0.8 Pearson Correlation Coefficient • • 1.0 Figure 2.8: Fraction of gene pairs co-biased as either AT or AA for bins of different connec tivity. R 2 = 0.47, p < 0.0001. Red dots represent bins with higher fraction of AT co-biased genes within bin. Blue dots represent bins with higher fraction of AA co-biased genes within bin. Table 2.4: Occurrence Expected x2,p < 6e-08 Co-biased pairs of AS homeologs in AT- identified gene networks. Co-biased as AT Biased as AT and AA Co-biased as AA 219 302 325 173.1 419.2 253.7 32 2.4.5 Variation within AT and AA Note that although extant accessions of AT, AA, and FlAS were used, AS was fo rmed 12,000 to 300,000 years ago, perhaps from different accessions. DFs and Illumina rese- quencing may potentially result in misleading conclusions. Nevertheless, 5 million years of sequence divergence between AT and AA compares favorably with the smaller amount of standing sequence variation and with the unaccounted extra divergence since AS fo rma- tion. From the above resequencing data, we estimated the divergence of the AA homeolog within AS from the homologous gene in AA. Likewise, we estimated the divergence of the AT homeolog within AS from the homologous gene in AT . Consistent with high sequence variation in AA [85], the divergence from parental homologs is larger in AA, as sequence variation in natural AT is very limited [5] . This would result in fewer AA-like calls, and lower biases detected in this manuscript . Note that, as expected from [98], stronger ex- pressed genes appear more conserved and exhibit lesser AT and AA divergences (Figure 2.9). Q) - () c a:> Q) D - 2' c::i � - 0 """ Q) D - () c::i c Q) :::J - rr Q) D (f) D - c::i • • • • • ·- ' . • .. . . • • , ._ . . . . . " . .. ·. - . . . I 0 • • '· · I 100 • • • • • • • • I 200 • • • I 300 • Homeolog Expression • I 400 - . I 500 Figure 2.9: Divergence of AT and AA homeologs in AS in comparison with AT and AA references (Y-axis) compared to homeolog-specific expression (X-axis). 33 2.5 Discusss ion 2.5.1 Evolved AS patterns In allopolyploid speciation, two genomes that have experienced long independent evo lution are combined. Their genomes were shaped in different ways in response to the extrinsic en vironmental and intrinsic lifestyle pressures. We focused on AS, a species that evolved 12,000 to 300,000 years ago from a single hybrid individual fo rmed from an ovule of AT and a pollen of AA [75, 115, 135] . Orthologous genes of AT and AA have aver age sequence divergence of 5% [85], exhibit differences in tissue-specific expression [2, 171], and are located on five versus eight chromosomes. The allotetraploid hybrid initially had low fert ility, if one can conclude this from the performance of artificial hybrids in the lab. This fert ility can be restored through the complex interplay of genetic and epige netic processes [130] . Several groups have been fa scinated with this rapid but complex process [2, 16, 34, 35, 40, 59, 117, 130, 170, 171]. We focus on the subsequent longer-term molecular evolution, by comparing an evolved natural AS with an "unevolved" F1AS hybrid. We found that in AS, AA homeologs are more frequently retained and more actively transcribed than their AT counterpar ts. We hypothesize that these AA-f avoring biases are not random, but rather represent a signature of an evolut ionary process. To explain these patterns, we propose a concept of "homeolog competition." Genes are subject to detri mental mutations at approximately constant rates [82] . Purifying selection removes these mutations with varying efficiencies depending on the gene redundancy, dominance, and 34 other chara cteristics [56, 121, 122, 179] . As some F1AS homeologs are func tionally redun dant, they should be progressively lost to mutations and deletions. From the initial pool of homeologs, natural selection would prefer entially maintain those with a higher contri bution to fitness. In this sense, homeologs "compete." Despite stoichiometric constraints to maintain stable ratios of dosage among genes [20, 21] , there is a well-documented shrinkage of polyploid genomes over time [4, 87, 121, 122, 129, 138, 150, 160, 173] , as not all genes are haploinsufficient [56] . Why would AT- originated homeologs be less valuable? Our first hypothesis is in spired by Hill and Robertson [56] . Selfing organisms, such as AT, are less capable of purging mildly deleterious mutations. This is because of severely reduced recombination in comparison to outcrossers, such as AA [63, 156, 179 ] . This may seem paradoxical, as AT maintains much less variation than AA [85], which one might interpret as mutations in AA. Indeed, when selfing evolves, segregat ing mutations are quickly purged, as they exhibit their deleterious nature in autozygous individuals. In the short term, selfers are in fact better off [179 ] . With time, however, Muliers' ratchet kicks in one slightly deleterious mutation after another, resulting in low standing variation but inferior functionality [82] . From an initial pool of alleles in a genome, natural selection would preferent ially maintain those with a higher contribution to fitness. We hypothesize that AA homeologs are more valuable as they originate from an outcrosser. Our interpretation is also supported by the fact that selfing is typical of terminal branches on phylogenetic trees, interpreted as evidence that selfing is an evolutionary dead-end [146, 153, 156] . The preferential reten tion of AA alleles in AS might illustrate the mutational buildup typical for AT- orthologs after turning onto this dead-end evolut ionary path. 35 Our second hypothesis involves historical fact ors. Suppose the southern-adapted AT accession hybridized with the northern-adapted AA accession, and that the emerging AS accession spent most of the 12,000 to 300,000 years in the northern en vironment [75, 136]. AA-originated homeologs would be a better fit for the environment, would be more frequently retained, and would evolve to be preferent ially used [98]. To test this hypothesis, one must sample AS accessions from multiple locations, resequence their genomes and transcriptomes and identify environment-specific molecular evolution since the unique AS speciation event. Our model assumes a large standing variation in the genome and transcriptome, which has been well-documented in Tragopogon [93, 159]. A more direct, rather than biogeographic-type, evidence might be obtained with Gossypium [48] . This species displays a similar strengthening of parentally skewed expression when natural allotetraploids are com pared with F1 allotetraploid controls. Thirdly, recall that the AA transcription machinery is preferent ially expressed in F1AS [170]. Homeologs pre-adapted to function under AA transcriptional control will then be selected for, reinforcing this initial pattern of dominance [49, 141]. Hom eolog specific methylation might be at the heart of these processes [34, 43, 59, 64, 65, 116]. In directly supporting this idea, AA-like genes in AS exhibited enrichment in the "gene ex pression" category (with subprocesses: transcription, translation, RNA processing, and gene silencing by miRN A). Recent reports in Arabidopsis and Brassica allopolyploids indicate a high proportion of nonadditive expression for genes within these categories as well [54, 59, 170]. In synthetic thaliana-lyrata allotetraploids, AT rRNA expression was repressed [16,35], perhaps just the tip of a genome-wide trend against AT gene expression within interspecific hybrids. This hypothesis is supported by dominant AA phenotypes 36 m F1AS and the observation that lyrata and synthetic thaliana-lyrata allotetraploids share similar morphological charac teristics, such as flower morphology, plant stature and long-lived habit [16, 117]. Similar results have also been shown in Senecio [61, 62] . One explanation for this bias in F1AS is that more efficient selection has resulted in con siderable co-evo lution of AA trans-acting fact ors and their targets in the cis-regulatory sequences . This could result in AA transcriptional control dominating in the Fl. This asser tion is supported by our observation that the GO categories "Gene Expression" and "Trans criptional Control" was strongly enriched for retention of AA alleles in AS. As mutation stochastically inactivates gene duplicates in a polyploid, reten tion of homeologs with greater proportional effects on the phenotype and greater expression in the hybrid (AA genes for Gene Expression and Transcript ional Control) should be preferred while the loss of less active homeologs (AT genes) will be under less selective pressure [28, 141, 160] . 2.5.2 Resolving incompatibilities in allotetraploid networks Imagine ancestral genes A1 and A2 that fo rmed a fun ctional dimer in the common ances tor of AT and AA 5 million years ago. These genes evolved into AT 1 and AT2 orthologs in the AT lineage, and into AA1 and AA2 orthologs in the AA lineage. Within these lineages, AT 1 and AT2 have been selected for the ability to form a dimer. Likewise, co evolution has been taking place between AA1 and AA2 proteins [41]. In AS, along with the parental dimers AT 1-AT2 and AA1-AA2, there will also be heterologous AT 1-AA2 and AA1-A T2 dimers. Are these dimers likely to be functional? Dobzhansky and Muller hypothesized that some would not be [162] . Somewhat more generally, all genes within a network shall be more co-adapted with each other than with genes from homeologous 37 networks. Adverse fitness effects from genetic interac tions, termed Dobzhansky-Muller (D-M ) incompatibilities, are thought to accumulate between allopatric populations and contribute to reproductive isolation between newly arising species on the timescale of a few million years [41] . Strongly decreased fitness of AT x AA Fl and F2 seeds and meiotic disruptions in Fl's attest to the presence of these intrinsic incompatibilities contributing to the reproductive isolation of these two species [179] . These results provide insight into gene retention of duplicated genes and suggest a concerted evolution of gene clusters. One might hypothesize that after millions of years, paleopolyploids should preferent ially retain genes from one of the two homeologous sets. This pattern was observed in the most recent whole genome duplication of Arabidop sis [160], and more recently in maize [140-142, 178], cotton [49] and wheat [129], and is now recognized as a general fe ature of paleopolyploids [27, 28, 30, 51, 87, 138] . We provide here molecular evidence that this process initiates very early (e.g. is detectable within a few thousands of years of the polyploidy event) and is coincident with homeolog-specific reten tion. It is worth noting that the differential retention of genes in AS that we have observed was dependent on genetic divergence between homeologs at the time of hy bridization. Following this, we speculate that the polyploid ancestors of flowering plants that gave rise to the fractionated genomes we observe today were allopolyploids. Our observa tions suggest an evolutionary path to fitness restora tion during allopoly ploid speciation by preferent ially co-expressing only one parental set of interacting home ologs, with mixed networks being less common. Co-expressed genes involved in the same biological networks are more likely to exhibit D-M interactions and networks could alle viate such negative interact ions by evolving towards pure AT or AA profiles. This type 38 of "D-M homeolog conflict resolution" should be typical for polyploid ancestors of the flowering plants and should further contribute to the fract ionated genomes we observe today [87, 129, 138,151, 160] . As we now know the identity of networks ha ving evolved to a "pure" parental type, our strong prediction is that the experimenter-induced heterologous state in these networks shall result in detectable reproductive losses. 2.6 Conclusions When an allotetraploid is formed, the funct ions of homeologs are partially redundant , and the genome is set for gene silencing and deletion. Thousands of genes affected by these processes in AS were identified with tiling arrays and resequenci ng. These new compu tational approaches enable the use of widely available and economical tiling microarrays for the whole-genome analyses of species closely related to the sequenced refer ences. In the AS allotetraploid, more AT- originated homeologs were lost and silenced than AA originated homeologs. We hypothesize that these AA-fa voring biases are not random, but rather represent a signature of an evolutionary process. Whenever more than one gene experiences silencing within a netwo rk, the hom eo log bias of the first event influences the likewise bias for the subsequent silencing; networks evolve towards their ancestral types. The mosaics of predominantly pure-parental networks in allotetraploids might contribute to phenotypic variability and plasticity, and enable the species to exploit a larger range of en vironments. 39 Chapter 3 Population genomics of pooled Arabidopsis lyrata genome data show gene network evolution patterns consistent with flies, worms and yeast The following chapter was submitted as a Research Note to Molecular Ecology: Chang PL, Leong W, Zhang P, Friesen ML: Population genomics of pooled Ara bidopsis lyrata genome data show gene network evolution patterns consistent with flies, worms and yeast. Submitted to Molecular Ecology. 3.1 Abstract Genes located more centrally in protein interaction networks may be constrained from sequence changes. Since genes also interact func tionally with their neighbors in networks, positively selected genes and their neighbors may co-evolve with similar rates of evolu tion. To test gene network evolution predictions, we calculated dN /dS ratios for 6,322 genes in existing genomic resequencing data of Arabidopsis lyrata populations growing in serpentine and granitic soils. We also calculated ratios of serpentine dN / dS over 40 granitic dN/dS (S/G ratios) and designated 283 genes with high S/G (greater than 1.2) as potentially involved in serpentine adapta tion. dN/dS and S/G were correlated with measures of network centrality calculated from a modified graphical Gaussian network and a co-expression network from the literature. Genes with higher dN/dS and S/G were constrained to nodes of lower connectivity and betweenness, and higher closeness in gene netw orks, consistent with patterns of gene network evolution in Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae. Genes under higher positive se lection co-evolve with neighboring genes in 10 out of 30 instances, indicating that highly selected genes can affect the evo lutionary rate of their neighbors, though whether they do depends on the location of the polymorphism. 3.2 Introduction Genes and proteins do not function independently, but rather in complicated networks in cells. Biological networks share fe atures with complex technological and social networks, allowing mathematical analysis of network characteristics [13]. The centrality of a gene in a network depends upon connectivity - the number of nodes a gene is connected to, betweenness - the frequency with which the gene lies on the shortest path between all other nodes, and closeness - the number of nodes connecting a gene to all other genes [172]. Central genes have a larger impact on the network and are predicted to be more evolut ionarily conserved; this has been shown in Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae [60]. Conversely, genes for local adaptation need to evolve quickly to en vironmental changes, and should not be centrally located in gene 41 networks. Genes also interact with their neighbors in protein interac tion networks and a breakdown of this interaction leads to loss of function in protein complexes [183]; hence positively selected genes are expected to co-evolve with their neighbors in the network. To test gene network evolut ion predictions, we used a whole genome Arabidopsis lyrata data set generated to look for genes involved in local adaptation to serpentine soil [163] . Twenty five plants were pooled from each of 4 populations - two growing on serpen tine soils and two growing on granitic soils - and sequenced to produce a total of �39X coverage. Turner et al. identified genomic regions with high Fst that contained metal transporters and candidate genes with polymorphisms for serpentine adaptation [163] . We reanalyzed this data set to calculate dN /dS ratios for 6,322 genes. To calculate net work statistics, we used data from two Arabidopsis thaliana networks - one generated using a modified graphical Gaussian model (GGM) [99] and another a co-expression net work generated using selective probes and chipsets (AGCN) [103]. We used orthologous Arabid opsis thaliana networks to test the fo llowing gene network evolution predictions: 1) Genes with higher centrality scores have lower dN/dS ratios, 2) Genes involved in ser pentine adaptation in Ara bidopsis lyrata have lower centrality scores, and 3) Positively selected genes evolve at similar rates with their neighbors in netw orks. 42 3.3 Materials and Methods Genomic DNA resequencing Genomic DNA resequencing using the Illumina platform was previously performed by Turner et al. [163]. This data included sequencing of four populations of Arab idopsis lyrata (AL), where two populations were sampled from serpentine and two from granitic sites. At each of these sites, 25 plant individuals were pooled and sequenced at �lOX cove rage. Sequenced reads of 36 bases were aligned to the AL genome release using RMAP [149]. Alignments were allowed up to 3 misma tches and reads that aligned uniquely to the genome ref erence were mapped. An alignment is considered unique if there is only one alignment with the best score. To minimize the effects of sequencing errors, polymorphic sites were defined as base positions where more than one nucleotide had at least 3 basecalls at that site. Test for selection The current genome annota tion of AL contains 32,670 genes, although only 12,341 CDS orthologs were identified using existing Ara bidopsis thaliana (AT) genes. An additional 126 were removed due to unannotated start or end sites for the coding regions. For the remaining genes, a pairwise alignment to the orthologous AT sequence was performed using ClustalW [11]. For 6,322 AL genes aligned to their AT ortholog, dN/dS ratios were calculated for each gene in each of the four population. AT was used as ref erence sequence for dN /dS calculation because the AL reference was not the out group to the four resequenced populations in a quarter of gene trees. At base positions containing multiple 43 alleles resulting in multiple codons, the dN /dS at that codon position was calculated by averaging over all codons, weighted by the number of their occurrence. Gene ontology enrichment analysis Genes were tested for over- representation in Gene Ontology (GO) categories using BiNGO 2.3 in Cytoscape 2.6 [12] . We compared GO clustering frequencies with the GOSlim Plant database in TA IR [13] using Hypergeometric tests with a Benjamini and Hochberg False Discovery Rate (FDR) correction [14]. A term was deemed GO enriched ifFDR corrected p-values were less than 0.05. Network analysis We reconstructed Ma et al.'s graphical Gaussian (GGM) network [99] and Mao et al.'s AGCN network [103] as one-mode undirected networks using Pajek [15] . For each gene, we calculated measures of connectivity (the number of nodes connected to the gene), betweenness (how many times the gene lies between two other genes in netw ork), and closeness (sum of the number of steps in shortest paths connecting the gene to all other genes it can reach in the netw ork). 44 3.4 Results Out of 32,670 genes in the Ara bidopsis lyrata (AL) genome, only 12,341 had Arabidopsis thaliana (AT) orthologs. We disregarded genes that lacked sufficient coding sequence information or did not align properly to their AT ortholog. We conservatively limited our analyses to 6,322 genes that had complete CDS annotation and good alignment to the AT ortholog. 3.4.1 Test for selection between orthologs Distributions of dN /dS for the four populations were similar to each other. To identify genes undergoing positive selection in the two serpentine populations (AL3 and AL4), we calculated the ratio between dN /dS of pooled serpentine and pooled granitic sequences (S/G ra tio). As expected, S/G ratios were centered around 1 and more than 70% of genes had ratios between 0.90 and 1.10. We identified 283 genes with S/G ratios above 1. 2 as potentially involved in serpentine adaptation (hereafter "serpentine genes" ). Of the 283 serpentine genes, only one overlapped with a list of 20 genes identified as candidate polymorphisms for serpentine adaptation by Turner et al. using Fst as a metric for selection (Table 3.1). The serpentine genes we identified included genes that coded for membrane proteins, metal ion transporters, metal ion binding and cellular response to metal ions. These are likely to be involved in adaptation to serpentine soils, which are enriched in heavy metals and deficient in nutrien ts. GO enrichment analysis was carried out on the 6,322 genes that were analyzed, as well as the 283 serpentine genes with S/G above 1.2. Categories were ranked in terms 45 Table 3.1: Serpentine genes with membrane and cytosol functions having high S/G ratios. ANNATl was identified as having serpentine adapted polymorphisms by Turner et al .. Rank Gene S/G Annotation 9 AT 1G35720 2.6301 Annexin 1 (ANNAT1); cadmium ion response, copper ion binding 14 AT 4G26570 2.2953 Calcineurin B-like 3 (ATCBL3); calcium ion binding 20 AT 5G08670 2. 1397 ATP binding / hydrogen ion transporting ATP synthase 24 AT 3G27060 1.9169 TS02; oxidoredu ctase activity, transition metal ion binding 53 AT5G19780 1.5169 TUA5; cadmium ion response 55 AT 1G23490 1.5025 ADP-ribosylation factor 1 (ATARF 1); cadmium ion response 63 AT 5G38940 1.4471 mang anese ion binding, nutrient reservoir activity; salt stress response 72 AT 5G47050 1.3937 protein binding, zinc ion binding 100 AT2G14720 1.2973 VSR2; calcium ion binding 107 AT 1G11670 1.2828 MATE efflux family protein; antiporter, transporter activity 114 AT 1G05500 1.2720 Located in plasma membrane, contains C2 calcium binding region 138 AT5G01490 1.2425 CAX4; cation exchanger and ant iporter 146 AT4G11650 1.2365 Osmotin 34 (ATO SM34); salt stress response 148 AT3G15730 1.2349 Phospholipase D alpha 1 (PLDALPHA1); cadmium ion response 151 AT2G01770 1.2333 VIT1; vacuolar iron ion transmemb rane transporter 162 AT5G04160 1.2262 Phosphate translocator; organic anion transmembrane transporter 182 AT 5G52670 1.2143 Heavy-metal- associated domain-containing protein; metal ion binding 197 AT 2G46800 1.2030 Zinc transporter (ZAT); metal ion transmem brane transporter of significance of GO enrichment (Table 3.2). Membrane and cytosol func tion were most significantly enriched in serpentine genes, but only ranked 6 and 7, respectively, in the complete gene set. 3.4.2 Network evolut ion We calculated dN/dS for 3,063 out of 6,760 genes in the GGM network. Most genes had low dN /dS of less than 0.4, and were distributed evenly across centrality values. Genes with higher connectivity and betweenness had lower dN /dS, and genes with higher dN /dS had lower connectivity and betwee nness. The rela tionship between dN/dS and closeness also shared the same expected pattern as that for connectivity and betwee nness. For example, genes with high dN /dS did not have low closeness, sugges ting that they are fa rther and more isolated from the network. These same patterns were observed for the 3,170 out of 5,743 genes in the AGCN network for which we had dN/dS values. The two networks shared an overlap of 1,211 genes with dN/dS values. A significant difference in 46 Table 3.2: Ranks of GO enrichment. NA indicates term was not significantly overrepresented. Gene Ontology Term Rank for S/G > 1.2 Rank for all genes Cytosol 1 7 Membrane Structural molecule activity Intracellular Generation of precursor metab olites and energy Cytoplasm Cellular process Thylakoid Photosynthesis Cell Catalytic activity Metabolic process Vacuole 1 3 4 4 6 7 8 10 NA NA NA NA 6 9 4 18 1 11 28 33 3 2 5 30 dN /dS was detected between genes of high and low betweenness in the AGCN network ( p < 0.0004), as well as genes of high and low closeness in the GGM network ( p < 0.02) . dN /dS was significan tly correlated with expression in both GGM and AGCN networks ( p < 0.0001). We had S/G ratios for 1,714 and 1,779 genes from GGM and AGCN networks re- spectively. Of these, 52 and 48 genes, respectively, had S/G ratios greater than 1.2 (serpentine genes). In both netw orks, serpentine genes had lower connectivity and be- tweenness, and higher closeness. Serpentine genes had significantly lower betweenness than non-serpentine genes in the GGM network (one-tailed t-test, p < 0.005). All other comparisons were not significant, although this can be attributed to the smaller sample SlZ8. To investigate if genes that neighbor high dN /dS genes also undergo positive selection, we calculated mean dN /dS for genes within two edges of 30 nodes with the highest dN/dS (ranging from 1.01 to 6.59). All but two of the 30 two-neighbor subnetworks were independent of each other. Genes within the two-neighbor radius of the highest 30 nodes had higher dN /dS than the remaining genes in the network (one-tailed t-test, p < 0.0086). 47 3.5 Discussion By comparing dN/dS ratios of serpentine and granitic populations, we were able to iden tify genes involved in oxidative stress, metal ion binding, and plasma membrane receptors and transporters that are potentially involved in adaptation to serpentine soils, which are enriched in heavy metals and low in nutrien ts. As expected, we fo und significant en richment in membrane and cytosol genes. Surprisingly, relatively few of the potential serpentine adaptation genes we identified in this study overlapped with the 96 high Fst ratio polymorphisms identified by Turner et al. [163] . One explanation could be because we tested for selection only in coding sequences, we could have missed genes where poly morphisms occur in non-coding sequences which may play a regulatory role in expression. Out of 20 candidate serpentine adaptation genes identified by Turner et al., half had poly morphisms in non-coding regions. Since there is little overlap in genes identified, multiple tests for selection can be used while screening to detect more potential adaptive genes. Our data also adds to the current understanding of gene network evo lution [60] . Al though dN /dS correlated negatively with connectivity and betweenness, the relationship between each pair of measures is not linear. The great maj ority of genes had low dN /dS, independent of centrality, which explains our inability to detect significant differences using statistical comparisons. The few genes with high dN /dS had low connectivity and betweenness scores. Connectivity and betweeness are both measures of how many other genes in the network a gene is connected to directly or indirectly, and is indicative of the number of genes in the network that may be aff ected by a change in the gene of interest . A greater number of connections increases the likelihood that a change in the gene will be 48 incompatible with other genes in the netwo rk, so only relatively unconnected genes can incorporate non-synonymous changes. Contrary to this pattern, high dN IdS genes had higher closeness. Closeness is a measure of the average distance between the gene and all other genes in the network. While essential genes may be constrained to short efficient pathways [60], genes with higher dNidS that may be important in serpentine adaptation are less connected with other members of the network. Patterns of SIG ratio versus centrality closely mimic those between dNidS and cen trality. It is important to note that high SIG genes are not the same as those previously identified as ha ving high dNidS. We do not expect a gene that is under positive selection between AL resequenced populations and AT to be the same as that between edaphic ecotypes. We hypothesized that genes involved in local adaptation will belong to evo lutionarily flexible parts of networks. Consistent with our hypothesis, we found that serpentine genes (SIG > 1.2) were restricted to positions of low centrality in networks. Finally, our data showed that nodes neighboring positively selected (high dNidS) genes also exhibited positive selection in a significant number of instances, suggesting that these genes are co-evo lving. This pattern could partly be due to the fact that we did not have dN IdS values for all possible neighboring genes. Close neighbors showed a range of dN IdS values, with some genes co-evolving more than others. We had enough neighboring dN IdS values for the 30 genes we analyzed to reasonably expect that sampling error does not bias our analysis. However, being able to detect whether a neighboring gene co-evolves depends on whether the polymorphism is located in a site that would affect the interaction with the neighboring gene. Hence it is reasonable that not all neighboring 49 genes would co-evolve. We detected significant co-evo lution of imme diate subnetwork in a third of the cases we tested. The use of AT as a reference annotation for AL assumes that many of the networks and pathways have been conserved since both diverged from a common ancest or. This is a fair assumption due to the relatively short divergence time of 5 million years [85] . It should be noted, howe ver, that the two species have differing reproductive strat egies, with AT being a predominately selfing annual , whereas AL is an obligate outcrosser. The genome of AT has shrunken considerably, spanning five chromosomes as compared to eight for all other members of the Arabidopsis fa mily. Coupled with a higher mutation rate and shorter genera tion time, it is possible that the smaller genome of AT and its applica tion to our data results in fa ilure to detect some AL pathways. Nevertheless, we are left with a robust dataset with many pathwa ys. Furthermore, as our analysis fo cuses on adaptation to serpentine en vironments, the divergence of reproductive lifestyles should have a minimal impact. As advancements in sequencing technologies increase the efficiency and reduce the cost of sequencing, we are confronted with the problem of how to effect ively get the most information out of the large amounts of data genera ted. In this work, we re-analyzed existing resequencing data to extract more informa tion. Our analysis can be applied to new datasets where individuals are pooled instead of sequenced directly, reducing sequencing costs. We produced comparative data to show that gene network evo lution in a plant is consistent with existing data that has been generated for metazoans and yeast [60]. In examining evo lution of subnetworks around highly selected genes, we also presented a novel way to empirically test predictions of gene network evo lution. 50 3.6 Conclusions We reanalyzed existing genomic resequencing data of A rabidopsis lyrata to identify 283 genes undergoing positive selection in populations growing in serpentine soils. Some of these genes were involved in oxidative stress, metal ion binding, and plasma membrane receptors and transporters that are potentially involved in adaptation to serpentine soils. We also analyzed genes in the context of their networks and found that serpentine genes were constrained to nodes of lower centrality, allowing them to more easily evolve quickly to en vironmental changes. These pattens of connectivity, betweenness, and closeness also applied to genes undergoing positive selection in Arabidopsis lyrata when compared to its sister species Arabidopsis thaliana. Our data also shows that genes under higher positive selection co-evolve with neighboring genes, indicating that highly selected genes can affect the evolutionary rate of their neighbors. 51 Chapter 4 Sequencing of Arabidopsis halleri ssp gemmifera reveal regions of duplication undergoing positive selection Chang PL, Steets JA, Wo lf DE, Takebayashi N, Nord borg M, Nuzhdin SV: Sequencing of Arabidopsis halleri ssp gemmifera reveal regions on duplication undergoing positive selection. In Prep. 4.1 Introduction In the first stage of allopolyploidization, parental spec1es diverge and adapt to spe cific environmen ts. Directional selection contributes to the fixation of species-specific beneficial mutations in coding and regulatory regions [118, 131], while slightly delete rious mutations fix due to drift. One plant system where this can be studied is Ara bidopsis kamchatica (AK), an allopolyploid fo rmed from the hybridization of Arabidop sis lyrata (AL) and Arabid opsis halleri (AH) [145], two species from sister taxas that have similar lifespan and lifestyle, but with adaptation to different extreme environ ments. AL is a perennial outcrossing species [37] that occurs under a variety of climatic 52 and ecological conditions, but is most often cold-tol erant and grows in low competi tion habitats [38, 78] . AH is also a perennial outcrossing species [111] and has served as one of the best model systems for heavy metal tolerance and accumulation in plant species [9, 53, 80, 86, 108, 109, 125-12 8, 134, 165], having a preference for high altitudes and harsh soil conditions. In order to understand the evolutionary changes that occur upon allopolyploidization, it is important to first identify genetic differences between parental species. Recognizing these differences will allow identification of changes that are occurring in one parentally contributed genome or the other. Using the AL genome as a ref erence, we used Illumina sequencing at 63X coverage to characterize the genome of its sister species, AH, using short read mapping coupled with de novo assembly and anchoring of sequences to the AL ref erence. At the nucleotide level, we identified more than 1.8M fixed polymorphisms between AH and AL. At the gene level, we identified 738 de novo genes in AH and characterized thousands more that were fo und in differing amounts among the two sub species of AH, gemmifera and halleri, and to AL. At the larger regional level, we searched for genomic structure variation between AH and AL, identifying segments of insertions, translocations, and transve rsions, as well as 34 regions in the gemmifera genome that were present in higher copy numbers. Our results confirm the increase in genome size of the gemmifera subspecies, resulting from the segmental duplication of several regions. We also fo und that these regions displayed a significant increase in Ka/Ks ratios, consistent with positive selection acting on these gene duplicates. The relationship between AL and the two subspecies of AH suggest a duplication event in the gemmifera lineage after the split with halleri. 53 4.2 Materials and Methods Plant material and Illumina library preparation Genomic DNA were prepared from six wild populations of Arabidopsis halleri ssp. hal leri collected from France, Austria, and Germany. Each population sample consisted of leaves collected from 25 individuals. Two samples were also prepared from leaves of two individuals of Arabidopsis halleri ssp. gemmifera collected from repository stocks at the University of Alaska at Fairbanks, originally collected from Japan. Illumina library con struction was performed according to standard protocols and each sam pie was sequenced on one flow cell lane. Illumina data analysis Paired-end 76-base Illumina reads were aligned using mrFAST Ve rsion 0.5.7 [6] to the publicly available Ara bidopsis lyrata (AL) reference genome [68]. All alignments having up to six misma tches per end where both ends were aligned with insert size length less than 580 bases were mapped. Reads that did not align as pairs were aligned independently as single ends to all aligned regions. Species-specific polymorphisms fixed in Arabidopsis halleri (AH) were called when they were supported by more than 95% of reads at these positions in the pooled sam pie. These fixed polymorphisms were incorporated into the AL reference to create an AH refer ence. De novo contigs were assembled using Velvet 1.0.15 [181], with a kmer length of 47 bases. BLAST algorithms [8, 182] were used to annotate contigs having E-values less than 1e-5 and whose alignments span at least 80% of the query (BLASTN) or gene hit (BLASTX). Contigs with unique BLASTN hits 54 to previously unmapped regions of the AH ref erence were anchored using customized Perl scripts. Genes covered by mapped reads and contigs were also characterized using customized Perl scripts. Analysis of structural variation Paired-end 76-base Illumina reads were aligned using mrFAST to the AH reference using the "-discordant-vh" option. This option identified paired-end reads that did not contain an acceptable mapping, where mates did not map within 580 bases of each other or if mates were mapped in the wrong orienta tion. The resulting ".vh" file was filtered to include paired-end reads that mapped to only two regions in the reference (one per end). These reads were analyzed using VariationHunter Ve rsion 0.3 [66, 67] to identify translocations and transversions in the genome. BLAST alignments were also used to characterize structural varia tion. To gether with the 144,383 and 361,214 contiguously mapped genomic segments in the halleri and gem mifera references, respectively, unmapped reads were used by Velvet to construct de novo contigs. These contigs were aligned to the genomic segments using BLASTN with E-values less than 1e-40. Chimeric contigs were identified when one portion of the contig was uniquely mapped to one genomic segment, while an adjacent portion was uniquely mapped to another genomic segment. Analysis of depth coverage To search for CNV s, the average read depth was calculated for all nonoverlapping 10kb windows of the genome for each sample. Because the average coverage differed between 55 samples, windows were standardized by dividing by the median coverage for all windows within that sample. Windows from six halleri subsamples were further averaged to create a depth reference, with the subsequent comparative genomic analysis comparing all six halleri subsamples and two gemmifera subsamples to this refer ence. Regions with higher or lower number of reads in gemmifera than in all halleri subsamples were identified by comparing halleri subspecies samples to the depth reference and setting a 0.5% cutoff at both ends of the distribution of ratios. Genomic regions where the depth ratio was fo und beyond these cutoffs in both gemmifera subspecies were considered significant . Extended regions of increase depth coverage in gemmi fera were identified by scanning for 200kb- windows in which at least 19 out of 20 10kb-regions (95%) had higher depths in both gemmifera subsamples than in all halleri subsamples. These differences also needed to be significant for at least 10 regions (50%). Overlapping windows were collapsed to identify the entirety of these extended regions and were considered part of the same CNV. Sequence analysis of molecular evolution Analysis of molecular evo lution was performed using PAML [180], which implements sev eral programs in a maximum likelihood framework to select the best model for the data. Pairwise alignments of gene sequences between AL and AH orthologs were performed using ClustalW [89] to identify the codon fra me, removing codons that contained inser tions. These alignments were analyzed in the "codeml" package of PAML used to calculate measures of molecular evolut ion, such as total divergence, synonymousjnonsynonymous mutation freq uencies, and Ka/Ks ratios relative to the AL gene ref erence. 56 4.3 Results 4.3.1 Mapping of AH reads to the AL reference In order to understand the evolut ionary changes that occur upon allopolyploidization, it is important to first identify genetic differences between parental species. Illumina sequenced reads from eight samples of two subspecies of Arabidopsis halleri (AH) were mapped to the Ara bidopsis lyrata (AL) genome. The pooled data from all samples had a total of 185M paired-end reads that were 76 bases long, of which 143M (77%) were mapped across 174M bases (84% of the referen ce). When broken down into the two sub species, halleri and gemmifera, the percentage of reads that were mapped were similar. Species-specific polymorphisms fixed in AH were called when they were supported by more than 95% of reads at these positions in the pooled sample, and total 1,692,614 nucleotide substitutions, 55,150 single-base deletions, and 43,146 insertions. These fixed polymor phisms were incorporated into the AL ref erence and all Illumina reads were remapped to identify halleri and gemmifera- specific polymorphisms, called when they were supported by more than 95% of reads at these positions in their respective subspecies. There were 42M reads among all sam pies that fa iled to map to the ref erence. More than 7M reads were used by Velvet to construct 46,803 contigs, ranging in lengths from 300 to 6,526 bases, with N50 and L50 being 15,762 contigs and 464 bases, respectively. BLAST algorithms were used to map and anchor de novo contigs back to previously un mapped regions in the reference, then to AL protein sequences, identifying 714 sequences with simila rities to full-length genes. 415 of these 714 sequences had similarities to genes annotated as transposable elements, such as retrotransposons or reverse transcriptase, 57 Table 4.1: Summary of Illumina data analysis in AH. To tal reads To tal reads, mapped To tal reads, mapped (percentage) To tal reads, assembled into contigs To tal de novo contigs Max Length Median Length N50 L50 De novo Genes identified - from searching Arabidopsis lyrata - from searching Arabidopsis thaliana and Genbank AH halleri AH ge mmifera 142,104,943 42,907,300 109,769,740 33,568,570 77.2% 78.2% 5,472,629 1,733,716 36,303 4,412 394 8,317 451 714 1,568 10,500 6,526 412 4,290 525 which can be highly diverse and occur in high copy numbers [10]. 172 sequences were similar to NB- LRR disease resistance genes, which are known to be part of a large fam- ily and have high sequence variability [112]. Contigs were also aligned to Arabidopsis thaliana and Genbank protein sequences to identify an additional 1,568 genes not fo und in the AL genome. Most were classified as transposable elements or disease resistance genes. Interestingly, the Gene Ontology categories Tra nsmembrane Receptor Activity (p < 5.8e-20), Signal Tran sducer Activity (p < 2e- 13), and Metal Ion Tra nsporter (p < 6e-12) were significantly overrepresented among all genes fo und within contigs assembled from unmapped reads. These results are interesting considering that AH has shown con- stitutive tolerance to Zn and Cd [80, 86, 108, 109, 125-128, 134, 165], and that these genes only identified in AH may play a role in this adapta tion. 58 4.3.2 Structural variation Inspection of mapped reads across the genome show 144,383 and 361,214 contiguously mapped segments in the halleri and gemmifera references, respectively. These segments were separated by genomic sequences that were not mapped across, possibly due to struc tural changes in the genome. To search for these changes, we used VariationHunter to identify regions within the genome that were rearrang ed. Using paired end informa tion, VariationHunter searched for paired reads that mapped to the ends of two differ ent segments and considered its mapping orientation to identify the type of rearrange ment. Within the halleri subspecies, VariationHunter identified 9,458-12,071 transloca tions among the six population samples, where genomic regions were moved around but kept in the same orienta tion. There were 1,040 translocations common in all six popula tions. The program also identified 721-1,115 transve rsions, where a segment of DNA was reorien ted. Only 33 of these were detected in all six populations. For all rearrangem ents, the median distances between contigs were 1,949 and 108,307 bases for translocations and transversions, respectively. For the gemmi fera subspecies, we also identified 3,462 and 937 translocations and transve rsions, with the median distances being 5,4 72 and 255,393 bases, respectively. BLAST alignments were also used to characterize structural varia tion. To gether with the contiguously mapped genomic segments in the halleri or gemmifera ref erences, un mapped reads were used by Velvet to construct de novo contigs. Chimeric contigs were identified when one portion of the contig was uniquely mapped to one genomic segment, 59 Table 4.2: Structural variants in AH. Numbers in parentheses indicate median distances of translocations or transversions. Translocations using P E reads Transversions using P E reads Translocations using chimeric contigs Transversions using chimeric contigs AH halleri 12,071 (1,949 bases) 1,11 5 (108,307 bases) 35,717 (40 bases) 1,668 AH ge mmifera 3,462 (5,472 bases) 937 (255,393 bases) 26,768 (31 bases) 1,283 while an adjacent portion was uniquely mapped to another genomic segment. These align- ments provide stronger evidence for scaffolding of genomic segments since these segments themselves are mapped and connected on the same contiguously assembled contig. We identified and scaffolded 37,385 and 28,051 contig-pairs in halleri and gemmifera, respec- tively, with 308 and 984 contig-pairs also fo und in results reported by VariationHunter. It is interes ting to note that the median distances between joined contigs in the same orientation were only 40 and 31 bases in halleri and gemmifera, respectively, compared to 1,949 and 5,472 as reported by VariationHunter. The reason why this is the case is because scaff olding via Velvet can join contigs adjacent to each other on the ref erence sequence, whereas VariationHunter requires that they are fa rther than 580 bases. It is important to note that while VariationHunter considers the reference sequence and posi- tions where reads are mapped, Velvet only uses the sequences of the genomic segments and the unmapped reads to assemble, without considering their location in the genome. 60 4.3.3 Depth coverage of the gemmi fera and halleri subspecies To compare the genomic content of all eight AH samples, read depths were averaged across lOkb-regions and normalized against the average read depth of the corresponding region in the depth ref erence. Regions across the genome where the depth is higher may reveal genes that have been duplicated in one subspecies. Genome-wide scans of depth show 4,847 lOkb-regions where both gemmifera subspecies were mapped with higher (4,529) or 1: "(i.j c Q) 0 0 0.25 0.50 2 4 Ratio of 1 Okb Depth Figure 4.1: Depth ratio of genome- wide 1 Okb-regions in AH normalized against the average read depth of the corresponding regions in the depth reference. Black lines represent halleri subspecies. Red line represent ge mmifera subspecies. 61 Table 4.3: Regions of increase depth coverage in AH ge mmife ra. Scaffold Position Average Fold Change Scaffold 1 6.33M - 6.64M 1.64 17.85M - 18.47M 3.00 29.80M - 30.07M 4.37 30.10M - 30.73M 3.40 30.84M - 31.36M 2.69 31.68M - 32.10M 4.88 Scaffold 3 17.59M - 18.02M 3.42 18.57M - 18.91M 2.38 19.76M - 20.18M 2.58 20.20M - 20.49M 4.28 23.11M - 23.39M 2.57 24.06M - 24.47M 3.45 Scaffold 4 3. 70M - 4.00M 2.53 5.73M - 5.96M 2.93 6.76M - 7.03M 4.28 Scaffold 5 O.OOM - 0. 28M 2.07 5.85M - 6.11M 3.29 6.84M - 7.26M 3.04 18.81M - 19.08M 1.85 20.97M - 21.22M 1.64 Scaffold 6 13.36M - 13.56M 3.02 17.20M - 17.63M 2.49 18.22M - 18. 76M 2.68 Scaffold 7 1.24M - 1.54M 2.09 8.72M - 9.02M 1.94 17.82M - 18.06M 2.79 18.53M - 18.85M 2.97 23.28M - 23.59M 2.24 Scaffold 8 5.40M - 5.66M 3.12 5.89M - 6.24M 2.62 6.65M - 7.43M 3.88 10.59M - 11.04M 4.07 22. 77M - 22.96M 2.21 Scaffold 9 0.44M - 0.64M 3.36 lower (318) number of reads than in all halleri subspecies samples. These regions were identified by comparing halleri subspecies samples to the depth reference and setting a 0.5% cutoff at both ends of the distribution of ratios. Genomic regions where the depth ratio was fo und at these extremes in both gemmifera subspecies were considered 62 Table 4.4: Number of genes with increase depth coverage in AH. All Genes identified from A rabidopsis lyrata Genes identified from Arabidopsis thaliana or Genbank Biased Only Also Biased Only Also towards found in found in halleri halleri gemmi fera 290 239 51 220 172 48 70 67 3 towards gem mifem 1,510 1,392 118 found in gem mifera 36 25 11 found in halleri 1,474 1,367 107 significant (Figure 4.1). Among 10kb-regions classified as significant , six had zero reads mapped in both gemmifera subspecies, and only one had zero reads mapped in all halleri subspecies, indicating a deletion in either subspecies. We identified two regions of at least 500kB among these 4,847 10kb-regions (Chrl: 29.81M - 30.73M and Chr8: 6.57M - 7.58M) whose depths were �3.40X higher in gemmifera when compared to halleri. We also identified 34 regions of at least 200kb across 9 chromosomes/scaffolds whose depths were on average 2.95X higher in gemmi fera, suggesting that these regions may have been duplicated. As shown in Table 4.3 and Figure 4.2, these regions are clustered on several chromosome arms. In addition to chromosome regions, this analysis of depth coverage was extended to the individual gene level, identifying 1,800 genes with higher or lower number of reads mapped in all gemmifera subspecies when com pared to all halleri subspecies; 275 genes had zero reads mapped in either all gemmifera or all halleri subspecies. Due to the presence of gene paralogs, it is not straightforward to conclude that a gene itself has been duplicated merely based on depth. It is, howe ver, possible to identify gene fam ilies whose copy numbers have changed, as well as the types of biological processes and functions in which these genes are implicated in. Analysis of the 1,510 genes having higher number of reads in gemmifera show enrichment of several Gene Ontology categories, including 63 s N 0 Figure 4.2: Genome-wide analysis of AH gemmifera across 8 chromosomes, displaying several patterns and statistics. Outermost bars represent 8 chromosomes in different colors. Next two tracks (going inwards) show histograms and heat map of read depth coverage. Next two tracks represent heterozygosity and Ka/Ks ratios. Dark blue indicates high values. Next track shown in colors of the 8 chromosomes show density of structural variants described in Table 4.2. Inner circle of lines show intra-chromosomal links, where two segments of DNA from two different chromosomes are adjacent to each other in the gemmifera genome. Photosynthesis (p < 2.7e-6), Electron Tr ansport (p < 5.3e-5), Copper Ion Binding (p < 0.001), and Metal Ion Binding (p < 0.002). The 290 genes with higher number of reads 64 s N 0 Figure 4.3: Genome- wide analysis of "8292" AH halleri population sample across 8 chro mosomes, displaying several patterns and statistics. Outermost bars represent 8 chromosomes in different colors. Next two tracks (going inwards) show histograms and heat map of read depth coverage. Next two tracks represent heterozygosity and Ka/Ks ratios. Dark blue indi cates high values. Next track shown in colors of the 8 chromosomes show density of structural variants described in Table 4.2. Inner circle of lines show intra-chromosomal links, where two segments of DNA from two different chromosomes are adjacent to each other in the halleri genome. See plots of additional halleri samples in Appendix C. m halleri show enrichment of Tra nsmembrane Receptor Activity (p < 2. 9e-7) and AT P Binding (p < 0.0002). 65 4.3.4 Sequence analysis and variation The data from AH was further analyzed to characterize genome-wide sequence variation and polymorphism. Overa ll, compared to AL, substitution rates and Ka and Ks were smaller in the halleri subsamples than in the gemmifera subsamples (p < 2.2e-16 ), al- though Ka/Ks ratios were similar in both subsamples (p < 0.62) (Figure 4.4) . Within the 34 regions of at least 200kb with increase depth coverage in gemmifera (Table 4.3, ranging from 1.6X - 4.9X ), polymorphism and heterozygosity increased compared to the genome- wide average (Wilcoxon Rank Sum Test: p < 1.3e-06 ), consistent with the presence of additional chromosome duplicates homologous to these regions [148] . We also identified the genes within these regions and fo und a decrease in the rate of synonymous nucleotide substitutions compared to the entire genome (p < l. Qe-04 ) . Interestingly, there was no difference in the rate of non-synonymous substitutions for genes within these regions (p < 0.73), leading to higher Ka/Ks ratios (Figure 4.4, shown in red) consistent with positive selection acting for amino acid changes in these regions with newly duplicated gene pairs. Coi'Jt)arison d' Ka Coflt)arison d' Ks CoRJlarisonofKa!Ks .· Figure 4.4: Nucleotide substitution rates of genes in Arabidopsis halleri, comparing halleri subspecies (AH) with gem mifera subspecies (AG). Genes within regions of increase depth coverage are shown in red. 66 4.4 Discussion With the availability of the Ara bidopsis lyrata (AL) genome and its gene annota tion, the attention here fo cuses on the sequencing of Arabid opsis halleri (AH) and identify ing species-specific markers to distinguish between the two species. Reads from Illumina sequencing of AH were mapped to 84% of the AL genome, identifying 1,692,614 fixed nu cleotide substitutions, 55,150 single-base deletions, and 43,146 insertions, when comparing AL and AH. While there are undoubtedly polymorphisms that have not been identified, such as in regions that are fast evolving, using conservative estimates, the number of diagnos tic nucleotides average 36 per gene, which is sufficient to identify homeologs that have been lost in an allopolyploid genome in later analyses. The percentage of reads that were mapped using the reference of a sister species is impressive considering the 6% divergence between both species. In a similar analysis, 89% of Illumina reads from AL mapped to the reference, which is higher than the 77% mapped here using Illumina reads sequenced from AH. This decrease in mapping percentage is also consistent with the AH genome being larger than the AL genome. Sequences specific to AH would not map to the AL ref erence and could account for a portion of the 23% of reads that were unmapped. The use of the AL genome, nevertheless, enabled the identifica tion of diagnostic nucleotides specific to AH. Of the 32,670 genes annotated in the AL genome, 31,863 gene (98%) were mapped onto by reads from AH and contained diagnos tic nucleotides, indicating that a large portion of the AL genic content was covered. Sequenced reads that could not be initially aligned to the reference were assembled into contigs. Since they are longer than short reads, these contigs mapped to regions in 67 the reference using BLASTN. Nearly all contigs that mapped to the reference exhibited sequence varia tion that would have prevented short reads from being mapped across, such as long insertions or deletions, as well as having a high number of misma tches. These contigs were used to fill in gaps throughout the reference that were not covered by short reads. The remaining contigs were scanned for full length ORFs using TBLASTX searches to genes from the AL ref erence. Since these genes from AH already had significant number of reads mapped to them, these ORFs may represent gene duplicates in AH. It is important to note that some of these samples were derived from a pool of individuals and that these sequences could represent different alleles of the same gene, although this is extremely unlikely since different alleles would have also aligned to the gene during the short read mapping. One of the issues with using the AL ref erence is that de novo-assembled sequences not found in AL cannot be easily charac terized in AH. This is evident in the thousands of contigs that did not have simila rities to other sequences that had to be disregar ded. It is also possible that these contigs do not contain CDS sequences and/or that these contig sequences have extremely high divergence between AL and AH. Using only the unmapped reads, 2,282 full-length ORFs were identified, with most having similarities to transposable elements and disease resistance genes. This is not unusual as they can occur in high copy numbers [10, 112] . Only 738 ORFs remained after filtering for these sequences . Among newly identified ORFs in the AH species, many were implicated in func tions involved in transmembrane receptor activity and tolerance to Zn and Cd, which supports previous work showing constitutive adaptation to metal-polluted soil throughout this species [80, 86,108, 109, 125-128, 134, 165] . In the analysis of Ilium ina reads from AL, 68 only seven ORFs were identified from the unmapped reads; five had sequence similarities to transposable elements. The use of contiguously mapped regions in the reference (long sequences) along with the unmapped reads (short sequences) also provided some inf ormation on the structure and synteny of the chromosome. Scaffolds assembled from these long and short sequences bridged together regions that were adjacent to each other on the refe rence. In some instances, sequences annotated as being far from each other on the reference were con nected on the same scaff old. Unfortunately, most of these chimeric scaff olds contained the former, connecting adjacent contigs. This analysis identified numerous megabase length sequences whose subsequences were in the same colinear order on the genome reference; the presence of these long sequences indicate that there were no structural re arrangement events within these regions in the genome. Nevertheless, identifying actual rearrangements is not exactly the opposite of identifying contiguous regions. The ability to identify structural variants requires sufficient coverage as well as the right sequence and gene composition, and plant genomes are notorious for having many duplication events. Since these conservative approaches only use uniquely mapped contigs when identifying structural variation, these results may reflect the lack of power when using short reads. Many assembled genomes that rely purely on short reads have routinely fa llen into the curse of identifying a huge number of contigs separated by large gap regions, without re ally being able to place these sequences on the correct chromosome of the actual species under study [7, 139]. In future work, using the AL reference in combination with Sanger sequences with lengths in the tens of thousands, or even Ilium ina reads with variable gap lengths would significantly improve the ability of find structural varian ts. 69 To com pare the two subspecies of AH, the depth ratio within the halleri subspecies samples was used as the background. Once a cutoff was chosen, it was used to identify regions that were lost in either the gemmifera or halleri subspecies. It is interesting to note that the average halleri-to-halleri variance in the ratio distribution (0.44) is larger than the gemmifera-to-g emmifera variance (0.16). Although both subspecies are outcrossers [3 7, 111], this increase in variance is expected since the halleri sam pies were taken from pooled populations while the gemmifera samples were taken from single individuals within the same population. This increase in variance when com paring gemmi fera to halleri may result from gains and losses of gene orthologs in gemmi fera. It is also worthy to note that while the distribution of normalized depth ratio is nearly symmetrical among the halleri subspecies, this distribution is hea vily skewed towards the right in the gemmi fera subspecies. These results are consistent with the gemmifera subspecies having an even larger genome than the halleri subspecies, with sequences from these additional genomic regions also mapping to similar regions in the genome ref erence. Along with seven 10kb regions unmapped in either subspecies, 4,847 other regions exhibited higher mappings in one subspecies over the other. The disparity in genome size is even more pronounced considering that 4,529 of these 4,84 7 regions had more reads mapping in the gemmi fera subspecies. The segmental duplication of entire chromosome regions would drast ically increase the number reads mapped at these sites, a pattern that was observed. We identified two regions of at least 500kb and 34 regions of at least 200kb across 9 scaffolds (Table 4.3 and and Figure 4.2), all of whose depths are significantly higher in gemmi fera when compared to halleri. It is sequences homologous to these regions that contribute to the additional genetic material in the gemmifera genome. 70 There were 1,800 genes that exhibited higher number of reads in one subspecies over the other. Most of these were found in the gemmifera subspecies, again, consistent with it having a larger genome. Of the 290 with higher number of reads in halleri, 239 were fo und only in halleri; i.e. most genes that were biased towards halleri were only fo und in halleri. On the other hand, only 36 of the 1,510 genes biased towards gemmi fera were fo und exclusively in gemmifera, indicating that most genes were also fo und in halleri. Interestingly, when compared to the average depth in their respective genomes, the depth coverage of gemmifera-biased genes were high in gemmifera, but were fo und to have similar depth in halleri, confirming that the higher number of reads in gemmifera is mostly due to additional duplicates of these genes in gemmi fera, rather than a lost in halleri. It is interesting to note that genes within these regions are undergoing positive selection, having higher Ka/Ks ratios than across the rest of the genome (as well as across the halleri subspecies ). This could be a result of neofunct ionalizaton [113], occurring when gene duplicates adopt amino acid changes that ultimately cause a change in function of the gene duplicate. Sequence analysis of both genomes show that substitution rates are lower in the hal leri subspecies than in the gemmi fera subspecies. This may reflect differential evo lution acting on the genomes of both subspecies. However, this difference may result from an ascertainment bias since the halleri samples were taken from pooled individuals, lead ing to larger sequence variation, and resulting in less genome positions called as fixed polymorphisms. This is similar to regions in gemmifera having increase depth cover age, with the average number of diagnostic nucleotides decreased to 28, compared to the genome-wide average of 36. Although we observed an increase in SNP frequency and 71 heterozygosity within these regions, it is at the cost of losing diagnostic nucleotides. Nev ertheless, while these genes probably have additional paralogs mapped onto them, there is still a considerable amount of fixed polymorphisms that could have only resulted from an early gene conversion event fo llowing the split of AH and AL but before the split of the halleri and gemmifera subspecies, followed by gene duplication events occuring after the halleri-gem mifera split. The identification of diagnostic nucleotides serves its purpose in determining the parental origin of sequenced isolated from an allopolyploid genome. 4.5 Conclusions The Ara bidopsis halleri genome was characterized through Illumina resequencing of two of its subspecies, gemmifera and halleri. Almost 80% of sequenced reads were mapped onto 95% of the genome reference, identifying l. SM fixed polymorphisms between Arabidopsis halleri and A rabidopsis lyrata. These polymorphisms will serve as diagnostic nucleotides in the future when analyzing allopolyploid genomes for parenta l-specific gene loss and retention among 32K genes. The de-novo-assembling of unmapped reads also identified an additional 2000 genes, some ha ving simila ries to those implicated in transmembrane receptor activity, although most were similar to transposable elements or disease resis tance genes. While the gemmi fera and halleri subspecies were similar in many ways, there exists some profound differences : analysis of genomic read depth identified 34 regions in the gemmifera genome that may have been duplicated in the lineage after the gemmi fera halleri split, consistent with its increased genome size. Genes fo und in higher amounts in gemmifera were enriched for different biological processes and functions com pared with 72 those ha ving higher amounts in halleri. Not surprisingly, genes that were fo und in higher amounts in gemmifera had higher number of reads than other gemmifera genes, confirm ing that this is mostly due to duplication of these genes in gemmifera, rather than a lost in halleri. Genes that were found in higher amounts displayed an increase in Ka/Ks ratios, consistent with positive selection acting for amino acid changes on these gene duplicates. 73 Chapter 5 Genomic characterization of three Arabidopsis kamch atica genomes Chang PL, Steets JA, Wo lf DE, Takebayashi N, Nord borg M, Nuzhdin SV: Sequencing of three Arabidopsis kamchatica genomes. In Prep. 5.1 Introduction In Stages 2 and 3 of allopolyploidization, diverged species hybridize and increase ploidy, with the two events sometimes reversed in order [151]. This change in ploidy enables the correct pairing at meiosis and in the short term may be beneficial by stabilizing the pos itive effects of hybridization, rather than conf erring immediate benefits through genome duplication [31]. Newly created polyploids often experience rapid intrag enomic adj ust ments and hybridization frequen tly results in phenotypic instability, widespread genomic rearrangements, epigenetic silencing, and unusual splicing [2, 3, 36, 46, 48, 81, 90, 95, 100, 106, 120, 121, 130, 152, 171, 173]. Stages 2 and 3 are well-studied with artificial polyploids 74 constructed in the laboratory [2,36, 46,81, 90, 94,100, 130, 152, 171] or spontaneously arising in nature [3,48]. Stage 4 is the long term evolution of homeologous genes (ie., homologous genes from two parents jo ined into one polyploid genome and stably inherite d). When allopolyploids are fo rmed, two pairs of homeologs find themselves in the company of diverged counter parts with similar functional roles. There could be three potential outcomes. 1) If all four alleles are identical in func tion, the homeologs can become redundant. Purifying selection is relaxed and accelerated accumulation of now-neutral mutations is expected. This process would be less prominent for haploinsufficient genes, or for genes strongly constrained for precise transcript levels [82]. This redundancy could also promote accu mulation of advantageous mutations [58] . 2) Genes with strong interspecific differences between homeologs may have diverged due to selection relaxa tion, either in one parental species or both. They may continue to accumulate deleterious mutations in the polyploid. 3) Genes with strong interspecific divergence between homeologs may also have diverged due to directional selection in one or both parental species, and they may continue to do so in the polyploid. Stage 4 occurs much slower on the evo lutionary time-scale, with interesting patterns having been reported [1, 25, 26, 48, 129]. Notably, the retention and expression of homeologs is frequent ly biased towards one parental species, as seen in cotton [33, 49, 132], maize [140-142, 178] and Tragopogon [27, 28,30]. In this work, we studied the long term evolut ionary effects of genome duplication on gene duplicates in Arabid opsis kamchatica (AK), an allopolyploid resulting from a recent duplication event for which both parental species have been identified. Because Arabidopsis has served as one of the premier plant model systems, there is a wealth 75 of informa tion on network rela tionships and expressiOn patterns for many genes and pathwa ys. AK has two parental diploids that have been well studied in the literature. Paternal Arab idopsis lyrata (AL ) is a perennial outcrossing species [37] that occurs under a variety of climatic and ecological conditions, but is most often cold-tol erant and grows in low competition habitats [38, 78] . It has been identified in North America and Europe, growing in patchy regions extending from Central Europe to Norway [78] . Maternal Arabid opsis halleri (AH ) is also a perennial outcrossing species [111] that has served as one of the best model systems for heavy metal tolerance and accumulation in plant species [9, 53, 80, 86, 108, 109, 125-12 8, 134, 165], having a preference for high altitudes and harsh soil conditions. Both species have provided substantial contributions to the plant community. The AK, AL, and AH system, combined with high-throughput sequencing, offers an opportunity to address the long term evolutionary effects of genome duplication, including the accumulation of new beneficial coding and regulatory mutation for each homeolog. These changes may be due to selection relaxa tion and adaptation to a range of environments across multiple populations. It also provides general insights into the molecular underpinning of environment-related phenotypic plasticity at a molecular level . 76 gemmifera halleri kamc/Jatica lyrata petraea Figure 5.1: Geographical distribution of AK, AL, and AH in the Northern Hemisphere. AH is classified into two subspecies, and is generally found in Europe and East Asia. AL is classified into two subspecies, and is generally found in Europe and North America. AK is also classified into two subspecies, and is generally found in East Asia and North America. 77 5.2 Materials and Methods Plant material and Illumina library preparation Genomic DNA were prepared from leaves of three individuals of Arabidopsis kamchatica (AK ) collected from repository stocks at the University of Alaska at Fairbanks, originally collected from Japan, Alaska, and British Columbia. These three regions were chosen to represent the geographical distribution, and because the genome size of AK is higher in the North American subspecies. The Ara bidopsis kamchatica ssp. kawasakiana from Japan is reportedly 617Mb while the two AK subspecies from North America are reportedly 549Mb. Illumina library construc tion was performed according to standard protocols and each sample was sequenced across two flow cell lanes. Illumina data analysis Paired-end 76-base Illumina reads were aligned using mrF AST Ve rsion 0.5. 7 [6] to both the Arabid opsis lyrata ssp. petraea and Arabidopsis halleri ssp. gemmifera ref erence sequences . All alignments ha ving up to six misma tches per end where both ends were aligned with insert size length less than 580 bases were mapped. Whenever a read was aligned to one parental reference and not the corresponding homeologous region in the other parental reference, it was mapped to the first parent. If a read was aligned to the same homeologous region in both parental references, diagnostic nucleotides within the aligned regions were used to map the read. Reads were mapped to the parental reference whose diagnostic nucleotides matched those fo und within the read. In the case when a read did not contain any diagnostic nucleotides (33.5%), or when a read exhibited 78 diagnos tic nucleotides from both parental references (3.4%), it was discarded and not used any furt her. Reads that did not align as pairs were aligned independently as single ends to all aligned regions and mapped to either parental reference based on the same mapping scheme as described above. De novo contigs were assembled using Velvet 1.0.15 [181], with a kmer length of 4 7 bases. BLAST algorithms [8, 182] were used to annotate contigs having E-values less than 1e-5 and contigs with unique hits to previously unmapped regions of the references were anchored using customized Perl scripts. Genes covered by mapped reads and contigs were also characterized using customized Perl scripts. Reference sequences from other samples of both parental species were also used for the alignment of reads from AK . These samples include four from pooled populations of Arabid opsis lyrata ssp. lyrata collected throughout North America as well as six from pooled populations of A rabidopsis halleri ssp. halleri collected throughout Europe. See Appendix A for all data analyzed in this study. Sequence analysis of molecular evolution Analysis of molecular evolution was performed usmg PAML [180], which implements several programs in a maximum likelihood framework to select the best model for the data. Pairwise alignments of gene sequences between homologs were performed using ClustalW [89] to identify the codon frame, removing codons that contained inser tions. These alignments were analyzed in the "codeml" package of PAML used to calculate measures of molecular evolut ion, such as total divergence, synonymousjnonsynonymous mutation freq uencies, and Ka/Ks ratios relative to the AL gene ref erence. 79 5.3 Results Arabid opsis kamchatica (AK ) is an allopolyploid fo und in East Asia and North America originating from the hybridization of Arabidopsis halleri (AH ) and Arabidopsis lyrata (AL ). To study genome duplication and its effects on the evolut ion of homeologous genomes, three accessions of AK were sequenced using the Illumina sequencing plat fo rm . Over 120M reads were mapped onto the AH and AL refer ences, with almost 80M reads con taining unambiguous diagnos tic nucleotides that classified it as either being de rived from AH or AL homeologs. Using these reads and inf ormation from diagnostic nucleotides, we analyzed over 32K gene sequences and characterized their parental source in the AK genome. 5.3.1 Homeolog-specific genomic retention For each of the three accessions of AK, we identified genes whose homeologs were not fo und within the allopolyploid genome. We defined a homeolog as being present when at least 80% of the paren tal-specific nucleotides in that homeolog were covered by sequenced reads. Conversely, we also defined homeologs as being loss when none of its diagnostic nucleotides were covered by any sequenced read. For the analysis of the AK accession from Japan, of the 31,255 genes present in both pure parental genomes, we identified 705, 173, and 271 genes in which only the AH homeolog, only the AL homeolog, or neither, respectively, were present in the polyploid. In the AK accession from Alaska, 699, 17 4, and 264 genes were present from only the AH homeolog, only the AL homeolog, or neither, respectively. In the AK accession from British Columbia, this trend continued, 80 AK sample Japan Alaska British Columbia All Accessions Table 5.1: Homeolog-specific retention in AK. only AH homeolog only AL homeolog in polyploid in polyploid 705 173 699 174 639 159 270 48 neither in polyploid 271 264 265 222 with 639, 159, and 265 genes present from only the AH homeolog, only the AL homeolog, or neither, respectively. This bias in loss against genes derived from AL is fo und in all three accessions, in which genes that were not retained as duplicates were four times more likely to be lost from the AL homeolog than the AH homeolog (Table 5.1). As AK has had multiple origins [147], we find it remarkable that the accessions ex- hibited similar patterns of gene loss. There were 272 AH-specific and 48 AL-specific genes found in the same class in all three accessions of AK, which is more than would be expected considering the demographic relationship of the accessions. Nevertheless, there were some varia tion, with some genes grouped into one class in one accession and another class in another accession. The data presented above identified genes in which no reads were mapped onto their homeologs, indicating a complete loss of that homeolog. In cases where part of the homeolog remains, such as for pseudogenes, it may still display reads mapping onto its sequences . For each gene in the polyploid, we compared the number of reads mapped across diagnostic nucleotides in the pure parental genomes to their respective homeologs within the polyploid (Figure 5.2, left, shown for Japan sample ). These results confirm that less reads are mapped in the polyploid, either due to gene loss or from a loss of sequence conservation during diploidization. We also took the ratio of read counts between the AH 81 � � "' � � � w Gene Depth AH-hallen + AL -IJrala 1/32 1/16 RallO of Gene Depth AK-I)«ala AL-Iyrata Figure 5.2: Dot plots of gene depth in AK Japan and its two parental species, AH and AL. homeolog and the AH homolog in the pure parental species, as well as the AL homeolog to the AL homolog in the pure parental species, and com pared these ratios to each other (Figure 5.2, right, shown for Japan sample; see Figures D.l and D.2 in Appendix D for similar patterns in the Alaska and British Columbia exam pies). Distributions of ratios show genes with higher or lower read counts in the polyploid genome (Figure 5.3), compared to their respective parental species. While this ratio is symmetrical for AH, we find that it is skewed towards the left in AL, indicating an excess of gene loss for genes derived from AL. This pattern was consistently observed in all three samples of AK and is in agreement with our previous results where gene loss is more likely to occur for AL-derived homeologs. To identify potential funct ional biases of genes whose copy numbers have changed in the polyploid, we took the 5% of genes fo und at each tail and searched for Gene Ontology biological processes that were overrepresented within these regions. In all three samples of AK, genes in the lower 5% (less reads in the polyploid) were enriched for processes involved 82 1: '(i.j c Q) 0 1/16 1/4 4 16 Ratio of Gene Depth Figure 5.3: Distribution of fo ldchange in gene depth in AK compared to its parental species. Three solid lines represent the foldchange of AH homeologs from AK Japan (red), AK Alaska (black), and AK B ritish Columbi a (blue) compared to the parental AH. Three dotted lines represent the foldchange of AL homeologs compared to the parental AL. in energy production, such as Photosynthesis (p < le-06 ), Generation of metabolites and ener gy (p < 4e- 17), Electron transport (p < 3e-04 ), AT P metabolic process (p < 2e-04 ), and Oxidative phosphorylation (p < 0.03). These results are best explained considering that the genome size has doubled, while the amount of chloroplast remained the same, thus the decreased number of reads observed for these plastid genes. Other processes 83 overrepresented in the lower 5% include those involved DNA geometric change (p < 2e-07) and Organ devel opment (p < 0.002). This set also had a significant underrepresentation of genes involved in Regulation of gene expr ession (p < 0.001) and Regulation of biological processes (p < 0.003), which involve multiple pathways with multiple interac tions, and whose lost may affect the efficiency of these regulatory processes [52]. Both of these processes were underrepresented for the AH-homeologous genome as well as the AL homeologous genome in all three AK samples for genes that were fo und in fewer amounts in the polyploid. Genes that were fo und in the upper 5% had more reads in the polyploid than in the pure parental lines, and were consistently enriched for those involved in Res ponse to stress (p < 4e-04), Response to stimulus (p < Se-04), Immune res ponse (p < 1e- 04), and Defense response (p < Se-07). Polyploids have been fo und to tolerate a wide range of stress conditions and extreme environments, including salt and metal tolerance, and temperature and oxidative resistance [12] . Increased duplication followed by neo or subfunct ionalization of genes that play a role in these biological processes could be one mechanism in which polyploids respond to these en vironments, as was previously described in wheat [55,91] and rice [155]. Four other processes of interest include Res ponse to carbohydrate stimulus (p < 1e-05) and Cell cycle checkpoint (p < 0.002), which may include genes that play a role in how carbohydrate availability regulates the progression through cell cycle during Cell diff erentiation (p < 2e-09) and Cell devel opment (p < 4e-10). Strangely, while genes for ancient cellular functions are not usually retained in whole genome duplications, genes involved in cell cycle are an exception [71, 72, 176]. For example, most cyclins underwent fun ctional diversification and are represented by 84 additional paralogs retained after a whole genome duplication prior to the radiation of vertebrates [71, 72]. Since polyploid genomes need to sort out homeologous chromosomes from homologous chromosomes in order to properly undergo mitosis, genes involved in these processes play a vital role [175]. Genes within these processes were hea vily enriched and may have been duplicated in the polyploids, leading to the increased regulatory complexity seen in other eukaryotic systems [71, 72]. We also fo und that genes involved in Programmed cell death (p < 3e-22) was hea vily enriched in this set, which may be a mechanism in which polyploid systems ensure proper replica tion by removing faulty cells. Within these tails at either end, we did not find enrichment of genes that were implicated in any signal transduction pathway, consistent with exis ting hypotheses suggesting that the disruption of stoichiometric ratios of these genes lead to haploinsufficiencies [166, 167]. In fact , in the upper 5% tail, we found a slight underrepresentation of genes involved in Hormone-mediated signaling (p < 0.03). To address the issue of gene connectivity, we used data from two A rabidopsis thaliana networks. One network was generated using a modified graphical Gaussian model ( GGM) [99] and another a co-expression network generated using selective probes and chipsets (AGCN) [103]. For genes within any of these tail regions (2,495 genes), we fo und that their AT orthologs had a significant decrease in the number of network partners when compared to the entire genome (Wilcoxon rank-sum test p < 1e-17). When considered separately, for both tail regions in both parental homeolog-comparison in all three AK samples, we fo und a significant decrease in network partners and/or expression correlation coefficient for genes in 10 out of 12 tail regions. Similar to our previous work in A rabidopsis lyrata (C hapter 3), these results confirm that genes with copy number fluctuations in AK at 85 these tails regions are typically found farther and more isolated from the netw ork. Gene homeologs that have been lost or duplicated are constrained to nodes of lower connectivity and centrality, minimizing their dosage effects and allowing networks to more easily adapt to these changes. Aside from any parental and funct ional biases, we want ed to understand how sequence divergence between the two pure parental species might have played a role in the evolu tionary outcome of gene duplicates. We fo und that gene duplicates whose homeologs were lost had lower sequence divergence between parental AH and AL. Homeologs with little sequence divergence probably have very similar func tions and can be interchange able. As a result, unless they must be kept in dosage sensitive amounts, the loss of one homeolog could be compensated by the other. Our results are in agreement with these hypotheses and consistent with genes in the lower 5% tail ha ving Ka/Ks ratios similar to the rest of the genome and not undergoing differential selection. 5.3.2 Sequence evolution of homeologs Allopolyploidization often bring together two parental species with different lifes tyles. These differences in lifestyles and reproductive patterns can impose different levels of selection on both parental genomes outside of polyploids. Upon polyploidization, selec tion is relaxed and genes are free to accumulate additional mutations. After assigning sequenced reads from the polyploid to either parental ref erence, we identified nucleotide positions within the homeologous genomes that differed when compared to their respec tive parental genomes. Using these sequences, we characterized sequence evolution of the AH and AL homeolog in the polyploid and fo und a significant increase in sequence 86 divergence between homeologs within the polyploid, compared to the divergence of their corresponding parental orthologs in the pure parental species (Wilcoxon rank-sum test p < 5e-115). Interestingly, there seems to be a minimal yet significant correlation be tween the rate of sequence evolution for the AH-homeolog and the AL-homeolog in the polyploid, when both were compared to their respective parental orthologs. Neverthe less, when seen in the context of homeologous gene loss and duplication, genes whose homeologs in the polyploids had increased depth coverage had high Ka/Ks ratios com pared to the rest of the genome (Table 5.4, shown for Japan sample) . This suggests that some genes duplicated after polyploidy are undergoing neofunc tionalization [113], with positive selection acting for amino acid changes on these gene duplicates. On the other hand, genes with less depth coverage had ratios similar to the rest of the genome, which is likely to occur when genes are not undergoing selection. These results were consistent across all three AK samples (see Figures D.3 and D.4 in Appendix D for the Alaska and British Columbia sample s). 5.3.3 Analysis using other parental genomes of AH and AL Arabid opsis kamchatica may have been fo rmed from multiple parental strains of AH and AL [147]. Results shown above utilized the AH gemmifera and AL petraea subspecies as references, which are believed to be derived from the parental lineage of AK. While we do not have other sam pies of the AH gemmifera or AL petraea subspecies, we do have Illumina reads and genomic sequences from six additional populations of AH halleri and four additional populations of AL lyrata sampled across Europe and North America. Results using these alternative AH and AL as parental references were similar to those 87 v � � 8' 2 - 0 0 0 g 0 r o 0 0.001 1/32 0.01 0.1 Parertal AH -halleri:A L-Iyrata Oivergen:e 1116 1/4 0 0 0 0 0 0 0 e o 0 0 0 AK-halleri:AH-haUeri Ka'Ks 0 0 � " - i ;g -" ,: ;: � � "' 1132 1116 114 AK-halleri:AH-h:31eri K3'KS Oo i � � 0 } 0 � 8 o 0 � 0 � 0 '6 � 1/32 1116 1/4 Figure 5.4: Analysis of homeologous sequence evolution in AK Japan. Top Left compares divergence between homeologs to divergence between parental species. Top Right compares Ka/Ks in AK-halleri to AK-lyrata. Ka/Ks for both homeologs were calculated in reference to their respective parental species. Best-fit linear regression line: y = O.l9x - 1.3,p < 8.2e-24, R 2 = 0.04. Both plots at bottom compare gene depth between homeologs and their respective parental species, at different values of Ka/Ks, for AH (left) and AL (right). fo und using the original refer ences. Table D.l in Appendix D show that 50-65% of genes having a loss of at least one homeolog using the AH gemmi fera or AL petraea subspecies were also reported as such using the AH halleri or AL lyrata populations. We continued to record a bias in gene loss favoring the allele from AH, with GO categories similar to 88 those outlined above, which is not too surprising considering that most of these genes were unchang ed. We also fo und similar results when analyzing the sequences of homeologous genes with the polyploids. These similarities are attributed to the use of AH or AL species-specific nucleotides that served as diagnostic markers for Illumina sequences from AK. As such, they are robust to within-species variation found in AH and AL. Therefore, although AH gemmi fera and AL petraea may differ from the actual parental subspecies of AK, the evolutionary patterns and conclusions were similar regardless of the exact parental subspecies used as refer ences. 5.3.4 Analysis of chloroplast sequences While allopolyploidization can occur in both parental directions, polyploids in the Ara bidopsis genus only occur in one direction. A rabidopsis suecica was fo rmed from the hybridization of a maternal A rabidopsis thaliana and paternal A rabidopsis a reno sa, and synthetic allotetraploids hybridized in the lab from these parents are also unidirectional (see Chapter 2). We aligned chloroplast sequences of seven AH ref erences, six AL refer ences, and three AK references and used 3,766 polymorphic nucleotides to show that the samples segregate into either the AH or the AL clade (Figure 5.5). Within the AH clade, all six AH samples derived from pooled populations across Europe segregated together, while the three polyploid samples of AK segregated with AH gemmi fera. These results are in agreement with existing data supporting the maternal relationship between AH gemmifera and AK . It is also interesting to note that the Alaska and British Columbia samples of AK segregate with each other and are monophyletic; bootstrapping scores indi cate that 95% of resampled trees display this pattern. Considering these results, and that 89 AL AL ALP AL AL l AL AL4 AL AL3 AL AL2 AK AK AK BC AH GEM AK JP AH 8296 AH 8297 AH 8292 AH 8294 AH 8295 L--- AH 8293 L----------- AT Figure 5.5: Maximum parsimony tree of Arabidopsis samples inferred from 3,766 polymor phisms within chloroplast sequences. AT and AL are based on the references taken from Arabidopsis thaliana Columbia ecotype and Arabidopsis lyrata, respecti vely. the second subspecies of AK, Arabidopsis kamchatica kawasakiana, is fo und exclusively within Japan, the sample from Japan may be a member of this subspecies, with the two samples of AK from Alaska and British Columbia representing the more commonly fo und Arabid opsis kamchatica kamchatica subspecies. We also note that the four AL samples derived from pooled North American populations cosegregated, with the AL petraea and AL lyrata subspecies serving as outgroups. 90 5.3.5 Simulation of Illumina data from AH and AL In addition to comparing results using other subspecies of AH and AL, we simulated 25 million PE 76-base Illumina reads from the AH gemmifera and AL lyrata references at four different divergence levels. We mapped these reads back onto the AH and AL references and assigned them using the same mapping scheme as described above. These simulations allowed us to characterize any ascertainment or mapping bias that our procedures may have introduced, and to quan tify the number of wrongly classified reads. These simulation results show a general increase in the number of reads correctly assigned as the number of allowed misma tches increase, with the expected decrease in specificity (Figure 5.6). When reads were simulated with higher divergence at 0.01, 0.05, and 0.10, the True Positive Rate is lower at 0 misma tches allowed, but rises to the same performance among all divergence levels. These results demonstrate the ability to assign reads from an allopolyploid, even when divergence rate between the homeologous genome and the parental reference is high. By increasing the number of mapping misma tches allowed, the divergence of the homeologous genome and its parental reference can be overcome. The increasing of the number of tolerated misma tches allows mapping of the reads to a gene, while diagnostic nucleotides permit the assigning of a read to either of the two homeologous genome. For each gene, we looked at the ratio of reads mapped among the simulated reads to those from the pure parental sam pies. One interesting point to note is that regardless of the divergence rate of the simulated genome, the number of reads that were mapped were nearly identical at 8 misma tches allowed, as suggested by the earlier True Positive and False Positive analyses where higher divergence can be overcome by increasing the 91 ;;;:: ....... ·:;: +=' ·w c (!) � (!) ....... ro n:: (!) > +=' ·w 0 0... (!) :::::1 � (J) 0 co 0 1'-- 0 (!) 0 lO 0 '<:t 0 (") c::i 0.000 0.0 05 0. 010 0.015 0.020 False Positive Rate ( 1-specificity) Figure 5.6: ROC curves comparing True Positive and False Positive Rates of read simulation. Red lines represent simulation results of AH gemmifera reads mapped back onto to the AH gem mifera reference. Blue lines represent simulation results of AL petraea reads mapped back onto the AL petraea reference. Darkest lines represent 0% divergence, with subsequent lighter lines representing 1%, 5%, and 10% simulated divergence. number of mapping misma tches allowed. We also find that the distribution of simulated read depth among both parental references are symmetrical here (Figure D .5 ), which confirms that the earlier skewness towards the left in AL homeologs is not an artifact of the genome annotation but rather reflect actual loss of genes derived from AL. Finally, we identified SNPs that were reported in the simulated genomes as if they were from actual genomic samples. Compared to the divergence of parental AH gemmifera 92 and AL petraea, there did not seem to be an increase in divergence between the simulated homeologous genomes when they were simulated without divergence . As the divergence of the simulated genomes increased to 0.01, 0.05, and 0.10, we observed the expected upward shifts (Figure D.6), indicating the presence of mutations within the homeologous sequences . Although these simulated sequences exhibited an increase in mutations, we did not observe selection taking place and did not find a correlation between Ka/Ks ratios and gene depth like in the actual data (data not shown). These simulation results confirm that the patterns observed in the actual sequencing of all three sam pies of AK are due to homeologous sequences acquiring mutations, and that these mutations are not random, but are the result of differential selection acting on homeologous gene sequences . 5.4 Discussion Our work here fo cuses on the genomic chara cterization of three samples of Arabidopsis kamchatica (AK), originally collected from Japan, Alaska, and British Columbia. To study genome duplication and its effects on the evolution of homeologous genomes, we sequenced their genomes using the Illumina sequencing platform and analyzed over 32K gene sequences. 5.4.1 Characterization of Illumina sequences from AK Analysis of Illumina reads resulted in the mappmg of over 120M reads out of a to tal of 150M sequenced reads. This 80% lies between the percentage of AH and AL reads sequenced from their pure parental species that were mapped onto their respec- tive references (C hapter 4). Of the 120M mapped reads, only SOM reads contained a 93 previously-identified diagnostic nucleotide that permitted the classification of the read as being derived from the AH or the AL lineage. Distribution of gene depth in all three AK samples showed median coverages of 362-411 reads per gene, with the number of genes without any reads ranging between 1,067-1,154 among the three AK samples. We also fo und a strong concordance with the number of reads mapped between samples of AK; whenever a gene was covered with less reads in an AK sam pie, it was covered at similar levels in the other AK sam pies. While there is some possibility that genes present in the genomes were not sequenced, we do not know of any part of experimental procedure that would have produced a biological bias for a specific set of genes. We considered whether the use of both AH and AL parental references to analyze the Illumina data minimized any mapping bias. Though dependant on the divergence between parental species, the use of only one genome, such as the AL ref erence, may prevent the identifica tion of gene sequences that may have acquired numerous mutations in the AH lineage. To circumvent this, we previously sequenced AH and AL from several samples across Europe, Asia, and North America and used polymorphisms fixed and exclusively fo und within their lineages to identify AH and AL specific nucleotides. These diagnostic nucleotides would serve to identify Illumina reads sequenced from the polyploids as those derived from either AH or AL homeologs. We also varied the subspecies of AH and AL used as actual ref erences, but did not see any qualitative differences in the results. Most of the genes and biological processes that were significant in some way using the original parental references were also fo und to be significant using other references. We also used a set of simulated data to charac terized the reads that were mapped backed onto the genome reference and fo und an accepted level of error that we were willing to tolerate 94 (Figure 5.6). It is interes ting to note that while our analysis was initiated with the AL reference, we fo und a bias in the three polyploid genomes against AL genes, which could not have resulted from a lack of available sequence informa tion from the AL lineage. It is also important to consider that homeologs within the polyploids may have di verged from their parental lineages. Mutations that occur in AK may prevent alignment to any of the reference genomes. However, the divergence time between parental AH and AL is estimated to be 2MY A, which are orders of magnitude greater than the time when allopolyploidization is believed to have taken place. Through our simulation studies, we find that even at divergence of 0.10, we are still able to assign reads to the appropriate homeologs, based on nucleotides that differed between parental AH and AL and fixed within their respective lineages. Under the infinite alleles model, it is unlikely that poly morphisms fixed as different nucleotides in AH and AL acquire mutations in the homeologs that switches their alleles. We were careful to only consider reads that mapped to a gene and displayed a diagnostic nucleotide signature derived from that parental refere nce. Furthermore, we note that the gene-level analysis depends on changes in gene depth and not the absolute number of reads. Therefore, we considered the number of reads mapped to a homeolog in relation to the number of reads mapped in the pure parental species, and used this as a proxy for gene loss and duplication. We were careful to note the difference between a gene not mapped by a diagnostic read (a read without a diagnostic nucleotide) and a gene not mapped by any read. We also identified hom eo logs in AK where an increase in the number of reads mapped in the AH-homeolog coincided with a decrease in the AL-homeolog, or vice versa. These pa tterns may be representative of gene conversion, where an AL allele is converted into an AH allele and the sequencing of the 95 homeologs in the polyploid can reflect this event. These patterns could also result from a mischaracterization of a read derived from the AL-homeolog as mistakenly from the AH homeolog. We evaluated numerous instances of these, and in all cases, the reads assigned to a parental reference clearly contained several diagnostic nucleotides that associated it with the correct parental refer ence. 5.4.2 Patterns of homeologous gene loss and gain Upon polyploidization, diploidization reduces tetraploid genomes back towards a diploid state. However, polyploidy would not serve a major role in speciation if all gene duplicates were quickly removed . Data from a variety of ancient polyploids suggests that a much larger fract ion of duplicated gene copies are retained than expected. We fo cused on genes that were present in both parental AH and AL lineages and identified nearly 900 genes whose homeologs were lost in either of the two homeologous genomes. The data also displays an asymmetry, in which genes that were not retained as duplicates were four times more likely to be lost from the AL homeolog. Since this pattern is fo und in all three sam pies of AK, it is less likely that these results are due to environmental fact ors. These patterns may represent mechanistic properties where the direction of an allopolyploidization plays a large role. Although we did not analyze reciprocal polyploid crosses, our results are consistent with previous work done in other plant species where there is a bias against genes derived from the paternal genotype [152] . These patterns fo und in other plant species show an increase in genomic rearrangement and degradation in the paternal genome, presumably due to cytoplasmic incompatibilty fa ctors in the ovum [152] . These rearrangements may lead to instability in the paternal genome, and 96 as a result, can affect the consistent expression of these genes [3, 36, 46, 48, 65, 81, 90, 95, 100, 106, 120, 130, 152, 171]. While this varia tion can lead to subfunc tionalization in the early genera tions after polyploidization [2], it can also lead to repression of a gene and its eventual loss [140, 141] . In all three samples of AK, we observed a pattern of increased gene loss for homeologous derived from the AL parent . Homeologs retained in the polyploid displayed properties that were typical of those expected, such as genes involved in complex regulation and cell-cell signaling [52]. In addition, hybridization and mixing of genomes that accompany allopolyploidization have the potential to result in inadvertant gene duplication as well. Several Gene Ontology biological processes were enriched in this latter set, which include those involved in Cell cycle checkpoint and Programmed cell death. Polyploid animals in Droso phila and other insects display conserved body size, with the increase size of polyploid cells compensated by fewer numbers of cells [122]. Plant polyloids, however, are not limited to size and grow larger and fa ster than their diploid counterparts [122] . We detected a significant increase of reads in genes involved in these processes that could play a role in how cells regulate their cell count and body size. While genome duplication sometimes involve duplication of exact genes, polyploidiza tion in A rabidopsis introduces gene sequences that have diverged for millions of years. As these gene orthologs diverge in sequence, they modify and fine tune their fun ctions [118] such that when they are brought back into the polyploids as homeologs, they are both required in their respective homeologous genome and are more likely to be retained in the polyploid. Along these lines, genes that had little sequence divergence between parental AH and AL were less kept as duplicates in the polyploid. We also fo und that these genes 97 were not under any special selection, exhibiting Ka/Ks ratios similar to the rest of the genome, which is expected when the single-copy gene is not free to acquire deleterious mu tations and must perform all of its functions [113] . This contrasts to genes that have been duplicated in the polyploid, which have significantly higher Ka/Ks ratios, representative of positive selection occurring in genes free to undergo neofunc tionalization [113] . While we described fun ctional rationale for ways genes are retained, there could also be an adaptation component. AK kawasakiana collected from Japan have been shown to be metal tolerant. We identified three genes in the AK Japan sample whose AH-derived homeologs were retained and under purifying selection. This contrasts to both the Alaska and British Columbia samples where the same gene has acquired numerous mutations. In general, while we have fo cused on evo lutionary patterns consistent in all three polyploids, it would be important to evaluate those patterns different amongst all three, and perhaps identify local and environmental fa ctors that may have affected these patterns. We were not able to identify genes associated with cold- tolerance, maybe since all three species of AK were found in temperate regions of the globe where temperature is typically cooler. 5.4.3 Network evolut ion of homeologous genomes One popular theory for the evo lution of duplicated genes is the Gene Balance Hypothesis, where duplicates are retained because the removal of certain genes result in haploins uf ficiencies [20, 21]. Inspection of the biological processes with increased or decreased read count resulted in a few GO terms that were relevant. However, when considering genes in the context of orthologous A rabidopsis thaliana netw orks, we find that genes lost or duplicated in the polyploid had a significant decrease in the number of network partners. 98 Consistent with the Gene Balance Hypothesis, in our AK polyploids, we find that "connected" genes are not usually loss or duplicated. Central genes have a larger impact on the network and fluctuations in copy numbers would result in haploinsufficiency. These "connected" genes also exhibited higher expression correlation coefficient with their bind ing partners in the netw ork. We had previously reported that genes undergoing positive selection for local adaptation in AL were not centrally located in gene networks ( Chap ter 3). We extend our results to add that genes with copy number fluctuations in AK were typically fo und fa rther and more isolated from the netwo rk, constrained to nodes of lower centrality and connectivity, minimizing their dosage effects and allowing networks to more easily adapt to evolut ionary changes. 5.4.4 Differential selection of homeologous genomes Comparison of gene reten tion in AK show a preference for AH-derived homeologs. This is evident in the patterns of genes that were lost in the polyploids. Support for this also comes from the rates of selection of both homeologs, when compared to their pure parental refer ences. Although an R 2 value of 0.04 suggests that other fa ctors play a role, there is clearly a strong and positive correlation between the selection of both homeologs ( p < 8.2e-24, Figure 5.4, Top Right ), with AH hom eo logs evolving fa ster than AL-homeologs. With the gene sequences of homeologs available for further analysis, it would be very interesting to identify sets of genes under different ial selection in the two homeologous genomes and to understand the properties common to them. While the analyses of AK have been at the individual gene level, it is also import ant to understand how gene retention acts at the syntenic level. It is believed that duplicated 99 genomes are marked upon polyploidy in such a way to promote fractionation towards a diploid state [65, 160]. In this scenario, one would predict that hom eo logs that were lost or undergoing relaxed selection would be clustered across several regions across the genome. Indeed, this is a pattern that was observed in numerous plant species [49, 129, 140-142, 178] and is a general fe ature of polyploids [27,28,30, 51,87, 138]. We believe that this process of fractionation initiates very early and is aided by the genetic divergence between homeologs at the time of polyploidization, and will be extremely pronounced as the genome evo lves. Future work based on identifying local gene clusters on a chromosome will address the issues of frac tionation and different ial selection of homeologous genomes, and their roles in homeolog-specific reten tion. 5.5 Conclusions Polyploidization of AK from diploid AH and AL resulted in immediate duplication of thousands of genes. While some genes were lost in the initial chaotic restructuring, others were subsequen tly lost over thousands of years through stochastic mutagenesis caused by genetic redundancy. With the help of genome sequences from numerous parental references, hundreds of genes were fo und to be lost in AK . Gene homeologs that were lost in AK typically had low sequence divergence between parental AH and AL lines, while having high within-species polymorphism within their parental lineages. These genes were also found to be away from network hubs, exhibiting less centrality and connectivity. Genes from AL were more likely to be lost, and exhibited less positive selection than 100 homeologs from AH. The preferent ial reten tion of AH homeologs may reflect the current geographical distribution of AK in an area dominated by AH. 101 Chapter 6 Summary and Conclusions Recent deve lopments in genom1cs are revolutionizing our v1ews of genome evolution, demonstrating that perhaps all higher organisms, including mammals, have undergone full or partial genome duplications. Most biologists believe that genome duplication fol lowed by divergence is an import ant source of novel adapta tion, resulting from widespread genomic and phenotypic changes and the subsequent niche separa tion. Polyploidization, a form of genome duplication, is the increase in genome size caused by the inheritance of additional sets of chromosomes. Over time, these genomes undergo diploidization, re ducing polyploid genomes back towards a diploid state. Nevertheless, polyploidy would not serve a major role in speciation if all gene duplicates were quickly removed. In some ways, polyploidy may be the single most common mechanism of speciation in plan ts. Our early work on Arabidopsis polyploids fo cused on Arabid opsis suecica (AS), which resulted from the hybridization of Ara bidopsis thaliana (AT) and Arabid opsis arenosa (AA) 30K years ago. We observed that AT- originated homeologs were lost fa ster than AA-originated hom eo logs in the AS allotetraploid. We also fo und that AT homeologs were more likely to be silenced in the allotetraploid leaf transcriptome and that this silencing 102 was network-dependent. The networks of AS are evolving to be more AT- like or more AA-like, rather than mixed. We believe that this is due to co-evo lution of genes within a network in AT and AA lineages leading up to hybridization, which might be compromised in the allotetraploid. In general, genes within an interspecies network are typically more co-adapted with each other than with genes from other homeologous networks. Here, we fo und that mixed networks were significantly underrepresented in AS. The analyses of homeologous gene loss in AS offer insight into the kinds of questions that can be addressed using large scale genomic data from multiple species. However, while studies of AT- AA polyploids provided an excellent opportunity to link funct ional ge netics with evolutionary patterns, the evolution of its genome may have been confounded by parental biases that are difficult to separate from ecological and demographic fact ors. For example, while both are annual herbs, AT is a predominant selfer whereas AA is an obligate outcrosser. This choice of reproductive lifestyle already puts AT homeologs at a disadvantage as they are less fit than those of outcrosser AA. In addition, by using AT as a genome and annotation ref erence, it is difficult to study traits and genes that have already been selected out and removed in AT . We followed our studies of AS with a second polyploid in the Arabidopsis fam ily that has not garnered much attention in Arab idopsis kamchatica (AK ). While AS is native to Northern Europe and Scandinavia, AK is reportly fo und in East Asia and North America. The AK system offers the opportunity to study genome duplication and hybrid speciation in a polyploid model with natural varia tion and adaptation to extreme en vironments, having two parental species that have adapted to a variety of climatic and ecological conditions. Arabidopsis lyrata (AL ) is a perennial outcrossing species that is 103 cold-tolerant and grows in low competition habitats, while Ara bidopsis halleri (AH) has served as one of the best model systems for heavy metal tolerance and accumulation in plant species, ha ving a preference for high altitudes and harsh soil conditions. Our initial efforts to characterize the genome of AL fo cused on the ability of two North American populations to adapt to serpentine en vironments. We reanalyzed existing genomic resequencing data for these populations to identify 283 genes undergoing positive selection in populations growing in serpentine soils. Some of these genes were involved in oxidative stress, metal ion binding, and plasma membrane receptors and transporters that are potentially involved in adaptat ion to serpentine soils. We also analyzed genes in the context of their networks and found that serpen tine genes were constrained to nodes of lower centrality, allowing them to more easily evolve quickly to environmental changes. Moreover, these analyses in AL showed that genes under higher positive selection co-evolved with neighboring genes, indicating that highly selected genes can affect the evolutionary rate of their neighbors. We also sequenced six populations of the AH halleri subspecies sam pled across France, Austria, and Germany as well as two plant individuals of AH gemmifera sampled from Japan. Our initial focus was to identify genomic polymorphisms fixed and specific to all samples of AH that would be useful when analyzing sequences derived from AK. Along the way, we charac terized genomic differences between both subspecies, identifying genes that were fo und exclusively or primarily in halleri or gemmi fera. Some of these de novo genes were implicated in transmembrane receptor activity and metal ion binding, which is important considering that AH is a constitutive metal-tolerant species that is adapted to metallicolous en vironments. We also fo und numerous regions in the gemmifera genome 104 showing patterns of segmental duplication, with genes in these regions exhibiting higher Ka/Ks patterns consistent with positive selection acting for amino acid changes in newly duplicated gene pairs. Finally, in AK, we sequenced three accesswns collected from Japan, Alaska, and British Columbia and documented a pattern of consistent loss of AL homeologs, in which genes that were not retained as duplicates were four times more likely to be lost from the AL homeolog. We also examined the correlation between network connectivity and gene reten tion and fo und that genes lost or further gained after polyploidization displayed a significant decrease in the number of network partners and/or expression correlation coefficient. Comparison of divergence between parental AL and AH homologs showed that homeolog loss is more common in less divergent genes, as well as in genes with high within-species variation in AL and AH lineages. In cases where genes were duplicated after polyploidization, they exhibited increased Ka/Ks ratios, suggesting that some duplicates were undergoing neofunc tionalization resulting in amino acid changes that may have introduced new func tions. In the study of polyploidization and genome duplication, our results across six species of Arabi do psis illustrate the importance of understanding gene evolution in the context of network topology. In the AS polyploid system, we fo und that AS networks are evolving to be more AT -like or more AA-like, due to co-evolut ion of genes within a network in AT and AA lineages leading up to hybridization. In the AL, AH, and AK polyploid system, we fo und that gene evolut ion occurs fa stest among genes constrained to nodes of lower centrality and connectivity. These are the genes that were undergoing positive selection for local adaptat ion in AL. These are the genes whose homeologs were lost or further 105 gained after polyploidization in AK. Consistent with the Gene Balance Hypothesis, we find that "connected" genes are not usually lost or duplicated. In the rare cases where this does happen, genes with copy number fluctuations were typically found fa rther and more isolated from the network, minimizing their dosage effects and allow networks to more easily adapt to evolutionary changes. 106 References [1] ADAMS, K. L., CRONN, R. C., PERCIFIELD, R., AND WENDEL, J. F. Genes du plicated by polyploidy show unequal contributions to the transcriptome and organ specific reciprocal silencing. Proceedings of the National Academy of Sciences of the United States of America 100, 8 (2003), 4649. [2] ADAMS, K. L., PERCIFIELD, R., AND WENDEL, J. F. Organ-specific silencing of duplicated genes in a newly synthesized cotton allotetraploid. Genetics 168, 4 (2004), 2217. [3] ADAMS, K. L., AND WENDEL, J. F. Novel patterns of gene expression in polyploid plants. Tre nds in Genetics 21 , 10 (2005), 539-543. [4] ADAMS, K. L., AND WENDEL, J. F. Polyploidy and genome evolution in plants. Current Opinion in Plant Biology 8, 2 (2005), 135-141. [5] AL-SHEHBAZ, I. A., AND O' KANE, S. L. Taxonomy and phylogeny of Arabidopsis (Brassicaceae ). The Arabidopsis Book 6 (2002), 1-22. [6] ALKAN, C., KIDD, J. M., MARQUES-BONET, T., AKSAY , G., ANTONACCI, F., HORMO ZDIARI, F., KITZMAN, J. 0., BAKER, C., MALIG, M., MUTLU, 0., SAHI NALP, S. C., GIBBS, R. A., AND EICHLER, E. E. Personalized copy number and segmental duplication maps using next-generat ion sequencing. Nature Genetics 41 (2009), 1061-1067. [7] ALKAN, C., SAJJADIAN, S., AND EICHLER, E. E. Limitations of next-generat ion genome sequence assembly. Nature Methods 8 (2011), 61-65. [8] ALTSCHUL, S. F., MADDEN, T. L., SCHAFFER, A. A., ZHANG, J., ZHANG, Z., MILLER, W., AND LIPMAN, D. J. Gapped BLAST and PSI-BLAST: a new genera tion of protein database search programs. Nucleic Acids Research 25 (1997), 3389-3402. [9] ANTONOVICS, J., BRADS HAW, A. D., AND TURNER, R. Heavy metal tolerance in plants. Advances in Ecological Research 'l (1971), 1-85. [10] AvERSAN O, R. , ERCOLA NO, M. R., CARuso, I., FASANO, C., RosELLINI, D., AND CARPUTO, D. Molecular tools for exploring polyploid genomes in plants. International Journal of Molecular Sciences 13, 8 (2012), 10316-35. 107 [11] BACHEM, C. w., VAN DER HOEVEN, R. S., DE BRUIJN, S. M., VREUGDENHIL, D., ZABEAU, M., AND VISSER, R. G. Visualization of differential gene expression using a novel method of RNA fingerprinting based on AFLP: analysis of gene ex pression during potato tuber development . The Plant Journal 9, 5 (1996), 745-753. [12] BARABASCHI, D., GUERRA, D., LACRIMA, K., LAINO, P., MICHELOTTI, V., URso, S., VA LE, G., AND CATT IVELLI, L. Emerging knowledge from genome sequencing of crop species. Molecular Biotechnology 50, 3 (2012), 250-66. [13] BARABAS!, A.-1., AND 0L TVAI, Z. N. Network biology: understanding the cell's funct ional organiza tion. Nature Review Genetics 5, 2 (2004), 101-113. [14] BAROSS, A., DELANEY, A. D., LI, H. I., NAYA R, T., FLIBOTTE, S., QIAN, H., CHAN, S. Y., AsANO, J., ALLY, A., CAo, M., BIRcH, P., BROWN-JOHN, M., FERNANDES, N., Go, A., KE NNEDY, G., LANGLOIS, S., EYDoux, P., FRIED MAN, J., AND MARRA, M. A. Assessment of algorithms for high throughput detection of genomic copy number varia tion in oligonucleotide microarray data. BMC Bioinformatics 8 (2007), 368. [15] BATAG ELJ, V., AND MRV AR, A. Pajek-program for large network analysis. Con nections 21, 2 (1998), 4 7-57. [16] BEAULIEU, J., JEAN, M., AND BELZILE, F. The allotetraploid Arabidopsis thaliana-Arabidopsis lyrata subsp. petraea as an alternative model system for the study of polyploidy in plants. Molecular Genetics and Genomics 281, 4 (2009), 421-435. [17] BEILSTEIN, M. A., NAGALIN GUM, N. S., CLEMENTS, M. D., MANCHESTER, S. R., AND MATHEWS, S. Dated molecular phylogenies indicate a Mioc ene origin for Arabidopsis thaliana. Proceedings of the National Academy of Sciences of the United States of America 107, 43 (2010), 18724-28. [18] BENJAMIN!, Y., AND HOCHBERG, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statis tical Society B: Statistical Methodology 57, 1 (1995), 289-300. [19] BIRCHLER, J. A., RIDDLE, N. C., AUGER, D. 1., AND VEITIA, R. A. Dosage balance in gene regula tion: biological implications. Tre nds in Genetics 21, 4 (2005), 219-226. [20] BIRCHLER, J. A., AND VEITIA, R. A. The gene balance hypothesis: from classical genetics to modern genomics. The Plant Cell 19, 2 (2007), 395-402. [21] BIRCHLER, J. A., AND VEITIA, R. A. The gene balance hypothesis: implications for gene regula tion, quantitative traits and evolution. New Phytologist 186 (2010), 54-62. [22] BLANC, G., AND WOLFE, K. H. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. The Plant Cell 16, 7 (2004), 1667. 108 [23] BoLSTAD, B. M., IRIZARRY, R. A., AsTRAND, M., AND SPEED, T. P. A com parison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 2 (2003), 185. [24] BOREVITZ, J. 0., HAZEN, S. P., MICHAEL, T. P., MORRIS, G. P., BAXTER, I. R., Hu, T. T., CHEN, H., WERNER, J. D., NoRDRORG, M., SALT, D. E., KAY, S. A., CHORY, J., WEIGEL, D., JONES, J. D., AND EcKER, J. R. Genome wide patterns of single-feature polymorphism in Arabidopsis thaliana. Proceedings of the National Academy of Sciences of the United States of Amer ica 104, 29 (2007), 12057. [25] BoTTLEY, A., AND KoE RNER, R. M. Variation for homoeologous gene silencing in hexaploid wheat. The Plant Journal 56, 2 (2008), 297-302. [26] BoTTLEY, A., XIA, G., AND KoERNER, R. M. Homoeologous gene silencing in hexaploid wheat. The Plant Journal 47, 6 (2006), 897-906. [27] BuGGs, R. J., CHAMALA, S., Wu, W., GAo, 1., MAY, G. D., ScHNARLE, P. S., SOLTIS, D. E. , SOLTIS, P. S., AND BARRAZUK, W. B. Characterization of duplicate gene evolution in the recent natural allopolyploid Tragopogon miscel lus by next-generation sequencing and Sequenom iP1EX MassARRAY genotyping. Molecular Ecology 19 (2010), 132-146. [28] BuGGs, R. J., CHAMALA, S., Wu, W., TATE, J. A., ScHNARLE, P. S., SoLTIS, D., SOLTIS, P. S., AND BARRAZUK, W. B. Rapid, repeated, and clustered loss of duplicate genes in allopolyploid plant populations of independent origin. Current Biology 22, 3 (2012), 248-252. [29] BuGGs, R. J., DousT, A., TATE, J. A., KoH, J., SoLTIS, K., FELTus, F. A., PATERSON, A. H., SOLTIS, P. S., AND SOLTIS, D. E. Gene loss and silencing in tragopogon miscellus (asteraceae ): comparison of natural and synthetic allote traploids. Heredity 103 (2009), 73-8 1. [30] BuGGs, R. J., ELLIOTT, N. M., ZHANG, 1., KoH, J., VICCINI, 1. F., SoLTIS, D. E. , AND SOLTIS, P. S. Tissue-specific silencing of homoeologs in natural popu lations of the recent allopolyploid Tragopogon mirus. New Phytologist 186, 1 (2010), 175-183. [31] BUGGS, R. J., SOLTIS, P. S., AND SOLTIS, D. E. Does hybridization between divergent progenitors drive whole-genome duplication? Molecular Ecology 18, 16 (2009), 3334-9. [32] CHANG, P. 1., DILKES, B. P., McMAHON, M., CoMA!, 1., AND NuzHDIN, S. V. Homoeolog-specific retention and use in allotetraploid Arabidopsis suecica depends on parent of origin and network partners. Genome Biology 11, 12 (2010), R125. [33] CHAUDHARY, B., FLAGEL, 1., STUPAR, R. M., UDALL, J. A., VERMA, N., SPRINGER, N. M., AND WENDEL, J. F. Reciprocal silencing, transcriptional bias and functional divergence of homeologs in polyploid cotton (gossypium). Genetics 182, 2 (2009), 503-17. 109 [34] CHEN, M., HA, M., LACKEY, E. , WANG, J., AND CHEN, Z. J. RNAi of met1 reduces DNA methylation and induces genome-specific changes in gene expression and cen tromeric small RNA accumulation in Arabidopsis allopolyploids. Genetics 178, 4 (2008), 1845-58. [35] CHEN, Z. J., COMA!, L., AND PIKAARD, C. S. Gene dosage and stochastic effects determine the severity and direction of uniparental ribosomal RNA gene silencing (nucleolar dominance) in Arabidopsis allopolyploids. Proceedings of the National Academy of Sciences of the United States of America 95, 25 (1998), 14891. [36] CHEN, Z. J., AND NI, Z. Mecha nisms of genomic rearrangements and gene ex pression changes in plant polyploids. BioEssays 28, 3 (2006), 240. [37] CLAUSS, M. J., AND KocH, M. A. Poorly known relatives of Arabidopsis thaliana. Tre nds in Plant Science 11, 9 (2006), 449-59. [38] CLAUSS, M. J., AND MITCHELL- 0LDS, T. Population genetic structure of Ara bidopsis lyrata in Europe. Molecular Ecology 15, 10 (2006), 2753- 2766. [39] COMA!, L., TY AGI, A. P., AND LYSAK, M. A. FISH analysis of meiosis in Ara bidopsis allopolyploids. Chromosome Research 11, 3 (2003), 217-226. [40] COMA!, L., TY AGI, A. P., WINTER, K., HOLM ES-DAVIS, R., REYNOLDS, S. H., STEVENS, Y., AND BYERS, B. Phenotypic instability and rapid gene silencing in newly fo rmed Arabidopsis allotetraploids. The Plant Cell 12, 9 (2000), 1551. [41] COYNE, J. A., AND ORR, H. A. The evo lutionary genetics of specia tion. Philo sophical Tra nsactions of the Royal Society B: Biological Sciences 353, 1366 (1998), 287. [42] DILKES, B. P., SPIELMAN, M., WEIZBAUER, R., WATSON, B., BURKART-WACO, D., ScoTT, R. J., AND COMA!, L. The maternally expressed WRKY transcription fact or TTG 2 controls lethality in interploidy crosses of Arabidopsis. P LoS Biology 6 (2008), e308. [43] Du, C., FEFELOVA, N., CARONNA, J., HE, L., AND DooNER, H. K. The polychromatic Helitron landscape of the maize genome. Proceedings of the National Academy of Sciences of the United States of America (2009). [44] EDGER, P., AND PIRES, J. Gene and genome duplications: the impact of dosage sensitivity on the fate of nuclear genes. Chromosome Research 17 (2009), 699-717. [45] EHRENDORFER, F. Polyploidy and distribution. Basic Life Sciences 13 (1979), 45-60. [46] FELDMAN, M., LIU, B., SEGAL, G., ABBO, S., LE VY, A. A., AND VEGA, J. M. Rapid elimination of low-copy DNA sequences in polyploid wheat: a possible mech amsm for differentiation of homoeologous chromosomes. Genetics 147, 3 (1997), 1381. 110 [47] FERRIS, S. D., AND WHITT, G. S. Evolut ion of the differential regula tion of duplicate genes after polyploidization. Journal of Molecular Evolution 12, 4 (1979), 267- 317. [48] FLAGEL, 1. E., UDALL, J. A., NETTLETON, D., AND WENDEL, J. F. Duplicate gene expression in allopolyploid Gossypium reveals two temporally distinct phases of expression evolut ion. BMC Biology 6 (2008), 16. [49] FLAGEL, L. E., AND WENDEL, J. F. Evo lutionary rate variation, genomic domi nance and duplicate gene expression evolution during allotetraploid cotton specia tion. New Phytologist 186, 1 (2010), 184-193. [50] FORCE, A., LYN CH, M., PICKETT, F. B., AMOR ES, A., YAN, Y.-L., AND PosTLE THWAIT, J. Preservation of duplicate genes by complementary, degener ative mutations. Genetics 151, 4 (1999), 1531. [51] FREELING, M. Bias in plant gene content fo llowing different sorts of duplication: tandem, whole-genome, segmen tal, or by transposi tion. Annual Review of Plant Biology 60 (2009), 433- 453. [52] FREELING, M., AND THOMAS, B. C. Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Research 16, 7 (2006), 805. [53] FUKUDA, N., HOKURA, A., KI TAJIMA, N., TERADA, Y., SAITO, H., ABE, T., AND NAKAI, I. Micro X-ray fluorescence imaging and micro X-ray absorption spec troscopy of cadmium hyper-accumulating plant, Arabidopsis halleri ssp. gemmifera, using high-energy synchrotron radiation. Journal of Analytical Atomic Spectrome try 23, 8 (2008), 1068-1075. [54] GAE TA, R. T., Yoo, S.-Y. , PIRES, J. C., DoERGE, R. W., CHEN, Z. J., AND OSHLACK, A. Analysis of gene expression in resynt hesized Brassica napus allopolyploids using Arabidopsis 70mer oligo microarrays. PLoS One 4, 3 (2009), e4760. [55] GENG, S., ZHAO, Y., TANG, 1., ZHANG, R., SuN, M., Guo, H., KoNG, X., LI, A., AND MAo, L. Molecular evo lution of two duplicated CDPK genes CPK7 and CPK12 in grass species: A case study in wheat (Triticum aestivum L.). Gene 475, 2 (2011), 94-103. [56] GILLESPIE, J. H. Junk ain't what junk does: neutral alleles in a selected context. Gene 205, 1-2 (1997), 291-299. [57] GRANT, V. Plant Speciation (1981). [58] HA, M., KIM, E.- D., AND CHEN, Z. J. Duplicate genes increase expression diversity in closely related species and allopolyploids. Proceedings of the National Academy of Sciences of the United States of America 106, 7 (2009), 2295. 111 [59] HA, M., 1u, J., TIAN, 1., RAMACHANDRAN, V., KAsscHAU, K. D., CHAPMAN, E. J., CARRINGTON, J. C., CHEN, X., WA NG, X.-J., AND CHEN, Z. J. Small RNAs serve as a genetic buffer against genomic shock in Arabidopsis interspecific hybrids and allopolyploids. Proceedings of the National Academy of Sciences of the United States of America 106, 42 (2009), 17835-17840. [60] HAHN, M. W., AND KERN, A. D. Comparative Genomics of Centrality and Essentiality in Three Eukaryotic Protein-Interaction Networks. Molecular Biology and Evolution 22, 4 (2005), 803- 806. [61] HE GARTY , M. J., BARKER, G. 1., WILSON, I. D., ABBOTT, R. J., EDWARDS, K. J., AND HISCOCK, S. J. Transcrip tome shock after interspecific hybridization in Senecio is ameliorated by genome duplication. Current Biology 16, 16 (2006), 1652-1659. [62] HE GARTY , M. J., AND HISCOCK, S. J. Hybrid speciation in plants: new insights from molecular studies. New Phytologist 165, 2 (2005), 41 1-23. [63] HELLER, R. , AND SMITH, J. M. Does Muller's ratchet work with selfing? Genetics Research 32, 03 (2009), 289-293. [64] HoLLISTER, J. D., AND GAUT, B. S. Epigene tic silencingof transposa ble elements: A trade-off between reduced transposition and deleterious effects on neighboring gene expression. Genome Research 19, 8 (2009), 1419-1428. [65] HoLLISTER, J. D., SMITH, 1. M., Guo, Y.-1., OTT, F., WEIGEL, D., AND GAUT, B. S. Transposable elements and small RNAs contribute to gene expression divergence between Arabidopsis thaliana and Arabidopsis lyrata. Proceedings of the National Academy of Sciences of the United States of America 108, 6 (2011), 2322- 2327. [66] HoRMO ZDIARI, F., ALKAN, C., EICHLER, E. E., AND SAHINALP, S. C. Combi natorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Rese arch 19 (2009), 1270-1278. [67] HORMO ZDIARI, F., HAJIRAS OULIHA, I., DAO, P., HACH, F., YORUKOGLU, D., ALKAN, C., EICHLER, E. E., AND SAHINALP, S. C. Next-generation variation hunter: combinatorial algorithms for transposon insertion discovery. Bioinfo rmatics 26, 12 (2010), i350-i3 57. [68] Hu, T. T., PATTYN, P., BAKKER, E. G., CAo, J., CHENG, J.-F., CLARK, R. M., FAHLGREN, N., FAWCETT, J. A., GRIMWOOD, J., GUNDLACH, H., HABERER, G., HOLLISTER, J. D., OSSOWSKI, S., 0TTILAR, R. P., SALAMOV, A. A., SCHNEEBERGER, K., SPANNAGL, M., WANG, X., YAN G, 1., NASRALLAH, M. E., BERGELSON, J., CARRINGTON, J. C., GAUT, B. S., SCHMUTZ, J., MAYER, K. F. X., PEER, Y. V. D., AND GRIGORIEV, I. V. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nature Genetics 43, 5 (201 1). 112 [69] HUGHES, M. K., AND HUGHES, A. L. Evolution of duplicate genes in a tetraploid animal, Xenopus laevis. Molecular Biology and Evolution 10, 6 (1993), 1360. [70] HUL TEN, E. Atlas of the distribution of vascular plants in northwestern Europe. Generalstabens litografiska anstalts fo rlag, Stockholm. [71] HUMINIECKI, 1., AND CONANT, G. C. Polyploidy and the evolution of complex traits. International Journal of Evolutionary Biology 2012 (2012), 292068. [72] HUMINIECKI, 1., AND BELDIN, C. 2R and remodeling of vertebrate signal trans duction engine. BMC Biology 8 (2010), 146. [73] IRIZAR RY, R. A., BOLSTAD, B. M., COLLIN, F., COPE, L. M., HOBBS, B., AND SPEED, T. P. Summaries of Aff ymetrix GeneChip probe level data. Nucleic Acids Research 31, 4 (2003), e15. [74] IRIZAR RY, R. A., HOBBS, B., COLLIN, F., BEAZER- BARCLAY, Y. D., ANTONEL LIS, K. J., SCHERF, U., AND SPEED, T. P. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 2 (2003), 249. [75] JAKOBSSON, M., HAGENBLAD, J., TAVARE, S., SALL, T., HALLDEN, C., LIND HALLDEN, C., AND NoRDBORG, M. A unique recent origin of the allotetraploid species Arabidopsis suecica: evidence from nuclear DNA markers. Molecular Biol ogy and Evolution 23, 6 (2006), 1217. [76] JIAO, Y., LEEBENS-MAcK, J., AYYA MPALA YAM, S., BowERS, J., McKAIN, M., McNEAL, J., RoLF, M., RuziCKA, D., WAFULA, E. , WICKETT, N., Wu, X., ZHANG, Y., WANG, J., ZHANG, Y., CARPENTER, E. , DEYHOLOS, M., KUTCHAN, T., CHANDERBALI, A., SoLTIS, P., STEVENSON, D., McCoMBIE, R., PIRES, J., WoNG, G., SOLTIS, D., AND DEPAMPHILIS, C. A genome triplication associated with early diversification of the core eudicots. Genome Biology 13 (2012), R3. [77] JJAO, Y., WICKETT, N. J., AYY AMPALA YAM, S., CHANDERBALI, A., LANDHERR, 1., RALPH, P. E., ToMsHo, L. P., Hu, Y., LIANG, H., SoLTIS, P. S., SoLTIS, D. E., CLIFTON, S. w., SCHLARBAUM, S. E., SCHUSTER, S. C., MA, H., JIM, 1.-M., AND DEPAMPHILIS CLAUDE W. Ancestral polyploidy in seed plants and angiosperms. Nature 473, 7345 (2011), 97-100. [78] Jo NSELL, B., KusTAS, K., AND NoRDAL, I. Genetic variation in Arabis petraea, a disjunct species in northern Europe. Ecography 18, 4 (1995), 321-332. [79] Jo sEFSSON, C., DILKES, B. P., AND COMA!, L. Parent-dependent loss of gene silencing during interspecies hybridization. Current Biology 16, 13 (2006), 1322-8. [80] KASHEM, M. A., SINGH, B. R., KUBOTA, H., NAGASHIMA, R. S., KITAJIMA, N., KoNDO, T., AND KAWAI, S. Assessing the potential of Arabidopsis halleri ssp. gemmifera as a new cadmium hyperaccumulator grown in hydroponics. Canadian Journal of Plant Science 87, 3 (2007), 499. 113 [81] KASHKUSH, K., FELDMAN, M., AND LEVY, A. A. Gene loss, silencing and acti vation in a newly syn thesized wheat allotetraploid. Genetics 16 0, 4 (2002), 1651. [82] KE IGHTLEY, P. D., AND OTTO, S. P. Interference among deleterious mutations fa vours sex and recombination in finite populations. Nature 443, 7107 (2006), 89- 92. [83] KE JNOVSKY, E., LEITCH, I. J., AND LEITCH, A. R. Contrasting evo lutionary dynamics between angiosperm and mammalian genomes. Tre nds in Ecology and Evolution 24, 10 (2009), 572-82. [84] KocH, M. A., HAUBOLD, B., AND MITCHELL-OLDS, T. Com parative evo lution ary analysis of chalcone synthase and alcohol dehydrogenase loci in Arabidopsis, Arabis, and related genera (Brassicaceae ). Molecular Biology and Evolution 17, 10 (2000), 1483. [85] KocH, M. A., AND MATSCHINGER, M. Evo lution and genetic different iation among relatives of Arabidopsis thaliana. Proceedings of the National Academy of Sciences of the United States of Amer ica 104, 15 (2007), 6272. [86] KUB OTA, H., AND TAKEN AKA, C. Arabis gemmifera is a hyperaccumulator of Cd and Zn. International Journal of Phytoremediation 5, 3 (2003), 197 -201. [87] LANGHAM, R. J., WALSH, J., DuNN, M., Ko, C., GoFF, S. A., AND FREEL lNG, M. Genomic duplication, frac tionation and the origin of regulatory novelty. Genetics 166, 2 (2004), 935-945. [88] LANGMEAD, B., TRAPNELL, C., PoP, M., AND SALZBERG, S. Ultra fast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10 (2009), R25. [89] LARKIN, M. A., BL ACKSHIELDS, G., BROWN, N. P., CHENNA, R., McG ET TIGAN, P. A., McWILLIAM, H., VALENTIN, F., WALLACE, I. M., WILM, A., LOPEZ, R., THOMPSON, J. D., GIBSON, T. J., AND HIGGINS, D. G. Clustal W and Clustal X version 2.0. Bioinformatics 23, 21 (2007), 2947-2948. [90] LEE, H.-S ., AND CHEN, Z. J. Protein-coding genes are epigenetically regulated in Arabidopsis polyploids. Proceedings of the National Academy of Sciences of the United States of America 98, 12 (2001), 6753. [91] LI, A.-L., ZHu, Y.-F., TAN, X.-M., WANG, X., WEI, B., Guo, H.-Z., ZHANG, Z.-L., CHEN, X.-B., ZHAO, G.-Y., KONG, X.-Y., JIA, J.-Z., AND MAO, L. Evolutionary and functional study of the CDPK gene fam ily in wheat (Triticum aestivum L. ). Plant Molecular Biology 66, 4 (2008), 429-443. [92] LI, H., AND DURBIN, R. Fast and accurate short read alignment with burrows wheeler transf orm. Bioinfo rmatics 25 (2009), 1754-1760. 114 [93] LIM, K., SOLTIS, D., SOLTIS, P., TATE, J., MATYASEK, R., SRUBA ROVA , H., Ko vA RIK, A., PIRES, J., XIONG, Z., AND LEITCH, A. Rapid chromosome evolu tion in recen tly fo rmed polyploids in tragopogon (asteraceae ). PLoS One 3 (2008), e3353. [94] LIU, B., BRUBAKER, C. L., MERGEAI, G., CRONN, R. C., AND WENDEL, J. F. Polyploid fo rmation in cotton is not accompanied by rapid genomic changes. Genome 44, 3 (2001), 321-330. [95] LIU, B., VEGA, J. M., AND FELDMAN, M. Rapid genomic changes in newly synthesized amphiploids of Triticum and Aegilops. II. Changes in low-copy coding DNA sequences. Genome 41, 4 (1998), 535-542. [96] LIU, S.-L., AND ADAMS, K. L. Dramatic change in function and expression pattern of a gene duplicated by polyploidy created a paternal effect gene in the Brassicaceae . Molecular Biology and Evolution 27, 12 (2010), 2817-2828. [97] LIU, S.-L., BAUTE, G. J., AND ADAMS, K. L. Organ and cell type-specific complementary expression patterns and regulatory neofunc tionalization between duplicated genes in Arabidopsis thaliana. Genome Biology and Evolution 3 (2011), 1419-36. [98] LYN CH, M., AND FoRCE, A. The probability of duplicate gene preservat ion by subfunc tionalization. Genetics 154, 1 (2000), 459. [99] MA, S., GONG, Q., AND BOHNERT, H. J. An Arabidopsis gene network based on the graphical Gaussian model. Genome Research 17, 11 (2007), 1614-25. [100] MADLUNG, A., MASUELLI, R. W., WATSON, B., REYNOLDS, S. H., DAVISON, J ., AND COMA!, L. Remodeling of DNA methylation and phenotypic and tran scriptional changes in synthetic Arabidopsis allotetraploids. Plant Physiology 129, 2 (2002), 733. [101] MAERE, S., DE BODT, S., RAES, J., CASNEUF, T., VAN MONTAGU, M., KUIPER, M., AND VAN DE PEER, Y. Modeling gene and genome duplications in eukaryotes. Proceedings of the National Academy of Sciences of the United States of America 102, 15 (2005), 5454- 5459. [102] MALIK, H. S., AND BAYES, J. J. Genetic conflicts during meiosis and the evolu tionary origins of centromere complexity. Biochemical Society Tra nsactions 34, Pt 4 (2006), 569 -573. [103] MAO, L., VAN HEMER T, J. L., DASH, S., AND DICKERSON, J. A. Arabidopsis gene co-expression network and its funct ional modules. EM C Bioinformatics 10 (2009), 346. [104] MASTERSON, J. Stomatal Size in Fossil Plants: Evidence for Polyploidy in Majority of Angiosperms. Science 264, 5157 (1994), 421-424. 115 [105] MATZKE, M. A., AND MATZKE, A. J. Polyploidy and transposons. Tre nds in Ecology and Evolution 13, 6 (1998), 241. [106] MATZKE, M. A., SCHEID, 0. M., AND MA TZKE, A. J. Rapid structural and epigenetic changes in polyploid and aneuploid genomes. BioEssays 21, 9 (1999), 761-767. [107] McCLINTOCK, B. The significance of responses of the genome to challenge. Science 226, 4676 (1984), 792- 801. [108] MEYER, C.-1., KOSTECKA, A. A., SAUMITOU-1APRADE, P., CREACH, A., GAS TRIC, V., PAUWELS, M., AND FREROT, H. Variability of zinc tolerance among and within populations of the pseudometallophyte species Arabidopsis halleri and possible role of directional selection. New Phytologist 185, 1 (2010), 130-42. [109] MEYER, C.- 1., VITALIS, R. , SAUMITOU-1APRADE, P., AND GASTRIC, V. Ge nomic pattern of adaptive divergence in Arabidopsis halleri, a model species for tolerance to heavy metal. Molecular Ecology 18, 9 (2009), 2050-2062. [110] MEZEY, J. G., NUZHDIN, S. V., YE, F., AND JO NES, C. D. Coordinated evo lution of co-expressed gene clusters in the Drosophila transcript ome. BMC Evolu tionary Biology 8 (2008), 2. [111 ] MITCHELL-OLDS, T. Arabidopsis thaliana and its wild relatives: a model system for ecology and evolut ion. Tre nds in Ecology and Evolution 16, 12 (2001), 693-700. [112] MONDRAGON-PALOMINO, M., MEYERS, B. C., MICHELMORE, R. w., AND GAUT, B. S. Patterns of positive selection in the complete NBS-1RR gene fam ily of Arabidopsis thaliana. Genome Research 12, 9 (2002), 1305-13 15. [113] MOORE, R. C., AND PURUGGANAN, M. D. The evolutionary dynamics of plant duplicate genes. Current Opinion in Plant Biology 8, 2 (2005), 122-8. [114] MULLER, H. J. Why polyploidy is rarer in animals than in plants? American Naturalist 59, 663 (1925), 346-353. [115] MUMMENHOFF, K., AND HURKA, H. Allopolyploid origin of Arabidopsis suecica (Fries) Norrlin: Evidence from chloroplast and nuclear genome markers. Botanica Acta 108 (1995), 449- 456. [116] NAITO, K., ZHANG, F., TSUKI YAMA, T., SAITO, H., HANCOCK, C. N., RICHARDSON, A. 0., 0KUMOTO, Y., TANISAKA, T., AND WESSLER, S. R. Un expected consequences of a sudden and massive transposon amplification on rice gene expression. Nature 461 (2009), 1130-1134. [117] NASRALLAH, M. E., YOGEESWARAN, K., SNYDER, S., AND NASRALLAH, J. B. Arabidopsis species hybrids in the study of species differences and evo lution of amphiploidy in plan ts. Plant Physiology 124, 4 (2000), 1605. 116 [118] NUZHDIN, S. V., WAYNE, M. 1., HARMON, K. 1., AND MciNTYRE, 1. M. Com mon pattern of evolution of gene expression level and protein sequence in drosophila. Molecular Biology and Evolution 21, 7 (2004), 1308-1317. [119] ORR, H. A. Why polyploidy is rarer in animals than in plants revisited. American Naturalist 36, 6 (1990), 759-770. [120] OSBORN, T. C., PIRES, J. C., BIRCHLER, J. A., AUGER, D. 1., CHEN, Z. J., LEE, H.-8., COMA!, 1., MADLUNG, A., DOERGE, R. W., COLOT, V., AND MAR TIENSSEN, R. A. Understanding mechanisms of novel gene expression in polyploids. Tre nds in Genetics 19, 3 (2003), 141-147. [121] OTTO, S. P. The evolutionary consequences of polyploidy. Cell 131, 3 (2007), 452-62. [122] OTTO, S. P., AND WHITTON, J. Polyploid incidence and evolution. Annual Review of Genetics 34 (2000), 401-437. [123] PAPP, B., P AL, C., AND HURST, 1. D. Dosage sensitivity and the evo lution of gene fam ilies in yeast. Nature 424, 6945 (2003), 194-197. [124] PATERSON, A. H., BOWERS, J. E., AND CHAPMAN, B. A. Ancient polyploidiza tion predating divergence of the cereals, and its consequences for comparative ge nomics. Proceedings of the National Academy of Sciences of the United States of Amer ica 101, 26 (2004), 9903-9908. [125] PAUWELS, M., FREROT, H., BONNIN, 1., AND SAUMITOU-LAPRADE, P. A broad scale analysis of population different iation for Zn tolerance in an emerging model species for tolerance study: Arabidopsis halleri (Brassicaceae ). Journal of Evolu tionary Biology 19, 6 (2006), 1838- 1850. [126] PAUWELS, M., RosENBERG, N. A., FREROT, H., AND SAUMITOU-LAPRADE, P. When population genetics serves genomics: putting adaptation back in a spatial and historical context. Current Opinion in Plant Biology 11, 2 (2008), 129-34. [127] PAUWELS, M., SAUMITOU-LAPRADE, P., HoLL, A. C., PETIT, D., AND BoN NIN, I. Multiple origin of metallicolous populations of the pseudometallophyte Ara bidopsis halleri (Brassicaceae ) in central Europe: the cpDNA testimony. Molecular Ecology 14, 14 (2005), 4403- 4414. [128] PAUWELS, M., WILLEMS, G., RosENBERG, N. A., FREROT, H., AND SAUMITou LAPRADE, P. Merging methods in molecular and ecological genetics to study the adaptation of plants to an thropogenic metal-polluted sites: Implications for phytoremediation. Molecular Ecology 17, 1 (2008), 108-119. [129] PONT, C., MURAT, F., CONFOLENT, C., BALZERGUE, S., AND SALSE, J. RNA seq in grain unveils fate of neo- and paleopolyploidization events in bread wheat (Triticum aestivum L.). Genome Biology 12, 12 (2011), R119. 117 [130] PONTES, 0., NG, P., SIL VA, M., LEWIS, M. S., MADLUNG, A., COMA!, L., VIEGAS, W., AND PIKAARD, C. S. Chromosomal locus rearrangements are a rapid response to fo rmation of the allotetraploid Arabi do psis suecica genome. Proceedings of the National Academy of Sciences of the United States of Amer ica 101, 52 (2004), 18240. [131] RANZ, J. M., CASTILLO- DAVIS, C. 1., MEIKLEJOHN, C. D., AND HARTL, D. L. Sex-dependent gene expression and evolution of the Drosophila transcript ome. Sci ence 300, 5626 (2003), 1742-1745. [132] RAPP, R., UDALL, J., AND WENDEL, J. Genomic expression dominance in al lopolyploids. BMC Biology 'l (2009), 18. [133] ROBINSON, M. D., AND 0SHLACK, A. A scaling normalization method for differ ential expression analysis of RNA-seq data. Genome Biology 11, 3 (2010), R25. [134] RoOSENS, N. H., WILLEMS, G., AND SAUMITOU-LAPRADE, P. Using Arabidopsis to explore zinc tolerance and hyperaccumula tion. Tre nds in Plant Science 13, 5 (2008), 208-2 15. [135] SALL, T., JAKOBSSON, M., LIND-HALLDEN, C., AND HALLDEN, C. Chloroplast DNA indicates a single origin of the allotetraploid Arabidopsis suecica. Journal of Evolutionary Biology 16, 5 (2003), 1019-1029. [136] SALL, T., LIND-HALLDEN, C., JAKOBSSON, M., AND HALLDEN, C. Mode of reproduction in Arabidopsis suecica. Hereditas 141, 3 (2005), 313- 317. [137] SALMON, A., AINOUCHE, M. L., AND WENDEL, J. F. Genetic and epigenetic con sequences of recent hybridization and polyploidy in Spartina( Poaceae). Molecular Ecology 14, 4 (2005), 1163-1175. [138] SANKOFF, D., ZHENG, C., AND ZHU, Q. The collapse of gene complement follow ing whole genome duplication. BMC Genomics 11 (2010), 313. [139] SC HATZ, M. C., DELCHER, A. L., AND SALZBERG, S. L. Assembly of large genomes using second-genera tion sequencing. Genome Research 20, 9 (2010), 11 65- 1173. [140] SCHNABLE, J. C., AND FREELING, M. Genes identified by visible mutant pheno types show increased bias toward one of two subgenomes of maize. P LoS One 6, 3 (2011), e17855. [141] SCHNABLE, J. C., SPRINGER, N. M., AND FREELING, M. Differentiation of the maize subgenomes by genome dominance and both ancient and ongoing gene loss. Proceedings of the National Academy of Sciences of the United States of America 108, 10 (2011), 4069-74. [142] SCHNABLE, P. S., WARE, D., FULTON, R. S., STEIN, J. C., WEI, F., PASTER NAK, S., LIANG, C., ZHANG, J., FULTON, L., GRAVES, T. A., MINX, P., REILY, 118 A. D., COUR TNEY, L., KRUCHOWSKI, S. S., TOMLINSON, C., STRONG, C., DELEHAUNTY, K., FRONICK, C., CouR TNEY, B., RocK, S. M., BELTER, E., Du, F., KIM, K., ABBOTT, R. M., CoTTON, M., LE VY, A., MARCHETTO, P., OCHOA, K., JACKSON, S. M., GILLAM, B., CHEN, w., YAN, L., HIGGIN BOTHAM, J., CARDENAS, M., WALI GORSKI, J., AP PLEBAUM, E., PHELPS, L., FALCONE, J., KANCHI, K., THANE, T., SCIMONE, A., THANE, N., HENKE, J., WANG, T., RUPP ERT, J., SHAH, N., ROTTER, K., HODGES, J., INGENTHRON, E., COR DES, M., KOHLBERG, S., SGRO, J., DELGADO, B., MEAD, K., CH !NW ALLA, A., LEONARD, S., CROUSE, K., COLLURA, K., KUDRNA, D., CURRIE, J., HE, R., ANGELOVA, A., RAJASEKAR, S., MUELLER, T., LOMELI, R. , SC ARA, G., Ko, A., DELANEY, K., WissoTSKI, M., LoPEZ, G., CAMPos, D., BRAIDOTTI, M., AsHLEY, E., GoLSER, W., KIM, H., LEE, S., LIN, J., DuJMIC, Z., KI M, w., TALAG, J., ZUCCOLO, A., FAN, C., SE BASTIAN, A., KRAMER, M., SPIEGEL, L., NASCIMENTO, L., ZUTA VERN, T., MILLER, B., AMBROISE, C., MULLER, S., SPOONER, w., NARECHANIA, A., REN, L., WEI, S., KUMAR!, S., FAGA, B., LE VY, M. J., McMAHAN, L., VAN BuREN, P., VAUGHN, M. W., YING, K., YEH, C.-T., EMRICH, S. J., JIA, Y., KAL YAN ARAMAN, A., HSIA, A.-P., BAR BAZUK, W. B., BAUCOM, R. S., BRUTNELL, T. P., CARPITA, N. C., CHAPARRO, C., CHIA, J.-M., DERAGON, J.-M., EsTILL, J. C., Fu, Y., JEDDELOH, J. A., HAN, Y., LEE, H., LI, P., LiscH, D. R. , Lw, S., Lw, Z., NAGEL, D. H., Mc CANN, M. C., SANMIGUEL, P., MYERS, A. M., NETTLETON, D., NGUYEN, J., PENNING, B. W., PONNALA, L., SCHNEIDER, K. L., SC HWARTZ, D. C., SHARMA, A., SoDERLUND, C., SPRINGER, N. M., SuN, Q., WANG, H., WATERMAN, M., WE STERMAN, R. , WOLFGRUBER, T. K., YANG, L., Yu, Y., ZHANG, L., ZHOU, S., ZHU, Q., BENNETZEN, J. L., DAWE, R. K., JIANG, J., JI ANG, N., PRESTING, G. G., WE SSLER, S. R. , ALURU, S., MARTIENSSEN, R. A., CLIFTON, S. w., McCOMBIE, W. R. , WING, R. A., AND WILSON, R. K. The b73 maize genome: Com plexity, diversity, and dynamics. Science 326, 5956 (2009), 11 12-1 115. [143] SEOIGHE, C., AND GEHRING, C. Genome duplication led to highly selective ex pansion of the Arabidopsis thaliana proteome. Tr ends in Genetics 20, 10 (2004), 461-464. [144] SHAKED, H., KASHKUSH, K., OZKAN, H., FELDMAN, M., AND LEVY, A. A. Sequence elimination and cytosine methylation are rapid and reproducible responses of the genome to wide hybridization and allopolyploidy in wheat. The Plant Cell 13, 8 (2001), 1749. [145] SHIMIZU, K. K., FUJII, S., MARHOLD, K., WATANABE, K., AND KUDOH, H. Arabidopsis kamchatica (Fisch. ex DC.) K. Shimizu & Kudoh and A. kamchat ica subsp. ka wasakiana (Makino) K. Shimizu & Kudoh, New Combinations. Acta Phytotaxonomical Geobotany 56, 2 (2005), 163-172. [146] SHIMIZU, K. K., KUDOH, H., AND Ko BAYAS HI, M. J. Plant sexual reproduction during climate change: gene func tion in natura studied by ecological and evo lution ary systems biology. Annals of Botany (2011). 119 [147] SHIMIZU -I NA TSUGI, R., LIHOVA, J., IWANAGA, H., KUDOH, H., MARHOLD, K., SAVOLAINEN, 0., WATANABE, K., YAKUBOV, V. V., AND SHIMIZU, K. K. The allopolyploid Arabidopsis kamchatica originated from multiple individuals of Ara bidopsis lyrata and Arabi do psis halleri. Molecular Ecology 18, 19 (2009), 4024-4048. [148] SIMPSON, J. T., MciNTYRE, R. E. , ADAMS, D. J., AND DURBIN, R. Copy number variant detection in inbred strains from short read sequence data. Bioin formatics 26, 4 (2010), 565-567. [149] SMITH, A. D., XUAN, Z., AND ZHANG, M. Q. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9 (2008), 128. [150] SOLTIS, D. E. , BUGGS, R. J., BARBAZUK, W., SCHNABLE, P., AND SOLTIS, P. S. On the origins of species: does evo lution repeat itself in polyploid populations of independent origin? Cold Spring Harbor Symposia On Quantitative Biology 74 (2009), 215-23. [151] SOLTIS, P. S., AND SOLTIS, D. E. The role of hybridization in plant specia tion. Annual Review of Plant Biology 60 (2009), 561-588. [152] SoNG, K., Lu, P., TANG, K., AND OsBORN, T. C. Rapid genome change in synthe tic polyploids of Brassica and its implications for polyploid evo lution. P ro ceedings of the National Academy of Sciences of the United States of Amer ica 92, 17 (1995), 7719. [153] STEBBINS, G. L. Self fert ilization and population variability in the higher plants. Amer ican Naturalist 91, 861 (1957), 337-3 54. [154] SUGISAKA, J., AND KUDOH, H. Breeding system of the annual Cruciferae, Ara bidopsis kamchatica subsp. kaw asakiana. Journal of Plant Research 121, 1 (2008), 65-8. [155] SuN, H.-Z., AND GE, S. Molecular evolution of the duplicated TFIIAgamma genes in Oryzeae and its relatives. BMC Evolutionary Biology 10 (2010), 128. [156] TAKEBAY ASHI, N., AND MORRELL, P. 1. Is self-fert ilization an evolut ionary dead end? Revisiting an old hypothesis with genetic theories and a macroevolutionary approach. American Journal of Botany 88, 7 (2001), 1143-1 150. [157] TANG, H., BOWERS, J. E., WANG, X., MING, R., ALAM, M., AND PATERSON, A. H. Synteny and collinearity in plant genomes. Science 320, 5875 (2008), 486- 488. [158] TANG, H., WANG, X., BOWERS, J. E., MING, R., ALAM, M., AND PATERSON, A. H. Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Research 18, 12 (2008), 1944-1954. 120 [159] TATE, J., JosHI, P., SOLTIS, K., SOLTIS, P., AND SOLTIS, D. On the road to diploidization? homoeolog loss in independently formed populations of the al lopolyploid tragopogon miscellus (asteracea e) . BMC Plant Biology 9 (2009), 80. [160] THOMAS, B. C., PEDERSEN, B., AND FREELING, M. Following tetraploidy in an Arabidopsis ancest or, genes were removed preferent ially from one homeolog leaving clusters enriched in dose-sensitive genes. Genome Research 16, 7 (2006), 934-46. [161] TIAN, C., XIONG, Y., LIU, T., SuN, S., CHEN, L., AND CHEN, M. Evidence for an ancient whole-genome duplication event in rice and other cereals. Acta Genetica Sinica 32, 5 (2005), 519. [162] TRUE, J., AND HAAG, E. Developmental system drift and flexibility in evo lution ary trajectories. Evolution and Development 3 (2001), 109-119. [163] TURNER, T. L., BOURNE, E. C., VoN WETTBERG, E. J., Hu, T. T., AND NUZHDIN, S. V. Population resequencing reveals local adaptation of Arabidopsis lyrata to serpentine soils. Nature Genetics 42, 3 (2010), 260-3. [164] VAN DE PEER, Y., FAWCETT, J. A., PROOST, S., STERCK, L., AND VANDE POELE, K. The flowering world: a tale of duplications. Tre nds in Plant Science 14, 12 (2009), 680-8. [165] VAN RossuM, F., BoNNIN, I., FENART, S., PAUWELS, M., PETIT, D., AND SAUMITOU-LAPRADE, P. Spatial genetic structure within a metallicolous popu lation of Arabidopsis halleri, a clonal, self -incompatible and hea vy-metal-tole rant species. Molecular Ecology 13, 10 (2004), 2959-67. [166] VEITIA, R. A. Exploring the etiology of haploinsufficiency. BioEssays 24, 2 (2002), 175-84. [167] VEITIA, R. A. Nonlinear effects in macromolecular assembly and dosage sensitivity. Journal of Theoretical Biology 220, 1 (2003), 19-25. [168] VEITIA, R. A., BoTTANI, S., AND BIRCHLER, J. A. Cellular reactions to gene dosage imbalance: genomic, transcriptomic and proteomic effect s. Tr ends in Ge netics 24 , 8 (2008), 390-397. [169] WALSH, J. B. How often do duplicated genes evolve new func tions. Genetics 139, 1 (1995), 421-428. [170] WANG, J., T!AN, L., LEE, H.-S., WEI, N. E., JIANG, H., WATSON, B., MAD LUNG, A., OSBORN, T. C., DOERGE, R. w., COMA!, L., AND CHEN, Z. J. Genomewide nonadditive gene regula tion in Arabidopsis allotetraploids. Genetics 172, 1 (2006), 507-17. [171] WANG, J., T!AN, L., MADLUNG, A., LEE, H.-S., CHEN, M., LEE, J. J., WAT SON, B., KAGOCHI, T., COMA!, L., AND CHEN, Z. J. Stochastic and epigenetic changes of gene expression in Arabidopsis polyploids. Genetics 167, 4 (2004), 1961. 121 [172] WASS ERMAN, S., AND FAUST, K. Social network analysis: methods and applica tions, 1994. [173] WENDEL, J. F. Genome evolution in polyploids. Plant Molecular Biology 42, 1 (2000), 225-249. [174] WENDEL, J. F., SCHNABEL, A., AND SE ELANAN, T. Bidirectional interlocus concerted evolut ion fo llowing allopolyploid speciation in cotton (Gossypium). P ro ceedings of the National Academy of Sciences of the United States of Amer ica 92, 1 (1995), 280. [175] WILSKER, D., CHUNG, J. H., AND BUNZ, F. Chk1 suppresses bypass of mitosis and tetraploidization in p53-deficient cancer cells. Cell Cycle 11, 8 (2012), 1564-72. [176] WoLFE, K. E., AND SHIELDS, D. C. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387 (1997), 708 -713. [177] WooD, T. E. , TAKEBAY ASHI, N., BARKER, M. S., MAYROSE, 1., GREENSPOON, P. B., AND RIESEBERG, L. H. The frequency of polyploid speciation in vascular plants. Proceedings of the National Academy of Sciences of the United States of Amer ica 106, 33 (2009), 13875-9. [178] WOODHOUSE, M. R. , SCHNABLE, J. C., PEDERSEN, B. S., LYONS, E. , LISCH, D., SUBR AMANIAM, S., AND FREELING, M. Following Tetraploidy in Maize, a Short Deletion Mechanism Removed Genes Prefer entially from One of the Two Homeologs. PLoS Biology 8, 6 (2010), e1000409. [179] WRIGHT, S. 1., LAUGA, B., AND CHARLESWORTH, D. Rates and patterns of molecular evolut ion in inbred and outbred Arabidopsis. Molecular Biology and Evolution 19, 9 (2002), 1407. [180] YANG, Z. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24, 8 (2007), 1586-91. [181] ZERBINO, D. R. , AND BIRNEY, E. Velvet : algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 5 (2008), 821. [182] ZHANG, Z., SC HWAR TZ, S., WA GNER, L., AND MILLER, W. A greedy algorithm for aligning DNA sequences . Journal of Com putational Biology 'l, 1-2 (2000), 203- 214. [183] ZOTENKO, ELENA AND MESTRE, JULIAN AND OL EARY, DIANNE P AND PRZY TYCKA, TERESA M. Why Do Hubs in the Yeast Protein Interac tion Network Tend To Be Essential: Reexamining the Connection between the Network Topology and Essentiality. PLoS Computational Biology 4, 8 (2008), 16. 122 Appendix A List of data sets analyzed 123 '"--' "" "'" Table A.1: List of data sets sequenced and analyzed. (*) indicate that data has not yet been released. Species Sample Types Experiment Platform Notes Arabidopsis thaliana 4x Ler 3 samples WGS 3 arrays Affymetrix GeneChip AT l.OR Chapter 2 Array Express E MEXP 2G6G Arabidopsis arenosa Care-l 3 samples WGS 3 arrays Affymetrix GeneChip AT l.OR Chapter 2 Array Express E-MEXP 2G6G Arabidopsis AT x AA Fl 3 samples WGS 3 arrays Affymetrix GeneChip AT l.OR Chapter 2 Array Express E-MEXP 2G6G Arabidopsis suecica Sue-1 3 samples WGS 3 arrays Affymetrix GeneChip AT l.OR Chapter 2 Array Express E-MEXP 2G6G Arabidopsis suecica Sue 3 samples WTS 3 arrays Affymetrix GeneChip AT l.OR Chapter 2 Array Express E-MEXP 2G68 Arabidopsis suecica Sue 1 sample WTS 1 lane (3GM 76b FE reads) Ilium ina Genome Analyzer II Chapter 2 NCBI SRA: SRA025G58 Arabidopsis lyrata lyrata 4 popul ations WGS 64 lanes (540M 36b FE reads) Ilium ina Genome Analyzer Chapter 3 See Turner et al [163] Arabidopsis lyrata petraea 1 genotype WGS 1 lane (211M lOOb FE reads) Ilium ina HiSeq 2000 Chapter 5 Arabidopsis halleri halleri 6 popul ations WGS 6 lanes (142M 76b FE reads) Ilium ina Genome Analyzer II Chapter 4 Arabidopsis halleri gemm ifera 2 genotypes WGS 2 lanes (43M 76b FE reads) Ilium ina Genome Analyzer II Chapter 4 NCBI SRA055768* Arabidopsis kamchatica (AK) 1 genotype WGS 2 lanes (46M 76b FE reads) Ilium ina Genome Analyzer II Chapter 5 NCBI SRA055771* Arabidopsis kamchatica (BC) 1 genotype WGS 2 lanes (52M 76b FE reads) Illumina Genome Analyzer II Chapter 5 NCBI SRA055771* Arabidopsis kamchatica (JP) 1 genotype WGS 2 lanes (52M 76b FE reads) Illumina Genome Analyzer II Chapter 5 NCBI SRA055771* Medicago truncatula (HapMap) 40 genotypes WGS 40 lanes (1,421M GOb FE reads) Ilium ina Genome Analyzer II Appendix E NCBI SRA010GG4 Medicago truncatula (Tunisia) 3G genotypes WGS 15 lanes (310M GOb FE reads) Ilium ina Genome Analyzer II Appendix E NCBI SRA026748 Medicago truncatula (Soliman) G6 genotypes WGS 12 lanes (1,444M 101b FE reads) Ilium ina HiSeq 2000 Appendix E NCBI SRA055772 Medicago truncatula (T N1.13) 6 samples WTS 3 lanes (27M 76b reads) Ilium ina Genome Analyzer II Appendix F NCBI SRA051536 Medicago truncatula (T N1. 15) 6 samples WTS 3 lanes (45M 76b reads) Ilium ina Genome Analyzer II Appendix F NCBI SRA051536 Medicago truncatula (TN'T.22) 6 samples WTS 3 lanes (38M 76b reads) Ilium ina Genome Analyzer II Appendix F NCBI SRA051536 Medicago truncatula (TN8.22) 6 samples WTS 3 lanes (35M 76b reads) Ilium ina Genome Analyzer II Appendix F NCBI SRA051536 Spart na alterniflora 3 samples RAD lane (5M 76b reads) Ilium na Genome Analyzer II Manuscr pt n Prep Spart na foliosa 3 samples RAD lane (4M 76b reads) Ilium na Genome Analyzer II Manuscr pt n Prep Spart na SA x SF Fl 6 samples RAD lane (10M 76b reads) Ilium na Genome Analyzer II Manuscr pt n Prep Drosophila melanoga ster (Infection) 8 popul ations WGS 8 lanes (182M 108b reads) Illumina Genome Analyzer II Manuscript in Prep NCBI: SRA0558G1* Drosophila melanoga ster (Aging) 6 popul ations WGS 6 lanes (514M 76b FE reads) Ilium ina Genome Analyzer II Appendix G NCBI SRA038471 Drosophila melanoga ster (Female) 6 samples WTS 6 lanes (44M 36b FE reads) Ilium ina Genome Analyzer Appendix H NCBI SRA026048 Drosophila melanoga ster (Male) 6 samples WTS 6 lanes (4GM 36b FE reads) Ilium ina Genome Analyzer Appendix H NCBI SRA026048 Drosophila melanoga ster (Tra) 6 samples WTS 6 lanes (32M 36b FE reads) Ilium ina Genome Analyzer Appendix H NCBI SRA026048 Drosophila melanoga ster (EBB) 12 samples WTS 12 lanes (8G5M 75b FE reads) Ilium ina Genome Analyzer II Appendix I Drosophila melanoga ster (Courtship) 16 samples WTS 4 lanes (lOOM 76b reads) Ilium ina Genome Analyzer II Appendix J NCBI SRA0558GO* Drosophila melanoga ster (Polysome) G samples WTS 3 lanes (42M 76b reads) Ilium ina Genome Analyzer II Appendix K Drosophila melanoga ster (Embryo) 1 popul ation WTS 1 lane (lGM 76b reads) Ilium ina Genome Analyzer II Appendix L NCBI SRA04376G Drosophila melanoga ster (Germline) 33 samples WTS 11 lanes (211 M 76b reads) Illumina Genome Analyzer II Manuscript n Prep Drosophila melanoga ster (Diet) 12 samples WTS 3 lanes (35M 76b reads) Ilium ina Genome Analyzer II Manuscript n Prep Lasius niger 6 popul ations WGS lane (22M 76b reads) Ilium na Genome Analyzer II Manuscr pt n Prep NCBI SRA055G26* Lasius niger 6 popul ations WGS lane (401M lOOb FE reads) Ilium na HiSeq 2000 Manuscr pt n Prep NCBI SRA055G26* Zyginidia pullula 5 samples WTS lane (50M 76b FE reads) Ilium na Genome Analyzer II Manuscr pt n Prep NCBI SRA0563G5* Ruditapes philippinarum 24 samples WTS 2 lanes (GOM 76b FE reads) Ilium ina Genome Analyzer II Appendix M: NCBI SRA03'TG84 Ruditapes decussatus 24 samples WTS 2 lanes (166M 76b FE reads) Ilium ina Genome Analyzer II Manuscript in Prep NCBI: SRA0558G'T* Mesorhizobium loti 48 genotypes WGS 5 lanes (2G'TM 76b FE reads) Ilium ina Genome Analyzer II Manuscript in Prep NCBI SRA055773* Sinorhizobium G6 genotypes WGS 10 lanes (21 'T 76b reads) Ilium ina Genome Analyzer II Manuscript in Prep NCBI SRA055774* Escherichia coli (SF2003) 10 popul ations WGS 10 lanes (111M 44b reads) Ilium ina Genome Analyzer Manuscript in Prep Escherichia coli (SF200G) 12 popul ations WGS 12 lanes ('T0 1M 36b FE reads) Ilium ina Genome Analyzer Manuscript in Prep Appendix B Homeolog-specific retention and use in Arabidopsis suecica 3.0 3.16M-3.28M 41 Genes 35 Chromosome 1 4.0 5.0 CIYomosome Pos lMB) Figure B.l: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue), AS (gold), and FIAS (brown) on Chromosome 1. 125 8.0 8 40M-849M 37 Genes 85 Chromosome 1 9.0 95 10.0 Ctromosome Pos (MB) Figure B.2: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue) , AS (gold), and FlAS (brown) on Chromosome 1. Chromosome 1 12.0 12.5 13.0 Ctromosome Pos (MB) 13.5 13 66M-1386M 43 Genes 14.0 Figure B.3: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue) , AS (gold), and FlAS (brown) on Chromosome 1. 126 5.5 60 Chromosome 2 6 50M-6 66M 41 Genes 6.5 Ctromosome Pos (MB) 70 7.5 Figure B.4: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue) , AS (gold), and FlAS (brown) on Chromosome 2. 16.5 17.0 17 22M-17 30M 31 Genes Chromosome 2 17.5 Ctromosome Pos (MB) 18 11M-18 18M 32 Genes 18.0 18.5 Figure B.5: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue) , AS (gold), and FlAS (brown) on Chromosome 2. 127 200 2024M-20 40M 62 Genes 20.5 Chromosome 3 20 91M-21 OOM 36 Genes 21.0 Ctromosome Pos (MB) 21 30M-21 54M 21 58M-21 86M 78 Genes 86 Genes 21.5 22.0 Figure B.6: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue) , AS (gold), and FlAS (brown) on Chromosome 3. 6.5 7.0 Chromosome 4 7.5 7 67M-7 82M 45 Genes Ctromosome Pos (MB) 8.0 8.5 Figure B. 7: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue) , AS (gold), and FlAS (brown) on Chromosome 4. 128 16.0 16.5 Chromosome 4 16 89M-16 96M 32 Genes 17.0 Ctromosome Pos (MB) 17.5 18.0 Figure B.8: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue) , AS (gold), and FlAS (brown) on Chromosome 4. 804M-8 14M 30 Genes 8.0 8.5 Chromosome 5 9.0 9.5 10.0 Ctromosome Pos (MB) Figure B.9: Distribution of probe intensities in AS. lOOkb sliding window averages for AT (red), AA (blue) , AS (gold), and FlAS (brown) on Chromosome 5. 129 � rn c Q) 0 L() N D 0.0 0.2 0.4 0.6 0.8 1.0 alpha Figure B.lO: Histogram distribution of o: for FlAS (brown), L\.S (gold), and AS eDNA (black). The o: statistic represents the relative contribution of AA signals in a mixed allote traploid sample for a given gene. Bars represent o: for all genes. Lines represent kernel density estimates. 130 A� - � ·v; c Q) 0 0.0 0.2 0.4 0.6 alpha B 0 � "' � ..r;:; c. "iii <'£ ;; N 0 � 0.8 1.0 0.0 . . .. . fl • • :· , ... · . . .. : . . . . .. . .. . , . . . . ' " . , . . :. , ., . . · ,:. ·. :� .,. � :- . c .. 1, -: ., · . .... .. . . • • • , .,. , , ""! , • � . . · .. · "' .. ' ."'· ' . . . \ �· · · · ·· . . .. � , · -:: · � · ... .. . . . . .. . · '\ · : . . . . 0.2 .. . . . . 0.4 0.6 Fl As alpha 0.8 1.0 Figure B.ll: (A) Histogram distribution of a for AS (gold) and FlAS (brown) . Bars represent a for all genes. Lines represent kernel density estimates of a for genes with different ial hybridization of AS against FlAS within AS (gold) or FlAS (brown) . (B) Scatter plot of a for genes with differen tial hybridization, each point representing a single gene classified as AT-biased (red) , mixed (black), or AA (blue) in AS. A :;; B q . · � "' 0 :::l "' ..r;:; � 0 c. � " "iii 'Vi <( c z Q) 0 - a u ;; <'£ q N ci ci � � 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 alpha As DNA alpha Figure B.12: (A) Histogram distribution of a for AS eDNA (black) and AS DNA (gold) . Bars represent a for all genes. Lines represent kernel density estimates of a for genes with differen tial hybridization of AS eDNA against AS DNA within AS eDNA (black) or AS DNA (gold) . (B) Scatter plot of a for genes with differential hybridization, each point representing a single gene classified as AT-biased (red) , mixed (black), or AA (bl ue) in AS eDNA. 131 Appendix C Sequencing of Arabidopsis halleri ssp gemmifera reveal regions of duplication undergoing positive selection 132 0 s 0 N 0 0 Figure C.l: Genome-wide analysis of "8293" AH halleri population sample across 8 chro mosomes, displaying several patterns and statistics. Outermost bars represent 8 chromosomes in different colors. Next two tracks (going inwards ) show histograms and heat map of read depth coverage. Next two tracks represent heterozygosity and Ka/Ks ratios. Dark blue indi cates high values. Next track shown in colors of the 8 chromosomes show density of structural variants described in Table 4.2. Inner circle of lines show intra-chromosomal links, where two segments of DNA from two different chromosomes are linked together. 133 0 s 0 N 0 0 Figure C.2: Genome-wide analysis of "8294" AH halleri population sample across 8 chro mosomes, displaying several patterns and statistics. Outermost bars represent 8 chromosomes in different colors. Next two tracks (going inwards ) show histograms and heat map of read depth coverage. Next two tracks represent heterozygosity and Ka/Ks ratios. Dark blue indi cates high values. Next track shown in colors of the 8 chromosomes show density of structural variants described in Table 4.2. Inner circle of lines show intra-chromosomal links, where two segments of DNA from two different chromosomes are linked together. 134 0 s 0 N 0 0 Figure C.3: Genome-wide analysis of "8295" AH halleri population sample across 8 chro mosomes, displaying several patterns and statistics. Outermost bars represent 8 chromosomes in different colors. Next two tracks (going inwards ) show histograms and heat map of read depth coverage. Next two tracks represent heterozygosity and Ka/Ks ratios. Dark blue indi cates high values. Next track shown in colors of the 8 chromosomes show density of structural variants described in Table 4.2. Inner circle of lines show intra-chromosomal links, where two segments of DNA from two different chromosomes are linked together. 135 0 s 0 N 0 0 Figure C.4: Genome-wide analysis of "8296" AH halleri population sample across 8 chro mosomes, displaying several patterns and statistics. Outermost bars represent 8 chromosomes in different colors. Next two tracks (going inwards ) show histograms and heat map of read depth coverage. Next two tracks represent heterozygosity and Ka/Ks ratios. Dark blue indi cates high values. Next track shown in colors of the 8 chromosomes show density of structural variants described in Table 4.2. Inner circle of lines show intra-chromosomal links, where two segments of DNA from two different chromosomes are linked together. 136 0 s 0 N 0 0 Figure C.5: Genome-wide analysis of "8297" AH halleri population sample across 8 chro mosomes, displaying several patterns and statistics. Outermost bars represent 8 chromosomes in different colors. Next two tracks (going inwards ) show histograms and heat map of read depth coverage. Next two tracks represent heterozygosity and Ka/Ks ratios. Dark blue indi cates high values. Next track shown in colors of the 8 chromosomes show density of structural variants described in Table 4.2. Inner circle of lines show intra-chromosomal links, where two segments of DNA from two different chromosomes are linked together. 137 Appendix D Genomic characterization of three Arabidopsis kamch atica genomes � " � " 8 � � ' " . � f " � f - - � " ' � ' � � 0 � 3 0 ' 0 ' 0 - 0 8 &l § 2 w Gene Depth AH-hallen + AL -IJrala RallO of Gene Depth AK-I)«ala AL-Iyrata Figure D.l: Dot plots of gene depth in AK Alaska and its two parental species, AH and AL. 138 � " � " 8 � � f � " � " � f - � " ' � � ' 3 � � 0 ' 0 0 ' 0 - 0 8 &l § 0' @ 0 0 2 w we ' " we 100K 'M 1/32 1/16 ' " w �' Gene Depth AH hallen + AL -IJrala RallO of Gene Depth AK-I)«ala AL-Iyrata Figure D.2: Dot plots of gene depth m AK British Columbia and its two parental species, AH and AL. 139 � "' � � ;1 � f «> "' <( § 0001 001 0 1 1/32 1/16 1/4 Parertal A.H -halleri:AL -lyrata Divergen:e AK-halleri:AH-h31eri KM<s g 0 8 0 0 § '!! "' � £! � � .:, <( � "' fl l" f ;;,. "' � <( � j; � it 0 0 � ;1 1' 0 c3 0 1:> 0 «> 0 � 0 0 iii 0 0: 0 0 § 1/32 1116 1/4 1/32 1116 1/4 AK-halleri:AH-h31eri KM<s AK-tyota AL-tyrata Kall<s Figure D.3: Analysis of homeologous sequence evolution in AK Alaska. Top Left compares divergence between homeologs to divergence between parental species. Top Right compares Ka/Ks in AK-halleri to AK-lyrata. Ka/Ks for both homeologs were calculated in reference to their respective parental species. Best-fit linear regression line: y = O.l9x - 1.4,p < 6.9e-21, R 2 = 0.04. Both plots at bottom compare gene depth between homeologs and their respective parental species, at different values of Ka/Ks, for AH (left) and AL (right). 140 � ;:; g' � � � � "' z. � � � ;1 c � " 0 � 0 f � "' "' "" <( 0 <> � :t: 8 0 § 6' 6' 0001 001 0 1 1/32 1/16 1/4 Parertal A.H -halleri:AL -lyrata Divergen:e AK-halleri:AH-h31eri KM<s 0 0 0 g g 0 0 0 0 8 0 0 0 § '!! 0 � '!! £! 0 � � _:, <( � "' fl l" f z. "' 0 � <( � 0 0 0 � C o j; it 0 0 � ;g 1' ;g 0 c3 0 1:> 0 "' 0 "' � O o iii 0: § § 0 0 6' 1/32 1/16 1/4 1/32 1116 1/4 AK-halleri:AH-h31eri KM<s AK-tyota AL-tyrata Kall<s Figure D.4: Analysis of homeologous sequence evolution in AK British Columbia. Top Left compares divergence between homeologs to divergence between parental species. Top Right compares Ka/Ks in AK-halleri to AK-lyrata. Ka/Ks for both homeologs were calculated in reference to their respective parental species. Best-fit linear regression line: y = O.l9x-l.4,p < 3.4e-30, R 2 = 0.04. Both plots at bottom compare gene depth between homeologs and their respective parental species, at different values of Ka/Ks, for AH (left) and AL (right). 141 Table D.l: Analysis using other parental genomes of AH and AL. Alaska AH Only Overlap AL Only Overlap Neither Overlap AH_GEM- AL_ALP 699 174 264 AH_8292 - ALALP 705 345 262 71 354 171 AH_8293 - ALALP 817 395 274 69 371 173 AH_8294 - ALALP 862 408 296 72 402 186 AH_8295 - ALALP 814 394 282 77 405 179 AH_8296 - ALALP 852 412 258 71 407 196 AH_8297 - ALALP 917 429 255 70 426 194 AH_GEM- AL_AL1 768 481 180 106 308 168 AH_GEM- AL_AL2 755 474 176 103 306 170 AH_GEM- AL_AL3 760 481 179 102 315 168 AH_GEM- AL_AL4 774 470 157 85 286 155 British Columbia AH Only Overlap AL Only Overlap Neither Overlap AH_GEM- AL_ALP 639 159 265 AH_8292 - ALALP 665 333 244 71 352 177 AH_8293 - ALALP 734 344 270 72 352 180 AH_8294 - ALALP 765 358 272 70 398 196 AH_8295 - ALALP 743 324 251 66 382 184 AH_8296 - ALALP 740 343 256 73 398 202 AH_8297 - ALALP 812 379 237 66 412 201 AH_GEM- AL_AL1 646 414 146 87 307 170 AH_GEM- AL_AL2 646 406 143 86 314 178 AH_GEM- AL_AL3 650 413 153 88 318 172 AH_GEM- AL_AL4 650 399 122 74 293 161 Japan AH Only Overlap AL Only Overlap Neither Overlap AH_GEM- AL_ALP 705 173 271 AH_8292 - ALALP 703 367 263 77 360 180 AH_8293 - ALALP 793 394 292 84 364 183 AH_8294 - ALALP 805 397 287 77 415 201 AH_8295 - ALALP 811 389 282 81 398 191 AH_8296 - ALALP 847 412 269 84 408 204 AH_8297 - ALALP 905 434 261 82 433 203 AH_GEM- AL_AL1 759 473 188 111 315 175 AH_GEM- AL_AL2 749 466 179 108 321 181 AH_GEM- AL_AL3 749 469 185 109 326 178 AH_GEM- AL_AL4 760 434 167 96 294 161 142 .;?; "Ui c Q) 0 1/16 1/4 4 16 Ratio of Gene Depth Figure D.5: Distribution of foldchange in gene depth in simulated data compared to its parental species. Four solid lines represent the fo ldchange of AH homeologs simulated at divergence rate of 0, 0.01, 0.05, and 0.1. Four dotted lines represent the foldchange of AL homeologs simulated at the same divergence. Reads were mapped with up to 8 mismatches and assigned using the same mapping scheme as described. Note that regardless of divergence, the number of reads that were mapped were nearly identical. 143 6 ' � � " " � ' � � � ' ll � ' � 0 f � � � � ll 0 8 � 8 Parental AH hallen AL-Iyrata Divergence 0 001 Parental AH hallen AL-Iyrata Divergence ! ' � � " " � ' 0 � � � ' ll � 8 Parental AH hallen AL-Iyrata Divergence 6 ' � � 0 f � 0 ' � E � ' F ' 0 0 � 8 0 001 Parental AH hallen AL-Iyrata Divergence Figure D.6: Analysis of homeologous sequence evolution in simulated data. Comparison of divergence of simulated AH and AL homeologs to divergence between parental species for 0 (Top Right), 0.01 (Top Left), 0.05 (Lower Left ), and 0.1 (Lower Right ) divergence. Plots can be compared to actual data shown in Top Left plots in Figures 5.4, D.3, and D.4. 144 Appendix E The ecological and genomic basis of salin ity adaptation m Tunisian Medicago truncatula The fo llowing was submitted as a Research Article to the Proceedings of the National Academy of Sciences of the United States of America: Friesen ML, von We ttberg EJB, Badri M, Moriuchi KS, Barhoumi F, Cuellar-Ortiz S, Chang PL, Cordeiro MA, Vu WT, Arraouadi S, Djebali N, Zribi K, Badri Y, Porter SS, Aouani ME, Cook DR, Strauss SY, Nuzhdin SV: The ecological and genomic basis of salinity adaptation in Tunisian Medicago truncatula. Submitted to the Pro ceedings of the National Academy of Sciences of the United States of America. As our world becomes warmer, agriculture is increasingly impacted by rising soil salinity and we need to understand trade-offs that limit plant adaptation to salt stress to enable effective crop breeding. Salt tolerance is a complex plant phenotype and we know little about the pathways utilized by naturally tolerant plan ts. Legumes' capacity for symbiotic nitrogen fixation makes them cornerstone species in agricultural and natural ecosystems but they are particularly vulnerable to salt since symbiosis is salt sensitive. Our studies of the model legume M edicago truncatula in natural field and controlled greenhouse set tings demonstrate that Tunisian populations are locally adapted to saline soils. Whole genome re-sequencing of 40 wild accessions reveals a small number of candidate genomic regions that assort non-randomly with saline source population. These candidate regions contain genes that regulate physiological acclimation to salt stress, such as abscisic acid and ja smonic acid signaling, including a novel salt-tolerance candidate orthologous to the uncharact erized gene At CIPK21. Unexpectedly, these regions also contain biotic stress genes, including a NB-LRR gene, and flower ing time pathway genes, including CON ST AN S. We show that flowering time, which varies with climate in A rabidopsis thaliana, is differentiated between saline and non-saline populations and may allow salt stress es cape. Selection acts on several traits simultaneously and likely affects different life stages, leading to the complex genetic basis we observe that contradicts predictions that local adaptation should have a sim pie architecture. Our work demonstrates that local adap tation can occur despite multi-f aceted selection and the involvement of many genes and phenotypes. 145 Appendix F Genetic variation of transgenerational plasticity on seed transcriptome and off spring early resp onse to salin ity The following was submitted as a Research Article to Current Biology: Vu WT, Chang PL, Moriuchi KS, Friesen ML: Genetic variation of transgenerational plasticity on seed transcriptome and offspring early response to salinity. Sub mitted to Current Biology. In many cases, the environment that parents experience can influence offspring traits early in devel opment, resulting in transgenerat ional phenotypic plasticity in which the parental en vironment (PE) can be a predictor of offspring phenotype. One mechanism by which parental environmental experience is transmitted to offspring is through the deposition of long-live RNA transcripts in seeds. PE influence on seed characters is expected to be large because most of seed development and maturation occur while the seed is attached to the parental plant. To study the effects of transgenerat ional phenotypic plasticity under salt stress, we sequenced the mature seed transcriptome of four selfed M edicago truncatula inbred genotypes grown in saline and non-saline environments to determine the extent of genetic variation and the response of these genotypes to contrasting PE. Out of 9,321 genes detected in our seed transcriptome, we fo und evidence of genotype-dependent transcription for 1,500 genes, PE-dependent transcription for 321 genes, and the inter action between genotype and PE for 1,362 genes. Gene ontology enrichment analysis of the total genes expressed in our seed transcriptome revealed significant enrichment of terms related to stress response, epigenetics, post-embryonic deve lopment, symbiosis and post-transcriptional and translational regula tion. We observe minimal overlap of PE responsive genes among the four genotypes, which is in agreement with the genotypic variation observed in offspring early developmental traits in response to PE. Some of the most interesting G by PE transcripts code for AGO 4 protein, Dicer-like protein, DNA cytosine 5 methyltransferase, chromatin remodeling proteins, histone modifying fact ors and DEAD-box AT P-dependent helicase proteins, all of which have been implicated in RNA-directed DNA methylation pathwa ys. These pathways have been well documented to mediate de novo methylation in plants and studies have suggested its involvement in transgenerational epigenetic inheritance. 146 Appendix G Genomic basis of aging and life- history evolution m Dros ophila melanogaster The following was published as a Research Article in Evolution: Remolina SC, Chang PL, Leips J, Nuzhdin SV, Hughes KA: Genomic basis of aging and life-history evolution in Drosophila melanogaster. Evolution 2012, 66(11):3390- 3403. Natural diversity in patterns of aging is a hallmark of organismal varia tion. Related species, populations, and individuals within populations show genetically based variation in lifespan and other aspects of age-related performance . Population differences are es pecially useful for investigating natural variation because they can be large relative to within-population differences and they occur in organisms with otherwise similar genomes. We used experimental evolution to produce populations divergent for lifespan and late-life fe rtility, and then employed deep genome sequencing to produce a screen for natural vari ants with nucleotide-level resolution. Several genes and genome regions exhibited strong signatures of selection, and the same regions were implicated in three independent com parisons, suggesting repeatable evolution at the molecular level . Genes with funct ions related to oogenesis, immunity, and protein degradation were over- represented, implying that these func tions are important modifiers of late-life perf ormance. Expression profiling and functional analysis of sequence changes allowed us to narrow the list of strong can didate genes to 42. Most of these are novel candidates for regulating aging and therefore highlight the importance of studying natural variation. The experimental populations exhibited a negative correlation between lifespan and early-age fe cundity; therefore the alleles we identified also are candidate regulators of a major life- history tradeoff. 147 Appendix H Somatic sex-specific transcriptome differen ces in Dros ophila revealed by whole transcriptome sequencing The following was published as a Research Article in BMC Genomics: Chang PL, Dunham JP, Nuzhdin SV, Arbeitman MN: Somatic sex-specific transcrip tome differences in Drosophila revealed by whole transcriptome sequencing. BMC Genomics 2011, 12(1):364. Understanding animal development and physiology at a molecular-biological level has been ad vanced by the ability to determine at high resolution the repertoire of mRNA molecules by whole transcriptome resequencing. This includes the ability to detect and quantify rare transcripts and isoform -specific mRNA variants produced from a gene. The sex hierarchy consists of a pre-mRNA splicing cascade that directs the production of sex- specific transcription fa ctors that specify nearly all sexual dimorphism. We used deep RNA sequencing to gain insight into how the Drosophila sex hierarchy generates somatic sex differences, by examining gene and isoform expression differences between the sexes in adult head tissues. Here we find 1,381 genes and 1,370 isoforms that differ in expression levels between females and males. Additio nally, we find 512 genes regulated uptream of trans fo rmer that are more highly expressed in males than femal es. These 512 genes are enriched on the X chromosome and reside adjacent to dosage compensation complex entry sites, which taken together suggests that their residence on the X chromosome might be sufficient to confer male-biased expression. There are no transcription unit structural features that are robustly significantly different among genes with differences in the ratio of their isoforms, suggesting that there is no single molecular mechanism that generates isoform differences between the sexes, even though the sex hierarchy is known to include three pre-mRNA splicing factors. We identify thousands of genes that show sex- specific differences in overall gene expression levels, and identify hundreds of additional genes that have differences in isoform expression. No transcription unit structural feature was robustly enriched in the sex-differentially expressed iso fo rms. Additio nally, we found that many genes with male-biased expression were enriched on the X chromosome and reside adjacent to dosage compensation entry sites, suggesting that differences in sex chromosome composition contributes to dimorphism in gene expression. Taken together, this study provides new insight into the molecular underpinnings of sexual differen tiation. 148 Appendix I Sex-specific signaling in the blood brain barrier IS required for male courtship in Drosophila The following was published as a Research Article in PLoS Genetics: Hoxha V, Lama C, Chang PL, Saurabh S, Olate N, Patel N, Dauwalder B: Sex-specific signaling in the blood brain barrier is required for male courtship in Drosophila. PLoS Genetics. In Press. Soluble circulating proteins play an important role in the regula tion of mating behavior in Drosophila. However, how these fa ctors signal through the blood brain barrier to inter act with the sex-specific brain circuits that control courtship is unknown . Here we show that the blood brain barrier is sexually dimorphic and that male-specific fa ctors in the bbb are physiologically required for normal male courtship beha vior. Feminization of the bbb of adult males significantly reduces male courtship. We show that the bbb-specific G-protein coupled receptor moody and bbb-specific Go signaling in adult males are nec essary for normal courtship and present genetic evidence that moody signals through Go. These data identify sex-specific fa ctors and signaling processes in the bbb as import ant regulators of male mating behavior. 149 Appendix J Genetic basis of long term courtship suppression of Dros ophila males revealed by transcriptome sequencing The following was published as a Research Article in G3: Genes, Genomes, Genetics: Winbush A, Reed D, Chang PL, Nuzhdin SV, Arbeitman MN: Genetic basis of long term courtship suppression of Drosophila males revealed by transcriptome sequencing, G3: Genes, Genomes, Genetics 2012, 2(11):1437-1445. In Drosophila melanogaster long-term memory is an important component of the insect's behavioral repertoire that allows an individual to modify behaviors based on previous expenences. Male flies that have been repeatedly rejected by mated fe males during courtship advances are less likely than naive males to subsequently court another mated fe male. This long-term courtship suppression lasts for several days after the initial rejec tion period and is a form of behavioral plasticity. While genes with functions in associative learning and memory, including those that function in the cyclic AMP signaling and RNA translocation, have been identified as playing critical roles in long-term courtship sup pression, it is clear that genetic mechanisms outside of associative learning and memory also contribute to long-term courtship suppression. We have used deep RNA sequenc ing to identify differen tially expressed genes and transcript isoforms between naive males and males subjected to courtship conditioning regimens that are sufficient for inducing long-term courtship suppression. Transcript ome analysis of head tissues revealed differen tially expressed genes with functions in cytoskeletal dynamics, translation and chromatin remodeling. A much larger number of differen tially expressed transcript isoforms were identified, including those from genes previously implicated in associative memory and other genes with a wide variety of other functions associated with neuronal development and establishment of circuitry that may play funct ional roles in olfact ion and courtship beha vior. Our results shed light on the complexity of the neurogenetics and plasticity behind LTCS memory and reveal several areas for further study. 150 Appendix K A versatile method for cell-specific profiling of translated mRN As in Drosophila The following was published as a Research Article in PLoS ONE: Thomas A, Lee P-J, Dalton JE, Nomie KJ, Stoica L, Costa-Mattioli M, Chang PL, Nuzhdin SV, Arbeitman MN, Dierick HA: A versatile method for cell-specific pro filing of translated mRNAs in Drosophila. PLoS ONE 2012, 7(7):e40276. Until recently expression analyses exploring gene regula tion was largely limited to easily obtainable tissues or whole organisms, which conf ounded the analysis of transcriptional expression occurring within specific cellular subtypes in those tissues. Here, we created transgenic Drosophila strains expressing a GFP-tagged ribosomal protein, RpL10A, un der the control of the UAS promoter to perform cell-type specific translatome profiling. Using polysome affinity purification we were able to enrich transcripts from targeted tissues with sufficient sensitivity to analyze expression in small cell populations. This method can be used to determine the unique translatome profiles in different cell-types under varied physiological and pathological conditions. 151 Appendix L A de novo transcriptome assembly of Lucilia sericata (Diptera: Calliphoridae) with predicted alternative splices, single nucleotide polymorphisms and transcript expressiOn estimates The following was published as a Research Article in Insect Molecular Biology: Sze S-H, Dunham JP, Carey B, Chang PL, Li F, Edman RM, Fjeldsted C, Scott MJ, Nuzhdin SV, Tarone AM: A de novo transcriptome assembly of Lucilia sericata ( Diptera: Calliphoridae) with predicted alternative splices, single nucleotide polymorphisms and transcript expression estimates. Insect Molecular Biology 2012, 21(2):205-21. The blow fly Lucilia sericata (Diptera: Calliphoridae)(Meigen) is a nonmodel organism with no reference genome that is associated with numerous areas of research spanning the ecological, evolutionary, medical, veterinary and fo rensic sciences. To faci litate scientific discovery in this species, the transcriptome was assembled from more than six billion bases of Illumina and twen ty-one million bases of 454 sequence derived from embryonic, larval, pupal, adult and larval salivary gland libraries. The assembly was carried out in a manner that enabled identification of putative single nucleotide polymorphisms (SNPs) and alternative splices, and that provided expression estimates for various life history stages and for salivary tissue. The assembled transcriptome was also used to identify transcribed transposable elements in L. sericata. The results of this study will enable blow fly biologists, dipterists and comparative genomicists to more rapidly develop and test molecular and genetic hypotheses, especially those regarding blow fly development and salivary gland biology. 152 Appendix M De novo assembly of the Manila clam Ruditap es philippinarum transcriptome provides new ins ights into expression bias, mitochondrial doubly uniparental inheritance and sex determination The following was published as a Research Article in Molecular Biology and Evolution: Ghiselli F, Milani L, Chang PL, Hedgecock D, Davis JP, Nuzhdin SV, Passamonti M: De Novo Assembly of the Manila Clam Ruditapes philippinarum transcrip tome provides new insights into expression bias, mitochondrial doubly uni parental inheritance and sex determination. Molecular Biology and Evolution 2012, 29(2):7 71-86. Males and females share the same genome, thus, phenotypic divergence requires differential gene expression and sex- specific regulation. Accordingly, the analysis of expression patterns is piv otal to the understanding of sex determination mechanisms. Many bivalves are stable gonochoric species, but the mechanism of gonad sexualization and the genes involved are still unknown. Moreover, during the period of sexual rest, a gonad is not present and sex cannot be determined. A mechanism associated with germ line differentiation in some bivalves, including the Manila clam Ruditapes philip pinarum, is the doubly uniparental inheritance (DUI) of mitochondria, a variation of strict maternal inheritance. Two mitochondrial lineages are present, one transmitted through eggs and the other through sperm, as well as a mother-dependent sex bias of the progeny. We produced a de novo annotation of 17,186 transcripts from R. philip pinarum and compared the transcriptomes of males and females and identified 1,575 genes with strong sex-specific expression and 166 sex- specific single nucleotide polymorphisms, obtaining preliminary inf ormation about genes that could be involved in sex determination. We compared the transcriptomes between a family producing predominantly females and a family producing predominantly males to iden tify candidate genes involved in regulation of sex- specific aspects of DUI, finding a relationship between sex bias and differential expression of several ubiquitination genes. In mammalian em bryos, sperm mitochondria are degraded by ubiquitination. A modification of this mechanism is hypothesized to be responsible for the retention of sperm mitochondria in male embryos of DUI species. Ubiquitination can additionally regulate gene expression, playing a role in sex determi nation of several animals. These data enable us to develop a model that incorporates both the DUI literature and our new findings. 153
Abstract (if available)
Abstract
Recent developments in genomics are revolutionizing our views of genome evolution, demonstrating that perhaps all higher organisms, including mammals, have undergone full or partial genome duplications. Polyploidization, a form of genome duplication, is the increase in genome size caused by the inheritance of additional sets of chromosomes. Over time, these genomes undergo diploidization, reducing polyploid genomes back towards a diploid state. In some ways, polyploidy may be the single most common mechanism of speciation in plants. ❧ Our work on Arabidopsis polyploids focused on two allotetraploids: Arabidopsis suecica (AS) and Arabidopsis kamchatica (AK). In AS, we observed that homeologs originated from parental Arabidopsis thaliana (AT) were lost faster than those originated from Arabidopsis arenosa (AA). We also found that AT homeologs were more likely to be silenced in the leaf transcriptome and that this silencing was network-dependent. The networks of AS are evolving to be more AT-like or more AA-like, rather than mixed. In general, genes within an interspecies network are typically more co-adapted with each other than with genes from other homeologous networks. Here, we found that mixed networks were significantly underrepresented in AS. ❧ In AK, we sequenced three accessions collected from Japan, Alaska, and British Columbia and documented a pattern of consistent loss of homeologs from parental Arabidopsis lyrata (AL), compared to Arabidopsis halleri (AH). We also examined the correlation between network connectivity and gene retention and found that genes lost or further gained after polyploidization displayed a significant decrease in the number of network partners and/or expression correlation coefficient. Comparison of divergence between parental AL and AH homologs showed that homeolog loss in the polyploid is more common in less divergent genes, as well as in genes with high within-species variation in AL and AH lineages. In cases where genes were duplicated after polyploidization, they exhibited increased Ka/Ks ratios, suggesting that some duplicates were undergoing neofunctionalization that introduced new functions. ❧ In the study of polyploidization and genome duplication, our results across six species of Arabidopsis illustrate the importance of understanding gene evolution in the context of network topology. In the AS polyploid system, AS networks are evolving to be more AT-like or more AA-like, due to co-evolution of genes within a network in AT and AA lineages leading up to hybridization. In the AK polyploid system, gene evolution occurs fastest among genes constrained to nodes of lower centrality and connectivity. These are the genes that were undergoing positive selection for local adaptation in parental AL. These are the genes whose homeologs were lost or further gained after polyploidization in AK. Consistent with the Gene Balance Hypothesis, we find that "connected" genes are not usually lost or duplicated. In the rare cases where this does happen, genes with copy number fluctuations were found farther and more isolated from the network, minimizing their dosage effects and allow networks to more easily adapt to evolutionary changes.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Investigating the evolution of gene networks through simulated populations
PDF
Natural variation of Arabidopsis thaliana methylome and its impact on genome evolution
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
The evolution of gene regulatory networks
PDF
Analysis of genomic polymorphism in Arabidopsis thaliana
PDF
Biological interactions on the behavioral, genomic, and ecological scale: investigating patterns in Drosophila melanogaster of the southeast United States and Caribbean islands
PDF
Comparative analysis of DNA methylation in mammals
PDF
Association mapping in Arabidopsis thaliana
PDF
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Comparative transcriptomics: connecting the genome to evolution
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Evolutionary genomic analysis in heterogeneous populations of non-model and model organisms
PDF
A population genomics approach to the study of speciation in flowering columbines
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Computational and experimental approaches for the identification of genes and gene networks in the Drosophila sex-determination hierarchy
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Ecologically responsible domestication of kelp facilitated by genomic tools
PDF
Orthogonal shared basis factorization: cross-species gene expression analysis using a common expression subspace
Asset Metadata
Creator
Chang, Peter L.
(author)
Core Title
Long term evolution of gene duplicates in arabidopsis polyploids
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
11/20/2014
Defense Date
10/05/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptation,arabidopsis,gene evolution,genome duplication,genome evolution,metal tolerance,OAI-PMH Harvest,polyploidy
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nuzhdin, Sergey V. (
committee chair
), Conti, David V. (
committee member
), Dean, Matthew D. (
committee member
), Smith, Andrew D. (
committee member
), Sun, Fengzhu Z. (
committee member
)
Creator Email
peterc@usc.edu,peterlchang@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-117583
Unique identifier
UC11290607
Identifier
usctheses-c3-117583 (legacy record id)
Legacy Identifier
etd-ChangPeter-1318.pdf
Dmrecord
117583
Document Type
Dissertation
Rights
Chang, Peter L.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
adaptation
arabidopsis
gene evolution
genome duplication
genome evolution
metal tolerance
polyploidy