Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Natural variation of Arabidopsis thaliana methylome and its impact on genome evolution
(USC Thesis Other)
Natural variation of Arabidopsis thaliana methylome and its impact on genome evolution
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NATURAL VARIATION OF ARABIDOPSIS THALIANA METHYLOME AND ITS IMPACT ON GENOME EVOLUTION by Pei Zhang A Dissertation Presented to the FACULTY OF THE USC GRAUDATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Molecular Biology) Dec 2014 Copyright 2014 Pei Zhang ii DEDICATION To my parents and husband iii ACKNOWLEDGEMENT I would first like to express my deepest gratitude to my PhD advisor, Dr. Magnus Nordborg, for admitting me and giving me a wonderful research environment, patiently mentoring for my research projects, and generously supporting me for the past seven years. I also would like to thank the members of my qualification and dissertation committee, Dr. Steven Finkel, Dr. Matt Dean, Dr. David Conti, Dr. Xuelin Wu and Dr. Sergey Nuzhdin, who have all provided valuable advice for me. I want to thank all collaborators on my projects, Dr. Qiang Song, Dr. Manu Dubin, Dr. Marie Remigereau, Dr. Dazhe Meng, Dr.Edward Osborne, Dr. Andrew Smith and Dr. Richard Clark. Without their help, I could not manage to finish. In addition, I would like to thank all my former and present lab members, both at USC and GMI. They care about me. The memory of spending all these years is sweet and precious, especially when I am far away from my family. Last but not least, I would like to thank my parents, my husband and friends. They have always been around to support and encourage me, accompanying me through these tough days and nights. iv TABLE OF CONTENTS DEDICATION II ACKNOWLEDGEMENT III LIST OF TABLES VI LIST OF FIGURES VII ABBREVIATIONS XIV ABSTRACT XV CHAPTER 1 : INTRODUCTION 1 1.1 DNA methylation in Arabidopsis thaliana 4 1.2 Comparing the Mutational and Polymorphism Spectrum for Arabidopsis thaliana 7 CHAPTER 2 : NATURAL VARIATION OF METHYLOME FOR ARABIDOPSIS THALIANA 10 2.1 The basic pattern of variation of the Arabidopsis thaliana methylome 13 2.2 The influence of the environment on the Arabidopsis thaliana methylome 16 2.3 The effect of methylation variation on gene expression 19 2.3.1 Correlation between positional methylation and expression of genes 19 2.3.2 Clustering analysis shows two types of methylation for genes 27 2.3.3 GWAS using DNA methylation to explain gene expression 29 2.4 Linking genetic and methylation variants 32 2.5 Methylation and transcription for genes with copy number variation 44 v CHAPTER 3 : ONGOING CG-ENRICHMENT IN REGIONS WITH NON-CPG METHYLATION IN ARABIDOPSIS THALIANA 51 3.1 Ongoing enrichment of CG in Swedish Arabidopsis thaliana 51 3.2 CG enrichment is related to the methylation level 56 3.3 Strong CG enrichment for TEs and genic regions under the RdDM pathway 59 3.4 Genes with CG enrichment show greater variability for expression and methylation 66 3.5. Discussion 69 CHAPTER 4 : GEOGRAPHICAL DIFFERENCES IN GBM AND CYTOSINE NUMBER 71 4.1 Northern Arabidopsis thaliana lines exhibit higher GBM 71 4.2 Northern Arabidopsis thaliana lines have more cytosines 74 4.3 Exploration of the geographical differences in number of cytosines 76 4.3.1 Not related to methylation level 76 4.3.2 Not due to codon bias 78 4.3.3 Also observed in world-wide collections and in Arabidopsis lyrata 79 4.3.4 Demography or selection? 80 4.3.5 Mutation bias 83 4.4 Discussion 83 CHAPTER 5 : MATERIALS AND METHODS 84 5.1 Generation of raw data 84 5.2 Sequence analysis 85 5.3 Annotation 90 REFERENCE 92 vi LIST OF TABLES Table 2.1 Comparison of gene specific profile for two types of methylation ....................................... 27 Table 3.1 U-norm for comparing two opposite derived allele frequencies. A U-norm approximately equal to 0.5 indicates an ‘absence of fixation bias’. A U- norm greater than 0.5 indicates a biased fixation for cytosines. Thus, the CG enrichment is mainly caused by a C⬄T substitution. ................................................ 54 Table 3.2 Ongoing W S fixation Bias in Swedish Arabidopsis is present for all cytosine contexts (W S DAF skew is quantified by the U-norm statistic) ................................. 58 Table 4.1 Pearson Correlation score between total cytosine number for each accession and minimum temperature of the accession’s location of origin. .................................. 76 Table 4.2 The correlation between cytosine number and MinTemp for different contexts and alternative alleles. .................................................................................................................... 77 Table 4.3 The Pearson Correlation score between total cytosine number for each accession and minimum temperature for the accession’s location of origin. ........... 78 Table 4.4 A correlation between latitude and number of cytosine in Arabidopsis lyrata ............... 79 Table 4.5 The ratio of Pi for deamination mutation type divided by pi for all mutation ................. 83 vii LIST OF FIGURES Figure 2.1 Generation of the data set. ................................................................................................................. 10 Figure 2.2 Phenotypic polymorphisms for Swedish Arabidopsis thaliana (Photo taken by Suzi Atwell) .......................................................................................................................................... 12 Figure 2.3 The distribution of leaf number when flowering for our Swedish population compared to the worldwide collection. The red histogram indicates the Swedish accessions and the light blue histogram is the worldwide accessions. The range of this phenotype is roughly the same for the two collections. ........................................................................................................................................... 12 Figure 2.4 Flowering time as a function of two temperatures in growth chambers for Swedish Arabidopsis thaliana genotypes. Each line corresponds to an inbred line. Accessions in red flowered 20% earlier in 10°C than 16°C, while accessions in green were the opposite. Flowering time at 10°C for accessions in grey was within ±20% of 16°C. ....................................................................... 13 Figure 2.5 An example of two methylated regions in the genome .The blue blocks indicate annotated protein coding genes. The green blocks indicate annotated transposable elements. The orange bars mark the position of a methylated CpG sites, and the height of bar is proportional to the methylation level (between 0 and 1). The green bar marks the position of a CpG with enough coverage to infer its methylated state. The height of bar is proportional to 1 minus the methylation level (1 if CpG is un-methylated). ................................................. 15 Figure 2.6 The distribution of methylation levels for each cytosine context. The error bar indicates the standard error of fraction within 155 accessions with BS-seq data at 10°C. ....................................................................................................................................... 15 Figure 2.7 The distribution of average genomic methylation level for CpG, CHG and CHH among 155 accessions at 10°C. The red line denotes Col-0. ............................... 16 Figure 2.8 The effect of temperature on the number of methylated cytosines. For 110 accessions with available BS-seq data for both temperatures, there are statistically significant more CHH methylated at 16°C compared to 10°C (p value<2.2e-16). ................................................................................................................................. 18 Figure 2.9 The mean CHH methylation level for accessions at 16°C are significantly higher than those for the same accessions at 10°C. Each line is one accession, and Y-axis is the mean methylation level. The red lines denote 99 accessions whose CHH methylation was higher at 16°C. The green lines denote 13 lines with the opposite pattern. ............................................................................... 18 Figure 2.10 A Manhattan plot of GWAS results using average levels of CHH methylation at 10°C as the phenotype. The strongest signal falls near CMT2. ................................ 19 Figure 2.11 The odds ratio for genes being expressed if they are methylated. Other than the 5’UTR, other genomic feature all show a significant odds ratio. An odds viii ratio above 1 (above 0 in the log2 transformed plot) means more likely to be transcribed if methylated. Red line indicates 0. .............................................................. 21 Figure 2.12 The relationship between DNA methylation and transcription. Genes were rank ordered and binned based on expression in 9-leaf-stage rosettes. Each bin on the X-axis is a 5% bin of RPKM value. The average CpG methylation for gene body region and promoter region for each bin are plotted. ........................... 21 Figure 2.13 A correlation between gene expression and mean CpG methylation for 200bp sliding windows, for genes with the most significant negative association. (A) The X-axis is the position relative to the TSS(Transcription Start site), in units of 100bp. The Y-axis is the correlation coefficient between gene expression and the methylation for each sliding window. The black vertical line marks the TSS (Transcription Start Site). (B) Correlation between gene expression and methylation in 200bp windows for genes with most significant negative correlation. Windows for the whole gene region are with negative correlation. The blue line is the mean for correlation score for all genes, and the shade means the standard error of the mean correlation. .......................................................................................................................................... 24 Figure 2.14 A correlation between gene expression and mean CpG methylation for 200bp sliding windows, for genes with the most significant positive association. (A) The X-axis is the position relative to TSS, in unist of 100bp. Y-axis is the correlation coefficient between gene expression and the methylation for the sliding window. The green vertical line marks the TSS .(B) A correlation between gene expression and methylation in 200bp windows for genes with the most significant positive correlation. Usually windows for the whole gene region are positively correlated. The blue line is the mean for correlation score for all genes, and the shade indicates the standard error of the mean correlation. ................................................................................................................. 26 Figure 2.15 The distribution of genomic position for the 200bp window with the most significant correlation for top significant genes. Negative correlation is strongest at the Transcription Start Site (TSS), while positive correlation is usually stronger in the gene body region(left panel). ......................................................... 26 Figure 2.16 AT3G25720, an example of methylation covering the promoter region, possibly caused by the spreading of methylation from nearby TEs. The SNP haplotype structure (left) in and around genes (blue arrow) is strongly correlated to the methylation pattern, which in turn correlates with differences in gene expression (center). The different colors for methylation across the gene corresponds to each 20% quantile: blue <green <yellow <red <purple. ..................................................................................................................................... 28 Figure 2.17 An example of GBM: SNP haplotype structure (left) in and around the gene AT2G20650, appears to show no correlation with expression and methylation pattern. The blue arrow indicates the gene position and the direction of transcription. .............................................................................................................. 29 ix Figure 2.18 A genome-wide scan for association with gene expression as phenotype and mean methylation for 1kb sliding window as genotype. Expression and methylation at 10°C are used in this plot and regions with highly significant association (p value <10 -7 ) are plotted. Light blue to dark red indicates the gradient of more significant p values. The diagonal dotted line indicates methylation around the gene region that associated significantly with gene expression (i.e. a cis-correlation). ............................................................................................. 30 Figure 2.19 An example of the highest association observed in a GWAS (10°C expression vs. 1kb methylation): AT3G30720, Qua-Quine Starch (QQS). The five colors denote the five chromosomes, in left to right order 1-5. The dotted line denotes the Bonferroni threshold for an alpha level of 5%. .................................... 31 Figure 2.20 The methylation and expression for AT3G30720 (QQS).The orange bar is the constructed methylated region and is based on methylated cytosines (orange vertical lines). The green bar indicates expression values (RPKM). Expression is likely to be suppressed by methylation around QQS, since the methylation pattern appears to have spread across the whole region. ....................... 32 Figure 2.21 Pearson correlation between mean SNP LD and r 2 of methylation status of single cytosine for each gene. Blue bar denotes all protein coding genes; Red bar denotes genes with significant positive correlation between expression and methylation (based on cor>0.15); Green bar denotes genes with significant negative correlation between expression and methylation (based on cor<-‐0.15). ...................................................................................................................... 34 Figure 2.22 Pearson correlation between mean SNP LD and r 2 of methylation status of 200bp sliding window for each gene. Blue bar denotes all protein coding genes; Red bar denotes genes with significant positive correlation between expression and methylation (based on cor>0.15); Green bar denotes genes with significant negative correlation between expression and methylation (based on cor<-‐0.15). .................................................................................. 35 Figure 2.23 A gene profile for AT4G03470, which shows a significant correlation between expression and CG methylation (cor=0.64). A strong haplotype pattern correlates with methylation and expression among accessions. (A) Linkage disequilibrium in SNPs, and correlation heatmap for methylation. Sliding windows of CHG and CHH methylation shows a similar pattern with the CpG sliding window, therefore this data is not shown. (B) R squared for LD and the correlation between methylation for 200 sliding window and single cytosines. The Y-axis is the distance between two SNPs, or two sliding windows, or two cytosines. R squared for SNPs decreases with distance slower when compared to methylation. (C) Clustering based on methylation and expression among samples. Each row is one accession. Two haplotypes defined by local SNP pattern correlate strongly with methylation and expression. In the left panel, green to red means increasing of methylation and expression level. In the middle panel, the blue bar is the gene position, and SNPs within 1kb are included in the plot, with red color x indicates minor allele. In the right panel, green to red indicates increasing of methylation level. ......................................................................................................................... 39 Figure 2.24 (A) Using the mean methylation level for the gene region of AT4G03470 as phenotype and full sequence SNP as genotype, GWAS identified one SNP 2kb downstream of the gene as the most significantly associated SNP to the methylation haplotype. (B) The Y-axis is the mean CpG methylation level for AT4G03470. Two groups are the SNP type at the position most significantly associated identified in figure (A). Accessions with ‘A’ at this SNP position show significantly higher methylation compared to accessions with ‘G’ at this SNP position. ...................................................................................................... 39 Figure 2.25 A gene profile for AT5G32460, which shows a significant correlation between expression and CG methylation (cor=0.46). Strong haplotype pattern correlates with methylation and expression type among accessions. (A) Linkage disequilibrium in SNPs, and correlation heatmap for methylation. (B) R squared for LD and correlation between methylation for 200 sliding window and single cytosines. R squared for SNPs decreases with distance much slower when compared to methylation. (C) Clustering based on methylation and expression among samples. Each row is one accession. Methylation does not have clear haplotypes, and does not correlate clearly with the SNP haplotype. .............................................................................. 41 Figure 2.26 A gene profile for AT3G26612, which shows a significant correlation between expression and CG methylation (cor=-0.77). Strong correlation between methylation status nearby, yet no significant correlation with SNP haplotype. (A) Linkage disequilibrium in SNPs, and correlation heatmap for methylation. Sliding window with CHG and CHH methylation shows similar pattern with CpG sliding window, thus not shown. (B) R square for LD and correlation between methylation for 200 sliding window and single cytosine. (C) Clustering based on methylation and expression among samples. Each row is one accession. Methylation shows two clear haplotypes, and does not correlate with SNP haplotypes. ................................................ 42 Figure 2.27 A gene profile for AT2G31150, which shows a significant correlation between expression and CG methylation (cor=-0.75). Strong correlation between methylation status are seen nearby but no significant correlation is observed with SNP haplotype. (A) Linkage disequilibrium in SNPs, and correlation heatmap for methylation. Sliding window using CHG and CHH methylation show as similar pattern to the CpG sliding window, therefore this data is not shown. (B) R squared for LD and correlation between methylation for 200 sliding window and single cytosines. (C) Clustering based on methylation and expression among samples. Each row is one accession. Methylation shows two clear haplotypes, and correlate clearly with SNPs. ............................................................................................................................................ 43 Figure 2.28 A genome-wide scan for associations using estimated gene copy numbers as the phenotype and SNPs as the genotype. Highly significant associations (p value <10 -10 ) are plotted. The light blue to dark red colour indicates the xi gradient of p values significance. The X-axis is the SNP position, or in other words the candidate position where the new copy is located. The Y-axis is the gene position in Col-0. The diagonal dotted line indicates SNPs that are associated with the copy number and are where the gene is annotated in Col-0 (cis). ........................................................................................................................................... 45 Figure 2.29 Comparison of methylation levels for genes with trans and cis signals shown when running GWAS for copy number variations. Only protein coding genes are included in the analysis. The Y-axis is the mean methylation level for the gene across all copies. 4 groups for the boxplots were presented here: CpG methylation for genes with trans association, CpG methylation for genes with only cis association, CHG methylation for genes with trans association, CHG methylation for genes with only cis association. ............................. 46 Figure 2.30 For genes with CNV and identified with a potential insertion location on another chromosome. Each line is one significant association between the estimated copy number for a gene and SNPs. The left bar is the location of the SNPs (potential insertion position). The right bar is the location where the gene is annotated in Col-0. The histograms on two sides are the total hits for each position. ...................................................................................................................... 47 Figure 2.31 One gene example AT1G27540: expression and methylation increases with more copy number. (a) The Y-axis shows gene expression (RPKM) for each accession, the X-axis is the normalized copy number of genes. Accessions with a higher copy numbers show higher expression (cor=0.71). (b) The Y- axis is the mean CpG methylation for genes in each accession, the X-axis is the normalized cop number of genes. Accessions with a higher copy number show higher methylation (cor=0.62). (c) The X axis is mean CpG methylation for genes in each accession, the Y- axisis gene expression (RPKM). Expression and methylation are positively correlated (cor=0.55).). .................................................................................................................................................................. 48 Figure 2.32 One gene example: methylation decreases as copy number increases. (a) The Y-axis is gene expression (RPKM) in each accession, the X-axis is the normalized copy number of all the genes. Accessions with higher copy numbers show marginally higher expression (cor=0.16). (b) The Y- axis is the mean CpG methylation for genes in each accession, the X-axi is the normalized copy number of all the genes. Accessions with higher copy numbers show lower methylation (cor=-0.48). (c) The X-axis is the mean CpG methylation for genes in each accession, the Y-axis is gene expression (RPKM). Expression and methylation are weakly positively correlated (cor=0.16). .......................................................................................................................................... 49 Figure 2.33 One gene example: expression and methylation do not change much as copy number increases. (a) The Y-axis is gene expression (RPKM) in each accession, the X-axis the normalized copy of genes. Accessions with higher copy numbers show similar expression. (b) The Y-axis is the mean CpG methylation for genes in each accession, the X- axis is the normalized copy number of all the genes. Accessions with higher copy numbers show similar xii methylation. (c) The X-axis is the mean CpG methylation for genes in each accession, the Y-axis is gene expression (RPKM). Expression and methylation are weakly positively correlated (cor=0.14). ................................................ 49 Figure 3.1 The number of the six different types of nucleotide substitution. The number of SNPs are grouped by the mutation type that generated them. The inserted figure shows the estimated mutation rate for each type (Ossowski et al. 2010). While GC to AT mutation are by far the most common, SNPs resulting from mutation in the opposite direction are more common in the wild. ........................................................................................................................................................ 52 Figure 3.2 DAF (Derived Allele Frequency) distribution shows ongoing CG enrichment in Swedish Arabidopsis population. (A) DAF for StoW (Strong to Weak) and WtoS (Weak to Strong) alleles. (B) DAF for C=>T and T=>C alleles. ..................... 54 Figure 3.3 Derived allele frequency distribution for six possible mutation types. AT to GC mutations are over-represented at frequencies near fixation, suggesting that selection is responsible for the higher GC content of the A. thaliana genome, and is still ongoing. ........................................................................................................ 55 Figure 3.4 The derived allele frequency distribution for six different types of nucleotide substitutions for five CpG methylation level categories. The methylation level for each SNP was annotated using the mean CpG methylation level for the most adjacent 200bp-sliding window across the genome. Mean CpG methylation level was classified into 5 categories and the upper limit is shown. More AT to GC mutations reached high derived allele frequencies in higher methylated regions. ....................................................................................................... 57 Figure 3.5 The Derived allele frequency distribution for GBM, TE and Gene RdDM. Ongoing CG enrichment is observed in TE and genes with RdDM. ............................. 60 Figure 3.6 U-norm for SNPs in TEs and genes under the RdDM pathway show a biased fixation as methylation levels increase, while SNPs in genes with gene body methylation do not show a strongly biased fixation. A higher U-norm indicates a greater fixation bias towards CG enrichment. ............................................... 61 Figure 3.7 The methylation rates for accessions with cytosines at the position show higher methylation in regions with similar mean methylation levels. Mrate is defined as the ratio of methylation among accessions with cytosines and reads mapped to them. .................................................................................................................... 65 Figure 3.8 A correlation between methylation rates for accessions with cytosines at the given position and the Derived Allele Frequency for that mutation type. .................. 65 Figure 3.9 The median Mrate differences for ATtoGC for high DAF – low DAF. ............................ 65 Figure 3.10 The median Mrate difference for GCtoAT for high DAF – low DAF. ............................ 66 Figure 3.11 GO enrichment for genes in Gene RdDM ................................................................................. 67 Figure 3.12 Genes under the RdDM pathway (with CG enrichment) show a negative correlation between expression and methylation, low expression level (red line indicates the genome average). .......................................................................................... 68 xiii Figure 3.13 Genes with a significantly negative correlation between expression and methylation show CG enrichment. ............................................................................................. 69 Figure 4.1 Northern accessions show a much higher CpG methylation for protein coding genes. (A) Mean CpG methylation levels for protein coding genes significantly correlate with latitude (cor=0.74). Y-axis is the mean CpG methylation level for the genic region in one accession. (B) Violin plot for mean CG methylation of protein coding genes for accessions originating in northern and southern Sweden. ................................................................................................... 71 Figure 4.2 The correlation between mean CpG methylation for different genomic features and latitude. The red line indicates the correlation between global CpG methylation and latitude. All protein coding gene-related features show a significant correlation(cor >0.6). ............................................................................................... 72 Figure 4.3 The correlation between gene CpG methylation and climate data. The minimum temperature from the sampling location showed the strongest correlation with CG methylation for genes. The red line indicates the correlation between latitude and methylation. ...................................................................... 72 Figure 4.4 Gene AT5G20420 with the strongest correlation between latitude and methylation. (left panel) The Y-axis is the mean CG methylation for this gene, the X axis is the latitude. (right panel) The Y-axis is the RPKM expression for this gene, the X-axis is the latitude. ............................................................. 74 Figure 4.5 The number of cytosines strongly correlate with minimum temperature of the coldest month in the accession’s location of origin. (left panel). A correlation between MinTemp and number of cytosines. (right panel) A strong correlation between total number of CAG in the genome and minimum temperature of the accession’s location of origin. ........................................... 75 Figure 4.6 A correlation between latitude and number of cytosines in Arabidopsis lyrata. Four panels are for CpG number, CHG number, CHH number and total cytosine number respectively. ...................................................................................................... 80 Figure 4.7 The five allele frequency categories used to calculate the number of cytosines ........... 82 Figure 4.8 The genic region for each category in Figure 4.7 and all alleles with cytosines ............................................................................................................................................... 82 xiv ABBREVIATIONS GBM Gene body methylation, calculated by the mean methylation level between transcription start site and stop site TE Transposable elements RdDM RNA directed DNA methylation TSS Transcription start site from Tair10 annotation TTS Transcription stop site from Tair10 annotation SNP Single nucleotide polymorphism DAF Derived allele frequency. Polarized by outgroup Arabidopsis lyrata and Capsella rubella LD Linkage disequilibrium DMR Differentially methylated regions CNV Copy number variation BS-seq Bisulfite sequencing GWAS Genome wide association studies W2S Weak to strong, denotes nucleotide substitution from A/T to C/G S2W Strong to weak, denotes nucleotide substitution from C/G to A/T MWU Mann Whitney U test Mrate Methylation rate among the accessions with a cytosine at that position xv ABSTRACT DNA methylation involves the addition of one methyl group to a cytosine nucleotide, thereby altering the underlying genetic structure and increasing the frequency of mutations brought about by cytosine to thymine deamination. However, how natural selection i) responds to this aforementioned higher mutation rate and ii) shapes the “epigenome” despite a constantly changing genome composition remains unclear. To better understand how i) the genotype and environment interact and affect DNA methylation and transcription and ii) the role methylation plays in genome evolution, we generated transcriptome- and methylome-sequencing data for 200 natural Swedish Arabidopsis thaliana accessions. These accessions were grown in two different environments, 10°C and 16°C (chosen because they led to different flowering phenotypes). Combined with existing genomic resequencing- (Long et al., 2012) and phenotypic-data, our comprehensive dataset enabled us to explore the complex relationships that exist between genotype and phenotype. For each gene, we constructed a gene-specific profile by integrating: i) SNP data ii) gene expression iii) DNA methylation and iv) correlations between these different layers of data, including local LD structure. These profiles served as a powerful research tool. We investigated genetic variation and epigenetic variation in Swedish Arabidopsis thaliana, and demonstrated that there is an ongoing CG enrichment in regions with non- CpG methylation, while regions with only CpG methylation do not show similar patterns of selection. This suggests that DNA methylation in TEs, or RdDM (RNA directed DNA methylation) methylated regions that are under strong selection, is likely to be of more xvi functional consequence than regions that are methylated only in the CpG context (i.e. mostly genic regions). Furthermore, this finding may i) help solve the many points of contention that surround the functionality of GBM (Gene body methylation), and ii) explain the contrasting correlation patterns observed between expression and methylation for TE/RdDM regions (a negative correlation) and gene body regions (a positive correlation). Additionally, the mean global GBM and total number of cytosines in the genome shows significant geographical differences, and although the reason behind this pattern remains elusive, selection is likely to play a part. Chapter 1 : Introduction Arabidopsis thaliana is a commonly used model system to investigate genetic- epigenetic interactions and to delineate the genotype-phenotype map, due to the following reasons i) their small genome (~125Mb) as availability of inbred lines is less costly for whole-genome sequencing ii) environmental effects can be easily controlled, and iii) predictions can be experimentally verified using replicable genotypes. To further our understanding of natural variation and elucidate how the genotype, epi-genotype and environment interact and shape important life-history traits, we quantified changes in gene expression and DNA methylation in 200 Swedish Arabidopsis thaliana accessions grown in two environments (10°C and 16°C), and analyzed these ‘molecular phenotypes’ in combination with existing genomic sequencing datasets (Long et al., 2012). In this dissertation, we focus on an analysis of the natural variation of the DNA methylome and its effect on genome evolution, using this novel Swedish Arabidopsis dataset. Significant correlations between expression and DNA methylation are widely observed and extensively studied in plants. The following general trends have been observed: i) TEs are highly methylated and methylation acts to limit transcription and proliferation ii) promoter regions of genes are methylated at a low level which is negatively correlate with expression iii) moderately expressed genes are methylated too, and conversely the methylation level is positively correlated with expression (Zilberman et al., 2007). However, these reported trends are based on taking an average of all genes 2 and notably are limited by the sample sizes used. This may yield to many misleading generalizations for any particular gene. With available RNA-seq and BS-seq data for our large collection of natural Swedish Arabidopsis accessions, we have the power and resolution to dissect the correlation between DNA methylation and gene expression at a single gene level. The function of methylation in TE and promoter regions is well studied. For TE regions in plants, cytosines are methylated and heterochromatin structure is induced by histone modification. This stable silencing effect appears to involve a positive feedback loop and the spreading of DNA-methylation. In promoter regions methylation is hypothesized to repress expression for two reasons 1) methylation binding proteins can recruit other proteins and introduce changes in the chromatin structure and 2) the added methyl group to the cytosine base results in a sequence motif in the promoter region that is unrecognizable to some transcription factors (Zilberman, 2008). However, the function of GBM remains largely unresolved. Though a positive correlation between gene expression and methylation is evidently apparent (Zemach, McDaniel, Silva, & Zilberman, 2010), whether methylation is induced by expression or whether it affects the regulation of transcription has been a matter of intense debate. To help solve this puzzle, we investigated whether GBM is functional or not by comparing the selection signal on genomic sequences with and without GBM. The underlying assumption is that in regions where methylation affects gene expression and thereby possibly fitness, cytosines are being selected for since they are required for methylation to take place. Subsequently, mutations that result in the gain of cytosines would be favored, while those that result in cytosine loss would be selected against. On 3 the other hand, in regions where methylation is a downstream by-product of gene expression, cytosines would not be selected for in order to maintain the methylation status. In summary, if GBM is functional, different selection pressures regarding mutations related to cytosines should be found across regions that differ in methylation status. Moreover, the natural population of Arabidopsis thaliana used in this study was collected from heterogeneous environments in Sweden. These accessions have readily adapted to latitudinal or altitudinal gradients with different temperatures, soil conditions, and water availability. All Arabidopsis accessions were collected with coordinate information on where they originate. This geographical data provides important information regarding adaptive and demographic history, which influences the genetic and epigenetic differentiation between populations. In this dissertation, we present novel discoveries in analyzing methylome data comprehensively with other genome-wide datasets (transcriptome, genome re- sequencing). In Chapter 1, we start by briefly reviewing recent studies on the biological mechanisms associated with the establishment and maintenance of DNA methylation patterns in Arabidopsis. In Chapter 2, we describe the natural variation of the Arabidopsis methylome, the environmental effects, and the effect of methylation variation on expression. At the end, we generated a single gene profile database with all levels of information as resources for other researchers, incorporating gene expression, methylation, local SNP patterns, copy number estimates and local LD patterns. 4 In Chapter 3, we demonstrate that there is ongoing enrichment of CG nucleotides in Arabidopsis thaliana, and such enrichment is only found in regions with non-CpG methylation, which implies positive selection for such regions. In Chapter 4, we report the significant geographical bias in genic CpG methylation and total cytosine number in the whole genome. Accessions originating from northern Sweden show significantly higher genic methylation and significantly higher cytosine number, especially for CHG number in genic regions. Based on GWAS studies, the bias in CpG methylation seems to be under genetic control, while the reason behind the bias in cytosine number is still unclear. Finally, we present our materials and methods. 1.1 DNA methylation in Arabidopsis thaliana DNA methylation refers to the addition of a methyl group to a cytosine nucleotide. Cytosines reside in three sequence contexts: CpG, CHG and CHH (where H is any nucleotide other than G). In mammals, CpG sites are methylated exclusively, while in plants all three contexts are methylated. Furthermore, whereas 70%-80% CpGs are methylated in mammals, DNA methylation is sparse in plants. In Arabidopsis thaliana, approximately 24% CpGs, 6.7% CHGs and 1.7% CHHs are methylated (Cokus et al., 2008). In Arabidopsis thaliana, distinct enzymes methylate cytosines in three sequence contexts. CpG is methylated by MET1 (DNA Methyltransferase 1), and the methylation status of CpG can be almost faithfully transmitted from one generation to the next through the VIM-MET1 pathway, where VIM (Variant In Methylation) recognizes 5 hemimethylated symmetrical CpG sites generated through DNA replication and recruits MET1 (Law & Jacobsen, 2010). CHG is methylated by CMT3 (Chromo- methyltransferase 3) and CHG methylation can be partially maintained by the KYP- CMT3 pathway. KYP dimethylates H3K9 of histone and then H3K9me2 can be bound by CMT3, which methylates CHG nearby (S Inagaki & Kakutani, 2013). CHH is methylated by DRM2 (Domain Rearranged Methyltransferase 2) in the RdDM pathway. Core enzymes in the RdDM pathway are plant specific DNA-dependent RNA polymerase PolIV and PolV. TEs, endogenous repeats, DNA viruses, and some protein coding genes are all targets of the RdDM pathway (Stroud, Greenberg, Feng, Bernatavichute, & Jacobsen, 2013). Looking across the genome, higher methylation patterns are observed for all sequence contexts at the pericentromeric regions (Lister et al., 2008), where TEs are prevalent. There are two types of methylation in terms of genomic feature in plants: GBM and TE methylation. Methylation of the gene body region (defined as the region between transcription start and stop site) is usually CpG exclusive, while TE regions can be methylated in all sequence contexts. The lack of non-CpG methylation in the genic region is due to IBM1 (Increase in Bonsai Methylation 1), which appears to be recruited to actively transcribed genes and de-methylates H3K9me2. Since H3K9me2 is a signal to activate KYP-CMT3 pathway to maintain CHG methylation, CHG methylation is lost in active genes (Fan et al., 2012; Soichi Inagaki et al., 2010). Although TE methylation has been acquired independently in several evolutionary lineages, GBM is a conserved feature for many eukaryotes, except for fungi (Zemach & 6 Zilberman, 2010). While the function of TE methylation is well studied, both the evolutionary dynamics and the function of GBM are uncertain. Patterns of gene body methylation exhibit polymorphism in different wild accessions of Arabidopsis. However, orthologs of rice and Brachypodium distachyon show a strong conservation of GBM (Takuno & Gaut, 2013). Genes with GBM are have been shown to be conserved between species, functionally important and to evolve slowly (Takuno & Gaut, 2013). It has been hypothesized that GBM acts to increase the accuracy of splicing, prevent aberrant transcription within genes, and exclude H2A.Z from genes that are constitutively expressed (Coleman-Derr & Zilberman, 2012; Zilberman, Coleman-Derr, Ballinger, & Henikoff, 2008). However, many studies reported a stable DNA methylation pattern across different tissue and development stages, despite variable expression being observed across the same samples (Schmid et al., 2005; Schmitz et al., 2013). Moreover, in met1 mutants, where gene body methylation is largely removed, the expression of genes does not change significantly (Lister et al., 2008). This indicates a plausible expression-regulating role of GBM. Whether GBM is byproduct of moderate transcription, or whether GBM is functional in terms of regulating transcription is currently being debated. In this thesis, we generated and analyzed the natural variation of methylomes under two temperatures and combined this together with transcriptome and genome data to explore both the regulation and functionality of methylation. 7 1.2 Comparing the Mutational and Polymorphism Spectrum for Arabidopsis thaliana Epigenetic changes are heritable and do not involve changing the underlying DNA sequence. Epigenetic marks, such as DNA methylation, evolve faster than genetic changes. In Arabidopsis thaliana the epimutation rate, estimated in Mutation Accumulation lines, is 6.37 x10 7 times faster than the spontaneous mutation rate of base substitutions (Becker et al., 2011). In plants, since no germ line is preserved aside from the somatic tissues, somatic mutations can be transmitted from one generation to the next. The methylation status of CpG and CHG contexts are largely maintained in the Arabidopsis male germ line, however, reprogramming of CHH methylation occurs in the pollen and this is mediated by siRNA. Therefore, epigenetic diversity in natural accessions serves as an important platform for natural selection and evolution to act on. Moreover, methylation can affect the genomic composition directly due to an elevated mutation ratio for methylated cytosines. Spontaneous deamination of methylated cytosines can lead to G:C=>A:T substitutions, which is believed to be a major source of genetic mutations. In mutation accumulation lines of Arabidopsis thaliana, transitions (purine to purine, or pyrimidine to pyrimidine mutations) were estimated to be 2.4 times more frequent than transversions (purine to pyrimidine, and vice versa mutations), and G:C=>A:T was indeed the most frequent type of mutations (Ossowski et al., 2010). DNA methylation impacts genome evolution for two reasons: 1) the underlying mutation process varies between methylated and un-methylated cytosines, since the spontaneous deamination of methylated cytosines results in a high mutation rate and a preferential replacement of cytosine(C) with thymine (T). 2) Natural selection acts 8 differently on sites that affect fitness. Assuming that the methylation status contributes to the regulation of gene expression, the replacement of cytosines that are methylated should be more functionally constrained compared to those that are un-methylated,. The observed AT ratio is far less than the genomic equilibrium expected by the mutation rate (68% vs. 85%) (Ossowski et al., 2010), so either the genome is presently evolving towards a higher AT content, or there is a selection pressure that prevents AT accumulation. Genomes of the same species vary in sequence content as a result of evolution. The dropping cost of obtaining whole genome sequencing data makes it possible to infer evolutionary processes using extant patterns of both genetic and epigenetic variation. The advantage of using whole genome re-sequencing data to perform population genetic analyses is a reduction in ascertainment bias. Since the SNPs were called by the pileup of all 259 Swedish samples pooled together, it eliminated the ascertainment bias that existed in the 250k SNP dataset, which was based on a SNP survey for 20 worldwide accessions. A survey with 80 world-wide collections of Arabidopsis thaliana showed an excess of G:C=>A:T polymorphism present in alleles with a low derived allele frequency, most of which appear to be recent mutations. In contrast, A:T=>G:C and G:C=>A:T mutations are almost equal for alleles with a high derived allele frequency (Cao et al., 2011). This contrasting pattern implies positive selection for A:T=>G:C mutations, and we will test this trend in more detail in chapter 3, and elaborate the relationship between selection on cytosines and methylation. A recent study examined the natural variation of the Arabidopsis methylome by sequencing the methylome of 152 wild accessions living in the northern hemisphere. The 9 team reported that single cytosine methylation polymorphisms are not linked to the genotype (Schmitz et al., 2013). Additionally, they showed that RNA-seq data taken from different tissues clustered on the tissue type, while DNA methylation data clustered on the genotype. This indicated that methylation is far less dynamic when compared to expression, in natural populations. Furthermore, through genome wide association mapping they demonstrated that a large fraction of DMRs show a putative causal genetic variation. However, they did not investigate the reverse of this relationship—the impact of methylation on genome evolution, which we have included in our analysis. 10 Chapter 2 : Natural Variation of Methylome for Arabidopsis thaliana Natural wild accessions of Arabidopsis thaliana taken from geographically distinct regions, provides a large collection of both genetic and epigenetic variants, which can be inherited and replicated under laboratory conditions to reproduce diverse phenotypes. To characterize the epigenetic variants that exist in natural populations, and to elucidate the link between the methylome and transcriptome as well as genetic variants, we quantified transcriptome and methylome data in two environments (10°C and 16°C) for 200 lines of Arabidopsis thaliana through next generation sequencing (Figure 2.1). Figure 2.1 Generation of the data set. There are three reasons Swedish accessions were chosen for this study: 1. Sample size for Swedish accessions is large enough that the distribution of phenotypes, such as flowering time and leaf number, has a similar range to the worldwide 11 samples (Figure 2.2, Figure 2.3). 200 accessions were selected from 330 available Swedish lines, based on the diversity of phenotype and geographic location. 2. Less genic heterogeneity in terms of functional alleles enabled us to control for population structure more efficiently and carry out GWAS in local populations with a good resolution. 3. A large sample size from a restricted geographical region with diverse local environments allowed us to look for evidence of selection and adaptation. We grew 200 Swedish accessions in two environments, 10°C and 16°C, chosen because phenotypic effects were observed at these two temperatures. Some accessions flowered faster at 10°C than at 16°C, while others showed the opposite pattern (Figure 2.4). This implied gene regulation was occurring in a temperature dependent manner in these two environments. Indeed, in a previous study using 10°C, 16°C and 22°C, Atwell et al., (2010), found that Swedish accessions grew best at 10°C, which is closer to their natural habitat, while temperatures above 16°C stressed the plants. Hence, we chose these two temperatures to explore the effect of the environment on the methylome and transcriptome. To avoid the massive regulatory changes that come with the onset of flowering, all samples were collected at a young plant stage with 9 true leaves, i.e well before flowering time. The methylation profiles we studied, therefore, represent the sum of methylomes from several leaf tissues and differentiated cell types (e.g. epidermis, photosynthetic mesophyll cells, vascular tracts, apical meristem), but sampled at the same developmental stage. 12 Figure 2.2 Phenotypic polymorphisms for Swedish Arabidopsis thaliana (Photo taken by Suzi Atwell) Figure 2.3 The distribution of leaf number when flowering for our Swedish population compared to the worldwide collection. The red histogram indicates the Swedish accessions and the light blue histogram is the worldwide accessions. The range of this phenotype is roughly the same for the two collections. 13 Figure 2.4 Flowering time as a function of two temperatures in growth chambers for Swedish Arabidopsis thaliana genotypes. Each line corresponds to an inbred line. Accessions in red flowered 20% earlier in 10°C than 16°C, while accessions in green were the opposite. Flowering time at 10°C for accessions in grey was within ±20% of 16°C. 2.1 The basic pattern of variation of the Arabidopsis thaliana methylome DNA methylation in Arabidopsis thaliana has been described in the reference genome Col-0 and in several mutants of key enzymes involved in the maintenance and the de novo generation of DNA methylation marks (Cokus et al., 2008; Lister et al., 2008; Stroud et al., 2013). In addition, the natural variation of the genome wide pattern of DNA methylation in wild Arabidopsis populations has been published recently by Schmitz et al. (2013). However, a basic description of natural variation of the Arabidopsis methylome compared to the Col-0 reference is lacking. In addition, the wealth of information from the whole methylome of 200 wild Arabidopsis lines provides a unique 14 opportunity to use this natural variation as a virtual repeated experiment of methylation mutagenesis. This gives us the power to define correlation trends between methylation variations, genome and transcriptome properties with much more accuracy than could be achieved with a single line. The overall density of methylated sites increased in pericentromeric regions and heterochromatin knobs, as previously observed in Col-0 (Lister et al., 2008). A closer examination of the distribution of methylated cytosines along chromosome arms confirmed that the Arabidopsis genome is mostly un-methylated or methylated at low levels through single isolated methylated cytosines. In contrast, most methylated sites occur in clusters, often overlapping genes and transposable elements (Figure 2.5). Methylation across different genotypes is very stable (Figure 2.5), as reported previously by Schmitz et al., (2013). The fraction of methylated sites for CpG, CHG and CHH are estimated at a similar level across natural accessions (Figure 2.6). Consistent with previous studies, though the methylation profile is generally stable on a gene level, a lot of variation was found between accessions even at the level of genome-wide averages (Figure 2.7). 15 Figure 2.5 An example of two methylated regions in the genome .The blue blocks indicate annotated protein coding genes. The green blocks indicate annotated transposable elements. The orange bars mark the position of a methylated CpG sites, and the height of bar is proportional to the methylation level (between 0 and 1). The green bar marks the position of a CpG with enough coverage to infer its methylated state. The height of bar is proportional to 1 minus the methylation level (1 if CpG is un-methylated). Figure 2.6 The distribution of methylation levels for each cytosine context. The error bar indicates the standard error of fraction within 155 accessions with BS-seq data at 10°C. Figure'1' A.'Distribution$of$methylation$level$in$each$context$ B.'Average$methylation$levels$of$annotated$features$of$the$genome$ A 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Intergenic Promoter (-1kb to TSS) Coding gene Pseudogene miRNA ncRNA rRNA TE - class 1 TE - class 2 Average CpG methylation 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Intergenic Promoter (-1kb to TSS) Coding gene Pseudogene miRNA ncRNA rRNA TE - class 1 TE - class 2 Average non-CpG methylation CHG CHH B 16 Figure 2.7 The distribution of average genomic methylation level for CpG, CHG and CHH among 155 accessions at 10°C. The red line denotes Col-0. 2.2 The influence of the environment on the Arabidopsis thaliana methylome Although the natural variation seen in the Arabidopsis methylome has been published recently by Schmitz et al., (2013), the contrast of environmental settings in our study is unique and this enabled us to explore the genotype by environment effect on the methylome. We tested whether the genome-wide methylation levels are different between plants grown at 10°C and 16°C respectively. In general, plants grown in these two environments show a similar CpG and CHG methylation level. Interestingly, however, CHH methylation at 16°C is statistically significantly higher than at 10°C, based on a paired Welch’s t-test (p value <2.2e-16). On average, there are 115,240 more methylated cytosines in the CHH context at 16°C, based on a paired comparison for the same genotype (Figure 2.8). Mean CHH methylation for 110 accessions with BS-seq at both temperatures was 14% higher at 16°C than at 10°C (Figure 2.9). In plants, CHH methylation mainly occurs in TEs, and this is catalyzed by the RdDM pathway. Thus, the temperature effect is more likely due to an increased activity CpG meth CpG meth Frequency 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0 5 10 15 20 25 CHG meth CHG meth Frequency 0.060 0.070 0.080 0.090 0 5 10 15 CHH meth CHH meth Frequency 0.010 0.014 0.018 0.022 0 5 10 15 20 17 of enzymes in the RdDM pathway. CpG and CHG do not show significant differences between temperatures, which can partially be explained by an inheritable property for symmetric methylation. As expected, the CHH methylation differences between temperatures are more significant for TEs. To investigate the genetic basis for CHH methylation, we performed a genome wide association study using the average CHH methylation level for each accession as the phenotype. A clear peak of association was observed on chromosome 4 (after Bonferroni correction) from our GWAS result (Figure 2.10). The most significantly associated SNP is located around 38kb downstream from CMT2, which was recently reported to methylate CHG and CHH sites (Stroud et al., 2014). To confirm the effect of CMT2 on CHH methylation, we performed F2 crosses, QTL mapping and mutant analysis. These experiments demonstrated that although CMT2 is responsible for the variation seen for CHH methylation, the temperature differences is likely due to the elevated activity of the RdDM pathway at higher temperatures, and independent of CMT2 (Dubin, Zhang, Meng, & Remigereau, unpublished results). 18 Figure 2.8 The effect of temperature on the number of methylated cytosines. For 110 accessions with available BS-seq data for both temperatures, there are statistically significant more CHH methylated at 16°C compared to 10°C (p value<2.2e-16). Figure 2.9 The mean CHH methylation level for accessions at 16°C are significantly higher than those for the same accessions at 10°C. Each line is one accession, and Y-axis is the mean methylation level. The red lines denote 99 accessions whose CHH 19 methylation was higher at 16°C. The green lines denote 13 lines with the opposite pattern. Figure 2.10 A Manhattan plot of GWAS results using average levels of CHH methylation at 10°C as the phenotype. The strongest signal falls near CMT2. 2.3 The effect of methylation variation on gene expression 2.3.1 Correlation between positional methylation and expression of genes A strong correlation between gene expression and DNA methylation has been observed previously (Zilberman et al., 2007), but this observation was only based on taking the average of all genes within one genome. Our comprehensive dataset with natural variations enabled us to study another kind of correlation—at a gene specific level— between mean methylation for sliding window and gene expression across all accessions. Though some trans effects from the genomic background can confound the relationship between expression and methylation, focusing on genes that show a strong correlation should avoid this problem. We focus on CpG methylation in this section for two reasons 1) CpG methylation can be faithfully inherited from one generation to the next based on the symmetric structure and 2) in genic regions, which is more interesting in terms of function, usually 20 CpG is exclusively methylated. Even for genes with non-CpG methylation, the level of CpG methylation correlates well with non-CpG methylation (Cokus et al., 2008). For the correlation between methylation and expression that has been previously reported, sequencing data provided an improved resolution compared to previous results based on microarray data. Methylated genes are more likely to be transcribed, especially when they are methylated in the exon, intron, and 3’UTR region. In contrast, genes methylated in the region around the TSS are less likely to be expressed (Figure 2.11). This is consistent with the relationship between methylation and expression in Figure 2.12. Genes with GBM show a bell shaped relationship between expression and methylation, with positive correlations seen for low and moderately expressed genes, while negative correlations were seen for highly expressed (above 75%) or extremely low expressed genes (<10%). This negative correlation is significant for the promoter region, especially for short genes (i.e. <1kb). 21 Figure 2.11 The odds ratio for genes being expressed if they are methylated. Other than the 5’UTR, other genomic feature all show a significant odds ratio. An odds ratio above 1 (above 0 in the log2 transformed plot) means more likely to be transcribed if methylated. Red line indicates 0. Figure 2.12 The relationship between DNA methylation and transcription. Genes were rank ordered and binned based on expression in 9-leaf-stage rosettes. Each bin on the X- axis is a 5% bin of RPKM value. The average CpG methylation for gene body region and promoter region for each bin are plotted. Genes vary for lots of properties (length, base composition, promoter, mechanisms of regulation, function). The fact that a strong trend appears by averaging all genes indicates either: 1) a potential rule exists that is fundamental enough to regulate the expression of all genes by a universal mechanism, or 2) the methylation pattern that we observe is merely a reflection of gene expression. Results obtained by averaging all genes can be too great a generalization to draw a conclusion about the relationship between methylation and expression. In addition, this 22 kind of analysis cannot utilize data from multiple genomes such as natural variations, which are hundreds of accessions with a different genetic background. In our study, since we have large sample size (~150) for each gene, the correlation between sliding window methylation and expression can be calculated for each gene across all samples, with a resolution of positional methylation. Three classes of genes were expected in this analysis: (1) genes with variable methylation that correlate with fluctuating expression; (2) genes with variable methylation but do not correlate with changes in expression (either there is stable expression across samples, or the change in expression is independent of methylation) (3) genes with no variation among accessions. The first class of genes are potential targets of epigenetic regulation, and may present selective signatures that are associated with methylation. Thus, we would like to pinpoint those genes. Notably however, since we are limited by the developmental stage we sampled at, we may have missed genes with a functional methylation in other stages in our study. We computed the correlation between mean CpG methylation of 200bp sliding windows and expression on a per gene basis, from 1kb upstream of the TSS to 3kb downstream of the TSS, and from 1kb upstream of the TTS to 1kb downstream of the TTS. 7,925 genes were detected with at least one 200bp methylation window showing a significant correlation with gene expression. Among them, 3,779 genes showed a negative correlation (Pearson cor<-0.2), and 5,247 genes showed a positive correlation (Pearson cor>0.2). The correlation in sliding windows for genes that showed the strongest negative correlation between expression and methylation are depicted in Figure 2.13a, and the 23 trends are summarized in Figure 2.13b. The most significant negative correlation lies at the TSS position itself, and the whole gene region shows a negative correlation from 1kb upstream of the TSS to 1kb downstream of the TTS. This lies in stark contrast to the results seen by taking the average of all genes, were negative correlations were limited to 200bp upstream of the TSS, to 800bp downstream of the TSS (Song, 2013). Similarly, the correlation in sliding windows for genes that showed the strongest positive correlation between expression and methylation are depicted in Figure 2.14a, and the trends are summarized in Figure 2.14b. The most significant positive correlation lies at 1kb upstream of the TTS, and the whole gene region shows a positive correlation that stretches to 1kb around the gene. When taking the average of all genes the results showed that the positive correlation was limited to 1000bp downstream of the TSS to 800bp upstream of the TSS. Therefore, averaging all genes gives a misleading conclusion because it mixes the signals of two distinct correlations together. For each gene that showed a significant correlation between expression and sliding window methylation, the position with the strongest association was marked and plotted as a distribution (Figure 2.15). Positions with the strongest negative correlation between gene expression and sliding window methylation tightly center on the TSS, while the positions with the strongest positive correlation center in the gene body region and away from both the TSS and TTS. This is consistent with the general trend we observed in Figure 2.11: expression of genes methylated in the promoter tends to be repressed and genes that are methylated in the middle are expressed. 24 Figure 2.13 A correlation between gene expression and mean CpG methylation for 200bp sliding windows, for genes with the most significant negative association. (A) The X-axis is the position relative to the TSS(Transcription Start site), in units of 100bp. The Y-axis is the correlation coefficient between gene expression and the methylation for each 25 sliding window. The black vertical line marks the TSS (Transcription Start Site). (B) Correlation between gene expression and methylation in 200bp windows for genes with most significant negative correlation. Windows for the whole gene region are with negative correlation. The blue line is the mean for correlation score for all genes, and the shade means the standard error of the mean correlation. 26 Figure 2.14 A correlation between gene expression and mean CpG methylation for 200bp sliding windows, for genes with the most significant positive association. (A) The X-axis is the position relative to TSS, in unist of 100bp. Y-axis is the correlation coefficient between gene expression and the methylation for the sliding window. The green vertical line marks the TSS .(B) A correlation between gene expression and methylation in 200bp windows for genes with the most significant positive correlation. Usually windows for the whole gene region are positively correlated. The blue line is the mean for correlation score for all genes, and the shade indicates the standard error of the mean correlation. Figure 2.15 The distribution of genomic position for the 200bp window with the most significant correlation for top significant genes. Negative correlation is strongest at the Transcription Start Site (TSS), while positive correlation is usually stronger in the gene body region(left panel). 27 2.3.2 Clustering analysis shows two types of methylation for genes To further investigate the relationship between expression and methylation, for each gene, we clustered accessions based on local SNPs (i.e. SNPS located 1000bp around the gene), and plotted the methylation and expression level accordingly. From such gene- specific profiles, we could observe that genes with opposite correlations show a different methylation pattern (Table 2.1). Table 2.1 Comparison of gene specific profile for two types of methylation Type 1 gene Type 2 gene Correlation between methylation and expression Negative correlation Positive correlation Methylation at gene border (TSS, TTS) Methylation across gene border Methylation avoiding gene border Methylation level High level Low level Non-CpG methylation Present Absent Methylation pathway RdDM MET1 Number of genes 3779 5247 Example Profile Figure 2.16 Figure 2.17 The first type of methylation is promoter methylation (Figure 2.16), where high methylation levels are apparent in addition to a continuous long region of methylation that spans both the TSS and TTS. The local SNP haplotype pattern usually correlates tightly with the methylation pattern, which appears to be the result of nearby TE insertions. The long stretch of methylation involves non-CpG methylation and histone modification, which is under the RdDM pathway. Methylation in this region limits transcription. The second type of methylation is GBM (Figure 2.17). Here the methylation level is relatively low and variable. The methylation itself is restricted from the gene borders (i.e. the TSS and TTS). No apparent association between the methylation type and SNP 28 haplotype is immediately obvious. Genes of this type are usually exclusively methylated in the CpG context. The restriction of GBM rom the gene borders explains why the correlation between expression and methylation dropped to 0 as it approaches the gene border (Figure 2. 14). In summary, gene specific profiles that integrate both methylome and transcriptome data are a powerful tool to dissect correlations between DNA methylation and gene expression. Figure 2.16 AT3G25720, an example of methylation covering the promoter region, possibly caused by the spreading of methylation from nearby TEs. The SNP haplotype structure (left) in and around genes (blue arrow) is strongly correlated to the methylation pattern, which in turn correlates with differences in gene expression (center). The different colors for methylation across the gene corresponds to each 20% quantile: blue <green <yellow <red <purple. 38 Figure!2.27.!!An!example!of!GBM:!SNP!haplotype!structure!(left)!in!and!around!gene! AT2G20650!appears!no!correlation!with!expression!and!methylation!pattern.!The! blue!arrow!indicates!the!gene!position!and!the!transcribed!direction.!! ! Figure! 2.28,! AT3G25720,! an! example! of! methylation! covering! promoter! region,! possibly!by!extension!of!methylation!from!nearby!TEs.!SNP!haplotype!structure!(left)! in!and!around!genes!(blue!arrow)!can!be!found!strongly!correlated!to!Methylation! pattern,!which!in!turn!correlates!with!differences!in!gene!expression!(center).!! ! s8902680 s8902682 s8902732 s8902739 s8902788 s8902900 s8902930 s8902966 s8903048 s8903070 s8903155 s8903249 s8903312 s8903429 s8903504 s8903665 s8903813 s8903876 s8904163 s8904274 s8904535 s8904595 s8904618 s8904632 s8904702 s8904796 s8905125 s8905146 s8905234 s8905431 s8905443 s8905600 s8905673 s8905854 s8905863 s8905922 s8905971 s8906050 s8906059 s8906119 s8906120 s8906308 s8906445 s8906667 s8906830 s8906928 s8907054 s8907098 s8907134 s8907210 s8907216 s8907217 s8907220 s8907226 s8907254 s8907255 s8907316 s8907335 s8907353 s8907365 s8907366 s8907378 s8907380 s8907388 s8907391 s8907399 s8907401 s8907403 s8907405 s8907410 s8907412 s8907413 s8907414 s8907419 s8907422 s8907428 s8907432 s8907434 s8907436 s8907448 s8907450 s8907453 s8907456 s8907473 s8907482 s8907492 s8907495 s8907498 s8907500 s8907501 s8907504 s8907517 s8907518 s8907520 s8907521 s8907524 s8907528 s8907535 s8907539 s8907540 s8907549 s8907550 s8907552 s8907553 s8907554 s8907555 s8907559 s8907572 s8907584 s8907596 s8907610 s8907633 s8907636 s8907664 s8907672 s8907691 s8907723 s8907731 s8907781 s8907823 s8907829 s8907831 s8907832 s8907849 s8907855 s8907864 s8907865 s8907979 s8907994 s8907995 s8907997 s8907998 s8907999 s8908011 s8908015 s8908018 s8908024 s8908032 s8908038 s8908044 s8908053 s8908058 s8908059 s8908065 s8908069 s8908072 s8908086 s8908089 s8908091 s8908094 s8908103 s8908106 s8908110 s8908112 s8908124 s8908125 s8908132 s8908135 s8908139 s8908140 s8908146 s8908149 s8908165 s8908174 s8908176 s8908187 s8908197 s8908198 s8908212 s8908216 s8908218 s8908223 s8908231 s8908237 s8908238 s8908240 s8908250 s8908260 s8908277 s8908278 s8908284 s8908287 s8908289 s8908291 s8908298 s8908313 s8908316 s8908319 s8908320 6235_10C 6235_16C 8387_10C 6901_10C 6901_16C 6016_10C 6043_10C 6043_16C 6046_10C 6046_16C 6243_10C 5856_10C 6090_10C 6193_16C 8369_10C 8369_16C 6150_10C 6013_10C 6069_10C 6039_10C 6088_16C 8242_16C 8241_10C 8241_16C 6188_10C 6188_16C 7519_10C 7519_16C 8426_10C 8334_10C 8256_10C 8256_16C 9454_10C 9454_16C 8326_10C 9436_10C 6122_10C 6122_16C 6042_16C 9437_10C 8283_10C 9470_10C 9470_16C 6036_10C 6036_16C 6040_10C 6136_16C 9434_16C 8249_10C 8249_16C 6231_10C 6173_10C 6173_16C 6244_10C 6244_16C 9323_10C 9323_16C 6071_10C 6071_16C 6913_10C 6913_16C 6209_10C 6209_16C 9427_10C 6216_10C 6070_10C 6070_16C 6237_16C 9332_10C 9332_16C 9371_10C 9371_16C 6221_10C 9363_16C 6011_10C 1552_16C 8227_10C 8227_16C 8335_10C 6242_10C 6242_16C 6189_10C 6189_16C 6153_10C 9452_10C 9452_16C 8376_10C 8376_16C 6092_10C 7516_10C 6030_10C 6030_16C 6102_10C 6102_16C 6076_10C 6076_16C 8240_10C 6128_10C 6413_16C 6074_16C 9416_10C 6100_10C 6100_16C 6252_16C 9381_10C 6149_10C 9383_16C 1257_16C 9336_10C 9336_16C 9058_10C 9058_16C 1061_10C 1074_10C 1074_16C 9394_10C 9382_10C 6192_10C 9057_10C 9057_16C 6184_10C 6085_10C 6085_16C 6973_10C 6973_16C 6118_10C 6118_16C 6094_10C 1137_10C 1137_16C 8222_16C 8258_16C 5832_10C 5832_16C 8351_16C 6151_10C 6151_16C 6114_10C 992_10C 992_16C 8237_10C 6112_10C 6112_16C 6125_10C 6131_10C 6131_16C 9399_10C 9395_10C 1062_16C 6104_16C 6132_10C 6109_10C 991_16C 6113_10C 6034_10C 6034_16C 9481_16C 6202_10C 6202_16C 6180_16C 5865_10C 6105_16C 9413_16C 9352_10C 9353_10C 9369_16C 9339_16C 9339_10C 6098_10C 6098_16C 1 2 6235_10C 6235_16C 8387_10C 6901_10C 6901_16C 6016_10C 6043_10C 6043_16C 6046_10C 6046_16C 6243_10C 5856_10C 6090_10C 6193_16C 8369_10C 8369_16C 6150_10C 6013_10C 6069_10C 6039_10C 6088_16C 8242_16C 8241_10C 8241_16C 6188_10C 6188_16C 7519_10C 7519_16C 8426_10C 8334_10C 8256_10C 8256_16C 9454_10C 9454_16C 8326_10C 9436_10C 6122_10C 6122_16C 6042_16C 9437_10C 8283_10C 9470_10C 9470_16C 6036_10C 6036_16C 6040_10C 6136_16C 9434_16C 8249_10C 8249_16C 6231_10C 6173_10C 6173_16C 6244_10C 6244_16C 9323_10C 9323_16C 6071_10C 6071_16C 6913_10C 6913_16C 6209_10C 6209_16C 9427_10C 6216_10C 6070_10C 6070_16C 6237_16C 9332_10C 9332_16C 9371_10C 9371_16C 6221_10C 9363_16C 6011_10C 1552_16C 8227_10C 8227_16C 8335_10C 6242_10C 6242_16C 6189_10C 6189_16C 6153_10C 9452_10C 9452_16C 8376_10C 8376_16C 6092_10C 7516_10C 6030_10C 6030_16C 6102_10C 6102_16C 6076_10C 6076_16C 8240_10C 6128_10C 6413_16C 6074_16C 9416_10C 6100_10C 6100_16C 6252_16C 9381_10C 6149_10C 9383_16C 1257_16C 9336_10C 9336_16C 9058_10C 9058_16C 1061_10C 1074_10C 1074_16C 9394_10C 9382_10C 6192_10C 9057_10C 9057_16C 6184_10C 6085_10C 6085_16C 6973_10C 6973_16C 6118_10C 6118_16C 6094_10C 1137_10C 1137_16C 8222_16C 8258_16C 5832_10C 5832_16C 8351_16C 6151_10C 6151_16C 6114_10C 992_10C 992_16C 8237_10C 6112_10C 6112_16C 6125_10C 6131_10C 6131_16C 9399_10C 9395_10C 1062_16C 6104_16C 6132_10C 6109_10C 991_16C 6113_10C 6034_10C 6034_16C 9481_16C 6202_10C 6202_16C 6180_16C 5865_10C 6105_16C 9413_16C 9352_10C 9353_10C 9369_16C 9339_16C 9339_10C 6098_10C 6098_16C 2 4 6 8 10 12 Value Color Key 8902000 8904000 8906000 8908000 AT2G20650 Lines 8903000 8905000 8907000 Chr.2 Lines Haplotype!by!SNP !!!!!!!!Expression Methylation s9379247 s9379278 s9379279 s9379284 s9379289 s9379293 s9379298 s9379300 s9379314 s9379322 s9379323 s9379327 s9379343 s9379345 s9379347 s9379357 s9379376 s9379404 s9379407 s9379410 s9379421 s9379425 s9379432 s9379439 s9379446 s9379483 s9379491 s9379522 s9379564 s9379581 s9379597 s9379606 s9379625 s9379631 s9379638 s9379655 s9379669 s9379671 s9379680 s9379687 s9379691 s9379712 s9379713 s9379728 s9379750 s9379759 s9379776 s9379778 s9379814 s9379851 s9379882 s9379896 s9379904 s9379905 s9379906 s9379907 s9379909 s9379910 s9379913 s9379932 s9379940 s9379946 s9379956 s9379963 s9379971 s9379976 s9379978 s9379996 s9380049 s9380115 s9380119 s9380129 s9380136 s9380137 s9380150 s9380188 s9380271 s9380272 s9380273 s9380274 s9380284 s9380296 s9380304 s9380308 s9380332 s9380346 s9380377 s9380387 s9380391 s9380411 s9380415 s9380456 s9380485 s9380493 s9380516 s9380558 s9380560 s9380588 s9380600 s9380670 s9380692 s9380698 s9380703 s9380727 s9380806 s9380830 s9380834 s9380837 s9380893 s9380920 s9380953 s9380963 s9380965 s9380978 s9380979 s9380987 s9381013 s9381014 s9381055 s9381071 s9381101 s9381133 s9381157 s9381212 s9381221 s9381230 s9381249 s9381252 s9381266 s9381277 s9381331 s9381351 s9381360 s9381376 s9381377 s9381407 s9381415 s9381432 s9381433 s9381445 s9381455 s9381460 s9381461 s9381468 s9381476 s9381478 s9381483 s9381484 s9381504 s9381512 s9381514 s9381526 s9381530 s9381539 s9381540 s9381748 s9381754 s9381774 s9381791 s9381825 s9381840 s9381852 s9381875 s9381879 s9381881 s9381975 s9382024 s9382137 s9382143 s9382168 s9382213 s9382225 s9382240 s9382249 s9382281 s9382293 s9382325 s9382332 s9382378 s9382384 8249_10C 8326_10C 8242_16C 5832_16C 8227_10C 8227_16C 6094_10C 1552_16C 9381_10C 9413_16C 6243_10C 9383_16C 9336_10C 9336_16C 8237_10C 8283_10C 9427_10C 9399_10C 7519_10C 7519_16C 9470_16C 6074_16C 1137_10C 1137_16C 6216_10C 9371_16C 6901_16C 6188_10C 6188_16C 6118_10C 6118_16C 6193_16C 6085_10C 6085_16C 6043_10C 6043_16C 6046_16C 6153_10C 6109_10C 992_10C 992_16C 6016_10C 6973_16C 6244_16C 9332_10C 7516_10C 9339_10C 9369_16C 6252_16C 8222_16C 6036_10C 5865_10C 6042_16C 6071_16C 9352_10C 6209_10C 6209_16C 9323_10C 9323_16C 6237_16C 5856_10C 6173_16C 6069_10C 9452_16C 9452_10C 9353_10C 6913_16C 6913_10C 6235_10C 6231_10C 6221_10C 6070_16C 6070_10C 6011_10C 6013_10C 9416_10C 8387_10C 9057_10C 9057_16C 6189_10C 6189_16C 6100_10C 6100_16C 6413_16C 6098_10C 6098_16C 6136_16C 8369_10C 8369_16C 8335_10C 6104_16C 6150_10C 1062_16C 6040_10C 991_16C 6149_10C 6122_16C 1061_10C 6122_10C 6076_10C 6076_16C 6180_16C 1074_10C 1074_16C 6102_10C 6102_16C 8426_10C 6034_10C 6034_16C 6039_10C 6113_10C 9454_10C 9454_16C 8240_10C 6151_10C 6151_16C 9058_10C 9058_16C 9394_10C 6030_10C 6030_16C 8258_16C 8256_10C 8256_16C 8334_10C 9481_16C 6242_10C 6242_16C 6105_16C 6114_10C 6131_10C 6131_16C 8351_16C 1257_16C 8376_10C 8376_16C 6088_16C 6184_10C 6125_10C 6112_10C 6112_16C 8241_16C 6090_10C 8241_10C 9437_10C 9436_10C 6202_10C 6202_16C Haplotype!by!SNP Expression Methylation gene 1 2 8249_10C 8326_10C 8242_16C 5832_16C 8227_10C 8227_16C 6094_10C 1552_16C 9381_10C 9413_16C 6243_10C 9383_16C 9336_10C 9336_16C 8237_10C 8283_10C 9427_10C 9399_10C 7519_10C 7519_16C 9470_16C 6074_16C 1137_10C 1137_16C 6216_10C 9371_16C 6901_16C 6188_10C 6188_16C 6118_10C 6118_16C 6193_16C 6085_10C 6085_16C 6043_10C 6043_16C 6046_16C 6153_10C 6109_10C 992_10C 992_16C 6016_10C 6973_16C 6244_16C 9332_10C 7516_10C 9339_10C 9369_16C 6252_16C 8222_16C 6036_10C 5865_10C 6042_16C 6071_16C 9352_10C 6209_10C 6209_16C 9323_10C 9323_16C 6237_16C 5856_10C 6173_16C 6069_10C 9452_16C 9452_10C 9353_10C 6913_16C 6913_10C 6235_10C 6231_10C 6221_10C 6070_16C 6070_10C 6011_10C 6013_10C 9416_10C 8387_10C 9057_10C 9057_16C 6189_10C 6189_16C 6100_10C 6100_16C 6413_16C 6098_10C 6098_16C 6136_16C 8369_10C 8369_16C 8335_10C 6104_16C 6150_10C 1062_16C 6040_10C 991_16C 6149_10C 6122_16C 1061_10C 6122_10C 6076_10C 6076_16C 6180_16C 1074_10C 1074_16C 6102_10C 6102_16C 8426_10C 6034_10C 6034_16C 6039_10C 6113_10C 9454_10C 9454_16C 8240_10C 6151_10C 6151_16C 9058_10C 9058_16C 9394_10C 6030_10C 6030_16C 8258_16C 8256_10C 8256_16C 8334_10C 9481_16C 6242_10C 6242_16C 6105_16C 6114_10C 6131_10C 6131_16C 8351_16C 1257_16C 8376_10C 8376_16C 6088_16C 6184_10C 6125_10C 6112_10C 6112_16C 8241_16C 6090_10C 8241_10C 9437_10C 9436_10C 6202_10C 6202_16C 0 1 2 3 4 5 6 Value Color Key 1 2 8249_10C 8326_10C 8242_16C 5832_16C 8227_10C 8227_16C 6094_10C 1552_16C 9381_10C 9413_16C 6243_10C 9383_16C 9336_10C 9336_16C 8237_10C 8283_10C 9427_10C 9399_10C 7519_10C 7519_16C 9470_16C 6074_16C 1137_10C 1137_16C 6216_10C 9371_16C 6901_16C 6188_10C 6188_16C 6118_10C 6118_16C 6193_16C 6085_10C 6085_16C 6043_10C 6043_16C 6046_16C 6153_10C 6109_10C 992_10C 992_16C 6016_10C 6973_16C 6244_16C 9332_10C 7516_10C 9339_10C 9369_16C 6252_16C 8222_16C 6036_10C 5865_10C 6042_16C 6071_16C 9352_10C 6209_10C 6209_16C 9323_10C 9323_16C 6237_16C 5856_10C 6173_16C 6069_10C 9452_16C 9452_10C 9353_10C 6913_16C 6913_10C 6235_10C 6231_10C 6221_10C 6070_16C 6070_10C 6011_10C 6013_10C 9416_10C 8387_10C 9057_10C 9057_16C 6189_10C 6189_16C 6100_10C 6100_16C 6413_16C 6098_10C 6098_16C 6136_16C 8369_10C 8369_16C 8335_10C 6104_16C 6150_10C 1062_16C 6040_10C 991_16C 6149_10C 6122_16C 1061_10C 6122_10C 6076_10C 6076_16C 6180_16C 1074_10C 1074_16C 6102_10C 6102_16C 8426_10C 6034_10C 6034_16C 6039_10C 6113_10C 9454_10C 9454_16C 8240_10C 6151_10C 6151_16C 9058_10C 9058_16C 9394_10C 6030_10C 6030_16C 8258_16C 8256_10C 8256_16C 8334_10C 9481_16C 6242_10C 6242_16C 6105_16C 6114_10C 6131_10C 6131_16C 8351_16C 1257_16C 8376_10C 8376_16C 6088_16C 6184_10C 6125_10C 6112_10C 6112_16C 8241_16C 6090_10C 8241_10C 9437_10C 9436_10C 6202_10C 6202_16C 0 1 2 3 4 5 6 Value Color Key gene 9378000 9379000 9380000 9381000 AT3G25720 Lines 9379500 9380500 9381500 9382500 Chr.2 Lines 29 Figure 2.17 An example of GBM: SNP haplotype structure (left) in and around the gene AT2G20650, appears to show no correlation with expression and methylation pattern. The blue arrow indicates the gene position and the direction of transcription. 2.3.3 GWAS using DNA methylation to explain gene expression Due to the widely observed correlation, mean methylation level for sliding window analysis across a genome can be used as a genotype marker to run a GWAS to explain gene expression (Figure 2.18). Hundreds of gene expression patterns can be mapped this way. While the trans peaks are difficult to verify, the cis peaks (144 genes with a p value <1e-7) are almost certainly true, since more often than not local SNPs are significantly associated with both gene expression and local methylation. Many studies have observed extensive epigenetic variation at genes that have originated de novo. QQS (Qua-Quine Starch) is one such example. This gene is embedded in a repeat rich region, and is present in several lines of Arabidopsis. Moreover, QQS does not show linkage with adjacent TE insertions (Silveira et al., 2013). 38 Figure!2.27.!!An!example!of!GBM:!SNP!haplotype!structure!(left)!in!and!around!gene! AT2G20650!appears!no!correlation!with!expression!and!methylation!pattern.!The! blue!arrow!indicates!the!gene!position!and!the!transcribed!direction.!! ! Figure! 2.28,! AT3G25720,! an! example! of! methylation! covering! promoter! region,! possibly!by!extension!of!methylation!from!nearby!TEs.!SNP!haplotype!structure!(left)! in!and!around!genes!(blue!arrow)!can!be!found!strongly!correlated!to!Methylation! pattern,!which!in!turn!correlates!with!differences!in!gene!expression!(center).!! ! s8902680 s8902682 s8902732 s8902739 s8902788 s8902900 s8902930 s8902966 s8903048 s8903070 s8903155 s8903249 s8903312 s8903429 s8903504 s8903665 s8903813 s8903876 s8904163 s8904274 s8904535 s8904595 s8904618 s8904632 s8904702 s8904796 s8905125 s8905146 s8905234 s8905431 s8905443 s8905600 s8905673 s8905854 s8905863 s8905922 s8905971 s8906050 s8906059 s8906119 s8906120 s8906308 s8906445 s8906667 s8906830 s8906928 s8907054 s8907098 s8907134 s8907210 s8907216 s8907217 s8907220 s8907226 s8907254 s8907255 s8907316 s8907335 s8907353 s8907365 s8907366 s8907378 s8907380 s8907388 s8907391 s8907399 s8907401 s8907403 s8907405 s8907410 s8907412 s8907413 s8907414 s8907419 s8907422 s8907428 s8907432 s8907434 s8907436 s8907448 s8907450 s8907453 s8907456 s8907473 s8907482 s8907492 s8907495 s8907498 s8907500 s8907501 s8907504 s8907517 s8907518 s8907520 s8907521 s8907524 s8907528 s8907535 s8907539 s8907540 s8907549 s8907550 s8907552 s8907553 s8907554 s8907555 s8907559 s8907572 s8907584 s8907596 s8907610 s8907633 s8907636 s8907664 s8907672 s8907691 s8907723 s8907731 s8907781 s8907823 s8907829 s8907831 s8907832 s8907849 s8907855 s8907864 s8907865 s8907979 s8907994 s8907995 s8907997 s8907998 s8907999 s8908011 s8908015 s8908018 s8908024 s8908032 s8908038 s8908044 s8908053 s8908058 s8908059 s8908065 s8908069 s8908072 s8908086 s8908089 s8908091 s8908094 s8908103 s8908106 s8908110 s8908112 s8908124 s8908125 s8908132 s8908135 s8908139 s8908140 s8908146 s8908149 s8908165 s8908174 s8908176 s8908187 s8908197 s8908198 s8908212 s8908216 s8908218 s8908223 s8908231 s8908237 s8908238 s8908240 s8908250 s8908260 s8908277 s8908278 s8908284 s8908287 s8908289 s8908291 s8908298 s8908313 s8908316 s8908319 s8908320 6235_10C 6235_16C 8387_10C 6901_10C 6901_16C 6016_10C 6043_10C 6043_16C 6046_10C 6046_16C 6243_10C 5856_10C 6090_10C 6193_16C 8369_10C 8369_16C 6150_10C 6013_10C 6069_10C 6039_10C 6088_16C 8242_16C 8241_10C 8241_16C 6188_10C 6188_16C 7519_10C 7519_16C 8426_10C 8334_10C 8256_10C 8256_16C 9454_10C 9454_16C 8326_10C 9436_10C 6122_10C 6122_16C 6042_16C 9437_10C 8283_10C 9470_10C 9470_16C 6036_10C 6036_16C 6040_10C 6136_16C 9434_16C 8249_10C 8249_16C 6231_10C 6173_10C 6173_16C 6244_10C 6244_16C 9323_10C 9323_16C 6071_10C 6071_16C 6913_10C 6913_16C 6209_10C 6209_16C 9427_10C 6216_10C 6070_10C 6070_16C 6237_16C 9332_10C 9332_16C 9371_10C 9371_16C 6221_10C 9363_16C 6011_10C 1552_16C 8227_10C 8227_16C 8335_10C 6242_10C 6242_16C 6189_10C 6189_16C 6153_10C 9452_10C 9452_16C 8376_10C 8376_16C 6092_10C 7516_10C 6030_10C 6030_16C 6102_10C 6102_16C 6076_10C 6076_16C 8240_10C 6128_10C 6413_16C 6074_16C 9416_10C 6100_10C 6100_16C 6252_16C 9381_10C 6149_10C 9383_16C 1257_16C 9336_10C 9336_16C 9058_10C 9058_16C 1061_10C 1074_10C 1074_16C 9394_10C 9382_10C 6192_10C 9057_10C 9057_16C 6184_10C 6085_10C 6085_16C 6973_10C 6973_16C 6118_10C 6118_16C 6094_10C 1137_10C 1137_16C 8222_16C 8258_16C 5832_10C 5832_16C 8351_16C 6151_10C 6151_16C 6114_10C 992_10C 992_16C 8237_10C 6112_10C 6112_16C 6125_10C 6131_10C 6131_16C 9399_10C 9395_10C 1062_16C 6104_16C 6132_10C 6109_10C 991_16C 6113_10C 6034_10C 6034_16C 9481_16C 6202_10C 6202_16C 6180_16C 5865_10C 6105_16C 9413_16C 9352_10C 9353_10C 9369_16C 9339_16C 9339_10C 6098_10C 6098_16C 1 2 6235_10C 6235_16C 8387_10C 6901_10C 6901_16C 6016_10C 6043_10C 6043_16C 6046_10C 6046_16C 6243_10C 5856_10C 6090_10C 6193_16C 8369_10C 8369_16C 6150_10C 6013_10C 6069_10C 6039_10C 6088_16C 8242_16C 8241_10C 8241_16C 6188_10C 6188_16C 7519_10C 7519_16C 8426_10C 8334_10C 8256_10C 8256_16C 9454_10C 9454_16C 8326_10C 9436_10C 6122_10C 6122_16C 6042_16C 9437_10C 8283_10C 9470_10C 9470_16C 6036_10C 6036_16C 6040_10C 6136_16C 9434_16C 8249_10C 8249_16C 6231_10C 6173_10C 6173_16C 6244_10C 6244_16C 9323_10C 9323_16C 6071_10C 6071_16C 6913_10C 6913_16C 6209_10C 6209_16C 9427_10C 6216_10C 6070_10C 6070_16C 6237_16C 9332_10C 9332_16C 9371_10C 9371_16C 6221_10C 9363_16C 6011_10C 1552_16C 8227_10C 8227_16C 8335_10C 6242_10C 6242_16C 6189_10C 6189_16C 6153_10C 9452_10C 9452_16C 8376_10C 8376_16C 6092_10C 7516_10C 6030_10C 6030_16C 6102_10C 6102_16C 6076_10C 6076_16C 8240_10C 6128_10C 6413_16C 6074_16C 9416_10C 6100_10C 6100_16C 6252_16C 9381_10C 6149_10C 9383_16C 1257_16C 9336_10C 9336_16C 9058_10C 9058_16C 1061_10C 1074_10C 1074_16C 9394_10C 9382_10C 6192_10C 9057_10C 9057_16C 6184_10C 6085_10C 6085_16C 6973_10C 6973_16C 6118_10C 6118_16C 6094_10C 1137_10C 1137_16C 8222_16C 8258_16C 5832_10C 5832_16C 8351_16C 6151_10C 6151_16C 6114_10C 992_10C 992_16C 8237_10C 6112_10C 6112_16C 6125_10C 6131_10C 6131_16C 9399_10C 9395_10C 1062_16C 6104_16C 6132_10C 6109_10C 991_16C 6113_10C 6034_10C 6034_16C 9481_16C 6202_10C 6202_16C 6180_16C 5865_10C 6105_16C 9413_16C 9352_10C 9353_10C 9369_16C 9339_16C 9339_10C 6098_10C 6098_16C 2 4 6 8 10 12 Value Color Key 8902000 8904000 8906000 8908000 AT2G20650 Lines 8903000 8905000 8907000 Chr.2 Lines Haplotype!by!SNP !!!!!!!!Expression Methylation s9379247 s9379278 s9379279 s9379284 s9379289 s9379293 s9379298 s9379300 s9379314 s9379322 s9379323 s9379327 s9379343 s9379345 s9379347 s9379357 s9379376 s9379404 s9379407 s9379410 s9379421 s9379425 s9379432 s9379439 s9379446 s9379483 s9379491 s9379522 s9379564 s9379581 s9379597 s9379606 s9379625 s9379631 s9379638 s9379655 s9379669 s9379671 s9379680 s9379687 s9379691 s9379712 s9379713 s9379728 s9379750 s9379759 s9379776 s9379778 s9379814 s9379851 s9379882 s9379896 s9379904 s9379905 s9379906 s9379907 s9379909 s9379910 s9379913 s9379932 s9379940 s9379946 s9379956 s9379963 s9379971 s9379976 s9379978 s9379996 s9380049 s9380115 s9380119 s9380129 s9380136 s9380137 s9380150 s9380188 s9380271 s9380272 s9380273 s9380274 s9380284 s9380296 s9380304 s9380308 s9380332 s9380346 s9380377 s9380387 s9380391 s9380411 s9380415 s9380456 s9380485 s9380493 s9380516 s9380558 s9380560 s9380588 s9380600 s9380670 s9380692 s9380698 s9380703 s9380727 s9380806 s9380830 s9380834 s9380837 s9380893 s9380920 s9380953 s9380963 s9380965 s9380978 s9380979 s9380987 s9381013 s9381014 s9381055 s9381071 s9381101 s9381133 s9381157 s9381212 s9381221 s9381230 s9381249 s9381252 s9381266 s9381277 s9381331 s9381351 s9381360 s9381376 s9381377 s9381407 s9381415 s9381432 s9381433 s9381445 s9381455 s9381460 s9381461 s9381468 s9381476 s9381478 s9381483 s9381484 s9381504 s9381512 s9381514 s9381526 s9381530 s9381539 s9381540 s9381748 s9381754 s9381774 s9381791 s9381825 s9381840 s9381852 s9381875 s9381879 s9381881 s9381975 s9382024 s9382137 s9382143 s9382168 s9382213 s9382225 s9382240 s9382249 s9382281 s9382293 s9382325 s9382332 s9382378 s9382384 8249_10C 8326_10C 8242_16C 5832_16C 8227_10C 8227_16C 6094_10C 1552_16C 9381_10C 9413_16C 6243_10C 9383_16C 9336_10C 9336_16C 8237_10C 8283_10C 9427_10C 9399_10C 7519_10C 7519_16C 9470_16C 6074_16C 1137_10C 1137_16C 6216_10C 9371_16C 6901_16C 6188_10C 6188_16C 6118_10C 6118_16C 6193_16C 6085_10C 6085_16C 6043_10C 6043_16C 6046_16C 6153_10C 6109_10C 992_10C 992_16C 6016_10C 6973_16C 6244_16C 9332_10C 7516_10C 9339_10C 9369_16C 6252_16C 8222_16C 6036_10C 5865_10C 6042_16C 6071_16C 9352_10C 6209_10C 6209_16C 9323_10C 9323_16C 6237_16C 5856_10C 6173_16C 6069_10C 9452_16C 9452_10C 9353_10C 6913_16C 6913_10C 6235_10C 6231_10C 6221_10C 6070_16C 6070_10C 6011_10C 6013_10C 9416_10C 8387_10C 9057_10C 9057_16C 6189_10C 6189_16C 6100_10C 6100_16C 6413_16C 6098_10C 6098_16C 6136_16C 8369_10C 8369_16C 8335_10C 6104_16C 6150_10C 1062_16C 6040_10C 991_16C 6149_10C 6122_16C 1061_10C 6122_10C 6076_10C 6076_16C 6180_16C 1074_10C 1074_16C 6102_10C 6102_16C 8426_10C 6034_10C 6034_16C 6039_10C 6113_10C 9454_10C 9454_16C 8240_10C 6151_10C 6151_16C 9058_10C 9058_16C 9394_10C 6030_10C 6030_16C 8258_16C 8256_10C 8256_16C 8334_10C 9481_16C 6242_10C 6242_16C 6105_16C 6114_10C 6131_10C 6131_16C 8351_16C 1257_16C 8376_10C 8376_16C 6088_16C 6184_10C 6125_10C 6112_10C 6112_16C 8241_16C 6090_10C 8241_10C 9437_10C 9436_10C 6202_10C 6202_16C Haplotype!by!SNP Expression Methylation gene 1 2 8249_10C 8326_10C 8242_16C 5832_16C 8227_10C 8227_16C 6094_10C 1552_16C 9381_10C 9413_16C 6243_10C 9383_16C 9336_10C 9336_16C 8237_10C 8283_10C 9427_10C 9399_10C 7519_10C 7519_16C 9470_16C 6074_16C 1137_10C 1137_16C 6216_10C 9371_16C 6901_16C 6188_10C 6188_16C 6118_10C 6118_16C 6193_16C 6085_10C 6085_16C 6043_10C 6043_16C 6046_16C 6153_10C 6109_10C 992_10C 992_16C 6016_10C 6973_16C 6244_16C 9332_10C 7516_10C 9339_10C 9369_16C 6252_16C 8222_16C 6036_10C 5865_10C 6042_16C 6071_16C 9352_10C 6209_10C 6209_16C 9323_10C 9323_16C 6237_16C 5856_10C 6173_16C 6069_10C 9452_16C 9452_10C 9353_10C 6913_16C 6913_10C 6235_10C 6231_10C 6221_10C 6070_16C 6070_10C 6011_10C 6013_10C 9416_10C 8387_10C 9057_10C 9057_16C 6189_10C 6189_16C 6100_10C 6100_16C 6413_16C 6098_10C 6098_16C 6136_16C 8369_10C 8369_16C 8335_10C 6104_16C 6150_10C 1062_16C 6040_10C 991_16C 6149_10C 6122_16C 1061_10C 6122_10C 6076_10C 6076_16C 6180_16C 1074_10C 1074_16C 6102_10C 6102_16C 8426_10C 6034_10C 6034_16C 6039_10C 6113_10C 9454_10C 9454_16C 8240_10C 6151_10C 6151_16C 9058_10C 9058_16C 9394_10C 6030_10C 6030_16C 8258_16C 8256_10C 8256_16C 8334_10C 9481_16C 6242_10C 6242_16C 6105_16C 6114_10C 6131_10C 6131_16C 8351_16C 1257_16C 8376_10C 8376_16C 6088_16C 6184_10C 6125_10C 6112_10C 6112_16C 8241_16C 6090_10C 8241_10C 9437_10C 9436_10C 6202_10C 6202_16C 0 1 2 3 4 5 6 Value Color Key 1 2 8249_10C 8326_10C 8242_16C 5832_16C 8227_10C 8227_16C 6094_10C 1552_16C 9381_10C 9413_16C 6243_10C 9383_16C 9336_10C 9336_16C 8237_10C 8283_10C 9427_10C 9399_10C 7519_10C 7519_16C 9470_16C 6074_16C 1137_10C 1137_16C 6216_10C 9371_16C 6901_16C 6188_10C 6188_16C 6118_10C 6118_16C 6193_16C 6085_10C 6085_16C 6043_10C 6043_16C 6046_16C 6153_10C 6109_10C 992_10C 992_16C 6016_10C 6973_16C 6244_16C 9332_10C 7516_10C 9339_10C 9369_16C 6252_16C 8222_16C 6036_10C 5865_10C 6042_16C 6071_16C 9352_10C 6209_10C 6209_16C 9323_10C 9323_16C 6237_16C 5856_10C 6173_16C 6069_10C 9452_16C 9452_10C 9353_10C 6913_16C 6913_10C 6235_10C 6231_10C 6221_10C 6070_16C 6070_10C 6011_10C 6013_10C 9416_10C 8387_10C 9057_10C 9057_16C 6189_10C 6189_16C 6100_10C 6100_16C 6413_16C 6098_10C 6098_16C 6136_16C 8369_10C 8369_16C 8335_10C 6104_16C 6150_10C 1062_16C 6040_10C 991_16C 6149_10C 6122_16C 1061_10C 6122_10C 6076_10C 6076_16C 6180_16C 1074_10C 1074_16C 6102_10C 6102_16C 8426_10C 6034_10C 6034_16C 6039_10C 6113_10C 9454_10C 9454_16C 8240_10C 6151_10C 6151_16C 9058_10C 9058_16C 9394_10C 6030_10C 6030_16C 8258_16C 8256_10C 8256_16C 8334_10C 9481_16C 6242_10C 6242_16C 6105_16C 6114_10C 6131_10C 6131_16C 8351_16C 1257_16C 8376_10C 8376_16C 6088_16C 6184_10C 6125_10C 6112_10C 6112_16C 8241_16C 6090_10C 8241_10C 9437_10C 9436_10C 6202_10C 6202_16C 0 1 2 3 4 5 6 Value Color Key gene 9378000 9379000 9380000 9381000 AT3G25720 Lines 9379500 9380500 9381500 9382500 Chr.2 Lines 30 However, in our Swedish population, QQS was identified as the most significant cis- associated gene with expression and methylation (Figure 2.19), showing a clear association with a local SNP pattern. This finding directly contradicts previous reports that QQS is purely de novo. Expression of QQS is likely to be suppressed by methylation around it, most likely caused by the silencing of nearby TEs (Figure 2.20). Figure 2.18 A genome-wide scan for association with gene expression as phenotype and mean methylation for 1kb sliding window as genotype. Expression and methylation at 10°C are used in this plot and regions with highly significant association (p value <10 -7 ) are plotted. Light blue to dark red indicates the gradient of more significant p values. The 35 ! ! Figure!2.24.!Genome]wide!scan!for!association!with!gene!expression!as!phenotype! and! mean! methylation! for! 1kb! sliding! window! as! genotype.! Expression! and! methylation! at! 10°C! are! used! in! this! plot! and! regions! with! highly! significant! association!(pval!<10 ]7 )!are!plotted.!Light!blue!to!dark!red!indicates!the!gradient!of! more!significant!p!value.!The!diagonal!dotted!line!indicates!methylation!around!the! gene!region!associated!significantly!with!gene!expression!(cis]correlation).!!! ! 31 diagonal dotted line indicates methylation around the gene region that associated significantly with gene expression (i.e. a cis-correlation). Figure 2.19 An example of the highest association observed in a GWAS (10°C expression vs. 1kb methylation): AT3G30720, Qua-Quine Starch (QQS). The five colors denote the five chromosomes, in left to right order 1-5. The dotted line denotes the Bonferroni threshold for an alpha level of 5%. 32 Figure 2.20 The methylation and expression for AT3G30720 (QQS).The orange bar is the constructed methylated region and is based on methylated cytosines (orange vertical lines). The green bar indicates expression values (RPKM). Expression is likely to be suppressed by methylation around QQS, since the methylation pattern appears to have spread across the whole region. 2.4 Linking genetic and methylation variants To determine the extent to which variation in DNA methylation and genotype are linked, single base resolution LD analysis was performed using SNPs and methylation data. LD of methylated cytosines was explored by Schmitz et al., (2013), and they reported that genetic variation for single nucleotide methylation polymorphisms are independent to CpG methylated regions. However, as already mentioned this study presented a general trend by taking the average for all genes. With our dataset, we were able to perform LD analysis for each gene individually and explore the distinct correlation between LD patterns and local methylation. LD patterns reveal a combination of information about i) historic recombination events ii) history of mutations iii) genetic drift and iv) signatures of selection. Assuming that local methylation levels are determined by the underlying DNA sequence, the disruption of SNP LD by recombination should also change the local methylation pattern. Therefore, a similar LD pattern for both methylation and SNPs is expected. In contrast, if methylation patterns are preserved independently and regardless of historic recombination patterns, it should show a pattern that cannot be tagged by SNPs. To test which scenario is true, we turned our focus to local LD structure, and only examined co-segregation of alleles that were the result of linkage. We calculated r 2 for local SNPs for each gene (SNP LD) and r 2 for methylation, separately for sliding window 33 methylation and single cytosine methylation. For each gene, we could analyze the gene specific profile by integrating SNPs, methylation, and expression for all samples. First, to check whether genes with strong SNP LD also exhibit high LD for methylation status, we calculated the mean r 2 for each gene (SNP, methylation status of each cytosine, methylation status of 200bp sliding window), and computed the Pearson correlation between each pair. For genes with negative correlation between expression and methylation, local LD pattern is significantly negatively correlated with the methylation r 2 of single CpG, but positively correlated with methylation r 2 of single CHG and CHH. Genes without negative correlation between expression and methylation do not show significant correlation between SNP LD and methylation (Figure 2.21). Thus at least for genes with RdDM (Table 2.1), SNP polymorphism correlates significantly with methylation variation. Genes with strong LD pattern exhibit strong correlation between single CHG and CHH methylation. However, single CpG methylation did not follow the same pattern. In addition, for the Pearson correlation coefficient between LD and r 2 of mean methylation level for 200bp sliding windows, all 3 contexts showed similar negative value. It is expected that 3 contexts behave similarly since such genes are mainly under RdDM thus non- distinguishing cytosine context. However, why genes with stronger LD show lower correlation between LD and single CpG/ sliding window methylation is unclear. One possible explanation is, genes under strong LD are tightly regulated by cis DNA sequence thus the effect of single CpG methylation and sliding window mean methylation is minor, while non-CpG methylation is more crucial to help regulate the 34 transcription. In addition, single CpG methylation can be maintained between cell division, thus the correlation observed can reflect the residue of regulation in other developmental stage. In other words, more noise is introduced and maintained in CpG signal. In summary, while methylation status of single CHG and CHH tags the co- inheritance of SNPs well, CpG methylation and sliding window methylation present more randomness. Figure 2.21 Pearson correlation between mean SNP LD and r 2 of methylation status of single cytosine for each gene. Blue bar denotes all protein coding genes; Red bar denotes genes with significant positive correlation between expression and methylation (based on cor>0.15); Green bar denotes genes with significant negative correlation between expression and methylation (based on cor<-‐0.15). 35 Figure 2.22 Pearson correlation between mean SNP LD and r 2 of methylation status of 200bp sliding window for each gene. Blue bar denotes all protein coding genes; Red bar denotes genes with significant positive correlation between expression and methylation (based on cor>0.15); Green bar denotes genes with significant negative correlation between expression and methylation (based on cor<-‐0.15). For genes with a positive correlation between expression and methylation, LD and r 2 of methylation status of the genic region is not significantly different compared to all genes. However, we still find some examples where there are significantly different methylation groups among samples, the clustering of groups tend to correlate nicely with SNP haplotype structure. On the other hand, most genes show methylation with a similar mean level but no significant correlation between the methylation status of cytosines, and the methylation pattern does not significantly correlate with the SNP haplotype. Take gene AT4G03470 as an example for the first case, for which CG methylation and expression are highly correlated (cor=0.64), the LD structure of SNPs and the correlation between a sliding window of methylation and single cytosines all show a consistent pattern (Figure 2.23). Using the mean methylation level for this gene as a phenotype to run GWAS, one SNP at 2kb downstream of the gene was identified as the 36 most significantly associated SNP to the methylation haplotype, which neatly explains the variation in methylation (Figure 2.24). This implies a significant amount of genetic control on the methylation pattern. Take gene AT5G32460 as an example for the second case, for which CG methylation and expression are highly correlated (cor=0.46). While LD across the gene is strong, the correlation between sliding window of methylation and single cytosine is not significant (Figure 2.25). However, there are no clearly defined groups of methylation among the accessions. For genes with a negative correlation between expression and methylation, correlation between local methylation is significantly stronger compared to all genes, while LD across the genic region can be either extensive or not. The strong correlation in local methylation status is likely due to the positive feedback loop for reinforcing the methylation by chromatin modification, which can act independent of the DNA sequence to some extent. Take gene AT3G26612 as an example for the first case, for which CG methylation and expression are highly negatively correlated (cor=-0.77). While LD structure of methylation for sliding window and single cytosine are strong, the SNP LD is much weaker (Figure 2.26). The clustering plot also shows no apparent SNP haplotype structure around the gene, though methylation and expression divide into two groups. This case is likely caused by nearby TE insertions. Take gene AT2G31150 as an example of the second case, for which CG methylation and expression is highly correlated (cor=-0.75). The LD structure of SNPs and the correlation between sliding window of methylation and single cytosines all show a 37 consistent pattern (Figure 2.27), and methylation strongly correlates with the SNP haplotype. In summary, based on the gene-specific profile with LD/expression/methylation information, methylation status of both sliding window and single cytosine correlate tightly with local SNP haplotype structure for genes with RdDM and subsets of genes with GBM. In addition, since the epi-genome is much more dynamic compared to the genome, such a strong correlation indicates epigenetic variants depend on genetic variants. Analyzing links between epigenetic and genetic variants using single gene profiles yields different conclusions compared to those using patterns generated by taking the average of all genes. This adds weight to our claim that our comprehensive dataset can aid research regarding not only natural variation, but also basic mechanism questions about transcription and methylation. 38 A B C 39 Figure 2.23 A gene profile for AT4G03470, which shows a significant correlation between expression and CG methylation (cor=0.64). A strong haplotype pattern correlates with methylation and expression among accessions. (A) Linkage disequilibrium in SNPs, and correlation heatmap for methylation. Sliding windows of CHG and CHH methylation shows a similar pattern with the CpG sliding window, therefore this data is not shown. (B) R squared for LD and the correlation between methylation for 200 sliding window and single cytosines. The Y-axis is the distance between two SNPs, or two sliding windows, or two cytosines. R squared for SNPs decreases with distance slower when compared to methylation. (C) Clustering based on methylation and expression among samples. Each row is one accession. Two haplotypes defined by local SNP pattern correlate strongly with methylation and expression. In the left panel, green to red means increasing of methylation and expression level. In the middle panel, the blue bar is the gene position, and SNPs within 1kb are included in the plot, with red color indicates minor allele. In the right panel, green to red indicates increasing of methylation level. Figure 2.24 (A) Using the mean methylation level for the gene region of AT4G03470 as phenotype and full sequence SNP as genotype, GWAS identified one SNP 2kb downstream of the gene as the most significantly associated SNP to the methylation haplotype. (B) The Y-axis is the mean CpG methylation level for AT4G03470. Two groups are the SNP type at the position most significantly associated identified in figure (A). Accessions with ‘A’ at this SNP position show significantly higher methylation compared to accessions with ‘G’ at this SNP position. 40 1" AT5G32460 _ cg_259 Physical Length:4.3kb R 2 Color Key 0 0.2 0.4 0.6 0.8 1 AT5G32460 _ chg_259 Physical Length:1.5kb R 2 Color Key 0 0.2 0.4 0.6 0.8 1 AT5G32460 _ chh_259 Physical Length:3.7kb R 2 Color Key 0 0.2 0.4 0.6 0.8 1 AT5G32460 _ SW_259_cg Physical Length:4.2kb R 2 Color Key 0 0.2 0.4 0.6 0.8 1 AT5G32460 _ snp_259 Physical Length:4.2kb R 2 Color Key 0 0.2 0.4 0.6 0.8 1 AT4G03470 _ snp_filter Physical Length:4.1kb R 2 Color Key 0 0.2 0.4 0.6 0.8 1 CpG"200bp"methyla0on" SNP"LD" CHG"methyla0on" CHH"methyla0on" CpG"methyla0on" AT5G32460" ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 4000 0.0 0.2 0.4 0.6 0.8 1.0 AT5G32460_1kb around_snp_259 distance r 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 4000 0.0 0.2 0.4 0.6 0.8 1.0 AT5G32460_1kb around_cg_259 distance r 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● 0 500 1000 1500 0.0 0.2 0.4 0.6 0.8 1.0 AT5G32460_1kb around_chg_259 distance r 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 0.0 0.2 0.4 0.6 0.8 1.0 AT5G32460_1kb around_chh_259 distance r 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 4000 0.0 0.2 0.4 0.6 0.8 1.0 AT5G32460_1kb around_SW_259_chg distance r 2 2" CpG"200bp"" meth" SNP"LD" CpG"meth" CHG"200bp" "meth" CHH"200bp" "meth" CHG"meth" CHH"meth" ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 4000 0.0 0.2 0.4 0.6 0.8 1.0 AT5G32460_1kb around_SW_259_cg distance r 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 4000 0.0 0.2 0.4 0.6 0.8 1.0 AT5G32460_1kb around_SW_259_chh distance r 2 AT5G32460" Legend"of" meth_Exp" meth"Region" meth"level" Methyla0on" Exp" Local"SNP"paJern" Methylated"Region" Gene" posi0on" Clustering"samples"by"methyla0on"and"expression"does"show"haplotype"of"SNPs." Each"row"is"one"sample."" genes_CG promoters_CG genebody_CG Expression −2 0 2 Column Z−Score 0 100 200 300 Color Key and Histogram Count 3" 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 seq(0, 1, 0.1) seq(0, 1, 0.1) <0.4 0.4−0.5 0.5−0.6 0.6−0.7 0.7−0.8 >0.8 <0.4 0.4−0.5 0.5−0.6 0.6−0.7 0.7−0.8 >0.8 <0.5 0.5−0.6 0.6−0.7 0.7−0.8 >0.8 genes_CG promoters_CG genebody_CG Expression −4 −2 0 2 4 Column Z−Score 0 100 200 Color Key and Histogram Count 12080000 12084000 12088000 AT5G32460 Lines 12081000 12083000 12085000 Chr.2 Lines AT5G32460" A B C 41 Figure 2.25 A gene profile for AT5G32460, which shows a significant correlation between expression and CG methylation (cor=0.46). Strong haplotype pattern correlates with methylation and expression type among accessions. (A) Linkage disequilibrium in SNPs, and correlation heatmap for methylation. (B) R squared for LD and correlation between methylation for 200 sliding window and single cytosines. R squared for SNPs decreases with distance much slower when compared to methylation. (C) Clustering based on methylation and expression among samples. Each row is one accession. Methylation does not have clear haplotypes, and does not correlate clearly with the SNP haplotype. 42 Figure 2.26 A gene profile for AT3G26612, which shows a significant correlation between expression and CG methylation (cor=-0.77). Strong correlation between methylation status nearby, yet no significant correlation with SNP haplotype. (A) Linkage disequilibrium in SNPs, and correlation heatmap for methylation. Sliding window with CHG and CHH methylation shows similar pattern with CpG sliding window, thus not shown. (B) R square for LD and correlation between methylation for 200 sliding window and single cytosine. (C) Clustering based on methylation and expression among samples. Each row is one accession. Methylation shows two clear haplotypes, and does not correlate with SNP haplotypes. 43 Figure 2.27 A gene profile for AT2G31150, which shows a significant correlation between expression and CG methylation (cor=-0.75). Strong correlation between methylation status are seen nearby but no significant correlation is observed with SNP haplotype. (A) Linkage disequilibrium in SNPs, and correlation heatmap for methylation. Sliding window using CHG and CHH methylation show as similar pattern to the CpG sliding window, therefore this data is not shown. (B) R squared for LD and correlation between methylation for 200 sliding window and single cytosines. (C) Clustering based on methylation and expression among samples. Each row is one accession. Methylation shows two clear haplotypes, and correlate clearly with SNPs. 44 2.5 Methylation and transcription for genes with copy number variation When genes duplicate, new copies often can insert themselves into or adjacent to the location of the already existing gene (tandem repeats), or they move to a new trans- location. Three possible scenarios for a methylation strategy exist for these two cases: (1) The mean methylation level does not change, when methylation spreads nearby locally for the new tandem repeats. (2) The mean methylation level increases, when more siRNAs are generated from the new copies, which in turn enhance the RdDM pathway. This case can happen for both cis- and trans-copies. (3) Mean methylation level decreases, when new copies are not methylated due to lack of a prior template and proper regulation, which should be more frequent for a new trans copy. To check whether these cases are truly present in the real data, we performed an analysis for genes showing copy number variation. Genes with copy number variations can be called by calculating the coverage directly through the mapping of genomic sequencing reads. Copy number for each gene was estimated by coverage for this gene divided by the mean coverage of the surrounding 3Mb window. Genes were called as CNVs if the estimated coverage was at least two- fold higher than the background in at least 15 accessions. In the end, we identified 1,667 genes with CNV. To determine whether the extra copies are trans or cis, first the potential location of newly inserted copy was identified by running GWAS, using the number of estimated copies as the phenotype and genome-wide SNPs as the genotype, since genetic variants around newly inserted copies are linked to the presence of that copy. 499 genes showed a significant association between copy number and SNPs, this association was mostly in cis 45 and this means that the potential insertion location for the extra copies is typically around the reference position (Figure 2.28). Figure 2.28 A genome-wide scan for associations using estimated gene copy numbers as the phenotype and SNPs as the genotype. Highly significant associations (p value <10 -10 ) are plotted. The light blue to dark red colour indicates the gradient of p values significance. The X-axis is the SNP position, or in other words the candidate position where the new copy is located. The Y-axis is the gene position in Col-0. The diagonal dotted line indicates SNPs that are associated with the copy number and are where the gene is annotated in Col-0 (cis). Genes with trans associations are more methylated compared to genes showing only a cis signal (Figure 2.29). While genes with only a cis association are methylated at 46 similar levels as other protein coding genes without copy number variations. Genes with a trans association are methylated at much higher level but show lower TE methylation. Figure 2.29 Comparison of methylation levels for genes with trans and cis signals shown when running GWAS for copy number variations. Only protein coding genes are included in the analysis. The Y-axis is the mean methylation level for the gene across all copies. 4 groups for the boxplots were presented here: CpG methylation for genes with trans association, CpG methylation for genes with only cis association, CHG methylation for genes with trans association, CHG methylation for genes with only cis association. Based on previous GWAS results, not only can we give a list of potential positions where the inserted copy is for a gene with CNV, but also the hotspots for insertions can be determined (Figure 2.30). Potential hotspots for insertions are in the centromeric region, which is consistent with the a report claiming that reactivated TEs tend to jump into centromeric regions (Tsukahara et al., 2009), though the mechanism behind such targeting is not clear. 47 Figure 2.30 For genes with CNV and identified with a potential insertion location on another chromosome. Each line is one significant association between the estimated copy number for a gene and SNPs. The left bar is the location of the SNPs (potential insertion position). The right bar is the location where the gene is annotated in Col-0. The histograms on two sides are the total hits for each position. For genes identified with copy number variation, gene expression and mean methylation show different relationships with copy number. Transcription levels can be independent of copy number variation or they can increase with copy number. 48 For methylation, we observe all 3 possible cases as we expected: Some genes were methylated at higher level with more copies (Figure 2.31), probably due to better silencing by increased siRNA copy, which is used by the RdDM pathway; or due to elevated transcription levels, if methylation is a byproduct of expression. In order to distinguish which is the case for this gene, non-CpG methylation, as an indicator of the RdDM pathway was checked. Indeed, AT1G27540 has a considerate amount of non-CpG methylation in a subsets of samples, thus it is a target of the RdDM pathway. In addition, it functions in both pollen and seed development (Meyer et al., 2012), and recently Schmitz et al. (2013) confirmed that RdDM genes are indeed involved in pollen function. Figure 2.31 One gene example AT1G27540: expression and methylation increases with more copy number. (a) The Y-axis shows gene expression (RPKM) for each accession, the X-axis is the normalized copy number of genes. Accessions with a higher copy numbers show higher expression (cor=0.71). (b) The Y- axis is the mean CpG methylation for genes in each accession, the X-axis is the normalized cop number of genes. Accessions with a higher copy number show higher methylation (cor=0.62). (c) The X axis is mean CpG methylation for genes in each accession, the Y- axisis gene expression (RPKM). Expression and methylation are positively correlated (cor=0.55).). Some genes show a lower mean methylation level with more copies, this is likely due to the new-copies being un-methylated (Figure 2.32). For this example, gene expression weakly correlates positively with copy number and methylation, indicating the extra copies are not transcribed as extensively as the original copy. 49 Figure 2.32 One gene example: methylation decreases as copy number increases. (a) The Y-axis is gene expression (RPKM) in each accession, the X-axis is the normalized copy number of all the genes. Accessions with higher copy numbers show marginally higher expression (cor=0.16). (b) The Y- axis is the mean CpG methylation for genes in each accession, the X-axi is the normalized copy number of all the genes. Accessions with higher copy numbers show lower methylation (cor=-0.48). (c) The X-axis is the mean CpG methylation for genes in each accession, the Y-axis is gene expression (RPKM). Expression and methylation are weakly positively correlated (cor=0.16). Other genes show no significant change in the mean methylation level with the increase of copy number. These are mostly genes with cis associations. Such genes appear to have tandem repeats at the original location, thus methylation is likely to spread from the original copy, causing little change in the mean methylation level (Figure2.33). Figure 2.33 One gene example: expression and methylation do not change much as copy number increases. (a) The Y-axis is gene expression (RPKM) in each accession, the X- axis the normalized copy of genes. Accessions with higher copy numbers show similar expression. (b) The Y-axis is the mean CpG methylation for genes in each accession, the X- axis is the normalized copy number of all the genes. Accessions with higher copy numbers show similar methylation. (c) The X-axis is the mean CpG methylation for genes in each accession, the Y-axis is gene expression (RPKM). Expression and methylation are weakly positively correlated (cor=0.14). 50 In summary, analyzing genes with copy number variations using methylome, transcriptome and genome sequencing data can provide new insights into how methylation functions, in addition to providing gene specific resolution profiles for other researches. However, it is important to note that these deductions need supporting experimental work to be fully verified. 51 Chapter 3 : Ongoing CG-enrichment in regions with non-CpG methylation in Arabidopsis thaliana Complicated and interacting methylation pathways create distinct methylation landscapes in plants, which also puts additional mutation and selection pressure on cytosine nucleotides. To explore such effects on genomic compositions related to methylation, we analyzed the derived allele frequency distribution in Swedish Arabidopsis thaliana populations and found a strong positive selection signal for regions with non-CpG methylation. 3.1 Ongoing enrichment of CG in Swedish Arabidopsis thaliana The mutation type for each SNP in Arabidopsis thaliana can be determined by comparing the allele types for the same position in Arabidopsis lyrata and Capsella rubella. The shared allele is annotated as an ancestral allele, while the private alleles in Arabidopsis thaliana are defined as derived alleles. The mutation spectrum in Arabidopsis thaliana has been estimated from a 30- generation mutation accumulation line and is shown in the blue inserted plot in Figure 3.1. Out of 6 possible types of single nucleotide changes, G:C=>A:T occurs the most frequently and accounts for half of all observed mutations (Ossowski et al., 2010). However, the spectrum of polymorphisms for 259 Swedish Arabidopsis thaliana accessions presented a remarkably different pattern (Figure 3.1), with much smaller differences between G:C=>A:T mutation type and the other 5 mutation types. The 52 spectrum of polymorphisms in wild populations results from mutation and selection, combined with demographical processes and other factors (eg. biased gene conversion). Since mutation accumulation lines suffer little selection pressure, the different pattern in natural populations is a result of biased fixation rather than biased mutation. More specifically, the underrepresented C=>T mutation is likely due to a strong positive selection for T=>C or a strong purifying selection against C=>T, or, more likely still, a combination of both. Figure 3.1 The number of the six different types of nucleotide substitution. The number of SNPs are grouped by the mutation type that generated them. The inserted figure shows the estimated mutation rate for each type (Ossowski et al. 2010). While GC to AT mutation are by far the most common, SNPs resulting from mutation in the opposite direction are more common in the wild. Many metazoan genomes show a bias towards substitutions that change weak (A,T) base pairs to strong (G,C) base pairs in fast evolving regions. Katzman et al demonstrated the presence of an ongoing fixation bias favoring G and C alleles throughout the human 53 genome and suggested that the bias was caused by a recombination-associated process, such as GC biased gene conversion (Katzman, Capra, & Pollard, 2011). To check whether the same bias exists in Arabidopsis thaliana, we computed the derived allele frequency spectra for the W2S (Weak to Strong) mutations and the S2W (Strong to Weak) mutations separately. An offset in the spectra between the two categories was then tested with a two-sided MWU (Mann Whitney U) test. This test has been shown to have good power in detecting fixation biases. Applying the MWU test to simulations of a neutral coalescent model showed that the p-values from this test accurately reflected the fraction expected by chance from a neutrally evolving locus (Katzman et al., 2011). A U-norm from this MWU test was calculated to compare the distribution of W2S and S2W. A U-norm around 0.5 indicates ‘absence of fixation bias’ and a U-norm greater than 0.5 means biased fixation for a W2S type of mutation. Distribution of DAFs for all SNPs was plotted in Figure 3.2 (A). ‘S’ refers to the ‘Strong allele’: C and G; while ‘W’ refers to the ‘Weak allele’: A and T. Among all six possible mutation types for each SNP with ancestral alleles defined, we annotated ‘T=>C’ and ‘A=>C’ as ‘W2S’ ; ‘C=>T’ and ‘C=>A’ as ‘S2W’ . For 1.8 million SNPs that passed our conservative filtering for ancestral allele annotation, W2S mutations segregated at higher derived allele frequencies than S2W (Figure 3.2A). This implies that regardless of the rate of introduction of W2S or S2W mutations, W2S mutations are more likely to reach high frequency and eventually fixation in the Arabidopsis thaliana population. Since the U-norm of C⬄A substitution shows no significant sign of biased fixation, the CG enrichment observed is mainly caused by a C⬄T substitution (Table 3.1, Figure 3.2). 54 Table 3.1 U-norm for comparing two opposite derived allele frequencies. A U-norm approximately equal to 0.5 indicates an ‘absence of fixation bias’. A U-norm greater than 0.5 indicates a biased fixation for cytosines. Thus, the CG enrichment is mainly caused by a C⬄T substitution. WeakóStrong CóT CóA U norm 0.573 0.604 0.497 Figure 3.2 DAF (Derived Allele Frequency) distribution shows ongoing CG enrichment in Swedish Arabidopsis population. (A) DAF for StoW (Strong to Weak) and WtoS (Weak to Strong) alleles. (B) DAF for C=>T and T=>C alleles. Since the Swedish Arabidopsis thaliana populations in this study mainly come from two regions of Sweden, North and South (Long et al., 2012), we checked using the U- 55 norm statistics whether the populations from the two regions behaved differently in terms of CG enrichment. Accessions from both regions show a similar skew towards ‘W2S’ mutation. Figure 3.3 Derived allele frequency distribution for six possible mutation types. AT to GC mutations are over-represented at frequencies near fixation, suggesting that selection is responsible for the higher GC content of the A. thaliana genome, and is still ongoing. To pin down which mutation type causes this ongoing CG enrichment, we plotted the DAF (Derived Allele Frequency) spectrum for each mutation type (Figure 3.3). Out of the 6 substitution types, T=>C showed a distinct distribution with a dramatic skew towards high derived allele frequency. This implies that the differences between the mutational and polymorphism spectra is likely due to a strong positive selection, specifically on the T=>C nucleotide replacement, which leads to an ongoing CG enrichment in Figure 3.2. CG enrichment is certainly consistent with a mechanism of gene conversion that favors selection of G or C alleles from a heterozygote or some other selective force generally favoring higher GC content. However, since the skew to higher allele frequency only applies to AT=>GC, and biased gene conversion should affect A:T=>G:C and A:T=>C:G substitution equally, the CG enrichment pattern in Arabidopsis thaliana can 56 not be explained by biased gene conversion. This is completely different to the CG enrichment in human populations, which can be largely explained by biased gene conversion and recombination hotspots (Katzman et al., 2011). Taken together, ongoing CG enrichment is observed in natural populations of Arabidopsis thaliana, and is due to an excess A:T=>G:C mutations. 3.2 CG enrichment is related to the methylation level Deamination of methylated cytosine accounts for at least part of the excess of C=>T substitutions (Cao et al., 2011; Ossowski et al., 2010). Thus, we investigated how the methylation status affects CG enrichment. For each SNP, we annotated the local methylation level by the mean methylation level for the most adjacent 200bp-sliding window across the genome. Since CpG methylation is more frequent and stable than non- CpG methylation, mean CpG methylation was used in the study. CpG methylation levels were classified into 5 categories and the upper limit is shown in Figure 3.4. As shown in the figure, the strength of fixation bias gets stronger with increasing methylation levels. Higher methylated regions show a stronger positive selection for T=>C and a stronger negative selection against C=>T. The other 4 types of mutation appear to show minor differences between the 5 categories, although a small level of a positive selection for A=>C (CG enrichment) and a negative selection for C=>A (CG loss) is still evident. 57 Figure 3.4 The derived allele frequency distribution for six different types of nucleotide substitutions for five CpG methylation level categories. The methylation level for each SNP was annotated using the mean CpG methylation level for the most adjacent 200bp- sliding window across the genome. Mean CpG methylation level was classified into 5 categories and the upper limit is shown. More AT to GC mutations reached high derived allele frequencies in higher methylated regions. In Arabidopsis thaliana, methylation for different cytosine contexts and genomic features show different properties. Therefore, the next question we chose to explore was 58 whether the strength of CG enrichment is affected by these factors even after controlling for neighboring methylation levels. The DAF skew is present for all C contexts, and is relatively higher for CHG and weaker for CHH (Table 3.2). As expected, CHH has the weakest signal, since CHH is the most abundant in the genome, methylated at the lowest level and cannot be reliably inherited from one generation to the next due to its nonsymmetrical structure. However, that CHG shows the strongest bias is interesting, since CHG was postulated to have a key function in gene methylation (Saze & Kakutani, 2011), leading to our next question as to whether this signal is enriched in the TE or genic region. Table 3.2 Ongoing W ➔S fixation Bias in Swedish Arabidopsis is present for all cytosine contexts (W ➔S DAF skew is quantified by the U-norm statistic) All CpG CHG CHH W2S/S2W 0.548 0.547 0.558 0.540 C=>T/T=>C 0.604 0.595 0.607 0.602 The DNA methylation patterns for CpG, CHG, CHH are different from each other in the Arabidopsis thaliana genome, with CpG methylated at the highest level and CHH methylated at the lowest. Thus, a similar level of positive selection for all cytosine contexts would imply that either such a selection is not related to DNA methylation, or that a selection bias is only related to methylation type that does not distinguish between cytosine contexts. As shown previously, such enrichment was stronger as the methylation level increased, which left us with the second possibility that CG enrichment is only present in 59 regions with non-CpG methylation, and that no CG enrichment should be expected in the methylated gene body, since GBM features exclusive CpG methylation. 3.3 Strong CG enrichment for TEs and genic regions under the RdDM pathway To explore whether CG enrichment is confined to regions with methylation for all cytosine contexts, the DAF distribution for TEs, genes with GBM, and genes with non- CpG methylation (Gene RdDM) were compared (Figure 3.5). In this figure, ‘GBM’ contains SNPs within the Tair10-annotated protein-coding genes after removing SNPs with CHG or CHH DNA methylation levels above 5%. ‘TE’ contains SNPs within the Tair10-annotated transposable elements, or transposable element genes. ‘Gene RdDM’ contains SNPs within the Tair10-annotated protein-coding genes, but with a CHG or CHH methylation level above 5%. As expected, TE regions and genes with RdDM methylation (where all cytosines are methylated regardless of the context) showed a strong ongoing CG enrichment, while the CG enrichment pattern is completely absent for genes with GBM, even after controlling for the background methylation level. A U-norm calculation with different methylation levels also confirms that the fixation bias is only found in TEs and genes with non-CpG methylation (Figure 3.6). 60 Figure 3.5 The Derived allele frequency distribution for GBM, TE and Gene RdDM. Ongoing CG enrichment is observed in TE and genes with RdDM. CGmeth0.01 CGmeth0.4 CGmeth0.6 CGmeth0.8 CGmeth1 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 ATtoGC GCtoAT ATtoCG ATtoTA GCtoTA GCtoCG 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 DAF density MutationType ATtoGC GCtoAT ATtoCG ATtoTA GCtoTA GCtoCG Gene Body Methylation CGmeth0.01 CGmeth0.4 CGmeth0.6 CGmeth0.8 CGmeth1 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 ATtoGC GCtoAT ATtoCG ATtoTA GCtoTA GCtoCG 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 DAF density MutationType ATtoGC GCtoAT ATtoCG ATtoTA GCtoTA GCtoCG TE Methylation CGmeth0.01 CGmeth0.4 CGmeth0.6 CGmeth0.8 CGmeth1 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 ATtoGC GCtoAT ATtoCG ATtoTA GCtoTA GCtoCG 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 DAF density MutationType ATtoGC GCtoAT ATtoCG ATtoTA GCtoTA GCtoCG Gene RdDM Methylation 61 Figure 3.6 U-norm for SNPs in TEs and genes under the RdDM pathway show a biased fixation as methylation levels increase, while SNPs in genes with gene body methylation do not show a strongly biased fixation. A higher U-norm indicates a greater fixation bias towards CG enrichment. Why does CG enrichment only happen in highly methylated TEs and genes with RdDM? It seems that loss of cytosines in these regions is deleterious while the gain of new cytosines is beneficial. One explanation for this is that by having more cytosines in these regions the maintenance of a high methylation status is facilitated. If CG enrichment facilitates high methylation, newly derived cytosines should be methylated at a higher ratio compared to older cytosines. The addition of these cytosines would otherwise lower the mean methylation level. To confirm this prediction we calculated the Mrate (defined as methylation rate among the accessions with a cytosine at that position) for each cytosine with polymorphism data in the Swedish population. The 0.0 0.2 0.4 0.6 0.8 1.0 0.50 0.55 0.60 0.65 0.70 0.75 0.80 CG meth vs U norm CG methylation U norm All TE GeneBody_meth Gene_RdDM 62 distribution of Mrate was plotted while controlling for background methylation level (Figure 3.7). As expected, in highly methylated regions, TEs and genes with RdDM show a bias towards a higher Mrate. Thus, cytosines in the highly methylated regions in TEs and genes under RdDM are more likely to be methylated in the population, supporting the potential role of CG enrichment in these regions for facilitating methylation. To directly confirm that the derived cytosines are more frequently methylated, the correlation between Mrate (as defined in Figure 3.7) and DAF were computed (Figure 3.8). Both substitution types A=>C and T=>C show a significantly positive correlation, indicating that derived cytosines can reach higher frequencies in the population if they are more methylated in the population for the same SNP. Surprisingly, this signal is true for all contexts, including genes with only GBM, where no CG enrichment is observed. One possible explanation is that the methylation rate for single polymorphic cytosines is not consistent with the sliding window estimate of background methylation in Figure 3.5. In fact, the methylation rate among 200 samples is a result of a combination of selection, mutation and methylation on the polymorphic cytosine. Therefore, it is not necessarily representative of the mean local methylation for the 200bp sliding window. As the last step to confirm that high derived frequency cytosines are more highly methylated than low derived frequency cytosines, we divided all cytosines into two groups - high DAF (DAF>0.5) and low DAF (DAF<0.5) - and computed the median differences in the methylation rate between these two groups while controlling for the local methylation level. As expected, the T=>C substitution type showed a high methylation rate for the high DAF group, especially when the local methylation level is higher (Figure 3.9). Interestingly, the C=>T substitution type showed a high methylation 63 rate for the low DAF group, especially when the local methylation level is lower (Figure 3.10), further supporting the conclusion that in highly methylated regions, cytosines with high methylation rates among accessions are pushed in the low DAF direction to shield them from being lost. 64 CGmeth0.01 CGmeth0.4 CGmeth0.6 CGmeth0.8 CGmeth1 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Mrate density MutationType ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG gene body meth CGmeth0.01 CGmeth0.4 CGmeth0.6 CGmeth0.8 CGmeth1 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Mrate density MutationType ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG gene body meth CGmeth0.01 CGmeth0.4 CGmeth0.6 CGmeth0.8 CGmeth1 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Mrate density MutationType ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG gene RdDM meth CGmeth0.01 CGmeth0.4 CGmeth0.6 CGmeth0.8 CGmeth1 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Mrate density MutationType ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG gene body meth CGmeth0.01 CGmeth0.4 CGmeth0.6 CGmeth0.8 CGmeth1 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Mrate density MutationType ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG TE meth CGmeth0.01 CGmeth0.4 CGmeth0.6 CGmeth0.8 CGmeth1 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Mrate density MutationType ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG gene body meth 65 Figure 3.7 The methylation rates for accessions with cytosines at the position show higher methylation in regions with similar mean methylation levels. Mrate is defined as the ratio of methylation among accessions with cytosines and reads mapped to them. Figure 3.8 A correlation between methylation rates for accessions with cytosines at the given position and the Derived Allele Frequency for that mutation type. Figure 3.9 The median Mrate differences for ATtoGC for high DAF – low DAF. -‐0.2 -‐0.1 0 0.1 0.2 0.3 0.4 ATtoGC GCtoAT ATtoCG GCtoTA GCtoCG gene gene_Rddm TE 0 0.05 0.1 0.15 0.2 0.25 TE gene_Rddm Gene 66 Figure 3.10 The median Mrate difference for GCtoAT for high DAF – low DAF. 3.4 Genes with CG enrichment show greater variability for expression and methylation Since CG enrichment is only observed in TEs and genes with non-CpG methylation, it is possible that these genes are merely mis-annotated TEs. This would make it easier to understand that high methylation in the TE region will limit their transcription and proliferation, providing a fitness advantage. To check whether genes with CG enrichment are infact true genes, GO enrichment analysis was performed. 458 genes (annotated as protein coding genes in Tair10) are defined as genes with CG enrichment, since they contain more than 2 T=>C SNPs with DAF higher than 0.5. GO enrichment analysis found DMT2 and two other methyltransferases, leading to enrichment of DNA-methyltransferase activity (Figure -‐0.16 -‐0.14 -‐0.12 -‐0.1 -‐0.08 -‐0.06 -‐0.04 -‐0.02 0 TE gene_Rddm Gene 67 3.11). This list of genes also includes several that are involved in pollen and seed development. Therefore these genes are in fact real genes, rather than disguised TEs. Figure 3.11 GO enrichment for genes in Gene RdDM In addition, we referred to our gene-specific profile generated with the transcriptome and methylome dataset and found that genes with CG enrichment are transcribed at lower levels compared to the genomic average, and with a significant negative correlation between expression and methylation (Figure 3.12). Again, expression can be detected for these genes (even in our limited sampling stage), supporting the conclusion that they are real protein coding genes. 68 Figure 3.12 Genes under the RdDM pathway (with CG enrichment) show a negative correlation between expression and methylation, low expression level (red line indicates the genome average). Tatarinova et al. reported that regions with higher GC rates exhibit more expression and methylation variability (Tatarinova, Elhaik, & Pellegrini, 2013). The advantage of having more cytosines is likely to be an enablement of precise gene regulation, especially for stress- or development-related genes. For other genes, methylation is unlikely to be a major player in the regulation of gene expression, and this is supported by the observation that expression varies much more than methylation, between tissues (Schmitz et al., 2013). As further evidence that CG enrichment is evolutionarily beneficial, genes without CpG sites around the TSS have a significantly lower level expression compared to all genes. 308 of 372 genes with no methylated CpG (Col-0) are not transcribed. Although these are extreme examples, the CpG site appears important for gene expression. Genes with CG enrichment were defined as having at least two earlier T=> C SNPs, which is a very arbitrary cutoff. Since these genes seem to show a negative correlation between expression and methylation, we would like to check whether genes with a negative correlation between methylation and expression present strong CG enrichment in general. In our dataset, 247 genes with a correlation between expression and mean CG methylation of <-0.2 do show strong CG enrichment (Figure 3.13). 69 Figure 3.13 Genes with a significantly negative correlation between expression and methylation show CG enrichment. 3.5. Discussion Taken together, genes with a negative correlation between expression and methylation, and TEs that are repressed by DNA methylation show strong ongoing CG enrichment. On the other hand, gene body regions with a positive correlation between expression and methylation do not show CG enrichment. This suggests that methylation for RdDM regions and TEs are more functional and consequently under stronger natural selection, while methylation in gene body regions are not. Based on these observations, we propose the following hypothesis: Methylation functions upstream of expression in genes that show a negative correlation between expression and CG methylation. Therefore, gaining cytosines presents an evolutionary advantage in terms of tuning the regulation of this expression. Accordingly, losing cytosines is a disadvantage. However, methylation in the gene body region is downstream 70 of gene expression and so losing or gaining a cytosine here does significantly impact the genome. In this study, methylation and transcription levels were measured in whole rosettes of the 9-leaf stage for all plants, with a mixture of leaf tissue and meristem cells. DNA methylation is largely stable across tissues and developmental stages (Schmitz et al., 2013) and therefore methylation measured at this stage can be used as an estimate of methylation levels for other stages. This is why the mean methylation level for 200bp sliding window is used as the expected methylation level for overlapping SNPs. However, gene expression is variable for different tissues and development stages (Schmid et al., 2005; Schmitz et al., 2013), so expression estimates from this dataset can only be used to interpret the 9-leaf stage. Consequently, our list of genes with CG is limited by our sampling stage. 71 Chapter 4 : Geographical differences in GBM and cytosine number 4.1 Northern Arabidopsis thaliana lines exhibit higher GBM rDNA shows a significant geographical bias in Swedish populations (Long et al., 2012), and CMT2 has been shown to play a role in climate adaptation. This observation prompted us to explore whether methylation patterns also show geographical biases. Surprisingly, the genomic mean methylation level does show a significant correlation with the latitudes of the lines’ origins. Using latitude of 59.3 as the cutoff between southern and northern populations, northern lines showed a higher gene body CpG methylation compared to the southern lines (cor=0.74) (Figure 4.1). CHG and CHH also show some degree of correlation with geographical distribution, though this is not as significant as CpG and GBM. Figure 4.1 Northern accessions show a much higher CpG methylation for protein coding genes. (A) Mean CpG methylation levels for protein coding genes significantly correlate with latitude (cor=0.74). Y-axis is the mean CpG methylation level for the genic region in one accession. (B) Violin plot for mean CG methylation of protein coding genes for accessions originating in northern and southern Sweden. 72 Checking the correlation for different genomic features confirmed that the strongest correlation comes from CpG methylation for protein coding genes with GBM, while the correlation for TEs is much weaker (Figure 4.2) However, latitude is not a phenotypic or environmental factor, so we explored the climate dataset to find the possible parameter that drives this apparent correlation. The minimum temperature taken from the location of our plant samples correlates well with the CG methylation level of genes (Figure 4.3). Therefore it seems temperature may be an important factor in adaptation. Figure 4.2 The correlation between mean CpG methylation for different genomic features and latitude. The red line indicates the correlation between global CpG methylation and latitude. All protein coding gene-related features show a significant correlation(cor >0.6). Figure 4.3 The correlation between gene CpG methylation and climate data. The minimum temperature from the sampling location showed the strongest correlation with CG methylation for genes. The red line indicates the correlation between latitude and methylation. intron Gene exon utr3 utr5 CpG promoter ncRNA miRNA pseudo TE2 rRNA TE1 intergenic Correlation with latitude for 153 accs @10C 0.0 0.2 0.4 0.6 0.8 intron Gene exon utr3 CpG promoter utr5 ncRNA miRNA pseudo TE2 TE1 rRNA intergenic Correlation with latitude for 122 accs @16C 0.0 0.2 0.4 0.6 0.8 ParSpring MaxTemp MinRain Rain MaxRain Aridity LenGrow Humidity ColdDays FreeDays temp lat DayLength MinTemp Cor with gene CG meth @10C abs(cor) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 73 To check whether this geographical bias in CG methylation only occurs in small subsets of genes or in the majority of genes, we quantified how many genes show significantly different methylation levels based on a 2-sample test between northern and southern populations. 60% of all protein coding genes show significantly different methylation levels in northern and southern groups (pval<0.05). Even with strict Bonferroni correction, 20% of genes remain differentially methylated between the two regions. Among these, more than 95% of genes are more highly methylated in the northern compared to the southern population. Thus, a bias exists in a large proportion of protein coding genes. Genes with GBM usually exhibit a positive correlation between expression and methylation. Therefore, higher methylation in northern populations may indicate a higher gene transcription activity in northern accessions. Indeed this appears to be the case, since 66% of genes with a significant difference in gene expression between north and south are more highly expressed in the northern populations. Genes with a significant correlation between methylation and latitude were also characterized as long genes with active transcription. CHR42 (Chromatin Remodeling 42) was identified as the gene with the most significantly different CG methylation between southern and northern samples (Pearson cor=0.84), and it is more highly expressed in the north (cor=0.35) (Figure 4.4). 74 Figure 4.4 Gene AT5G20420 with the strongest correlation between latitude and methylation. (left panel) The Y-axis is the mean CG methylation for this gene, the X axis is the latitude. (right panel) The Y-axis is the RPKM expression for this gene, the X-axis is the latitude. Such GBM differences with latitude/minimum temperature were demonstrated to have a genetic basis, and genes affected by the trans mQTLs identified were involved in cell division and flowering time. These observations might be a result of adaptations to colder temperatures (Dubin et al, unpublished results). 4.2 Northern Arabidopsis thaliana lines have more cytosines The observation that gene-body CG methylation is higher in northern lines led us to investigate latitudinal differences at the DNA sequence level. Because methylated cytosines show a high mutation rate, geographic differences in the level of methylation imply geographical differences in mutation rates, which should be reflected in the pattern of genetic polymorphisms that involve methylated cytosines. Furthermore, if selection shapes the pattern of methylation, it may also act directly on the sequence level, for instance by altering the number of cytosines that can be methylated. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 56 58 60 62 64 0.1 0.2 0.3 0.4 0.5 0.6 0.7 AT5G20420 GeneMeth vs Lat, cor=0.8424052 at10$lat at10$genes_CG_meth ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 56 58 60 62 64 0 1 2 3 4 5 6 AT5G20420 Expression vs Lat, cor=0.346 at10$lat at10$expEJ 75 Total cytosine numbers for each context were calculated based on full genome sequencing data for 259 Swedish accessions. Accessions originating from colder regions had a higher total number of cytosines (cor = -0.31). While the total numbers of CpG and CHH do not show a significant difference between cold and warm areas in genic regions, a strong correlation (cor = -0.65) was observed between the CHG number and minimum temperatures in the location where the accessions originated. This correlation was driven by a number of genic CAG sites (cor = -0.7) which represent 80% of all CHG sites (Figure 4.5, Table 4.1). In TE regions, the total number of cytosines showed a similar level of significant negative correlation with minimum temperature (number of CpG: -0.17, number of CHG: -0.19, number of CHH: -0.19). Figure 4.5 The number of cytosines strongly correlate with minimum temperature of the coldest month in the accession’s location of origin. (left panel). A correlation between MinTemp and number of cytosines. (right panel) A strong correlation between total number of CAG in the genome and minimum temperature of the accession’s location of origin. GeneRegion TERegion Pearson Correlation Cnum vs. MinTemp Correlation vs. MinTemp −0.6 −0.2 0.0 0.2 0.4 0.6 CpGNum CHGNum CHHNum totalCnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 4923000 4925000 4927000 TotalCAGnum vs MinTemp(−O) Cor= −0.49 MinTemp(−O) TotalCAGnum Srho=−0.41 Spval=2.3e−11 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 3663000 3665000 3667000 GenicCAGnum vs MinTemp(−O) Cor= −0.56 MinTemp(−O) GenicCAGnum Srho=−0.48 Spval=3.2e−16 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 839000 841000 843000 TECAGnum vs MinTemp(−O) Cor= −0.23 MinTemp(−O) TECAGnum Srho=−0.2 Spval=0.0014 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 921500 922500 InterCAGnum vs MinTemp(−O) Cor= 0.22 MinTemp(−O) InterCAGnum Srho=0.17 Spval=0.0086 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 1917000 1918000 GeneCGCAGnum vs MinTemp(−O) Cor= −0.45 MinTemp(−O) GeneCGCAGnum Srho=−0.37 Spval=1.1e−09 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 1160800 1161200 1161600 1162000 GeneNoCGCAGnum vs MinTemp(−O) Cor= −0.62 MinTemp(−O) GeneNoCGCAGnum Srho=−0.6 Spval=6.5e−26 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 209100 209300 209500 OldTECAGnum vs MinTemp(−O) Cor= −0.042 MinTemp(−O) OldTECAGnum Srho=0.0024 Spval=0.97 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 718000 720000 YoungTECAGnum vs MinTemp(−O) Cor= −0.24 MinTemp(−O) Y oungTECAGnum Srho=−0.23 Spval=0.00031 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 3043000 3044000 RealGeneCpGnum vs MinTemp(−O) Cor= 0.069 MinTemp(−O) RealGeneCpGnum Srho=0.096 Spval=0.13 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 3762500 3763500 3764500 RealGeneCHGnum vs MinTemp(−O) Cor= −0.65 MinTemp(−O) RealGeneCHGnum Srho=−0.56 Spval=1.5e−22 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 16260000 16261500 16263000 RealGeneCHHnum vs MinTemp(−O) Cor= 0.14 MinTemp(−O) RealGeneCHHnum Srho=0.2 Spval=0.0016 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 23065000 23067000 RealGeneCnum vs MinTemp(−O) Cor= −0.31 MinTemp(−O) RealGeneCnum Srho=−0.26 Spval=2.7e−05 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −140 −100 −80 −60 −40 −20 3078500 3079500 RealGeneCAGnum vs MinTemp(−O) Cor= −0.7 MinTemp(−O) RealGeneCAGnum Srho=−0.57 Spval=9.5e−23 76 Table 4.1 Pearson Correlation score between total cytosine number for each accession and minimum temperature of the accession’s location of origin. CpG Number CHG Number CHH Number C Number CAG Number Genomewide -0.04 -0.38 -0.07 -0.31 -0.49 Genic Region 0.069 -0.65 0.14 -0.31 -0.7 TE Region -0.17 -0.19 -0.19 -0.31 -0.23 4.3 Exploration of the geographical differences in number of cytosines 4.3.1 Not related to methylation level This biased pattern in cytosine number may be related to methylation in two ways: 1) Methylation generates mutations through cytosine deamination and so cytosines that show a CT polymorphism should also show a stronger pattern, while those that show a CA polymorphism should not. 2) This pattern may result from a selection for certain methylation patterns, and therefore this biased pattern should appear in highly methylated regions while absent from non-methylated regions. We decided to test these two possibilities. First, the correlation between cytosine number and MinTemp (or latitude) was observed for all cytosines without preference to CT polymorphisms or CA polymorphisms (Table 4.2). Therefore the pattern is almost certainly not due to deamination, making Selection much more likely. The only context that showed a preference to deamination CT polymorphism was TE region CpG sites. 77 Table 4.2 The correlation between cytosine number and MinTemp for different contexts and alternative alleles. Pearson Correlation Alleles Genome- wide Genic TE CHG All -0.38 -0.65 -0.19 CT -0.29 -0.47 -0.24 AC -0.44 -0.49 -0.26 CG -0.05 -0.13 0.17 HAC 0.23 0.19 -0.02 HCT -0.20 0.05 0.11 The second possibility is also not valid, since this pattern was observed not only for C’s that could be methylated but also for other sequence motifs. Other genic tri- nucleotides, such as AAA (cor=-0.6), CTA (cor=-0.53) and TGA (cor=0.43), also showed significantly strong correlations with latitude. CAG is nevertheless the strongest. Thus, this pattern seems to have nothing to do with methylation, at least in the genic part. In addition, since geographic differences in gene CG methylation were observed in the same population, we investigated whether the difference in cytosine number in genes related to the methylation level. A strong correlation between the number of CHG sites and minimum temperature was shown for all genes. Furthermore, the correlation was stronger for genes without CG methylation (cor=-0.63) compared to genes with CG methylation (cor=-0.45) (Table 4.3). Genes with GBM have been reported to be highly conserved and evolve slowly. This indicates that genes with GBM are under a strong selection constraint. However, this does not explain why only the number of CHG sites showed a strong correlation with latitude (MinTemp) in the genic region. 78 This biased pattern for number of CpGs was stronger for old TEs that are shared by Arabidopsis thaliana and Arabidopsis lyrata, while the numbers of CHGs and CHHs correlated more significantly for young TEs. Young TEs were methylated at a higher level than older ones, and generally showed a greater loss of cytosines in southern accessions. However, such differences in CHG /CHH numbers almost disappeared in old TEs. One possibility is that fixed CHG sites in colder regions were gradually lost and over time, reached the same level as accessions in warmer region. This implies that the numbers of CHGs and CHHs were more sensitive to deamination as a result of high methylation, while CpG sites were better preserved compared to CHG sites. In summary, in TE regions, methylation may play a role in shaping the observed sequence bias with latitude, while in genic regions, methylation is not likely to play a major role in this pattern. Table 4.3 The Pearson Correlation score between total cytosine number for each accession and minimum temperature for the accession’s location of origin. CpG Number CHG Number CHH Number C Number CAG Number Gene with CG meth -0.065 -0.45 0.079 -0.4 -0.45 Gene without CG meth 0.19 -0.63 0.15 -0.14 -0.62 Old TE -0.28 0.1 0.15 -0.01 -0.042 Young TE -0.18 -0.23 -0.16 -0.33 -0.24 4.3.2 Not due to codon bias One possibility of such tri-nucleotide differences (CHG) between northern and southern populations is codon bias. If this is true, we would expect to see a correlation with latitude only for codon position but not for CHGs outside the reading frame. 79 To test this possibility, we counted all tri-nucleotides for all three sliding positions. The pattern with latitude is not restricted to the codon position. Therefore we concluded that codon bias does not explain the pattern. Given that this pattern is not restricted to the coding sequence region also suggests that it is not due to selection on amino acid products. 4.3.3 Also observed in world-wide collections and in Arabidopsis lyrata We counted the number of different tri-nucleotide combinations for all 1001 genomes and observed a similar trend with latitude for motifs as ‘CAG’ (cor=0.22), ’GAA’ (cor=0.32). Interestingly, the total number of cytosines for lyrata also appears to correlate with latitude (CpG cor=0.54, CHG cor=0.35), although we only have 18 lyrata accessions. This implies that lyrata experienced a similar force driving a change in the number of cytosines in cold and warm regions (Table 4.4, Figure 4.6). Table 4.4 A correlation between latitude and number of cytosine in Arabidopsis lyrata CpG Number CHG Number CHH Number C Number GenomeWide 0.54 0.4 -0.37 0.44 Genic 0.54 0.35 -0.41 0.27 TE 0.53 0.54 0.12 0.47 Non-Gene/TE 0.41 0.14 -0.14 0.23 80 Figure 4.6 A correlation between latitude and number of cytosines in Arabidopsis lyrata. Four panels are for CpG number, CHG number, CHH number and total cytosine number respectively. 4.3.4 Demography or selection? When genetic similarity decays with geographic distance between the origins of the individuals, genetic variation can be spatially structured. Demographic processes should similarly affect all loci, while selection should only affect a subset of loci. If the biased pattern was due to a large frequency of differences between north and south at a small number of sites, it should appear as a selection sweep. However, while a pattern due to small frequency differences at a large number of sites may also be ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 5620000 5625000 5630000 5635000 5640000 CpGnum vs Latitude Cor= 0.54 Latitude CpGnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 3710000 3715000 3720000 CpGGenenum vs Latitude Cor= 0.54 Latitude CpGGenenum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 972000 974000 976000 978000 CpGTEnum vs Latitude Cor= 0.53 Latitude CpGTEnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 1428000 1430000 1432000 1434000 CpGInternum vs Latitude Cor= 0.41 Latitude CpGInternum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 6116000 6120000 6124000 CHGnum vs Latitude Cor= 0.4 Latitude CHGnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 4486000 4487000 4488000 4489000 4490000 CHGGenenum vs Latitude Cor= 0.35 Latitude CHGGenenum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 990000 992000 994000 996000 CHGTEnum vs Latitude Cor= 0.54 Latitude CHGTEnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 1215500 1216500 1217500 1218500 CHGInternum vs Latitude Cor= 0.14 Latitude CHGInternum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 4924000 4926000 4928000 CAGnum vs Latitude Cor= 0.029 Latitude CAGnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 3643000 3644000 3645000 3646000 CAGGenenum vs Latitude Cor= −0.2 Latitude CAGGenenum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 786000 787000 788000 789000 CAGTEnum vs Latitude Cor= 0.48 Latitude CAGTEnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 960000 960500 961000 961500 962000 CAGInternum vs Latitude Cor= −0.17 Latitude CAGInternum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 31165000 31170000 31175000 CHHnum vs Latitude Cor= −0.37 Latitude CHHnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 19486000 19490000 19494000 19498000 CHHGenenum vs Latitude Cor= −0.41 Latitude CHHGenenum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 5603000 5603500 5604000 5604500 5605000 CHHTEnum vs Latitude Cor= 0.12 Latitude CHHTEnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 8680000 8682000 8684000 CHHInternum vs Latitude Cor= −0.14 Latitude CHHInternum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 42910000 42920000 42930000 Cnum vs Latitude Cor= 0.44 Latitude Cnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 27685000 27690000 27695000 27700000 CGenenum vs Latitude Cor= 0.27 Latitude CGenenum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 7565000 7570000 7575000 CTEnum vs Latitude Cor= 0.57 Latitude CTEnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 11326000 11330000 11334000 CInternum vs Latitude Cor= 0.23 Latitude CInternum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 5620000 5625000 5630000 5635000 5640000 CpGnum vs Latitude Cor= 0.54 Latitude CpGnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 3710000 3715000 3720000 CpGGenenum vs Latitude Cor= 0.54 Latitude CpGGenenum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 972000 974000 976000 978000 CpGTEnum vs Latitude Cor= 0.53 Latitude CpGTEnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 1428000 1430000 1432000 1434000 CpGInternum vs Latitude Cor= 0.41 Latitude CpGInternum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 6116000 6120000 6124000 CHGnum vs Latitude Cor= 0.4 Latitude CHGnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 4486000 4487000 4488000 4489000 4490000 CHGGenenum vs Latitude Cor= 0.35 Latitude CHGGenenum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 990000 992000 994000 996000 CHGTEnum vs Latitude Cor= 0.54 Latitude CHGTEnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 1215500 1216500 1217500 1218500 CHGInternum vs Latitude Cor= 0.14 Latitude CHGInternum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 4924000 4926000 4928000 CAGnum vs Latitude Cor= 0.029 Latitude CAGnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 3643000 3644000 3645000 3646000 CAGGenenum vs Latitude Cor= −0.2 Latitude CAGGenenum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 786000 787000 788000 789000 CAGTEnum vs Latitude Cor= 0.48 Latitude CAGTEnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 960000 960500 961000 961500 962000 CAGInternum vs Latitude Cor= −0.17 Latitude CAGInternum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 31165000 31170000 31175000 CHHnum vs Latitude Cor= −0.37 Latitude CHHnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 19486000 19490000 19494000 19498000 CHHGenenum vs Latitude Cor= −0.41 Latitude CHHGenenum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 5603000 5603500 5604000 5604500 5605000 CHHTEnum vs Latitude Cor= 0.12 Latitude CHHTEnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 8680000 8682000 8684000 CHHInternum vs Latitude Cor= −0.14 Latitude CHHInternum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 42910000 42920000 42930000 Cnum vs Latitude Cor= 0.44 Latitude Cnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 27685000 27690000 27695000 27700000 CGenenum vs Latitude Cor= 0.27 Latitude CGenenum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 7565000 7570000 7575000 CTEnum vs Latitude Cor= 0.57 Latitude CTEnum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 45 50 55 60 65 11326000 11330000 11334000 CInternum vs Latitude Cor= 0.23 Latitude CInternum 81 selection, it can be divided into two further cases: i) is the pattern due to alleles at intermediate frequencies, or ii) to rare alleles? If common alleles were the cause, then selection likely created the pattern, since mutation rates could never have been high enough to affect intermediate allele frequencies. If rare alleles were the cause, then both selection and mutation were involved and we should be able to model these scenarios. To tackle this problem, we divided all alleles into five non-overlapping categories based on their cytosine frequency in northern and southern populations: 1) Alleles that are fixed in northern populations 2) both allelic frequencies being greater than 97.5% 3) both allelic frequencies being greater than 80% 4) both allelic frequencies being greater than 2.5% (the median) and 5) both allelic frequencies being <2.5% (i.e. rare) (Figure 4.7). The pattern of cytosine number bias was mainly caused by cytosines that were fixed in northern lines and almost reaching fixation in southern lines. All other categories showed no significant correlation, except for alleles with a median frequency showing a strong opposite correlation, both for genes and TEs (Figure 4.8). These results show that a large number of sites with small effects can create such a distinct pattern. 82 Figure 4.7 The five allele frequency categories used to calculate the number of cytosines Figure 4.8 The genic region for each category in Figure 4.7 and all alleles with cytosines 83 4.3.5 Mutation bias To test whether southern accessions have higher mutation rates as a result of cytosine deamination, we calculated nucleotide diversity (pi) for all 6 substitution types and normalized pi for sites with C=>T mutations with pi for all sites. The observed south/north ratio above 1 is consistent with the hypothesis that southern accessions have lost more cytosines due to deamination. However, since we failed to connect the biased pattern with methylation, the reason behind this result remains unclear. Table 4.5 The ratio of Pi for deamination mutation type divided by pi for all mutation CtoT Pi Ratio North CtoT Pi Ratio South South/North Genome-wide 26.49% 26.64% 1.0056 Genic region 28.92% 28.98% 1.0023 Genes with CG meth 29.79% 29.88% 1.0032 Genes without CG meth 27.93% 27.98% 1.0018 TE region 26.83% 27.23% 1.0150 4.4 Discussion In this study, we observed a strong association between cytosine number and latitude (or the minimum temperature where the plants originated). Though we failed to nail down the mechanism behind the biased pattern, we believe it is worthwhile to bring attention to such strong patterns in sequence composition with geographical differences when designing studies with natural populations of Arabidopsis thaliana. 84 Chapter 5 : Materials and Methods 5.1 Generation of raw data 5.1.1 Plant growth A diverse set of 200 Swedish accessions were sown on soil and stratified for 3 days at 4°C in the dark. They were then transferred to environmentally controlled growth chambers set at 10°C or 16°C under long day conditions (04:00-20:00) and individual seedlings were transplanted to single pots after one week. When plants reached the 9- true-leaf stage of development, whole rosettes were collected between 15:00 and 16:00 and flash-frozen in liquid nitrogen. 5.1.2 RNA-seq library preparation For each accession, 3 plants were pooled and total RNA was extracted by TRIzol (Invitrogen 15596-018), DNase treated and mRNA purified with oligo dT Dynabeads (Life Technology). RNA was then fragmented using Ambion Fragmentation buffer and first and second strand cDNA synthesis was carried out using Invitrogen kit 18064-071. The ends of sheared fragments were repaired using Epicentre kit ER81050. After A- tailing using exo-Klenow fragment (New England Biolabs, MA, NEB M0212L), barcoded adaptors were ligated with Epicentre Fast-Link DNA Ligation Kit (Epicentre LK6201H). Adaptor-ligated DNA was resolved on 1.5% low melt agarose gels for 1 hour at 100V. DNA in the range of 200 -250 bp was excised from the gel and purified with the Zymoclean Gel DNA recovery kit (Zymo Research). The libraries were amplified by PCR for 15 cycles with Illumina PCR primers 1.1 and 1.2 with Phusion polymerase (NEB F-530L). 85 Single-end 32 bp sequencing was performed at the University of Southern California Epigenome Center on an Illumina GAIIx instrument using 4-fold multiplexing. 5.1.3 BS-seq library preparation For each accession three individual plants were pooled and total DNA was extracted using CTAB and phenol-chloroform. Approximately two micrograms of genomic DNA was used for BS-seq library construction and sequencing (92 bp paired-end) by BGI. 5.2 Sequence analysis 5.2.1 Genome sequences Illumina sequencing data from 180 published Swedish genomes (Long et al., 2012) were combined with sequencing data from another 79 genomes (1001genomes.org), which had been processed using the same pipeline to yield polymorphism data for a total of 259 accessions. Only high-quality Single Nucleotide Polymorphism (SNP) data were included in the analysis (Qual >30). 5.2.2 SNP filtering For a particular position to be included in the analysis, it had to have a minimum of 50% of the samples with a base call of A, C, G, or T. Thus, positions with many Ns (due to low quality or low coverage) were excluded. For analyses that focused on the difference between northern and southern samples, we required that the number of missing samples be fewer than half in both populations, to ensure an even representation of DAF for comparison. 86 Furthermore, for calculating the derived allele frequencies or minor allele frequencies at polymorphic positions, singletons were removed. Singletons are variant positions where only one sample supports the minor allele type, typically called because of errors in sequencing. Though based on the high coverage of sequencing data in our dataset (>50X), every singleton can be confirmed by multiple reads, thus singleton filtering is relatively strict. Finally, a further requirement of bi-allelic polymorphisms was checked for DAF calculation. Alleles from the outgroup (A.lyrata and Capsella rubella) need to be present in the bi-alleles of A.thaliana belonging to the Swedish population. If A.lyrata and Capsella rubella do not share at least one of the A.thaliana alleles at the same location, the SNPs were ignored for DAF analysis. After all three steps of filtering, 1918440 SNPs were left in the analysis, 46.2% out of 4148936 SNPs passed Q30 filtering for 259 Swedish accessions. The mean SNP density is 1 SNP every 62 bp. 5.2.3 DAF analysis Alleles were polarized into ancestral versus derived state, based on the global alignment between A.thaliana and A.lyrata (Hu et al., 2011). The shared alleles between A.lyrata and A.thaliana were defined as ‘ancestral’. Cytosines have higher mutation rates compared to other nucleotides (Ossowski et al., 2010); thus, the real ancestral allele ‘C’ may mutate independently, in both A.lyrata and A.thaliana lineage, into ‘T’ resulting in incorrectly annotated ancestral states of ‘T’ instead of ‘C’. In this study, the analysis may be severely biased by such mis-annotations, 87 since we were trying to identify high-frequency derived alleles under strong positive selection. To remove spurious classification of ancestral alleles, another filtering of SNPs was based on additional outgroup information. Genome re-sequencing data from 18 A.lyrata (2 for A.lyrata. lyrata, 16 for A.lyrata.petraea), and 1 C.rubella were mapped to the A.thaliana Col-0 reference, and SNPs were called (data from Polina Novikova). This combined outgroup SNP dataset was compared to the SNP dataset for the 259 Swedish A.thaliana accessions. 175k SNP polymorphisms from C.rubella and 20k A.lyrata SNP polymorphisms were shared with A.thaliana. Thus, roughly 8.2% of A.thaliana SNPs can be polarized using additional information. First, 20k shared polymorphic sites between A.thaliana and 18 A.lyrata were removed due to the difficulty in identifying the true ancestral allele. Second, 67k SNPs were removed because the allele in C.rubella was different to the allele in A.lyrata. In total, 4.5% SNPs were removed from the analysis because of additional outgroup information. In the end, 1.83 million SNPs were included in the study with ancestral allele information. To check whether mis-classification of ancestral allele is enriched in C⬄T polymorphism (possibly caused by cytosine deamination), the ratio of each polymorphism type were compared between SNPs that passed the polarization filtering and those that failed filtering. No significant enrichment of CT polymorphism was found in SNPs that didn’t meet the filtering criteria. This implies two possibilities 1) either there is no such bias in mis-annotation for CT polymorphism in our dataset, or 2) our method is not efficient enough to remove such mis-annotations. 88 5.2.4 MWU test To determine whether mutations from an ancestral weak (A or T) base pair to a strong (G or C) base pair (W2S mutations) are more likely to spread in the population represented by our samples, we compared W2S segregating sites to S2W segregating sites for each target region and for the aggregate set of segregating sites. We performed a Mann-Whitney U (MWU) test for a difference between the W2S and S2W derived allele frequency spectrums. The test was performed using the R language with the command "wilcox.test (paired=FALSE, alternative=two.sided)". For the target regions with significant p-values on either the MWU or MK tests, we tested whether the significance was due to a restricted locus within the region by removing all segregating sites and fixed differences under a mask of a given size at a given position within the region and rerunning the test with the remaining data. 5.2.6 BS-seq data processing Reads were aligned as previously described (Dinh et al., 2012) to a modified pseudo- reference chromosome in which SNPs were inserted into the TAIR10 reference genome using NextGenMap (version 0.4.3) (Sedlazeck, Rescheneder, & von Haeseler, 2013) allowing up to a 10% mismatch between the reads (-i 0.90) and the reference sequence and discarding reads that map equally well to more than one genomic location or having less than 45 nucleotides mapping without error to the reference sequence (-R 45). Average coverage was 12.6 X. To correct for sample or data mixups, the raw data was also aligned to the first chromosome of the Col-0 TAIR reference genome, as described above, and SNP calling was performed using the BISsnp package (Liu, Siegmund, Laird, & Berman, 2012). The 89 polymorphism data was then compared to data from genome sequencing (1001genomes.org). Accessions that did not have the highest similarity to the expected genotype were excluded from further analysis. Methylation was estimated individually for each cytosine using a python script provided with the BSMAP software package (Xi & Li, 2009) . Conversion efficiency was estimated from the fraction of methylated cytosines in chloroplasts using the R software package (www.r-project.org, version 2.15.2). After eliminating one outlier, the samples had conversion efficiencies ranging from 99.25%–99.80% (mean=99.59%). Genome wide average methylation levels were calculated separately for the CG, CHG and CHH contexts. The average variance between 11 biological replicates was 2.2%, 3.2% and 7.3% for CG, CHG and CHH methylation respectively, while for identical genotypes grown at different temperatures (111 pairs) CG, CHG and CHH methylation variance was 2.7%, 4,6% and 15.9% respectively. The variance in genome wide methylation levels for the 152 accessions grown at 10C was respectively 7.6%, 9.2% and 13.2% for CG, CHG and CHH methylation, while for the 121 accessions grown at 16 C genome wide CG, CHG and CHH methylation varied 8.5%, 9.5% and 14.3% respectively. The Bioconductor package Repitools (version 0.6.0) (Statham et al., 2010) was then used to average methylation over genomic features of interest (e.g., all genes, all transposons over 4 kB or a subset of transposons of interest). Pairwise DMRs were called individually for each accession using the R software package methylKit (version 0.5.6) (Akalin et al., 2012) using a window size of 200 bp, an FDR rate of 0.05 and a minimum fold change of 0.3. Overlap of DMRs with (TAIR10) genomic features such as transposons and genes were calculated using the Bioconductor package ChIPpeakAnno 90 (version 2.8.0) (Zhu et al., 2010) . For each accession, methylation data was smoothed independently for each context using the Bioconductor package BSmooth (version 0.4.5) (Hansen, Langmead, & Irizarry, 2012) using the default settings. Average methylation was then calculated for 200bp windows across the genome. 5.3 Annotation 5.3.1 SNP context annotation This study focused on enrichment of CG alleles. Based on the fact that different context of CG alleles (CpG, CHG and CHH) were under control of different Methyltransferases, and usually with a different DNA methylation level, the strength of CG enrichment may be related to the different context. Thus, it is important to account for CG contexts in the analysis. Identifying CG contexts for population samples is not straightforward, since in large populations the context change can be very complicated. We created a genome-wide FASTA alignment of all accessions with high quality SNPs set against TAIR 10, and then identified the cytosine context in terms of CpG, CHG and CHH. Indels relative to TAIR 10 were ignored so that the alignment had the same coordinate system as the TAIR 10 reference. For each cytosine position, if it had been identified at least once as CpG or CHG, we annotated this SNP accordingly. For cytosine sites not identified as CpG or CHG in the union set, we identified them as CHH sites. For cytosine sites identified both as CpG and CHG, we excluded them from our analysis. In the end, 220249 CpG, 174632 CHG and 405715 CHH sites were identified. 91 5.3.2 Gene/TE annotation We classified SNPs based on Tair 10 as coding, intron, un-translated regions, transposable elements and inter-genic regions. Each SNP was also annotated by its position relative to the nearest protein coding gene and transposable elements. Transposable element classifications were based on TAIR 10 annotated transposable element genes as well as transposable element fragments. (Gan et al., 2011) By gene information, we can relate expression / gene function annotation to evaluate each SNP comprehensively. 5.3.3 DNA methylation annotation Mean methylation level for 200 bp sliding window was calculated with 100bp overlapping between windows. Methylation level for each SNP was annotated as the nearest window overlapping with the particular SNP. 5.3.4 North South population annotation Samples in the study were collected from Sweden. Hierarchical clustering of genome-wide SNPs reveal two large clusters consisting of accessions separated by latitude 59.3. Thus we annotated accessions came from north of latitude 59.3 as northern population, and the others as southern population. 92 REFERENCE Akalin, A., Kormaksson, M., Li, S., Garrett-bakelman, F. E., Figueroa, M. E., Melnick, A., & Mason, C. E. (2012). methylKit : a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biology, 13(10), R87. doi:10.1186/gb-2012-13-10-R87 Atwell, S., Huang, Y. S., Vilhjálmsson, B. J., Willems, G., Horton, M., Li, Y., … Nordborg, M. (2010). Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature, 465(7298), 627–31. doi:10.1038/nature08800 Becker, C., Hagmann, J., Müller, J., Koenig, D., Stegle, O., Borgwardt, K., & Weigel, D. (2011). Spontaneous epigenetic variation in the Arabidopsis thaliana methylome. Nature, 480(7376), 245–9. doi:10.1038/nature10555 Cao, J., Schneeberger, K., Ossowski, S., Günther, T., Bender, S., Fitz, J., … Weigel, D. (2011). Whole-genome sequencing of multiple Arabidopsis thaliana populations, 43(10). doi:10.1038/ng.911 Cokus, S. J., Feng, S., Zhang, X., Chen, Z., Merriman, B., Haudenschild, C. D., … Jacobsen, S. E. (2008). Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature, 452(7184), 215–9. doi:10.1038/nature06745 Coleman-Derr, D., & Zilberman, D. (2012). Deposition of Histone Variant H2A.Z within Gene Bodies Regulates Responsive Genes. PLoS Genetics, 8(10), e1002988. doi:10.1371/journal.pgen.1002988 Dinh, H. Q., Dubin, M., Sedlazeck, F. J., Lettner, N., Mittelsten Scheid, O., & von Haeseler, A. (2012). Advanced methylome analysis after bisulfite deep sequencing: an example in Arabidopsis. PloS One, 7(7), e41528. doi:10.1371/journal.pone.0041528 Dubin, M. J., Zhang, P., Meng, D., & Remigereau, M. (n.d.). DNA methylation variation in Arabidopsis has a genetic basis and shows evidence of local adaptation Abstract : Fan, D., Dai, Y., Wang, X., Wang, Z., He, H., Yang, H., … Ma, L. (2012). IBM1, a JmjC domain-containing histone demethylase, is involved in the regulation of RNA- directed DNA methylation through the epigenetic control of RDR2 and DCL3 expression in Arabidopsis. Nucleic Acids Research, 40(18), 8905–8916. doi:10.1093/nar/gks647 93 Gan, X., Stegle, O., Behr, J., Steffen, J. G., Drewe, P., Hildebrand, K. L., … Mott, R. (2011). Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature, 477(7365), 419–23. doi:10.1038/nature10414 Hansen, K. D., Langmead, B., & Irizarry, R. A. (2012). BSmooth : from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 13(10), R83. doi:10.1186/gb-2012-13-10-R83 Hu, T. T., Pattyn, P., Bakker, E. G., Cao, J., Cheng, J.-F., Clark, R. M., … Guo, Y.-L. (2011). The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nature Genetics, 43(5), 476–81. doi:10.1038/ng.807 Inagaki, S., & Kakutani, T. (2013). What Triggers Differential DNA Methylation of Genes and TEs: Contribution of Body Methylation? Cold Spring Harbor Symposia on Quantitative Biology. doi:10.1101/sqb.2013.77.016212 Inagaki, S., Miura-Kamio, A., Nakamura, Y., Lu, F., Cui, X., Cao, X., … Kakutani, T. (2010). Autocatalytic differentiation of epigenetic modifications within the Arabidopsis genome. The EMBO Journal, 29(20), 3496–506. doi:10.1038/emboj.2010.227 Katzman, S., Capra, J. A., & Pollard, K. S. (2011). Ongoing GC-biased evolution is widespread in the human genome and enriched near recombination hotspots Supplementary Material, 1–8. Law, J. a, & Jacobsen, S. E. (2010). Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nature Reviews. Genetics, 11(3), 204– 20. doi:10.1038/nrg2719 Lister, R., O’Malley, R. C., Tonti-Filippini, J., Gregory, B. D., Berry, C. C., Millar, a H., & Ecker, J. R. (2008). Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, 133(3), 523–36. doi:10.1016/j.cell.2008.03.029 Liu, Y., Siegmund, K. D., Laird, P. W., & Berman, B. P. (2012). Bis-SNP : Combined DNA methylation and SNP calling for Bisulfite-seq data. Genome Biology, 13(7), R61. doi:10.1186/gb-2012-13-7-r61 Long, Q., Rabanal, F. A., Meng, D., Huber, C. D., Farlow, A., Platzer, A., … Nordborg, M. (2012). Massive genomic variation and strong selection in Swedish Arabidopsis thaliana. Meyer, R. C., Witucka-Wall, H., Becher, M., Blacha, A., Boudichevskaia, A., Dörmann, P., … Altmann, T. (2012). Heterosis manifestation during early Arabidopsis seedling development is characterized by intermediate gene expression and enhanced metabolic activity in the hybrids. The Plant Journal : For Cell and Molecular Biology, 71(4), 669–83. doi:10.1111/j.1365-313X.2012.05021.x 94 Ossowski, S., Schneeberger, K., Lucas-Lledó, J. I., Warthmann, N., Clark, R. M., Shaw, R. G., … Lynch, M. (2010). The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science (New York, N.Y.), 327(5961), 92–4. doi:10.1126/science.1180677 Saze, H., & Kakutani, T. (2011). Differentiation of epigenetic modifications between transposons and genes. Current Opinion in Plant Biology, 14(1), 81–7. doi:10.1016/j.pbi.2010.08.017 Schmid, M., Davison, T. S., Henz, S. R., Pape, U. J., Demar, M., Vingron, M., … Lohmann, J. U. (2005). A gene expression map of Arabidopsis thaliana development. Nature Genetics, 37(5), 501–6. doi:10.1038/ng1543 Schmitz, R. J., Schultz, M. D., Urich, M. a., Nery, J. R., Pelizzola, M., Libiger, O., … Ecker, J. R. (2013). Patterns of population epigenomic diversity. Nature. doi:10.1038/nature11968 Sedlazeck, F. J., Rescheneder, P., & von Haeseler, A. (2013). NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics (Oxford, England), 29(21), 2790–1. doi:10.1093/bioinformatics/btt468 Silveira, A. B., Trontin, C., Cortijo, S., Barau, J., Del Bem, L. E. V., Loudet, O., … Vincentz, M. (2013). Extensive natural epigenetic variation at a de novo originated gene. PLoS Genetics, 9(4), e1003437. doi:10.1371/journal.pgen.1003437 Song, Q. (2013). Dissertation: Whole Genome Bisulfite Sequencing Analytical Methods and Biological Insights. Statham, A. L., Strbenac, D., Coolen, M. W., Stirzaker, C., Clark, S. J., & Robinson, M. D. (2010). Repitools: an R package for the analysis of enrichment-based epigenomic data. Bioinformatics (Oxford, England), 26(13), 1662–3. doi:10.1093/bioinformatics/btq247 Stroud, H., Do, T., Du, J., Zhong, X., Feng, S., Johnson, L., … Jacobsen, S. E. (2014). Non-CG methylation patterns shape the epigenetic landscape in Arabidopsis. Nature Structural & Molecular Biology, 21(1), 64–72. doi:10.1038/nsmb.2735 Stroud, H., Greenberg, M. V. C., Feng, S., Bernatavichute, Y. V, & Jacobsen, S. E. (2013). Comprehensive analysis of silencing mutants reveals complex regulation of the Arabidopsis methylome. Cell, 152(1-2), 352–64. doi:10.1016/j.cell.2012.10.054 Takuno, S., & Gaut, B. S. (2012). Body-Methylated Genes in Arabidopsis thaliana Are Functionally Important and Evolve Slowly. Molecular Biology and Evolution, 29(1), 219–27. doi:10.1093/molbev/msr188 95 Takuno, S., & Gaut, B. S. (2013). Gene body methylation is conserved between plant orthologs and is of evolutionary consequence. Proceedings of the National Academy of Sciences of the United States of America, 2012. doi:10.1073/pnas.1215380110 Tatarinova, T., Elhaik, E., & Pellegrini, M. (2013). Cross-species analysis of genic GC3 content and DNA methylation patterns. Genome Biology and Evolution, 5(8), 1443– 56. doi:10.1093/gbe/evt103 Tsukahara, S., Kobayashi, A., Kawabe, A., Mathieu, O., Miura, A., & Kakutani, T. (2009). Bursts of retrotransposition reproduced in Arabidopsis. Nature, 461(7262), 423–6. doi:10.1038/nature08351 Xi, Y., & Li, W. (2009). BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics, 10, 232. doi:10.1186/1471-2105-10-232 Zemach, A., McDaniel, I. E., Silva, P., & Zilberman, D. (2010). Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science (New York, N.Y.), 328(5980), 916–9. doi:10.1126/science.1186366 Zemach, A., & Zilberman, D. (2010). Evolution of eukaryotic DNA methylation and the pursuit of safer sex. Current Biology : CB, 20(17), R780–5. doi:10.1016/j.cub.2010.07.007 Zhu, L. J., Gazin, C., Lawson, N. D., Pagès, H., Lin, S. M., Lapointe, D. S., & Green, M. R. (2010). ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP- chip data. BMC Bioinformatics, 11, 237. doi:10.1186/1471-2105-11-237 Zilberman, D. (2008). The evolving functions of DNA methylation. Current Opinion in Plant Biology, 11(5), 554–9. doi:10.1016/j.pbi.2008.07.004 Zilberman, D., Coleman-Derr, D., Ballinger, T., & Henikoff, S. (2008). Histone H2A.Z and DNA methylation are mutually antagonistic chromatin marks. Nature, 456(7218), 125–9. doi:10.1038/nature07324 Zilberman, D., Gehring, M., Tran, R. K., Ballinger, T., & Henikoff, S. (2007). Genome- wide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription. Nature Genetics, 39(1), 61– 9. doi:10.1038/ng1929
Abstract (if available)
Abstract
DNA methylation involves the addition of one methyl group to a cytosine nucleotide, thereby altering the underlying genetic structure and increasing the frequency of mutations brought about by cytosine to thymine deamination. However, how natural selection i) responds to this aforementioned higher mutation rate and ii) shapes the “epigenome” despite a constantly changing genome composition remains unclear. ❧ To better understand how i) the genotype and environment interact and affect DNA methylation and transcription and ii) the role methylation plays in genome evolution, we generated transcriptome‐ and methylome‐sequencing data for 200 natural Swedish Arabidopsis thaliana accessions. These accessions were grown in two different environments, 10°C and 16°C (chosen because they led to different flowering phenotypes). Combined with existing genomic resequencing‐ (Long et al., 2012) and phenotypic‐data, our comprehensive dataset enabled us to explore the complex relationships that exist between genotype and phenotype. ❧ For each gene, we constructed a gene‐specific profile by integrating: i) SNP data, ii) gene expression, iii) DNA methylation, and iv) correlations between these different layers of data, including local LD structure. These profiles served as a powerful research tool. ❧ We investigated genetic variation and epigenetic variation in Swedish Arabidopsis thaliana, and demonstrated that there is an ongoing CG enrichment in regions with non‐CpG methylation, while regions with only CpG methylation do not show similar patterns of selection. This suggests that DNA methylation in TEs, or RdDM (RNA directed DNA methylation) methylated regions that are under strong selection, is likely to be of more functional consequence than regions that are methylated only in the CpG context (i.e. mostly genic regions). Furthermore, this finding may i) help solve the many points of contention that surround the functionality of GBM (Gene body methylation), and ii) explain the contrasting correlation patterns observed between expression and methylation for TE/RdDM regions (a negative correlation) and gene body regions (a positive correlation). ❧ Additionally, the mean global GBM and total number of cytosines in the genome shows significant geographical differences, and although the reason behind this pattern remains elusive, selection is likely to play a part.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Analysis of genomic polymorphism in Arabidopsis thaliana
PDF
Comparative genomics of arabidopsis thaliana
PDF
Natural variation in Arabidopsis thaliana
PDF
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
A population genomics approach to the study of speciation in flowering columbines
PDF
Integrated genomic & epigenomic analyses of glioblastoma multiforme: Methods development and application
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
An analysis of conservation of methylation
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Evolution of E. coli in unstructured and structured environments
PDF
Comparative analysis of DNA methylation in mammals
PDF
New tools for whole-genome analysis of DNA replication timing and fork elongation in saccharomyces cerevisiae
PDF
DNA methylation inhibitors and epigenetic regulation of microRNA expression
PDF
Genome sequencing and transcriptome analysis of the phenotypically plastic spadefoot toads
PDF
Exploring the application and usage of whole genome chromosome conformation capture
Asset Metadata
Creator
Zhang, Pei
(author)
Core Title
Natural variation of Arabidopsis thaliana methylome and its impact on genome evolution
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Molecular Biology
Publication Date
11/18/2015
Defense Date
11/18/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Arabidopsis thaliana,genome evolution,methylome,natural variation,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nordborg, Magnus (
committee chair
), Conti, David V. (
committee member
), Dean, Matthew D. (
committee member
), Finkel, Steven E. (
committee member
)
Creator Email
peizhang@usc.edu,zhangpei511@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-517908
Unique identifier
UC11298508
Identifier
etd-ZhangPei-3092.pdf (filename),usctheses-c3-517908 (legacy record id)
Legacy Identifier
etd-ZhangPei-3092.pdf
Dmrecord
517908
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Pei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Arabidopsis thaliana
genome evolution
methylome
natural variation