Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mapping epigenetic and epistatic components of heritability in natural population
(USC Thesis Other)
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MAPPING EPIGENETIC AND EPISTATIC COMPONENTS OF HERITABILITY IN NATURAL POPULATION by Dazhe Meng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) May 2015 Copyright 2015 Dazhe Meng Dedication To my family, especially my father, who showed me the path to science and a meaningful life. ii Acknowledgments Deepest acknowledgment to my advisor Magnus, lab members and committee members iii Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xi 1 Introduction 1 1.1 Epigenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 DNA methylation in Arabidopsis thaliana . . . . . . . . . . . . . . . 6 1.3 Epistasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Linear mixed models for GWAS . . . . . . . . . . . . . . . . . . . . 13 1.4.1 Calculating the kinship matrix . . . . . . . . . . . . . . . . . 14 1.4.2 Estimating the parameters . . . . . . . . . . . . . . . . . . . 16 1.4.3 Multiple variance component models . . . . . . . . . . . . . 16 2 Association mapping with DNA methylation 18 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Data preparation and preliminary analysis . . . . . . . . . . . . . . 19 2.2.1 Sample preparation and sequencing . . . . . . . . . . . . . . 19 2.2.2 Filtering of data . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.3 Correlation between expression and methylation . . . . . . . 21 2.3 Association mapping of expression levels . . . . . . . . . . . . . . . 21 2.3.1 Test for global methylation effect . . . . . . . . . . . . . . . 21 2.3.2 Marginal association study . . . . . . . . . . . . . . . . . . . 24 2.3.3 Stepwise GWAS . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.4 Detection of new peaks from DNA methylation . . . . . . . 29 2.3.5 Mapping methylation effect on top of SNP effects . . . . . . 31 2.3.6 Cis DNA methylation regulated expression . . . . . . . . . . 35 2.4 Application to flowering time . . . . . . . . . . . . . . . . . . . . . 35 iv 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3 Causal analysis in structured data 42 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Population causality model . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Qualitative description . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 Mathematical formulation . . . . . . . . . . . . . . . . . . . 46 3.2.3 Tests of significance . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Application to real data . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.2 DNA methylation more often appear causal . . . . . . . . . 50 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4 Mapping epistasis 58 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Brief description of dataset used . . . . . . . . . . . . . . . . . . . . 60 4.3 SNP by SNP epistasis . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Gene by gene epistasis . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.1 Haplotype based model . . . . . . . . . . . . . . . . . . . . . 64 4.4.2 Variance component model . . . . . . . . . . . . . . . . . . . 66 4.4.3 Simulation of variance component model . . . . . . . . . . . 70 4.4.4 Application to real data . . . . . . . . . . . . . . . . . . . . 73 4.5 SNP by genetic background epistasis . . . . . . . . . . . . . . . . . 73 4.5.1 PCA based model . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5.2 Variance component model . . . . . . . . . . . . . . . . . . . 76 4.5.3 Application to real data . . . . . . . . . . . . . . . . . . . . 76 4.6 Global sum of epistasis . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.6.1 Global epistatic model . . . . . . . . . . . . . . . . . . . . . 78 4.6.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6.3 Application to real data . . . . . . . . . . . . . . . . . . . . 80 4.7 Extension to higher order interactions . . . . . . . . . . . . . . . . . 82 4.8 Long range linkage disequilibrium . . . . . . . . . . . . . . . . . . . 83 4.8.1 Correction for structural inflation of r 2 . . . . . . . . . . . . 83 4.8.2 Expected number of significant pairs . . . . . . . . . . . . . 84 4.8.3 Filtering of sequencing artifacts . . . . . . . . . . . . . . . . 86 4.8.4 Signs of selection . . . . . . . . . . . . . . . . . . . . . . . . 88 4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Reference List 91 v A Terminology 100 A.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 A.2 Default meaning of mathematical symbols . . . . . . . . . . . . . . 101 B More on linear mixed models 102 B.1 Likelihood ratio test for linear mixed models . . . . . . . . . . . . . 102 B.2 R 2 measurement for linear mixed models . . . . . . . . . . . . . . . 103 vi List of Tables 1.1 Summary of two types of Arabidopsis DNA methylation . . . . . . . 8 vii List of Figures 1.1 Number of genes with different mean levels of CG methylation and CHG methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 CMT2 and environmental effect on CHH methylation level . . . . . 9 2.1 Significant correlation between expression and DNA methylation bins 22 a CG gene body methylation . . . . . . . . . . . . . . . . . . . 22 b CG TE-like methylation . . . . . . . . . . . . . . . . . . . . 22 c CHG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 d CHH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 p-value for additional methylation background effect . . . . . . . . . 23 a Using gene body methylation . . . . . . . . . . . . . . . . . 23 b Using CHG methylation . . . . . . . . . . . . . . . . . . . . 23 2.3 QQ-plot for association mapping of expression . . . . . . . . . . . . 25 a QQ plot in normal scale . . . . . . . . . . . . . . . . . . . . 25 b QQ plot in log scale . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Association mapping of expression on SNPs and DNA methylation . 26 2.5 Step-wise association mapping of expression on DNA methylation . 28 2.6 Most significant p-value of association for expression of each gene . 30 2.7 Number of genes with cis effects detectable at various Bonferroni α 31 viii 2.8 Distance of peak cis loci to gene . . . . . . . . . . . . . . . . . . . . 32 2.9 MarginalR 2 of SNPs+methylation fixed effect versus those of SNPs alone of each expressed gene . . . . . . . . . . . . . . . . . . . . . . 34 2.10 GWAS and EWAS for expression of a R-gene . . . . . . . . . . . . . 35 2.11 Manhattan plot of GWAS and EWAS for flowering time of plants grown at 10 ◦ C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1 Possible causal structure between genotype and two traits . . . . . 45 3.2 Illustrative example of causal mediating testing of height vs shadow 46 3.3 Testing causality model with simulation . . . . . . . . . . . . . . . 49 3.4 Number of cases where either DNA methylation or expression is fully mediating the other . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 -BIC comparison for model I and II . . . . . . . . . . . . . . . . . . 52 a CG gene body methylation . . . . . . . . . . . . . . . . . . . 52 b CG in TE-like methylation . . . . . . . . . . . . . . . . . . . 52 c CHG methylation . . . . . . . . . . . . . . . . . . . . . . . . 52 d CHH methylation . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 -BIC comparison for model I and II after inverse normal transform . 53 a CG gene body methylation . . . . . . . . . . . . . . . . . . . 53 b CG in TE-like methylation . . . . . . . . . . . . . . . . . . . 53 c CHG methylation . . . . . . . . . . . . . . . . . . . . . . . . 53 d CHH methylation . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 QQ plot for SNP by SNP epistatic effect tests . . . . . . . . . . . . 62 a QQ plot in normal scale . . . . . . . . . . . . . . . . . . . . 62 b QQ plot in log scale . . . . . . . . . . . . . . . . . . . . . . 62 4.2 Multiple-pairs simulation result of gene by gene epistatic models . . 71 ix a Power for different effect size . . . . . . . . . . . . . . . . . . 71 b Power for different causal fraction . . . . . . . . . . . . . . . 71 4.3 Single-pair simulation result of gene by gene epistatic models . . . . 72 4.4 FT10 SNP marginal GWAS and local-global kinship GWAS . . . . 73 4.5 FT10 gene by gene epistasis detected by variance component model 74 4.6 Candidate SNP by genetic background epistasis . . . . . . . . . . . 77 a Variance component model . . . . . . . . . . . . . . . . . . . 77 b PCA model . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.7 Global epistatic variance component model on real phenotypes . . . 81 4.8 Filtering of high linkage disequilibirium SNP pairs . . . . . . . . . . 87 4.9 Long range linkage disequilibrium of SNPs in Swedish Arabidopsis thaliana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 x Abstract Genome wide association mapping has become the tool of choice for dissecting the genetic basis of heritable traits. Most studies today utilizes linear mixed mod- els, in which the genetic contribution is divided into a number of fixed additive effect and random effect terms corresponding to small contribution from all genetic variants. In this work I expand existing models and explore contribution to her- itability from two sources: DNA methylation patterns and epistasis between loci in natural samples of Arabidopsis thaliana. Our results indicate contribution from both, but they do not impress in number of statistically significant loci nor effect size. On the other hand, we also derived statistical model for inferring causality in a structured sample. Our efforts showed that additive genetic components are still dominant in exploring heritability, though there is a promising future for continued exploration of other contributors. xi Chapter 1 Introduction Themechanismsgoverningheritablebiologicalvariationhavebeenafascinating question in biology ever since Gregor Mendel first studied characteristics of the pea plant Pisum sativum over 150 years ago(Mendel, 1865). Two aspects of this problem have seen tremendous progress in recent years thanks for advancement in technology: First, we seek to uncover the identity of ’genes’ underlying heritable traits, crucial for delving the biological pathway underlying traits, and for us to be able to manipulate them eventually. Second, we try to estimate effect sizes of naturally occurring alleles, especially important for understanding the adaptive forces that shapes genomes and for predication of phenotypes. These questions are answered together with the coming of age of trait mapping. Atfirst, traitsofinterestwerestudiedincontiguousgenerations, eitherinpedigrees or experimental crosses, in designs known as linkage mapping or quantitative trait locus (QTL) mapping (Sax, 1923). Although successful, linkage mapping studies sufferfrompoorresolution. Recombinationisarareevent, andevenwithverylarge cross or pedigrees, QTLs found usually span large genomic regions and contain numerous genes. This makes it more or less impossible to pinpoint the exact causal polymorphism or even the underlying gene. One way of obtaining greater precision is the use of historical recombinations that shaped the genomes of natural populations over a much longer period of time and thus much more fine-scaled. Nonetheless, these treasure trove of genetic 1 information have been impractical to dig cost-wise, until the the advent of high- throughput genotyping, especially short-read sequencing. As of 2014, the genetic information of a genome as large as ours can finally be determined to high precision at slightly over $1,000 with the release of Illumina Hi-seq X Ten. Armed with these new tools, genome-wide association studies (GWAS) has become the method of choice in finding genetic loci underlying traits, and by extension, in understanding heritable traits. It has been applied with great success in finding loci influencing human diseases or traits in model organisms Atwell et al. (2010); Boyko et al. (2010); Visscher et al. (2012). The basic idea behind GWAS is straightforward. In a sample with heritable phenotypic differences, the individuals would harbor genetic difference at loci con- trolling these phenotypes. By genotyping all genetic variations or a representative subsample, we would find the causal loci, or close-by tag variants, associating with the phenotype. Genetic variations at unlinked locations would be associated by chance, but with a large enough sample size these false-positive associations would be rare. In practice, numerous practical challenges exist for studying complex traits, in particular the inclusiveness of the models used. This likely contribute to the notorious ’missing heritability’ problem, a discrepancy between top-down esti- mates of heritability from linkage designs and bottom-up approach by summing up significant allelic effects from loci detected through GWAS(Robinson et al., 2014). An example concerns human height, where even the top 150 associate loci detected by marginal GWAS barely explains 10% of the heritability(Allen et al., 2010), in contrast with the expected level of about 80% narrow sense heritability estimated by pedigree studies. By modeling height as a summation of thousands of small effect loci in a variance component model, however, Yang et al. (2010) showed that more than 40% can be explained by the same SNPs used in previous 2 studies. Combined effects from numerous small effect loci are only one type of her- itable effects missed by including only significant individual effect in linear models. Other notable examples include epistasis, genotype by environment effects, non- linear effects or un-genotyped loci. In this work, I devise and implement methods to explore two particular cases of these: two-way epistasis and heritable DNA methylation changes, and apply these methods to the model plant Arabidopsis thaliana. In the subsequent chapters of the introductory chapter, more background on epigenetics and epistasis is presented, highlighting what we already know and what methods were used to study them. At the end of the chapter is a brief introduction about linear mixed models, the standard GWAS tool in structured data. We then proceed to chapter 2, where association study is performed on the combined genome and epigenome with various models, in order to find any addi- tional contribution of epigenetic information. The foundation of the analysis in chapter 2 is more thoroughly studied in chapter 3, where we demonstrate statisti- cal evidence that epigenetic signal is not always the product of traits and instead can act as an mediator. Finally, I present methods to detect epistasis at various levels in chapter 4. Many of the methods presented are considerably novel, and are accompanied with simulation results that attempt to show their effectiveness. The usefulness of these models, their biological findings and potential impact are discussed in detail after each chapter, and overall conclusion and outlook for the entire work is summarized in chapter 5. After the main text, I included quick appendix sections that attempt to define commonly used terms as well as default meaning of many mathematical symbols to avoid ambiguity. Following that, some additional useful developments on linear mixed models is described. 3 1.1 Epigenetics The faithful nature of DNA which make it the perfectly resilient schematic of life also undermines a role as the carrier of more transient functional information. For instance, when all cells carry the same DNA sequence, how does the organism guide groups of them through similar developmental process and form tissues? In order to bridge this gap, it was thought that genes can be switched on or off through persistant signals other than DNA(Waddington, 2014; Holliday & Pugh, 1975). This initial definition of epigenetics has evolved over the years, and is now a prosperous topic of research focusing on any ’intermediate’ signals that can influence development or more broadly phenotypes in general. Among the many signals considered epigenetic, DNA methylation, or more nar- rowly cytosine methylation, is perhaps the best studied. Since the tentative dis- covery of a ’new’ type of methylated nucleotide by Hotchkiss (Hotchkiss, 1948), the 5-methylcytosine has been found in DNA of a wide variety of different species(Feng et al., 2010), in particular the plant model Arabidopsis thaliana, the animal model Mus musculus, and in us humans. The importance of methylation is strongly suggested by the conservation of DNA-methyltransferases(Goll & Bestor, 2005), and directly shown when some of these enzymes are knocked out: developmen- tal defects in Arabidopsis, and lethality in cases of mouse(Li et al., 1992). Even the signal themselves are shown to be conserved in closely related plant species (Takuno & Gaut, 2013). Unsurprisingly, the addition of a methyl group has many desirable properties as a vessel of medium-term information. Firstly, it is a fully reversible reaction and capable of being a molecular switch. Secondly, it involves covalent bonding and is stable enough. Last but certainly not least, DNA methylation, especially in the CpG context, has well understood and reliable mechanisms to copy itself like 4 a ’5th base’ during DNA replication. It is sometimes argued that it is this partic- ular potential to be inherited that make a signal truly ’epigenetic’. True to this potential, DNA methylation has been found active in a variety of developmental processes involving carrying information to daughter cells, such as cellular differ- entiation(Lister et al., 2009), genomic imprinting(Suzuki et al., 2007), persistent change in response to environment(Heijmans et al., 2008), and more sinisterly the etiology of various forms of cancer(Jones, 1986; Jaenisch & Bird, 2003; Jones & Baylin, 2007). Even more fascinating is the discovery of DNA methylation status that carry information across multiple generations. This is best exemplified by various ’epi- alleles’ found in plants(Weigel & Colot V, 2012). More recent research have not only demonstrated that spontaneous mutations in DNA methylation can be inher- ited without accompanying DNA changes(Schmitz et al., 2011; Becker et al., 2011), but that induced DNA methylation changes can bring about heritable changes in flowering time and root development by themselves(Cortijo et al., 2014). Although speculated for a long time(Wong et al., 2005; Slatkin, 2009), these work shows direct evidence that DNA methylation matter beyond developmental regulation and can have active roles in adaptation and evolution. This immediately raised the question of whether and how much organisms utilize this in nature. In the chapters 2 and 3, we attempt to answer this question in a collection of Arabidopsis thaliana from Sweden. Before going further into details of the study, it is useful to take a step back and examine what we already know about Arabidopsis DNA methylation: 5 1.2 DNA methylation in Arabidopsis thaliana Arabidopsis thaliana has long served as one of the best models to study DNA methylation. In fact, many important understanding of DNA methylation have come from, and will likely continue to come from this model plant. There are some major differences between methylation in plants and mammals. Firstly, the epigenetic reprogramming in embryo is often not extensive, where most of the DNA methylation state is reset each generation. On the other hand, DNA methylation in the plant occur consistently over all three cytosine contexts: CG (CpG), CHG and CHH, where H denote any base other than a guanine. Upon closer inspection, it seems that there are broadly two types of DNA methylation that result in very different patterns: one in which cytosine in all the three contexts are methylated, is much denser, and typically occur in and around transposable elements; and the other in which only CpG sites are methylated, is less dense and primarily occurs in gene bodies away from gene ends. These differences can be seen in Figure 1.1 plotted from our Arabidopsis data, where the ’line’ on the left are genes with CG gene body methylation while those on the top middle ’cloud’ are genes with C methylated in all contexts. The differences goes beyond the signals, of course: The first type, which we shall term ’transposable element like methylation’ or TEM, are thought to be initialized with RNA-directed DNA methylation pathway, a de novo pathway in which unmethylated cytosines in a region are indiscriminately methylated by the methyltransferase DRM2(Chan et al., 2004). These signals can then be transmit- ted during cell division through maintenance transferases including MET1, CMT3 and recently found CMT2(Zemach et al., 2013; Stroud et al., 2013). The second type, ’gene body methylation’ or GBM, is less understood and appear to only involve the maintenance transferase MET1(Cokus et al., 2008; Coleman-Derr & 6 0.0 0.2 0.4 0.6 0.8 1.0 gene body CHG methylation 0.0 0.2 0.4 0.6 0.8 1.0 gene body CG methylation Figure 1.1: Number of genes with different mean levels of CG methylation and CHG methylation. Color indicate number of genes in each bin, with deep blue meaning 0. Zilberman, 2012). In terms of function, these two classes of methylation also differ significantly. Whereas TEM are almost surely responsible for gene/TE silencing, the role of GBM is still debated, with many believing it to regulate splicing and robust expression(Kim & Zilberman, 2014). Given these differences, which are summarized in Table ??, we shall proceed to treat them separately in most cases. Aside from these mechanisms, much progress have also been made towards understanding the source of methylation differences in nature. As expected, local DNA sequences, in particular presence of transposable element, can dictate the presence or absence of methylation. In the absence of DNA sequence changes, DNA methylation patterns tend to be maintained, though heritable mutations do occur at a low rate (Schmitz et al., 2011; Becker et al., 2011). Developmental regulation of DNA methylation is also evident, such as the activation of RdDM 7 Gene body methylation TE like methylation Context CpG CpG, CHG and CHH Lost by knocking out... MET1 MET1, CMT2/3, DRM2 Typically found at... Protein coding genes Transposable elements Density of methylated C Low High Function Debated Silencing Table 1.1: Summary of two types of Arabidopsis DNA methylation pathwaysduringembryogenesis(Rojasetal.,2012), thoughalsodependentonDNA sequence. These sources of variation can be summarized in the following abstract equa- tion: M =M 0 β 0 +Xβ X +D(X,t,env) + (1.1) Where M is the measurement of DNA methylation of a current sample, M 0 is the DNA methylation state of parental germline with β 0 its contribution,X and β X genetic contribution, D(X,t) the effect from development at time t and envi- ronment env, while the unexplained random factors including noise in measure- ments. Although the relationship is represented here as additive linear model, the actual relationship is of course much more complex and poorly understood, with interaction between all terms. In a parallel study conducted together with Manu Dubin and Pei Zhang, we investigated the various contribution to these variation, and found substantial evidence for trans genetic control as well as environmental regulation(Dubin et al.). Some of these results are presented in Figure 1.2. In this thesis, I instead focus on the role of DNA methylation. The most recognized molecular function of DNA methylation is regulating expression levels incis(Jaenisch&Bird,2003;Reik,2007), probablythroughalteringbindingsitesof 8 b a d Chr1 Chr2 Chr3 Chr4 Chr5 Chr1 Chr2 Chr3 Chr4 Chr5 SNPs DMRs methylation levels CMT2a-nr CMT2-r CMT2b-nr 3e-06 2e-06 1e-06 0 -1e-06 -2e-06 -3e-06 CMT2a-nr CMT2-r CMT2b-nr e e 0 .6 0 0 .2 0 .4 0 .8 1 .0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 Env Cis CisXEnv Trans TransXEnv CMT2 CMT2XEnv total porportion of variance explaned porportion of variance explaned 0 .6 0 0 .2 0 .4 0 .8 1 .0 density c Figure1.2: CMT2andenvironmentaleffectonCHHmethylationlevel. a)Marginal twas on global CHH level. b) global BLUP-corrected CHH level for different alleles of CMT2. c) Number of new TEs called for different alleles of CMT2. d) 2D plots of association between CHH levels and SNPs e) Variance explained by different factors for bins of CHH methylation transcription factors. This effect has indeed been observed in Arabidopsis thaliana, and is the primary trait we study. 9 The other required component is of course differential methylation. Research on plant DNA methylation has been greatly enhanced by existence of viable Ara- bidopsis methylation mutants like met1 and cmt2-7. Yet these methods disrupts DNA methylation at whole genome level, and also impact important cellular pro- cesses associated with them. Attempt to utilize DNA methylation mutations to generate large changes in isogenic lines was also limited by mutation rate. In addi- tion, these ’artificial’ approaches to generate differential methylation only provide indirect estimates of impact in nature. Fortunately, natural polymorphisms in Arabidopsis DNA methylation abounds (Schmitz et al., 2013; Dubin et al.), and will form the basis of our study. We pursue the following overlapping questions: 1. Given the fact that DNA methylation can influence traits, how much do they contribute in nature? Furthermore, in terms of using GWAS in forward genetics, can DNA methylation help us finding loci that otherwise lack DNA polymorphisms? Or more generally, how much information do they provide? (This is investigated in Chapter 2) 2. What is the role of gene body methylation? Is it a passive record of expres- sion history, simply keeping expression levels/splice forms in check, or more actively regulating gene expression? (This is investigated in Chapter 3) 1.3 Epistasis Another source of heritability that would not be captured by additive models came from interaction between pairs of loci, or epistasis. Despite current interest in detecting such interactions, epistasis is a century old concept, and the term is first coined by William Bateson to describe one gene masking the effect of the other 10 (Bateson & Mendel, 1909). Much like epigenetics, the proper scope of the term is still ambiguous today. Here we employ the more population genetics oriented statistical definition of epistasis by Fisher, which is any deviation from additive allelic effects(Fisher, 1919). In a diploid organism, this include interaction between genes(Kempthorne, 1954) as well as dominance effects (Cockerham, 1954). Many cases of epistasis are found throughout different domains and kingdoms. In bacteria (Elena & Lenski, 1997); In plants (Kroymann & Mitchell-Olds, 2005); In animals (Anholt et al., 2003); and last but not least, in yeast (Tong et al., 2004). Although it is beyond question that epistasis can have a large impact on phe- notypes, their importance for complex traits in natural populations is not at all straightforward. Some have argued that epistatic effect are expected to be abun- dant from evolutionary perspective(Hemani et al., 2013); There is also claim that they should explain more variance than popularly attributed simply due to diffi- culty in searching for them(Nelson et al., 2013). Yet others built statistical argu- ments that epistatic variance must be small in outbred populations for complex traits, in scenarios where pairs of interactions have similar effect sizes as marginal effects(Hill et al., 2008; Maki-Tanila & Hill, 2014). At this moment, experimental evidence from genome scale studies generally favor the camp arguing for small effect, for instance in studies on Drosophila(Huang et al., 2012) and Humans (Hemani et al., 2014; Brown et al., 2014). It is notable that in these studies, many instance of epistasis are found, but they barely explain much variance. On the other hand, a study using yeast(Bloom et al., 2013) found larger con- tribution from epistasis, but unfortunately they used a F1-based sample. The distinction of using F1 population is important, since genetic variance attributable for epistasis in a population depend not only on effect size but also frequency of 11 respective allele combinations. This one of the major source of difficulty for detect- ing epistasis in a GWAS setting. Just like the notorious problem of detecting ’rare SNPs’ in marginal GWAS, epistasis of even very large effect would be impossible to observe without the specific combination of contributing alleles in the sample. In natural populations, polymorphisms usually segregate at frequencies distributed like a U-shaped curve, with very few at intermediate frequencies. This means that most pairs of SNPs would have an interaction term that is similar to a very rare SNP in terms of detection. This does seem to paint a gloomy picture for importance of epistasis in natural populations. The huge amount of effort devoted to ease computation challenge might not matter if only small contributions are found. Nonetheless, even though epistasis may not explain much phenotypic variance, the identity of interacting pairs itself are important biological findings. The very possibility that interacting pairs act in the same pathway gave molecular biologist a huge insight on how they carry out their function. For many human diseases, for example, the crucial drug target can be the interacting partner for a protein of interest. Thus, it is still important to develop methods and datasets for detection of interaction, especially for organisms where a cross design might not be feasible or too expensive compared with simply sampling more individuals. One important aspect of epistatic GWAS that must be mentioned, is the basic assumption about the functional form of the trait. Recall that Fisher’s definition imply anything non additive is epistatic. Thus, if allelic effects are multiplica- tive for example, then we would actually detect synergistic epistasis between loci, sometimes strongly so. In these cases, ’rampant’ epistasis would simply disappear, and our understand and prediction of the trait would be much better if we simply 12 look at the logarithm of trait instead. Sometimes, inappropriate assumption like this is detectable by looking at the distribution of the trait. 1.4 Linear mixed models for GWAS The popularity of linear mixed models in GWAS can be credited to the prob- lem of population structure(Novembre et al., 2008) confounding in GWAS , where false positive associations arise between traits and polymorphisms that by chance have different allele frequencies in sub populations. There have been much effort to address this, including famously genomic control and principle component anal- ysis(Price et al., 2006). Currently, the most used model for association mapping in structured samples is the linear mixed model. These flexible models are first developed by scientist working on animal breed- ing (Henderson et al., 1959) to predict breeding values, and has received tremen- dous attention today due to its ability to correct for relatedness at all levels(Kang et al., 2010). Although first used for population structure correction in marginal association tests, the correction really works because the random effect would capture combined effect from the genetic background (Vilhjálmsson & Nordborg, 2012), just like how the model is used in animal breeding studies. This other aspect of mixed model has been utilized to great success to estimate genetic variance for many complex traits in GWAS setting (Yang et al., 2010). Mixed model have been extended to variety of other situations, such as GWAS on associated traits(Korte et al., 2012) and multi-locus GWAS(Segura et al., 2012). Different forms of linear mixed models are the majority of models used in this thesis, and here I briefly introduce the simplest one that is frequently used in GWAS today. 13 Consider a trait y, Y =μ + X i X i β i +g + (1.2) Whereg∼N (0,K S σ 2 S ) is the random genetic effect and∼N (0,σ 2 ) the random noise. Thus, Y is also normally distributed, and an equivalent way to represent the mixed model is: Y∼N (μ + X i X i β i ,K S σ 2 S +Iσ 2 ) (1.3) K S is usually computed beforehand, and in a marginal test only one SNP or X is included. Thus there are 4 parameters in the model: μ the phenotypic mean, β the SNP effect, σ 2 S the variance of random genetic effect and σ 2 the variance of noise. Among these, β is the parameter of interest for marginal test, while σ 2 S or the ratio betweenσ 2 S andσ 2 is the parameter of interest for total genetic variance. Most GWAS applications on linear mixed model focus on value and significance of these statistics. But firstly, I shall describe the calculation ofK based on genomic data. 1.4.1 Calculating the kinship matrix The ’kinship matrix’K S used to refer to one that is constructed pedigree data. However, this is clearly not possible for most GWAS nowadays using randomly sampled individuals. With the availability of whole genome data, a relatedness matrix constructed from SNP data have been used instead and have been shown to performreallywell. Intherestofthethesis, itisusuallyassumedthatbya’kinship matrix’, we refer to some form of relatedness or covariance matrix estimated from polymorphism data. Its calculation are well documented in many papers on linear mixed models (Yang et al., 2010), which I try to briefly describe here. 14 The rationale behind calculation of this type of kinship matrices can be thought as an extension of the ’infinitesimal’ model of Fisher. Typically, a small random effect is assigned to all biallelic SNPs: Y =μ + n X k X k z k + (1.4) Or for each individual: y i =μ + n X k x ik z k + i (1.5) Where z j ∼N (0,σ 2 S ), i ∼N (0,σ 2 ) are i.i.d. Gaussian random variables. Now consider the covariance between measurement of the trait of individuals i and j (i6=j): Cov(y i ,y j ) = Cov( n X k x ik z k , n X k x jk z k ) = n X k Cov(x ik z k ,x jk z k ) =σ 2 S n X k x ik x jk (1.6) Thus, supposeX = (x ij ) is the matrix form for vectors of SNPs, then Var(Y ) =X 0 Xσ 2 S +Iσ 2 (1.7) orK =X 0 X Ofcourse,wewillgetdifferentformofK dependingonwhatSNPsareincluded, and how the SNPs are encoded. In some studies, the SNPs are first scaled to mean 0 and variance 1 (by the factor of q p(1−p), p is the allele frequency), andK is exactly the covariance matrix. CalculatingK thiswayreflectamodelinwhichthetwoallelesofeachSNPhave opposite effect, and that the heritability of all SNPs have the same distribution 15 regardless of allele frequency. Alternatively, we can assume that the two alleles of each SNP are independently distributed, and the effect is not scaled by a function of allele frequency. The resulting matrix from this assumption is the same as the identical by state or IBS kinship matrix(Kang et al., 2008), and can also be easily calculated. For instance, for a homozygous organism, we can encode the two alleles of each SNP as 1 and -1, then: K IBS S = 1 2 X 0 X + 1 2 (1.8) This matrix has been found to perform slightly better than the covariance matrix, and by default we use the IBS version of the kinship matrices. 1.4.2 Estimating the parameters There have been tremendous progress in estimating parameters for the linear mixed model. In most cases, the parameters are either chosen to maximize the like- lihoodL(μ,β,σ 2 S ,σ 2 |Y ) or the restricted likelihoodL(ˆ μ, ˆ β,σ 2 S ,σ 2 |Y ) where ˆ μ and ˆ β are least square estimates of μ and β without random effect. Currently, several packages implement efficient estimation of these parameters as well as providing p-value for marginal tests (Zhou & Stephens, 2012; Lippert et al., 2011). 1.4.3 Multiple variance component models One useful extension of the basic linear mixed model is the inclusion of multiple random effect terms that correspond to different subsets of SNPs. For instance, SNPs around individual genes can be grouped together into an additional ’local’ kinship term, which can capture the aggregate signal from these SNPs. Another application of this approaches is by constructing a random effect term for SNPs 16 from each chromosomes in humans, and it was realized that the estimated effect size of each correlate with the length of the respective chromosomes Yang et al. (2011). 17 Chapter 2 Association mapping with DNA methylation 2.1 Introduction Our first study pertain to assessing the magnitude of effect of DNA methyla- tion on trait means, mainly the expression level of genes. More specifically, we want to investigate the contribution of naturally occurring methylation polymor- phisms, especially in contrast to contribution of genetic factors. This is actually an impossible task to statistically resolve in most cases, since DNA methylation levels are often correlated with local haplotype patterns, most notably to pres- ence of nearby transposable element, and indeed to the entire global population stratification(Dubin et al.). In those cases, a conservative answer would be that these methylation polymorphisms are in turn regulated by DNA sequences, and ultimately did not carry any independent information. However, there exist the moreliberalpossibilitywouldbethatitwasthesemethylationpolymorphismsthat are causing the changes while haplotype patterns are merely linked. Nonetheless, no matter which of the possibility is true, we can estimate how much additional variance we can explain with DNA methylation data, and answer whether they contribute to ’missing heritability’ of complex traits. 18 Another goal of our study is with respect to mapping traits to specific loci with association studies. Here we are in particular concerned with methylation- associatedtraitswherenogeneticpolymorphismsarefoundtobeassociatedorvery weakly associated. These cases, if exist, would mean direct gain in power to detect loci that might not be detectable with genetic data even at much larger sample size. It would also be interesting to note whether these cases are overrepresented in functional categories that demands more rapid adaptation, for instance disease- resistancegenes. Aproblemwiththisapproachislackofvalidationformethylation sites found, so we are focusing on cis associations with expression levels which can be generally assumed to be true positives. This is reasonable statistically, since even in our case of modest sample size and potentially large measurement errors, cis loci are very limited in number. Biologically, this is also highly plausible due to well known mechanism of cis DNA/DNA methylation effects on expression. One important precautions must be taken in interpretation of association. Very importantly, talking of ’effect’ is actually assuming that DNA methylation is causal or at least act as an intermediate upstream of the trait. While this is quite accept- able assumption for TEM, it is an unanswered question for GBM, and we shall address this question for GBM in the next chapter. 2.2 Data preparation and preliminary analysis 2.2.1 Sample preparation and sequencing Arabidopsis inbred lines were collected from Sweden and grown in chambers at constanttemperature. ForDNAsequencing, sampleswerecollectedfromrootsand sequenced with Illumina at 72-100 base pair paired-end runs . These were then mapped with BWA and SNPs were called with GATK in our custom pipeline. 19 RNA samples were collected from leaves at the 9-leaf stage, and sequenced at 4- fold multiplexing with 36-40 base pair runs. For DNA methylation, samples were again collected from leaves at the 9-leaf stage and bisulfite-sequenced with base pair runs. All samples used in this study were grown at a constant 10 ◦ C. 2.2.2 Filtering of data In order to reduce the number of false associations and , the three dataset were filtered according to the following rules: SNPs: all variants less than 0.05 minor allele frequency are removed. This is not only because we have less power to detect effects from rare SNPs, but it is also easier for false positives to arise from them due to sensitivity of parametric linear regression to outliers. In GWAS, this would often manifest as a trait associated with a large number of SNPs across the genome, appearing as ?horizontal lines? on a 2D-association plot. Expression: all genes whose mean expression level is lower than 3 after anscombe transformation are removed. A minimum coefficient of variation of 0.05 is demanded for the remaining set so that only genes with some varying expression are kept. Methylation bins: DNA methylation data from individual cytosine sites are combined in overlapping 200bp windows and the average methylation level, defined astotalnumberofmethylatedreadstototalnumberofrespectivecytosinesites, are used. similar to expression levels, all methylation bins are filtered for a minimum coefficient of variation of 0.05. We also devised a new filter based on similar motivation as minor allele frequency filter for SNPs but extended to quantitative data: methylation levels are normalized to μ = 0 and σ 2 = 1, and the sum of squared distance of the 5 furthest values are calculated. Is this number is greater 20 than 75, the methylation bin is filtered out. This reduces chance that a few outliers would distort association results. 2.2.3 Correlation between expression and methylation We first investigate whether there is any correlation between expression level of aparticulargeneandDNAmethylationacrossindividuals, bycalculatingPearson’s r between them. Unsurprisingly, we detect many associations between them, espe- cially in cis. A summary plot (Figure 2.1) is shown for cis associations divided into 4 types of bins: CG gene body methylation, and CG, CHG, CHH in transposable- element like methylation. Expectantly, TE-like methylation level in all context are predominantly negatively associated with expression, particularly at the tran- scription start site. However, CG gene body methylation are mostly positively correlated with expression. This type of methylation are usually found in gene bodies away from either end as indicated by the location of strongest associations. The result from correlation analysis correspond nicely with what we already know about ’promoter’ methylation of transposable elements. However, it is one of the first study in which positive correlation between segregating gene body methylation levels and expression across different individuals is observed. 2.3 Association mapping of expression levels 2.3.1 Test for global methylation effect It has been recognized that traits can be affected by a large number of small effect loci, and together such effect can add up to explain a substantial amount of trait variation(Yang et al., 2010). In GWAS, these effects can manifest as 21 4000 2000 0 2000 4000 6000 8000 10000 Location of highest correlated bin (0 being TSS) 0 10 20 30 40 50 60 70 80 Count of genes among these with r 2 >0.1 (a) CG gene body methylation 4000 2000 0 2000 4000 6000 8000 10000 Location of highest correlated bin (0 being TSS) 0 10 20 30 40 50 Count of genes among these with r 2 >0.1 (b) CG TE-like methylation 4000 2000 0 2000 4000 6000 8000 10000 Location of highest correlated bin (0 being TSS) 0 5 10 15 20 25 30 35 40 45 Count of genes among these with r 2 >0.1 (c) CHG 4000 2000 0 2000 4000 6000 8000 10000 Location of highest correlated bin (0 being TSS) 0 5 10 15 20 25 30 Count of genes among these with r 2 >0.1 (d) CHH Figure 2.1: Significant correlation between expression and DNA methylation bins. The overlapping histograms are distribution of locations, relative to transcription start site, of the highest associated methylation bin for each gene. The direction of correlation of such bins are differentiate by color, with blue indicating positive and green negative r. Only genes with significantly associated methylation bins (r 2 >0.1) are included. substantial inflation of test statistics for large number of loci due to correlation with population stratification. As being shown in these studies, such effects can be effectively modeled as a summation of i.i.d. random Gaussian effects for all SNPs, and is routinely included in association tests with linear mixed models to reduce spurious association. With the possibility that methylation sites also affects traits independent of SNPs, small trans effects from methylation status can also add up to substantial effect on heritability of traits. Therefore, one first task before further 22 association study is to explore whether such effects exist and should be included as an additional factor in regression. This is performed by comparing the following nested models with likelihood ratio test: Y∼N (μ,K S σ 2 S +K M σ 2 M +σ 2 ) (2.1) Y∼N (μ,K S σ 2 S +σ 2 ) (2.2) HereK S andK M refer to a relatedness matrix computed by SNPs or methylation bins alone, whereas the σs and e indicate the strength of the respective effects and noise. Since it might be expected that the methylation variants from different contexts has varying effects on traits, separate tests are conducted on CG GBM and CG/CHG/CHH methylation in TEM for a total of four comparisons per trait. (a) Using gene body methylation (b) Using CHG methylation Figure 2.2: Distribution of likelihood ratio test p-value comparing variance com- ponent model with both SNP and methylation random effects versus null model of only SNP random effects. Mean expression levels are used as traits. The results of these tests (Figure 2.2) largely correspond to our expectation that the methylation relatedness matrices did not improve the dissection of trait variance. Except for very rare cases, adding an additional variance component yield the same likelihood as the model with SNPs only. This shows that the either 23 correlation structure between methylation sites are well captured by SNPs already, thereforeyieldingverysimilarkinshipmatrices, orthatmethylationsitesingeneral does not affect the trait in large amount, adding up to insignificant amounts of trait heritability. (Potentially additional analysis here: comparing with models with only methylation kinship, matrix comparison, look at dissection of variance similar to rnaSeq/meth paper) 2.3.2 Marginal association study Association study on both genotype data (GWAS) and DNA methylation bins (EWAS), is performed with linear mixed models to correction for population struc- ture [cite]. As shown in the above section, we do not expect further inflation of test statistics due to stratification in methylation data, and only the SNP kinship is included as additional variance component. The model used is: Y∼N (μ +Xβ,K s σ 2 s +σ 2 e ) (2.3) Again, Y is the vector of phenotypes, X is a single vector of SNP or methylation bin, while the betas correspond to allelic effect sizes. K s and σ 2 s are again the genetic related matrix and its corresponding random effect size, while σ 2 e is the residual variance due to unexplained environment or noise. Marginal F-statistic is calculated as in ordinary linear model after rotating the phenotype Y and X by (Λδ +I) − 1 2 Q T , where QΛQ T is the spectral decomposition of the symmetric relatedness matrix K andδ is the ratio betweenσ s andσ e . To simplify calculation, we used the same approximation as in EMMAX/P3D that only calculate the ratio δ once for the null model in which only the interceptμ is included in the fixed part 24 of the Gaussian distribution. The significance level as measured in p-value is then obtained by F-test for SNPs and methylation bins of all contexts. (a) QQ plot in normal scale (b) QQ plot in log scale Figure 2.3: Combined QQ-plot for association mapping with either SNPs (blue and red) or methylation bins (green and cyan). In b) the axes are in log scale to magnify the tail distribution. Before proceeding to analyze the peaks, we turn to assess the performance of ourmixedmodelincontrollingspuriousassociationduetopopulationstratification (Figure 2.3). As expected, p-value for association with neither trans SNPs nor methylation bins are significantly inflated, and lie almost entirely on the expected line except for the tail. For cis-loci, in contrast, we do expect more significantly associations due to abundance of cis-regulatory elements of transcription. Indeed, we find significant inflation (lower p-values than expected) at all cis-loci quantiles due to linkage disequilibrium. A global view of associated loci shows abundance of cis-associations with scat- tered instances of trans associations, a pattern that is shared between GWAS and EWAS. Notably, only in cis region do we detect shared loci between SNP and methylation bins (generously defined as any significantly associated pair within 25 0 10 20 30 10 0 10 20 0 10 0 10 20 Genomic location of associated variant (Mb) 0 10 20 30 10 0 10 20 0 10 0 10 20 Genomic location of gene (Mb) Figure 2.4: Association mapping of mean expression on SNPs and DNA methy- lation. For each gene, results are merged in 10kb windows and a dot is plotted whenever there is a single associated loci above Bonferroni corrected α of 0.05. The color denote the type of loci: red-SNP, blue-CG, green-CHG/CHH, black- Both SNP and DNA methylation 10kb of each other). This is another circumstantial evidence that trans associated peaks are likely spurious. 26 2.3.3 Stepwise GWAS In a structured natural population like Arabidopsis, samples closer to each other are more likely to have a similar ancestral recombination graph even at loci distant to each other due to selection or random chance. This manifest in genomic data as long range linkage equilibrium between otherwise unlikes loci, and show up as additional peaks at distant locations to the real causal loci. On the other hand, allelic heterogeneity in a single regulatory region often produce a single peak which mask the existence of multiple causal factors. One proposed method to deal with these issues and obtain a more reliable sets of associations is what has been aptly termed multi-locus mixed model (MLMM), which has been demonstrated to work well in both Arabidopsis and human data (Segura et al., 2012). The basic approach is a combination of mixed model correction for population stratification and stepwise model selection for large effect loci. We employ the same strategy and the mBonf threshold that the authors suggested: the highest contributing loci is iteratively added as fixed cofactors, until no more remaining loci is significant at 0.05 family wise error rate. The models being explored is an extension of the marginal case: Y∼N (μ + n S X i X S,i β S,i + n M X j X M,j β M,j ,K S σ 2 S +σ 2 ) (2.4) Here X s and β s are SNP vectors and their respective effect, whereas X M and β M are methylation bin vectors and effect. In practice, we want to compare between models that involve only SNPs, only methylation bins and both. Therefore, we perform the stepwise-association for all three settings. 27 The results are very similar to those obtained from marginal tests. Except for a few cases, most of the trans peaks are not due to correlation with cis loci or each other. 0 10 20 30 10 0 10 20 0 10 0 10 20 Genomic location of associated variant (Mb) 0 10 20 30 10 0 10 20 0 10 0 10 20 Genomic location of gene (Mb) Figure 2.5: Step-wise association mapping of mean expression on DNA methy- lation, showing all significant steps. For each gene, results are merged in 10kb windows and a dot is plotted whenever there is a single associated loci above Bonferroni correctedα of 0.05. The color denote the type of loci: blue-CG, green- CHG/CHH, black-Both SNP and DNA methylation 28 2.3.4 Detection of new peaks from DNA methylation In this section we investigate any potential gain in power to detect new and credible associated loci by having epigenome data on top of genomic data based on marginal tests. Although we can’t say for certain validity of trans peaks without further confirmation, cis-associated peaks are expected to be true. Indeed from the association tests we can estimate an upper bound on the rate of false positives for cis loci, by the extreme assumption that all trans peaks are false hits. Therefore: # of false cis associations = # of trans associations # of trans loci ×# of cis loci We used results from the stepwise model for this calculation, which shows around 0.5% error rate for cis associations. We shall thus proceed with the approximation that all cis associations are true. First, we examine the top p-value of associated SNP and methylation bins for each gene. At an arbitrary significance threshold of 10 −5 , a large number of gene expression (n=338) with associated cis DMR is also associated with cis SNP(s). However, a significant number (n=141) of expression traits are not detected by GWAS, but instead have a cis methylation peak. A more visual breakdown of Figure 2.6 is by looking at detection of cis effect at varioussignificancelevelsasinFigure2.7. Ingeneral, SNPsdominatebynumberof cis associations, but at all significance levels there is a significant number of cases in which only methylation bins (Blue and purple) are associated. Furthermore, if we focus on the group of genes with methylation-only cis association at the Bonferroni significance threshold of α = 0.01 (purple in the bars), we find that even at much more lenient threshold level (10 −4.5 ), many of them are still not significantly associated with any cis SNPs. This means that, even if we draw more 29 0 10 20 30 40 50 60 top cis SNP p-value 0 10 20 30 40 50 60 top cis methylation bin p-value n=338 n=2011 n=141 Figure 2.6: Most significant p-value of association for expression of each gene. Only genes with at least one significantly associated (p-value less than 10 −5 ) SNP or methylation loci are shown. Here cis is defined as a distance of less than 50kb from the transcription start site of the gene being tested. samples from the same population as existing individuals, we are still unlikely to detect any signal from cis SNPs at normally used significance levels. A plausible reason for this, other than random chance, is that there is no segregating genetic variants in the population that regulates the expression of these genes, or that they are unlinked with any genotyped SNPs. In practice, this means that knowing the epigenomes does allow us to detect more potential regulators, including a fraction that would not be found easily by genotype-only designs at much larger scale. 30 0.01 0.02 0.05 0.1 0.15 10 −5.5 10 −5 10 −4.5 Bonferroni correction alpha/absolute p-value 0 500 1000 1500 2000 2500 3000 3500 Number of genes with detectable cis loci both snp only meth only, genes in first bar meth only, other Figure 2.7: Number of genes with cis effects detectable at various Bonferroni α (first 5) or absolute p-value levels (last 3). Purple bands track the group of genes that are detected by EWAS only at alpha=0.01 Finally we examine the peaks the angle of precision. It appear that associated DMRs are slightly closer to the gene, allowing small improvement in mapping resolution. However, the difference is not great, and we should not expect EWAS to be the solution for dissecting complex peaks. 2.3.5 Mapping methylation effect on top of SNP effects In the above sections we explored detection of new QTLs through methylation where there is an absence of marginal cis SNP association. However these are only a fraction of the cases where methylation status has an detectable effect on trait. In many cases, methylation simply explains more variance than SNPs. To investigate this rigorously, we perform a association study that explicitly compare variance explained by DNA alone versus DNA methylation and DNA together. 31 0 2000 4000 6000 8000 10000 distance to gene 0 10 20 30 40 50 60 70 80 peak snp peak methylation Figure 2.8: Histogram showing distribution of distance from peak cis loci to the gene whose expression the loci is associated with. Only genes with both cis DMR and SNP detected are shown. We try to capture as much variance explained as possible without overfitting by utilizing previous knowledge on genetic architecture of the traits in question, expression levels. In a companion project investigating genetic and environmental contribution to heritability of expression traits, we realized that allelic hetero- geneity, especially for cis region, can have a large impact on dissecting expression variation. Although we have used stepwise models as a way to better capture these effects, it only works on large effect and would ignore modest contributions from a larger number of loci. To capture these effects, linear mixed models that include an additional variance component for cis loci have been shown to perform adequately (Companion project, not published). 32 Utilizing this knowledge, we devised additional models that incorporates large effect loci, local allelic heterogeneity as well as background trans effects. In par- ticular, local allelic heterogeneity is modeled by the local equivalent of the global relatedness matrix. The full models are: genetics :Y∼N (μ +X S β S ,K S σ 2 S +K l σ 2 l +σ 2 ) (2.5) epigenetics :Y∼N (μ +X M β M ,K S σ 2 S +K lM σ 2 lM +σ 2 ) (2.6) full :Y∼N (μ +X S β S +X M β M ,K S σ 2 S +K l σ 2 l +K lM σ 2 lM +σ 2 ) (2.7) , where K l and σ 2 l are the cis SNP kinship and its effect, while K lM and σ 2 lM are cis methylation kinship and its effect. We do not include a global methylation kinship since that has been found to exert no influence. Weuseddifferentvariantsofthesemodelsbasedondifferentassumptionsabout underlying biology as well as faith on trans associations. For the most conservative model, we includes only the most significant cis loci for both SNP and methylation, and also assumed no term for local methylation allelic heterogeneity. In the inter- mediate model we added the the methylation kinship term that intend to capture more methylation effect at cis level. The most relaxed model also include signifi- cant trans loci detected from step-wise GWAS. To compare between the models, we used both likelihood ratio test as well as measures ofR 2 on linear mixed models. More discussion about R 2 measures for linear mixed models are given in Section B.2. In Figure 2.9 we present results from a slightly simplified model (Equation ??) involving only fixed effects, cis SNP random effect and global SNP random effect. For majority of genes, methylation does not significantly add to the variance explained. Howeverthereisaminorityofgenes, inred,withadditionalmethylation 33 Figure 2.9: Marginal R 2 of SNPs+methylation fixed effect versus those of SNPs alone. Highlighted in red are genes with methylation effect that is significant at Bonferroni threshold. effect that are significant even at the Bonferroni threshold. Note that since we are not showing genetic effect attributed to the cis kinship nor global kinship, the implied heritability explained are smaller than actual values. 34 2.3.6 Cis DNA methylation regulated expression FromtheassociationtestingwithmodelinEquation??, wederivedacandidate list of genes that shows significant regulations by a cis DNA methylation bin. In total, we found 210 genes with significant cis methylation association. Figure 2.10: Manhattan plot of GWAS and EWAS for expression of a member of the NB-LRR family. Here blue represent SNP associations, green represent CG GBM associations, and red represent CG TEM associations. An example of such a loci is given in Figure 2.10. The NB-LRR family is a large family of heavily regulated resistance genes, which play a central part in combating pathogens. As such, they are ideal targets of epigenetic switches constructed with transcription start site methylation. Interestingly, the expression level of this particular member of the family is not associated with any local SNP variant, and appear to be solely regulated by the DNA methylation status. Of course, this can also be due to untyped structural variants, in particular a TE insertion, that is still difficult to type with present technologies. 2.4 Application to flowering time Flowering time is a well known quantitative trait in Arabidopsis, and its precise quantitative determination is crucial to the survival of the weed in its competitive habitat. Manygenesandlocihasbeenfoundtoregulatefloweringtime, integrating various signals including external environmental cue and internal developmental 35 Figure 2.11: Manhattan plot of GWAS and EWAS for flowering time of plants grown at 10 ◦ C. Here green represent SNP associations, blue represent CG GBM associations, and red represent CHG associations progress. Due to varying nature of environment, it can be expected that some degree of responsiveness from rapidly mutating epigenetic changes can play a sig- nificant role in its regulation. To verify this possibility, we performed marginal EWAS test to flowering time of an overlapping set of 133 inbred lines grown at 10C. We identify a potential TEM regulatory peak at AOP1-3 cluster, which are themselves potential candidate (Kerwinetal.,2011)aswellasbeingnearaknownregulatoroffloweringtimegene. In addition, the expression level of AOP2 has been found to be directly associated to flowering time, painting a potential pathway of DNA regulating expression, which in turn regulate flowering time. The validity of this pathway is still under active investigation, as the methylation level has been found to be associated with population structure of the sample, and the significance level can be still inflated. 2.5 Discussion Through careful data preparation and extensive modelling, we began a tenta- tive step towards understanding the role that heritable DNA methylation variation plays on regulating phenotypes at a genome-wide scale. Although it has been long 36 thought that inheritable DNA methylation polymorphisms can regulate pheno- types, our study demonstrated a large number of concrete associations as well as providing a well-reasoned, albeit rough, estimate on the extent of their effects. Care must be taken to interpret even the most significant of associations. The mostsevereweaknessofourstudyisperhapsthelimitoftechnologyandthenature of data collected. We recognize that considerable noise exist in our measurement of expression data, and to a lesser extent, DNA methylation data. This not only createsstatisticalchallengestodiscovertrueassociations, butisalsoacauseofcon- cern for high false positive rates among the tremendous number of tests involved. This is especially a problem for trans associated loci, and essentially render any conclusion based on them risky. On the other hand, though great efforts are put in to collect data at fixed devel- opmentalstageandtimeofday, thecriteriausedareexternalandisonlyoneaspect of growth. Some systematic biases has to be expected due to profound differences that our models struggle to capture. Another source of information loss is the fact that we are but taking single snapshots of ongoing dynamic biological processes. One very specific flaw is not taking data from the entire circadian cycle which would miss some crucial parts of expression data. But most crucially, in taking sample from a mixture of tissues, we are missing important distinctions between the transcriptional and possibly DNA methylation landscape among differentiated cells. Due to these limitations, we have been cautious in making conclusions, and many interesting features of the ’epigenetic architecture’ of traits discovered are expectedtohold. First, werecognizedthatcombinedeffectoflargenumberoftrans methylation sites, or the epigenetic background, does not seem to help explaining much phenotypic variation. This could of course mean that trans methylation 37 sites do not influence expression. But more likely, due to sharing the same pop- ulation history as genetic variation, contribution from methylation sites correlate heavily with pattern of nearby SNP, and would be absorbed into SNP effects in a model that only involve SNPs. This result frees any further tests from considering potential confounding from an epigenetic relatedness matrix. Second, by examining cis associated methylation bins, we found a number of cases where there is no significantly associated SNP in vicinity. This is interesting, since even untyped genetic variants like transposable elements are expected to be linked in some degree to local SNP variants. Therefore, it might be expected that in some of these cases that methylation variation are not controlled by any cis genetic loci. This independence is perhaps an indication of their status as pure epialleles. On other other hand, it is certainly intriguing how such methylation polymorphism came to segregate in a population without any linked SNPs. A possible explanation is the high rate of epi-mutations: it might simply be that it arises independently under many different local genetic haplotypes. Last but certainly not least, the extent of cis regulation by methylation sites appear limited in comparison to SNPs overall. Considering our inability in disen- tangling methylation and SNP effects, this is one area that certainly demands more careful study. Nonetheless, it is perhaps safe to say that epigenomic data would unlikely be the magical missing link in missing heritability of traits in Arabidopsis, and perhaps to plants in general. This is perhaps not surprising, given that the genome should be a much more robust long term storage for adaptive signals. If this view is correct, what we are observing as effects of methylation should come as a combination of two processes: residual signal of a temporary solution that is being taken over by DNA in the long term, and regulatory functions demanded of much more rapidly changing environmental cue than DNA could handle. The first 38 process is of course controlled by selective pressure and mutation rate, while the second also depend on rate and prevalence of such changing external forces. These factors are very difficult to quantitatively gauge at the moment, but the apparent effect size of DNA methylation place a theoretical ceiling on those processes. Here we are ignoring potentially large contributions from trans loci. This reflect our lack of confidence in these associations, and warrant further investigation. In terms of mapping for new regulatory targets, epigenome-wide association mapping certainly presents a large number of new potential loci, many of them uncorrelatedevenwithnearbygeneticsignals. Evenamongperfectlylinkedmethy- lation polymorphisms, it can be expected that some of them to be the causal loci. In practice, we do detect a significant number of ’real’ cis associations, many of them undetectable by SNPs even with many times the sample size. Therefore, epigenomic data does expand our ability to find new links in regulatory network of traits, though again, this gain in power appear limited to a small fraction of traits. Looking to the future, our work presented many exciting future research direc- tions. In terms of utilizing existing data, a detailed analysis of potential splice forms may reveal much about the role that DNA methylation play in them. Many of the methylation-regulated genes also deserve further scrutiny, both in the details of regulation mechanisms such as transcriptional factor bonding or chromatin restructuring, as well as aspects of their biological function which might merit epigenetic regulation. In terms of additional data collection, a simple but certainly costly approach is to simply gather many more samples and thus reduce the impact of noise. This is particularly important for fishing out real large trans signals, which is profoundly difficult among the apparently large number of false associations in our current sample. A more homogeneous sampling of single tissues would also greatly reduce 39 difficulties associated in treating mixtures. More ambitiously, longitudinal data especially of expression levels would be very useful in dissecting the precise mech- anism of their regulation, even allowing us to map regulation of variance or the genetic/epigenetic canalization, which is implausible in the present dataset. Even a deeper bisulfite sequencing would allow much greater ability to examine DNA methylation at the base-pair level without compromising quality. On the theoretical fronts, many results here opens avenues of research into underlying models of adaptation. Indeed, aspects of epigenetic regulation and adaptation are already under intense investigation from theoretical direction, and both examples found in our study as well as the estimate of effect sizes are perhaps of great interest in working out the underlying evolutionary model. An example of research that greatly benefit from high quality genomic data and epigenomic data used would be investigating potential genomic signatures that could come from epigenetic regulation and comparing with loci detected in our work. One of the most exciting potential extension would be a combination of these directions in working out the precise model of epigenetic regulation at the local level. This is actually one promising aspect of having base pair resolution methy- lation in the first place: that we have the capacity to find out exactly how DNA methylation status on single loci combine and translate into biologically meaning- ful signals. Unfortunately, the sheer dimensionality of this problem renders our efforts mostly fruitless at this point. It is however not difficult to imagine that a combination of high quality data, biochemical insights and careful statistics would present us with a likely model. In conclusion, we are but presenting one more piece of puzzle in solving the complex system of trait determination. Our contribution is not only the various patterns that we observed, but also the statistical approaches to problem, and 40 perhaps most importantly a greatly enlarged knowledge border that other pieces of the great puzzle can fit in. 41 Chapter 3 Causal analysis in structured data 3.1 Introduction It has long been the cautionary tale among students of statistics that correla- tion does not equate causality. However, this distinction is not greatly emphasized in linkage studies and more recently most GWAS studies. The reason is simple: in those studies, one of the factors, the DNA sequences of the individuals, are expected to be unaffected by any other factors in the time frame of the studies. Therefore, any correlation can only come from two sources: direct/indirect causa- tion by DNA sequence, or shared common/confounding factor. The latter of which includes the frequent case in which the associated variant is genetically linked to thecausalvariant. Anotherfamousexampleisassociationstudyofchopstickskills: in a casual GWAS, all private SNPs in the Chinese/Japanese population would be associated with the skill, but the reason for association is shared correlation with environment: culture. Correlation with external factors are much less likely in study of model organ- isms in controlled laboratory setting. Nonetheless, the nature of DNA methylation itself means the general causal interpretation in association study with them as either as a ’genotype’ or a ’phenotype’ is misleading, even if studies yields plenti- ful results. For instance, correlation between a DNA methylation polymorphism and a nearby SNP can be simply a result of shared population history, much like linkage disequilibrium between nearby SNPs. On the other hand, the possibility of 42 reverse causation in the case of association with traits is certainly real, and is the target of our analysis here. In fact, there are many evidence that DNA methyla- tion is regulated by expression. In face of these evidence, we must wonder whether DNA methylation matter for traits, even as an intermediate carrier/aggregator of information. Luckily, with the availability of DNA sequence data, we can address this difficult causality question through formal statistical framework based on direct acyclic graphs. Our lack of direct ways of intervention is compensated by strong background knowledge and corresponding assumptions. Similar approaches have already been successfully developed and deployed in linkage studies (Schadt et al., 2005) and more recently in a study of human (Gutierrez-Arcelus et al., 2013). Unlike these situations, we face much more serious confounding in the form of background genetic factors, or ’population structure’ as more commonly known. These factors can often explain 80% of variance in a DMR (Dubin et al.), and also 25% of variance in expression level (companion project, unpublished). Ignor- ing these factors will both create false positives in which we mistook these shared genetic factors as a causal link, and also lower our power to derive the direction of causality in the case no strong cis SNP QTLs are found to serve as instrument variable. Based on previous experience in association mapping, we modeled these effect as a sum of iid additive small effects. 3.2 Population causality model 3.2.1 Qualitative description Our methods try to infer causal relationship between three variables: genetic factors (G), or more precisely DNA sequence, and two traits A and B. In the 43 context of the thesis, A is DNA methylation (M) and B is the trait (Y), in this context mostly referring to expression traits. Among these, it is assumed that genetic factors (G) are not subject to influences from the other factors. This is not true in general due to effects of selection and mutation rates on DNA sequences, but these effects are negligible for data collected in this study which are at most several generations apart. Therefore, we only consider the four possible scenarios below: Model I: (Methylation mediating) In this model, DNA methylation act as the inter- mediate in determining the trait Y. In another word, as long as we know the DNA methylation status of the corresponding loci, we do not need DNA sequence information to predict the trait to the best of our abilities. Model II: (Trait mediating) In this model, DNA methylation is actively and only deter- mined by the trait. Model III: (Independent) In this model, both DNA methylation and trait are controlled by DNA sequence directly. Model IV: (Full/Null) In this model, both DNA methylation and trait are controlled by DNA sequence directly. However, DNA sequence is not sufficient to explain all the association between M and Y. In this case, it is impossible to deter- mine statistically whether it’s M influencing Y, Y influencing M or they are mutually influencing each other. These models are illustrated in Figure 3.1. To determine the best fitting model among these and therefore infer direction of causality between M and Y, we calcu- late the likelihood of each suitably controlled for their respective model complexity. Including genetic background as a random effect 44 Inferring direction of causation between complex traits by selection on mixed effect models Suppose the two traits of interests are measured by A and B, while the genotype is G. We consider the following possible scenario, and compare between them by Bayesian information criteria (BIC). Where the conditional distributions are: (showing only B due to symmetry) Dazhe Meng 1,5 , Manu Dubin 1 , Pei Zhang 1,5 , Bjarni Vilhájlmsson 2 , Oliver Stegle 3 , Richard Clark 4 , Magnus Nordborg 1,5 1 Gregor Mendel Institute, Vienna, Austria; 2 Aarhus University , Aarhus, Denmark; 3 The European Bioinformatics Institute, Cambridge, UK; 4 University of Utah, Salt Lake City , US; 5 University of Southern California, Los Angeles, US We often study traits that are not easily manipulated, preventing - tively independent variables in most designs; Including them in causal graphical models enables us to make various causal inter- pretations. 1 In this study we try to infer direction of causal relationship be- tween two traits. One indicator of trait A causally regulating trait B is conditional independence as illustrated above: B is as- sociated with genotype G, but only through A. 2 this can be problematic when there is: • • Genetic confounding when using structured sample We attempt to address these issues through extension of existing methods as outlined below. 1 Mendelian randomization is sometimes used to denote all such studies 2 Th is strategy (Schadt 2005) has been successfully employed in many studies Th e distinction in our model is the inclusion of polygenic terms that attempt to capture the collective eff ect of numerous loci with undetectable eff ects individually. As an example, consider a hypothetical experiment in which we want to determine causal relationship between height and length of shadow a person cast. Th e genotypes can explain a signifi cant portion of height variation with a sum of random eff ects 1 , but would not provide further information on shadow length given height. Th is help us infer height shadow length. Like in the height example, we can model such polygenic eff ects as a the summation of iid random eff ect of individual SNPs. Al- ternatively, top principle components can be used. In this post- er, only the mixed eff ect model is presented. 1 (Y ang 2010) estimates this to be 45% of variance To investigate performance of our model, we performed a sim- ulation study by simulating pairs of traits using Arabidopsis SNP dataset. Three sets of simulations are run: • Trait B is A plus another Gaussian error. • Model III: Trait A and B are both sum of 0-1 shared large ef- 10000, and a Gaussian error. • Model IV: Similar to III, but one of the traits also contain a linear term of the other. The results are summarized as belows: When the underlying model is I (II) or III, we can deduce the correct model most of the time. However, when the real model is IV , it is very hard to capture. We have devised a method to derive direction of causality in a general population sample. It is worthwhile to note that there are many limitations and areas of improvement: • Sample structural correlation with environmental factors can infl ate polygenic term • Error terms for the traits are likely not normally distributed. Generalization is possible e.g. kernel based error modeling. • Th e requirement for full mediation is stringent, and would fail when there are suffi ciently strong genetic factor(s) controlling each trait separately. • Diffi cult to derive a test of signifi cance Despite these concerns, with careful designs, our method can see applications to study of complex traits, especially in epide- miology and social studies. A B Loci of small eff ects Acknowledgment involved in data collection, preprocessing and dis- cussion. This work is primarily supported by the Center of Excellence in Genomic Sciences Award. Scan for pdf Site has contact for me; or just use: dazhemen@usc.edu Arabidopsis As part of our study of epigenetic contributions to missing heritability , we en- counter the question of whether gene body methylation are regulating gene expression they are associated with. Unlike the better understood transposable element methylation, plant gene body methylation primarily occur away from gene ends and has been shown to only occur on actively transcribed genes. This naturally give rise to the hypothesis that gene body methylation occur due to transcriptional activity . We applied our model selection method on DNA methylation and expression data from a collection of 135 Swedish Arabidopsis thali- ana. Here trait A is methylation or M, whilst trait B is expression or E. Only correlated DNA methyl- ation expression pairs that share genetic factors are considered. Conclusions: • Gene body methylation cannot be adequately explained by expression • Similar to methylation on TEs, gene body methylation can regulate expression • Note that although we express the relationship between SNP and methylation as causal, it can also be due to shared common factor (e.g. linkage due to shared coalescent). Best model shown with color. Most M-E pairs fi t the independent model best, and model I is generally prefered to model II Genotype G (e.g. SNPs) Trait A Trait B Trait A Trait B Random effects Zu Fixed effects Trait A Trait B Random effects Zu Fixed effects Trait A Trait B Random effects Zu Fixed effects Trait A Trait B Random effects Zu Fixed effects I: Full mediation A-B In this model all genetic through A. This corre- spond to A regulating B. II: Full mediation B-A Opposite model of I. This correspond to B causing A. III: Independent In this model genetic - ations in A and B by themselves. This can be too conservative due to IV: Complete model This is the null model: As- sociation between A and B is partially due to genetic factors. Cannot infer direc- tional relationship between A and B. More cases fi t model I than model II, for all types of methylation Figure 3.1: Possible causal structure between genotype and two traits. Our main innovation lies in interpretation and representation of the genetic factor G: whereas in past studies it is often taken as the most significantly asso- ciated SNP, we include an additional term: the combined effect of many small loci represented as random terms in linear mixed model or equivalently additional covariance term that depend on the genetic relatedness matrix K. As a hypothetical example to illustrate the rationale behind this, consider a fictional experiment in which we collect human height and length of their shadow (see Figure 3.2). We want to derive the correct causal relationship that height determine shadow length from data alone with help of genomic data in a situation similar to Mendelian randomization. In this case, the length of shadow is almost completely determined by height, but can also be affected by external factors like time of the day. We know from numerous GWA studies that individual loci contributeverylittletoheritabilityofheight, whilethecombinedeffectsfromsmall loci amount to 40% of heritability (Yang et al., 2010). Thus, although we would be hard pressed to produce a good single loci to use like an instrument variable, the combined polygenic effect can fulfill the role. 45 Inferring direction of causation between complex traits by selection on mixed effect models Suppose the two traits of interests are measured by A and B, while the genotype is G. We consider the following possible scenario, and compare between them by Bayesian information criteria (BIC). Where the conditional distributions are: (showing only B due to symmetry) Dazhe Meng 1,5 , Manu Dubin 1 , Pei Zhang 1,5 , Bjarni Vilhájlmsson 2 , Oliver Stegle 3 , Richard Clark 4 , Magnus Nordborg 1,5 1 Gregor Mendel Institute, Vienna, Austria; 2 Aarhus University , Aarhus, Denmark; 3 The European Bioinformatics Institute, Cambridge, UK; 4 University of Utah, Salt Lake City , US; 5 University of Southern California, Los Angeles, US We often study traits that are not easily manipulated, preventing - tively independent variables in most designs; Including them in causal graphical models enables us to make various causal inter- pretations. 1 In this study we try to infer direction of causal relationship be- tween two traits. One indicator of trait A causally regulating trait B is conditional independence as illustrated above: B is as- sociated with genotype G, but only through A. 2 this can be problematic when there is: • • Genetic confounding when using structured sample We attempt to address these issues through extension of existing methods as outlined below. 1 Mendelian randomization is sometimes used to denote all such studies 2 Th is strategy (Schadt 2005) has been successfully employed in many studies Th e distinction in our model is the inclusion of polygenic terms that attempt to capture the collective eff ect of numerous loci with undetectable eff ects individually. As an example, consider a hypothetical experiment in which we want to determine causal relationship between height and length of shadow a person cast. Th e genotypes can explain a signifi cant portion of height variation with a sum of random eff ects 1 , but would not provide further information on shadow length given height. Th is help us infer height shadow length. Like in the height example, we can model such polygenic eff ects as a the summation of iid random eff ect of individual SNPs. Al- ternatively, top principle components can be used. In this post- er, only the mixed eff ect model is presented. 1 (Y ang 2010) estimates this to be 45% of variance To investigate performance of our model, we performed a sim- ulation study by simulating pairs of traits using Arabidopsis SNP dataset. Three sets of simulations are run: • Trait B is A plus another Gaussian error. • Model III: Trait A and B are both sum of 0-1 shared large ef- 10000, and a Gaussian error. • Model IV: Similar to III, but one of the traits also contain a linear term of the other. The results are summarized as belows: When the underlying model is I (II) or III, we can deduce the correct model most of the time. However, when the real model is IV , it is very hard to capture. We have devised a method to derive direction of causality in a general population sample. It is worthwhile to note that there are many limitations and areas of improvement: • Sample structural correlation with environmental factors can infl ate polygenic term • Error terms for the traits are likely not normally distributed. Generalization is possible e.g. kernel based error modeling. • Th e requirement for full mediation is stringent, and would fail when there are suffi ciently strong genetic factor(s) controlling each trait separately. • Diffi cult to derive a test of signifi cance Despite these concerns, with careful designs, our method can see applications to study of complex traits, especially in epide- miology and social studies. A B Loci of small eff ects Acknowledgment involved in data collection, preprocessing and dis- cussion. This work is primarily supported by the Center of Excellence in Genomic Sciences Award. Scan for pdf Site has contact for me; or just use: dazhemen@usc.edu Arabidopsis As part of our study of epigenetic contributions to missing heritability , we en- counter the question of whether gene body methylation are regulating gene expression they are associated with. Unlike the better understood transposable element methylation, plant gene body methylation primarily occur away from gene ends and has been shown to only occur on actively transcribed genes. This naturally give rise to the hypothesis that gene body methylation occur due to transcriptional activity . We applied our model selection method on DNA methylation and expression data from a collection of 135 Swedish Arabidopsis thali- ana. Here trait A is methylation or M, whilst trait B is expression or E. Only correlated DNA methyl- ation expression pairs that share genetic factors are considered. Conclusions: • Gene body methylation cannot be adequately explained by expression • Similar to methylation on TEs, gene body methylation can regulate expression • Note that although we express the relationship between SNP and methylation as causal, it can also be due to shared common factor (e.g. linkage due to shared coalescent). Best model shown with color. Most M-E pairs fi t the independent model best, and model I is generally prefered to model II Genotype G (e.g. SNPs) Trait A Trait B Trait A Trait B Random effects Zu Fixed effects Trait A Trait B Random effects Zu Fixed effects Trait A Trait B Random effects Zu Fixed effects Trait A Trait B Random effects Zu Fixed effects I: Full mediation A-B In this model all genetic through A. This corre- spond to A regulating B. II: Full mediation B-A Opposite model of I. This correspond to B causing A. III: Independent In this model genetic - ations in A and B by themselves. This can be too conservative due to IV: Complete model This is the null model: As- sociation between A and B is partially due to genetic factors. Cannot infer direc- tional relationship between A and B. More cases fi t model I than model II, for all types of methylation Figure 3.2: Illustrative example of causal mediating testing of height vs shadow There is many ways to model the combined polygenic effect, including a fixed ’membership vector’ after dividing samples into distinct groups, or using the first few principle component of the genetic relatedness matrix. We choose here to implement it as a summation of random effects, similar to what’s commonly used in mixed model association mapping. The detailed mathematical formula is given in the next subsection. 3.2.2 Mathematical formulation Here our goal is to distinguish between the four potential models considered. We base our selection on well established information criteria that are themselves based on maximum likelihood of the respective models. Suppose the genotype and 46 DNA methylation data areG andM respectively, and we observed for a particular gene its expression levels E. The likelihood for the 4 models considered are: Model I:L(M 1 |g,m,e) =p(e|m)p(m|g)p(g) Model II:L(M 2 |g,m,e) =p(m|e)p(e|g)p(g) Model III:L(M 3 |g,m,e) =p(e|g)p(m|g)p(g) Model IV:L(M 4 |g,m,e) =p(e|m,g)p(m|g)p(g) =p(m|e,g)p(e|g)p(g) (3.1) IncaseswhereM isconfinedtooneobservationperindividuallikeexpressionlevels, the relationship between E and M are considered linear with Gaussian noise: p(e|m)∼N (μ E +mβ M , E )| m p(m|e)∼N (μ M +eβ E , M )| e (3.2) Whereas the distribution involving genotype would contain both fixed terms for large effects as well as random terms for genetic background: p(e|g)∼N (μ E +Xβ X,M ,K S σ 2 S,E + E )| g p(m|g)∼N (μ M +Xβ X,E ,K S σ 2 S,M + M )| g p(e|m,g)∼N (μ E +Xβ X,M +mβ M ,K S σ 2 S,E + E )| m,g p(m|e,g)∼N (μ M +Xβ X,E +eβ E ,K S σ 2 S,M + M )| e,g (3.3) The highest likelihood/loglikelihood estimate of Equation 3.2 can be found eas- ily by least square estimate of βs, while those of Equation 3.3 can be found by numerical method implemented in mixmogam or LIMIX. Since we are interested in the likelihood itself, the ’ML’ criteria is chosen as the optimization criteria for 47 the latter. After we obtain the maximum log likelihood for each component, we simply add them up to obtain the overall log likelihood of the models minus a constant. BIC is chosen as our selection criteria, corresponding to the fact that we already have all potential models chosen. 3.2.3 Tests of significance One interesting corollary from assuming Gaussian distribution for the variables came from the fact that in these situations, conditional independence reduce to conditional correlation of 0. Therefore, we can obtain a very rough conditional independencetestbycomparingtheincompletemodels(ModelI,IIandIII)against Model IV. In practice, this allow us to obtain an estimate on significance of the causal relationship from Model I and II. Here we used a difference in log likelihood of 3 as a threshold for significance. 3.3 Simulation studies In order to demonstrate the ability of our method to distinguish the different cases of causality, we simulated data based on the 4 different models and assessed the performance to deduce the correct models. For genotype data G, we used the same SNP dataset as we used in the analysis. These data are constructed sequentially by calling on some combination of the following functions. 1. G→ M;G→ E: methylation and expression data is generated as a sum- mation of additive effects from two types of SNPs. Firstly, a large number of loci with i.i.d. Gaussian distributed effects to model genetic background. Secondly, a single loci with Gaussian/exponential distributed effect for the 48 large cis effect. A random Gaussian error is then added to the phenotype in order to fix the trait heritability to certain levels, in our case 0.5. 2. M→E;E→M: In the case that methylation and expression data directly influence the other in models 1 and 2, they are linked by a linear relationship with Gaussian noise. 3. (G,M)→ E; (G,E)→ M: Similar as G→ M,G→ E, but with an addi- tional fixed term for M/E Since models I and II are symmetric, we simulated data according to models I, III and IV only. The resulting datasets are then fed into the models, and the best fit is recorded. Inferring direction of causation between complex traits by selection on mixed effect models Suppose the two traits of interests are measured by A and B, while the genotype is G. We consider the following possible scenario, and compare between them by Bayesian information criteria (BIC). Where the conditional distributions are: (showing only B due to symmetry) Dazhe Meng 1,5 , Manu Dubin 1 , Pei Zhang 1,5 , Bjarni Vilhájlmsson 2 , Oliver Stegle 3 , Richard Clark 4 , Magnus Nordborg 1,5 1 Gregor Mendel Institute, Vienna, Austria; 2 Aarhus University , Aarhus, Denmark; 3 The European Bioinformatics Institute, Cambridge, UK; 4 University of Utah, Salt Lake City , US; 5 University of Southern California, Los Angeles, US We often study traits that are not easily manipulated, preventing - tively independent variables in most designs; Including them in causal graphical models enables us to make various causal inter- pretations. 1 In this study we try to infer direction of causal relationship be- tween two traits. One indicator of trait A causally regulating trait B is conditional independence as illustrated above: B is as- sociated with genotype G, but only through A. 2 this can be problematic when there is: • • Genetic confounding when using structured sample We attempt to address these issues through extension of existing methods as outlined below. 1 Mendelian randomization is sometimes used to denote all such studies 2 Th is strategy (Schadt 2005) has been successfully employed in many studies Th e distinction in our model is the inclusion of polygenic terms that attempt to capture the collective eff ect of numerous loci with undetectable eff ects individually. As an example, consider a hypothetical experiment in which we want to determine causal relationship between height and length of shadow a person cast. Th e genotypes can explain a signifi cant portion of height variation with a sum of random eff ects 1 , but would not provide further information on shadow length given height. Th is help us infer height shadow length. Like in the height example, we can model such polygenic eff ects as a the summation of iid random eff ect of individual SNPs. Al- ternatively, top principle components can be used. In this post- er, only the mixed eff ect model is presented. 1 (Y ang 2010) estimates this to be 45% of variance To investigate performance of our model, we performed a sim- ulation study by simulating pairs of traits using Arabidopsis SNP dataset. Three sets of simulations are run: • Trait B is A plus another Gaussian error. • Model III: Trait A and B are both sum of 0-1 shared large ef- 10000, and a Gaussian error. • Model IV: Similar to III, but one of the traits also contain a linear term of the other. The results are summarized as belows: When the underlying model is I (II) or III, we can deduce the correct model most of the time. However, when the real model is IV , it is very hard to capture. We have devised a method to derive direction of causality in a general population sample. It is worthwhile to note that there are many limitations and areas of improvement: • Sample structural correlation with environmental factors can infl ate polygenic term • Error terms for the traits are likely not normally distributed. Generalization is possible e.g. kernel based error modeling. • Th e requirement for full mediation is stringent, and would fail when there are suffi ciently strong genetic factor(s) controlling each trait separately. • Diffi cult to derive a test of signifi cance Despite these concerns, with careful designs, our method can see applications to study of complex traits, especially in epide- miology and social studies. A B Loci of small eff ects Acknowledgment involved in data collection, preprocessing and dis- cussion. This work is primarily supported by the Center of Excellence in Genomic Sciences Award. Scan for pdf Site has contact for me; or just use: dazhemen@usc.edu Arabidopsis As part of our study of epigenetic contributions to missing heritability , we en- counter the question of whether gene body methylation are regulating gene expression they are associated with. Unlike the better understood transposable element methylation, plant gene body methylation primarily occur away from gene ends and has been shown to only occur on actively transcribed genes. This naturally give rise to the hypothesis that gene body methylation occur due to transcriptional activity . We applied our model selection method on DNA methylation and expression data from a collection of 135 Swedish Arabidopsis thali- ana. Here trait A is methylation or M, whilst trait B is expression or E. Only correlated DNA methyl- ation expression pairs that share genetic factors are considered. Conclusions: • Gene body methylation cannot be adequately explained by expression • Similar to methylation on TEs, gene body methylation can regulate expression • Note that although we express the relationship between SNP and methylation as causal, it can also be due to shared common factor (e.g. linkage due to shared coalescent). Best model shown with color. Most M-E pairs fi t the independent model best, and model I is generally prefered to model II Genotype G (e.g. SNPs) Trait A Trait B Trait A Trait B Random effects Zu Fixed effects Trait A Trait B Random effects Zu Fixed effects Trait A Trait B Random effects Zu Fixed effects Trait A Trait B Random effects Zu Fixed effects I: Full mediation A-B In this model all genetic through A. This corre- spond to A regulating B. II: Full mediation B-A Opposite model of I. This correspond to B causing A. III: Independent In this model genetic - ations in A and B by themselves. This can be too conservative due to IV: Complete model This is the null model: As- sociation between A and B is partially due to genetic factors. Cannot infer direc- tional relationship between A and B. More cases fi t model I than model II, for all types of methylation Figure 3.3: Testing causality model with simulation. Model I and II are symmetric and therefore only Model I is simulated. We simulated 5000 pairs of methylation and expression data for each of the three schema. For models 1 (2), the more crucial misclassification rate into the reverse model (for 1 and 2) or orthogonal models (misclassifying as model 3 for 1 and 2, misclassifying as model 1 or 2 for model 3) is low. Unfortunately, simulated data under model 4 shows a high misclassification rate, probably showing difficulty 49 in deriving this complicated situations at our sample size. Yet, there is no bias against misclassification into reverse model (i.e. model 1 or 2 to each other) in all simulations. This means that no matter the underlying model being more SNP involved (models 3 and 4), we should not expect to see more of model 1 or 2 unless it is indeed the case that methylation or expression is regulating one another. 3.4 Application to real data 3.4.1 Data preparation For checking such causal mediation models, we need DNA methyla- tion/expression pairs only correlated to each other, but also show evidence of being regulated by genetic factors. We thus prepared the following set of data for use in causal analysis. First we correlated expression level with all cis methylation bins that is within the gene or within 2000bp of the transcription start site. If any bin is correlated with a r 2 greater than 0.2, the pair of expression and the highest correlated bin is added to a testing pool. From this pool, mixed model GWAS is perform on each pair of expression/methylation, and any pair that does not 1. share associated SNP at the Bonferroni threshold for trans SNPs (defined as >50kb away) or at 10 −5 for cis SNPs, or 2. have the genetic kinship component explain at least 5% of variance, are filtered out. This result in a final set of 297 pairs of data. 3.4.2 DNA methylation more often appear causal We calculated the likelihood for each of the 4 models on our dataset and com- pared between them based onBIC. In most ofthe cases model 3is favored, showing independent regulation of methylation and expression by SNPs. However, there is 50 evidently a stronger fit in most cases for model 1 than 2, showing the likely infer- ence that there is more cases that DNA methylation is regulating expression level. This trend is present for both GBM and TEM, indicating that at least for some genes the gene body methylation is not the passive product of their expression. Perhaps more surprisingly, there is some evidence that it is expression regulating TEM in some pairs. This probably due to difficulty in distinguishing models where the true underlying model is the full model (model 4), or perhaps this hints at less- known biological mechanisms for TEM regulation. Results mostly robust against distribution assumption Figure 3.4: Comparing Number of cases where either DNA methylation or expres- sion is fully mediating the other One assumption that will greatly affect validity of the models is the presence of outliers in the variables involved. To address this issues, we performed the same model selection after transforming both methylation and expression data with a 51 800 700 600 500 400 300 200 100 0 G>M>E 800 700 600 500 400 300 200 100 0 G>E>M BIC comparison between models for CG_only methylation (a) CG gene body methylation 800 700 600 500 400 300 200 100 0 G>M>E 800 700 600 500 400 300 200 100 0 G>E>M BIC comparison between models for CG_C methylation (b) CG in TE-like methylation 800 700 600 500 400 300 200 100 0 G>M>E 800 700 600 500 400 300 200 100 0 G>E>M BIC comparison between models for CHG_C methylation (c) CHG methylation 800 700 600 500 400 300 200 100 0 G>M>E 800 700 600 500 400 300 200 100 0 G>E>M BIC comparison between models for CHH_C methylation (d) CHH methylation Figure 3.5: -BIC comparison for model I and II. The colors indicate the best model by BIC: blue: model 1; green: model 2; red: model 3; black: model 4; white: did not converge. Negative BIC is plotted to aid visual identification of the better model, so that for any gene plotted the closer axis correspond to the better fit model rank-inverse Gaussian transformation. By this we mean that we sort the respective data in order of magnitude, then replacing each value by one drawn from Gaussian distribution based on its rank. This essentially render the test largely distribution- free, but of course would reduce the magnitude of any real relationship. 52 800 700 600 500 400 300 200 100 0 G>M>E 800 700 600 500 400 300 200 100 0 G>E>M BIC comparison between models for CG_only methylation (a) CG gene body methylation 800 700 600 500 400 300 200 100 0 G>M>E 800 700 600 500 400 300 200 100 0 G>E>M BIC comparison between models for CG_C methylation (b) CG in TE-like methylation 800 700 600 500 400 300 200 100 0 G>M>E 800 700 600 500 400 300 200 100 0 G>E>M BIC comparison between models for CHG_C methylation (c) CHG methylation 800 700 600 500 400 300 200 100 0 G>M>E 800 700 600 500 400 300 200 100 0 G>E>M BIC comparison between models for CHH_C methylation (d) CHH methylation Figure 3.6: -BIC comparison for model I and II after inverse normal transform. The colors indicate the best model by BIC: blue: model 1; green: model 2; red: model 3; black: model 4; white: did not converge. Negative BIC is plotted to aid visual identification of the better model, so that for any gene plotted the closer axis correspond to the better fit model As expected, the likelihood of all four models are reduced after transforma- tion. Notably the number of pairs falling under model 3 is drastically reduced. This probably reflect the fact that the coefficient of relationship between SNP and expression/methylation is dramatically reduced by the transformation, and in terms of likelihood cannot overcome the additional penalty imposed by BIC. Still, 53 the trend stand that model 1 is favored over model 2, showing that our inference is not affected by the normality assumption. 3.5 Discussion We have demonstrated with statistical evidence that at least some DNA methy- lation do not simply act as passive receiver of information and can have an active influence on expression of a nearby gene. This regulatory function does not only occur to the well known mechanism of gene silencing by methylation of cytosines in all context at the transcriptional start sites, but seem to extend to methylation of CpG sites well within gene bodies. This is a crucial result, since it was previously often considered that gene body CpG methylation might be simply a by-product of expression, or probably serve a function in preventing aberrant transcription or regulating splice forms. This conclusion not only justify association studies on DNA methylation data of all context in searching for causal loci, but also hints at thus far unclear mechanisms on how these epigenetic signals regulates. There are several concerns about our method that need more careful considera- tion. One major problem is the lack of an accurate significance test. As seen in the results section, in many of the cases the distinction between the models are small, and it can be argued that these would simply be due to uncertainty in estimation of respective likelihood. Ideally one should obtain an empirical distribution of like- lihood and base a significance test on this, however efforts in this direction was not fruitful and we instead present only an approximate test based on a simple log likelihood difference threshold of 3. Fortunately, even with very strong noise, we would not expect there to be a large bias in the direction of causality thus inferred between methylation and expression. With this argument, we should not expect 54 by chance to observe the significantly higher number of cases where methylation seem to be causal to expression. Furthermore, the likelihood difference we obtain from some pairs of data is extremely large, indicating a very small chance for the opposite relationship to be true. Another problem is the assumption that noise in both M and Y follow Gaus- sian distribution. This is mostly a matter of convenience, especially for easily interpretable relationship with conditional independence. We tried to take this concern into consideration by applying inverse Gaussian transformation on the data before testing the causal models and the results are encouragingly consistent with the direction of causality. Nonetheless, this approach is not fail-safe and tend to remove too much real signal. Other than the direction of causality between methylation and expressions, one must be cautious to interpret other parts of the causal model. Despite a direct link between G and M in three of the models being compared, it must be emphasized again that this relationship need not necessarily be causal and can represent a shared common cause. Biologically, though it is reasonable to assume no effect on DNA from DNA methylation in a short time frame, it is still possible that both DNA and much of DNA methylation variation are simply inherited, and therefore shaped by population history of the samples. To visualize this, consider two linked SNPs A and B. Although A cannot directly determine the nucleotide at B, they are nonetheless associated and their association would be indistinguishable from causal in our models. It is also a very fine line between the independent model (model 3) and the models which suggest a direct relationship between methylation and expression (model 1, 2 and 4). As evident from applying the same model on transformed data, thedifferenceinlikelihooddependontheestimatedmagnitudeofeffectofthe 55 genetic factor. In fact, since we choose the large effect SNP and also fit the genetic background effect after we choose the pairs of methylation and expression data, we are biasing ourselves towards overfitting the genetic effect. Taking a step back, we have known from simpler correlation analysis in previous chapter that TEM is negatively correlated with expression, whereas GBM is positively correlated. In either case, if expression or methylation are simply independently regulated, it is implausible that the direction of regulation would mostly be opposite (for TEM) or the same (for GBM). Of course, this argument only holds if there is two common mechanisms behind relationship between DNA methylation and expression, one for each of GBM and TEM. So is there? Unfortunately our data and method cannot examine this at the moment. Nonetheless, it might be the simplest explanation underlying most of the relationships we detected. Although it is not difficult to imagine completely separate mechanisms managing expression and methylation in different genes, the question is evolutionary advantages. For instance, it is much simpler to have a single primary RNA polymerase and regulate its behavior by ’add-ons’ rather than having many different polymerases. In the case of DNA methylation, it would certainly be easier to imagine a common regulatory mechanism which is in turn fine tuned. Such a system will both be easier to arise from evolution, and also consistent with other well known biological processes. Morepractically,therearenumerousavenuesinwhichourunderstandingcanbe takenforward. Intermsofmethodology, weshouldexploreotherwaystomodelthe effects, forinstanceusingPCAtomodelpolygeniceffect. Itisalsorathersimplistic to use only a single methylation bin as a representative of overall methylation pattern. Even a rough sensitivity analysis would also help us understanding the robustness of our conclusions. All of the approaches would benefit enormously 56 from a greatly expanded sample size, longitudinal data (especially data from early embryogenesis) and data under strong environmental perturbations. More importantly, further experiments should be devised to check the causal relationships inferred by more direction intervention. Although it is still difficult to manipulate methylation status without upsetting the entire system, it is much easier to control expression levels through interference/silencing/promoter change, and we should be able to check reverse causation for many loci found (model 2) with carefully designed setups. On the other hand, we are indeed beginning to find milder trans regulators of methylation levels (Dubin et al.), and it would not be unreasonable to believe that we can obtain methylation changes without knocking out crucial pathways. In summary, we attempted a very challenging statistical problem and learned rough but significant result with the help of previous understanding and careful modelling. Our result shed light on a controversial topic, and would help guide further research in similar directions. Remarkably, everything we learnt here came from a natural population, hinting at the importance of processes involved in adaptation at large. 57 Chapter 4 Mapping epistasis 4.1 Introduction The biological background to the study of epistasis is discussed in chapter 1, where I discussed the outlook of detecting epistasis in a natural population. In this introduction I focus on the general statistical methods for finding epistasis in a haploid or highly inbred population sample, where dominance effects can be ignored. We begin with Fisher’s definition of statistical epistasis as any deviation from the additive model; In a haploid organism or mostly homozygous organism like Arabidopsis thaliana, the additive model means Y =α i +α j + (4.1) Where α i and α j are allelic effects of loci i and j respectively, while is again the random effect due to environment. This definition form a basis for statistical tests for epistasis. For instance, in a GWAS setting, we have the genotype information at n loci X 1 ..X n . The additive model is thus: Y =μ + X i X i β i + (4.2) 58 Whereβ i denote effect of loci i. With epistasis, the model can simply be expanded the more terms: Y =μ + X i X i β i + X i<j X i X j β ij + X i<j<k X i X j X k β ijk +... + (4.3) Whereβ ij denote the two-way epistatic effect between loci i and j, β ijk denote the the three-way epistatic effect between loci i, j and k and so on. On the other hand, the product termsX i X j ,X i X j X k and so on represent the ’interaction’ categories, and for biallelic SNPs these are simply the product of corresponding SNP vectors. Statistical test of epistasis thus involves comparing the epistatic model 4.3 with the null model 4.2. Most studies focus on detection of SNP by SNP epistatic effect; much alike marginal GWAS, this involves a test for each combination of loci, and the number of tests quickly grew to computationally intractable numbers, not to mention extreme sample size required to reach statistical significance with multiple test correction. Recent efforts in method development thus mainly focus on ways to address the computational burden, for example by only testing pairs among SNPs with a minimum level of marginal effect. Computational implementations capitalizing on massive parallel power of GPU computing also increased the . We show our own effort to detect such SNP by SNP epistasis in Section 4.3. Nonetheless, it is not clear whether these marginal epistatic effect are the only interesting effects we can find, given that combinations of effects can be easier to detect in either effect size or multiple testing correction. In the subsequent sections, I discuss detection of biological relevant combinations of some of the marginal effects, from groups of SNPs in genes, effect of one SNP in different genetic background, to a model that attempt to capture summation of all epistatic effect across the genome. I will discuss situations under which these models will 59 be interesting, some mathematical formulation I attempted as well as potential caveats and problems. 4.2 Brief description of dataset used Throughout this chapter the various methods are applied to the following datasets: 250K SNP data: SNP data called from Affymetric SNP array on world- wide accessions of Arabidopsis thaliana. After rigorous filtering, 1307 accessions with 214k SNPs remain in the dataset, and any missing data are imputed with NPUTE(Roberts et al., 2007). This dataset is fully described in (Horton et al., 2012). 107 phenotype data: This is the 107 phenotypes published in (Atwell et al., 2010) and covers a wide range of physiological traits. These number of samples measure for each range from below 100 to 196. Flowering time at 10 ◦ C: Or ’FT10’ data, a greatly expanded dataset of sim- ilar named trait measured as in the 107 phenotype data, containing XX accession. This data is collected by Susanne Atwell and as of now unpublished. 4.3 SNP by SNP epistasis Thedetectionofmarginalepistaticeffectfollowsfromthemajorityofliterature, in that a model consisting of both additive and interactive effect from a pair of SNPs are compared with the null in which only additive effects are included. Of course, this epistatic GWAS suffer from the same problem of population structure confounding as marginal GWAS. Therefore we simply extended existing mixed model to correct for these structural confounding, in a similar way as a recent 60 exhaustive epistatic search on human traits (Lippert et al., 2013). Specifically, for testing a pair of SNPs i and j, we compared the following models: H add :Y∼N (μ +X i β i +X j β j ,K S σ 2 S +σ 2 ) H epi :Y∼N (μ +X i β i +X j β j +X i X j β ij ,K S σ 2 S +σ 2 ) (4.4) In order to speed up this test, we deployed the same heuristic used in EMMAX and P3D, and fixed the ratio between σ 2 S and σ 2 to the level ˆ δ 0 in the null model without any additive effect: H null :Y∼N (μ,K S σ 2 S +σ 2 ) (4.5) ˆ δ 0 = ˆ σ 2 ˆ σ 2 S H null (4.6) This is followed by spectral decomposition of the kinship matrix K S : K S =QΛQ T (4.7) Where Λ is the diagonal matrix of the eigenvalues of K. The phenotype, SNPs and their epistatic vector are then transformed: ˜ X = (Λ +δI) − 1 2 Q T X (4.8) so that any association can be tested with simple f-test on the transformed vectors. We implemented this based on REML estimate of variance components in mix- mogam (package by Bjarni) and applied it on the 107 phenotype data. To reduce multiple testing, only SNPs that has a marginal association p-value in the top 5% 61 tail are included. Despite this effort, none of any pair of SNPs are found to be significant at the Bonferroni corrected α of 0.05. (a) QQ plot in normal scale (b) QQ plot in log scale Figure 4.1: QQ plot for SNP by SNP epistatic effect tests. Three sample pheno- typesareshown: floweringtimeat10 ◦ C(FT10), Probabilityofsurvival(PropSurv) and expression level of gene FRI (FRI expr) We turn to examine the distribution of p-value for epistatic effects. Interest- ingly, there seem to be a slight inflation of p-value for many of the tested pheno- types, especiallyattailend(Forexample, floweringtimeat10 ◦ CandProbabilityof survival in Figure 4.1). This seem to indicate firstly that there are many numerous small epistatic effects that cannot be explained by the global background addi- tive kinship. At the tail end, there appear to be some larger epistatic effects that simply require larger sample size to reach significance. 4.4 Gene by gene epistasis Although it is ideal to detect causal pairs of variants interacting with each other, in some applications we are primarily interested in the identities of genes interacting with each other. For instance, when transgenic resource are available, 62 we might only need to isolate a groups genes so that they can be studied in alter- native designs utilizing non-natural mutations. One innate advantage to test only pairs of genes is the great reduction in number of test and multiple testing burden. However, these models can also be more powerful in a variety of biological settings. Notably, it is very likely that interaction between a pair of genes is not limited to one single pair of SNPs. Analysis of marginal GWAS studies indicate many peak SNPs can be due to synthetic association(Zhang et al., 2014; Platt et al., 2010), and there are evidences of local allelic heterogeneity. This possibility have spawned a variety of methods to address it, such as burden test(Ionita-Laza et al., 2013) and local variance component test(Listgarten et al., 2013; Ionita-Laza et al., 2013). With similar logic, we can expect epistasis between genes to be often the result of a combination of effects between many pairs of SNPs, each small in size and difficult to detect individually. By methods that attempt to add up effect from individual pairs and detecting the total, we can obtain greater power to find such epistatic genes. On the other hand, we might be missing important polymorphism in dataset we study. Whereas a correlated tag SNPs might be detectable for marginal additive GWAS, epistasis is much more demanding to find. This problem is considerably alleviated with cheaper sequencing, but still persist for harder to find polymor- phisms such as structural variants. In situations like this, a gene based search can also be more powerful. There are some previous attempts to detect epistasis at gene by gene level, for instance by functional model or more specifically PCA (Zhang et al., 2014). In this section, we instead devised and applied two approaches, the first based on haplotypes and the second an extension of the ’local kinship’ method for GWAS. 63 4.4.1 Haplotype based model ThehaplotypebasedgenebygeneinteractiontestisanextensionoftheSNPby SNP test. In marginal additive tests, single marker tests are usually favored over haplotype based method, since it is usually expected that some SNPs coincide with haplotype classification perfectly. However, in an epistatic setting, using haplotypes as a single factor instead of multiple SNPs for each gene greatly reduce multiple testing burden. Suppose there are t i haplotypes for gene i. Instead of a single vector, the design matrix X i instead consists of t i − 1 column vectors, each corresponding to a incidence vector of a particular haplotype. The interaction design matrix X i X j thus consist of at most (t i − 1)(t j − 1) columns of incidence vectors of haplotype 64 combinations. As an example, consider a gene with 4 haplotypes against another with 2 haplotypes: X i = 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 ,X j = 1 0 1 1 0 0 1 0 0 0 ,)X i X j = 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 (4.9) These design matrices are transformed in the same manner as in Equation 4.8, and are also tested with ANOVA f-test exactly like snp by snp interaction. The challenge with this approach lies in the specification of the haplotypes. Here a variety of approaches can be used, from simple hierarchical clustering, models involving recombination breakpoints like fastPHASE (Scheet & Stephens, 2006) or those derived from probably ancestral recombination graphs (ARGs) at each gene(Rasmussen et al., 2014). Among these, the ARG based method make the most sense biologically, as functionally similar haplotypes tend to arise from the same phylogenetic branch. On the other hand, ARG itself is very difficult to 65 determine(Rasmussenetal.,2014). Itisalsoadifficultchoicetochoosethenumber of haplotypes. Whereas a stringent number will merge biologically interesting hap- lotypes with others and diminish potential signals, a relaxed criteria will produce too many factors to test. Here objective criteria such as minimum substitutions between haplotypes or percentage of genetic variation captured by haplotypes can be used to prevent subjective or arbitrary choice. 4.4.2 Variance component model Another completely different approach is based on constructing an epistatic variance component and compare variance component model with and without the epistatic component. In a way, this can be thought of as an extension of the ’local-global kinship’ model, variants of which are applied and still studied exten- sively [cite Quan, unpublished]. Aside from the obvious benefit of multiple testing reduction, there are also evidence that with a combined estimate of effect, we alle- viate the problem of allelic heterogeneity and can obtain higher power to detect additive allelic effect [cite Quan, unpublished]. This increase in power probably applies to epistatic interactions as well. Indeed, if epistatic effects are small and numerous as expected from theory, we would expect higher power by combining these effects. Westartwiththeconstructionoftheepistaticvariancecomponentbetweentwo genes A and B, with n A and n B SNPs respectively. Similar to the construction of the genetic kinship in linear mixed models, the effect size of each SNP as well as interaction between pairs of SNPs are considered to be random: Y =μ + n A X i X Ai u i + n B X j X Bj v j | {z } local gene effects + n A X i n B X j X Ai X Bj w ij | {z } epistatic effects + n X k X k z i | {z } background + (4.10) 66 where u i ∼N (0,λ Ai σ 2 A ),v j ∼N (0,λ Bi σ 2 B ),w ij ∼N (0,λ ij σ 2 AB ) (4.11) are independently Gaussian distributed. A scaling factorλ is included to represent different assumptions about the effect of minor allele frequency: in some studies it is assumed that all SNPs irrespective of allele frequency contribute the same level of heritability (λ Ai = 1/(p Ai (1−p Ai )) where p Ai is the allele frequency of SNP i at gene A) whereas in others there is no scaling (λ≡ 1). Here we proceed the latter case. Similar to the the procedure in section 1.4.1, the variance of Y can be decomposed into: Var(Y ) =K A σ 2 A +K B σ 2 B +K AB σ 2 AB +K S σ 2 S +Iσ 2 (4.12) To derive the variance of Y, we can again follow the same steps as in sec- tion 1.4.1, and compute the complete n A n B by m matrix of interaction terms X AB = (x Ai,k x Bj,k ) and then calculate K AB asX 0 AB X AB . However, a much sim- pler alternative is available when we carefully consider the terms and the exact form of matrix to compute. Now, we carefully examine the interaction term in Equation 4.10. For individual k, the interaction term is: y k =... + n A X i n B X j x Ai,k x Bj,k w ij +... (4.13) 67 Thus, since w ij are independent of each other, the covariance between individuals k and l due to interaction is: Cov(y k ,y l ) =... + n A X i n B X j Cov(x Ai,k x Bj,k w ij ,x Ai,l x Bj,l w ij ) +... (4.14) =... +σ 2 AB n A X i n B X j x Ai,k x Bj,k x Ai,l x Bj,l +... (4.15) =... +σ 2 AB n A X i x Ai,k x Ai,l n B X j x Bj,k x Bj,l +... (4.16) Note that n A X i x Ai,k x Ai,l is exactly the (k,l) entry of the kinship matrixK A , (K A ) k,l ; similarly for B In another word, the interaction kinship matrixK AB can be easily calculated as the element wise multiplication ofK A andK B : K AB =K A K B (4.17) Untilnow, wehavenotdiscussedtheexactencodingofSNPentriesX andwhat exactly the resulting kinship matrices K A , K B and K AB mean. As mentioned earlier, different scaling of the effect sizes with λ would produce different form of K and thus variance estimates, although the difference would not be great. We would also get difference result if we assume that the two alleles for each SNP produceexactlyoppositeeffect, orthattheyhaveindependentrandomeffects. The combination of no scaling and independent effect would result in what is called the identity by state (IBS) kinship matrix, which has been shown to perform slightly better in earlier GWAS. 68 Thus, we proceed to derive a IBS form of the local and epistatic kinship. Here, we assume that we are working with a haploid/homozygous sample and all poly- morphic sites are biallelic. The IBS kinship matrices for global background effects as well as for genes are similar to those shown in Section. Briefly, we encode the two alleles of each SNP as 1 and -1, and: K IBS A = X 0 A X A 2n A + 1 2 (4.18) For the epistatic effect, ideally, we would assign 4 difference random variables forthe4differentallelecombinations(1,1), (1,-1), (-1,1)and(-1,-1). Forsimplicity, weassumethatthecombinations(1,1)and(-1,-1)producethesameepistaticeffect, similarly for (-1,1) and (1,-1). This is so that only the product matters, and we can use the result in Equation 4.17 to ease computation: K IBS AB = X 0 A X A 2n A X 0 B X B 2n B + 1 2 (4.19) Thus, to test whether there are interaction between two gene, we compare the following two models: H add :Y∼N (μ,K A σ 2 A +K B σ 2 B +K S σ 2 S +σ 2 ) (4.20) H epi :Y∼N (μ,K A σ 2 A +K B σ 2 B +K AB σ 2 AB +K S σ 2 S +σ 2 ) (4.21) A likelihood ratio tests are used to obtain a p-value for significance. Alternatively, the relative contribution for ML effect of interaction, ˆ σ 2 AB /( ˆ σ 2 A + ˆ σ 2 B + ˆ σ 2 AB + ˆ σ 2 S + ˆ σ 2 ) is also potentially an interesting statistic to derive a test from. These tests are implemented in python with LIMIX as the backend for estima- tion of variance component parameters. As mentioned, the raw kinship matrix for 69 each gene is precomputed, and the epistatic kinship matrix is computed at run- time by element wise multiplication. Unfortunately, even with this speed up, a major constraint is the speed at which the variance components can be estimated. For a test involving 1000 individuals, a single pair of genes took on average 28s to run. This creates a computational problem for exhaustive search over entire genomes. Additional speed up, such as using only a subset of SNPs for kinship matrix as in FaSTLMM(Listgarten et al., 2012) have the potential to further speed up the test and bring it to current GWAS standard. 4.4.3 Simulation of variance component model We simulated phenotype data based on the 250K SNP data in order to assess the validity of the variance component models and performance compared with SNP by SNP tests. Many of the upper limit of parameters in this simulation are arbitrary and intended to cover a huge range of biologically feasible genetic architecture. We first randomly choose two annotated genes in TAIR10, and isolate the SNPs within 5kb of ends of each gene. If there are too few SNPs (<5) located near either gene, we start over. A random fraction (∼U(0, 1)) of SNPs near each gene are then assigned a phenotypic effect drawn fromN (0, 1), and the summed effect are scaled to a target gene heritability randomly ranging from 0 to 50% variance explained. From the rest of the genome, we randomly choose another 10000 SNPs, and again assign each a phenotypic effect (∼N (0, 1)). The summed effect is again scaled to a target global effect heritability between 0 to 50%. These 3 target heritability’s (1 for each gene, and 1 global) are chosen so that their sum does not exceed a 70% total. 70 We then consider pairs of SNPs between the two genes. A predetermined fraction of them are chosen to carry an effect, again sampled fromN (0, 1). The sum of epistatic effects are scaled by predetermined target heritability between 0% and 20%. Finally, another effect is drawn fromN (0, 1) and scaled to the 1 minus the sum of target heritability’s to simulate noise term . (a) Power for different effect size (b) Power for different causal fraction Figure 4.2: Simulation result of gene by gene variance component model compared with SNP by SNP model. The threshold for positive is chosen so that in the the control set (0 simulated epistatic effect), the type I error rate is 5%. a) Power at different total effect size for epistasis simulated, when 10% of the pairs of SNPs are assigned non-zero effect size. b) Power when different fractions of pairs of SNPs are assigned non-zero effect size, while keeping the total epistatic effect size constant. In both panels, the power for SNP by SNP model at Bonferroni significance threshold of α=5% are shown for reference. The results for this simulation shown in Figure 4.2, each sub panel with 5000 runs for target epistatic heritability of 0% as control and another 10000 for target epistatic heritability >0%. The result indicate that when it is indeed the case that many pairs of SNPs have an epistatic effect, the variance component epistatic model have more power to detect interaction at gene by gene level. The likelihood ratio test for the variance component also consistently perform better than a test 71 based on estimated size of effect. Interestingly, the sparsity of epistatic effect, or fraction of interacting pairs of SNPs, have no discernible influence on power of the variance component test. The overwhelmingly important parameter of the underlying genetic architecture seem to be total effect size of epistasis. This is not the case for SNP by SNP test, where the dilution of effect reduces power. Figure 4.3: Single-pair simulation result of gene by gene epistatic models. In another set of simulations, instead of a fraction of SNP at each gene or a fraction of pairs, we simulated data such that exactly one SNP from each gene and the interaction term between them have an effect. Their effects are again scaled by the same target heritabilities. This setup mirrors the model in SNP by SNP studies. The result are shown in figure 4.3. In this case, power of the SNP by SNP test is superior, but the variance component model still perform reasonable well. These results together indicate that the variance component model is a useful model to 72 4.4.4 Application to real data We applied both the haplotype based method and the variance component model to detect epistasis at gene by gene level in the full FT10 dataset. For the former, we used hierarchical clustering to determine haplotypes, where the number of clusters is decided by 5% average substitution between clusters. For the variance component model, we tested only a subset of genes which are significant by the local-global kinship test, the results of which are shown as a manhattan plot in figure 4.4. 0 10 20 0 10 0 10 20 0 10 0 10 20 0 2 4 6 8 −log(p−value) Figure 4.4: FT10 SNP marginal GWAS and local-global kinship GWAS The results for the variance component model is shown in figure 4.5. 4.5 SNP by genetic background epistasis Another interesting combination of epistatic effect is the summation of pairs of effect between one particular SNP with the rest of the genome. This sum present the changes in effect size of a SNP in different genetic background. In terms of GWAS, the presence of these effects means that we would have different power to detect a causal loci present in different populations at the same allele frequency. In terms of evolution, the same allele would face different selective pressure when pitted in similar environment. As of now, there have been several people trying out this type of model. Notably, a very similar variance component model is presented during Biology of Genome 2014. 73 Figure 4.5: FT10 Gene by gene epistasis detected by variance component model. The axes are genomic positions of interacting genes in Mb. In red/top left are shown pairs significant by likelihood ratio test, while those in blue/bottom right are pairs where the estimated epistatic variance components explain more than 5% of variance 4.5.1 PCA based model Principle component analysis have long been used to capture signals of popu- lation structure. Recently, it has been demonstrated to perform well at correcting 74 structural effects in GWAS. The basic idea behind this is simple: individuals that are more closely related will randomly share more alleles. When we perform eigen- decomposition of the population covariance matrix (which is equal to the IBD kinship matrix), the eigenvectors associated with top eigenvalues, the top prin- ciple components, will explain the greatest amount of genetic variation between individuals, and tends to separate major population groups from each other. An example of this can be seen in figure X, showing top two principle components plotted against each other in the 1001 genomes project. Sincetheprinciplecomponentsareproxiesofpopulationdifferentiationtosome degree, they also represent the genetic backgrounds associated with these popula- tions. Hence, interaction terms produced by product of the principle components with a SNP are proxies to the locus by genetic background epistatic effect. Thus, taking n p top principle components v 1 ,v 2 ,...,v np , we can model phenotypes as: Y =μ + X i β i |{z} marginal SNP effect + np X j X i v j β ij | {z } SNP by background epistasis + X k X k z k | {z } background effect + (4.22) This can then be compared with the null model with f-test: Y =μ +X i β i + X k X k z k + (4.23) Again, we can use the EMMAX assumption and transformation in Equation 4.8 on the epistatic terms X i v j to speed up our test. 75 4.5.2 Variance component model Similar to Section 4.4.2, a variance component term can be constructed for the SNP by background epistasis by summing up random effects from all SNPs pairing with the SNP in question: Y =μ + X i β i |{z} marginal SNP effect + n X j X i X j u ij | {z } SNP by background epistasis + X k X k z k | {z } background effect + (4.24) , where u ij ∼N (0,σ 2 iS ) are i.i.d. Gaussian random effects. Thus Y∼N (μ +X i β i ,K iS σ 2 iS +K S σ 2 S +σ 2 ) (4.25) , whereK iS can be computed with relative ease by brute force calculation ofX i X first. 4.5.3 Application to real data Both the PCA based and variance component model are applied to the FT10 data. Here I present results from a list of candidate SNPs found by marginal additive GWAS on the same data. As shown in Figure 4.6, both models identify some of the candidate SNPs as having epistatic effect with the genetic background, although the effect size apparently varies. In both cases, the contribution from such epistatic effect is minor compared with rest of the global additive effects. 76 (a) Variance component model (b) PCA model Figure4.6: CandidateSNPbygeneticbackgroundepistasis, showingtheestimated effect size 77 4.6 Global sum of epistasis Visscher showed that the sum of numerous small additive effect can explain a substantialamountofheritabilityinquantitativetraitslikehumanheight. Weused similar method and showed again that the same is true for traits like expression in Arabdopsis thaliana. A very straightforward extension of these work is to try to showthesamethingforepistaticeffects: combiningallpossiblepairsofinteractions into a variance component and show whether these contribute to a detectable amount of heritability. This idea is certainly not new, and have been long used in animal breeding, and more recently in GWAS settings to Drosophila and rice. Here we use the same idea and apply it to Arabidopsis data. This model can be seen as the extension of restricted variance component models in previous sections to all potential pairs. 4.6.1 Global epistatic model We start from the very general epistatic model as in Equation 4.3, but here ignoring higher-order interactions and treat the effect sizes βs as random effects instead. Thus: Y =μ + X i X i z i | {z } Sum of additive effects + X i<j X i X j u ij | {z } Sum of 2-way epistatic effects + (4.26) , whereu ij ∼N (0,σ 2 E 2 )Themajorchallengewiththisapproachisthecomputation of the epistatic kinship matrix term. Again, we make use of similar trick as in Section 4.4.2, and describe how to adapt it to calculate the global 2-way epistatic kinship matrixK E 2 . 78 Consider the covariance between phenotypes of individuals k and l due to two- way epistasis: Cov(y k ,y l ) epi = n X i n X j>i Cov(x i,k x j,k u ij ,x i,l x j,l u ij ) =σ 2 E 2 n X i n X j>i x i,k x j,k x i,l x j,l = 1 2 σ 2 E 2 ( n X i x i,k x i,l n X j x j,k x j,l − n X i x i,k x i,k x i,l x i,l ) = 1 2 σ 2 E 2 (( n X i x i,k x i,l ) 2 − n X i x i,k x i,l ) (4.27) Therefore, the epistatic kinship matrix can be easily calculated as: K E 2 = 1 2 (K S K S −K S ) (4.28) The IBS versions for this when SNPs are encoded (1,-1) is: K IBS E 2 = K E 2 n 2 + 1 2 (4.29) These can the be plugged into model comparison between the purely additive and epistatic model: H add :Y∼N (μ,K S σ 2 S +σ 2 ) (4.30) H epi Y∼N (μ,K S σ 2 S +K E 2 σ 2 E 2 +σ 2 ) (4.31) Again, a likelihood ratio test can be used as a test of significance. 79 4.6.2 Simulation A simulation study is conducted in similar manner to the gene by gene case in Section 4.4.3. Again, a number of SNPs and a number of pairs of SNPs are chosen and assigned effects sampled fromN (0, 1). The sum of the additive and 2-way epistatic effects are scaled to designated heritability levels, with the rest filled with a noise term also sampled fromN (0, 1) and scaled accordingly. Unfortunately, the results from this simulation is disappointing. Although it detects epistasis readily in the absence of simulated additive effect, it have little power to distinguish between pure additive or epistatic traits, and assign effect size poorly. This is very likely due to the fact that the global epistatic term here is heavily correlated with the global additive term, and variance component models cannot distinguish effects between them effectively. 4.6.3 Application to real data Similar to expection from simulation, when we applied the whole genome epistatic model to quantitative traits in the 107 phenotypes data, the result appear inconsistent. For most traits, no heritability is assigned to the epistatic term at all, while for others the epistatic term completely eclipses the additive term. These results are nonetheless presented in Figure 4.7. 80 Figure 4.7: Global epistatic variance component model on real phenotypes 81 4.7 Extension to higher order interactions One major advantage of variance component models to capture epistasis is the ease in which it can be extended to higher order interactions. Here we extend the model we used for gene by gene interaction (Section 4.4.2) as an example. Suppose we are interested in the three way interaction among genes A, B and C: Y =μ + n A X i A X Ai A u A i A + n B X i B X Bi B u B i B + n C X i C X Ci C u C i C | {z } local gene effects + n A X i A n B X i B X Ai A X Bi B v i A i B + n C X i C n B X i B X Ci C X Bi B v i C i B + n A X i A n C X i C X Ai A X Ci C v i A i C | {z } 2-way epistatic effects + n A X i A n B X i B n C X i C X Ai A X Bi B X Ci C w i A i B i C | {z } 3-way epistatic effects + n X k X k z i | {z } background + (4.32) , where u, v, w and are again Gaussian random variables. Since other terms are already calculated before, here I focus on the 3-way epistatic covariance only: Cov 3-way (y j ,y l ) = n A X i A n B X i B n C X i C Cov(x A:j,i A x B:j,i B x C:j,i C w i A i B i C ,x A:l,i A x B:l,i B x C:l,i C w i A i B i C ) =σ 2 ABC n A X i A n B X i B n C X i C (x A:j,i A x B:j,i B x C:j,i C x A:l,i A x B:l,i B x C:l,i C ) =σ 2 ABC n A X i A (x A:j,i A x A:l,i A ) n B X i B (x B:j,i B x B:l,i B ) n C X i C (x C:j,i C x C:l,i C ) (4.33) Thus, K ABC =K A K B K C (4.34) These can then be easily plugged into a very large variance component model. However, perhaps the only situation we want to find interactions higher than order 82 2 is in the global model, which have been found to perform inadequately even at order 2. 4.8 Long range linkage disequilibrium A major departure from the rest of the models considered in this chapter, selectively non-neutral epistasis can also be detected by carefully observing poly- morphism patterns across the genome. The basis behind this approach is the fact that when some combination of alleles at two distinct sites convey a selective advantage, they would be overrepresented in the population. Similarly, selectively disadvantageous combinations will be found less than expected randomly. An extreme example of this is synthetic lethality, when no individuals harboring par- ticularcombinationswillsurvive. Thisapproachhasbeenusedtogreatsuccessina sample of Drosophila melanogastor(Corbett-Detig et al., 2013), where the authors found 22 pairs of incompatible SNPs. A similar approach has been investigated in a Swedish sample of Arabidopsis thaliana in our genomes paper(Long et al., 2013a). Here I describe details of that work. 4.8.1 Correction for structural inflation of r 2 A serious problem with this in natural population is population structure: even SNPs on different chromosomes are more likely than random to be correlated due to shared population history. One way to visualize this is to consider the coalescent process at two distant loci. Due to separation of the populations, the ARGs at these loci are more likely than random to be similar. As a result, mutations that randomly occurred on branches of the ARGs are also more likely than random to be correlated. 83 Bjarni devised a method to reduce such effects (described in supplementary notes to (Long et al., 2013b)), and a similar approach has also been independently described in (Mangin et al., 2011). The idea behind this is similar to that in popu- lation structure correction in GWAS. Briefly, we treat the SNPs as independently sampled from a distribution with covariance matrix V =X 0 X, which is inciden- tally the IBD kinship matrix. Then, we transform the SNPs X i into pseudo-SNPs X 0 i =V − 1 2 X i . The r 2 between transformed SNPs: r 2 i,j = Cov(X 0 i ,X 0 j ) Var(X 0 i ) − 1 2 Var(X 0 j ) − 1 2 (4.35) are expected to be same in distribution as those calculated from uncorrelated individuals. The effectiveness of this procedure can be seen in panel a) of figure 4.9. 4.8.2 Expected number of significant pairs Even when population structure is fully accounted for, we might expect some significant r 2 values simply by chance. In this subsection, I estimate an expected number of such pairs in a random sample with the same number of individuals and similar SNP allele frequency distribution. The null distribution for the probability that a SNP pair is in LD can be obtained when assuming that every SNPs are independent and identically dis- tributed vectors with some distribution that is permutable in the order of the accessions. Given two binary SNPs X and Y, let x and y be their minor allele count respectively, and N the total number of accessions sampled, which in our case is 180. In addition, let the number of samples having both the minor allele of 84 X and Y be u. Then, the linkage disequilibirium between X and Y is a function of x, y and u: r 2 (X,Y ) =r 2 (x,y,u) = (Nu−xy) 2 (Nx−x 2 )(Ny−y 2 ) Thus, given a r 2 lower threshold ξ (0.8 in our case), if u is unknown: P (r 2 (X,Y )≥ξ|x,y) = X u P (r 2 (x,y,u)≥ξ)P (u|x,y) = X u I [ξ,1] (r 2 (x,y,u)) x u N−x y−u N y (Here, I A (E) is the indicator function of E∈A) Then, the overall probability that random X and Y are significantly correlated, if we known the distribution of minor allele count of each SNP: P (r 2 (X,Y )≥ξ) = P (r 2 (X,Y )≥ξ|x,y)P (x)P (y) Here, we can use the estimated distribution of minor allele count from our dataset forP (x). AswelimitoursearchingofpairstoSNPswithminorallelecountgreater or equal to 15, we can simply shift the distribution so that P (x) = 0∀x< 15. Given M SNPs X 1 , X 2 , ... X M , the expected number of LRLD pairs L can then be easily estimated : E(L) =E(number of pairs with r 2 ≥ξ) = E( X 1≤i≤M i<j≤M I [ξ,1] (r 2 (X i ,X j )) = X 1≤i≤M i<j≤M E(I [ξ,1] (r 2 (X i ,X j )) = M 2 ! P (r 2 (X,Y )≥ξ) 85 The actual distribution of the number of LRLD pairs L is very complex since the pairs are not independent. Thus, the significance level of an observed number of LRLD pairs L obs is difficult to obtain. However, a very rough but still useful bound can be obtained since L≥ 0, by noticing E(L) = P (L≥L obs )E(L|L≥L obs ) +P (L<L obs )E(L|L<L obs ) ≥ P (L≥L obs )L obs +P (L<L obs )(0) =P (L≥L obs )L obs Thus, we have an upper bound for the p-value: P (L≥L obs )≤ E(L) L obs 4.8.3 Filtering of sequencing artifacts We discovered a total of XX significant pairs of long-range LD with corrected r 2 >0.8. However, it is reasonable to expect that some of these are spurious cor- relation due to systematic errors in SNP calling. One particularly nasty problem is when SNPs called are not where they are on the reference genome. To ensure SNPs called are not from similar sequences, we Blasted 75bp sequences surround- ing SNPs against the reference genome. Any SNP that hit two or more regions even relatively loosely (’homology’>90%) are filtered out. Unknown structural variants, such as duplications not present on reference genomes, present an even greater challenge. To find out whether some of the LRLD pairs are actually close together in at least some of the Swedish samples, we again aligned short sequences (both 75bp and 150bp) surrounding each SNP, this time to libraries of De-novo assembled scaffolds. If sequences flanking both 86 Remai ni ng f r ac t i on of SNP af t er each f i l t er i ng s t ep Fr ac t i on of SNPs f i l t er ed by each st ep 12510/ 20483 SNPs ( 717205/ 787237 Pai r s ) Fi l t er i ng of pai r s c r eat ed by popul at i on s t r uct ur e t hr ough SNP t r ans f or mat i on 7109/ 7973 SNPs ( 67523/ 70032 Pai r s) Fi l t er i ng of pai r s pot ent i al l y c r eat ed by mobi l e el ement s t hr ough TAI R10 TE annot at i ons 21769/ 42252 SNPs ( 2053531/ 2840768 Pai r s ) Fi l t er i ng of pai r s c r eat ed by mi s mapped r eads by mul t i pl e bl ast hi t s i n r ef er enc e genome Figure 4.8: Filtering of high linkage disequilibirium SNP pairs SNP in a pair map near each other to the same scaffold, we consider them linked in the corresponding accession. We find 17 pairs which could be caused by structural variations, either with strong (linked in more than 10 accessions using 150bp) or weak (other cases) evi- dence. We certainly do not expect this method to be very sensitive due to length limitations of De novo assembled scaffolds in general. By a very rough estimate, we should be able to pick up at least 40% of SV-caused LRLD pairs. Note that this estimate does not take into consideration the possibility that many SNP pairs would never be found on the same scaffolds, so the actual percentage could be much lower. As an alternative method, we also checked for paired end reads that map near the SNPs in LRLD pairs. This method did not enable us to detect additional pairs that might be caused by structural changes. In conclusion, we are reasonably sure that SVs would not be a major explanation for LRLD. 87 4.8.4 Signs of selection One disadvantage with detecting epistatis this way is the lack of experimental data to support a difference in fitness. However, many types of circumstantial evidence exist, and enrichment of these are also proof that combinations of some of these pairs of SNPs are indeed under selection. We first checked LRLD pairs against published data on pairwise protein interaction. However, we detect no significance in those sets, indicating no evidence that pairs of SNPs are those interacting proteins. Next, we looked for indirect evidence in the form of published local adaptation associated loci (Hancock et al., 2011). We checked whether SNPs involved in LRLD are close (5kb/10kb) to the peak SNPs (defined as p-value of 0.001) for all 13 climatic traits, and calculated the p-value based on permutations, by rotating SNP positions along the chromosomes which maintains LD structure between the SNPs. We detected signal of enrichment in two of the traits, relative humidity and minimum precipitation, after correcting for multiple testing with Bonferroni significant level of (0.05/13). Finally, there is an overlap of some of LRLD SNP locations with selective sweep peaks, although apparently the selective sweep event is not the one causing the LRLD in question. 4.9 Discussion We have devised and applied a variety of models that seek to detect epistasis at the SNP by SNP, gene by gene, SNP by background and global level. Although we captures some hints of epistasis, nothing very substantial are detected, in accor- dance with some theoretical predictions and other attempts at find epistasis in natural populations. 88 b a LR-LD likely caused by SV Remaining pairs with R = 1 2 Remaining pairs 2 CLR Northern Sweden Precipitation associated SNPs Pairwise R 2 Figure 4.9: . Long range linkage disequilibrium of SNPs in Swedish Arabidopsis thaliana. a) Raw (upper left) and corrected (lower right) r 2 between SNPs on each chromosome. Centromeres are highlighted in gray. Note that massive LD between SNPs on different chromosomes disappear as expected after population structure correction. b) Pairs of long range LD after filtering for potentially duplicated regions and population structure effects. Also shown are some evidence of selection that are correlated, SNPs associated with precipitation levels and selective sweeps found by Sweepfinder in Northern Swedish accessions. Although it might be unsurprising to find no huge hits, it is important to sit back and reflect on what we are mapping in the first place. By Fisher’s definition, anything non-additive would show up as epistasis. For most traits, we should simplyexpectthatadditivityisaveryroughapproximation. Theabsenceofstrong epistasis seem to indicate that our simple effort to transform quantitative data to be more normal-like does indeed work well to convert traits into more additive like functional form. On the other hand, it can also indicate that we simply do not possess the sample size to study these issues. In terms of methodological development, most of the models we implemented are not new and are indeed being re-invented by many. However, our application 89 certainly shows that such models are useful in our particular setting, and continued efforts should be spent on . Forthefuture, Ibelievethatinsteadoftryingtosearchforepistasis, moreeffort should be focused at directly investigating different functional forms on additive GWAS. These findings can then by confirmed by expanded sample size and differ- ently structured real data. 90 Reference List Allen HL, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, Willer CJ, Jackson AU, Vedantam S, Raychaudhuri S, Ferreira T, Wood AR, Weyant RJ, Segrè AV, Speliotes EK, Wheeler E, Soranzo N, Park JH, Yang J, Gudbjartsson D, Heard-Costa NL, Randall JC, Qi L, Smith AV, Mägi R, Pastinen T, Liang L, Heid IM, Luan J, Thorleifsson G, Winkler TW, Goddard ME, Lo KS, Palmer C, Workalemahu T, Aulchenko YS, Johansson Å, Zillikens MC, Feitosa MF, Esko T, Johnson T, Ketkar S, Kraft P, Mangino M, Prokopenko I, Absher D, Albrecht E, Ernst F, Glazer NL, Hayward C, Hottenga JJ, Jacobs KB, Knowles JW, Kutalik Z, Monda KL, Polasek O, Preuss M, Rayner NW, Robertson NR, Steinthorsdottir V, Tyrer JP, Voight BF, Wiklund F, Xu J, Zhao JH, Nyholt DR, Pellikka N, Perola M, Perry JRB, Surakka I, Tammesoo ML, Altmaier EL, Amin N, Aspelund T, Bhangale T, Boucher G, Chasman DI, Chen C, Coin L, Cooper MN, Dixon AL, Gibson Q, Grundberg E, Hao K, Junttila MJ, Kaplan LM, Kettunen J, König IR, Kwan T, Lawrence RW, Levinson DF, Lorentzon M, McKnight B, Morris AP, Müller M, Ngwa JS, Purcell S, Rafelt S, Salem RM, Salvi E, Sanna S, Shi J, Sovio U, Thompson JR, Turchin MC, Vandenput L, Verlaan DJ, Vitart V, White CC, Ziegler A, Almgren P, Balmforth AJ, Camp- bell H, Citterio L, De Grandi A, Dominiczak A, Duan J, Elliott P, Elosua R, Eriksson JG, Freimer NB, Geus EJC, Glorioso N, Haiqing S, Hartikainen AL, Havulinna AS, Hicks AA, Hui J, Igl W, Illig T, Jula A, Kajantie E, Kilpeläinen TO, Koiranen M, Kolcic I, Koskinen S, Kovacs P, Laitinen J, Liu J, Lokki ML, Marusic A, Maschio A, Meitinger T, Mulas A, Paré G, Parker AN, Peden JF, Petersmann A, Pichler I, Pietiläinen KH, Pouta A, Ridderstråle M, Rotter JI, Sambrook JG, Sanders AR, Schmidt CO, Sinisalo J, Smit JH, Stringham HM, Walters GB, Widen E, Wild SH, Willemsen G, Zagato L, Zgaga L, Zitting P, Alavere H, Farrall M, McArdle WL, Nelis M, Peters MJ, Ripatti S, van Meurs JBJ, Aben KK, Ardlie KG, Beckmann JS, Beilby JP, Bergman RN, Bergmann S, Collins FS, Cusi D, den Heijer M, Eiriksdottir G, Gejman PV, Hall AS, Ham- sten A, Huikuri HV, Iribarren C, Kähönen M, Kaprio J, Kathiresan S, Kiemeney L, Kocher T, Launer LJ, Lehtimäki T, Melander O, Mosley Jr TH, Musk AW, Nieminen MS, O’Donnell CJ, Ohlsson C, Oostra B, Palmer LJ, Raitakari O, Ridker PM, Rioux JD, Rissanen A, Rivolta C, Schunkert H, Shuldiner AR, 91 Siscovick DS, Stumvoll M, Tönjes A, Tuomilehto J, van Ommen GJ, Viikari J, Heath AC, Martin NG, Montgomery GW, Province MA, Kayser M, Arnold AM, Atwood LD, Boerwinkle E, Chanock SJ, Deloukas P, Giege... (2010) Hun- dreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467:832–838. Anholt R, Dilda CL, Chang S, Fanara JJ (2003) The genetic architecture of odor- guided behavior in Drosophila: epistasis and the transcriptome. ... genetics . Atwell S, Huang YS, Vilhjálmsson BJ, Willems G, Horton M, Li Y, Meng D, Platt A, Tarone AM, Hu TT, Jiang R, Muliyati NW, Zhang X, Amer MA, Baxter I, Brachi B, Chory J, Dean C, Debieu M, de Meaux J, Ecker JR, Faure N, Kniskern JM, Jones JDG, Michael T, Nemri A, Roux F, Salt DE, Tang C, Todesco M, Traw MB, Weigel D, Marjoram P, Borevitz JO, Bergelson J, Nordborg M (2010) Genome-wideassociationstudyof107phenotypesinArabidopsisthalianainbred lines. Nature 465:627–631. Bateson W, Mendel G (1909) Mendel’s principles of heredity. Becker C, Hagmann J, Müller J, Koenig D, Stegle O, Borgwardt K, Weigel D (2011) Spontaneous epigenetic variation in the Arabidopsis thaliana methylome. Nature 480:245–249. Bloom JS, Ehrenreich IM, Loo WT, Lite T, Kruglyak L (2013) Finding the sources of missing heritability in a yeast cross. Nature . Boyko AR, Quignon P, Li L, Schoenebeck JJ, Degenhardt JD, Lohmueller KE, Zhao K, Brisbin A, Parker HG, vonHoldt BM, Cargill M, Auton A, Reynolds A, Elkahloun AG, Castelhano M, Mosher DS, Sutter NB, Johnson GS, Novembre J, Hubisz MJ, Siepel A, Wayne RK, Bustamante CD, Ostrander EA (2010) A Simple Genetic Architecture Underlies Morphological Variation in Dogs. PLoS biology 8:e1000451. Brown AA, Buil A, Viñuela A, Lappalainen T, Zheng HF (2014) Genetic inter- actions affecting human gene expression identified by variance association map- ping. eLife . Chan SWL, Zilberman D, Xie Z, Johansen LK, Carrington JC, Jacobsen SE (2004) RNA Silencing Genes Control de Novo DNA Methylation. Sci- ence 303:1336–1336. Cockerham CC (1954) An Extension of the Concept of Partitioning Hereditary VarianceforAnalysisofCovariancesamongRelativesWhenEpistasisIsPresent. Genetics 39:859. 92 Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B (2008) Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature . Coleman-Derr D, Zilberman D (2012) Deposition of Histone Variant H2A.Z within Gene Bodies Regulates Responsive Genes. PLoS Genetics 8:e1002988. Corbett-Detig RB, Zhou J, Clark AG, Hartl DL, Ayroles JF (2013) Genetic incom- patibilities are widespread within species. Nature 504:135–137. Cortijo S, Wardenaar R, Colome-Tatche M, Gilly A, Etcheverry M, Labadie K, Caillieux E, Hospital F, Aury JM, Wincker P, Roudier F, Jansen RC, Colot V, Johannes F (2014) Mapping the Epigenetic Basis of Complex Traits 343:1145–1148. Crainiceanu CM (2008) Likelihood Ratio Testing for Zero Variance Components in Linear Mixed Models In Random Effect and Latent Variable Model Selection, pp. 3–17. Springer New York, New York, NY. Crainiceanu CM, Ruppert D (2004) Likelihood ratio tests in linear mixed models with one variance component. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66:165–185. Dubin MJ, Zhang P, Meng D, Remigereau MS, Osborne EJ, Casale FP, Drewe P, Kahles A, VilhjÃąlmsson B, Jagoda J, Irez S, Voronin V, Song Q, Long Q, RÃďtsch G, Stegle O, Clark RM, Nordborg M . Elena SF, Lenski RE (1997) Test of synergistic interactions among deleterious mutations in bacteria. Nature . Feng S, Cokus SJ, Zhang X, Chen PY, Bostick M, Goll MG, Hetzel J, Jain J, Strauss SH, Halpern ME, Ukomadu C, Sadler KC, Pradhan S, Pellegrini M, Jacobsen SE (2010) Conservation and divergence of methylation pat- terning in plants and animals. Proceedings of the National Academy of Sci- ences 107:8689–8694. Fisher RA (1919) XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. Transactions of the royal society of Edinburgh . Goll MG, Bestor TH (2005) EUKARYOTIC CYTOSINE METHYLTRANS- FERASES. Annu Rev Biochem 74:481–514. Gutierrez-Arcelus M, Lappalainen T, Montgomery SB (2013) Passive and active DNA methylation and the interplay with genetic variation in gene regulation. Elife . 93 Hancock AM, Brachi B, Faure N, Horton MW (2011) Adaptation to climate across the Arabidopsis thaliana genome. Science . Heijmans BT, Tobi EW, Stein AD, Putter H, Blauw GJ, Susser ES, Slagboom PE, Lumey LH (2008) Persistent epigenetic differences associated with pre- natal exposure to famine in humans. Proceedings of the National Academy of Sciences 105:17046–17049. Hemani G, Knott S, Haley C (2013) An Evolutionary Perspective on Epistasis and the Missing Heritability. PLoS Genetics 9:e1003295. Hemani G, Shakhbazov K, Westra HJ, Esko T, Henders AK, McRae AF, Yang J, Gibson G, Martin NG, Metspalu A, Franke L, Montgomery GW, Visscher PM, Powell JE (2014) Detection and replication of epistasis influencing transcription in humans. Nature 508:249–253. Henderson CR, Kempthorne O, Searle SR, von Krosigk CM (1959) The Esti- mation of Environmental and Genetic Trends from Records Subject to Culling. Biometrics 15:192. Hill WG, Goddard ME, Visscher PM (2008) Data and Theory Point to Mainly Additive Genetic Variance for Complex Traits. PLoS Genetics 4:e1000008. Holliday R, Pugh JE (1975) DNA modification mechanisms and gene activity during development. Science . Horton MW, Hancock AM, Huang YS, Toomajian C, Atwell S, Auton A, Muliyati NW, Platt A, Sperone FG, Vilhjálmsson BJ, Nordborg M, Borevitz JO, Bergel- sonJ(2012) Genome-widepatternsofgeneticvariationinworldwideArabidopsis thaliana accessions from the RegMap panel 44:212–216. Hotchkiss RD (1948) The quantitative separation of purines, pyrimidines, and nucleosides by paper chromatography. Journal of Biological Chem- istry 175:315–332. Huang W, Richards S, Carbone MA, Zhu D, Anholt RRH, Ayroles JF, Duncan L, Jordan KW, Lawrence F, Magwire MM, Warner CB, Blankenburg K, Han Y, Javaid M, Jayaseelan J, Jhangiani SN, Muzny D, Ongeri F, Perales L, Wu YQ, Zhang Y, Zou X, Stone EA, Gibbs RA, Mackay TFC (2012) Epistasis dominates the genetic architecture of Drosophila quantitative traits. Proceedings of the National Academy of Sciences 109:15553–15559. Ionita-Laza I, Lee S, Makarov V, Buxbaum JD (2013) Sequence kernel association testsforthecombinedeffectofrareandcommonvariants. The American Journal of ... . 94 Jaenisch R, Bird A (2003) Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nature Genet- ics 33:245–254. Jones PA (1986) DNA methylation and cancer. Cancer research . Jones PA, Baylin SB (2007) The Epigenomics of Cancer. Cell 128:683–692. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong S (2010) Variance compo- nent model to account for sample structure in genome-wide association studies. Nature . Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E (2008) Efficient Control of Population Structure in Model Organism Association Mapping. Genetics 178:1709–1723. Kempthorne O (1954) The Correlation between Relatives in a Random Mating Population. Proceedings of the Royal Society of London. Series B - Biological Sciences 143:103–113. Kerwin RE, Jimenez-Gomez JM, Fulop D (2011) Network quantitative trait loci mapping of circadian clock outputs identifies metabolic pathway-to-clock link- ages in Arabidopsis. The Plant Cell ... . Kim MY, Zilberman D (2014) DNA methylation as a system of plant genomic immunity. Trends in Plant Science 19:320–326. Korte A, Vilhjálmsson BJ, Segura V, Platt A, Long Q (2012) A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nature ... . Kroymann J, Mitchell-Olds T (2005) Epistasis and balanced polymorphism influ- encing complex trait variation. Nature . Li E, Bestor TH, Jaenisch R (1992) Targeted mutation of the DNA methyltrans- ferase gene results in embryonic lethality. Cell 69:915–926. Lippert C, Listgarten J, Davidson RI, Baxter J, Poon H (2013) An exhaustive epistatic SNP association analysis on expanded wellcome trust data. Scientific reports . Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D (2011) FaST linear mixed models for genome-wide association studies. Nature Meth- ods 8:833–835. 95 Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. nature . Listgarten J, Lippert C, Kadie CM, Davidson RI, Eskin E, Heckerman D (2012) Improved linear mixed models for genome-wide association studies 9:525–526. Listgarten J, Lippert C, Kang EY, Xiang J, Kadie CM, Heckerman D (2013) A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics 29:1526–1533. Long Q, Rabanal FA, Meng D, Huber CD, Farlow A, Platzer A, Zhang Q, Vilhjálmsson BJ, Korte A, Nizhynska V, Voronin V, Korte P, Sedman L, Mandáková T, Lysak MA, Seren Ü, Hellmann I, Nordborg M (2013a) Mas- sive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden. Nature Genetics 45:884–890. Long Q, Rabanal FA, Meng D, Huber CD, Farlow A, Platzer A, Zhang Q, Vilhjálmsson BJ, Korte A, Nizhynska V, Voronin V, Korte P, Sedman L, Mandáková T, Lysak MA, Seren Ü, Hellmann I, Nordborg M (2013b) Mas- sive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden 45:884–890. Maki-Tanila A, Hill WG (2014) Influence of Gene Interaction on Complex Trait Variation with Multilocus Models. Genetics 198:355–367. Mangin B, Siberchicot A, Nicolas S, Doligez A, This P, Cierco-Ayrolles C (2011) Novel measures of linkage disequilibrium that correct the bias due to population structure and relatedness. Heredity 108:285–291. Mendel G (1865) Versuche über Pflanzenhybriden pp. 1–77. Nakagawa S, Schielzeth H (2013) A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in Ecology and Evolu- tion 4:133–142. Nelson RM, Pettersson ME, Carlborg Ö (2013) A century after Fisher: time for a new paradigm in quantitative genetics. Trends in Genetics 29:669–676. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR (2008) Genes mirror geography within Europe. Nature . Platt A, Vilhjálmsson BJ, Nordborg M (2010) Conditions Under Which Genome-Wide Association Studies Will be Positively Misleading. Genet- ics 186:1045–1052. 96 Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide associ- ation studies. Nature Genetics 38:904–909. Rasmussen MD, Hubisz MJ, Gronau I, Siepel A (2014) Genome-wide inference of ancestral recombination graphs. PLoS genetics . Reik W (2007) Stability and flexibility of epigenetic gene regulation in mammalian development. Nature 447:425–432. Roberts A, McMillan L, Wang W, Parker J, Rusyn I (2007) Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows. ... . Robinson MR, Wray NR, Visscher PM (2014) Explaining additional genetic vari- ation in complex traits. Trends in Genetics 30:124–132. Rojas D, Fischer RL, Tamaru H, Zilberman D (2012) Active DNA demethylation in plant companion cells reinforces transposon methylation in gametes. Science . Sax K (1923) The Association of Size Differences with Seed-Coat Pattern and Pigmentation in PHASEOLUS VULGARIS. Genetics 8:552. Schadt EE, Lamb J, Yang X, Zhu J, Edwards S (2005) An integrative genomics approach to infer causal associations between gene expression and disease. Nature ... . Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and hap- lotypic phase. The American Journal of Human Genetics . Schmitz RJ, Schultz MD, Lewsey MG, O’Malley RC, Urich MA, Libiger O, Schork NJ, Ecker JR (2011) Transgenerational Epigenetic Instability Is a Source of Novel Methylation Variants. Science 334:369–373. Schmitz RJ, Schultz MD, Urich MA, Nery JR, Pelizzola M, Libiger O, Alix A, McCosh RB, Chen H, Schork NJ, Ecker JR (2013) Patterns of population epige- nomic diversity 495:193–198. Segura V, Vilhjálmsson BJ, Platt A, Korte A, Seren Ü, Long Q, Nordborg M (2012) An efficient multi-locus mixed-model approach for genome-wide associa- tion studies in structured populations. Nature Genetics 44:825–830. Slatkin M (2009) Epigenetic Inheritance and the Missing Heritability Problem. Genetics 182:845–850. 97 Stram DO, Lee JW (1994) Variance components testing in the longitudinal mixed effects model. Biometrics . Stroud H, Do T, Du J, Zhong X, Feng S, Johnson L, Patel DJ, Jacobsen SE (2013) Non-CG methylation patterns shape the epigenetic landscape in Ara- bidopsis 21:64–72. Suzuki S, Ono R, Narita T, Pask AJ, Shaw G, Wang C (2007) Retrotransposon silencing by DNA methylation can drive mammalian genomic imprinting. PLoS genetics . Takuno S, Gaut BS (2013) Gene body methylation is conserved between plant orthologs and is of evolutionary consequence. Proceedings of the National Academy of Sciences 110:1797–1802. Tong A, Lesage G, Bader GD, Ding H, Xu H, Xin X (2004) Global mapping of the yeast genetic interaction network. science . Vilhjálmsson BJ, Nordborg M (2012) The nature of confounding in genome-wide association studies. Nature Reviews Genetics 14:1–2. Visscher PM, Brown MA, McCarthy MI, Yang J (2012) Five Years of GWAS Discovery. The American Journal of Human Genetics 90:7–24. Waddington CH (2014) The strategy of the genes. Weigel D, Colot V (2012) Epialleles in plant evolution. Genome Biol . Wong AHC, Gottesman II, Petronis A (2005) Phenotypic differences in geneti- cally identical organisms: the epigenetic perspective. Human Molecular Genet- ics 14:R11–R18. Yang J, Manolio TA, Pasquale LR, Boerwinkle E (2011) Genome partitioning of genetic variation for complex traits using common SNPs. ... genetics . Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA,HeathAC,MartinNG,MontgomeryGW,GoddardME,VisscherPM(2010) Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42:565–569. ZemachA,KimMY,HsiehPH,Coleman-DerrD(2013) The<i>Arabidopsis</i> Nucleosome Remodeler DDM1 Allows DNA Methyltransferases to Access H1- Containing Heterochromatin. Cell . Zhang F, Boerwinkle E, Xiong M (2014) Epistasis analysis for quantitative traits by functional regression model 24:989–998. 98 Zhou X, Stephens M (2012) Genome-wide efficient mixed-model analysis for asso- ciation studies. Nature Genetics 44:821–824. 99 Appendix A Terminology Here I list a glossary of terminologies that frequently appear in this thesis, as well as default definition of mathematical symbols A.1 Glossary Natural population: Here, specifically refer to a population sample formed by random individuals, as opposed to say trios or one derived from a cross LD: Linkage disequilibrium. By default it is measured by r 2 QTL: Quantitative trait loci. gene: For the purpose of most studies, include both protein coding genes and transcribed transposable element. GBM: Gene body methylation. A form of methylation that predominantly occur on gene bodies away from ends of transcription. Only cytosines in CpG context are methylated. TEM: Transposable element like methylation. A form of methylation that typically associate with silencing of transposons. Cytosines of all different contexts (CG, CHG, CHH) are methylated. EWAS: Epigenome-wide association mapping. Just like genome-wide associa- tion mapping, but using variations in methylation data. 100 A.2 Default meaning of mathematical symbols n: Number of polymorphic sites, most often SNPs m: Number of individuals, most often Arabidopsis thaliana accession Y: Vector of phenotypes μ: Mean value of phenotypes X,X: Vectors of polymorphisms, by default SNPs, and the matrix made up with these vectors β: Effect sizes, usually considered a fixed value u,v,w,z: Effect sizes, usually considered random and normally distributed ,σ 2 : Random gaussian error and its variance K: Kinship or relatedness matrix that appear in variance from summation of i.i.d. random effects K S ,σ 2 S : Global kinship or relatedness matrix and its corresponding variance α: Probability of observation given null model is true; type I error rate when used as significance threshold for p-value 101 Appendix B More on linear mixed models In the introductory chapter I discussed the basic linear mixed model used in GWAS. Although it is very commonly used nowadays, mixed models behave very differently than fixed effect linear models. In this follow-up chapter, I discuss some additional results about them that are used in the projects. B.1 Likelihoodratiotestforlinearmixedmodels We have extensively used likelihood ratio test in this thesis to compare between linearmixedmodelswithdifferentnumberofvariancecomponent. Thedistribution oftheloglikelihoodratioforthesekindofmodelcomparisondonottypicallyfollow the normal χ 2 distribution expected from nested fixed effect models. In (Stram & Lee, 1994; Crainiceanu & Ruppert, 2004) the authors derived asymptotic distribution for comparing variance component models that differ by one term, and found it to be a mixture distribution of 1 2 χ 2 0 : 1 2 χ 2 1 . Further enhance- ment were introduced by (Crainiceanu, 2008), which showed that certain simulated distributions perform better than the asymptotic distribution. Inthisthesis, weusetheasymptoticdistributionforderivingsignificancevalues by default. As in (Crainiceanu, 2008), this can be too harsh and . Thus for many applications, we also used empirical estimation of error rate based on simulations on the null model. 102 B.2 R 2 measurement for linear mixed models Definition for R 2 is tricky for mixed models. Unlike models with only fixed effects, adding additional fixed or random effect to a linear mixed model will not always cause the best fitting error due to noise to decrease. As such, a casual definition such as unexplained variance will lead to inconsistent interpretation and generally cannot be used to compare between models. Unfortunately, there is no other statistic that has been found that fulfill the role of the traditional R 2 well. Here I shall describe marginal R 2 and conditional R 2 , two statistic that are described in (Nakagawa & Schielzeth, 2013). Although they apply to generalized linear models too, i shall continue to focus on the simpler case. Consider a mixed model that have both a number of fixed terms as well as a number of random effect terms: Y =μ + n X i X i β i + m X j Z j u j + (B.1) Where u j ∼N (0,σ 2 j ), ∼N (0,σ 2 ) The definition of marginal R 2 for a subset of fixed effects and conditional R 2 for a subset of random effects are summarized as follows: Marginal R 2 :R 2 M := ˆ σ 2 β ˆ σ 2 β + P m j ˆ σ 2 j + ˆ σ 2 Conditional R 2 :R 2 C := ˆ σ 2 β + P m j ˆ σ 2 j ˆ σ 2 β + P m j ˆ σ 2 j + ˆ σ 2 Where σ 2 β := Var( n X i X i ˆ β i ) (B.2) 103 Note that the conditional R 2 can be similarly defined for a subset S of random effects: Conditional R 2 :R 2 C (S) := ˆ σ 2 β + P j∈S ˆ σ 2 j ˆ σ 2 β + P m j ˆ σ 2 j + ˆ σ 2 (B.3) 104
Abstract (if available)
Abstract
Genome wide association mapping has become the tool of choice for dissecting the genetic basis of heritable traits. Most studies today utilizes linear mixed models, in which the genetic contribution is divided into a number of fixed additive effect and random effect terms corresponding to small contribution from all genetic variants. In this work I expand existing models and explore contribution to heritability from two sources: DNA methylation patterns and epistasis between loci in natural samples of Arabidopsis thaliana. Our results indicate contribution from both, but they do not impress in number of statistically significant loci nor effect size. On the other hand, we also derived statistical model for inferring causality in a structured sample. Our efforts showed that additive genetic components are still dominant in exploring heritability, though there is a promising future for continued exploration of other contributors.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Analysis of genomic polymorphism in Arabidopsis thaliana
PDF
Association mapping in Arabidopsis thaliana
PDF
Natural variation of Arabidopsis thaliana methylome and its impact on genome evolution
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Comparative analysis of DNA methylation in mammals
PDF
Exploring the genetic basis of quantitative traits
PDF
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
Investigating the evolution of gene networks through simulated populations
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Effects of chromatin regulators during carcinogenesis
PDF
Genetic and molecular insights into the genotype-phenotype relationship
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Data modeling approaches for continuous neuroimaging genetics
Asset Metadata
Creator
Meng, Dazhe
(author)
Core Title
Mapping epigenetic and epistatic components of heritability in natural population
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
10/22/2015
Defense Date
10/09/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Arabidopsis thaliana,causal analysis,DNA methylation,epigenetics,epistasis,genome wide association mapping,linear mixed model,OAI-PMH Harvest,quantitative genetics,statistical genetics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nordborg, Magnus (
committee chair
), Conti, David V. (
committee member
), Ralph, Peter L. (
committee member
), Sun, Fengzhu Z. (
committee member
)
Creator Email
dazhe.meng@gmail.com,dazhemen@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-556854
Unique identifier
UC11301008
Identifier
etd-MengDazhe-3368.pdf (filename),usctheses-c3-556854 (legacy record id)
Legacy Identifier
etd-MengDazhe-3368.pdf
Dmrecord
556854
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Meng, Dazhe
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Arabidopsis thaliana
causal analysis
DNA methylation
epigenetics
epistasis
genome wide association mapping
linear mixed model
quantitative genetics
statistical genetics