Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Comparative analysis of DNA methylation in mammals
(USC Thesis Other)
Comparative analysis of DNA methylation in mammals
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPARATIVE ANALYSIS OF DNA METHYLATION IN MAMMALS by Jianghan Qu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) December 2017 Copyright 2017 Jianghan Qu Comparative analysis of DNA methylation in mammals Jianghan Qu Abstract The mammalian methylome holds regulatory information for cell-specific gene expression. DNA methylation in sperm is among the most important factors influencing the evolution of mam- malian genomes. Yet little is known about how this system itself has evolved in the mammalian germline. Most existing comparative epigenome studies are limited to pair-wise comparisons between species, examination of genome-wide correlations, or studying species-specific changes. Deeper understanding of epigenomic evolution requires identification of conservation and changes of epi- genetic states at higher genomic resolution and annotation of epigenetic changes to specific lin- eages. Although DNA methylation can be profiled at single-CpG resolution, interspecies compar- ison of methylome is challenged by the fact that most CpG sites have diverged between species. However, DNA methylation, like other epigenetic modifications, displays strong autocorrelation along the genome, which is an important relationship to incorporate into evolutionary models, and is information that can be leveraged to overcome the challenge of divergent sites across species. We introduce a phylo-epigenetic model for methylome evolution, which accommodates the high correlation of methylation states at neighboring CpG sites during evolution, and allows for infer- ence of ancestral methylation states. To understand the evolution of sperm methylomes, we study the whole-genome single-CpG resolution DNA methylation profiles in sperm from human, chimp, gorilla, rhesus, mouse, rat and dog, with the proposed phylo-epigenetic model. We find that the hypomethylated portion of the mammalian sperm methylome has experienced an evolutionary expansion, driven both by the birth of newly hypomethylated regions and by the extension of ancestral hypomethylated regions. The extensions are over-represented, and contribute to progressive expansion of individual hypomethy- lated regions during evolution. The changes of methylation states are nonuniformly distributed ii between genomic contexts and between species: promoter regions exhibit less divergence than other regions, and rodents have accumulated more divergence than primates. The rodent-specific and the primate-specific hypomethylated regions are enriched for binding sites of similar transcrip- tion factors, and are associated with genes of similar functions in early embryonic development. Together, we provide an evolutionary modeling strategy and computational methods for com- parative epigenomics. With the proposed approaches, we characterize evolutionary changes in mammalian sperm methylomes at high spatial and temporal resolution. Our analyses reveal the global trend and lineage-specific features of expanding DNA hypomethylation during evolution, and provide insights to how evolution of the epigenome may contribute to species-specific fine- tuning of conserved mammalian developmental programs. Applying this phylo-epigenetic model to more types of epigenetic modifications will shed light on their functional roles, and expand our knowledge about the evolution of the mammalian gene regulation network. iii Tomyteachers,colleagues,andespeciallytomyparents iv Acknowledgments I would like to thank my mentor Dr. Andrew Smith, who first introduced me to the field of epige- netics, and offered patient guidance and kind support during my graduate studies. His constructive criticism, wise advice and inspiring discussions with me were essential for my progress. This work would not have been possible without the contribution of my collaborators Drs. Emily Hodges and Antoine Molaro, the guidance and advice from Drs. Gregory Hannon, Matthew Dean and Pascal Gagneux, and the generous assistance from Drs. Beckie Williams and Joseph Hacia. I am grateful to my proposal and thesis committee Drs. Matthew Dean, Fengzhu Sun, Caleb Finch, Peter Calabrese and Xianghong Jasmine Zhou, for their attention and discussion regarding my education and research. I would also like to thank members of the Smith lab past and present who have helped me both in research and in life. v Contents Dedication iv Acknowledgments v List of Figures viii List of Tables x 1 Introduction 1 1.1 Mammalian epigenome and evolution . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Comparative epigenomic studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Modeling methylome evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 7 2.1 Biology of DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Methods for profiling DNA methylation . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Comparative epigenomic studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Questions arising from phylo-epigenomic studies . . . . . . . . . . . . . . . . . . 17 2.5 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Models for mammalian methylome evolution 27 3.1 State space and units of measurement for DNA methylation . . . . . . . . . . . . . 27 3.2 Modeling a methylome profiled by whole-genome bisulfite sequencing . . . . . . . 28 3.3 Modeling methylome evolution with independent sites . . . . . . . . . . . . . . . 33 3.4 Modeling methylome evolution with interdependent sites . . . . . . . . . . . . . . 35 3.4.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.2 Model learning and inference . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.3 Results on simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.4 Results on real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Alternative models for epigenome evolution . . . . . . . . . . . . . . . . . . . . . 50 vi 4 Evolution of mammalian sperm methylomes 57 4.1 Lineage-specific features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Methylation features are conserved at promoters and divergent distally . . . . . . . 61 4.3 Methylation loss exceeds methylation gain across species . . . . . . . . . . . . . . 66 4.4 Widening of hypomethylated regions has contributed significantly to methylation loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5 Sequence signatures driven by sperm methylome divergence . . . . . . . . . . . . 81 4.6 Methylation divergence is associated with histone modification and sequence diver- gence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.7 Lineage-specific changes involve common transcription factors and developmental genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5 Conclusions 96 Reference List 100 A Appendix 114 A.1 Computing derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.2 MCMC convergence diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.3 Two-tuple Bayesian network model . . . . . . . . . . . . . . . . . . . . . . . . . 116 A.4 Two-tuple Markov Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.5 Intra-species methylome variation . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.6 Supplemental methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 vii List of Figures 2.1 Example visualization of WGBS methylome . . . . . . . . . . . . . . . . . . . . . 14 2.2 Seven mammalian species. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1 Inferring methylation states from WGBS methylome assuming independent sites. . 31 3.2 Methylation states are autocorrelated and organized in blocks . . . . . . . . . . . . 32 3.3 Inferring methylation states from WGBS methylome with HMM. . . . . . . . . . . 33 3.4 Performance on simulated data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Trace plot for example sufficient statistics. . . . . . . . . . . . . . . . . . . . . . . 45 3.6 Phylogenetic trees estimated by independent-site and interdependent-site models . 48 3.7 Inference of ancestral methylomes reveal small but conserved HMRs . . . . . . . . 51 3.8 Graphical models for epigenome evolution . . . . . . . . . . . . . . . . . . . . . . 52 4.1 Distribution of sperm methylation levels . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Distribution of genomic context . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Hierarchical clustering of mammalian sperm and somatic methylomes . . . . . . . 60 4.4 Sperm HMRs are more abundant and wider than somatic HMRs . . . . . . . . . . 61 4.5 Promoter methylation profiles in sperm and somatic cells . . . . . . . . . . . . . . 62 4.6 Pericentromeric satellites are hypomethylated in dog sperm . . . . . . . . . . . . . 63 4.7 Conserved promoter HMRs and divergent distal HMRs . . . . . . . . . . . . . . . 64 4.8 Sperm DNA methylation of seven species in an example orthologous region . . . . 65 4.9 Sperm promoter HMR sizes differ between species . . . . . . . . . . . . . . . . . 65 4.10 Total region size of species-specific methylation and hypomethylation . . . . . . . 70 4.11 Schematic presentation of interdependent-site phyloepigenetic model . . . . . . . . 71 4.12 Evolution trees for genome and sperm methylome . . . . . . . . . . . . . . . . . . 73 4.13 HMR divergence by an exponential decay model . . . . . . . . . . . . . . . . . . 74 4.14 Evolutionary increase of the fraction of genomic hypomethylation . . . . . . . . . 75 4.15 Sizes of HMR gain and loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.16 Comparison of independent-site models indicates non-stationary process . . . . . . 76 4.17 Schematic annotation of methylome evolutionary event . . . . . . . . . . . . . . . 77 4.18 Total size of methylation evolution events by type and lineage . . . . . . . . . . . 77 4.19 Secondary mutation events are closer to gene promoters . . . . . . . . . . . . . . . 78 4.20 Enrichment of different types of methylome evolution events . . . . . . . . . . . . 79 4.21 More conserved sperm HMRs in the orthologous genome are wider . . . . . . . . 80 4.22 HMR sizes in the orthologous genome and species-specific genome . . . . . . . . 81 viii 4.23 Inferred age of human sperm HMR and CpG enrichment . . . . . . . . . . . . . . 82 4.24 Human HMR CpG enrichment by HMR age and average methylation level . . . . 82 4.25 Opposing scenarios for germline promoter HMR size evolution. . . . . . . . . . . 83 4.26 CpG enrichment profile at orthologous promoters . . . . . . . . . . . . . . . . . . 85 4.27 Mouse CpG enrichment in rodent-specific vs. conserved methylated regions . . . . 85 4.28 Sperm DNA methylation and histone modification . . . . . . . . . . . . . . . . . . 86 4.29 Sequence conservation levels in lineage-specific HMRs . . . . . . . . . . . . . . . 88 4.30 Relative substitution rates in genomic background and lineage-specific HMRs . . . 90 4.31 Enriched transcription factor binding sites in HMR birth and extension regions . . 92 4.32 Biological processes associated with promoter HMR extensions on human lineage . 93 4.33 Example lineage-specific HMR births . . . . . . . . . . . . . . . . . . . . . . . . 94 4.34 Lineage-specific HMRs on parallel lineages . . . . . . . . . . . . . . . . . . . . . 95 A.1 Comparison of Markov blankets . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.2 HMR number and average size in various cell types and tissues. . . . . . . . . . . 123 A.3 Correlation of sperm DNA methylation in 6 mouse individuals from two strains. . . 123 A.4 Correlation of sperm DNA methylation in 4 (20-40yr) human individuals. . . . . . 124 ix List of Tables 2.1 Sequencing and mapping statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 List of mammalian methylome data sets . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 The Markov blanket and probabilities involved in MCMC sampling procedure. . . 55 3.2 Partial derivatives with respect to model parameters required in the maximization step of EM algorithm (Section 3.4.2). . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1 Summary statististics about methylomes . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Aligning methylomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Retrotransposon hypomethylation in sperm . . . . . . . . . . . . . . . . . . . . . 63 4.4 Enriched Biological Processes associated with ultra-conserved HMRs . . . . . . . 67 4.5 Enriched Molecular Functions associated with ultra-conserved HMRs . . . . . . . 68 4.6 Enriched Cellular Components associated with ultra-conserved HMRs . . . . . . . 69 4.7 Sizes of species-specific hypomethylated and hypermethylated regions . . . . . . . 70 4.8 Phylogenetic tree and phylo-epigenetic tree models . . . . . . . . . . . . . . . . . 72 4.9 Enrichment of histone modifications in HMRs with lineage-specific HMR extensions 87 4.10 Overlap between lineage-specific sperm HMRs and human somatic HMRs . . . . . 88 4.11 Lineage-specific HMRs contain more substitutions on the same lineage . . . . . . 91 x Chapter 1 Introduction 1.1 Mammalian epigenome and evolution Mammalian DNA methylation is an epigenetic modification that occurs primarily on cytosines in CpG dinucleotides and has wide-ranging connections to mammalian development (Reik et al., 2001; Smith & Meissner, 2013). A typical mammalian genome is globally methylated, with inter- spersed hypomethylated regions. The hypomethylated regions often overlap with important regu- latory elements, such as gene promoters, enhancers and insulators, with diverse level of regulatory activity (Jones, 2012). Dynamic changes of hypomethylated domains, including switch of methy- lation states and shift of boundary locations, occur during embryogenesis, cell differentiation, and in response to environmental influences and diseases (Meissner et al., 2008; Irizarry et al., 2009; dos Santos et al., 2015). DNA methylation has substantially impacted the large-scale evolution of mammalian genomes. Spontaneous deamination elevates the mutation rate at methylated cytosines (Bird, 1980). As a consequence, germline DNA methylation profiles have shaped the CpG landscape of mammalian genomes (Cohen et al., 2011), resulting in the CpG island phenomenon (Gardiner-Garden & From- mer, 1987). In germ cells, DNA methylation is necessary for suppressing germline retro-element proliferation (Walsh et al., 1998a; Bourc’his & Bestor, 2004). Mounting evidence supports a contribution of germ cell DNA methylation to the epigenomic state of embryonic cells and the early activation of developmental genes (Brykczynska et al., 2010; Carrell & Hammoud, 2010; Smith et al., 2012; Branco et al., 2016; Atsem et al., 2016). Examination of germline DNA methy- lation is therefore of particular interest in understanding how the requirements of epigenomic reg- ulatory programming are balanced against the increased mutational burden of DNA methylation. 1 Taking a broader view, patterns of DNA methylation and the way they vary between species afford us a glimpse into the evolution of gene regulation. As highlighted by the seminal work of King & Wilson (1975) and validated in the current genomics area, species with high genetic similarity in their protein coding regions can show substantial phenotypic differences which may be related to differences in gene expression and regulation (Karaman et al., 2003; Odom et al., 2007; Bauernfeind et al., 2015). Epigenetic divergence reflects altered protein-DNA interactions, which may associate with changes in both the sequence ofcis-regulatory elements (Heijmans et al., 2007; Lienert et al., 2011) and the binding preference of trans-acting factors (Cheng et al., 2014). Thus, characterizing epigenetic divergence is a first step towards understanding the evolution of regulatory systems in different species. 1.2 Comparative epigenomic studies Comparative analyses of mammalian epigenomes can increase our understanding about the evo- lution of epigenetic regulation. In combination with genomic evolution signatures, they enable a better understanding of the specific mechanisms behind phenotypic evolution. Depending on the specific epigenetic modification of interest, these studies mainly fall into two groups. The first group focus on specific histone modifications, and rely on current understanding of the com- binatorial “histone code” (Strahl & Allis, 2000; Rando, 2012) to identify specific classes of regulatory elements, such as active or poised gene promoters or enhancers (Villar et al., 2015; Lesch et al., 2016). The second group of studies take interest in DNA methylation for its broad link to regulatory elements with various activities, and thus capture a wider range of regulatory elements. The development of whole-genome bisulfite sequencing techniques (Lister et al., 2009; Urich et al., 2015) have enabled observations of methylation signals at single-nucleotide resolu- tion and whole-genome scale. These methylation signals are automatically normalized and can be directly compared across different genomic locations and across samples, which is a substantial improvement over ChIP-based experiments. In this thesis, we focus on comparative methylome 2 analysis, but have histone modifications in mind when we develop models for epigenome evolution and explore the regulatory relevance of epigenetic divergence. Existing comparative analyses of mammalian methylomes have increased our understanding of DNA methylation divergence. Methylation states across multiple primate species recapitulate their phylogenetic relationship (Martin et al., 2011), and greater variation has been observed between tissues than between species (Pai et al., 2011; Molaro et al., 2011). Regions with species-specific DNA methylation have been reported to harbor excessive species-specific substitutions (Hernando- Herraez et al., 2015). However, such studies have been limited to species most closely related to human, and have focused on species-specific differentially methylated regions, instead of system- atically studying methylation changes in all lineages of the phylogenetic tree without bias. 1.3 Modeling methylome evolution It remains unclear how and why DNA methylation patterns have evolved in mammalian species: where in the genome, and when during evolutionary history have changes occurred? Addressing these questions involves modeling an evolutionary process that may be viewed as analogous to sequence evolution. In homogeneous healthy cell populations, and over nearly the entire genome, continuous methylation levels at CpG sites can be functionally understood as having either high or low methylation, and a methylome can be modeled as a sequence of binary states. Phylogenetic modeling for molecular sequence evolution has a rich history (Felsenstein, 1981a; Yang, 1994b; Siepel & Haussler, 2004), and approaches have been applied to DNA methylation changes in tumor growth and development (Siegmund et al., 2009; Capra & Kostka, 2014). However, modeling methylome evolution at individual CpG sites presents a unique challenge: the majority of those sites have mutated between distant species due to their hypermutability. Focusing on CpG sites that are fully conserved in the relevant species dramatically reduces the portion of the orthologous genome that can be analyzed. Solutions that aggregate data, for example in pre-determined windows along the genome, lower resolution while only partially solving the 3 problem. At the same time, it is well known that methylation states of neighboring CpG sites are highly interdependent and a phenotypically relevant epigenomic change tends to involve multiple consecutive CpG sites. This dependence between neighboring sites provides information that may be leveraged to overcome this challenge. 1.4 Contributions 1. An evolutionary model for mammalian DNA methylation: Tools are lacking for compar- ative methylation analysis. We designed a probabilistic evolutionary model for mammalian DNA methylation. Similar to likelihood-based phylogenetic models, we model the evolution process of methylation states along the phylogenetic tree as a time-continuous Markov pro- cess. Meanwhile, we incorporate the local dependency of methylation states in an individual species by using a discrete-state Markov chain as formerly used in models for detecting hypomethylated regions. These two orthogonal processes, one operating along the time line, the other along the genome, are combined to describe the evolution of mammalian methy- lomes. This model provides new opportunities for making inference and testing hypotheses with regard to the evolution of mammalian methylomes. It also paves the way for modeling the co-evolution of DNA sequence and epigenetic marks. 2. A computational algorithm for efficient approximation: Our proposed graphical model is a Bayesian (directed acyclic) network. In application, we are faced with a missing data problem: not only ancestral methylomes are unobserverable, methylation states at a significant fraction of sites in each extant species are also unobserved. Obtaining the maximum-likelihood estimates (MLE) of the model parameters through exact Expectation- Maximization (EM) algorithm would incur computational complexity that is exponential to the number of hidden variables and intractable in practice. We resort to approximation approaches in the E-step, and develop an efficient Monte Carlo EM (MCEM) algorithm (Wei & Tanner, 1990) for model learning and inference. 4 3. Biological implication from phylo-epigenomic analysis of mammalian sperm methy- lomes: Our ultimate goal in this research is to understand the history and impact of epi- genetic changes in mammalian species. We examine whole genome bisulfite sequencing data from the sperm genomes of 7 mammalian species – human, chimpanzee, gorilla, rhesus macaque, mouse, rat and dog – that spans about 100 million years of mammalian evolu- tion to investigate the evolution of germline DNA methylation. We create a comprehensive annotation of evolutionary events along individual lineages of the phylogenetic tree. Several key findings emerged from our analysis. There is a global trend towards expansion of the hypomethylated fraction of the genome on all phylogenetic lineages. Hypomethylated inter- vals around transcription start sites have evolved to be substantially wider in primates and dog than in rodents. Genomic intervals with lineage-specific hypomethylation are associated with common developmental processes, and are often close to sets of genes with significant overlap. Analysis of transcription factor binding data indicates that regions of rodent-specific and primate-specific hypomethylation are enriched for binding sites of orthologous transcrip- tion factors. These observations indicate how evolution of the epigenome may contribute to species-specific fine-tuning of conserved mammalian developmental programs. 1.5 Outline In the second chapter, we review the biological regulation and functional roles of DNA methylation in the mammalian genome and experimental techniques for profiling DNA methylation. Then we give a summary of previous studies, and list out central questions of comparative studies of mammalian DNA methylation. Next, we introduce the methylome datasets that we generated in this study or collected from previous studies. Finally, we give detailed information about the Markov model for molecular sequence evolution, which is the foundation for phylo-epigenetic models that we develop in the following chapter. 5 In Chapter 3, we develop models for methylome evolution. We first introduce the state space and unit for mammalian DNA methylation marks and describe how observed data from bisulfite sequencing experiments are modeled. Next, we introduce a phylo-epigenetic model that assumes independent evolution at different sites, which is a direct application of the classic model for molec- ular sequence evolution. Then, we develop a phylo-epigenetic model with autocorrelated sites, which captures the well-know autocorrelated genomic distribution of mammalian epigenetic marks and overcomes the challenge that the majority of substrates of DNA methylation have mutated between distant species. We describe in detail the model learning and inference procedures and show their performance on simulated and real datasets. In Chapter 4, we perform in-depth comparative analyses of sperm methylomes from seven mammalian species. We first describe cell-type-specific and species-specific aspects of the sperm methylome evolution. Using the proposed phylo-epigenetic model, we characterize global trends of evolution occurring along parallel lineages of the phylogenetic tree, infer ancestral methylomes and annotate evolutionary events at high spatial and temporal resolution. Based on observations and model inference, we propose a hypothesis that, averaged across lineages and over the orthol- ogous genome, existing epigenetic features tend to spread locally during evolution. Last but not least, we associate derived methylome features with genes by proximity to understand their func- tional relevance. In Chapter 5, we conclude this study by highlighting the merits of our proposed phylo- epigenetic model and methods for learning and inference, and insights to how evolution of the epigenome may contribute to species-specific fine-tuning of conserved mammalian developmental programs. 6 Chapter 2 Background 2.1 Biology of DNA methylation An ancient epigenetic modification DNA methylation is an epigenetic modification found in almost all kingdoms of life. Two major types of DNA methylation are 5-methylcytosine (5mC) and N6-methyladenine (6mA). In many bacteria and archaea, cytosine or adenine methylation is part of the restriction modification sys- tem, and is important for DNA replication, mismatch repair as well as gene regulation (Wion & Casades´ us, 2006). Cytosine methylation is observed in all eukaryotic kingdoms, showing both conservation and divergence in the sequence context and genomic distribution across kingdoms and species (Zemach et al., 2010). In plants, cytosine methylation occurs in three sequence contexts: CpG, CHG, and CHH. In animals, it mainly occurs in CpG dinucleotides, with additional non-CpG methylation at low levels in certain tissues and cell types, such as in brain, embryonic stem cells and oocytes (Ziller et al., 2011; Keown et al., 2017; Yu et al., 2017), which may be the product ofdenovo DNA methylation activities of DNMT3A and DNMT3L (Shirane et al., 2013). Animals and plants share homologous enzymes for de novo methylation and the maintenance of methylation through cell cycles (Law & Jacobsen, 2010). Invertebrates have quite heterogenous DNA methylation patterns. Many invertebrate genomes are predominantly unmethylated (Bird & Taggart, 1980). Some invertebrate genomes, such as Ciona intestinalis (sea squirt), Crassostrea gigas (oyster) and Apis mellifera (honey bee), are 7 sparsely or moderately methylated mainly in gene bodies (Simmen et al., 1999; Gavery & Roberts, 2010; Elango et al., 2009). Vertebrates share a common feature known as global DNA methylation. The vast majority of cell types in vertebrate species, including representatives of fish, amphibians, reptiles, birds, and mammals, have 60-90% of CpG dinucleotides containing 5-methylcytosine (5mC). The transition from fractional to global methylation occurred close to the origin of vertebrates, as amphioxus has a typically invertebrate methylation pattern whereas primitive vertebrates (hagfish and lamprey) have patterns that are typical of vertebrates (Tweedie et al., 1997). Methylation at gene bodies and transposable elements appears to be ancient property of eukaryotic genomes (Zemach et al., 2010), while promoter methylation seems to be divergent between vertebrates and invertebrates in terms of both genomic distribution and impact on gene expression (Keller et al., 2015). From here on, we focus specifically on mammalian DNA methylation. Machinery of mammalian DNA methylation and demethylation Mammalian DNA methyltransferases. Two founding papers by Riggs (1975) and by Holliday & Pugh (1975) proposed DNA methylation as a mechanism to heritably regulate gene expres- sion in eukaryotic cell differentiation, and predicted the properties of DNA methylation enzymes including a de novo methylation activity that methylates both strands at certain palindromic DNA sequences, and a maintenance activity with the preference to methylate the newly synthesized strand at semi-methylated sites after DNA replication. The maintenance methyltransferase, DNA methyltransferase 1 (DNMT1), was purified in 1983 (Bestor & Ingram, 1983). The de novo DNA methylation activity in mammalian organisms was first reported in mouse (J¨ ahner et al., 1982; Stewart et al., 1982), and two de novo DNA methyltransferases, DNMT3A and DNMT3B, were 8 identified through sequence homology search with known prokaryotic cytosine DNA methyltrans- ferases (Okano et al., 1998). The DNA methyltransferase-like protein (DNMT3L) lacks the com- mon catalytic domain conserved in other DNA methyltransferases, but acts as a general stimula- tory factor for de novo methylation by DNMT3A (Hata et al., 2002; Ch´ edin et al., 2002), and is indespensible for maternal methylation at imprinting centers (Bourc’his et al., 2001) and spermato- genesis (Hata et al., 2002). Recently, Barau et al. (2016) identified a rodent-specificdenovo DNA methyltransferase geneDnmt3c, which was created by a duplication ofDnmt3b, and is involved in suppressing activities of young retrotransposons in the male germ line. Demethylation Cells can remove DNA methylation through either a passive demethylation pro- cess or an active demethylation process. The passive demethylation dilutes genomic DNA methy- lation by half after each cell division under reduced or absent DNMT activity, which is observed during embryogenesis (Monk et al., 1991; Rougier et al., 1998). This process globally reduces genomic DNA methylation, and can lead to close to complete erasure of genomic DNA methylation through multiple rounds of cell devision in specific developmental stages. Active demethylation, by contrast, is required for fine-tuned local removal of methylation marks, and is more commonly associated with cell differentiation. Several pathways have been proposed for active demethyla- tion (Ooi & Bestor, 2008), among which, pathways of oxidation-mediated demethylation appear to be the most favorable, after the recent discovery of cytosine hydroxymethylation (5hmC) and advancement in studies of ten-eleven translocation (TET) proteins. TET family enzymes can con- vert 5mC into 5hmC (Tahiliani et al., 2009). Later, two more cytosine derivatives, 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC) were observed as products of TET activities (Ito et al., 2011). Active DNA demethylation possibly occurs through TET-catalyzed oxidation, followed with deam- ination by activation-induced deaminase (AID) (Guo et al., 2011) and DNA repair enzyme thymine DNA glycosylase (TDG) mediated base excision repair (Cortellino et al., 2011). 9 Roles of methylation in mammalian gene regulation and genome evolution DNA methylation at regulatory elements Cells may modulate gene expression through pro- moter methylation. The methylation status at promoters varies considerably across different cell types, with methylation of promoters generally associated with low or no transcription (Suzuki & Bird, 2008). Aberrant methylation at the promoters of tumor suppressor genes have been observed in cancer cells (Portela & Esteller, 2010). Enhancers are also targets of epigenetic reg- ulation. Cell-type-specific hypomethylation has been observed to accompany cell-type-specific binding of transcription factors (Stadler et al., 2011). Methylation alters the chemical property of DNA molecules and their interaction with DNA-binding proteins. For example, DNA strand separation, an important step in gene expression, has been reported to be affected by methyla- tion level and sequence context (Severin et al., 2011). Binding of certain transcription factors to DNA can be blocked by methylation at the binding site (Watt & Molloy, 1988). Meanwhile, DNA methylation may also silence gene expression indirectly by recruiting chromatin-remodeling factors, such as those with methyl-CpG-binding domains (MBD), that change the chromatin into a more compact structure (Boyes & Bird, 1991; Wade et al., 1999; Ballestar & Wolffe, 2001; Martinowich et al., 2003). DNA methylation at gene bodies Gene body hypermethylation is conserved in most plants and animals (Feng et al., 2010) and may have distinct regulatory roles for gene expression from pro- moter methylation. However, the role of gene body methylation is not yet clear. A recent study showed that the enzymatic activity of DNMT3B and recruitment to gene body by H3K36me3 are required to suppress spurious transcription from the gene body (Neri et al., 2017). Intragenic methylation may also play a role in alternative splicing. For example, exonic DNA methylation can reduce the inclusion of upstream exons by blocking CTCF binding (Shukla et al., 2011), and promote exon inclusion by recruiting methyl-CpG-binding protein MeCP2 and subsequent histone acetylation (Maunakea et al., 2013). 10 DNA methylation at retrotransposable elements Transposition activities of transposable ele- ments (TE) may create insertion mutations, disturb gene regulation or cause chromosomal rear- rangement by homologous recombination between non-allelic repeats. DNA methylation is part of the cellular apparatus suppressing TE activities. Induced demethylation of TEs, such as mouse IAP elements (Walsh et al., 1998b) and human SINE and LINE elements (Liu et al., 1994; Woodcock et al., 1997), are correlated with elevated TE transcription levels. DNA methylation likely suppresses TE activity through the same mechanisms that suppress expression of genes with promoter methylation (Bourc’his & Bestor, 2004; Aravin et al., 2008). During millions of years of evolution in the host genome, the majority of TEs have have lost ability to transpose due to mutations accumulated. DNA methylation at TEs may help accelerate this process through spontaneous deamination of the methylated cytosines. Only a small proportion of TEs still possess transposition abilities. For example, about 35-40 subfamilies of Alu, L1 and SV A elements are mobile in human genome today (Mills et al., 2007). These active TEs continue to bring in genetic diversity, and potentially cause diseases. Some TEs carry sequences that can assume regulatory functions (Van de Lagemaat et al., 2003; Jordan et al., 2003; Thornburg et al., 2006; Bourque et al., 2008; Kunarso et al., 2010). For example, the eutherian specific transposable element MER20 have the epigenetic signatures of enhancers, insulators and repressors, and enrichment of MER20 surrounding differentially regu- lated orthologous genes in endometrium cells during pregnancy was observed in a cross-species analysis (Lynch et al., 2011). DNA methylation status of TEs may serve as a rough surrogate, for the inference of their transcription activity and regulatory role in both single species analysis and cross-species evolutionary studies. Imprinting via DNA methylation Imprinting refers to changes in the state of a chromo- some that occur during one generation and are inherited and recognized in the next gener- ation (Crouse, 1960). The imprint on a chromosome is solely determined by its parent of 11 origin, and is independent of the genetic content of the chromosome. The parent-of-origin- depedent gene expression is mediated by imprinting control regions (ICR) with differential methy- lation between the paternal and maternal alleles. A single ICR usually has control over the expression pattern of a cluster of imprinted genes. The differential epigenetic state of ICR is propagated in an allele-specific manner via long noncoding RNAs (lncRNA), insulator pro- teins, histone modifications and higher-order chromatin structure. The evolution of imprinting is tightly linked with the emergence of placenta in mammals. Studies in human and mouse have proved an important role of gene imprinting in placenta development, where the paternal expression of genes usually enhance, and maternal expression usually restrict the growth of pla- centa. Comparative studies of imprinted gene clusters in human and mouse have shown that imprinting is a highly dynamic process under fast evolutionary adaptation (Paulsen et al., 2000; Johnstone et al., 2006). Two waves of reprogramming in germ-cell and zygotic development Epigenetic reprogram- ming is the process of close to complete erasure and re-establishment of DNA methylation marks. It happens at two important stages during development. The first wave of epigenetic reprogram- ming happens soon after the formation of the zygote. The paternal and maternal genome goes through demethylation separately, and at around embryo day E3.5 in mouse, the embryo reaches its lowest global methylation level (Seisenberger et al., 2013). Then at embryonic day E6.5, the genomic methylation level is restored globally. This is an important step to reset the germ cell marks to embryonic methylation patterns. A number of regions, however, are protected from demethylation in this process, including imprinted regions and specific types of repetitive elements (Rougier et al., 1998; Reik et al., 2001; Lane et al., 2003; Borgel et al., 2010; Smallwood et al., 2011). The second wave of epigenetic reprogramming occurs during the formation of primordial germ cells (PGC), which arise around E7.25 in the epiblast of the developing embryo (Ginsburg et al., 1990). Parental imprints and somatic epigenetic patterns are erased during the reprogram- ming, giving way to epigenetic settings that prelude gamete totipotency (Hajkova et al., 2002; 12 Lane et al., 2003; Guibert et al., 2012). There are element that can escape demethylation during the PGC reprogramming as well, including thousands of retrotransposons and a few CpG-dense regions at gene promoter and gene-body (Hackett et al., 2013). Impact of DNA methylation on genomic evolution Spontaneous deamination of 5- methylcytosine (5mC) results in thymine (Coulondre et al., 1978; Shen et al., 1994). This has caused drastic underrepresentation of CpG sites in species showing global DNA methylation pat- terns. The rate of deamination is positively correlated with temperature (Shen et al., 1994). Among vertebrates, there is a correlation between the level of CpG depletion and body temperature (Jab- bari et al., 1997; Varriale & Bernardi, 2006). Although vertebrate genomes are generally depleted in the CpG dinucleotide, they also contain interspersed CpG islands (CGI), which are regions of DNA with a high G+C content and a high frequency of CpG dinucleotides relative to the bulk genome (Gardiner-Garden & Frommer, 1987). CGIs are associated with the 5’ ends of housekeep- ing genes and many tissue-specific genes. The CGI phenomenon is potentially a result of genomic DNA methylation patterns in germline cells (Molaro et al., 2011). Regions consistently methylated in the germline gradually loose CpGs during evolution, while regions constitutively protected from DNA methylations retain its CpG density that is higher than the genomic background. 2.2 Methods for profiling DNA methylation Sodium bisulfite conversion Frommer et al. (1992) designed a sequencing protocol that yields positive identification of 5-methylcytosine residues from individual genomic DNA molecules. The method treats genomic DNA with sodium bisulfite, which converts cytosines to uracils and is non-reactive with 5-methylcytosine. “The gold-standard for determining methylation states of individual CpG sites is to combine sodium bisulfite conversion with PCR and Sanger sequencing.” (Urich et al., 2015). 13 CpG Islands 50 kb GADD45A GADD45A GADD45A GNG12 GNG12-AS1 DIRAS3 WLS WLS WLS MIR1262 H1ESC H9ESC Sperm BCell HMR Methylation level 0 1 Figure 2.1: Example visualization of WGBS methylome Whole genome bisulfite sequencing (WGBS) is used to interrogate cytosine methylation states at single-base resolution at a genome-wide scale. Usually, WGBS can cover over 90% of the total CpG sites in mammalian genomes. Methylation level at an individual CpG site is estimated with the proportion of mapped reads (after removal of PCR duplicates) covering this site that have base ‘C‘ at this location. We can visualize a WGBS methylome in genome browser (Figure 2.1) Reduced representation bisulfite sequencing (RRBS) first digest DNA with restriction enzyme MspI that cleaves DNA at restriction sites CCGG, then size-select DNA fragments that are between 100-150 bp. The isolated DNA fragments are treated with sodium bisulfite and sequenced with high-throughput sequencing instruments. Because of the restriction sites and size-selection, the sequencing library is enriched with fragments from CpG-rich regions, which covers about 50% of RefSeq gene promoters, and about 85% of CpG islands (Stevens et al., 2013). Methylation arrays are used to interrogate DNA methylation states of pre-selected regions in the human genome. They contain DNA probes designed for methylated and unmethylated sequences after bisulfite conversion in selected genomic regions of prior interest. These regions include the majority of CpG islands, RefSeq gene promoters and cover over 850k 14 CpG sites (Illumina MethylationEPIC array) out of 28 million CpG sites in human genome (Moran et al., 2016). Methylated DNA immunoprecipitation (MeDIP) uses an antibody against 5-methylcytosine to capture and enrich 5mC-containing DNA fragments in the DNA library that can be profiled with DNA micro-arrays or high-throughput sequencing (Weber et al., 2005; Down et al., 2008). Long-read sequencing Long-read sequencing technologies, such as the single molecule real- time (SMRT) sequencing by Pacific BioSciences and the MinION nanopore sequencing by Oxford Nanopore Technologies (ONT), have been recently applied in epigenetics research. SMRT mea- sures characteristics of DNA polymerization, including the interpulse duration (IPD), which is the time duration between successive fluorescence pulses from base incorporations, and is able to dis- tinguish between 5mC, 5hmC and unmodified cytosines (Flusberg et al., 2010). Similarly, DNA modifications can be detected as changes in MinION’s ionic current signal (Rand et al., 2017). These sequencing techniques not only have the advantage of producing long (e.g. above 10 kb) reads from single molecules, but also can provide simultaneous profiling of multiple epigenetic modifications. 2.3 Comparative epigenomic studies King & Wilson (1975) found that the differences in protein sequences are two small to account for the phenotypic differences between closely related species, and proposed that regulatory muta- tions account for major biological difference between species. Divergent regulation of ortholo- gous genes has been characterized in comparative gene expression analysis in multiple species (Enard et al., 2002; Karaman et al., 2003; Blekhman et al., 2010). The degree of gene expres- sion difference between species varies for different tissues, and in certain tissue more changes have accumulated in one species than in other species (Enard et al., 2004; Khaitovich et al., 2005; Blekhman et al., 2008). Comparative epigenomic studies aim to uncover the specific epigenetic 15 changes and regulatory mechanisms underlying such divergent expression of orthologous genes in different species. Findings from recent comparative epigenetic studies Methylation states across multiple pri- mate species recapitulate their phylogenetic relationship (Martin et al., 2011). Greater variation has been observed between tissues than between species (Pai et al., 2011; Molaro et al., 2011). Differential promoter methylation between species explains a significant proportion of differences in gene expression levels in matching tissues (Pai et al., 2011), and may be an important mech- anism driving species-specific traits such as brain evolution and disease vulnerabilities in human (Wang et al., 2012; Zeng et al., 2012). Divergent sequence signatures are associated with divergent methylation patterns. In germ-line cells, differential methylation have led to divergent levels of CpG decay between species (Molaro et al., 2011). Meanwhile, differential methylation between species may arise as a result of sequence divergence. For example, gain and loss of CTCF-binding sites in the genomes of multiple primate species are associated with cross-species epigenetic diver- gence (Fukuda et al., 2013). Regions with species-specific DNA methylation have been reported to harbor excessive species-specific substitutions (Hernando-Herraez et al., 2015). Choosing a method for methylome profiling Methods of lower resolution or targeting a sub- set of genomic regions, such as MeDIP (Weber et al., 2005) and RRBS (Meissner et al., 2005) have been used to capture conserved and species-specific methylation patterns in earlier studies. However, comparison of high resolution whole genome methylation patterns across species is still limited in the literature, partly because WGBS is still a relatively new and expensive technology. Nevertheless, WGBS data sets in matching cell types in multiple species have been accumulating from individual studies, and are available for comparative studies (Song et al., 2013). Choosing a cell type for studying methylome evolution Overall, somatic methylation profile reflects the methylation state in germ cells. For example, CpG decay rates in CGIs with conserved methylation levels across the neutrophils of multiple primates showed strong correlation with the 16 somatic methylation levels (Martin et al., 2011). However, as epigenetic studies of diverse cell types have shown, while the methylome is globally stable, focal changes of methylation state are correlated with, and sometimes responsible for, the speciation of different lineages of cell types (Hodges et al., 2011). Moreover, germ cells have methylation patterns that are distinct from somatic methylomes. For example, while the majority of gene promoters are associated with HMRs in the promoter region in both germ cells, embryonic stem cells (ESC) and somatic cells, the germ cell promoter HMRs are usually much wider than ESC and somatic promoter HMRs. This structure of core and extended HMR at gene promoters in germ cell and somatic cells is conserved between human and chimp (Molaro et al., 2011), but less noticeable in mouse. In addition, most germline-specific genes are hypermethylated in somatic cells (Weber et al., 2007). Therefore, germ cell methylation profiles would be a more accurate reflection of the mutational pressure on the DNA sequences in evolution. 2.4 Questions arising from phylo-epigenomic studies Phylo-epigenomics – the comparative study of epigenomes across species – is still in an early stage, giving rise to many valid questions awaiting proper answers. Here we discuss a few questions that we attempt to address in this project. What are the large-scale evolutionary relationships among methylomes of mammals? Evo- lutionary trees for species can be built from any number of phenotypic traits or genetic features (Felsenstein, 2004). Such trees usually have the same topology, as they typically reflect speciation relationships, but the branch lengths can differ depending on the measured trait – even when those lengths are intended to reflect historical time. Due to the accuracy of molecular evolution mod- els, neutrally evolving DNA sequence is often taken as the most consistent with the fossil record. Certain traits, for example the structure of proteins in a particular family, can then be understood 17 relative to historical time, and one can examine whether the accelerated or constrained change indicates adaptation or selection relative to historical time. What are the species-specific features of mammalian methylomes? As an epigenetic mark, DNA methylation has a potentially large margin of autonomy during evolution. Apart from cytosines that act as methylation substrates, there is no prior reason why neutral changes to the sequence must alter the epigenome. Yet we observe substantial genome-wide alterations in DNA methylation between closely related species. When considering the sperm epigenome, exam- ples include the primate-specific hypomethylation of pericentromeric satellites and human-specific hypomethylation of SV A elements, or the reduced size of hypomethylated regions in mouse (Molaro et al., 2011; Molaro et al., 2014). These large-scale changes are too wide-spread to be easily associated with specific genes or molecular functions. Yet they have the potential to broadly impact the function of the epigenome. These changes may reflect evolution of how the methyl- transferases are regulated, the efficiency and fidelity of the DNA methylation and de-methylation systems, or the way particular species organize chromosomes within the nucleus. A primary goal of our research is to characterize these large-scale epigenomic changes within a restricted, but representative, mammalian tree. In broad strokes, we seek to understand when the DNA methyla- tion system itself appears to have made significant adaptations, and in which taxa it seems tightly constrained. What is the relationship between genomic and epigenomic evolution? Divergent epigenetic features reflect divergent usage or location of regulatory elements between species. Regulatory ele- ments are often sequence-specific binding sites of transcription factors. For orthologous regions showing conserved epigenetic states indicative of conserved transcription factor binding events, the underlying sequences might be evolving under constraint because of the conserved binding specificity of orthologous transcription factors. They might also co-evolve with the binding speci- ficity of the interactingtrans factors and display weaker sequence conservation. It is also possible 18 that these sequences are bound by different factors in different species, in which case, the sequence conservation level may be expected to be even lower. On the other hand, would regions associated with species-specific epigenetic features show accelerated evolution in the same species-specific manner? It might be the case for newly formed species-specific regulatory elements. However, as may be limited by the scope of cell types under study, the observed species-specific epigenetic feature may result from a species- and cell-type-specific usage of a conserved regulatory elements. These scenarios are described from the perspective that the underlying sequence determines bind- ing affinity of transcription factors, which subsequently yield divergent epigenetic states. Of equal importance but harder to prove is the hypothesis that genetic assimilation of acquired traits can transform ancestral plasticity into genetically coded phenotypes in derived species (Waddington, 1953). How divergent methylome features contribute to the evolution of different species? Which features alter regulatory interaction in local scale, such as transcription factor turnover, and which alter the regulatory network at a larger scale, such as binding of chromatin-looping factors that can alter domains of chromatin-chromatin interactions? Which features increase the stability of gene expression, and which ones can increase the plasticity of gene expression in response to environmental stimuli? Have different species adopted alternative strategies for silencing, pos- ing and activating genes, which we can detect from the epigenetic footprints of these different mechanisms? Addressing these questions will require more information about the gene expres- sion pattern and chromatin structure of matching cell types from multiple species, and rely heavily on the annotation of regulatory elements and their evolution history along different lineages of species. 19 Human Chimp Gorilla Orangutan Gibbon Rhesus Crab eating macaque Baboon Green monkey Marmoset Squirrel monkey Bushbaby Chinese tree shrew Squirrel Lesser Egyptian jerboa Prairie vole Chinese hamster Golden hamster Mouse Rat Naked mole rat Guinea pig Chinchilla Brush tailed rat Rabbit Pika Pig Alpaca Bactrian camel Dolphin Killer whale Tibetan antelope Cow Sheep Domestic goat Horse White rhinoceros Cat Dog Ferret Panda Pacific walrus Weddell seal Black flying fox Megabat David's myotis Microbat Big brown bat Hedgehog Shrew Star−nosed mole Elephant Cape elephant shrew Manatee Cape golden mole Tenrec Aardvark Armadillo Opossum Tasmanian devil Wallaby Platypus Primate Mammal Afrotheria Laurasiatheria Euarchontoglires human chimp gorilla rhesus mouse rat dog 6.65 Mya 9.06 Mya 29.44 Mya 90 Mya 96 Mya 20.9 Mya A B Figure 2.2: Seven mammalian species. (A) Placing of the seven species in a phylogenetic tree of 64 mammals extracted from the phyloge- netic tree used during the multiz multiple alignment of 100 vertebrate species (Murphy et al., 2001; Blanchette et al., 2004). Species are colored by a human-centered nested clade hierarchy. (B) Phy- logeny of the seven species and consensus estimates of divergence time (Hedges et al., 2015). 2.5 Data description WGBS data sets are available for many mammalian species and cell types. Sperm methylomes of human, chimpanzee, rhesus macaque and mouse have been profiled in previous studies (Molaro et al., 2011; Hammoud et al., 2014; Gao et al., 2017). We generated WGBS data sets for 3 more mammals: gorilla, rat and dog. Technical properties of the data, including mapping and conversion rates, are presented in Table 2.1. Together, we collected sperm methylomes for seven mammals representing approximately 100 million years of evolutionary history (Figure 2.2). We also col- lected WGBS data sets for ESC and somatic cell types for these species. Whenever possible, we 20 Sample Rep. Lib. #PE % Mapped % Unique BS Depth Fraction Level (M) conv. sites F M W CHH CXG CCG Gorilla R1 1 562.3 67.4% 49.5% 0.993 11.8 0.89 0.769 0.706 0.718 0.14% 0.14% 0.19% 2 223.7 68.3% 58.1% 0.993 Rat R1 1 293.9 72.5% 97.0% 0.996 8.7 0.87 0.834 0.771 0.765 0.09% 0.08% 0.10% R2 1 229.3 74.9% 97.5% 0.996 8.1 0.81 0.839 0.757 0.787 0.10% 0.09% 0.11% Dog R1 1 127.1 78.9% 95.3% 0.995 24.0 0.97 0.818 0.754 0.726 0.07% 0.07% 0.10% 2 249.2 80.5% 91.9% 0.995 3 191.3 91.0% 89.6% 0.996 15.0 0.96 0.816 0.763 0.769 0.05% 0.06% 0.12% 4 109.1 91.2% 91.8% 0.996 R2 1 157.4 90.8% 93.4% 0.996 19.1 0.95 0.809 0.744 0.746 0.06% 0.06% 0.10% 2 142.1 90.5% 94.1% 0.996 R3 1 131.6 91.9% 92.7% 0.995 28.5 0.97 0.814 0.739 0.707 0.06% 0.06% 0.09% 2 262.4 84.2% 90.0% 0.995 Table 2.1: Sequencing and mapping statistics #PE: total number of paired-end reads in million. Unique: non-PCR-duplicate reads. Fraction sites: fraction of total CpGs in the native genome covered by reads. Methylation level: F–fraction of CpG sites with methylated status; M–mean methylation level across all CpG sites; W–average CpG methylation level weighted by read coverage; CHH/CXG/CCG–weighted average cytosine methylation level for each non-CpG context. 21 use methylomes for blood cell types in comparison with sperm methylomes to examine tissue- specific features of mammalian methylome. Methylome data sets used in this study are listed in Table 2.2. Species Cell type Rep. Accession Citation Human Sperm 2 GSE30340 Molaro et al. (2011) Human Sperm 2 GSE49624 Hammoud et al. (2014) Chimpanzee Sperm 2 GSE30340 Molaro et al. (2011) Gorilla Sperm 1 GSE79566 private Rhesus Macaque Sperm 1 GSE60166 Gao et al. (2017) Mouse Sperm 4 GSE49624 Hammoud et al. (2014) Rat Sperm 2 GSE79566 private Dog Sperm 3 GSE79566 private Human B cell 1 GSE31971 Hodges et al. (2011) Chimp B cell 1 GSE31971 Hodges et al. (2011) Gorilla Whole blood 1 SRP059313 Hernando-Herraez et al. (2015) Rhesus PBMC 6 GSE34128 Tung et al. (2012) Mouse B cell 2 SRP029721 Kieffer-Kwon et al. (2013) Rat Left ventricle 4 ERP002215 Johnson et al. (2014) Human H9 ESC 1 SRX026814 Lister et al. (2011) Human H1 ESC 1 GSE16256 Lister et al. (2009) Mouse ESC 1 GSE30202 Stadler et al. (2011) Dog Kidney epithelial cell line 1 GSE48527 Carmona et al. (2014) Table 2.2: List of mammalian methylome data sets 2.6 Problem description In a phylo-epigenomic study, where the epigenomes of multiple species are profiled in a matched tissue or cell type, we assume that the phylogenetic relationship among these species has been reli- able resolved. The basic problem is given the phylogeny and epigenomic states of the orthologous genome in extant species, can we infer the epigenomic evolution history and, specifically, the time and location of epigenomic divergence events. We use “event” to describe the rise or decline of an epigenetic feature in analogy with genetic evolution events, such as gene duplication. 22 This problem is very similar to those in phylogenetic studies of DNA and protein sequences. Mathematical models for molecular sequence evolution have been successfully employed in these context (Lio & Goldman, 1998). The most widely used model framework is the maximum likelihood-based methods with parametric Markov models for the molecular substitution process (Felsenstein, 1981a). Below, we give a basic summary of this modeling framework. Markov models for molecular sequence evolution Discrete states Suppose a biological sequence has lengthN. Each siten (n = 1;:::;N) has a discrete states n (a random variable) from the alphabet . For example, DNA =fA;T;C;Gg; amino acids =fA;R;N;D; ;W;Y;Vg; methylome =fUnmethylated, Methylatedg; and H3K4me3 =fAbsent, Presentg: Then the biological sequence of interest can be written asS =s 1 s 2 :::s N . Sites evolve independently Assume different sites in the sequence evolve independently from each other. The independence assumption greatly simplifies computation, however, is often vio- lated in reality. In the context of DNA sequence evolution, for example, substitutions are more frequent at the 3rd codon positions, and C!T transition is more frequent when followed by a G. Models that consider such context dependency have been developed (Siepel & Haussler, 2004), which can be viewed as extensions to the basic independent site model. 23 Markov process Assume all sites evolve according to a continuous time Markov process, with substitution rates between pairs of states represented by a transition rate matrixQ jjjj that satis- fies Q ij 0 (i6=j) andQ ii = X j6=i Q ij ; wherei;j2 : This transition rate matrix can be fully parameterized, or simplified in various ways according to assumptions about the evolution process. Given the state i of a site at time t 0 , the probability distribution of the state of this site at timet 0 + t is thei th row in the transition probability matrix P (t) = exp(Qt) = 1 X k=0 Q k t k k! : Phylogeny Suppose sequences in extant species evolved from a common ancestor following a phylogenetic tree structure =fV;Eg, whereV are tree vertices indicating ancestral and extant species, andE are directed tree branches linking an internal vertex to vertices representing its direct descendants. A branching event in the phylogenetic tree represents the speciation of lineages from an ancestral species. Each edge (u;v)2E is associated with a branch length ` uv , representing the evolutionary time that species v have evolved since its speciation from its ancestor u. After speciation, sequences on parallel branches evolve independently. Initial distribution To express the state distribution of a biological sequence along the phyloge- netic tree, we also need to specify the initial state distribution = ( i ) 1jj at the root species, i.e. the last common ancestor of extant species. Together, a phylogenetic model for molecular sequence evolution is represented by its parameter space =f;Q;;g: Complete data likelihood Suppose the states at all sites in all species in the phylogeny are given. Letv n be the state of then th site in speciesv, for anyn = 1;:::;N andv2V. We denote 24 the complete state variable set withY =fv n :v2V;n = 1;:::;N:g. The likelihood function of the model parameter can be expressed as follows. L(jY ) = N Y n=1 L(jY n ) = N Y n=1 rn Y (u;v)2E P (` uv ) unvn : (2.1) Computation of the complete data likelihood has time complexityO(NjVj). Model estimation through maximum likelihood approach is to find an estimate ^ for the model parameter such that ^ = arg max L(jY ): Incomplete data likelihood and pruning algorithm States at ancestral species are usually unobservable, and only states at extant leaf species are observed. Let X denote states that are observable and Z denote states at unobserved sites. Computing the incomplete data likelihood involves summing over all possible combinations of states at unobserved sites. L(jX) = X Z L(jX;Z) = N Y n=1 X Zn L(jX n ;Z n ): Felsenstein (1981a) proposed a dynamic programming algorithm called “pruning” for com- puting the likelihood function with incomplete data. The pruning algorithm is explained here for computing the observed (incomplete) data likelihood at a single siten. LetX n (u) be set of states at siten at leaf nodes that are descendants of nodeu, and it follows thatX n (r) = X. Denote the conditional probability of observing statesX n (v) at terminal descendants of nodev given that the parentu has statej with p j (v n ) = Pr(X n (v)ju n =j; ): (2.2) 25 For notational convenience, define q k (v n ) = 8 < : Pr(v n =k) ifv is a leaf node, Q c2child(v) p k (c n ) otherwise. (2.3) The probabilityp j (v n ) can then be written as the recurrence p j (v n ) = X k P (` v ) jk q k (v n ) : (2.4) The likelihood of the observed data for a single site is then L(jX n ) = Pr(X n (r)j) = X j2 j p j (r n ): (2.5) Under the assumption that methylation states at distinct sites evolve independently, the likelihood for observed data at multiple sites is L(jX) = Q N n=1 L(jX n ): Computing the observed data likelihood through the pruning algorithm has time complexityO(N jVjjj 2 ), which is equivalent to linear complexityO(NjVj) as the alphabet size is a small constant. 26 Chapter 3 Models for mammalian methylome evolution In this chapter, we design a phylo-epigenetic model for mammalian methylome evolution. Although the model is primarily designed for DNA methylation, it is suitable for the evolution- ary processes for other epigenetic modifications as well. To make it easy to generalize the model description and the methods from parameter estimation and model inference for other types of epigenetic modifications , we separate the modeling of whole genome bisulfite sequencing data from the modeling of the evolution of epigenomic states. In our phylo-epigenetic model, we make specific choices on which relationships to model and how we model them. Alternative models that take different paths on these choices are discussed at the end of this chapter. 3.1 State space and units of measurement for DNA methylation DNA methylation is often discussed in terms of levels at individual CpG sites, and the level reflects the fraction of cells that have the discrete methyl mark at that site (more accurately molecules with the mark, in the case of non-haploid cells). In multiple studies since 2009, when considering cells that are relatively pure in terms of phenotype, the methylation levels have been observed to fall into two categories: high and low levels. This is seen in the global bi-modal distributions of methyla- tion levels, and when one observes profiles of DNA methylation in a genome browser. There are special cases of intermediate methylation levels. For example, cancers have large domains of par- tial methylation. In addition, imprinted loci have intermediate methylation, with the paternal or the maternal allele methylated through an imprinting control region in most somatic cells. However, 27 for the vast majority of the sites, major phenotypic differences among healthy cells usually involve methylation changing from low to high, or from high to low. For this reason, all our modeling is in terms of low and high methylation states, and we use the corresponding state spacef0; 1g. This allows for a distribution of the observed levels associated with the “low” methylation state, and another distribution for the observed levels at sites occupying the “high” state. Current technology for measuring DNA methylation (e.g. WGBS) can interrogate single nucleotides (in mammals, with specific exceptions, these are cytosines of CpG dinucleotides). In modeling how methylation status relates between species we must use measurements that can be considered orthologous. We devise a strategy to model individual CpG sites directly, even when those sites are not mutually orthologous within all considered species. Below we explain our mod- eling approach in terms of CpG sites, but the same approaches can be applied for any kind of unit to which a methylation level may be assigned. 3.2 Modeling a methylome profiled by whole-genome bisulfite sequencing The observed data from a WGBS experiment are read proportions at individual CpG sites. For a specific CpG site, let the total number of reads covering this site be R, and R C be the numbers of reads containing a C at this site. We are interested in identifying the underlying methylation states of individual CpG sites based on these read proportion data. In preparation for subsequent modeling strategies, we first introduce several related concepts. Methylation state is a binary variable with states s2f0; 1g representing hypomethylation state and hypermethylation state respectively. At single molecule level, the binary methylation characterizes presence of either a 5mC or an unmodified cytosine. This, however, is a simpli- fication of the reality, as multiple forms of cytosine modification have been identified, such as 5hmC, 5fC and 5caC (Kriaucionis & Heintz, 2009; Ito et al., 2011), which are involved in the 28 demethylation pathway and may have functional roles distinct from those of 5mC (Wu et al., 2011; Szulwach et al., 2011). At cell population level, by assuming binary methylation state at a CpG site, we implicitly assume that cells have homogeneous epigenetic state at single-molecule level. The epigenetic state can be understood as the permissive or inhibiting conformation of the local DNA for interactions with other molecules. Methylation probability accounts for the stochastic inter-cell variation in a homogeneous cell population. It is the probability that any molecule from the population is methylated at a particular CpG site, i.e. p2 (0; 1). For CpG sites that are methylated (or hypomethylated) in the vast majority of cells, their methylation probabilities can vary due to heterogeneous sequence context and chromatin structure along the genome. With this consideration, we model methylation probabilities at individual sites with a two-component Beta-mixture distribution. Methylation level is the observed ratioR C =R from a WGBS experiment. Given the methyla- tion probabilityp at a CpG site, the read proportion follows a binomial distributionB(R;p). Posterior methylation probability is the conditional probability that a site has the high methy- lation state (s = 1) given the observed read proportions and parameters of the beta-mixture model. Independent sites Without prior knowledge about the relationship between methylation states at different sites, we can assume that methylation states and, subsequently, read proportions are independent between sites. The read proportion at an individual site follows a two-mixture Beta-Binomial distribution: R C 8 < : BetaBin(R; 0 ; 0 ) , with probability 0 ; BetaBin(R; 1 ; 1 ) , with probability 1 = 1 0 : The read coverageR is site-specific and given by the WGBS experiment. The model parameters =f 0 ; i ; i :i = 0; 1g are shared across all sites in the experiment. 29 Estimation We use the maximum likelihood estimation method to find model parameters. The methylation states at individual sites are hidden variables. We solve this estimation problem with an expectation-maximization (EM) algorithm. In each iteration of the EM algorithm, given the current parameter estimates, in the expectation step (E-step) we compute Q(j (t) ) =E SjfRg; (t) logL(jR) = N X n=1 1 X i=0 Pr(s n =ijR Cn ;R n ; 0 ) log Pr(R Cn js n =i;;R n ): In the maximization step (M-step), we update parameter estimates with (t+1) = arg max Q(j (t) ): Inference The posterior methylation probability at a site given the model parameters 0 and observed read proportion (R C ;R) can be computed as follows: Pr(s = 1jR C ;R; 0 ) = 1 Pr(R C js = 1; 0 ;R) P i=0;1 i Pr(R C js =i; 0 ;R) : Results of application on a sperm methylome Applying this analysis to a rat sperm methylome, we estimated that the mixing proportions for hypomethylated and hypermethylated states are 19% and 81% respectively. The average posterior methylation probability is 80.9%. With a cutoff value of 0.5 on the posterior methylation probability, 85.1% sites were classified as methylated. Figure 3.1 shows the distribution of methylation levels computed from WGBS read proportions, and posterior methylation probabilities estimated using the two-beta-binomial mixture model. The estimated posterior methlyation probabilities give a better separation of CpG sites than methylation levels. 30 Rat sperm R1 Methylation level Density 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 Beta−hypo * 0.19 Beta−hyper * 0.81 Rat sperm R1 Methylation state posterior probability Density 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Rat sperm R1 Methylation level Methylation state posterior probability Figure 3.1: Inferring methylation states from WGBS methylome assuming independent sites. Left: Distribution of methylation levels, computed directly from WGBS read proportions, overlaid with the density curve for the two components in the estimated beta mixture distribution scaled by respective mixing proportions. Middle: Distribution of posterior methylation probability at indi- vidual CpG sites. Right: Smoothed scatter plot showing the correspondence between methylation levels and posterior methylaiton probabilities. Autocorrelated sites From empirical observations, we know that methylomes are organized as focal hypomethy- lated regions interspersed in the hyper-methylated genomic background (Figure 3.2). Hid- den Markov model has been successfully applied to partition hypomethylated regions (HMR) from the background global methylation for primary healthy samples (Molaro et al., 2011; Song et al., 2013). The sequence of hidden methylation states along the genome forms a Markov chain over the state spacef0; 1g with initial distribution = ( 0 ; 1 ) and state transition probability matrixG. For each siten, given the methylation states2f0; 1g, the emission distribution follows a beta- binomial distribution, i.e. R Cn jS n = s BetaBin(R n , s , s ). Together, the methylome profile from WGBS experiment is modeled with a two-state Markov chain with beta-binomial emissions. Results of application on a sperm methylome With this method, the average posterior methy- lation probability of the rat sperm methylome sample is 85.2%. Using 0.5 as cutoff to determine methylation states, we can classify 85.6% CpG sites as methylated (compared with 85.1% using 31 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 Distance (bp) between neighboring CpG sites Correlation of methylaton levels BCell H1ESC CpG Islands 2 kb GADD45A GADD45A GADD45A H1ESC H9ESC Sperm BCell HMR Methylation level A B Figure 3.2: Methylation states are autocorrelated and organized in blocks (A) A hypomethylated region in human methylomes. (B) Auto-correlation of CpG methylation levels. the independent-site beta-mixture model). Figure 3.3 shows the inference result. The posterior methylation probabilities are distributed tightly close to the extreme values (Figure 3.3 Middle), reflecting the impact of sharing information with neighboring sites through the Markov transition relationships. This smoothing effect of Markov chain model can be also observed from the scatter plot heatmap for the 2-D distribution of methylation levels and posterior methylation probabilities (Figure 3.3 Right): there are sites with extremely low methylation levels and above 0.6 posterior methylation probability. These sites are likely sparsely located within a cluster of heavily methy- lated sites, and the state transition cost is higher than the cost of assigning these sites to methylated state. To answer the question of whether these sites are indeed hypomethylated sites, greater sequencing depth is required. Alternatively, multiple species showing consistent low methylation levels at orthologous sites can also be evidence that these sites are truly hypomethylated, which is similar to increasing the sequencing depth. The posterior methylation probabilities, or the most likely methylation states at individual sites, determined by either of the above two methods, can be used as the input data for a phylo-epigenetic model. Of course, these are not the only ways to arrive at estimates of methylation states from WGBS experiments. With other sequencing or methylation-state profiling technologies, suitable models are required to translate observed data to probabilistic inference of methlyation states. 32 Rat sperm R1 Methylation level Density 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 Beta−hypo * 0.148 Beta−hyper * 0.852 Rat sperm R1 Methylation state posterior probability Density 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Rat sperm R1 Methylation level Methylation state posterior probability Figure 3.3: Inferring methylation states from WGBS methylome with HMM. Left: Distribution of methylation levels, computed directly from WGBS read proportions, overlaid with beta distribution density curves. Middle: Distribution of posterior methylation probability at individual CpG sites. Right: Smoothed scatter plot showing the correspondence between methyla- tion levels and posterior probabilities of the methylated state. Above, we have described the processing of WGBS data as an isolated problem, because we want to treat the phylo-epigenetic model as a module that can be used for multiple types of epige- nomic profiles, such as chromatin accessibility and different types of histone modifications. Never- theless, the beta-binomial emission distribution can be easily integrated with the phylo-epigenetic model to take WGBS read proportions as input. 3.3 Modeling methylome evolution with independent sites We use a simple application of the maximum likelihood model for molecular sequence evolution reviewed in the previous chapter. This application is simpler than the general model because we assume that the alphabet has two characters (states). Consequently, the eigendecomposition of the transition rate matrix can be easily solved, and the transition probability matrix can be explicitly expressed as a function of the branch length and state transition rates. Here, we restate the model, likelihood function and parameter estimation methods the binary state Markov process as a special case of the general model. 33 We assume that the methylation state of a site is a binary random variable, taking value from f0, 1g, and it evolves according to a continuous-time Markov process. For a methylome sequence of multiple sites, we assume sites in the root node are independent and identically distributed according to an initial distribution = ( 0 ; 1 ). Sites in the methylome evolve independently along a phylogenetic tree following a continuous time Markov chain. Let the transition rate matrix be Q = 2 4 3 5 : The transition probability matrix between two time points separated by time t is determined by two terms (t( +);=). Let + = 1, so that the mutation rate and branch length parameters are identifiable. Thus, the transition probability matrix is P (t) = exp(Qt) = 2 4 1T T (1)T 1 (1)T 3 5 ; whereT = 1 exp(t), soT2 (0; 1) fort> 0. Let =fV;Eg be the phylogenetic tree, whereV is the set of vertices andE is the set of branches. We assume has known topology and unknown branch lengths. The model is repre- sented by its parameter space =f;Q;g: Notations We user to denote the root node of the phylogenetic tree, and useu,v andw to denote consecutive nodes on a lineage, such that (u;v)2E and (v;w)2E. LetN be the total number of sites in the methylome. Let random variableu n 2f0; 1g be the state of siten in speciesu. Let` v be the length of edge (u;v)2E. We usej andk to denote methylation states. In general, we will usej to denote the state at the parent of a node whose state is denoted byk. Let X n (1 n N) be the set of methylation states associated with leaf nodes at site n. X = X 1 :::X N denote the observed methylation states at all leaf nodes. Let X(u n ) be the set 34 of methylation states at leaf nodes that are descendents ofu n . Thus, all observed data at the leaf nodes of siten is represented byX(r n ), which we also denote withX n . Observed data likelihood The likelihood for observations at leaf nodes at multiple sites is L(;X) = N Y n=1 L(;X n ) = N Y n=1 X j2f0;1g j p j (r n ); where p j (r n ) is as defined in Equation 2.2, and is computed through recursions in the pruning algorithm (Equation 2.4). In our implementation, we optimize parameters using gradient descent. The likelihood and gradients are recursively computed in the same spirit of the pruning algorithm (Felsenstein, 1981a). With the MLE of model parameters, we then compute the maximum likelihood reconstruction of the HME at each CpG site. The reconstruction can be achieved through a dynamic programming algorithm with time complexity linear to the number of nodes in the tree (Pupko et al., 2000). 3.4 Modeling methylome evolution with interdependent sites The autocorrelation of methylation states between neighboring sites is an important relationship to be incorporated in modeling the evolution of methylomes at single-base resolution. Here, we develop a model to allow for two processes that jointly describe the observed mammalian methy- lome: one is the process of methylation state inheritance from ancestral species, and the other is the process governing the correlation observed between methylation states at neighboring sites within a species. 3.4.1 Model description As in the independent-site model, we use =fV;Eg to denote the phylogenetic tree relating extant species, with known topology and unknown branch lengths. The inheritance process is defined by 35 Q, as previously introduced. At the root species, the dependence between neighboring CpG sites is described with a discrete-time Markov chain over the state spacef0; 1g, where the time points correspond to CpG sites in the genome. Let the initial distribution be in the root species. Let the transition probability matrix between neighboring CpG sites of the root species be F = 2 6 4 f 0 1f 0 1f 1 f 1 3 7 5 : In non-root species, we also use a transition probability matrix to describe the autocorrelation relationship of CpGs within individual methylomes. The transition probability matrix, G = 2 6 4 g 0 1g 0 1g 1 g 1 3 7 5 ; is assumed to be homogeneous in all non-root species. We use different horizontal transition probability matrices for the root species and non-root species because the horizontal process at non-root species interacts with the vertical inheritance process to determine the genomic distribu- tion of methylation states, while the the horizontal process at root species alone determines the methylation states in the genome. The model parameter space is now =f;;Q;F;Gg: We assume the CpG sites are ordered from 1 toN by their position within the reference human genome, ignoring chromosome boundaries for simplicity. Consider two neighboring CpG sites n 1 andn (1<nN) and a branch (u;v)2E in the phylogenetic tree, withu being the parent ofv. Let random variablev n denote the methylation state of siten in speciesv. We assume the 36 conditional probability of methylation state k is proportional to G ik P (` v ) jk , and thus define the conditional distribution ofv n , given the previous site’s statev n1 and its ancestral stateu n , as: p v (i;j;k) = Pr(v n =kjv n1 =i;u n =j) = G ik P (` v ) jk P k 0 =0;1 G ik 0P (` v ) jk 0 ; wherei;j;k2f0; 1g and 1<nN: (3.1) This conditional probability distribution models the majority of sites, but there are important spe- cial cases: (i) the states of all nodes at positionn = 1, and (ii) the states of all sites at the root node r. For sites at positionn = 1, the methylation state evolution is described by the initial distribution , the continuous-time Markov process with transition rate matrix Q, and the branch lengths in the phylogenetic tree. The methylome of the root species is modeled only with the discrete-time Markov chainF and the initial distribution. 3.4.2 Model learning and inference Complete data likelihood and sufficient statistics Assume the complete-dataY – methylation states at every siten2f1;:::;Ng and in every speciesv2V – are observed, including ancestral species. For convenience below we will use the symbol v n , which we previously defined as a random variable, to denote the observed state of that variable. With model parameters , the complete data likelihood is L(jY ) = Pr(Yj) = Pr(Y 1 j) N Y n=2 F r n1 rn Y u2I Y v2child(u) p v (v n1 ;u n ;v n ): 37 Withu denoting the parent ofv andr denoting the root, we define the following: w j = 1fr 1 =jg (corresponds to site 1 in root species) w jk (v) = 1fu n =j;v n =kg (site 1 in non-root species) w ik = P N n=2 1fr n1 =i;r n =kg (remaining root sites) w ijk (v) = P N n=2 1fv n1 =i;u n =j;v n =kg (all other sites) (3.2) The log-likelihood can then be written as logL(jY ) = X v=r j2f0;1g w j log j + X v6=r j;k2f0;1g w jk (v) logP (` v ) jk + X i;k2f0;1g w ik logF ik + X v6=r i;j;k2f0;1g w ijk (v) logp v (i;j;k): (3.3) Therefore, the following are sufficient statistics for our model parameters: W = w j ;w ik ;w jk (v);w ijk (v) :i;j;k2f0; 1g;v2Vnfrg : Among these statistics,fw ijk g represent the vast majority of information from the data, while fw j g andfw jk g carry negligible information. Under complete data, we can use these sufficient statistics to efficiently compute the MLE for model parameters by numerical methods. As with the independent-site model, our implementation also uses gradient ascent to arrive at the MLE of parameters in the dependent-site model. Learning model with EM algorithm from incomplete data In reality, observations about DNA methylation states are only available at a subset of sites in extant species. Given the model param- eter and observed statesX, we can obtain expected values of the sufficient statistics E ZjX; W . 38 Given the sufficient statistics in (3.2), we can derive the MLE of model parameters by maximiz- ing the complete-data likelihood in (3.3). These two processes are exactly the two steps in an EM algorithm. In the E-step, we computeQ(j (t) ), which is defined as the expected value of the complete-data log-likelihood logL(jX;Z) with respect to the unknown dataZ given the observed dataX and parameter estimates (t) : Q(j (t) ) = E ZjX; (t) logL(jX;Z) = X v=r j2f0;1g E(w j jX; (t) ) log j + X v6=r j;k2f0;1g E(w jk (v)jX; (t) ) logP (` v ) jk + X i;k2f0;1g E(w ik jX; (t) ) logF ik + X v6=r i;j;k2f0;1g E(w ijk (v)jX; (t) ) logp v (i;j;k): (3.4) The E-step is equivalent to computing the expected value of WjX; (t) as explained below. The maximization (M) step can be expressed as: (t+1) = arg min Q(j (t) ): E-step and Markov chain Monte Carlo approximation In the E-step, we use Markov chain Monte Carlo (MCMC), specifically Gibbs sampling, to estimate the expectation of the sufficient statistics given the model parameters and observations at leaf nodes. Recall that we used Z to denote (unobserved) methylation states at internal nodes in previous sections, and assumed that observation at leaf nodes are complete. Here we describe the algorithm for a more general situation, where observations at a number of leaf nodes may be missing at a number of sites. Hereafter, we useZ to denote the methylation states that are not observed, including all sites from internal species and sites with missing data at leaf species. We useX to denote all the observed methylation states, which must come from leaf nodes. We apply MCMC to approximate the conditional distribution of 39 Pr(ZjX;). LetZ (t) be thet th sample in the chain. LetMB(v) be the Markov blanket of nodev in the Bayesian network. Given the states of all variables in MB(v) asb, the conditional distribution Pr(v =ijMB(v) =b) can be computed easily (see Table 3.1). Let MB(v;t) denote the states of nodes in MB(v) inZ (t) . The sampling procedure forZ (t) is as follows: 1. Draw a starting sampleZ (t=0) , which is a specific initiation of the states for all nodes with unobserved methylation states. 2. Fort = 1; 2;::: : Iterate over all sites from site 1 to siteN. At each site, iterate over the nodes of the phylo- genetic tree according to a post-order traversal. If the node has an unobserved methylation statez2Z: (a) Make a proposal to change the state ofz to:z prop = 1z (t1) . (b) Accept the proposalz (t) =z prop with probability = Pr(z prop jMB(z;t 1)): (c) If the proposal is rejected, letz (t) =z (t1) . The chainfZ (t) g is positive recurrent and aperiodic on a finite state space. It follows that the chain is uniformly ergodic (Roberts & Polson, 1994). The same are true for the chain of sample statisticsfW (t) g derived fromfZ (t) g. Theories about MCMC (Theorem 3.1 of (Gilks et al., 1995)) guarantees that Pr 1 T T X t=1 W (t) ! E ZjX; W = 1: 40 Our goal in the E-step is to approximate E ZjX; W using MCMC samples. For any scalar sample statisticw2W , the output from MCMC is summarized in terms of ergodic averages of the form w t = P t i=1 w (i) t : The Markov chain central limit theorem states that p t( w t E ZjX; [w]) d !N(0; 2 w ); ast!1; where w is a constant givenX and. Given an estimate of w , we can form a confidence interval for E ZjX; [w], and continue sampling until the confidence interval is sufficiently small. We adopt a fixed-width stopping rule for the Markov chain based on a consistent batch means method, i.e. the CBM method referred to by (Jones et al., 2006). See Appendix A.2 for details. M-step Let ~ Q(j (t) ) be an approximation to Q(j (t) ) with E ZjX; (t)W substituted by the MCMC approximation ~ W . We use gradient ascent to maximize ~ Q(j (t) ) = X v=r j2f0;1g ~ w j log j + X v6=r j;k2f0;1g ~ w jk (v) logP (` v ) jk + X i;k2f0;1g ~ w ik logF ik + X v6=r i;j;k2f0;1g ~ w ijk (v) logp v (i;j;k): Partial derivatives required for parameter optimization are shown in Table 3.2. Summary of model inference procedure Together, the procedure for model parameter estima- tion and ancestral state reconstruction is summarized as below: 1. Choose start point for model parameters (t) . 2. Iterate the following EM procedure 41 E-step: Use Gibbs sampling to approximate E ZjX; (t)W . M-step: update model parameters to (t+1) . until convergence:k (t+1) (t) k<. 3. Given the final model parameter estimates, generate MCMC samples as in the E-step. After discarding the first half of samples, construct marginal posterior distribution with the second half of samples at individual sites. Use the MAP estimates as the methylation states at individual sites. 3.4.3 Results on simulated data We implemented a software package Epiphyte hosted on github. In this section, we report the performance of this implementation, and evaluate the impact of phylogenetic tree structure, scale and noise in the observations. We simulated an methylome evolution process for a hypotheti- cal genome containing 70k CpG sites in 400 blocks. The concept of block is to accommodate non-homologous chromosomes or fragments of chromosomes that can be assumed to evolve inde- pendently. The parameter values used for simulation are as follows: = (((S4 : 0:02;S5 : 0:02)S3 : 0:05; (S7 : 0:04;S8 : 0:04)S6 : 0:03)S2 : 0:01;S9 : 0:04)S1; F :f 0 = 0:9;f 1 = 0:95; G :g 0 = 0:85;g 1 = 0:8; = 0:4 0 = 0:485: In the phylogenetic tree, species are numbered according to pre-order traversal. We use the index of each non-root species to also denote the branch that leads to it. This toy phylogenetic tree contains two parallel clades of species with different time points of existence for their respective last common ancestors. Species S9 can be considered an out-group species, which is often included in phylogenetic studies to pivot the divergence time of in-group species. Data simulation To simulate methylomes under the interdependent-site assumption, we gener- ate entire methylomes species by species, starting from the root species. The root methylome is 42 simulated according to the Makrov chain (;F ). Then, in depth-first traversal order, we simulate the methylome at each descendant species. The methylation state of a site follows a Bernoulli dis- tribution given the methylation states at its previous neighboring site and its parent site, which is determined by mutation rates and branch lengthfQ;`g and the horizontal transition probabilities G. The simulated methylomes are binary states at a sequence of sites. Model learning with complete methylation state data First we examine the accuracy of param- eter estimation when complete data is given. We generated 100 independent data sets with the model parameters specified above. Our implementation of the maximum likelihood parameter optimization provide unbiased estimates (Figure 3.4A). MCMC convergence with input leaf methylation states We used methylation states at leaf species as input data, and monitored the MCEM iterations on one simulated dataset. Figure 3.5 shows the trace plot for a subset of sufficient statistics. Within each iteration, the MCMC sample statistics converge quickly, and over iterations converge to approximate the true statistic values of the simulated complete dataset. This is true for branches located in different parts of the phylo- genetic tree, for example internal branch #2 is closest to the root and furthest from observed data, internal branch #6 is located in the middle section of the tree, and external branch #8 has observed methylation states at the descendant end of the branch. Model learning and inference with methylation states at leaf nodes When methylation states are only observed at leaf species, we use MCEM algorithm for parameter estimation. The esti- mates are also accurate and unbiased for all model parameters (Figure 3.4B). With the parameter estimates, we used MCMC sampling method to estimate the posterior methylation probabilities at all sites of the internal nodes, and used the MAP estimates as the inference for methylation states at sites without observations. The accuracy of methylation state estimates is shown in Figure 3.4C. The accuracy (sensitivity) is above 99% for both the hypomethylated state and the hypermethy- lated state at all internal species. However, there is a difference between internal species that are 43 Figure 3.4: Performance on simulated data. Performance is tested on five scenarios with different amount of missing data and observational errors. closer to external species and those that are further from external species. For example, species S3 and S6, though having the same height in the phylogenentic tree, have different “distances” from extant species. Our method had better accuracy of methylation state inference at species S3, 44 Figure 3.5: Trace plot for example sufficient statistics. Sufficient statisticsw ijk (v) is shown for branches associated with nodev = 2; 6; 8. Gray dashed lines indicate starts of EM iterations. Horizontal lines mark the true statistic value from the simu- lated complete data set. which is closer to leaf species, than species S6. Next, we examined the accuracy of the infer- ence of evolutionary events. The evolutionary events are separated into four categories for each pair of parent-child species, which are conserved hypomethylation, conserved hypermethylation, hypo-to-hypermethylation (methylation gain) and hyper-to-hypomethylation (methylation loss). Our inference of conservation events showed high accuracy (> 99%). Inference of evolutionary changes had lower accuracy, at around 80% along all branches, except for branch #2, which is the branch between the root node and the highest non-root internal node. However, the branches linking the last common ancestor of the in-group species with the out-group species are often not of interest in evolutionary analysis. Model learning and inference with noisy observations at leaf nodes The methylation states at leaf species are not directly observed, but inferred from experimental evidence, such as WGBS read proportions. Therefore, we examined the impact of imperfect observations on the performance of model learning and inference. On top of the simulated binary valued methylation states, we added independent and identically distributed noises form a folded normal distribution bounded between 0 and 1. The noise term is added to 0, and subtracted from 1, depending on the simulated binary methylation states. The noisy observations represent uncertainty about the methylation states at 45 leaf nodes. We simulated noise at 3 different levels, with the folded normal scale parameter taking values of 0:05; 0:1; 0:2, representing increasing levels of noise. Adding noise to observations had a huge impact on model learning, especially on the esti- mates of external branch lengths and the horizontal transition parameters for non-root species (Figure 3.4DGJ). The external branch lengths were substantially overestimated, with more devia- tion from the true values under higher level of leaf observation uncertainty. The horizontal transi- tion parameters for non-root species (g 0 ;g 1 ) were underestimated, and also deviated more from the true values under higher level of certainty. However, the relative order of terminal branch lengths was preserved in the estimates. In contrast to terminal branches, lengths of internal branches were only slightly underestimated. The rate parameter was over estimated, which may be a case-specific phenomenon. As for inference of methylation states at internal species, noisy leaf observations caused the accuracy to decrease (Figure 3.4E, H, K). Interestingly, internal species higher up in the phyloge- netic tree suffered less than those closer to leaf nodes. This is also true for conservation events. However, for methylation gain and loss events, the accuracy stayed relatively stable for terminal branches, and decreased more for internal branches further away from leaf species. The model inference process overestimates terminal branch lengths to account for uncertainty in leaf methylation state observations. We re-estimated model parameters with the MAP methy- lation states at all species as if complete data were given. At low noise level ( = 0:05), model parameters were reliably recovered except for lengths of branches linking the last common ances- tor of in-group species with out-group species (Figure 3.4F). At higher noise level, the terminal branch length estimates and non-root horizontal transition parameters also regressed towards their true values. However, estimates for internal branches worsened (Figure 3.4IL), quite severely at the highest noise level ( = 0:2) examined. Branch length scaling and interpretation In phylogenetic models based on continuous Markov chain, the branch lengths have an interpretation as the expected number of substitutions per site 46 per unit time (i.e. unit branch length). The mutation rate matrix and branch lengths can be scaled so that a unit branch length represents 1 expected substitution. For the independent-site phyloepi- genetic model, the scaling factor is indep = 0 + (1 0 ) (1): For the interdependent site phyloepigenetic model, the scaling factor is dep =p 00 G 01 G 00 +p 10 G 11 G 10 +p 01 G 00 G 01 (1) +p 11 G 10 G 11 (1); where p ik (i;k 2f0; 1g) is the proportion of neighboring-site pairs with the pattern ik, which we estimate using the complete ancestral and extant species methylation patterns inferred by the model. To illustrate the consistency between the two models, and to show the tree scaling process, we apply the models on the same real dataset. Because the independent-site model (INDEP) only consider orthologous sites across species, we used 200bp bins as “sites” or unit of methylome evolution. Only bins that cover at least one WGBS-profiled CpG in each of the seven species are included. Single CpG sites located in the same bin are pooled together for classifying the methylation state in each extant species. BecauseEpiphyte treats isolated site in the same way as the site-independent model, to contrast the results of these two methods, we removed isolated bins from the dataset. There are a total of 270,488 bins, which are separated into 17,632 continuous blocks. The estimated parameter from the INDEP model is (^ ; ^ = 0:55; ^ 0 = 0:24), where is ((((Human:0.0262,Chimp:0.0387):0.00941,Gorilla:0.0495):0.0311, Rhesus:0.0582):0.0578, (Mouse:0.0983, Rat:0.0989):0.0822, Dog:0.0854):0; 47 Human Chimp Gorilla Rhesus Mouse Rat Dog EPIPHYTE Human Chimp Gorilla Rhesus Mouse Rat Dog INDEP 0.01 Figure 3.6: Phylogenetic trees estimated by independent-site and interdependent-site models The scaling factor is ^ indep = 0:48. The estimated parameter fromEpiphyte is (^ ; ^ = 0:44; ^ g 0 = 0:91; ^ g 1 = 0:94; ^ 0 = 0:19), where is ((((Human:0.0601,Chimp:0.0703):0.0221,Gorilla:0.0899):0.0589, Rhesus:0.117):0.134, (Mouse:0.152,Rat:0.152):0.162, Dog:0.213):0; The scaling factor is dep = 0:27. After scaling, the two trees (Figure 3.6) have consistent scale and shape, with 0.91 Pearson’s correlation of the branch lengths. 3.4.4 Results on real data Detailed comparative analyses of mammalian sperm methylomes are presented in Chapter 4. Here we show a specific result about inferring methylation states at extant species and ancestral species to highlight the strength of our phylo-epigenetic model. The structure of our proposed model contain links between neighboring sites within the same species and links between orthologous sites through the phylogeny relationship, which allows information about methylation states to be shared between different sites both within and between 48 species. This is an important feature for studying the evolution of mammalian methylome at high resolution, as the majority of CpGs have diverged among different species (see Table 4.2). When identifying HMRs from a single methylome using hidden Markov model, the posterior decoding usually result in huge amount of hypomethylated domains consisting of a single or very few CpG sites. We often filter out such regions using an false discovery rate (FDR) control proce- dure, implicitly assuming they are due to experimental noise. This treatment likely overlooks small but truly hypomethylated regions in the biological sample. Replicates are often used to distinguish true signals from noise. If multiple independent replicates all support the hypomethylation state of the same site, then we might hypothesize that the epigenetic state of this site is important to cellular processes or functions. In phylo-epigenetic studies, different species can be viewed as replicates in a more general sense. A regulatory region in one species that is overlooked by single-methylome analysis because of very few supporting CpG sites can be rescued by incorporating supporting sites in the orthologous regions in other species. Figure 3.7 shows a region from genePLEKHM2 on chromosome 1. This region is orthologous across all 7 species, and doesn’t contain repeat elements or CpG islands. Controlling FDR at 0.01, we did not identify any HMRs from this region in any species. However, looking at methylation levels at individual CpG sites, we find one CpG site with 17x coverage in one replicate and 31x coverage in another replicate and showing very low methylation level in both replicates. This seem to be good evidence that this CpG site is hypomethylated in human sperm. This CpG site is conserved in chimpanzee and gorilla, but not in rhesus. In the orthologous region (dashed box) of the other 3 primate species, there are 3 consecutive CpG sites in each species that have very low methylation levels. If we had required CpG sites to be conserved in all species to be included in the analysis, this region would have been eliminated. Fortunately, our model allows for information sharing both within and across species. As a result, in the final inference of methylation states in both extant and internal species, we are able to identify a Catarrhini-conserved HMR in the intronic region immediately upstream of an exon in the gene PLEKHM2. This gene is involved in Golgi organization and lysosome localization, and is expressed in brain, muscle and testes at 49 a higher level than in other tissues (Lonsdale et al., 2013). Furthermore, CpG sites in this region are methylated in blood methylomes of these 4 primates, and many other somatic methylomes of human, including brain, muscle, lung and ESC. Although it is unclear whether hypomethylation of this region has regulatory significance, methylomes from multiple species offer strong evidence that the hypomethylated state is conserved across human and rhesus, and appears to be sperm- specific. 3.5 Alternative models for epigenome evolution To model epigenome evolution, the independent-site assumption needs to be relaxed because of the well characterized autocorrelation of epigenetic marks, especially when modeling at high- resolution, focusing on individual sites. Introducing dependence between neighboring sites as we described above is a major improvement over independent-site models. It not only captures the biological autocorrelation, also permits effective means of solving the missing data problem due to the substantial divergence of CpG sites between species. The model we proposed for methylome evolution is a Bayesian network, where vertical edges exist between a site in one species and the homologous site in the parent species, and describe the causal relationship between ancestor and descendant methylation states. Inter-site dependence is modeled with directed edges between neighboring sites within the same species. The probability mass function can be factorized into conditional probabilities of a descendant site given its previous site and its ancestral site (Figure 3.8A). Our formulation of the conditional probability distribution offers flexibility at combining the horizontal dependence and vertical inheritance relationships. For example, wheng 0 = g 1 = 0:5, the methylation state evolution processes at different sites are independent, while the states at neighboring sites in a non-root species are still correlated as a result of the correlation of the homologous sites in the root species. When the rows ofF are the initial distribution , in addition to g 0 = g 1 = 0:5, this special case is equivalent to the independent- site model. Potential extensions can be made to the modeling of horizontal processes to capture 50 hg19 chr1: 200 bp 16,047,500 16,048,000 PLEKHM2 Human Hominini Chimp Homininae Gorilla Catarrhini Rhesus Mouse Murinae Rat Boreoeutheria Dog 100 Vert. Cons 4.88 - -4.5 _ 0 - Euarchontoglires methylation level posterior hypermethylation probability HMR CpG site Figure 3.7: Inference of ancestral methylomes reveal small but conserved HMRs 51 a b ab a b ideal phylo-epigenetic model a b context-dependent model (2-tuple) a b phylo-epigenetic model general Markov model ... ... A B C D Figure 3.8: Graphical models for epigenome evolution Main factors in probability mass function factorization. Speciesa is the parent of speciesb in the phylogenetic tree. Circles represent methylation states at individual sites of individual species. Tri- angles represent tree-wise methylation states at individual sites. Vertex setx i contains all vertices at sitei. In the chain graph model,x a ,x b are the vertex sets corresponding to the chain components for speciesa andb, containing all sites in each species. more characteristics of the methylome and its evolution dynamics, which we did not pursue in this study. For example, the horizontal process can be modeled with a continuous time Markov chain to capture the inverse-relationship between pair-wise methylation state correlation and inter- site distance. In addition, different horizontal processes can be assumed for individual non-root species to capture lineage-specific property of local correlation. In the context of genomic evolution, sequence context-dependent mutation rates have been incorporated to extend the standard phylogenetic models. The standard phylogenetic model and many of its extensions can be considered as graphical models. Siepel & Haussler (2004) intro- duced a context-dependent phylogenetic model to allow mutation rates to vary depending on the identity of neighboring nucleotides. This model is equivalent to a Bayesian network where factors in the density function are conditional probabilities of a descendant site given its parent site, pre- vious parent site and its previous site (Figure 3.8B). The conditional distributions are derived from 52 a continuous-time Markov model for dinucleotide evolution. Computing the data likelihood of observations at leaf species of the phylogenetic tree is intractable. Several approximation methods have been applied. For example, Siepel & Haussler (2004) originally ignored the dependencies between ancestral states of neighboring sites and simplified likelihood computation to that of an (k-1)st order Markov chain for sequences in extant species when the substitution process is jointly modeled for k-tuples; Jojic et al. (2004) used structured variational methods to approximate the data-likelihood, and showed that preserving the phylogenetic tree structure at individual sites pro- vides a better approximation. An application of the 2-tuple context-dependent model to epigenome evolution is introduced in section A.3, together with model learning and inference methods that are also based on MCMC sampling. Both of these two Bayesian network models are special cases of a general Markov model for transitions between neighboring sites over tree-wise epigenomic states (Figure 3.8C). The transi- tion probability between tree states can be expressed by the product of local conditional probabil- ities at individual nodes of the phylogenetic tree. Compared with a general Markov model, these two Bayesian network models are expressive of the phylogenetic relationship between species and dramatically reduce the number of parameters involved in the tree-state transition probabilities fromO(2 2jVj ) toO(jVj). Bayesian networks are generative models, which allows for straightfor- ward data simulation and likelihood computation. However, the relationship between neighboring sites in the genome is not causal. Assuming causal relationship between neighboring sites may induce an asymmetry between sequences of states following opposite directions along the genome in descendant species. An undirected graphical model for context-dependent epigenome evolution model is described in Section A.4. Ideally, the epigenome evolution can be modeled with a chain graph (Lauritzen, 1996). Each node in the graph corresponds to the methylation state of a site in a species. Undirected edges exist between neighboring sites within the same species to describe the autocorrelation of methylation states. Directed edges link each site in each internal species to its descendants in the child species, and describe the causal relationship between parent and descendant sites. A generative model for 53 such a chain-graph structure would involve dynamic processes that generate samples iteratively until convergence to some kind of equilibrium. In the context of modeling epigenome evolution, for example, this equilibrium may be measured by the distribution of the number and sizes of HMRs. This iterative dynamic process allows a descendant site to “see” past the immediate neighboring sites and its direct parent site in the chain graph and incorporate information from much wider range of sites in both the parent species and the descendant species. Compared with the generative process for Bayesian network models based on conditional probabilities, a generated chain-graph sample is a more realistic picture of epigenome evolution. However, factorization of a probability function on this graph involves potentials over all complete subsets of the complete graph induced by all nodes in a chain component (Lauritzen, 1996). In contrast with Bayesian network models, even computing complete data likelihood for this chain-graph is intractable because each chain component may inflict state space of size exponential to the number of sites in the genome. 54 If the current nodev =r is the root: n = 1 MB(v n ) =fc n :c2 child(v)g[fv n+1 g Pr(v n =ijb)/ Q c2child(v) Pr(c n jv n =i) n =N MB(v n ) =fc n1 ;c n :c2 child(v)g[fv n1 g Pr(v n =ijb)/ Pr(v n =ijv n1 ) Q c2child(v) p c (c n1 ;i;c n ) 1<n<N MB(v n ) =fc n1 ;c n :c2 child(v)g[fv n1 ;v n+1 g Pr(v n =ijb)/ Pr(v n =ijv n1 ) Pr(v n+1 jv n =i) Q c2child(v) p c (c n1 ;i;c n ) If the current nodev2L is a leaf: n = 1 MB(v n ) =fv n+1 g[fu n ;u n+1 g Pr(v n =ijb)/ Pr(v n =iju n )p v (i;u n+1 ;v n+1 ) n =N MB(v n ) =fv n1 g[fu n g Pr(v n =ijb)/p v (v n1 ;u n ;i) 1<n<N MB(v n ) =fv n1 ;v n+1 g[fu n ;u n+1 g Pr(v n =ijb)/p v (v n1 ;u n ;i)p v (i;u n+1 ;v n+1 ) If the current nodev = 2L[frg is internal: n = 1 MB(v n ) =fc n :c2 child(v)g[fv n+1 g[fu n ;u n+1 g Pr(v n =ijb)/ Pr(v n =iju n )p v (i;u n+1 ;v n+1 ) Q c2child(v) Pr(c n =ijv n =u) n =N MB(v n ) =fc n ;c n1 :c2 child(v)g[fv n1 g[fu n g p b (v n ) =p v n1 p un Q c2child(v) p c n1 p cn Pr(v n =ijb)/p v (v n1 ;u n ;i) Q c2child(v) p c (c n1 ;i;c n ) 1<n<N MB(v n ) =fc n ;c n1 :c2 child(v)g[fv n1 ;v n+1 g[fu n ;u n+1 g Pr(v n =ijb)/p v (v n1 ;u n ;i)p v (i;u n+1 ;v n+1 ) Q c2child(v) p c (c n1 ;i;c n ) Table 3.1: The Markov blanket and probabilities involved in MCMC sampling procedure. These are broken down for each of the 9 separate cases involving combinations offroot, leaf, internalg nodes andffirst, internal, lastg sites. In each expression where it appears, i2f0; 1g is the methylation state ofv n , b2 B(v n ) is the joint state set for MB(v n ), the indicated Markov blanket. In our notation, the nodeu is the parent of nodev. 55 @ ~ Q @ 0 = ~ w 0 0 ~ w 1 1 0 @ ~ Q @f i = ~ w ii f i ~ w i(1i) 1f i , i = 0; 1 @ ~ Q @g i = X v6=r j;k2f0;1g ~ w ijk (v) ( (1) k G ik P k 0 2f0;1g (1) k 0 P (` v ) jk 0 P k 0 2f0;1g G ik 0P (` v ) jk 0 ) ; i = 0; 1 @ ~ Q @ = X v6=r j;k2f0;1g ~ w jk (v) (1) 1k T v P (` v ) jk + X v6=r i;j;k2f0;1g ~ w ijk (v)T v ( (1) 1k P (` v ) jk P k 0 2f0;1g G ik 0(1) 1k 0 P k 0 2f0;1g G ik 0P (` v ) jk 0 ) @ ~ Q @T v = X v6=r j;k2f0;1g ~ w jk (v) 1j ( 1) j (1) 1k P (l v ) jk + X v6=r i;j;k2f0;1g ~ w ijk (v) 1j ( 1) j 8 > < > : (1) 1k P (l v ) jk P k 0 2f0;1g G ik 0(1) 1k 0 P k 0 2f0;1g G ik 0P (` v ) jk 0 9 > = > ; ; whereT v = 1 exp(l v ); v2Vnfrg Table 3.2: Partial derivatives with respect to model parameters required in the maximization step of EM algorithm (Section 3.4.2). 56 Chapter 4 Evolution of mammalian sperm methylomes We generated whole-genome bisulfite sequencing (WGBS) data for sperm cells from gorilla, dog and rat. WGBS methylomes of sperm from human, chimpanzee, rhesus monkey and mouse were produced in previous studies (Molaro et al., 2011; Lu et al., 2015; Hammoud et al., 2014). For each species, the sequencing had a total depth above 10, and covered the entire mappable fraction of the corresponding reference genome. All species exhibited global methylation in sperm, with genome-wide average DNA methy- lation levels ranging between 0.68 and 0.79 (Table 4.1), consistent with previous observations (Molaro et al., 2011; Molaro et al., 2014). The methylation levels at single CpG sites formed a bimodal distribution (Figure 4.1). As has been widely observed, mammalian genomes are globally methylated, with interspersed hypomethylated regions (HMR) showing distinct bound- aries and marking important regulatory elements, such as promoters and enhancers. Variation in the methylation state of these regions, as well as changes in the location of HMR bound- aries, have been shown to correlate with changes in gene expression and may collectively reflect differences in the gene regulatory programs (Hodges et al., 2011; Schlesinger et al., 2013; dos Santos et al., 2015). Therefore, the main units of our analysis are sperm HMRs, rather than individual CpG sites. We identified HMRs from individual methylomes using a hidden Markov model as previously described (Molaro et al., 2011). To study epigenomic evolution across species, we aligned each of the non-human methylomes to the human reference genome at single CpG resolution (methods in section A.6). The genomic composition of the 7-way orthologous genome is shown in Figure 4.2. The orthologous genome contains, on average, 27% (6.7 million) of the total CpG sites in individual species. The majority of the CpG sites have diverged between species, and only 0.4 million CpG sites are conserved 57 Type Species Assembly Tissue Level Depth #HMR HMR size (bp) Mean Median Sperm Human hg19 Sperm 0.734 70.7 93384 1705.2 1205 Chimp panTro4 Sperm 0.683 20.5 71271 1706.2 1189 Gorilla gorGor3 Sperm 0.687 11.8 73650 1693.7 1245 Rhesus rheMac3 Sperm 0.780 22.0 72189 1269.3 915 Mouse mm10 Sperm 0.786 32.0 74681 1617.2 1264 Rat rn5 Sperm 0.746 16.5 77699 1416.6 1178 Dog canFam3 Sperm 0.741 90.2 67310 1303.0 880 Somatic Human hg19 B cell 0.737 11.8 52685 1054.3 701 Chimp panTro4 B cell 0.742 7.4 48813 1017.5 678 Gorilla gorGor3 Whole blood 0.681 21.3 56442 1024.4 703 Rhesus rheMac3 PBMC 0.741 65.1 62529 910.7 592 Mouse mm10 B cell 0.797 42.2 59079 938.0 596 Rat rn5 Left ventricle 0.827 63.5 55575 924.3 688 ESC Human hg19 ESC 0.792 27.4 36349 1156.0 817 Mouse mm10 ESC 0.793 12.0 40480 1292.1 995 Table 4.1: Summary statististics about methylomes Original assembly hg19 7-way orthologous (hg19) Species Assembly Total #CpG Aligned* # CpG # Common # HMR mean size Human hg19 28,299,638 7,393,913 381,845 25,780 1463.0 Chimp panTro4 26,598,660 21,342,779 6,816,494 25,538 1393.4 Gorilla gorGor3 28,451,114 20,080,410 6,725,653 25,438 1350.5 Rhesus rheMac3 25,642,482 21,335,729 7,093,267 22,875 1323.2 Mouse mm10 21,342,779 9,037,130 5,638,326 33,336 1138.8 Rat rn5 24,713,205 10,218,023 6,530,333 36,690 1073.4 Dog canFam3 21,342,779 13,876,401 7,139,271 22,388 1254.8 Table 4.2: Aligning methylomes across all species. The number of HMRs located in the orthologous genome ranges between 22k and 37k, with the average HMR size ranging between 1 kb and 1.5 kb (Table 4.2). 4.1 Lineage-specific features For a global view of the relationships between sperm methylomes, we clustered the sperm methy- lomes along with somatic methylomes based on pair-wise correlation of average methylation levels 58 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Methylation level Proportion ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● human chimp gorilla rhesus mouse rat dog Figure 4.1: Distribution of sperm methylation levels Distribution of single CpG methylation levels in each species. CpGs with less than 10 coverage were excluded. Y-axis shows the proportions of CpGs in 25 non-overlapping methylation level intervals. human genome 7-way other repeat intron repeat intron non-repeat exon Figure 4.2: Distribution of genomic context Total size and genomic context composition of the 7-way orthologous is shown in proportion to human genome. 59 mouse (1) mouse (4) mouse (2) mouse (3) rat (1) rat (2) dog (3) dog (1) dog (2) rhesus gorilla human (2) human (1) human (3) human (4) chimp (1) chimp (2) rat rhesus (1) rhesus (2) gorilla chimp human mouse (1) mouse (2) mouse human (1) human (2) 1.0 0.9 0.8 0.7 0.6 Correlation (7-way orthologous 200bp bins) sperm ESC somatic (diverse types) A Correlation mouse (1) mouse (3) mouse (2) mouse (4) rat (1) rat (2) dog (3) dog (1) dog (2) rhesus gorilla human (2) human (1) human (3) human (4) chimp (1) chimp (2) rat mouse mouse (1) mouse (2) human (1) human (2) rhesus (1) rhesus (2) gorilla chimp human 1.0 0.9 0.8 0.7 Promoter mouse (1) mouse (2) mouse (3) mouse (4) rat (1) rat (2) dog (3) dog (1) dog (2) rhesus gorilla human (2) human (1) human (3) human (4) chimp (1) chimp (2) human (1) human (2) rhesus (1) rhesus (2) gorilla chimp human rat mouse mouse (1) mouse (2) 1.0 0.8 0.6 0.4 Repeat mouse (1) mouse (4) mouse (2) mouse (3) rat (1) rat (2) dog (3) dog (1) dog (2) rhesus gorilla human (2) human (1) human (3) human (4) chimp (1) chimp (2) human (1) human (2) rhesus (1) rhesus (2) gorilla chimp human rat mouse mouse (1) mouse (2) 1.0 0.8 0.6 0.4 Other Correlation Correlation B Figure 4.3: Hierarchical clustering of mammalian sperm and somatic methylomes (A) Clustering results on whole orthologous genome. (B) Clustering results separated based on genomic context. in 200bp bins (Figure 4.3). There was a clear separation between sperm and somatic methylomes, indicating that tissue-specific characteristics of sperm DNA methylation are conserved across species. In each species, HMRs were more abundant and larger in sperm than in the somatic cells examined (Figure 4.4, 4.5). They also overlapped retrotransposons and pericentromeric 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40k 50k 60k 70k 80k 90k 1.0 1.2 1.4 1.6 Number of HMRs Mean HMR size (kbp) chimp chimp dog gorilla gorilla human human human mouse mouse mouse rat rat rhesus rhesus Figure 4.4: Sperm HMRs are more abundant and wider than somatic HMRs satellites more frequently in sperm than in ESCs and somatic cells (Table 4.3, Figure 4.6). We previously observed these sperm-specific epigenomic features in human, chimpanzee and mouse (Molaro et al., 2011; Molaro et al., 2014). Our current results demonstrate that these char- acteristics are conserved more broadly across mammals. 4.2 Methylation features are conserved at promoters and divergent distally We divided HMRs of each species located in the 7-way orthologous genome into promoter HMRs and non-promoter HMRs. The total number of promoter HMRs were comparable across species, while the non-promoter HMRs were much more abundant in rodents than other species (Fig- ure 4.7). We identified conserved HMRs by taking the intersection of HMRs from all species, 61 Distance to TSS (kb) −2 0 2 Heart Sperm 4 −4 −2 0 2 B Cell Sperm 4 −4 −2 0 2 ESC Sperm 4 −4 −2 0 2 0.0 0.2 0.4 0.6 0.8 1.0 ESC Sperm 4 −4 −2 0 2 0.0 0.2 0.4 0.6 0.8 1.0 PBMC Sperm 4 −4 Distance to TSS (kb) −2 0 2 0.0 0.2 0.4 0.6 0.8 1.0 MDCK Sperm 4 −4 Human Chimp Rhesus Mouse Rat Dog −2 0 2 Blood Sperm 4 −4 Gorilla Distance to TSS (kb) Methylation level Methylation level Methylation level Figure 4.5: Promoter methylation profiles in sperm and somatic cells which resulted in 15k regions, with an average size of 1:1kbp. These regions were located pre- dominantly at gene promoters, and constitute on average 75% of the hypomethylated promoters in each species (Figure 4.7). The non-promoter HMRs were far less conserved across species, 62 %repeat< 0.4 methylation Species Cell type SINE LINE LTR Other Satellite Human Sperm 6.67% 5.60% 6.51% 36.60% 61.13% Chimp Sperm 5.86% 4.53% 3.14% 5.50% 55.06% Gorilla Sperm 8.25% 6.32% 6.96% 6.88% 60.18% Rhesus Sperm 6.62% 5.33% 5.56% 26.28% 43.80% Mouse Sperm 4.14% 3.77% 4.59% 5.61% 7.03% Rat Sperm 6.78% 6.39% 6.85% 6.95% 26.08% Dog Sperm 5.86% 4.53% 3.14% 40.21% Human B cell 3.25% 5.08% 6.04% 1.70% 13.38% Human ESC (H1) 1.01% 1.36% 1.62% 0.95% 4.13% Human ESC (H9) 1.23% 1.86% 2.16% 1.23% 7.62% Chimp B cell 3.48% 5.49% 6.04% 2.43% 14.71% Gorilla Whole blood 4.79% 6.47% 7.20% 3.47% 11.22% Rhesus PBMC 2.69% 4.82% 4.63% 6.47% 6.84% Mouse B cell 3.38% 2.67% 3.08% 1.87% 4.31% Mouse B cell 3.60% 2.80% 3.14% 2.00% 5.02% Mouse ESC 3.59% 2.23% 2.96% 1.79% 5.54% Rat Left ventricle 1.41% 1.12% 1.37% 1.06% 2.28% Table 4.3: Retrotransposon hypomethylation in sperm 0.0 0.4 0.8 0 100 300 200 1.0 0.6 0.2 Other statellites Frequency 0 10 20 30 0.0 0.4 0.8 1.0 0.6 0.2 Pericentromeric satellite repeats Methylation level (dog sperm) Figure 4.6: Pericentromeric satellites are hypomethylated in dog sperm with conserved HMRs across all seven species constituting 11-31% of the non-promoter HMRs in each species. The example interval in Figure 4.8 shows the conserved presence of an HMR at the promoter in each species. The gene within this interval, EBF2 in human, is expressed in a variety of somatic cells, but has low expression in testis. The promoter HMR size of this gene varied between species, and there were lineage-specific HMR gain and loss events more distal to 63 human chimp gorilla rhesus mouse rat dog 0 5 10 15 20 0 5 10 15 20 Number of HMRs (x1000) conserved in all species promoter non-promoter Figure 4.7: Conserved promoter HMRs and divergent distal HMRs transcription start sites (TSS). These observations are consistent with slower change at promoters, and more rapid evolution of enhancers, as has been observed in a comparative study of histone modifications focused on mammalian liver (Villar et al., 2015). Taking advantage of the precision afforded by WGBS, we compared the sizes of promoter HMRs across species. While the HMR sizes at orthologous gene promoters were similar across species in ESC and somatic cells, in sperm they show substantial variation (Figure 4.9). Rodent sperm promoter HMRs are significantly shorter compared with primates and dog (Wilcoxon rank sum test p < 2:22 10 16 ). The size difference separating rodents from rhesus was 500bp, a quarter of the size of those HMRs in rodents. Within primates, rhesus monkey had significantly shorter HMRs than other species examined (Wilcoxon rank sum testp < 2:22 10 16 ). Similar levels of variation were evident at replicate level between species as well. These sperm promoter HMR size differences indicate a global divergence in the organization of sperm epigenomes, for example via differenttrans-epigenetic mechanisms, during mammalian evolution. 64 EBF2 5 kb human chimp gorilla rhesus mouse rat dog methylation HMR Figure 4.8: Sperm DNA methylation of seven species in an example orthologous region human chimp gorilla rhesus mouse rat dog 1000 2000 TSS 1000 2000 Promoter HMR size (bp) somatic sperm 25% 75% Figure 4.9: Sperm promoter HMR sizes differ between species The median size of regions up and down-stream of the transcription start sites (TSS) are shown for somatic HMRs (orange) and sperm HMRs (blue). Error bars show HMR size quartiles. 65 The cross-species HMR size difference in orthologous regions indicates that an ancestral HMR has evolved to have different boundary locations in different species. Indeed, boundary variation was prevalent at conserved HMRs. For a conserved HMR in human, on average, the core region of hypomethylation shared in all species only constituted 62% of the HMR. This proportion was about 75% for rodents. Boundary variation was also common for HMRs specific to a clade. Altogether, a considerable amount of cross-species epigenomic variation was from adjustments in boundaries of these putative regulatory intervals – in addition to complete gain or loss. Despite prevalent HMR boundary variation, a subset of orthologous HMRs had relatively con- served boundary locations across species. We identified 293 such “ultra-conserved” HMRs across the seven species using a cutoff by the Jaccard index (Methods in section A.6), all of which over- lapped phastCons conserved elements for placental mammals (Siepel et al., 2005), and 95% over- lapped gene promoters (TSS1kb). Relative to genes with promoter HMRs in all seven species, those with ultra-conserved promoter HMRs were enriched in multiple biological processes includ- ing developmental process, signal transduction and regulation of multicellular organismal process (Table 4.4, 4.5, 4.6). 4.3 Methylation loss exceeds methylation gain across species We identified genomic intervals with species-specific gain and loss of methylation in primates and rodents using dog as out-group. For each species, the genomic fraction showing species-specific methylation loss (corresponding to either birth or extension of an HMR) was substantially larger than species-specific gain (Figure 4.10, Table 4.7). This hints at a non-stationary process, and prompted us to ask if global trends and rates of epigenomic change can be extracted from these methylomes. Phylogenetic modeling for DNA sequence evolution has a rich history (Felsenstein, 1981a; Yang, 1994b; Siepel & Haussler, 2004). Approaches from molecular evolution have been applied to DNA methylation changes in tumor growth and development (Siegmund et al., 2009; 66 Biological Processes GO Term Description P.val FDR Enrich. N B n b GO:0048856anatomical structure development 1.16E-15 1.48E-11 2.19 8837 1690 227 95 GO:0032502developmental process 2.46E-14 1.57E-10 1.86 8837 2402 227 115 GO:0048731system development 9.72E-14 4.13E-10 3.93 8837 386 227 39 GO:0007165signal transduction 9.88E-12 3.15E-08 1.88 8837 2011 227 97 GO:0044767single-organism developmental process 2.52E-11 6.41E-08 1.81 8837 2171 227 101 GO:0051239regulation of multicellular organismal process 2.07E-10 4.40E-07 2.11 8837 1289 227 70 GO:0044700single organism signaling 3.76E-09 6.85E-06 3.57 8837 305 227 28 GO:0023052signaling 4.68E-09 7.46E-06 3.54 8837 308 227 28 GO:0007267cell-cell signaling 8.94E-09 1.27E-05 3.65 8837 277 227 26 GO:0007399nervous system development 2.35E-08 2.99E-05 4.4 8837 177 227 20 GO:2000026regulation of multicellular organismal development 3.62E-08 4.19E-05 2.2 8837 903 227 51 GO:0009653anatomical structure morphogenesis 4.95E-08 5.26E-05 2.36 8837 725 227 44 GO:0032501multicellular organismal process 7.54E-08 7.39E-05 1.86 8837 1447 227 69 GO:0097485neuron projection guidance 1.10E-07 1.00E-04 4.98 8837 125 227 16 GO:0007166cell surface receptor signaling pathway 1.36E-07 1.16E-04 2.13 8837 913 227 50 GO:0044707single-multicellular organism process 1.41E-07 1.12E-04 1.96 8837 1170 227 59 GO:0050793regulation of developmental process 3.08E-07 2.31E-04 1.93 8837 1167 227 58 GO:0010646regulation of cell communication 3.25E-07 2.30E-04 1.76 8837 1595 227 72 GO:0007610behavior 4.52E-07 3.04E-04 3.19 8837 293 227 24 GO:0007154cell communication 4.95E-07 3.15E-04 2.78 8837 406 227 29 GO:0044708single-organism behavior 5.09E-07 3.09E-04 3.66 8837 213 227 20 GO:0023051regulation of signaling 5.47E-07 3.17E-04 1.73 8837 1616 227 72 GO:0022603regulation of anatomical structure morphogenesis 5.57E-07 3.09E-04 2.55 8837 504 227 33 GO:0045595regulation of cell differentiation 5.77E-07 3.06E-04 2.17 8837 790 227 44 GO:0007411axon guidance 5.87E-07 2.99E-04 4.71 8837 124 227 15 GO:0051240positive regulation of multicellular organismal process 7.81E-07 3.83E-04 2.25 8837 691 227 40 GO:0051093negative regulation of developmental process 1.05E-06 4.94E-04 2.68 8837 421 227 29 GO:0048869cellular developmental process 1.38E-06 6.27E-04 1.81 8837 1311 227 61 GO:0051241negative regulation of multicellular organismal process 1.95E-06 8.59E-04 2.5 8837 483 227 31 GO:0001501skeletal system development 2.88E-06 1.23E-03 5.19 8837 90 227 12 GO:0044699single-organism process 3.32E-06 1.37E-03 1.23 8837 5359 227 170 GO:0051094positive regulation of developmental process 4.48E-06 1.79E-03 2.25 8837 606 227 35 GO:0098742adhesion via plasma-membrane adhesion molecules 6.40E-06 2.47E-03 4.82 8837 97 227 12 Table 4.4: Enriched Biological Processes associated with ultra-conserved HMRs 67 Molecular Functions GO Term Description P.val FDR Enrich. N B n b GO:0044212 transcription regulatory region DNA binding 4.53E-08 1.60E-04 2.78 8837 476 227 34 GO:0099600 transmembrane receptor activity 4.97E-08 8.81E-05 3.36 8837 301 227 26 GO:0000975 regulatory region DNA binding 5.02E-08 5.93E-05 2.77 8837 478 227 34 GO:0001067 regulatory region nucleic acid binding 5.29E-08 4.68E-05 2.76 8837 479 227 34 GO:0004871 signal transducer activity 6.28E-08 4.45E-05 2.64 8837 530 227 36 GO:0004888 transmembrane signaling receptor activity 1.95E-07 1.15E-04 3.34 8837 280 227 24 GO:0000977 RNA polymerase II regulatory region 3.31E-07 1.67E-04 3.06 8837 331 227 26 sequence-specific DNA binding GO:0001012 RNA polymerase II regulatory region DNA binding 3.51E-07 1.55E-04 3.05 8837 332 227 26 GO:0038023 signaling receptor activity 5.56E-07 2.19E-04 2.98 8837 340 227 26 GO:0000976 transcription regulatory region 1.09E-06 3.85E-04 2.8 8837 375 227 27 sequence-specific DNA binding GO:0001159 core promoter proximal region DNA binding 2.66E-06 8.57E-04 3.42 8837 216 227 19 GO:0004872 receptor activity 3.35E-06 9.89E-04 2.53 8837 446 227 29 GO:0060089 molecular transducer activity 3.35E-06 9.13E-04 2.53 8837 446 227 29 GO:1990837 sequence-specific double-stranded DNA binding 3.41E-06 8.62E-04 2.64 8837 398 227 27 GO:0000978 RNA polymerase II core promoter 5.05E-06 1.19E-03 3.42 8837 205 227 18 proximal region sequence-specific DNA binding GO:0005102 receptor binding 6.52E-06 1.44E-03 2.18 8837 643 227 36 GO:0000987 core promoter proximal region 8.64E-06 1.80E-03 3.29 8837 213 227 18 sequence-specific DNA binding Table 4.5: Enriched Molecular Functions associated with ultra-conserved HMRs 68 Cellular Components GO Term Description P.val FDR Enrich. N B n b GO:0043005 neuron projection 3.47E-09 5.54E-06 3.15 8837 408 227 33 GO:0031226 intrinsic component of plasma membrane 4.46E-08 3.57E-05 2.63 8837 547 227 37 GO:0097458 neuron part 6.82E-08 3.64E-05 2.37 8837 707 227 43 GO:0005887 integral component of plasma membrane 1.48E-07 5.94E-05 2.6 8837 524 227 35 GO:0005886 plasma membrane 3.91E-07 1.25E-04 1.74 8837 1634 227 73 GO:0044459 plasma membrane part 2.92E-06 7.79E-04 1.89 8837 1069 227 52 GO:0042995 cell projection 8.36E-06 1.91E-03 2.05 8837 760 227 40 Table 4.6: Enriched Cellular Components associated with ultra-conserved HMRs 69 0.0 2.5 5.0 7.5 10.0 Human Chimp Gorilla Rhesus Mouse Rat Dog Total size (Mbp) hypomethylation methylation Species specific methylation pattern Figure 4.10: Total region size of species-specific methylation and hypomethylation Total size of species-specific HMRs and non-HMRs in each of the seven species. Size (Mbp) Cut-off Human Chimp Gorilla Rhesus Mouse Rat species- 0.2 0.516 0.792 1.065 1.289 2.301 3.855 specific 0.3 0.61 0.962 1.31 1.488 2.605 4.298 hypo- 0.4 0.736 1.224 1.827 1.886 2.851 4.806 methylation 0.5 1.006 1.78 2.946 2.765 3.027 5.426 species- 0.5 0.032 0.044 0.021 0.179 0.506 0.536 specific 0.6 0.037 0.043 0.023 0.194 0.499 0.498 hyper- 0.7 0.048 0.044 0.027 0.211 0.505 0.473 methylation 0.8 0.092 0.045 0.039 0.286 0.615 0.484 Table 4.7: Sizes of species-specific hypomethylated and hypermethylated regions Hypo or hyper-methylation states are determined by a cut-off for methylation levels in 200bp bins. Capra & Kostka, 2014). However, as we explained above, modeling the process of DNA methyla- tion change across species presents unique challenges. Here, we briefly restate our phylo-epigenetic model for methylome evolution which has been introduced in detail in Chapter 3. The model combines two processes: (1) the “horizontal” process that induces autocorrelation of methylation states along the genome within a species, and (2) the “vertical” inheritance of methylation states from the ancestral methylome (Chapter 3). The vertical process is modeled with a continuous-time Markov process in the usual way (Felsenstein, 1981a). 70 HMR birth a b HMR extension a b HMR conservation a b HMR conservation + birth a b Example single-site calculation hypo-methylated methylated States a b ab within-species Dependence inheritance ab Figure 4.11: Schematic presentation of interdependent-site phyloepigenetic model Example evolution scenarios from an ancestor speciesa to a descendant speciesb. The number of hypomethylated sites inb is the same in all cases, but the probabilities differ in each case under this model. To model the horizontal dependency between the methylation states of two neighboring sites in the same methylome, we introduce a transition probability to model the conditional distribution of a site’s methylation state given the state at the site’s neighbors (Figure 4.11). Combining these two processes, our model is a Bayesian network that describes the probability of a site having a particular methylation state as a function of its ancestral state and the contemporaneous state of its previous neighbor. The inclusion of the horizontal process allows us to treat non-conserved CpG sites as sites with missing data, and distinguish between different types of changes in the methylome (Figure 4.11). This increases the fraction of the 7-way orthologous genome we can model at single-CpG resolution (540 Mbp) and increased contiguity within those intervals. We implemented this modeling strategy and applied it on the sperm methylomes. The esti- mated branch lengths represented the amount of methylome evolution along different lineages as species diverged during evolution (Figure 4.12, Table 4.8). The consensus divergence times are the average time from multiple studies that were based on different sets of species and different type of data, including palaeontological and molecular data (Hedges et al., 2015). Branches within rodents were significantly longer than the primate lineage (Wilcoxon signed-rank testp = 3:9 10 10 in 50 equal-sized regions between human and mouse lineages), indicating faster epigenomic evolu- tion in sperm of rodents than primates. This echoes the 3-fold higher nucleotide substitution rate observed in rodents than humans (Consortium, 2004). The rat sperm methylome exhibited faster 71 Model Input Correlation between models Tree2 Tree3 Tree1: independent-site unrestricted DNA multiple sequence alignment 0.280 0.652 Tree2: independent-site methylome 200bp-bin binary methylation states 0.714 Tree3: inter-dependent site Single-CpG posterior methylation probability Branch lengths Human Chimp Hominini Gorilla Homininae Rhesus Catarrhini Mouse Rat Murinae Euar. Dog Tree1:0.0055 0.0058 0.0018 0.0073 0.0175 0.0311 0.0902 0.0694 0.0752 0.2726 0.0064 0.1768 Tree2:0.0135 0.0237 0.0046 0.0351 0.0151 0.0320 0.0184 0.0434 0.0559 0.0325 0.0131 0.0212 Tree3:0.3824 0.3838 0.0944 0.5013 0.1890 0.7133 0.1100 0.6339 0.8518 0.8951 0.0003 0.9187 Table 4.8: Phylogenetic tree and phylo-epigenetic tree models Euar.: Euarchontoglires 72 human chimp gorilla rhesus mosue rat dog 10 Myr Reference phylogeny human chimp gorilla rhesus mouse rat dog 0.01 Methylome human chimp gorilla rhesus mouse rat dog 0.05 Genome Figure 4.12: Evolution trees for genome and sperm methylome Phylogenetic tree representing consensus divergence time (Hedges et al., 2015), genome evolution and sperm methylome evolution. Unit branch length represents 1 million year, 1 substitution/site and 1 methylation state change/CpG site, respectively. evolution than mouse (Wilcoxon signed-rank testp = 3:9 10 10 ), which coincided with a 5-10% higher nucleotide substitution rate on the rat lineage than the mouse lineage (Consortium, 2004). However, the methylome evolution rates were much higher at the terminal branches than internal branches, suggesting relatively fast turnover of sperm HMRs in mammals. We fit an exponen- tial decay model to the relationship between shared HMR fraction and divergence time between species pairs (Figure 4.13). The estimated half-life of HMRs in the 7-way orthologous genome was 95Myr. This decay rate varied substantially between different genomic contexts, however, with the estimated half-life in promoter at 500Myr, but only 23Myr for distal (over 10kb from TSS) HMRs. Again, these differing decay rates indicate slow promoter evolution and fast enhancer evolution. The inferred methylation states in ancestral species indicate an increasing proportion of hypomethylation in the orthologous genome along each branch of the phylogenetic tree (Fig- ure 4.14). We estimated that HMRs made up 6.2% of the orthologous genome in the last common ancestor of the seven species. The corresponding estimates for the primate and rodent common ancestors were 6.5% and 6.7%, respectively. Among extant species, within primates the fractions were 6.8-8.5% and in the rodents 8.0-8.4% (Figure 4.14). The overall increase in the hypomethy- lated fraction of the orthologous genome was a result of DNA methylation loss exceeding gain along all branches of the tree except dog, with the dog coming closest to balancing the two (Fig- ure 4.14,4.15). To further examine whether the methylome evolutionary process was stationary, we 73 0.00 0.25 0.50 0.75 1.00 0 25 50 75 100 Divergence time (myr) HMR Jaccard index all HMRs distal TSS TSS+/-10k Figure 4.13: HMR divergence by an exponential decay model Empirically determined HMR divergence rates by the fraction of conserved orthologous HMRs between pairs of species as a function of divergence time in million years. compared a model assuming reversible evolutionary process with a general model (i.e. lacking the reversibility assumption) using methylation states of extant species in 200bp bins as observed data (complete observation at extant species). The general model also estimated a process with increas- ing hypomethylation proportion, and fit the data better than the reversible model (Figure 4.16). 4.4 Widening of hypomethylated regions has contributed sig- nificantly to methylation loss Loss of DNA methylation during evolution reflects an increase in either the size or the number of HMRs. For example, the evolutionary birth of an HMR, putatively a regulatory region, may arise within a methylated region in the ancestral species. Widening of ancestral HMRs is also plausible as a driver for methylation loss. With the inferred ancestral methylomes, we identified 74 Boreoeutheria ( LCA) Euarchontoglire Catarrhini Homininae Hominini Human Chimp Gorilla Rhesus Murinae Mouse Rat Dog ● ● ● ● ● ● ● ● ● ● ● ● ● Time (Mya) Fraction of genome hypomethylated (%) 100 80 60 40 20 0 6.5 7.0 7.5 8.0 8.5 Figure 4.14: Evolutionary increase of the fraction of genomic hypomethylation Fraction of orthologous genome hypomethylated in extant and ancestral species, estimated by interdependent-site phylo-epigenetic model. epigenetic conservation and mutation events according to vertical (“parent-child”) methylation state transitions. Then, by comparing neighboring events within each descendant species, we classified methylation loss (HMR gain) events into HMR birth events and HMR extension events. We similarly classified HMR loss events into deaths and contractions (Figure 4.17,4.18). Here, we use “birth” and “death” to describe the rise and decline of a HMR during evolution, in a similar way that studies of transcription factor binding site (TFBS) turnover refer to births and deaths of TFBS during evolution. The HMR birth and death events showed different genomic distribution from the extensions and contractions. Averaged across branches, 52% HMR birth and 42% of death events were located more than 10kb away from gene transcription start sites (TSS). In contrast, 30% of extensions and 17% of contractions were more than 10kb away from TSS (Figure 4.19). The abundance of de novo events in distal genomic regions corresponds with the highly divergent non-promoter HMRs 75 −5 0 5 10 Euarchontoglires Catarrhini Homininae Hominini Human Chimp Gorilla Rhesus Murinae Mouse Rat Dog Region size (Mbp) HMR gain HMR loss Figure 4.15: Sizes of HMR gain and loss ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 60 100 140 0 40 80 120 Expected # methylation state transitions between pairs of extant species (x1000) Observed methylation state differences (x1000) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Reversible model General model 40 80 120 20 60 100 140 AIC R = 2.87x10 6 AIC G = 2.76x10 6 human chimp gorilla rhesus mouse rat dog 0.01 Phylo-epigenetic tree (200bp bins, independent site general model) A B Figure 4.16: Comparison of independent-site models indicates non-stationary process (A) Phylo-epigenetic tree estimated by general independent-site phylo-epigenetic model using dis- crete methylation states in 200bp bins. (B) The general model (without the reversibility assump- tion) fits the observed methylation divergence better than assuming reversible state substitution process by Akaike information criterion (AIC). 76 ancestor HMR human chimp mouse rat ancestral rodent ancestral primate segmentation 1 4 5 3 2 6 branch assignment 1 3 6 4 1 3 evolution events fully conserved HMR extension HMR contraction HMR death HMR birth branch Figure 4.17: Schematic annotation of methylome evolutionary event −4 −2 0 2 4 6 Euarchontoglire Catarrhini Homininae Hominini Human Chimp Gorilla Rhesus Murinae Mouse Rat Dog Region size (Mbp) birth contraction death extension Figure 4.18: Total size of methylation evolution events by type and lineage in extant species, while the secondary events contribute to the boundary variation of conserved HMRs in different species. Only a small proportion (less than 10%) of the orthologous genome was hypomethylated. If newly gained HMRs had occurred randomly in the genome during evolution, we would expect very few of them to be located next to ancestral HMRs and appear as HMR extensions. However, 77 0 2 4 6 0 2 4 6 Murinae Mouse Dog Catarrhini Homininae Hominini Human Distance to TSS (log10) Figure 4.19: Secondary mutation events are closer to gene promoters The distribution of methylome evolution events, relative to gene transcription start sites (TSS). along different branches of the evolutionary tree, we observed that between 20-75% of total size of HMR gain came from HMR extension events (Figure 4.18). Compared to an expected random distribution of mutation events in the orthologous genome, we observed over 10-fold enrichment for widening of ancestral HMRs along all branches (Figure 4.20). In contrast, HMR loss events occurred more than expected as HMR deaths (Figure 4.20). This suggests that existing HMRs may facilitate the evolutionary formation of a new HMR in the adjacent region, meanwhile loss of HMR is more likely to occur as methylation through an entire ancestral HMR rather than only a fraction of it. The overall trend of methylation loss during evolution, together with the observed frequency of HMR extensions, led to our hypothesis of progressive HMR extension during mammalian evo- lution: once established, a typical HMR can gradually evolve to span a larger genomic region. We found two forms of evidence in line with this hypothesis. First, the reconstruction of ancestral methylomes provided age estimates for HMRs in extant species. We compared HMR sizes between different age groups, and found positive correlation between the age and the size of HMRs. The progressive expansion appeared faster at promoter HMRs than non-promoter HMRs. In the human sperm methylome, for example, the size difference between young promoter and non-promoter HMRs (after human-gorilla divergence about 9 Mya) was not significant (Figure 4.21). For older 78 Catarrhini 0 2 4 6 0 20 40 60 Homininae 0 2 4 6 0 20 40 Hominini 0 2 4 6 0 20 40 Human 0 2 4 6 0 20 40 Chimp 0 2 4 6 0 20 40 Gorilla 0 2 4 6 0 20 40 Rhesus 0 2 4 6 0 20 40 Murinae 0 2 4 6 0 10 20 Mouse 0 2 4 6 0 10 20 Rat 0 2 4 6 0 10 20 Dog 0 2 4 6 0 20 40 Catarrhini 0 2 4 6 0 2 4 6 8 Homininae 0 2 4 6 0 10 20 30 Hominini 0 2 4 6 0 40 80 Human 0 2 4 6 0 2 4 6 Chimp 0 2 4 6 0.0 1.0 2.0 Gorilla 0 2 4 6 0.0 1.0 Rhesus 0 2 4 6 0.0 0.4 0.8 Murinae 0 2 4 6 0.0 1.0 2.0 Mouse 0 2 4 6 0.0 1.0 2.0 Rat 0 2 4 6 0.0 1.0 2.0 Dog 0 2 4 6 0.0 0.4 0.8 1.2 Distance to conserved HMR (log 10 ) Distance to conserved HMR (log 10 ) Observed/expected HMR gain events HMR loss events Figure 4.20: Enrichment of different types of methylome evolution events Methylome evolution events were inferred using dependent-site phylo-epigenetic model on single- CpG methylation probabilities from extant species. These gain and loss events on each branch are shuffled within genomic regions available for the respective type of changes. The relative abun- dance of events (observed over randomly shuffled) is computed at different distances to inherited HMRs in the descendant species. Close-to-zero distances from conserved HMRs (left-most bin) indicate extension/contraction events. Bigger distances from conserved HMRs indicate birth/death events. HMR groups, in contrast, promoter HMRs were significantly wider than non-promoter HMRs. Similarly, in mouse, both promoter and non-promoter HMRs were wider in older age groups than those in younger age groups, and the size difference between promoter and non-promoter HMRs was the greatest in the oldest age group (Figure 4.21). Second, the level of genomic orthology can serve as a proxy for the average age of HMRs. We expected HMRs in mutually orthologous regions to be older on average than those in species-specific genomic regions. Therefore, we sep- arated HMRs in each species into two groups depending on whether they were located in regions orthologous across all seven species, and further stratified HMRs within each age group by their genomic context. We found that HMRs in the orthologous genome were consistently larger in size within each genomic context group than those in species-specific genomic regions (Figure 3B). An 79 < 2.22e−16 2.97e-4 2.03e-4 2.23e-4 0 1 2 3 4 Boreoeutheria Euarchontoglires Catarrhini Homininae Hominini Human Average HMR size (kb) Human HMRs in orthologous genome < 2.22e−16 6.35e−06 < 2.22e−16 2.77e−16 0 1 2 3 4 Boreoeutheria Euarchontoglires Murinae Mouse Mouse HMRs in orthologous genome nonpromoter promoter nonpromoter promoter Specific to clade Specific to clade Figure 4.21: More conserved sperm HMRs in the orthologous genome are wider Mean size of promoter and non-promoter HMRs in human and mouse, grouped by emergence time in mammalian evolution. Error bars mark standard errors of the mean. Significant Wilcoxon rank sum test p-values are shown. alternative hypothesis states that HMRs may randomly arise in the genome with substantial size variation, but wider HMRs are more likely to survive through evolution. Although this alternative hypothesis may also explain the size-age correlation of HMRs, it offers no explanation to the sub- stantial cross-species size variation of conserved HMRs. Therefore, progressive extension seems to be a more competitive hypothesis. Together, these observations suggest that recently appearing epigenomic features encompass smaller genomic intervals than more ancient features. Of note, such a phenomenon must have occurred independently and may have different rates along parallel evolutionary lineages. 80 Human Chimp Gorilla Rhesus Mouse Rat Dog 0 2 4 6 HMR size (kb) promoter repeat other 7-way orthologous other Genomic context Genomic conservation Figure 4.22: HMR sizes in the orthologous genome and species-specific genome Distribution of HMR sizes in genomic regions with different sequence conservation and genomic context groups. 4.5 Sequence signatures driven by sperm methylome diver- gence Probabilistic inference of ancestral methylation states allowed us to assess the impact of historical germline methylation on CpG enrichment. Sequence with a longer history of germline hypomethy- lation will exhibit less effects of methylation-induced CpG decay. We measured CpG enrichment with the observed-over-expected (o/e) ratio of CpG frequency. To examine the effect of “HMR age” on the underlying DNA sequence, we divided human sperm HMRs into five age groups based on estimated ancestral methylation states. We calculated local CpG o/e ratios from non- repetitive sequences for each age group. Confirming our hypothesis, older HMRs had higher CpG o/e ratio than younger HMRs (Figure 4.23). In addition, older HMRs displayed lower levels of DNA methylation than younger HMRs, suggesting that conserved elements were subject to more stringent epigenetic regulation and maintenance while younger elements were involved in more dynamic interactions. We confirmed that the association between CpG enrichment and HMR age remained after we controlled for DNA methylation level (Figure 4.24). 81 HMR age (Myr) >90 30−90 9−30 7−9 <7 0.0 0.1 0.2 0.3 0.4 0.5 Methylation level *** *** *** *** 0.0 0.5 1.0 CpG o/e ratio LCA Catarrhini Homininae Hominini Human LCA Catarrhini Homininae Hominini Human Figure 4.23: Inferred age of human sperm HMR and CpG enrichment CpG enrichment (observed/expected ratio of CpG occurrences) and regional average DNA methy- lation level in human HMRs grouped by estimated hypomethylation age. CpG o/e ratio (human) *** *** 0.0 0.2 0.4 0.6 0.8 1.0 ** *** * human hominini homininae catarrhini LCA Methylation level range 0~0.05 0.05~0.1 human hominini homininae catarrhini LCA Hypomethylated since Figure 4.24: Human HMR CpG enrichment by HMR age and average methylation level The relationship between germline methylation history and CpG decay through a genomic interval suggests a perspective for understanding the promoter HMR size divergence in Figure 4.8. The substantial number of promoter HMRs that are much wider in human than in mouse may have 82 CpG enrichment species 1 species 2 somatic HMR Scenario 1 species 1 species 2 ancestor Scenario 2 species 2 ancestor species 1 species 1 ancestor species 2 Sperm HMR Figure 4.25: Opposing scenarios for germline promoter HMR size evolution. Two opposite scenarios explaining the evolution path to different promoter HMR sizes in 2 extant species, and expected CpG enrichment schematic profile in the extant species as a result of methy- lation induced CpG-decay. arisen from a widening along the human lineage or a narrowing along the mouse lineage (Fig- ure 4.25). If we consider only these two scenarios, the degree of CpG decay in the corresponding intervals may provide evidence for one or the other. In particular, if the HMRs on average have widened towards human, with the methylation state in mouse more closely matching the common ancestor, then CpGs in the relevant intervals of the mouse genome would decay at the same rate as in background regions. On the other hand, if those HMRs have contracted along the mouse lineage, then the relevant genomic intervals in mouse should retain a higher CpG o/e ratio than the background. The data favors a widening on the human lineage as the main factor contributing to wider promoter HMRs in human sperm than in mouse sperm. We selected gene promoters that were hypomethylated in both human and mouse sperm with a wider HMR in human, and examined the CpG o/e ratio around the outer human HMR boundary, as well as around the inner mouse HMR 83 boundary. In addition, considering the nested and extended HMR structure in the germ line (Fig- ure 4.8, (Molaro et al., 2011)), we also separated the impact of methylation pattern in sperm from that in ESC by focusing on promoters with wider HMRs in sperm than in ESC in both species. In summary, three types of boundaries – human sperm HMR boundary, mouse sperm HMR boundary and common ESC boundary – divided orthologous promoter regions into four sections: (i) com- mon hypermethylated regions furthest away from TSS, (ii) regions hypomethylated only in human sperm, (iii) regions hypomethylated in both human and mouse sperm but not in ESC, and (iv) regions consistently hypomethylated in sperm and ESC of both human and mouse (Figure 4.26A). We examined signatures of CpG decay in both species, crossing the boundaries between these dif- ferent sections. The mouse CpG o/e ratios on opposite sides of positions orthologous to human sperm HMR boundaries (sections i vs ii) showed no significant difference, suggesting that both sides in the mouse genome shared a common methylation history during evolution (Figure 4.26A). Meanwhile, regions corresponding to rodent-lineage-specific sperm HMR loss showed higher CpG o/e ratios than the methylated genomic background level (Wilcoxon rank sum testp = 1:9110 5 , Figure 4.27), confirming that the divergence time between human and mouse (about 90 Myr) was not enough for CpGs in ancestral HMRs to decay to a genomic background level. In contrast, there was a significant drop of human CpG o/e ratios (Wilcoxon rank sum testp = 8:4 10 25 ) cross- ing the mouse sperm HMR boundaries from common HMRs (section iii) to human-only HMRs (section i). Although regions on both sides were hypomethylated in present-day human sperm, the outer side (section i) must have become hypomethylated later than the inner side (section ii) during evolution on the lineage that leads to human. Dog sperm methylome also showed wider promoter HMRs than mouse (Figure 4.8), and we observed similar profiles of CpG o/e ratios in the dog-mouse comparison (Figure 4.26B). These observations suggest that although sperm promoter HMR sizes might have expanded on parallel lineages, the changes might have occurred at different rates, and that ancestral promoter HMR size in sperm was closer to that in rodents, and both dog and primates have evolved wider sperm promoter hypomethylation. 84 CpG o/e ratio (human) −3 0 3 −3 0 3 −3 0 3 0.0 0.2 0.4 0.6 0.8 1.0 CpG o/e ratio (mouse) −3 0 3 −3 0 3 −3 0 3 0.0 0.2 0.4 0.6 0.8 1.0 human mouse Sperm ESC Promoter HMR: Position w.r.t boundaries (x100bp) (i) (ii) (iii) (iv) section CpG o/e ratio (mouse) −3 0 3 −3 0 3 −3 0 3 0.0 0.2 0.4 0.6 0.8 1.0 Position w.r.t. boundaries (x100bp) CpG o/e ratio (dog) −3 0 3 −3 0 3 −3 0 3 0.0 0.2 0.4 0.6 0.8 1.0 dog mouse A B Figure 4.26: CpG enrichment profile at orthologous promoters (A) Human and mouse CpG enrichment profile at hypomethylated orthologous gene promoters where human sperm HMRs are wider than mouse. Profiles show CpG enrichment in 2bp windows by distance to mouse HMR boundaries, and human HMR boundaries. Locally weighted smoothing (LOESS) was used to generate smooth lines around boundaries between sections. (B) Two opposite scenarios explaining the evolution path to different promoter HMR sizes in 2 extant species, and expected CpG enrichment schematic profile in the extant species as a result of methylation induced CpG-decay. p = 1.91e-5 Rodent−specific methyaled regions Methylated background 0.0 0.1 0.2 0.3 0.4 0.5 CpG o/e ratio (mouse) Figure 4.27: Mouse CpG enrichment in rodent-specific vs. conserved methylated regions 85 0.024 < 2.22e−16 −4 −2 0 2 4 6 common human wider mouse wider human specific mouse specific H3K27me3 signal (log 2 ratio vs input) Human round spermatid < 2.22e−16 < 2.22e−16 −4 −2 0 2 4 6 H3K4me3 signal (log 2 ratio vs input) < 2.22e−16 < 2.22e−16 −4 −2 0 2 4 6 Mouse round spermatid 2.21e−13 < 2.22e−16 −4 −2 0 2 4 6 common human wider mouse wider human specific mouse specific human sperm HMRs mouse sperm HMRs types of regions Figure 4.28: Sperm DNA methylation and histone modification Mouse and human-specific sperm HMRs and HMR extensions show species specific H3K4me3 and H3K27me3 enrichment. 4.6 Methylation divergence is associated with histone modifi- cation and sequence divergence DNA methylation states are known to correlate with other epigenomic marks. Although protamines replace many nucleosomes in mature sperm, DNA methylation states of sperm are largely estab- lished in earlier stages in spermatogenesis. To examine whether sperm DNA methylation diver- gence is corroborated by histone modification divergence, we analyzed H3K4me3 and H3K27me3 ChIP-seq data from round spermatids of human and mouse (Lesch et al., 2016). Lineage-specific sperm HMR births in human and mouse were associated with lineage-specific enrichment of H3K4me3 and H3K27me3 (Figure 4.28). In lineage-specific HMR extensions, similarly, DNA hypomethylation is associated with stronger enrichment of H3K4me3 and/or H3K27me3 in round spermatids in a lineage-specific manner. In both human and mouse, promoter HMRs showing species-specific widening in sperm, especially those that also appeared wider in ESC than the 86 Enrichment Promoter HMR Distal HMR Human Mouse Human Mouse E&S S E&S S E&S S E&S S Bivalent 1.51 0.91 2.09 1.39 1.71 1.19 1.93 1.14 H3K4me3-only 5.75 11.91 1.43 0.85 0.14 0.35 0.31 0.14 H3K27me3-only 0.3 0.19 0.53 0.84 1.92 2.32 3.01 3.94 None 2.89 6.57 1.12 1.71 0.19 0.64 0.64 0.69 Total H3K4me3 1.92 1.96 1.75 1.11 1.01 0.81 0.68 0.37 Total H3K27me3 0.78 0.48 0.88 0.96 1.77 1.5 2.53 2.69 Table 4.9: Enrichment of histone modifications in HMRs with lineage-specific HMR extensions E&S: lineage-specific HMR extension in both ESC and Sperm. S: lineage-specific HMR extension only in sperm. other species, were enriched with bivalent histone marks in round spermatid (Table 4.9). The asso- ciation between divergent DNA methylation and divergent histone modification in sperm indicates that sperm DNA methylation is an integral part of the germline epigenome evolution. Divergence of DNA methylation state may arise as a result of underlying sequence divergence that alters the composition of transcription factor binding sites in the genomic region (Stadler et al., 2011). We first compared sequence conservation at lineage-specific HMRs relative to their neighboring regions using phyloP scores (Pollard et al., 2010). HMRs born on the rodent, mouse and rat branches showed substantially stronger sequence conservation than their flanking regions (Figure 4.29A). This observation suggests that lineage-specific HMRs are not necessarily species- specific regulatory elements, and could be conserved regulatory elements that have evolved diver- gent tissue-specific activity between species. To examine this possibility, we compared lineage- specific sperm HMRs with a collection of HMRs pooled from 33 human somatic tissues and cell types in the Roadmap epigenomics project (Kundaje et al., 2015). All groups of lineage-specific sperm HMRs showed a sizable overlap with human somatic HMRs (Table 4.10). For example, 33% of mouse-specific sperm HMRs (25% in base pairs) were also hypomethylated in at least one human somatic methylome. Having overlap with human somatic HMRs considerably increased the sequence conservation level in lineage-specific sperm HMRs (Figure 4.29B). Interestingly, sperm 87 Lineage specific sperm HMR Overlap human somatic HMRs (bp) Total # HMRs Total size (bp) # HMRs Total size (bp) Human 1830 1,135,178 684 294,235 Chimp 1306 610,413 503 180,155 Gorilla 1124 474,457 472 162,022 Rhesus 2117 1,076,302 818 325023 Catarrhini 1835 973,499 1251 572,918 Mouse 8279 5,493,508 2762 1,382,278 Rat 11516 7,698,225 3451 1,653,517 Murinae 5859 3,812,953 2519 1,275,779 Table 4.10: Overlap between lineage-specific sperm HMRs and human somatic HMRs A B ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.4 0.6 −4kb −2kb HMR +2kb +4kb Placental phyloP ● ● ● ● Human Chimp Gorilla Rhesus ● ● ● ● Catarrhini Mouse Rat Murinae Lineage−specific HMRs ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.4 0.6 Placental phyloP Overlapping human somatic HMRs −4kb −2kb HMR +2kb +4kb ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.4 0.6 Not overlapping human somatic HMRs −4kb −2kb HMR +2kb +4kb Figure 4.29: Sequence conservation levels in lineage-specific HMRs (A) Average sequence conservation level (placental mammal phyloP score) with standard error in the lineage-specific sperm HMRs and neighboring regions in 1kb windows. (B) Conservation levels are shown separately for species-specific sperm HMRs overlapping human somatic HMRs and for those not overlapping human somatic HMRs. HMRs specific to primate lineages and specific to the sperm cell type had lower sequence conserva- tion than neighboring regions, while sperm HMRs specific to rodent lineages and not overlapping human somatic HMRs remained at higher sequence conservation level than neighboring regions. It is possible that there are conserved regulatory regions between human and mouse that are not revealed by methylomes of the profiled human cell types. Nevertheless, these observations are con- sistent with exaptation of existing elements as a major source for species to evolve new regulatory interactions in sperm. As lineage-specific HMR births usually have a different level of sequence conservation from neighboring regions, they may be under different selection pressure. Therefore, directly com- paring substitutions in lineage-specific HMR births with neighboring regions may be confounded 88 by their different selection pressure. We took a different approach, and measured the relative sequence substitution rate (RSSR) in an orthologous region between two sister lineages. To assign sequence substitutions to individual branches of the phylogenetic tree, we used the 15- way eutherian mammals Enredo-Pecan-Ortheus (EPO) multiple alignments (Herrero et al., 2016; Yates et al., 2016), which contains inferred ancestral sequences. Considering that the evolution time and neutral sequence substitution rates of different lineages may differ, we compared the RSSR between two parallel lineages in regions of interest against randomly sampled regions from the background orthologous genome. We studied four pairs of sister or parallel lineages, and observed that lineage-specific HMR births were associated with significantly elevated sequence substitution rate on the same lineage (Figure 4.30, Table 4.11, methods in section A.6). After excluding CpG sites, the differences in relative substitution rates remained significant. Clustering of sperm methylomes as well as comparison of evolutionary trees for genomes and sperm methy- lomes have shown a global correlation between genome and epigenome divergence. These obser- vations, further extending the previous results, reveal a local link between sequence substitutions and evolution of new epigenetic features. 4.7 Lineage-specific changes involve common transcription factors and developmental genes DNA hypomethylation reflects cis-regulatory interactions (Bird, 2002; Hodges et al., 2011; dos Santos et al., 2015). We wanted to know which transcription factors were potentially involved in these newly evolved interactions in sperm cells of different species. The transcription factor binding site (TFBS) profiles from ENCODE ChIP-seq experiments (Wang et al., 2013) come from multiple cell types in human and mouse, and include 161 factors in human and 46 factors in mouse. 89 human p = 1.83e-03 chimp p = 7.59e-03 0.0 2.5 5.0 7.5 10.0 0.8 0.9 1.0 Density gorilla p = 3.40e−05 rhesus p = 5.32e-03 0 10 20 30 40 0.22 0.24 0.26 0.28 Density mouse p < 2.2e−16 rat p < 2.2e−16 0 25 50 75 100 0.75 0.80 0.85 0.90 catarrhini p < 2.2e−16 murinae p = 6.45e−09 0 50 100 0.34 0.36 0.38 0.40 0.42 RSSR in lineage specific HMRs RSSR distribution by random sampling 90% CI more in human more in chimp #Sub(human) #Sub(chimp) #Sub(mouse) #Sub(rat) #Sub(gorilla) #Sub(rhesus) #Sub(catarrhini) #Sub(murinae) Figure 4.30: Relative substitution rates in genomic background and lineage-specific HMRs Distribution of relative sequence substitution rate in the orthologous genome between two sister lineages is shown with histogram, and fit with normal distribution (black curve). Yellow area marks 90% confidence interval for mean relative sequence substitution rate (RSSR). Each lollipop marks the observed RSSR in HMRs specific to the indicated lineage. They comprise a representative annotation of active binding sites of the targeted transcription fac- tors in somatic cells, but might fail to capture those binding events exclusive to sperm and its precursors. We looked for transcription factors with enriched binding sites in lineage-specific promoter HMR extensions and lineage-specific non-promoter HMR gains in human and mouse (Figure 4.31, section A.6). At recently evolved lineage-specific non-promoter HMRs, a number of transcrip- tional activators were enriched on both lineages, such as histone acetyltransferase p300 (EP300), activator protein 1 (AP-1) subunits FOS, FOSL1 and JUN, and AP-1 interacting factor MAFK. In addition, chromatin looping factors including CTCF and cohesin components RAD21 and SMC3 were overrepresented on both lineages. Of the CTCF binding sites found in recently evolved lineage-specific non-promoter sperm HMRs, 38% and 39% of the sites were also lineage- specifically hypomethylated in the ESCs of human and mouse. These observations suggest a degree of evolutionary plasticity for these factors in their roles associated with 3D organization 90 Overlap Roadmap HMRs Not overlap Roadmap HMRs Subsitutions on lineage Subsitutions on lineage HMRs specific to lineage Catarrhini Murinae Catarrhini Murinae Catarrhini 3,629 50,270 7,568 17,514 Murinae 7,478 136,695 56,920 151,833 Fisher p-val 4.57E-39 5.01E-22 Exact odds ratio 1.320 1.153 Test 95% CI (1.266, 1.375) (1.120, 1.186) Subsitutions on lineage Subsitutions on lineage HMRs specific to lineage Mouse Rat Mouse Rat Mouse 47,437 53,179 84,033 92,766 Rat 59,077 73,883 116,483 149,953 Fisher p-val 7.42E-39 1.93E-137 Exact odds ratio 1.116 1.166 Test 95% CI (1.097, 1.134) (1.152, 1.180) Subsitutions on lineage Subsitutions on lineage HMRs specific to lineage Gorilla Rhesus Gorilla Rhesus Gorilla 1,536 2,583 1,075 3,692 Rhesus 3,206 6,396 2,240 10,430 Fisher p-val 1.15E-05 6.16E-13 Exact odds ratio 1.186 1.356 Test 95% CI (1.099, 1.281) (1.248, 1.472) Subsitutions on lineage Subsitutions on lineage HMRs specific to lineage Human Chimp Human chimp Human 1,806 1,822 3,217 3,270 Chimp 649 669 885 1,104 Fisher p-val 0.748 7.02E-05 Exact odds ratio 1.022 1.227 Test 95% CI (0.899, 1.161) (1.108, 1.360) Table 4.11: Lineage-specific HMRs contain more substitutions on the same lineage of the genome and long range promoter-enhancer interaction; in light of results presented above, this plasticity seems to be increasing the number of regions involved in these functions. Lineage-specific promoter HMR extensions showed quite different composition of transcrip- tion factor binding sites between human and mouse. On the human lineage, most transcriptional activators overrepresented in non-promoter HMR gains were also overrepresented in promoter HMR extensions. Mouse promoter HMR extensions only shared with human lineage the binding 91 0 20 40 60 80 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● BATF CEBPB* CTCF* EP300* ESR1 FOS FOXA1 FOXA2 GATA2 GATA3 IKZF1 JUN JUND* KAP1 MAFF MAFK* POLR2A RAD21* SMC3* STAT3 TEAD4 ZNF143 ZNF217 Bonferroni corrected p value (-log 10 ) 0 50 100 150 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● CEBPB EP300* EZH2 FOS FOXA1 FOXA2 GATA2 GATA3 MAFF MAFK POLR2A* STAT3 −4 −2 0 2 4 0 5 10 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● EP300* MAZ MXI1 POLR2A* SIN3A −4 −2 0 2 4 0 20 40 60 80 100 Bonferroni corrected p value (-log 10 ) ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● BHLHE40 CEBPB* CTCF* EP300* ETS1 FOSL1 JUN JUND* MAFK* MYOD1 RAD21* SMC3* ZC3H11A ZNF384 Repressor Chromatin looping Activator/coactivator Human Chimp Gorilla Rhesus Mouse Rat Dog Human Chimp Gorilla Rhesus Mouse Rat Dog Lineage-specific Promoter HMR extension Non-promoter HMR gain Log 2 fold change Log 2 fold change Figure 4.31: Enriched transcription factor binding sites in HMR birth and extension regions Factors enriched in both lineages are marked with “*”. site enrichment of EP300 apart from RNA polymerase II. Since promoter HMR sizes showed con- siderable difference between primates and rodents, likely due to HMR extension in primates, we examined the functional association of the genes that had conserved hypomethylation state with divergent HMR sizes. There were a total of 7114 orthologous protein-coding genes with con- served promoter hypomethylation in the sperm of all 7 species. We identified 1044 genes showing primate-lineage-specific promoter HMR extension. These genes showed enrichment in biological processes such as pattern specification, embryonic morphogenesis (Figure 4.32). It appears that these divergent sperm DNA methylation features at gene promoters are associated with conserved developmental programs. 92 ● ● ● ● ● ● ● Regionalization Multicellular organismal process Developmental process Locomotion Regulation of DNA binding Biological regulation Movement of cell or subcellular component −5 0 5 −5 0 5 Semantic space y Semantic space x log 2 fold 1 2 −9 −8 −7 −6 −5 −4 log 10 p-value Figure 4.32: Biological processes associated with promoter HMR extensions on human lineage About 12% of primate-lineage-specific HMR gains in human sperm were also hypomethylated in human ESC, and as much as 80.5% overlap with known ENCODE transcription factor binding sites, DNase hypersensitive sites (Thurman et al., 2012) or Roadmap somatic HMRs. This suggests that recently evolved sperm HMRs likely can take on roles of regulatory elements in developmental stages beyond germ cell development. Features of the sperm methylomes that appeared on inde- pendent evolutionary lineages may have evolved to support regulation of lineage-specific functions. To examine whether this was the case, we used orthologous protein-coding genes between human and mouse, and associated sperm HMRs containing lineage-specific birth events with the closest gene promoter in respective reference genomes. Despite being lineage-specific, HMR births on these two parallel lineages sometimes occurred in each other’s neighborhood in the orthologous genome (Figure 4.33), and they shared a sizeable list of common associated genes (Jaccard index 93 2 kb VWA5B2 MIR1224 ALG3 2 kb PAX7 2 kb POU6F2 4 kb SOBP Human Chimp Gorilla Rhesus Mouse Rat Dog Human Chimp Gorilla Rhesus Mouse Rat Dog Sperm methylation Sperm HMR Figure 4.33: Example lineage-specific HMR births = 0.44, Fisher exact testp = 1:7 10 238 ). These common genes were enriched for functions in embryonic developmental processes, including mesoderm and ectoderm development and nervous system development (Figure 4.34). Meanwhile, analysis of associated genes that were specific to one species did not yield any enrichment of biological processes compared with all orthologous protein-coding genes. Together, these results suggest that independently evolving features of the sperm epigenome are involved in regulating common genes that are important for embryonic devel- opment. This may be viewed as a form of convergent evolution: regulatory features are emerging independently for the same genes. 94 1617 1010 2072 genes closest to mouse−lineage-specific HMR birth genes closest to human−lineage-specific HMR birth system development (GO:0048731) developmental process (GO:0032502) nervous system development (GO:0007399) mesoderm development (GO:0007498) multicellular organismal process (GO:0032501) single−multicellular organism process (GO:0044707) heart development (GO:0007507) signal transduction (GO:0007165) cell communication (GO:0007154) muscle organ development (GO:0007517) cell differentiation (GO:0030154) intracellular signal transduction (GO:0035556) 0 2 4 6 8 Bonferroni corrected p value (-log 10 ) Figure 4.34: Lineage-specific HMRs on parallel lineages HMR births on the parallel lineages of human and mouses are associated with gene sets with significant overlap. These common genes are enriched in embryonic development processes. 95 Chapter 5 Conclusions Epigenetic regulation is critical to organism development and divergent gene expression activities between species. As a central component of the mammalian epigenome, DNA methylation pat- terns have a wide range of connections with histone modifications, chromatin accessibility and binding of transcription factors. DNA hypomethylation is indicative of regulatory function of the underlying DNA elements in almost all normal cell types. Within the same species, comparative studies of methylomes from different cell types, conditions or disease status have helped identify regulatory elements associated with cellular identity and elucidate their functional roles in gene regulation. Comparing methylomes from different species affords us a glimpse into the evolution- ary history of regulatory elements and can reveal about the stringency and flexibility of elements required by the mammalian gene regulation network. So far, cross-species comparative analyses of methylomes have been limited by the available computational tools. The classic phylogenetic models for molecular sequences are not directly applicable to sequences of epigenetic modifica- tions at same level of resolution. For methylomes, the obstacle mainly comes from the highly divergent CpG sites between different species, which creates a missing data problem with huge proportion of missing data. In addition, epigenetic marks have strong auto-correlation along the genome, which is useful information to incorporate into mathematical models for epigenome evo- lution to better approximate the biological processes as well as to alleviate the burden of missing data. The aim of my research was to build a model framework for studying mammalian epige- nomic evolution, and design computational procedures for model learning and inference. With this computational method, we extended on previous research of mammalian sperm methylomes to uncover more detailed aspects of sperm methylome divergence, including genome-scale trends, 96 high-resolution annotation of evolutionary events, lineage-specific features, relationship with DNA sequence divergence, and trans-generational impact. In Chapter 3, we proposed a probabilistic graphical model for the evolution of epigenomes. The main feature of this model is that it combines the inheritance process that operates along the evolutionary time at individual sites and the auto-correlation relationship that operates along the genome at any evolutionary time point. The inheritance process is modeled with a continuous-time Markov chain, as in classic phylogenetic models (Felsenstein, 1981b). The auto-correlation rela- tionship is modeled with a discrete-time Markov chain, as previously used to model methylation states along the genome of a single methylome (Molaro et al., 2011). We formulated the model as a generative model, a Bayesian network model to be specific, so that simulating data according to the model assumptions is straightforward. Exact learning of the model parameters is computationally challenging when there are missing data. We used a Monte Carlo Expectation Maximization algorithm for approximate learning, where the expectation of the complete data sufficient statistics are approximated with averages of samples generated by Markov chain Monte Carlo conditional on the observed data. We used a Gibbs sampler to update the methylation state of a single site in a single species at a time, conditional on the current methylation states for all sites in its Markov blanket. We showed with simulated datasets that this approximate model inference procedure can reliable recover model parameters when methylation states are known at the leaf species. With estimates of model parameters, the MCMC sampling procedure provides reliable inference of unobserved methylation states. We also showed that uncertainty in observations affect the accuracy of model learning and inference, and that the impact differs for internal species with different distance to extant species where observations are available. In Chapter 4, we applied the proposed phylo-epigenetic model to study the evolutionary history of mammalian sperm methylome represented by seven species. We chose whole genome bisulfite sequencing as the method of profiling DNA methylation because it provides whole-genome cov- erage and unbiased signals from both methylated and unmethylated cytosines. We chose sperm 97 as the cell type of interest because DNA methylation in germ-line cells has important roles in maintaining genomic integrity, passing regulatory information across generations to embryos and shaping genomic sequences through mutational pressure. We generated and collected whole genome bisulfite sequencing data from the sperm genomes of 7 mammalian species – human, chimpanzee, gorilla, rhesus macaque, mouse, rat and dog – to investigate the evolution of germ-line DNA methylation. Several key findings emerged from our analysis. There is a global trend towards expansion of the hypomethylated fraction of the genome on all lineages of the phylogenetic tree, and evidence for progressive expansion of accessible reg- ulatory regions, a feature that evolved independently along different lineages. This expansion shows strong lineage-specific aspects: hypomethylated intervals around transcription start sites have evolved to be substantially wider in primates and dog than in rodents; meanwhile, rodents show a greater trend towards birth of hypomethylated regions, and exhibit a greater rate of diver- gence in DNA methylation, which echoes the faster DNA sequence evolution in rodents compared with primates. Genomic intervals with lineage-specific hypomethylation are associated with com- mon developmental processes, and are often proximal to sets of genes with significant overlap. Analysis of transcription factor binding data indicates that regions of rodent-specific and primate- specific hypomethylation are enriched for binding sites of similar transcription factors. Our evo- lutionary analyses of sperm methylomes across mammalian species, particularly the mode and tempo of change involving hypomethylated regions, lead to significant insights into epigenomic evolution. Our findings about mammalian sperm methylome evolution resulted from successful applica- tion of the proposed phylo-epigenetic model to WGBS methylome data. This model has a wider realm of application. It can be directly applied to other regulatory features that are organized into distinct genomic domains and can be characterized with binary states, such as chromatin accessibil- ity, various histone tail modifications, transcription factors with broad and sequence-non-specific binding sites. In addition, because the phylo-epigenetic model follows the same graphical model structure of phylo-genetic models for DNA sequence evolution, potentially, they can be combined 98 into a single model that jointly describes the evolution of the genome and the epigenome. It is especially beneficial for phylogenetic analysis to incorporate methylome evolution history, as the elevation of substitution rates in CpG context are not uniform either along the genome or along the evolutionary time line. With the accumulating collection of high-resolution and whole-genome profiles of different epigenetic marks in multiple cell types and species, systematic analyses the evolutionary history of epigenomic patterns and regulatory elements with phylo-epigenetic models will expand our knowledge of mammalian gene regulation and evolution. 99 Reference List Aravin AA, Sachidanandam R, Bourc’his D, Schaefer C, Pezic D, Toth KF, Bestor T, Hannon GJ (2008) A pirna pathway primed by individual transposons is linked to de novo dna methylation in mice. Molecularcell 31:785–799. Atsem S, Reichenbach J, Potabattula R, Dittrich M, Nava C, Depienne C, B¨ ohm L, Rost S, Hahn T, Schorsch M et al. (2016) Paternal age effects on sperm FOXK1 and KCNA7 methylation and transmission into the next generation. HumMolGenet 25:4996–5005. Ballestar E, Wolffe AP (2001) Methyl-CpG-binding proteins. European Journal of Biochem- istry 268:1–6. Barau J, Teissandier A, Zamudio N, Roy S, Nalesso V , H´ erault Y , Guillou F, Bourchis D (2016) The DNA methyltransferase DNMT3C protects male germ cells from transposon activity. Sci- ence 354:909–912. Bauernfeind AL, Soderblom EJ, Turner ME, Moseley MA, Ely JJ, Hof PR, Sherwood CC, Wray GA, Babbitt CC (2015) Evolutionary divergence of gene and protein expression in the brains of humans and chimpanzees. GenomeBiolEvol 7:2276–2288. Bestor TH, Ingram VM (1983) Two DNA methyltransferases from murine erythroleukemia cells: purification, sequence specificity, and mode of interaction with DNA.ProceedingsoftheNational AcademyofSciences 80:5559–5563. Bird A (2002) DNA methylation patterns and epigenetic memory. Genes & Develop- ment 16:6–21. Bird AP (1980) DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res 8:1499–1504. Bird AP, Taggart MH (1980) Variable patterns of total DNA and rDNA methylation in animals. NucleicAcidsResearch 8:1485–1497. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED et al. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genomeresearch 14:708–715. 100 Blekhman R, Marioni JC, Zumbo P, Stephens M, Gilad Y (2010) Sex-specific and lineage-specific alternative splicing in primates. Genomeresearch 20:180–189. Blekhman R, Oshlack A, Chabot AE, Smyth GK, Gilad Y (2008) Gene regulation in primates evolves under tissue-specific selection pressures. PLoSgenetics 4:e1000271. Borgel J, Guibert S, Li Y , Chiba H, Sch¨ ubeler D, Sasaki H, Forn´ e T, Weber M (2010) Targets and dynamics of promoter DNA methylation during early mouse development. Nature genet- ics 42:1093–1100. Bourc’his D, Bestor TH (2004) Meiotic catastrophe and retrotransposon reactivation in male germ cells lacking Dnmt3L. Nature 431:96–99. Bourc’his D, Xu GL, Lin CS, Bollman B, Bestor TH (2001) Dnmt3L and the establishment of maternal genomic imprints. Science 294:2536–2539. Bourque G, Leong B, Vega VB, Chen X, Lee YL, Srinivasan KG, Chew JL, Others. (2008) Evolution of the mammalian transcription factor binding repertoire via transposable elements. GenomeResearch 18:1752– 1762. Boyes J, Bird A (1991) DNA methylation inhibits transcription indirectly via a methyl-CpG binding protein. Cell 64:1123–1134. Branco MR, King M, Perez-Garcia V , Bogutz AB, Caley M, Fineberg E, Lefebvre L, Cook SJ, Dean W, Hemberger M et al. (2016) Maternal DNA methylation regulates early trophoblast development. DevCell 36:152–163. Brykczynska U, Hisano M, Erkek S, Ramos L, Oakeley EJ, Roloff TC, Beisel C, Sch¨ ubeler D, Stadler MB, Peters AH (2010) Repressive and active histone methylation mark distinct promoters in human and mouse spermatozoa. NatureStructural&MolecularBiology 17:679–687. Capra JA, Kostka D (2014) Modeling DNA methylation dynamics with approaches from phylo- genetics. Bioinformatics 30:i408–i414. Carmona FJ, Davalos V , Vidal E, Gomez A, Heyn H, Hashimoto Y , Vizoso M, Martinez-Cardus A, Sayols S, Ferreira HJ et al. (2014) A comprehensive DNA methylation profile of epithelial-to- mesenchymal transition. CancerResearch 74:5608–5619. Carrell DT, Hammoud SS (2010) The human sperm epigenome and its potential role in embryonic development. Molecularhumanreproduction 16:37–47. Ch´ edin F, Lieber MR, Hsieh CL (2002) The DNA methyltransferase-like protein DNMT3L stimulates de novo methylation by Dnmt3a. Proceedings of the National Academy of Sci- ences 99:16916–16921. Cheng Y , Ma Z, Kim BH, Wu W, Cayting P, Boyle AP, Sundaram V , Xing X, Dogan N, Li J et al. (2014) Principles of regulatory information conservation between mouse and human. Nature 515:371–375. 101 Cohen NM, Kenigsberg E, Tanay A (2011) Primate CpG islands are maintained by heterogeneous evolutionary regimes involving minimal selection. Cell 145:773–786. Consortium RGSP (2004) Genome sequence of the brown norway rat yields insights into mam- malian evolution. Nature 428:493–521. Cortellino S, Xu J, Sannai M, Moore R, Caretti E, Cigliano A, Le Coz M, Devarajan K, Wessels A, Soprano D et al. (2011) Thymine DNA glycosylase is essential for active DNA demethylation by linked deamination-base excision repair. Cell 146:67–79. Coulondre C, Miller JH, Farabaugh PJ, Gilbert W (1978) Molecular basis of base substitution hotspots in Escherichia coli. Nature 274:775–780. Crouse HV (1960) The controlling element in sex chromosome behavior in Sciara. Genet- ics 45:1429 – 1443. dos Santos CO, Dolzhenko E, Hodges E, Smith AD, Hannon GJ (2015) An epigenetic memory of pregnancy in the mouse mammary gland. CellReports 11:1102–1109. Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, Kulesha E, Graef S, Johnson N, Herrero J, Tomazou EM et al. (2008) A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis. Naturebiotechnology 26:779–785. Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z (2009) GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMCBioinformatics 10:48. Elango N, Hunt BG, Goodisman MA, Soojin VY (2009) DNA methylation is widespread and associated with differential gene expression in castes of the honeybee, Apis mellifera. Proceed- ingsoftheNationalAcademyofSciences 106:11206–11211. Enard W, Fassbender A, Model F, Adorj´ an P, P¨ a¨ abo S, Olek A (2004) Differences in DNA methylation patterns between humans and chimpanzees. CurrentBiology 14:R148–R149. Enard W, Khaitovich P, Klose J, Z¨ ollner S, Heissig F, Giavalisco P, Nieselt-Struwe K, Muchmore E, Varki A, Ravid R et al. (2002) Intra-and interspecific variation in primate gene expression patterns. Science 296:340–343. ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74. Felsenstein J (1981a) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journalofmolecularevolution 17:368–376. Felsenstein J (1981b) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journalofmolecularevolution 17:368–376. Felsenstein J (2004) Inferringphylogenies Sinauer Associates, Suderland, Massachusetts. 102 Feng S, Cokus SJ, Zhang X, Chen PY , Bostick M, Goll MG, Hetzel J, Others. (2010) Conservation and divergence of methylation patterning in plants and animals. PNAS 107:8689 – 8694. Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S et al. (2014) Ensembl 2014. NucleicAcidsRes 42:D749–D755. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, Turner SW (2010) Direct detection of dna methylation during single-molecule, real-time sequencing. Nature methods 7:461–465. Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL (1992) A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. ProceedingsoftheNationalAcademyofSciences 89:1827–1831. Fukuda K, Ichiyanagi K, Yamada Y , Go Y , Udono T, Wada S, Maeda T, Soejima H, Saitou N, Ito T et al. (2013) Regional DNA methylation differences between humans and chimpanzees are associated with genetic changes, transcriptional divergence and disease genes. Journalofhuman genetics 58:446–454. Gao F, Niu Y , Sun YE, Lu H, Chen Y , Li S, Kang Y , Luo Y , Si C, Yu J et al. (2017) De novo DNA methylation during monkey pre-implantation embryogenesis. Cellresearch 27:526. Gardiner-Garden M, Frommer M (1987) CpG islands in vertebrate genomes. Journal of Molec- ularBiology 196:261–282. Gavery MR, Roberts SB (2010) DNA methylation patterns provide insight into epigenetic regu- lation in the Pacific oyster (Crassostrea gigas). BMCgenomics 11:483. Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R et al. (2012) Architecture of the human regulatory network derived from ENCODE data. Nature 489:91–100. Gilks WR, Richardson S, Spiegelhalter D (1995) Markov chain Monte Carlo in practice CRC press. Ginsburg M, Snow M, McLAREN A (1990) Primordial germ cells in the mouse embryo during gastrulation. Development 110:521–528. Guibert S, Forn´ e T, Weber M (2012) Global profiling of DNA methylation erasure in mouse primordial germ cells. Genomeresearch 22:633–641. Guo JU, Su Y , Zhong C, Ming Gl, Song H (2011) Hydroxylation of 5-methylcytosine by TET1 promotes active dna demethylation in the adult brain. Cell 145:423–434. Hackett JA, Sengupta R, Zylicz JJ, Murakami K, Lee C, Down TA, Surani MA (2013) Germline DNA demethylation dynamics and imprint erasure through 5-hydroxymethylcytosine. Sci- ence 339:448–452. 103 Hajkova P, Erhardt S, Lane N, Haaf T, El-Maarri O, Reik W, Walter J, Surani MA (2002) Epige- netic reprogramming in mouse primordial germ cells. Mechanismsofdevelopment 117:15–23. Hammoud SS, Low DH, Yi C, Carrell DT, Guccione E, Cairns BR (2014) Chromatin and tran- scription transitions of mammalian adult germline stem cells and spermatogenesis. Cell Stem Cell 15:239–253. Hata K, Okano M, Lei H, Li E (2002) Dnmt3L cooperates with the Dnmt3 family of de novo DNA methyltransferases to establish maternal imprints in mice. Development 129:1983–1993. Hedges SB, Marin J, Suleski M, Paymer M, Kumar S (2015) Tree of life reveals clock-like speciation and diversification. MolBiolEvol 32:835–845. Heijmans BT, Kremer D, Tobi EW, Boomsma DI, Slagboom PE (2007) Heritable rather than age-related environmental and stochastic factors dominate variation in DNA methylation of the human IGF2/H19 locus. Humanmoleculargenetics 16:547–554. Hernando-Herraez I, Heyn H, Fernandez-Callejo M, Vidal E, Fernandez-Bellon H, Prado- Martinez J, Sharp AJ, Esteller M, Marques-Bonet T (2015) The interplay between DNA methy- lation and sequence divergence in recent human evolution. bioRxiv p. 015966. Herrero J, Muffato M, Beal K, Fitzgerald S, Gordon L, Pignatelli M, Vilella AJ, Searle SM, Amode R, Brent S et al. (2016) Ensembl comparative genomics resources. Database 2016:bav096. Hodges E, Molaro A, Dos Santos CO, Thekkat P, Song Q, Uren PJ, Park J, Butler J, Rafii S, McCombie WR et al. (2011) Directional DNA methylation changes and complex intermedi- ate states accompany lineage specificity in the adult hematopoietic compartment. Molecular Cell 44:17–28. Holliday R, Pugh JE (1975) DNA modification mechanisms and gene activity during develop- ment. Science 187:226–232. Hubisz MJ, Pollard KS, Siepel A (2011) PHAST and RPHAST: phylogenetic analysis with space/time models. BriefinBioinform 12:41–51. Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M et al. (2009) The human colon cancer methylome shows similar hypo-and hyper- methylation at conserved tissue-specific CpG island shores. Naturegenetics 41:178–186. Ito S, Shen L, Dai Q, Wu SC, Collins LB, Swenberg JA, He C, Zhang Y (2011) Tet proteins can convert 5-methylcytosine to 5-formylcytosine and 5-carboxylcytosine. Science 333:1300–1303. Jabbari K, Cacci` o S, de Barros JPP, Desgr` es J, Bernardi G (1997) Evolutionary changes in cpg and methylation levels in the genome of vertebrates. Gene 205:109–118. 104 J¨ ahner D, Stuhlmann H, Stewart CL, Harbers K, L¨ ohler J, Simon I, Jaenisch R (1982) De novo methylation and expression of retroviral genomes during mouse embryogenesis. Nature . Johnson MD, Mueller M, Adamowicz-Brice M, Collins MJ, Gellert P, Maratou K, Srivastava PK, Rotival M, Butt S, Game L et al. (2014) Genetic analysis of the cardiac methylome at single nucleotide resolution in a model of human cardiovascular disease. PLoSgenetics 10:e1004813. Johnstone KA, DuBose AJ, Futtner CR, Elmore MD, Brannan CI, Resnick. JL (2006) A human imprinting centre demonstrates conserved acquisition but diverged maintenance of imprinting in a mouse model for Angelman syndrome imprinting defects. Human Molecular Genet- ics 15:393 – 404. Jojic V , Jojic N, Meek C, Geiger D, Siepel A, Haussler D, Heckerman D (2004) Efficient approx- imations for learning phylogenetic HMM models from data. Bioinformatics 20:i161–i168. Jones GL, Haran M, Caffo BS, Neath R (2006) Fixed-width output analysis for Markov chain Monte Carlo. JournaloftheAmericanStatisticalAssociation 101:1537–1547. Jones PA (2012) Functions of DNA methylation: islands, start sites, gene bodies and beyond. NatureReviewsGenetics 13:484–492. Jordan IK, Rogozin IB, Glazko GV , Koonin. EV (2003) Origin of a substantial fraction of human regulatory sequences from transposable elements. TrendsinGenetics TIG 19:68 – 72. Karaman MW, Houck ML, Chemnick LG, Nagpal S, Chawannakul D, Sudano D, Pike BL, Ho VV , Ryder OA, Hacia JG (2003) Comparative analysis of gene-expression patterns in human and African great ape cultured fibroblasts. GenomeRes 13:1619–1630. Keller TE, Han P, Soojin VY (2015) Evolutionary transition of promoter and gene body DNA methylation across invertebrate-vertebrate boundary.Molecularbiologyandevolution p. msv345. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D (2002) The human genome browser at UCSC. Genomeresearch 12:996–1006. Keown CL, Berletch JB, Castanon R, Nery JR, Disteche CM, Ecker JR, Mukamel EA (2017) Allele-specific non-CG DNA methylation marks domains of active chromatin in female mouse brain. ProceedingsoftheNationalAcademyofSciences 114:E2882–E2890. Khaitovich P, Hellmann I, Enard W, Nowick K, Leinweber M, Franz H, Weiss G, Lachmann M, P¨ a¨ abo S (2005) Parallel patterns of evolution in the genomes and transcriptomes of humans and chimpanzees. Science 309:1850–1854. Kieffer-Kwon KR, Tang Z, Mathe E, Qian J, Sung MH, Li G, Resch W, Baek S, Pruett N, Grøntved L et al. (2013) Interactome maps of mouse gene regulatory domains reveal basic prin- ciples of transcriptional regulation. Cell 155:1507–1520. 105 King M, Wilson AC (1975) Evolution at two levels in humans and chimpanzees. Sci- ence 188:107–116. Kriaucionis S, Heintz N (2009) The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science 324:929–930. Ku M, Koche RP, Rheinbay E, Mendenhall EM, Endoh M, Mikkelsen TS, Presser A, Nusbaum C, Xie X, Chi AS et al. (2008) Genomewide analysis of PRC1 and PRC2 occupancy identifies two classes of bivalent domains. PLoSGenet 4:e1000242. Kunarso G, Chia NY , Jeyakani J, Hwang C, Lu X, Chan YS, Ng HH, Bourque G (2010) Transpos- able elements have rewired the core regulatory network of human embryonic stem cells. Nature genetics 42:631–634. Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ et al. (2015) Integrative analysis of 111 reference human epigenomes. Nature 518:317–330. Lane N, Dean W, Erhardt S, Hajkova P, Surani A, Walter J, Reik W (2003) Resistance of IAPs to methylation reprogramming may provide a mechanism for epigenetic inheritance in the mouse. Genesis 35:88–93. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. GenomeBiol 10:R25. Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, Low HM, Sung KWK, Rigoutsos I, Loring J et al. (2010) Dynamic changes in the human methylome during differentiation. Genome Research 20:320–331. Lauritzen SL (1996) Graphicalmodels, V ol. 17 Clarendon Press. Law JA, Jacobsen SE (2010) Establishing, maintaining and modifying dna methylation patterns in plants and animals. NatureReviewsGenetics 11:204–220. Lesch BJ, Silber SJ, McCarrey JR, Page DC (2016) Parallel evolution of male germline epigenetic poising and somatic development in animals. NatGenet 48:888–894. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079. Lienert F, Wirbelauer C, Som I, Dean A, Mohn F, Sch¨ ubeler D (2011) Identification of genetic elements that autonomously determine DNA methylation states. Naturegenetics 43:1091–1097. Lio P, Goldman N (1998) Models of molecular evolution and phylogeny. Genome research 8:1233–1244. 106 Lister R, Mukamel EA, Nery JR, Urich M, Puddifoot CA, Johnson ND, Lucero J, Huang Y , Dwork AJ, Schultz MD et al. (2013) Global epigenomic reconfiguration during mammalian brain development. Science 341:1237905. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462:315–322. Lister R, Pelizzola M, Kida YS, Hawkins RD, Nery JR, Hon G, Antosiewicz-Bourget J, O’malley R, Castanon R, Klugman S et al. (2011) Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature 471:68. Liu W, Maraia R, Rubin C, Schmid C (1994) Alu transcrips: Cytoplasmic localisation and regu- lation by DNA methylation. NucleicAcidsRes 22:1087 – 1095. Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N et al. (2013) The genotype-tissue expression (GTEx) project. Naturegenetics 45:580–585. Lu H, Yuan Z, Tan T, Wang J, Zhang J, Luo HJ, Xia Y , Ji W, Gao F (2015) Improved tagmentation- based whole-genome bisulfite sequencing for input DNA from less than 100 mammalian cells. Epigenomics 7:47–56. Lun AT, Smyth GK (2016) csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows. NucleicAcidsRes 44:e45. Lynch VJ, Leclerc RD, May G, Wagner. GP (2011) Transposon-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals. Nature Genet- ics 43:1154 – 1159. Martin DI, Singer M, Dhahbi J, Mao G, Zhang L, Schroth GP, Pachter L, Boffelli D (2011) Phyloepigenomic comparison of great apes reveals a correlation between somatic and germline methylation states. Genomeresearch 21:2049–2057. Martinowich K, Hattori D, Wu H, Fouse S, He F, Hu Y , Fan G, Sun YE (2003) DNA methylation-related chromatin remodeling in activity-dependent BDNF gene regulation. Sci- ence 302:890–893. Maunakea AK, Chepelev I, Cui K, Zhao. K (2013) Intragenic DNA methylation modulates alter- native splicing by recruiting mecp2 to promote exon recognition. CellResearch 23:1256 – 1269. Meissner A, Gnirke A, Bell GW, Ramsahoye B, Lander ES, Jaenisch R (2005) Reduced represen- tation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic acidsresearch 33:5868–5877. Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, Zhang X, Bernstein BE, Nusbaum C, Jaffe DB et al. (2008) Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature 454:766–770. 107 Mi H, Poudel S, Muruganujan A, Casagrande JT, Thomas PD (2016) PANTHER version 10: expanded protein families and functions, and analysis tools. NucleicAcidsRes 44:D336–D342. Mills RE, Bennett EA, Iskow RC, Devine. SE (2007) Which transposable elements are active in the human genome? TrendsinGenetics 23:183 – 191. Molaro A, Falciatori I, Hodges E, Aravin AA, Marran K, Rafii S, McCombie WR, Smith AD, Hannon GJ (2014) Two waves of de novo methylation during mouse germ cell development. Genes&Development 28:1544–1549. Molaro A, Hodges E, Fang F, Song Q, McCombie WR, Hannon GJ, Smith AD (2011) Sperm methylation profiles reveal features of epigenetic inheritance and evolution in primates. Cell 146:1029–1041. Monk M, Adams R, Rinaldi A (1991) Decrease in DNA methylase activity during preimplanta- tion development in the mouse. Development 112:189–192. Moran S, Arribas C, Esteller M (2016) Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics 8:389–399. Murphy WJ, Eizirik E, O’Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW et al. (2001) Resolution of the early placental mammal radiation using bayesian phylogenetics. Science 294:2348–2351. Neri F, Rapelli S, Krepelova A, Incarnato D, Parlato C, Basile G, Maldotti M, Anselmi F, Oliviero S et al. (2017) Intragenic dna methylation prevents spurious transcription initiation. Nature 543:72–77. Odom DT, Dowell RD, Jacobsen ES, Gordon W, Danford TW, MacIsaac KD, Rolfe PA, Con- boy CM, Gifford DK, Fraenkel E (2007) Tissue-specific transcriptional regulation has diverged significantly between human and mouse. NatureGenetics 39:730–732. Okano M, Xie S, Li E (1998) Cloning and characterization of a family of novel mammalian DNA (cytosine-5) methyltransferases. Naturegenetics 19:219–220. Ooi SK, Bestor TH (2008) The colorful history of active DNA demethylation. Cell 133:1145–1148. Pai AA, Bell JT, Marioni JC, Pritchard JK, Gilad Y (2011) A genome-wide study of DNA methylation patterns and gene expression levels in multiple human and chimpanzee tissues. PLoS Genetics 7:e1001316. Paulsen M, El-Maarri O, Engemann S, Str¨ odicke M, Franck O, Davies K, Reinhardt R, Reik W, Walter. J (2000) Sequence conservation and variability of imprinting in the Beckwith Wiedemann syndrome gene cluster in human and mouse. HumanMolecularGenetics 9:1829 – 1841. 108 Peng JC, Valouev A, Swigut T, Zhang J, Zhao Y , Sidow A, Wysocka J (2009) Jarid2/jumonji coordinates control of PRC2 enzymatic activity and target gene occupancy in pluripotent cells. Cell 139:1290–1302. Picelli S, Bj¨ orklund ˚ AK, Reinius B, Sagasser S, Winberg G, Sandberg R (2014) Tn5 transposase and tagmentation procedures for massively scaled sequencing projects. Genome Res 24:2033–2040. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A (2010) Detection of nonneutral substitution rates on mammalian phylogenies. GenomeResearch 20:110–121. Portela A, Esteller M (2010) Epigenetic modifications and human disease. Nature biotechnol- ogy 28:1057–1068. Pupko T, Pe I, Shamir R, Graur D (2000) A fast algorithm for joint reconstruction of ancestral amino acid sequences. MolecularBiologyandEvolution 17:890–896. Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841–842. Ram´ ırez F, D¨ undar F, Diehl S, Gr¨ uning BA, Manke T (2014) deepTools: a flexible platform for exploring deep-sequencing data. NucleicAcidsRes 42:W187–W191. Rand AC, Jain M, Eizenga JM, Musselman-Brown A, Olsen HE, Akeson M, Paten B (2017) Map- ping DNA methylation with high-throughput nanopore sequencing. naturemethods 14:411–413. Rando OJ (2012) Combinatorial complexity in chromatin structure and function: revisiting the histone code. Currentopinioningenetics&development 22:148–155. Raney BJ, Dreszer TR, Barber GP, Clawson H, Fujita PA, Wang T, Nguyen N, Paten B, Zweig AS, Karolchik D et al. (2014) Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser. Bioinformatics 30:1003–1005. Reik W, Dean W, Walter J (2001) Epigenetic reprogramming in mammalian development. Sci- ence 293:1089–1093. Riggs AD (1975) X inactivation, differentiation, and DNA methylation. CytogeneticandGenome Research 14:9–25. Roberts GO, Polson NG (1994) On the geometric convergence of the Gibbs sampler. J Roy Stat SocB pp. 377–384. Rosenbloom KR, Armstrong J, Barber GP, Casper J, Clawson H, Diekhans M, Dreszer TR, Fujita PA, Guruvadoo L, Haeussler M et al. (2015) The UCSC genome browser database: 2015 update. NucleicAcidsResearch 43:D670–D681. 109 Rougier N, Bourc’his D, Gomes DM, Niveleau A, Plachot M, P` aldi A, Viegas-P´ equignot E (1998) Chromosome methylation patterns during mammalian preimplantation development. Genes & development 12:2108–2113. Schlesinger F, Smith AD, Gingeras TR, Hannon GJ, Hodges E (2013) De novo DNA demethy- lation and noncoding transcription define active intergenic regulatory elements. Genome Research 23:1601–1614. Seisenberger S, Peat JR, Hore TA, Santos F, Dean W, Reik W (2013) Reprogramming DNA methylation in the mammalian life cycle: building and breaking epigenetic barriers.Philosophical TransactionsoftheRoyalSocietyB:BiologicalSciences 368:20110330. Severin PM, Zou X, Gaub HE, Schulten K (2011) Cytosine methylation alters DNA mechanical properties. Nucleicacidsresearch p. gkr578. Shen JC, Rideout WM, Jones PA (1994) The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. NucleicAcidsResearch 22:972–976. Shirane K, Toh H, Kobayashi H, Miura F, Chiba H, Ito T, Kono T, Sasaki H (2013) Mouse oocyte methylomes at base resolution reveal genome-wide accumulation of non-CpG methylation and role of DNA methyltransferases. PLoSgenetics 9:e1003439. Shukla S, Kavak E, Gregory M, Imashimizu M, Shutinoski B, Kashlev M, Oberdoerffer P, Sand- berg R, Oberdoerffer S (2011) CTCF-promoted RNA polymerase II pausing links DNA methy- lation to splicing. Nature 479:74. Siegmund KD, Marjoram P, Woo YJ, Tavar´ e S, Shibata D (2009) Inferring clonal expansion and cancer stem cell dynamics from DNA methylation patterns in colorectal cancers. ProcNatlAcad Sci 106:4828–4833. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genomeresearch 15:1034–1050. Siepel A, Haussler D (2004) Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. MolBiolEvol 21:468–488. Simmen MW, Leitgeb S, Charlton J, Jones SJ, Harris BR, Clark VH, Bird A (1999) Nonmethy- lated transposable elements and methylated genes in a chordate genome. Science 283:1164–1167. Smallwood SA, Tomizawa Si, Krueger F, Ruf N, Carli N, Segonds-Pichon A, Sato S, Hata K, Andrews SR, Kelsey G (2011) Dynamic CpG island methylation landscape in oocytes and preim- plantation embryos. Naturegenetics 43:811–814. Smith AD, Chung WY , Hodges E, Kendall J, Hannon G, Hicks J, Xuan Z, Zhang MQ (2009) Updates to the RMAP short-read mapping software. Bioinformatics 25:2841–2842. 110 Smith ZD, Chan MM, Mikkelsen TS, Gu H, Gnirke A, Regev A, Meissner A (2012) A unique regulatory phase of DNA methylation in the early mammalian embryo. Nature 484:339–344. Smith ZD, Meissner A (2013) DNA methylation: roles in mammalian development. Nature ReviewsGenetics 14:204–220. Song Q, Decato B, Hong EE, Zhou M, Fang F, Qu J, Garvin T, Kessler M, Zhou J, Smith AD (2013) A reference methylome database and analysis pipeline to facilitate integrative and com- parative epigenomics. PloSOne 8:e81148. Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Sch¨ oler A, van Nimwegen E, Wirbelauer C, Oakeley EJ, Gaidatzis D et al. (2011) DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature . Stevens M, Cheng JB, Li D, Xie M, Hong C, Maire CL, Ligon KL, Hirst M, Marra MA, Costello JF et al. (2013) Estimating absolute methylation levels at single-CpG resolution from methylation enrichment and restriction enzyme sequencing methods. Genomeresearch 23:1541–1553. Stewart CL, Stuhlmann H, J¨ ahner D, Jaenisch R (1982) De novo methylation, expression, and infectivity of retroviral genomes introduced into embryonal carcinoma cells. Proceedings of the NationalAcademyofSciences 79:4098–4102. Strahl BD, Allis CD (2000) The language of covalent histone modifications. Nature 403:41. Supek F, Boˇ snjak M, Scaronlkunca N, ˇ Smuc T (2011) REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms. PLoSOne 6. Suzuki MM, Bird A (2008) DNA methylation landscapes: provocative insights from epige- nomics. NatureReviewsGenetics 9:465–476. Szulwach KE, Li X, Li Y , Song CX, Wu H, Dai Q, Irier H, Upadhyay AK, Gearing M, Levey AI et al. (2011) 5-hmC-mediated epigenetic dynamics during postnatal neurodevelopment and aging. Natureneuroscience 14:1607–1616. Tahiliani M, Koh KP, Shen Y , Pastor WA, Bandukwala H, Brudno Y , Agarwal S, Iyer LM, Liu DR, Aravind L et al. (2009) Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 324:930–935. Thornburg BG, Gotea V , Makałowski. W (2006) Transposable elements as a significant source of transcription regulating signals. Gene 365:104–110. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B et al. (2012) The accessible chromatin landscape of the human genome. Nature 489:75–82. Tung J, Barreiro LB, Johnson ZP, Hansen KD, Michopoulos V , Toufexis D, Michelini K, Wilson ME, Gilad Y (2012) Social environment is associated with gene regulatory variation in the rhesus macaque immune system. ProceedingsoftheNationalAcademyofSciences 109:6490–6495. 111 Tweedie S, Charlton J, Clark V , Bird A (1997) Methylation of genomes and genes at the invertebrate-vertebrate boundary. Molecularandcellularbiology 17:1469–1475. Urich MA, Nery JR, Lister R, Schmitz RJ, Ecker JR (2015) MethylC-seq library preparation for base-resolution whole-genome bisulfite sequencing. Natureprotocols 10:475–483. Van de Lagemaat LN, Landry JR, Mager DL, Medstrand. P (2003) Transposable elements in mammals promote regulatory variation and diversification of genes with specialized functions. TrendsinGenetics TIG 19:530 – 536. Varriale A, Bernardi G (2006) DNA methylation and body temperature in fishes. Gene 385:111–121. Villar D, Berthelot C, Aldridge S, Rayner TF, Lukk M, Pignatelli M, Park TJ, Deaville R, Erichsen JT, Jasinska AJ et al. (2015) Enhancer evolution across 20 mammalian species.Cell 160:554–566. Waddington CH (1953) Genetic assimilation of an acquired character. Evolution 7:118–126. Wade PA, Gegonne A, Jones PL, Ballestar E, Aubry F, Wolffe AP (1999) Mi-2 complex couples DNA methylation to chromatin remodelling and histone deacetylation. Naturegenetics 23:62–66. Walsh CP, Chaillet JR, Bestor TH (1998a) Transcription of IAP endogenous retroviruses is con- strained by cytosine methylation. NatGenet 20:116–117. Walsh CP, Chaillet JR, Bestor. TH (1998b) Transcription of IAP endogenous retroviruses is constrained by cytosine methylation. NatureGenetics 20:116 – 117. Wang J, Zhuang J, Iyer S, Lin XY , Greven MC, Kim BH, Moore J, Pierce BG, Dong X, Vir- gil D et al. (2013) Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium. NucleicAcidsResearch 41:D171–D176. Wang J, Cao X, Zhang Y , Su. B (2012) Genome-wide DNA methylation analyses in the brain reveal four differentially methylated regions between humans and non-human primates. BMC EvolutionaryBiology 12:144. Wang Q, Gu L, Adey A, Radlwimmer B, Wang W, Hovestadt V , B¨ ahr M, Wolf S, Shendure J, Eils R et al. (2013) Tagmentation-based whole-genome bisulfite sequencing.NatProtoc 8:2022–2032. Watt F, Molloy PL (1988) Cytosine methylation prevents binding to DNA of a hela cell tran- scription factor required for optimal expression of the adenovirus major late promoter. Genes & development 2:1136–1143. Weber M, Davies JJ, Wittig D, Oakeley EJ, Haase M, Lam WL, Schuebeler D (2005) Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Naturegenetics 37:853–862. 112 Weber M, Hellmann I, Stadler MB, Ramos L, P¨ a¨ abo S, Rebhan M, Sch¨ ubeler D (2007) Distri- bution, silencing potential and evolutionary impact of promoter dna methylation in the human genome. Naturegenetics 39:457–466. Wei GC, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms.JournaloftheAmericanstatisticalAssociation 85:699–704. Wion D, Casades´ us J (2006) N6-methyl-adenine: an epigenetic signal for DNA–protein interac- tions. NatureReviewsMicrobiology 4:183–192. Woodcock D, Lawler C, Linsenmeyer M, Doherty J, Warren W (1997) Asymmetric methyla- tion in the hypermethylated CpG promoter region of the human L1 retrotransposon. J. Biol. Chem 272:7810 – 7816. Wu H, D’Alessio AC, Ito S, Wang Z, Cui K, Zhao K, Sun YE, Zhang Y (2011) Genome-wide analysis of 5-hydroxymethylcytosine distribution reveals its dual function in transcriptional reg- ulation in mouse embryonic stem cells. Genes&development 25:679–684. Yang Z (1994a) Estimating the pattern of nucleotide substitution. JMolEvol 39:105–111. Yang Z (1994b) Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. JMolEvol 39:306–314. Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Fitzgerald S, Gil L et al. (2016) Ensembl 2016. NucleicAcidsRes 44:D710–D716. Yu B, Dong X, Gravina S, Kartal ¨ O, Schimmel T, Cohen J, Tortoriello D, Zody R, Hawkins RD, Vijg J (2017) Genome-wide, single-cell dna methylomics reveals increased non-cpg methylation during human oocyte maturation. StemCellReports 9:397–407. Zemach A, McDaniel IE, Silva P, Zilberman D (2010) Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328:916–919. Zeng J, Konopka G, Hunt BG, Preuss TM, Geschwind D, Soojin VY (2012) Divergent whole- genome methylation maps of human and chimpanzee brains reveal epigenetic basis of human regulatory evolution. TheAmericanJournalofHumanGenetics 91:455–465. Ziller MJ, M¨ uller F, Liao J, Zhang Y , Gu H, Bock C, Boyle P, Epstein CB, Bernstein BE, Lengauer T et al. (2011) Genomic distribution and inter-sample variation of non-CpG methylation across human cell types. PLoSgenetics 7:e1002389. 113 Appendix A Appendix A.1 Computing derivatives Under the binary-state independent site phylo-epigenetic model, the derivative of the transition rate matrixP (t) with respect to rate is P (t) = @ @ P (t) =T 2 4 1 1 1 1 3 5 ; where as aboveT = 1 exp(t). The derivative of the likelhood with respect to satisfies the recurrence relation @p j (v) @ = X k P (` v ) jk q k (v) +P (` v ) jk X w2child(v) @p k (w) @ q k (v) p k (w) : (A.1) Incorporating the probabilities associated with the root noder, the derivative of the likelihood over the entire tree becomes @L @ = X j2f0;1g j @q j (r) @ : (A.2) We also require derivatives of the likelihood with respect to the transformed lengths of each tree branch. The derivative of the rate matrix with respect to transformed branch lengthP T (t) can be computed as P T (t) = @P (t) @T =Q: 114 LetT v = 1 exp(` v ). Then we have @p j (v) @T v = X k P Tv (` v ) jk q k (v) +P (` v ) jk X w2child(v) @p k (w) @T v q k (v) p k (w) : (A.3) Finally, the derivative of the likelihood with respect to the initial distribution parameter 0 = 1 1 at root noder is @L(;X) @ 0 =q 0 (r)q 1 (r): A.2 MCMC convergence diagnostics We adopt a fixed-width stopping rule for the Markov chain based on a consistent batch means method, i.e. the CBM method referred to by (Jones et al., 2006). In each E-step, lett be the current chain length. Let the batch sizeb t , and batch numbera t be functions oft: b t = j t 1 2 k ;a t =bt=b t c: Let w j be the mean of thej th batch: w j = 1 b t jbt X i=(j1)bt+1 w (i) ; forj = 1;:::;a t : The CBM estimate of 2 w is ^ 2 w;CBM = b t a t 1 at X j=1 ( w j w t ) 2 : We stop sampling the first time that t ? ^ w;CBM p t +p(t)<; for allw2W; 115 wheret ? is the appropriate quantile of Student’s t-distribution witha t 1 degree of freedom, and p(t) = 1ft t min g, where t min is a minimum sampling effort (for example 100). In our implementation, we chose = N=10 4 as the half 95%-confidence interval width cutoff. Then, E ZjX; (t)W is approximated with the sample average of the second half of the chain, which we denote with ~ W . A.3 Two-tuple Bayesian network model Consider two consecutive sites in the genome. The methylation states at these two sites are a random variable x from the state space =f00; 01; 10; 11g where 0 and 1 indicate hypo and hyper-methylation states, respectively. The joint methylation statex changes over time, andx(t) evolves according to a continuous time Markov chain with transition rate matrix Q 44 , along a phylogeneic tree =fV;Eg. Considering the spatial symmetry of the genome and epigenome, we can parameterize the transition rate matrixQ as below with 4 parametersa;b;c;d> 0. Q = 0 B B B B B B B @ 00 01 10 11 00 2a a a 0 01 b (b +c) 0 c 10 b 0 (b +c) c 11 0 d d 2d 1 C C C C C C C A The equilibrium state of this process is/ (bd;ad;ad;ac) for 2-tuples, and s / ((a +b)d;a(c + d)) for single sites. We can further eliminate one parameter by requiring the unit-time mutation rate to be 2 when the process reaches equilibrium, i.e. P i i Q ii = 2. Thus,d =ac=(2a(b+c1)b), and we are left with 3 parameters with constraints:a;b;c> 0 and 2a(b +c 1)b> 0. Let = ( 0 ; 1 ) be the initial distribution of methylation states in the root species, andG 22 be the transition probability matrix of a 2-state discrete-time Markov chain that models the methylome of the root species. Together, the model is represented by =f;;G;Qg. 116 Let (u;v)2E be two species in the phylogenetic tree wherev had evolved fromu for timet. Letu n;n+1 be the 2-tuple methylation state for siten andn + 1 in speciesu. Then the conditional probability ofv n;n+1 givenu n;n+1 is P (` uv ) ij = Pr(v n;n+1 =jju n;n+1 =i) = exp(Qt) ij ;8i;j2 : Now consider siten + 1 in speciesv, we have its conditional distribution given its previous site’s state and the ancestral 2-tuple’s state: p(v) ij;kl = Pr(v n+1 =lju n =i;u n+1 =j;v n =k) = exp(Qt) ij;kl exp(Qt) ij;k0 + exp(Qt) ij;k1 ; (A.4) wherei;j;k;l2f0; 1g. Complete data likelihood LetY be the methylation states of all sites in all species in the phylo- genetic tree. L(;Y ) = r 1 N1 Y n=1 G rnr n+1 Y (u;v)2E P (l uv ) u 1;2 v 1;2 N1 Y n=2 Pr(v n+1 ju n;n+1 ;v n ) (A.5) Log-likelihood is thus l(;Y ) = log r 1 + N1 X n=1 logG rnr n+1 + X (u;v)2E logP (l uv ) u 1;2 v 1;2 + N1 X n=2 log Pr(v n+1 ju n;n+1 ;v n ) = log r 1 + X i;j2f0;1g w ij logG ij + X (u;v)2E X i;j;k;l2f0;1g (v) ij;kl logP (l uv ) ij;kl +w(v) ij;kl logp(v) ij;kl (A.6) The sufficient statistics areW = w ij ;(v) ij;kl ;w(v) ij;kl :v2Vnfrg; i;j;k;l2f0; 1g : 117 Parameter optimization Following the notations by Siepel & Haussler (2004), assume Q is diagonalizable asSS 1 , and let the elements ofS andS 1 be denotedfs ij g andfs 0 ij g Thus exp(Qt) ij = X k s ik e k t s 0 kj and @ @t exp(Qt) ij = X k s ik k e k t s 0 kj : IfP is a rate-matrix parameter. @ @P exp(Qt) =S F S 1 @ @P (Qt)S S 1 where denotes the Hadamard (pointwise) product of two matrices andF =ff ab g is defined as f i;j = 8 > > < > > : t exp( i t) if i = j exp( i t) exp( j t) i j otherwise. EM algorithm for incomplete observation The conditional probability factorization above defines a Bayesian network model with directed edges linking neighboring sites within a genome from one end to the other end, and directed edges from an ancestral site to its direct descendant site as well as the next (but not the previous) descendant site. We note that the top (root) to bottom (leaf) edge directions are a natural reflection of the causal relationships between species in evo- lution. The horizontal directions within each genome is not natural because sites in the genome should have symmetric relationship. 118 Gibbs sampling Consider a siten (1 < n < N) in an internal nodev2VnfL[frgg. The set of neighbors ofv n is denoted byN(v n ), and containsfv n1 ;v n+1 g[fu n1 ;u n1 ;u n+1 : u2 fpa(v);child(v)gg. Given the realized values ofN(v n ), the distribution ofv n is : p(v n jN(v n ))/ n Y i=n1 p(v i+1 ) u i u i+1 ;v i v i+1 Y w2child(v) p(w i+1 ) v i v i+1 ;w i w i+1 ; v n 2f0; 1g and (u;v)2E; wherep(l) ij;kl is the conditional distribution defined in Equation (A.4). Comparison with phylo-epigenetic model The major difference between 2-tuple context- dependent model and our proposed phylo-epigenetic model is the Markov properties assumed by the two models. This is illustrated in Figure A.1 showing the Markov blanket of a typical site under these two graphical models. The phylo-epigenetic model induces an asymmetry in the positioning of sites in the Markov blanket relative to the site of interest. In contrast, the 2-tuple context dependent model induces a symmetric positioning of sites in the Markov blanket relative to the site of interest. The latter may be a more desirable features, as a methylome has no intrinsic directionality and should possess spatial symmetry. However, the Bayesian network version of the 2-tuple context dependent model does not portray perfect symmetry either. Horizontally flipping the assignment of states at sites in the Markov blanket will result in a different conditional distribu- tion for the site of interest, if the assignment of states is not horizontally symmetric. It is possible to construct a graphical model that can express the horizontal symmetry of methylomes by using undirected edges between neighboring sites. One such model is discussed in the next section. A.4 Two-tuple Markov Random Field Let G = (V;E) be the undirected graph representing methylome evolution. Neighboring sites within each species are connected by an edge. Each site in a non-leaf species is connected to its direct descendant with an edge, as well as to the neighbors of the direct descendant in the 119 a b c context-dependent model (2-tuple) a b c a b c phylo-epigenetic model A B Figure A.1: Comparison of Markov blankets For simplicity, assume the phylogenetic tree contains only one lineage of 3 consecutive speciesa, b andc. The site of interest is colored red. Sites in the Markov blanket of the site of interest under each model are colored black. The Markov blankets are highlighted in green. All other sites are colored gray. Given the states at all black sites, the state at the red site is independent of all gray sites. descendant species, if exist. LetCl(G) be the collection of maximal cliques in graphG. Observed dataX, unobserved dataZ. Complete dataX = (Y;Z). We also useZ() to denote the partition function, which should be easy to distinguish from the unobserved data Z. By Hammersley- Clifford’s theorem, the probability distribution satisfying the Markov properties indicated by graph G can factorize according to the maximal cliques in the G. Under our construction of G, all maximal cliques are the 4-cliques consisting of two neighboring sites in an internal species and their direct descendant sites in one child species. The probability distribution over the graph is expressed as p (x) = 1 Z() Y C2Cl(G) C (x c j) E-step E ZjY; (0) logp (Y;Z) = E ZjY; (0) X C2Cl(G) log C (X c j) logZ() = X (u;v)2E X i;j;k;l2f0;1g log uv (ijklj) E ZjY; (0)C uv;ijkl logZ() (A.7) LetW uv;ijkl = E ZjY; (0)C uv;ijkl , which can be estimated through MCMC sampling. 120 M-step The main challenge in M-step is computingZ() and its partial derivatives w.r.t. We can use the idea of importance sampling to approximateZ(), and @ @P . Z() = X X Y C2Cl(G) C (X C j) = X X f(X) f(X) Y C2Cl(G) C (X C j) =E Xf Q C2Cl(G) C (X C j) f(X) 1 N N X n=1 Q C2Cl(G) C (x (n) C j) f(x (n) ) ; wherefX (n) g N n=1 i:i:df: @ @P Z() 1 N N X n=1 Q C2Cl(G) C (x (n) C j) f(x (n) ) X C2Cl(G) 1 C (x (n) C j) @ @P C (x (n) C j) Candidates for distributionf can be the product of marginal distributions, or the MRF distri- bution with some previous parameter (t) . Suppose that we choosef as the posterior distribution p(XjY; (t) ). Z() = X X Y C2Cl(G) C (X C j) = X X f(X) f(X) Y C2Cl(G) C (X C j) =E Xf Q C2Cl(G) C (X C j) f(X) 1 N N X n=1 Q C2Cl(G) C (x (n) C j) f(x (n) ) ; wherefX (n) g N n=1 i:i:df: @ @P Z() 1 N N X n=1 Q C2Cl(G) C (x (n) C j) f(x (n) ) X C2Cl(G) 1 C (x (n) C j) @ @P C (x (n) C j) Candidates for distributionf can be the product of marginal distributions, or the MRF distri- bution with some previous parameter (t) . Suppose that we choosef as the posterior distribution p(XjY; (t) ). 121 f(x) = 1 Z (t) Y C2Cl(G) C (x (n) C j (t) ) = exp X i;j;k;l2f0;1g log uv (ijklj)C uv;ijkl (x) logZ( (t) ) A.5 Intra-species methylome variation Variation between cell types Methylomes of different cell types reflect cell-type-specific regu- lations. Not only the specific genomic locations of HMRs vary between cell types, the overall size and total number of these regulatory regions also have huge variation among cell types. Figure A.2 shows this variation for human ES and somatic cells profiled in multiple studies (Lister et al., 2009; Hodges et al., 2011; Lister et al., 2013; Kundaje et al., 2015). Overall, cell types in early emby- ronic developmental stages have fewer and smaller HMRs than most differentiated fetal and adult tissues. We used human B-cell as a representative somatic cell type for comparison with human sperm. Inter-individual variation of sperm methylomes Within species, individual methylomes are also more variable at repeats. In Figure A.3, 6 mouse individuals from two different strains are compared with each other. The correlation of methylation levels in 200-bp bins reflects difference between different mouse strains, as well as between individuals from the same strain that are considered genetically identical. In Figure A.4, a comparison between the 4 human individuals used in this study shows a similar pattern, with inter-individual variation smaller than the inter- strain variation in mouse. A.6 Supplemental methods Sperm sample collection and library preparation The gorilla sample was obtained from 43 year old western lowland Gorilla at Zoo Atlanta after review and approval by the Zoos scientific 122 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Middle−Frontal−Gyrus−35d Middle−Frontal−Gyrus−2yr Middle−Frontal−Gyrus−25yr Middle−Frontal−Gyrus−5yr CD133HSC HSPC Neutrophil H1 BCell H1−Mesenchymal H1−Mesendoderm H1−NP H9 Adipose Adrenal−gland Bladder Ectoderm Endoderm Esophagus Fetal−Adrenal−gland Fetal−Heart Fetal−Large−intestine Fetal−Spinal−Cord Fetal−Stomach Fetal−Muscle−Leg Fetal−Muscle−Trunk Gastric Heart−Aorta HSC Left−ventrical Liver Lung Macrophage Mesoderm NK Ovary Psoas−muscle Right−atrium Right−ventrical Sigmoid−Colon Small−intestine Spleen Tcell Thymus 900 1000 1100 1200 1300 1400 40000 50000 60000 70000 80000 90000 Total HMR number Mean HMRs size germ layer ● ● ● ● ESC and ESC-derived ectoderm endoderm mesoderm Figure A.2: HMR number and average size in various cell types and tissues. DBA/2J (1) DBA/2J (2) C57Bl/6 (1) C57Bl/6 (2) C57Bl/6 (4) C57Bl/6 (3) DBA/2J (1) DBA/2J (2) C57Bl/6 (1) C57Bl/6 (4) C57Bl/6 (2) C57Bl/6 (3) DBA/2J (1) DBA/2J (2) C57Bl/6 (1) C57Bl/6 (4) C57Bl/6 (2) C57Bl/6 (3) 0.7 0.8 0.9 1 Correlation Color Key Promoter Repeat Other Figure A.3: Correlation of sperm DNA methylation in 6 mouse individuals from two strains. 123 MR2 MR1 HR1 HR2 0.7 0.8 0.9 1 Correlation Color Key MR2 MR1 HR1 HR2 MR2 MR1 HR1 HR2 Promoter Repeat Other Figure A.4: Correlation of sperm DNA methylation in 4 (20-40yr) human individuals. review committee. The sample was collected opportunistically (cage floor after animal mastur- bated) and shipped to San Diego overnight in BWW medium with penicillin/streptomycin. Two bisulfite-converted libraries were constructed as previously described (Molaro et al., 2011). Paired- end sequencing was performed on Illumina HiSeq2000 platform (paired-end 76bp read lengths). Dog sperm samples were from three individuals, each from a different breed: doberman, Labrador retriever and Portuguese water dog. Frozen dog semen samples were collected by Dr. Beckie Williams at Yorba Regional Animal Hospital, Anaheim, CA. Genomic DNA was extracted from each dog semen sample using a user-developed protocol (QIAamp DNA Mini kit and QIAamp DNA Blood Mini kit 11/2007, Isolation of genomic DNA from sperm using the QIAamp DNA Mini kit; protocol 2). Extracted DNA samples were sent to BGI for bisulfite- conversion, library construction and sequencing. Bisulfite sequencing libraries were prepared using standard Illumina protocol. Paired-end sequencing was performed on Illumina HiSeq2000 platform (paired-end 90bp read lengths). Genomic DNA from rat sperm (two biological replicates) was extracted according to previously described methods (Molaro et al., 2011; Molaro et al., 2014). Whole genome bisulfite libraries were constructed using tagmentation-based methods as previously described (Wang et al., 2013) with the following modifications: Transposomes were assembled with Tn5 enzyme purified in house using a plasmid construct kindly provided by the Sandberg lab and according to protocols 124 previously described (Picelli et al., 2014). Tagmentation of 50ng genomic DNA was performed in Tris-DMF buffer (10 mM Tris-HCl pH 7.5, 5 mM MgCl2, 10% DMF) at 55 C for 8 minutes. Reactions were stopped with 0.04% SDS (final concentration) and incubated for 7 minutes at 55 C. Samples were incubated with 2ul 10uM replacement oligo Tn5mC-Repl01 at 45 C for 10 minutes. Gap repair was performed in T4 DNA Ligase Buffer with 10mM ATP (NEB), 0.1mM each dNTPs (final concentration), 1ul T4 DNA ligase (400,000 units/mL, NEB) and 1ul T4 DNA polymerase (3000 units/mL, NEB) and incubated at 37 C for 15 minutes followed by 25 C for 10 minutes. Reactions were heat inactivated at 75 C for 20 minutes. Bisulfite conversion of DNA was performed using the EZ DNA Methylation Lightning kit (Zymo cat # D5030) according to the manufacturers recommendations. After desulphonation and purification of sodium bisulfite treated libraries, samples were amplified for 14-18 cycles with the Kapa HiFi HotStart Uracil+ Ready Mix according to the manufacturers instructions. Oligonucleotide sequences are detailed in (Wang et al., 2013). Each replicate library was barcoded, pooled and sequenced on 3 lanes of an Illumina HiSeq 2500 (paired-end 100bp read lengths). Public WGBS data used in this study include human and chimpanzee sperm (GEO acces- sion GSE49624; Hammoud et al. (2014) and GSE30340; Molaro et al. (2011)), human H1 ESC (GSE16256; Lister et al. (2009)), human H9 ESC (GSE19418; Laurent et al. (2010)), human and chimpanzee B-cell (GSE31971; Hodges et al. (2011)), gorilla whole blood (NCBI SRA acces- sion SRP059313; Hernando-Herraez et al. (2015)), rhesus macaque sperm (GSE60166; Lu et al. (2015)), rhesus macaque PBMC (GSE34129; Tung et al. (2012)), rat left ventricle (Euro- pean Nucleotide Archive accession number ERP002215; Johnson et al. (2014)), mouse sperm (GSE49624; Hammoud et al. (2014)), mouse B-cell (SRP029721; Kieffer-Kwon et al. (2013)), mouse ESC (GSE30206; Stadler et al. (2011)), and the dog kidney epithelial cell line MDCK (GSE48527; Carmona et al. (2014)). Mapping and annotation Reads from whole genome bisulfite sequencing were mapped with rmapbs (Smith et al., 2009) to respective reference genome of each species: hg19, panTro4, 125 gorGor3, rheMac3, mm10, rn5, and canFam3. Subsequent analyses on duplicate removal, bisulfite conversion rates, methylation levels and HMRs were carried out with tools from themethpipe package (Song et al., 2013). Mapping statistics are provided in Table 2.1. We did not observe significant non-CpG methylation in any species. Our analyses were focused on CpG methylation. We used Ensembl gene annotations (Release 75) (Flicek et al., 2014) for species involved in this study. Coordinates of rhesus macaque gene annotation were converted to assembly rheMac3 from rheMac2 withliftOver from the Genome Browser tool suite (Kent et al., 2002). Methylome alignment We produced multiple methylome alignment by mapping non-human CpG positions and methylation levels to the human genome assembly hg19, using the correspond- ing pair-wise genome alignment downloaded from UCSC Genome Browser (Rosenbloom et al., 2015). The percentage of CpGs in individual species in regions aligned to hg19 ranges from 42.3% for mouse to 89.4% for chimpanzee (Table 4.2). We exclude any regions that are more than 1kbp in size and contain no CpGs or have 0 read coverage in at least one species, and consider the result- ing set of genomic regions, covering 0.54 billion bases, as the regions conserved across all seven species. The total number of CpG sites in the conserved regions from different species is quite similar among different species, between 5.6-7.4 million. However, there is considerable sequence divergence among the seven species at CpG sites, and the number of CpG sites conserved across all seven species is about 0.38 million. The conserved regions identified above contained short repeat insertions specific to primate species, as we were using the human reference genome as the primary genome for comparative analyses. Exclusion of divergent methylation patterns that overlapped with primate-specific short repeats interspersed in the orthologous genomic regions did not affect the general trend of accumu- lation of hypomethylated genomic fraction in the primate lineages. In subsequent multi-methylome segmentation results, we removed genomic bins and segments that had substantial (greater than 60%) overlap with primate-specific SINE elements, before calling methylation mutation events. 126 Hierarchical clustering In the hierarchical clustering analysis of sperm methylomes along with ESC and somatic methylomes from multiple mammalian species, sample similarity was measured by correlation of average DNA methylation levels in 200bp genomic bins. Clustering used aver- age linkage. The somatic methylomes are from human, chimpanzee and mouse B-cells, gorilla peripheral whole blood, rhesus macaque PBMC, and rat left ventricle. We also made separate clustering for different genomic context. The promoter group contains bins overlapping annotated human gene promoter (TSS1kbp). Non-promoter bins were included into the repeat group if they overlap with human repetitive elements annotated by the RepeatMasker track in UCSC Genome Browser. The rest of the bins were collected in the group labeled “other.” Orthologous promoter HMR sizes For each non-human species, annotation of protein-coding genes orthologous to human genes were extracted using BioMart from Ensembl release 75. HMRs containing a single transcription start site of orthologous protein coding genes were selected from each somatic and sperm methylome. The somatic methylomes used for this analysis were human B-cell, chimpanzee B-cell, gorilla whole blood, rhesus macaque PBMC, mouse B-cell, rat left ventricle and dog MDCK cell line. The methylome of dog MDCK cell line contains partially methylated domains (PMDs). We only used dog MDCK HMRs located outside of PMDs to mea- sure promoter HMR size. Ultra-conserved HMRs We defined core HMRs as the intersection of sperm HMRs from all seven species. Sperm HMRs from all species that overlapped with a same core HMR were merged into a single region, which we call conserved HMR. The ratio between the core HMR size and the corresponding conserved HMR size is a measurement of the local conservation of DNA methyla- tion pattern. We chose a lower cutoff at 0.7 for definition of ultra-conserved HMR, which resulted in 293 regions. A more stringent cutoff 0.8 resulted in 57 regions. We used GOrilla (Eden et al., 2009) to identify enriched biological processes, molecular functions and cellular components in the target set of genes whose TSS are located within the 293 ultra-conserved HMRs, in contrast 127 with a background gene set, which are genes with TSS hypomethylated in the sperm of all seven species. Enriched GO terms with p-value< 1 10 5 are reported in Tables 4.4, 4.5, 4.6. Species-specific methylation gain and loss We used HMRs from individual methylomes to determine species-specific gain and loss of methylation (Figure 4.10). As an alternative, we also used methylation level cutoffs to determine methylation states. Average methylation levels in each species were measured in 200bp non-overlapping bins along the hg19 genome. Bins with methy- lation level less than a threshold were considered hypomethylated, and hypermethylated otherwise (Table 4.7). Phylogenetic tree from multiple genome alignment We estimated the phylogenetic tree branch lengths under the unrestricted single-nucleotide model (Yang, 1994a), using the R package rphast (Hubisz et al., 2011). Multiple genome alignment of the seven species in this study was extracted from the multiple alignments of 99 vertebrate genomes with human genome (hg19/GRCh37, Feb. 2009) (Blanchette et al., 2004) downloaded from UCSC Genome Browser. Phylogenetic tree was estimated independently from alignments in different autosomes. The results were very stable across chromosomes. The average branch lengths were used as the final branch lengths for the phylogenetic tree for the entire orthologous genome across the seven species. Independent-site phylo-epigenetic model We fit an independent-site phylogenetic model for binary state DNA methylation data with two different types of input. The first type of input are complete methylation states of 200bp bins at extant species. We used a beta-binomial mixture model to estimate the methylation probability at individual CpG sites in each species separately. Then we separated the genome into 200bp non-overlapping bins. We defined the effective orthol- ogous region as the union of 200b bins with at least one CpG observed from each species. From the average CpG methylation probability in each bin in each species, we determined a discrete methylation state using a cutoff at 0.5. The second type of input are methylation probabilities at single CpG sites from all extant species, including all CpG sites with BS-seq coverage in the 128 7-way orthologous genome. In this case, we have incomplete observations at extant species. We used a MCEM algorithm to get the maximum-likelhood estimates of the model parameters. See supplemental methods for details. Model selection between stationary and nonstationary evolution processes We used Akaike information criterion (AIC) to compare two independent site phylogenetic models, with and with- out the assumption of reversible evolutionary process. We discretized average methylation prob- abilities (cutoff 0.5) to determine the methylation states in 200bp bins in the 7-way orthologous genome. Only bins with non-zero CpG coverage in all species are included to form complete observation at extant species. The reversible model was estimated with R packagerphast. The general model (without reversibility assumption) fit the data better according to AIC (Figure 4.16). Types of methylome evolution events We inferred HMRs in ancestral species using the pro- posed dependent-site phylo-epigenetic model. Methylome evolution events – HMR gain in the form of birth and extension and HMR loss in the form of death and contraction – along each branch of the phylogenetic tree were determined by comparison of HMRs in the corresponding parent and child species, and were required to be at least 50bp. To show the total sizes of more specific types of events, i.e. birth/extension and death/contraction, we filtered out events located at the boundaries of the 7-way orthologous genomic fragments. To examine the enrichment ofdenovo (birth or death) and secondary (extension or contraction) methylome evolution events on each phylogenetic tree branch, we considered the conserved HMRs and HMR gains and losses between the parent-child pair. We were interested in the relationship between HMR gains and ancestral HMRs, i.e whether HMR gains prefer to take the form of exten- sion from an ancestral HMR or a newly formed HMR in a methylated background region. We computed the distances between HMR gains to the closest ancestral HMRs. The expected distribu- tion of this distance was formed by shuffling HMR gain events within all methylated regions in the ancestral methylome usingbedtools (Quinlan & Hall, 2010). Similarly, with HMR loss events, 129 we asked whether they are more likely to affect an entire ancestral HMR, i.e. HMR death. We shuffled HMR loss events within ancestral HMRs, and measure the distances of HMR loss events to the closest derived descendant HMRs. The enrichment of different types of HMR gain and loss events was measured with the ratio between the observed and expected distance distribution (Figure 4.20). CpG enrichment at promoter HMRs We created CpG site genomic position maps across assemblies using pair-wise sequence alignments between human (hg19) and mouse (mm10), dog (canFam3) and mouse (mm10), mouse (mm10) and dog (canFam3) downloaded from UCSC Genome Browser. We mapped mouse sperm methylome in its native mm10 assembly to human genome assembly hg19, and called mouse sperm HMRs in the hg19 assembly. We filtered for human and mouse sperm HMRs such that both their intersection and union contain exactly 1 gene transcription start site (TSS). We obtained human promoter HMR extension regions relative to the mouse promoter HMRs with minimum length of 200bp using bedtools. The two boundaries of a such extension region correspond to a human HMR boundary and a mouse HMR boundary. We masked out regions that are not alignable between human and mouse according to their pair- wise sequence alignment. We used a sliding 2bp-windows with 1bp step size around the human HMR boundary of the extended HMR, the mouse HMR boundary contained within a wider human HMR, and around TSS along the human genome to measure GC content and CpG occurrences, and produced a meta plot for CpG enrichment relative to the boundaries and TSS. To make these measurements in the mouse genome, we mapped human methylomes to mouse mm10 genome assembly, and followed the same analysis procedure. We performed the same analysis for dog- mouse comparison. ChIP-seq data analysis Human and mouse ESC histone modification H3K4me3 and H3K27me3 ChIP-seq data were from the ENCODE project (ENCODE Project Consortium, 2012). Human data GEO accession: GSM1000089, GSM769008, and GSM918754. Mouse data GEO 130 accession: GSM1185386, GSM605334, GSM466734, GSM469971, and GSM605333. Human and mouse round spermatid histone modification H3K4me3 and H3K27me3 ChIP-seq data were from the study by (Lesch et al., 2016) (GSE68507). Reads were mapped to respective refer- ence genome (hg19 and mm10) withbowtie2 (Langmead et al., 2009). Duplicated reads were removed using samtools within each sequencing library (Li et al., 2009). Fragment lengths were estimated using thecsaw Bioconductor package (Lun & Smyth, 2016). Enrichment scores were calculated using deepTools (Ram´ ırez et al., 2014) as the log2 ratio of number of frag- ments per 10bp bin between treatment and input after scaling by sequencing depth. The mouse histone mark enrichment scores were then aligned to the human hg19 coordinates according to the pairwise chained and netted alignments between mm10 and hg19 (Rosenbloom et al., 2015). Relative sequence substitution rate We used 15-way eutherian mammals Enredo-Pecan- Ortheus (EPO) multiple alignments corresponding to the Release 75 of Ensembl (Herrero et al., 2016). From the EPO alignment, we extracted the alignment of the 7 species used in this study and the ancestral species in the corresponding phylogeny. Sequence substitutions were annotated to individual lineages by comparing the ancestral sequence and the descendant sequence. We only considered alignment columns that have no insertion or deletion in all 7 species. Relative sequence substitution rate (RSSR) between two sister lineages for a set of genomic regions is defined as the ratio of total number of substitutions in these regions that occurred on one lineage over that on the other lineages. To form the background distribution of RSSR, we randomly sampled regions from the orthologous genome. The sampled region size was chosen to match the smaller size of the total aligned bases in lineage-specific HMRs between the two lineages. We independently sam- pled 5000 times, and created the sampling distribution of RSSR for each pairs of sister lineages. The significance (one-sidedp-value) of observed RSSRs in lineage specific HMRs was calculated using a normal distribution fit to the empirical sampling results. 131 Transcription factor binding site analysis Human transcription factor binding sites (TFBS) used in our analysis were TFBS clusters (V3) from ENCODE data uniformly processed by the ENCODE Analysis Working Group (Gerstein et al., 2012; Wang et al., 2013), downloaded from UCSC Genome Browser ENCODE Analysis Hub (hg19) (Raney et al., 2014). We kept all TFBS that overlapped with 7-way orthologous regions and with human sperm HMRs. For mouse transcription factor binding sites, we downloaded all narrowPeak files forMusmus- culus transcription factor ChIP-seq experiments from ENCODE covering 46 different transcription factors (mm9) (ENCODE Project Consortium, 2012). We also included mouse ESC EZH2 ChIP- seq data from studies by (Ku et al., 2008) (GSE13084) and (Peng et al., 2009) (GSE18776). Bind- ing sites of the same transcription factor profiled in different cell types were pooled and collapsed. To evaluate enrichment of each transcription factor in the HMR gain regions on the path from the last common ancestor of all seven species to human (or mouse), we used fisher exact test on the contingency table offfactor, other factorsgfHMR gain, other HMRg. Fisher exact test p-values were adjusted by Bonferroni correction. Gene ontology analyses of the HMR births on parallel lineages Protein coding genes with TSS located in HMRs in the sperm methylomes of all seven species comprise the background gene list. The subset showing primate-lineage specific promoter HMR extension in human sperm com- prise the target gene list. We used GOrilla (Eden et al., 2009) to identify enriched biological processes at false discovery rate 0.05. We further removed ontology term redundancy, and visu- alized the remaining terms in semantic similarity-based scatter plots usingREViGO (Supek et al., 2011). Primate-lineage-specific HMRs are human HMRs in 7-way orthologous genome that contain HMR birth events annotated to the human lineage since the mouse-human common ancestor. Rodent-lineage-specific HMRs are mouse HMRs in 7-way orthologous genome that contain HMR birth events annotated to the mouse lineage since the mouse-human common ancestor. Over- lapping HMRs between the two lineages were removed. We established gene-HMR association 132 by annotating an HMR to the closest gene transcription start site. The gene-HMR association for mouse HMRs are established according to the mouse reference genome assembly coordinates after converting HMRs from hg19 to mm10 withliftOver tool. The candidate genes for gene- HMR association are orthologous protein coding genes between human and mouse (Ensembl75) that have gene transcription start sites located within the 7-way orthologous genome. The gene orthologs that are associated with both a human sperm HMR and a mouse sperm HMR located in the 7-way orthologous genome comprise the background gene list (8493 genes). From this gene list, we further identified the subset that are associated with primate-lineage-specific HMRs (3082 genes), and the subset that are associated with mouse-lineage-specific HMRs (3689 genes). These two subsets have a significant overlap (2072 genes, Fisher exact testp = 3:26 10 187 ). We used PANTHER Classification System (Mi et al., 2016) to find overrepresented PANTHER GO-Slim biological processes associated with the common genes. 133
Abstract (if available)
Abstract
The mammalian methylome holds regulatory information for cell-specific gene expression. DNA methylation in sperm is among the most important factors influencing the evolution of mammalian genomes. Yet little is known about how this system itself has evolved in the mammalian germline. ❧ Most existing comparative epigenome studies are limited to pair-wise comparisons between species, examination of genome-wide correlations, or studying species-specific changes. Deeper understanding of epigenomic evolution requires identification of conservation and changes of epigenetic states at higher genomic resolution and annotation of epigenetic changes to specific lineages. Although DNA methylation can be profiled at single-CpG resolution, interspecies comparison of methylome is challenged by the fact that most CpG sites have diverged between species. However, DNA methylation, like other epigenetic modifications, displays strong autocorrelation along the genome, which is an important relationship to incorporate into evolutionary models, and is information that can be leveraged to overcome the challenge of divergent sites across species. We introduce a phylo-epigenetic model for methylome evolution, which accommodates the high correlation of methylation states at neighboring CpG sites during evolution, and allows for inference of ancestral methylation states. ❧ To understand the evolution of sperm methylomes, we study the whole-genome single-CpG resolution DNA methylation profiles in sperm from human, chimp, gorilla, rhesus, mouse, rat and dog, with the proposed phylo-epigenetic model. We find that the hypomethylated portion of the mammalian sperm methylome has experienced an evolutionary expansion, driven both by the birth of newly hypomethylated regions and by the extension of ancestral hypomethylated regions. The extensions are over-represented, and contribute to progressive expansion of individual hypomethylated regions during evolution. The changes of methylation states are nonuniformly distributed between genomic contexts and between species: promoter regions exhibit less divergence than other regions, and rodents have accumulated more divergence than primates. The rodent-specific and the primate-specific hypomethylated regions are enriched for binding sites of similar transcription factors, and are associated with genes of similar functions in early embryonic development. ❧ Together, we provide an evolutionary modeling strategy and computational methods for comparative epigenomics. With the proposed approaches, we characterize evolutionary changes in mammalian sperm methylomes at high spatial and temporal resolution. Our analyses reveal the global trend and lineage-specific features of expanding DNA hypomethylation during evolution, and provide insights to how evolution of the epigenome may contribute to species-specific fine-tuning of conserved mammalian developmental programs. Applying this phylo-epigenetic model to more types of epigenetic modifications will shed light on their functional roles, and expand our knowledge about the evolution of the mammalian gene regulation network.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Identification of DNA methylation markers in diffuse large B-cell lymphoma
PDF
DNA methylation as a biomarker in human reproductive health and disease
PDF
DNA methylation inhibitors and epigenetic regulation of microRNA expression
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
An analysis of conservation of methylation
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
The kinetic study of engineered MBD domain interactions with methylated DNA: insight into binding of methylated DNA by MBD2b
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
Effects of chromatin regulators during carcinogenesis
PDF
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Air pollution, smoking, and multigenerational DNA methylation Signatures: a study of two southern California cohorts
PDF
Natural variation of Arabidopsis thaliana methylome and its impact on genome evolution
PDF
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Differential methylation analysis of colon tissues
Asset Metadata
Creator
Qu, Jianghan
(author)
Core Title
Comparative analysis of DNA methylation in mammals
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
10/19/2017
Defense Date
10/13/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
DNA methylation,Evolution,mammal,OAI-PMH Harvest,phylo-epigenetic model,sperm
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew D. (
committee chair
), Finch, Caleb (
committee member
), Sun, Fengzhu (
committee member
)
Creator Email
jqu@usc.edu,qujianghan@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-446614
Unique identifier
UC11265086
Identifier
etd-QuJianghan-5843.pdf (filename),usctheses-c40-446614 (legacy record id)
Legacy Identifier
etd-QuJianghan-5843.pdf
Dmrecord
446614
Document Type
Dissertation
Rights
Qu, Jianghan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
DNA methylation
mammal
phylo-epigenetic model
sperm