Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Whole genome bisulfite sequencing: analytical methods and biological insights
(USC Thesis Other)
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
WHOLE GENOME BISULFITE SEQUENCING: ANALYTICAL METHODS AND BIOLOGICAL INSIGHTS by Qiang Song A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) May 2013 Copyright 2013 Qiang Song Dedication To my mom, dad and sister. ii Acknowledgments I own my gratitude to the many people who have accompanied, supported and helped me during graduate school. They made my life as a Ph.D. student an memorable experience. First and foremost, I want to express my deepest gratitude to my advisor Dr. Andrew Smith. He is knowledgeable, visionary and insightful. He leads me to the fascinatingfieldofepigenetics,andhasfosteredmyvisionandtasteaboutsciences. He has also generously supported me for the past four years. His influences on me extend far beyond my intellectual growth. He is my “Jedi master”. He enlightens me about my ability and motivates me to pursue my goals with determination and confidence. Iwouldalsoliketothankmyqualificationanddissertationcommitteemembers, Dr. Michael Waterman, Dr. Fei Sha, Dr. Bing Ren, Dr. Matthew Dean, Dr. Fengzhu Sun and Dr. Ian Ehrenreich, to whom I have been constantly turned to for inspirations and ideas. Ihavebeenveryfortunatetocollaboratewithseveralawesomescientistsduring mystudies. MygreatgratitudegoestoDr. EmilyHodges,Dr. AntoineMolaroand Dr. GregoryHannon, withwhomIworkedonthespermandbloodcellmethylome projects. I am also grateful to Dr. Fred Rollins, who generously allowed me to use the lung cancer methylome data in this dissertation. I also want to express my iii gratitudetoPeiZhang,myalumnabothfromUniversityofScienceandTechnology of China and University of Southern California, who generated the Arabidopsis methylome data and collaborated with me in the plant methylome project. I would also to thank many friends whom I met at the graduate school. I will cherish their friendship for ever. Dr. Min Xu, Dr. Xiting Yan and Dr. Stanley Shi are the first few people I met when I came to the United States. They helped me start my life in the States by providing guidance and help. We have been friends since then and have interacted a lot, intellectually and personally. I want also to thank Pei Zhang, Mu Zhou, Chao Dai, Quan Chen, Yang Fu, Tianyin Zhou, Dr. Shihua Zhang, Dr. Lin Wan, Jing Zhang, Dr. Wenyuan Li, Qingjiao Li and Zack Fraser. I am always missing those days when we studied together, played together and traveled together. I would also like to thank my lab members. In particular, Dr. Philip Uren, Timothy Daley and Jianghan Qu patiently and carefully proofread the manuscript of this dissertation. Ben Decato, Meng Zhou, Jianghan Qu, Jun Zhao Dr. Fang Fang, Elizabeth Hong helped a lot to develop the software tools to analyze whole genome bisulfite sequencing data. In addition, Iris Dror, Emad Bahrami Samani, Joyce Kao and Dr. Kjong Lehmann have accompanied me in the graduate school and made my life much interesting. I would also like to thank Dr. Hao Wang, Dr. Yan Zhou and Linli Yang from the Department of Psychology. They broaden my knowledge and are interesting people to hang out with. I am also grateful to Dr. Shanshan Xu, Dr. Yueyan Wang, Dr. Yixin Chen and Dr. Yan Zhou. We discussed about statistics and hiked many mountains together. iv Six years are a long time. Many people come into our life and then leave. I may not be able to thank everyone individually in this short acknowledgement, but I will always cherish their friendship and help. v Contents Dedication ii Acknowledgments iii List of Tables viii List of Figures ix Abstract xii 1 Introduction 1 2 Background 6 2.1 A brief survey of DNA methylation biology . . . . . . . . . . . . . . 6 2.2 Modern techniques for profiling methylation . . . . . . . . . . . . . 13 3 Data analysis for high-throughput bisulfite sequencing 15 3.1 Bisulfite sequencing in high-throughput . . . . . . . . . . . . . . . . 15 3.2 Mapping reads from bisulfite sequencing . . . . . . . . . . . . . . . 17 3.2.1 Wildcard mapping . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 Considerations for paired-end reads . . . . . . . . . . . . . . 18 3.2.3 Dealing with short fragments . . . . . . . . . . . . . . . . . 19 3.2.4 Bisulfite-specific deadzones . . . . . . . . . . . . . . . . . . . 19 3.3 Identifying distinct molecules . . . . . . . . . . . . . . . . . . . . . 20 3.4 Estimating bisulfite conversion rate . . . . . . . . . . . . . . . . . . 21 3.5 Estimating methylation levels . . . . . . . . . . . . . . . . . . . . . 22 3.5.1 Interpretation of methylation levels and notations . . . . . . 22 3.5.2 Statistical distributions for individual sites . . . . . . . . . . 23 3.5.3 Summarizing methylation level region-wide . . . . . . . . . . 25 3.6 Considerations for hydroxy methylation . . . . . . . . . . . . . . . . 26 vi 4 Patterns of methylation: analyzing individual methylomes 28 4.1 The analysis problem . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Vertebrates: hypomethylated regions . . . . . . . . . . . . . . . . . 31 4.2.1 Identifying hypomethylated regions de novo ......... 31 4.2.2 HMRs extending outside annotated CGIs and promoters . . 32 4.2.3 A summary of hypomethylation in mammalian methylomes . 34 4.3 Plant methylomes: focal methylation and multiple mechanisms . . . 37 4.3.1 Issues in identifying plant HyperMRs . . . . . . . . . . . . . 37 4.3.2 Distinct properties of CpG, CHG and CHH HyperMRs . . . 41 4.3.3 Classification of CpG HyperMRs . . . . . . . . . . . . . . . 44 5 Variation of methylation: analysis of multiple methylomes 54 5.1 Dierentially methylated regions between two methylomes . . . . . 56 5.1.1 A three-state HMM to identify DMRs . . . . . . . . . . . . 56 5.1.2 DMRs during hematopoietic stem cell dierentiation . . . . 58 5.2 Dierential methylation in case-control studies . . . . . . . . . . . . 68 5.2.1 Using sliding windows to identify DMRs . . . . . . . . . . . 68 5.2.2 Temperature specific DMRs in plant methylomes . . . . . . 69 5.3 Unsupervised identification of dierentially methylated regions . . . 72 5.3.1 Issues in identifying DMRs in an unsupervised manner . . . 72 5.3.2 An EM method for unsupervised identification of DMRs . . 73 5.4 Methylation variation in terms of hyper-methylated regions . . . . . 78 5.5 DNA methylation and transcription factor binding sites . . . . . . . 83 6 Co-methylation network analysis 85 6.1 Procedures of co-methylation network analysis . . . . . . . . . . . . 85 6.2 Functional gene clusters . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3 Co-spliced exon modules . . . . . . . . . . . . . . . . . . . . . . . . 89 7 Conclusions 93 Reference List 96 A Appendix 103 A.1 Bisulfite-specific deadzones . . . . . . . . . . . . . . . . . . . . . . . 103 A.2 Estimaing parameters of the beta-binomial distribution . . . . . . . 105 A.3 Methylomes of Arabidopsis thaliana ecotypes in Northern Europe . 106 A.4 An approximate algorithm to speed up variable-distribution HMM . 107 A.5 List of datasets used in this study . . . . . . . . . . . . . . . . . . . 108 vii List of Tables 2.1 Methods for profiling DNA methylation that are based on bisulfite conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1 HMRs distribution relative to CGIs . . . . . . . . . . . . . . . . . . 32 4.2 Distribution of HMRs relative to gene annotations . . . . . . . . . . 33 4.3 Interdependenceofhypo-andhypermethylationstatusofCpGsites between rossetta leaf sample and endosperm sample . . . . . . . . . 38 4.4 Number and sizes of CpG, CHG and CHH HyperMRs . . . . . . . . 42 4.5 Gene-overlapping CpG HyperMRs restricted to intragenic regions . 42 4.6 Contingency table for testing that sequence motif mutation causes methylation status change . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 DMRs during human blood stem cell dierentiation . . . . . . . . . 60 5.2 DMRs and transcription factor binding sites . . . . . . . . . . . . . 63 6.1 Functional gene clusters identified from promoter co-methylation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.1 List of methylomes used in comparison of HMR number and sizes . 109 viii List of Figures 3.1 WGBS analysis workflow . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Typical HMRs in mammalian genomes . . . . . . . . . . . . . . . . 30 4.2 Sperm HMRs defines boundaries of somatic HMRs . . . . . . . . . 35 4.3 HMR number and sizes in normal cell methylomes . . . . . . . . . . 35 4.4 HMR number and sizes in normal cell and cancerous cell methylomes 36 4.5 Typical HyperMRs in Arabidopsis thaliana . . . . . . . . . . . . . . 38 4.6 Length distrubution of hypomethylated CpG sites . . . . . . . . . . 39 4.7 The distribution of CpG, CHG and CHH HyperMRs along chromo- some 4 of A. thaliana . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.8 Genomic annotation of plant HyperMRs . . . . . . . . . . . . . . . 44 4.9 CpG, CHG and CHH HyperMR examples in Arabidopsis thaliana genome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.10 Relationship between CpG, CHG and CHH HyperMRs . . . . . . . 46 4.11 Supporting, variable and gap CpGs within a composite HyperMRs . 47 4.12 Frequency of ACGT motif around supporting, variable, gap and nonHMR CpG sites . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.13 Gap CpGs tend to be hypermethylated if the ACGT motif changes 49 ix 4.14 Classification of composite and simple HyperMRs based on the pro- portion of supporting CpGs . . . . . . . . . . . . . . . . . . . . . . 51 4.15 Genomic annotation of simple and composite HyperMRs in Ara- bidopsis thaliana . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.16 Dual-eect of DNA methylation on gene expression . . . . . . . . . 53 5.1 Distribution of DMRs in promoter regions . . . . . . . . . . . . . . 62 5.2 Intergenic DMR at AML1 binding site . . . . . . . . . . . . . . . . 62 5.3 DMR caused by the birth of a new HMR] . . . . . . . . . . . . . . 64 5.4 DMR caused by partially methylated regions . . . . . . . . . . . . . 65 5.5 DMR caused by the extension of existing HMR boundaries . . . . . 65 5.6 DMRcausedbytransitionfromsharpboundariestogradualbound- aries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.7 HSPCs show intermediate methylation in cell type specific DMRs. Red: B cells; Green: HSPCs; Blue: neutrophils . . . . . . . . . . . . 67 5.8 Temperature eect on genome-wide methylation level . . . . . . . . 70 5.9 Genomic annotation of temperature-specific CHH DMRs . . . . . . 71 5.10 Examples of DMRs among multiple Arabidopsis ecotypes . . . . . . 73 5.11 Example of DMRss among multiple human methylomes . . . . . . . 74 5.12 Illustration of superHMR . . . . . . . . . . . . . . . . . . . . . . . . 78 5.13 FrequencyspectrumofCpGHyperMRsinSwedishArabidopsispop- ulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.14 FrequencyspectrumofHyperMRsoverlappingtransposableelements stratified by transposable element length . . . . . . . . . . . . . . . 80 5.15 FrequencyspectrumofHyperMRsoverlappingproteincodinggenes stratified by gene sizes . . . . . . . . . . . . . . . . . . . . . . . . . 81 x 5.16 FrequencyspectrumofHyperMRsoverlappingproteincodinggenes stratified by gene families . . . . . . . . . . . . . . . . . . . . . . . 82 6.1 Phase transition in promoter co-methylation network . . . . . . . . 88 6.2 Promoter methylation levels in a typical gene cluster in Arabidopsis thaliana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3 35 human methylomes used to construct exon co-methylation network 91 6.4 Co-spliced exons tend to be co-methylated . . . . . . . . . . . . . . 92 A.1 Theorectical estimation of the proportion of mappable CpG sites with deadzone analysis . . . . . . . . . . . . . . . . . . . . . . . . . 104 xi Abstract Whole genome bisulfite sequencing is a powerful technique for profiling methy- lation patterns, which provides measurement of methylation levels of individual cytosine sites across the genome. This dissertation presents the computational methods to analyze whole genome bisulfite sequencing data. Besides mapping bisulfite converted reads and estimating methylation levels of individual cytosines, thisdissertationparticularlystudiesepigenomicdomainsmarkedbycharacteristics methylation status. The biological insights from the application of the computa- tional methods to the analysis of mammalian methylomes and plant methylomes are also reported. xii Chapter 1 Introduction GeneticsfocusesonDNAsequence,basedonwhichphenotypictraitsarepassed from parents to osprings. However, increasing evidence suggests the existence of epigenetic mechanisms that cause heritable changes in gene expression and phe- notype without changes of DNA sequence. DNA methylation, the addition of a methyl group to the 5’ location of cytosines (which we often refer to as 5mC), is one of the most studied epigenetic mechanisms. The addition of the methyl group alters the local chromatin environment and accessibility of the underlying DNA sequences to proteins, which subsequently aects gene transcription activi- ties. DNAmethylationisinvolvedinavarietyofbiologicalprocesses. Forexample, celldierentiationisusuallyaccompaniedbycharacteristicchangesinDNAmethy- lation pattern. Cancer development and the aging process also involve changes in DNA methylation. In particular, DNA methylation itself is a driving force in shaping the composition and landscape of DNA sequences. From the perspective of evolution, epigenetic marks together with DNA sequences undergo continuous mutation and selection, which aect the fitness of organisms. DNA methylation has been a subject of intense study for several decades. One important step to understand the role of DNA methylation is to profile DNA methylation pattern across the genome. By comparative study of the methylation patterns in multiple tissues, one may characterize how DNA methylation changes when cells are dierentiating, or under environmental pressure, or transited into cancerous state. Such knowledge will enable us to establish causal link between 1 methylation changes and cell functional changes, and identify targets for cancer treatment. The whole-genome bisulfite sequencing (WGBS) technology, which combines bisulfite treatment of DNA sequences and massively parallel sequenc- ing, is able to give a quantitative measure of the methylation status of individual cytosinesinagenome-widescale. Inanalogywiththeterm“genome”,thegenome- wide methylation pattern is referred to as the “methylome”. A rapidly increasing amountofgenome-widemethylationdatafromWGBSisbecomingavailable. Simi- lartothesituationwhenthefirstseriesofgenomedatacameoutintheearly2000s, theemergenceoftheselargescaleepigenomedatasetscallsfornovelanalyticmeth- ods. In this disseration, we present novel computational methods for analyzing WGBS data and the biological insights we learned through the analysis of multi- ple mammalian and plant methylomes. Typically the development of a particular analysis method is motivated by specific questions we ask of the data. Interwoven with descriptions of computational methods are descriptions of their application to real data sets, many but not all of which have been published. In a few cases, the application of a specific new analysis method has led us to look more deeply at the same biological phenomenon. We feel the narrative of this thesis is best served by addressing these minor deviations when they emerge, as this is the natural way to communicate research that progressed in such a manner. In Chapter 2 we briefly review the biological mechanisms associated with the establishment and maintenance of DNA methylation patterns, and also cover gen- eral information about the functions of DNA methylation. We also explain the basicprinciplesofexperimentaltechniquesusedtoprofileDNAmethylation. Next, 2 in Chapter 3 we describe the technical aspects of analyzing WGBS data, includ- ing mapping bisulfite-treated reads, quality control, and estimation of methylation level of individual cytosines. DNAmethylationexertsregulatoryeectsbyalteringthelocalchromatinenvi- ronment and accessibility to DNA-binding proteins, which often involves methyla- tion changes in what we might term localized “epigenomic regions”. We start with hypomethylated regions (or hypermethylated regions in plant genomes) in indi- vidual methylomes, and then proceed to identify dierentially methylated regions between multiple methyomes, and finally use co-methylation network analysis to identify co-regulatory DNA elements. Most of the mammalian genomes are highly methylated, supposedly to sup- press the activity of transposable elements. However there are certain regions whereDNAmethylationlevelislow, ashasbeenobservedinCpGislandsandpro- moterregions. Suchhypomethylatedregionsarethoughttobereadilyaccessibleto transcription machinery and other DNA-binding proteins, and harbor regulatory elements. We have developed statistical methods to identify such hypomethylated regulatory solely based on methylation data, without reliance on existing genome annotations. Therefore our method gives an unbiased and comprehensive assess- mentofopenchromatinregionscharacterizedbytheabsenceofDNAmethylation. We observed a significant proportion of hypomethylated regions outside of CpG islands and promoters. Some of them are located in annotated intergenic regions and some even overlap with repeat elements. With this knowledge, we are to able to identify previously unknown regulatory elements, which will help interpret the results from the ENCODE project and genome-wide association studies. 3 Themethylationlandscapeinplantgenomesisdierentfrommamallianmethy- lomes. Plant genomes are generally unmethylated with scattered hypermethy- lated regions. Such hypermethylated regions are enriched in repeat sequences and intragenic regions. Interestingly, we observed that hypermethylated regions in intragenic regions usually harbor conserved hypomethylated cytosine sites, while cytosines within transposon HyperMRs are uniformly hypermethylated. The exis- tence of those unmethylated cytosines within gene-body hypermethylated regions seems to allow more flexible regulation of gene expression. By comparing multiple Arabidopsis methylomes, we ascertained that hypermethylated regions are rather stable units, usually established or removed as a whole between dierent ecotypes. We propose that hypermethylated regions may be the basic regulatory units of DNA methylation, and such hypermethylated regions may serve as the foundation for studies of selection pressure on methylation as SNP sites serve as the units for studies of selection on DNA sequence mutation. DNA methylation landscape on one hand can be faithfully transferred across cell divisions and even across generations due to the existence of maintenance enzymes. On the other hand, DNA methylation is subject to changes in response to developmental and/or environmental signals. By comparative studies of multi- plemethylomes,weexpectedtoidentifycertainregionswheremethylationchanges can be linked with particular cell types and/or disease states. There are dierent experimental settings for such comparative studies, we will examine three typi- cal experimental designs and present corresponding methods to identify dieren- tially methylated regions in each case. In our study of blood cell methylomes, we were able identify an extensive set of dierentially methylated regions during hematopoieticstemcelldierentiation,someofthemareconservedbetweenhuman and chimpanzee, demonstrating that the function of methylation in regulating cell 4 dierentiation is a conserved mechanism. Functional studies showed these tissue specificdierentiallymethylatedregionsusuallyoverlapswithconservedtranscrip- tion factor binding sites, and are associated with gene expression. Finally, we will present a framework for analyzing networks of genomic inter- vals that share co-methylation. This framework allows information from multi- ple methylomes to be integrated for anlaysis, using correlated methylation status through pairs of genomic intervals as an indicator of shared functional context. Most biological processes involve orchestrated activities of multiple regulatory ele- ments, which are accompanied by corresponding changes in epigenomic status at the associated loci. For example, the promoters of multiple genes involved in cer- tain pathway are likely to lose or gain methylation in a correlated manner across cell types or conditions, assuming any variance is observed in their methylation levels. Co-methylation network analysis is similar to gene co-expression network analysisandcanmakeuseofexistinganalysismethods. Additionally,DNAmethy- lationprofilingdatagiveadigitizedmeasureofmethylationofindividualcytosines on a genome-wide scale, which allows us to explore co-regulatory network beyond geneannotations. Inthisthesis, wewillreporttwointerestingstudiesbasedonco- methylation network analysis: one dealing with functionally related genes and the other illustrating how co-methylated exons coincide with co-spliced exon modules. These studies demonstrate the potential of co-methylation network analysis, and we expect its broad application as DNA methylation data accumulates. For exam- ple,comparisonofco-methylationnetworkswithlinkage-disequilibriumgraphswill help to elucidate the relationship between genetic mutation and epigenetic varia- tion. Similarly, co-methylation analysis of enhancers and gene promoters will help us establish relationships between distal regulatory elements and promoters with which they interact. 5 Chapter 2 Background 2.1 A brief survey of DNA methylation biology DNA methylation refers to the addition of a methyl group to the 5’ carbon of cytosine nucleotides. In mammals, DNA methylation is observed almost exclu- sively on cytosines in the context of CpG dinucleotides. However, in plants and in embryonic stem cells of humans, cytosines in the CHG and CHH contexts (where H denotes any DNA base other than cytosine) are also found to be methylated (Lister et al., 2008; Cokus et al., 2008; Lister et al., 2009). In mammals, cytosine methylation is established by a de novo methyltransferase (a member of the DNA methyltransferase 3 family, which includes DNMT3a, DNMT3b and DNMT3L) regardless of sequence contexts. The CpG context is special, however, because it is a self-complementary, or symmetric dinucleotide. Methylation in the symmet- ric CpG context is maintained through cell division by the maintenance methyl- transferase, DNA methyltransferase 1 (DNMT1). When a CpG is synthesized, if the template strand was methylated then the particular site is said to be hemi- methylated. DNMT1 follows the DNA polymerase, recognizes the methyl group on the template strand at hemi-methylated sites, and adds a methyl group to the CpG on the newly synthesized strand (Goll & Bestor, 2005). Importantly, since the methyl group constitutes a covalent modification, it is highly stable in the absense of enzymatic activity. 6 The existence of functions like those of DNMT1 and DNMT3 were predicted by Holliday & Pugh (1975) and Riggs (1975) even before the enzymes were dis- covered. The ability of DNMT1 to specifically methylate hemimethylated CpGs produced from DNA replication provides a reliable mechanism to preserve epige- netic information through cell divisions. This process can be viewed as a means of communicating information to daughter cells, for example if some stimulus caused a phenotypic change to the cells (e.g. dierentiation), then the phenotype can be preserved even in the absense of the original stimulus. The replicated methylation pattern may also provide useful information passively as a recording mechanism for cell lineages (Siegmund et al., 2011). Demethylation: Widespread loss of DNA methylation occurs twice during cell reprogramming: in early embryos before implantation and during the develop- ment of primordial germ cells. Reduction of methylation level in localized regions is also prevalent during cell dierentiation. The loss of methylation happens in two ways, passive and active DNA demethylation. These two mechanisms dier in whether replication is necessary to achieve demethylation. If DNA replicates and there is no maintenance enzyme present, then only 50% of the methylation will remain. Whentheprocesscontinuesforseveralcyclesofreplication,overallmethy- lationlevelisdilutedpassively. Activedemethylation, ontheotherhand, hasbeen rathermysterious,andmultiplepathwayshavebeenreported(Bhutanietal.,2011; Wu & Zhang, 2010). In the deamination pathway, methylated cytosines are con- vertedtouracilsbydeaminaseAID/APOBEC,andthenenzymesforbaseexcision repair,possiblyTDGorMBD2,replacethemismatcheduracilswithunmethylated Cs. In the oxidation pathway, the TET family of proteins first oxidize 5mC to 5- hydroxymethylcytosines (5hmC), which are further converted to 5caC or 5hmU 7 (by AID/APOBEC), followed by TGD/SMUG1 in the BER pathway to replace 5caC or 5hmU with unmethylated cytosines. The landscape of methylation along chromosomes: We examined the enzymes involved in establishment, maintenance and removal of DNA methylation, next we will describe how genome-wide hypo- and hypermethylation patterns are estab- lished, specifically, the methylation “spreading” model and epigenetic domain boundary elements that regulate the transition from one region to another. Jones & Takai (2001) proposed two mechanisms responsible for varying methylation patterns. First, some regions are targeted for methylation directed by sequence- specific binding proteins. Second, some regions are inaccessible to DNA methyl- transferasesduetocertainDNA-bindingproteins. Law&Jacobsen(2010)pointed out the mechanism of targeted removal of DNA methylation by active demethy- lation in imprinted regions. Turker (2002) speculated that DNA methylation originates from “methylation centers” and then spreads to adjacent regions until it encounters an opposing force that favors demethylation. The junction where the methylation-spreading forces and the methylation-exclusion forces achieve a dynamic equilibrium forms the boundaries between hypomethylated regions and hypermethylated regions. The main components of this model are “methylation centers”, “methylationspreading”and“domainboundaries”. Previousstudiessug- gested repeat elements such as B1 in mouse, Alu in human and LINE-1 elements may serve as the origins of de novo methylation (Yates et al., 1999; Turker, 2002). The notion of “methylation spreading” is supported by several pieces of empirical evidence. Lindsay & Adams (1996) demonstrated that methylation of a cytosine increasesthelikelihoodofnearbycytosinesbeingmethylatedbothincisandtrans. 8 The methylated cytosine and the nearby unmethylated cytosine may form a tem- porary hemimethylated local structure, which is the favorite substrate of main- tenance DNA methyltransferases. Jahner & Jaenisch (1985) reported that with the insertion of a provirus in the mouse genome, the DNA sequences flanking the provirusbecamemethylatedwithin1kilobase(kb)oftheintegrationsite. Itisnot completely clear whether there are any specific sequence elements residing at the boundaries between hypo- and hypermethylated regions though two early studies showedthatSp1elementsprotectaCpGislandnearthemouseAPRTgenefromde novomethylation(Macleodetal.,1994;Brandeisetal.,1994). Inaddition,Molaro et al. (2011) reported some sequence features around HMR boundaries, including increasedinter-CpGdistancesandenrichmentoftheCACGTGpattern. Thelatter indicates a possible role of E-box elements for regulating methylation spreading. In summary, methylation centers, methylation spreading, domain boundaries and dynamic disequilibrium of methylation-demethylation jointly shape the genome- wide methylation pattern we observed. DNA methylation in development and dierentiation: Early embryos start with a fertilized zygote, and undergo a complex set of reprogramming events to acquire totipotency, which include an almost complete erasure of parental and maternal methylationandre-establishmentofthemethylationpatternsubsequently. During the methylome erasure, the parental genome is probably subject to a rapid active demethylation,whilethematernalgenomelosesitsmethylationgradually,possibly asaresultofpassivedemethylation. However, certainregions, includingimprinted genes, retain their methylation, providing a mechanism for inter-generation epige- netic inheritance. During the transition from blastocyst to post-implementation epiblast, DNA methylation pattern is re-established by DNMT3b (Borgel et al., 2010). In particular, promoters of germline genes and lineage-specific genes are 9 remethylated and the expression of those genes are repressed, which are subse- quently demethylated during cell dierentiation. At the end of this process, the methylation of embryos are established, which is characterized by genome-wide hypermethylationwithsomepromoters, CpGislandsandDNAbindingsitesbeing protected from methylation. Another interesting observation is that embryonic stem cells possess pervasive non-CpG methylation, possibly due to the indiscrim- inate nature of de novo methyltransferases, whose biological implications are yet to be explored. Afterembryonicmethylationreprogramming,primordialgermcellsandembry- onic stem cells of somatic lines undergo dierent modes of modification. The spec- ification of primordial germ cells involves another round of reprogramming events, which retain totipotent potential and more importantly reset sex-specific imprint- ing patterns. During the expansion and migration of primordial germ cells to gen- ital bridges, the majority of the genome loses methylation except certain regions related to imprinting, CpGs islands on the X chromosomes and germline-specific genes. The loss of methylation occurs via conversion of 5mC to 5hmC, catalyzed by the TET family members TET1 and TET2. The conversion of 5mC entails a gradual decline of methylation level in a replication-coupled passive demethy- lation manner. Upon entry to gonals, imprinted regions, some CpG islands in X chromosomes and germline-specific genes become demethylated. However intracis- ternalA-particles(IAPs)andsomepromotersnearbyescapeglobaldemethylation, whichagainprovidetargetsforfutureinvestigationoftransgenerationalepigenetic inheritance. There are not sucient studies about how the methylation pattern is re-established during the development of oocytes and sperm cells. Mature sperm 10 are characterized by a large number of hypermethylated regions, including sub- stantial number of HMRs outside of CpG islands. HMRs in sperm cells seems to delineate the limit of HMRs at the same location in other cell types. Embryonicstemcellsgiverisetoavarietyofspecializedcelltypes,accompanied by the modification of methylation pattern. Since DNA methylation can be rather stably passed on through cell divisions, it is able to preserve cell identities. Cell- type-specific methylation variation has been studied in a variety of tissues; to name a few: fibroblast cells (Lister et al., 2009; Laurent et al., 2010) and blood cells (Hodges et al., 2011; Ji et al., 2010). In contrast to the reprogramming processes in early embryos and primordial germ cells, DNA methylation changes duringcelldierentiationusuallyoccurinlocalizedregions,suchasgenepromoters and transcription factor binding sites, that activates expression of tissue-specific genes, and permanently represses the expression of genes specific to other lineages. Methylation changes, especially near promoter and CpG island (CGI) shores, have beenfoundtostronglycorrelatewithgeneexpression(Irizarryetal.,2009;Hodges et al., 2011). DNA methylation in cancer and disease: We have reviewed the general pattern of DNA methylation and dynamics during normal organism development, which is important to understand basic biological principles. From a practical perspective, the importance of research about DNA methylation lies in the fact that aberrant methylation patterns are associated with several diseases. The studies of can- cer methylomes provide biomarkers for diagnosis and clinical therapy of cancer (Heyn & Esteller, 2012). Compared to normal methylomes, cancer methylomes are characterized by hypermethylation at focal regions and extensive reduction of methylation levels in large scale regions, ranging from thousands to millions of 11 basepairs(Bermanetal.,2012;Hansenetal.,2011). Localhypermethylation, usu- ally observed in gene promoters, was observed long ago and is thought to repress tumor-suppressor genes (Baylin & Jones, 2011). The loss of methylation in large scaleregionswererecentlyobservedthankstoadvancementinprofilingtechniques. Hansen et al. (2011) proposed that loss of methylation in those large scale regions leadstoincreasedepigeneticvariabilitywhichmaycontritetotumorheterogeneity. DNA methylation in plants: DNA methylation is an epigenetic modification found in diverse eukaryotic organisms. Plant methylomes, as exemplified by Arabidop- sis thaliana, show characteristic patterns distinct from that of vertebrate methy- lomes. WhilemethylationispredominantlyobservedinCpGdinucleotidesinmam- malian cells, methylation exists at cytosines in all sequence contexts governed by distinct pathways (Lister et al., 2008; Cokus et al., 2008). Methylation in all sequence contexts is established by DRM1/2, counterpart to mammalian de novo methyltransferases DNMT3a/b, via the RdDM pathway. MET1, homologous to mammalian maintenance methyltransferase DNMT1, perpetrates CpG methyla- tion through cell divisions, and the plant-specific methyltransferase CMT mostly methylates cytosine in the CHG sequence context (Law & Jacobsen, 2010). Active demethylationinplantsarecatalyzedbyDNAglycosylases,includingDME,ROS1 and DML2/3, in combination with the BER pathway. Genome-wide, plants are described as having a “mosaic methylation pattern” (Suzuki & Bird, 2008). Peri- centromeric regions of the plant genome are enriched with methylated cytosines in all sequence contexts, while enchromatic chromosome arms are mostly devoid of methylation, except in discrete regions where hypermethylated cytosines exist in clusters. 12 2.2 Modern techniques for profiling methylation Studies of methylation patterns rely on the advancement of experimental tech- niques for profiling methylation. Based on the underlying mechanisms, known profiling methods are divided into three categories: methylation-sensitive endonu- clease based, methylation anity based, and sodium bisulfite conversion based methods (Laird, 2010). In this thesis, we focus on methods based on bisulfite conversion coupled with sequencing. Sodium bisulfite, a mutagen, can specifically deaminate unmethylated cytosines, which are converted to uracils, while its eect on 5mC are negligible (Wang et al., 1980). Bisulfite treatment essentially trans- lates the DNA methylation signal into a sequence dierence, which can be read out with a variety of sequencing technologies. The initial application of bisulfite conversion to measure methylation with Sanger sequencing is limited to single loci (Susan et al., 1994). WGBS, which combines bisulfite conversion and massively parallel sequencing, is one of most promising breakthroughs in methylation pro- filing techniques, that produces genome-wide methylome at single-base resolution, firstinmodelorganismArabidopsisthaliana(Listeretal.,2008;Cokusetal.,2008) and then in mammalian genomes as large as human (Lister et al., 2009). While WGBS has been applied in several studies, and will be the ultimate choice for methylation profiling, at present the cost of sequencing limits its wide use in research projects involving large number of samples. To reduce the cost of WGBS, several methods have developed to target specific regions of interest. Reduced representation bisulfite sequencing (RRBS) first digests original DNA with methylation-insensitive endonuclease (eg. MspI) and selects DNA fragments 40-220bp long for subsequent studies. RRBS covers about 4.8% of all CpG sites, preferably in CpG-rich regions (Meissner et al., 2008). Two other methods, one 13 Method Principle Scale Reference Sanger BS Selected clones Single locus Susan et al. (1994) RRBS-seq Methyl-sensitive restriction Subset of CpGs Meissner et al. (2008) BSPP-seq Padlock capture Target regions Deng et al. (2009) BS-capture Hybridization capture Target regions Hodges et al. (2009) WGBS-seq Shotgun sequencing Genome-wide Lister et al. (2008) oxBS-seq ı KruO4 oxidization of 5hmC Genome-wide Booth et al. (2012) TAB-seq ı Tet1-assisted BS conversion Genome-wide Yu et al. (2012) Table 2.1: Methods for profiling DNA methylation that are based on bisulfite conversion. ı indicates used to profile hydroxy methylation. usingmicroarraycapture(Hodgesetal.,2009),andtheotherusingpadlockcapture (Deng et al., 2009), may be used to target specific regions of interest. AnotherconcernaboutbisulfiteconversionbasedmethodsisthatstandardBS- seq cannot discriminate 5mC and 5hmC. Therefore the estimation of methylation levels in traditional BS-seq is the combined value of 5mC and 5hmC. Two recently emerged experimental techniques, variants of traditional BS-seq, are able to mea- sure5mCand5hmCrespectively. TheoxBS-seqmethodfirsttreatsDNAmolecules with potassium perruthenate (KruO4), which oxidizes 5hmC to 5-formylcytosine (5fC). unmethylated Cs together with 5fC are converted to uracils upon bisul- fite treatment. Therefore, oxBS-seq measures only the 5mC level (Booth et al., 2012). The TAB-seq, short for Tet-assisted Bisulfite Sequencing, first labels 5- methylcytosines with Glc, which are converted to uracil after subsequent TET oxidization and bisulfite treatment (Yu et al., 2012). As a result, only 5hmc in the original sequences are read out as Cs and both unmethylated Cs and 5- methylcytosines are read out as Ts. In summary, BS-seq measures the combined value of 5mC and 5hmC, oxBS-seq measures the 5mC level and TAB-seq measures 5hmC level. Any combination of two rounds of these techniques enable unbiased measurement of 5mC and 5hmC respectively. The basic principles, profiling scales and their respective first applications are summarized in Table 2.1. 14 Chapter 3 Data analysis for high-throughput bisulfite sequencing This chapter addresses basic technical issues in analyzing whole genome bisul- fitesequencingdata. Theoverallworkflowfrommappingreads,tobiascorrection, to the estimation of methylation level, is shown in Figure 3.1. We will discuss each step in detail below. 3.1 Bisulfite sequencing in high-throughput In a WGBS experiment, DNA molecules are treated with bisulfite sodium, followed by massively parallel sequencing. There are two related yet dierent protocols introduced in two early influential papers applying this technique to genome-wide methylation profiling (Lister et al., 2008; Cokus et al., 2008). The Listeretal.(2008)protocolwasinitiallytermed“Methyl-seq”andtheCokusetal. (2008) protocol adopted the name “BS-seq” in the beginning (see Lister & Ecker 2009 for a comparison of these two protocols). The Lister et al. (2008) protocol hasbecomethede factostandardtoperformbisulfite-sequencingexperiments,and is also referred to as “BS-seq” or “WGBS” by other researchers. In the Lister et al. (2008) protocol, orginal DNA molecules are fragmented using either an enzy- matic approach or sonication. After size selection, DNA fragments are ligated to 15 AA GT G G TA G TTGA C C AT G C G AA GT G TA GT T T C C C C T C C C C C C 5 C T 1 T T T T T 3 AA GT G G TA GT TTGA C C AT GA 3' 5' 3' 5' 5' 5' 3' 3' 3' 5' 5' 3' 3' 3' 3' 5' 3' 5' 5' 3' 3' 5' 5' 3' 3' 3' 5' C C C C C G AA GT G TA TTGA C AT T T C C A A C C AA GT G AA GT G T T C C AT A A T T C G TA GT T T C AA GT G TA GT T T C C AA GT G G TA GT TTGA C C G AA GT G T T C C AT A A C C C C C C C AT C Bisulfite treatment, PCR amplification and sequencing (Section 3.1) GT 3' GA 5' GA 5' A-rich DNA T-rich GA 5' GA 5' T-rich A-rich Wildcard mapping (Section 3.2.1) Clipping paired-end reads (Section 3.2.2) Reverse comp. A-rich T-rich Clipped Read counts Methylation level 5/6 0/3 Read counts and methylation levels (Section 3.5.2) Key steps Illustration Figure 3.1: The major stages in a workflow for basic WGBS data analysis 16 sequencing adapters where all cytosines are methylated. Sodium bisulfite treat- ment of those DNA fragments converts unmethylated cytosines into uracils and leaves methylated cytosines unchanged. Next PCR with primers complementary to the methylated adapters is used to amplify bilsulfite-converted DNA fragments, duringwhichuracilsarereplacedwiththymines. Finallythelibraryissequencedin high-throughput,almostexclusivelyusingtheIllumina/Solexasequencingtechnol- ogy. UsingIlluminasequencing,dependingonthewhethersingle-endorpaired-end sequencing is used, reads may be produced from one strand (single-end) or both strands (paired-end) of the sequence fragment. In paired-end sequencing, the first strand read corresponds to one from the original DNA template, which has had most cytosines converted to thymines as a result of bisulfite conversion. These reads are called T-rich. The second strand read is complementary to the first, and so the second end reads in paired-end sequencing will have adenine in positions that contained a guanine in the original molecules. These reads are called A-rich. 3.2 Mapping reads from bisulfite sequencing Mapping bisulfite-converted reads to the reference genomes is the first step of analyzing WGBS data. However the mapping procedure is more complicated than mapping reads from DNA-seq or RNA-seq due to bisulfite conversion. 3.2.1 Wildcard mapping After bisulfite conversion, unmethylated cytosines in the original DNA frag- ments are converted to thymines, therefore thymines in bisulfite treated reads may originate from thymines or unmethylated cytosines in the originate DNA frag- ments. T’s in the reads should not necessarily cause mismatches when mapped to 17 C’s in the reference genome. There are two approaches to tackle this issue. The three-alphabet approach converts all cytosines in the reference genome and reads intothymines, andrunsmappingalgorithmdevelopedformappingordinaryreads. The wildcard-mapping approaches allows T’s in the reads mapped to either C’s or T’s in reference genome, but C’s in the reads are not allowed to map to T’s in the reference genome. Compared to three-alphabet strategy, wildcard mapping strat- egy retains the complexity of reads and may improve mapping of low-complexity reads. Oneconcernaboutwildcardmappingstrategyisthatreadscontainingmany methylated cytosines have higher complexity, and are therefore more mappable, than reads originating from the same region but with fewer methylated cytosines. To avoid this bias, we require that Cs in reads are not allowed to map over Ts in the reference unless the T is followed by a G. 3.2.2 Considerations for paired-end reads In paired-end sequencing experiments, both T-rich and A-rich strands are sequenced. When mapping A-rich reads, we use AG wildcard mapping, i.e., A’s in the reads can map either A or G in the reference genome, and G’s in the reads cannot map As in the reference genome. We first identify all possible mapping locations for T-mate and A-mate independently, and join the mapping locations of corresponding pairs when they fall in certain distance (the maximum length of DNA fragment) within the correct chromosome and strand orientation. When merging the reads, we also detect the overlapping portion, if any, between the two read, and clip the overlapping part of the read with lower quality. This clipping step fixes the double-counting issue. Only uniquely mapping location of the paired mates are retained. 18 3.2.3 Dealing with short fragments When DNA fragments are short, it is possible to sequence into the sequenc- ing adapters attached to the 3’ end. The retained adapter makes it hard to map those reads. Even worse, including the adapters in mapped reads and subsequent analysis may lead to inaccurate estimation of methylation levels. In studies using next-generation sequencing technique to identify SNPs, the read bases (not just read counts) are used to identify SNPs based on a hypothesis test. Similar to SNP identification, in WGBS, the bases are used to estimate methylation fre- quency. Accidentally retained adapter sequences may cause errors in estimation of methylation levels (Section 3.5). We pre-process raw reads by identifying the sub- sequence atthe 3’ends ofreads that matches exactlywith thesequencing adapter, and remove the subsequence. 3.2.4 Bisulfite-specific deadzones Ambiguously mapped reads are discarded when mapping reads. When two locations in the genome have identical sequences of length greater than or equal to the read length, any reads derived from one of those locations will necessarily be ambiguous and therefore be discarded. We refer to contiguous sets of locations to which no read can map uniquely as “deadzones”. The existence of deadzones means that certain region of the genome cannot be profiled with WGBS. Bisulfite- treated reads are particularly likely to map ambiguously because of the conversion ofcytosinestothyminesincreasesthechanceoftworeadsoriginatedfromdierent locations having the same sequence. While longer read length can increase the proportion of the genome mappable, common read length is slightly above 100bp due to limitation of present sequencing platforms. Here we will give an estimate of the proportion of cytosines accessible with current technologies. Our analysis 19 indicates that over 90% of CpG sites in the human genome and the Arabidopsis genome are mappable with typical read lenghs ranging between 30bp and 120bp (Appendex A.1). Most of those un-mappable CpG sites are located in genomic repeats, such SINEs and LINEs. In practice, there are other factors aecting the proportion of CpG sites mappable. For example, certain regions cannot be readilyamplifiedbyPCR,and“deadzones”arelargerinrealapplications,thanthe theoretical results above because mismatches are allowed during mapping. From our own empirical experiences, we were able to profile 96.8% of CpGs in the A. thaliana genome with 79bp-long reads and 96.0% of CpG sites in human genomes with 101bp reads. 3.3 Identifying distinct molecules In a WGBS experiment, the PCR-amplified sequencing library may contain multiple copies of DNA fragments originated from the same DNA molecule. One ormoreofthosecopiesaresampledandsequencedwithIllumina/Solexasequencer. If multiple copies of those DNA fragments get sequenced, we will observe multi- ple reads mapped exactly to the same genomic location. We call these reads “duplicate reads”(Cokus et al. (2008) used the term “clonal reads”). Duplicate reads should not be confused with ambiguously mapped reads. The former refer to reads mapped to the same genomic location which are likely originate from the same DNA fragments, while the latter refers to reads that are mappable to multiple locations Duplicate reads, present in hundreds of and thousands of copies at some loca- tions, are likely the products of PCR over-amplification of a DNA molecule. If we keep all these duplicate reads, especially when the over-amplified DNA has 20 dierent methylation state from other DNA molecules, the estimation of methy- lation levels of CpG sites covered by these duplicate reads is biased toward the over-amplified DNA molecule. To correct such bias, we randomly keep one copy of those duplicate reads. This correction is conservative because it is possible that reads mapped to the same location may actually originate from distinct DNA molecules, that happen to be fragmented at the same locations. The correction of duplicate reads above ensures that we retain no more than one copy of each distinct DNA molecule from the same biological sample. When a biological sample is prepared and sequenced in multiple runs, and we want to obtainasingleestimateofthatsample,oneshouldpoolallreadsfromthetechnical replicates of the same sample, and then remove duplicate reads. When we want to merge multiple biological samples, the reads from dierent biological samples are simply pooled after duplicate reads in each biological replicate are removed. 3.4 Estimating bisulfite conversion rate Bisulfite sodium converts unmethylated cytosines inDNA fragmentsto uracils, which are later read out as T’s. However depending on experimental condition, bisulfite kit and time of treatment, the conversion may be incomplete. The bisul- fite conversion rate is an important measure for quality control of any bisulfite- treatment based experiments. Estimating the bisulfite conversion rate requires some kind of control set of genomic cytosines believed to be unmethylated. One typical option is to spike in some DNA sequences without any methylation such as Lambda virus DNA. One may also use chloroplast genomes, which has no evidence of methylation, to estimate bisulfite conversion rate when working with plant data. It is also possible 21 to use non-CpG cytosines which are free of methylation in most mammalian cells. Such control DNA sequences are treated with same protocol as applied to the test sample. We then map WGBS reads to the control DNA sequences, and for these cytosines presumed unmethylated, we compute the number of Cs m and the number of thymines u in reads overlapping those positions. The ratio of u to (m+u) gives the estimator of the bisulfite conversion rate. It is worth pointing out that the observed Cs at cytosine sites believed to be unmethylatedmaybeduetoincompletebisulfiteconversion, orotherfactors, such as random sequencing error. The estimation of bisulfite conversion rate above should actually be interpreted as an estimation of the eect of those factors com- bined. 3.5 Estimating methylation levels 3.5.1 Interpretation of methylation levels and notations The measurement of methylation levels from WGBS reflects the composite profileofacellpopulation. Theterm“methylationlevel”hasbeenusedextensively, however the meaning behind this term is rarely carefully considered. For example, howcanthiscompositeprofileberelatedtothemethylationstatusofasinglecell? Since a cell population may be homogeneous or quite heterogeneous, how should we interpret the value of methylation levels in each context? If we are able to examine the methylation status of a particular cytosine in a single cell, its methylation status should be either methylated or unmethylated at a particular time point. The methylation probability, the probability of its being in the methylated state, reflecting the relative strength of methylation and demethylation forces, and is the value we are interested in. In a homogeneous 22 cell population, at that cytosine location, some cells are in methylated state and some in unmethylated state. The proportion of cells in the methylated state in an infinite cell population will be equal to the methylation probability of that cytosine. However with WGBS we are only able to sample limited number of cells, and we use the term “methylation frequency” to refer the estimation of the methylation probability from WGBS. In a heterogeneous population as in almost allrealbiologicalsamples,theestimatedmethylationfrequencywillbeanestimate of the true methylation probability, whose accuracy depends on the proportion of the cell types of interest. In summary, we use “methylation probability” to refer to the real unobserved methylationlevels,and“methylationfrequency”torefertotheobservedestimation of“methylationprobability”basedonlimitedcellinWGBS.Wedenotetheformer value with p and the latter with f thereafter. 3.5.2 Statistical distributions for individual sites Assuming the sequenced reads have been mapped to the reference genome and properlyfiltered,wecountthenumberofmethylatedreads(containingC’s)andthe number of unmethylated reads (containing T’s) at the position of each cytosine site. Since CpG methylation is believed to be symmetric, reads from the both strands are used to give a single count for that CpG site. Wedenotethenumberofmethylatedreadswithmandthenumberofunmethy- lated cytosines with u, based on which estimation and inference about methyla- tion status of that cytosine may be performed. The number of methylated reads m follows a binomial distribution with the parameter p, the maximum-likelihood estimator of its parameter is simply the methylation frequency f (= m/(m+u)). 23 AsmentionedSection3.4,incompletebisulfiteconversionandsequencingerrors may leads to observation of cytosines at the position of an unmethylated cytosine. Giventhenumberofmethylatedreadsmandthenumberofunmethylatedreadsu, howcanwetestthatcytosineisindeedmethylated? Tothisend,weperformaone- tailedbinomialtest,wheretheprobabilitythatamethylatedreadisobservedunder the null hypothesis is estimated in Section 3.4, and the p-value of that cytosine being methylated is computed by summarizing the probability of observing equal to or more than m methylated reads under the null hypothesis. We have treated the estimation and inference of the methylation probability of single cytosines from the frequentist perspective. However those methods ignore the stochastic nature of the observations. Moreover, it is ignorant of the read cov- erageandtheestimatorofthemethylationlevelusuallysuersfromlargevariance in low coverage dataset. We introduce the beta-binomial distribution to model the read numbers. Given the true underlying methylation probability p, the num- ber of methylated reads m follows the binomial distribution Bin(m + u,p).We assume that the true methylation probability p follows a beta prior Beta(–,— ), which accounts for the inter-cytosine variance of methylation probabilities. The compound distribution of m is the following beta-binomial distribution: Pr(m|–,—,m +u)= A m+u m B B(m+–,u +— ) B(–,— ) . (3.1) Given a sequence of observations of the number of methylated reads and the num- ber of unmethylated reads, the procedure to estimate the parameters of the beta- binomial distribution is given in Appendix A.2. The beta-binomial distribution will be used to infer hypo- or hypermethylated regions (Section 4.2.1), and defer- entially methylated regions (Section 5.3). 24 3.5.3 Summarizing methylation level region-wide Besidesmethylationlevelatindividualcytosines,researchersarealsointerested inthemethylationstatusincertainregions,aswillbeshownlaterthatmethylation seems to function regionally. We use the average methylation level weighted by coverage to summarize methylation level in a region. Schultz et al. (2012) shared our opinion after comparing three alternative approaches: fraction of methylated cytosines, mean methylation level and weighted mean methylation level. However they also advocated discarding methylated reads at those sites that aredeemedunmethylatedbyabinomialtest, onthegroundthatthosemethylated reads are probably products of incomplete bisulfite conversion and/or sequencing errors. Wehavereservationsaboutadoptingthisapproachforthefollowingconsid- erations. First, thesignificanceofbinomialtestdependsoncoverage. Thinkofthe following extreme example. A region contains 10 cytosines, each one covered with two reads, one methylated and one unmethylated. By performing binomial test at each site, none of those cytosines are significantly methylated. If we discard all methylated reads according to their suggestion, we will conclude that the regional methylation level is 0, which seems quite unlikely. In this case, it is mostly due to low coverage that we deem those cytosines as “unmethylated”. Second, even if those methylated reads are indeed products of incomplete bisulfite conversion or other random noise, the same sources of noise are present at all sites, yet only those of low levels are being corrected. Because of these considerations, we use weighted mean methylation level without the proposed “corrections”. 25 3.6 Considerations for hydroxy methylation Recall from Section 2.2 that traditional WGBS cannot distinguish 5hmCs from 5mCs, and that two variant of WGBS, oxBS-seq and TAB-seq are developed to profile 5mC and 5hmC respectively, the analytic methods described above mostly apply to the analysis of oxBS-seq data and TAB-seq data. In the mapping stage, reads from both methods simply require CT wildcard mapping. The ensuing step for removing duplicate reads remain the same. Next, we discuss how to estimate 5hmC level for TAB-seq data and oxBS-seq data separately. With TAB-seq data, we count the number of reads containing cytosines m and the number of reads containing thymines u over a cytosine site. The estimation of 5hmC level is the ratio m/(m+u). Since 5hmC levels are not necessarilysymmetricevenintheCpGsequencecontext, weshoulddealwitheach cytosine separately (Yu et al., 2012). The oxBS-seq data production involves one round of standard WGBS, and one roundofoxBS-seq. Wedenotethe5mCprobabilitywithp m , the5hmCprobability with p h , and the combined 5mC and 5hmC probability p. Suppose the number of methylated and unmethylated reads from WGBS at a cytosine is m and u respectively, while the number of methylated and unmethylated reads from oxBS- seq at the same site is m Õ and u Õ . We assume that p≥ Beta(m+–,u +— ) and p m ≥ Beta(m Õ +– Õ ,u Õ +— Õ ). To compute the posterior expectation on the 5hmC level p h , we can compute the posterior expectation for the dierence between the beta-distributed random variates: E(p h )=E(p≠ p m |u,m,u Õ ,m Õ ,p>p m ). 26 Although we have not yet applied this method in practice, it is straightforward andwillyieldmorerobustresultsthannaivemethodsbasedondierencesbetween point estimates. 27 Chapter 4 Patterns of methylation: analyzing individual methylomes In Chapter 3, we have described how to obtain the number of methylated and unmethylated reads, and perform estimation and inference about methylation sta- tus at individual cytosine sites. In this Chapter, we consider a sequence of the observations of individual cytosine sites ordered according to their genomic loca- tions, from which we identify epigenetic domains with characteristic methylation status, specifically hypomethylated regions (HMR) in mammalian genomes and hypermethylated regions (HyperMR) in plant genomes. 4.1 The analysis problem The mammalian genome is mostly methylated, with localized regions of low methylation, shown as “valleys” located in the mostly highly methylated mam- malian genome (Figure 4.1). Such hypomethylated regions are of particular inter- est, because the decreased methylation level reflects the accessibility of the under- lying DNA elements, which have been implied with regulatory roles (Thurman et al., 2012). The methylome as continuous along the genome: The observed methylation status of a region gives important information about the epigenomic state of that region, whichisbelievedtobecontinuouseventhoughDNAmethylationeventshappenat 28 specificsites. Thecontinuityoftheepigenomicphenomenonmaybethoughtinthe context of chromatin accessibility. A hypomethylated region is usually accessible to protein binding proteins, which implies every base in that region, regardless of it being a cytosine or not, shares the properties. Predictably, if an additional CpG siteweretobeinsertedbetweentwoexistingonesandthetwoarehypomethylated then we expect the new one to be in the same state. These observations motivate us to identify such hypomethylated regions as concrete biological units. If we consider the genomic locations as a sequence of time points, the obser- vations at individual CpG sites are essentially sampling at discrete time points from a continuous process. More interestingly, since CpG sites are non-uniformly distributed along the genome, sampling frequencies in CpG-dense and CpG-sparse regions are variable. We are usually able to get more observations in regions of interest, suchasCpGislands(Gardiner-Garden&Frommer,1987)andpromoters. Thinking about methylation as high vs. low: Withinthosehypomethylatedregions, the observed methylation level is not necessarily reduced down to 0, just as the methylation level in the background does not necessarily always reach 1.We accordinglyusethetermhypo-orhypermethylatedregions. Ashasbeendiscussed in Section 3.5.1, the measurements from WGBS reflects the composite profile of a cell population, the non-binary methylation levels possibly suggests dynamic equilibrium of methylation and demethylation forces within those regions, and/or the heterogeneity of the underlying cell population. Motivations for a method to identify HMRs de novo: The WGBS data provide genome-widemeasurementofmethylationlevelsatsinglebaseresolution. Tomake full use of the WGBS data, we sought a method that identifies HMRs empirically from WGBS data alone, which will enable us to examine the methylome landscape 29 5 kb 28850000 28855000 28860000 HMR CD19 1 0 BCell Figure 4.1: Typical HMRs in mammalian genomes in an unbiased way, and discover novel regulatory elements based on their epige- netic state. Another desired feature is to locate epigenetic domain boundaries at high resolution based on the single-base resolution measurement from WGBS. 30 4.2 Vertebrates: hypomethylated regions 4.2.1 Identifying hypomethylated regions de novo The observation data we have are a sequence of read counts at each CpG site ordered by their genomic locations, X = {(m i ,u i )}, where m i and u i denote the number of methylated and unmethylated reads at the i th CpG site respectively. We model the observations at individual CpG sites with the beta-binomial distri- butionintroducedinSection3.5. Weassumethatanobservationisgeneratedfrom either one distribution BetaBin(– 0 ,— 0 ) representing the hypomethylation state, or another distribution BetaBin(– 1 ,— 1 ) representing the hypermethylation state. We denote the parameters (– 0 ,beta 0 ,– 1 ,— 1 ) with . As discussed in Section 4.1, the epigenomic state in a region may be considered as continuous, and adjacent obser- vations are correlated, so we assume the state of adjacent CpG sites follows the Markov property. Let’s use the indicator variable I i to denote whether the i th observation is in the hypomethylated state, the complete data likelihood of our sequence observations X and the indicator variables I is p(X,I|) = n Ÿ i=i a I i≠ 1 I i BetaBin(m i |m i +u i ,– 0 ,— 0 ) I i BetaBin(m i |m i +u i ,– 1 ,— 1 ) 1≠ I i , where a I i≠ 1 I i denote the transition probability from state I i≠ 1 to I i , and for sim- plicity we ignore the term denoting the transition from start state to the first observation. With the model described above, we use Baum-Welch algorithm for model training (Baum et al., 1970). HMRs are identified by posterior decoding: we compute the posterior probability that a CpG site being in the hypomethylated 31 state. Contiguous CpG sites with the above posterior probability greater than 0.5 are joined to obtain HMRs. 4.2.2 HMRs extending outside annotated CGIs and pro- moters We applied our HMR-finding method to WGBS data from hematopoietic stem and pluripotent cells, B cells and neutrophils, and identified between 50000 and 60000 HMRs in each reference methylome. As expected, many of our empiri- cally defined HMRs overlap with CGIs, features believed to be protected from methylation in conventional wisdom; however CGIs alone seem to fall short as a benchmark to define all HMRs with probable regulatory significance, as CGIs account for fewer than half of HMRs in any cell type (Table 4.1). Many of those HMRs outside of CGIs seem functional as supported by evolutionary conservation between chimpanzee and human (Table 4.1). Location within CGIs Outside CGIs Human 24988 29881 Conserved 21776 16960 Table 4.1: The distribution of HSPC HMRs in human relative CGI annotations. The Second row shows the number of HMRs conserved in the same cell type between human and chimpanzee in the same locations Promotersareanotherkindofgenomicfeaturebelievedtobelociofhypomethy- lation, however the majority of found HMRs are outside of annotated promoter regions, which is not surprising since promoters are usually closely related to CGIs (Table4.2). WeinparticularexaminedthesetofintergenicHMRs,definedas10kb at least away from annotated genic regions, which account for about one quarter of all HMRs in each cell type (Table 4.2). Comparisons of orthologous sites in the 32 cell Total Promoter Gene Gene Prox. Intergenic Conserved Inter. BCell 54869 10697 33149 6565 15155 8887 HSPC 63474 11188 37632 7646 18196 8979 Neut 71815 11572 42324 8776 20715 11280 Table 4.2: Distribution of HMRs relative to gene annotations. Column meaning: Total: No. HMR genome-widely; Promoter: No. HMR overlapping promot- ers, defined as 2kb upstream to 2kb downstream of transcription start site (TSS); Gene: No. HMRsinintragenicregionbetweenTSSandtranscriptiontermination site (TTS); Gene Prox.: No. HMRs in gene proximity region defined as 10kb upstream of TSS and 10kb downstream of TTS; Intergenic: No. HMRs in inter- genic regions defined as 10kb away to any gene annotation; Conserved Inter: No. intergenic HMRs conserved between human and chimpanzee corresponding cell types of human and chimpanzee show that half of the identi- fied human intergenic HMRs are conversed between these two species (Table 4.2). Considering the vast size of intergenic regions, species-specific structural variation of genomes, and possible species-specific variation of methylomes, the conserva- tion of intergenic HMRs is strikingly high, implying their regulatory significance. Since hypomethylation usually implies the accessibility to DNA-binding proteins, we searched for enrichment of transcription factor binding sites within intergenic HMRs. Indeed, lineage-specific intergenic HMRs are enriched for binding sites of transcription factors with implied function for blood stem cell dierentiation, such as C/EBP and PU.1 in neutrophile-specific HMRs and EBF and PAX motifs in B cell specific HMRs. For those shared intergenic HMRs between multiple cell types, we observed an striking enrichment of CTCF binding sites. These obser- vations implied important roles of intergenic HMRs in gene expression regulation and genome organization. 33 4.2.3 A summary of hypomethylation in mammalian methylomes Another feature of our de novo HMR-finding method is that it is able to locate HMR boundaries at single-base resolution. The properties of HMRs, such as num- ber of HMRs and HMR sizes, provide a high level summary of methylation pat- tern, which themselves reveal novel biological insights. We identified HMRs in sperm, HSPC, B cells and neutrophil genomes. Sperm genome harbors about 80000 HMRs, the largest number among these four cell types. HSPC, B cells and neutrophil have 63474, 54869 and 71815 HMRs respectively. More interestingly, in those HMRs shared between sperm and blood cells, sperm HMRs usually contain the corresponding blood cell HMRs (Figure 4.2). We extended the HMR analysis to include more methylomes of a variety of human samples, including normal cells, cancerous cells and cultured cells lines (Appendix A.5). Cell types show clear clusters based on the number of HMRs and median size of HMRs (Figure 4.3). In particular, sperm cells stand out as having the largest number of and widest HMRs. These observations suggest that the locations and widths of HMRs in sperm can define the ultimate boundaries of somatic HMRs. Compared to normal cells, cancerous cells and cultured cell lines tend to have more and larger HMRs (Figure 4.4), consistent with previously reported loss of methylation in large par- tially methylated regions. 34 chr16: 28850000 28852000 28854000 28856000 28858000 CD19 1 - 0 _ 1 - 0 _ 1 - 0 _ 1 - 0 _ HSPC Bcell Neut Sperm Figure 4.2: Sperm HMRs defines boundaries of somatic HMRs 30000 50000 70000 90000 600 800 1000 1200 #HMR HMR size ColonNormal BCell CD133 HSPC Neut ColoMucosa CD4T_100yr CD4T_Newborn Sperm PBMC BronchialAirway SmallAirway Cortex_Meth Figure 4.3: HMR number and sizes in normal cell methylomes 35 2e+04 4e+04 6e+04 8e+04 1e+05 1000 3000 5000 7000 #HMR HMR size Normal Cancerous Cell lines Figure 4.4: HMR number and sizes in normal cell and cancerous cell methylomes 36 4.3 Plant methylomes: focal methylation and multiple mechanisms 4.3.1 Issues in identifying plant HyperMRs Motivation for modeling three regional methylation states: In the Arabidopsis thaliana genome, the majority of cytosines are devoid of methylation, except in some discrete regions, hypermethylated cytosines exist in clusters. We call such regions hypermethylated regions (Fig. 4.9). By visual examination, we noticed that certain HyperMRs consist of alternating hypo- and hypermethylated CpGs, and the pattern was conserved among multiple methylomes, as exemplified by the left panel in Figure 4.5. To further investigate this pattern, we labeled a CpG hypomethylated if its methylation frequency is below 0.5, and then extracted continuous clusters of hypomethylated CpGs. The size distribution of those hypomethylated CpG clusters has two modes, one around 50 CpGs represent- ing unmethylated background regions, and the other with just one or a few CpGs representing tiny HypoMRs (Figure 4.6). Is it possible that those unmethylated CpGs within HyperMRs are simply due to random noise? To address this ques- tion, we selected the set of HyperMRs from rossetta leaf data and compared the methylation status of CpGs within those HyperMRs between endosperm (Hsieh et al., 2009) and in rossetta leaves. It turned out that the methylation status of those CpGs are significantly correlated between these two samples 4.3, therefore the unmethylated CpGs scattered in HyperMRs are unlikely due to random noise. Three-state variable-duration HMM:Theaboveobservationsandpreliminaryanal- ysis suggest that cytosines in the Arabidopsis exist in three possible states: the 37 5856-10C 6043-10C 6043-16C 6043-16Cb 6046-10C 6046-16C 8805000 8815000 8820000 8810000 8825000 22665000 22670000 22675000 22680000 22685000 chr5 1kb 1kb meth 1-meth Figure 4.5: Examples of dierent types of HyperMRs. The left panel illustrate an HyperMR consisting of alternating hypo- and hypermethylated CpGs. The right panel shows another HyperMR within which all CpG sites are uniformly hyperme- thylated. Each sample is shown in two tracks, the orange track representing the methylationfrequency,andthegreentrackrepresentingoneminusthemethylation frequency No. CpG sites HYPO in endosperm HYPER in endosperm Sum HYPO in leaves 123895 13224 137109 HYPER in leaves 189676 561144 750820 Sum 313571 574368 887929 Table 4.3: Interdependence of hypo- and hypermethylation status of CpG sites between rossetta leaf sample and endosperm sample. Only CpG sites in HyperMR are used. A CpG site is deemed hypomethylated if its methylation level is below 0.5, and hypermethylation otherwise. Fisher’s exact test for dependence of methy- lation status: p-value < 2.2e≠ 16 hypo-methylated state in the background, the HYPER-methylated state in Hyper- MRsandtheHYPO-methylatedstatescatteredwithinHyperMRs. Thetransition between the hypo and HYPO states is prohibited. In our model, we allow general duration distributions to model the length of genomic regions. In standard HMM, the length of genomic regions follows implicitly the geometric duration, which may not be flexible or realistic. For example, the mode of the geometric distribution is 1, which is quite unrealistic for 38 0 50 100 150 200 0 10000 20000 30000 40000 segment length (CpG number) number of CpGs Figure 4.6: The length distrubution of the hypomethylated CpG clusters meaningful genomic regions. In our model, the duration distribution can be any discrete distribution, which generalizes standard HMM, and is called VDHMM. Before giving the complete data likelihood under the three-state VDHMM, we need to introduce some notations. We denote the sequence of observations of read counts with X = {(m i ,u i )}, and denote the hidden states with Q = {q i }, where q i is the hidden state of the i th observation, which takes value in hypo, HYPER or HYPO. Finally, we denote the locations where the hidden states switch with C = {c 0 ,...,c j ,...,c m }, where c 0 points to 1, and c m points to the number of 39 observations n. We use d q (l) to denote the likelihood of the length l of a sequence in the hidden state q. The formula of the complete data likelihood is p(X,Q,C|) = m Ÿ j=1 1 a qc j≠ 2 qc j≠ 1 d qc j≠ 1 (c j ≠ c j≠ 1 ) c j ≠ 1 Ÿ i=c j≠ 1 BetaBin(m i |m i +u i ,– q i ,— q i ) 2 Issues in implementation of the three-state VDHMM: The training procedure of the three-state VDHMM is similar in spirit to the Baum-Welch algorithm (Baum et al., 1970). We give below the two key formula in the forward algorithm and the backward algorithm. Similar to the notation used in (Rabiner, 1989), we denote the likelihood of the i th observation given its hidden state q i with b q i (X i ). The forward score – i (q), which is the likelihood of the observations X 1:i with the i th observation being the end of a segment in hidden state q, is – i (q)= i≠ 1 ÿ j=1 ÿ s”=q – j (s)a sq d q (i≠ j) i Ÿ k=j+1 b q (X k ). (4.1) The backward score — i (q), which is the likelihood of the observation Y i+1:N given the i th observation being the end of a segment in hidden state q, is — i (q)= N ÿ j=i+1 ÿ s”=q a qs d s (i≠ j) j Ÿ k=i+1 1 b s (X k )— j (s) 2 . (4.2) Wetrainedthethree-stateVDHMMwithamethodsimilartotheBaum-Welch algorithm (Baum et al., 1970). Genome segmentation was done with posterior decoding or Viterbi algorithm (Viterbi, 1967). Because of the variable duration distribution, the time complexity of the above model is quadratic in the term of the number of CpG sites, which is computationally impractical for whole-genome analysis. In actual implementation, we used heuristic methods that either impose 40 imposed a threshold on the maximum length of each region, or assume the tran- sition from one region to another one, or vice versa, occurs in a short region (Appendix A.4). 4.3.2 Distinct properties of CpG, CHG and CHH Hyper- MRs We applied the three-state VDHMM to identify HyperMRs for cytosines in CpG, CHG and CHH contexts separately using the Arabidopsis WGBS dataset (Appendix A.3). Figure 4.9 show the methylation levels and identified HyperMRs in a typical genomic region. CpG,CHGandCHHHyperMRsunderdistinctregulation: InarepresentativeAra- bidopsis WGBS dataset, we identified 15985 CpG HyperMRs, 5337 CHG Hyper- MRs and 9359 CHH HyperMRs. The mean and median sizes of each kind of HyperMRs are shown in Table 4.4. Along each chromosome, CHG and CHH HyperMRs are mostly concentrated in pericentromeric regions and heterochro- matinhubs. WhileCpGHyperMRsarehighlyenrichedinpericentromericregions, they are also present in euchromatic chromosome arms at high proportion (Fig- ure 4.7), which is consistent with previous observations (Cokus et al., 2008). Next we examined the genomic annotation of CpG, CHG and CHH HyperMRs intermsofproteincodinggenes,transposableelementsandintergenicregions. The majority of CHG and CHH HyperMRs overlap transposable elements, consistent withtheirestablishmentbyRdDMpathwayandpossiblerolesinsuppressingtrans- poson mobility (Figure 4.8). Of the 15985 CpG HyperMRs, about 22% overlap transposable elements, while most of the remaining CpG HyperMRs are located in intragenicregions,consistentwiththepreviousreportsofgenebodymethylationin 41 HyperMR Number Mean size (bp) Median size (bp) CpG 15985 2560 1041 CHG 5337 4259 1019 CHH 9359 1687 537 Table 4.4: Number and sizes of CpG, CHG and CHH HyperMRs Start End No. HyperMRs INSIDE DOWNSTREAM 1155 INSIDE INSIDE 9407 UPSTREAM DOWNSTREAM 536 UPSTREAM INSIDE 336 Table 4.5: Most gene-overlapping CpG HyperMRs are restricted to gene bodies, i.e., starting downstream of the TSS and ending upstream of TTS plants (Feng et al., 2010; Zemach et al., 2010). The majority of gene-overlapping CpG HyperMRs show one-to-one correspondence with protein coding genes. In most cases, those CpG HyperMRs are restricted to intragenic regions, i.e, those HyperMRs start downstream of transcription start site (TSS) and end upstream of transcription termination site (TTS) (Table 4.5). Relationship between CpG, CHG and CHH HyperMRs: In the analysis above, we have shown the distinct patterns and genomic annotations of CpG, CHG and CHH HyperMRs. However, as illustrated in Figure 4.9, CpG HyperMRs, CHG HyperMRs and CHH HyperMRs exist together in certain regions (mostly trans- posons), and even more strikingly, they share the same boundaries, which suggests possible inter-connection between the three pathways responsible for methylation establishment and maintenance in some cases. We computed the frequency of all possible of combinations of CpG, CHG and CHH HyperMRs (Figure 4.10). Stan- daloneCpGHyperMRsaccountforalargeproportionofthoseregions,asexpected from previous observations that intragenic methylation happen in and exclusively in the CpG sequence context (Cokus et al., 2008; Lister et al., 2008). However, 42 0.0e+00 5.0e+06 1.0e+07 1.5e+07 0.0 0.2 0.4 0.6 0.8 1.0 chr4 HyperMR coverage CpG CHG CHH Figure4.7: ThedistributionofCpG,CHGandCHHHyperMRsalongchromosome 4 of A. thaliana standalone non-CpG HyperMRs accounts only for a tiny proportion, and most non-CpG HyperMRs co-occur with CpG HyperMRs, which is possibly due to the methyltransferase DRM2, capable of establishing de novo methylation via RdDM pathway, methylates all cytosines regardless of their sequence contexts. 43 CpG CHG CHH 0.0 0.2 0.4 0.6 0.8 1.0 Gene TE Intergenic Other Figure 4.8: Genomic annotation of plant HyperMRs in terms of protein coding genes, tranposons and intergenic regions 4.3.3 Classification of CpG HyperMRs Previously we described the observation of conserved hypomethylated CpG sites interspersed within CpG HyperMRs (Figure 4.5). We looked further into this phenomenon, and found that two types of CpG HyperMRs are distinguishable by the structural properties of their internal DNA methylation pattern. These two structural categories also correspond to fundamental functional dierences in the 44 chr1: TE 2 kb 1190000 1191000 1192000 1193000 1194000 1195000 1196000 CpG CHG CHH AT1G04420.1 AT1G04425.1 Figure 4.9: CpG, CHG and CHH HyperMR examples in Arabidopsis thaliana genome way these two types of HyperMRs associate with the regulation of gene expression and the silencing of transposable elements. Classification of supporting, variable and gap CpGs: We examined the methyla- tion status of individual CpGs within HyperMRs across 200 ecotypes, soliciting evolutionary conversation information to reveal their biological significance. We classified individual CpG sites in HyperMRs into three categories (Figure 4.11). The “supporting” sites, defined as methylated in more than 80% of methylomes with a HyperMR, seem to be methylated as a unit, and are essential to the pres- ence of HyperMRs. The “gap” sites are hypomethylated in more than 80% of lines with HyperMRs, and appear to be protected from methylation. The remaining sites having less conserved methylation status are classified as “variable”. Sequence preference of gap CpGs: We sought to explore the biological implications of gap and supporting CpG sites that show conserved methylation status. We first examined the sequence preferences of dierent classes of CpG sites, namely, CpGs outside of HyperMRs (nonHMR CpG), supporting CpGs, gap CpGs and variable CpGs. For each CpG site, we extracted the tetranucleotides centered around that CpG in the Arabidopsis TAIR10 reference genome. The relative frequency of each of 16 possible tetranucleotides are counted. ACGT motif is significantly enriched 45 000 001 010 011 100 101 110 111 HyperMR combination Frequency 0 2000 4000 6000 8000 Figure 4.10: Relationship between CpG, CHG and CHH HyperMRs. The first digit in the label means the presence of CpG HyperMRs, the second the presence of CHG HyperMRs, the third the presence of CHH HyperMRs. For example, the label 101 means there are CpG and CHH HyperMRs, but no CHG HyperMRs in the group of gap CpGs (enriched 10-fold compared to supporting CpGs and 2.5 times relative to unmethylated CpGs outside of HMRs) (Figure 4.12). This result agreed with previous observations by Lister et al. (2008) and Cokus et al. (2008). If the ACGT motif is essential to protect gap CpGs from being methylated, we would expect to see hypermethylation of gap CpGs in those ecotypes where the 46 0.0 0.6 8815500 8816000 8816500 8817000 0.0 0.6 Positions Mean meth Prop. of sample of Hyper-methylation 1.0 1.0 Figure 4.11: Supporting, variable and gap CpGs within a composite HyperMRs motifismutated. WeobtainedtheSNPdataofcorrespondingecotypesfromLong & Nordborg (2013) and derived the “real” reference genome for each ecotype. At each gap CpG site, we extracted the tetranucleotides from those samples in which that CpG site is within a HyperMR. If the CpG itself is mutated, those samples are discarded. The tetranucleotide is assigned into the HYPER group if the CpG methylation level is above 0.5, or the HYPO group if its methylation levelisbelow 0.2. WefirstpooledthetetranucleotidesfromallgapCpGstogether, and computed the relative frequency of each tetranucleotide in the HYPER and HYPO group respectively. One-tailed Fisher’s exact test based on the contingency Table. 4.6 showed the ACGT motif is significantly enriched in the hypomethylated group(Figure4.13). wealsoperformedsimilartestsatindividualgapCpGs,which 47 supporting variable gap nonHMR 0.00 0.05 0.10 0.15 0.20 0.25 ACGT preference ACGT GCGT ACGC ACGA CCGT ACGG TCGT GCGC CCGG GCGA CCGC TCGC TCGG GCGG CCGA TCGA 0.00 0.05 0.10 0.15 0.20 0.25 Figure 4.12: Frequency of ACGT motif around supporting, variable, gap and nonHMR CpG sites. Top left panel: The frequency of all possible tetranucleotides confirmedthehypothesisthatmutationofACGTmotifleadstohypermethylation of gap CpGs (data not shown). 48 HYPO HYPER #ACGT a c #OTHER b d Table 4.6: Contingency table for testing that ACGT favors hypomethylation. a: the number of ACGTin the HYPO group; b: the number of other tetranucleotides in the HYPO group; c and d: the number of ACGTs and other tetranucleotides in the HYPER group. One-sided Fisher’s exact test is used to compute the p-value that ACGT is enriched in the HYPO group GCGG CCGC CCGG GCGA TCGC GCGC CCGA TCGG CCGT ACGG TCGA GCGT ACGA ACGC TCGT ACGT GCGG CCGC CCGG GCGA TCGC GCGC CCGA TCGG CCGT ACGG TCGA GCGT ACGA ACGC TCGT ACGT Fisher's e values Enriched in Hypo Depleted from Hypo Figure 4.13: Gap CpGs tend to be hypermethylated if the ACGT motif changes The proportion of supporting CpGs distinguishes intragenic and transposon Hyper- MRs: The proportion of supporting CpGs within HyperMRs followed a bimodal distribution (Figure 4.14). Therefore, we subdivided CpG HyperMRs into two structural categories: the “simple” HyperMRs, composed mostly of supporting CpGs, and “composite” HyperMRs, due to their appearance as contiguous blocks 49 ofsupportingCpGspunctuatedbygapCpGs. Wecheckedthegenomicannotation of simple HyperMRs and composite HyperMRs respectively. A clear distinction canbedrawnbetweenthetwostructurallydefinedcategoriesofHyperMRs. About 60% of simple HyperMRs overlap with transposable elements, and only 25% over- lap with protein coding genes. On the other hand, more than 85% of composite HyperMRs are located in intragenic regions of protein coding genes (Figure 4.15). Therefore, these structurally defined HyperMR categories seem to bear significant functional dierences. Regulation of expression by simple and composite HyperMRs: DNA methylation has dual-eect on gene expression regulation, i.e., promoter methylation represses transcription,andgenebodymethylationispositivelycorrelatedwithgeneexpres- sion. This phenomenon suggested the eect of DNA methylation depends on their positionaldistributionregardinggenes, howevernomechanismhasbeenproposed. Can the distinction of simple and composite HyperMRs help explain the dual- eect of methylation on transcription regulation? We computed the correlation of methylation and expression in 200bp windows from 1kb upstream of TSS to 3kb into gene body, and 2kb region centered around TTS (Figure 4.16), which repro- duced the dual-eect observation. Next, we computed the methylation-expression correlation stratified by simple and composite HyperMR regions respectively. The middle panel in Figure 4.16 shows that methylated cytosines in simple HyperMRs are strongly negatively correlated with gene expression in promoter regions, and slightly negatively correlation even in gene bodies. On the other hand, methy- lated cytosines in composite HyperMRs show negligible eect in promoter regions, but strong positive correlation in gene bodies (bottom panel in Figure 4.16). The dual-eect of methylation on transcription regulation seems separated based on whether the methylation occur in simple or in composite HyperMRs. 50 proportion of supporting CpGs Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 500 1000 1500 2000 Composite HyperMRs Simple Figure 4.14: Classification of composite and simple HyperMRs based on the pro- portion of supporting CpGs Theseobservationsledustoproposethefollowingparadigm: highanduniform methylation,asisobservedthroughsimpleHyperMRs,isnecessarytosilencetran- scription, while in composite HyperMRs, the existence of gap and variable CpGs may facilitate the binding of DNA-binding proteins, which possibly exert more flexible and subtle control of gene expression. One possible interesting direction forfurtherexplorationistheplantbZIPproteins,whichtargetatACGTcis-acting sequence elements in a methylation sensitive way (Foster et al., 1994). 51 Composite Simple 0.0 0.2 0.4 0.6 0.8 1.0 Gene TE Intergenic Other Figure4.15: CompositeHyperMRs preferentially located withinintragenicregions of protein coding genes while simple HyperMRs are preferentially associated with transposable elements and intergenic regions 52 TSS TTS Overall methylation effect Simple HyperMR effect Composite HyperMR effect methylation expression correlation Gene model Figure 4.16: Dual-eect of DNA methylation on gene expression explained by classificationofsimpleandcompositeHyperMRs. Top: overallmethylationeect; Middle: simple HyperMR methylation eect; Bottom: composite HyperMR methylation eect 53 Chapter 5 Variation of methylation: analysis of multiple methylomes Inthischapterweaddressthecriticalissuesofidentifyingchangesordierences of methylation patterns between multiple methylomes. These dierences tend to be in the form of contiguous genomic intervals, and in general indicate genomic regions with specific functions or behaviors that dier between conditions. We call such genomic regions dierentially methylated regions (DMR). Thesimplestcomparisonstudyofmethylomesinvolvestwosamples. Examples include the comparison of methylation from one normal sample and one cancerous sample (Berman et al., 2012), or comparing one cell type with another (Hodges et al., 2011). This simple design involving only two methylomes has been common in recent years because the cost of producing high-quality and high-resolution DNA methylation data has limited the numbers of replicates or individuals typically used. However this pairwise comparison does not provide any information about the intrinsic variation of methylation in a normal biological sample (Hansen et al., 2012). This concern demands particular attention when studying cancer methy- lomes, which are heterogenous and likely exhibit some variation in methylation that is a consequence, rather than a cause, of the cancer phenotype. Case-control experimental designs, which are the natural extension to simple one-to-one com- parison, include multiple replicates or individuals in both the case and control 54 groups, and therefore allow us to take inter-individual variation into account. To overcome the cost limitation of large scale case-control projects, certain projects have resorted to RRBS (Gu et al., 2011); others argue in favor of lower coverage WGBS, and using local smoothing methods in the analysis to compensate for the coverage (Hansen et al., 2012). A third kind of experimental settings also involve multiple samples, but these samplesarenotreadilyseparatedintotwogroups. Forexample,toassessthenatu- ral variation of Arabidopsis methylomes, researchers have obtained methylomes of multiple Arabidopsis ecotypes, which cannot be readily separated into case or con- trolgroups. IdentifyingDMRsinanunsupervisedmannercallsfornovelanalytical methods. The analytical challenges in DMRs involve two aspects. First, if we are pre- sented with a specific genomic interval, possibly very large or as small as a single site,howcanweusetheavailabledatatomakeinferencesaboutdierentialmethy- lationinthatinterval? Moreover, howcanweassignstatisticalsignificancetosuch dierences? Second, the DMR intervals are distributed along the genome but we know neither how many there are, nor their sizes. The search space of possible sets of intervals is huge. Of course, these two aspects are inter-connected. We will present statistical methods for the identification of DMRs in the three aforemen- tioned types of experimental settings, and focus our descriptions around these two key technical aspects. Toward the end of this chapter, we will explore the adaption of concepts and methods in population genetics to studies of epigenetics in a population, and through which we may have a better understanding of variation and selection of epigenetics mechanisms. 55 5.1 Dierentially methylated regions between two methylomes 5.1.1 A three-state HMM to identify DMRs Our method to identify DMRs between two samples extends the method intro- duced by Hodges et al. (2011). The basic framework is as follows: we first assign dierential methylation scores to individual CpG sites, and then we model how these scores change moving along the genome. Quantifying dierential methylation at individual sites: Letusassumewearegiven methylation data sets from two dierent samples, and are interested in a specific cytosine site. In particular, we want the probability that the cytosine methylation forsample1,denotedp 1 ,isgreaterthanthemethylationlevelforthesamecytosine site in sample 2, denoted p 2 . The observed data are the counts m 1 and u 1 of methylatedandunmethylatedobservationsfromreadsforsample1,andm 2 andu 2 which are defined similarly for sample 2. Altham (1969) took a Bayesian approach to the same inference problem, formulating the probability as Pr 1 p 1 >p 2 |m 1 ,u 1 ,m 2 ,u 2 2 = ⁄ 1 0 ( m 1 +u 1 ) ( m 1 )( u 1 ) y m 1 ≠ 1 (1≠ y) u 1 ≠ 1 ◊ ⁄ y 0 ( m 2 +u 2 ) ( m 2 )( u 2 ) x m 2 ≠ 1 (1≠ x) u 2 ≠ 1 dxdy. Bymanipulatingtheintegrals,weobtainaformulathatrevealsthediscretenature of the underlying counts data: m 2 ≠ 1 ÿ k=j 1 m 2 +u 2 ≠ 1 k 21 m 1 +u 1 ≠ 1 m 1 +u 1 ≠ 1≠ k 2 1 m 1 +u 1 +m 2 +u 2 y≠ 2 m 1 +m 2 ≠ 1 2 , with j = max(m 2 ≠ u 1 ,0). (5.1) 56 The formula is symmetric, i.e.,, Pr(p 2 Ø p 1 |m 1 ,u 1 ,m 2 ,u 2 )= 1≠ Pr(p 1 > p 2 |m 1 ,u 1 ,m 2 ,u 2 )givesthecytosinemethylationinsample2isgreaterthanthatof sample 1. This formula provides a single-site measure for dierential methylation that takes into account both the observed frequencies of methylation as well as the amountofdatacontributingtothosefrequencies. Thelargerthevalueis, themore likely it is that this cytosine has lower methylation in sample 2 than in sample 1. We call this statistic the single site dierential methylation score, or di-score for short. The di-score is a probability, ranging from 0 to 1. We compute di-scores for each cytosine site (denoted {X i } for site i), and use a Beta distribution to model their values. Three possible states of dierential methylation: With two methylomes we must model three possible states: (i) no dierential methylation, (ii) methylome 1 has lower methylation, and (iii) methylome 2 has lower methylation. We therefore introduce three states, namely same (s), gain (g) and loss (¸) . For convenience, we use the language of gain and loss of methylation between samples, but remark that in many cases there will be no time dependence between samples. Since a region gaining methylation is unlikely adjacent to a region losing methylation, we forbid the transition between g and l states. The transition matrix for this model is: S W W W W W U p gg p sg 0 1≠ p gg p ss 1≠ p ¸¸ 0 p s¸ p ¸¸ T X X X X X V As pointed out earlier (Section 4.3.1), intervals of interesting methylation fea- tures tend to have size distributions with mode away from 1, and are therefore not well described by the geometric duration distribution implicit in HMMs. To allow greaterflexibilityindescribingthedurationofeachstate,weuseempiricalduration 57 distributionsforthe g and¸states. Theproceduresformodeltrainingandgenome segmentation are similar to the three-state VDHMM described in Section 4.3.1, except the emission distribution of the di-scores is the beta distribution. Some implementation considerations in addressing the time complexity of VDHMM are provided in Appendix A.4. Assessing significance of DMRs: From the three-state HMM model, we obtain a set of candidate DMRs. To assess the statistical significance of these DMRs, we assign empirical p-values to them based on a randomization method. For a DMR coveringthes th CpGtothet th CpG,wecomputetheareaofmethylationdierence between those two samples with the formula: v = t ÿ i=s |f i1 ≠ f i2 |, where f i1 and f i2 represent the methylation frequencies of the i th CpG in sample 1 and sample 2 respectively. Next we randomly sample r genomic regions with the samenumberofcytosines,andcomputetheareaofmethylationdierenceforthese r regions (often in practice r=50,000). We assign p-values to DMRs based on theempiricaldistributionfromthe r randomregions. Giventheempirical p-values of all candidate DMRs, we retain DMRs with given false discovery rate using the method of Benjamini & Hochberg (1995). 5.1.2 DMRsduringhematopoieticstemcelldierentiation WeappliedourDMR-findingmethodtoaWGBSdatasetofhematopoieticstem and progenitor cells (HSPC), B cells, and neutrophil cells (Hodges et al., 2011). These cell types include one self-renewing and multipotent stem cell, along with two derived cell types that have dierentiated along opposite lineages: the B cells 58 are lymphoid and the neutrophils are myeloid. By examining DMRs between all pairs of HSPC, B cells and neutrophils, we are able to probe the DNA methylation changes associated with cell-fate specification. General methylation changes during hematopoietic development: We did pairwise comparisons between the three reference methylomes. There are 36081 DMRs between B cells and HSPCs, and 14223 DMRs between neutrophils and HSPCs (Table 5.1). The number of DMRs in each comparson is consistent with previous observation that HSPCs are skewed toward the myeloid lineage (Ji et al., 2010). WhenHSPCsdierentiateintoeitherBcellsorneutrophils,thereareregionsgain- ingandlosingmethylation, whichsuggestcelldierentiationinvolvesboth de novo methylation and active demethylation. Interestingly, the commitment of HSPC into B cells involves 18335 DMRs gaining methylation compared to 17726 DMRs losing methylation, while the commitment of HSPC into neutrophils includes 6190 DMRs gaining compared and 8033 DMRs losing methylation. The comparison of B cells and neutrophils also confirms that B cells have more DMRs gaining than losing methylation, which suggests that de novo methylation plays an important role during lymphopoeisis. Extensive DMRs beyond CGIs and CGI shores: CpGislands(CGI),areofparticu- lar interest in studying DNA methylation because CGIs, usually hypo-methylated and accessible to DNA-binding proteins, are believed to be of regulatory roles. A previous study reported that most cancer and tissue specific DMRs occur within 2kb from CGIs, termed “CGI shores” (Irizarry et al., 2009). We explored the rela- tionship of our DMRs with CGIs and CGI shores. We obtained a comprehensive list of CGIs generated from a HMM-based approach (Irizarry et al., 2009), and produced CGI shores by extending each CGI by 2kb in both directions. In con- trast to previous observation, only about one third of DMRs we found are located 59 Annotation BCell v HSC HSC v Neut BCell v Neut TOTAL 36081 14223 38068 GAIN 17726 6190 16925 LOSS 18355 8033 21143 CGI-Irizarry-hg18 5174 3479 4217 CGI-Irizarry-hg18-shore 11574 5708 10880 promoter 2165 974 2215 UTR5 4356 1534 4505 gene 22030 8077 23271 exon 5380 2085 5631 intron 19333 6679 20706 UTR3 1695 620 1842 TFBS 13391 3914 14416 Table 5.1: DMRs during human blood stem cell dierentiation within CGIs and CGI shores, while a substantial number of DMRs exist beyond CGI and CGI shores (Table 5.1). One possible explanation is that the Irizarry et al. (2009) study used an array-based method CHARM, whose probes are limited to CpG-rich regions, while our analysis based on WGBS allows more comprehen- sive assessment of the methylome. When we identified hypo-methylated regions (HMR) in individual methylomes, we also observed many HMRs outside of CGIs, which seems functional as supported by evolutionary conservation among human and chimpanzee (Section 4.2.2). Our DMRs analysis is consistent with that obser- vation, suggesting, functional DNA methylation changes occur in more diverse genomic contexts. Promoter DMRs regulate gene expression: DMRs close to genes have been associ- atedwithtranscriptionregulatoryroles,contributingtotissuespecificgeneexpres- sion. We focused on DMRs for comparison between B cells and neutrophils since those DMRs represent a comprehensive set of regions where the methylation pat- tern changes between myeloid and lymphoid lineages. Among the 38068 DMRs between B cells and neutrophils, 23271 overlap with genes and 28017 are located 60 within 10kb from genes. The distribution of DMRs relative to gene annotations is shownin Table 5.1. Weexamined thedistribution ofDMRs inpromoter regionsin more detail by computing the DMRs frequency in 50bp sliding windows from 4kb upstream of TSS to 2kb downstream (Figure 5.1). Promoter regions are enriched with DMRs relative to other part of genes. DMRs are more concentrated down- stream of TSS than upstream of TSS, and the DMR coverage peaks around 1kb downstreamofTSS.Bycomputingthecorrelationbetweendierentialmethylation and dierential expression around promoter region, DMRs downstream of TSS are strongly associated with tissue specific gene expression. DMRs are enriched around transcription factor binding sites: DNA methylation is believed to exert their regulatory eects by aecting protein-binding. We further studied whether tissue-specific DMRs are associated tissue specific transcription factor binding. We obtained the list of conserved transcription factor binding sites predicted using sequence conservation among human, mouse and rate from UCSC Genome Browser (Weirauch & Raney, 2007). Among the 38068 DMRs between B cells and neutrophils, 14416 overlap with conserved TFBS. Next we extracted those TFBSs that are over-represented in DMRs between B cells and neutrophils. Those enriched TFBSs in DMRs are generally related to blood cell dierentiation and immune response (see Table 5.2 for a list of enriched TFBS). For example, AML1 (alternatively called RUNX1) is a transcription factor that regulates the dierentiation of hematopoietic stem cells into mature blood cells. Figure 5.2 shows an intergenic DMRs, conserved between human and chimpanzee around a conserved TFBS of AML1. In summary, cell type specific DMRs are significantly enriched in binding sites of transcription factors important for blood cell dierentiation. 61 0.14 0.16 0.18 0.20 0.22 0.24 Position relative to TSS Frequency covered by DMRs TSS 1000 2000 Figure 5.1: Distribution of DMRs in promoter regions chr7: 1 kb 139574500 139575500 139576500 Human BCell Human Beut Chimp BCell Chimp Neut DMR Conservation Figure 5.2: A DMR located in intergenic region overlaps with a conserved AML1 binding site 62 TFBS #DMR #Total Enrichment p-value CEBPB-02 655 14338 2.20 1.66e-69 AP1-01 588 15233 1.86 1.54e-41 NFKBC 434 10394 2.02 3.63e-38 AML1-01 1286 43334 1.43 4.08e-33 BACH2-01 530 14432 1.77 1.25e-32 NFKAPPAB-01 358 8815 1.96 9.48e-30 CREL-01 362 9027 1.94 4.07e-29 CEBPB-01 534 15132 1.70 4.86e-29 NFE2-01 457 13114 1.68 3.86e-24 LMO2COM-01 393 11366 1.67 1.69e-20 CEBPA-01 467 14295 1.58 1.11e-19 HLF-01 584 19073 1.48 1.40e-18 ISRE-01 464 14758 1.52 6.92e-17 BACH1-01 602 20245 1.44 9.68e-17 AP1C 235 6323 1.79 4.89e-16 ELK1-01 231 6464 1.72 3.99e-14 CEBP-Q2 462 15379 1.45 6.84e-14 OLF1-01 346 11298 1.48 9.44e-12 P53-01 347 11425 1.47 2.20e-11 NF1-Q6 328 10706 1.48 3.01e-11 AREB6-03 294 9696 1.46 7.74e-10 NFKAPPAB65-01 179 5301 1.63 1.44e-09 MYOD-01 321 10993 1.41 5.34e-09 HEN1-02 258 8528 1.46 9.32e-09 MYOD-Q6 272 9165 1.43 1.91e-08 MYB-Q6 378 13535 1.35 2.69e-08 E47-02 407 14882 1.32 7.03e-08 RREB1-01 267 9254 1.39 2.32e-07 MYOGNF1-01 325 11651 1.35 2.55e-07 SREBP1-02 318 11395 1.35 3.26e-07 ARNT-02 232 7886 1.42 3.55e-07 AP1FJ-Q2 230 7808 1.42 3.64e-07 CEBPC 390 14468 1.30 4.91e-07 Table 5.2: Binding sites of transcription factors essential for blood stem cell dier- entiation are enriched within in DMRs. The whole genome have n TFBSs and n Õ of them overlapping DMRs. For a given transcription factor, there are m binding sites and m Õ of them overlapping with DMRs. If DMRs are enriched around TFBS of the given TF, we performed Fisher’s exact test to test m Õ /m > n Õ /n 63 Typical DMR patterns with evolutionary support: The genome browser showing DMRs and methylation levels in individual methylomes illustrates characteristic patterns of how a DMR comes into being. We identified DMRs from the same blood cell types in human and chimpanzee respectively, and used DMRs conserved between human and chimpanzee to study typical DMR patterns. Based on those conserved DMRs, we observed four scenarios that a DMR may emerge. First, an existing HMR disappears or a new HMR comes into being. We call this process the death or birth of HMRs (Figure 5.3). Second, the methylation in one region changes partially (Figure 5.4). Third, the boundaries of an existing HMR shifts which results in the expansion or shrinkage of HMRs (Figure 5.5). Fourth, the location of boundaries of a HMR remain the same but they may change from a sharp transition to a gradual transition. This kind of change suggest changes in the strength of regulation upon those boundaries (Figure 5.6). Directional methylation change during hematopoietic stem cell dierentiation: We have identified a set of cell type specific DMRs that characterize myeloid and lymphoid lineages, and we sought to elucidate how the methylation changes from HSPCs to specialized lineages. In each DMR between B cells and neutrophils, we chr19: 2 kb 1104500 1105500 1106500 1107500 1108500 1109500 SBNO2 Huma BCell Human Neut DMR Chimp BCell Chimp Neut Figure 5.3: DMR caused by the birth of a new HMR] 64 chr19: 2 kb 12869000 12871000 12873000 12875000 SYCE2 GCDH Huma BCell Human Neut DMR Chimp BCell Chimp Neut Figure 5.4: DMR caused by partially methylated regions chr19: 5 kb 38480000 38482000 38484000 38486000 38488000 CEBPA BC114482 Huma BCell Human Neut DMR Chimp BCell Chimp Neut Figure 5.5: DMR caused by the extension of existing HMR boundaries chr19: 2 kb 13133000 13135000 13137000 13138000 Huma BCell Human Neut DMR Chimp BCell Chimp Neut Figure 5.6: DMR caused by transition from sharp boundaries to gradual bound- aries 65 computed the mean methylation levels in B cells, HSPCs and neutrophils respec- tively. In these DMRs, HSPCs show intermediate methylation level between B cells and neutrophils, and the methylation level of HSPCs are closer to neutrophils than to B cells. The same pattern is observed in DMRs during chimpanzee blood cell dierentiation (Figure 5.7). The intermediate methylation of HSCs in DMRs between B cells and neutrophils suggests that HSCs probably maintain an poised methylation status for lineage specification, which are undergone de novo methy- lation or demethylation when HSPCs are committed to certain lineages. There are concerns that the observed intermediate methylation level in HSPCs is due to the heterogeneity of HSPC, i.e., a mixture of myeloid and lymphoid cells. While this possibility cannot be completely ruled out, we observed characteristic HMRs around gene promoters specific to stem cell renewal, suggesting that our HSPC sample indeed has pluripotency. 66 DMR: BCell < Neut DMR: BCell > Neut Mean methylation 0 1.0 Chimp Human 0 1.0 Mean methylation DMR index by mean methylation level in HSPCs Figure 5.7: HSPCs show intermediate methylation in cell type specific DMRs. Red: B cells; Green: HSPCs; Blue: neutrophils 67 5.2 Dierentialmethylationincase-controlstud- ies Asanextensiontotheone-to-onecomparison,case-controlstudiesincludemul- tiple replicates in each group. The samples in the two groups may be matched or unmatched. The optimal number of samples should be determined in the specific context of genome-wide methylation profiling. These experimental design issues remain important open questions. In this Thesis, we will focus on the two aspects in identifying DMRs, i.e., determination of DMR location and boundaries, and testing whether the methylation levels are significantly dierent. 5.2.1 Using sliding windows to identify DMRs Using sliding windows to determine DMRs: OurgeneralstrategytoidentifyDMRs in the case-control experiments is based on sliding windows. We move a sliding window along the genome at certain step size. In each sliding window, we test whether those two groups are dierentially methylated. Finally, individual signifi- cantly dierentiated windows, if located within a certain distance from each other, are merged to form a single DMR. Statistical test of significance of dierential methylation: Given an interval to be tested, we compute the weighted average methylation level in each sample (Sec- tion 3.5). If the samples in the two groups are paired, we use the Wilcoxon signed- rank test (Wilcoxon, 1945). Otherwise, Welch’s t-test is used, because sample variances in the two groups are not necessarily the same (Welch, 1947). After the p-values are computed for all windows, the procedure of Benjamini & Hochberg (1995) is used to select the p-value cuto under given false discovery rate. 68 5.2.2 Temperature specific DMRs in plant methylomes Weappliedourmethodtodetecttemperature-specificDMRsintheArabidopsis methylome grown at 10 ¶ C and 16 ¶ C (see Appendix A.3 for description of this dataset). Before setting out to find specific DMRs, we tested whether the genome- wide methylation levels are dierent between plants grown at 10 ¶ C and 16 ¶ C respectively. In general, plants grown under 16 ¶ C show higher methylation level thanthosegrownunder10 ¶ C(Figure5.8). Welch’st-testindicatesthemethylation dierence between two temperatures is statistically significant for CHH sequence context, butinsignificantCpGandCHGsequencecontexts. Guidedbythisresult, we used our DMR-finding method to search for specific DMRs using cytosines in CHH sequence contexts. We identified 16638 temperature-specific CHH DMRs with mean size 968bp. Genomic annotation shows these CHH DMRs are mainly associated with transposable elements and intragenic regions (Figure 5.9). So far we have not been able to explore the biological significance of these temperature- specific DMRs. 69 10C 16C 0.25 0.30 0.35 0.40 CpG Temperature 10C 16C 0.06 0.07 0.08 0.09 0.10 0.11 0.12 CHG Temperature 10C 16C 0.010 0.015 0.020 0.025 0.030 0.035 0.040 CHH Temperature Figure 5.8: Temperature eect on genome-wide methylation level 70 Gene 4084 TE 10112 Intergenic 2314 Other 128 Figure 5.9: Genomic annotation of temperature-specific CHH DMRs 71 5.3 Unsupervised identification of dierentially methylated regions 5.3.1 Issues in identifying DMRs in an unsupervised man- ner WehavediscussedtheidentificationofDMRswhenthereareonlytwosamples, or when multiple samples are readily assigned case or control labels. Here we addresstheproblemofidentifyingDMRsinmultiplemethylomeswhenthesamples have not been separated into groups a priori. This problem was motivated by the study of natural variation of Arabidopsis thaliana methylomes in northern Europe (Appendix A.3). This dataset includes about two hundred methylomes of Arabidopsis ecotypes collected from various geological locations. In this case, there is no sound reason to group those samples into two categories. However methylation variation has been observed in certain genomic regions, which may be considered as dierentially methylation regions. What is more interesting, samples that share similar methylation status in one region may show dramatic methylation dierence in another (Figure 5.10). Similar scenarios arise when we compare the methylomes of a variety of human cell types. The genome browser snapshot in Figure 5.11 shows the methylation level and HMRs of multiple human methylomes, from which we easily identify several regions with dierential methylation that possibly play regulatory roles in cell dierentiation. However there is no simple grouping scheme applicable to all DMRs. 72 chr1: 5 kb 195000 197000 199000 201000 204000 1062-10C 1063-10C 1074-10C 1137-10C 1254-10C 1363-10C 1367-10C 1585-10C 5828-10C 5832-10C 5856-10C Figure 5.10: Examples of DMRs among multiple Arabidopsis ecotypes BesidessearchingofDMRboundariesandtestingdierentialmethylation,iden- tifying DMRs in the two motivating examples requires additionally determination of the methylation status of each sample on a per-DMR basis. 5.3.2 An EM method for unsupervised identification of DMRs Our general framework to identify DMRs from unsupervised samples is also based on sliding windows as in the case-control experimental setting (Section 5.2). TheinnovationforunsupervisedidentificationofDMRsishowtoinferthegrouping of samples and test whether those samples are dierential methylated with proper grouping. 73 chr19: 10 kb 17080000 17085000 17090000 17095000 17100000 CD4T-100yr CD4T-new PBMC HSPC BCell Neut CD133 HCC1954 MEC FES H9ESC NHFF Figure 5.11: Example of DMRss among multiple human methylomes Observations and notations: Suppose we have m samples, and in the region to be tested there are n cytosine sites. At each site of each sample, we observe the number of methylated reads and the number of unmethylated reads: I 1 (m 11 ,u 11 )(m 12 ,u 12 ) ... (m 1n ,u 1n ) I 2 (m 21 ,u 21 )(m 22 ,u 22 ) ... (m 2n ,u 2n ) . . . . . . . . . . . . . . . I m (m m1 ,u m1 )(m m2 ,u m2 ) ... (m mn ,u mn ). 74 Since we have no prior knowledge about whether a sample is hypo- or hypo- methylated in this region, we use the indicator function I i to indicate that the i th sample belongs to the hyper-methylated group, i.e., I i = Y _] _[ 1 if the i th sample is hyper-methylated, 0 if the i th sample is hypo-methylated. Null hypothesis: The null hypothesis is that this region is not dierentially methy- lated, i.e., H 0 :(I 1 =··· = I m =0)‚ (I 1 =··· = I m =1). We use the beta-binomial distribution to model the number of methylated and unmethylated reads as described in Section 3.5. The likelihood of the observations under the null model is L 0 = m Ÿ s=1 n Ÿ i=1 BetaBin(m si |(m si +u si ),–,— ). (5.2) Computation of the likelihood under the null hypothesis is straightforward. We first get the maximum likelihood estimator of – and — as described in Appendix A.2, and sum up the likelihood of all observations. Alternative model: The alternative hypothesis is that at least one sample has methylation status dierent from the others, i.e., H 1 :1 < q i I i <m. Suppose the observations in the hypo-methylated group follow a beta-binomial distribution with parameters (– 0 ,— 0 ), while observations in the hyper-methylated 75 group are beta-binomial with parameters (– 1 ,— 1 ). The total likelihood under the alternative model is L 1 = m Ÿ s=1 n Ÿ i=1 BetaBin(m si |(m si +u si ),– 1 ,— 1 ) I si BetaBin(m si |(m si +u si ),– 0 ,— 0 ) 1≠ I si . (5.3) The alternative hypothesis is essentially a mixture beta-binomial distributions, with the indicator function I i as latent variables. To compute the likelihood under the alternative model (Equation 5.3), we employ an EM algorithm to estimate the model parameters – 0 , — 0 , – 1 , — 1 , along with the latent I i . Intheexpectationstep,weknowthemaximumlikelihoodestimatesof ˆ – 0 , ˆ — 0 , ˆ – 1 and ˆ — 1 . The expected value ˆ I s of the indicator I s provides an estimate of the probability that the s th sample belongs to the hypermethylated group, and is updated according to the formula ˆ I s = r n i=1 Pr 1 m si |(m si +u si ),– 1 ,— 1 2 r n i=1 Pr 1 m si |(m si +u si ),– 1 ,— 1 )+ r n i=1 Pr 1 m si |(m si +u si ),– 0 ,— 0 2 . In the maximization step, we have the updated values of the indicator function ˆ I s . The parameters – 1 and — 1 are updated with their MLEs from the observations weighted by corresponding ˆ I s . Similarly the parameters – 0 and — 0 are updated with their MLEs from the observations weighted by (1≠ ˆ I s ). The expectation and maximization steps are iterated until the likelihood L 1 converges. Model selection: After we obtain the likelihood L 0 under the null hypothesis, and thelikelihoodL 1 underthealternativemodel,weusealikelihoodratiotestfor⁄ = L 1 /L 0 to determine whether the region shows significant dierential methylation across the methylomes. To compute the p-value for the above statistics, we use the Chi-square approximation. 2ln(⁄ ) follows the chi-square distribution ‰ 2 3 with 76 degree of freedom of 3. The p-value of this region is dierentially methylation under the null hypothesis is the probability Pr(2ln(⁄ )>‰ 2 3 ), where ‰ 2 3 denotes a random variable following the chi-square distribution with degree of freedom 3, and 2ln(⁄ ) is computed from the observed data. 77 Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 SuperHMR 1 1 0 1 0 Frequency: 3/5 Figure 5.12: Illustration of superHMR and how to compute HyperMR frequency at a given superHMR in a population 5.4 Methylation variation in terms of hyper- methylated regions HyperMRs are basic units of methylation regulation: Recall in Section 4.3 we stud- ied HyperMRs in individual A. thaliana methylomes. When we compare multiple methylomes of wild Arabidopsis strains (Appendix A.3), HyperMRs frequently appear or disappear as a whole unit (Figure 5.10). This observation suggests that HyperMRs are the basic units of regulation and selection. To study the dynamics ofHyperMRsinapopulation,weintroducetheconceptofsuperHMR,whichrefers to a region that encompass all HyperMRs in individual samples (Figure 5.12). In a superHMR region, some samples have HyperMRs and some not. Conceptually, the presence or absence of HyperMRs are similar to two alleles. We compute the frequency of samples with a HyperMR in a superHMR, and then obtain frequency spectrumofthepresenceofHyperMRssummarizingallsuperHMRs(Figure5.13). HyperMRs that overlap tranposons: The frequency spectrum of CpG HyperMRs stratified by genes, transposons and intergenic regions is shown Figure 5.13. 78 Gene TE Intergenic Figure 5.13: Frequency spectrum of CpG HyperMRs in Swedish Arabidopsis pop- ulation The frequency spectrum shows a characteristic bimodal distribution, indicat- ing that certain regions have conserved methylation signals, and other regions either newly gain or gradually lose methylation in some ecotypes. TE-overlapping superHMRs are skewed toward the high frequency categories when compared to gene-overlapping superHMRs. The length of transposons does not show strong influence on the frequency spectrum (Figure 5.14). This observation suggests that TEregionstendtohavehighandconservedmethylationlevels,possibletosuppress transposon stably. 79 Figure 5.14: Frequency spectrum of HyperMRs overlapping transposable elements stratified by transposable element length HyperMRs that overlap genes: Gene-overlapping HyperMRs are more variable as exemplified by many loci with low or intermediate population frequency (Fig- ure 5.13). The frequency spectrum of those superHMRs are clearly aected by gene length (Figure 5.15). HyperMRs overlapping long genes (Ø 4 kb) are usually conserved. HyperMRs overlapping with genes of intermediate length show great variability. Unexpectedly HyperMRs overlapping with small genes (Æ 1 kb) are also rather conserved compared to intermediate length genes. To explore of the functional relevance of HyperMR variation, we computed the frequency spectrum of HyperMRs stratified by major gene families (Figure 5.16), which ranges from 80 0.0 0.8 Figure 5.15: Frequency spectrum of HyperMRs overlapping protein coding genes stratified by gene sizes rather stable, such as kinesins family, to quite flexible, such as the ERF tran- scription factor family and the monolignol biosynthesis family. While certain gene families have relative specific roles, such as the involvement of the ERF transcrip- tion factor family in plant defense, most gene families are involved in multiple biological roles. It is not clear what evolutionary advantages DNA methylation variation may contribute. Remarks on selection of HyperMRs: DNA methylation shows extensive variation in natural populations. However how to test selection pressure on methylation variation selection remains unsolved. Population genetics tests are built on the 81 monolignol_biosynthesis ERF_transcription_factor MIKC_MADS_transcription_factor P450 bHLH_transcription_factor cytoplasmic_ribosomal class_III_peroxidase NAC_transcription_factor WRKY_transcription_factor glutathione B3_transcription_factor carbohydrate_esterase nodulin_like MYB_transcription_factor subtilisin_like_serine_proteases glycoside_hydrolase organic_solute_cotransporter C2H2_transcription_factor F_box ion_channel NBS_LRR_active_TNL inorganic_solute_cotransporter monosaccharide_transporter_like Antiporters M_type_MADS_transcription_factor MYB_related_transcription_factor NBS_LRR_active_CNL RLK PP2C_phosphatase glycosyltransferase SNARE_and_SNARE_interacting BTB bZIP_transcription_factor EF_hand DEFL_superfamily ABC_superfamily AtRINGS C3H_transcription_factor U_box eukaryotic_initiation_factor MAPKKK primary_pumps_ATPases kinesins 0.0 0.2 0.4 0.6 0.8 1.0 48 48 25 179 96 131 57 75 41 28 45 69 46 83 51 314 202 50 578 57 79 73 45 66 51 48 44 572 67 270 60 74 53 145 178 123 367 39 41 83 68 72 58 Figure 5.16: Frequency spectrum of HyperMRs overlapping protein coding genes stratified by gene families assumptionofinfinite-sitemutationmodel,whicharenotreadilysatisfiedinstudy- ing selection of HyperMRs. Methylation level is continuous while a SNP site just involves two alleles. Methylation changes in term of HyperMRs while SNP follows a point mutation model. Furthermore methylation changes can go back and forth in a short time scale (Becker et al., 2011). 82 5.5 DNA methylation and transcription factor binding sites The working model in this thesis for the function of DNA methylation is that whenDNAmethylationimpactsgeneexpression,itdoessobyregulatingtranscript factor binding. This model may draw criticism from several perspectives. First, Schubeler (2012) argued against a regulatory role for methylation on enhancer activities, based on the experimental observation that transcription factor binding occurs before methylation changes (Stadler et al., 2011). It is not known what fraction of TF binding can be directly blocked by a methyl group on a cytosine in their binding sites. However, the presence of methylation likely is associated with an unfavorable local environmental for TF binding for other reasons, and therefore in many cases indicates repression of spurious enhancer activation. The experiments of Stadler et al. (2011) were done in vivo, which may have other compensatory mechanisms that remove the methylation (evidenced by presence of 5hmC, intermediate products of active demethylation), when the binding of tran- scription factors to certain enhancers becomes necessary for cell dierentiation. In apreviousexperimentinvitro,probesequencesmethylatedatallcytosinesinhibits the binding of CTCF transcription factor (Hark et al., 2000), which supports the regulatory eects of methylation on transcription factor binding in the absence of compensatory forces. Likely some subset of DNA binding proteins has the capacity to remodel the methylome, while others have behavior that is completely constrained by methylation. 83 Second, the target sequence elements of certain transcription factors may not contain CpG dinucleotides. Does methylation still have any eect on transcrip- tion factor binding in that case? DNA methylation may exert its eect on tran- scription factor binding by directly blocking transcription factor binding. Alter- natively, molecular dynamics simulation studies showed that the addition of the methyl group makes the DNA helix more rigid and changes the width of the major grove, which aects the specificity of DNA binding proteins (Remo Rohs, per- sonal communication). In addition, the crosstalk between DNA methylation and repressive histone modifications causes local formation of heterochromatin (Cedar & Bergman, 2009). These observations suggest that methylation alters local chro- matin environment and indirectly aect the accessibility of the underlying DNA sequences, which we think may be the primary way of methylation exerting eect on transcription factor binding. 84 Chapter 6 Co-methylation network analysis Most biological processes involve orchestrated activities of multiple regulatory elements, which are accompanied by corresponding changes in epigenomic methy- lation levels at the associated loci. Co-methylation network analysis uses cor- related methylation status through pairs of genomic intervals as an indicator of shared functional context. For example, the promoters of multiple genes involved in certain pathway are likely to lose or gain methylation in a correlated manner across cell types or conditions, assuming any variance is observed in their methy- lation levels. Co-methylation network analysis has many aspects similar to gene co-expressionnetworkanalysisandcanmakeuseofexistinganalysismethods(Hor- vath,2011;Huangetal.,2010). Additionally, DNAmethylationprofilingdatagive digitized measure of methylation of individual cytosines in a genome-wide scale, which allows one to explore co-regulatory network beyond known genomic annota- tions. Here I report two proof-of-concept applications of co-methylation network analysis: onedealingwithfunctionallyrelatedgenesandtheotherillustratinghow co-methylated exons coincide with co-spliced exon modules. 6.1 Procedures of co-methylation network anal- ysis Selecting genomic intervals: Depending on the questions of interest, a variety of genomic intervals may be used to construct co-methylation networks. One may 85 select genomic intervals with reference to known genomic features, such as pro- moterregions,exons,transposableelements,andtranscriptionfactorbindingsites. Inaddition,theunprecedentedadvantagesoftheWGBStechnique,whichprovides digitalized measurement of methylation levels of the whole genome at single-base resolution, enables us to examine correlation between any choice of genomic inter- vals as small as single cytosines, independent of genomic annotations. Filtering intervals with informative variation: After the set of intervals have been defined, we compute the regional methylation levels in each sample. Before com- putingthepairwisecorrelationbetweenthoseintervals,itisadvisabletofilterthose intervals exhibiting certain levels of variation across samples. One obvious filter- ing method is to compute the variances of methylation level across samples, and then discard those regions whose variance falls below certain cuto. An additional useful criterion from our own experiences is to retain only those regions, where certain minimum proportion (for example 5%) of samples are highly methylated and certain minimum proportion of samples are hypo-methylated. Computing correlation matrix: Next the Pearson correlation of methylation levels in each pair of genomic regions is computed. The time complexity of a naive method to compute the correlation matrix is quadratic, as we need to compute the correlation between each interval and all the others. The calculations may be prohibitive if there are a large number of intervals involved. While parellel computing in high performance clusters may be employed, novel algorithms like locality-sensitive hashing are also promising. After the correlation matrix is computed, many existing methods for network analysis may be used (Horvath, 2011). In this pilot study, we extract connected subgraphs after removing edges with low weights, and then perform functional analysis on those subgraphs. 86 6.2 Functional gene clusters In the first proof-of-concept study, we constructed a promoter co-methylation network with Arabidopsis methylome data (Appendix A.3), and were able to iden- tify functional gene clusters related to oxygen and ion binding, plant defense and cell wall structure. We started with promoter regions to build a co-methylation network because methylation in promoter regions, especially between 200bp upstream and 800bp downstream of TSS, shows strong negative correlation with gene expression lev- els (Section 4.3.3). After filtering out intervals with low variation, we computed pairwise Pearson correlation matrix. Next we extracted functional gene clusters by removing low-weight edges. To determinetheappropriatecutoofcorrelationcoecient,weenumeratedthecuto from 0.1 to 0.95. At each cuto, we extracted subgraphs and kept only those subgraphs with more than 5 genes. The relationship between the cuto and the number of subgraphs is given in Figure. 6.1, indicating a phase transition at cuto between 0.3 and 0.4. We used the contingent cuto 0.4, and obtained 107 gene clusters. The methylation levels of genes in one of the identified gene clusters are shown in Figure 6.2. To evaluate the biological relevance of those gene clusters identified from pro- moterco-methylationnetwork,weperformedGeneOntologyanalysisforfunctional enrichment in these clusters (Table 6.1). These gene clusters are enriched in three functional categories, including oxygen binding and ion binding, plant defense and apoptosis,andcellwallstructure. Intermsofstructuralclassification,geneswithin these clusters often contain F-box domains and zinc finger domains. 87 0.2 0.4 0.6 0.8 0 50 100 150 correlation coefficient cutoff No. subgraphs Figure 6.1: Phase transition in promoter co-methylation network: relationship between correlation coecient cuto and the number of subgraphs gene cluster ID terms 2 heme, iron, oxidoreductase 7 oxygen binding, iron, cytochrome 450 14 ATP binding 22 iron binding, oxidoreductase 35 ion binding 80 ion transport, transmembrane 91 zinc finger, ion binding 34 apoptosis, cell death, innate immune response 56 apoptosis, cell death, plant defense 71 response to baterium 104 apoptosis, cell death, defense response 107 apoptosis, cell death, immune response 33 chloroplast, biosynthetic process 55 extensin-like region, cell wall structure 38 phosphate metabolic process Table6.1: GOanalysisofgeneclustersidentifiedfromco-methylationofpromoters are involved in three broad categories: oxygen binding and ion binding, plant defense and apoptosis, and cell wall structure 88 AT4G16880 AT4G16890 AT4G16900 AT4G16915 AT4G16920 AT4G16940 AT4G16950 methylation 0 1.0 methylation 0 1.0 methylation 0 1.0 methylation 0 1.0 Sample indexed by mean methylation Figure 6.2: Promoter methylation levels in a typical gene cluster. X-axis: index of samples ordered by mean methylation level; Y-axis: methylation level 6.3 Co-spliced exon modules In the second proof-of-concept study, we constructed an exon co-methylation network using 35 human methylomes (Figure 6.3). The potential involvement of epigenetic mechanisms in regulating alternative splicing has been gaining more interest recently (Luco et al., 2011). The histone modification H3K36me3 is enriched in exons relative to introns, and found at lower levels in alternatively 89 spliced exons (Kolasinska-Zwierz et al., 2009). Hodges et al. (2009) observed that exons show higher methylation level than introns. These studies suggest DNA methylation may, directly or via interaction with histone modification, be involved in alternatively splicing regulation. By comparing co-methylation network for exons with exon co-splicing patterns, we observe that co-spliced exons tend to be co-methylated. For example, in human gene SAMD11, exon 8-11 show a highly correlatedmethylationpatternsacrossmethylomes,andthoseco-methylatedexons tend to co-occur in the same transcript isoforms. (Figure 6.4). In this example we simplyusedtranscriptisoformsannotation. Inthefuture, onemayinferco-spliced exon modules directly from RNA-seq data (Dai et al., 2012), and compare with the exon co-methylation network. 90 FFiPSC1911BMP4 H9ESCLaurent2010 H1ESC H9ESCLister2011 ADSiPSC IMR90iPSCLister2011 FFiPSC1911 FFiPSC197 FFiPSC69 BCell PBMC Neut CD133 HSPC ADSAdipose ADS FESLaurent2010 IMR90Lister2009 NHFFLaurent2010 FF ColonCancer ColonNormal BronchialAirway SmallAirway GM128782 GM128781 GM12878 1650Lung 1650TGFB12dOff16d 1650TGFB4d 1650TGFB12d 1650TGFB8d Calu1Lung 441NSCLC M3Lung Methylomes f hylation study Figure 6.3: 35 human methylomes used to construct exon co-methylation network 91 h work (SAMD11) Exon: 1 Exon: 2 Exon: 3 Exon: 4 Exon: 5 Exon: 6 Exon: 7 Exon: 8 Exon: 9 Exon: 10 Exon: 11 Exon: 12 Exon: 13 Exon: 14 Figure 6.4: Co-spliced exons tend to be co-methylated. Top: heatmap of cor- relation matrix of exon methylation; Below: transcript isoforms of SAMD11 are given 92 Chapter 7 Conclusions Just as next-generation sequencing technologies have been revolutionizing bio- logical research in general, whole genome bisulfite sequencing technique (WGBS) has been revolutionizing the way we think about and carry out studies of DNA methylation. Ithasbeenveryfortunateformetobeabletoworkwiththestate-of- artexperimentaltechnique. Thisdissertationhaspresentedmyworkinaddressing theanalyticalchallengesfromWGBSdataandthebiologicalinsightslearnedfrom the application of those analytical methods. Unprecedented data call for new analytic methods: The magnitude and novelty of WGBS data calls for new analytic methods. The basic processing methods describedinChapter3translatetherawreadstomeaningfulinformationofmethy- lation levels of individual cytosines through wildcard-based read mapping and cor- rection of potential issues inherent in WGBS data. Next, we have elaborated on the concept of epigenomic domains marked by characteristic methylation status, and described methods to identify HMRs in mammalian genomes (Section 4.1), and HyperMRs in plant genomes (Section 4.3.1). In addition, we have systemati- cally treated the topic of comparative methylome studies, and described analytic methods to identify DMRs in three common experimental settings (Chapter 5). New analytic methods lead to novel biological insights: WGBS technology gal- vanizes DNA methylation researchers because the measurement is genome-wide and at single-case resolution. How has our understanding of DNA methylation been deepened with the application of this technique? First, we have found a 93 substantial number of HMRs and DMRs beyond annotated CpG islands and pro- moters, traditional focus of DNA methylation researchers (Section 4.2 and 5.1.2). HMRs and DMRs have been associated with functions in gene expression regula- tion and genome organization, possibly via aecting transcription factor binding (Section 4.2 and 5.1.2). Second, taking advantage of the single-base resolution of WGBS, we were able to determine the boundaries of HMRs and DMRs quite precisely (Section 4.2 and 5.1). The accurate identification of HMR boundaries allows us to discover that the location and widths of germ line HMRs can define the ultimate boundaries of somatic HMRs (Section 4.2.3). The single-base resolu- tion data allow us to examine sometimes quite subtle methylation changes, such as expansion or shrinkage of HMR boundaries, during cell dierentiation (Sec- tion 5.1.2). The most unexpected discovery using single-base resolution data is the classification of simple and composite HyperMRs, which have distinct eects on expression regulation in plants (Section 4.3.3). Finally, two proof-of-concept studies of co-methylation network showed that functionally related genes tend to be co-methylated in their promoter regions, and the co-spliced exon modules coincided with co-methylated exons. These preliminary studies demonstrated the co-methylation analysis as a promosing framwork to explore co-regulated genomic elements (Chapter 6). We believe that the biological discoveries and analytic methods presented in this dissertation should be able to motivate and facilitate further research in the field of epigenetics. There are still lots of exciting questions to pursue. In the short term, one may adapt hidden Markov models to identify DMRs in the case of case-control studies and unsupervised DMR identification. Our meth- odsforthosetwocases,aswellasmostotherstudies,arebasedonslidingwindows. 94 Might this strategy lose subtle epigenetic signals in small scale? Can more sophis- ticated kernel smoothing methods be used to improve their performance? We have discussed the idea that treats the observations at individual CpG sites as non-uniform sampling events at dierent locations from a continuous epigenomic regions. Continuous-time HMM may be worthwhile exploring for modeling that process. In the next few years, more and more methylomes will be generated, including reference methylomes of multiple cell types. We would expect the fruitful applica- tionoftheDMRs-findingmethodsandtheco-methylationnetworkanalysisframe- work described in this dissertation. In particular, the co-methylation network can beextendedtoelucidateco-regulationnetworksofmoregeneralgenomicelements, including distal regulatory elements and genomic regions that have not been anno- tated. Thanks to the high resolution of WGBS data, the co-methylation network may provide useful complementary information to the proximity-based network derived from Hi-C. 95 Reference List AlthamP(1969) ExactBayesiananalysisofa2◊ 2contingencytable, andFisher’s "exact" significance test. Journal of the Royal Statistical Society. Series B (Methodological) 31:261–269. Baum LE, Petrie T, Soules G, Weiss N (1970) A Maximization Technique Occur- ringintheStatisticalAnalysisofProbabilisticFunctionsofMarkovChains. The Annals of Mathematical Statistics 41:164–171. Baylin SB, Jones PA (2011) A decade of exploring the cancer epigenome â biological and translational implications. Nature Reviews Cancer 11:726–734. Becker C, Hagmann J, MÃller J, Koenig D, Stegle O, Borgwardt K, Weigel D (2011) Spontaneous epigenetic variation in the arabidopsis thaliana methylome. Nature 480:245–249. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practi- cal and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 57:289–300 ArticleType: primary_article / Full publication date: 1995 / Copyright  1995 Royal Statistical Society. Berman BP, Weisenberger DJ, Aman JF, Hinoue T, Ramjan Z, Liu Y, Noushmehr H, Lange CPE, Dijk CMv, Tollenaar RAEM, Berg DVD, Laird PW (2012) Regions of focal DNA hypermethylation and long-range hypomethylation in col- orectal cancer coincide with nuclear lamina-associated domains. Nature Genet- ics 44:40–46. Bhutani N, Burns D, Blau H (2011) DNA demethylation dynamics. Cell 146:866–872. Booth MJ, Branco MR, Ficz G, Oxley D, Krueger F, Reik W, Balasub- ramanian S (2012) Quantitative sequencing of 5-methylcytosine and 5- hydroxymethylcytosine at single-base resolution. Science 336:934–937. Borgel J, Guibert S, Li Y, Chiba H, SchÃbeler D, Sasaki H, Fornà T, Weber M (2010) Targets and dynamics of promoter DNA methylation during early mouse development. Nature Genetics 42:1093–1100. 96 Brandeis M, Frank D, Keshet I, Siegfried Z, Mendelsohn M, Names A, Temper V, Razin A, Cedar H (1994) Spl elements protect a CpG island from de novo methylation. Nature 371:435–438. Cedar H, Bergman Y (2009) Linking DNA methylation and histone modification: patterns and paradigms. Nature Reviews Genetics 10:295–304. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Prad- han S, Nelson SF, Pellegrini M, Jacobsen SE (2008) Shotgun bisulphite sequencing of the arabidopsis genome reveals DNA methylation patterning. Nature 452:215–219. Dai C, Li W, Liu J, Zhou XJ (2012) Integrating many co-splicing networks to reconstruct splicing regulatory modules. BMC Systems Biology 6:S17 PMID: 23046974. Deng J, Shoemaker R, Xie B, Gore A, LeProust EM, Antosiewicz-Bourget J, Egli D, Maherali N, Park IH, Yu J, Daley GQ, Eggan K, Hochedlinger K, Thom- son J, Wang W, Gao Y, Zhang K (2009) Targeted bisulfite sequencing reveals changes in DNA methylation associated with nuclear reprogramming. Nature Biotechnology 27:353–360. FengS,CokusSJ,ZhangX,ChenP,BostickM,GollMG,HetzelJ,JainJ,Strauss SH, Halpern ME, Ukomadu C, Sadler KC, Pradhan S, Pellegrini M, Jacobsen SE (2010) Conservation and divergence of methylation patterning in plants and animals. Proceedings of the National Academy of Sciences . FosterR,IzawaT,ChuaNH(1994) PlantbZIPproteinsgatheratACGTelements. The FASEB Journal 8:192–200. Gardiner-Garden M, Frommer M (1987) CpG islands in vertebrate genomes. J. Mol. Biol. 196:261–282. Goll MG, Bestor TH (2005) Eukaryotic cytosine methyltransferases. Annual Review of Biochemistry 74:481–514. Gu H, Smith ZD, Bock C, Boyle P, Gnirke A, Meissner A (2011) Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling. Nature Protocols 6:468–481. Hansen KD, Langmead B, Irizarry RA (2012) BSmooth: from whole genome bisulfite sequencing reads to dierentially methylated regions. Genome Biol- ogy 13:R83 PMID: 23034175. 97 Hansen KD, Timp W, Bravo HC, Sabunciyan S, Langmead B, McDonald OG, Wen B, Wu H, Liu Y, Diep D, Briem E, Zhang K, Irizarry RA, Feinberg AP (2011) Increased methylation variation in epigenetic domains across cancer types. Nature Genetics 43:768–775. Hark AT, Schoenherr CJ, Katz DJ, Ingram RS, Levorse JM, Tilghman SM (2000) CTCF mediates methylation-sensitive enhancer-blocking activity at the H19/Igf2 locus. Nature 405:486–489. Heyn H, Esteller M (2012) DNA methylation profiling in the clinic: applications and challenges. Nature Reviews Genetics 13:679–692. Heyn H, Li N, Ferreira HJ, Moran S, Pisano DG, Gomez A, Diez J, Sanchez-Mut JV, Setien F, Carmona FJ, Puca AA, Sayols S, Pujana MA, Serra-Musach J, Iglesias-Platas I, Formiga F, Fernandez AF, Fraga MF, Heath SC, Valencia A, Gut IG, Wang J, Esteller M (2012) Distinct DNA methylomes of newborns and centenarians. Proceedings of the National Academy of Sciences109:10522–10527. Hodges E, Smith AD, Kendall J, Xuan Z, Ravi K, Rooks M, Zhang MQ, Ye K, Bhattacharjee A, Brizuela L, McCombie WR, Wigler M, Hannon GJ, Hicks JB (2009) High definition profiling of mammalian DNA methylation by array cap- ture and single molecule bisulfite sequencing. Genome Research 19:1593–1605. Hodges E, Molaro A, Santos COD, Thekkat P, Song Q, Uren P, Park J, Butler J, Rafii S, McCombie WR, Smith AD, Hannon GJ (2011) Directional dna methy- lation changes and complex intermediate states accompany lineage specificity in the adult hematopoietic compartment. Cell Mol. Cell . HollidayR,PughJE(1975) Dnamodificationmechanismsandgeneactivityduring development. Science 187:pp. 226–232. Horvath S (2011) Correlation and gene co-expression networks In Weighted Net- work Analysis, pp. 91–121. Springer New York. Hsieh T, Ibarra CA, Silva P, Zemach A, Eshed-Williams L, Fischer RL, Zilber- man D (2009) Genome-Wide demethylation of arabidopsis endosperm. Sci- ence 324:1451–1454. Huang H, Liu CC, Zhou XJ (2010) Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. Proceedings of the National Academy of Sciences of the United States of America 107:6823–6828 PMID: 20360561. Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K,RongioneM,WebsterM,JiH,PotashJB,SabunciyanS,FeinbergAP(2009) 98 The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nature Genetics 41:178–186. Irizarry RA, Wu H, Feinberg AP (2009) A species-generalized probabilistic model- based definition of CpG islands. Mammalian Genome 20:674–680. Jahner D, Jaenisch R (1985) Retrovirus-induced de novo methylation of flanking host sequences correlates with gene inactivity. Nature 315:594–597. JiH,EhrlichLIR,SeitaJ,MurakamiP,DoiA,LindauP,LeeH,AryeeMJ,Irizarry RA, Kim K, Rossi DJ, Inlay MA, Serwold T, Karsunky H, Ho L, Daley GQ, Weissman IL, Feinberg AP (2010) Comprehensive methylome map of lineage commitment from haematopoietic progenitors. Nature 467:338–342. JonesPA,TakaiD(2001) TheroleofDNAmethylationinmammalianepigenetics. Science 293:1068 –1070. Kolasinska-Zwierz P, Down T, Latorre I, Liu T, Liu XS, Ahringer J (2009) Dier- ential chromatin marking of introns and expressed exons by H3K36me3. Nature Genetics 41:376–381. Laird PW (2010) Principles and challenges of genome-wide DNA methylation analysis. Nat Rev Genet 11:191–203. Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, Low HM, Sung KWK, RigoutsosI,LoringJ,WeiCL(2010) Dynamicchangesinthehumanmethylome during dierentiation. Genome Research . Law JA, Jacobsen SE (2010) Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nature Reviews Genet- ics 11:204–220. Li Y, Zhu J, Tian G, Li N, Li Q, Ye M, Zheng H, Yu J, Wu H, Sun J, Zhang H, Chen Q, Luo R, Chen M, He Y, Jin X, Zhang Q, Yu C, Zhou G, Sun J, Huang Y, Zheng H, Cao H, Zhou X, Guo S, Hu X, Li X, Kristiansen K, Bolund L, Xu J, Wang W, Yang H, Wang J, Li R, Beck S, Wang J, Zhang X (2010) The DNA methylomeofhumanperipheralbloodmononuclearcells. PLoS Biol8:e1000533. Lindsay H, Adams RL (1996) Spreading of methylation along DNA. Biochemical Journal 320:473–478 PMID: 8973555 PMCID: 1217954. Lister R, Ecker JR (2009) Finding the fifth base: Genome-wide sequencing of cytosine methylation. Genome Research 19:959–966. 99 ListerR,O’MalleyRC,Tonti-FilippiniJ,GregoryBD,BerryCC,MillarAH,Ecker JR (2008) Highly integrated Single-Base resolution maps of the epigenome in arabidopsis. Cell 133:523–536. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo Q, Edsall L, Antosiewicz-Bourget J, Stewart R, Ruotti V, Millar AH, Thomson JA, Ren B, Ecker JR (2009) Human DNA methylomes at base resolution show widespread epigenomic dierences. Nature 462:315–322. Lister R, Pelizzola M, Kida YS, Hawkins RD, Nery JR, Hon G, Antosiewicz- Bourget J, OâèMalley R, Castanon R, Klugman S, Downes M, Yu R, Stew- art R, Ren B, Thomson JA, Evans RM, Ecker JR (2011) Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature 471:68–73. Long Q, Nordborg M (2013) Massive genomic variation and strong selection in swedish arabidopsis thaliana. Nature Submitted 0:0. Luco RF, Allo M, Schor IE, Kornblihtt AR, Misteli T (2011) Epigenetics in alter- native pre-mRNA splicing. Cell 144:16–26. Macleod D, Charlton J, Mullins J, Bird AP (1994) Sp1 sites in the mouse aprt gene promoter are required to prevent methylation of the CpG island. Genes & Development 8:2282 –2292. Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, Zhang X, Bernstein BE, Nusbaum C, Jae DB, Gnirke A, Jaenisch R, Lander ES (2008) Genome-scale DNA methylation maps of pluripotent and dierentiated cells. Nature 454:766–770. Molaro A, Hodges E, Fang F, Song Q, McCombie WR, Hannon GJ, Smith AD (2011) Sperm methylation profiles reveal features of epigenetic inheritance and evolution in primates. Cell . Rabiner L (1989) A tutorial on hidden markov models and selected applications inspeech recognition. Proceedings of the IEEE . RiggsAD(1975) Xchromosomeinactivation, dierentiation, anddnamethylation revisited, with a tribute to susumu ohno. Cytogenet Genome Res 99:17–24. Schroeder DI, Lott P, Korf I, LaSalle JM (2011) Large-scale methylation domains mark a functional subset of neuronally expressed genes. Genome Research 21:1583–1591 PMID: 21784875 PMCID: PMC3202276. Schubeler D (2012) Epigenetic islands in a genetic ocean. Science 338:756–757. 100 Schultz MD, Schmitz RJ, Ecker JR (2012) Leveling the playing field for analyses of single-base resolution DNA methylomes. Trends in Genetics 28:583–585. Siegmund KD, Marjoram P, Tavarà S, Shibata D (2011) High DNA methylation pattern intratumoral diversity implies weak selection in many human colorectal cancers. PLoS ONE 6:e21657. Song Q, Smith AD (2011) Identifying dispersed epigenomic domains from ChIP- Seq data. Bioinformatics 27:870–871. Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, SchÃler A, Nimwegen Ev, WirbelauerC,OakeleyEJ,GaidatzisD,TiwariVK,SchÃbelerD(2011) DNA- bindingfactorsshapethemousemethylomeatdistalregulatoryregions. Nature. Steine EJ, Ehrich M, Bell GW, Raj A, Reddy S, van Oudenaarden A, Jaenisch R, Linhart HG (2011) Genes methylated by DNA methyltransferase 3b are similar in mouse intestine and human colon cancer. Journal of Clinical Investi- gation 121:1748–1752. Susan JC, Harrison J, Paul CL, Frommer M (1994) High sensitivity mapping of methylated cytosines. Nucleic Acids Research 22:2990–2997. Suzuki MM, Bird A (2008) DNA methylation landscapes: provocative insights from epigenomics. Nature Reviews Genetics 9:465–476. ThurmanRE,RynesE,HumbertR,VierstraJ,MauranoMT,HaugenE,Sheeld NC, Stergachis AB, Wang H, Vernot B, Garg K, John S, Sandstrom R, Bates D, Boatman L, Canfield TK, Diegel M, Dunn D, Ebersol AK, Frum T, Giste E, Johnson AK, Johnson EM, Kutyavin T, Lajoie B, Lee BK, Lee K, London D, Lotakis D, Neph S, Neri F, Nguyen ED, Qu H, Reynolds AP, Roach V, Safi A, Sanchez ME, Sanyal A, Shafer A, Simon JM, Song L, Vong S, Weaver M, Yan Y, Zhang Z, Zhang Z, Lenhard B, Tewari M, Dorschner MO, Hansen RS, Navas PA, Stamatoyannopoulos G, Iyer VR, Lieb JD, Sunyaev SR, Akey JM, Sabo PJ, Kaul R, Furey TS, Dekker J, Crawford GE, Stamatoyannopoulos JA (2012) The accessible chromatin landscape of the human genome. Nature 489:75–82. Turker MS (2002) Gene silencing in mammalian cells and the spread of DNA methylation. Oncogene 21:5388–5393. Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theor. 13:260–269. Wang RY, Gehrke CW, Ehrlich M (1980) Comparison of bisulfite modifi- cation of 5-methyldeoxycytidine and deoxycytidine residues. Nucleic Acids Research 8:4777–4790 PMID: 7443525 PMCID: PMC324387. 101 Weirauch M, Raney B (2007) Ucsc genome browser: Hmr conserved transcription factor binding sites. WelchBL(1947) TheGeneralizationof‘Student’s’ProblemwhenSeveralDierent Population Variances are Involved. Biometrika 34:28–35. Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bul- letin 1:pp. 80–83. Wu SC, Zhang Y (2010) Active DNA demethylation: many roads lead to rome. Nature Reviews Molecular Cell Biology 11:607–620. Yates PA, Burman RW, Mummaneni P, Krussel S, Turker MS (1999) Tandem b1 elements located in a mouse methylation center provide a target for de novo DNA methylation. Journal of Biological Chemistry 274:36357 –36361. Yu M, Hon G, Szulwach K, Song CX, Zhang L, Kim A, Li X, Dai Q, Shen Y, Park B, Min JH, Jin P, Ren B, He C (2012) Base-resolution analysis of 5- hydroxymethylcytosine in the mammalian genome. Cell 149:1368–1380. Zemach A, McDaniel IE, Silva P, Zilberman D (2010) Genome-Wide evolutionary analysis of eukaryotic DNA methylation. Science 328:916–919. Zeng J, Konopka G, Hunt BG, Preuss TM, Geschwind D, Yi SV (2012) Divergent whole-genome methylation maps of human and chimpanzee brains reveal epi- genetic basis of human regulatory evolution. The American Journal of Human Genetics 91:455–465. 102 Appendix A Appendix A.1 Bisulfite-specific deadzones By “deadzones”, we refer to contiguous sets of locations to which no read can map uniquely, since when two locations in the genome have identical sequences of lengthgreaterthanorequaltothereadlength,anyreadsderivedfromoneofthose locations will necessarily be ambiguous and therefore be discarded. The procedure to identify bisulfite-specific deadzones is similar to that of identifying deadzones from ChIP-seq (Song & Smith, 2011), except that we convert all cytosines in the genome sequences to thymines considering the eect of bisulfite conversion. We identifyed bisulfite deadzones for read lengths ranging between 30bp and 120bp for human genome and the Arabidopsis genome respectively. From the set of deadzones, those mappable CpG sites are derived. The estimated proportion of mappable CpGs with respect to dierent read lengths is given in Figure A.1. 103 40 60 80 120 0.80 0.85 0.90 0.95 1.00 Arabidopsis thaliana (TAIR10) read length prop. of mappable CpGs 40 60 80 120 0.80 0.85 0.90 0.95 1.00 Human (hg18) read length prop. of mappable CpGs Figure A.1: Theorectical estimation of the proportion of mappable CpG sites with deadzone analysis 104 A.2 Estimaing parameters of the beta-binomial distribution From a WGBS experiment, we observed a sequence of read counts, X = {(m i ,u i )}, where m i and u i denote the number of methylated and unmethylated reads at the i th CpG site respectively, and i ranges from 1 to n. We use the beta- binomial distribution to model the number of methylated reads m i at each site, whose density function is: Pr(m i |–,—,m i +u i )= A m i +u i m i B B(m i +–,u i +— )/B(–,— ), where B denotes the beta function. To estimate the parameters – and — from the sequence of observations, we worked with the methylation frequencies f i = m i /(m i +u i ) instead. We estimate the parameters as though they were for a Beta distribution, and therefore satisfy  (ˆ – )≠  (ˆ – + ˆ — )= 1 n n ÿ i=1 log(ˆ p i ) and  ( ˆ — )≠  (ˆ – + ˆ — )= 1 n n ÿ i=1 log(1≠ ˆ p i ) where the digamma function:  (x)= d dx log( x). We use an iterative procedure to compute ˆ – and ˆ — . The parameters are initiliazed as ˆ – (0) =  ≠ 1 3 1 n q n i=1 log(ˆ p i ) 4 and ˆ — (0) =  ≠ 1 3 1 n q n i=1 log(1≠ ˆ p i ) 4 . Thisinitializationcorrespondsroughlytotheassumptionof– +— =1,as (1) = 0. 105 At each iteration, these estimates are updated using the formula: ˆ – (k) =  ≠ 1 3 1 n q n i=1 log(ˆ p i )+ (ˆ – (k≠ 1) + ˆ — (k≠ 1) ) 4 and ˆ — (k) =  ≠ 1 3 1 n q n i=1 log(1≠ ˆ p i )+ 1 ˆ – (k≠ 1) + ˆ — (k≠ 1) 2 4 . The inverse of the digamma ( ) function can be calculated very easily by noting that  ≠ 1 (x)= e x + ‘,for 0 Æ ‘ Æ 1 for any relevant values of x. We use a bisection search around e x to evaluate  ≠ 1 , and apply the iterative procedure until convergence criteria are satisfied. A.3 Methylomes of Arabidopsis thaliana eco- types in Northern Europe Some of the analytical mothods preseneted in this Thesis (Section 4.3, 5.2, 5.3 and 5.4) have been motivated and applied to a dataset of Arabidopsis thaliana methylomes, generated by Dr. Marie-Stanislas Remigereau and Pei Zhang in Dr. Magnus Nordborg’s lab, which have not been published at this time point. Here I give a brief description of this dataset. The 208 Arabidopsis thaliana ecotypes were collected in dierent locations across Sweden, and were grown both at 10 ¶ C and 16 ¶ C in environmentally con- trolled chamber conditions. All samples were collected from the full rosette when theplantsattainedthesame8-leafstageofdevelopment(9thleafemerging). Bisul- fite sequencing are performed with 91bp paired-end reads. The generated bisulfite reads were analyzed as described in Chapter 3. Median coverage at CpG sites was 10x, while the CHG and CHH contexts were covered an average of 6x-fold. 106 A.4 An approximate algorithm to speed up variable-distribution HMM Inourmethodstoidentifyhyper-methylatedregionsintheArabidopsisgenome (Section 4.3.1) and to identify deferentially methylated regions between two sam- ples (Section 5.1.1), we used variable-duration HMM (VDHMM) to allow more flexibility in modeling region lengths. However the time complexity for training and decoding of VDHMM is quadratic, which becomes impractical for genome- wide analysis. To deal with this issue, we adopted two implementation tricks. First, we may impose an upper bound on the length of HyperMRs and DMRs, say 100 CpG sites, which is a reasonable assumption for most applications. However in certain applications, such as the identification of large-scale methy- lation pattern during carcinogenesis, the limitation on the domain length seems too stringent. Recall in Section 4.3.1, in the three-state VDHMM for identifying HyperMRs, theforwardscore– i (q), whichisthelikelihoodoftheobservationsX 1:i with the i th observation being the end of a segment in hidden state q, is computed with following formula: – i (q)= i≠ 1 ÿ j=1 ÿ s”=q – j (s)a sq d q (i≠ j) i Ÿ k=j+1 b q (X k ). (A.1) Wedefineachangepointasalocationwherethehiddenstatesequencechanges from one state. The outermost summation in the above formula essentially enu- meratesallpossiblechangepointsbeforecurrentobservationi, whichisveryine- cient since most locations within a region contribute little to the summation. The basic idea of our approximate algorithm is to keep track of a list of locations in their decreasing likelihood being a change point, compute the contribution from 107 those likely change points first, and then stop the summation if the contribution of remaining locations become negilible. To be more specifically, let T be a prior- ity queue whose elements are tuples like (p,j), where j denote the location and p denotestheprobabilityofj beingachangepoint. T Õ isanotherpriorityholdingthe list of possible change points for next round. The Algorithm 1 gives the detailed forward algorithm, whose complexity is roughly linear given the transition from one state to another state occur in a short region. Algorithm 1 Approximate forward algorithm for VDHMM for i=1æ n do {# (i≠ 1) is a new location, always with high priority } – q (i)=0, Push (Œ ,i≠ 1) to T, Set T Õ to empty while T is not empty do {# Choose next most probable change point j} Let j be location value of the top element of T pop T {# Update problity j being a change point} p = q s”=q – j (s)a sq r i k=j+1 b q (X k )d q (DØ i≠ j) Push (p,j) to T Õ {# Add contribution of j being a change point} l = q s”=q – j (s)a sq r i k=j+1 b q (X k )d q (i≠ j) – i (q) += l {# Exit summation if j’s contribution is negilible} if l/– i (q)<‘ then break end if end while T = T Õ end for A.5 List of datasets used in this study 108 Sample No. HMR HMR median Size References BCell 54869 663 Hodges et al. (2011) BronchialAirway 83370 784 Unpublished ı CD133 52931 588 Hodges et al. (2011) CD4T 100yr 45702 709 Heyn et al. (2012) CD4T Newborn 60029 550 Heyn et al. (2012) ColoMucosa 39705 899 Hansen et al. (2011) ColonNormal 60259 926 Berman et al. (2012) Cortex 31228 853 Schroeder et al. (2011) HSPC 63474 619 Hodges et al. (2011) Neut 71815 573 Hodges et al. (2011) PBMC 40532 767 Li et al. (2010) SmallAirway 81471 861 Unpublished ı Sperm 83994 1329 Molaro et al. (2011) 1650Lung 89901 4261 Unpublished ı 1650TGFB12d 89025 4466 Unpublished ı 1650TGFB12dO16d 92903 4134 Unpublished ı 1650TGFB4d 87935 4601 Unpublished ı 1650TGFB8d 88505 4556 Unpublished ı 441NSCLC 73158 3513 Unpublished ı AdenoPolyp 33747 2121 Hansen et al. (2011) Calu1Lung 97476 7324 Unpublished ı ColoCancer 39605 4060 Hansen et al. (2011) ColonCancer 89045 3606 Berman et al. (2012) M3Lung 96239 6363 Unpublished ı PreFrontCortex 42325 749 Zeng et al. (2012) ADS 79962 1478.5 Lister et al. (2011) ADSAdipose 81124 1326 Lister et al. (2011) ADSiPSC 45905 782 Lister et al. (2011) ColonInduced 10897 836 Steine et al. (2011) FESLaurent2010 54736 2377 Laurent et al. (2010) FF 85680 5179 Lister et al. (2011) FFiPSC69 38430 773 Lister et al. (2011) H1ESC 37870 771 Lister et al. (2009) H9ESCLaurent2010 43066 683 Laurent et al. (2010) H9ESCLister2011 39965 699 Lister et al. (2011) IMR90iPSCLister2011 39445 728 Lister et al. (2011) IMR90Lister2009 90636 5406 Lister et al. (2009) NHFFLaurent2010 65688 4223.5 Laurent et al. (2010) SHSY5Y 66160 3046 Schroeder et al. (2011) TableA.1: ListofcelltypesusedinanalysisofHMRnumberandsizes. ı Dr. Fred Rollins generous allowed us to use the methylomes of lung caners in this analysis. We greatly appreciated that 109
Abstract (if available)
Abstract
Whole genome bisulfite sequencing is a powerful technique for profiling methylation patterns, which provides measurement of methylation levels of individual cytosine sites across the genome. This dissertation presents the computational methods to analyze whole genome bisulfite sequencing data. Besides mapping bisulfite converted reads and estimating methylation levels of individual cytosines, this dissertation particularly studies epigenomic domains marked by characteristics methylation status. The biological insights from the application of the computational methods to the analysis of mammalian methylomes and plant methylomes are also reported.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Comparative analysis of DNA methylation in mammals
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Developing a robust single cell whole genome bisulfite sequencing protocol to analyse circulating tumor cells
PDF
Understanding the characteristic of single nucleotide variants
PDF
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
PDF
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Integrating high-throughput sequencing data to study gene regulation
PDF
Functional DNA methylation changes in normal and cancer cells
Asset Metadata
Creator
Song, Qiang
(author)
Core Title
Whole genome bisulfite sequencing: analytical methods and biological insights
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
04/01/2013
Defense Date
03/11/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
differentiation,epigenetics,hidden Markov model,methylation,OAI-PMH Harvest,sequencing,Statistics
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew D. (
committee chair
), Ren, Bing (
committee member
), Sha, Fei (
committee member
), Waterman, Michael S. (
committee member
)
Creator Email
chiangsong@gmail.com,qiang.song@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-229360
Unique identifier
UC11294517
Identifier
usctheses-c3-229360 (legacy record id)
Legacy Identifier
etd-SongQiang-1498.pdf
Dmrecord
229360
Document Type
Dissertation
Rights
Song, Qiang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
differentiation
epigenetics
hidden Markov model
methylation
sequencing