Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identifying allele-specific DNA methylation in mammalian genomes
(USC Thesis Other)
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IDENTIFYING ALLELE-SPECIFIC DNA METHYLATION IN MAMMALIAN GENOMES by Fang Fang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) December 2012 Copyright 2012 Fang Fang To my parents and especially to my husband Huanying Ge ii Acknowledgments I would like to first express my sincere gratitude to my advisor, Professor Andrew D Smith, for his continu- ous guidance and encouragement throughout my research and professional development. This thesis would not have been possible without his support, patience, enthusiasm and immense knowledge. I also gratefully acknowledge my committee members, Professor Simon Tavar´ e, Professor Matteo Pellegrini and Professor Joseph G Hacia, for their valuable suggestions and generous help on this work. My thank also goes to Professor Michael S Waterman for leading me working on bioinformatics research, and for his patient teaching, generous help and insightful comments. I also would like to take the opportunity to sincerely thank my labmates: Dr. Philip J Uren, Qiang Song, Timothy Daley, Emad Bahrami Samani, Jianghan Qu, Natalia Rodchenko, Tyler Garvin, Ehsan Behnam, Meng Zhou, Ben Decato, Elizabeth Hong and Muhammad Ali Amer, for their stimulating discussions, for their time and help reading my manuscripts, for all the enjoyable time we have had together during the past years. I am also grateful to many colleagues in the PhD program of Molecular and Computational Biology for providing wise advice and helpful discussions. Lastly, I am greatly indebted to my husband Huanying Ge for all the understanding and support he has provided throughout my study and life. I also wish to thank my parents and in-laws for giving me selfless help to pursue an academic career. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vi List of Figures vii Abstract ix Chapter 1: Introduction 1 Chapter 2: Allele-specific DNA methylation and genomic imprinting 3 2.1 Fundamentals of DNA methylation in mammals . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Experimental methods to profile DNA methylation . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Allele-specific DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Genomic imprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 Approaches for identifying imprinted genes . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.1 Screening for allele-specific gene expression . . . . . . . . . . . . . . . . . . . . . 11 2.5.2 Predicting imprinted genes with genome sequence features . . . . . . . . . . . . . . 13 2.5.3 Screening for allele-specific methylation . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.4 Leveraging high-throughput sequencing technologies . . . . . . . . . . . . . . . . . 14 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 3: Modeling allele-specific DNA methylation 16 3.1 A simple method for scoring and visualizing ASM . . . . . . . . . . . . . . . . . . . . . . 17 3.2 A probabilistic model for allele-specific DNA methylation . . . . . . . . . . . . . . . . . . 17 3.2.1 Modeling DNA methylation on a single allele . . . . . . . . . . . . . . . . . . . . . 18 3.2.2 Modeling DNA methylation on two alleles . . . . . . . . . . . . . . . . . . . . . . 19 3.2.3 Identifying allele-specific DNA methylation by model comparison . . . . . . . . . . 22 3.3 Practical issues in genome-scale applications . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 An algorithmic approach to optimize ASM boundaries . . . . . . . . . . . . . . . . . . . . 27 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv Chapter 4: Simulation 30 4.1 The concept of semi-simulated BS-seq data . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Model validation for fixed intervals of ASM . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Estimating false-discovery rate using semi-simulated data . . . . . . . . . . . . . . . . . . . 32 Chapter 5: The genomic landscape of human allele-specific methylation 34 5.1 Motivation for genome-wide computational analyses . . . . . . . . . . . . . . . . . . . . . 34 5.2 Technical and biological characteristics of methylomes analyzed . . . . . . . . . . . . . . . 35 5.3 Allele-specific methylation on the X chromosome, a sanity check . . . . . . . . . . . . . . . 37 5.4 Genome-wide AMR identification predicts imprinted genes . . . . . . . . . . . . . . . . . . 38 5.5 Analysis of known imprinting control regions . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.6 Allele-specific methylation and epigenomic reprogramming in iPSCs . . . . . . . . . . . . . 46 5.7 The landscape of allele-specific methylation in human . . . . . . . . . . . . . . . . . . . . . 46 Chapter 6: The evolution of allele-specific methylation: comparing human and mouse 50 6.1 Theoretical perspectives relating allele-specific methylation and evolution . . . . . . . . . . 50 6.2 Motivation for comparing allele-specific methylation between human and mouse . . . . . . 52 6.3 Identification of allele-specific methylation in human and mouse . . . . . . . . . . . . . . . 53 6.4 Conserved and divergent allele-specific methylation genome-wide . . . . . . . . . . . . . . 54 6.5 Allele-specific methylation at shared and species-specific imprinted genes . . . . . . . . . . 55 6.6 What have we learned about the evolution of imprinting? . . . . . . . . . . . . . . . . . . . 62 Chapter 7: Conclusions 65 Bibliography 69 v List of Tables 3.1 Comparison of model selection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1 False-positive rate estimated for AMR identification . . . . . . . . . . . . . . . . . . . . . . 33 5.1 Characteristics of human BS-seq data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 AMRs common to methylomes of uncultured human cells . . . . . . . . . . . . . . . . . . 40 5.3 Imprinted clusters and associated AMRs in human cells . . . . . . . . . . . . . . . . . . . . 42 6.1 Data characteristics for the mouse methylomes analyzed . . . . . . . . . . . . . . . . . . . 53 6.2 Conserved AMRs between human and mouse. . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.3 Summary of ASM associated imprinted genes in human and mouse . . . . . . . . . . . . . 58 vi List of Figures 2.1 DNA methylation in mammals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Shotgun bisulfite sequencing (BS-seq) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Methylation level defined in BS-seq data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 An example for detecting allele-specific methylation . . . . . . . . . . . . . . . . . . . . . 6 2.5 The H19/Igf2r imprinted cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6 The Igf2r/Air imprinted cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.7 Reciprocal cross breeding experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 Calculating allelic scores based on contingency table . . . . . . . . . . . . . . . . . . . . . 18 3.2 An example of allelic score at the GNAS locus . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Modeling site-specific DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Merging nearby AMRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Flow chart for AMR identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1 Sensitivity of AMR identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1 Identified AMRs on chrX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Allele-specific methylation identified at the XIST promoter. . . . . . . . . . . . . . . . . . . 39 5.3 Refined AMRs near MEG3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.4 Regions of allele-specific methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.5 Clustering of human methylomes according to AMRs . . . . . . . . . . . . . . . . . . . . . 47 5.6 Examples for iPSC reprogramming of AMRs . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.1 Locations of the two conserved AMRs without reported imprinting between human and mouse 56 6.2 Conserved AMR and imprinting but overall epigenomic divergence . . . . . . . . . . . . . 59 vii 6.3 ASM at the CDKN1C locus in human and mouse . . . . . . . . . . . . . . . . . . . . . . . 61 6.4 ASM at the COMMD1 locus in human and mouse . . . . . . . . . . . . . . . . . . . . . . . 62 6.5 Mouse-specific ASM around the Igf2r gene . . . . . . . . . . . . . . . . . . . . . . . . . . 63 viii Abstract Among the most well-known functions of DNA methylation is in mediating imprinted gene expression by differentially marking specific regulatory regions on maternal and paternal alleles. Imprinted genes are expressed from one of the two parental alleles in mammals, thereby rendering the organism functionally haploid. Imprinting has been tied to the evolution of placental mammals and defects in imprinting have been associated with human diseases. Although recent advances in genome sequencing have revolutionized the study of DNA methylation, existing methylome data remains largely untapped in the study of imprinting. We present a novel statistical model to describe allele-specific methylation (ASM) in data from high-throughput short-read bisulfite sequencing. Simulation results indicate technical specifications of existing methylome data, such as read length and coverage, are sufficient for full-genome ASM profiling based on our model. Because our method is independent of genotype, it is applicable to identify ASM in the context of genomic imprinting. We used our model to analyze methylomes for a diverse set of human cell types, including cultured and uncultured differentiated cells, embryonic stem cells (ESCs) and induced pluripotent stem cells (iPSCs). Regions of ASM identified most consistently across methylomes are tightly connected with known imprinted genes and precisely delineate the boundaries of several known imprinting control regions. Novel predicted regions of ASM common to multiple cell types frequently mark ncRNA promoters and represent promising starting points for targeted validation. We also compared regions of ASM between uncultured mouse and human cells. Regions with both conserved sequence and ASM status between species show high concordance of known imprinted genes, adding more evidence for novel prediction of imprinted genes. The skewing of ASM associated imprinted genes in mouse agrees with the parental conflict theory, which hypothesizes that the evolution of genomic ix imprinting is inspired by the different interests of parental genes on the offspring growth. Furthermore, the variation of ASM between species shows that ASM set in parental germlines play more critical roles in regulating imprinting than those set in somatic cells. More generally, our model provides the analytical complement to cutting-edge experimental technolo- gies for surveying ASM in specific cell types and across species. x Chapter 1 Introduction In mammals, a small set of genes is subject to an unusual form of regulation in which they are expressed according to parent-of-origin. This phenomenon is termed genomic imprinting. These imprinted genes are preferentially expressed from either the maternal or paternal allele regardless of genotype. Because imprinted genes are effectively haploid, and have given up the advantages of diploidy for robustness to dele- terious mutations, any perturbation to their expression easily leads to disease. Aberrant genomic imprinting leads to many well-known imprinting disorders (e.g. Prader-Willi, Angelman, and Beckwith-Wiedemann syndromes). Although imprinted genes are critical for normal development, since the phenomenon was originally described the experimental identification of additional imprinted genes has been sporadic. Since the first imprinted genes (Igf2r and H19) were identified in 1991, there are less than 100 imprinted genes accumulated. One possible reason is that many of these genes might only be expressed in a tissue-specific manner or during a specific developmental stage. Imprinted genes might also be expressed widely, but could have their imprinted regulation restricted to a particular context. Even before the mechanisms of imprinting were understood, epigenetics was believed to play an essential role, as some means is required to distin- guish the origin of two alleles that may be identical at the DNA sequence level. As a well understood epigenetic modification in mammals, DNA methylation was associated with imprinting early on. Differ- ential DNA methylation marks specific regulatory regions on maternal and paternal alleles. Allele-specific DNA methylation (ASM) coordinating genomic imprinting should logically be acquired during gametogen- esis and retained during development. This reasoning has been repeatedly validated, despite our knowledge that the DNA methylation of both paternal and maternal chromosomes is almost completely reprogrammed shortly after fertilization. A major premise of this thesis is that understanding ASM is a highly effective direction for understanding imprinting, how imprinting is established, and also imprinted gene regulation programs are carried out in somatic developmental lineages. Therefore, the precise locations and common features of regions of 1 ASM involved in imprinting must be elucidated. However, although the technological advances for DNA methylation profiling (e.g. high-throughput short-read bisulfite sequencing, BS-seq) have achieved single base-pair resolution, it is still not feasible to directly query ASM. In this thesis, we develop a probabilistic model to describe ASM based on data from BS-seq experi- ments. The model describes the extent to which the methylation states observed in reads appear to represent two different allele patterns. Based on the model, any genomic interval can be determined whether or not having ASM. This method provides an essential analytical complement to recently emerged experimental methods for understanding the role of DNA methylation for allele-specific regulation, especially for genomic imprinting. Summary and outline of the thesis: This thesis mainly describes a computational method to detect ASM based on BS-seq data. Because this method is independent of genotype, it has broad applicability to identify ASM in the context of genomic imprinting. The application of the method in human and mouse methylomes confirms the accuracy of the identification for ASM in known imprinted genes and also provides a prediction for new imprinted genes. Furthermore, such pipeline is also used to compare ASM in genomic imprinting between species, giving an evolutionary overview of ASM associated imprinted genes. The thesis is organized as follows. Chapter 2 provides an overview of the background of the research and also a review of previous methods for ASM and imprinted gene identification. Chapter 3 describes the probabilistic models and computational method we have developed to detect ASM from BS-seq data. Chap- ter 4 presents the results from simulation, giving an estimation of the performance of the method. Chapter 5 describes an application of the method in human methylomes. The ASM consistently identified across methylomes are highly concordant with known imprinted genes, validating the efficiency of our method and also indicating the remaining common ASM are likely candidate for imprinting. Chapter 6 compares ASM between human and mouse, reflecting the conservation and development of genomic imprinting in evolution. The final chapter gives a summary of the research and discussion of future works. 2 Chapter 2 Allele-specific DNA methylation and genomic imprinting 2.1 Fundamentals of DNA methylation in mammals DNA methylation refers to the addition of a methyl group to the fifth carbon of the base cytosine as an enzymatic reaction. It is a major type of epigenetic modification, which refers to heritable changes in gene expression without changes to the DNA sequence. Such epigenetic modifications have been proved to be important in the gene expression regulation and the onset of diseases (Ellis et al., 2009). In mammals, DNA methylation occurs mainly on cytosines of CpG dinucleotides (Figure 2.1), and can be reliably transmitted through mitosis. In a single genome, DNA methylation is established through two waves of reprogramming: during early embryogenesis and during gametogenesis (Jaenisch and Bird, 2003; Reik et al., 2001a). At these two stages, genome-wide erasure and de novo DNA methylation occur, and during division of somatic cells the methylation pattern is copied. DNA methylation plays important roles in cell differentiation (Latham et al., 2008; Watanabe et al., 2002), embryogenesis (Haaf, 2006; Mayer et al., 2000) and silencing of repetitive elements (Chen et al., 1998) through regulating gene expression (He et al., 2011). It can directly regulate gene expression by preventing the binding of transcription factors in gene promoters (Tate and Bird, 1993), and the methylated DNA can also change the chromatin structure by recruiting methyl-CpG-binding domain proteins (MBDs) and then other proteins (Hendrich and Bird, 1998; Lewis et al., 1992; Prokhortchouk et al., 2001). The extent of DNA methylation varies in different tissues and at different developmental stages, implying a fundamental but distinct role of DNA methylation for normal development. 3 2.2 Experimental methods to profile DNA methylation The key to detecting DNA methylation is to distinguish unmethylated cytosines from methylated cytosines, and there are three general ways to accomlish this: methylation sensitive restriction enzymes, antibodies for methylated cytosines and sodium bisulfite conversion. Restriction endonucleases can cut genomic DNA at specific recognition sites, and some are sensitive to methylation states. Cleavage can be blocked or impaired when methylation happens in the recognition site for methylation sensitive restriction enzymes (Hatada et al., 1991; Singer-Sam et al., 1990). For example, the restriction enzyme NotI can recognize the sequence 5’-GCGGCCGC-3’ and cut it at the first CpG if it is unmethylated (Lindsay and Bird, 1987). Restriction enzymes can only inform about methylation at CpG sites inside the enzyme’s recognition sites. Antibodies designed for 5-methylcytosine (5mC) have been used in immunoprecipitation-based technologies to enrich a sample for methylated DNA (Weber et al., 2005). Sodium bisulfite treatment converts unmethylated cytosine to uracil but does not affect methylated cytosines, and after PCR amplification the uracil is replaced with thymine (Frommer et al., 1992; Her- man et al., 1996). Theoretically bisulfite modification can detect methylation at each CpG dinucleotide and also provide information about the absense of methylation at a nucleotide; This bisulfite sequencing is considered the “gold standard” for profiling DNA methylation. Recent advances in DNA sequencing technology couples with bisulfite treatment (BS-seq) has enabled high-throughput genome-wide analysis of DNA methylation. As shown in Figure 2.2, treatment with sodium bisulfite converts unmethylated cytosines to uracil and then to thymines after PCR amplification. By sequencing the DNA we can directly observe which cytosines are methylated in the original sample. Using high-throughput short read sequencing, reads are mapped to a reference genome and several methylation states can be observed corresponding to the same CpG site, but all originating from a distinct molecule in the original cell population. For each CpG dinu- cleotide, the methylation level of a CpG site is defined as the proportion of methylated nucleotides in reads piling up over that site (Figure 2.3). Thus, BS-seq provides genome-wide methylation profile (methylome) at base-pair resolution. 4 addition of a methyl group to the 5th carbon of base C CpG CpG CpG CpG CpG CpG Me Me Me T A A C A C G A T G A T T G T G C T A C CH3 CH3 Symmetric C C C C N N NH2 O CH3 Figure 2.1: DNA methylation mainly occurs at CpG dinucleotides in mammals by adding a methyl group to the fifth carbon of base C, and it is symmetric. Bisulfite treatment and PCR Me Me T A C A T G C G A T G T A T G C T C G A C C T G C G A T Me Me T A T A T G C G A T G T A T G T T T G A T T T G C G A T Figure 2.2: BS-seq to detect DNA methylation. The bisulfite treatment and the following PCR converts all unmehtylated cytosines to thymines, and all methylated cytosines are retained. 2.3 Allele-specific DNA methylation In diploid organisms, somatic cells have two copies of the genome, with one copy inherited from each parent. Therefore, each autosomal gene is represented by two copies, or alleles. The two genomes are not equivalent and both are required for normal development (Barton et al., 1984; McGrath and Solter, 1984). Allele-specific DNA methylation (ASM) refers to different methylation states of the two alleles, and is usually related to the regulation of allele-specific gene expression (ASE). Figure 2.4 gives a depiction of how ASM might appear. The two parental alleles are distinguished by a T/A SNP site (though in general the alleles could be genetically identical), and in the region of ASM, most paternal alleles are heavily methylated while the majority of maternal alleles are unmethylated. 5 reference genome C G C G C G C G T G T G T G C G C G C G Example: methylation level = 6 6 + 3 =0.67 BS-seq reads Figure 2.3: Definition of methylation level in BS-seq. After mapping the reads to the reference genome, the CpG is covered by 9 reads. Among them, 6 are methylated and 3 are unmethylated, so the methylation level is 6=9 = 0:67. T T T T T A A A A A Paternal (SNP T) Maternal (SNP A) ASM Figure 2.4: Example of ASM detection. Each row of circles corresponds to each clone of bisulfite PCR products. Open and closed circles stand for unmethylated and methylated C residues, respectively. The SNP (T/A) site distinguishes the maternal and paternal alleles. The red rectangle indicates the ASM. Recent genome-wide surveys have identified the prevalence of ASM in both human and mouse, and also concluded that an appreciable amount of ASM co-occurs with heterozygous SNPs in CpG dinucleotides at one allele (Schalkwyk et al., 2010; Schilling et al., 2009; Shoemaker et al., 2010). Due to the high mutability of CpG dinucleotides, such CpG SNPs are widespread in individual genomes (Li et al., 2009). In turn, such cis-regulated ASM constitutes an important class of the individual variability of the epigenome. It has been suggested that this sequence polymorphism-dependent ASM plays a critical role in connecting genetic variation to individual phenotypic differences. 6 2.4 Genomic imprinting There is a spectrum of ASM in mammals regulating the parent-of-origin-dependent allele-specific expres- sion of imprinted genes (Chandler et al., 1987; Zhang and Tycko, 1992; Zhang et al., 1993). Imprinting is an inheritance process independent of the classical Mendelian inheritance. Imprinted alleles are silenced in the parental germ line such that the genes in the newly formed embryo (and placenta) are either expressed only from the allele inherited from the mother (e.g. H19) or from that from the father (e.g. IGF2). To date approximately 100 imprinted genes have been identified in mammals (Weaver et al., 2009). Human diseases related to imprinting. Although they represent a small subset of the mammalian genome ( 1%), imprinted genes are essential for normal development and have been well documented in the embryonic growth and neurodevelopment of mammals (Davies et al., 2001; Hata et al., 2002; Isles and Wilkinson, 2000). Knock-out studies have grouped the functions of imprinted genes in mouse into three categories: 50% act as embryonic or neonatal growth regulators, 20% functions in neurological processes, and the rest currently have no identified biological functions (Renfree et al., 2009). Abnormal imprinting frequently leads to pre- and post-natal growth defects (Reik et al., 2001a) and also relates to the pathogenesis of pediatric and adult cancers (Astuti et al., 2005; Feinberg et al., 2002; Weksberg et al., 2003). The first imprinting genetic disorders described in humans were the Prader-Willi and Angelman syndromes (Henry et al., 1991; Moutou et al., 1992). Both syndromes are associated with loss of chromosome 15q11-13, which contains the paternally expressed SNRPN and NDN genes, as well as the maternally expressed UBE3A gene. Prader-Willi syndrome (PWS) is associated with extreme eating problem as hyperphagia and developmental retardation. PWS is caused by the loss of the paternal inheritance of the region. The absence of products from paternally expressed imprinted genes in the region results in PWS. Angelman syndrome, a disorder of the nervous system with developmental disabilities including seizures, speech deficits and motor oddities, is caused by the loss of maternal copy of the region. The expression of UBE3E, which involves in the ubiquitin pathway, is turned off, leading to the phenotypic symptoms observed in Angelman syndrome. Evolution of genomic imprinting. Genomic imprinting is believed to have evolved along with the placenta, emerging during the divergence of the monotremes and the therian mammals (marsupials and eutherians) (Hore et al., 2007). Since they can only be expressed from one parental chromosome, imprinted genes do not have the advantage brought by the redundancy of two copies of autosomal genes in diploid organisms. 7 The consequence is a decreased robustness to mutation and increases the risk of genetic disease associated with these genes. Therefore, the potential benefits must outweigh the cost of being functionally haploid. The persistence of imprinting in mammalian evolution is also evident for evolutionary advantages. Comparative genomics has provided some clues to the origin and evolution of imprinting (Renfree et al., 2009). It has been proposed that imprinting originated from a host defense mechanism of cells to knock down parasitic transposable elements through DNA methylation (Barlow, 1993). The benefits brought by imprinted genes can be explained by the so called “parental conflict hypothesis”, which states that imprinting is a result of the differing interests of each parent in terms of the evolutionary fitness of their genes and serves as a mechanism for allocating the maternal resources to the developing fetus and the mother’s survival (Haig and Westoby, 1989; Moore and Haig, 1991). According to this hypothesis, paternally expressed imprinted genes tend to promote growth to gain greater fitness for the offspring, while the maternally expressed genes function to reserve resources for other offspring and the mother’s own survival by limiting growth. Chapter 6 contains an expanded discussion of this theory as we compare ASM between human and mouse. In addition to this widely accepted hypothesis, several other theories suggest either a coadaptive reason for the evolution of genomic imprinting or a machinery for homologous chromosome recognition during meiosis (Barlow, 1993; Pardo-Manuel de Villena et al., 2000; Varmuza and Mann, 1994). Coadaptation pro- poses that genomic imprinting is favored by selection because it adaptively regulates reproductive behavior and offspring development (Crews, 2008; Wolf and Hager, 2006). The coadaptation theory implies the coor- dinated expression of imprinted genes in the placenta and the maternal brain. The evolution of imprinting integrates the maternal and offspring genomes for higher offspring fitness. The “ovarian time bomb” theory proposes that imprinting can reduce the risk of ovarian trophoblastic diseases in female mammals (Varmuza and Mann, 1994; Weisstein et al., 2002). The ovarian trophoblastic disease is caused by the development of unfertilized eggs. Thus, mothers who downregulate the expression of those growth-enhancing genes in offsprings can prevent such harmful reproduction and are selectively favored. The chromosome pairing the- ory (Pardo-Manuel de Villena et al., 2000) proposes that different chromatin structures caused by genomic imprinting can facilitate pairing in meiosis and distinguish two homologous chromosomes for DNA repair and recombination. All of these hypotheses are not mutually exclusive but complementary to each other, suggesting the evolution of genomic imprinting is coordinated by different selective pressures. To summa- rize, the evolution of genomic imprinting in placental mammals may have been triggered by independent 8 Paternal Maternal Ins2 Igf2 H19 Imprinting Control Region (ICR) Expressed allele Silenced allele Figure 2.5: Example of an imprinted gene cluster on mouse chr7. The H19/Igf2 imprinted domain contains three imprinted genes: the paternally expressed insulin-like growth factor 2 (Igf2) and insulin 2 (Ins2) and the maternally expressed long ncRNA H19. Open and closed circles stand for unmethylated and methylated C residues, respectively. The paternally methylated ICR is located upstream of H19. efforts to silence transposable elements in male and female germ cells. The consequent parent-of-origin reg- ulation of certain genes has been advantageous therefore favored by natural selection, gradually increasing the number of imprinted genes in mammalian species. Imprinted genes are distributed throughout the human and mouse genomes, and most imprinted genes reside in clusters that span1Mb. Each cluster is coordinately regulated through a cis element known as an imprinting control region (ICR), which bears specific epigenetic marks on either the maternal or paternal allele, but not both. Deletion of ICRs results in loss of imprinting of adjacent genes (Fitzpatrick et al., 2002; Mancini-DiNardo et al., 2006; Thorvaldsen et al., 1998). The most common differential epigenetic mark of the ICRs is ASM, and although it is typically correlated with other epigenomic marks, DNA methylation appears to be the information-carrying mark. For example, the ICR of the imprinted cluster H19/IGF2 has distinctive DNA methylation states on the two alleles (Figure 2.5). Such ASM marks of ICRs (alternatively called gametic differentially methylated regions, gDMRs) in imprinted genes are set during the de novo methylation in germ cell reprogramming that follows genome-wide methylation erasure. All evidence indi- cates that these ASM, or at least it carries, is somehow maintained through the DNA methylation reprogram- ming that occurs in the earliest stages of embryonic development (Davis et al., 2000). Most known ICRs are maternally methylated; the known exceptions are three paternally methylated ICRs (H19/Igf2, Dlk1/Dio3 and Rasgrf1) (Lucifero et al., 2002; Reik et al., 2001b). The numerical imbalance between maternally and paternally methylated ICRs may be explained by the higher mutation rate in paternal germ line and func- tional dominance of maternally methylated ICRs in the fetal-maternal interface during the early embryonic 9 development (Bourchis and Bestor, 2006; Schulz et al., 2010). The remaining imprinting associated ASM (somatic DMRs) may be established after fertilization (El-Maarri et al., 2001), possibly under the control of nearby ICRs, and therefore are simply a consequence of nearby ASM set in the gametes of the parents. ICRs generally act over long distances to coordinately control multiple genes in imprinted clusters. There are two silencing mechanisms described for ICRs: the insulator model and the ncRNA silencing model (Pauler and Barlow, 2006). In the first model, the insulator controls the access to the common regu- latory elements of the imprinted clusters. The most intensively studied example for this model is H19/Igf2 (Figure 2.5). The ICR in the middle functions as an insulator. The unmethylated maternal allele recruits the insulator protein CTCF binding and then forms unique chromosome loops that block the interaction of Igf2 to the downstream enhancer and switch off its expression, so only H19 is expressed from the maternal allele. In contrast, on the paternal allele, the methylated ICR prevents CTCF binding, so Igf2 is expressed by the interaction with the enhancer and H19 is silenced (Bell and Felsenfeld, 2000; Bell et al., 1999; Hark et al., 2000). In the ncRNA silencing model, the transcription of a long ncRNA mediates silencing the multiple protein-coding genes bidirectionally in cis (Mancini-DiNardo et al., 2006; Sleutels et al., 2002). The mech- anism of silencing ncRNA is well defined in the imprinting Igf2r/Air cluster in mouse (Braidotti et al., 2004). The ICR lies in the second intron of Igf2r and also the promoter of a noncoding antisense RNA Air (Figure 2.6). The unmethylated ICR on the paternal allele allows active expression of the ncRNA Air, which silences all of the three surrounding protein coding genes Igf2r, Slc22a2 and Slc22a3 (Sleutels et al., 2002). In contrast, the methylated ICR on the maternal allele prevents the expression of Air and the three protein-coding genes are actively expressed. There is still a debate about how the transcription of ncRNA silences multiple genes in cis bidirectionally (Pauler et al., 2007). Both the ncRNA product and the tran- scription per se are possible for gene silencing. These two silencing mechanisms may operate together at the same cluster. For example, the Kcnq1 locus, which contains a ncRNA Kcnq1ot1 located in an intron of Kcnq1, is structurally similar to the Igf2r/Air locus and the paternally expression of ncRNA Kcnq1ot1 is demonstrated to be essential to silence the eight protein-coding genes around (Fitzpatrick et al., 2002). At the same time, the ICR at the promoter of the ncRNA kcnq1ot1 shows insulator activity and allele-specific binding of CTCF (Fitzpatrick et al., 2007; Kanduri et al., 2002), suggesting the coexistence of the insulator 10 Paternal Maternal Igf2r Imprinting Control Region (ICR) Expressed allele Silenced allele Airn Slc22a2 Slc22a3 Figure 2.6: Example of an imprinted gene cluster Igf2r/Air for ncRNA silencing model. The ICR contains the promoter of the long ncRNA Airn and is methylated on the maternal allele. Only the ncRNA Airn is expressed from the unmethylated paternal allele and all other protein coding genes are only expressed from the maternal allele. silencing mechanism. Although the two silencing models have been well established, the function of some ICRs are still unclear (e.g. ICR at the Dlk1/Gtl2 locus). In summary, ASM relates to precise gene regulation and plays an important role in many biological pro- cesses, especially in imprinted gene expression, which is regulated by ASM in ICRs in all known cases. The investigation of ASM is extremely helpful in defining and exploring imprinted domains and their regulatory mechanisms. 2.5 Approaches for identifying imprinted genes Genomic imprinting is essential for normal development. The identification of imprinted genes and a detailed understanding of their regulation has become increasingly important along with the realization that aberrant genomic imprinting contributes to several complex diseases, such as Prader-Willi, Angelman, Beckwith-Wiedemann, Retts, and Silver-Russell syndromes (Feinberg, 2007; Monk, 2010). 2.5.1 Screening for allele-specific gene expression Much effort has been directed toward locating imprinted genes using expression screen-based approaches (Barlow et al., 1991; Nikaido et al., 2003; Pollard et al., 2008). One limitation of such approaches is that many imprinted genes may only show allele-specific expressions in particular tissues at appropriate developmental stages (Deltour et al., 1995). Consequently, failure to confirm imprinting in a specific tissue at a specific stage of development does not eliminate the possibility that a different isoform 11 may be imprinted in some other tissue at some other stage of development. Although estimates of imprinted gene prevalence in the human genome vary, they hover around 1%. Consequently, in the absence of any method for prioritizing genes, an average of 100 genes must be examined (perhaps in a broad range of tis- sues and at many stages of development) before an imprinted gene would be identified. Indeed, experimental identification of human imprinted genes to date has been slow: H19 and Igf2r were the first genes shown to be imprinted (Barlow et al., 1991; Bartolomei et al., 1991; DeChiara et al., 1991), and since its discovery in 1991, only60 additional imprinted genes have been identified with varying degrees of evidence for imprinting (Morcos et al., 2011). So far, direct observation of mammalian imprinting in living cells and tissues has been carried out most thoroughly in the mouse genome (Gregg et al., 2010a; Wang et al., 2008). These studies employed the gold standard for recognizing imprinting in mice using the non-equivalence of monoallelic expression in reciprocal matings of inbred strains in mouse embryonic brain. A reciprocal cross is a breeding experiment designed to test the role of parent’s sex on a given inheritance pattern. Two crosses are performed: in one cross, a male strain S1 will be mated with a female strain S2; and in the other, a female S1 will be mated with a male S2 (Figure 2.7). Single nucleotide polymorphisms (SNPs) in S1 and S2 are used to distinguish the two parental alleles in the F1. RNA-seq is used to get the gene expression profiles in F1. Letp 1 denote the proportion of allele S1 in the F1 resulting from the first cross (S1 father S2 mother) and letp 2 be the proportion of allele S1 in the F1 resulting from the second cross (S2 father S1 mother). If a gene has equal expression from the two parental alleles, we expect bothp 1 andp 2 to equal 0:5 (barring experimental noise). A paternally expressed imprinted gene will have the pattern ofp 1 > 0:5 andp 2 < 0:5, whereas a maternally expressed imprinted gene will have the pattern ofp 1 < 0:5 andp 2 > 0:5. In the presence of noise, statistical tests can be applied to distinguish these hypotheses (Wang et al., 2008). Obviously breeding experiments are unavailable for studying human imprinting. Consequently, demon- stration of imprinting requires family-based tissue samples as well as accurate methods to observe differ- ential expression of parental alleles. Another limitation of human studies is access to some of the most informative tissue types: early developing somatic and germ cells (e.g. ESCs and PGCs) and female germ cells. Because of these limitations, our understanding of imprinting in humans is far behind that in mouse. 12 F1 p1: proportion of the allele from S1 S1 S2 F1 p2: proportion of the allele from S1 S1 S2 1) p1=p2=0.5: a gene has equal expression from the two parental alleles; 2) p1>0.5, p2<0.5: a paternally expressed imprinted gene; 3) p1<0.5, p2>0.5; a maternally expressed imprinted gene. Figure 2.7: The left part is the first cross with the strain S1 as the father and S2 as the mother, and the right part shows the second cross with S1 as the mother and S2 as the father. p 1 is the proportion of the S1 allele in F1 from the first cross, andp 2 is the proportion of the S1 allele in F1 from the second cross. For a rough estimate, the relative values ofp 1 andp 2 can be used to screen imprinted gene candidates. 2.5.2 Predicting imprinted genes with genome sequence features The enrichment of certain types of repeated elements (e.g. LINE-1) and existence of some sequence char- acteristics have been reported in imprinted genes (Allen et al., 2003; Ke et al., 2002), raising the possibility to distinguish between monoallelically and biallelically expressed genes from sequences. Based on the DNA sequence features around the imprinted genes, machine learning methods were developed to predict genes’ imprinted status (Luedi et al., 2005, 2007). The DNA sequence features used include repeats, CpG islands, recombination hotspots, nucleosome formation potential and trained consensus motifs. In human model (Luedi et al., 2007), with a small test sets (40), they achieved a sensitivity of 85% and a specificity of 79%. Among the predicted imprinted gene candidate set (156 in total), they successfully validated two new imprinted genes in Human. These methods reveal some specific sequence features for known imprinted genes, but are limited by the small training set and the fact that genomic imprinting varies in different tissues and different developmental stages. 13 2.5.3 Screening for allele-specific methylation ASM screen-based methods can overcome the effect of temporal and spatial expression patterns because the ICRs are expected to exist through developmental stages preceding the context in which they become active. Different methylation profiling methods have been applied to human uniparental samples and mouse uniparental (i.e. parthenogenetic or androgenetic) embryos to identify parent of origin-specific DNA methy- lation (Choufani et al., 2011; Hayashizaki et al., 1994; Hayward et al., 1998; Kamiya et al., 2000; Peters et al., 1999; Plass et al., 1996; Smith et al., 2003), and many novel imprinted genes and corresponding ICRs have been discovered. However, the use of tissues or embryos of uniparental origin suffers from the risk of disrupted methylation patterns in such aberrant genome configurations. Shoemaker et al. (2010) made use of correlations among CpGs on the same allele and adjacent heterozygous SNPs to locate regions of ASM, providing a method for identifying cis-regulated ASM in the genome. 2.5.4 Leveraging high-throughput sequencing technologies Advances in DNA sequencing technology have been leveraged for high-throughput identification of novel imprinted genes. Recently, based on transcriptome sequencing of mouse brain tissues in a reciprocal cross design, Gregg et al. (Gregg et al., 2010a,b) detected previously unobserved parent-of-origin allele-specific expression for hundreds of genes. Short-read sequencing has also been applied to profile DNA methylation in mammals at unprecedented resolution. The aforementioned “BS-seq” technology has enabled genome- wide profiling of DNA methylation in mammalian genomes at single-CpG resolution (Lister et al., 2009). Li et al. (2010) produced a methylome from peripheral blood of a single individual and recognized the potential of using such data to profile ASM. They employed a method based on associating heterozygous SNPs with differential methylation, and identified hundreds of ASM regions. Methods such as this, however, must be applied to data from a single individual and for which matching genotypic data is available. There are two shortcomings of approaches that depend on genotype. First, they can be confounded by ASM that is associated with genotype, but which may not have any regulatory effect. The amount of ASM typically associated with genotype is not well understood, but recent reports suggest it is significant (Kerkel et al., 2008). More importantly, because imprinted methylation is not necessarily associated with genotypic variation, these methods will be inherently blind to some portion of ASM. 14 2.6 Summary Genomic imprinting likely evolved in mammals as a result of parental battle between sexes to control the maternal resources to the offspring. Imprinted genes are only expressed from one allele, but the identity of that allele is determined by parent-of-origin rather than by genetics. Paternally expressed imprinted genes tend to promote growth while maternally expressed imprinted genes suppress growth. Traditionally, allele- specific expression is used as a sign for locating imprinted genes. Due to the tissue- and stage-specific expression patterns of imprinted genes, these expression screen-based methods cannot finish the complete survey for imprinted genes. Almost without exception, ASM coordinates the expression of imprinted genes by differentially marking specific regulatory regions on maternal and paternal alleles. ASM is less restricted by the spatial and temporal variation compared to expression based screen method, and thus is a more stable signal to detect for genomic imprinting. Heterozygous SNPs are usually used to separate two alleles to compare their methylation levels. The dependence on heterozygous SNPs raises demand for higher coverage of the data but provides limited output. Therefore, to accompany the advances in methylation profiling technique, a more comprehensive method to identify ASM is a necessity. We have developed a novel probabilistic model to describe ASM based on data from high-throughput BS-seq experiments (Fang et al., 2012). Our model is independent of genotype, and therefore has broad applicability to identify ASM in the context of imprinting. 15 Chapter 3 Modeling allele-specific DNA methylation This chapter describes a novel computational method to identify ASM based on correlations between adja- cent CpGs from BS-seq data. In allelically methylated regions (AMRs), reads covering more than one CpG can be clustered into two distinct classes with different allele origins based on different methylation states. We first use a Fisher’s exact test on two consecutive CpGs to visualize the probability of ASM at each CpG site. Then to make use of correlations from more CpGs, we design two statistical models to describe a genomic interval by assuming it has ASM or not respectively. The genomic interval is predicted as an AMR if the model for ASM better fits the reads observed. We apply the method to the whole genome using sliding windows, and for each candidate AMR identified with sliding windows, the boundaries can be pinpointed by a dynamic programming method optimizing the likelihood. In this chapter, we assume any read has been sequenced after bisulfite treatment and mapped uniquely to the reference genome. Because we are interested in mammalian methylation, we restrict our attention to CpG sites both in the genome and in the reads. Reads not mapping over a CpG are ignored. Our goal is to identify intervals of the genome where it appears that the two alleles have different methylation patterns – typically in such a case one allele will be highly methylated, and the other not. Such allelically methylated regions are called “AMRs.” There are two kinds of important information our model must capture: (1) the set of reads mapping into the interval should appear to represent two distinct methylation patterns, and (2) the subsets of reads corresponding to those two patterns should be in roughly equal proportions, since the alleles themselves are present in equal proportions. One can consider a methylation pattern as analogous to a haplotype, but with a strong stochastic component. Therefore, reads that contain only a single CpG will provide us with relatively little information, and we would like reads to cover as many CpGs as possible. We can then ask whether neighboring CpG sites on the same read tend to share methylation states, and whether other reads cover the same CpG sites but with the alternative shared methylation state. Our approach is to 16 apply a single-allele model to the data, then apply an allele-specific model, and to compare the fit for these models to determine if the data support ASM. 3.1 A simple method for scoring and visualizing ASM Before going into the full probabilistic model that is the central technical contribution of this thesis, we first present a simpler method to capture information about correlated methylation states in reads. This simpler method should make the information underlying our more sophisticated method more intuitive. At the same time, we have found this simple method to be helpful in visualizing ASM plotted along a chromosome in a genome browser. We first examine the ASM status at each CpG based on methylation correlations between two consec- utive CpGs (in a fixed direction, i.e. 5’ or 3’). Using the observed reads covering both CpGs, We design a contingency table, in which the two rows are numbers of methylated and unmethylated reads covering one CpG, and the two columns are numbers of methylated and unmethylated reads covering its adjacent CpG (e.g. 3’). The underlying null hypothesis is that at the designate CpG sites, methylated reads and unmethy- lated reads are randomly generated from two alleles, so there is no ASM there. We use Fisher’s exact test to calculate anp-value from the contingency table, showing the significance the observed reads under the null hypothesis of no ASM (Figure 3.1). Then we define an allelic score as 1p for each CpG; the higher is the score, the more likely is the pair of CpGs to have ASM. Figure 3.2 shows the allelic score calculated in mouse hematopoietic stem cells (HSC) around the imprinted GNAS loci, and also the AMRs identified. The peaks of the allelic score correspond well to the AMRs. Therefore, the allelic score provides a complementary visualization of our ASM prediction. 3.2 A probabilistic model for allele-specific DNA methylation The allelic methylation score in the previous section may miss the correlation information provided by the reads covering more than two CpGs. To fully make use of the correlation information between CpGs in a genomic interval, we design two statistical models based on ASM and no ASM separately. If the ASM model fits the observed reads better, the genomic interval is labeled as an AMR. 17 CpG2 p fisher =0.04 CpG1 m u m 5 1 u 1 5 methylated unmethylated CpG1 CpG2 Figure 3.1: An example of contingency table construction based on adjacent CpG site. There are 12 reads covering the two CpGs together. The four cells in the contingency table record the corresponding four combinations of methylation states at the two CpGs. Then one-sidep-value is calculated based on Fisher’s exact test. 50 kb Mouse chr2:174,085,000-174,185,000 (mm9) Nespas Gnas Gnas Gnas HSC AMRs CGI allelic_score Figure 3.2: An example of allelic score calculated around the imprinted GNAS loci in mouse HSC cells. The orange bars are the allelic scores for all CpGs in the region, and the red blocks are the AMRs called. 3.2.1 Modeling DNA methylation on a single allele In regions without ASM (non-AMRs), we associate with each CpG a single probability indicating the prob- ability that the CpG is methylated in the cells of interest ( 1 , 2 and 5 in Figure 3.3). We assumed that each individual CpG site was independent. For a genomic interval containingn CpGs, the single-allele model is 18 allele1 allele2 Region with ASM CpG CpG θ 1 θ 2 θ 31 θ 41 θ 32 θ 42 θ 5 CpG CpG CpG CpG CpG CpG CpG CpG Figure 3.3: Schematic of modeling site-specific DNA methylation in regions without ASM or with ASM. The two horizontal lines indicate the two parental alleles. The brown circles indicate CpGs. The pink rectangle is the region of ASM. All’s are methylation probabilities of corresponding CpGs. = ( 1 ;:::; n ) with i representing the methylation probability at thei th CpG inside the interval. Given a set of readsR, the likelihood in the single-allele model within the interval is L 1 (jR) = Pr(Rj)/ n Y i=1 m(R;i) i (1 i ) u(R;i) ; (3.1) wherem(R;i) andu(R;i) give the numbers of methylated and unmethylated observations from reads map- ping over thei th CpG. Given that we have the readsR, fitting the component parameters of the model is trivial since each is a binomial. 3.2.2 Modeling DNA methylation on two alleles Within regions of ASM (AMRs), we use an allele-specific model that associates two distinct methylation probabilities with each CpG ( 31 , 32 , 41 and 42 in Figure 3.3). Assuming there aren CpGs in the genomic interval, the two-allele model has the structure = ( 11 ; 12 ;:::; n1 ; n2 ), with i1 and i2 representing the probability that the i th CpG is methylated on allele 1 and allele 2, respectively. Under this model, reads mapping over the same genomic CpG may have different probabilities of methylation for their CpGs depending on the allele from which they originate. The allele of origin for any read is missing data, and for a given set R of reads we express these missing data as the partition , (R) =f 1 ; 2 g defined by R = 1 [ 2 , wherejRj =j 1 j +j 2 j. For anyr2 R, ifr2 j we say thatr originates from allelej. 19 Because we are modeling alleles in the context of data from a diploid cell population, the probability that any read originates from a given allele is 0:5. Thus the likelihood is L 2 (jR; ) = Pr(Rj ; ) Pr( ); (3.2) since the partition is independent of . The probability Pr( ) is effectively a prior on the size of the read partition, which we define relative to the assumption that j 1 j Binomial(j 1 j +j 2 j; 0:5); since each allele is present with equal frequency. Therefore, L 2 (jR; ) = jRj j 1 j 0:5 jRj n Y i=1 2 Y j=1 m( j ;i) ij (1 ij ) u( j ;i) ; (3.3) where them andu are as defined for Equation 3.1. The allele of origin for each read is missing data and needs to be estimated. To partition the reads into two classes, multiple clustering methods are available. We chose to fit the allele-specific model using expectation maximization (EM; (Dempster et al., 1977)), obtaining expectations on membership in 1 and 2 . Such a model-based method provides robust consensus for the partition of reads. Expectation maximization for the allele-specific model. When computing the likelihood according to the allele-specific model we require a partition R = 1 [ 2 of the reads assigning each read to one of two alleles. This is missing information, and we infer the expected partition by assigning indicator vari- ables for the events that individual reads have membership in 2 . Assuming two alleles, and therefore two methylation probabilities for each CpG, we let i1 and i2 be the methylation probabilities at CpG i for allele 1 and 2, respectively. The read setR is partitioned into two subsets 1 and 2 according to the allele of origin for each read. When calculating the likelihood, the methylation probabilities are the parameters =f( 11 ; 12 );:::; ( n1 ; n2 )g. Let i ;i2f1; 2g denote the probability that a read comes from allelei, 20 so 1 = 2 = 0:5. We use the indicator functionsI 1 (r i ) andI 2 (r i ) = 1I 1 (r i ) for events thatr i originated from allele 1 and allele 2, respectively. The complete data likelihood is: L(jR; ) = m Y i=1 2 Y j=1 j n Y k=1 m(r i ;k) kj (1 kj ) u(r i ;k) I j (r i ) (3.4) wherem(r i ;k) andu(r i ;k) are indicators for the methylation state of the readr i at thek th CpG, and we let m(r i ;k) =u(r i ;k) = 0 when thek th CpG is not covered byr i . The expectation (E) step updates the missing data with the observed dataR and parameters . We definep ji as the probability that a readr i comes from allelej. Thesep ji are essentially the expected values of membership in the subsets 1 and 2 of the partition. Therefore,p ji can be calculated as the ratio of the probability that the readr i comes from the allelej and the sum of probabilities that the read comes from either allele. At then-th iteration, p (n) ji = Pr(I j (r i ) = 1jR; ) (3.5) = j Q n k=1 m(r i ;k) kj (1 kj ) u(r i ;k) P 2 j=1 j Q n k=1 m(r i ;k) kj (1 kj ) u(r i ;k) = Q n k=1 m(r i ;k) kj (1 kj ) u(r i ;k) P 2 j=1 Q n k=1 m(r i ;k) kj (1 kj ) u(r i ;k) ; where the parameters on the right-hand side are as estimated in iterationn 1. The maximization (M) step updates the parameters to maximize the likelihood: (n+1) k1 = P m i=1 p 1i m(r i ;k) P m i=1 p 1i ; (3.6) (n+1) k2 = P m i=1 p 2i m(r i ;k) P m i=1 p 2i : (3.7) With these EM steps, we can estimate values for all parameters =f( 11 ; 12 );:::; ( n1 ; n2 )g and the probabilities for each read originating from either allele. 21 3.2.3 Identifying allele-specific DNA methylation by model comparison For any fixed interval, we apply the single-allele model (Equation 3.1) and allele-specific model (Equa- tion 3.2) separately and then determine which one better describes the observed reads in the fixed interval. For model selection, we use Bayesian Information Criterion (BIC) (Schwarz, 1978) and likelihood ratio test separately. When using BIC as a model selection criterion, the single-allele model has one parameter for each of then CpGs, and the number of observations is equal tojRj: BIC(single) =n lnjRj 2 lnL 1 (jR): (3.8) For the allele-specific model, there are two parameters for each CpG: BIC(pair) = 2n lnjRj 2 lnL 2 (jR; ): (3.9) An interval is identified as having allele-specific methylation if and only if BIC(pair)<BIC(single). When using likelihood ratio test to select model, the test statistic is calculated as: D =2 ln L 1 (jR) L 2 (jR; ) (3.10) whereL 1 is the likelihood value of the single-allele model andL 2 is the likelihood value of the allele-specific model. The distribution of the test statisticD is approximately a chi-squared distribution with degrees of freedom equal to the difference of the number of free parameters of the two models (df 2 df 1 ). Since the single-allele model has one parameter for each of then CpGs while the allele-specific model has two, D 2 n (3.11) Then, for each fixed interval, anp-value showing the significance ofD can be calculated according to the chi-squared distribution. When thep-value is smaller than a preset cutoff, the fixed interval is determined as having allele-specific methylation. 22 Table 3.1: Comparison between two model selection criteria. LRT represents for likelihood ratio test. The percentage refers to the portion of AMRs identified by LRT that overlaps with those identified by BIC. Human Mouse Criterion #AMRs Overlapped with BIC #AMRs Overlapped with BIC BIC 2812 1693 LRT (cutoff=0.01) 1200 1193 (99.4%) 999 996 (99.7%) LRT (cutoff=0.05) 1720 1695 (98.6%) 1215 1208 (99.4%) LRT (cutoff=0.1) 1986 1931 (97.2%) 1335 1318 (98.7%) LRT (cutoff=0.2) 2352 2194 (93.3%) 1465 1417 (96.7%) We compare the performance of these two model selection criteria in both human and mouse data. The two datasets both have 101bp reads and 11 coverage. Table 3.1 shows that likelihood ratio test provides more restricted selection criterion than BIC in general. Most of the AMRs found using likelihood ratio test (> 93%) are overlapped with those found by BIC. Although the cutoff selection in likelihood ratio test provides more flexibility, it also brings some uncertainty in the results. Computing degrees of freedom for our models. The formula for the BIC is BIC(k;N) =k lnN 2 lnL; wherek is the number of free parameters to be estimated,N is the number of data points observed andL is the maximum likelihood. We assume the methylation probabilities at each CpG are independent, sok is the number of CpGs (or double the number of CpGs in allele-specific model) in a genomic interval. However, the methylation probabilities are estimated depending on adjacent CpGs during the EM process for the allele-specific model. Furthermore, the short read from sequencing usually only covers several CpGs of the genomic interval, not all of them. Thus, the usage of number of reads forN does not meet the definition for the BIC. The similar issue of confounding methylation probabilities exists when we estimate the degree of freedom in likelihood ratio test. Based on the assumption of independent methylation probabilities among CpGs, for a genomic interval, the degree of freedom of the single-allele model (df 1 ) is the number of CpGs and the degree of freedom of the alternative allele-specific model (df 2 ) is double the number of CpGs. However, the estimation of methylation probabilities in allele-specific model involves dependency between overlapping reads, and those CpGs covered by the same read are not independent of each other. Therefore, 23 the number of degrees of freedom is not fully determined by the number of CpGs sites. In practice, the best way to estimate the significance of the statisticD (2ln( L 1 L 2 )) in the likelihood ratio test is probably through Monte Carlo simulation. Assuming the null hypothesis (single-allele model) is true, semi-simulated data can be generated for any genomic interval repeatedly (see Chapter 4). For each simulated data set, D is calculated by maximizing the likelihood under the null (single-allele model) and alternative (allele-specific model) hypotheses. The significance of an observed value ofD is calculated as the proportion of the times thatD exceeds the values observed in the simulations. Although such a simulation at any genomic interval can solve the degree of freedom problem, this will require a huge amount of time in a genome-wide scan and optimizing this process is a topic for future research. 3.3 Practical issues in genome-scale applications We identify AMRs genome-wide by using a fixed-width sliding window (i.e. fixed number of CpG sites) and determining for each whether the single-allele or allele-specific model better describes the data. Intervals in close proximity are merged, and we also exclude intervals overlapping LSU rRNA genes from our final analyses as we suspected problems with their assembly in the reference genome. Preprocessing BS-seq data For accurate and efficient application of our model, we take the following preprocessing steps before ASM identification. Bisulfite sequencing data is mapped with RMAPBS (Smith et al., 2009) after removing adaptor sequences. Only one read per mapping location is retained to eliminate bias from PCR duplicates. All paired-end reads having both ends map within 1000bp are merged as a single read, possibly including a spacer consisting of N characters. Because we are only interested in DNA methylation states, we restrict our attention to CpG sites both in the genome and in the reads, and the sequence alphabet is treated as binary (methylated and unmethylated). Therefore, all reads are converted from genomic coordinates to CpG coordinates, and all non-CpG positions are removed form each read. The characters in the converted reads are C, T 24 and N, to indicate methylated, unmethylated, and unknown. Because CpG density varies, in general our reads have variable widths. Only reads with at least one non-N character are retained after this conversion to methylation state alphabet and CpG coordinates. When processing reads, positions with N are ignored completely. Issues related to selecting a sliding window size. In selecting a window size, the two main considerations (other than computational speed) are (1) the window size must be small enough so that AMR boundaries are accurately identified with the desired resolution; (2) to be large enough that we can leverage as much information as possible from the overlapping reads. In general, there is no single window size that will optimally identify AMRs through the entire genome, and different data sets likely will benefit most from using different window sizes (e.g. based on average CpGs per read, and total amount of data). To select the window size of 10 CpGs, we tested windows of size 5, 10, 15 and 20 using blood cell methylomes (described in detail in Chapter 5). We examined how these window sizes identify known AMRs in the H19, GNAS, SGCE, SNRPN, KCNQ1, ZIM2 and MEG3 loci. When experimental technology produce longer reads, it is likely that a larger window size will capture a much greater amount of information about how the reads corresponding to the same allele overlap. However, using a larger window size will still blur boundaries of AMRs, and potentially will cause smaller AMRs to be missed. When a better gold standard training set exists, we will be in a better position to optimize parameters such as the window size. Rationale for merging nearby AMR fragments. As described, we identified AMRs by applying our model in a sliding window along the chromosomes and any identified AMR “fragments” that were adjacent were merged if they were within 1 kbp of each other. We applied the method on 5 human blood methylomes: hematopoietic stem and progenitor cells (HSPC), B-cells (BCell), neutrophils (Neut), and CD133+ cord blood cells (CD133) produced by Hodges et al. (2011), and peripheral blood mononuclear cells (PBMC) produced by Li et al. (2010). More details can be found in Section 5.2 of Chapter 5. 25 10 kb HSPC Neutrophil BCell CD133 PBMC Human chr11:1,968,000-1,995,000 (hg18) H19 AMRs CGI PAMRs Figure 3.4: AMRs for blood methylomes at the H19 locus before and after merging. The black blocks are AMR “fragments” identified initially, and the red blocks are AMRs merged. Some motivation for merging nearby AMR fragments can be found in Figure 3.4, which shows the difference between the AMRs before and after merging for the blood methylomes at the ICR of the imprinted H19 locus. In this case, due to fluctuations in coverage through the 5 kbp H19 ICR (which has been experimentally determined), several fragments of AMRs were identified initially, and after merging the intervals covered by the HSPC, BCell, neutriphils and CD133+ cord blood cells were very similar and correspond to the known interval of ASM. Using such a method will always fail to join nearby fragments if they are more distant than the cutoff, as illustrated for the PBMC methylome. Removal of LSU-rRNA genes from predictions We observed that several of our top identified AMRs (i.e. those most consistent across methylomes) overlapped LSU-rRNA genes. Such a finding would be consistent with reports of dosage compensation, analogous to X chromosome inactivation, for rRNA genes (Schlesinger et al., 2009). However, we also noticed that the number of reads mapping over these regions was generally much more than in other top identified AMRs. BLASTing several of these in the NCBI nr database revealed that most of them matched only one location in hg18, but matched additional locations in newer assemblies of chromosomes, frequently even newer than are included in hg19. We there- fore decided to remove these from our predictions, as we believe they are likely artifacts representing methy- lation states from multiple genomic intervals superimposed on a single interval. 26 3.4 An algorithmic approach to optimize ASM boundaries Our genome-wide AMR identification is based on testing for allele-specific methylation in sliding windows along each chromosome. We also design an algorithm that does not require a sliding window, allowing us to optimize the boundaries of the identified AMRs so that we might more precisely locate these bound- aries. This algorithm is much more computationally expensive, and so is not appropriate for genome-wide application. This method uses scores that are based on the likelihoods for either one or two alleles, and is equivalent to testing all ways to partition of a genomic interval into alternating sub-intervals of allele-specific and single-allele methylation. We do not use BIC or likelihood ratio test in this method, but instead use a heuristic penalty term equal to a linear function of the number of reads inside the AMR to offset the dif- ference in model complexity between the allele-specific and single-allele models. This is similar to Akaike information criterion (AIC) (Akaike, 1974). Because of the logarithmic function in the BIC, it could not be computed incrementally in the dynamic programming recurrence presented below. The effect of this different penalty term is increased sensitivity, but also decreased specificity. This method is only suitable for applying in regions where we have prior information telling us we should find an AMR, and our goal is to locate the boundaries of that AMR. The importance of this task is evident from examples, such as PEG10 (Ono et al., 2001, 2005) and GNAS (Fr¨ ohlich et al., 2010; Williamson et al., 2006) promoters, where precise boundaries seem to distinguish allelic states of nearby promoters. Dynamic programming algorithm for refining AMRs. LetL 2 (i;j) denote the maximum likelihood of the two-allele model using only CpGsi throughj as estimated by EM and letL 1 (i) denote the likelihood of the single-allele model computed only for the i th CpG. For CpG i, we use score 1 (i) to indicate the maximum likelihood of the interval [1;i] assuming thei th CpG has single-allele methylation, and score 2 (i) is the maximum likelihood of the interval [1;i] with thei th CpG as the end of an AMR. Assume the size distribution of non-AMR is a geometric distribution with parameter, and the size distribution of AMR (f 2 ) is arbitrary. Then we use the recurrences score 2 (i) = max 1i 0 <i n logL 2 (i 0 ;i) + logf 2 (ii 0 ) + score 1 (i 0 ) o ; (3.12) 27 Start to identify AMRs in a DNA sequence ? AMR Determine ASM based on BIC or LRT using sliding window ASM detection Merge nearby windows with ASM and output the final set of AMRs Final AMRs sliding window Figure 3.5: Flow chart of the AMR identification. Starting from bisulfite sequencing reads, sliding windows are used to scan the DNA sequence. Fixed intervals with ASM are first identified based on BIC or likelihood ratio test (LRT). Nearby intervals with ASM are then merged and the final set of AMRs are output. and score 1 (i) = logL 1 (i) + max 8 < : score 2 (i 1) + log; score 1 (i 1) + log(1): (3.13) to compute the maximal values of likelihoods for partial segmentations of the data up to eachi. We record suchscore 1 andscore 2 for each CpG. The estimated optimal value is found at then th CpG and a traceback provides the precise locations of AMRs. In practice we impose a minimum size (10 CpGs) on the AMRs and spaces between AMRs. The reason why the function f 2 is described as “arbitrary” above is because the value of score 2 cannot be built up incrementally, and each individual value of score 2 must be computed using EM. In this context no one duration distribution will lead to faster computation. Because of this there is no speed benefit to using a geometric distribution for the sizes of AMRs in our scoring function. However, we did not evaluate other distributions and simply used a geometric distribution forf 2 . The value of and the corresponding parameter forf 2 were set by assuming that the mean AMR size was 100 CpGs, and that the mean inter-AMR distance was 10,000 CpGs. 3.5 Summary We have provided a solution for ASM identification solely based on BS-seq data. Two statistical models for AMRs (Equation 3.2) and non-AMRs (Equation 3.1) are designed to fit the reads observed respectively, and intervals better described by the allele-specific model are defined as AMRs. To implement a genome-wide scan for AMRs, we use sliding windows and each window is an interval for test. Then nearby intervals 28 with ASM are merged and output. Figure 3.5 is the overall flow chart of our AMR identification method. For further accuracy, dynamic programming is used to determine the accurate AMR boundaries around the identified AMRs. 29 Chapter 4 Simulation We conducted simulations to evaluate how the performance of our model relates to several critical param- eters of the underlying data set. To reflect the performance characteristics for real data sets, we used a strategy we call semi-simulated data in which methylation states are simulated within actual reads from BS-seq experiments. We tested our method for fixed genomic intervals from different regions with varied CpG densities. Although the sensitivity was lower for CpG-poor regions, our results indicate that technical characteristics of existing public methylomes (i.e. read length and coverage) are sufficient to accurately identify AMRs and the false-discovery rate is estimated as less than 0.01. 4.1 The concept of semi-simulated BS-seq data In semi-simulated data, the locations of mapped reads are taken from real data, as are the locations of CpGs within reads and the underlying reference genome. The methylation states inside those reads are determined according to randomly generated allele-specific or single-allele methylation profiles. For each CpG, we randomly generate two methylation profiles by sampling individual CpG methylation levels as Beta variants skewed towards 0 or 1 (e.g. Beta distribution with mean 0:75 for one allele and 0:25 for the other with variance also controlled). For CpGs designated within non-AMR, both alleles’ methylation probability is set as one of the two profiles randomly. In this way the average methylation level through a region is always roughly 0:5, even for single-allele simulations. Then we assign each read with equal probability to one of the two alleles and the methylation states of the CpGs within the read are sampled according to probabilities given by the methylation profile corresponding to that allele. Mimicking the bisulfite conversion, all unmethylated read cytosines are converted to thymines. 30 4.2 Model validation for fixed intervals of ASM With current methylomes from BS-seq, we expected the variation in coverage along chromosomes to be a critical factor for the performance of our model. In addition, the variation in inter-CpG distance may prevent our method from capturing ASM in regions of low CpG density for a fixed read length. We examined how well our method could identify ASM in a given genomic interval by manipulating three independent variables: Mean coverages weref5; 10; 15g, corresponding to current methylomes from BS-seq. Read lengths weref50; 100; 150g bases corresponding roughly with current short read sequencing technologies. CpG density distributions took 3 different settings: CGIs defined as in (Gardiner-Garden and From- mer, 1987), non-CGI promoters defined as 1 kb upstream of refSeq TSS but not CGIs, and randomly sampled genomic background with CpG density (O/E) between 0:2 and 0:4. For each combination of variables, 100 regions were randomly selected from one of the three sets. Then each region was simulated as AMR and non-AMR 10 times respectively. In total, there were 2000 = 210100 data points in one simulation. To calculate the variances of specificity and sensitivity, we repeated the simulations 100 times for each variable combination. Specificity was generally very high (99%) for all simulation parameter combinations, reflecting our conservative model selection criterion (Equations 3.8 and 3.9). In contrast, sensitivity showed greater depen- dence on properties of the data sets. Sensitivity was higher for regions of higher CpG density, as expected since our model depends on the relationships between CpG states inside a read. As shown in Figure 4.1, inside CGIs sensitivity reached above 95% for all read lengths when the mean coverage was above 10. Sensitivity reached70% for intergenic regions but required both 10 coverage and read length 100 to com- pensate for the decrease in CpG density. As expected, greater coverage and read length improved accuracy, and the effect of read length is equivalent to that of CpG density. These results indicate that methylomes with read lengths around 100bp and mean coverage above 10 appear sufficient for our model to accurately identify ASM. These criteria are met by most existing methylomes from BS-seq experiments. 31 15X 10X 5X (B) non-CGI promoter 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 15X 10X 5X (A) CGIs sensitivity Coverage: 50bp 100bp 150bp 15X 10X 5X (C) non-CGI intergenic 0.2 0.4 0.6 0.8 1.0 Figure 4.1: Sensitivity of AMR identification based on semi-simulated data. Coverages of 5, 10 and 15, and read lengths of 50, 100 and 150 bp were used. CpG densities were controlled by simulating within (A) CGIs, (B) non-CGI promoter regions, and (C) non-CGI intergenic regions. 4.3 Estimating false-discovery rate using semi-simulated data We used the idea of semi-simulated data to obtain bounds on false-discovery rate (FDR) for the 5 blood methylomes: Hematopoietic stem/progenitor (HSPC), B-cell, Neutrophil, CD133+ cord blood cell and Peripheral blood mononuclear cell (PBMC). FDR is defined as the expectation for the ratio of false pos- itives over the sum of true positives and false positives. Our procedure was as follows. Using the real data from reads, we randomly shuffled methylation states corresponding to each CpG site. In other words, the methylation states were collected from all reads mapping over a specific CpG site, and then randomly per- muted before being assigned back to those reads. This preserves exactly the likelihood for any interval under our single-allele model. We used chr10, and we did 1000 such random experiments for each of the 5 blood methylomes. This provided a false-positive rate (Type I error rate) that can be used to bound the FDR (4.1). Since in each case above the number of AMRs identified under our null hypothesis is less than 0.1, we may estimate an upper bound on the FDR as 0:1=x, wherex is the number of AMRs identified. In all cases this would result in an FDR of less than 0:01. Caveat: The major caveat associated with estimating an FDR in the way we have above has to do with the underlying biology. Cell populations grow as mixtures of clones. DNA methylation has a stochastic component that remains poorly understood. At the same time, any stochastic changes in methylation will be preserved due to the mitotic inheritance of the methylation. Therefore, any real methylome will likely by chance contain intervals that truly represent a mixture of two different methylation profiles, yet these may be associated with absolutely no biological function (according to our current understanding). 32 Table 4.1: False-positive rate estimated on semi-simulated data generated from human chr10 using five blood methylomes from uncultured cell samples. The number of identified AMRs in the actual data and randomized data are presented, along with the corresponding estimate of Type I error. Cell type Actual data Randomized data Type I error HSPC 132 0.038 0.00029 Neutrophil 133 0.009 6.8e-05 B-cell 160 0.008 5.0e-05 CD133+ cord blood 138 0.008 5.8e-05 PBMC 58 0.035 0.0006 The best way to ensure that identified AMRs are not spurious, therefore, is to analyze replicate exper- iments where the cells are grown or purified separately. In the case of the methylomes we have analyzed, each comes from a very different population of cells, and therefore AMRs that overlap between cell types should be absent from the intersection of the AMR sets. 33 Chapter 5 The genomic landscape of human allele-specific methylation 5.1 Motivation for genome-wide computational analyses As explained in Section 2.4 a disruption of normal imprinting can cause delayed development, mental retar- dation and other medical problems. In general, imprinted genes are more vulnerable to the negative effects of deleterious mutations or other loss-of-function events because they are functionally haploid. Despite the importance of imprinted genes in several diseases, their identification, especially in human, has been slow. Methods like the expression- or methylation-based screens described in Section 2.5 are even more difficult for human genes since the relevant cell types (often early development or germ cells) for fully understand- ing imprinting are difficult or impossible to use in experiments. It has also been known for some time that culture conditions lead to epigenomic abnormalities (Allegrucci et al., 2007), and as we will demonstrate ASM appears to be a very fragile epigenomic phenomenon. Based on BS-seq data, the method presented in Chapter 3 provides the analytic technology for genome- wide ASM scans without relying on genotypic variation and is therefore able to detect ASM associated solely with parent-of-origin. However, the majority of ASM identified from one single methylome consists of cis-regulated and individual specific ASM (Section 2.3). Since imprinting-associated ASM, especially germline AMRs, is maintained throughout development and should be conserved across cells and between individuals, pooling data from a diversity of cell types and individuals can eliminate cis-regulated ASM and identify conserved imprinting associated ASM. The simulation results presented in Chapter 4 demonstrate that our method accommodates the read length and coverage of most currently available data sets. Therefore, we collected 22 human BS-seq datasets from different cell types and analyzed their ASM separately. The intersection of the identify AMR sets 34 provides a likely candidate set for imprinting associated ASM. Furthermore, these methylomes represent a range of different cells: uncultured cells, embryonic stem cells (ESCs), induced pluripotent stem cells (iPSCs) and cultured differentiated cells. Comparing differences of ASM among these cells sheds light on the change of imprinting under different conditions. For example, iPSCs are derived from somatic cells and supposed to be similar to natural pluripotent stem cells like ESCs. iPSCs can also be differentiated in culture into cells of all three germ layers. We have included the iPSCs, the somatic cells from which they are derived, ESCs and the differentiated cells from iPSCs in our data collection, providing a complete system to assess the change of ASM and genomic imprinting during the reprogramming process of iPSCs. 5.2 Technical and biological characteristics of methylomes analyzed We analyzed 22 publicly available methylomes, including 5 uncultured primary cell types, 8 cultured dif- ferentiated cell lines, 4 ESCs and 5 iPSCs from following studies. Additional details about each of the methylomes can be found in Table 5.1. Hodges et al. (2011) produced 4 uncultured methylomes from blood cells: hematopoietic stem and progenitor cells (HSPC), B-cells (BCell), neutrophils (Neut), and CD133+ cord blood cells (CD133). Since the first three samples were pooled from 6 unrelated individuals, ASM caused by genetic variants should not be apparent due to the effect of pooling. The last sample was generated from one individual. Li et al. (2010) produced the other uncultured methylome from peripheral blood mononuclear cells (PBMC) of one individual. These 5 uncultured primary blood cells constitute a standard set for profiling DNA methylation in somatic tissues. The study of Laurent et al. (2010) produced 3 methylomes: newborn human foreskin fibroblasts (NHFF), H9 ESCs (H9ESCLa) and fibroblasts derived from H9 ESCs (FES). Lister et al. (2009) produced methy- lomes for IMR90 cells (IMR90r1 and IMR90r2) and H1 ESCs (H1ESCr1 and H1ESCr2), 2 replicates each which we treat as distinct methylomes. In a separate study, Lister et al. (2011) produced methylomes for 10 cell types. Included among these were H9 ESCs (H9ESCLi), adipose-derived stem cells (ADS), adipocytes differentiated from ADS cells (ADSAdipose) and foreskin fibroblasts (FF). iPSCs derived from ADS (ADSiPSC), IMR90 (IMR90iPSC) and FF (FFiPSC69, FFiPSC197 and FFiPSC1911) cells were also 35 Table 5.1: Characteristics of human BS-seq data sets. Methylome Reference uncultured? Sex Meth. % Coverage CpGs/read #AMR Mean size CpGs/AMR B cell Hodges et al. (2011) ~ 73.6 10 1.9 2482 580bp 28.2 Neutrophil Hodges et al. (2011) ~ 75.7 11 2.1 2114 653bp 38.5 HSPC Hodges et al. (2011) ~ 74.7 11 1.8 1999 530bp 27.0 CD133+ Cord blood (CD133) Hodges et al. (2011) | 79.0 7 2 1595 559bp 27.5 PBMC Li et al. (2010) | 64.3 9 1.6 1026 513 19.6 IMR90 (r1) Lister et al. (2009) ~ 64.2 19 1.5 518 785bp 35.5 IMR90 (r2) Lister et al. (2009) ~ 64.5 21 1.6 667 801bp 38.4 H1 ESC (r1) Lister et al. (2009) | 80.7 13 1.4 215 782bp 29.9 H1 ESC (r2) Lister et al. (2009) | 78.1 17 1.6 572 728bp 35.2 H9 ESC (H9ESCLa) Laurent et al. (2010) ~ 72.3 15 2 2823 902bp 52.0 FES Laurent et al. (2010) ~ 69.7 15 1.8 1920 954bp 55.6 NHFF Laurent et al. (2010) | 62.3 15 1.7 1247 751bp 38.9 H9 ESC (H9ESCLi) Lister et al. (2011) ~ 81.2 11 1.7 1218 947bp 61.8 ADS Lister et al. (2011) ~ 63.5 35 1.8 2912 718bp 40.8 ADS iPSC Lister et al. (2011) ~ 79.2 43 2.1 1624 713bp 44.1 ADS Adipose Lister et al. (2011) ~ 64.1 28 2.1 4339 739bp 38.6 IMR90 iPSC Lister et al. (2011) ~ 82.6 11 1.7 1115 832bp 55.6 FF Lister et al. (2011) | 66.4 19 1.7 1362 673 34.7 FF iPSC 6.9 Lister et al. (2011) | 80.4 12 1.8 904 801bp 47.2 FF iPSC 19.7 Lister et al. (2011) | 79.8 12 1.8 946 774bp 45.2 FF iPSC 19.11 Lister et al. (2011) | 79.7 11 1.8 744 737bp 47.4 FF iPSC 19.11 BMP4 Lister et al. (2011) | 76.8 20 1.7 2043 595bp 32.5 “Meth.%” is for methylation levels. “CpGs/read” is number of CpGs per read. 36 profiled, with FF iPSCs taken at 3 different times, the last of which were also profiled after being differenti- ated in the presence of bone morphogenic protein 4 (FFiPSC1911BMP4). These datasets contain different ESCs, iPSCs and differentiated cells in culture and form a substantial basis for ASM study under various conditions. All methylomes have similar read length ( 100bp) and coverages are all above 10 except CD133 (7) and PBMC (9). Technical details of these data sets can be found in Table 5.1. The AMRs found by our method using a sliding-window of 10 CpGs are also summarized in the table. The number of AMRs found in each methylome ranges from 215 to 4339, but most are between 1000 and 2000. The mean size of the AMRs centers around 730bp. 5.3 Allele-specific methylation on the X chromosome, a sanity check In female mammals, one of the two X chromosomes is silenced to compensate gene dosage between the sexes. The copy of the X chromosome (chrX) that is inactive has high levels of DNA methylation compared to the active chrX, resulting in many AMRs (Migeon, 1990). In contrast, only a single allele from chrX is represented in the male methylome data. Comparing the results of our analyses between male and female X chromosomes therefore provides a measure of specificity: AMRs identified on chrX in males are likely false-positives. In total, 12 of the analyzed methylomes are female. Although coverage on chrX in males is reduced by half, 3 male methylomes approached 10 coverage on chrX (H1 ESC rep 2, FF and FF iPSC BMP4; (Lister et al., 2009, 2011)). The locations of identified AMRs on chrX are presented in Figure 5.1. The fraction of AMRs from chrX in female methylomes ranges from 15% to 36% with a mean of 24%. For the three male methylomes tested, the fraction is in the range of 1% to 2%. These results further support our conclusion from simulations that specificity is high in our AMR prediction. The X chromosome inactivation procedure is initiated via the XIST gene, which encodes a lncRNA with random allele-specific expression in female somatic cells. Our analyses identified an AMR at the XIST promoter in each female differentiated methylome (Figure 5.2), but not in any of the ESCs, iPSCs or male methylomes. The loss of ASM marks at the promoter of XIST and reactivation of the silenced chrX in 37 HSPC BCell Neut FES chromosome X: H9ESCLi H9ESCLa IMR90iPSC ADS ADSAdipose ADS iPSC FF FFiPSC1911BMP4 H1ESCr2 IMR90r2 IMR90r1 Figure 5.1: Locations of AMRs identified on chrX. All female data (pink) are included. Only male methy- lomes (blue) with sufficient coverage on chrX are shown as these have coverage reduced by 50% compared with autosomes. Numbers in brackets indicate references for data sources. See Table 5.1 for information about methylomes. females has been reported in both ESCs and iPSCs (Lyon, 1999; Wutz, 2011). Our results further confirm such unstable X chromosome inactivation states in pluripotent cells. 5.4 Genome-wide AMR identification predicts imprinted genes Since in vitro culture alters DNA methylation to some degree (Allegrucci et al., 2007), AMRs found in cultured cell lines may only reflect some culture-induced perturbations. Therefore, after identifying a full set of AMRs for each of the human methylomes, we emphasized the uncultured blood methylomes in compiling sets of high-confidence AMRs, and generally used the remaining cultured methylomes to provide additional supporting evidence. We found 579 autosomal AMRs those are common to at least 3 of the 5 uncultured methylomes (HSPC, BCell, Neut, CD133 and PBMC), 247 common to at least 4/5 and 81 shared across all 5. Table 5.2 presents the 39 AMRs common to all 5 uncultured methylomes and that are proximal to promoters (4 kb of a UCSC KnownGene TSS). Among these, 18 AMRs overlap a known imprinted gene. The high concordance between our predictions and known imprinted genes further validates our model and provides strong support for the remaining predictions as candidate imprinted genes. The regulatory activity of lncRNAs has been observed for most imprinted clusters (O’Neill, 2005). lncRNAs usually reside on the parental chromosome with unmethylated ICR and their expression suppress the activity of the surrounding 38 HSPC Neut BCell CD133 PBMC H9ESCLa FES NHFF H1ESCr1 H1ESCr2 IMR90r1 IMR90r2 H9ESCLi IMR90iPSC ADS ADSAdipose ADSiPSC FF FFiPSC69 FFiPSC197 FFiPSC1911 FFiPSC1911BMP4 10 kb XIST Human chrX:72,972,635-73,000,000 (hg18) AMRs Figure 5.2: Allele-specific methylation identified at the XIST promoter. Consistent with earlier findings, allele-specific methylation is found in exactly those methylomes that are (1) female, and (2) not from ESCs or iPSCs. imprinted genes on the same chromosome. By contrast, the chromosome carrying the methylated ICR has no active lncRNA but expresses other imprinted genes. Although ICR appears as a positive regulator for lncRNA expression, the regulatory mechanism is not unique (Pauler et al., 2007). Among the 39 common AMRs, 20 mark a lncRNA promoter. Such a frequent overlap of identified AMRs and lncRNA promoters suggests these might serve as a regulator for lncRNA in imprinted clusters (Koerner et al., 2009). The differential methylation marks for ICRs are set in gametogenesis and maintained thereafter. To check the parent-of-origin for the AMRs identified, we computed the methylation level in sperm at each of the identified AMRs using data from a previous study (Molaro et al., 2011). Among the 579 autoso- mal AMRs common to 3/5 uncultured methylomes, 146 are methylated (<50%) in sperm. As indicated in Table 5.2, among the 39 predicted AMRs common across all uncultured cell types, only 3 are methylated in 39 Table 5.2: AMRs common to all uncultured cells that overlap gene promoters. Gene symbols ncRNA? CGI? Meth. in Sperm? ESC iPSC Tot ? GNAS,NESP55 4 5 22 ? GNAS-AS1,GNAS 4 5 22 ? MESTIT1/MEST,MIR335 4 5 22 ? SGCE,PEG10 4 5 22 NHP2L1 4 5 22 ? ZNF597,NAA60 4 5 22 ? SNRPN,SNURF 4 5 22 ? AMPD3,MIR4485,MTRNR2L8 4 5 22 PMF1-BGLAP 4 5 22 MIR663B,LOC554226,ANKRD30BL 4 5 22 UNC45B 4 5 22 LINC00273 4 5 22 ? NAP1L5 4 5 21 ? KCNQ1OT1 3 5 21 LOC284801,MIR663 4 4 21 ? PSIMCT-1 2 5 20 ? H19,MIR675 2 5 20 TRAPPC9,AX748239 3 4 20 CR590796 4 3 20 ? DIRAS3 2 5 19 AX748049 2 5 19 BC028329 3 4 19 ZNF718,ZNF595 3 5 19 BC023516 3 5 18 LOC100130522 2 4 18 ? FANK1 2 5 18 ? GNAS 2 3 18 VTRNA2-1 3 4 17 MTRNR2L3 4 1 15 ? BLCAP,NNAT 2 2 14 LOC728024 0 3 14 RPS2P32 0 3 14 ? ZIM2,PEG3,MIMT1 1 0 12 ? MEG3 0 0 11 LOC100132167 1 0 11 ? HOXA6,HOXA5,LOC100133311 0 0 9 KIAA0934,DIP2C 0 0 8 LOC440570,AX747988 0 0 5 LOC100335030 0 0 5 Columns indicate whether the gene is non-coding (ncRNA), the AMR overlaps a CGI promoter (CGI) or is hyperme- thylated in sperm (Meth. in Sperm). Counts indicate the number of ESC, iPSC, total human methylomes, and chimp methylomes in which the AMR is found.?=known imprinted gene. sperm. Among these is the H19 ICR, which is well known to be methylated on the paternal allele (Trem- blay et al., 1995). If we use the methylation level in sperm as an indicator of methylation on the paternal 40 allele, these results point to an asymmetry in the paternal and maternal mechanisms of imprinting DNA methylation. The asymmetric methylation between the parental chromosomes may be achieved by differ- ent protection mechanisms against demethylation in germ line cells (Reik et al., 2001b). As discussed in Section 2.4, the higher mutation rate in paternal germ line and functional dominance of maternally methy- lated ICRs in the fetal-maternal interface during the early embryonic development may also contribute to the imbalance of parental AMRs identified. 5.5 Analysis of known imprinting control regions There are65 human genes currently validated as imprinted and we divided them into 32 imprinted clusters. We asked for what proportion of these clusters do we identify an AMR shared between cells, and do these shared AMRs coincide with experimentally validated AMRs? As can be seen from Table 5.3, 24 of the clusters contain validated AMRs, and in 21 of those cases we correctly identify a known AMR common to 4/5 uncultured cells. For the IGF2R and INPP5F clusters, we only identified AMRs shared between 2 and 3 of the uncultured cells, respectively. Both genes have been reported to be monoallelically expressed in an isoform-specific manner (Gregg et al., 2010a; Xu et al., 1993), suggesting a correlation between sporadic ASM and polymorphic imprinting. The AMPD3 gene has been found imprinted in placenta (Schulz et al., 2006), but has no validated AMR to our knowledge. Our algorithm finds an AMR shared across all 22 methylomes, indicating a likely candidate for validation. To our knowledge, no AMRs have yet been identified for the remaining clusters, and our algorithm fails to predict any AMRs that are shared between methylomes. Knowledge of the location of true AMRs around several of these imprinted genes allowed us to apply a more intensive analysis to examine them with greater sensitivity and precision. We used the dynamic programming algorithm described in Section 3.4 to optimize the locations of AMR boundaries by evaluating each possible AMR size rather than joining overlapping sliding windows. We refer to AMRs identified with this algorithm as refined AMRs. The imprinted cluster on chr14 consists of 7 genes controlled by the maternally expressed lncRNA MEG3. The region harbors an AMR at the MEG3 promoter, and another intergenic AMR15 kb upstream of the MEG3 TSS, both paternally methylated (Rocha et al., 2008) with the upstream AMR shown to 41 Table 5.3: Imprinted clusters and associated AMRs. The last 6 columns indicate the number of cells the AMRs found. Locations of AMRs Cluster Chr Start Size Known ICR Uncultured Cultured ESC iPSC Total Chimp GNAS chr20 56847556 3854 5 17 4 5 22 3 chr20 56858376 6489 5 17 4 5 22 3 chr20 56896500 2465 5 13 2 3 18 0 SGCE/PEG10 chr7 94121954 5833 5 17 4 5 22 3 MESTIT1/MEST chr7 129916795 4123 5 17 4 5 22 3 ZNF597,NAA60 chr16 3432778 1992 5 17 4 5 22 2 SNRPN/SNURF chr15 22750391 2854 5 17 4 5 22 1 AMPD3 chr11 10484081 4190 5 17 4 5 22 0 NAP1L5 chr4 89836887 1685 5 16 4 5 21 3 KCNQ1OT1 chr11 2675962 3054 5 16 3 5 21 0 PSIMCT-1/HM13 chr20 29598251 1312 5 15 2 5 20 3 KCNK9 chr8 141177200 3463 5 15 3 4 20 3 INS-IGF2-H19 chr11 1973052 8263 5 15 2 5 20 0 DIRAS3 chr1 68288125 2154 5 14 2 5 19 3 ZDBF2 chr2 206835953 1861 5 14 3 4 19 2 FANK1 chr10 127574239 3782 5 13 2 5 18 0 BLCAP/NNAT chr20 35581688 2995 5 9 2 2 14 2 ZIM2/PEG3 chr19 62040531 4409 5 7 1 0 12 0 DLK1/MEG3 chr14 100346727 914 3 2 0 0 5 0 chr14 100359992 4913 5 6 0 0 11 3 RB1 chr13 47788276 5673 5 2 0 0 7 3 42 Table 5.3: Continued. Locations of AMRs Cluster Chr Start Size Known ICR Uncultured Cultured ESC iPSC Total Chimp L3MBTL chr20 41575637 2216 4 15 3 5 19 3 DDC/GRB10 chr7 50816964 1861 4 15 3 4 19 3 PLAGL1/HYMAI chr6 144369538 2346 4 14 2 4 18 3 FAM50B chr6 3793744 2166 4 8 0 1 12 3 TCEB3C chr18 42801219 584 4 7 1 1 11 0 INPP5F chr10 121567253 1932 3 15 3 5 18 2 IGF2R chr6 160346324 1361 2 11 0 4 13 1 DLGAP2 chr8 1636691 686 1 1 1 0 2 0 TP73 chr1 3562326 259 1 0 0 0 1 0 ANKRD11 chr16 87861896 242 1 0 0 0 1 0 DLX5 chr7 96484700 5611 0 2 0 0 2 0 ABCA1 chr9 106664562 311 0 1 0 0 1 0 WT1 0 0 0 0 0 0 RBP5 0 0 0 0 0 0 43 50 kb Human chr14:100,330,000-100,410,000 (hg18) MIR2392 MEG3 HSPC Neutrophil BCell CD133 PBMC Sperm AMRs CGI Figure 5.3: Refined AMRs near MEG3. One AMR consistently appears in the promoter region of the lncRNA MEG3, and another AMR15 kb upstream of the MEG3 TSS appears in 6 differentiated cells. The red bars are AMRs, and the green bars are CGIs. Vertical bins indicate the methylation levels at CpG sites in each cell. act as an ICR (Kagami et al., 2008). Our genome-wide scan found the MEG3 promoter AMR in 11/13 differentiated cells. The boundaries of refined AMRs were identified in each uncultured methylome at nearly the exact same location, covering an interval that is hypomethylated in sperm (Figure 5.3). A refined AMR was identified in each uncultured methylome precisely at the known ICR location, which is methylated in sperm. Interestingly, each of the ESC/iPSC methylomes shows full methylation through the ICR, suggesting possible imprinting defects in these cells. Imprinted expression in the GNAS locus is highly complex, with maternally, paternally and bialleli- cally expressed transcripts sharing sets of exons (Williamson et al., 2006). This locus includes four AMRs at alternative promoters (NESP55, GNAS-AS1, XLs, Exon A/B) (Fr¨ ohlich et al., 2010). We identified refined AMRs at these locations in all uncultured methylomes (Figure 5.4). Between different methylomes, boundaries of refined AMRs fluctuated by fewer than 10 CpGs and frequently were identified at identical locations. In each case two separate refined AMRs were identified at the GNAS-AS1 and XLs promoters. The consistent location of the refined AMR boundary between the GNAS-A/B and GNAS-1 TSS, which 44 20 kb GNAS-AS1 NESP55 XLAS GNAS-1 chr20:56,839,898-56,911,935 (hg18) GNAS-A/B 5 kb SGCE PEG10 chr7:94,113,000-94,134,800 (hg18) HSPC Neutrophil BCell CD133 PBMC Sperm HSPC Neutrophil BCell CD133 PBMC Sperm A B AMRs CGI Figure 5.4: Regions of allele-specific methylation through (A) the GNAS and (B) SGCE/PEG10 loci. In both examples refined AMRs show highly consistent boundaries across methylomes, and each includes an AMR with a precise boundary inside a CGI, distinguishing the regulatory regions of distinct TSS. coincides with the center of a CGI, suggests a strict partition of regulatory sequence between these two transcripts. The LTR-derived PEG10 and the adjacent SGCE are part of an imprinted gene cluster on chr7 sharing complete synteny with imprinted orthologs in mouse (Ono et al., 2001, 2005). PEG10 and SGCE are separated by less than 100 bp, are divergently transcribed and have a single CGI overlapping both TSS. Our analyses revealed an AMR at their shared promoter in all 22 methylomes, with a positional bias in the direction of PEG10. As can be seen from Figure 5.4, the refined AMRs in uncultured methylomes have identical boundaries precisely between the PEG10 and SGCE TSS at the center of a CGI similar to the GNAS-A/B case described above. Each refined AMR is fully contained inside the body of PEG10, consistent with the LTR origin of PEG10 which implies that PEG10 carries internal regulatory elements. This internal PEG10 promoter appears responsible for imprinted regulation of both genes, despite the hypomethylation 45 reaching into SGCE in all methylomes. One plausible scenario is that regulatory elements within the AMR interact with those nearby in the hypomethylated portion of the CGI to regulate SGCE. 5.6 Allele-specific methylation and epigenomic reprogramming in iPSCs One of the central questions related to the use of iPSCs, in research or therapeutically, is the degree to which they resemble true ESCs. The landmark study of Lister et al. demonstrated significant reprogramming variability between iPSCs (Lister et al., 2011). Evidence from cloning studies suggests that imprinting might be especially difficult to reprogram (Rideout et al., 2001). We assembled the union of all identified AMRs in all methylomes. For each of these AMRs, we computed average methylation in each methylome. We then clustered the methylomes hierarchically according to correlation of methylation levels through these intervals (Figure 5.5). The iPSCs correlated better with ESCs than with the somatic cells from which they are derived, suggesting that ASM has in general been successfully reprogrammed in these iPSCs. However, we found several examples where the iPSCs appear to diverge from the ESCs in terms of ASM (Figure 5.6). An AMR was identified at the GNAS-1 promoter in 18 of the methylomes, but this interval was hypomethylated in the ADS iPSCs. Similarly, for the AMR identified at the 3’ promoter of ZNF331, the ADS iPSCs are methylated at 50% and resemble differentiated cells more closely than ESCs or other iPSCs, suggesting failed reprogramming of ADS iPSCs at these locations. It has been proposed that a single imprinted cluster might be sufficient to diagnose iPSC reprogramming in mouse (Stadtfeld et al., 2010). The diversity of ASM we observe between iPSCs and even between ESC lines suggests such diagnosis will be more complex in human. 5.7 The landscape of allele-specific methylation in human We applied our method to publicly available human methylomes and validated its accuracy on real data by comparing ASM identified on female and male X chromosomes. Our most consistent predictions across methylomes showed high concordance with known AMRs associated with imprinted genes. The remaining 46 PBMC BCel l CD133 Neut HSPC IMR90r2 IMR90r1 ADSAdipose ADS FF NHFF FES H9ESCLi H1ESCr1 ADSiPSC FFiPSC1911BMP4 IMR90iPSC H9ESCLa H1ESCr2 FFiPSC69 FFiPSC197 FFiPSC1911 PBMC BCell CD133 Neut HSPC IMR90r2 IMR90r1 ADSAdipose ADS FF NHFF FES H9ESCLi H1ESCr1 ADSiPSC FFiPSC1911BMP4 IMR90iPSC H9ESCLa H1ESCr2 FFiPSC69 FFiPSC197 FFiPSC1911 1 0.84 0.83 0.77 0.82 0.6 0.6 0.56 0.56 0.58 0.59 0.46 0.43 0.48 0.58 0.48 0.5 0.51 0.53 0.55 0.55 0.55 0.84 1 0.84 0.78 0.82 0.6 0.6 0.57 0.57 0.59 0.58 0.44 0.44 0.48 0.59 0.49 0.5 0.51 0.54 0.56 0.56 0.56 0.83 0.84 1 0.89 0.91 0.69 0.7 0.67 0.67 0.67 0.67 0.48 0.5 0.53 0.7 0.57 0.58 0.59 0.62 0.64 0.65 0.65 0.77 0.78 0.89 1 0.95 0.63 0.63 0.62 0.61 0.62 0.61 0.43 0.44 0.47 0.62 0.52 0.51 0.51 0.55 0.57 0.57 0.57 0.82 0.82 0.91 0.95 1 0.63 0.64 0.61 0.61 0.62 0.61 0.45 0.45 0.49 0.62 0.52 0.52 0.53 0.56 0.58 0.58 0.58 0.6 0.6 0.69 0.63 0.63 1 0.98 0.79 0.8 0.82 0.81 0.61 0.49 0.52 0.64 0.57 0.56 0.56 0.59 0.6 0.61 0.61 0.6 0.6 0.7 0.63 0.64 0.98 1 0.8 0.8 0.82 0.81 0.6 0.49 0.52 0.65 0.57 0.56 0.56 0.6 0.6 0.61 0.62 0.56 0.57 0.67 0.62 0.61 0.79 0.8 1 0.98 0.82 0.82 0.5 0.41 0.42 0.62 0.51 0.48 0.48 0.51 0.54 0.55 0.56 0.56 0.57 0.67 0.61 0.61 0.8 0.8 0.98 1 0.82 0.82 0.5 0.41 0.42 0.62 0.51 0.49 0.49 0.51 0.54 0.56 0.56 0.58 0.59 0.67 0.62 0.62 0.82 0.82 0.82 0.82 1 0.94 0.54 0.43 0.44 0.61 0.53 0.5 0.5 0.52 0.56 0.57 0.58 0.59 0.58 0.67 0.61 0.61 0.81 0.81 0.82 0.82 0.94 1 0.55 0.43 0.45 0.62 0.52 0.51 0.53 0.53 0.55 0.57 0.58 0.46 0.44 0.48 0.43 0.45 0.61 0.6 0.5 0.5 0.54 0.55 1 0.72 0.68 0.55 0.62 0.62 0.71 0.68 0.63 0.63 0.63 0.43 0.44 0.5 0.44 0.45 0.49 0.49 0.41 0.41 0.43 0.43 0.72 1 0.81 0.69 0.69 0.8 0.86 0.84 0.79 0.79 0.78 0.48 0.48 0.53 0.47 0.49 0.52 0.52 0.42 0.42 0.44 0.45 0.68 0.81 1 0.69 0.7 0.78 0.81 0.88 0.8 0.79 0.79 0.58 0.59 0.7 0.62 0.62 0.64 0.65 0.62 0.62 0.61 0.62 0.55 0.69 0.69 1 0.72 0.83 0.81 0.82 0.85 0.88 0.87 0.48 0.49 0.57 0.52 0.52 0.57 0.57 0.51 0.51 0.53 0.52 0.62 0.69 0.7 0.72 1 0.74 0.75 0.77 0.81 0.83 0.84 0.5 0.5 0.58 0.51 0.52 0.56 0.56 0.48 0.49 0.5 0.51 0.62 0.8 0.78 0.83 0.74 1 0.87 0.87 0.87 0.89 0.89 0.51 0.51 0.59 0.51 0.53 0.56 0.56 0.48 0.49 0.5 0.53 0.71 0.86 0.81 0.81 0.75 0.87 1 0.88 0.87 0.87 0.87 0.53 0.54 0.62 0.55 0.56 0.59 0.6 0.51 0.51 0.52 0.53 0.68 0.84 0.88 0.82 0.77 0.87 0.88 1 0.89 0.89 0.89 0.55 0.56 0.64 0.57 0.58 0.6 0.6 0.54 0.54 0.56 0.55 0.63 0.79 0.8 0.85 0.81 0.87 0.87 0.89 1 0.94 0.93 0.55 0.56 0.65 0.57 0.58 0.61 0.61 0.55 0.56 0.57 0.57 0.63 0.79 0.79 0.88 0.83 0.89 0.87 0.89 0.94 1 0.97 0.55 0.56 0.65 0.57 0.58 0.61 0.62 0.56 0.56 0.58 0.58 0.63 0.78 0.79 0.87 0.84 0.89 0.87 0.89 0.93 0.97 1 Figure 5.5: Clustering of all human methylomes according to their methylation patterns in all identified AMRs. The numbers in cells indicate the correlation of methylation patterns between two cell types, and a higher number corresponds to a darker color. Basically, three clusters are formed: 1) ESCs/iPSCs; 2) cultured differentiated cells; 3) uncultured cells. predictions represent likely candidate ICRs for novel imprinted loci, with several overlapping lncRNA pro- moters and supported by similar analysis at orthologous regions based on low-coverage methylomes from chimp. Our top predicted autosomal AMRs show remarkable concordance with known AMRs controlling imprinted gene expression. Among the 39 common to all uncultured methylomes and proximal to anno- tated promoters, 18 are marking known imprinted genes. It appears as though the AMRs that are already known are also those identified most consistently across methylomes. This finding can be interpreted in 47 Mean methylation 0.0 0.2 0.4 0.6 0.8 1.0 uncultured cells cultured differentiated cells iPSCs ESCs GNAS EXON1 DMR A ZNF331 DMR B HSPC Neut BCell CD133 PBMC FFiPSC1911BMP4 FF NHFF FES IMR90r1 IMR90r2 ADS ADSAdipose ADSiPSC IMR90iPSC FFiPSC69 FFiPSC197 FFiPSC1911 H1ESCr1 H1ESCr2 H9ESCLi H9ESCLa Mean methylation 0.0 0.2 0.4 0.6 0.8 1.0 MEG3 GDMR C HSPC Neut BCell CD133 PBMC FFiPSC1911BMP4 FF NHFF FES IMR90r1 IMR90r2 ADS ADSAdipose ADSiPSC IMR90iPSC FFiPSC69 FFiPSC197 FFiPSC1911 H1ESCr1 H1ESCr2 H9ESCLi H9ESCLa MEG3 SDMR D Figure 5.6: Examples for iPSC reprogramming of AMRs. 22 methylomes were divided into four groups: uncultured cells, cultured differentiated cells, iPSCs and ESCs. (A) DMR in GNAS EXON1. Some ESCs and iPSCs lost the allele-specific methylation in this region. (B) DMR in ZNF331. All ESCs were hyper- methylated, 4 out of 5 iPSCs reprogrammed such hypermethylation from ASM except ADSiPSC. Some cultured differentiated cells also had both alleles methylated (C,D) GDMR and SDMR for MEG3. All ESCs and iPSCs showed hypermethylation in these regions. Some cultured differentiated cells lost the ASM marks as well. several ways. One possibility is that a significant portion of the imprinted genes or clusters, possibly more than half, have already been identified. Estimates of the total number of imprinted gene in human hover around 100-200 (Barlow, 1995; Luedi et al., 2007), and many parent-of-origin disease phenotypes have already been explained through known imprinted genes. Among our remaining predictions (several hundred 48 putative AMRs) many could represent weaker ASM signals possibly without functional relevance and these are identified with less consistency across data sets because of a lack of sensitivity in our method. Another possibility is that many genes are imprinted with cell-type specificity, and that the known AMRs are biased towards those that can be identified in a greater variety of cell types. Our top predictions were based on consistency across the available methylomes from uncultured cells, but these all happened to be from blood. Another important finding to emerge from our analyses is the precision with which AMRs are defined across cell types. The GNAS and SGCE/PEG10 examples illustrate this strong consistency of AMR bound- aries: in both of these examples there are pairs of TSS in very close proximity, sharing CGIs, but for which one has ASM methylation and the other does not. Methods like ours that can delineate the boundaries of AMRs will assist future efforts to precisely map the elements inside these regulatory regions. Among the methylomes analyzed, the total number of identified AMRs varied substantially across methylomes. Much of this variation is likely due to variation in coverage and number of CpGs/read, and relates to sensitivity as illustrated in our simulations. A substantial part may also be due to epigenotypic vari- ation between the cells: those from culture will exhibit the associated epigenomic effects. It has been known for some time that permanent cell lines have altered DNA methylation (Doherty et al., 2000; Fern´ andez- Gonzalez et al., 2004), and more recently it was shown that aberrant methylation is correlated with passage number (Meissner et al., 2008). We also observed some examples which might indicate an effect of culture conditions, such as the MEG3 AMRs where cells differentiated in culture show different properties from other somatic cells. Moreover, any methylome data sets derived from single individuals might exhibit the effects of genotype (Gertz et al., 2011). 49 Chapter 6 The evolution of allele-specific methylation: comparing human and mouse 6.1 Theoretical perspectives relating allele-specific methylation and evolu- tion As explained in Chapter 2, imprinted genes are functionally haploid, losing the stability of diploidy and increasing exposure to deleterious alleles. Thus, the evolutionary significance of imprinting is subject to much debate. The benefit of silencing one parental allele must outweigh the loss of robustness associated with the gene being functionally haploid. Regarding the origin of imprinted genes, as mentioned in Chap- ter 2 the host defense hypothesis proposes that DNA methylation that is induced to silence foreign invading DNA insertions causes genomic imprinting (Barlow, 1993). This hypothesis has been strongly supported by the findings from comparative studies (Pask et al., 2009; Suzuki et al., 2005). The reason why multiple imprinted genes might be maintained through mammalian evolution, implying that they are somehow adap- tive, is a more complicated issue. Many theories have been advanced to explain why imprinting might be adaptive and add fitness to genotypes that bear imprinted genes. Currently no single theory seems to best explain all empirical observations concerning imprinted genes. We must also keep in mind that there need not be any single reason why imprinting (or any trait) is advantageous. The two major hypotheses to explain why imprinting has been maintained as a general regulatory mechanism are the parental conflict theory (also known as kinship hypothesis) and the coadaptation theory. Haig’s parental conflict theory is the most well-known explanation for the evolution of genomic imprint- ing (Haig and Westoby, 1989; Moore and Haig, 1991). This theory states that genomic imprinting is a result of the differing interests of each parent in how genes are expressed during the early development of their offspring. Fitness of the paternal genotype is maximized if the offspring can use as much of the maternal 50 resources as possible. As a consequence, paternally regulated gene expression should maximize the growth of the developing fetus. Fitness of the maternal genotype is maximized by reserving resources to be dis- tributed to other offspring, which for most mammals (and for most of mammalian evolution) are likely to have different fathers. Maternally regulated gene expression should act to suppress the growth of fetus, to both protect the mother herself and balance the nutrient allocation either within or between pregnancies. The identification of imprinting in all placental mammals provides support for this theory. The placenta is the key medium through which maternal resources are passed to the fetus. To some degree the fetus acts like a parasite of the mother, and epigenomic information on the paternal alleles has the opportunity to opportunity to manipulate how maternal resources are distributed. The trophoblast cells can generate several hormones to adjust maternal physiology for fetal benefit in many mammals including human and mouse (Reik et al., 2003). Furthermore, the fetal trophoblast cells can change the maternal vascular system to isolate the maternal effects on the nutrient volume of the blood reaching the placenta. Thus, paternally expressed genes not only decide the demand of nutrients from the fetus, but also affect the supply of nutrients from the mother through the placenta. In fact, many known imprinted genes function in the placenta in ways that seem to strongly support the parental conflict theory. For example, the paternally expressed insulin-like growth factor II gene (IGF2) promotes the growth of fetal and placenta (Constˆ ancia et al., 2002), and the maternally expressed IGF2 receptor (IGF2R) binds to and degrades it (Kornfeld, 1992). Besides the placenta, the parental conflict theory can also be applied to imprinting in the brain, contributing to the behaviors of developing offspring in contexts like postnatal feeding. For example, a paternally expressed transcript at the GNAS locus, Gnasxl, has been demonstrated to be required for postnatal development, including suckling, blood glucose and energy homeostasis in both human and mouse (Genevi` eve et al., 2005; Plagge et al., 2004). Although the parental conflict theory is strongly supported by many observations, this support is not universal. Hurst and McVean (1997) investigated both maternal and paternal uniparental disomies (UPDs) for chromosome 14 in liveborn humans and found the prenatal growth retardation of paternal UPDs as well as the post-weaning growth effects in maternal UPDs, which seem contradictory to the predictions of the parental conflict theory. These inconsistent phenomena put forward major challenge for the parental conflict hypothesis. 51 Coadaptation is a relatively new theory, proposing that genomic imprinting coadaptively regulates the mammalian embryonic development and reproductive behavior (Barlow, 1993; Pardo-Manuel de Villena et al., 2000; Varmuza and Mann, 1994). The mother-infant codaptation indicates that the offspring that have received both good maternal care and nutrition will be well provisioned and potentially have good mothering behavior when adult, thereby ensuring the maintenance of the gene through generations. Consider the Peg3 gene, which has received much research focus. A targeted mutation in the embryo confirms its regulatory function in both fetal and postnatal development including placental size, fetal growth, suckling and postna- tal growth, weaning age and puberty onset (Curley et al., 2004). The same mutation in a mother results in reduced maternal care, reduced maternal food intake during pregnancy and disrupted milk let-down. Such synchronized traits are coadaptively selected at the same gene in mother and offspring. Consequently, off- spring with appropriately imprinted Peg3 get enough maternal nurturing and have healthy growth. They are also expected to have good mothering when adult. The coadaptive selection thus enhances the spread of the gene across generation. It should be noticed that the coadaptation theory emphasizes the matrilineal control in linking parent-infant expression. In summary, the parental conflict theory provides a good explanation for the evolution of imprinted genes governing growth and their regulators. The coadaptation theory is a good supplementary to explain the imprinted genes connected to both fetal and maternal behaviors. 6.2 Motivation for comparing allele-specific methylation between human and mouse Although there are some exceptions (Kalscheuer et al., 1993), imprinted genes show extensive conservation between mammalian species (Barlow, 1995). As with many aspects of genome science, searching for homol- ogy is also a standard approach for identifying new imprinted genes (Kim et al., 1997). However, a clear divergence of imprinting has been observed between human and mouse with only 29 conserved imprinted genes (Morison et al., 2005). According to the parental conflict hypothesis, this could be due to an increased parental conflict in mouse. Mice usually carry 5-15 fetus at one pregnancy, and these are usually associated with multiple matings (Baker et al., 1999; Dean et al., 2006). In contrast, humans usually bear singletons, and are more commonly monogamous. Therefore, the paternal epigenotypes in mice that are more able to 52 Table 6.1: Overall data property of the 4 mouse methylomes. Cell Sex Meth.% Coverage CpGs/read #AMR Mean size CpGs/AMR HSC | 72.0 11 1.8 983 586bp 16.3 Intestinal stem cell (Intestinal1) ~ 77.6 12 1.9 1870 611bp 15.4 Intestinal 2 nd daughter cell (Intestinal2) ~ 73.9 9 1.8 836 659bp 16.3 Intestinal differentiated cell (Intestinal3) ~ 74.4 12 2.2 1511 648bp 14.7 “Meth.%” is for methylation levels. “CpGs/read” is number of CpGs per read. acquire resources from the mother will directly out-compete other paternal epigenotypes. Such different levels of parental conflict results in an expansion of imprinting in mouse to adapt more intra-brood compe- tition. In addition, the shorter gestational period and more reproductive capability of females over a lifetime in mouse may also raise the requirement for maintaining placental-specific imprinting (Frost and Moore, 2010). Therefore, comparing imprinted genes between human and mouse provides an attractive way to test the parental conflict theory and investigate the evolution of the complex imprinted clusters. The change of imprinting regulation may also be reflected by differential methylation in human and mouse sperm. As shown in Chapter 5, the common ASM of multiple cell types are expected to be biologically func- tional and are very likely to be associated with imprinting. Comparing such common ASM sets between human and mouse provides inter-species conserved ASM, adding extra confidence to the prediction of imprinting around these regions. The inconsistency of these ASM associated imprinted genes contributes to understanding the evolutionary change since 85 million years ago when human and mouse diverges. 6.3 Identification of allele-specific methylation in human and mouse We analyzed 4 mouse methylomes, which are all from uncultured primary cells: hematopoietic stem cells (HSC) produced by Emily Hodges at Cold Spring Harbor Laboratory (unpublished), intestinal stem cells (Intestinal1), intestinal stem cell 2 nd daughter cells (Intestinal2) and intestinal differentiated cells (Intesti- nal3) from Lucas Kaaij at Hubrecht Institute (unpublished). To do the comparison fairly, we used 4 uncul- tured human cell methylomes: hematopoietic stem and progenitor cells, B-cells, neutrophils, and CD133+ cord blood cells (details of these methylomes and our naming convention can be found in Section 5.2). The overall data properties of these 8 methylomes are similar (Table 6.1 and Table 5.1), providing a good basis for comparison. 53 We scanned these 8 methylomes for ASM using the same procedure as applied in Chapter 5, using a sliding window of 10 CpGs and merging overlapping windows found to have ASM. Table 5.1 and Table 6.1 summarize the AMRs identified. It can be seen that males usually have less AMRs due to X chromosome inactivation in females as discussed in Section 5.3. After excluding such sex effect by looking at autosome AMRs, mouse has more AMRs than human on average (1406 Vs. 1337) as expected. 6.4 Conserved and divergent allele-specific methylation genome-wide As shown in Chapter 5, the common AMRs of uncultured cells are highly correlated with imprinted genes. Thus, we used AMRs common to 3 out of 4 methylomes in both human and mouse for comparison. There are 445 such common AMRs in mouse and 362 in human. At first, we checked the conservation between these common AMRs. We used the coordinate conversion tool liftOver from UCSC Genome Browser utilities to do ortholog mapping between human and mouse, and 17 AMRs have both conserved sequence and ASM status in both species. Among these 17 conserved AMRs, 15 are associated with known imprinted genes in both human and mouse (Table 6.2). The high concordance between the conserved AMRs and known imprinted genes validates the reduction of false positives in imprinting discovery through inter-species com- parison. Although the remaining two have not been reported as imprinted genes yet, the conservation both across different cells and between species provides strong evidence that these are, in fact, imprinted. Figure 6.1 shows the locations of the two conserved AMRs that lack previous evidence of imprinting in human and mouse. The first resides in the second intron of the gene DIP2C/Dip2c, which encodes a member of the disco-interacting protein homolog 2 family. The protein shares strong similarity with a Drosophila protein which interacts with the transcription factor disco and is expressed in the nervous system. Imprinted expression of DIP2C is consistent with the suggested critical role of imprinted gene expression in the brain (Davies et al., 2001; Isles and Wilkinson, 2000). Allele-specific expression for DIP2C without local SNP dependency in human lymphoblast cell lines has been previously reported (Morcos et al., 2011), but has not been associated with ASM or parent-of-origin effects. The identification of a conserved AMR in DIP2C/Dip2c in both human and mouse primary cells provides strong evidence that this gene is imprinted. Furthermore, the AMR is at the promoter of the gene KIAA0943 in human, constituting a typical structure for imprinting cluster. A BLAT match of KIAA0943 is also found in mouse at the same place as the AMR. 54 The second conserved AMR without reported imprinting is located in the second exon of HLA- DRB1/H2-Eb1 gene, which encodes a class II histocompatibility antigen. This gene plays a central role in the immune system. While not yet reported as imprinted, parent-of-origin effects have been observed for HLA-DRB1 in human (Chao et al., 2010). Thus, observations consistent with imprinting for both HLA- DRB1 and DIP2C have already been made, and our identification of conserved AMRs both across cell types and between species provides both strong support for imprinting and also points to the associated control regions. It is noteworthy that the conserved AMR in DIP2C/Dip2c bears different methylation states in male germ cells of human and mouse. In addition, the CpG densities vary substantially in the loci around DIP2C and Dip2c, which might signal evolutionary change at the level of DNA methylation. It is therefore remarkable both that ASM has been retained in this region and that our method is able to detect it. 6.5 Allele-specific methylation at shared and species-specific imprinted genes To date there are 83 autosomal imprinted genes found in mouse and 65 in human, catalogued in (http://www.geneimprint.com/) (Jirtle, 1997). Our method identified AMRs in 38 mouse autosomal imprinted genes and 30 human autosomal imprinted genes (within4kb of the TSS). Some imprinted genes reside in the same cluster and share one AMR, e.g. Peg10 and Sgce. Therefore, we estimate around 50% known imprinted genes have proximal ASM considering the possibility that some other imprinted genes may be associated with distal AMRs. Table 6.3 summarizes these ASM associated autosomal imprinted genes in both species. Among these ASM associated imprinted genes, 26 have orthologous matches between human and mouse and have AMRs identified in both species. We used liftOver to map these AMRs between species. Most AMRs around these homologous imprinted genes have sequence conservation between human and mouse except two: the AMR at the H19 promoter and the AMR at the SNRPN/Snrpn promoter. This observation implies that the maintenance of ASM in genomic imprinting does not require extreme conservation of the local sequence. The gene H19 acts as a regulatory non-coding RNA, which may explain the uncertainty of its sequence in different species (Brannan et al., 1990). It should be mentioned that although the status of imprinting and ASM are conserved, in several cases the boundaries of the identified ASM appears to shift 55 Human chr10:508,774-528,774 (h18) Mouse chr13:9,486,467-9,506,467 (mm9) KIAA0943 KIAA0943 match DIP2C Dip2c Human chr6:32,647,540-32,667,540 (hg18) Mouse chr17:34,440,812-34,460,812 (mm9) HLA-DRB1 H2-Eb1 HSC Intestinal1 Intestinal2 Intestinal3 Sperm HSPC Sperm A B Neutrophil CD133 10 kb HSC Intestinal1 Intestinal2 Intestinal3 Sperm HSPC Sperm Neutrophil BCell Figure 6.1: Locations of the two conserved AMRs without reported imprinting between human and mouse. Vertical bars indicate methylation levels. (A) Conserved AMR at the second intron in DIP2C/Dip2c. It is also at the promoter of KIAA0943 in human. A BLAT match of KIAA0943 in mouse is indicated. (B) Conserved AMR at the second exon of HLA-DRB1/H2-Eb1. 56 Table 6.2: Seventeen conserved AMRs between human and mouse.?=known imprinted gene. Location in human Location in mouse Chr Start Size Human genes Mouse genes Chr Start Size chr10 520313 840 KIAA0934 NA chr13 9493814 827 chr10 121568044 947 ? INPP5F ? Inpp5f chr7 135831179 1529 chr14 100361660 1956 ? MEG3,FP504 ? Meg3 chr12 110778652 1969 chr19 62040959 4926 ? ZIM2,PEG3 ? Usp29,Peg3 chr7 6680018 5132 chr20 29596610 3046 ? HM13,PSIMCT-1 ? Mcts2 chr2 152512015 1673 chr20 35582968 1833 ? BLCAP,NNAT ? Blcap,Nnat chr2 157385749 1847 chr20 41575818 1261 ? L3MBTL ? L3mbtl chr2 162773339 891 chr20 56846770 4927 ? GNASAS,GNAS ? Nespas,Gnas chr2 174108800 4562 chr20 56858075 6781 ? GNAS ? Gnas chr2 174119706 7105 chr20 56895744 3284 ? GNAS ? Gnas chr2 174151465 3283 chr4 89837390 988 ? NAP1L5 ? Nap1l5,Herc3 chr6 58856501 925 chr6 32659825 328 HLA-DRB1 H2-Eb1 chr17 34446396 566 chr6 144369295 2410 ? PLAGL1,HYMAI ? Plagl1,Zac1 chr10 12810073 2318 chr7 50817294 859 ? GRB10 ? Grb10 chr11 11925454 1428 chr7 94122338 5923 ? PEG10 ? Peg10,Sgce chr6 4696011 5555 chr7 129915862 4929 ? MEST ? Mest,Copg2 chr6 30684931 4422 57 Table 6.3: Summary of ASM associated imprinted genes in human and mouse. All clusters are divided to three categories: genes having imprint and AMR in both species, genes having imprint in both species but AMR only in one species and genes imprinted only in one species. Mouse Human Cluster Imprinting CGI Meth. in sperm AMR Cluster Imprinting CGI Meth. in sperm AMR Blcap/Nnat BLCAP/NNAT Gnas GNAS Grb10 GRB10 H13/Mcts2 HM13/PSIMCT-1 H19/Igf2 H19/IGF2 Inpp5f INPP5F V2 L3mbtl L3MBTL Meg3 MEG3 Copg2/Mest COPG2/MEST Nap1l5 NAP1L5 Peg10/Sgce PEG10/SGCE Peg3/Usp29 PEG3/ZIM2 Plagl1 PLAGL1/HYMAI Snrpn/Snurf SNRPN/SNURF Cdkn1c/Slc22a18 CDKN1C/SLC22A18 Dlk1 DLK1 Tfpi2 TFPI2 Kcnq1/Kcnq1ot1 KCNQ1/KCNQ1OT1 Magel2 MAGEL2 Commd1/U2af1-rs1 COMMD1 Igf2r/Air IGF2R Impact IMPACT Peg12 Peg13 Rasgrf1 RASGRF1 Slc38a4 SLC38A4 DIRAS3 “Meth in sperm” means methylated in sperm. 58 5 kb Human chr7:94,118,573-94,130,573 (h18) HSPC Neutrophil BCell Sperm SGCE CD133 PEG10 AMRs CGI Sgce Peg10 Mouse chr6:4,692,306-4,704,306 (mm9) HSC Intestinal1 Intestinal2 Intestinal3 Sperm Figure 6.2: An example of conserved AMR and imprinting but having boundary movement between human and mouse at the Peg10/Sgce (PEG10/SGCE) loci. Vertical bins indicate the methylation levels at CpG sites in each cell. between species. As shown in Figure 6.2, the AMRs of PEG10/SGCE locate towards PEG10 in all human cells (without the optimization described in Section 3.4), but in mouse cells the left boundaries of the AMR extend into the Sgce gene. Such differences between the boundaries of AMRs may suggest a degree of evolutionary changes in regulatory mechanisms for these two divergently transcribed imprinted genes. There are 7 genes with conserved imprinting status in both human and mouse but for which we only identified ASM in one species. We identified ASM near Cdkn1c, Slc22a18, Dlk1 and Tfpi2 in mouse but not in human. Conversely, we identified ASM near KCNQ1, KCNQ1OT1 and MAGEL2 in human but not in mouse. The AMR shared by Cdkn1c and Slc22a18 in mouse is known to be a somatically acquired 59 AMR (Lewis et al., 2004). The absence of the AMR in human CDKN1C/SLC22A18 suggests a mecha- nism independent of somatic ASM to maintain genomic imprinting. The hypomethylation of the region can still be observed in human male germ cells (Figure 6.3). Both of the AMRs near Dlk1 and Tfpi2 are also somatic AMRs (Monk et al., 2008; Rocha et al., 2008), and other mechanisms like histone modifications are also involved in maintaining the imprinting status. The AMR at the promoter of KCNQ1OT1/Kcnq1ot1 is a germline AMR and is expected to be conserved between human and mouse (Monk et al., 2006). How- ever, this AMR is absent in all three mouse intestinal cells, so it was excluded from the common set of AMRs in mouse. We used some other mouse methylomes to confirm the conserved presence of the AMR at the promoter of Kcnq1ot1 (data not shown). It has been reported that KCNQ1/Kcnq1 (also known as KVLQT1/Kvlqt1) is a tissue-specific imprinted gene (Gould and Pfeifer, 1998; Lee et al., 1997), although the biallelic expression has not been reported in intestine yet. The gene plays an important role in intestine by forming a voltage-gated potassium channel with other proteins and should have strong active expression (Warth et al., 2002). Our prediction of the lack of ASM in Kcnq1 from all intestinal cells may imply the loss of imprinting and the resulted biallelic expression. The AMR in the promoter of MAGEL2 has only been reported in a previous uniparental disomy study (Sharp et al., 2010), and this paternally expressed gene is thought to be controlled by the germline AMR in SNURF/SNRPN (Hanel and Wevrick, 2001). There- fore, our finding first suggests a somatic AMR for MAGEL2 specific to human here. Together we identify some variation in somatic ASM between species, although the variation does not result in the change of imprinting. We postulate that somatic ASM may be dispensable for preserving imprinting and other alter- native epigenetic marks such as histone modifications may help maintain the imprints under the regulation of germline AMR. The rest of the known imprinted genes are species-specific imprinted genes, which show a very biased distribution: 9 are mouse specific and 1 is human specific, consistent with ideas that there is less benefit to imprinting in human than in mouse. The imprinting mechanisms for Commd1/U2af1-rs1 (Joh et al., 2009) and Igf2r/Air (Sleutels et al., 2002; St¨ oger et al., 1993) are very similar, depending on antisense transcripts. Both of the antisense transcripts lack homologs in human and accordingly the imprints disappear. The germline AMR in COMMD1/Commd1 has completely disappeared in human cells, and the methylation status of the region in sperm cell also changes (Figure 6.4). However, the germline AMR in the second intron of IGF2R/Igf2r remains in two human cells (Neutrophil and CD133), indicating a weakening of the 60 Human chr11:2,859,824-2,865,824 (hg18) Mouse chr7:150,643,044-150,649,044 (mm9) 2 kb HSC Intestinal1 Intestinal2 Intestinal3 Sperm HSPC Sperm Neutrophil CD133 BCell CDKN1C Cdkn1c AMRs Figure 6.3: An example of disappearing somatic AMR in CDKN1C in human. ASM signal. Besides the AMR at the promoter of the antisense transcript, Igf2r harbors another somatic AMR in its promoter as an effector for its imprinting (Rougeulle and Heard, 2002), and this somatic AMR also disappears in human IGF2R promoter (Figure 6.5). Our data indicates that the absence of the antisense transcripts and the corresponding ASM in human results in the loss of imprinting for these two genes. The gene Impact is only imprinted in Glires clade (rodents and lagomorphs), but not in other mammals (Okamura et al., 2000). Some sequence features including tandem repeats (Okamura et al., 2000) and latent CpG dinucleotide periodicity (Okamura et al., 2008) have been identified for the mouse-specific primary AMR. For mouse specific imprinted genes Peg12 and Peg13, there are no human homologs. The gene Rasgrf1 shows tissue-specific imprinted expression in mouse, although it maintains the AMR throughout the development and additional marks are required for tissue-specific signals (Dockery et al., 2009). Slc38a4 is another species- and tissue-specific imprinted gene found in mouse (Mizuno et al., 2002; Zaitoun and Khatib, 2006). Here we first report the absence of the AMR in human SLC38A4, suggesting a similar variation in regulation as Rasgrf1. The human specific imprinted gene DIRAS3 has no ortholog reported 61 Human chr2:61,984,307-62,004,307 (hg18) Mouse chr11:22,864,284-22,884,284 (mm9) HSC Intestinal1 Intestinal2 Intestinal3 Sperm HSPC Sperm Neutrophil CD133 10 kb BCell COMMD1 Commd1 Zrsr1 AMRs Figure 6.4: An example of disappearing germline AMR in COMMD1 in human. in mouse. In summary, the species-specific imprinted genes are caused by both genetic and epigenetic variations, and most imprinting change comes with alteration of ASM. 6.6 What have we learned about the evolution of imprinting? Combining experimental approaches with analytic methods such as ours on appropriately selected mam- malian species will help to elucidate the enigmatic role of sexual conflict influencing mammalian evolution. We applied our method to 4 human methylomes and 4 mouse methylomes and revealed evolutionary con- servation and development of ASM and associated imprinted genes genome-widely. Regions that have both conserved sequences and conserved ASM status are extensively associated with known imprinted genes with only two exceptions. Although there is no reported imprinting for these two “outliers,” their functions and structures are likely to be related to imprinting. Also some evidence of parent-of-origin effect have been 62 20 kb Human chr6:160,305,121-160,365,121 (hg18) HSPC Neutrophil B-cell Sperm IGF2R CD133 Mouse chr17:12,907,572-12,967,572 (mm9) Igf2r HSC Intestinal1 Intestinal2 Intestinal3 Sperm Air AMRs CGI Figure 6.5: AMRs found around mouse specific imprinted gene Igf2r. observed for these two genes in some tissues. Thus, grouping conserved ASM across cell types and between species does provide a promising way to identify new imprinted genes. Individual variation of ASM confounds the difference between human and mouse. Choosing conserved AMRs across cell types in both species eliminates the individual specificity and make them more compa- rable. The variation of ASM between human and mouse is consistent with expectations based on parental conflict theory: more ASM is found mouse than in human. These results point to different roles of germline AMRs and somatic AMRs. The former are more important for the maintenance of imprinting through evolu- tion. Possibly the mechanisms for setting imprinted methylation marks in developing germ cells of the par- ents are more constrained. Loss of germline AMRs leads to loss of imprinting (e.g. COMMD1/Commd1). The case in IGF2R/Igf2r indicates an intermediate status of the disappearance of ASM and the associated 63 imprint in human. On the other hand, somatic ASM does not necessarily correlate to the status of imprinting (e.g. CDKN1C/Cdkn1c). Some alternative signals may exist at these loci to assist imprinting. The accurate boundaries of the AMRs found between SGCE and PEG10 in human cells seem to change in mouse, and such extension is consistent across cells. Although the underlying mechanism of such bound- ary movement is still unclear at the current stage, the consistency implies some evolutionary conserved change of ASM as a regulator for these two imprinted genes. Our comparison of ASM between human and mouse confirms widespread conservation but also reveals recent adaptation related to ASM in closely related species. Focusing such analysis and comparison on some specific genes helps to understand the role of ASM in the regulatory mechanism of genomic imprinting and the evolutionary development of the mechanism. 64 Chapter 7 Conclusions As the best understood and most stable epigenetic mark modulating the transcription in mammals, DNA methylation is assumed to be the same on both alleles across the majority of the genome. However, recent studies suggest that allele-specific DNA methylation (ASM) is a common feature across the mouse and human genomes (Gibbs et al., 2010; Schilling et al., 2009; Zhang et al., 2010). Although most ASM are associated with genetic variation in cis and thus are tissue- and individual-specific, a noticeable proportion of ASM are related to parent-of-origin effects, i.e. genomic imprinting. Imprinted genes are only expressed from one allele, either inherited from the mother (e.g. H19) or from the father (e.g. IGF2). These imprinted genes play important roles in normal mammalian development, especially in fetal, placental and neona- tal growth. The functional haploidy makes imprinted genes vulnerable to mutations and perturbation of expression. Not surprisingly, genomic imprinting defects are implicated in a range of embryonic and fetal abnormalities, including cancers. The connection between imprinting and DNA methylation was uncovered shortly after the first identification of imprinted genes in mammals. ASM can regulate the monoallelic gene expression by differentially marking the regulatory regions, which are called imprinting control regions (ICRs). The ASM marks in ICRs are established in the germline cells and are maintained throughout all somatic cells. Therefore, using ASM to profile imprinted genes can overcome the difficulty in identifying imprinting caused by its tissue- and developmental stage-specific variations and at the same time shed light to understand the regulatory mechanisms for imprinted genes. In this thesis, we present a computational strategy for identifying ASM accommodating the advanced high-throughput bisulfite sequencing technology for DNA methylation profiling. Our method does not rely on any genotype variation and is soly based on the base-pair resolution methylome data. We designed two likelihood models representing single-allele methylation and allele-specific methylation separately. For any genomic interval, the best model is selected to fit the observed reads by either Bayesian information crite- rion (BIC) or likelihood ratio test. We profiled ASM genome-widely using a fixed-width sliding window. 65 For more accurate boundaries of allelically methylated regions (AMRs), we used a dynamic programming algorithm to optimize the locations of AMRs by evaluating each possible AMR size rather than joining overlapping sliding windows. We validated our method using semi-simulated data in which methylation states were simulated within actual reads from BS-seq experiments. Our results indicate that technical char- acteristics of existing public methylomes (i.e. read length and coverage) are sufficient to accurately identify AMRs. In one word, our genotype-independent method provides a feasible tool to identify ASM associated with genomic imprinting, which does not depend on the existence of in cis sequence variations. We applied the method to 22 human methylomes to identify ASM. The total number and mean size of AMRs identified in each methylome varied due to the difference in coverage and read length. Such vari- ation also reflects the methylation difference across cells and bias introduced by culture. By intersecting AMRs across methylomes, the most consistent predictions of AMRs showed high concordance with known imprinted genes, validating the efficiency of the method and providing possible candidate ICRs for imprinted loci. Our prediction also shows that those ubiquitously imprinted gens associated AMRs are identified most consistently across methylomes, while some tissue or cell type specific imprinted genes have AMRs that have less consistency across methylomes. For the latter case, grouping methylomes from the same tissue or cell type may provide more sensitivity to find the cell-specific ASM signal. For example, we assessed the reprogramming of ASM in iPSCs by comparing them to ESCs and differentiated cells. Although there are some variations, iPSCs generally obtain similar ASM signals as ESCs. Our method generates satisfying precision of the boundaries of AMRs refined by the dynamic programming algorithm locally. Strong conser- vation of the boundaries across cell types were observed at several loci. The accuracy raises the possibility to pinpoint the regulatory elements inside these conserved AMRs. As closely related placental mammalian species, there is still a high level of discordance of genomic imprinting between mouse and human. Such evolutionary change of imprinting can be explained by their different reproduction procedures based on the parental conflict hypothesis. The hypothesis states that pater- nally derived fetal alleles are selected to demand more resource from the mother than the maternally derived fetal alleles, and such differential interests of parental alleles favor the evolution of genomic imprinting. In mouse, mother usually contains multiple fetus that may come from more than one father, so the paternal derived alleles at the fetus loci try to compete over other fetus to obtain as much nutrient from the mother as possible to maximize its survival, even at the expense of its half-siblings and the mother. However, 66 human commonly have singleton in one pregnancy and may have less competition and then less demand for imprinting. We compare ASM associated imprinted genes between mouse and human. To exclude individ- ual specific ASM, we use the intersection set of ASM from 4 unculture cell methylomes in both species. As expected, there are more ASM in mouse than in human. We find 17 AMRs maintain both the conserved sequence and ASM status in two species. Among them, 15 are associated with known imprinted genes, implying high probabilities of imprinting of the remaining two genes: DIP2C/Dip2c and HLA-DRB1/H2- Eb1. Both of these two genes have previously been reported to have allele-specific expression in specific cells. Our results further highlight the possibility of imprinting for them. Focusing on the known imprinted genes associated with ASM, the majority have conserved ASM and imprinting status between species. The variation of ASM can be divided into two scenarios: change of germline AMRs always leads to the change of imprinting, but change of somatic AMRs does not necessarily relate to imprinting status. Similar to the previous reported distribution of imprint genes, more ASM associated imprinted genes are found to be specific to mouse. The accumulation of bisulfite sequencing data in many different cell types and species enables us to do more thorough investigation about ASM associated genomic imprinting across cells and between species. Induced pluripotent stem cells (iPSCs) are artificially stem cells derived from adult somatic cells by manip- ulating the expression of some specific genes. Although many aspects of these iPSCs are similar to those natural pluripotent stem cells like embryonic stem cells (ESCs), the full extent of the similarity between iPSCs and natural pluripotent stem cells is still being assessed. Our method provides a flexible way mea- suring the success of reprogramming ASM in iPSCs. Figure 5.5 shows a general view of ASM in iPSCs, which are more similar to ESCs than the differentiated cells where whey come from. Although the overall reprogramming of ASM in iPSCs is successful, there are still many discordance of ASM in iPSCs compared to ESCs. More coming data promise a more detailed picture for the comparison of ASM between iPSCs and ESCs. More comparative data from different species provide opportunities for hypothesis testing. For example, the murine imprinted gene Igf2r shows biallelically expression in most human cells. We observe the loss of germline ASM in IGF2R for some human cells and postulate the gradual lost of ASM in ICR and corre- sponding imprinting status since the divergence of mouse and human. If there is more comparative data for the primate lineage, such statement can get further strength. 67 Experimental validations of our prediction for new imprinted genes are important. Advances in RNA- Seq technology enables the detection of allele-specific expression (ASE) much faster and easier. However, the tissue and developmental specificity of imprinting status may reduce the global correlation between ASM and ASE in one specific cell. Our prediction of ASM, especially those conserved ASM across cells and between species provides feasible guidance for the experimental design. 68 Bibliography Akaike, H. (1974). A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6), 716–723. Allegrucci, C., Wu, Y ., Thurston, A., Denning, C., Priddle, H., et al. (2007). Restriction landmark genome scanning identifies culture-induced DNA methylation instability in the human embryonic stem cell epigenome. Human molecular genetics, 16(10), 1253–1268. Allen, E., Horvath, S., Tong, F., Kraft, P., Spiteri, E., Riggs, A., and Marahrens, Y . (2003). High concentra- tions of long interspersed nuclear element sequence distinguish monoallelically expressed genes. PNAS, 100(17), 9940. Astuti, D., Latif, F., Wagner, K., Gentle, D., Cooper, W., Catchpoole, D., Grundy, R., Ferguson-Smith, A., and Maher, E. (2005). Epigenetic alteration at the DLK1-GTL2 imprinted domain in human neoplasia: analysis of neuroblastoma, phaeochromocytoma and Wilms’ tumour. British journal of cancer, 92(8), 1574–1580. Baker, R., Makova, K., and Chesser, R. (1999). Microsatellites indicate a high frequency of multiple pater- nity in Apodemus (Rodentia). Molecular Ecology, 8(1), 107–111. Barlow, D. (1993). Methylation and imprinting: from host defense to gene regulation? Science, 260(5106), 309. Barlow, D. (1995). Gametic imprinting in mammals. Science, 270(5242), 1610. Barlow, D., St¨ oger, R., Herrmann, B., Saito, K., and Schweifer, N. (1991). The mouse insulin-like growth factor type-2 receptor is imprinted and closely linked to the Tme locus. Nature, 349(6304), 84–87. Bartolomei, M., Zemel, S., and Tilghman, S. (1991). Parental imprinting of the mouse H19 gene. Nature, 351(6322), 153–155. Barton, S., Surani, M., and Norris, M. (1984). Role of paternal and maternal genomes in mouse develop- ment. Nature, 311, 374–376. Bell, A. and Felsenfeld, G. (2000). Methylation of a CTCF-dependent boundary controls imprinted expres- sion of the Igf2 gene. Nature, 405(6785), 482–485. Bell, A., West, A., and Felsenfeld, G. (1999). The protein CTCF is required for the enhancer blocking activity of vertebrate insulators. Cell, 98(3), 387–396. Bourchis, D. and Bestor, T. (2006). Origins of extreme sexual dimorphism in genomic imprinting. Cytoge- netic and genome research, 113(1-4), 36–40. 69 Braidotti, G., Baubec, T., Pauler, F., Seidl, C., Smrzka, O., Stricker, S., Yotova, I., and Barlow, D. (2004). The Air noncoding RNA: an imprinted cis-silencing transcript. In Cold Spring Harbor symposia on quantitative biology, volume 69, page 55. Wellcome Trust Public Access. Brannan, C., Dees, E., Ingram, R., and Tilghman, S. (1990). The product of the H19 gene may function as an RNA. Molecular and cellular biology, 10(1), 28–36. Chandler, L., Ghazi, H., Jones, P., Boukamp, P., and Fusenig, N. (1987). Allele-specific methylation of the human c-Ha-ras-1 gene. Cell, 50(5), 711–717. Chao, M., Herrera, B., Ramagopalan, S., Deluca, G., Handunetthi, L., Orton, S., Lincoln, M., Sadovnick, A., and Ebers, G. (2010). Parent-of-origin effects at the major histocompatibility complex in multiple sclerosis. Human molecular genetics, 19(18), 3679–3689. Chen, R., Pettersson, U., Beard, C., Jackson-Grusby, L., and Jaenisch, R. (1998). DNA hypomethylation leads to elevated mutation rates. Nature, 395(6697), 89–93. Choufani, S., Shapiro, J., Susiarjo, M., Butcher, D., Grafodatskaya, D., et al. (2011). A novel approach iden- tifies new differentially methylated regions (DMRs) associated with imprinted genes. Genome Research, 21(3), 465. Constˆ ancia, M., Hemberger, M., Hughes, J., Dean, W., Ferguson-Smith, A., et al. (2002). Placental-specific IGF-II is a major modulator of placental and fetal growth. Nature, 417(6892), 945–948. Crews, D. (2008). Epigenetics and its implications for behavioral neuroendocrinology. Frontiers in neu- roendocrinology, 29(3), 344–357. Curley, J., Barton, S., Surani, A., Keverne, E., et al. (2004). Coadaptation in mother and infant regulated by a paternally expressed imprinted gene. Proceedings of the Royal Society of London, Series B: Biological Sciences, 271(1545), 1303–1309. Davies, W., Isles, A., and Wilkinson, L. (2001). Imprinted genes and mental dysfunction. Annals of medicine, 33(6), 428–436. Davis, T., Yang, G., McCarrey, J., and Bartolomei, M. (2000). The H19 methylation imprint is erased and re-established differentially on the parental alleles during male germ cell development. Human Molecular Genetics, 9(19), 2885. Dean, M., Ardlie, K., and Nachman, M. (2006). The frequency of multiple paternity suggests that sperm competition is common in house mice (Mus domesticus). Molecular Ecology, 15(13), 4141–4151. DeChiara, T., Robertson, E., Efstratiadis, A., et al. (1991). Parental imprinting of the mouse insulin-like growth factor II gene. Cell, 64(4), 849. Deltour, L., Montagutelli, X., Guenet, J., Jami, J., and P´ aldi, A. (1995). Tissue-and developmental stage- specific imprinting of the mouse proinsulin gene, Ins2. Developmental Biology, 168(2), 686–688. Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38. 70 Dockery, L., Gerfen, J., Harview, C., Rahn-Lee, C., Horton, R., Park, Y ., and Davis, T. (2009). Differential methylation persists at the mouse Rasgrf1 DMR in tissues displaying monoallelic and biallelic expression. Epigenetics, 4(4), 241. Doherty, A., Mann, M., Tremblay, K., Bartolomei, M., and Schultz, R. (2000). Differential effects of culture on imprinted H19 expression in the preimplantation mouse embryo. Biology of Reproduction, 62(6), 1526. El-Maarri, O., Buiting, K., Peery, E., Kroisel, P., Balaban, B., Wagner, K., Urman, B., Heyd, J., Lich, C., Brannan, C., et al. (2001). Maternal methylation imprints on human chromosome 15 are established during or after fertilization. Nature Genetics, 27(3), 341–344. Ellis, L., Atadja, P., and Johnstone, R. (2009). Epigenetics in cancer: targeting chromatin modifications. Molecular cancer therapeutics, 8(6), 1409–1420. Fang, F., Hodges, E., Molaro, A., Dean, M., Hannon, G., and Smith, A. (2012). Genomic landscape of human allele-specific DNA methylation. PNAS, 109(19), 7332–7337. Feinberg, A. (2007). An epigenetic approach to cancer etiology. The Cancer Journal, 13(1), 70. Feinberg, A., Cui, H., and Ohlsson, R. (2002). DNA methylation and genomic imprinting: insights from cancer into epigenetic mechanisms. In Seminars in cancer biology, volume 12, pages 389–398. Elsevier. Fern´ andez-Gonzalez, R., Moreira, P., Bilbao, A., Jim´ enez, A., P´ erez-Crespo, M., Ram´ ırez, M., De Fonseca, F., Pintado, B., and Guti´ errez-Ad´ an, A. (2004). Long-term effect of in vitro culture of mouse embryos with serum on mRNA expression of imprinting genes, development, and behavior. PNAS, 101(16), 5880. Fitzpatrick, G., Soloway, P., and Higgins, M. (2002). Regional loss of imprinting and growth deficiency in mice with a targeted deletion of KvDMR1. Nature Genetics, 32(3), 426–431. Fitzpatrick, G., Pugacheva, E., Shin, J., Abdullaev, Z., Yang, Y ., Khatod, K., Lobanenkov, V ., and Higgins, M. (2007). Allele-specific binding of CTCF to the multipartite imprinting control region KvDMR1. Molecular and cellular biology, 27(7), 2636–2647. Fr¨ ohlich, L., Mrakovcic, M., Steinborn, R., Chung, U., Bastepe, M., and J¨ uppner, H. (2010). Targeted deletion of the Nesp55 DMR defines another Gnas imprinting control region and provides a mouse model of autosomal dominant PHP-Ib. PNAS, 107(20), 9275. Frommer, M., McDonald, L., Millar, D., Collis, C., Watt, F., Grigg, G., Molloy, P., and Paul, C. (1992). A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. PNAS, 89(5), 1827. Frost, J. and Moore, G. (2010). The importance of imprinting in the human placenta. PLoS genetics, 6(7), e1001015. Gardiner-Garden, M. and Frommer, M. (1987). CpG islands in vertebrate genomes. Journal of Molecular Biology, 196(2), 261–282. Genevi` eve, D., Sanlaville, D., Faivre, L., Kottler, M., Jambou, M., Gosset, P., Boustani-Samara, D., Pinto, G., Ozilou, C., Abeguil´ e, G., et al. (2005). Paternal deletion of the GNAS imprinted locus (including Gnasxl) in two girls presenting with severe pre-and post-natal growth retardation and intractable feeding difficulties. European journal of human genetics, 13(9), 1033–1039. 71 Gertz, J., Varley, K., Reddy, T., Bowling, K., Pauli, F., Parker, S., Kucera, K., Willard, H., and Myers, R. (2011). Analysis of DNA methylation in a three-generation family reveals widespread genetic influence on epigenetic regulation. PLoS Genetics, 7(8), e1002228. Gibbs, J., Van Der Brug, M., Hernandez, D., Traynor, B., Nalls, M., Lai, S., Arepalli, S., Dillman, A., Rafferty, I., Troncoso, J., et al. (2010). Abundant quantitative trait loci exist for DNA methylation and gene expression in human brain. PLoS genetics, 6(5), e1000952. Gould, T. and Pfeifer, K. (1998). Imprinting of mouse Kvlqt1 is developmentally regulated. Human molec- ular genetics, 7(3), 483–487. Gregg, C., Zhang, J., Weissbourd, B., Luo, S., Schroth, G., Haig, D., and Dulac, C. (2010a). High-resolution analysis of parent-of-origin allelic expression in the mouse brain. Science, 329(5992), 643. Gregg, C., Zhang, J., Butler, J., Haig, D., and Dulac, C. (2010b). Sex-specific parent-of-origin allelic expression in the mouse brain. Science, 329(5992), 682. Haaf, T. (2006). Methylation dynamics in the early mammalian embryo: implications of genome repro- gramming defects for development. DNA Methylation: Development, Genetic Disease and Cancer, pages 13–22. Haig, D. and Westoby, M. (1989). Parent-specific gene expression and the triploid endosperm. The American Naturalist, 134(1), 147–155. Hanel, M. and Wevrick, R. (2001). Establishment and maintenance of DNA methylation patterns in mouse Ndn: implications for maintenance of imprinting in target genes of the imprinting center. Molecular and cellular biology, 21(7), 2384–2392. Hark, A., Schoenherr, C., Katz, D., Ingram, R., Levorse, J., and Tilghman, S. (2000). CTCF mediates methylation-sensitive enhancer-blocking activity at the H19/Igf2 locus. Nature, 405(6785), 486–489. Hata, K., Okano, M., Lei, H., and Li, E. (2002). Dnmt3L cooperates with the Dnmt3 family of de novo DNA methyltransferases to establish maternal imprints in mice. Development, 129(8), 1983. Hatada, I., Hayashizaki, Y ., Hirotsune, S., Komatsubara, H., and Mukai, T. (1991). A genomic scanning method for higher organisms using restriction sites as landmarks. PNAS, 88(21), 9523. Hayashizaki, Y ., Shibata, H., Hirotsune, S., Sugino, H., Okazaki, Y ., Sasaki, N., Hirose, K., Imoto, H., Okuizumi, H., Muramatsu, M., et al. (1994). Identification of an imprinted U2af binding protein related sequence on mouse chromosome 11 using the RLGS method. Nature genetics, 6(1), 33–40. Hayward, B., Kamiya, M., Strain, L., Moran, V ., Campbell, R., Hayashizaki, Y ., and Bonthron, D. (1998). The human GNAS1 gene is imprinted and encodes distinct paternally and biallelically expressed G pro- teins. PNAS, 95(17), 10038. He, X., Chen, T., and Zhu, J. (2011). Regulation and function of DNA methylation in plants and animals. Cell Research, 21(3), 442–465. Hendrich, B. and Bird, A. (1998). Identification and characterization of a family of mammalian methyl-CpG binding proteins. Molecular and cellular biology, 18(11), 6538–6547. 72 Henry, I., Bonaiti-Pellie, C., Chehensse, V ., Beldjord, C., Schwartz, C., Utermann, G., and Junien, C. (1991). Uniparental paternal disomy in a genetic cancer-predisposing syndrome. Nature, 351(6328), 665–667. Herman, J., Graff, J., My¨ oh¨ anen, S., Nelkin, B., and Baylin, S. (1996). Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. PNAS, 93(18), 9821. Hodges, E., Molaro, A., Dos Santos, C., Thekkat, P., Song, Q., Uren, P., Park, J., Butler, J., Rafii, S., McCombie, W., et al. (2011). Directional DNA methylation changes and complex intermediate states accompany lineage specificity in the adult hematopoietic compartment. Molecular Cell. Hore, T., Rapkins, R., and Graves, J. (2007). Construction and evolution of imprinted loci in mammals. Trends in Genetics, 23(9), 440–448. Hurst, L. and McVean, G. (1997). Growth effects of uniparental disomies and the conflict theory of genomic imprinting. Trends in Genetics, 13(11), 436–443. Isles, A. and Wilkinson, L. (2000). Imprinted genes, cognition and behaviour. Trends in cognitive sciences, 4(8), 309–318. Jaenisch, R. and Bird, A. (2003). Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nature genetics, 33, 245–254. Jirtle, R. L. (1997). Geneimprint. http://www.geneimprint.com/. Joh, K., Yatsuki, H., Higashimoto, K., Mukai, T., and Soejima, H. (2009). Antisense transcription occurs at the promoter of a mouse imprinted gene, Commd1, on the repressed paternal allele. Journal of biochem- istry, 146(6), 771–774. Kagami, M., Sekita, Y ., Nishimura, G., Irie, M., Kato, F., Okada, M., Yamamori, S., Kishimoto, H., Nakayama, M., Tanaka, Y ., et al. (2008). Deletions and epimutations affecting the human 14q322 imprinted region in individuals with paternal and maternal upd(14)-like phenotypes. Nature Genetics, 40(2), 237–242. Kalscheuer, V ., Mariman, E., Schepens, M., Rehder, H., and Ropers, H. (1993). The insulin–like growth factor type–2 receptor gene is imprinted in the mouse but not in humans. Nature genetics, 5(1), 74–78. Kamiya, M., Judson, H., Okazaki, Y ., Kusakabe, M., Muramatsu, M., Takada, S., Takagi, N., Arima, T., Wake, N., Kamimura, K., et al. (2000). The cell cycle control gene ZAC/PLAGL1 is imprinteda strong candidate gene for transient neonatal diabetes. Human molecular genetics, 9(3), 453–460. Kanduri, C., Fitzpatrick, G., Mukhopadhyay, R., Kanduri, M., Lobanenkov, V ., Higgins, M., and Ohls- son, R. (2002). A differentially methylated imprinting control region within the Kcnq1 locus harbors a methylation-sensitive chromatin insulator. Journal of Biological Chemistry, 277(20), 18106–18110. Ke, X., Thomas, S., Robinson, D., and Collins, A. (2002). The distinguishing sequence characteristics of mouse imprinted genes. Mammalian genome, 13(11), 639–645. Kerkel, K., Spadola, A., Yuan, E., Kosek, J., Jiang, L., Hod, E., Li, K., Murty, V ., Schupf, N., Vilain, E., et al. (2008). Genomic surveys by methylation-sensitive SNP analysis identify sequence-dependent allele-specific DNA methylation. Nature Genetics, 40(7), 904–908. 73 Kim, J., Ashworth, L., Branscomb, E., and Stubbs, L. (1997). The human homolog of a mouse-imprinted gene, Peg3, maps to a zinc finger gene-rich region of human chromosome 19q13.4. Genome research, 7(5), 532–540. Koerner, M., Pauler, F., Huang, R., and Barlow, D. (2009). The function of non-coding RNAs in genomic imprinting. Development, 136(11), 1771. Kornfeld, S. (1992). Structure and function of the mannose 6-phosphate/insulinlike growth factor II recep- tors. Annual review of biochemistry, 61(1), 307–330. Latham, T., Gilbert, N., and Ramsahoye, B. (2008). DNA methylation in mouse embryonic stem cells and development. Cell and tissue research, 331(1), 31–55. Laurent, L., Wong, E., Li, G., Huynh, T., Tsirigos, A., Ong, C., Low, H., Kin Sung, K., Rigoutsos, I., Loring, J., et al. (2010). Dynamic changes in the human methylome during differentiation. Genome Research, 20(3), 320. Lee, M., Hu, R., Johnson, L., and Feinberg, A. (1997). Human KVLQT1 gene shows tissue-specific imprint- ing and encompasses Beckwith-Wiedemann syndrome chromosomal rearrangements. Nature genetics, 15(2), 181–185. Lewis, A., Mitsuya, K., Umlauf, D., Smith, P., Dean, W., Walter, J., Higgins, M., Feil, R., and Reik, W. (2004). Imprinting on distal chromosome 7 in the placenta involves repressive histone methylation independent of DNA methylation. Nature genetics, 36(12), 1291–1295. Lewis, J., Meehan, R., Henzel, W., Maurer-Fogy, I., Jeppesen, P., Klein, F., and Bird, A. (1992). Purification, sequence, and cellular localization of a novel chromosomal protein that binds to methylated DNA. Cell, 69(6), 905–914. Li, J., Gao, Y ., Aach, J., Zhang, K., Kryukov, G., Xie, B., Ahlford, A., Yoon, J., Rosenbaum, A., Zaranek, A., et al. (2009). Multiplex padlock targeted sequencing reveals human hypermutable CpG variations. Genome research, 19(9), 1606–1615. Li, Y ., Zhu, J., Tian, G., Li, N., Li, Q., Ye, M., Zheng, H., Yu, J., Wu, H., Sun, J., et al. (2010). The DNA methylome of human peripheral blood mononuclear cells. PLoS Biology, 8(11), e1000533. Lindsay, S. and Bird, A. (1987). Use of restriction enzymes to detect potential gene sequences in mammalian DNA. Lister, R., Pelizzola, M., Dowen, R., Hawkins, R., Hon, G., Tonti-Filippini, J., Nery, J., Lee, L., Ye, Z., Ngo, Q., et al. (2009). Human DNA methylomes at base resolution show widespread epigenomic differences. Nature, 462(7271), 315–322. Lister, R., Pelizzola, M., Kida, Y ., Hawkins, R., Nery, J., Hon, G., Antosiewicz-Bourget, J., OMalley, R., Castanon, R., Klugman, S., et al. (2011). Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature, 471(7336), 68–73. Lucifero, D., Mertineit, C., Clarke, H., Bestor, T., and Trasler, J. (2002). Methylation dynamics of imprinted genes in mouse germ cells. Genomics, 79(4), 530–538. 74 Luedi, P., Hartemink, A., and Jirtle, R. (2005). Genome-wide prediction of imprinted murine genes. Genome Research, 15(6), 875. Luedi, P., Dietrich, F., Weidman, J., Bosko, J., Jirtle, R., and Hartemink, A. (2007). Computational and experimental identification of novel human imprinted genes. Genome Research, 17(12), 1723. Lyon, M. (1999). X-chromosome inactivation. Current Biology, 9, 235–237. Mancini-DiNardo, D., Steele, S., Levorse, J., Ingram, R., and Tilghman, S. (2006). Elongation of the Kcnq1ot1 transcript is required for genomic imprinting of neighboring genes. Genes and Development, 20(10), 1268. Mayer, W., Niveleau, A., Walter, J., Fundele, R., and Haaf, T. (2000). Embryogenesis: demethylation of the zygotic paternal genome. Nature, 403(6769), 501–502. McGrath, J. and Solter, D. (1984). Completion of mouse embryogenesis requires both the maternal and paternal genomes. Cell, 37(1), 179–183. Meissner, A., Mikkelsen, T., Gu, H., Wernig, M., Hanna, J., Sivachenko, A., Zhang, X., Bernstein, B., Nus- baum, C., Jaffe, D., et al. (2008). Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature, 454(7205), 766–770. Migeon, B. (1990). Insights into X chromosome inactivation from studies of species variation, DNA methy- lation and replication, and vice versa. Genetical Research, 56(2-3), 91–98. Mizuno, Y ., Sotomaru, Y ., Katsuzawa, Y ., Kono, T., Meguro, M., Oshimura, M., Kawai, J., Tomaru, Y ., Kiyosawa, H., Nikaido, I., et al. (2002). Asb4, Ata3, and Dcn are novel imprinted genes identified by high-throughput screening using RIKEN cDNA microarray. Biochemical and biophysical research communications, 290(5), 1499–1505. Molaro, A., Hodges, E., Fang, F., Song, Q., McCombie, W., Hannon, G., and Smith, A. (2011). Sperm methylation profiles reveal features of epigenetic inheritance and evolution in primates. Cell, 146(6), 1029–1041. Monk, D. (2010). Deciphering the cancer imprintome. Briefings in Functional Genomics, 9(4), 329. Monk, D., Arnaud, P., Apostolidou, S., Hills, F., Kelsey, G., Stanier, P., Feil, R., and Moore, G. (2006). Limited evolutionary conservation of imprinting in the human placenta. Proc. Natl. Acad. Sci. USA, 103(17), 6623–6628. Monk, D., Wagschal, A., Arnaud, P., M¨ uller, P., Parker-Katiraee, L., Bourchis, D., Scherer, S., Feil, R., Stanier, P., and Moore, G. (2008). Comparative analysis of human chromosome 7q21 and mouse proximal chromosome 6 reveals a placental-specific imprinted gene, TFPI2/Tfpi2, which requires EHMT2 and EED for allelic-silencing. Genome Research, 18(8), 1270. Moore, T. and Haig, D. (1991). Genomic imprinting in mammalian development: a parental tug-of-war. Trends in Genetics, 7(2), 45–49. Morcos, L., Ge, B., Koka, V ., Lam, K., Pokholok, D., Gunderson, K., Montpetit, A., Verlaan, D., and Pastinen, T. (2011). Genome-wide assessment of imprinted expression in human cells. Genome Biology, 12(3), R25. 75 Morison, I., Ramsay, J., and Spencer, H. (2005). A census of mammalian imprinting. Trends in Genetics, 21(8), 457–465. Moutou, C., Junien, C., Henry, I., and Bonaiti-Pellie, C. (1992). Beckwith-Wiedemann syndrome: a demon- stration of the mechanisms responsible for the excess of transmitting females. Journal of medical genetics, 29(4), 217–220. Nikaido, I., Saito, C., Mizuno, Y ., Meguro, M., Bono, H., Kadomura, M., Kono, T., Morris, G., Lyons, P., Oshimura, M., et al. (2003). Discovery of imprinted transcripts in the mouse transcriptome using large-scale expression profiling. Genome Research, 13(6b), 1402. Okamura, K., Hagiwara-Takeuchi, Y ., Li, T., Vu, T., Hirai, M., Hattori, M., Sakaki, Y ., Hoffman, A., and Ito, T. (2000). Comparative genome analysis of the mouse imprinted gene impact and its nonimprinted human homolog IMPACT: toward the structural basis for species-specific imprinting. Genome research, 10(12), 1878–1889. Okamura, K., Wintle, R., Scherer, S., et al. (2008). Characterization of the differentially methylated region of the Impact gene that exhibits Glires-specific imprinting. Genome biology, 9(11), R160. O’Neill, M. (2005). The influence of non-coding RNAs on allele-specific gene expression in mammals. Human Molecular Genetics, 14(suppl 1), R113. Ono, R., Kobayashi, S., Wagatsuma, H., Aisaka, K., Kohda, T., Kaneko-Ishino, T., and Ishino, F. (2001). A retrotransposon-derived gene, PEG10, is a novel imprinted gene located on human chromosome 7q21. Genomics, 73(2), 232–237. Ono, R., Nakamura, K., Inoue, K., Naruse, M., Usami, T., Wakisaka-Saito, N., Hino, T., Suzuki-Migishima, R., Ogonuki, N., Miki, H., et al. (2005). Deletion of Peg10, an imprinted gene acquired from a retrotrans- poson, causes early embryonic lethality. Nature Genetics, 38(1), 101–106. Pardo-Manuel de Villena, F., de la Casa-Esper´ on, E., and Sapienza, C. (2000). Natural selection and the function of genome imprinting: beyond the silenced minority. Trends in Genetics, 16(12), 573–579. Pask, A., Papenfuss, A., Ager, E., McColl, K., Speed, T., and Renfree, M. (2009). Analysis of the platypus genome suggests a transposon origin for mammalian imprinting. Genome biology, 10(1), R1. Pauler, F. and Barlow, D. (2006). Imprinting mechanismsit only takes two. Genes & development, 20(10), 1203–1206. Pauler, F., Koerner, M., and Barlow, D. (2007). Silencing by imprinted noncoding RNAs: is transcription the answer? TRENDS in Genetics, 23(6), 284–292. Peters, J., Wroe, S., Wells, C., Miller, H., Bodle, D., Beechey, C., Williamson, C., and Kelsey, G. (1999). A cluster of oppositely imprinted transcripts at the Gnas locus in the distal imprinting region of mouse chromosome 2. Proc. Natl. Acad. Sci. USA, 96(7), 3830. Plagge, A., Gordon, E., Dean, W., Boiani, R., Cinti, S., Peters, J., and Kelsey, G. (2004). The imprinted signaling protein XLs is required for postnatal adaptation to feeding. Nature genetics, 36(8), 818–826. 76 Plass, C., Shibata, H., Kalcheva, I., Mullins, L., Kotelevtseva, N., Mullins, J., Kato, R., Sasaki, H., Hirot- sune, S., Okazaki, Y ., et al. (1996). Identification of Grf1 on mouse chromosome 9 as an imprinted gene by RLGS–M. Nature genetics, 14(1), 106–109. Pollard, K., Serre, D., Wang, X., Tao, H., Grundberg, E., Hudson, T., Clark, A., and Frazer, K. (2008). A genome-wide approach to identifying novel-imprinted genes. Human Genetics, 122(6), 625–634. Prokhortchouk, A., Hendrich, B., Jørgensen, H., Ruzov, A., Wilm, M., Georgiev, G., Bird, A., and Prokhortchouk, E. (2001). The p120 catenin partner Kaiso is a DNA methylation-dependent transcrip- tional repressor. Genes & development, 15(13), 1613–1618. Reik, W., Dean, W., and Walter, J. (2001a). Epigenetic reprogramming in mammalian development. Science, 293(5532), 1089–1093. Reik, W., Walter, J., et al. (2001b). Genomic imprinting: parental influence on the genome. Nature Reviews Genetics, 2(1), 21–32. Reik, W., Constˆ ancia, M., Fowden, A., Anderson, N., Dean, W., Ferguson-Smith, A., Tycko, B., and Sibley, C. (2003). Regulation of supply and demand for maternal nutrients in mammals by imprinted genes. The Journal of physiology, 547(1), 35–44. Renfree, M., Hore, T., Shaw, G., Marshall Graves, J., and Pask, A. (2009). Evolution of genomic imprinting: insights from marsupials and monotremes. Annual review of genomics and human genetics, 10, 241–262. Rideout, W., Eggan, K., and Jaenisch, R. (2001). Nuclear cloning and epigenetic reprogramming of the genome. Science, 293(5532), 1093. Rocha, S., Edwards, C., Ito, M., Ogata, T., and Ferguson-Smith, A. (2008). Genomic imprinting at the mammalian Dlk1-Dio3 domain. Trends in Genetics, 24(6), 306–316. Rougeulle, C. and Heard, E. (2002). Antisense RNA in imprinting: spreading silence through Air. Trends in Genetics, 18(9), 434–437. Schalkwyk, L., Meaburn, E., Smith, R., Dempster, E., Jeffries, A., Davies, M., Plomin, R., and Mill, J. (2010). Allelic skewing of DNA methylation is widespread across the genome. The American Journal of Human Genetics, 86(2), 196–212. Schilling, E., El Chartouni, C., and Rehli, M. (2009). Allele-specific DNA methylation in mouse strains is mainly determined by cis-acting sequences. Genome research, 19(11), 2028–2035. Schlesinger, S., Selig, S., Bergman, Y ., and Cedar, H. (2009). Allelic inactivation of rDNA loci. Genes & Development, 23(20), 2437. Schulz, R., Menheniott, T., Woodfine, K., Wood, A., Choi, J., and Oakey, R. (2006). Chromosome- wide identification of novel imprinted genes using microarrays and uniparental disomies. Nucleic Acids Research, 34(12), e88. Schulz, R., Proudhon, C., Bestor, T., Woodfine, K., Lin, C., Lin, S., Prissette, M., Oakey, R., and Bourc’his, D. (2010). The parental non-equivalence of imprinting control regions during mammalian development and evolution. PLoS genetics, 6(11), e1001214. 77 Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, pages 461–464. Sharp, A., Migliavacca, E., Dupre, Y ., Stathaki, E., Sailani, M., Baumer, A., Schinzel, A., Mackay, D., Robinson, D., Cobellis, G., et al. (2010). Methylation profiling in individuals with uniparental disomy identifies novel differentially methylated regions on chromosome 15. Genome research, 20(9), 1271– 1278. Shoemaker, R., Deng, J., Wang, W., and Zhang, K. (2010). Allele-specific methylation is prevalent and is contributed by CpG-SNPs in the human genome. Genome Research, 20(7), 883. Singer-Sam, J., Grant, M., LeBon, J., Okuyama, K., Chapman, V ., Monk, M., and Riggs, A. (1990). Use of a HpaII-polymerase chain reaction assay to study DNA methylation in the Pgk-1 CpG island of mouse embryos at the time of X-chromosome inactivation. Molecular and cellular biology, 10(9), 4987–4989. Sleutels, F., Zwart, R., and Barlow, D. (2002). The non-coding Air RNA is required for silencing autosomal imprinted genes. Nature, 415(6873), 810–813. Smith, A., Chung, W., Hodges, E., Kendall, J., Hannon, G., Hicks, J., Xuan, Z., and Zhang, M. (2009). Updates to the RMAP short-read mapping software. Bioinformatics, 25(21), 2841–2842. Smith, R., Dean, W., Konfortova, G., and Kelsey, G. (2003). Identification of novel imprinted genes in a genome-wide screen for maternal methylation. Genome Research, 13(4), 558. Stadtfeld, M., Apostolou, E., Akutsu, H., Fukuda, A., Follett, P., Natesan, S., Kono, T., Shioda, T., and Hochedlinger, K. (2010). Aberrant silencing of imprinted genes on chromosome 12qF1 in mouse induced pluripotent stem cells. Nature, 465(7295), 175–181. St¨ oger, R., Kubicka, P., Liu, C., Kafri, T., Razin, A., Cedar, H., and Barlow, D. (1993). Maternal-specific methylation of the imprinted mouse Igf2r locus identifies the expressed locus as carrying the imprinting signal. Cell, 73(1), 61–71. Suzuki, S., Renfree, M., Pask, A., Shaw, G., Kobayashi, S., Kohda, T., Kaneko-Ishino, T., and Ishino, F. (2005). Genomic imprinting of IGF2, p57KIP2 and PEG1/MEST in a marsupial, the tammar wallaby. Mechanisms of development, 122(2), 213–222. Tate, P. and Bird, A. (1993). Effects of DNA methylation on DNA-binding proteins and gene expression. Current opinion in genetics & development, 3(2), 226–231. Thorvaldsen, J., Duran, K., and Bartolomei, M. (1998). Deletion of the H19 differentially methylated domain results in loss of imprinted expression of H19 and Igf2. Genes and Development, 12(23), 3693. Tremblay, K., Saam, J., Ingram, R., Tilghman, S., and Bartolomei, M. (1995). A paternal-specific methyla- tion imprint marks the alleles of the mouse H19 gene. Nature Genetics, 9(4), 407–413. Varmuza, S. and Mann, M. (1994). Genomic imprinting-defusing the ovarian time bomb. Trends in Genetics, 10(4), 118–123. Wang, X., Sun, Q., McGrath, S., Mardis, E., Soloway, P., and Clark, A. (2008). Transcriptome-wide identi- fication of novel imprinted genes in neonatal mouse brain. PLoS One, 3(12), e3839. 78 Warth, R., Garcia Alzamora, M., Kim, J., Zdebik, A., Nitschke, R., Bleich, M., Gerlach, U., Barhanin, J., and Kim, S. (2002). The role of KCNQ1/KCNE1 K+ channels in intestine and pancreas: lessons from the KCNE1 knockout mouse. Pfl¨ ugers Archiv European Journal of Physiology, 443(5), 822–828. Watanabe, D., Suetake, I., Tada, T., and Tajima, S. (2002). Stage-and cell-specific expression of Dnmt3a and Dnmt3b during embryogenesis. Mechanisms of development, 118(1), 187–190. Weaver, J., Susiarjo, M., and Bartolomei, M. (2009). Imprinting and epigenetic changes in the early embryo. Mammalian Genome, 20(9), 532–543. Weber, M., Davies, J., Wittig, D., Oakeley, E., Haase, M., Lam, W., and Sch¨ ubeler, D. (2005). Chromosome- wide and promoter-specific analyses identify sites of differential DNA methylation in normal and trans- formed human cells. Nature genetics, 37(8), 853–862. Weisstein, A., Feldman, M., and Spencer, H. (2002). Evolutionary genetic models of the ovarian time bomb hypothesis for the evolution of genomic imprinting. Genetics, 162(1), 425–439. Weksberg, R., Smith, A., Squire, J., and Sadowski, P. (2003). Beckwith–Wiedemann syndrome demon- strates a role for epigenetic control of normal development. Human molecular genetics, 12(suppl 1), R61–R68. Williamson, C., Turner, M., Ball, S., Nottingham, W., Glenister, P., Fray, M., Tymowska-Lalanne, Z., Plagge, A., Powles-Glover, N., Kelsey, G., et al. (2006). Identification of an imprinting control region affecting the expression of all transcripts in the Gnas cluster. Nature Genetics, 38(3), 350–355. Wolf, J. and Hager, R. (2006). A maternal–offspring coadaptation theory for the evolution of genomic imprinting. PLoS biology, 4(12), e380. Wutz, A. (2011). Gene silencing in X-chromosome inactivation: advances in understanding facultative heterochromatin formation. Nature Reviews Genetics, 12(8), 542–553. Xu, Y ., Goodyer, C., Deal, C., and Polychronakos, C. (1993). Functional polymorphism in the parental imprinting of the human IGF2R gene. Biochemical and biophysical research communications, 197(2), 747–754. Zaitoun, I. and Khatib, H. (2006). Assessment of genomic imprinting of SLC38A4, NNAT, NAP1L5, and H19 in cattle. BMC genetics, 7(1), 49. Zhang, D., Cheng, L., Badner, J., Chen, C., Chen, Q., Luo, W., Craig, D., Redman, M., Gershon, E., and Liu, C. (2010). Genetic control of individual differences in gene-specific methylation in human brain. The American Journal of Human Genetics, 86(3), 411–419. Zhang, Y . and Tycko, B. (1992). Monoallelic expression of the human H19 gene. Nature Genetics, 1(1), 40–44. Zhang, Y ., Shields, T., Crenshaw, T., Hao, Y ., Moulton, T., and Tycko, B. (1993). Imprinting of human H19: allele-specific CpG methylation, loss of the active allele in Wilms tumor, and potential for somatic allele switching. American Journal of Human Genetics, 53(1), 113. 79
Abstract (if available)
Abstract
Among the most well-known functions of DNA methylation is in mediating imprinted gene expression by differentially marking specific regulatory regions on maternal and paternal alleles. Imprinted genes are expressed from one of the two parental alleles in mammals, thereby rendering the organism functionally haploid. Imprinting has been tied to the evolution of placental mammals and defects in imprinting have been associated with human diseases. Although recent advances in genome sequencing have revolutionized the study of DNA methylation, existing methylome data remains largely untapped in the study of imprinting. We present a novel statistical model to describe allele-specific methylation (ASM) in data from high-throughput short-read bisulfite sequencing. Simulation results indicate technical specifications of existing methylome data, such as read length and coverage, are sufficient for full-genome ASM profiling based on our model. Because our method is independent of genotype, it is applicable to identify ASM in the context of genomic imprinting. ❧ We used our model to analyze methylomes for a diverse set of human cell types, including cultured and uncultured differentiated cells, embryonic stem cells (ESCs) and induced pluripotent stem cells (iPSCs). Regions of ASM identified most consistently across methylomes are tightly connected with known imprinted genes and precisely delineate the boundaries of several known imprinting control regions. Novel predicted regions of ASM common to multiple cell types frequently mark ncRNA promoters and represent promising starting points for targeted validation. ❧ We also compared regions of ASM between uncultured mouse and human cells. Regions with both conserved sequence and ASM status between species show high concordance of known imprinted genes, adding more evidence for novel prediction of imprinted genes. The skewing of ASM associated imprinted genes in mouse agrees with the parental conflict theory, which hypothesizes that the evolution of genomic imprinting is inspired by the different interests of parental genes on the offspring growth. Furthermore, the variation of ASM between species shows that ASM set in parental germlines play more critical roles in regulating imprinting than those set in somatic cells. ❧ More generally, our model provides the analytical complement to cutting-edge experimental technologies for surveying ASM in specific cell types and across species.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Comparative analysis of DNA methylation in mammals
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Measuring, modeling and identifying factors that influence eukaryotic DNA replication
PDF
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Differential methylation analysis of colon tissues
PDF
Identification of DNA methylation markers in diffuse large B-cell lymphoma
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
DNA methylation changes in the development of lung adenocarcinoma
PDF
The relationship between DNA methylation and transcription factor binding in colon cancer cells
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Machine learning of DNA shape and spatial geometry
Asset Metadata
Creator
Fang, Fang
(author)
Core Title
Identifying allele-specific DNA methylation in mammalian genomes
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
09/13/2012
Defense Date
08/13/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
allele-specific methylation,bisulfite sequencing,epigenetics,genomic imprinting,imprinting control regions,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew D. (
committee chair
), Hacia, Joseph G. (
committee member
), Pellegrini, Matteo (
committee member
), Tavare, Simon (
committee member
)
Creator Email
fang.flora@gmail.com,ffang@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-96610
Unique identifier
UC11288244
Identifier
usctheses-c3-96610 (legacy record id)
Legacy Identifier
etd-FangFang-1196.pdf
Dmrecord
96610
Document Type
Dissertation
Rights
Fang, Fang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
allele-specific methylation
bisulfite sequencing
epigenetics
genomic imprinting
imprinting control regions