Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Detection, classification and functional annotation of mouse L1 retrotransposon promoters
(USC Thesis Other)
Detection, classification and functional annotation of mouse L1 retrotransposon promoters
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DETECTION, CLASSIFICATION AND FUNCTIONAL ANNOTATION OF MOUSE L1 RETROTRANSPOSON PROMOTERS by Meng Zhou A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) May 2019 Copyright 2019 Meng Zhou Detection, classification and functional annotation of mouse L1 retrotransposon promoters Meng Zhou Abstract L1Md retrotransposons are the most abundant and active transposable elements in the mouse genome. The promoters of many L1Md retrotransposons are composed of tandem repeats which are called monomers. The number of monomers varies between retrotransposons and thus makes it difficult to locate the exact boundaries and infer the activity of L1Md promoters. Duplication of monomers contributes to the maintenance of L1Md promoter through generations, but the mecha- nism is unclear. Since the current classification of monomers is based on limited data, there is an imperative need for a comprehensive annotation of monomers which is specialized for functional study of L1Md repeats. We developed a pipeline for de novo monomer detection in the whole genome using sequence heuristics. Identified monomers were further classified into subtypes based on their sequence pro- files. We applied this pipeline to genome assemblies of various rodent species and strains. A major monomer subtype was found missing in all but one species, implying potential species-specific enhancement of monomer expansion. The positioning pattern of monomer subtypes within a pro- moter was discovered and analyzed. We showed that the subtype composition of an L1Md pro- moter can be used to infer its transcriptional activity in germ cell development. The monomer detection pipeline presented in this dissertation is independent of the current repeat annotation and is able to detect monomer and annotate subtypes in a de novo manner. Sub- types for all monomer types have been identified using comprehensive data, greatly expanding the spectrum of monomer variants. The analysis of monomer subtype positioning within promoters provides evidence supporting both previously proposed models of L1Md promoter expansion. The ii transcription silencing of L1Md promoters differentiates for promoter types, which implies poten- tial factor-specific regulatory functions instead of a universal retrotransposon repression mecha- nism. iii To my dear grandma, my first teacher. iv Acknowledgments I would like to thank my advisor, Dr. Andrew Smith, for his patient guidance throughout my extended course of pursuing a doctoral degree. He has been motivating me all the time, providing helpful suggestions and directions with foresight. He is a role model scientist, who showed me how to study things in a scientific way in many aspects. Besides he is also a good friend, and it’s always fun talking with him. Thanks to my dissertation committee, Drs. Liang Chen, Matt Dean, Remo Rohs and Min Yu, for their instructful comments for both my qualification proposal and dissertation defense. Their thoughts and discussion are very insightful. I really appreciate Drs. Xiaole Shirley Liu and Yong Zhang for their introduction of computa- tional biology and epigenomics to me. They offered me a chance to have a glimpse at real and cool science and enlightened me to choose this path of studying it. They have always been so supportive to me. I also want to thank my collaborators over these years. Many thanks to Dr. Lucas Kaaij of the Ketting Lab, Dr. Joaquina Delas Vives, Giorgia Battistoni and Dr. Ilaria Falciatori of the Hannon Lab, Dr. Caitlin Howe of the Breton Lab. And of course leaders of those labs, Dr. Ren´ e Ketting, Dr. Greg Hannon and Dr. Carrie Breton. They all have been very kind and provided generous assistance. It was really a pleasure to collaborate with them. Members of the Smith Lab, both past and current, have helped me greatly. We are such a big family and I couldn’t list all of them here due to space limit. I would like to thank all of them for v their help and communication during my study. It was really fun being a member of the Smith Lab. I really enjoyed having them as my labmates. To my many friends in USC, we are a pack. Thank you so much for helping me both in study and life, and I will never forget the joy we had together. Again I couldn’t list all your names, but our friendship will never fade. Finally thanks to all my family, especially my parents who have been always there throughout my life. It is not possible for me to carry on without their care and support. To my wife Su, thank you for everything. You have made me a better man and gave me all the encouragement and motivation I needed. vi Contents Dedication iv Acknowledgments v List of Figures ix List of Tables xi 1 Introduction 1 2 Background 6 2.1 Classification of retrotransposons . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Mechanism and impact of LINE retrotransposition . . . . . . . . . . . . . . . . . 8 2.3 Regulation of retrotransposon activities . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Annotation of retrotransposons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 RepeatMasker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 De novo repeat identification algorithm . . . . . . . . . . . . . . . . . . . 18 2.5 The special structure of mouse L1 promoters . . . . . . . . . . . . . . . . . . . . . 19 2.6 Objectives of monomer annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Computational methods for monomer detection 25 3.1 Detection of monomers using a specialized profile-HMM . . . . . . . . . . . . . . 25 3.1.1 Topology of the specialized profile-HMM . . . . . . . . . . . . . . . . . . 26 3.1.2 Model construction from multiple sequence alignment . . . . . . . . . . . 30 3.1.3 Scoring scheme of the decoding step . . . . . . . . . . . . . . . . . . . . . 34 3.2 Monomer subtype identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Calculating the Fisher score vector . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2 Clustering-based subtype identification . . . . . . . . . . . . . . . . . . . 38 3.3 Pipeline of L1Md monomer detection . . . . . . . . . . . . . . . . . . . . . . . . 40 4 Classification and functional annotation of L1Md promoters 44 4.1 Results of monomer detection in various genome assemblies . . . . . . . . . . . . 44 4.2 Monomer subtype definition and identification . . . . . . . . . . . . . . . . . . . . 51 4.3 Truncation profiles of monomer types . . . . . . . . . . . . . . . . . . . . . . . . 55 vii 4.4 Analysis of subtype composition in L1Md promoters . . . . . . . . . . . . . . . . 59 4.5 Identification of active L1Md promoters using DNA methylation data . . . . . . . 66 5 Conclusions 74 A Appendix 85 A.1 Implementation details of the specialized profile-HMM . . . . . . . . . . . . . . . 85 A.1.1 Choice of prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . 85 A.1.2 Numerical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 A.1.3 Data structure and decoding implementation in C++ . . . . . . . . . . . . 87 A.2 Consensus sequences of finalized profile-HMMs . . . . . . . . . . . . . . . . . . . 89 A.3 Statistics of sequencing data used for relative abundance analysis . . . . . . . . . . 89 A.4 Identified subtypes of monomers in the mm10 genome . . . . . . . . . . . . . . . 105 A.5 CAGE-seq analysis and alignment to profile-HMM states . . . . . . . . . . . . . . 105 A.6 Position preference scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 viii List of Figures 2.1 Schematic of the TPRT process. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Schematic of some possible outcomes of TPRT. . . . . . . . . . . . . . . . . . . . 11 2.3 Length distribution of mouse L1 repeats . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Activity of L1 repeats and corresponding regulation effects . . . . . . . . . . . . . 15 2.5 Structure of an L1Md repeat with monomers . . . . . . . . . . . . . . . . . . . . . 20 2.6 Hypothesized monomer array expansion models . . . . . . . . . . . . . . . . . . . 22 3.1 Structure of a profile-HMM specialized for monomer detection . . . . . . . . . . . 27 3.2 An example of model construction from multiple sequence alignment . . . . . . . 31 3.3 Dendrogram of randomly sampled Type A monomers. . . . . . . . . . . . . . . . 39 3.4 Workflow of monomer detection pipeline . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Count of detected monomers in various mouse genomes . . . . . . . . . . . . . . . 45 4.2 Fraction of unassembled nucleotides around Type A monomers . . . . . . . . . . . 46 4.3 Distribution of unassembled nucleotides around monomers . . . . . . . . . . . . . 47 4.4 Length distribution of detected Type A monomers . . . . . . . . . . . . . . . . . . 48 4.5 An example of posterior expected state counts showing reads related to the A sub- type defining insertion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6 Fractions of reads related with L1Md sequences. . . . . . . . . . . . . . . . . . . 50 4.7 tSNE plot of the clustering of A subtypes . . . . . . . . . . . . . . . . . . . . . . 52 4.8 Clustering of A subtypes based on average intra-type edit distance . . . . . . . . . 53 4.9 Clustering of G and T subtypes based on average intra-type edit distance . . . . . . 54 4.10 Truncation profiles and CAGE-seq signals of all three types of monomers . . . . . 56 4.11 Weighted read pileup of CAGE-seq data at various time points during germ cell development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.12 Distribution of monomer counts per L1Md promoter . . . . . . . . . . . . . . . . 61 4.13 Ordering scheme of monomers within an L1Md promoter . . . . . . . . . . . . . . 63 4.14 The three preference modes of monomer positioning . . . . . . . . . . . . . . . . 63 4.15 Hierarchical clustering of A subtypes based on edit distance . . . . . . . . . . . . 65 4.16 Hierarchical clustering of G subtypes based on edit distance . . . . . . . . . . . . 66 4.17 Hierarchical clustering of T subtypes based on edit distance . . . . . . . . . . . . . 67 4.18 Distribution of methylation loss after KO experiments . . . . . . . . . . . . . . . . 69 4.19 Scatter plot of methylation loss by KO of piRNA regulation factors . . . . . . . . . 70 ix 4.20 O/E ratios of monomer subtypes grouped by position preference modes after the Mili KO experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.21 O/E ratios of T subtypes grouped by position preference modes after the Miwi2 KO experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A.1 Consensus of Type A monomer profile-HMM . . . . . . . . . . . . . . . . . . . . 89 A.2 Consensus of Type G monomer profile-HMM . . . . . . . . . . . . . . . . . . . . 90 A.3 Consensus of Type T monomer profile-HMM . . . . . . . . . . . . . . . . . . . . 90 x List of Tables 4.1 Genomes on which the monomer detection pipeline was applied . . . . . . . . . . 45 4.2 Statistics of detected promoters and repeat family classification provided by Repeat- Masker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Position preference modes of A subtypes . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Position preference modes of G subtypes . . . . . . . . . . . . . . . . . . . . . . . 64 4.5 Position preference modes of T subtypes . . . . . . . . . . . . . . . . . . . . . . . 64 A.1 Statistics of sequencing data used for relative abundance analysis . . . . . . . . . . 104 A.2 List for subtypes of A monomers . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.3 List for subtypes of G monomers . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.4 List for subtypes of T monomers . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 A.5 Position preference scores of A subtypes . . . . . . . . . . . . . . . . . . . . . . . 110 A.6 Position preference scores of G subtypes . . . . . . . . . . . . . . . . . . . . . . . 111 A.7 Position preference scores of T subtypes . . . . . . . . . . . . . . . . . . . . . . . 112 xi Chapter 1 Introduction The study of functional genomics started with protein coding regions because intuitively cell func- tions and phenotypes are dictated by the proteins expressed in a cell. However, it was soon realized that the majority of our genome is composed of non-coding DNA. Here are some facts: among the 3 billion base pairs of human genomic DNA, only 2.8% are protein coding, and less than half (45.1%) are genic DNA (Alexander et al., 2010) – not only just a fraction of the genomic DNA encodes protein, but also more than half of the genome comprises non-genic DNA. So the question is: what are those intergenic DNA regions and do they have any function that is essential to cell phenotype? It was realized in the 1960s that mammalian genomes contain substantial amount of non-coding repetitive DNA (Britten & Kohne, 1968). This kind of DNA was first thought to be non-functional and was later called “junk DNA” in the early 1970s (Ohno, 1972). Whether this part of the genome is relevant in functional genomics studies is a vital question, especially at a time when whole genome sequencing was prohibitively expensive. Should people only focus on the 3 percent of the genome in coding regions, and not waste time and resources on the intergenic “junk”? Interestingly, early genome-wide comparative studies have revealed conserved non-coding DNA elements between mammals to a degree that suggests biological function for these regions in association with the fitness of an organism (Bejerano et al., 2004). Effective annotation of such regions quickly became an important issue of relevance to most types of genomics studies. The initial draft of the human genome (Lander et al., 2001) revealed that most intergenic non- coding sequences belong to a specific class of DNA elements: the retrotransposons, which are defined by their mechanism of self-replication. The retrotransposons are also strongly associated with evolutionary expansion of genome sizes, with human as a notable example of expansion. The most ancient retrotransposons date back at least 150 million years in our genome (Lander et al., 1 2001). So it means these DNA elements have shaped our genomic landscape significantly during evolution. It is thus clear that the majority portion of the non-coding intergenic regions can be attributed to those retrotransposons, and they need to be appropriately identified and studied. It was believed that genes are arranged in tandem in the chromosomes, until Barbara McClin- tock discovered that some genes in the maize genome are capable of mobilizing (McClintock, 1950). Those “jumping genes” are now known as transposons. However, it took decades since the discovery of transposons for the scientific community to realize the significance of transposition, with its exceptional abundance in eukaryotic genomes. Further investigation revealed that there are two types of transposons, classified by their pattern of mobilization in the genome. The first type includes those that physically migrate within the genome. The second type can generate and insert new copies into the genome, and are called retrotransposons. Obviously retrotransposons have contributed much more to the total amount of genomic DNA because of their ability to replicate themselves. Retrotransposons have drawn much attention in studies of the non-coding portion of genomes because of their abundance and potential to impact functional within the genome. Because DNA synthesis via self template is involved in the amplification process for retro- transposons, they are fundamentally different from other repetitive DNA elements, such as tandem repeats and satellite repeats. These repeats are incapable of self-replication and are not included in the scope of this dissertation. Therefore in the context of this dissertation, the term “repeat” will often be used as synonymous with retrotransposon, excluding other types of repetitive DNA elements. Although offspring repeats of the same ancestor can share highly similar sequence, the effect of DNA sequence spontaneous mutation has greatly shaped individual repeats independently during the course of evolution. This means the sequences of repeats differ, and the competence of generating new offspring repeats varies depending on the number and location of mutations. This is because different regions of the DNA sequences of a retrotransposon represent different functions, and they can be divided into functional regions as structures of a repeat. The structures of most repeats have been defined and studied, and the common structure of all retrotransposons is the promoter. The promoter is responsible for initiating transcription, which is the first step of 2 all types of retrotransposition. Therefore one of the most important goals of studying repeats is to identify all that are competent in generating new insertions, which includes finding all working repeat promoters. The self-replicating nature of repeats indicates that all the activities necessary for their prolifer- ation are embedded within their own sequence. Thus the classification of repeats is solely based on their sequences, rather their genomic locations. Once the classification of repeats is established, it can be used to infer the function of repeats based on their classified groups, given that repeats shar- ing similar sequence will have similar function. Historically retrotransposons have been studied in a case-by-case manner, due to the limitation of experiment throughput and data availability. In the sequencing era, when genome assembly covers most of those repeats and whole genome assays are available, it becomes imperative to study retrotransposons in a comprehensive and systematic way. Traditionally to summarize a group of repeats with certain sequence similarity, a consensus sequence is used. The consensus sequence is an average of the whole group, but is still a single sequence and can only approximate the mode of the population, without specific information about the variation. Stochastic models are a more natural way to describe the population of individual retrotransposon sequences with measurable variance. Parameterization of sequence features can enable comparison between groups, which will be much more effective for analyzing numerous individual repeats. Functional annotation of repeats requires not only classification based on sequences, but also biological data for direct evidence of repeat activity. DNA methylation is one of the most reliable and effective biological mark for identification of active repeat transcription. It is ubiquitous in the genome, and plays a crucial regulation role in controlling gene expression. Regulation of specific genomic regions can be reflected by DNA methylation footprint in a highly precise and prompt manner. Current sequencing assays are able to distinguish the change of DNA methylation with high sensitivity, and thus it is possible to locate specific differential activities within the genome in the finest resolution. With proper experimental design, people can probe the targets of various gene 3 regulation mechanisms in the entire genome. DNA methylation data, coupled with appropriate computational methodology, can be applied to study the activity of individual repeats. The outline of this dissertation is listed here: In Chapter 2, the background information of retrotransposon is reviewed. First, a brief introduc- tion of repeat classification is presented, and then the mechanism of retrotransposon proliferation introduced. The impacts of retrotransposon activities are also discussed. Then the regulation of retrotransposon activities, especially the epigenetic regulation, is introduced. Next the current annotation tools for retrotransposons are listed with their algorithms explained. Finally, the spe- cial structure of a group of mouse retrotransposons is introduced as the subject of research for this dissertation. In Chapter 3, a framework for identification and classification of the repeat promoters is devel- oped. This approach is based on stochastic modeling of DNA sequences. The structure of the model is then introduced, including the processes for fitting the model on real data and training model parameters. Then the implementation of this framework is explained. The clustering workflow and algorithms involved in monomer subtype identification are introduced. The rea- sons for choice of clustering algorithm are discussed. Finally, the application of the computa- tional method is introduced, and the details for coping with real data are also included. In Chapter 4, the results of repeat promoter annotation and classification are demonstrated and explained. The computational framework is applied to multiple genomes and the results are compared. Then the classification of promoters is generated and compared with the previously documented one. Next the composition of classes within individual repeat promoters is analyzed to verify the hypotheses of repeat promoter maintenance mechanisms. The regulation of repeat promoters are analyzed using public epigenetic data, and currently active repeat promoters are detected and investigated. 4 In Chapter 5, the annotation framework and the findings brought by its application are sum- marized. The achievements of this method is highlighted and the biological implication of the discoveries in this work is explained. 5 Chapter 2 Background 2.1 Classification of retrotransposons There are two classes of transposable elements (TE): retrotransposon (Class I) and DNA transpo- son (Class II) (Wicker et al., 2007). These two classes are defined based on the mechanism of transposition. Class II TEs, which are also known as DNA transposons, mobilize through a “cut and paste” process. In this process, one DNA transposon is excised from the genome and inserted into another genomic location. This process does not alter the copy number of DNA transposons, because they are simply “jumping” inside the genome. Because of this Class II TEs only com- prise around 2% of total nucleotides in the human genome (Smit, 1996). Class I TEs, on the other hand, propagate via a “copy and paste” process. During mobilization of retrotransposons, RNA polymerase is first recruited for transcription of mRNAs. Then the enzymes encoded by retrotrans- poson mRNA will generate a new DNA copy of retrotransposon via reverse transcription, which will be inserted to the genome eventually. The name “retrotransposon” is thus derived from the essential role of reverse transcription during this process. Because retrotransposon mobilization uses RNA as intermediate and the genomic DNA sequence of the original template element is unaffected, the copy number of Class II TEs grows with each successful retrotransposition event. As a consequence, retrotransposons are very abun- dant in genomes. In fact, it is estimated that half of the nucleotides in the human genome originated through this process (Lander et al., 2001), and this fraction could be even higher (de Koning et al., 2011). The following classification of known retrotransposons is largely based on the review of Wicker et al. (2007), in which the authors partition the Class II of retrotransposons to five orders: 6 Long interspersed nuclear elements (LINEs): This is one of the major types of retrotransposons. The most abundant family of LINE is called L1. A complete L1 repeat is about 7kb long, including a 5’ untranslated region (UTR), two open reading frames (ORFs) and a 3’ UTR with poly-A signal (Mathias et al., 1991). The 5’ UTR is typically said to serve as the promoter of an L1, initiating transcription by RNA polymerase II (Swergold, 1990). Short interspersed nuclear elements (SINEs): SINE repeats are very commonly observed in many genomes. The majority of SINEs in the human genome belong to the Alu family. Alu repeats are around 280bp long (Rubin et al., 1980). They were derived from 7SL signal RNAs, and are transcribed by RNA Pol III (Jagadeeswaran et al., 1981). SINE repeats do not have any coding potential and therefore need enzymes encoded by LINE repeats for mobilization (Ohshima et al., 1996). Long terminal repeats (LTRs): LTRs are also known as endogenous retro-virus, and they are highly enriched in a vast majority of genomes. They were derived from exogenous retro-viruses through infection of germ cells (Gifford & Tristem, 2003; Stocking & Kozak, 2008). Due to cer- tain mutations, especially loss of envelope protein coding sequence, exogenous virus could lose the ability to escape the infected cell, while still preserve the function of replication. Eventually this type of mutated virus has converted to endogenous virus once they infected germ cells and became inheritable as part of the host genome. Dictyostelium intermediate repeat sequence (DIRS): This is a type of retrovirus that uses a spe- cial integration mechanism different from LTRs. Penelope-like elements (PLE): This type of repeats encode a telomerase-like reverse transcrip- tase, which is different from the four other orders. As stated above, LINE and LTR are classified as autonomous retrotransposons, because they encode all enzymes that are sufficient for mobilization. In contrast, SINE repeats are non- autonomous because they do not have coding sequence. They rely on proteins coded by LINE repeats for mobilization (Ohshima et al., 1996). These three orders of repeats contribute to the vast 7 majority of active retrotransposition in the mammalian genomes. The retrotransposon family that makes up the most genomic DNA content is the LINE-1 (L1) family of LINE (Adams et al., 1980; V oliva et al., 1984; Hattori et al., 1986), comprising about 17% of the human genome, and 19% of the mouse genome (Lander et al., 2001; Mouse Genome Sequencing Consortium & et al., 2002). Thus repeats of the L1 families in mammalian genomes have been a major focus of retrotransposon related studies. 2.2 Mechanism and impact of LINE retrotransposition Because LTR activity in human is very limited, if any, LINE can be considered as the only active autonomous retrotransposon order in the human genome (Mills et al., 2007). Therefore in this sec- tion, the mobilization of the major LINE family, L1, is reviewed. A canonical complete L1 repeat encodes two proteins, of which both have preference to bind their template mRNA molecule and form a ribonucleoprotein particle (RNP) in the cytoplasm (Hohjoh & Singer, 1997, 1996). This characteristics of L1 mRNA is called bicistronicity. The first ORF of L1 encodes an RNA binding protein (referred to as ORF1p hereafter) with nucleic acid chaperone property (Martin, 2014) and its function has not yet been completely characterized. It is shown that ORF1p is required for transportation of L1 mRNA from cytoplasm back to the nucleus during insertion (Goodier et al., 2007). The protein encoded by the second ORF (ORF2p) is a reverse transcriptase with endonucle- ase function (Feng et al., 1996). This is the most important enzyme for retrotransposition, because it is responsible for creating the insertion site on the genomic DNA, and synthesizing cDNA of the RNA template. Although it seems that the reverse transcription function of ORF2p is sufficient to support the insertion of an L1 mRNA, it has been shown by in vitro assays that both proteins are essential for successful retrotransposition (Martin, 2006). The bicistronicity of proteins encoded by LINE repeats does not prevent the access from other molecules. The most commonly seen exception is the propagation of SINE repeats. As stated in the previous section, SINE repeats do 8 not have coding potential, but their prevailing presence in the human genome proves that their abil- ity of borrowing enzymes from LINE repeats is really successful. Another example is processed pseudogene. These DNA fragments exist in the DNA form of transcripts belonging to non-LINE genes. They lack intron sequences, and have trailing poly-A signal. It has been shown that such pseudogenes are inserted into the genome by the enzymes encoded by LINE repeats (Esnault et al., 2000). A C D B AAAA TTTAA Genomic DNA Poly-A tail Cleave site “AAATT” L1 mRNA AAAA TTTAA Genomic DNA Cleave site “AAATT” L1 mRNA cDNA Antisense of L1 mRNA Sense of L1 mRNA Sense of L1 mRNA generated by DNA repair Antisense of L1 mRNA generated by RT Target site duplication generated by two cleaves Figure 2.1: Schematic of the TPRT process. A: L1 RNP creates the first nick of genomic DNA. B: reverse transcription synthesizes cDNA of the antisense strand. C: L1 DNA on the sense strand is generated via DNA repair mechanism of the host genome. The overhang created by the first nick and its reverse complement are indicated by yellow. D: After insertion, two target site duplications have formed flanking the newly inserted L1 DNA fragment. At the beginning of LINE insertion process, ORF2p creates a nick at the target site. The consensus sequence of the cut site is AAA/TT (actual nick position is indicated by slash), which is mostly reverse complementary to the poly-A tail of LINE mRNA. This will enhance the affinity of mRNA attaching genomic DNA by base pairing. Then ORF2p also creates a second nick on the other strand of genomic DNA, approximately 12 nt downstream from the first cut. Then the poly-A tail of L1 mRNA is annealed to the DNA overhang of poly-T created by the first cut, and the free 3’ hydroxyl group is used to prime the reverse transcription. This process is known as target-primed reverse transcription (TPRT) (Luan et al., 1993). The schematic steps of this process are shown 9 in Figure 2.1. After cDNA synthesis, DNA repair mechanisms will engage to complete the other strand and ligate the nicks. This process has the side effect of generating a target site duplication (TSD) at both ends of the inserted DNA, due to the fact that the nicks created by ORF2p are sticky ends instead of blunt ends. As shown in Figure 2.2, there are a number of possible outcomes resulted by TPRT: The complete sequence of L1 mRNA is inserted into the genome. This scenario is rare due to the instability of TPRT shown later. The 3’ terminus of the mRNA is inserted, generating a 5’ end truncated insertion. This is caused by an early termination of the TPRT process. Since the insertion begins from the poly-A tail of the mRNA, early termination effectively removes the 5’ end of inserted repeats. It is observed that the vast majority of identified L1 insertions in the genome are truncated (Figure 2.3). A study focusing on the 5’ end junctions of L1s suggests a termination model through random microhomology between the target genomic DNA and L1 mRNA (Zingler et al., 2005), which coincides with the high truncation frequency given the fact that the insertion of L1 does not show specific sequence preference (Nell˚ aker et al., 2012). A partially inverted sequence is inserted. During TPRT, the DNA overhang on the opposing strand of the reverse transcription, may also pair with the L1 mRNA and then primes another insertion (Ostertag & Kazazian, 2001). In this scenario, the second insertion starts from the middle of the template mRNA and the cDNA is synthesized and ligated to the complementary strand. Consequently the product of this so-called “twin-priming” process is a partially inverted insertion. Flanking sequences which do not belong to the repeat may also be inserted. Transcription of retrotransposon sometimes does not start or terminate at the repeat boundaries, and thus a elon- gated mRNA is generated. This process is called transduction. Both upstream and downstream regions of an L1 repeat can be transduced, depending on the orientation of transcription and the extent of elongation. 10 AAAA TTTAA Genomic DNA ORF2p ORF1p Poly-A tail Target site “AAATT” 5’ UTR ORF1 ORF2 3’ UTR AAAA 3’ UTR AAAA ORF2 A full-length copy is inserted to the genome. Early termination of reverse- transcription generates a 5’ truncated copy. Alternate priming sometimes generates an inversion. 1 2 3 5’ UTR ORF1 ORF2 3’ UTR AAAA 5’ UTR ORF1 ORF2 3’ UTR AAAA Figure 2.2: Schematic of some possible outcomes of TPRT. The L1 mRNA is indicated as the green arrowed curve, while genomic DNA is indicated as the grey double-stranded arrowed curve. The arrow represents the orientation from 5’-end to 3’-end. Proteins encoded by the L1 mRNA are indicated by the blue and green bubbles, corresponding to ORF1p and ORF2p, respectively. The inserted DNA, including cDNA and its complementary strand, is represented by the blue double- stranded arrow curves. It can be inferred that any inheritable insertions of retrotransposons should occur in germ cells and early embryonic cells. Somatic retrotranspositions might alter phenotypes of the organism and thus increase the risk of reduced fitness (Garcia-Perez et al., 2007; Packer et al., 1993; Branciforte & Martin, 1994; Georgiou et al., 2009; Fadloun et al., 2013). Because of this, retrotransposons have been considered “junk DNA” because they appear almost completely inactive in terms of transcription in somatic cells, and they do not show any sequence conservation or homology with known important proteins (Ohno, 1972). On one hand, this is caused by the 5’ truncation of insertions such that most identified repeats in the genome have lost promoters and are indeed inert. On the other hand, it is hypothesized that retrotransposons are mostly inactive because of purifying selection against repeats active in somatic cells which are deleterious to the genome; it 11 Length (bp) Density 0 2000 4000 6000 8000 0.0000 0.0010 0.0020 Figure 2.3: Length distribution of mouse L1 repeats. Only the ones that are continuous (not disrupted by other repeats) are included. is also hypothesized that some repeats prefer heterochromatin for insertion in order to cause less damage (Pereira, 2004). The purifying selection hypothesis is supported by in vitro assays that failed to identify significant site preference of LINE (Nell˚ aker et al., 2012). Insertions of repeats are common mutagens which disrupt normal gene expression, when occur- ring in a genic region (Cordaux & Batzer, 2009). They can also cause ectopic recombination, when two non-allelic but homologous retrotransposons cross over (Langley et al., 1988). This mutation is severe, usually resulting in deletion of large segments of genomic DNA. Another consequence of retrotransposition is transduction which is mentioned in the previous section. It is observed that such transduction can include extra sequences as long as a few thousand base pairs, and sometimes duplicates a complete protein coding gene or multiple regulatory regions (Pickeral et al., 2000). It is worth noting that promoters of the human-specific L1HS repeats are bidirectional, which means they have the ability to initiate transcription towards both the upstream and downstream (Speek, 2001). Therefore even if one L1HS repeat has lost its coding sequence, it may still initialize dis- ruptive transcriptions. 12 Non-heritable somatic retrotransposition has been observed in many studies (Coufal et al., 2009; Kano et al., 2009; Belancio et al., 2010; Muotri et al., 2005) and has recently gained focus in retrotransposon-related research. As reviewed by Erwin et al. (2014), retrotransposition in neu- ronal stem cells may contribute to genome mosaicism in neurons. Another study of repeat element exaptation found thousands of repeats in the human genome showed elevated sequence conserva- tion, suggesting that some repeats have been adopted as part of the regulatory system (Lowe et al., 2007a). These concepts highlight the importance of studying retrotransposon systematically. Those so-called “genetic fossils” contain valuable information about genome evolution, since their history spans over a hundred million years. 2.3 Regulation of retrotransposon activities Due to the deleterious effects of retrotransposition, as a result of selection the regulation on repeats is almost always suppressing. Below is a list of major mechanisms that can alter repeat activity. Genetic mutation. Retrotransposons are subject to spontaneous mutations, and the mutation rate is observed to be higher than in other genomic regions (Drake et al., 1998). This can be explained by the purifying selection hypothesis that mutations on other genomic regions are directly detrimental to the fitness. Spontaneous mutation is not the only source that shaped retrotransposon sequences. The TPRT process of LINE and SINE might have low fidelity in reverse transcription. High sequence similarity between different repeats can promote the fre- quency of recombination. This is another source of elevated sequence mutation rate within retrotransposons. Epigenetic regulation. This includes DNA methylation and histone modification marks deposited in repeat promoters. Among regulatory epigenetic marks, DNA methylation is uniquely effective in repressing retrotransposition activity. By default DNA methylation will be deposited and maintained in the whole genome, unless a locus is protected from the de novo 13 DNA methyltransferase activity (Singh et al., 2013). This means retrotransposon transcripts are generated from those that have a hypomethylated promoter occupied by transcription factors. Small RNA regulation. There is a pathway that can reduce the transcript level of repeats (Aravin et al., 2007). The key proteins in this pathway belong to the Piwi family. They specifically locate and cleave retrotransposon mRNAs through smRNA pairing. The smRNAs are therefore called Piwi-interacting RNA (piRNA). The piRNAs are generated by processing the transcripts of retrotransposon. Since spontaneous sequence mutations do not target repeats specifically, it is not considered as a regulation pathway here. Nevertheless due to the randomness of mutation, it is also not impossible that spontaneous mutations accumulated in repeats could promote the fitness of the organism. The concept of exaptation suggests that it is possible for the host genome to utilize retrotransposons as regulatory elements (Brosius, 1999; Lowe et al., 2007b; de Souza et al., 2013). The other two types listed above are in effect depending on sequences of repeats, and therefore it is of interest to study the underlying mechanism, such as how active repeats are recognized. The L1 propagation and regulations for its activity are demonstrated in Figure 2.4. DNA methylation is an important epigenetic mark that regulates gene expression. The over- all methylation level in mammalian genome is higher than 50%, and thus low methylation, i.e. hypomethylation, is usually an indicator of active regulation. For protein coding genes, hypomethylated promoter is necessary for active transcription in most cases (Bird, 2002; Suzuki & Bird, 2008). Therefore on the other hand, DNA methylation is being recruited as repressive mark to silence genes, including retrotransposons. DNA methyltransferases are responsible for depositing methylation marks onto cytosines. The two waves of epigenetic reprogramming in developing embryonic cells and germ cells will erase and then re-establish methylation of the entire genome, which potentially make the genomes vulnerable to inheritable retrotranspositions. However, it was observed that some active families of retrotransposons are re-methylated quicker than other ones, consequently narrowing the 14 L1 L1 1. Transcription 2.1. Translation 2.2. Transcripts captured by piRNA regulation piRNA biogenesis 3. RNP or piRNA return to nucleus 4.2. Integration 4.1. piRNA-guided methylation L1 Figure 2.4: Activity of L1 repeats and corresponding regulation effects. The left part of this figure indicates nucleus (orange background) and the right part indicates cytoplasm (blue background). Stages starting from top left: 1. L1 repeats first initiate transcription to generate mRNA. If the promoter is methylated (indicated by purple circles), then transcription will be blocked. In Step 2, there are two possible outcomes: 2.1. L1 mRNA is successfully translated, generating RNP; 2.2. transcript is captured by the piRNA regulation pathway, and subsequently it is processed into piRNA and lost the chance to be translated. In Step 3, processed piRNA and/or RNP are transferred back to the nucleus. Then piRNA will guide de novo methylation machinery to suppress L1 promoters through base-pairing (Step 4.1); for RNP, it can be integrated into the genome via TPRT (Step 4.2). All proteins involved in the piRNA regulation pathway are represented by the yellow circle. window for repeat transcription (Seisenberger et al., 2012). This suggests a de novo methylation mechanism which targets transcriptionally active retrotransposons (Kuramochi-Miyagawa et al., 2008). It is observed that piRNA can guide methylation specifically on retrotransposons during germ cell development (Molaro et al., 2014). Because piRNA is a product of retrotransposon 15 transcripts, it can be inferred that this piRNA-guided de novo methylation is targeting the active repeats. Also because piRNA-based pathways function through base-pairing, the sequences of individual repeats are then the sole determinant for the outcome of regulation. According to the classification system of repeats proposed by Wicker et al. (2007), the fam- ily hierarchy is used to describe a group of repeats that have the same structure and share high sequence similarity. It is then expected that repeats belonging to the same family should face the same regulation, either being recognized by the piRNA pathway, or being able to escape from epigenetic suppression. Genetic mutations that are significant enough to alter the ability of mobi- lization of repeats should be also sufficient for classifying repeats to another family. It implies that, since the epigenetic regulation is targeting only the promoter of a repeat, the promoter sequence should be used for distinguishing repeat activities. 2.4 Annotation of retrotransposons The annotation of retrotransposons includes their genomic locations and classification assignment. The locations can be detected with or without prior knowledge of the sequences, while the for- mer one is more commonly used and more effective. Identifying a group of repeats with known sequence in a reference genome is a homology searching problem. The aim is to locate all occur- rences in the genome using a query sequence, namely distinguishing sequences that are signifi- cantly similar to the query from the non-homologous background. This problem is very similar to the sequence alignment problem, which has been well studied and a number of successful algo- rithms are available. Once repeat occurrences are found, the next step is to classify individual copies to their corresponding taxa. This assignment step is basically based on the label of con- sensus sequences, because as discussed in previous sections, the biological functions of retrotrans- posons are almost solely based on the sequence, and it is expected that repeats belonging to the same taxon have similar function. Therefore classifying repeats that share similar sequence is bio- logically meaningful and useful for other studies. The simplest classification can be generated by 16 associating each occurrence to the corresponding query sequence that has been used for detection; however, studies have shown that classification of repeats under close scrutiny can reveal their evolution history, which suggests more advanced classification algorithm is preferred (Sookdeo et al., 2013; Price et al., 2004). Later in this section, some typical repeat annotation tools will be introduced and their limitations will be discussed. 2.4.1 RepeatMasker RepeatMasker (Smit et al., 2015) is the most commonly used pipeline for identifying known repet- itive elements in a reference genome. It combines a number of highly sensitive homologous sequence searching programs and uses the most comprehensive repeat sequence database, Rep- Base (Jurka et al., 2005), as searching target to detect and classify repeats. The pipeline in brief, is as followed: for each reference sequence in the database, scan the genome for regions with similar sequence; then label each region as a copy of the corresponding repeat family according to the name of consensus. Each consensus sequence in RepBase is usually submitted and named by the authors who first publish the consensus sequence. RepeatMasker has a custom version of the con- sensus database, which has been curated and re-organized for repeat discovery and classification purposes. For example, there are redundant sequences stored in RepBase submitted by different groups for various studies, but in fact they can be represented by one common group of repetitive elements. Since those different versions of consensus do not have identical sequence and name, they will cause ambiguity in repeat classification process, because the repeat family membership is inferred from consensus names. To cope with different characteristics of different types of repeats, there are multiple sub- pipelines integrated in RepeatMasker. For example, TandemRepeatsFinder (Benson, 1999) is used to detect simple and low complexity repeats; two consensus sequences are used to detect subunits of a long and divergent L1 element. Due to the nature of highly divergent sequences even between individual repeats of the same family, some consensus sequences in RepBase do not 17 reflect complete repeat regions. For example, the consensus sequence of an L1 repeat family can be stored as the 5’-end terminus and the 3’-end. This helps increase detection sensitivity, but might also introduce ambiguity in assignment of family. For example the ORF2 of an L1 repeat, which is close to its 3’ end, usually has less sequence divergence than its 5’ UTR; therefore using separated 5’- and 3’-end sequences the pipeline can discover more regions satisfying certain sequence simi- larity cut-off than using complete consensus. However, if the labels of the two ends do not match for a detected repeat, the assignment of classification might cause confusion because repeats are only allowed to belong to one family. For example in previous sections it is discussed that one L1 repeat can have a different type of promoter. In summary, the RepeatMasker repeat detection and classification pipeline relies on known repeat sequences. It is the most commonly used tool to annotate repeats in a given genome assem- bly. However if a reference genome of one newly studied species becomes available, it can be hard to accurately identify all repetitive elements in the genome, because there are potentially unknown repeat families. Thus de novo repeat discovery methods have been developed to fulfill such requirement. 2.4.2 De novo repeat identification algorithm To discover repeats in a newly sequenced genome, especially for organisms that have not been fully studied and lack sufficient prior knowledge of potentially present retrotransposons, the so-called de novo repeat identification methods become more useful. Here the brief workflow of RepeatScout (Price et al., 2005), a widely used de novo repeat annotation tool, is discussed. The algorithm of RepeatScout starts with the most abundant k-mer. This k-mer serves as a seed for a expansion procedure, during which the program greedily tries to include as many nucleotides as possible in the surrounding sequences of eachk-mer occurrence in the genome to construct a consensus. The measurement for determining how many nucleotides to be included depends on the alignment scoring method, which can be adjusted to favoring a broader consensus or a narrower one. Then 18 the next step is to locate regions with sequence similar to this consensus (using RepeatMasker), and then adjustk-mer count table for finding the next most frequent consensus. In summary, the de novo repeat identification algorithm is capable of detecting repetitive elements without prior knowledge, but the following annotation has to use information of known repeats. In the scope of this dissertation, the analysis starts with the annotation generated by RepeatMasker. 2.5 The special structure of mouse L1 promoters The L1 families specific in the lab mouse genome are named as L1Md families, following the species name. The most important feature that distinguishes L1Md repeats from human L1 repeats is the presence of tandem repeats in the 5’ UTR, which are called monomers (Loeb et al., 1986; Goodier et al., 2001). Each repeating unit is about 200 bp long, and the number of monomers per 5’ UTR is variable. Figure 2.5 shows the typical structure of a mouse L1 repeat. Because of this structural difference, the length of the promoter can differ substantially for human and mouse L1 repeats, and even for different L1Md repeats. For a typical human L1 repeat, its 5’ UTR can be considered as the promoter. This region is usually around 1kb long, and it harbors multiple transcription factor binding sites, including SOX2, YY1, RUNX3, and TCF/LEF (Kuwabara et al., 2009; Yang et al., 2003; Tch´ enio et al., 2000; Athanikar et al., 2004). The monomer sequences are the most likely to be the functional promoter of the L1Md repeats, given the fact that they are CpG-rich and contain multiple transcription factor binding sites including Yy1 (DeBerardinis & Kazazian, 1999a) and Ap-1 (Swergold, 1990). The linker region between monomers and ORF1 is approximately 250bp long and is CpG-sparse. It has be illustrated in vivo that this linker region alone is not capable of initiating transcription (Severynse et al., 1991). Since the CpG sites are the substrate of methylation, and most protein coding gene promoters show an elevated CpG density level compared to other non-regulatory regions, it can be inferred that the monomer region is the promoter of an L1 repeat, and the transcriptional activity of one L1 is fully dependent on the monomers it has. 19 5’ UTR ORF1 ORF2 3’ UTR AAAA Monomer Monomer Monomer ... Figure 2.5: Structure of an L1Md repeat with monomers. One mouse L1 repeat has a 5’ UTR, two open reading frames (ORF1 and ORF2), and a 3’ UTR with poly-A signal. The 5’ UTR includes tandem repeat sequences called monomer, with each being around 200bp. Following the monomer region, there is a linker region before the ORF1 coding sequence, which is indicated by the dashed line. L1Md monomers were first classified into two types: A and F (Loeb et al., 1986; Padgett et al., 1988), with the latter being identified as no longer active. Later within the F type, two more types, the G f and T f types (Goodier et al., 2001; Naas et al., 1998), were identified representing the transcriptionally active descendants of the F type. Thus these three types of monomer are known to be the active promoters within the mouse genome. In the seminal work of Schichman et al. (1993), the A type was further divided into six subtypes based on consistent insertions at specific locations. However the subtype definition for G f and T f types has not been established. The classification of L1Md monomers is mainly consistent with that of L1 repeats, which means if a repeat is classified to the L1Md A family, it usually possesses type A monomers. The exceptions are likely to be caused by recombination events, when two L1Md repeats of different families exchanged genetic contents, and then mixed types of monomers can be observed for one L1Md promoter; or the type of monomers completely mismatches the family name of the repeat. The monomer structure has provided obvious advantage for the propagation of L1Md repeats, given the fact that frequent 5’-end truncation occurs for all L1 insertions. If multiple monomers are included in the mRNA, the probability of promoter-depleting truncation could be effectively reduced and thus more functional insertions can be generated. However, it remains enigmatic how L1Md repeats acquired this tandem repeating structure during evolution. Two models of monomer expansion have been proposed by Schichman et al. (1993). Here the two models are listed and briefly introduced. 20 “Step-wise” expansion model (Figure 2.6A). It was observed that given the subtype classifica- tion of type A monomers, within one L1Md promoter a certain subtype contributes to the most of promoter extension. Therefore it was inferred that this subtype of monomer was duplicated via some unknown process and new monomers were inserted upstream of the promoter one at a time. Unequal crossing over model (Figure 2.6B). If recombination occurs between two monomers located in different promoters, one of the resulting repeat might have elongated promoter because the monomers were transferred from the other repeat. This can cause a segmental duplication of multiple monomers at once. Currently neither model has been proven or disproven, and the mechanism which can cause the step-wise expansion has not been fully characterized. The impacts of these two models differ, and the details will be discussed in later chapters. Monomers have raised many questions regarding their effects to L1Md activities in the mouse genome. For example, for an L1Md promoter with multiple monomers, how its activity is deter- mined. It has been shown that multiple monomers in one promoter can increase the transcription activity, in almost a linear manner (DeBerardinis & Kazazian, 1999a). However, it is yet unclear whether one functional monomer is sufficient to drive transcription when embedded into an array of mutated monomers. Monomers also increase the complexity in all areas of L1Md study, such as the annotation of monomer-containing repeats, identification of active L1Md promoters, etc. Due to these characteristics of monomer and the complexity it creates, specialized computational approaches are required for studying L1Md monomers. 2.6 Objectives of monomer annotation Here the objectives of annotating L1Md monomers in the mouse genome are listed. 21 “Step-wise” expansion Unequal crossing over A B Figure 2.6: Hypothesized monomer array expansion models. A: the step-wise expansion model. B: the unequal crossing over model. Genome-wide detection of L1Md monomers. To annotate monomers, the first task is to detect them within the genome. This can be viewed as a homology searching problem, if the sequences is known. However, since monomers form tandem arrays, it is also necessary to delimit individual monomers within the same L1Md promoter accurately. Direct search of monomer sequence using currently available tools is sufficient for identifying all monomers in the mouse genome. However, later objectives require the detection to be highly accurate, especially for the boundaries between individual monomers, in order to annotate the start and end of the periodic sequences in a manner with biological meaning. In addition, the results of 22 posterior decoding given a profile-HMM is required for classification and analysis of monomer sequences, which is not directly accessible using current public tools, e.g. nhmmer. Therefore it is necessary to implement a profile-HMM structure to detect monomer sequences with high accuracy, and develop a workflow based on this profile-HMM framework to identify and classify monomers in the entire genome. Classification into subtypes. It has been shown that Type A monomers have specific subtypes. However this definition of A subtypes was based on limited data and potentially there can be more undiscovered subtypes. Similar for Type G f and T f monomers, it is likely that subtypes exist but a consistent method for identifying subtypes for all three types of monomers is lacking. Phylogenetic annotation for monomer subtype groups. Observations support that there must be some mechanism which allows L1Md repeats to extend their promoters via duplication of monomers. However, the underlying mechanism still remains elusive. Once a comprehensive classification of all monomer subtypes is established, it is possible to infer the phylogenetic relationship of individual monomer subtype groups. This task requires consideration of both sequence divergence between subtypes and the positioning pattern of monomers within the same promoter. The former can be learned by analyzing sequences of individual subtypes, while the latter requires deeper investigation of the composition of monomer subtypes in individual L1Md promoters. Functional annotation of L1Md promoters based on epigenetic data. The varying number of monomers significantly obscures the problem for determining the overall transcription activ- ity of individual L1Md promoters. It is impossible to directly measure the activity, due to the fact that phenotypes of repeats in normal germ cells are mostly repressed by existing repeat- targeting regulations, such as the piRNA pathway. Therefore it is necessary to employ specific experimental interference, for example knockdown experiments, such that the true phenotypes of repeats can be revealed. In addition, it is inefficient to use typical gene expression data, e.g. RNA-seq data, to estimate the transcription level of repeats, because the similarity of sequence is 23 too high and thus associating sequencing reads to specific repeat elements is extremely difficult. Also given the large copy number of L1Md repeats in the mouse genome, it is not effective to use gene expression data which relies very much on correct normalization. Therefore using epi- genetic marks, especially DNA methylation, is the best choice for annotating repeat functions. 24 Chapter 3 Computational methods for monomer detection 3.1 Detection of monomers using a specialized profile-HMM TheTandemRepeatsFinder (Benson, 1999) integrated in the RepeatMasker pipeline is capa- ble of identifying L1Md monomers. However it provides the boundaries of the entire array of monomers instead of individual ones, and thus the locations of individual monomers need to be separately determined. Theoretically individual monomers can be identified using a homolgy searching program. However, in practice monomers are located in tandem, effectively forming an array of periodic sequence, which might confound the identification of precise boundaries. In our application of the state-of-art homology searching tool,nhmmer (Wheeler & Eddy, 2013), we found that the detected monomers are usually located with small gaps (around 3bp) in between each pair of neighbors. We speculated the reason to be the optimizations for improving sensitivity and the profile-HMM topology implemented fornhmmer. This topolgy structure, named as “Plan 7” (Eddy, 2011), has to insert a number of background states in between each occurrence of homology hit. Although background nucleotides are inevitably included, the sensitivity ofnhmmer detection is sufficient for gathering almost all potential locations of monomers. Therefore considering the additional divergence brought by potential monomer subtypes, we decided to develop a specialized profile-HMM for monomer detection, to accurately detect individual monomers. 25 3.1.1 Topology of the specialized profile-HMM Profile hidden Markov models (profile-HMMs) were introduced by Krogh and colleagues (Krogh et al., 1994). In this work the authors demonstrated an application of hidden Markov models (HMMs) to describe the sequence profile of a family of globins. This HMM structure is the skele- ton of the one that is used in the present study for modeling repeat sequences. A modified profile- HMM structure is introduced by Durbin et al. (1998) and it was demonstrated capable of dealing with local alignment requirements, which is very useful for detecting repeat sequences because of the common 5’ end truncation. This structure is adopted in the model described below to fit poten- tially truncated monomer sequences (Fig 3.1). In the model infrastructure shown in the figure, the stateI 0 is associated with emission of background nucleotides, i.e. the non-monomer sequence in the input. The model length, i.e. total number of match states, is denoted asL. Importantly, we allowed transition fromD L toD 1 , thus making it possible to fit it to the entire observed sequence and detect multiple occurrences of the monomers. This transition can bypass the background state I 0 when the model is fitted to contiguous occurrences of monomers. The theories of applying profile-HMM have been extensively explained in Durbin et al. (1998). Here the parts that are the most relevant to this work are introduced, with necessary modifications to cope with the special cases of monomer detection. LetX denote a DNA sequence, andx i denote the nucleotide at positioni. LetY denote a sequence of hidden states, withy i indicating the state corresponding to observed DNA sequence at positioni. For anyi, the statey i can take value from the set of states that can emit a nucleotide: y i 2S =fM 1 ;:::;M L g[fI 0 ;:::;I L1 g; as depicted in Fig 3.1. Note here because the deletion states do not emit, they are not included. Let denote the set of parameters for the profile-HMM, including all transition probabilities and all emission probabilities. The goal of the decoding process is to find the most likely state sequence Y when observing a DNA sequence X. One can find this state sequence by enumerating over 26 M 1 M 2 M i M L I 0 I 1 I 2 I i D 1 D L D 2 D i ... ... ... ... ... ... ... ... Figure 3.1: Structure of a profile-HMM specialized for monomer detection. This topology allows 5’ and 3’ end truncation (local alignment), and cyclic representation of sequences without back- ground sequences. Silent states that don’t have emission probabilities are shaded. Red transition edges indicate the modifications made for monomer detection. all possible Y and pick the one that gives maximum likelihood. But this process is impractical because the number of possibleY increases exponentially with the length of input DNA sequence. In fact, finding the state sequence Y = y 1 y 2 :::y N , such that Pr(Y jX;) is maximized, can be computed using a dynamic programming algorithm within polynomial time. The algorithm is 27 called the forward-backward algorithm. In this algorithm, the forward probability y i (t) is defined as the probability of observing a prefix of the sequence at a certain state: k (i) = Pr(y i =k;x 1 :::x i j) = X l2S l (i1)Pr(y i =kjy i1 =l;)Pr(x i jy i =k;) = X l2S l (i1)t kjl e k (x i ); wheret kjl = Pr(y i = kjy i1 = l;) is the transition probability, ande k (x t ) = Pr(x t jy i = k;) is the emission probability. Note in the profile-HMM structure used in this study, not all states are emitting; in particular the deletion states are silent. Therefore, in the implementation the forward probability has two cases: k (i) = Pr(y i =k;x 1 :::x i j) = 8 > > < > > : P l l (i1)t kjl e k (x i ); k2S P l l (i)t kjl ; otherwise. (3.1) The backward probability is defined as the probability of starting from a state at one position, to observe a suffix of the sequence: k (i) = Pr(x i+1 :::x N jy i =k;); 28 whereN is the length ofX. Similarly, the backward probability has two cases for emitting and non-emitting states: k (i) = Pr(x i+1 :::x N jy i =k;) = X l 8 > > < > > : l (i+1)t ljk e l (x i+1 ); l2S l (i)t ljk ; otherwise. (3.2) According to the definition of forward and backward probabilities, the probability of y i = k is proportional to the product of both by conditioning: Pr(y i =kjx;)/ k (i) k (i): Therefore individualy i can be found by y i = argmax k k (i) k (i); fork2S and fori2f1;:::;Ng: This is called posterior decoding. In the decoding result of a potential promoter region, individual monomers can be found amongY as consecutive runs of non-background states with increasing order of state index. We found there is a special case which requires specific settings when practicing posterior decod- ing with real data. Some types of monomers have small internal duplications (less than 10bp), which will cause a break in the decoding step if monotonic increasing of state order is required. For example, in an optimal decoding result, the states for a region with duplication could be: 20;21;22;23;24;21;22;23;24;25, where the states 21 24 represent the duplicated region. In this case, one monomer should be identified instead of two. Therefore in the implementation of posterior decoding, we allowed the state index to go back at most 10 while maintaining the runs of non-background states to accommodate internal duplications. 29 3.1.2 Model construction from multiple sequence alignment Learning the parameters of a profile-HMM from a multiple sequence alignment (MSA) is the key to sensitivity in subsequent application of the model. Parameters representing the transition and emission distributions for individual states, along with the length L of the consensus sequence, must be extracted from the MSA in non-trivial steps. Note here, the MSA is constructed from individual monomer sequences, which may or may not be truncated, instead of using sequences of entire promoters with multiple monomers. This is for the purpose of achieving an optimal multiple alignment of monomers for better training of emission parameters. The transition probabilities which determine the expected number of monomers per promoter are not essential in this step. In this section, the process of estimating profile-HMM parameters from this type of MSA is discussed. The model construction starts with finding the value ofL from the MSA. Supposen sequences x 1 ;x 2 ; :::;x n are aligned in an MSA with m columns, and gaps are represented by the space symbol “-”. FindingL2 (0;m] is actually selecting columns among this MSA that best represent the sequence profiles. Heuristically, one can select columns in which some threshold number of sequences are not gap: I(columni is chosen) = 8 > > < > > : 1; jfj :x j i 6=gjm 0; otherwise; where x j i represents the symbol of sequence x j at the ith column in the MSA. Usually this is set to 0.5, which means if more than half of the sequences are not gap at a specific column, this column will be marked as homologous and included in the model. This heuristic method, however, does not work well in the case of heavy 5’ truncation of retrotransposons. If more than half of the sequences included in the MSA have 5’ truncation, it is likely for the 5’ end positions to be excluded from the model. This means the heuristic method will need careful selection of sequences for MSA, otherwise the estimation ofL will be affected by the truncation significantly. 30 x 1 CGCGATCGCTG-ATCGCATCGATACGTACGTGACGACGATC x 2 ----ATCGCTG-ATCCCATCGATACGTACGTGTCGACGATC x 3 --------CTG-ATCGCATCGAGACGTA--TGACGACGATC x 4 --CGAT-GCTG-ATCGCATGAGTACATGCGAGACCACGATC x 5 --CGCTCGCTGCGCTCGCTCGA-ACATATGTGACCACGATC x 6 --CGGCTGTTG-ATCG----GATACGTATGTGACGACGATC x 7 CGCGATCGC-G-ATCGCATCGA-ACATATGTGGCGACG--C x 8 --CGATCGCCGCGACGCATCGATACGTA--TGGCGACGATC x 9 CGCGATCGCTG-ATCGCATCA--ACGTA--TGACGACGATC x 10 ------CATTA-------TCAATACGTACG-GACGACGATC Heu ######### ############################# MAP *********** ***************************** Figure 3.2: An example of model construction from multiple sequence alignment. The columns that are chosen into the model is indicated with “#” or “*”, indicating choices made by the heuristic method and the MAP, respectively. An alternative way to estimate L from the MSA is an MAP method, which uses a dynamic programming algorithm to determine which columns to included (Durbin et al., 1998). Before introducing the recursive process, the parameter estimation from MSA should be demonstrated. Assume that the value ofL has been estimated. In other words,L columns in the MSA have been chosen as the corresponding model positions. Then the hidden state path for each sequence can be decided by checking whether or not the symbol in the aligned sequence is a gap. The truncations are also considered after L columns are chosen. The columns before the first chosen column in the MSA will not be included. For example in Fig 3.2, the match states are indicated by “#” with the heuristic method. The third column is thus the first chosen column, and the corresponding nucleotide ofx 1 isx 1 3 =C, which can be considered in the state sequence asy 1 1 =M 1 . 31 This state pathy i of sequencex i is composed of match (M j ), insertion (I j ), and deletion (D j ) states, corresponding to those state types introduced in the previous section. The subscript j of those states, is numbered from 1 toL according to their position relative to the marked columns. The model parameters are then estimated from the counts of those hidden states for all y j . The count of transitions between any two states k and l is obtained by considering all pairs of such transitions in the MSA: T(k;l) = n X i=1 m1 X j=1 I(y i j =k)I(y i j+1 =l); where k;l2S[fD 2 ;:::;D L1 g, because truncation is not included. Similarly, the count for symbolu generated by the emission of statek is E(k;u) = n X i=1 m X j=1 I(y i j =k)I(x i j =u); where k 2 S;u 2 fA;C;G;Tg. And finally the transition and emission probabilities can be estimated by the following equations: ^ t ljk = T(k;l)+t 0 ljk P l T(k;l)+t 0 ljk ^ e k (u) = E(k;u)+e 0 k (u) P v E(k;v)+e 0 k (v) ; where t 0 ljk and e 0 k (u) are the priors, i.e. pseudo-counts, and u;v indicate four types of DNA nucleotides. The choice of priors is attached in Appendix A.1.1. Now the question is how to optimally choose L columns from the MSA with a total of m columns, whenL is not known. The aforementioned MAP method recursively calculates an objec- tive functionS j for each columnj2 [1;m]. This function is defined as the log probability of the optimal model considering the MSA from the beginning up to columnj, with this column being chosen. Based on the above equations for estimation model parameters, the count data only depend 32 on the subscripts of the hidden states iny. More specifically, for a block of the MSA bounded by two chosen columns, the transition and emission counts are independent of columns out of this block. This means the objective functionS j only depends on columnsk2 (i;j), given thatj is the next marked column afteri. Then the recursion can be established only based on those columns. Here the recursion ofS j is shown: S j = max 0ij S i +T ij +M j +I i+1;j1 +: The definitions of each term in this equation is explained in the following. The termT ij is the log posterior probability of observed transitions between marked columni andj given the prior transition probability: T ij =T(y i ;y j )logt 0 y j jy i : The termM j is the analogous log probabilities given a prior emission probability when columnj is chosen: M j =E(y j ;u)loge 0 y j (u): Similarly,I i+1;j1 is the log posterior probabilities given that fromi+1st column to thej1st are insertion. Note here the prior of emission is given as a general match and insert state, respectively, because the subscript ofy cannot be decided yet. The parameter is a regularizing parameter to control the number of columns that are marked. Its Bayesian interpretation is the prior probability of marking all the columns in the MSA result in log scale; therefore it has a non-positive value, with larger ones favoring longerL. The algorithm pseudo-code is shown in Algorithm 1. In the example shown in Fig 3.2, it can be seen that the MAP method tends to included columns that have end truncation, while in the intermediate columns, the MAP method recognizes deletion and insertion same as the heuristic method. 33 Algorithm 1 MAP model construction Initialization 1: S 0 = 0;M M+1 = 0. Recursion 2: forj = 1;:::;M +1 do 3: S j = max 0i<j S i +T ij +M j +I i+1;j1 + 4: j = argmax 0i<j S i +T ij +M j +I i+1;j1 + 5: end for Traceback 6: j = M+1 7: whilej > 0 do 8: j = j 9: end while 3.1.3 Scoring scheme of the decoding step In the previous section, the region of a consecutive run of non-background states in the posterior decoding step is considered as the sequences that are generated by the profile-HMM. The log- likelihoodL(jY) for the input sequenceY can be written in the following form: L(jX) = log X Y Pr(X;Yj) = log X k k (N); whereN =jXj indicates the length of the sequence, and k (N) is the forward probability function considering the entire input sequence. The log-likelihood of an input DNA sequence given the profile-HMM will be negatively correlated to its length, because longer sequence means more terms in the summation, and each term is always negative, due to the probability is bounded to be less than 1. This can cause numerical issue, such as underflow in the implementation, and also reduces the sensitivity of separating the foreground sequences which are homologous to the model from the background noise. A commonly used scoring scheme to avoid such issue is by changing 34 the emission probability included in the forward and backward probability functions to log-odds ratio (Barrett et al., 1997). For example, in Eq 3.1, the forward probability is redefined as: k (i) = 8 > > < > > : P l l (i1)t kjl e k (x i )=e I 0 (x i ); k2S P l l (i)t kjl ; otherwise. (3.3) And similarly the backward probability is: k (i) = X l 8 > > < > > : l (i+1)t ljk e l (x i+1 )=e I 0 (x i+1 ); l2S l (i)t ljk ; otherwise. (3.4) Here the emission probability of the corresponding hidden statex i has been changed to a ratio over the emission of the background stateI 0 . It is shown in Durbin et al. (1998) that this ratio can significantly increase the sensitivity of detection, however with a bias towards longer sequences. It is because the longer the input sequenceY is, the more likely for it to have sporadic significant emission ratios by chance, to boost the forward and backward probabilities. To mitigate this bias, it is necessary to further normalize the likelihood scores by sequence length. Here we define a scoring function to normalize log-likelihood ofX asL (jX) = log P k k (N) by the score of its reverse complement,X RC : s(X) =L (jX)L (jX RC ) (3.5) This works becauseX RC has the same length ofX, and the reverse complementary sequence will have a likelihood value very similar to background noise, if X can align well with the model. Finally, a z-score transform is applied for the scores of all the sequences. Then the z-score of individual s(x i ) can be used for filtering out potential false positives. Since the distribution of transformeds(x i ) has a long tail of extreme values because of the true positives which have very high emission ratios, the filtering is one-sided. In this study z-score> 1 is used to find sequences that are homologous to the consensus sequences represented by a given profile-HMM. 35 3.2 Monomer subtype identification Identifying subtypes given a group of monomers is a problem of finding sensible clusters using sequence information. Once each repeat was inserted into the genome, all its monomers will start accumulating random sequence mutations independently, and thus the distance between each pair of monomers is a function of the time since insertion. Similarly, for monomers belonging to dif- ferent repeats, a distance can also be found by comparing the sequences. Then clustering based on pairwise distances of monomers would have biological meaning, indicating certain relation- ships during evolution. The basic assumption is that, the random sequence mutation accumulated overtime during evolution is not comparable to that generated by rare mutation events which have caused emergence of new monomer subtypes. Thus in a clustering analysis sequences forming a cluster can be regarded to belong to the same subtype. Although there is not a rigorous definition for such events, usually one can consider those significant mutations that are not likely to be caused by spontaneous mutation or errors during reverse-transcription to be subtype-defining. Since the probability of nucleotide substitution is much higher than that of insertion or dele- tion, the latter two were used to referring evolution events when subtypes emerged in the study of Schichman et al. (1993), while substitutions are often used to estimate the age of a specific repeat group. However, it is not a trivial task to define the extent of insertion/deletion as subtype- defining. The implementation of profile-HMM for monomer detection actually helps circumvent- ing this problem when monomers have different lengths. Insertions and deletions will be detected during the posterior decoding step, and their contribution (or disruption) to the clustering will be properly weighted according to the decoding of latent states. In this section, we will discuss sub- type identification for monomers based on the Fisher score vector, which will be shown to have certain advantages over the classical subtype definition. 36 3.2.1 Calculating the Fisher score vector The variable length of sequences causes difficulty in constructing consistent multiple sequence alignment, and most of the useful clustering algorithms cannot be directly applied on DNA sequences. Therefore it is better to implement a method which can map input sequences of variable length to data with a fixed size. The profile-HMM modeling of sequences enables such conver- sion, because the model parameter space is fixed once the number of match states L is decided. The Fisher score vector includes partial derivatives of individual model parameters given the log- likelihood, and it was first applied to a homology searching related study by Jaakkola et al. (2000). In the practice of monomer detection, one does not need to use derivatives of all the parameters, but only to choose the ones with the most distinguishing power. In this study, the derivatives of the emission probabilities of all match states are used, which constitutes a vector of size4L. Here the Fisher score vectorU X of one input monomer sequenceX is defined as: U X =r L(jX) = @L(jX) @e M 1 (A) ; @L(jX) @e M 1 (C) ;:::; @L(jX) @e k (u) ;::: ; (3.6) wherek2fM 1 ;:::;M L g, andu2fA;C;G;Tg. It has been shown in Jaakkola et al. (2000) that the analytic form ofU X is: @L(jX) @e k (u) = E(ujk) e k (u) X v2fA;C;G;Tg E(vjk); (3.7) whereE(ujk) can be calculated from forward-backward probabilities: E(ujk) = N1 X i=1 k (i) k (i) Pr(Xj) I(x i =u): In the analysis of variable number of monomers, it is necessary to normalize the Fisher score vector. Because the expectation of emission is dependent on the sequence lengthN, the variances 37 of individual derivatives are not at the same scale. However normalizing by N for the entire vector is biased, due to the 5’ truncation which caused the expectation of some states to be lower than others. To mitigate this bias, the expected occurrences of states can be used. This expected occurrence can be directly calculated from the forward and backward probabilities based on their definitions: E(kjX;) = N X i=1 Pr(y i =kjX;) = N X i=1 k (i) k (i) Pr(Xj) ; (3.8) which is exactly the last term of Equation 3.7 Now the normalized Fisher score vectorU X can be written as: U X = :::; E(ujk) e k (u) P v2fA;C;G;Tg E(vjk) 1;::: ! : (3.9) Following the construction of the Fisher score vectors, individual monomer sequences at vari- ous lengths can be mapped to the same parameter space, and then clustering analysis can be done to identify subtypes. 3.2.2 Clustering-based subtype identification Once monomer sequences are converted to Fisher score vectors, dimensionality reduction is applied before clustering. Principal component analysis (PCA) is used in this study. For a profile- HMM with L = 200, the size of a Fisher score vector will be 800, according to all emission parameters of the match states. In practice of monomer clustering, we found that applying PCA and keeping the first 100 principal components (PC) is sufficient. In a case of more than 3000 sequences, using the first 100 PCs will reflect more than 60% of the variance. Classical clustering methods, for example k-means or hierarchical clustering, do not work well when applied to monomer sequences. This is because different subtypes of monomers emerged at different evolutionary times have accumulated various amount of spontaneous mutation, which reflects diverged variance within group. The size of each subtype, also diverges greatly due to the 38 vastly different copy number of subtypes. For k-means, it tends to partition the parameter space into equal size areas. This will bias towards monomer subtypes of large sizes, and omit small groups. Also it is difficult to find a value for k which is biologically meaningful. For hierarchical clustering, since the intra-group variance varies between subtypes, one cannot set a flat threshold for cutting the dendrogram of pair-wise distance (Figure 3.3). In addition, an ancient subtype which has distantly related individuals might be divided into different subtrees by other more dense clusters. 0 20 40 60 80 100 Euclidean distance Figure 3.3: Dendrogram of randomly sampled Type A monomers. All Type A monomers were included in the PCA analysis. After PCA 200 monomers were randomly sampled to construct a dendrogram based on hierarchical clustering. We used the HDBSCAN algorithm (Campello et al., 2013) to detect clusters after applying PCA. HDBSCAN is a hierarchical clustering method based on DBSCAN (Ester et al., 1996). The latter is a density-based clustering algorithm, which first identifies closely connected “core” data points and then assign other points to their neighboring core points following certain distance 39 threshold. Two parameters for the minimum number of pointsm pts for finding core points and the distance threshold are required to be specified by the user. HDBSCAN is more flexible at finding core points by dynamically changing for finding cores at different densities. Since HDBSCAN only clusters based on the core points, and labels the remaining points as noise, we applied an additional nearest neighbor clustering to assign a subtype for all monomer sequences. 3.3 Pipeline of L1Md monomer detection In this section, the pipeline of L1Md monomer detection is introduced. The workflow is shown in Figure 3.4. The first step is to learn the model length of profile-HMM for each type of monomer. At the beginning, the entire genome is scanned bynhmmer using a profile-HMM built from one consensus sequence of a monomer. The consensus sequences of each type of monomer are taken from previous publications (Schichman et al., 1993; Goodier et al., 2001; Naas et al., 1998). In the results ofnhmmer scan, entries that have minimum bit score of 60 are kept as potential monomers. These monomers are then extended by 20 bp on both directions, such that adjacent monomers can be merged to recapitulate potential L1 promoters even when small gaps exist. For each promoter region generated by merging monomers, its reverse complementary sequence is used for TandemRepeatsFinder analysis with parameters “2 5 5 80 10 50 500”. These parameters indicate the weight of a sequence match, the penalty of a mismatch, the penalty of an indel, the expected probability of a match (in percentage), the expected probability of an indel (in percentage), the minimum score required for a periodic signal to be reported, and the maximum length of a repeating period, respectively. These parameters are set empirically, and in practice we found that most of the monomers in mouse can be detected with this setup, due to the length of most monomers is around 200 bp and the sequence similarity between monomers is high. The results ofTandemRepeatsFinder require some filtering, because many other non-monomeric tandem repeats could also be found. To exclude those irrelevant results and detect the most likely monomer sequences, we required the length of detected tandem repeats to be within the range of 40 180 – 220 bp, and the minimum copy number per promoter is 2. Although this filter will exclude the monomers detected in many short promoters, the remaining monomers are still sufficient for the construction of an initial profile-HMM, which will be refined in an iterative process. Random sampling of detected tandem repeat sequences is applied such that up to 500 sequences in their original orientation are used to construct a multiple sequence alignment using MUSCLE (Edgar, 2004). An initial profile-HMM is built from this multiple alignment with the model length estimated through the MAP algorithm discussed in Section 3.1.2. This MAP algorithm is tuned such that more columns would be included in the profile-HMM to retain as much information as possible in the initial model, by setting to 0. Posterior decoding based on this initial model is then applied to all promoter sequences for a preliminary detection of monomer locations. The length histogram of detected monomers is summarized, and the median of the top 8 abundant lengths is determined as the proper length of the profile-HMM. This is because during posterior decoding, all potential monomers will be detected, including those with significantly altered lengths due to indel or truncation, and the mean of detected monomer lengths would be biased if those monomers are considered. Monomers having length no more than 20 bp away from the estimated consensus length are randomly sampled such that 500 of such monomers are aligned for training emission parameters iteratively. A consensus sequence is then extracted from the profile-HMM using the most likely emission for each individual match state. This consensus sequence will be used to guide the inference of hidden states from the multiple sequence alignments constructed later for parameter training. The next step is to refine the parameters of the initial model. For each iteration, posterior decoding using the current model parameters is applied for all promoter sequences. After decoding, individual monomers can be identified by selecting the regions of consecutive match states. The among the detected monomers, 500 are randomly sampled from the top 33% highest scoring ones to construct a multiple alignment. This multiple alignment also includes a reference sequence to mark the columns to be chosen. Thus instead of using the MAP algorithm, no inference is needed for individual hidden states because the columns are already marked by the reference sequence, and 41 the length of the profile-HMM will stay unchanged during iterations. This alignment is then used for estimating the model parameters for the next iteration. Because only highly scored sequences are sampled, the training process will converge after a certain number of iterations. The condition for determining convergence is when the Jaccard index of the regions identified as monomers between two iterations becomes higher than 0.95. This pipeline of monomer detection usesnhmmer to reduce the searching space for monomers, because based on our current implementation of profile-HMM, it is impractical to perform a whole- genome posterior decoding. Since the sensitivity of nhmmer is high enough to identify most of the monomers in the mouse genome, we focused more on refining the parameters of profile-HMM trained from monomer sequences to obtain accurate boundaries of monomers. In fact this pipeline can run with alternative input instead ofnhmmer output. For example, one can use the upstream regions of L1 ORF1 sequences identified in the genome as potential L1 promoter regions. It is still feasible forTandemRepeatsFinder to detect periodic sequences in such regions, although the classification of monomers needs further inspection in this case. During the detection of monomers based onnhmmer, it is assumed that all detected monomers belong to the same type, and thus the merging of adjacent monomers would most likely to yield promoters of the same monomer type. This works because the initial detection of monomers using nhmmer has certain specificity to distinguish monomer types. In practice, we found that the detection of A, G f and T f monomers by nhmmer using the published consensus sequences can distinguish these monomer types, and thus it is not required to solve the ambiguity of detection. In sum, the pipeline we developed for monomer detection and profile-HMM construction can identify monomers in the genome de novo. The parameters of profile-HMM are iteratively refined for more accurate detection. This pipeline does not require a huge amount of computational resources to deploy, because the search space for monomers can be narrowed down to most likely L1 promoter regions, by eithernhmmer detection or L1 ORF1 locations. The posterior decoding process following monomer detection is performed for individual promoter regions, and thus can be parallelized to reduce runtime. Later in this dissertation it will be shown that the profile-HMM 42 Construct prole-HMM from monomer consensus Genome-wide search with nhmmer using consensus Generate promoter regions and use TRF to find repeating signal Align repeating signals and construct an initial profile-HMM Compute posterior decoding of promoter regions to detect monomers Construct a profile-HMM from detected monomers Detection converged Output converged profile-HMM and detected monomers No Yes Figure 3.4: Workflow of monomer detection pipeline. constructed for each type of monomer can be used for the classification of subtypes and a number of analyses to study the evolution of monomers. 43 Chapter 4 Classification and functional annotation of L1Md promoters In this chapter, the results of applying the monomer detection pipeline are presented. The main focus is the genome of lab mouse, mm10, while other available mice genomes are also included in the analysis. Then the monomers in the mm10 genome are used for subtype identification. After subtyping, functional annotation of the L1Md promoters are shown using publicly available epigenetic data during mouse sperm development. 4.1 Results of monomer detection in various genome assem- blies We applied our pipeline separately for detection of the three known monomer types: A, G f and T f in the genomes listed in Table 4.1. For convenience these species names are abbreviated using the initials, such as mc for M. caroli, etc. The Type F monomer was not included because it is the common ancestor of Type G f and T f , which would create overlap in the nhmmer detection results. Hereafter we abbreviate G f and T f as Type G and T, respectively. The number of detected monomers in each genome is shown in Figure 4.1. We found that the number of monomers were significantly lower in all other genomes compared to mm10, even for the WSB/EiJ strain, possibly due to difficulty in assembly construction for highly similar monomers. We examined the amount of unassembled nucleotides proximal to detected monomers. In Figure 4.2 it can be seen that indeed there is a substantial fraction of unassembled regions around detected monomers, but the 44 distribution of those unassembled shown in Figure 4.3 are mostly surrounding L1Md promoters instead of overlapping. We speculated that there are more such regions that cover entire L1Md promoters, which could cause missing of entire promoters; while for the detected ones, it is more likely that they are not affected. Species name Strain name Genome assembly name Genome assembly date M. m. domesticus C57BL/6J mm10 2011.12 M. m. domesticus WSB/EiJ mmd 2015.09 M. m. musculus PWK/PhJ mmm 2015.09 M. m. castaneous CAST/EiJ mmc 2015.09 M. spretus SPRET/EiJ ms 2015.09 M. caroli CAROLI/EiJ mc 2015.09 Rattus norvegicus – rn6 2014.07 Table 4.1: Genomes on which the monomer detection pipeline was applied. Except for mm10 and rn6, all other genome assembly names are abbreviated from their corresponding species name, because official names are not applicable. Type A Type G Type T Count (x1000) 0 5 10 15 20 mc ms mmc mmm mmd mm10 Figure 4.1: Count of detected monomers in various mouse genomes. 45 mc ms mmc mmm mmd mm10 Monomer count Fraction unassembled 0 5 10 15 20 25 Count (x1000) 0 0.05 0.1 0.15 0.2 0.25 Fraction unassembled Figure 4.2: Fraction of unassembled nucleotides around Type A monomers. The finalized profile-HMMs for all three types of monomers in the mm10 genome are presented as consensus sequences in Figures A.1, A.2 and A.3. In our detection results, the T consensus has a shift of periodicity compared with the previously reported one. This is probably due to the fact that some T promoters have truncated monomers in the 3’-end region (Goodier et al., 2001). Because currently only Type A monomers have well defined subtypes based on indels (Schichman et al., 1993), we chose to examine the length distribution of Type A monomers detected using our pipeline to investigate if there is any undocumented length polymorphism. The distribution of monomer lengths is shown in Figure 4.4. We found that the range for all six A subtypes (197 – 208 bp) covers monomers in all genomes very well, while lengths in the range of 202 bp to 207 bp are more abundant in genomes other than mm10. The most striking difference is that the 208 bp length is almost completely missing in all other genomes except for mm10. According to the previous A subtype definition (Schichman et al., 1993), the 208 bp monomers are classified as the 46 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Fraction of unassembled mc ms mmc mmm mmd −500bp −250bp 0% 50% 100% +250bp +500bp Figure 4.3: Distribution of unassembled nucleotides around monomers. A-I subtype based on a 7-bp insertion. Since A-I is the youngest and most abundant subtype in mm10, it is unlikely that all other species and strains lack this subtype of monomer. Therefore we verified the existence of A-I in the five mouse species other than lab mouse using raw genomic sequencing reads. The reads used for validating were obtained from the same data set released by the Mouse Genome Project (Keane et al., 2011) for genome assembly construction. In addition, we included two genomic sequencing samples for rat as an outgroup control (Ramdas et al., 2018). We first used nhmmer with the consensus sequence of L1 spa ORF2 (GenBank accession AF016099.1) and the consensus sequence of the Type A monomer reported in Figure A.1 to search for reads related with L1 ORF2 and Type A monomers, respectively. A minimum bit score of 20 was used for filtering for homologous reads. Then for the reads that were detected homologous to the Type A monomer, we used our profile-HMM to detect the 7-bp insertion using posterior state occurrences following 47 180 190 200 210 220 0.00 0.10 0.20 0.30 Length Fraction mc ms mmc mmm mmd mm10 Figure 4.4: Length distribution of detected Type A monomers in various mouse genomes. Only the range within 180 – 220 bp is shown. Most of the detected monomers are within this range. Equation 3.8. If one read contains the 7-bp insertion which defines the 208-bp A subtype, its posterior expected occurrence of match states will show elevated values for states corresponding to the insertion. As shown in Figure 4.5, the read carrying the 7-bp insertion shows a clear increase in states corresponding to the insertion, while the non-related read has a flat signal. The threshold for determining relationship is set to 1.2, and we used this method to filter reads that were detected to be related to Type A monomers to get the number of reads carrying this subtype defining insertion. In Figure 4.6, the percentages of reads related to L1Md repeats in various genomes are shown. The detailed statistics of this analysis is included in Table A.1. As expected, there are very few reads that were detected to be related with Type A monomers in the rat genome, even though the fraction of ORF2-related reads is higher in rat. It can be seen that although the fractions of ORF2-related reads are similar across all Mus species, it differs for Type A monomer and the A-I subtype. A probable cause for this bias is that the sequences used in this analysis are from the 48 0 50 100 150 200 0.0 0.5 1.0 1.5 State position Posterior expected state count 208bp−related non−related Figure 4.5: An example of posterior expected state counts showing reads related to the A subtype defining insertion. Match states from position 97 to 104 correspond to the 7-bp insertion which defines the 208-bp A subtype. The red curve represents the expected state occurrences of a read carrying this insertion, and therefore it was verified to be related to the 208-bp subtype. lab mouse genome. Therefore the number of reads detected to be related to A monomers and potential A-I subtype is a function of evolutionary distance between the corresponding species to lab mouse. However, this result is still sufficient to support that the A-I subtype monomers which are commonly observed in lab mouse also exist in other Mus species, because of the presence of the specific 7-bp insertion. This means this specific subtype existed in the common ancestor of Mus. According to the analysis by Schichman et al. (1993), the 7-bp insertion during the emergence of the A-I subtype occurred after the formation of all other subtypes. Here we verified that in M. caroli this type of insertion exists, indicating that this insertion occurred at least 4 million years ago, before the speciation of M. caroli. 49 ORF2 % of reads 0 2 4 6 8 10 A monomer % of reads 0.00 0.02 0.04 0.06 0.08 0.10 0.12 7−bp insertion % of reads 0.000 0.010 0.020 0.030 rn mc ms mmc mmm mmd Figure 4.6: Fractions of reads related with L1Md sequences. Left: percentage of reads that are homologous to L1Md ORF2 sequences. Middle: percentage of reads that are homologous to L1Md Type A monomers. Right: percentage of reads that have 7-bp insertion which defines a A subtype. Error bars indicateSD. In LINE-1 repeats of rat (name as L1Rn families following the species name Rattus norvegi- cus), tandem repeat structure was also observed in promoter regions (Furano et al., 1988). How- ever, when we applied our pipeline for the rat genome assembly (rn6), no monomer around 200 bp in length like those in Mus species was detected. We made two attempts, one with the same detection setup using a mouse monomer consensus sequence heuristic, and the other one directly starting from detection ofTandemRepeatsFinder in 5’ UTRs located by homologous search of L1 ORF1 sequence. Although the first attempt might be limited by low sequence similarity between potential rat monomers and the ones in mouse, the second should be able to discover tandem repeat structure if there is any in the 5’ UTR of L1Rn repeats. We did observe a small 50 number of mouse monomer-like sequences in the rat genome, which belong to repeats of the Lx family. Because the Lx family is common in rodents (Pascale et al., 1993), this result suggests that monomers around 200 bp were specifically inherited in the Mus lineage. The tandem repeats reported by (Furano et al., 1988) are around 600 bp long. We observed such sequences in only few L1Rn repeats, with copy number no more than 2 per promoter. Considering these facts, it can be inferred that monomers around 200 bp are specifically abundant in Mus species, and they have greatly enhanced the activity of L1Md repeats. 4.2 Monomer subtype definition and identification The classical subtype definition of Type A monomers is problematic for direct application to infer subtypes genome-wide, because it is based on features that are confounded by random indels that arised during evolution. Therefore we developed an unsupervised approach which uses infor- mation independent of sequence length to identify and annotate monomer subtypes in the whole genome. The profile-HMM implemented for this study allows the computation of Fisher score vectors (Jaakkola et al., 2000; Tsuda et al., 2003), which can be used for subtype identification instead of raw DNA sequences. The Fisher score vector given a profile-HMM is a vector of par- tial derivatives of emission parameters of the log-likelihood of the observed sequence. The key advantage of using this approach is that all sequences can be converted to a vector of the same dimension, which automatically solves the problem of altered lengths caused by random indels. To mitigate the effect of truncation and long indels, we defined a set of “core” monomers. This set only consists of monomers with length ranging from 180 bp to 220 bp with the additional requirement of having another monomer both 5’ and 3’. The length range was set empirically based on the length distribution of detected monomers. Monomers not meeting these criteria were excluded from subtype identification as they are likely truncated. We also excluded the Y chro- mosome from analyses related with subtype classification. A principal component analysis (PCA) 51 was first applied on the Fisher score vectors of unique monomers. Then the top 100 principal com- ponents (PC) were chosen for clustering. The top 100 PCs can explain around 60% of the total variance (53.6%, 66.3%, and 57.3% for A, G, and T, respectively). Then subtype identification was done following the procedures described in Section 3.2. The HDBSCAN algorithm used for clus- tering helps in monomer subtype detection because the lineage of monomers includes groups that expanded at various ages, leading to differences in expected intra-group variation. HDBSCAN also designates certain outliers as “noise” so we applied nearest neighbor after HDBSCAN to assign a subtype to each data point that was left out by HDBSCAN. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −100 −50 0 50 100 −100 −50 0 50 100 tSNE 1 tSNE 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Figure 4.7: tSNE plot of the clustering of A subtypes. Each dot represents a core monomer sequence used for the identification of subtypes, and the color of each dot is based on the result of HDBSCAN and nearest neighbor assignment. 52 Using this approach, we identified 34, 6 and 32 subtypes for Type A, G, and T, respectively (Tables A.2, A.3, and A.4). The data used for subtype identification of Type A monomers is shown as a tSNE plot in Figure 4.7. The identified subtypes are numbered based on their counts in a decreasing order. In general, the majority of monomers of each subtype share the same length. We were able to distinguish subtypes that have the same consensus length but differ by nucleotide sub- stitutions. We then measured the divergence of sequence within each subtype and the relationship between subtypes using operation-weighted edit distance, with substitutions and spaces weighted as 1 and 2, respectively. 26 34 31 11 8 23 33 7 10 9 13 3 5 18 12 19 29 21 28 32 24 22 20 4 25 17 27 14 15 1 16 30 2 6 <20 26 35 44 >50 Figure 4.8: Clustering of A subtypes based on average intra-type edit distance. 53 4 3 1 5 2 6 10 20 30 40 50 60 Edit distance A B 3 11 13 16 1 24 17 5 7 32 31 19 18 30 9 28 23 6 21 8 15 4 22 10 29 14 12 2 26 27 20 25 10 20 30 40 50 60 Edit distance Figure 4.9: A: clustering of G subtypes based on average intra-type edit distance; B: clustering of T subtypes. In Figure 4.8, clustering for the subtypes of the Type A monomers is shown. Hierarchical clus- tering indicates that Subtype 2, 6 and 30 are outlier groups, which have high variation for both intra- and inter-group comparisons. The consensus lengths and intra-subtype sequence variation of subtypes indicate that our identification method is capable of not only recapitulating the clas- sical A subtype definition based on indel, but also providing further classification to differentiate monomer groups in a finer scale. For example, Subtype 1 corresponds to the majority of the A-I subtype defined by Schichman et al.; however, this A-I subtype was also classified to multiple other subtypes such as Subtype 14, 15, etc. We then analyzed G and T subtypes using the same approach (Figure 4.9). Previous studies have reported consensus sequence lengths of 206 bp and 212 bp for Type G and T monomers, respectively (Goodier et al., 2001; Naas et al., 1998). Among the G subtypes we identified, the 206 bp length is the most abundant among all subtypes, while Subtype 3, 4 and 6 have specific deletions (Table A.3). Clustering based on sequence variance shows that Subtype 3 and 4 have low similarity comparing with other subtypes (Figure 4.9, panel A). For T subtypes, we found that the length of 214 bp is actually more abundant than the previ- ously reported 212 bp (Table A.4). Sequence variation analysis indicates Subtype 3 was the only 54 outlier group. In fact, Subtype 3 of Type T monomers has a dominant length of 204 bp, which is closer to G subtypes. 4.3 Truncation profiles of monomer types The truncation site of the 5’-end monomer in a promoter can be considered as the insertion termi- nation site of the corresponding repeat. If the repeat mRNA was complete and fully inserted, this site corresponds to the transcription start site of the parent copy, since retrotransposon promoters must be internal. Previous low-throughput studies reported that many Type A monomers share a common truncation site, while Type G and T often exhibit truncations close to an YY1 binding site (Shehee et al., 1987; DeBerardinis & Kazazian, 1999b). We investigated the truncation profiles of all three types of monomers across the entire genome to validate whether these truncation features still hold. Following the definition of core monomers, we filtered for those non-core monomers which are located at the very 5’-end tip of individual L1Md promoters, gathering 7425, 2351 and 5103 truncated monomers for Type A, G and T, respectively. Forward-backward algorithm was applied to those truncated monomers and the posterior expected occurrence of each match state was computed following Equation 3.8. We consider the mean of these expected occurrences as the truncation profile of the corresponding monomer type. This approach is more accurate than directly using the length of a truncated monomer to infer the truncation site, because the latter is vulnerable to insertions or deletions within the sequence. The truncation profiles are visualized in Figure 4.10. Note here the plateau of posterior expec- tations did not reach 1. This is because in Equation 3.8, the silent states are not considered, e.g. the internal deletion states. Thus the total expectation is scaled down, but the values are still com- parable for match states. Since the x-axis corresponds to the consensus sequence in the sense orientation, potential truncation sites are reflected by sharp increases in the curve. It can be seen that Type A monomers show a strong common truncation site indicated by a dramatic increase starting from the 70th state till the 100th state. This then indicates the TSS of a Type A monomer 55 A B C 0 50 100 150 200 0.0 0.1 0.2 0.3 0.4 0.5 Type A Match state position Mean expected count 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 Type G Match state position Mean expected count 0 50 100 150 200 0.0 0.1 0.2 0.3 0.4 0.5 Type T Match state position Mean expected count 0 50 100 150 200 0 20 40 60 80 100 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −15 −10 −5 0 5 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −60 −20 0 20 40 Match state position Sum of read weights Sense Antisense Figure 4.10: Truncation profiles and CAGE-seq signals of all three types of monomers. Top panels: truncation profiles estimated by posterior decoding. Bottom panels: aggregate of weighted pileup from mouse testes CAGE-seq data. Antisense reads were visualized as negative values. is located within the same range. It is a more precise estimation of TSS than the previous estima- tion (Shehee et al., 1987), where the authors reported around 1/3 of the total length was commonly truncated. We also observed an intriguing phenomenon around the 100th state. It is covered by a 7-bp duplication of the sequence ACTCGAG, which has been used to define the A-I subtype by Schichman et al. (1993), and it also serves as an AP-1 binding site. Since this 7-bp duplication constitutes the most abundant A subtype (A-I by Schichman et al. (1993) and Subtype 1 in this study, Table A.2), it is tempting to speculate that this duplication has greatly increased the activity of Type A promoters. Type G and T truncation profiles support a common truncation near the YY1 binding site (binding motif GCCATCTT), which is located close to the 170th state in Type G, and the 160th state in Type T. But the slope is generally smaller than that of Type A, indicating a less fixed TSS. 56 To validate the TSS estimation using truncation profiles, we used CAGE-seq data of mouse embryo testes published by the FANTOM5 project (Forrest et al., 2014). Certain ambiguity of mapping was allowed to effectively capture transcription initiation from highly similar monomer sequences (see Section A.5). Nevertheless the pileup of reads was weighted to make sure the count of ambiguously mapping reads are correctly normalized. The 5’-end of each mapped read was aligned to a match state in the corresponding monomer profile-HMM via posterior decoding. The weighted read pileup was summed in aggregate for visualization. In Figure 4.10 (bottom panels), the time point of E15 is shown, which is close to the stage when DNA methylation is almost completely erased genome-wide during germ cell development. It can be seen that the peaks of sense TSS signals align well with sharp increases in the truncation profile curves of all three types of monomers, as well as the known locations of AP-1 or YY1 binding sites. Type G monomers have less concentrated distribution of read pileups, which supports the smaller slope in the corresponding truncation profile curve. As for the other time points during germ cell development shown in Figure 4.11, the distri- butions of TSS signals retain a similar shape for all monomer types. The only exception is Type T monomer, which gains a strong TSS at the position around the 10th match state on E16, and this peak persists till E18, suggesting an alternative TSS for the Type T monomers. It is worth noting that due to mapping ambiguity and our strategy of read weighting, this quantification result of transcription initiation cannot be normalized across different time points. We noticed that Type G and T monomers have strong antisense TSS signals, especially for the peak located around the 50th match state of Type T. This indicates that mouse L1MdG and L1MdT repeats might also have bidirectional promoters similar to those of human L1HS repeats (Speek, 2001; Faulkner et al., 2009). 57 0 50 100 150 200 0 20 40 60 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −50 0 50 100 150 200 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −50 0 50 100 150 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −20 0 20 40 60 Match state position Sum of read weights Sense Antisense E13 E16 E17 E18 E13 E16 E17 E18 E13 E16 E17 E18 A B C 0 50 100 150 200 −25 −15 −5 0 5 10 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −20 0 10 20 30 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −20 −10 0 10 20 30 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −10 −5 0 5 10 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −60 −20 0 20 40 60 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −50 0 50 100 150 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −50 0 50 100 Match state position Sum of read weights Sense Antisense 0 50 100 150 200 −40 −20 0 20 40 Match state position Sum of read weights Sense Antisense Figure 4.11: Weighted read pileup of CAGE-seq data at various time points during germ cell development. A: Pileup of Type A monomers. B: Pileup of Type G monomers. C: Pileup of Type T monomers. 58 4.4 Analysis of subtype composition in L1Md promoters We joined proximal monomers as promoter regions, and the types of individual promoters are classified correspondingly to the type of comprising monomers. The statistics of L1Md promoters in the mm10 genome are included in Table 4.2. It is shown that most of the promoter classification conformed with the classification generated by RepeatMasker based on whole repeats. However, a small number of repeats have mismatched promoter types. The majority of such mismatched types can be attributed to the fact that many L1MdF repeats have been identified to possess a Type G or T promoter. The mean counts of monomers per promoter are 2.7, 2.9 and 3.1 for Type A, G and T, respectively. The corresponding distribution is shown in Figure 4.12. The majority of Type A promoters have two monomers, while for the other two types of promoter the distribution is less concentrated and the mode shifts towards 3. This is because Type G and T promoters can have multiple incomplete monomers located close to the 3’-end of the promoter (Goodier et al., 2001). It is worth mentioning that there are a number of extremely long promoters. The longest promoter is located on chromosome 3, position 6296898 – 6308206 (mm10 reference), which has 50 Type A monomers. However, in the latest version RepeatMasker annotation (using Repeat Library 20140131), it is annotated as part of an L1MdGf I repeat, following the subfamily definition introduced by (Sookdeo et al., 2013). 59 Repeats annotated by RM Type A promoters Type G promoters Type T promoters Family Count Count RM % p. % Count RM % p. % Count RM % p. % L1MdA 40,139 6,385 15.9 87.4 326 0.8 14.6 180 0.4 3.6 L1MdG 5,096 184 3.6 2.5 804 15.8 36.0 245 4.8 4.9 L1MdT 9,287 106 1.1 1.5 28 0.3 1.3 3,392 36.5 67.5 L1MdF 70,753 474 0.7 6.5 1,046 1.5 46.8 1,153 1.6 22.9 Others – 158 – 2.2 32 – 1.4 56 – 1.1 Table 4.2: Statistics of detected promoters and repeat family classification provided by RepeatMasker (RM). For each detected promoter, its overlap with current annotation of RM was searched. The overlaps were listed in this table with percentage of the corresponding repeats (RM %) and percentage of the corresponding promoter type (p. %). Because not all repeats have a promoter, the sum of RM % for each row is not 100. 60 1 2 3 4 5 6 7 8 9 10 >10 Monomers per promoter Count 0 500 1500 2500 Type A Type G Type T Figure 4.12: Distribution of monomer counts per L1Md promoter. We then investigated whether any specific pattern of subtype composition exists within L1 pro- moters. In the following analysis of subtype composition within promoters, all monomers includ- ing the core ones and truncated ones were used. The subtype assignments for non-core monomers were inferred using the same model trained for the core ones based on PCA andk-NN. To con- sistently label monomers in individual promoters, we ordered the positions based on the direction of L1 repeat insertion (Fig 4.13), which is opposite to the direction of transcription. Previously it has been observed that certain A subtypes prefer to locate only at position 1, while other subtypes tend to appear in further upstream position (Schichman et al., 1993). We tested whether our new definition of subtypes for all A, G and T promoters supports this observation. Since 99% of the promoters have no more than 10 monomers, we only considered the positions up to 10. Type A promoters were first analyzed because they do not have truncated internal monomers like Type G and T. We calculated the fraction of individual position orders using all subtypes as the expected 61 distribution. We hypothesized that no subtypes have position preference, then for each subtype the position order should follow this expected distribution (shown in Figure 4.14 as the bars; left panel). Then we computed the fractions of position orders for each subtype and compared them with the expected. A position preference value was calculated for each position order by sub- tracting the background value from the observed one. We summarized three position preference modes from the results of all A subtypes (Table 4.3) (1) intermediate region, (2) terminus region and (3) no preference (Figure 4.14, right panel). The list for preference mode of individual A subtypes is shown in Table 4.3. The first mode is represented by Subtype 1, which has low pref- erence (<0:1) for the first position and tends to locate in upstream regions from the 3’-end of a monomer array. This kind of subtype contributes the most to promoter extension. The second mode, for example Subtype 2, favors location of the 3’-end terminus ( 0:1 preference for position order of 1). The third mode shows no preference value beyond the range of[0:1;0:1) at any posi- tion, such as Subtype 9. The only un-categorized subtype after these criteria is Subtype 30, which has a position preference value greater than 0.1 at the position order of 2 (Table A.5). Subtypes of G and T monomers also follow these three modes (Table A.6 and A.7). However, we had to extend the definition of terminus region mode for these two types to the first 3 position orders to accommodate the truncated monomers. Therefore for G and T subtypes, we changed the condition for the terminus region mode to greater than 0.1 preference value in any of the first three position orders. The preference mode summary of G and T subtypes can be found in Table 4.4 and 4.5. Combining the position preference modes of individual subtypes and their sequence divergence shown in Figure 4.15, 4.17 and 4.17, we found that there is particular relationship between subtypes with similar sequence and their position preference. Excluding the special case of Subtype 30, the clustering result based on sequence also separates most subtypes by their position preference mode, especially for the subtypes with the intermediate region mode (Figure 4.15). For G and T subtypes, this pattern still applies (Figure 4.17 and 4.17). We speculate that this phenomenon is due to the expansion pattern of L1Md promoters during evolution. Our results indicate that the mechanism of monomer array extension functions through inserting monomers to the upstream of current 62 Monomer Monomer ORF1 Monomer 1 2 3 Monomer order w.r.t. the 3’ end of the promoter Promoter region Direction of transcription Direction of insertion Figure 4.13: Ordering scheme of monomers within an L1Md promoter. 1 2 3 4 5 6 7 8 9 10 Order from 3' end Fraction 0.0 0.2 0.4 0.6 0.8 Subtype 1 Subtype 2 Subtype 9 Expected 1 2 3 4 5 6 7 8 9 10 Intermediate region Position preference −0.4 −0.2 0.0 0.2 0.4 Order from 3' end 1 2 3 4 5 6 7 8 9 10 Linker region Position preference −0.4 −0.2 0.0 0.2 0.4 Order from 3' end 1 2 3 4 5 6 7 8 9 10 No preference Position preference −0.4 −0.2 0.0 0.2 0.4 Order from 3' end Categorizing into three preference modes Figure 4.14: The three preference modes of monomer positioning. 63 Preference mode Subtype list Total count Intermediate region 1, 14, 15, 16, 17, 20, 22, 23, 24, 25, 27, 28, 32, 33 4,106 Terminus region 2, 3, 5, 6, 7, 10, 11, 13, 31 4,984 No preference 4, 8, 9, 12, 18, 19, 21, 26, 29, 34 2,108 Table 4.3: Position preference modes of A subtypes. Subtype 30 has an outlier mode and thus it does not belong to any of these three modes. Preference mode Subtype list Total count Intermediate region 2, 6 412 Terminus region 1, 3, 4, 5 1,870 No preference NA 0 Table 4.4: Position preference modes of G subtypes. L1Md promoters, and it uses monomers located in the intermediate region as templates, rather than those located in the terminus region. Thus subtypes observed to have position preference at terminus regions are not likely to contribute to promoter extension in this mechanism. On the other hand, the subtypes showing no preference in positioning distribute uniformly along the positions of L1Md promoters. This suggests an alternative extension mechanism may exist, capable of using all monomers within the promoter as insertion template. Preference mode Subtype list Total count Intermediate region 4, 6, 10, 15, 22, 28, 29 1,799 Terminus region 1, 2, 3, 5, 7, 8, 11, 12, 13, 16, 21, 23, 24, 26, 27 5,594 No preference 9, 14, 17, 18, 19, 20, 25, 30, 31, 32 1,479 Table 4.5: Position preference modes of T subtypes. 64 60 50 40 30 20 10 0 Edit distance 6 2 16 1 15 14 27 17 25 4 20 22 24 32 28 21 29 19 12 18 5 3 13 9 10 7 33 23 8 11 31 34 26 30 Intermediate Terminus No preference Figure 4.15: Hierarchical clustering of A subtypes based on edit distance. The subtypes are colored by their position preference modes. 65 60 50 40 30 20 10 0 Edit distance 6 2 5 1 3 4 Intermediate Terminus Figure 4.16: Hierarchical clustering of G subtypes based on edit distance. The subtypes are colored by their position preference modes. 4.5 Identification of active L1Md promoters using DNA methy- lation data The set of retrotransposition-competent L1 repeats has been well defined in L1Base2 (Penzkofer et al., 2016), in which the competence is characterized based on both promoter and ORF complete- ness. Since L1Base2 uses the L1Md classification generated by RepeatMasker, we are interested 66 60 50 40 30 20 10 0 Edit distance 25 20 27 26 2 12 14 29 10 22 4 15 8 21 6 23 28 9 30 18 19 31 32 7 5 17 24 1 16 13 11 3 Intermediate Terminus No preference Figure 4.17: Hierarchical clustering of T subtypes based on edit distance. The subtypes are colored by their position preference modes. in investigating the transcription potential for activity of individual L1Md promoters based on our classification of monomer subtypes. Because of the high copy number of L1Md repeats in the genome, it is not very effective to quantitatively estimate the transcription levels of L1Md repeats 67 through typical RNA-seq data analysis. Therefore to evaluate the capability of initiating transcrip- tion of individual L1Md promoters, we chose to use promoter DNA methylation as the measure- ment to indirectly estimate the activity of L1Md promoters. As mentioned before, the piRNA reg- ulation pathway is responsible for repressing actively transcribed repeats through piRNA-guided de nove DNA methylation in developing germ cells. We chose public datasets of certain knock- out (KO) experiments related with the piRNA regulation pathway (Molaro et al., 2014; Manakov et al., 2015; Inoue et al., 2017) for identification of active L1Md promoters, by comparing with wild type (WT) methylation. This is because during germ cell development, most of the active L1Md promoters are suppressed by deposition of DNA methylation, and therefore they cannot be directly distinguished from other inert promoters that are routinely methylated together with most intergenic regions. With methylation data after knocking out key factors of the piRNA regulation pathway, active L1Md promoters can be identified by measuring the methylation loss. The Dnmt3L KO data generated by (Inoue et al., 2017) was used as a negative control, because this enzyme is known to regulate retrotransposon methylation during male germ cell development (Hata et al., 2002; Bourc’his & Bestor, 2004; Webster et al., 2005). The other factors, Mili, Miwi2 and Pld6, represent piRNA regulation at different stages of spermatogenesis (Aravin et al., 2008; Watanabe et al., 2011). For each L1Md promoter, we calculated its average methylation level in all KO experiments and WT samples. A minimum coverage filter was applied to only obtain precise levels. We required each promoter region to have at least 3 CpG sites with mapped reads and 10 observations of methylation. This observation is defined by one read covering a CpG site. Then the average methylation level was computed by the weighted average of methylation at individual CpG sites. After filtering, there are in total 4187, 1359 and 2411 promoters of Type A, G and T, respectively. The distributions of the methylation loss can be seen in Figure 4.18, grouped by promoter type. The negative control Dnmt3L KO showed expected methylation loss, despite Type G promoters losing less methylation. Since most of the promoters were hypermethylated in WT samples, the 68 0.0 0.5 1.0 0 1 2 3 4 Methylation loss after KO Density Mili KO Miwi2 KO Pld6 KO Dnmt3L KO −0.5 0.0 0.5 1.0 0 1 2 3 Methylation loss after KO Density 0.0 0.5 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Methylation loss after KO Density Type A Type G Type T Figure 4.18: Distribution of methylation loss after KO experiments of key piRNA regulation path- way factors. observed low methylation loss after KO experiments could not have been caused by hypomethyla- tion in WT (Figure 4.19). All three types exhibit diverged responses in the Mili KO comparison. However, following Miwi2 KO only the Type T promoters have methylation loss. For the KO of Pld6, the majority of Type A and T promoters were demethylated, while most of the Type G promoters retained their methylation. We then investigated the diverged response after Mili KO. All promoters were separated to two groups, labeled as positive and negative of methylation loss, based on the threshold at 0.5 methylation difference. To explore whether there are any subtypes highly correlated with positive or negative response, we calculated an observed to expected (O/E) ratio for each of the subtypes in both groups. The expected distribution assumes all subtypes are randomly distributed across all promoters, and the fractions of individual subtypes should be proportional to the counts of subtypes in all promoters passing the coverage threshold. The O/E ratios of Type A subtypes are shown in Figure 4.20 (left panel). The most extreme values belong to Subtype 2 and 6 in the “negative loss” group. We showed (above) that these two subtypes are mostly positioned at the 3’-end of promoters, and their sequences diverged from other subtypes. This suggests that these two subtypes have already lost the function as a promoter, and if one L1Md promoter only has these two subtypes, it is very likely that such a promoter is inert during germ cell development. In 69 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 Type A WT Spermatocyte Mili KO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 Miwi2 HET Miwi2 DKO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 WT Spermatogonia Dnmt3L KO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 WT Spermatogonia Pld6 KO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 Type G WT Spermatocyte Mili KO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 Miwi2 HET Miwi2 DKO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 WT Spermatogonia Dnmt3L KO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 WT Spermatogonia Pld6 KO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 Type T WT Spermatocyte Mili KO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 Miwi2 HET Miwi2 DKO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 WT Spermatogonia Dnmt3L KO 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 WT Spermatogonia Pld6 KO Figure 4.19: Scatter plot of methylation loss by KO of piRNA regulation factors. 70 general, subtypes that have a preference in intermediate regions are more enriched in the positive group, while terminus region subtypes show the opposite trend. We observed a similar pattern for G and T subtypes after Mili KO (Figure 4.20, middle and right panels). For the Miwi2 KO, we used a lower cut-off for positive methylation loss (loss 0.3), given that the response after KO is not as strong as the other factors. The result of enrichment analysis indicates that the regulation targets of Miwi2 on Type T promoters are similar to those of Mili (Figure 4.21). Because Mili and Miwi2 are active in different stages of spermatogenesis, it is yet unclear whether the overall negative response for Type A and G promoters after Miwi2 KO is due to the limit of available time points of data used in this analysis. It is intriguing that even Type G and T promoters have divergent responses after Miwi2 KO, despite the high sequence similarity between Type G and T monomers. On the other hand, since Mili binds to primary piRNA and Miwi2 binds to secondary piRNA, the differential response after Miwi2 KO might be associated with the increased antisense transcription from Type T promoters shown in Figure 4.10C. 71 30 34 29 26 21 19 18 12 9 8 4 31 13 11 10 7 6 5 3 2 33 32 28 27 25 24 23 22 20 17 16 15 14 1 Subtype O/E ratio A subtypes 0.0 0.5 1.0 1.5 2.0 Positive Negative 5 4 3 1 6 2 Subtype O/E ratio G subtypes 0.0 0.5 1.0 1.5 2.0 32 31 30 25 20 19 18 17 14 9 27 26 24 23 21 16 13 12 11 8 7 5 3 2 1 29 28 22 15 10 6 4 Subtype O/E ratio T subtypes 0.0 1.0 2.0 3.0 Figure 4.20: O/E ratios of monomer subtypes grouped by position preference modes after the Mili KO experiment. 72 32 31 30 25 20 19 18 17 14 9 27 26 24 23 21 16 13 12 11 8 7 5 3 2 1 29 28 22 15 10 6 4 Subtype O/E ratio T subtypes 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Positive Negative Figure 4.21: O/E ratios of T subtypes grouped by position preference modes after the Miwi2 KO experiment. 73 Chapter 5 Conclusions Studying retrotransposons, especially the evolution history of retrotransposons, has benefited greatly from the completion of genome assemblies. This is due to the characteristics of repeat expansion, which preserved all ancestral repeats within the genome as genetic fossils. The phy- logeny of repeats can be comprehensively investigated as long as a highly sensitive retrotransposon detection and annotation method is developed. Thanks to the ongoing advances of repeat annota- tion methodology, e.g. the RepeatMasker pipeline and the seminal work of Sookdeo et al., now people are able to systematically analyze repeats with breadth as much as covering half of the genome, and with depth as specific as the classification up to subfamilies consisting of only hun- dreds of repeats. Yet in this dissertation we presented an attempt to bring the annotation of repeats one step further by specifically annotating the monomers of mouse L1Md promoters. We hope this work can shed light upon the activity and regulation of this important family of repeats, and it can also offer a glimpse to the mechanism of the origin and expansion of L1Md monomers in the mouse genome. The promoters of L1 repeats have been evolving rapidly comparing with the highly conserved ORFs. This is because it is the battlefield of host genome suppression and parasitic retrotransposon expression. The change of either side will be responded by the adaptation of the other. Therefore the sequences of L1 promoters have developed great diversity, requiring classification at finer scale. The implementation of a profile-HMM based annotation framework fulfills this requirement. By fitting a profile-HMM to a group of monomer sequences, it is possible to map discrete sequences of individual monomers to points in a continuous parameter space; following this conversion a number of effective algorithms can be applied to identify clusters which indicate evolutionary relationship. The pipeline presented in this dissertation does not rely on supervision, and thus it 74 saves manual intervention in directly classifying sequences and can be applied on a large number of individuals. The monomer annotation pipeline was applied to various genome assemblies, and the results have shown that only in the mm10 genome were the abundant 208bp Type A monomers observed. Unfortunately this observation is confounded by the unassembled regions of all other mouse species, thus we could not conclude that this supposedly burst of monomer expansion is species- specific. However, despite of the monomer detection results in other genomes, the number of monomers in the lab mouse genome contributes to the surge of L1 insertions, and more are antici- pated to occur in the future generations given the activity of those L1Md promoters. It remains elusive how L1Md repeats obtained tandem repeating monomers at the very begin- ning. The analysis of monomer array expansion patterns supports both previously proposed mod- els. The step-wise model is expected to have contributed more to the duplication of monomers, because the number of monomers following the intermediate preference mode is greater than that of the other two modes. This means there exists a mechanism that can insert monomers to the upstream of current L1Md promoters. It probably takes place during post-insertion DNA repair, when the 3’-end of the cDNA is ligated to the genomic DNA. It is then intriguing to investigate why currently there are only three types of monomers, and why monomer is mouse-specific. The former question aims to understand whether the monomer sequence is an essential requirement for them to be duplicated; in other words, can any arbitrary sequences become tandem-ized via L1Md propagation? The latter question is related with the mouse-specific DNA repair mechanism. Whether there exists any machinery that exclusively duplicates 200bp fragments of DNA is yet to be verified. Unfortunately both questions seem to require experimental validation and thus for now they are beyond the scope of computational methods. We hope in the near future more efforts can be put on the study of monomer expansions and solve this enigma. The regulation of L1Md promoters is an important topic for understanding the mechanisms of epigenetic regulation and germ cell development. Given the fact that L1Md promoters can drive transcription both up- and downstream via transduction, it is better to segregate the annotation of 75 L1Md promoters from that of the complete repeats. Imagine this scenario when an L1Md pro- moter is separated from its downstream ORFs due to some genome rearrangement event, although it has lost the ability to generate new insertions autonomously, this region can still actively initiate transcription. Therefore we used data of a number of KO experiments to identify active L1Md pro- moters for studying potential sequence determinants which might have triggered piRNA-guided de novo methylation. The response of methylation loss after KO experiments diverged by both monomer subtypes and monomer types, which indicates that multiple regulation pathways exist in the mouse genome, and the sensitivity of the regulation is high enough to recognize subtle sequence differences between subtypes. If more experimental data become available, it is possible to fur- ther narrow down the targets of repeat-specific DNA methylation regulation. Once such sequence determinants are known, it is foreseeable for the era of epigenetic manipulation to emerge. 76 Reference List Adams JW, Kaufman RE, Kretschmer PJ, Harrison M, Nienhuis AW (1980) A family of long reiterated DNA sequences, one copy of which is next to the human beta globin gene. Nucleic Acids Research 8:6113–6128. Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB (2010) Annotating non-coding regions of the genome. Nature Publishing Group 11:559–571. Aravin AA, Sachidanandam R, Bourc’his D, Schaefer C, Pezic D, Toth KF, Bestor T, Hannon GJ (2008) A piRNA pathway primed by individual transposons is linked to de novo DNA methylation in mice. Molecular cell 31:785–799. Aravin AA, Sachidanandam R, Girard A, Fejes-Toth K, Hannon GJ (2007) Developmentally reg- ulated pirna clusters implicate mili in transposon control. Science 316:744–747. Athanikar JN, Badge RM, Moran JV (2004) A YY1-binding site is required for accurate human LINE-1 transcription initiation. Nucleic Acids Research 32:3846–3855. Barrett C, Hughey R, Karplus K (1997) Scoring hidden markov models. Bioinformat- ics 13:191–199. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D (2004) Ultra- conserved elements in the human genome. Science 304:1321–1325. Belancio VP, Roy-Engel AM, Pochampally RR, Deininger P (2010) Somatic expression of LINE-1 elements in human tissues. Nucleic Acids Research 38:3909–3922. Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27:573. Bird A (2002) DNA methylation patterns and epigenetic memory. Genes & development 16:6–21. Bourc’his D, Bestor TH (2004) Meiotic catastrophe and retrotransposon reactivation in male germ cells lacking Dnmt3L. Nature 431:96. Branciforte D, Martin SL (1994) Developmental and cell type specificity of LINE-1 expression in mouse testis: implications for transposition. Molecular and Cellular Biology 14:2584–2592. 77 Britten RJ, Kohne DE (1968) Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms. Sci- ence 161:529–540. Brosius J (1999) RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene 238:115–134. Campello RJ, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates In Pacific-Asia conference on knowledge discovery and data mining, pp. 160–172. Springer. Cordaux R, Batzer MA (2009) The impact of retrotransposons on human genome evolution. Nature Publishing Group 10:691–703. Coufal NG, Garcia-Perez JL, Peng GE, Yeo GW, Mu Y , Lovci MT, Morell M, O’Shea KS, Moran JV , Gage FH (2009) L1 retrotransposition in human neural progenitor cells. Nature 460:1127–1131. de Koning APJ, Gu W, Castoe TA, Batzer MA, Pollock DD (2011) Repetitive Elements May Comprise Over Two-Thirds of the Human Genome. PLOS Genetics 7:e1002384. de Souza FS, Franchini LF, Rubinstein M (2013) Exaptation of transposable elements into novel cis-regulatory elements: is the evidence always strong? Molecular biology and evolu- tion 30:1239–1251. DeBerardinis RJ, Kazazian HH (1999a) Analysis of the promoter from an expanding mouse retro- transposon subfamily. Genomics 56:317–323. DeBerardinis RJ, Kazazian HH (1999b) Analysis of the promoter from an expanding mouse retro- transposon subfamily. Genomics 56:317–323. Drake JW, Charlesworth B, Charlesworth D (1998) Rates of spontaneous mutation. Genetics . Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids Cambridge university press. Eddy SR (2011) Accelerated profile hmm searches. PLoS computational biology 7:e1002195. Edgar RC (2004) Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 32:1792–1797. Erwin JA, Marchetto MC, Gage FH (2014) Mobile DNA elements in the generation of diversity and complexity in the brain. Nature Publishing Group 15:497–506. Esnault C, Maestre J, Heidmann T (2000) Human LINE retrotransposons generate processed pseudogenes. Nature genetics 24:363. 78 Ester M, Kriegel HP, Sander J, Xu X et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, V ol. 96, pp. 226–231. Fadloun A, Le Gras S, Jost B, Ziegler-Birling C, Takahashi H, Gorab E, Carninci P, Torres-Padilla ME (2013) Chromatin signatures and retrotransposon profiling in mouse embryos reveal regu- lation of LINE-1 by RNA. Nature Publishing Group 20:332–338. Faulkner GJ, Kimura Y , Daub CO, Wani S, Plessy C, Irvine KM, Schroder K, Cloonan N, Steptoe AL, Lassmann T et al. (2009) The regulated retrotransposon transcriptome of mammalian cells. Nature genetics 41:563. Feng Q, Moran JV , Kazazian HH, Boeke JD (1996) Human L1 retrotransposon encodes a con- served endonuclease required for retrotransposition. Cell 87:905–916. Forrest AR, Kawaji H, Rehli M, Baillie JK, De Hoon MJ, Haberle V , Lassmann T, Kulakovskiy IV , Lizio M, Itoh M et al. (2014) A promoter-level mammalian expression atlas. Nature 507:462. Furano A V , Robb SM, Robb FT (1988) The structure of the regulatory region of the rat L1 (L1Rn, long interspersed repeated) DNA family of transposable elements. Nucleic acids research 16:9215–9231. Garcia-Perez JL, Marchetto MCN, Muotri AR, Coufal NG, Gage FH, O’Shea KS, Moran JV (2007) LINE-1 retrotransposition in human embryonic stem cells. Human molecular genet- ics 16:1569–1577. Georgiou I, Noutsopoulos D, Dimitriadou E, Markopoulos G, Apergi A, Lazaros L, Vaxevanoglou T, Pantos K, Syrrou M, Tzavaras T (2009) Retrotransposon RNA expression and evidence for retrotransposition events in human oocytes. Human molecular genetics 18:1221–1228. Gifford R, Tristem M (2003) The evolution, distribution and diversity of endogenous retroviruses. Virus genes 26:291–315. Goodier JL, Ostertag EM, Du K, Kazazian HH (2001) A novel active L1 retrotransposon subfamily in the mouse. Genome research 11:1677–1685. Goodier JL, Zhang L, Vetter MR, Kazazian HH (2007) Line-1 orf1 protein localizes in stress granules with other rna-binding proteins, including components of rna interference rna-induced silencing complex. Molecular and cellular biology 27:6469–6483. Hata K, Okano M, Lei H, Li E (2002) Dnmt3L cooperates with the Dnmt3 family of de novo DNA methyltransferases to establish maternal imprints in mice. Development 129:1983–1993. Hattori M, Kuhara S, Takenaka O, Sakaki Y (1986) L1 family of repetitive DNA sequences in primates may be derived from a sequence encoding a reverse transcriptase-related protein. Nature 321:625–628. 79 Hohjoh H, Singer MF (1996) Cytoplasmic ribonucleoprotein complexes containing human LINE- 1 protein and RNA. The EMBO journal 15:630–639. Hohjoh H, Singer MF (1997) Sequence-specific single-strand RNA binding protein encoded by the human LINE-1 retrotransposon. The EMBO journal 16:6034–6043. Inoue K, Ichiyanagi K, Fukuda K, Glinka M, Sasaki H (2017) Switching of dominant retrotrans- poson silencing strategies from posttranscriptional to transcriptional mechanisms during male germ-cell development in mice. PLoS genetics 13:e1006926. Jaakkola T, Diekhans M, Haussler D (2000) A discriminative framework for detecting remote protein homologies. Journal of computational biology 7:95–114. Jagadeeswaran P, Forget BG, Weissman SM (1981) Short interspersed repetitive DNA elements in eucaryotes: transposable DNA elements generated by reverse transcription of RNA pol III transcripts? Cell 26:141–142. Jurka J, Kapitonov VV , Pavlicek A, Klonowski P, Kohany O, Walichiewicz J (2005) Rep- base Update, a database of eukaryotic repetitive elements. Cytogenetic and genome research 110:462–467. Kano H, Godoy I, Courtney C, Vetter MR, Gerton GL, Ostertag EM, Kazazian HH (2009) L1 retrotransposition occurs mainly in embryogenesis and creates somatic mosaicism. Genes & Development 23:1303–1312. Keane TM, Goodstadt L, Danecek P, White MA, Wong K, Yalcin B, Heger A, Agam A, Slater G, Goodson M et al. (2011) Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477:289. Krogh A, Brown M, Mian IS, Sj¨ olander K, Haussler D (1994) Hidden markov models in computa- tional biology: Applications to protein modeling. Journal of molecular biology 235:1501–1531. Kuramochi-Miyagawa S, Watanabe T, Gotoh K, Totoki Y , Toyoda A, Ikawa M, Asada N, Kojima K, Yamaguchi Y , Ijiri TW, Hata K, Li E, Matsuda Y , Kimura T, Okabe M, Sakaki Y , Sasaki H, Nakano T (2008) DNA methylation of retrotransposon genes is regulated by Piwi family members MILI and MIWI2 in murine fetal testes. Genes & Development 22:908–917. Kuwabara T, Hsieh J, Muotri A, Yeo G, Warashina M, Lie DC, Moore L, Nakashima K, Asashima M, Gage FH (2009) Wnt-mediated activation of NeuroD1 and retro-elements during adult neu- rogenesis. Nature neuroscience 12:1097–1105. Lander ES, et al., International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921. Langley CH, Montgomery E, Hudson R, Kaplan N, Charlesworth B (1988) On the role of unequal exchange in the containment of transposable element copy number. Genetics Research 52:223–235. 80 Li XZ, Roy CK, Dong X, Bolcun-Filas E, Wang J, Han BW, Xu J, Moore MJ, Schimenti JC, Weng Z et al. (2013) An ancient transcription factor initiates the burst of piRNA production during early meiosis in mouse testes. Molecular cell 50:67–81. Loeb D, Padgett R, Hardies S, Shehee WR, Comer M, Edgell M, Hutchison C (1986) The sequence of a large l1md element reveals a tandemly repeated 5’end and several features found in retro- transposons. Molecular and Cellular Biology 6:168–182. Lowe CB, Bejerano G, Haussler D (2007a) Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proceedings of the National Academy of Sciences of the United States of America 104:8005–8010. Lowe CB, Bejerano G, Haussler D (2007b) Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proceedings of the national academy of sciences 104:8005–8010. Luan DD, Korman MH, Jakubczak JL, Eickbush TH (1993) Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: A mechanism for non-LTR retrotransposition. Cell 72:595–605. Manakov SA, Pezic D, Marinov GK, Pastor WA, Sachidanandam R, Aravin AA (2015) MIWI2 and MILI have differential effects on piRNA biogenesis and DNA methylation. Cell reports 12:1234–1243. Martin SL (2006) The orf1 protein encoded by line-1: structure and function during l1 retrotrans- position. BioMed Research International 2006. Martin SL (2014) Nucleic acid chaperone properties of ORF1p from the non-LTR retrotransposon, LINE-1. RNA Biology 7:706–711. Mathias SL, Scott AF, Kazazian HH, Boeke JD, Gabriel A (1991) Reverse transcriptase encoded by a human transposable element. Science 254:1808–1810. McClintock B (1950) The origin and behavior of mutable loci in maize. Proceedings of the National Academy of Sciences of the United States of America 36:344–355. Mills RE, Bennett EA, Iskow RC, Devine SE (2007) Which transposable elements are active in the human genome? Trends in genetics : TIG 23:183–191. Molaro A, Falciatori I, Hodges E, Aravin AA, Marran K, Rafii S, McCombie WR, Smith AD, Hannon GJ (2014) Two waves of de novo methylation during mouse germ cell development. Genes & development 28:1544–1549. Mouse Genome Sequencing Consortium, et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562. 81 Muotri AR, Chu VT, Marchetto MCN, Deng W, Moran JV , Gage FH (2005) Somatic mosaicism in neuronal precursor cells mediated by L1 retrotransposition. Nature 435:903–910. Naas TP, DeBerardinis RJ, Moran JV , Ostertag EM, Kingsmore SF, Seldin MF, Hayashizaki Y , Martin SL, Kazazian HH (1998) An actively retrotransposing, novel subfamily of mouse L1 elements. The EMBO journal 17:590–597. Nell˚ aker C, Keane TM, Yalcin B, Wong K, Agam A, Belgard TG, Flint J, Adams DJ, Frankel WN, Ponting CP (2012) The genomic landscape shaped by selection on transposable elements across 18 mouse strains. Genome Biology 13:R45. Ohno S (1972) So much “junk” DNA in our genome. Brookhaven symposia in biology 23:366–370. Ohshima K, Hamada M, Terai Y , Okada N (1996) The 3’ends of tRNA-derived short interspersed repetitive elements are derived from the 3’ends of long interspersed repetitive elements. Molec- ular and cellular biology 16:3756–3764. Ostertag EM, Kazazian HH (2001) Twin priming: a proposed mechanism for the creation of inversions in l1 retrotransposition. Genome research 11:2059–2065. Packer AI, Manova K, Bachvarova RF (1993) A Discrete LINE-1 Transcript in Mouse Blastocysts. Developmental Biology 157:281–283. Padgett RW, Hutchison III CA, Edgell MH (1988) The f-type 5 motif of mouse l1 elements: a major class of l1 termini similar to the a-type in organization but unrelated in sequence. Nucleic acids research 16:739–749. Pascale E, Liu C, Valle E, Usdin K, Furano A V (1993) The evolution of long interspersed repeated DNA (L1, LINE 1) as revealed by the analysis of an ancient rodent L1 DNA family. Journal of molecular evolution 36:9–20. Penzkofer T, J¨ ager M, Figlerowicz M, Badge R, Mundlos S, Robinson PN, Zemojtel T (2016) L1Base 2: more retrotransposition-active LINE-1s, more mammalian genomes. Nucleic acids research p. gkw925. Pereira V (2004) Insertion bias and purifying selection of retrotransposons in the arabidopsis thaliana genome. Genome biology 5:R79. Pickeral OK, Makałowski W, Boguski MS, Boeke JD (2000) Frequent human genomic dna trans- duction driven by line-1 retrotransposition. Genome research 10:411–415. Price AL, Eskin E, Pevzner PA (2004) Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome research 14:2245–2252. Price AL, Jones NC, Pevzner PA (2005) De novo identification of repeat families in large genomes. Bioinformatics 21:i351–i358. 82 Ramdas S, Ozel AB, Holl K, Mandell M, Woods LS, Li JZ (2018) Extended regions of suspected mis-assembly in the rat reference genome. bioRxiv p. 332932. Rubin CM, Houck CM, Deininger PL, Friedmann T, Schmid CW (1980) Partial nucleotide sequence of the 300-nucleotide interspersed repeated human DNA sequences. Nature 284:372–374. Schichman SA, Adey NB, Edgell MH, Hutchison III C (1993) L1 A-monomer tandem arrays have expanded during the course of mouse L1 evolution. Molecular biology and evolu- tion 10:552–570. Seisenberger S, Andrews S, Krueger F, Arand J, Walter J, Santos F, Popp C, Thienpont B, Dean W, Reik W (2012) The dynamics of genome-wide dna methylation reprogramming in mouse primordial germ cells. Molecular cell 48:849–862. Severynse DM, Hutchison CA, Edgell MH (1991) Identification of transcriptional regulatory activity within the 5 a-type monomer sequence of the mouse line-1 retroposon. Mammalian Genome 2:41–50. Shehee WR, Chao SF, Loeb DD, Comer MB, Hutchison III CA, Edgell MH (1987) Determination of a functional ancestral sequence and definition of the 5 end of A-type mouse L1 elements. Journal of molecular biology 196:757–767. Singh P, Li AX, Tran DA, Oates N, Kang ER, Wu X, Szab´ o PE (2013) De novo dna methylation in the male germ line occurs by default but is excluded at sites of h3k4 methylation. Cell reports 4:205–219. Smit A, Hubley R, Green P (2015) RepeatMasker Open-4.0. URL http://www.repeatmasker.org . Smit AF (1996) The origin of interspersed repeats in the human genome. Current opinion in genetics & development 6:743–748. Sookdeo A, Hepp CM, McClure MA, Boissinot S (2013) Revisiting the evolution of mouse LINE- 1 in the genomic era. Mobile DNA 4:3. Speek M (2001) Antisense promoter of human L1 retrotransposon drives transcription of adjacent cellular genes. Molecular and cellular biology 21:1973–1985. Stocking C, Kozak CA (2008) Murine endogenous retroviruses. Cellular and Molecular Life Sciences 65:3383–3398. Suzuki MM, Bird A (2008) DNA methylation landscapes: provocative insights from epigenomics. Nature Reviews Genetics 9:465. Swergold GD (1990) Identification, characterization, and cell specificity of a human line-1 pro- moter. Molecular and cellular biology 10:6718–6729. 83 Tch´ enio T, Casella JF, Heidmann T (2000) Members of the SRY family regulate the human LINE retrotransposons. Nucleic Acids Research 28:411–415. Tsuda K, Kawanabe M, M¨ uller KR (2003) Clustering with the Fisher score In Advances in Neural Information Processing Systems, pp. 745–752. V oliva CF, Martin SL, Hutchison CA, Edgell MH (1984) Dispersal process associated with the L1 family of interspersed repetitive DNA sequences. Journal of Molecular Biology 178:795–813. Watanabe T, Chuma S, Yamamoto Y , Kuramochi-Miyagawa S, Totoki Y , Toyoda A, Hoki Y , Fujiyama A, Shibata T, Sado T et al. (2011) MITOPLD is a mitochondrial protein essential for nuage formation and piRNA biogenesis in the mouse germline. Developmental cell 20:364–375. Webster KE, O’Bryan MK, Fletcher S, Crewther PE, Aapola U, Craig J, Harrison DK, Aung H, Phutikanit N, Lyle R et al. (2005) Meiotic and epigenetic defects in Dnmt3L-knockout mouse spermatogenesis. Proceedings of the National Academy of Sciences 102:4068–4073. Wheeler TJ, Eddy SR (2013) nhmmer: DNA homology search with profile HMMs. Bioinformat- ics 29:2487–2489. Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O, Paux E, SanMiguel P, Schulman AH (2007) A unified classification system for eukaryotic transposable elements. Nature Reviews Genetics 8:973–982. Yang N, Zhang L, Zhang Y , Kazazian HH (2003) An important role for RUNX3 in human L1 transcription and retrotransposition. Nucleic Acids Research 31:4929–4940. Zingler N, Willhoeft U, Brose HP, Schoder V , Jahns T, Hanschmann KMO, Morrish TA, L¨ ower J, Schumann GG (2005) Analysis of 5 junctions of human line-1 and alu retrotransposons suggests an alternative model for 5-end attachment requiring microhomology-mediated end- joining. Genome research 15:780–789. 84 Appendix A Appendix A.1 Implementation details of the specialized profile-HMM A.1.1 Choice of prior distributions The prior distributions mentioned in Chapter 3 include the pseudocount of transition and emission probabilitiest 0 ljk ande k (u) 0 , respectively, and the initial distribution of states =f 1 ; 2 g. Their values are listed here. The prior emission distribution of all match statesM i is set to fe 0 M i (A);e 0 M i (C);e 0 M i (G);e 0 M i (T)g =f0:414;0:208;0:171;0:206g: These values are estimated from the nucleotide composition of human L1 ORF2 coding sequence. It is a GC-sparse sequence, which intends to mimic the nucleotide composition around monomer sequences. The emission distributions of the background stateI 0 and all internal insertion statesI i are set to fe 0 I 0 (A);e 0 I 0 (C);e 0 I 0 (G);e 0 I 0 (T)g =f0:309;0:313;0:132;0:247g; which is the composition of human chrM, a completely unrelated sequence. 85 The pseudocount of transition probabilities is set to the following values: t 0 MjM = 0:8; t 0 IjM = 0:1; t 0 DjM = 0:1 t 0 MjI = 0:5; t 0 IjI = 0:4; t 0 DjI = 0:1 t 0 MjD = 0:5; t 0 IjD = 0:1; t 0 DjD = 0:4 The 5’-end truncation probabilities are set to a uniform distribution, with the exception of the first position. This intends to favor full-matching, such that the first position being a match state has more weights. Thereforet M i jD 1 = t D L jM i = 0:5=(L1) for alli2f2;:::;Lg, while t M 1 jD 1 = 0:5. The 3’-end truncation probabilities follow uniform distribution for all positions. The initial probability is =f 1 ; 2 g =f0:8;0:2g. A.1.2 Numerical stability The recursions in the forward-backward algorithm include product of a number of probabilities within range[0;1]. Therefore after a number of recursions, the values will be very small, and this will cause the underflow problem in implementation. A commonly used way to avoid underflow problem is the logarithm transformation. By log-transforming individual terms in the calculation, the product will become summation, and the values of probability will become numerically stable. One transformation that is worth mentioning is taking log of a summation. If all the terms in one summation are subject to underflow, the summation cannot be calculated precisely before log- transformation. This can be solved with a conversion of the terms. Assume one needs to calculate 86 the following terma =b+c after log-transformation, withb andc are values that both are so small and causing underflow. The conversion is shown below. log(a) = log(b+c) = log(exp(log(b))+exp(log(c))) = log(b)+log(1+exp(log(c)log(b))): By this conversion the direct summation of two small terms is avoided and the calculation can be done without loss of precision. During implementation, the larger term is always being pulled out from the log-summation asb. And this will makeexp(log(c)log(b)) approach zero very fast to keep the calculation numerically stable. This conversion can be generalized to an arbitrary number of terms. To calculatea =b 1 +b 2 + +b n , it is only necessary to find the maximum term and take it out from the summation: log(a) = log(b 1 +b 2 ++b n ) = log(b )+log(1+exp( X b i 6=b log(b i )log(b )); whereb = max i b i ; for alli. A.1.3 Data structure and decoding implementation in C++ The model parameters that need to be stored in memory during runtime include mainly the tran- sition and emission probabilities, and the recursions of the forward-backward algorithm. For a profile-HMM with lengthL, the number of states is 3L. Currently all the transition probabilities are stored in a vector of vectors structure, mimicking a matrix structure. Both levels of vectors have size3L. This data structure is designed for the purpose of simplifying codes for indexing. In practice not all of the allocated space will be used, because the total number of transitions in the 87 profile-HMM is smaller than 3L 3L. The emission probabilities are stored in a similar struc- ture, with dimension2L4, because only theM andI states have emission distribution. Here the alphabet size is set to 4, representing four types of DNA nucleotides. In practice, if an unassembled nucleotide is encountered, i.e. an “N” in the sequence, an insertion state will be forcibly applied to the inference of its hidden state during decoding. The forward-backward algorithm includes calculation of recursions of two variables k (i) and k (i). The recursions of these two variables are stored in the vector of vectors setup, with dimen- sion (N +1)3L, whereN is the sequence length. The additional row is used for initialization of the recursion: I 0 (0) = 1 D 1 (0) = 2 k (N) = 1 for allk: Here represents the initial distribution of states =f 1 ; 2 g. In Durbin et al. (1998), the authors pointed out that the recursion in the dynamic programming of the forward-backward algorithm needs to be implemented in a certain order. This is because the recursion of non-emitting states depends on the same position. For example, in the recursion of forward probability (Equation 3.1), computing k (i) requires the value of l (i) whenk is a non-emitting state. In the implementation for forward algorithm, for each recursion at positioni, all the emitting states are calculated first, and then the non-emitting states are calculated. Things are reversed for the backward probabilities. Because for each state k, the backward probability at position i is dependent on all backward probabilities of emitting states at positioni+1, and all non-emitting states at the same positioni. Therefore for each recursion, k (i) for all the non-emitting states must be calculated first. 88 GCGCCTGCCCCAATCCAATCGCGCGGAACTTGAGACTGCGGTACATAGGG GTGCCTGCCCCAATCCAATCGCGCGGAACTCGAGACTGCGGTACATAGGG * **************************** ******************* AAGCAGGCTACCCGGGCCTGATCTGGGGCACAAGTCCCTTCCGCTC---- AAGCAGGCTACCCGGGCCTGATCTGGGGCACAAGTCCCTTCCGCTCGACT ********************************************** ---CACTCGAGCCCCGGGCTACCTTGCCAGCAGAGTC--GCCCGACACCC CGAGACTCGAGCCCCGGGCTACCTTGCCAGCAGAGTCTTGCCCAACACCC ********************************* **** ****** GCAAGGGCCCACACAGGACTCCCCACGGGATCCTAAGACCTCTGGTGAGT GCAAGGGTCCACACGGGACTCCCCACGGGACCCTAAGACCTCTGGTGAGT ******* ****** *************** ******************* GGAACACA GGATCACA *** **** 1 51 101 151 201 A_mm10 A-I A_mm10 A-I A_mm10 A-I A_mm10 A-I A_mm10 A-I Figure A.1: Consensus of Type A monomer profile-HMM. A-mm10 represents the model con- structed in this study. The A-I subtype was chosen for comparison. A.2 Consensus sequences of finalized profile-HMMs For the detection of monomers in the mm10 genome, each of the three types: A, G and T, has been modeled to profile-HMM and the parameters were trained. To represent the emission parameters, for each match state of the model, the dominant nucleotide was chosen to constitute the consensus sequence of the corresponding profile-HMM. The sequences are shown in comparison with the classical monomer consensus sequences reported before (Schichman et al., 1993; Goodier et al., 2001; Naas et al., 1998). A.3 Statistics of sequencing data used for relative abundance analysis In this section, the statistics of raw sequencing reads that were used for estimating the relative abundance of L1Md repeats is attached. For each of the five species included in the analysis, the 89 TGGTGAGAGCACAGGGGTCTGCCCGGCCTGAGAGGTTTGTGCCTCAGGCG ---TGAGAGCACGGGGGTCTGCCNGGCCTGAGAGGTTTGTGGCACAGGCG ********* ********** ***************** * ****** CCGGCGGGAGCCTTCTTGGCTCCGGGACTCCGCGGAGGGCAGGCTGCACG CCGGCGGGAGCCTTCTTGGCTCCGGGACTCCGCGGAGGGCAGGCTGCACG ************************************************** GGTGACCGTGTGGAATACAGAG-GCCAGCCGTTTCTGGGACGGGCGAGAG GGTGACCGTGTGGAATACAGAGTGCCAGCCGTTTCTGGGACGGGCGAGAG ********************** *************************** CCACAGAG--CTTCTGAGGCGGCGCCATCTTCGGCTCCAGACAACCGGCC NNNCAGAGNNCTTCTG-GGCGGCGCCATCTTCAGCTCCAGACAGCCGGCC ***** ****** *************** ********** ****** ACCTTCCTG- ACCTTCCTGG ********* 1 51 101 151 201 G_mm10 G G_mm10 G G_mm10 G G_mm10 G G_mm10 G Figure A.2: Consensus of Type G monomer profile-HMM. G-mm10 represents the model con- structed in this study. GGTGCGCCAGAGAACCTGACAGCTTCTGGAACAGGCAGAAGCACAGAGGC -------------------------------------------------- GCTGAGGCAGCACCCTGTGTGGGCCGGGGACAGCCGGCCACCTTCCGGAC -------------------------------------------------C * CAGAGGACAGGTGCCCGCCCGGCTGGGGAGGCGGCCTAAGCCACAGCAGC CAGAGGACAGGTGCCCGCCCGGCTGGGGAGGCGGCCTAAGCCACAGCAGC ************************************************** AGCGGTCGCCATCTTGGTTCCGGGACTCCGCCGAACTTAGGAAATTAGTC AGCGGTCGCCATCTTGG-TCCGGGAC-CCGCCGAACTTAGGAAATTAGTC ***************** ******** *********************** TGAACAGGTGAGAG------------------------------------ TGAACAGGTGAGAGGGTGCGCCAGAGAACCTGACAGCTTCTGGAACAGGC ************** -------------------------------------------------- GGAAGCACAGAGGCGCTGAGGCAGCACCCTGTGTGGGCCGGGGACAGCCG ------------- GCCACCTTCCGGA 1 51 101 151 201 251 301 T_mm10 T T_mm10 T T_mm10 T T_mm10 T T_mm10 T T_mm10 T T_mm10 T Figure A.3: Consensus of Type T monomer profile-HMM. T-mm10 represents the model con- structed in this study. 90 first five sequencing runs released by the Mouse Genome Project were chosen. Because all runs use pair-end layout, in fact there were in total ten files per species involved in the analysis. 91 Species File Total reads ORF2 reads % Monomer reads % 7-bp insertion reads % rn SRR7755593 1 275,518,015 24,105,828 8.7493 1 0.0000 1 0.0000 rn SRR7755593 2 275,517,889 23,911,311 8.6787 7 0.0000 0 0.0000 rn SRR7755594 1 308,093,023 27,430,787 8.9034 8,934 0.0029 45 0.0000 rn SRR7755594 2 308,090,141 27,254,418 8.8462 8,915 0.0029 45 0.0000 mc ERR019941 1 27,906,249 1,637,328 5.8672 20,338 0.0729 1,240 0.0044 mc ERR019941 2 27,906,249 1,533,681 5.4958 17,349 0.0622 1,119 0.0040 mc ERR019942 1 14,215,764 953,148 6.7049 2,244 0.0158 150 0.0011 mc ERR019942 2 14,215,764 906,755 6.3785 1,659 0.0117 130 0.0009 mc ERR019943 1 19,609,614 1,594,498 8.1312 1,128 0.0058 56 0.0003 mc ERR019943 2 19,609,614 1,497,000 7.6340 729 0.0037 51 0.0003 mc ERR019944 1 21,942,342 1,684,607 7.6774 7,140 0.0325 420 0.0019 mc ERR019944 2 21,942,342 1,592,735 7.2587 5,288 0.0241 331 0.0015 mc ERR019945 1 27,950,396 1,699,670 6.0810 24,389 0.0873 1,497 0.0054 mc ERR019945 2 27,950,396 1,650,423 5.9048 21,588 0.0772 1,290 0.0046 mc ERR019946 1 28,816,471 1,754,600 6.0889 25,096 0.0871 1,577 0.0055 mc ERR019946 2 28,816,471 1,702,856 5.9093 22,193 0.0770 1,357 0.0047 mc ERR019947 1 28,650,122 1,744,446 6.0888 25,029 0.0874 1,552 0.0054 mc ERR019947 2 28,650,122 1,693,228 5.9100 22,117 0.0772 1,383 0.0048 92 mc ERR019948 1 37,650,602 2,398,424 6.3702 30,163 0.0801 1,788 0.0047 mc ERR019948 2 37,650,602 2,266,796 6.0206 24,256 0.0644 1,447 0.0038 mc ERR019949 1 37,630,586 2,396,735 6.3691 30,365 0.0807 1,816 0.0048 mc ERR019949 2 37,630,586 2,265,852 6.0213 24,035 0.0639 1,411 0.0037 mc ERR019950 1 36,839,832 2,347,715 6.3728 29,074 0.0789 1,657 0.0045 mc ERR019950 2 36,839,832 2,202,919 5.9797 23,468 0.0637 1,431 0.0039 mc ERR128870 1 117,733,045 7,299,273 6.1999 58,995 0.0501 4,782 0.0041 mc ERR128870 2 117,733,045 7,275,856 6.1800 58,333 0.0495 4,762 0.0040 mc ERR133990 1 182,151,449 11,465,500 6.2945 90,496 0.0497 7,592 0.0042 mc ERR133990 2 182,151,449 11,406,438 6.2621 85,318 0.0468 6,715 0.0037 mc ERR133991 1 181,215,430 11,423,773 6.3040 90,945 0.0502 7,529 0.0042 mc ERR133991 2 181,215,430 11,364,437 6.2712 85,479 0.0472 7,053 0.0039 mc ERR133992 1 182,906,477 11,541,625 6.3101 91,832 0.0502 7,707 0.0042 mc ERR133992 2 182,906,477 11,481,870 6.2775 86,870 0.0475 7,173 0.0039 mc ERR133993 1 180,785,647 11,392,379 6.3016 90,794 0.0502 7,460 0.0041 mc ERR133993 2 180,785,647 11,335,918 6.2704 85,910 0.0475 7,095 0.0039 ms ERR118249 1 32,661,408 2,352,806 7.2036 28,546 0.0874 3,470 0.0106 ms ERR118249 2 32,661,408 2,332,571 7.1417 27,931 0.0855 3,255 0.0100 ms ERR118254 1 29,001,399 2,091,969 7.2133 25,035 0.0863 2,892 0.0100 ms ERR118254 2 29,001,399 2,073,182 7.1486 24,560 0.0847 2,723 0.0094 93 ms ERR118259 1 32,230,309 2,318,722 7.1942 28,152 0.0873 3,339 0.0104 ms ERR118259 2 32,230,309 2,295,812 7.1231 27,208 0.0844 3,010 0.0093 ms ERR118264 1 32,727,502 2,358,382 7.2061 28,477 0.0870 3,373 0.0103 ms ERR118264 2 32,727,502 2,340,318 7.1509 27,791 0.0849 3,176 0.0097 ms ERR167388 1 37,955,613 2,674,473 7.0463 33,176 0.0874 3,922 0.0103 ms ERR167388 2 37,955,613 2,656,972 7.0002 33,438 0.0881 3,949 0.0104 ms ERR175944 1 95,906,142 6,691,115 6.9767 69,921 0.0729 8,326 0.0087 ms ERR175944 2 95,906,142 6,612,934 6.8952 69,074 0.0720 8,095 0.0084 ms ERR175946 1 98,315,210 6,948,782 7.0679 69,918 0.0711 8,138 0.0083 ms ERR175946 2 98,315,210 6,909,621 7.0280 70,265 0.0715 8,121 0.0083 ms ERR175948 1 98,563,005 6,984,067 7.0859 69,786 0.0708 8,240 0.0084 ms ERR175948 2 98,563,005 6,935,932 7.0371 70,181 0.0712 8,081 0.0082 ms ERR175950 1 99,027,631 7,008,635 7.0775 70,175 0.0709 8,117 0.0082 ms ERR175950 2 99,027,631 6,960,934 7.0293 70,674 0.0714 8,193 0.0083 ms ERR175952 1 99,441,785 7,033,618 7.0731 70,818 0.0712 8,036 0.0081 ms ERR175952 2 99,441,785 6,983,907 7.0231 71,015 0.0714 7,957 0.0080 ms ERR181453 1 34,998,648 2,504,488 7.1560 29,903 0.0854 3,562 0.0102 ms ERR181453 2 34,998,648 2,490,602 7.1163 30,197 0.0863 3,501 0.0100 ms ERR181458 1 35,074,904 2,507,431 7.1488 30,243 0.0862 3,567 0.0102 ms ERR181458 2 35,074,904 2,495,268 7.1141 30,364 0.0866 3,692 0.0105 94 ms ERR181463 1 35,410,672 2,535,230 7.1595 29,984 0.0847 3,501 0.0099 ms ERR181463 2 35,410,672 2,522,194 7.1227 30,514 0.0862 3,536 0.0100 mmc ERR089754 1 23,251,214 1,359,247 5.8459 20,497 0.0882 4,464 0.0192 mmc ERR089754 2 23,251,214 1,354,042 5.8235 20,589 0.0886 4,280 0.0184 mmc ERR089762 1 19,914,243 1,158,586 5.8179 17,214 0.0864 3,548 0.0178 mmc ERR089762 2 19,914,243 1,157,458 5.8122 17,585 0.0883 3,636 0.0183 mmc ERR089770 1 19,544,079 1,143,244 5.8496 17,251 0.0883 3,757 0.0192 mmc ERR089770 2 19,544,079 1,140,394 5.8350 17,368 0.0889 3,732 0.0191 mmc ERR089778 1 22,640,707 1,312,934 5.7990 19,625 0.0867 4,026 0.0178 mmc ERR089778 2 22,640,707 1,317,159 5.8177 20,280 0.0896 4,222 0.0186 mmc ERR089786 1 19,891,912 1,156,374 5.8133 17,110 0.0860 3,597 0.0181 mmc ERR089786 2 19,891,912 1,158,006 5.8215 17,646 0.0887 3,711 0.0187 mmc ERR089794 1 23,035,235 1,345,389 5.8406 20,242 0.0879 4,417 0.0192 mmc ERR089794 2 23,035,235 1,343,751 5.8335 20,524 0.0891 4,371 0.0190 mmc ERR089802 1 23,031,873 1,345,495 5.8419 20,339 0.0883 4,450 0.0193 mmc ERR089802 2 23,031,873 1,340,950 5.8221 20,483 0.0889 4,285 0.0186 mmc ERR118270 1 24,088,880 1,401,651 5.8187 21,077 0.0875 4,462 0.0185 mmc ERR118270 2 24,088,880 1,393,722 5.7857 21,115 0.0877 4,399 0.0183 mmc ERR118278 1 20,672,284 1,201,670 5.8130 17,919 0.0867 3,937 0.0190 mmc ERR118278 2 20,672,284 1,196,336 5.7871 18,131 0.0877 3,749 0.0181 95 mmc ERR118286 1 24,039,813 1,397,504 5.8133 20,972 0.0872 4,517 0.0188 mmc ERR118286 2 24,039,813 1,392,056 5.7906 21,109 0.0878 4,345 0.0181 mmc ERR118294 1 23,785,922 1,388,129 5.8359 20,651 0.0868 4,586 0.0193 mmc ERR118294 2 23,785,922 1,380,363 5.8033 21,119 0.0888 4,344 0.0183 mmc ERR118302 1 23,671,103 1,373,339 5.8018 20,475 0.0865 4,428 0.0187 mmc ERR118302 2 23,671,103 1,366,844 5.7743 20,729 0.0876 4,278 0.0181 mmc ERR118310 1 23,795,261 1,383,697 5.8150 20,584 0.0865 4,408 0.0185 mmc ERR118310 2 23,795,261 1,379,156 5.7959 20,889 0.0878 4,315 0.0181 mmc ERR118318 1 23,875,483 1,389,757 5.8209 20,836 0.0873 4,453 0.0187 mmc ERR118318 2 23,875,483 1,384,287 5.7979 20,887 0.0875 4,306 0.0180 mmc ERR118326 1 23,988,871 1,397,095 5.8239 21,083 0.0879 4,529 0.0189 mmc ERR118326 2 23,988,871 1,390,722 5.7974 21,259 0.0886 4,384 0.0183 mmc ERR134268 1 19,028,887 1,128,749 5.9318 13,183 0.0693 2,491 0.0131 mmc ERR134268 2 19,028,887 1,120,321 5.8875 13,157 0.0691 2,393 0.0126 mmc ERR134276 1 18,000,042 1,062,494 5.9027 12,681 0.0704 2,465 0.0137 mmc ERR134276 2 18,000,042 1,061,534 5.8974 12,814 0.0712 2,521 0.0140 mmc ERR134284 1 17,594,261 1,036,627 5.8918 12,682 0.0721 2,565 0.0146 mmc ERR134284 2 17,594,261 1,034,846 5.8817 12,687 0.0721 2,482 0.0141 mmc ERR134292 1 17,886,452 1,055,237 5.8996 12,705 0.0710 2,554 0.0143 mmc ERR134292 2 17,886,452 1,054,254 5.8941 12,890 0.0721 2,558 0.0143 96 mmc ERR134300 1 17,908,866 1,055,564 5.8941 12,450 0.0695 2,393 0.0134 mmc ERR134300 2 17,908,866 1,055,003 5.8910 12,914 0.0721 2,471 0.0138 mmc ERR134308 1 18,005,174 1,060,866 5.8920 12,793 0.0711 2,570 0.0143 mmc ERR134308 2 18,005,174 1,059,501 5.8844 12,973 0.0721 2,532 0.0141 mmc ERR134316 1 18,033,149 1,063,178 5.8957 12,738 0.0706 2,465 0.0137 mmc ERR134316 2 18,033,149 1,062,484 5.8918 12,798 0.0710 2,490 0.0138 mmc ERR137437 1 18,141,724 1,072,401 5.9112 12,869 0.0709 2,539 0.0140 mmc ERR137437 2 18,141,724 1,066,282 5.8775 12,798 0.0705 2,494 0.0137 mmc ERR137445 1 17,774,978 1,050,810 5.9117 12,393 0.0697 2,511 0.0141 mmc ERR137445 2 17,774,978 1,044,204 5.8746 12,585 0.0708 2,406 0.0135 mmc ERR137453 1 18,370,285 1,087,456 5.9196 12,898 0.0702 2,649 0.0144 mmc ERR137453 2 18,370,285 1,081,240 5.8858 12,988 0.0707 2,545 0.0139 mmc ERR137461 1 17,737,029 1,049,251 5.9156 12,402 0.0699 2,459 0.0139 mmc ERR137461 2 17,737,029 1,042,574 5.8780 12,420 0.0700 2,460 0.0139 mmc ERR137469 1 17,926,323 1,060,348 5.9150 12,649 0.0706 2,509 0.0140 mmc ERR137469 2 17,926,323 1,055,529 5.8882 12,743 0.0711 2,457 0.0137 mmc ERR137477 1 17,999,320 1,063,627 5.9093 12,887 0.0716 2,452 0.0136 mmc ERR137477 2 17,999,320 1,059,888 5.8885 13,003 0.0722 2,515 0.0140 mmc ERR137485 1 17,837,966 1,055,310 5.9161 12,587 0.0706 2,528 0.0142 mmc ERR137485 2 17,837,966 1,049,642 5.8843 12,766 0.0716 2,460 0.0138 97 mmc ERR137493 1 17,839,259 1,056,515 5.9224 12,533 0.0703 2,557 0.0143 mmc ERR137493 2 17,839,259 1,051,726 5.8956 12,776 0.0716 2,460 0.0138 mmc ERR148504 1 18,577,185 1,102,062 5.9323 12,870 0.0693 2,540 0.0137 mmc ERR148504 2 18,577,185 1,099,598 5.9191 12,899 0.0694 2,460 0.0132 mmm ERR089751 1 23,394,691 1,433,688 6.1283 23,882 0.1021 5,372 0.0230 mmm ERR089751 2 23,394,691 1,429,399 6.1099 24,195 0.1034 5,327 0.0228 mmm ERR089759 1 20,047,321 1,220,768 6.0894 20,087 0.1002 4,151 0.0207 mmm ERR089759 2 20,047,321 1,219,779 6.0845 20,355 0.1015 4,281 0.0214 mmm ERR089767 1 19,675,745 1,208,108 6.1401 19,982 0.1016 4,596 0.0234 mmm ERR089767 2 19,675,745 1,203,581 6.1171 20,292 0.1031 4,547 0.0231 mmm ERR089775 1 22,794,573 1,388,964 6.0934 22,799 0.1000 4,802 0.0211 mmm ERR089775 2 22,794,573 1,393,768 6.1145 23,026 0.1010 5,219 0.0229 mmm ERR089783 1 20,016,165 1,220,598 6.0981 19,924 0.0995 4,450 0.0222 mmm ERR089783 2 20,016,165 1,221,160 6.1009 20,490 0.1024 4,527 0.0226 mmm ERR089791 1 23,224,177 1,426,499 6.1423 23,446 0.1010 5,349 0.0230 mmm ERR089791 2 23,224,177 1,424,484 6.1336 23,980 0.1033 5,357 0.0231 mmm ERR089799 1 23,141,246 1,420,178 6.1370 23,745 0.1026 5,416 0.0234 mmm ERR089799 2 23,141,246 1,414,027 6.1104 24,171 0.1044 5,408 0.0234 mmm ERR118267 1 24,111,538 1,472,839 6.1084 24,271 0.1007 5,388 0.0223 mmm ERR118267 2 24,111,538 1,463,869 6.0712 24,383 0.1011 5,209 0.0216 98 mmm ERR118275 1 20,661,276 1,261,643 6.1063 20,769 0.1005 4,752 0.0230 mmm ERR118275 2 20,661,276 1,255,518 6.0767 21,044 0.1019 4,500 0.0218 mmm ERR118283 1 24,005,628 1,466,155 6.1075 24,104 0.1004 5,335 0.0222 mmm ERR118283 2 24,005,628 1,458,479 6.0756 24,305 0.1012 5,165 0.0215 mmm ERR118291 1 23,726,235 1,450,101 6.1118 23,671 0.0998 5,358 0.0226 mmm ERR118291 2 23,726,235 1,442,879 6.0814 23,862 0.1006 5,197 0.0219 mmm ERR118299 1 23,618,932 1,439,980 6.0967 23,713 0.1004 5,327 0.0226 mmm ERR118299 2 23,618,932 1,432,814 6.0664 23,991 0.1016 5,176 0.0219 mmm ERR118307 1 23,760,318 1,451,166 6.1075 23,417 0.0986 5,268 0.0222 mmm ERR118307 2 23,760,318 1,445,041 6.0817 23,927 0.1007 5,269 0.0222 mmm ERR118315 1 23,822,399 1,454,831 6.1070 24,061 0.1010 5,395 0.0226 mmm ERR118315 2 23,822,399 1,446,894 6.0737 24,490 0.1028 5,227 0.0219 mmm ERR118323 1 23,899,503 1,461,201 6.1139 24,134 0.1010 5,403 0.0226 mmm ERR118323 2 23,899,503 1,453,885 6.0833 24,580 0.1028 5,322 0.0223 mmm ERR134265 1 19,255,207 1,191,613 6.1885 17,318 0.0899 3,530 0.0183 mmm ERR134265 2 19,255,207 1,179,905 6.1277 16,928 0.0879 3,327 0.0173 mmm ERR134273 1 18,277,004 1,126,021 6.1609 16,334 0.0894 3,512 0.0192 mmm ERR134273 2 18,277,004 1,123,778 6.1486 16,207 0.0887 3,395 0.0186 mmm ERR134281 1 17,845,470 1,098,734 6.1569 15,855 0.0888 3,451 0.0193 mmm ERR134281 2 17,845,470 1,097,161 6.1481 15,966 0.0895 3,339 0.0187 99 mmm ERR134289 1 18,159,732 1,118,565 6.1596 16,135 0.0889 3,457 0.0190 mmm ERR134289 2 18,159,732 1,115,925 6.1451 16,057 0.0884 3,402 0.0187 mmm ERR134297 1 18,180,738 1,119,452 6.1574 16,259 0.0894 3,489 0.0192 mmm ERR134297 2 18,180,738 1,116,200 6.1395 16,132 0.0887 3,365 0.0185 mmm ERR134305 1 18,255,995 1,128,189 6.1798 16,306 0.0893 3,459 0.0189 mmm ERR134305 2 18,255,995 1,124,424 6.1592 16,194 0.0887 3,425 0.0188 mmm ERR134313 1 18,328,351 1,128,946 6.1596 16,287 0.0889 3,496 0.0191 mmm ERR134313 2 18,328,351 1,127,021 6.1491 16,271 0.0888 3,428 0.0187 mmm ERR137434 1 18,509,785 1,143,764 6.1792 16,385 0.0885 3,533 0.0191 mmm ERR137434 2 18,509,785 1,136,305 6.1389 16,206 0.0876 3,339 0.0180 mmm ERR137442 1 18,146,212 1,123,600 6.1919 16,039 0.0884 3,498 0.0193 mmm ERR137442 2 18,146,212 1,116,172 6.1510 16,283 0.0897 3,427 0.0189 mmm ERR137450 1 18,767,642 1,161,378 6.1882 16,590 0.0884 3,543 0.0189 mmm ERR137450 2 18,767,642 1,153,883 6.1483 16,702 0.0890 3,396 0.0181 mmm ERR137458 1 18,112,696 1,119,481 6.1806 16,090 0.0888 3,526 0.0195 mmm ERR137458 2 18,112,696 1,112,149 6.1402 16,061 0.0887 3,324 0.0184 mmm ERR137466 1 18,281,195 1,131,581 6.1899 16,155 0.0884 3,568 0.0195 mmm ERR137466 2 18,281,195 1,125,076 6.1543 16,172 0.0885 3,368 0.0184 mmm ERR137474 1 18,391,048 1,135,335 6.1733 16,535 0.0899 3,476 0.0189 mmm ERR137474 2 18,391,048 1,129,020 6.1390 16,370 0.0890 3,534 0.0192 100 mmm ERR137482 1 18,233,574 1,127,038 6.1811 16,328 0.0895 3,630 0.0199 mmm ERR137482 2 18,233,574 1,119,493 6.1397 16,230 0.0890 3,446 0.0189 mmm ERR137490 1 18,208,639 1,127,906 6.1943 16,253 0.0893 3,465 0.0190 mmm ERR137490 2 18,208,639 1,122,581 6.1651 16,254 0.0893 3,186 0.0175 mmm ERR148501 1 18,453,636 1,147,180 6.2166 16,029 0.0869 3,459 0.0187 mmm ERR148501 2 18,453,636 1,141,214 6.1842 15,924 0.0863 3,217 0.0174 mmd ERR089753 1 23,495,826 1,444,974 6.1499 26,147 0.1113 6,643 0.0283 mmd ERR089753 2 23,495,826 1,438,426 6.1220 26,457 0.1126 6,427 0.0274 mmd ERR089761 1 20,135,192 1,230,552 6.1114 22,100 0.1098 5,034 0.0250 mmd ERR089761 2 20,135,192 1,229,412 6.1058 22,552 0.1120 5,402 0.0268 mmd ERR089769 1 19,755,696 1,214,342 6.1468 21,960 0.1112 5,622 0.0285 mmd ERR089769 2 19,755,696 1,209,858 6.1241 22,267 0.1127 5,413 0.0274 mmd ERR089777 1 22,882,376 1,397,547 6.1075 25,355 0.1108 6,121 0.0267 mmd ERR089777 2 22,882,376 1,400,602 6.1209 26,143 0.1142 6,425 0.0281 mmd ERR089785 1 20,108,137 1,231,077 6.1223 22,252 0.1107 5,418 0.0269 mmd ERR089785 2 20,108,137 1,230,544 6.1196 22,907 0.1139 5,626 0.0280 mmd ERR089793 1 23,294,872 1,430,194 6.1395 25,695 0.1103 6,538 0.0281 mmd ERR089793 2 23,294,872 1,427,724 6.1289 26,518 0.1138 6,506 0.0279 mmd ERR089801 1 23,220,716 1,427,681 6.1483 25,967 0.1118 6,788 0.0292 mmd ERR089801 2 23,220,716 1,422,156 6.1245 26,155 0.1126 6,489 0.0279 101 mmd ERR118269 1 24,284,120 1,484,453 6.1129 26,670 0.1098 6,653 0.0274 mmd ERR118269 2 24,284,120 1,476,162 6.0787 27,054 0.1114 6,548 0.0270 mmd ERR118277 1 20,793,887 1,273,046 6.1222 23,307 0.1121 5,775 0.0278 mmd ERR118277 2 20,793,887 1,266,468 6.0906 23,305 0.1121 5,623 0.0270 mmd ERR118285 1 24,170,370 1,478,624 6.1175 26,628 0.1102 6,543 0.0271 mmd ERR118285 2 24,170,370 1,470,748 6.0849 27,016 0.1118 6,453 0.0267 mmd ERR118293 1 23,938,447 1,467,025 6.1283 26,317 0.1099 6,543 0.0273 mmd ERR118293 2 23,938,447 1,459,174 6.0955 26,600 0.1111 6,411 0.0268 mmd ERR118301 1 23,810,007 1,456,155 6.1157 26,314 0.1105 6,620 0.0278 mmd ERR118301 2 23,810,007 1,448,070 6.0818 26,491 0.1113 6,464 0.0271 mmd ERR118309 1 23,943,729 1,462,804 6.1093 26,578 0.1110 6,612 0.0276 mmd ERR118309 2 23,943,729 1,455,709 6.0797 26,994 0.1127 6,598 0.0276 mmd ERR118317 1 24,005,552 1,469,631 6.1220 26,645 0.1110 6,741 0.0281 mmd ERR118317 2 24,005,552 1,462,839 6.0938 26,795 0.1116 6,454 0.0269 mmd ERR118325 1 24,075,078 1,473,140 6.1189 26,790 0.1113 6,764 0.0281 mmd ERR118325 2 24,075,078 1,466,058 6.0895 26,940 0.1119 6,305 0.0262 mmd ERR134267 1 17,058,574 1,063,639 6.2352 15,838 0.0928 3,660 0.0215 mmd ERR134267 2 17,058,574 1,056,712 6.1946 15,764 0.0924 3,416 0.0200 mmd ERR134275 1 16,154,320 1,004,455 6.2179 14,985 0.0928 3,520 0.0218 mmd ERR134275 2 16,154,320 1,002,873 6.2081 15,097 0.0935 3,529 0.0218 102 mmd ERR134283 1 15,775,356 978,218 6.2009 14,949 0.0948 3,577 0.0227 mmd ERR134283 2 15,775,356 976,936 6.1928 14,992 0.0950 3,526 0.0224 mmd ERR134291 1 16,038,899 997,635 6.2201 14,703 0.0917 3,560 0.0222 mmd ERR134291 2 16,038,899 995,606 6.2074 14,908 0.0929 3,527 0.0220 mmd ERR134299 1 16,065,228 998,406 6.2147 14,950 0.0931 3,549 0.0221 mmd ERR134299 2 16,065,228 997,621 6.2098 15,331 0.0954 3,560 0.0222 mmd ERR134307 1 16,159,491 1,004,546 6.2164 15,109 0.0935 3,577 0.0221 mmd ERR134307 2 16,159,491 1,003,172 6.2079 15,300 0.0947 3,499 0.0217 mmd ERR134315 1 16,163,051 1,005,838 6.2231 15,147 0.0937 3,594 0.0222 mmd ERR134315 2 16,163,051 1,004,398 6.2142 15,347 0.0950 3,620 0.0224 mmd ERR137436 1 16,363,388 1,018,148 6.2221 15,185 0.0928 3,745 0.0229 mmd ERR137436 2 16,363,388 1,012,950 6.1903 15,197 0.0929 3,538 0.0216 mmd ERR137444 1 16,030,680 998,901 6.2312 14,873 0.0928 3,590 0.0224 mmd ERR137444 2 16,030,680 992,173 6.1892 14,962 0.0933 3,531 0.0220 mmd ERR137452 1 16,563,507 1,032,781 6.2353 15,292 0.0923 3,709 0.0224 mmd ERR137452 2 16,563,507 1,027,687 6.2045 15,508 0.0936 3,640 0.0220 mmd ERR137460 1 15,972,075 994,131 6.2242 14,694 0.0920 3,539 0.0222 mmd ERR137460 2 15,972,075 988,878 6.1913 15,166 0.0950 3,527 0.0221 mmd ERR137468 1 16,165,377 1,007,305 6.2312 15,080 0.0933 3,557 0.0220 mmd ERR137468 2 16,165,377 1,002,064 6.1988 15,065 0.0932 3,423 0.0212 103 mmd ERR137476 1 16,227,244 1,011,161 6.2313 14,885 0.0917 3,595 0.0222 mmd ERR137476 2 16,227,244 1,005,966 6.1992 15,138 0.0933 3,508 0.0216 mmd ERR137484 1 16,086,372 1,002,875 6.2343 15,035 0.0935 3,633 0.0226 mmd ERR137484 2 16,086,372 997,165 6.1988 15,089 0.0938 3,554 0.0221 mmd ERR137492 1 16,068,022 1,001,587 6.2334 14,985 0.0933 3,694 0.0230 mmd ERR137492 2 16,068,022 996,900 6.2042 15,130 0.0942 3,594 0.0224 mmd ERR148503 1 16,681,148 1,042,272 6.2482 15,157 0.0909 3,601 0.0216 mmd ERR148503 2 16,681,148 1,038,042 6.2228 15,285 0.0916 3,415 0.0205 Table A.1: Statistics of sequencing data used for relative abundance analysis. 104 A.4 Identified subtypes of monomers in the mm10 genome The statistics of subtypes identified for monomers in the mm10 genome is listed here in Tables A.2, A.3 and A.4. The min points (MP) parameter for the HDBSCAN algorithm is set to 10 for Type A and T, and 5 for Type G monomers. The number of nearest neighbors for subtype inference is set to 3 for all types of monomers. A.5 CAGE-seq analysis and alignment to profile-HMM states Prior to mapping, both 5’-end and 3’-end adaptor sequences were removed from the reads as described by Li et al. (2013). Then the reads were mapped usingbowtie-1.0.0 with end-to-end mode allowing at most 3 mismatches per alignment and “--best”. Mapping ambiguity was set to allowing maximum 100k mappable locations by the “-m” option. Then for every reported alignment of each individual read, its weight was set to the fraction of total mapped locations, such that the sum of all locations of one read is 1. For each mapped read, its 5’-end position was used for aligning to a match state of the monomer profile-HMM. This was done by computing posterior decoding of the entire promoter sequence, and then the match state located at the mapping location was used. Alignments to an insertion state were discarded. A.6 Position preference scores The scores of position preference were computed for all subtypes identified for A, G and T monomers. They are included in Table A.5, A.6 and A.7. For all A, G and T subtypes, the threshold that triggered preference mode categorization is 0:1. The terminus region definition differs for A and the other two types. For A subtypes, the terminus region is defined as the first position, but for G and T, the terminus is defined as the 105 first 3 positions because of truncated monomers have been observed at the 3’-end of some L1Md promoters containing Type G or T monomers. 106 Subtype ID Count of core monomers Count of all monomers Consensus length 1 2,172 4,966 208 2 1,898 3,040 198 3 1,148 2,080 199 4 486 661 199 5 448 575 199 6 359 666 198 7 327 387 199 8 293 599 199 9 278 320 199 10 271 404 199 11 244 268 199 12 219 245 199 13 196 219 199 14 196 338 207 15 194 221 208 16 189 223 208 17 189 269 201 18 186 204 200 19 180 188 199 20 172 354 201 21 164 252 201 22 162 306 201 23 158 183 201 24 153 196 208 25 149 202 201 26 144 148 199 27 131 164 201 28 116 168 201 29 103 103 201 30 99 128 199 31 93 131 199 32 67 70 201 33 58 65 200 34 55 65 199 Table A.2: List for subtypes of A monomers. 107 Subtype ID Count of core monomers Count of all monomers Consensus length 1 1,517 4,919 206 2 360 606 206 3 193 289 201 4 85 447 201 5 75 90 206 6 52 90 204 Table A.3: List for subtypes of G monomers. 108 Subtype ID Count of core monomers Count of all monomers Consensus length 1 1,257 2,871 214 2 886 989 214 3 604 1,147 204 4 530 2,158 212 5 513 1,363 214 6 505 654 212 7 445 980 214 8 427 518 212 9 350 429 212 10 264 305 212 11 233 263 214 12 220 251 214 13 218 248 214 14 179 230 212 15 177 215 212 16 174 180 214 17 170 194 214 18 162 185 212 19 161 298 212 20 158 228 214 21 153 200 212 22 153 178 212 23 144 161 212 24 127 143 214 25 125 204 212 26 103 107 214 27 90 94 214 28 89 115 212 29 81 111 212 30 80 91 213 31 60 67 212 32 34 37 212 Table A.4: List for subtypes of T monomers. 109 Subtpe ID Position order 1 2 3 4 5 6 7 8 9 10 1 -0.375 0.062 0.106 0.076 0.042 0.025 0.013 0.009 0.007 0.005 2 0.232 -0.065 -0.061 -0.037 -0.023 -0.014 -0.007 -0.005 -0.002 -0.001 3 0.268 -0.092 -0.058 -0.041 -0.024 -0.014 -0.009 -0.006 -0.003 -0.002 4 -0.008 0.022 0.004 -0.011 -0.002 -0.002 0.005 0.000 0.000 -0.001 5 0.383 -0.187 -0.074 -0.047 -0.022 -0.016 -0.010 -0.004 -0.001 -0.003 6 0.231 -0.070 -0.049 -0.039 -0.023 -0.011 -0.007 -0.006 -0.004 -0.003 7 0.163 -0.021 -0.055 -0.038 -0.013 -0.010 -0.001 -0.006 -0.004 0.000 8 0.075 -0.004 -0.015 -0.029 -0.011 0.001 0.007 0.000 -0.004 -0.003 9 0.028 0.023 0.008 -0.031 -0.010 -0.002 -0.010 0.001 0.000 0.001 10 0.105 0.022 -0.033 -0.023 -0.020 -0.016 -0.006 -0.003 -0.004 -0.003 11 0.153 0.025 -0.054 -0.040 -0.027 -0.016 -0.010 -0.006 -0.004 -0.003 12 0.089 -0.033 -0.039 -0.024 0.001 -0.003 0.004 0.003 0.001 -0.003 13 0.333 -0.147 -0.066 -0.036 -0.026 -0.016 -0.010 -0.006 -0.004 -0.003 14 -0.350 0.118 0.051 0.071 0.055 0.019 0.005 0.004 0.001 0.002 15 -0.323 0.121 0.042 0.026 0.067 0.040 0.006 0.009 0.001 0.008 16 -0.355 0.120 0.094 0.071 0.027 0.010 0.011 -0.001 -0.004 0.003 17 -0.122 0.072 0.025 0.002 0.022 0.010 -0.005 0.015 -0.004 -0.003 18 0.035 -0.014 -0.010 -0.008 0.001 -0.006 0.012 0.004 0.002 -0.003 19 0.020 -0.040 0.004 0.033 -0.003 0.000 0.007 -0.001 0.002 -0.003 20 -0.123 0.056 0.038 0.019 -0.008 -0.005 0.002 -0.006 -0.004 -0.003 21 -0.061 0.051 0.015 0.005 0.005 -0.016 -0.004 0.000 -0.004 -0.003 22 -0.136 -0.038 0.041 0.006 -0.013 0.021 0.003 0.006 -0.004 -0.003 23 -0.171 0.284 0.000 -0.037 -0.031 -0.010 -0.003 -0.006 -0.004 -0.003 24 -0.303 0.320 -0.029 -0.010 0.015 0.010 -0.003 0.000 -0.004 -0.003 25 -0.190 0.019 0.000 0.038 0.022 0.017 0.017 0.000 0.016 0.004 26 0.065 0.014 -0.024 -0.035 -0.010 0.011 0.004 0.001 -0.004 -0.003 27 -0.229 0.153 0.076 0.005 0.015 -0.001 -0.002 -0.006 -0.004 0.005 28 -0.295 0.055 0.074 0.125 0.038 0.018 -0.001 0.002 0.005 -0.003 29 -0.043 -0.045 -0.010 0.002 -0.002 0.003 0.019 0.023 0.006 0.007 30 0.017 0.104 -0.016 -0.026 -0.031 -0.016 -0.010 -0.006 0.006 -0.003 31 0.202 -0.057 -0.053 -0.056 -0.020 -0.016 -0.010 -0.006 -0.004 -0.003 32 -0.116 -0.035 0.087 0.063 0.014 -0.001 -0.010 -0.006 -0.004 -0.003 33 -0.139 0.116 0.048 0.065 -0.031 -0.016 -0.010 -0.006 -0.004 -0.003 34 0.063 0.062 -0.016 -0.020 -0.031 -0.016 -0.010 -0.006 -0.004 -0.003 Table A.5: Position preference scores of A subtypes. 110 Subtpe ID Position order 1 2 3 4 5 6 7 8 9 10 1 0.106 -0.012 -0.003 -0.049 -0.022 -0.010 -0.005 -0.003 -0.002 0.000 2 -0.241 -0.163 0.008 0.218 0.086 0.041 0.023 0.014 0.008 0.004 3 -0.226 0.402 -0.059 -0.074 -0.032 -0.008 -0.003 0.002 0.002 -0.002 4 -0.093 0.298 -0.067 -0.069 -0.029 -0.011 -0.014 -0.008 -0.004 -0.002 5 -0.126 -0.140 0.287 -0.022 0.001 0.003 0.000 0.005 -0.004 -0.002 6 -0.246 -0.294 -0.066 0.346 0.197 0.054 0.025 -0.008 -0.004 -0.002 Table A.6: Position preference scores of G subtypes. 111 Subtpe ID Position order 1 2 3 4 5 6 7 8 9 10 1 0.057 0.173 -0.115 -0.055 -0.029 -0.016 -0.007 -0.002 -0.002 -0.001 2 0.352 -0.155 -0.092 -0.048 -0.027 -0.011 -0.006 -0.005 -0.003 -0.002 3 0.322 -0.105 -0.099 -0.050 -0.033 -0.018 -0.009 -0.003 -0.003 -0.002 4 -0.194 -0.022 0.054 0.081 0.049 0.023 0.002 0.001 0.001 0.000 5 -0.252 -0.090 0.270 0.073 -0.002 0.003 -0.001 0.001 -0.001 0.000 6 -0.150 0.030 -0.018 0.022 0.044 0.025 0.017 0.007 0.005 0.004 7 -0.245 -0.119 0.279 0.034 0.026 0.013 0.005 0.000 0.000 0.001 8 -0.185 -0.009 0.108 0.041 0.003 0.015 0.013 0.012 0.000 0.003 9 -0.038 -0.024 -0.013 0.029 0.024 0.003 0.009 0.004 0.003 0.001 10 -0.141 -0.042 0.027 0.054 0.049 0.027 0.027 0.003 -0.003 0.002 11 -0.163 0.133 -0.029 0.033 0.031 0.012 -0.006 -0.005 -0.003 -0.002 12 0.238 -0.135 -0.061 -0.028 -0.010 -0.004 0.003 0.000 0.002 -0.002 13 -0.091 0.285 -0.115 -0.055 -0.019 -0.004 -0.002 0.000 0.002 0.003 14 0.056 -0.084 0.003 -0.010 0.026 0.005 0.000 0.001 0.003 -0.002 15 -0.189 -0.025 0.084 0.087 0.032 0.011 0.001 -0.005 0.009 -0.002 16 0.003 0.176 -0.096 -0.043 -0.024 -0.005 -0.005 0.001 -0.003 -0.002 17 0.023 0.099 -0.047 -0.024 -0.012 -0.017 -0.011 -0.005 -0.003 -0.002 18 0.065 -0.021 -0.028 0.004 -0.004 -0.016 -0.011 -0.005 0.004 -0.002 19 -0.044 -0.069 0.041 0.029 0.033 0.008 0.008 0.002 -0.003 -0.002 20 0.025 0.062 -0.037 -0.026 -0.016 0.009 -0.004 -0.005 -0.003 -0.002 21 -0.160 -0.016 0.215 0.002 -0.009 -0.016 -0.004 -0.005 -0.003 -0.002 22 -0.167 -0.029 0.052 0.074 0.011 0.010 0.009 0.002 0.010 0.005 23 -0.045 0.121 -0.018 -0.034 -0.028 0.005 -0.004 -0.005 0.004 0.005 24 0.009 0.221 -0.123 -0.043 -0.026 -0.015 -0.011 -0.005 -0.003 -0.002 25 0.022 -0.034 -0.018 0.013 -0.001 0.009 0.005 0.003 0.005 -0.002 26 0.321 -0.146 -0.093 -0.054 -0.032 -0.013 -0.001 0.005 0.007 0.008 27 0.326 -0.152 -0.081 -0.038 -0.030 -0.023 0.000 0.007 -0.003 -0.002 28 -0.128 0.074 -0.013 0.030 0.026 0.022 -0.011 0.007 -0.003 -0.002 29 -0.169 0.053 0.077 0.053 0.020 -0.010 -0.011 -0.005 -0.003 -0.002 30 -0.030 0.045 0.042 -0.020 -0.016 -0.010 0.002 -0.005 -0.003 -0.002 31 0.087 -0.013 -0.020 -0.033 0.009 -0.006 -0.011 -0.005 -0.003 -0.002 32 -0.065 -0.095 0.065 0.094 0.047 -0.023 -0.011 -0.005 -0.003 -0.002 Table A.7: Position preference scores of T subtypes. 112
Abstract (if available)
Abstract
L1Md retrotransposons are the most abundant and active transposable elements in the mouse genome. The promoters of many L1Md retrotransposons are composed of tandem repeats which are called monomers. The number of monomers varies between retrotransposons and thus makes it difficult to locate the exact boundaries and infer the activity of L1Md promoters. Duplication of monomers contributes to the maintenance of L1Md promoter through generations, but the mechanism is unclear. Since the current classification of monomers is based on limited data, there is an imperative need for a comprehensive annotation of monomers which is specialized for functional study of L1Md repeats. ❧ We developed a pipeline for de novo monomer detection in the whole genome using sequence heuristics. Identified monomers were further classified into subtypes based on their sequence profiles. We applied this pipeline to genome assemblies of various rodent species and strains. A major monomer subtype was found missing in all but one species, implying potential species-specific enhancement of monomer expansion. The positioning pattern of monomer subtypes within a promoter was discovered and analyzed. We showed that the subtype composition of an L1Md promoter can be used to infer its transcriptional activity in germ cell development. ❧ The monomer detection pipeline presented in this dissertation is independent of the current repeat annotation and is able to detect monomer and annotate subtypes in a de novo manner. Subtypes for all monomer types have been identified using comprehensive data, greatly expanding the spectrum of monomer variants. The analysis of monomer subtype positioning within promoters provides evidence supporting both previously proposed models of L1Md promoter expansion. The transcription silencing of L1Md promoters differentiates for promoter types, which implies potential factor-specific regulatory functions instead of a universal retrotransposon repression mechanism.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Comparative genomics of translational regulation
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Plant genome wide association studies and improvement of the linear mixed model by applying the weighted relationship matrix
PDF
Application of machine learning methods in genomic data analysis
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Comparative analysis of DNA methylation in mammals
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
danbing-tk: a computational genetics framework for studying variable number tandem repeats
PDF
Comparative transcriptomics: connecting the genome to evolution
PDF
Structural and biochemical determinants of APOBEC1 substrate recognition and enzymatic function
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Structural and biochemical analyses on substrate specificity and HIV-1 Vif mediated inhibition of human APOBEC3 cytidine deaminases
PDF
Improving the power of GWAS Z-score imputation by leveraging functional data
Asset Metadata
Creator
Zhou, Meng
(author)
Core Title
Detection, classification and functional annotation of mouse L1 retrotransposon promoters
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
01/31/2019
Defense Date
12/10/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
annotation,L1 retrotransposon,mouse L1 monomer,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew David (
committee chair
), Chen, Liang (
committee member
), Dean, Matthew (
committee member
), Rohs, Remo (
committee member
), Yu, Min (
committee member
)
Creator Email
mengzhou@usc.edu,mzhou.usc@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-116312
Unique identifier
UC11675642
Identifier
etd-ZhouMeng-7035.pdf (filename),usctheses-c89-116312 (legacy record id)
Legacy Identifier
etd-ZhouMeng-7035.pdf
Dmrecord
116312
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhou, Meng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
annotation
L1 retrotransposon
mouse L1 monomer