Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Probabilistic modeling and data integration to examine RNA-protein interactions
(USC Thesis Other)
Probabilistic modeling and data integration to examine RNA-protein interactions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PROBABILISTIC MODELING AND DATA INTEGRATION TO EXAMINE RNA-PROTEIN INTERACTIONS by Emad Bahrami-Samani A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) May 2015 Copyright 2015 Emad Bahrami-Samani To my parents, for their unconditional love and support, and for teaching me to be passionate and persistant. ii Acknowledgments I would like to thank my advisor, Dr. Andrew Smith, for his patient guidance, encour- agement and support. I am extremely grateful to him for continual counsel and our regular meetings, which were essential for my progress. I would also like to express my extreme grattiude to Dr. Philip Uren from whom I learned and continue to learn so much, Dr. Luiz Penalva for all the guidance and advice he has been giving me during my studies and Dr. Fengzhu Sun for serving on my super- visory committee and for his guidance and valuable feedback. iii Contents Dedication ii Acknowledgments iii List of Figures vi 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 6 2.1 Co- and post-transcriptional events in high throughput . . . . . . . . . . 6 2.2 RNA-binding proteins; the key players . . . . . . . . . . . . . . . . . . 8 2.2.1 Regulating transcript abundance . . . . . . . . . . . . . . . . . 9 2.2.2 Alternative splicing . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Alternative poly-adenylation . . . . . . . . . . . . . . . . . . . 12 2.2.4 Stability and decay . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.5 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Profiling RNA-binding protein activities . . . . . . . . . . . . . . . . . 22 2.3.1 Experimental methods . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Finding targets and binding sites of RNA-binding proteins . . . 25 2.3.3 Characterizing and understanding RBP specificity . . . . . . . . 29 3 Examining RNA-protein interaction data in high-throughput 33 3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 CLIP-seq data encodes RBP-specific sequence and structure signals . . 34 3.2.1 Sequence specificity of RBP binding sites . . . . . . . . . . . . 36 3.2.2 Calculating secondary structure of an RNA molecule . . . . . . 37 3.2.3 Structural specificity of RBP binding sites . . . . . . . . . . . . 40 3.2.4 Statistical tests for the influence of secondary structure . . . . . 41 3.3 CLIP-seq diagnostic events follow RBP- and technology- specific patterns 44 iv 3.3.1 HITS-CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.2 PAR-CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.3 iCLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.4 Distribution of diagnostic events . . . . . . . . . . . . . . . . . 48 4 Motif discovery in RNA-protein interaction data 50 4.1 Formalization of sequence motif discovery . . . . . . . . . . . . . . . . 50 4.2 Simultaneously modeling secondary structure and binding sites . . . . . 54 4.3 Modeling diagnostic events . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Results and comparison with other methods . . . . . . . . . . . . . . . 59 4.4.1 Generation of simulated data . . . . . . . . . . . . . . . . . . . 59 4.4.2 Evaluating motifs recovered from simulated data . . . . . . . . 61 4.4.3 Evaluating results on CLIP-seq data . . . . . . . . . . . . . . . 63 4.4.4 New insights on sequence and structure specificity of RBPs . . 72 5 Conclusion 74 Reference List 76 A Detailed derivation of the EM algorithm 102 A.1 Sequence-only motif discovery - ZOOPS Model . . . . . . . . . . . . . 103 A.1.1 Expectation-maximization – E-step . . . . . . . . . . . . . . . 105 A.1.2 Expectation-maximization – M-step . . . . . . . . . . . . . . . 107 A.2 Sequence and structure motif discovery . . . . . . . . . . . . . . . . . 107 A.2.1 Expectation-maximization – E-step . . . . . . . . . . . . . . . 108 A.2.2 Expectation-maximization – M-step . . . . . . . . . . . . . . . 110 A.3 Sequence and diagnostic events motif discovery . . . . . . . . . . . . . 110 A.3.1 Expectation Maximization – E-step . . . . . . . . . . . . . . . 113 A.3.2 Expectation Maximization – M-step . . . . . . . . . . . . . . . 114 B Determining the secondary structure of an RNA molecule 117 B.1 Computational approaches . . . . . . . . . . . . . . . . . . . . . . . . 117 B.1.1 Minimum free energy structure . . . . . . . . . . . . . . . . . 117 B.1.2 The ensemble of RNA structures . . . . . . . . . . . . . . . . . 118 B.2 Emerging experimental approaches . . . . . . . . . . . . . . . . . . . . 120 v List of Figures 2.1 Summary of experiments and methods . . . . . . . . . . . . . . . . . . 8 2.2 Overview of ribosomal profiling (RP) experiments . . . . . . . . . . . . 19 2.3 Read profiles of untreated and harringtonine-treated RP data . . . . . . 20 2.4 Overview of CLIP experiments . . . . . . . . . . . . . . . . . . . . . . 24 3.1 Sequence specificity of RBPs . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Structural specificity of RBPs . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Sequence and structure in the context of binding versus non-binding . . 44 3.4 Examples of structural specificity of hexamers in the context of binding versus non-binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Distribution of diagnostic events . . . . . . . . . . . . . . . . . . . . . 47 4.1 Results on simulated data . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 Results on CLIP-seq data - 1 . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Results on CLIP-seq data - 2 . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Results on CLIP-seq data - 3 . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Results on CLIP-seq data - 4 . . . . . . . . . . . . . . . . . . . . . . . 69 4.6 Results on CLIP-seq data - 5 . . . . . . . . . . . . . . . . . . . . . . . 70 4.7 Results on CLIP-seq data - 6 . . . . . . . . . . . . . . . . . . . . . . . 71 4.8 Sequence- and structure-specificity in RBP binding sites . . . . . . . . 73 vi Chapter 1 Introduction 1.1 Motivation Phenotypic and physiological differences among cell types and tissues are largely the result of differential gene expression patterns. Consequently, gene expression reg- ulation is a fundamental process that enables the dynamics of cellular systems and their responses to external and internal stimuli [26]. Transcriptional regulation of gene expression, mechanisms that can turn on and off transcription, have been studied inten- sively, but for years we underestimated the diversity of the RNA world [47]. Co- and post-transcriptional regulation of gene expression encompasses a multifaceted and inter- connected group of events including RNA processing, translation and decay. It is now clear that these processes are major modulators of phenotype. They play important roles in practically all biological processes, and have been implicated in many human diseases [153, 154, 164]. The list of regulators, which often participate in multiple processes, is long; more than 1,000 putative human RNA-binding proteins (RBPs) and thousands of non-coding RNAs have been identified [66, 59]. In terms of technology, one of the prime movers of recent and continuing genome- wide insights into post-transcriptional regulation is CLIP-seq (cross-linking with immunoprecipitation followed by high-throughput sequencing), which allows a high- resolution investigation of the binding sites for a given RBP. Three variants of CLIP- seq have been developed: high-throughput sequencing CLIP, termed HITS-CLIP [143]; 1 Photoactivatable-Ribonucleoside-Enhanced CLIP, termed PAR-CLIP [85]; and individ- ual nucleotide resolution CLIP-seq, termed iCLIP [119]. All the CLIP-seq variants allows target mRNAs to be identified and give us a high enough resolution to localize binding sites to within a small window, about 50 nucleotides [259, 218]. Beyond this, statistical models are needed to exactly localize the binding site and to characterize the binding specificity of RBPs. Even with iCLIP, which inherently gives single-nucleotide resolution for the cross-link location, back- ground noise and sequencing artifacts mean that not every locus identified by the assay is a bona fide binding site, nor will the localization and characterization always be per- fect, as cross-linking biases are well-known [218, 64]. The main question is then how to localize and characterize binding sites of an RBP using CLIP-seq data? In efforts to answer this question much of the analysis method- ology has been borrowed from the field of transcription factors, for example from the analysis of sequences identified in ChIP-seq experiments. Binding sites for RBPs though tend to be shorter [259], coupled with the abundance of highly similar non-binding sites and the proclivity towards low sequence specificity in some RBPs present difficulties for such approaches. Another source of complexity is a much larger space of parameters involved in RNA-protein interactions, such as RNA secondary structures [259]. Despite mount- ing evidence that RNA secondary structure plays a role in RBP binding site selection, the contribution of information from sequence and structure for recognizing the binding site by an RBP remains unexplored and a few motif discovery tools consider using this information [140, 110]. 2 Overall, a general model for incorporating sequence and structure, and exploiting all sources of information in a unified model seems necessary to overcome the com- plexity of simultaneous binding site characterization and localization in RNA-protein interaction data. 1.2 Contributions The research represented in this thesis is divided into two parts. In the first part we investigated different properties of RNA-protein interaction data in high-throughput for possible useful sources of information that can help us overcome the problems that motivated us to carry out this work. In this regard: We demonstrate that RNA structure has critical roles in binding process of RNA binding proteins. In this regard, we devised statistical tests to investigate the effects of sequence and/or structure on the binding affinities. These tests show a strong correlation between particular kinds of secondary structure and the bind- ing sites of RNA binding proteins. Furthermore, we show evidence that the sec- ondary structure of particular motifs can be completely different in target versus non-target molecules. This statistical test, reveals that the assumption of inde- pendence between sequence and structure for RBP binding site localization and characterization is over-simplifying the problem. The CLIP-seq experimental procedure causes sequencing reads to exhibit charac- teristic substitutions and deletions relative to the reference genome at the cross- link location [77, 116, 218, 85]. We call these substitutions and deletions diag- nostic events (DEs). We investigated the distribution of diagnostic events around binding sites and observed that diagnostic events are often enriched at a partic- ular distance from the consensus match, and follow a characteristic distribution. 3 We proposed a model that uses the distribution of diagnostic events for a more accurate motif discovery. In the second part and after proving the utility of extra information for motif discovery, we proposed a new algorithm for simultaneous motif characterization and binding site localization from CLIP-seq data that uses a model specifically designed for CLIP-seq derived RNA binding sites. Our method is the first motif-discovery algorithm that is able to exploit all of the additional information inherent in CLIP-seq data of any variety. Using our method, we demonstrate that this extra information has utility in suc- cessfully recovering RBP motifs in CLIP-seq data by modeling sequence, sec- ondary structure and technology-specific cross-linking events. Using our method, we show that motifs enriched in CLIP-seq datasets represent a trade-off between sequence and structure specificity, suggesting that RBPs with highly specific sequence motifs require less structural constraints on their binding sites to achieve their specificity. 1.3 Outline In chapter 2, we give a detailed background on the biology of co- and post- transcrip- tional regulation of gene expression. We explain the different types of regulatory mech- anisms and give examples of the involvement of RBPs in these processes. We place specific emphasis on the analysis of the resulting data, in particular the computational tools and resources available, as well as looking towards future challenges that remain to be addressed. Then we establish the importance of localizing and characterizing the RBP binding sites, and explain in details all the existing approaches to these problems. 4 In chapter 3, first we introduce our RBP data base, which is a collection of the pub- licly available CLIP data sets. Then we give some preliminary analysis on read data base to show different properties of the data, on which our method is based. We will explain the statistical tests we devised to show the importance of sequence and structure, and establish our null and alternate hypotheses. Furthermore, we investigate the proper- ties of the common sequencing approaches for genome-wide RBP analysis (HITS-CLIP, PAR-CLIP and iCLIP). We give detailed information on differences between them and the properties such as mutations and deletions. We show the need to do post-processing on the results of these experiments [257, 44, 119]. In chapter 4 first we define the general problem of motif discovery and describe the formulation of the problem. Then we give detailed explanation of our model. Finally we illustrate the results of applying our method on simulated and public data. We compare our method to the other available methods and conclude by details of new biological insights into the specificity of RBP binding. 5 Chapter 2 Background Co- and post-transcriptional regulation of gene expression is complex and multi-faceted, spanning the complete RNA lifecycle from genesis to decay. High-throughput profil- ing of the constituent events and processes is achieved through a range of technologies that continue to expand and evolve. Fully leveraging the resulting data is non-trivial, and requires the use of computational methods and tools carefully crafted for specific data sources and often intended to probe particular biological processes. Drawing upon databases of information pre-compiled by other researchers can further elevate analy- ses. Within this chapter, we describe the major co- and post-transcriptional events in the RNA lifecycle that are amenable to high-throughput profiling. Dissecting co- and post-transcriptional regulatory events at the genomic level poses numerous challenges in terms of methods and computational analyses. 2.1 Co- and post-transcriptional events in high through- put Controlled and coordinated binding of one or more RNA binding proteins or miRNAs is the key mechanism that drives co- and post-transcriptional regulation of gene expres- sion. These processes are often complex, inter-related, and dynamic in terms of their timing. Efforts to understand them at global scale therefore require multiple lines of 6 investigation, and necessitate a range of computational methods to interpret the resul- tant data. However, computational methods to support the state-of-the-art technologies have yet to reach the level of maturity seen, for example, in the transcriptomic field. In contrast to transcriptional regulation of gene expression, where some consensus has been reached in terms of methods and analysis pipelines [70, 246, 176, 70], RNA biologists continue to use a range of different experimental and analysis approaches. For example, although still used, RIP-chip and RIP-seq have been mostly replaced by a plethora of different cross-linking methods such as cross-linking and analysis of cDNAs (CRAC) [77] and CLIP (Cross-linking and Immuno-Precipitation) approaches, i.e. HITS-CLIP, PAR-CLIP and iCLIP [228, 143, 119, 85]. All methods have their pros and cons and, due to their technical differences and biases, deliver slightly differ- ent datasets [230]. When comparing datasets, it is hard to say why one method but not the others captured a particular binding site. We clearly need to conduct more exten- sive comparative analyses coupled with functional assays to better understand what each method is producing. An understanding of the idiosyncrasies of each technology used in the lab and how they relate to analysis methods is essential. They will give us the means to improve computational tools and include filters that at the end will deliver the highest number of functional RBP sites with a minimum number of false positives. At a higher level, the need for effective integration of disparate data sources in the study of co- and post-transcriptional regulation is particularly pronounced. Assigning function to RBP binding can be a complex task due to the polyvalent nature of these proteins. For example, binding of a given RBP to 3’UTRs (untranslated regions) could affect mRNA decay, translation or interfere with poly(A) site selection; multiple angles of analysis are necessary, but data integration is nontrivial. There is need to central- ize all co- and post-transcriptional datasets and develop tools to allow cross-platform comparisons. 7 Figure 2.1: Summary of post-transcriptional regulation processes and correspond- ing computational methods. Protein Regulators Assays RNA-seq Ribosome profiling RNA-related biological processes RIP- and CLIP-seq RBPs Process-focused Computational tools, methods and databases Regulator-focused Computational tools, methods and databases pre-mRNA Translation RBPs Ribosomes DNA Ribosome release rate Translation rate Translation efficiency Poly-A Machine Learning methods i.e. SVM PACdb PolyA_DB Splicing MATS DiffSplice SplicingCompass Cuffdiff 2 DEXSeq DSGseq Split-read mappers STAR GSNAP MapSplice TopHat CLIP-based RBP-specific verification i.e. MISO PALMapper mRNA abundance, stability and decay RBPs miRNA RNA Cuffdiff edgeR DESeq mRNAstab Dynamic Transcriptome Analysis (DTA) Methods for estimating mRNA decay based on genomic run-on experiements, which quantify abundance over time RBPs Piranha RIP-seeker AS-peak dCLIP MEME MDScan AlignACE DME MEMERIS RNAContext CMfinder Dynalign RBPDB CLIPZ RBPmap Peak-calleres Motif finders DBs miRNAs miRanda TargetScan PicTar doRiNA miRBase TarBase ExprTarget ExprTargetDB StarMir miRNA target Databases identifiers Figure 2.1 summarizes the relation between the major experimental high-throughput assays with both the stages and regulators of the RNA lifecycle they inform on. In the next sections, we cover different high-throughput approaches used in RNA biology, tailoring the discussion to the computational methods available and challenges in terms of development and data integration. 2.2 RNA-binding proteins; the key players Binding of a given RBP to a target transcript can produce a variety of outcomes, both promoting and repressing events; for instance increasing or decreasing translation or mRNA decay, promoting or repressing exon skipping or the usage of a distal poly A site. A variety of genomics methods are necessary to link binding to function. For instance proteomics studies have been combined with RIP-chip and CLIP experiments to identify functional RBP binding sites, e.g. to characterize the translation regulators such as HuR 8 [131], Msi1 [235], IGF2BP1-3, QKI and PUM2 [196], or the splicing regulator RBM20 [155]. Other methods combine the analysis of miRNAs with proteome, transcriptome or translatome profiling, e.g. for miR-124 [88], mir-223 and others [203, 13, 219]. The analysis of this data (and integration with data from binding assays) brings a new set of computational challenges that we discuss in the remaining sections. 2.2.1 Regulating transcript abundance Quantifying gene expression is a well-studied problem in computational genomics. Expression profiling is now largely performed by RNA-seq. Read counts are the main source of information to calculate a genes expression profile, though they must be cor- rectly normalized to obtain meaningful information. There are primarily two concerns during normalization, which arise from transcript length and sequencing depth. The former is the result of RNA fragmentation during library construction in which longer transcripts naturally generate more reads than shorter transcripts even if they have sim- ilar abundance. Sequencing depth refers to the variability in the total number of reads sequenced and mapped in each run, which causes variations across samples [70]. To account for these issues, the reads per kilobase of transcript per million mapped reads (RPKM) metric was introduced by Mortazavi et al. [167] to normalize a transcript’s read count by both its length and the total number of mapped reads in the sample [70]. With paired-end data, to avoid counting reads that fall into mapped fragments twice, a similar measure called reads per kilobase of transcript per million mapped fragments (FPKM) was developed [226]. However, Wagner et al. [237] showed evidence that RPKM is not suitable for comparison between samples and proposed a new measure called transcript per million (TPM) for this purpose [237]. For a comprehensive review on normalization methods for transcript abundance, refer to Dillies et al. [48]. 9 Often the goal of analyses is to compare expression between conditions and identify transcripts whose concentration changed. Methods such as Cuffdiff [226], edgeR [191] and DESeq [8] are frequently used. Cuffdiff [226] is based on beta negative binomial model and estimates the variance of RNA-seq data by t-like statistics from FPKM val- ues. edgeR [191] is based on an over-dispersed Poisson model in order to explain the variation in the read count data. The evaluation of differences across transcripts, are estimated using Empirical Bayes method. DESeq [8] uses a negative binomial for esti- mation of variability in read count data. Differential expression analysis for RNA-seq is a widely explored area; for a comprehensive survey refer to [172, 70]. RNA-binding proteins have the capacity to directly regulate mRNA levels. However, many studies observe substantial changes in transcript abundance upon knockdown or knockout of RBPs, but find a surprisingly small overlap with the set of RBP targets identified by binding assays [171]. This discrepancy is most likely due to a large number of indirect effects. As a result, the question of whether data from binding assays can be effectively married with mRNA expression data remains open. 2.2.2 Alternative splicing The “one gene, one enzyme” hypothesis postulated by Beadle and Tatum [22] is no longer valid; we know that the number of human genes is much smaller than the number of expressed proteins [232]. This discrepancy can be explained by several levels of gene regulation, co- and post-transcriptional modifications, especially alternative splic- ing [28]. More than 90% of human genes are alternatively spliced, with a role in many physiological functions [239]. Alternative splicing, coupled to nonsense-mediated decay (NMD), can also directly regulate gene expression by producing unstable tran- scripts that contain premature stop codons [202, 107]. Splicing-related changes in gene expression can be triggered in response to stress and other environmental signals [4], 10 and are increasingly recognized as a participant in many diseases [255, 221, 207, 142]. Cancer-related studies have revealed specific changes in alternative splicing patterns that can be used for diagnosis [71] and therapy [86]. Many mathematical models, algorithms and statistical methods have been developed and employed to explore alternative splicing. The goal of these methods is generally to identify and quantify the abundance of individual transcripts [224, 109], or more commonly, to profile changes in splicing either at the full transcript level or at the level of individual splice sites and exons [9, 243, 206, 98, 11]. The latter task is called differential splicing analysis. An example of such an analysis would be to calculate exon inclusion from exon-junction arrays, microarrays or RNA-seq data, and then compare the values between samples or conditions to infer occurrences of different alternative splicing events. Although some approaches to either problem may employ a reference dataset of exons or splice junctions and only considers splicing events with known splice junctions, a frequent goal is to identify novel splicing events with previously unknown donor and acceptor sites. Addressing this challenge relies heavily on split-read mappers, which are able to map reads containing previously unknown splice junctions a task that regular short-read mappers generally fail with, as the read is not derived from a single contiguous region of genomic sequence, nor one that can easily be constructed in silico [128, 225, 49, 99, 90, 245, 240, 7, 91, 249, 12]. Several excellent reviews of computational methods for splicing and alternative splicing analysis already exist; for a detailed review of methods and databases refer to Hooper et al. [93], and the EURASNET website, respectively. Despite much work in this field, it remains challenging to link the observed changes in splicing regulation with their regulators, such as RNA-binding proteins. In the case of RBPs, one approach is to profile cells with a regulator of interest either silenced or deleted, and compare against the wild-type. RIP and CLIP have been used to match observed changes in splicing 11 to the putative binding sites identified, as has been done for example for Nova [143], hnRNP proteins (namely, hnRNP C [119], H1 [109], L [101], A1, A2, A2B1, F, M, U [100]), TDP43 [222], Fox [254, 260], PTB [148, 251], Mbnl1 [53], TIA1, and TIAL1 [245]. However, the analysis is generally ad-hoc; no effective computational tools yet exist for linking functional assays such as RNA-seq with binding assays such as RIP- or CLIP-seq. One main reason for this problem is that observing binding activities of an RBPs according to RIP or CLIP experiments is not an evidence of direct binding. 2.2.3 Alternative poly-adenylation Poly-adenylation is the addition of a stretch of adenosine nucleotides to the end of RNA molecules. This polyA tail aids nuclear export and translation, and protects the transcript from degradation. The point at which the RNA is cleaved and the tail is added can vary a mechanism known as alternative polyadenylation (APA). APA can result in mRNAs with differences in coding sequence and 3UTR, contributing to altered regulation, function, stability, localization, and translational efficiency [56]. Although alternative polyA sites, that are situated between coding exons, can lead to isoforms encoding different proteins [94], more often APA events result in shorter 3UTRs which lack sequences that are targets of microRNAs and RNA-binding proteins [210]. The earliest examples of APA were described in the mRNAs of IgM and DHFR [6, 204]. Subsequently, EST databases and microarray analyses allowed the identification of several other APA sites [23, 72]. Recent RNAseq methods have enormously improved our understanding of APA [81]. Genomic studies have shown that APA is a widespread phenomenon in metazoan genomes. For example, about 70% of mammalian genes and about 50% of the genes in flies and worms are subjected to APA [84, 210, 46]. This mechanism is known to 12 regulate a range of biological processes, often associated with development, cellular dif- ferentiation and proliferation. Shortened 3 UTRs due to alternative poly-adenylation are associated with increased pluripotency and cell proliferation [168, 195], and relaxation of microRNA repression of oncogenes [160]. Computational methods for the prediction of alternative polyadenylation are mainly based on the Direct RNA Sequencing (DRS) technology [178], in which RNA molecules are sequenced without prior conversion to cDNA or the need for biasing ligation or amplification steps [177]. This method was employed to develop a map of over 1 million polyA sites in major cancers and tumor cell lines [134, 144]. An alternative method, PolyA-seq, allows for the high-throughput sequencing of the 3 ends of polyadenylated transcripts, and has been used to obtain a global map of polyadeny- lation sites in human, rhesus, dog, mouse, and rat [46]. Purely computational methods for predicting the locations of polyA signals also exist, such as the classification-based method polyA-predict, which was used to construct a database of predicted sites [39]. Other databases of polyA sites include PACdb [29] and PolyA DB [132]. 2.2.4 Stability and decay Regulation of mRNA stability and decay Another major contributor to expression regulation is mRNA degradation which has also been linked to several diseases [54]. Two major regulatory routes control mRNA decay: quality control mechanisms eliminate the production of aberrant protein products while another group of mechanisms influence mRNA life time with the main purpose of controlling protein abundance. A prevalent example of degradation for quality control is Nonsense Mediated Decay (NMD), which eliminates mRNAs that prematurely terminate translation [202]. It can 13 be regulated in multiple ways, such as relative concentration and phosphorylation of NMD factors and miRNAs a detailed review is provided by Kervestin et al. [113]. Another important mechanism is the ARE-mediated mRNA decay. It is predicted that 9% of the human transcriptome contains ARE elements in the 3UTR; these are characteristic short AU rich or U-rich sequences [16]. ARE-containing mRNAs have been implicated in important physiological functions as well as diseases and tumorigen- esis [27]. Several RBPs like TTP, BRF1, KSRP and AUF1 interact with ARE-sequences and help recruit degradative enzymes. Another group of RBPs, which include the highly studied HuR, binds ARE elements and increase their stability [231]. These ARE binding proteins have their activities modulated by cell signaling, phosphorylation and cellular localization [19, 80]. For a comprehensive review on mRNA decay see [198]. Transcriptome-wide profiling and computational tools Transcriptome-wide analysis of mRNA decay generally relies on time-series data in which mRNA levels are measured at different time points [244]. For example, data from genomic run-on experiments is used by the computational tool mRNAStab to determine mRNA stability by calculating mRNA half-lives [5]. D¨ olken et al. [50] developed a pioneering approach to separate total cellular RNA into newly transcribed and preexist- ing RNA upon metabolic labeling. Other methods are based on Dynamic Transcriptome Analysis (DTA) [163] to calculate mRNA half-lives [200]. From a functional per- spective, the influence of RNA sequence and structural elements on mRNA stability and other post-transcriptional regulatory mechanisms has been the subject of recent studies [47]. For instance, TEISER [76] is a computational framework to calculate the corre- lation between the presence or absence of sequence and structural motifs with exper- imentally determined mRNA stability. MIST-Seq (Measurement of Isoform-Specific Turnover using Sequencing) is another recently introduced method designed to estimate 14 the decay rate of a population of RNAs accurately [83]. Its application revealed that even minor differences in sequence composition could lead to large changes in decay rates between isoforms, highlighting the functional effect of particular 3 UTR elements on mRNA stability. Similar studies have been carried out in yeast, comparing mRNA isoform half-lives across different isoforms of particular genes and inferring biological functions for particular sequence elements [73]. Micro-RNA biogenesis and function in mRNA decay Over the last decade though, probably the most heavily studied mechanism for regulat- ing mRNA levels has been through micro-RNAs (miRNAs). Micro-RNAs regulate gene expression by base-pairing with complementary sequences in mRNAs [20]. To accom- plish this, miRNAs rely on an Argonaute protein to form a complex, called the RNA- induced silencing complex (RISC) that facilitates the binding of miRNAs to mRNAs, and their gene silencing function. However, the actual mediators of gene silencing are members of the GW182 protein family, which regulate all downstream steps in gene silencing [261, 183, 38, 78, 87, 181]. Watson-Crick base-pairing between the miRNA and target mRNA determines the specificity of the complex, while the Argonaute protein exerts the gene regulatory function [213]. A given miRNA can have hundreds of targets and a given gene can be regulated by multiple miRNAs. A more comprehensive review on the mechanisms of miRNA gene regulation is presented elsewhere [103]. The end result of miRNA-mediated gene regulation is reduced protein output from the cognate mRNA [13]. The most successful methods to date for computational identification of miRNA binding sites have been miRanda [57], TargetScan [136], and PicTar [122]. miRanda uses a dynamic programming algorithm to search for complementarity matches between miRNAs and 3’ UTRs. For each match, it estimates the stability of interaction using 15 thermodynamic calculation of the complex free energy and calculates a conservation score with closely related species [57]. Validations have shown this approach to be highly successful. TargetScan [136] takes a similar approach based on the ther- modynamics of RNA-RNA interactions and comparative sequence analysis to predict miRNA targets conserved between species. The algorithm in PicTar is based on Ahab [186, 199], which is a probabilistic algorithm for the identification of combinations of transcription factor binding sites [136] and identifies common targets of microRNAs in eight vertebrate genomes. Several research groups have developed databases of miRNA target sites. ExprTar- getDB [69] is a database obtained using an integrative approach combining the results form TargetScan, miRanda, and PicTar. Other databases include miRBase [121], the repository for miRNA gene set annotations and TarBase [234], which is a collection of miRNA gene interactions coupled with experimental observations for any listed inter- action. STarMir [190] is a web-server that predicts miRNA binding sites and computes several other features of the targets such as consensus sequence, thermodynamic and target structure to calculate a measure of confidence for each predicted site. Micro-RNAs act in concert with RBPs. Some databases leverage this for greater accuracy. For instance, Starbase [252], which uses CLIP experiments to compile a set of computationally predicted miRNA target sites for several species. They also filter false positive miRNA target sites, which can be used for the detection of false negative binding sites absent from current prediction sets. Another database employing CLIP- seq data is doRiNA [9], which uses PicTar [122], and offers the advantage of easy visualization via the UCSC genome browser. Target prediction algorithms for miRNAs that rely on a trusted set of miRNA target sites can greatly benefit from such a feature [47]. 16 2.2.5 Translation Translation and its role in biological processes Translation regulation plays an important role in many biological processes [21, 212]. It accounts for up to 30% of variation in protein expression in both yeast [151] and mammalian cells [236]. Certain cell types are even more reliant on post-transcriptional regulation than others. Examples include blood platelets, which lack nuclei, so their cel- lular responses must be modulated post-transcriptionally, and the final stages of sperm development, where transcription is silenced [24, 30]. Translation regulation is also essential in development. During early embryogenesis it controls embryonic axis, body patterning and cell fate, as transcription is largely quiescent at this stage [125]. Since translation reacts faster than transcription, it often forms the basis for rapid responses to environmental changes [212]. Due to its important role in cellular biology, translation is also recognized as a nexus susceptible to disruption in diseases. For example, abnormal translation is now a recog- nized characteristic of tumor cells and a potential target for therapy [25]. Elevated levels of the translation initiation factor elF4E have been found in many cancer cell lines and tumors, and over-expression in rodent cells results in malignancies [130]. Close to 60% of the mRNAs classified as proto-oncogenes have atypical 5’UTRs with complex struc- ture and high GC content, hindering ribosome binding [120]. There are implications for understanding cancer treatment as well. Radiotherapy is the preferred approach for many tumor types. Genome-wide analyses of irradiated cells revealed that the number of genes with translation affected by radiation is close to 10-fold greater than those with altered transcription [151]. 17 Methods and challenges Genome-scale knowledge of translation regulation has lagged behind that of transcrip- tion, despite its central role. Integrative analysis of RNA-seq and shotgun proteomics and comparison of protein to mRNA concentrations is one approach to estimate transla- tion efficiency [201]. However, this approach is limited, for example by the number of genes covered by proteomics analysis and ignorance of protein degradation. More direct approaches use ribosome binding to mRNAs as a proxy of translation efficiency. For decades polysome profiling has been used to study translation regulation. This method is based on separation of mRNAs that are heavily loaded with ribosome from free mRNAs using ultracentrifugation on sucrose gradients. Coupling polysomal profiling and microarrays or RNA-seq enable translation studies to enter the world of genomics [108, 263]. In recent years, the field has experienced a dramatic boost with the advent of ribosome profiling [104]. Ribosome profiling Ribosome profiling (RP) is a relatively new method that promises to provide researchers with quantitative information about the relative number and locations of ribosomes bound to RNA [104]. In the RP method, ribosome-protected mRNA fragments are sequenced deeply. Figure 2.2 demonstrates the detailed steps in this protocol. RP can be used for examining translational control in a range of settings, from basic mecha- nistic investigations to studies of disease and drug treatments [96, 126]. It provides an excellent tool to investigate, discover and catalog translational products present in a cell type at single-nucleotide resolution. Despite its challenging protocol, the RP technology is now more and more used, and computational analysis tools are under development. Currently, the number and position of reads is used to estimate ribosome binding. A fun- damental contribution of RP has been the identification of open reading frames (ORFs). 18 Figure 2.2: Overview of ribosomal profiling (RP) experiments. (B) Detailed steps in the ArtSeq protocol for ribosomal profiling. The protocol starts with cell fragmen- tation; the resulting cell extract is submitted to nuclease digestion, which will generate ribosome-protected RNA fragments. Ribosome-RNA complexes are purified using gel filtration columns (SV400 samples) or sucrose cushion (sucrose samples), followed by RNA extraction and elimination of ribosomal RNAs (rRNA). rRNA-depleted samples are submitted to electrophoresis, and ribosome-protected fragments (about 35 nt long) are eluted from gel. These RNAs are used as templates for library preparation and sequencing. Add cycloheximide Lyse cells Nuclease Digestion (A) Purify RPFs Library preparation Sequencing Cell Lysis DNase/RNase Size Exclusion Column purification Ribo-Zero RNA Isolation / PAGE purify RPFs 3’ End Repair 3’ Adaptor ligation / Adaptor removal Reverse Transcription cDNA PAGE Circularization PCR (PAGE) (B) PAGE purified RPFs HO PO 4 HO OH App X + X 3’ End-repair 3’ adaptor Ligation & excess adaptor digestion Reverse Transcription PAGE Purify Circligase dU PCR dU 3’ Adaptor 5’ Adaptor p p dU X RPFs = Ribosome Protected Fragments An ORF is a segment of an mRNA, bounded by a translation initiation site (TIS) and translation termination site (TTS), which causes formation of the elongation-competent 80S ribosome complex [106]. Identifying ORFs is one of the classical analysis prob- lems of computational genomics [214]. Hidden Markov Models (HMMs) have been used to identify ORFs for more than 20 years [123], and have done exceptionally well due to their flexibility and the natural sequential dependence within ORFs. The most sophisticated ORF-predicting HMMs were developed in the context of determining the complete gene structure (promoter, exon, intron, etc.) [32]. However, factors such as transcripts with multiple ORFs, internal ribosome entry sites, leaky translation, ribosome shunting, and near-cognate start codons make the 19 Figure 2.3: Read profiles of untreated and harringtonine-treated RP data. The genic ORF and two uORFs in the Nanog transcript are shown. Start codons are highlighted, and the offset of the 5’ end of reads is indicated. Nanog 50 bp ...GGCTCACTTCCTTCTGACTTC......AACTCTTCTTTCTATGATCTTTCCTTCTAGTTT... ...CCATCACACTGACATGAGTGTGG... treated 341 0 control 108 1 13 bp 13 bp 13 bp purely computational identification of ORFs problematic, as evidenced by the discovery of many novel ORFs by RP studies [65, 105, 133]. Despite the success of RP, no public tools are available to date. Many studies simply assume known ORFs [82]. Those that predict them rely on read patterns in ribosome profiling data from samples treated with elongation inhibitors, which cause ribosome arrest at the TIS (Figure 2.3). Ingolia et al. [105] employed a classification approach to provide genome-wide maps of protein synthesis. Lee et al. [133] defined a measure based on the number of reads at each posi- tion and the total number of reads on the same transcript in their data to identify peaks of ribosome activities and therefore obtain a global map of translation initiation sites in mammalian cells. Fritsch et al. [65] employed a neural network method for genome- wide identification of novel upstream ORFs in human. Stern-Ginossar et al. [215] used a method similar to Ingolia et al. [105] to discover diverse short reading frames in human Cytomegalovirus. Clear read patterns denote both TIS and stop codons in untreated samples too, but have not so far been leveraged to improve our definition of ORFs. Another complication is that elongation-inhibited samples only approximately iden- tify the TIS, since the start of the reads marking the protected fragment is offset from the A-site by about 12 nucleotides generally [40, 189, 133]. The TIS is then determined 20 by searching for a sequence (codon) nearby, which requires an existing model and pre- cludes unbiased TIS characterization. Existing methods for detecting the read-pattern indicative of TIS in RP data have been trained on known exemplars [65, 105], which may not always be available and biases towards sites similar to those already known. Identification of ORFs also opens up the possibility of finding and characterizing regulatory reading frames. Many mRNAs contain ORFs upstream of the genic ORF, called uORFs, which also engage ribosomes [18, 106]. Whether uORFs produce viable proteins with any function remains open, though the fact that they regulate translation of their downstream genic counterparts is now well established through several recent studies [18, 33, 92, 106, 166]. RP analysis provides several measures of translation regulation. It reports the number of mRNAs bound by ribosomes compared to unbound mRNAs (occupancy), it reports the total number of ribosomes per mRNA (density), and the ribosome position at nucleotide resolution. While these data are insufficient to calcu- late actual rates of translation, they serve as a detailed proxy of translation efficiency per gene. Mass-spectrometry based approaches have recently provided methods to measure actual translation rates [95], but in contrast to RP, these methods only cover a fraction of the human genome. To the best of our knowledge no comparison of RP and actual protein expression levels exists to-date. Via the clever use of time-series data and drug treatments that inhibit translation ini- tiation, RP can also provide insights into translation elongation speed using so-called run-off experiments [105]. Following treatment, ribosomes inside active ORFs will move away from the TIS leaving a depleted region, where RP reads are only observed at the noise level. In addition, we can also define the unaffected region, where ribosomes still exist, and the depleting region, where some intermediate fraction of messages have been depleted of ribosomes (i.e. stochastic variation in speed between molecules with 21 the same ORF). Analysis of the position and lengths of these regions after specific treat- ment times provides estimates of elongation speed. Despite the successes of RP, there are a number of outstanding computational chal- lenges. One major challenge is correctly adjusting for ribosome pausing. Protein syn- thesis by ribosomes takes place at non-uniform speeds between ORFs, and also with varying speeds within an ORF; one extreme is pausing [189, 233, 205]. Metrics aimed at measuring translation levels must therefore be adjusted to remove the influence of stalled ribosomes. These might be stalled pre-initiation complexes, ribosomes paused during elongation or awaiting release upon termination. Because these ribosomes are not actively translating, they do not contribute to protein levels. Previous studies either ignore the pausing phenomenon, or assume important pausing happens near TIS and stop sites, discarding all reads falling within a fixed distance to these. This discards information, alters the effective size of the region when normalizing, and cannot be done for short coding sequences. 2.3 Profiling RNA-binding protein activities 2.3.1 Experimental methods RNA binding proteins are, next to non-coding RNAs, the central drivers of co- and post- transcriptional regulation, and can have hundreds to thousands of target mRNAs thanks to flexibility in their binding specificity. En masse identification of in vivo binding has become possible only within the last decade, first with RIP and then with CLIP. They were developed by the Keene and Darnell labs respectively [220, 228]. They both con- sist of immuno-precipitation approaches where RNPs containing the RBP of choice are isolated and associated mRNAs are subsequently purified and identified. Quantification 22 of the resultant RNA, was originally carried out using micro-arrays or Sanger sequenc- ing, but is now more commonly performed using next- and second-generation deep sequencing. When RIP was established, there were some concerns regarding the possi- bility of re-assortment of RNPs during the IP process. This issue was essentially raised by a study from the Steitz lab [162], in which a very simplistic analysis was conducted. To the best of our knowledge, similar claims have not been reported by other scientists using RIP. In fact, RIP was used successfully in cell systems and organisms to gener- ate cell type specific gene expression profiles and no problems of cross-contamination between cell types have been reported [193, 180, 51]. We focus on the analysis of data from these high-throughput assays, termed RIP-seq and CLIP-seq. While CLIP-seq is more frequently used, RIP-seq continues to be used, especially if there are limitations in terms of antibodies, or the amount and type of tissue. Figure 2.4 illustrates different steps and major differences between different types of CLIP experiments and Table 2.1 presents a comparison of different CLIP experiments according to various properties. 23 Figure 2.4: Overview of CLIP experiments. UV 365nm PAR-CLIP U 4s U 4s U 4s Cell lysis - RNA digestion - Immunoprecipitation UV 254nm HITS-CLIP UV 254nm iCLIP 3’ Adapter ligation - SDS Page - Proteinase K digestion Proteinase K leaves polypeptide at the crosslink location 5’ Adapter ligation Reverse transcription and Circularization Primer: two cleavable adapter regions and barcode ( ) 5’ adapter 5’ adapter 3’ adapter 3’ adapter 3’ adapter Reverse transcription U G U A cDNA Transition cDNA or Read-through Deletion cDNA cDNA PCR High throughput sequencing Linearization & PCR High throughput sequencing C G Mutation Deletion Truncation or Read-through 24 HITS-CLIP PAR-CLIP iCLIP Cross link UV (254nm) UV (365nm) UV (254nm) CL diagnostic deletion T! C mutation 5’ truncation of read Resolution single nucleotide * single nucleotide * single nucleotide Strategy digestion + co-IP digestion + co-IP digestion + co-IP Input cells/tissues cells cells/tissues Table 2.1: Differences between CLIP experiments for RBP-protein interactions. CL: cross-link. *special data processing required. Recently, “reversed CLIP” assays have been developed in which mRNAs are extracted; the binding sites and identities of bound proteins are determined by RNA- seq and proteomics, respectively [17, 36]. These studies have revealed the enormous extent of the protein-RNA interaction landscape. In a more recent study Tombe et al. [223] developed a high-throughput sequencing - RNA affinity profiling (HiTS-RAP) assay that employs high-throughput sequencing to measure RNA aptamer affinities in large scale by quantifying the binding of fluorescently labeled protein to millions of RNAs anchored to sequenced cDNA templates. This is an extension of high-throughput sequencing fluorescent ligand interaction profiling (HiTS-FLIP) protocol [174] that was previously developed to image and analyze the binding of fluorescently labeled proteins to DNA clusters for direct quantitative measurement of protein-DNA binding affinity. 2.3.2 Finding targets and binding sites of RNA-binding proteins RIP and CLIP aim to answer two closely related questions: which transcripts are bound by an RBP, and where. The key distinction lies in resolution. Generally, RIP-seq does not involve digestion of bound RNA fragments, and provides transcript-level resolution, enriching reads in bound RNAs but not necessarily with positional information. In con- trast, CLIP-seq allows for much higher resolution. From a technical perspective though, identifying targets at the full transcript level and finding binding sites at the resolution of ten or twenty nucleotides are essentially the same problem: we search for genomic 25 regions which are enriched for reads. This process is referred to as peak-calling, and forms the basis for any downstream analyses. Peak-calling follows read-mapping (align- ment of short sequenced reads to the reference genome), which we will not address as it has been covered fully elsewhere [61]. Peak-calling assumes that some loci will receive reads, but not all of these represent true binding sites. There are a number of possi- ble reasons for this, including transient or non-specific interactions [169, 131, 10, 58], cross-linking biases (modest uridine preference caused by UV cross-linking in HITS- CLIP and iCLIP) [218], re-association after cell lysis [162] (the artifactual RNA-protein complexes formed in cell lysate, depending on lysis conditions, generally only a prob- lem with RIP-seq), and background cross-linking (background caused by random UV cross-linking of RNAs to proteins that are not the RBP of interest) [64]. However, it is expected that such false-positive loci will generally accumulate few reads. There is generally no specific way of defining such binding activities and different groups use different measure. For instance Friedersdorf et al. [64] performed an experimental method to define background cross-linking in PAR-CLIP data. Freeberg et al. [63] calculated the cross-link score (CLS) for each T in the genome, where CLS is the ratio of CLIP reads containing one or two T-to-C conversion events to the number of mRNA- seq reads and associate low CLS values to transient binding [63]. Similar methods can be used to define cross-lining biases and background cross-linking. Peak-calling aims to differentiate these loci from those that represent targeted binding of the RBP, i.e. are true-positives. This differentiation is particularly important in RIP, where the lack of cross-linking and RNAse treatment results in much higher background signal. Although CLIP has a high-degree of accuracy that cannot be achieved by RIP, it exhibits both cross-link biases and background cross-linking. In addition, due to inefficiency of UV cross-linking [117] it is not clear what proportion of binding activities is really cap- tured by cross-linking. Nonetheless, even with these problems CLIP has proven to be 26 useful for identifying mRNA targets of RBPs. However, due to the above-mentioned problems rendering careful separation of signal from noise essential [162, 218, 64]. The simplest peak-calling scheme considers only the number of reads mapped to a locus. The exact read-count threshold to use must be calibrated for each dataset, since sequencing depth varies. A major challenge is selecting an appropriate resolution. Reads are counted into bins tiled along the genome. If bin size is too small, it is difficult to distinguish the underlying distribution of the read counts in peaks from the background. If the bin size is too large, resolution suffers. Most methods defer the decision to the analyst, although there are some attempts to automate selection of resolution, such as RIP-Seeker [141]. Further, one must consider the statistical distribution of the read counts. Uren et al. [230] demonstrated that read-counts are Poisson over-dispersed in CLIP-seq datasets. An appropriate model to capture their distribution is thus the negative binomial. When only a single sample is analyzed, loci with zero-counts are not considered, and in this case it is better to use a zero-truncated negative binomial, which appropriately adjusts for the missing zero counts. These distributions were used as the basis for the Piranha peak-caller [230]. In addition, other methods proposed HMM for modeling and ana- lyzing CLIP-seq data, such as dCLIP [242] and MiCLIP [241]. At the first step, dCLIP normalizes CLIP-seq data across datasets and subsequently employs an HMM to detect common or different RBP-binding regions across conditions [242]. MiCLIP uses two rounds of HMM to first infer enriched vs. non-enriched regions and then to distinguish binding sites of RBPs vs. non-binding sites within those enriched regions [241]. Additional information beyond read-counts can be used to improve peak calling. One example is transcript abundance. The number of reads mapping to a given genomic locus will be proportional to the binding strength of the RBP to that site, but also the abundance of the RNA. Abundant RNAs will take a greater slice of the sequencing pie, 27 leaving less abundant RNAs, even if strongly bound, starving for coverage. Piranha was developed to account for this sequencing inequality, allowing the significance threshold, at which a locus is considered a true interaction, to vary as a function of RNA abundance, measured by RNA-seq [230]. AS-peak [124] is another peak caller, tailored specifically to RIP-seq data, that considers transcript abundance. Other markers of true RBP-RNA interactions are modifications in nucleotide reads as a result of UV cross-linking, coined cross-link induced mutation sites (CIMS). In HITS-CLIP, CIMS are deletions at the cross-linked nucleotide [257], while in PAR-CLIP, reads exhibit T-to-C nucleotide conversions due to incorporation of 4SU photoactivatable-ribonucleoside into transcripts [85]. Not only are these changes useful in distinguishing true from false interactions, but they have also been used to improve localization. Without considering CIMS, only iCLIP achieves single-nucleotide resolu- tion. Zhang & Darnell proposed a systematic method based on CIMS for the analysis of HITS-CLIP, elevating HITS-CLIP to single nucleotide resolution, and allowing exact localization of the cross-link location [257]. They applied their genome-wide analysis to Nova and Ago HITS-CLIP data, identifying CIMS deletions in8% of mRNA tags mapped to Nova targets. Corcoran et al. [44] proposed a method for PAR-CLIP data, based on the characteristic conversion. They allow a read to contain up to two mis- matches restricted to T-to-C conversions during the mapping. At each genomic locus, they calculate the likelihood of T to C conversion and use this to predict interaction sites. To date, most analyses employing CLIP- and RIP-seq have been restricted to iden- tifying targets and binding sites under single conditions. Moving forward, comparative analyses will become more important, and a few studies have already taken steps in this direction [149, 242, 241, 256, 250]. Firstly Tenenbaum et al. [220] used RIP-chip to determine dynamic changes in mRNA targets during neuronal differentiation. More- over, Mukherjee et al. [170] employed Gaussian Mixture Modeling to RIP-seq data 28 with probabilistic LOD scores and background quantification of each mRNA target to quantify dynamic changes in mRNA targets during T cell activation. However, compu- tational tools to facilitate comparative peak-calling are few. To date, only Piranha and dCLIP provide support for identifying differential binding [242, 241, 230]. Most tools for identifying interaction sites are stand-alone programs intended to run on a local machine. There are some online tools that can be used for CLIP data analysis, for example PIPE-CLIP [37] and pyCRAC [247], both of which run on the web-based Galaxy [75] platform. 2.3.3 Characterizing and understanding RBP specificity Nucleic acid binding proteins interact with their substrate (DNA or RNA) and partici- pate in biochemical reactions that lead to specific cellular functions [146]. In the case of RNA, these interactions happen between a subset of residues in the protein (the RNA binding domains, or RBDs) and a subset of nucleotides within the RNA (the binding sites). Certain nucleotide sequences present high affinity for the protein’s RBDs, caus- ing the protein to bind to these locations with high frequency. These patterns are called motifs, and observing these patterns in a genomic location is called a motif occurrence. Motifs can be characterized by both sequence and structural elements and show tremen- dous variation amongst RBPs, even between members of the same RBP family [188]. Until the early 2000s, characterization of binding sites was mostly restricted to indi- vidual studies involving a particular RBP and one target gene/binding motif. Such studies include a variety of assays from mutagenesis and binding shifts to more elab- orate analyses involving 3D structures of RBP bound to RNA [74]. One exception, SELEX experiments, combined a recombinant RBP and large pools of short random RNA sequences. After several rounds of selection, a consensus motif is defined based on the sequence of RNA fragments preferentially bound [118, 227]. RNAcompete 29 is another in vitro method that is much less expensive than SELEX due to a smaller designed pool of RNA oligo-nucleotides [187]. Finding statistically enriched motifs in biological sequences is one of the most well studied problems in computational biology. The inherent variability in the motif sequence for RBPs renders methods based on exact matches of little use [217]. More flexible models have been proposed, the most well established being the position weight matrix, constructed by counting occurrences of each type of nucleotide at each position in the motif [35, 129, 145, 146]. Methods employing this representation can generally be divided into two groups, 1) exhaustive enumeration methods, which are based on enu- merating possible motifs then progressively narrowing the search to the neighborhood of highest scoring motifs and 2) probabilistic models, which construct the motif model and find the occurrences of the motif simultaneously in an iterative manner [2]. Much of the extensive body of work on motif discovery is due to the attention paid to tran- scription factors and the need to understand transcriptional regulation through protein- DNA interactions. MEME [14], MDScan [147], AlignACE [192] and DME [211] are just a handful of the highly successful methods. The interested reader is encouraged to pursue one or more of the extensive reviews written on the details of these methods [45, 97, 71, 185, 216]. In comparison, motif finding in RNA brings its own unique set of challenges that must be considered. Early applications of motif-finding algorithms optimized for transcription factor binding sites to finding regulatory regions in RNA, especially RBP binding sites, encountered a number of challenges, chief amongst which are the shorter length of RBP motifs [3, 259] and the role of RNA secondary structure in binding site recognition [139]. An early approach for modeling RNA structure involves covariance models (CMs) [55, 253]. CMs deliver both a sequence alignment and a consensus structure for a set of RBP-bound RNA sequences. Training a CM constructs a model from a set of sequences, 30 which in turn can be used for aligning new sequences in an integrative approach. Other methods, such as Dynalign, a software for simultaneous sequence and structural align- ment of RNA molecules using dynamic programing [157], evolutionary methods [60], and text indexing approaches [159] have been used for sequence and structural motif discovery for RNAs. However, evolutionary and text indexing methods are very limited in terms of the range of RNA secondary structures that they can discover, while CM and Dynalign are computationally expensive. MEMERIS [89] was proposed for RNA binding site characterization, and it takes both sequence and structure into account. MEMERIS calculates the probability of RNA regions to be single-stranded, and uses these values as prior knowledge to guide the search for the motif. RNAcontext [110] is another approach for RNA binding site char- acterization and motif discovery that takes both sequence and structure of the RNA into account. The model developed in this program has a much simpler representation than MEMERIS: a position weight matrix for describing the motif sequence and an addi- tional vector to describe the structural context of each nucleotide in the motif. RNA- context performs well, both in vitro and in vivo, in terms of recovering experimentally validated motifs. However, both MEMERIS and RNAcontext suffer from the assump- tion that RNA sequence and structure are independent. In addition, MEMERIS takes only single stranded regions into account, which is a limiting factor for RBPs that bind double-stranded RNA. More recently, a new method called GraphProt was proposed as a machine-learning framework for learning models of RBP binding preferences from different types of high-throughput experimental data. GraphProt in essence is a super- vised learning algorithm that builds a model using positive and negative sets of binding sites and then scans the genome to find instances of binding sites based on sequence and structure profiles [158]. For Identification of miRNA-RISC complex target sites, hand- ful of studies has done CLIP experiment for transcriptome-wide mapping of miRNA 31 targets, which have proven to be quite useful [42, 44, 165, 85]. In addition, the com- putational methods take advantage of predictive features of the binding regions, most notably sequence characteristics of the seed region, phylogenic conservation of binding sites and secondary structure accessibility of the target [262, 79, 111]. Several databases of RNA-protein interaction sites have been developed. RBPDB [43] contains a collection of experimental motifs of RNA-binding sites from human, mouse, fly and worm. This database includes RBP binding sites derived from in vitro methods, motifs in position weight matrix format, and sets of sequences of binding sites obtained from immunoprecipitation experiments in vivo. CLIPZ [114] is a database of binding sites that are constructed from CLIP data for a limited number of proteins. However, users can upload their short read sequences from CLIP, small RNA sequenc- ing, and mRNA sequencing experiments for analysis [114]. RBPmap is a webserver for prediction of RBP binding sites. Users can input their sequences and motif in the form of a consensus sequence or position weight matrix or select from a large database of experimentally validated motifs. The algorithm then searches sequences for the motif, compares matches to the embedded background model, calculates a weighted rank for all the positions, and outputs a summary of all predicted binding sites [179]. 32 Chapter 3 Examining RNA-protein interaction data in high-throughput In this chapter we present results that explore the relationship between RNA sequence, RNA structure and the activity of RBPs at their binding sites. Our goal here is to gain sufficient insight to guide our development of approaches to analyze RNA-protein inter- actions. Since we have no exact information about the locations of binding sites, we will attempt to analyze these relationships indirectly. 3.1 Datasets To investigate the properties of CLIP-seq, the sequence and secondary structure pref- erence of a range of RBPs, and to evaluate our proposed method we collected a set of public data derived from 20 studies, covering iCLIP [218, 119, 231, 245, 222], HITS-CLIP [258, 238, 254, 109, 100, 138, 41, 251, 230, 135, 182], and PAR-CLIP [85, 208, 131, 169]. This collection constitutes data profiling 40 RBPs (36 human and 4 mouse). A complete description is provided in Table 3.1. Sequence data was mapped to hg19 and mm9 using Novoalign (Novocraft, http://www.novocraft.com). Definitions of genes, exons and UTRs were taken from RefSeq [184]. CLIP variants give us a snapshot of what binding affinities look like. Genomic regions with a lot of mapped reads are more likely to be a stronger binding sites and 33 therefore having stronger characteristics of the binding sites. The first step of the analy- sis is to find the locations with unexpectedly large read counts in a set of reads mapped to the reference genome. We used Piranha software [230] to obtain a list of high confi- dence binding sites. This method assumes that sites with low number of mapped reads are more likely to be noise and modeled in the background. It fits a zero-truncated neg- ative binomial distribution to the read counts data. In this way, we can assign ap-value for each location with the mapped reads and fix the p-values for multiple hypothesis testing and obtain a correctedp-value for each site. After peak finding we take the significant locations (p-value < 0:05) and take the top one thousand targets as the list of high confidence targets. Since the number of mapped reads is highly variable we change thep-value threshold for different data sets to be able to have as many high confidence targets that is needed for analysis up to one thousand locations. 3.2 CLIP-seq data encodes RBP-specific sequence and structure signals It is well known that RNA-protein interactions can require specific RNA sequence at target sites, Pumilio proteins for instance [74]. Interestingly, it is also well known that certain RNA-binding proteins have very low or no sequence specificity, for instance a group of heterogeneous nuclear ribonucleoproteins (hnRNP A1, A2, B1, B2, C1 and C2) [154, 1]. However, even for those RBPs for which sequence specificity has already been demonstrated, there is no guarantee that sites identified by RIP-seq or the CLIP variants will behave like the well-characterized functional sites. One possibility is that RBPs bind a large number of sites, some with high-frequency occupancy, and others only transiently. In such cases it is plausible that only a subset of RBPs interacting 34 RBP Technology Cell Citation Ago HITS-CLIP HeLa [41] Agof1..4g PAR-CLIP HEK293 [85] IGF2BPf1..3g PAR-CLIP HEK293 [85] PUM2 PAR-CLIP HEK293 [85] QKI PAR-CLIP HEK293 [85] TNRC6fA..Cg PAR-CLIP HEK293 [85] hnRNPh1 HITS-CLIP HEK293 [109] hnRNPa2b1 HITS-CLIP HEK293 [100] hnRNPa1 HITS-CLIP HEK293 [100] hnRNPff,m,ug HITS-CLIP HEK293 [100] Lin28 HITS-CLIP HEK293, hESC [138] MOV10 PAR-CLIP HEK293 [208] Ago2, HuR HITS-CLIP HEK293 [116] Ago2, HuR PAR-CLIP HEK293 [116] hnRNPC iCLIP HeLa [119] HuR PAR-CLIP HeLa [131] HuR PAR-CLIP HEK293 [169] HuR iCLIP HeLa [231] TIA1, TIAL1 iCLIP HeLa [245] PTB HITS-CLIP HeLa [251] TDP43 iCLIP SH-SY5Y [222] Ago2 HITS-CLIP HEK293 [230] Ago2 HITS-CLIP mESC [135] TDP43 HITS-CLIP Mouse brain [182] Nova HITS-CLIP Mouse brain [258] Nova iCLIP Mouse brain [218] Mbnl1 HITS-CLIP Mouse brain, C2C12, heart, muscle [238] hnRNPh1 iCLIP HeLa Private Musashi iCLIP U251 Private Table 3.1: List of CLIP and RIP data sets in our database with RNAs at any given instance are carrying out a regulatory program. The rest may simply be rapidly associating and dissociating. Moreover, it remains unclear whether what proportion of sites identified in CLIP or RIP are actual binding sites (regardless of whether they have any functional importance). We begin by examining the degree to which target RNAs identified in CLIP variants show biased sequence composition. 35 Without knowing precise binding sites, this broader analysis will provide some clue as to how much specificity can be extracted from the CLIP data. 3.2.1 Sequence specificity of RBP binding sites In order to investigate the sequence specificity of different RBPs, we used the CLIP-seq experiments described in Section 3.1 for sequence analysis. We counted the number of times each 3-mer occurs in the sequences. CLIP experiments are claimed to be so accurate that the sequence motif becomes apparent with just a simple analysis of the set of candidate targets. Based on these counts, a z-score was computed as a measure of enrichment for the hexamer in each dataset. Figure 3.1(A) shows thez-scores of differ- ent trimers in the collection of CLIP experiments. As can be seen from Figure 3.1(A), the z-scores for TTT and AAA are highly variable across data sets. This kind of dis- persion of their distribution suggests the abundance of these particular motifs in a lot of data sets. We know that trimers such as the poly-T and poly-A are highly abundant in vertebrate 3’ UTRs. It is also possible that these sequence biases are, in fact, experi- mental artifacts related to cross-linking bias or some other aspects of the experimental procedures. Thez-scores, presented in Figure 3.1(A), reveal much sequence preference associ- ated with the target sequences. The concentration of higher z-scores for some of the trimers in the tail of the distributions suggests that some of the trimers are enriched in some of the data sets and can be used for target motif analysis. Therefore, in the next step we stratified the enrichment of the motifs by RBPs and used a subset of data sets from the CLIP-seq database described in Section 3.1, to take RBPs that are well- studied and we have a fairly good understanding of what the target motif looks like. In all cases, trimers that are known to be preferred by certain RBPs show enrichment 36 in target sequences identified in experiments for those RBPs. The enriched trimers in Figure 3.1(B) that are shown in red are as follows: hnRNPc and HuR: TTT trimer hnRNPh1: the G rich trimers including A (GGG, GGA, GAG, AGG) Nova: YCAY motifs (TCA, CAT, CCA, CAC) PTB: CT-rich motifs (CTC, TCT, TTC, CTT, TTT) PUM2: substrings of the TGTA(A/T)ATA TDP-43: GT-rich motifs (GTG,TGT) TIA1 and TIAL1: TA-rich motifs (TAT, ATA, ATT, TTA, TTA, AAT, TAA, TTT) Overall these results indicate a role for sequence that can be quite strong, but that also has the potential to mislead analyses based solely on sequence. 3.2.2 Calculating secondary structure of an RNA molecule Several approaches can be used to understand the role of secondary structure and accu- rate calculation of the structural states of bases in the RNA. Here we start by stating general definitions of the structure followed by detailed explanations of how to accu- rately obtain structural information of a certain set of sequences. Definition 3.2.1 (RNA Primary Sequence). Let =fA;C;G;Ug be the alphabet of RNA sequences and be all finite sequences of symbols (bases) in . StringS2 = f;A;C;G;U;AA;AC;AG;AU;:::g is called the primary sequence of the RNA. For the sequenceR =r 1 r n , letjRj =n be the number of symbols appearing inR, which is the length ofR. 37 −2 0 2 4 6 8 10 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 TTT AAA GTG,TGT TCA,CAT TCT TAT,ATA NCG,CGN The rest ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Z-score (A) (B) Density z−score −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 Density Density Density hnRNPc hnRNPh1 HuR Nova PTB PUM2 QKI TDP43 TIA1 TIAL1 Data sets z-score Figure 3.1: Sequence specificity of RBPs. (A) Distribution of trimer z-scores for all datasets shows both high variance trmers (e.g. TTT, AAA) and low variance trimers (e.g. TCA, CAT); low dispersion trimers provide more discriminative power in motif search. Trimers with high z-scores in many datasets (e.g. TTT) likely represents abun- dant background sequences and cross-linking biases. (B) Z-scores for all trimers in selected datasets; red points represent trimers that are part of the expected sequence motif for that RBP, and show a trend towards higher z-scores. Definition 3.2.2 (Canonical base pair). The hydrogen bonds between nitrogen bases in an RNA sequence occur in two forms: Watson-Crick form (C-G and A-U) and Wobble base pairs (G-U). The latter form may often happen in RNA due to energy preferences of the molecule. These three types of pairings are referred to as canonical base pairs. Definition 3.2.3 (RNA Secondary Structure). A secondary structure for an RNA sequence, R, is a specification of which bases are paired in the folded sequence. We use (i;j) to represent the base pair formed by the i th base and the j th base, so Sf(i;j) : 1i<jng is an RNA secondary structure if it satisfies the following conditions: 8(i;j)2S, (i;j) is a canonical base pair 8(i;j); (i 0 ;j 0 )2S; i =i 0 () j =j 0 8(i;j)2S;ji> minimum loop size 38 where, minimum loop size is the minimum number of unpaired bases in a between a base pair, due to energy preferences of the RNA molecule and is usually set to 3. Furthermore, a secondary structure S, is a pseudoknot free secondary structure if and only if, 8(i;j); (i 0 ;j 0 )2S; i<i 0 () j <i 0 orj 0 <j : Here we only consider the structures that do not contain pseudoknots. Let R be an RNA sequences of interest, wherejRj = L. Now, let T denote the secondary structure ofs as T =fT ij : 1i<jLg; where T ij = 1 if the i th and j th bases form a pair and T ij = 0 if they do not. This description assumes an exact pairing state (paired or unpaired) can be assigned to each nucleotide in each sequence. One way to do this is to compute the minimum free energy structure for each sequence. However, the minimum free energy structure can be mis- leading. In reality, there is an ensemble of folds that a given transcript may adopt, and each one has a particular probability of occurring. Rather than using point estimates of the pairing state of nucleotides, we elected to instead use base-pair probabilities to represent structure. The changes to the above description to facilitate this are trivial: t i = i1 X j=1 T ji + n X j=i+1 T ij ; where T ij = Pr(basei is paired with basej): We calculate these base-pair probabilities using McCaskill’s algorithm. We used the implementation from the RNA Vienna package [150], modifying its interface to allow 39 its efficient use and embedding in our package. For detailed description of how we calculate RNA secondary structure, please refer to Appendix B. 3.2.3 Structural specificity of RBP binding sites In order to establish the correct hypotheses about the role of structure in RNA binding process, we have to know what are the RBPs’ structural characteristics [194]. This can give us some high level information to what types of RNA secondary structure these RBPs are capable of binding to. The number of different binding domains are shown in Table 2.1. The numbers are obtained from the latest version of InterPro database [102]. The following accession numbers have been used to retrieve the data: RRM (RNA recognition motif): IPR000504 KH (K homology domain): IPR004087 CSD (Cold shock protein): IPR011129 DS-RBD (Double-stranded RNA binding): IPR001159 ZnF (Zinc finger, CCCH-type): IPR000571 DEAD/DEAH box (DNA/RNA helicase, DEAD/DEAH box type, N-terminal): IPR011545 PUF (Pumilio RNA-binding repeat): IPR001313 PAZ (Argonaute/Dicer protein): IPR003100 LSM (Like-Sm domain): IPR010920 In addition to sequence specificity, RBPs are well known to prefer particular sec- ondary structural contexts. For example, the very common RRM motif binds only to 40 Domain Homo Mus Arabidopsis Drosophila Caenorhabditis sapiens musculus thaliana melanogaster elegans RRM 1033 783 601 369 175 KH 176 122 67 95 49 CSD 33 33 4 8 5 DS-RBD 113 92 30 81 18 ZnF (Zinc finger) 176 176 97 101 46 DEAD/DEAH box 411 218 150 138 73 PPR * 8 - 450 - - RGG box * 152 - 56 - - PUF 23 21 25 3 12 PAZ 27 21 20 61 36 LSM 66 49 75 35 19 Table 3.2: The total number of putative RNA-binding proteins containing each specified RNA-binding domain in five different eukaryotes (*from [209]). single-stranded sites in RNAs[154]. These binding domains often have sequence pref- erence, but will ignore sites if the sequence occurs in a double-stranded conformation. We sought to investigate the overall role of secondary structure in RNA-protein inter- actions independent of the specific sequence context. We focused only on the RRM- containing RBPs, as their secondary-structural preference is well understood. Using the partition function method explained in Section B.1, we calculated base pairing probabilities for a window around the top one thousand targets of the RRM RBPs. After calculating the base pairing probability for all the sequences in the high confidence target set for each RBP, we take the average for all the locations in the win- dow and obtain the highest single stranded probability in that window (Figure 3.2). 3.2.4 Statistical tests for the influence of secondary structure By now we have established the sequence and structural specificity of RBPs. However, even though there seems to be a relationship between the binding sites and secondary structure of binding locations supported by the high-throughput data, this does not mean 41 0.50 0.60 0.70 0.80 0 10 20 30 40 50 60 0.45 0.55 0.65 0.75 −100 −50 0 50 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Single stranded probability Index Frequency Single stranded probability (A) (B) hnRNPc Figure 3.2: Structural specificity of RBPs. (A) Histogram of maximum single- stranded RNA probability within plus or minus 5nt from CLIP sites in all datasets from RRM-containing RBPs shows enrichment for single-stranded probabilities. (B) Single- stranded RNA probability calculated from all identified CLIP sites in an iCLIP dataset for hnRNPc, shown relative to the site location. that secondary structure has an effect on binding. In other words, if we assume that structure can be determined by the sequence, there is possibility that RBPs recognize the sequence that happen to have a particular secondary structure. In this section, we want to explore the three-way relationship between Sequence, structure and the binding affinities. This relationship can be tested statistically. We devise a statistical test as follows: our null hypothesis is that there is no correla- tion between the binding affinity and the mRNA secondary structure. More specifically, if secondary structure can be determined solely by sequence, the context of binding or non-binding will not change the structure of this sequence. Therefore, we can conclude that there is no correlation between the binding affinities and the secondary structure. On the other hand the alternative hypothesis is that, there is a correlation between the binding affinity and the mRNA secondary structure, meaning that secondary structure of a particular sequence can be different in the context of binding versus non-binding. First we define target and non-target sets for each CLIP experiment on a particular RBP. For each CLIP-seq dataset we define a set of target 3’UTRs by binning CLIP-seq reads in 1nt bins (iCLIP) or 20nt bins (PAR-CLIP, HITS-CLIP), and retained only those bins that could be uniquely assigned to a single transcript. For each 3’UTR we found 42 the bin with the largest number of reads. We then ranked 3’ UTRs by the count of reads in the bin with the most reads, and selected the top 1000 3’ UTRs as our target set for that RBP of interest that the CLIP experiment was carried out on. The non-target set is simply any 3’ UTR regions (as defined by RefSeq) that are not contained in the target set. Secondly, we define paired and unpaired states for hexamer occurrences, meaning an occurrence of a hexamer is called single stranded if and only if the average base pairing probability over k nucleotides is less than 0.5, and double stranded otherwise. Now, for each hexamer, we obtained the set of occurrences in the whole region of all the 3’UTRs. This comprises our reference sets of occurrences for each hexamer. Then for each hexamer, we calculated the number of times it appears as double or single stranded in these reference sets. Let’s call these numbers R + and R respectively. Then we calculate the same numbers, this time not for all the 3’UTRs but only for target set of a particular RBP in a CLIP experiment. For each hexamer and each CLIP experiment, let the counts of double and single stranded occurrences in target 3’UTRs beE + andE . For each hexamer and each RBP, the counts of double and single stranded occurrences of that hexamer in non-target 3’UTRs are defined asC + =R + E + andC =R E . Now for each hexamer we can produce a contingency table, with the number of times a hexamer in 3’UTRs tends to be in paired or unpaired states (C + andC ), and the number of times this hexamer in relation to binding to protein is in paired or unpaired states (E + andE ). The odds-ratio from this contingency table represents the tendency of the hexamer to appear as either single- or double-stranded in target 3’ UTRs, as opposed to non-target 3’ UTRs. Significance is determined via Fisher’s exact test, which gives us a p-value on the odds-ratio for each hexamer. To establish what the p-value distribution is under the null hypothesis, we do the same analysis, but select the target set randomly, rather than using CLIP-seq data. We plotted the distribution of p-values in both cases (CLIP derived targets, and random targets), and noted a strong enrichment 43 p−value −0.5 0.0 0.5 0 2 4 6 8 10 log(odds ratio) −log(p−value) TDP-43 (B) (C) −0.5 0.0 0.5 0 2 4 6 8 10 14 12 log(odds ratio) −log(p−value) HuR GTGTGT TGTGTG TTTTTT 0 50 100 150 200 250 0.00 0.25 0.50 0.75 1.00 Count ● ● ● ● ● ● ● ● ● ● ● 0.005 0.025 0.045 Count 50 150 250 ● ● ● ● ● ● ● ● ● ● ● ● ● ● p-value 0.015 0.035 0.055 100 200 (A) (D) condition Random Targets Figure 3.3: Sequence and structure in the context of binding versus non-binding. (A) Histogram of p-values from Fisher’s exact tests on all hexamers in all datasets, testing for significant difference in ratio of single- and double-stranded occurences of the hexamer in target and non-target 3’ UTRs for CLIP-identified targets (red) and randomly chosen targets (blue). (B) Enlarged portion of panel A showing enrichment for signficant p-values in CLIP targets over randomly selected targets. (C,D) V olcano plots, showing the significance of Fisher’s exact test on TDP-43 and HuR iCLIP data sets, the expected motifs tend to have a single stranded in the context of binding. The positive log odds ratio shows the tendency to have single stranded binding sites when observed in 3’UTRs of the targets. for highly significant p-values was present with CLIP-derived targets, but not random targets. 3.3 CLIP-seq diagnostic events follow RBP- and technology- specific patterns We performed some preliminary analysis to understand the nature of diagnostic events in different experiments. The relative prevalence of each type of event varies by tech- nology. HITS-CLIP and iCLIP have similar mapping rate, however only 8 to 20 percent of the mapped reads in HITS-CLIP contain a deletion at the cross-link location whereas 44 Counts: (0,10] (10,20] (20,30] (30,40] (40,50] (50,Inf] 0 0 0.4 -0.4 0.4 -0.4 -0.8 10 5 0 10 5 0 PUM2 TIA1 −log(p−value) log(odds ratio) log(odds ratio) −log(p−value) Figure 3.4: Examples of structural specificity of hexamers in the context of bind- ing versus non-binding. Heatmap plot for the number of hexamers having differ- ent p-values and log(odds-ratio) resulted from the Fisher’s exact tests for PUM2 and TIA1. The sign of log(odds-ratio) for hexamers with highest log (p-value) corresponds to the expected motif’s tendency to have double stranded and single stranded structure in PUM2 and TIA1 data sets, respectively. in iCLIP around 99 percent of the mapped reads are truncated at the cross-link location, amplified and sequenced and therefore contain diagnostic events. In PAR-CLIP, a higher percentage of reads contain diagnostic events than HITS-CLIP, see Figure 3.5A. By def- inition, all iCLIP reads that contain diagnostic events have only one. While this need not necessarily be the case for HITS-CLIP and PAR-CLIP, we observed that it generally is: see Figure 3.5B. 3.3.1 HITS-CLIP Since the use of probabilistic models is highly dependent on having enough data to pro- cess, we wanted to calculate the number of reads in which at least one diagnostic event is present. Figure Figure 3.5A shows the percentage of deletions in mapped reads in HITS-CLIP data. As shown in this figure the number of reads with at least one diagnos- tic events is highly variable even among the replicates, and depends on quality of the data among other factors. However, it is apparent that occurrence of these diagnostic 45 events can enhance the accuracy and performance of the model. Furthermore, we calcu- lated the number of diagnostic events per read to understand the extend of occurrence of these diagnostic events in different experiments. Figure Figure 3.5B shows the propor- tion of reads having one, two or more deletions. As shown in this figure, the majority of the reads have exactly one deletion. 3.3.2 PAR-CLIP We calculated the number of reads in PAR-CLIP experiments that have diagnostic events in them. Figure Figure 3.5B shows the percentage of the reads with at least one T! C mutation. Then we calculated the number of T! C mutations per read. Figure 4.1(B) shows the proportion of reads having one, two or more mutations. As shown in this figure, the majority of the reads have exactly one mutation. 3.3.3 iCLIP The diagnostic event in iCLIP is the truncation of the synthesized strand during the first cycle of PCR. The cross-link causes processing by the polymerase to terminate precisely at the nucleotide that is cross-linked, so each DNA sequence fragment will have one end precisely at the cross-link point. the stuff below is correct If single-end sequencing is used, then by random chance the read will correspond to either the positive or negative strand. If the read corresponds to the positive strand, then the 5’ mapping location of the read will be precisely at the nucleotide that was cross-linked. The strand can be determined by the strand of the gene in the reference genome. If the opposite strand is sequenced, then the expected distance between the mapping location of the read and the cross-link point is the expected fragment length. 46 H-CLIP P-CLIP iCLIP ● ● ● ● ● ● 1 2 3 Number of diagnostic events (B) (A) prop. of mapped reads with ≥ 1 DE ● ● ● ● Ago Ago2 hnRNPa1 hnRNPa2b1 hnRNPf hnRNPh1 hnRNPm hnRNPu Lin28 Mbnl1 Nova PTB TDP43 prop. of reads mapped prop. of mapped reads with ≥ 1 DE Ago1 Ago2 Ago3 Ago4 HuR IGF2BP1 IGF2BP2 IGF2BP3 Mov10 PUM2 QKI TNRC6A TNRC6B TNRC6C 1 0 1 0 proportion HITS-CLIP PAR-CLIP (C) HuR hnRNP H1 MBNL1 TDP43 0.0 0 10 -10 0 10 -10 0 10 -10 0 10 -10 position relative to consensus occurrence 0.1 0.2 0.3 0.4 density Figure 3.5: Distribution of diagnostic events. (A) Percentage of reads mapped, and mapped reads with diagnostic events for HITS-CLIP, PAR-CLIP and iCLIP (B) Count- ing the number of diagnostic events in the reads that have at least one diagnostic events shows that most of the reads have exactly one diagnostic events. (C) The density of diagnostic events relative to occurrences of an expected consensus sequence in the top 5,000 CLIP-identified target sequences for HuR, hnRNP H1, MBNL1 and TDP43 One complication with iCLIP diagnostic events that is not present in HITS-CLIP or PAR-CLIP is that each read will appear to contain a diagnostic event, since each read has a 5’ end. How does this impact the way we use diagnostic events in iCLIP? In 1% of iCLIP reads, we detected a deletion, indicating that reverse-transcriptase read through the cross-link location. Hence, the read is not truncated at the cross-link location and does not have a diagnostic event. In the remaining 99% of the reads, we did not find any deletions, which can either mean that reverse-transcriptase has read through the cross-link location with no deletion or it has been halted at the cross-link location. Only in the latter case does the read in fact contain the diagnostic event. Sugimoto et al. [218], using previously published Nova HITS-CLIP and mRNA-seq, estimated that 82% of the reads are truncated at the cross-link location. More formally, they estimated 47 f, the proportion of read-through cDNAs in the total Nova iCLIP library, to be 18% according to the following formula f = p(iCLIP)p(BG) p(RT)p(BG) ; (3.1) where p(iCLIP) is the proportion of cDNAs with deletions in the first 25 nucleotides for Nova iCLIP data, p(RT) is the proportion of cDNAs with deletions in the first 25 nucleotides for read-through cDNAs from Nova CLIP data andp(BG) is the proportion of cDNA with deletions in the first 25 nucleotides of mRNA-Seq cDNAs, which was used to estimate the background occurrence of deletions. Thus, they estimated that 82% of cDNAs were lost in CLIP cDNA cloning protocol due to truncations. However, there is no way to distinguish between the truncated reads and the ones that have read-through without any deletion, using a single iCLIP data set. For simplicity, we considered all of the reads without deletions to have a diagnostic event at the truncation site. 3.3.4 Distribution of diagnostic events To investigate the distribution of diagnostic events around binding sites, we searched for occurrences of simple consensus sequences, defined from literature (see Table 3.3), around CLIP sites identified for each RBP in our data. We observed that diagnostic events are often enriched at a particular distance from the consensus match, and follow a characteristic distribution; we noted that the distance from the start of the consensus match to the most likely position of a diagnostic event varied by RBP however (Fig- ure 3.5C). 48 RBP Consensus citation HuR UUUUU [187] PTB CUCUCU [175] QKI2 ACUAA [67] Nova YCAY [31, 52] TIA1 UUUUA [245, 62] TIAL1 UUUUA [245] PUM2 UGUAUAUA [68] TDP43 UGUGU [127] hnRNP C UUUUU [256] hnRNP H GGGA [34] IGF2BP1,2,3 CATH [85, 196] Table 3.3: Previously described consensus sequences 49 Chapter 4 Motif discovery in RNA-protein interaction data 4.1 Formalization of sequence motif discovery In our effort to characterize the binding specificity of RBPs, we build upon the origi- nal model developed for characterizing transcription factor binding sites that has been studied extensively [217, 129, 35, 145, 146, 14, 192, 147, 211]. We start with basic mathematical definitions and describe the motif model for characterizing the sequence specificity. The natural progression from this is to modify this simple model of the iden- tified sites in an effort to characterize the binding specificity of the RBPs. Aside from informing the function of the RBP, this allows putative binding sites to be identified in future in a purely computational fashion. As it happens, these two threads are inher- ently interwoven, and this duality has been exploited in the past to concurrently localize and characterize binding sites in sequence data, most notably in the well known MEME algorithm. 50 Definition 4.1.1 (Motif Discovery). Motif discovery is the following computation prob- lem: LetS =fS 1 ;S 2 ;:::;S n g be a set of unaligned RNA sequences over the alphabet =fA;C;G;Ug, where S 1 =S 1;1 S 1;2 S 1;3 :::S 1;L 1 S 2 =S 2;1 S 2;2 S 2;3 :::S 2;L 2 S 3 =S 3;1 S 3;2 S 3;3 :::S 3;L 3 . . . S N =S N;1 S N;2 S N;3 :::S N;L N : LetI =f(i;l) : 1iN; 1lL i g denote the collection of all indices in the data set. We want to find the setY I, whereY =f(1;Y 1 ); (2;Y 2 );:::; (N;Y N )g, andY i is the starting position of the motif in thei th sequence. Given S, the goal of model-based methods is twofold: (1) finding instances of a recurring pattern, called motif, in the data, (2) estimate the parameters of the model that best describes the motif and the background. We model the sequence of motif occurrences using a position weight matrix (product multinomial distribution),M and background distributionf. More specifically,M = (M j ) w j=1 , where for anyb2 , we have M j (b) = Pr b appears at positionj of the motif ; and f(b) = Pr b appears in background : The one-occurrence-per-sequence (OOPS) model requires that each sequence inS con- tain an occurrence of a width w motif, and for each S i 2 S we associate the set 51 X i =fX i1 ;:::;X i;`w+1 g with X ij indicating the event that the motif occurrence in S i begins at positionj. The defining condition of the OOPS model is that for anyi, `w+1 X j=1 X ij = 1: (4.1) The complete data likelihood for the OOPS model, considering sequence only, is Pr(S;Xj) = n Y i=1 Pr(S i ;X i j) = n Y i=1 Pr(X i j) Pr(S i jX i ; ) = n Y i=1 1 `w+1 Pr(S i jX i ; ); (4.2) with Pr(X i j) taking a uniform distribution over possible positions of the motif. For anyS i 2S, Pr(S i jX i ; ) = `w+1 Y j=1 Pr(S i jX ij = 1; ) X ij (4.3) and Pr(S i jX ij = 1; ) = ` Y l=1 M lj+1 (S il ) I jw (l) f(S il ) (1I jw (l)) ; (4.4) where S il is the l th base in S i . This is the well-known product multinomial distri- bution [129]. Now we generalize the complete data likelihood, for the zero-or-one- occurrence-per-sequence (ZOOPS) model. Under the ZOOPS model, each sequence contains either zero or one occurrences of the RBP binding motif; we treat the presence and location of the motif as latent data. LetX ij = 1 if the motif occurrence for sequence i is at position j, and X ij = 0 otherwise. For notational convenience, we also define O i = 1 if there is a motif occurrence in sequencei andO i = 0 otherwise. Notice that O is completely specified byX, sinceO i = P mw+1 j=1 X ij . The interpretation ofX ij is otherwise unchanged. To account for the ZOOPS assumption, we introduce the model parameter , which is the probability that a sequence contains the motif. Throughout, 52 we will refer to the model as . Hence, our basic model, which we will extend later for structure and diagnostic events, is =fM;f; g. SincejOj is not constrained by the model, we require a prior distribution onjOj and describe it with a binomial distribution having parameter . Pr(S;X;Oj) = Pr(S;XjO; ) Pr(Oj ) = Pr(Oj ) n Y i=1 Pr(S i ;X i j;O i ); (4.5) with Pr(Oj ) = n jOj jOj (1 ) njOj ; (4.6) where is the probability that a sequence is in the foreground (contains a motif), and O i is the indicator for the event thatS i is in the foreground. Therefore, we have Pr(S i ;X i j;O i ) = Pr(S i ;X i j;O i = 1) O i Pr(S i ;X i j;O i = 0) 1O i = ( 1 `w + 1 ) O i `w+1 Y j=1 Pr(S i jX ij = 1; ) X ij ( ` Y l=1 f(S il )) 1O i ; again assuming a uniform distribution over motif occurrence locations inS i whenO i = 1. In the following sections we explain how we leverage two additional pieces of infor- mation for a practical and more accurate motif discovery in high-throughput RNA- protein interactions data. 53 4.2 Simultaneously modeling secondary structure and binding sites Our model of sequence and structure builds upon the widely used classic mixture model introduced by Lawrence and Reilley [129] with many notable extensions [146], pre- sented in the previous section. LetS =fS 1 ;S 2 ;:::;S n g be a set of unaligned RNA sequences over the alphabet =fA;C;G;Ug, obtained from a CLIP-seq experiment. Without loss of generality, to make the notations simpler we assume a fixed length,m, for all the sequences. LetT represent the secondary structure of the sequences inS, such thatT ij = 1 if nucleotide j in sequencei is paired, andT ij = 0 otherwise. The structure of an RNA molecule is completely determined by its primary sequence; in describing our model and algorithms then, we will assume thatT is known wheneverS is also known. However, in reality, there is an ensemble of folds that a given sequence may adopt, and each one has a particular probability of occurring. Therefore, we cannot obtain the true value forT and in fact secondary structure of sequences can be viewed as as missing data in our model. However, calculating the pairing probabilities using McCaskill’s algorithm gives us the probability that a base is being paired or not, which in turn can be interpreted as the expectation, sinceT is a matrix of indicators. We augment the usual motif model M = (M k ) w k=1 and background model f to account for secondary structure: f and theM k are multinomial distributions overfA, C, G, Ugfpaired, unpairedg.From a theoretical standpoint, we consider secondary structure to be an inherent property of the sequences, so accounting for structure does not introduce any new observed data. We will describe how our model works assuming structure is fully determined by the sequence; by this we mean that, given a sequenceS i , the base-pairing state (paired or unpaired) of each nucleotide within that sequence can 54 be exactly determined. At the end of this subsection we will discuss the practicalities of determining the structure of the sequences. In order to account for secondary structure, we augmentM andf as follows: M k (b;) = Pr b appears at positionk in the motif with pairing state ; f(b;) = Pr b appears in the background with pairing state ; where2f0; 1g, either paired or unpaired. Then Pr(S;T;XjM;f) = n Y i=1 Pr(S i ;T i ;X i jM;f) = Pr(XjM;f;O) Pr(OjM;f) n Y i=1 Pr(S i fX i g;TfX i gjM;X;O i ) Pr(S i fX i g;T i fX i gjf;X;O i ); whereO i = 1 if sequencei contains a motif occurrence and 0 otherwise. When deal- ing only with sequence and structure, the prior Pr(XjM;f;O) is uniform, and can be disregarded. For any sequences having structural indicatorst, Pr(s;tjM) = w Y k=1 M k (s k ;t k ) and Pr(s;tjf) = jsj Y k=1 f(s k ;t k ): 55 Note that we can consider the form of the complete-data likelihood unchanged from Equation 4.5 with changes in the equations for computing Pr(S i jX ij = 1; ) and Pr(S i jO i = 0; ), i.e. Pr(S i jX ij = 1; ) = m Y l=1 U Y b=A 1 Y =0 ( blj ) bli (4.7) and Pr(S i jO i = 0; ) = m Y l=1 U Y b=A 1 Y =0 f(b;) bli ; (4.8) where blj = 8 < : M lj+1 (b;) ifjlj +w 1; f(b;) otherwise: and bli = 1 if basel in sequencei isb and has pairing state, and bli = 0 otherwise. 4.3 Modeling diagnostic events Next we make an additional augmentation to the model to account for cross-linking. We assume each sequence contains one cross-link location. We estimate the probabilities of where the cross linking is located using diagnostic events as follows: Pr(C i =ljO i = 1;D) = D ij + P m j 0 =1 D ij 0 + ; 56 whereD ij is the count of diagnostic events at locationj in sequencei (we treatD as a fixed model parameter), C i is the location of the cross-link in sequencei, and is a pseudo-count to avoid zero-counts. We bring information about diagnostic events in via the prior on motif occurrences, Pr(XjM;f;O;D;g), whereg =fg 1 ;g 2 g models the distance between cross-link site and motif occurrences, as Pr(XjO;D;g) = n Y i=1 mw+1 Y j=1 Pr(X ij = 1jO i = 1;D;g) X ij ; where Pr(X ij = 1jO i = 1;D;g) = m X l=1 Pr(C i =ljO i = 1;D) g 1 (1g 1 ) jl(j+g 2 )j K : Note that Pr(XjM;f;O;D;g) = Pr(XjO;D;g), since the prior has no dependence on the sequence/structure parameters of the model. We decided to model the relation- ship between diagnostic events and binding site positions using a geometric distribution, whereg =fg 1 ;g 2 g has probabilityg 1 and location parameterg 2 to indicate the typical offset between the cross-link site and the start of the binding site. The reason for using a geometric distribution is primarily based on observing what seems to be an exponen- tial decay in the density of diagnostic events moving away from what we believe to be true binding sites in many of the higher quality data sets (Figure 3.5C). This also makes sense conceptually in iCLIP, since the rate parameter can be viewed as the probability that the truncation will mistakenly occur one base too soon or too late. Our exploration of CLIP-seq data highlighted the fact that cross-linking is often quite noisy. To balance the impact of CLIP-seq cross-link events against sequence and struc- ture information, we include the parameterK – a tuning parameter that modulates the impact of diagnostic events on the algorithm. Low values (near 0) reduce the impact of 57 diagnostic events, while higher values increase it. For all results reported here, we fixed this at 1.1, which is the default in our implementation. In practice, we set this heuristi- cally using an exhaustive search of possible values and picking the one that maximized the number of motifs recovered from our collection of CLIP-seq data. To make sure that we are not over- fitting this parameter, we performed 1000 bootstrap samples of our data collection and a value of 1.1 proved to be the optimal value. However, this parameter can be adjusted by users if they wish to do so. Using our selection of current publicly available datasets, the default value seems to be optimal, however as CLIP experiment improves the users might feel the need to increase this value for their data sets. This is a somewhat conservative value, but users can increase this if they feel their data is of higher quality and as the CLIP-seq assay continues to improve. Our goal in characterizing binding specificity from the data is to find estimates for the model parametersM andf. The traditional formulation, without secondary structure indicators or diagnostic event locations, treats the sites X as missing data, and uses a method like EM or Gibbs Sampling. Our model has two important properties that facilitate using an EM algorithm to estimate the model parameters: We assume the values of the structure indicators T are not dependent on the model parameters. We also have a method to compute expected values for T : we can use the partition function algorithm due to McCaskill [161], as explained in Appendix B. Hence, in using EM to estimateM andf, we need only recompute expectations for the values ofX at each iteration, whileT remains static. We also remark here that in theory we could re-estimate values for T using a restricted partition function algorithm [156], but doing so would be computationally pro- hibitive. 58 The contribution of diagnostic events, given the motif locations is independent of the sequencesS and secondary structure indicatorsT , as well as the model param- eters M and f. This makes it easy to decompose the estimation procedure into steps involvingS andT and other steps involvingD. This has intuitive appeal: if every observed diagnostic event corresponded to a functional binding event, and if every functional binding event resulted in a diagnostic event, then we should be able to estimate the binding site indicatorsX using data fromD alone, only involving the parametersg. 4.4 Results and comparison with other methods 4.4.1 Generation of simulated data Position weight matrix / motif For each simulated dataset, we generated a random position weight matrix from which motif occurrences were drawn. For all simulations, the PWMs had length six, and an average information content of 0.5 bits per column (within a tolerance of 0.005 bits). The target average information content was achieved using a rejection sampling approach. Sequence data We first took the set of all human 3’ UTRs from RefSeq and collapsed these into non- overlapping genomic regions. We retained all regions of length greater than 50bp. For each simulation, we selected 500 of these regions randomly, and for each region ran- domly selected a 50bp sub-sequence. We also randomly generate a PWM. For each of 59 the 500 sequences, we generated a motif occurrence from the PWM and randomly place this into the sequence. When imposing structure on a motif occurrence, we select a subsequence either upstream or downstream of the motif occurrence (with equal probability if the occur- rence is sufficiently far from the edge of the sequence, or specifically one or the other if it is close) such that (1) the selected subsequence has length 10, and (2) the subsequence does not overlap the motif, but is no more than 15 bases from the motif. We then place the reverse complement of this subsequence on the opposite side of the motif such that it also does not overlap the motif, but is no more than 15 bases from the selected subse- quence. In this way, there is high probability that the local sequence will give rise to a computational fold with a stem loop encompassing the motif occurrence. Diagnostic events We consider diagnostic events to have a distance from motif occurrences that follows a geometric distribution. We also consider that this distance distribution may be centered on a position that is offset from the start of the occurrence. We call this offset. This is also the generative model we use to simulate diagnostic events. For each simulated dataset, we randomly generate a value for (random uniform distribution on8 to +8 relative to the motif occurrence start), and a value for the probability parameter of the geometric distribution (also random uniform in range 0 to 1). For each diagnostic event we place into the dataset, we simulate its location relative to the selected motif location by drawing a distance from the geometric distribution specified by p, and adding to this. In addition, we consider that some fraction of diagnostic events will be noise. For all simulations performed in preparing this manuscript, we fixed this fraction at 20%. For noise events, we place the diagnostic event at a position selected uniformly at random from all loci in the sequence. The number of diagnostic events that were 60 placed into each sequence was drawn from the empirical distribution of diagnostic event counts in real CLIP-Seq dataset as follows: we randomly selected a CLIP-Seq dataset and counted the number of diagnostic events that were present in this dataset within the 50bp region that the sequence was drawn from. 4.4.2 Evaluating motifs recovered from simulated data To determine how close to the planted motif a recovered motif was, we calculated the KullbackLeibler divergence (KL-D) of the recovered motif from the planted motif. Rather than reporting raw KL-divergence values though, which are difficult to interpret, we instead report fractions of the number of simulated datasets on which the motif was successfully recovered. To determine whether a motif was recovered or not, we estab- lished a threshold KL-D value below which we would consider the motif to be recov- ered, and above which we would consider that the motif was not recovered. To find this threshold, we calculated the KL-divergence between the simulated position weight matrix and the set of all position weight matrices recovered by Zagros from simulated datasets – this allowed us to estimate the background distribution of KL-divergence val- ues. Using this background distribution, we calculated ap-value for the motif recovered for each simulated datasets. If thep-value was below 0:05 we considered that the motif had been recovered, otherwise we considered it not-recovered. For equal proportions of the datasets, we fixed the secondary structure of the occur- rence to be single-stranded in 100% of the occurrences, 90% of the occurrences and so on down to 0% of the occurrences (we call this the structure-fraction of the dataset). We also simulated diagnostic events with distance geometrically distributed around some offset from the motif occurrence. This offset is fixed for each dataset, but varies uni- formly at random in all dataset on the range +/- 8nt. Similar to the structure, for equal 61 Figure 4.1: Use of RNA structure and diagnostic events improves motif-discovery performance on simulated data and outperforms other methods. (A) Proportion of recovered motifs as a function of the fraction of planted motif occurrences that are forced to adopt a specific RNA secondary structure (in this case, ssRNA). Zagros was run with either just sequence data, or both the sequences and the structure (base-pair probabilities). For each fraction of occurrences with planted motifs, we simulated 500 random datasets. Error-bars are 95% confidence interval from 1000 bootstrap samples. (B) As in panel A, but as a function of the fraction of planted motif occurrences that have one or more diagnostic events. Zagros was run either with just sequence data, or sequence and diagnostic events data. (C) Comparison with other methods. (C) 0.00 0.25 0.50 0.75 1.00 fraction of recovered motifs DME MDSCAN MEME Zagros 0 10 20 30 50 40 60 70 80 90100 1.00 0.75 0.50 0.25 0.00 fraction of recovered matrices percentage of motif occurrences with one or more diagnostic events using DEs not using DEs (B) 0 10 20 30 50 40 60 70 80 90100 percentage of motif occurrences with forced ssRNA structure fraction of recovered matrices using RNA structure not using RNA structure (A) 1.00 0.75 0.50 0.25 0.00 proportions of the datasets, we planted these diagnostic events for 100% of the motif occurrences, 90% of the occurrences and so on down to 0% of the occurrences. For each simulated dataset we ran Zagros in four different ways: with just the simu- lated sequences; with the sequences and base-pair probabilities (RNA secondary struc- ture); with the sequences and diagnostic event locations; and with the sequences, struc- ture, and diagnostic events. Although Zagros can be run in these four different modes, when we refer simply to Zagros, we mean the version using all three of sequence, struc- ture and DEs. In each case, we determined whether the motif recovered by Zagros was a match to the planted one, and calculated the fraction of datasets for which each method was able to recover the planted motif. As the fraction of motif occurrences with fixed structure increases, the ability of Zagros to recover the motif if structural information is provided also increases, while it remains stationary if only sequence information is pro- vided (see Figure 4.1A). Similar, although less dramatic, improvements in performance 62 are observed as the fraction of motif occurrences with diagnostic events increases (Fig- ure 4.1B). It is possible to increase the impact of diagnostic events by increasing the value of the parameterK. Although this results in stronger performance on simulated data, we elected not to do this as it has an adverse impact on the algorithm’s performance with CLIP-seq derived data, which has higher levels of more heterogeneous cross-link noise. Figure 4.1C shows the accuracy of Zagros (using sequence, structure and diagnostic events) compared to three well-known motif discovery tools (MEME [14], DME [211] and MDSCAN [147]). We ran all of these methods on the same simulated datasets described above. As shown in Figure 4.1C, using the extra information of structure and diagnostic events allows Zagros to clearly achieve the best performance of the methods tested. The stark contrast is due to the low information content of the motif, making discovery by sequence alone highly challenging. The bars show the fraction of motifs recovered by each program; error bars indicate the standard deviation. 4.4.3 Evaluating results on CLIP-seq data We compared the performance of Zagros to DME, MDSCAN and MEME on a set of datasets derived from a range of CLIP-seq experiments comprised of multiple replicates. We selected a subset of 19 RBPs from this data for which a sequence preference had previously been reported (excluding miRNA-associated RBPs, such as Ago2; list of selected datasets and details of sequence selection provided in section 3.1). We report results for those replicates where at least one of the tested methods was able to recover the expected motif. When one of the algorithms designed for transcription factor binding sites finds the reverse compliment of a motif, we count that as a success, since they could easily be modified to be strand specific. Figure 4.2 shows twelve example datasets and the top-scoring motif recovered for each dataset; the previously described binding 63 site for each is also shown. Note that in all the examples, the match to the previously described site is obtained by Zagros using all three of sequence, structure and diagnostic events. 64 Figure 4.2: Use of RNA structure and diagnostic events improves recovery of expected motifs from CLIP-seq data. Top-scoring motifs recovered by Zagros on twelve example CLIP-seq datasets. For each dataset we show the motif recovered by DME, MDSCAN and MEME in addition to each version of Zagros. Pum2 TIA1 IGF2BP1 RBP hnRNPc HuR PTB TIAL1 ZAGROS IGF2BP2 IGF2BP3 QKI YCAY UUUUUA CAU UUUUUA U-rich motif U-rich motif UCUU UG-rich motif previously reported consensus CAUH UGUAUA UUAAC CAU Nova DME MD-SCAN MEME sequence only sequence and Structure sequence and DEs sequence, structure and DEs (A) (B) (C) (D) (E) (F) (G) (H) (I) (J) (K) (L) TDP-43 65 Although the other tested programs sometimes recover the expected motif, Zagros achieves the most consistent recovery. In some cases, it is clear that just the addition of structure is sufficient, such as with HuR. In other cases, such as for the IGF2BP proteins and hnRNPC, Zagros achieves a close match to the expected motif only through the use of both structure and diagnostic events. In general, the combination of both extra pieces of information gives the best result. Figures 4.3-4.7 show the result of running DME, MD-SCAN and MEME along with Zagros in all four different modes on all the CLIP data sets we have. We present results for all replicates, though we eliminate those for which no method was able to recover the expected motif. As shown in these figures, the superiority of Zagros using sequence, structure and diagnostic events is obvious. It is worth noting that there are differences in the results recovered by MEME and Zagros using sequence-only information. This is to be expected. Despite the fact that they use the same model and both employ expectation maximization, in practice there will be differences in implementation details, not to mention the accumulation of heuris- tics and optimizations in MEME as a result of its long history. Furthermore, there is noticeable variation in the results obtained by all of the pro- grams on datasets for the same RBP arising from different labs. This is again expected. CLIP-seq is a very challenging assay to perform, and the skill of the technician is instru- mental in achieving a good outcome. It is not surprising then that variability in data quality also results from this. Some datasets are so clean as to allow identification of the motif almost by visual inspection, while others are extremely challenging and even with advanced statistical methods it might not be possible to extract the expected motif above all others. A range of intermediate possibilities also exist. Add to this minor differences in the laboratory procedures used, and even different cell types in some cases, and it is not at all surprising that results are variable. 66 Figure 4.3: Top-scoring motifs recovered by Zagros on IGF2BPf1..3g CLIP-seq datasets For each dataset we show the motif recovered by DME, MD-SCAN and MEME in addition to each version of Zagros. RBP DME MD-SCAN MEME sequence and DEs sequence, structure and DEs sequence and Structure sequence only previously reported consensus ZAGROS IGF2BP1 IGF2BP2 IGF2BP3 IGF2BP1 IGF2BP1 IGF2BP1 IGF2BP1 IGF2BP2 IGF2BP2 IGF2BP2 IGF2BP2 IGF2BP3 IGF2BP3 CAU CAUH CAU CAU CAU CAU CAU CAU CAU CAU CAU CAUH CAUH 67 Figure 4.4: Top-scoring motifs recovered by Zagros on PUM2, QKI, hnRNPc and HuR CLIP-seq datasets. For each dataset we show the motif recovered by DME, MD-SCAN and MEME in addition to each version of Zagros. RBP DME MD-SCAN MEME sequence and DEs sequence, structure and DEs sequence and Structure sequence only previously reported consensus ZAGROS HuR U-rich motif U-rich motif UGUAUA UUAAC U-rich motif U-rich motif UGUAUA UUAAC UUAAC UUAAC U-rich motif U-rich motif U-rich motif Pum2 Pum2 QKI QKI QKI QKI hnRNPc hnRNPc hnRNPc HuR HuR HuR 68 Figure 4.5: Top-scoring motifs recovered by Zagros on TDP-43 CLIP-seq datasets. For each dataset we show the motif recovered by DME, MD-SCAN and MEME in addition to each version of Zagros. RBP DME MD-SCAN MEME sequence and DEs sequence, structure and DEs sequence and Structure sequence only previously reported consensus ZAGROS TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif TDP-43 UG-rich motif 69 Figure 4.6: Top-scoring motifs recovered by Zagros on TIA1, TIAL1 and PTB CLIP- seq datasets. For each dataset we show the motif recovered by DME, MD-SCAN and MEME in addition to each version of Zagros. RBP DME MD-SCAN MEME sequence and DEs sequence, structure and DEs sequence and Structure sequence only previously reported consensus ZAGROS TIA1 PTB TIAL1 UUUUUA UUUUUA UCUU TIA1 PTB TIAL1 UUUUUA UUUUUA UCUU TIA1 PTB TIAL1 UUUUUA UUUUUA UCUU PTB UCUU 70 Figure 4.7: Top-scoring motifs recovered by Zagros on Nova CLIP-seq datasets. For each dataset we show the motif recovered by DME, MD-SCAN and MEME in addition to each version of Zagros. RBP DME MD-SCAN MEME sequence and DEs sequence, structure and DEs sequence and Structure sequence only previously reported consensus ZAGROS YCAY Nova YCAY Nova YCAY Nova YCAY Nova YCAY Nova YCAY Nova YCAY Nova YCAY Nova YCAY Nova YCAY Nova YCAY Nova YCAY Nova 71 4.4.4 New insights on sequence and structure specificity of RBPs We applied Zagros to investigate the relationship between structure and sequence speci- ficity of RBP binding. We determined the sequence specificity of the motif reported by Zagros as the frequency with which the consensus sequence from the motif occurred in either human or mouse exons, depending on the organism for which the CLIP experi- ment was conducted. Consensus sequences that occur more frequently are considered to have low-specificity, while those that occurred relatively rarely we considered to have high-specificity. We then ordered RBPs by their mean specificity and plotted the mean base-pair probabilities recovered by Zagros for each (Figure 4.8). There was a negative correlation (Spearman’s correlation coefficient: -0.36) between sequence specificity and mean base-pair probability. In their seminal work, Schneider et al. [197] found that binding sites tend to contain enough information for them to be recognized, but not more than is required. RBP binding sites tend to be short and degenerate, leading to generally low sequence-based information. It is reasonable to assume then that other features, such as secondary structure, contribute the necessary information to overcome the deficit and allow the sites to be recognized. Importantly though, the work of Schnei- der et al. indicates that (owing to random drift), the additional information contributed will not be more than is required for recognition. Our results are supportive of this posi- tion, as we observe that motifs with more informative sequence are less structured, and vice versa. 72 Figure 4.8: Use of RNA structure and diagnostic events reveals a potential link between sequence- and structure-specificity in RBP binding sites. The average base- pair probability of the recovered motifs for RRM-containing RBPs. The x-axis is sorted by the average specificity of the sequence component of the recovered motifs; those motifs with high sequence specificity (i.e. for which matches to the sequence component of their recovered motif are rare) on the left, and those with low specificity (i.e. for which matches to their sequence are common) are on the right. QKI TDP43 TIA1 PTB IGF2BP3 IGF2BP2 TIAL1 hnRNPf Nova hnRNPu Mbnl1 hnRNPa1 HuR hnRNPc hnRNPh MOV10 PUM2 hnRNPa2b1 0.5 0.4 0.3 0.2 0.1 base-pair-probability IGF2BP1 Lin28 hnRNPm ● ● ● ● ● ● ● ● ● ● 73 Chapter 5 Conclusion Once relegated to the status of a simple messenger, our understanding of the role of the RNA molecule is currently undergoing something of a renaissance. Post-transcriptional regulation of genes is an adaptation to increase diversity of gene products. Processes such as alternative splicing or RNA editing fall into this category. RNA binding proteins are the key players in these regulation [164] and the interactions they have not only with mRNA, but a range of other RNA species as well [154]. This nexus of regulation, which affects very real phenotypical manifestations, including diseases such as cancer, is of growing interest to researchers [115]. The research presented in this dissertation was based on the problem of pattern discovery in high-throughput RNA- protein interaction data. The sequence-based approaches have shown to be unsuccessful in this area and the main reasons are abun- dant repeats and motifs in the genome, the experiment biases and generally too little information to be able to successfully characterize the binding sites. Our model lever- ages two additional pieces of information, secondary structure and experiment-specific events to accurately identify and characterize the RBP binding sites. We have evidence to show that RNA secondary structure is an important factor in the process of binding that is not trivially calculated from the sequence and the experiment specific events are footprints of RBP binding that if correctly modeled can yield a great insight into the process of binding. Using our method, we were able to uncover previously unknown facts about the nature of RNA-protein interactions. 74 Moving forward, new biological questions will be asked. Questions aimed at expanding our understanding of the interactions between regulators, regulatory net- works, the timing of events, and how perturbation of the cellular state affects them. These questions will drive the next generation of computational methods. One key issue will be the development of tools that effectively handle multi-factorial experimen- tal designs, with multiple replicates, and are able to leverage the additional statistical information they bring. First studies exist which combine several of these large-scale approaches. For example, to distinguish functional from non-functional RBP binding sites, proteomics studies have been combined with RIP-chip and CLIP experiments to characterize the translation regulators. Other efforts combine the analysis of miRNAs with proteome, transcriptome or translatome profiling. As more and more studies on multi- dimensional approaches arise, we need computational methods to integrate and analyze these data. These approaches will help to drive the consolidation of information about co- and post-transcriptional gene regulation into more holistic and comprehensive models. 75 Reference List [1] Norzehan Abdul-Manan and Kenneth R Williams. hnRNP A1 binds promiscu- ously to oligoribonucleotides: utilization of random and homo-oligonucleotides to discriminate sequence from base-specific binding. Nucleic acids research, 24(20):4063–4070, 1996. [2] Sharifah Lailee Syed Abdullah, Naimah Mohd Hussin, Hazaruddin Harun, and Noor Elaiza Abd Khalid. Comparative study of random-PSO and Linear-PSO algorithms. In Computer & Information Science (ICCIS), 2012 International Conference on, volume 1, pages 409–413. IEEE, 2012. [3] Martin Akerman, Hilda David-Eden, Ron Y Pinter, and Yael Mandel-Gutfreund. A computational approach for genome-wide mapping of splicing factor binding sites. Genome Biol, 10(3):R30, 2009. [4] GS Ali and ASN the. Regulation of alternative splicing of pre-mRNAs by stresses. Nuclear pre-mRNA Processing in Plants, pages 257–275, 2008. [5] Andrei Alic, Jos´ e E P´ erez-Ort´ ın, Joaqu´ ın Moreno, and Vicente Arnau. mRNAStaba web application for mRNA stability analysis. Bioinformatics, 29(6):813–814, 2013. [6] Frederick W Alt, Alfred LM Bothwell, Michael Knapp, Edward Siden, Eliza- beth Mather, Marian Koshland, and David Baltimore. Synthesis of secreted and membrane-bound immunoglobulin mu heavy chains is directed by mRNAs that differ at their 3 ends. Cell, 20(2):293–301, 1980. [7] Adam Ameur, Anna Wetterbom, Lars Feuk, and Ulf Gyllensten. Global and unbi- ased detection of splice junctions from RNA-seq data. Genome Biol, 11(3):R34, 2010. [8] Simon Anders and Wolfgang Huber. Differential expression analysis for sequence count data. Genome biol, 11(10):R106, 2010. [9] Simon Anders, Alejandro Reyes, and Wolfgang Huber. Detecting differential usage of exons from RNA-seq data. Genome research, 22(10):2008–2017, 2012. 76 [10] Manuel Ascano, Neelanjan Mukherjee, Pradeep Bandaru, Jason B Miller, Jef- frey D Nusbaum, David L Corcoran, Christine Langlois, Mathias Munschauer, Scott Dewell, Markus Hafner, et al. FMRP targets distinct mRNA sequence ele- ments to regulate protein expression. Nature, 2012. [11] Moritz Aschoff, Agnes Hotz-Wagenblatt, Karl-Heinz Glatting, Matthias Fischer, Roland Eils, and Rainer K¨ onig. SplicingCompass: differential splicing detection using RNA-Seq data. Bioinformatics, page btt101, 2013. [12] Kin Fai Au, Hui Jiang, Lan Lin, Yi Xing, and Wing Hung Wong. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic acids research, 38(14):4570–4578, 2010. [13] Daehyun Baek, Judit Vill´ en, Chanseok Shin, Fernando D Camargo, Steven P Gygi, and David P Bartel. The impact of microRNAs on protein output. Nature, 455(7209):64–71, 2008. [14] Timothy L Bailey and Charles Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine learning, 21(1-2):51–80, 1995. [15] TL Bailey and C Elkan. The value of prior knowledge in discovering motifs with MEME. In Proceedings/... International Conference on Intelligent Systems for Molecular Biology; ISMB. International Conference on Intelligent Systems for Molecular Biology, volume 3, page 21, 1995. [16] Tala Bakheet, Bryan RG Williams, and Khalid SA Khabar. ARED 3.0: the large and diverse AU-rich transcriptome. Nucleic acids research, 34(suppl 1):D111– D114, 2006. [17] Alexander G Baltz, Mathias Munschauer, Bj¨ orn Schwanh¨ ausser, Alexandra Vasile, Yasuhiro Murakawa, Markus Schueler, Noah Youngs, Duncan Penfold- Brown, Kevin Drew, Miha Milek, et al. The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts. Molecular cell, 46(5):674–690, 2012. [18] Cristina Barbosa, Isabel Peixeiro, and Lu´ ısa Rom˜ ao. Gene expression regu- lation by upstream open reading frames and human disease. PLoS genetics, 9(8):e1003529, 2013. [19] Carine Barreau, Luc Paillard, Agn` es M´ ereau, and H Beverley Osborne. Mam- malian CELF/Bruno-like RNA-binding proteins: molecular characteristics and biological functions. Biochimie, 88(5):515–525, 2006. 77 [20] David P Bartel. MicroRNAs: genomics, biogenesis, mechanism, and function. cell, 116(2):281–297, 2004. [21] Ariel A Bazzini, Miler T Lee, and Antonio J Giraldez. Ribosome profiling shows that miR-430 reduces translation before causing mRNA decay in zebrafish. Sci- ence, 336(6078):233–237, 2012. [22] George W Beadle and Edward L Tatum. Genetic control of biochemical reactions in Neurospora. Proceedings of the National Academy of Sciences of the United States of America, 27(11):499, 1941. [23] Emmanuel Beaudoing, Susan Freier, Jacqueline R Wyatt, Jean-Michel Claverie, and Daniel Gautheret. Patterns of variant polyadenylation signal usage in human genes. Genome research, 10(7):1001–1010, 2000. [24] Anilkumar Bettegowda and Miles F Wilkinson. Transcription and post- transcriptional regulation of spermatogenesis. Philosophical Transactions of the Royal Society B: Biological Sciences, 365(1546):1637–1651, 2010. [25] B Bilanges and D Stokoe. Mechanisms of translational deregulation in human tumors and therapeutic intervention strategies. Oncogene, 26(41):5973–5990, 2007. [26] Mariano Bizzarri, Alessandro Palombo, and Alessandra Cucina. Theoretical aspects of systems biology. Progress in biophysics and molecular biology, 112(1):33–43, 2013. [27] Sarah E Brennan, Yuki Kuwano, Nadim Alkharouf, Perry J Blackshear, Myriam Gorospe, and Gerald M Wilson. The mRNA-destabilizing protein tristetrapro- lin is suppressed in many cancers, altering tumorigenic phenotypes and patient prognosis. Cancer research, 69(12):5168–5176, 2009. [28] David Brett, Heike Pospisil, Juan Valc´ arcel, Jens Reich, and Peer Bork. Alterna- tive splicing and genome complexity. Nature genetics, 30(1):29–30, 2002. [29] J Michael Brockman, Priyam Singh, Donglin Liu, Sean Quinlan, Jesse Salisbury, and Joel H Graber. PACdb: PolyA cleavage site and 3-UTR database. Bioinfor- matics, 21(18):3691–3693, 2005. [30] G Thomas Brown and Thomas M McIntyre. Lipopolysaccharide signaling with- out a nucleus: Kinase cascades stimulate platelet shedding of proinflamma- tory IL-1–rich microparticles. The Journal of Immunology, 186(9):5489–5496, 2011. 78 [31] RJ Buckanovich and RB Darnell. The neuronal RNA binding protein Nova-1 recognizes specific RNA targets in vitro and in vivo. Mol Cell Biol, 17:3194– 3201, 1997. [32] Chris Burge and Samuel Karlin. Prediction of complete gene structures in human genomic DNA. Journal of molecular biology, 268(1):78–94, 1997. [33] Sarah E Calvo, David J Pagliarini, and Vamsi K Mootha. Upstream open reading frames cause widespread reduction of protein expression and are poly- morphic among humans. Proceedings of the National Academy of Sciences, 106(18):7507–7512, 2009. [34] Massimo Caputi and Alan M. Zahler. Determination of the RNA binding speci- ficity of the heterogeneous nuclear ribonucleoprotein (hnRNP) H/H’/F/2H9 fam- ily. Journal of Biological Chemistry, 276(47):43850–43859, 2001. [35] Lon R Cardon and Gary D Stormo. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA frag- ments. Journal of molecular biology, 223(1):159–170, 1992. [36] Alfredo Castello, Bernd Fischer, Katrin Eichelbaum, Rastislav Horos, Benedikt M Beckmann, Claudia Strein, Norman E Davey, David T Humphreys, Thomas Preiss, Lars M Steinmetz, et al. Insights into RNA biology from an atlas of mammalian mRNA-binding proteins. Cell, 149(6):1393–1406, 2012. [37] Beibei Chen, Jonghyun Yun, Min Soo Kim, Joshua T Mendell, and Yang Xie. PIPE-CLIP: a comprehensive online tool for CLIP-seq data analysis. Genome Biol, 15(1):R18, 2014. [38] Thimmaiah P Chendrimada, Richard I Gregory, Easwari Kumaraswamy, Jessica Norman, Neil Cooch, Kazuko Nishikura, and Ramin Shiekhattar. TRBP recruits the Dicer complex to Ago2 for microRNA processing and gene silencing. Nature, 436(7051):740–744, 2005. [39] Yiming Cheng, Robert M Miura, and Bin Tian. Prediction of mRNA polyadeny- lation sites by support vector machine. Bioinformatics, 22(19):2320–2325, 2006. [40] Guo-Liang Chew, Andrea Pauli, John L Rinn, Aviv Regev, Alexander F Schier, and Eivind Valen. Ribosome profiling reveals resemblance between long non- coding RNAs and 5 leaders of coding RNAs. Development, 140(13):2828–2834, 2013. [41] Sung Wook Chi, Julie B. Zang, Aldo Mele, and Robert B. Darnell. Arg- onaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature, 460(7254):479–486, 07 2009. 79 [42] Chih-Hung Chou, Feng-Mao Lin, Min-Te Chou, Sheng-Da Hsu, Tzu-Hao Chang, Shun-Long Weng, Sirjana Shrestha, Chiung-Chih Hsiao, Jui-Hung Hung, and Hsien-Da Huang. A computational approach for identifying microRNA-target interactions using high-throughput CLIP and PAR-CLIP sequencing. BMC genomics, 14(Suppl 1):S2, 2013. [43] Kate B Cook, Hilal Kazan, Khalid Zuberi, Quaid Morris, and Timothy R Hughes. RBPDB: a database of RNA-binding specificities. Nucleic acids research, 39(suppl 1):D301–D308, 2011. [44] David L Corcoran, Stoyan Georgiev, Neelanjan Mukherjee, Eva Gottwein, Rebecca L Skalsky, Jack D Keene, Uwe Ohler, et al. PARalyzer: definition of RNA binding sites from PAR-CLIP short-read sequence data. Genome Biol, 12(8):R79, 2011. [45] Modan K Das and Ho-Kwok Dai. A survey of DNA motif finding algorithms. BMC bioinformatics, 8(Suppl 7):S21, 2007. [46] Adnan Derti, Philip Garrett-Engele, Kenzie D MacIsaac, Richard C Stevens, Shreedharan Sriram, Ronghua Chen, Carol A Rohl, Jason M Johnson, and Tomas Babak. A quantitative atlas of polyadenylation in five mammals. Genome research, 22(6):1173–1183, 2012. [47] Christoph Dieterich and Peter F. Stadler. Computational biology of RNA interac- tions. Wiley Interdisciplinary Reviews: RNA, 4(1):107–120, 2013. [48] Marie-Agn` es Dillies, Andrea Rau, Julie Aubert, Christelle Hennequet-Antier, Marine Jeanmougin, Nicolas Servant, C´ eline Keime, Guillemette Marot, David Castel, Jordi Estelle, et al. A comprehensive evaluation of normalization meth- ods for Illumina high-throughput RNA sequencing data analysis. Briefings in bioinformatics, 14(6):671–683, 2013. [49] Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013. [50] Lars D¨ olken, Zsolt Ruzsics, Bernd R¨ adle, Caroline C Friedel, Ralf Zimmer, J¨ org Mages, Reinhard Hoffmann, Paul Dickinson, Thorsten Forster, Peter Ghazal, et al. High-resolution gene expression profiling for simultaneous kinetic parame- ter analysis of RNA synthesis and decay. Rna, 14(9):1959–1972, 2008. [51] Joseph P Doyle, Joseph D Dougherty, Myriam Heiman, Eric F Schmidt, Tanya R Stevens, Guojun Ma, Sujata Bupp, Prerana Shrestha, Rajiv D Shah, Martin L Doughty, et al. Application of a translational profiling approach for the compar- ative analysis of CNS cell types. Cell, 135(4):749–762, 2008. 80 [52] B. Kate Dredge and Robert B. Darnell. Nova regulates gabaa receptor 2 alterna- tive splicing via a distal downstream ucau-rich intronic splicing enhancer. Molec- ular and Cellular Biology, 23(13):4687–4700, 2003. [53] Hongqing Du, Melissa S Cline, Robert J Osborne, Daniel L Tuttle, Tyson A Clark, John Paul Donohue, Megan P Hall, Lily Shiue, Maurice S Swanson, Charles A Thornton, et al. Aberrant alternative splicing and extracellular matrix gene expression in mouse models of myotonic dystrophy. Nature structural & molecular biology, 17(2):187–193, 2010. [54] Ralf Eberhardt, Devanand Anantham, Felix Herth, David Feller-Kopman, and Armin Ernst. Electromagnetic navigation diagnostic bronchoscopy in peripheral lung lesions. CHEST Journal, 131(6):1800–1805, 2007. [55] Sean R Eddy and Richard Durbin. RNA sequence analysis using covariance models. Nucleic acids research, 22(11):2079–2088, 1994. [56] Ran Elkon, Alejandro P Ugalde, and Reuven Agami. Alternative cleavage and polyadenylation: extent, regulation and function. Nature Reviews Genetics, 14(7):496–506, 2013. [57] Anton J Enright, Bino John, Ulrike Gaul, Thomas Tuschl, Chris Sander, Debora S Marks, et al. MicroRNA targets in Drosophila. Genome biology, 5(1):R1–R1, 2004. [58] Florian Erhard, Lars D¨ olken, and Ralf Zimmer. RIP-chip enrichment analysis. Bioinformatics, 29(1):77–83, 2013. [59] Manel Esteller. Non-coding RNAs in human disease. Nature Reviews Genetics, 12(12):861–874, 2011. [60] Gary B Fogel, V William Porto, Dana G Weekes, David B Fogel, Richard H Griffey, John A McNeil, Elena Lesnik, David J Ecker, and Rangarajan Sampath. Discovery of RNA structural elements using evolutionary computation. Nucleic Acids Research, 30(23):5310–5317, 2002. [61] Nuno A Fonseca, Johan Rung, Alvis Brazma, and John C Marioni. Tools for mapping high-throughput sequencing data. Bioinformatics, page bts605, 2012. [62] P Forch, O Puig, C Martinez, B Seraphin, and J Valcarcel. The splicing regulator TIA-1 interacts with U1-C to promote U1 snRNP recruitment to 5’ splice sites. EMBO J, 21:6882–6892, 2002. [63] Mallory A Freeberg, Ting Han, James J Moresco, Andy Kong, Yu-Cheng Yang, Zhi J Lu, John R Yates, and John K Kim. Pervasive and dynamic protein binding 81 sites of the mRNA transcriptome in Saccharomyces cerevisiae. Genome Biol, 14:R13, 2013. [64] Matthew B Friedersdorf and Jack D Keene. Advancing the functional utility of PAR-CLIP by quantifying background binding to mRNAs and lncRNAs. Genome Biol, 15:R2, 2014. [65] Claudia Fritsch, Alexander Herrmann, Michael Nothnagel, Karol Szafranski, Klaus Huse, Frank Schumann, Stefan Schreiber, Matthias Platzer, Michael Krawczak, Jochen Hampe, et al. Genome-wide search for novel human uORFs and N-terminal protein extensions using ribosomal footprinting. Genome research, 22(11):2208–2218, 2012. [66] Pedro AF Galante, Devraj Sandhu, Raquel de Sousa Abreu, Michael Gradassi, Natanja Slager, Christine V ogel, Sandro Jose de Souza, and Luiz OF Penalva. A comprehensive in silico expression analysis of RNA binding proteins in normal and tumor tissue. RNA biology, 6(4):426–433, 2009. [67] Andre Galarneau and Stephane Richard. Target RNA motif and target mrnas of the quaking star protein. Nat Struct Mol Biol, 12(8):691–698, 08 2005. [68] Alessia Galgano, Michael Forrer, Lukasz Jaskiewicz, Alexander Kanitz, Mihaela Zavolan, and Andr´ e P. Gerber. Comparative analysis of mrna targets for human puf-family proteins suggests extensive interaction with the mirna regulatory sys- tem. PLoS ONE, 3(9):e3164, 09 2008. [69] Eric R Gamazon, Hae-Kyung Im, Shiwei Duan, Yves A Lussier, Nancy J Cox, M Eileen Dolan, and Wei Zhang. Exprtarget: an integrative approach to predict- ing human microRNA targets. PLoS One, 5(10):e13534, 2010. [70] Manuel Garber, Manfred G Grabherr, Mitchell Guttman, and Cole Trapnell. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature methods, 8(6):469–477, 2011. [71] Paul J Gardina, Tyson A Clark, Brian Shimada, Michelle K Staples, Qing Yang, James Veitch, Anthony Schweitzer, Tarif Awad, Charles Sugnet, Suzanne Dee, et al. Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC genomics, 7(1):325, 2006. [72] Daniel Gautheret, Olivier Poirot, Fabrice Lopez, St´ ephane Audic, and Jean- Michel Claverie. Alternate polyadenylation in human mRNAs: a large-scale analysis by EST clustering. Genome Research, 8(5):524–530, 1998. [73] Joseph V Geisberg, Zarmik Moqtaderi, Xiaochun Fan, Fatih Ozsolak, and Kevin Struhl. Global analysis of mRNA isoform half-lives reveals stabilizing and desta- bilizing elements in yeast. Cell, 156(4):812–824, 2014. 82 [74] Tina Glisovic, Jennifer L Bachorik, Jeongsik Yong, and Gideon Dreyfuss. RNA-binding proteins and post-transcriptional gene regulation. FEBS letters, 582(14):1977–1986, 2008. [75] Jeremy Goecks, Anton Nekrutenko, James Taylor, et al. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol, 11(8):R86, 2010. [76] Hani Goodarzi, Hamed S Najafabadi, Panos Oikonomou, Todd M Greco, Lisa Fish, Reza Salavati, Ileana M Cristea, and Saeed Tavazoie. Systematic discov- ery of structural elements governing stability of mammalian messenger RNAs. Nature, 485(7397):264–268, 2012. [77] Sander Granneman, Grzegorz Kudla, Elisabeth Petfalski, and David Tollervey. Identification of protein binding sites on U3 snoRNA and pre-rRNA by UV cross- linking and high-throughput analysis of cDNAs. Proceedings of the National Academy of Sciences, 106(24):9613–9618, 2009. [78] Richard I Gregory, Thimmaiah P Chendrimada, Neil Cooch, and Ramin Shiekhat- tar. Human RISC couples microRNA biogenesis and posttranscriptional gene silencing. Cell, 123(4):631–640, 2005. [79] Andrew Grimson, Kyle Kai-How Farh, Wendy K Johnston, Philip Garrett- Engele, Lee P Lim, and David P Bartel. MicroRNA targeting specificity in mam- mals: determinants beyond seed pairing. Molecular cell, 27(1):91–105, 2007. [80] Andreas R Gruber, J¨ org Fallmann, Franz Kratochvill, Pavel Kovarik, and Ivo L Hofacker. AREsite: a database for the comprehensive investigation of AU-rich elements. Nucleic acids research, 39(suppl 1):D66–D69, 2011. [81] Andreas R Gruber, Georges Martin, Walter Keller, and Mihaela Zavolan. Means to an end: mechanisms of alternative polyadenylation of messenger RNA precur- sors. Wiley Interdisciplinary Reviews: RNA, 5(2):183–196, 2014. [82] Huili Guo, Nicholas T Ingolia, Jonathan S Weissman, and David P Bartel. Mam- malian microRNAs predominantly act to decrease target mRNA levels. Nature, 466(7308):835–840, 2010. [83] Ishaan Gupta, Sandra Clauder-M¨ unster, Bernd Klaus, Aino I J¨ arvelin, Raeka S Aiyar, Vladimir Benes, Stefan Wilkening, Wolfgang Huber, Vicent Pelechano, and Lars M Steinmetz. Alternative polyadenylation diversifies post- transcriptional regulation by selective RNA–protein interactions. Molecular sys- tems biology, 10(2), 2014. 83 [84] Simon Haenni, Zhe Ji, Mainul Hoque, Nigel Rust, Helen Sharpe, Ralf Eberhard, Cathy Browne, Michael O Hengartner, Jane Mellor, Bin Tian, et al. Analysis of C. elegans intestinal gene expression and polyadenylation by fluorescence-activated nuclei sorting and 3-end-seq. Nucleic acids research, 40(13):6304–6318, 2012. [85] Markus Hafner, Markus Landthaler, Lukas Burger, Mohsen Khorshid, Jean Hausser, Philipp Berninger, Andrea Rothballer, Manuel Ascano Jr, Anna- Carina Jungkamp, Mathias Munschauer, et al. Transcriptome-wide identifica- tion of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell, 141(1):129–141, 2010. [86] Masatoshi Hagiwara. Alternative splicing: a new drug target of the post- genome era. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics, 1754(1):324–331, 2005. [87] Scott M Hammond, Sabrina Boettcher, Amy A Caudy, Ryuji Kobayashi, and Gregory J Hannon. Argonaute2, a link between genetic and biochemical analyses of RNAi. Science, 293(5532):1146–1150, 2001. [88] David G Hendrickson, Daniel J Hogan, Heather L McCullough, Jason W Myers, Daniel Herschlag, James E Ferrell, and Patrick O Brown. Concordant regula- tion of translation and mRNA abundance for hundreds of targets of a human microRNA. PLoS biology, 7(11):e1000238, 2009. [89] Michael Hiller, Rainer Pudimat, Anke Busch, and Rolf Backofen. Using RNA secondary structures to guide sequence motif finding towards single-stranded regions. Nucleic acids research, 34(17):e117–e117, 2006. [90] Steve Hoffmann, Christian Otto, Gero Doose, Andrea Tanzer, David Langen- berger, Sabina Christ, Manfred Kunz, Lesca Holdt, Daniel Teupser, J¨ org Hack- erm¨ uller, et al. A multi-split mapping algorithm for circular RNA, splicing, trans- splicing, and fusion detection. Genome biology, 15(2):R34, 2014. [91] Steve Hoffmann, Christian Otto, Stefan Kurtz, Cynthia M Sharma, Philipp Khaitovich, J¨ org V ogel, Peter F Stadler, and J¨ org Hackerm¨ uller. Fast mapping of short sequences with mismatches, insertions and deletions using index struc- tures. PLoS computational biology, 5(9):e1000502, 2009. [92] Heather M Hood, Daniel E Neafsey, James Galagan, and Matthew S Sachs. Evo- lutionary roles of upstream open reading frames in mediating gene regulation in fungi. Annual review of microbiology, 63:385–409, 2009. [93] Joan E Hooper. A survey of software for genome-wide discovery of differential splicing in RNA-Seq data. Gene, 36:37, 2014. 84 [94] Mainul Hoque, Zhe Ji, Dinghai Zheng, Wenting Luo, Wencheng Li, Bei You, Ji Yeon Park, Ghassan Yehia, and Bin Tian. Analysis of alternative cleavage and polyadenylation by 3 [prime] region extraction and deep sequencing. Nature methods, 10(2):133–139, 2013. [95] Andrew JM Howden, Vincent Geoghegan, Kristin Katsch, Georgios Efstathiou, Bhaskar Bhushan, Omar Boutureira, Benjamin Thomas, David C Trudgian, Benedikt M Kessler, Daniela C Dieterich, et al. QuaNCAT: quantitating pro- teome dynamics in primary cells. Nature methods, 10(4):343–346, 2013. [96] Andrew C Hsieh, Yi Liu, Merritt P Edlind, Nicholas T Ingolia, Matthew R Janes, Annie Sher, Evan Y Shi, Craig R Stumpf, Carly Christensen, Michael J Bonham, et al. The translational landscape of mTOR signalling steers cancer initiation and metastasis. Nature, 485(7396):55–61, 2012. [97] Jianjun Hu, Bin Li, and Daisuke Kihara. Limitations and potentials of current motif discovery algorithms. Nucleic acids research, 33(15):4899–4913, 2005. [98] Yin Hu, Yan Huang, Ying Du, Christian F Orellana, Darshan Singh, Amy R Johnson, Ana¨ ıs Monroy, Pei-Fen Kuan, Scott M Hammond, Liza Makowski, et al. DiffSplice: the genome-wide detection of differential splicing events with RNA- seq. Nucleic acids research, 41(2):e39–e39, 2013. [99] Songbo Huang, Jinbo Zhang, Ruiqiang Li, Wenqian Zhang, Zengquan He, Tak- Wah Lam, Zhiyu Peng, and Siu-Ming Yiu. SOAPsplice: genome-wide ab initio detection of splice junctions from RNA-Seq data. Frontiers in Genetics, 2:46, 2011. [100] Stephanie C Huelga, Anthony Q Vu, Justin D Arnold, Tiffany Y Liang, Patrick P Liu, Bernice Y Yan, John Paul Donohue, Lily Shiue, Shawn Hoon, Sydney Bren- ner, et al. Integrative genome-wide analysis reveals cooperative regulation of alternative splicing by hnRNP proteins. Cell reports, 1(2):167–178, 2012. [101] Lee-Hsueh Hung, Monika Heiner, Jingyi Hui, Silke Schreiner, Vladimir Benes, and Albrecht Bindereif. Diverse roles of hnRNP L in mammalian mRNA pro- cessing: a combined microarray and RNAi analysis. Rna, 14(2):284–296, 2008. [102] Sarah Hunter, Philip Jones, Alex Mitchell, Rolf Apweiler, Teresa K Attwood, Alex Bateman, Thomas Bernard, David Binns, Peer Bork, Sarah Burge, et al. InterPro in 2011: new developments in the family and domain prediction database. Nucleic acids research, 40(D1):D306–D312, 2012. [103] Eric Huntzinger and Elisa Izaurralde. Gene silencing by microRNAs: contri- butions of translational repression and mRNA decay. Nature Reviews Genetics, 12(2):99–110, 2011. 85 [104] Nicholas T Ingolia, Sina Ghaemmaghami, John RS Newman, and Jonathan S Weissman. Genome-wide analysis in vivo of translation with nucleotide resolu- tion using ribosome profiling. science, 324(5924):218–223, 2009. [105] Nicholas T Ingolia, Liana F Lareau, and Jonathan S Weissman. Ribosome pro- filing of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell, 147(4):789–802, 2011. [106] Richard J Jackson, Christopher UT Hellen, and Tatyana V Pestova. The mecha- nism of eukaryotic translation initiation and principles of its regulation. Nature reviews Molecular cell biology, 11(2):113–127, 2010. [107] Maria Kalyna, Craig G Simpson, Naeem H Syed, Dominika Lewandowska, Yamile Marquez, Branislav Kusenda, Jacqueline Marshall, John Fuller, Linda Cardle, Jim McNicol, et al. Alternative splicing and nonsense-mediated decay modulate expression of important regulatory genes in Arabidopsis. Nucleic acids research, 40(6):2454–2469, 2012. [108] Katannya Kapeli and Gene W Yeo. Genome-wide approaches to dissect the roles of RNA binding proteins in translational control: implications for neurological diseases. Frontiers in neuroscience, 6:144–144, 2011. [109] Yarden Katz, Eric T Wang, Edoardo M Airoldi, and Christopher B Burge. Analy- sis and design of RNA sequencing experiments for identifying isoform regulation. Nature methods, 7(12):1009–1015, 2010. [110] Hilal Kazan, Debashish Ray, Esther T Chan, Timothy R Hughes, and Quaid Morris. RNAcontext: a new method for learning the sequence and structure binding preferences of RNA-binding proteins. PLoS computational biology, 6(7):e1000832, 2010. [111] Michael Kertesz, Nicola Iovino, Ulrich Unnerstall, Ulrike Gaul, and Eran Segal. The role of site accessibility in microRNA target recognition. Nature genetics, 39(10):1278–1284, 2007. [112] Michael Kertesz, Yue Wan, Elad Mazor, John L Rinn, Robert C Nutter, Howard Y Chang, and Eran Segal. Genome-wide measurement of RNA secondary structure in yeast. Nature, 467(7311):103–107, 2010. [113] Stephanie Kervestin and Allan Jacobson. NMD: a multifaceted response to premature translational termination. Nature Reviews Molecular Cell Biology, 13(11):700–712, 2012. [114] Mohsen Khorshid, Christoph Rodak, and Mihaela Zavolan. CLIPZ: a database and analysis environment for experimentally determined binding sites of RNA- binding proteins. Nucleic acids research, 39(Database issue):D245–52, 2011. 86 [115] Mee Young Kim, Jung Hur, and Sunjoo Jeong. Emerging roles of RNA and RNA- binding protein network in cancer cells. BMB Reports, 42(3):125–130, 2009. [116] Shivendra Kishore, Lukasz Jaskiewicz, Lukas Burger, Jean Hausser, Mohsen Khorshid, and Mihaela Zavolan. A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nat Meth, 8(7):559–564, 07 2011. [117] Daniel M Klass, Marion Scheibe, Falk Butter, Gregory J Hogan, Matthias Mann, and Patrick O Brown. Quantitative proteomic analysis reveals concurrent RNA– protein interactions and identifies new RNA-binding proteins in Saccharomyces cerevisiae. Genome research, 23(6):1028–1038, 2013. [118] Stefanie J Klug and Michael Famulok. All you wanted to know about SELEX. Molecular biology reports, 20(2):97–107, 1994. [119] Julian K¨ onig, Kathi Zarnack, Gregor Rot, Tomaˇ z Curk, Melis Kayikci, Blaˇ z Zupan, Daniel J Turner, Nicholas M Luscombe, and Jernej Ule. iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nature structural & molecular biology, 17(7):909–915, 2010. [120] Marilyn Kozak, M Evans, Paul D Gardner, Idhaliz Flores, Thomas M Mari- ano, Sidney Pestka, Anne Phelps, Hartmut Wohlrab, Michitaka Zushi, Komakazu Gomi, et al. Structural features in eukaryotic mRNAs that mod. Biological Chem- istry, 266(30), 1991. [121] Ana Kozomara and Sam Griffiths-Jones. miRBase: integrating microRNA anno- tation and deep-sequencing data. Nucleic Acids Research, 39(suppl 1):D152– D157, 2011. [122] Azra Krek, Dominic Gr¨ un, Matthew N Poy, Rachel Wolf, Lauren Rosenberg, Eric J Epstein, Philip MacMenamin, Isabelle da Piedade, Kristin C Gunsalus, Markus Stoffel, et al. Combinatorial microRNA target predictions. Nature genet- ics, 37(5):495–500, 2005. [123] Anders Krogh, Michael Brown, I Saira Mian, Kimmen Sj¨ olander, and David Haussler. Hidden Markov models in computational biology: Applications to pro- tein modeling. Journal of molecular biology, 235(5):1501–1531, 1994. [124] Alper Kucukural, Hakan ¨ Ozadam, Guramrit Singh, Melissa J Moore, and Can Cenik. ASPeak: an abundance sensitive peak detection algorithm for RIP-Seq. Bioinformatics, 29(19):2485–2486, 2013. [125] Scott Kuersten and Elizabeth B Goodwin. The power of the 3 UTR: translational control and development. Nature Reviews Genetics, 4(8):626–637, 2003. 87 [126] Scott Kuersten, Agnes Radek, Christine V ogel, and Luiz OF Penalva. Transla- tion regulation gets its omics moment. Wiley Interdisciplinary Reviews: RNA, 4(6):617–630, 2013. [127] Pan-Hsien Kuo, Chien-Hao Chiang, Yi-Ting Wang, Lyudmila G Doudeva, and Hanna S Yuan. The crystal structure of TDP-43 RRM1-DNA complex reveals the specific recognition for UG-and TG-rich nucleic acids. Nucleic acids research, 42(7):4712–4722, 2014. [128] Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 10(3):R25, 2009. [129] Charles E Lawrence and Andrew A Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Bioinformatics, 7(1):41–51, 1990. [130] Anthoula Lazaris-Karatzas, Kathleen S Montine, and Nahum Sonenberg. Malig- nant transformation by a eukaryotic initiation factor subunit that binds to mRNA 5’cap. Nature, 345(6275):544–547, 1990. [131] Svetlana Lebedeva, Marvin Jens, Kathrin Theil, Bj¨ orn Schwanh¨ ausser, Matthias Selbach, Markus Landthaler, and Nikolaus Rajewsky. Transcriptome-wide anal- ysis of regulatory interactions of the RNA-binding protein HuR. Molecular cell, 43(3):340–352, 2011. [132] Ju Youn Lee, Ijen Yeh, Ji Yeon Park, and Bin Tian. PolyA DB 2: mRNA polyadenylation sites in vertebrate genes. Nucleic acids research, 35(suppl 1):D165–D168, 2007. [133] Sooncheol Lee, Botao Liu, Soohyun Lee, Sheng-Xiong Huang, Ben Shen, and Shu-Bing Qian. Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proceedings of the National Academy of Sciences, 109(37):E2424–E2432, 2012. [134] Antonio Lembo, Ferdinando Di Cunto, and Paolo Provero. Shortening of 3 UTRs correlates with poor prognosis in breast and lung cancer. PLoS One, 7(2):e31129, 2012. [135] Anthony K L Leung, Amanda G Young, Arjun Bhutkar, Grace X Zheng, Andrew D Bosson, Cydney B Nielsen, and Phillip A Sharp. Genome-wide identi- fication of Ago2 binding sites from mouse embryonic stem cells with and without mature microRNAs. Nat Struct Mol Biol, 18(2):237–244, 2011. 88 [136] Benjamin P Lewis, Christopher B Burge, and David P Bartel. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. cell, 120(1):15–20, 2005. [137] Fan Li, Qi Zheng, Paul Ryvkin, Isabelle Dragomir, Yaanik Desai, Subhadra Aiyer, Otto Valladares, Jamie Yang, Shelly Bambina, Leah R Sabin, et al. Global analysis of RNA secondary structure in two metazoans. Cell Reports, 1(1):69–82, 2012. [138] Fan Li, Qi Zheng, Lee E Vandivier, Matthew R Willmann, Ying Chen, and Brian D Gregory. Regulatory Impact of RNA Secondary Structure across the Arabidopsis Transcriptome. The Plant Cell Online, 24(11):4346–4359, 2012. [139] Xiao Li, Hilal Kazan, Howard D Lipshitz, and Quaid D Morris. Finding the target sites of RNA-binding proteins. Wiley Interdisciplinary Reviews: RNA, 5(1):111– 130, 2014. [140] Xiao Li, Gerald Quon, Howard D. Lipshitz, and Quaid Morris. Predicting in vivo binding sites of RNA-binding proteins using mRNA secondary structure. RNA, 16(6):1096–1107, 2010. [141] Yue Li, Dorothy Yanling Zhao, Jack F Greenblatt, and Zhaolei Zhang. RIPSeeker: a statistical package for identifying protein-associated transcripts from RIP-seq experiments. Nucleic acids research, 41(8):e94–e94, 2013. [142] YZ Liang, TY Fang, HG Xu, and ZQ Zhuo. Expression of CD44v6 and Livin in gastric cancer tissue. Chin Med J (Engl), 125(17):3161–3165, 2012. [143] Donny D Licatalosi, Aldo Mele, John J Fak, Jernej Ule, Melis Kayikci, Sung Wook Chi, Tyson A Clark, Anthony C Schweitzer, John E Blume, Xun- ing Wang, et al. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature, 456(7221):464–469, 2008. [144] Yuefeng Lin, Zhihua Li, Fatih Ozsolak, Sang Woo Kim, Gustavo Arango-Argoty, Teresa T Liu, Scott A Tenenbaum, Timothy Bailey, A Paula Monaghan, Patrice M Milos, et al. An in-depth map of polyadenylation sites in cancer. Nucleic acids research, 40(17):8460–8471, 2012. [145] Jun S Liu. The collapsed Gibbs sampler in Bayesian computations with applica- tions to a gene regulation problem. Journal of the American Statistical Associa- tion, 89(427):958–966, 1994. [146] Jun S Liu, Andrew F Neuwald, and Charles E Lawrence. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. Journal of the American Statistical Association, 90(432):1156–1170, 1995. 89 [147] X Shirley Liu, Douglas L Brutlag, and Jun S Liu. An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature biotechnology, 20(8):835–839, 2002. [148] Miriam Llorian, Schraga Schwartz, Tyson A Clark, Dror Hollander, Lit-Yeen Tan, Rachel Spellman, Adele Gordon, Anthony C Schweitzer, Pierre de la Grange, Gil Ast, et al. Position-dependent alternative splicing activity revealed by global profiling of alternative splicing events regulated by PTB. Nature structural & molecular biology, 17(99):1114–1123, 2010. [149] Gabriel B Loeb, Aly A Khan, David Canner, Joseph B Hiatt, Jay Shendure, Robert B Darnell, Christina S Leslie, and Alexander Y Rudensky. Transcriptome- wide miR-155 binding map reveals widespread noncanonical microRNA target- ing. Molecular cell, 48(5):760–770, 2012. [150] Ronny Lorenz, Stephan H Bernhart, Christian H¨ oner zu Siederdissen, Hakim Tafer, Christoph Flamm, Peter F Stadler, and Ivo L Hofacker. Viennarna package 2.0. Algorithms for Molecular Biology, 6(1):26, 2011. [151] Peng Lu, Christine V ogel, Rong Wang, Xin Yao, and Edward M Marcotte. Abso- lute protein expression profiling estimates the relative contributions of transcrip- tional and translational regulation. Nature biotechnology, 25(1):117–124, 2006. [152] Julius B. Lucks, Stefanie A. Mortimer, Cole Trapnell, Shujun Luo, Sharon Avi- ran, Gary P. Schroth, Lior Pachter, Jennifer A. Doudna, and Adam P. Arkin. Mul- tiplexed RNA structure characterization with selective 2’-hydroxyl acylation ana- lyzed by primer extension sequencing (SHAPE-Seq). Proceedings of the National Academy of Sciences, 108(27):11063–11068, 2011. [153] Kiven E. Lukong, Kai wei Chang, Edouard W. Khandjian, and St´ ephane Richard. RNA-binding proteins in human genetic disease. Trends in Genetics, 24(8):416 – 425, 2008. [154] Bradley M. Lunde, Claire Moore, and Gabriele Varani. RNA-binding proteins: modular design for efficient function. Nat Rev Mol Cell Biol, 8(6):479–490, 2007. [155] Henrike Maatz, Marvin Jens, Martin Liss, Sebastian Schafer, Matthias Heinig, Marieluise Kirchner, Eleonora Adami, Carola Rintisch, Vita Dauksaite, Michael H Radke, et al. RNA-binding protein RBM20 represses splicing to orchestrate cardiac pre-mRNA processing. The Journal of clinical investigation, 124(8):3419, 2014. [156] David H. Mathews. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA, 10(8):1178–1190, 2004. 90 [157] David H Mathews and Douglas H Turner. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of molecular biology, 317(2):191–203, 2002. [158] Daniel Maticzka, Sita J Lange, Fabrizio Costa, and Rolf Backofen. Graph- Prot: modeling binding preferences of RNA-binding proteins. Genome Biol, 15(1):R17, 2014. [159] Giancarlo Mauri and Giulio Pavesi. Pattern discovery in RNA secondary structure using affix trees. In Combinatorial Pattern Matching, pages 278–294. Springer, 2003. [160] Christine Mayr and David P Bartel. Widespread shortening of 3 UTRs by alter- native cleavage and polyadenylation activates oncogenes in cancer cells. Cell, 138(4):673–684, 2009. [161] John S McCaskill. The equilibrium partition function and base pair binding prob- abilities for RNA secondary structure. Biopolymers, 29(6-7):1105–1119, 1990. [162] Stavroula Mili and Joan A Steitz. Evidence for reassociation of RNA-binding proteins after cell lysis: implications for the interpretation of immunoprecipita- tion analyses. Rna, 10(11):1692–1694, 2004. [163] Christian Miller, Bj¨ orn Schwalb, Kerstin Maier, Daniel Schulz, Sebastian D¨ umcke, Benedikt Zacher, Andreas Mayer, Jasmin Sydow, Lisa Marcinowski, Lars D¨ olken, et al. Dynamic transcriptome analysis measures rates of mRNA synthesis and decay in yeast. Molecular systems biology, 7(1), 2011. [164] Melissa J Moore. From birth to death: the complex lives of eukaryotic mRNAs. Science, 309(5740):1514–8, 2005. [165] Michael J Moore, Chaolin Zhang, Emily Conn Gantman, Aldo Mele, Jen- nifer C Darnell, and Robert B Darnell. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nature protocols, 9(2):263–293, 2014. [166] David R Morris and Adam P Geballe. Upstream open reading frames as regula- tors of mRNA translation. Molecular and Cellular Biology, 20(23):8635–8642, 2000. [167] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7):621–628, 2008. 91 [168] Alisa A Mueller, Tom H Cheung, and Thomas A Rando. All’s well that ends well: alternative polyadenylation and its implications for stem cell biology. Current opinion in cell biology, 25(2):222–232, 2013. [169] Neelanjan Mukherjee, David L Corcoran, Jeffrey D Nusbaum, David W Reid, Stoyan Georgiev, Markus Hafner, Manuel Ascano Jr, Thomas Tuschl, Uwe Ohler, and Jack D Keene. Integrative regulatory mapping indicates that the RNA- binding protein HuR couples pre-mRNA processing and mRNA stability. Molec- ular cell, 43(3):327–339, 2011. [170] Neelanjan Mukherjee, Patrick J Lager, Matthew B Friedersdorf, Marshall A Thompson, and Jack D Keene. Coordinated posttranscriptional mRNA popula- tion dynamics during T-cell activation. Molecular systems biology, 5(1):288–288, 2009. [171] Tadashi Nakaya, Panagiotis Alexiou, Manolis Maragkakis, Alexandra Chang, and Zissimos Mourelatos. FUS regulates genes coding for RNA-binding proteins in neurons by binding to their highly conserved introns. Rna, 19(4):498–509, 2013. [172] Intawat Nookaew, Marta Papini, Natapol Pornputtpong, Gionata Scalcinati, Linn Fagerberg, Matthias Uhl´ en, and Jens Nielsen. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cere- visiae. Nucleic Acids Research, 40(20):10084–10097, 2012. [173] Ruth Nussinov and Ann B Jacobson. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proceedings of the National Academy of Sci- ences, 77(11):6309–6313, 1980. [174] Razvan Nutiu, Robin C Friedman, Shujun Luo, Irina Khrebtukova, David Silva, Robin Li, Lu Zhang, Gary P Schroth, and Christopher B Burge. Direct measure- ment of DNA affinity landscapes on a high-throughput sequencing instrument. Nature biotechnology, 29(7):659–664, 2011. [175] Florian C. Oberstrass, Sigrid D. Auweter, Mich` ele Erat, Yann Hargous, Anke Henning, Philipp Wenter, Luc Reymond, Batoul Amir-Ahmady, Stefan Pitsch, Douglas L. Black, and Fr´ ed´ eric H.-T. Allain. Structure of ptb bound to rna: Spe- cific binding and implications for splicing regulation. Science, 309(5743):2054– 2057, 2005. [176] Alicia Oshlack, Mark D Robinson, Matthew D Young, et al. From RNA-seq reads to differential expression results. Genome biol, 11(12):220, 2010. 92 [177] Fatih Ozsolak, Philipp Kapranov, Sylvain Foissac, Sang Woo Kim, Elane Fishile- vich, A Paula Monaghan, Bino John, and Patrice M Milos. Comprehen- sive polyadenylation site maps in yeast and human reveal pervasive alternative polyadenylation. Cell, 143(6):1018–1029, 2010. [178] Fatih Ozsolak and Patrice M Milos. RNA sequencing: advances, challenges and opportunities. Nature reviews genetics, 12(2):87–98, 2010. [179] Inbal Paz, Idit Kosti, Manuel Ares, Melissa Cline, and Yael Mandel-Gutfreund. RBPmap: a web server for mapping binding sites of RNA-binding proteins. Nucleic Acids Research, 42(W1):W361–W367, 2014. [180] Luiz OF Penalva, Michael D Burdick, Simon M Lin, Hedwig Sutterluety, and Jack D Keene. RNA-binding proteins to assess gene expression states of co- cultivated cells in response to tumor cells. Molecular cancer, 3(1):24, 2004. [181] Janina Pfaff, Janosch Hennig, Franz Herzog, Ruedi Aebersold, Michael Sat- tler, Dierk Niessing, and Gunter Meister. Structural features of Argonaute– GW182 protein interactions. Proceedings of the National Academy of Sciences, 110(40):E3770–E3779, 2013. [182] Magdalini Polymenidou, Clotilde Lagier-Tourenne, Kasey R Hutt, Stephanie C Huelga, Jacqueline Moran, Tiffany Y Liang, Shuo-Chien Ling, Eveline Sun, Edward Wancewicz, Curt Mazur, Holly Kordasiewicz, Yalda Sedaghat, John Paul Donohue, Lily Shiue, C Frank Bennett, Gene W Yeo, and Don W Cleveland. Long pre-mRNA depletion and RNA missplicing contribute to neuronal vulnera- bility from loss of TDP-43. Nat Neurosci, 14(4):459–468, 04 2011. [183] Patrick Provost, David Dishart, Johanne Doucet, David Frendewey, Bengt Samuelsson, and Olof R˚ admark. Ribonuclease activity and RNA binding of recombinant human Dicer. The EMBO Journal, 21(21):5864–5874, 2002. [184] Kim D. Pruitt, Garth R. Brown, Susan M. Hiatt, Franc ¸oise Thibaud-Nissen, Alexander Astashyn, Olga Ermolaeva, Catherine M. Farrell, Jennifer Hart, Melissa J. Landrum, Kelly M. McGarvey, Michael R. Murphy, Nuala A. O’Leary, Shashikant Pujar, Bhanu Rajput, Sanjida H. Rangwala, Lillian D. Riddick, Andrei Shkeda, Hanzhen Sun, Pamela Tamez, Raymond E. Tully, Craig Wallin, David Webb, Janet Weber, Wendy Wu, Michael DiCuccio, Paul Kitts, Donna R. Maglott, Terence D. Murphy, and James M. Ostell. RefSeq: an update on mam- malian reference sequences. Nucleic Acids Research, 42(D1):D756–D763, 2014. [185] Ping Qiu. Recent advances in computational promoter analysis in understanding the transcriptional regulatory network. Biochemical and biophysical research communications, 309(3):495–501, 2003. 93 [186] Nikolaus Rajewsky, Massimo Vergassola, Ulrike Gaul, and Eric D Siggia. Com- putational detection of genomic cis-regulatory modules applied to body pattern- ing in the early Drosophila embryo. BMC bioinformatics, 3(1):30, 2002. [187] Debashish Ray, Hilal Kazan, Esther T Chan, Lourdes Pe˜ na Castillo, Sidharth Chaudhry, Shaheynoor Talukder, Benjamin J Blencowe, Quaid Morris, and Tim- othy R Hughes. Rapid and systematic analysis of the RNA recognition specifici- ties of RNA-binding proteins. Nature biotechnology, 27(7):667–670, 2009. [188] Debashish Ray, Hilal Kazan, Kate B Cook, Matthew T Weirauch, Hamed S Najafabadi, Xiao Li, Serge Gueroussov, Mihai Albu, Hong Zheng, Ally Yang, et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature, 499(7457):172–177, 2013. [189] David W Reid and Christopher V Nicchitta. Genome-scale ribosome foot- printing identifies a primary role for endoplasmic reticulum-bound ribosomes in the translation of the mRNA transcriptome. Journal of Biological Chemistry, 10(1074):M111, 2011. [190] William Rennie, Chaochun Liu, C Steven Carmack, Adam Wolenc, Shaveta Kanoria, Jun Lu, Dang Long, and Ye Ding. STarMir: a web server for prediction of microRNA binding sites. Nucleic acids research, 42(W1):W114–W118, 2014. [191] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edgeR: a Bio- conductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010. [192] Frederick P Roth, Jason D Hughes, Preston W Estep, and George M Church. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature biotechnology, 16(10):939–945, 1998. [193] Peter J Roy, Joshua M Stuart, Jim Lund, and Stuart K Kim. Chromoso- mal clustering of muscle-expressed genes in Caenorhabditis elegans. Nature, 418(6901):975–979, 2002. [194] Emad Bahrami Samani, M Mehdi Homayounpour, and Hong Gu. A Novel Hybrid GMM/SVM Architecture for Protein Secondary Structure Prediction. In Appli- cations of Fuzzy Sets Theory, pages 491–496. Springer, 2007. [195] Rickard Sandberg, Joel R Neilson, Arup Sarma, Phillip A Sharp, and Christo- pher B Burge. Proliferating cells express mRNAs with shortened 3’untranslated regions and fewer microRNA target sites. Science, 320(5883):1643–1647, 2008. 94 [196] Marion Scheibe, Falk Butter, Markus Hafner, Thomas Tuschl, and Matthias Mann. Quantitative mass spectrometry and PAR-CLIP to identify RNA-protein interactions. Nucleic acids research, 40(19):9897–9902, 2012. [197] Thomas D Schneider, Gary D Stormo, Larry Gold, and Andrzej Ehrenfeucht. Information content of binding sites on nucleotide sequences. Journal of molec- ular biology, 188(3):415–431, 1986. [198] Daniel R Schoenberg and Lynne E Maquat. Regulation of cytoplasmic mRNA decay. Nature Reviews Genetics, 13(4):246–259, 2012. [199] Mark D Schroeder, Michael Pearce, John Fak, HongQing Fan, Ulrich Unnerstall, Eldon Emberly, Nikolaus Rajewsky, Eric D Siggia, and Ulrike Gaul. Transcrip- tional control in the segmentation gene network of Drosophila. PLoS biology, 2(9):e271, 2004. [200] Bj¨ orn Schwalb, Daniel Schulz, Mai Sun, Benedikt Zacher, Sebastian D¨ umcke, Dietmar E Martin, Patrick Cramer, and Achim Tresch. Measurement of genome- wide RNA synthesis and decay rates with Dynamic Transcriptome Analysis (DTA). Bioinformatics, 28(6):884–885, 2012. [201] Bj¨ orn Schwanh¨ ausser, Dorothea Busse, Na Li, Gunnar Dittmar, Johannes Schuchhardt, Jana Wolf, Wei Chen, and Matthias Selbach. Global quantification of mammalian gene expression control. Nature, 473(7347):337–342, 2011. [202] Christoph Schweingruber, Simone C Rufener, David Z¨ und, Akio Yamashita, and Oliver M¨ uhlemann. Nonsense-mediated mRNA decaymechanisms of substrate mRNA recognition and degradation in mammalian cells. Biochimica et Biophys- ica Acta (BBA)-Gene Regulatory Mechanisms, 1829(6):612–623, 2013. [203] Matthias Selbach, Bj¨ orn Schwanh¨ ausser, Nadine Thierfelder, Zhuo Fang, Raya Khanin, and Nikolaus Rajewsky. Widespread changes in protein synthesis induced by microRNAs. Nature, 455(7209):58–63, 2008. [204] David R Setzer, Michael McGrogan, Jack H Nunberg, and Robert T Schimke. Size heterogeneity in the 3 end of dihydrofolate reductase messenger RNAs in mouse cells. Cell, 22(2):361–370, 1980. [205] Reut Shalgi, Jessica A Hurt, Irina Krykbaeva, Mikko Taipale, Susan Lindquist, and Christopher B Burge. Widespread regulation of translation by elongation pausing in heat shock. Molecular cell, 49(3):439–452, 2013. [206] Shihao Shen, Juw Won Park, Jian Huang, Kimberly A Dittmar, Zhi-xiang Lu, Qing Zhou, Russ P Carstens, and Yi Xing. MATS: a Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data. Nucleic Acids Research, 40(8):e61–e61, 2012. 95 [207] Jun Shi, Zhou Zhou, Wen Di, and Ningli Li. Correlation of CD44v6 expression with ovarian cancer progression and recurrence. BMC cancer, 13(1):182–182, 2013. [208] Cem Sievers, Tommy Schlumpf, Ritwick Sawarkar, Federico Comoglio, and Renato Paro. Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR-CLIP data. Nucleic Acids Research, 40(20):e160–e160, 2012. [209] Ian M Silverman, Fan Li, and Brian D Gregory. Review: Genomic era analyses of RNA secondary structure and RNA-binding proteins reveal their significance to post-transcriptional regulation in plants. Plant Science, 205(206):55–62, 2013. [210] Peter Smibert, Pedro Miura, Jakub O Westholm, Sol Shenker, Gemma May, Michael O Duff, Dayu Zhang, Brian D Eads, Joe Carlson, James B Brown, et al. Global Patterns of Tissue-Specific Alternative Polyadenylation in Drosophila. Cell reports, 1(3):277–289, 2012. [211] Andrew D Smith, Pavel Sumazin, and Michael Q Zhang. Identifying tissue- selective transcription factor binding sites in vertebrate promoters. Proceedings of the National Academy of Sciences of the United States of America, 102(5):1560– 1565, 2005. [212] Nahum Sonenberg and Alan G Hinnebusch. New modes of translational control in development, behavior, and disease. Molecular cell, 28(5):721–729, 2007. [213] Ji-Joon Song, Stephanie K Smith, Gregory J Hannon, and Leemor Joshua-Tor. Crystal structure of Argonaute and its implications for RISC slicer activity. sci- ence, 305(5689):1434–1437, 2004. [214] Rodger Staden. Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic acids research, 12(1Part2):551–567, 1984. [215] Noam Stern-Ginossar, Ben Weisburd, Annette Michalski, Vu Thuy Khanh Le, Marco Y Hein, Sheng-Xiong Huang, Ming Ma, Ben Shen, Shu-Bing Qian, Hart- mut Hengel, et al. Decoding human cytomegalovirus. Science, 338(6110):1088– 1093, 2012. [216] Gary D Stormo. DNA binding sites: representation and discovery. Bioinformat- ics, 16(1):16–23, 2000. [217] Gary D Stormo, Thomas D Schneider, Larry Gold, and Andrzej Ehrenfeucht. Use of the Perceptronalgorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Research, 10(9):2997–3011, 1982. 96 [218] Yoichiro Sugimoto, Julian K¨ onig, Shobbir Hussain, Blaˇ z Zupan, Tomaˇ z Curk, Michaela Frye, and Jernej Ule. Analysis of CLIP and iCLIP methods for nucleotide-resolution studies of protein-RNA interactions. Genome Biol, 13(8):R67, 2012. [219] Saleh Tamim, Dat T V o, Philip J Uren, Mei Qiao, Eckart Bindewald, Woj- ciech K Kasprzak, Bruce A Shapiro, Helder I Nakaya, Suzanne C Burns, Patri- cia R Araujo, et al. Genomic Analyses Reveal Broad Impact of miR-137 on Genes Associated with Malignant Transformation and Neuronal Differentiation in Glioblastoma Cells. PloS one, 9(1):e85591, 2014. [220] Scott A Tenenbaum, Craig C Carson, Patrick J Lager, and Jack D Keene. Identi- fying mRNA subsets in messenger ribonucleoprotein complexes by using cDNA arrays. Proceedings of the National Academy of Sciences, 97(26):14085–14090, 2000. [221] Matilde Todaro, Miriam Gaggianesi, Veronica Catalano, Antonina Benfante, Flora Iovino, Mauro Biffoni, Tiziana Apuzzo, Isabella Sperduti, Silvia V olpe, Gianfranco Cocorullo, et al. CD44v6 Is a Marker of Constitutive and Repro- grammed Cancer Stem Cells Driving Colon Cancer Metastasis. Cell stem cell, 14(3):342–356, 2014. [222] James R Tollervey, Tomaˇ z Curk, Boris Rogelj, Michael Briese, Matteo Cereda, Melis Kayikci, Julian K¨ onig, Tibor Hortob´ agyi, Agnes L Nishimura, Vera ˇ Zupunski, et al. Characterizing the RNA targets and position-dependent splic- ing regulation by TDP-43. Nature neuroscience, 14(4):452–458, 2011. [223] Jacob M Tome, Abdullah Ozer, John M Pagano, Dan Gheba, Gary P Schroth, and John T Lis. Comprehensive analysis of RNA-protein interactions by high- throughput sequencing-RNA affinity profiling. Nature methods, 11(6):683–688, 2014. [224] Cole Trapnell, David G Hendrickson, Martin Sauvageau, Loyal Goff, John L Rinn, and Lior Pachter. Differential analysis of gene regulation at transcript res- olution with RNA-seq. Nature biotechnology, 31(1):46–53, 2013. [225] Cole Trapnell, Lior Pachter, and Steven L Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105–1111, 2009. [226] Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Mar- ijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and iso- form switching during cell differentiation. Nature biotechnology, 28(5):511–515, 2010. 97 [227] Craig Tuerk. Using the SELEX combinatorial chemistry process to find high affinity nucleic acid ligands to target molecules. PCR Cloning Protocols, pages 219–230, 1997. [228] Jernej Ule, Kirk B Jensen, Matteo Ruggiu, Aldo Mele, Aljaˇ z Ule, and Robert B Darnell. CLIP identifies Nova-regulated RNA networks in the brain. Science, 302(5648):1212–1215, 2003. [229] Jason G Underwood, Andrew V Uzilov, Sol Katzman, Courtney S Onodera, Jacob E Mainzer, David H Mathews, Todd M Lowe, Sofie R Salama, and David Haussler. FragSeq: transcriptome-wide RNA structure probing using high- throughput sequencing. Nature methods, 7(12):995–1001, 2010. [230] Philip J Uren, Emad Bahrami-Samani, Suzanne C Burns, Mei Qiao, Fedor V Karginov, Emily Hodges, Gregory J Hannon, Jeremy R Sanford, Luiz OF Penalva, and Andrew D Smith. Site identification in high-throughput RNA– protein interaction data. Bioinformatics, 28(23):3013–3020, 2012. [231] Philip J Uren, Suzanne C Burns, Jianhua Ruan, Kusum K Singh, Andrew D Smith, and Luiz OF Penalva. Genomic analyses of the RNA-binding protein Hu antigen R (HuR) identify a complex network of target genes and novel character- istics of its binding sites. Journal of Biological Chemistry, 286(43):37063–37066, 2011. [232] H´ ector H Valdivia. One Gene, Many Proteins Alternative Splicing of the Ryan- odine Receptor Gene Adds Novel Functions to an Already Complex Channel Protein. Circulation research, 100(6):761–763, 2007. [233] Stanislas Varenne, Jean Buc, Roland Lloubes, and Claude Lazdunski. Translation is a non-uniform process: effect of tRNA availability on the rate of elongation of nascent polypeptide chains. Journal of molecular biology, 180(3):549–576, 1984. [234] Thanasis Vergoulis, Ioannis S Vlachos, Panagiotis Alexiou, George Georgakilas, Manolis Maragkakis, Martin Reczko, Stefanos Gerangelos, Nectarios Koziris, Theodore Dalamagas, and Artemis G Hatzigeorgiou. TarBase 6.0: capturing the exponential growth of miRNA targets with experimental support. Nucleic acids research, 40(D1):D222–D229, 2012. [235] Dat T V o, Dharmalingam Subramaniam, Marc Remke, Tarea L Burton, Philip J Uren, Jonathan A Gelfond, Raquel de Sousa Abreu, Suzanne C Burns, Mei Qiao, Uthra Suresh, et al. The RNA-binding protein Musashi1 affects medulloblas- toma growth via a network of cancer-related genes and is an indicator of poor prognosis. The American journal of pathology, 181(5):1762–1772, 2012. 98 [236] Christine V ogel, Raquel de Sousa Abreu, Daijin Ko, Shu-Yun Le, Bruce A Shapiro, Suzanne C Burns, Devraj Sandhu, Daniel R Boutz, Edward M Marcotte, and Luiz O Penalva. Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line. Molecular systems biology, 6(1):400–400, 2010. [237] G¨ unter P Wagner, Koryu Kin, and Vincent J Lynch. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory in Biosciences, 131(4):281–285, 2012. [238] Eric T Wang, Neal AL Cody, Sonali Jog, Michela Biancolella, Thomas T Wang, Daniel J Treacy, Shujun Luo, Gary P Schroth, David E Housman, Sita the, et al. Transcriptome-wide regulation of pre-mRNA splicing and mRNA localization by muscleblind proteins. Cell, 150(4):710–724, 2012. [239] Eric T Wang, Rickard Sandberg, Shujun Luo, Irina Khrebtukova, Lu Zhang, Christine Mayr, Stephen F Kingsmore, Gary P Schroth, and Christopher B Burge. Alternative isoform regulation in human tissue transcriptomes. Nature, 456(7221):470–476, 2008. [240] Kai Wang, Darshan Singh, Zheng Zeng, Stephen J Coleman, Yan Huang, Gleb L Savich, Xiaping He, Piotr Mieczkowski, Sara A Grimm, Charles M Perou, et al. MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Research, 38(18):e178–e178, 2010. [241] Tao Wang, Beibei Chen, MinSoo Kim, Yang Xie, and Guanghua Xiao. A Model-Based Approach to Identify Binding Sites in CLIP-Seq Data. PloS one, 9(4):e93248, 2014. [242] Tao Wang, Yang Xie, and Guanghua Xiao. dCLIP: a computational approach for comparative CLIP-seq analyses. Genome Biol, 15(1):R11, 2014. [243] Weichen Wang, Zhiyi Qin, Zhixing Feng, Xi Wang, and Xuegong Zhang. Iden- tifying differentially spliced genes from two groups of RNA-seq samples. Gene, 518(1):164–170, 2013. [244] Yulei Wang, Chih Long Liu, John D Storey, Robert J Tibshirani, Daniel Her- schlag, and Patrick O Brown. Precision and functional specificity in mRNA decay. Proceedings of the National Academy of Sciences, 99(9):5860–5865, 2002. [245] Zhen Wang, Melis Kayikci, Michael Briese, Kathi Zarnack, Nicholas M Lus- combe, Gregor Rot, Blaˇ z Zupan, Tomaˇ z Curk, and Jernej Ule. iCLIP predicts the dual splicing effects of TIA-RNA interactions. PLoS biology, 8(10):e1000530, 2010. 99 [246] Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1):57–63, 2009. [247] Shaun Webb, Ralph D Hector, Grzegorz Kudla, and Sander Granneman. PAR- CLIP data indicate that Nrd1-Nab3-dependent transcription termination regulates expression of hundreds of protein coding genes in yeast. Genome Biol, 15(1):R8, 2014. [248] Eric Westhof and Pascale Romby. The RNA structurome: high-throughput prob- ing. Nature methods, 7(12):965–967, 2010. [249] Thomas D Wu and Serban Nacu. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics, 26(7):873–881, 2010. [250] Yuanchao Xue, Kunfu Ouyang, Jie Huang, Yu Zhou, Hong Ouyang, Hairi Li, Gang Wang, Qijia Wu, Chaoliang Wei, Yanzhen Bi, et al. Direct conversion of fibroblasts to neurons by reprogramming PTB-regulated microRNA circuits. Cell, 152(1):82–96, 2013. [251] Yuanchao Xue, Yu Zhou, Tongbin Wu, Tuo Zhu, Xiong Ji, Young-Soo Kwon, Chao Zhang, Gene Yeo, Douglas L Black, Hui Sun, et al. Genome-wide analysis of PTB-RNA interactions reveals a strategy used by the general splicing repressor to modulate exon inclusion or skipping. Molecular cell, 36(6):996–1006, 2009. [252] Jian-Hua Yang, Jun-Hao Li, Peng Shao, Hui Zhou, Yue-Qin Chen, and Liang-Hu Qu. starBase: a database for exploring microRNA–mRNA interaction maps from Argonaute CLIP-Seq and Degradome-Seq data. Nucleic acids research, 39(suppl 1):D202–D209, 2011. [253] Zizhen Yao, Zasha Weinberg, and Walter L Ruzzo. CMfindera covariance model based RNA motif finding algorithm. Bioinformatics, 22(4):445–452, 2006. [254] Gene W Yeo, Nicole G Coufal, Tiffany Y Liang, Grace E Peng, Xiang-Dong Fu, and Fred H Gage. An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nature structural & molecular biology, 16(2):130–137, 2009. [255] Pei Yu, Ling Zhou, Weifeng Ke, and Ke Li. Clinical significance of pAKT and CD44v6 overexpression with breast cancer. Journal of cancer research and clin- ical oncology, 136(8):1283–1292, 2010. [256] Kathi Zarnack, Julian K¨ onig, Mojca Tajnik, Inigo Martincorena, Sebastian Eustermann, Isabelle St´ evant, Alejandro Reyes, Simon Anders, Nicholas M Lus- combe, and Jernej Ule. Direct competition between hnrnp c and u2af65 protects the transcriptome from the exonization of¡ i¿ alu¡/i¿ elements. Cell, 152(3):453– 466, 2013. 100 [257] Chaolin Zhang and Robert B Darnell. Mapping in vivo protein-RNA interac- tions at single-nucleotide resolution from HITS-CLIP data. Nature biotechnol- ogy, 29(7):607–614, 2011. [258] Chaolin Zhang, Maria A. Frias, Aldo Mele, Matteo Ruggiu, Taesun Eom, Christina B. Marney, Huidong Wang, Donny D. Licatalosi, John J. Fak, and Robert B. Darnell. Integrative modeling defines the Nova splicing-regulatory network and its combinatorial controls. Science, 329(5990):439–443, 2010. [259] Chaolin Zhang, Kuang-Yung Lee, Maurice S Swanson, and Robert B Dar- nell. Prediction of clustered RNA-binding protein motif sites in the mammalian genome. Nucleic acids research, 41(14):6793–6807, 2013. [260] Chaolin Zhang, Zuo Zhang, John Castle, Shuying Sun, Jason Johnson, Adrian R Krainer, and Michael Q Zhang. Defining the regulatory network of the tissue- specific splicing factors Fox-1 and Fox-2. Genes & development, 22(18):2550– 2563, 2008. [261] Haidi Zhang, Fabrice A Kolb, Vincent Brondani, Eric Billy, and Witold Fil- ipowicz. Human Dicer preferentially cleaves dsRNAs at their termini without a requirement for ATP. The EMBO journal, 21(21):5875–5885, 2002. [262] Hao Zheng, Rongguo Fu, Jin-Tao Wang, Qinyou Liu, Haibin Chen, and Shi- Wen Jiang. Advances in the techniques for the prediction of microRNA targets. International journal of molecular sciences, 14(4):8179–8187, 2013. [263] Qin Zong, Mich` el Schummer, Leroy Hood, and David R Morris. Messenger RNA translation state: the second dimension of high-throughput expression screening. Proceedings of the National Academy of Sciences, 96(19):10632–10636, 1999. [264] Michael Zuker and David Sankoff. RNA secondary structures and their predic- tion. Bulletin of Mathematical Biology, 46(4):591–621, 1984. 101 Appendix A Detailed derivation of the EM algorithm Here we present the complete formalization of our method. The data available for the solution of the motif-finding problem is the primary sequences, the secondary structure of those sequences, and the location of the cross-link induced read artifacts, which we call diagnostic events. Given that the primary sequence will always be used, the optional use of the other two pieces of information gives us four different ways our algorithm can be run. The first of these uses only sequence information; the second uses sequence information and pre-computed base-pair probabilities informing us about the structure of the sequences; the third uses sequence information and the locations of the diagnostic events; and finally, the fourth way of running the algorithm uses all of sequence, struc- ture and diagnostic events information. Since we present results for each of these four approaches, we will fully describe here both the model and how the algorithm works in each case. Throughout, we will assume the length of the RBP binding motif is fixed at w nucleotides. Further, we assume that any given sequence may either contain an occurrence of the motif, or it might not – this is the so-called zero-or-one-occurrence- per-sequence assumption, or ZOOPS. Much literature exists on the problem of motif discovery, which we will attempt not to rehash here; for further background and details please refer to [129, 146, 15]. 102 A.1 Sequence-only motif discovery - ZOOPS Model Considering only the sequence, the model parameter set is defined as =fM;f; g. The complete-data likelihood for this model is Pr(S;Xj) = n Y i=1 Pr(S i ;X i j) = n Y i=1 Pr(S i ;X i j;O i ) Pr(O i j) = Pr(Oj) n Y i=1 Pr(S i ;X i jO i ; ) = Pr(Oj) n Y i=1 Pr(S i jO i ;X i ; ) Pr(X i jO i ; ); = Pr(Oj) Pr(XjO; ) n Y i=1 Pr(S i jO i = 0; ) (1O i ) mw+1 Y j=1 Pr(S i jX ij = 1; ) X ij : (A.1) Here, we have two priors, Pr(Oj) and Pr(XjO; ). The first is the prior probability of a motif occurrence, given that the model is known, while the second is the prior proba- bility of the occurrence indicators, given knowledge of the model and which sequences contain the motif occurrences. In the context of the sequence-only model, these two priors may be calculated as Pr(Oj) = Pr(Oj ) = q (1 ) nq ; where we defineq to be the number of sequences which contain an occurrence of the motif, i.e.q = P n i=1 O i , and Pr(XjO; ) = n Y i=1 mw+1 Y j=1 1 mw + 1 X ij : 103 The complete-data likelihood also contains expressions for the probability of observing a sequence, given the location of the motif in the sequence and the model are known, i.e. Pr(S i jX ij = 1; ), as well as the special case where no motif occurrence exists in the sequence, i.e. Pr(S i jO i = 0; ). Again, in the context of the sequence-only model, we may calculate these as Pr(S i jX ij = 1; ) = m Y l=1 U Y b=A ( blj ) bli ; and Pr(S i jO i = 0; ) = m Y l=1 U Y b=A f(b) bli : Here we introduce the indicator variable, which is defined such that bli = 1 if base b appears at position l of sequence i and, bli = 0 otherwise. Finally, we have b;l;j , which is the foreground PWM (i.e.M) whenl falls within the motif occurrence starting at positionj, or the background modelf if it does not. More formally, blj = 8 < : M lj+1 (b) ifjlj +w 1; f(b) otherwise: We employ the expectation maximization (EM) algorithm to estimate parameters for our model, . The algorithm iterates between an expectation step, in which the expected value of the log-likelihood is calculated using current estimates of the parameters and a maximization step, where the model parameters are optimized to find the maximum of this function. 104 A.1.1 Expectation-maximization – E-step The log of the complete-data likelihood is easily derived from Equation A.1 as log`(jS;X) = n X i=1 mw+1 X j=1 X ij log 1 mw + 1 +q log (nq) log(1 ) + n X i=1 (1 P mw+1 j=1 X ij ) m X l=1 U X b=A bli logf(b) + n X i=1 mw+1 X j=1 X ij m X l=1 U X b=A bli log bl 0: (A.2) Let us define Q(; (t) ) as the expected value of the log-likelihood function with respect to X, given our current model estimate, which we will represent as (t) , and our observed dataS. Hence, Q(; (t) ) = E Xj (t) ;S (log`(jS;X)) = n X i=1 mw+1 X j=1 E Xj (t) ;S (X ij ) log 1 mw + 1 +E Xj (t) ;S (q) log (nE Xj (t) ;S (q)) log(1 ) + n X i=1 (1 P mw+1 j=1 E Xj (t) ;S (X ij )) m X l=1 U X b=A bli logf(b) + n X i=1 mw+1 X j=1 E Xj (t) ;S (X ij ) m X l=1 U X b=A bli log bl 0: (A.3) Note here that we can replace the expected value ofq with the explicit summation, i.e. E Xj (t) ;S (q) = n X i=1 mw+1 X j=1 E Xj (t) ;S (X ij ): 105 Notice also that, sinceX ij are Bernoulli random variables,E Xj (t) ;S (X ij ) = Pr(X ij = 1jS i ; (t) ), which we may compute as Pr(X ij = 1jS i ; (t) ) = Pr(S i jX ij = 1; (t) ) Pr(X ij = 1j (t) ) Pr(S i j (t) ;O i = 0) Pr(O i = 0j (t) ) + mw+1 X j 0 =1 Pr(S i jX ij 0 = 1; (t) ) Pr(X ij 0 = 1j (t) ) : Here, the prior probability on occurrence at any given position can easily be calculated by noting that Pr(X ij = 1j (t) ) = Pr(X ij = 1jO i = 1; (t) ) Pr(O i = 1j (t) ) = (t) mw + 1 : The prior on non-occurrence of the motif is also straight-forward: Pr(O i = 0j (t) ) = 1 (t) . The remaining two probabilities on observing the sequence, given either where the motif occurs, or the special case of non-occurrence, have the same form as in the complete-data likelihood, i.e. Pr(S i jX ij = 1; (t) ) = m Y l=1 U Y b=A ( (t) blj ) bli ; and Pr(S i jO i = 0; (t) ) = m Y l=1 U Y b=A f (t) (b) bli : The same can be said for the definition of (t) blj , which is analogously defined as (t) blj = 8 < : M (t) lj+1 (b) ifjlj +w 1; f (t) (b) otherwise: 106 A.1.2 Expectation-maximization – M-step Maximizing for (t+1) with respect toQ(; (t) ) is strightforward. ForM, the MLE is ^ M k (b) = 1 n n X i=1 mw+1 X j=1 (Ifs i;j+k1 =bg) Pr(X ij = 1j (t) ); (A.4) whereIfs i;j =bg is the indicator function, equal to 1 when sequences i contains baseb at positionj, and 0 otherwise. The MLE estimate forf is defined analogously as ^ f(b) = 1 n(mw) n X i=1 m X l=1 mw+1 X j=1 (Ifs i;l =bg)(Ifl= 2[j;j+w1]g) Pr(X ij = 1j (t) ): (A.5) Finally, the MLE for the ZOOPS model parameter, , is ^ = P n i=1 P mw+1 j=1 Pr(X ij = 1j (t) ) n : (A.6) A.2 Sequence and structure motif discovery In order to account for secondary structure, we augmentM andf as follows: M k (b;) = Pr b appears at positionk in the motif with pairing state ; f(b;) = Pr b appears in the background with pairing state ; 107 where2f0; 1g, either paired or unpaired. The form of the complete-data likelihood is unchanged from Equation A.1; all that is required is changes in the equations for computing Pr(S i jX ij = 1; ) and Pr(S i jO i = 0; ), i.e. Pr(S i jX ij = 1; ) = m Y l=1 U Y b=A 1 Y =0 ( blj ) bli Pr(S i jO i = 0; ) = m Y l=1 U Y b=A 1 Y =0 f(b;) bli ; (A.7) where blj = 8 < : M lj+1 (b;) ifjlj +w 1; f(b;) otherwise: and bli = 1 if basel in sequencei isb and has pairing state, and bli = 0 otherwise. A.2.1 Expectation-maximization – E-step The log-likelihood is largely unchanged from Equation A.2; i.e. log`(jS;X) = n X i=1 mw+1 X j=1 X ij log 1 mw + 1 +q log + (nq) log(1 ) + n X i=1 (1 P mw+1 j=1 X ij ) m X l=1 U X b=A 1 X =0 bli logf(b;) + n X i=1 mw+1 X j=1 X ij m X l=1 U X b=A 1 X =0 bli log blj : (A.8) 108 As such,Q(; (t) ) is also similar, and is defined as Q(; (t) ) = E Xj (t) ;S (log`(jS;X)) (A.9) = n X i=1 mw+1 X j=1 E Xj (t) ;S (X ij ) log 1 mw + 1 +E Xj (t) ;S (q) log (t) + (nE Xj (t) ;S (q)) log(1 (t) ) + + n X i=1 (1 P mw+1 j=1 E Xj (t) ;S (X ij )) m X l=1 U X b=A 1 X =0 bli logf (t) (b;) + n X i=1 mw+1 X j=1 E Xj (t) ;S (X ij ) m X l=1 U X b=A 1 X =0 bli log (t) blj ; (A.10) where, just as previously, E Xj (t) ;S (q) = P n i=1 P mw+1 j=1 E Xj (t) ;S (X ij ), and E Xj (t) ;S (X ij ) = Pr(X ij = 1jS i ; (t) ). Further, the same form applies for the cal- culation of the occurrence probabilities, so Pr(X ij = 1jS i ; (t) ) = Pr(S i jX ij = 1; (t) ) Pr(X ij = 1j (t) ) Pr(S i j (t) ;O i = 0) Pr(O i = 0j (t) ) + mw+1 X j 0 =1 Pr(S i jX ij 0 = 1; (t) ) Pr(X ij 0 = 1j (t) ) : Here we amend the sequence components of the likelihood to account for structure by introducing, hence Pr(S i jX ij = 1; (t) ) = m Y l=1 U Y b=A 1 Y =0 ( (t) blj ) bli ; Pr(S i jO i = 0; (t) ) = m Y l=1 U Y b=A 1 Y =0 f (t) (b) bli ; (A.11) 109 where (t) blj = 8 < : M (t) lj+1 (b;) ifjlj +w 1; f (t) (b;) otherwise: (A.12) The prior probabilities on occurrence of the motif, i.e. Pr(X ij = 1j (t) ) and Pr(O i = 0j (t) ) are unchanged from section A.1.1. A.2.2 Expectation-maximization – M-step Maximizing the ZOOPS model parameter is unchanged from section A.1.2. Only minor changes are needed to the maximization step forM andf: ^ M k (b;) = 1 n n X i=1 mw+1 X j=1 bji Pr(X ij = 1j (t) ); (A.13) and ^ f(b;) = 1 n(mw) n X i=1 m X l=1 mw+1 X j=1 bli (Ifl= 2[j;j+w1]g) Pr(X ij = 1j (t) ): (A.14) A.3 Sequence and diagnostic events motif discovery CLIP-seq uses UV light to induce cross-links between RBPs and RNA molecules at the point of interaction. A high density of reads at a given genomic locus indicates that locus is likely to be a binding site. Individual reads can contain artifacts that localize the cross-link location to a single nucleotide (we call these artifacts ”diagnostic events”; see the main manuscript for details of how we extract these from reads). In bringing the locations of diagnostic events into our algorithm to assist in locating RBP binding sites, we make some simplifying assumptions: 110 each sequence comes from a distinct RNA transcript, and every one of these tran- scripts was cross-linked to the RBP of interest. if an RNA transcript is cross-linked, it will be cross-linked at exactly one location. all sequences have equal affinity for the RBP, given that they contain an occur- rence of the motif (put another way, we do not assume some sequences with motif occurrence are more informative than others). LetC i be the (unknown) location of the cross-link for thei th sequence. We assume the distances of the motif occurrences from the cross-link locations (i.e.jC i j +g 2 j, given thatX ij = 1) follow a geometric distribution parameterised byg 1 . Here,g 2 is an offset from the start of the motif occurrence to account for our observation, from CLIP-seq data, that the most likely cross-link location is not always at the motif start location. It is worth pointing out that we do not treat C as latent data. We include the number of diagnostic events in each sequence i at each position j as a fixed parameter of our model,D ij . Hence, the expanded model is defined as follows: =fM;f; ;g 1 ;g 2 ;Dg. We bring the influence of cross-linking information into our algorithm via the prior on motif occurrence locations, Pr(XjO; ). As such, the form of the complete-data likelihood for the model remains unchanged from Equation A.1. The only difference is in computing the prior onX, which can be decomposed as Pr(XjO; ) = n Y i=1 Pr(X i jO i ; ) = n Y i=1 Pr(X i jO i = 1; ) O i Pr(X i jO i = 0; ) 1O i = n Y i=1 mw+1 Y j=1 Pr(X ij = 1jO i = 1; ) X ij : (A.15) 111 Notice here that Pr(X i jO i = 0; ) = 1. As noted above, we assume that the distance of motif occurrences from the cross-link location follows a geometric distribution; we integrate over all possible cross-link locations, hence Pr(X ij = 1jO i = 1; ) = m X l=1 Pr(C i =ljO i = 1; ) g 1 (1g 1 ) jl(j+g 2 )j K : Here, K is a tuning parameter that modulates the impact of diagnostic events on the algorithm. Low values (near 0) reduce the impact of diagnostic events, while higher values increase it. In practice, we set this heuristically using an exhaustive search of possible values and picking the one that maximized the number of motifs recovered from our collection of CLIP-seq data. To make sure that we are not over-fitting this parameter, we performed 1000 bootstrap samples of our data collection and a value of 1:1 proved to be the optimal value. However, this parameter can be adjusted by users if they wish to do so. Using our selection of current publicly available datasets, the default value seems to be optimal, however as CLIP experiment improves the users might feel the need to increase this value for their data sets. The probability of the cross-link location for sequencei being at positionl is estimated from the diagnostic events, so Pr(C i =ljO i = 1; ) = D ij + P m j 0 =1 (D ij 0 +) ; (A.16) or put less formally, the fraction of diagnostic events observed at that location. Here is a pseudo-count to avoid division by 0 in any sequences with no diagnostic events – in practice we set = 1. 112 A.3.1 Expectation Maximization – E-step The log-likelihood of the expanded model is easily arrived at by replacing the prior on X in Equation A.2 with the one given by Equation A.15, hence log`(jS;X) = n X i=1 mw+1 X j=1 X ij log m X l=1 Pr(C i =ljO i = 1; (t) ) g 1 (1g 1 ) jl(j+g 2 )j K ! +q log + (nq) log(1 ) + n X i=1 (1 P mw+1 j=1 X ij ) m X l=1 U X b=A bli logf(b) + n X i=1 mw+1 X j=1 X ij m X l=1 U X b=A bli log bl 0; (A.17) and so Q(; (t) ) = E Xj (t) ;S (log`(jS;X)) (A.18) = n X i=1 mw+1 X j=1 Pr(X ij = 1j (t) ;S i ) log m X l=1 Pr(C i =ljO i = 1; (t) ) g 1 (1g 1 ) jl(j+g 2 )j K ! +E Xj (t) ;S (q) log + (nE Xj (t) ;S (q)) log(1 ) + n X i=1 (1 P mw+1 j=1 E Xj (t) ;S (X ij )) m X l=1 U X b=A bli logf(b) + n X i=1 mw+1 X j=1 E Xj (t) ;S (X ij ) m X l=1 U X b=A bli log bl 0; (A.19) where Pr(X ij = 1j (t) ;S i ) = 113 Pr(S i jX ij = 1; (t) ) Pr(X ij = 1j (t) ) Pr(S i jO (t) i = 0; (t) ) Pr(O (t) i = 0j (t) ) + P mw+1 j 0 =1 Pr(S i jX ij 0 = 1; (t) ) Pr(X ij 0 = 1j (t) ) : Here Pr(S i jX ij = 1; (t) ), Pr(S i jO i = 0; (t) ) and Pr(O (t) i = 0j (t) ) are computed exactly as in the sequence-only method. The only change is that Pr(X ij = 1j (t) ) = m X l=1 Pr(C i=l ) h g (t) 1 (1g (t) 1 ) jl(j+g (t) 2 )j i K : (A.20) A.3.2 Expectation Maximization – M-step Maximization of the sequence component of the model, i.e. M, follows the same pro- cedure described in section A.1.2. With respect to diagnostic events, there are two free parameters of the model that need to be maximized: g 1 and g 2 . There is no closed form maximum likelihood estimator for either of these parameters. To maximizeg 2 , we perform an exhaustive search of all possible values in the range8g 2 8. We max- imizeg 1 by using the Newton-Raphson algorithm to find the root of the first derivative ofQ with respect tog 1 . This is a numerical approach that iteratively refines an initial estimate ofg 1 ,g (0) 1 as follows: g (t) 1 =g (t1) 1 Q 0 ( (t1) ; (t2) ) Q 00 ( (t1) ; (t2) ) : (A.21) 114 Where (t) = f ^ M; ^ f; ^ ;g (t) 1 ; ^ g 2 ;Dg. This process continues until Q( (t) ; (t1) ) Q( (t1) ; (t2) )<, where is some fixed precision threshold. Here Q 0 (; (t) ) = @Q(; (t) ) @g 1 = n X i mw+1 X j " Pr(X ij = 1j (t) ;S i ) m X l=1 Pr(C i =ljO i = 1; (t) )[Kg K1 1 (1g 1 ) Kjl(j+g 2 )j g K 1 (Kjl (j +g 2 )j)(1g 1 ) Kjl(j+g 2 )j1 ] m X l=1 Pr(C i =ljO i = 1; (t) )g K 1 (1g 1 ) Kjl(j+g 2 )j # = K n X i mw+1 X j " Pr(X ij = 1j (t) ;S i ) 1 g 1 m X l=1 Pr(C i =ljO i = 1; (t) )(jl (j +g 2 )j)(1g 1 ) Kjl(j+g 2 )j1 m X l=1 Pr(C i =ljO i = 1; (t) )(1g 1 ) Kjl(j+g 2 )j !# ; (A.22) and Q 00 (; (t) ) = @ 2 Q(; (t) ) @(g 1 ) 2 = K n X i mw+1 X j " Pr(X ij = 1j (t) ;S i ) 1 (g 1 ) 2 D @N @g 1 N @D @g 1 ( P m l=1 Pr(C i =ljO i = 1; (t) )(1g 1 ) Kjl(j+g 2 )j ) 2 !# ; 115 where N = m X l=1 Pr(C i =ljO i = 1; (t) )(jl (j +g 2 )j)(1g 1 ) Kjl(j+g 2 )j1 D = m X l=1 Pr(C i =ljO i = 1; (t) )(1g 1 ) Kjl(j+g 2 )j @N @g 1 = K m X l=1 Pr(C i =ljO i = 1; (t) )(jl (j +g 2 )j)(jl (j +g 2 )j 1)(1g 1 ) Kjl(j+g 2 )j2 @D @g 1 = K m X l=1 Pr(C i =ljO i = 1; (t) )(jl (j +g 2 )j)(1g 1 ) Kjl(j+g 2 )j1 : In practise, we found that Zagros was relatively insensitive to the choice ofg 1 , but opti- mizing this parameter increases the asymptotic runtime. As a trade-off, we computed the optimal value ofg 1 as described above for all of the CLIP-seq datasets in our collec- tion and found the mode of this distribution. By default, Zagros uses this value forg 1 and does not optimize the parameter, however we still provide an option for the user to maximizeg 1 via Newton-Raphson if they wish. 116 Appendix B Determining the secondary structure of an RNA molecule B.1 Computational approaches Computational methods have long been used for understanding the RNA secondary structure. These methods can generally be divided into two groups, methods that cal- culate the minimum free energy structure and the ones that consider the ensemble of structures for the RNA to calculate the base pairing probabilities. In this section we give detailed explanation of both methods. B.1.1 Minimum free energy structure One of the first approaches to the problem of RNA secondary structure prediction was presented by Nussinov [173]. In this approach the idea is simply to maximize the num- ber of paired bases. Although the idea has the correct direction, it is not realistic. In order to get the correct biological structure of the RNA, one should take different kinds of substructures into account. Definition B.1.1 (Stacking substructures). A secondary structure of an RNA is mainly composed of two types substructures: stems and loops. A stem, is contiguous stacked base pairs, and a loop is an unpaired region of the molecule. There are four types of loops: hairpin loops (loops at the end of a stem), bulge loops (single stranded bases 117 occurring within a stem), internal loops (single stranded bases interrupting both sides of a stem), and multi-branched loops (loops from which three or more stems comes out). The idea was first developed by Zuker [264]. In this approach, a structure with the minimum free energy is obtained based on the substructures; stems have stabilizing effects (negative free energy) and loops have destabilizing effects (positive free energy). Definition B.1.2 (k:loop). The formation of base pairs results in a looped region R[i+1..j-1], where we call (i,j) as the closing base pair. A “k:loop” consists ofu unpaired bases andk 1 base pairs. Therefore, loops can be classified to: stems (k = 2;u = 0), hairpins (k = 1), bulges and interior loops (k = 2;u > 0) with contiguous unpaired bases in bulge loops different from interior loops that unpaired bases are not contiguous, and multi-branched loops (k> 2). B.1.2 The ensemble of RNA structures Minimum free energy structure can not always reflect what really occurs in nature. Therefore, methods have been developed to improve our understanding of RNA sec- ondary structure by considering the ensemble of RNA secondary structures and to cal- culate the base pairing probabilities. These methods rely on calculating the partition function using a particular energy model [161]. Partition Function (PF) Input: An RNA sequenceR. Question: The partition function matricesQ;Q b andQ m of the sequenceR. In the standard energy model disallowing pseudoknots, we denote free energy of a loop,L, byG L . The free energy of a specified secondary structures, is G s = X L2s G L ; (B.1) 118 as previously described by. The partition functionQ is the weighted sum over all possi- ble configurations of secondary structures: Q = X s2S e Gs=RT : (B.2) We define the problem of computing the partition function as follows: In order to calculate the partition function of a sequence efficiently, we introduce 3 matrices: Q: partition function over all secondary structures whereQ i;j holds this value for subsequenceR[i::j]. Q b : whereQ b i;j is the partition function value forR[i::j] constrained to have (i;j) paired. Q m : whereQ m i;j is the partition function value forR[i::j] under the constraint that R[i::j] is part of a multiloop that has at least one component. Before deriving the recurrences, energy parameters of the energy model should be introduced. The energy of an empty subsequence is G empty i;j = 0: (B.3) The energy of a hairpin loop closed by base pair (i;j) is G hairpin i;j : (B.4) The energy of an interior loop which is formed by closing base pair (i;j) and interior base pair (g;h) is given by: G interior i;g;h;j : (B.5) 119 The energy of the bulge loops and stems are calculated as a special case of interior loops where in bulges (g =i + 1 orh =j 1) and in stems (g =i + 1 andh =j 1) G multi = 1 + 2 B + 3 U: (B.6) B.2 Emerging experimental approaches In recent years, there have been some attempts to use high-throughput sequencing tech- nologies to directly assay RNA secondary structure. These transcriptome-wide methods are based on structure-specific chemical modification or nuclease digestions followed by high-throughput sequencing to provide comprehensive maps of RNA secondary struc- ture [209]. These approaches shed light on many aspects of post-transcriptional regulatory pro- cesses that are regulated by RNA secondary structure. For example, generally around the start and stop codons of Arabidopsis, yeast, C. Elegans, and Drosophila mRNAs have been shown to be unpaired regions by structural profiles resulted from these exper- iments, which are also supported by computational predictions based on free energy- based structure modeling [112, 137, 209]. We believe that such studies has the potential to uncover RNAs with structures or substructures that may be functional and can discover underlying mechanisms of many post-transcriptional regulations that depend on the structure of the proteins. Here we briefly explain the methodology in each of these experiments. SHAPE-seq Developed by [152], this is in vitro technique, which uses selective 2’- hydroxyl acylation and the resulting fragments are sequenced by primer extension. This method is based on the fact that chemicals such as NMIA or 1M7 binds to unpaired nucleotides of an RNA molecule and blocks reverse transcription. By 120 sequencing the primer extension reactions we can determine the structures of the RNA(s) of interest. This method outputs only single-stranded sequences and the throughput is limited by primer extension [209]. PARS Developed by [112], this is an in vitro technique, which make use of two ribonu- cleases RNase S1 and V1. RNase V1, cuts 3’ of double-stranded RNA, while RNase S1 cuts 3’ of single-stranded RNA so we can capture the double-stranded and single-stranded RNA populations. The advantage of this technique over SHAPE-seq is that it sequences both single and double stranded sequences. FragSeq [229] developed this in vitro technique, which uses the single-stranded ribonu- clease P1 and two background controls to sequence unpaired RNA nucleotides. ss/ds RNA-seq [137] introduced this in vitro technique. It is a ribonuclease-based method that degrades single-stranded RNAs using RNase1 enzyme to capture for paired RNA nucleotides and nuclease RNase V1 to select for single-stranded RNA population. These molecules are then sequenced and analyzed. It is a genome- wide technique that can capture both single and double stranded RNA molecules. These methods have been validated and shown to be effective and efficient to under- stand the RNA secondary structure on a global scale in the genome [248]. In addi- tion, there has been some studies on whole-genome analysis of RNA secondary struc- ture in metazoans [137]. Gregory et al. in their paper, develop a thorough anal- ysis of mRNA secondary-structure correlation between animals. They used a high- throughput, sequencing-based assay to identify the paired and unpaired components of the Drosophila melanogaster and Caenorhabditis elegans transcriptomes. In their studies they have found significant correlation between ssRNAs and dsRNAs and specific epigenetic modifications in animals. They have also shown that mRNA sec- ondary structure surrounding miRNA binding sites is strikingly distinct in Drosophila 121 and C. elegans. They suggest that target mRNA recognition and/or regulation by miR- NAs is significantly different in various animals. 122
Abstract (if available)
Abstract
High-throughput protein‐RNA interaction data generated by CLIP‐seq has provided an unprecedented depth of access to the activities of RNA‐binding proteins (RBPs), the key players in co‐ and post‐transcriptional regulation of gene expression. Motif discovery forms part of the necessary follow‐up data analysis for CLIP‐seq, both to refine the exact locations of RBP binding sites, and to characterize them. The specific properties of RBP binding sites, and the CLIP‐seq methods, provide additional information not usually present in the classic motif discovery problem: the binding site structure, and cross‐linking induced events in reads. We show that CLIP‐seq data contains clear secondary structure signals, as well as technology‐ and RBP‐specific cross‐link signals. We introduce Zagros, a motif discovery algorithm specifically designed to leverage this information and explore its impact on the quality of recovered motifs. Our results indicate that using both secondary structure and cross‐link modifications can greatly improve motif discovery on CLIP‐seq data. Further, the motifs we recover provide insight into the balance between sequence‐ and structure‐specificity struck by RBP binding.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Isoform quantification and splicing regulation analysis in RNA-seq studies
PDF
Integrating high-throughput sequencing data to study gene regulation
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Data-driven learning for dynamical systems in biology
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Site-directed spin labeling studies of sequence-dependent DNA shape and protein-DNA recognition
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Malignant cell fraction prediction using deep learning: from point estimate to uncertainty quantification
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Computer modeling of protein-peptide interface solvation
PDF
Computational analysis of DNA replication timing and fork dynamics in S. cerevisiae
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Discovery of mature microRNA sequences within the protein- coding regions of global HIV-1 genomes: Predictions of novel mechanisms for viral infection and pathogenicity
Asset Metadata
Creator
Bahrami‐Samani, Emad (author)
Core Title
Probabilistic modeling and data integration to examine RNA-protein interactions
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
04/24/2015
Defense Date
03/20/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
CLIP‐seq,motif discovery,OAI-PMH Harvest,probabilistic models,RNA‐binding proteins
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew D. (
committee chair
), Mak, Chiho (
committee member
), Sun, Fengzhu Z. (
committee member
)
Creator Email
bahramis@usc.edu,emad.b.samani@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-560478
Unique identifier
UC11300462
Identifier
etd-BahramiSam-3397.pdf (filename),usctheses-c3-560478 (legacy record id)
Legacy Identifier
etd-BahramiSam-3397.pdf
Dmrecord
560478
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Bahrami‐Samani, Emad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
CLIP‐seq
motif discovery
probabilistic models
RNA‐binding proteins