Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
(USC Thesis Other)
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INNOV ATIVE SEQUENCING TECHNIQUES ELUCIDATE GENE REGULATORY EVOLUTION IN DROSOPHILA by Bradley Jay Main A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (MOLECULAR BIOLOGY) May 2013 Copyright 2013 Bradley Jay Main Table of Contents List of Figures iv List of Tables vii Dedication ix Acknowledgements x Abstract xi Chapter 1: Introduction 1 1.0.1 Gene Expression Studies . . . . . . . . . . . . . . . . . . . 1 1.0.2 Introduction to Current Methods and Technology . . . . . . 4 1.0.3 Overview of Chapters . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Allele-specific expression assays using Solexa 7 2.1 New Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Digital ASE assay . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Cis- and trans-regulatory variation within species . . . . . . 16 2.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 Fly strains and genotypes . . . . . . . . . . . . . . . . . . . 19 2.3.2 Gene selection . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Extraction of nucleic acids and cDNA synthesis . . . . . . . 21 2.3.4 PCR and purification . . . . . . . . . . . . . . . . . . . . . 22 2.3.5 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 3: Transcription Start Site Evolution inDrosophila 25 3.1 New Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 5 0 Anchored Reads . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 30 ii 3.2.1 Analysis of 5 0 anchored reads . . . . . . . . . . . . . . . . . 30 3.2.2 TSS Conservation Between Drosophila Species . . . . . . . 32 3.2.3 Conservation of Promoter-Specific Expression and TSS Peak Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.1 Fly strains . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 CAGE library preparation . . . . . . . . . . . . . . . . . . 43 3.3.3 Mapping CAGE Reads . . . . . . . . . . . . . . . . . . . . 44 3.3.4 Connecting Orthologs . . . . . . . . . . . . . . . . . . . . . 44 3.3.5 TSS Peak Calling . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.6 TSS Peak Shape Quantification . . . . . . . . . . . . . . . . 45 Chapter 4: Conclusion 46 4.0.7 Summary of Contributions . . . . . . . . . . . . . . . . . . 46 4.0.8 Future Directions . . . . . . . . . . . . . . . . . . . . . . . 48 4.0.9 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 49 Bibliography 51 .1 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 .1.1 Allele-Specific Expression Assays Using Solexa - Supple- mental Material . . . . . . . . . . . . . . . . . . . . . . . . 60 .2 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 .2.1 Allele-specific expression using Solexa detailed protocol 63 .3 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 .3.1 TSS Evolution Supplemental Materials . . . . . . . . . . . 66 .4 Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 .4.1 5 0 Anchored Paired-End Read Protocol . . . . . . . . . . . . 68 iii List of Figures 2.1 ASE using Solexa. Summary of sample preparation: (a) The for- ward primer is designed to anneal 1 bp upstream of the known SNP. The 5’ tail contains a unique barcode and adapter sequences neces- sary for hybridization to the Solexa sequencing platform. (b) Model of target enrichment and allele-specific expression represented as sequencing coverage per allele. . . . . . . . . . . . . . . . . . . . . 9 2.2 Count-based ASE assays. Allele-specific expression represented as sequencing reads per W-allele (black) and tester allele (white). The sequencing coverage per allele is shown on each bar and the cor- rected RNA (*) is represented as a proportion of each allele. The y-axis is the proportion of the difference (allele1- allele2)/(allele1 + allele2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Nucleic acid samples. The three nucleic acid sample types used in this study: the parental mix, F1 heterozygote, and introgression lines. (a) Representation of all chromosomes for each genetic back- ground assayed. Red represents the W line and black represents the tester line. (b) Zoom-in of chromosome 3 for each genotype. All genes analyzed are between the genetic markers scarlet (st) and ebony (e). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Verification of ASE assays using Solexa. Each data point represents one of three replicate dilutions analyzed at each step of the dilution series (9:1, 8:2, 7:3, 5:5, 3:7, 2:8, 1:9). The distribution of sequenc- ing reads within each gene is demonstrated as follows: mean +/- SE (n = 21). DSX = 11.3 +/- 3.8, CG2604 = 2,478.3 +/- 288.8, CG10824 = 2,756.4 +/- 333.5, CG11459 = 11,578.6 +/- 2515.1. . . . 14 iv 2.5 Intraspecific regulatory variation due to cis and trans. Expression bias is the proportion of the difference between alleles: (allele1- allele2)/(allele1 + allele2). The parental mix is plotted on the x-axis and the F1 heterozygote is plotted on the y-axis. A 1:1 relationship would indicate 100% cis and a slope of zero would indicate 100% trans. A 1:1 line is shown for reference. The vast majority of the expression differences between the W lines and the tester line are attributed to cis, for the four genes and six lines tested (regression coefficient = 0.91+/- 0.13). Color code: line 84 - blue, line 181 - red, line 147 - purple, line 192 - orange, and line 68 - green. . . . . . 18 3.1 This diagram shows sequence turnover as color changes and TSS turnover as movement of TSS location. Note that expression and location are largely maintained during sequence turnover. TSS turnover may be a gradual process with instances of intermediate or incom- plete turnover prior to death or divergence of the redundant TSS. . . 26 3.2 Solid lines represent RNA and dashed lines indicate cDNA. Black lines are biological sequences and red and blue represent Illumina adapters 1 and 2, respectively. There are 12 standard Illumina bar- coding Indexes (see the # symbol) that can be incorporated during PCR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 This diagram shows the number of overlapping mel orthologs found in each species. For example, 1285 mel TSS had an identified ortholog in each species and another 103 mel orthologs were only found in sim. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 These histograms show that mel TSS locations are largely conserved in sim (A), sec (B), and pse (C). Distances were calculated with respect to mel and the coding region. Thus, positive numbers are downstream from mel and negative values are upstream from mel. 35 3.5 Here are detailed examples of TSS variation between species. (A) There is a mel-specific, downstream TSS in CG6409, associated with the broadening of the upstream TSS. This may be an inter- mediate TSS turnover event. pse is not aligned in this region (B) This figure shows a sec-specific deletion (red square) at the TSS associated with nearly a complete loss of activity in sec. Also, pre- dominant pse activity appears to have shifted 20bp upstream. . . . . 36 v 3.6 This figure shows the relatedness of each species and the number of TSS changes in each species with respect to mel (N=2849). *When TSS turnover events were shared by sim and sec, we assumed it occurred prior to speciation of sim and sec. **When mel TSS were diverged between all species pairs, we considered it to be a mel- specific turnover event. . . . . . . . . . . . . . . . . . . . . . . . . 37 3.7 We examined sequence conservation and mutations within 2kb of mel TSS (N=2849). (A) The trend of SNPs (average in 100bp slid- ing window) around all mel TSS in sim (black) and sec (red). (B) Similar to A, but with indels. (C) Average sequence conservation (PhastCons 15 insect species) score around TSS. Note the dip in conservation. (D) We examined the frequency of segregating sites around TSS in emphmel population data. Note the spike in sequence diversity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.8 In the Alcohol Dehydrogenase gene (ADH), we observe significant differences in PSE between species. For example, mel has biased expression of the downstream promoter, like sec. However, pse is restricted to the upstream alternative promoter and sim is restricted to the downstream promoter. . . . . . . . . . . . . . . . . . . . . . 40 1 Optimal window size. We estimated promoter-specific expression and TSS peak shape using a fixed window size. Optimizing the win- dow size is important in order to increase our precision in detecting broad and sharp peaks. . . . . . . . . . . . . . . . . . . . . . . . . 66 2 Entropy distribution. Shown is the frequency distribution of TSS peak entropy estimates for each species. The vast majority of the TSS peaks are high entropy values, indicating that most TSS peaks are broad. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 vi List of Tables 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Solexa Single Lane Sequencing Coverage Per Gene. Coverage changes with different alignment criteria around the SNP. *CG4120 (Cyp21c) was thrown out of analysis because of non-specific amplification on the 2nd (Cyp4ac) and on the 3rd Chromosome (Cyp21c genes). . . . 15 3.1 Method reproducibility . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Mapping results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Accuracy of TSS estimates . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Mapping percentages of mel TSS to each species. . . . . . . . . . . 33 3.5 Conservation of promoter-specific expression between species . . . 41 3.6 Conservation of TSS peak shape between species . . . . . . . . . . 42 1 Solexa Single Lane Sequencing Coverage Per Gene . . . . . . . . . 60 2 Sequencing Error and Position Effect. Values are the proportion of non-SNP base pairs at different positions in the sequencing read as the barcode length varied. * Note: A/G SNP’s both have a consis- tent bias for miscalling C’s over T’s (data not shown) and a higher error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3 Estimation of the mean binomial sampling, technical, and biological variance (s 2 ) for each gene and type of assay. . . . . . . . . . . . . 61 vii 4 Correction of RNA using ASE in Het DNA. This normalization method was used to calculate the variance in technical and biologi- cal replicates, compared to the binomial distribution. Due perhaps to a difference in size of the two flies in mix or some bias in the assay, the ratios d1/(d1+d2) and d3/(d3+d4) may not both approximate 0.5. We assume that this bias, whatever its reason, is the same for DNA and RNA. Therefore, the observed RNA counts are a product of the DNA counts, which incorporate this bias, and some unobserved expression value, which is what we care about. We are interested in whether the ratios e1/(e1+e2) and e3/(e3+e4) are significantly dif- ferent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 viii Dedication I dedicate this work to my family. ix Acknowledgements My PhD research has been intense because funding was never a concern, amazing post- docs were always around (like Ryan Bickel and Maren Friesen), I traveled regularly for research and conferences, and Sergey refuses to micromanage students. Basically, I was always to blame for my failures and successes. My sincerest thanks to Sergey for guiding me through this process and providing rich soil for me to grow. Special thanks to Lauren McIntyre and Rita Graze who have been awesome collaborators throughout. Perhaps the next most influential person was Maren Friesen who has been a contin- ual source of scientific and professional inspiration and encouragement. I would not be in this position without her. Early on, I gleaned some serious molecular skills and developed character (i.e. patience, work ethic, and self-awareness) while working in the trenches (wet lab) alongside my comrade Joe Dunham. The second half of my PhD started when I chose the red pill (Matrix movie reference) and Matt Dean introduced me to Python programming. Shortly after, I moved to the fourth floor to focus entirely on data analysis and regularly harassing Andrew Smith (our local computational BA). I would be remiss if I did not mention the intellectual and philosophical contributions of Brad Foley and the immeasurable contributions of my top-notch undergraduate assis- tants: Hyo-sik Jang and Eugenia Chang. Honestly, I only deserve half of this PhD. The other half belongs to my beautiful and smart wife Jo, who has faithfully been my sounding board, advisor, and encourager throughout. x Abstract Connecting genetic variants with phenotypic differences is a fundamental aim in bio- logical science. A functional genetic mutation can affect the phenotype by altering the protein sequence of a gene or by perturbing gene expression patterns. The rela- tive contribution of coding versus regulatory variation on phenotypic diversity remains debatable, but the importance of gene expression differences in adaptation, speciation, and disease susceptibility cannot be overstated. In this thesis, I elucidate regulatory evolution in Drosophila using two new high-throughput sequencing library preparation techniques described herein. The first technique PCR amplifies known polymorphic regions of the genome and sequences them with high-throughput sequencing to pre- cisely assay allele-specific expression levels. Allele-specific expression estimates in parents and hybrids can be used to estimate variation at the gene of interest (cis) or else- where in the genome (trans). To validate our method, we measured the allelic bias in a dilution series of homozygous parental alleles and found high correlations between measured and expected values (r>0.9, p< 0.001). We applied this method to a set of 5 genes in a D. simulans parental mix, F1 and introgression and found that for these genes the majority of expression divergence can be explained by cis-regulatory variation and non-additive interactions were not detected. Differences in alternative promoter usage or movement of promoter elements and their associated transcription start site (TSS) may contribute to cis-regulatory variation xi between species. However, TSS evolution remains largely undescribed in Drosophila, likely due to limited annotations in non-melanogaster species. Here, we introduce a second new method that anchors sequencing output to the 5‘ end of mRNA. Using this method, we called TSS locations in four Drosophila species, including: D. melanogaster, D. simulans, D. sechellia, and D. pseudoobscura. For verification, we compared our results in D. melanogaster with known annotations, published 5‘ RACE data, and with RNAseq from the same mRNA pool. Then, we paired 2849 D. melanogaster TSS with its closest equivalent TSS in each species (likely to be its true ortholog) using the avail- able multiple sequence alignments. Most of the D. melanogaster TSS were successfully paired with an ortholog in each species (83%, 86%, and 55% for D. simulans, D. sechel- lia, and D. pseudoobscura, respectively). Based on the number and distribution of reads mapped at each TSS, we also estimated promoter-specific expression and TSS peak shape, respectively. Among paired TSS orthologs, the location and promoter activity were largely conserved. TSS location appears important as promoter-specific expres- sion and TSS peak shape were more frequently divergent among TSS that had moved. Unpaired TSS were surprisingly common in D. pseudoobscura. An increased mutation rate upstream of TSS might explain this pattern. We found an enrichment of ribosomal protein genes among diverged TSS, suggesting that TSS evolution is not uniform across the genome. xii Chapter 1 Introduction 1.0.1 Gene Expression Studies Characterizing the genetic basis for phenotypic diversity remains a major aim in evolu- tionary biology [1]. This is important for understanding the cause of genetic diseases, developing personalized medicine, improving agriculture, and understanding fundamen- tal evolutionary processes like adaptation and speciation. A functional genetic vari- ant may alter the protein sequence or gene expression pattern of a given gene, but the relative importance of each remains debatable [2, 3, 4]. Protein changes (i.e. non- synonymous mutations) are identifiable from the DNA sequence alone and thus, can be quickly assessed. However, characterizing regulatory variants remains challenging because functional regulatory elements are still being discovered, and even when muta- tions occur in known elements, the impact on expression levels can be complex and non-additive [5]. Challenges aside, the significance of gene regulatory variation in evo- lution has been appreciated since the 1970’s, when King and Wilson highlighted the dearth of protein coding differences underlying extensive phenotypic diversity between human and chimps [6]. More recently, direct connections have been made between adaptive phenotypes and regulatory variants. For example, pigmentation patterns in North American Drosophila species [7] and lactose tolerance in humans [8]. At a given gene, expression levels may vary between individuals due to mutations at the gene of interest (cis) or elsewhere in the genome (trans). For example, a cis-acting mutation might occur in the promoter or enhancer sequence of a gene, whereas a non- synonymous mutation that perturbs DNA binding of a transcription factor would act in 1 trans. Mutations in cis are physically linked to the gene of interest and thus, influence expression levels in an allele-specific manner. On the other hand, trans-variation affects both parental alleles equally, barring co-evolution of cis and trans variants (discussed later)[9]. Trans-variation may also impact many genes (e.g. downstream targets in a network), while cis-variation is commonly limited to the gene of interest. Thus, trans- mutations are commonly pleiotropic (i.e. influence more than one distinct trait) and thereby, are more likely to be deleterious [2]. Of course, the magnitude and dominance associated with each individual mutation will affect the likelihood of fixation. The rela- tive contribution of variation in cis and trans underlying the total expression differences within and between species is indicative of the relative selection pressures on cis and trans and is informative for how gene expression evolves. By analogy, does the race (i.e. evolution) favor the tortoise (cis) or the hare (trans)? To approximate the relative contributions of cis- and trans-regulatory variation, a previous study by Wittkopp et al. compared total expression differences between two Drosophila species (D. melanogaster and D. simulans) and allele-specific expres- sion (ASE) differences between parental alleles in their interspecific F1 hybrid [10]. Using this ingenious approach, variation in cis can be estimated from the allelic bias in F1 hybrids, where the trans environment is shared equally between parental alleles. Then, trans variation can be inferred from the portion of overall expression differences between the parents that are not explained by cis. Typically, regulatory variants are point mutations that directly affect transcription factor binding and/or transcript initiation, but variation in the environment, gene copy number, transcript stability, epigenetic marks, and small RNAs may also contribute to expression differences. Using Wittkopp et al.’s approach, each source of variation will be partitioned into cis and trans depending on whether they act in an allele-specific manner (cis) or not (trans). Gene copy number is a notable exception and could be confounded in both cis and trans depending on linkage 2 between gene copies and the ability to differentiate expression from each copy. Wittkopp et al. found that the majority of observed variation in expression between Drosophila species was due to cis-differences [10]. To date, the most popular view on the relative importance of cis and trans on the evolution of gene expression is that cis dominates between species and trans dominates within [11]. However, reports in the literature are far from a consensus. Studies within species in yeast [12], Drosophila [13, 14], and mouse have reported that roughly 70% of variation is due to trans. Between D. melanogaster and D. simulans, Wittkopp et al. reported that nearly all of the selected genes were divergent in cis, while roughly half were variable in trans. McManus et al. recently reported that more genes vary in trans than in cis between D. melanogaster and D. sechellia [15]. In an aneuploid mouse cell line with human chromosome 21, human expression patterns were recapitulated in the mouse trans environment, suggesting cis is important [16]. Perhaps, the most telling paper is a mouse study in which trans explained 65% of variation when detection thresh- olds were low, but cis explained nearly all expression differences when the significance threshold was much stricter [17]. They concluded that cis-variation is easier to detect. As a result, reports of cis and trans can vary widely due to differences in sample size and the precision of the technology used. The biological theory behind the preferential accumulation of cis or trans over evolutionary time is that cis-mutations are preferably fixed due to their higher likelihood to be additive [18] and weaker pleiotropic effects [19]. Theoretically, estimates of cis and trans can be complicated by the presence of non- additive, or epistatic, interactions between the two. For example, if the combination of an upregulating cis-allele and an upregulating trans-allele results in downregulated gene expression at the gene of interest, that is indicative of a cis-by-trans, non-additive interaction. Epistasis can involve many other types of interactions [20], but we focused 3 on cases where one trans environment/factor preferentially affects the expression of one parental allele over another (cis-by-trans interactions). To uncover potential cis- by-trans interactions, we created introgression lines by selectively backcrossing (using mutant markers) a 12MB portion of the 3rd chromosome of each D. simulans genotype into the genetic background of a single tester line. To assess cis-by-trans interactions we compared the allelic bias in F1 hybrids with the allelic bias in the introgression hybrid (partial hybrid), where the majority of the genome is homozygous for the tester line. In this comparison, cis differences are the same for all genes tested, but the trans envi- ronment is different between the full and partial hybrids. Thus, significant changes in allelic bias between the full and partial hybrids would indicate cis-by-trans interactions. 1.0.2 Introduction to Current Methods and Technology Major technological advances have been made in recent years to confront the challenges inherent in allele-specific gene expression studies. The first ASE assays were performed using RT-PCR with custom primers and radioactive dNTP’s [21]. Traditional Sanger sequencing has also been used to estimate relative allelic abundance based on relative fluorescent peak heights in the sequence output [22]. However, these approaches require five or more replicates and are low throughput. Next, pyrosequencing, a sequencing-by- synthesis approach, was introduced that has proven very effective at estimating ASE [23, 10]. Still, pyrosequencing is relatively low-throughput, so the introduction of high-throughput microarrays with reproducible and transcriptome-wide gene expres- sion estimates was well received [24]. That said, in a collaborative study, we show that using microarrays to decipher ASE levels in hybrids requires highly sophisticated array designs and statistical normalization methods [25]. Next, high-throughput sequencing was introduced, which unlike microarrays, does not require prior knowledge of the ref- erence sequence and ASE can be assayed from the relative abundance of allele-specific 4 nucleotide variants in the sequencing output. New issues have emerged like allele- specific alignment bias and allele-specific amplification bias, but these have been largely resolved computationally [26]. Additionally, sequencing of genomic DNA in parallel with RNA seems to correct allele-specific bias by using any allelic bias observed in the DNA (expected to be one-to-one) to normalize the allelic bias estimated from the RNA. Initially, obtaining sufficient RNAseq coverage to estimate ASE was costly. In a typical hybrid, most of the reads align to regions that are not polymorphic and thus, these reads are not useful for ASE estimates. Thus, I developed a targeted approach that specifically amplified known polymorphic regions in transcripts. Using this approach, all reads are useful for ASE assays. To multiplex, we barcoded each sample and pooled over 300 assays in a single solexa run, including technical and biological replicates ([27]; see chapter 2). This is an attractive alternative to typical targeted techniques used for ASE assays like pyrosequencing [10]. Our paper, among others, also showed that high-throughput technologies tolerate considerable customization of the protocol. As technology has advanced, the cost per sequencing read has been reduced dramat- ically, making ASE assays with RNAseq on a population scale more feasible. RNAseq is advantageous because it facilitates the discovery and relative quantification of unex- pected transcriptional intricacies like alternative splicing isoforms [28] and small RNAs [29]. That said, certain elements like downstream alternative transcription start sites (TSS) remain hard to resolve with RNAseq due to overlapping transcripts. This is a critical drawback because TSS are difficult to identify based on sequence motifs alone. Furthermore, alternative promoter usage is common in higher organisms (e.g. humans) and has been shown to be involved in disease [30, 31]. Thus, biological assays which specifically target and estimate TSS location and promoter activity are valuable. Pre- viously, techniques like FLcDNA [32] were used, but they were low throughput. More high-throughput approaches have been introduced, like cap analysis for gene expression 5 (CAGE), but some drawbacks remain. For example, most CAGE methods produce short tags, single-end reads, and may require semi-suppressive PCR to enrich for the desired product. For my research, I developed a new library preparation technique that anchors all sequencing reads to the extreme 5’ end of mRNA and yields long, paired-end reads without the need for semi-suppressive PCR (see chapter 3). This approach facilitates the identification of novel TSS and promoters and can be used to estimate promoter-specific expression (PSE), without the aforementioned drawbacks. 1.0.3 Overview of Chapters In chapter 2, I introduce a novel targeted sequencing method and use it to demonstrate that regulatory variation within a sample population of D. simulans is largely explained by cis. Thus, cis variation is an important source of genetic diversity that can be acted upon by natural selection. One component of cis is the core promoter and its associated TSS. Is cis variation occurring in enhancer regions only, or is the core promoter also varying? In Chapter 3, I introduce another targeted sequencing method and use it to locate TSS and characterize promoter activity in four Drosophila species. Although TSS are largely conserved between species, among cases of TSS movement, promoter activity is more likely divergent. Furthermore, I report a trend of elevated mutation rate upstream of TSS, likely due to the open state of chromatin at active genes. 6 Chapter 2 Allele-specific expression assays using Solexa Genotype-phenotype mapping is a fundamental aim of biological science. This is crit- ical for many goals such as understanding of how genetic architecture shapes pheno- typic variation and adaptation [33, 34, 35], and more specific aims such as decipher- ing how genetic variation in humans may affect response to treatment [36, 37]. Many genetic variants resulting in phenotypic differences are mediated through changes in gene expression. Thus, analyzing gene expression allows us to better understand geno- typic variation. Variation in gene expression can be due to polymorphisms both at the gene locus (cis) and in other genes that influence its expression (trans), as well as the non-additive interactions between the two (cis-by-trans) [38]. Furthermore, epigenetic mechanisms [39], chromatin conformation [40], copy number variation [41, 42] and microRNA [43] all play important roles in the transcription of a given gene. By parti- tioning regulatory variation into cis, trans, and cis-by-trans, we can identify their respec- tive contributions to changes in gene expression and potentially how expression levels evolve within the genomes of complex organisms [11, 44]. Allele-specific expression (ASE) studies have introduced a creative method to uncover the respective contributions of cis- and trans-regulatory variation [45, 46, 10, 47, 45]. First, total expression differences are measured from a pooled sample of two homozygous lines. Then, cis-regulatory variation is estimated from the allelic imbal- ance (unequal expression of alleles) in the corresponding F1 heterozygote, where each 7 allele is regulated by the same trans-factors [22]. Trans can then be inferred from the total expression differences that are not explained by cis. Of course, these inferences about cis- and trans-regulatory variation can be complicated if cis-elements and trans- factors interact non-additively [45, 48, 49]. Allelic imbalance in non-imprinted genes has been shown to be common in mice, maize and humans [22, 50, 51]. Also, a few studies have investigated cis- and trans-regulatory variation within and between species of Drosophila. Measuring ASE, Wittkopp et al. reported that cis-regulatory variation plays a predominant role in divergent gene expression between D. melanogaster and D. simulans [10]. Allele-specific expression has been measured using various targeted approaches including reverse transcription-PCR (asRT-PCR; [21]), and pyrosequencing [10, 23]. Genome-wide approaches have also been used including serial analysis of gene expres- sion (SAGE) [52, 53], massively parallel signature sequence (MPSS)[54, 55], and microarray-based methods [51, 56]. Here, we introduce a simple approach to ASE assays that combines a targeted approach to gene expression assays with the power of high-throughput sequencing. In brief, transcripts of interest (containing a known SNP) are PCR enriched and barcoded to enable large-scale multiplexing. Using this approach, we sequence only regions of interest and allele-specific read counts are used to estimate ASE for large numbers of samples using a single lane of a Solexa flowcell (Figure 2.1 and 2.2). To demonstrate its application, we investigated variation in gene expression in a set of five Drosophila simulans genes. Using a common tester line, we measured ASE in an equal mix of homozygotes (parental mix), a heterozygote, and an introgression (Figure 2.3). By analyzing changes in ASE under these different regulatory conditions, we elucidate the respective contributions of cis, trans, and cis-by-trans interactions on variation in gene expression. Furthermore, we tested for within-species variation in cis 8 and trans by comparing trends in ASE between six highly inbred lines collected from a single population. Figure 2.1: ASE using Solexa. Summary of sample preparation: (a) The forward primer is designed to anneal 1 bp upstream of the known SNP. The 5’ tail contains a unique barcode and adapter sequences necessary for hybridization to the Solexa sequencing platform. (b) Model of target enrichment and allele-specific expression represented as sequencing coverage per allele. 9 Figure 2.2: Count-based ASE assays. Allele-specific expression represented as sequenc- ing reads per W-allele (black) and tester allele (white). The sequencing coverage per allele is shown on each bar and the corrected RNA (*) is represented as a proportion of each allele. The y-axis is the proportion of the difference (allele1- allele2)/(allele1 + allele2). 2.1 New Method 2.1.1 Digital ASE assay In this study, we introduce an accurate approach to allele-specific expression assays that relies on read counts generated from Solexa sequencing. For each gene of interest, a single nucleotide polymorphism (SNP) within the mRNA transcript was identified that differs between the lines of interest and our tester line. We then designed a PCR 10 Figure 2.3: Nucleic acid samples. The three nucleic acid sample types used in this study: the parental mix, F1 heterozygote, and introgression lines. (a) Representation of all chromosomes for each genetic background assayed. Red represents the W line and black represents the tester line. (b) Zoom-in of chromosome 3 for each genotype. All genes analyzed are between the genetic markers scarlet (st) and ebony (e). primer that annealed immediately flanking the 5’ end of the SNP and another primer that annealed 200-300 base pairs downstream (Figure 2.1a). In the primer design, we included adapter sequences, provided by Illumina, as 5 tails to allow these PCR products to be sequenced on the Solexa platform without additional steps. When we planned on analyzing more than one sample per gene, a unique forward primer was ordered for each sample that contained the common elements described above, plus a unique one to three base pair barcode sequence in the 5’ tail that will allow for individual sample identification (Figure 2.1a). We used a touchdown PCR cycling program to enrich each sample for our target region and then purified the amplicons using gel extraction. We 11 then pooled the purified samples in large numbers (in our case, 300) and sequenced them in parallel on a single lane of a Solexa flow cell. The resulting reads were analyzed by assigning each read to a specific gene based on homology and sample based on the unique barcode (see methods). We can then compare the number of reads containing each SNP to have a digital representation of ASE (Figure 2.1b and 2.2). In all comparisons, both alleles were amplified in the same reaction and thus utilized the same reagents. As a result, each allele should theoretically maintain the same relative abundance throughout amplification. However, this may not be the case if small differ- ences in primer hybridization or polymerase efficiency exist between alleles. We can control for this error in amplification by analyzing heterozygous DNA samples, where the actual allele frequency is 50:50, and then correcting each RNA sample by the differ- ence observed in the heterozygote ([10] and Table 4 in Appendix A). DNA samples from each mixed sample were also analyzed in order to correct for both allele-specific ampli- fication error and differences in cell number between the individuals extracted from each parental line [10]. We report allelic imbalance as: the proportion of reads that are dif- ferentially expressed ((allele1 allele2)/(allele1 + allele2)). This approach allows us to easily estimate the binomial sampling error. If there is no difference between alleles, the bias is zero, while the bias is positive if the first allele is favored and negative if the second allele is favored. To verify the accuracy of Solexa and the normalization procedure, we created three replicate dilution series using genomic DNA from the tester line and an experimental line (W line 84). Then we estimated the allelic bias at each step of the dilution (in multi- plex) using a separate sequencing lane. All four genes demonstrated a strong correlation (r>0.9 p<0.001) between the expected and observed allelic bias (Figure 2.4). The gene 12 dsx had a relatively low correlation, because there was very limited sequencing cover- age for all samples within this gene. Thus, the Solexa sequencing output can be used to accurately measure the relative abundance of alleles in a given sample. By analyzing technical variation and correlation coefficients in the verification experiment, it appears that coverage on the order of a few hundred reads is sufficient to yield reproducible results (Figure 2.4 and Additional file 3). Thus, increased bio- logical replication should be favored over sequencing coverage beyond a few hundred reads (see table 3). In this study, coverage varied widely between genes (see table 2.1 and 2.2), while coverage of individual assays within each gene was similar (see table 3). This is most likely due to variation in initial transcript abundance, variation in amplifica- tion efficiency between genes and the resulting uneven pooling of DNA between genes. Therefore, we suggest that if the pooling of DNA is done carefully based on the concen- tration of each purified sample or PCR amplicon band intensity, the sequencing coverage can be more evenly distributed. This will also increase the number of high-quality ASE assays that can be performed on a single lane of Solexa. The cost per base of sequence using Solexa is very low, so for large-scale projects, the preparation costs such as barcoded primers become the limiting economic factor. To address these concerns, we suggest one of the following approaches depending on the type of project: For studies where many assays are performed with a few select genes, barcode costs can be significantly reduced if paired-end sequencing reads are combined with multiplicative barcoding. For example, instead of ordering 900 barcoded adapters for a given gene, we can create 900 unique barcode combinations using only 30 bar- coded forward primers and 30 barcoded reverse primers. For studies involving large gene sets, barcoded adapter sequences can be added to typical gene-specific anneal- ing primers before PCR using single-strand ligation. Using this approach, barcoded adapter sequences can be used in multiple experiments. We have shown that Solexa can 13 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 dsx Dilution Series Sample Estimate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 CG2604 Dilution Series Sample Estimate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 CG11459 Dilution Series Sample Estimate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 CG10824 Dilution Series Sample Estimate r = .85 r = .95 r = .98 r = .97 Figure 2.4: Verification of ASE assays using Solexa. Each data point represents one of three replicate dilutions analyzed at each step of the dilution series (9:1, 8:2, 7:3, 5:5, 3:7, 2:8, 1:9). The distribution of sequencing reads within each gene is demonstrated as follows: mean +/- SE (n = 21). DSX = 11.3 +/- 3.8, CG2604 = 2,478.3 +/- 288.8, CG10824 = 2,756.4 +/- 333.5, CG11459 = 11,578.6 +/- 2515.1. effectively estimate ASE using this targeted approach and we mention these additional modifications to allow easier adaptation for future researchers. 14 Required perfect alignment of 5bp pre-SNP and 3bp post-SNP Gene Reads % CG11459 2,088,931 35% CG2604 1,434,025 24% DSX 963,692 16% CG10824 107,315 2% *CG4120 1,065,503 18% Unaligned 340,501 5% Total reads 5,999,967 100% Table 2.1: Required perfect alignment of 5bp pre-SNP and 8bp post-SNP Gene Reads % CG11459 2,043,861 34% CG2604 1,401,941 23% DSX 924,548 16% CG10824 103,581 2% *CG4120 9,109 0.20% Unalignable 1,516,927 25% Total reads 5,999,967 100% Table 2.2: Solexa Single Lane Sequencing Coverage Per Gene. Coverage changes with different alignment criteria around the SNP. *CG4120 (Cyp21c) was thrown out of anal- ysis because of non-specific amplification on the 2nd (Cyp4ac) and on the 3rd Chromo- some (Cyp21c genes). 2.2 Results and discussion 2.2.1 Error analysis To understand the error in quantifying ASE, we looked at sources of variation at multi- ple levels. First, we estimated sequencing error at the SNP position by identifying the proportion of reads with an incorrect base-call. Sequencing error was well below 2% for most of the genes. Furthermore, this error rate did not seem to change when the position of the SNP within the read shifted due to barcode lengths changing from one 15 to three base pairs (see table 2). Second, we estimated the binomial sampling error and found that this error quickly becomes negligible with the high coverage obtained with this method (see table 3). Third, we assessed the error introduced by the method, such as polymerase and reverse transcription error by comparing allelic imbalance between technical replicates (cDNA templates created separately from the same RNA pool). This technical variation was considerably more than binomial sampling once coverage was over a few hundred reads (see table 3). And finally, to analyze overall variation in expression estimates, including dynamic gene expression in vivo, we compared ASE estimates between biological replicates (separately collected material from the same genetic cross). The biological variation was greater than technical variation in this study, but an accurate assessment of the relative magnitude of technical and biological variance is beyond the scope of this study and thus, both should be considered when designing future experiments (see table 3). 2.2.2 Cis- andtrans-regulatory variation within species We used six highly inbred lines of D. simulans from Winters, CA (W Lines) and a scar- let ebony (st e) tester line in this study. For each W line, we compared expression to the tester line in a parental mix (i.e. an equal mix of homozygous tester and W line flies), the related F1 heterozygote, and a corresponding introgression (see methods for details) (Figure 2.3). This experimental design allows us to assess intraspecific regulatory vari- ation in cis, trans and cis-by-trans. To do this, we estimated the overall expression differences in four genes in the aforementioned parental mix. Then, we determined cis-regulatory variation from the allelic imbalance present in the related F1 heterozy- gote, where trans-factors affect each allele equally. We can then infer trans from the overall expression difference that is not explained by cis. In the four genes analyzed, 16 cis-regulatory variation explains the majority of the overall divergence in gene expres- sion between the tester line and the W lines (Figure 2.5, regression coefficient = 0.91 +/- 0.13, [11]). It should be noted that this is a small gene set and most of these genes have been shown to be variable within-species (see methods). Thus, we do not con- sider this result to be a reflection of the genome as a whole. One explanation for the small contribution of trans in these genes is that trans-variation has an increased poten- tial for pleiotropic effects, some of which may be deleterious and removed by purifying selection in the W lines tested. Gene expression is a result of cis-regulatory elements interacting with trans-regulatory proteins. If there is variation in both cis-elements and trans-proteins, there is the potential for non-additive interactions (cis-by-trans) [57]. To test for cis-by-trans, we compared allelic imbalance in the heterozygote and the intro- gression within each W line (Figure 2.3). If all genes act additively, allelic imbalance in the heterozygote and introgression should be equal because the cis-regulatory elements for each allele and the trans-factors within each genotype are identical. If allelic imbal- ance between these genotypes is not equal, that difference is due to cis-by-trans. We lacked the statistical power to individually detect these cis-by-trans interactions, but we were able to test for a systematic shift in expression between these genotypes across all genes and W lines. We hypothesized that if the cis-elements and trans-factors within the tester line and the W lines co-evolved, non-additive interactions would be relatively common and therefore we might detect significant cis-by-trans effects across all genes and lines. To test for this, we analyzed the relationship between heterozygous and intro- gression ASE values for the 6 W lines and 4 genes using a linear model. The regression coefficient is not significantly different from one (p= 0.5) and thus, there is no system- atic cis-by-trans interactions detected within our sample population and selected genes. This result, together with recent findings seems to be indicating that at least within Drosophila, epistasis may be rare or is small in magnitude [45]. This may also be due 17 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 expression bias between line and tester (line/tester) expression bias of alleles in F1 hybrids (line F1 /tester F1 ) ● ● CG10824 CG11459 CG2604 dsx ● ● ● ● ● Figure 2.5: Intraspecific regulatory variation due to cis and trans. Expression bias is the proportion of the difference between alleles: (allele1- allele2)/(allele1 + allele2). The parental mix is plotted on the x-axis and the F1 heterozygote is plotted on the y-axis. A 1:1 relationship would indicate 100% cis and a slope of zero would indicate 100% trans. A 1:1 line is shown for reference. The vast majority of the expression differences between the W lines and the tester line are attributed to cis, for the four genes and six lines tested (regression coefficient = 0.91+/- 0.13). Color code: line 84 - blue, line 181 - red, line 147 - purple, line 192 - orange, and line 68 - green. 18 to the lack of population structure in Drosophila, which would hinder the co-evolution of divergent cis-elements and trans-factors. To compliment our previous estimates of cis and trans between the tester line and the W lines, we measured variation in cis between W lines in order to give additional insight into the type and magnitude of regulatory variation that may be segregating in natural populations. To identify this variation we tested for a significant effect of a given W line on the overall estimate of cis and trans. Using this approach we were unable to detect significant variation between lines (p = 0.31). We did however find evidence for variation between the W lines and the tester line suggesting that genetic variation may be more common between populations (Figure 2.2). 2.2.3 Conclusions We have presented a new method that uses Solexa to accurately measure ASE for hun- dreds of samples using as little as one lane of a Solexa flowcell. This will be a valuable technique for analyzing a few genes for many individuals or under many conditions, measuring a large selection of genes for a few individuals, and verifying ASE estimates from genome-wide expression assays. 2.3 Methods 2.3.1 Fly strains and genotypes D. simulans lines used in this study were collected from the Wolfskill orchard in Winters, CA and inbred for a minimum of 20 generations to create highly inbred lines (W lines) [30]. A D. simulans tester line containing the visible mutations scarlet (st) and ebony (e) (st ((chr3L: 15833418-15836165), e (chr3R: 4397293-4404648)) was obtained from 19 the Drosophila stock center at UCSD (stock number 14021-0251.041) and inbred for over 20 generations. We analyzed ASE in a parental mix, heterozygote, and introgres- sion for each W line against the common tester line (lines: 84, 181, 147, 201, 68, 192) (Figure 2.3). To create the parental mix, homozygous male flies from a given W line were homogenized together with homozygous male flies from the tester line. The het- erozygous genotype was the F1 male progeny from a cross between a given W line male and two virgin tester line females. And finally, to construct introgression lines, F1 heterozygous females were created from an initial cross between a given W line female and st e male. Then they were backcrossed to the st e line. After each gen- eration, wild-type females were selected (heterozygous genotype at both markers) for subsequent backcrossing. This process was repeated for a minimum of 20 generations to create an introgression line. Thus the introgression line is heterozygous for approx- imately 12 Mb between the genetic markers st and e on the third chromosome, while the remainder of the genome is homozygous for the st e line (Figure 2.3). Two replicate introgression lines were created in parallel for each W line (introgression A and B) to control for expression differences due to different introgression cutoff points flanking the visible markers. Reciprocal crosses were not analyzed, but the effects of genomic imprinting, the only type of parent-of-origin effects that would impact ASE, are likely rare or do not exist in adult Drosophila [58]. All flies were reared at sin25 C with a 12h:12h light cycle. 2.3.2 Gene selection We chose a set of genes within the introgressed region on chromosome three for further analysis (Figure 2.3b). Genes were chosen based on current interest in the lab (dsx) and previous analyses showing them to vary within species (Cyp21c, CG10824, CG2604) [59] or between species (CG11459) [10]. We resequenced most of the coding region 20 of each gene and then designed our gene-specific primers based on the location of a single nucleotide polymorphism (SNP) between the W lines and the st e tester line. The Cyp21c primers non-specifically amplified a homologous gene (Cyp4ac3), thus it was not analyzed further. 2.3.3 Extraction of nucleic acids and cDNA synthesis We extracted DNA and RNA in parallel from 14 whole-body adult male flies for each sample using a modified protocol for Promega’s SV Total RNA Extraction Kit [10]. Experimental samples included a parental mix and a heterozygous DNA sample, along with cDNA from the parental mix, F1, and introgression A and B (see above) for each W line. We included replicate biological samples (separately collected material from the same genetic cross) for all experimental samples involving W line 181 and 84. Also, we included technical replicate (cDNA templates created separately from the same RNA pool) samples for all biological samples involving W lines 84 (Additional file 1). To extract nucleic acid from the parental mix, we homogenized seven homozygous male flies from a given W line together with seven homozygous male flies from the st e line. The heterozygous and the introgression genotypes were extracted from 14 male progeny. Before extraction, adult male flies (three to five days old) were snap-frozen in liquid nitrogen in the morning and stored at -70 C until extraction. Briefly, we passed the supernatant from fly homogenate through an affinity column optimized for bind- ing DNA. Then the flow-through was run through another column optimized to bind all RNA. We then treated RNA columns with DNAse, followed by subsequent wash- ing steps and elution. Using Applied Biosystem’s (AB) High-Capacity cDNA Reverse Transcription Kit we immediately synthesized cDNA from the RNA samples using AB’s standard protocols with RNAse inhibitor. Following extraction, we stored all cDNA and DNA samples at -70 C until further preparation. 21 2.3.4 PCR and purification PCR primers for the ASE assay were designed such that one annealed immediately flanking the five prime (5’) end of the SNP to be sequenced and the other, 200-300 base pairs downstream (3’). Then we modified each primer design with a custom 5’ tail consisting of Solexa adapter sequences. Furthermore, to allow for multiplexing, a sample-specific barcode (one to three base pairs) was added to the design between the adapter and the annealing primer (Figure 2.1 and Additional file 2). All PCR reactions were performed using Finnzymes Phusion High-Fidelity DNA Polymerase and a touchdown cycling program stepping down 1 each cycle from 70 C to 60 C, followed by an additional 25 cycles at 60 C annealing. Samples were run on a 2 percent agarose gel for amplicon size confirmation and agarose gel purification. PCR products were gel purified using Qiagen’s Qiaex II Gel extraction kit and standard protocols. Following purification, a portion of each of the 300 PCR enriched samples was pooled together, ethanol precipitated and then resuspended to a concentration of 10nM. The pooled sample was sequenced at Cornell’s core sequencing center using a custom primer. 2.3.5 Data analysis Using a Perl script, the 6 million reads generated from a single lane of a Solexa flow cell were assigned to one of our five genes based on the first eight base pairs (allowing for one mismatch) of the annealing primer. Within each gene pool, the reads were then separated into 60 unique ASE assays using the one to three base pair sample-specific barcodes (Figure 2.1). We then checked for a known sequence surrounding the SNP, including five base pairs 5 of the SNP and eight base pairs 3 of the SNP. If these 13 base pairs (5bp pre-SNP + 8bp post-SNP) could not be identified (allowing for a single 22 mismatch) the reads were discarded. We then scored each read for the nucleotide in the SNP position to determine the allele-specific read count (Additional file 2). To determine the respective contributions of cis- and trans-regulatory variation, we compared ASE in the parental mix flies and the heterozygous flies. ASE differences in the parental mix reflect total variation in gene expression, including cis and trans. In contrast, trans is shared between alleles in the heterozygote, thus only cis contributes to ASE differences in this genotype. As a result, trans can be inferred from the percent of the total variation estimated in the parental mix (cis+trans) that is not explained by the variation estimated in the heterozygote (cis). To perform this test this we used the model:Y ij = +BX ij +e ij where for the ith W line and the jth gene, Y is the esti- mated difference between alleles in the heterozygote (cis) and X is the total difference in expression estimated from the parental mix. Then, we estimated trans-regulatory varia- tion from the percent of the overall expression differences found in the parental mix that are not explained by this model (Figure 2.5). Missing data points are due to lack of a SNP within a given gene and W line. To identify cis-by-trans interactions (i.e. non-additive effects), we tested whether the allelic bias in heterozygotes was significantly different from the allelic bias in intro- gressions. To examine this effect, we fit the model: Y ij = +BZ ij +e ij where for the i th W line and for the j th gene, Y is the estimated difference between alleles for the heterozygote and Z is the estimated difference between alleles for the introgression line. The cis-regulatory elements for each allele in the heterozygote are identical to the cis-regulatory elements in the introgression, hence the allelic bias cannot vary due to cis between these genotypes. Furthermore, although the trans-factors change between genotypes (heterozygous W/st e to homozygous st e) the trans-factors are shared equally between alleles within each genotype and thus, differences in trans cannot contribute to 23 allelic bias. As a result, any change in allelic bias between these genotypes must be due to cis-by-trans interactions. To test for natural variation (i.e. variation between W lines) in the relative contribu- tion of cis and trans, we fit the modelY ij =+B 1 X ij +B 2 L i +e ij where for thei th line and thej th gene, Y is the average bias between alleles for the parental mix, and X is the average bias between alleles for the heterozygote. We then tested for the effect of line. To verify the accuracy of ASE assays using Solexa, we used a separate lane of a flow cell to analyze three replicate dilution series (1:9, 2:8, 3:7, 5:5, 7:3, 8:2, 9:1). These dilutions were created using genomic DNA extracted separately from homozygous W line 84 and the homozygous st e tester line (Figure 2.4). Each of the four genes analyzed in this study were tested for the ability to accurately detect known allele frequencies. Heterozygous DNA from a cross between these lines was also analyzed in parallel to correct for allele-specific amplification bias ([10] and Additional file 3). To correct for differences in starting concentration of the homozygous DNA used for the dilution series, we corrected the expected values by the average allelic bias found in the 1:1 mix [10]. Correlations between the known dilution and the measured ASE using Solexa were estimated separately for each gene using Pearsons correlation coefficient [59]. 24 Chapter 3 Transcription Start Site Evolution in Drosophila At a given gene, a functionally redundant promoter and associated transcription start site (TSS) may emerge (”birth”) via random mutations, and like gene duplication events, one copy may experience relaxed selection [60]. Ultimately, one copy ”dies” or evolves a new function. Thus, a new TSS may arise or a TSS may move, or turnover, via this birth and death process (see figure 3.1). Conservation of TSS locations has mostly been explored between human and mouse, where TSS turnover (20%) [61] is reported to be less common than transcription factor binding site (TFBS) turnover (32-40%) [62]. At such evolutionary distances, TSS movement is sometimes difficult to inter- pret. Thus, assessing TSS conservation would be ideal in Drosophila, where multi- ple species genome alignments are available. However, TSS turnover has only been assessed for select genes [63], including gene duplicates [64], due to a lack of genome- wide TSS annotations in non-D. melanogaster species. For reference, TFBS locations are highly conserved between Drosophila species even though sequence turnover is common [65]. Understanding TSS evolution is important because variants for TSS usage have been linked to important phenotypic differences [1], including susceptibil- ity to disease [30, 31]. Furthermore, promoter and TSS differences may account for important cis-regulatory differences [10] between Drosophila species. In this study, we developed a new method that identifies TSS and employed it to explore trends in TSS evolution in Drosophila. This concise method yields long (e.g. 25 TSS Turnover Incomplete TSS Turnover Figure 1. A Diagram of TSS Turnover Sequence Turnover = Seq. changes = Low Exp. = High Exp. Figure 3.1: This diagram shows sequence turnover as color changes and TSS turnover as movement of TSS location. Note that expression and location are largely maintained during sequence turnover. TSS turnover may be a gradual process with instances of intermediate or incomplete turnover prior to death or divergence of the redundant TSS. 76bp), paired-end reads that are anchored to the 5 0 end of mRNA. After mapping to each respective genome, we estimated and removed background noise using RNAseq reads (as a null model) generated from the same mRNA pool. Using this approach, we called TSS in four Drosophila species, including: D. melanogaster (mel), D. simulans (sim), D. sechellia (sec), and D. pseudoobscura (pse). Then, we paired each mel TSS with its putative ortholog in each species (based on proximity) using the available mutliple sequence alignment. From these results, we estimated overall and species-specific TSS differences, with respect to mel. For example, TSS that are conserved in all but one species were considered species-specific TSS differences. Additionally, if a TSS variant was shared between sim and sec alone, we assumed it occurred prior to the split of these sister species. The relative abundance and distribution of coverage (TSS peak shape) at each TSS was highly reproducible between biological and technical replicates and expression estimates were comparable to RNAseq results generated from the same mRNA pool. Thus, we also estimated promoter-specific expression (PSE) and TSS 26 peak shape for each species and compared conservation of promoter activity with the conservation of TSS location. 3.1 New Method 3.1.1 5 0 Anchored Reads Several molecular techniques can be used to locate TSS including: cap analysis for gene expression (CAGE) [66] and its various versions [67, 68, 69], 5 0 rapid amplifica- tion of cDNA ends (RACE) [70], robust analysis of 5 0 transcript ends (5 0 RATE) [71], and FLcDNA assays [72]. The original CAGE protocol involves the concatenation of short 5 0 sequence tags (14-20bps), followed by traditional Sanger sequencing [66]. More recently, CAGE and similar 5 0 targeting methods have been adapted to high-throughput sequencing [73, 69, 67, 68]. One major difference between the available methods is the approach used to target 5 0 ends of full length transcripts. For example, some methods rely on the removal of the 5 0 CAP structure with tobacco acid pyrophosphatase (TAP), others use the CAP structure to perform template-switching, and 5 0 CAPs can also be biotinyled and isolated with streptavidin beads. There are benefits and drawbacks to each method currently available, but we wanted a simple and straight-forward approach without specific limitations like short reads (tags) [69, 74, 75], single-end reads [75] which hinder mapping, and added sampling bias from a required semisuppressive PCR step [67, 76]. Thus, we generated 5 0 anchored reads using a concise TAP-based protocol that employs standard Illumina adapters and barcode indexes and is free of the afore- mentioned drawbacks. We extracted total RNA from whole body, adult female flies from each Drosophila species. We purified mRNA and ligated an RNA adapter oligo to the 5 0 end of each mRNA molecule. We chemically fragmented the ligated mRNA using RNA fragmentation reagent (Ambion) and generated single-stranded cDNA with 27 Table 3.1: Method reproducibility Species Comparison r i p-value N ii sim 5 0 Anchored Technical reps 0.91 <0.000001 759628 mau 0 iii 5 0 Anchored Biological reps 0.87 <0.000001 2471760 mel RNAseq Technical reps 0.86 <0.000001 132600 mel RNAseq vs. 5 0 Anchored(FPKM) 0.60 <0.0001 NA i Pearson correlation coefficient, ii Sample size. iii mau 0 was used for reproducibility estimates only. reverse transcriptase and random hexamers, followed by RNAse H treatment. We added a primer complementary to the 5 0 ligated adapter sequence and performed one primer extension step at 72 C with Taq polymerase to yield double-stranded fragments of all 5 0 ends (see figure .2.1). This primer has a 5 0 amine group to prevent concatenation and subsequent ligation. Taq adds an A-overhang in a template independent fashion [77], thus we can bypass the typical blunt-end repair and cleanup step and immediately ligate standard Illumina adapters in a strand-specific orientation. Standard Illumina indexing barcodes were then added during PCR enrichment of each sample. We sequenced the 5 0 enriched fragments on an Illumina Genome Analyzer II (see supplemental file for a detailed protocol). One potential concern with using TAP treatment and RNA ligation is that RNA secondary structure might bias ligation. However, our results were highly reproducible and expression estimates were similar with RNAseq results generated from the same mRNA pool (see table 3.1). Also, we expect any bias in our sampling of TSS to be largely consistent between species, enabling us to compare TSS among species. Unlike typical 5 0 RACE protocols, phosphatase treatment (e.g. using CIP) is optional in our protocol due to a required mRNA purification step, which leaves only trace amounts of uncapped product (e.g. rRNA). It is possible that fragmented or non-full-length mRNA are present and could result in a false positive TSS read. However, assuming break 28 + TAP Treatment Adapter 1 + ssLigation p m7Gppp AAAAAAAA AAAAAAAA + RNA Fragmentation p AAAAAAAA + cDNA Synthesis AAAAAAAA + RNAseH Treatment + Primer Extension AAAAAAAA TTTTTT TTTTTT Figure 2. Taq-ex CAGE Protocol + Adapter Ligation + PCR/Barcode Addition primer A T A T # Final Product Figure 3.2: Solid lines represent RNA and dashed lines indicate cDNA. Black lines are biological sequences and red and blue represent Illumina adapters 1 and 2, respectively. There are 12 standard Illumina barcoding Indexes (see the # symbol) that can be incor- porated during PCR. 29 points or truncated sites are random, these signals would be removed as background. As with any 5 0 CAP targeting method, results may include reads from transcripts that have broken and been recapped [78]. Genome sequencing technology has advanced at an incredible rate making genome annotations a limiting factor in comparative genomics. This issue highlights the impor- tance of methods like the one described here. Our method generates long, paired-end reads for improved mapping and the removal of PCR duplicates. Also, this concise approach is amenable to projects with many samples, and only requires 1-5ug of total RNA, which is less than other methods that, like ours, do not rely on semisuppressive PCR [66, 69, 75]. We compared our results with RNAseq generated from the same mRNA pool and found highly correlated gene expression estimates (see table 3.1). We believe this approach will be valuable option for exploring TSS changes, improving genome annotations, and more generally, for gene expression studies where comparing PSE is important. 3.2 Results and Discussion 3.2.1 Analysis of 5 0 anchored reads We mapped the 76bp, paired-end reads to each respective reference genome of each species using BWA [79] (allowing 4% MM). We required that both reads in a pair mapped unambiguously, which allowed us to remove a substantial amount of PCR dupli- cates (see table 3.2). The 5 0 mapping position of the first read in each pair represents the TSS. We observed a prominent enrichment of reads mapped near known TSS in mel, but there was apparent background noise, likely due to residual RNAseq reads. Thus, we modeled the error in the method using RNAseq reads generated from the same mRNA pool. We chose a threshold that removed 99% of positions identified from RNAseq 30 Table 3.2: Mapping results Species Method Total RNA Reads i Mapped % Filtered ii % P mel CAGE 5ug 858863 372550 43% 166096 19% P mel CAGE 10ug 119426 58556 49% NA NA sim CAGE 5ug 1180578 823516 70% 320263 27% sec CAGE 5ug 1003687 695863 69% 287079 29% mau 0 ii CAGE 5ug 627904 286214 46% NA NA mau 0 CAGE 5ug 1201745 488569 41% NA NA pse CAGE 5ug 1161039 822188 71% 323942 28% mel RNAseq 5ug 1209363 706386 58% 1232707 43% P mel RNAseq 10ug 1664075 1004598 60% NA NA i Reads after removing PCR duplicates. ii =D. mauritiana. P= Pooled samples. Table 3.3: Accuracy of TSS estimates Enrichment Category TSS/Reads % TSS at 5’ END(+=200bp) 2453 85% Reads at 5’ END(+=200bp) 83237 84% TSS a 5’ END(+=200bp) 2453 85% TSS in CDS regions 180 6% Reads in CDS regions 5673 6% TSS in other 240 8% Reads in other 10284 10% (10 reads within 50bp, see methods). Next, we identified a single representative TSS position centered in each cluster of TSS reads (TSS peak) to compare between species (see methods). The vast majority of representative TSS positions (85%) and total reads mapped (84%) occurred within 200bps of annotated 5 0 transcript ends in mel (see table 3.3). A previous CAGE study, found a similar percent of reads that overlapped known TSS regions (86%) [80]. We examined the non-5 0 end TSS peaks further, to understand if these are likely new TSS or error. The non-5 0 end TSS peaks included 301 that mapped to chromosome U and Uextra, 6% mapped within coding regions, 10% mapped to intergenic regions. 31 Additionally, 7 TSS peaks were identified at 5’ ends of protein-coding genes in the mitochondrial genome (e.g. CoI-III and ND3). Here, we focused on PolII transcripts and well-annotated regions of genome, thus TSS from chromosome U, Uextra and the mitochondrial genome were not analyzed further. Unexpected TSS locations are not unique to our TSS study [61, 80], and may represent previously unannotated promoters, RNA breaks followed by recapping events [78], or background noise from the method. In summary, we compared 2849 high confidence D. melanogaster TSS with putative orthologs in D. simulans, D. sechellia, and D. pseudoobscura. 3.2.2 TSS Conservation Between Drosophila Species The birth of a new (alternative) TSS and TSS movement may contribute to regula- tory variation, but the degree to which relative TSS locations are conserved among Drosophila is unknown. To assess TSS conservation in Drosophila, we paired each mel TSS with its ortholog in sim, sec, and pse. We accomplished this by converting the mel TSS position to an equivalent position in each species using the whole-genome pairwise alignment available from the UCSC genome browser and a liftover utility (see methods). The most likely ortholog was then chosen based on proximity. Transcrip- tion initiates from a range of local positions at a given promoter, typically spanning less than 50bps [80]. Thus, we required that TSS orthologs be separated by more than 50bp before we considered it evidence for TSS movement. Intuitively, we expect an increased likelihood of incorrectly pairing TSS (e.g. with a promoter at a nearby gene) between species as distance between putative orthologs increases. To limit this, we only paired orthologs within 500bp. Using this approach, we identified orthologous positions for nearly all of the mel TSS in the sister species sim (96%) and sec (98%, see table 3.4). We found much fewer orthologous positions in the more diverged pse (71%). 32 Table 3.4: Mapping percentages of mel TSS to each species. Species Mapped % Unpaired i % Paired % sim 2730 96 374 13 2356 86 sec 2799 98 359 13 2440 87 pse 2036 71 469 16 1567 77 Species Paired<50bp ii % Paired>50bp % sim 2242 95 114 5 sec 2332 96 108 4 pse 1385 88 182 12 i Unpaired mel TSS do not include unmapped. ii r<50bp are correlations limited to orthologs within 50bp. In total, we successfully paired 1,285 mel TSS with orthologs in all three other species, 1,106 in at least two species, and an additional 296 mel TSS were identified in only one species (see figure 3.3). We plotted the distribution of distances between all TSS orthologs for each species to elucidate any trends in TSS movement. This resulted in a sharp distribution of distances centered at zero bp (relative to mel) between orthol- ogous TSS for each species (see figure 3.4). Furthermore, only 4% of mel TSS between sister species and 9% between pse, were separated by more than 50bp. Thus, over 90% of the paired TSS locations were conserved in each Drosophila species, including the more distantly related pse. We expect these estimates of TSS conservation to be an overestimate because this trend excludes orthologs seperated by more than 500bps, movement less than 50bp, TSS in unaligned regions, and TSS that were expressed below detection. For example, 13-16% of mel TSS locations had identifiable orthologous posi- tions, but lacked an orthologous TSS within 500bp in a given species. Unaligned regions prevented the identification of another 4, 2, and 29% of mel TSS regions in sim, sec, and pse, respectively (see table 3.4). Many of these cases likely involve substantial evo- lutionary events. For example, we observed a sec-specific deletion at a promoter (see 33 sim pse sec 1285 52 141 138 92 876 103 Figure 3. Identified mel Orthologs Figure 3.3: This diagram shows the number of overlapping mel orthologs found in each species. For example, 1285 mel TSS had an identified ortholog in each species and another 103 mel orthologs were only found in sim. figure 3.5). Improved multiple species alignments, exhaustive sequencing coverage, combinations of sequencing approaches, and tests between more intermediate species (e.g. D. yakuba and D. erecta) may further help track highly diverged orthologous sequences and TSS movement between Drosophila species. We hypothesized that TSS divergence accounts for important cis-regulatory differ- ences between species. Thus, we identified cases of species-specific TSS movement. We assumed that cases of TSS divergence in both sim and sec exist due to a mutation prior to their speciation event (see figure 3.6). Likewise, if a mel TSS was diverged in all species, based on parsimony, we consid- ered it mel-specific TSS movement. Like our overall results, we report cases of species- specific TSS differences in three distinct categories of promoter divergence: paired but separated by more than 50bp, mapped but unpaired, and unmapped. As expected, we 34 pse Figure 3.4: These histograms show that mel TSS locations are largely conserved in sim (A), sec (B), and pse (C). Distances were calculated with respect to mel and the coding region. Thus, positive numbers are downstream from mel and negative values are upstream from mel. observed similar species-specific differences for sim and sec, due to their equal diver- gence time from mel (see figure 3.6). However, unmapped TSS were more common in sim (65 vs. 18), likely due to differences in genome quality between species. The pse-specific TSS differences were highest in all categories as expected due to its higher divergence time. Overall, we observed highly conserved TSS locations and associated promoter activ- ity between each species. Interestingly, the minority of paired TSS orthologs that have moved beyond 50bp have more differences in expression and TSS peak shape. We con- clude that by moving relative positions, TSS are associated with a new set of enhancers 35 chr2L: 10,306,900 10,306,950 10,307,000 10,307,050 10,307,100 10,307,150 CG4968 chr3L: 10,874,950 10,875,000 10,875,050 10,875,100 10,875,150 CG6409 A B mel 80 0 sim 80 0 sec 80 0 5` RACE (mel) 0 45 5` RACE (mel) mel sim sec pse 30 0 30 0 30 0 30 0 36 0 DELETION Figure 3.5: Here are detailed examples of TSS variation between species. (A) There is a mel-specific, downstream TSS in CG6409, associated with the broadening of the upstream TSS. This may be an intermediate TSS turnover event. pse is not aligned in this region (B) This figure shows a sec-specific deletion (red square) at the TSS associated with nearly a complete loss of activity in sec. Also, predominant pse activity appears to have shifted 20bp upstream. that result in new expression patterns. Some mel TSS positions do not align between species, indicating considerable sequence divergence at these regions. This may occur 36 !!" N = 2849 D. melanogaster TSS! Species-Specific TSS Differences Relative to D. melanogaster! pse pse pse pse pse pse pse pse pse pse pse pse pse pse Figure 3.6: This figure shows the relatedness of each species and the number of TSS changes in each species with respect to mel (N=2849). *When TSS turnover events were shared by sim and sec, we assumed it occurred prior to speciation of sim and sec. **When mel TSS were diverged between all species pairs, we considered it to be a mel-specific turnover event. from movement of the entire TSS followed by degeneration of the now non-functional promoter element, or a duplication event that prevents unambiguous alignment. We believe the relative proportion of mel TSS that do not map between species is an infor- mative estimate of sequence divergence at TSS. To understand why TSS divergence in not uniform across the genome, we checked for enrichment of gene ontology categories among genes with TSS differences. Interestingly, we found a significant enrichment of ribosomal protein genes for each species (p=1.4E-16). Perhaps TSS movement and sequence turnover have hindered the discovery of functional regulatory elements for these genes [81]. Considering that promoters are essential regulatory elements, we were surprised that 29% of mel TSS positions did not map to pse. Positive selection on distinct elements or an increased mutation rate in TSS proximal DNA might explain this elevated sequence divergence. To test this, we explored patterns of local sequence conservation flanking the 37 TSS. Interestingly, there is a prominent increase in sequence divergence upstream of the TSS among insects (see figure 3.7). If positive selection has occurred between species, we might expect increased divergence between species and reduced sequence variation within species. Thus, we examined mutation frequencies within a population of mel. We observed a trend of increased variation (elevated frequency of segregating mutations) upstream of the TSS that coincided with reduced sequence conservation between species (see figure 3.7d). This suggests that increased mutation rate upstream of TSS might be the source. We hypothesize that the open chromatin state at actively expressed genes could result in higher mutations in these regions [82]. 3.2.3 Conservation of Promoter-Specific Expression and TSS Peak Shape Expression profiles from our 5 0 anchored reads were highly reproducible between tech- nical and biological replicates (r 0.91, p<0.0001 and r 0.87, p<0.0001, respec- tively; see table 3.1) and expression estimates (FPKM) were highly correlated with standard RNAseq derived from the same RNA pool (Pearson correlation, r 0.60, p<0.0001). Thus, we did not detect substantial bias from the method. We estimated promoter-specific expression (PSE) from the total number of reads that map within 50bps of each representative TSS position. We normalized each PSE estimate by divid- ing by the total number of reads mapped in each sample. To illustrate the type of data we generated and the value of estimating PSE, we describe species-specific usage of two alternative promoters of alcohol dehydrogenase (ADH), a highly studied gene impor- tant in ethanol preference and tolerance in fermenting food [83, 84]. These alternative promoters are known to be tissue-specific and temporally regulated [85]. The proximal TSS is favored in mel, sim, and sec to varying degrees, but pse exclusively uses the distal TSS in adult females (see figure 3.8). Interspecific differences in expression from these 38 Figure 7. Mutations Around TSS SNPs per Position Indels per Position sim sec sim sec 0.4 0.5 0.6 0.3 0.2 Average PhastCons Sequence Conservation (insects) Position Relative to TSS A B C 0 500 1000 1500 -500 -1000 -1500 0 500 1000 1500 -500 -1000 -1500 0 500 1000 1500 -500 -1000 -1500 450 400 150 250 350 200 300 35 30 25 20 15 10 5 0 500 1000 1500 -500 -1000 -1500 0.03 0.04 0.05 SNP Frequency Polymorphic sites within population D Figure 3.7: We examined sequence conservation and mutations within 2kb of mel TSS (N=2849). (A) The trend of SNPs (average in 100bp sliding window) around all mel TSS in sim (black) and sec (red). (B) Similar to A, but with indels. (C) Average sequence conservation (PhastCons 15 insect species) score around TSS. Note the dip in conser- vation. (D) We examined the frequency of segregating sites around TSS in emphmel population data. Note the spike in sequence diversity. 39 chr2L: 14,615,500 14,616,000 14,616,500 14,617,000 14,617,500 pse 30 - 0 _ sec 30 - 0 _ sim 30 - 0 _ mel 30 - 0 _ 5`RACE 323 - 1 _ ADH distal proximal Figure 3.8: In the Alcohol Dehydrogenase gene (ADH), we observe significant differ- ences in PSE between species. For example, mel has biased expression of the down- stream promoter, like sec. However, pse is restricted to the upstream alternative pro- moter and sim is restricted to the downstream promoter. alternative promoters has been observed previously, but an adaptive link has not been reported [86]. Highly significant correlations in PSE were found between Drosophila species (Pearson correlation, r 0.63, p<0.0001; r 0.66, p<0.0001; and r 0.54, p<0.0001 for sim, sec, and pse, respectively; see table 3.5). TSS movement may introduce an entirely new suite of regulatory elements associated with each promoter. Thus, we hypothesized that PSE differences will increase between TSS that have moved. To test this, we compared correlations in PSE between TSS with conserved locations and TSS that have moved beyond 50bp. For conserved TSS, correlations in PSE increased relative to the overall estimates (Pearson correlation, r 0.65, p<0.0001; r 0.68, p<0.0001; and r 0.57, p<0.0001 for sim, sec, and pse, respectively; see table 3.5). 40 Table 3.5: Conservation of promoter-specific expression between species Species r i N ii p-value sim 0.63 2381 9.1e-261 sec 0.66 2448 7.9e-301 pse 0.54 1585 8.9e-120 Species r<50bp iii N p-value r>50bp N p-value sim 0.65 2266 1.9e-268 0.17 115 0.07 sec 0.68 2340 4.3e-315 0.26 108 0.006 pse 0.57 1400 5.4e-119 0.22 185 0.002 i r = Pearson correlation coefficient, ii N = Sample size, iii r<50bp are correlations lim- ited to orthologs within 50bp. However, correlations in PSE break down substantially between TSS pairs that were sep- arated by more than 50bps (Pearson correlation, r 0.17, p=0.07; r 0.26, p=0.006; and r 0.22, p=0.002 for sim, sec, and pse, respectively; see table 3.5). This result highlights the importance of the relative position of TSS on maintaining gene expres- sion levels. Contrary to many 0 textbook 0 descriptions of transcription initiation in eukaryotes, transcripts start from a range of local positions at a given promoter. This results in a fre- quency distribution of initiation events that varies between promoters from ”broad” to ”sharp” [87, 88, 89]. These ”broad” and ”sharp” structures are reproducible and largely conserved between mouse and human [89] and between life stages of Drosophila [80]. Comparisons between Drosophila species have only been made on select genes [63, 64]. In this study, we roughly quantify the structure at each promoter using an entropy-based model (see methods) and compare estimates between orthologs. Using this approach, we report that TSS initiation distributions are highly correlated between all species (Pear- son correlation, r 0.75, p<0.0001; r 0.76, p<0.0001; and r 0.66, p<0.0001 for sim, sec, and pse, respectively; see table 3.6). Roughly, sharp peaks remain sharp 41 Table 3.6: Conservation of TSS peak shape between species Species r N p-value sim 0.75 2381 <0.00001 sec 0.76 2448 <0.00001 pse 0.66 1585 6.1e-202 Species r<50bp N p-value r>50bp N p-value sim 0.77 2266 <0.00001 0.06 115 0.54 sec 0.79 2340 <0.00001 0.30 108 0.002 pse 0.70 1400 4.2e-209 0.11 185 0.15 peaks between species (and vice versa). Similar to PSE, correlations between TSS dis- tributions increase when TSS location is conserved (r 0.77, p<0.0001, r 0.79, p<0.0001, r 0.70, p<0.0001 for sim, sec, and pse respectively) and decrease when TSS are separated by more than 50bp (r 0.06, p=0.56, r 0.30, p=0.002, r 0.11, p=0.15 for sim, sec, and pse respectively; see table 3.6). Thus, the relative location of TSS is important for maintaining several aspects of promoter activity. A previous study highlighted a species-specific change in TSS peak shape [63]. In support of this, we observed several cases of divergence in TSS peak shape from sharp to broad (with respect to the inferred ancestral state) coinciding with the birth of a new TSS (see figure 3.5A). 42 3.3 Methods 3.3.1 Fly strains D. melanogaster: DGRP line hybrid of 303 and 313. Bloomington stock: 25176 and 25180, respectively. D. simulans: st e line UCSD stock: 14021 0251.034 D. pseudoob- scura: UCSD stock: 14011 0121.32 D. sechellia: Robertson Line 1 D. mauritiana: w140 (used in reproducibility estimates). 3.3.2 CAGE library preparation 20 adult (3-5 days old) female Drosophila were snap-frozen in liquid nitrogen and stored at -80 C until further processing. Total RNA was then extracted using Zymoresearch duet Kit. We prepared CAGE libraries using 1,5, and 10ug of total RNA from adult females for each of the five Drosophila species (figure .2.1). First, we use Dynabeads to isolate mRNA by polyA selection. Then, we treat the mRNA with TAP (tobacco acid pyrophosphatase), which removes a pyrophosphate from the 5 0 CAP structure and exposes a 5 0 phosphate to ligation with an RNA oligo (containing a Solexa adapter sequence). Next, the adapter ligated mRNA is chemically fragmented and then con- verted to single-stranded cDNA. 5 0 fragments are then enriched by adding a primer that is homologous to the RNA adapter oligo and perform one round of polymerase exten- sion using Taq DNA Polymerase (Thermus aquaticus). Taq was specifically chosen because it adds an A at the end of each fragment in a template-independent fashion [90]. Thus, after Taq extension (hence the name: Taq-ex CAGE), all 5 0 fragments are double-stranded with an A overhang on one end. From here, the standard RNAseq pro- tocol is followed including adapter ligation, size selection, and index addition with PCR enrichment. A detailed protocol is available in supplemental materials. 43 3.3.3 Mapping CAGE Reads Sequencing on GAII with 76bp reads resulted in over 1 million paired-end reads per species library (see table 3.2). For a given species, the raw paired-end reads were mapped to their own genome (see table 3.4) using BWA version 0.5.9-r16. We mapped D. simulans reads to droSim1, D. sechellia reads to droSec1, D. pseudoobscura reads to dp4, and D. mauritiana reads to a DroMau draft genome [91] (used for reproducibility estimates only). We allowed 4 % MM and only accepted uniquely mapping reads. Next, PCR duplicates were removed using picards Mark Duplicate Reads (version 1.56.0). 3.3.4 Connecting Orthologs We created an interval file with each TSS peak position found in D. melanogaster and converted these chromosomal positions to coordinates in the D. simulans (droSim1), D. sechellia (droSec1), and D. pseudoobscura (dp4) genomes using a LiftOver utility (Con- vert Genome Coordinates (Version 1.0.3) available on the web-based galaxy platform [92, 93, 94]. From the D. melanogaster position mapped to D. simulans (for example), we paired our representative TSS with the closest TSS ortholog in D. simulans within 500bps. 3.3.5 TSS Peak Calling TSS were identified from the paired-end CAGE data from the 5 0 mapping position of the first read in each pair. Active promoters are represented as a frequency distribution of initiation events along the genome. We filtered for promoter regions with a minimum of 10 reads mapped within 50bps; similar to [80]. At each promoter, we identified the average position used, based on the local frequency distribution of initiation events. After rounding down and removing minor positions (positions with less mapped reads), 44 this average position was used as a representative TSS position for comparisons between orthologs. 3.3.6 TSS Peak Shape Quantification To quantify peak shape we calculated entropy in a 65bp window centered on each TSS. For some positions within the window, there may be zero reads mapped resulting in a probability of 0, underestimating of the true probability. The approach used by Hoskins et al. 2011 excluded these positions from the analysis, which resulted in variable effec- tive window sizes for each TSS distribution. When the window sizes are different, entropy estimates will be biased. Thus, we incorporated Laplace 0 s rule to account for cases of zero hits before calculating entropy. This approach adds a uniform probability across the entire window before estimating entropy. We added one mapped TSS read to each position, including positions with TSS mapped before calculating entropy. Here is the equation for our entropy (E) estimate at each TSS: E = P 65 i=0 p i log 2 p i . It should be noted that entropy estimates for bimodal peaks or other higher order peak shapes may be overestimated using this method. 45 Chapter 4 Conclusion In this chapter, I summarize the major contributions made by my research and propose some future directions. 4.0.7 Summary of Contributions The major aim of my research has been to develop new molecular techniques that over- come the drawbacks of current methods and to employ them to elucidate how gene expression patterns evolve within and between Drosophila species. My specific contri- butions are listed as follows. Allele-Specific Expression Assays Using Solexa (Method) By barcoding samples, we were able to estimate ASE for hundreds of samples in one sequencing lane. This is an attractive alternative to pyrosequencing and can be used in large-scale or for verification purposes by ”spiking-in” a few samples with other libraries on an existing sequencing lane. Interspecific differences likely accumulate by selection on standing cis- variation within populations A fundamental question is whether gene expression profiles diverge between species via selection on standing variation or new mutations. Also, if interspe- cific differences are due to the accumulation of many small-effect mutations in cis or few large-effect changes in trans. We found that most regulatory variation observed was due to differences in cis within a sample population of D. simulans, 46 perhaps due to the increased likelihood of deleterious pleiotropic (trans) effects. This suggests that the accumulation of regulatory variation between species is likely due to selection on standing variation within populations. Our study was also one of the first to test for cis-by-trans interactions within species. We did not detect significant cis-by-trans interactions; potentially due to limited statistical power. 5‘ Anchored Reads (Method) Within the context of cis-variation, the contribution of changes in promoters and their associated TSS remains underexplored. To assess TSS evolution between Drosophila species, we developed a new method that anchors sequencing out- put to the 5‘ end of mRNA. Our 5‘ anchored reads technique offers a relatively simple protocol (less than half the steps of CAGE methods) and has the impor- tant addition of paired-end, strand-specific, and long reads. This method is also broadly applicable for estimating promoter-specific expression and characterizing TSS peak shape. TSS Evolution inDrosophila Using our new method, we identified TSS in four Drosophila species and com- pared relative locations of orthologs. Overall, TSS orthologs are largely con- served. This study is the first transcriptome-wide assessment of TSS evolution in Drosophila, likely because TSS annotations are not yet publicly available. TSS Divergence is not Uniform Genome-wide We found that ribosomal protein genes were more likely to have diverged TSS. This suggests that purifying selection on these cis-elements is likely not uniform across the genome. 47 Increased Mutation Rate Upstream of TSS We observed a pattern of increased sequence divergence upstream of TSS between insect species. This could be due to relaxed selection or an increased mutation rate associated with the upstream TSS region. We found a pattern of elevated segregat- ing sites corresponding with the upstream TSS regions within a population of D. melanogaster. This is evidence that there is an increased mutation rate upstream of TSS, perhaps due to the open state of chromatin at active genes. 4.0.8 Future Directions There are several open issues related to my research that need further study. More tests forcis-by-trans interactions Our current research is following up with a genome-wide assessment of cis-by- trans using a custom allele-specific microarray. This will give a clearer picture of the relative contributions of epistatic interactions in these studies. Expanded Multiple Sequence Alignments TSS turnover and the consequence of elevated mutation rates upstream of TSS have not been explored enough at all. We need improved multiple species align- ments and TSS annotations in many more species to better understand and track the process of TSS evolution in Drosophila. More Association Studies Using ASE Assays Once coding mutations have been explored and ruled out, I believe ASE is under- used for identifying the genetic basis for phenotypic differences. Improved Sequencing Technology Third generation sequencing technology is substantially higher throughput and 48 yields longer reads with less bias because it uses unamplified starting material. Thus, alignment is vastly improved (especially in repetitive regions) and down- stream splicing isoforms can be associated with TSS usage. This technology has only recently become readily available, but will likely substantially improve the precision of gene annotations and expression studies. 4.0.9 Final Remarks In this dissertation I have developed two novel molecular techniques and employed them to elucidate patterns of cis-regulatory evolution in Drosophila. Using these broadly applicable techniques I have answered several important questions related to gene regulatory evolution: Do gene expression differences evolve through mutations in cis or trans? We assessed cis and trans variation for five genes within a sample population of D. simulans and found the majority of variation was explained by cis. However, transcriptome-wide estimates are needed to make confirm this trend. Are TSS fixed entities in the genome? If TSS move, is promoter-specific expression still conserved? We compared 2,849 TSS between four Drosophila species and found that for the TSS with identified orthologs, most had conserved locations and activity. How- ever, among divergent TSS, promoter-specific expression was also more likely to be different. Thus, location appears important for maintaining gene expression levels. We observed a pattern of elevated sequence divergence upstream of TSS between species. What could be driving this prominent pattern? 49 We saw a similar pattern of increased segregating sites upstream of TSS within a population D. melanogaster. These results, together with the open state of chro- matin in these regions, suggest that an increased mutation rate may underly high sequence divergence upstream of TSS. 50 Bibliography [1] G. A. Wray, “Transcriptional regulation and the evolution of development.,” Int. J. Dev. Biol., vol. 47, no. 7-8, pp. 675–684, 2003. [2] S. B. Carroll, “Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution.,” Cell, vol. 134, pp. 25–36, July 2008. [3] H. Hoekstra, “The locus of evolution: evo devo and the genetics of adaptation,” Evolution, 2007. [4] D. L. Stern and V . Orgogozo, “The loci of evolution: how predictable is genetic evolution?,” Evolution, vol. 62, no. 9, pp. 2155–2177, 2008. [5] J. C. Kwasnieski, I. Mogno, C. A. Myers, J. C. Corbo, and B. A. Cohen, “Complex effects of nucleotide variants in a mammalian cis-regulatory element.,” Proceed- ings of the National Academy of Sciences, vol. 109, pp. 19498–19503, Nov. 2012. [6] M. C. King and A. C. Wilson, “Evolution at two levels in humans and chim- panzees,” Science (New York, NY), vol. 188, pp. 107–116, Apr. 1975. [7] P. J. Wittkopp, E. E. Stewart, L. L. Arnold, A. H. Neidert, B. K. Haerum, E. M. Thompson, S. Akhras, G. Smith-Winberry, and L. Shefner, “Intraspecific Poly- morphism to Interspecific Divergence: Genetics of Pigmentation in Drosophila,” Science (New York, NY), vol. 326, pp. 540–544, Oct. 2009. [8] J. Majewski and T. Pastinen, “The study of eQTL variations by RNA-seq: from SNPs to phenotypes,” Trends in Genetics, 2011. [9] D. Kuo, K. Licon, and S. Bandyopadhyay, “Coevolution within a transcriptional network by compensatory trans and cis mutations,” Genome . . . , 2010. [10] P. J. Wittkopp, B. K. Haerum, and A. G. Clark, “Evolutionary changes in cis and trans gene regulation,” Nature, vol. 430, pp. 85–88, July 2004. 51 [11] P. J. Wittkopp, B. K. Haerum, and A. G. Clark, “Regulatory changes underly- ing expression differences within and between Drosophila species,” Nat Genet., vol. 40, pp. 346–350, Mar. 2008. [12] G. Yvert, R. B. Brem, J. Whittle, J. M. Akey, E. Foss, E. N. Smith, R. Mackelprang, and L. Kruglyak, “Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors.,” Nature genetics, vol. 35, pp. 57–64, Sept. 2003. [13] B. Lemos, L. O. Araripe, P. Fontanillas, and D. L. Hartl, “Dominance and the evolutionary accumulation of cis- and trans-effects on gene expression.,” Proc Natl Acad Sci USA, vol. 105, pp. 14471–14476, Sept. 2008. [14] H.-Y . Wang, Y . Fu, M. S. McPeek, X. Lu, S. Nuzhdin, A. Xu, J. Lu, M.-L. Wu, and C.-I. Wu, “Complex genetic interactions underlying expression differences between Drosophila races: analysis of chromosome substitutions,” Proc Natl Acad Sci USA, vol. 105, pp. 6362–6367, Apr. 2008. [15] C. J. McManus, J. D. Coolon, M. O’Duff, J. Eipper-Mains, B. R. Graveley, and P. J. Wittkopp, “Regulatory divergence in Drosophila revealed by mRNA-seq,” Genome Res, Mar. 2010. [16] M. D. Wilson, N. L. Barbosa-Morais, D. Schmidt, C. M. Conboy, L. Vanes, V . L. J. Tybulewicz, E. M. C. Fisher, S. Tavare, and D. T. Odom, “Species-Specific Tran- scription in Mice Carrying Human Chromosome 21,” Science (New York, NY), vol. 322, pp. 434–438, Oct. 2008. [17] N. Hubner, C. A. Wallace, H. Zimdahl, E. Petretto, H. Schulz, F. Maciver, M. Mueller, O. Hummel, J. Monti, V . Zidek, A. Musilova, V . Kren, H. Causton, L. Game, G. Born, S. Schmidt, A. M¨ uller, S. A. Cook, T. W. Kurtz, J. Whittaker, M. Pravenec, and T. J. Aitman, “Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease,” Nature genetics, vol. 37, pp. 243–253, Feb. 2005. [18] M. Kimura, “On the Probability of Fixation of Mutant Genes in a Population,” Genetics, 1962. [19] B. Prud’homme, N. Gompel, and S. B. Carroll, “Emerging principles of regulatory evolution,” Proc Natl Acad Sci USA, vol. 104 Suppl 1, pp. 8605–8612, May 2007. [20] P. C. Phillips, “Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems,” Nat Rev Genet, vol. 9, p. 855, Nov. 2008. [21] J. Singer-Sam and C. Gao, “Quantitative RT-PCR-Based Analysis of Allele- Specific Gene Expression,” Methods in molecular biology (Clifton, NJ), vol. 181, pp. 145–152, 2001. 52 [22] C. R. Cowles, J. N. Hirschhorn, D. Altshuler, and E. S. Lander, “Detection of regulatory variation in mouse genes,” Nat Genet., vol. 32, pp. 432–437, Nov. 2002. [23] A. Ahmadian, B. Gharizadeh, A. C. Gustafsson, F. Sterky, P. Nyr´ en, M. Uhl´ en, and J. Lundeberg, “Single-nucleotide polymorphism analysis by pyrosequencing,” Analytical biochemistry, vol. 280, pp. 103–110, Apr. 2000. [24] A. Schulze and J. Downward, “Navigating gene expression using microarrays–a technology review.,” Nature cell biology, vol. 3, pp. E190–5, July 2001. [25] R. M. Graze, L. M. McIntyre, B. J. Main, M. L. Wayne, and S. V . Nuzhdin, “Reg- ulatory divergence in Drosophila melanogaster and D. simulans, a genomewide analysis of allele-specific expression.,” Genetics, vol. 183, pp. 547–61, 1SI–21SI, Oct. 2009. [26] R. V . Satya, N. Zavaljevski, and J. Reifman, “A new strategy to reduce allelic bias in RNA-Seq readmapping.,” Nucleic acids research, vol. 40, p. e127, Sept. 2012. [27] B. J. Main, R. D. Bickel, L. M. McIntyre, R. M. Graze, P. P. Calabrese, and S. V . Nuzhdin, “Allele-specific expression assays using Solexa.,” BMC genomics, vol. 10, p. 422, 2009. [28] P. L. Chang, J. P. Dunham, S. V . Nuzhdin, and M. N. Arbeitman, “Somatic sex- specific transcriptome differences in Drosophila revealed by whole transcriptome sequencing.,” BMC genomics, vol. 12, p. 364, 2011. [29] P. Landgraf, M. Rusu, R. Sheridan, A. Sewer, N. Iovino, A. Aravin, S. Pfeffer, A. Rice, A. O. Kamphorst, M. Landthaler, C. Lin, N. D. Socci, L. Hermida, V . Fulci, S. Chiaretti, R. Foa, J. Schliwka, U. Fuchs, A. Novosel, R.-U. Muller, B. Schermer, U. Bissels, J. Inman, Q. Phan, M. Chien, D. B. Weir, R. Choksi, G. De Vita, D. Frezzetti, H.-I. Trompeter, V . Hornung, G. Teng, G. Hartmann, M. Palkovits, R. Di Lauro, P. Wernet, G. Macino, C. E. Rogler, J. W. Nagle, J. Ju, F. N. Papavasiliou, T. Benzing, P. Lichter, W. Tam, M. J. Brownstein, A. Bosio, A. Borkhardt, J. J. Russo, C. Sander, M. Zavolan, and T. Tuschl, “A mammalian microRNA expression atlas based on small RNA library sequencing.,” Cell, vol. 129, pp. 1401–1414, June 2007. [30] C. Bonilla, R. K. Panguluri, L. Taliaferro-Smith, G. Argyropoulos, G. Chen, A. A. Adeyemo, A. Amoah, S. Owusu, J. Acheampong, K. Agyenim-Boateng, B. A. Eghan, J. Oli, G. Okafor, F. Abbiyesuku, T. Johnson, T. Rufus, O. Fasanmade, Y . Chen, F. S. Collins, G. M. Dunston, C. Rotimi, and R. A. Kittles, “Agouti- related protein promoter variant associated with leanness and decreased risk for diabetes in West Africans.,” Int J Obes (Lond) (2005), vol. 30, pp. 715–721, Apr. 2006. 53 [31] H. D. Shin, C. Winkler, J. C. Stephens, J. Bream, H. Young, J. J. Goedert, T. R. O’Brien, D. Vlahov, S. Buchbinder, J. Giorgi, C. Rinaldo, S. Donfield, A. Willoughby, S. J. O’Brien, and M. W. Smith, “Genetic restriction of HIV-1 pathogenesis to AIDS by promoter alleles of IL10.,” Proc Natl Acad Sci USA, vol. 97, pp. 14467–14472, Dec. 2000. [32] Y . Suzuki, K. Yoshitomo-Nakagawa, K. Maruyama, A. Suyama, and S. Sugano, “Construction and characterization of a full length-enriched and a 5-end-enriched cDNA library,” Gene, vol. 200, pp. 149–156, Oct. 1997. [33] R. R. H. Anholt and T. F. C. Mackay, “Quantitative genetic analyses of complex behaviours in Drosophila,” Nat Rev Genet, vol. 5, pp. 838–849, Nov. 2004. [34] H. Ellegren and B. C. Sheldon, “Genetic basis of fitness differences in natural populations,” Nature, vol. 452, pp. 169–175, Mar. 2008. [35] S. T. Harbison, M. A. Carbone, J. F. Ayroles, E. A. Stone, R. F. Lyman, and T. F. C. Mackay, “Co-regulated transcriptional networks contribute to natural genetic vari- ation in Drosophila sleep,” Nat Genet., vol. 41, pp. 371–375, Mar. 2009. [36] V . Nardi, T. Raz, X. Cao, C. J. Wu, R. M. Stone, J. Cortes, M. W. N. Deininger, G. Church, J. Zhu, and G. Q. Daley, “Quantitative monitoring by polymerase colony assay of known mutations resistant to ABL kinase inhibitors,” Oncogene, vol. 27, pp. 775–782, Jan. 2008. [37] D. Altshuler, M. J. Daly, and E. S. Lander, “Genetic mapping in human disease,” Science (New York, NY), vol. 322, pp. 881–888, Nov. 2008. [38] M. V . Rockman and L. Kruglyak, “Genetics of global gene expression,” Nat Rev Genet, vol. 7, pp. 862–872, Nov. 2006. [39] R. Jaenisch and A. Bird, “Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals,” Nat Genet., vol. 33 Suppl, pp. 245–254, Mar. 2003. [40] D. R. Higgs, D. Vernimmen, J. Hughes, and R. Gibbons, “Using genomics to study how chromatin influences gene expression,” Annual review of genomics and human genetics, vol. 8, pp. 299–325, 2007. [41] C. Henrichsen, N. Vinckenbosch, S. Z¨ ollner, E. Chaignat, S. Pradervand, F. Sch¨ utz, M. Ruedi, H. Kaessmann, and A. Reymond, “Segmental copy number variation shapes tissue transcriptomes,” Nat Genet., Mar. 2009. [42] P. Cahan, Y . Li, M. Izumi, and T. Graubert, “The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells,” Nat Genet., Mar. 2009. 54 [43] D. P. Bartel, “MicroRNAs: genomics, biogenesis, mechanism, and function,” Cell, vol. 116, pp. 281–297, Jan. 2004. [44] J. A. Stamatoyannopoulos, “The genomics of gene expression,” Genomics, vol. 84, pp. 449–457, Sept. 2004. [45] P. J. Wittkopp, B. K. Haerum, and A. G. Clark, “Independent effects of cis- and trans-regulatory variation on gene expression in Drosophila melanogaster,” Genet- ics, vol. 178, pp. 1831–1835, Mar. 2008. [46] H. Yan, W. Yuan, V . E. Velculescu, B. V ogelstein, and K. W. Kinzler, “Allelic variation in human gene expression,” Science (New York, NY), vol. 297, p. 1143, Aug. 2002. [47] J. C. Knight, “Allele-specific gene expression uncovered,” Trends in genetics : TIG, vol. 20, pp. 113–116, Mar. 2004. [48] O. Carlborg and C. S. Haley, “Epistasis: too often neglected in complex trait stud- ies?,” Nat Rev Genet, vol. 5, pp. 618–625, Aug. 2004. [49] G. Gibson and B. Weir, “The quantitative genetics of transcription,” Trends in genetics : TIG, vol. 21, pp. 616–623, Oct. 2005. [50] M. Guo, M. A. Rupe, C. Zinselmeier, J. Habben, B. A. Bowen, and O. S. Smith, “Allelic variation of gene expression in maize hybrids,” The Plant cell, vol. 16, pp. 1707–1716, July 2004. [51] H. Lo, Z. Wang, Y . Hu, H. Yang, S. Gere, and K. Buetow, “Allelic variation in gene expression is common in the human genome,” Genome Res, 2003. [52] V . E. Velculescu, L. Zhang, B. V ogelstein, and K. W. Kinzler, “Serial analysis of gene expression,” Science (New York, NY), vol. 270, pp. 484–487, Oct. 1995. [53] C.-L. Wei, P. Ng, K. P. Chiu, C. H. Wong, C. C. Ang, L. Lipovich, E. T. Liu, and Y . Ruan, “5’ Long serial analysis of gene expression (LongSAGE) and 3’ LongSAGE for transcriptome characterization and genome annotation,” Proc Natl Acad Sci USA, vol. 101, pp. 11701–11706, Aug. 2004. [54] S. Brenner, M. Johnson, J. Bridgham, G. Golda, D. H. Lloyd, D. Johnson, S. Luo, S. McCurdy, M. Foy, M. Ewan, R. Roth, D. George, S. Eletr, G. Albrecht, E. Ver- maas, S. R. Williams, K. Moon, T. Burcham, M. Pallas, R. B. DuBridge, J. Kirch- ner, K. Fearon, J. Mao, and K. Corcoran, “Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays,” Nature biotechnol- ogy, vol. 18, pp. 630–634, June 2000. 55 [55] M. Guo, S. Yang, M. Rupe, B. Hu, D. R. Bickel, L. Arthur, and O. Smith, “Genome-wide allele-specific expression analysis using Massively Parallel Signa- ture Sequencing (MPSS) reveals cis- and trans-effects on gene expression in maize hybrid meristem tissue,” Plant molecular biology, vol. 66, pp. 551–563, Mar. 2008. [56] D. Serre, S. Gurd, B. Ge, R. Sladek, D. Sinnett, E. Harmsen, M. Bibikova, E. Chudin, D. L. Barker, T. Dickinson, J.-B. Fan, T. J. Hudson, and E. T. Dermitza- kis, “Differential Allelic Expression in the Human Genome: A Robust Approach To Identify Genetic and Epigenetic Cis-Acting Mechanisms Regulating Gene Expression,” PLoS Genetics, vol. 4, p. e1000006, Feb. 2008. [57] T. F. Mackay, “The genetic architecture of quantitative traits,” Annual review of genetics, vol. 35, pp. 303–339, 2001. [58] P. Wittkopp, B. Haerum, and A. Clark, “Parent-of-origin effects on mRNA expres- sion in Drosophila melanogaster not caused by genomic . . . ,” Genetics, 2006. [59] J. Neter and E. Maynes, “On the Appropriateness of the Correlation Coefficient with a 0, 1 Dependent Variable,” Journal of the American Statistical Association, vol. 65, no. 330, pp. 501–509, 1970. [60] S. OHNO, Evolution by gene duplication. London: George Alien & Unwin Ltd. Berlin, Heidelberg and New York: Springer-Verlag., 1970. [61] M. C. Frith, J. Ponjavic, D. Fredman, C. Kai, J. Kawai, P. Carninci, Y . Hayashizaki, and A. Sandelin, “Evolutionary turnover of mammalian transcription start sites,” Genome Res, vol. 16, pp. 713–722, June 2006. [62] E. T. Dermitzakis and A. G. Clark, “Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover.,” Mol Biol Evol, vol. 19, pp. 1114–1121, July 2002. [63] M. Sorourian, “Turnover and lineage-specific broadening of the transcription start site in a testis-specific retrogene,” Fly, vol. 4, pp. 3–11, Jan. 2010. [64] C. Park and K. Makova, “Coding region structural heterogeneity and turnover of transcription start sites contribute to divergence in expression between duplicate genes.,” Genome Biol, vol. 10, no. 1, p. R10, 2009. [65] R. K. Bradley, X.-y. Li, C. Trapnell, S. Davidson, L. Pachter, H. C. Chu, L. A. Tonkin, M. D. Biggin, and M. B. Eisen, “Binding site turnover produces perva- sive quantitative changes in transcription factor binding between closely related Drosophila species.,” PLoS Biol, vol. 8, p. e1000343, Mar. 2010. 56 [66] T. Shiraki, S. Kondo, S. Katayama, K. Waki, T. Kasukawa, H. Kawaji, R. Kodz- ius, A. Watahiki, M. Nakamura, T. Arakawa, S. Fukuda, D. Sasaki, A. Podhajska, M. Harbers, J. Kawai, P. Carninci, and Y . Hayashizaki, “Cap analysis gene expres- sion for high-throughput analysis of transcriptional starting point and identification of promoter usage.,” Proc Natl Acad Sci USA, vol. 100, pp. 15776–15781, Dec. 2003. [67] C. Plessy, N. Bertin, H. Takahashi, R. Simone, M. Salimullah, T. Lassmann, M. Vitezic, J. Severin, S. Olivarius, D. Lazarevic, N. Hornig, V . Orlando, I. Bell, H. Gao, J. Dumais, P. Kapranov, H. Wang, C. A. Davis, T. R. Gingeras, J. Kawai, C. O. Daub, Y . Hayashizaki, S. Gustincich, and P. Carninci, “Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan.,” Nat Meth, vol. 7, pp. 528–534, July 2010. [68] M. Kanamori-Katayama, M. Itoh, H. Kawaji, T. Lassmann, S. Katayama, M. Kojima, N. Bertin, A. Kaiho, N. Ninomiya, C. O. Daub, P. Carninci, A. R. R. Forrest, and Y . Hayashizaki, “Unamplified cap analysis of gene expression on a single-molecule sequencer,” Genome Res, vol. 21, pp. 1150–1159, July 2011. [69] T. Ni, D. L. Corcoran, E. A. Rach, S. Song, E. P. Spana, Y . Gao, U. Ohler, and J. Zhu, “A paired-end sequencing strategy to map the complex landscape of tran- scription initiation,” Nat Meth, vol. 7, pp. 521–527, July 2010. [70] R. J. Harvey and M. G. Darlison, “Random-primed cDNA synthesis facilitates the isolation of multiple 5’-cDNA ends by RACE.,” Nucleic acids research, vol. 19, p. 4002, July 1991. [71] M. Gowda, H. Li, and G. Wang, “Robust analysis of 5-transcript ends: a high- throughput protocol for characterization of sequence diversity of transcription start sites.,” Nat. Prot., vol. 2, no. 7, pp. 1622–1632, 2007. [72] Y . Suzuki, K. Yoshitomo-Nakagawa, K. Maruyama, A. Suyama, and S. Sugano, “Construction and characterization of a full length-enriched and a 5’-end-enriched cDNA library.,” Gene, vol. 200, pp. 149–156, Oct. 1997. [73] A. Sandelin, P. Carninci, B. Lenhard, J. Ponjavic, Y . Hayashizaki, and D. A. Hume, “Mammalian RNA polymerase II core promoters: insights from genome-wide studies,” Nat Rev Genet, vol. 8, pp. 424–436, June 2007. [74] M. Harbers and P. Carninci, “Tag-based approaches for transcriptome research and genome annotation,” Nat Meth, vol. 2, pp. 495–502, July 2005. [75] R. Kodzius, M. Kojima, H. Nishiyori, M. Nakamura, S. Fukuda, M. Tagami, D. Sasaki, K. Imamura, C. Kai, M. Harbers, Y . Hayashizaki, and P. Carninci, 57 “CAGE: cap analysis of gene expression,” Nat Meth, vol. 3, pp. 211–222, Mar. 2006. [76] M. Salimullah, M. Sakai, S. Mizuho, C. Plessy, and P. Carninci, “NanoCAGE: a high-resolution technique to discover and interrogate cell transcriptomes.,” Cold Spring Harb Protoc, vol. 2011, p. pdb.prot5559, Jan. 2011. [77] J. M. Clark, “Novel non-templated nucleotide addition reactions catalyzed by pro- caryotic and eucaryotic DNA polymerases.,” Nucleic Acids Res, vol. 16, pp. 9677– 9686, Oct. 1988. [78] ENCODE, “Post-transcriptional processing generates a diversity of 5’-modified long and short RNAs,” Nature, vol. 457, pp. 1028–1032, Feb. 2009. [79] H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows- Wheeler transform.,” Bioinformatics, vol. 25, pp. 1754–1760, July 2009. [80] R. A. Hoskins, J. M. Landolin, J. B. Brown, J. E. Sandler, H. Takahashi, T. Lass- mann, C. Yu, B. W. Booth, D. Zhang, K. H. Wan, L. Yang, N. Boley, J. Andrews, T. C. Kaufman, B. R. Graveley, P. J. Bickel, P. Carninci, J. W. Carlson, and S. E. Celniker, “Genome-wide analysis of promoter architecture in Drosophila melanogaster.,” Genome Res, vol. 21, pp. 182–192, Feb. 2011. [81] H. Hu and X. Li, “Transcriptional regulation in eukaryotic ribosomal protein genes,” Genomics, vol. 90, pp. 421–423, Oct. 2007. [82] R. R. Hudson, M. Kreitman, and M. Aguad´ e, “A test of neutral molecular evolution based on nucleotide data.,” Genetics, vol. 116, pp. 153–159, May 1987. [83] Y . Malherbe, A. Kamping, W. van Delden, and L. van de Zande, “ADH enzyme activity and Adh gene expression in Drosophila melanogaster lines differentially selected for increased alcohol tolerance.,” J Evol Biol, vol. 18, pp. 811–819, July 2005. [84] M. Ogueta, O. Cibik, R. Eltrop, A. Schneider, and H. Scholz, “The influ- ence of Adh function on ethanol preference and tolerance in adult Drosophila melanogaster.,” Chem. Senses, vol. 35, pp. 813–822, Nov. 2010. [85] J. W. Posakony, J. A. Fischer, and T. Maniatis, “Identification of DNA sequences required for the regulation of Drosophila alcohol dehydrogenase gene expression.,” Cold Spring Harbor Symp. Quant. Biol, vol. 50, pp. 515–520, 1985. [86] M. Papaceit, D. Orengo, and E. Juan, “Sequences upstream of the homologous cis- elements of the Adh adult enhancer of Drosophila are required for maximal levels of Adh gene transcription in adults of Scaptodrosophila lebanonensis.,” Genetics, vol. 167, pp. 289–299, May 2004. 58 [87] W. Nickel and M. Seedorf, “Unconventional mechanisms of protein transport to the cell surface of eukaryotic cells,” Annual Review of Cell and Developmental Biology, 2008. [88] E. A. Rach, H.-Y . Yuan, W. H. Majoros, P. Tomancak, and U. Ohler, “Motif compo- sition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome.,” Genome Biol, vol. 10, no. 7, p. R73, 2009. [89] P. Carninci, A. Sandelin, B. Lenhard, S. Katayama, K. Shimokawa, J. Ponjavic, C. A. M. Semple, M. S. Taylor, P. G. Engstrom, M. C. Frith, A. R. R. For- rest, W. B. Alkema, S. L. Tan, C. Plessy, R. Kodzius, T. Ravasi, T. Kasukawa, S. Fukuda, M. Kanamori-Katayama, Y . Kitazume, H. Kawaji, C. Kai, M. Naka- mura, H. Konno, K. Nakano, S. Mottagui-Tabar, P. Arner, A. Chesi, S. Gustincich, F. Persichetti, H. Suzuki, S. M. Grimmond, C. A. Wells, V . Orlando, C. Wahlest- edt, E. T. Liu, M. Harbers, J. Kawai, V . B. Bajic, D. A. Hume, and Y . Hayashizaki, “Genome-wide analysis of mammalian promoter architecture and evolution,” Nat Genet., vol. 38, pp. 626–635, June 2006. [90] R. M. Clark, T. N. Wagler, P. Quijada, and J. Doebley, “A distant upstream enhancer at the maize domestication gene tb1 has pleiotropic effects on plant and inflorescent architecture,” Nat Genet., vol. 38, pp. 594–597, Apr. 2006. [91] V . Nolte, R. V . Pandey, R. Kofler, and C. Schl¨ otterer, “Genome-wide patterns of natural variation reveal strong selective sweeps and ongoing genomic conflict in Drosophila mauritiana.,” Genome Res, Oct. 2012. [92] J. Goecks, A. Nekrutenko, J. Taylor, and Galaxy Team, “Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.,” Genome Biol, vol. 11, no. 8, p. R86, 2010. [93] D. Blankenberg, G. V on Kuster, N. Coraor, G. Ananda, R. Lazarus, M. Mangan, A. Nekrutenko, and J. Taylor, “Galaxy: a web-based genome analysis tool for experimentalists,” Curr Protoc Mol Biol, vol. Chapter 19, pp. Unit 19.10.1–21, Jan. 2010. [94] B. Giardine, C. Riemer, R. Hardison, R. Burhans, L. Elnitski, P. Shah, Y . Zhang, D. Blankenberg, I. Albert, J. Taylor, W. Miller, W. J. Kent, and A. Nekrutenko, “Galaxy: a platform for interactive large-scale genome analysis,” Genome Res, vol. 15, pp. 1451–1455, Oct. 2005. 59 .1 Appendix A .1.1 Allele-Specific Expression Assays Using Solexa - Supplemental Material Coverage with relaxed alignment criteria around the SNP (5bp Pre-SNP and 3bp’s post-SNP) Line Reads % CG11459 2,088,931 35% CG2604 1,434,025 24% DSX 963,692 16% CG10824 107,315 2% CG4120 a 1,065,503 18% unable to align 340,501 5% Total 5,999,967 100% Coverage with increased alignment criteria around the SNP (5bp Pre-SNP and 8bp’s post-SNP) Line Reads % CG11459 2043861 34% CG2604 1401941 23% DSX 924548 16% CG10824 103581 2% CG4120 b 9109 0.20% unable to align 1516927 25% Total 5999967 100% Table 1: Solexa Single Lane Sequencing Coverage Per Gene a CG4120 (Cyp21c) was thrown out of analysis because of non-specific amplification on the 2nd (Cyp4ac) and on the 3rd Chromosome (Cyp21c genes). b See footnote a. 60 Position in read *CG10824 CG2604 *DSX CG11459 (A/G SNP) (C/T SNP) (A/G SNP) (C/T SNP) 20 0.0079 0.002 0.0228 N/A 21 0.0089 0.002 0.0236 N/A 22 0.0052 0.002 0.0117 0.0008 23 N/A N/A N/A 0.0005 24 N/A N/A N/A 0.0008 Table 2: Sequencing Error and Position Effect. Values are the proportion of non-SNP base pairs at different positions in the sequencing read as the barcode length varied. * Note: A/G SNP’s both have a consistent bias for miscalling C’s over T’s (data not shown) and a higher error. Gene Type Reads Sampling (s 2 ) Tech. (s 2 ) Bio. (s 2 ) CG10824 parental mix 2,173.50 0.00064 No data 0.0117 CG10824 Heterozygote 165.2 0.00755 0.0034 0.0023 CG10824 introgression 903.8 0.00242 0.0044 0.0045 CG11459 parental mix 28,964.30 0.00003 No data 0.0033 CG11459 Heterozygote 29,867.80 0.00004 0.0308 0.0155 CG11459 introgression 20,913.10 0.00011 0.0021 0.0293 CG2604 parental mix 23,445.80 0.00011 No data 0.0008 CG2604 Heterozygote 26,801.50 0.0001 0.0002 0.0033 CG2604 introgression 22,834.10 0.00008 0.0002 0.0058 dsx parental mix 11,903.50 0.00027 No data 0.0284 dsx Heterozygote 13,466.30 0.00018 0.0002 0.0016 dsx introgression 17,058.70 0.00017 0.0042 0.0032 0.0057 0.0091 Table 3: Estimation of the mean binomial sampling, technical, and biological variance (s 2 ) for each gene and type of assay. 61 DNA F1 Mix Allele 1 d1 d3 Allele 2 d2 d4 RNA F1 Mix Allele 1 r1 = d1 x e1 d3 = d3 x e3 Allele 2 r2 = d2 x e2 r4 = d4 x e4 Unobserved expression F1 Mix Allele 1 e1 e3 Allele 2 e2 e4 Table 4: Correction of RNA using ASE in Het DNA. This normalization method was used to calculate the variance in technical and biological replicates, compared to the binomial distribution. Due perhaps to a difference in size of the two flies in mix or some bias in the assay, the ratios d1/(d1+d2) and d3/(d3+d4) may not both approximate 0.5. We assume that this bias, whatever its reason, is the same for DNA and RNA. Therefore, the observed RNA counts are a product of the DNA counts, which incorporate this bias, and some unobserved expression value, which is what we care about. We are interested in whether the ratios e1/(e1+e2) and e3/(e3+e4) are significantly different. 62 .2 Appendix B .2.1 Allele-specific expression using Solexa detailed protocol 1) Identify a SNP between two parental lines. 2) Design gene specific 18-20bp annealing primers as follows: forward primer flanking the 5’ end of the SNP such that the base immediately following the 3 end of the primer is the SNP, the second 200-300bp’s downstream from the SNP. 3) Check primer design and verify that no additional SNP’s occur within the gene specific annealing primers. 4) Create barcodes for each unique sample (we used 1-3 bp’s). 5) Obtain Solexa adapter sequences A1 and A2 available upon request from Illumina. 6) Identify Solexa Adapter 1 sequence (A-1) (note: be sure to include the additional adapter sequence added during PCR enrichment step. The A1 adapter alone will not be sufficient). Solexa Adaptor 2 sequence should be as given. (Figure .2.1) Design forward primers with modified 5’ tails, starting 5’ to 3’ as follows: Solexa Adapter-1 sequence, barcode, forward primer from step 2. 7) Design reverse primers with modified 5’ tails, starting 5’ to 3’ as follows: Solexa Adapter-2 sequence, reverse primer from step 2. 8) Design a custom Solexa sequencing primer to anneal immediately flanking the barcode sequence on the primer from step 7 (Figure .2.1). 9) Extraction of Nucleic Acids: - DNA and RNA were isolated from 14 individual flies using a protocol modified by Wittkopp (2004) for Promega’s SV Total RNA Extraction Kit. In brief, RNA was isolated as described with DNAse treatment of column-bound RNA prior to elution. The DNA, normally discarded with the DNA affinity column, was washed and eluted from the column using water. Nucleic acid quantification was carried out using a 63 Nanodrop. All samples were stored at -70 C. 10) cDNA Synthesis: -cDNA was synthesized from total RNA within 24 hours of isolation using the High- Capacity cDNA Reverse Transcription Kit with RNAse Inhibitor (Applied Biosystems) according to the manufacturer’s protocol. All cDNA samples were stored at -70C. 11) PCR/Purification: - PCR was performed using Phusion High-Fidelity DNA Polymerase (Finnzymes Inc.) in 25 uL reactions with gene-specific, modified Solexa A1 and A2 primer pairs using standard reaction conditions and touchdown PCR in which the annealing temperature was decreased by 0.5 C every cycle, from 68 C to 60 C, at which 25 cycles were carried out. 20ul from each PCR reaction was run on a 2% super fine agarose gel (Amresco) for amplification confirmation and gel purification. PCR products were gel purified using Qiaex II Gel Extraction Kit (Qiagen) and eluted in 20 uL Qiagen EB. Pooled samples for sequencing were generated by combining 5ul of each of the 300 assays (5 genes x 60 samples/gene), ethanol precipitated and resuspended to 10nM in water (1% tris-buffered). The concentration of the pooled samples was determined using a NanoDrop. 20ul of 100uM custom sequencing primer and 20ul of 10nM of the pooled sample were sent to the Cornell Life Sciences Core Laboratories Center for Illumina/Solexa DNA sequencing. 64 65 .3 Appendix C .3.1 TSS Evolution Supplemental Materials Estimating TSS peak structure: Entropy (E) estimate at each TSS: E = P 65 i=0 p i log 2 p i . Choosing an optimal window size for entropy estimates: To maximize our ability to differentiate different peak shapes, we chose a window size that maximized the variance (st.d.). We found that at 32 (actual window of 65bps), the std was at its maximum (as seen in figure). Figure 1: Optimal window size. We estimated promoter-specific expression and TSS peak shape using a fixed window size. Optimizing the window size is important in order to increase our precision in detecting broad and sharp peaks. 66 Hoskins et al. used entropy to estimate TSS peak shape. However, they employed a variable window size in their estimates. The variable window size will add bias when comparing between species. Thus, we adopted a fixed window approach. Laplace’s Rule: There are many cases within a window that received zero TSS hits. This would result in an underestimate of the true probability of this position gaining a TSS hit with more sequencing. Thus, we added a 1, uniformly across the window, including positions with more than one read. Then we calculated the entropy for each window including these pseudocounts. !"# $%! sec dpse &'()*+,- &'()*+,- &'()*+,- &'()*+,- ./++#"!"'(0#-1%2/)"-34-5%$()%6/(%*'-*7-8..-&'()*+,-&$(%!0("$ Figure 2: Entropy distribution. Shown is the frequency distribution of TSS peak entropy estimates for each species. The vast majority of the TSS peaks are high entropy values, indicating that most TSS peaks are broad. 67 .4 Appendix D .4.1 5 0 Anchored Paired-End Read Protocol ! ! ! ! ! ! ! "#$%&'($)(*!+,!-.+#/$!0#123!45/#)(!67896-4:!;#3<! :=31>)3!?/3>&!6@-!".>;(!2'1(!A/&,A5($/&9%1((!%/1!A&>$32<! .6@-!B2/&>3#/$!"%/1!/A3#.>&!1(2'&32C!23>13!D#35!E'F!3/3>&!6@-<!! G'>$3#%,!3/3>&!6@-!'2#$F!35(!G'+#3H!! ! ! ! ! !!!!!!!!!!!! 4>&)'&>3(!I,$>+(>*2!$((*(*J!KLL'7MNE'F!?/3>&!6@-!"!#$%&'(! ! !!!!!! 6(2'2A($*!I,$>+(>*2!#$!)*+!#$%&'(!O#$*#$F!O'%%(1H! ! ! !!!!!!!!!!!!!!!!! -*P'23!3/3>&!6@-!3/!I,$>+(>*!Q/&'.(!D#35!I:R4!D>3(1H!! ! ! !!!!!!!!!!!!! B$)'+>3(!S!TEU4!%/1!K!.#$H!3/!*#21'A3!K$*>1,!231')3'1(H99V!B4:H!! ! ! !!!!!!!!!!!!! -**!6@-!3/!*,$>+(>*2H!8#=!99V!1/3>3(!%/1!E!.#$H!S!6?H! ! ! ! ! !!!!!!!!!!!!! W>25!KX!D#35!W>25#$F!+'%%(1!O! ! ! ! ! ! ! ! !!!!!!!!!!!!! -**!YL'7!YL.8!?1#29Z4&H!8#=H!B$)'+>3(!S![LU4!%/1!K!.#$H! ! ! !!!!!!!!!!!!! G'#);&,!1(./Q(!2'A(1$>3>$3!/Q(1!.>F$(3! ?-R!?1(>3!"\,/'!.>,!#$)&'*(!>!9?-R!)/$31/&!5(1(<! YL'7!.6@-!"%1/.!YL'F!3/3>&!6@-<C!K'7!?-R!+'%%(1!"YLX<C!]'7!?-R"KL!^C!K'7! 6@-2(!#$5#+!">.+#/$<C!K'7!ZK_!! ,-.!/(01%23! -04&561(!>3!`NU4!%/1!Y!5/'1!!!!!!"a3/1(!S!9[LU4<!!! 6@-!4&(>$!! :&'3(!#$!YYHE'7! 6@-!-*>A3(1!7#F>3#/$!! -4-4^4^^^444^-4-4b-4b4^4^^44b-^4\^!!"KEL'8!23/);C!YLL'8!>&#c'/3<! `L'7!6(>)3#/$!!"2>Q(!E'7!?-Rd*!.6@-!%/1!./1(!3(232!&>3(1<! YL'7!?-Rd*!.6@-C!K'7!6@-!>*>A3(1!YLL'8!"KLLA./&<C!`'7!&#F>2(!+'%%(1!"YLX<\C! YL'7!!3]!6@-!&#F>2(!"EL^<C!K'7!"[L^<!6@-2(!#$5#+#3/1C!`'7!ZK_!3/!`L'7H!8#=!F($3&,! D#35!A#A(33(H! !704&561(!8!9:;<!=$>!)!?$&>! 68 !"#$%&'(&)*+,-.&/#(.0&1+&$-0203-(.&3314&56&768&9:;8&;<6=:&>?@A&BC+(1#'(0&;8DE& F7;&G$#)%-(1#1'+(&B0--&/113HII"""4#%J'+(4C+%I1-C/*'JI03-CI03KL?MN43.O4+*.E& ;..&>4>2P&QNR&G$#)%-(1#1'+(&<2OO-$&B1+1#*&,+*2%-&(+"&>>2PE& S(C2J#1-&T&?N@A&O+$&M&%'(4& ;..&>4>2P&01+3&J2OO-$&#(.&3*#C-&+(&'C-4& F7;&A*-#(&BUHQ&F7;&J'(.'()&J2OO-$&H&0#%3*-E& !"#$%&'(&QQ2P&9U6& F-,-$0-&8$#(0C$'31'+(& & Q4& ;00-%J*-&1/-&O+**+"'()&$-#C1'+(&'(&#&UNN2P&1/'(&"#**&DAF&12J-&&& & & & F#(.+%&9-V#%-$0&B>2)IWPX&S(,'1$+)-(X&YMLQZN[NQQE&&&&Q&WP&&& & & & G$#)%-(1-.&%F7;&BQNN()IWPE&&&&&&&& & &&&&&&&&&&&QN&WP& & U4& S(C2J#1-&1/-&0#%3*-&'(&#&DAF&1/-$%+C\C*-$&]^_A&O+$&^&%'(21-0X&#(.&& & & 1/-(&3*#C-&1/-&12J-&+(&'C-4& & >4& `'V&1/-&O+**+"'()&'(&+$.-$X&%#a-&QNb&-V1$#&$-#)-(1&O+$&%2*1'3*-&& & & 0#%3*-0H& & & & & & & & & & & & ^c&O'$01&01$#(.&J2OO-$&BS(,'1$+)-(X&YQLN]M[NQME& & M2P& & & & QNN%`&588&BS(,'1$+)-(X&YQLN]M[NQME& & & UWP&&&&&&&&&&&&&&& & & & .78D&%'V&BQN%`E& & & & & & QWP& & & & F7;0-&'(/'J4&BMNdIWPE&B;%J'+(E&&&&& &&&&&&&&&&&&&&&&&&&&&&&& N4^WP&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& & & & 9U6& & & & & & & & N4^2P& & M4& ;..&LWP&+O&%'V12$-&1+&1/-&12J-X&%'V&"-**X&#(.&/-#1&1/-&0#%3*-&U^_A&& & & '(&#&1/-$%+C\C*-$&O+$&U&%'(21-04& & ^4& ;..&Q&WP&e23-$eC$'31&SS&BUNNdI&WPX&S(,'1$+)-(X&YQLN]M[NQME&1+&1/-&& & & 0#%3*-X&#(.&'(C2J#1-&1/-&0#%3*-&'(&#&1/-$%+C\C*-$&"'1/&O+**+"'()&& & & 3$+)$#%H& & & & & & & & & & & & e1-3&Q&& U^_A& QN&%'(&& & & & & & & & e1-3&U&& MU_A& ^N&%'(&& & & & & & & & e1-3&>&& ?N_A& Q^&%'(&& & & & & & & & e1-3&M&& M&_A&&&&&9+*.& & &''()#*(+,&-%(.(/0123$456%17(81'(99:(01;#<8$%(=(>?@A(B54(CD(E31F(( & & & & 69 ! "#$%&'!()*+,-./,!0.,123+,4+5!67!*8+!2.69&97$:9+:!;<=!:9/*/4/2>0?@3A$@3A! ;9.B+9C,/9B&2>! 4DEF! 042+&,>! ! ! ! ! ! G@3H! ! !!!!!!!! !!!!!!!!! "#$%FI!0GJ!"9++,!A&-*+9!A.)>! ! ! ! KL3H! ! ! !!!!!!!!! KM$ENG$FDF;%(=!:9.B+9!0KL3A>! ! ! ! @3H! ! ! !!!!! ! NG#! ! ! ! ! ! ! ! GO3H! ! !!!!!!!!!!! ! %#%FH!! ! ! ! ! ! ! @LL3H!!!!!!!!!!!!!!!!!! KM$ENG$FDF;%(=!:9.B+9P! KM$!QKFBA<RQF<!F<%!<%%!%<<!<%F!<F<!"F<!"<%!<%%!<<"!F%<!%!$SM! TKU<!KPLL!VTKU<!LPSLW!RLU<!LPXKW!YGU<!GPLLZ!!"W!YGU<!YPLL[!XU<!X+\+9! ]7B/!DEF!<2+&,Q</,4+,*9&*/9! (23*+!.,!@@!3H?! F5&:*/9!H.^&*./,!0(:.4+,*9+!_&-*$H.,`!DEF!H.^&*./,!a.*->!bbb&5&:*+9!-8/325!6+!@LP@! 9&*./?!! ! @?! ;9+:&9+!*8+!1/22/c.,^!9+&4*./,!B.)P! ! ! ! (23*+5!DEF! ! ! ! @LdH!!!! ! ! ! ! ! @LJ!_&-*$H.,`!H.^&*./,!e311+9! @?KdH! ! ! ! ! ! ! F%;!0@LBA>! ! ! ! @?KdH! ! ! ! ! ! ! F5&:*/9!/2.^/!B.)!0XT3A>! ! !!!@dH!! ! ! ! ! ! ! _&-*$H.,`!DEF!H.^&-+! ! !!!@dH! ! G?! f,436&*+!&*!GKg<!1/9!@K!B.,3*+-?! ! S?! N+&*!.,&4*.\&*+!&*!YLg<!1/9!@K!B.,3*+-! hf](!h(H(<%f#E!/1!4DEF!%+B:2&*+-!! ! @?! ;9+:&9+!&!@?Ki!&^&9/-+!^+2!0K3H!-&1+\.+cQ@LLBH!&^&9!-/23*./,>! ! G?! #$%&'())*!H/&5!K3H!/1!@LL6:!DEF!2&55+9!*8+,!2+&\+!&*!2+&-*!G!c+22-! ! ! 6+*c++,!-&B:2+-!*/!:9+\+,*!49/--!4/,*&B.,&*./,?!0&55!&22!-&B:2+>! ! S?! =3,!^+2!&*!@GLj!1/9!GLB.,!c.*8!@`6!/9!@LL6:!2&55+9! ! X?! 042+&,!-391&4+->!<3*!/3*!^+2!-2.4+!&*!GLL$KLL6:!/9!SLL6:!kQ$!GK6:!!! "+2!;39.1.4&*./,!0]7B/9+-+&948!"+2!DEF!=+4/\+97!a.*>! +,-./0,-/1./"23#4/ &-5.,/67/"89)/:1.,;!! ! ! 70 !"#$%&'()*+,&-$./$!0'(/(,1$)234$5,+678-,9$ $ :;$ <,-$06$!"#$+89-,'$+(=>$+8?,$:@A$,=-'8$',8B,&-$/.'$+07-(67,$98+67,9C$ $ $ D=$!*09(.&$EF$G0//,'$ $ $ :@$H7$ $ $ I07-(J<-1$!"#$6'(+,'$:;@$KLD$HIM$ :$H7$ $ $ I07-(67,=(&B$!"#$6'(+,'$L;@$K@;D$HIM$:$H7$$NNN#%"O#2$G4#"O2%$P<%2NNN$ $ $ !"#$6'(+,'$(&1,=$:$KLD$HIM$ $ :$H7$ $ $ 135!$K:@+IM$$ $ $ :;LD$H7$ $ $ !*09(.&$6.7Q+,'89,$ $ $ @;D$H7$ $ $ <8+67,$$ $ $ $ RD;LD$H7$ $ $ 5O54S$$ $ $ $ D@$0S$ $ TUV"$:C@@$WTUV"$@C:@X$YDV"$@CR@X$ZLV"$@CR@[$!"#X$ZLV"$DC@@>$\V"$\,],'$ $ $ $ $ $ $ ^,7$%=-'8)-(.&$8&1$%&'()*+,&-$".&/('+8-(.&$ $ :;$#0&$98+67,$.0-$.&$8$:;DA$B,7;$S,8],$8-$7,89-$:$78&,$(&$$_,-`,,&$98+67,9$.'$ $ 7811,';$"0-$.0-$/.'$9,a0,&)(&B;$ $ L;$^,7$,=-'8)-$`(-*$bQ+.$?(-;$$%&'()*+)",&-)$.)/),0"1'2((+)3,) 466,&1(=$ 4%*56+&7%(6'*8(9) :@@$HI$9-.)?9$./$,8)*$.7(B.$`,',$6',68',1$09(&B$5%$6E$U;$$2(70-,$-.$866'.6'(8-,$ ).&),&-'8-(.&$`(-*$ELO$ :8;<'(=9)) I07-(67,=(&B$8186-,'$.7(B.$L$ Dc$!d^45"^^44^4^"4"4"^5"5$ I07-(J<-1$8186-,'$.7(B.$:$ Dc$4"4"5"555"""54"4"^4"^"5"55""^45"5$ $ 5.$6',68',$\T$HI$8186-,'C$ I07-(67,=(&B$8186-,'$.7(B.$L$K:@@$HIM$$ \T$H7$ I07-(J<-1$8186-,'$.7(B.$:$K:@@$HIM$$ $ \T$H7$ 38"7$KDIM$$ $ $ $ $ L$H7$ YD°"$/.'$D$+(&9>$\D°"$/.'$D$+(&9>$RZ°"$/.'$:@$+(&9>$LD°"$/.'$:@$+(&9$ $ $ $ $ $ $ $ $ $ 71 !"#$!%&'(%)$ !"#$%&'$()*+,)*-%./-)012)3456!7) 58)99:;9:9+;;+;9++9++;9;9:+:9+9+:+:::+++:9+9+;9+;+:+::++;9:+<:) !"#$%=#/>%?@)*+,)*-%./-)412)32156!7) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))58)*+*,"+**,*++"9;9+;:;:;+:+::++;9:+<:) ) $ -(./(01&02$!%&'(%)$304$,4356(%$7%&(0636&80$ A?(/>),/B()'/C"/?D%?@)*-%./-)30226!7) 58);9:+;;99;9;+9+9+;:+:;99+:++9;:+9+) ) ) ) 9E9*:F,) )30127)58)99:;9:9+;;+;9++9++;9;9:+:9+9+:+:::+++:9+9+;9+;+:+::++;9:+<:) ) ) ) ) ) )))))58)9+9+:+:::+++:9+9+;9+;+:+)::++;9:+:) ) ) ) ) ) ) )))))))))))))))))))))))))))!"!#"$"$+;9;99;;+:9; * ) ) ) ) ) )))34127)))58);:;9+:;;9;::+9;9+;:;:;+:+::++;9:+<:) ) ) ) ) ) ))))))))))))))))GAHIF,) ))))+99;+9;99;9+;;+9:9+;9;9:+;:;9:*+*,"+**,*++9") ) 9E9*:F,) ) J) ) ) ) ) )))))))))))))))))))KLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL) ) ) ) ) ) ) AHEFM)'FNOFH+AH;)*,A!F,) $ $ -8:(;3$<04(;&02$!%&'(%)$=>?6'@$ $ *+,)*-%./-)A?(/>)0) +99;+9;99;9+;;+9:9+;9;9:+;:;9:;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)4) +99;+9;99;9+;;+9:9+;9;9:9+9:+;;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)P) +99;+9;99;9+;;+9:9+;9;9:;++:99;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)Q) +99;+9;99;9+;;+9:9+;9;9::;;:+9;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)5) ) +99;+9;99;9+;;+9:9+;9;9:+9+:;:;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)R) +99;+9;99;9+;;+9:9+;9;9:9::;;+;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)S) +99;+9;99;9+;;+9:9+;9;9:;9:+:;;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)T) +99;+9;99;9+;;+9:9+;9;9::+99;:;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)U) +99;+9;99;9+;;+9:9+;9;9:+:;9:+;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)02) +99;+9;99;9+;;+9:9+;9;9:99;+:9;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)00) +99;+9;99;9+;;+9:9+;9;9:;:9;++;:;9+:;;9;::<+) *+,)*-%./-)A?(/>)04) +99;+9;99;9+;;+9:9+;9;9::9+99;;:;9+:;;9;::<+) 72
Abstract (if available)
Abstract
Connecting genetic variants with phenotypic differences is a fundamental aim in bio- logical science. A functional genetic mutation can affect the phenotype by altering the protein sequence of a gene or by perturbing gene expression patterns. The rela- tive contribution of coding versus regulatory variation on phenotypic diversity remains debatable, but the importance of gene expression differences in adaptation, speciation, and disease susceptibility cannot be overstated. In this thesis, I elucidate regulatory evolution in Drosophila using two new high-throughput sequencing library preparation techniques described herein. The first technique PCR amplifies known polymorphic regions of the genome and sequences them with high-throughput sequencing to pre- cisely assay allele-specific expression levels. Allele-specific expression estimates in parents and hybrids can be used to estimate variation at the gene of interest (cis) or else- where in the genome (trans). To validate our method, we measured the allelic bias in a dilution series of homozygous parental alleles and found high correlations between measured and expected values (r>0.9, p < 0.001). We applied this method to a set of 5 genes in a D. simulans parental mix, F1 and introgression and found that for these genes the majority of expression divergence can be explained by cis-regulatory variation and non-additive interactions were not detected. ❧ Differences in alternative promoter usage or movement of promoter elements and their associated transcription start site (TSS) may contribute to cis-regulatory variation between species. However, TSS evolution remains largely undescribed in Drosophila, likely due to limited annotations in non-melanogaster species. Here, we introduce a second new method that anchors sequencing output to the 5‘ end of mRNA. Using this method, we called TSS locations in four Drosophila species, including: D. melanogaster, D. simulans, D. sechellia, and D. pseudoobscura. For verification, we compared our results in D. melanogaster with known annotations, published 5‘ RACE data, and with RNAseq from the same mRNA pool. Then, we paired 2849 D. melanogaster TSS with its closest equivalent TSS in each species (likely to be its true ortholog) using the avail- able multiple sequence alignments. Most of the D. melanogaster TSS were successfully paired with an ortholog in each species (83%, 86%, and 55% for D. simulans, D. sechel- lia, and D. pseudoobscura, respectively). Based on the number and distribution of reads mapped at each TSS, we also estimated promoter-specific expression and TSS peak shape, respectively. Among paired TSS orthologs, the location and promoter activity were largely conserved. TSS location appears important as promoter-specific expres- sion and TSS peak shape were more frequently divergent among TSS that had moved. Unpaired TSS were surprisingly common in D. pseudoobscura. An increased mutation rate upstream of TSS might explain this pattern. We found an enrichment of ribosomal protein genes among diverged TSS, suggesting that TSS evolution is not uniform across the genome.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The evolution of gene regulatory networks
PDF
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
Robustness and stochasticity in Drosophila development
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Orthogonal shared basis factorization: cross-species gene expression analysis using a common expression subspace
PDF
Computational and experimental approaches for the identification of genes and gene networks in the Drosophila sex-determination hierarchy
PDF
Biological interactions on the behavioral, genomic, and ecological scale: investigating patterns in Drosophila melanogaster of the southeast United States and Caribbean islands
PDF
Chorion gene amplification in Drosophila melanogaster
PDF
Natural variation of Arabidopsis thaliana methylome and its impact on genome evolution
PDF
Behavioral choice assays and alcohol preference in Drosophila melanogaster
PDF
Characterization of Drosophila longevity and fecundity regulating genes
PDF
The use of alignment-free statistics for the evolutionary study of study of 5' cis-regulatory sequences
PDF
Genetic engineering of fungi to enhance the production and elucidate the biosynthesis of bioactive secondary metabolites
PDF
Evolutionary genomic analysis in heterogeneous populations of non-model and model organisms
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Comparative analysis of DNA methylation in mammals
PDF
Comparative transcriptomics: connecting the genome to evolution
PDF
Phylogeography, reproductive isolation, and the evolution of sex determination mechanisms in the copepod Tigriopus californicus
Asset Metadata
Creator
Main, Bradley Jay
(author)
Core Title
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Molecular Biology
Publication Date
03/28/2015
Defense Date
02/28/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
allele-specific expression,Drosophila,gene regulatory evolution,high-throughput sequencing,OAI-PMH Harvest,transcription start site,TSS
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nuzhdin, Sergey V. (
committee chair
), Conti, David V. (
committee member
), Dean, Matthew D. (
committee member
), Smith, Andrew D. (
committee member
)
Creator Email
bmain1@mac.com,bradmain@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-228097
Unique identifier
UC11295020
Identifier
usctheses-c3-228097 (legacy record id)
Legacy Identifier
etd-MainBradle-1491-0.pdf
Dmrecord
228097
Document Type
Dissertation
Rights
Main, Bradley Jay
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
allele-specific expression
Drosophila
gene regulatory evolution
high-throughput sequencing
transcription start site
TSS