Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Integrated genomic & epigenomic analyses of glioblastoma multiforme: Methods development and application
(USC Thesis Other)
Integrated genomic & epigenomic analyses of glioblastoma multiforme: Methods development and application
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INTEGRATED GENOMIC & EPIGENOMIC ANALYSES OF GLIOBLASTOMA MULTIFORME: METHODS DEVELOPMENT AND APPLICATION by Houtan Noushmehr A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (GENETIC, MOLECULAR, AND CELLULAR BIOLOGY) May 2011 Copyright 2011 Houtan Noushmehr Epigraph \Nothing in biology makes sense except in the light of evolution." by Theodosius Dobzhansky 1 1 Dobzhansky T. American Biology Teacher, March 1973 (35:125-129) ii Dedication I dedicate this dissertation to my loving wife, Ana Valeria Castro Noushmehr, and to my little \sho-sho-zinho", Livia Castro Noushmehr. Without their unconditional support and love, the body of work presented in this dissertation would not have been possible. iii Acknowledgments I would like to thank a number of people over the years who have generously provided support, expertise, patience and guidance to me during and before this study. To Dr. Peter W. Laird, my sincere appreciation for your guidance and generous support for allowing me the opportunity to take on one of the biggest challenges of my life. Thank you also for your constructive criticism and useful suggestions and encouragement throughout my research. You have inspired me to pursue wonderful things. To Dr. Daniel J. Weisenberger for getting me over the hurdles and pushing me along when I needed it most. To Dr. Benjamin P. Berman for giving me my initial shot at working with data analysis and providing me with insightful and enriching views on interpreting the data by way of color. To Drs. Toshi Hinoue & Simeen Malik for giving me a nice perspective on how a hard core molecular biologists/non-bioinformatician/non- programmer can pick up the necessary skills and do what is needed to analyze large data. To my fellow classmates in the PIBBS program and to future Drs. Timothy J. Triche Jr. & Hui Shen who have been extraordinary in their support and keeping our oce space fun and exciting. To Ite A. Laird-Oringa for believing in me and giving me the opportunity to pursue what I love to do. To my previous USC mentor, Dr. Wange Lu for also believing in me and giving me an opportunity to understand what a young investigator must go through to be successful. A special appreciation is extended to Dr. Kristina Obom, director of The Johns Hopkins Bioinformatics Masters Program (www.bioinformatics.jhu.edu), for being a constant support and guidance during my Bioinformatics Masters Degree program and iv encouraging me to pursue my Ph.D. In addition, a special appreciation is extended to Dr. Benjamin Feldman from the National Institutes of Health/National Human Genome Research Institute, (www.genome.gov/11007795, Principal Investigator at the NHGRI) for giving me the unique and enriching opportunity to complete a successful \Intramural Research Training Award" (IRTA) Fellowship and for never giving up on me. I would also like to extend a special thank you to each member of my Ph.D. Disser- tation committee, Drs. Gerhard A. Coetzee, Timothy J. Triche Sr., and Jonathan D. Buckley, who have been extremely generous with their time and guidance. I would also like to thank The Cancer Genome Atlas (TCGA) all the research members of TCGA, without their contributions and dedication during the rst few years of the pilot phase, my dissertation work would not have materialized to the level presented here. Therefore, I would like to acknowledge each TCGA member, please see Appendix A (page 129) for a complete list of names and aliation. To all the wonderful sta and researchers at the USC Epigenome Center, thank you for making the latter years enjoyable. This dissertation was written entirely in L A T E X 2 " , therefore I would like to acknowl- edge the eorts made by countless users (in some cases anonymous users) around the world who have made L A T E X 2 " free, accessible and robust to use (www.latex-project. org). v Table of Contents Epigraph ii Dedication iii Acknowledgments iv List of Tables ix List of Figures x Abstract xiii Chapter 1: Introduction 1 1.1 Epigenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 The Normal Role of Epigenetics . . . . . . . . . . . . . . . . . 4 1.1.1.1 Histone Modications . . . . . . . . . . . . . . . . . . 5 1.1.1.2 DNA Methylation . . . . . . . . . . . . . . . . . . . . 8 1.1.2 DNA Methylation in Cancer . . . . . . . . . . . . . . . . . . . 10 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Process, Handle and Analyze Innium Human Methylation27K Data 14 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 \What is Bioinformatics?" . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Open Source Tool . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2.1 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2.2 Bioconductor . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 Commercial Tool - GenomeStudio . . . . . . . . . . . . . . . . 19 2.1.3.1 The Illumina Innium Platform for assaying DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.3.2 Commercial Software to Process Data: GenomeStudio 21 2.1.4 Summary and Recommendations . . . . . . . . . . . . . . . . . 23 2.2 RAPiD.pro Script for Handling and Processing Illumina Innium DNA Methylation Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 From Scanner Output to R: Data Import Protocol . . . . . . . 25 vi 2.2.2 Multiscan Calibration Protocol . . . . . . . . . . . . . . . . . . 29 2.2.3 Quality Control Plots . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.4 Module to Incorporate Normalization Methods . . . . . . . . . 35 2.3 RAPiD.ana Script for Analyzing Illumina Innium DNA Methylation Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.3.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3.2 Unsupervised Hierarchical Clustering . . . . . . . . . . . . . . . 39 2.3.3 Detection of Dierentially Methylated CpG Probes between Groups of Samples . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4 RAPiD.int Script for Integrating Illumina Innium DNA Methylation and Gene Expression Arrays . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 3: Identication of the Epigenetic Subgroups of Glioblastoma 45 3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.1 The Cancer Genome Atlas Network (TCGA) . . . . . . . . . . 49 3.3 Material and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.1 List of Available Data Types . . . . . . . . . . . . . . . . . . . 51 3.3.2 GBM and Control Samples. . . . . . . . . . . . . . . . . . . . . 52 3.3.3 Methylation Assays. . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.4 Unsupervised Consensus and Hierarchical Clustering of DNA Methylation and Gene Expression Data Sets. . . . . . . . . . . 59 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.1 Identication of a Distinct DNA Methylation Subgroup Within GBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Summary and Discussion of Results . . . . . . . . . . . . . . . . . . . 62 Chapter 4: Clinical & Molecular Characterization of Epigenetic Subgroups of Glioblastoma 66 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Material and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.1 Integrative TCGA Data Platforms . . . . . . . . . . . . . . . . 67 4.2.1.1 Gene Expression and miRNA Expression . . . . . . . 67 4.2.1.2 Mutation Data . . . . . . . . . . . . . . . . . . . . . . 67 4.2.2 Wilcoxon Rank Sum Test and Dierence & Fold Change for Dierential DNA Methylation and Dierential Gene Expression in TCGA GBM . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.3 Binomial Test for Genomic Clustering of CIMP Loci . . . . . . 69 4.2.4 G-CIMP Validation Using MethyLight Technology . . . . . . . 69 4.2.5 Pathway Analysis and Meta-Analyses . . . . . . . . . . . . . . 71 4.2.6 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.1 Clinical Characterization of G-CIMP Tumors . . . . . . . . . . 72 vii 4.3.2 Characterization of G-CIMP Tumors within Gene Expression Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.3 IDH1 Sequence Alterations in G-CIMP Tumors . . . . . . . . . 73 4.3.4 Copy Number Variation (CNV) in Proneural G-CIMP Tumors 77 4.3.5 Identication of DNA Methylation and Transcriptome Expres- sion Changes in Proneural G-CIMP Tumors . . . . . . . . . . . 78 4.3.6 Validation of G-CIMP in GBM and Incidence in Low-grade Gliomas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.7 Stability of G-CIMP at Recurrence . . . . . . . . . . . . . . . . 85 4.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 85 Chapter 5: G-CIMP Genomic Signatures: in silico Analysis 102 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 Material and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.1 Distinct Genomic Features Correlates with Hypermethylated CpG Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.2 Distinct DNA Sequence Tightly Correlates with Hypermethy- lated CpG Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4 Future Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Chapter 6: Perspective and Synthesis 111 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2 Perspective and Future Outlook . . . . . . . . . . . . . . . . . . . . . . 111 6.3 Challenges associated with NGS . . . . . . . . . . . . . . . . . . . . . 113 6.4 Interpreting and understanding the data . . . . . . . . . . . . . . . . . 114 Bibliography 116 Appendix A (TCGA Author List) 129 Appendix B (Pipeline R Code) 135 viii List of Tables 1.1 Summary of the Histone Code . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Innium vs GoldenGate sample identication by clusters . . . . . . . . 61 4.1 Somatic Mutation associated with G-CIMP . . . . . . . . . . . . . . . 74 4.2 G-CIMP and IDH1 mutation status in primary, secondary and recur- rent GBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Germline mutation and LOH associated with G-CIMP . . . . . . . . . 76 4.4 The top most dierentially hypermethylated and downregulated genes in proneural G-CIMP positive tumors. . . . . . . . . . . . . . . . . . . 80 4.5 GSEA Pathway Analysis: DNA Methylation Target Enrichment in G-CIMP vs. Proneural Non G-CIMP . . . . . . . . . . . . . . . . . . . 81 4.6 Non-Conserved Transcription Factor binding site prediction . . . . . . 82 4.7 Summary of G-CIMP and IDH1 mutation status in the 360 glioma validation panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 ix List of Figures 1.1 Central Dogma of molecular biology . . . . . . . . . . . . . . . . . . . 3 1.2 Histone modications and their function . . . . . . . . . . . . . . . . . 6 1.3 The epigenome shapes the physical structure of the genome . . . . . . 7 1.4 DNA Methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Normal vs Cancer epigenetic dierence . . . . . . . . . . . . . . . . . . 11 2.1 Informatic Skills: System Design & Implementation . . . . . . . . . . 15 2.2 Number of contributed packages included in each of the Bioconductor releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Innium Methylation assay scheme . . . . . . . . . . . . . . . . . . . . 20 2.4 GenomeStudio MM Module Display . . . . . . . . . . . . . . . . . . . 22 2.5 A set of third-party tools available for a broad range of genetic analysis applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 RAPiD Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7 Illumina Innium Chip layout . . . . . . . . . . . . . . . . . . . . . . . 27 2.8 Example plots for the Red channel before and after Multiscan calibration 30 2.9 Multiscan Setting Optimization Scatter Plot . . . . . . . . . . . . . . . 32 2.10 Summary plot to illustrate samples which fail detection p-value cut-o of 10% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.11 Before and After normalization of the raw intensities for Methylation and Unmethylation by batch. . . . . . . . . . . . . . . . . . . . . . . . 37 2.12 Before and After normalization of the raw -values by batch. . . . . . 38 2.13 Standard Deviation across all samples . . . . . . . . . . . . . . . . . . 39 x 2.14 Two-way clustering of samples using the identied SD cut-os . . . . . 41 2.15 Multiple Adjustment Comparison . . . . . . . . . . . . . . . . . . . . . 42 2.16 Volcano Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1 TCGA Data Portal website. . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 Data types available for each TCGA sample . . . . . . . . . . . . . . . 53 3.3 United States Map depicting all institutes involved with TCGA . . . . 54 3.4 Innium Methylation assay scheme . . . . . . . . . . . . . . . . . . . . 57 3.5 Distribution of all 27K probes on the innum platform distance to TSS 58 3.6 Clustering of TCGA GBM tumors and control samples identies a CpG Island Methylator Phenotype (G-CIMP). . . . . . . . . . . . . . 62 3.7 Unsupervised clustering of 238 TCGA Glioblastoma Multiforme iden- ties a CpG Island Methylator Phenotype (G-CIMP). . . . . . . . . . 63 3.8 Venn Diagram and Correlation of probes between GoldenGate and Innium. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.9 Consensus Clustering Statistical Results. . . . . . . . . . . . . . . . . . 65 4.1 Characterization of G-CIMP tumors as a unique subtype of GBMs within the proneural gene expression subgroup . . . . . . . . . . . . . 89 4.2 G-CIMP association with Gene Expression Clusters and Kaplan-Meier survival curves of G-CIMP-positive tumors. . . . . . . . . . . . . . . . 90 4.3 Somatic Mutation, Germline Mutation, LOH analysis of Proneural G- CIMP-positive tumors and summary of G-CIMP and IDH1 mutation status in the 360 glioma validation panel. . . . . . . . . . . . . . . . . 91 4.4 Signicant regions of copy number variation in a subset of G-CIMP genome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.5 Signicant regions of copy number variation in all G-CIMP genome. . 93 4.6 Volcano plots of DNA methylation and Gene Expression Analysis. . . 94 4.7 Volcano plot for 534 microRNA expression levels. . . . . . . . . . . . . 95 4.8 Starburst Plot for comparison of transcriptome versus epigenetic dif- ferences between proneural G-CIMP-positive and G-CIMP-negative tumors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 xi 4.9 Gene Expression Summary Statistics and Overlap between Agilent and Aymetrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.10 NextBio illustration of correlation between two dierent biosets. . . . 98 4.11 Meta-analyses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.12 G-CIMP prevalence in grade II, III and IV gliomas using MethyLight. 100 4.13 Recurrence of G-CIMP over time . . . . . . . . . . . . . . . . . . . . . 101 5.1 CpG probes selected for genomic association analysis . . . . . . . . . . 105 5.2 Summary results of all available genomic features . . . . . . . . . . . . 107 5.3 Identied de novo motifs . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4 Genomic Plots of de novo motifs . . . . . . . . . . . . . . . . . . . . . 109 xii Abstract Promoter DNA hypermethylation is generally associated with transcriptional gene silenc- ing. Studies show that silencing a critical tumor suppressor genes may contribute to tumorigenesis. While the guidelines that govern methylation patterns at promoter CpG islands during the pathogenesis of individual cancers are still unclear, it is widely known that certain genes carry a higher frequency of DNA methylation in select tumors whereas other genes are methylated across most types of tumors. Using open source tools, a pipeline and a set of analytical tools were created to inte- grate multiple high-throughput datasets (genomic and epigenomic) in order to identify and understand the genes that are epigentically modulated and biologically silenced in Glioblastoma multiforme, a highly aggressive form of brain tumor. The ultimate goal for these integrative analyses is to bring these identied marks to the point where they can be developed as targets for diagnostic methods and treatment strategies. The following thesis can be divided into three main sections. The rst section presents an overall understanding of Epigenetics and its role in human cancer, specif- ically Glioblastoma multiforme (Chapter 1). The second section focuses primarily on the development and utilization of specic open source tools and analytical scripts in order to assist in analyzing and integrating multiple high-throughput data derived for multiple samples (Chapter 2). Finally, the third and largest section of this thesis deals specically with harnessing the tools created in Chapter 2 in order to fully describe and understand the molecular and clinical features associated with Glioblastoma multiforme xiii (Chapter 3-5). The following summarizes the overall ndings associated with Chapters 3-5. We proled promoter DNA methylation alterations associated with The Cancer Genome Atlas (TCGA) project in 272 glioblastoma tumors. Unsupervised analyses of these data revealed a novel molecular subgroup of samples (24/272, 8.8%) with highly concordant gene promoter methylation including a large number of tumor specic hyper- methylated loci (3,098 loci), indicating the existence of a glioma-CpG Island Methlylator Phenotype (G-CIMP). We characterized these G-CIMP-positive samples by integrating available TCGA data consisting of clinical features, DNA sequence alterations (somatic mutation and copy number variations), and transcriptome expression. These G-CIMP-positive patients are younger at the time of diagnosis (median age=36) and display signicantly improved outcome (median survival=150 weeks) compared to G-CIMP-negative patients. G-CIMP-positive tumors are predominantly of the proneural expression subtype (21/24,87.5%), and tightly completely associated with IDH1 somatic mutations. Next, we analyzed copy number variation dierences in proneural G-CIMP posi- tive tumors and identied 2,875 genes that had signicant copy number changes and reduction of chromosome 7 gain and 10 loss. Interestingly, we observed signicant chro- mosome gains in 10p15.3-p11.21 and 8q23.1-q24 in proneural G-CIMP-positive compared to proneural G-CIMP-negative. We also identied signicant dierences in both DNA methylation (1,550 genes) and gene expression (1,575 genes) and the integration of these two experimental results identied a total of 300 genes with signicant DNA hyperme- thylation and gene expression changes in proneural G-CIMP-positive tumors. Gene ontology analyses showed G-CIMP-specic down-regulation of genes associ- ated with the mesenchyme subtype, tumor invasion and the extracellular matrix as the most signicant terms. Genes with roles in transcriptional silencing and chromatin structure modications showed increased gene expression in proneural G-CIMP-positive xiv tumors. Meta-analysis identied signicant overlap with down-regulated genes in low- intermediate grade glioma compared to Glioblastoma Multiforme in a variety of pre- viously published datasets. In addition, de novo DNA motif analysis identied a dis- tinct motif (5'-CCCCTGGGG-3') signicantly associated with G-CIMP specic locus. This motif revealed a strong homology with a well known palindromic motif which was found to be highly associated with EBF1, a known epigenetic modulator which promotes demethylation and chromatin remodeling in B-lymphocyte. In summary, we proled promoter DNA methylation alterations in GBM tumors, and identied and characterized a unique subtype of human glioma tumors that are highly associated with several dierent and distinct clinical and molecular features. xv Chapter 1 Introduction During the late 1800s, the German biochemist Frederich Miescher discovered \nuclein" or deoxyribonucleic acid (DNA). Over half a century later, the signicance of DNA was further demonstrated by the work of James Watson, Francis Crick, Maurice Wilkins and Rosalind Franklin ([Watson and Crick, 1953], as reviewed by [Dahm, 2007]). Since then, many advances have been been made in the eld of genomics research, contributing to the scientic community's deeper understanding of DNA. In June of 2000 and then later in 2003, the Human Genome Project announced the results of a 13-year, $3-billion project to sequence the rst human genome. This revealed that the human genome contains an esti- mated 3,000,000,000 nucleotide bases and approximately 20,000 genes on 23 pairs of chromosomes 1 [International Human Genome Sequencing Consortium, 2001, Venter et al., 2001]. Since then, nearly 270 billion nucleotide bases have been deposited in the sequence read archive (www.ncbi.nlm.nih.gov/sra) and the cost of sequenc- ing has dropped exponentially from approximately $US10,000 (10 years ago) to about $US1.00 per million base pairs of sequence [Hayden, 2010]. Many scientists expected high results from the enormous amount of information contained in the human genome of every cell. However, the \hoped-for revolution" 2 against the root causes of many common human diseases, (leading to improved diagnosis, 1 The instructions contained in the DNA could easily be stored in about 3 gigabytes of computer storage space and the DNA in one of our cells could easily measure approximately 2 meters in length if unwrapped and stretched from end to end. 2 \With this profound new knowledge, humankind is on the verge of gaining immense, new power to heal...It will revolutionize the diagnosis, prevention and treatment of most, if not all, human diseases." President Bill Clinton, June 26, 2000. East Room of the White House. www.genome.gov/10001356 1 better treatments and reliable cures for intractable illnesses like many forms of cancer) has yet to be fully delivered ten years later [Hayden, 2010]. In 2010, with the advent of new and improved molecular technologies such as microarrays and aordable whole- genome sequencing, scientists are nally making huge leaps of novel discoveries and are rapidly gaining insight into the complexity of the human genome and the cancer genome [Bozic et al., 2010, Campbell et al., 2010, Kelly et al., 2010, Gu et al., 2010]. Within the context of Glioblastoma multiforme (GBM) and acute myeloid leukemia (AML), our ability to glimpse into the cancer genome is allowing us to oer new potential targets for therapy [Cancer Genome Atlas Research Network, 2008, Verhaak et al., 2010, Shannon and Armstrong, 2010]. For example, a recent whole- genomic sequencing of a patient who died from AML, revealed a novel truncated DNA methyltransferase (DNMT3a) protein. An analysis of more than 200 AML patients, revealed that 22.1% of them harbored this mutation. Patients with a mutation of their DNMT3a were also included among the \intermediate risk" group and were clas- sied as receiving signicantly poorer prognoses [Ley et al., 2010]. In addition, recent advances in the eld of epigenetics have also provided insight into the complexity of the cancer epigenome associated with the progression of cancer [Noushmehr et al., 2010]. Therefore, whole-genomic and epigenomic type studies may help scientists understand the root cause of cancer and, eventually, help to improve the treatment of these non- communicable diseases [Cancer Genome Atlas Research Network, 2008, Hayden, 2010, International Cancer Genome Consortium, 2010]. This chapter will introduce the role of epigenetics in shaping the human genome and will illustrate how epigenetics can potentially regulate the expression pattern of genes. In addition, this chapter will include an introduction to the aberrant epigenetic changes associated with human cancers, which will provide the background and ideas for the material presented in Chapters 2-5. 2 1.1 Epigenetics Francis Crick proposed the central dogma theorem in 1958 and later re-stated it in 1970 [Crick, 1958, Crick, 1970]. His central dogma theorem describes how information is passed from DNA to protein (Figure 1.1). The question \What distinguishes each cell, tissue, and organ in our body?" arises from the discoveries that 1) each cell of every type in an organism contains approximately the same DNA sequence and 2) the genes carry the blueprints to make the proteins in every living cell. Somehow, the cell Figure 1.1: Central Dogma of molecular biology Information is passed from DNA to RNA to protein, which eventually provides the biological function. (gure adapted from Van Helden J. et al. (2003) 3 must be able to access its own combination of genes in order to maintain its identity. During dierentiation and development, the selective silencing and the activation of certain genes determine each cell's fate. This process is mediated in part by the tissue- specic expression of transcription factors, proteins that modulate the function of other proteins as well as the distinct epigenome that characterizes each cell of a tissue or organ. Therefore, the importance of gene silencing governed by epigenetics should not be underestimated. 1.1.1 The Normal Role of Epigenetics Epigenetics means \on," \above," or \in addition to" genes. It refers to all modications to genes other than specic changes in the DNA sequence itself 3;4 [Jones and Baylin, 2007, Sharma et al., 2010]. These epigenetic modications include biochemical tagging; the addition of methyl ({CH 3 ) groups on CpG sites (dinucleotides consisting of cytosine and guanine); the modication of histones; chromosome organi- zation; nucleosome remodeling; and non-coding RNAs including microRNAs (miRNAs) [Sharma et al., 2010]. In this context, we can imagine the role of epigenetics as providing the potential for select genes to be turned \on" or \o." Within the nucleus, the DNA compaction is facilitated by its incorporation into chro- matin. Incorportation occurs through association with specic proteins. These proteins, known as histones, provide the foundations for the nucleosome, a DNA packaging unit. Each nucleosome contains about 146 bp of DNA wrapped around histones. The histones contain two copies of H2A, H2B, H3, and H4 [Barski et al., 2007, Steger et al., 2008, 3 \DNA is just a tape carrying information, and a tape is no good without a player. Epigenetics is about the tape player" Bryan Turner, (Birmingham, UK) 4 \I would take a picture of a computer and say that the hard disk is like DNA, and then the pro- grammes are like the epigenome. You can access certain information from the hard disk using the programmes on the computer. But there are certain password protected areas and those which are open. I would say we're trying to understand why there are passwords for certain regions and why other regions are open." J orn Walter, (Saarland, Germany) 4 Sharma et al., 2010]. The chromatin conformation is labile during various cellular pro- cesses, such as cell cycle, transcription or DNA damage [Kelly et al., 2010]. In particular, during gene activation, the transcription factors compete with chromatin packaging pro- teins and with changes in the local chromatin structure in order to gain access to the underlying DNA sequence and read the genetic information accurately. 1.1.1.1 Histone Modications Accumulated evidence shows that the chromatin architecture of gene promoter regions or other cis-regulatory elements, such as enhancer and insulator, could strongly regulate the gene transcription. This chromatin environment might be altered by DNA methyla- tion, post-translation modications of histone proteins, histone variants, or nucleosome positioning [Yang et al., 2011]. Lysine (K) and arginine (R) are the predominate amino acids enriched in histones. Because of their strong positive charge, these amino acids enable the histone proteins to bind and they neutralize the negatively charged DNA. This tight binding between DNA and histones essentially reduces the accessibility of DNA to transcription factors (TFs). Covalent and non-covalent modications to the histones, on the other hand, can facilitate the DNA's accessibility to the TFs. A number of dierent histone mod- ications occur in highly accessible N-terminal histone tails. These tails extend out- ward from the core of the nucleosome at specic amino acids such as arginine, lysine and serine. Figure 1.2 illustrates some of these histone modications: acetylation, methylation, phosphorylation, ubiquitination, polyADP-ribosylation and glycosylation [Barski et al., 2007, Steger et al., 2008, Zhang and Reinberg, 2001]. Since histone modications can have a direct impact on the induction or repression of transcription through the alteration of the local chromatin architecture, the \histone code" (summarized in Table 1.1) was proposed [Li et al., 2007, Yang et al., 2011]. In addition, the modications can in uence the transcriptional process through the recruit- ment of transacting factors that recognize specic histone modications that constitute 5 Figure 1.2: Histone modications and their function Four histone protiens (H3, H4, H2A, H2B) make up the nucleosome and specic amino acid residues which are either acetylated, methylated or phosphorylated are shown in this adapted gure by Zhang and Reinberg (2001) [Zhang and Reinberg, 2001]. epigenetic marks [Barski et al., 2007, Steger et al., 2008, Bongiorni et al., 2009]. These modications generate diverse histones structures and alter chromatin conformation (see Figure 1.3). Type of modication H3K4 H3K9 H3K14 H3K27 mono-methylation activation activation - activation di-methylation - repression - repression tri-methylation activation repression - repression acetylation - activation activation - Type of modication H4K20 H2BK5 H3K79 mono-methylation activation activation activation di-methylation - - activation tri-methylation - repression activation, repression acetylation - - - Table 1.1: Summary of the Histone Code Each histone tail modication is associated with a type of gene regulation, activation or repression. [Barski et al., 2007, Steger et al., 2008] 6 Figure 1.3: The epigenome shapes the physical structure of the genome Changes to the structure of the chromatin and DNA have the potential to in uence gene expression: if the chromatin is condensed (top), the transcription factors can not access the DNA therefore, the expression of the gene is repressed or silenced. Conversely, if the chromatin is decondensed (bottom), the transcription factors can access the DNA and the expression of the gene is activated. Acetylated H3K9 constitutes open chromatin to facilitate transcription activation; on the other hand H3K9me3 recruits silencing proteins to assemble highly compacted chro- mosome regions. In addition to post-translational modications of histone proteins, the nucleosomes that contain both H2A.Z and H3.3 are unstable compared with the nucle- osomes that constitute only the canonical histone proteins [Jin and Felsenfeld, 2007]. These components of nucleosomes might play a critical role in establishing nucleosome- free regions of active gene promoters and thus may in uence gene expression. 7 1.1.1.2 DNA Methylation In addition to playing a key role in modulating the chromatin structure via histone modication, epigenetics is also associated with the biochemical modication on top of the DNA sequence. In mammalian cells, DNA methylation occurs most frequently at the number 5 carbon of the cytosine of a CpG dinucleotide (Figure 1.4) [Sharma et al., 2010]. DNA methylation patterns are established during development and dierentiation. Figure 1.4: DNA Methylation DNA methylation is mediated by known DNA methyltransferases, known as DNMTs. The CpG dinucleotide is present in roughly only 1% of the human genome. CpG dinu- cleotides are preferentially located at the 5' end of genes and they occupy =60% of human gene promoters [Sharma et al., 2010]. These dinucleotide sequences are often asso- ciated in clusters called CpG islands [Jones and Baylin, 2007, Takai and Jones, 2002]. CpG islands do not have an operational denition but, rather, are characterized by a higher than expected GC content (55%). The CpG dinucleotide is relatively enriched with an observed versus expected ratio of 0.65, over a distance >500 base pairs [Takai and Jones, 2002]. This stringent denition excludes many intergenic CpG-rich areas such as those associated with long terminal repeats (LTRs), Alus, and other repet- itive elements. According to this denition, approximately 40% of human genes are associated with these elements [Takai and Jones, 2002]. 8 The methylation of CpG dinucleotide can switch o gene expression directly by pre- venting transcription factors from binding to promoter regions. The methylation of CpG dinucleotide can serve, indirectly, as a binding site for methylcytosine binding proteins that act and interact as co-repressors and/or histone deacetylases. Like his- tone modications, DNA methylation is a mechanism that has the potential to aect the transcription control of genes, conferring an additional layer of gene regulation [Sharma et al., 2010, Jones and Baylin, 2007]. In humans, the process of DNA methylation is carried out by three known enzymes: 1) DNA methyltransferase 1 (DNMT1), 2) DNA methyltransferase 3a (DNMT3a), and 3) DNA methyltransferase 3b (DNMT3b) [Sharma et al., 2010, Jones and Baylin, 2007]. DNMT3a and DNMT3b are the de novo methyltransferases that provide the early pat- tern of DNA methylation during development. DNMT1 is the proposed maintenance methyltransferase that is responsible for copying DNA methylation patterns to the daughter strands during DNA replication, preferentially methylating hemimethylated DNA [Sharma et al., 2010, Jones and Baylin, 2007]. DNA methylation is fundamentally important in a number of normal processes in the mammalian system. For example, it plays a central role in imprinting, X-chromosome inactivation, heterochromatin mainte- nance, developmental controls and (as discussed above) tissue-specic gene expression [Sharma et al., 2010]. Interestingly, a recent whole-genomic sequencing study revealed that 22.1% of patients with acute myeloid leukemia (AML) tumors harbored a somatic mutation asso- ciated with DNMT3a [Ley et al., 2010]. The ndings alluded to a signicant biological association between a genetic mutation and an AML patient's outcome since the major- ity of the patients classied as \intermediate risk" were tightly associated with this mutation even though the researchers observed no global eect on DNA methylation [Ley et al., 2010]. 9 The crosstalk between DNA methylation, histone modication and nucleosome posi- tion and the biophysical properties of the underlying DNA sequence create an interacting network that regulates gene expression. 1.1.2 DNA Methylation in Cancer Hyper- and hypo-DNA methylation changes to the normal epigenomic prole can alter the spatial arrangement of chromatin. In turn, this alteration of the spatial arrangement of the chromatin may directly aect the expression of genes (e.g. silencing) after cell division (see Figure 1.3) [Pogribny and Beland, 2009, Eden et al., 2003]. The failure to maintain the heritable epigenetic marks can lead to a number of dierent disease states such as: imprinting disorder (Beckwith-Wiedemann syndrom, Prader-Willi syndrome, and transient neonatal diabetes mellitus) repeat-instability (Fragile X syndrome, Facioscapulohumeral muscular dystrophy) defects of the methylation machinery (systemic lupus erythemtosus, immunode- ciency, centromeric instability and facial anomalies syndrome) Scientists also know that DNA methylation patterns become substantially altered and that they may contribute to the malignant phenotype associated with the pro- cess of carcinogenesis [Sharma et al., 2010, Jones and Baylin, 2007]. For example, if a CpG site within the promoter of tumor suppressor genes or DNA repair genes was hypermethylated, it may contribute to the progression of the cancer (Fig- ure 1.5) [Jones and Laird, 1999] . In addition to site-specic CpG island pro- moter hypermethylation, a cancer epigenome is marked by genome-wide hypomethy- lation [Rodriguez et al., 2006]. While the underlying mechanisms that initiate these global changes are still under investigation, recent studies indicate that some changes occur very early in cancer development and may contribute to cancer initiation [Feinberg et al., 2006, Hinoue et al., 2009, Sharma et al., 2010]. 10 Figure 1.5: Normal vs Cancer epigenetic dierence A typical promoter region is depicted illustrating two states of DNA methylation, hypomethylation and hypermethylation. Aberrant DNA methylation changes can lead a specic gene to be turned o. Figure adapted by Taylor S.M. 2006 [Taylor, 2006], www.cellscience.com/reviews7/Cancer_DNA_ methylation.html Since it is almost impossible to target and change the specic DNA sequences of the genes, it is more feasible to change the regulation of a gene's expression. Studies on epigenetic changes in a wide range of cancers are oering new therapeutic avenues for exploration [Cancer Genome Atlas Research Network, 2008, Verhaak et al., 2010, Noushmehr et al., 2010, Ley et al., 2010]. A handful of FDA approved drug are now available that can \turn on" genes that have been inappropriately \turned o" by mod- ifying their epigenetic landscape [Jones and Baylin, 2007]. For example 5-AZA-CdR is a demethylation agent currently used to treat patients with myelodysplastic syndromes [Sharma et al., 2010, Jones and Baylin, 2007]. 11 1.2 Thesis Outline The following points summarize the work developed in this dissertation: 1. The utilization and development of statistical methods in order to handle and process the large amount of molecular data generated for The Cancer Genome Atlas (TCGA). 2. A methods development (Chapter 2) for the integration of clinical and available genomic data derived from many dierent public repositories such as the Gene Expression Omnibus (GEO) and TCGA. Chapter 2 (page 14) discusses the devel- opment and utilization of statistical methods in handling, processing, and analyzing large high-throughput data generated from array technologies such as the Illumina Innium Human Methylation27K platform. Since the Illumina Innium Human Methylation27K platform is used to assay DNA methylation levels between dis- ease and normal conditions, the tools generated and discussed in Chapter 2 can be applied to better characterize and understand the epigenetic alterations (aberrant DNA methylation) associated with cancer treatments. 3. A comprehensive integrative analysis of epigenomic and genomic datasets in order to characterize and understand a specic cancer, Glioblastoma multiforme (GBM). Using the most comprehensive DNA methylation proling available for GBM, Chapter 3 (page 45) describes the identication of a unique subtype of gliomas, which we termed the glioma-CpG Island Methylator Phenotype (G-CIMP). Chap- ter 4 (page 66) describes our eort to characterize G-CIMP by integrating all the publicly available data provided by TCGA (cancergenome.nih.gov) and GEO in order to examine the specic molecular changes associated with G-CIMP. In addi- tion, several key clinical features (age, treatment and survival) are also integrated into the analysis to provide a complete molecular and clinical characterization of GBM. 12 4. Chapter 5 (page 102) presents preliminary in silico results that expand the molec- ular ndings of G-CIMP. The ndings reveal several distinct genomic features and show that DNA sequence is tightly associated with CpG sites that are prone to aberrant DNA methylation. 5. Finally, Chapter 6 (page 111) concludes the thesis with a summary and future perspective of the work. In addition, it discusses several observations that resulted from working with these large datasets. 13 Chapter 2 Process, Handle and Analyze Innium Human Methylation27K Data 2.1 Introduction 2.1.1 \What is Bioinformatics?" As described in Chapter 1 (page 3), the DNA that comprises the genes encodes RNA. RNA in turn, produces the proteins that regulate all of the biological processes within the organism. While we have an understanding of how genes are regulated, we do not fully understand the functions of the thousands of proteins that are produced [Wu et al., 2009]. In addition, we do not understand why and how certain transcription machinery goes awry in specic diseases [Feinberg et al., 2006, Jones and Baylin, 2007, Sharma et al., 2010]. Advances in molecular technologies are beginning to shed light on this issue [International Cancer Genome Consortium, 2010, Grith et al., 2006, Fishel et al., 2007, Chan et al., 2008, de Magalhaes et al., 2009, Wirapati et al., 2008, Miller and Stamatoyannopoulos, 2010]. However, the bottleneck appears to be shifting from the production of the data to developing methods to harvest and understand the complex data. In some cases, a researcher may have to deal with billions of data points at one time. As the growth of information expands, the need to integrate the data eciently from many areas is becoming a bioinformatics \problem." 14 Bioinformatics is an integrative eort among dierent scientic disciplines, such as computer science, biostatistics, and biomedicine [Goodman et al. 2010]. Bioinformati- cians generally focus on the development, design, and evaluation of approaches for pro- cessing and analyzing biological information with the ultimate aim of answering biolog- ical questions (Figure 2.1) [Goodman et al. 2010]. Many general approaches developed by information scientists are applied to bioinformatics such as databases, knowledge representation languages, search algorithms and communication protocols. In addition, bioinformatics researchers can develop highly specialized methods. For example, they can develop algorithms that detect the presence of genes in nucleotide sequences; create a hidden markov models to extrapolate ChIP-seq regions; and apply a new statistical model to normalize high-throughput data. Figure 2.1: Informatic Skills: System Design & Implementation A bioinformatician skill requires an understanding of how to organize, design, develop and evaluate a set of tools that can assist in answering a biological question. This also includes understanding and interpreting the results generated by the tool. 15 A wide array of software tools is freely and commercially available to the analyst. The freely available software tools include: Linux Operating System, BioPerl (a program- ming language), R (a programming language and statistical software), Bioconductor (a project geared towards biological information tools based on R), and other OpenSource tools. Commercial software tools include GenomeStudio (bundled with service contracts for Illumina's platforms), Matlab, Partek, GeneSpring, SAS/JMP Genomics and many others. This chapter focuses on the use and development of OpenSource methods that can be shared among many users. In particular, I focus on the applications of R/Bioconductor and investigate new strategies that support aspects of the bioinformatic-driven research. The strategies advanced in this chapter were conceived out of needs that originated in the context of high quality robust data generated for multiple samples across multiple batches. In order to describe the set of methods that were developed, I introduce two OpenSource tools (R and Bioconductor) and one commercial tool (GenomeStudio). In the following sections, the pros and cons of each tool are described as well as recommen- dations regarding when to use them for the development of methods and analyzing large datasets. 2.1.2 Open Source Tool 2.1.2.1 R R is a software package that is freely downloadable for Windows, Unix, and Linux (www. r-project.org/) [Becker et al., 1988, Chambers and Hastie, 1992, Chambers, 1998, R Development Core, 2009]. With a large repository, many dierent countries around the world host a mirror copy of this software package, allowing open access from around the world. R provides the language, tools, and environment in one convenient package and, since it is exible and highly customizable, it lends itself to a set of excellent graph- ical tools. In addition, its exibility in data handling and modeling capabilities makes R an ideal environment for bioinformaticians interested in \Exploratory Data Analysis" 16 [R Development Core, 2009]. The functions and the results of analyses are stored as objects or containers that allows users to easily modify functions and build statistical models. Despite its strengths, R is not particularly ecient in executing a large number of \for loops" (iteration of the same code) compared to compiler languages such as C/C++. This is due, in part, toR's inability to harness multiple CPU cores. Fortunately, there is a workaround solution, and several teams have developed robust packages to allow R to use multiple cores [R Development Core, 2009]. The most problematic part of R is its steep learning curve compared to \point and click" software. Users who are not trained in data abstraction (or structure) and who do not have a background in computer programming may nd R cumbersome and burdensome to use compared to the utility oered by the commercially available Graphical User Interface (GUI) software. However, despite the disadvantages of learning a new abstract tool, highly interested users will overcome the R learning curve and quickly excel in data analysis. 2.1.2.2 Bioconductor Bioconductor (www.bioconductor.org) is a community driven project that develops innovative software tools for use in computational biology. Experience with using R and a familiarity with R is necessary for Bioconductor users since all of its tools and packages are derived in the context of R. The strength of Bioconductor lies in its huge repository of available packages. These packages provide a exible interactive tool for carrying out a number of dierent computational tasks. There are currently 268 active developers and 372 contributed packages in Bioconductor's development repository (Figure 2.2) [Gentleman et al., 2004]. The project also maintains 501 annotation data packages that aid in the analysis of data from microarray and sequencing experiments [Gentleman et al., 2004]. These packages include the analysis of DNA microarray, sequence, ow, SNP, and other data. 17 Figure 2.2: Number of contributed packages included in each of the Bioconductor releases A line plot showing the incremental increase of the number of total packages deposited into Bioconductor after each new release. Releases occur twice per year. Data source taken from [Gentleman et al., 2004]. Since many of the packages are re ections of current ideas and the source code is freely available to the public, most can be improved. Despite a reliance on the expertise of the scientic community, or an inability to modify existing codes, the most signicant advantage of using R/Bioconductor for data analyses concerns the way the codes are shared and freely collaborated across the world. In this case, hundreds of heads are better than one! 18 2.1.3 Commercial Tool - GenomeStudio Before describing the GenomeStudio software, I will provide a brief background on the Illumina Innium Human Methylation27K platform and the methodology by which the DNA methylation levels are assayed. This background information will help guide the eorts made in developing the script described on page 24. 2.1.3.1 The Illumina Innium Platform for assaying DNA methylation The Illumina Innium Human Methylation27K platform, which interrogates 27,578 CpG dinucleotides spanning the promoter regions of 14,495 Consensus Coding DNA Sequence (CCDS) genes and 110 microRNA gene promoters, is currently used by many investi- gators [Bibikova et al., 2009, Noushmehr et al., 2010, Bock et al., 2010], as reviewed in [Laird, 2010]. This platform is highly suitable for detailing the biological role of DNA methylation in both normal and diseased cells and for DNA methylation marker dis- covery [Bibikova et al., 2009, Noushmehr et al., 2010]. Illumina recently announced a more comprehensive Innium platform that can interrogate more than 450K CpG din- ucleotides covering a wide array of dierent genomic attributes (e.g. enhancer, gene bodies, CpG Island shores,CpG sites outside of CpG islands, non-CpG methylated sites identied in human stem cells, and miRNA promoter regions). Despite this availability of a more comprehensive platform to interrogate additional dierent CpG dinucleotides, the Human Methylation27K platform will be described here with the understanding that the developed tools can be applicable to the new platform. After the bisulte conversion of DNA, methylated cytosines remain unchanged while unmethylated cytosines are converted to uracil (Figure 2.3) [Bibikova et al., 2009]. Two bead types represent each CpG locus on the platform. One bead type (U) represents a probe that is designed to match to the unmethylated state. The second bead type (M) matches the methylated state. If the locus of interest is unmethylated, it matches perfectly with the U probe, enabling single-base extension and detection. The unmethy- lated locus, however, has a single-base mismatch to the M probe. This inhibits extension 19 Figure 2.3: Innium Methylation assay scheme Figure taken from [Bibikova et al., 2009]. (A) The locus of interest is unmethylated. (B) The locus of interest is methylated. and results in a low signal on the array (Figure 2.3A). If the CpG locus of interest is methylated, the reverse occurs where the M bead type will display a signal and the bead type will show a low signal on the array (Figure 2.3B). If the locus has an intermediate methylation state, both probes will match the target site and will be extended. The methylation status of the CpG site is determined by the value calculation (see Equation 2.1 on page 20). The methylation status at an interrogated locus i is then summarized as = M i M i +U i (2.1) which represents the proportion of the total mean uorescence intensity within a probe pair due to methylated cytosines. This proportion is naturally modeled by a two- parameter beta distribution dened on the interval (0, 1) as reviewed in [Laird, 2010]. 20 2.1.3.2 Commercial Software to Process Data: GenomeStudio GenomeStudio is the commercial software of choice for many users of Illumina plat- forms [Bibikova et al., 2006, Bibikova et al., 2009]. Built primarily on a GUI inter- face, GenomeStudio contains several modules that assist users to visualize and analyze their data. These modules are unique for each of Illumina's available platforms (Single Nucleotide Polymorphism, Gene Expression, and DNA Methylation). The Methylation Module is used exclusively for the visualization and analysis of the Illumina Innium Human Methylation27K platform (Figure 2.4). This module supports the analysis of both Innium and GoldenGate methylation assay data collected by the iScan system or the BeadXpress Reader. The module calculates methylation levels, analyzes dierential methylation levels between experimental groups, and allows users to view CpG island methylation statuses across the genome with the built-in Illumina Genome Browser and Chromosome Browser. Although the GenomeStudio software is sucient when working with a small number of samples, its limitation is evident when working with more than 10 samples at one time. The software is built on the heavy GUI structure, which overloads the system's memory and thus limits the available computer resources to process large amounts of data. One way to circumvent this problem is to increase the computers resources (CPU and mem- ory) or use third-party software to analyze the data. Illumina recommends third-party applications to users analyzing some of their popular platforms (www.illumina.com/ software/genome_analyzer_software.ilmn#third_party_tools) but does not cur- rently recommend a third party application for their Methylation Module (Figure 2.5). Potential problem may arise when the user of proprietary software wants to utilize the availability of other analytical tools and statistics. For example, if a user of GenomeStu- dio wants to perform a wilcoxon-rank test (or Kruskal-Wallis, a non parametric ANOVA) instead of a student t-test (or parametric ANOVA) to identify the signicant dierence 21 Figure 2.4: GenomeStudio MM Module Display A screen shot of the GenomeStudio MM Module. Adapted from www.illumina.com between two groups, the user must rst request the feature and then wait patiently until the software developers update the tool. In addition, in the rush to get the software out on time, bugs may exist in the code or in the methods to analyze a simple statistical test. We identied one such bug when we were investigating the utility of creating a pipeline tool: GenomeStudio uses an incorrect methodology to calculate the detection p-value (see detailed discussion on page 31). Instead of using a modied detection p-value based on the design of the Methylation Module platform, they implemented the Gene Expression module version that averages the control probes across multiple color channels. Fortunately, Illumina has since eliminated the bug in its updated version of GenomeStudio. However, at the time of this study, the only way to circumvent the bug was to develop the code and the methods to calculate the detection p-value correctly. 22 Figure 2.5: A set of third-party tools available for a broad range of genetic analysis applications Third-party tools are available for a broad range of genetic analysis applications including: Sequence alignment, SNP calling, indel detection, Sequencing informatics work ow and data management, Whole- genome association, Copy number variation analysis, Gene expression analysis, eQTL analysis, Multi- assay data integration, Biological pathway and network analysis. Purple dots indicate the application for which it's most compatible with. Adapted from www.illumina.com 2.1.4 Summary and Recommendations In summary, users interested in using either novel or existing statistical methods should harness commercially available tools (e.g. Partek or SAS-JMP), analyze a limited num- ber of samples at a time, or develop their own custom scripts. The choice to use the latter option gives the user more control in implementing novel statistical models, the 23 ability to harness existing statistical models available through Bioconductor, and the ability to analyze a larger number of samples at once. In addition, developing meth- ods in an OpenSource environment also allows the user to examine multiple datasets from other existing platforms including non-Illumina platforms in a single environment [Gentleman et al., 2004, Du et al., 2008]. The wide range of freely accessible packages in Bioconductor that are exclusively devoted to pipeline analysis (see page 17) demonstrate that the use of statistical methodology can improve both the accuracy and the precision of bottom-line results [Gentleman et al., 2004, Du et al., 2008]. In contrast to ad hoc procedures introduced by software engineers and manufacturers of the technology. In the following section, I will describe how one can use the available Bioconductor packages to properly process, handle and analyze Illumina's Innium Human Methyla- tion27K technology. This will allow an analyst to interrogate data from multiple plat- forms and from thousands of samples in order to understand the molecular dynamics associated with a particular study such as The Cancer Genome Atlas (see page 49 for a detailed description of the TCGA project). 2.2 RAPiD.pro Script for Handling and Processing Illu- mina Innium DNA Methylation Arrays Previous studies have culminated in meta-analysis of multiple high-throughput data sets [Grith et al., 2006, Fishel et al., 2007, Chan et al., 2008, de Magalhaes et al., 2009, Wirapati et al., 2008, Miller and Stamatoyannopoulos, 2010], but researchers have not always learned as much as they hoped to learn. This is primarily due to the following issues: small sample sizes variable sample quality 24 poor clinical information batch eects looking at just one piece of the puzzle poor experimental design With any study, these items are essential to evaluate and control. The script we provide will alleviate, in theory, the issues of variable sample quality and batch eects by provid- ing the information to the end-user prior to downstream high level analysis. The set of packages/scripts used to analyze the data from the Illumina platform is called \RAPiD" which stands for \Rapid Automated Pipeline DNA methylation." 1 A schematic of the proposed work ow for the RAPiD packages are presented in Figure 2.6. The pack- ages are divided into three major segments: 1) RAPiD.pro (for processing of the data, page 25), 2) RAPiD.ana (higher level analyses of the processed data, page 37) and 3) RAPiD.int (integration with other high-throughput datasets, page 42). RAPiD uses Bioconductor classes, which allows more experienced users to easily invoke other R/Bioconductor components. In particular, the exprSet object is used for storing the raw and processed intensities and it is strictly associated with the phenoData object used to document the experimental design (e.g. sample manifest). 2.2.1 From Scanner Output to R: Data Import Protocol The rst part of the RAPiD.pro code is designed to extract all of the raw intensity read- outs from the Illumina BeadArray directory. Since a chip contains 12 samples (organized as shown in Figure 2.7), each directory is organized by a 10-digit Illumina Chip ID (XXXXXXXXXX). Within each chip directory, the 12 samples are listed and organized by Chip ID XXXXXXXXXX followed by \ ". A letter (A, B, C, D, E, F, G, H, I, J, K, or L) precedes the \ ". Each 1 The name for this package was coined by Dr. Simeen Malik, Ph.D., a current bioinformatic Post-Doc Fellow in Dr. Laird's USC Epigenome Center Laboratory. 25 Figure 2.6: RAPiD Flowchart RAPiD is divided into three independent component, .pro (processing), .ana (analysis), and .int (inte- gration). sample contains a set of les (*.xml, *.idat, *.locs, *.tif, and *.txt) which are further subdivided by red and green channels. An example of the working directory layout is contained in Listing 2.1 on page 27. 26 Figure 2.7: Illumina Innium Chip layout Each chip design contains 12 samples or array in which each array contains 60,000 beads. Each sample is identied by a 10-digit Chip ID (e.g. 1309982003) followed by a letter (A-L) depending on the position on the chip. For example, the far left sample depicted in this representative chip by Illumina is identied as 1309982003 A and the far right sample is identied as 1309982003 L. Figure adapted by Illumina manual [Bibikova et al., 2009] 1 /home/ d i r e c t o r y 2 /home/ d i r e c t o r y/ i l l m data 3 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/ 4 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/XXXXXXXXXX A.csv 5 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/XXXXXXXXXX A Grn . idat 6 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/XXXXXXXXXX A Grn . l o c s 7 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/XXXXXXXXXX A Grn . t i f 8 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/XXXXXXXXXX A Grn . xml 9 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/XXXXXXXXXX A Red . idat 10 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/XXXXXXXXXX A Red . l o c s 11 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/XXXXXXXXXX A Red . t i f 12 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/XXXXXXXXXX A Red . xml 13 /home/ d i r e c t o r y/ i l l m data/XXXXXXXXXX/XXXXXXXXXX A. txt Listing 2.1: Illumina Innium Folder Structure illustrating only Sample A within Chip XXXXXXXXXX 27 Each associated le is stored by Illumina's Scanner. To access the les, GenomeS- tudio requires the user to provide the exact location and path to \*.idat". However, the information contained in \*.csv" suciently contains all the summary bead level information required to assemble the -value and the detection p-value for each sam- ples assayed. The le contains the raw intensities for each bead, the number of beads assayed, and the standard error calculated. The other associated les are accessible and a user can extract information from the image le or the .idat le since the path to the directory is xed. In order to capture the information presented in this directory work ow, RAPiD.pro requires the directory path as well as static probe id information. This information is entered as described in Listing 2.2 on page 28 and is evoked by running the R command described in Listing 2.3. 1 # 2 # beadlevel input 3 # 4 probe . id<read .csv ( " ./ infinium probe id . csv " ) ; 5 probe . c o l o r< read .csv ( " ./probe c o l o r . csv " ) ; 6 control . probes<read .table ( " ./ c o n t r o l probes . txt " , sep="n t " , header= T) ; 7 datPath<c ( " ./" ) 8 pdata .m<read .table (paste ( " . " , "sample . i n f o . txt " , sep="/" ) , header= T) Listing 2.2: Directory Path 1 # 2 # Example R code to run Rapid.pro and import data 3 # 4 data<rapid . pro ( datPath="path/to/ d i r e c t o r y " ) Listing 2.3: R command which runs rapid.pro function As described in Appendix B (lines 58-223, page 135), RAPiD.pro will extract all of the samples from the \datPath" and then will collect and store the data into an exprSet, a useful container that easily utilizes commonly used Bioconductor packages. This R 28 object contains a list of the data les of all the intensities from the red and green probes, the pvalue calculation (see Equation 2.2 on page 33), the probe manifest and the sample manifest. The sample manifest is a le created by the end-user and stored as a tab-delineated le that describes each sample's phenotype. This le is stored in R as \pdata.m" (see Listing 2.2 on page 28). A unique \one-to-one" mapping is required in order to enable the samples to match the appropriate data. 2.2.2 Multiscan Calibration Protocol When performing across-platform analysis (where Innium beadchips are scanned at varying intervals) technical variance can sometimes overwhelm the downstream high-level analysis and mask the potential biological signicance that a researcher wishes to analyze [Laird, 2010, Leek et al., 2010]. For example, the scanner setting, also known as the photomultiplier tube (PMT), is set by the technician prior to a scan. PMT setting is an important control as it sets the camera's scanning thresh- old when measuring the raw intensities outputted from the machine. As the maxi- mum measured intensity is 65,536 pixels (2 16 ) for any given scanning machine, vary- ing the PMT can produce pronounced variations in analyses of the same sample [Bengtsson et al., 2004, Bengtsson and Hossjer, 2006]. Controlling these technical vari- ants will be instrumental for downstream high level data analysis and for the subsequent validation of experiments. Overcoming the hurdles of using scanners to measure the raw intensities in Innium DNA Methylation production is similar to what many researchers experienced when they processed the data generated for gene expression by using one or two channel chip arrays. Later, researchers showed that a channel-specic bias introduced by the scan- ner (most likely by its detector parts) existed in these studies [Bengtsson et al., 2004, Bengtsson and Hossjer, 2006]. Subsequently, a scan protocol and a model that allows researchers to estimate the biases and calibrate the observed signals were developed 29 [Bengtsson et al., 2004, Bengtsson and Hossjer, 2006]. Using this information we imple- mented a set of codes for properly calibrating the Illumina scanner prior to the data generation. As described in the RAPiD Flowchart (Figure 2.6, page 26), a step is invoked by the user to test and recalibrate the scanner. In order to utilize this feature, a chip is scanned at two or more dierent scan settings (Listing 2.4). By running \multiscan- Plot(data)" in R, a series of plots will be generated in a working directory of the user's choice. The plots will show the density distribution of the raw intensities for both the red and green channels (Figure 2.8) Figure 2.8: Example plots for the Red channel before and after Multiscan calibration Left column of plots are pre-multiscan calibration while the right column are post-multiscan calibration. For each situation, the raw, MvA and density distributions are plotted. 30 1 # 2 # Example R code to run Rapid.pro and import data 3 # 4 c a l i b r a t e M u l t i s c a n (Y.m, c o n s t r a i n t=Constraint , 5 s a t S i g n a l=s a t S i g n a l ) ; 6 data<rapid . pro ( datPath="path/to/ d i r e c t o r y " , multiscan= TRUE) 7 multiscanPlot (data ) ; Listing 2.4: Multiscan argument set to TRUE and the calibration method will generate a series of plots to assist the user optimize the ideal scan setting. In the following example, two chips with 12 samples each was rescanned at the following settings: 0.6, 1.0, 1.25, 1.5 and 2.0. The raw and log10 transformed intensities for each chip and its samples are compared for each scan setting in Figure 2.9. Ideally, by comparing the scans to each other, the user will nd a scan setting in which all the probes fall within the full dynamic range (0-64K). A plateau near the high end indicates saturation and, therefore, the respective data is determined to be unreliable. In this example, a scan setting 1.0 was determined to be the optimal setting. 2.2.3 Quality Control Plots Interestingly, during the development of the pipeline, it was clear that the initial pre- processing of the Innium data by the GenomeStudio software was done incorrectly. This included the calculations of the detection p-value (initially described on page 22) and the scanner setting (described on page 29). The detectionp-value is used to access whether the assayed CpG locus is signicantly above background levels, allowing condence in analyzing specic CpG for methylation. GenomeStudio provides a detection p-value for each interrogated locus. However, the statistical test calculated by GenomeStudio assumes that the negative control probes for both the red and green channels are the same. Since each color channel is experimentally 31 Figure 2.9: Multiscan Setting Optimization Scatter Plot Each Scan setting is plotted and compared for one sample (\D") on one Chip (\4447820191"). Each color channel is individually identied by color. Along the diagonal, are the density distribution. distinct, each channel should be calculated independently. In our method,the p-value uses negative control signals from the background model that characterize the chance that the target sequence signal was distinguishable from the negative controls. This is done separately for the red and green channels. 32 For the Innium methylation platform, the negative controls are modeled by a normal distribution. The detection p-value for the probe with intensity I p , is calculated as: 1Z jI p neg j neg (2.2) where neg is the average of the negative controls, neg is the standard deviation of the signals, and Z is the one-sided tail probability of standard normal distribution. Since the sample size in some cases could be smaller than six probes, a Z test is preferred over a Welch t-test. The smaller detection p-value is taken as the nal detection p-value for the probe. Beta values with detection p-values greater than 0.05 are not signicantly dierent from the background and are subsequently masked as \NA" (see Listing 2.5 on page 33 where a code is implemented and described in the following section). 1 for ( rg in channel )f 2 for (mu in c ( "M" , "U" ))f 3 z score [ [ rg ] ] [ [ mu ] ] < (mubc [ [ mu]] c t r avg1 )/ c t r sd1 ; 4 # calculating z score for each probe per sample 5 pValue [ [ rg ] ] [ [ mu ] ] < apply ( z score [ [ rg ] ] [ [ mu] ] ,c ( 1 , 2 ) , 6 pnorm, lower . t a i l=FALSE) ; 7 g 8 g Listing 2.5: R code to calculate the detection p-value for each channel (red & green) separately After the raw scanned data are imported into R and the detectionp-value is calculated as described above, the rst quality control (QC) plots are generated and stored in a folder dened on lines 43-53 (see Appendix B, page 135 and listing 2.6). 33 1 # 2 # Path to QC plots 3 # 4 f i g F o r c e < FALSE; 5 #for the calibration plots for each sample and color 6 figPath .c < Argumentsn$getWritablePath (file .path ( " . " , 7 " r e s u l t s . c a l i b r a t i o n " , f s e p="/" ) ) ; 8 #for all other figures 9 figPath < Argumentsn$getWritablePath (file .path ( " . " , 10 " r e s u l t s " , f s e p="/" ) ) ; 11 verbose < Argumentsn$getVerbose (5, timestamp= TRUE) ; Listing 2.6: Path to Quality Control Figures Each probe per sample will have a p-value associated with it and, if this p-value is greater than or equal to 0.05 (5%) or to any arbitrary cut o (e.g. 0.01, 1%), an \NA" is added in lieu of a -value. The total number of NAs calculated is determined and divided by the total number of probes on the array (see Equation 2.3). SumofallNAs SizeofArray(e:g:27; 578) (2.3) This percentage or ratio is then used to determine whether a sample reached a predeter- mined threshold set in the code by the user. In this case, we set the threshold to 10%. This means that, if a sample contains more than 10% insignicant probes, the script will ag the sample and report the sample to the end-user. A graph is also generated and organized by plate, chip and sample position to allow the user to evaluate if a series of samples failed. This may indicate a failed chip or just one failed sample, which may suggest a failed protocol. An example of this type of graph is presented in Figure 2.10 where the samples are arranged by chip and plate. Samples that fail are generally due to bad chips, poor laboratory handling, an improper scan setting, poor quality reagents or a combination of these. When a sample is agged by 34 Figure 2.10: Summary plot to illustrate samples which fail detection p-value cut-o of 10% Each dot represents one sample organized along the x-axis by plate and chip. Each box represents a plate, and each light gray vertical line represents demarcation between chip (12 samples). The y-axis indicates the total number of probes for this particular platform (27,578). Red samples indicates sample which failed by using a 10% cut-o. RAPiD.pro, the user needs to investigate and to determine the source empirically. The motivation behind these QC generated plots is to point the user to the source and lead him/her to a quick solution. 2.2.4 Module to Incorporate Normalization Methods Once the data is properly processed and ported into an R object, the appropriate sample and probe manifest are associated with this object as described above. The next step is to decide whether the data should be normalized. Additional plots are available in RAPiD.pro to evaluate and explore the datasets. These include boxplots; scatter plots and MvA; and higher level evaluation protocols such as the Principal Component 35 Analysis (PCA). If the data is assumed to be homogeneous (as is the case with TCGA type datasets for a particular cancer type), several approaches are available to normalize the data. However, not all methods that are inherently applicable to gene expression centric data can be applied to DNA methylation datasets [Laird, 2010] since the data follows a conned beta distribution (0,1) instead of an innite scale. Therefore, although we do not recommend a particular method here, we will describe the method that was used in this study. This method involves the use of the median or mean across batches and then modeling it by linear regression. By taking the raw intensities between the red and green channels independently, and calculating the mean or median intensities between batches, a model can be assumed based on the linear model (y=mx+b) and the calculated expected new intensity can be derived. Let y ijk be the pre-normalization intensity of locus i and sample j from batch k, and let y ijk be the y-intercept t value and y ijk be the slope t value determined by the linear regression between batches k. Consequently, y ijk becomes the new calculated intensity after normalization (see Equation 2.4): x ijk = y ijk b ijk m ijk (2.4) Once the new normalized intensities are calculated for each sample by the respective batch, the new -value is calculated. Next, a set of plots are generated to evaluate the performance of the normalization. These plots describe the raw intensity values (as shown in Figure 2.11) and the boxplots of the new -value per sample, color-coded by batch (see Figure 2.12). Each gure shows the before and after normalization of the data. At this point, users can evaluate the results and decide to use either data matrix (normalized or non-normalized data) for their downstream higher-level analyses as described in the following section. 36 Figure 2.11: Before and After normalization of the raw intensities for Methylation and Unmethylation by batch. Each locus is plotted by it's log2 transformed methylated and unmethylated intensities and averaged across the respective batch (batch 1-3; 10; 16). Each batch is color coded and plots are divided by post-normalization and pre-normalization. 2.3 RAPiD.ana Script for Analyzing Illumina Innium DNA Methylation Arrays After the data is processed and stored as an R object, the data is ready for downstream \high-level" data analysis by the bioinformatician. This includes feature selection, hier- archical clustering, genomic mapping, and integration with other Bioconductor packages 37 Figure 2.12: Before and After normalization of the raw -values by batch. Summary Boxplots and Density plots for each samples-value is plotted before and after normalization. Each sample is color coded by batch association as determined by the sample manifest. including the Broad gene pattern tool set for Consensus Clustering [Monti et al., 2003]. Since this level of analysis requires study-centric follow-up and evaluation, a description of the methods and output are described instead of a recommended set of methods. The idea behind this is that the end-user will have the freedom to invoke a number of preferred methods of choice since it is assumed that the user will have full access to Bioconductor packages. The methods described below are probably the most popular methods of choice during the \Exploratory Data Analysis" phase. These methods are used extensively in chapters 3 and 4. 2.3.1 Feature Selection For each dataset, a feature selection of the most variant probes is initially required to explore the data. To use this method, the user will add the \TRUE" argument to the \sd" in Listing 2.7. This will generate a distribution plot, which will allow the user to 38 determine the optimal cuto. In this case, 0.1 is determined as the optimal cut-o since the distribution implies a bimodal distribution (Figure 2.13). 1 # 2 # Selecting features by Standard Deviation 3 # 4 data < rapid . ana ( dat , sd= TRUE) Listing 2.7: Feature Selection Figure 2.13: Standard Deviation across all samples A histogram plot across all SD calculated across all samples. 2.3.2 Unsupervised Hierarchical Clustering After a select number of probes are identied and extracted, the following argument in Listing 2.8 will plot all the samples in a two-way hierarchical clustering (Figure 2.14). The clustering method invoked by the user could be either Consensus Clustering (described 39 in detail in Chapter 3) or any of the preferred methods developed and deposited into Bioconductor. Again, the assumption here is that the user will have the ability to utilize any available clustering package available through Bioconductor and use the method best suited to their study. 1 # 2 # Clustering 3 # 4 data < rapid . ana ( dat , sd= TRUE, c u t o f f =0.1 , 5 c l u s t e r= TRUE, method=" ConsensusCluster " ) Listing 2.8: Unsupervised Clustering 2.3.3 Detection of Dierentially Methylated CpG Probes between Groups of Samples In addition to a set of \exploratory data analysis" described above, the generated tools can also identify dierentially methylated CpG probes between two or more groups of samples. The user could either use a predened set of samples per group (should be dened in the \phenoData") or use the identied groups by the unsupervised analysis described above. Listing 2.9 describes how to invoke RAPiD.ana to identify and plot the most dierentially methylated CpG probes. Any statistical test can be used, as long as the given sample classes are determined in advance. In addition, a multiple testing correction is available by setting the argument \multitest" to TRUE and invoking the method of choice (Figure 2.15). A volcano plot is then generated, which describes the biologically identied CpG probes, where the y-axis is the signicance and the x-axis is the beta value dierence between two experimental groups (Figure 2.16). 40 Figure 2.14: Two-way clustering of samples using the identied SD cut-os H. Clustering of 99 samples using SD cut-o of 0.1. Samples are labeled by batch and methylation levels range from 0 (blue) to 1 (red) 1 # 2 # Wilcox Rank Sum Test 3 # 4 data < rapid . ana ( dat , t e s t=Wilcox , sample1=c ( 1 : 1 0 ) , sample2=c ( 1 1 : 2 0 ) ) Listing 2.9: Identifying dierentially methylated CpG probes 41 Figure 2.15: Multiple Adjustment Comparison Plot showing the performance of the many dierent multiple correction testing available through Bio- conductor. Y-axis shows the sorted adjusted p-alues where the x-axis shows the number of rejected hypotheses. Legend key indicates each test compared to the raw p-value. 2.4 RAPiD.int Script for Integrating Illumina Innium DNA Methylation and Gene Expression Arrays In addition to the methods described in the RAPiD.ana package, users can also take the DNA methylated data and integrate it with other Illumina or non-Illumina platforms. These platforms could be gene expression, copy number, miRNA expression platforms or a combination of these. Using the tools and methods described for RAPiD.ana, gene expression platforms can also be analyzed in order to identify dierentially expressed genes. R code in Listing 2.9 on page 41 is an example of a method of choice. Once all of the genes and probes are analyzed and the associated p-value and fold or beta 42 Figure 2.16: Volcano Plot Volcano Plots of all CpG loci analyzed. The beta value dierence in DNA Methylation between the two groups is plotted on the x-axis, and the p-value for a FDR-corrected Wilcoxon signed-rank test of dierences (-1 multiplied by log10 scale) is plotted on the y-axis. Probes that are signicantly dierent between the two subtypes are colored in red. value change are calculated, a \starburst" plot can be generated. An example of a \starburst" plot is described in Chapter 4 on page 66 (the code is described in Appendix B). This chapter provides a detailed perspective of the types of higher-level analysis one can perform to integrate multiple molecular datasets. 2.5 Summary and Conclusion The tools developed and described in this chapter were conceived out of needs that orig- inated in the context of high quality, robust data generated for multiple samples across multiple batches. The primary focus was on the utility and application of OpenSource tools to properly handle, process and analyze the molecular datasets generated by Illu- mina. The code that was organized and developed is attached to this thesis as Appendix B. In addition, this chapter provided a series of recommendations to users interested in using GenomeStudio or R/Bioconductor. Although both tools are advantageous for 43 dierent users and dierent studies, the utility of R/Bioconductor allowed me to use the full gamut of statistical packages while correctly handling and processing large datasets simultaneously. Chapters 3 (page 45) and 4 (66) contain examples of how one can use the codes in \RAPiD.pro," \RAPiD.ana," and \RAPiD.int" to properly handle, process, and fully integrate the large DNA methylation centric datasets with other publicly available datasets (e.g. GEO or TCGA). Specically, the analysis focuses on the integration of a number of molecular data types available for a common primary brain tumor, Glioblas- toma multiforme (GBM), within the context of TCGA [Noushmehr et al., 2010]. 44 Chapter 3 Identication of the Epigenetic Subgroups of Glioblastoma The material in this chapter has been published, see [Noushmehr et al., 2010]. 3.1 Abstract The Cancer Genome Atlas (TCGA) project proled promoter DNA methylation alter- ations in 272 Glioblastoma tumors. The unsupervised analyses of these data revealed a novel molecular subgroup of samples with highly concordant gene promoter methy- lation including a large number of tumor specic hypermethylated loci, indicating the existence of a glioma-CpG Island Methlylator Phenotype (G-CIMP). We characterized these G-CIMP-positive samples by integrating available TCGA data consisting of clinical features, DNA sequence alterations (somatic mutation and copy number variations), and transcriptome expression. These G-CIMP-positive patients were found to be younger at the time of diagnosis (median age = 36 years) and display signicantly improved out- comes (median survival = 150 weeks) compared to G-CIMP-negative patients. G-CIMP- positive tumors are predominantly of the proneural expression subtype (21/24) and are strongly associated with IDH1 somatic mutations (18/18). In order to explore this nding further, we tested an independent set of 100 gliomas (WHO grades II, III and IV) for G-CIMP status (using MethyLight) andIDH1 mutation. Among 48 IDH1-mutant tumors, 35 were G-CIMP-positive. However, only 3/52 cases without an IDH1 mutation were G-CIMP positive, validating the tight association of G-CIMP with IDH1 mutation. Next, we analyzed copy number variation dierences in 45 proneural G-CIMP positive tumors. We identied 2,875 genes that had signicant copy number changes and reductions of chromosome 7 gain and chromosome 10 loss. Inter- estingly, we observed signicant chromosome gains in 10p15.3-p11.21 and 8q23.1-q24 in proneural G-CIMP-positive compared to proneural G-CIMP-negative. We also identied signicant dierences in both DNA methylation (1,550 genes) and gene expression (1,575 genes). The integration of these two experimental results identied 300 genes with signif- icant DNA hypermethylation and gene expression changes in proneural G-CIMP-positive tumors. Gene ontology analyses showed a G-CIMP-specic down-regulation of genes asso- ciated with the mesenchyme subtype, tumor invasion and the extracellular matrix as the most signicant terms. Genes with roles in transcriptional silencing and chromatin structure modications showed increased gene expression in proneural G-CIMP-positive tumors. Meta-analysis identied a signicant overlap with down-regulated genes in low- intermediate grade glioma compared to Glioblastoma multiforme (GBM) in a variety of previously published datasets. Taken together, these molecular ndings suggest that G-CIMP-positive tumors have epigenetically-related gene expression dierences that are more consistent with low-grade gliomas and high-grade tumors with favorable prognoses. In summary, we proled pro- moter DNA methylation alterations in GBM tumors and we identied and characterized a unique subtype of human glioma tumors that are highly associated with several dier- ent clinical and molecular features. 3.2 Introduction There is currently great interest in characterizing and compiling the genome and tran- scriptome changes in human GBM tumors to identify aberrantly functioning molecu- lar pathways and tumor subtypes. The Cancer Genome Atlas (TCGA) pilot project 46 identied genetic changes of primary DNA sequence and copy number, DNA methy- lation, gene expression and patient clinical information for a set of GBM tumors [Cancer Genome Atlas Research Network, 2008]. TCGA also rearmed genetic alter- ations in TP53, PTEN, EGFR, RB1, NF1, ERBB2, PIK3R1, and PIK3CA muta- tions and detected an increased frequency of NF1 mutations in GBM patients [Cancer Genome Atlas Research Network, 2008]. Recent DNA sequencing analyses of primary GBM tumors using a more comprehensive approach [Parsons et al., 2008] also identied novel somatic mutations in isocitrate dehydrogenase 1 (IDH1) that occur in 12% of all GBM patients. IDH1 mutations have only been detected at the argi- nine residue in codon 132, with the most common change being the R132H mutation [Parsons et al., 2008, Yan et al., 2009], which results in a novel gain of enzyme function in directly catalyzing alpha-ketoglutarate to R(-)-2-hydroxyglutarate [Dang et al., 2009]. IDH1 mutations are enriched in secondary GBM cases, younger individuals and coincident with increased patient survival [Balss et al., 2008, Hartmann et al., 2009, Yan et al., 2009]. Higher IDH1 mutation rates are seen in grade II and III astrocytomas and oligodendrogliomas [Balss et al., 2008, Bleeker et al., 2009, Hartmann et al., 2009, Yan et al., 2009], suggesting that IDH1 mutations generally occur in the progressive form of glioma, rather than in de novo GBM. Mutations in the related IDH2 gene are of lower frequency and generally non-overlapping with tumors containing IDH1 mutations [Hartmann et al., 2009, Yan et al., 2009]. Cancer-specic DNA methylation changes are hallmarks of human cancers, in which global DNA hypomethylation is often seen concomitantly with hypermethylation of CpG islands [Jones and Baylin, 2007]. Promoter CpG island hypermethylation generally results in transcriptional silencing of the associated gene [Jones and Baylin, 2007]. CpG island hypermethylation events have also been shown to serve as biomarkers in human cancers, for early detection in blood and other bodily uids, for prognosis or prediction of response to therapy and to monitor cancer recurrence [Laird, 2003]. 47 A CpG island methylator phenotype (CIMP) was rst characterized in human colorectal cancer by Toyota and colleagues [Toyota et al., 1999] as cancer-specic CpG island hypermethylation of a subset of genes in a subset of tumors. We conrmed and further characterized colorectal CIMP using MethyLight technology [Weisenberger et al., 2006]. Colorectal CIMP is characterized by tumors in the prox- imal colon, a tight association with BRAF mutations, and microsatellite insta- bility caused by MLH1 promoter hypermethylation and transcriptional silencing [Weisenberger et al., 2006]. DNA methylation alterations have been widely reported in human gliomas, and there have been several reports of promoter-associated CpG island hyper- methylation in human GBM and other glioma subtypes [Kim et al., 2006, Martinez et al., 2009, Martinez et al., 2007, Nagarajan and Costello, 2009, Stone et al., 2004, Tepel et al., 2008, Uhlmann et al., 2003]. Several studies have noted dierences between primary and secondary GBMs with respect to epigenetic changes. Overall secondary GBMs have a higher frequency of promoter methylation than primary GBM [Ohgaki and Kleihues, 2007]. In particular, promoter methylation of RB1 was found to be approximately three times more common in secondary GBM [Nakamura et al., 2001]. Hypermethylation of the MGMT promoter-associated CpG island has been shown in a large percentage of GBM patients [Esteller et al., 2000, Esteller et al., 1999, Hegi et al., 2005, Hegi et al., 2008, Herman and Baylin, 2003]. MGMT encodes for an O6-methylguanine methyltransferase which removes alkyl groups from the O-6 position of guanine. GBM patients with MGMT hypermethylation showed sensi- tivity to alkyating agents such as temozolomide, with an accompanying improved outcome [Esteller et al., 2000, Esteller et al., 1999, Hegi et al., 2005, Hegi et al., 2008, Herman and Baylin, 2003]. However, initial promoter methylation of MGMT, in conjunction with temozolomide treatment, may result in selective pressure to lose mismatch repair function, resulting in aggressive recurrent tumors with 48 a hypermutator phenotype [Cahill et al., 2007, Hegi et al., 2005, Silber et al., 1999, Cancer Genome Atlas Research Network, 2008]. Chapters 3 and 4 report the DNA methylation analysis and the integration of multiple high-throughput data generated using 272 GBM tumors collected by TCGA. This is then extended to lower grade tumors. 3.2.1 The Cancer Genome Atlas Network (TCGA) The Cancer Genome Atlas (TCGA) is a comprehensive program in cancer genomics that is jointly supported and managed by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) of the US National Institutes of Health [Cancer Genome Atlas Research Network, 2008]. Starting with $US110M in 2005-2006, TCGA began a pilot study that focused on characterizing three cancer projects: 1) Glioblastoma multiforme, 2) Serous cystadenocarcinoma of the ovary, and 3) Lung squamous carcinoma. The project recently expanded to produce comprehen- sive genomic data sets for at least 20-25 other cancers in the next ve to seven years (cancergenome.nih.gov). The primary objective and goal of TCGA is to identify the molecular and genomic basis of cancer by coordinating eorts among hundreds of sci- entists around the nation. By bringing together many of the best scientists around the globe, the hope is to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing. Eventually, the aim is to improve the ability to diagnose, treat, and prevent cancer through a better understanding of the molecular basis of this disease. TCGA is divided among the ve distinct organizations proled below: the Cancer Genome Characterization Centers (CGCC), the Genome Sequencing Centers (GSC), the Genome Data Analysis Centers (GDAC), the Data Coordinating Center (DCC), and the Biospecimen Core Resource (BCR) (see Figure 3.3 on page 54). 49 The Cancer Genome Characterization Centers (CGCC): The NCI funds six CGCCs. These include the Broad Institute; Harvar; the University of North Car- olina; the University of Southern California; Baylor College of Medicine; and the British Colombia Cancer Centre. Genome Sequencing Centers (GSC): The NCI and NHGRI co-fund three GSCs. These include the Broad Institute, Washington University and Baylor College of Medicine. All three of these sequencing centers have shifted from Sanger sequencing to next-generation sequencing (NGS), although a variety of NGS technologies are being implemented simultaneously. Genome Data Analysis Centers (GDAC): The NCI funds seven GCACs. Three of them are called GDAC-As and these centers are responsible for the integration of data across all characterization and sequencing centers. The GDAC-As include The Broad Institute, University of North Carolina and the Lawrence Berkeley National Laboratory. Four of the GDAC are called GDAC-Bs and are responsible for the biological interpretation of TCGA data. The GDAC-Bs include the University of California at Santa Cruz, the MD Anderson Cancer Center, the Memorial Sloan Kettering Cancer Center and The Institute for Systems Biology. All seven GDACs work together to develop an analysis pipeline for automated data analysis. Data Coordinating Center (DCC): The Data Coordinating Center is the central repository for TCGA data. It is also responsible for the quality control of data entering the TCGA database. The DCC also maintains the TCGA Data Por- tal, where users access TCGA data. This work is performed under contract by bioinformatics scientists and developers from SRA International, Inc. Biospecimen Core Resource (BCR): The NCI currently funds two BCRs. They are the Nationwide Children's Hospital and International Genome Consortium. These two centers are responsible for verifying the quality and quantity of all tissue shipped by tissue source sites, the isolation of DNA and RNA from the tissue 50 samples, the quality control of these biomolecules and the shipment of samples to the GSCs and CGCCs. The molecular assays utilized by TCGA includes the sequencing of targeted genes, Copy Number arrays (Agilent 244K), SNP arrays (Ay 6/500K, Illumina 550K BeadAr- ray), Expression arrays (Ay U133+2, Agilent 44K), Exon arrays (Ay), DNA methy- lation arrays (Illumina) and micro RNA (miRNA) arrays (Agilent). The datasets pro- duced by TCGA are then made publicly available. The data can be downloaded as at les directly from the Data Portal (Figure 3.1) or by open or controlled Access File Transfer Protocol (FTP) (Open Access: ftp://ftp1.nci.nih.gov/tcga; Closed Access: sftp:caftps.nci.nih.gov). Controlled Access FTP contains all the informa- tion located in the Open AccessFTP with the addition of sequence, SNP, exon, and clinical data, which are controlled due to privacy concerns. Tumor samples are sent out to the characterization centers in batches. The data (\raw" and processed) are grouped by batch into gzipped tarballs, which can be as large as 10GB in size. Many of the data produced for a particular cancer type exceed \billions of data points." 3.3 Material and Methods 3.3.1 List of Available Data Types TCGA provides two sets of data type for each qualied sample (Figure 3.2), clinical and molecular features. For the clinical features, TCGA provides Age at diagnosis, Survival dates after diagnosis, whether a patient was treated, type of treatment, size of tumor, gender, race, and any other relevant clinical feature that will aid in the analysis. Each TCGA tumor sample, DNA and RNA is isolated, processed for quality control and sent to several dierent cancer genome characterization centers (CGCC) across the United States (Figure 3.3). 51 Figure 3.1: TCGA Data Portal website. Top illustrates the Data Access matrix and link to open and controlled access. Bottom illustrates the features one can select in order to download the data of interest. 3.3.2 GBM and Control Samples. Genomic DNAs from TCGA GBM tumors were isolated by the TCGA Biospec- imen Core Resource (BCR) and delivered to USC as previously described 52 Figure 3.2: Data types available for each TCGA sample There exists two types of data for each TCGA samples: Clinical and Molecular data types. [Cancer Genome Atlas Research Network, 2008]. One sample (TCGA-06-0178) with a conrmed IDH1 mutation was removed from our analyses, since it became clear that an incorrect tissue type had been shipped for the DNA methylation analysis. Four brain genomic DNA samples from apparently healthy individuals were included as con- trols. All subjects (patients and healthy individuals) signed the informed consent forms allowing their tissue be used for research studies. The study was approved by IRB com- mittee associated with USC, JHU and MD Anderson. Genomic DNA methylated in vitro with M.SssI methylase or whole genome amplied (WGA) as positive and neg- ative controls for DNA methylation, respectively, were also included. Genomic DNA samples (1 g each) were bisulte converted using the Zymo EZ96 DNA methylation kit (Zymo Research, Orange, CA; cat # D5004) according to the manufacturers instruc- tions. Bisulte-converted DNA was eluted in an 18 l volume, and then removed 3 l for post-bisulte quality control tests as described previously [Campan et al., 2009]. All 53 Figure 3.3: United States Map depicting all institutes involved with TCGA Each center is marked on the map of the United States of America. Only institute not depicted is University of Southern California, Epigenome Center which is jointly collaborating with The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins University. TCGA Map last updated in 2010 (cancergenome.nih.gov). TCGA GBM samples passed bisulte conversion quality control and were subsequently processed in the Illumina GoldenGate and/or Innium DNA methylation platforms. We began our study using DNA methylation data from 273 TCGA GBM patients. Of these, we generated GoldenGate data on 245 samples and Innium data on 91 tumors and there were 63 samples that were assayed on both platforms. Of the 244 samples that were assayed on the GoldenGate DNA methylation platform, six sam- ples failed post-GoldenGate Cancer Panel I (OMA-002) QC evaluations and the data for these samples were omitted. The DNA methylation data from the custom Golden- Gate array, (OMA-003), designed specically for the analysis of TCGA GBM tumors [Cancer Genome Atlas Research Network, 2008] were obtained for all 244 samples. Unsupervised consensus clustering was performed on 238 samples for which both GoldenGate (OMA-002 & OMA-003) DNA methylation data were available, eectively 54 merging the two platforms (Figure 3.7A-B). A Venn diagram showing the samples assayed in the three platforms is described in Figure 3.8C. Consensus clustering analysis of 238 samples assayed with the GoldenGate OMA-002 platform and the 244 samples assayed with the GoldenGate OMA-003 platform are not shown. 3.3.3 Methylation Assays. The GoldenGate assays were performed according to the manufacturer and as described previously (Bibikova et al., 2006). The GoldenGate methylation assays survey the DNA methylation of up to 1,536 CpG sites - a total of 1,505 CpGs spanning 807 unique gene loci are interrogated in the OMA-002 probe set [Bibikova et al., 2006], and 1,498 CpGs spanning the same number of unique gene regions are investigated in the OMA- 003 probe set [Cancer Genome Atlas Research Network, 2008]. For the Golden Gate Methylation protocol, background-corrected methylated (Cy5) and unmethylated (Cy3) signal intensities for each CpG site were obtained using the Illumina BeadStudio soft- ware. DNA methylation data, presented as beta values, were calculated as a ratio of the methylated signal intensity compared to the methylated + unmethylated intensities at each locus, as described previously [Cancer Genome Atlas Research Network, 2008]. The detection p-values are computed using negative control signals from the background model that characterize the chance that the target sequence signal was distinguishable from the negative controls. For the Innium methylation platform, the negative controls are modeled by normal distribution. The detection p-value for the probe with intensity I p , is calculated as previously described in equation 2.2 on page 33. In this formula, neg is the average of the negative controls, and neg is the standard deviation of the signals and Z is the one-sided tail probability of standard normal distribution. For each probe, the detection p-values are calculated separately for methylated and unmethy- lated probe signal intensities according to the corresponding red and green channel, and the smaller detection p-value was taken as the nal detection p-value for the probe. Beta values with detection p-values greater than 0.05 were not signicantly dierent 55 from background and masked these as \NA" in the Level 2 and Level 3 data packages. All Innium data were packaged and deposited onto the TCGA Data Portal web site (tcga-data.nci.nih.gov/tcga/findArchives.htm). For the Innium protocol, bisulte-converted DNA is whole genome amplied and enzymatically fragmented. The bisulte-converted, WGA-DNA samples are puried and hybridized to the BeadChip arrays, in which bisulte-converted DNA molecules anneal to locus-specic DNA oligomers that are bound to individual bead types. Each CpG locus can hybridize to methylated (CpG) or unmethylated (TpG) oligo bead types. Allele- specic primer annealing is followed by single-base extension using labeled nucleotides. Both unmethylated and methylated bead types for a specic CpG locus incorporate the same labeled nucleotide, as determined by the base immediately preceding the cytosine being interrogated by the assay, and subsequently will be detected in a single chan- nel. Each beadchip, containing 12 subarrays, is uorescently stained after extension, scanned, and the intensities of the methylated (M) and unmethylated (U) bead types for each CpG locus across all samples are measured. Mean non-background corrected M and U signal intensities for each locus were extracted from Illumina BeadStudio (or GenomeStudio) software. The beta value DNA methylation scores for each sample and locus were calculated as described in equation 3.1 on page 57. Detection p-values were calculated using the Z-score formula shown in equation 2.2 on page 33. The Illumina Innium Human Methylation27K platform, which interrogates 27,578 CpG dinucleotides spanning promoter regions of 14,495 Consensus Coding DNA Sequence (CCDS) genes and 110 microRNA gene promoters, is currently the most comprehensive platform to date [Bibikova et al., 2009, Hinoue et al., 2009]. A newer Innium platform was recently announced that can interrogate more than 450K CpG dinucleotides covering a wide net across many dierent genomic attributes (e.g. enhancer, gene bodies, CpG Island shores, etc.). After bisulte conversion of DNA, methylated cytosines remain unchanged, while unmethylated cytosines are converted to uracil (Fig- ure 3.4) [Bibikova et al., 2009]. The methylation status at an interrogated locus i is 56 Figure 3.4: Innium Methylation assay scheme Figure taken from [Bibikova et al., 2009]. Nonmethylated cytosines (C) are converted to uracil (U) when treated with bisulte, while methylated cytosines remain unchanged. Each CpG locus is represented by two bead types. One bead type (U) presents a probe that is designed to match to the unmethylated site; the second bead type (M) matches the methylated state. (A) On the left side of this gure, the locus of interest is unmethylated. It matches perfectly with U probe, enabling single-base extension and detection. The unmethylated locus ahas a single-base mismatch to the M probe, inhibiting extension that results in low signal on the array. (B) If the CpG locus of interest is methylated, the reverse occurs: the M bead type will display a signal, and the bead type will show a low signal on the array. If the locus has an intermediate methylation state, both probes will match the target site and will be extended. Methylation status of the CpG site is determined by the value calculation (see equation 3.1 on page 57), which is the ratio of the uorescent signals from the methylated probe to the total locus intensity. then summarized as M i M i +U i (3.1) which represents the proportion of the total mean uorescence intensity within a probe pair due to methylated cytosines. This proportion is naturally modelled by a two- parameter beta distribution dened on the interval (0, 1) [Laird, 2010]. 57 Figure 3.5: Distribution of all 27K probes on the innum platform distance to TSS Histogram showing all 27K probes distance to TSS. The Innium methylation assays were performed according to manufacturer's instruc- tions. The assay generates DNA methylation data for 27,578 CpG dinucleotides span- ning 14,473 well-annotated, unique gene promoter and/or 5 gene regions (from -1,500 to +1,500 from the transcription start site, Figure 3.5). The assay information is available at www.illumina.com and the probe information is available on the TCGA Data Portal web site. Data from 91 TCGA GBM samples (Batches 1, 2, 3 and 10) were included in this analysis. Batches 1-3 (63 samples) were run on both Inium and GoldenGate, 58 whereas batches 4-8 (182 samples) were analyzed exclusively on GoldenGate and batch 10 (28 samples) were analyzed exclusively on Innium. All data were packaged and deposited onto the TCGA Data Portal web site (tcga.cancer.gov/dataportal). Innium data archived versions used in this analysis include: jhu-usc.edu GBM.HumanMethylation27.2.0.0 (batch 1), jhu- usc.edu GBM.HumanMethylation27.3.0.0 (batch 2), jhu- usc.edu GBM.HumanMethylation27.4.0.0 (batch 3), jhu-usc.edu GBM.HumanMethylation27.1.1.0 (batch 10) GoldenGate OMA002 data archived versions used in this analy- sis include: jhu-usc.edu GBM.IlluminaDNAMethylation OMA002 CPI.1.2.0, jhu-usc.edu GBM.IlluminaDNAMethylation OMA002 CPI.2.3.0, jhu-usc.edu GBM.IlluminaDNAMethylation OMA002 CPI.3.0.0, jhu- usc.edu GBM.IlluminaDNAMethylation OMA002 CPI.4.0.0, jhu- usc.edu GBM.IlluminaDNAMethylation OMA002 CPI.5.0.0, jhu-usc.edu GBM.IlluminaDNAMethylation OMA002 CPI.6.0.0 GoldenGate OMA003 data archived versions used in this analy- sis include: jhu-usc.edu GBM.IlluminaDNAMethylation OMA003 CPI.1.0.0, jhu-usc.edu GBM.IlluminaDNAMethylation OMA003 CPI.2.0.0, jhu-usc.edu GBM.IlluminaDNAMethylation OMA003 CPI.3.0.0, jhu- usc.edu GBM.IlluminaDNAMethylation OMA003 CPI.4.0.0, jhu-usc.edu GBM.IlluminaDNAMethylation OMA003 CPI.5.0.0 3.3.4 Unsupervised Consensus and Hierarchical Clustering of DNA Methylation and Gene Expression Data Sets. We performed Probes for each platform were ltered by removing those targeting the X and Y chromosomes, those containing a single nucleotide polymorphism (SNP) within ve basepairs of the targeted CpG site, and probes containing repeat element sequences 59 10 basepairs. We next retained the most variably methylated probes (standard devi- ation > 0.20) across the tumor set in each DNA methylation platform. These nal data matrices were used for unsupervised Consensus/Hierarchical clustering analyses. Consensus Cluster (GenePattern (v.3.2.0, build 8571)) was used to determine the statis- tical signicance of clustering [Reich et al., 2006]. Evaluation of sample clustering and stability of clusters was performed by consensus clustering with K clusters (K = 2, 3, 412) using 1000 iterations with random restart and Euclidean distance metric for sample ordering. The resulting consensus matrix was imported into R software (version 2.9.2) (www.r-project.org). Using the bioconductor package \gplots", the data was visual- ized as color-coded heat maps (blue to red). Consensus Cluster assignments for each sample are summarized in Figure 4.1A (see page 89) and Table 3.1 (see page 61). For hierarchical clustering, the R function \hclust" was used in combination with the \heatmap.2" function using the \average linkage" algorithm. 3.4 Results 3.4.1 Identication of a Distinct DNA Methylation Subgroup Within GBM We determined DNA methylation proles in a discovery set of 272 TCGA GBM samples. At the start of this study, we relied on the Illumina GoldenGate platform, using both the standard Cancer Panel I, and a custom-designed array [Cancer Genome Atlas Research Network, 2008] (Figure 3.7A-C), but migrated to the more comprehensive Innium platform (Figures 3.6A-B and 3.7A-B), as it became available. DNA methylation measurements were highly correlated for CpG dinucleotides shared between the two platforms (Pearson's r = 0.94, Figure 3.8D). Both platforms interrogate a sampling of about two CpG dinucleotides per gene. Although this implies 60 that non-representative CpGs may be assessed for some promoters, it is likely that rep- resentative results will be obtained for most gene promoters, given the very high degree of locally correlated DNA methylation behavior [Eckhardt et al., 2006]. We selected the most variant probes on each platform and performed consensus clustering to identify GBM subgroups [Monti et al., 2003] (see chapter 2 for detailed description of method). We identied three DNA methylation clusters using either the GoldenGate or Innium data, with 97% concordance (61/63) in cluster member- ship calls for samples run on both platforms (Table 3.1). Cluster 1 formed a partic- Innium vs GoldenGate Clusters INF.1 INF.2 INF.3 UNK Total GG.1 8 0 0 11 19 GG.2 0 21 1 78 100 GG.3 0 1 32 87 120 UNK 4 9 15 5 33 Total 12 31 48 181 272 Table 3.1: Innium vs GoldenGate sample identication by clusters GG stands for GoldenGate; INF stands for Innium. The number preceeding either GG or INF, stands for the cluster number identied by Consensus Cluster. UNK stands for unknown status of samples not processed with either GG or INF. Bold samples indicate the samples run on both platform (n = 63). ularly tight cluster on both platforms with a highly characteristic DNA methylation prole (Figures 3.6, 3.7, 3.9F-L), reminiscent of the CpG Island Methylator Phe- notype described in colorectal cancer [Toyota et al., 1999, Weisenberger et al., 2006]. CIMP in colorectal cancer is characterized by correlated cancer-specic CpG island hypermethylation of a subset of genes in a subset of tumors, and not just a stochas- tic increase in the frequency of generic CpG island methylation across the genome [Toyota et al., 1999, Weisenberger et al., 2006]. Cluster 1 GBM samples show similar concerted methylation changes at a subset of loci. We therefore designated cluster 1 tumors as having a glioma CpG Island Methylator Phenotype (G-CIMP). Combining Innium and GoldenGate data, 24 of 272 TCGA GBM samples (8.8%) were identied as G-CIMP subtype. 61 Figure 3.6: Clustering of TCGA GBM tumors and control samples identies a CpG Island Methylator Phenotype (G-CIMP). Unsupervised consensus clustering was performed using the 1,503 Infnium DNA methylation probes whose DNA methylation beta values varied the most across the 91 TCGA GBM samples. DNA Methy- lation clusters are distinguished with a color code at the top of the panel: red, consensus cluster 1 (n = 12 tumors); blue, consensus cluster 2 (n = 31 tumors); green, consensus cluster 3 (n = 48 samples). Each sample within each DNA Methylation cluster are colored labeled as described in the key for its gene expression cluster membership (Proneural, Neural, Classical and Mesenchymal). The somatic mutation status of ve genes (EGFR, IDH1, NF1, PTEN, and TP53) are indicated by the black squares, the gray squares indicate the absence of mutations in the sample and the white squares indicate that the gene was not screened in the specic sample. G-CIMP-positive samples are labeled at the bottom of the matrix. A) Consensus matrix produced by k-means clustering (K = 3). The samples are listed in the same order on the x and y axes. Consensus index values range from 0 to 1, 0 being highly dissimilar and 1 being highly similar. B) One-dimensional hierarchical clustering of the same 1,503 most variant probes, with retention of the same sample order as in Figure 3.6A. Each row represents a probe; each column represents a sample. The level of DNA methylation (beta value) for each probe, in each sample, is represented by using a color scale as shown in the legend; white indicates missing data. M.SssI-treated DNA (n = 2), WGA-DNA (n = 2) and normal brain (n = 4) samples are included in the heatmap but did not contribute to the unsupervised clustering. The probes in the eight control samples are listed in the same order as the y axis of the GBM sample heatmap. 3.5 Summary and Discussion of Results Glioblastoma multiforme (GBM) is a highly aggressive form of brain tumor, with a median survival rate of just over one year. The Cancer Genome Atlas (TCGA) project 62 Figure 3.7: Unsupervised clustering of 238 TCGA Glioblastoma Multiforme identies a CpG Island Methylator Phenotype (G-CIMP). Unsupervised consensus clustering was performed on all 238 samples for which DNA methylation data were available on both GoldenGate probe sets, together with 370 ( 10% of total probe on array) probes whose DNA methylation beta values varied the most across the sample set (Standard Deviation > 0.2). A) Consensus matrix produced by k-means clustering (K = 3). The samples are listed in the same order on the x and y axes. The intensity of the red color of the square for each sample combination corresponds to the frequency the samples cluster together in the iterations (iterations=1000) of dataset perturbation. Consensus index values range from 0-1, 1 being highly correlated and 0 being highly uncorrelated. DNA Methylation cluster are distinguished with a color code at the top of the matrix: red, consensus cluster 1 (n = 19); blue, consensus cluster 2 (n = 100), green, consensus cluster 3 (n = 119). Each sample within each DNA Methylation clusters are color labeled as described in the gure legend for known gene expression clusters (Proneural, Neural, Classical and Mesenchymal), the mutation status of ve genes (EGFR, IDH1, NF1, PTEN, and TP53) previously reported to be mutated in GBMs. The last row of color bands indicates the source batch of each sample indicated by TCGA. G-CIMP samples are distinguished as indicated at the bottom of the matrix. B) Unsupervised hierarchical clustering of 238 TCGA GBM samples using 370 most variant probes across the samples. Each row represents a probe, while each column represents a sample. The level of DNA methylation of each probe, in each sample, is represented by using a rainbow color scale as shown in the legend; white indicates missing data. DNA methylation beta values range from 0-1, with zero indicative of very low methylation and one is indicative of high DNA methylation levels. Samples are listed in the same order as Figure 3.7A and each sample are labeled as indicated by the legend and reported in Figure 3.7A. aims to characterize cancer genomes to identify means of improving cancer prevention, detection, and therapy. Using TCGA data, we identied a subset of GBM tumors with 63 Figure 3.8: Venn Diagram and Correlation of probes between GoldenGate and Innium. Venn Diagram showing interrelationship of samples/platform and genes/platform. C) Venn diagram of all available TCGA GBM samples that were analyzed for DNA Methylation using three dierent platforms, OMA-002, OMA-003, and Innium. Under each number of samples in parenthesis is the total number of identied G-CIMP-positive as conrmed by consensus clustering. D) Scatter plot for 189 DNA methylation probes for which the genomic coordinate matches both platforms (Innium and GoldenGate (OMA-002 + OMA-003)). Mean beta value was calculated across 63 TCGA GBM samples common between the two platforms (Figure 3.7C). The y-axis represents mean Innium beta value and x-axis represents mean GoldenGate beta value. The Pearson's correlation coecient was calculated: r = 0.94. characteristic promoter DNA methylation alterations, referred to as a glioma CpG Island Methylator Phenotype (G-CIMP). 64 Figure 3.9: Consensus Clustering Statistical Results. E-H) Statistical summary output from Consensus Clustering for 91 TCGA GBM samples using Innium. I-L) Statistical summary output analyses from Consensus Clustering for 238 TCGA GBM samples using the GoldenGate DNA methylation platform. 65 Chapter 4 Clinical & Molecular Characterization of Epigenetic Subgroups of Glioblastoma The material in this chapter has been published ([Noushmehr et al., 2010]). 4.1 Introduction As discussed in Chapter 3 (page 45), Glioblastoma multiforme (GBM) is a highly aggres- sive form of brain tumor, with a patient median survival of just over one year. The Cancer Genome Atlas (TCGA) project aims to characterize cancer genomes to identify means of improving cancer prevention, detection, and therapy. Using TCGA molecular data, we identied a subset of GBM tumors with characteristic promoter DNA methylation alter- ations, referred to as a glioma CpG Island Methylator Phenotype (G-CIMP). Patients with G-CIMP tumors are younger at diagnosis, display improved survival times and are a subset of Proneural subtype. In this chapter, we are interested in characterizing and identifying the unique molecular features associated with G-CIMP by integrating multiple data types available through the TCGA. These molecular data types include high-throughput gene expression (Agilent, Aymetrix), copy number alterations (Agi- lent), sequencing of targeted genes of interest, DNA methylation of a select number of interrogated CpG sites (27,578) across the entire human genome and other data types available in the public repository. 66 4.2 Material and Methods 4.2.1 Integrative TCGA Data Platforms While ancillary data (expression, mutation, copy number) were available for additional tumor samples, we only included those samples for which there were DNA methylation proling (either GoldenGate or Innium). Since the Agilent gene expression platform contained a greater number of genes for which DNA methylation data were available, we limited our primary analysis to only the Agilent gene expression data set. Where appropriate, we conrmed results using the Aymetrix gene expression data. 4.2.1.1 Gene Expression and miRNA Expression Level 3 gene expression data from the Agilent whole genome 44K probe set and the Agilent 8 x 15K Human miRNA-specic microarray data were obtained from the TCGA Data Portal website. Gene expression data were Lowess normalized for each sample. Custom perl scripts were created to collect and process individual text les for each sample in order to populate samples into a working data matrix for further downstream data analyses. 4.2.1.2 Mutation Data DNA sequence data for 833 genes contained in the Phase I and II GBM gene sets, which included IDH1, were collected. All DNA sequence data were generated using PCR amplication for each gene and sequencing using Sanger chemistry on the ABI 3730 DNA sequencing platform and subsequently validated as described previously [Cancer Genome Atlas Research Network, 2008]. 67 4.2.2 Wilcoxon Rank Sum Test and Dierence & Fold Change for Dif- ferential DNA Methylation and Dierential Gene Expression in TCGA GBM Each DNA methylation platform (GoldenGate OMA-002, GoldenGate OMA-003 and Innium) was independently analyzed. A non-parametric approach was used to deter- mine probes/genes that are dierentially DNA methylated or dierentially expressed between two groups of interest. The Wilcoxon rank sum test [Troyanskaya et al., 2002] was applied to assess the likelihood (a raw p-value) that is equal to or greater than the observed statistic between two groups, with the null hypothesis of no dierential DNA methylation, when analyzing DNA methylation or of no dierential expression when ana- lyzing gene expression. The Benjamini-Hochberg multiple testing correction was applied to adjust the resulting p-values while controlling the False Discovery Rate (FDR) below 0.05. Probes with an adjusted p-value below 0.05 were considered signicantly dieren- tially methylated between the two sets of tumors. The beta value dierence between the two groups was performed by rst calculating the mean beta value across each group and then calculating the dierence between the mean beta values for each probe. Genes with multiple probes were collapsed down to a primary probe. The primary probe for each gene was chosen as the one that is located closest to the -100 bp position in the promoter relative to the transcription start site; this location should be in a key region of the promoter to correlate with expression changes. Dierentially methylated probes are then dened by a primary probe and having an FDR < 0.05. In addition to the methods described above, gene expression data was further analyzed by the Signicance Analysis of Microarrays (SAM) R package [Tusher et al., 2001] to estimate the FDR. The SAM statistics was computed using 1000 permutations and the -value cuto was selected for which the 90th percentile FDR < 5%. Genes below this -value cuto were 68 considered statistically signicant. Expression fold change was obtained by calculating the log 2 ratios of intensities as described in the following equation: log 2 GCIMP + GCIMP (4.1) 4.2.3 Binomial Test for Genomic Clustering of CIMP Loci To test whether G-CIMP locus probes tend to cluster within the genome, the distance between adjacent G-CIMP probes was measured in terms of the number of intervening non-G-CIMP probes, and the distribution of these distances compared to a binomial distribution. Bin(n,), where n in the number of G-CIMP probes and is the proportion of probes within G-CIMP loci. Since the Innium array is designed to include an average of two probes for each gene and we wished to test for clustering of adjacent genes but not of probes within a single promoter or CpG island, we eliminated any probe that followed within 2kb of the previous probe. 4.2.4 G-CIMP Validation Using MethyLight Technology Sections were cut, deparanized and DNA was isolated using a commercially available kit (Epicentre, Madison, WI). Samples were converted with bisulte (using a kit from Zymo Research, Orange, CA), and then amplied by the uorescence-based, MethyLight real-time PCR strategy as described previously [Eads et al., 2000, Eads et al., 1999]. Primers and probes used for validation are as follows: ANKRD43, forward primer: TCGTCGGTATCGAGTAGCGG, reverse primer: CGATACTAAACTTCC- TACAAAAACACGAC, 5 modication: 6FAM, probe: AATACGCAACTCCGAAC- TACTAAACCGCTTC, Quencher: BHQ-1; HFE, forward primer: TTTTTGAT- GTTTTTGTAGATCGCG, reverse primer: CGCGCCCCTAATTCGC, 5 modica- tion: 6FAM, probe: CGAACTCACGCAACAAACGCCCCTA, Quencher: BHQ-1; 69 MAL, forward primer: GTTCGGTGTAGGATTTTAGCGTC, reverse primer: ATC- TACAATAAAAAATAAAACCGACCG, 5 modication: 6FAM, probe: CGACCGC- CGACCCCTTCCG, Quencher: BHQ-1; DOCK5, forward primer: CGGTTCGCG- GAGTTTAGC, reverse primer: AACTACTACAACTCCTCGAACTCCG, 5 modi- cation: 6FAM, probe: CAAACGCTTCCGCCATATTCCGCC, Quencher: BHQ- 1; LGALS3, forward primer: GCGGAGTTTCGTGGGTTTCG, reverse primer: AATAACCAAACTACGACTCGTCACC, 5 modication: 6FAM, probe: CCG- CAAAACGCAAACGACGAAAATACGACG, Quencher: MGBNFQ; FAS-1, for- ward primer: AGGAACGTTTCGGGATAGGAA, reverse primer: CAACTTAAC- CTACGCGCGAAT, 5 modication: 6FAM, probe: TGTGTAACGAATTTTG, Quencher: MGBNFQ; FAS-2, forward primer: GGGTAGGAGGTCGGTTTTCG, reverse primer: TTCGTTACACAAATAAACATTCCTATCC, 5 modication: 6FAM, probe: TGAGTATGTTAGTTATTGTAGGAAC, Quencher: MGBNFQ; RHOF, forward primer: GTCGTAGTCGTCGTCGTTTACG, reverse primer: GCTAC- GAACTCCGAACAATAAATACC, 5 modication: 6FAM, probe: AAACCC- TAACCCAAACCGCCGCCC, Quencher: MGBNFQ. COL2A1, forward primer: TCTAACAATTATAAACTCCAACCACCAA, reverse primer: GGGAAGATGGGATA- GAAGGGAATAT, 5 modication: 6FAM, probe: CCTTCATTCTAACCCAATACC- TATCCCACCTCTAAA, Quencher: TAMRA. Primers were tested on commercially available methylated and unmethylated DNA converted with bisulte to assure PCR specicity. To increase sensitivity, a pre-amplication step of 10 cycles was performed prior to real-time PCR. We determined the DNA methylation levels of each gene and sample by calculating delta Ct values of each G-CIMP gene to the COL2A1 reference gene using ABI 7900 Sequence Detection System (Perkin-Elmer, Foster City, CA) or a Bio-Rad Chromo 4 Continuous Fluorescence Detector. 70 4.2.5 Pathway Analysis and Meta-Analyses Parallel searches of Molecular Signatures Database (MSigDB database v2.5) were used to identify statistical enrichment (Fisher's exact test, odds ratio and log odds ratio) for a priori dened sets of genes for conserved transcription factor (TF) binding site promoter predictions, predicted targets for specic miRNAs, cancer-related and other modules consisting of sets of genes previously reported in high-throughput data pro- duction [Subramanian et al., 2005]. Identication for enrichments of Gene Ontology (GO) terms and chromosome clustering was performed using the Database for Annota- tion, Visualization and Integrated Discovery (DAVID) on-line software package (david. abcc.ncifcrf.gov/) [Dennis et al., 2003, Huang et al., 2009]. Highly-signicant genes by gene expression and/or DNA methylation were compared to the background gene list for each platform to discover enriched pathway and gene function terms. Meta-analyses to identify overlapping and associated genes with publically available data set were per- formed using the NextBio TM (Cupertino, CA) on-line search engine (www.nextbio.com/, accessed October 5, 2009) as previously described [Sung et al., 2009]. 4.2.6 Statistical Analysis All statistical tests were done using R software (R version 2.9.2, 2009-08-24) and packages in Bioconductor [Gentleman et al., 2004], except as noted. A non-parametric Pearson 2 test with Yates continuity correction and odds ratio was used to assess the signicance of association of various covariates to DNA methylation clusters and Gene Expression Clusters. Fisher's exact test was performed on covariates in which samples were less than ve /or when samples were not available. For pathway and motif enrichment analysis, a binomial and Fisher's exact test was performed. All survival analyses were standard, and the statistical signicance of the separation between the KaplanMeier (K-M) curves was evaluated using a log-rank (Mantel - Haenszel) test under the assumption of proportional hazards in the two groups being tested. Multiple-predictor comparisons were evaluated through Cox proportional-hazard regression. All other data management and summary 71 statistics analyses were performed using custom scripts in perl and R package \ggplot2" (had.co.nz/ggplot2/), respectively. 4.3 Results 4.3.1 Clinical Characterization of G-CIMP Tumors We further characterized G-CIMP tumors by reviewing the available clinical covariates for each patient. Although patients with proneural GBM tumors are slightly younger (median age, 56 years) than all other non-proneural GBM patients (median age, 57.5 years), this was not statistically signicant (p = 0.07). However, patients with G-CIMP tumors were signicantly younger at the time of diagnosis compared to patients diagnosed with non-G-CIMP proneural tumors (median ages of 36 and 59 years, respectively; p < 0.0001; Figure 4.1C). The overall survival for patients of the proneural subtype was not signicantly improved compared to other gene expression subtypes (Figure 4.1D), but signicant survival dierences were seen for groups dened by DNA methylation status (Figure 4.1E, 4.1F, Figure 4.2B). We observed signicantly better survival for proneural G- CIMP-positive patients (median survival of 150 weeks) than proneural G-CIMP-negative patients (median survival of 42 weeks) or all other non-proneural GBM patients (median survivals of 54 weeks). G-CIMP status remained a signicant predictor of improved patient survival (p = 0.0165) in Cox multivariate analysis after adjusting for patient age, recurrent vs. non-recurrent tumor status and secondary GBM versus primary GBM status. 4.3.2 Characterization of G-CIMP Tumors within Gene Expression Clusters Four gene expression subtypes (Proneural, Neural, Classical and Mesenchymal) have been previously identied and characterized using TCGA GBM samples 72 [Verhaak et al., 2010]. We compared the DNA methylation consensus cluster assign- ments for each sample to their gene expression cluster assignments (Figure 3.6, Figure 4.1A, Figure 3.7A-B, Table 3.1). The G-CIMP sample cluster is highly enriched for proneural GBM tumors, while the DNA methylation clusters 2 and 3 are moderately enriched for classical and mesenchymal expression groups, respectively. Of the 24 G- CIMP tumors, 21 (87.5%) were classied within the proneural expression group. These G-CIMP tumors represent 30% (21/71) of all proneural GBM tumors, suggesting that G-CIMP tumors represent a distinct subset of proneural GBM tumors (Figure 4.1A and Figure 4.2A, Table 3.1). The few non-proneural G-CIMP tumors belong to neural (2/24 tumors, 8.3%), and mesenchymal (1/24 tumors, 4.2%) gene expression groups. In order to obtain an integrated view of the relationships of G-CIMP status and gene expression dierences, we performed pairwise comparisons between members of dier- ent molecular subgroups (Figure 4.1B). We calculated the mean Euclidean distance in both DNA methylation and expression for each possible pairwise combination of the ve dierent subtypes: G-CIMP-positive proneural, G-CIMP-negative proneural, classical, mesenchymal and neural tumors. We observed the high dissimilarity of the GP, GN, GC, and GM pairs (Figure 4.1B), supporting the hypothesis that G-CIMP-positive tumors are a unique molecular subgroup of GBM tumors, and more specically that G-CIMP status provides further renement of the proneural subset. Indeed, among the proneu- ral tumors, the G-CIMP-positive tumors are distinctly dissimilar to the mesenchymal tumors (GM pair), while the G-CIMP-negative proneural tumors are relatively similar to mesenchymal tumors (PM pair). We focused downstream analyses on comparisons between G-CIMP-positive versus G-CIMP-negative tumors within the proneural subset, to avoid misidentifying proneural features as G-CIMP-associated features. 4.3.3 IDH1 Sequence Alterations in G-CIMP Tumors 833 genes were analyzed for somatic mutations within 218 TCGA GBM samples. We identied nine genes showing signicantly elevated somatic mutation frequencies between 73 proneural G-CIMP-positive tumors and proneural G-CIMP-negative tumors: DST, EIF2AK4, EPHB4, FGFR4, IDH1, LEMD3, MAPK7, TNFRSF10A and TRPM3 (p< 0.05, Fisher's Exact Test, Table 4.1, Figure 4.3A). IDH1 somatic mutations, recently SOMATIC MUTATIONS GENE chi.pvalue sher.exact ODDS RATIO IDH1 2.20E-16 6.59E-09 1.60E+02 DST 9.42E-03 6.17E-03 1.03E+01 EIF2AK4 2.32E-03 2.27E-02 NA EPHB4 6.03E-06 2.27E-02 NA FGFR4 6.03E-06 2.27E-02 NA LEMD3 2.32E-03 2.27E-02 NA MAPK7 1.05E-04 2.27E-02 NA TNFRSF10A 6.58E-04 2.27E-02 NA TRPM3 1.05E-04 2.27E-02 NA Table 4.1: Somatic Mutation associated with G-CIMP Calculated chi-square and Fishers exact test are reported as well as associated odds ratios. Listed are the nine signicant somatic genes identied to be associated with G-CIMP. identied primarily in secondary GBM tumors [Balss et al., 2008, Parsons et al., 2008, Yan et al., 2009], were found to be very tightly associated with G-CIMP in our data set (Table 4.2), with 18 IDH1 mutations primarily observed in 23 (78%) G-CIMP-positive tumors, and 184 G-CIMP-negative tumors were IDH1-wildtype (p < 2.210 16 ). The ve discordant cases of G-CIMP-positive, IDH1-wildtype are not signicantly dierent in age compared to G-CIMP-positive, IDH1-mutant (median ages of 34 and 37 years respectively; p = 0.873). However, the ve discordant cases of G-CIMP-positive, IDH1- wildtype tumors are signicantly younger at the time of diagnosis compared to patients with G-CIMP-negative, IDH1-wildtype (median ages of 37 and 59 years respectively; p < 0.008). Interestingly, two of the ve patients each survived more than ve years after diagnosis. We did not observe any IDH2 mutations in the TCGA data set. Tumors displaying both G-CIMP-positive and IDH1 mutation occurred at low frequency in pri- mary GBM, but were enriched in the set of 16 recurrent (treated) tumors, and to an even 74 G-CIMP and IDH1 mutation status G-CIMP GBMs Neg Pos Total All Tumors IDH1 Wild-type 184 5 189 Mutant 0 18 18 Total 184 23 207 Primary Tumors IDH1 Wild-type 171 4 175 Mutant 0 12 12 Total 171 16 187 Recurrent Tumors IDH1 Wild-type 12 0 12 Mutant 0 4 4 Total 12 4 16 Secondary Tumors IDH1 Wild-type 1 1 2 Mutant 0 2 2 Total 1 3 4 Table 4.2: G-CIMP and IDH1 mutation status in primary, secondary and recurrent GBMs G-CIMP and IDH1 mutation status are compared for all analyzed GBM tumors (p < 2.210 16 ), primary tumors, recurrent tumors, and secondary tumors. P value is calculated from Fisher's exact test. greater degree in the set of four secondary GBM (Table 4.2). Mutations in DST (Dys- tonin or bullous pemphigoid antigen 1) were also elevated in proneural G-CIMP-positive tumors (p < 0.006, Fisher's Exact Test, odds ratio = 10.02). DST encodes a plakin protein which is involved in connecting cytoskeletal elements, and contains domains that bind microtubules and actin [Sonnenberg and Liem, 2007]. In addition, three samples that did not showIDH1 mutation were shown to have mutation inDST (2/3) orLEMD3 (1/3). 75 We also analyzed 833 genes for germline mutation and loss of heterozygosity (LOH) within 218 TCGA GBM samples. We identied nine germline mutations and six loss of heterozygosity that showed signicant association with proneural G-CIMP-positive tumors (Figure 4.3B-C, Table 4.3). Genes with signicant germline mutation dierences GERMLINE MUTATIONS GENE chi.pvalue sher.exact ODDS RATIO LOC389458 5.2E-05 5.9E-03 NA PIP5K2A 5.4E-03 5.9E-03 NA KCNV2 1.1E-02 6.6E-03 0.0E+00 SOX13 1.5E-02 1.4E-02 4.7E+00 DGKZ 4.2E-02 2.0E-02 7.9E+00 ASCL2 6.0E-06 2.3E-02 NA ConsReg2496 2.5E-02 2.3E-02 NA CHAT 7.7E-07 2.4E-02 1.2E+01 CDH19 7.7E-02 4.3E-02 3.9E+00 LOH GENE chi.pvalue sher.exact ODDS RATIO CDH23 8.9E-03 4.4E-03 1.4E-01 DMRT2 1.6E-02 6.2E-03 1.0E+01 SLC1A2 1.1E-02 2.3E-02 NA BDNF 2.2E-03 2.4E-02 1.2E+01 SMARCA2 3.9E-02 3.3E-02 3.8E+00 DPP6 8.7E-03 4.2E-02 5.1E+00 Table 4.3: Germline mutation and LOH associated with G-CIMP Calculated chi-square and Fishers exact test are reported as well as associated odds ratios. Listed are the nine signicant germine mutation genes and 6 LOH identied to be associated with G-CIMP. in proneural G-CIMP tumors were ASCL2, CDH19, CHAT, CONSREG2496, DGK2, KCNV2, LOC389458, PIP5K2A and SOX13. Only germline mutations in KCNV2 were under-represented in G-CIMP-positive tumors, while the remaining genes showed an increase in germline mutations in G-CIMP-positive tumors. Genes with signicant LOH dierences in proneural G-CIMP tumors were BDNF, CDH23, DMRT2, DPP6, SLC1A2 and SMARCA2. CDH23 mutations were signicantly under-represented in G-CIMP- positive tumors (p < 0.004, Fisher's Exact Test, odds ratio = 0.5), while the other ve genes had an over-representation of mutations in G-CIMP-positive tumors. 76 4.3.4 Copy Number Variation (CNV) in Proneural G-CIMP Tumors In order to elucidate critical alterations within proneural G-CIMP-positive tumors, we analyzed gene-centric copy number variation data. Previously described DNA copy num- ber data for 61 TCGA GBM samples were obtained from MSKCC: cbio.mskcc.org/ cancergenomics/gbm/cna. Each sample per locus contained a copy number call value which was recently described in [Taylor et al., 2008]. Using this gene-centric copy num- ber le, we calculated signicant dierences in copy number alterations between proneu- ral G-CIMP-positive (n=18) and proneural G-CIMP-negative (n=43) tumor samples using the Cochran-Armitage test for trend. The Benjamini-Hochberg multiple testing correction was applied to adjust the resulting p-values while controlling the False Discov- ery Rate (FDR) below 0.05. Genes with an adjusted p-value below 0.05 were considered signicantly dierent in copy number between the two sets of tumors. We identied signicant copy number dierences in 2,875 genes between proneural G-CIMP-positive and G-CIMP-negative tumors (Figure 4.4A). Although chromosome 7 amplications are a hallmark of aggressive GBM tumors [Cancer Genome Atlas Research Network, 2008], copy number variation along chromosome 7 was reduced in proneural G-CIMP-positive tumors. Gains in chromosomes 8q23.1-q24.3 and 10p15.3-p11.21 were identied (Figure 4.4B and 4.4C). The 8q24 region contains the MYC oncogene, is rich in sequence variants and was previously shown as a risk factor for several human can- cers [Amundadottir et al., 2006, Freedman et al., 2006, Haiman et al., 2007a, Haiman et al., 2007b, Schumacher et al., 2007, Shete et al., 2009, Visakorpi et al., 1995, Yeager et al., 2007]. Accompanying the gains at chromosome 10p in proneural G- CIMP-positive tumors, we also detected deletions of the same chromosome arm in G-CIMP-negative tumors (Figure 4.4C). Similar copy number variation results were obtained when comparing all G-CIMP-positive to G-CIMP-negative samples (Figure 4.5). These ndings point to G-CIMP-positive tumors as having a distinct prole of copy number variation when compared to G-CIMP-negative tumors. 77 4.3.5 Identication of DNA Methylation and Transcriptome Expres- sion Changes in Proneural G-CIMP Tumors To better understand CpG island hypermethylation in glioblastoma, we investigated the dierentially methylated CpG sites of these samples (see Figure 3.6B on page 62). Among 3,153 CpG sites that were dierentially methylated between proneural G-CIMP positive and proneural G-CIMP{ tumors, 3,098 (98%) were hypermethylated (Figure 4.6A). In total, there were 1,550 unique genes, of which 1,520 were hypermethylated and 30 were hypomethylated within their promoter regions. We ranked our probe list by decreasing adjusted p-values and increasing beta-value dierence in order to identify the top most dierentially hypermethylated CpG probes within proneural G-CIMP+ tumors (Table S3). The Agilent transcriptome data were used to detect genes showing both dieren- tial expression and G-CIMP DNA methylation, in G-CIMP+ and negative proneural samples. Gene expression values were adjusted for regional copy number changes, as described below. We used the copy number data to normalize the gene expression data among TCGA GBM samples. We back-transformed the log 2 -expression data and then using the call numbers dened in [Taylor et al., 2008], and then normalized gene expres- sion data using the following corrections: a call of 2 was converted to 0 copies, a call of -1 was converted to 1 copy, a call of zero was converted to two copies, a call of +1 was converted to four copies; and a call of +2 was converted to 8 copies. For each sample and locus with a call of -2, the intensity was reduced to background levels of gene expression. Gene expression values for data points with a call of -1 were multiplied by two. Gene expression values for data points with a call of zero were unchanged. Gene expression values for data points with a call of +1 were divided by two. Finally, gene expression values for data points with a call of +2 were divided by four. These normalized gene expression values were log 2 -transformed for subsequent data analysis, see equation 4.1 on page 69. 78 A total of 1,030 genes were signicantly down-regulated and 654 genes were sig- nicantly up-regulated among proneural G-CIMP-positive tumors (Figure 4.6B). We mapped 927 of the 1,030 down-regulated genes and identied 24 functional clusters enriched in proneural G-CIMP-positive tumors. The dierentially down-regulated gene set was highly enriched for polysaccharide, heparin and glycosaminoglycan binding, colla- gen, thrombospondin and cell morphogenesis (p< 2.210 04 ). We next categorized the genes that are up-regulated in proneural G-CIMP tumors. We mapped 542 of 654 genes to a gene reference list. We identied 16 functional groups with signicant enrichment in proneural G-CIMP-positive tumors. The signicantly up-regulated gene set was highly enriched among functional categories involved in regulation of transcription, nucleic acid synthesis, metabolic processes, and cadherin-based cell adhesion (p < 6.210 04 ). Zinc nger transcription factors were also found to be highly enriched in genes signicantly up-regulated in expression (p = 3.110 08 ). Similar ndings were obtained when a per- mutation analysis was performed and when Aymetrix gene expression data were used (data not shown). In addition to known canonical genes, the Innium DNA methylation platform also interrogates CpG sites within promoters of known micro RNA (miRNA) sequences. Nine Innium CpG probes associated with ve miRNAs (MIR330, MIR170, MIR196B, MIR565, MIR221) had signicant methylation changes in G-CIMP-positive tumors. We also identied 20 miRNAs that showed signicant dierences in their gene expression between proneural G-CIMP+ and proneural G-CIMP{ tumors (Figure 4.7, Table S3). Seven miRNAs were down-regulated, while 13 others were up-regulated in G-CIMP- positive tumors. MIR221 and MIR222 were the most signicantly down-regulated in G-CIMP-positive tumors, and MIR222 was the only miRNA that was also signicantly hypermethylated. Both MIR222 and MIR221 were shown to promote metastasis in a variety of cancers [Liu et al., 2009, Schaefer et al., 2009], and promote growth of human GBM cell lines [Gillies and Lorimer, 2007]. 79 Integration of the normalized gene expression and DNA methylation gene lists identi- ed a total of 300 genes with both signicant DNA hypermethylation and gene expression changes in G-CIMP+ tumors compared to G-CIMP{ tumors within the proneural subset. Of these, 263 were signicantly down-regulated and hypermethylated within proneural G-CIMP+ tumors (Figure 4.8C, lower right quadrant). To validate these dierentially expressed and methylated genes, we replicated the analysis using an alternate expres- sion platform (Aymetrix), and derived consistent results (Figure 4.9). Among the top ranked genes were FABP5, PDPN, CHI3L1 and LGALS3 (Table 4.4), which were iden- DNA Methylation Gene Expression Gene Name P-Value value P-Value Fold G0S2 2.37E-07 0.76 2.12E-13 -3.92 RBP1 2.37E-07 0.84 1.07E-14 -3.9 FABP5 2.30E-05 0.3 1.39E-12 -3.53 CA3 4.50E-06 0.43 2.06E-06 -2.82 RARRES2 4.74E-07 0.63 6.25E-10 -2.69 OCIAD2 2.06E-04 0.32 1.23E-08 -2.64 CBR1 3.30E-05 0.37 3.77E-09 -2.45 PDPN 3.87E-03 0.23 1.47E-07 -2.43 LGALS3 2.37E-07 0.72 7.97E-09 -2.42 CTHRC1 3.50E-04 0.45 1.44E-07 -2.34 CCNA1 4.66E-03 0.29 2.95E-05 -2.14 ARMC3 1.76E-03 0.31 7.76E-05 -2.13 CHST6 5.74E-04 0.22 8.70E-06 -2.11 C11orf63 2.37E-07 0.64 7.95E-11 -2.05 GJB2 2.16E-03 0.24 2.94E-07 -2.04 KIAA0746 1.66E-06 0.58 2.97E-07 -1.94 MOSC2 2.37E-07 0.66 6.85E-12 -1.91 CHI3L1 7.11E-06 0.13 4.11E-06 -1.9 Table 4.4: The top most dierentially hypermethylated and downregulated genes in proneural G-CIMP positive tumors. Genes are sorted by decreasing Gene Expression log2 ratios. Beta value dierence indicates dierences in mean of beta values (DNA methylation values) between proneural G-CIMP-positive and proneural G- CIMP-negative. Fold Change is thelog2 ratio of the means of proneural G-CIMP-positive and proneural G-CIMP-negative normalized expression intensities (see equation 4.1 on page 69). tied in an independent analysis to be highly prognostic in GBM with higher expression 80 associated with worse outcome [Colman et al., 2010]. Gene ontology analyses showed G- CIMP-specic down-regulation of genes associated with the mesenchyme subtype, tumor invasion and the extracellular matrix as the most signicant terms (Table 4.5). Genes Table 4.5: GSEA Pathway Analysis: DNA Methylation Target Enrichment in G-CIMP vs. Proneural Non G-CIMP with roles in transcriptional silencing, chromatin structure modications and activation of cellular metabolic processes showed increased gene expression in proneural G-CIMP+ tumors. Additional genes dierentially expressed in proneural G-CIMP+ samples are provided in XXX (discuss RBP1 and G0S2 - rbp1/gnmt1/g0s2 genomic coordinates (chapter 4)). To extend these ndings, the dierentially silenced genes were subjected to a NextBio (www.nextbio.com) meta-analysis to identify data sets that were signicantly associated with our list of 263 hypermethylated and down-regulated genes. There was an overlap with down-regulated genes in low- and intermediate-grade glioma compared to GBM in a variety of previously published datasets [Ducray et al., 2008, Liang et al., 2005, Sun et al., 2006] (Figure 4.10). The overlap of the 263-gene set with each of these additional datasets was unlikely to be due to chance (all analyses p < 0.00001). To further characterize this gene set, we tested the survival association of these gene expression values in a collection of Aymetrix proling data from published and pub- licly available sources on which clinical annotation was available. This dataset included cohorts from the Rembrandt set [Madhavan et al., 2009] as well as other sources, and did not include TCGA data (since TCGA data were used to derive the gene list). In this 81 combined dataset, the expression of the 263-gene set was signicantly associated with patient outcome (Figure 4.11). Together, these ndings suggest that G-CIMP+ GBMs tumors have epigenetically-related gene expression dierences which are more consistent with low-grade gliomas as well as high grade tumors with favorable prognosis. We also classied 29 of the 30 signicantly hypomethylated genes in G-CIMP-positive tumors, and found enrichment of genes involved in cytokine activity and receptor binding (Fold enrichment=3.4-11.7, p = 0.033, Fisher's Exact Test). Next, we used the Molecular Signature Database (www.broadinstitute.org/gsea/msigdb/) to nd enrichment of known transcription factor motifs within G-CIMP genes, and identied enrichment (p < 0.05, Fisher's Exact Test, Fold enrichment 1.3-1.96) for TCF3, MYOGENIN, MYOD, AP1/JUN and TCF8 transcription factor binding sites. Closer inspection of MYOD and TCF8 binding sites revealed a common motif, 5'-NNCACCTGNY -3' (Table 4.6). Table 4.6: Non-Conserved Transcription Factor binding site prediction 4.3.6 Validation of G-CIMP in GBM and Incidence in Low-grade Gliomas To validate the existence of G-CIMP loci and better characterize the frequency of G- CIMP in gliomas, MethyLight was used to assay the DNA methylation levels in eight G-CIMP gene regions in seven hypermethylated loci (ANKRD43, HFE, MAL, LGALS3, FAS-1, FAS-2, and RHO-F) and one hypomethlyated locus, DOCK5, in the tumor sam- ples. These eight markers were evaluated in paran embedded tissues from 20 TCGA samples of known G-CIMP status (10 G-CIMP-postive and 10 G-CIMP negative). We 82 observed perfect concordance between G-CIMP calls on the array platforms versus with the MethyLight markers, providing validation of the technical performance of the plat- forms and of the diagnostic marker panel. These 20 samples were excluded from the validation set described below. A sample was considered G-CIMP positive if at least six genes displayed a combination of DOCK5 DNA hypomethylation and/or hypermethyla- tion of the remaining genes in the panel. Using these criteria, we tested an independent set of non-TCGA GBM samples for G-CIMP status. Sixteen of 208 tumors (7.6%) were found to be G-CIMP-positive (Figure 4.12A), very similar to the ndings in TCGA data. To further expand these observations, we determined the IDH1 mutation status for an independent set 100 gliomas (WHO grades II, III and IV). Among 48 IDH1-mutant tumors, 35 (72.9%) were G-CIMP. However, only 3/52 cases (5.8%) without an IDH1 mutation were G-CIMP positive (odds ratio= 42; 95% condence interval (CI), 11-244; Table 4.7D), validating the tight association of G-CIMP and IDH1 mutation. Based on the association of G-CIMP status with features of the progressive, rather than the de novo GBM pathway, we hypothesized that G-CIMP status was more common in the low- and intermediate-grade gliomas. We extended this analysis by evaluating 60 grade II and 92 grade III gliomas for G-CIMP DNA methylation using the eight gene MethyLight panel. Compared to GBM, grade II tumors showed an approximately 10-fold increase in G- CIMP-positive tumors, while grade III tumors had an intermediate proportion of tumors that were G-CIMP-positive (Figure 4.12A, Table 4.7E). When low- and intermediate- grade gliomas were separated by histologic type, G-CIMP positivity appeared to be approximately twice as common in oligodendrogliomas (52/56, 93%) as compared to astrocytomas (43/95, 45%). G-CIMP positive status correlated with improved patient survival within each WHO-recognized grade of diuse glioma, indicating that the G- CIMP status was prognostic for glioma patient survival (p < 0.032, Figure 4.12B). G-CIMP status was an independent predictor (p< 0.01) of survival after adjustment for patient age and tumor grade (Table 4.7F). Together, these ndings show that G-CIMP 83 Table 4.7: Summary of G-CIMP and IDH1 mutation status in the 360 glioma validation panel. D) Number of G-CIMP-positive and G-CIMP-negative tumors in the validation set, stratied by tumor grade. E) IDH1 mutation status of 100 tumors of the validation panel, stratied by G-CIMP status. F) G-CIMP status as an independent predictor of survival after correction for patient age and tumor grade. is a prevalent molecular signature in low grade gliomas and confers improved survival in these tumors. 84 4.3.7 Stability of G-CIMP at Recurrence Since epigenetic events can be dynamic processes, we examined whether G-CIMP status was a stable event in glioma or whether it was subject to change over the course of the disease. To test this, we obtained a set of samples from 15 patients who received a second surgical procedure following tumor recurrence, with time intervals of up to eight years between initial and second surgical procedures. We used the eight-gene MethyLight panel to determine their G-CIMP status and found that eight samples were G-CIMP-positive, while seven were G-CIMP-negative. Interestingly, among the G-CIMP-positive cases, 8/8 (100%) recurrent samples retained their G-CIMP positive status. Similarly, among seven G-CIMP-negative cases, all seven remained G-CIMP-negative at recurrence, indicating stability of the G-CIMP phenotype over time (Figure 4.13). 4.4 Summary and Discussion In this report, we identied and characterized a distinct molecular subgroup in human gliomas. Analysis of epigenetic changes from TCGA samples identied the existence of a proportion of GBM tumors with highly concordant DNA methylation of a subset of loci, indicative of a CpG island methylator phenotype (G-CIMP). G-CIMP-positive samples were associated with secondary or recurrent (treated) tumors and tightly asso- ciated with IDH1 mutation. G-CIMP tumors also showed a relative lack of copy number variation commonly observed in GBM, including EGFR amplication, chromosome 7 gain and chromosome 10 loss. Interestingly, G-CIMP tumors displayed copy-number alterations which were also shown in gliomas with IDH1 mutations in a recent report [Sanson et al., 2009]. Integration of the DNA methylation data with gene expression data showed that G-CIMP-positive tumors represent a subset of proneural tumors. G- CIMP-positive tumors showed a favorable prognosis within GBMs as a whole and also within the proneural subset, consistent with prior reports for IDH1 mutant tumors 85 [Parsons et al., 2008, Yan et al., 2009]. Interestingly, of the ve discordant cases of G- CIMP-positive, IDH1-wildtype tumors, two patients survived more than ve years after diagnosis, suggesting that G-CIMP-positive status may confer favorable outcome inde- pendent of IDH1 mutation status. However, studies with many more discordant cases will be needed to carefully dissect the eects of G-CIMP status versus IDH1 mutation on survival. To a large extent, the improved prognosis conferred by proneural tumors [Phillips et al., 2006] can be accounted for by the G-CIMP-positive subset. These nd- ings indicate that G-CIMP could be use to further rene the expression-dened groups into an additional subtype with clinical implications. G-CIMP is highly associated withIDH1 mutation across all glioma tumor grades, and the prevalence of both decreases with increasing tumor grade. Tumor grade is dened by morphology only, and therefore can be heterogeneous with respect to molecular subtypes. Within grade IV/glioblastoma tumors are a subset of patients who tend to be younger and have a relatively favorable prognosis. It is only through molecular characterization using markers such as IDH1 and G-CIMP status that one could prospectively identify such patients. Conversely, these markers could also be used to identify patients with low- and intermediate-grade gliomas who may exhibit unfavorable outcome relative to tumor grade. In the non-TCGA independent validation set examined in this study, an IDH1 muta- tion was detected in 40/43 (93%) low- and intermediate-grade gliomas, but only 7/57 (13%) of primary GBMs. Similarly, we detected nearly 10-fold more G-CIMP-positive gliomas in grade II tumors as compared to grade IV GBMs. The improved survival of G-CIMP gliomas at all tumor grades suggests that there are molecular features within G-CIMP gliomas that encourage a less aggressive tumor phenotype. Consistent with this, we identied G-CIMP-specic DNA methylation changes within a broad panel of genes whose expression was signicantly associated with patient outcome. We observed that this large subset of dierentially silenced genes were involved in specic functional categories, including markers of mesenchyme, tumor invasion and extracellular matrix. 86 This concept builds upon our prior nding of a mesenchymal subgroup of glioma which shows poor prognosis [Phillips et al., 2006]. According to this model, a lack of methy- lation of these genes in G-CIMP-negative tumors would result in a relative increase in expression of these genes, which in turn would promote tumor progression and/or lack of response to currently available treatment modalities. A comparison of the G-CIMP gene list with prior gene expression analyses (meta-analyses) suggests that G-CIMP positive tumors may be less aggressive due to silencing of key mesenchymal genes. We found that a minority of genes with signicant promoter hypermethylation showed a concomitant signicant decrease in associated gene expression (293/1520, 19%). This is consistent with previous reports, in which we found similarly low frequencies of inversely correlated promoter hypermethylation and gene expression [Houshdaran et al., 2010, Pike et al., 2008]. The lack of an inverse relationship between promoter hypermethylation and gene expression for most genes may be attributed to sev- eral scenarios, including the lack of appropriate transcription factors for some unmethy- lated genes and the use of alternative promoters for some genes with methylated promot- ers. Epigenetics controls expression potential, rather than expression state. Dissecting the gene expression and DNA methylation alterations of G-CIMP tumors among lower grade gliomas will be helpful to better understand the roles of a mutant IDH1 and G-CIMP DNA methylation on tumor grade and patient survival. The highly concerted nature of G-CIMP methylation suggests that this phenomenon may be caused by a defect in a trans-acting factor normally involved in the protection of a dened subset of CpG island promoters from encroaching DNA methylation. Loss of function of this factor would result in widespread concerted DNA methylation changes. We propose that transcriptional silencing of some CIMP genes may provide a favorable context for the acquisition of specic genetic lesions. Indeed, we have recently found that IGFBP7 is silenced by promoter hypermethylation in BRAF-mutant CIMP+ colorectal tumors [Hinoue et al., 2009]. Oncogene-induced senescence by mutant BRAF is known 87 to be mediated by IGFBP7 [Wajapeyee et al., 2008]. Hence, CIMP-mediated inactiva- tion of IGFBP7 provides a suitable environment for the acquisition of BRAF mutation. The tight concordance of G-CIMP status with IDH1 mutation in GBM tumors is very reminiscent of colorectal CIMP, in which DNA hypermethylation is strongly associated with BRAF mutation [Weisenberger et al., 2006]. We hypothesize that the transcrip- tional silencing of as yet unknown G-CIMP targets may provide an advantageous envi- ronment for the acquisition of IDH1 mutation. In our integrative analysis of G-CIMP tumors, we observed up-regulation of genes functionally related to cellular metabolic processes and positive regulation of macro- molecules. This expression prole may re ect a metabolic adjustment to the prolif- erative state of the tumor, in conjunction with the gain-of-function IDH1 mutation [Dang et al., 2009]. Such a metabolic adjustment may be consistent with Warburgs observation that proliferating normal and tumor cells require both biomass and energy production, and convert glucose primarily to lactate, regardless of oxygen levels, while non-proliferating dierentiated cells emphasize ecient energy production (reviewed in [Vander Heiden et al., 2009]). In summary, our data indicate that G-CIMP status stratify gliomas into two distinct subgroups with dierent molecular and clinical phenotypes. These molecular classica- tions have implications for dierential therapeutic strategies for glioma patients. Further observation and characterization of molecular subsets of glioma will likely provide addi- tional information enabling insights into the the development and progression of glioma, and may lead to targeted drug treatment for patients with these tumors. 88 Figure 4.1: Characterization of G-CIMP tumors as a unique subtype of GBMs within the proneural gene expression subgroup A) Integration of the samples within each DNA methylation and gene expression cluster. Samples are primarily categorized by their gene expression subtype: P, proneural; N, neural; C, classical; M, mesenchymal. The number and percent of tumors within each DNA methylation cluster (red, cluster 1 (G-CIMP); blue, cluster 2; green, cluster 3) are indicated for each gene expression subtype. B) Scatter plot of pairwise comparison of the gene expression and DNA methylation clusters as identied in Figure 3.6 and Figure 3.7. Same two-letter represents self-comparison while mixed two-letter represents the pair-wise correlation between gene expression and DNA methylation. Axes are reversed to illustrate increasing similarity. C) GBM patient age distribution at time of diagnosis within each gene expression cluster. Samples are divided by gene expression clusters as identied along the top of each jitter plot, and further subdivided by G-CIMP status within each expression subgroup. G-CIMP-positive samples are indicated as red data points and G-CIMP-negative samples are indicated as black data points. Median age at diagnosis is indicated for each subgroup by a horizontal solid black line. D-F) Kaplan- Meier survival curves for GBM methylation and gene expression subtypes. In each plot, the percent probability of survival is plotted versus time since diagnosis in weeks. All samples with survival data greater than ve years were censored. D) Kaplan-Meier survival curves among the four GBM expression subtypes. Proneural tumors are in blue, Neural tumors are in green, Classical tumors are in red, and Mesenchymal tumors are in gold. E) Kaplan-Meier survival curves between the three DNA methylation clusters. Cluster 1 tumors are in red, cluster 2 tumors are in blue and cluster 3 tumors are in green. F) Kaplan-Meier survival curves between proneural G-CIMP-positive, proneural G-CIMP-negative and all non-proneural GBM tumors. Proneural G-CIMP-positive tumors are in red, proneural G-CIMP-negative tumors are in blue and all non-proneural GBM tumors are in black. See also Figure 3.7. 89 Figure 4.2: G-CIMP association with Gene Expression Clusters and Kaplan-Meier sur- vival curves of G-CIMP-positive tumors. A) G-CIMP sample distribution within gene expression clusters identies an association with proneural GBM tumors. The stacked bar plot shows the sample distribution data for each gene expression cluster (Proneural, Neural, Classical and Mesenchymal). The number of samples within each of the bar is divided into two groups based on classication of G-CIMP-positive (in red) and G-CIMP-negative (in black). Below each bar lists the exact number of samples stratied by G-CIMP-status. B) Kaplan-Meier survival plots of GBM tumors. The y axis indicates percent probability of survival and x axis indicated the time since diagnosis (weeks). The log-ranked p-value is indicated on each plot, and all samples with survival data greater than ve years are censored. The vertical dashed black line indicates ve year survival time. The survival curves for samples are stratied by their G-CIMP status. 90 Figure 4.3: Somatic Mutation, Germline Mutation, LOH analysis of Proneural G-CIMP- positive tumors and summary of G-CIMP and IDH1 mutation status in the 360 glioma validation panel. A) All signicantly associated somatic mutations identied within G-CIMP proneural subtype (p < 0.05, Fisher's Exact Test) are primarily listed in order of decreasing percentage in proneural G-CIMP- positive tumors and secondarily sorted by increasing percentage in non-proneural tumors. The lled stacked bar plot shows the sample distribution data for samples within the proneural expression subgroup compared to non-proneural tumors. Samples within the proneural subgroup are further divided by their G-CIMP status. B-C) The number of samples within each of the bar is divided into two groups based on identication of known somatic mutation in black and wild-type in gray. The percent of total samples is plotted along the y-axis. D) Number of G-CIMP-positive and G-CIMP-negative tumors in the validation set, stratied by tumor grade. E) IDH1 mutation status of 100 tumors of the validation panel, stratied by G-CIMP status. F) G-CIMP status as an independent predictor of survival after correction for patient age and tumor grade. 91 Figure 4.4: Signicant regions of copy number variation in a subset of G-CIMP genome. Copy number variation for 23,748 loci (across 22 autosomes and plotted in genomic coordinates along the x-axis) was analyzed using 61 proneural TCGA GBM tumors. Homozygous deletion is indicated in dark blue, hemizygous deletion in light blue, neutral/no change in white, gain in light red and high-level amplication in dark red. A) Copy number variation between proneural G-CIMP positive and G-CIMP negative tumors. The Cochran Armitage test for trend, percent total amplication/deletion and raw copy number values are listed. The log10(FDR-adjusted P value) between G-CIMP-positive and G- CIMP negative proneurals is plotted along the y-axis in the \adjusted P-value pos vs. neg" panel. In this panel, red vertical lines indicate signicance. Gene regions in 8q23.1-q24.3 and 10p15.2-11.21 are identied by asterisks and are highlighted in panels B and C, respectively. 92 Figure 4.5: Signicant regions of copy number variation in all G-CIMP genome. Copy number variation for 23,748 loci was analyzed using 214 TCGA GBM tumors. Homozygous deletion is indicated in dark blue, hemizygous deletion in light blue, neutral/no change in white, gain in light red and high level amplication in dark red. Copy number variation in 22 autosomes of G-CIMP positive and G-CIMP negative tumors. Each analyzed locus is plotted in genomic coordinates (chromosome numbers indicated above each grey box and centromere as white, grey vertical lines represent gene deserts) at the top of the gure. The Cochran Armitage test for trend, percent total amplication/deletion and raw copy number values are listed. The log10(p-value) between G-CIMP-positive and G-CIMP negative tumors is plotted along the y-axis in the \P-value pos vs. neg" panel. In this panel, red vertical lines indicates loci with signicant p-values (p < 0.05). 93 Figure 4.6: Volcano plots of DNA methylation and Gene Expression Analysis. A) Volcano Plots of all CpG loci analyzed for G-CIMP association. The beta value dierence in DNA Methylation between the proneural G-CIMP-positive and proneural G-CIMP-negative tumors is plotted on the x-axis, and the p-value for a FDR-corrected Wilcoxon signed-rank test of dierences between the proneural G-CIMP-positive and proneural G-CIMP-negative tumors (-1 multiplied by log10 scale) is plotted on the y-axis. Probes that are signicantly dierent between the two subtypes are colored in red. B) Volcano plot for all genes analyzed on the Agilent gene expression platform. 94 Figure 4.7: Volcano plot for 534 microRNA expression levels. The -1(log10(FDR corrected Wilcoxon p-value)) is plotted versus the log2(mean fold change (G-CIMP- positive vs G-CIMP-negative)). The 13 up-regulated and seven down-regulated miRNA transcripts are highlighted in red, as shown in the upper two quadrants. The horizontal dash black line indicates adjusted.pvalue at 0.05. MIR221 is boxed to indicate it as both dierentially methylated and expressed in G-CIMP positive tumors. 95 Figure 4.8: Starburst Plot for comparison of transcriptome versus epigenetic dierences between proneural G-CIMP-positive and G-CIMP-negative tumors. Starburst plot for comparison of TCGA Innium DNA methylation and Agilent gene expression data normalized by copy number information for 11,984 unique genes. Log10(FDR-adjusted P value) is plotted for DNA methylation (x-axis) and gene expression (y-axis) for each gene. If a mean DNA methylation -value or mean gene expression value is higher (greater than zero) in G-CIMP-positive tumors, -1 is multiplied to log10(FDR-adjusted P value), providing positive values. The dashed black lines indicates FDR-adjusted P value at 0.05. Data points in red indicate those that are signicantly up- and down- regulated in their gene expression levels and signicantly hypo- or hypermethylated in proneural G- CIMP-positive tumors. Data points in green indicate genes that are signicantly down-regulated in their gene expression levels and hypermethylated in proneural G-CIMP-positive tumors compared to proneural G-CIMP-negative tumors. 96 Figure 4.9: Gene Expression Summary Statistics and Overlap between Agilent and Aymetrix. A) Pie chart showing breakdown of number of genes based on dierent FDR (adjusted.pvalue) cutos as analyzed using Agilent Gene Expression Platform (not normalized by copy number). The inset shows number of dierentially up and down- regulated genes. B) Pie chart showing breakdown of number of genes based on dierent FDR (adjusted.pvalue) cutos as analyzed using Aymetrix Gene Expression Platform (not normalized by copy number). The inset shows number of dierentially up and down- regulated genes. C) Pie chart of the genes overlapping Aymetrix and Agilent and are dierentially expressed (adjusted.pvalue< 0.05). D) Stacked bar chart showing the number of dierentially expressed and methylated genes using either Aymetrix or Agilent. The total percentage is indicated in the y-axis. Genes that are signicantly down regulated are in green and the genes that are signicantly up regulated in expression are in red. 97 Figure 4.10: NextBio illustration of correlation between two dierent biosets. Bioset #1 is the 292 gene list that were observed to be the most dierentially hypermethylated (adjusted.pvalue < 0.05) and dierentially up- and down- regulated (adjusted.pvalue < 0.05) based on analysis of the TCGA data. Bioset #2 contain data from [Liang et al., 2005] (panels A and B) and [Ducray et al., 2008] (panel C). Associated Venn diagram and summary statistics illustrates the signicantly overlapping genes between the two biosets. 98 Figure 4.11: Meta-analyses. Meta analysis of a panel of 67 G-CIMP loci with gene expression. A set of 67 G-CIMP DNA methylation targets were found to the correlated with their gene expression. These were examined in an independent data set from non-TCGA samples. A metagene score was determined based on a composite of the 67 gene panel, and samples were divided into two groups based on their median metagene score. The patient survival proportion is plotted versus the survival time (in weeks) for each group. The dashed line indicates a low metagene score, while the solid line indicates a high metagene score. 99 Figure 4.12: G-CIMP prevalence in grade II, III and IV gliomas using MethyLight. A) Methylation proling of gliomas shows an association of CIMP status with tumor grade. Eight markers were tested for G-CIMP DNA methylation in 360 tumor samples. Each marker was coded as red if methylated and green if unmethylated. One of these markers (DOCK5) is unmethylated in CIMP, while the remaining seven markers show G-CIMP-specic hypermethylation. G-CIMP-positive status was determined if 6 of the 8 genes had G-CIMP-dening hyper- or hypomethylation. G-CIMP-positive status is indicated with a black line (right side of panel), and a grey line indicates non G-CIMP. Samples with an identied IDH1 mutation is indicated as a black line and samples with no known IDH1 mutation as a grey line. White line indicates unknown IDH1 status. B) Association of G-CIMP status with patient outcome stratied by tumor grade. G-CIMP-positive cases are indicated by the dashed lines and the G-CIMP-negative cases are indicated by solid lines in each Kaplan-Meier survival curve. 100 Figure 4.13: Recurrence of G-CIMP over time Stability of G-CIMP over time in glioma patients. Fifteen samples from newly diagnosed tumors were tested for G-CIMP positivity using the eight-marker MethyLight panel. Eight tumors were classied as G-CIMP-positive (upper left panel), and seven tumors were classied as G-CIMP-negative (non-G-CIMP, lower left panel). Samples from a second procedure, ranging from 2-9 years after the initial resection, were also evaluated for the G-CIMP-positive cases (upper right panel), as well as for the non-G-CIMP cases (lower right panel). Each marker was coded as red if methylated and green if unmethylated. 101 Chapter 5 G-CIMP Genomic Signatures: in silico Analysis 5.1 Introduction One of the intriguing questions surrounding aberrant DNA methylation associated with cancer relates to how cancer cells establish their characteristic epigenomic patterns. Clearly, there exists some type of DNA sequence specicity associated with the activ- ities of the dierent methyltransferases (see page 9 for a description of DNA methyl- transferases (DNMTs)). In this DNA sequence, only certain promoter associated CpG islands are aberrantly hypermethylated as is the case in CIMP in glioma (G-CIMP). As examined in Chapter 4, the highly concerted nature of the G-CIMP methylation phenotype could be explained by three hypotheses: 1. A subset of CpG islands in cancer cells has a dierent susceptibility to de novo DNA methylation; 2. Specic somatic mutations, such asIDH1 (see Table 4.1 on page 74), are necessary but not sucient for the DNA methylation of some CpG islands; 3. Other factors that are normally involved in the protection of a dened subset of CpG island promoters from encroaching DNA methylation are either deregulated or missing during pathogenesis. The last hypothesis is the most tantalizing of the three because we can potentially test this hypothesis using current in silico tools, and the transcriptional silencing of 102 yet unknown trans-acting factors that are targeting methylated G-CIMP CpG islands. This may provide an advantageous environment for the acquisition of IDH1 mutation. Thus, identifying unknown trans-acting factors may identify certain epigenetic path- ways inherent in the aberrant DNA methylation associated with tumor progression and development. In this chapter, preliminary results are presented that expand the molecular nd- ings of G-CIMP. These ndings may provide clues to the biological mechanism by which glioma cells become hypermethylated at specic CpG islands and could potentially pro- vide novel targets for therapeutics development. 5.2 Material and Methods Using selected FASTA sequences of interest, de novo and known motif discovery was per- formed using Hypergeometric Optimization of Motif EnRichment (HOMER, script v2.6 (10-22-10)), an algorithm previously described [Heinz et al., 2010]. Brie y, HOMER uses two sets of sequences supplied by the user and determines the enrichment of both known orde novo motifs. Each successive iteration of HOMER screens all possible x-mer candi- date sequences for their signicance using a simple mismatch model. The signicance of association is assessed by the overlapping sequences compared to the background using the hypergeometric distribution model. HOMER uses publicly available databases from both TRANSFAC and JASPAR to identify known motif enrichment. Both databases contain all experimentally derived transcription factor motifs, including putative motifs derived from in silico analysis. G-CIMP specic, G-CIMP non specic, GBM specic and background genomic sequences were extracted using Galaxy in order to determine putative de novo DNA sequences [Goecks et al., 2010]. Only known CpG islands overlapping the respective CpG sites were used. These sets of sequences were divided into \target" and \back- ground" sets for each iteration of the algorithm (HOMER perl script \ndMotifs.pl"). 103 Motifs of lengths 6, 7, 8, 10, 11, and 12 bp were identied separately for enrichment in a \target" (compared to \background" ) set using the cumulative hypergeometric distri- bution which provides an enrichment score. To increase the sensitivity of the method, up to two mismatches were allowed in each sequence and distributions of the CpG con- tent in \target" and \background" sequences were selectively weighted to equalize the distributions of the CpG content in both sets. Raw outputs from HOMER are exported as an html page and are stored as pdf les. HOMER perl script \annotatePeaks.pl" [Heinz et al., 2010]; R software (R version 2.12.1, 2010-12-16, [R Development Core, 2009]); and the \ggplot2" package were used to generate genomic distribution plots of each identied motif. The selection of CpG probes was accomplished by comparing the mean values across distinct DNA methylation clusters of GBMs compared to normal human brains (Figure 3.6 on page 62). In addition, a wilcoxon rank sum test was performed to stratify each class of GBMs further by their identifying CpG probes. 5.3 Results and Discussion 5.3.1 Distinct Genomic Features Correlates with Hypermethylated CpG Sites Based on both supervised and unsupervised analysis, a feature selection classied CpG probes based on their association with G-CIMP status or GBM status. The selected class of CpGs was plotted as heatmaps and each class of probes were summarized by comparing the overlapping CpG site with known genomic features such as embryonic stem cell histone marks (Figure 5.1). In embryonic stem cells, polycomb repressive complexes are used to keep transcrip- tional control pathways poised and ready for either dierentiation or development. These complexes can be associated with either active or repressive histone marks and, in some regions of the genome, they contain both the active and repressive marks, also known 104 Figure 5.1: CpG probes selected for genomic association analysis Three classes of cancer specic methylation are identied and represented as a heatmap. Rows represents individual CpG sites and columns represents samples. The samples are ordered exactly as represented in Figure 3.6 on page 62. Each class of CpGs are summarized for known embryonic stem cell histone marks, H3K27me3, H3K4me3, bivalent (H3K4me3+H3K27me3) or none. as bivalent domains [Mikkelsen et al., 2008] (see page 5 for a review of dierent histone modications). Using chromosomal location, all 27,578 CpG probes on the Innium Human Methylation27K were mapped and marked as overlapping one or more of these histone marks. If a CpG did not overlap a histone mark, it was marked \none." Using this information, cancer-specic hypermethylated CpG probes in GBM were found to be highly enriched for bivalent domains (active (H3K4me3) and repressive (H3K27me3) marks) (77% compared to 13%) and were found to be depleted of H3K4me3 as compared to background or non-target probes (5% compared to 19%, see Figure 5.1). 105 The observed enrichment of the embryonic stem cell histone marks among all classes of GBM was also noted as a general characteristic of cancer specic DNA hyper- methylation compared to non-targets [Widschwendter et al., 2007]. The association of embryonic stem cell polycomb targerts with aberrant DNA methylation suggests that the rst steps of tumorigenesis may be epigenetic rather than somatic mutations [Widschwendter et al., 2007]. In addition, we mapped all of the available genomic features that are publicly avail- able through the UCSC genome browser. These features include known transcription factor motifs provided by the ENCODE project [ENCODE Project Consortium, 2007], known and predictive Transcription Start Sites (TSS), repeat elements (LINE, SINE, and Microsatilites) and other well-known polycomb group proteins (SUZ12, RING1b). The cancer-specic DNA methylation CpG classes were found to be signicantly enriched for known CpG islands. Interestingly, these CpG classes were signicantly depleted of nearly all ENCODE transcription factors compared to background or non- target probes (Figure 5.2). These observations suggest that CpG probes that are aber- rantly DNA methylated in glioma and are devoid of transcription factors are generally less protected targets for aberrant DNA hypermethyation. 5.3.2 Distinct DNA Sequence Tightly Correlates with Hypermethy- lated CpG Sites De novo motif analysis using HOMER revealed distinct motifs for each cancer-specic DNA methylation CpG class (Figure 5.3). Plotting the frequency of occurrence for each motif (+/- 10,000 bps from the nearest TSS overlapping CpG islands) shows each identied motif to be signicantly enriched at or near the TSS of these genes (Figure 5.4) compared to the background probes. Interestingly, cross-comparing the G-CIMP+ associated DNA motif (TCCCNNGGGA) with all known transcription factors putative binding sites identied one well-characterized transcription factor known as the Early 106 Figure 5.2: Summary results of all available genomic features Each row represents summary statistics for one of the four classes (G-CIMP+, GBM, G-CIMP- and Background). Box 1-3 depicts repeat element overlap, box 4-8 represents TSS, Takai Jones CpG island, Gardner-Gardner CGI and Irizarray CGI, Box 9-12 represent histone and polycomb marks as well as lamin attachment domains, box 13-66 depicts ENCODE transcription factor overlap. B-cell-like factor 1 (EBF1). Fascinatingly, the genomic prole for this known motif is similar to the de novo motif identied in this study (see Figures 5.3 and 5.4). 107 Figure 5.3: Identied de novo motifs EBF1 was previously reported to promote demethylation and chromatin remodeling in B-lymphocyte [Maier et al., 2004]. The genome-wide analysis ofEBF1 by ChIP-seq in pro-B cells, coupled with expression analysis, revealed specic target genes for activation and the EBF1 marks correlated with the appearance of histone H3K4me2 modications [Treiber et al., 2004]. Additional reports have identied a functional role for the EBF family of transcription factors as a putative tumor suppressor in a number of dierent cancers including GBM, reviewed in [Liao, 2009]. Taken together, the de novo motif analysis of a distinct set of DNA sequences associated with a well-characterized CIMP in glioma suggests the presence of a putative protein involved in promoting aberrant DNA hypermethylation and chromatin remodeling. 108 Figure 5.4: Genomic Plots of de novo motifs 5.4 Future Studies In this study, we reported the identity of a putative transcription factor (EBF1) that recognizes a specic palindromic DNA sequence (5'-TCCCNNGGGA-3'). EBF1's bind- ing motif was found to be tightly associated with G-CIMP+ targets and, although the in silico results presented in this chapter are preliminary, the ndings point to a potential 109 biological mechanism by which G-CIMP+ cells become hyper-methylated at specic CpG islands. Researchers need to perform in depth in vitro andin vivo studies to validate and elucidate the mechanistic role of EBF1 in the development of DNA hypermethylation associated with the G-CIMP+ phenotype. Below is the description of a potential in vitro experiment that could conrm the biological function of EBF1 in G-CIMP. Since EBF1 plays an important role in demethylation, one hypothesis is that EBF1 is deregulated or mutated in G-CIMP (resulting in a loss- or gain- of function) and that the over expression of a functional copy may restore the epigenomic prole to the normal state. Targeted DNA sequencing of EBF1 in G-CIMP could reveal whether this gene is mutated. However, if the gene is not mutated, then we could suggest that the protein is deregulated by some unknown down- or upstream mechanism. In order to decipher the mechanistic role of EBF1 in G-CIMP+, cultured primary cells derived from G-CIMP+ patients are required. Currently to our knowledge, no known cell line exists that shares the G-CIMP+ phenotype. However this does not imply that one might not exist. To identify potential cell lines, a number of cultured or primary cell lines should be proled for G-CIMP+ status using the 8 marker panel described in Chapter 4. If a G-CIMP+ cell line exists, then molecular experiments would involve introducing the coding sequence of EBF1 downstream of a ubiquitous promoter and inserting the sequence into cultured glioma cells. This would result in the overexpression of EBF1. By creating a cellular model, we could then prole the genomic targets of EBF1 by ChIP-seq techniques and correlate this with whole-genome expression analysis (RNA- seq). In addition, treating cells with 5-AZA-CdR, a potent and useful drug that removes a substantial number of DNA methylation levels across the entire genome could also assist in identifying putative genomic regions that are selective targets for EBF1. This type of study would aid the understanding of the in uence of EBF1 on DNA methylation in glioma cells and would potentially provide novel targets for therapeutics development for Glioblastoma. 110 Chapter 6 Perspective and Synthesis 6.1 Introduction The main objective of this thesis is to utilize and fully integrate the information associ- ated with multiple high-throughput datasets across many dierent platforms in order to characterize the molecular biology of Glioblastoma multiforme (GBM). By understanding the molecular biology of GBM, we can identify epigenetically modulated regions of the genome that may be involved in the pathogenesis of GBM, thereby allowing researchers to develop targets for diagnostic and treatment strategies. In addition, this study, like many other high-throughput scale studies, can generate interesting hypotheses. These hypotheses can be tested by using and developing in silico tools or designing in vitro/in vivo experiments as discussed in Chapter 5 (page 102). In this chapter, the aim is to provide a general perspective of the future of \- omic" research and explore the challenges and benets of embarking on the use of next- generation sequencing to explore the next frontier of cancer biology. 6.2 Perspective and Future Outlook Technologies to assist in detecting dierent types of alterations associated with cancer have been developed and applied over the past decade. These studies have primarily used platforms which proled copy number changes at a relatively high-resolution or direct sequencing of candidate genes or specic classes of genes which has identied novel asso- ciation with specic cancers: ovarian, melanoma, lung and colon cancers [Nanjundan et al. 2007, Garraway et al. 2005, Bass et al. 2009, Firestein et al. 2008, Levine et 111 al. 2005, Davies et al. 2002, Mosse et al. 2008]. In addtion, aberrant DNA methyla- tion patterns underscore it's important role in the progression of cancer transformation [Sharma et al., 2010, Jones and Baylin, 2007]. The last two decades of basic research have revealed many novel mechanisms of cancer epigenetics. Our understanding of cancer pathogenesis and our knowledge of therapeutic and diagnostic strategies have increased. The research on DNA methylation and the increased awareness of clinicians of the use of DNA methylation as a tool to detect and treat cancer have signicantly in uenced the eld of epigenetics and will likely elicit more interest in epigenomics for years to come. Certainly, a greater understanding of the role of epigenetics as well as genomics in the development of cancer will have a profound, positive impact on both the mortality and morbidity of these diseases in the future. Many of types of studies eluded to the important need to fully characterize the dierent types of somatic genome alterations in dierent subsets of cancer. Molecular proling of the cancer genome will require a diverse set of technologies as well as expertise to handle both the data production and data analysis. Even though conventional microarrays have broadened our understanding of cancer biology, many current array platforms only cover a reduced representation of the entire genome and thereby, capture a small snapshot of the dynamic nature associated with cancer. Certainly, the next wave of global proling studies will include the utilization of next-generation sequencing (NGS) to prole the full spectrum of \-omic" (genomic, tran- scriptomic and epigenomic) associated with human cancers. Sequencing-based methods oer the potential to explore the entire genome without biasing the coverage that is generally inherent in most array platforms (reviewed in [Laird, 2010]). Of course, just like microarrays, NGS will oer its own set of technical biases and issues and, undoubt- edly, inspire creative and insightful statisticians and bioinformaticians to develop novel methods to process and handle data. 112 Eventually, these enriching whole-genomic studies will provide an integrative view of aberrant DNA methylation, genomic alterations, somatic mutations and other regional hotspots of interests such as nucleosome positioning, associated with cancer. 6.3 Challenges associated with NGS Sanger-based capillary sequencing technologies oered an enoroumous opportunity for cancer biologist to study the dynamic genome. Unfortunately, this technology was lim- ited to analyzing large numbers of samples with only characterizing a small number of genes or study all coding genes by a reduced number of samples [Greenman et al. 2007, Dalgliesh et al. 2010, Kan et al. 2010, Sjoblom et al. 2006, Wood et al. 2007, S Jones et al. 2008, Parsons et al. 2008]. NGS technology however enables the com- plete sequencing of entire genomes in a relatively ecient time and at a more reasonable cost, then previous iterations. As of today, a single sequencing run using HiSeq 2000 sequencer can generate more than 200 gigabases of sequence data in 8 days. These types of data can be generated at high-resolution and oer accurate measurements to iden- tify somatic copy number alterations and nucleotide substitutions. It can also provide a view of the genomic structural variations such as, globabl assessment of chromosomal rearrangements. Utilizing cDNA instead of genomic DNA, one can perform, transcrip- tomic proling by way of RNA-seq, which can provide novel discoveries in splice variants as well as aberrant transcripts, identication of long or short RNAs which can reveal new insights into the regulation of gene transcription and RNA processing [Maher et al. 2009a,b, Berger et al. 2010, Palanisamy et al. 2010]. For the most part, the current challenges associated with NGS are mostly infras- tructure related [Bernstein et al., 2007, Bock and Lengauer, 2008, Laird, 2010]. As the data production exceeds the \Exabyte" space for a particular study, as described in a recent Nature news article [Ledford, 2010], we will require tools that can fully utilize all of the resources available in computer clusters and run exponential parallel jobs with 113 ease and with little downtime for the analyst. Computer clusters will undoubtedly be required, which will increase the overall cost of performing such studies since researchers will need to purchase the hardware as well as hire personnel to manage and maintain the machines. 6.4 Interpreting and understanding the data In addition, the other major challenges associated with NGS will be interpret- ing and understanding the data generated. Currently, many available tools (both OpenSource and Commercial) allow researchers to work with large-scale epige- nomics and genomics data. These tools include dynamic sequencing data visu- alizers such as the Integrative Genomics Viewer (IGV) [Robinson et al., 2011] and the ever improving NextBio TM genomic viewer (Cupertino, CA, www.nextbio.com, [Kupershmidt et al., 2010]). IGV and NextBio TM tools are optimized to work with large datasets such as TCGA [Cancer Genome Atlas Research Network, 2008], 1000 Genomes (www.1000genomes.org) and ENCODE (www.genome.gov/10005107) projects as well as to allow for seamless interaction with collaborators over the Internet [Robinson et al., 2011, Kupershmidt et al., 2010]. Other tools to handle the in ux of genomic data analysis include web-based tools like Galaxy [Goecks et al., 2010]. This tool provides both an intuitive and simple way to perform standard manipulation that otherwise would require an in-depth knowledge of perl or other similar scripting languages like R. It also provides a user-friendly work ow, allowing one to manage and organize the data in a clear and coherent fashion. Since Galaxy is tightly integrated with both UCSC's genome and Ensembl's database, many dierent analyses could be developed and used in one single platform. An important consideration when analysing, interpreting and understanding cancer genomic data includes among others, an important understanding of quality control (QC) 114 of data, accurate estimation of the ratio between signal (true biology) vs. noise (tech- nical artifacts) associated with large data sets and reproducible approaches to complex genomic analyses. The latter can be underestimated since achieving sucient power with relatively small sample size is an important hurdle to address. Fortunately, many of these parameters can be overcome by creating a standard operating procedure using standard technical methods to process and produce the data in a consistent and repro- ducible manner. In addition, normalization steps can be implemented to remove any experimental artifacts that may negatively impact the data quality. Statistical tools exists that can better inform the analyst in a study whether technical noises such as batch eects are driving the data: e.g. use SAM, or ANOVA to look for genes that are dierentially expressed between batches. Finally, once the data is produced and interpreted by skilled bio-analysts, the next major challenge will be to test these hypotheses in the wet-laboratory. Eventually, it will take good, old-fashioned, molecular biology techniques and skilled molecular biologists months (or even years) to test and analyze each hypothesis generated from in silico analysis. 115 Bibliography [Amundadottir et al., 2006] Amundadottir, L. T., Sulem, P., Gudmundsson, J., Helga- son, A., Baker, A., Agnarsson, B. A., Sigurdsson, A., Benediktsdottir, K. R., Cazier, J. B., Sainz, J., Jakobsdottir, M., Kostic, J., Magnusdottir, D. N., Ghosh, S., Agnars- son, K., Birgisdottir, B., Le Roux, L., Olafsdottir, A., Blondal, T., Andresdottir, M., Gretarsdottir, O. S., Bergthorsson, J. T., Gudbjartsson, D., Gylfason, A., Thorleif- sson, G., Manolescu, A., Kristjansson, K., Geirsson, G., Isaksson, H., Douglas, J., Johansson, J. E., Balter, K., Wiklund, F., Montie, J. E., Yu, X., Suarez, B. K., Ober, C., Cooney, K. A., Gronberg, H., Catalona, W. J., Einarsson, G. V., Barkardottir, R. B., Gulcher, J. R., Kong, A., Thorsteinsdottir, U., and Stefansson, K. (2006). A common variant associated with prostate cancer in european and african populations. Nat Genet, 38(6):652{8. [Balss et al., 2008] Balss, J., Meyer, J., Mueller, W., Korshunov, A., Hartmann, C., and von Deimling, A. (2008). Analysis of the idh1 codon 132 mutation in brain tumors. Acta Neuropathol., 116(6):597{602. [Barski et al., 2007] Barski, A., Cuddapah, S., Cui, K., Roh, T. Y., Schones, D. E., Wang, Z., Wei, G., Chepelev, I., and Zhao, K. (2007). High-resolution proling of histone methylations in the human genome. Cell, 129(4):823{37. [Becker et al., 1988] Becker, R. A., Chambers, J. M., and Wilks, A. R. (1988). The new s language: A programming environment for data analysis and statistics. Wadsworth, 4(4):737{38. [Bengtsson and Hossjer, 2006] Bengtsson, H. and Hossjer, O. (2006). Methodological study of ane transformations of gene expression data with proposed robust non- parametric multidimensional normalization method. BMC Bioinformatics, 5(7):100. [Bengtsson et al., 2004] Bengtsson, H., Jonsson, G., and Vallon-Christersson, J. (2004). Calibration and assessment of channel-specic biases in microarray data with extended dynamical range. BMC Bioinformatics, 5(1):177. [Bernstein et al., 2007] Bernstein, B. E., Meissner, A., and Lander, E. S. (2007). The mammalian epigenome. Cell, 128(1):669{81. 116 [Bibikova et al., 2009] Bibikova, M., Le, J., Barnes, B., Saedinia-Melnyk, S., Zhou, L., Shen, R., and Gunderson, K. L. (2009). Genome-wide dna methylation proling using innium assay. Epigenomics, 1(1):177{200. [Bibikova et al., 2006] Bibikova, M., Lin, Z., Zhou, L., Chudin, E., Garcia, E. W., Wu, B., Doucet, D., Thomas, N. J., Wang, Y., Vollmer, E., Goldmann, T., Seifart, C., Jiang, W., Barker, D. L., Chee, M. S., Floros, J., and Fan, J. B. (2006). High- throughput dna methylation proling using universal bead arrays. Genome Res., 16(3):383{93. [Bleeker et al., 2009] Bleeker, F. E., Lamba, S., Leenstra, S., Troost, D., Hulsebos, T., Vandertop, W. P., Frattini, M., Molinari, F., Knowles, M., Cerrato, A., Rodolfo, M., Scarpa, A., Felicioni, L., Buttitta, F., Malatesta, S., Marchetti, A., and Bardelli, A. (2009). Idh1 mutations at residue p.r132 (idh1(r132)) occur frequently in high-grade gliomas but not in other solid tumors. Hum. Mutat., 30(1):7{11. [Bock et al., 2010] Bock, C., Gu, H., Jager, N., Tomazou, E., Gnirke, A., Stunnenberg, H. G., and Meissner, A. (2010). Quantitative comparison of genome-wide dna methy- lation mapping technologies. Nature Biotechnology, 28(10):1106{1114. [Bock and Lengauer, 2008] Bock, C. and Lengauer, T. (2008). Computational epigenet- ics. Bioinformatics, 24(10):1{10. [Bongiorni et al., 2009] Bongiorni, S., Pugnali, M., Volpi, S., Bizzaro, D., Singh, P., and Prantera, G. (2009). Epigenetic marks for chromosome imprinting during spermato- genesis in coccids. Chromosoma, 118:501{512. [Bozic et al., 2010] Bozic, I., Antal, T., Ohtsuki, H., Carter, H., Kim, D., Chen, S., Karchin, R., Kinzler, K. W., Vogelstein, B., and Nowak, M. A. (2010). Accumulation of driver and passenger mutations during tumor progression. Proceedings of the National Academy of Sciences of the United States of America, 107(43):18545{50. [Cahill et al., 2007] Cahill, D. P., Levine, K. K., Betensky, R. A., Codd, P. J., Romany, C. A., Reavie, L. B., Batchelor, T. T., Futreal, P. A., Stratton, M. R., Curry, W. T., Iafrate, A. J., and Louis, D. N. (2007). Loss of the mismatch repair protein msh6 in human glioblastomas is associated with tumor progression during temozolomide treatment. Clin Cancer Res, 13(7):2038{2045. [Campan et al., 2009] Campan, M., Weisenberger, D. J., Trinh, B., and Laird, P. W. (2009). Methylight. Methods Mol. Biol., 507:325{37. [Campbell et al., 2010] Campbell, P. J., Yachida, S., Mudie, L. J., and Et, A. L. (2010). The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature, 467(7310):1109{13. [Cancer Genome Atlas Research Network, 2008] Cancer Genome Atlas Research Net- work, T. (2008). Comprehensive genomic characterization denes human glioblastoma genes and core pathways. Nature, 455(7216):1061{8. 117 [Chambers, 1998] Chambers, J. M. (1998). Programming with data: A guide to the s language. Springer-Verlag, 171(4356):737{38. [Chambers and Hastie, 1992] Chambers, J. M. and Hastie, T. (1992). Statistical models in s. Wadsworth, 171(4356):737{38. [Chan et al., 2008] Chan, S. K., Grith, O. L., Tai, I. T., and Jones, S. J. M. (2008). Meta-analysis of colorectal cancer gene expression proling studies identi- es consistently reported candidate biomarkers. Cancer Epidemiol Biomarkers Prev, 17(1):54352. [Colman et al., 2010] Colman, H., Zhang, L., Sulman, E., McDonald, J., Shooshtari, N., Rivera, A., Popo, S., Nutt, C., Louis, D., Cairncross, J., Gilbert, M., Phillips, H., Mehta, M., Chakravarti, A., Pelloski, C., Bhat, K., Feuerstein, B., Jenkins, R., and Aldape, K. (2010). A multigene predictor of outcome in glioblastoma. Neuro-Oncology, 12(1):49{57. [Crick, 1958] Crick, F. H. C. (1958). On protein synthesis. Symp. Soc. Exp. Biol., XII:139{63. [Crick, 1970] Crick, F. H. C. (1970). Central dogma of molecular biology. Nature, 227:561{63. [Dahm, 2007] Dahm, R. (2007). Discovering dna friedrich miescher and the early years of nucleic acid research. Human Genetics, 122(6):565{581. [Dang et al., 2009] Dang, L., White, D., Gross, S., Bennett, B., Bittinger, M., Driggers, E., Fantin, V., Jang, H., Jin, S., Keenan, M., Marks, K., Prins, R., Ward, P., Yen, K., Liau, L., Rabinowitz, J., Cantley, L., Thompson, C., Vander Heiden, M., and Su, S. (2009). Cancer-associated idh1 mutations produce 2-hydroxyglutarate. Nature, 462(7274):739{744. [de Magalhaes et al., 2009] de Magalhaes, J. P., Curado, J., and Church, G. M. (2009). Meta-analysis of age-related gene expression proles identies common signatures of aging. Bioinformatics, 25(1):87581. [Dennis et al., 2003] Dennis, G., J., Sherman, B. T., Hosack, D. A., Yang, J., Gao, W., Lane, H. C., and Lempicki, R. A. (2003). David: Database for annotation, visualization, and integrated discovery. Genome Biol., 4(5):P3. [Du et al., 2008] Du, P., Kibbe, W. A., and Lin, S. M. (2008). lumi: a pipeline for processing illumina microarray. Bioinformatics, 24(13):1547{1548. [Ducray et al., 2008] Ducray, F., Idbaih, A., de Reynies, A., Bieche, I., Thillet, J., Mokhtari, K., Lair, S., Marie, Y., Paris, S., Vidaud, M., Hoang-Xuan, K., Delat- tre, O., Delattre, J. Y., and Sanson, M. (2008). Anaplastic oligodendrogliomas with 1p19q codeletion have a proneural gene expression prole. Mol. Cancer, 7:41. 118 [Eads et al., 2000] Eads, C. A., Danenberg, K. D., Kawakami, K., Saltz, L. B., Blake, C., Shibata, D., Danenberg, P. V., and Laird, P. W. (2000). Methylight: a high- throughput assay to measure dna methylation. Nucleic Acids Res., 28(8):E32. [Eads et al., 1999] Eads, C. A., Danenberg, K. D., Kawakami, K., Saltz, L. B., Danen- berg, P. V., and Laird, P. W. (1999). Cpg island hypermethylation in human colorec- tal tumors is not associated with dna methyltransferase overexpression. Cancer Res., 59(10):2302{6. [Eckhardt et al., 2006] Eckhardt, F., Lewin, J., Cortese, R., Rakyan, V., Attwood, J., Burger, M., Burton, J., Cox, T., Davies, R., Down, T., Hae iger, C., Horton, R., Howe, K., Jackson, D., Kunde, J., Koenig, C., Liddle, J., Niblett, D., Otto, T., Pettett, R., Seemann, S., Thompson, C., West, T., Rogers, J., Olek, A., Berlin, K., and Beck, S. (2006). Dna methylation proling of human chromosomes 6, 20 and 22. Nat. Genet., 38(12):1378{1385. [Eden et al., 2003] Eden, A., Gaudet, F., Waghmare, A., and Jaenisch, R. (2003). Chromosomal instability and tumors promoted by dna hypomethylation. Science, 300(5618):455. [ENCODE Project Consortium, 2007] ENCODE Project Consortium, T. (2007). Identi- cation and analysis of functional elements in 1% of the human genome by the encode pilot project. Nature, 447(7146):799{816. [Esteller et al., 2000] Esteller, M., Garcia-Foncillas, J., Andion, E., Goodman, S. N., Hidalgo, O. F., Vanaclocha, V., Baylin, S. B., and Herman, J. G. (2000). Inactivation of the dna-repair gene mgmt and the clinical response of gliomas to alkylating agents. N. Engl. J. Med., 343(19):1350{4. [Esteller et al., 1999] Esteller, M., Hamilton, S. R., Burger, P. C., Baylin, S. B., and Herman, J. G. (1999). Inactivation of the dna repair gene o6-methylguanine-dna methyltransferase by promoter hypermethylation is a common event in primary human neoplasia. Cancer Res., 59(4):793{7. [Feinberg et al., 2006] Feinberg, A. P., Ohlsson, R., and Heniko, S. (2006). The epige- netic progenitor origin of human cancer. Nature Reviews Genetics, 7(1):21{33. [Fishel et al., 2007] Fishel, I., Kaufman, A., and Ruppin, E. (2007). Meta-analysis of gene expression data: a predictor-based approach. Bioinformatics, 23(1):15991606. [Freedman et al., 2006] Freedman, M. L., Haiman, C. A., Patterson, N., McDonald, G. J., Tandon, A., Waliszewska, A., Penney, K., Steen, R. G., Ardlie, K., John, E. M., Oakley-Girvan, I., Whittemore, A. S., Cooney, K. A., Ingles, S. A., Altshuler, D., Henderson, B. E., and Reich, D. (2006). Admixture mapping identies 8q24 as a prostate cancer risk locus in african-american men. Proc Natl Acad Sci U S A, 103(38):14068{73. 119 [Gentleman et al., 2004] Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Det- tling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y., and Zhang, J. (2004). Bioconductor: open software development for computational biology and bioinformat- ics. Genome Biol., 5(10):R80. [Gillies and Lorimer, 2007] Gillies, J. K. and Lorimer, I. A. (2007). Regulation of p27kip1 by mirna 221/222 in glioblastoma. Cell Cycle, 6(16):2005{9. [Goecks et al., 2010] Goecks, J., Nekrutenko, A., Taylor, J., and Team, T. G. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and trans- parent computational research in the life sciences. Genome Biology, 11(8):R86. [Grith et al., 2006] Grith, O. L., Melck, A., Jones, S. J. M., and Wiseman, S. M. (2006). Meta-analysis and meta-review of thyroid cancer gene expression proling studies identies important diagnostic biomarkers. J Clin Oncol, 24(1):504351. [Gu et al., 2010] Gu, H., Bock, C., Mikkelsen, T. S., Jager, N., Smith, Z. D., Tomazou, E., Gnirke, A., Lander, E. S., and Meissner, A. (2010). Genome-scale dna methylation mapping of clinical samples at single-nucleotide resolution. Nature Methods, 7(2):133{ 136. [Haiman et al., 2007a] Haiman, C. A., Le Marchand, L., Yamamato, J., Stram, D. O., Sheng, X., Kolonel, L. N., Wu, A. H., Reich, D., and Henderson, B. E. (2007a). A common genetic risk factor for colorectal and prostate cancer. Nat Genet, 39(8):954{6. [Haiman et al., 2007b] Haiman, C. A., Patterson, N., Freedman, M. L., Myers, S. R., Pike, M. C., Waliszewska, A., Neubauer, J., Tandon, A., Schirmer, C., McDonald, G. J., Greenway, S. C., Stram, D. O., Le Marchand, L., Kolonel, L. N., Frasco, M., Wong, D., Pooler, L. C., Ardlie, K., Oakley-Girvan, I., Whittemore, A. S., Cooney, K. A., John, E. M., Ingles, S. A., Altshuler, D., Henderson, B. E., and Reich, D. (2007b). Multiple regions within 8q24 independently aect risk for prostate cancer. Nat. Genet., 39(5):638{44. [Hartmann et al., 2009] Hartmann, C., Meyer, J., Balss, J., Capper, D., Mueller, W., Christians, A., Felsberg, J., Wolter, M., Mawrin, C., Wick, W., Weller, M., Herold- Mende, C., Unterberg, A., Jeuken, J. W., Wesseling, P., Reifenberger, G., and von Deimling, A. (2009). Type and frequency of idh1 and idh2 mutations are related to astrocytic and oligodendroglial dierentiation and age: a study of 1,010 diuse gliomas. Acta Neuropathol., 118(4):469{74. [Hayden, 2010] Hayden, E. C. (2010). Human genome at ten: life is complicated. Nature, 464(6822):664{667. [Hegi et al., 2005] Hegi, M. E., Diserens, A. C., Gorlia, T., Hamou, M. F., de Tribo- let, N., and et al. (2005). Mgmt gene silencing and benet from temozolomide in glioblastoma. N. Engl. J. Med., 352(10):997{1003. 120 [Hegi et al., 2008] Hegi, M. E., Liu, L., Herman, J. G., and et al. (2008). Correlation of o6-methylguanine methyltransferase (mgmt) promoter methylation with clinical outcomes in glioblastoma and clinical strategies to modulate mgmt activity. J. Clin. Oncol., 26(25):4189{99. [Heinz et al., 2010] Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., Cheng, J. X., Murre, C., Singh, H., and Glass, C. K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities. Molecular cell, 38(4):576{589. [Herman and Baylin, 2003] Herman, J. G. and Baylin, S. B. (2003). Gene silenc- ing in cancer in association with promoter hypermethylation. N. Engl. J. Med., 349(21):2042{54. [Hinoue et al., 2009] Hinoue, T., Weisenberger, D., Pan, F., Campan, M., Kim, M., Young, J., Whitehall, V., Leggett, B., and Laird, P. (2009). Analysis of the association between cimp and brafv600e in colorectal cancer by dna methylation proling. PLoS ONE, 4(12):e8357. [Houshdaran et al., 2010] Houshdaran, S., Hawley, S., Palmer, C., Campan, M., Olsen, M., Ventura, A., Knudsen, B., Drescher, C., Urban, N., Brown, P., and Laird, P. (2010). Dna methylation proles of ovarian epithelial carcinoma tumors and cell lines. PLoS ONE, 5(2):e9359. [Huang et al., 2009] Huang, D., Sherman, B., and Lempicki, R. (2009). Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat. Pro- toc., 4:44{57. [International Cancer Genome Consortium, 2010] International Cancer Genome Con- sortium, T. (2010). International network of cancer genome projects. Nature, 464(7291):993{998. [International Human Genome Sequencing Consortium, 2001] International Human Genome Sequencing Consortium, T. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822):860{921. [Jin and Felsenfeld, 2007] Jin, C. and Felsenfeld, G. (2007). Nucleosome stability medi- ated by histone variants h3.3 and h2a.z. Genes Dev., 21(12):15191529. [Jones and Baylin, 2007] Jones, P. A. and Baylin, S. B. (2007). The epigenomics of cancer. Cell, 128(4):683{92. [Jones and Laird, 1999] Jones, P. A. and Laird, P. W. (1999). Cancer-epigenetics comes of age. Nature Genetics, 21(2):163{167. [Kelly et al., 2010] Kelly, T. K., Miranda, T. B., Liang, G., Berman, B. P., Lin, J. C., Tanay, A., and Jones, P. A. (2010). H2a.z maintenance during mitosis reveals nucleo- some shifting on mitotically silenced genes. Molecular cell, 39(6):901{911. 121 [Kim et al., 2006] Kim, T. Y., Zhong, S., Fields, C. R., Kim, J. H., and Robertson, K. D. (2006). Epigenomic proling reveals novel and frequent targets of aberrant dna methylation-mediated silencing in malignant glioma. Cancer Res., 66(15):7490{501. [Kupershmidt et al., 2010] Kupershmidt, I., Su, Q. J., Grewal, A., Sundaresh, S., Halperin, I., Flynn, J., Shekar, M., Wang, H., Park, J., Cui, W., Wall, G. D., Wisotzkey, R., Alag, S., Akhtari, S., and Ronaghi, M. (2010). Ontology-based meta- analysis of global collections of high-throughput public data. PLoS ONE, 5(9):e13066. [Laird, 2003] Laird, P. W. (2003). The power and the promise of dna methylation mark- ers. Nat. Rev. Cancer, 3(4):253{66. [Laird, 2010] Laird, P. W. (2010). Principles and challenges of genome-wide dna methy- lation analysis. Nature Reviews Genetics, 11(3):191{203. [Ledford, 2010] Ledford, H. (2010). The cancer genome challenge. Nature, 464(12):972{ 974. [Leek et al., 2010] Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K., and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch eects in high-throughput data. Nature Reviews Genetics, 11(10):733{739. [Ley et al., 2010] Ley, T., Ding, L., Walter, M., and Et, A. L. (2010). Dnmt3a mutations in acute myeloid leukemia. New England Journal of Medicine, 363(43):2424{2433. [Li et al., 2007] Li, B., Carey, M., and Workman, J. L. (2007). The role of chromatin during transcription. Cell, 128(4):707{719. [Liang et al., 2005] Liang, Y., Diehn, M., Watson, N., Bollen, A. W., Aldape, K. D., Nicholas, M. K., Lamborn, K. R., Berger, M. S., Botstein, D., Brown, P. O., and Israel, M. A. (2005). Gene expression proling reveals molecularly and clinically distinct subtypes of glioblastoma multiforme. Proc. Natl. Acad. Sci. U.S.A., 102(16):5814{9. [Liao, 2009] Liao, D. (2009). Emerging roles of the ebf family of transcription factors in tumor suppression. Molecular cancer research, 7(12):1069{1077. [Liu et al., 2009] Liu, X., Yu, J., Jiang, L., Wang, A., Shi, F., Ye, H., and Zhou, X. (2009). Microrna-222 regulates cell invasion by targeting matrix metalloproteinase 1 (mmp1) and manganese superoxide dismutase 2 (sod2) in tongue squamous cell carcinoma cell lines. Cancer Genomics Proteomics, 6(3):131{9. [Madhavan et al., 2009] Madhavan, S., Zenklusen, J. C., Kotliarov, Y., Sahni, H., Fine, H. A., and Buetow, K. (2009). Rembrandt: helping personalized medicine become a reality through integrative translational research. Mol. Cancer Res., 7(2):157{67. [Maier et al., 2004] Maier, H., Ostraat, R., Gao, H., Fields, S., and Hagman, J. (2004). Early b cell factor cooperates with runx1 and mediates epigenetic changes associated with mb-1 transcription. Nature immunology, 5(10):1069{1077. 122 [Martinez et al., 2009] Martinez, R., Martin-Subero, J. I., Rohde, V., Kirsch, M., Alaminos, M., Fernandez, A. F., Ropero, S., Schackert, G., and Esteller, M. (2009). A microarray-based dna methylation study of glioblastoma multiforme. Epigenetics, 4(4):255{64. [Martinez et al., 2007] Martinez, R., Schackert, G., and Esteller, M. (2007). Hyperme- thylation of the proapoptotic gene tms1/asc: prognostic importance in glioblastoma multiforme. J. Neurooncol., 82(2):133{9. [Mikkelsen et al., 2008] Mikkelsen, T. S., Hanna, J., Zhang, X., Ku, M., and Wernig, M. (2008). Dissecting direct reprogramming through integrative genomic analysis. Nature, 454(8):49{55. [Miller and Stamatoyannopoulos, 2010] Miller, B. and Stamatoyannopoulos, J. A. (2010). Integrative meta-analysis of dierential gene expression in acute myeloid leukemia. PLoS ONE, 5(1):e9466. [Monti et al., 2003] Monti, S., Tamayo, P., Mesirov, J., and Golub, T. (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52:91{118. [Nagarajan and Costello, 2009] Nagarajan, R. P. and Costello, J. F. (2009). Epigenetic mechanisms in glioblastoma multiforme. Semin. Cancer Biol., 19(3):188{97. [Nakamura et al., 2001] Nakamura, M., Yonekawa, Y., Kleihues, P., and Ohgaki, H. (2001). Promoter hypermethylation of the rb1 gene in glioblastomas. Lab Invest., 81(1):77{82. [Noushmehr et al., 2010] Noushmehr, H., Weisenberger, D. J., Diefes, K., Phillips, H. S., Pujara, K., Berman, B. P., Pan, F., Pelloski, C. E., Sulman, E. P., Bhat, K. P., Verhaak, R. G., Hoadley, K. A., Hayes, D. N., Perou, C. M., Schmidt, H. K., Ding, L., Wilson, R. K., Van Den Berg, D., Shen, H., Bengtsson, H., Neuvial, P., Cope, L. M., Buckley, J., Herman, J. G., Baylin, S. B., Laird, P. W., Aldape, K., and Network., C. G. A. R. (2010). Identication of a cpg island methylator phenotype that denes a distinct subgroup of glioma. Cancer Cell, 17(5):510{22. [Ohgaki and Kleihues, 2007] Ohgaki, H. and Kleihues, P. (2007). Genetic pathways to primary and secondary glioblastoma. Am J Pathol., 170(5):1445{1453. [Parsons et al., 2008] Parsons, D. W., Jones, S., Zhang, X., Lin, J. C., Leary, R. J., Angenendt, P., Mankoo, P., Carter, H., Siu, I. M., Gallia, G. L., Olivi, A., McLendon, R., Rasheed, B. A., Keir, S., Nikolskaya, T., Nikolsky, Y., Busam, D. A., Tekleab, H., Diaz, L. A., J., Hartigan, J., Smith, D. R., Strausberg, R. L., Marie, S. K., Shinjo, S. M., Yan, H., Riggins, G. J., Bigner, D. D., Karchin, R., Papadopoulos, N., Parmi- giani, G., Vogelstein, B., Velculescu, V. E., and Kinzler, K. W. (2008). An integrated genomic analysis of human glioblastoma multiforme. Science, 321(5897):1807{12. 123 [Phillips et al., 2006] Phillips, H. S., Kharbanda, S., Chen, R., Forrest, W. F., Soriano, R. H., Wu, T. D., Misra, A., Nigro, J. M., Colman, H., Soroceanu, L., Williams, P. M., Modrusan, Z., Feuerstein, B. G., and Aldape, K. (2006). Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. Cancer Cell, 9(3):157{73. [Pike et al., 2008] Pike, B., Greiner, T., Wang, X., Weisenburger, D., Hsu, Y., Renaud, G., Wolfsberg, T., Kim, M., Weisenberger, D., Siegmund, K., Ye, W., Groshen, S., Mehrian-Shai, R., Delabie, J., Chan, W., Laird, P., and Hacia, J. (2008). Dna methyla- tion proles in diuse large b-cell lymphoma and their relationship to gene expression status. Leukemia, 22(5):1035{1043. [Pogribny and Beland, 2009] Pogribny, I. and Beland, F. (2009). Dna hypomethylation in the origin and pathogenesis of human diseases. CellularandMolecularLifeSciences, 66:2249{2261. [R Development Core, 2009] R Development Core, T. (2009). R: A language and envi- ronment for statistical computing. Technical Report 3-900051-07-0, R Foundation for Statistical Computing. [Reich et al., 2006] Reich, M., Liefeld, T., Gould, J., Lerner, J., Tamayo, P., and Mesirov, J. P. (2006). Genepattern 2.0. Nat. Genet., 38(5):500{1. [Robinson et al., 2011] Robinson, J. T., Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., and Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1):24{26. [Rodriguez et al., 2006] Rodriguez, J., Frigola, J., Vendrell, E., Risques, R., Fraga, M. F., Morales, C., Moreno, V., Esteller, M., Capella, G., Ribas, M., and Peinado, M. A. (2006). Chromosomal instability correlates with genome-wide dna demethyla- tion in human primary colorectal cancers. Cancer Research, 66(17):8462{9468. [Sanson et al., 2009] Sanson, M., Marie, Y., Paris, S., Idbaih, A., Laaire, J., Ducray, F., El Hallani, S., Boisselier, B., Mokhtari, K., Hoang-Xuan, K., and Delattre, J. (2009). Isocitrate dehydrogenase 1 codon 132 mutation is an important prognostic biomarker in gliomas. J. Clin. Oncol., 27(25):4150{4154. [Schaefer et al., 2009] Schaefer, A., Jung, M., Mollenkopf, H. J., Wagner, I., Stephan, C., Jentzmik, F., Miller, K., Lein, M., Kristiansen, G., and Jung, K. (2009). Diagnostic and prognostic implications of microrna proling in prostate carcinoma. Int.J.Cancer. [Schumacher et al., 2007] Schumacher, F. R., Feigelson, H. S., Cox, D. G., Haiman, C. A., Albanes, D., Buring, J., Calle, E. E., Chanock, S. J., Colditz, G. A., Diver, W. R., Dunning, A. M., Freedman, M. L., Gaziano, J. M., Giovannucci, E., Hankinson, S. E., Hayes, R. B., Henderson, B. E., Hoover, R. N., Kaaks, R., Key, T., Kolonel, L. N., Kraft, P., Le Marchand, L., Ma, J., Pike, M. C., Riboli, E., and et al. (2007). A common 8q24 variant in prostate and breast cancer from a large nested case-control study. Cancer Res., 67(7):2951{6. 124 [Shannon and Armstrong, 2010] Shannon, K. and Armstrong, S. A. (2010). Genetics, epigenetics, and leukemia. New England Journal of Medicine, 363(43):2460{2461. [Sharma et al., 2010] Sharma, S., Kelly, T. K., and Jones, P. A. (2010). Epigenetics in cancer. Carcinogenesis, 31(1):27{36. [Shete et al., 2009] Shete, S., Hosking, F. J., Robertson, L. B., Dobbins, S. E., Sanson, M., Malmer, B., Simon, M., Marie, Y., Boisselier, B., Delattre, J. Y., Hoang-Xuan, K., El Hallani, S., Idbaih, A., Zelenika, D., Andersson, U., Henriksson, R., Bergenheim, A. T., Feychting, M., Lonn, S., Ahlbom, A., Schramm, J., Linnebank, M., Hemminki, K., Kumar, R., Hepworth, S. J., Price, A., Armstrong, G., Liu, Y., Gu, X., Yu, R., Lau, C., Schoemaker, M., Muir, K., Swerdlow, A., Lathrop, M., Bondy, M., and Houlston, R. S. (2009). Genome-wide association study identies ve susceptibility loci for glioma. Nat Genet, 41(8):899{904. [Silber et al., 1999] Silber, J., Blank, A., Bobola, M., Ghatan, S., Kolstoe, D., and Berger, M. (1999). O6-methylguanine-dna methyltransferase-decient phenotype in human gliomas: frequency and time to tumor progression after alkylating agent-based chemotherapy. Clin Cancer Res, 5(4):807{814. [Sonnenberg and Liem, 2007] Sonnenberg, A. and Liem, R. K. (2007). Plakins in devel- opment and disease. Exp. Cell Res., 313(10):2189{203. [Steger et al., 2008] Steger, D. J., Lefterova, M. I., Ying, L., Stonestrom, A. J., Schupp, M., Zhuo, D., Vakoc, A. L., Kim, J. E., Chen, J., Lazar, M. A., Blobel, G. A., and Vakoc, C. R. (2008). Dot1l/kmt4 recruitment and h3k79 methylation are ubiquitously coupled with gene transcription in mammalian cells. Mol Cell Biol, 28(8):2825{39. [Stone et al., 2004] Stone, A. R., Bobo, W., Brat, D. J., Devi, N. S., Van Meir, E. G., and Vertino, P. M. (2004). Aberrant methylation and down-regulation of tms1/asc in human glioblastoma. Am. J. Pathol., 165(4):1151{61. [Subramanian et al., 2005] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., and Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression proles. Proc. Natl. Acad. Sci. U.S.A., 102(43):15545{50. [Sun et al., 2006] Sun, L., Hui, A. M., Su, Q., Vortmeyer, A., Kotliarov, Y., Pastorino, S., Passaniti, A., Menon, J., Walling, J., Bailey, R., Rosenblum, M., Mikkelsen, T., and Fine, H. A. (2006). Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell, 9(4):287{300. [Sung et al., 2009] Sung, W. K., Lu, Y., Lee, C. W., Zhang, D., Ronaghi, M., and Lee, C. G. (2009). Deregulated direct targets of the hepatitis b virus (hbv) protein, hbx, identied through chromatin immunoprecipitation and expression microarray prol- ing. J. Biol. Chem., 284(33):21941{54. 125 [Takai and Jones, 2002] Takai, D. and Jones, P. A. (2002). Comprehensive analysis of cpg islands in human chromosomes 21 and 22. Proceedings of the National Academy of Sciences of the United States of America, 99(6):3740{3745. [Taylor et al., 2008] Taylor, B. S., Barretina, J., Socci, N. D., Decarolis, P., Ladanyi, M., Meyerson, M., Singer, S., and Sander, C. (2008). Functional copy-number alterations in cancer. PLoS ONE, 3(9):e3179. [Taylor, 2006] Taylor, S. M. (2006). p53 and deregulation of dna methylation in cancer. Cellscience Reviews, 2(3). [Tepel et al., 2008] Tepel, M., Roerig, P., Wolter, M., Gutmann, D. H., Perry, A., Reifen- berger, G., and Riemenschneider, M. J. (2008). Frequent promoter hypermethylation and transcriptional downregulation of the ndrg2 gene at 14q11.2 in primary glioblas- toma. Int. J. Cancer, 123(9):2080{6. [Toyota et al., 1999] Toyota, M., Ahuja, N., Ohe-Toyota, M., Herman, J. G., Baylin, S. B., and Issa, J. P. (1999). Cpg island methylator phenotype in colorectal cancer. Proc. Natl. Acad. Sci. U.S.A., 96(15):8681{6. [Treiber et al., 2004] Treiber, N., Treiber, T., Zocher, G., and Grosschedl, R. (2004). Structure of an ebf1 dna complex reveals unusual dna recognition and structural homology with rel proteins. Genes and Development, 5(10):1069{1077. [Troyanskaya et al., 2002] Troyanskaya, O. G., Garber, M. E., Brown, P. O., Botstein, D., and Altman, R. B. (2002). Nonparametric methods for identifying dierentially expressed genes in microarray data. Bioinformatics, 18(11):1454{61. [Tusher et al., 2001] Tusher, V., Tibshirani, R., and Chu, G. (2001). Signicance anal- ysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A., 98(9):5116{5121. [Uhlmann et al., 2003] Uhlmann, K., Rohde, K., Zeller, C., Szymas, J., Vogel, S., Mar- czinek, K., Thiel, G., Nurnberg, P., and Laird, P. W. (2003). Distinct methylation proles of glioma subtypes. Int. J. Cancer, 106(1):52{9. [Vander Heiden et al., 2009] Vander Heiden, M. G., Cantley, L. C., and Thompson, C. B. (2009). Understanding the warburg eect: the metabolic requirements of cell prolif- eration. Science, 324(5930):1029{33. [Venter et al., 2001] Venter, J. C., Adams, M. D., Myers, E. W., Li, P. L., Mural, R. J., and Others, E. A. (2001). The sequence of the human genome. Science, 291(5507):1304{1351. [Verhaak et al., 2010] Verhaak, R., Hoadley, K., Purdom, E., Wang, V., Qi, Y., Wilker- son, M., Miller, C., Ding, L., T., G., Mesirov, J., Alexe, G., Lawrence, M., O'Kelly, M., Tamayo, P., Weir, B., Gabriel, S., Winckler, W., and et al. (2010). An integrated genomic analysis identies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr and nf1. Cancer Cell, 17(1):98{110. 126 [Visakorpi et al., 1995] Visakorpi, T., Kallioniemi, A. H., Syvanen, A. C., Hyytinen, E. R., Karhu, R., Tammela, T., Isola, J. J., and Kallioniemi, O. P. (1995). Genetic changes in primary and recurrent prostate cancer by comparative genomic hybridiza- tion. Cancer Res, 55(2):342{7. [Wajapeyee et al., 2008] Wajapeyee, N., Serra, R., Zhu, X., Mahalingam, M., Green, M., Green, M., Green, M., Mahalingam, M., Green, M., Green, M., Green, M., Mahalingam, M., Green, M., Green, M., Green, M., Mahalingam, M., Green, M., Green, M., Green, M., Mahalingam, M., Green, M., Green, M., and Green, M. (2008). Oncogenic braf induces senescence and apoptosis through pathways mediated by the secreted protein igfbp7. Cell, 132(3):363{374. [Watson and Crick, 1953] Watson, J. D. and Crick, F. H. C. (1953). Molecular structure of deoxypentose nucleic acids. Nature, 171(4356):737{38. [Weisenberger et al., 2006] Weisenberger, D. J., Siegmund, K. D., Campan, M., Young, J., Long, T. I., Faasse, M. A., Kang, G. H., Widschwendter, M., Weener, D., Buchanan, D., Koh, H., Simms, L., Barker, M., Leggett, B., Levine, J., Kim, M., French, A. J., Thibodeau, S. N., Jass, J., Haile, R., and Laird, P. W. (2006). Cpg island methylator phenotype underlies sporadic microsatellite instability and is tightly associated with braf mutation in colorectal cancer. Nat. Genet., 38(7):787{93. [Widschwendter et al., 2007] Widschwendter, M., Fiegl, H., Egle, D., Mueller-Holzner, E., Spizzo, G., Marth, C., Weisenberger, D. J., Campan, M., Young, J., Jacobs, I., and Laird, P. W. (2007). Epigenetic stem cell signature in cancer. Nat.Genet., 39(2):157{8. [Wirapati et al., 2008] Wirapati, P., Sotiriou, C., Kunkel, S., Farmer, P., Pradervand, S., Haibe-Kains, B., Desmedt, C., Ignatiadis, M., Sengstag, T., Schutz, F., Goldstein, D. R., Piccart, M., and Delorenzi, M. (2008). Meta-analysis of gene expression pro- les in breast cancer: toward a unied understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res, 10(1):R65. [Wu et al., 2009] Wu, C., Orozco, C., Boyer, J., Leglise, M., Goodale, J., Batalov, S., Hodge, C., Haase, J., Janes, J., Huss, J., and Su, A. (2009). Biogps: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biology, 10(11):R130. [Yan et al., 2009] Yan, H., Parsons, D. W., Jin, G., McLendon, R., Rasheed, B. A., Yuan, W., Kos, I., Batinic-Haberle, I., Jones, S., Riggins, G. J., Friedman, H., Fried- man, A., Reardon, D., Herndon, J., Kinzler, K. W., Velculescu, V. E., Vogelstein, B., and Bigner, D. D. (2009). Idh1 and idh2 mutations in gliomas. N. Engl. J. Med., 360(8):765{73. [Yang et al., 2011] Yang, X., Noushmehr, H., Gaining, G., and Jones, P. A. (2011). Coordinated chromatin remodeling induced by demethylation requires srcap mediated h2a.z exchange. Manuscript in preparation, XXX(XX):XXX. 127 [Yeager et al., 2007] Yeager, M., Orr, N., Hayes, R. B., Jacobs, K. B., Kraft, P., Wacholder, S., Minichiello, M. J., Fearnhead, P., Yu, K., Chatterjee, N., Wang, Z., Welch, R., Staats, B. J., Calle, E. E., Feigelson, H. S., Thun, M. J., Rodriguez, C., Albanes, D., Virtamo, J., Weinstein, S., Schumacher, F. R., Giovannucci, E., Willett, W. C., Cancel-Tassin, G., Cussenot, O., Valeri, A., Andriole, G. L., Gelmann, E. P., Tucker, M., Gerhard, D. S., Fraumeni, J. F., J., Hoover, R., Hunter, D. J., Chanock, S. J., and Thomas, G. (2007). Genome-wide association study of prostate cancer identies a second risk locus at 8q24. Nat Genet, 39(5):645{9. [Zhang and Reinberg, 2001] Zhang, Y. and Reinberg, D. (2001). Transcription regula- tion by histone methylation: interplay between dierent covalent modications of the core histone tails. Genes and Development, 15(18):2343{60. 128 Appendix A (TCGA Author List) The Cancer Genome Atlas Research Network (TCGA Research Network); Current as of April 2010 Tissue source sites: Duke University Medical School: Roger McLendon 1 , Allan Friedman 2 & Darrell Bigner 1 Emory University: Erwin G. Van Meir 3;4;5 , Daniel J. Brat 5;6 , Gena M. Mastrogianakis 3 & Jerey J. Olson 3;4;5 Henry Ford Hospital: Tom Mikkelsen 7 & Norman Lehman 8 MD Anderson Cancer Center: Ken Aldape 9 , W. K. Alfred Yung 10 & Oliver Bogler 11 University of California San Francisco: Scott VandenBerg 12 , Mitchel Berger 13 & Michael Prados 13 Genome sequencing centers: Baylor College of Medicine: Donna Muzny 14 , Margaret Morgan 14 , Steve Scherer 14 , Aniko Sabo 14 , Lynn Nazareth 14 , Lora Lewis 14 , Otis Hall 14 , Yiming Zhu 14 , Yanru Ren 14 , Omar Alvi 14 , Jiqiang Yao 14 , Alicia Hawes 14 , Shalini Jhangiani 14 , Ger- ald Fowler 14 , Anthony San Lucas 14 , Christie Kovar 14 , Andrew Cree 14 , Huyen 129 Dinh 14 , Jireh Santibanez 14 , Vandita Joshi 14 , Manuel L. Gonzalez-Garay 14 , Christo- pher A. Miller 14;15 , Aleksandar Milosavljevic 14;15;16 , Larry Donehower 17 , David A. Wheeler 14 & Richard A. Gibbs 14 Broad Institute of MIT and Harvard: Kristian Cibulskis 18 , Carrie Sougnez 18 , Tim Fennell 18 , Scott Mahan 18 , Jane Wilkinson 18 , Liuda Ziaugra 18 , Robert Onofrio 18 , Toby Bloom 18 , Rob Nicol 18 , Kristin Ardlie 18 , Jennifer Baldwin 18 , Stacey Gabriel 18 & Eric S. Lander 18;19;20 Washington University in St Louis: Li Ding 21 , Robert S. Fulton 21 , Michael D. McLellan 21 , John Wallis 21 , David E. Larson 21 , Xiaoqi Shi 21 , Rachel Abbott 21 , Lucinda Fulton 21 , Ken Chen 21 , Daniel C. Koboldt 21 , Michael C. Wendl 21 , Rick Meyer 21 , Yuzhu Tang 21 , Ling Lin 21 , John R. Osborne 21 , Brian H. Dunford-Shore 21 , Tracie L. Miner 21 , Kim Delehaunty 21 , Chris Markovic 21 , Gary Swift 21 , William Courtney 21 , Craig Pohl 21 , Scott Abbott 21 , Amy Hawkins 21 , Shin Leong 21 , Carrie Haipek 21 , Heather Schmidt 21 , Maddy Wiechert 21 , Tammi Vickery 21 , Sacha Scott 21 , David J. Dooling 21 , Asif Chinwalla 21 , George M. Weinstock 21 , Elaine R. Mardis 21 & Richard K. Wilson 21 Cancer genome characterization centres: Broad Institute/Dana-Farber Cancer Institute: Gad Getz 18 , Wendy Winckler 18;22;23 , Roel G. W. Verhaak 18;22;23 , Michael S. Lawrence 18 , Michael O'Kelly 18 , Jim Robinson 18 , Gabriele Alexe 18 , Rameen Beroukhim 18;22;23 , Scott Carter 18 , Derek Chiang 18;22 , Josh Gould 18 , Supriya Gupta 18 , Josh Korn 18 , Craig Mermel 18;22 , Jill Mesirov 18 , Stefano Monti 18 , Huy Nguyen 18 , Melissa Parkin 18 , Michael Reich 18 , Nicolas Stransky 18 , Barbara A. Weir 18;22;23 , Levi Garraway 18;22;23 , Todd Golub 18;22;23 & Matthew Meyerson 18;22;23 Harvard Medical School/Dana-Farber Cancer Institute: Lynda Chin 22;24;25 , Alexei Protopopov 24 , Jianhua Zhang 24 , Ilana Perna 24 , Sandy Aronson 26 , Narayan 130 Sathiamoorthy 26 , Georgia Ren 24 , Sachet Shukla 24 , W. Ruprecht Wiedemeyer 22 , Hyunsoo Kim 26 , Sek Won Kong 27;28 , Yonghong Xiao 24 , Isaac S. Kohane 26;27;29 , Jon Seidman 30 , Peter J. Park 26;27;29 & Raju Kucherlapati 26 Johns Hopkins/University of Southern California: Peter W. Laird 31 , Leslie Cope 32 , James G. Herman 33 , Daniel J. Weisenberger 31 , Fei Pan 31 , David Van Den Berg 31 , Houtan Noushmehr 31 , Hui Shen 31 , Leander Van Neste 34 , Joo Mi Yi 33 , Kornel E. Schuebel 33 & Stephen B. Baylin 33 HudsonAlpha Institute/Stanford University: Devin M. Absher 35 , Jun Z. Li 36 , Audrey Southwick 37 , Shannon Brady 37 , Amita Aggarwal 37 , Tisha Chung 37 , Gavin Sherlock 37 , James D. Brooks 38 & Richard M. Myers 35 Lawrence Berkeley National Laboratory: Paul T. Spellman 39 , Elizabeth Purdom 40 , Lakshmi R. Jakkula 39 , Anna V. Lapuk 39 , Henry Marr 39 , Shan- non Dorton 39 , Yoon Gi Choi 41 , Ju Han 39 , Amrita Ray 39 , Victoria Wang 40 , Steen Durinck 39 , Mark Robinson 42 , Nicholas J. Wang 39 , Karen Vranizan 41 , Vivian Peng 41 , Eric Van Name 41 , Gerald V. Fontenay 39 , John Ngai 41 , John G. Conboy 39 , Bahram Parvin 39 , Heidi S. Feiler 39 , Terence P. Speed 40;42 & Joe W. Gray 39 Memorial Sloan-Kettering Cancer Center: Cameron Brennan 43 , Nicholas D. Socci 44 , Adam Olshen 45 , Barry S. Taylor 44;46 , Alex Lash 44 , Nikolaus Schultz 44 , Boris Reva 44 , Yevgeniy Antipin 44 , Alexey Stukalov 44 , Benjamin Gross 44 , Ethan Cerami 44 , Wei Qing Wang 44 , Li-Xuan Qin 45 , Venkatraman E. Seshan 45 , Liliana Villafania 47 , Magali Cavatore 47 , Laetitia Borsu 48 , Agnes Viale 47 , William Gerald 48 , Chris Sander 44 & Marc Ladanyi 48 University of North Carolina, Chapel Hill: Charles M. Perou 49;50 , D. Neil Hayes 51 , Michael D. Topal 50;52 , Katherine A. Hoadley 49 , Yuan Qi 51 , Sai Balu 52 , Yan Shi 52 & Junyuan Wu 52 Biospecimen Core Resource: 131 Robert Penny 53 , Michael Bittner 54 , Troy Shelton 53 , Elizabeth Lenkiewicz 53 , Scott Morris 53 , Debbie Beasley 53 & Sheri Sanders 53 Data Coordinating Center: Ari Kahn 55 , Robert Sfeir 55 , Jessica Chen 55 , David Nassau 55 , Larry Feng 55 , Erin Hickey 55 , Jinghui Zhang 56 & John N. Weinstein 57 Project teams: National Cancer Institute: Anna Barker 58 , Daniela S. Gerhard 58 , Joseph Vockley 58 , Carolyn Compton 58 , Jim Vaught 58 , Peter Fielding 58 , Martin L. Ferguson 59 , Carl Schaefer 56 , Subhashree Madhavan 56 & Kenneth H. Buetow 56 National Human Genome Research Institute: Francis Collins 60 , Peter Good 60 , Mark Guyer 60 , Brad Ozenberger 60 , Jane Peterson 60 & Elizabeth Thomson 60 Aliations: 1 Department of Pathology, and 2 Department of Surgery, Duke Univer- sity Medical Center, Durham, North Carolina 27710, USA. 3 Department of Neuro- surgery, 4 Department of Hematology and Medical Oncology, 5 Winship Cancer Insti- tute, and 6 Department of Pathology and Laboratory Medicine, Emory University School of Medicine, Atlanta, Georgia 30322, USA. 7 Department of Neurological Surgery, 8 Department of Pathology, Henry Ford Hospital, Detroit, Michigan 48202, USA. 9 Department of Pathology, 10 Department of Neuro-Oncology, and 11 Department of Neurosurgery, University of Texas M.D. Anderson Cancer Center, Houston, Texas 77030, USA. 12 Department of Pathology, and 13 Department of Neurosurgery, Uni- versity of California San Francisco, San Francisco, California 94143, USA. 14 Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA. 15 Graduate Program in Structural and Computational Biology and Molecular Bio- physics, 16 Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA. 17 Department of Molecular Virology and Microbiology, Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, 132 USA. 18 The Eli and Edythe L. Broad Institute of Massachusetts Institute of Technol- ogy and Harvard University, Cambridge, Massachusetts 02142, USA. 19 Department of Biology, Institute of Massachusetts Institute of Technology, Cambridge, Massachusetts 02142, USA. 20 Department of Systems Biology, Harvard University, Boston, Mas- sachusetts 02115, USA. 21 The Genome Center at Washington University, Department of Genetics, Washington University School of Medicine, St Louis, Missouri 63108, USA. 22 Department of Medical Oncology, 23 Center for Cancer Genome Discovery, and 24 Belfer Institute for Applied Cancer Science, Dana-Farber Cancer Institute, Boston, Mas- sachusetts 02115, USA. 25 Department of Dermatology, Harvard Medical School, Boston, Massachusetts 02115, USA. 26 Harvard Medical School-Partners HealthCare Center for Genetics and Genomics, Boston, Massachusetts 02115,USA. 27 Informatics Program, and 28 Department of Cardiology, Children's Hospital, Boston, Massachusetts 02115, USA. 29 Center for Biomedical Informatics, and 30 Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA. 31 USC Epigenome Center, University of Southern California, Los Angeles, California 90089, USA. 32 Biometry and Clinical Trials Division, and 33 Cancer Biology Division, The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins University,Baltimore, Maryland 21231, USA. 34 Department of Molecular Biotechnology, Faculty of Bioscience and Engineering, Ghent Univer- sity, Ghent B-9000, Belgium. 35 HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA. 36 Department of Human Genetics, University of Michigan, Ann Arbor, Michigan 48109, USA. 37 Department of Genetics, and 38 Department of Urol- ogy, Stanford University School of Medicine, Stanford, California 94305, USA. 39 Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA. 40 Department of Statistics, and 41 Department of Molecular and Cellular Biology, University of California at Berkeley, Berkeley, California 95720, USA. 42 Walter and Eliza Hall Institute, Parkville, Victoria 3052, Australia. 43 Department of Neurosurgery, and 44 Computational Biology Center, Memorial Sloan-Kettering Cancer Center, New York, 133 New York 10065, USA. 45 Department of Epidemiology and Biostatistics, Memorial Sloan- Kettering Cancer Center, New York, New York 10065, USA. 46 Department of Physiol- ogy and Biophysics, Weill Cornell Graduate School of Medical Sciences, New York, New York 10065, USA. 47 Genomics Core Laboratory, Memorial Sloan-Kettering Cancer Cen- ter, New York, New York 10065, USA. 48 Department of Pathology, Human Oncology and Pathogenesis Program, Memorial Sloan-Kettering Cancer Center, New York, New York 10065, USA. 49 Department of Genetics, 50 Department of Pathology and Labora- tory Medicine, and 51 Department of Internal Medicine, Division of Medical Oncology, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA. 52 Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA. 53 International Genomics Consortium, Phoenix, Arizona 85004, USA. 54 Computational Biology Division, Translational Genomics Research Institute, Phoenix, Arizona 85004, USA. 55 SRA International, Fairfax, Virginia 22033, USA. 56 Center For Biomedical Infor- matics and Information Technology, National Cancer Institute, Rockville, Maryland 20852, USA. 57 Department of Bioinformatics and Computational Biology, M.D. Ander- son Cancer Center, Houston, Texas 77030, USA. 58 National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA. 59 MLF Consulting, Arlington, Massachusetts 02474, USA. 60 National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA. 134 Appendix B (Pipeline R Code) PipelineR Code developed for handling and processing DNA methylation Innium data. See Chapter 2 starting on page 14 for detailed description of this code. Listing 1: Pipeline R Code 1 # ########################################### # Author: Houtan Noushmehr # Contact: hnoushme@usc.edu; houtana@gmail.com; +1.310.570.2DNA # Institute : University of Southern California Epigenome Center #(http : //epigenome.usc.edu/) 6 # Date creation : Jun 7, 2010 # Project Name: Pipeline import of Infinium Data # 1. this script will import csv files or beadlevel #summary for all samples for a given number of CHIPs. # currently we need to move all csv files for 11 #each of the scan settings into a directory for #each scan (e.g. 1.0, 1.25, 1.5) # eventually this will be scripted within genius. # 2a. calibrate imported data using multiscan calibration . #Data is stored before and after calibration into datRG and datRGc. 16 # 2b. Calibrate control probes between scans. Calulating #detection pvalues (zscore method) using calibrated negative controls . # 2c. Plots are generated for reports # 3. Using information from 2b, decide how to proceed #with normalization 135 21 # 4. Scripts used to plot pre/post normalization # # required packages: # o R. utils # o aroma. light 26 # o matrixStats # o mSet # ########################################### library ( "aroma . l i g h t " ) ; library ( "R. u t i l s " ) ; 31 #library("aroma.core"); library ( " matrixStats " ) ; # # beadlevel input # 36 probe . id<read . csv ( " . / infinium probe id . csv " ) ; probe . c o l o r< read . csv ( " . /probe c o l o r . csv " ) ; control . probes<read . table ( " . / c o n t r o l probes . txt " , sep="n t " , header= T) ; datPath<c ( " . /" ) 41 pdata .m<read . table ( paste ( " . " , "sample . i n f o . txt " , sep="/" ) , header= T) f i g F o r c e < FALSE; #for the calibration plots for each sample and color figPath . c < Arguments$getWritablePath ( f i l e . path( " . " , 46 " r e s u l t s . c a l i b r a t i o n " , f s e p="/" ) ) ; #for all other figures figPath < Arguments$getWritablePath ( f i l e . path( " . " , " r e s u l t s " , f s e p="/" ) ) ; verbose < Arguments$getVerbose (5, timestamp= TRUE) ; 51 136 #arguments , functions used #Data Classifications # # 56 # # 1) loading beadlevel data directly from scanner. # at moment, need to move csv files to appropriate # folders for each scan # 61 # # # # Functions to load all scans from beadlevel data # 66 #arguments used #controls control . probes [ , 3 ]<as . factor ( control . probes [ , 3 ] ) ; control . probes<control . probes [ ,1]; qq . neg . rg<l i s t ( ) ; 71 m. rg . id<c ( "Mean .RED" , "Mean .GRN" ) ; #probes/colors probeid . c o l o r < merge( probe . id , probe . color , by . x="TargetID" , by . y="TargetID" ) ; 76 probe . red < subset ( probeid . color , Color Channel=="Red" ) ; probe . grn < subset ( probeid . color , Color Channel=="Grn" ) ; probe<l i s t ( red=l i s t (M = NULL,U = NULL) , 81 green=l i s t (M = NULL,U = NULL) ) 137 probe [ [ " green " ] ] [ [ "M" ] ]<probe . grn [ , c ( 1 , 4 ) ] ; probe [ [ " green " ] ] [ [ "M" ] ] [ , 2 ]<as . factor ( probe [ [ " green " ] ] [ [ "M" ] ] [ , 2 ] ) ; probe [ [ " green " ] ] [ [ "U" ] ]<probe . grn [ , c ( 1 , 3 ) ] ; 86 probe [ [ " green " ] ] [ [ "U" ] ] [ , 2 ]<as . factor ( probe [ [ " green " ] ] [ [ "U" ] ] [ , 2 ] ) ; probe [ [ " red " ] ] [ [ "M" ] ]<probe . red [ , c ( 1 , 4 ) ] ; probe [ [ " red " ] ] [ [ "M" ] ] [ , 2 ]<as . factor ( probe [ [ " red " ] ] [ [ "M" ] ] [ , 2 ] ) ; probe [ [ " red " ] ] [ [ "U" ] ]<probe . red [ , c ( 1 , 3 ) ] ; probe [ [ " red " ] ] [ [ "U" ] ] [ , 2 ]<as . factor ( probe [ [ " red " ] ] [ [ "U" ] ] [ , 2 ] ) ; 91 datRG<l i s t (#before calibration scan1=l i s t ( red=l i s t (M = NULL,U = NULL) , green=l i s t (M = NULL,U = NULL) ) , scan2=l i s t ( red=l i s t (M = NULL,U = NULL) , green=l i s t (M = NULL,U = NULL) ) , scan3=l i s t ( red=l i s t (M = NULL,U = NULL) , green=l i s t (M = NULL,U = NULL) ) 96 ) datControls<l i s t (#before calibration scan1=l i s t ( red= NULL, green= NULL) , scan2=l i s t ( red= NULL, green= NULL) , scan3=l i s t ( red= NULL, green= NULL) 101 ) datRGi<datRG ; datControls . i<datControls ; channel<c ( " red " , " green " ) ; m. u<c ( "M" , "U" ) ; 106 scans<l i s t . f i l e s ( datPath , f u l l= TRUE) ; #arguments used #begin load of data for ( a in seq ( along=scans ))f #each scan settings within the directory datPath1<c ( scans [ a ] ) ; 111 b<l i s t . f i l e s ( datPath1 ) ; l l<strsplit (b , "nn. " ) ; l l<sapply ( l l , " [ " , 1 ) ; 138 for ( x in seq ( along=b ) ) f enter ( verbose , " Scan #: " , a , " ; Extracting sample #: " , 116 x , " out of " , length (b ) ) ; temp < read . csv ( paste ( datPath1 , b [ x ] , sep="/" ) , header = TRUE) ; i f ( x==1)f #this is set as a check to add in the first set 121 # of columns and subsequent adds will only # add necessary intensities for ( cc in channel )f for (m m in m. u)f i f ( cc==" red " )f 126 i d c o l<c ( " Illumicode " , "Mean .RED" ) ; g elsef i d c o l<c ( " Illumicode " , "Mean .GRN" ) ; g i f (m m =="M" )f 131 a . b<c ( "ProbeID B" ) ; kk<c ( " .METH" ) ; g elsef a . b<c ( "ProbeID A" ) ; kk<c ( " .UN ME TH" ) ; 136 g datRG [ [ a ] ] [ [ cc ] ] [ [m m] ]<merge( probe [ [ cc ] ] [ [m m] ] , temp [ , c ( i d c o l ) ] , by . x=a . b , by . y=" Illumicode " ) ; datRG [ [ a ] ] [ [ cc ] ] [ [m m] ] < as . data . frame( 141 datRG [ [ a ] ] [ [ cc ] ] [ [m m] ] ) ; nn< dim(datRG [ [ a ] ] [ [ cc ] ] [ [m m] ] ) [ 2 ] ; dimnames(datRG [ [ a ] ] [ [ cc ] ] [ [m m] ] ) [ [ 2 ] ] [ nn ] <c ( paste ( l l [ x ] , 139 kk , sep="" ) ) ; 146 dimnames(datRG [ [ a ] ] [ [ cc ] ] [ [m m] ] ) [ [ 1 ] ] <datRG [ [ a ] ] [ [ cc ] ] [ [m m] ] [ , "TargetID" ] g#end mm 151 #extract controls for each sample i f ( cc==" red " )f m. rg . id1< m. rg . id [ 1 ] ; g elsef m. rg . id1< m. rg . id [ 2 ] ; 156 g qq<merge( control . probes , temp ,by . x="ProbeID" , by . y=" Illumicode " ) ; qq . neg<qq [ qq[ ,2]== "NEGATIVE" , ] ; datControls [ [ a ] ] [ [ cc ] ]<qq . neg [ , c ( "ProbeID" , 161 "TargetID" ,m. rg . id1 ) ] ; datControls [ [ a ] ] [ [ cc ] ] < as . data . frame ( datControls [ [ a ] ] [ [ cc ] ] ) ; nnc < dim( datControls [ [ a ] ] [ [ cc ] ] ) [ 2 ] ; dimnames( datControls [ [ a ] ] [ [ cc ] ] ) [ [ 2 ] ] [ nnc ]<c ( 166 l l [ x ] ) ; dimnames( datControls [ [ a ] ] [ [ cc ] ] ) [ [ 1 ] ]<datControls [ [ a ] ] [ [ cc ] ] [ , "ProbeID" ] g#ned cc g elsef 171 for ( cc in channel )f for (m m in m. u)f i f ( cc==" red " )f i d c o l<c ( " Illumicode " , "Mean .RED" ) ; g elsef 140 176 i d c o l<c ( " Illumicode " , "Mean .GRN" ) ; g i f (m m =="M" )f a . b<c ( "ProbeID B" ) ; kk<c ( " .METH" ) ; 181 g elsef a . b<c ( "ProbeID A" ) ; kk<c ( " .UN ME TH" ) ; g 186 yy<merge( probe [ [ cc ] ] [ [m m] ] , temp [ , c ( i d c o l ) ] , by . x=a . b ,by . y=" Illumicode " ) ; datRG [ [ a ] ] [ [ cc ] ] [ [m m] ]<merge(datRG [ [ a ] ] [ [ cc ] ] [ [m m] ] , yy [ , c ( 2 : 3 ) ] , by . x="TargetID" ,by . y="TargetID" ) ; #need to create a merge function . 191 #I suspect the probe orders might be screwed up. nn< dim(datRG [ [ a ] ] [ [ cc ] ] [ [m m] ] ) [ 2 ] ; dimnames(datRG [ [ a ] ] [ [ cc ] ] [ [m m] ] ) [ [ 2 ] ] [ c (nn ) ]<c ( paste ( l l [ x ] , kk , sep="" ) ) ; dimnames(datRG [ [ a ] ] [ [ cc ] ] [ [m m] ] ) [ [ 1 ] ]<datRG 196 [ [ a ] ] [ [ cc ] ] [ [m m] ] [ , "TargetID" ] g#end mm #extract controls for each sample i f ( cc==" red " )f 201 m. rg . id1< m. rg . id [ 1 ] ; g elsef m. rg . id1< m. rg . id [ 2 ] ; g qq<merge( control . probes , temp ,by . x="ProbeID" , 206 by . y=" Illumicode " ) ; 141 qq . neg<qq [ qq[ ,2]== "NEGATIVE" , ] ; datControls [ [ a ] ] [ [ cc ] ]<merge( datControls [ [ a ] ] [ [ cc ] ] , qq . neg [ , c ( "ProbeID" ,m. rg . id1 ) ] , by . x="ProbeID" , by . y="ProbeID" ) ; 211 nnc < dim( datControls [ [ a ] ] [ [ cc ] ] ) [ 2 ] ; dimnames( datControls [ [ a ] ] [ [ cc ] ] ) [ [ 2 ] ] [ nnc ]<c ( l l [ x ] ) ; dimnames( datControls [ [ a ] ] [ [ cc ] ] ) [ [ 1 ] ]<datControls [ [ a ] ] [ [ cc ] ] [ , "ProbeID" ] g #ned cc 216 g# if of idx verbose & & exit ( verbose ) ; g# end x g#end a summary(datRG ) ; 221 summary( datControls ) ; #end load of data # 226 # # # 2.a) multiScan Calibration Function begins # # 231 # # # calibrateMultiscan () applied to datRG object created when loading # in data from csv files (see above). # post calibration is stored in datRGc 236 # #arguments used 142 f i t L i s t < l i s t ( ) ; #create list to store calibrated data. datRGc< l i s t ( red= NULL, green= NULL) ; 241 datRGc . p<datRGc #list to store samples of interest after calibration datRGci<l i s t ( red= NULL, green= NULL) ; #create list to store calibrated data. datControls . c < l i s t ( red= NULL, green= NULL) ; 246 #list to store samples of interest after calibration datControls . c i<l i s t ( red= NULL, green= NULL) ; YY. c o n t r o l s<l i s t ( ) ; YY.m<l i s t ( ) ; YY. u<l i s t ( ) ; 251 names<l i s t ( ) ; f i t L i s t .m< l i s t ( ) ; f i t L i s t . u< l i s t ( ) ; channel<c ( " red " , " green " ) ; nbrOfScans < length (datRG ) ; 256 nbrOfArrays < 6 ; #always 12 nbrOfSamples < dim(datRG [ [ 1 ] ] [ [ 1 ] ] [ [ 1 ] ] ) [ 2 ] ; #subtract 2 because you # want to remove the first #two columns added in when #loading the data from 261 #the csv Constraint<c ( "max" , " diaganol " ) [ 1 ] ; s a t S i g n a l < 2^161; plot yn < c ( " yes " , "no" ) [ 1 ] ; #arguments used 266 #begin calibration for ( rg in channel ) f #for each color channels for ( samp in 1 : ( nbrOfSamples2) )f 143 enter ( verbose , " [ [ " , rg , " ] ] Calibrating sample : " , samp , " out of " , ( nbrOfSamples 2 ) ) ; #subtract 2 because 271 #first two columns #are not samples for ( sc in seq ( along=(datRG) ))f datPath1<c ( scans [ 1 ] ) ; b<l i s t . f i l e s ( datPath1 ) ; 276 l l<strsplit (b , "nn. " ) ; chipName<sapply ( l l , " [ " , 1 ) ; YY.m[ [ sc ] ]<datRG [ [ 1 ] ] [ [ rg ] ] [ [ "M" ] ] [ , c ( "TargetID" , paste ( chipName [ samp ] , " .METH" , sep="" ) ) ] ; YY. u [ [ sc ] ]<datRG [ [ 1 ] ] [ [ rg ] ] [ [ "U" ] ] [ , c ( 281 "TargetID" , paste ( chipName [ samp ] , " .U NM ET H" , sep="" ) ) ] ; YY. c o n t r o l s [ [ sc ] ]<datControls [ [ 1 ] ] [ [ rg ] ] [ , chipName [ samp ] ] ; names [ [ sc ] ]<chipName [ samp ] ; g #end sc (each scan setting) 286 #since we are using three scans , this will be hard coded. #If number of scans changed , need to change #this code. (next 3 lines) #sanity check 291 s t o p i f n o t ( i d e n t i c a l (names [ [ 1 ] ] , names [ [ 2 ] ] ) ) ; s t o p i f n o t ( i d e n t i c a l (names [ [ 1 ] ] , names [ [ 3 ] ] ) ) ; s t o p i f n o t ( i d e n t i c a l (names [ [ 2 ] ] , names [ [ 3 ] ] ) ) ; Y.m< cbind (YY.m[ [ 1 ] ] [ , 2 ] ,YY.m[ [ 2 ] ] [ , 2 ] ,YY.m[ [ 3 ] ] [ , 2 ] ) ; dimnames(Y.m) [ [ 2 ] ]<c ( "pmt1 . 0 " , "pmt1 . 0 " , "pmt1 . 0 " ) ; 296 dimnames(Y.m) [ [ 1 ] ]<paste (YY.m[ [ 1 ] ] [ , 1 ] , " .METH" , sep="" ) ; Y. u< cbind (YY. u [ [ 1 ] ] [ , 2 ] ,YY. u [ [ 2 ] ] [ , 2 ] ,YY. u [ [ 3 ] ] [ , 2 ] ) ; dimnames(Y. u ) [ [ 2 ] ]<c ( "pmt1 . 0 " , "pmt1 . 0 " , "pmt1 . 0 " ) ; dimnames(Y. u ) [ [ 1 ] ]<paste (YY. u [ [ 1 ] ] [ , 1 ] , " .U NM ET H" , sep="" ) ; 144 Y. c o n t r o l s < cbind (YY. c o n t r o l s [ [ 1 ] ] ,YY. c o n t r o l s [ [ 2 ] ] , 301 YY. c o n t r o l s [ [ 3 ] ] ) ; colnames(Y. c o n t r o l s )<c ( "pmt1 . 0 " , "pmt1 . 0 " , "pmt1 . 0 " ) ; #for plotting #Y.mc< calibrateMultiscan(Y.m, average=NULL, c o n s t r a i n t=Constraint , s a t S i g n a l=s a t S i g n a l ) ; #will keep each 306 #new modified #scan in a #separate column #for plotting #Y.uc < calibrateMultiscan(Y.u, average=NULL, 311 c o n s t r a i n t=Constraint , s a t S i g n a l=s a t S i g n a l ) ; #will keep each #new modified #scan in a #separate column #for plotting 316 #Yc< rbind(Y.mc,Y.uc); Y< rbind (Y.m,Y. u ) ; #storing fit /offset values for each sample #fitList .m[[samp]] < attr(Y.mc, "modelFit"); #names( fitList .m[[samp]]) < c(chipName[samp]); 321 #fitList .u[[samp]] < attr(Y.uc, "modelFit"); #names( fitList .u[[samp]]) < c(chipName[samp]); #collapsing multiple scans into one by taking the #median value for each probe per sample 326 #Y.mc.1 < calibrateMultiscan(Y.m, constraint=Constraint , s a t S i g n a l=s a t S i g n a l ) ; #will keep each new modified scan in a separate #column for plotting #Y.uc.1 < calibrateMultiscan(Y.u, constraint=Constraint , 145 331 s a t S i g n a l=s a t S i g n a l ) ; #will keep each new modified scan in a separate #column for plotting Y.mc. 1 < as . data . frame(Y.m[ , 1 ] ) ; dimnames(Y.mc . 1 ) [ [ 2 ] ]<chipName [ samp ] 336 Y. uc . 1 < as . data . frame(Y. u [ , 1 ] ) ; dimnames(Y. uc . 1 ) [ [ 2 ] ]<chipName [ samp ] Yc . 1 < rbind (Y.mc. 1 ,Y. uc . 1 ) ; Yc . 1 < as . data . frame(Yc . 1 ) #Y. controls .1 < calibrateMultiscan(Y. controls , 341 c o n s t r a i n t=Constraint , s a t S i g n a l=s a t S i g n a l ) ; #will keep each #new modified scan in a separate column for plotting Y. c o n t r o l s . 1 < as . data . frame(Y. c o n t r o l s [ , 1 ] ) Y. c o n t r o l s . 1 < as . data . frame(Y. c o n t r o l s . 1 ) 346 i f ( samp==1 )f datRGc [ [ rg ] ]<Yc . 1 ; datRGc [ [ rg ] ] < as . data . frame(datRGc [ [ rg ] ] ) ; nn< samp + 1 ; #ad hoc to force data to accept colnames #change on the first column. 351 #need to remove first column when #done. (see below , last line of #code after for loop done. datRGc [ [ rg ] ] [ , nn ]< Yc . 1 ; #change name of first column because second column 356 #'nn' will be replaced in subsequent samples. dimnames(datRGc [ [ rg ] ] ) [ [ 2 ] ] [ samp ]<c ( chipName [ samp ] ) ; datControls . c [ [ rg ] ]< Y. c o n t r o l s . 1 ; datControls . c [ [ rg ] ] < as . data . frame( datControls . c [ [ rg ] ] ) ; 361 nnc < samp + 1 ; #ad hoc to force data to accept colnames c 146 #hange on the first column. need to #remove first column when done. #(see below , last line of code after #for loop done. 366 datControls . c [ [ rg ] ] [ , nnc ]< Y. c o n t r o l s . 1 ; dimnames( datControls . c [ [ rg ] ] ) [ [ 2 ] ] [ samp ]<c ( chipName [ samp ] ) ; #change name of first column because second column 'nn' #will be replaced in subsequent samples. g elsef 371 datRGc [ [ rg ] ] [ , samp ]< Yc . 1 ; dimnames(datRGc [ [ rg ] ] ) [ [ 2 ] ] [ samp ]<c ( chipName [ samp ] ) ; datControls . c [ [ rg ] ] [ , samp ]< Y. c o n t r o l s . 1 ; dimnames( datControls . c [ [ rg ] ] ) [ [ 2 ] ] [ samp ]<c ( chipName [ samp ] ) ; 376 g#end if # # plot? type 'yes ' or 'no' # 381 i f ( plot yn==" yes " )f #decide if plots are necessary #Y1< calibrateMultiscan(Y, average=NULL, c o n s t r a i n t=Constraint , s a t S i g n a l=s a t S i g n a l ) ; #Calibrate again to make fit for both M/U for plotting xy Y1< Y 386 #fit < attr(Y1, "modelFit"); #isSat < rowAnys(is .na(Yc)); #print(summary(isSat )); # Scale factors before and after calibration 391 # scale0 < colMedians(Y[ !isSat ,] , na.rm=TRUE); # scale1 < colMedians(Yc[ !isSat ,] , na.rm=TRUE); 147 # # # Rescale just to fit in the same plot # scale < max(scale0/scale1 ); 396 # scale < satSignal /max(Yc[Y < satSignal ] , #na.rm=TRUE); # Yc< scaleYc; figName < s p r i n t f ( "%s ,%s , multiscan 1.0 repeat " , 401 rg , chipName [ samp ] ) ; filename < s p r i n t f ( "%s . png" , figName ) ; pathname < f i l e P a t h ( figPath . c , filename ) ; i f ( ! f i g F o r c e & & i s F i l e ( pathname ) ) f next ; 406 g width < 640; devNew( png , pathname , width=width , height =1.3width ) ; 411 layout (matrix ( 1 : 1 , ncol=1, byrow=FALSE ) ) ; par (mar=c (3 ,4 ,1 ,1)+0.1 , mgp=c ( 1 . 8 , 0 . 8 , 0 ) ) ; 416 lim < c (0 , s a t S i g n a l ) ; l l i m < log2 ( lim +1); # Plot log2(x) densities plotDensity ( log2 (Y1) , xlim=llim , lwd =2); 421 legend ( " t o p l e f t " ,colnames(Y1) , col =1: nbrOfScans , lwd=2) #added in legend for density plots s t e x t ( s i d e =3,pos=0, s p r i n t f ( "Chip : %s " , chipName [ samp ] ) ) ; 148 #stext(side=3,pos=1, names(YList)[pp]); box( col=rg , lwd =3); 426 # g # for (pp ...) devDone ( ) ; 431 g # for (plot yn ...) verbose & & exit ( verbose ) ; #print( fitList .m[[samp]][[ adiag ]]); #print( fitList .m[[samp]][[ b ]]); 436 #print( fitList .u[[samp]][[ adiag ]]); #print( fitList .u[[samp]][[ b ]]); g#end of samples 441 i f ( plot yn==" yes " )f print ( paste ( "Done with MULTICALIBRATION f o r the " , rg , " channel . You can check f o l d e r f o r p l o t s : " , figPath . c ) ) ; g elsef print ( paste ( "Done with MULTICALIBRATION f o r the " , rg , " 446 channel . " ) ) ; g #ifelse g# for (rg ..) , each color (red and green). print ( paste ( paste ( "Done c a l i b r a t i n g both RED (M/U) & GREEN (M/U) f o r " , 451 ( nbrOfSamples2)/12 , " chips or " , c ( nbrOfSamples2) ," # of samples . " ) ) ) ; summary(datRGc ) ; summary( datControls . c ) ; 149 #end calibration 456 # # # # 2.b) calculate detection pvalues using calibrated controls # (16 probes) 461 # # # #arguments used p . val<c ( 0 . 0 1 , 0 . 0 5 ) [ 2 ] 466 z score<l i s t ( red=l i s t (M = NULL, U = NULL) , green=l i s t (M = NULL, U = NULL) ) #pNorm<list (red=list (M=NULL, U=NULL),green=list (M=NULL, U=NULL)) pValue<l i s t ( red=l i s t (M = NULL, U = NULL) , green=l i s t (M = NULL, U = NULL) ) i<l i s t (M =l i s t ( red =1, green =1) ,U =l i s t ( red =1, green =1)) mubc<l i s t (M = NULL, U = NULL) 471 #arguments used #detection pvalue script for ( rg in channel )f enter ( verbose , " [ [ " , rg , " ] ] Setting up Controls and Analytical 476 data " ) ; dim. r < dim(datRGc [ [ rg ] ] ) [ 1 ] ; dim. r1 < dim. r/ 2 ; dim. r2 < dim. r1 +1; mubc [ [ "M" ] ] < datRGc [ [ rg ] ] [ 1 : dim. r1 , ] ; 481 #extracting out meths mubc [ [ "U" ] ] < datRGc [ [ rg ] ] [ dim. r2 :dim. r , ] ; #extracting out unmeths verbose & & exit ( verbose ) ; 150 486 enter ( verbose , " [ [ " , rg , " ] ] Averaging MS, Us and Controls " ) ; #AB<(mb+mc)/2; #taking average meth and unmeth c t r avg < apply ( datControls . c [ [ rg ] ] , 2 ,mean, na .rm = T) ; #average controls per sample 491 c t r sd < apply ( datControls . c [ [ rg ] ] , 2 , sd , na .rm = T) ; #sd controls per sample c t r avg1 < matrix ( c t r avg , dim. r1 , dim(datRGc [ [ rg ] ] ) [ 2 ] , byrow= TRUE) ; c t r sd1 < matrix ( c t r sd , dim. r1 , dim(datRGc [ [ rg ] ] ) [ 2 ] , 496 byrow= TRUE) ; verbose & & exit ( verbose ) ; for (mu in c ( "M" , "U" ))f enter ( verbose , " [ [ " , rg , " ] ] " , mu, " : Calculating 501 ZScores and PValues " ) ; z score [ [ rg ] ] [ [ mu ] ] < (mubc [ [ mu]] c t r avg1 )/ c t r sd1 ; #calculating z score for each probe per sample pValue [ [ rg ] ] [ [ mu ] ] < apply ( z score [ [ rg ] ] [ [ mu] ] , c ( 1 , 2 ) , pnorm, lower . t a i l=FALSE) ; 506 #calculating pnormal distribution for each probe #per sample # pValue [[ rg ]][[mu]] < 1pNorm[[ rg ]][[mu]]; #calculating pvalues for each probe per sample verbose & & exit ( verbose ) ; 511 g enter ( verbose , " [ [ " , rg , " ] ] Masking (NA) values with high pvalues " ) ; dat<datRGc pv<pValue 516 c o l o r<rg 151 dim. rp<dim( dat [ [ c o l o r ] ] ) [ 1 ] ; dim. rp1<dim. rp/ 2 ; dim. rp2 < dim. rp1 +1; dat .m<( dat [ [ c o l o r ] ] [ 1 : dim. rp1 , ] ) 521 #extracting out meths dat . u<( dat [ [ c o l o r ] ] [ dim. rp2 :dim. rp , ] ) #extracting out unmeths pv [ [ c o l o r ] ] [ [ "M" ] ] [ pv [ [ c o l o r ] ] [ [ "M" ]]>p . val ]< NA #METH 526 pv [ [ c o l o r ] ] [ [ "U" ] ] [ pv [ [ c o l o r ] ] [ [ "U" ]]>p . val ]< NA #UNMETH #sanity create two random matrix for m and u. #if they are identical , recreate a new random matrix. c #hances of this happening twice for the same position is 531 #very unlikely . dat .m1<matrix (rnorm ( 1 :dim( dat .m) [ 1 ]dim( dat .m) [ 2 ] ) , nrow=dim( dat .m) [ 1 ] , ncol=dim( dat .m) [ 2 ] , byrow= TRUE) dat .m2<matrix (rnorm ( 1 :dim( dat .m) [ 1 ]dim( dat .m) [ 2 ] ) , nrow=dim( dat .m) [ 1 ] , ncol=dim( dat .m) [ 2 ] , byrow=FALSE) 536 i f (sum(apply ( dat .m1 ==dat .m2, 2 ,sum)>0)>=1)f dat .m1<matrix (rnorm ( 1 :dim( dat .m) [ 1 ]dim( dat .m) [ 2 ] ) , nrow=dim( dat .m) [ 1 ] , ncol=dim( dat .m) [ 2 ] , byrow= TRUE) dat .m2<matrix (rnorm ( 1 :dim( dat .m) [ 1 ]dim( dat .m) [ 2 ] ) , nrow=dim( dat .m) [ 1 ] , ncol=dim( dat .m) [ 2 ] , byrow=FALSE) 541 i [ [ "M" ] ] [ [ c o l o r ] ]<i [ [ "M" ] ] [ [ c o l o r ]]+1 g i f (sum(apply ( dat .m1 ==dat .m2, 2 ,sum)>0)>=1)f dat .m1<matrix (rnorm ( 1 :dim( dat .m) [ 1 ]dim( dat .m) [ 2 ] ) , nrow=dim( dat .m) [ 1 ] , ncol=dim( dat .m) [ 2 ] , byrow= TRUE) 546 dat .m2<matrix (rnorm ( 1 :dim( dat .m) [ 1 ]dim( dat .m) [ 2 ] ) , nrow=dim( dat .m) [ 1 ] , ncol=dim( dat .m) [ 2 ] , byrow=FALSE) 152 i [ [ "M" ] ] [ [ c o l o r ] ]<i [ [ "M" ] ] [ [ c o l o r ]]+1 g 551 dat . u1<matrix (rnorm ( 1 :dim( dat . u ) [ 1 ]dim( dat . u ) [ 2 ] ) , nrow=dim( dat . u ) [ 1 ] , ncol=dim( dat . u ) [ 2 ] , byrow=FALSE) dat . u2<matrix (rnorm ( 1 :dim( dat . u ) [ 1 ]dim( dat . u ) [ 2 ] ) , nrow=dim( dat . u ) [ 1 ] , ncol=dim( dat . u ) [ 2 ] , byrow= TRUE) i f (sum(apply ( dat .m1 ==dat .m2, 2 ,sum)>0)>=1)f 556 dat . u1<matrix (rnorm ( 1 :dim( dat . u ) [ 1 ]dim( dat . u ) [ 2 ] ) , nrow=dim( dat . u ) [ 1 ] , ncol=dim( dat . u ) [ 2 ] , byrow=FALSE) dat . u2<matrix (rnorm ( 1 :dim( dat . u ) [ 1 ]dim( dat . u ) [ 2 ] ) , nrow=dim( dat . u ) [ 1 ] , ncol=dim( dat . u ) [ 2 ] , byrow= TRUE) i [ [ "U" ] ] [ [ c o l o r ] ]<i [ [ "U" ] ] [ [ c o l o r ]]+1 561 g i f (sum(apply ( dat .m1 ==dat .m2, 2 ,sum)>0)>=1)f dat . u1<matrix (rnorm ( 1 :dim( dat . u ) [ 1 ]dim( dat . u ) [ 2 ] ) , nrow=dim( dat . u ) [ 1 ] , ncol=dim( dat . u ) [ 2 ] , byrow=FALSE) dat . u2<matrix (rnorm ( 1 :dim( dat . u ) [ 1 ]dim( dat . u ) [ 2 ] ) , 566 nrow=dim( dat . u ) [ 1 ] , ncol=dim( dat . u ) [ 2 ] , byrow= TRUE) i [ [ "U" ] ] [ [ c o l o r ] ]<i [ [ "U" ] ] [ [ c o l o r ]]+1 g dat .m1[ is . na( pv [ [ c o l o r ] ] [ [ "M" ] ] ) ]<75000 #setting all 571 #insignificant probes to 75000 (higher than true saturation) dat .m2[ is . na( pv [ [ c o l o r ] ] [ [ "U" ] ] ) ]<75000 #setting all #insignificant probes to 75000 (higher than true saturation) dat . u1 [ is . na( pv [ [ c o l o r ] ] [ [ "M" ] ] ) ]<75000 #setting all #insignificant probes to 75000 (higher than true saturation) 576 dat . u2 [ is . na( pv [ [ c o l o r ] ] [ [ "U" ] ] ) ]<75000 #setting all #insignificant probes to 75000 (higher than true saturation) 153 581 dat .m3<dat .m1 ==dat .m2 dat .m3[ dat .m3 == TRUE]< NA dat .m3[ dat .m3 == FALSE]<0 dat .m<dat .m +dat .m3 586 dat . u3<dat . u1==dat . u2 dat . u3 [ dat . u3== TRUE]< NA dat . u3 [ dat . u3== FALSE]<0 dat . u<dat . u+dat . u3 591 datRGc . p [ [ rg ] ]<rbind ( dat .m, dat . u) #datRGc.p[[ rg ]] < getPvalue(datRGc,pValue , color=rg); verbose & & exit ( verbose ) ; g #end rg 596 #detection pvalue script summary(datRGc) summary(datRGc . p) i 601 # # # # 2.c) PLOTTING PRE AND POST CALIBRATION METHOD OF EVALUATION 606 # # # datRGc . o r i g i n a l<datRGc #store before masking 154 datRGc<datRGc . p #creates masking of data file . 611 #This is the working file . #arguments , functions used beta .RAWi<l i s t ( #pre scan1=l i s t ( red=l i s t (M = NULL,U = NULL,BETA.V = NULL) , g reen=l i s t (M = NULL,U = NULL,BETA.V = NULL) ) , 616 scan2=l i s t ( red=l i s t (M = NULL,U = NULL,BETA.V = NULL) , green=l i s t (M = NULL,U = NULL,BETA.V = NULL) ) , scan3=l i s t ( red=l i s t (M = NULL,U = NULL,BETA.V = NULL) , green=l i s t (M = NULL,U = NULL,BETA.V = NULL) ) ) ; 621 beta .RAWci<l i s t ( #post red=l i s t (M = NULL,U = NULL,BETA.V = NULL) , green=l i s t (M = NULL,U = NULL,BETA.V = NULL) ) ; 626 getBeta < function ( dat , c o l o r ) f dat [ [ c o l o r ] ] [ [ "M" ] ] [ dat [ [ c o l o r ] ] [ [ "M" ]]<0]<0 dat [ [ c o l o r ] ] [ [ "U" ] ] [ dat [ [ c o l o r ] ] [ [ "U" ]]<0]<0 dat [ [ c o l o r ] ] [ [ "M" ] ] /( dat [ [ c o l o r ] ] [ [ "M" ]]+ dat [ [ c o l o r ] ] [ [ "U" ] ] ) 631 g # # PRE CALIBRATION PLOTS: # calculating betavalues and plotting each sample for #each scan settings before calibration 636 # #arguments used set<c ( " 1.0 " , " 1.25 " , " 1.5 " ) ; v<1 #arguments used 155 641 #begin preC plots #for (v in seq(along=datRGi)) f#for each scan 646 enter ( verbose , " Plotting Scan s e t t i n g " ) ; figName < s p r i n t f ( "2 rep PREpvalue " ) ; filename < s p r i n t f ( "%s . png" , figName ) ; pathname < f i l e P a t h ( figPath , filename ) ; # if ( !figForce & & isFile(pathname)) f 651 # next; # g width < 640; devNew( png , pathname , width=width , height =1.3width ) ; layout (matrix ( 1 : 4 , ncol=2, byrow= TRUE) ) ; 656 par (mar=c (3 ,4 ,1 ,1)+0.1 , mgp=c ( 1 . 8 , 0 . 8 , 0 ) ) ; for ( rg in channel ) f beta .RAWi[ [ v ] ] [ [ rg ] ] [ [ "M" ] ] < datRG [ [ v ] ] [ [ rg ] ] [ [ "M" ] ] ; beta .RAWi[ [ v ] ] [ [ rg ] ] [ [ "U" ] ] < datRG [ [ v ] ] [ [ rg ] ] [ [ "U" ] ] ; beta .RAWi[ [ v ] ] [ [ rg ] ] [ [ "BETA.V" ] ] < getBeta (beta .RAWi[ [ v ] ] , 661 c o l o r=rg ) ; #boxplots boxplot (beta .RAWi[ [ v ] ] [ [ rg ] ] [ [ "BETA.V" ] ] [ , 3 : 8 ] , pch=" . " , col=rg , ylab=" beta value " ) ; 666 s t e x t ( s i d e =3,pos=0, paste ( "PMT: " , set [ v ] , " PREpvalue " ) ) ; box( col=rg , lwd =3); #density plots plotDensity ( beta .RAWi[ [ v ] ] [ [ rg ] ] [ [ "BETA.V" ] ] [ , 3 : 8 ] ) ; 671 s t e x t ( s i d e =3,pos=0, paste ( "PMT: " , set [ v ] , " PREpvalue " ) ) ; 156 box( col=rg , lwd =3); g#end rg devDone ( ) ; 676 verbose & & exit ( verbose ) ; #g #end v #end preC plots 681 # # POST CALIBRATION PLOTS: # plotting and calculating betavalues for post calibration #using samples of interest (datRGci) # 686 #begin extraction for ( rg in channel ) f datRGci [ [ rg ] ]<datRGc [ [ rg ] ] ; #includes normals , tumors and #controls (WGA /SssI) if dealing with TCGA g 691 summary( datRGci ) ; #end extraction #begin plotting postC for ( rg in channel ) f dim. r<dim( datRGci [ [ rg ] ] ) [ 1 ] ; 696 dim. r2<dim. r/ 2 ; beta .RAWci [ [ rg ] ] [ [ "M" ] ] < datRGci [ [ rg ] ] [ 1 : dim. r2 , ] ; dim. r2 < dim. r2 +1; beta .RAWci [ [ rg ] ] [ [ "U" ] ] < datRGci [ [ rg ] ] [ dim. r2 :dim. r , ] ; beta .RAWci [ [ rg ] ] [ [ "BETA.V" ] ] < getBeta (beta .RAWci, c o l o r=rg ) ; 701 g 157 enter ( verbose , " Plotting data " ) ; figName < s p r i n t f ( "2 rep Postpvalue " ) ; filename < s p r i n t f ( "%s . png" , figName ) ; 706 pathname < f i l e P a t h ( figPath , filename ) ; # if ( !figForce & & isFile(pathname)) f # next; # g width < 640; 711 devNew( png , pathname , width=width , height =1.3width ) ; layout (matrix ( 1 : 4 , ncol=2, byrow= TRUE) ) ; par (mar=c (3 ,4 ,1 ,1)+0.1 , mgp=c ( 1 . 8 , 0 . 8 , 0 ) ) ; for ( rg in channel ) f 716 #boxplots boxplot (beta .RAWci [ [ rg ] ] [ [ "BETA.V" ] ] , pch=" . " , col=rg , ylab=" beta value " ) ; s t e x t ( s i d e =3,pos=0, paste ( "POSTpvalue " ) ) ; box( col=rg , lwd =3); 721 #density plots plotDensity ( beta .RAWci [ [ rg ] ] [ [ "BETA.V" ] ] ) ; s t e x t ( s i d e =3,pos=0, paste ( "POSTpvalue " ) ) ; box( col=rg , lwd =3); 726 g devDone ( ) ; verbose & & exit ( verbose ) ; 731 #end plotting postC 158 # Level 1 and Level 2 data (Pre and Post Calibration ): 736 # pre stores all 3 scans individually summary(beta .RAWi) ; #M/U for each red/green + betavalue #calculations summary( datRGi ) ; #all samples of interest summary(datRG ) ; #all samples 741 summary( datControls ) ; #all control probes for each scan , #not calibrated . # post calibrated from 3 scans # 27,578xN (N=number of samples of interest , e.g. tumors , 746 #normals , controls ). summary(beta .RAWci ) ; #M/U for each red/green + betavalue #calculations summary( datRGci ) ; #all samples of interest summary(datRGc ) ; #all samples after masking insignificant 751 #probes summary(datRGc . o r i g i n a l ) #all samples before masking #insignificant probes summary( datControls . c ) ; #calibrated controls for all samples 756 #clean up rownames, strip out .METH created during #loading of data to streamline and separate out meth/unmeth. for ( rg in channel )f #betaV l l<dimnames(beta .RAWci [ [ rg ] ] [ [ "BETA.V" ] ] ) [ [ 1 ] ] ; 761 l l<strsplit ( l l , "nn. " ) ; l l<sapply ( l l , " [ " , 1 ) ; dimnames(beta .RAWci [ [ rg ] ] [ [ "BETA.V" ] ] ) [ [ 1 ] ]<l l ; #M 159 l l<dimnames(beta .RAWci [ [ rg ] ] [ [ "M" ] ] ) [ [ 1 ] ] ; 766 l l<strsplit ( l l , "nn. " ) ; l l<sapply ( l l , " [ " , 1 ) ; dimnames(beta .RAWci [ [ rg ] ] [ [ "M" ] ] ) [ [ 1 ] ]<l l ; #U l l<dimnames(beta .RAWci [ [ rg ] ] [ [ "U" ] ] ) [ [ 1 ] ] ; 771 l l<strsplit ( l l , "nn. " ) ; l l<sapply ( l l , " [ " , 1 ) ; dimnames(beta .RAWci [ [ rg ] ] [ [ "U" ] ] ) [ [ 1 ] ]<l l ; g 776 dna .m<rbind (beta .RAWci [ [ " red " ] ] [ [ "BETA.V" ] ] , beta .RAWci [ [ " green " ] ] [ [ "BETA.V" ] ] ) #dropouts drop . outs<as . data . frame(round ( ( apply ( is . na( dna .m) , 2 ,sum)/27578)100 , d i g i t s =0)) 781 drop . outs$SampleID<as . character (dimnames(drop . outs ) [ [ 1 ] ] ) l l<as . character (dimnames(drop . outs ) [ [ 1 ] ] ) l l<strsplit ( l l , "nn " ) ; l l<sapply ( l l , " [ " , 1 ) ; drop . outs$ChipID<l l 786 dimnames(drop . outs ) [ [ 2 ] ]<c ( "PercentDropOut" , "SampleID" , "ChipID" ) qplot (drop . outs [ , "SampleID" ] , drop . outs [ , "PercentDropOut" ] , data=drop . outs , colour= PercentDropOut>5) + geom point ( s i z e =2) + 791 f a c e t wrap (~ ChipID ) + scale colour manual ( values = c ( "FALSE"="BLACK" , "TRUE"="RED" ) , " Greater thannn5%?" ) + scale y continuous ( "DROPOUTS" , l i m i t s=c (0 ,100)) + #scale x manual("Sample ID") + 160 796 geom h l i n e ( y i n t e r c e p t = 5 , colour=" black " , s i z e =0.5 , l i n e t y p e =2) # ############################################ # HISTORY: # Dec 21, 2009 801 # o Added information for BATCH 20 (GBM, see #multiple scans gbm. txt file ). # Jul 29, 2009 # o Removed Multiscan Calibration after Discussion with #Peter during 806 #ECBM meeting. No practical value in using the three #scan settings . # Now, just import and process the 1.0 settings for all data. # Jul 28, 2009 # o Added in affine Normalization. 811 # o Added in plot scripts to evaluate pre/post normalization. # o Added in legends to calibration density plots # Jul 27, 2009 # o corrected rownames in loading and calibration . #Before , 816 #rownames was not tracking . # o Added controls . Now control probes are stored #as the data #is loaded in and it 's calibrated between the 3 scans. # o Moved section 2.c to 2.b and 2.b to 2.c. This is done 821 #after calibration so the detection pvalue is calculated #prior to any downstream analysis , plotting etc . # o Detection pvalue formula added to calibrated control #probes and calibrated data. # Jul 24, 2009 826 # o Removed 'X' from ann files and cleaned up code so it 's 161 #more robust and can fit any new data or experiments. , hn # o Added in function for calculating detection pvalues using #calibrated controls (section : 2.c), hn # o Store control probes during loading of data. , hn 831 # Jul 23, 2009 # o Reorganized code to fit pipeline flow . Moved plotting #codes for both pre/post calibration together . , hn # o Removed restructure code , hn. # o Bug fix for multicalibration . sample order 836 #was screwed #up by the list of 4D arrays. rewrote entire #code to use #'datRG' instead . (datRG == beadlevel data #imported from scanner), hn. 841 # Jun 7, 2009 # o Created. by houtan # ############################################# 162
Abstract (if available)
Abstract
Promoter DNA hypermethylation is generally associated with transcriptional gene silencing. Studies show that silencing a critical tumor suppressor genes may contribute to tumorigenesis. While the guidelines that govern methylation patterns at promoter CpG islands during the pathogenesis of individual cancers are still unclear, it is widely known that certain genes carry a higher frequency of DNA methylation in select tumors whereas other genes are methylated across most types of tumors.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
DNA hypermethylation: its role in colorectal tumorigenesis and potential clinical applications
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Integrative genomic and epigenomic analysis of human cancer
PDF
DNA methylation inhibitors and epigenetic regulation of microRNA expression
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Identification of DNA methylation markers in diffuse large B-cell lymphoma
PDF
CpG poor promoter SULT1C2 regulated by DNA methylation and is induced by cigarette smoke condensate in lung cell lines
PDF
Identification of novel epigenetic biomarkers and microRNAs for cancer therapeutics
PDF
Cancer epigenetics: linking basic mechanisms to therapy
PDF
Role of DNA methyltransferases 3A and 3B in inheritance of DNA methylation patterns
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
An analysis of conservation of methylation
PDF
DNA methylation changes in the development of lung adenocarcinoma
PDF
The relationship between DNA methylation and transcription factor binding in colon cancer cells
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Targeting glioma cancer stem cells for the treatment of glioblastoma multiforme
PDF
Natural variation of Arabidopsis thaliana methylome and its impact on genome evolution
PDF
Comparative analysis of DNA methylation in mammals
Asset Metadata
Creator
Noushmehr, Houtan
(author)
Core Title
Integrated genomic & epigenomic analyses of glioblastoma multiforme: Methods development and application
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Genetic, Molecular and Cellular Biology
Publication Date
04/25/2011
Defense Date
03/30/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bioinformatics,CIMP,DNA methylation,glioma,methods,OAI-PMH Harvest,pipeline
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Laird, Peter W. (
committee chair
), Buckley, Jonathan (
committee member
), Coetzee, Gerhard A. (
committee member
), Triche, Timothy J. (
committee member
)
Creator Email
hnoushme@usc.edu,houtana@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3768
Unique identifier
UC1423835
Identifier
etd-Noushmehr-3840 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-449458 (legacy record id),usctheses-m3768 (legacy record id)
Legacy Identifier
etd-Noushmehr-3840.pdf
Dmrecord
449458
Document Type
Dissertation
Rights
Noushmehr, Houtan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
bioinformatics
CIMP
DNA methylation
glioma
methods
pipeline