Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sparse representation models and applications to bioinformatics
(USC Thesis Other)
Sparse representation models and applications to bioinformatics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SPARSE REPRESENTATION MODELS AND APPLICATIONS TO BIOINFORMATICS by Roger Pique-Regi A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2009 Copyright 2009 Roger Pique-Regi Dedication To my parents Merce and Joan Ramon, and my sister Mariona. ii Acknowledgements I would like to thank my advisor Prof. Antonio Ortega and Dr. Shahab Asgharzadeh for their guidance and support on my research. They have always been available and enthusiastic to discuss my research, open and patient to listen my ideas, and providing suggestions and criticism which greatly improved the quality of my work. My regular meetings with them have been very enjoyable and helped me to improve my communi- cation and research skills. I owe a lot of my knowledge to their experience. Dr. Shahab Asgharzadehhadthepatiencetoteachmethefundamentalsofbiologyandoncology, and Prof. A. Ortega transmitted me his knowledge in signal processing. They have given me the tools necessary to propose the models I used in my research. IwouldalsoliketothankProf. BartKoskoforhisinterestinmyresearchandforming part of my dissertation committee, and Prof. Paul Marjoram and Prof. Keith Jenkins for their participation in my qualifying examination. It is indeed a great privilege to have their valuable comments and feedback on my work. I also want to acknowledge my collaborators in many of my research projects: Prof. D. C. Thomas, Prof. Paul Marjoram, Dr. Corina Sthir, Prof. Kim Siegmund, Lingyan Shen, Jordi Monso-Varona, Dr. Richard Sposto, Dr. Diana Abdueva and J. R. Gonzalez. I also want to thank all my professors that have given me the tools which are the foundation of my research. iii Thanks also to all my friends for the happy moments we shared and for cheering me up when I was having a hard time: Jose Ramon, Mahesh, Mona, Yvonne, Jonathan, Julia, Ivona, Vicky, Stavros, Victor, Andreu, Jordi, Selina, Quimi, Anna and Yesim. I am especially grateful to my house mate and friend Cintia for her encouragement and support. I am also indebted to all the people in the signal and image processing group and especially Ivy, Zihong, Polin, InSuk, Sphinx, Carlos, JaeHoon, Hye-Yeon, Ngai Man, Hiusheng, Talya and Gloria. I also feel sorry for not having been able to be closer to my old friends Carlos, Joan, Juan, Jordi and Miguel in Barcelona. Finally, I would like to express my deepest gratitude to my parents Joan Ramon and Merce and my sister Mariona for their unconditional love. You have always supported me to pursue my dreams and you encouraged me during my hardest times. I also feel indebted to the rest of my family and very especially to my grandparents Dolores, Senen and Rosita. It is always a regret not have been able to spend much time with you during these years. In the course of completing my studies I met Nadia who filled my heart with happiness. I cannot imagine how lonely I would feel without you. I am starting a new journey with you and I am very excited to pursue new dreams together. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures x Abstract xiii Chapter 1: Introduction 1 1.1 Significance and scope of the research 1 1.2 The Human Genome 4 1.3 Microarray technology 7 1.3.1 Issues on analyzing microarray data 8 1.4 Research contributions 12 Chapter 2: Sparse Linear Discriminant Analysis of gene expression microarrays 15 2.1 Introduction 15 2.2 Background 18 2.2.1 Linear Discriminant Analysis 18 2.2.2 Feature subset selection (FSS) approaches 20 2.2.3 Diagonal Linear Discriminant Analysis (DLDA) 21 2.2.4 Model selection and evaluation with cross-validation 24 2.3 SeqDLDA – Sequential Diagonal Linear Discriminant Analysis 26 2.4 BDLDA – Block Diagonal Linear Discriminant Analysis 29 2.4.1 Block Diagonal LDA – BDLDA 33 2.4.2 Greedy Feature and Model Selection for Block Diagonal LDA 34 2.4.2.1 Feature addition scoring procedure 35 2.4.2.2 Model selection with cross-validation 35 2.4.3 Relationship with other LDA methods and applications 36 2.4.4 Repeated feature subset selection BDLDA (Rep-BDLDA) 37 2.5 Results 38 2.5.1 Application of DLDA to neuroblastoma 38 v 2.5.2 Evaluation of SeqDLDA in four microarray datasets 41 2.5.3 Simulation results with SeqBDLDA 42 2.5.4 Results with Rep-BDLDA on microarray data 47 2.6 Conclusions 48 Chapter 3: Sparse representation and Bayesian detection of genome copy number alterations from microarray data 50 3.1 Introduction 51 3.2 PWC vector representation of Genomic Data 56 3.2.1 Properties of the PWC representation 57 3.3 Formulation of the breakpoint detection problem 61 3.4 Sparse Bayesian Learning (SBL) 64 3.4.1 Implementation of SBL to find copy number alterations 67 3.5 Breakpoint ranking by Backward Elimination 71 3.5.1 The role of the T parameter in BE ranking 73 3.6 Segment Alteration Detection 74 3.7 GADA approach to CNA detection 76 3.8 Simulation Results 77 3.8.1 Simulated CGH Data and evaluation metrics 77 3.8.2 GADA approach compared to greedy search methods 79 3.8.3 GADA approach compared to other CNA detection methods 80 3.9 Evaluation with microarray data 84 3.9.1 Neuroblastoma Genomic Data from Array Platforms 84 3.9.2 Computational speed in commercial microarray platforms 85 3.9.3 Comparison of neuroblastoma CNA detection using different array platforms 86 3.10 Conclusions 87 Chapter 4: Bayesian detection of recurrent copy number alterations across multiple array samples 96 4.1 Introduction 97 4.2 N-GADA for multiple samples with shared breakpoints 99 4.2.1 Sparse Bayesian Learning for multiple samples with shared breakpoints 99 4.2.2 BackwardEliminationformultiplesampleswithsharedbreak- points 101 4.3 Results 102 4.4 Conclusions 105 Chapter 5: Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA 108 5.1 Introduction 109 5.2 GADA model with separate median normalization (GADA-SMN) 113 5.3 GADA model with joint reference normalization (GADA-JRN) 114 5.3.1 Fitting the model with the EM algorithm 115 5.3.2 GADA-JRN model with a scale parameter for the bias 117 vi 5.4 Backward Elimination 118 5.5 Performance metrics and evaluation methods 119 5.6 Results 121 5.6.1 Simulation datasets 122 5.6.2 Affymetrix SNP 6.0 data set description and normalization 123 5.6.3 Results with simulated data 124 5.6.4 Results with Affymetrix microarray data 128 5.6.5 Simulation results with a scale effect 134 5.6.6 Scale effect on the Affymetrix data 137 5.6.7 Impact of the batch effects on the Affymetrix dataset 138 5.7 Discussion 142 5.8 Conclusions 143 Chapter 6: Bayesian hierarchical modeling of means and covariances of gene expression data within families 145 6.1 Introduction 146 6.2 Methods 148 6.2.1 Statistical model 148 6.2.2 Subjects, genotypes, and phenotypes 151 6.3 Results 152 6.4 Discussion 154 Chapter 7: Conclusions 158 Appendix A The role of the parameter a in SBL 163 Appendix B Backward Elimination algorithm properties 166 Appendix C Adjustment of the SBL and BE parameters in GADA 170 C.1 Experiments adjusting a and T in GADA 172 C.2 Strategy to adjust a and T in GADA 173 C.3 Sensitivity to the adjustments of a and T 174 Bibliography 178 vii List Of Tables 1.1 Different types of microarrays 8 2.1 Sequential generation of candidate covariance matrix models for LDA. 32 2.2 AverageCross-validationerror,numberofselectedgenesandstandard deviation (SD) 41 2.3 Average Error Rate (Standard Deviation) for microarray data 48 3.1 Relationship between signal processing methods for overcomplete ex- pansions and methods in statistics for variable selection in multiple regression 62 3.2 Simulateddatasetscategorizedonthenumberofbreakpointsandseg- ment lengths 78 3.3 Possible outcomes for each candidate breakpoint position 79 3.4 Sensitivity and FDR dependence on datasets of different complexity 84 3.5 Averageanalysistime(seconds)forAffymetrixandIlluminamicroarrays 86 3.6 Significant copy number alterations found in four neuroblastoma cell lines 91 3.7 Copy number breakpoints found on all platforms (SK-N-BE2 and SMS-KAN) 92 3.8 Copy number breakpoints found on all platforms (LAN-6, CHLA-20) 93 3.9 Additional copy number breakpoints found by at least two platforms 94 3.10 Differences in copy number breakpoint placing between chips 95 viii 5.1 Consistency on HapMap trios 121 5.2 Comparison on HapMap trio consistency FTCR 130 6.1 Top ranking associations 153 6.2 Linkage of residual gene expression variation after association 153 ix List Of Figures 1.1 Gene expression in a cell 5 1.2 SNP allele transmission and recombination 6 1.3 Affymetrix microarray design 7 1.4 Affymetrix gene expression microarray essay 9 2.1 Graphical interpretation of the DLDA model for two features 22 2.2 Hard and Soft thresholding functions 23 2.3 Two possible scenarios for selecting correlated features in a DLDA model 27 2.4 Linear discriminant scenario with two correlated features 31 2.5 Classification error plot showing the percentage of neuroblastoma pa- tients misclassified using DLDA models of different size 39 2.6 SeqBDLDA Classification performance, Toeplitz covariance matrix 45 2.7 SeqBDLDA Classification performance, Block covariance matrix 46 3.1 Copy number graphical observation model 53 3.2 Step vector in the PWC basis representation 56 3.3 SBL and BP sparseness metrics compared to the desired l 0 norm 67 3.4 PROC operational curves for GADA vs. Greedy search methods 80 3.5 Median sensitivity and FDR for detecting known copy number changes 81 x 3.6 PROC operational curves for sensitivity vs. false discovery rate 82 3.7 Inferred copy numbers from neuroblastoma cell-lines 90 4.1 PROC operational curves for 1-GADA vs. M-GADA 104 4.2 Visual representation of the detected CNA using different algorithms and settings 107 5.1 Copy number detection block diagrams 112 5.2 Illustration of the observation model 125 5.3 Simulation model with measurement bias of only one type 127 5.4 Variabilityonthecopynumberestimatesifthesetofreferencesamples changes. 129 5.5 Consistency of the copy number estimates on HapMap Trios if the set of reference samples changes. 129 5.6 Consistency within HapMap trios using a different sparseness setting 131 5.7 Section of the chromosome 17 that contains an already known CNV 132 5.8 Example of a complex copy number section of Chr. 17 within a HapMap trio 133 5.9 ComputationaltimerequiredtofitthemodelsGADA-JRNandGADA- SMN 134 5.10 Simulation model with measurement bias with different amplitudes 136 5.11 Consistency within HapMap trios using a different sparseness setting 137 5.12 ConsistencywithinHapMaptrioswhentwoplates(CEUandYRI)are analyzed separately or together using the GADA-SMN and GADA- JRN algorithms 139 5.13 Pairwise Spearman’s Correlations between different signals 139 5.14 Simulation model with a batch effect 141 6.1 Directed acyclic graph for the analysis model 150 xi 6.2 Gene expression x Genotype associations and residual linkage summary 156 6.3 Potential Master Regulatory region around rs916482 SNP 157 6.4 Gene Ontology on potential Master Regulatory region 157 A.1 Plot of the SBL marginal prior distribution on a single weight 164 C.1 Breakpoint set concordances and sparseness in different situations 176 C.2 Sensitivity to different choices of a on the PR operational curves 177 xii Abstract Microarraysandnewsequencingtechniquesofferahighthroughputplatformtostudythe wholegenomewiththeunprecedentedcapabilityofmeasuringmillionsofgenomicfeatures onasingleessay. Thismassiveparallelmeasurementpowerhasanenormouspotentialfor research in Biology and Medicine with the ultimate objective of identifying and learning the biological processes occurring in different organisms and diseases. Existing model learning methods are severely limited by the reduced number of samples that are usually available compared to the measurements. We propose that sparse signal representations can model these biological signals and wedeveloptheanalyticaltoolstoaccuratelyextracttherelevantinformation. Weexploit the underlying sparseness of the biological model to overcome some of the problems associated with analyzing these massive datasets. This work demonstrates the potential benefits of this approach by studying three different problems involving microarray data. The first problem concerns the supervised design with a limited amount of training samples of a classification procedure to predict tumor progression. We propose a greedy search strategy to select a sparse feature subset with a block diagonal covariance matrix structure to build a linear discriminant analysis model for tumor prognosis. The second problem deals with the detection of copy number alterations. We develop a maximally xiii sparse representation for these copy number alterations, and a sparse Bayesian learning approach is optimized to detect these alterations from noisy microarray observations. The third problem involves finding genetic determinants of gene expression. In this case, we propose a linear regression model with a sparse Bayesian prior on the large matrix of the regression coefficients relating genome alterations to transcription levels. xiv Chapter 1 Introduction 1.1 Significance and scope of the research In recent years we have witnessed a tremendous advance in technologies to extract bio- logical data from living organisms. Automated DNA sequencing has made it possible to obtain a reference sequence for the human genome (Human Genome Project - HGP [16]) and many other organisms. From these DNA sequences 3 billions of base pairs (the AGCT genetic code) one can identify which portions are genes that are transcribed to RNA to produce a protein. It is estimated that there exist about 25.000 human genes, and these are used as a blueprint to build about 100.000 different proteins which are responsible for running the biological functions on human cells. Although 99.9% of the genome is identical among all humans, small differences in the form of Single Nucleotide Polymorphisms (SNP), inversions, and copy number alterations (CNA, e.g., deletions and duplications) give rise to the rich variability between individuals. New technolo- gies such as microarrays provide the means to measure gene expression activity (RNA arrays) and genomic alterations (SNP and aCGH arrays) with millions of probes along 1 the genome [81]. Other techniques have been developed to study protein levels, protein structure, DNA methylation and many other biological processes occurring in the cell and living organisms. These advances have contributed to the discovery of key new findings in Biology; new genes, new functionalities, alternative gene splicing, gene silencing, copy number polymorphisms and many other. These discoveries have also tremendously affected other relatedfields. Inmedicine[15,35],theyhaveleadtonewmoleculardiagnosticprocedures, unveiledtheunderlyingbiologicalprocessesofsomediseases,andguidedthedevelopment of new drugs. Despite technical advances, severe noise degradation of the measurements due to cross-hybridization and other biological effects poses a challenge when trying to extract reliable information from these large sets of data. Accurate and computationally efficient methods are essential for detecting genetic differences, genomic alterations, and identifying which genes are active or corregulated. The interdisciplinary field of bioinformatics has been growing quickly and has led to the development of increasingly efficient and reliable tools for the analysis of very large biological datasets. The key features of these datasets are i) small n number of samples (n∼ hundreds), ii) large p number of measurements (p∼ millions), and iii) sparseness of the underlying biological models. For example, a microarray can interrogate millions of positions along the genome with dozens of probes for all the known 25.000 genes of the human genome but only a very reduced set (about 10 to 100) will be active at a giventimeonsometissue. Geneactivityisalsoregulatedthroughnetworkswithasparse number of interactions. The alterations along the genome are also a rare event and only a very small subset have a role in in explaining differences in gene activity. In conclusion, 2 though the number of measurements is very large, the number of underlying variables that are involved in regulating the biological process is however quite small. This research is one of the first to apply recent advances in sparse signal representa- tionsandmachinelearningtodevelopnewbioinformaticstools. Sparsesignalrepresenta- tions seek to minimize the number of elements chosen form a dictionary of signals neces- sary to approximate (e.g., as a linear combination) special cases of signals (e.g., smooth, piecewise constant, splines, etc.). Large efforts have been devoted to designing the most adequate bases (e.g., wavelets) or overcomplete collections of bases to build sparse rep- resentations for different classes of signals (e.g., images, speech and audio, econometric, biomedical) [19,50,62,102]. Fast and accurate optimization algorithms for finding the optimal sparse representations have been proposed (e.g., matching pursuit [61,70], ba- sis pursuit [13] and sparse Bayesian learning [97,104]). The theoretical properties of these methods are still under study but important results have been obtained [20,98] for applications such as signal denoising [50] and compressive sensing [12,48,99]. New classifying techniques have been developed such as support vector machines and regularized linear discriminants with similar optimization algorithms to search the spars- est models (least number of features) with highest prediction accuracy. Examples of my research include developing sparse representations for detecting genome alterations and models for classifying cancer tumors. These methods exploit the sparseness of the un- derlying biological models to overcome the problem of extracting meaningful and reliable predictions from a dataset with a very large number of measurements compared to the number of observed samples, i.e., p>>n. 3 1.2 The Human Genome Thissectionprovidesaveryshortintroductiontothehumangenome,andisnotintended to provide a detailed description. The objective of the next two sections is to provide a verybasicunderstandingofthebiologicalprocessesandexperimentaltechniquesthatare covered in this dissertation. A more detailed and precise description of the underlying biology can be found in molecular biology textbooks [2]. All the (somatic) cells of the body contain a full set of chromosomes, with identical genes, i.e., exactly the same DNA sequence that is known as genome. On 2003, the Human Genome Project completed the reference DNA sequence for the human genome, whichcanbebrowsedalongwithannotationinformationfrommanydifferentsourceslike Ensembl [45] 1 , the University of California Santa Cruz (UCSC) browser [52] 2 , and the National Center for Biotechnology Information (NCBI) 3 . With the completed sequence, now the efforts are centered in finding and analyzing polymorphisms, i.e. common alter- ations of the reference sequence; and finding and studying gene expression, the pieces of the genome that carry out basic functions in the organism. A gene is a section of the genome that encodes the information necessary to produce functional products, proteins, to accomplish some function in the cell or the organism. A gene is said to be activated if it is being read and generates the protein that it encodes in a process illustrated in Figure 1.1. On any given cell and any given time, only small subset of all the∼ 25,000 genes that compose the genome are active, i.e., they are being “expressed”. The expression levels of the gene also change dynamically and are 1 http://www.ensembl.org 2 http://genome.ucsc.edu/ 3 http://www.ncbi.nlm.nih.gov/ 4 regulated by a complex network of pathways in which other genes produce the RNA and proteins(transcriptionfactors)thatregulateothergenes. TheproteinsandRNAcanalso interact with each other assembling into proteins complexes, reshaping their structure, and silencing the mRNA (miRNA and siRNA). Gene expression microarrays provide the technological means to measure transcription levels of hundreds of thousands of DNA fragments that are expressed. 1. Gene (DNA sequence) Transcription 2. mRNA Translation 3. Protein Figure 1.1: Gene expression in a cell (Permission for use: (c) Transgene S.A.). The DNA sequence of the gene (1) is copied into RNA sequences in process called transcription. Then this sequences are spliced and assembled into mRNA (2). This mRNA serves as a blueprint to build a protein (3) in a process called translation The DNA sequence of the genome of any two given individuals is 99.9% identical. Differences in the sequences, are called mutations, and may or may not affect some of the many observable different traits that distinguish two persons. The most frequent of these sequence alterations are called polymorphisms; i.e., common variations of the reference sequence that are also found in a large amount of other individuals (see Figure 1.2 for an example). Each polymorphism is located in a particular position of the genome, loci, and the possible sequence variants are called alleles. Since for most of our genome we have 5 two copies of the DNA, the combination of the alleles of each copy define the genotype. The biological process of meiosis directs how the DNA material is replicated and the alleles transmitted to the offspring. Alleles that are close together in a chromosome are transmitted as a block, haplotype, and those that are in different chromosomes (or far away in the same chromosome) are independently transmitted. This haplotype structure is a consequence of recombination events during the meiosis in which the parental DNA copies are crossed over. Since genotypes of proximal loci are linked together, we can locate the position for rare events (e.g., disease traits) by association to genotypes of markers for which the positions are known. Genotyping microarrays are able to genotype thousands (now millions) of Single Nucleotide Polymorphisms (SNPs) markers scattered along the genome. These experiments can be used to locate and characterize this human genetic variation. Human genome: … AGCAAA(T/A)GC…CAG(G/C)TAGCT … Dad’s genotype: … AGCAAA A GC…CAG G TAGCT … … AGCAAA T GC…CAG C TAGCT … Mom’s genotype: … AGCAAA A GC…CAG C TAGCT … … AGCAAA T GC…CAG G TAGCT … Son2 Non-Recomb: … AGCAAA A GC…CAG G TAGCT …(from dad) … AGCAAA A GC…CAG C TAGCT …(from mom) Son1 Recombinant:… AGCAAA T GC…CAG G TAGCT …(from dad) … AGCAAA T GC…CAG G TAGCT …(from mom) SNP SNP Figure 1.2: SNP allele transmission and recombination. A Single Nucleotide Polymor- phism (SNP) is a type of genomic alteration that only affects one nucleotide of the DNA sequence and has only two possible states, i.e. alleles. Each descendant gets one copy of the autosomal genome of their parents and the combined alleles defines his genotype. Alleles of proximal SNP are linked together unless there is a recombination event. 6 1.3 Microarray technology Microarrays are a new type of biological essays with very high throughput capabilities, i.e., theabilitytoperformaverylargenumberofmeasurementsineachexperiment. This technology exploits the ability of RNA and DNA to bind specifically to, or hybridize to, the complementary DNA template from which it originated. The basic design consist of a solid support onto which relatively short DNA sequences (probes) from thousands of different sections of the genome are immobilized at fixed locations, see Figure 1.3. The DNA or RNA is extracted from the sample to study, fragmented, and fluorescently tagged; then, these tagged fragments preferably bind to target probes with the exact complementary DNA; and finally, the array is washed of unattached fragments and the fluorescent intensity of each probe is measured. Figure 1.3: Affymetrix microarray design. (Image courtesy of Affymetrix. Permission for use: (c) Affymetrix 2007) Different types of microarrays have been developed to measure genome features such as gene expression, genotypes, or copy number (see Table 1.1. The basic approach is the 7 same for all of them; the only change comes from the target DNA sequences that are on the chip and the type of genetic material that is extracted. A gene expression array essay is depicted in Figure 1.4. In this case, the material extracted from the cell to study is RNA, and the microarray contains probes targeting RNA transcripts. The hybridization intensity of each probe gives a measure of the gene transcription level. In SNP genotyping arrays, the DNA is the material extracted form the cells, then cut in known places by restriction enzymes, and fragmented into small pieces that contain only one SNP. The array now contains probes with the complementary DNA sequences targeting the two possible SNP alleles (see SNP in Figure 1.2). The hybridization levels associated with each allele can be used to infer the genotype or the copy number. Table 1.1: Different types of microarrays Type Measurement Application SNP Hybridization intensities of fragmented DNA to the two different SNP allele variants Genotyping. Association studies. Drug response. Genome variation. Copy number alterations aCGH Hybridization intensities to large sections of DNA Copy number alterations Expression (Exon arrays) Hybridization of fluorescently tagged RNA transcripts to its complementary coding DNA fragments Geneexpression. Generegula- tion and function. Molecular profiling. (Alternative Gene Splicing) 1.3.1 Issues on analyzing microarray data The high throughput capabilities of these microarrays have an enormous potential for research in Biology and Medicine. For example, microarray gene expression studies have been performed to find which genes are active (being transcribed) on different types of 8 Figure 1.4: Affymetrix gene expression microarray essay. The microarray essay can be divided in three steps: a) sample preparation, b) hybridization, c) washing and scanning. In the first step a): the mRNA is extracted from the cells to study, converted to cDNA (reverse transcription), the cDNA is cut into small fragments of∼ 26 bases length, and this fragments are labeled with fluorescent tags. Then, in the hybridization step b), the fragments in the previous preparation specifically bind to the complementary short segments of DNA that are specially selected and immobilized on specific locations in the array. Finally in c), the array is washed and scanned giving a fluorescent intensity reading. Locations corresponding to expressed genes should have higher intensity that those corresponding to non-expressed genes. (Image courtesy of Affymetrix. Permission for use: (c) Affymetrix 2007) 9 tissues, different cell conditions, or in response to an administered drug or treatment. Other applications in medicine include development of new diagnosis and risk assessment procedures. Similarly, genotyping arrays have been used to define groups of high risk, to find a causal genetic locus for a disease, and to study differences in treatment response. The large volume of the data generated by microarrays pose new issues that require the development of new data analysis methods that are computationally efficient and statistically reliable. These issues are essentially three: i) normalization, ii) large number of variables, iii) small number of samples. Normalization. The measurements obtained from microarrays are not perfect obser- vations. Sample extraction and hybridization processes are affected by a large number of biological factors. Some of these factors can be modeled and corrected by appropriate normalization procedures. However, there may always be unknown effects like cross- hybridization, RNA degradation, or other sources of experimental error that are out of our control. In any case, it is important to define normalization procedures, that make the microarray measures comparable across different samples, and transform the data such that we can assume an appropriate distribution safely (e.g. Gaussian). Large number of variables. Continuous improvements in array technology and se- quencing are increasing the number of measurements, which are currently in the order of millions. Gene expression arrays and new exon arrays contain hundreds of thousands of probes targeting all the genes and their exon transcripts. The newest genotyping arrays cover nearly a million SNPs along the genome. However, it is expected that only a very small (sparse) subset of these variables will typically be related to the research questions 10 that we are trying to answer with the microarray experiment (e.g., those given as exam- ple in the beginning of the section). Thus, efficient as well as reliable statistical methods are required for screening these large sets variables to build models that are likely to be biologically meaningful. Small number of samples. In contrast to the number of variables, the number of sam- ples is typically very small∼ 100. This requires that the statistical procedures used for classification evaluation and for testing genes for association have to be very powerful, i.e. efficient in the number of samples. In classification, splitting the available samples in independent sets for model fitting, selection, and evaluation is very inefficient; and computationally intensive algorithms like cross-validation and bootstrapping approaches have to be used. Additionally, permutation based tests are used to assess the proportion of the genes that could be falsely deemed significant (i.e., the False Discovery Rate FDR) just by chance due to the large amount of noisy predictive variables that are being eval- uated. In conclusion, the large number of variables together with the reduced amount of samples make the problem of robustly estimating the underlying biological models very challenging. In order to overcome these challenges, prior knowledge about these biological mod- els should be exploited. There exists a large number of databases that gather relevant information for a large number of genes, gene regulation properties, pathways, haplo- type structures, and other potentially useful information that is continuously updated as ongoing research progresses. Currently, methods for combining these different sources of information are very limited, and only very simple models are being used due to the limitedamountofsamplesthatareavailabletofitthemodel. However, intheforeseeable 11 future it is very likely that more complex algorithms will exploit all this prior knowledge that is being gathered. 1.4 Research contributions The research contributions of this work have been grouped into three major parts each one dealing with a different problem related to microarray analysis. The nexus among the three of them is that in all three cases linear models with sparseness constraints are adopted. In other words, the solution to the problem consist of finding a sparse linear combination that depends on only a small subset of all microarray probes. The adoption of sparse linear models is biologically supported by the underlying assumption that any basic cellular process is controlled by a very small portion of the genome. InChapter2westudythedesignandevaluationofmolecularclassifiersthatarebased on linear discriminant analysis (LDA) of gene expression microarrays. Different options to select the genes and to place sparseness constraints on LDA are studied. We start reviewing DLDA, which is a widely used method in which correlations are completely ignored (i.e., assumed to be 0), and we discuss the application of DLDA to analyze gene expression profiles for the prognosis of tumor progression in neuroblastomas [6]. Then, we consider a new gene selection algorithm SeqDLDA [78] that under the DLDA model selects the genes that are better modeled by the non-correlation assumption. Afterwards, SeqBDLDA[77]proposesanembeddedapproachinwhichthegenesandablockdiagonal covariance structure are jointly selected to fit a linear discriminant model. LDA and the developed feature selection procedures provide a flexible framework for microarray gene 12 expression analysis, in which models with different degrees of complexity can be adopted depending on the amount of available training samples or prior knowledge. In Chapters 3, 4 and 5 we tackle the problem of detecting genome copy number alterations (CNAs) using microarray data [75]. Studying copy number alterations is important to understand the role of natural copy number variations in human genomes. It is also essential in understanding cancer cells, where genomic instability leads to large abnormalities in genome copy numbers. The hybridization intensities from SNP probes in genotyping arrays, or from specially chosen probes in aCGH, are correlated with the underlying number of DNA copies of their corresponding genome regions but are severely degraded by noise. In our approach, we model the genome copy number as a piece-wise constant (PWC) vector for which a sparse representation is formulated. Then, sparse Bayesian learning (SBL) is optimized for the proposed PWC representation and used to detect CNA breakpoints. Moreover, a backward elimination (BE) procedure is used to rank the inferred breakpoints; where a cut-off point can be adjusted to control the false discovery rate (FDR). The performance of our algorithm is evaluated using both simulated and real genome datasets and compared to other existing techniques. Our approach achieves the highest accuracy and lowest FDR, as compared to the state of the art, while improving computational speed by several orders of magnitude. In Chapter 6, we describe a novel hierarchical Bayesian model for the influence of constitutional genotypes from a linkage scan on the expression of a large number of genes. Thisworkcanbeconsideredoneoftheinitialstepstofindgeneticdeterminantsof gene expression at a genome-wide scale. The proposed model comprises linear regression models for the means in relation to genotypes and for the covariances between pairs 13 of related individuals in relation to their identity by descent estimates. The matrices of regression coefficients for all possible pairs of SNPs by all possible expressed genes are sparse, and modeled as a mixture of null values and a normal distribution of non- null values. The approach appears to be a promising way to address the huge multiple comparisons problem for relating genome-wide genotype-by-expression data. 14 Chapter 2 Sparse Linear Discriminant Analysis of gene expression microarrays 2.1 Introduction Gene expression microarrays can measure the expression values of thousands of genes in the same experiment. A large number of fundamental biological and medical research questionsinvolveidentifyingwhichgenesandmechanismsareresponsibleindetermining, forexample,tumorprogression,bloodpressureordrugresponse. Inasupervisedlearning approach we are given a training group of samples for which the outcome of the variable of interest is known. Thus, these examples, can be used to fit a discriminant function to predict the outcome. In microarray experiments the challenge is that the number of samples(n∼100)isverysmallcomparedtothenumberoffeatures/probes(p>10000). These discriminant methods are required to (i) generalize well, i.e., have a low prediction error for future samples; (ii) be sparse, i.e. be based on a small subset of features; and (iii) have low False Discovery Rate, i.e. have a large proportion of true relevant features that can also be confirmed by other studies. 15 Linear discriminant analysis (LDA) [22,40] is a well known supervised classification technique in which the discriminant functions are linear combinations of the features for whichweobtainagoodseparationbetweenclasses. Ifweconsideraclassificationproblem withtwoclasses,thedegreeofseparationcanbemeasuredbytakingthesquareddistance betweenthecentroidsdividedbythevariance. Inthecontextofmicroarrays,wearefaced with a very large number of features, and very few training samples. Thus, the sample covariance matrix is singular and does not give a reliable estimate of the true covariance matrixtobeusedtofitthelineardiscriminant[105]. Twoofthemostpopulartechniques inmicroarrayclassification,diagonallineardiscriminantanalysis(DLDA)[23]andnearest shrunken centroids (NSC) [96], are based on LDA and solve this problem by imposing a diagonal structure to the covariance matrix and using only a small subset of the topmost discriminative features to build the classifier. InthischapterwereviewDLDAandweapplythemethodtobuildapredictionmodel fortheprognosisofneuroblastomatumorprogression 1 thatincorporatesanovelselection andevaluationmethodforDLDAbasedonnestedcross-validation. Thismethodisuseful for deciding which genes to include in the DLDA model, and for accurately estimating the prediction error for future samples. The proposed model selection and evaluation methodshavebeenpositivelyreceivedandhavebeensuggestedasusefultoolsformedical studies [92]. Afterwards, we introduce novel alternative strategies for constructing the linear dis- criminantfunction. First,SeqDLDA 2 inSection2.3keepstheDLDAmodelbutproposes 1 The application and evaluation of the DLDA model has been published as a part of a medical study for neuroblastoma in [6] 2 The SeqDLDA model was proposed in [78] 16 a wrapper feature subset selection (FSS) approach that does not ignore correlations. In SeqDLDA, one gene is sequentially added and the linear discriminant recomputed using the DLDA model (i.e., a diagonal covariance matrix). Classical DLDA instead adds the gene with highest t-test score without checking the resulting model. In contrast, Se- qDLDA will find the one gene that better improves class separation after recomputing the model. SeqBDLDA 3 in Section 2.4 extends the DLDA model to consider a block diagonally structured covariance matrix. In this case we adopt a novel embedded FSS approach, in which each feature is added sequentially in the model either as an independent block or insideoneofthepreviouslyexistingblocksofthecovariancematrix. Ateachstep,thebest feature and block model are decided by measuring the class separation after computing the resulting linear discriminant of that feature set and its corresponding block diagonal covariance matrix. This is the first time that such a joint design of the model with the feature selection has been proposed in the context of LDA. In order to reduce the complexity of exploring a large number of BDLDA models, an optimized repeated FSS (RFSS) search strategy 4 is also proposed. These new contributions show considerable improvement in prediction accuracy both in simulated and real datasets, especially in the cases where more training samples are available (Section 2.5). Additionally, the more complex block diagonal modeling could become even more promising in the future since it could be used to exploit prior knowl- edge on gene corregulation; i.e., the block structure could be estimated from previous experiments or genome databases. 3 The SeqBDLDA model was proposed in [78] 4 This work was in collaboration with Lingyan Shen [91] 17 2.2 Background 2.2.1 Linear Discriminant Analysis In statistical pattern recognition problems, Bayes decision techniques provide optimal classification performance as long as the distribution of the samples is known [22]. In many practical cases, these distributions are not known and they must be learned from training data. In the context of microarrays simple linear models are preferred because of the limited number of samples available, relative to the number of features. Lineardiscriminantanalysis(LDA)isawidelyusedtechniqueforsampleclassification [22,40]. For two classes, LDA is defined by a linear discriminant function g(x): g(x)=w t x−b >0⇒Class A <0⇒ Class B (2.1) where x is the vector containing the gene expression of the sample to classify. w is a vector of weights orthogonal to the hyperplane that together with the scalar b define the decisionboundaryg(x)=0 that discriminates between the two classesto separate. If the samples are normally distributed with known means and variances: f A (x)∼N (m A ,K), f B (x)∼N (m B ,K); the optimal Bayes maximum a posteriori classifier is given by the LDA discriminant function with: w =K −1 d d=m A −m B (2.2) b=ln π A π B −w t (m A +m B ) 2 (2.3) 18 where π A and π B are the prior probabilities of each class; and d is the vector that links the two class centroids. Inamoregeneralsensethissolutionisalsoreasonable(butnotnecessarilyoptimal)if distributions are symmetrically bell-shaped (like a normal) because it gives the direction in which the variance between/within classes (J K (w)) is maximized; i.e., when J K (w)= (d t w) 2 w t Kw (2.4) w =argmax w J K (w)=K −1 d (2.5) In practical applications, the mean vectors and the covariance matrix are not known and have to be estimated from training data, typically using maximum likelihood (ML) estimators. However, choosing the ML estimators it is only asymptotically optimal [30], i.e. when the number of training samples grows so ˆ w→w. Since in our case the number of features p is larger than n this convergence will never happen. Indeed, in this scenario the regular ML estimates are very misleading, because i) the estimates are unreliable, and ii) the sample covariance matrix is singular. A p×p sample covariance matrix ˆ K has rank at most n−2. Thus, the null space of this matrix has dimension at least p−n+2 giving the false impression that the natural variation is 0 in this subspace. Then, any spurious difference on the estimated class vector means on this subspace will be falsely regarded as very significant. When n and p are comparable, different authors [30,34,39,41] have proposed a reg- ularized solution for the problem by assuming some structure in the covariance matrix (e.g., a diagonally dominant covariance matrix). This has the advantage of reducing 19 the effective number of parameters that need to be estimated and the risk of obtaining a sample covariance matrix that is singular. However, when n is much smaller than p regularization alone is not enough to achieve reliable classification and it is necessary to further simplify the model by discarding features, i.e., by selecting a reduced feature set. Feature selection is in fact almost always needed in the context of microarray genomic classification, where p is in the order of tens of thousands of genes while n corresponds to a few hundred tissue samples. Taking cancer as an example, it is typically expected that only a few genes will be associated with the disease. Thus, feature (i.e., gene) selection serves the dual purpose of i) reducing the effect of a small training set on classification performance, and ii) identifying substantial genes that are more likely to be associated with the disease. 2.2.2 Feature subset selection (FSS) approaches Therearethreemajorapproachestoclassifierdesignandfeaturesubsetselection(FSS)[36]; namely, (i) filter, (ii) wrapper, and (iii) embedded. In filter approaches, features are first ranked using a statistical score, such as a t-test. Then the classifier is built by selecting the highest ranking features. This is the most popular method in microarray classifica- tion problems, due primarily to its simplicity (see Section 2.2.3). Note, however, that it completely ignores interactions among genes. In wrapper approaches [53] a classifier is constructed with different candidate fea- ture subsets, the performance is measured (using, for example, cross validation), and finally the feature subset that achieves the maximum performance is chosen. This is a combinatorial optimization problem and a full search would be very complex, requiring 20 2 p different evaluations, and prone to overfitting. For this reason, only greedy search strategies using different heuristics are feasible. In the context of microarrays and LDA, wrapper approaches have been proposed using full [105] or diagonal [10,78] covariance matrices and different search strategies (see Section 2.3). Finally, embedded approaches [58] consider jointly the classifier design and the FSS. This is in contrast to the wrapper approaches that consider the classifier as a black box thatinducesapredictionruleoncethefeaturesubsetischosen. Guyonetal.proposedan embedded approach [37] for Support Vector Machines. To the best of our knowledge, in the context of LDA we have been the first to propose an embedded design approach [77], see Section 2.4. 2.2.3 Diagonal Linear Discriminant Analysis (DLDA) The diagonal linear discriminant analysis (DLDA) model is simply the LDA model with the covariance matrix constrained to be diagonal and a filter FSS approach. This ap- proach, formally introduced in [23], is only optimal when the features are uncorrelated multivariate normal. However, with limited training data the DLDA models are more reliably trained and may achieve a better prediction accuracy than using the LDA with an unreliable full covariance matrix. 21 Under the diagonal assumption each variable in the DLDA model can be treated independently since in (2.2) we have that w i =d i /(σ 2 i ) where σ 2 i =K i,i . Thus, the linear discriminant (2.1) can be rewritten as linear combination of univariate classifiers: g(x ∗ )=log π A π B + p X i=0 ˆ m A i − ˆ m B i ˆ σ i | {z } ˆ h i weight x ∗ i − ˆ m i ˆ σ i | {z } ˆ v i vote ≥0⇒x ∗ ∈ClassA <0⇒x ∗ ∈ClassB (2.6) where the right term represents a vote,ˆ v i , a single feature discriminant that scores how likely it is that a new sample belongs to class A rather than class B; and, the left term is a weight ˆ h i that scores how good a feature i is in discriminating two classes. It is easy to see that ˆ h i is proportional to the t-statistic for the difference of two population means. This is illustrated by Figure 2.1 for a two dimensional example. Figure 2.1: Graphical interpretation of the DLDA model (2.6) for two features. Each axis represents a feature with the corresponding univariate distributions. Feature x 1 has alargerseparationbetweenclasscentroidscomparedtotheunderlyingnoisethanfeature x 2 . In DLDA we can classify a new sample x ∗ by a simple linear combination of single feature discriminants (x∗ 1 ,x∗ 2 ,...) weighed by the appropriate weights (2.6). 22 In this diagonal model, most of the genes will probably be irrelevant for the classi- fication. Low scoring genes, can be seen as adding noise to (2.6) so it makes sense to remove the terms with lowest ˆ h i . This can be seen as a filter approach for feature se- lection (Section 2.2.2), since genes are first ranked using the statistical score, and then the discriminant function is built by selecting the highest ranking genes. The size of the model, i.e. the number of genes, is usually determined by cross-validation, see Section 2.2.4. Thenearestshrunkencentroid(NSC)[96]andtheweighedcovariate(WC)approaches in [33] are tightly related to this DLDA procedure [23]. DLDA [23] and WC [33] can be seen to remove features by hard-thresholding ˆ h i (Figure 2.2 a). In NSC instead, the ˆ h i scores are continuously shrunk to 0 before they are completely removed, i.e., soft- thresholding (Figure 2.2 b). In NSC [96] the scores ˆ h i are also made more robust; a dampening term is added to the standard deviation, ˆ σ ′ i = ˆ σ i +median(ˆ σ i ), to protect against very small ˆ σ i occurring by chance. In our experience, hard-thresholding usually gives a better prediction with smaller feature sets. −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 (a) (b) Figure 2.2: Hard (a) and Soft (b) thresholding functions. Both methods set small values (< τ = 1) to zero, but (b) shrinks all the values by τ while (a) leaves the values above the threshold (>τ) untouched. 23 Finally, the size of the model (i.e. the number of top scoring genes to add in the DLDA or NSC model) is specified as the one that maximizes the classifier prediction accuracy, which can be estimated by cross-validation (Section 2.2.4). 2.2.4 Model selection and evaluation with cross-validation Before introducing the new extensions to LDA in the following sections, this background sectionwillconcludebyreviewingthecross-validationtechniquesthatareusedtoevaluate theclassificationmodelsandtoguidetheirconstruction. Theobjectiveoftheprediction accuracy evaluation is to assess the performance for future new samples not included in the training set. The prediction error computed on the training set, the training error, is a rather optimistic estimate of the generalization error [40]. For this reason the classifier has to be evaluated using an independent set of samples known as the test set. In microarray classification problems the data available is very limited and we cannot afford to reserve a large number of samples solely for testing. In these situations, cross- validation (CV) provides an efficient method of iteratively splitting the available samples between a training set and an independent testing set. For example, in 10-fold CV, the training samples are randomly partitioned in 10 segments (balanced to preserve the class proportions), thenforeachofthe10segmentswetrainthemodelwiththeothernineand use the one reserved for testing for evaluating the performance. The advantage of using cross-validation over reserving a large fraction of samples for an independent testing set is that larger training sets can generate more reliable and more accurate models. In this chapter we have adopted a 100X10-fold CV (100 repetitions of 10-fold CV) to selectthesizeoftheLDAmodel;i.e. thenumberofgenesP inDLDAandSeqDLDA,and 24 the covariance structure in SeqBDLDA. The model is chosen as the one that minimizes the average cross-validation error. Some authors point out that to report the minimizing cross-validationerrorasthefutureerrorratefornewsamplescouldintroduceabiasagain, the selection bias, and that a second external round of CV is necessary [5]. To further ensure that no selection bias is introduced, we use a nested cross-validation procedure as suggested in [5]. First, an internal 100X10-fold CV strategy is used to select the LDA model. Then, an external leave-one-out CV (LOOCV) is used to give an unbiased estimate of the model performance. The evaluation, model/gene selection, and training are performed as described in Algorithm 1 Algorithm 1 Nested Cross-validation 1: (Leave-one-out external cross-validation loop) 2: for sample i=1...N do 3: Leave sample i for external test set 4: Use remaining N−1 samples to form the external training set 5: (Repetitions of internal cross-validation loop) 6: for Repetition r =1...R do 7: External training set is randomly partitioned in N segments 8: (Internal 10-fold cross-validation loop) 9: for Segment j =1...N do 10: Leave segment j out for the internal test set 11: Use remaining N−1 segments for the internal training set 12: for LDA model m=1...M do 13: Find features to include and fit the model using only internal training set 14: Test the model on the internal test set and count the errors 15: end for 16: end for 17: end for 18: Find the model m ∗ that minimizes the average internal cross-validation error. 19: Find the features and fit the model m ∗ using the complete external training set. 20: Test the final external model on the left out sample i. 21: end for 22: Report the external error rate as the expected error rate for future samples. 25 Insummary,thenestedcross-validationapproachprovidesanefficientreuseofsamples to perform three tasks that would otherwise require three independent sets of samples: i) a training set to estimate the models, ii) a validation set to chose the best model, and iii) a testing set to evaluate the final performance of the validated model. If the number of samples is very limited, which is the case of most microarray studies, cross-validation has theadvantagethatmoresamplescanbededicatedtobuildthemodel. Inthenestedcross validation approach, the selection bias is avoided by never using the samples reserved for testing in the inner loops. In the context of microarray studies, we are the first to employ this nested cross-validation strategy together with LDA based models that we discuss in this chapter. In a tumor risk prognosis study [6] using the DLDA model (see results in Section 2.5.1) this evaluation approach has been encouraged and received positive reviews [92]. 2.3 SeqDLDA – Sequential Diagonal Linear Discriminant Analysis Inthebackgroundsection(Section2.2)wementionedthatformicroarrayapplicationswe are usually forced to make assumptions about the covariance matrix structure because there is usually not enough data to accurately estimate all pair-wise correlations. In DLDA model (Section 2.2.3), if gene correlations are not zero (which can happen often), selecting all the top-scoring features ignoring their correlations is not the best strategy. As illustrated intuitively in Figure 2.3 we could select the features whose correlations are 26 better suited for the DLDA model. This is the idea behind sequential DLDA (SeqDLDA) described in this section 5 . (a) (b) Figure 2.3: Two possible scenarios for selecting correlated features in a DLDA model. Feature x 1 is more discriminative than x 2 , i.e., H(x 1 ) > H(x 2 ), and H(x 2 ) > H(x 3 ). The DLDA with a filter FSS approach (a) would choose x 1 and x 2 , while if we check the resulting discriminant H(g(x)), the SeqDLDA approach (DLDA with wrapper FSS) in (b) has a better class separation H(g SeqDLDA (x))>H(g FilterDLDA (x)). The discriminant function in SeqDLDA is the same as in Filter-DLDA [23] (Section 2.2.3) g l (x)=log π A π B + X i∈S l H(x i ) x i −ˆ μ(x i ) ˆ σ(x i )+ˆ σ 0 (2.7) H(x)= ˆ μ A (x)− ˆ μ B (x) ˆ σ(x)+ˆ σ 0 ˆ σ 0 =median i=1..p (ˆ σ(x i )) (2.8) 5 SeqDLDA was initially proposed in [78] 27 where H(x) in (2.8) measures class separation. In relation to (2.6) we have that H(x i )= ˆ h i , ˆ μ A (x i ) = ˆ m A i , ˆ μ B (x i ) = ˆ m B i , and ˆ σ i = ˆ σ(x i )+ ˆ σ 0 . The additional term ˆ σ 0 protects against an unusually low σ produced by chance and makes the score H(x) more robust (this strategy is also used in NSC [96]). The essential difference between SeqDLDA and FilterDLDA is how we choose the set of featuresS. Starting from an empty set of featuresS 0 =∅, at every iteration l, we add the one gene j that most increases H(g l (x)) to the set of selected featuresS l =S l−1 ∪{j}. The SeqDLDA approach can be seen as a wrapper FSS approach [53] (Section 2.2.2). Instead of measuring H(g(x)) for all possible combinations of features, we use a greedy search described in [53] as Forward-Selection/Hill-Climbing. In contrast to SeqDLDA, regular filter DLDA adds the gene with highest score H(x i ) without checking the resulting model. The number of computations is much higher in SeqDLDA because at each iteration we have to evaluate the resulting discriminants of all possible candidates to add into the model. The advantage is that in situations as those depicted in Figure 2.3, SeqDLDA can choose the features whose correlation structure works better for the DLDA model. An approach similar to SeqDLDA was also proposed in [10], but using a regular t-test which reduces the robustness of the model and the exploratory search resulting in a much lower performance. In Section 2.5.2 we can find theresultsthatevaluatetheSeqDLDAcomparedtootherLDAapproachesformicroarray applications. In SeqDLDA, maximizing H(g l (x)) is also equivalent to maximizing J ˆ K (w) in (2.4), where the vector w is calculated using only the diagonal part of K in (2.5); i.e., w i = 28 d i /(σ i + σ 0 ). In the following section we will extend this approach to consider other choices for the covariance matrix using a similar greedy search strategy. 2.4 BDLDA–BlockDiagonalLinearDiscriminantAnalysis The DLDA models covered in previous sections ignore the correlation structure. The dif- ference between filter-DLDA (Section 2.2.3) and SeqDLDA (Section 2.3) is in the feature selection: SeqDLDA checks which selected features give better results with the DLDA model. Model selection and feature selection are usually considered two separate tasks. For example, in a Linear Discriminant Analysis (LDA) setting, a modeling assumption is typically made first (e.g., a full or a diagonal covariance matrix can be chosen) and then with this model the feature subset providing the best prediction performance is selected. Iflimitedtrainingdataisavailable,thenthenumberofparametersofamodelthatcanbe reliably estimated will also be limited. In the context of LDA, model selection basically entails simplifying the covariance matrix by setting to zero some of these components. Thisleadstodifferentblock diagonalmatrixstructures(e.g.,full/diagonal)whichinvolve different sets of features and require different parameters to be estimated. In this section, we argue that LDA feature and parameter selection should be done jointly; and we propose a greedy algorithm SeqBDLDA 6 for joint selection of features and of a block diagonal structure for the covariance matrix. To the best of our knowledge this is the first time such a joint design is proposed in the context of LDA. In more recent work, shrunken centroids regularized discriminant analysis (SCRDA) [34] proposes what can also be considered a joint feature/model selection for LDA. SCRDA, includes two 6 SeqBDLDA was initially presented in [77] 29 shrinkage parameters that are adjusted by cross-validation: the first parameter regular- izes the covariance matrix by shrinking the off-diagonal terms towards zero, the second parameter performs variable selection by shrinking each feature weight of the discrimi- nant (as in NSC, see Section 2.2.3). In contrast to SCRDA, which only offers a trade-off between a full and a diagonal covariance matrix, the BDLDA framework introduced in thissectionconsidersasearchacrossalargecollectionofdifferentblockdiagonaloptions. The choice of a block diagonal structure is motivated by microarray classification problems, where we have a very large amount of features (i.e., genes) that are expected to be corregulated in small blocks (groups of corregulated genes). Figure 2.1 illustrates a scenariowithtwofeatureswhichcouldbemodeledasoneBDLDAblock. Inthecontextof geneexpressionanalysis,thiscanbethecaseoftwocorregulatedgenesx 1 andx 2 ,wherea verysmallchangeinx 2 triggersalargerchangeinx 1 . Themicroarraymeasurementnoise makes it very difficult to detect changes in x 2 , but taking into account the correlation with x 1 results in a more discriminative LDA model. 30 Figure 2.4: Linear discriminant scenario with two correlated features. Feature x 1 is more predictive than feature x 2 since it has better separation between classes. If the correlationbetweenx 2 andx 1 isignored, theresultingDLDAclassifierassignspractically alltheweighttox 1 , sox 2 willberemovedfromthemodelifweuseafilter(Section2.2.3) or wrapper (2.3) approach. If we do not ignore the correlations, the resulting LDA model is more powerful in separating the two classes. 31 Table 2.1: Sequential generation of candidate covariance matrix models for LDA. Number of parameters Number of features Starting with an empty list, we add one feature at a time (namely, the one that maximizes a statistical score) using two possible operations: (i) Block expansion (solid lines), where a new feature is added to an existing block grouping already chosen features in the correlation structure. (ii) Independent feature addition(dashedlines), whereafeatureisaddedignoringcorrelations(i.e., independentofexistingblocksofvariablesinthecorrelationstructure). The best among all these models is selected using cross-validation. 32 2.4.1 Block Diagonal LDA – BDLDA Modelselection,intheLDAcontext,isessentiallyachoiceofastructureforthecovariance matrix. Thus a simple method would be to perform feature selection for both a diagonal andafullcovariancematrixstructureandselectthebestofthetwo. Inadiagonalmodel, the number of parameters to estimate, l is l =p, while in a full matrix l = 1 2 p(p+1). In this section, we propose to further increase the number of available models by including a whole range of block diagonal matrix structures, as shown in Table 2.1. Applying the bias/variance trade-off principle [39] in this setting implies that the more parameters we estimate the less bias we will have, but at the cost of increasing the variance. For this reason, the LDA performance is limited primarily by the number of parameters to estimate (rather than by the number of features). We use this insight to develop novel efficient techniques to embed feature and model selection, which are based on searching for the best feature set and covariance model for a given number of parameters. Thus, for a given number of parameters, more features can be used with a diagonal covariance model than with a full covariance matrix, but correlation among features will be completely ignored. For uncorrelated features this model will perform best, but there mightbecorrelationspresentthatcouldbeexploitedtogetbetterperformancewithfewer features. Exploring all possible feature subsets and possible block diagonal structures is not feasible. Thus, we propose a sequential greedy algorithm, SeqBDLDA, for finding at the same time a feature subset and a block diagonal structure. 33 2.4.2 Greedy Feature and Model Selection for Block Diagonal LDA Our proposed greedy algorithm for feature and model selection (see Algorithm 2), adds features to the model sequentially, one at a time. The process starts by selecting the best feature measured with the J score of (2.4). Then, at each stage, we have two options: i) adding one more feature, independent of all previously selected features, thus leading to a new block in the block-diagonal structure, and ii) growing the current block in the matrix structure by adding one more feature to it. These two options are marked with dashed and solid lines, respectively, in Table 2.1 and can beused alternatively to produce feature subsets with different block diagonal covariance structures. In both operations the current set of features,A, is “inherited” from the parent node. In order to determine which is the best new feature for a given structure we use the scoring procedure discussed in Section 2.4.2.1. After obtaining one feature subset A m for each of the models in Table 2.1, we are interested in finding which is the more reliable model if the number of parameters is limited. To do so we use leave-one-out cross-validation (see Section 2.4.2.2). Algorithm 2 Greedy feature subset and model construction 1: Create first model with best feature: i=argmax j∈S d j σ j 2: for all Model m in Table 2.1 do 3: A← Feature set of the parent node 4: j ∗ ← AddFeature(A,m) ⊲ Find the best feature to add 5: A m ←A∪{j ∗ } 6: ǫ m ← EvaluateModel(A m ,m) ⊲ Using crossvalidation 7: end for 8: l← Number of parameters 9: m ∗ ←arg max m:|m|=l ǫ m ⊲ Find the best model with l parameters 34 2.4.2.1 Feature addition scoring procedure Assume that we have already chosen a subset of featuresA, with sample covariance matrix ˆ K A and difference of sample means ˆ d A . Then, from (2.2) the LDA classifier with a model m is constructed using the following weights: w A = ˆ K −1 A,m ˆ d A , (2.9) where ˆ K A,m is obtained from ˆ K A by zeroing out those terms that are zero in model m (see examples in Table 2.1). Then, using (2.5), the best new feature to add to the model j∈A C (whereA C is the complement ofA in the original feature set) will be: j∗=argmax j∈A C ˆ d t A j w A j 2 w t A j ˆ K A j w A j A j =A∪{j} (2.10) In our greedy procedure, the new feature is always added in the lower right corner of thematrix,eitherasanindependentblock(i.e.,ignoringcorrelations),orbyincreasingthe size of the lower right block by one. In finding the best feature, significant computational savings can be achieved by exploiting the block structure of the matrix in (2.9), and the fact that only certain blocks in vectors and matrices in (2.10) change with j. 2.4.2.2 Model selection with cross-validation Since we used the J score (2.4) to guide the search for the feature subset we cannot use it to decide which model to select. This is because it is a biased estimate of performance of the classifier that can be used to compare alternative models with same number of 35 parameters and features, but does not provide a reliable way to compare models with different structures. We used leave-one-out cross-validation (Section 2.2.4). to estimate the probability of error of a classifier without bias. In leave-one-out cross-validation, one sample is left out and we train with the remaining n−1 samples. Then the sample that has been left out is classified. The entire training procedure is repeated n times for each of the samples and the error rate ǫ m is estimated as the total number of misclassified samples divided by n. In our case, if the number of parameters is limited to l, we will select the model in the column l of Table 2.1 with the lowest cross-validation error. 2.4.3 Relationship with other LDA methods and applications Table 2.1 contains several models that have been proposed in the literature: models “grown” by following only solid lines, correspond to “full matrix” LDA with forward fea- tureselection(SeqLDA,[105]). Alternativelymodelsgrownbyfollowingonlydashedlines correspond to forward selection using the Diagonal LDA (SeqDLDA, Section 2.3) model. Thus both “full matrix” LDA and SeqDLDA are part of the space of solutions being searched. Note also that if some a priori knowledge was available about the structure of the covariance matrix this could be exploited to reduce the complexity of the search by removing some of the paths in Table 2.1 from consideration. For example, if it is believed that features will tend to be correlated in small groups, it is very easy to set limits on the maximum size of the blocks to be explored by our algorithm. 36 2.4.4 Repeated feature subset selection BDLDA (Rep-BDLDA) Using cross validation in the model selection (Section 2.4.2.2) for BDLDA is very time consuming, thus not an appropriate algorithm for gene expression data, which has a large number of features and the number of possible models grows exponentially. As an alternative, we could exploit some underlying knowledge of the data to reduce the BDLDA parameter search space. This is the idea behind the repeated FSS (RFSS) search strategy 7 discussed in this section. RFSSsearchmethodconsistofrepeatingthemodelconstructionandfeatureselection N times. Ateachrepetition,onlyapredefinedmaximumnumberoffeaturesMaxFeature is permitted, with the features selected during previous iterations removed from the set of candidate features. Finally, the N models are combined by vector concatenating N means and block diagonally concatenating N covariance matrices. The feature sets in all N models are different and uncorrelated. The model construction is performed N times or stops when there are not enough candidate features. These heuristic search enables thealgorithmtofindmorediscriminatingfeatureswithoutbeinginfluencedbypreviously selected models. The resulting approach, Rep-BDLDA is useful for gene expression data, becausegenesbelongingtothesamepathwaytendtobehavesparsecorrelations,butalso more than one gene in the same pathway could independently lead to the same outcome we are trying to predict. 7 This work was in collaboration with L. Shen, more details can be found in [91] 37 2.5 Results Severalsimulatedandrealmicroarraydatasetsareusedtoevaluatethelineardiscriminant methods discussed in this chapter: DLDA (Section 2.2.3, SeqDLDA (Section 2.2.3) and SeqBDLDA (Section 2.2.3). The microarray data used in this section correspond to publicly available datasets from medical research in four different cancer classification studies (leukemia [33], colon [4], prostate [93] and neuroblastoma [6]). Cross-validation (Section 2.2.4) is used to properly fit and evaluate the models. In simulated datasets, where the underlying distributions are known, the true classification accuracy can be obtained either analytically or numerically (with Montecarlo Simulation). 2.5.1 Application of DLDA to neuroblastoma The development of the classification techniques proposed in this chapter are part of a medical study for neuroblastoma tumor led by Dr. Asgharzadeh at Childrens Hospital Los Angeles. This section will describe the dataset that was collected, and will show how to apply and evaluate the DLDA model (as was done in [6]). Some of the evaluation techniques used here, will also be used in the following sections to evaluate the other LDA based techniques presented in this chapter. Metastatic neuroblastomas lacking amplification of the MYCN proto-oncogene vary in their clinical behavior. Those diagnosed before one year of age are least aggressive and those diagnosed after two years are most aggressive. Age, however, is not always correlated with survival, and it is hypothesized that molecular classification of tumors at diagnosis using gene expression profiling would improve prediction of outcome. 38 Gene expression profiles of 102 untreated primary tumors from patients with stage 4 MYCNnon-amplifiedneuroblastomadiagnosedat<12,12−24,>24monthsofagewere determined using Affymetrix microarrays. A supervised method using DLDA (Section 2.2.3) was devised to build a multi-gene model for predicting outcome. Figure 2.5: Classification error plot showing the percentage of neuroblastoma patients misclassified using DLDA models of different size (x-axis). Red line: mean error rate generated from 100 times 10-fold cross validations using the training set. Blue box plots: lower and upper bounds of the boxes represent the 25th and 75th quartiles of the cross- validation errors for a given gene model; whiskers represent data within the range of 1.5 times the upper and lower inter quartiles; outliers are shown as red crosses. The first minima of the curve occurred with 55 genes. 39 Theaveragecross-validationerrorratewasfirstminimizedformodelsthatusedprobes for 55 genes (52 unique genes) as shown in Figure 2.5. As a validation strategy, we per- formed nested leave-one-out cross-validation of the entire process including the model selection. The leave-one-out cross-validation error-rate estimates from our nested algo- rithm in predicting progression status was 15.7% with permutation standard error of ±3.95% for all patients and 17.9%±5.09% for those diagnosed after 12 months of age. Withoutcross-validation, thetrainingerrorrateof9.8%for55genesusingall102tumors is probably an underestimate and reflects the need for cross-validation [5]. To evaluate if the selection of the multi-gene model that minimizes classification error rate occurred by chance, we permuted our samples by randomly mislabeling them with respect to their progression status and created 1000 sample sets. The error rates generated from multi- gene models in these permuted sample sets was found to be higher than that generated from our correctly labeled sample set (P <0.01). The gene expression signatures of tumors obtained at diagnosis from patients with clinically indistinguishable high-risk, metastatic neuroblastomas identify subgroups with quite different outcomes. Accurate identification of these subgroups with gene expression profiles will probably facilitate development, implementation, and analysis of clinical trials aimed at improving outcome. The list of candidate genes can also be used as candidate targets for future medical studies. The following sections will analyze in this and other microarray datasets if the other techniques introduced in this chapter can obtain a better performance in terms of prediction accuracy or in capturing better the correlations between the genes. 40 2.5.2 Evaluation of SeqDLDA in four microarray datasets The proposed SeqDLDA algorithm (2.3) has been evaluated using 100 runs of 10-fold Cross-Validationonseveral2-classdatasetsshowninTable2.2. Theleukemia[33](n=72, p=7129), colon [4] (n=22, p=2000), prostate [93] (n=102, p=6033) datasets are publicly availableandwidelyusedinotherstudies. Theneuroblastomadataset(n=102,p=44298) has been described in Section 2.5.1. In all cases, the gene expression has been normalized by clipping values lower than 1 and taking a log-transform. Using the same evaluation methods, the proposed SeqDLDA approach has been com- pared to DLDA [23], NSC [96], GP-DLDA [10], ULDA [106] and Linear SVM. Table 2.2: Average Cross-validation error, number of selected genes and standard devia- tion (SD) Leukemia Colon Prostate Neuroblastoma Seq-DLDA 4.11%,180(1.32%) 12.06%,50(1.87%) 5.53%,26(0.90%) 13.87%,70(2.41%) GP-DLDA 3.82%,18(0.77%) 13.08%,16(1.76%) 6.44%,20(0.70%) 15.77%,35(1.61%) DLDA 3.38%,7(1.30%) 12.40%,3(1.44%) 6.99%,2(0.33%) 16.91%,55(1.54%) NSC 4.18%,70(0.80%) 10.31,20(1.02%) 7.65%,6(0.42%) 17.98%,70(1.67%) ULDA 3.39%,p(0.747%) 15.19%,p(2.72%) 8.53%,p(1.10%) 13.42%,p(1.55%) Lin-SVM 2.61%,p(0.57%) 15.39%,p(2.17%) 8.01%,p(1.14%) 14.13%,p(1.45%) In the studied datasets SeqDLDA obtains results very close to the best approach, and the best results for the prostate and neuroblastoma datasets. Additionally SeqDLDA performs gene selection, unlike ULDA and SVM whose classifier uses the whole set of genes. Gene selection is crucial in order to identify genomic targets that may explain the disease. Classical DLDA filtering approaches [23,96] provide similar results in the absence of gene correlations or inter-pair correlations in GP-DLDA [10]. For example, Nearest 41 Shrunken Centroid (NSC) obtains the best results in the colon dataset. However correla- tion among genes is generally present and the SeqDLDA method will allow us to choose genes that may have a lower score (under a diagonal correlation assumption) but can be shown to provide better classification performance when combined with the already selected genes. Additionally, we have also noticed that improvement in performance over DLDA is more noticeable when a larger number of training samples is available. In the neuroblastoma data set, the average misclassification rate of DLDA (16.91%) is signifi- cantly reduced to 13.87% using SeqDLDA. 2.5.3 Simulation results with SeqBDLDA The SeqBDLDA has been first extensively analyzed with artificial data for two basic reasons. First this allows to control the covariance matrix and to evaluate the ability of the algorithm to select a model close to actual one. Second, evaluation is simplified, since for a given LDA-trained model we can exactly compute the probability of error without having to estimate it. The training data is generated by drawing n samples with distributions f A (x)∼ N(m A ,K), f B (x) ∼ N(m B ,K). The two basic generating parameters are K, and d = m A −m B . We have experimented with several covariance matrix structures and randomly permuted the features, so that in general two contiguous features are not nec- essarily correlated. In the experiments presented here d was fixed so that the SNR of the features is exponentially decreasing with parameter γ: d j σ j =e −γj σ 2 j j =diag(K) (2.11) 42 The number of features that will be optimal for the classifier will usually be between 1/γ and 4/γ approximately, increasing with the sample size n and decreasing with p. When n and p are constant, if γ is small, a large number of features will be required for the classifier and a diagonal matrix model will be preferred over a full matrix one. After training the weight vector w, we can compute the exact probability of error P e|w =1−Φ 1 2 s J K (w) 1+1/n ! J K (w)= (d t w) 2 w t Kw (2.12) where Φ(x) is the standard normal cumulative distribution function and 1+1/n takes into account the cost of estimating the b parameter in (2.3). This is possible because we know the underlying distribution that generates the samples we want to classify. Using Montecarlo simulation, we repeat the training and evaluation T times and the average P e is estimated as: ˆ P e = 1 T T X t=1 P e|ˆ wt (2.13) These results are reported for our proposed algorithm (SeqBDLDA) along with the two related wrapper methods SeqDLDA and SeqLDA (see Section 2.4.3). Finally, 95% confidence intervals asses the statistical significance of our findings. ThefirstsimulationexampleusesaToeplitzsymmetriccovariancematrix. AToeplitz symmetric matrix arises from AR processes, in which contiguous features are locally correlated. This is exploited by several classification algorithms [30], which will, however, fail if the features are permuted. Our proposed algorithm avoids this problem since 43 it is invariant to feature permutation. This comes indirectly from our original design assumption that no prior knowledge exists about correlation between features. Inourexperimentsthemorediagonallydominantthematrixis,thebetterthediagonal model will be. If the training data is limited, the full-matrix approach quickly fails as we increase the number of parameters. Figure 2.6 illustrates this with the following covariance matrix: K= 1 4 − 1 8 1 10 0 ··· − 1 8 1 4 − 1 8 1 10 . . . 1 10 − 1 8 1 4 − 1 8 . . . 0 1 10 − 1 8 1 4 . . . . . . . . . . . . . . . . . . (2.14) Ingeneral, theSeqBDLDAalgorithmachievesaverygoodperformanceforaToeplitz covariance matrix. Although a Toeplitz matrix it is not strictly block diagonal, it has a sparsenumberofcorrelationsdifferentthan0. Thus, eveninthecasethattheunderlying features covariances cannot be arranged in blocks for some permutation, SeqBDLDA still does a good job in reducing the number of parameters that have to be estimated to approximate the underlying Toeplitz structure. 44 0 5 10 15 20 25 8 10 12 14 16 18 20 22 24 26 Number of parameters Misclassification error rate % Full Matrix (SeqLDA); n = 120 Diagonal (SeqDLDA); n = 120 Block Diagonal (SeqBDLDA); n = 120 LDA Bound n=∞ DLDA Bound n=∞ BDLDA Bound n=∞ Figure 2.6: SeqBDLDA Classification performance, Toeplitz covariance matrix. p=200, n = 120, K as in (2.14), γ = 0.2. Solid and dotted lines represent the mean ˆ P e and its 95 % confidence interval for 100 trainings. In this example SeqBDLDA (in green) outperforms both SeqLDA (red) and SeqDLDA (blue). The second experiment tests the algorithm with block diagonal matrices. Figure 2.7 shows the results for the following covariance matrix structure: K= A 0 0 0 0 B 0 0 0 0 C 0 0 0 0 D (2.15) 45 A= 1 2 1 3 0 1 3 1 2 − 1 2 0 − 1 2 1 2 B= 1 2 0 0 0 1 2 0 0 0 1 2 C= 1 2 0 − 1 3 1 2 0 1 2 0 0 − 1 3 0 1 2 0 1 2 0 0 1 2 D= 1 2 0 ... 0 1 2 . . . . . . . . . 1 2 0 5 10 15 20 25 30 5 10 15 20 25 30 Number of parameters Misclassification error rate % Full Matrix (SeqLDA); n = 60, 120 Diagonal (SeqDLDA); n = 60, 120 Block Diagonal (SeqBDLDA); n = 60, 120 LDA Bound n=∞ DLDA Bound n=∞ BDLDA Bound n=∞ Figure 2.7: SeqBDLDA Classification performance, Block covariance matrix. p = 200, n = 60 (thin line),120 (thick line),K as in (2.15) , γ =0.1. Solid and dotted lines represent themeanandits95%confidenceintervalforP e of100trainings. Thisexampleillustrates that depending on the number of samples available, SeqBDLDA (in green) can chose an adequate block diagonal structure, that outperforms both SeqLDA and SeqDLDA. Figure 2.7 shows that when training data is very limited, e.g., n = 60, a diagonal structure(SeqDLDA)outperformsafullmatrixapproach(SeqLDA),whileasnincreases 46 thefullmatrixapproachbecomesbetter. OurtechniqueapproachcandefaulttoSeqLDA or SeqDLDA for some number of parameters, but it is also capable of choosing other intermediate block-diagonal alternatives. This is why most of the time SeqBDLDA can outperform SeqLDA and SeqDLDA in a block matrix scenario. 2.5.4 Results with Rep-BDLDA on microarray data In microarray datasets the SeqBDLDA algorithm evaluated in the previous section is very slow and we need mechanisms to avoid using cross-validation to select the model and heuristics for reducing the search of the best block structure. This can be achieved by Repeated FSS (RFSS) of small sets of BDLDA models as explained in Section 2.4.4 (more details and results also in [91]). TheRep-BDLDAalgorithmistestedonthecolon,prostateandneuroblastomadatasets. The Rep-BDLDA algorithm was used with N = 5 repetitions of BDLDA models with maximum block size MaxGrow = 3 and MaxFeature = 20 features. The average 50X10-fold cross-validation error rates and standard deviations are shown in Table 2.3. In SCRDA [34], the cross-validation error rate corresponds to the minimum after adjust- ing two regularization parameters. In all three real datasets, BDLDA has the lowest error rates. Among them, Neu- roblastoma, with more than 40,000 features, is considered the most challenging. Our algorithm reduced the error rate by more than 2%, compared to the second best algo- rithm, SeqDLDA (Section 2.5.2). 47 Table 2.3: Average Error Rate (Standard Deviation) for microarray data Colon Prostate Neuro-blastoma RepBDLDA 10.06% 5.21% 10.61% (1.15%) (0.85%) (1.29%) SeqDLDA 12.06% 5.53% 13.87% (1.87%) (0.9%) (2.41%) NSC 10.31% 7.65% 17.98% (1.02%) (0.42%) (1.67%) SCRDA 11.41% 5.41% 14.22% (1.69%) (0.89%) (1.39%) 2.6 Conclusions This chapter covered the design and evaluation of classifiers based on linear discriminant analysis (LDA) for microarray applications. The design of accurate classifiers is chal- lenging due to the limited number of training samples compared to the large number of genes in microarray studies. Under these conditions, the estimation of parameters to fit an LDA model (the covariance matrix and the class centroids) is not robust. However, the underlying biological models tend to be sparse in the sense that: i) very few genes are normally relevant for the outcome the classifier is trying to predict, ii) genes are corre- lated in relatively small groups. Several modeling and feature selection approaches have been proposed using the LDA framework exploiting this underlying sparseness. Starting from DLDA models that ignore gene correlations we investigate several searching approaches (filter DLDA and wrapper SeqDLDA) for selecting a subset of genes. This is then expanded to BDLDA models in which we search for a feature subset along with a block structure that models the interactions between genes. In other words, 48 we solve the problem of fitting an LDA model by searching the combination of genes and gene correlations that give best discrimination between classes. Depending of the number of the training samples available, we can can search for a smaller or larger model in terms of number of parameters to estimate. The appropriate size of the model can be determined by crossvalidation. The embedded FSS search tries togivethebestblockstructureandparameter/featureselectionofamodelofagivensize. ResultsonsimulatedandrealmicroarraydatademonstratethattheproposedSeqDLDA, SeqBDLDA and specially RepBDLDA offer a very competitive performance compared to other state-of-the-art approaches such as NSC, SCRDA, GP-DLDA and SVM. Thereisasteadyincreaseintheknowledgeaboutgeneregulationandmoreandmore microarray experiments are deposited into large public available databases every day. In future work, all this prior knowledge could be used in the FSS and BDLDA framework presented here. For example, groups of corregulated genes could be proposed to belong to the same block in BDLDA, or gene pathways could be used to guide the FSS search. 49 Chapter 3 Sparse representation and Bayesian detection of genome copy number alterations from microarray data Genomic instability in cancer leads to abnormal genome copy number alterations (CNA) thatareassociatedwiththedevelopmentandbehavioroftumors. Advancesinmicroarray technology have allowed for greater resolution in detection of DNA copy number changes (amplifications or deletions) across the genome. However, the increase in number of measured signals and accompanying noise from the array probes present a challenge in accurateandfastidentificationofbreakpointsthatdefineCNA.Inthischapterwepropose a novel detection technique that exploits the use of piece-wise constant (PWC) vectors to represent genome copy number and sparse Bayesian learning (SBL) to detect CNA breakpoints 1 . First,acompactlinearalgebrarepresentationforthegenomecopynumber is developed from normalized probe intensities. Second, SBL is applied and optimized to infer locations where copy number changes occur. Third, a backward elimination (BE) procedureisusedtoranktheinferredbreakpoints;sothatacut-offpointcanbeefficiently adjusted in this procedure to control for the false discovery rate (FDR). The performance 1 Part of the work presented in this chapter has been published in [75,79] 50 of our algorithm is evaluated using simulated and real genome datasets and compared to other existing techniques. Our approach achieves the highest accuracy and lowest FDR while improving computational speed by several orders of magnitude. 3.1 Introduction Copy number alterations involving deletion or replication of entire chromosomes or chro- mosomal regions are known to occur in numerous genetic disorders (e.g., Down’s syn- drome, Klinefelter’s syndrome), while replications of multiple chromosomes leading to states of hyperploidy are well known in cancer biology [3]. Similarly, regional CNA have been demonstrated in tumors, and linked to leading them to develop aggressive behavior. Examples include loss of RB tumor suppressor in retinoblastoma or MYCN proto-oncogeneamplificationinneuroblastoma. Recently,alargenumberofcopynumber variants (CNVs) have also been described in the human genome [82] and found across large numbers of individuals. These recurrent copy number changes, CNVs, tend to be much smaller in size and will be considered in more detail in Chapters 4 and 5. Array- based technologies use genetic material as sensors or probes to estimate copy number for the intended genomic regions. The resolution for detection of CNA depends on the number and type of probes placed on these arrays. Comparative genomic hybridization (CGH, [51]) is one of the earlier array platforms that uses large insert DNA fragments (kilobases)asprobesinmeasuringDNAcopynumber. Theseprobes,numberingtypically in the order of hundreds of thousands to millions, allow co-hybridization to take place between a fluorescently tagged genome of interest and a normal reference genome. The 51 relative intensity at a given probe is directly proportional to the copy number for that region. More recently, platforms using short oligonucleotide probes (≤60 bases), which allowplacementofhundredsofthousandsofprobesonanarray,havebecomemorewidely used [80]. The probes are only hybridized to a tagged genome of interest and the inten- sities are usually compared to those of reference set of arrays of normal genomes. The majority of these arrays use oligonucleotides that also probe for regions with genotype polymorphisms thus providing both copy number and genotype information [43,71]. The increase in the probe density poses computational challenges to accurately and efficiently assess DNA copy number and identify altered regions. Several algorithms have been proposed to detect CNA [11,29,42–44,60,64,68,69,73, 80,107]. Most of these algorithms rely on a fundamental characteristic, namely, that a genome is composed of relatively long segments, DNA sequences, that have a constant number of copies present. The genomic segments can be represented by m probes map- ping to a specific position on the genome having c m copies. The copy numbers c m can be ordered and arranged as vectorsc that have two key characteristics: i) they are piecewise constant (PWC) with very small number of breakpoints relative to the number of probes and ii) they have discrete values (DIS) (i.e., copy numbers can only be 0,1,2,3,...). How- ever, these properties cannot be directly observedin the log-intensitiesy m measured with microarrays, due to contamination by biological and technical noise; thus a widely used model is: y m =x m +ǫ m (3.1) 52 where x m represents the average log intensity, and ǫ m is an additive zero-mean white random process (see Figure 3.1). 1c 2c 3c Probes (m = 1,…,i,…,M) ordered in chromosomal position log 2 hybridization intensity i 1 i 2 i 3 i 4 a 0 a 1 a 2 a 3 a 4 x m ∝ log 2 (c m ) (PWC) y m = x m + ε m Figure 3.1: Graphical representation of the observation model (3.1) using a chromosome section with 2 alterations as an example (simulated data). The underlying mean hybri- dization intensity x m is piece-wise constant (PWC) with breakpointsI ={i 1 ,i 2 ,i 3 ,i 4 } that mark the starting probe of each segment, and amplitudesa=(a 0 ,a 1 ,a 2 ,a 3 ,a 4 ) that depend on the underlying number of copies (DIS). The observed probe hybridization in- tensities y m do not follow this expected behavior due to degradation by hybridization noise ǫ m . Most techniques exploit the assumption that x m ∝ log 2 (c m ) and that properties PWC and DIS, as introduced above, are met. For example, one of the first and sim- plest techniques to exploit PWC consisted of applying a smoothing filter followed by a threshold [43,80]. This has been improved upon by more specialized techniques such as wavelets [42], segmentation [60,69,73], or penalized least-squares [44]. Additionally, hidden Markov models (HMM) [29,64,68,107] and Bayesian methods [11] exploit both PWC and DIS by assuming that each observation y m comes from a probe in a particular hidden copy number state c m to be inferred. Exploiting DIS can be difficult in cases where that state is rare (observed a small number of times), or in cases of specimens containing a heterogeneous population of cells with respect to DNA copy numbers. This 53 heterogeneous copy number state typically occurs in the case of tumor samples, where x m =log 2 (¯ c m ) would correspond to the average copy number in the mixture. Among all the previous methods, circular binary segmentation (CBS) [69] was found to be one of the most accurate methods for CNA detection by two independent com- parative studies [57,103], but was also one of the slowest. These studies used synthetic datasets where the CNA occur at known positions, the probes are uniformly spaced, and the hybridization noise is generated according to a white Gaussian distribution. More recently, new approaches [25,85,90] have extended previously proposed methods in order to target specific scenarios not considered by the CBS approach, e.g., presence of out- liers [90], non-uniform probe spacing [85], and chromosomes with a reduced number of probes and non uniform variance [25]. In our case we focus on the default conditions and metrics proposed by [103] under which our results show that these new algorithms do not give better accuracy than that of CBS and are slower. The performance of these algorithms under different conditions that may arise on specific microarray platforms is discussed in Chapter 5. Recently, the computational performance of CBS algorithm has significantly improved with a new approximate version [101] with no significant loss of performance. However, the run-times of this new version and the other new algorithms are still very high, especially when applied to the new high density array platforms. Incontrast,weproposeanovelmodelingofgenomicdatausingPWCvectorsthatcan be efficiently exploited to build algorithms for CNA detection with very significant gain in computational speed. We also propose a new approach that we call genome alteration detectionalgorithm(GADA)forCNAdetectionfromarraydatathatcombinesthesparse Bayesian learning (SBL) technique introduced by [97] and a backward elimination (BE) 54 procedure that can efficiently adjust the accuracy trade-off between sensitivity and the FDR. We evaluate our algorithm using the simulated array-CGH dataset proposed by [103], where the underlying positions of copy number changes are known and can be used as a benchmark to compare the accuracies of different algorithms. We also evaluate the performance of three algorithms [25,85,90] that appeared after the [103] comparative study, and the newer CBS implementation [101]. Using that benchmark dataset our GADA approach obtained one of the best accuracies, and the best performance in terms of computational speed, followed by CBS. Additionally we compare the results of our algorithm and CBS on data generated from several array types from two commercial manufacturers (Affymetrix and Illumina) using DNA from four different neuroblastoma cell lines. Our results indicate that our algorithm can analyze data efficiently from high density platforms and provide an accuracy similar or better than that of state of the art algorithms, but with reduced computation costs. On the new large array platforms, our algorithm is two orders of magnitude faster than one of the most accurate algorithms available to date, i.e., CBS [69]. This chapter is organized in to major parts. The first part (Sections 3.2 through 3.7) introduces the PWC models, the SBL and BE algorithms that compose the GADA approach; and also discusses the theoretical properties of these models and algorithms. The second part evaluates the proposed GADA approach in both simulated (Section 3.8) and real microarray data (Section 3.9) and presents the final conclusions. 55 3.2 PWC vector representation of Genomic Data One of the major contributions of this work is the development of a compact description for the copy number along the chromosome using PWC vectors (green signal in Figure 3.1). Usingsimplelinearalgebra,anyPWCvectorxwithKbreakpointsI ={i 1 ,...,i K } can be compactly represented by a linear combination of K step vectorsf i (each with a single breakpoint i inI, see Figure 3.2) plus a constant vectorf 0 . f i (m) = − q M−i iM m≤i q i M(M−i) m>i (3.2) f 0 (m) = 1 √ M (3.3) 1 i M Probes (m = 1,…,i,…,M) ordered in chromosomal position − q M−i iM q i M(M−i) m ≤ i m > i f i (m) Figure 3.2: Step vectorf i with a breakpoint between probei andi+1 as defined in (3.2). Noticethatthestepvectorshavebeennormalizedtohaveunitnorm, P M m=1 (f i (m)) 2 =1, and average zero for i>0, P M m=1 (f i (m))=0. Therefore, in matrix notation, we can write this linear combination as: x = Fw (3.4) 56 wherethecolumnsofF arethestepfunctions(F =[f 0 ,f 1 ,...,f M−1 ]); and,w isasparse vector, i.e., there are only K +1 non-zero components. Equivalently, we can remove the components ofw that are zero and write: x = F I w I (3.5) whereF I =[f 0 ,f i 1 ,...,f i K ] andw I =[w 0 ,w i 1 ,...,w i K ]. This representation has three veryimportantpropertiesthatareprovedinSection3.2.1. First,thecolumnsofF forma basis thatcanbeusedtorepresentanyarbitraryvector. Second,ithasanested structure, and for each additional breakpoint i that the PWC vector may contain, we only require an additional weight w i to be nonzero. Third, any arbitrary PWC vector with exactly K breakpoints can be represented with K +1 non-zero components which is proved to be the minimum possible amount; i.e., maximal sparseness. To the best of our knowledge, we are the first to explicitly propose this representation in the context of genome copy number variations [79] and to exploit its properties to develop a highly accurate and efficient detection technique that will be detailed in the following sections. 3.2.1 Properties of the PWC representation This representation was initially inspired by the concept of wavelet footprints [21] where the more general case of piece-wise polynomial signals is considered from a wavelet anal- ysis perspective. The maximally sparse representation for PWC signals demonstrated in wavelet footprints is reformulated here using standard linear algebra and extended to 57 arbitraryvectorlengths. Thisalsoallowsustoestablishacorrespondencebetweensetsof breakpoints and a nested structure of vector subspaces which we use here to demonstrate the representation properties. Mathematically, a PWC vectorx can be completely characterized by its change loca- tions (i.e., breakpoints) and the constant values of the regions in between (i.e., segment amplitudes): Definition1(PWCvector). Apiece-wiseconstantvectorx=(x 1 ,...,x M ) t ischaracter- izedbyanorderedsetofK discontinuitylocationsI ={i 1 <i 2 ...<i K }⊂{1,...,M−1} (i k denotes the beginning position of segmentk) and a vector with the correspondingK+1 segment amplitudes a=(a 0 ,...,a K ) t . Thus: x t = a 0 ↑ 1 ,...,a 0 ,a 1 ↑i 1 ,...,a k−1 ,a k−1 ,a k ↑i k ,a k ,...,a K ↑ M (3.6) WiththisdefinitionitiseasytoshowthatthebreakpointsetsIsinducethefollowing vector subspace properties: Lemma 1 (PWC Vector Subspaces). LetS I be the set of all PWC vectors x that have breakpoint locations contained inI, and segment amplitudes a∈R K+1 . Then, we have that: i)S I is a subspace ofR M of dimension K +1. ii)S I 1 is a subspace ofS I 2 if and only ifI 1 ⊂I 2 Proof. It is clear that i) holds since, first, for any x 1 , x 2 with breakpoints inI, but different amplitudes a 1 and a 2 ; we have that x 3 =x 1 +x 2 may remove existing break- points but never create a breakpoint outsideI, thusx 3 ∈S I because it will always have 58 breakpoints contained in the sameI, anda 3 =a 1 +a 2 . Second, for anyx 1 ∈S I and for all α, x 4 = αx 1 will also have breakpoints contained inI with a 4 = αa 1 and thus will belong toS I . Furthermore, whenI is fixed x and a vector spaces are isomorphic and henceS I has dimension K +1; thus, ii) readily follows from i). Part ii) of the lemma is equivalent to saying that any PWC vector x∈S I can be represented as a linear combination of step vectors inS {k} , k = i 1 ,...,i K . With this principle in mind, we now introduce a basis for PWC signal representation that has some desirable properties. Theorem 1 (PWC Basis). Define a matrix F = [f 0 ,f 1 ,...,f M−1 ], with columns f i defined as in (3.2). Then, we have the following properties: i) (Complete Basis): The columns of F are a basis for R M , i.e., for any x∈ R M there exists a unique w such that x=Fw. ii) (Nested Structure): The columns of F I = [f 0 ,f i 1 ,...,f i K ] , “nest” of F, are a basis for the vector subspace S I , formed by PWC vectors with breakpoints at I = {i 1 <i 2 ...<i K }. iii) (MaximalSparseness): Any PWC vectorx∈S I , can be written asx=Fw, where w has as many as|I|+1 non-zero entries, which is the minimal amount possible (maximal sparseness). Moreover, if the non-zero weights are w I =[w 0 ,w i 1 ,...,w i K ], we can write x =F I w I , where the subscriptI denotes that only the columns of F (resp. components of w) at the positions corresponding to the indices inI are included. Proof. In order to better understand the previous theorem we will give a proof by con- structing the PWC basis using the nested structure. First, if x is a constant vector, i.e., 59 it has no discontinuities,I =∅, the dimension ofS 0 =S I=∅ is one and can be spanned by the constant vectorf 0 . Then, for k = 1,...,M−1 the vector spacesS k =S I={k} of PWC vectors with a single discontinuity between k and k+1 can be spanned by adding the element f k , a step vector with a breakpoint at that position. Moreover, the set of vectors now forms a complete basis: from Lemma 1.ii) anyx∈S I can be represented by linearly combining{f 0 ,f i 1 ,...,f i K }. This basis construction proves ii) in the theorem, as well as i) whenI ={1,...,M}. Alternatively we can prove that i) holds, by simply checking that F is a square in- vertible matrix. The the rows of its inverseF −1 which form the dual basis are: F −1 = h ˜ f 0 ,..., ˜ f M−1 i t (3.7) ˜ f 0 = 1 √ M 1 M ˜ f k (m) = − q k(N−k) N m=k−1 q k(N−k) N m=k 0 otherwise The most appealing property for our specific application is iii), since copy number vectorswillhaveveryfewbreakpoints(K <<M)whichmakeswasparserepresentation. We can prove iii) by the following argument. First, we cannot have less than K+1 non- zero elements because this is the minimum required to form a basis forS I . Then, for all m / ∈I, we have that x m −x m−1 =0, and using the dual basis,w =F −1 x, we have that forallm>0w m =0ifandonlyifx m −x m−1 =0. Thus,thereareexactlyK+1non-zero elements, which is indeed the minimum (so the representation is maximally sparse). 60 3.3 Formulation of the breakpoint detection problem The compact representation developed in the previous section can be used to facilitate estimating x from a degraded observation y generated as in model (3.1): y =x+ǫ=Fw+ǫ, (3.8) wherex has been replaced by its representation in terms of the basis vectors,Fw. Since the number of copy number changes is very small compared to the number of probes, K << M, then x =Fw has a sparse representation in the F basis, while the noise ǫ is not sparse in this representation. Under this scenario, the problem is formulated as that of finding ˆ x =F ˆ w that is closest to the observed y subject to having only K non-zero components of ˆ w. ˆ w :min w e(Fw,y) s.t. s(w)=K. (3.9) Differentmeasuresofcloseness e(.)andsparseness s(.)canbeused. Forcloseness,we willusetheleastsquareserrormeasure,sinceitisthemostwidelyusedforapproximation and will facilitate comparison among algorithms, although it may be sensitive to outliers. For measuring sparseness we are especially interested in the l 0 norm (i.e., the number of w m 6=0), which best models the biological property thatK <<M without imposing any restriction on the specific values of w m . 61 Then, the optimization with these measures can be rewritten as follows: ˆ w =argmin w ky−Fwk 2 +λkwk 0 (3.10) where the l p norm and the l 0 pseudo-norm are defined as: kwk p = M X m=1 |w m | p kwk p→0 = M X m=1 I(w m 6=0) (3.11) and with λ>0 as a trade-off parameter between goodness of fit and sparseness. Finding a solution for the problem of (3.10) would require solving O(KM 2 ) least squares problems. This approach is intractable for chromosome lengths M and number of discontinuities K that are typical for our application. There exist several popular sub-optimal approaches both in the signal processing and in the statistics literature (see Table 3.1) that use a greedy search strategy or convex relaxation. Table 3.1: Relationship between signal processing methods for overcomplete expansions and methods in statistics for variable selection in multiple regression Signal Processing Statistics Greedy Methods: MP-FS Matching Pursuit [61] Forward Selection [88] OMP Orthogonal Matching Pursuit [70] Relaxation methods: MoF-Ridge Method of Frames [13] Ridge regression [40] BP-Lasso Basis Pursuit [13] Lasso [40] Methods are paired when a similar version of equation (3.9) is solved (i.e., when the same metrics are chosen). But note that there will be differences in how λ is adjusted, and the size or types of design matricesF that are used. The first class of strategies, greedy methods, consists of reducing the search space of allpossiblevariable(breakpoint)subsets2 M byassumingthatthebestsetofK 1 variables 62 (breakpoints) will often be a subset of the best set of K 2 variables, for K 2 > K 1 . If this assumption is correct, the set of best predictors can be constructed sequentially as in MP-FS; where we start selecting the vector (regressor) with largest projection (largest F-score), and keep adding the vector that most reduces the energy of the residual. This strategy is only optimal whenF is orthogonal, or nearly optimal [20] when the coherence of F (C = max f k ,f j k6=j) is small and the signal is “sufficiently” sparse (i.e.,kwk 0 is small). It is important to note that this result cannot be applied to our case, since the coherence of F approaches 1, i.e., the set of vectors considered here is highly coherent. The second class of strategies is based on replacing the l 0 sparseness measurekwk 0 by some other l p measure, such that more efficient optimization methods (such as linear programming, projection or gradient methods) can be used. For example, for p=2 (i.e., kwk 2 ), we would have a ridge regression in which the two square norms can be easily combined resulting in ˆ w rigde = F t F +λI −1 F t y. However, ˆ w ridge is not sparse at all in l 0 sense, and thus we would be interested in using a p as small as possible. The l 1 norm is often used, because it is is the minimal one for which the constraints form a convex set and thus convex optimization or linear programming can be used to solve the problem. This is the strategy behind basis pursuit [13] and lasso [40], for which there exist a similar result as in MP [20] showing that if the coherence is small then minimizing for l 1 is equivalent to minimizing for l 0 . Therefore, when F is highly coherent, as in our case, these techniques lead to sub-optimal performance and a new approach is needed. In conclusion, the performance of the methods in Table 3.1 is severely limited by the high collinearity between the columns of F [20] (the inner product is almost 1, the maximum). Ontheotherhand,sparseBayesianlearning(seenextsection)forthespecific 63 applicationofCNAdetection[79]cansuccessfullyexploitthecollinearitystructureofthe PWC representation. 3.4 Sparse Bayesian Learning (SBL) Theoptimizationproblemdefinedin(3.10)canbeformulatedfromaBayesianestimation pointofview,aswasdoneby[104],forthecasewhereF isanarbitrarymatrix,andsolved using sparse Bayesian learning (SBL) [97], an empirical Bayes approach. Following [104], the problem in (3.10) can be cast as a maximum a posteriori (MAP) estimate: ˆ w MAP = argmax w p(w|y) = argmax w p(y|w)p(w) = argmin w −logp(y|w)−logp(w) (3.12) where the observation model p(y|w) specifies the goodness of fit measure e(.) and the prior distribution for the weights p(w) specifies the sparseness measure s(.) in (3.9). In SBL [97], the observation model is assumed normal: p(y|w)∼N Fw,σ 2 I (3.13) which leads to the mean square error as a measure of fit in (3.10). The limitations of this measure are basically the same as all the least square based methods. We will discuss robustness to extreme outliers in Section 3.8.1 and Chapter 4. 64 The sparseness measure in lasso/BP and Ridge-Tikhonov methods presented in Sec- tion 3.5 is equivalent to using the Laplacian and the normal distributions for the prior p(w) respectively. In contrast, the prior distribution for the weights in SBL is specified as a hierarchical prior: p(w|α) = M−1 Y m=1 N w m |0,α −1 m , (3.14) where α is a vector of hyperparameters that are distributed according to a gamma dis- tribution: p(α) = M−1 Y m=1 Γ(α m |a,b). (3.15) The choice of p(w) in convex relaxation methods is more relevant in terms of enforcing sparseness(approximatingbetterthel 0 norm)inacomputationallyefficientmannerthan in terms of incorporating some existing prior knowledge on the actual p(w) distribution. In this regard, the SBL prior has three useful features. First, given the hyperparameters α, the conditional posterior weight distribution (3.16) is normal: p w|y,α,σ 2 = N (w|μ,Σ) (3.16) Σ = σ −2 F ′ F +diag(α) −1 (3.17) μ = σ −2 ΣF t y (3.18) 65 and,following[97],p(w|y)canbecorrectlyapproximatedbypointestimatesasp w|y,ˆ α,ˆ σ 2 ; thus, the MAP estimate is given by the posterior mean ˆ w =μ. Second, by treating the weightsw as hidden variables, the maximum likelihood esti- mation for the hyperparametersα can be obtained by the EM algorithm [97]. Thus, for each iteration l until convergence we would alternate: EStep: E w|y,α (l) ,σ 2 w 2 m =Σ mm +μ 2 m (3.19) MStep: ˆ α (l+1) m = 1+2a Σmm+μ 2 m +2b (3.20) Finally, although this hierarchical prior does not appear to encourage sparseness, it has been demonstrated that indeed it has very good sparseness properties [97,104]. This behavior can be unveiled by finding the marginal “effective” prior, p(w), which is is an i.i.d. t-distribution with 2a degrees of freedom and a scale parameter of p a/b (see Appendix A). When b→0 and a is small, this distribution peaks very sharply at 0, and has very thick flat tails that decay at (1+2a) rate in log-scale: logp(w) → b→0 C(a)+(1+2a) M−1 X m=0 log|w m | (3.21) Thus, as shown in Figure 3.3, with this prior we obtain a sparseness cost that more closely approximates the desired l 0 norm. In other words, this prior forces a very large number of weights to be 0 while the non-zero weights are free to take any value (in Figure 3.3 the sparseness penalty is almost constant for any r > 0), which matches well our underlying biological knowledge for copy number changes. 66 Although, the model contains several hyperparameters, the α and σ parameters are estimated from the data while b is set to 0 (uninformative prior). Thus, sparseness is adjusted solely by the a parameter (Section 3.4.1 and Appendix A) −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0 1 2 3 4 5 6 K = 1 K = 2 K = 3 K = 4 K = 5 r Sparseness(w) Sparseness(w): ||w|| 0 ||w|| 1 (BP − Laplacian) ∝ − Σ m=1 M log|w m | (SBL) Figure 3.3: SBL and l 1 sparseness metrics compared to the desired l 0 norm (dotted line). Each curve represents the sparseness metric for an arbitrary vector w with only K =1,...,5non-zerocoefficientscoefficientsatanyposition. Allthenon-zeroweightsare given the same magnitude r for different values of r on the x-axis. Ideally, we would like the sparseness metric to be inversely proportional to the the l 0 norm, which will be equal to the number of non-zero components (K) regardless of the value of the components themselves (i.e., r). Note that the SBL metric approximates better the l 0 norm, while l 1 norm deviates significantly from this ideal behavior. 3.4.1 Implementation of SBL to find copy number alterations TothebestofourknowledgethisisthefirsttimethatSBLhasbeenemployedtofindcopy number alterations, and more specifically with the PWC representation that we propose, where F has a very special structure. One of our contributions [79] is the observation that SBL can function well in our situation where significant collinearity exists, unlike other standard methods in Section 3.3. 67 Additionally, SBL computational performance can be optimized for our PWC repre- sentation by exploiting the nested structure property. Direct computation of equations (3.17) and (3.20) for an arbitraryF would require O(M 3 ) operations [97,104]. However, for our particularF in (3.2),H I =G −1 I = F t I F I −1 is, for all possibleI, a symmetric tridiagonal matrix, with main diagonal h 0 (j)= (M−i j )i j M (i j+1 −i j−1 ) (i j+1 −i j )(i j −i j−1 ) (3.22) and upper/lower diagonal h 1 (j)= p (M−i j )i j (M−i j+1 )i j+1 M(i j+1 −i j ) (3.23) This structure can be used to efficiently compute Σ mm and μ m for each EM step (3.19) in O(M) steps (see lines 9-14 in Algorithm 3 ). Additional computational savings are achieved through removal of columns ofF that correspond to the breakpoints whose weights w are very likely to become 0 (lines 15- 19 in Algorithm 3). This column removal strategy was used by [97] for the general F case, but, when combined with the tridiagonal structure exploited here, each EM step is solved more rapidly; complexity is O(|I|), so that the speed increases as the number of remaining breakpoints|I| decreases. In our implementation, σ 2 is estimated from the data. The parameter σ 2 in the previous work [97,104] is usually jointly estimated by the EM algorithm. However, since each chromosome in the genome is analyzed independently, and σ 2 is assumed to be the 68 Algorithm 3 Sparse Bayesian Learning SBL for PWC Input: y,a,σ 2 1: ¯ y← 1 M P M m=1 y m 2: y←y−¯ y ⊲ Removing ¯ y allows us to removef 0 fromF 3: α←0 M ⊲ Initialize to a vector containing M zeros 4: I←{1,...,M−1} ⊲ Initially every possible location can be a breakpoint 5: [h 0 ,h 1 ]←G −1 I ⊲ Inverse is tridiagonal symmetric withh i =ith-diagonal 6: w 0 ←F −1 y ⊲F −1 is bidiagonal 7: z←F t y ⊲ Computed by solving the tridiagonal systemG −1 z =w 0 8: repeat 9: [t 0 ,t 1 ]←T = σ 2 G −1 I Λ+I ⊲ G −1 I = [h 0 ,h 1 ] andΛ =diag(α) 10: w←Solve the tridiagonal systemTw =w 0 ⊲ E-Step 11: Obtain diagonal of Σ=σ 2 T −1 G −1 I ⊲ Only need tridiagonal band ofT −1 12: for j =1...|I| do 13: α j ← 1+2a w 2 j +Σ jj ⊲ M-Step 14: end for (Optional reduction by removing very unlikely breakpoints) 15: if ∃i∈I :α i >τ =1E8 then 16: I←{i∈I :α i ≤τ} 17: [h 0 ,h 1 ]←G −1 I ⊲ Recompute only terms that change 18: w 0 ←G −1 I z(I) ⊲G −1 I is tridiagonal 19: end if 20: untilw has converged (kw old −w new k≤ǫ) Output: w I ,I 69 same for all chromosomes, it is more robust to estimate σ 2 for the entire genome before applying the EM algorithm in each chromosome. In this chapter, σ 2 is estimated as ˆ σ 2 = 1 2M M X m=1 (y m −y m−1 ) 2 (3.24) in which the difference y m −y m−1 removes the baseline PWC component and is dis- tributed as N(0,2σ 2 ) except for the breakpoints, which can be removed in the sum by replacing the mean by a trimmed mean. Similar estimates have also been widely em- ployed in signal denoising approaches [21]. In the context of multiples samples, as will be discussed in Chapters 4 and 5, a probe specific variance terms could also be used. Finally, the EM algorithm is guaranteed to improve the solution after each step and will always converge [104], but it may converge to a local minimum instead of the global minimum. However,theselocalminimaareindeedalwayssparse(seeTheorem2in[104]). The degree of sparseness in the SBL algorithm is controlled by the parameter a, as can be seen from (3.21) and Figure A.1, whereby an increase in a causes a sharper peak at zero with faster tail decay and leads to a sparser solution. The a parameter also controls the convergence rate of the EM algorithm, with larger a leading to faster convergence. However, larger values of a are not always desirable and lead to suboptimal placement of breakpoints because of rapid convergence of the EM algorithm to a local minimum. The EMlocalminimumproblemcanbecorrectedbycheckingthestatisticalevidenceforeach breakpoint after obtaining a set of breakpoints at an appropriate a level. The statistical significancetestcanbeperformedbyabackwardeliminationproceduredescribedinnext section, which also allows more flexibility in setting the final desired degree of sparseness. 70 3.5 Breakpoint ranking by Backward Elimination Not all breakpoints found by SBL have the same statistical significance since noise may makeareaswithoutanyunderlyingalterationappearsimilartothoseareascorresponding to actual alterations. Some breakpoints mark the separation between two long segments (i.e., such that each segment includes many probes) and are such that the difference between the estimated amplitudes of the two segments is large. Such breakpoints are more likely to correspond to true underlying changes in copy number, and therefore will have have higher statistical score t j =|ˆ w j |/ p σ 2 h 0 (j) (see Appendix B). This score dependsonthetwocontiguousbreakpoints, andthussignificancescoreswillchangeevery time a breakpoint is removed (i.e. two segments are merged). Instead of testing all the possible breakpoint combinations (i.e., segmentations), we have adopted a sub-optimal backward elimination (BE) strategy, in which we recursively eliminate the breakpoint with lowest statistical evidence t j . Although the procedure is suboptimal, since we may eliminate breakpoints that would be more significant in a later stage, it is much less sensitive than forward selection [53]. The BE procedure can be stopped when all the remaining breakpoints have t j higher than a specified T, the BE critical value. Moreover, withI K being the breakpoint set obtained from SBL, the procedure creates a sequence of nested subsetsI 1 ⊂I 2 ...⊂I K , which are obtained backwards, and such that successive subsets differ only in one discontinuity: this directly provides a breakpoint ranking. This ranking r is obtained efficiently by Algorithm 4 in O(|I|), where we exploit the fact that removing one discontinuity at a time only affects the two neighboring breakpoints (lines 9 and 12). 71 Algorithm 4 starts with the set of breakpointsI (given by the SBL algorithm), the original array observations y and the noise variance estimate σ 2 . Lines 2 and 3 find the breakpoint weights w I (i.e., the projection coefficients of y ontoS I ) exploiting the structure of the PWC representation instead of using equation (B.1) directly. On the loop (lines 5-17) we sequentially remove the breakpoints with least statistical score t j (line 6) until all breakpoints are removed. When we remove a breakpoint j ∗ , we do not need to compute all the weights again but only those of the left (line 9) and right (line 12) neighbors. Algorithm 4 Breakpoint Ranking by Backward Elimination Input: y,I,σ 2 1: ComputeH I , i.e. [h 0 ,h 1 ], using (3.22) and (3.23) 2: z←F t y ⊲ Computed by solving bidiagonal system (F t ) −1 z =y 3: w I ←H I z(I) ⊲H I is tridiagonal 4: Compute scorest, t j =w I (j)/ p σ 2 h 0 (j) 5: for k =|I|,...,1 do 6: j ∗ =min i j ∈I |t j | ⊲ Find the least significant breakpoint for removal 7: r k ←(i j ∗,t j ∗) ⊲ Give breakpoint the k-th rank 8: if j ∗ >1 then 9: w I (j ∗ −1)←w I (j ∗ −1)+ ⊲ Recompute left breakpoint r (M−i j ∗ −1 )i j ∗ −1 (M−i j ∗)i j ∗ (i j ∗ +1 −i j ∗) (i j ∗ +1 −i j ∗ −1 ) w I (j ∗ ) 10: end if 11: if j ∗ <|I| then 12: w I (j ∗ +1)←w I (j ∗ +1)+ ⊲ Recompute right breakpoint r (M−i j ∗ +1 )i j ∗ +1 (M−i j ∗)i j ∗ (i j ∗−i j ∗ −1 ) (i j ∗ +1 −i j ∗ −1 ) w I (j ∗ ) 13: end if 14: I←I−{i j ∗} ⊲ Remove breakpoint from the set 15: w I ←w I (I) ⊲ Remove j ∗ component 16: Recomputeh 0 , andt for newI ⊲ Only j ∗ −1 and j ∗ +1 change (3.22) 17: end for Output: r 72 Finally, with the ranking of breakpoints r, we can adjust the final breakpoint list to any critical value of T with no additional computational cost. This provides great flexibility in adjusting the final breakpoint set. The expected false discovery rate (FDR) is monotonically decreasing with T, thus we can obtain a list of breakpoints with lower FDR by increasing threshold T. 3.5.1 The role of the T parameter in BE ranking The ranking provided by the backward elimination procedure, Algorithm 4, can be used to quickly return a breakpoint set with different degrees of sparseness that contains the breakpoints with the strongest evidence. This is done by cutting the ranking r at some specified threshold T, such that all the remaining breakpoints have a|t j |≥T. Both true positives and false positives will decrease with increasing level of sparseness (i.e., higher T) but if P (|t j |≥T|w j = 0) < P (|t j |≥T|w j 6= 0), the expected proportion of false breakpoints on the returned set (i.e., the false discovery rate FDR) will be monotonically decreasing withT. The previous condition is true for Gaussian noise but will also be true for other symmetrically bell shaped noise distributions. Additionally, we can associate a p-value for any particular value of t, or a significance cutoff α = P (|t j |≥T|w j = 0) for any T, if we assume the noise is normal, using (B.9). If the noise is not Gaussian, the p-value will still be a good a approximation for the breakpoints with large flanking segments (i.e., the two neighboring breakpoints are far apart), since t will converge to a normal distribution under the null hypothesis (for any noise with zero mean, finite variance and small correlation). Alternatively, for small 73 segments, we could estimate the p-value by a resampling method (e.g. bootstrap [24]) or replace the t score by a non-parametric ranksum test. It is important to notice that the aforementioned p-value is associated with a sin- gle breakpoint in one of the many possible segmentations. Thus, it does not take into account all the possible segmentations that are effectively tested during the algorithm, i.e. multiple hypothesis testing or multiple comparison problem. Commonly used tools to solve this problem (Bonferroni, Benjamini-Hochberg [8] and Benjamini-Yettukeli [9]) are not recommended here because they do not take into account the special correlation structure that exists between the t scores of overlapping or neighboring segmentations, and the independence between thet scores separated by one breakpoint or more. Solving theproblemofthemultipletestinginthisscenario, inthesenseofbeingabletoprovidea T that controls for the FDR being below some bound is still an open problem. However, sincetheFDRismonotonicallydecreasingwithT,wecanadjustittoachieveaparticular degree of sparseness, and then estimate the FDR that corresponds to that T either using results from multiple samples, replicates or by a resampling procedure. 3.6 Segment Alteration Detection The SBL and BE procedures are segmentation approaches that make no assumptions about the amplitude of the reconstructed segments. The objective is to provide a nearly optimal set of amplitudes and breakpoint positions that best fits the hybridization in- tensities observed in the array as described in (3.10). Once the breakpoints are fixed, in order to achieve the minimal residual error RSS, the amplitude corresponding to each 74 segment is given by the average hybridization level of all the probes that fall inside that segment. Because of this model, segments that correspond to the same underlying copy number state may be given a different reconstruction amplitude. Thus, an additional step has to be done to classify these segments into a copy number (0, 1, 2, 3, 4, ...) or alteration status (Non-Altered, Gain and Loss). There are two popular alternatives to perform this additional step, since it is also required in other segmentation procedures such as DNAcopy [69] and CGHseg [73]. The first alternative, also used in smoothing and thresholding methods [43,80], assumes or estimates a baseline Non-Altered mean hybridization level and classifies all the segments whose average amplitude is significantly above (below) that level as Gain (Loss), other- wise Non-Altered is assigned. The second alternative is the MergeLevels algorithm [103], which reduces the number of different reconstruction amplitudes by recursively merging those that are the least significantly different. The final smaller set of levels may be associated with a copy number state (0, 1, 2, 3, 4, ...). Other CNA/CNV detection approaches, especially those that are based on HMMs automatically incorporate a classification into the different states of a hidden variable associated with each probe. However, as we discussed in Section 3.1, this may not be a goodmodelwhenthenumberofhiddenstatesthathasbeenassumeddoesnotmatchthe true number of underlying mean hybridization levels. This is especially likely to occur whenanalyzingtumorsamplesthatrepresentmixturesofcellswithdifferentcopynumber state, because cancer genomes are inherently unstable and heterogeneous [31]. 75 3.7 GADA approach to CNA detection We now summarize our algorithm for detecting CNA, which we call GADA, and consists oftwosteps. First, weapplySBL,whichwillprovideasetofbreakpointswithaspecified initial level of sparseness controlled by the prior hyperparameter a. Then, the second step ranks the breakpoints provided by SBL by using a BE procedure, where the critical value T is used to establish the final degree of desired sparseness. The combination of these two approaches provides greater accuracy and flexibility. First, this combination provides greater accuracy because each step minimizes the impact of the assumptions made by the other. SBL provides a better search strategy because effective removal of breakpoints is accomplished in several EM iterations. How- ever, the breakpoint set detected by SBL may still include some spurious breakpoints (see Section 3.4.1). These ‘false’ breakpoints are then removed using the BE procedure (Section 3.5). The BE approach is greedy and fast, and it benefits from starting from a smaller set of breakpoints provided by the SBL, since fewer errors will accumulate with a smaller set (see Appendix B). Second, itprovidesgreaterflexibilityinadjustingthefinalbreakpointset. Bothaand T can adjust sparseness in an equivalent way. We have shown that breakpoints obtained with higher sparseness settings in SBL (i.e. larger a values) tend to be subsets of those obtained with lower sparseness settings when evaluated using the same T value in BE (see Appendix C.1). Moreover, adjusting T can be done at no additional computational cost. Thus, SBL will be used with a small a, which gives a high initial sensitivity, and BE adjusts the final level of FDR. 76 The foreseeable usage by a practitioner of the GADA approach in detecting CNA (or CNV)wouldstartbyanalyzingalargecollectionofmicroarraysampleswithasmallinitial a. This a can be obtained by analyzing a small subset of samples and/or chromosomes. However, we have found by analyzing simulated and real datasets on platforms ranging from 50K to 550K probes that a = 0.2 is small enough to give the necessary initial level of sensitivity (see Appendix C for more information). Following analysis of samples with SBL, the user can adjust T to obtain the final breakpoint set. A significance value α = P(|t|>T|w = 0) can be computed if the array noise is considered normal (t∼N(0,1)), or estimated using a resampling procedure. Any of the proceduresthat are typically used to control for FDR as was mentioned in Section 3.5.1 are not recommended for adjusting T becausetheydonottakeintoaccountthedependencestructureamongthebreakpoints. However, if replicate samples are available, the FDR can be estimated at a given T. 3.8 Simulation Results 3.8.1 Simulated CGH Data and evaluation metrics Thedatasetsusedtocomparethealgorithms’ratesofaccuracy(sensitivityandFDR)are those proposed by [103]. To further assess these metrics in CNA occurring in genomes with differing complexities, we generated six additional simulated datasets containing 200 genomes each with 20 chromosomes. All datasets were generated in Matlab form- ing chromosomes of length 200 probes and sampling the CNA from the same empirical distribution used by [103], but were categorized by the number and length of CNA (see Table 3.2). These categories include: 1) no breakpoints, 2) only one breakpoint at any 77 position(uniformlydistributed),3-6)generatedasin[103]butcategorizedbythenumber of breakpoints and the length of the altered segments. Table 3.2: Simulated datasets categorized on the number of breakpoints and segment lengths Category Number of breaks Segment lengths 1) 0 — 2) 1 Any length 3) Few (2 to 4) Large (10-150) 4) Few (2 to 4) Small (1-9) 5) Many (5 to 10) Large (10-150) 6) Many (5 to 10) Small (1-9) All experiments consist of 200 samples with 20 chromosomes containing 200 probes. Each row represents a set of samples with different genomic complexity. Table 3.3 shows definitions of the accuracy metrics used in the analyses of simulated data (sensitivity and FDR). A breakpoint is claimed to have been detected correctly only if it is placed within a distance of δ probes from the true breakpoint. In evaluating the performanceofthealgorithms, analgorithmisconsideredtoperformbetterthananother if 1) the algorithm’s FDR is smaller with same sensitivity, or 2) if its sensitivity is higher with same FDR, or 3) if it has both lower FDR and higher sensitivity. All other cases are considered uninformative (e.g., similar FDR and sensitivity or discordant FDR and sensitivity). For each sample in a given simulated data set, the performance (FDR and sensitivity) of the algorithms was measured. The proportion of times that an algorithm performed better was obtained using only the informative cases. The two-sample test for binomial proportions (or McNemar’s test) was used then to assess differences in the performance of the algorithms. Concordance between algorithms was measured as|A∩B|/|A∪B| [56]; whereA andB are the breakpoint sets returned by each algorithm. Breakpoints belong to the 78 intersection (i.e., are considered to be the same), if they are separated by less than δ =2 probes. Table 3.3: Possible outcomes for each candidate breakpoint position Breakpoint Not detected Detected Present FN TP Not present TN FP Performance metrics: Sensitivity or Recall =E h TP FN+TP i FDR or 1-Precision =E h FP FP+TP i Note: A True Positive (TP) only occurs if the breakpoint that has been detected is within a distance of δ probes from a true breakpoint. If there is more than one breakpoint detected within this vicinity, only the closest one is considered TP and the remainders are False Positives (FP). The true breakpoint positions that are not detected are False Negatives (FN). The regions without a breakpoint where no breakpointshavebeendetectedareTrueNegatives(TN). IfthearrayhasM probes,M−1isthenumber of candidate breakpoints (i.e. M−1 = TP+FP+TN+FN). The number of breakpoints falling in each of these categories are random numbers obtained on each simulated sample; thus expected values can be obtained for False Discovery Rate (FDR = 1- Precision) and Sensitivity (Recall) by taking the average over all the simulated samples. 3.8.2 GADA approach compared to greedy search methods InthissectionwecomparetheGADAapproachtotheotherpopularmethodsinTable3.1 that could be used with our PWC formulation. Compared to GADA, most of these existing methods are severely limited by the high collinearity/coherence between the columns of F (see Section 3.3). Using the performance evaluation procedure described in Section 3.8.3 [103] the precision-recall operating curves (PROC) were generated for each approach. The sen- sitivity and FDR for detected CNA in the simulated CGH dataset is obtained at each operating point (Figure 3.4). SBL had the best performing PROC curve as compared to other approaches for all given values of δ (data shown only for δ =2). 79 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Discovery Rate Sensitivity MP MMP OMP BP SBL Figure 3.4: PROC operational curves for sensitivity vs. false discovery rate in detect- ing real copy number changes within a δ = 2 sample precision window in the dataset introduced by [103]. 3.8.3 GADA approach compared to other CNA detection methods We evaluated the performance of the proposed algorithm and compared the results with other published algorithms that are publicly available; including CBS [69,101], SWAR- RAY [54], HMM [29], RHMM [90], PL [25], RJaCGH [85], and GLAD [46]. We employed a simulated array-CGH dataset with known CNA positions [103], where the accuracy in detectingbreakpointswasmeasuredintermsofsensitivityandFDRasdefinedinSection 3.8.1. The performance in terms of accuracy for all the analyzed algorithms (using their respective default parameters) is reported in Figure 3.5. Three of the methods, CBS, 80 HMM, and GLAD were previously analyzed by [103] and results are identical to those reported previously The faster new CBS [101] was also evaluated with results matching thosefromthepreviousimplementation[69]. ForRJaCGH,duetothelongcomputational running time of the algorithm (>1 day), the segmentation results were obtained directly from the authors and then evaluated with the same metrics. GADA, CBS, RJaCGH and RHMM are the most accurate algorithms in terms of both sensitivity or FDR, while the remaining algorithms clearly show poorer accuracy in both metrics. Among these top four algorithms, considering the times required to analyze the entire dataset, GADA (48 seconds) is fastest, followed by CBS (625 seconds), RHMM (41 minutes) and RJaCGH (>1 day). 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 0.2 0.4 0.6 0.8 1 δ GADA δ CBS (DNAcopy) δ RJaCGH δ RHMM δ PL δ SWARRAY (GEMCA) δ HMM δ GLAD Sensitivity FDR Figure3.5: MediansensitivityandFDRfordetectingknowncopynumberchangeswithin a probe window of δ length (δ = 0− 3). The results are obtained using the default parameters settings in each algorithm (in GADA this is a = 0.2 and T=4). The median and the interquartile range (IQR) are taken across the 500 samples. In Figure 3.6, the parameters that control the trade-off between sensitivity and FDR are adjusted in GADA and CBS to generate the precision versus recall operation curves 81 0 0.2 0.4 0.6 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Discovery Rate (1−Precision) Sensitivity (Recall) GADA a=0.2 CBS RJaCGH RHMM 0.05 0.1 0.15 0.2 0.25 0.3 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 Figure 3.6: PROC operational curves for the mean sensitivity vs. FDR in detecting real copy number changes within a δ = 2 sample precision window in the dataset introduced by [103] (averages taken across the 500 samples). RJaCGH and RHMM results are obtained using the default parameter settings and provide a single point. CBS operation points are obtained by varying the α, while GADA operating points were obtained by varying the T parameters with the default a=0.2. (PROC). The single operating points generated by RJaCGH and RHMM algorithms (using their default parameters) are also shown for comparison. The results show no significant differences in performance among these four algorithms. The GADA results presented in this section are also not sensitive to different choices of the a parameter. Figure C.2 shows that essentially the same results as in Figure 3.6 are obtained for a range of a parameters. As discussed in Section 3.7, GADA is a two step procedure controlled by two parameters a and T. Setting a higher a simply makes the PROC curve 82 shorter (i.e., it starts further to the left and to the bottom) since all the breakpoints that would be removed by BE are instead eliminated in the SBL step. It should also be notedthatRJaCGH,RHMMandPLarereportedtohaveabetteraccuracythanCBSin situations different than the ones evaluated by the employed dataset, which may include: non-uniform probe spacing, chromosomes with a reduced number of probes, non uniform variance, and presence of outliers. In Chapters 4 and 5 we will study the impact of some of these situations on GADA performance, as well possible extensions to GADA in order to handle them. In what follows we focus on comparing GADA to CBS, the baseline algorithm that mostoftherecentapproachesuseforcomparison. Theneweralgorithmsarenotincluded inthisanalysisastheydonotshowsignificantimprovementsoverCBSusingthestandard evaluation methods designed by [103] and have considerably slower running-times. In Section 3.8.1 we observed that the majority of the simulated genomes by [103] have few breakpoints with large altered regions. To further assess the performance of GADA and CBS on genomes with complex patterns of CNA, typical of those observed in tumors,wegeneratedsixadditionalsimulateddatasets. Thesedatasetscontainedvarying complexity of CNA and were derived using the same procedure proposed by [103]. The datasets included both ‘quiet’ genomes (0-1 breakpoint) and complex genomes involving few or multiple breakpoints resulting in small or large CNA regions. The performances of GADA and CBS on these six datasets are provided in Table 3.4. Both algorithms work well for finding a small number of discontinuities within large segments, but there is significant evidence that GADA outperforms CBS for the complex cases. However, the magnitude of the overall differences in sensitivity and FDR between GADA and CBS is 83 relatively small (< 3%), and the main advantage of our approach is in its flexibility and computational speed when analyzing large density arrays. Table 3.4: Sensitivity and FDR dependence on datasets of different complexity Compl. FDR % Sensitivity % Both terms Categ. GADA CBS (<,>) GADA CBS (>,<) (GADA,CBS) p-val. 1) 0.00 0.00 (19,54) — — — — 2) 5.00 5.00 (76,62) 95.00 95.00 (43,41) (74,52) 0.025 3) 4.04 5.61 (91,70) 95.96 97.83 (24,95) (60,69) 0.21 4) 3.85 3.48 (84,77) 80.39 77.78 (129,30) (94,34) 6E-8 5) 2.97 5.56 (162,30) 95.28 96.23 (50,92) (100,30) 4E-10 6) 2.15 2.84 (119,62) 77.23 76.07 (155,38) (114,20) 2E-16 For each of the simulated datasets of different complexity in Table 3.2 we compare the performance of GADA and the CBS algorithms. For all the cases, the GADA algorithm is set to T = 4.0, and CBS to α = 0.01, since this provides comparable performance points in the PROC curves, and allows to comparison to other cases. The median sensitivity and False Discovery Rate % in breakpoint detection within 2 probes δ = 2 are evaluated. The FDR and sensitivity of GADA and CBS are also compared for each sample in a given dataset and the number of times where FDR and sensitivity are smaller or larger (<,>) between the two algorithms are reported. The value pairs in parentheses are arranged so that if the left value is higher than the right one indicates a better performance of GADA. The rightmost column counts the number of times GADA is performing better than CBS either in terms of smaller FDR or higher Sensitivity (a p-value is computed as described in the Section 3.8.1). 3.9 Evaluation with microarray data 3.9.1 Neuroblastoma Genomic Data from Array Platforms Four neuroblastoma cell lines, two with known MYCN oncogene amplification (SK-N- BE2, SMS-KAN) and two lacking MYCN amplification (LAN-6, CHLA-20) were grown in RPMI medium with 10% FCS to confluence prior to extraction of DNA using STAT60 (Tel-Test, Inc.). The same stock of DNA was used to perform whole genome analysis for CNA using Affymetrix SNP arrays 50K Xba, 250K Sty, and 250K Nsp and Illumina GoldenGate 550K SNP array based on their respective protocols. The raw data obtained 84 from the Affymetrix platform arrays were normalized using routines employed in Copy Number Analysis Tool version 3.0 (www.affymetrix.com) where log2ratios of the inten- sity of the probes are calculated after fitting a regression model generated from a normal set of diploid samples. The Illumina platform data were normalized and summarized using the BeadStudio Genotype analysis software and the log-R-ratio data were exported for further analyses. Data from 60 NCI cell lines generated using Affymetrix 50K Hind and 50K Xba [31] were also used to assess the computational speed of the algorithm (GEO repository accession number: GSE2520). 3.9.2 Computational speed in commercial microarray platforms We recorded the time required to analyze using GADA and CBS copy number data generated on Affymetrix or Illumina platforms from neuroblastoma cell lines and NCI cell lines. Results are summarized in Table 3.5. The GADA algorithm was on average 100 times faster than the latest implementation of CBS. The GADA algorithm provides an additional advantage by identifying all breakpoints corresponding to all the operating points of the PROC curve within the time frames shown in Table 3.5. This allows real- timecontrolofthefinaladjustmentoftherepresentationofCNAregionscorrespondingto differentchoicesofthecriticalvalueT withnoadditionalcomputationaltime. Incontrast, in the current implementation of CBS, the entire procedure needs to be repeated in order to obtain sets of breakpoints at different values of the α parameter. The computational complexity of SBL has been greatly optimized by exploiting the properties of the PWC representation as described in Section 3.4.1. The EM algorithm converges very fast, and each EM step is solved in a linear number of operations O(M), 85 Table 3.5: Average analysis time (seconds) for Affymetrix and Illumina microarrays 50K 100K 250K 500K Illumina GADA 1.5 2.98 7.10 15.95 20.49 CBS 197.7 444.9 597.72 1262.40 2665 Average time required to analyze the data in seconds per chip (only the time spent by the detection algorithm is counted). The 100K and the 500K columns correspond to the analysis of the combination of the two 50K (Hind/Xba) and two 250K (Nsp/Sty) chips respectively. resulting in an overall running time that, as confirmed in Table 3.5, increases linearly with the array size M. In contrast, the computational complexity of CBS is composed of two parts; i) the circular binary segmentation optimization with O M 2 operations, and ii) the hybrid permutation test [101] that decides whether or not to proceed with the recursive segmentation with O(MP) operations (P is the number of permutations). The hybrid permutation test in CBS has been improved as compared to the previous implementation[69],whichrequiredO M 2 P operations;however,theoverallcomplexity is still limited by i), the circular segmentation taking O M 2 operations. 3.9.3 ComparisonofneuroblastomaCNAdetectionusingdifferentarray platforms The DNA from two neuroblastoma cell lines with (SK-N-BE2, SMS-KAN) and without (CHLA-20, LAN-6) MYCN oncogene amplification were analyzed for DNA copy number alterations. Three Affymetrix genotyping arrays (50K Xba, 250K Nsp, 250K Sty) and Illumina’s humanhap550 genotyping beadchip were used to generate the copy number data. A total of 105 breakpoints were identified for at least two of the platforms using theSBLalgorithmandwereusedforfurtheranalysis(Tables3.7,3.8and3.6). Figure3.7 86 shows graphical output ofthe algorithmonrepresentative chromosomeswhere significant CNA are known to be associated with neuroblastoma. Of the 105 breakpoints identified, 68 (65%) were identified on all platforms using GADA (Tables 3.7 and 3.8). The lowest density platform Xba, detected 78 (75%) of the 105breakpoints, whilethe highest density platforms detectedall(100%) the breakpoints. The detected alterations include the correct identification of the MYCN oncogene in the two cell lines with known MYCN amplification status and other common alterations found in neuroblastoma genome: loss of proximal region of 1p, gain of 17q, loss of distal region of 11q. Although the SK-N-BE2 showed copy number of two for chromosome 1p (Figure 3.7), genotype information revealed loss of heterozygosity (LOH) in this region (i.e. uniparental disomy - data not shown) with gain of 1q not reflecting any significant change in the rate of heterozygosity. There was also no gain of 17q in this cell line but there was loss of 17p and LOH for this region. Finally, we compared GADA and CBS detection performance in this real data set. The concordance rate between GADA and CBSforbreakpointsthatweredetectedbyatleasttwoplatformswas93%(Arrayspecific concordances: Xba 97%, Nsp 90%, Sty 98%, Nsp+Sty 90%, Illumina 95%). There was also no significant difference between CBS and GADA in the distribution of distances for concordant breakpoints identified across the array platforms (Table 3.10). 3.10 Conclusions We have introduced a new representation for genome copy number data and methodolo- gies to detect CNA. The proposed PWC representation provides very useful properties 87 such as sparseness, embeddedness, and computational efficiency. This representation was exploited using a novel combination of two algorithms. The first one is based on sparse Bayesian learning (SBL), and the second one is a stepwise backward elimination (BE) procedure. Combinationoftheseapproachesresultsinanaccurateandfastmethodology, whichwecallGADA,todetectCNA.Tothebestofourknowledge, thisisthefirstreport that applies SBL to detect copy number changes or to estimate PWC representations in any application. In simulated datasets, the GADA approach obtained the best performance in accu- rately detecting CNA when compared to other approaches. We have also demonstrated its applicability to two different commercial microarray platforms (Affymetrix and Illu- mina). The fast computational speeds obtained in analyzing these large arrays should allow further development of our algorithm in analyzing large cohorts of samples. Although inclusion of allele specific copy number data has not been addressed in this work,theBayesianframeworkinouralgorithmcouldbeextendedtoincludethegenotype data to improve placement of breakpoint positions. The genotype data and population heterozygosity frequencies could be used to jointly estimate loss of heterozygosity and allele specific copy number alterations. The advantage of such an approach is evident in our analyzed data of tumor cell lines with copy neutral LOH of chromosome 1p. The performance of the proposed GADA approach has been studied and evaluated assuming that hybridization noise is additive white Gaussian [103]. However, real mi- croarray probe hybridization intensities may be affected by a wide range of platform specific effects like regional trends, non-uniform variance and outliers. Normalization of the microarray probe intensities can correct or minimize the impact of some of these 88 effects in a pre-processing step to ensure that the data follows closely the model. Ad- ditionally, there exist several statistical tests (e.g. White test, Breusch-Pagan test or Kolmogorov-Smirnov) that could be performed on the residuals of the resulting segmen- tations to check for presence of the effects ignored by the model. Chapter 5 will evaluate the impact on the accuracy of GADA based on these different possible departures from the assumed model, and consider how these departures could be included in the Bayesian approach described here. 89 Figure 3.7: Inferred copy numbers from neuroblastoma cell-lines SK-N-BE2, SMS-KAN, LAN-6andCHLA-20. Cell-lineswereanalyzedusingAffymetrix’sgenotypingarrays50K Xba, 250K Nsp and 250K Sty and Illumina’s humanhap550 genotyping beadchip. The output of our software GADA(SBL) used the critical value of T = 4.8 and is compared to DNAcopy (CBS) with α = 0.01. T was adjusted to the point where an increase on T removed concordant CNA between samples and platforms, and a decrease on T did not provide additional concordant CNA regions. Blue color tones indicate loss of genetic material, and red color tones amplification 90 Table 3.6: Significant copy number alterations found in four neuroblastoma cell lines Chr. SK-N-BE2 SMS-KAN LAN-6 CHLA-20 1: –(pEnd– p13.3) –(pEnd– p36.12) +(p21.3– qEnd) +(p21.1– qEnd) +(p21.1– qEnd) 2: +(pEnd–p21) +(pEnd– p24.1) +(pEnd– p22.1) ++(pEnd– p16.1) ++p24.3 MYCN ++p24.3 MYCN +(p16.1– q31.1) ++p24.1 +q35 –q22.1 –q23.3 +(q32.2– q37.2) 3: –(pEnd– p14.2) –(pEnd– p14.3) +(p12.1– p11.1) 4: –(p16.1– p15.33) –(q12–q22.1) –p24KLHL5 –q22.3 +(q35.1– q35.2) +(q34.1– qEnd) 5: –(q35.3–qEnd) +q11.2 6: –(q12–p16.3) –(q22.31– qEnd) – – q26 PARK 7: +(pEnd– p15.1) –q21.1AHR +7 –(p14.3– q11.21) –(p11.21– q11.22) –q33SEC8L1 8: –(pEnd–p12) –(q22.1– q23.3) +q21.3 – –q24.23 +(q22.2– q24.1) 9: –p24.2GLIS3 –(p23–p21.2) – – p21.3MTAP –p13.3RECK 10: +(pEnd– p11.23) –q22.3 PT- PRE 11: +(q13.1– qEnd) –(q14.2– q23.3) +(q13.4–q25) –q14.1 – –(q25–qEnd) +q22.1 CNTN5 12: +(q23.3– q24.33) +12 ++ q24.33 13: –q31.1 14: –(q23.2–qEnd) +0(q31.3) TTC8 15: 16: +16q +(pEnd– p13.3)LEP 17: – –p11.2 EPN2 -(pEnd–q11.2) -(pEnd–q11.2) +17 +(q21.2- qEnd) 18: -18 +(p11.23- qEnd) 19: –(q13.2– q13.33) +19p 20: -p13 21: 22: X: X XX X XX Table listing the most significant copy number alterationsT = 5, that have been found on at least two of the platforms (Xba,Nsp,Sty,Nsp+Sty) being analyzed. 91 Table 3.7: Copy number breakpoints found on all platforms (SK-N-BE2 and SMS-KAN) Cell-line Chr & GADA Position [BP] CBS Position [BP] name Cytoband Xba Nsp Sty Nsp+Sty Illumina Xba Nsp Sty Nsp+Sty Illumina SK-N-BE2 1p21.3 97045920 96895983 97183094 96895983 96808701 96602215 96564172 95918556 96486927 96808701 SK-N-BE2 2p24.3 15977810 15978001 15977810 15978001 15979864 15977810 15978001 15977810 15978001 15979864 SK-N-BE2 2p24.3 16419609 16453092 16462002 16462002 16463522 16419609 16453092 16462002 16462002 16465097 SK-N-BE2 2p21 48447814 47722197 48071628 47629563 47840828 48447814 47728339 48071628 47738916 47840828 SK-N-BE2 3p14.2 61381385 61301823 61423021 61159361 61312730 61227509 61301823 61137444 61241447 61312730 SK-N-BE2 8q24.23 137748993 137757306 137735555 137747078 137747933 137748993 137746403 137735555 137746403 137747933 SK-N-BE2 8q24.23 137892295 137955330 137924208 137924208 137919630 137892295 137931617 137924208 137931617 137919630 SK-N-BE2 11q13.1 64310154 64977325 65339642 65010150 65335248 64310154 64977325 65339642 65339642 65193464 SK-N-BE2 17q11.2 28109086 28263828 28164827 28283675 28247634 28109086 28263828 28164827 28266739 28214976 SK-N-BE2 20p13 2406160 2773972 3036010 3036010 2987115 2406160 2773972 3036010 3036010 2987115 SMS-KAN 1p13.3 108157301 107888773 108203626 108218567 108177825 108157301 108208349 108186644 108178227 108177825 SMS-KAN 2p24.3 15721907 15868241 15853157 15868241 15869663 15721907 15868241 15853157 15868241 15869663 SMS-KAN 2p24.3 16419609 16576876 16551640 16576876 16578889 16419609 16576876 16551640 16576876 16578889 SMS-KAN 2p24.1 21887032 21974989 22013932 22013932 21992435 21887032 21974989 22013932 22013932 21992435 SMS-KAN 2p24.1 22466771 22475341 22470539 22475341 22475673 22466771 22475341 22470539 22475341 22475673 SMS-KAN 3p12.1 83843045 84210511 84386931 83873910 84165323 85157384 84137907 84386931 83960187 84165323 SMS-KAN 3p11.1 88163152 90346746 97369003 90346746 90472437 88163152 90346746 96620438 90346746 90472437 SMS-KAN 4q12 55018889 55049094 55058835 55049094 55040244 55152302 55049094 55056941 55049094 55040244 SMS-KAN 4q22.2 94894653 94792613 94948871 94957181 94937901 94894653 94833508 94919018 94872717 94940338 SMS-KAN 10p11.23 30297356 30435753 30721148 30685964 30559838 30297356 30552253 30721148 30721148 30551022 SMS-KAN 10q26.2 129582283 129682011 129741611 129741611 129689191 129582283 129686948 129693110 129689191 SMS-KAN 10q26.2 130102560 130079710 130148252 130148252 130172807 130095140 130168048 130148252 130172807 SMS-KAN 11q14.2 85287454 85435237 85383124 85383124 85381622 85287454 85367689 85383124 85383124 85381622 SMS-KAN 11q23.3 117735474 117791205 118235879 118235879 117802601 117735474 117791205 117823148 117823148 117802601 SMS-KAN 19q13.2 45796700 45571967 45879279 45879279 45910451 45796700 45571967 45879279 45879279 45910451 SMS-KAN 19q13.33 57641156 57395071 57400799 57395071 57374324 57641156 57395071 57322747 57395071 57374324 Table listing the locations for the copy number breakpoints detected by GADA and CBS on the neuroblastoma cell-lines SK-N-BE2 and SMS-KAN that have also been found on all array platforms (Xba, Nsp, Sty, Nsp+Sty, Illumina). 92 Table 3.8: Copy number breakpoints found on all platforms (LAN-6, CHLA-20) Cell-line Chr & GADA Position [BP] CBS Position [BP] name Cytoband Xba Nsp Sty Nsp+Sty Illumina Xba Nsp Sty Nsp+Sty Illumina LAN-6 1p36.33 21940846 22438012 22476248 22476248 22480144 22445846 22438012 22476248 22476248 22480144 LAN-6 1q21.1 142661525 143607676 143798352 143887291 144106312 142661525 143607676 143798352 143798352 144106312 LAN-6 2p22.1 42656869 41405461 33909326 41340562 41358977 42656869 41405461 41335143 41340562 41358977 LAN-6 2q36 218185198 218577787 218754682 218756027 218628755 218395997 218577787 218754682 218754682 218628755 LAN-6 2q36 220406369 220398375 220629809 220629809 220449867 220406369 220534849 220475305 220480390 220449867 LAN-6 3p14.3 57324905 57032227 57238434 57238434 57215650 57324905 57032227 57238434 57238434 57215650 LAN-6 6q12 68022441 68353881 68154792 68095015 68197504 68022441 68095015 67263660 68095015 68197504 LAN-6 6p16.3 106344198 105090334 105177437 105110313 105135994 105275051 105090334 105110313 105110313 105112419 LAN-6 6q22.31 123295546 123710384 123731764 123710384 123710826 123295546 123710384 123605395 123710384 123710826 LAN-6 7p15.1 27784099 24892359 24256367 24892359 24919337 27784099 31129325 25064458 24892359 24919337 LAN-6 7p14.3 31126902 31129325 31135758 31135758 31145274 31126902 31129325 31135758 31135758 31145274 LAN-6 7q11.21 62171000 62343621 63050316 63050316 62336389 62171000 62910986 62283233 62934268 62336389 LAN-6 7q11.21 63993279 63940919 63137057 63940919 63963370 63993279 63940919 65748163 63940919 63963370 LAN-6 7q11.22 69847573 69835520 68907809 69835520 69831960 69738283 69835520 69807385 69835520 69831960 LAN-6 8p12 39331612 37835460 37192319 38093651 37869447 37458281 37835460 38093651 38093651 37869447 LAN-6 8q22.1 93948610 95445117 95274323 95445117 95414304 96630283 95445117 95372632 93495648 95414304 LAN-6 8q23.3 116469762 116142633 116478611 116140376 116480735 116519714 116472717 116471437 116472717 116480735 LAN-6 9p24.2 3394434 3579281 3316679 3579281 3585674 3394434 3579281 3316679 3579281 3585674 LAN-6 9p24.2 4947650 4635935 4685068 4648449 4647040 4947650 4630212 4624827 4624827 4647040 LAN-6 9p23 12741741 12643846 13524659 12649691 12706172 12741741 12643846 12164190 12754534 12716962 LAN-6 9p21.3 21451790 21460464 21460997 21460997 21468318 21451790 21460464 21460997 21460997 21484643 LAN-6 9p21.3 22185820 22158464 22197037 22197037 22197037 22404973 22158464 22197037 22197037 22197037 LAN-6 9p21.2 28417657 28860162 28929272 28820009 28857478 28765609 28853202 28742971 28820009 28844830 LAN-6 11q13.4 71592372 71549242 71591974 71591974 71607855 71592372 71549242 71591974 71591974 71634231 LAN-6 12q23.3 105993065 106086197 106095939 106095939 106074551 105993065 106086197 106232150 106095939 106074551 LAN-6 14q23.1 60468351 60341891 60393691 60343301 60386017 60468351 60436455 60393691 60393691 60386017 LAN-6 17q11.2 25755541 24749129 24833230 24836351 24865310 25755541 24836351 24833230 24836351 24865310 LAN-6 17q21.2 36391251 35264341 36690164 35852756 35294289 36095487 35264341 36556209 36393487 35294289 LAN-6 17q22 50618719 51862430 51475956 51862430 51856911 50618719 51862430 51475956 51862430 51855630 CHLA-20 1q21.1 120089986 143607676 120928505 143887291 143328536 142661525 142756696 143798352 143657867 14802010 CHLA-20 2p16.1 68885099 57662314 55736176 57577846 57662175 57629406 57662314 58097954 57625311 57583694 CHLA-20 2q31.1 174726584 174705263 174793287 174717018 174730921 174726584 174705263 174717018 174717018 174730921 CHLA-20 2q32.2 178651035 178581699 179377095 178574520 178576047 178214924 178574520 179028748 178574520 178576047 CHLA-20 4p16.1 5722378 5842107 5913372 5842107 5844271 5722378 5842107 5913372 5842107 5842107 CHLA-20 4p15.33 12551237 12300938 12392275 12321237 12316278 12551237 12300938 12391799 12321237 12300938 CHLA-20 4q34.1 175185953 174895669 175097390 174895669 174897540 175185953 174895669 174890618 174895669 174895669 CHLA-20 8q21.3 87234127 87640754 87583206 87583206 87594384 87858742 87640368 87583206 87583206 87594384 CHLA-20 8q21.3 90473708 90386811 90407394 90407394 90366715 90138115 90372039 90407394 90376886 90366715 CHLA-20 8q22.2 100574413 99817691 99940387 99940387 99638911 99803560 99817691 99940387 99940387 99638911 CHLA-20 8q24.1 127917518 128903451 128922998 128922998 128913903 127917518 128903451 128922998 128922998 128913903 CHLA-20 18p11.23 8617957 8510927 8200848 8510927 8392640 8617957 8393289 8309751 8448484 8392640 CHLA-20 19q12 33171613 32761177 23876259 32690406 24095263 33171613 32761177 24165666 32690406 24053526 Table listing the locations for the copy number breakpoints detected by GADA and CBS on the neuroblastoma cell-lines LAN-6, CHLA-20 that have also been found on all array platforms (Xba, Nsp, Sty, Nsp+Sty, Illumina). 93 Table 3.9: Additional copy number breakpoints found by at least two platforms Cell-line Chr & GADA Position [BP] CBS Position [BP] name Cytoband Xba Nsp Sty Nsp+Sty Illumina Xba Nsp Sty Nsp+Sty Illumina SMS-KAN 2q22.1 141962960 142006622 142006622 141996840 141962960 141991258 141991258 141996840 SMS-KAN 2q22.1 142284086 142419855 142419855 142418322 142284086 142419855 142386408 142418322 SMS-KAN 2q23.3 152928225 152959722 152945591 152947104 152928225 152959722 SMS-KAN 2q23.3 153172826 153233532 153356905 153182040 153172826 153231102 SMS-KAN 4q22.3 98081595 97971534 97971534 97946136 98081595 97971534 97971534 97946136 SMS-KAN 4q22.3 98389453 98794422 98492192 98515047 98389453 98489669 98489669 98515047 LAN-6 4q35.1 187001741 187043679 187043679 187037031 187001741 187248120 186997226 187037031 LAN-6 4q35.2 189693227 189400592 189537964 189715209 189693227 189400592 189693227 189715209 LAN-6 5q35.3 178389593 178435944 178439675 178388353 178389593 178435944 178439675 178388353 LAN-6 6q26 162768919 162672040 162783990 162769931 162768919 162770133 162770133 162769931 LAN-6 6q26 163042210 162863051 163042210 163069487 163073363 162948280 163042210 163069487 LAN-6 7p21.1 17042942 17048506 17048506 17079506 17079506 LAN-6 7p21.1 17194089 17293268 17293268 17187461 17194647 LAN-6 9p13.3 35932406 35934224 35923323 35923323 LAN-6 9p13.3 36095264 36117196 36036596 36036596 LAN-6 11q25 134393784 134408260 134410991 134410991 LAN-6 12q24.32 125117158 125475975 125319087 125257058 124609076 125131448 125319087 LAN-6 12q24.33 127723245 127722879 127723245 127723245 127723245 127835197 127723245 127723245 LAN-6 13q31.1 82988642 82996585 82996585 LAN-6 13q31.1 83045936 83063672 83055928 LAN-6 14q31.3 88300117 88381272 88443831 88310402 88300117 88381272 88443831 LAN-6 14q31.3 88499809 88623396 88625132 88647502 88499809 88623396 88634883 LAN-6 16p13.3 5755300 5775884 5775884 5679682 5669239 5775884 5802165 6023611 LAN-6 16q23.3 82340991 82342624 82374996 86364648 82340991 82342624 82502789 LAN-6 17 19109505 19117656 19120783 19117656 19120783 LAN-6 17 19175068 19175068 19145456 19175068 19145456 CHLA-20 2q37.2 236702079 236768287 236768287 236738515 236702079 236844697 236768287 236765691 CHLA-20 4p14 38741486 38741486 38758076 38741486 38741486 38752396 CHLA-20 4p14 38996761 38996761 39068335 38996761 38996761 39006763 CHLA-20 5q11.2 50870182 50824363 50850389 50824363 50824363 50850389 CHLA-20 5q11.2 51529692 51532772 51532772 51529692 51532772 51532772 CHLA-20 7q33 132875721 132875721 132884795 132875721 132875721 132875949 CHLA-20 7q33 133004505 133004505 132996066 133004505 133004505 132996066 CHLA-20 11q14.1 78683236 78694008 78694008 78691521 78694008 78694008 78691521 CHLA-20 11q14.1 78805883 78801531 78801531 78814667 78801531 78801531 78818346 CHLA-20 11q22.1 99371373 99366936 99366936 99378927 99371373 99240362 99366485 99378927 CHLA-20 11q22.1 100243669 100322922 100322922 100299086 100243669 100319819 100322922 100299086 Table listing the locations for the copy number breakpoints detected by GADA and CBS on the four neuroblastoma cell-lines (SK-N-BE2, SMS-KAN, LAN-6, CHLA-20) that have also been found on at two of the array platforms analyzed (Xba, Nsp, Sty, Nsp+Sty, Illumina). 94 Table 3.10: Differences in copy number breakpoint placing between chips MAD [BP] K-S p-value Chips compared # cases GADA CBS GADA larger CBS larger i) min(|Xba−Sty|,|Xba−Nsp|): 59 95670 93132 0.54 0.76 ii)|Nsp−Sty|: 61 88024 71265 0.35 1.0 iii)|(Nsp&Sty)−Illumina|: 91 22784 21388 0.58 0.95 Fortheconfirmedbreakpointsandexcludingthosenearthecentromere,wecomputedthemedianabsolute difference in breakpoint location between chips (units in base pairs [BP]), and the p-value associated with the Kolmogorov-Smirnoff test for the hypothesis that differences are stochastically larger (i.e. less accurate) in one algorithm vs. the other. The chips compared are: i) the Xba to the Sty and Nsp separately, ii) the Nsp to Sty chips, and iii) the combined Nsp and Sty chips to the Illumina chips. No significant changes in accuracy have been found between chips of similar size on number of probes. 95 Chapter 4 Bayesian detection of recurrent copy number alterations across multiple array samples Copy number changes (CNAs or CNVs) affecting small portions of chromosomes are difficult to identify. Advances in microarray technology now allow very high resolution scans of large cohorts of samples but at the price of severe noise degradation. Our proposed genome alteration detection algorithm (GADA) has been shown to be a highly accurate and efficient approach to analyze a single array sample. In this chapter, the sparse Bayesian learning (SBL) used in GADA is extended to model CNA on multiple samples that share breakpoint positions but may have different magnitude of alteration 1 . Our model is especially well suited to analyze sample replicates, i.e., multiple arrays from the same specimen. Our results show that replicates greatly improve the accuracy and robustness in detection. In some cases, a single replicate sample offers an accuracy equivalent to a 2-fold increase in the signal to noise ratio, while reducing by up to a 50% the detection of false CNA caused by outliers. The computational cost of the algorithm is essentially linear,O(NM), in the number of microarray probes, M, and samples, N. 1 Parts of this work has been presented in [74] 96 In conclusion, the multiple sample GADA (N-GADA) presented here appears to be a promising tool for finely locating small CNAs that are shared across multiple samples. 4.1 Introduction Recent advances in the microarray technology enabling high resolution genomic scans of large cohort of individuals have revealed presence of short copy number variation CNVs that are repeated across normal genomes (i.e., polymorphic CNAs) [82] constituting a completely new source of unstudied natural genetic variation. Small alterations are the most difficult to detect and the ones most likely to lead to false detections because of severe noise degradation. A joint analysis of many samples would undoubtedly increase the performance in detecting small CNAs, but nearly all currently available algorithms only analyze one sample at a time. In Chapter3 (see also [75,79]) we developed a copy number detection approach called GADA (genome alteration detection algorithm) that achieved excellent performance in single-sample CNA detection. Compared to other state-of-the-art methods, using stan- dard evaluation datasets and benchmarks [103], GADA obtained the highest accuracy and was at least 100 times faster. GADA is based on a compact linear algebra rep- resentation of the array probe intensities as a piece-wise constant (PWC) vector and makes use of a two step detection approach. In the first step, sparse Bayesian learning (SBL [97,104]) identifies all potentially interesting breakpoints that delimitate the CNA. The second step uses a backward elimination (BE) procedure to statistically rank the identified breakpoints, allowing a flexible control of the false discovery rate (FDR). 97 In this chapter we extend GADA to detect CNA across multiple samples (N-GADA). The method is especially suited to detect CNAs from sample replicates, since the un- derlying breakpoint locations should be the same, but the mean magnitude of the array probe measurements may be different. These differences may be due to sample contam- ination, different initial DNA concentrations, or other uncontrolled effects that cannot be corrected. Compared to the large number of algorithms proposed for single-sample CNA analysis, there are very few approaches dealing with the multiple sample prob- lem [17,60,84,89]. Two of them [17,84] are post-processing techniques to refine the results obtained by a given single-sample algorithm and do not propose a joint model. The other two approaches [60,89] propose models that only encourage overlap among CNAs across samples. In contrast, our approach is unique in the sense that it encourages recurrent breakpoint positions. More precisely, the SBL hierarchical prior is modified in this chapter to encourage the selection of breakpoints delimiting CNA at exactly the same positions across the samples under analysis. We hypothesize that this may be a more powerful model when there is underlying evidence that the alterations start and end at recurrent positions, as is the case for sample replicates and possibly for CNA polymorphisms. In order to evaluate N- GADA we used simulation and real datasets of pairs of replicate samples with the same underlying copy number profile. The results of the new approach show that replicates greatly improve the accuracy and robustness of detection while maintaining a very good computational efficiency. 98 Thischapterisstructuredasfollows. TheextendedN-GADAapproachanditsimple- mentation are presented in Section 4.2. Section 4.3 is devoted to presenting the results, and conclusions are discussed in Section 4.4. 4.2 N-GADAformultiplesampleswithsharedbreakpoints In this section we extend the GADA approach so that it can handle multiple samples. First, we extend the SBL hierarchical prior to model sparse breakpoints occurring at similar locations across multiple samples. Then, we describe how to efficiently fit the resultingmodelusingaexpectationmaximization(EM)procedure[66]. Finally,wedetail the new multiple sample implementation of the BE procedure to control for the false discovery rate (FDR). 4.2.1 SparseBayesianLearningformultiplesampleswithsharedbreakpoints As was shown in Section 3.4, the CNA detection can be formulated using SBL as the problem of finding the maximum a posteriori (MAP) estimate: ˆ w MAP = argmax w p(w|y)=argmax w p(y|w)p(w) = argmin w −logp(y|w)−logp(w) (4.1) where the observation model p(y|w) specifies a goodness of fit measure and the prior distribution for the weights p(w) specifies the sparseness constraints. Here, we extend our previously proposed model in Section 3.4 to multiple samples. Assuming that the 99 noise is normal and independent across probes m and samples n, for a given underlying CNA profile for each sample,x n =Fw n , the observation model would be: p y 1 ,...,y N |w 1 ,...,w N = N Y n=1 N Fw n ,σ 2 n I (4.2) and the prior distribution for the weights is specified as a hierarchical prior: p w 1 ,...,w N |α = N Y n=1 M−1 Y m=1 N w n m |0,α −1 m (4.3) where α is a vector of hyperparameters that are distributed according to a gamma dis- tribution: p(α)= M−1 Y m=1 Γ(α m |a,b). (4.4) Notice that here the α hyperparameters are shared across multiple samples. This is in contrast to the application of SBL in 1-GADA, which implies that a different set of a hyperparametersisusedforeachsample. Theroleofthehyperparameterα m istocontrol the likelihood of the presence of a breakpoint at a particular position of the genome but without imposing any restriction on the actual magnitude of the breakpoint w n m and its corresponding CNA. The mathematical procedures to fit this multiple sample model and to infer the CNA breakpoints are basically the same as in 1-GADA (Chapter 3). We also use the EM algo- rithm, exploiting the conjugacy properties between the gamma and normal distributions, as well as the properties of our PWC representation (i.e., the matrix structure for F). 100 The E-step is the same as before but repeated for each of the samples; i.e., finding the posterior distribution given the hyperparameters and the observation: p w n |y n ,α,σ 2 n = N (w n |μ n ,Σ n ) (4.5) Σ n = σ −2 n F t F +diag(α) −1 (4.6) μ n = σ −2 n Σ n F t y n (4.7) The M-step, on the other hand, takes all the samples into account in computing the α hyperparameters: ˆ α m = 2a+N P n Σ n mm +(μ n m ) 2 +2b (4.8) The EM algorithm requires very few iterations to converge in our experiments; and all required operations in each iteration can be performed in a linear number of steps O(NM). This is clear for the M-step, and we already demonstrated in Chapter 3 that the operations required to computeμ (4.7) and the diagonal ofΣ (4.6) isO(M) for each sample, since we can exploit the fact that (F t F) −1 is a tridiagonal matrix. 4.2.2 BackwardEliminationformultiplesampleswithsharedbreakpoints In Chapter 3 the statistical significances of breakpoints returned by SBL were ranked by a simple BE procedure using a standard linear regression model. Here, this is done withintheSBLalgorithmbuttakingintoaccountthestatisticalevidenceobservedacross multiple samples. For a single sample, both approaches are essentially equivalent; but 101 the new approach can exploit better the information gathered by SBL about the multiple samples (i.e., the α hyper-parameters). In this new BE procedure, after the SBL has converged for the first time to a set of breakpoints with high sensitivity, each breakpoint is statistically scored as t m = s X n μ n m 2 Σ n mm (4.9) The squared of this score can be seen as the sum across the individual squared scores (Section3.5)foreachindividualsample,n,atpositionm. UsingAppendixBthet 2 m score also represents the total increase in the residual sum of squares (RSS) after removing breakpoint m of all samples. The lowest scoring breakpoint is recursively eliminated from the model by setting w m =0 and repeating the EM algorithm described in Section 4.2.1. Eliminatingthelowestt m increasesthesparsenessbyremovingthebreakpointthat minimally increases the RSS, i.e, more likely to be a false positive (noise). Repeating the EMtore-estimatethehyperparametersα m seemstoleadtobetterresultswiththemodel proposedinthischapterbecausetheα m aresharedacrosssampleswhileitdoesnotmake a difference when the α m are different across samples (Chapters 3 and 5). Finally, as in Section 3.5, the sensitivity vs. FDR trade-off is controlled by stopping the BE procedure when all the remaining breakpoints have a score higher than a critical value T. 4.3 Results In this section we evaluate the proposed N-GADA algorithm for the case where N = 2 replicates are available, but results extend to other N. We employed the same artificial 102 dataset it is used in Section 3.8.1, which consists of 500 samples of 20 chromosomes with 100 probes where the underlying CNA are known and the noise is i.i.d. Gaussian. We generatedthesamplereplicatesusingthesamegroundtruthbutwithanindependentnew noise realization ǫ∼N(0,σ 2 I), with uniformly distributed noise power σ∼ U(0.1,0.2) and tissue mixture p∼ U(0.3,0.7) parameters. These kind of simulations [103] may not reflect all possible scenarios, but constitute the most widely used method for quantitative evaluation. These 2×500 samples are used to compare the performance of N-GADA to two other alternatives (Figure 4.1). The algorithms that combine both samples, i.e., 2-GADA and naive averaging, greatly improve the accuracy in breakpoint detection in comparison to thecaseinwhichnoreplicatesareavailable(1-GADA).Roughly,asamplereplicatewould be equivalent to a two fold increase of the signal to noise ratio of a single sample. The resultsobtainedbynaiveaveragingareslightlyworsethanthoseofthe2-GADAapproach; because the former assumes that breakpoints and segment reconstruction levels are the same while in the latter only the breakpoints are the same. On this simulation dataset, the reconstruction levels for each sample in the pair change depending on the tissue mixture parameter p. In order to further assess the performance in terms of robustness, we randomly in- troduced single probe outliers (extreme values) in only one of the samples in each pair in a simulation dataset. Ideally, we would like to avoid false detections that are only supported by one of the samples. The single-sample algorithm and the one based on sample averaging cannot distinguish these outliers and nearly all of them will cause false 103 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 sample + 1-GADA Average. + 1-GADA 2 sample + 2-GADA False Discovery Rate Sensitivity Figure 4.1: PROC operational curves for the mean sensitivity vs. FDR in detecting real copy number changes at their exact location. Black curve consist of applying 1-GADA to each of the two samples independently. Red curve combines the two samples by a weighted average into a single sample which is analyzed by 1-GADA. Blue curve is the proposed M-GADA approach. The benchmark metrics sensitivity and FDR are the same as originally defined in [103] in terms of CNA breakpoint detection. detection. On the other hand, 2-GADA reduces false detection caused by these outliers by about 50%. Our results on real data are also in accordance to the findings obtained using simu- lation data. Figure 4.2 shows a visual representation of some of the CNA detected on 3 different FDR operating points (T settings) for a pair of replicate samples (S1, S2) ana- lyzed with Affymetrix 500K platform. The CNA found are very short segments because the samples are from a healthy human subject (NA01416). We can observe the higher 104 sensitivity of the 2-GADA approach on the deletion on q35; the CNA is retained for a higher significance setting T = 7 while it is removed on the single-sample approaches. This higher sensitivity can also be achieved by the sample averaging procedure, but this naive combination may cause more spurious false CNA (see 3rd column, T = 4). On the other hand, the 2-GADA approach is more robust since it retains the information of the origin of each observation. This can also be seen on an S2 outlier in q21.13 T = 5; 2-GADA eliminates this false alteration since it is not supported on S1, one of the two samples, while in naive averaging this outlier causes a false detection. In terms of computational speed, the 2-GADA approach performance is very competitive, with computational complexity linear in the number of probes M and samples N. 4.4 Conclusions This chapter presents a novel approach N-GADA to solve the problem of finding CNA with breakpoints at recurrent positions across multiple samples. N-GADA extends the single-sample algorithm GADA presented in Chapter 3 using a Bayes hierarchical prior for the breakpoints that is shared across all the samples. Simulation and real data results show that the proposed approach achieves a higher accuracy and robustness to outliers when sample replicates are available. The resulting approach retains a linear complexity in the number of samples and probes. Thus, the approach can be considered a promising tool to discover small alterations that are recurrent across many samples. In this chapter, the noise is considered independent across samples, but microarrays canalsoshareasignificantamountofnoisyartifactsacrosssamples. Thesearrayartifacts 105 would look similar to true recurrent CNV if not corrected. Chapter 5 will propose a methodforcorrectingsharedartifactsifmultiplesamplesareavailable. Thisnewmethod canbeusedincombinationwiththesharedbreakpointmethodproposedinthischapter. 106 Figure 4.2: Visual representation of the detected CNA using different algorithms and settings (columns) on two replicates (S1 and S2) of a normal human sample (NA01416) analyzedusingAffymetrix500K(Nsp)platform. Columnsaredividedintothreesections, eachrepresenting adifferent threshold Tused for CNA detection. Ineach group, the first two columns correspond to the independent analysis of S1 and S2 using 1-GADA, the third column is the result of applying 1-GADA to the S1 and S2 weighted average, and thelasttwocolumnsineachgroup(4thand5th)aretheoutputscorrespondingtoS1and S2 resulting of the 2-GADA joint analysis. For each claimed CNA, red tones represent amplification and blue tones loss of genetic material. 107 Chapter 5 Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA The complexity of a large number of recently discovered copy number polymorphisms is much higher than initially thought, thus making it more difficult to detect them in the presence of significant measurement noise. In this scenario, separate normalization and segmentation is prone to lead to many false detections of changes in copy number. New approaches capable of jointly modeling the copy number and the non-copy number (noise)hybridizationeffectsacrossmultiplesampleswillpotentiallyleadtomoreaccurate results. Inthischapter,thegenomealterationdetectionanalysis(GADA)approachintro- duced previously is extended to a multiple sample model. The copy number component is independent for each sample and uses a sparse Bayesian prior, while the reference hy- bridization level is not necessarily sparse but identical on all samples. The EM algorithm used to fit the model iteratively determines whether the observed hybridization levels are more likely due to a copy number variation or to a shared hybridization bias. The newproposedapproachiscomparedtothecurrentlyusedstrategyofseparatenormaliza- tion followed by independent segmentation of each array. Real microarray data obtained 108 from HapMap samples are randomly partitioned to create different reference sets. Using the new approach, copy number and reference intensity estimates are significantly less variable if the reference set changes; and a higher consistency on copy numbers detected within HapMap family trios is obtained. Finally, the running time to fit the model grows only linearly in the number samples and probes. 5.1 Introduction The DNA copy number alterations (CNAs) introduced in Chapter 3 focused mainly the relatively large chromosomal aberrations that are commonly seen in cancer tumor cells. In recent years, when the resolution of DNA microarrays increased, it become apparent that changes in DNA copy number were also widespread across normal genomes. This naturalDNAcopynumbervariation(CNV)canbeseeninmanyplacesacrossthegenome of healthy individuals but the altered DNA segments are much shorter in size compared to those generally seen in cancer (CNAs) [26–28,47,82]. In this chapter we will extend the results of Chapter 3 to the detection of copy number variations (CNVs). Currently, some known CNV regions (CNVRs) catalogued in the database of genome variants (DGV) [47] tend to be large and not very clearly delimitated regions and also misssmallerCNV(∼10Kbases)duetothelowerresolutionofthetechnologiesusedinthe past[65,72]. Moreover,thecopynumberstructureofsomeofthemosthighlypolymorphic regions has a much higher complexity than initially thought [72]. In order to understand the role of CNVs as a genetic determinant we still require a better characterization and 109 delimitation of these polymorphisms along the entire genome using higher resolution arrays and more accurate detection algorithms. There are several array platforms that can be used to measure CNV. In this work, we focus on the latest high density array platforms that use millions of short oligonucleotide probes distributed along the genome. The high resolution of these arrays is particularly well suited to detect new short CNVs and more accurately delimitate the position and structure of known CNVs. Additionally, some of these probes are also used to target the possible allelic variants of a single nucleotide polymorphism (SNP), making it possi- ble to obtain with the same experiment the SNP and CNV genetic profile of a sample. Commercially available platforms with these characteristics include Affymetrix SNP 6.0 and Illumina Human 1M microarrays. The basic premise for copy number detection is that the hybridization intensities of probes falling under a CNV region will have higher (or lower) values than those expected on a non copy number variant (non-CNV) region. However, probe hybridization intensities also depend on other non-copy number related events, which can be regarded as experimental noise that makes CNV detection a chal- lenging problem. Correct estimation of a reference hybridization intensity expected on a non-CNV probe is essential to consider that the noise has zero mean. Existing CNV detection algorithms assume that the measurements are unbiased (i.e., the noise distribution is centered at zero) and require a separate pre-processing step for normalizationandreferenceextraction(Figure5.1A).Differentnormalizationapproaches have been proposed, including CNAT [43], dChip [107], CNAG [68], GEMCA [54], Bead- Studio [71], CRMA [7] and ITALICS [83]. Examples of non-copynumber sources of vari- ation that are targeted by some of these procedures include: allele cross-hybridization 110 (GEMCA, BeadStudio and CRMA) and, in Affymetrix chips, fragment length and GC content effect (CNAG, GEMCA, CRMA and ITALICS). These and other methods pro- posed by [18,63] can remove probe hybridization biases that have spatial correlation (i.e.“wave-like”) or correlation with the GC content or fragment length. While the spa- tially correlated portion of the bias can be removed by these pre-processing methods (Global Effects Normalization in Figure 5.1 A) there is still a probe specific bias due to itsownbindingaffinitythatneedstobecorrected. Thisisusuallydonebytakingarobust average (trimmed mean, median, clustering) across a set of reference samples. However, if the aim is to identify CNVs that are frequent in population, the application of median normalization on a set of reference samples, where the non-CNV regions are not known a priori, would lead to biased results. This has been already pointed out as potentially problematic by several authors [7,54,83], and there is no clearly defined methodology currently available to establish a good reference in regions of the genome with complex CNV patterns. Inthischapter,weproposeanewmodelforjointestimationofCNVsandthereference hybridization intensity associated with the non-CNV state. This new model extends our previous work with the genome alteration detection analysis (GADA) approach (Chapter 3, [75,79]) that achieved excellent accuracy in normalized samples. Specifically, we incorporateinthemodelavectorparameterforthereference intensitiesinadditiontothe CNV component. The copy number component, as in our previous work, is represented by a sparse Bayesian learning (SBL) prior, which favors estimates where each sample has a small number of CNV regions, but is uninformative with regard to the position and magnitude of these regions. Extending this representation, our proposed hybridization 111 Figure 5.1: Copy number detection block diagrams: A) the typical workflow used to analyzecopynumberwithseparatepre-processing. B)thenewproposedworkflowusinga jointestimationmodelforcopynumbervariationsandtheprobehybridizationintensities. reference intensity has a flat prior, but the effects are shared among all samples. The EM algorithm is used to fit the model, and simultaneously estimates the copy number and the reference parameters in a given set of observed samples. The new approach is evaluated in both simulated datasets and microarray data ob- tained from a pool of HapMap specimens [38] using the Affymetrix SNP 6.0 array plat- form. We compared the new GADA with joint reference normalization (GADA-JRN), withthecurrentlyusedapproachofusingthemediantocomputethereferencehybridiza- tion(GADAwithseparatemediannormalization,GADA-SMN)and,withtheAffymetrix Genotyping Console GTC3.0.1. with GC correction. The presented results demonstrate that the detected CNV are significantly more consistent within the HapMap family trios. 112 CNV detected by the new approach are also less variable if we change the set of samples used to create the reference. Finally, the computation time to fit the new model GADA- JRN is very competitive, since the resulting algorithm (as was the case for GADA-SMN) has complexity that grows linearly with the number of samples and probes. 5.2 GADAmodelwithseparatemediannormalization(GADA- SMN) Inchapters3and4weassumedthattherewasnoremainingbiasafterlogratioextraction (LRE); i.e., ˜ y m =x m +ǫ m , (5.1) In contrast, this chapter assumes that for a collection of microarray experiments the following holds true: y mn =x mn +r m +ǫ mn (5.2) wherey mn representsthelog 2 ofthehybridizationintensityobservedbyprobemonarray n; x mn represents change in hybridization intensity due to altered copy number, r m is the reference probe hybridization intensity expected for a non-CNV state, and ǫ mn is a zero-mean array noise. In other words, comparing the models in (5.2) and (5.1), if the reference r m were known, we could move it to the left side and ˜ y mn =y mn −r m would become the log-ratio 113 intensities with only two remaining variations: the CNV effect x mn , and the zero-mean hybridization noise ǫ mn . Once this effect has been removed, the n in the notation can be dropped because each sample n can be analyzed separately. In general, the probe hybridization bias is not known and it has to be estimated and corrected. The typical approach to estimate this bias is to use the median or some other robust estimate of the mean and then perform copy number analysis independently in each sample. Throughout this chapter, GADA-SMN refers to the resulting approach of combining a separate median normalization followed by GADA (as in Chapter 3). 5.3 GADAmodelwithjointreferencenormalization(GADA- JRN) Thenewproposedmodeldoesnotassumethattheprobehybridizationbiasforeachprobe r m in (5.2) has been removed, it is instead estimated jointly with the copy number from a large number samples. In vector form, the observation model for the log-hybridization of each probe on sample n can be rewritten as: y n =x n +r+ǫ n (5.3) Therearetwobasicpremisesthatallowthejointestimationofthereferencerandthecopy number x n component. First, the copy number component x n on each sample is piece- wise constant with a small number of breakpoints K and can be efficiently represented withasx n =Fw n . Itshouldbenotedthatthenumberandpositionofthosebreakpoints is in general different for each sample. Second, the probe hybridization bias (or non-CNV 114 reference intensity) r is not necessarily PWC, but is exactly the same across multiple arrays. Our model can also be extended for cases in which the amplitude of r may change, i.e. y mn =x mn +ρ n r m +ǫ mn (see Section 5.3.2). ThecopynumbercomponentismodeledusingandindependentsparseBayesianlearn- ing (SBL) hierarchical prior for each breakpoint m and sample n: p w 1 ,··· ,w N |α 1 ,··· ,α N = N Y n=1 M−1 Y m=1 N w mn |0,α −1 mn (5.4) p α 1 ,··· ,α N = N Y n=1 M−1 Y m=1 Γ(α mn |a,b) (5.5) ThepropertiesoftheSBLprioraredetailedin [75,97,104]andalsoinChapter3. Setting b=0 and a to be small encourages a sparse number of breakpoints, but is uninformative with respect the position and magnitude of the corresponding CNV regions. Compared tothe single samplemodelN =1, thismodelincludesanSBLprior independent for each sample n, which means that independent CNV locations can be chosen across samples. The only parameter that is shared is the hyperparameter a that controls the expected degree of sparseness (number of CNV) in each sample. The noise ǫ, as in Chapter 3, is assumed normal p(ǫ n )∼N 0,σ 2 n I using a different variance parameter (with a flat prior) for each array experiment. Finally, the new r component is modeled using an uninformative flat prior. 5.3.1 Fitting the model with the EM algorithm The model parameters are estimated by finding the maximum a posteriori (MAP) using a similar evidence maximization procedure as was introduced in Chapter 3. The EM 115 algorithm starts by setting α m and r m to zero; then, it iterates the following E and M steps: EStep: Σ n = σ −2 n F ′ F +diag(α) −1 (5.6) μ n =σ −2 n Σ n F ′ (y n −r) (5.7) ˆ x n =Fμ n (5.8) MStep: ˆ α n m = 1+2a Σ mm +μ 2 m +2b (5.9) ˆ σ 2 n = 1 M ky n − ˆ x n −rk 2 +σ 2 n X m (1−Σ n mm α m ) ! (5.10) r = 1 N X n (y n − ˆ x n ) (5.11) where P w n |y n ,α n ,r,σ 2 n =N (w n |μ n ,Σ n ) exploiting the conjugacy properties be- tween the gamma and normal distributions. The same notation as in Chapter 3 is used and the super/sub-scripts n and m are added to identify the parameters that correspond to each sample and probe respectively. For example, Σ n mm refers to the diagonal terms of the covariance matrix for the breakpoint weights w n posterior distribution of sample n. Convergence of the model is reached with very few EM iterations; and all required oper- ations in each iteration can be performed in a linear number of stepsO(MN) exploiting the properties of our PWC representation (i.e., the matrix structure for F, see Chapter 3 for details). InrelationtothepreviousGADA[75]modelinChapter3,ifthenewrcomponentwas modeledbyafixedpointestimate(e.g. themedianacrosssamples),thentheentiremodel will be completely equivalent to processing each sample independently using GADA (i.e., 116 GADA-SMN). This can also be seen on the EM steps, if r is a fix point estimate then (5.6 - 5.10) can be updated separately for each n. Therefore, in GADA-JRN the CNV are placed independently in each sample and the coupling is only through the estimation ofr. During the EM algorithm most of the weight parameters ˆ w m = μ m are driven to 0 to fulfill the sparseness constraints imposed by the hierarchical prior. Upon convergence to zero, the corresponding weight parameter and hyperparameter can be eliminated from the model and the EM algorithm can continue with a model of reduced dimensions. This elimination makes the algorithm run faster since it has to update fewer parameters (i.e., only those that are non-zero). 5.3.2 GADA-JRN model with a scale parameter for the bias The main model proposed in this chapter assumes that the probe hybridization bias for each probe r m in (5.2) has exactly the same magnitude across samples. However, we can adapt the model to include a different amplitude term ρ n for each sample: In vector form, the observation model for the log-hybridization of each probe on sample n can be rewritten as: y n =x n +ρ n r+ǫ n (5.12) The same two basic premises that allowed the joint estimation of the referencer and the copy number x n component also extend here with the inclusion of the ρ n term. Notice that the probe hybridization bias (or non-CNV reference intensity)r is exactly the same 117 across multiple arrays but the addition of the ρ n allows a change in the amplitude and the sign ofr. The copy number component and the noise is modeled exactly the same as in Section 5.3 and the new ρ n parameters are given an uninformative flat prior. The new resulting EM algorithm is: E Step: Σ n = σ −2 n F ′ F +diag(α) −1 (5.13) μ n =σ −2 n Σ n F ′ (y n −ρ n r) (5.14) ˆ x n =Fμ n (5.15) M Step: ˆ α n m = 1+2a Σ mm +μ 2 m +2b (5.16) ˆ σ 2 n = 1 M ky n − ˆ x n −ρ n rk 2 +σ 2 n X m (1−Σ n mm α m ) ! (5.17) ρ n = P m (y mn r m ) P m r 2 m (5.18) r = P n ρ n (y n − ˆ x n ) P n ρ 2 n (5.19) where we notice that if we fix ρ n =1 we obtain the same algorithm as before. 5.4 Backward Elimination The sensitivity vs. false discovery rate (FDR) trade-off of our model is controlled by the hyper-parameter a. For higher values of a a more sparse solution with fewer CNVs is obtained, reducing both the FDR and the sensitivity (see Appendix A). In order 118 to efficiently explore solutions with different level of sparseness without having to run the algorithm all-over again, the same backward elimination (BE) strategy described in Section 3.5 is employed. Separately, for each of the samples the breakpoints with the lowest score t m are recursively removed: t m = s μ m 2 Σ mm (5.20) Thesensitivityvs. FDRtrade-offiscontrolledbystoppingtheprocedurewhenallthe remainingbreakpointshaveascorehigherthanagivencriticalvalueT. Therefore, asthe algorithm continues to remove all the breakpoints and keeps track of which score, T, a particularbreakpointiseliminated,thebreakpointscanberapidlyadjustedtoanydesired level based on their rank. The final result is reported as a set of segment breakpoints and amplitudes that represent the copy number variations. The parameter T is physically more informative than the parameter a because it can be interpreted as the standard error the user is willing to tolerate to call a CNV significant. 5.5 Performance metrics and evaluation methods Inordertocomparetheperformanceofthedifferentapproachesforcombiningnormaliza- tion of the non-CNV probe reference intensities and copy number detection, the follow- ing methods and performance metrics are introduced. 270 samples from the Affymetrix dataset (Section 5.6.2) are randomly partitioned into reference sets of different sizes (10, 20, 30, 45, 70, 90 and 135 samples). In each partition, one given sample would have been grouped with a different set of samples. For example, if we create 10 random partitions 119 into 9 groups of size 30, each sample will have been grouped randomly with 29 different samples of the remaining 269. The Affymetrix dataset used in this analysis was processed under ideal conditions (i.e. processed on the same day using three plates; personal communications). Thus, we expectcopynumberestimates ˆ x n g wouldbeverysimilarregardlessofthesubgroupchosen as the reference (see relevant results in Section 5.6.7). Although, significant changes in ˆ r g are not observed in this dataset, it is noteworthy that the ˆ r g will likely change with various laboratory conditions or across different batches. Section 5.6.7 will address possiblemethodstoanalyzesamplesfromdifferentbatches. Varianceinnon-copynumber andcopynumberestimates(V r andV x )acrossdifferentsubgroupsg canbeusedtoassess the performance, with smaller variance indicating better performance. V r =median m 1 G G X g=1 (r g −¯ r g ) 2 (5.21) V x =median n median m 1 G G X g=1 x |rg − ¯ x |rg 2 (5.22) Sincethisdatasetcontains180samplesthatarerelatedinfamilytrios,wealsopropose an additional measure of trio consistency. Identified CNVs in each HapMap trio are classified for each probe as in Table 5.1. Then, the failed trio consistency rate (FTCR) metric is defined as the ratio of inconsistent CNV probes in a trio among all identified CNV probes (except those considered uninformative): 120 FTCR = I C +I (5.23) A smaller FTCR value indicates that copy numbers within a family trio are more con- sistent. This measure ignores less frequent scenarios; e.g., if both mother and father have one chromosome with a CNV gain and the other chromosome without variation, it will still be possible (25% of the time) for the son to not inherit the CNV. In order to assess the validity of the FTCR measure, the FTCR scores of true trios are compared to those obtained from randomly grouping unrelated samples into trios. Table 5.1: Consistency on HapMap trios Father and Mother pairs Offspring (G,G) (G,L) (G,N) (L,L) (L,N) (N,N) Gain C C C I I I Loss I C I C C I Neutral I – C I C – For each variation detected as (G)ain, (L)oss or (N)eutral (no variation detected) there are 18 possible CNV possible outcomes for each trio. Each outcome is labeled with ‘C’, ‘I’ and ‘–’ indicating whether they are consistent, inconsistent or uninformative. 5.6 Results All the samples used to evaluate the methods described in this chapter are publicly available as the Affymetrix SNP 6.0 Dataset [1]. The implementation for the new GADA methodispubliclyavailableathttp://biron.usc.edu/ ~ piquereg/GADA. Thealgorithm has been implemented in C and tested using Matlab. 121 5.6.1 Simulation datasets An artificial dataset in which the underlying model is known (i.e., copy number, array artifacts and noise) was created to assess the proposed approach under specific scenarios. Using the model in (5.2) we created an artificial microarray dataset with N = 20 samples and one single chromosome with M = 10000 probes. We generated a copy numberprofilex mn containingtwoCNVregions(CNVRs)ofdifferenttype(Figure5.2A). ThefirstoneisaregionwithalignedCNVs(breakpointsoccurringatsamepositionacross samples). The second type is a region with non-aligned CNVs. The non-CNV portion of the genome, has an expected log 2 -ratio x mn = 0. The first variation is chosen to represent a loss x mn =−1 (copy number 1), and the second variation is chosen to have copy number 3 (x mn =log 2 (3)−1=0.58). The reference, or systematic bias to remove, is chosen to be a sinusoid r m =sin(2π0.001m), which represents a wavy effect similar to what has been observed in some actual array experiments [63]. The noise introduced in the dataset is white i.i.d Gaussianǫ mn ∼N(0,1). The major difference of this simulation model as compared to others [57,103] is the inclusion of the probe hybridization bias effect r m = 0.5sin(2π0.001m)+N(0,0.25) with two main components: i) a sinusoidal wave with spatial correlation similar to what has been observed in some actual array experiments [63], and ii) a noise wave without spatial correlation simulating each probe specific affinity. The proposed methods have also been evaluated with other simulation scenarios which include: 1. “Wave” only bias shared on all the samples r m = 0.5sin(2π0.001m) (Figure 5.3 Top). 122 2. “Non-Wave” probe specific bias shared on all the samples r m ∼N(0,1) (Figure 5.3 Bottom). 3. “Non-Wave+Wave” with a different amplitude in each sample, (Figure 5.10). 4. “Non-Wave” bias different in two subgroups (batch effect). r 1 m ∼N(0,1) indepen- dent of r 2 m ∼N(0,1) (Figure 5.14). 5.6.2 Affymetrix SNP 6.0 data set description and normalization The Affymetrix dataset contains 270 samples from the International HapMap Project consisting of 30 CEPH trios (CEU), 30 Yoruban trios (YRI), 45 unrelated Han Chinese samples(CHB)and45unrelatedJapanesesamples(JPT).TheAffymetrixGenome-Wide HumanSNPArray6.0integratesabout1.9millionprobesets(931,946SNPs,946,000CN). The preprocessing software packages that are available for the Affymetrix SNP Array 6.0 platform include: the Affymetrix Genotyping Console 3.0 (GTC3) with a normaliza- tionsimilartoCNAG[68],dChipwithinvariantsetnormalization[86],andAroma.Affymetrix which implements the CRMA [7]. The CRMA method was chosen, since it has been shown to be more robust and accurate than other methods ( [7]). The Aroma.Affymetrix package performs the following 4 correction steps: 1) allelic crosstalk calibration, ACC, for SNP probes; 2) probe level modeling PLM, which gives a single signal for the SNP probes; 3) fragment length normalization FLN, which corrects the differences in the PCR reaction due to the length and GC content of the fragment; and finally, 4) log ratio ex- traction, LRE, which calculates the log hybridization intensity relative to the expected diploid signal reference intensity using the median. In Figure 5.1A, the steps ACC, PLM 123 and FLN are grouped together as global effects normalization, and LRE is the probe bias correction. In this chapter, we argue that while steps 1-3 may be performed safely during pre- processing, finding the non-CNV reference intensity is problematic in regions rich in copy number polymorphisms. In normal samples, most of the probes (> 90%) will fall on regions without CNV which implies that the normalization model parameters which have a global effect on all probes can be safely estimated using robust strategies as those employed by CRMA [7]. On the other hand, in any given region of the genome con- taining a highly polymorphic CNV it is not known a priori which samples do not have a CNV. Ideally, this CNV effect should be removed before estimating the probe hybri- dization intensity associated with the non-CNV state. In this work, we will extract the correctedprobeintensitiesaftersteps1-3,anduseournewproposedmodel(GADA-JRN) to jointly estimate the reference and the copy number component (Figure 5.1B). Results from Affymetrix GTC software with GC correction are also obtained for comparison. 5.6.3 Results with simulated data Theartificiallygenerateddata(Section5.6.1)illustratesascenarioinwhichtherearetwo relatively high frequency CNVs (Figure 5.2 A) with a bias on the hybridization measure- ment from the array experiment (Figure 5.2 B). If the probe hybridization bias r is not removed from the data, the results will be contaminated with a large number of spurious segmentsnotassociatedwithtrueCNVs(Figure5.2C).Whileotherapproaches[7,18,63] cancorrectthe“smooth-wave”(GCcorrelated)partofthebias(seenextsection),GADA- JRN can also correct the non-smooth (uncorrelated) probe specific bias. The currently 124 Probe m Sample n Underlying CNA component x A 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n Observed array data y = x+r +ǫ B 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n Median normalized data: ˜ y = y− median (y) D 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA with joint reference normalization (GADA-JRN) F 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA without normalization C 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA with separate median normalization (GADA-SMN) E 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 −1.5 −1 −0.5 0 0.5 1 1.5 CNVR−1 CNVR−2 Figure5.2: Illustrationoftheobservationmodel. Colorsrepresenttheobservedhybridiza- tionintensitiesandtherelativecopynumberchange(blue-loss‘-1’,red-gain‘+1’,green - neutral ‘0’). A) The true underlying CNV component with two CNV regions (CNVR-1 around m=2500 and CNVR-2 around m=7500). B) Simulated array hybridization in- tensities degraded by noise ǫ n and a systematic measurement bias r. C) Copy number profile using GADA on non normalized data. D) Data after reference subtraction esti- mated by separate median pre-processing (SMN). E) Copy number profile using GADA with separate median normalization (GADA-SMN). F) Copy number profile estimated using GADA with joint reference normalization (GADA-JRN). used approach of separate pre-processing is based on estimating r m as the median across a set of reference samples (here the simulated samples themselves) before extracting the CNVs. This can eliminate r m in the areas of the genome without CNVs (x mn = 0 re- gions); but it is problematic in CNV regions containing a large amount of CNV across samples (Figure 5.2 D and E). The new joint reference normalization approach (GADA- JRN in Figure 5.2 F) can correctly delimitate the CNV on the samples with CNV on the region CNVR-2, while the separate median normalization (GADA-SMN in Figure 5.2 E) incorrectly reports CNV on samples n=1,...,10. Additionally, GADA-JRN can better detect the small CNV on CNVR-1, in which the separate median normalization tends 125 to make the amplitude of the variation smaller and thus more difficult to detect. These resultsarenotaffectedifr m hasdifferenttypesofwaveasillustratedwithmoreexamples in Figure 5.3. 126 Probe m Sample n Underlying CNA component xmn A 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n Observed array data ymn = xmn +rm +ǫmn B 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA without normalization C 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-SMN D 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-JRN without bias amplitude paramter E 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-JRN with bias amplitude paramter F 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n Underlying CNA component xmn A 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n Observed array data ymn = xmn +rm +ǫmn B 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA without normalization C 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-SMN D 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-JRN without bias amplitude paramter E 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-JRN with bias amplitude paramter F 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 −1.5 −1 −0.5 0 0.5 1 1.5 Figure 5.3: Simulation model with measurement bias of only one type: Top r m = 0.5sin(2π0.001m), Bottom r m ∼N(0,1.0). The colors represent the observed hybri- dization intensity and the copy number (blue - loss, red - gain, green - neutral). A) The true underlying CNV component with two CNV regions (CNVR-1 around m=2500 and CNVR-2 around m=7500). B) Simulated array hybridizationintensities degraded by noiseǫ n and a systematic measurement biasr. C) Copy number profile using GADA on non normalized data. D) Copy number profile using GADA with separate median nor- malization (GADA-SMN). E) and F) Copy number profile estimated using GADA-JRN with or without a scaling parameter ρ n for the bias. 127 5.6.4 Results with Affymetrix microarray data ThehybridizationintensitiesareobtainedafterusingACCandapplyingFLNcorrections from Aroma.Affymetrix, on the 270 HapMap samples analyzed with Affymetrix SNP 6.0 arrays. This pre-processing step appropriately scales and centers the data removing the spatially correlated part of the bias. We next compare GADA-JRN and GADA-SMN employing the evaluation methods introduced in Section 5.5. Usingtherandomlychosenreferencesetsofdifferentsize, weevaluatedthevariability V r and V x in the reference intensities and the copy number estimates. Figure 5.4 shows that, as the number of samples in the reference set increases, the variability on CNV estimates V x decreases. More importantly, we can see that using GADA-JRN achieves a considerably better performance when compared to GADA-SMN. In some cases GADA- JRNrequiresabouthalfofthesamplesinordertoobtainestimatesofsimilaraccuracyto thoseachievedwithGADA-SMN.Intermsofvarianceofthereferenceintensityestimates, V r , GADA-JRN also achieves a significantly smaller values (p < 1E− 7 Kolmogorov- Smirnov test, data not shown). This improvement in performance can also be observed using the trio consistency measure (FTCR) described in Section 5.5. In Figure 5.5, the trio consistency improves (i.e., FTCR decreases) with the size of the reference set, and GADA-JRN also achieves significantly better consistency. The results are also similar with change on the sparse- ness parameters a and T that set the trade-off between sensitivity and FDR. Figure 5.6 illustrates for one of the reference sets (90 CEU samples) the consistency that is obtained for different settings of the parameter T, which controls the backward elimination (BE) 128 10 20 30 45 54 70 90 135 0 1 2 3 4 5 6 7 8 x 10 −3 Variance of copy number estimate V x Number Of References GADA with joint reference normalization (GADA−JRN) GADA with separate median normalization (GADA−SMN) Figure 5.4: Variability on the copy number estimates if the set of reference samples changes. 10 20 30 45 54 70 90 135 0.14 0.16 0.18 0.2 0.22 0.24 0.26 Failed trio consistency rate (FTCR) Number Of References GADA with joint reference normalization (GADA−JRN) GADA with separate median normalization (GADA−SMN) Figure 5.5: Consistency of the copy number estimates on HapMap Trios if the set of reference samples changes. 129 step. The FTCR measure improves (decreases) with increasing T, since the number of detected false CNVs (i.e. FDR) decreases; but for higher values of T (e.g., T > 8) true CNVsmayalsofailtobedetectedontheoffspring(i.e. lowersensitivity),andthusFTCR stops decreasing. The FTCR obtained on randomly formed trios of unrelated samples assuresthatthismetricisactuallycapturingtheincreaseonsharedCNVswithinafamily (p<0.01)andcanbeusedtocomparedifferentnormalizationandcopynumberdetection approaches. InTable5.2GADA-JRNobtainsabetterFTCR (16.7%)thanGADA-SMN (19.5%) and GTC (19.45%) when using 90 reference samples. On a larger reference set of 180 samples, GADA-SMN (FTCR = 18.3%) and GTC (FTCR = 17.31%) improve but GADA-JRNstillretainsabetterFTCR (16.5%). Overall, thenewapproachisespecially indicated for small reference sets and for regions with highly polymorphic CNVs. Fig- ure 5.7 and 5.8 shows the copy number estimates obtained on an already known highly polymorphic regionofchromosome 17. The predictedgainsand lossesofGADA-JRN are retained when the reference set of 90 reference (CEU) samples is enlarged to include 180 (CEU+YRI) samples. On the other hand, GADA-SMN is less consistent in delimiting the CNV boundaries. Table 5.2: Comparison on HapMap trio consistency FTCR Number of Reference Samples 90 CEU 180 CEU+YRI FTCR # CNVs FTCR # CNVs GADA-JRN 16.7% 85 (T=10) 16.5% 127 (T=9.0) GADA-SMN 19.2% 94 (T=9.0) 18.3% 125 (T=8.0) GTC 18.45% 86 17.31% 127 Table with the Failed Trio Consistency Rate (FTCR) and the median number of CNVs per sample when using three different algorithms GADA-JRN, GADA-SMN, and GTC. The threshold values for GADA-JRN and GADA-SMN were chosen to obtain approximately equal number of CNVs among the algorithms. 130 0 2 4 6 8 10 12 14 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Failed trio consistency rate (FTCR) Critical value T GADA−JRN −− 90 reference samples GADA−JRN −− 180 reference samples GADA−SMN −− 90 reference samples GADA−SMN −− 180 reference samples GADA−JRN −− 100 random trios GADA−SMN −− 100 random trios Figure 5.6: Consistency within HapMap trios using a different sparseness setting T. The dashed and solid lines correspond to a 90 (CEU) and 180 (CEU+YRI) sample reference set respectively. The cloud of points are the FTCR values obtained from 100 randomly formed trios. The FTCR values of GADA-JRN (blue) are smaller than those of GADA- SMN (green). 131 4.12 4.14 4.16 4.18 4.2 4.22 4.24 x 10 7 NA12144 NA10846 NA12145 NA12146 NA10847 NA12239 NA07022 NA07019 NA07056 NA06994 NA07029 NA07000 NA06993 NA06991 NA06985 NA07034 NA07048 NA07055 NA12056 NA10851 NA12057 NA07357 NA07348 NA07345 NA12043 NA10857 NA12044 NA11881 NA10859 NA11882 NA11839 NA10854 NA11840 NA11831 NA10855 NA11832 NA11829 NA10856 NA11830 NA12716 NA12707 NA12717 NA11992 NA10860 NA11993 NA11994 NA10861 NA11995 NA12264 NA10863 NA12234 NA12154 NA10830 NA12236 NA12155 NA10831 NA12156 NA12248 NA10835 NA12249 NA12003 NA10838 NA12004 NA12005 NA10839 NA12006 NA12750 NA12740 NA12751 NA12760 NA12752 NA12761 NA12762 NA12753 NA12763 NA12812 NA12801 NA12813 NA12814 NA12802 NA12815 NA12872 NA12864 NA12873 NA12874 NA12865 NA12875 NA12891 NA12878 NA12892 GADA-SMN, 90 reference samples (CEU) Probe position in bases 4.14 4.16 4.18 4.2 4.22 4.24 x 10 7 GADA-SMN, 180 reference samples (CEU+YRI) Probe position in bases 4.14 4.16 4.18 4.2 4.22 4.24 x 10 7 Probe position in bases GADA-JRN, 90 reference samples (CEU) 4.14 4.16 4.18 4.2 4.22 4.24 x 10 7 Probe position in bases GADA-JRN, 180 reference samples (CEU+YRI) Copy Number 0 1 2 3 4 5 6 7 8 9 Figure 5.7: Section of the chromosome 17 that contains an already known CNV. Each row corresponds to one of the 90 CEU HapMap samples and are grouped in trios (father, son/daughter, mother)delimitedbyhorizontaldottedlines. Ontheleftofthethickverti- cal line are shown the CNVs estimated using GADA with separate median normalization using a reference set of 90 and 180 reference samples. On the right, copy number esti- mated using GADA with joint reference normalization show a higher consistency when the reference set is changed as well as within HapMap trios. 132 4.05 4.1 4.15 4.2 4.25 4.3 x 10 7 −2 −1 0 1 2 Observed data y mn and smoothing Father: NA11992 4.05 4.1 4.15 4.2 4.25 4.3 x 10 7 −2 −1 0 1 2 Offspring: NA10860 4.05 4.1 4.15 4.2 4.25 4.3 x 10 7 −2 −1 0 1 2 Mother: NA11993 Position in Chr 17 [Base pairs] 4.05 4.1 4.15 4.2 4.25 4.3 x 10 7 −2 −1 0 1 2 GADA−SMN y mn −r m and detected x mn 4.05 4.1 4.15 4.2 4.25 4.3 x 10 7 −2 −1 0 1 2 4.05 4.1 4.15 4.2 4.25 4.3 x 10 7 −2 −1 0 1 2 Position in Chr 17 [Base pairs] 4.05 4.1 4.15 4.2 4.25 4.3 x 10 7 −2 −1 0 1 2 GADA−JRN y mn −r m and detected x mn 4.05 4.1 4.15 4.2 4.25 4.3 x 10 7 −2 −1 0 1 2 4.05 4.1 4.15 4.2 4.25 4.3 x 10 7 −2 −1 0 1 2 Position in Chr 17 [Base pairs] Figure 5.8: Example of a complex copy number section of Chr. 17 within a HapMap trio (Top row, father; Center row, son/daughter; Bottom row, mother). Left column shows the original observed array values and a smoothed version that shows that there is a hybridization bias correlated along the samples. Center column shows the median normalized intensities and the detected altered segments in red (GADA-SMN). Right column shows the resulting intensities corrected with the reference estimated by GADA- JRN and the corresponding detected segments in red. This result visually depicts that thedetectedalterationsaremoreconsistentwithintrioswiththenewGADA-JRNmodel. 133 Finally the computational time required for fitting the model is longer on the new approach, but still retains a very competitive linear complexity in the number of probes and samples (see Figure 5.9). 0 20 40 60 80 100 120 140 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of references Time in seconds GADA with joint reference normalization (GADA−JRN) GADA with separate median normalization (GADA−SMN) Figure5.9: ComputationaltimerequiredtofitthemodelsGADA-JRNandGADA-SMN. Thetimerequiredtofitthemodelislinearonthenumberofsamplesforbothapproaches. Execution times required to process the models are measured on the same machine. 5.6.5 Simulation results with a scale effect In order to evaluate this model which includes a scale parameter ρ n , a new simulated dataset is created. The same underlying copy number profile x n is used (Figure 5.10 A) but the wave r that is added to each sample is multiplied by a random amplitude ρ n ∼ U(−1,1) (uniformly distributed between -1.0 and 1.0). The bias wave used in this experiment is r m = 0.5sin(2π0.001m)+N(0,0.25) and has two main ingredients: i) a 134 sinusoidal wave with spatial correlation, ii) a noise wave without spatial correlation. The resulting observed array intensities y m n have three components (Figure 5.10 B): a) the copy number component which is piecewise constant independent for each sample, b) the wave that is correlated across samples n and possibly across probes m, and c) the noise uncorrelated across samples n and m probes with variance 0.2. The new joint reference normalization approach (GADA-JRN in Figure 5.10 F) can correctly delimitate the CNVs on the samples and extract the probe hybridization bias and its magnitude from each sample. 135 Probe m Sample n Underlying CNA component xmn A 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n Observed array data ymn = xmn +ρnrm +ǫmn B 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA without normalization C 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-SMN D 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-JRN without bias amplitude paramter E 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-JRN with bias amplitude paramter F 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 −1.5 −1 −0.5 0 0.5 1 1.5 Probe m Sample n Underlying CNA component xmn A 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n Observed array data ymn = xmn +ρnrm +ǫmn B 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA without normalization C 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-SMN D 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-JRN without bias amplitude paramter E 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 Probe m Sample n GADA-JRN with bias amplitude paramter F 2000 4000 6000 8000 10000 2 4 6 8 10 12 14 16 18 20 −1.5 −1 −0.5 0 0.5 1 1.5 Figure 5.10: Simulation model with measurement bias with different amplitudes: Top ρ n ∼ U(0.7,1.0), Bottom ρ n ∼ U(−1.0,1.0). The colors represent the observed hybri- dization intensity and the copy number (blue - loss, red - gain, green - neutral). A) The true underlying CNV component with two CNV regions (CNVR-1 around m=2500 and CNVR-2 around m=7500). B) Simulated array hybridizationintensities degraded by noiseǫ n and a systematic measurement biasr. C) Copy number profile using GADA on non normalized data. D) Copy number profile using GADA with separate median nor- malization (GADA-SMN). E) and F) Copy number profile estimated using GADA-JRN with or without a scaling parameter ρ n for the bias. 136 5.6.6 Scale effect on the Affymetrix data The same pre-processed data as in Section 5.6.4 is used to evaluate the new model with a bias amplitude parameter. In real data, the change of the amplitude of the bias is likely to be a consequence of the differences on the initial amounts of DNA material as was shown by [18]. The Aroma.Affymetrix pre-processing can make the amplitudes ρ n the same across samples using the appropriate scaling and correction to each of the Sty and Nsp fragments. For this reason, the results do not differ much whether or not this new parameter is added to the model (Figure 5.11). 0 2 4 6 8 10 12 14 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 Failed Trio Consistency Rate FTCR Critical value T GADA−JRN without bias amplitude parameter GADA−JRN with bias amplitude parameter Figure 5.11: Consistency within HapMap trios using a different sparseness setting T. Using 180 HapMap samples (CEU+YRI) the GADA+JRN algorithm does not show a significant improvement if we add a bias amplitude parameteron this set of samples after Aroma.Affymetrix normalization. 137 5.6.7 Impact of the batch effects on the Affymetrix dataset All the samples in the Affymetrix dataset were generated on the same laboratory at the same time and analyzed on three plates with each ethnic group (CEU, YRI, JPT+CHB) analyzed in separate plates. The batch effects that would be expected if the plates were analyzed in separate days or labs would be much higher than what it is actually seen in these samples, see simulated example in Figure 5.14. For this reason, we do not observe anysignificantdifferenceinFTCRwiththenewalgorithmifthetwobatchesareanalyzed separately or combined together with (GADA-JRN) in Figure 5.12. The performance of separate median normalization independently in each plate is worse because the median in not robust in a small set of samples (especially in this dataset where each plate has a different ethnic group). If the subgroups (or batches) are not known a priori the residuals after running GADA can be useful to discover subgroups of samples that share a common artifact. In Figure 5.13B, two blocks can be visually appreciated that do not appear on Fig- ure 5.13C. The average magnitude of these correlations is small (< 0.05) compared to the variability within each block (<0.25) because the shift in r is not substantially large. Less than < 0.36% probes have a shift in r m larger than a true copy number change (|r CEU m −r YRI m | > 0.585 = log 2 (3)−1). This remaining probe hybridization bias if spa- tially uncorrelated will be regarded as an additional white noise in our model (higher noise variance estimate) and are unlikely to cause false detections but a decreased sen- sitivity to true changes. These shifts are likely to be considerably larger if the batches 138 0 2 4 6 8 10 12 14 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 Failed Trio Consistency Rate FTCR Critical value T GADA−SMN, CEU and YRI together 180 GADA−SMN, CEU and YRI separately 90+90 GADA−JRN, CEU and YRI together 180 GADA−JRN, CEU and YRI separately 90+90 Figure 5.12: Consistency within HapMap trios when two plates (CEU and YRI) are analyzed separately or together using the GADA-SMN and GADA-JRN algorithms. In terms of FTCR we only obtain a small improvement if we separate each batch using the GADA-JRN, but a decreased performance in terms of the median because it requires more than 90 samples to become a more robust estimator. A) B) C) 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Figure 5.13: Pairwise Spearman’s Correlations between different signals: A) the origi- nal microarray observed intensities, B) the residual intensities after processing together CEU+YRI with GADA-JRN, and C) the residual intensities after processing separately CEU and YRI groups with GADA-JRN. Samples 1 to 90 correspond to the CEU ethnic group and 91 to 180 to the YRI ethnic group. 139 were analyzed in different days or labs. In this latter case we would analyze each batch separately with GADA-JRN (see Figure 5.14 for a simulation result). 140 Probe m Sample n Underlying CNA component xmn A 2000 4000 6000 8000 10000 5 10 15 20 25 30 35 40 Probe m Sample n Observed array data ymn = xmn +rmn +ǫmn B 2000 4000 6000 8000 10000 5 10 15 20 25 30 35 40 Probe m Sample n GADA without normalization C 2000 4000 6000 8000 10000 5 10 15 20 25 30 35 40 Probe m Sample n GADA-SMN all samples together D 2000 4000 6000 8000 10000 5 10 15 20 25 30 35 40 Probe m Sample n GADA-JRN all samples together E 2000 4000 6000 8000 10000 5 10 15 20 25 30 35 40 Sample n Sample n GADA-JRN Residuals Correlation F 10 20 30 40 5 10 15 20 25 30 35 40 −1.5 −1 −0.5 0 0.5 1 1.5 −0.2 0 0.2 0.4 0.6 0.8 1 Probe m Sample n GADA-SMN batches separated G 2000 4000 6000 8000 10000 5 10 15 20 25 30 35 40 Probe m Sample n GADA-JRN batches separated H 2000 4000 6000 8000 10000 5 10 15 20 25 30 35 40 Sample n Sample n GADA-JRN Residuals Correlation I 10 20 30 40 5 10 15 20 25 30 35 40 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 5.14: Simulation model with a batch effect: for n = 1..20 r n m =∼N(0,1.0) inde- pendent of n=21..40 r n m =∼N(0,1.0). The colors represent the observed hybridization intensity and the copy number (blue - loss, red - gain, green - neutral). A) The true un- derlyingCNVcomponentwithtwoCNVregions(CNVR-1aroundm=2500andCNVR-2 around m=7500). B) Simulated array hybridization intensities degraded by noiseǫ n and a systematic measurement biasr. C) Copy number profile using GADA on non normal- ized data. D) and E) Copy number profile using GADA-SMN and GADA-JRN on all the samples together. G) and H) Copy number profile using GADA-SMN and GADA-JRN on each batch separately. F) Pairwise correlation between the residuals of GADA-JRN after processing the batches together or I) separately. 141 5.7 Discussion TheapplicationoftheproposedGADA-JRNisnotonlylimitedtoAffymetrixSNParrays but it can also be applied to other platforms such as Illumina beadarrays or NimbleGen aCGH.InIlluminaBeadStudiotheprobehybridizationintensitiesR(obtainedafterallele crosstalk correction, ACC) can be extracted instead of the LRR values. The extraction of the LRR values uses a cluster approach to compute the expectedR values of non-CNV regions [71] which have similar drawbacks as separate median normalization. In aCGH, the reference DNA from a single sample or a pool of samples is used as a reference, but these log-ratio intensities may still contain a remaining “wavy” artifact [63] that our proposed approach could eliminate. ITALICS [83] is an iterative approach that alternates two separate steps of copy numberdetectionandnormalization. TheiterativeconceptissimilartotheEMalgorithm employed to fit the GADA-JRN model in this chapter but there are two fundamental differences. First, ITALICS operates on a single sample and only includes a small set of parameterscorrectingforglobalarrayeffectssuchasthefragmentlengthandGCcontent; ITALICS assumes that the reference non-CNV probe intensity is fixed and extracted from a separate reference set of samples. Second, the copy number extraction and the normalization model are iterated in practice only twice and not integrated under an unifying probabilistic model as in GADA-JRN. In contrast, GADA-JRN first proposes a multiple sample probability model, which includes parameters for the CNV component andthereferencenon-CNVprobehybridizationintensityofeverypositionofthegenome; 142 then an iterative approach to fit the model is derived using the EM algorithm, which guarantees the parameters converge. GEMCA [54], only available for Affymetrix 500K platform, approaches the problem byfindingthereferenceafter copynumberdetection. First,thecopynumberisestimated on the difference between all the possible pairs. The CNV regions are defined as those identified in certain number of pairs. Then the largest subgroup of samples with no relative variations (found using a maximum clique algorithm) is used to establish the reference set for that particular regionof the genome. The Canary algorithm inBirdsuite [55] uses a GMM mixture model to identify a posteriori the copy number variation state of already delimited regions of copy number variation. In both GEMCA and Canary, the underlying assumption that CNVs are tightly aligned across samples and do not overlap with other possible CNVs represents a challenge in dealing with complex polymorphic CNV regions of the genome (e.g., Figure 5.7). 5.8 Conclusions In this chapter we introduced a new method in which the reference probe hybridization intensity is jointly estimated with the copy number component. This type of methods are essential for the characterization of CNV on normal population using the latest array technologies, in which the underlying genome copy number variations are not known a priori. The currently used approach of separate pre-processing using the median to estimate the hybridization intensities may fail to accurately detect highly polymorphic regions of the genome. The new proposed method extends the previous GADA model by 143 introducing a new vector of parameters that model a common hybridization bias shared acrossmany arrays. Resultsdemonstrate asignificantly betterperformancewiththe new extendedGADAmodelwhilemaintainingtheattractivelinearcomputationalcomplexity in number of probes and samples. GADA-JRN may also prove to be a flexible algorithm in a variety of other appli- cations. For example, samples processed by different labs or on different days may not necessarily have the same non-CNV reference hybridization intensity even if the platform is exactly the same. GADA-JRN could be applied to assess if this effect has changed be- tween different experimental batches. Moreover, one or more control samples replicated across experimental batches could be used to better characterize and correct the probe hybridization noise. In addition to the probe hybridization reference, other parameters can also be intro- duced to the GADA framework to model other known sources of variation. For example, allowing changes in scale of the bias, using an independent bias correction for each SNP allele, andincludingprobespecificnoisevariances(heteroscedasticmodel). Theresiduals of the model can be used to assess the impact of these other sources of variation on the results In particular, correlation on the model residuals can be used to discover hidden batch effects indicating the need for subgroup analysis. Finally, the GADA-JRN model could be used in combination with the N-GADA model [74] (in Chapter 4) for modeling breakpoints of CNVs across multiple samples. 144 Chapter 6 Bayesian hierarchical modeling of means and covariances of gene expression data within families The previous chapters proposed methods for extracting and analyzing only one kind of genomic information. Chapter 2 covered gene expression, and microarrays are used to quantifythetranscriptionactivityofeachgeneinthegenome. Chapters3,4and5covered the analysis of DNA copy number changes using genotyping arrays. In this chapter we will develop a new method that will identify variation in the DNA (SNP genotypes) that influence the gene expression levels. We propose a novel hierarchical Bayesian model for the influence of constitutional genotypesfromalinkagescanontheexpressionofalargenumberofgenes. Thisworkwas presented in Genetic Analysis Workshop GAW-15 and was also selected for publication [76], and can be considered as a first step to find genetic determinants of gene expression at genome-wide scale. Results on Chr. 11 replicate an already known association between a DNA variation (SNP) and a gene expression level. The approach appears to be a promising way to 145 address the huge multiple comparisons problem for relating genome-wide genotype-by- expression data. Thischapterisorganizedinthefollowingway. Firstweprovidethebackgroundonthe problem, the approaches that exist and we motivate the need for new methods (Section 6.1). Then, on the Methods Section we present our model, the statistical approaches used to fit the model, and the data that has been used. Sections 6.3 and 6.4 present and discuss the results obtained. 6.1 Introduction The two major genomic informations: i) DNA variation and ii) gene expression have been usually been studied separately. An extensive literature on the analysis of gene expression data has evolved over the last five years, and since the advent of ultra high volume genotyping platforms, genome-wide association and linkage scans using SNPs have also become feasible. The multiple comparisons problem is central to the analysis of either type of high- volumedata. In2001,Jansen[49]proposedcombiningtheanalysisofthetwotechnologies (SNPs and gene expression arrays) in what he called “genetical genomics” to provide insight into the genetic regulation of gene expression. However, only quite recently have attempts been made to relate the two technologies, first by Morley et al. [67] in a linkage scan for 3,554 expressed genes in relation to 2,756 autosomal SNP markers, and subsequently by the same group [14] in a genome-wide association scan of 27 of the expressed genes with the highest linkage in the first study, in 146 relationto>770,000SNPs. (SeealsoSchadtetal.[87]andStrangeretal.[94]forsimilar analyses.) Independently, Tsalenko [100] proposed a biclustering method to visualize SNPs and the transcripts they regulate, using an approach that is more visual than statistical. The multiple comparisons problem in such analyses (2.7 billion comparisons in the association analysis) dwarfs those from either genome-wide linkage or association analyses of single traits or supervised cluster analyses of expression data in relation to single outcomes. Therefore,thereisaneedtodevelopnewstatisticalmethodstoanalyzealltranscripts and genotypes together. Here, we describe a novel hierarchical Bayesian approach to the analysis of all possible pairs of associations and linkages between expressed genes and SNP markers. We demonstrate the results for chromosome 11 and we argue that the method can be extended to cover the entire genome and transcriptome. The proposed model comprises linear regression models for the means in relation to genotypes and for the covariances between pairs of related individuals in relation to their identity by descent estimates. The matrices of regression coefficients for all possible pairs of SNPs by all possible expressed genes are in turn modeled as a mixture of null values and a normal distribution of non-null values, with probabilities and means given by a third-level model of SNP and trait random effects and a spatial regression on the distance between the SNP and the expressed gene. The latter provides a way of testing for cis and trans effects, depending if the alteration affects the gene where it is sitting or some other gene. The method was applied to data on 116 SNPs and 189 genes on chromosome 11, for which Morley et al. [67] had previously reported linkage. We were able to confirm the 147 association of the expression of HSD17B12 with a SNP in the same region reported by Morley et al., and also detected a SNP that appeared to affect the expression of many genes on this chromosome. The approach results are promising and could be extended to cover an genome-wide gene expression and genotyping scan. 6.2 Methods 6.2.1 Statistical model Let Y n ij denote the expression of gene n in member j of family i and let G m ij be the correspondingSNPgenotypeatmarkermatlocationx m . Forthemeansandcovariances oftheexpressiontraits,weadoptedaGeneralizedEstimatingEquationsmodeloftheform used by Thomas et al. [95]. E Y n ij ≡ μ n ij =α n 0 + M X m=1 A nm G m ij (6.1) E C n ijk ≡ χ n ijk =β n 0 +B n Z ijk (X n ) (6.2) where C n ijk = (Y n ij −μ n ij )(Y n ik −μ n ik ) and Z ijk (x) is the estimated E(IBD ijk (x)|G i ) at chromosomal location x for pairs (j,k) from nuclear family i, based on the complete multilocus marker data. X n is a latent variable for the location of the unobserved causal locus linked to expression trait n. For j = k, V(Y n ij ) = χ n models the gene expression variance in (6.2). 148 In (6.2), the regression coefficients A nm are modeled as mixtures of null values with probabilities 1−π nm and a normal distribution of non-null values with means α nm ex- pressed in terms of row and column effects: A nm ∼(1−π nm )δ(0)+π nm N α nm ,σ 2 (6.3) where α nm = γ A 0 +γ A 1 I(x m ∈R n )+e A m +h A n (6.4) logit(π nm ) = γ P 0 +γ P 1 I(x m ∈R n )+e P m +h P n (6.5) The parameter γ 1 distinguishes between cis and trans effects, a cis interaction occurs when the chromosomal location x m of SNP m is within the interval R n , the alignment region for the gene expression probe n. The random effectse andh are distributed as e A m ,e P m ∼ N 2 (0,T) (6.6) h A n ,e P n ∼ N 2 (0,W) (6.7) and the γs,T,W have uninformative normal and Wishart priors. The regression coefficients B n in the covariance model (6.2) are handled similarly, exceptthatweassumeeachtraithasatmostoneregionlinkedtoit. (Thisisnotessential to the method, as Eq.(2) could be extended to a summation over multiple independent linkage regions, but it would not make sense to offer all marker locations simultaneously, 149 since the IBD variables are highly correlated from one location to the next.) Thus, we assume B n ∼N γ B 0 +γ B 1 I(X n ∈R n ),τ 2 (6.8) and pick a uniform prior for X n ; to simplify the calculations, we restrict X n to the observed marker locations x m and compute IBD probabilities only at these locations. X n thus has a discrete distribution with prior masses inversely proportional to the local marker density, here estimated simply as|x m+1 −x m−1 |. The full model is represented in the directed acyclic graph (DAG) shown in Figure 6.1. Y ij n C ijk n G ij m R n P ij n B n F ijk n Z ijk (X n ) X n A nm S nm D nm x m J E n X n R n x m R n e m h n T W Figure 6.1: Directed acyclic graph for the analysis model. Squares represent observed data, circles represent parameters or latent variables, triangles represent deterministic nodes. 150 We fitted the model using a Markov chain Monte Carlo (MCMC) approach, im- plemented in Matlab. Updates of all parameters except the location parameters X n s involve standard Gibbs sampling from their respective full conditional distributions, e.g., [α n 0 |Y n ,G,A], [β n 0 |C n ,Z,B], [A nm |Y n ,G m ,α n 0 ,π nm ,α nm ,σ 2 ], etc. The updates of the Xs are based on a Metropolis-Hastings procedure with a random walk proposal. The sequence was started 10 times from several initial points chosen from an overdispersed prior around rough estimates. Half of the initial samples are discarded and the second half is kept. The number of kept samples, L = 4000, is chosen to be large enough so that for all parameters of interest the variance between sequences V B is comparable to that within sequence V W , R<1.10: ˆ R = r L−1 L + 1 L V B V W (6.9) The rationale behind this convergence monitoring procedure is described and justified in [32]. 6.2.2 Subjects, genotypes, and phenotypes In order to keep the computation to a manageable level, we restricted this analysis to the SNPgenotypesandexpressedgenesonchromosome11,aspreviousanalysesbyMorleyet al. [67] had found evidence of linkages both in cis and in trans at this chromosome. The finaldatasetthushad116SNPsand189expressedgenes. IBDstatuswasestimatedfrom the complete two-generation pedigrees (excluding grandparents) by a program written 151 based on the Lander-Green algorithm [59]. All 378 sibpairs (110 individuals) from the available 14 families were included in the phenotype analysis. 6.3 Results After convergence has been reached, the number of regression coefficients with nonzero coefficients in (6.2) is very small. This is because in the mixture model employed in (6.3), a large number of the probabilities are close to 0 as shown in Figure 6.2. Figure 6.2 also shows, as expected, that each gene expression phenotype is explained by relatively few genotypes that have a role in regulating their expression. Table 6.1 lists, for the best predicted phenotypes, the SNPs included most frequently in the model. Significantly, the top ranking phenotype, HSD17B12 (217869–at), associated with SNP rs1453389, is the same as the one reported by Cheung et al. [14] as associated with another SNP in the same region (not included in the GAW dataset). Figure 6.3 shows that some SNPs in chromosome 11, especially rs916482, are significantly associated with more phenotypes than others. These SNP are possibly within a master regulatory region of gene expression. The list of gene ontology terms that were over-represented in the list of its associated genes involved mostly metabolic functions (Figure 6.4). The covariance model (2) results are summarized in the right panel of Figure 6.2, and the strongest linkage peaks are listed in Table 6.2. This linkage is for the remaining variation not explained by the association/means model (1), and the peaks would corre- spond to unseen genotypes that are in linkage disequilibrium (LD) with a marker that was not used in the association model. Thus, this explains in part why linkage results 152 Table 6.1: Top ranking associations Top SNPs used in the prediction Phenotype Probe R 2 P(R 2 > 0) SNP1 π nm SNP2 π nm SNP3 π nm SNP4 π nm HSD17B12 217869 at 0.25 0.988 *rs1453389 1.00 rs916482 1.00 rs1425151 0.40 rs509628 0.28 C11orf10 218213 s at 0.12 0.986 rs916482 1.00 AMPD3 207992 s at 0.19 0.985 *rs2029463 0.81 rs948215 0.80 rs1157659 0.21 rs1491846 0.17 FEZ1 203562 at 0.12 0.984 rs2029463 1.00 rs2155076 0.20 *rs948215 0.11 ADM 202912 at 0.11 0.982 rs916482 1.00 STIP1 213330 s at 0.11 0.981 rs916482 0.99 rs1319730 0.33 DDB1 208619 at 0.15 0.978 rs1530966 0.91 rs597345 0.54 rs1499511 0.10 FADS1 208964 s at 0.14 0.974 rs1216592 0.85 rs1605026 0.38 rs591804 0.35 TPP1 200743 s at 0.13 0.970 rs916482 0.94 rs1157659 0.14 *rs902215 0.14 RBM14 204178 s at 0.10 0.966 rs916482 0.98 rs674237 0.10 HMBS 203040 s at 0.13 0.963 rs86392 0.49 rs916482 0.47 *rs1319730 0.44 rs1945906 0.20 PPME1 49077 at 0.11 0.958 rs916482 0.82 rs2155001 0.16 CD44 204490 s at 0.12 0.957 rs702738 0.34 rs916482 0.28 rs1319730 0.28 *rs1453390 0.17 NRGN 204081 at 0.10 0.946 rs2029463 0.93 rs961746 0.16 rs509628 0.15 NDUFS8 203190 at 0.11 0.944 rs86392 0.68 rs1319730 0.33 rs1945906 0.32 PSMD13 201232 s at 0.09 0.923 rs916482 0.91 rs1319730 0.12 Phenotypes ranked by most significant coefficient of determination, and some of their top associated SNPs ranked by average π nm . (*) indicates a cis-acting interaction, defined as the SNP being within 10MB of the phenotype probe alignment. Table 6.2: Linkage of residual gene expression variation after association Phenotype R 2 P(R 2 > 0) Samples Mode[X n ] 208964 s at 0.014 0.912 1749 rs2226844 202223 at 0.009 0.908 1526 rs1453390 220964 s at 0.007 0.862 1412 rs647837 201432 at 0.005 0.821 1567 rs931811 201477 s at 0.008 0.802 1492 rs1941817 204178 s at 0.004 0.800 1548 rs2155076 202076 at 0.004 0.772 1668 rs681267 206067 s at 0.002 0.749 1535 rs2029463 210364 at 0.003 0.718 1586 rs470719 203675 at 0.006 0.706 1659 rs1216592 205412 at 0.001 0.683 1585 rs470982 202418 at 0.001 0.679 1444 rs470719 Pr[X n =m] 20 40 60 80 100 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 Phenotypes ranked by most significant coefficient of determination in the covariance model, the posterior distribution of their locus positionX n , and its mode. The coefficient of determination and its significance are calculated form samples drawn (from 10% of the mass) around the mode. are less compelling than the association ones. However, for those phenotypes for which significant linkage was found, the expression covariance increased with the IBD status, especially in 208964–s–at. 153 6.4 Discussion We have introduced a novel hierarchical Bayes model for genetic control of gene expres- sion. Our approach to dealing with the multiple comparisons problem is to represent the matrices of all possible SNP expressed gene association or linkage coefficients in terms of row and column random effects, along with a spatial regression on the distance be- tween the two. Although this allows inference on specific pairs, we have greater interest in the variances of the row and column effects, which reflect systematic tendencies for SNPs to affect variable numbers of phenotypes and for phenotypes to be differentially expressed. Our mixture model also supports the possibility that the vast majority of such associations or linkages would be truly null, and allows separate estimation of both theprobabilityandmagnitudeofnon-nulltests. Sofarwehavenotimposedanyrelation- ship between the parameters of the association (means) and linkage (covariance) models, but one might contemplate using the broad regions where linkage is seen for a particular phenotype as a prior for testing single-SNP associations with that phenotype. The strongest gene-expression SNP association reported by Cheung et al. on chro- mosome 11 also appeared in our results as the most significant association, but with a SNP close to theirs (their reported SNP was not included in the dataset). We also found evidence of at least one SNP that appears to be linked to a large number of expressed genes, suggesting the existence of master regulatory genes in that region. We chose to restrict these analyses to a subset of genes and SNPs on a single chromo- some to test the feasibility of the method. In principle the approach could be extended on a genome-wide scale, since the computation time increases linearly with m, n, sample 154 size, and number of MCMC samples. Generating 4000 MCMC samples required 6 hours on a 2.2 GHz single-processor machine. However, one outstanding methodological chal- lengethatwouldhavetobeaddressedbeforetheapproachcouldbeappliedtodenseSNP associations would be how to deal with the multicollinearity problem. This is because very close SNPs will exhibit highly correlated genotypes (strong LD). For this reason, we chose to restrict this analysis to only a subset of SNPs that were not strongly correlated (in strong LD) with each other. 155 SNP Marker m (Genotype) Gene expression probe n (Phenotype) Association parameter π mn posterior 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 140 160 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 SNP Marker m (Genotype) Gene expression probe n (Phenotype) Linkage locus posterior: Pr(X n =m|Data) 10 20 30 40 50 60 70 80 90 100 110 20 40 60 80 100 120 140 160 180 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Figure6.2: GeneexpressionxGenotypeassociationsandresiduallinkagesummary. (Left) Image describing the mean value of the association parameters π nm between the gene expression phenotypes (rows) and the SNP genotypes (columns). The matrix shows that the interactions are very sparse (dark spots), meaning that phenotypes are controlled by small number of SNPs, with no apparent concentration along the cis region delimited by blue lines. However, there exist some SNPs (columns) that seem to be correlated with a large set of phenotypes, potentially indicating a master regulatory region. (Right) Image describing the posterior probability of the linkage locus after removing the association effect from the covariance. 156 -0.6 -0.4 -0.2 0 0.2 0.4 D mn 0 20 40 60 80 100 120 140 160 180 200 0 0.2 0.4 0.6 0.8 1 201232_s_at 212983_at 200743_s_at 202912_at 204490_s_at 217869_at 203103_s_at 218213_s_at 213330_s_at 206580_s_at 204610_s_at 204178_s_at 203252_at 49077_at 201494_at 210538_s_at 218826_at 203040_s_at S mn Phenotype n=1...184 -0.6 -0.4 -0.2 0 0.2 0.4 D mn 0 20 40 60 80 100 120 140 160 180 200 0 0.2 0.4 0.6 0.8 1 201232_s_at 212983_at 200743_s_at 202912_at 204490_s_at 217869_at 203103_s_at 218213_s_at 213330_s_at 206580_s_at 204610_s_at 204178_s_at 203252_at 49077_at 201494_at 210538_s_at 218826_at 203040_s_at S mn Phenotype n=1...184 Figure 6.3: Potential Master Regulatory region around rs916482 SNP. Bottom plot is the cross-sectionofcolumn84ofFigure6.2,describingtheassociationbetweenallphenotypes in chromosome 11 and SNP m = rs916482. The top plot shows the magnitude and sign of dependence on the genotype. This SNP has a large number of associated genotypes, providing a strong indication of a Master Regulatory region. Figure6.4: GeneOntology(GO)onpotentialMasterRegulatoryregion. Overrepresented GO terms by the phenotypes associated to the SNP rs916482 analyzed using FatiGO (http://www.fatigo.org/). 157 Chapter 7 Conclusions The research presented in this dissertation has been developed with the primary applica- tionofprovidingnewBioinformaticstoolsforreliableandefficientprocessingofverylarge biological datasets. These datasets can measure millions of biological variables degraded by noise, but the underlying sparseness of these biological systems can be exploited to extract the relevant information. In Chapter 2, a novel embedded FSS framework was developed that builds a block diagonallineardiscriminantmodel(BDLDA).BDLDAwasshowedcapableofidentifying gene interactions and improving the classifying accuracy in simulations. The algorithm is based on a greedy search that recursively seeks to add either a new feature or an interaction term in the model. Feature selection is very important not only for reducing the number of parameters but also for identifying relevant genes or genetic pathways (group of genes that are corregulated) which are consistent with the underlying biology of the disease (e.g., inflammatory genes, DNA repair, cell division). As higher density microarrays and direct RNA sequencing become available novel analytical techniques will be required for finding differences in gene expression (and exon 158 splicing) in different kinds of tissues, organisms or tumors. In this context, sparse signal representations can be used to group microarray probes that belong to the same exon and characterize which of the different exon combinations (splicing variants) are active onagivensample. Asimilarmethodologyasemployedtodetectcopynumberalterations (Chapter 3) can be used to extract these splicing features. These clean features can then be used to build models using the discriminant analysis tools (DLDA and BDLDA) we alreadydevelopedinChapter2. Asmoreknowledgeoftheunderlyingbiologyisbecoming publicly available and genetic pathways are better understood, the BDLDA model search strategies can be adapted to look for patterns of coexpression along these pathways. Copy number alterations can also be represented by a sparse PWC representation. In GADA approach, introduced in Chapter 3, a sparse Bayesian learning (SBL) is used in combinationwiththePWCrepresentation. Whencomparedtootherpopularlyusedcopy number detection approaches, GADA achieves one of the highest detection accuracies while improving computational speed by several orders of magnitude, especially on very large arrays. Other segmentation approaches with comparable accuracy to GADA such as CBS have a quadratic cost in terms of computation. The GADA software and source codeispubliclyavailable(http://biron.usc.edu/ ~ piquereg/GADA)andhasbeendown- loaded and used by a large number of institutions including the Sanger Institute, The Center for Applied Genomics (TCAG), and the Center for Genomic Regulation (CRG) among others. The GADA approach has also been extended to scenarios where multiples samples are available and a joint analysis would help achieve higher detection accuracy (Chapters 4 and 5). The first of these scenarios models breakpoint locations that are shared across 159 samples (examples include sample replicates or samples with common ancestry), Chap- ter 4. The second scenario assumes that the breakpoints are not necessarily located at thesamepositionacrosssamples,butthereisasystematicperturbation(ormeasurement bias) on the probes that can be estimated and thereby removed. We demonstrated in Chapter5thatjointextractionofareferencehybridizationintensityandthecopynumber component of a large cohort of samples has better performance in terms of accuracy and robustness than using the median across the samples. The shared breakpoint model can be integrated with the joint reference normaliza- tion model into a complete multiple sample model. These improved models and methods should be especially indicated to make experimental designs in which a set of well known reference samples are replicated across batches. These methods will help to bring the intensity values together and reduce the experimental variability across different experi- mentalbatchesorlaboratories. TheGADAapproachcouldalsobeextendedtofindPWC discriminatory regions along the genome to separate groups of samples in a supervised fashion. Copy number alterations associated with a target disease should also exhibit a piecewise constant behavior on the statistical scores used to measure association. GADA can exploit this PWC behavior to accurately detect regions associated with some disease condition. In the near future we expect that the number of technologies that are available to extract genetic measurements will only continue to increase. If the current trend in mi- croarraytechnologycontinues, thenumberofmeasurementsduplicateseveryfewmonths. New and cheaper sequencing techniques (Solexa and 454 sequencing) are becoming avail- able enabling a new set of experiments that generate millions of reads that can map to 160 any position on the genome (3 billion pairs). Chromatin Inmuno-Precipitation (ChIP) studies have been developed to detect the organization of the DNA around nucleosomes in the nucleus and DNA binding proteins. Large scale projects that have been launched in the last three years (e.g., the HapMap project, the cancer genome atlas (TCGA), the 1000 genomes project) are generating large amounts of data that is becoming publicly available. These datasets will shed light on the characteristics of the cancer genome and thenaturalvariationpresentonhumanhealthycells, butwillrequiredevelopingnewand more efficient analysis techniques. GADA or similar methods could be used to detect signals of interest in data obtained by new generation sequencing techniques that are becoming available. The knowledge on how to measure gene expression (mRNA) and genetic alterations (DNA) will make it possible to answer more complex biological questions. We will be able to use and extend systems biology precise knowledge on small isolated biological processestothegenomewidescaledataobtainedbyhigh-throughputexperimentsoncells inlivingorganisms. Sparsesignalprocessingrepresentationswillbeusefultointerconnect these system biology models in a large sparse network that will span the entire genome. For example, pharmacogenetics studies will measure the impact of a drug on the gene expression for diseases that require better treatments and reduced undesired secondary effects. Other than drugs, new experiments will focus on the DNA alterations that are induced by viruses which may lead to discover new preventable agents of cancer. We still know very little about which parts of the genome play a role in regulating rates of gene expression, mRNA degradation, or translation. Our ability to interpret the regulatory “language” of DNA sequences is extremely limited. Future research should 161 aim to provide much better annotation of the regulatory elements in the human genome, and ultimately, to work towards detailed models of the combinatorial nature of gene regulation. These models and corresponding validations would contribute to create a muchdeeperunderstanding ofthe generegulationmechanismsandthe functional impact of genetic variation. The research presented in this thesis provides a starting point to develop analytical methods to jointly analyze the impact of genetic alterations with the gene expression levels to pinpoint important genome regions for gene regulation. 162 Appendix A The role of the parameter a in SBL The parameter a controls the shape of the prior distribution over the weights p(w) specified by the hierarchical prior defined by p(w|α) (3.14) and p(α) (3.15). Following [97], theα hyperparameters can be integrated out to find the marginal “effective” prior p(w): p(w) = Z p(w|α)p(α)dα = M−1 Y m=1 Z p(w m |α m )p(α m )dα m = M−1 Y m=1 p(w m ) (A.1) where p(w m ) is: p(w m ) = Z p(w m |α m )p(α m )dα m = Γ(a+1/2) Γ(a) √ 2πa r a b 1+ w 2 2b −( 1 2 +a) (A.2) 163 a t-distribution with 2a degrees of freedom and a scale parameter of p a/b. When b→0 and a is small, this distribution peaks very sharply at 0, and has very thick flat tails, as shown in Figure A.1. −10 −5 0 5 10 −300 −250 −200 −150 −100 −50 0 w log(p(w)) a = 0.0001 a = 0.001 a = 0.01 a = 0.1 a = 0.2 a = 0.5 a = 0.8 a = 1 −25 −20 −15 −10 −5 0 5 10 15 20 25 −400 −350 −300 −250 −200 −150 −100 −50 log(|w|) log(p(w)) a = 0.0001 a = 0.001 a = 0.01 a = 0.1 a = 0.2 a = 0.5 a = 0.8 a = 1 Figure A.1: Plot of the SBL marginal prior distribution on a single weight for different choices of the hyperparameter a. To make the plot we approximated b→ 0 by b = 1E−80, this is a similar conceptually as approximating a delta distribution by a normal distribution with σ =1E−80. As justified in Section 3.3 and 3.4, the log of the prior distribution p(w) gives us the sparseness cost measure (i.e., the penalty for not having coefficients different than 0): logp(w) = C(a,b)+ 1+ a 2 M−1 X m=0 log 1+ w 2 m 2b and we are interested in the case when b→ 0, which gives (3.21) which we repeat here for easier reference: logp(w) → b→0 C(a)+(1+2a) M−1 X m=0 log|w m | After providing the details in the derivation of the SBL sparseness cost, the second objective of this appendix section is to provide more discussion on its properties. This sparseness cost is depicted for a single nonzero weight and severala in Figure A.1 and for 164 multiple nonzero weights and a singlea in Figure 3.3. The approximately flat tails makes this sparseness cost a good approximation of the l 0 norm, and much more desirable than the l 1 norm (i.e., Laplacian prior). Considering specifically a in (3.21) and Figure A.1, we can see that the sparseness penalty is proportional to (1+2a). For example, in Figure A.1 (left), for a = 1 we get a penalty of around 300 for large coefficients as compared to 100 when a∼0, i.e., (1+2a) times higher. Therefore, we can increase the sparseness by increasing a, this takes mass away from the tails and puts it on the “delta” (point mass at 0) by decreasing the rate on the tail decay. In Figure A.1 (right), the tail decay rate is about (1+2a) on the natural logarithmic scale. The parameter a also has an impact on the convergence rate of the EM algorithm, i.e., the speed of the SBL algorithm. In our experiments, for higher sparseness settings (fewer breakpoints and larger a), the algorithm converges faster than for smaller a. This is also supported with the following argument. The α −1 m parameters, either converge to 0 (breakpoint discarded) or to a finite point (breakpoint accepted). The EM algorithm rate of convergence is governed by the maximum eigenvalue of the Jacobian matrix of the EM mapping defined in (3.20), [66]. In that situation, 1/(1+2a) would pull out of the derivative of α −1 m in (3.20); thus speeding up convergence since the maximum eigenvalue is divided by 1/(1+2a). In conclusion, the a parameter controls the sparseness in the SBL algorithm, and the speed of the algorithm. An increase of a leads to a sparser result, fewer breakpoints, and faster convergence. 165 Appendix B Backward Elimination algorithm properties The backward elimination (BE) procedure used in GADA (Chapters 3, 4 and 5 ) could be used alone, without the SBL step, for CNA detection. It is based on considering our PWCmodel(3.8)asaclassicalvariableregressionselectionproblem,y ∼Fw; wherethe regressors w i with less impact on the residual are sequentially removed one by one. To the best our knowledge, this simple procedure has never been proposed as a standalone techniqueforCNAdetectionbefore. Thisisagreedyapproach, whichissuboptimalsince wemayeliminatebreakpointsthatcouldbemoresignificantatalaterstage. Sinceerrors can be added by each greedy decision, this algorithm tends to be more reliable when the number of regressors (i.e., candidate breakpoints) is smaller. Compared to forward selection (FS), BE has been seen to perform better in situations where, as in our case, the columns ofF have high degree of collinearity [53]. Furthermore, the structure ofF, the design matrix, can be exploited to efficiently find and remove each breakpoint and produce a ranking list as detailed in Algorithm 4. 166 Using standard linear regression, for a given fixed breakpoint setI, the least squares estimate for the breakpoint weightsw I is found by solving the normal equations: F t I y =F t I F I ˆ w I ˆ w I = F t I F I −1 F t I y (B.1) which gives the orthogonal projection ˆ x I of the vectory onS I as: ˆ x I = F I ˆ w I (B.2) ˆ x I = F I F t I F I −1 F t I y (B.3) and the residual sum of squares RSS I or norm of the error is: RSS I = ky− ˆ x I k 2 = ky−F I ˆ w I k 2 (B.4) All these operations can be solved efficiently by noticing again that H I = F ′ I F I −1 is a symmetric tridiagonal matrix, with main diagonal h 0 (3.22) and first off-diagonals h 1 (3.23) (see lines 2 and 3 of Algorithm 4). The criterion to decide which breakpoint to remove can be seen in three different but equivalent ways. 167 First, wemightconsiderremovingthebreakpointwhichincreasestheleasttheRSS I . If we denote RSS j to be the residual sum of the squares after removing i j from I (RSS I−j ), then the increase in RSS is: RSS j −RSS I = ˆ w 2 I (j) h 0 (j) (B.5) Furthermore, when the noise is normal N(0,σ 2 I), F j = RSS j −RSS I RSS I /(M−K) (B.6) is distributed as F 1,M−K Fisher-Snedecor distribution (M is the number of candidate breakpoints, and K =|I| the number of breakpoints in the model). If the σ 2 is known, or M >>K,then RSS I /(M−K)→σ 2 and F 1,∞ ∼χ 2 1 ; thus t 2 j = RSS j −RSS I σ 2 = ˆ w 2 I (j) σ 2 h 0 (j) (B.7) is distributed as a χ 2 1 distribution. Second, if we assume that the noise is normal N(0,σ 2 I), and σ 2 is known. Then the least squares estimate for ˆ w I is also normally distributed: ˆ w I ∼N w I ,H I /σ 2 (B.8) 168 Therefore, under the hypothesis that w I (j)=0 t j ≡ ˆ w I (j) p σ 2 h 0 (j) ∼N(0,1) (B.9) Third, developing what t j represents in terms of y and σ 2 by performing all the operations in (B.1), we can see that: t j = 1 i j+1 −i j i j+1 P m=i j +1 y m ! − 1 i j −i j−1 i j P m=i j−1 +1 y m ! σ q 1 i j+1 −i j + 1 i j −i j−1 (B.10) which can be interpreted as the difference between the sample mean of the right and the left segment ofi j breakpoint divided by the square root of the variance of thatdifference. Even if the noise is not normal, but has a finite variance σ 2 , (B.10) tells us that as the size of the segments increases, under the null hypothesis of no difference, t j will converge to N(0,1) because of the central limit theorem. Recalculation of the weights after each removal, can be done efficiently with very few (a constant amount of) operations using the weights already calculated (see lines 9,12 and 16 on Algorithm 4). Thus the overall order of complexity to rank a breakpoint setI is linear with the size of the set O(|I|). 169 Appendix C Adjustment of the SBL and BE parameters in GADA BoththeSBLortheBEprocedurecouldbeusedindependentlytoestimatecopynumber changes. However, the best results and flexibility are obtained with the combination of these two algorithms that was discussed in Section 3.7. The objectives of this appendix are: 1) show that SBL and BE elimination produce breakpoint sets that are subsets of those obtained from higher sparseness settings, higher T ora, and can produce equivalent breakpoint sets; 2) propose a strategy for efficient pa- rameteradjustment inthe most generalcase; 3)evaluatethe effectivenessofthisstrategy in the simulated dataset [103]. The experiments consist of drawing simulated chromosomes of different lengths M (M=100, 200, 500, 1000 and 2000 probes per chromosome) in the following conditions: i. Simulation of null hypothesis (no breakpoints) using normal noise with different levels of variance. ii. Simulation of normal copy number variations (few breakpoints and short segments) with real noise obtained by randomly sampling segments of data of size M from a 170 pool of a normal (diploid genome) CEPH cell line samples analyzed by Affymetrix 250K Nsp array platform. iii. Simulation on cancer copy number variations, by sampling random chunks of data of size M from cancer samples analyzed in Section 3.9.3. iv. Evaluation on the simulated dataset analyzed in Section 3.8.3, (only M=100). For i. to iii. we simulated L = 10000 chromosomes, for the last case iv. all the L = 500×20=2000 chromosomes of sizeM =100 were used. Each sample, i.e. chromosome, was analyzed with different options for a and T, and the returned breakpoint sets were evaluatedusingdifferentmetrics. Thesparsenessofeachsetwascomputedasthenumber of returned breakpoints divided by the size of the chromosome, i.e.|I|/M. λ denotes the average sparseness across all samples. When comparing two breakpoint setsA andB ob- tained for the same sample but with different parameter settings, we denoteA∩B the set of common breakpoints, which in our case includes all breakpoints inA such that there exists a breakpoint inB less than δ probes away (if there are two breakpoints inA closer thanδ to a breakpoint onB then only the closest one is assigned to the intersection). We then computed the averages of the following metrics [56] along the L simulated samples: P (A=B)= |A∩B| |A∪B| (C.1) P (A⊂B)= |A∩B| |A| (C.2) 171 whichrepresenttheproportionofbreakpointsthatarethesameonbothsets,i.e.,concor- dance (C.1); and the proportion of breakpoints onA that are also inB, i.e., inclusiveness ofA inB (C.2). C.1 Experiments adjusting a and T in GADA In all four panels of Figure C.1, we can see that in the initial breakpoint sets provided by SBL (at T = 0), a higher a setting increases sparseness (lower plots in each panel); but at the same time the breakpoints remain the same since the P (A⊂B)>99% in all the cases. That means that breakpoint sets obtained with higher a tend to be subsets of those obtained with lower a. As T increases we can see on the lower plots of each panel that we are monotonically obtaining sparser sets. The breakpoints that we are removing with BE might be different depending on the initial conditions; for example, a = 0.8 already has a high degree of sparsenesssoitwillnotstartremovinganythinguntilT >2.88, wherethesparsenesswill start to curb down and eventually will converge to the curves obtained with lower a. On the top plot, we can see that this convergence is not only on the degree of sparseness but also on the breakpoint sets themselves too, since as T increases concordance goes to 1. That means that as we increase T we remove the extra part that it was in the breakpoint set obtained with lower a and we end up with the same breakpoints. Following the example with a = 0.8, we can see in Figure C.1 (A), that for T > 4.15, the concordance to starting with a lower a is higher than 80%; and for T > 4.25 and T > 4.35 we obtain concordances that are respectively higher than 90% and 95%. 172 These results indicate that we can adjust the sparseness of the resulting segmentation equivalently with a and T in a wide margin of settings to give the same breakpoint set. This behavior has been observed in all the experimental settings (i.–iv.). If there is something to be detected, true copy number alterations or outliers (ii.) then the probability of detection is higher and the high concordance is reached for smaller values of T than in the (i.) case (compare A and C , and B and C, on Figure C.1). For (iii.) case (datanotshown)theconcordanceisevenhighersincecancersamplescontainmoreCNA. ThesizeofthechromosomeM alsohassomeimpactontheconvergence; onchromosomes with larger M high concordance is reached at a higher T, but for M > 2000 it does not movefurthermoretotheright. Additionally, ourresultsoncase(i.) areexactlythesame for different noise power σ 2 because both a and T have already been corrected by ˆ σ 2 . C.2 Strategy to adjust a and T in GADA Adjusting sparseness with T can be done at no additional computational cost, while adjusting a requires to run the EM algorithm again. Thus, a good strategy is to select a small a for SBL, i.e., one that provides an initial breakpoint set that reduces most of the unlikely breakpoints but still ensures a high sensitivity. The first step is sufficiently sensitive so as not to miss any breakpoints that would require us to switch back to a lower a. Then, the final degree of sparseness will be adjusted with T. From the previous experiment in concordance between sets (Figure C.1), we can see thatagoodsensitivitymeansthatwedonotremoveanythingthatwouldnotberemoved with a lower a at the same T. The worst case, i.e. requiring a higher T for the same 173 concordance, is where there is nothing to be detected (Figure C.1 A and B); or when the true copy number alterations are very short and small (the hardest to be detected). Moreover, dense arrays (higher M) will be more sensitive because CNA will be sam- pled with more probes and will produce statistically larger t (compare panels A and C to B and D in Figure C.1). Thus, small arrays will be those requiring the smallest T to be highly sensitive. Even very small arrays with 100 probes per chromosome, T = 4 provides enough initial sensitivity. Thus, we find that a=0.2 should be small enough in general, and is the value that we have used in all the results on Section 3.8. In Chapter 5 and 6 we increase a up to 0.5 because the size of the arrays M is much larger and we can use a higher T with similar initial sensitivity. A practical approach to ensure that the parameter a is well chosen for a particular setting ofT on real data is the following. Assuming thata=0.2 (or any other choice) we can check that is small enough for a particular T of interest by rerunning the algorithm with a lower a, e.g. a = a/2, and checking if the set of breakpoints returned for that particular T and different a’s are essentially the same (e.g., >95% concordant). C.3 Sensitivity to the adjustments of a and T We will use the simulation case (iv.), to evaluate the impact of the parameter setting strategy described in the previous section in terms of accuracy. This is the same dataset astheoneusedinSection3.8.3andby[103],wheretheunderlyingbreakpointsareknown, so we can exactly evaluate the FDR and the sensitivity for different choices of T and a. 174 InFigureC.2,curvescorrespondingtodifferentahavedifferentstartingpointinterms of sensitivity and FDR, but as T increases we decrease the FDR and similar operational points in terms of sensitivity and FDR are reached compared to those obtained from different a. The proposed a = 0.2 in Section C.2 offers an initial sensitivity and FDR such that all the remaining points in the curve are reached adjusting only T, providing all the levels of sensitivity or FDR that we might be interested in using without having to switch to another a. Compared to CBS, we are able to obtain a wider margin of operating points of the PROCcurve. Moreover,independentlyoftheinitialawealwayshaveapointwithsimilar or better average performance either in terms of FDR or sensitivity. 175 A 0 1 1.5 2 2.4 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 T threshold —— P(I a =I 0.01 ) ; − − − P(I a ⊂ I 0.01 ) i) Normal noise, no breakpoints, M=2000 0 1 2 3 4 5 6 7 8 −7 −6 −5 −4 −3 −2 −1 0 − log 10 ( α ) = − log 10 ( P( |t|>T | w=0 ) ) log 10 ( λ ) Sparseness a = 0.01 a = 0.05 a = 0.1 a = 0.2 a = 0.5 a = 0.8 B 0 1 1.5 2 2.4 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 T threshold —— P(I a =I 0.01 ) ; − − − P(I a ⊂ I 0.01 ) i) Normal noise, no breakpoints, M=500 0 1 2 3 4 5 6 7 8 −7 −6 −5 −4 −3 −2 −1 0 − log 10 ( α ) = − log 10 ( P( |t|>T | w=0 ) ) log 10 ( λ ) Sparseness a = 0.01 a = 0.05 a = 0.1 a = 0.2 a = 0.5 a = 0.8 C 0 1 1.5 2 2.4 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 T threshold —— P(I a =I 0.01 ) ; − − − P(I a ⊂ I 0.01 ) ii) Real noise (NSP platform), CEPH cell−lines CNA, M=1000 0 1 2 3 4 5 6 7 8 −3 −2.5 −2 −1.5 −1 −0.5 − log 10 ( α ) = − log 10 ( P( |t|>T | w=0 ) ) log 10 ( λ ) Sparseness a = 0.01 a = 0.05 a = 0.1 a = 0.2 a = 0.5 D 0 1 1.5 2 2.4 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 T threshold —— P(I a =I 0.01 ) ; − − − P(I a ⊂ I 0.01 ) ii) Real noise (NSP platform), CEPH cell−lines CNA, M=500 0 1 2 3 4 5 6 7 8 −3 −2.5 −2 −1.5 −1 −0.5 − log 10 ( α ) = − log 10 ( P( |t|>T | w=0 ) ) log 10 ( λ ) Sparseness a = 0.01 a = 0.05 a = 0.1 a = 0.2 a = 0.5 Figure C.1: The four panels (A,B,C and D) represent a different experimental dataset, with the results of applying different settings of a and T parameters. Each color corre- sponds to different setting of a = 0.01,0.05,0.1,0.2,0.5, and the x-axis increasing values of T or its associated significance level log 10 (α). On the top plot we have represented the inclusiveness P (I a ⊂I a=0.01 ) (dashed line); and the concordance P (I a =I a=0.01 ). The concordant breakpoints are defined within a window of δ = 2 probes. The bottom plot represents the sparseness which on A and B also represent specificity because there are no underlying breakpoints. A and B use the normal noise simulation described in i. with chromosome lengths of M = 500 and M = 2000 (different noise levels σ 2 generate ex- actly the same curves); and C and D use the simulation described in ii. with chromosome lengths of M =500 and M =1000. 176 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Discovery Rate (1−Precision) Sensitivity (Recall) SBL a=0.05 SBL a=0.1 SBL a=0.2 SBL a=0.5 SBL a=0.8 SBL a=1.0 CBS Figure C.2: PR operational curves for sensitivity vs. false discovery rate in detecting copy number changes within δ = 2 probe window. Each line corresponds to SBL+BE with different starting breakpoint sets (a = 0.05,0.1,0.2,0.5,0.8,1.0,1.5) and varying T (T increases as we traverse the curve from right to left, i.e. FDR decreases). The light green curve represents the operating points obtained by CBS with different α = 1E−4,0.001,0.002,0.005,0.01,0.05 177 Bibliography [1] Genome-wide human snp array 6.0 sample data set. Affymetrix, 2007. http://www.affymetrix.com/support/technical/sample_data/genomewide_ snp6_data.affx. [2] B. Alberts. Molecular biology of the cell. Garland Science, New York, 5th edition, 2008. [3] D. G. Albertson, C. Collins, F. McCormick, and J. W. Gray. Chromosome aberra- tions in solid tumors. Nat Genet, 34(4):369–76, 2003. [4] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A, 96(12):6745–50, 1999. [5] C. Ambroise and G. J. McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A, 99(10):6562–6, 2002. [6] S. Asgharzadeh, R. Pique-Regi, R. Sposto, H. Wang, Y. Yang, H. Shimada, K.Matthay,J.Buckley,A.Ortega,andR.C.Seeger. Prognosticsignificanceofgene expressionprofilesofmetastaticneuroblastomaslackingMYCNgeneamplification. J Natl Cancer Inst, 98(17):1193–203, 2006. [7] H.Bengtsson,R.Irizarry,B.Carvalho,andT.P.Speed. Estimationandassessment of raw copy numbers at the single locus level. Bioinformatics, 24(6):759–767, 2008. [8] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57:289–300, 1995. [9] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29:1165–1188, 2001. [10] T. Bo and I. Jonassen. New feature subset selection procedures for classification of expression profiles. Genome Biol, 3(4):RESEARCH0017, 2002. [11] P. Broet and S. Richardson. Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics, 22(8):911– 8, 2006. 178 [12] E. J. Cand` es and J. K. Romberg. Robust signal recovery from incomplete observa- tions. In ICIP, pages 1281–1284, 2006. [13] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998. [14] V. G. Cheung, R. S. Spielman, K. G. Ewens, T. M. Weber, M. Morley, and J. T. Burdick. Mappingdeterminantsofhumangeneexpressionbyregionalandgenome- wide association. Nature, 437(7063):1365–9, 2005. [15] F. S. Collins, E. D. Green, A. E. Guttmacher, and M. S. Guyer. A vision for the future of genomics research. Nature, 422(6934):835–847, 2003. [16] T. I. H. G. M. Consortium. A physical map of the human genome. Nature, 409(6822):934–941, 2001. [17] S. J. Diskin, T. Eck, J. Greshock, Y. P. Mosse, T. Naylor, J. Stoeckert, C. J., B. L. Weber, J. M. Maris, and G. R. Grant. STAC: A method for testing the significance ofDNAcopynumberaberrationsacrossmultiplearray-CGHexperiments. Genome Res, 16(9):1149–58, 2006. [18] S. J. Diskin, M. Li, C. Hou, S. Yang, J. Glessner, H. Hakonarson, M. Bucan, J. M. Maris, and K. Wang. Adjustment of genomic waves in signal intensities from whole-genome snp genotyping platforms. Nucleic Acids Res, 36(19):e126, 2008. [19] M. N. Do and M. Vetterli. The contourlet transform: an efficient directional multiresolution image representation. Image Processing, IEEE Transactions on, 14(12):2091–2106, —2005—. [20] D. Donoho, M. Elad, and V. Temlyakov. Stable recovery of sparse overcomplete representationsinthepresenceofnoise. IEEE Trans. Information Theory, 52(1):6– 18, Jan. 2006. [21] P. Dragotti and M. Vetterli. Wavelet footprints: Theory, algorithms, and applica- tions. IEEE-Trans-SP, 51(5):1306–1323, May 2002. [22] R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley and Sons, 2001. 0-471-05669-3. [23] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods fortheclassificationoftumorsusinggeneexpressiondata. Journal of the American Statistical Association, 97(457):77–87, 2002. [24] B. Efron and R. Tibshirani. An introduction to the bootstrap. Monographs on statistics and applied probability. Chapman Hall, New York, —1993—. [25] D. A. Engler, G. Mohapatra, D. N. Louis, and R. A. Betensky. A pseudolikelihood approach for simultaneous analysis of array comparative genomic hybridizations. Biostatistics, 7(3):399–421, 2006. 179 [26] L. Feuk, A. R. Carson, and S. W. Scherer. Structural variation in the human genome. Nat Rev Genet, 7(2):85–97, 2006. [27] D. Fredman, S. J. White, S. Potter, E. E. Eichler, J. T. Den Dunnen, and A. J. Brookes. Complex snp-related sequence variation in segmental genome duplica- tions. Nat Genet, 36(8):861–6, 2004. [28] J. L. Freeman, G. H. Perry, L. Feuk, R. Redon, S. A. McCarroll, D. M. Altshuler, H. Aburatani, K. W. Jones, C. Tyler-Smith, M. E. Hurles, N. P. Carter, S. W. Scherer, and C. Lee. Copy number variation: new insights in genome diversity. Genome Res, 16(8):949–61, 2006. [29] J. Fridlyand, A. M. Snijders, D. Pinkel, D. G. Albertson, and A. N. A. N. Jain. Hidden markov models approach to the analysis of array cgh data. Journal of Multivariate Analysis, 90(1):132–153, 2004. [30] J. H. Friedman. Regularized discriminant analysis. Journal of the American Sta- tistical Association, 84:165–175, 1989. [31] L.A.Garraway,H.R.Widlund,M.A.Rubin,G.Getz,A.J.Berger,S.Ramaswamy, R. Beroukhim, D. A. Milner, S. R. Granter, J. Du, C. Lee, S. N. Wagner, C. Li, T. R. Golub, D. L. Rimm, M. L. Meyerson, D. E. Fisher, and W. R. Sellers. IntegrativegenomicanalysesidentifyMITFasalineagesurvivaloncogeneamplified in malignant melanoma. Nature, 436(7047):117–22, 2005. [32] A. Gelman, J. B. Carlin, H. Stern, and D. B. Rubin. Bayesian data analysis. Chapman and Hall/CRC, New York, 2 edition, 2004. [33] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–7, 1999. [34] Y. Guo, T. Hastie, and R. Tibshirani. Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8(1):86–100, 2007. [35] A. E. Guttmacher and F. S. Collins. Genomic medicine–a primer. N Engl J Med, 347(19):1512–20, 2002. [36] I.GuyonandA.Elisseeff. Anintroductiontovariableandfeatureselection. Journal of Machine Learning Research, 3:1157–1182, 2003. [37] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi- fication using support vector machines. Mach. Learn., 46(1-3):389–422, 2002. [38] HapMap Consortium. The International HapMap Project. Nature, 426(6968):789– 96, 2003. http://www.hapmap.org. [39] T. Hastie, A. Buja, and R. Tibshirani. Penalized discriminant analysis. Annals of Statistics, 23:73–102, 1995. 180 [40] T.Hastie, R.Tibshirani, andJ.H.Friedman. The Elements of Statistical Learning. Springer, July 2001. [41] J.P.HoffbeckandD.A.Landgrebe. Covariancematrixestimationandclassification with limited training data. IEEE Trans. Pattern Anal. Mach. Intell., 18(7):763– 767, 1996. [42] L. Hsu, S. G. Self, D. Grove, T. Randolph, K. Wang, J. J. Delrow, L. Loo, and P. Porter. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics, 6(2):211–26, 2005. [43] J. Huang, W. Wei, J. Zhang, G. Liu, G. R. Bignell, M. R. Stratton, P. A. Futreal, R. Wooster, K. W. Jones, and M. H. Shapero. Whole genome DNA copy number changesidentifiedbyhighdensityoligonucleotidearrays. Hum Genomics,1(4):287– 99, 2004. [44] T. Huang, B. Wu, P. Lizardi, and H. Zhao. Detection of DNA copy number al- terations using penalized least squares regression. Bioinformatics, 21(20):3811–7, 2005. [45] T. J. P. Hubbard, B. L. Aken, K. Beal, B. Ballester, M. Caccamo, Y. Chen, L. Clarke, G. Coates, F. Cunningham, T. Cutts, T. Down, S. C. Dyer, S. Fitzger- ald, J. Fernandez-Banet, S. Graf, S. Haider, M. Hammond, J. Herrero, R. Holland, K. Howe, K. Howe, N. Johnson, A. Kahari, D. Keefe, F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, C. Melsopp, K. Megy, P. Meidl, B. Ouverdin, A. Parker, A. Prlic, S. Rice, D. Rios, M. Schuster, I. Sealy, J. Severin, G. Slater, D. Smed- ley, G. Spudich, S. Trevanion, A. Vilella, J. Vogel, S. White, M. Wood, T. Cox, V. Curwen, R. Durbin, X. M. Fernandez-Suarez, P. Flicek, A. Kasprzyk, G. Proc- tor, S. Searle, J. Smith, A. Ureta-Vidal, and E. Birney. Ensembl 2007. Nucl. Acids Res., page gkl996, 2006. [46] P. Hupe, N. Stransky, J.-P. Thiery, F. Radvanyi, and E. Barillot. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20(18):3413–3422, 2004. [47] A. J. Iafrate, L. Feuk, M. N. Rivera, M. L. Listewnik, P. K. Donahoe, Y. Qi, S. W. Scherer, and C. Lee. Detection of large-scale variation in the human genome. Nat Genet, 36(9):949–51, 2004. [48] S. Issue. Sensing, sampling, and compression. Signal Processing Magazine, IEEE, 25(2), 2008. [49] R.C.JansenandJ.P.Nap. Geneticalgenomics: theaddedvaluefromsegregation. Trends Genet, 17(7):388–91, 2001. [50] S. Jean-Luc, E. J. Candes, and D. L. Donoho. The curvelet transform for image denoising. Image Processing, IEEE Transactions on, 11(6):670–684, —2002—. 181 [51] A. Kallioniemi, O. P. Kallioniemi, D. Sudar, D. Rutovitz, J. W. Gray, F. Wald- man, and D. Pinkel. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science, 258(5083):818–21, 1992. [52] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and other. The Human Genome Browser at UCSC. Genome Res., 12(6):996–1006, 2002. [53] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intel- ligence, 97(1-2):273–324, 1997. [54] D. Komura, F. Shen, S. Ishikawa, K. R. Fitch, G. Liu, S. Ihara, H. Nakamura, M. E. Hurles, C. Lee, S. W. Scherer, K. W. Jones, M. H. Shapero, J. Huang, and H. Aburatani. Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res, 16(12):1575–84, 2006. [55] J. M. Korn, F. G. Kuruvilla, S. A. McCarroll, A. Wysoker, J. Nemesh, S. Cawley, E. Hubbell, J. Veitch, P. J. Collins, K. Darvishi, C. Lee, M. M. Nizzari, et al. Integrated genotype calling and association analysis of snps, common copy number polymorphisms and rare cnvs. Nat Genet, 40(10):1253–60, 2008. [56] B. Kosko. Probable equivalence, superpower sets, and superconditionals. Interna- tional Journal of Intelligent Systems, 19(12):1151–1171, 2004. [57] W. R. Lai, M. D. Johnson, R. Kucherlapati, and P. J. Park. Comparative anal- ysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics, 21(19):3763–70, 2005. [58] T. Lal, O. Chapelle, J. Weston, and A. Elisseeff. Embedded methods, chapter 6. Springer, 2005. [59] E. S. Lander and P. Green. Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci U S A, 84(8):2363–7, 1987. [60] D.Lipson, Y.Aumann, A.Ben-Dor, N.Linial, andZ.Yakhini. Efficientcalculation ofintervalscoresforDNAcopynumberdataanalysis. J Comput Biol,13(2):215–28, 2006. [61] S.MallatandZ.Zhang. Matchingpursuitswithtime-frequencydictionaries. IEEE- Trans-SP, 41(12):3397–3415, 1993. [62] S. G. Mallat. A wavelet tour of signal processing. Academic, San Diego, Calif. ; London, 2nd edition, —1999—. [63] J. Marioni, N. Thorne, A. Valsesia, T. Fitzgerald, R. Redon, T. D. Andrews, B. Stranger, E. Dermitzakis, N. Carter, S. Tavare, and M. Hurles. Breaking the waves: improved detection of copy number variation from microarray-based com- parative genomic hybridization. Genome Biology, 8(10):R228, 2007. 182 [64] J. C. Marioni, N. P. Thorne, and S. Tavare. BioHMM: a heterogeneous hidden markovmodelforsegmentingarrayCGHdata. Bioinformatics,22(9):1144–6,2006. [65] S. A. McCarroll, F. G. Kuruvilla, J. M. Korn, S. Cawley, J. Nemesh, A. Wysoker, M. H. Shapero, P. I. de Bakker, J. B. Maller, A. Kirby, A. L. Elliott, M. Parkin, E. Hubbell, T. Webster, R. Mei, et al. Integrated detection and population-genetic analysis of snps and copy number variation. Nat Genet, 40(10):1166–74, 2008. [66] G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, 1997. [67] M. Morley, C. M. Molony, T. M. Weber, J. L. Devlin, K. G. Ewens, R. S. Spielman, and V. G. Cheung. Genetic analysis of genome-wide variation in human gene expression. Nature, 430(7001):743–7, 2004. [68] Y. Nannya, M. Sanada, K. Nakazaki, N. Hosoya, L. Wang, A. Hangaishi, M. Kurokawa, S. Chiba, D. K. Bailey, G. C. Kennedy, and S. Ogawa. A robust algorithm for copy number detection using high-density oligonucleotide single nu- cleotide polymorphism genotyping arrays. Cancer Res, 65(14):6071–9, 2005. [69] A. B. Olshen, E. S. Venkatraman, R. Lucito, and M. Wigler. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5(4):557–72, 2004. [70] Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: Re- cursive function approximation with applications to wavelet decomposition. In Proceedings of the 27 th Annual Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, Nov 1993. [71] D. A. Peiffer, J. M. Le, F. J. Steemers, W. Chang, F. Garcia, K. Haden, J. Li, C. A. Shaw, J. Belmont, S. W. Cheung, R. M. Shen, D. L. Barker, and K. L. Gunderson. High-resolution genomic profiling of chromosomal aberrations using infinium whole-genome genotyping. Genome Res, 16(9):1136–48, 2006. [72] G. H. Perry, A. Ben-Dor, A. Tsalenko, N. Sampas, L. Rodriguez-Revenga, C. W. Tran, A. Scheffer, I. Steinfeld, P. Tsang, N. A. Yamada, H. S. Park, J.-I. Kim, J.-S. Seo, Z. Yakhini, S. Laderman, et al. The fine-scale and complex architecture of human copy-number variation. American journal of human genetics, 82(3):685– 695, 2008. [73] F.Picard,S.Robin,M.Lavielle,C.Vaisse,andJ.J.Daudin. Astatisticalapproach for array CGH data analysis. BMC Bioinformatics, 6:27, 2005. [74] R. Pique-Regi, J. Monso-Varona, A. Ortega, and S. Asgharzadeh. Bayesian detec- tion of recurrent copy number alterations across multiple array samples. In Ge- nomic Signal Processing and Statistics, 2008. GENSiPS 2008. IEEE International Workshop on, pages 1–4, 2008. 183 [75] R. Pique-Regi, J. Monso-Varona, A. Ortega, R. C. Seeger, T. J. Triche, and S. As- gharzadeh. Sparse representation and bayesian detection of genome copy number alterations from microarray data. Bioinformatics, 24(3):309–18, 2008. [76] R. Pique-Regi, J. Morrison, and D. C. Thomas. Bayesian hierarchical modelling of means and covariances of gene expression data within families. In BMC Genetic Analysis Workshop Proceedings, Nov. 2006. [77] R. Pique-Regi and A. Ortega. Block diagonal linear discriminant analysis with sequential embedded feature selection. In Proc. Int’l Conf. Acoustics, Speech, and Signal Processing, volume 5, 2006. [78] R. Piqu´ e-Reg´ ı, A. Ortega, and S. Asgharzadeh. Sequential diagonal linear discrim- inant analysis (SeqDLDA) for microarray classification and gene identification. In Computational Systems and Bioinformatics, Aug. 2005. [79] R. Pique-Regi, E. S. Tsau, A. Ortega, R. C. Seeger, and S. Asgharzadeh. Wavelet footprints and sparse bayesian learning for DNA copy number change analysis. In Proc. Int’l Conf. Acoustics, Speech, and Signal Processing, April 2007. [80] J. R. Pollack, C. M. Perou, A. A. Alizadeh, M. B. Eisen, A. Pergamenschikov, C. F. Williams, S. S. Jeffrey, D. Botstein, and P. O. Brown. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet, 23(1):41–6, 1999. [81] J. Quackenbush. Computational analysis of microarray data. Nat Rev Genet, 2(6):418–27, 2001. [82] R.Redon,S.Ishikawa,K.R.Fitch,L.Feuk,G.H.Perry,T.D.Andrews,H.Fiegler, M. H. Shapero, A. R. Carson, W. Chen, E. K. Cho, S. Dallaire, J. L. Freeman, J. R. Gonzalez, M. Gratacos, et al. Global variation in copy number in the human genome. Nature, 444(7118):444–54, 2006. [83] G. Rigaill, P. Hupe, A. Almeida, P. La Rosa, J.-P. Meyniel, C. Decraene, and E.Barillot. ITALICS:analgorithmfornormalizationandDNAcopynumbercalling for affymetrix snp arrays. Bioinformatics, 24(6):768–774, 2008. [84] C. Rouveirol, N. Stransky, P. Hupe, P. L. Rosa, E. Viara, E. Barillot, and F. Rad- vanyi. Computation of recurrent minimal genomic alterations from array-CGH data. Bioinformatics, 22(7):849–56, 2006. [85] O. M. Rueda and R. Diaz-Uriarte. Flexible and accurate detection of genomic copy-number changes from aCGH. PLoS Comput Biol, 3(6):e122, 2007. [86] E. E. Schadt, C. Li, B. Ellis, and W. H. Wong. Feature extraction and normaliza- tion algorithms for high-density oligonucleotide gene expression array data. J Cell Biochem Suppl, Suppl 37:120–5, 2001. 184 [87] E. E. Schadt, S. A. Monks, T. A. Drake, A. J. Lusis, N. Che, V. Colinayo, T. G. Ruff, S. B. Milligan, J. R. Lamb, G. Cavet, P. S. Linsley, M. Mao, R. B. Stoughton, and S. H. Friend. Genetics of gene expression surveyed in maize, mouse and man. Nature, 422(6929):297–302, 2003. [88] G. A. F. Seber and A. J. Lee. Linear Regression Analysis. John Wiley, New York, second edition, 2003. [89] S. P. Shah, W. L. Lam, R. T. Ng, and K. P. Murphy. Modeling recurrent DNA copy number alterations in array CGH data. Bioinformatics, 23(13):i450–8, 2007. [90] S. P. Shah, X. Xuan, R. J. DeLeeuw, M. Khojasteh, W. L. Lam, R. Ng, and K. P. Murphy. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics, 22(14):e431–9, 2006. [91] L. Sheng, R. Pique-Regi, A. Asgharzadeh, and A. Ortega. Microarray classification using block diagonal linear discriminant analysis with embeddded feature selection. In Proc. Int’l Conf. Acoustics, Speech, and Signal Processing, April 2009. [92] R. Simon. Development and Evaluation of Therapeutically Relevant Predictive Classifiers Using Gene Expression Profiling. J. Natl. Cancer Inst., 98(17):1169– 1171, 2006. [93] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A.Renshaw, A.V.D’Amico, J.P.Richie, E.S.Lander, M.Loda, P.W.Kantoff, T. R. Golub, and W. R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2):203–9, 2002. [94] B. E. Stranger, M. S. Forrest, A. G. Clark, M. J. Minichiello, S. Deutsch, R. Lyle, S.Hunt,B.Kahl,S.E.Antonarakis,S.Tavare,P.Deloukas,andE.T.Dermitzakis. Genome-wide associations of gene expression variation in humans. PLoS Genet, 1(6):e78, 2005. [95] D. C. Thomas, D. Qian, W. J. Gauderman, K. Siegmund, and J. L. Morrison. A generalizedestimatingequationsapproachtolinkageanalysisinsibshipsinrelation to multiple markers and exposure factors. Genet Epidemiol, 17 Suppl 1:S737–42, 1999. [96] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Diagnosis of multiple can- cer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A, 99(10):6567–72, 2002. [97] M. E. Tipping. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res., 1:211–244, 2001. [98] J. A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Trans. Information Theory, 50(10):2231–2242, 2004. [99] Y. Tsaig and D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52:1289–1306, 2006. 185 [100] A.Tsalenko, R.Sharan, H.Edvardsen, V.Kristensen, A.L.Borresen-Dale,A.Ben- Dor, and Z. Yakhini. Analysis of snp-expression association matrices. Proc IEEE Comput Syst Bioinform Conf, pages 135–43, 2005. [101] E. S. Venkatraman and A. B. Olshen. A faster circular binary segmentation algo- rithm for the analysis of array CGH data. Bioinformatics, NIL(NIL):NIL, 2007. [102] M. Vetterli and J. Kovacevic. Wavelets and subband coding. Prentice Hall PTR, Englewood Cliffs, N.J., —1995—. [103] H. Willenbrock and J. Fridlyand. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics, 21(22):4084–91, 2005. [104] D. Wipf and B. Rao. Sparse Bayesian learning for basis selection. IEEE-Trans-SP, 52(8):2153–2164, Aug. 2004. [105] M. Xiong, W. Li, J. Zhao, L. Jin, and E. Boerwinkle. Feature (gene) selection in gene expression-based tumor classification. Mol Genet Metab, 73(3):239–47, 2001. [106] J. Ye, T. Li, T. Xiong, and R. Janardan. Using uncorrelated discriminant analysis for tissue classification with gene expression data. IEEE/ACM Trans Comput Biol Bioinform, 1(4):181–90, 2004. [107] X. Zhao, C. Li, J. G. Paez, K. Chin, P. A. Janne, T. H. Chen, L. Girard, J. Minna, D. Christiani, C. Leo, J. W. Gray, W. R. Sellers, and M. Meyerson. An integrated view of copy number and allelic alterations in the cancer genome using single nu- cleotide polymorphism arrays. Cancer Res, 64(9):3060–71, 2004. 186
Abstract (if available)
Abstract
Microarrays and new sequencing techniques offer a high throughput platform to study the whole genome with the unprecedented capability of measuring millions of genomic features on a single essay. This massive parallel measurement power has an enormous potential for research in Biology and Medicine with the ultimate objective of identifying and learning the biological processes occurring in different organisms and diseases. Existing model learning methods are severely limited by the reduced number of samples that are usually available compared to the measurements.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Novel variations of sparse representation techniques with applications
PDF
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
PDF
Robust representation and recognition of actions in video
PDF
Investigating statistical modeling approaches for reservoir characterization in waterfloods from rates fluctuations
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Quantitative modeling of emotion perception in music
PDF
The extraction and complexity limits of graphical models for linear codes
PDF
Knowledge-driven representations of physiological signals: developing measurable indices of non-observable behavior
PDF
Parallel implementations of the discrete wavelet transform and hyperspectral data compression on reconfigurable platforms: approach, methodology and practical considerations
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Prediction modeling and statistical analysis of amino acid substitutions
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Theoretical models of voltage controlled oscillators and the effects of non-linearity
PDF
Lifting transforms on graphs: theory and applications
PDF
Dynamical representation learning for multiscale brain activity
PDF
Modeling the learner's attention and learning goals using Bayesian network
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
Model, identification & analysis of complex stochastic systems: applications in stochastic partial differential equations and multiscale mechanics
PDF
Modeling and predicting with spatial‐temporal social networks
Asset Metadata
Creator
Pique-Regi, Roger
(author)
Core Title
Sparse representation models and applications to bioinformatics
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/05/2009
Defense Date
01/30/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bioinformatics,copy number,linear discriminant analysis,OAI-PMH Harvest,sparse Bayesian learning,sparse models,sparse representations
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ortega, Antonio (
committee chair
), Asgharzadeh, Shahab (
committee member
), Kosko, Bart (
committee member
)
Creator Email
piquereg@usc.edu,rpique@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2484
Unique identifier
UC1149290
Identifier
etd-Regi-2698 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-184415 (legacy record id),usctheses-m2484 (legacy record id)
Legacy Identifier
etd-Regi-2698.pdf
Dmrecord
184415
Document Type
Dissertation
Rights
Pique-Regi, Roger
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
bioinformatics
copy number
linear discriminant analysis
sparse Bayesian learning
sparse models
sparse representations