Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Integrative analysis of gene expression and phenotype data
(USC Thesis Other)
Integrative analysis of gene expression and phenotype data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INTEGRATIVE ANALYSIS OF GENE EXPRESSION AND PHENOTYPE DATA by Min Xu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Ful¯llment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY) August 2009 Copyright 2009 Min Xu Dedication To Luge and Shiyou ii Acknowledgements I would ¯rst like to express my appreciation to my adviser, Professor Xianghong Jasmine Zhou, for her insight, guidance, and support. Over the last few years, her encouragement and commitment have been inspiring, and I am honored to take my place as an alumnus ofherlab. IalsowanttothankProfessorGarethJames,ProfessorFengzhuSun,Professor MichaelWatermanandProfessorDonaldArnoldforservingineithermyPh.D.candidacy or dissertation committees. I want to thank Dr. Wenyuan Li, Professor Gareth James, Professor Louxin Zhang, Dr. Mike Mehan, Qiang Song, Dr. Juan Nunez-Iglesias and Dr. Ming-Chih Kao for their cooperation, as well as Dr. Chao Cheng for manuscript reviews. My thanks also go to all the graduate students and postdoctoral fellows from our Computational Biology Program for their friendship during my stay here. I wish them great success in their studies and future careers. I wish to extend my gratitude to other colleagues and friends, including Professor Frank Alber, Dr. Jianjun Hu, Dr. Jim Liu, Dr. Xiting Yan, Dr. Huanying Ge, Dr. Kangyu Zhang, Dr. Zhidong Tu, Dr. Xiaotu Ma, Dr. Quansong Ruan, Dr. Li Wang, Dr. Weihong Xu, and Dr. Shihua Zhang, for their help during the course of my studies. iii Finally, I especially thank my wife, Luge Yang, for her dedication to our family and my father, Shiyou Wu, for his encouragement and support. This dissertation is dedicated to them. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables vii List Of Figures viii Abstract ix Chapter 1: Introduction 1 1.1 Integrative analysis of gene expression and phenotype association . . . . . 1 1.2 Gene expression and phenotype data . . . . . . . . . . . . . . . . . . . . . 1 1.3 Integrative analysis of gene-phenotype association. . . . . . . . . . . . . . 5 1.4 Summary of our work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.1 Automated multi-dimensional phenotype pro¯ling . . . . . . . . . 6 1.4.2 Stable re¯nement of discriminative gene clusters . . . . . . . . . . 7 1.4.3 Disease-speci¯c pathway identi¯cation . . . . . . . . . . . . . . . . 8 Chapter 2: Automated Multi-dimensional Phenotypic Pro¯ling 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Overview of method . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 An illustrative case: reconstructing the temporal order of the yeast log-to-stationary growth transition . . . . . . . . . . . . . . . . . . 15 2.2.3 Large-scale prediction of phenotype pro¯les . . . . . . . . . . . . . 17 2.2.4 Multi-dimensional pro¯ling of complex phenotypes . . . . . . . . . 19 2.2.5 Discovery of hidden confounding factors in microarray studies . . . 23 2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 Predicting phenotype pro¯les by constrained optimization . . . . . 26 2.4.2 Data collection and processing . . . . . . . . . . . . . . . . . . . . 29 2.4.3 Automatic processing of phenotype annotations . . . . . . . . . . . 29 2.4.4 Measuring phenotype annotation similarity . . . . . . . . . . . . . 30 2.4.5 Calculation of TF-IDF vectors of sample groups . . . . . . . . . . 31 v Chapter 3: A Stable Iterative Method for Re¯ning Discriminative Gene Clusters 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Simulated datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2 Leukemia dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.3 The performance analysis on other real datasets . . . . . . . . . . 43 3.2.4 Comparingthesamplephenotypeclassi¯cationperformancetoother studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.2 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . 56 Chapter 4: An integrative approach to characterize disease-speci¯c pathways and their coordination 58 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.1 Network properties of the cancer di®erential co-expression network 62 4.2.2 Identi¯cation of pathway modules speci¯c to cancer or cancer sub- types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.3 Network modules in the breast cancer cluster . . . . . . . . . . . . 69 4.2.3.1 AtumorsuppressornetworkrelatedtoPDGFsuperfamily signaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.3.2 A network module related to in°ammatory response . . . 74 4.2.4 Identi¯cation of pathway coordination beyond co-expression . . . . 75 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.2 Select di®erentially co-expressed gene pairs . . . . . . . . . . . . . 80 4.4.3 Identify conditionally activated network module candidates . . . . 81 4.4.4 Geneontologyfunctionandtranscriptionfactorenrichmentofmod- ules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.5 Identi¯cation of transcription factor binding . . . . . . . . . . . . . 82 References 82 vi List Of Tables 3.1 Signi¯cantly enriched biological processes . . . . . . . . . . . . . . . . . . 43 3.2 The performance of the algorithm for di®erent sample phenotype classi¯- cation studies.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Comparison of our algorithm with others. . . . . . . . . . . . . . . . . . . 49 vii List Of Figures 1.1 Microarray technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 The principles of PhenoPro¯ler. . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Consistency between predicted and true phenotype pro¯les in GDS18. . . 16 2.3 Multi-dimensional pro¯ling of the dataset GDS843. . . . . . . . . . . . . . 21 3.1 The performance analysis on simulated datasets. . . . . . . . . . . . . . . 38 3.2 Theanalysisofthethree-foldcrossvalidationperformanceofthealgorithm on the Leukemia dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Average proportion of genes of the four biological processes during the re¯nement iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 The algorithm of selecting discriminative clusters.. . . . . . . . . . . . . . 54 3.5 Multiple objective optimization procedure for cluster elimination. . . . . . 56 4.1 Overview of network module ¯nding procedure. . . . . . . . . . . . . . . . 60 4.2 Node degree distribution on the di®erential co-expression networks. . . . . 64 4.3 The linear relationship between network size S and cluster diameter D. . 68 4.4 General cancer and solid tumor network modules. . . . . . . . . . . . . . . 70 4.5 Coordinated breast tumor network modules. . . . . . . . . . . . . . . . . . 72 viii Abstract The linking genotype to phenotype is the fundamental aim of modern genetics. We focus on study of links between gene expression data and phenotype data through integrative analysis. In this work, we propose three approaches. 1) The inherent complexity of phenotypes makes high-throughput phenotype pro¯l- ing a very di±cult and laborious process. We propose a method of automated multi- dimensional pro¯ling which uses gene expression similarity. Large-scale analysis of more than500geneexpressiondatasetsshowthatourmethodcanproviderobustpro¯lingthat reveals di®erent phenotypic aspects of samples. This pro¯ling technique is also capable of interpolation and extrapolation beyond the phenotype information given in training data. As such, it can be used in many applications, including facilitating experimental design and detecting confounding factors. 2) Phenotype association analysis problems are complicated by small sample size and high dimensionality. Consequently, phenotype-associated gene subsets obtained from trainingdataareverysensitivetoselectionoftrainingsamples, andtheconstructedsam- ple phenotype classi¯ers tend to have poor generalization properties. To eliminate these obstacles, we propose a novel approach that generates sequences of increasingly discrim- inative gene cluster combinations. Our experiments on both simulated and real datasets ix show that the resulting cluster combinations are robust to perturbation in training sam- ples and have good sample phenotype classi¯cation performance. 3)Manycomplexphenotypes,suchascancer,aretheproductofnotonlygeneexpres- sion, but also gene interaction. We propose an integrative approach to ¯nd gene network modules that activate under di®erent phenotype conditions and apply this approach to the study of cancer. Using our method on multiple cancer gene expression datasets, we discovered cancer subtype-speci¯c network modules, as well as the ways in which these modules coordinate. In particular, we detected a breast-cancer speci¯c tumor suppressor network module with a hub gene, PDGFRL, which may play an important role in this module. x Chapter 1 Introduction 1.1 Integrative analysis of gene expression and phenotype association With the advancement of the high-throughput genotyping technologies, large amounts of genotype data have been accumulated in recent years. Such data include gene expres- sion, single nucleotide polymorphism, copy number variation, DNA mythelation, histone acetylation, alternative splicing, and proteomics. Since the fundamental problem in mor- den genetics is the linking of genotype to phenotype. we are particularly interested in the associations between gene expression and phenotype in this dissertation. 1.2 Gene expression and phenotype data WiththecompletionoftheHumanGenomeProject, biologyresearchisenteringthepost genome era. Although biologists have collected a vast amount of DNA sequence data, thedetailsthatwouldexplainhowthesesequencesfunctionstillremainlargelyunknown. Genomes of even the simplest organisms are very complex. Key questions still confront 1 biologists, including 1) the functional roles of di®erent genes and the cellular processes in which they participate; 2) how genes are regulated, how genes and gene products interact, and how to identify interaction networks; and, ¯nally, 3) how gene expression levels di®er in various cell types and states, as well as how gene expression is changed by various disease pathologies and their treatments. Biology used to be a data-poor science. Nowadays, however, many of these questions can be addressed with the advent of more advanced techniques, such as gene expression. Therefore, investigators are now able to transform vast amounts of biological information into useful data. This makes it possible to study gene function globally, and a new ¯eld, functional genomics, emerges. Speci¯cally, functional genomics refers to the development and ap- plication of global (genome-wide or system-wide) experimental approaches to assess gene function by making use of the information and reagents provided by structural genomics. Gene function may be divided into gene expression, which refers to the process of tran- scribing a gene's DNA sequence into the RNA that serves as a template for protein production, and the level of gene expression, which indicates how active a gene is in cer- tain tissues, at certain times, or under certain experimental conditions. Therefore, the monitoring of gene function can provide an overall picture of the genes being studied, in- cludingtheirexpressionlevelandtheactivitiesofthecorrespondingproteinundercertain conditions. To accomplish this, functional genomics provides techniques characterized by high-throughputorlarge-scaleexperimentalmethodologiescombinedwithstatisticaland computational analysis of the results. Among the several methods that have been developed to understand the behavior of genes, microarray technology is particularly important. It is used to simultaneously 2 monitor the expression levels of genes by several orders of magnitude, and many steps are involved in this technology. First, complementary DNA (cDNA) molecules or oligos are printed onto slides as spots. Then, two kinds of dye-labeled samples, i.e., sample and control,arehybridized. Finally,thehybridizationisscannedandstoredasimages(Figure 1.1). Usingasuitableimageprocessingalgorithm, theseimagesarethenquanti¯edintoa setofexpressionvaluesrepresentingtheintensityofspots. Usually,thedyeintensitymay bebiasedbyfactors,suchasphysicalproperty,experimentalvariabilityinprobecoupling and processing procedures, and scanner settings. To minimize the undesirable e®ects caused by this biased dye intensity, normalization is done to balance dye intensities and make expression value comparable across experiments. Here the term comparable means that the di®erence of any measured expression value of a gene between two experiments should re°ect the di®erence of its true expression levels. A phenotype is any observable characteristic of an organism. Phenotypes result from the activities of an organism's genes as well as the in°uence of environmental factors and possible interactions between the two. Phenotypes are often described in natural languagesintheliterature. Theycansometimesbemeasuredascontinuousorcategorical values. While it has thus far been very di±cult to automatically extract phenotype information from the literature, the Uni¯ed Medical Language System (UMLS) has been developed in recent years to provide facilities for natural language processing. As such, UMLS is a compendium of many controlled vocabularies in the biomedical sciences. It provides a mapping structure among these vocabularies and thus gives researchers the ability to translate among the various terminology systems. It is also a comprehensive thesaurus and ontology of biomedical concepts and consists of the following components: 3 Figure 1.1 A diagram illustrating how gene expression is obtained from microarray experiments. Figure obtained from http://www.mun.ca 4 1) a Metathesaurus, the core database of the UMLS, which is a collection of concepts and terms from the various controlled vocabularies and their relationships; 2) a Semantic Network which is a set of categories and relationships that are used to classify and relate the entries in the Metathesaurus; and 3) a SPECIALIST Lexicon which is a database of lexicographicinformationforuseinnaturallanguageprocessing. Anumberofsupporting software tools are also provided for this system. 1.3 Integrative analysis of gene-phenotype association In recent years, the accumulation of gene expression data has been a rapid process. For example, the two largest public gene expression repositories, Gene Expression Ominibus (GEO)andArrayExpress,alreadycontainmorethan500,000samples,measuringgenome wide gene expression levels of various organisms under various phenotypic conditions. Basically, these gene expression data repositories contain two types of information: gene expression and sample phenotype information. Expression information based on microarrays are measured in continuous values. On the other hand, sample phenotype information is often described in natural language, or it can be manually structured and recorded into a database. We develop methods of integrative analysis which combine these two types of information obtained from di®erent experimental sources, therefore enhancing con¯dence and providing a better understanding of the association between genotype and phenotype. 5 1.4 Summary of our work As indicated above, we propose three approaches for the above analysis. 1) We propose theconceptandamethodofAutomatedMulti-dimensionalPhenotypicPro¯lingbyusing genotype information as a bridge. 2) We propose a method that can robustly search for combination of gene sets that provides true discrimination and good prediction of sample phenotype classes. 3) We propose an integrative approach to characterize disease-speci¯c pathwaysandtheircoordinationusinggenecoexpressionnetworkanalysis. The¯rstwork primarily focus on phenotype prediction. Both the second and third work above involve studying di®erent aspects of gene phenotype associations. The second work focus on selectingphenotypeassociatedcombinationsofgenesets,whilethethirdworkparticularly focus on genetic interaction mechanism that is associated to phenotype di®erence. 1.4.1 Automated multi-dimensional phenotype pro¯ling Phenotypes are complex and di±cult to quantify in a high-throughput fashion. The lack ofcomprehensivephenotypedatacanpreventordistortgenotype-phenotypemapping. In Chapter 2 we describe PhenoPro¯ler, a novel computational method that enables in sili- con phenotype pro¯ling. Drawing on the principle that similar gene expression patterns are likely to be associated with similar phenotype patterns, PhenoPro¯ler supplements the missing quantitative phenotype information for a given microarray dataset based on other well-characterized microarray datasets. We applied our method to 587 human microarray datasets covering more than 14,000 samples, and con¯rmed that the predicted phenotype pro¯les are highly consistent with 6 true phenotype descriptions. PhenoPro¯ler o®ers several unique capabilities: (1) au- tomated, multi-dimensional phenotype pro¯ling, facilitating the analysis and treatment design of complex diseases; (2) the extrapolation of phenotype pro¯les beyond provided classes; and (3) the detection of confounding phenotype factors that would otherwise bias biological inferences. Finally, as no direct comparisons are made between gene ex- pression values from di®erent datasets, the method can utilize the entire body of cross- platform microarray data. This work has produced a compendium of novel phenotype pro¯les for the NCBI GEO datasets, which can facilitate an unbiased understanding of transcriptome-phenome mapping. By increasing the variaty and the quality of pheno- types to be pro¯led, the continued accumulation of microarray data will further increase the power of PhenoPro¯ler. 1.4.2 Stable re¯nement of discriminative gene clusters Gene expression data are often used to identify the phenotype associated genes that are di®erentially expressed between samples of two phenotype classes. On the other hand, since microarray datasets contain only a small number of samples and a large number of genes, it is usually desirable to identify small gene subsets with distinct patterns between sample phenotype classes. Such gene subsets are highly discriminative in phenotype clas- si¯cation because of their tightly coupling features. Unfortunately, gene subsets obtained from training data are often very sensitive to the selection of training samples, and the classi¯ers constructed from such gene subsets usually tend to have poor generalization properties on the test samples as a result of the over¯tting problem. 7 In Chapter 3, we propose a novel approach [XZZ08] to correct these problems by combining supervised and unsupervised learning techniques to generate increasingly dis- criminative gene cluster combinations in an iterative manner. Compared with existing approaches, our experiments on both simulated and real datasets show that our method can produce a series of gene cluster combinations that are robust to training sample per- turbation and have good sample phenotype classi¯cation performance. This backward approach for re¯ning a series of highly discriminative gene cluster combinations for the purpose of sample phenotype classi¯cation has proven to be very consistent and stable when applied to various types of training data. In addition, these combinations can be further used to study the underlying genetic interactions that lead to phenotype di®er- ence,andextendedtointegrategeneclustersobtainedfrommultipledatasetsconsistingof similar kinds of phenotype classi¯cation studies. Thus, the con¯dence that these clusters are truly discriminative can be further enhanced. 1.4.3 Disease-speci¯c pathway identi¯cation The most common application of microarray technology in disease research is to identify genes di®erentially expressed in disease versus normal tissues. However, in the case of complex diseases, it is well known that, phenotypes are determined by the underlying structure of genetic networks, not by individual genes. Thus in many situations, it is the interaction of many genes that plays an important role in causing phenotypic variations. In Chapter 4, using cancer as an example, we develop a graph-based method [XKNI + 08] tointegratemultiplemicroarraydatasetstodiscoverdisease-relatedco-expressionnetwork modules. Weproposeanunsupervisedmethodthattakesintoaccountbothco-expression 8 dynamics and network topological information to simultaneously infer network modules and phenotype conditions in which they are activated or de-activated. Usingourmethod, wehavediscoverednetworkmodulesspeci¯ctocancerorsubtypes of cancers. Many of these modules are consistent with or supported by their functional annotations or their previously known involvement in cancer. In particular, we identi- ¯ed a module that is predominately activated in breast cancer and is involved in tumor suppression. While individual components of this module have been suggested to be as- sociated with tumor suppression, their coordinated function has never been elucidated. Here by adopting a network perspective, we have identi¯ed their interrelationships and, particularly, a hub gene PDGFRL that may play an important role in this tumor sup- pressornetwork. Consequently, byusinganetwork-basedapproach, ourmethodprovides new insights into the complex cellular mechanisms that characterize cancer and cancer subtypes. By incorporating co-expression dynamics information, our approach not only extracts more functionally homogeneous modules than those based solely on network topology, but also reveals pathway coordination beyond co-expression. 9 Chapter 2 Automated Multi-dimensional Phenotypic Pro¯ling 2.1 Introduction The fundamental aim of modern genetics is linking genotype to phenotype. With the rapid accumulation of genomics data, the lack of phenotype data has become the bot- tleneck of this process [FS03]. Phenotyping, especially for human subjects, is a labo- rious process [LL07]. Moreover, researchers often gloss over the complexity of human phenotypes by reporting only those traits speci¯cally relevant to their studies. For ex- ample, a given dataset may provide survival information but not the patients' ages. In- ferences derived from such data could be biased or even invalidated by undocumented or poorly documented phenotypic traits. Furthermore, most available phenotype character- izations are qualitative (categorical) rather than quantitative (continuous). This practice is problematic for two reasons: the boundaries between categories are often vague or ar- bitrary [OHB08], and any phenotypic information distinguishing data within a category is lost. 10 In this paper, we address the above issues by developing PhenoPro¯ler, a computa- tional framework for predicting the quantitative phenotype information missing from a genomic dataset. In particular, this method associates each sample of a given dataset with the relative intensity of a speci¯c phenotype trait. The quantitative measures of samples across the whole dataset is referred to as a phenotype pro¯le (PP). Examples includethebodyweightsofindividuals, degreesofmalignancyintumorsamples, andthe quantitative responses of patients to drug treatments. The principle of PhenoPro¯ler is that similar genomic patterns are likely to be asso- ciated with similar phenotypic patterns [BvD04]. Thus, we can supplement the (incom- plete) phenotypic information in a given genomics dataset with traits recorded in other well-characterized datasets. In particular, we focus on the vast accumulated microarray data. The NCBI Gene Expression Omnibus (GEO) [BTW + 07], for example, currently contains more than 2000 human microarray datasets that systematically document the transcriptome basis of phenotypes as diverse as heart diseases, mental illness, infectious diseases, and a variety of cancers. The intuition behind our method is as follows. Given a training dataset with known sample description of a phenotype P, for each gene we can derive an association between itsexpressionpro¯leandthisphenotypeP. Wedenotethosegenesassignaturegenes,the expressions of which are strongly associated with the phenotype in the training dataset. Given a new microarray dataset that is known to be related to the phenotype P, but the phenotype description of its individual samples are unknown, we aim to estimate the PP by constructing a sample pro¯le as a real-valued vector that is most similar to the 11 expression pro¯les of those signature genes in the new dataset. Figure 2.1 illustrates this approach. Since the information we borrow from the training dataset is only the association betweenthe gene expressions and sample phenotypes, wedo not directly compare the ex- pression values between the training and prediction datasets, thus bypassing the data in- compatibility problem between cross-platform and cross-laboratory microarray datasets. PhenoPro¯ler can therefore use as many as possible microarray datasets in the public repositories in the training stage, and construct a new dataset's pro¯les for a wide range of phenotypes. We applied our method to 587 human microarray datasets, covering more than four- teen thousand microarray samples. The predicted phenotype pro¯les were highly con- sistent with known phenotype descriptions. We showed that PhenoPro¯ler can robustly providemulti-dimensionalcharacterizationofthephenotypesmissingfromadataset,and can facilitate the discovery of confounding factors for the transcriptome-phenotype map- ping. The comprehensive phenotypic data generated by this method will vastly increase the value of published and forthcoming genomics data. 2.2 Results 2.2.1 Overview of method As illustrated in Figure 2.1, PhenoPro¯ler consists of two steps. (1) Given a microarray dataset D 1 whose samples have known descriptions of the phenotype P, for each gene i 12 Figure 2.1 The principles of PhenoPro¯ler. Each row of an expression matrix corre- sponds to a gene, and each column corresponds to a sample. The magnitude of gene expression is indicated by a color scale running from green (low) to red (high). From the training dataset (left), we obtain for each gene i a coe±cient w i describing the degree of association between its expression level and the phenotype. Here, genes 1 and 2 have high positive coe±cients. Gene 3 shows no clear association with the phenotype, so its coe±cient is close to zero. Gene 4 has a high negative coe±cient. Therefore, genes 1,2 and 4 are signature genes of the phenotype. Given a new dataset, for each sample we aim to estimate the relative intensity of the association between this sample and the phenotype such that the derived intensity values of all sample are most similar to the ex- pression values of signature genes. We term such a sample pro¯le as a phenotype pro¯le. The phenotype pro¯le vector is depicted by the grey cells, where brighter colors indicate estimated tighter association with the phenotype. After sorting the samples based on the phenotypepro¯le, itcanbeseen thattheexpression pro¯lesofgenes 1and2 arestrongly correlated with phenotype pro¯le, while that of gene 4 is anti-correlated. 13 we calculate a coe±cient w i that indicates the degree of association between its expres- sion pro¯le and the phenotype P. The appropriate way to calculate w i depends on the structure of the phenotype descriptions. If the phenotype description is binary, i.e. the dataset compares two phenotype groups, we can simply use the two-sample t-statistic as w i . If the phenotype description is continuous, we can use the correlation between the phenotype values and the gene expression values as w i . Genes with a large magnitude of w i are termed the signature genes of the phenotype P. (2) In order to predict the phenotype pro¯le (PP) of a new microarray dataset D 2 , we use the following constrained optimization approach. For a dataset with m individual samples, the PP is de¯ned as a normalized, real-valued vector of m values (p 1 ;p 2 ;:::;p m ) T (denoted p). We also de¯ne the normalized expression vector of the gene i as (e i1 ;e i2 ;:::;e im ) T , denoted e i ; and we denote the expression matrix containing all such expression vectors as E. Given a set of coe±cients w i determined from the training data, the objective is to ¯nd a pro¯lep that minimizes the weighted least-squares di®erence min P i jw i j P j (sgn(w i )e ij ¡p j ) 2 . (The function sgn is +1 or¡1 depending on the sign of its argument; we want the phenotype pro¯le p to be close to e i when w i is positive, and close to¡e i when w i is negative.) A series of matrix computations (see the Methods section) yields the optimal solution ^ p, which is essentially the normalized weighted (byw i ) sum of gene expression values across samples. To assess whether the predicted pro¯le ^ p captures the expression trend of those sig- nature genes, we calculate an association score ^ c, de¯ned as the Pearson's correlation 14 between the coe±cientsw andE^ p (detailed explanation in Methods). To assess the sta- tistical signi¯cance of ^ p, we compare the ^ c to those calculated using the same expression matrix E and 1000 random permutations of coe±cients w. 2.2.2 An illustrative case: reconstructing the temporal order of the yeast log-to-stationary growth transition As an illustrative example, consider the two microarray datasets GDS18 and GDS283. Bothstudythelog-to-stationarygrowthtransitionofyeast,butwithdi®erentmicroarray platforms. Both datasets measure gene expression starting at the logarithmic phase and extending through the stationary phase. Here our goal is to predict changes in yeast phenotype from log to stationary transition, responding to the depletion of nutrients. Naturally, the temporal order of the samples serves as a good means of validating the prediction. Usingonedatasetfortraining,wecomputedtheSpearman'srankcorrelationbetween individualgeneexpressionpro¯lesandthetemporalorderofthesamples. Thesestatistics areusedasthetrainingcoe±cients. Intheotherdataset,wethenpredictedthephenotype responsepro¯lebasedongeneexpressions,hopingtorecoverthecorrecttemporalorderof samples. In both cases, the predicted PP was highly consistent with the actual sequence of samples (Spearman's rank correlation were 0.83 and 0.79, depending on which dataset was used for training). Figure 2.2 shows the predicted and the original sample order of dataset GDS18. Two subgroups are visible in the predicted pro¯le, accurately re°ecting the logarithmic and stationary phases. The sole exception is at the transition between the two phases. 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 2.0 2.5 3.0 3.5 4.0 Predicted Phenotype Profile Original Temporal Order Figure2.2 Consistencybetweenpredictedandtruephenotypepro¯lesinGDS18. Using GDS283asthetrainingdataset,thepredictedphenotypepro¯leofGDS18closelymatches the original temporal order of the samples. The original temporal order is measured as the logarithm (base 10) of minutes. 16 Intriguingly, experiment GDS283 stopped taking measurements only 17 hours after theyeastenteredthestationaryphase. ExperimentGDS18,ontheotherhand,continued measurements for another 12 days. It is remarkable that a phenotype signature derived from GDS283 can accurately sort the phenotype progression of GDS18. This result demonstrates that the essential physiological changes occurring within and between the logarithmic and stationary growth phases can be extrapolated as well as interpolated. 2.2.3 Large-scale prediction of phenotype pro¯les To test the general applicability of our method, we performed a large-scale analysis of 587 human microarray datasets (see the Methods section for details on data collection and processing). Datasets containing at least two disjoint sample groups, representing a phenotype and its baseline (P and P 0 ), each with at least 10 samples, were selected as training datasets (D 1 ). If a dataset contains n sample groups, we can generate ¡ n 2 ¢ distinct training datasets. Since the phenotype values in these datasets are categorical, we use the two-sample t-statistic as coe±cients w. By setting the threshold p-value for the predicted PP to 0.001 and the association score ^ c¸0:25, a total of 37,852 novel PPs were associated to the 587 datasets. To validate the method, for each training dataset D 1 we also need a testing dataset D 2 that contains the sample descriptions on exactly the same phenotypes P and P 0 . Among all 587 datasets, we only identi¯ed 4 training-testing dataset pairs meeting this criterion, in which each of the testing datasets also contains two sample groups of P and P 0 . To assess whether the predicted phenotype pro¯le is consistent with the known distribution of phenotypes P and P 0 in the testing dataset, we used the Wilcoxon rank 17 sum test. Speci¯cally, in the test dataset, the two sample groups' (P and P 0 ) predicted phenotype values are compared using Wilcoxon rank sum test. A small Wilcoxon p- valueindicatesthatthereisasigni¯cantdi®erencebetweenthedistributionsofpredicted phenotype values for the two groups, therefore the predicted pro¯le is consistent with known phenotype information. Among the 4 training-testing pairs, all predicted PPs were highly consistent with the known phenotype groups (Wilcoxon p-value <10 ¡4 ). In order to obtain a general assessment using more validation data, we relaxed the requirement that the description of the testing dataset exactly matches the training data phenotypes. In fact, if the phenotypes of a given dataset were even moderately similar to the training phenotype, the predicted pro¯les were found to agree well with known phenotype groups in the testing dataset. This implies a strong interdepedence among related phenotypes. We quantify the similarity between the training and testing phe- notype with two measures: (1) the percentage ° of Uni¯ed Medical Language System (UMLS) concepts of the merged sample group descriptions ofD 1 shared with the dataset descriptionofD 2 ;and(2)thesimilaritybetweenthedescriptionsofcorrespondingsample groups in D 1 and D 2 , denoted as s. The latter is de¯ned as the cosine of the angle be- tween two term frequency-inverse document frequency (tf-idf) vectors of mapped UMLS terms (See the Methods section for details). Following these measurements, we iden- ti¯ed 32 training-testing dataset pairs with similarity thresholds s > 0:4 and ° > 0:6. Among these, 81% of predicted phenotype pro¯les were consistent with prior phenotype descriptions (Wilcoxion test p-value < 0:05). This result highlights the e®ectiveness of our method in exploiting the interdependence of similar phenotypes. 18 We further studied the robustness of our method against the size of the training dataset. We randomly selected a training dataset and a test dataset, and then calculated the correlation between the PP constructed with the original training dataset and that with certain amount of training samples randomly removed. Repeating this test 10,000 times with 10% (and 20%) sample removal produced an average correlation of 0.98 (and 0.95) between the resulting PPs and those without any samples removed. Even for those training datasets with a small size of 10 samples in each of the two phenotype groups, the obtained PP correlations were still greater than 0.9 for both 10% and 20% sample removal, demonstrating the robustness of our method. 2.2.4 Multi-dimensional pro¯ling of complex phenotypes As previously mentioned, a total of 37,852 PPs were derived and assigned to the 587 datasets. On average, each dataset is assigned 65 PPs. In some cases, related train- ing datasets generated highly correlated PPs, further enhancing our con¯dence in the prediction. Two examples are described below. Dataset GDS2855 studies various forms of muscular dystrophy. Three training sets (GDS609, 610, and 612) generated highly correlated PPs (average correlation 0.88) for GDS2855. All three training datasets describe the di®erence between Duchenne mus- cular dystrophy and normal muscle tissues, although they were measured with di®erent platform technologies. Furthermore, all 3 predicted PPs were highly consistent with the original sample description of GDS2855 (Wilcoxion test p-value <10 ¡6 ). Dataset GDS1962 studies gliomas of di®erent grades, and was assigned four highly correlated PPs (average correlation 0.9) by datasets GDS1975, 1976, 1815, and 1816. 19 All four training datasets focused on comparing grade III and grade IV glioma samples. Remarkably,thepredictedPPsnotonlydidagoodjobofseparatinggradeIIIfromgrade IV samples in the testing dataset, but also separated grade II from grade III samples. In addition,theseparationsfollowedtheorderoftumorgrades. Thisexampleshowsthatour method captures the essential di®erence between high- and low-grade tumors, and thus can be extrapolated to tumors of grades beyond those represented in the training data. This ability to extrapolate from the training dataset represents a signi¯cant advantage over traditional classi¯cation methods. A testing dataset is often (78% of cases) assigned multiple uncorrelated PPs (correla- tion < 0.1) describing di®erent properties of a complex phenotype. For example, dataset GDS843 contains 49 samples comparing patients with abnormal karyotypes to patients withnormalkaryotypestostudyadultacutemyeloidleukemia(AML).Thesampleswere collected from peripheral blood or bone marrow. Its predicted phenotypes include three uncorrelated pro¯les (see Figure 2.3), which are detailed below. 1. Training dataset GDS842 also studied abnormal versus normal karyotypes in adult AML patients. The derived phenotype pro¯le is consistent with known sample description of this phenotype in GDS843 (Wilcoxon p-value 0.04), thus validating our method. 2. Training dataset GDS2118 compared individuals with refractory anemia to normal individuals. The PP trained by this comparison is highly correlated (correlation > 0.9) with two other PPs that also come from training datasets that studied re- fractory anemia. In fact, the recently proposed WHO classi¯cation of hematologic 20 3 TUDIESOFADULTACUT EM Y ELOIDLEUK EMIA , EUK EMIA ! NEMIA $ R UG2 ESP ONSE 2 ESP ONSIV ET OIMA TINIBV S UNR ESP ONSIV E 0 R OFILE 2 EFR AC T OR YANEMIAV S NOR MAL 3 AMPLES 0 R OFILE . OR MALK AR Y OT YP EV S AB ER R AN TK AR Y OT YP E 3 AMPLES 0 R OFILE 3 AMPLES | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | & !.# ' & !.#$ & !.# ! 04+ -!0+ 4)% LCN2 CBS FLT3 Figure2.3 Multi-dimensionalpro¯lingofthedatasetGDS843thatcomparespatientsof adult AML with abnormal karyotypes to those with normal karyotypes. Three di®erent training datasets produced mutually uncorrelated phenotype pro¯les (black curves) that could be assigned to GDS843. The three most signi¯cantly correlated expression pro¯les of genes known to be related to the respectively pro¯led phenotypes are superposed in the three primary colors. For clarity, the samples are ordered according to the ¯rst PP, andtheexpressionpro¯lesofnegativelycorrelatedexpressionpro¯leshavebeenreversed. 21 malignancies merged the disease refractory anemia with excess blasts in transfor- mation (RAEB-T) into the category AML. However, this new disease classi¯cation is controversial. Although RAEB-T and AML share similar clinical parameters, a study pointed out that their biological bases are di®erent (e.g. RAEB-T is distin- guished from AML by a signi¯cant increase in apoptosis), and it suggested that RAEB-T should be regarded as a distinct disease entity [ABO + 00]. Therefore, the derived PP may uncover hidden patient information and possibly help to dif- ferentiate RAEB-T from AML, which could further lead to improved treatment design. 3. Training set GDS1221 studies the patient response to the drug Imatinib. Imatinib was designed to treat chronic myeloid leukemia by reducing the tyrosine kinase activity of the well-known bcr-abl fusion gene. Our phenotype pro¯le could there- fore be used to identify patients that would be more likely to respond to Imatinib treatment. In summary, while the ¯rst PP serves as an internal validation of the method, the other two PPs provide novel insights into the pathologic and therapeutic properties of samplephenotypesinthedatasetGDS843. Thespeci¯cphenotypepropertiesrepresented by the above PPs can be further con¯rmed by examining genes whose expressions are signi¯cantly correlated (p-value < 0.001) with these pro¯les. For example, the AML PP (number 1 above), has 10 signi¯cantly correlated genes that are known to be associated with the UMLS concept \Leukemia, Myelocytic, Acute." A particularly interesting gene is FLT3. A study suggested that in patients with karyotype alterations, a recipricol 22 translocation was not su±cient to cause acute promyelocytic leukemia, and that an ad- ditional mutation in FTL3 may be required [DLCBPS + 08]. For the refractory anemia activationpro¯le,thereare12signi¯cantlycorrelatedgenesknowntobeinvolvedin\Ane- mia." These include three Fanconi Anemia genes (FANCA, FANCD2, FANCG) as well asTGFB1, whichmaya®ecttheprogressionofrefractoryanemiaspeci¯cally [BBG + 05]. AmongthegenescorrelatedwiththeImatinibresponsepro¯le,threearetyrosinekinases, which is consistent with the target of Imatinib [DD03]. This example demonstrates the power and speci¯city of our multi-dimensional phenotype pro¯ling approach by utilizing the large number of microarray datasets. 2.2.5 Discovery of hidden confounding factors in microarray studies Due to the scarcity of phenotype information in many microarray datasets, confounding phenotype variables may not be well documented. Thus, caution should be exercised in deriving inferences from microarray datasets. The following cases provide representative scenarios. DatasetGDS1673examinesnormallungtissuefrom23donors,includingsmokingand non-smokingindividuals. Interestingly, wefoundthatapredictedPPtrainedonmalevs. female skeletal muscle samples (dataset GDS914) was able to separate the smoking and non-smoking samples of GDS1673 (Wilcoxon p-value 0.0002). After obtaining additional phenotype information on the GDS1673 subjects, it turns out that among non-smokers, whichmadeupalmosttwo-thirdsofthesample,femalesoutnumberedmalesbymorethan two to one, while among smokers the numbers of the two genders were approximately equal. Thus,simplycomparingtheexpressionpro¯lesofthesmokingversusnon-smoking 23 groups would not derive the signature of smoking, but rather the mixed signatures of smoking and gender. As another example, the goal of the GDS1887 study was to build a prognosis model for rectal cancer cells responding to radio therapy. According to the GEO annotation, its 46 samples had been separated into training and test groups for model construction and validation. Surprisingly, we found that the original training and test samples from this study could be well separated (average Wilcoxon p-value 0.001) by four highly correlated PPs (average correlation 0.95). All of those four PPs come from training datasets that compare the cancer to normal tissue or that compare cancers of di®erent malignancies. Thisstronglysuggestthatthereweresystematicdi®erencesincancermalignancybetween the training and testing samples, even though they were supposed to be generated by random partition. Any such sampling bias would negatively impact the accuracy of the prognosis model. Of course, sampling bias can often be traced to the very limited availability of phe- notype data in the ¯rst place. Our compendium of predicted phenotype pro¯les (http://zhoulab.usc.edu/PhenoPro¯ler) provides a comprehensive description of a large proportion of the datasets in the GEO database. This knowledge can facilitate an un- biased understanding of the transcriptome-phenome mapping. It can also serve as the starting point for the identi¯cation of molecular mechanisms shared by di®erent diseases and phenotypes. 24 2.3 Discussion Phenotypes are complex, and di±cult to quantify in a high-throughput fashion. The lack of comprehensive phenotype data can prevent or distort genotype-phenotype mapping. This paper describes a novel approach to perform in silicon phenotype pro¯ling. Our method provides numerous advantages which we outline here. (1) For most datasets we were able to predict multiple phenotype pro¯les, which could help researchers to reveal di®erent aspects of complex diseases and facilitate treatment design. (2) We can provide a quantitative phenotype description of the sample characteristics. Although \categor- ical" phenotype description is prevalent, in reality phenotypes constitute a continuous spectrum. (3) Our method can extrapolate the pro¯ling to classes beyond those repre- sented in the training data, as illustrated in the glioma case study. This is an advantage over traditional classi¯cation methods. (4) PhenoPro¯ler avoids direct comparison of gene expression values from di®erent datasets, and thus can utilize almost all available microarray data regardless of platform or laboratory. In contrast, traditional regression methods cannot be directly applied to microarray datasets from di®erent platforms. The continued accumulation of microarray data will further increase the power of PhenoPro¯ler intwoaspects: thevarietyofphenotypestobepro¯led,andthecon¯dence of its predictions. The latter bene¯t derives from having several mutually correlated PPs from similar datasets. The principles of our method can be easily applied to other types of genomics data (e.g. proteomics or metabolomics) as they become increasingly available. The present work focuses on linear gene-phenotype associations, but more complex relationships can be devised depending on the data characteristics. 25 Our univariate method for selecting the weights from the training sample is only one ofmanypossibleapproaches. Forexample, onecouldconsiderconstructingweightsusing a multivariate procedure that takes into account correlations among the gene expression levels, such as Fisher's linear discriminant procedure (i.e Discriminant Function Analysis for two groups). However, such an approach requires estimating a covariance matrix for the gene expressions which is not practical given that there are thousands of genes and a limited number of samples (typically on the order of 10) per dataset. In fact Fan and Fan [FF08] prove that, when the dimension of the feature space is high, a univariate two-sample t-test procedure, similar to our approach, is often superior to a multivariate method. Alternatively, when the important phenotype information can be characterized usingasmallnumberoflinearcombinationsofthegenes, dimensionreductiontechniques likeNonnegativeMatrixFactorization [BTGM04,TSE + 07]mayalsoproducemeaningful phenotype predictions. 2.4 Methods 2.4.1 Predicting phenotype pro¯les by constrained optimization From a training microarray dataset, we derive a vector w = (w 1 ;w 2 ;:::;w n ) T that con- tains the gene-phenotype association coe±cients of n genes. Given a new dataset with n genes and m samples, and the normalized gene expression matrix E = (e ij ) n£m , we aim toobtaintheoptimalPhenotypePro¯le(PP) ofthemsamples,wherePPisanormalized, 26 real-valued vector p = (p 1 ;p 2 ;:::;p m ) T that show high similarity to the expression pro- ¯les of those genes that have high magnitude of gene-phenotype association coe±cients w (termed signature genes) Q 1 : min p2R m d(E;w;p)= X i jw i j X j (sgn(w i )e ij ¡p j ) 2 subject to X j p j =0 1 m¡1 X j p 2 j =1 Let b = (b 1 ;b 2 ;:::;b m ) T be the weighted sum of gene expression values for each sample, b j = P i w i e ij . The following theorem provides the solution to the minimization problem. Theorem: The solution ^ p to problem Q 1 is a vector that is the normalized form of b. That is, ^ p= b¡b ¾(b) where b and ¾(b) are the mean and standard deviation of b respectively. Proof: By expanding the function d, we have d(E;w;p)= X i jw i j X j e 2 ij + X i jw i j X j p 2 j ¡2 X i w i X j e ij p j 27 Since p is normalized and E and w are ¯xed, the ¯rst two terms are ¯xed. So the minimization problem Q 1 can be simpli¯ed to an equivalent maximization problem: Q 2 : max p2R m c(E;w;p)=w T Ep subject to p T 1=0 p T p=m¡1 where 1 is the vector whose elements are all 1. Let b=E T w. Let the Lagrangian function for Q 2 be L(p;¸ 1 ;¸ 2 )=b T p+¸ 1 p T 1+¸ 2 (p T p¡(m¡1)) where ¸ 1 and ¸ 2 are Lagrangian multipliers. According to the Karush-Kuhn-Tuker con- ditions [BSS93] (as the functions b T p, p T 1, and p T p are all convex), the solution to Equation 2.1 contains the global optimum of Q 2 , rL(p;¸ 1 ;¸ 2 )=0 (2.1) Equation2.1resultsintwosolutions: p=§ b¡b ¾(b) . SinceQ 2 isamaximizationproblem, it is easy to show that the solution of Q 2 is ^ p= b¡b ¾(b) , and so is the solution of Q 1 . ^ p is regarded as the PP of the new dataset because among all vectors inR n , ^ p is the one that most resembles the normalized expression pro¯les of the signature genes that were de¯ned by the training data. 28 We calculate an association score ^ c as the Pearson correlation between w and E^ p. The score is derived from the maximization problem Q 2 . E^ p provides the association between expression pro¯les and the predicted phenotype pro¯le in the testing dataset. Thus, higher ^ c indicates higher consistency of gene-phenotype associations derived from the training and testing datasets. 2.4.2 Data collection and processing We collected 587 human microarray datasets, each containing at least 5 samples, from the NCBI Gene Expression Omnibus (GEO) [BTW + 07]. For data generated with the A®ymetrix platforms, we increased any values less than 10 to 10 and performed a log transform of the gene expression values. For genes with multiple probesets present, the expression values of those probesets were averaged. We then normalized each dataset by converting the expression values of each gene to Z-scores (zero mean and unit variance). The 587 datasets yielded 537 training datasets. A training dataset contains two disjoint sample groups, each of which contains at least 10 samples, representing a phenotype and its baseline. When performing PP prediction, we discard those training-testing dataset pairs that share fewer than 100 genes in common. 2.4.3 Automatic processing of phenotype annotations In GEO, a dataset is usually annotated by a short description paragraph; a sample group is annotated by a word or a short phrase; and a sample is usually annotated by one sentence. To systematically categorize the phenotype information associated with each microarray dataset, we used the Uni¯ed Medical Language System (UMLS) [Bod,BK06]. 29 We mapped the dataset description, sample group descriptions, and sample descriptions onto UMLS concepts via the MetaMap Transfer program [Aro01]. To reduce noise we focused on disease-related concepts, including the MeSH vocabulary and the semantic types \Pathologic Function", \Injury or Poisoning", \Anatomical Abnormality", \Body Part, Organ, or Organ Component", \Tissue", and \Cell". In general, the higher a concept is on the UMLS hierarchy, the broader is the concept. Disease concepts at the ¯ne granularity level may be associated with more clinical signi¯cance. In order to infer higher-orderlinksbetweendatasets,alloftheancestorconceptsofmappedconceptswere included. 2.4.4 Measuring phenotype annotation similarity We measure phenotype similarity between sample groups (the two groups of a training dataset and the two groups of a testing dataset) by the following procedure. 1) For each group, we map its title and member sample descriptions onto UMLS concepts. 2) We then construct a term frequency-inverse document frequency (tf-idf) vector [SB88] for each sample group. 3) Suppose that U 11 and U 12 are two tf-idf vectors corresponding to the two sample groups in the training dataset, and U 21 and U 22 are tf-idf vectors correspondingtothetwosamplegroupsinthetestingdataset. Thesimilaritybetweenthe sample groups is then calculated as max(< U 11 ;U 21 > + < U 12 ;U 22 >;< U 11 ;U 22 > + <U 12 ;U 21 >), where <a;b > denotes the cosine similarity [LKS + 07] of two vectors a and b, calculated as a normalized dot product < a;b >= a¢b kakkbk . Essentially, this measure identi¯es the best match between the sample groups in the training dataset and 30 testingdatasetwhiletakingintoaccountthepossibilitythatthegroupscouldbematched in reverse order. 2.4.5 Calculation of TF-IDF vectors of sample groups Notethatasamplegroupiscalled\subset"inNCBIGEO.Annotationofasamplegroup includes UMLS concepts, mapped from both (1) its text title and (2) all its samples' text description. Inourexperiment,eachsamplegroupisviewedasdocumentandeachUMLS concept is viewed as a term. Thus, given 587 human microarray datasets, we collected 4065 documents (sample groups) and 1277 terms (UMLS concepts). In NCBI GEO annotation of sample groups, its title and each samples' description share almost the same set of UMLS concepts. Our practical experiments showed that, in suchcomplicatedannotationenvironment,thefrequencyofaUMLSconceptsinasample group does not indicate the importance of the UMLS concept. So we used only 0 or 1 to represent the term frequency of each UMLS concept. Then IDF of each term (UMLS concept) is counted based on the binary TF vector of each document (sample group). Details of how to get TF-IDF is presented in the following formulas: TF term, document = 8 > > < > > : 1; if this term occurs in the document; 0; otherwise. (2.2) IDF term = log µ number of all documents number of documents this term occurs ¶ (2.3) TFIDF term, document = TF term, document £IDF term (2.4) 31 Inpracticalexperiments,wenoticedthatalargeportionofUMLSconceptsareshared by two sample groups in the same dataset (e.g., U 11 and U 12 from the training dataset). Ideally, we expect that annotation of two sample groups from the same dataset should re°ect their phenotype di®erences, not commonness. Thus, we mask by zero in TF-IDF vectors, the common UMLS concepts shared by the same-dataset sample groups, e.g, U 11 and U 12 (or U 21 and U 22 from testing dataset), when computing annotation similairty. 32 Chapter 3 A Stable Iterative Method for Re¯ning Discriminative Gene Clusters 3.1 Introduction Microarray has become an important tool for identifying genes that discriminate sample phenotypeclassesbecauseofitspowerofmonitoringtheexpressionlevelsofthousandsof genes in a single experiment. Finding discriminative genes with gene expression data is actually the feature selection problem in classi¯cation theory. From the machine learning point of view, it is critical since the gene expression datasets usually contain a small number of experiments (called samples) and a large number of genes (called features) in each experiment. The selected highly discriminative genes after ¯ltering out those non-representative genes which may dilute the pattern in classi¯cation computation can be further studied for the investigation on the biological mechanisms that are responsible for sample phenotype class distinction. A number of e®orts have been put in searching e®ective gene selection methods (For example[GST + 99,GWBV02,XFZ01,AM02,FSMJ03,ZLS + 06]). Duetothesmall-sample 33 size and high-dimension properties of the sample phenotype classi¯cation problem, it is notdi±cultto¯ndafeaturesubsetthatcanperfectlydiscriminateallthesamples[MR03]. In fact, theoretical study in [Cov65] showed that even for the non-informative, randomly generated dataset, the expected size of a feature subset that can linearly discriminate all the n samples is just (n+1)=2. In microarray data analysis, there can be a large number of highly discriminative subsets containing only a couple of genes; and each individual gene in such a subset is not necessarily highly discriminative. For example, we observed by exhaustive search that there are as many as 10,173 perfect 3-gene subsets for sample phenotype classi¯cation with the weighted voting method proposed by Golub et al and with their proposed training-test split [GST + 99]; and these gene subsets cover 3,337 genes (93.4% of all the 3,571 genes in the datasets after preprocessing). This observation suggests that a method of ¯nding a highly discriminative compact gene subset is not enough. Thevariabilityofthesubsetsfoundbysuchamethodlikelyhindersthediscovery of real interaction among the genes given that the method is usually sensitive to both the choice of samples and noise in the microarray data. The fundamental limit and challenges mentioned above motivates us to design more robustmethodsbytakingintoaccounttheexpressionsimilarityinformationamonggenes. In this paper, we identify a series of discriminative gene clusters by running clustering and feature selection processes iteratively, where the centroids of the clusters are used to form predictors. This work also shows that the predictor constructed in this way is more stableandlesssensitivetothechoiceoftrainingsamples. Becausebiologicalfunctionsare usually resulted from collective behavior and coordinated expression of a group of genes 34 rather than that of an individual gene, genes grouped according to their co-expression pattern may be more powerful in revealing gene regulation mechanisms. Our approach to generate discriminative gene clusters is a combination of supervised and unsupervised technique In recent years, Jornsten and Yu [JY03] and Dettling and Buhlmann [DB02] proposed similar combination approaches. However, there are major di®erences between their methods and our method. We use a multivariate approach for cluster selection, while Dettling and Buhlmann [DB02] employed a univariate approach, whichassumestheindependenceofthecontributionofclusterstosamplephenotypeclas- si¯cation. Althoughsuchhypothesisreducescomputationalcomplexityforlargedatasets, theaccuracyiscompromisedsincethecomplexbiologicalinteractionamonggeneclusters isnotproperlyre°ected. Weexploitamultivariateapproachinthecontentofgeneexpres- sion analysis since it accounts for the joint contribution of clusters to sample phenotype classi¯cation. It is known that such complex phenotype like cancer is resulted from not expression of individual genes but rather the underlying genetic networks. Thus a mul- tivariate approach that emphasize the combinatorial e®ect of variables is more suitable to the exploration of the interaction among variables. Our method di®ers from [JY03] in the following two aspects: Although both works adopt multivariate approach, ¯rst of all, in their information-based approach, clustering and cluster selection are done si- multaneously, resulting in a set of clusters optimizing the Minimum Description Length. In comparison, our computation-oriented approach is a re¯ning process where cluster- ing and cluster selection are performed alternatively in each iteration step with better and better results. Secondly, the clusters generated with Jornsten and Yu's approach include both active and inactive ones. Here, active clusters are those whose centroids are 35 relevant to sample phenotype classi¯cation, and inactive ones are not. Our method is essentially a backward approach [AM02]. It iteratively eliminates the less active clusters and re-clusters the remaining genes in the active clusters, reducing the negative in°uence of non-discriminative clusters on the sample phenotype classi¯cation. Ourprogramoutputsaseriesofclustersetsthathaveincreasingdiscriminationpower for training samples without losing prediction power on the test samples, as indicated in our experimental results. It achieved similar or better prediction accuracy than the known methods aforementioned for most of the tested datasets in our validation process. More importantly, our test shows that the centroids of the output clusters using di®erent sets of training samples are stable and consistently achieve signi¯cant proximity to the global optimal gene clusters obtained by using all the samples. Another advantage of our method is that it provides researchers with °exibility to decide which cluster set should be chosen for their purpose. 3.2 Results Weimplementedthealgorithm(describedintheMethodssection)asMATLABfunctions. It runs on a PC with the Windows operating system. The SVM program written by Cawleywasdownloadedfromthewebsitehttp://theoval.sys.uea.ac.uk/gcc/svm/toolbox. In this section, we present the detailed test results on both simulated and Leukemia AML/ALL datasets [GST + 99]. We also have tested our method on other real datasets and compared the performance of our algorithm with those reported in the previous literature. The details of the performance measures are described in the Method section. 36 3.2.1 Simulated datasets Wegenerated100simulatedbinaryclassi¯cationdatasetsusingasimplestochasticmodel. Each simulated dataset contains 100 samples evenly split into two sample phenotype classes. Both training and test samples contain 25 samples in each class. Each dataset contains of 400 genes evenly divided into four gene clusters. Two of the four clusters are relevant to sample phenotype classi¯cation and these two discriminative clusters C 1 and C 2 contribute to sample phenotype classi¯cation independently. Their centroids Â(C 1 ) and Â(C 2 ) are generated according to the sample phenotype class labels. Each component of Â(C 1 ) in a position is generated according to normal distributions N(1, 0.5) or N(-1, 0.5) depending on whether the corresponding sample is in class 1 or class -1, while each component of Â(C 2 ) generated according to N(-1, 0.5) if the sample is in class 1 and N(1, 0.5) otherwise. Similarly, the centroids of the non-discriminative clusters C 3 and C 4 are generated according to the normal distribution N(1, 1) and N(- 1,1) regardless of the samples' class. For each i=1, 2, 3, 4, the expression values of a gene in the cluster C i are generated according to the multivariate normal distribution N(Â(C i );d i =4), where d i =min j6=i d(Â(C i );Â(c j )). We run our algorithm with the input gene set contains all the 400 genes for each of the 100 simulated datasets. The performance results are summarized in Figure 3.1. We observed that the classi¯cation performance of the generated clusters keeps increasing as the iteration process goes. The average classi¯cation accuracy of these tests jumps from 0.756 up to 0.848 (Figure 3.1a); and the classi¯cation accuracy µ test on training samples goes up from 0.720 to 0.984 (Figure 3.1b). 37 Figure 3.1 The performance analysis on simulated datasets. The solid lines indicate the average values; and dotted lines indicate one standard deviation from the averages. The X-axis represents the number of genes in S i . Note that, when the generating process goes, the number of genes in S i decreases. (a) The classi¯cation accuracy µ test on the test samples. (b) The classi¯cation accuracy µ train on the training samples. (c) The percentage ½ sim (i) of truly discriminative genes in S i ; (d) The p-value ½ S (i) based on ^ ±(A i ). 38 We also observed that more and more truly discriminative genes are identi¯ed in the activeclustersasthealgorithmproceeds. Sincethegenesinthediscriminativeclustersare knownineachsimulateddataset,wecomputedtheratio½ sim (i)= jS i \(C 1 [C 2 )j jS i j ofthetruly discriminative genes over all the genes in for each iteration i. The active clusters output ½ sim (i), just before the algorithm terminates is about 0.778 (Figure 3.1c). Recall that, at each iteration i, the algorithm generates ·=50 active gene clusters since the number of training samples n r = 50 for each simulated dataset. We found that at each iteration i, the centroids of two active clusters are very close to Â(C 1 ) and Â(C 2 ), the centriods of the discriminative clusters in the model. This is re°ected by the indistinguishably small p-value ½ S (i) calculated based on ¹ d(A i ;¢ 0 ). Here ¹ d(A i ;¢ 0 ) is the 'average' Euclidean distance of centroids between an active cluster in A i and its closest cluster in ¢ 0 = fC 1 ;C 2 g. In the same time, the centroids of active clusters become more and more distinguish- able from each other, increasingly close to the average pairwise distance of all 400 genes, and such trend can also be re°ected by the increasing p-value ½ S (I) from 0.228 up to 0.476 (Figure 3.1d), calculated based on ^ ±(A i ), the average Euclidean distance between the centroids of active clusters in A i . Meanwhile, the Silhouette width ¹ !(A i ) of active clusters in A i increases from 0.826 to 0.980. 3.2.2 Leukemia dataset Leukemia AML/ALL dataset [GST + 99] contains the expression values of 6,817 human genes in 47 acute lymphoblastic leukemia (ALL) and 25 acute myeloid leukemia (AML) tissue samples. After performing the threshold ¯ltering and logarithmic transformation 39 procedure, we obtained a reduced dataset with only 3,571 genes. Here, we validate our algorithmbyusingthree-foldcrossvalidation. Ineachrun,werandomlyselectedtwothird of the samples as the training samples and the rest as the testing samples. The samples ofdi®erentphenotype classesare keptproportional in thetraining and testsamples. The resulting dataset was further normalized by rescaling the variance of expression values of each gene to 1 in the training samples, and then applying the same rescaling factor to the expression values of that gene in the test samples. We conducted the three-fold cross validation for 100 times. To reduce computational cost, we restrict our algorithm on small portions of discriminative genes. In each run, the algorithm starts with the input gene set consisting of the 357 genes (10% of all the 3,571 genes) that are highly correlated with the training samples' phenotype classi¯cation in terms of the correlation metric proposed in [GST + 99]. Figure3.2summarizesthevaluesofthedi®erentperformanceindicators. Theaverage classi¯cation accuracy µ train on the training samples ranges from 0.994 up to 1 (Figure 3.2b); and the average classi¯cation accuracy µ test on the test samples increase slightly from 0.966 to 0.972 (Figure 3.2a). These results show that the centroids of the clusters generated in di®erent iteration steps discriminate the training samples better and better without signi¯cant decrease of its generalization ability. Fortheevaluationofouralgorithm, wesearchedforperfect3-genesubsets, whichcan be used to perfectly classify all 72 samples using the weighted voting classi¯er trained on all the samples. This search resulted in 9,722 perfect subsets. We selected 48 (roughly equal to n r ) genes g i (1·i· 48) with highest occurrence frequency to form the cluster set ¢ 0 1 =ffg i gj1·i·48g for comparison with the clusters generated by our algorithm. 40 Figure 3.2 The analysis of the three-fold cross validation performance of the algorithm on the Leukemia dataset. The dotted lines indicate the performance values in individual tests. The solid lines indicate the average values; and the dotted lines indicate one standard deviation from the averages. The X-axis represents the number of genes in S i . (a) The classi¯cation accuracy µ test on the test samples. (b) The classi¯cation accuracy µ train on the training samples. (c) The p-values ½ S (i) based on ¹ d(A i ;¢ 0 1). (d) The p- values ½ S (i) based on ¹ d(A i ;¢ 0 2). (e) The p-value ½ all (i) based on ^ ±(A i ). (f) The average Silhouette width ¹ !(A i ) of active clusters in A i . 41 We also evaluate our algorithm using another cluster set ¢ 0 2 , the ¯nal set of active clusters generated by our algorithm with S 0 as the input gene subset and with all the 72 samples as the training samples, where S 0 is the set of the 357 genes (10% of all the 3,571genes)thatarehighlycorrelatedwiththeAML/ALLsampleclassesintermsofthe correlation metric proposed in [GST + 99]. Probably because of the selection sensitivity of the correlation metric of [GST + 99] resulting from small sample size, the gene sets that are selected according to di®erent training-test splits do not have many genes in common. In all the 100 validation exper- iments, only 120 genes appearing in every input gene set . This number is quite small compared with 1,071, the number of the genes appearing in some input gene sets (each of size357). Bycontrast,thecentroidsofclustersinthesetgeneratedineachofiterationsof ouralgorithmindi®erentrunsaresigni¯cantlysimilartotheselecteddiscriminativegenes in ¢ 0 1 and ¢ 0 2 at most iteration steps. This is re°ected by the very small p-values ½ s (i) computed based on ¹ d(A i ;¢ 0 1 ) and ¹ d(A i ;¢ 0 2 ), which range from 4:11£10 ¡2 to 6:12£10 ¡3 (Figure 3.2c) and from 5:62£10 ¡3 to 2:38£10 ¡3 (Figure 3.2d) respectively. The above observation strongly suggests the stability associated with discriminative clusters rather than with individual discriminative genes. Such stability is one of the main advantages of our method. We further studied the biological function of genes in the active clusters using Gene Ontology (GO), focusing on the biological processes located at the ¯fth level of the GO hierarchy. For the set of all genes from active clusters in A i , we ¯nd its enriched biolog- ical processes by calculating the hyper-geometric p-value, then convert the p-value into a log score s by s = ¡log 10 (p). Table 3.1 gives the top four biological processes that 42 Biological process Average score In°ammatory response 2.72 Regulation of catalytic activity 2.19 Response to wounding 1.98 Positive regulation of metabolic process 1.95 Table 3.1 Signi¯cantly enriched biological processes are most signi¯cantly enriched in the active clusters in the ¯nal iteration, in terms of the score averaged from the 100 validation experiments. All four processes are frequently associatedwithleukemia. Inaddition, weinspectedthechangeofproportionofthegenes of the four processes in the active clusters during re¯nement iterations. The proportions are also averaged over the 100 validation experiments. Figure 3.3 shows that when the active clusters contain less than two third genes in input gene set S, the average gene proportions of all four processes monotonically increase until the last iteration. Such convergence strongly suggests that our method can indeed re¯ne clusters into biologi- cally meaningful ones. Interestingly, processes of in°ammatory response and response to wounding showed very similar convergence patterns. In fact, these two processes are closely related. The same holds for biological processes of regulation of catalytic activity and positive regulation of metabolic process. 3.2.3 The performance analysis on other real datasets Besides the above dataset, we also tested our algorithm on seven other datasets. The descriptions of these datasets are as follows. Altogether, we derive 12 sample phenotype classi¯cation studies from the 8 datasets. 43 Figure3.3 Averageproportionofgenesofthefourbiologicalprocessesduringthere¯ne- ment iterations. The solid black, blue, red, and green lines corresponding to the ordered processes in Table 3.1 from top to bottom. 44 1. ALL T/B Cell dataset. The 47 ALL samples in Leukemia ALL/AML dataset in [GST + 99] are further divided into 39 T-cell samples and 9 B-cell samples. 2. Breast cancer dataset [WBD + 01]. This dataset comprises 7,129 genes and 49 sam- ples being divided into two sample phenotype classes according to their estrogen receptor (ER) responses: 25 for ER positive and 24 for ER negative. 3. Carcinoma dataset [NASL01]. It contains the expression levels of about 6600 genes in 18 tumor and 18 normal tissues. 4. Colon dataset [ABN + 99]. It consists of the expression levels of 6,500 human genes in 40 tumor and 22 normal tissues. 5. Di®use Large B-Cell Lymphoma (DLBCL) dataset [SRT + 02]. It consists of expres- sion values of 6,817 genes in 58 DLBCL and 19 Follicular Lymphoma. 6. The Melanoma dataset [BMC + 00]. The dataset consists of the expression ratios of 6,971 human genes in 12 Unclustered Cutaneous Melanomas and 19 Cutaneous Melanomas samples. 7. Prostate dataset [SFR + 02]. The dataset consists of the expression levels of 52 prostate and 50 normal samples of 6,744 human genes. 8. The Small, round blue cell tumors (SRBCT) dataset [KWR + 01]. It consists of 88 samples divided into ¯ve sample phenotype classes: neuroblastoma (NB) (18 sam- ples), rhabdomyosarcoma (RMS) (25 samples), Burkitt lymphomas (BL) (11 sam- ples), theEwingfamilyoftumors(EWS) (29samples)andothers(5samples). The 45 dataset has 2,308 genes. Since we consider the binary sample phenotype classi¯ca- tion problem, we derived four binary sample phenotype classi¯cation datasets from this dataset using the one-against-all rule: SRBCT-NB, SRBCT-RMS, SRBCT-BL SRBCT-EWS. RFERENCES We preprocessed each dataset by applying ¯ltering and logarithm transformation if necessary. For each sample phenotype classi¯cation study, we run our algorithm 100 times by choosing random training-test splits in the same way as the Leukemia dataset described in the last subsection. The performance of our method is summarized in Table 3.2. In the table, there are two columns for each performance measure, indicating the average values of the corresponding measures at the ¯rst and last iteration step of our algorithm. Because the exhaustive search of the most frequent globally optimal genes for constructing ¢ 0 1 is time-consuming, we only compare the active clusters with ¢ 0 2 constructed as follows: 1) we apply our algorithm on all samples in the dataset and 2) use the active clusters of the last iteration as ¢ 0 2 . The classi¯cation accuracy µ test on the test samples shows that among 9 of 12 sample phenotype classi¯cation studies, the prediction performance of active clusters in A j in- creases slightly from the start to the end of each execution, which are highlighted in the table. The value of µ test for the remaining three studies (Breast, Colon and Carcinoma) decrease slightly. The above observations indicate that for all datasets we tested, there is no signi¯cant decrease in the generalization ability of the active clusters A i in obtained 46 Datasets µtest ½S(i) based on ¹ d(Ai;¢ 0 2 ) ½ all (i) based on ¹ ±(Ai) ¹ !(Ai) Leukemia ALL T/B cell 0.97 0.977 1.72E-02 8.28E-03 0.088 0.331 0.406 0.973 Breast 0.843 0.842 1.33E-02 8.36E-03 0.142 0.421 0.351 0.974 Carcinoma 0.983 0.981 2.96E-02 3.20E-02 0.194 0.252 0.382 0.966 Colon 0.814 0.806 2.43E-02 2.06E-02 0.75 0.755 0.673 0.978 DLBCL 0.896 0.929 8.75E-02 1.99E-02 0.441 0.514 0.716 0.982 Melanoma 0.913 0.921 1.71E-02 2.25E-02 0.129 0.463 0.272 0.957 Prostate 0.889 0.916 4.79E-02 2.27E-02 0.495 0.541 0.68 0.987 SRBCT- BL 1 1 3.63E-04 7.52E-05 0.314 0.322 0.682 0.984 SRBCT- EWS 0.956 0.986 5.06E-04 9.17E-05 0.297 0.408 0.634 0.984 SRBCT- NB 0.989 0.996 2.99E-04 6.42E-05 0.321 0.436 0.665 0.986 SRBCT- RMS 0.974 0.98 4.82E-04 8.18E-05 0.304 0.347 0.63 0.989 Lukemia AML / ALL 0.966 0.972 5.62E-03 2.38E-03 0.212 0.398 0.627 0.98 Table3.2 Theperformanceofthealgorithmfordi®erentsamplephenotypeclassi¯cation studies. in each iteration step. The classi¯cation performance µ train on the training samples in- creasesinallofthe12studies,whichindicatesthattheseparationofthetrainingsamples improves for all studies. All the 100 input gene sets S vary a lot in di®erent runs for each study. There are only 1.1% to 5.1% of all the genes appearing in all the 100 input gene sets S, while at least 23.8% to 51.7% genes appear in some input gene sets. By contrast, the centroids of clusters in SA i generated by our algorithm at each iteration step i are stably close to the optimal centroids of clusters in ¢ 0 2 as re°ected by the p-values ½ s (i) ranging from 2:99£10 ¡4 to8:75£10 ¡2 atthe¯rstiterationstepandthoserangingfrom6:42£10 ¡5 to 3:20£10 ¡2 in the last iteration step. The consistent closeness of the clusters generated in di®erent repeats can also be re°ected in the standard deviation of ½ s (i), which are limited from 0.32 to 0.96 times of the absolute values of in the ¯rst iteration step and 0.24 to 1.37 times at the last iteration step. 47 During the generation process, the p-values ½ all (i) of average pairwise distance ^ ±(A i ) amongcentroidsofclustersinA i keepsincreasingforall12studies(rangingfrom0.088to 0.750 at the ¯rst iteration step and from 0.252 to 0.755 in the last step), and the average Silhouette width of active clusters ¹ !(A i ) keeps increasing for all the 12 studies (ranging from 0.230 to 0.698 at the ¯rst iteration step and from 0.964 to 0.989 in the last iteration step). This indicates that the clusters in A i are more and more distinct in general. In summary, our test shows that on real microarray datasets, our algorithm is able to generate clusters that separate the training samples with increasing prediction accu- racy and closeness to known optimal clusters. Such discriminative cluster re¯nement is consistent with what we have observed on simulated datasets. 3.2.4 Comparing the sample phenotype classi¯cation performance to other studies In this section, we compare the cross validation performance of our method with pre- vious works reported in [JY03,DB02,SRT + 02,JSR02]. For the purpose of comparison, we converted the classi¯cation performance from the classi¯cation accuracy µ test into the error rate. Table 3.3 summarizes the comparison of our algorithm (of both binary and multi-sample phenotype class versions) with others by the cross validation error rates. It is di±cult to make direct comparisons with other approaches in the literature, because the speci¯c data sets or data preparation are not always available. However, the perfor- mances our method is in general comparable to others. In the comparison, the DLBCL and Carcinoma datasets are validated using leave-one-out validation; and the remaining datasets are validated using three-fold cross validation. 48 Datasets Our algorithm Dettling and Buhlmann (2002) Jornsten and Yu (2003) Shipp et al. (2002) Lukemia AML/ALL 3.43 - 2.57 6.58 - 2.71 Leukemia three classes 13.8 - 9.3 12.6 Breast 16.14 - 14.11 3.00 - 0.75 Carcinoma 5.6 - 0.0 Colon 19.41 - 18.23 23.35 - 15.95 13.6 DLBCL 8.7 - 7.4 7.8 Prostate 11.09 - 8.36 16.47 - 6.91 SRBCT multi class 5.92 - 4.27 5.76 - 0.43 Table 3.3 Comparison of our algorithm with others. Dettling and Buhlmann [DB02] reported the error rate of their algorithm for di®erent datasets. Theyemployednearestneighborsandaggregatedtreesastheclassi¯ersintheir three-fold cross validation test. For the leukemia AML / ALL dataset, our algorithm seemstoachieveaslightlylowererrorratethantheirs. IntheColonandProstatedatasets, the error rate of our algorithm lies between that of theirs. For the Breast dataset, the error rate is signi¯cantly higher than that of Dettling and Buhlmann's. However, we obtained the performance using all the original 49 samples. The error rate in each test ranges from 7.89 to 6.90. According to [WBD + 01], at least 7 out of the 49 samples are inherently erroneous. In comparison, Dettling and Buhlmann [DB02] used the 38 good samples selected by [WBD + 01], and the error rate ranges from 1.14 to 0.10. The 38 samples used in Dettling and Buhlmann [DB02] consists none of the above 7 erroneous samples. Thus, we believe that the performances of ours and Dettling and Buhlmann's are still comparable for the Breast dataset. For the DLBCL dataset, the leave-one-out performance of Shipp et al. [SRT + 02] is in our performance range. For Carcinoma dataset, Jaeger et a. [JSR02] achieved perfect leave-one-out performance, and our best performance can match theirs. For the Colon 49 dataset, both ours and Dettling and Buhlmann's error rate are higher than Jornsten and Yu's. We also test the performance of the multiple-sample phenotype class version of our method against other methods. For the Leukemia three-class dataset, our method is comparable to Jornsten and Yu's method. However, for the SRBCT multi class dataset, ouralgorithmseemshadaslightlyhighererrorratethanthatofDettlingandBuhlmann's. 3.3 Conclusions Due to the small-sample-high-dimension nature of the microarray dataset, it is not di±- cult to ¯nd highly discriminative gene subsets of small size. However, if a gene selection process is unstable with the choice of training samples, the biological signi¯cance of the resultinggenesubsetsisoftennotguaranteed. Inthispaper, insteadof¯ndingindividual discriminative genes or gene subsets, we propose a novel backward approach for gener- ating a series of highly discriminative gene clusters. Compared to selection of individual discriminative genes, genes grouped in these clusters are more stable when subject to change of training samples. Therefore they could provide more convincing support to gene interactions that are associated with the sample phenotype classes. In future, we will work with biologists to study the biomedical implication of these clusters. Regarding to the sample phenotype classi¯cation performance, the gene clusters pro- ducedbyourapproachcangenerallyachievegoodcrossvalidationperformancecompared totheexistingmethodsformostofdatasetswetested. Moreimportantly, ourtestexper- imentsshowthatregardlessofthechoiceoftrainingsamples, thecentroidsoftheclusters 50 generatedarestableandsigni¯cantlyclosetotheknownoptimalgeneclustersfoundusing all the samples. All these indicate that our approach is promising. However, the current version of our algorithm is time-consuming. In future, the computational e±ciency will be investigated. On the other hand, we used K-means algorithm, a typical partitioning based clustering method to seek a certain number of clusters that minimize the sum of squared distances between each gene and its centroid. The drawback for K-means is the subjective speci¯cation of input parameters such as the number of clusters and initial centroid locations. For unknown microarray datasets, such information is unavailable. Furthermore, di®erent input parameters may result in signi¯cantly di®erent clustering results. K-means can only converge to local optima, rather than the global optimum. In order to address these problems associated with K-means clustering. We plan to apply a novel clustering method based on Random Matrix Theory (RMT) [ZW08] which is completely objective and do not require the speci¯cation of cluster number and initial centroid locations. RMT method avoids being trapped into local optima. Furthermore, most previous clustering methods including K-means and hierarchical clustering parti- tion members into non-overlapping groups. The RMT method allows the same genes in multiplegroupstore°ectthefactthatasinglegenemaycontributetomultiplebiological pathways. In order to test the discriminative power of a certain gene cluster, additional criteria establishedbystatisticalanalysisshouldalsobeconductedtoidentifyandremoveinactive cluster. For example, gene expression pattern observed in the active clusters should be less likely to appear in the control set. Chi Square test might be used to test the signi¯cance. Some data normalization technique may be considered in the preprocessing 51 step to improve the data quality. Furthermore, more suitable backward feature selection method needs to be exploited so that the gene clustering and cluster selection processes can be integrated better. Our approach provides a °exible framework that allows us to test the performance of various computing modules in a various ways of combinations. Our method can not only be easily extended to multi-sample phenotype class predic- tion, but also can be easily extended to integrate gene clusters obtained from multiple datasets that contain sample phenotype classi¯cation study of same type. Such integra- tion can be obtained by counting the occurrence frequency of a gene in similar clusters obtainedfromdi®erentdatasets. Insuchway,bypullingevidencesfrommultipledatasets from di®erent experimental sources, genes in such clusters of high occurrence frequency would have higher con¯dence to form truly discriminative gene sets. 3.4 Methods 3.4.1 Algorithm In this subsection, we present our backward approach for generating discriminative gene clusters. The method is executed in a repetitive manner. In each pass, the method ¯rst groups genes into clusters that may indicate functional categories [THC + 99]. It then ranks the generated clusters and eliminates those clusters that are less discriminative so thatthere-clusteringofremaininggenescangeneratemoduleswithbetterresolutionand stronger association with the sample phenotype classes. In the clustering stage, we use the K-means method to group the genes into a constant number of active clusters. 52 In the elimination stage, we use a backward feature selection method. This stage involves cluster validation and evaluation of the discriminative ability of active clusters. Tovalidateclusters,weusetheSilhouettewidth[KR90]tomeasuretheirvalidity. Assume the input genes are partitioned into p clusters C i ;C 2 ;:::;C p . Given a gene g, let ¹ w g be the average Euclidian distance between g and another gene within the same cluster, and let ¹ b gJ the average Euclidian distance between and a gene g in a di®erent cluster C J . Then the Silhouette width ! g of g is de¯ned as ! g = min J ( ¹ b gJ )¡¹ wg max(min J ( ¹ b gJ );¹ wg) , and the Silhouette width of a cluster is de¯ned as the average Silhouette width of all its members. It is easy to see that the Silhouette width of a cluster fall within the range from -1 to 1. A good cluster should have a high Silhouette width. To measure the discriminative ability of an active cluster, we adopt the idea of SVM- RFE method in [GWBV02]. Support Vector Machine (SVM) is a binary-class prediction method originated from statistical learning theory [Vap00]. A linear SVM ¯rst ¯nds a decision hyperplane y =w T x+b that maximizes the separation between samples of two classes;andthenitdoesclasspredictionaccordingtotherelativelocationofanewsample with respect to the hyperplane in the feature space. Note that the weight vectorw found by the linear SVM indicates the relative importance of the genes for the classi¯cation. Here, we iteratively train a linear SVM and eliminate a gene cluster based on an overall evaluationonboththeweightandtheSilhouettewidthinsteadofdiscardingsinglegenein theoriginalSVM-RFEmethod. Suchsystematicapproachmakestheeliminationprocess to better re°ect the underlying biological meaning. Our method is summarized into the algorithm in Figure 3.4. In the algorithm, ¢ denotes the set of inactive gene clusters; A i denotes the set of active clusters at each 53 iterationi;S i denotesthesetofgenesunderconsiderationatthebeginningoftheiteration i; · denotes the number of clusters partitioned at each iteration step. For simplicity, we set · to be n r , the number of training samples. Figure 3.4 The algorithm of selecting discriminative clusters. It is often di±cult to determine how many clusters the genes should be grouped into for microarray datasets, which usually have complex expression patterns. The algorithm outputs · = n r , the number of the training samples, active clusters in each iteration. This is because the expected size of a feature subset that can linearly discriminate all the samples is only (n r +1)=2 [MR03]. Note that if the feature number is too small, the clustering will lose its resolution. RecallthattheK-meansclusteringmethodstartswithaninitialpartitionofthegenes. In order to make it more deterministic in Step 2, we ¯rst select · genes as follows: Find 54 a furthest gene pair and form an initial gene set G, and then iteratively ¯nd a gene with largest average Euclidean distance from the genes in G and add it into G untiljGj = ·. We then partition all the genes into clusters by merging each gene with its nearest gene in G. The calculation of the Silhouette width of each cluster in A i takes all the clusters in both sets A i and into account ¢. At the ith iteration, the algorithm groups all the genes in the set S i into · clusters, forming the cluster set A i , and then insert the least active cluster into the inactive cluster set ¢ in Step 5 as follows. Therearetwoimportantfactorstoevaluateinordertodeterminewhichclustershould be removed from S i and added into ¢. The ¯rst factor is the cluster's Silhouette width. Another factor is the cluster's discriminative ability in terms of its weight determined by the linear SVM constructed in Step 4. Here, we would like to eliminate a least discrim- inative cluster whose centroid is su±ciently representative of the expression pattern of thecluster(measuredbytheSilhouettewidth). Inotherwords, weeliminateasetofwell clustered genes whose expression patterns have little contribution to classi¯cation. On the other hand, those not well clustered genes will be re-clustered at later iterations. Since the above two factors are not always consistent, we adopt a multiple objective optimizationtechniqueappearingin[Gen]to¯ndanicetradeo®betweenthesetwofactor and such multiple objective method is shown in Figure 3.5. Finally,wecanextendthealgorithmtothemultiple-classcasebyadoptingthepopular one-against-all approach. In this approach, given a training test split, both training and test samples of a dataset of K > 2 classes are transferred into K binary classi¯cation problems, each corresponding to classify samples from one class against samples from all 55 Figure 3.5 Multiple objective optimization procedure for cluster elimination. remaining classes. Then our algorithm executed on the K problems results in K series of active cluster sets A j;i ;j = 1;:::;K. Then classi¯ers are constructed using K £ · clusters from the K active cluster sets A 1;i 1 ;:::;A K;i K by selecting i 1 ;:::;i K such that jS i 1 j;:::;jS i K j are roughly identical. Given the centroids of the above K£· clusters, a multi-class linear SVM is trained using training samples and tested on test samples. 3.4.2 Performance measures We validate our method in terms of its classi¯cation performance and clustering perfor- mance. The classi¯cation performance is determined by the classi¯cation accuracy on training or testing samples. We use the SVM as the classi¯er to evaluate the generated geneclusters. Classi¯cationaccuracyµ test onthetestsamplesisde¯nedasthepercentage of the correctly classi¯ed samples. However, we de¯ne classi¯cation accuracy µ train on training samples as the average accuracy of the 10-fold cross validation on the training samples as suggested in [AM02] for less biased estimation of classi¯cation performance. The clustering quality is measured in terms of the density of the clusters, as well as the distinction between clusters and the closeness of the clusters to some reference clus- ters. They are measured respectively by (a) the average Silhouette width ¹ !(A i ) of active 56 clusters in A i produced in iteration i, (b) the average Euclidean distance ¹ ±(A i ) between the centroids of active clusters in A i and (c) the 'average' Euclidean distance ¹ d(A i ;¢ 0 ) of centroids between an active cluster in A i and its closest cluster in a reference cluster set ¢ 0 (the construction can be found in the Result Section). To be more precise, assume A i =fC 1 ;:::;C · g and ¢ 0 =fD 1 ;:::;D · g. First, from 1 to ·, ¯nd recursively C 0 l 2 A i and D 0 l 2 ¢ 0 such that d(x(C 0 l );x(D 0 l )) = min C 0 l 2A i ¡A 00 ;D 00 l 2¢ 0 ¡¢ 00 l d(x(C 00 l );x(D 00 l )), where x() denotes the centeroid of a cluster, d(;) the Euclidean distance between two vectors, A 00 =fC 0 1 ;:::;C 0 l¡1 g and ¢ 00 l =fD 0 1 ;:::;D 0 l¡1 g. Then, the 'average' Euclidean distance ¹ d(A i ;¢ 0 ) is de¯ned as ¹ d(A i ;¢ 0 )= 1 · P i·l·· d(x(C 0 l );x(D 0 l )). We measure the statistical signi¯cance of average distances in both case (b) and (c) at each iteration i against the pairwise distances of all genes in the input gene set S in terms of the p-value ½ S (i), and against all the genes in the dataset in terms of the p-value ½ a ll(i). In each case, the p-values are calculated according to the empirical distribution (null distribution) of the pairwise distance of genes randomly sampled in the whole dataset. 57 Chapter 4 An integrative approach to characterize disease-speci¯c pathways and their coordination 4.1 Introduction The recent development of microarray technology has signi¯cantly facilitated the iden- ti¯cation of disease-related genes [SC03,CCS + 02,GST + 99,LMA06]. However, many complex disease phenotypes are caused not by individual genes, but by the coordinated e®ect of many genes. Insight into the structure and coordination of disease-related path- waysiscrucialtounderstandingthepathophysiologyofcomplexdiseases. However,ithas proved di±cult to infer pathways from microarray data by deriving modules of multiple related genes, rather than individual genes. The major challenges are: (1) Genes in- volved in a pathway may exhibit complex expression relationships beyond co-expression, which may be overlooked by standard microarray analysis methods such as clustering [ZKH + 05]. (2)Pathwaysaredynamicandthecurrentstaticannotationofpathwaysmay not serve as a good template. In fact, pathways are manual dissections of the underlying dynamic gene regulatory network. Under di®erent conditions, di®erent segments of the 58 ensemble network will be activated, leading to condition-speci¯c activation of pathways [YMH + 07]. In this study, by integrating many microarray datasets we propose a novel method to simultaneously infer pathways and disease/phenotypic conditions under which the path- waysareactivated. Theidenti¯edpathwaysmaycomprisegeneswithcomplexexpression relationships beyond co-expression. Due to the existence of a large amount of cancer microarray data, we used cancer as our case study. We collected a series of microarray datasetsmeasuringdi®erenttypesofcancers,andaseriesofdatasetsmeasuringothercel- lular/physiological conditions. We ¯rst construct a di®erential co-expression network, in which each node represents a gene and each edge indicates a gene pair that is frequently co-expressed in cancer datasets but not in non-cancer datasets. We then dissect the networks into cancer-subtype speci¯c network modules by considering (1) co-expression dynamics and (2) network topology. Figure 4.1a illustrates the conceptual pipeline of our method. To measure co-expression dynamics, we use second-order expression similarity, which we proposed previously [ZKH + 05]. Brie°y, if we de¯ne ¯rst-order expression similarity as the expression similarity of two genes from one dataset, then second-order similarity measures whether two gene pairs simultaneously exhibit either high or low expression similarity across multiple datasets. In general, high ¯rst-order similarity suggests the ex- istence of a functional link between two genes, and clustering based on the second-order similarity captures multiple functional links always activated and deactivated under sim- ilar conditions. Such functional links are likely to comprise a functional module. Inter- estingly, genes in a second-order cluster may not always have high ¯rst-order similarity 59 Phenotype- specific module Connectivity dissection Co-expression dynamics dissection Connected network modules A hierarchy of second order clusters Differential co- expression Network 32 cancer and 23 non-cancer datasets Network ranking using the scaling model Selection of differential co-expressed gene pairs a b c d e f g i k f h a — b b— c c — d a — d e —f a — c e — h e — g i— k g— f e — c b— d c — h e — h e — b Gene pairs Cancer datasets A B C Figure 4.1 Overview of analysis procedure. (A) Flow chart of the analysis pipeline. (B) Schematic illustration of the concept of second-order similarity. It is obvious that the overall expression similarity between the two gene pairs (genes 1 and 2 versus genes 3 and4)isnotsigni¯cantlyhigh, buttheir¯rst-orderexpressioncorrelationpro¯lesexhibit high second-order similarity. (C) Schematic illustration of the dissection of di®erential co-expression networks into network modules based on the co-expression dynamics and network connectivity. In the heat map, every column corresponds to a dataset and every row corresponds to a gene pair. Red, black, green and grey corresponds to positive, low, negative and missing correlations, respectively. By hierarchical clustering, the gene pairs fall into two major second-order clusters. The 9 gene pairs in the green cluster comprisethreeconnectednetworkcomponents,whereasthe6genepairsintheredcluster give rise to three connected components. Furthermore, by considering the second order cluster of a higher level of the hierarchy, which consists of both green and red clusters, the six networks are united to form two connected networks, re°ecting the hierarchical modularity of cancer co-expression networks. 60 (see an example in Figure 4.1b); therefore, second-order analysis allows us to identify functional modules that are inaccessible to co-expression analysis. Givenmultiplegenepairssharinghighsecond-ordersimilarity,wefurtherdividethem into network components based on their connectivity on the di®erential co-expression graph (see an example in Figure 4.1c). We observe that genes within a connected networkcomponentaremorelikelytoparticipateinthesamespeci¯cpathwaythanthose betweendi®erentcomponents,which,inturn,arelikelytobeinvolvedindi®erentrelevant pathways. This may reveal high-order cross-pathway coordination. In fact, hierarchical clustering of di®erentially co-expressed gene pairs based on their second-order similarity results in a hierarchical modularity in terms of relevance of functional links. We designed a linear scaling model to select modules by considering both module size (number of edges) and within-module second-order similarity. Then, given selected modules, we can furtherinferdatasets(phenotypicconditions)inwhichamoduleisactivated,i.e. inwhich genes in the module coordinate. Applying our methods to 32 cancer-related microarray datasets, and 23 non-cancer related datasets, we derived 162 second-order clusters consisting of 224 network modules, activated either in cancer or in speci¯c cancer subtypes. In particular, we identi¯ed a breast cancer speci¯c network module that involved in tumor suppression via platelet- derived growth factor (PDGF)-like signaling, more importantly, a hub gene PDGFRL that may play an important role in this tumor suppressor module. 61 4.2 Results 4.2.1 Network properties of the cancer di®erential co-expression network We curated 32 human microarray datasets (1,764 expression pro¯les in total) measuring cancers of 12 tissues, and 23 datasets (1,158 expression pro¯les) not related to cancer (e.g. normal tissues, chronic granulomatous disease, Huntington's disease, in°ammatory response). We ¯rst identify gene pairs which consistently demonstrate higher correlation incancerversusnon-cancerdatasetsbasedonarobustcorrelationestimator, thenormal- ized Percentage Bend correlation (for details see Methods). In following sections, if not speci¯ed, the term correlation will by default refer to the normalized Percentage Bend correlation. Thesecriteriaresultin6,035genepairscovering1,967genes. The6,035gene pairs, each representing a potential conditional functional link, can be represented as a di®erential co-expression network. In this network, each gene is represented as a node and each di®erential co-expression relationship is represented as an edge. Ithasbeenreportedthatco-expressionnetworksfollowascale-freenodedegreedistri- bution [JMRWK04]. Weobservedthatthedi®erentialco-expressionnetworkalsofollows such a topology, where only a small number of nodes act as \highly connected hubs" (see node degree distribution in Figure 4.2). This indicates that most gene-gene co-expression relationships di®ering between cancer and other phenotypes are associated with only a few \hub" genes. Such hub genes exhibit a high degree of coordination with many other genes in neoplastic states, and are therefore likely to play important roles in carcino- genesis and cancer progression. In fact, most hub genes fall into two main functional 62 categories: 1) core processes of neoplastic states such as cell division and chromosome or- ganization; or 2) dynamic interactions between cancer cells and their microenvironment such as angiogenesis, immune response, and cell adhesion . For those hub genes with unknown functions, we can predict their cancer-related functions based on their neighbor genes. For example, the 16 out of the 33 interacting partners of the ADP-ribosylation factor-like 6 interacting protein (ARL6IP) are involved in cell division (hypergeometric test p-value 1:6£10 ¡24 ). Thus, ARL6IP is likely to be involved in cell proliferation, con- sistent with its initial characterization as an interaction partner of the Ras superfamily member ARL6 [PBG + 00]. As another example, while micro¯brillar-associated protein 2 (MFAP2) has long been known to bind to various components of the elastic extracellular matrix [JSOP01], it has not been clear whether it serves more than a mechanical func- tion. We found that 6 out of its 24 neighbor genes are involved in cell adhesion (p-value 7:7£10 ¡5 ). In fact, a recent study found that MFAP2 binds to a neighbor gene Notch1 and activates it [MLH + 06]. In the di®erential networks, we selected the top 112 hub genes that have degrees ¸ 22, such that they together account for 30% of the total degrees on the di®erential co-expression networks. Many of those genes fall into two main functional categories: 1) core processes of neoplastic states such as cell division and chromosome organization (36 genes); or 2) dynamic interactions between cancer cells and their microenvironment such as angiogenesis, immune response, and cell adhesion (24 genes). Genes falling into category (1) include TUBB (participates in microtubule-based movement; node degree 70), CKS1B (cell cycle; node degree 43), KPNA2 (regulation of DNA recombination; node degree 44), RRM1 (DNA replication; node degree 41), MAD2L1 (cell cycle; node 63 Figure 4.2 Node degree distribution on the di®erential co-expression networks. 64 degree49),andSOX2(establishment/maintenanceofchromatinarchitecture;nodedegree 30). Genes in category (2) include MAP2K7 (response to proin°ammatory cytokines; node degree 50), TYROBP (cellular defense response; node degree 37), CD4 (immune responses; node degree 45). They are all reported to play a role in cancer pathogenesis and progression. In addition, hub genes in the category (1) behave as hub genes across most of datasets, while those in the category (2) tend to be in hub genes only in solid tumor datasets. 4.2.2 Identi¯cation of pathway modules speci¯c to cancer or cancer subtypes The di®erential co-expression network provides a summary of co-expression links fre- quently active across all types of cancers. However, it does not provide clues as to which set of links tend to be simultaneously active and inactive under which types of cancer. That is, the edges of a di®erential co-expression network may not be active in the same subset of datasets. In fact, the largest connected component of the di®erential co-expression network contains 5944 edges, which comprises 98% of all the edges in the network. Thus,basedonconnectivityalonewecannotbreakthenetworkintofunctionally coherent and cancer-subtype speci¯c modules. To dissect the networks, we integrate two types of information, the co-expression dynamics and the network connectivity, to extract cancer-subtype speci¯c network mod- ules. First, we employ the second-order clustering approach to utilize the co-expression dynamics information. This includes two steps: (i) for any two genes connected with 65 an edge in the di®erential co-expression network, we calculate the expression correla- tion in each of the 32 cancer microarray datasets and store it in a vector, termed the ¯rst-order expression correlation pro¯le of the genes; (ii) we then perform hierarchical clustering of all the gene pairs based on the Euclidean distance between the ¯rst-order expression correlation pro¯les. Unlike commonly used clustering approaches, the unit of the second-order clustering is a gene pair instead of a gene, and the distance between units is computed based on the ¯rst-order expression correlation pro¯les instead of the original gene expression pro¯les, hence the term \second-order" clustering [ZKH + 05]. Since each edge represents a frequently occurring co-expression relationship in multiple cancer datasets, it likely represents a functional link. If a cluster of gene pairs follows the same co-expression pattern across multiple cancer datasets, it represents a module of functional links being turned on or o® simultaneously across di®erent cancer phenotypes. Givenasecond-orderclusterofgenepairs,wefurtheridentifyconnectednetworkcom- ponentsamongthem. Wesuggestthatasetofgenepairsismorelikelytobefunctionally related if they form a connected component. We found that the disjoint network mod- ules of same second-order cluster generally fall into di®erent functional categories. From the 162 second-order clusters, we measure the functional similarity of genes within each connect network, and between networks of the same cluster using number of shared Gene Ontology functions. A t-test of the distinction in functional similarities gives p-value 1:6£10 ¡8 on our selected network modules shows that the genes tend to be signi¯cantly more functionally coherent within a network than between networks. Thus each network is more likely to represent a pathway. But in addition, these networks are in the same 66 second-order cluster, this indicates that these networks are activated together in certain phenotypes. Given a second-order hierarchical clustering tree, we traverse the tree bottom up to retrieve connected network components. In general, the size of a connected component (S, the number of edges) decreases with the second-order diameter (D), de¯ned as the largest pairwise second-order distance. We found that S and D show a linear scaling relationship in a logarithm scale (Figure 4.3). We are especially interested in outliers - network components small in D but large in S, which represent tightly clustered network modules relative to their size. We de¯ne the modularity score ¸ of a subnetwork using a linear scaling model ¸ = ®log 2 (S)¡log 2 (D)¡¯, where ® and ¯ are estimated using linear ¯tting. With our data, we obtain ® = 0:13 and ¯ = 2:2. We select the top 60% of networks (S ¸ 4) ranked by ¸ scores, removing those networks having D · 0:34, and merging heavily overlapping networks. This procedure resulted in 162 second-order clusterscomprising224connectednetworkmodules, withsizerangingfrom4to64edges. 175 (78%) modules are statistically signi¯cantly functionally homogenous based on the GeneOntology Biological Process annotation (hypergeometric test p-value < 0.01). The most predominant functional categories are cell cycle, cell division, cell proliferation, responsetostress,immuneresponseandcelladhesionconsistentwithknownpathological mechanisms of cancer. One main feature of our approach is that it can simultaneously discover network modules and the types of cancer in which the modules are activated. Figure 4.4a shows a module that is activated in most of the cancer datasets. The genes of the module are mostly involved in cell division and genetic stability, representing a cell proliferation 67 Figure4.3 ThelinearrelationshipbetweennetworksizeS andclusterdiameterD. Each dot in the ¯gure corresponds to a connected network of size¸ 4. The line is ¯tted using networks with size¸ 10. 68 signature, a key feature of cancer. Figure 4.4c shows a network module which tends to be activated in only solid tumors. The genes of the module are mostly involved in cell adhesion and organogenesis, which is speci¯c to solid tumor versus blood cancers or neoplastic cell lines. In the next section, we will detail two network modules which are activated predominately in breast cancer data sets. 4.2.3 Network modules in the breast cancer cluster Ouranalysisresultedinasecond-orderclustercontainingtwoconnectednetworkmodules that tend to be more active in all seven breast tumor datasets relative to the rest of the datasets. The average correlation of these modules in breast tumor and other cancer datasets are 0.49 and 0.23 respectively (the t-test of co-expressions between breast tumor and the rest of cancer datasets gives a p-value of 1:56£10 ¡95 ). 4.2.3.1 A tumor suppressor network related to PDGF superfamily signaling The module in Figure 4.5a contains 52 genes. Most of them are extracellular or mem- brane proteins, and 23 genes have previously been found to be involved in breast cancer. Following are examples of genes in the breast tumor suppressor modules that known to related to breast cancer : connective tissue growth factor (CTGF) is over-expressed in breastcancercelllineswithincreasedmetastaticactivity[KSS + 03]; Fibulin1(FBLN1)is implicatedinimmuneresponseagainstbreastcancer[PAF + 03], andoneofitssplicevari- ants is over-expressed in breast [BMM + 05]; the cysteine-rich secreted protein (SPARC) playsacrucialroleintumordevelopmentinbreastcancer[WDJB + 05];GAS1isalsofound 69 A B C Figure 4.4 General cancer and solid tumor network modules. (A) A network module thattendtoactivateinmostofcancerdatasets,consisting24genesand28edges. Average correlation across all data sets is 0.42. Most of genes in the module are related to cell division and genetic stability.(B) Another network module that is activated in most of cancerdatasets,consisting9genesand9edges. Themoduleislocatedinthesamesecond order cluster as the one in ¯gure 2a. Its average correlation across all datasets is 0.39. Most of genes in the module are related to nucleobase, nucleoside, nucleotide and nucleic acid metabolism. (C) Left : a network that tends to be activated only in solid tumor datasets. Right, the co-expression heatmap of the edges across datasets. Six datasets are not shown in the heatmap due to lack of valid co-expression estimations. Average correlation in solid tumor datasets and other datasets are 0.61 and 0.17, respectively. 70 to be induced in apoptotic mammary gland cells [SBB + 05]; Cysteine-rich angiogenic in- ducer 61 (CYR61) is involved in the proliferation, cell survival, and Taxol resistance of breastcancer[MVM + 04]; ¯nally, inthecase ofahubgeneLRP1(lowdensitylipoprotein receptor-related protein 1), the T allele of the C766T polymorphism is associated with an increased risk of breast cancer development [BJ · Z + 03]. Amongthe52genes, 16areinvolvedincelladhesion(p-value7:4£10 ¡11 ),and14are involved cell-cell signalling (p-value 7:9£10 ¡6 ), suggesting a role in tumor invasiveness of the module [RLS + 04]. General cancer and solid tumor network modules Most interestingly, we found one main function of this network module appears to be tumor suppression via the inhibi- tion of PDGF superfamily signaling. A hub gene with high degree of this module is the gene PDGF receptor-like (PDGFRL) (degree 11). While its precise biological function is not known, PDGFRL encodes a 375aa product with signi¯cant sequence similarity to the extracellular domain of PDGFR. Indeed, mutations in PDGFRL have been found in individual cancer samples [KSU + 97,LOT + 99,ALG + 02,KLK + 03]. PDGFRL is lo- cated in chromosome 8p22-8p21.3, where multiple studies have suggested the existence of a putative breast cancer tumor suppressor gene [YKL + 96,SWF + 00,RASB + 03]. Re- cently, an in-depth study of the region using microcell-mediated chromosome transfer found that indeed PDGFRL expression is decreased in the majority of breast cancer cells [SKW + 06]. Many genes in this network module have been found to be involved in PDGF superfamily signaling. For example, Cysteine-rich protein (SPARC) binds to PDGF-AB and PDGF-BB dimers, inhibiting the binding of these growth factors to their cell surface receptors [RLIA + 92] and inhibiting PDGF-induced vascular smooth muscle 71 A B Figure 4.5 Coordinated breast tumor network modules. (A) A breast tumor network module that involved in PDGF signalling. Genes in round shape have promoter regions that are predicted to be bound by transcription factor ZNF148. (B) A breast tumor net- work module related to in°ammatory response, located in the same second-order cluster as the one in Figure 3a. Genes in round shape have promoter regions that are predicted to be bound by transcription factor POU2F1. 72 proliferation [MFK + 02]. Also, connective tissue growth factor (CTGF) has structural similarities with PDGF [KMK + 05]. Recently, two new PDGF ligands were discovered that have a N-terminal complement subcomponent C1R/C1S, UEGf, BMP1 (CUB) do- main [UK05]. Interestingly both C1R and C1S are members of this network module. Also, LRP1 is a physiological modulator of the PDGF signaling pathway [TMAH05]. In total, we found direct Pubmed literature support for 19 out of the 52 member genes to be involvedin (or related to) the PDGF-superfamily signalling. Furthermore, it is known that zinc ¯nger protein (ZNF148) binds to the PDGFR [DBSS + 05] gene promoter. We screened the promoter regions of the 52 member genes, and found that the binding sites of ZNF148 are signi¯cantly enriched (hypergeometric p-value 0.016). In addition, it has been reported that PDGFR positively regulates collagen production (for exam- ple [CRH + 07,ZBC + 06]). Our module showed the breast tumor speci¯c co-expression between PDGFRL and collagens COL3A1, COL5A2 and COL6A3. Many of the above evidences support the hypothesis that PDGFRL has an agonistic function to PDGFR signaling. Although abundant evidence suggests the individual involvement of many of these genes in tumor suppression, their coordinated function has never been elucidated. Here by adopting a network perspective, we have identi¯ed their interrelationships and a gene (PDGFRL) that may play a central role in this tumor suppressor network. 73 4.2.3.2 A network module related to in°ammatory response Another identi¯ed breast cancer-speci¯c network module (Figure 4.5b) may be involved in the coordination of the in°ammatory response to cancer pathology. This network con- sists of 5 genes: tumor necrosis factor receptor superfamily member 1B (TNFRSF1B); vascularcelladhesionmolecule1(VCAM1);leukocyte-associatedimmunoglobulin-likere- ceptor1(LAIR1);andCathepsinL1(CSTL).Theyarearrangedaroundthemacrophage- associated antigen (CD163). Several lines of evidence have implicated these genes indi- vidually in breast cancer, for example: increased plasma levels of VCAM1 is associated withadvancedbreastcancer [SGC + 06]; geneticvariationinTNFRSF1Bmaypredictthe late onset of breast carcinoma, and relapse and death for patients with breast carcinoma [MBBAC05]; ¯nally,thebreastcancercelllineexhibitingthehighestinvitroinvasiveness also expressed the highest amount of CTSL1 splice variant L-A3 [CKSL06]. Most of the genes of this module are related to tumor necrosis factor (TNF), an in°ammatory cytokine. It has been reported that activation of rat CD163 on peritoneal macrophages induces the production of pro-in°ammatory mediators including TNF [PFD + 06]; TNFdirectlyinteractswithTNFRSF1B [MERS96]andisamediatorofTNF function in the mouse ovary [GRP + 07]. Finally, transcription of VCAM1 in endothelial cells can be induced by TNF [NRT + 95]. A transcription factor, octamer-binding transcription factor-1 (POU2F1, also known as Oct-1) has been predicted to bind promoter region of VCAM1, LAIR1 and CTSL (hypergeometric p-value 0.012). The binding of POU2F1 to VCAM1 promoter is indeed 74 supported by the literature [SRJ + 98,SOG + 03,OHB + 02]. Also, POU2F1 has been found to bind SP3 [LLC + 03], which is reported to activate the transcription of CSTL [SR04]. The tight and coordinated expression of the genes in this network module reveals an induced in°ammatory response that may be important in breast tumor progression [Cou02]. 4.2.4 Identi¯cation of pathway coordination beyond co-expression A major advantage of second-order clustering is that it can identify functionally related genes beyond co-expression, as illustrated in Figure 4.1b. We elaborate on this point in this section. To allow readers to easily assess the magnitude of the correlation, only in this section we use the Pearson correlation to measure the co-expression level. Based on the de¯nition of second-order clustering, connected network components within the same second-order cluster show coordinated activities, which implies their functional relevance. In the example of the two modules in the general cancer cluster in Figure 4.4a and 2b, each module may play di®erent roles in the regulation of speci¯c biological processes { cell division and nucleic acid metabolism, respectively; the latter is clearly required for cell division. Given the fact that these processes belong to the same second-order cluster, they may represent facets of the same underlying neoplastic process. However, member genes of the two modules exhibit distinct expression patterns: The average Pearson correlation between genes of the two modules is only 0.13. As another example, the two breast cancer modules described in last section are also related. Indeed, the collaboration of PDGF signalling and TNF have long been known to be required for tissue repairing [OMF04], and their abnormal expression play important 75 but partially de¯ned roles in breast tumor development and progression [CMI + 06]. On the other hand, member genes of the two modules exhibit relatively distinct expression patterns. For example, two hub genes of modules, PDGFRL and CD163, also show very weak expression similarities across all seven breast cancer datasets: the average Pearson correlation between these two genes is only 0.24. Besides the above examples, we found coordinated modules within the majority of identi¯ed second-order clusters. Overall, from the total 162 second-order clusters, 25% give rise to more than one connected network module. To estimate the amount of cross module co-expression within second-order clusters, for each second order cluster, we ¯rst determined the active cancer datasets, in which the average Pearson correlation of gene pairs in the cluster is greater than 0.5. For 72% of those module pairs within the same second-order cluster, the average gene pairwise Pearson correlation between modules in theactivedatasetsislessthan0.5(normalizedPercentageBendcorrelationapproximately < 0.35), and for 30% module pairs the cross-module average gene pairwise Pearson cor- relation is less than 0.3 (normalized Percentage Bend correlation approximately < 0.19). Furthermore, even genes in the same network module are not necessarily highly co- expressedwhenthemoduleisactive, despitetheirhighdegreeoffunctionalhomogeneity, as discussed previously. We found that in 32 of 224 (14 %) modules, the average pairwise Pearson correlation of any two genes in the module is < 0.5 in the corresponding active datasets. Such modules could therefore easily be overlooked by traditional clustering methods. 76 4.3 Discussion Therapidaccumulationofmicroarraydataprovidesunprecedentedopportunitiestostudy the molecular mechanisms underlying disease pathogenesis and progression. Although many studies utilized multiple microarray datasets to derive consistent lists of genes speci¯c for (subtypes of) cancer [RBR + 02,RYS + 04,KCYCGK + 04], little attention has been paid to derive genetic networks characterizing di®erent types of cancer. Segal et al. [SFKR04] used prede¯ned biologically meaningful gene sets including known biological pathways, and have successfully identi¯ed activated or repressed biological modules in a wide variety of neoplastic conditions. The approach, however, relies on the knowledge of pre-de¯ned biological modules and has limited use in the discovery of novel association betweengenes. ArecentstudybyChoietal. [CYYK05]comparedthetwoco-expression networks summarized from 10 tumor and normal datasets, respectively, and have identi- ¯ed functional di®erences between normal growth and cancer in terms of gene coexpres- sion changes in broad areas of physiology. However, due to the multifaceted nature of cancer,interactionsinsuchaderivedsummarynetworkmaynotbesimultaneouslyactive in individual datasets, i.e. speci¯c cancer conditions. In this study, we propose an unsupervised method that integrates both co-expression dynamics and network topology information to characterize cancer (subtype) speci¯c network modules. The identi¯ed modules, such as modules activated across all can- cer subtypes or only in solid tumors, are novel, but consistent with known molecular mechanisms. Importantly, we have discovered a potential tumor suppressor network par- ticularly active in breast tumors, and provide compelling evidence that the hub gene 77 PDGFRL is a true tumor suppressor gene. Compared to commonly used di®erential or co-expression analysis, our approach has the following advantages: (1) our unsupervised approach simultaneously discovers network modules and the conditions (e.g. cancer sub- types) in which they are activated, thus providing new insights into the complex cellular mechanisms that characterize cancer and cancer subtypes. (2) Compared to existing approaches [CYYK05,LCL + 06,CBGB04,SKFW03,BTS + 00] which can only identify densely connected network modules based solely on network topology information, our approachincorporatingco-expressiondynamicsinformation(second-ordersimilarity)can extract more diverse types of modules regardless of network density. It is known that many biological pathways do not necessarily form densely connected modules. (3) Our approachcanrevealcoordinationofpathwaysbeyondco-expression. (4)Ourmethodcan be applied to any types of molecular networks beyond co-expression network, when data of multiple networks under di®erent conditions are available. In the current framework, the selection of biologically meaningful modules still need certain amount of manual intervention from biology expert. We are looking for more systematicwaysformoduleselection,especiallybyputtingtheframeworkintothecontext of network statistics to improve the robustness. For example, although the scaling model we constructed is based on direct observation of the distribution of network properties S and D, their log-linear relationship suggests that it should be straightforward to make use of the exponential random graph models that have been used in recent years to study statistical aspects of networks [CSW05]. In essence, such models linearly combine network properties and assign the probability of observed networks as the exponential of such linear combinations. Integrating this work with the exponential family probabilistic 78 models may provide both better estimation of the coe±cients in the linear scaling model and more accurate selection of network modules via hypothesis testing. We intend to explore this direction in future work. The choice of datasets depends on the research question. Ideally there should be a balanced and su±cient sampling of di®erent phenotypes (in particular di®erent tissues for this cancer study). Particularly, a paired Wilcoxon test between cancer and normal samples of the same tissue would signi¯cantly eliminate the amount of tissue-speci¯c co- expressions. However, due to the limited amount and the heterogeneity of existing data it is currently impractical to achieve this goal. Using a weighted sampling scheme could potentially bypass the imbalance e®ects. We aim to investigate this strategy, and use statistical models of the correlation values to determine the weighting factors. 4.4 Methods 4.4.1 Datasets Wecurated32cancerand23non-cancerhumangeneexpressiondatasetsmainlyfromthe Stanford Microarray Database (SMD) and Gene Expression Omnibus (GEO) databases, each containing more than 15 microarrays, on either A®ymetrix or cDNA platforms. In eachdataset,iftherearemultipleprobesthatcorrespondtothesamegene,wechoosethe one that contains the least amount of missing values. For datasets containing absolute expression measurements, we convert all values · 10 to 10, then perform a base 2 log transform. Estimation of Pairwise gene co-expression We used Percentage Bend Correla- tion [Wil05] (with ¯ = 0:05) to obtain a robust correlation estimate. Percentage Bend 79 Correlation ¯rst detects outliers in expression values of each gene then reduces the e®ects of those outliers in the correlation calculation. Only gene pairs with a large number of validsamplesm¸15areusedtocalculatecorrelation. Tomakethecorrelationestimates comparable across di®erent datasets of variable sample sizes and among di®erent gene pairs of di®erent amount of missing expression measures within the same dataset, we performed Fisher's z-transform [And] to reduce sample size e®ect. Given a correlation estimate r and sample size n, the Fisher's Z scores (divided by its theoretical standard deviation) is calculated as z = p n¡3 2 log( 1+r 1¡r ), which theoretically has an asymptotically standard normal distribution. Note that sample sizen maybe di®erentfrom gene pair to gene pair due to missing values, and from dataset to dataset. In reality, we observed that the distributions of the z-score are still di®erent from dataset to dataset: we therefore normalizedz-scorestoenforcethestandardnormaldistribution. Afterthat,standardized correlations r 0 are obtained by inverting the z-score with a ¯xed n of 30. 4.4.2 Select di®erentially co-expressed gene pairs We de¯ne a gene pair to be di®erentially co-expressed between cancer and non-cancer if it satis¯es the following two criteria: (1) their expression correlation in cancer datasets is su±ciently strong (can be either positively or negatively high). This is done by setting threshold for average summed square of correlations in cancer datasets. i.e, q 1 c P k (r (k) i ) 2 < 0:35 for the gene pair i, where there are c valid correlation estima- tions (c = 32 if there are no missing values) and k is dataset index corresponding to all valid correlation estimations; and (2) the Wilcoxon ranksum test of correlations between cancer and non-cancer datasets gives a p-value· 0.01. 80 4.4.3 Identify conditionally activated network module candidates We hierarchically clustered the di®erentially co-expressed gene pairs based on their ex- pression correlation pro¯les using the CLUSTER program [dHINM04] with complete linkage and Euclidean distance. The Euclidean distance is averaged to provide a simple estimationgiventheexistenceofmissingcorrelations(duetothemissingvalueproblem). d i;j = q 1 c P k (r (k) i ¡r (k) j ) where r (k) i and r (k) j are the correlations of gene pairs i and j in the dataset k, respectively, and c is number of valid correlations. In the hierarchical tree, each leaf node represents a gene pair, and each inner node corresponds to a second-order cluster of gene pairs (edges) which may comprise zero, one, or more connected network components. In cases where the size of the di®erential network is too big to be processed usinghierarchicalclustering(HC),thegenepairswere¯rstseparatedusingk-meansclus- teringthenprocessedthesmallerclustersseparatelybyhierarchicalclustering. Asforour experience,thebiologicallymeaningfulmodulesnormallycontainlessthanafewhundred edges, thus k-means clustering will keep most of modules intact. 4.4.4 Gene ontology function and transcription factor enrichment of modules The functional enrichment analysis is done by the hypergeometric test on genes. We selected 419 Gene Ontology (GO) functions (i.e. biological process terms) which are 4 levels below the root in the GO hierarchy. Each gene may be directly or indirectly associated with some of these functions. A set of genes will be considered to have a enriched function when (1) the functional homogeneity modeled by the hypergeometric 81 distribution [WHD + 02]issigni¯cantatasigni¯cancelevel0.01and(2)thereareatleast 2 genes in the set are associated with the function. 4.4.5 Identi¯cation of transcription factor binding 10kb upstream sequences for each gene were obtained from NCBI Gene database. Af- ter applying RepeatMasker [SHG04], we used the MATCH program of TRANSFAC [MFG + 03] (version 9.2) to scan the sequences for the presence of transcription factor binding sites based on position weight matrices. We used vertebrate-speci¯c matrices, and chose cut-o®s to minimize the sum of false positives and false negatives. We kept only the top 3,000 hits per matrix, sorted by the matrix similarity score. Altogether we obtained 349,178 predicted transcription factor target relationships of 180 transcription factors. A hypergeometric test was performed for each network module to search for over-represented transcription factor binding sites. 82 References [ABN + 99] U. Alon, N. Barkai, DA Notterman, K. Gish, S. Ybarra, D. Mack, and AJ Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, 1999. [ABO + 00] M. Albitar, M. Beran, S. O'Brien, H. Kantarjian, E. Frieriech, M. Keat- ing, and E. Estey. Di®erences between refractory anemia with excess blasts in transformation and acute myeloid leukemia. Blood, 96(1):372, 2000. [ALG + 02] Q. An, Y. Liu, Y. Gao, J. Huang, X. Fong, L. Liu, D. Zhang, J. Zhang, and S. Cheng. Deletion of tumor suppressor genes in Chinese non-small cell lung cancer. Cancer letters, 184(2):189, 2002. [AM02] C. Ambroise and G.J. McLachlan. Selection bias in gene extraction on thebasisofmicroarraygene-expressiondata. ProceedingsoftheNational Academy of Sciences, 99(10):6562, 2002. [And] TW Anderson. An Introduction to Multivariate Statistical Analysis. 2003. [Aro01] A.R. Aronson. E®ective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp, 17(21):17{ 21, 2001. [BBG + 05] A. Balog, Z. Borb¶ enyi, Z. Gyulai, L. Molnar, and Y. Mandi. Clinical Importance of Transforming Growth Factor-b but Not of Tumor Necro- sis Factor-a Gene Polymorphisms in Patients with the Myelodysplastic Syndrome Belonging to the Refractory Anemia Subtype. Pathobiology, 72(3):165, 2005. [BJ · Z + 03] P. Bene· s, M. Jurajda, J. · Zaloud¶ ³k, L. Izakovi· cov¶ a-Holl¶ a, and J. V¶ acha. C766T low-density lipoprotein receptor-related protein 1 (LRP1) gene polymorphism and susceptibility to breast cancer. Breast Cancer Res, 5(3):R77{R81, 2003. [BK06] A.J. Butte and I.S. Kohane. Creation and implications of a phenome- genome network. Nature biotechnology, 24:55{62, 2006. 83 [BMC + 00] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, et al. Molecular classi¯cation of cutaneous malignant melanoma by gene expression pro- ¯ling. Nature, 406(6795):536{540, 2000. [BMM + 05] A.Bardin, F.Moll, R.Margueron, C.Delfour, MLChu, T.Maudelonde, V.Cavailles, andP.Pujol. Transcriptionalandposttranscriptionalregu- lationof¯bulin-1byestrogensleadstodi®erentialinductionofmessenger ribonucleic acid variants in ovarian and breast cancer cells. Endocrinol- ogy, 146(2):760{768, 2005. [Bod] O. Bodenreider. The Uni¯ed Medical Language System (UMLS): inte- grating biomedical terminology. Nucleic Acids Research, 32(90001):267{ 270. [BSS93] Mokhtar S. Bazaraa, Hanif D. Sherali, and C. M. Shetty. Nonlinear programming : theory and algorithms. Wiley, 2. ed edition, 1993. [BTGM04] J.P. Brunet, P. Tamayo, T.R. Golub, and J.P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences, 101(12):4164{4169, 2004. [BTS + 00] A.J.Butte,P.Tamayo,D.Slonim,T.R.Golub,andI.S.Kohane. Discov- ering functional relationships between RNA expression and chemother- apeutic susceptibility using relevance networks. Proceedings of the Na- tional Academy of Sciences, 97(22):12182, 2000. [BTW + 07] T. Barrett, D.B. Troup, S.E. Wilhite, P. Ledoux, D. Rudnev, C. Evan- gelista, I.F. Kim, A. Soboleva, M. Tomashevsky, and R. Edgar. NCBI GEO: mining tens of millions of expression pro¯les{database and tools update. Nucleic Acids Research, 35(Database issue):D760, 2007. [BvD04] HG Brunner and MA van Driel. From syndrome families to functional genomics. Nat Rev Genet, 5(7):545{51, 2004. [CBGB04] S.L. Carter, C.M. Brechbuhler, M. Gri±n, and A.T. Bond. Gene co- expression network topology provides a framework for molecular charac- terization of cellular state, 2004. [CCS + 02] X. Chen, S.T. Cheung, S. So, S.T. Fan, C. Barry, J. Higgins, K.M. Lai, J. Ji, S. Dudoit, I.O.L. Ng, et al. Gene expression patterns in human liver cancers. Gene Expression Patterns, 13(6):1929{1939, 2002. [CKSL06] S. Caserman, S. Kenig, B.F. Sloane, and T.T. Lah. Cathepsin L splice variants in human breast cell lines. Biological Chemistry, 387(5):629, 2006. 84 [CMI + 06] E.Chala,C.Manes,H.Iliades,G.Skaragkas,D.Mouratidou,andE.Ka- pantais. Insulin resistance, growth factors and cytokine levels in over- weight women with breast cancer before and after chemotherapy. Hor- mones, 5(2):137{146, 2006. [Cou02] Z Coussens, L.and Werb. In°ammation and cancer. Nature, 420(6917):860{867, 2002. [Cov65] T.M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers, 14(3):326{334, 1965. [CRH + 07] NI Chaudhary, GJ Roth, F. Hilberg, J. Muller-Quernheim, A. Prasse, G. Zissel, A. Schnapp, and JE Park. Inhibition of PDGF, VEGF and FGF signalling attenuates ¯brosis. European Respiratory Journal, 29(5):976, 2007. [CSW05] P.J. Carrington, J. Scott, and S. Wasserman. Models and methods in social network analysis. Cambridge University Press, 2005. [CYYK05] J.K. Choi, U. Yu, O.J. Yoo, and S. Kim. Di®erential coexpression anal- ysis using microarray data and its application to human cancer. Bioin- formatics, 21(24):4348{4355, 2005. [DB02] M. Dettling and P. Buhlmann. Supervised clustering of genes. Genome Biology, 3(12):1{0069, 2002. [DBSS + 05] C. De Bustos, A. Smits, B. Stromberg, VP Collins, M. Nister, and G.A¯nk. APDGFRApromoterpolymorphism,whichdisruptsthebind- ing of ZNF148, is associated with primitive neuroectodermal tumours and ependymomas, 2005. [DD03] M.W.N. Deininger and B.J. Druker. Speci¯c Targeted Therapy of Chronic Myelogenous Leukemia with Imatinib. Pharmacological Re- views, 55(3):401{423, 2003. [dHINM04] M.J.L. de Hoon, S. Imoto, J. Nolan, and S. Miyano. Open source clus- tering software. Bioinformatics, 20(9):1453, 2004. [DLCBPS + 08] M. De Lourdes Chau®aille, D. Borri, R. Proto-Siqueira, ES Moreira, and FL Alberto. Acute promyelocytic leukemia with t (15; 17): fre- quency of additional clonal chromosome abnormalities and FLT3 muta- tions. Leukemia & lymphoma, 49(12):2387, 2008. [FF08] J. Fan and Y. Fan. High dimensional classi¯cation using features an- nealed independence rules. Annals of statistics, 36(6):2605, 2008. [FS03] N. Freimer and C. Sabatti. The Human Phenome Project. Nature Ge- netics, 34(1):15{21, 2003. 85 [FSMJ03] C. Furlanello, M. Sera¯ni, S. Merler, and G. Jurman. Entropy-based gene ranking without selection bias for the predictive classi¯cation of microarray data. BMC bioinformatics, 4(1):54, 2003. [Gen] G.Z.M. Gen. Evolutionary Computation on Multicriteria Production Process Planning Problem. In Proceedings of the... IEEE Conference on Evolutionary Computation, volume 14, page 419. IEEE. [GRP + 07] C.R. Greenfeld, K.F. Roby, M.E. Pepling, J.K. Babus, P.F. Terranova, and J.A. Flaws. Tumor necrosis factor (TNF) receptor type 2 is an important mediator of TNF alpha function in the mouse ovary. Biology of reproduction, 76(2):224, 2007. [GST + 99] TR Golub, DK Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, JPMesirov,H.Coller,MLLoh,JRDowning,MACaligiuri,etal. Molec- ular classi¯cation of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531, 1999. [GWBV02] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi¯cation using support vector machines. Machine learning, 46(1):389{422, 2002. [JMRWK04] I.K. Jordan, L. Marino-Ramirez, Y.I. Wolf, and E.V. Koonin. Conserva- tion and coevolution in the scale-free human gene coexpression network. Molecular biology and evolution, 21(11):2058{2070, 2004. [JSOP01] MP Jacob, M. Sauvage, and M. Osborne-Pellegrin. Regulation of elastin synthesis. Journal de la Soci¶ et¶ e de biologie, 195(2):131, 2001. [JSR02] J. JÄ ager, R. Sengupta, and W.L. Ruzzo. Improved gene selection for classi¯cation of microarrays. Biocomputing 2003, page 53, 2002. [JY03] R.JornstenandB.Yu. Simultaneousgeneclusteringandsubsetselection for sample classi¯cation via MDL, 2003. [KCYCGK + 04] J.KyoonChoi,J.YoungChoi,D.GhonKim,D.WookChoi,B.YeoKim, K. Ho Lee, Y. Il Yeom, H. Sook Yoo, O. Joon Yoo, and S. Kim. Integra- tive analysis of multiple gene expression pro¯les applied to liver cancer study. FEBS letters, 565(1-3):93{100, 2004. [KLK + 03] Y.S. Kahng, Y.S. Lee, B.K. Kim, W.S. Park, J.Y. Lee, and C.S. Kang. Lossofheterozygosityofchromosome8pand11pinthedysplasticnodule and hepatocellular carcinoma. Journal of Gastroenterology & Hepatol- ogy, 18(4):430, 2003. [KMK + 05] S. Kanazawa, T. Miyake, T. Kakinuma, K. Tanemoto, T. Tsunoda, and K.Kikuchi. Theexpressionofplatelet-derivedgrowthfactorandconnec- tivetissuegrowthfactorindi®erenttypesofabdominalaorticaneurysms. The Journal of cardiovascular surgery, 46(3):271, 2005. 86 [KR90] L. Kaufman and P.J. Rousseeuw. Finding groups in data: an introduc- tion to cluster analysis. New York, 1990. [KSS + 03] Y. Kang, P.M. Siegel, W. Shu, M. Drobnjak, S.M. Kakonen, C. Cord¶ on- Cardo, T.A. Guise, and J. Massagu¶ e. A multigenic program mediating breast cancer metastasis to bone. Cancer Cell, 3(6):537{549, 2003. [KSU + 97] A. Komiya, H. Suzuki, T. Ueda, S. Aida, N. Ito, T. Shiraishi, R. Yatani, M. Emi, K. Yasuda, J. Shimazaki, et al. PRLTS gene alterations in human prostate cancer. Cancer Science, 88(4):389{393, 1997. [KWR + 01] J. Khan, J.S. Wei, M. Ringn¶ er, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, et al. Classi¯ca- tion and diagnostic prediction of cancers using gene expression pro¯ling and arti¯cial neural networks. Nature medicine, 7(6):673{679, 2001. [LCL + 06] C.C. Liu, W.S.E. Chen, C.C. Lin, H.C. Liu, H.Y. Chen, P.C. Yang, P.C. Chang, and J.J.W. Chen. Topology-based cancer classi¯cation and related pathway mining using microarray data. Nucleic acids research, 34(14):4069, 2006. [LKS + 07] K. Lage, E.O. Karlberg, Z.M. Storling, P. ¶ I. ¶ Olason, A.G. Pedersen, O. Rigina, A.M. Hinsby, Z. TÄ umer, F. Pociot, N. Tommerup, et al. A humanphenome-interactomenetworkofproteincomplexesimplicatedin genetic disorders. Nature Biotechnology, 25:309{316, 2007. [LL07] Y.A. Lussier and Y. Liu. Computational Approaches to Phenotyping: High-Throughput Phenomics. Proceedings of the American Thoraic So- ciety, 4(1):18, 2007. [LLC + 03] M. Liu, J.L. Leibowitz, D.A. Clark, M. Mendicino, Q. Ning, J.W. Ding, C. D'Abreo, L. Fung, P.A. Marsden, and G.A. Levy. Gene transcription of fgl2 in endothelial cells is controlled by Ets-1 and Oct-1 and requires the presence of both Sp1 and Sp3. European Journal of Biochemistry, 270(10):2274{2286, 2003. [LMA06] Z. Liu, K. Maas, and T.M. Aune. Identi¯cation of gene expression sig- natures in autoimmune disease without the in°uence of familial resem- blance. Human Molecular Genetics, 15(3):501{509, 2006. [LOT + 99] F. Lerebours, S. Olschwang, B. Thuille, A. Schmitz, P. Fouchet, B. Buecher, N. Martinet, F. Galateau, and G. Thomas. Fine deletion mapping of chromosome 8p in non-small-cell lung carcinoma. Interna- tional Journal of Cancer, 81(6), 1999. [MBBAC05] S. Mestiri, N. Bouaouina, S. Ben Ahmed, and L. Chouchane. A func- tional polymorphism of the tumor necrosis factor receptor-II gene as- sociated with the survival and relapse prediction of breast carcinoma. Cytokine, 30(4):182{187, 2005. 87 [MERS96] A.E. Medvedev, T. Espevik, G. Ranges, and A. Sundan. Distinct roles of the two tumor necrosis factor (TNF) receptors in modulating TNF and lymphotoxin alpha e®ects. Journal of Biological Chemistry, 271(16):9778, 1996. [MFG + 03] V. Matys, E. Fricke, R. Ge®ers, E. Gossling, M. Haubrock, R. Hehl, K.Hornischer, D.Karas, AEKel, OVKel-Margoulis, etal. TRANSFAC (R): transcriptional regulation, from patterns to pro¯les. Nucleic Acids Research, 31(1):374, 2003. [MFK + 02] K. Motamed, S.E. Funk, H. Koyama, R. Ross, E.W. Raines, and E.H. Sage. Inhibition of PDGF-stimulated and matrix-mediated proliferation of human vascular smooth muscle cells by SPARC is independent of changes in cell shape or cyclin-dependent kinase inhibitors. Journal of cellular biochemistry, 84(4), 2002. [MLH + 06] A. Miyamoto, R. Lau, P.W. Hein, J.M. Shipley, and G. Weinmaster. Micro¯brillar proteins MAGP-1 and MAGP-2 induce Notch1 extracel- lular domain dissociation and receptor activation. Journal of Biological Chemistry, 281(15):10089, 2006. [MR03] Xu M. and Setiono R. Gene selection for cancer classi¯cation using a hybrid of univariate and multivariate feature selection methods. Applied Genomics and Proteomics, 2:79{91, 2003. [MVM + 04] J.A. Menendez, L. Vellon, I. Mehmi, P.K. Teng, D.W. Griggs, and R. Lupu. A novel CYR61-triggered \CYR61-®v¯3 integrin loop" regu- lates breast cancer cell survival and chemosensitivity through activation of ERK1/ERK2 MAPK signaling pathway. Oncogene, 24(5):761{779, 2004. [NASL01] D.A. Notterman, U. Alon, A.J. Sierk, and A.J. Levine. Transcriptional geneexpressionpro¯lesofcolorectaladenoma,adenocarcinoma,andnor- mal tissue examined by oligonucleotide arrays, 2001. [NRT + 95] AS Neish, MA Read, D. Thanos, R. Pine, T. Maniatis, and T. Collins. Endothelialinterferonregulatoryfactor1cooperateswithNF-kappaBas atranscriptionalactivatorofvascularcelladhesionmolecule1. Molecular and Cellular Biology, 15(5):2558{2569, 1995. [OHB + 02] M. Ortego, A.G. Hern¶ andez, C. Bustos, L.M. Blanco-Colio, M.A. Hern¶ andez-Presa, J. Tu~ n¶ on, and J. Egido. 3-Hydroxy-3-methylglutaryl coenzyme A reductase inhibitors increase the binding activity and nu- clear level of Oct-1 in mononuclear cells. European journal of pharma- cology, 448(2-3):113{121, 2002. [OHB08] M.Oti,M.A.Huynen,andH.G.Brunner. Phenomeconnections. Trends in Genetics, 24(3):103{106, 2008. 88 [OMF04] E. Ottaviani, D. Malagoli, and A. Franchini. Invertebrate humoral fac- tors: cytokines as mediators of cell survival. Invertebrate Cytokines and the Phylogeny of Immunity: Facts and Paradoxes, page 1, 2004. [PAF + 03] S.M. Pupa, S.W. Argraves, S. Forti, P. Casalini, V. Berno, R. Agresti, P. Aiello, A. Invernizzi, P. Baldassari, W. Otwal, et al. Immunologi- cal and pathobiological roles of ¯bulin-1 in breast cancer. Oncogene, 23(12):2153{2160, 2003. [PBG + 00] M. Pettersson, M. Bessonova, H.F. Gu, L.C. Groop, and J.I. JÄ onsson. Characterization, chromosomal localization, and expres- sion during hematopoietic di®erentiation of the gene encoding Arl6ip, ADP-ribosylation-like factor-6 interacting protein (ARL6). Genomics, 68(3):351{354, 2000. [PFD + 06] M.M.J. Pol°iet, B.O. Fabriek, W.P. DaniÄ els, C.D. Dijkstra, and T.K. van den Berg. The rat macrophage scavenger receptor CD163: expres- sion,regulationandroleinin°ammatorymediatorproduction. Immuno- biology, 211(6-8):419{425, 2006. [RASB + 03] K. Rennstam, M. Ahlstedt-Soini, B. Baldetorp, P.O. Bendahl, A. Borg, R.Karhu,M.Tanner,M.Tirkkonen,andJ.Isola. Patternsofchromoso- mal imbalances de¯nes subgroups of breast cancer with distinct clinical features and prognosis. A study of 305 tumors by comparative genomic hybridization, 2003. [RBR + 02] D.R. Rhodes, T.R. Barrette, M.A. Rubin, D. Ghosh, and A.M. Chin- naiyan. Meta-Analysis of Microarrays Interstudy Validation of Gene Expression Pro¯les Reveals Pathway Dysregulation in Prostate Cancer 1, 2002. [RLIA + 92] EW Raines, TF Lane, ML Iruela-Arispe, R. Ross, and EH Sage. The extracellular glycoprotein SPARC interacts with platelet-derived growth factor (PDGF)-AB and-BB and inhibits the binding of PDGF to its receptors. Proceedings of the National Academy of Sciences, 89(4):1281{ 1285, 1992. [RLS + 04] J.S. Ross, G.P. Linette, J. Stec, E. Clark, M. Ayers, N. Leschly, W.F. Symmans, G.N. Hortobagyi, and L. Pusztai. Breast cancer biomarkers and molecular medicine: part II. erm, 4(2):169{188, 2004. [RYS + 04] D.R. Rhodes, J. Yu, K. Shanker, N. Deshpande, R. Varambally, D. Ghosh, T. Barrette, A. Pandey, and A.M. Chinnaiyan. Large-scale meta-analysis of cancer microarray data identi¯es common transcrip- tional pro¯les of neoplastic transformation and progression. Proceedings of the National Academy of Sciences, 101(25):9309{9314, 2004. 89 [SB88] G.SaltonandC.Buckley. Term-weightingapproachesinautomatictext retrieval. Information Processing and Management: an International Journal, 24(5):513{523, 1988. [SBB + 05] MB Seol, JJ Bong, M. Baik, et al. Expression pro¯les of apoptosis genes in mammary epithelial cells. Molecules and Cells, 2005. [SC03] N.SevenetandO.Cussenot. DNAmicroarraysinclinicalpractice: past, present, and future. Clinical and experimental medicine, 3(1):1, 2003. [SFKR04] E.Segal, N.Friedman, D.Koller, andA.Regev. Amodulemapshowing conditional activity of expression modules in cancer. Nature genetics, 36:1090{1098, 2004. [SFR + 02] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D'Amico, J.P. Richie, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2):203{209, 2002. [SGC + 06] HC Silva, F. Garcao, EC Coutinho, CF De Oliveira, and FJ Regateiro. Soluble VCAM-1 and E-selectin in breast cancer: relationship with staging and with the detection of circulating cancer cells. Neoplasma, 53(6):538{543, 2006. [SHG04] AFA Smit, R. Hubley, and P. Green. RepeatMasker Open-3.0. Institute for Systems Biology. http://www. repeatmasker. org (January 10, 2007), 2004. [SKFW03] R. Steuer, J. Kurths, O. Fiehn, and W. Weckwerth. Observing and interpreting correlations in metabolomic networks, 2003. [SKW + 06] S. Seitz, E. Korsching, J. Weimer, A. Jacobsen, N. Arnold, A. Meindl, W. Arnold, D. Gustavus, C. Klebig, I. Petersen, et al. Genetic back- ground of di®erent cancer cell lines in°uences the gene set involved in chromosome8mediatedbreasttumorsuppression. Genes,Chromosomes and Cancer, 45(6), 2006. [SOG + 03] A.J.Syder,J.D.Oh,J.L.Guruge,D.O'Donnell,M.Karlsson,J.C.Mills, B.M. Bjorkholm, and J.I. Gordon. The impact of parietal cells on Heli- cobacter pylori tropism and host pathology: an analysis using gnotobi- otic normal and transgenic mice. Proceedings of the National Academy of Sciences, 100(6):3467{3472, 2003. [SR04] V. Sriraman and J.A.S. Richards. Cathepsin L gene expression and pro- moter activation in rodent granulosa cells. Endocrinology, 145(2):582{ 591, 2004. 90 [SRJ + 98] J.L. Schwachtgen, J.E. Remacle, N. Janel, R. Brys, D. Huylebroeck, D. Meyer, and D. Kerbiriou-Nabias. Oct-1 is involved in the transcrip- tional repression of the von Willebrand factor gene promoter. Blood, 92(4):1247, 1998. [SRT + 02] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar,M.Gaasenbeek,M.Angelo,M.Reich,G.S.Pinkus,etal. Di®use large B-cell lymphoma outcome prediction by gene-expression pro¯ling and supervised machine learning. Nature medicine, 8(1):68{74, 2002. [SWF + 00] S. Seitz, S. Werner, J. Fischer, A. Nothnagel, PM Schlag, and S. Scher- neck. Re¯ned deletion mapping in sporadic breast cancer at chromoso- malregion8p12-p21andassociationwithclinicopathologicalparameters. European Journal of Cancer, 36(12):1507{1513, 2000. [THC + 99] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church. Systematicdeterminationofgeneticnetworkarchitecture. Nature genet- ics, 22:281{285, 1999. [TMAH05] Y. Takayama, P. May, R.G.W. Anderson, and J. Herz. Low den- sity lipoprotein receptor-related protein 1 (LRP1) controls endocytosis and c-CBL-mediated ubiquitination of the platelet-derived growth fac- tor receptor fbetag(PDGFR fbetag). Journal of Biological Chemistry, 280(18):18504, 2005. [TSE + 07] P.Tamayo,D.Scanfeld,B.L.Ebert,M.A.Gillette,C.W.M.Roberts,and J.P.Mesirov. Metageneprojectionforcross-platform, cross-specieschar- acterization of global transcriptional states. Proceedings of the National Academy of Sciences, 104(14):5959, 2007. [UK05] C.V. Ustach and H.R.C. Kim. Platelet-derived growth factor D is ac- tivated by urokinase plasminogen activator in prostate carcinoma cells. Molecular and Cellular Biology, 25(14):6279{6288, 2005. [Vap00] V.N. Vapnik. The nature of statistical learning theory. springer, 2000. [WBD + 01] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J.A. Olson Jr, J.R. Marks, and J.R. Nevins. Predicting the clinical status of human breast cancer by using gene expression pro¯les. Proceedings of the National Academy of Sciences, 98(20):11462, 2001. [WDJB + 05] G. Watkins, A. Douglas-Jones, R. Bryce, R. E Mansel, and W.G. Jiang. Increased levels of SPARC (osteonectin) in human breast cancer tissues and its association with clinical outcomes. Prostaglandins, Leukotrienes & Essential Fatty Acids, 72(4):267{272, 2005. [WHD + 02] L.F. Wu, T.R. Hughes, A.P. Davierwala, M.D. Robinson, R. Stoughton, and S.J. Altschuler. Large-scale prediction of Saccharomyces cerevisiae genefunctionusingoverlappingtranscriptionalclusters. nature genetics, 31:255{265, 2002. 91 [Wil05] R.R. Wilcox. Introduction to robust estimation and hypothesis testing. Academic Press, 2005. [XFZ01] M. Xiong, X. Fang, and J. Zhao. Biomarker identi¯cation by feature wrappers, 2001. [XKNI + 08] M.Xu, M.C.Kao, J.Nunez-Iglesias, J.Nevins, M.West, andX.J.Zhou. An integrative approach to characterize disease-speci¯c pathways and their coordination: a case study in cancer. BMC genomics, 9(Suppl 1):S12, 2008. [XZZ08] M. Xu, M. Zhu, and L. Zhang. A stable iterative method for re¯ning discriminative gene clusters. BMC genomics, 9(Suppl 2):S18, 2008. [YKL + 96] M.L. Yaremko, C. Kutza, J. Lyzak, R. Mick, W.M. Recant, and C.A. Westbrook. Lossofheterozygosityfromtheshortarmofchromosome8is associatedwithinvasivebehaviorinbreastcancer. Genes, Chromosomes and Cancer, 16(3), 1996. [YMH + 07] X. Yan, M.R. Mehan, Y. Huang, M.S. Waterman, P.S. Yu, and X.J. Zhou. A graph-based approach to systematically reconstruct human transcriptional regulatory modules. Bioinformatics, 23(13):i577, 2007. [ZBC + 06] P. Zymek, M. Bujak, K. Chatila, A. Cieslak, G. Thakker, M.L. Entman, and N.G. Frangogiannis. The role of platelet-derived growth factor sig- naling in healing myocardial infarcts. Journal of the American College of Cardiology, 48(11):2315{2323, 2006. [ZKH + 05] X.J. Zhou, M.C.J. Kao, H. Huang, A. Wong, J. Nunez-Iglesias, M. Primig, O.M. Aparicio, C.E. Finch, T.E. Morgan, and W.H. Wong. Functional annotation and network reconstruction through cross-platform integration of microarray data. Nature Biotechnology, 23(2):238{243, 2005. [ZLS + 06] X. Zhang, X. Lu, Q. Shi, X. Xu, H.E. Leung, L.N. Harris, J.D. Iglehart, A. Miron, J.S. Liu, and W.H. Wong. Recursive SVM feature selection and sample classi¯cation for mass-spectrometry and microarray data. BMC bioinformatics, 7(1):197, 2006. [ZW08] M. Zhu and Q. Wu. A Parallel Computing Approach to Decipher Tran- scriptionNetworkforLarge-scaleMicroarrayDatasets. BMC Genomics, 9(Suppl 1):S5, 2008. 92
Abstract (if available)
Abstract
The linking genotype to phenotype is the fundamental aim of modern genetics. We focus on study of links between gene expression data and phenotype data through integrative analysis. In this work, we propose three approaches.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Between genes and phenotypes: an integrative network-based Monte Carlo method for the prediction of human-gene phenotype associations
PDF
Genome-wide association study of factors influencing gene expression variation and pleiotropy
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Integrative analysis of gene expression and phenotype data
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Prioritizing phenotype-associated functional modules and sub-networks from high throughout screening results
PDF
New analysis methods for microarray data
PDF
Integrative approach of biological systems analysis on regulatory pathways, modules, protein essentialities, and disease gene classification
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Integrating high-throughput sequencing data to study gene regulation
PDF
Genomic, regulatory and functional dynamics of the duplication process
PDF
Orthogonal shared basis factorization: cross-species gene expression analysis using a common expression subspace
PDF
Statistical analysis of microarray data and functional genomics of yeast ageing
PDF
Automatic tracking of protein vesicles
PDF
Exploring the genetic basis of complex traits
PDF
Control of endogenous gene expression in the mouse
PDF
Analysis of high-density oligonucleotide gene expression data for dissecting aging pathways
PDF
Computational and experimental approaches for the identification of genes and gene networks in the Drosophila sex-determination hierarchy
PDF
Detecting and understanding differentiation of microarray expression data
PDF
Real-time tracking and analysis of Drosophila behavior and gene expression
Asset Metadata
Creator
Xu, Min
(author)
Core Title
Integrative analysis of gene expression and phenotype data
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology
Publication Date
07/29/2009
Defense Date
04/24/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
association analysis,gene expression analysis,gene network analysis,machine learning,OAI-PMH Harvest,phenotype
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Zhou, Xianghong Jasmine (
committee chair
), James, Gareth (
committee member
), Sun, Fengzhu Z. (
committee member
)
Creator Email
mxu@usc.edu,xumin100@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2416
Unique identifier
UC1145639
Identifier
etd-Xu-2978 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-405031 (legacy record id),usctheses-m2416 (legacy record id)
Legacy Identifier
etd-Xu-2978.pdf
Dmrecord
405031
Document Type
Dissertation
Rights
Xu, Min
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
association analysis
gene expression analysis
gene network analysis
machine learning
phenotype