Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Genome-wide association study of factors influencing gene expression variation and pleiotropy
(USC Thesis Other)
Genome-wide association study of factors influencing gene expression variation and pleiotropy
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GENOME-WIDE ASSOCIATION STUDY OF FACTORS INFLUENCING GENE EXPRESSION VARIATION AND PLEIOTROPY by Linqi Zhou A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) December 2009 Copyright 2009 Linqi Zhou ii Dedication This dissertation is lovingly dedicated to my parents, whose inspiration and motivation have guided me through the many paths of life, and my husband, whose love and support are always with me. iii Acknowledgements I would like to extend my heartful gratitude to the following people: Dr. Fengzhu Sun, my advisor, for his guidance, inspiration, understanding, motivation during the years I am in his lab. Without his expert advice and guidance this work would not be possible. My committee, Dr. Michael S Waterman, Dr. Liang Chen, and Dr. Larry Goldstein, for their good advice and teaching me useful knowledge that benefit my research and life. My lab mates, fellow graduate students and friends, for all of their support, accompany and encouragement, especially Dr. Xiaotu Ma, for his kind help in my research. iv Table of Contents Dedication ii Acknowledgements iii List of Tables vi List of Figures ix Abstract xiii Chapter 1: Introduction of Saccharomyces cerevisiae 1 1.1 Introduction 1 1.2 Overview of dissertation work 14 Chapter 2: The effects of protein interactions, gene essentiality and regulatory regions on expression variation 18 2.1 Introduction 18 2.2 Results and Discussion 25 2.3 Materials and Methods 43 Chapter 3: Chromatin regulation and gene centrality are essential for controlling fitness pleiotropy in yeast 49 3.1 Introduction 49 3.2 Results and Discussion 53 3.3 Materials and Methods 72 Chapter 4: Conclusions 78 Chapter 5: Future work 80 References 84 Appendix A: Supplemental materials for Chapter 2 90 A1. Analysis using MIPS protein interaction data 90 A2. Analysis using DIP protein interaction data 96 A3. Analysis using BioGrid protein interaction data 101 A4. Analysis using the average of expression variation across four expression data sets including Ca_Na_exposure, Chemostat, Environmental Stress and Oxidative Stress 106 A5. Analysis using variability of gene expression across more than 1,500 conditions from a combined data set 108 A6. Correlation between independent variables in each model with different protein interaction data 110 v Appendix B: Supplemental materials for Chapter 3 111 B1. Analysis using Brown et al [2006] data and DIP protein interaction data 111 B2. Analysis results using Brown et al [2006] data and BioGrid protein interaction data 114 B3. Analysis using Parson et al. [2006] data and MIPS protein interaction data 116 B4. Analysis using Parson et al [2006] data and DIP protein interaction Data 120 B5. Analysis results using Parson et al [2006] data and BioGrid protein interaction data 122 B6. Analysis results using Hillenmeyer et al. [2008] data and MIPS protein interaction data 124 B7. Analysis results using Hillenmeyer et al. [2008] data and DIP protein interaction data 128 B8. Analysis results using Hillenmeyer et al. [2008] data and BioGrid protein interaction data 131 B9. The correlation among three phenotypic profiles 133 vi List of Tables Table 2.1 The relationship between gene expression variation and the number of cis-elements 34 Table 2.2 Analysis of four factors and their interactions affecting expression variation using stepwise selection with AIC 37 Table 2.3 The effect of two factors on expression variation stratified by the presence/absence of TATA box 40 Table 2.4 The effect of toxicity degree on expression variation stratified by the set of environmental stress response (ESR) 41 Table 3.1 Correlation between fitness pleiotropy and each measurement when expression variation is either controlled or not 68 Table 3.2 Partial Spearman’s correlation between fitness pleiotropy and expression variation when each measurement is controlled 69 Table 3.3 Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE 70 Table A2.1 Analysis of four factors and their interactions affecting expression variation using stepwise selection with AIC 100 Table A3.1 Analysis of four factors and their interactions affecting expression variation using stepwise selection with AIC 105 Table A4.1 Analysis of four factors and their interactions affecting expression variation using stepwise selection with AIC 106 Table A4.2 The effects of two factors on expression variation stratified by the presence/absence of TATA box 107 Table A4.3 The effects of toxicity degree on expression variation stratified by the set of environmental stress response (ESR) 107 Table A5.1 Analysis of four factors and their interactions affecting expression variation using stepwise selection with AIC 108 Table A5.2 The effects of two factors on expression variation stratified by the presence/absence of TATA box 109 Table A5.3 The effects of toxicity degree on expression variation stratified by the set of environmental stress response (ESR) 109 vii Table A6.1 The pairwise Spearman´s correlation between independent variables in each model with different protein interaction data 110 Table B1.1 Spearman’s correlation between fitness pleiotropy and PPI, CC when gene expression variation is either controlled or not 112 Table B1.2 Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when protein interaction degree, CC or CRE is controlled 113 Table B1.3 Partial Spearman’s correlation between fitness pleiotropy and protein interaction degree, CC or CRE 113 Table B2.1 Spearman’s correlation between fitness pleiotropy and PPI degree, CC when gene expression variation is either controlled or not 115 Table B2.2 Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when PPI degree, CC or CRE is controlled 115 Table B2.3 Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE 115 Table B3.1 Correlation between fitness pleiotropy and each measurement when expression variation is either controlled or not 118 Table B3.2 Partial Spearman’s correlation between fitness pleiotropy and expression variation when each measurement is controlled 119 Table B3.3 Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE 119 Table B4.1 Spearman’s correlation between fitness pleiotropy and PPI, CC when gene expression variation is either controlled or not 121 Table B4.2 Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when protein interaction degree, CC or CRE is controlled 121 Table B4.3 Partial Spearman’s correlation between fitness pleiotropy and protein interaction degree, CC or CRE 121 Table B5.1 Spearman’s correlation between fitness pleiotropy and PPI degree, CC when gene expression variation is either controlled or not 123 viii Table B5.2 Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when PPI degree, CC or CRE is controlled 123 Table B5.3 Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE 123 Table B6.1 Correlation between fitness pleiotropy and each measurement when expression variation is either controlled or not 126 Table B6.2 Partial Spearman’s correlation between fitness pleiotropy and expression variation when each measurement is controlled 127 Table B6.3 Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE 127 Table B7.1 Spearman’s correlation between fitness pleiotropy and PPI, CC when gene expression variation is either controlled or not 129 Table B7.2 Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when protein interaction degree, CC or CRE is controlled 130 Table B7.3 Partial Spearman’s correlation between fitness pleiotropy and protein interaction degree, CC or CRE 130 Table B8.1 Spearman’s correlation between fitness pleiotropy and PPI degree, CC when gene expression variation is either controlled or not 132 Table B8.2 Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when PPI degree, CC or CRE is controlled 133 Table B8.3 Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE 133 Table B9.1 The Spearman’s rank correlation among three phenotypic profiles used for our analysis 133 ix List of Figures Figure 1.1 The structure of yeast cell 1 Figure 1.2 The reproduction modes of S. cerevisiae 3 Figure 1.3 The deletion project in yeast 5 Figure 1.4 Y2H to detect pair-wise protein interaction 7 Figure 1.5 Genome-wide location analysis of binding sites for TFs 10 Figure 1.6 The composition of nucleosome 12 Figure 2.1 Gene expression variation is negatively correlated with protein interaction degree 27 Figure 2.2 The effect of essentiality, toxicity degree, and protein interaction degree on gene expression variation 29 Figure 2.3 The effect of TATA box, number of TFs, and toxicity degree on gene expression variation 32 Figure 3.1 The relationship between fitness pleiotropy and PPI degree (A) and between CC and PPI degree (B) 55 Figure 3.2 The relationship between fitness pleiotropy and measurements 58 Figure 3.3 The relationship between fitness pleiotropy and measurements 61 Figure 3.4 Fitness pleiotropy is negatively associated with chromatin regulatory effect (CRE) (ρ=-0.172, p<2.2e-16) 63 Figure A1.1 Gene expression variation is negatively correlated with protein interaction degree 90 Figure A1.2 The effect of essentiality, toxicity degree, and protein interaction degree on gene expression variation 91 Figure A1.3 The effect of TATA box, number of TFs, and toxicity degree on gene expression variation 91 Figure A1.4 Gene expression variation is negatively correlated with protein interaction degree 92 Figure A1.5 The effect of essentiality, toxicity degree, and protein interaction degree on gene expression variation 93 x Figure A1.6 The effect of TATA box, number of TFs, and toxicity degree on gene expression variation 93 Figure A1.7 Gene expression variation is negatively correlated with protein interaction degree 94 Figure A1.8 The effect of essentiality, toxicity degree, and protein interaction degree on gene expression variation 95 Figure A1.9 The effect of TATA box, number of TFs, and toxicity degree on gene expression variation 95 Figure A2.1 Gene expression variation is negatively correlated with protein interaction degree 96 Figure A2.2 The effect of toxicity degree, and protein interaction degree on gene expression variation 96 Figure A2.3 Gene expression variation is negatively correlated with protein interaction degree 97 Figure A2.4 The effect of toxicity degree, and protein interaction degree on gene expression variation 97 Figure A2.5 Gene expression variation is negatively correlated with protein interaction degree 98 Figure A2.6 The effect of toxicity degree, and protein interaction degree on gene expression variation 98 Figure A2.7 Gene expression variation is negatively correlated with protein interaction degree 99 Figure A2.8 The effect of toxicity degree, and protein interaction degree on gene expression variation 99 Figure A3.1 Gene expression variation is negatively correlated with protein interaction degree 101 Figure A3.2 The effect of toxicity degree, and protein interaction degree on gene expression variation 101 Figure A3.3 Gene expression variation is negatively correlated with protein interaction degree 102 Figure A3.4 The effect of toxicity degree, and protein interaction degree on gene expression variation 102 xi Figure A3.5 Gene expression variation is negatively correlated with protein interaction degree 103 Figure A3.6 The effect of toxicity degree, and protein interaction degree on gene expression variation 103 Figure A3.7 Gene expression variation is negatively correlated with protein interaction degree 104 Figure A3.8 The effect of toxicity degree, and protein interaction degree on gene expression variation 104 Figure B1.1 The relationship between fitness pleiotropy and protein interaction degree (A) and between CC and protein interaction degree (B) 111 Figure B1.2 Boxplot of fitness pleiotropy for different groups of proteins classified according to protein interaction degree and CC 112 Figure B2.1 The relationship between fitness pleiotropy and PPI degree (A) and between CC and PPI degree (B) 114 Figure B2.2 Boxplot of fitness pleiotropy for different groups of proteins classified according to PPI degree and CC 114 Figure B3.1 The fitness pleiotropy is positively correlated with protein physical interaction (PPI) degree 116 Figure B3.2 The relationship between fitness pleiotropy and measurements 117 Figure B3.3 The relationship between fitness pleiotropy and measurements 117 Figure B3.4 Fitness pleiotropy is negatively associated with chromatin regulatory effect (CRE) 118 Figure B4.1 The relationship between fitness pleiotropy and PPI 120 Figure B4.2 Boxplot of fitness pleiotropy for different groups of proteins classified according to protein interaction degree and CC 120 Figure B5.1 The relationship between fitness pleiotropy and PPI degree 122 Figure B5.2 Boxplot of fitness pleiotropy for different groups of proteins classified according to PPI degree and CC 122 Figure B6.1 The fitness pleiotropy is positively correlated with protein physical interaction (PPI) degree 124 xii Figure B6.2 The relationship between fitness pleiotropy and measurements 125 Figure B6.3 The relationship between fitness pleiotropy and measurements 125 Figure B6.4. Fitness pleiotropy is negatively associated with chromatin regulatory effect (CRE) 126 Figure B7.1 The relationship between fitness pleiotropy and PPI 128 Figure B7.2 Boxplot of fitness pleiotropy for different groups of proteins classified according to protein interaction degree and CC 129 Figure B8.1 The relationship between fitness pleiotropy and PPI 131 Figure B8.2 Boxplot of fitness pleiotropy for different groups of proteins classified according to PPI degree and CC 132 xiii Abstract Genes show different expression variation and growth fitness when responding to various environment conditions. Many studies have focused on identifying factors influencing such differences, especially in the studies of essential genes versus viable genes. Nonetheless, the need to obtain a more complete understanding of factors and their interactions influencing expression variation and fitness pleiotropy (growth defect upon gene deletion) remains a challenge. In this dissertation, a systematic analysis on factors that affect expression variation and pleiotropy as well as their inter-relationship have been conducted using S. cerevisiae as a model. For S. cerevisiae, high-throughput technologies produce many genome-wide data, such as protein interaction, regulatory network, gene deletion, etc. With availability of such data, it was found that TATA-box and the number of TFs (transcription factors) are the most important factors influencing expression variation among four different factors: protein interaction degree, toxicity degree (the number of DNA-damaging conditions that the growth rate of the yeast deletion strain is significantly affected.), TATA box, the number of TFs. In addition, the number of TFs regulating a gene was found to be an important factor influencing expression variation for both TATA-containing and non- TATA-containing genes, but with different association strength. Moreover, expression variation was significantly negatively correlated with toxicity degree only for TATA- containing genes. Fitness pleiotropy is a measurement of the gene’s importance to fitness. Two important measurements: 1) if the gene product is central in a protein interaction network, and 2) if xiv the gene is more likely to be chromatin regulated (defined as CRE) were found to significantly affect a gene’s fitness pleiotropy. Although a significant negative association between fitness pleiotropy and gene expression variation was identified, study showed that CRE could be considered as the key underlying latent variable that controls both fitness pleiotropy and expression variation resulting in their correlation. These findings highlight the significance of both gene regulation and protein interaction networks in influencing the gene expression variation and fitness pleiotropy. Moreover, distinct mechanisms may influence gene expression variation in TATA-containing and non-TATA-containing genes, provides new insights into the mechanisms that underlie the evolution of gene expression. 1 Chapter 1: Introduction of Saccharomyces cerevisiae 1.1 Introduction Yeast is unicellular fungi that widely exist in the natural environment. Among several hundred of yeast species, S. cerevisiae has been extensively studied in biological field. It serves as a good model for Eukaryote as its structural and functional features share with higher Eukaryote. The basic structure of S. cerevisiae includes cell envelope, cytoplasm, nucleus, endoplasmic reticulum, golgi apparatus, mitochondrion, etc. (Figure 1.1). The studies of S. cerevisia provide a large amount of information for higher Eukaryote, for example, previous study on telomeric DNA in S. cerevisiae suggested that generation of tumor in mammalian is associated with telomere length [Walker 1998]. Many biological processes are common to yeast and higher Eukaryote oragnisms, such as the replication, cell division, the transportation of protein, etc. Figure 1.1. The structure of yeast cell (http://home.earthlink.net/%7Eggda/yeast_cell_final_resample.jpg). 2 The haploid genome size for S. cerevisiae is about 12Mb with 16 chromosomes. The genome is completely sequenced and encodes more than 6000 ORFs. Moreover, many other genome-wide data including protein interaction, transcription regulation, gene deletion, etc, are available for S. cerevisiae. 1.1.1 The growth of S. cerevisiae Yeast cells could survive and grow as two forms: haploid and diploid. The rapid growth and the existence of both haploid and diploid life cycles enable S. cerevisiae to be an ideal organism for genetic study. The reproduction mode in S. cerevisiae includes asexual and sexual mode. Both haploid and diploid could use budding, an asexual reproduction process, to reproduce (Figure 1.2A). Budding is the most common mode in S. cerevisiae. When mother cell size reaches a certain size, budding process begins. Such a process includes the G1 phase (pre-synthesis), the S phase (DNA synthesis), the M phase (mitosis). Once mitosis is finished, the daughter bud cells detaches from mother cells. A daughter bud may form at any position on the surface of mother cell. 3 Figure 1.2. The reproduction modes of S. cerevisiae. A) Asexual reproduction: the budding of yeast cells (https://eee.uci.edu/clients/bjbecker/NatureandArtifice/yeastbudb.jpg). B) Sexual reproduction: mating of yeast cells (http://upload.wikimedia.org/wikipedia/commons/archive/c/c3/20070124104751!Budding_yeast_Lifecycle.png) Also, S. cerevisia has the ability to reproduce sexually (Figure 1.2B). There are two mating types a and α in haploid. Haploid with opposite mating type in S. cerevisiae could mate to form diploid cells. Cell with mating type a secrets a-factor pheromone that binds to α cell receptor and cell with mating type α secrets α-factor pheromone that binds to a cell receptor. This signaling transduction initiates mating process that forms a diploid from two haploids with opposite mating types. Only under condition of stress, such as the nitrogen starvation, diploid cell would be induced to undergo sporulation, i.e., meiosis, to produce four haploid spores (2a and 2α). 1.1.2 The deletion project in S. cerevisiae Almost whole genome gene deletions have been generated for phenotypic analysis [http://www-sequence.stanford.edu/group/yeast_deletion_project/project_desc.html]. Phenotypic analysis of such deletion strains provides a useful tool to analyze all genes in parallel and hence save time and labor compared to traditional methods. 4 A PCR-based method is used to generate deletions (Figure 1.3A). There are two rounds of PCR in this gene deletion process. First, upstream (UPTAG) and downstream (DOWNTAG) primers which contain unique tags (TAG1 and TAG2) for each ORF are integrated into a gene (KanMX4) by PCR. Such PCR products have a selectable marker in KanMX4 and would be used for homologous recombination. TAG1 and TAG2 (in some papers they are called barcodes) are unique 20bp sequences for each ORF. Second PCR is performed to elongate the homology to the region upstream and downstream of the target ORF for homologous recombination. The resulted PCR product containing KanMX4 gene replaces ORF through homologous recombination so that each ORF is associated with unique tags (TAG1 and TAG2). Then all deletions are pooled together to grow and hybridized to oligonucleotide array containing tag complements for future analysis. To confirm whether such gene deletion is generated successfully or not, PCR with primers A-KanB and D-KanC is conducted. If the correct size PCR products are seen by running gel electrophoresis (Figure 1.3B), the deletion is successful. Especially for haploid deletion, there is an additional check for the missing PCR products of the wild-type with primers A-B and C-D. 5 Figure 1.3. The deletion project in yeast (http://www- sequence.stanford.edu/group/yeast_deletion_project/project_desc.html). A) A PCR-based deletion method is used to generate deletions. B) Confirmation of success of deletion. For the deletion project in S. cerevisiae, four different types of deletion strains were generated for each ORF: two haploid strains with mating type a and α, two diploid strains with one homozygous and one heterozygous for deletion loci. Homozygous diploid is obtained by mating two haploid strains with opposite mating types while heterozygous diploid is constructed by the above transformation. One haploid is obtained by sporulation while another haploid is also constructed by the above transformation. More than 6,000 genes in S. cerevisia are classified into essential genes and non-essential genes [Giaever et al, 2002]. Essential genes are those that could only survive as the heterozygous diploid when deleted. They are identified by 2 viable: 2 dead of tetrads producing from heterozygous diploid. Therefore, deletions of essential genes are only available as heterozygous diploid. On the contrast, deletions of non-essential genes could be available as heterozygous diploid, homozygous diploid and two haploids with different mating types. 6 1.1.3 Protein interaction Protein interactions are classified into physical and genetic interactions. Genetic interaction exists between two proteins if mutations in them cause synthetic sick or lethal (SSL) interactions. Protein-protein physical interactions are key biological events in a living cell, and proteins in a cell interact with each other transiently or permanently to perform certain functions. For example, signaling events often require protein–protein interactions to function, and several proteins might form a protein complex, such as transcription regulatory complex to perform certain function. We focus on the study of physical interactions in the dissertation work. Currently high-throughput techniques generated large amounts of physical interaction data, including yeast two-hybrid systems (Y2H) and mass spectrometry (MS). Y2H is a method to detect pair-wise protein physical interaction. As seen in Figure 1.4 (www.clontech.com), the tested two proteins are bound to the GAL4 DNA-binding domain and activation domain separately. If these two proteins interact with each other, the reporter gene such as product of lacZ, produces a blue colony through the metabolism of X-gal. MS is used to determine the component of a sample. A biological sample is purified by some techniques, such as affinity chromatography, then is proteolysed and analyzed by MS. It is thought that high-throughput protein interaction data contains many false positive and negative interactions and is not as reliable as the literature data and data collected from carefully performed small-scale experiments. Therefore, the assessment of reliability of different interaction data sets, i.e., the fraction of real interactions over the observed protein interactions, is very important. Deane et al [2002] presented method of 7 the expression profile reliability (EPR) index estimate to assess the reliability of protein interaction data sets. For any given protein interaction data set, the distribution of distances of expression levels for interacting pairs was created. Based on the model that such observed distribution is the mixture of the distribution for true interaction and the distribution for true non-interaction pairs, they found that DIP core data set has high reliability. In addition, Deng et al [2003] showed that MIPS physical data has largest mean gene expression correlation coefficients according to distribution of gene expression correlation coefficients between any pair of interacting protein. They also developed a maximum likelihood method based on the distributions of the correlation coefficients of gene expression profiles for interacting protein pair and predicted that DIP data is more reliable. Figure 1.4 Y2H to detect pair-wise protein interaction (www.clontech.com). In this dissertation work, three protein interaction data sets with different reliability were used to show the consistency of findings, including the Munich Information Center for 8 Protein Sequence (MIPS) [Mewes et al, 2004], the database of interacting proteins (DIP) [Salwinski et al, 2004] and General Repository for Interaction Datasets (BioGrid) [Stark et al, 2006]. MIPS [Mewes et al, 2004] is a database of protein sequence-related information. In yeast, more than 10,000 protein-protein interaction records including physical and genetic interactions, are collected from large-scale experiments and literature. DIP core [Salwinski et al, 2004] has binary protein-protein interactions by merging the PVM (paralogous verification method), INT (interactions determined by one or more small-scale experiment) and EC2 (interactions determined by at least two independent experiments). Several tests has been used to assess the reliability of DIP core data set, i.e., evaluating the experimental methods, analysis of reliability of individual interaction using PVM method (i.e., an interaction is likely to be true if interacting pair has paralogs to interact). BioGrid [Stark et al, 2006] contains high-throughput and literature curate interactions in yeast. Using pair-wise protein interactions, a protein interaction network could be built. Each node in the network represents a gene or its protein product and an edge represent binary interactions between two proteins. There are several measurements in this network: protein physical interaction degree (PPI), betweenness (BW), and clustering coefficient (CC). Protein physical interaction degree refers to the number of interacting partners of a given protein. The betweenness (BW) represents the frequency with which a node is on the shortest path of any other two nodes [Freeman, 1978]. Intuitively, in a communication network, a node with high betweenness has high potential in controlling communication of other nodes. The quantitative measurement is as follows [Freeman, 1978]: 9 For any given node (n k ) in the network, BW(n k ) = Σ i Σ j p ij (n k ) where i≠j≠k and p ij is the probability that node n k falls on the shortest path of any two other nodes n i and n j . Clustering coefficient is to measure the connectivity among a node and its first order neighbors. Suppose that a node n v has Kv neighbours, CC (n v ) = 2 / ) 1 ( − Kv Kv e v where e v is the actual numbers of edges that exist while Kv(Kv-1)/2 is the maximum possible number of edges that could exist among its first order neighbors [Watts & Strogatz, 1998]. 1.1.4 Regulatory pathway The regulation of transcription plays an important role in correct temporal and spatial expression of genes in Eukaryote. It mainly includes regulation by chromatin modifiers and transcription factors (TFs). Chromatin modifiers (CRs) work upstream or interact with transcription factors to regulate the gene transcription. The cis-elements are bound by transcription factors, i.e., sequence-specific DNA-binding proteins to activate or repress the gene expression. The transcription factors usually have DNA-binding domain and activation/repression domain. Ren et al [2000] developed a genome-wide location analysis technique to find binding sites (cis-elements) in the genome for transcription factors (Figure 1.5). First, a Myc-epitope-coding sequence is integrated into the gene coding the transcription factor and hence transcription factor is labeled with epitope. A genome-wide location analysis is preformed to identify the binding site for this transcription factor: If a cis-element is bound by a transcription factor in vivo, such bound protein will be formaldehyde-crosslinked to DNA. Then the cells are 10 harvested and lyses. Due to epitope-tagged TF, the crosslinked transcription factor and DNA fragment will be immunoprecipitated by an anti-myc antibody. The precipitated material is separated into transcription factor protein and DNA fragment by reversal crosslinks. Such DNA sample is enriched for transcription factor-bound DNA fragment. This enriched DNA sample and DNA sample from an unenriched sample are labeled with different fluorescence and hybridized to a microarray representing intergenic region of the yeast genome. Using this genome-wide chip-chip analysis, Harbison et al [2004] identified the occupancy of 203 transcription factors in yeast. MacIsaac et al [2006] reanalyzed the chip-chip data from Harbison et al [2004] and developed two motif finding algorithms based on conservation information to find motifs for transcription factors in yeast. They used these motifs to identify the different sets of binding sites (cis- elements) for transcription factors with different binding p values in location analysis and conservation criterion in yeast species. Figure 1.5 Genome-wide location analysis of binding sites for TFs [Ren et al, 2000]. Hu et al [2007] generated a transcriptional response profile by the deletion of 263 transcription factors. From differential expression data of genes after the deletion of TFs, 11 they construct an unrefined regulatory network. Using a regulatory epistasis approach, they remove indirect regulation and reconstruct a refined regulatory network. The main differences between this functional regulatory network and binding regulatory network are: 1) A transcription factor may regulate a target indirectly through regulatory cascades. In such situation, location analysis might not detect it; 2) Transcription factor binding may be a separate step from regulation control. Therefore, location analysis might detect some false positive interactions between transcription factors and target genes. In the dissertation work, functional regulatory network constructed by Hu et al [2007] is used. In the regulatory network, there are two measures: in degree (number of transcription factor that target a gene) and out degree (number of target genes a transcription factor regulates). In Eukaryote, chromosome is packaged into chromatin, which is composedly of DNA, RNA and protein. The primary protein involved is histone. There are usually four core histones, i.e., H3, H4, H2A and H2B. The core histones contain highly conserved N- terminal tails and a globular domain. The N-terminal tails are the important modification sites for gene regulation. Such histones are wrapped by DNA fragments to form basic chromatin unit-nucleosome (Figure 1.6). Such structure of chromatin could make some DNA sequence such as cis-element inaccessible and hence repress the gene expression. In this case, some chromatin modifiers could function in remodeling the nucleosome structure to make cis-element to be accessible. On the other hand, the folding of DNA by nucleosome structure may benefit the gene expression if distanced regulatory elements are brought near to each other. 12 Figure 1.6 The composition of nuleosome (http://www.abcam.com/ps/CMS/Images/Phil-Carp-Fig-1-Nucleosoma- .jpg). Chromatin modifiers influence chromatin structure by forming a chromatin structure that is needed for transcription factor activity. Chromatin modifiers may depend on ATP or act independently of ATP [Steinfeld et al, 2007]. The ATP-dependent chromatin modifiers include the histone acetyltransferases (HATs) and the histone deacetylases (HDACs). Such chromatin modifiers usually work as a complex containing an ATPase subunit. With ATPase, chromatin modifiers could use energy from ATP hydrolysis to change histone-DNA interaction or the position of nucleosome. For example, the negative charge produced by acetylation of lysines in N-terminal tail of histone may neutralize the positive charge so that it may weaken electrostatic interaction between histone and DNAs and interaction between neighbor nuclosomes. Studies have shown that the pattern of acetylation on promoters of genes is affected by individual activators or repressors and transcriptional activation is not necessarily associated with increased acetylation [Deckert and Struhl, 2001]. Other chromatin modifiers that are independent of 13 ATP modify the histone by adding methyl group, phosphate, etc. Many chromatin modifiers are found to be evolutionarily conserved. 1.1.5 TATA box TATA box is first identified regulatory element in the Eukaryote. About 20% of the yeast genome was found to be TATA-containing genes [Basehoar et al, 2004] according to the following criteria: 1) Location of a TATA box is upstream -200 to -50 of a gene as previously identified TATA boxes tend to be within such region; 2) Conservation of a TATA box in four yeast species as important motif tend to be fixed; 3) The length of a TATA box is 8 as TATA binding protein (TBP) binds to 8bp of DNA and previous reported consensus sequences for a TATA box are about 5-6 bp in length. By testing all combinations of 8 bp sequence, consensus sequence for a TATA box is TA(A/T)(A/T)NNNN; 4) The gene expression of TATA-containing genes tend to be sensitive to the TBP DNA binding mutants. By combining first three criteria, a set of TATA containing genes is classified and solely from the fourth, another set of TATA containing genes are classified too. Therefore, genes are grouped into TATA-containing genes and non-TATA-containing genes. TATA-containing genes tend to be found in subtelomeric region and enriched in stress- related genes while non-TATA-containing genes tend to be house-keeping genes such as genes involved in protein synthesis and cell growth [Basehoar et al, 2004]. It was found that TATA-containing genes were repressed by nucleosome [Basehoar et al, 2004]. 14 Chromatin keep TATA boxes inaccessible till regulators modify chromatin structure. Hence TATA-containing genes are highly regulated by chromatin regulators. Two distinct regulatory mechanism presented in TATA-containing genes and non-TATA- containing genes [Basehoar et al, 2004]. TBP binds to the promoters of TATA- containing/ non-TATA-containing genes through interactions with different coactivator complexes (SAGA for TATA-containing genes while TFIID for non-TATA-containing genes). Then RNA polymerase and other factors are recruited for the gene expression. TATA-containing genes are positively regulated by TBP regulator-SAGA and negatively regulated by TBP regulator-Mot1 and Bur6. With aid of SAGA and a TATA box, TATA- containing genes tend to be highly regulated by chromatin regulators, TBP regulators and be stress-inducible. In contrast, non-TATA-containing genes are more likely to recruit TFIID coactivator complexes and less regulated. 1.2 Overview of dissertation work 1.2.1 The effects of protein interactions, gene essentiality and regulatory regions on expression variation Gene expression variation is a hot research topic that has been studied in different organisms at different levels. With the development of current technique such as DNA microarray, thousands of individual gene expression level in an organism could be measured at the same time and a faster speed than ever. A DNA microarray is composed of probes that are DNA fragments and complementary to the sample sequences. Using 15 DNA microarray, lots of profiles that measured gene expression level or gene expression change have been generated. Identifying factors affecting gene expression variation is a challenging problem in genetics. Previous studies have shown that the presence of TATA box [Landry et al, 2007; Tirosh et al, 2006], the number of cis-regulatory elements [Landry et al, 2007], gene essentiality [Tirosh and Barkai, 2008; Choi et al., 2007], and protein interactions [Lemos et al, 2004; 2005] significantly affect gene expression variation. Nonetheless, the need to obtain a more complete understanding of such factors and how their interactions influence gene expression variation remains a challenge. In terms of gene essentiality, we found that essential genes tend to have lower gene expression variation than non-essential genes, which is consistent with [Tirosh and Barkai, 2008; Choi et al., 2007]. Furthermore, we studied the trend of gene expression variation detailed in non-essential genes. The growth rates of yeast cells under several DNA-damaging conditions have been studied and a gene’s toxicity degree is defined as the number of such conditions that the growth rate of the yeast deletion strain is significantly affected. Since toxicity degree reflects a gene’s importance to cell survival under DNA-damaging conditions, we expect that it is negatively associated with gene expression variation. Mutations in transcription factors (TF) regulating a gene affect the gene’s expression and thus we study the relationship between gene expression variation and the number of TFs regulating a gene. The number of TFs regulating a gene is collected from TFs knock-out experiments and is different from the number of cis-elements for a given gene. Both the 16 number of TFs and the number of cis-elements have been investigated to compare their influence on gene expression variation. Most importantly we study how these factors interact with each other influencing gene expression variation. Using yeast as a model system, we evaluated the effects of four separate factors and their interactions on gene expression variation: protein interaction degree, toxicity degree, number of TFs, and the presence of TATA box. 1.2.2 Chromatin regulation and gene centrality are essential for controlling fitness pleiotropy in yeast There are a wide range of phenotypes that are due to loss-of-function or null mutations. Two remaining questions include 1) what are the functions of a gene’s product that underlie the importance of a gene to fitness? and 2) how does the interaction of protein functions contribute to fitness? Previously, the functions of gene products that distinguish essential from nonessential genes were characterized. However, the functions of products of non-essential genes that contribute to fitness remain minimally understood. Currently the fitness (growth rate) profiles of the S. cerevisiae deletion strains under various culture conditions were useful for studying function of non-essential genes. Using such data, we investigated which factors are associated with a gene’s fitness pleiotropy. Fitness pleiotropy is a measurement of the gene’s importance to fitness, defined as the number of growth conditions that significantly affect fitness when the gene is deleted. 17 Jeong et al. [2001] found that the essential genes tend to encode products that have a large number of physical interaction partners. So we ask whether centrality in a protein interaction network might influence fitness pleiotropy. We measure centrality by interaction degree, betweenness and clustering coefficient. The effect of different centrality measurements on fitness pleiotropy and their interaction has been detailed studied in the dissertation work. Since chromatin regulation is a way that organism responds to internal and external stimuli and might affect gene expression, we ask if 1) if a gene product functions in chromatin regulation, and 2) if the gene is likely chromatin regulated, influence a gene’s fitness pleiotropy. 18 Chapter 2: The effects of protein interactions, gene essentiality and regulatory regions on expression variation 2.1 Introduction 2.1.1 Previous work Gene expression variation has been studied on three different levels: single cells across a common environment [Newman et al, 2006], within one species across a variety of different environments [Nelson et al, 2004; Walther et al, 2007], and across different species/strains, which is often referred to as evolutionary variation [Lemos et al, 2004, 2005; Landry et al, 2007; Tirosh et al, 2006; Tirosh and Barkai, 2008]. Those studies have focused on individual factors affecting gene expression variation. Newman et al. [2006] developed an experimental technique to quantify protein concentrations at cell level in response to differing environmental conditions. Their approach uses high-throughput flow cytometry and a library of green fluorescent protein (GFP)-tagged yeast strains, in which each protein is expressed as a carboxy-terminal GFP fusion and measured by flow cytometry. They found that chromosomal distance to other genes and mRNA-half life is associated with expression noise. They also showed that proteins that respond to environmental changes tend to have high noise while those involved in protein synthesis tend to have low noise. However, they did not find a relationship between protein expression noise and protein-protein interactions. Recently, 19 using a more complete interaction dataset, Batada et al. [2006] found that protein expression variation is negatively correlated with interaction degree when protein abundance was controlled using the data in Newman et al. [2006]. This relationship continues to hold within the viable genes. Nelson et al. [2004] and Walther et al. [2007] focus on the expression variation within one species. Nelson et al. [2004] studied the relationship between the number of tissues or body parts (expression variation), where the gene is expressed, and intergenic distance in C. elegans and D. melanogaster. The hypothesis is that the genes expressed in a greater number of tissues or body parts tend to require a greater number of regulatory elements to regulate its expression and more regulatory elements will also occupy more physical spacing in the genome. They defined intergenic distance in two ways. One is the sum of upstream and downstream distance to the nearest neighboring genes. Another is calculated based on 11-gene window as regulatory information may not only be contained in the boundaries of a gene itself. They found that gene expression variation increases in relation to the intergenic distances in both definitions. In addition, they showed that genes with complex functions tend to have larger intergenic distance than house keeping genes. Walther et al. [2007] hypothesized that genes that are differentially expressed under more environmental stimuli are expected to have more distinct cis- regulatory elements in upstream regions than are genes that are differentially expressed under few environmental stimuli. They count number of 93 cis-regulatory elements in upstream region up to a length of 3,000 nucleotides and found a positive correlation between the frequency of a gene’s differential expression and the number of cis- regulatory elements of that gene in A. thaliana. 20 There are a few studies on factors influencing gene expression across different species/strains (evolutionary variation) [Lemos et al, 2004,2005; Landry et al, 2007; Tirosh et al, 2006; Tirosh and Barkai, 2008]. Lemos et al. [2004, 2005] studied the effect of protein-protein interactions and protein length on evolutionary variation (variation among strains in a species). They found that evolutionary variation is negatively correlated with protein-protein interactions in S. cerevisiae or Drosophila melanogaster [Lemos et al., 2004] and negatively correlated with protein length in Drosophila melanogaster [Lemos et al., 2005]. These studies highlighted the importance of protein interactions and gene regulatory regions on gene expression variation. Landry et al. [2007] performed a mutation-accumulation experiment and studied factors influencing evolutionary variation, i.e., four strains in S. cerevisiae. They found that genes with high evolutionary variation have larger number of cis-element binding sites and larger number of others genes that affect the expression of the given gene than those has low evolutionary variation. However, the positive correlation between evolutionary variation and size of cis-element binding sites could be fully explained by the presence/absence of the TATA box in the promoter region. Using genome-wide interspecies/interstrain expression data, Tirosh et al. [2006, 2008] also found, that the interspecies variation of gene expression is significantly correlated with the presence/absence of the TATA box in the promoter region. 21 2.1.2 Our goal In order to integrate such different data sources in a way that collectively identifies and interprets the key factors affecting gene expression variation. Therefore, we conducted studies of proteomic and genomic factors marginally and collectively influencing gene expression variation across different perturbation conditions within one species: yeast. We study genetic factors affecting gene expression variation within one species across many different environmental conditions. Broadly, the genetic factors affecting gene expression primarily include the binding of regulatory proteins to cis-elements in the upstream of the gene, as well as physical and genetic interactions with other genes. With the availability of many gene expression profiles, protein interaction networks, and gene regulatory networks, it is now possible to study how gene expression variation is associated with both network features and genomic factors. In the case of protein interaction networks, interaction degree, i.e., the number of interacting partners of a given protein, is one of many factors. The presence or absence of TATA box and the number of transcription factors (TFs) regulating a gene provide examples of genomic factors influencing gene expression variation. 2.1.3 Our hypothesis Protein interactions play an important role in gene expression variation. Protein-protein interactions are key biological events in a living cell, and proteins in a cell interact with each other to perform certain functions. High-throughput technologies, including yeast 22 two-hybrid systems and mass spectrometry, have generated a large amount of protein interactions in yeast. Computational methods have also been developed to study the reliability of the observed interactions [Deane et al, 2002; Deng et al, 2003] and to build reliable protein interaction networks. These efforts have resulted in the development of several protein interaction databases, albeit with differing degrees of reliability, including MIPS [Mewes et al, 2004], DIP [Salwinski et al, 2004] and BioGrid [Stark et al, 2006]. From an evolutionary point of view, the expression profiles of neighboring genes of a target gene in a protein interaction network may put some constraints on the target gene’s expression. Thus, in a protein interaction network, the interacting partners of a specific protein can affect the corresponding gene’s expression. Therefore, protein physical interaction degree, i.e., the number of interacting partners of a given protein, can significantly affect gene expression variation. In the present study, we show that gene expression variation decreases with protein interaction degree and that protein interaction degree accounts for 1-2% of the expression variation in model organism yeast, a result consistent with previous studies [Lemos et al, 2004; 2005]. Another key factor affecting gene expression variation is gene essentiality. Genes can be classified into essential and non-essential genes based on the fitness phenotype of the yeast cell when the gene is deleted under normal growth conditions [Giaever et al, 2002]. Essential genes are those that, when deleted, will render the yeast cell non-viable. Non- essential genes can be further classified into no-phenotype and toxicity-modulating genes based on the fitness phenotype of yeast cell when the gene is deleted under the conditions of four DNA-damaging treatments [Said et al, 2004]. Specifically, we define a gene’s toxicity modulation degree as the number of DNA-damaging treatments significantly 23 affecting the deletion strain’s fitness (toxicity modulation degree = 0 (no phenotype), 1, 2, 3, and 4). The higher the toxicity modulation degree, the more important the gene is in relation to cell survival. Therefore, toxicity degree gives a quantitative measurement of a gene’s importance to yeast cell survival. We measure a gene’s functional importance in relation to cell survival by the essentiality of the essential genes and the toxicity modulation degree of non-essential genes. Since the expression of genes important for cell survival are generally stable under many different stimuli and cannot fluctuate extensively, we hypothesize and show that expression variation of essential genes is lower than that of non-essential genes and decreases with toxicity degree within non- essential genes. The number of cis-elements has been shown to be positively associated with gene expression variation [Landry et al, 2007]. The number of cis-elements is usually approximated using computational approaches and many contain false positive and negative predictions. Theoretically, a given gene's expression pattern can become increasingly complex with the increasing number of transcription factors that regulate this gene, either directly or indirectly. In this study, we hypothesize that the number of TFs is a significant predictor of expression variation and show that the number of TFs regulating a gene (hereinafter referred to as ‘number of TFs’) accounts for 8-14% of its expression variation, much higher than that can be explained by the number of cis- elements (0.3-1.7%). This implies the importance of indirect trans-effect on expression variation. 24 The TATA box is a conserved element in the eukaryotic promoter region and is usually bound by TATA-binding proteins. The presence of TATA box has been shown to be one of the most important factors contributing to gene expression variation [Landry et al, 2007; Tirosh et al, 2006]. Further analysis of the individual genomic and proteomic factors affecting gene expression indicates that there might be two distinct mechanisms that specifically influence gene expression variation of TATA-containing and non- TATA-containing genes. Most importantly, we show that significant negative correlation between expression variation and toxicity degree is only present for TATA-containing genes and that toxicity degree accounts for 1.3-2.6% of the expression variation. In contrast, the relationship between expression variation and toxicity degree is absent for non-TATA-containing genes. The fact that TATA-containing genes are enriched in stress-related genes [Basehoar et al, 2004] may explain this difference. Although the number of TFs is significantly positively correlated with expression variation for both TATA- and non-TATA-containing genes, the association strength is higher for non- TATA containing genes than for TATA-containing genes. These results imply that the mechanism influencing TATA-containing gene expression variation is much more complicated than that in non-TATA-containing genes. For example, TATA-containing genes were found more likely to be epigenetic regulated [Basehoar et al, 2004; Choi and Kim 2008]. Thus, this study gives a more complete analysis of factors and their interaction affecting gene expression variation than may be found in previous studies. 25 2.2 Results and Discussion We present our results based on the MIPS protein physical interaction data [Mewes et al, 2004] and the yeast gene expression profiles under 40 Ca and Na exposure conditions [Yoshimoto et al, 2002]. The results based on three interaction datasets (MIPS [Mewes et al, 2004], DIP [Salwinski et al, 2004], and BioGrid [Stark et al, 2006]) and four other gene expression datasets (chemostat (nutritional stress) [Saldanha et al, 2004], environmental stress [Gasch et al, 2000], oxidative stress [Shapira et al, 2004], and a combined gene expression dataset over more than 1,500 conditions [Tirosh et al, 2006]) are given in the Appendix A. We study the expression data individually in order to minimize the variation among different laboratories. By doing so, we can also confirm whether the results based on different gene expression data are consistent. Consistency of results using a variety of different datasets adds confidence to the conclusions. In this manuscript, we use genes and proteins interchangeably. We declare statistical significance if a p-value is less than 0.05 without adjusting for multiple comparisons. In this study, we conducted an exploratory study of factors affecting gene expression variation. As in many epidemiological studies, we did not adjust p-values for multiple comparisons. Therefore, some of our findings need to be further tested in other datasets. 2.2.1 Gene expression variation versus protein interaction degree We measured gene expression variation by the logarithm of the variance of the expression levels of each gene across the 40 Ca and Na exposure conditions [Yoshimoto 26 et al, 2002]. We then studied the relationship between the expression variation and protein interaction degree using the LOWESS function in R [http://www.r-project.org/] to fit the data. On these bases, it was obvious that gene expression variation decreases with the degree of protein physical interaction (Figure 2.1A). The decreasing trend is especially significant when the interaction degree is relatively low (≤ 20). In contrast, when the interaction degree is greater than 40, the decreasing trend is not as obvious. Biologically, this may be explained by the fact that gene expression variation stabilizes when the interaction degree is above a given threshold. Another potential explanation is that interactions between proteins of high degrees are simply less reliable [Bader et al, 2004]. The large number of less reliable interactions for proteins with high degrees can skew the true relationship between gene expression variation and interaction degrees. Since the true underlying mechanism of this phenomenon is not clear, we limited our further analysis to proteins with an interaction degree of no more than 20. Protein physical interaction degree has been found to be negatively associated with gene expression variation for single cells [Batada et al, 2006] and evolutionary expression variation [Lemos et al, 2004; 2005] between different strains/species. Our result on the relationship between gene expression variation and interacting degree within S. cerevisiae is consistent with their findings. 27 Figure 2.1. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0302, R 2 = 1.41%, and p-value = 9.704e-14. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of the gene expression variation given PPI degree. To keep the same scale for gene expression variation across the figures, the range of the y-axis is -2.5 to 0.5. Accordingly, we then used linear regression to fit the expression variation for proteins with a maximal physical interaction degree of 20: v d α β = + where v is the gene expression variation and d is the interaction degree. α and β are parameters. The fitted line and the corresponding bar-plot for the expression variation are shown in Figure 2.1B. The gene expression variation is significantly negatively correlated with the protein interaction degree (≤20) (R 2 =1.41%, β = -0.0302, p-value = 9.704e-14). The negative correlation between expression variation and interaction degree implies that protein with high interaction degrees do not tolerate extensive expression 28 variation and such protein need more precise control on gene expression for an organism to function normally. 2.2.2 Gene expression variation versus essentiality, toxicity modulation, and interaction degrees As noted above, we divided genes into two classes: essential and non-essential genes. Essential genes are less likely to be perturbed than non-essential genes, as significant perturbations of essential genes will, for example, render the yeast cell non-viable. We further classified the non-essential genes into five groups (no phenotype, 0, and toxicity- modulating proteins with degrees 1, 2, 3, and 4, respectively) according to the cell’s fitness phenotype changes under four DNA-damaging agents (methylating agent methyl methanesulfonate (MMS), the bulky alkylating agent 4-nitroquinoline-N-oxide (4NQO), the oxidizing agent tert-butyl hydroperoxide (t-BuOOH), and 254-nm UV radiation) when the non-essential genes are knocked out [Said et al, 2004]. Since toxicity degree reflects the functional importance of genes in relation to cell survival under several DNA- damaging perturbations, we expected that gene expression variation would decrease as the toxicity degree increases. Since the number of genes with toxicity degree 4 was small (n = 32), we combined them with the group having toxicity degree 3. We referred to the essential genes as the group with toxicity degree 4. Figure 2.2A shows the bar-plot and the linear regression fit of the gene expression variation with respect to toxicity degree. Indeed, a significant negative association between gene expression variation and toxicity degree was observed (R 2 =0.75%, β = -0.0629, p-value = 4.73e-08). In the study of the 29 relationship between gene essentiality and evolutionary expression variation conducted by Tirosh and Barkai [2008] and Choi et al. [2007], they found that essential genes tend to have lower variation than non-essential genes. This is consistent with our result which demonstrates that the variation in essential genes (toxicity degree = 4) is lower than that of non-essential genes (combine the proteins with toxicity degree = 0, 1, 2, 3) (p-value = 0.027). However, our study differs from these two studies in that we further classified the genes according to their toxicity degree and found negative association between gene expression and toxicity degree. Figure 2.2. The effect of essentiality, toxicity degree, and protein interaction degree on gene expression variation. A) Bar-plot of the expression variation of all the genes with a given toxicity degree together with the linear regression fit to the expression variation of the genes in relation to the toxicity degree. The linear coefficient β = -0.0629, R 2 = 0.75%, and the p-value = 4.73e-08. B) The mean expression variation and the linear regression fit to the expression variation with respect to PPI degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are -0.0172, -0.0230, -0.0304, -0.0164 and -0.0460 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.0313, 0.0141, 0.0011, 0.2248 and 0.0013, respectively. R 2 is 0.31%, 0.75%, 2.75%, 0.8% and 4.41%, respectively. The labels are the same as those in Figure 2.1. 30 We observed a positive correlation between interaction degree and toxicity degree (data not shown) and, therefore, asked whether the observed negative correlation between gene expression variation and interaction degree is, conversely, caused by the positive correlation between interaction degree and toxicity degree. We consequently studied the relationship between gene expression variation and protein interaction degree within gene groups stratified according to their toxicity degrees (Figure 2.2B). Using Ca and Na exposure gene expression data [Yoshimoto et al, 2002], we found a significant decreasing trend of gene expression variation with respect to interaction degree in all the strata except for the one with toxicity degree 3. The corresponding (R 2 , β, p-value) are (0.31%, -0.017, 0.03), (0.75%, -0.023, 0.014), (2.75%, -0.030, 0.001), (0.8%, -0.016, 0.2248), and (4.41%, -0.046, 0.001) for toxicity degrees 0, 1, 2, 3 and 4, respectively. The fraction of expression variation explained by the protein interaction degree seems to increase as the toxicity degree increases. In our analyses, both toxicity and protein interaction degrees are negatively associated with gene expression variation. Hence, the more important a gene is to the survival of the yeast cell, the less variation there is in its expression levels across many different conditions. Similarly, the higher the interaction degree of a gene, the more stability is observed in its expression levels. Biologically, a gene is important to the cell’s survival since it participates in many important biological processes. Any perturbation of this gene’s expression will likely cause deleterious effect to the corresponding biological process and thus renders the cell non-viable. An evolutionary consequence of this hypothesis is that genes important to cell survival appear to have robust expression levels. 31 2.2.3 Expression variation versus gene regulatory regions: TATA box, number of TFs, and toxicity degree Previous studies have established the relationship between gene expression variation and the regulatory regions, including the presence/absence of TATA box [Landry et al, 2007; Tirosh et al, 2006], the length of intergenic regions [Nelson et al, 2004], and the number of cis-regulatory elements [Walther et al, 2007]. We therefore asked if the observed relationship between gene expression variation and toxicity degree are the same for TATA-containing genes and non-TATA-containing genes. To accomplish this goal, we first stratified the yeast genes based on the presence/absence of TATA boxes and reanalyzed the relationship between gene expression variation and toxicity degree. Consistent with previous findings [Landry et al, 2007; Tirosh et al, 2006], it is clear that the gene expression variation of TATA-containing genes is much higher than that of non- TATA-containing genes for each fixed toxicity degree (p-value < 2.2e-16). Significant negative association between gene expression variation and toxicity degree was observed for the TATA-containing group (Figure 2.3A) (R 2 = 2.59%, β = -0.1674, p-value = 7.174e-06). However, the association between expression variation and toxicity degree for the non-TATA-containing group is only marginally significant (R 2 = 0.13%, β = - 0.02338, p-value = 0.0413). Consistent with the other two gene expression datasets, chemostat [Saldanha et al, 2004] and environmental stress [Gasch et al, 2000], as well as the combined gene expression data from [Tirosh et al, 2006] (see Appendix A: Table A5.2), such a highly significant negative relationship between gene expression variation and toxicity degree could be observed in TATA-containing genes, but not for the non- TATA-containing genes. Thus, the effect of toxicity degree on gene expression variation 32 is different for TATA-containing genes versus non-TATA-containing genes. The relative small R 2 value between gene expression variation and toxicity degree alone may be explained by the absence of association between them within the non-TATA-containing genes. Figure 2.3. The effect of TATA box, number of TFs, and toxicity degree on gene expression variation. A) The relationship between expression variation and toxicity degree stratified by the presence/absence of the TATA box (R 2 = 2.59%, β = -0.1674, p-value = 7.174e-06 for the TATA-containing gene set; R 2 = 0.13%, β = -0.0234, p-value = 0.0413 for the non-TATA-containing gene set). B) The relationship between expression variation and the number of TFs up to 25 (R 2 = 8.28%, β = 0.0654, p-value < 2.2e-16). The labels are the same as those in Figure 2.1. Previous studies identified the number of cis-element motifs contributing to gene expression variation in A. thaliana [Walther et al, 2007], but Landry et al. [2007] found that the number of cis-elements only marginally affects gene expression variation in yeast. Here we studied these two factors (the number of cis-elements and the number of TFs) in relation to gene expression variation. We first analyzed the relationship between gene expression variation and number of cis-elements [MacIsaac et al, 2006] using the linear 33 model. Similar to MacIsaac et al. [2006], we defined cis-elements according to binding probability (p<0.0001) and classified them according to their conservation in two other yeast species. The results are presented in Table 2.1 for the different gene expression data. For the Ca and Na exposure data [Yoshimoto et al, 2002], the highest R 2 is 1.33% (p- value = 4.45e-08) when cis-elements that are conserved in at least one species are used. We then studied the relationship between gene expression variation and the number of TFs that influence target gene expression based on the gene regulatory network developed in Hu et al. [2007]. A highly significant positive correlation between gene expression variation and the number of TFs was observed (R 2 = 8.28%, β = 0.0654, p- value < 2.2e-16) (Figure 2.3B). The fraction of variation explained by the number of TFs (R 2 = 8.28%) is much higher than that by the number of cis-elements indicating that the number of TFs is a better predictor of gene expression variation than the number of cis- elements. This result implies the importance of trans-effect for gene expression variation. 34 Table 2.1: The relationship between gene expression variation and the number of cis-elements. Cis-elements are identified with binding p<0.0001 and conservation in at least 2 other yeast species Gene expression data- set Linear regression R 2 β p value Ca_Na exposure 0.0589 1.17e-06 1.25% Chemostat 0.0188 0.0067 0.39% Environmental Stress 0.0507 5.48e-07 1.33% Oxidative Stress 0.0025 0.725 0.01% Cis-elements are identified with binding p<0.0001 and conservation in at least 1 other yeast Gene expression data- set Linear regression R 2 β p value Ca_Na exposure 0.0539 4.45e-08 1.33% Chemostat 0.0161 0.0046 0.36% Environmental Stress 0.0516 3.56e-10 1.74% Oxidative Stress 0.0013 0.826 0 Cis-elements are identified with binding p<0.0001 and no Conservation Criteria Gene expression data- set Linear regression R 2 β p value Ca_Na exposure 0.0393 1.05e-07 1.02% Chemostat 0.0151 0.0004 0.45% Environmental Stress 0.0346 1.68e-08 1.15% Oxidative Stress 0.0028 0.526 0.01% Cis-elements are identified with three different criteria according to their conservation in two other species. β is the linear coefficient in the linear model, the p-value is related to the null hypothesis that β ≠ 0 versus β = 0, and the R 2 is the fraction of variation explained by the number of cis-elements. 35 2.2.4 Overall analysis of factors affecting gene expression variation As enumerated above, we have identified several factors influencing gene expression variation in S. cerevisiae. In addition to the presence/absence of TATA box identified in previous studies [Landry et al, 2007; Tirosh et al, 2006], we found that gene expression variation decreases as both the protein interaction and toxicity modulation degrees increase. These findings are consistent with other studies for expression variation of single cells [Batada et al, 2006] and evolutionary expression variation across different strains/species [Lemos et al, 2004; 2005; Choi et al, 2007]. We also found that gene expression variation increases as the number of TFs or the number of cis-elements increases and that the number of TFs regulating a gene is a much better predictor of expression variation than the number of cis-elements. We therefore studied the contribution of each factor and their interactions on gene expression variation by taking the other factors into consideration using the Akaike Information Criterion (AIC) [Akaike 1973]. We retained the model with the smallest AIC. Table 2.2 gives the results of factors included in the final linear model using the MIPS interaction data [Mewes et al, 2004] and the four expression profiles by stepwise selection with AIC. We then studied the effect of the selected factors on expression variation using linear regression, and the corresponding p-values and R 2 values are given in Table 2.2. The results showed that protein interaction degree only explained less than 1% of variation when adjusted for the three other main factors and two interaction terms. Except for the oxidative stress gene expression dataset [Shapira et al, 2004], the consistently selected model contained four main factors including protein interaction degree, toxicity degree, the number of TFs, and TATA box, and two interaction terms, i.e., interaction between TATA and toxicity 36 degree and TATA and number of TFs. The two interaction terms are found to be statistically significant across three expression datasets: Ca and Na exposure [Yoshimoto et al, 2002], chemostat [Saldanha et al, 2004], and environmental stress [Gasch et al, 2000] (p-value < 0.05) and explained 0.4% - 2.7% of the variation. 37 Table 2.2: Analysis of four factors and their interactions affecting expression variation using stepwise selection with AIC. variable Ca_Na_exposure Chemostat Environmental Stress Oxidative Stress model p value R 2 model p value R 2 model p value R 2 model p value R 2 x1 √ 0.4252 0.05% √ 0.1582 0.16% √ 0.0647 0.28% √ 0.1075 0.21% x2 √ 0.3721 0.06% √ 0.0635 0.28% √ 0.0861 0.24% 0.8729 0.002% x3 √ 1.16E- 20 6.76% √ 4.59E- 09 2.73% √ < 2e-16 13.42 % √ 7.30E- 09 2.65% x4 √ 3.22E- 12 3.83% √ 7.96E- 10 3.00% √ < 2e-16 6.92% √ 0.8841 0.002% x1*x2 x1*x3 x1*x4 √ 0.0581 0.29% √ 0.0100 0.53% √ 0.0348 0.36% x2*x3 √ 0.0422 0.33% x2*x4 √ 0.0226 0.41% √ 0.0036 0.68% √ 0.0108 0.53% x3*x4 √ 0.0039 0.67% √ 0.0185 0.45% √ 6.2e-09 2.71% R 2 model 16.36% 12.73% 22.39% 4.43% The four main factors include protein interaction degree (x1), toxicity degree (x2: treat essential genes as ones with toxicity degree 4), number of TFs (x3), and the presence of TATA box (x4: 1-TATA containing genes, 0-non-TATA containing genes). The protein interaction data used in this analysis is based on the MIPS dataset. The column marked with “√” indicates inclusion in the final linear model. The multiple linear regression is based on the final linear model, respectively. The p-value is related to the null hypothesis that β ≠ 0 versus β = 0. R 2 is the variation explained by the model and each independent variable, respectively. 38 Because we found that TATA box interact with toxicity degree and the number of TFs influencing gene expression variation, we reanalyzed the contributions of toxicity degree and number of TFs on gene expression variation stratified by the presence/absence of TATA box using the linear model. The protein physical interaction degree was not included in this analysis because the interaction between TATA box and physical interaction degree was not detected. Table 2.3 shows the different results for TATA and non-TATA gene sets. First, the number of TFs is the most significant factor for gene expression variation in both TATA- and non-TATA-containing genes with p-values less than 5e-6 for all the datasets. However, the association strength between the number of TFs and gene expression variation measured by the coefficient β for the non-TATA- containing genes is about 1.5-fold higher than the corresponding values for the TATA- containing genes. Accordingly, the p-values for the non-TATA-containing genes are about three orders of magnitude smaller than the corresponding p-values for the TATA- containing gene set. We noted that TATA-containing genes tend to have higher number of TFs than non-TATA-containing genes (p-value = 1.184e-10), but this cannot explain our observation that the association strength between expression variation and the number of TFs within the TATA-containing genes is lower than that within the non- TATA-containing genes. One possible explanation is that the presence of TATA-box weakens the effect of the number of TFs on gene expression during evolution. Second, the toxicity degree is only a significant contributor for gene expression variation within the TATA-containing genes in the Ca-Na exposure [Yoshimoto et al, 2002] and Chemostat [Saldanha et al, 2004] gene expression dataset, while it is not a significant contributor for gene expression variation in the non-TATA-containing gene set. Within 39 the environmental stress gene expression dataset [Gasch et al, 2000], the effect of toxicity degree on gene expression variation was not statistically significant. However, we did observe a decreasing trend of gene expression variation with respect to toxicity degree within the TATA-containing genes (see Appendix A, Figure A1.5). 40 Table 2.3. The effect of two factors on expression variation stratified by the presence/absence of TATA box. TATA dataset variable Ca_Na_exposure Chemostat Environmental Stress Oxidative Stress β p-value R 2 β p-value R 2 β p-value R 2 Β p-value R 2 x2 -0.1436 0.0024 1.95% -0.0764 0.0034 1.81% -0.0510 0.156 0.43% 0.0081 0.778 0.02% x3 0.0312 5.4e-10 7.90% 0.0168 1.5e-09 7.50% 0.0283 2.6e-13 10.79% 0.0138 4.8e-06 4.37% R 2 (model) 9.47% 8.97% 11.09% 4.39% Non-TATA dataset variable Ca_Na_exposure Chemostat Environmental Stress Oxidative Stress β p-value R 2 β p-value R 2 β p-value R 2 Β p-value R 2 x2 -0.0167 0.402 0.06% -0.0026 0.8189 0.005 % 0.0258 0.086 0.26% 0.0038 0.747 0.009% x3 0.0502 <2e-16 8.70% 0.0300 <2e-16 9.56% 0.0468 <2e-16 12.6% 0.0181 4.4e-10 3.40% R 2 (model) 8.85% 9.62% 12.67% 3.39% The linear model that includes the toxicity degree (x2) and the number of TFs (x3) is built for each of the four gene expression datasets, respectively. R 2 is the variation explained by each independent factor and the model, respectively. β is the linear coefficient in the linear model, and the p-value is related to the null hypothesis that β ≠ 0 versus β = 0 . 41 The toxicity degree of a gene measures the tolerance of the yeast cell to different external stress conditions when the gene is knocked out. Therefore, it might be expected that the different relationship between gene expression variation and toxicity degree for TATA- containing genes and non-TATA-containing genes is due to the enrichment of stress- related genes in TATA-containing genes, as found in [Basehoar et al, 2004]. To test this hypothesis, we used a set of genes related to Environmental Stress Response (ESR) [Gasch et al, 2000]. If the hypothesis is true, we would expect a higher association between expression variation and toxicity degree within the ESR genes than that within the non-ESR genes. However, our data shows that the β values are similar for the two groups of genes (Table 2.4). Thus we cannot explain the interaction between TATA and toxicity degree by the enrichment of ESR genes in TATA-containing genes. The biological mechanisms underlying the observed interaction are not clear and need to be further studied. Table 2.4. The effect of toxicity degree on expression variation stratified by the set of environmental stress response (ESR). Gene group Ca_Na_exposure Chemostat Environmental Stress Oxidative Stress β p-value β p-value β p-value β p-value ESR -0.0710 0.0053 -0.0314 0.0532 -0.0176 0.3640 0.0026 0.8708 Non-ESR -0.0848 2.09e-13 -0.0354 1.11e-07 -0.0579 1.28e-11 -0.0091 0.2086 The linear model is built for each of the four gene expression datasets, respectively. β is the linear coefficient in the linear model and the p-value is related to the null hypothesis that β ≠ 0 versus β = 0 . 42 We also did the same analysis for the average gene expression variation across the four expression datasets (Ca and Na exposure [Yoshimoto et al, 2002], chemostat [Saldanha et al, 2004], environmental stress [Gasch et al, 2000], and oxidative stress [Shapira et al, 2004]) and the combined gene expression data of Landry et al. [2006], and the results are presented as Appendix A. The same conclusions can be obtained indicating the robustness of our results. Previous studies showed that TATA- and non-TATA- containing genes might recruit different coactivator complexes for gene expression [Basehoar et al, 2004]. TATA-containing genes were also found to be subject to greater nucleosomal regulation than non-TATA-containing genes [Basehoar et al, 2004]. Basehoar et al. [Basehoar et al, 2004] suggested that two distinct regulatory mechanisms may be present at TATA- and TATA-less promoters. The results in Table 2.3 support their findings. The results based on the oxidative stress gene expression dataset [Shapira et al, 2004] are not consistent with the results based on the other three gene expression datasets. This observation may be due to the relatively small gene expression variation in this data. For example, the range of the variance of the expression levels within the oxidative stress dataset, (0.07, 5.34), is much smaller than the corresponding ranges, (0.02, 10.59), (0.17, 9.18), and (0.09, 11.07), for the Ca and Na exposure [Yoshimoto et al, 2002], chemostat [Saldanha et al, 2004], and environmental stress conditions [Gasch et al, 2000], respectively. We also studied the contributing factors for gene expression variation using the DIP [Salwinski et al, 2004] and BioGrid interactions [Stark et al, 2006], and the results are 43 given in Appendix A. Similar conclusions as those based on the MIPS interaction data [Mewes et al, 2004] were obtained. The consistency of the results using different combinations of protein interaction data sets and gene expression profiles showed the robustness of our conclusions. However, the fraction of gene expression variation explained by all factors is less than 25%. One possible explanation is that the measurement of gene expression changes and other factors, including the toxicity degree and interaction degree, are still very noisy. We expect that the true R 2 would be higher than that observed in this study. 2.3 Materials and Methods In order to study factors affecting gene expression variation, we collected data on gene expression profiles, protein physical interactions, gene regulatory networks, essentiality and toxicity resistance. Details of these data are given below. 2.3.1 Gene expression profiles A large number of gene expression studies are available. In this study, we chose gene expression studies containing at least 40 conditions. These datasets include yeast gene 44 expression profiles under 40 Ca and Na exposure conditions [Yoshimoto et al, 2002], chemostat (i.e., nutritional stress) at 100 conditions [Saldanha et al, 2004], environmental stress at 156 conditions [Gasch et al, 2000] and oxidative stress at 70 conditions [Shapira et al, 2004]. These data were analyzed separately to ensure that between-laboratory variation was minimized. A combined gene expression profile under more than 1,500 conditions was collected by [Tirosh et al, 2006]. The responsiveness for each gene across more than 1,500 conditions calculated by [Tirosh et al, 2006] was used in our analysis as expression variation. 2.3.2 Protein interaction data We downloaded yeast protein interaction data from three different data sources. The MIPS (Munich Information Center for Protein Sequences) [Mewes et al, 2004] dataset (version: PPI_18052006.tab) contains 11,124 protein physical interactions involving 4,404 proteins. The DIP core interaction dataset [Salwinski et al, 2004] (version: ScereCR20070107) contains 5,738 protein interactions involving 2,161 proteins. The DIP core interactions were assessed by a number of quality tests and are supposed to be highly reliable [Deane et al, 2002]. The BioGrid [Stark et al, 2006] dataset (version 2.0.34) contains 59,317 protein physical interactions involving 5,054 proteins. 45 2.3.3 Essential and toxicity modulating gene Large scale gene deletion studies have identified about 17-20% of the genes essential for yeast cell survival [Giaever et al, 2002] under normal conditions. Even within the class of non-essential genes, a gene’s importance in relation to cell survival is not the same. Further studies classified the non-essential genes based on the cell’s fitness phenotypes under four different DNA damage perturbations when a gene is knocked out [Said et al, 2004]. The toxicity modulating genes were defined as those significantly affecting the cell’s fitness phenotype when knocked out. We defined the toxicity degree of a gene as the number of perturbations that significantly affected the deletion strain’s fitness. Essential genes were downloaded from the SGD website [Winzeler et al, 1999], and the toxicity degrees of non-essential genes were calculated from [Said et al, 2004]. 2.3.4 Gene Regulatory Network Studies have shown that gene expression variation is positively correlated with the number of cis-regulatory elements and the length of intergenic region in several organisms. Since cis-elements control the expression of genes through interaction with the TFs, it is interesting to study if the number of TFs regulating a gene has an effect on gene expression variation. The mapping of cis-elements to genes was obtained using motif discovery algorithms, PhyloCon and Converge, with binding p-value less than 0.001 and conservation in at least 0, 1 or 2 other yeast species [MacIsaac et al, 2006]. The mapping of the TFs to genes is obtained from Hu et al. [2007]. 46 2.3.5 TATA-containing genes A TATA box is a DNA sequence (cis-element) found in the promoter region of most eukaryotic genes. The TATA consensus sequence was identified as TATA(A/T)A(A/T)(A/G) [Basehoar et al, 2004] . The TATA box has been identified as a very important factor for gene expression variation. The relationship between yeast genes and the TATA box was downloaded from [Basehoar et al, 2004]. There are 1090 out of 6278 genes that were predicted to have a TATA box. Our analysis used these 1090 genes as TATA-containing genes and other genes as non-TATA-containing genes. (We note that 607 genes are not classified in [Basehoar et al, 2004], and the results are essentially the same when these genes are not considered (data not shown).) 2.3.6 Statistical Analysis Gene expression variation was measured by the logarithm of the variance of the gene expression levels under various conditions. The distribution of the variance was not normal. In addition, the standard deviations of the resulting distributions conditional on the independent variables (protein physical interaction degree, toxicity degree, TATA box, number of TFs) differed widely, making the linear model for the variance invalid. To avoid these problems, we measured the gene expression variation by the logarithm of the variance. The resulting distributions seem to fit the conditions for the linear model. Hence, in our study, we used a linear model to study the relationship between the expression variation and each factor. In the study of the relationship between the gene 47 expression variation and interaction degrees, we first used the LOWESS function in R [http://www.r-project.org/] to fit the data. An approximate linear relationship between gene expression variation and interaction degree was observed when the interaction degree was less than 20. We then proceeded to use linear regression to fit the data up to interaction degree 20. v d α β = + where v is the gene expression variation and d is the interaction degree. α and β are parameters. We tested the statistical significance for the relationship between gene expression variation and interaction degree based on the linear regression model. Before we do the joint analysis of expression variation with respect to the four factors (protein interaction degree, toxicity degree, number of TFs, and TATA box), we tested if the four factors are highly correlated. We calculated the correlation matrix between them and it is given in Appendix A (Table A6.1). All the correlation coefficients are smaller than 0.3 indicating that they are not highly correlated. Although it might be more computationally reasonable to first find the principal components of these factors and then analyze the data using linear regression, the interpretation of the final result is not clear. Since these factors are not highly correlated, we treat them as independent factors in our joint analysis. In the overall analysis, we first used stepwise selection to find a model that gives the smallest AIC (Akaike information criterion) = 2*K+n*ln(SSE/n), where K is the number of parameters in the model; n is the number of observations; and SSE is the residual sum 48 of squares [Burnham and Anderson, 2002]. We then used linear regression to analyze the relationship between gene expression variation and the retained factors and interactions. The corresponding p-values and the R 2 values are reported in Table 2.2. 49 Chapter 3: Chromatin regulation and gene centrality are essential for controlling fitness pleiotropy in yeast 3.1 Introduction 3.1.1 Previous work Mutations in individual genes or in a combination of genes can have varying effects on phenotype. To study this further, individual S. cerevisiae strains, each with a gene- deletion mutation for a gene in the genome, such that there is a strain with a mutation for every gene in the genome, were generated [Giaever et al, 2002]. The effects of these mutations on viability, when each strain was grown in rich medium, identified a set of essential genes, consisting of about 20% of all the genes [Giaever et al, 2002]. Essential genes are required for cell viability, while the other genes are nonessential genes. There are several studies that have been conducted regarding the functions of gene products that distinguish essential from nonessential genes [Jeong et al., 2001; Batada et al, 2006; Coulomb et al, 2005; Hakes et al, 2005; Hakes et al 2008; Chen and Xu, 2004]. Jeong et al. [2001] found that the essential genes tend to encode products that have a large number of physical interaction partners. However, this finding has been challenged [Batada et al, 2006; Coulomb et al, 2005; Hakes et al, 2005; Hakes et al 2008], but is conserved across phyla [Chen and Xu, 2004]. Batada et al. [2006] showed that essential genes tend to have higher connectivity than non-essential genes in most interaction data 50 sets, but strength is different when using different protein interaction data. For example, using Y2H data set, such relationship is weak and not significant (p=0.056) at α = 0.001 level. Coulomb et al [2005] and Hakes et al [2005; 2008] argued that since essential genes receive higher attention than non-essential genes and tend to have more documented interacting partners, it is likely to cause high interactions for essential genes in small-scale studies. By using Y2H protein interaction data set, only a weak positive correlation is observed between gene essentiality and protein connectivity [Coulomb et al, 2005]. 3.1.2 Our goal and hypothesis The observation that ~80% of genes are not essential for viability suggested that they contribute to optimum fitness in response to different growth conditions. To study the functions of non-essential genes, growth rates (fitness) of the S. cerevisiae deletion strains were examined in various culture conditions [Brown et al, 2006; Parsons et al, 2006; Hillenmeyer et al, 2008]. One of the objectives of these studies was to group genes with similar fitness profiles, to provide insight into gene function. With these data sets, a gene’s importance to survival can be measured by fitness pleiotropy. A gene’s fitness pleiotropy is defined as the number of conditions that the fitness of the corresponding S. cerevisiae deletion strain is significantly reduced [He and Zhang, 2006]. Fitness pleiotropy is a quantitative measurement of the importance of a gene’s function to the organism’s relative fitness. The more important a gene is to fitness, the higher the fitness 51 pleiotropy. For any deletion strain of a given gene, the number of conditions that the growth rate is significantly reduced is counted and this is defined as the gene’s fitness pleiotropy. Thus, if the gene is important for growth, the gene will have a high fitness pleiotropy measure. Previously it was shown that the fitness pleiotropy of a gene is positively associated with the number of biological processes that the gene’s product functions in, as well as the number of protein interaction partners of the gene product [He and Zhang, 2006]. A positive association between fitness pleiotropy of transcription factors (TF) and the number of the TF’s target genes was also found [He and Zhang, 2006]. However, the positive association was not statistically significant (p-value = 0.22). Here, the fitness data from the S. cerevisiae deletion strains from the previous studies [Brown et al, 2006; Parsons et al, 2006; Hillenmeyer et al, 2008] were re-examined to determine if other centrality measures, in addition to protein physical interaction (PPI) degrees, are associated with fitness pleiotropy. We consider two additional centrality measures: 1) betweenness (BW; the fraction of shortest paths between any two proteins that pass through the given protein in a protein interaction network [Freeman, 1979) and 2) the clustering coefficient (CC; a ratio of the number of edges between its first order neighbors, over all possible edges between its first order neighbors of a given a protein [Watts and Strogatz, 1998]). The clustering coefficient (CC) of a protein in an interaction network quantifies the potential for the protein and its interaction partners to form a complex. It has previously been shown that proteins within complexes are more likely to be essential [Batada et al, 2006]. Thus we consider three measures, PPI degree, BW and CC, whereas the previous study only considered one measure (PPI degree; [He and 52 Zhang, 2006]). Our results show that both PPI degree and CC are strongly associated with fitness pleiotropy and that the association between BW and pleiotropy can be explained by the association between PPI degree and pleiotropy. Additionally, we further examined if the number of target genes for a TF is highly statistically significantly associated with fitness pleiotropy and show that they are significantly associated. We also determined the effect of chromatin regulation on fitness pleiotropy in two ways. First, we examined the fitness pleiotropy of genes that encode chromatin regulatory factors, that likely influence transcription by altering chromatin structure. Second, we examine the epigenetic regulatory effect for every gene, here defined as the chromatin regulation effect (CRE); CRE of a gene is a measure of the mean absolute change of the gene’s expression level when chromatin regulators are mutated, as was done previously [Choi and Kim, 2008]. We find that CRE is strongly associated with fitness pleiotropy. In summary, the following work will demonstrate that 1) gene centrality, particularly in relation to the protein interaction network, as measured by PPI degree and clustering coefficient (CC), and 2) chromatin regulation, as measured by chromatin regulation effect (CRE), underlie fitness pleiotropy in S. cerevisiae. 53 3.2 Results and Discussion Three phenotypic profiles were used to define fitness pleiotropy. In the first experiment, a quantitative profile for 4,277 mutant diploid strains, each homozygous for a deletion of a nonessential gene, were examined under 51 growth conditions [Brown et al, 2006]. In the second experiment, a quantitative profile of 4,111 mutant haploid strains, each with a deletion of a nonessential gene, were examined under 82 growth conditions [Parsons et al, 2006]. In the third experiment, a quantitative profile for 4,742 mutant strains each homozygous mutant for a deleted nonessential genes were examined under 418 conditions and a quantitative profile for 4,956 mutant strains each heterozygous for a deletion of a nonessential genes were examined under 726 conditions [Hillenmeyer et al, 2008]. The results using the phenotypic profile from Brown et al. [2006] are presented below, while those based on the phenotypic profiles from Parsons et al. [2006] and from Hillenmeyer et al [2008] are found in the Appendix B. The results based on phenotypic profiles of heterozygous deletions [Hillenmeyer et al, 2008] are not shown since the statistical significance is weak or not observed in some relationships. Moreover, we found that the correlation of fitness pleiotropy for homozygous deletions [Hillenmeyer et al, 2008] and heterozygous deletions [Hillenmeyer et al, 2008] under 119 unique conditions was very low. The biological explanation is likely that the protein products in the heterozygous deletion strains are present. To ensure that our results do not depend on the particular interaction datasets used, we studied three interaction data sources: MIPS [Mewes et al, 2004], DIP [Salwinski et al, 2004], and BioGrid [Stark et al, 2006]. In main text, we only present the results with 54 regard to protein interaction degree using MIPS data set [Mewes et al, 2004] and the results using DIP [Salwinski et al, 2004] and BioGrid [Stark et al, 2006] data sets are found in Appendix B. 3.2.1The relationship between fitness pleiotropy and gene product centrality as measured within the protein interaction network: protein interaction degree, betweenness, and clustering coefficient The physical interactions between proteins form a protein interaction network. In this network, each protein is a node, and the physical interaction between proteins is an edge. The physical protein interaction degree (PPI degree) is defined as the number of interaction partners for each protein. Since protein interactions play a central role in protein function, proteins with high PPI degree may be involved in more biological processes. Thus, we also expect that genes that encode such proteins will have high fitness pleiotropy. As shown in Figure 3.1A, as PPI degree increases, fitness pleiotropy of the gene also increases (ρ= 0.232, p<2.2e-16). This result is consistent with the findings of He and Zhang [2006], where they found a relatively weak, yet significant positive association between fitness pleiotropy and PPI degree (ρ= 0.19, p<e-6), using a different dataset. The positive association between fitness pleiotropy and PPI degree indicates that when a gene with a high PPI degree is deleted, the functions of many proteins that interact with this protein are likely to be affected, resulting in changes in overall fitness, under different growth conditions. Hence, the importance of a gene with 55 respect to fitness increases with the gene product’s PPI degree. The findings are also consistent with previous results that showed that the essential genes, that have the highest fitness pleiotropy, tend to have products with higher physical interaction degrees (in our dataset, p = 1.4e-4) [Jeong et al, 2001; Yu et al, 2007]. Figure 3.1. The relationship between fitness pleiotropy and PPI degree (A) and between CC and PPI degree (B). A) The fitness pleiotropy is positively correlated with protein physical interaction (PPI) degree. The Spearman’s rank correlation is used to measure the relationship between fitness pleiotropy and PPI degree (ρ= 0.232, p<2.2e-16). The red dots are the mean fitness pleiotropy of the genes, given PPI degree. For visualization, the blue line represents linear regression. Note that only less than 1% of protein has PPI degree higher than 50 (data not shown). B) The scatter plot of the relationship between clustering coefficient and PPI degree. The Spearman correlation coefficient ρ is 0.643 (p< 2.2e-16). In this study, a gene’s product is considered central (gene centrality) based on a high PPI degree and two other measures: betweenness (BW) and clustering coefficient (CC). First, BW of a target protein is calculated by the fraction of shortest paths that pass through the target protein between any pair of proteins. It thus measures the frequency of target 56 protein use when the signal is transmitted between two proteins. Yu et al. [2007] showed that PPI degree is a better predictor of protein essentiality than BW in a protein interaction network, although the probability of a protein being essential increases with BW. Here it was examined whether the fitness pleiotropy of a non-essential gene increases with BW. Fitness pleiotropy is significantly positively associated with BW (ρ=0.178, p<2e-16). PPI degree and BW are also highly correlated with a Spearman correlation of ρ=0.893 in our dataset. These findings indicate, however, that the high correlation between fitness pleiotropy and BW may be explained by the high correlation between fitness pleiotropy and PPI degree. To determine if this is true, the partial correlation between fitness pleiotropy and PPI degree with BW controlled (ρ fitness pleiotropy, PPI degree|BW = 0.169, p=1.7e-20) was examined. When PPI degree is controlled, the partial correlation between fitness pleiotropy and BW is -0.077 (p= 2.6e-05), indicating an absolute value much smaller than the partial correlation between fitness pleiotropy and PPI degree when BW is controlled. Note that the sign of ρ fitness pleiotropy, BW| PPI degree is the reverse of the sign of ρ fitness pleiotropy, BW . These results indicate that PPI degree is a better predictor of fitness pleiotropy than BW, because the partial correlation between fitness pleiotropy and BW is minimal when PPI degree is controlled. This finding is consistent with the results of Yu et al. [2007] that PPI degree is a better predictor of essentiality than BW. Therefore, we will not consider BW in the studies presented below. Second, the clustering coefficient (CC) for the non-essential genes was examined. CC quantifies the potential that the target protein and its neighbors form a complex. Since 57 proteins within complexes are more likely to be essential [Batada et al, 2006], it is also hypothesized that fitness pleiotropy for non-essential genes increases with CC. This is demonstrated by the positive correlation with fitness pleiotropy and CC (ρ=0.243, p<2.2e-16). Although there is also a high correlation between PPI degree and CC (ρ=0.643, p< 2.2e-16, Figure 3.1B), this correlation is not as strong as the correlation between PPI degree and BW (ρ=0.893). To determine how PPI degree and CC interact to influence fitness pleiotropy, the genes were divided into four groups based on the measurement of PPI degree and CC: low PPI degree, low CC (LL); high PPI degree, low CC (HL); low PPI degree, high CC (LH), and high PPI degree, high CC (HH). Proteins with a PPI degree up to 3 (70.37% of the genes) and proteins with a PPI degree of at least 6 (18.36% of the genes) were defined as low and high PPI degree proteins, respectively. Similarly, proteins with CC of 0 (76.48% of the genes) and those with CC of at least 0.4 (4.96% of the genes) were classified as low CC and high CC, respectively. Only about 2.07% of nonessential gene products are classified in the group having high PPI degree and high CC, whereas most nonessential gene products belong to the group with low PPI and low CC. Figure 3.2A gives the box plot for the fitness pleiotropy within each group. The results indicate that genes with products of high PPI degree and high CC tend to have the highest fitness pleiotropy. Similar results were obtained when other thresholds were used to partition the proteins into four groups (data not shown). 58 Figure 3.2. The relationship between fitness pleiotropy and measurements. A) Fitness pleiotropy for four different groups of proteins classified according to PPI degree and CC: LL (PPI degree <=3, CC<=0); LH (PPI degree <=3, CC>=0.4); HL (PPI degree >=6, CC<=0); HH (PPI degree >=6, CC>=0.4). P-values are given to test the hypothesis that the median fitness pleiotropy in LL, LH, and HL is lower than that in the HH group, respectively. The value of n in the box is the number of genes for each group. B) Fitness pleiotropy is positively associated with the number of targeted genes that each TF regulates (ρ=0.355, p=4.0e-08). Note that only less than 0.5% protein has out-degree higher than 100 (data not shown). One explanation for this phenomenon is that proteins with high PPI degree and high CC tend to form complexes that frequently underlie important biological processes, and thus are important for fitness. Inspection of the data leads to the identification of genes with products that function in complexes that underlie important biological processes. For example, COG7 (PPI=8, CC=0.43 and fitness pleiotropy=6) encodes a component of the cytosolic Golgi tethering complex that functions to mediate fusion of transport vesicles to Golgi compartments [http://www.yeastgenome.org/]. Another example is CDC10 (PPI=8, CC=0.5 and fitness pleiotropy=7), which encodes a component of the septin ring of the mother-bud neck that is required for cytokinesis [http://www.yeastgenome.org/]. The studies of gene centrality presented here indicate that fitness pleiotropy in nonessential 59 genes increases with PPI degree, BW or CC. PPI degree is a better predictor than BW, and PPI degree interacts with CC influencing fitness pleiotropy. 3.2.2 The influence of transcription factors, chromatin regulators, and chromatin regulation effect on fitness pleiotropy Phenotypic changes are also associated with changes in gene expression levels. Hence, genes with products that influence gene expression might also be associated with fitness pleiotropy, such as genes that encode transcription factors (TFs) or chromatin regulators (CR) that underlie epigenetic gene regulation. Epigenetic gene regulation refers to modification of chromatin by CRs, such as methylation or acetylation of histone proteins, a component of chromatin. Given that chromatin modification usually affects TF binding and gene regulation, it is hypothesize that both TFs and CRs must be important contributors to fitness pleiotropy. To compare the contributions of TFs or CRs to fitness pleiotropy, the influence of both gene and chromatin regulatory networks on fitness pleiotropy were examined. First, transcription factors in gene regulatory networks were examined, in which the nodes are the genes, and directed edges indicate regulatory relationship. We used the gene regulatory network constructed in [Hu et al, 2007]. In such a network, there are two types of degrees, in-degree and out-degree. The in-degree of a gene measures the number of TFs that regulate the gene. The out-degree of a TF measures the number of genes that the TF regulates. When a TF is deleted, the genes regulated by the TF will be affected. 60 Thus, if the out-degree of a TF is high, many genes will be affected when the TF is deleted, and consequently will increase fitness pleiotropy. Therefore, we expect that the fitness pleiotropy should increase with out-degree, but not in-degree. As shown in Figure 3.2B, fitness pleiotropy is significantly positively associated with the out-degree of TFs (ρ=0.355, p=4.0e-08). There is no significant association between fitness pleiotropy and in-degree in the gene regulatory network. This is expected as in-degree only indicates how many TFs control the gene, and it is not related to its effect on other genes and thus overall fitness. This result supports the observation that fitness pleiotropy was positively associated with the out-degree of TFs although the association was not significant in [He and Zhang, 2006]. We next investigated the CRs that underlie chromatin modification, such as histone acetylation / methylation, ubiquitylation / deubiquitylation and phosphorylation. Given that chromatin modification has a high degree of impact on gene expression, it is predicted that CR genes will have high fitness pleiotropy. To test this, 65 genes that encode chromatin regulators were identified from a previous study [Steinfeld et al, 2007], and the mean fitness pleiotropy of CR genes was found to be 2.282. This is significantly higher than the mean fitness pleiotropy of non-CR genes (1.149) (p = 3.7e-5, Figure 3.3A). These results demonstrate the importance of genes that encode chromatin regulators relative to other genes, with respect to the organism’s fitness. 61 Figure 3.3. The relationship between fitness pleiotropy and measurements. A) Fitness pleiotropy for CRs and non CRs. The line in the box indicates the median value. The upper edge of the box indicates the 75 th percentile, and the lower edge indicates the 25 th percentile. The ends of the vertical line indicate the minimum and the maximum values, and the points outside the ends of the vertical line are outliers. P-values are given to test the hypothesis that the median fitness pleiotropy for CRs is higher than that for non CRs using non-parametric Wilcoxon rank sum test. The value of n in the box is the number of genes for each group. B) Fitness pleiotropy is significantly negatively associated with gene expression variation (ρ = -0.151, p< 2.2e-16). We next tested another key hypothesis of this study: that genes that are likely to be chromatin regulated are associated with low fitness pleiotropy. We used the following approach to measure the potential for a gene to be chromatin regulated. The gene expression compendium that examined global gene expression profiles in 116 different S. cerevisiae strains that have CR genes mutated, based on the data in [Steinfeld et al, 2007], was used here. The potential for a gene to be CR-regulated was determined by the chromatin regulation effect (CRE) measure, which is defined as the mean absolute value of the logarithm of the gene expression changes across the 116 perturbations, as was previously done [Choi and Kim, 2008]. The CRE measures the likelihood of a gene to be epigenetically regulated. This means that, as CRE increases, the likelihood that this gene 62 is epigenetically regulated also increases. It has been shown that CRE is significantly positively associated with gene expression variation, due to trans-regulation [Choi and Kim, 2008]. Here, fitness pleiotropy is negatively associated with gene expression variation suggesting that genes that show high expression variation across the experiments is not important for fitness (see Figure 3.3B). Therefore, it is possible that CRE will also be negatively associated with fitness pleiotropy. Based on the data, we studied the relationship between a gene’s CRE and fitness pleiotropy and found that they are indeed significantly negatively associated (ρ=-0.172, p<2.2e-16, Figure 3.4). Thus, genes that display high expression change when chromatin regulators are mutated tend to have low fitness pleiotropy. 63 Figure 3.4. Fitness pleiotropy is negatively associated with chromatin regulatory effect (CRE) (ρ=-0.172, p<2.2e- 16). The labels are the same as those in Figure 3.1. This suggests that genes with high CRE might function under specific conditions, and therefore, have low fitness pleiotropy in the conditions assayed. The epigenetic regulation of such genes is hypothesized to control when gene expression is induced, suggesting that these genes might only be required in certain growth conditions. As a result, the deletion of such genes would result in defective growth only under specific conditions, and will have low fitness pleiotropy. The dataset was further examined to identify genes with low fitness pleiotropy that are also chromatin modified, to determine if this hypothesis is correct. Indeed, pho5 (fitness pleiotropy = 0) encodes acid phosphatase in budding yeast and is induced under phosphate starvation, but repressed under high-phosphate condition. It was found that the promoter of pho5 is protected by four positioned nucleosomes under high-phosphate conditions [http://www.yeastgenome.org/] and pho5 activation is 64 epigenetically regulated at intermediate phosphate concentrations [Dhasarathy and Kladde, 2005]. Another example is SSA3 (fitness pleiotropy = 0), which encodes a member of the heat shock protein 70 (HSP70) family. The expression of ssa3 is induced after diauxic shift or upon heat shock [http://www.yeastgenome.org/]. The previous study showed that there is a significant increase in H4 acetylation at the promoter of ssa3 upon heat shock [Deckert and Struhl, 2001]. Therefore, genes that are epigentically regulated and have products that function under specific conditions do show low fitness pleiotropy. Given that TF out-degree is positively associated with fitness pleiotropy (see above), the relative contributions of out-degree and CRE to fitness pleiotropy were examined, in order to determine their relative importance in influencing fitness pleiotropy. Partial correlation analysis was used to achieve this objective. The partial correlation analysis was restricted to TFs, as the large number of non-TFs may confound our analysis. The results showed that ρ fitness pleiotropy, CRE | out degree = -0.300 (p= 1.3e-05), ρ fitness pleiotropy, out degree | CRE = 0.311 (p=5.9e-06). The absolute values of the two partial correlations are similar indicating that the strength of contributions of CRE and out-degree to fitness pleiotropy are similar. However, the two partial correlations have different signs indicating that fitness pleiotropy is still negatively associated with CRE when out-degree is controlled and that fitness pleiotropy is still positively associated with out-degree when CRE is controlled. Given that the number of TFs is small, in the following analysis only CRE will be examined. 65 3.2.3 Joint effect of chromatin regulation and TATA-box on fitness pleiotropy The TATA-box is a conserved cis-DNA-element found in the eukaryotic promoter region. Genes are divided into TATA-containing genes and non-TATA- containing genes based on the presence of TATA-box in the promoter region [Basehoar et al, 2004]. The TATA- box has been found to be the most important DNA motif for predicting gene expression variation, with TATA-containing genes having significantly higher expression variation than non-TATA-containing genes [Landry et al, 2007; Tirosh et al, 2006]. In sharp contrast, TATA-containing genes have lower mean fitness pleiotropy (0.850) than non- TATA-containing genes (1.237), and the difference is highly significant (p=8.7e-08). In other words, when TATA-containing genes are deleted, low fitness pleiotropy is observed, suggesting that these mutations have a less deleterious effect to the organism. Furthermore, the presence/absence of TATA-box has been shown to be highly associated with CRE [Choi and Kim, 2008]. Therefore, the effect of the TATA-box on fitness pleiotropy, as indicated above, could be explained by CRE if the association between fitness pleiotropy and TATA-box disappears when we control CRE. To confirm this, partial correlation was used to measure the association strength between fitness and CRE/TATA-box after controlling TATA-box/CRE, respectively. The results showed that ρ fitness pleiotropy, CRE | TATA-box = -0.148 (p= 8.9e-18) and ρ fitness pleiotropy, TATA-box | CRE = -0.027 (p= 0.127; treat TATA-containing genes as 1 and non-TATA-containing genes as 0). This indicates that the relationship between fitness pleiotropy and the presence of the TATA motif could be explained by the negative association between fitness pleiotropy and CRE. 66 Also, TATA-containing genes are only about 20% of all yeast genes, therefore, we will not consider the presence of the TATA-box further. 3.2.4 The influence of gene expression variation on the relationship between fitness pleiotropy and PPI degree, CC and CRE Many of the gene measurements influencing fitness pleiotropy identified in this study coincide with those influencing gene expression variation, such as PPI degree, presence/absence of TATA-box, and CRE [Choi and Kim, 2008; Landry et al, 2007; Tirosh et al, 2006]. Therefore, a natural question that arises is whether fitness pleiotropy can be completely explained by gene expression variation or not. If fitness pleiotropy can be completely explained by gene expression variation, a direct relationship between gene expression variation and fitness pleiotropy could be inferred. Accordingly, the gene expression variation data from a previous study was examined [Tirosh et al, 2006], to determine if there is a relationship between fitness pleiotropy and gene expression variation. As shown in the scatter-plot in Figure 3.3B, there is, indeed, a high correlation between fitness pleiotropy and gene expression variation (ρ=-0.151, p< 2.2e-16), but the absolute correlation coefficient is relatively low, indicating that expression variation may only explain a small fraction of fitness pleiotropy. Genes with fitness pleiotropy of at least 4 (top 11% of the all the genes) and gene expression variation of at least 2970 (top 10% of the genes) were selected as a set with high fitness pleiotropy and high expression variation (0.4% of the data). Interestingly, we 67 found that this set was enriched with the genes that encode ion transporters (P value = 0.00019 indicated by FunSpec [Robinson et al, 2002]), especially heavy metal ion transporters, including the iron transporter genes ftr1, fet3 and ctr1. Given that iron plays a vital role in many important processes, such as electron transfer, oxygen transport, and DNA synthesis, a deletion of an ion transporter gene is very likely to affect fitness. In S. cerevisiae, iron level is primarily mediated by a plasma membrane iron transport system, including products encoded by ftr1and fet3. Additionally, it was found that expression of the genes that encode the iron transporters are regulated according to iron need in the cell [Radisky and Kaplan, 1999; Felice et al, 2005]. Therefore, some genes with high gene expression variation also tend to have high fitness pleiotropy. Genes (15% of the data) with low fitness pleiotropy (equal to 0) and low expression variation (no greater than 800, low 22% of the genes) were also identified. It should be noted that 60% of the genes in this set encode proteins that have unknown biological function. The set also included genes such as pex7, pex10, pex4, pex6, and pex15, that are encode products involved in peroxisome organization and biogenesis, especially genes that encode proteins involved in importing other proteins into the peroxisomal matrix [http://www.yeastgenome.org/]. The expression of these genes show low gene expression variation, perhaps because expression is not influenced by environmental conditions. The low fitness pleiotropy (i.e., 0) suggests that a defect in the biological process that these genes underlie might not affect cell growth significantly. These findings also suggest that the negative correlation between gene expression variation and fitness pleiotropy is not strong and cannot describe some groups of genes. 68 The partial correlation between fitness pleiotropy and PPI degree, CC, and CRE, were examined by controlling gene expression variation. The results are given in Table 3.1. For comparison, we also give the correlation between fitness pleiotropy and PPI degree, CC, and CRE when gene expression variation is not controlled. The partial correlation coefficients between fitness pleiotropy and PPI degree, CC and CRE when gene expression is controlled are all similar to the corresponding correlation without controlling gene expression variation indicating that these measurements contribute to fitness pleiotropy independent of expression variation. Table 3.1. Correlation between fitness pleiotropy and each measurement when expression variation is either controlled or not. measurement ρ p value PPI degree without expression variation controlled 0.232 <2.2e-16 with expression variation controlled 0.225 <2.2e-16 CC without expression variation controlled 0.243 <2.2e-16 with expression variation controlled 0.229 <2.2e-16 CRE without expression variation controlled -0.172 <2.2e-16 with expression variation controlled -0.112 2.6e-10 When gene expression variation is controlled, ρ is partial Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling gene expression variation, i.e., the relationship between fitness pleiotropy and each measurement is explained by gene expression variation. 69 Based on this result, we next asked what biological mechanism underlies the correlation between fitness pleiotropy and expression variation. In order to answer this question, we studied the partial correlation between fitness pleiotropy and gene expression variation when PPI degree, CC, or CRE is controlled, respectively (see Table 3.2). When CRE is controlled, fitness pleiotropy and gene expression variation are no longer associated indicating that CRE plays key roles in both fitness pleiotropy and gene expression variation. Thus, CRE can be considered as the key underlying latent variable that controls both fitness pleiotropy and expression variation resulting in their correlation, and fitness pleiotropy and gene expression are independent when CRE is controlled. Table 3.2. Partial Spearman’s correlation between fitness pleiotropy and expression variation when each measurement is controlled. ρ p value ρ fitness pleiotropy, expression variation | PPI degree -0.144 3.4e-14 ρ fitness pleiotropy, expression variation | CC -0.143 5.5e-14 ρ fitness pleiotropy, expression variation | CRE -0.020 0.2662 ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and gene expression variation after controlling PPI degree, CC or CRE, i.e., the relationship between fitness pleiotropy and gene expression is explained by PPI degree, CC or CRE. 3.2.5 Joint analysis of PPI degree, CC and CRE on fitness pleiotropy These findings indicated that the measures that are significantly associated with fitness pleiotropy are PPI degree, CC, or CRE for the nonessential S. cerevisiae genes. Fitness 70 pleiotropy increases with PPI degree and CC, while it decreases with CRE. We also found that, although the presence of TATA- box influences fitness pleiotropy, this phenomenon can be explained by high CRE in TATA-containing genes, which suggests that fitness pleiotropy is no longer associated with TATA-box once CRE is controlled. Based on these findings, the next logical step takes us to a determination of whether such measurements collectively explain fitness pleiotropy among all of the nonessential genes. In order to achieve this objective, the partial correlation between fitness pleiotropy and either PPI, CC or CRE measures were examined, when the other two measures are controlled (Table 3.3). The results show that both CRE and gene centrality (measured by PPI degree and CC) play important roles influencing fitness pleiotropy. The results are consistent in the other combinations of protein interaction data and fitness profiles examined (see Appendix B). Table 3.3. Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE. ρ p-value ρ fitness pleiotropy, CRE | PPI,CC -0.153 3.5e-14 ρ fitness pleiotropy, PPI | CRE,CC 0.100 7.5e-07 ρ fitness pleiotropy, CC | CRE,PPI 0.114 2.0e-08 Partial Spearman’s correlation between fitness pleiotropy and PPI degree refers to Spearman’s correlation after controlling CC and CRE. ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling two other measurements, i.e., such measurement is not significantly associated with fitness pleiotropy in this joint analysis. 71 This study provides a systematic analysis of genes and their products’ functions that influence fitness pleiotropy, for all of the nonessential genes in S. cerevisiae. Within the concept of gene centrality and chromatin regulation, the important measures identified are PPI degree, CC, and CRE. The inter-relationship between these gene centrality measures and regulation by CRs was also examined with respect to expression variation and fitness pleiotropy. The findings indicate that the gene centrality, as measured by PPI degree and clustering coefficient, and the potential for a gene to be chromatin regulated, as measured by CRE, significantly affect the corresponding gene’s fitness pleiotropy. The results from examining the data based on three independent gene deletion experiments, that examined fitness in 51, 82 and 418 conditions, respectively, are consistent. These consistent results indicate that the conclusions should be generally applicable to many other conditions. However, there are several limitations of this study. Both the protein interaction network and gene regulatory network are incomplete and contain false positive and negative errors. To study the effect of incompleteness of the protein interaction network, we did the same type of analyses using the other two protein interaction data sets: DIP [Salwinski et al, 2004] and BioGrid [Stark et al, 2006], and the results are qualitatively the same (see Appendix B). We used the largest gene regulatory network that is currently available in this study. How our results will change when more complete regulatory network data are available is a question for future studies. 72 3.3 Materials and Methods 3.3.1 Phenotypic Profiles Three fitness profiles of S. cerevisiae deletion strains, which measured the changes of growth rate when nonessential genes were deleted under various conditions were used [Brown et al, 2006; Parsons et al, 2006; Hillenmeyer et al, 2008]. In the main text, the quantitative profile for yeast homozygous deletion strains with each of 4277 genes deleted under 51 conditions were used [Brown et al, 2006]. When duplicate measures of growth rate for strains with the same deleted genes were available, the average change in growth rate was used in our analysis. A total of 10 genes have duplicate measures, and the results are essentially the same if these genes had been removed in the analysis (data not shown). The refined data were normalized under each condition to a standard normal distribution. To exclude the biological dependency between these 51 conditions, the conditions were classified into 31 groups based on their different effects on the phenotype using two-way clustering [Brown et al, 2006]. The conditions in the same group have a similar phenotypic profile that was measured by Pearson’s correlation coefficient by Brown et al. [Brown et al, 2006]. The 31 groups are as follows: AAPO,H2O2; Alk.5g,Alk.15g; Bleo,HygB; Cis1,Cis4,Oxa; CPTa,CPTc; ActD,Dox; Gal.5g,Gal.15g; AntA,GlyE; Ida, TPZ;Mech,MMC; Min.5g,Min.15g;NaCl.5g,NaCl.15g; Nys.5g,Nys.15g; Sorb.5g,Sorb.15g; Trp,Thr,Lys,SC; UVB,UVC,IR; and the remaining with each condition as one group. The deletion strain with growth rate change less than -2 (2 standard deviation) is defined as having significant growth defect under the specific condition. A deletion strain has a growth defect under a group of conditions if the 73 deletion strain shows growth defect under at least one of the conditions in this group. The fitness profile data contain the growth rate of yeast haploid deletion strains of 4111 nonessential genes under 82 conditions [Parsons et al, 2006], growth rate of yeast homozygous deletion strains of 4742 nonessential genes under 418 conditions [Hillenmeyer et al, 2008], and the details are given in the Appendix B. The fitness pleiotropy measures based on the three phenotype profiles are strongly correlated (See Table B9.1 in Appendix B). 3.3.2 Protein Interaction Networks The yeast protein interaction data from three different data sources were downloaded: MIPS [Mewes et al, 2004], DIP [Salwinski et al, 2004], and BioGrid [Stark et al, 2006]. The MIPS (Munich Information Center for Protein Sequences) [Mewes et al, 2004] dataset (version: PPI_18052006.tab) contains 11,124 protein physical interactions involving 4,404 proteins. The DIP core interaction dataset [Salwinski et al, 2004] (version: ScereCR20070107) contains 5,738 protein interactions involving 2,161 proteins. The DIP core interactions were assessed by a number of quality tests and are supposedly highly reliable [Deane et al, 2002]. The BioGrid [Stark et al, 2006] dataset (version 2.0.34) contains 59,317 protein physical interactions involving 5,054 proteins. Previous studies have shown that the MIPS interaction dataset has relatively high reliability compared to other data sources [Deng et al, 2003]. Therefore, our efforts were concentrated on the results based on MIPS. The results based on DIP and BioGrid are 74 presented as Appendix B. For a given protein interaction dataset, the protein physical interaction (PPI) degree was calculated. The betweenness (BW), and the clustering coefficient (CC) were calculated using the software Pajek 1.20 [Batagelj and Mrvar, 1998]. Pajek is a software package for large network analysis and visualization. 3.3.3 Regulatory Network Transcription factors (TFs) influence the expression of downstream genes. Hu et al. [2007] constructed a regulatory network using 263 TF knockout profiles. We used a directed edge from a TF to a gene if the expression of the gene was significantly changed when the TF was knocked out. Note that this regulatory network represents indirect relationship, not necessarily direct regulation. The out-degree of a TF is the number of genes that the TF regulates in this network, while the in-degree is the number of TFs regulating a specific gene in this network. 3.3.4 Expression Compendium of Chromatin Regulators To study the effects of chromatin regulation on fitness pleiotropy, the expression compendium of chromatin regulators assembled previously, was used [Steinfeld et al, 2007]. We removed the expression data under perturbations of TATA binding protein (TBP), histone proteins (H3 and H4), proteins with unknown chromatin regulation activities, as well as comparative perturbations, because they do not represent 75 perturbations of chromatin regulators. Finally, we obtained a reduced dataset of expression profiles for 116 perturbations of chromatin modifiers, Histone mehtyltransferase, acetyltransferases and deacetyltransferases, silencing factors, ubiquitinating, deubiquitinating enzymes and ATPase. We further checked the percentage of missing values for each gene under 116 perturbations. If a gene had more than 10% (i.e., 12) missing values, we excluded it in the final refined data. We normalized the refined data under each perturbation to a standard normal distribution and calculated chromatin regulator effect (CRE) as the average of absolute value of logarithm of the gene expression changes across 116 perturbations, which is the same as [Choi and Kim, 2008]. 3.3.5 TATA-containing genes A TATA box is a DNA sequence motif (cis-element) found in the promoter region of most eukaryotic genes. The TATA consensus sequence was identified as TATA(A/T)A(A/T)(A/G) [Basehoar et al, 2004]. The relationship between yeast genes and the TATA box was downloaded from [Basehoar et al, 2004]. 3.3.6 Statistical Analysis In our dataset, fitness pleiotropy is a discrete response variable. To measure the relationship between fitness pleiotropy and each measurement, we used a non-parametric 76 Spearman’s rank correlation with corresponding statistical significant test since the assumptions of parametric methods, such as linear regression or ordinal logistic regression, are not satisfied. Spearman’s rank correlation is used to discover the linear association between two variables, and its corresponding test has no distribution assumptions for the variables. In the joint analysis, non-parametric Spearman partial correlation and the corresponding significant test are used to measure which measurement is most important in influencing fitness pleiotropy. We also used Spearman partial correlation to find the relative importance of measurements influencing fitness pleiotropy. For example, if we want to know which of measurement y or z has a stronger association with x, we compare the value of ρ x,y|z and ρ x,z|y . The bigger value means the stronger association. ρ x,y|z means partial correlation between x and y after controlling z. The first order partial correlation is defined as: ) 1 )( 1 ( 2 2 , yz xz yz xz xy z xy ρ ρ ρ ρ ρ ρ − − − = where ρ xy is the correlation between x and y. The second order partial correlation is defined as: ) 1 )( 1 ( 2 , 2 , , , , , 1 2 1 2 1 2 1 2 1 2 1 z yz z xz z yz z xz z xy z z xy ρ ρ ρ ρ ρ ρ − − − = where ρ xy,z is the partial correlation between x and y after controlling z. It is implemented by SAS 9.0 (http://www.sas.com/technologies/bi/appdev/base/). To provide visualization of the relationship between fitness pleiotropy and each measurement, we used linear regression to fit the data in the plot. 77 v d α β = + where v is fitness pleiotropy and d is the measurement value. α and β are parameters. We also used box plots for visualization in our studies. These show the difference in distribution of each variable. The line in the box indicates the median value. The upper edge of the box indicates the 75 th percentile, and the lower edge indicates the 25 th percentile. The ends of the vertical line indicate the minimum and the maximum values, and the points outside the ends of the vertical line are outliers. In addition, we used a non-parametric Wilcoxon rank sum test [Wilcoxon, 1945] to compare the difference in median for two distributions. The test in our study is a one-side test that is based on the alternative hypothesis that variable A has higher or lower value than variable B. 78 Chapter 4: Conclusions We implemented a system-wide analysis of proteomic and genomic factors affecting gene expression variation and pleiotropy. Among four different factors (protein interaction degree, toxicity degree, TATA box, the number of TFs), TATA-box and the number of TFs are the most important factors influencing gene expression variation. Results showed that 1) gene expression variation is negatively correlated with the protein interaction degree in the protein interaction network, 2) essential genes tend to have less expression variation than non-essential genes and gene expression variation decreases with toxicity degree, 3) TATA-containing genes have higher gene expression variation than non-TATA-containing genes, and 4) the number of TFs regulating a gene is the most important factor influencing gene expression variation (R 2 = 8-14%). Although it is intuitive that the number of TFs regulating a gene should have a significant effect on the gene’s expression variation, the magnitude of its influence has not been studied in large scale expression datasets to the best of our knowledge. Our findings demonstrated that the gene regulation is a main factor affecting gene expression variation. Protein interaction degree and toxicity degree do not account for as much variation when compared to the influence of the number of TFs and the TATA-box. In our overall analysis, we also found the interactions between TATA-box and toxicity degree as well as the number of TFs influence expression variation. The further study stratified by TATA-box indicated that TATA-containing genes and non- TATA containing genes behave differently in relation to the toxicity degree and the 79 number of TFs. The effect of the number of TFs on expression variation within the TATA-containing genes is lower than that for the non-TATA-containing genes. On the other hand, toxicity degree is associated with expression variation within the TATA- containing genes only. These findings suggest that the regulatory mechanism might be more complicated for TATA-containing genes than non-TATA containing genes. A systematic analysis of the functions of the genes, including measurements of protein centrality and the potential for being chromatin regulated, that influence fitness pleiotropy in S. cerevisiae was performed. The analyses presented indicated two measurements mainly and collectively influence fitness pleiotropy in S. cerevisiae: gene centrality and chromatin regulation. Gene centrality refers to the centrality of the encoded protein in a protein interaction network. The deletion of genes with high centrality results in high fitness pleiotropy. On the contrast, the genes with high possibility to be epigenetic regulated tend to have low fitness pleiotropy. Genes with products that function in epigenetic chromatin regulation, genes that are regulated by chromatin modification, and genes with products that play a central role in a protein interaction network affect an organism’s fitness. These findings highlight the significance of both epigenetic gene regulation and protein interaction networks in influencing the fitness phenotype. 80 Chapter 5: Future work With the availability of genome-wide data including protein interaction, gene regulation, etc. we could study potential effect of different factors influencing pleiotropy in yeast. However, there are several limitations of this study. Both the protein interaction network and gene regulatory network are incomplete and contain false positive and negative errors. For example, there are much arguments regarding reliability of protein interactions identified from high-throughput experiments or collected from literatures (i.e., small scale experiments). Some suggested that high-throughput experiments such as MS analysis of protein complex may prone to protein interaction with high expression [Batada et al, 2006]. For data collected from small-scale experiments, some argued that they are well validated while some argued essential genes tend to receive higher attention and hence have higher interaction partner [Coulomb et al, 2005]. To study the effect of incompleteness of protein interaction network, we did analyses using three different protein interaction data: MIPS [Mewes et al, 2004], DIP [Salwinski et al., 2004] and BioGrid [Stark et al., 2006], and the results are qualitatively the same. Also, we used the largest gene regulatory network that is currently available in this study. However, how our results will change when more complete and reliable protein interaction and regulatory network are available is a question for future studies. Because of the large amount of data for the fitness phenotype available in yeast, we considered only one phenotype, the growth rate. It will be interesting to see whether the 81 results we obtained in fitness pleiotropy study will hold for other phenotypes, such as cell shape, cell size, and so on. The dissertation work used yeast as the model as genome-wide data of different features are generated in yeast. If the dissertation work could be extended to higher Eukaryote especially human, that would be more beneficial to public health. The study can be useful for the study of human disease genes. Studies showed that the local environment of genes in protein interaction network and regulatory network plays roles in disease development [Tu et al. 2006; Jaenisch and Bire, 2003]. Human essential genes are generally highly expressed in most tissues and highly conserved [Tu et al, 2006]. When they malfunction, the individuals having the mutants will not be able to survive. Therefore, human essential genes are not observed to be associated with human diseases. On the other hand, malfunction of neutral genes may not affect the phenotype of the individuals. Between the human essential and neutral genes lie the genes that when they mutate will change the phenotypes of the individuals. The three groups of genes, essential, disease related, and neutral genes, have distinct genomics and proteomic features [Tu et al. 2006, Goh et al., 2007, Feldman et al, 2008]. For example, it was found that genes having intermediate centrality in interaction network are more likely to be related to human inherited diseases and the severity of disease is associated with the centrality of the genes [Tu et al. 2006]. These findings are similar to our results that fitness pleiotropy is positively associated with PPI degree in yeast. The studies in yeast also suggest the role of epigenetic regulation influencing fitness pleiotropy. Would epigenetic regulation have impact on the probability of a gene to be disease-related, i.e., 82 the impact of environment on disease susceptibility? In fact, there is much evidence to support this, for example, the effects of diet on long-term diseases such as cancer [Jaenisch and Bire, 2003]. Matouk and Marsden [2008] showed that the epigenetic pathways control vascular endothelial gene expression in disease. There are several data sets of protein interactions in human, such as HPRD [Peri et al, 2003], BioGrid [Stark et al, 2006], etc. The HPRD network contains more than 9,000 proteins with more than 30,000 pair-wise interactions collected from literature and is thought to be reliable. Therefore, as conducted in previous works [Tu et al, 2006], although the protein interaction in human is not complete, the study of the relationship between protein interaction and disease genes might show a potential trend. However, the gene regulatory network and epigenetic regulation network are not well understood in human. More data are needed to evaluate the significance of regulatory networks on human diseases. Such studies would help us better understand human disease genes and possibly classify human disease genes. With the availability of more data on different features including protein interaction, gene regulation, etc in the future, it is also possible to conduct the study of factors influencing gene expression variation in human. The expression variation in human could be measured as one within or among different population. Several groups [Stranger et al, 2007; Monks et al, 2004; Cheung et al, 2005] have measured gene expression profiles of individual. The research on gene expression variation in human may provide insight into the function of genes that have high or low variation and its relationship with disease genes. However, since human is a complex organism, there exists problem that some 83 gene expression and protein interactions are tissue-specific. Alternative splicing is another problem needed to be carefully considered for future study in humans. 84 References Akaike H: Information theory and an extension of the maximum likelihood principle. In Proceedings of Second International Symposium on Information Theory. B. N. Petrov, and F. Csaki, Eds.: Akademiai Kiado, Budapest; 1973: 267-281. Bader JS, Chaudhuri A, Rothberg JM, Chant J: Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol. 2004, 22:78-85. Basehoar AD, Zanton SJ, Pugh BF: Identification and distinct regulation of yeast TATA box-containing genes. Cell 2004, 116: 699-709 Batada NN, Hurst LD, Tyers M: Evolutionary and physiological importance of hub proteins. PLoS Comput Biol. 2006, 2(7): e88. Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hurst LD, Tyers M: Stratus not altocumulus: a new view of the yeast protein interaction network. PLoS Biol. 2006, 4(10):e317. Batagelj V. and Mrvar A.: Pajek - Program for Large Network Analysis. Connections 1998, 21: 47-57 Brown JA, et al: Global analysis of gene function in yeast by quantitative phenotypic profiling. Mol Syst Biol. 2006, 2: 2006.0001 Burnham KP, Anderson DR: Model selection and multimodel inference: a practical- theoretic approach, 2nd ed. Springer-Verlag; 2002. Chen Y, Xu D.: Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics. 2005, 21: 575-81. Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT.: Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005, 437:1365-9. Choi JK, Kim SC, Seo J, Kim S, Bhak J: Impact of transcriptional properties on essentiality and evolutionary rate. Genetics. 2007, 175:199-206. Choi JK, Kim YJ.: Epigenetic regulation and the variability of gene expression. Nat Genet. 2008, 40: 141-7 85 Coulomb S, Bauer M, Bernard D, Marsolier-Kergoat MC.: Gene essentiality and the topology of protein interaction networks. Proc Biol Sci. 2005, 272: 1721-5. Deane CM, Salwinski L, I Xenarios, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 2002, 1: 349-56 Deckert J, Struhl K.: Histone acetylation at promoters is differentially affected by specific activators and repressors. Mol Cell Biol. 2001, 21: 2726-35. Deng MH, Sun FZ, Chen T: Assessment of the reliability of protein-protein interactions and protein function prediction. Pac Symp Biocomput 2003, 197- 206 Dhasarathy A, Kladde MP.: Promoter occupancy is a major determinant of chromatin remodeling enzyme requirements. Mol Cell Biol. 2005, 252: 698-707 Feldman I, Rzhetsky A, Vitkup D.: Network properties of genes harboring inherited disease mutations. Proc Natl Acad Sci U S A. 2008, 105(11):4323-8. Felice MR, et al: Post-transcriptional regulation of the yeast high affinity iron transport system. J Biol Chem. 2005, 280: 22181-90. Freeman, L. C.: Centrality in social networks: Conceptual clarification. Social Networks 1979, 1: 215-239. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 2000, 11:4241-57. Giaever G, et al: Functional profiling of the Saccharomyces cerevisiae genome. Nature 2002, 418: 387-91. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL.: The human disease network. Proc Natl Acad Sci U S A. 2007, 104(21):8685-90. Hakes L, Robertson DL, Oliver SG.: Effect of dataset selection on the topological interpretation of protein interaction networks. BMC Genomics. 2005, 6: 131. Hakes L, Pinney JW, Robertson DL, Lovell SC.: Protein-protein interaction networks and biology--what's the connection? Nat Biotechnol. 2008, 26: 69-72. 86 He X, Zhang J.: Toward a molecular understanding of pleiotropy. Genetics. 2006, 173: 1885-91. Hillenmeyer ME, et al: The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science. 2008, 320: 362-5. Hu Z, Killion PJ, Iyer VR.: Genetic reconstruction of a functional transcriptional regulatory network. Nat Genet. 2007, 39: 683-7. Jaenisch R, Bird A.: Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat Genet. 2003, 33 Suppl: 245-54. Jeong H, Mason SP, Barabási AL, Oltvai ZN: Lethality and centrality in protein networks. Nature 2001, 411: 41-2. Landry CR, Lemos B, Rifkin SA, Dickinson WJ, Hartl DL: Genetic properties influencing the evolvability of gene expression. Science 2007, 317: 118-21 Lemos B, Meiklejohn CD, Hartl DL: Regulatory evolution across the protein interaction network. Nat Genet. 2004, 36:1059-60. Lemos B, Bettencourt BR, Meiklejohn CD, Hartl DL: Evolution of proteins and gene expression levels are coupled in Drosophila and are independently associated with mRNA abundance, protein length, and number of protein-protein interactions. Mol Biol Evol. 2005, 22:1345-54. MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E.: An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 2006, 7:113. Matouk CC, Marsden PA.: Epigenetic regulation of vascular endothelial gene expression. Circ Res. 2008, 102:873-87 Mewes HW, et al: MIPS: analysis and annotation of proteins from whole genomes. Nucl Acid Res 2004, 32(Database issue):D41-44 Monks SA, Leonardson A, Zhu H, Cundiff P, Pietrusiak P, Edwards S, Phillips JW, Sachs A, Schadt EE.: Genetic inheritance of gene expression in human cell lines. Am J Hum Genet. 2004, 75:1094-105. 87 Nelson CE, Hersh BM, Carroll SB: The regulatory content of intergenic DNA shapes genome architecture. Genome Biol 2004, 5:R25. Newman JR, Ghaemmaghami S, Ihmels J, Breslow DK, Noble M, DeRisi JL, Weissman JS: Single-cell proteomic analysis of S. cerevisiae reveals the architecture of biological noise. Nature. 2006, 441:840-6. Parsons AB, et al: Exploring the mode-of-action of bioactive compounds by chemical- genetic profiling in yeast. Cell. 2006, 126: 611-25 Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TKB, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao ZX, Chandrika KN, Padma N, Harsha HC et al: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13:2363– 2371 Radisky D, Kaplan J.: Regulation of transition metal transport across the yeast plasma membrane. J Biol Chem. 1999, 274: 4481-4. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA.: Genome-wide location and function of DNA binding proteins. Science. 2000, 290: 2306-9. Robinson MD, Grigull J, Mohammad N, Hughes TR.: FunSpec: a web-based cluster interpreter for yeast. BMC Bioinformatics. 2002, 3: 35. Said MR, Begley TJ, Oppenheim AV, Lauffenburger DA, Samson LD: Global network analysis of phenotypic effects: protein networks and toxicity modulation in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A. 2004, 101:18006-11. Saldanha AJ, Brauer MJ, Botstein D: Nutritional homeostasis in batch and steady- state culture of yeast. Mol Biol Cell 2004, 15:4089-104. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D.: The Database of Interacting Proteins: 2004 update. Nucl Acid Res 2004, 32(Database issue):D449-451 Shapira M, Segal E, Botstein D: Disruption of yeast forkhead-associated cell cycle transcription by oxidative stress. Mol Biol Cell 2004, 15:5659-69. 88 Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M.: BioGRID: a general repository for interaction datasets. Nucl Acids Res. 2006, 34(Database issue):D535-9. Steinfeld I, Shamir R, Kupiec M.: A genome-wide analysis in Saccharomyces cerevisiae demonstrates the influence of chromatin modifiers on transcription. Nat Genet. 2007, 39: 303-9 Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, Ingle CE, Dunning M, Flicek P, Koller D, Montgomery S, Tavaré S, Deloukas P, Dermitzakis ET.: Population genomics of human gene expression. Nat Genet. 2007, 39:1217-24. Tirosh I, Weinberger A, Carmi M, Barkai N: A genetic signature of interspecies variations in gene expression. Nat Genet 2006, 38: 830-4 Tirosh I, Barkai N: Evolution of gene sequence and gene expression are not correlated in yeast. Trends Genet. 2008, 24:109-13. Tu Z, Wang L, Xu M, Zhou X, Chen T, Sun F.: Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC Genomics. 2006, 7:31 Walker Graeme M.: Yeast physiology and biotechnology. J. Wiley & Sons: 1998 Watts DJ, Strogatz SH.: Collective dynamics of 'small-world' networks. Nature. 1998, 393: 440-2 Walther D, Brunnemann R, Selbig J: The regulatory code for transcriptional response diversity and its relation to genome structural properties in A. thaliana. PLoS Genet 2007, 3: e11. Wilcoxon F: Individual Comparisons by Ranking Methods. Biometrics 1945, 1:80-83 Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, Andre B, Bangham R, Benito R, Boeke JD, Bussey H, Chu AM, Connelly C, Davis K, Dietrich F, Dow SW, El Bakkoury M, Foury F, Friend SH, Gentalen E, Giaever G, Hegemann JH, Jones T, Laub M, Liao H, Liebundguth N, Lockhart DJ, Lucau-Danila A, Lussier M, M'Rabet N, Menard P, Mittmann M, Pai C, Rebischung C, Revuelta JL, Riles L, Roberts CJ, Ross- MacDonald P, Scherens B, Snyder M, Sookhai-Mahadeo S, Storms RK, Véronneau S, Voet M, Volckaert G, Ward TR, Wysocki R, Yen GS, Yu K, Zimmermann K, Philippsen P, Johnston M, Davis RW.: Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 1999, 285:901-6. 89 Yoshimoto H, Saltsman K, Gasch AP, Li HX, Ogawa N, Botstein D, Brown PO, Cyert MS: Genome-wide analysis of gene expression regulated by the calcineurin/Crz1p signaling pathway in Saccharomyces cerevisiae. J Biol Chem 2002, 277:31079-88. Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M.: The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput Biol. 2007, 3: e59. 90 Appendix A: Supplemental materials for Chapter 2 A1. Analysis using MIPS protein interaction data Chemostat (Nutritional Stress) Figure A1.1 Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0145, R 2 = 1.05%, and the p-value = 1.279e-10. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of gene expression variation given PPI degree. 91 Figure A1.2. The effect of essentiality, toxicity degree, and protein interaction degree on gene expression variation. A) Bar-plot of the expression variation of all the genes with a given toxicity degree together with the linear regression fit to the expression variation of the genes in relation to the toxicity degree. The linear coefficient β = - 0.0275, R 2 = 0.46%, and the p-value = 1.632e-05. B) The mean expression variation and the linear regression fit to the expression variation with respect to PPI degree for non-essential genes stratified according to toxicity degree and for the essential genes The β values are -0.0108, -0.0089,-0.0133,-0.0109, and -0.0183 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.0120, 0.0914, 0.0113, 0.1355, and 0.0163, respectively. R 2 is 0.42%, 0.36%, 1.65%, 1.2% and 2.47%, respectively. The labels are the same as those in Figure A1.1. Figure A1.3. The effect of TATA box, number of TFs, and toxicity degree on gene expression variation. A) The relationship between expression variation and toxicity degree stratified by the presence/absence of the TATA box (R 2 = 2.33%, β = -0.0860, p-value = 2.104e-05 for TATA containing group; R 2 = 0.05%, β = -0.0085, p-value = 0.1906 for the non-TATA containing group); B) The relationship between expression variation and the number of TFs up to 25 (R 2 = 8.16%, β = 0.0364, p-value < 2.2e-16). The labels are the same as those in Figure A1.1. 92 Environmental Stress Figure A1.4. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0286, R 2 = 1.96%, and the p-value = 2.2e-16. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of gene expression variation given PPI degree. 93 Figure A1.5. The effect of essentiality, toxicity degree, and protein interaction degree on gene expression variation. A) Bar-plot of the expression variation of all the genes with a given toxicity degree together with the linear regression fit to the expression variation of the genes in relation to the toxicity degree. The linear coefficient β= -0.0314, R 2 = 0.3%, and the p-value = 0.0006. B) The mean expression variation and the linear regression fit to the expression variation with respect to PPI degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are -0.0175, -0.0277, -0.0253, -0.0143, and -0.0289 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.0049, 0.0002, 0.0007, 0.1553, and 0.0162, respectively. R 2 is 0.53%, 1.7%, 2.93%, 1.09% and 2.48%, respectively. The labels are the same as those in Figure A1.4. Figure A1.6. The effect of TATA box, number of TFs, and toxicity degree on gene expression variation. A) The relationship between expression variation and toxicity degree stratified by the presence/absence of the TATA box (R 2 = 1.23%, β=-0.0936, p-value = 0.0020 for TATA containing group; R 2 = 0, β= 0.0007, p-value = 0.9329 for the non- TATA containing group); B) The relationship between expression variation and the number of TFs up to 25 (R 2 = 14.22%, β=0.0687, p-value < 2.2e-16). The labels are the same as those in Figure A1.4. 94 Oxidative Stress Figure A1.7. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0128, R 2 = 0.72%, and the p-value = 9.543e-08. The red dots are the mean of expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of the gene expression variation given PPI degree. 95 Figure A1.8. The effect of essentiality, toxicity degree, and protein interaction degree on gene expression variation. A) Bar-plot of the expression variation of all the genes with a given toxicity degree together with the linear regression fit to the expression variation of the genes in relation to the toxicity degree. The linear coefficient β= -0.0016, R 2 = 0 and the p-value = 0.8162. B) The mean expression variation and the linear regression fit to the expression variation with respect to PPI degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are -0.0075, -0.0043, -0.0147, -0.0109, and -0.0217 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.0942, 0.4425, 0.0114, 0.1821, and 0.0186, respectively. R 2 is 0.19%, 0.07%, 1.65%, 0.96% and 2.38%, respectively. The labels are the same as those in Figure A1.7. Figure A1.9. The effect of TATA box, number of TFs, and toxicity degree on gene expression variation. A) The relationship between expression variation and toxicity degree stratified by the presence/absence of the TATA box (R 2 = 0.01%, β= -0.0054, p-value = 0.8048 for TATA containing group; R 2 = 0.02%, β= 0.0050, p-value = 0.4737 for the non-TATA containing group); B) The relationship between expression variation and the number of TFs up to 25 (R 2 = 2.97%, β=0.0226, p-value < 2.2e-16). The labels are the same as those in Figure A1.7. 96 A2. Analysis using DIP protein interaction data Ca and Na exposure Figure A2.1. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0140, R 2 = 0.39%, and the p-value = 0.0041. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of gene expression variation given PPI degree. Figure A2.2. The effect of toxicity degree, and protein interaction degree on gene expression variation. The mean expression variation and the linear regression fit to the expression variation of all the genes with respect to protein interaction degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are 0.0007, -0.0229, -0.0209, -0.0046, and -0.0263 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.9555, 0.0547, 0.1321, 0.7783, and 0.1085, respectively. R 2 is 0, 0.95%, 0.96%, 0.06% and 1.37%, respectively. The labels are the same as those in Figure A2.1. 97 Chemostat (nutritional stress) Figure A2.3. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0140, R 2 = 1.25%, and the p-value = 2.458e-07. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of gene expression variation given PPI degree. Figure A2.4. The effect of toxicity degree, and protein interaction degree on gene expression variation. The mean expression variation and the linear regression fit to the expression variation of all the genes with respect to protein interaction degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are -0.0187, -0.0060, -0.0013, -0.0235, and -0.0108 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.0045, 0.3927, 0.8726, 0.0173, and 0.1775, respectively. R 2 is 1.38%, 0.19%, 0.01%, 4% and 0.97%, respectively. The labels are the same as those in Figure A2.1. 98 Environmental Stress Figure A2.5. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0124, R 2 = 0.45% and the p-value = 0.0023. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of gene expression variation given PPI degree. Figure A2.6. The effect of toxicity degree, and protein interaction degree on gene expression variation. The mean expression variation and the linear regression fit to the expression variation of all the genes with respect to protein interaction degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are 0.0039, -0.0131, 0.0030, -0.0337, and -0.0273 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.6852, 0.1935, 0.7883, 0.0065, and 0.0507, respectively. R 2 is 0.03%, 0.44%, 0.03%, 5.25% and 2.03%, respectively. The labels are the same as those in Figure A2.1. 99 Oxidative Stress Figure A2.7. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents the gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0080, R 2 is 0.35% and the p-value = 0.0064. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of the gene expression variation given PPI degree. Figure A2.8. The effect of toxicity degree, and protein interaction degree on gene expression variation. The mean expression variation and the linear regression fit to the expression variation of all the genes with respect to protein interaction degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are 0.0061, -0.0023, -0.0145, -0.0150, and -0.0210 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.3508, 0.7418, 0.1146, 0.1534, and 0.0446, respectively. R 2 is 0.15%, 0.03%, 1.05%, 1.47% and 2.14%, respectively. The labels are the same as those in Figure A2.1. 100 Table A2.1: Analysis of four factors and their interactions affecting expression variation using stepwise selection with AIC. variable Ca_Na_exposure Chemostat Environmental Stress Oxidative Stress model p value R 2 model p value R 2 model p value R 2 model p value R 2 x1 √ 0.8892 0.003% √ 0.0598 0.61% √ 0.1349 0.38% x2 √ 0.6756 0.03% √ 0.5487 0.06% √ 0.0059 0.13% x3 √ 1.8e-09 6.02% √ 3.3e-9 5.86% √ < 2e-16 14.49% √ 0.0181 0.95% x4 √ 6.3e-07 4.17% √ 7.5e-05 2.67% √ 6.0e-12 7.82% √ 0.8492 0.006% x1*x2 x1*x3 √ 0.0478 0.67% x1*x4 √ 0.0364 0.75% x2*x3 x2*x4 √ 0.0286 0.82% √ 0.0122 1.08% √ 0.0259 0.85% x3*x4 √ 0.1199 0.41% √ 0.0736 0.55% √ 0.0002 2.37% R 2 model 17.42% 12.83% 27.84% 3.45% The four factors include protein interaction degree (x1), toxicity degree (x2: treat essential genes as ones with toxicity degree 4), number of TFs (x3), and the presence of TATA box (x4: 1-TATA containing genes, 0-non-TATA containing genes). The protein interaction data used in this analysis is based on the DIP data set. The column marked with “√”indicates inclusion in the final linear model. The multiple linear regression is based on the final linear model, respectively. The p-value is related to the null hypothesis that β ≠ 0 versus β = 0. R 2 is the variation explained by the model and each independent variable, respectively. 101 A3. Analysis using BioGrid protein interaction data Ca and Na exposure Figure A3.1. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0133, R 2 = 1.36% and the p-value = 8.621e-13. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of gene expression variation given PPI degree. Figure A3.2. The effect of toxicity degree, and protein interaction degree on gene expression variation. The mean expression variation and the linear regression fit to the expression variation of all the genes with respect to protein interaction degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are -0.0106, -0.0136, -0.0192, -0.0118, and -0.0111 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.0025, 0.0011, 0.0003, 0.2632, and 0.0925, respectively. R 2 is 0.69%, 1.26%, 4.04%, 1.23% and 1.42%, respectively. The labels are the same as those in Figure A3.1. 102 Chemostat (nutritional stress) Figure A3.3. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0081, R 2 = 1.59% and the p-value = 8.964e-15. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of gene expression variation given PPI degree. Figure A3.4. The effect of toxicity degree, and protein interaction degree on gene expression variation. The mean expression variation and the linear regression fit to the expression variation of all the genes with respect to protein interaction degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are -0.0062, -0.0057, -0.0076, -0.0110, and -0.0025 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.0013, 0.0171, 0.0147, 0.0394, and 0.5151, respectively. R 2 is 0.77%, 0.68%, 1.86%, 4.1% and 0.21%, respectively. The labels are the same as those in Figure A3.1. 103 Environmental Stress Figure A3.5. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0083, R 2 = 0.86% and the p-value = 1.618e-08. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of gene expression variation given PPI degree. Figure A3.6. The effect of toxicity degree, and protein interaction degree on gene expression variation. The mean expression variation and the linear regression fit to the expression variation of all the genes with respect to protein interaction degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are -0.0055, -0.0082, -0.0135, -0.0135, and 0.0001 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.0425, 0.0141, 0.0023, 0.0724, and 0.9834, respectively. R 2 is 0.31%, 0.72%, 2.92%, 3.13% and 0, respectively. The labels are the same as those in Figure A3.1. 104 Oxidative Stress Figure A3.7. Gene expression variation is negatively correlated with protein interaction degree. The x-axis represents protein physical interaction degree, and the y-axis represents the gene expression variation. A) The LOWESS fit to the gene expression variation. B) Bar-plot of the expression variation of all the genes with a given protein interaction degree together with the linear regression fit to the gene expression variation in relation to the interaction degree. The linear coefficient β = -0.0048, R 2 = 0.53% and the p-value = 8.107e-06. The red dots are the mean expression variation of the genes given the protein physical interaction (PPI) degree. The bar represents the standard deviation of the gene expression variation given PPI degree. Figure A3.8. The effect of toxicity degree, and protein interaction degree on gene expression variation. The mean expression variation and the linear regression fit to the expression variation of all the genes with respect to protein interaction degree for non-essential genes stratified according to toxicity degree and for the essential genes. The β values are -0.0034, -0.0046, -0.0101, -0.0025, and -0.0014 for toxicity degree 0, 1, 2, and 3, and the essential genes, respectively. The corresponding p-values are 0.0750, 0.0610, 0.0048, 0.6781, and 0.7446, respectively. R 2 is 0.24%, 0.42%, 2.49%, 0.17% and 0.05%, respectively. The labels are the same as those in Figure A3.1. 105 Table A3.1: Analysis of four factors and their interactions affecting expression variation using stepwise selection with AIC. variabl e Ca_Na_exposure Chemostat Environmental Stress Oxidative Stress mod el P value R 2 model p value R 2 model p value R 2 model p value R 2 x1 √ 0.0050 0.72% √ 0.0002 1.29% √ 0.2220 0.14% √ 0.0154 0.53 % x2 √ 0.1319 0.21% √ 0.3483 0.08% √ 0.8175 0.005% x3 √ 4.7e-16 5.87% √ 2.0e-5 1.65% √ < 2e- 16 12.72% √ 9.6e- 08 2.56 % x4 √ 9.3e-14 4.96% √ 1.9e-9 3.25% √ 1.1e- 12 4.55% √ 0.0003 1.19 % x1*x2 x1*x3 √ 0.0377 0.39% x1*x4 √ 0.0281 0.44% x2*x3 x2*x4 √ 0.0152 0.54% √ 0.0131 0.56% √ 0.0229 0.47% x3*x4 √ 0.0143 0.55% √ 0.0154 0.54% √ 2.2e- 07 2.43% R 2 mod el 18.62% 13.78% 26.23% 5.57% The four factors include protein interaction degree (x1), toxicity degree (x2: treat essential genes as ones with toxicity degree 4), number of TFs (x3), and the presence of TATA box (x4: 1-TATA containing genes, 0-non-TATA containing genes). The protein interaction data used in this analysis is based on the BioGrid data set. The column marked with “√”indicates inclusion in the final linear model. The multiple linear regression is based on the final linear model, respectively. The p-value is related to the null hypothesis that β ≠ 0 versus β = 0. R 2 is the variation explained by the model and each independent variable, respectively. 106 A4. Analysis using the average of expression variation across four expression data sets including Ca_Na_exposure, Chemostat, Environmental Stress and Oxidative Stress. Table A4.1: Analysis of four factors and their interactions affecting expression variation using stepwise selection with AIC. variable annotation MIPS DIP BioGrid model p value R 2 model p value R 2 model p value R 2 X1 Protein physical interaction degree √ 0.0166 0.46% √ 0.0852 0.51% √ 0.0004 1.13 % X2 Toxicity degree (treat essential genes as ones with toxicity degree 4) √ 0.8370 0.003% √ 0.0579 0.62% √ 0.2705 0.11 % X3 #TFs √ < 2e- 16 10.16% √ 1.3e- 14 9.7% √ 1.8e- 11 4.07 % X4 1-TATA containing genes, 0-non-TATA containing genes √ < 2e- 16 5.93% √ 3.4e- 10 6.56% √ < 2e- 16 7.02 % x1*x2 x1*x3 √ 0.0910 0.26 % x1*x4 x2*x3 x2*x4 √ 0.0026 0.73% √ 0.0025 1.56% √ 0.0027 0.82 % x3*x4 √ 0.0005 0.97% √ 0.0360 0.75% √ 0.0045 0.74 % Total variation explained by the model R 2 =22.49% R 2 =24.679% R 2 =25.84% The four factors include protein interaction degree, toxicity degree, number of TFs, and the presence of TATA box. The protein interaction data used in this analysis is based on the MIPS, DIP and BioGrid data set. Gene expression variation is measured as the average of expression variation across four expression data sets including Ca_Na_exposure, Chemostat, Environmental Stress and Oxidative Stress. The column marked with “√”indicates inclusion in the final linear model. The multiple linear regression is based on the final linear model, respectively. The p-value is related to the null hypothesis that β ≠ 0 versus β = 0. R 2 is the variation explained by the model and each independent variable, respectively. 107 Table A4.2. The effects of two factors on expression variation stratified by the presence/absence of TATA box. TATA data set variables annotation β p value R 2 x1 Toxicity degree (treat essential genes as ones with toxicity degree 4) - 0.0744 0.0035 1.8% x2 #TFs 0.0230 < 2e- 16 13.79 % R 2 Total variation explained by the model 15.03% Non-TATA data set variables annotation β p value R 2 x1 Toxicity degree (treat essential genes as ones with toxicity degree 4) 0.0018 0.8588 0.003 % x2 #TFs 0.0362 < 2e- 16 16.25 % R 2 Total variation explained by the model 16.28% The linear model that includes toxicity degree and the number of TFs is built using the average of expression variation across four expression data sets including Ca_Na_exposure, Chemostat, Environmental Stress and Oxidative Stress. R 2 is the variation explained by the model and each independent variable, respectively. β is the linear coefficient in the linear model, the p-value is related to the null hypothesis that β ≠ 0 versus β = 0. Table A4.3. The effects of toxicity degree on expression variation stratified by the set of environmental stress response (ESR). Gene group β p-value ESR -0.0327 0.0330 Non-ESR -0.0425 2.20e-13 The linear model is built using the average of expression variation across four expression data sets including Ca_Na_exposure, Chemostat, Environmental Stress and Oxidative Stress. R 2 is the variation explained by the model. β is the linear coefficient in the linear model, the p-value is related to the null hypothesis that β ≠ 0 versus β = 0. 108 A5. Analysis using variability of gene expression across more than 1,500 conditions from a combined data set. Table A5.1. Analysis of four factors and their interactions affecting expression variation using stepwise selection with AIC. variable annotation MIPS DIP BioGrid model p- value R 2 model p- value R 2 model p- value R 2 X1 Protein physical interaction degree √ 0.0027 0.76% √ 0.453 0.73% √ 0.1309 0.22 % X2 Toxicity degree (treat essential genes as ones with toxicity degree 4) √ 0.1692 0.16% √ 0.9462 0.001% √ 0.0126 0.061 % X3 #TFs √ < 2e- 16 14.52% √ 8.2e- 13 16.61% √ < 2e- 16 12.74 % X4 1-TATA containing genes, 0-non-TATA containing genes √ < 2e- 16 12.72% √ < 2e- 16 8.92% √ < 2e- 16 12.76 % x1*x2 x1*x3 x1*x4 √ 0.0916 0.52% x2*x3 x2*x4 √ 0.0012 0.88% √ 0.0019 1.75% √ 0.0045 0.78 % x3*x4 √ 0.0015 0.85% √ 0.0029 1.61% √ 0.0038 0.82 % Total variation explained by the model R 2 = 37.53% R 2 = 40.99% R 2 = 38.65% The four factors include protein interaction degree, toxicity degree, number of TFs, and the presence of TATA box. The protein interaction data used in this analysis is based on the MIPS, DIP and BioGrid data set. Gene expression variation is variability of gene expression across more than 1,500 conditions using a combined data set. The column marked with “√”indicates inclusion in the final linear model. The multiple linear regression is based on the final linear model, respectively. The p-value is related to the null hypothesis that β ≠ 0 versus β = 0. R 2 is the variation explained by the model and each independent variable, respectively. 109 Table A5.2. The effects of two factors on expression variation stratified by the presence/absence of TATA box. TATA data set variables annotation β p value R 2 x1 Toxicity degree (treat essential genes as ones with toxicity degree 4) -0.1060 9.63e- 05 3.55% x2 #TFs 0.0274 < 2e- 16 18.76 % R 2 Total variation explained by the model 21% Non-TATA data set variables annotation β p value R 2 X1 Toxicity degree (treat essential genes as ones with toxicity degree 4) -0.0193 0.1012 0.24% x2 #TFs 0.0398 < 2e- 16 15.17 % R 2 Total variation explained by the model 15.56% The linear model that includes toxicity degree and the number of TFs is built using gene expression variability across more than 1,500 conditions on a combined data set. R 2 is the variation explained by the model and each independent variable, respectively. β is the linear coefficient in the linear model, the p-value is related to the null hypothesis that β ≠ 0 versus β = 0 Table A5.3. The effects of toxicity degree on expression variation stratified by the set of environmental stress response (ESR). Gene group β p-value ESR -0.0428 0.0085 Non-ESR -0.0687 <2e-16 The linear model is built using gene expression variability across more than 1,500 conditions on a combined data set. R 2 is the variation explained by the model. β is the linear coefficient in the linear model, the p-value is related to the null hypothesis that β ≠ 0 versus β = 0 110 A6. Correlation between independent variables in each model with different protein interaction data. Table A6.1. The pairwise Spearman´s correlation between independent variables in each model with different protein interaction data. Toxicity degree # TFs TATA ρ p value ρ p value ρ p value Protein interaction degree (MIPS) 0.1558 <0.0001 -0.0773 0.0042 -0.0823 0.0023 Toxicity degree -0.0697 0.0098 -0.1574 <0.0001 # TFs 0.2221 <0.0001 Toxicity degree # TFs TATA Ρ p value ρ ρ p value Protein interaction degree (DIP) 0.1530 0.0001 -0.0797 0.0482 -0.0973 0.0158 Toxicity degree -0.1557 0.0001 -0.1913 <0.0001 # TFs 0.2879 <0.0001 Toxicity degree # TFs TATA Ρ P value ρ p value ρ p value Protein interaction degree (BioGrid) 0.2455 <0.0001 0.0529 0.0427 -0.1541 0.0001 Toxicity degree -0.0466 0.0747 -0.1650 <0.0001 # TFs 0.2237 <0.0001 The ρ value is Spearman´s correlation coefficient and p-value is related to the null hypothesis that ρ ≠ 0 versus ρ = 0. 111 Appendix B: Supplemental materials for Chapter 3 B1. Analysis using Brown et al [2006] data and DIP protein interaction data Figure B1.1 The relationship between fitness pleiotropy and protein interaction degree (A) and between CC and protein interaction degree (B). A) The fitness pleiotropy is positively correlated with protein interaction degree. The Spearman’s rank correlation is used to measure the relationship between fitness pleiotropy and protein interaction degree (ρ= 0.094, p=0.0012). The dots are the mean fitness pleiotropy of the genes, given protein interaction degree. Note that only about 1% of protein has protein interaction degree higher than 0 (data not shown). For visualization, the line represents linear regression. B) The scatter plot of the relationship between clustering coefficient and protein interaction degree. The Spearman correlation coefficient ρ is 0.539 (p< 2.2e-16). 112 Figure B1.2 Boxplot of fitness pleiotropy for different groups of proteins classified according to protein interaction degree and CC: LL (protein interaction degree <=3, CC<=0); LH (protein interaction degree <=3, CC>=0.4); HL (protein interaction degree >=6, CC<=0); HH (protein interaction degree >=6, CC>=0.4). P-values are given to test the hypothesis that the median fitness pleiotropy in LL, LH, and HL is lower than that in the HH group, respectively. The upper edge of the box indicates the 75 th percentile, and the lower edge indicates the 25 th percentile. The ends of the vertical line indicate the minimum and the maximum values, and the points outside the ends of the vertical line are outliers. The value of n in the box is the number of genes for each group. Table B1.1 Spearman’s correlation between fitness pleiotropy and PPI, CC when gene expression variation is either controlled or not. measurement ρ p value Protein interaction degree without expression variation controlled 0.094 0.001 with expression variation controlled 0.092 0.002 CC without expression variation controlled 0.095 0.001 with expression variation controlled 0.062 0.039 When gene expression variation is controlled, ρ is partial Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling gene expression variation, i.e., the relationship between fitness pleiotropy and each measurement is explained by gene expression variation. 113 Table B1.2 Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when protein interaction degree, CC or CRE is controlled. ρ p value ρ fitness pleiotropy, expression variation | Protein interaction degree -0.197 1.5e-11 ρ fitness pleiotropy, expression variation | CC -0.191 6.3e-11 ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and gene expression variation after controlling protein interaction degree, CC or CRE. Table B1.3 Partial Spearman’s correlation between fitness pleiotropy and protein interaction degree, CC or CRE. ρ p-value ρ fitness pleiotropy, CRE | Protein interaction degree,CC -0.188 1.0e-09 ρ fitness pleiotropy, Protein interaction degree | CRE,CC 0.052 0.094 ρ fitness pleiotropy, CC | CRE, Protein interaction degree 0.032 0.301 Partial Spearman’s correlation between fitness pleiotropy and protein interaction degree means Spearman’s correlation after controlling CC and CRE. ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling another two measurements. 114 B2. Analysis results using Brown et al [2006] data and BioGrid protein interaction data Figure B2.1. The relationship between fitness pleiotropy and PPI degree (A) and between CC and PPI degree (B). A) The fitness pleiotropy is positively correlated with protein physical interaction (PPI) degree. The Spearman’s rank correlation is used to measure the relationship between fitness pleiotropy and PPI degree (ρ= 0.298, p< 2.2e-16). The dots are the mean fitness pleiotropy of the genes, given PPI degree. Note that only about 1% of protein has PPI degree higher than 150 (data not shown). For visualization, the line represents linear regression. B) The scatter plot of the relationship between clustering coefficient and PPI degree. The Spearman correlation coefficient ρ is 0.462 (p< 2.2e-16). Figure B2.2. Boxplot of fitness pleiotropy for different groups of proteins classified according to PPI degree and CC: LL (PPI degree <=3, CC<=0); LH (PPI degree <=3, CC>=0.4); HL (PPI degree >=6, CC<=0); HH (PPI degree >=6, CC>=0.4). P-values are given to test the hypothesis that the median fitness pleiotropy in LL, LH, and HL is lower than that in the HH group, respectively. The upper edge of the box indicates the 75 th percentile, and the lower edge indicates the 25 th percentile. The ends of the vertical line indicate the minimum and the maximum values, and the points outside the ends of the vertical line are outliers. The value of n in the box is the number of genes for each group. 115 Table B2.1. Spearman’s correlation between fitness pleiotropy and PPI degree, CC when gene expression variation is either controlled or not. measurement Ρ p value PPI degree without expression variation controlled 0.298 < 2.2e-16 with expression variation controlled 0.292 < 2.2e-16 CC without expression variation controlled 0.168 < 2.2e-16 with expression variation controlled 0.145 2.3e-16 When gene expression variation is controlled, ρ is partial Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling gene expression variation, i.e., the relationship between fitness pleiotropy and each measurement is explained by gene expression variation. Table B2.2. Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when PPI degree, CC or CRE is controlled. ρ p value ρ fitness pleiotropy, expression variation | PPI degree -0.119 1.4e-11 ρ fitness pleiotropy, expression variation | CC -0.128 4.0e-13 ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and gene expression variation after controlling PPI degree, CC or CRE. Table B2.3. Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE. ρ p-value ρ fitness pleiotropy, CRE | PPI,CC -0.126 1.2e-11 ρ fitness pleiotropy, PPI | CRE,CC 0.239 2.8e-3 ρ fitness pleiotropy, CC | CRE,PPI -0.006 0.7626 Partial Spearman’s correlation between fitness pleiotropy and PPI degree means Spearman’s correlation after controlling CC and CRE. ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling another two measurements. 116 B3. Analysis using Parson et al. [2006] data and MIPS protein interaction data The Parsons et al. (2006) dataset contains the growth rate change profile for mutant strains with each of 4111 nonessential genes deleted under 82 conditions. The data were normalized to a standard normal distribution for each condition. To exclude any biological dependencies among these 82 conditions, the conditions were classified into 65 groups based on their different effects on the phenotype (Parsons et al., 2006). The 65 groups are as follows: Latrunculin B, Cytochalasin A; Staurosporine, Caspofungin; Nystatin, Amphotericin; Clotrimazole, Fluconazole; Radicicol, Geldanamycin; Benomyl, Nocodazole; Haloperidol, Fenpropimorph, Dyclonine; Mitomycin C, MMS, Camptothecin, Cisplatin, Hydroxyurea; Amiodarone, Tamoxifen; Extract 00-192, Extract 00-132,192A4-Stichloroside,CG4-Theopalauamide;Alamethicin, Papuamide B; and the remaining with each condition as one group. The fitness pleiotropy is defined in the main text. Figure B3.1. The fitness pleiotropy is positively correlated with protein physical interaction (PPI) degree. The Spearman’s rank correlation is used to measure the relationship between fitness pleiotropy and PPI degree (ρ= 0.175, p<2.2e-16). The dots are the mean fitness pleiotropy of the genes, given the PPI degree. For visualization, the line represents linear regression. 117 Figure B3.2. The relationship between fitness pleiotropy and measurements. A) Fitness pleiotropy for four different groups of proteins classified according to PPI degree and CC: LL (PPI degree <=3, CC<=0); LH (PPI degree <=3, CC>=0.4); HL (PPI degree >=6, CC<=0); HH (PPI degree >=6, CC>=0.4). P-values are given to test the hypothesis that the median fitness pleiotropy in LL, LH, and HL is lower than that in the HH group, respectively. The value of n in the box is the number of genes for each group. B) Fitness pleiotropy is positively associated with the number of targeted genes that each TF regulates (ρ=0.217, p=0.0005). Figure B3.3. The relationship between fitness pleiotropy and measurements. A) Fitness pleiotropy for CRs and non CRs. The line in the box indicates the median value. The upper edge of the box indicates the 75 th percentile, and the lower edge indicates the 25 th percentile. The ends of the vertical line indicate the minimum and the maximum values, and the points outside the ends of the vertical line are outliers. P-values are given to test the hypothesis that the median fitness pleiotropy for CRs is higher than that for non CRs using non-parametric Wilcoxon rank sum test. The value of n in the box is the number of genes for each group. B) Fitness pleiotropy is significantly negatively associated with gene expression variation (ρ=-0.163, p< 2.2e-16). 118 Figure B3.4. Fitness pleiotropy is negatively associated with chromatin regulatory effect (CRE) (ρ=-0.217, p<2.2e-16). The labels are the same as those in Figure B3.1. Table B3.1. Correlation between fitness pleiotropy and each measurement when expression variation is either controlled or not. Measurement Ρ p value PPI degree without expression variation controlled 0.175 <2.2e-16 with expression variation controlled 0.169 <2.2e-16 CC without expression variation controlled 0.183 <2.2e-16 with expression variation controlled 0.165 <2.2e-16 CRE without expression variation controlled -0.217 <2.2e-16 with expression variation controlled -0.130 6.1e-14 When gene expression variation is controlled, ρ is partial Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling gene expression variation, i.e., the relationship between fitness pleiotropy and each measurement is explained by gene expression variation. 119 Table B3.2. Partial Spearman’s correlation between fitness pleiotropy and expression variation when each measurement is controlled. ρ p value ρ fitness pleiotropy, expression variation | PPI degree -0.156 7.1e-17 ρ fitness pleiotropy, expression variation | CC -0.155 8.4e-17 ρ fitness pleiotropy, expression variation | CRE -0.023 0.193 ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and gene expression variation after controlling PPI degree, CC or CRE, i.e., the relationship between fitness pleiotropy and gene expression is explained by PPI degree, CC or CRE. Table B3.3. Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE. ρ p-value ρ fitness pleiotropy, CRE | PPI,CC -0.173 1.1e-18 ρ fitness pleiotropy, PPI | CRE,CC 0.060 0.0025 ρ fitness pleiotropy, CC | CRE,PPI 0.079 6.7e-05 Partial Spearman’s correlation between fitness pleiotropy and PPI degree refers to Spearman’s correlation after controlling CC and CRE. ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling two other measurements, i.e., such measurement is not significantly associated with fitness pleiotropy in this joint analysis. 120 B4. Analysis using Parson et al [2006] data and DIP protein interaction data Figure B4.1. The relationship between fitness pleiotropy and PPI. The fitness pleiotropy is positively correlated with protein interaction degree. The Spearman’s rank correlation is used to measure the relationship between fitness pleiotropy and protein interaction degree (ρ=0.126, p=3.9e-06). The dots are the mean fitness pleiotropy of the genes, given protein interaction degree. Note that only about 1% of protein has protein interaction degree higher than 20 (data not shown). For visualization, the line represents linear regression. Figure B4.2. Boxplot of fitness pleiotropy for different groups of proteins classified according to protein interaction degree and CC: LL (protein interaction degree <=3, CC<=0); LH (protein interaction degree <=3, CC>=0.4); HL (protein interaction degree >=6, CC<=0); HH (protein interaction degree >=6, CC>=0.4). P-values are given to test the hypothesis that the median fitness pleiotropy in LL, LH, and HL is lower than that in the HH group, respectively. The upper edge of the box indicates the 75 th percentile, and the lower edge indicates the 25 th percentile. The ends of the vertical line indicate the minimum and the maximum values, and the points outside the ends of the vertical line are outliers. The value of n in the box is the number of genes for each group. 121 Table B4.1. Spearman’s correlation between fitness pleiotropy and PPI, CC when gene expression variation is either controlled or not. measurement Ρ p value Protein interaction degree without expression variation controlled 0.126 3.9e-06 with expression variation controlled 0.109 9.0e-05 CC without expression variation controlled 0.092 0.0008 with expression variation controlled 0.063 0.0244 When gene expression variation is controlled, ρ is partial Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling gene expression variation, i.e., the relationship between fitness pleiotropy and each measurement is explained by gene expression variation. Table B4.2. Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when protein interaction degree, CC or CRE is controlled. Ρ p value ρ fitness pleiotropy, expression variation | protein interaction degree -0.168 1.3e-09 ρ fitness pleiotropy, expression variation | CC -0.166 2.1e-09 ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and gene expression variation after controlling protein interaction degree, CC or CRE. Table B4.3. Partial Spearman’s correlation between fitness pleiotropy and protein interaction degree, CC or CRE. ρ p-value ρ fitness pleiotropy, CRE | Protein interaction degree,CC -0.203 2.2e-12 ρ fitness pleiotropy, Protein interaction degree | CRE,CC 0.084 0.0042 ρ fitness pleiotropy, CC | CRE, Protein interaction degree 0.002 0.9517 Partial Spearman’s correlation between fitness pleiotropy and protein interaction degree means Spearman’s correlation after controlling CC and CRE. ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling another two measurements. 122 B5. Analysis results using Parson et al [2006] data and BioGrid protein interaction data Figure B5.1. The relationship between fitness pleiotropy and PPI degree. The fitness pleiotropy is positively correlated with protein physical interaction (PPI) degree. The Spearman’s rank correlation is used to measure the relationship between fitness pleiotropy and PPI degree (ρ= 0.229, p< 2.2e-16). The dots are the mean fitness pleiotropy of the genes, given PPI degree. Note that only about 1% of protein has PPI degree higher than 150 (data not shown). For visualization, the line represents linear regression. Figure B5.2. Boxplot of fitness pleiotropy for different groups of proteins classified according to PPI degree and CC: LL (PPI degree <=3, CC<=0); LH (PPI degree <=3, CC>=0.4); HL (PPI degree >=6, CC<=0); HH (PPI degree >=6, CC>=0.4). P-values are given to test the hypothesis that the median fitness pleiotropy in LL, LH, and HL is lower than that in the HH group, respectively. The upper edge of the box indicates the 75 th percentile, and the lower edge indicates the 25 th percentile. The ends of the vertical line indicate the minimum and the maximum values, and the points outside the ends of the vertical line are outliers. The value of n in the box is the number of genes for each group. 123 Table B5.1. Spearman’s correlation between fitness pleiotropy and PPI degree, CC when gene expression variation is either controlled or not. measurement Ρ p value PPI degree without expression variation controlled 0.229 < 2.2e-16 with expression variation controlled 0.219 < 2.2e-16 CC without expression variation controlled 0.133 7.4e-16 with expression variation controlled 0.110 9.7e-11 When gene expression variation is controlled, ρ is partial Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling gene expression variation, i.e., the relationship between fitness pleiotropy and each measurement is explained by gene expression variation. Table B5.2. Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when PPI degree, CC or CRE is controlled. Ρ p value ρ fitness pleiotropy, expression variation | PPI degree -0.142 4.4e-17 ρ fitness pleiotropy, expression variation | CC -0.147 2.7e-18 ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and gene expression variation after controlling PPI degree, CC or CRE. Table B5.3. Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE. ρ p-value ρ fitness pleiotropy, CRE | PPI,CC -0.167 6.2e-21 ρ fitness pleiotropy, PPI | CRE,CC 0.171 7.0e-22 ρ fitness pleiotropy, CC | CRE,PPI 0.017 0.3599 Partial Spearman’s correlation between fitness pleiotropy and PPI degree means Spearman’s correlation after controlling CC and CRE. ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling another two measurements. 124 B6. Analysis results using Hillenmeyer et al. [2008] data and MIPS protein interaction data The Hillenmeyer et al. (2008) dataset contains the growth rate change profile for mutant strains with each of 4742 homozygous nonessential genes deleted under 418 conditions. Conditions include some compounds tested at different concentrations and different time point of collections, temperature change as well as different PH value. These 418 conditions were classified into 158 unique conditions including unique compound, unique temperature change treatment, unique PH treatment. The fitness pleiotropy is defined in the main text. Figure B6.1. The fitness pleiotropy is positively correlated with protein physical interaction (PPI) degree. The Spearman’s rank correlation is used to measure the relationship between fitness pleiotropy and PPI degree (ρ= 0.222, p< 2.2e-16). The dots are the mean fitness pleiotropy of the genes, given the PPI degree. For visualization, the line represents linear regression. 125 Figure B6.2. The relationship between fitness pleiotropy and measurements. A) Fitness pleiotropy for four different groups of proteins classified according to PPI degree and CC: LL (PPI degree <=3, CC<=0); LH (PPI degree <=3, CC>=0.4); HL (PPI degree >=6, CC<=0); HH (PPI degree >=6, CC>=0.4). P-values are given to test the hypothesis that the median fitness pleiotropy in LL, LH, and HL is lower than that in the HH group, respectively. The value of n in the box is the number of genes for each group. B) Fitness pleiotropy is positively associated with the number of targeted genes that each TF regulates (ρ= 0.296, p= 1.6e-06). Figure B6.3. The relationship between fitness pleiotropy and measurements. A) Fitness pleiotropy for CRs and non CRs. The line in the box indicates the median value. The upper edge of the box indicates the 75 th percentile, and the lower edge indicates the 25 th percentile. The ends of the vertical line indicate the minimum and the maximum values, and the points outside the ends of the vertical line are outliers. P-values are given to test the hypothesis that the median fitness pleiotropy for CRs is higher than that for non CRs using non-parametric Wilcoxon rank sum test. The value of n in the box is the number of genes for each group. B) Fitness pleiotropy is significantly negatively associated with gene expression variation (ρ= -0.173, p < 2.2e-16). 126 Figure B6.4. Fitness pleiotropy is negatively associated with chromatin regulatory effect (CRE) (ρ= -0.213, p < 2.2e-16). The labels are the same as those in Figure B6.1. Table B6.1. Correlation between fitness pleiotropy and each measurement when expression variation is either controlled or not. Measurement Ρ p value PPI degree without expression variation controlled 0.222 < 2.2e-16 with expression variation controlled 0.211 < 2.2e-16 CC without expression variation controlled 0.234 < 2.2e-16 with expression variation controlled 0.216 < 2.2e-16 CRE without expression variation controlled -0.213 < 2.2e-16 with expression variation controlled -0.145 < 2.2e-16 When gene expression variation is controlled, ρ is partial Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling gene expression variation, i.e., the relationship between fitness pleiotropy and each measurement is explained by gene expression variation. 127 Table B6.2. Partial Spearman’s correlation between fitness pleiotropy and expression variation when each measurement is controlled. ρ p value ρ fitness pleiotropy, expression variation | PPI degree -0.158 < 2.2e-16 ρ fitness pleiotropy, expression variation | CC -0.157 < 2.2e-16 ρ fitness pleiotropy, expression variation | CRE -0.019 0.252 ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and gene expression variation after controlling PPI degree, CC or CRE, i.e., the relationship between fitness pleiotropy and gene expression is explained by PPI degree, CC or CRE. Table B6.3. Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE. ρ p-value ρ fitness pleiotropy, CRE | PPI,CC -0.152 < 2.2e-16 ρ fitness pleiotropy, PPI | CRE,CC 0.097 6.5e-08 ρ fitness pleiotropy, CC | CRE,PPI 0.107 2.9e-09 Partial Spearman’s correlation between fitness pleiotropy and PPI degree refers to Spearman’s correlation after controlling CC and CRE. ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling two other measurements, i.e., such measurement is not significantly associated with fitness pleiotropy in this joint analysis. 128 B7. Analysis results using Hillenmeyer et al. [2008] data and DIP protein interaction data Figure B7.1. The relationship between fitness pleiotropy and PPI. The fitness pleiotropy is positively correlated with protein interaction degree. The Spearman’s rank correlation is used to measure the relationship between fitness pleiotropy and protein interaction degree (ρ= 0.104, p= 9.9e-05). The dots are the mean fitness pleiotropy of the genes, given protein interaction degree. Note that only about 1% of protein has protein interaction degree higher than 20 (data not shown). For visualization, the line represents linear regression. 129 Figure B7.2. Boxplot of fitness pleiotropy for different groups of proteins classified according to protein interaction degree and CC: LL (protein interaction degree <=3, CC<=0); LH (protein interaction degree <=3, CC>=0.4); HL (protein interaction degree >=6, CC<=0); HH (protein interaction degree >=6, CC>=0.4). P-values are given to test the hypothesis that the median fitness pleiotropy in LL, LH, and HL is lower than that in the HH group, respectively. The upper edge of the box indicates the 75 th percentile, and the lower edge indicates the 25 th percentile. The ends of the vertical line indicate the minimum and the maximum values, and the points outside the ends of the vertical line are outliers. The value of n in the box is the number of genes for each group. Table B7.1. Spearman’s correlation between fitness pleiotropy and PPI, CC when gene expression variation is either controlled or not. measurement ρ p value Protein interaction degree without expression variation controlled 0.104 9.9e-05 with expression variation controlled 0.102 1.8e-04 CC without expression variation controlled 0.081 0.002 with expression variation controlled 0.046 0.091 When gene expression variation is controlled, ρ is partial Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling gene expression variation, i.e., the relationship between fitness pleiotropy and each measurement is explained by gene expression variation. 130 Table B7.2. Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when protein interaction degree, CC or CRE is controlled. ρ p value ρ fitness pleiotropy, expression variation | protein interaction degree -0.217 4.8e-16 ρ fitness pleiotropy, expression variation | CC -0.215 1.0e-15 ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and gene expression variation after controlling protein interaction degree, CC or CRE. Table B7.3. Partial Spearman’s correlation between fitness pleiotropy and protein interaction degree, CC or CRE. ρ p-value ρ fitness pleiotropy, CRE | Protein interaction degree,CC -0.225 1.1e-15 ρ fitness pleiotropy, Protein interaction degree | CRE,CC 0.080 0.005 ρ fitness pleiotropy, CC | CRE, Protein interaction degree -0.005 0.862 Partial Spearman’s correlation between fitness pleiotropy and protein interaction degree means Spearman’s correlation after controlling CC and CRE. ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling another two measurements. 131 B8. Analysis results using Hillenmeyer et al. [2008] data and BioGrid protein interaction data Figure B8.1. The relationship between fitness pleiotropy and PPI. The fitness pleiotropy is positively correlated with protein interaction degree. The Spearman’s rank correlation is used to measure the relationship between fitness pleiotropy and protein interaction degree (ρ= 0.311, p< 2.2e-16). The dots are the mean fitness pleiotropy of the genes, given protein interaction degree. Note that only about 1% of protein has protein interaction degree higher than 20 (data not shown). For visualization, the line represents linear regression. 132 Figure B8.2. Boxplot of fitness pleiotropy for different groups of proteins classified according to PPI degree and CC: LL (PPI degree <=3, CC<=0); LH (PPI degree <=3, CC>=0.4); HL (PPI degree >=6, CC<=0); HH (PPI degree >=6, CC>=0.4). P-values are given to test the hypothesis that the median fitness pleiotropy in LL, LH, and HL is lower than that in the HH group, respectively. The upper edge of the box indicates the 75 th percentile, and the lower edge indicates the 25 th percentile. The ends of the vertical line indicate the minimum and the maximum values, and the points outside the ends of the vertical line are outliers. The value of n in the box is the number of genes for each group. Table B8.1. Spearman’s correlation between fitness pleiotropy and PPI degree, CC when gene expression variation is either controlled or not. measurement ρ p value PPI degree without expression variation controlled 0.311 < 2.2e-16 with expression variation controlled 0.304 < 2.2e-16 CC without expression variation controlled 0.168 < 2.2e-16 with expression variation controlled 0.148 < 2.2e-16 When gene expression variation is controlled, ρ is partial Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling gene expression variation, i.e., the relationship between fitness pleiotropy and each measurement is explained by gene expression variation. 133 Table B8.2. Partial Spearman’s correlation between fitness pleiotropy and gene expression variation when PPI degree, CC or CRE is controlled. ρ p value ρ fitness pleiotropy, expression variation | PPI degree -0.148 < 2.2e-16 ρ fitness pleiotropy, expression variation | CC -0.155 < 2.2e-16 ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and gene expression variation after controlling PPI degree, CC or CRE. Table B8.3. Partial Spearman’s correlation between fitness pleiotropy and PPI degree, CC or CRE. ρ p-value ρ fitness pleiotropy, CRE | PPI,CC -0.168 < 2.2e-16 ρ fitness pleiotropy, PPI | CRE,CC 0.251 < 2.2e-16 ρ fitness pleiotropy, CC | CRE,PPI -0.005 0.775 Partial Spearman’s correlation between fitness pleiotropy and PPI degree means Spearman’s correlation after controlling CC and CRE. ρ is Spearman’s correlation coefficient and p-value is based on null hypothesis test that there is no statistically significant relationship between fitness pleiotropy and each measurement after controlling another two measurements. B9. The correlation among three phenotypic profiles Table B9.1. The Spearman’s rank correlation among three phenotypic profiles used for our analysis. Brown et al [2006] Parson et al [2006] Hillenmeyer et al [2008] Brown et al [2006] ρ= 0.38, p<2e-16 ρ= 0.66, p<2e-16 Parson et al [2006] ρ= 0.39, p<2e-16 Hillenmeyer et al [2008]
Abstract (if available)
Abstract
Genes show different expression variation and growth fitness when responding to various environment conditions. Many studies have focused on identifying factors influencing such differences, especially in the studies of essential genes versus viable genes. Nonetheless, the need to obtain a more complete understanding of factors and their interactions influencing expression variation and fitness pleiotropy (growth defect upon gene deletion) remains a challenge.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Integrative analysis of gene expression and phenotype data
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Orthogonal shared basis factorization: cross-species gene expression analysis using a common expression subspace
PDF
Analysis of robustness and residuals in the Affymetrix gene expression microarray summarization
PDF
Alignment of phylogenetically unambiguous indels for genome-wide phylogenetic analysis and detection of lateral gene transfer
PDF
Genomic, regulatory and functional dynamics of the duplication process
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Application of machine learning methods in genomic data analysis
PDF
Integrative approach of biological systems analysis on regulatory pathways, modules, protein essentialities, and disease gene classification
PDF
Control of endogenous gene expression in the mouse
PDF
Between genes and phenotypes: an integrative network-based Monte Carlo method for the prediction of human-gene phenotype associations
PDF
Computational and experimental approaches for the identification of genes and gene networks in the Drosophila sex-determination hierarchy
PDF
Genetic association studies of age-related macular degeneration from candidate gene to whole genome
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Integrating high-throughput sequencing data to study gene regulation
Asset Metadata
Creator
Zhou, Linqi
(author)
Core Title
Genome-wide association study of factors influencing gene expression variation and pleiotropy
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology
Publication Date
10/21/2009
Defense Date
05/29/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
gene expression variation,OAI-PMH Harvest,pleiotropy
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Chen, Liang (
committee member
), Goldstein, Larry M. (
committee member
)
Creator Email
linqizho@usc.edu,zhoulinqi@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2680
Unique identifier
UC1163394
Identifier
etd-zhou-3210 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-267958 (legacy record id),usctheses-m2680 (legacy record id)
Legacy Identifier
etd-zhou-3210.pdf
Dmrecord
267958
Document Type
Dissertation
Rights
Zhou, Linqi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
gene expression variation
pleiotropy