Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Integrative approach of biological systems analysis on regulatory pathways, modules, protein essentialities, and disease gene classification
(USC Thesis Other)
Integrative approach of biological systems analysis on regulatory pathways, modules, protein essentialities, and disease gene classification
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INTEGRATIVE APPROACH OF BIOLOGICAL SYSTEMS ANALYSIS ON
REGULATORY PATHWAYS, MODULES, PROTEIN ESSENTIALITIES,
AND DISEASE GENE CLASSIFICATION
by
Zhidong Tu
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
December 2006
Copyright 2006 Zhidong Tu
ii
Dedication
to
My parents
Whose selfless love and support always guide my way.
iii
Acknowledgements
First and foremost I would like to thank my advisor Dr. Fengzhu Sun, for his great guidance,
support and encouragement. There are so many things that I learnt from him, especially, how
to be a serious scientist to explore the unknown scientific world. It is him who taught me to
face difficulties always with a positive attitude and never give up easily. All these are such a
great fortune of my life that will definitely effect my future career.
I also want to thank Dr. Ting Chen, my co-advisor who also gave me great support. He is a
person full of sparkling ideas which inspired me to be creative in my own research. I owe great
thanks to all the other members in my committee. Drs. Xianghong Jasmine Zhou and Michelle
N. Arbeitman gave me many great insightful advices, not only on my research, but also on my
career development. A special thank to Dr. David Kempe for kindly being my outside member
and the constructive discussions that we had.
There are many other people without whom this dissertation would likely not exist. A few of
them are Minghua Deng, Huiying Yang, Xiaotu Ma, Min Xu, and Matt Lebo. To those people
whose names are not listed here, I greatly appreciate their support.
Finally, I would like to thank my parents and my fiancée Li Wang. It is only because of their
forever loves so could I fully devote myself to the research and be brave when facing those
challenges.
iv
Table of Contents
Dedication
ii
Acknowledgements iii
List of Tables vii
List of Figures viii
Abstract x
Chapter I Introduction to systems biology
1.1 Introduction
1.1.1 Breakthrough in high throughput biotechnologies
1.1.2 Challenges in computational biology
1.2 Summary of our work
1.2.1 Causal gene inference and pathway identification
1.2.2 Transcriptional modules and their dynamics
1.2.3 Protein essentiality
1.2.4 Disease gene classification
1.3 Conclusions
1
2
2
8
11
12
14
14
15
16
Chapter II Causal gene inference and pathway identification
2.1 Introduction
2.2 Methods
2.2.1 Basic Assumptions
2.2.2 Searching the network
2.2.3 Select subset of conditions
2.2.4 Significant measurement
2.3 Results
2.3.1 Data collection
2.3.2 Testing with knock-out data
2.3.3 eQTL mapping
2.3.4 causal gene inference
2.4 Discussion
2.5 Acknowledgement
17
18
22
23
23
28
30
30
30
31
33
34
39
41
Chapter III Transcriptional modules and their dynamics
3.1 Introduction
42
43
v
3.2 Results
3.2.1 Gene clustering on eQTL
3.2.2 Genes in the same cluster share common function, have
highly correlated expression profiles, and are enriched on
common TFs
3.2.3 Genes in cluster have loose connections
3.2.4 Shaping forces of transcriptional modules
3.3 Discussion
3.4 Conclusions
3.5 Materials and methods
3.5.1 eQTL mapping
3.5.2 eQTL clustering algorithm
3.5.3 Protein physical interaction network
3.6 Acknowledgement
45
45
47
50
51
53
54
55
55
55
56
56
Chapter IV Protein essentiality prediction
4.1 Introduction
4.2 Methods
4.2.1 Diffusion-Kernel-based logistic regression model for
protein
function prediction
4.2.2 Integrating and selecting other information and model
validation
4.3 Results
4.3.1 Data
4.3.2 Feature analysis
4.3.3 Protein-protein interactions and network topology
4.3.4 Protein functions based on biological processes
4.3.5 Protein domains
4.3.6 Protein cellular localization
4.3.7 Protein conservation and complexes
4.3.8 Prediction for protein essentiality
4.3.9 Application of the computational model
4.4 Discussion
57
58
59
59
60
61
61
63
64
66
68
70
71
72
75
77
Chapter V Disease gene classification
5.1 Background
5.2 Results
5.2.1 Comparison on the evolutionary features
5.2.2 Cross-species comparison of gene deletion
phenotypes
5.2.3 Comparison on other features
80
81
83
83
89
91
vi
5.3 Discussion
5.4 Conclusions
5.5 Methods
5.5.1 Compiling lists of disease gene and UEHGs lists
5.5.2 Collection of gene features
5.6 Acknowledgements
95
99
99
99
100
102
Chapter VI Conclusions
6.1 Future work
6.2 Future of the future
103
104
106
Appendix A 118
Appendix B 121
Appendix C 125
vii
List of Tables
2.1 Pathway-wise increase of the correlation when appropriate subset of
conditions is selected
36
2.2 Function annotation of the proteins involved in the G1/S phase transition. 39
3.1 Function analysis of each cluster. 48
3.2 Interactions within each module. 51
4.1 A forward stepwise selection approach is used to select the features which
increase the prediction accuracy.
73
5.1 Comparison of evolutionary rate among three groups of genes. 85
5.2 Cross-species comparison of gene essentiality between human and S.
cerevisiae.
90
5.3 Comparison of the length of various parts of UEHGs, disease genes, and all
other genes.
94
B1 Function categories which cover over 100 proteins. 121
B2 Location categories which cover at least 100 proteins. 122
B3 Prediction of the essentialities of unknown yeast proteins. 123
B4 List of proteins which have high posterior probability of being essential
but are non-essential in the SGD phenotype data set.
123
viii
List of Figures
2.1 A conceptual gene regulatory pathway. 20
2.2 Overview of our multi-step procedure for causal gene identification and
gene regulatory pathway inference.
22
2.3 The flow diagram of the stochastic searching algorithm. 25
2.4 The global view of the inferred regulatory network. 35
2.5 An example of the inferred causal gene and the pathway. 37
2.6 Inferred pathways related to G1/S phase transition. 38
3.1 The chromosomal distribution of eQTLs. 46
3.2 The p-values for the tests of the hypothesis that expression levels of
genes within each modules are more correlated than random sets.
49
4.1 Upper: the fraction of essential proteins vs. the degree of proteins.
Down: the approximate density curves for the fraction of essential
proteins among the neighbours of 378 essential proteins and that of 423
non-essential proteins.
65
4.2 The fractions of essential and nonessential genes in each of the 24
function categories considered.
67
4.3 The fraction of essential proteins vs. the number of domains. 69
4.4 The fraction of essential proteins in different cellular locations. 70
4.5 Upper: the fraction of essential proteins in all, conserved and
non-conserved proteins. Down: the fraction of essential proteins in all,
protein within complexes, and proteins not in complexes.
72
4.6 Compare the prediction performance of the three methods, MRF,
SVM , and the new method.
75
5.1 Distribution of Ka, Ks and Ka/Ks. 86
5.2 Codon conservation of the three gene groups. The conservation score of
amino acids of the three groups of genes are compared.
88
5.3 Comparison of gene essentiality between human and C.elegans. 91
5.4 Distribution of protein physical interaction degrees, UEHGs, disease 92
ix
genes, and other genes are shown in three different colors in the
histogram.
5.5 Function annotations of genes in the three groups. 93
5.6 Correlation of disease onset age with Ka/Ks. 98
B1 The number of proteins with given number of domains. 124
C1 Cumulative distribution function curve of Ka, Ks and Ka/Ks for
UEHGs, disease and all other genes.
129
x
Abstract
The rapid development in high throughput biotechnologies in the last decade has
fundamentally changed the way of studying biological systems. Today, whole genome
sequences are either available or can be obtained within weeks. Genome wide expression
levels can be easily measured by microarray analyses. Tens of thousands of protein-protein
interactions have been identified in several species, by either yeast two hybrid assays or by
mass spectrometry. Transcription factor-DNA binding interactions can also be identified by
high throughput techniques, such as by chromatin immunoprecipitation followed by whole
genome chip analyses. Many other new technologies are emerging, such as high throughput
cell imaging, protein arrays. All these technologies, together with traditional methods, provide
us an unprecedented opportunity to explore biological systems in a holistic way.
Computational biologists are confronted with great opportunity but also significant challenges
in analyzing these data sets to extract biologically meaningful knowledge. Since no single
technique can provide all the information needed for understanding the system, integrative
analysis has turned out to be an essential approach to unravel how the systems work.
We demonstrate how we pursued such integrative approaches by presenting the results of
several independent but related problems. Specifically, we studied (i) causal gene inference
and regulatory pathway identification, (ii) transcriptional module identification and its
dynamics, (iii) essential protein prediction, and (iv) human disease gene classification. For the
first two problems, we treated the system as a network of interactions and regulations and
based our analysis mainly on this network. For the third problem, we used both network
xi
properties and several other features. For the last problem, we mainly relied on feature
comparison.
We want to show from these analyses that the integrative approach can help us to better utilize
data, to design more realistic models, and to achieve what would otherwise be impossible by
any single data source. We anticipate that as increasingly more large scale data sets become
available, our understanding of the systems will be significantly deepened.
1
Chapter I
Introduction to systems biology
2
1.1.1
1.1 Introduction
A new trend in biological research is to study organisms at a systems level. Different from
conventional approaches, which generally focused on a single gene or protein, now we are
able to study multiple genes/proteins simultaneously. There are many advantages of studying a
set of genes at the same time. As every functional gene and its products play their roles by
interacting with other molecules, it is essential to consider the whole system in which the gene
is involved for a complete understanding of its function. This is especially important for the
research on human diseases, which may have multiple contributing factors that are responsible
for the disease phenotype. For a thorough understanding of the disease mechanisms, it is
necessary to consider these factors together.
Breakthrough in high throughput biotechnologies
The dominant propelling force for this new trend in biological research is the development of a
series of new technologies, especially, high throughput technologies. Among all of these new
advances in high throughput technologies, the one which initiated the wave of systems biology
is high throughput DNA sequencing, which is still under rapid development. For example, the
new development in nano technology could enable us to sequence the whole human genome in
just several hours (Lagerqvist, et al., 2006). When this technology was widely applied to
sequencing the full genome of specific organisms, such as human in the late 90’s, it also
brought an urgent demand for computational methods to analyze the generated sequence data.
For example, due to the limitation of the sequencing technology, the draft data were reads of
short DNA segments of several hundred base pairs instead of the whole genome. Therefore,
reliable computational algorithms were needed to assemble these short fragments into a
3
complete genome sequence. Substantial research on sequence data analyses had been
performed and directly led to the formation of a new branch of science, computational biology.
With the development of both sequencing technology and the mathematical foundation
provided by computational biology, whole genome sequences of more than a hundred species
are now available and the number continues to grow.
This large amount of sequence information has given immeasurable help to biological
researchers in multiple aspects of their research. For example, a biologist can easily check his
newly identified DNA sequence against this huge DNA sequence database, to see if the
sequence has been previously identified. If there is a record, it is likely that additional
information is available for this sequence provided by other scientists. If the sequence has not
been found, then he can try to align the sequences to the genomes of other species. If there are
homologous sequences, the corresponding gene is likely to be functional and further studies
can be taken. As a second example, whole genome sequences of multiple species can also help
to annotate the genome. Because non-coding DNA sequences are supposed to undergo neutral
evolution and diverge much faster than coding sequences, it is much easier to identify a true
gene from the background by comparing multiple genomes. As our last example, in order to
study whole genome rearrangement events, it is essential to have multiple whole genome
sequences. There are many other uses of the DNA sequence information and we will give
additional examples on this in later parts of this dissertation.
There is no doubt that without the high throughput DNA sequencing technology, the whole
biological research community would be severely affected. Many projects would be slowed
down or even infeasible to perform. However, the information that DNA sequences provided
alone is far from sufficient for biological researches to understand the complex systems. For
4
example, with only DNA sequence information, it is hard (if not impossible) to tell how well
the gene will be expressed, it will also be hard (if not impossible) to identify the physical
interactions that a given gene product could have. Fortunately, new technologies are always
developed to meet our needs. Many other high throughput technologies have been developed
along with the DNA sequencing techniques and we’ll describe a few of them which are most
important and relevant to our research.
Among all other high throughput technologies, DNA microarray technology is probably the
best recognized one. Using microarray technology, we can measure the expression levels
(mRNA level) of tens of thousands of genes simultaneously. Compared with DNA sequencing,
which generates relatively static results, the microarray technology reveals the dynamic aspect
of the biological system. The most frequent use of this technology is to monitor genes that
change their expression levels under various conditions, such as at different stages of
development, or under different physical environmental conditions. For example, people
observed that hundreds of genes’ expression levels could undertake significant changes by a
single perturbation such as environmental temperature change. What’s more, investigators
found that for many diseases, such as cancer, the transcriptome displays significant differences
between healthy people and patients. These findings brought great interests in exploring the
underlying mechanisms that are responsible for the differential expression patterns. It is an
important task as it could help us to understand how certain disease states develop. To analyze
microarray data, computational biologists are required again since microarray data are in the
format of large numerical vectors (hundreds or thousands of genes each has a numerical
expression level or log ratio of the expression level) or matrix (multiple conditions) which
5
cannot be interpreted by the naked eye. We’ll discuss some of the computational issues of
microarray analyses in the next section.
Gene expression is under complicated control mechanisms, and transcription factors (TFs) are
believed to play a predominant role in activating or repressing gene expression. (Although it is
becoming clear that several other mechanisms are also contributing to the transcriptional
regulation (Slack, 2006), it is not clear how significant their roles are in the overall regulation
and we don’t further pursue that direction in this manuscript.) It is natural to study TFs when
studying gene expression regulation. It has been well studied that for a TF to regulate its target
genes, it needs to bind to the upstream DNA regulatory regions of the target genes (usually
termed regulatory motif). For a specific TF, the motif sequence has a specific pattern(s). It is
by this pattern(s) that a TF can recognize the corresponding target genes. Computationally
identifying the motif sequence turns out to be very challenging. This is because the upstream
regions that could contain motif sequences can be 10 kbps long or even longer, and usually
many patterns will appear significant in these regions. However, the development of
chromatin immunoprecipitation chip (ChIP-chip) technology seems to provide an
experimental solution for identifying the target genes of TFs (Lee, et al., 2002; Harbison, et al.,
2004). Using this technology, we can identify the DNA segments that a TF binds. By figuring
out which gene’s upstream region contains these DNA segments, we can easily figure out the
target genes of the TF.
With the knowledge of TF-target gene relationship, our understanding of gene regulatory
program has been significantly improved. For example, investigators have found that if a set
of genes display similar expression patterns, these same genes will frequently be under the
regulation of a common set of TFs. Using the same information, we also know that the
6
transcriptional modules form a hierarchical structure (Tanay, et al., 2004; Leyfer and Weng,
2005). Clearly, knowing the TF-target gene regulation is fundamentally important for
revealing the complex structures of how the transcriptome is regulated. However, having the
TF-DNA binding information is not sufficient for a complete understanding of the regulatory
program. Many TF can change their activities by post-translational regulation, e.g.,
phosphorylation or protein interactions. Therefore, to understand the whole system, we need
to know more than TF-DNA binding information. We’ll revisit this issue in the later part.
Another very important high throughput technology is protein-protein interaction assays. Uetz
et al. were among the first to use high-throughput technology to explore genome wide
protein-protein interaction (Uetz, et al., 2000). The technology that was used is yeast
two-hybrid analysis where fusion proteins were created to have either a transcription factor
activation domain or DNA-binding domain. When the two proteins fused to the TF domains
are interacting with each other, these two TF domains will become physically close so that the
dimer can bind to the regulatory region of the reporter gene and activate its expression. Using
the same technologies, Ito et al. reported a different set of protein-protein interaction in yeast a
year later (Ito, et al., 2001). With the data generated by these experiments, great interest arose
in studying the network properties. For example, a classic finding was made by Jeong et al.,
where they found that protein essentiality is positively correlated with the degree of protein
interactions (Jeong, et al., 2001). Jeong’s finding is important in the sense that it showed that
by looking at very simple network physical properties (e.g., the degree of protein-protein
interaction), much more can be known about biological systems. In other words, some of the
mysteries of a biological system can be revealed by very simple rules. Jeong et al. also found
out that the degree distribution of metabolic networks in multiple organisms follows a power
7
law (Jeong, et al., 2000). In such networks, the average distance between any two nodes is
much shorter than in an Erdös random network. This is because such a network’s topology is
dominated by a few highly connected nodes (hubs) which link the rest of the less connected
nodes to the system (Jeong, et al., 2000). Besides the degree distribution of the protein
interaction network, people also discovered that there exist simple topological structures
which appeared much more frequently than would otherwise be expected by random (Milo, et
al., 2002). Such simple topological structure profiles (called network motifs) have been
successfully used to distinguish different types of networks (Yeger-Lotem, et al., 2004) and
some of them have been shown to possess biological functions.
Lastly, we’ll discuss a newly developed high throughput technology for detecting proteome
wide phosphorylation events which could provide very important information for studying
gene expression regulation. Protein phosphorylation is estimated to affect 30% of the
proteome and is a major regulatory mechanism that controls many basic cellular processes
(Ficarro, et al., 2002; Ptacek, et al., 2005). In 2001, Zhu et al. designed the first proteome chips
by covalently attaching purified proteins to microwells in silicone elastomer sheets placed on
top of microscope slides (Zhu, et al., 2000). In 2005, Ptacek et al. used this technology to
perform the first global analysis of protein phosphorylation in yeast (Ptacek, et al., 2005). In
this experiment, about 4,400 proteins were spotted on the array. The proteome arrays were
then incubated with kinase and isotope labeled ATP. After incubation, the proteome array was
washed and exposed to film. By reading the film, Ptacek easily identified which protein was
phosphorylated by the kinase. Using this technology, more than 4,000 phosphorylation events
were identified and many of them were novel findings.
8
1.1.2
There are many other new technologies developed recently. High throughput and automation
are the two key features that bring them to the center stage. Since these high throughput
technologies usually generate large amounts of raw data, they give computational biologists
great opportunities to develop methodology to interpret them. For example, one common
drawback associated with these high throughput technologies is the high error rate. New
computational methods are required to overcome this problem and to extract meaningful
biological knowledge. We’ll discuss this in the following section.
Challenges in computational biology
Different technologies generate different formats of data, and this poses different problems to
computational biologists. There is not one computational method which can solve all the
problems. Therefore, we need to describe the computational problems and the solutions for
each specific technology.
Although the problems we’ll discuss are different, There is a common issue associated with all
the high throughput technologies we described so far, i.e., noise. None of the above
technologies are perfect and errors consistently exist, although to different extents. For
example, DNA sequencing data is quite accurate compared with other technologies. A recent
study estimated that the single nucleotide error rate for coding DNA is 0.10% (SE 0.012%) in
Genebank and 0.22% (SE 0.051%) for intronic DNA sequences (Wesche, et al., 2004).
Although the error rate is quite small, the total number of single nucleotide errors could still be
very large due to the large number of sequences in Genebank. For certain studies, such as
single nucleotide polymorphisms (SNPs), this error rate is intolerable. This is because SNPs
usually happen at very low frequency (from 1/500bps to 1/1,000bps). Therefore, for these
9
studies, repeated sequencing is required for reliable sequence results. On the other hand, errors
generated by microarray technologies are much more serious. We’ll use cDNA microarray as
an example. In cDNA microarray, the most common error is “red-green bias due to differences
between the labeling efficiencies and scanning properties of the two fluors complicated
perhaps by the use of different scanner settings. Other biases may arise from variation between
spatial positions on a slide or between slides. Positions on a slide may vary because of
differences between the print-tips on the array printer, variation over the course of the
print-run or non-uniformity in the hybridization. Differences between arrays may arise from
differences in print quality or from differences in ambient conditions when the plates were
processed. It is necessary to normalize the intensities before any subsequent analysis is carried
out” (Smyth, et al., 2003). To perform any high level analysis of microarray data (such as
regulatory network analysis), it is essential that we obtain reliable and clean data to begin the
study. Many studies are focused on minimizing the systemic errors in microarray experiments.
Besides the error issues, the fundamental computational biological problem is to derive
biologically meaningful knowledge out of the data. For sequence analysis, a few examples of
such problems are to identify sequences that are coding for either protein or RNA (gene
finding), or to find out sequences that are conserved across multiple species (multiple
sequence alignment), or to build the phylogenetic relationships among multiple sequences.
Many statistical models were built to solve the above problems, such as Hidden Markov
Models (HMM), Gibbs Sampler, Expectation Maximization (EM), and Maximum Likelihood
Estimation (MLE). By applying these methods, computational biologists can not only point
out which region is most likely to 1) be a gene, 2) be conserved among multiple species, and 3)
10
the phylogenetic tree that best fits the input sequences, they can also estimate how significant
these inferences are by generating either P-values, or log likelihood ratios.
For the microarray experiment, a typical problem is to identify differentially expressed genes
under certain perturbations (Baggerly, et al., 2001). Many statistical models have been
developed and research in this area is quite active (Tusher, et al., 2001; Reiner, et al., 2003; Jie
Chen, 2006). Another very important question on microarray analyses is to figure out gene
expression regulation mechanisms. Among the computational research efforts on this problem,
the Bayesian network approach attracted most attention (Friedman, et al., 2000) due to its
great power in modeling very complex systems. However, limitations of it soon became
known. First, it requires large amounts of data to train the network, which is usually not
available. Second, finding the optimum network structure has turned out to be an NP-hard
problem, i.e., no best solution can be found efficiently when the network is big. Third,
although the inferred network could fit the data, it may not directly reveal the underlying
regulatory mechanisms. All these issues require investigators to broaden their minds and not
only focus on gene expression profiles to interpret gene regulation.
For protein-protein interactions, there are quite a few research directions and we’ll briefly
describe two of them. First, computational scientists are working on inferring protein
functions based on protein interactions. The basic underlying rationale is that proteins tend to
interact with proteins having similar functions. From simple neighbor counting to the more
advance Markov Random Field model, protein interactions turned out to be informative in
function prediction (Hishigaki, et al., 2001; Deng, et al., 2003). Second, protein-protein
interaction data can also be used to identify protein complex components (Chu, et al., 2006).
This is because components in a protein complex will interact with other complex components,
11
1.2
and therefore form a dense sub-network, if we think of proteins as nodes and their interactions
as edges. By searching for dense sub-networks in the global protein-protein interaction
network, we can identify protein complexes and their components. To identify the largest
clique in a network is an NP-hard question and the protein interaction network is rather sparse
and many edges are potentially missed, we have to rely on heuristic solutions to find large (not
the largest) and almost complete sub-networks.
The above described problems are all based on single data source, in the sense that data
generated by a single high throughput technology are considered for the whole analysis. This
approach is effective when our focus is simple and straightforward and there is no need for
considering other data sources (such as protein network topology analysis only requires
protein interaction information). However, for the purpose of understanding the system, the
importance of integrating multiple sources of data becomes apparent. There are two major
advantages to integrative approaches. First, the added information will give us more power in
modeling and interpreting the system. Second, the error rate of a single data source can be
reduced by integrating different data sources. We’ll demonstrate these two points in our later
chapters. In the next section, we’ll briefly describe the work that we have done and each study
will be fully described in later chapters.
Summary of our work
The main theme of our work is to utilize different sources of information that are available to
develop new integrative approaches to help us understand the biological systems. We
demonstrate how we pursued this by presenting several independent, but also related projects
that we have worked on. Specifically, we studied the following four problems, (i) causal gene
12
1.2.1
inference and regulatory pathway identification, (ii) transcriptional module identification and
its dynamics, (iii) essential protein prediction, and (iv) human disease gene classification. For
the first two problems, we treated the system as a network of interactions and regulations and
we mainly worked on the network. For the third problem, we relied on both network properties
and several other features. For the last problem, we mainly used feature comparison. We’ll
give a brief summary of each work.
Causal gene inference and pathway identification
By comparing gene expression profiles of different individuals, scientists have observed
expression variation for thousands of genes in yeast, mouse and human (Brem, et al., 2002;
Morley, et al., 2004; Turk, et al., 2004). Additionally, investigators found that a large fraction
of such variation was heritable. Therefore, certain genetic factors must exist which are
responsible for the expression variation. If we treat the expression level as traditional
quantitative trait (typical quantitative traits for human could be body height, body weight, etc.),
then we can apply the classic linkage analysis method to identify the chromosomal regions
which can explain the expressional quantitative trait (eQT). Here, we will give a very brief
description of the basic concept of linkage analysis. For a thorough understanding of the topic,
readers can refer to any genetics book, e.g., Terwilliger JD, Ott J. “Handbook of human
genetic linkage” (1994). In linkage analysis, parental mating will create recombination events
and will be inherited in the offspring. We genotype a set of markers (a marker is a short DNA
fragment with known chromosomal location and usually shows much diversity in the
population). Due to recombination events, the combinations of markers are “randomized” in
offspring. By testing which marker significantly co-segregates with the quantitative trait, we
can identify the chromosomal regions (represented by markers) which are “linked” with the
13
quantitative trait. By linkage analysis, we see that expression variation can often be linked to
certain chromosomal regions (called eQTLs). Inferring the causal genes for the expression
variation is of great importance but rather challenging as the linked region generally contains
multiple genes. Even when a single candidate gene is proposed, the underlying biological
mechanism by which the regulation is enforced remains unknown. Novel approaches are
needed to both infer the causal genes and generate hypotheses on the underlying regulatory
mechanisms.
In this work, we propose a new approach which aims at achieving the above objectives by
integrating genotype information, gene expression, protein-protein interaction, protein
phosphorylation, and transcription factor (TF)-DNA binding information. We use
protein-protein interaction, protein phosphorylation and TF-DNA binding information to build
our gene network. We use linkage analysis to identify the eQTLs for the expression variation.
As we assume that the causal gene (the responsible gene for the expression variation) resides
in the eQTLs, our objective is to figure out which gene in the eQTLs is the real cause. A
network based stochastic algorithm is designed to infer the causal genes and identify the
underlying regulatory pathways.
We quantitatively verified our method by a test using data generated by yeast knock-out
experiments. When certain genes are knocked out, this will result in perturbing expression of a
set of genes. This is a similar situation to the naturally occurring expression variation.
However, in this case, we know exactly the cause for the variation, i.e., the knocked out gene.
So, it is good for testing purpose. We simply simulate an “eQTL region” around the knocked
out gene and test if we can identify which gene has been knocked out. Over 40% of inferred
causal genes are correct, which is significantly better than 10% by random guessing. We then
14
1.2.2
1.2.3
applied our method to a recent genome-wide expression variation study in yeast. We show that
our method can correctly identify the causal genes and effectively output experimentally
verified pathways. New potential gene regulatory pathways are generated and presented as a
global network.
Transcriptional modules and their dynamics
Extensive studies have shown that transcriptome has modular structures and many researchers
claim that understanding the modularity of transcriptome regulation could be the first step
towards understanding the transcriptional regulation system. We developed a new approach of
deriving transcriptional modules using gene expression profiles and linkage analysis.
Different from most of previous approaches, we don’t need information on TF-DNA binding
information, which is usually not complete and rather condition dependent. We show that most
derived modules are coherent in protein function, highly correlated at expression levels, and
enriched for specific transcription factors. However, rather loose within module connections
are observed when these modules are projected onto the protein physical interaction network.
The feature of loose connections within the interaction network is clearly different from what
has been observed in the heat shock response module. As the original experiments were
performed on two normal yeast strains, our modules are mainly composed of naturally
diverged genes. Therefore, we hypothesize that transcriptome can undertake rather dynamic
changes if important interactions are kept intact. We also collected other supporting data and
conclude that transcription modules are highly plastic and robust.
Protein essentiality
Knowing which protein is essential for an organism is an important problem as essential
15
1.2.4
proteins could very likely cause severe disease when mutated. To figure out which protein is
essential will require understanding of the relationship between protein essentiality and other
genomic information. To achieve this, we performed a comprehensive analysis on many
potential contributing factors for protein essentiality including protein function classification
based on Gene Ontology (GO), protein-protein interactions, protein conservation, domains,
cellular locations, and protein complexes. Most of these factors were found to correlate with
protein essentiality. We developed a new kernel-based probabilistic model to predict protein
essentiality by selecting and integrating these information sources. The ROC (Receiver
Operating Characteristic) score can reach 0.87 using this model. Our findings give us a better
understanding of the contributing factors for the protein essentiality. The prediction method
can be applied to other organisms to predict essential proteins.
Disease gene classification
Several studies have compared various features of heritable disease genes with other so called
non-disease genes, but they have yielded conflicting results. A potential problem in those
studies is that the non-disease genes contained a large number of essential genes -- genes
which are indispensable for humans to survive and reproduce. Since a functional disruption of
an essential gene has fatal consequences, it is more reasonable to regard essential genes as
extremely severe “disease” genes. Here we perform a comparative study on the features of
human essential, disease, and other genes.
In the absence of a set of well defined human essential genes, we consider a set of 1,789
ubiquitously expressed human genes (UEHGs), also known as housekeeping genes, as an
approximation. We demonstrate that UEHGs are very likely to contain a large proportion of
16
1.3
essential genes. We show that the UEHGs, disease genes and other genes are different in their
evolutionary conservation rates, DNA coding lengths, gene functions, etc. Our findings
systematically confirm that disease genes have an intermediate essentiality which is less than
housekeeping genes but greater than other human genes. The human genome may contain
thousands of essential genes having features which differ significantly from disease and other
genes. We propose to classify them as a unique group for comparisons of disease genes with
non-disease genes. This new way of classification and comparison enables us to have a
clearer understanding of disease genes.
Conclusions
There are a plethora of approaches which integrate multiple different data sources to study
biological systems, and our methods are just tiny drops of water in the great ocean. By no
means do we claim that we fully know how to study the system nor how well we understand it.
We want to show from these studies that the integrative approach can indeed help us better
utilize data, design more realistic system models, and achieve what would otherwise be
impossible by a single data source.
It will be our greatest pleasure if by reading our following chapters, that our colleagues can get
some hints on how to pursue their own interest in systems biology. In one sentence to conclude
this chapter, we want to say that “the future is systems biology”.
17
Chapter II
Causal gene inference and pathway identification
18
2.1 Introduction
Gene expression variation has been observed in human, yeast and other organisms (Brem, et
al., 2002; Morley, et al., 2004; Turk, et al., 2004). By linkage analysis, the variation of gene
expression can often be explained by the variation of DNA sequences on chromosomes. Great
interest has arisen in finding the causal genes and the mechanisms which account for the
expression variation (Friedman, et al., 2000; Brem, et al., 2002; Yvert, et al., 2003; Bing and
Hoeschele, 2005; Li, et al., 2005; Schadt, et al., 2005; Li, et al., 2006).
In these studies, the expression level is treated as a quantitative trait and the genetic loci linked
to the trait are usually termed as eQTL (expression quantitative trait loci). Determining the
eQTL, however, doesn’t answer which genes are the causal genes for the expression variation
since tens or even more genes can be contained in the eQTL. Although further fine mapping
will reduce the confidence interval of the eQTL, it is both time consuming and laborious.
Other factors, such as high linkage disequilibrium could make fine mapping less powerful.
Even when the number of candidate genes is reduced to a manageable number, the underlying
mechanism by which the regulation is enforced remains unknown.
Inferring the causal genes is challenging but very important in disease studies (Schadt, et al.,
2005). A simple method based on expression correlation was proposed but without solid
verification, and a large amount of potentially useful information was ignored (Bing and
Hoeschele, 2005). Gene expression regulation is traditionally divided into cis-regulation and
trans-regulation. If the eQTL is close to the target gene itself (cis-regulation), then the DNA
variation is most likely to effect the transcription regulatory regions of the gene, such as the
promoter, enhancer, etc. In trans-regulation, the eQTL is far away from the target gene. In this
19
case, the causal genes can be transcription factors (TFs) which regulate the target gene or
genes which affect the activity of the TFs. We are particularly interested in trans-regulation as
cis-regulation is relatively trivial to identify. For eQTL containing a TF which binds to the
promoter region of the target gene, the TF is a good candidate for the causal gene. However, in
many cases, the eQTLs do not contain any TFs (Brem, et al., 2002; Yvert, et al., 2003). An
alternative factor therefore must exist and need to be identified. One possible mechanism by
which this alternative factor regulates the target gene expression is by regulating the TFs.
Through protein-protein interaction and other mechanisms such as phosphorylation, the signal
is conveyed from this alternative factor to the TF and eventually alters the expression of the
target gene. Such (signaling) pathways have been widely found in multiple biological
processes and are considered to be one of the most fundamental gene expression regulatory
mechanisms in biological systems (Ogawa, et al., 2000; Yoshimoto, et al., 2002; Dodge-Kafka,
et al., 2005).
“Pathway” is a frequently used term in recent publications, but with rather different meanings
(Steffen, et al., 2002; Nelander, et al., 2005). We define a pathway as a set of both directionally
and un-directionally connected proteins which contains at least one TF at one end. (We don’t
distinguish between a gene and its protein product and they are used interchangeably
throughout the manuscript.) Both the proteins involved and the topologies are considered as
pathway components. Slightly different from (Steffen, et al., 2002), we don’t require the
pathway to be strictly linear (i.e., we allow network structures) to make more realistic
modeling. A conceptual pathway is shown in Figure 2.1.
Fig. 2.1. A conceptual gene regulatory pathway. (a). Genes involved in the pathway are
shown as circles (A,B,C,D,T1 and T2). B represents a kinase which activates downstream
protein C by phosphorylation. T1 and T2 are transcription factors and T1 positively
regulates T2’s expression. T2 binds to the promoter region of the target gene and activates
its expression. Edges without arrow indicate protein-protein interactions and edges with
arrow imply the transcriptional regulations or phosphorylations. (b) Gene B on the
pathway is deactivated. The expression of the target gene is down-regulated as a
consequence. We don’t require pathways to be strictly linear so that indispensable
components of the pathway (e.g., D) can be included.
Although related to the research on “transcription regulatory network” which aims at inferring
the regulatory relationship among transcription factors (Friedman, et al., 2000; Basso, et al.,
2005; Rogers and Girolami, 2005; Xing and van der Laan, 2005), our work is quite different.
On the surface, the difference appears as we consider the whole gene network (protein-protein
interaction, protein phosphorylation, and TF-DNA binding) instead of only TFs. Down to the
detail, our interests are not in identifying neighbor-to-neighbor regulations. Instead, we are
identifying pathways linking causal genes and target genes to explain the regulatory
relationships between them. In Figure 2.1, a link between gene A and gene B doesn’t indicate
that A regulates B’s expression, which is usually the case for transcription regulatory network
20
21
inference. Here, it stands for that protein A affects the expression of the target gene by
interacting with protein B. Despite these differences between the pathway and “transcription
regulatory network”, connections do exist as the transcription regulation could be part of the
pathway (or even the full pathway in some cases). We’ll give more details of the differences
and connections between our method and previous approaches in the Methods section.
Several approaches have been proposed to systematically identify function modules, pathways
and motifs in the biological system (Giaever, et al., 2002; Ideker, et al., 2002; Yeger-Lotem, et
al., 2004; Qi, et al., 2005). Algorithms are designed specifically for pathway identification
(Ficarro, et al., 2002; Steffen, et al., 2002). Although these algorithms can successfully find
known pathways, huge numbers of other “pathways” are also generated. The high false
positive rate significantly limits its application to solving real biological problems. Another
approach proposed by Yeang et al. is very successful when it is applied to a manually selected
sub-network. However, as the algorithm requires large amounts of perturbation data, it is
much less competent when applied to genome-wide analysis (Yeang, et al., 2004; Yeang, et al.,
2005). Rather than finding all the “possible” pathways, we try to locate the functioning ones
which can be revealed by analyzing experimental data.
We first designed a test to verify our method quantitatively. The Rosetta compendium data set
(Hughes, et al., 2000) was used for this purpose which interrogated expression profiles of 276
deletion mutants. We show that over 40% of the inferred causal genes are correct, which is
more than 4 times better compared with 10% by random guessing. We then applied our
method to a recent genome wide expression variation study in yeast (Brem, et al., 2005). We
demonstrate that experimentally verified causal genes and pathways can be correctly inferred,
and we also propose new potential pathways.
22
2.2 Methods
An overview of our multi-step procedure is shown in Figure 2.2. For a target gene, the
procedure identifies the eQTL by linkage analysis using expression profile and genotype
information. This generates a list of genes which contains the real causal gene for the target
gene expression variation. The gene network is compiled by integrating protein-protein
physical interactions, protein phosphorylations and TF-DNA binding information. As the
kernel of the whole process, we designed a network based stochastic inference algorithm to
identify the most likely causal genes in the eQTL and the underlying pathway.
Fig. 2.2. Overview of our multi-step procedure for causal gene identification and gene
regulatory pathway inference.
23
2.2.1 Basic assumptions
Given the target gene, the list of candidate causal genes, gene expression profiles and the
network, we infer the most likely causal genes and the underlying pathways. Two assumptions
are made. First, since our focus is on trans-regulation, we assume that the causal gene
regulates the target gene by affecting the activities of the TF(s) for the target gene through a
pathway. This assumption holds for most known cellular pathways. Although other regulatory
mechanisms do exist, we don’t explicitly consider them in this study. Second, we assume that
the activities of genes on the pathway correlate with the target gene’s expression. The idea is
illustrated by Figure 2.1(b). When a gene on the pathway is deactivated (e.g., knocked out), the
expression of the target gene is either down-regulated if the deactivated gene has a positive
effect or up-regulated otherwise. Since the activity of gene product is hard to measure directly,
we use gene’s expression level to approximate it. Clearly, this approximation could be violated
as protein activity is also regulated by post-translational regulations such as phosphorylation.
However, such an approximation is still widely in use and certain successes have been
reported (Stuart, et al., 2003). We will revisit this issue in the discussion section. Zien et al.
found that genes on the same pathway had higher “synchrony” in their expressions and this
supports our second assumption (Zien, et al., 2000).
2.2.2 Searching the network
Based on the assumptions described above, the problem can be rephrased as finding the
pathway which starts from the causal gene and ends at the TFs regulating the target gene so
that the expression of the genes on the pathway are correlated with the target gene. We
designed a network based stochastic backward searching algorithm to solve the problem. The
stochastic model is chosen over deterministic ones due to two reasons. First, the biological
system itself can be modeled as a stochastic network with various interactions occurring with
different probabilities. It is natural to design an algorithm which acknowledges the
uncertainties in the system. Second, deterministic algorithms require the pathway length to be
determined in advance, and the length cannot be too large due to high computation complexity.
They also require the pathway be strictly linear (Ficarro, et al., 2002; Steffen, et al., 2002). All
these issues can be avoided with a stochastic algorithm.
The basic idea of our algorithm can be described as follows. We start from a TF and initiate a
“walk” by following edges in the network. Decisions on what edges to take depend on gene
expression profile in a non-deterministic fashion. Genes in the eQTL will be visited with
different frequencies. The genes with higher frequencies are more likely to be the causal genes
and the most frequently traveled paths are regarded as the underlying regulatory pathways. We
formalize the algorithm as follows.
For a target gene , the set of transcription factors binding to it are denoted as ,
and the candidate causal genes in the eQTL regions are denoted as . The
gene network is represented as a graph , in which the protein-protein interactions are
represented as undirected edges while protein phosphorylation and TF-DNA bindings are
represented as directed edges. For each , we start a stochastic search procedure as
shown in Figure 2.3.
t g 1 ( ,..., ) tgn Tt t =
1 ( ,..., ) tm gc c Cg g =
G
t k g t T ∈
24
k
We denote all the neighbors of a particular gene in the gene network as , so
that , where represents a directed edge from to a . Starting from ,
we estimate for each the likelihood that is the cause for the expression
() Nei ⋅
() ba bNeia e ∈⇔ ∈G ba e b k t
() igNeit ∈ i g
Fig. 2.3. The flow diagram of the stochastic searching algorithm.
variation of the target gene . Based on our second assumption, we estimate such a causal
effect by the absolute value of the Pearson correlation coefficient of the expressions between
and , denoted as . Intuitively, a gene showing strong expression correlation with
the target gene has a higher probability of being involved in the pathway. However, as not all
the genes on the pathway necessarily correlate with the target gene due to other
post-translational regulation mechanisms, we give non-correlated genes a residual chance for
being on the regulatory pathway by defining the causal effect of with respect to as
, where is the residual causal effect a non-correlated
gene could have upon .
t g
i g t g (, ) | it gg ρ |
ε 1
i g t g
( , ) max{| ( , ) |, } it it gg gg ξ ρ = 0 ε <<
t g
25
26
j
We denote a path as , where are nodes in the graph and cycles are not
allowed in the path, i.e., for any on the path. To ensure paths are acyclic, a set
is introduced which contains only unvisited genes. We stochastically select
and transit from to . The transition probability is determined by equation (2.1). Based
on this transition probability, unvisited neighbor genes with greater causal effect will have a
higher chance of being visited next. The chosen gene will be removed from thereafter.
01 ( , ,..., ) z Pg g g 01 , ,..., z gg g
ig g ≠ ,i j gg
U () k iNeit g ∈ ∩ U
k t i g
U
()
(, )
Pr{ | , ( ) }
(, )
sk
it
ik i k
s t
gNeit
gg
gtg Neit
gg
ξ
ξ
∈∩
∈∩=
∑
U
U
. (2.1)
After we arrive at , the same procedure is repeated. We select based on a
similar transition probability described by equation (2.2).
i g () i i gNeig
′
∈ ∩ U
()
(, )
Pr{ | , ( ) }
(, )
si
it
iii i
s t
gNeig
gg
ggg Neig
gg
ξ
ξ
∈∩
′
′′
∈∩=
∑
U
U
. (2.2)
By noticing that we always calculate the causal effect of a gene with respect to , it is clear
that our procedure is different from most transcription regulatory network inference algorithm.
In our procedure, the objective is not to identify the relationship between connected genes (i.e.,
and ), but to find connected genes which are likely to be the cause for the expression
variation of the target gene .
i g t g
i g i g
′
t g
The above procedure stops when it reaches any gene or when it enters a dead end
(i.e., ). We also set an upper bound for the total transitions allowed to ensure a
stop. The upper bound is chosen to be unrealistically high for any known pathway and is
t i g g C ∈
() i Nei g∩U=∅
different from the path length in those deterministic pathway finding algorithms. Suppose we
stop at after one round of the procedure, the path can be written as .
The causal effect of on through can be calculated by (2.3). Here, we
assume that the causal effects of nodes on the pathway are independent from each other. This
assumption may not hold in reality. However, considering interactions among genes on the
pathway will make the problem too complex and we do not consider them in this study.
t c g g C ∈ ( ,..., ,..., ) ki c Pt g g
c g t g ( ,..., ,..., ) ki c Pt g g
( , , ( ,..., )) ( , ) ... ( , ) ck k c k t c t pg t Pt g t g g g ξξ =⋅⋅
. (2.3)
As equation (2.3) measures the causal effect of on with respect to a specific potential
pathway, the general causal effect of considering the whole gene network can be estimated
by equation (2.4), where denotes all the paths starting from and ending at .
c g t g
c g
c
k
g
t
P k t c g
( , ) ( , , ( ,..., ,..., ))
gc
tk
ck ck k i c
P
pg t pg t Pt g g =
∑
. (2.4)
To calculate , each gene is associated with a counter to record the
times it is been visited. We iterate the whole procedure N times and is set to be large
enough so that (2.5) can be approximated, where denotes the visit times for .
(, ) ck pg t i g ∈G ( ) kt i V g
N
( ) kt c Vg t c g g C ∈
. (2.5)
lim ( )/ ( , )
ktc ck
N
Vg N pgt
→+∞
=
If the target gene has more than one TF, we assign each TF a weight based on their causal
effect on the target gene and linearly combine them as shown by (2.6). The probability that
is the causal gene in the eQTL considering all the TFs for the target gene is estimated by
(2.7).
c g t g
27
1
1
(, ) ( )
()
(, )
k
m
kt t c
k
Tc
m
kt
k
tgV g
Vg
tg
ξ
ξ
=
=
=
∑
∑
. (2.6)
n () : ( )
(, )
()
Pr( )
() ( , )
st st
c
c
k
Tc
k
c
s
Ts s
k
gCg sgCg k
pg t
Vg
g
Vg pg t
∈∈
==
∑
∑ ∑∑
. (2.7)
Since we assume There is only one causal gene in each eQTL, the gene with the largest
posterior probability is reported as the cause as shown by (2.8).
n argmaxPr( )
sgt
c
gC
g
∗
∈
= sg
. (2.8)
To identify the underlying pathway, we start from and trace backwards. We find from
the gene with the largest visit count and move to that gene (not stochastically). We
repeat until we arrive at . By this way, we find the most probable pathway which links
and . The linear pathway generated by this approach is mainly for simplicity consideration.
As indicated by (2.4), there could be multiple paths (flows) connecting and , and all of
them contribute to the causal effect of .
c g
∗
1
( ) c Nei g
− ∗
k t c g
∗
k t
c g
∗
k t
c g
∗
2.2.3 Select subset of conditions
It is well known that TFs only actively regulate their target genes under specific conditions
(Ihmels, et al., 2002; Stuart, et al., 2003; Harbison, et al., 2004). It will therefore be beneficial
to infer the pathway under these conditions to exclude the noise introduced by non-relevant
conditions. To achieve this goal, we implemented two different methods.
28
First, we follow the signature algorithm developed by Ihmels to select an appropriate subset of
conditions (Ihmels, et al., 2002). Suppose the expression levels of gene are measured
under conditions in the original data set, denoted as . A condition m is
selected if it satisfies equation (2.9), where
t g
M
1
,..., t t
M
g g O O
t g O is the average expression level and is the
standard deviation. We empirically choose equal to 1 to ensure both sufficient variation and
a high enough number of included conditions.
t g σ
τ
t
t
t
m
g
g
g
OO
τ
σ
−
>
. (2.9)
The subset of conditions is then used to calculate the causal effect of on . Conditional on
the selected conditions, we search the gene network to find the pathways as described in 2.2.2.
i g t g
As our second method, we designed a sampling scheme. Suppose conditions are
sampled without replacement and denoted as . We re-calculate the correlation coefficient
using conditions covered by . is considered a valid choice of subset if ,
where is a pre-determined threshold for correlation. To make the selection robust and not
sensitive to one sample, we repeat the sampling multiple times until we obtain valid subsets
of conditions. The r subsets of conditions ( matrix) is then used to calculate the
expected causal effect of on using equation (2.10) and all the previous equations
concerning need to be updated accordingly.
( ll M <)
u s
u s u s (, ) u tk gt ξτ ′ > s
τ′
r
r l ×
i g t g
(, ) i t gg ξ
1
1
(, ) ( , )
u
r
ik i k
u
gt gt
r
ξ ξ
=
=
∑
s s
. (2.10)
It is obvious that the first method is computationally efficient compared to the second one.
29
However, this method can be heavily affected by outliers and conditions cannot be “tuned” for
specific TFs. Although the second method is much more time consuming and could fail either
because such conditions do not exist or due to the extremely large sample space, it is generally
more robust and will be much less affected by outliers.
2.2.4 Significance measurement
It is essential to test the reliability of the inferences by the above approaches. Erroneous
inferences could be caused purely by false positive interactions in the gene network and the
noisy expression data. Therefore, for which satisfies (2.8), the significance of the findings
needs to be evaluated. As the main source of error comes from the network topology and gene
expression, we permute the gene network while preserving the degree of each node similar to
(Milo, et al., 2002). Since the network topology is randomized, the pathway-wise expressions
are permuted accordingly too. For each permutation, we perform the same procedure and
obtain one . By repeating the permutation many times and ordering all the ,
we can calculate an empirical P-value for the . P-values less than 0.05 are considered
as significant and those will be reported as valid findings.
c g
∗
( ) T c Vg
∗
′
() T c Vg
∗
′
( ) Tc Vg
∗
s c g
∗
30
2.3 Results
2.3.1 Data collection
The Rosetta compendium data (Hughes, et al., 2000) is used to verify our method. 276 genes
were deleted and each deletion mutant’s expression profile was measured using microarrays.
To build the gene network, the protein-protein interaction data was obtained from a previous
compiled set by (Steffen, et al., 2002), combined with protein physical interactions deposited
31
in MIPS (Munich Information center for Protein Sequences). TF-DNA binding data was
obtained from (Harbison, et al., 2004) where 203 TFs were tested for their binding profiles in
yeast. We chose P<0.001 as the threshold for positive binding as used by the original authors.
The genome wide phosphorylation information was obtained from (Ptacek, et al., 2005) which
identified over 4,000 phosphorylation events. After compiling all three types of interactions,
the gene network covered 4,744 genes and contained more than 10,500 edges. Our main
experiment is based on a recent genome-wide study on expression variation by crossing two
yeast strains (Brem, et al., 2002; Yvert, et al., 2003; Brem, et al., 2005). 112 segregants were
individually genotyped at 2,956 marker positions and 6,228 gene expressions were measured
for each segregant. Since both the genotype and expression of each gene are known, these data
are excellent for this study.
2.3.2 Testing with knock-out data
In order to quantitatively measure the performance of our method, we designed a test using
Rosetta compendium knock-out data. The main reason to use the knock-out data set is because
we know which gene was deleted. As the true cause for the gene expression variation is known,
we are able to test the accuracy of the inferences. Although the original experiments are not
related to eQTL mapping, we can easily define regions around the deleted genes so the
problem will be the same as what we are trying to solve. The major steps of the test are
described as follows.
(1) For a deletion mutation experiment, we identify the genes whose expressions are
significantly perturbed. We further identify the common TFs for these significantly perturbed
genes. Only genes regulated by the common TFs are considered as valid target genes and are
32
used for the later inferences.
(2) We simulate an eQTL region around the deleted gene to let the pretended eQTL contain 10
genes. These 10 genes (the deleted gene and the surrounding 9 genes) are positioned
consecutively on the chromosome. The position of the deleted gene is randomly set to be from
1 to 10.
(3) We pretend that the real causal gene (in this case, the deleted gene) is unknown and try to
identify it from the ten genes.
(4) The overall prediction accuracy is calculated. As random guessing will give 10% correct
identifications in expectation, higher accuracy is expected if the method actually works.
For our method to work, it is essential to ensure that the “differentially” expressed genes are
really caused by deletion mutation instead of by noise. Obviously, a target gene whose
expression is perturbed by random events won’t lead us to any meaningful findings regardless
of the method. Hughes et al. designed a gene-specific error model to compensate for the
differences in the variation of transcript (Hughes, et al., 2000). Based on this error model,
more than half of the deletion mutation experiments didn’t show significant changes in
expression profiles and are excluded from our test. 118 knock out experiments contain at least
2 genes with 3 fold changes with P-value less than 0.01. The number of perturbed genes varied
significantly among these experiments (from 2 to several hundred). We further required that
the target genes should share common TFs. By only considering genes that could be clustered
by common TFs, we are more confident in believing that their expression variation is caused
by the knockout instead of by chance.
We developed a simple voting scheme to consider multiple perturbed genes for each knockout
experiment. The genes obtaining the most votes are reported as causal genes. 17/36 valid
predictions are correct (exactly match the deleted gene) using the first condition selection
method and 16/35 are correct using the second condition sampling method. The accuracy rates
(47% and 46%) are more than 4 times better than what would be expected by random guessing.
When the eQTL region is set to contain 20 genes, 15 out of 48 predictions are correct by the
first method and 15 out of 44 are correct using the second method. The accuracy rates (31%
and 34%) are more than six times better compared with random guessing (5%). The correct
prediction only decreases by two/one when the number of genes in eQTL doubles. This
indicates that our method is quite robust and relatively insensitive to the number of genes in
eQTL (details are provided in the Appendix A). The good performance suggests that our
approach can indeed extract useful information from multiple data sources and generate valid
hypothesis. We then applied our method to a recent genome-wide study on expression
variation in yeast where the causal genes are generally not known (Brem, et al., 2005).
2.3.3 eQTL mapping
We performed Wilcoxon ranksum tests to examine the association between each
gene’s expression level and each marker as in (Brem, et al., 2002; Bing and Hoeschele, 2005).
We only considered genes whose expression variation could be significantly linked to exactly
one locus on yeast genome (P-value<10-5) and the false discovery rate (FDR) was estimated
to be 0.005 using methods from (Storey and Tibshirani, 2003). This gave us a list of 1,226
genes. Based on these genes, we performed bootstrap to infer the 95% confidence interval
similarly as (Bing and Hoeschele, 2005). A small fraction of genes failed to generate valid
confidence intervals and were excluded for further consideration. Finally, we obtained a list of
6,228 2,956 ×
33
1,085 genes. The length of the confidence intervals ranges from 781bps to 141Kbps, the mean
and the standard deviation are 35Kbps and 28Kbps, respectively. The number of genes within
each interval ranges from 1 to 62, and the mean and standard deviation are 16.8 and 15.3,
respectively.
2.3.4 Causal gene inference
For the genes in the above list, we applied our algorithm to infer the causal gene in each eQTL.
To identify the TF that really involves in the pathway, we require that TF displays a strong
correlation with the target gene based on our sub-condition sampling scheme. Clearly, this
criterion may be too strong for some TFs and still not sufficient for others. However, the
assumption that stronger correlation implies higher probability of regulation could be valid in
general. For the 1,085 genes with valid eQTL regions, 585 genes have in total 1,403 highly
correlated TFs ( >0.5). For these 585 genes, we inferred the causal genes and
measured the significance for them. As described in 2.3, two methods were used to select
appropriate subset of conditions. These two methods generated quite similar outputs and we
only present the results generated by the second method. 239 inferences have P-value <0.05
and they are reported in supplementary files. The underlying pathways were inferred and are
shown in Figure 2.4, drawn by Cytoscape (Shannon, et al., 2003).
|( , )| ik gt ρ
34
Fig. 2.4. The global view of the inferred regulatory network. List of all the pathways is
provided in the supporting materials.
Here, we describe two examples which are well supported by experimental data and previous
studies. As the first example, the target gene is PRP39, a component of RNA splicing factor
U1 small nuclear ribonucleoprotein polypeptide (Winzeler, et al., 1999). There is no report on
its expression regulatory mechanism by SGD (Saccharomyces Genome Database) (Cherry,
et al., 1997). Based on the linkage analysis, variation of PRP39’s expression can be
significantly linked to a locus on chromosome VIII (P-value is 1e-7.3). The 95% confidence
35
36
interval contains three genes (NEM1, GPA1 and MRS11). From chromatin (Ch)
immunoprecipitation (IP) experiments, two TFs (DIG1 and STE12) bind to the promoter
region of PRP39 gene. For each TF, we first sample a subset of conditions so the TF correlates
with PRP39 in their expressions. Conditional on these conditions (Table 2.1), we
stochastically search paths which connect TF with the candidate causal genes so that the nodes
on the path show significant correlation with PRP39. Two TFs report the same gene (GPA1) as
the causal gene as it has the highest probability (0.975) among the three genes with P-value
<0.05 by permutation test. The pathways identified by each TF are also consistent. There are
quite a few genes (e.g., FAR1, FUS1, etc.) having the same inferred causal gene and pathway.
Many of them are known to be involved in pheromone signaling pathway (Giaever, et al.,
2002). By comparing the pathway we found (Figure 2.5) with the known pheromone pathway,
large fraction of proteins are matched and arranged in a correct order. To further verify that
GPA1 was indeed the cause for the downstream gene expression variation, Yvert et al.
performed experiment by making a point mutation on GPA1 in one of the yeast strain and
observed that those downstream genes displayed altered expression levels as expected (Yvert,
et al., 2003). Here, we show our method can correctly infer the right causal gene and derive the
underlying pathway without any prior knowledge of the corresponding pathway.
Table 2.1. Pathway-wise increase of the correlation when appropriate subset of conditions is
selected.
Target Gene (PRP39)
Genes on
the path
No sub-condition
sampling
*Condition on >0.4 *Condition on >0.5 *Condition on
>0.6
DIG1 0.27 0.45 0.52 0.62
FUS3 0.50 0.55 0.55 0.60
FAR1 0.49 0.60 0.62 0.64
STE4 0.37 0.52 0.53 0.53
GPA1 0.53 0.61 0.61 0.62
*When a specific condition is set, we require the TF, in this case DIG1 will have a correlation
coefficient with the target gene (PRP39) at least or above the threshold under the selected conditions.
In Table 2.1, we list the primary nodes on the above pathway and their expression correlations
with the target gene. We show both the correlations calculated without selecting a subset of
conditions and correlations calculated with subset conditions sampled. The conditions are
sampled based on different thresholds and it is clear that as the thresholds increase, the
correlations increase accordingly in a pathway-wise manner. This supports the validity of the
pathway from a different aspect.
Figure. 2.5. An example of the inferred causal gene and the pathway. Edges without
arrows are protein-protein interactions. Edges with arrows represent phosphorylation or
TF-DNA binding. The causal gene is correctly inferred in this example and proteins
involved in the pathway are highlighted in colors. Only nodes been visited more
frequently than GPA1 and have at least two interactions with primary pathway nodes are
shown.
We take G1/S phase transition pathway as our second example. In this example, we identified
a group of genes which were reported to be regulated by CDC28. CDC28 is a catalytic subunit
of the main cell cycle cyclin-dependent kinase (CDK). The pathways form a complex network
37
and are shown in Figure 2.6. We list the Gene Ontology annotations of the proteins involved in
the pathways in Table 2.2. It is clear that most genes we inferred are indeed related to the
mitotic cell cycle and most interactions and regulations are supported by previous studies.
Figure. 2.6. Inferred pathways related to G1/S phase transition. The arrow only
represents the direction of the causal relationship and doesn’t stand for phosphorylation or
TF-DNA interaction.
Deriving complicated networks such as the one shown in Figure 2.6 could take years by
biologists using traditional biology experimental methods. Here, we show that it can be easily
obtained by computational methods by integrating multiple sources of high-throughput data.
Although computational approach cannot replace the traditional experiments, they do generate
valid and testable hypothesis which can help biologists to be more productive.
38
39
Table 2.2. Function annotation of the proteins involved in the G1/S phase transition.
Descriptions are obtained from GO and SGD database or from referred studies.
Gene Function
Causal gene
CDC28
Cyclin-dependent protein kinase, involved in G1/S and G2/M transition
Pathway genes
PHO2
Involved in phosphate metabolism, could interact with CDC28 (Giaever, et
al., 2002)
SWI5
G1 specific TF, bind cooperatively to HO promoter with PHO2 (Bhoite and
Stillman, 1998)
CLN2
G1 cyclin, activates CDC28 to promote the G1/S phase transition
SWI4/SWI6/MBP1
Form complexes, regulate transcription at the G1/S transition
CAK1
Kinase, activates CDC28
RNH202
Ribonuclease H activity
FKH1
Regulates the cell cycle and pseudohyphal growth
Target genes
NIS1
Possibly involved in a mitotic signaling network
FAA3
Long chain fatty acyl-CoA synthetase
YPL158C
Regulated by SWI5 (Doolin, et al., 2001), unknown function
BUD9
Involved in bud-site selection
PSA1
Cell wall biosynthesis, required for cell wall structure
DSE1/DSE2
Cell wall organization and biogenesis
SIL1
Molecular function unknown
2.4 Discussion
We developed a novel approach to estimate the probability for genes in the eQTL of being the
causal gene for the target gene expression variation. We show that the causal gene inference
problem can be combined with pathway finding problem to achieve a unified solution.
Traditionally, genetic studies can only locate a region on chromosome which likely contains
the causal gene. Our approach digs deep into the biological system to explore the underlying
mechanism. We model biological system as a stochastic network of interactions and
regulations, the causal effect is explained by the pathways which link the genes in the eQTL
and the TFs which potentially regulate the target genes. The eQTL information plays an
important role in significantly reducing the number of possible pathways need be considered
while pathway identification ultimately helps to answer which gene is the causal gene.
40
Our methods rely on the network built on protein-protein interaction, TF-DNA binding, and
protein phosphorylation. The advantage of this is that the generated pathways can have direct
experimental supports. However, none of the above data is either complete or completely
accurate (von Mering, et al., 2002; Deng, et al., 2003). Therefore, important pathways may be
missed due to incompleteness of the data and causal genes may be erroneously inferred if it is
derived from the inaccurate part of the data. Although we expect to see more abundant and
accurate data available in the future, a robust method minimally affected by the data
imperfectness is always desired. Compared with deterministic approaches, our method has an
inherent stochastic component which makes it resistant to some errors. We intend to further
test the robustness of our methods in future work.
As described in the method section, we assume that genes on the pathway will have higher
expression correlation with the target genes. This is clearly true for the pheromone pathway
which we presented as an example. Moreover, our quantitative test on yeast knock-out
experiments indicates this assumption holds for many cases. However, as the biological
system is very sophisticated, we don’t expect such simple assumption holds for all the cases.
Much deeper understanding of the biological regulatory mechanism is needed for a more
realistic modeling and we’ll improve that in the future.
Gene expression level changes are found common to many diseases such as cancers (Bals and
Jany, 2001; van 't Veer, et al., 2002; Hauser, et al., 2003). Therefore, it will be very interesting
to explore on extending our methods to disease causal gene identification (Schadt, et al., 2005).
Once we identify genes whose expression change significantly between healthy individuals
and patients, our approach can be applied to find the genes responsible for these changes.
Although the findings may at large be hypothesis by itself, it will significantly improve our
41
2.5
understanding of the complex disease scenario by providing a global view of the whole
system.
Acknowledgements
We thank Drs. Rachel B. Brem and Leonid Kruglyak for kindly providing us the yeast
genotype data. This work was inspired by collaboration with Huiying Yang from Medical
Genetics Institute at Cedars-Sinai. We also thank Xianghong Jasmine Zhou for her
constructive suggestions. We are very grateful to the researchers who performed all the
experiments to generate the data that were used by this study. We apologize to those whose
works are not cited due to page limit. This research was supported by NIH/NSF joint
mathematical biology initiative DMS-0241102 and by NIH P50 HG 002790.
42
Chapter III
Transcriptional modules and their dynamics
43
3.1 Introduction
Modularity is observed on various scales in all life systems (Ravasz, et al., 2002; Wolf and
Arkin, 2003), including transcriptome (Ihmels, et al., 2002). Understanding how
transcriptional modules are structured and how they perform their functions are fundamentally
important. As increasing number of genome wide expression profiles have been accumulated,
together with the rapid development of several other high-throughput technologies, steady
progresses have been made on revealing the structure of the transcriptome at systems level.
Recently, a great amount of works have been carried out to study transcriptional modules (Lau,
et al., 2001; Ihmels, et al., 2002; Steffen, et al., 2002; Bar-Joseph, et al., 2003; Stuart, et al.,
2003; Tanay, et al., 2004; Leyfer and Weng, 2005; Campillos, et al., 2006; Lemmens, et al.,
2006). The definition of modules varies significantly in all these studies and some are informal
or conceptual. However, for a quantitative analysis of modules, we desire a strict and clear
definition. Here, we define a transcriptional module as a set of genes regulated by a common
transcription factor (TF) or a set of transcription factors (TFs) which participate in a common
regulatory pathway. By this definition, genes regulated by different TFs can belong to one
module and show correlation in their expressions when the regulatory pathway is turned on.
On the other hand, highly correlated expression does not guarantee that genes are in the same
module if different pathways are involved. Identifying transcriptional module is non-trivial
and complicated for the following reasons. First, TFs and their target genes have
many-to-many mappings and such mappings are condition dependent. Second, TFs form a
complex network by regulating each other, and even for the most well studied organisms such
as yeast (Lau, et al., 2001; Harbison, et al., 2004), the regulatory networks are far from being
44
fully understood. Third, modules are arranged hierarchically and can overlap with each other,
thus making clear module identification somehow intractable. Regardless of these difficulties,
significant progresses have been achieved owing to recent rapid development of several
high-throughput technologies such as DNA arrays, ChIP-chip, etc. Although these
technologies are imperfect and suffer from either high false positive rates or high false
negative rates, integrative approaches based on these technologies have been demonstrated
promising for module identification (Tanay, et al., 2004; Lemmens, et al., 2006).
In this paper, we present a new approach of deriving transcriptional module which relies on
gene expression profiling and linkage analysis. Gene expression variation has been observed
in human, yeast and other organisms (Brem, et al., 2002; Morley, et al., 2004; Turk, et al.,
2004). By linkage analysis, the variation of gene expression can often be mapped to certain
chromosomal regions and these regions are normally termed as expressional quantitative trait
locus (eQTLs). It is commonly observed that cluster of genes can link to a common
chromosomal region. In such cases, DNA mutation in the linked region causes expression
level changes of a group of genes. Since the upstream perturbation in the DNA is unique,
differentially expressed genes are believed to participate in the same regulatory pathway. We
based our studies on a recent genome-wide expression variation research in yeast (Brem, et al.,
2002; Yvert, et al., 2003; Brem and Kruglyak, 2005). We demonstrate that modules can be
derived without the need of identifying TFs or the regulatory program among TFs by
clustering sets of genes based on their eQTLs. We show that the identified modules are
enriched for particular geneontology functions, have highly correlated gene expression
profiles based on independent microarray experiments, and are enriched for particular
transcription factors based on ChIP-chip data.
45
Since proteins function by interacting with other molecules, we interpret gene expression
regulation as a mechanism of either endorsing or preventing certain interactions. Therefore, to
our belief, for a holistic understanding of the transcriptome modularity, protein physical
interaction network should be considered. We mapped the above derived modules onto protein
physical interaction network, and observed that, although each module does show functional
homogeneity, they rarely appear to have significant within-module interactions. We then
performed similar analysis on well known module (heat shock module), and the results are
significantly different. Since our derived modules represent naturally diverged transcriptional
regulations, we conclude that transcriptional modules are highly plastic and robust based on
these findings together with several other evidences.
3.2 Results
3.2.1 Gene clustering on eQTL
Currently, the most frequently used methods of deriving transcriptional modules are based on
gene expression, such as co-expression clustering, or TF-DNA binding profile, or integrative
approach using both information (West, et al., 2001; Tanay, et al., 2004; Leyfer and Weng,
2005; Lemmens, et al., 2006). Here we show that eQTL mapping followed by clustering can
naturally generate transcriptional modules with no requirement on knowing what TFs are
involved. We used the data generated from a recent genome-wide study on expression
variation by crossing two yeast strains (Brem, et al., 2002; Yvert, et al., 2003; Brem, et al.,
2005). 112 segregants were individually genotyped at 2,956 marker positions and 6,228 gene
expression levels were measured for each segregant. As both genotype and expression profiles
are known, we can locate eQTLs for each gene by linkage analysis. As shown in Figure 3.1, it
is clear that multiple eQTLs tend to cluster together. A single chromosomal mutation leads to
Figure 3.1. The chromosomal distribution of eQTLs. For each gene, linkage analysis
was performed to obtain the eQTL. Genes are sorted based on the physical positions of the
left end of eQTLs and are displayed as stacks for each chromosome. Each eQTL is
represented by three points on the same horizontal level (left, middle and right).
changes of a group of genes’ expression levels which suggests that regulatory pathways often
affect multiple targets (Ficarro, et al., 2002). There are advantages of clustering eQTLs than
clustering expression profiles. First, gene expression levels measured by microarray are very
noisy. Therefore highly correlated genes can often happen by chance if the threshold is set low.
Compared with microarray technology, genotype data are much more reliable. Second,
although genes in eQTL clusters are normally highly correlated in expression, highly
correlated genes can be mapped to different eQTLs. This can be easily understood by
considering two highly correlated genes. If the genotype configurations are different, these
two genes may link to rather different eQTLs. Thus, it is clear that including the genotype
46
47
information can help filter out those highly correlated genes happened by chance. The details
of clustering eQTLs are described in Methods. 14 clusters containing more than 900 genes are
derived. Clusters containing less than 9 genes are excluded from analysis as significant
statistics usually require large sample size. The list of genes in each cluster is provided in
supplementary materials.
3.2.2 Genes in the same cluster share common functions, have
highly correlated expression profiles, and are enriched of common
TFs
We rely on Gene Ontology (GO) annotation (Ashburner, et al., 2000) to study the function
composition of each module (only biological process is considered). We test if certain GO
functions are enriched within each module. To assess this, we used GO term finder (Boyle, et
al., 2004) which is available online at Saccharomyces Genome Database (SGD). GO term
finder takes a list of gene names as input and outputs GO function terms which are
over-represented in the gene list with p-values indicating the corresponding statistical
significance. We listed the results in Table 3.1. It is clear that all the derived modules have at
least one dominant function while most of them have more than one function. Although GO
annotations are incomplete and may be inaccurate as well, the significant non-random
association of genes within each module suggests that the clustering approach based on eQTLs
is meaningful.
As we assume that genes in a module are regulated by the same TF or same set of TFs, their
expression profiles should be more correlated compared to random selected genes. However,
we cannot test this using Brem’s data as expression information has been used by our
algorithm for linkage analysis of eQTLs. Therefore, we collected three independent yeast
48
Cluster
ID
Number
of genes
GO function Number
of gene
with the
function
p-value Second GO function Number
of gene
with the
function
p-value
1 9 peroxisome
organization and
biogenesis
4 9.13E-8 lipid metabolism 2 9.6E-2
2 21 Translation 8 2.1E-4 Mismatch repair 2 8.2E-4
3 134 ribosome biogenesis 33 5.5E-21 cytokinesis 9 1.4E-5
4 105 amino acid
biosynthesis
30 1.9E-30 positive regulation of
transcription
6 1.4E-4
5 21 conjugation with
cellular fusion
6 4.5E-7 regulation of cellular
metabolism
6 1.4E-4
6 19 transcription from
RNA polymerase II
promoter
4 6.3E-3 Cell wall organization
and biogenesis
3 3.7E-3
7 14 pyrimidine base
biosynthesis
3 1.6E-06 gluconeogenesis 2 6.8E-4
8 44 conjugation 12 1.4E-12 cytokinesis,
completion of
separation
2 3.3E-4
9 83 sterol biosynthesis 18 1.2E-25 electron transport 6 5.7E-7
10 20 serine family amino
acid catabolism |
3 3.7E-7 response to copper
ion
2 6.4E-5
11 226 protein metabolism 112 6.9E-23 transport 34 1.2E-6
12 10 Protein modification 3 3.7E-02 not found - -
13 186 cellular carbohydrate
metabolism
23 4.9E-09 response to
desiccation
4 3.6E-15
14 20 Oxidative
phosphorylation
10 1.8E-17 energy derivation by
oxidation of organic
compounds
4 9.5E-05
Table 3.1. Function analysis of each cluster. The number of genes contained in each module is listed
in column 2. The most significant GO term for the module, number of genes possessing the function
and the p-values are listed in column 3-5. The genes with the most significant function from the
modules are then removed and the secondary most significant GO term is identified, the
corresponding number of genes and p-values are listed in column 6-7. The GO terms with the most
significant p-values or with the greatest coverage if p-values are similar are listed.
expression profiles to test if genes in our modules are more correlated in their expression
levels. The three expression sets are yeast compendium knockout (Hughes, et al., 2000), stress
response (Yoshimoto, et al., 2002), and cell cycle (Cho, et al., 1998; Spellman, et al., 1998). As
illustrated in Figure 3.2, 10 out of 14 clusters show significantly higher correlation in their
expressions than random gene sets for at least two profiles (cut-off at 0.01). Only cluster 1, 5,
7 and 12 do not show such an increase in expression correlation.
Figure 3.2. The p-values for the tests of the hypothesis that expression levels of genes
within each module are more correlated than random gene sets. The color scale represents
the logarithm of the p-value. Three sets of gene expression profiles, CC (cell cycle), KO
(yeast knock out), and SR (stress response), are used to calculate the correlations.
To further test the obtained modules, we studied the transcription factor enrichment for each
module. The TF-DNA binding profile was based on results by (Harbison, et al., 2004). For
most modules (10/14), we found at least one TF being enriched (cut-off at 0.01). Only cluster
1, 7, 11, and 12 do not show TF enrichment. Interestingly, these modules have a large overlap
with the previous 4 clusters which do not show significant high expression correlation. For
several modules, up to 8 TFs appeared significantly enriched for a module.
49
50
3.2.3 Genes in cluster have loose connections
Since modules obtained by clustering eQTLs display strong functional coherence, it is natural
to expect that within-module interactions will be significant, as it is generally believed that
proteins interact with each other to perform similar functions. Several works have applied
such assumption to protein function prediction, and pathway inference, etc. (Deng, et al., 2003;
Stuart, et al., 2003; Vazquez, et al., 2003). To test if such assumption can be extended to our
derived modules, we build the protein interaction network using data from several resources
(see Methods) and map modules onto the network. Surprisingly, for almost all the modules,
none of them show significantly over-represented within module interactions (Table 3.2). To
ensure that our observation is not an error generated from the network compilation or the
mapping procedure, we compare our results with well studied modules in yeast. Heat shock
response and several other stress responses have been studied for years (Yoshimoto, et al.,
2002; Hahn, et al., 2004; Zanton and Pugh, 2004). The module regulated by heat shock
transcription factor (HSF) has been analyzed and we use it for our comparison. We obtained
the list of genes regulated by HSF from two sources (Hahn, et al., 2004; Harbison, et al., 2004).
Both studies report >200 genes bound by HSF. We mapped these ~200 genes onto the protein
interaction network and found significantly over-represented within module interactions
(Table 3.2). This result supports the essentiality of HSF1. There are other examples supporting
that proteins within the same regulatory module may have significant interactions among them.
For example, studies have shown that ribosomal complex proteins are highly correlated in
expression and are regulated in the same pathway (Giaever, et al., 2002; Jansen, et al., 2002;
Santangelo, 2006). Another recent study shows that sub-network formed by signal-induced
genes are significantly different from random networks, and in particular exhibit a high
51
connectivity (Nacher, et al., 2006). Clearly, the modules we derived do not belong to such well
internally connected module category.
Cluster ID Number of genes Number of Interactions Avg. random interactions p-value
1 8/9 0/28 0.02 1.0
2 14/21 0/91 0.08 1.0
3 119/134 8/7021 6.4 0.32
4 76/105 6/2850 2.6 0.09
5 15/21 2/105 0.1 0.007
6 17/19 2/136 0.1 0.01
7 11/14 0/55 0.05 1.0
8 33/44 0/528 0.49 1.0
9 63/83 5/1953 1.8 0.06
10 12/20 0/66 0.06 1.0
11 153/226 11/11628 10.6 0.45
12 9/10 0/36 0.03 1.0
13 133/186 9/8778 8.0 0.39
14 14/20 1/91 0.08 0.07
Heat shock 175/220 75/15225 13.9 <10
-4
Table 3.2. Interactions within each module. Genes in each module are mapped onto the protein
interaction network and within module interactions are identified. In the second column, the first
number represents the number of genes successfully mapped to the protein interaction network. The
second number represents the total number of genes in the module. In the third column, the first
number represents the number of interactions observed, while the second number represents the
possible number of interactions. The last two columns list the average interactions based on
permutation and the p-value for the observed interactions. None of the 14 modules can be regarded as
having significantly over represented within module interactions. On the other hand, the well studied
modules such as heat shock response module does show significant within module interactions.
3.2.4 Shaping forces of transcriptional modules
Why do our derived modules show rather different topological properties compared with heat
shock response module? There are several possible explanations. First, as the original
experiment compared two normal yeast strains (Brem, et al., 2002), our modules are
composed mainly of genes whose expression diverged naturally. It is understandable that such
expression variation may barely touch on essential protein interactions. If gene expression
52
variation by chance interrupts functionally essential protein interactions, one of the yeast
strains will be under severe negative selection as a consequence. However, as the two yeast
strains are normal in their fitness, this is unlikely to be the case. Therefore, our findings
suggest that the transcriptional module can evolve freely as long as the changes of expression
don’t interrupt important protein interactions. Second, by simple reasoning, it could be a
disadvantage for a module to contain proteins having large number of essential interactions
(we refer to this as centralized module). In the case of a centralized module, such as heat shock
response module, a mutation in the regulator could disable the whole module, which could be
devastating to the entire system (e.g., knockout HSF1 is lethal to yeast). Then why should heat
shock response module exist in yeast? One possible reason is that heat shock response is
extremely time critical. When the environmental temperature changes abruptly, yeast has to
reprogram its expression in a fast and highly coordinated manner for survive. Clearly, a
centralized regulatory structure best fits such requirement so that proteins which are essential
for heat shock response can be co-expressed and perform their function together. Here, we can
see how module centralization and decentralization can affect the structure of modules.
Naturally occurred mutations in both TFs and the motif regions of target genes will diminish
the association between TFs and their targets and decentralize modules, while the need for
highly coordinated expression of certain biological processes will demand a centralized
module structure and what we observed in yeast transcriptome is a balance of these two forces.
To further investigate which force takes the dominant role in shaping the transcriptome, we
tested a different set of modules derived by another study using multiple sources of
information (Lemmens, et al., 2006), only 2/63 modules show significant over represented
within module connections (cutoff at 0.01). There are other evidences suggesting that
decentralized modules are more common than centralized modules. For example, studies on
53
transcriptional regulation of yeast complex proteins showed that only a small fraction of the
known multi-protein complexes seems to have subset of their components co-regulated on the
transcriptional level (Simonis, et al., 2004). All these evidences suggest that for most modules,
it is advantageous to have decentralized structure which is beneficial for the robustness of the
system. The simple fact that yeast strains are normal with more than 1,500 genes different in
their expression levels indicates that the transcriptome is plastic and the system is very robust
in tolerating such high plasticity and we believe that decentralized transcriptional module
structure plays a significant role for the robustness.
3.3 Discussion
We are at the beginning stage towards a deeper and more complete understanding of
transcriptome. The main difficulties come from the complex intertwined hierarchical structure
of the regulatory network, the combinatorial regulatory pattern of multiple TFs and the
conditional specific activation/repression of regulatory network. The robustness and plasticity
of the transcription modules imply that not all the regulations have equal importance. As
revealed by systematic knock out experiment (Giaever, et al., 2002), yeast can survive when
about 90%(>180) of known TFs are knocked out individually, including TFs regulating large
number of genes with significant within module interactions such as FKH2, YAP6, etc. This
indicates that large proportion of TF-target gene regulations is not essential when yeast is
grown in rich media. A recent study on the co-evolution of transcription factors and their
targets points out that TFs which activate their targets can lose their targets more frequently
compared with repressors (Hershberg and Margalit, 2006). All these studies imply that many
positive regulations of TFs are nonessential (possibly due to basal gene expression) although
they could be advantageous. However, the existence of these nonessential regulations could be
54
the key buffering mechanism for the flexibility of the regulatory systems, so that rather
dynamic re-programming of the system can be allowed, and potential beneficial protein
interactions can be established and positively selected. Only robust transcriptional system can
tolerate certain disadvantageous regulatory codes which could be necessary for the system to
jump out of a local maximum.
3.4 Conclusions
We introduced a new approach of obtaining transcriptional module using gene expression
profile and linkage analysis. We studied the internal interactions within modules and found
rather loose connections. Based on several other evidences, we conclude that transcriptional
modules are rather plastic and favoring a decentralized structure. As oligo-nucleotide array
technology measuring genome wide human gene expression is already available (Morley, et
al., 2004), and the technology for re-sequencing human genome with under $1,000 is not far
away (Mardis, 2006), it is not difficult to envision that our proposed approach can be applied
to the analysis of human transcriptome in the near future. From lessons learned in yeast, we
anticipate that most modules obtained from healthy individuals will have loose within module
connections too. We also want to points out that for disease gene discovery using expression
profiles, it is more relevant to identify disease related differentially expressed genes rather
than the differentially expressed genes in general sense, as the transcriptome can be so
dynamic for its robustness and most regulation codes could just be “non-coding”.
55
3.5 Material and methods
3.5.1 eQTL mapping
We performed 6,228 ×2,956 Wilcoxon ranksum tests to examine the association between each
gene’s expression level and each marker as in (Brem, et al., 2002; Bing and Hoeschele, 2005).
We only considered genes whose expression variation could be significantly linked to exactly
one locus along the yeast genome (P-value<10
-5
) and the false discovery rate (FDR) was
estimated to be 0.005 using methods from (Storey and Tibshirani, 2003). This gave us a list of
1,226 genes. Based on these genes, we performed bootstrap to infer the 95% confidence
interval for the linked regions similarly as (Bing and Hoeschele, 2005). A small fraction of
genes failed to generate valid confidence intervals and were excluded for further consideration.
Finally, we obtained a list of 1,085 genes. The length of the confidence intervals ranges from
781bps to 141Kbps, the mean and the standard deviation are 35Kbps and 28Kbps, respectively.
The number of genes within each interval ranges from 1 to 62, and the mean and standard
deviation are 16.8 and 15.3, respectively.
3.5.2 eQTL clustering algorithm
1. Initiate a cluster by randomly select one gene and leave all other genes outside of the cluster.
2. Find the center point of the eQTL range of the current cluster, and choose from the remaining
genes whose eQTL is the closet to the cluster center point and also overlaps with the cluster.
If the distance is less than a preset threshold (50Kbps as determined empirically), move this
gene from outside of the cluster to the cluster.
3. Repeat until no genes can be added to the cluster and go to step one to start a new cluster.
As such heuristic approach is affected by initial point selection, minor updating is performed
56
by removing outliers and combining certain clusters. The full list of genes in each cluster is
available in the supplementary materials.
3.5.3 Protein physical interaction network
We obtained protein interaction data from (Steffen, et al., 2002; Ptacek, et al., 2005) combined
with protein physical interactions deposited in MIPS (Munich Information center for Protein
Sequences). The gene network covered 4,744 genes and contained more than 10,500 edges. To
randomize the network, we randomly select two edges, and swap the edges so the degree of
each node won’t change.
3.6 Acknowledgements
We thank Drs. Rachel B. Brem and Leonid Kruglyak for kindly providing us the yeast
genotype data. We are very grateful to the researchers who performed all the experiments to
generate the data that were used by this study. We apologize to those whose works are not cited
due to page limit. This research was supported by NIH/NSF joint mathematical biology
initiative DMS-0241102 and by NIH P50 HG 002790.
57
Chapter IV
Protein essentiality prediction
58
4.1 Introduction
Whole genome inactivation of individual genes makes it possible to globally study the
phenotypic effect of all the genes (Winzeler, et al., 1999; Tong, et al., 2001; Giaever, et al.,
2002). For yeast, about 20% of all the genes have been shown to be essential for the growth in
rich glucose medium. Essential genes are those when mutated will render the cell inviable. It
has been shown that a gene's essentiality correlates with many contributing factors including
protein function and protein connectivity (Jeong, et al., 2001; Giaever, et al., 2002; Jeong, et
al., 2003). Essential genes are more likely to have homologs in other organisms and are less
likely to be duplicated genes (Giaever, et al., 2002).
Understanding the relationship between protein essentiality and other genomic information is
important and challenging. Biological experiments, such as gene knock-out, have been carried
out to study gene essentiality (Winzeler, et al., 1999; Tong, et al., 2001; Giaever, et al., 2002).
However, these experiments are not only time consuming and laborious, but also impossible to
perform on species like the human being. Several computational methods for predicting the
protein essentiality have been proposed and appeared to be quite successful on yeast. Jeong et
al. (2003) studied the correlation of protein connectivity and variation of gene expression with
protein essentiality and designed a simple ranking model for prediction. Chen and Xu (2005)
explored more factors and used advanced machine learning technologies and achieved greater
prediction accuracy. In this research, we further study several other genomic factors which
may correlate with protein essentiality. Besides the well accepted factors, such as protein
function, protein-protein interactions, protein sequence conservation, we also study protein
domains, protein locations, and protein complex components. Our study represents the most
complete list of factors potentially contributing to gene essentiality up to date. In addition, we
use a novel probabilistic model to predict the lethal/viable gene deletion phenotype based on
these contributing factors. This new model combines the idea of Markov Random Field (MRF)
based approach of (Deng, et al., 2003) and the support vector machine (SVM) of (Lanckriet, et
al., 2004) for protein function prediction. The new approach outperforms the MRF and has
similar performance as the SVM approach (Chen and Xu, 2005). Compared with SVM which
put burden on users for selecting the right kernel and searching for optimal parameters, our
new approach is much simpler. By combining various sources of genomic information, very
high prediction accuracy can be achieved. Based on such a probabilistic model, the deletion
phenotype of a gene can be estimated when no direct experimental results are available.
4.2 Methods
Suppose a genome has N proteins P1,…, PN and the essentiality of m proteins are known (m ≤N).
We denote this set of known proteins by T. Our objective is to assign deletion phenotypes to
the unknown proteins based on the essentiality of the known proteins, the interaction networks,
and other features.
4.2.1 Diffusion-Kernel-Based Logistic Regression Model for
Protein Function Prediction
We use the diffusion kernel as defined by (Kondor and Lafferty, 2002) to extract information
from the protein interaction network. The diffusion kernel K is generated by taking the
exponential of a generator matrix H.
(4.1)
, (, ) { }
H
NN ij Kij e
τ
× =
59
Where
60
j
fo
for i ↔
r ij =
1
(, )
0
NN i Hij d ×
⎧
⎪
⎪
⎪
⎪
⎪
=−
⎨
⎪
⎪
⎪
⎪
⎪ ⎩
otherwise
Let Xi = 1 if the i-th protein is essential and Xi = 0 if it is non-essential. A logistic regression
model based on the above diffusion kernel is
:1, : 0,
Pr( 1| )
log ( , ) ( , )
1Pr( 1| )
jj
i
i
jX j i jX j i
Xnetwork
Kij Kij
Xnetwork
αβ β + −
=≠ = ≠
=
=+ +
−=
∑ ∑ (4.2)
We use the network consisting of the known proteins to estimate the parameters. In order to
compare the performance of this new model with SVM, we obtain the GIST (Ver2.1.1)
implementation of SVM programmed by Noble and Pavlidis
(http://microarray.cpmc.columbia.edu/gist/). The parameter τ for the diffusion kernel is
optimized separately for the SVM ( τ = 2) and the logistic regression model ( τ = 0.5).
4.2.2 Integrating and selecting other information and model
validation
A feature can be simple if it contains only one component like conservation, either the protein
is conserved or not conserved. To incorporate simple feature, we add one indicator term to the
above logistic regression model.
For complex feature like protein function, we follow the idea of the kernel method. For
example, we consider 24 functions labelled 0 to 23. Each protein corresponds to a 24
dimensional (0, 1) vector with the k-th component being 1 if the protein has the k-th function.
Let and
s
i
f
s
j
f denote the vectors for protein i and protein j with respect to feature s. The dot
product of and
s
i
f
s
j
f measures the similarity between protein i and protein j. Suppose that we
have u simple features and v complex features. We consider the following integrated model.
1, 0,
1, 0,
::
1: :
Pr( 1 | , )
log ( , ) ( , )
1Pr( 1| , )
jj
k
jj
i
i
jX j i jX j i
u
f ss ss
ks s
iij
kjXji jXj
Xnetworkfeatures
Kij Kij
X network features
Iff
αβ β
λθ θ
==
−
==
+−
≠≠
+
=≠
=
=+ +
−=
⎛
⎜
⎜
++ ⋅+ ⋅
⎜
⎝
∑∑
∑∑ ∑
1
v
s=
⎞
⎟
⎟
⎟
⎟
⎜ ⎟ ⎜
⎠
∑ ij
i
ff
≠
(4.3)
We use 5-fold cross validation to evaluate the prediction accuracy. The method randomly
selects 20% proteins and assumes that their deletion phenotypes are unknown. Then we
predict the essentiality of these “unknown” proteins with our model. The average ROC score
is calculated to measure the prediction accuracy.
To determine the relative importance of each feature, we adopt a forward stepwise selection
strategy. In each round of selection, we choose the feature that has the largest gain in accuracy.
We stop the process if no further improvement of the accuracy can be achieved or no feature is
left.
4.3 Results
4.3.1 Data
The core protein interaction data of yeast are obtained from the Database of Interacting
Proteins (DIP) (http://dip.doe-mbi.ucla.edu, version ScereCR20050102). DIP database is
manually curated and “represents current best approximation for yeast protein interaction”
(Wuchty, et al., 2003). The yeast core interaction data contains interactions from small scale
experiments and interactions confirmed by multiple experiments and paralogous verification
61
62
method (PVM) validation (Deane, et al., 2002). The core interaction data set contains 6,361
interactions involving 2,607 proteins (excluding self-interactions). The protein phenotype data
are obtained from the Saccharomyces Genome Database (SGD)
(ftp://genome-ftp.stanford.edu/pub/yeast/data download/literature curation/, version 2005 3
05). Protein complex data are obtained from MIPS (Munich Information Center for Protein
Sequences). We obtain the list of conserved yeast proteins from (Stuart, et al., 2003). Protein
domains were obtained from Pfam (ftp://ftp.genetics.wustl.edu/pub/pfam/). The
SwissPfam(ver7.5) defines the mapping between proteins' SWISS-PROT/TrEMBL accession
numbers and Pfam domains. Finally, we obtain the yeast protein function annotation from
Gene Ontology (Consortium, 2001). GO is a set of structured vocabularies organized in a
rooted directed acyclic graph (DAG), describing attributes of gene products (proteins or RNAs)
in three categories of “cellular component”, “molecular function”, and “biological process”.
Biological processes are most closely related to essentiality and are incorporated to our
computational model for essentiality. Generally, a gene is annotated by one or multiple GO
nodes along the DAG. The nodes at higher levels correspond to more abstract functional
description. If a gene is annotated with a GO node, we say that this node as well as its parents
covers this specific gene. Thus, the nodes at the higher levels cover more genes. Similar to the
definition used in (Zhou, et al., 2002), a GO node is referred as informative if it covers more
than 100 genes, and none of its child nodes covers the same number of genes. The terminal
informative nodes are defined as the informative nodes such that none of their descendants are
informative. The closer a node is to the root, the more abstract the corresponding function is.
By our definition, we divide all the functions into 24 function categories corresponding to the
terminal informative nodes, including the one for unknown functions. A list of the function
63
categories can be found in the supplementary materials. The GO Ontology and the yeast
annotation data were downloaded on Mar 4th, 2005.
4.3.2 Feature analysis
Yeast is so far the most extensively studied eukaryotes. Several large scale gene deletion
experiments have been performed on yeast (Winzeler, et al., 1999; Tong, et al., 2001; Giaever,
et al., 2002) and the deletion phenotypes for most genes under normal conditions are known.
Many other genomic data have also been collected on yeast, such as genome sequences,
protein-protein interactions, protein functions, biochemical and signaling pathways, gene
expressions etc. The almost complete deletion phenotype data set and many other information
altogether give us an excellent opportunity to systematically analyze how the lethal/viable
phenotypes are related to other genomic information. For the data sources used in this study,
see the Materials and Methods section.
Many factors can potentially affect a protein's essentiality. We first consider the functions of
the protein. Some biological processes, such as DNA replication, mRNA expression,
metabolism, etc., are indispensable for a cell to survive. When proteins having these functions
are deleted, the cell collapses if there are no alternative ways to compensate these functions.
Proteins exhibit their functions by interacting with other proteins. This leads us to consider
protein interactions as well. It is generally believed that proteins perform functions through
their domain structures (Deng, et al., 2002). Thus, we also consider the relationship between
domain structure and essentiality. In this paper, we first study the relationship between gene
essentiality and protein function, protein interaction, and protein domain structure. The
analysis is then extended to include other information such as protein conservation, whether
64
the protein belongs to a complex, and protein cellular localizations. The reason for studying
these factors will be explained in the following sections.
4.3.3 protein-protein interactions and network topology
Yeast protein-protein interaction network has been extensively studied recently (Uetz, et al.,
2000; Ito, et al., 2001; Maslov and Sneppen, 2002; Rives and Galitski, 2003; Han, et al., 2004).
Some amazing features of the network are revealed. Jeong et al. (2001) first noted the
correlation between gene essentiality and the centrality in the protein network. They showed a
positive correlation between the connectivity and essentiality of proteins.
We obtained the core protein-protein interaction data of yeast from the Database of Interaction
Proteins (DIP). The core interaction data set contains the most reliable interactions with a total
of 6,361 interactions involving 2,607 proteins. In order to study the relationship between
protein-protein interaction and protein essentiality, we obtained yeast phenotype data from
SGD (Saccharomyces Genome Database, http://www.yeastgenome.org). This data set
provides deletion phenotype information for 2,581 proteins which are covered by the core
interaction data set with 26 proteins having no essentiality information. Figure 4.1(upper)
shows the fraction of essential proteins as a function of the degree of proteins in the network. It
is clear that as the degree of a protein increases, the probability for this protein being essential
also increases. This result is consistent with the result of Jeong et al. (2001).
In addition to the degree of a protein, we discover that the essentialities of its interaction
partners are also relevant. If a protein is surrounded by a large number of essential proteins, it
is more likely to be essential. For each protein with at least 5 neighbors, we calculate the
fraction of essential neighbors. Then we compare the distribution of the fraction of essential
Figure 4.1. Upper: the fraction of essential proteins vs. the degree of proteins. Down:
The approximate density curves for the fraction of essential proteins among the neighbors
of 378 essential proteins (solid curve) and that of 423 non-essential proteins (dash curve).
The protein interaction network is based on DIP core yeast interaction data set containing
6,361 interactions.
neighbors between essential (378 proteins) and non-essential proteins (423 proteins). The
result is shown in Figure 4.1 (down). The fraction of essential proteins among neighbors of
65
66
essential proteins is stochastically larger than that of non-essential proteins. The mode of the
distribution for essential proteins is about 0.66 and the mode for non-essential proteins is about
0.19.
4.3.4 Protein functions based on biological processes
Proteins can have different functions and some functions are indispensable for the survival of
the organism. Yu et al. (2004) showed that the probability of being essential is an increasing
function of the number of functions the protein has. The function classifications from the
Munich Information Center (MIPS) (Mewes, et al., 2002) were used in their study. The
number of functions can be different if investigators change the classification of functions, e.g.
several function categories can be merged into one function category and thus, their results
depend on how the functions are classified. Gene Ontology (GO)
(http://www.geneontology.org/) describes gene products (proteins or RNA) based on three
principles: “Cellular component”, “Molecular function”, and “Biological process”. Since
biological processes are most related to protein essentiality, we use it to define protein
functions. In this study, we examine the distribution of essential/non-essential proteins within
each function category. GO has a Directed Acyclic Graph (DAG) structure and can not be
directly used for function classification.
We adopt the same method as in (Deng, et al., 2004) to group related functions into cattegories.
We classify the 2,607 proteins that we considered in the yeast core interaction set into 24
function categories such that each category contains at least 100 proteins in the core set. (See
Materials and Methods section and supplementary table for details). We obtain the yeast
function annotation from GO and calculate the fractions of essential vs. non-essential proteins
in these function categories. The results are shown in Figure 4.2.
Figure 4.2: The fractions of essential and nonessential genes in each of the 24 function
categories considered. The average fraction of the essential proteins in the yeast genome is
around 20%.
The fraction of essential proteins in different function categories varies significantly. It can be
as high as 86% in one function category and as low as 7% in another. Function categories 5
(rRNA processing) and 13 (mRNA processing) are the top two categories in which essential
proteins are significantly enriched (p-values are 1.1E-53 and 1.5E-32, respectively. The
calculation is based on a binomial model). Both functions are related to RNA processing and
are necessary for the stability of the cell. Function category 0 which covers all the
un-annotated proteins contains significantly small number of essential proteins (p-value =
1.4E-10). This observation may just indicate that biologists are more inclined to study
essential proteins.
67
68
4.3.5 Protein Domains
The amino acid sequence of a protein is fundamentally important for the protein's phenotype.
A protein's sequence determines its secondary and tertiary structure and thus, determines its
interaction partners and its biological functions. However, it is not clear how to extract
important features necessary for protein essentiality. Because protein domains are conserved
regions of peptide sequences with relatively independent tertiary structures, they provide
important features for understanding protein essentiality. We use Pfam domains as the source
of domain information. The SwissPfam (ver7.5) (ftp://ftp.genetics.wustl.edu/pub/pfam/)
defines the mapping between proteins' SWISS-PROT/TrEMBL accession numbers and Pfam
domains. Most proteins have less than 3 domains while a few of them can have more than 10
domains (See Figure B1 in Appendix B). The fraction of essential proteins as a function of the
number of domains of the proteins is given in Figure 4.3. As the number of domains of a
protein increases, the probability of the protein being essential also increases. Since the
number of domains of a protein can be an indicator of the protein's complexity, it is reasonable
that the more complex a protein structure is (potentially having more functions), the more
likely that the protein is essential. We also observe that some domains are tightly linked to
protein essentiality. For example, ten proteins in yeast have the Cpn-60 TCP1 domain. Among
these ten proteins (YLR259C, YJR064W, YJL111W, YJL014W, YJL008C, YIL142W,
YFR019W, YDR212W, YDR88W, YDL143W), 9 of them are essential. The only exception
is YFR019W which is a 1-phosphatidylinositol-3-phosphate 5-kinase. The vacuolar
membrane kinase generates phosphatidylinositol (3,5)P2, which is involved in vacuolar
sorting and homeostasis. All the other 9 proteins having the domain are either subunit of the
cytosolic chaperonin Cct ring complex or subunit of the T-complex. All the nine proteins are
involved in protein folding. It is almost certain that the Cpn-60 TCP1 domain plays a
significant role in protein folding because the protein YJL111W has only this domain. Thus,
domain information can be important in determining protein essentiality.
Figure 4.3: The fraction of essential proteins vs. the number of domains. The dash line is
the linear regression line for the data.
4.3.6 Protein Cellular Localization
The biological function of a protein is related to its cellular localization. For example, it is
unlikely for a cellular membrane protein to be involved in amino acid metabolism. Based on
this reasoning, a protein's localization may shed some light on protein essentiality. Similar to
what we have done with protein function for biological processes, we divide cellular locations
into categories and study the fraction of essential proteins in each category. The result is
shown in Figure 4.4. Essential genes are significantly enriched at nucleolus (category 5).
69
Among 145 proteins at nucleolus, 108 of them are essential (p-value = 5.0E-43). This is
reasonable since nucleolus is the place where the genetic information is stored and expressed.
On the other hand, essential proteins are significantly under-represented for localization
unknown proteins, only 4% of them are essential. The similar observations have also been
observed based on a large scale localization data of (Huh, et al., 2003).
Figure 4.4: The fraction of essential proteins in different cellular locations.
4.3.7 Protein Conservation and Complexes
A well conserved gene usually has important biological functions so that it is maintained
during millions years of evolution. Therefore, we expect that conserved genes are more likely
to be essential. A set of conserved genes was defined by comparing four different organisms:
Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces
cerevisiae (Stuart, et al., 2003). The method for identifying the conserved genes is to find
70
71
those genes with the highest reciprocal BLAST score. Based on this set of conserved proteins,
we analyze the correlation of protein conservation and protein essentiality. A positive
correlation is identified (Figure 4.5(upper)). There are a total of 1,784 conserved yeast genes
and 628 of them are essential. The fraction of essential genes among the conserved genes is
significantly higher than that for all the genes, 35% vs 21% (p-value = 1.54E-43). A protein
complex is a relatively stable multi-protein structure. Several groups have used different
biological techniques to identify protein complexes (Gavin, et al., 2002; Ho, et al., 2002).
Because protein complexes usually have more complicated structures and functions than
single-unit proteins, it is difficult to compensate their functions if they are disabled. Therefore,
if a protein is a component of a complex, it may have a high chance of being essential. In order
to obtain reliable complex data, we take the intersection of two protein complex data, one by
(Ho, et al., 2002) and the other by (Gavin, et al., 2002). The list contains 655 proteins which
are components of protein complexes and 304 of them are essential. Again the fraction of
essential proteins in protein complex components (46%) is significantly higher than the
average (Figure 4.5(down) )(p-value = 1.31E- 47).
Figure 4.5: Upper: the fraction of essential proteins in all, conserved and non-conserved
proteins. Down: the fraction of essential proteins for all, proteins within complexes, and
proteins not in complexes.
4.3.8 Prediction for Protein Essentiality
Next we develop a computational model to “predict” protein deletion phenotypes. For the data
set we are considering, there are only 26 yeast proteins with unknown deletion phenotypes and
it is relatively easy to experimentally check these proteins' essentiality. However, for many
other species, a greater proportion of their proteome have unknown deletion phenotypes and
72
an accurate essentiality prediction method is needed. We use a kernel based logistic regression
model to achieve this objective. The main purpose is to evaluate the prediction accuracy of the
computational model based on various data sources. If the prediction accuracy is high, we can
use this model to predict protein phenotypes in other organisms.
Table 4.1: A forward stepwise selection approach is used to select the features which increase the
prediction accuracy. In each round of selection, the feature which gives the greatest improvement of
accuracy is kept for the next round of selection.
Previously, Markov random field (MRF) model has been successfully applied in protein
function prediction (Deng, et al., 2003; Deng, et al., 2004). A more recent paper (Lanckriet, et
al., 2004) showed that the support vector machine (SVM) based on diffusion kernel can give a
better prediction accuracy. One of the most significant differences between these two methods
is how the protein-protein interaction information is treated. For the MRF, it considers the
direct neighbors of the protein in question. Although this is intuitive and can be effective for
proteins with large number of neighbors (e.g. > 4), it may not effectively handle the remote
73
74
neighbors which are reachable only by following two or more links. In contrast, Lanckriet et al.
(2004) used the diffusion kernel to incorporate the protein interaction information from the
network. The diffusion kernel establishes similarities among the nodes of a network based on a
random walk on the network (Kondor and Lafferty, 2002), thus a remote node is also
considered but with a smaller weight. Using the diffusion kernel based on the protein
interaction network, the SVM can fit the training set almost perfectly by optimizing the margin
of the separating hyper plane. However, we find that the number of supporting vectors is
usually very large, i.e., more than 90% of the training proteins are classified as supporting
vectors. To overcome this issue, we combined the ideas behind the MRF model and the SVM
model. This new model contains much less parameters than SVM and uses the protein
interaction information more efficiently than the MRF model. Comparisons of this new model
with the MRF and SVM are performed and the mean ROC score for the SVM is 0.812 while
the new model can achieve a mean ROC score of 0.831 based on 5-fold cross validation. The
Wilcoxon test shows that the difference between the results of these two methods is significant.
The optimal diffusion kernel coefficient is used for each method. The MRF can achieve 0.804
(Figure 4.6).
We then test if the prediction accuracy can be improved by incorporating other feature
information. Since we have more than one feature available, a forward selection is performed
to figure out the relative importance of each feature and the result is listed in table 4.1. The
results show that four features are useful, namely, protein-protein interaction, protein function
annotation, protein conservation, and protein localization. We use these four features in our
final model to predict proteins' essentiality with unknown deletion phenotype. Although the
other features like protein domain and protein complex information do improve the prediction
Figure 4.6: Compare the prediction performance of the three methods, MRF, SVM, and
the new method. The new method has the highest ROC score (0.831) and the SVM has a
slightly lower ROC score (0.812) and the MRF model has the lowest ROC score (0.804).
accuracy when no other information is available, the performance drops if they are mixed with
the above four features. We also run several permutations to test if our model consistently
performs well and an average ROC (Receiver Operating Characteristic) score of 0.866 is
obtained with our integrated model. Details of the integrated model can be found in Materials
and Methods section.
4.3.9 Applications of the Computational Model
We apply the computational model to predict the essentiality for unknown proteins, analyzing
proteins with high posterior probability of being essential and predicting protein essentiality in
human.
75
76
We first predict the deletion phenotype of 26 unknown yeast proteins and our results indicate
that three proteins, YGL100W (nuclear pore complex subunit), YHR072W-A (H/ACA-box
snoRNPs component), and YPR102C (component of the large (60S) ribosomal subunit) have
a high posterior probability (0.72, 0.94, and 0.66, respectively) of being essential. The full list
of posterior probabilities of being essential for all the unknown genes are given in
supplementary table B3.
Secondly, we analyze those proteins of which the posterior probability of being essential is
high (> 0.9) while they are annotated as being viable in the database. There are total of 7 such
proteins out of 116 proteins with high posterior probability. We generated a list of these
proteins and they are provided in supplementary table B4. One representative example is
shown here. Based on the information provided by SGD, LSM7 interacts with 16 proteins and
11 of them are essential. It functions in nuclear mRNA splicing via spliceosome. It is a
complex component and also conserved. The posterior probability of being essential is 0.93.
The SGD database annotates the null mutant as viable. But the null mutated yeast will grow
slowly at 23deg and 30deg, and LSM7 is required for growth at 37deg (information without a
citation in SGD). Thus, this protein is essential under certain conditions. Since the essentiality
of proteins is condition dependent and cannot always be described as either essential or
non-essential, the probability we present here gives a more suitable estimation of protein
essentiality in such situation.
Thirdly, we apply the computational model to predict essential proteins in other organisms,
e.g., human. If an essential yeast gene is conserved in human, its homologue in human can be
essential if the function of its protein product has not significantly changed. We can use all the
yeast essential genes to perform BLAST searches (http://www.ncbi.nlm.nih.gov/BLAST/).
77
However, as the majority of yeast gene deletion phenotype data were obtained through large
scale experiments which usually contain a significant number of false positive and false
negative errors, we want to filter out those less reliable essential proteins before we perform
the BLAST search. We select the subset of essential yeast proteins (109) which have high
posterior probability (> 0.9) of being essential and run BLAST against the human protein
database obtained from the EMBL-EBI database (http://www.ebi.ac.uk/Databases/). 102 of
such yeast proteins have homologues in the human genome (BLAST e-score smaller than
1E-10). The full list is available as online supplementary materials. One example we show
here is human ATP-dependent RNA helicase DP97 (Swissprot accession # Q8TDD1). Its
corresponding yeast homolog is YDL031W which also has the ATP-dependent RNA helicase
activity (Venema et al., 1999). Since both the proteins have the same function and the deletion
of YDL031W is lethal to yeast, the deletion/mutation phenotype of the corresponding human
gene is likely to cause some serious problems to human if not death.
4.4 Discussion
We perform an extensive search for factors potentially contributing to protein essentiality.
Consistent with previous studies, we find significant correlations between protein essentiality
and protein conservation, also between protein essentiality and the degree of a protein in the
protein interaction network. In addition, we find several other important factors for protein
essentiality. First, we find that essential proteins are significantly enriched in several
functional categories based on GO, such as ribosome biogenesis, mRNA processing, and
rRNA metabolism, all of which are related to RNA processing. Previous investigators only
found positive correlations between protein essentiality and the number of functions of a
protein based on MIPS (Yu, et al., 2004). Second, we show significant correlation between
78
protein essentiality and the fraction of essential proteins among its interaction neighbors based
on the core interaction data set of DIP. Third, protein domain information is also related to
protein essentiality. In this study, we provide a most complete survey of genomic factors
contributing to protein essentiality. Some of the factors are found to be correlated with protein
essentiality for the first time. The other novelty of the study is that we design a new
probabilistic method combining protein interaction network, protein function classification,
protein conservation, and other protein features to predict protein essentiality. We also use a
stepwise approach to select the most useful features for high accuracy protein essentiality
prediction. Our results clearly show that the kernel-based logistic regression model can
accurately predict protein essentiality and it outperforms the former MRF model. This model
has a similar accuracy as compared with SVM but is much easier to implement given the
simplicity of the logistic regression model. Combining other information such as protein
function and protein conservation can significantly improve the prediction accuracy.
There are several potential applications for this study. First, since most of the proteins’
essentialities are known for yeast, this enables us to evaluate the prediction capability of our
model using this nearly complete data set. Second, by building a computational model for
protein essentiality prediction, we can use information such as protein interactions, protein
functions, etc. to predict proteins' essentiality when it is impractical to study them
experimentally in certain organisms like human.
There are several limitations of this study. We treat the core interaction data set of DIP as the
true interaction data. Although the core interaction data are the most reliable, the false
negative rate is high. The effect of the incompleteness of the interaction data on the prediction
79
accuracy is not clear. We use all the interaction data in DIP for protein essentiality prediction
and the prediction accuracy is lower than what is presented in this paper (data not shown).
If a trustworthy estimate of the reliability of each protein-protein interaction can be obtained, it
will be useful for the prediction. The other issue is the dependence among the contributing
factors for protein essentiality. In this study, in order to reduce the number of parameters to be
estimated, we assume that all the factors contribute independently to protein essentiality.
The effect of this assumption needs to be further studied. Despite the limitations, the 5-fold
cross validation results indicate that the current model can give reliable and highly accurate
protein essentiality predictions.
Web site references
http://www.yeastgenome.org/; SGDTM is a scientific database of the molecular biology and
genetics of the yeast Saccharomyces cerevisiae.
ftp://genome-ftp.stanford.edu/pub/yeast/data download/literature curation/; FTP site of SGD.
ftp://ftp.genetics.wustl.edu/pub/pfam/; SwissPfam
http://mips.gsf.de/; Munich information center for protein sequences.
http://www.geneontology.org/; Gene Ontology Consortium.
http://dip.doe-mbi.ucla.edu; Databases of interacting proteins.
http://www.ebi.ac.uk/Databases/; European Bioinformatics Institute (EBI).
http://microarray.cpmc.columbia.edu/gist/; Gist Program website.
http://www.ncbi.nlm.nih.gov/BLAST/; BLAST program is download from NCBI website.
80
Chapter V
Disease gene classification
81
5.1 Background
Identification of novel genes associated with human diseases is among the most critical tasks
in medical research. Towards this goal, various features have been compared between
heritable disease genes and non-disease genes (Bortoluzzi, et al., 2003; Yvert, et al., 2003;
Lopez-Bigas and Ouzounis, 2004; Chen, et al., 2005). Although most findings were consistent
with each other, a few conflicting results showed up. For example, Smith et al. found that
disease genes evolved with higher nonsynonymous/synonymous substitution rate ratios
(Ka/Ks) than non-disease genes (Yvert, et al., 2003), but Huang et al. found no such significant
differences (Chen, et al., 2005). One common problem with these studies is that human
essential genes were ignored and simply grouped together with other non-disease genes.
Essential genes are genes whose functions are necessary for the organism to survive and
reproduce. Since the disruption of essential genes’ function will cause fatal consequences, they
should be regarded as the most severe “disease” genes. Therefore, comparing disease genes to
a mixture of essential and non-disease genes will reduce the clarity of the signals of the
disease-related features and may even lead to erroneous findings. Thus, it is beneficial to
separate human essential genes from other non-disease genes before comparisons are made.
Thousands of genes have been identified as essential genes in multiple model organisms, such
as Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus (Giaever, et al.,
2002; Blake, et al., 2003; Sonnichsen, et al., 2005). Although it is almost certain that the
human genome also contains hundreds to thousands of essential genes, it is impractical to
experimentally determine them as in S. cerevisiae or C.elegans. The absence of a set of
82
well-defined human essential genes poses a challenge on studying them and urges for
alternative solutions.
The human genome has an extremely complex tissue expression profile. Some genes are
expressed only in certain tissues during specific times, while others are constitutively and
ubiquitously expressed (Warrington, et al., 2000; Butte, et al., 2001). For the latter genes, they
are presumed to be necessary for the most fundamental cellular physiological processes and
are referred as housekeeping genes (Butte, et al., 2001). Housekeeping genes have been
studied by many researchers and some interesting observations have been reported. For
example, Zhang and Li found that housekeeping genes evolved more slowly than
tissue-specific genes (Han, et al., 2004). Eisenberg and Levanon found that housekeeping
genes were compact in their coding lengths, which could be the result of higher selective
pressure (Deane, et al., 2002). Based on the unique properties of the ubiquitously expressed
human genes (UEHGs), we believe that they are suitable candidates for essential genes.
Although this hypothesis is intuitive and sounds reasonable, serious efforts are required to
collect supportive evidence on a systematic level.
In this study, we consider a set of 1,789 ubiquitously expressed human genes (UEHGs) as an
approximation for essential genes. We demonstrate that UEHGs are very likely to contain a
large proportion of essential genes and thus can approximate human essential genes. By
performing a three-way feature comparison of UEHGs (presumed essential genes), disease
genes, and the rest of human genes (referred as other genes), we show that they are different in
many aspects such as the evolutionary conservation rates, DNA coding lengths, gene functions,
etc.
83
5.2 Results
Instead of dividing the human genome into disease vs. non-disease genes, we choose a
three-way classification, namely, UEHGs (presumed “essential”), disease, and other genes.
We first validate that the set of UEHGs contains a large fraction of essential genes. Then by
comparing the three groups of genes, we see how the disease genes can be distinguished from
essential and other genes. If UEHGs really contain much greater fraction of essential genes
than non-UEHGs (i.e. disease and other genes), we expect to observe the followings. First, as
essential genes are functionally extremely important, the selective pressure on them are much
higher than on non-essential genes, thus UEHGs should have a slower evolutionary rate than
both disease and other genes (Hirsh and Fraser, 2001; Lanckriet, et al., 2004). Second, since
most Mendelian diseases are caused by deleterious amino acid substitutions, if we study the
conservation at amino acid level, we expect to see different patterns for UEHGs, disease and
other genes. Third, when UEHGs are mapped to another species, the homologous genes
should more likely be essential in that species if the species is evolutionarily close to humans.
Fourth, since essential proteins usually tend to be hub proteins (highly connected) in the
protein-protein interaction network (Jeong, et al., 2001), UEHGs should have a higher average
physical interaction degree than non-UEHGs. Fifth, the functions of UEHGs should be
fundamentally important. To verify these hypotheses, we compile the lists of UEHGs,
disease genes and other human genes. We then collect various features and compare those
selected features among the three gene classifications.
5.2.1 Comparison on the evolutionary features
We first compare the Ka, Ks and the ratio (Ka/Ks) based on the three-way classification of the
84
human genome. The Ka, Ks and Ka/Ks are derived from both human-rat and human-mouse
orthologous pairs. The results obtained from human-mouse orthologous pairs indicate that
UEHGs have the smallest Ka, Ks and Ka/Ks ratio in the three groups (P-values for UEHGs vs.
disease are 3.4E-39, 6.3E-15, and 1.7E-38; P-values for UEHGs vs. others are 5.3E-64,
9.0E-27, and 1.3E-57, respectively for Ka, Ks and Ka/Ks), and disease genes have lower
evolutionary rates than other genes (P-values are 9.1E-5, 5.5E-4 and 2.6E-4 for Ka, Ks and
Ka/Ks, respectively) (Fig 5.1). By various statistical measurements, UEHGs consistently
stand out as the slowest evolved gene group and the difference between UEHGs and the other
two groups is greater than the difference between disease genes and other genes (Table
5.1).The results are similar when human-rat orthologous pairs are used to calculate Ka and Ks,
only the P-values are slightly less significant. Again, disease genes evolve at slower rates than
other genes with significant differences in Ka, Ks and Ka/Ks (P-values are 0.008, 0.052 and
0.026, respectively). UEHGs evolve at the slowest rates and the differences in Ka, Ks and
Ka/Ks are strongly significant (P-values for UEHGs vs. disease are 7.0E-32, 3.2E-13, and
2.7E-28; P-values for UEHGs vs. others are 1.1E-49, 4.8E-23, and 1.2E-44, respectively, for
Ka, Ks and Ka/Ks). Here, our results are different from the findings of (Yvert, et al., 2003) and
(Chen, et al., 2005), and as described before, their results are not consistent with each other
either. We think this is partly due to the effect of mixing essential genes with non-disease
genes in the previous studies. Another reason is that different groups used different set of
disease and non-disease genes (Smith et al. studied 387 disease and 2,024 non disease genes,
Huang et al. studied 1,112 disease genes and more than 10,000 non-disease genes and our
dataset covered >1,700 disease genes and >12,000 non-disease genes, see supplementary
materials for a detailed comparison). Our results indicate that UEHGs are under the strongest
selection pressure. Disease genes evolve at an intermediate rate which is slower than other
85
genes, but faster than UEHGs. Our experiments consider much larger gene sets and a
significant number of essential genes are separated out from non-disease genes. Therefore, our
results may better reflect the relationship of disease genes with other genes. The slightly
different results calculated from rat and mouse are most likely due to the different evolutionary
rates between mouse and rat after their divergence around 16 million years ago (Bourque, et al.,
2004). We also confirm the positive correlation between Ka and Ks in all the three gene groups,
as observed in previous studies (Wolfe and Sharp, 1993; Ohta and Ina, 1995; Makalowski and
Boguski, 1998; Zhu, et al., 2000; Castresana, 2002). For example, the Ka and Ks of disease
genes have a correlation coefficient of 0.45 (P-value is 1.9E-86 by t-test). Although a
synonymous mutation has no apparent effect on protein sequence, it may affect the DNA
structure, mRNA structure, and biological processes such as transcription and RNA splicing
(Miller and Kumar, 2001). The small average Ks of UEHGs indicates that the synonymous
sites of UEHGs are also under stronger selection than the other two groups, which further
suggests the functional importance of UEHGs.
Table 5.1. Comparison of evolutionary rate among three groups of genes
UEHGs Disease genes Other genes
Mean
(SEM)
Median Range Mean
(SEM)
Median Range Mean
(SEM)
Median Range
Ka 0.053
(0.0015)
0.033 0-0.48 0.080
(0.0017)
0.061 0-0.70 0.092
(8.0E-4)
0.066 0-1.37
Ks 0.58
(0.0047)
0.57 0.076-2
.0
0.63
(0.0043)
0.61 0.14-2.4 0.66
(0.0022)
0.62 0.041-4
.2
Ka/Ks 0.088
(0.0023)
0.059 0-0.59 0.12
(0.0023)
0.10 0-0.70 0.13
(0.001)
0.11 0-1.2
Ka, the number of non-synonymous substitutions per non-synonymous sites. Ks, the number of
synonymous substitutions per synonymous site. Ka and Ks values are calculated based on
human-mouse orthologous pairs as described in the main text. The first column in each group is
the mean followed by standard error of the mean (SEM).
(a)
(b)
Figure 5.1. Distribution of Ka, Ks and Ka/Ks. (a) The cumulative density of Ka, Ks
and Ka/Ks derived from human-mouse orthologous pairs. Ka, the number of
non-synonymous substitutions per non-synonymous sites. Ks, the number of synonymous
substitutions per synonymous site, and the Ka/Ks ratio. Three groups of human genes are
represented in different colors and the number of genes in each group is listed right to the
line symbols. (b) The box plots are drawn based on the same data. For each category, the
central box depicts the middle 50% of the data between the 25th and 75th percentile, and
the enclosed red horizontal line represents the median value of the distribution. Extreme
values are indicated by solid blue dots that occur outside the main bodies of data.
86
87
As Ka and Ks are summary statistics of the nucleotide substitution rate of a gene, and most
Mendelian diseases are caused by amino acid substitutions in coding regions, it will be more
informative to study the pattern of conservation for individual nucleotides or amino acids. We
obtain the conservation score of specific amino acids based on a large-scale multiple-sequence
alignment of 8 species performed recently by the UCSC research group (Siepel, et al., 2005).
The conservation scores were derived from a two-state phylo-Hidden Markov Model and can
be interpreted as probabilities of each base being from a conserved hidden state. We collect a
list of more than 6,000 disease mutation sites and about 1,900 polymorphism (neutral)
mutation sites from SwissProt. We compare the conservation of these two types of sites with
each other and also with the background (The background is obtained by considering the
conservation score of all the amino acids in coding regions). As shown in Fig 5.2(a,b),
polymorphism mutation sites are significantly biased towards less conserved sites while
disease mutation sites are significantly biased towards more conserved sites (p-values < 10-5
in both cases). This is consistent with the findings of Miller and Kumar although they focused
only on seven disease genes (Miller and Kumar, 2001). Then we compare the UEHGs with
disease genes and results are shown in Fig 5.2(c). UEHGs are more conserved than disease
genes but the conservation score of disease mutation sites are greater than those for UEHGs.
Since we don’t have information on “essential sites”, we are unable to directly compare the
“essential sites” with disease sites. Instead, we think that one possible mechanism
distinguishing essential genes from disease genes is that essential genes contain a larger
fraction of highly conserved sites (with the underlying assumption that highly conserved sites
(a)
0.00.2 0.40.6 0.81.0
Conservation Score
0
20
40
60
80
Percentage
Disease gene all sites
Disease mutation sites
(b)
0.0 0.2 0.4 0.6 0.8 1.0
Conservation Score
0
20
40
60
80
Percentage
Background (all sites)
Polymorphic mutation sites
(c)
0.00.2 0.40.6 0.81.0
Conservation score
0
20
40
60
80
Percentage
UEHGs all sites
Disease mutation sites
Disease gene all sites
(d)
0.0 0.2 0.4 0.6 0.8 1.0
Fraction of higly conserved region
0.0
0.5
1.0
1.5
2.0
Density
UEHGs
disease genes
other genes
Figure 5.2. Codon conservation of the three gene groups. The conservation score of
amino acids of the three groups of genes are compared. (a) The distribution of disease
causing mutation sites’ conservation score is plotted in the solid line. The dotted line is
drawn based on the conservation scores of all the sites in the coding region (i.e., the
distribution of the conservation score when sites are randomly chosen). (b) The
distribution of polymorphism mutation sites’ conservation score vs. the random
distribution as in (a). (c) The distribution of conservation score for UEHGs (black line),
disease gene (broken blue line) and disease causing mutation sites (red broken line). (d)
The distribution of the fraction of the highly conserved regions (Cons. Score>0.9). Each
human gene group is represented in a different color.
correspond to functionally important loci). Thus, the chance that a random mutation will cause
a severe phenotype will be much higher for essential genes than for disease genes. We select
conservation score 0.9 as the cut-off value and define sites with conservation scores above that
as highly conserved sites. The cut-off is chosen based on the distribution of the conservation
scores of disease mutation sites. Different cut-off values were tested and results are similar.
88
89
We calculate the fraction of highly conserved sites in the coding region and show the
distribution of this fraction for UEHGs, disease genes and other genes in Figure 5.2(d). It is
clear that UEHGs contains a much higher fraction of highly conserved sites than the other two
groups, while there is no significant difference between disease genes and other genes
(p-value=0.32).
5.2.2 Cross-species comparison of gene deletion phenotypes
Many human diseases are studied by experimenting on model organisms such as mouse. The
underneath rationale is that the homologous genes have similar functions if the two species are
evolutionarily close. Similarly, the essentiality of human genes can be tested in an
evolutionarily close species. This approach may not work for all genes, due to differences
between species, however, the closer the two species are, the higher accuracy it can achieve.
Unfortunately, the knowledge on gene essentiality in other high animals is still very limited.
For example, the number of mouse genes with known mutation phenotypes is around 10% of
the genome (~ 2,800 in Mouse Genome Informatics database) and they are heavily biased
towards the homologs of human disease genes (results not shown). By now, only S. cerevisiae
and C. elegans have been explored for gene essentialities on the whole genome scale (Giaever,
et al., 2002; Sonnichsen, et al., 2005). Here we compare the human genome with them,
although they are not favoured for the rather far evolutionarily distances between them and
humans. Human genes are mapped onto yeast and worm based on homologous relationship.
For UEHGs, disease genes and other genes, we examine the fraction of genes which are
homologous and essential in the mapped species (Table 5.2 and Fig 5.3). The first column of
Table 5.2 shows that UEHGs contain the largest fraction of genes which have homologous
counterparts in yeast. Disease genes have smaller fraction than UEHGs but greater than other
90
Table 5.2. Cross-species comparison of gene essentiality between human and S. cerevisiae.
Total Essential Non-essential Unknown
Homologs of UEHGs 384
(20.2% of UEHGs)*
138 (35.9% )
(7.3% of UEHGs)*
242(63.0%) 4(1.1%)
Homologs of disease
genes
196
(9.8% of disease genes)*
51 (26.0%)
(2.5% of disease genes)*
142(72.5%) 3(1.5%)
Homologs of other
genes
1005
(3.5% of other genes)*
379 (37.7%)
(1.3% of other genes)*
618(61.5%) 8(0.8%)
No homologs 4641 505 (10.9%) 3427(73.8%) 709(15.3%)
Total yeast genes 6179 1058 (17.1%) 4397(71.2%) 724(11.7%)
Yeast genes are mapped to three human gene groups by homologous mapping. The first three rows
are for each human gene group and the fourth row is for yeast genes without human homologous
genes. The last row summarizes all the yeast genes. The first column lists the number of homologous
genes found for each human gene group and the fractions with respect to the number of human genes
in the corresponding group (show in brackets marked with *). The following columns list the number
of yeast essential, non-essential, and unknown genes for each human gene group. The fractions with
respect to the number of homologs found in each group are shown in brackets. In the second column,
the fraction of homologous essential genes over the human genes in each group is given in the bracket
with *.
genes. The same order can be observed in C. elegans as shown in Fig 3. Since yeast and C.
elegans are evolutionary distant from human, the results support that UEHGs contain a greater
fraction of functionally important genes than disease and other genes. The second column of
Table 5.2 shows the fraction of homologous genes in each three groups which are also
essential in yeast. Again, we see that UEHGs have the highest fraction, followed by disease
genes, and the other genes. However, as shown in the same column, conditional on
homologous genes, the fraction of essential genes for each group does not strictly follow the
same order (i.e., UEHGs>disease genes>other genes). This is more prominent in yeast than in
C. elegans. Therefore the phenomenon is very likely caused by the larger evolutionary
distance between yeast and human. Or we can say that given a gene is highly conserved (such
as a human-yeast homologous gene), the essentiality of the gene is no longer strongly linked to
the group that the gene belongs to. In addition, the results suggest that the highly conserved
genes contain large fraction of essential genes too.
Figure 5.3. Comparison of gene essentiality between human and C. elegans. Human
genome is divided to three groups as described in the main text and 20,488 C.elegans
genes are mapped to each group based on homology. The essentiality of C. elegans gene is
obtained from RNAi-interference experiment as described in the main text. Different
phenotypes are represented by different colors and the number of the homologs in each
group is listed. The fraction of human genes with C. elegans homologs is shown under the
group name.
5.2.3 Comparison on other features
It has been noticed for several years that, in the protein interaction network, essential proteins
91
tend to interact with more proteins in model organisms (Jeong, et al., 2001). Peri et al.
manually collected more than 27,000 interactions involving about 18,000 human proteins
(Peri, et al., 2003). Although this data set is still quite sparse, its accuracy is assumed to be
high. The distribution of protein physical interaction degree of the UEHGs, disease and all
other genes are shown in Figure 5.4. The average degree±standard error for UEHGs, disease
and all other genes are 11.0±0.6, 8.4±0.4 and 6.2±0.1, respectively. We only consider proteins
with at least one interaction, since 0 degree could mean either no interaction or absence of data
and we are unable to make the distinction. We acknowledge that the data collection could be
0%
10%
20%
30%
40%
50%
60%
70%
1-5 6-10 11-
15
16-
20
21-
25
26-
30
31-
35
36-
40
41-
45
46-
50
51-
55
56-
60
61-
65
66-
70
71-
75
>76
Protein-protein interaction degree
Fraction
UEHGs(762)
Disease(1168)
Others(4063)
Figure 5.4. Distribution of protein physical interaction degrees. UEHGs, disease genes,
and other genes are shown in three different colors in the histogram. It can be seen that as
the interaction degree increases, the fraction of UEHGs also increases. For the summary
statistics, see main text. The number of genes with at least one interaction in HPRD is
listed for each gene group.
92
biased towards human disease genes. However, as the set of UEHGs are defined purely based
on gene expression from an independent source, it is unlikely to have a heavy bias towards
UEHGs. Thus, the high interaction degrees of UEHGs can be regarded as a supporting
evidence for their essentiality.
Figure 5.5 Function annotations of genes in the three groups. GO categories are
described by the row labels and columns are the three classes of genes. A color scheme
(scale shown on the right) is used to display the significance level of over-representation
(numbered as negative logarithm of the P-value, upper half of the scale) or
under-representation (numbered as logarithm of the P-value, lower half of the scale) for
certain gene group and function category. Hyper-geometric distribution is used for the
calculation of the P-value.
93
94
We also investigate the function annotation of the three groups of genes. As shown in Figure
5.5, UEHGs, disease genes and all other genes have distinct function distributions. UEHGs are
enriched in protein biosynthesis and several other fundamentally important physiological
processes, while disease genes are more relevant to sensing and responding to
internal/external signals, which are advanced mechanisms for the fine tuning of certain
biological processes.
Table 5.3. Comparsion of the length of various parts of UEHGs, disease genes, and all other
genes.
UEHGs(n=1400) Disease(n=1773) Others(n=10304) P-value
Coding sequence length
1501±38
1109
2205±73
1537
1849±15
1459
3.2E-09
Total exon length
2545±48
2136
3250±78
2557
2752±18
2343
1.6E-06
Number of exons
10.7±0.2
8
13.5±0.3
10
9.9±0.1
7
9.6E-07
Total intron length
35698±1558
15588
60376±2836
23528
54881±1139
18540
0.012
5’ UTR length
546±22
238
582±21
251
560±8
245
0.55
3’ UTR length
559±21
247
569±21
243
575±8
254
0.86
For each row the first line gives the average value ± s.e.m, and the second line gives the median.
UTR stands for untranslated region. Pair-wise rank sum tests are performed and only the largest
P-values are listed in the last column.
Finally, we look at the relationship between gene’s conservation and the onset age of disease.
Different diseases exhibit their symptoms at different ages. Some diseases develop as early as
in utero while some only present in elders. People usually think genes associated with early
onset diseases are under higher selection pressure than those associated with late onset
95
diseases. If this is correct, since essential genes are critical, their evolutionary rates should be
similar to those early onset disease genes rather than the late onset ones. To verify this, we
divide disease genes based on their onset age as (Jimenez-Sanchez, et al., 2001) and compare
the Ka/Ks ratio. Figure 5.6 shows that evolutionary rates tend to increase when the onset age
becomes larger. The correlation coefficient between the onset age and the Ka/Ks ratio is
positive with P-value of 0.02 based on weighted least squares regression (Huh, et al., 2003).
(Weighted least square is used here since different age groups contain unequal number of
genes with non-constant variances, by introducing weight to the regression, such effects can
be reduced.) Base on visual inspection, Fig. 5.6 also suggests that UEHGs have similar Ka/Ks
ratios as those for genes responsible for diseases in uterus and other genes have similar Ka/Ks
ratios with genes associated with late onset diseases. However, as the regression is performed
on the group number rather than the actual disease onset age (the original linear relationship
among different disease onset ages could be distorted to some extent), and the P-value just
passes a less stringent cut-off (i.e., 0.05), more data and further analysis are needed to draw a
more confident conclusion from above results.
5.3 Discussion
All the results above support that UEHGs by themselves form a distinct group other than
disease genes. The results also endorse that UEHGs may contain a large proportion of
functionally essential genes. Although we try to show that UEHGs are good candidates for
human essential genes, we have no intention to claim that they are the only or the best gene set
for representing human essential genes. Because a gene needs to be ubiquitously expressed to
be considered an UEHG, low expressed or somehow tissue specific expressed essential genes
will be excluded. Also, since the tissue samples were collected mainly from adult individuals,
96
genes which are essential for early stage development may be missed too. As revealed by the
cross-species comparison, UEHGs may have failed to cover many essential genes and those
genes are still classified as other genes. We study a different set of genes by considering
genes that are conserved across yeast, C.elegans and human. The results indicate that they may
contain a large fraction of essential genes too (results not shown). However, as pointed out by
(Chervitz, et al., 1998), such set may miss many human essential genes which don’t have
homologs in yeast and C.elegans. In contrast, UEHGs is a more unbiased sample from all
essential genes. A combination of UEHGs and conserved genes might generate a more
complete set of candidates for human essential genes. We also realize that the set of disease
genes in our study are mainly genes associated with Mendelian diseases, while complex
disease genes are under-represented.
Different from previous studies on human housekeeping gene, we define the UEHGs as genes
expressed in “almost all” (not “exactly all”) the tissues that are examined. Due to the
fluctuation of gene expression and the error in the gene expression measurement, as more
tissues being examined, fewer genes will be observed as expressed in all the tissues. We relax
the criteria to allow missing expression in a small fraction of tissues so that the size of UEHGs
is less sensitive to the number of tissues being examined. Also a different cutoff value of
expression level was adopted. In order to verify that our results are not sensitive to specific
criteria used to define UEHGs, we prepare another UEHGs set defined as genes expressed at
more than 300 standard units in all the 79 tissues. This leads to 2,038 genes being grouped as
UEHGs, and 1,509 genes are contained in the original set of 1,789 genes. The evolutionary
rates are compared among the new set of UEHGs, disease genes and other genes. The results
are almost identical as before except for the slight changes in P-values (see details in
97
supplementary materials). This indicates that our findings are not sensitive to the criteria for
defining the UEHGs.
Previous studies have shown that house-keeping genes have shorter coding length (Deane, et
al., 2002) while disease genes usually have longer coding length (Lopez-Bigas and Ouzounis,
2004). We confirm these findings in a three-way comparison (Table 5.3). Since UEHGs are
required to be expressed in all the tissues constitutively, it is beneficial to have the intron and
untranslated regions shorter than other genes. But it is unclear why disease genes are generally
longer than other genes. One possible explanation is that the functions of many disease genes
can only be performed by proteins with certain lengths. For example, some ion channel
proteins (e.g. cystic fibrosis transmembrane conductance regulator, CFTR) need to span
through the membrane multiple times to form the pore structure, a task which can not be
fulfilled by a short protein. Further studies are needed to explore how general such cases are.
We also want to point out that, as shown in Fig 5.6, there are no sharp dividing lines among
essential genes, disease genes and other genes. Some diseases are simply lethal and the
associated genes are essential genes by the definition. Some diseases have much less severe
effects and it is hard to distinguish them from true non-disease genes. Thus, the gene
essentiality might be better described by a continuous spectrum rather than by artificially
divided groups. Even more complicated situations arise when different mutation forms are
considered. Since different mutations usually lead to phenotypes of different severities
(Hamosh, et al., 2002; Ng and Henikoff, 2002), a disease gene could be either a non-essential
gene or essential gene but with non-lethal mutation form. Thus, any simple grouping of human
Figure 5.6. Correlation of disease onset age with Ka/Ks. The correlation of disease onset
age with Ka/Ks. Disease genes are divided into 5 groups based on disease onset age. The
weighted linear regression is applied to disease genes (group 2 to 5) and is shown as the
dotted line. The coefficient for onset age is +0.0086 and P-value is 0.02, derived from the
regression. UEHGs and other genes are plotted on the two sides of the diseases genes for
visual comparison. The standard deviation is indicated by the short horizontal bar and
mean is denoted by the solid circle. The large variation in each group hints for other
confounding factors which also affect Ka/Ks.
genome may lack the power for accurately illustration of the complex scenario associated with
human disease genes.
98
99
5.4 Conclusions
Our studies suggest that human essential genes are a unique group of genes and should not
simply be ignored and classified with non-disease genes for the studies on disease genes. We
also show that disease genes have several properties residing between essential and other
genes. We notice that gene essentiality might better be described in a continuous spectrum
instead of being assigned a class label. Nevertheless, the simplicity of the three-way
classification is good for the purpose of this research since comparisons can be performed
easily.
Extensive knowledge on human essential genes can be critical for the understanding of human
diseases. It has been shown that essential genes may have direct association with diseases such
as cancer (Pickeral, et al., 2000; Thomas, et al., 2003). Studying human essential genes might
also provide key clues for questions such as how human beings evolved. However, limited
attentions have been paid to them and very little systematic studies have been done. We
showed how the picture of disease genes gets clearer when we explicitly consider the essential
genes. We believe the updated global picture of disease genes will enable us to better identify
them in the future (Adie, et al., 2005).
5.5 Methods
5.5.1 Compiling lists of disease genes and UEHGs list
The list of disease genes were obtained from OMIM (Hamosh, et al., 2002). 3,962 records
were listed in the morbidmap (Jun 6, 2005) and entries with known sequence (OMIM ID
marked with *), with known sequence and phenotype (OMIM ID marked with #), and with
100
phenotype description, molecular basis known (OMIM ID marked with +) were retained for
this study. A total of 2,012 genes with unique OMIM Ids were finally collected as human
disease genes.
Ubiquitously expressed genes were obtained from the result of a recent large scale microarray
experiment on human gene expression patterns by (Deng, et al., 2003). A total of 33,698 genes
sampled from 79 tissues were interrogated in their experiments. The overall gene expression
level was 776.5 standard Affymetrix average difference units, and genes with expression level
greater than 550 standard units in at least 73/79 tissues were selected as UEHGs (a
conservative estimation on the percentage of essential genes in the human genome is about
10%, thus the standards were set so that roughly 2,000 genes would be classified as UEHGs).
A total of 1,789 such genes were collected. The set of UEHGs has a small overlap with disease
genes as 176 genes belong to both classes. The full list of UEHGs can be found in Additional
file 2.
5.5.2 Collection of gene features
The mouse and rat homologs and corresponding synonymous substitution rate (Ks),
nonsynomymous substitution rate (Ka) of totally 15,726 human genes were downloaded from
NCBI HomoloGene (Wheeler, et al., 2004). To prevent possible contamination by paralogous
genes, we only considered one-to-one mapped orthologous pairs. To test the statistical
significance of the difference of Ka, Ks and Ka/Ks distributions among the three groups,
Kolmogorov-Smirnov test was used to calculate the p-value as in (Chen, et al., 2005) so that
direct comparisons could be possible. Nucleotide conservation scores were downloaded from
UCSC Genome Browser website (http://hgdownload.cse.ucsc.edu/downloads.html#human).
101
Human sequence variation information was obtained from Swiss-Prot protein knowledgebase
(http://us.expasy.org/sprot/sp-docu.html). The original amino acid positions were mapped to
nucleotide positions on the corresponding chromosome to obtain the conservation score. To
study the correlation of the onset age of a disease with its conservation, we obtained the onset
ages of over 900 genes from (Jimenez-Sanchez, et al., 2001). Weighted least square regression
is used to find the correlations between disease onset ages and Ka/Ks ratios (Huh, et al., 2003).
Yeast genes were collected from NCBI Entrez Gene Database
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene) and were divided into four groups:
UEHG homologs, disease gene homologs, other human gene homologs, and genes without
human homologs. The homologies were obtained from NCBI HomoloGene as described
above. The yeast gene deletion phenotype data were downloaded from Saccharomyces
Genome Database (Cherry, et al., 1998). Similarly, genes in C. elegans were collected from
NCBI Entrez Gene Database and were divided into four groups. RNAi phenotypes of
C.elegans genes were retrieved from WormBase (Chen and Xu, 2005). The RNAi phenotypes
were divided into four categories: lethal (including both embryonic and larval lethal), wild
type, sick (phenotypes other than the above two), and unknown. For genes annotated with
more than one phenotype, the most severe one (assuming lethal>sick>wild) was chosen as
their phenotypes.
The degrees of genes in the protein physical interaction network were retrieved from the
Human Protein Reference Database (HPRD) (Peri, et al., 2003). To compare the function
distribution of the genes in different categories, we used Gene Ontology (GO) Biological
Process for protein function annotation. Gene Ontology annotations of 12,715 human genes
were downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/) and the classifications
102
based on biological processes were used. Similar to (Zhou, et al., 2002), a GO node is referred
as informative if it covers more than 500 genes, and none of its descendant nodes cover that
many genes. 25 GO informative nodes were defined according to the criterion. To test whether
UEHGs, disease genes or other genes were over/under represented in each of the 25 function
categories, we used hyper-geometric distribution to calculate the p-value.
Gene length information was retrieved from UCSC genome table browser (Karolchik, et al.,
2004). All the genes were first mapped to their refSeq IDs for length information retrieval. To
assess the significance of the difference in the length of genes in different categories,
Wilcoxon rank sum test was used to calculate the p-value.
In the process of collecting various features, some genes were not annotated in certain
databases. We limited our comparisons to genes with information. The number of genes
included for each comparison can be found in the corresponding tables or figures. For more
information on the method and materials, see Additional file 1.
5.6 Acknowledgements
We thank all the reviewers for the great suggestions on revision. We thank Andrew Su and
John Hogenesch for answering the question regarding to the expression cutoff value for
UEHGs. We also thank Kim Fechtel for providing their classification of genes so we can
compare the two gene sets. We thank Matthew Lebo for proofreading the manuscript. This
research is supported by NIH/NSF joint mathematical biology initiative DMS-0241102 and by
NIH P50 HG 002790.
103
Chapter VI
Conclusions
104
6.1 Future work
The studies we described in previous chapters were based on our very limited understanding
of the system. Clearly, biological systems are complicated and what we did is rather tiny.
Many problems need further exploration.
Several issues remain unsolved for causal gene inference and pathway identification. First,
we completely relied on known interactions to build gene network. However, the known
interactions may cover only a small fraction of the total interactions in the real gene network.
This partly explains why we had a rather low coverage in the causal gene inference. Also,
not all the known interactions are accurate. Erroneous predictions are expected if the
inferences were based on the unreliable data. Clearly, our approach needs improvement so it
can be more robust to incomplete and inaccurate input data. Second, we only considered
genes that linked to a single locus, but many genes are linked to two or more loci. It will be
of great interest to identify these polygenetic factors using similar approaches. Third, gene
expression has been demonstrated to strongly correlate with disease phenotype. For example,
investigators have successfully used gene expression to distinguish subtypes of cancers and
to predict clinical cancer status (Bhattacharjee, et al., 2001; West, et al., 2001). Thus, it will
be meaningful to explore the possibility of applying our method to identifying disease causal
genes by inferring the causal genes for gene expression variation. Some efforts have been
made on this (Schadt, et al., 2005), and apparently much more research is needed. Although
big challenges are ahead when working on diseases in higher organisms, it is no doubt that
exciting times lie ahead.
105
For the transcriptional module analysis, substantial additional work is required to prove our
hypothesis on how the transcriptional modules evolved. Although we claimed that
transcriptional modules could be dynamic, a quantitative study was not yet performed. The
quantitative study is important because only by doing so can we quantitatively compare
modules and numerically measure the difference. Lastly, we only focused on yeast to study
transcriptional modules. It will be helpful to extend our studies to multiple species and make
comparisons. We believe many interesting results will come out in this new direction.
For protein essentiality prediction, one challenge is to estimate protein essentiality from de
novo. In our work, a large fraction of proteins’ essentialities were known. However, the
situation is completely different for human system where very few proteins’ essentialities are
established. Therefore, our model needs modification so that neighbor proteins’ essentialities
are no longer required for the prediction. As we mentioned in Chapter IV , certain sets of data
can shed some lights on protein’s essentiality, e.g., the conservation score of a protein is
highly correlated with the same protein’s essentiality. Clearly, further exploration for
informative features and improvement on the predictive model will be the two keys for the
success of this work.
As we pointed out when we performed human disease gene classification, there is no clear
dividing line between different gene classes. Instead, it may be beneficial to report a
continuous measurement of gene’s essentiality rather than showing a binary classification.
To achieve this objective, an integrative scoring mechanism is needed to allow a
comprehensive consideration of multiple features of a gene and its product to generate an
unbiased estimation on their essentialities. Secondly, we treated all the disease genes as a
single group and made no discriminations between different disease types. Thus, our
106
conclusions are general to all the disease genes. But we know that different types of disease
genes can be quite different and it may be necessary to further divide these disease genes
into sub-groups based on disease types if we are only interested in genes of a specific disease
type. Better prediction performance may be achieved and more specific understanding of
disease sub-types can be gained by this kind of study.
6.2 Future of the future
We claimed that the future is systems biology and we want to discuss its future. Specifically,
there are a few questions that we want to address in this final section.
First question, can we finally fully understand how the system works? Our answer is a firm
YES. If we are allowed to go back for twenty years, we would be at the time when DNA
sequencing technology was just getting mature and popular. At that stage, not many people
would confidently believe that the whole human genome could be fully sequenced in the near
future. However, twenty years later, we have not only done that, but also sequenced many
other species such as mouse, fly, worm, etc. What’s even better, we are not just sequencing
one human individual (or a few individuals as original human genome sequencing project
planned), but we are going to achieve the ability of re-sequencing any individual for under
1,000 dollars (Mardis, 2006). The today’s situation for systems biology is quite similar. There
are already many new technologies available, although they still need improvements. Given
sufficient time, we are confident that the system can be fully understood.
Second question, what’s the next thing in systems biology? Answers could be various. For
example, 1K$ whole human genome sequencing gives great hope for personalized medicine.
107
This will help us to know what kind of disease we are predisposed to. Precaution can be made
in advance and early detection and treatment will be much feasible. Many other research areas
in systems biology could also be revolutionary. No matter what they are, we are certain that
they are very likely to change our daily life.
Last question, what do we do for the post-systems era, assuming we finally fully understand
the biological system? It may sound too early to raise this question now as we are still at a very
early stage of systems biology. But we should know that to understand the system is never the
whole story. It is neither too early to ask what we should do next. There is a greater aim of
systems biology and that is to use what we know in the systems biology to bring benefits to the
whole human beings. If we know how the system works, modifications would be possible to
make it work better. Clearly, it’s always of a priority to focus on the problems which have
greater potential of improving people’s health and lives.
We are in the best time of systems biology as so many interesting problems are unsolved and
great opportunities are everywhere. Although challenges do exist, difficulties are real, the
future is undoubtedly bright. Finally, we wish that our work could encourage some brave
thoughts or inspire some novel ideas to the systems biology research. Best luck to those who
are devoted to the exploration of the biological systems.
108
References
Adie, E., Adams, R., Evans, K., Porteous, D. and Pickard, B. (2005) Speeding disease gene discovery
by sequence based candidate prioritization. BMC Bioinformatics, 6, 55.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K.,
Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese,
J.C., Richardson, J.E., Ringwald, M., Rubin, G.M. and Sherlock, G. (2000) Gene Ontology: tool for
the unification of biology. Nat Genet, 25, 25-29.
Baggerly, K.A., Coombes, K.R., Hess, K.R., Stivers, D.N., Abruzzo, L.V. and Zhang, W. (2001)
Identifying Differentially Expressed Genes in cDNA Microarray Experiments. Journal of
Computational Biology, 8, 639-659.
Bals, R. and Jany, B. (2001) Identification of disease genes by expression profiling. Eur Respir J, 18,
882-889.
Bar-Joseph, Z., Gerber, G.K., Lee, T.I., Rinaldi, N.J., Yoo, J.Y ., Robert, F., Gordon, D.B., Fraenkel, E.,
Jaakkola, T.S., Young, R.A. and Gifford, D.K. (2003) Computational discovery of gene modules and
regulatory networks. Nat Biotech, 21, 1337-1342.
Basso, K., Margolin, A.A., Stolovitzky, G., Klein, U., Dalla-Favera, R. and Califano, A. (2005)
Reverse engineering of regulatory networks in human B cells. Nat Genet, 37, 382-390.
Bhattacharjee, A., Richards, W.G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J.,
Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E.J., Lander, E.S., Wong, W., Johnson, B.E.,
Golub, T.R., Sugarbaker, D.J. and Meyerson, M. (2001) Classification of human lung carcinomas by
mRNA expression profiling reveals distinct adenocarcinoma subclasses. PNAS, 98, 13790-13795.
Bhoite, L.T. and Stillman, D.J. (1998) Residues in the Swi5 Zinc Finger Protein That Mediate
Cooperative DNA Binding with the Pho2 Homeodomain Protein. Mol. Cell. Biol., 18, 6436-6446.
Bing, N. and Hoeschele, I. (2005) Genetical genomics analysis of a yeast segregant population for
transcription network inference. Genetics, 170, 533-542.
Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A. and Eppig, J.T. (2003) MGD: the Mouse Genome
Database. Nucl. Acids Res., 31, 193-195.
Bortoluzzi, S., Romualdi, C., Bisognin, A. and Danieli, G.A. (2003) Disease genes and intracellular
protein networks. Physiol. Genomics, 15, 223-227.
Bourque, G., Pevzner, P.A. and Tesler, G. (2004) Reconstructing the Genomic Architecture of
Ancestral Mammals: Lessons From Human, Mouse, and Rat Genomes. Genome Res., 14, 507-516.
Boyle, E.I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry, J.M. and Sherlock, G. (2004)
GO::TermFinder--open source software for accessing Gene Ontology information and finding
significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics, 20,
3710-3715.
Brem, R.B. and Kruglyak, L. (2005) The landscape of genetic complexity across 5,700 gene
expression traits in yeast. PNAS, 102, 1572-1577.
109
Brem, R.B., Storey, J.D., Whittle, J. and Kruglyak, L. (2005) Genetic interactions between
polymorphisms that affect gene expression in yeast. Nature, 436, 701-703.
Brem, R.B., Yvert, G., Clinton, R. and Kruglyak, L. (2002) Genetic dissection of transcriptional
regulation in budding yeast. Science, 296, 752-755.
Butte, A.J., Dzau, V.J. and Glueck, S.B. (2001) Further defining housekeeping, or "maintenance,"
genes Focus on "A compendium of gene expression in normal human tissues". Physiol. Genomics, 7,
95-96.
Campillos, M., von Mering, C., Jensen, L.J. and Bork, P. (2006) Identification and analysis of
evolutionarily cohesive functional modules in protein networks. Genome Res., 16, 374-382.
Castresana, J. (2002) Estimation of genetic distances from human and mouse introns. Genome Biology,
3, R28.
Chen, K.-C., Wang, T.-Y ., Tseng, H.-H., Huang, C.-Y .F. and Kao, C.-Y . (2005) A stochastic differential
equation model for quantifying transcriptional regulatory network in Saccharomyces cerevisiae.
Bioinformatics, 21, 2883-2890.
Chen, Y. and Xu, D. (2005) Understanding protein dispensability through machine-learning analysis
of high-throughput data. Bioinformatics, 21, 575-581.
Cherry, J.M., Adler, C., Ball, C., Chervitz, S.A., Dwight, S.S., Hester, E.T., Jia, Y., Juvik, G., Roe, T.,
Schroeder, M., Weng, S. and Botstein, D. (1998) SGD: Saccharomyces Genome Database. Nucl. Acids
Res., 26, 73-79.
Cherry, J.M., Ball, C., Weng, S., Juvik, G., Schmidt, R., Adler, C., Dunn, B., Dwight, S., Riles, L.,
Mortimer, R.K. and Botstein, D. (1997) Genetic and physical maps of Saccharomyces cerevisiae.
Nature, 387, 67-73.
Chervitz, S.A., Aravind, L., Sherlock, G., Ball, C.A., Koonin, E.V., Dwight, S.S., Harris, M.A.,
Dolinski, K., Mohr, S., Smith, T., Weng, S., Cherry, J.M. and Botstein, D. (1998) Comparison of the
Complete Protein Sets of Worm and Yeast: Orthology and Divergence. Science, 282, 2022-2028.
Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G.,
Gabrielian, A.E., Landsman, D., Lockhart, D.J. and Davis, R.W. (1998) A Genome-Wide
Transcriptional Analysis of the Mitotic Cell Cycle. Molecular Cell, 2, 65-73.
Chu, W., Ghahramani, Z., Krause, R. and Wild, D.L. (2006) Identifying protein complexes in
hihg-throughput protein interaction screens using an infinite latent feature model. Pacific Symposium
on Biocomputing, 2006.
Consortium, T.G.O. (2001) Creating the Gene Ontology Resource: Design and Implementation.
Genome Res., 11, 1425-1433.
Deane, C.M., Salwinski, L., Xenarios, I. and Eisenberg, D. (2002) Protein Interactions: Two Methods
for Assessment of the Reliability of High Throughput Observations. Mol Cell Proteomics, 1, 349-356.
Deng, M., Chen, T. and Sun, F. (2004) An Integrated Probabilistic Model for Functional Prediction of
Proteins. Journal of Computational Biology, 11, 463-475.
110
Deng, M., Mehta, S., Sun, F. and Chen, T. (2002) Inferring Domain-Domain Interactions From
Protein-Protein Interactions. Genome Res., 12, 1540-1548.
Deng, M., Sun, F. and Ting, C. (2003) Assessment of the reliability of protein-protein interaction and
protein function prediction. Pac Symp Biocomput, 2003, 140-151.
Deng, M., Tu, Z., Sun, F. and Chen, T. (2004) Mapping gene ontology to proteins based on
protein-protein interaction data. Bioinformatics, 20, 895-902.
Deng, M., Zhang, K., Mehta, S., Chen, T. and Sun, F. (2003) Prediction of Protein Function Using
Protein-Protein Interaction Data. Journal of Computational Biology, 10, 947-960.
Deng, M., Zhang, K., Mehta, S., Chen, T. and Sun, F. (2003) Prediction of Protein Function Using
Protein–Protein Interaction Data. Journal of Computational Biology, 10, 947-960.
Dodge-Kafka, K.L., Soughayer, J., Pare, G.C., Carlisle Michel, J.J., Langeberg, L.K., Kapiloff, M.S.
and Scott, J.D. (2005) The protein kinase A anchoring protein mAKAP coordinates two integrated
cAMP effector pathways. Nature, 437, 574-578.
Doolin, M.-T., Johnson, A.L., Johnston, L.H. and Butler, G. (2001) Overlapping and distinct roles of
the duplicated yeast transcription factors Ace2p and Swi5p. Molecular Microbiology, 40, 422-432.
Ficarro, S.B., McCleland, M.L., Stukenberg, P.T., Burke, D.J., Ross, M.M., Shabanowitz, J., Hunt,
D.F. and White, F.M. (2002) Phosphoproteome analysis by mass spectrometry and its application to
Saccharomyces cerevisiae. Nat Biotech, 20, 301-305.
Friedman, N., Linial, M., Nachman, I. and Pe'er, D. (2000) Using Bayesian Networks to Analyze
Expression Data. Journal of Computational Biology, 7, 601-620.
Gavin, A.-C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M.,
Michon, A.-M., Cruciat, C.-M., Remor, M., Hofert, C., Schelder, M., Brajenovic, M., Ruffner, H.,
Merino, A., Klein, K., Hudak, M., Dickson, D., Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B.,
Leutwein, C., Heurtier, M.-A., Copley, R.R., Edelmann, A., Querfurth, E., Rybin, V., Drewes, G.,
Raida, M., Bouwmeester, T., Bork, P., Seraphin, B., Kuster, B., Neubauer, G. and Superti-Furga, G.
(2002) Functional organization of the yeast proteome by systematic analysis of protein complexes.
Nature, 415, 141-147.
Giaever, G., Chu, A.M., Ni, L., Connelly, C., Riles, L., Veronneau, S., Dow, S., Lucau-Danila, A.,
Anderson, K., Andre, B., Arkin, A.P., Astromoff, A., El Bakkoury, M., Bangham, R., Benito, R.,
Brachat, S., Campanaro, S., Curtiss, M., Davis, K., Deutschbauer, A., Entian, K.-D., Flaherty, P.,
Foury, F., Garfinkel, D.J., Gerstein, M., Gotte, D., Guldener, U., Hegemann, J.H., Hempel, S., Herman,
Z., Jaramillo, D.F., Kelly, D.E., Kelly, S.L., Kotter, P., LaBonte, D., Lamb, D.C., Lan, N., Liang, H.,
Liao, H., Liu, L., Luo, C., Lussier, M., Mao, R., Menard, P., Ooi, S.L., Revuelta, J.L., Roberts, C.J.,
Rose, M., Ross-Macdonald, P., Scherens, B., Schimmack, G., Shafer, B., Shoemaker, D.D.,
Sookhai-Mahadeo, S., Storms, R.K., Strathern, J.N., Valle, G., Voet, M., Volckaert, G., Wang, C.-y.,
Ward, T.R., Wilhelmy, J., Winzeler, E.A., Yang, Y ., Yen, G., Youngman, E., Yu, K., Bussey, H., Boeke,
J.D., Snyder, M., Philippsen, P., Davis, R.W. and Johnston, M. (2002) Functional profiling of the
Saccharomyces cerevisiae genome. Nature, 418, 387-391.
Hahn, J.-S., Hu, Z., Thiele, D.J. and Iyer, V .R. (2004) Genome-Wide Analysis of the Biology of Stress
Responses through Heat Shock Transcription Factor. Mol. Cell. Biol., 24, 5249-5256.
111
Hamosh, A., Scott, A.F., Amberger, J., Bocchini, C., Valle, D. and McKusick, V.A. (2002) Online
Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucl.
Acids Res., 30, 52-55.
Han, J.-D.J., Bertin, N., Hao, T., Goldberg, D.S., Berriz, G.F., Zhang, L.V., Dupuy, D., Walhout,
A.J.M., Cusick, M.E., Roth, F.P. and Vidal, M. (2004) Evidence for dynamically organized modularity
in the yeast protein-protein interaction network. Nature, 430, 88-93.
Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N.J., Macisaac, K.D., Danford, T.W., Hannett, N.M.,
Tagne, J.-B., Reynolds, D.B., Yoo, J., Jennings, E.G., Zeitlinger, J., Pokholok, D.K., Kellis, M., Rolfe,
P.A., Takusagawa, K.T., Lander, E.S., Gifford, D.K., Fraenkel, E. and Young, R.A. (2004)
Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99-104.
Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N.J., Macisaac, K.D., Danford, T.W., Hannett, N.M.,
Tagne, J.B., Reynolds, D.B., Yoo, J., Jennings, E.G., Zeitlinger, J., Pokholok, D.K., Kellis, M., Rolfe,
P.A., Takusagawa, K.T., Lander, E.S., Gifford, D.K., Fraenkel, E. and Young, R.A. (2004)
Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99-104.
Hauser, M.A., Li, Y.-J., Takeuchi, S., Walters, R., Noureddine, M., Maready, M., Darden, T., Hulette,
C., Martin, E., Hauser, E., Xu, H., Schmechel, D., Stenger, J.E., Dietrich, F. and Vance, J. (2003)
Genomic convergence: identifying candidate genes for Parkinson's disease by combining serial
analysis of gene expression and genetic linkage. Hum. Mol. Genet., 12, 671-677.
Hershberg, R. and Margalit, H. (2006) Co-evolution of transcription factors and their targets depends
on mode of regulation. Genome Biology, 7, R62.
Hirsh, A.E. and Fraser, H.B. (2001) Protein dispensability and rate of evolution. Nature, 411,
1046-1049.
Hishigaki, H., Nakai, K., Ono, T., Tanigami, A. and Takagi, T. (2001) Assessment of prediction
accuracy of protein function from protein-protein interaction data. Yeast, 18, 523-531.
Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.-L., Millar, A., Taylor, P., Bennett,
K., Boutilier, K., Yang, L., Wolting, C., Donaldson, I., Schandorff, S., Shewnarane, J., V o, M., Taggart,
J., Goudreault, M., Muskat, B., Alfarano, C., Dewar, D., Lin, Z., Michalickova, K., Willems, A.R.,
Sassi, H., Nielsen, P.A., Rasmussen, K.J., Andersen, J.R., Johansen, L.E., Hansen, L.H., Jespersen, H.,
Podtelejnikov, A., Nielsen, E., Crawford, J., Poulsen, V., Sorensen, B.D., Matthiesen, J., Hendrickson,
R.C., Gleeson, F., Pawson, T., Moran, M.F., Durocher, D., Mann, M., Hogue, C.W.V., Figeys, D. and
Tyers, M. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass
spectrometry. Nature, 415, 180-183.
Hughes, T.R., Marton, M.J., Jones, A.R., Roberts, C.J., Stoughton, R., Armour, C.D., Bennett, H.A.,
Coffey, E., Dai, H., He, Y .D., Kidd, M.J., King, A.M., Meyer, M.R., Slade, D., Lum, P.Y ., Stepaniants,
S.B., Shoemaker, D.D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M. and Friend, S.H. (2000)
Functional discovery via a compendium of expression profiles. Cell, 102, 109-126.
Huh, W.-K., Falvo, J.V., Gerke, L.C., Carroll, A.S., Howson, R.W., Weissman, J.S. and O'Shea, E.K.
(2003) Global analysis of protein localization in budding yeast. Nature, 425, 686-691.
Ideker, T., Ozier, O., Schwikowski, B. and Siegel, A.F. (2002) Discovering regulatory and signalling
circuits in molecular interaction networks. Bioinformatics, 18, S233-240.
112
Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y. and Barkai, N. (2002) Revealing modular
organization in the yeast transcriptional network. Nat Genet, 31, 370-377.
Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. and Sakaki, Y. (2001) A comprehensive
two-hybrid analysis to explore the yeast protein interactome. PNAS, 98, 4569-4574.
Jansen, R., Greenbaum, D. and Gerstein, M. (2002) Relating Whole-Genome Expression Data with
Protein-Protein Interactions. Genome Res., 12, 37-46.
Jeong, H., Mason, S.P., Barabasi, A.L. and Oltvai, Z.N. (2001) Lethality and centrality in protein
networks. Nature, 411, 41-42.
Jeong, H., Oltvai, Z.N. and Barabasi, A.L. (2003) Prediction of Protein Essentiality Based on
Genomic Data. Complexus, 1, 19-28.
Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N. and Barabasi, A.L. (2000) The large-scale organization
of metabolic networks. Nature, 407, 651-654.
Jie Chen, S.K.S. (2006) A Bayesian determination of threshold for identifying differentially expressed
genes in microarray experiments. Statistics in Medicine, 25, 3174-3189.
Jimenez-Sanchez, G., Childs, B. and Valle, D. (2001) Human disease genes. Nature, 409, 853-855.
Karolchik, D., Hinrichs, A.S., Furey, T.S., Roskin, K.M., Sugnet, C.W., Haussler, D. and Kent, W.J.
(2004) The UCSC Table Browser data retrieval tool. Nucl. Acids Res., 32, D493-496.
Kondor, R.I. and Lafferty, J.D. (2002) Diffusion Kernels on Graphs and Other Discrete Input Spaces.
Proceedings of the Nineteenth International Conference on Machine Learning. Morgan Kaufmann
Publishers Inc.
Lagerqvist, J., Zwolak, M. and DiVentra, M. (2006) Fast DNA Sequencing via Transverse Electronic
Transport. Nano Lett., 6, 779-782.
Lanckriet, G.R.G., Deng, M., Cristianini, N., Jordan, M. and Noble, W.S. (2004) Kernel-based data
fusion and its application to protein function in yeast. Proceedings of the Pacific Symposium on
Biocomputing, 300-311.
Lau, N.C., Lim, L.P., Weinstein, E.G. and Bartel, D.P. (2001) An Abundant Class of Tiny RNAs with
Probable Regulatory Roles in Caenorhabditis elegans. Science, 294, 858-862.
Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison,
C.T., Thompson, C.M., Simon, I., Zeitlinger, J., Jennings, E.G., Murray, H.L., Gordon, D.B., Ren, B.,
Wyrick, J.J., Tagne, J.-B., Volkert, T.L., Fraenkel, E., Gifford, D.K. and Young, R.A. (2002)
Transcriptional Regulatory Networks in Saccharomyces cerevisiae. Science, 298, 799-804.
Lemmens, K., Dhollander, T., De Bie, T., Monsieurs, P., Engelen, K., Smets, B., Winderickx, J., De
Moor, B. and Marchal, K. (2006) Inferring transcriptional modules from ChIP-chip, motif and
microarray data. Genome Biology, 7, R37.
Leyfer, D. and Weng, Z. (2005) Genome-wide decoding of hierarchical modular structure of
transcriptional regulation by cis-element and expression clustering. Bioinformatics, 21, ii197-203.
113
Li, H., Chen, H., Bao, L., Manly, K.F., Chesler, E.J., Lu, L., Wang, J., Zhou, M., Williams, R.W. and
Cui, Y. (2006) Integrative genetic analysis of transcription modules: towards filling the gap between
genetic loci and inherited traits. Hum. Mol. Genet., 15, 481-492.
Li, H., Lu, L., Manly, K.F., Chesler, E.J., Bao, L., Wang, J., Zhou, M., Williams, R.W. and Cui, Y.
(2005) Inferring gene transcriptional modulatory relations: a genetical genomics approach. Hum. Mol.
Genet., 14, 1119-1125.
Lopez-Bigas, N. and Ouzounis, C.A. (2004) Genome-wide identification of genes likely to be
involved in human genetic disease. Nucl. Acids Res., 32, 3108-3114.
Makalowski, W. and Boguski, M.S. (1998) Evolutionary parameters of the transcribed mammalian
genome: An analysis of 2,820 orthologous rodent and human sequences. PNAS, 95, 9407-9412.
Mardis, E. (2006) Anticipating the $1,000 genome. Genome Biology, 7, 112.
Maslov, S. and Sneppen, K. (2002) Specificity and Stability in Topology of Protein Networks. Science,
296, 910-913.
Mewes, H.W., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morgenstern, B.,
Munsterkotter, M., Rudd, S. and Weil, B. (2002) MIPS: a database for genomes and protein sequences.
Nucl. Acids Res., 30, 31-34.
Miller, M.P. and Kumar, S. (2001) Understanding human disease mutations through the use of
interspecific genetic variation. Hum. Mol. Genet., 10, 2319-2328.
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002) Network Motifs:
Simple Building Blocks of Complex Networks. Science, 298, 824-827.
Morley, M., Molony, C.M., Weber, T.M., Devlin, J.L., Ewens, K.G., Spielman, R.S. and Cheung, V.G.
(2004) Genetic analysis of genome-wide variation in human gene expression. Nature, 430, 743-747.
Nacher, J.C., Schwartz, J.-M., Kanehisa, M. and Akutsu, T. (2006) Identification of metabolic units
induced by environmental signals. Bioinformatics, 22, e375-383.
Nelander, S., Larsson, E., Kristiansson, E., Mansson, R., Nerman, O., Sigvardsson, M., Mostad, P. and
Lindahl, P. (2005) Predictive screening for regulators of conserved functional gene modules (gene
batteries) in mammals. BMC Genomics, 6, 68.
Ng, P.C. and Henikoff, S. (2002) Accounting for Human Polymorphisms Predicted to Affect Protein
Function. Genome Res., 12, 436-446.
Ogawa, N., DeRisi, J. and Brown, P.O. (2000) New components of a system for phosphate
accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic
expression analysis. Mol Biol Cell, 11, 4309-4321.
Ohta, T. and Ina, Y. (1995) Variation in synonymous substitution rates among mammalian genes and
the correlation between synonymous and nonsynonymous divergences. Journal of Molecular
Evolution, 41, 717-720.
114
Peri, S., Navarro, J.D., Amanchy, R., Kristiansen, T.Z., Jonnalagadda, C.K., Surendranath, V.,
Niranjan, V., Muthusamy, B., Gandhi, T.K.B., Gronborg, M., Ibarrola, N., Deshpande, N., Shanker, K.,
Shivashankar, H.N., Rashmi, B.P., Ramya, M.A., Zhao, Z., Chandrika, K.N., Padma, N., Harsha, H.C.,
Yatish, A.J., Kavitha, M.P., Menezes, M., Choudhury, D.R., Suresh, S., Ghosh, N., Saravana, R.,
Chandran, S., Krishna, S., Joy, M., Anand, S.K., Madavan, V., Joseph, A., Wong, G.W., Schiemann,
W.P., Constantinescu, S.N., Huang, L., Khosravi-Far, R., Steen, H., Tewari, M., Ghaffari, S., Blobe,
G.C., Dang, C.V., Garcia, J.G.N., Pevsner, J., Jensen, O.N., Roepstorff, P., Deshpande, K.S.,
Chinnaiyan, A.M., Hamosh, A., Chakravarti, A. and Pandey, A. (2003) Development of Human
Protein Reference Database as an Initial Platform for Approaching Systems Biology in Humans.
Genome Res., 13, 2363-2371.
Pickeral, O.K., Li, J.Z., Barrow, I., Boguski, M.S., Makalowski, W. and Zhang, J. (2000) Classical
oncogenes and tumor suppressor genes: a comparative genomics perspective. Neoplasia, 2, 280-286.
Ptacek, J., Devgan, G., Michaud, G., Zhu, H., Zhu, X., Fasolo, J., Guo, H., Jona, G., Breitkreutz, A.,
Sopko, R., McCartney, R.R., Schmidt, M.C., Rachidi, N., Lee, S.-J., Mah, A.S., Meng, L., Stark,
M.J.R., Stern, D.F., De Virgilio, C., Tyers, M., Andrews, B., Gerstein, M., Schweitzer, B., Predki, P.F.
and Snyder, M. (2005) Global analysis of protein phosphorylation in yeast. Nature, 438, 679-684.
Qi, Y., Ye, P. and Bader, J. (2005) Genetic interaction motif finding by expectation maximization - a
novel statistical model for inferring gene modules from synthetic lethality. BMC Bioinformatics, 6,
288.
Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N. and Barabasi, A.L. (2002) Hierarchical
Organization of Modularity in Metabolic Networks. Science, 297, 1551-1555.
Reiner, A., Yekutieli, D. and Benjamini, Y. (2003) Identifying differentially expressed genes using
false discovery rate controlling procedures. Bioinformatics, 19, 368-375.
Rives, A.W. and Galitski, T. (2003) Modular organization of cellular networks. PNAS, 100,
1128-1133.
Rogers, S. and Girolami, M. (2005) A Bayesian regression approach to the inference of regulatory
networks from gene expression data. Bioinformatics, 21, 3131-3137.
Santangelo, G.M. (2006) Glucose Signaling in Saccharomyces cerevisiae. Microbiol. Mol. Biol. Rev.,
70, 253-282.
Schadt, E.E., Lamb, J., Yang, X., Zhu, J., Edwards, S., Guhathakurta, D., Sieberts, S.K., Monks, S.,
Reitman, M., Zhang, C., Lum, P.Y ., Leonardson, A., Thieringer, R., Metzger, J.M., Yang, L., Castle, J.,
Zhu, H., Kash, S.F., Drake, T.A., Sachs, A. and Lusis, A.J. (2005) An integrative genomics approach
to infer causal associations between gene expression and disease. Nat Genet, 37, 710-717.
Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B.
and Ideker, T. (2003) Cytoscape: A Software Environment for Integrated Models of Biomolecular
Interaction Networks. Genome Res., 13, 2498-2504.
Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A.S., Hou, M., Rosenbloom, K., Clawson, H., Spieth,
J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R.K., Gibbs, R.A., Kent, W.J., Miller, W. and
Haussler, D. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.
Genome Res., 15, 1034-1050.
115
Simonis, N., van Helden, J., Cohen, G. and Wodak, S. (2004) Transcriptional regulation of protein
complexes in yeast. Genome Biology, 5, R33.
Slack, F. (2006) Regulatory RNAs and the demise of 'junk' DNA. Genome Biology, 7, 328.
Smyth, G., Yang, Y . and Speed, T. (2003) Statistical issues in cDNA microarray data analysis. Methods
Mol Biol, 224, 111-136.
Sonnichsen, B., Koski, L.B., Walsh, A., Marschall, P., Neumann, B., Brehm, M., Alleaume, A.M.,
Artelt, J., Bettencourt, P., Cassin, E., Hewitson, M., Holz, C., Khan, M., Lazik, S., Martin, C.,
Nitzsche, B., Ruer, M., Stamford, J., Winzi, M., Heinkel, R., Roder, M., Finell, J., Hantsch, H., Jones,
S.J.M., Jones, M., Piano, F., Gunsalus, K.C., Oegema, K., Gonczy, P., Coulson, A., Hyman, A.A. and
Echeverri, C.J. (2005) Full-genome RNAi profiling of early embryogenesis in Caenorhabditis elegans.
Nature, 434, 462-469.
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein,
D. and Futcher, B. (1998) Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast
Saccharomyces cerevisiae by Microarray Hybridization. Mol. Biol. Cell, 9, 3273-3297.
Steffen, M., Petti, A., Aach, J., D'Haeseleer, P. and Church, G. (2002) Automated modelling of signal
transduction networks. BMC Bioinformatics, 3, 34.
Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. PNAS, 100,
9440-9445.
Stuart, J.M., Segal, E., Koller, D. and Kim, S.K. (2003) A Gene-Coexpression Network for Global
Discovery of Conserved Genetic Modules. Science, 302, 249-255.
Tanay, A., Sharan, R., Kupiec, M. and Shamir, R. (2004) Revealing modularity and organization in the
yeast molecular network by integrated analysis of highly heterogeneous genomewide data. PNAS, 101,
2981-2986.
Thomas, M.A., Weston, B., Joseph, M., Wu, W., Nekrutenko, A. and Tonellato, P.J. (2003)
Evolutionary Dynamics of Oncogenes and Tumor Suppressor Genes: Higher Intensities of Purifying
Selection than Other Genes. Mol Biol Evol, 20, 964-968.
Tong, A.H.Y., Evangelista, M., Parsons, A.B., Xu, H., Bader, G.D., Page, N., Robinson, M.,
Raghibizadeh, S., Hogue, C.W.V., Bussey, H., Andrews, B., Tyers, M. and Boone, C. (2001)
Systematic Genetic Analysis with Ordered Arrays of Yeast Deletion Mutants. Science, 294,
2364-2368.
Turk, R., t Hoen, P., Sterrenburg, E., de Menezes, R., de Meijer, E., Boer, J., van Ommen, G.-J. and
den Dunnen, J. (2004) Gene expression variation between mouse inbred strains. BMC Genomics, 5,
57.
Tusher, V.G., Tibshirani, R. and Chu, G. (2001) Significance analysis of microarrays applied to the
ionizing radiation response. PNAS, 98, 5116-5121.
Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V.,
Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T.,
Vijayadamodar, G., Yang, M., Johnston, M., Fields, S. and Rothberg, J.M. (2000) A comprehensive
analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 403, 623-627.
116
van 't Veer, L.J., Dai, H., van de Vijver, M.J., He, Y .D., Hart, A.A.M., Mao, M., Peterse, H.L., van der
Kooy, K., Marton, M.J., Witteveen, A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C., Linsley, P.S.,
Bernards, R. and Friend, S.H. (2002) Gene expression profiling predicts clinical outcome of breast
cancer. Nature, 415, 530-536.
Vazquez, A., Flammini, A., Maritan, A. and Vespignani, A. (2003) Global protein function prediction
from protein-protein interaction networks. Nat Biotech, 21, 697-700.
von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S. and Bork, P. (2002)
Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417, 399-403.
Warrington, J.A., Nair, A., Mahadevappa, M. and Tsyganskaya, M. (2000) Comparison of human
adult and fetal expression and identification of 535 housekeeping/maintenance genes. Physiol.
Genomics, 2, 143-147.
Wesche, P.L., Gaffney, D.J. and Keightley, P.D. (2004) DNA sequence error rates in Genbank records
estimated using the mouse genome as a reference. DNA Sequence, 15, 362-364.
West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J.A., Jr.,
Marks, J.R. and Nevins, J.R. (2001) Predicting the clinical status of human breast cancer by using
gene expression profiles. PNAS, 98, 11462-11467.
Wheeler, D.L., Church, D.M., Edgar, R., Federhen, S., Helmberg, W., Madden, T.L., Pontius, J.U.,
Schuler, G.D., Schriml, L.M., Sequeira, E., Suzek, T.O., Tatusova, T.A. and Wagner, L. (2004)
Database resources of the National Center for Biotechnology Information: update. Nucl. Acids Res.,
32, D35-40.
Winzeler, E.A., Shoemaker, D.D., Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R.,
Benito, R., Boeke, J.D., Bussey, H., Chu, A.M., Connelly, C., Davis, K., Dietrich, F., Dow, S.W., El
Bakkoury, M., Foury, F., ccedil, oise, Friend, S.H., Gentalen, E., Giaever, G., Hegemann, J.H., Jones,
T., Laub, M., Liao, H., Liebundguth, N., Lockhart, D.J., Lucau-Danila, A., Lussier, M., M'Rabet, N.,
Menard, P., Mittmann, M., Pai, C., Rebischung, C., Revuelta, J.L., Riles, L., Roberts, C.J.,
Ross-MacDonald, P., Scherens, B., Snyder, M., Sookhai-Mahadeo, S., Storms, R.K., eacute, ronneau,
S., Voet, M., Volckaert, G., Ward, T.R., Wysocki, R., Yen, G.S., Yu, K., Zimmermann, K., Philippsen,
P., Johnston, M. and Davis, R.W. (1999) Functional Characterization of the Saccharomyces cerevisiae
Genome by Gene Deletion and Parallel Analysis. Science, 285, 901-906.
Wolf, D.M. and Arkin, A.P. (2003) Motifs, modules and games in bacteria. Current Opinion in
Microbiology, 6, 125-134.
Wolfe, K.H. and Sharp, P.M. (1993) Mammalian gene evolution: Nucleotide sequence divergence
between mouse and rat. Journal of Molecular Evolution, 37, 441-456.
Wuchty, S., Oltvai, Z.N. and Barabasi, A.L. (2003) Evolutionary conservation of motif constituents in
the yeast protein interaction network. Nat Genet, 35, 176-179.
Xing, B. and van der Laan, M.J. (2005) A causal inference approach for constructing transcriptional
regulatory networks. Bioinformatics, 21, 4007-4013.
Yeang, C.-H., Ideker, T. and Jaakkola, T. (2004) Physical Network Models. Journal of Computational
Biology, 11, 243-262.
117
Yeang, C.-H., Mak, H.C., McCuine, S., Workman, C., Jaakkola, T. and Ideker, T. (2005) Validation
and refinement of gene-regulatory pathways on a network of physical interactions. Genome Biology, 6,
R62.
Yeger-Lotem, E., Sattath, S., Kashtan, N., Itzkovitz, S., Milo, R., Pinter, R.Y., Alon, U. and Margalit,
H. (2004) Network motifs in integrated cellular networks of transcription-regulation and
protein-protein interaction. PNAS, 101, 5934-5939.
Yoshimoto, H., Saltsman, K., Gasch, A.P., Li, H.X., Ogawa, N., Botstein, D., Brown, P.O. and Cyert,
M.S. (2002) Genome-wide analysis of gene expression regulated by the calcineurin/Crz1p signaling
pathway in Saccharomyces cerevisiae. J Biol Chem, 277, 31079-31088.
Yu, H., Greenbaum, D., Xin Lu, H., Zhu, X. and Gerstein, M. (2004) Genomic analysis of essentiality
within protein networks. Trends in Genetics, 20, 227-231.
Yvert, G., Brem, R.B., Whittle, J., Akey, J.M., Foss, E., Smith, E.N., Mackelprang, R. and Kruglyak, L.
(2003) Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription
factors. Nat Genet, 35, 57-64.
Zanton, S.J. and Pugh, B.F. (2004) Changes in genomewide occupancy of core transcriptional
regulators during heat stress. PNAS, 101, 16843-16848.
Zhou, X., Kao, M.-C.J. and Wong, W.H. (2002) From the Cover: Transitive functional annotation by
shortest-path analysis of gene expression data. PNAS, 99, 12783-12788.
Zhu, H., Klemic, J.F., Chang, S., Bertone, P., Casamayor, A., Klemic, K.G., Smith, D., Gerstein, M.,
Reed, M.A. and Snyder, M. (2000) Analysis of yeast protein kinases using protein chips. Nat Genet,
26, 283-289.
Zien, A., Kuffner, R., Zimmer, R., Lengauer and Thomas (2000) Analysis of gene expression data
with pathway scores. Proc Int Conf Intell Syst Mol Biol, 8, 407-417.
Appendix A
In this section, details are provided on the methods and results which are omitted in chapter II.
1. Selecting sub-set of conditions
We explored two methods to select appropriate subset of conditions as described in the
Method section 2.2.2 in the main text. Both of them require some threshold to be set. The first
method followed the signature algorithm developed by Ihmels to select appropriate subset of
conditions (Ihmels, et al., 2002). Suppose the expression levels of gene are measured under
conditions in the original data set, denoted as . A condition is selected if it
satisfies equation (A1.1), where
t g
M
1
,..., t t
M
g g O O m
t g O is the average expression level and is the standard
deviation.
t g σ
t
t
t
m
g
g
g
OO
τ
σ
−
> (A1.1)
Decision need to be made on the value τ . If we set very big, the gain is that genes under these
conditions are more likely being perturbed. However, this will reduce the total number of
conditions that can be included for calculation of correlation. As we know that correlations
calculated on small number of conditions are usually less reliable. To balance these two
considerations, we empirically choose . For most genes, this will allow >30 conditions
satisfying the equation (A1.1).
τ
1 τ =
118
Our second approach is based on sampling. First issue is to determine how many conditions
should be sampled. Although an appealing choice could be using the first method to estimate
how many conditions are significantly perturbed and sample that many number of conditions,
we decide to use a simple constant size, i.e, half of the total conditions. Obviously this is could
be too many for some genes and too few for other genes, but it works well for most cases.
Second issue with such sampling is that the sample space is extremely large, i.e., the
combination of conditions grows exponentially. Our current computation power won’t allow
us to perform exhaustive sampling, but we found that for many TF-target genes (as reported in
the main text 585 genes with 1,403 TFs, with ( ), we can very fast find combination of
conditions. For other genes, which we failed to obtain such subset of conditions in affordable
time, it could be due to either no such subset actually exists or the probability to sample it is far
way too small. We don’t consider these genes in our second method..
0.5 τ′ =
2. Testing on knockout data
There are in total 118 knockout experiments which generated significant perturbation to the
genome expression profile. Since we need first identify genes whose expression is perturbed
by knock-out instead of by chance, we used the results from (Hughes, et al., 2000) which
consider explicitly an noise model. However, even by doing that, we found some genes
appears not share any common TF with other perturbed genes in the same knock-out
experiment. As our model relies on accurate inference of the TFs of the target genes, we
further filter out genes which don’t share any TFs with other perturbed genes in the same
knockout experiment. After this filtering, we simulated eQTL region for each knocked-out
gene so that 10/20 genes were contained in each eQTL. The list of these 118 genes and genes
in the simulated eQTL can be downloaded from
119
120
http://www.cmb.usc.edu/people/ztu/ismb2006/. The prediction results are also downloadable.
See subsection 4 below for the descriptions on downloadable files.
3. Pathway inference
The algorithm of our pathway inference is well described in the main text and we don’t repeat
it here. For each inference, we repeat the “random walk” 10000 times, and we set up an upper
bound for allowed steps to be 15. In general, most paths can be identified by 2 to 5 steps and
valid genes normally get several hundred to more than a thousand hits. In order to reduce
computation burden, we only consider genes with maximum hits above 100 and performed
permutation test on them. The valid genes and inferred pathways are provided.
121
Appendix B
Cat. # GO ID Description # of Genes
0 0000004 biological process unknown 266
1 0007165 signal transduction 121
2 0007067 mitosis 112
3 0000074 regulation of cell cycle 106
4 0000902 cellular morphogenesis 104
5 0006364 rRNA processing 125
6 0016568 chromatin modification 131
7 0007010 cytoskeleton organization and biogenesis 231
8 0006412 protein biosynthesis 168
9 0006508 proteolysis and peptidolysis 105
10 0006464 protein modification 244
11 0006091 generation of precursor metabolites and energy 106
12 0006357 regulation of transcription from Pol II promoter 129
13 0006397 mRNA processing 109
14 0019752 carboxylic acid metabolism 108
15 0016310 phosphorylation 111
16 0048193 Golgi vesicle transport 103
17 0006605 protein targeting 156
18 0045045 secretory pathway 145
19 0000003 reproduction 127
20 0005975 carbohydrate metabolism 108
21 0009892 negative regulation of metabolism 105
22 0006974 response to DNA damage stimulus 114
23 0009628 response to abiotic stimulus 119
Table B1: Function categories which cover at least 100 proteins.
122
Cat. # GO ID Description # of Genes
0 0000228 nuclear chromosome 114
1 0005829 cytosol 169
2 0005783 endoplasmic reticulum 135
3 0005739 mitochondrion 199
4 0005856 cytoskeleton 165
5 0005730 nucleolus 145
6 0005667 transcription factor complex 106
7 0030529 ribonucleoprotein complex 211
8 0012505 endomembrane system 122
9 0031090 organelle membrane 198
10 0008372 cellular component unknown 145
Table B2: Location categories which cover at least 100 proteins.
123
Name Posterior probability Prediction
YAR064W 0.03 viable
YAR073W 0.08 viable
YCL066W 0.17 viable
YCR096C 0.03 viable
YDL246C 0.09 viable
YDL247W 0.03 viable
YER089C 0.18 viable
YFL066C 0.17 viable
YGL100W 0.72 inviable
YGR296W 0.07 viable
YHL050C 0.03 vialbe
YHR072W-A 0.94 inviable
YHR149C 0.03 viable
YHR215W 0.02 viable
YIL122W 0.04 viable
YIL177C 0.03 viable
YIR040C 0.19 viable
YJR027W 0.06 viable
YJR028W 0.06 viable
YJR159W 0.20 viable
YKL224C 0.02 viable
YMR323W 0.03 viable
YNL086W 0.03 viable
YOR388C 0.05 viable
YPL249C-A 0.29 viable
YPR102C 0.66 inviable
Table B3: Prediction of the essentialites of unknown yeast proteins.
Name
YDR364C YDR378C YFR036W
YGL240W YKL009W YNL147W
YOR308C
Table B4: List of proteins which have high posterior probability of being
essential but are non-essential in the SGD phenotype data set.
Figure B1. The number of proteins with given number of domains. A total of 4741
proteins are included.
124
125
Appendix C
In this section, details are provided on the methods and results which are omitted in chapter V .
Compile disease gene list
Disease genes are obtained from OMIM (Online Mendelian Inheritance in Man,
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM). There are 3,962 records in the
morbidmap (Jun 6, 2005) and we restrict to entries with known sequence (OMIM ID marked
with *), with known sequence and phenotype (OMIM ID marked with #) and with phenotype
description, molecular basis known (OMIM ID marked with +). After filtering, we obtain a list
of 2,012 genes with unique OMIM IDs.
Compile ubiquitously-expressed genes list
Ubiquitously expressed genes are obtained from the gene expression experiment results by Su
1
.
The data are publicly available for download at http://symatlas.gnf.org. The overall expression
level is 776.5 standard Affymetrix average difference units, and we choose genes with
expression level greater than 550 standard units in at least 73/79 tissues, these genes are
treated as ubiquitously-expressed genes. A total of 1,789 such genes are collected and the list
is online. As different criteria will lead to different set of output, we first estimate how many
essential genes the human genome could contain. A conservative estimation is roughly 10% of
human genome are essential genes, thus the desired number of UEHGs will be 2,000 to 3,000.
We choose number close to 2,000 for two reasons, firstly we prefer a small false positive rate
126
and secondly we want to have a roughly equal number of essential genes and disease genes for
easier comparison.
Mouse homolog and calculation of Ka, Ks and Ka/Ks
The mouse homologs and corresponding Ka, Ks of totally 15,726 human genes are
pre-calculated and downloaded from NCBI HomoloGene
(ftp://ftp.ncbi.nih.gov/pub/HomoloGene/build41.1/). For those Human genes with more than
one mouse homologs, we choose the one with the smallest Ka value (175 such cases). Among
the 15,726 human genes, 1,641 are successfully mapped to UEHGs list, 1,771 are mapped to
disease gene list, and 12,494 are categorized as others. To test the statistical significance of the
difference of Ka, Ks and Ka/Ks distributions among the three groups, a Kolmogorov-Smirnov
test is used to calculate the p-value.
Conservation of nucleotide and amino acid
To study the conservation at the nucleotide/amino acid level for different genes, we use the
results of a recent research based on 8 species multiple-sequence alignment
2
. Conservation
scores are downloaded from UCSC Genome Browser website
(http://hgdownload.cse.ucsc.edu/downloads.html#human). We only focus on the conservation
of coding regions. To calculate the conservation score for an amino acid, we take the average
of the conservation scores of the three nucleotides which correspond to the codon for the
amino acid. We also retrieve the human sequence variation information from Swiss-Prot
protein knowledgebase (http://us.expasy.org/sprot/sp-docu.html). The original amino acid
127
positions are mapped to nucleotide positions on the corresponding chromosome to obtain the
conservation score.
Length of different parts of genes
To compare the length of various parts of UEHGs, disease genes, and all other genes, we
obtain the length data from UCSC genome table browser (http://genome.ucsc.edu/)
3
. Genes
are first mapped to refSeq IDs, 1,783 out of 2,012 disease OMIM IDs can be mapped to refSeq
IDs and we can retrieve 1,773 length information for them. 1,423 out of 1,789 Unigene IDs
can be successfully mapped and 1,400 can obtain length information. We also compile a list of
23,833 refSeq IDs which don’t overlap with above two sets and 10,304 of them can retrieve
the length information.
Human protein-protein interaction
The protein physical interaction degree information is obtained from Human Protein
Reference Database (HPRD)
4
. XML files are parsed using java program to obtain the protein
symbol and interaction information. Disease genes’ OMIM IDs are mapped to Human Gene
Nomenclature (HGNC) official symbols and 1,723 can be linked in HPRD. 1,280 UEHGs can
be mapped in HPRD and all the other proteins having official symbols (9,481) are put together
as the third group. Due to the large fraction of proteins have no records of any interaction, we
decide to use only those proteins have at least one interaction for the analysis.
C.elegans homolog and RNAi Phenotype
After using the same method as above to identify human homologs and map UEG and disease
128
genes, we divide the totally 20,448 C.elegans genes collected by NCBI into four groups: 695
UEG homologs, 480 human disease gene homologs, 2669 non-UEG/disease gene homologs
and 16688 other genes that don’t have human homologs. By the WormMart tool of
WormBase (http://www.wormbase.org/), RNAi phenotypes of C.elegans genes are retrieved
using WormBase gene sequence name. We then map WormBase sequence name into NCBI
gene based on Entrez Gene ID. We divide the RNAi phenotype into four categories: lethal
(include both embryonic and larval lethal), wild type, sick (known phenotypes other than the
above two), and unknown. For genes with more than one phenotype, we choose the most
severe one, assuming lethal>sick>wild in their severity.
Yeast homolog and phenotype
We divide 6,179 yeast genes collected by NCBI into four groups: 384 UEG homologs, 196
human disease gene homologs, 1,005 non-UEG/disease gene homologs and 4,641 other genes
that don’t have human homologs. The yeast gene deletion phenotype data are downloaded
from Saccharomyces Genome Database (SGD, ftp://genome-ftp.stanford.edu/). We consider
three phenotypes: lethal, nonlethal and unknown.
Human gene function annotation
Gene Ontology annotations of 12,715 human genes are downloaded from NCBI
(ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/) on June 7
th
and we consider only the biological part
of the annotation. For these genes, 1331 can be mapped to UEG, 1741 can be mapped to
disease genes and 9,814 are put together as others. Similar to the definition used in Zhou et al.,
a GO node is referred as informative if it covers more than 500 genes, and none of its child
nodes covers the such many genes. We divide all the biological processes into 25 function
categories according to the informative nodes. To test whether UEHGs, disease genes or other
genes are over/under represented in each of the 25 function categories, we use
hyper-geometric distribution to calculate the p-value. In the figure, colors are used to present
the logarithm of the p-value.
Fig C1. Cumulative distribution function curve of Ka, Ks and Ka/Ks for UEHGs,
disease and all other genes. The Ka and Ks are calculated based on the human-rat
homologous pairs. (a) Ka, the number of non-synonymous substitutions per
non-synonymous sites. (b) Ks, the number of synonymous substitutions per synonymous
site, and (c) The Ka/Ks ratio.
129
Abstract (if available)
Abstract
The rapid development in high throughput biotechnologies in the last decade has fundamentally changed the way of studying biological systems. Today, whole genome sequences are either available or can be obtained within weeks. Genome wide expression levels can be easily measured by microarray analyses. Tens of thousands of protein-protein interactions have been identified in several species, by either yeast two hybrid assays or by mass spectrometry. Transcription factor-DNA binding interactions can also be identified by high throughput techniques, such as by chromatin immunoprecipitation followed by whole genome chip analyses. Many other new technologies are emerging, such as high throughput cell imaging, protein arrays. All these technologies, together with traditional methods, provide us an unprecedented opportunity to explore biological systems in a holistic way. Computational biologists are confronted with great opportunity but also significant challenges in analyzing these data sets to extract biologically meaningful knowledge. Since no single technique can provide all the information needed for understanding the system, integrative analysis has turned out to be an essential approach to unravel how the systems work.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
PDF
Prioritizing phenotype-associated functional modules and sub-networks from high throughout screening results
PDF
Computational and experimental approaches for the identification of genes and gene networks in the Drosophila sex-determination hierarchy
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
The evolution of gene regulatory networks
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Probabilistic modeling and data integration to examine RNA-protein interactions
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Structure – dynamics – function analysis of class A GPCRs and exploration of chemical space using integrative computational approaches
PDF
Computational analysis of genome architecture
PDF
Novel computational methods of disease gene and variant discovery, parallelization and applications
PDF
Evaluating the effects of testing framework and annotation updates on gene ontology enrichment analysis
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Comparative analysis of DNA methylation in mammals
Asset Metadata
Creator
Tu, Zhidong
(author)
Core Title
Integrative approach of biological systems analysis on regulatory pathways, modules, protein essentialities, and disease gene classification
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology
Publication Date
11/15/2006
Defense Date
10/27/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computational algorithm,data integration,OAI-PMH Harvest,Systems Biology
Language
English
Advisor
Sun, Fengzhu (
committee chair
), Arbeitman, Michelle N. (
committee member
), Chen, Ting (
committee member
), Kempe, David (
committee member
)
Creator Email
ztu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m149
Unique identifier
UC1436171
Identifier
etd-Tu-20061115 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-27328 (legacy record id),usctheses-m149 (legacy record id)
Legacy Identifier
etd-Tu-20061115.pdf
Dmrecord
27328
Document Type
Dissertation
Rights
Tu, Zhidong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
computational algorithm
data integration