Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Prioritizing phenotype-associated functional modules and sub-networks from high throughout screening results
(USC Thesis Other)
Prioritizing phenotype-associated functional modules and sub-networks from high throughout screening results
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PRIORITIZING PHENOTYPE-ASSOCIATED FUNCTIONAL MODULES AND
SUB-NETWORKS FROM HIGH THROUGHOUT SCREENING RESULTS
by
Li Wang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY)
August 2009
Copyright 2009 Li Wang
ii
Acknowledgements
First and foremost I would like to thank my advisor Dr. Fengzhu Sun, for his great
guidance, support and encouragement. I especially appreciate his great patience and trust
even after my many failures in the beginning years. It is him who taught me how to do
research from the very beginning, and how to become a serious scientist. His humble and
caring attitudes towards colleagues and students also influence me a lot. All these are
such a great fortune of my life that will definitely affect my future career.
I also want to thank Dr. Ting Chen, who helped initiate the project of “prioritizing
functional modules mediating genetic perturbations and their phenotypic effects”. He
gave me many great insightful advices, not only on my research, but also on my career
development. I owe great thanks to Drs. Xianghong Jasmine Zhou, Michelle N.
Arbeitman, Liang Chen, and C. C. Jay Kuo for kindly serving in my committee and for
giving many the constructive comments and suggests during my Ph.D. study.
There are many other people without whom this dissertation would likely not exist. A
few of them are Robert Gentleman, Xiaotu Ma, Rui Jiang, Min Xu, Tony Chang and Matt
Lebo. To those people whose names are not listed here, I greatly appreciate their support.
Finally, I would like to thank my parents, my sister and my husband Zhidong Tu. It is
only because of their forever loves so I could fully devote myself to the research and be
brave when facing those challenges.
iii
Table of Contents
Acknowledgements ii
List of Tables v
List of Figures vi
Abstract viii
Chapter 1 Introduction 1
1.1. From network to modularity
1.1.1. Network biology
1.1.2. Modular biology
1
1
4
1.2. Indentifying phenotype-associated sub-networks and modules
1.2.1. High throughput screenings
1.2.2. Problems in high throughput screenings
1.2.3. Network or module-based integrative analysis of screening
results
7
7
10
12
1.3. Summary of our work
1.3.1. Prioritizing functional modules mediating genetic
perturbations and their phenotypic effects
1.3.2. Prioritizing reliable hits from RNAi screens
14
14
16
Chapter 2 Prioritizing functional modules mediating genetic
perturbations and their phenotypic effects
18
2.1 Introduction
18
2.2 Results and Discussion
2.2.1 Prioritizing lethal modules in S. cerevisiae
2.2.1.1 Superiority of the BN model indicated by analysis of
lethality of ortholog genes
2.2.1.2 Superiority of the BN model revealed by
cross-validation
2.2.2 Prioritizing GO biological processes causally implicated in
human cancer
2.2.2.1 Superiority of the BN model indicated in the GO
hierarchical structure
2.2.2.2 Consistency of the BN model results with the literature
2.2.2.3 Superiority of the BN model revealed by
cross-validation
2.2.2.4 Consistency of the results prioritized by the BN model
in different datasets
2.2.3 Simulation study
22
22
22
31
35
37
41
44
46
50
iv
2.3 Materials and Methods
2.3.1 Bayesian network model
2.3.2 Inference of module lethality given gene lethality
2.3.3 Prediction of gene lethality through cross-validation
2.3.4 Local Bayesian model
2.3.5 Data sources
54
54
58
55
59
59
2.4 Conclusion
61
2.5 Acknowledgements
63
Chapter 3 Prioritizing reliable hits from RNAi screens 64
3.1 Introduction 64
3.2 Results
3.2.1 RNAi hits have higher network connectivity than random
chance hits
3.2.2 Identifying the best performing NePhe scoring system
3.2.3 Case studies: Hedgehog (Hh) and Wnt signaling pathways
3.2.3.1 Comparing NePhe scoring system with experimental
validation
3.2.3.2 Identifying FNs from nonhit set using NePhe scoring
system
3.2.4 Interpretation of RNAi phenotypes at module level
67
67
71
79
79
86
89
3.3 Materials and Methods
3.3.1 Network attributes, P-values and Randomization
3.3.2 Network-based similarity
3.3.3 NePhe scoring system
3.3.4 Rank-based test
3.3.5 Mutant phenotype similarity
93
93
94
96
98
100
3.4 Discussion and Conclusions 101
3.5 Acknowledgements 103
Chapter 4 Future work 104
4.1. Uncertain causality between gene and phenotype 104
4.2. The specificity issue with RNAi screens 105
4.3. Epistatic effects 106
4.4. Integrating modular and network representation 106
References 108
Appendices 121
Appendix A: Supplementary materials for Chapter 2 121
Appendix B: Supplementary materials for Chapter 3 125
v
List of Tables
Table 2.1: The 27 GO CAN-processes prioritized by the Bayesian
network model or the HG enrichment test based on cancer genes from
“cancer-gene census” database.
40
Table 2.2: Top GO CAN-processes ranked by the Bayesian network
model or the HG enrichment test based on cancer genes from systematic
sequencing of colorectal and breast cancer genomes.
49
Table 3.1: The network attributes and corresponding P-values for the 24
sub-networks constructed from RNAi hits.
71
Table 3.2: Each cell represents the overlap between KEGG pathway
genes (rows) and all, or subsets of, hits/nonhits in the corresponding RNAi
screen (columns).
82
Table A1: 94 curated lethal protein complexes identified by the BN
model.
123
Table B1: The network attributes and corresponding P-values for the 24
sub-networks constructed from RNAi hits.
126
Table B2: The Spearman’s rank correlation coefficients of the NePhe
score before and after adding different levels of noise into the hit set.
130
Table B3: The Spearman’s rank correlation coefficients of the NePhe
score before and after removing different fractions of hits from the hit set.
131
Table B4: The 12 methods for computing NePhe score. 133
Table B5: The network attributes and corresponding P-values for the 24
sub-networks constructed from RNAi hits after removing those that affect
cell growth and viability.
134
vi
List of Figures
Figure 1.1: Different types of molecules and interactions involved in the
information flow from DNA to cellular physiology.
4
Figure 1.2: Illustration of concept of overlapping communities. 6
Figure 1.3: An example of hierarchical organization of modularity. 7
Figure 1.4: Approaches of genome-wide RNAi screenings in different
organisms
9
Figure 1.5: Genes and proteins harboring variation causing the same
disease phenotype tend to form directly (physically) connected clusters.
13
Figure 2.1: Genes in S. cerevisiae are classified into four groups
according to their lethality and the lethality of protein complexes to
which they belong.
30
Figure 2.2: The “ortholog lethal ratio” for “NLGLC” and “NLGNLC”
when different cutoffs are used to identify lethal protein complexes.
31
Figure 2.3: The ROC curve, AUC and pAUC.2 of 100-fold
cross-validation in predicting lethality of genes in S. cerevisiae using
curated protein complexes (a), curated and HTP protein complexes (b)
and GO biological process (c).
32
Figure 2.4: The distribution of p-values for the enrichment of cancer
genes for GO BP nodes.
39
Figure 2.5: Pathways of Global Genome Nuclear Excision Repair
(GG-NER) and Transcription Coupled Nuclear Excision Repair
(TC-NER).
42
Figure 2.6: The ROC curve, AUC and pAUC.2 of 100-fold
cross-validation in predicting cancer genes using GO biological process.
45
Figure 2.7: simulation results. 53
Figure 2.8: An example of the Bayesian network. 57
vii
Figure 3.1: The flowchart for the rank-based test. 74
Figure 3.2: The overall performance of different methods in identifying
FNs (a) and FPs (b) in the rank-based test and the screen-specific
performance of different methods in identifying FNs (c) and FPs (d) in
the rank-based test.
76
Figure 3.3: The distributions of RR of FNs among nonhits (a) and RR of
FPs among hits (b) for each screen in the rank-based test.
78
Figure 3.4: The reproducibility rate of hits in the validation screen
within each interval of the RR by NePhe score for Hh (a) and Wnt (b)
signaling pathway.
80
Figure 3.5: The proportion of OT-related hits within each interval of
the RR by NePhe score for Hh (a) and Wnt (b) signaling pathway.
83
Figure 3.6: The reproducibility rate for OT-related and OT-unrelated
hits (a), for low-ranked and high-ranked OT-related hits by NePhe score
(b) and for low-ranked and high-ranked OT-unrelated hits by NePhe
scores (c) for Hh signaling pathway.
85
Figure 3.7: Distributions of the mutant phenotype similarities 88
Figure 3.8: A sub-network associated with the Wnt signaling pathway.
Red: hits of RNAi screening.
92
Figure A1: Genes in S. cerevisiae are classified into four groups
according to their lethality and the lethality of protein complexes to
which they belong
121
Figure A2: The 27 GO CAN-processes prioritized by the Bayesian
network model (yellow) and their offspring and ancestor nodes (blue).
122
Figure B1: Screen-specific performance of different methods in
identifying FPs in rank-based test.
128
Figure B2: The reproducibility rate for OT-related and OT-unrelated
hits (a), for low-ranked and high-ranked OT-related hits by NePhe score
(b) and for low-ranked and high-ranked OT-unrelated hits by NePhe
scores (c) for Wnt signaling pathway.
132
viii
Abstract
How to interpret the nature of biological processes, which when perturbed cause certain
phenotype changes, such as human disease, is a major challenge. The completion of
sequencing the genomics of many model organisms has made “reverse genetic
approaches” efficient and comprehensive ways to identify the causal genes for a given
phenotype under investigation. For instance, genome-wide knockout strains are now
available for S. cerevisiae, and diverse high throughput RNAi knockdown experiments
have been performed for multiple higher organisms. Although very useful, these high
throughput screening approaches are associated with two main problems: 1) the
underlying biology, i.e., how genetic perturbation leads to the change of phenotypes in
the complex of biological systems is unclear; 2) the screening results could be very noisy
with high false positive and false negative rates.
As genomic data from different sources accumulates, integrating screening results with
other genomic information, particularly in forms of functional modules and networks,
might help solve the above problems. Motivated by this, we developed two novel
integrative computational approaches to address problems with high throughput
screening results. In our first study, we represented the biological system as a set of
predefined functional modules, and developed a Bayesian Network strategy to identify
functional modules that are causally implicated in the phenotype under investigation. In
the second study, we represented the biological system as a network of functional or
physically associated proteins. We developed a network-based scoring system to filter out
noise in RNAi screens.
ix
Our studies demonstrate that the modular and network information can be effectively
utilized to address potential problems with high throughput screening results and to
achieve what would otherwise be impossible from the screening results alone.
1
Chapter 1 Introduction
1.1. From network to modularity
The central dogma of molecular biology is that nucleotide sequences are transcribed into
pieces of message RNA (mRNA), which in turn, are translated into amino-acid sequences
(Figure 1.1). The sequence of amino acids directs the folding of a protein into its native
three-dimensional shape which determines the functional properties of the protein.
Individual proteins carry out their function in complex networks of interacting
macromolecules, the coordinated behavior of which determines the physiological
properties of the living cell [1].
1.1.1. Network biology
A key aim of post-genomic biomedical research is to systematically catalogue all the
molecules and their interactions within a living cell [2]. In this section, we give a brief
introduction of four types of molecular networks that have been widely studied in
biomedical research community and the diverse experimental and computational
techniques developed to build these networks.
DNA sequence and co-evolutionary network
The complete sequencing of multiple organisms has enabled direct genomic comparison
across different species and subspecies. Previous studies have shown that genes that take
2
part in the same complex, pathway, or process tend to show a nonzero correlation of their
evolutionary histories [3]. In other words, co-evolutionary network built by inferring
co-evolutionary relationships between genes could help reveal and predict physical and
functional associations. Many computational strategies have been developed to capture
such co-evolutionary relationship and to incorporate such information into automated
function prediction [4, 5].
mRNA expression and transcriptional regulatory network
Transcription regulation involves synthesizing and/or degrading mRNA in response to the
need of a biological system [6]. Understanding its mechanisms has been a central topic in
biology for many decades because of its importance as a fundamental biological process.
Transcript levels at the genome-wide scale can be measured with DNA microarray
technology or "tag based" technologies like serial analysis of gene expression (SAGE).
Many computational approaches have been developed to infer transcriptional regulatory
networks directly from whole genome expression data [7-9]. Moreover, integrative
approaches [10-12] can be used to incorporate other types of genomic data, e.g., DNA
sequence data and protein–DNA binding data, in modeling of transcription regulation.
Proteins and physical interaction network
Many functions of proteins are mediated through their physical interaction with other
proteins. Two types of high throughput techniques are widely used in detecting
protein-protein interactions, i.e., yeast two hybrid system [13] and affinity purification
3
followed by mass spectrometry [14]. The former detects direct binary interactions, and
the latter detects protein complex membership. Recently, a variety of protein arrays have
been designed to detect specific types of interactions [15], such as, protein-DNA,
protein-drug, receptor-ligand, enzyme-substrate.
Gene perturbation and genetic interaction network
Probably all hereditary diseases in humans are genetically complex, resulting from the
combination of mutations in multiple different genes [16]. Consequently, both
experimental and computational techniques have been developed or under development
in model organisms to capture genetic interactions, particularly, synthetic lethal or sick
interactions, in which the combination of mutations in two genes causes cell death or
reduced fitness, respectively [17-21]. In contrast to other types of interactions which
generally correlate with each other, genetic interactions have shown to be mostly
orthogonal to physical interactions. For example, genetic interactions have been shown to
act mostly between pathways instead of within pathways [22].
4
Figure 1.1: Different types of molecules and interactions involved in the information
flow from DNA to cellular physiology. This graph is from [1].
1.1.2. Modular biology
One of the important discoveries that come out of the above molecular network analysis
is modularity in biological system. In general, modularity refers to a group of physically
or functionally linked molecules (nodes) that work together to achieve a (relatively)
distinct function. In a network representation, a module (or cluster) appears as a highly
interconnected group of nodes [2]. As different networks could represent different types
of molecules and their interactions, the modules derived from these networks hence could
5
have distinct meanings. For example, while densely connected sub-networks identified in
protein-protein interaction networks may represent protein complexes in biological
system, those identified in co-expression networks are likely to represent gene targets of
the same TF(s). Multiple computational methods have been introduced to identify
modules in various networks [23-27]. Moreover, different functional genomics data can
be integrated with the topology in module identification [28].
Analysis of modularity in biological systems has revealed two important features shared
by most biological modules: the overlapping community structure and hierarchical
organization. We discuss the two features in more details as follows.
Overlapping community structure
Most real networks are characterized by overlapping communities. This can be illustrated
by the numerous communities that each of us belongs to, including those related to our
scientific activities or personal life (school, hobby, family) and so on, as illustrated in
Figure 1.2 [29]. In biological systems, an example is that a large fraction of proteins
belong to several protein complexes simultaneously. Han et al. [17] found that proteins
with many physical interactions fall into two distinct classes: party hubs and date hubs.
Party hubs tend to be located within densely connected modules, while date hubs more
often act as bridges between different modules. Thus, the existence of date hubs in the
biological system also partially explains the observed overlapping community structure.
6
Figure 1.2: Illustration of concept of overlapping communities. This figure is from [30].
Hierarchical organization of modularity
The idea of hierarchical organization of modularity arises from the dilemma about the
scale-free nature versus the modularity of the molecular network [29]. On the one hand,
molecular networks possess high clustering coefficients [31], a property that is suggestive
of a modular organization. It implies that the molecular network potentially comprises
several
densely interconnected functional modules of varying sizes that
are connected by
few inter-module links. Under such modular organization, most nodes would have
approximately the same number of links. On the other hand, several recent publications
indicate that molecular networks in diverse eukaryotic species all have the scale-free
nature [32, 33], which seems to contrast with the modularity organization. The
coexistence of clustering and hubs indicates that topological modules are not independent,
7
but combined to form a hierarchical network. Figure 1.3 shows an example of such a
hierarchical network. This network is simultaneously scale-free and has a high clustering.
Figure 1.3: An example of hierarchical organization of modularity. The network is made
of many small, highly integrated 4-node modules that are assembled into larger 16-node
modules, each of which combines in a hierarchical fashion into even larger 64-node
modules. This figure is from [29].
1.2. Indentifying phenotype-associated sub-networks and
modules
1.2.1. High throughput screenings
Genome-wide loss-of-function screening
How to interpret the nature of biological processes, which when perturbed cause certain
phenotypes, such as human disease, is a major challenge. The completion of sequencing
8
of many model organisms has made “reverse genetic approaches” [34] efficient and
comprehensive ways to identify causal genes for a given phenotype under investigation.
“Reverse genetics” is an approach to discover the function of a gene that proceeds in the
opposite direction of so called “forward genetic screens” of classical genetics. Simply put,
while forward genetics seeks to find the genetic basis of a phenotype or trait, reverse
genetics seeks to find the possible phenotypes that may derive from a specific genetic
sequence obtained by DNA sequencing. For instance, genome-wide single gene knockout
strains are now available for S. cerevisiae, and multiple screenings have been carried out
to investigate phenotypes of these knockout strains under varied conditions [35, 36].
More recently, diverse high throughput RNAi knockdown experiments have been
performed, or are under development, for higher organisms including C. elegans [37], D.
melanogaster [38] and mammals [39, 40]. RNAi is an endogenous cellular process by
which messenger RNAs are targeted for degradation by double-stranded (ds) RNA of
identical sequence, leading to gene silencing. Initially used to knock down the function of
individual genes of interest, the technology was harnessed in several organisms on a
global scale with the production of RNAi libraries to silence most of the genes in their
genomes, allowing genome-wide loss-of-function screening (Figure 1.4) [41].
9
Figure 1.4: Approaches of genome-wide RNAi screenings in different organisms. This
figure is from [41]
Large scale screening for genome alterations
Besides the above externally introduced perturbations, naturally existing genome
alterations, such as DNA sequence changes, copy number aberrations, chromosomal
rearrangements and modification in DNA methylation, can now be surveyed at a
genome-wide scale owing to the continuing improvement of genomic technologies. It has
been long hoped that the advent of genome-wide association (GWA) studies would
10
provide genetic basis of many of these common causes of human morbidity and mortality.
The availability of dense genotyping chips which contain sets of hundreds of thousands of
single nucleotide polymorphisms (SNPs) that provide good coverage of much of the
human genome, has for the first time made GWA studies for thousands of cases and
controls technically and financially feasible. For example, the Wellcome Trust Case
Control Consortium (WTCCC) has recently completed GWA studies of 2,000 cases and
3,000 shared controls for 7 complex human diseases of major public health importance
[42].
Human cancer is thought to be caused by the accumulation of mutations in oncogenes
and
tumor suppressor genes. With the determination
of the human genome sequence and
improvements in sequencing
and bioinformatic technologies, systematic analyses of
genetic
alterations in human cancers have become possible [43]. More comprehensively,
the Cancer Genome Atlas aims to catalog and discover major cancer-causing genome
alterations in large cohorts of human tumors through integrated multi-dimensional
analyses [44].
1.2.2. Problems in high throughput screenings
Experimental Noise
Similar to many other high-throughput technologies, experimental noise remains to be a
big concern for most of the above mentioned high-throughput screenings. For example,
RNAi screening results are often noisy and with potentially high rates of false positives
11
and false negatives. On the one hand, genes may not always be effectively knocked down
and will consequently be missed by the screening. On the other hand, owing to the
tolerance for mismatches and gaps in base-pairing with targets, small interfering RNA
(siRNA) could possibly target up to hundreds of sequences [45, 46], which are often
termed as off-target effects (OTEs). Furthermore, designing a high-throughput RNAi
screen involves many levels of decision-making, such as the type and concentration of
RNAi reagents, the readout options, and the methodologies and criteria used for hit
selections, each of which could affect the quality of the final results [47].
Biological relevance
Besides the above technical noise issue, the biological relevance of the screening outputs
to the phenotype of interests usually requires further nontrivial assessments. For example,
due to linkage disequilibrium, a phenotypic variation could be linked to a genome region
that contains up to hundreds to genes. Thus, it is unclear which gene or genes are the
causal factors of the observed phenotypic changes. Another example is how to tell a
detected somatic mutation in a cancer genome is a driver mutation or a passenger
mutation. Driver mutations are
causally involved in the neoplastic process and are
positively
selected for during tumorigenesis. Passenger mutations provide
no positive or
negative selective advantage to the tumor but
are retained by chance during repeated
rounds of cell division
and clonal expansion [43].
12
Underlying biological mechanism
Compared to the direct genotype-phenotype correlation observed in the screening
experiments, what is less obvious is how genetic perturbation leads to the change of
phenotypes in the complex of biological systems. To be more specific, it is known that
proteins perform functions in the cellular system by interacting with other proteins, and
that the same protein could perform different functions by interacting with different
subsets of proteins or associating with different modules under different conditions. Thus,
it is not clear directly from the screening output that which interactions, pathways or
modules, the genetic perturbations have lead to the observed phenotypic changes.
1.2.3. Network or module-based integrative analysis of
screening results
As diverse genomic datasets accumulate, integrating screening results with other genomic
information, especially in the form of networks or modules, becomes one feasible
approach to solve the above mentioned problems. Indeed, more and more evidence have
indicated the modular nature of complex phenotypes, such as human diseases. Complex
traits are often associated with a group of functionally or physically linked molecules
instead of a single molecule. For example, Figure 1.5 shows that genes and proteins
harboring variation causing the same disease phenotype tend to form directly (physically)
connected clusters [48].
13
Motivated by these discoveries, we have developed two integrative approaches to identify
modules and sub-networks associated with a phenotype of interest, respectively. The
general purpose of our studies is to address some of the above mentioned problems in
high throughput screenings. In the next section, we’ll briefly introduce the two methods
and results of their applications. Each of the two studies will then be fully described in
later chapters.
Figure 1.5: Genes and proteins harboring variation causing the same disease phenotype
tend to form directly (physically) connected clusters. Physical-interaction gene clusters
associated with 38 disease phenotypes are shown separately for each disease (left) and
within the whole molecular interaction network. Note that several genes within the
network are known to affect multiple phenotypes (network nodes with multicolor stripes).
This figure is from [48].
14
1.3. Summary of our work
In our first study, we represented the biological system as a set of predefined functional
modules. We developed a Bayesian Network strategy which takes gene-phenotype
associations detected in experiments as inputs and which outputs a subset of modules that
are causally implicated in the phenotype under investigation. Our method is particularly
designed to accommodate the aforementioned two features of functional modules in
biological system, i.e., overlapping structure and hierarchical organization. In the second
study, we represented the biological system as a network of functional or physically
associated proteins. We developed a network-based scoring system to rank hits or nonhits
detected in RNAi screens. Our method is aimed to address the noise issue in RNAi
screening results, and to help reveal the underlying biological mechanisms.
1.3.1. Prioritizing functional modules mediating genetic
perturbations and their phenotypic effects
How variation in DNA leads to variation in phenotypes is a question open to
interpretation. One possibility is that genetic variations lead to perturbation of certain
functional modules, such as protein complexes and biological processes, which in turn,
causes the observed phenotypes. To test the plausibility of this hypothesis, we aim to
prioritize causal functional modules mediating genetic perturbations and their phenotypic
effects. To accomplish this, we will contrast the results of two statistical gene testing
methods, one involving a traditional local strategy, specifically, the hypergeometric
15
enrichment test, and one involving our novel global strategy, termed Bayesian network.
To accommodate the overlapping structure and hierarchical organization of functional
modules in biological systems, we have developed a global strategy to isolate the
functional modules that are most likely those mediating genetic perturbations and their
phenotypic effects. We will call these causal functional modules. These causal modules
are selected from a large number of other overlapping modules whose relationship with
the phenotypes very likely results from the sharing of gene members with causal modules.
To test the effectiveness of the global method, we take gene lethality in S. cerevisiae and
human cancer as two examples. In doing so, we provide multiple sources of evidence
indicating that the causal modules derived by the global strategy method are more
accurate than those identified by the local strategy. Namely, the hypergeometric
enrichment test, which ignores the interrelationships among modules. In the course of
our analyses of the modules related to lethality in yeast, we discovered that, across
different species, lethality is more conserved at the module level than at the gene level.
Also, our analysis of the modules related to human cancer resulted in the identification of
several potentially “new” cancer-related biological processes, specifically, cytoskeleton
anchoring and lipid transport.
We present a global strategy for module-based decoding of phenotypic variation caused
by genetic perturbation. Both the computational strategy and the findings that result from
this effort might have broad applications in the study of human diseases.
16
1.3.2. Prioritizing reliable hits from RNAi screens
The recently developed RNA interference (RNAi) technology has created an
unprecedented opportunity which allows the function of individual genes in whole
organisms or cell lines to be interrogated at genome-wide scale. However, multiple issues,
such as off-target effects or low efficacies in knocking down certain genes, have
produced RNAi screening results that are often noisy and that potentially yield both high
rates of false positives and false negatives. Therefore, integrating RNAi screening results
with other information, such as protein-protein interaction (PPI), may help to address
these issues.
By analyzing 24 genome-wide RNAi screens interrogating various biological processes
in Drosophila, we found that RNAi positive hits were significantly more connected to
each other when analyzed within a protein-protein interaction network, as opposed to
random cases, for nearly all screens. Based on this finding, we developed a
network-based approach to identify false positives (FPs) and false negatives (FNs) in
these screening results. This approach relied on a scoring function, which we termed
NePhe, to integrate information obtained from both PPI network and RNAi screening
results. Using a novel rank-based test, we compared the performance of different NePhe
scoring functions and found that diffusion kernel-based methods generally outperformed
others, such as direct neighbor-based methods. Using two genome-wide RNAi screens as
examples, we validated our approach extensively from multiple aspects. We prioritized
hits in the original screens that were more likely to be reproduced by the validation
17
screen and recovered potential FNs whose involvements in the biological process were
suggested by previous knowledge and mutant phenotypes. Finally, we demonstrated that
the NePhe scoring system helped to biologically interpret RNAi results at the module
level.
By comprehensively analyzing multiple genome-wide RNAi screens, we conclude that
network information can be effectively integrated with RNAi results to produce
suggestive FPs and FNs, and to bring biological insight to the screening results.
18
Chapter 2 Prioritizing functional modules mediating genetic
perturbations and their phenotypic effects
2.1 Introduction
How to interpret the nature of biological processes, which, when perturbed, cause certain
phenotypes, such as human disease, is a major challenge. The completion of sequencing
of many model organisms has made “reverse genetic approaches” [34] efficient and
comprehensive ways to identify causal genes for a given phenotype under investigation.
For instance, genome-wide knockout strains are now available for S. cerevisiae [35, 36],
and diverse high throughput RNAi knockdown experiments have been performed, or are
under development, for higher organisms including C. elegans [37], D. melanogaster [38]
and mammals [39, 40] .
Compared to the direct genotype-phenotype correlation observed in the above
experiments, what is less obvious is how genetic perturbation leads to the change of
phenotypes in the complex of biological systems. That is, we might perceive the cell or
organism as a dynamic system composed of interacting functional modules which are
defined as discrete entities whose functions are separable from those of other modules
[49]. For example, protein complexes and pathways are two types of functional modules.
Using this concept as a basis for hypothesis, it is tempting to conclude that it is the
perturbation of individual genes, which leads to the perturbation of certain functional
modules and that this, in turn, causes the observed phenotype. Previous studies have
19
reported this type of module-based interpretation of phenotypic effects [50-52]. For
example, Hart and his colleagues showed the distribution of gene essentiality among
protein complexes in S. cerevisiae and suggested that essentiality is the product of protein
complexes rather than individual genes [53]. Other studies have made use of the modular
nature of phenotypes to predict unknown causal genes [54]. In a recent study, Lage and
his colleagues mapped diverse human diseases to their corresponding protein complexes
and used such mapping to prioritize unknown disease genes within linkage intervals of
association studies [55].
Despite these successful studies, the task of computationally inferring the functional
modules which mediate genetic perturbations and their phenotypic effects might not be as
easy as it appears. On the one hand, different modules could share common components.
On the other hand, modules are believed to be hierarchically organized in biological
systems [2] such that smaller modules combine to form larger modules, as shown in Gene
Ontology (GO) annotations [56]. All these overlapping structures among modules make it
difficult to accurately identify causal modules, the term we will use in this paper to
indicate functional modules which mediate genetic perturbations and their phenotypic
effects. To be more specific, since the protein products of a single gene could be
associated with multiple modules, the phenotypic effects observed by perturbation of that
gene could be attributed to the perturbation of any one of these modules, or their subsets.
In other words, some modules, which are otherwise independent of a phenotype, but
share members with actual causal modules of the phenotype, could be mistakenly
prioritized as causal modules when traditional strategies, such as the hypergeometric (HG)
20
enrichment test, are applied. This results from the fact that HG associates a module to
the phenotype based merely on the phenotypic effects of its own components. In this
paper, we refer to methods with the above characteristics as local strategies. We are
therefore motivated to develop a global strategy, specifically, a Bayesian network (BN)
model [57], to distinguish modules that are most likely to be actual causal modules from
the other overlapping modules that are likely to be independent of the phenotype. We
refer to this strategy as global since, in contrast to local strategy, it associates a module
with a given phenotype based not only on its own components, but also on its
overlapping structure with other modules. We applied the BN model to prioritize casual
modules for two phenotypes: lethality in S. cerevisiae and human cancer. In both cases, as
summarized below, we provide evidence indicating that the causal modules prioritized by
the BN model are more accurate than those prioritized by such local strategies as the HG
enrichment test. With lethality and human cancers as two illustrating examples, we aim
to provide a general framework for module-based decoding of phenotypic variation
caused by genetic perturbation, which could be applied to the understanding of diverse
phenotypes in various organisms.
In the first case, we take gene lethality data observed from a genome-wide gene deletion
study in S. cerevisiae [35]. Using the BN model, we then prioritize causal modules for
which perturbation is the underlying cause of the inviable phenotype observed. For
simplicity, we termed them as lethal modules, i.e., lethal protein complexes or lethal
biological processes. First, analysis of lethality of ortholog genes indicates that the BN
model is superior to the HG enrichment test in distinguishing lethal protein complexes
21
from nonlethal protein complexes. Moreover, in the course of the above analysis, we
found that lethality is more conserved at the module level than at the gene level. Second,
the module lethality inferred from the BN model is superior to the results obtained by the
local strategy in predicting unknown lethal genes as evaluated through cross validation.
In the second case, we applied our strategy to the study of human cancer. Human cancer
is believed to be caused by the accumulation of mutations in cancer genes, e.g.,
oncogenes and tumor suppressor genes. It has been suggested that a limited number of
biological pathways might include most cancer genes [58]. Based on cancer genes
documented in “cancer-gene census” [59], we prioritized GO biological processes
causally implicated in cancers (CAN-processes). First, as indicated by their positions in
the GO hierarchical structure and the conditional HG enrichment test, those GO BP nodes
prioritized by the BN model are more likely to represent actual CAN-processes than
those obtained by the HG enrichment test. Second, the results obtained from
implementing the BN model are more consistent with previous knowledge of
cancer-related processes than results obtained through the HG enrichment test. Third,
similar to the case of lethality, the CAN-processes inferred from the BN model is superior
to the results obtained by the local strategy in predicting unknown cancer genes as
evaluated by cross validation. Forth, by comparing the CAN-processes prioritized in
“cancer-gene census” to a recent set of cancer genes identified through systematic
sequencing [43], we show that the results of our BN model, in contrast to the
conditional HG enrichment test, are more consistent, even when different datasets of
cancer genes are used. We also discuss the reasons that plausibly underlie the discrepancy
22
of the results from the two datasets and identify and describe several potentially “new”
CAN-processes identified in the recent set of cancer genes, specifically, cytoskeleton
anchoring and lipid transport.
2.2 Results and Discussion
2.2.1 Prioritizing lethal modules in S. cerevisiae
We prioritized lethal modules from the gene lethality data in S. cerevisiae obtained from a
genome-wide gene deletion study[35] (see Materials and Methods). We provided
evidence from two aspects indicating that the lethal modules prioritized by the BN model
are more accurate than those prioritized by either the HG enrichment test or the local
Bayesian model.
2.2.1.1 Superiority of the BN model indicated by analysis of lethality
of ortholog genes
Compared with the HG enrichment test, our analysis of lethality of ortholog genes in the
context of protein complexes indicates that the BN model is superior in distinguishing
lethal from nonlethal protein complexes. It is difficult to directly measure the accuracy of
the prioritized lethal protein complexes without a direct benchmark for lethal and
nonlethal protein complexes. However, we expect that genes involved in lethal protein
complexes will show some characteristics that distinguish them from genes that do not
possess such characteristics. These characteristics could therefore serve as indicators of
23
lethality of protein complexes and, hence, could be used to measure the quality of the
prioritized data. Here, we consider one such potential characteristic, as described below.
We can categorize nonlethal genes into two classes according to the lethality of protein
complexes in which they participate. For simplicity, we refer to NonLethal Genes whose
protein products have been involved in certain Lethal protein Complexes as NLGLC, and
we refer to NonLethal Genes whose protein products have Not been involved in any
Lethal protein Complexes as NLGNLC. A key computational measurement we use is
termed “ortholog lethal ratio,” which refers to the proportion of genes in species A,
specifically S. cerevisiae, whose ortholog genes in species B, specifically C. elegans, are
lethal. Thus, we hypothesize that NLGLC has a higher “ortholog lethal ratio” than
NLGNLC. An intuitive argument supporting this theory is that, in order for those
NLGNLC in S. cerevisiae to evolve into lethal genes in C. elegans, they must undergo
some extra evolutionary events which associate their protein products with certain lethal
modules, which would be a prerequisite for genes showing inviable phenotype when
perturbed under a module-based explanation of lethality. On the other hand, since
NLGLC by definition already meet this requirement, and assuming module lethality and
composition are relatively conserved across species, it might be easier for them to evolve
into lethal genes in C. elegans, for instance, by losing their functional backup within
lethal modules. Here, we only focus on nonlethal genes, either NLGLC or NLGNLC, but
not lethal genes because, according to the module-based explanation of gene lethality, all
lethal genes must have been involved in certain lethal modules, and there is no such
classification in the case of non-lethal genes. Nevertheless, in the following analysis, we
24
also categorized lethal genes into two classes in a manner similar to nonlethal genes,
namely, Lethal Genes whose protein products have been involved in certain Lethal
protein Complexes (referred to as LGLC for simplicity) and Lethal Genes whose protein
products have Not been involved in any Lethal protein Complexes (referred to as
LGNLC for simplicity). It should be noted that such classification is simply for the
purpose of elucidation. Not all lethal modules are included in our dataset. Thus, the
existence of LGNLC that have not been associated with any lethal modules in our dataset
largely results from data incompleteness.
Since we are able to distinguish lethal from non-lethal protein complexes based on the
“ortholog lethal ratio” of their associated nonlethal genes, we could expect that a list of
protein complexes with a higher enrichment of lethal protein complexes will show a
higher “ortholog lethal ratio” of nonlethal genes than otherwise. We therefore carried out
the following analysis to compare the capacity of HG enrichment test with the BN model
in distinguishing lethal from nonlethal protein complexes. To determine the lethality of
protein complexes, we first employed the HG enrichment test to evaluate the enrichment
of lethal genes in 390 curated protein complexes in S. cerevisiae. More specifically, we
assume each protein complex as a random sample from a set of 5,916 genes, 1,105 of
which are lethal. We called a complex with Nc genes and Lc lethal genes, lethal if the
probability of having at least Lc lethal genes out of Nc genes is less than 0.05 based on
the hypergeometric distribution. We obtained a total of 149 lethal protein complexes in
this way. We then classified genes into four groups according to their gene lethality and
the lethality of protein complexes in which they participate: LGLC, LGNLC, NLGLC
25
and NLGNLC. To estimate the “ortholog lethal ratio” for each group of genes, we
calculated the proportion of genes whose orthologs in C. elegans are lethal among all the
genes whose orthologs in C. elegans exist with known lethality (see Methods for details
of gene lethality data in C. elegans). As shown in Figure 2.1(a), there appears to be no
significant difference between NLGLC and NLGNLC derived in this way (lower left and
right cells, respectively), as indicated by the “ortholog lethal ratio” (p-value of chi-square
test between the two groups >0.1). However, as discussed in the introduction part, the
above HG method might over-estimates the number of lethal complexes by including
“overlapping protein complexes” whose enrichment of lethal genes would most likely
result from the sharing of gene members with actual lethal protein complexes. Thus, we
then used the BN model to filter out those “overlapping protein complexes”. Out of the
above 149 protein complexes with the HG p-value <0.05, we filtered out 55 protein
complexes whose probability of being lethal, as derived from the BN model, was <0.7
and treated them as nonlethal protein complexes. In this case, the “ortholog lethal ratio”
is significantly higher for NLGLC than for NLGNLC after filtering out the “overlapping
protein complexes” (lower left and right cells, respectively, of Figure 2.1(b); p-value of
chi-square test between the two groups <0.05). It has to be mentioned that those protein
complexes that are not significantly enriched with lethal genes (p-value of HG
enrichment test >0.05) are not considered as candidate lethal protein complexes in the BN
model to speed up the algorithm, since those HG insignificant complexes are of less
practical use and could add a substantial amount of computational burden to the BN
model, particularly when GO biological processes are considered in later analysis. Other
26
preprocessing strategies to speed up the algorithm might work as well, for instance,
removing protein complexes with the number of lethal genes less than a threshold.
Based on the results of the above analysis, we conclude that the BN model is superior to
the HG enrichment test in distinguishing lethal protein complexes from nonlethal protein
complexes as evidenced by the following four findings. First, as indicated by the
“ortholog lethal ratio,” those “overlapping protein complexes” filtered out by the BN
model are very likely to be nonlethal protein complexes. To be more specific, the
“ortholog lethal ratio” for nonlethal genes only involved in the “overlapping protein
complexes” was not found to be significantly different (20%) from that of NLGNLC
before filtering (39.4%, lower right cell in Figure 2.1(a)). However, it was found to be
significantly lower than that of NLGLC after filtering (63.6%, lower left cell in Figure
2.1(b); p-value of chi-square test <0.05). In other words, by successfully filtering out
these “overlapping protein complexes,” the resulting list of lethal protein complexes
becomes more enriched when quantified by the “ortholog lethal ratio.” Second, in the
absence of the BN model, it is unlikely that those “overlapping protein complexes” could
have been effectively filtered out by the HG enrichment test, even by setting a more
stringent p-value cutoff, since, based on the Wilcoxon rank-sum test, there is no
significant difference between the HG p-value of those “overlapping protein complexes”
filtered out by the BN model and the HG p-value of the remaining lethal protein
complexes. Third, the coverage of lethal genes by lethal protein complexes remains
similar, both before and after filtering out “overlapping protein complexes.” Because the
“overlapping protein complexes” filtered out by the BN model are those sharing lethal
27
gene members with the remaining lethal protein complexes, it can be seen from the data
in Figure 2.1 that the number of distinct lethal genes covered by lethal protein complexes
after filtering (140+92, the upper left cell in Figure 2.1(b)) is only marginally smaller
than before filtering (142+96, the upper left cell in Figure 2.1(a)). If, however, a more
stringent cutoff p-value is set for the HG enrichment test, the coverage of lethal genes by
lethal protein complexes will be dramatically decreased (data not shown). Fourth, even
when the coverage of lethal genes is not considered, the BN model still performs better
than the HG enrichment test in distinguishing lethal protein complexes from nonlethal
protein complexes as measured by the “ortholog lethal ratio” of nonlethal genes. Figure
2.2 shows the “ortholog lethal ratio” for NLGLC and NLGNLC (lower left and right cells,
respectively, in Figure 2.1(a) or (b)) when different thresholds for either the p-value of
the HG enrichment test or the probability of being lethal protein complexes derived from
BN model are used. Compared to the HG enrichment test, it can clearly be seen that the
“ortholog lethal ratio” shows more striking difference between NLGLC and NLGNLC
when the BN model is used.
The analysis of the lethality of ortholog genes in the context of protein complexes also
reveals that lethality is more conserved at the module level than at the gene level. In other
words, compared with the lethality of a gene itself, the lethality of the protein complexes
in which that gene participates appears to be a more relevant predictor for the lethality of
its orthologs in other organisms. It can be seen that both LGLC and NLGLC show a
similar “ortholog lethal ratio” (the upper and lower left cells in Figure 2.1(b)), which is
significantly higher than that of NLGNLC (the lower right cell in Figure 2.1(b)). It should
28
be noted that similar pattern could be observed when the “ortholog lethal ratio” is
calculated based on essential genes in D. melanogaster instead of C. elegans (Figure A1
in Appendix A). This indicates that our observations here are not restricted to one dataset
or one species. Since the genome-wide whole organism screening is not available for D.
melanogaster, the gene lethality in D. melanogaster is defined based on cell-based RNAi
screening[60]. The ortholog lethal ratio might be underestimated in this way because
genes that are lethal to the whole organism might not display any phenotype when tested
in certain types of cells. It may be recalled from our discussion above that LGNLC (upper
right cell in Figure 2.1(b)) may theoretically belong to some other lethal modules, thus
showing a high “ortholog lethal ratio” comparable to LGLC.
Our finding that lethality is more conserved at the module level than at the gene level has
several important implications. First, it could serve as a piece of evolutionary evidence
supporting the modular nature of lethality. Second, to supplement traditional
gene-based mapping, it suggests that a module-based mapping strategy might be
employed in transferring phenotypic knowledge across species where it is the phenotypic
effects of the associated modules, rather than the phenotypic effects of individual genes,
that are believed to be conserved across species. For example, we want to predict
ortholog lethality in C.elegans from lethality data in Yeast. According to the traditional
sequence-similarity mapping, the orthologs of LGLC and NLGLC are predicted as lethal
and nonlethal, respectively. However, according to our analysis (Figure 2.1(b)), NLGLC
show similar “ortholog lethal ratio” to that of LGLC. Thus, it might be useful to predict
the orthologs of NLGLC as lethal instead of nonlethal. By doing so, more lethal genes
29
can be predicted, but the accuracy (defined as the fraction of true lethal genes among all
the predicted lethal genes) remains similar, which is around 60% in the case of C.elegans.
Analysis of the proportion of lethal genes in each of the 94 curated lethal protein
complexes identified by the BN model reveals a high modularity of lethality. As shown in
Table A1 in Appendix A, about 63.8% (60 out of 94) of them have all their members
being lethal. All except for one of them have more than half of their members being lethal.
In addition, the proportion of lethal genes in a lethal complex appears to differ based on
their functions. For example, as listed in Table A1, lethal complexes related to chromatin
remodeling, such as, RSC complex and INO80 complex, or protein transport and
translocation, such as, mitochondrial outer membrane translocase complex, Nuclear pore
complex, and ER protein-translocation complex, have a relatively low proportion of
lethal genes. The relative low proportion of lethal genes indicates functional redundancy
within those complexes. For example, the nuclear pore complex (NPC) has the principal
function of regulating the high throughput of nucleocytoplasmic transport in a highly
selective manner[61]. The fact that over half the total mass of FG domains could be
deleted without loss of viability or the NPC's normal permeability barrier suggests the
existence of multiple translocation pathways and partial redundancy among them[62].
30
Figure 2.1: Genes in S. cerevisiae are classified into four groups according to their
lethality and the lethality of protein complexes to which they belong. Within each group,
the pie chart represents the distribution of genes with respect to the lethality of their
orthologs in C. elegans. In (a), the lethal protein complexes were identified using HG
enrichment test (p-value <0.05), and in (b), “overlapping protein complexes” (the
probability of being lethal inferred by the BN model <0.7) were filtered out from those
identified in (a).
31
Figure 2.2: The “ortholog lethal ratio” for “NLGLC” and “NLGNLC” when different
cutoffs are used to identify lethal protein complexes: when more stringent cutoff of
p-value (<0.05) of the HG enrichment test is used to identify lethal protein complexes
(blue), or different cutoff of the probability of being lethal inferred by the BN model is
used to filter out “overlapping protein complexes.”
2.2.1.2 Superiority of the BN model revealed by cross-validation
The module lethality inferred from the BN model is superior to the results obtained by
the local strategy in predicting unknown lethal genes as evaluated by cross validation. As
mentioned before, one of the applications of identifying causal modules is the prediction
of unknown causal genes. However, for gene lethality in S. cerevisiae, this is not
necessary since the lethality of almost all the genes is known. Nonetheless, S. cerevisiae
does provide a good system of evaluating prediction accuracy of gene lethality through
32
cross validation. More to the point of our study, if, by such evaluation, we assume that
more accurate prediction of gene lethality is a consequence of more accurate inference of
module lethality, then prediction accuracy of the former could reflect prediction accuracy
of the latter.
To evaluate prediction accuracy of gene lethality through cross-validation, we randomly
chose part of the gene lethality data (training data) as a known to estimate module
lethality. The estimation results were then used to infer the probability of being lethal for
the remaining genes (testing data) (see Materials and Methods for details). In the step
where the lethality of each candidate module is inferred, we employed the BN model as
Figure 2.3: The ROC curve, AUC and pAUC.2 of 100-fold cross-validation in predicting
lethality of genes in S. cerevisiae using curated protein complexes (a), curated and HTP
protein complexes (b) and GO biological process (c). BN represents the Bayesian
network model, and LM represents the local Bayesian model.
33
Figure 2.3: Continued.
Figure 2.3: Continued.
34
our global strategy and the local Bayesian model (LM) as our local strategy with the
purpose of comparing how the results of these two methods could affect prediction
accuracy of gene lethality. The LM model differs from the BN model only in that only
the sub network for a candidate module is considered as if none of its components
participates in other modules (see Materials and Methods for details). In this sense, the
probability of being lethal for each protein complex inferred by LM is similar to the
p-value of the HG enrichment test in prioritizing lethal protein complexes. In this case,
we chose to compare the BN model with the LM model instead of the HG enrichment test.
Compared with the p-value derived from the HG enrichment test, the output of the LM
model is more like the BN model and, therefore, easier to infer gene lethality. We used
receiver operating curve (ROC) curve[63] and the area under ROC curve (AUC) of
100-fold cross validation as measurements of the prediction accuracy of unknown lethal
genes. We calculated both standard AUC and partial AUC[64] at a false positive rate of
0.2 (denoted as pAUC.2). Because the BN model is primarily designed to remove
potential false positives that are overestimated by the HG/LM method, we are
predominantly concerned with the prediction accuracy of our models at low false positive
rates[65], which are preferred in practice. The results are shown in Figure 2.3. When the
candidate modules consist of only curated protein complexes, the pAUC.2 of our BN
model increases by 8.5% compared with that of the LM (Figure 2.3(a)). The relatively
smaller improvement in this case might be a result of the fact that the AUC is already
very high with curated protein complex data. As a matter of fact, when the HTP protein
complex data are added to the candidate modules, the pAUC.2 increases by 17.9%, which
35
is more visible (Figure 2.3(b)). The pAUC.2 increases by 46.9% when GO BPs are
considered as candidate modules (Figure 2.3(c)). Since the BN model is designed to
accommodate the overlapping structures among functional modules, such a striking
improvement is consistent with the more complicated overlapping structures among GO
BPs. Our simulation results (Section 2.2.3) also show that the amount of improvement of
the BN model over the HG method in identifying causal modules increases as the degree
of overlap among modules increases (Figure 2.7). Since both methods perform similarly
at high false positive rates, the average improvement over the whole range of false
positive rates is relatively small. The standard AUC increases by 1%, 2.4% and 7.6%,
respectively. Therefore, our results show that the module lethality inferred by the BN
model is superior to the results obtained by the LM model in predicting unknown lethal
genes. Overall, therefore, to the extent that prediction accuracy of gene lethality reflects
prediction accuracy of module lethality, our results also indicate that the lethal modules
identified by the BN model are more accurate than those identified by the local strategy.
2.2.2 Prioritizing GO biological processes causally implicated
in human cancer
In order to show how the Bayesian network model could be applied to more complicated
phenotypes, such as human diseases, we prioritized GO biological processes (BPs) that
are causally implicated in human cancers (CAN-processes) based on cancer genes
documented in “cancer-gene census,” a curated cancer gene database assembled from
previous studies [59]. Compared with protein complexes, biological processes (BPs) are
36
more conceptually defined modules whose interrelationships appear to be more
complicated. For example, the GO BPs [56] are organized into a directed acyclic
structure, where children nodes representing BPs with more specific definition are
pointed into parent nodes representing BPs with broader definition. Such a hierarchical
organization makes it possible to investigate the biological system with varied specificity,
but also brings in some difficulties. For example, if one GO BP node is enriched for lethal
genes based on the HG enrichment test, it is very likely that many of its offspring nodes
and ancestor nodes are also enriched, as well as some nodes that share members with it.
However, since our BN model is a global strategy sensitive to the interrelationship among
modules, it might be more useful than the HG enrichment test (local strategy) in
distinguishing GO BP nodes which are most likely to represent actual CAN-processes
from those whose enrichment of cancer genes is more peripheral, either from sharing
members with them or being their ancestor or offspring nodes. For simplicity, we refer to
the latter as “overlapping GO BP nodes.” Using measurement parameters similar to those
of our gene lethality model, only GO BP nodes with a HG enrichment test P-value<0.05
are treated as candidate modules in the Bayesian network model, and the same empirical
cutoff was used to filter out “overlapping GO BP nodes.” Table 2.1 lists the resulting GO
BP nodes and the same number of GO BP nodes prioritized by the HG enrichment test.
Our results show that the GO BP nodes identified by the Bayesian network model are
likely to be better representatives of CAN-processes than those identified by the HG
enrichment test in three different respects.
37
2.2.2.1 Superiority of the BN model indicated in the GO hierarchical
structure
As indicated by their positions in the GO hierarchical structure and the conditional HG
enrichment test, those GO BP nodes prioritized by the BN model are more likely to
represent actual CAN-processes than those obtained by the HG enrichment test. We
plotted the 27 BP nodes prioritized by the BN model (as listed in Table 2.1) together with
all their offspring and ancestor nodes in the directed acyclic structure (see Figure A2 in
Appendix A). It can be seen that most of the nodes in this sub-graph are significantly
enriched with cancer genes (The node size in Figure A2 corresponds to the minus log
p-value of the HG enrichment test). As noted above, if one GO node is enriched with
cancer genes, many of its ancestor and offspring nodes will also become enriched. The
results shown in Figure A2 are, therefore, consistent with this observation. It can also be
seen that most GO BP nodes prioritized by the HG enrichment test (23 out of 27 GO BP
nodes as listed in Table 2.1) are also within this sub-graph. However, while most of the
27 GO BP nodes prioritized by the BN model are close to the leaf nodes, those prioritized
by the HG enrichment test are close to the root.
Since most GO BP nodes prioritized by the HG enrichment test are close to the root node,
it is suspected that the enrichment of cancer genes for most of them might actually result
from being ancestor nodes of actual CAN-processes. As a matter of fact, the enrichment
of cancer genes for 63.0% of these nodes (17 out of 27) becomes insignificant (p-value of
the HG enrichment test>0.05) conditional on at least one of its child nodes [66]. In order
38
to calculate the p-value of the HG enrichment test of node A conditional on node B, we
removed genes included in node B from node A and calculated the p-value of the
enrichment of cancer genes for the remaining genes in node A. As a comparison, since
the 27 GO BP nodes prioritized by the BN model are close to the leaf nodes, their
enrichment of cancer genes is less likely to result from being ancestor nodes of actual
CAN-processes. As a matter of fact, out of 16 nodes that are not leaf nodes, only 12.5%
(2 out of 16) become insignificant conditional on at least one of their child nodes.
Moreover, for the 2 nodes that become insignificant conditional on their child nodes,
none of their child nodes is significantly enriched with cancer genes (p-value of the HG
enrichment test>0.05). In this sense, their child nodes are not better representatives of
actual CAN processes than the 2 nodes themselves.
On the other hand, although most GO BP nodes prioritized by the BN model are of
smaller size and close to the leaf nodes, their enrichment of cancer genes is less likely to
result from being the offspring nodes of actual CAN-processes. This means that only a
few of their ancestor nodes will remain significantly enriched conditional on the 27 GO
BP nodes prioritized by the BN model. In order to demonstrate this, for each parent node
of the 27 GO BP nodes prioritized by the BN model, we calculated the p-value of the HG
enrichment test conditional on the 27 nodes. Only 6.8% (3 out of 44) of their parent
nodes were conditionally significant (p-value<0.05). We then extended such conditional
HG enrichment test to all 649 GO BP nodes that are enriched with cancer genes (p-value
of the HG enrichment test<0.05). The distribution of the original p-values of the HG
enrichment test and the p-values of the HG enrichment test conditional on the 27 GO BP
39
nodes are shown in Figure 2.4. It can be seen that most GO BP nodes become
insignificant conditional on the 27 CAN-processes prioritized by the BN model
(p-value>0.05), only 13 with p-value < 0.001 and none with p-value < 1e-5. It can also be
seen in Figure 2.4 that the number of significantly enriched GO BP nodes conditional on
the 27 CAN-processes is significantly smaller than the number of significantly enriched
GO BP nodes conditional on the same number of randomly selected GO BP nodes with
similar size.
Figure 2.4: The distribution of p-values for the enrichment of cancer genes for GO BP
nodes, by the HG enrichment test, the HG enrichment test conditional on the 27
CAN-processes prioritized by the BN model, and the HG enrichment test conditional on
the same number of randomly sampled GO BP nodes with similar size.
40
GO CAN-processes prioritized by the BN model
Total #
Cancer #
GO CAN-processes prioritized by the HG
enrichment test
Total #
Cancer #
GO:0006366 transcription from RNA polymerase
II promoter
541 52 GO:0050794 regulation of cellular process 3958 205
GO:0045737 positive regulation of
cyclin-dependent protein kinase activity
3 3 GO:0050789 regulation of biological
process
4256 209
GO:0045786 negative regulation of progression
through cell cycle
203 41 GO:0065007 biological regulation 4648 217
GO:0007169 transmembrane receptor protein
tyrosine kinase signaling pathway
168 23 GO:0043283 biopolymer metabolic
process
5095 226
GO:0048268 clathrin cage assembly 4 2 GO:0000074 regulation of progression
through cell cycle
325 53
GO:0000718 nucleotide-excision repair, DNA
damage removal
21 7 GO:0051726 regulation of cell cycle 329 53
GO:0002903 negative regulation of B cell
apoptosis
2 2 GO:0019219 regulation of nucleobase,
nucleoside, nucleotide and nucleic acid
metabolic process
2501 145
GO:0015014 heparan sulfate proteoglycan
biosynthetic process, polysaccharide chain
biosynthetic process
3 2 GO:0031323 regulation of cellular
metabolic process
2703 151
GO:0010225 response to UV-C 2 2 GO:0006350 transcription 2540 145
GO:0006310 DNA recombination 92 13 GO:0019222 regulation of metabolic
process
2832 154
GO:0016571 histone methylation 6 2 GO:0006139 nucleobase, nucleoside,
nucleotide and nucleic acid metabolic
process
3771 181
GO:0060070 Wnt receptor signaling pathway
through beta-catenin
5 2 GO:0045449 regulation of transcription 2448 140
GO:0016573 histone acetylation 10 4 GO:0006351 transcription,
DNA-dependent
2360 136
GO:0045429 positive regulation of nitric oxide
biosynthetic process
5 3 GO:0006355 regulation of transcription,
DNA-dependent
2302 134
GO:0006298 mismatch repair 31 7 GO:0045786 negative regulation of
progression through cell cycle
203 41
GO:0009168 purine ribonucleoside
monophosphate biosynthetic process
15 2 GO:0032774 RNA biosynthetic process 2364 136
GO:0010332 response to gamma radiation 3 2 GO:0043170 macromolecule metabolic
process
6647 244
GO:0045661 regulation of myoblast differentiation 6 2 GO:0022402 cell cycle process 606 61
GO:0030101 natural killer cell activation 15 3 GO:0016070 RNA metabolic process 2896 143
GO:0046902 regulation of mitochondrial
membrane permeability
5 2 GO:0007049 cell cycle 761 67
GO:0051353 positive regulation of oxidoreductase
activity
5 2 GO:0048523 negative regulation of
cellular process
917 73
GO:0051898 negative regulation of protein kinase
B signaling cascade
2 2 GO:0048519 negative regulation of
biological process
958 73
GO:0000910 cytokinesis 28 4 GO:0044238 primary metabolic process 7595 254
GO:0000075 cell cycle checkpoint 58 14 GO:0048522 positive regulation of
cellular process
754 63
GO:0001952 regulation of cell-matrix adhesion 9 6 GO:0006366 transcription from RNA
polymerase II promoter
541 52
GO:0042593 glucose homeostasis 11 2 GO:0048518 positive regulation of
biological process
840 65
GO:0014065 phosphoinositide 3-kinase cascade 5 3 GO:0009719 response to endogenous
stimulus
400 44
Median # 6 3 Median # 2364 136
Table 2.1: The 27 GO CAN-processes prioritized by the Bayesian network model or the
HG enrichment test based on cancer genes from “cancer-gene census” database.
41
2.2.2.2 Consistency of the BN model results with the literature
The results obtained from implementing the BN model are more consistent with previous
knowledge of cancer- related processes than the results obtained through HG enrichment
test results. As shown in Table 2.1, a variety of well-known cancer-related processes have
been prioritized by the BN model. They include those directly related to cell cycle, e.g.,
positive regulation of cyclin-dependent protein kinase activity and cell cycle checkpoint,
and those canonical signaling pathways regulating cell birth and death [58], e.g.,
transmembrane receptor protein tyrosine kinase signaling pathway, Wnt receptor
signaling pathway through beta-catenin, phosphoinositide 3-kinase cascade and protein
kinase B signaling cascade. They also include biological processes responsible for the
maintenance of genome stability [67], e.g., nucleotide-excision repair, DNA damage
removal and mismatch repair, or epigenetic modification [68], e.g., histone methylation
and histone acetylation. The associations of some prioritized CAN-processes with cancers
might be less apparent, but the literature has indicated their involvement with more
well-known CAN-processes. For example, the role of clathrin cage assembly in cancer
generation might be related to its function in controlling EGF receptor signaling
through clathrin-mediated endocytosis [69]. Another example is regulation of
mitochondrial membrane permeability, whose role in apoptosis has been shown before
[70]. On the other hand, the CAN-processes prioritized by the HG model might be too
generally defined to be associated with cancers. As shown in Table 2.1, most of the
CAN-processes prioritized by the HG enrichment test are >2000 in size, which renders
them less informative.
42
Figure 2.5: Pathways of Global Genome Nuclear Excision Repair (GG-NER) and
Transcription Coupled Nuclear Excision Repair (TC-NER). Cancer genes involved in the
two sub-pathways as documented in “cancer gene census” are marked by the red star. In
GG-NER, damage, such as ultraviolet-induced cyclobutane pyrimidine dimers (CPD) or
6-4 photoproducts (6-4 PP), is recognized by proteins, including the XPE (DDB2) and
XPC gene products. In TC-NER, the lesion appears to block the progress of RNA
polymerase II in a process involving the CSA and CSB gene products. Following initial
damage recognition, the two sub-pathways converge. The XPB (ERCC3) and XPD
(ERCC2) helicases unwind the region surrounding the lesion, along with the XPA and
XPG (ERCC5) gene products and replication protein A (RPA). (The graph was obtained
from KEGG PATHWAY database [71], and only part of it is shown here.)
43
Previous knowledge also indicates that some of the “overlapping GO BPs” filtered out by
the BN model might be independent of cancer. Importantly, in the absence of such global
approach, these “overlapping GO BPs” are less distinguishable from actual
CAN-processes based on the HG enrichment test. One example is nuclear excision repair
(NER), which can be categorized into two classes: Global Genome NER (GG-NER)
and Transcription Coupled NER (TC-NER) [72]. The two sub-pathways differ in the sets
of proteins involved in the distortion and recognition of the DNA damage, but converge
after that (see Figure 2.5). Out of a total of 21 genes involved in GG-NER based on GO
annotations, 7 have been documented as cancer genes in “cancer-gene census.” Similarly,
3 out of 6 genes involved in TC-NER have been documented as cancer genes. Based on
the HG enrichment test, both GG-NER and TC-NER are significantly enriched with
cancer genes, along with their parent node NER with p-value of 2e-07, 2e-04 and 2e-06,
respectively. However, under the Bayesian network model, only GG-NER was prioritized
among the top list, while TC-NER and NER are filtered out as “overlapping GO BPs.”
When we take a close look at the exact position of those cancer genes in the two
sub-pathways, it can be seen that all 3 cancer genes involved in TC-NER, i.e., XPB
(ERCC3), XPD (ERCC2) and XPG (ERCC5), function after the two sub-pathways
converge. None of the genes involved in the initial damage recognition, which is specific
to TC-NER, e.g., CSA (ERCC8) and CSB (ERCC6), has yet been documented as cancer
genes in “cancer-gene census.” On the other hand, a number of genes specific to
GG-NER, e.g., XPE (DDB2) and XPC, have been documented as cancer genes.
44
Therefore, it is speculated that TC-NER itself might not be a CAN-process. Such
hypothesis has been supported by previous studies. For example, it has been shown that
skin cancer is not a feature of pure Cockayne syndrome, a disease which could be caused
by the defects in CSA or CSB [73]. Since, as described above, both CSA and CSB are
specific to TC-NER, such observation indicates that pure perturbation of TC-NER might
not cause cancer. A more comprehensive survey regarding the relationship between
GG-NER and TC-NER can be found in [72]. Nevertheless, since our knowledge of
cancer genes is far from complete, the case about the role of TC-NER in cancers remains
to be elucidated. In this regard, it might be more precise to treat those “overlapping
modules” filtered out by the BN model as those cases where further investigation and
justification are needed.
2.2.2.3 Superiority of the BN model revealed by cross-validation
The CAN-processes inferred from the BN model is superior to the results obtained by the
local strategy in predicting unknown cancer genes as evaluated by cross validation.
Similar to the case of lethality, we employed cross validation to compare the BN model
and the LM model in predicting cancer genes in cancer-gene census. We measured both
the standard AUC and pAUC.2 as before. The results shown in Figure 2.6 are consistent
with the results in lethality. The improvement of the BN model over the LM model is
more significant at low false positive rate. The pAUC.2 increases by 12.7%, and the
standard AUC increases by 3%. Compared with the case of lethality, the improvement
here is smaller (pAUC.2 increases by 46.9% when GO BPs are used in the case of
45
lethality). The reasons are that our knowledge of cancer genes is far from complete, that
the proportion of cancer genes in the CAN-processes is much lower than the proportion
of lethal genes in lethal complexes, and that human genes are not as well annotated as
yeast genes. For example, more than 50% of human genes (more than 40% of cancer
genes) are only annotated with most general GO BPs (the size of GO BP >100). For those
genes, it is unlikely for any method to make an accurate prediction.
Figure 2.6: The ROC curve, AUC and pAUC.2 of 100-fold cross-validation in predicting
cancer genes using GO biological process. BN represents the Bayesian network model,
and LM represents the local Bayesian model.
46
2.2.2.4 Consistency of the results prioritized by the BN model in
different datasets
Comparison of CAN-processes prioritized in different cancer gene datasets shows that the
BN model results are more consistent with each other than the HG enrichment test results.
In order to show the consistency of CAN-processes prioritized in different cancer gene
datasets, a second group of cancer genes is considered. These cancer genes were
identified recently through systematic sequencing of colorectal and breast cancer
genomes for somatic mutations [43] and are referred to as Wood’s dataset (see Methods
for details). The same process and cutoff were used as before to generate the top
CAN-processes list by the BN model. These CAN-processes together with the same
number of top CAN-processes ranked by the HG enrichment test are shown in Table 2.2.
Between the 1137 and 973 genes involved in the two sets of CAN-processes prioritized
by the BN model in the two datasets respectively, a total of 101 are common to both. The
overlap is statistically significant as measured by the Hypergeometric p-value for
overrepresentation at 0.002. On the contrary, when the HG enrichment test was used,
genes involved in the CAN-processes prioritized in Wood’s data are significantly
underrepresented when compared to those involved in the CAN-processes prioritized in
“cancer gene census” (Hypergeometric p-value for underrepresentation is 1.9e-37).
Therefore, the BN model results are more consistent with each other than the HG
enrichment test results when different datasets are used.
47
Although statistically significant, the overlap between the two sets of CAN-processes
prioritized based on the two cancer gene datasets respectively by the BN model is only 5%
(intersection/union of genes). Since the two datasets of cancer genes differ in many
respects, such small overlap could reflect the different focus of the two datasets.
Particularly, since the “cancer-gene census” is assembled from previous studies and the
Wood’s dataset is derived from a recent study with new techniques, the small overlap
could indicate the discovery of potentially “new” CAN-processes by the Wood’s
experiment. To be more specific, while the majority of cancer genes in “cancer-gene
census” were detected from liquid tumors such as leukemias and lymphomas, those in
Wood’s dataset were identified from the colorectal and breast tumors that are normally
solid tumors. In solid tumors, precursor cells must become mobile and invasive in order
to become malignant. Consequently, some “new” CAN-processes might be “overlooked”
because their disruption is not required in liquid tumors since their precursor cells are
already mobile and invasive [58]. Alternatively, but not exclusively, some CAN-processes
might be newly discovered as a consequence of the advantage of systematic sequencing
strategy of cancer genomes, which could be more unbiased and comprehensive than
traditional cloning techniques. It is also possible that some “new” CAN-biological
processes were prioritized simply because of the incomplete annotation of GO, and those
cancer genes involved in the “new” CAN-processes might later be found to function in
some “old” CAN-processes.
Cytoskeletal anchoring and lipid transport are two examples of such potentially “new”
CAN-processes prioritized in the Wood’s dataset by the BN model. None of the genes
48
associated with the two GO BPs has been documented as cancer genes in “cancer gene
census.” Their potential roles in tumorigenesis or metastasis are discussed as following.
These findings might give insight into further study and treatment of human cancers.
Cytoskeletal anchoring: The involvement of cytoskeletal anchoring in cancer
development is not unexpected, especially considering that it functions as a direct or
indirect connection between two groups of cancer-related molecules, e.g., transmembrane
or membrane-associated proteins and cytoskeletal filaments, both of which are actively
involved in signaling transduction, cell-cell adhesion, and other cancer-related biological
processes. For example, FLNB and TLN1 are two cytoskeletal anchoring genes detected
in Wood’s dataset. Both of them have been detected to interact with integrins [74], a
family of transmembrane receptor proteins whose key roles in tumor growth and
metastasis have been explored with a long history[75]. Thus, it is speculated that
malfunction of FLNB or TLN1 could contribute to cancer development by disrupting or
improperly activating function of integrins or integrin-related signaling pathways. A
similar example is SHANK1. SHANK1 has been observed to interact with Somatostatin
receptor 2[76], which was shown to be able to sensitize human cancer cells to death by
ligand-induced apoptosis[77].
Lipid transport: Compared to cytoskeletal anchoring, the roles of lipid transport in
cancers are more complicated. On the one hand, in rapidly dividing cancer cells, the
availability
of cholesterol is essential for proliferation and progression
of the cancer[78].
On the other hand, lipid transport might directly or indirectly coordinate with many
signaling pathways that control cell birth and death. For example, given that both
49
high-density and low-density lipoprotein receptors (or receptor-related proteins) were
found to regulate proliferation or survival of cancer cells[79, 80], it is not surprising to
find HDLBP and SORL1 in Wood’s dataset. The former is a high-density lipoprotein
binding protein, and the latter has been detected to interact with a low-density lipoprotein
receptor-related protein-associated protein 1 (LRPAP1)[81]. Moreover, a number of lipid
transporters like ATP-binding cassette (ABC) transporters have been implicated in tumor
cell resistance to anticancer therapy[82]. ABCA1 in Wood’s dataset might be one such
example[83].
GO CAN-processes prioritized by the BN
model
Total
gene #
Cancer
gene#
GO CAN-processes prioritized by
the HG enrichment test
Total
gene #
Cancer
gene#
GO:0007016 cytoskeletal anchoring 10 3 GO:0007155 cell adhesion 689 34
GO:0030198 extracellular matrix
organization and biogenesis
27 6 GO:0022610 biological adhesion 689 34
GO:0007185 transmembrane receptor protein
tyrosine phosphatase signaling pathway
6 2 GO:0016043 cellular component
organization and biogenesis
2325 66
GO:0007155 cell adhesion 689 34 GO:0030198 extracellular matrix
organization and biogenesis
27 6
GO:0007605 sensory perception of sound 116 9 GO:0048856 anatomical structure
development
1537 47
GO:0051318 G1 phase 18 2 GO:0007275 multicellular
organismal development
1797 52
GO:0006869 lipid transport 100 6 GO:0048731 system development 1196 38
GO:0009112 nucleobase metabolic process 16 2 GO:0007519 striated muscle
development
62 7
GO:0045661 regulation of myoblast
differentiation
6 2 GO:0007605 sensory perception of
sound
116 9
GO:0042593 glucose homeostasis 11 2 GO:0050954 sensory perception of
mechanical stimulus
116 9
GO:0043534 blood vessel endothelial cell
migration
6 2 GO:0016337 cell-cell adhesion 239 13
GO:0007183 SMAD protein complex
assembly
4 2 GO:0032501 multicellular
organismal process
3128 73
GO:0060070 Wnt receptor signaling pathway
through beta-catenin
5 2 GO:0007167 enzyme linked
receptor protein signaling pathway
245 13
Median # 11 2 Median # 689 34
Table 2.2: top GO CAN-processes ranked by the Bayesian network model or the HG
enrichment test based on cancer genes from systematic sequencing of colorectal and
breast cancer genomes.
50
2.2.3 Simulation study
The purpose of this simulation study is to quantify the sensitivity of the BN and HG
method to two factors which could affect their performance in identifying causal modules
among overlapping sets. In the context of lethal protein complexes prediction, the two
factors considered in the study are (1) the degree of overlap among protein complexes, (2)
the proportion of lethal genes in a lethal complex.
The parameters we used to simulate protein complexes, complex lethality and gene
lethality are from the 390 curated protein complexes or the 94 curated lethal protein
complexes identified by the BN model. Below are details about the simulation study.
(a) Simulate protein complexes. From a pool of 1376 genes (the total number of genes
associated with the 390 curated protein complexes), we simulate protein complexes.
Each simulated protein complex is a random sample of the 1376 genes. The complex size
follows the distribution of that of the 390 curated protein complexes.
(b) Simulate protein complex lethality. We randomly assign 24% of the protein complexes
simulated in (a) as lethal protein complexes, and the rest as nonlethal protein complexes.
24% is the proportion of lethal protein complexes (94 out of 390) among the curated
protein complexes identified by the BN model.
(c) Simulate gene lethality. The lethality of each gene is set to nonlethal at the beginning. For
each lethal protein complex simulated in (b), we randomly sampled a proportion of its
members, and set the lethality of their corresponding genes as lethal. follows the
51
distribution of , where denotes the proportion of lethal genes among the 94 curated
lethal protein complexes identified by the BN model.
In order to test the algorithms’ sensitivity to different degrees of overlap, we altered the
number of simulated protein complexes in (a) from 100 to 1000 with an interval of
100. As increases, we would expect higher overlap among the simulated protein
complexes. We further quantified the degree of overlap by measuring the proportion of
genes that are involved in more than one complex. In order to test the algorithms’
sensitivity to different proportions of lethal genes in a lethal protein complex, we altered
the distribution of from the distribution of to that of 1/2 and 1/4 .For each
given and given distribution of , we repeated step (a) to (c) for 100 times to obtain
100 simulated datasets.
For each simulated dataset, we applied the BN method to infer the complex lethality
given the gene lethality and complex membership simulated above. We measured the
prediction accuracy by both the standard AUC and pAUC.2 as introduced in the
manuscript. As a comparison, we also applied the HG test to the simulated dataset. Figure
2.7 gives the box plot of AUC and pAUC.2 for both methods under different degrees of
overlap and different distributions of .
As shown in Figure 2.7, over the range of the two factors tested here, the BN model
consistently performs better than the HG method. It can be seen that as the degree of
overlap increases or the proportion of lethal genes in a lethal complex decreases, the
performance of both methods decrease. The amount of improvement of the BN model
52
over the HG method also varies with the two factors. When the proportion of lethal genes
in a lethal complex is relatively high, for example, when the distribution of equals to
that of , the amount of improvement of the BN model over the HG method increases
steadily as the degree of overlap increases. But when the proportion of lethal genes in a
lethal complex is relatively low, the amount of improvement of the BN model over the
HG method remains relatively small for the whole range of degree of overlap tested here.
The amount of improvement even starts to decrease slightly as the degree of overlap
further increases.
53
Figure 2.7: simulation results. The performance of the BN model and the HG method in
identifying lethal complexes given different degrees of overlap among protein complexes
and different distributions of the proportion of lethal genes in a lethal complex.
54
2.3 Materials and Methods
2.3.1 Bayesian network model
In this section, we explain the Bayesian network model we used to prioritize functional
modules mediating genetic variation and its phenotypic effects. For illustration, we take
single gene deletion, lethality and protein complexes as examples of the genetic
perturbation, the phenotypic effect and the candidate functional modules, respectively.
Figure 2.8 gives an example of the Bayesian network and the details are given below.
: a set of genes whose lethality is known. In Figure 2.8,
,
…
.
: a set of known protein complexes. In Figure 2.8,
,
,
.
|
,
: is the association matrix between genes and
complexes.
1 if protein complex
contains protein product of gene
, and
0 otherwise. In Figure 2.8,
!
"
#
1 0 0
1 1 0
0 1 0
0 1 1
0 0 1
%
&
'
.
(
)*
|
+: is the lethality of each gene which is observed from genome-wide
knockout experiments. *
1 if deletion of gene
leads to inviable phenotype and
otherwise. In Figure 2.8, (
)*
,
, *
-
… *
.
+.
(
*
|
: is the unknown lethality of each protein complex. *
1 if
inactivation of protein complex
leads to inviable phenotype, and 0 otherwise. We
refer to those protein complexes with *
1 as lethal protein complexes, which are
55
the causal protein complexes mediating the gene knockout and observed inviable
phenotype. In Figure 2.8, (
)*
,
, *
-
, *
/
+.
According to the module-based explanation of lethality, the lethality observed in gene
deletion is determined by the lethality of the associated protein complex(es); thus, as
shown in Figure 2.8, there is an edge pointing from *
and *
, where
1.
As discussed in the introduction, inference of (
from the observed (
becomes
difficult when the protein products of gene
participate in more than one protein
complex. For example, in Figure1, protein products of
participate in both protein
complexes
and
. If *
-
1, it is possible that *
,
1 or *
-
1 or both. In
principle, such a problem could be solved by carrying out complex-specific protein
inactivation experiments, that is, by designing a complex-specific antibody which
“knocks down” protein products of gene
in protein complex
while keeping
protein products of gene
in protein complex
unchanged, and vice versa. The
inviable or viable phenotypes observed in such complex-specific protein inactivation
experiments could reflect the lethality of protein complexes more directly. Motivated by
this, we introduce a set of hidden variables which denote the outcome of
complex-specific protein inactivation experiments and function as bridges between (
and (
.
(
*
|
,
,
1 : are the hidden lethality data for each
complex-specific protein inactivation experiment. *
1 if inactivation of protein
products of gene
that participates in protein complex
(while keeping other
56
protein complexes active) leads to inviable phenotype, and 0 otherwise. In Figure 2.8,
(
)*
,
,
, *
-
,
, *
-
-
, *
/
-
, *
0
-
, *
0
/
, *
.
/
+.
The following explains the causality relationship among the three sets of variables and
derives the conditional probability tables (CPT) for the Bayesian network.
In reality, when gene
is knocked out from the genome, all of its protein products are
knocked out and hence inactivated. Thus, obviously, deletion of gene
will lead to
inviable phenotype if one complex-specific inactivation of its protein products leads to
inviable phenotype. However, even when no such complex-specific inactivation leads to
inviable phenotype, there is still a possibility that the deletion of gene
will lead to
inviable phenotype. One possibility is that there are agonistic interactions among protein
products of gene
that participate in different protein complexes such that
simultaneous inactivation of them, which is the case in deletion of gene
, leads to
inviable phenotype. Other possible explanations may include protein products of gene
possibly acting in certain lethal modules that are not included in our dataset of
candidate modules. For simplicity, we refer to all those other events or effects that cause
inviable phenotype, besides those reflected by complex-specific protein inactivation, as
“other effects” in our paper and assign them probability
12345
. According to the above
arguments, the CPT for each *
(
can be derived as follows:
6*
17 *
7
,
18 9
1 :; ∑ *
,=
>
?
@
A 1
12345
BCDEFG:HE
I (2.1)
57
Not all members in a complex are equally important. Although individual inactivation of
some members of a protein complex will cause that protein complex to “break down” or
lose its function, individual inactivation of others might not. Functional redundancy
within protein complexes might be one of many possible reasons. Thus, even for a lethal
protein complex, inactivation of certain members might not lead to inviable phenotype.
To accommodate such phenomena, we assign a lethal probability to each protein complex
, denoted as
J423KJ
, representing the probability that individual inactivation of its
members will lead to inviable phenotype. Needless to say, the lethality probability will be
zero for nonessential protein complexes. Thus, CPT for each *
*
can be
expressed as:
6*
17*
8 L
J423KJ
*
1
0 *
0
I (2.2)
Figure 2.8: An example of the Bayesian network. In this network, lethality of deletion of
gene
, denoted as *
, is determined by lethality of the complex-specific inactivation of
its protein products, denoted as *
, which in turn is determined by lethality of
inactivation of protein complex
,denoted as *
.
58
2.3.2 Inference of module lethality given gene lethality
Our goal is to infer each variable in (
in the above Bayesian network, given the
observed values in (
and the known network structure. Since there are other unknown
variables (
and unknown parameters )
J423KJ
M
+ (the parameter
12345
is set
to be a fixed value for simplicity), we employed an expectation-maximization (EM)
strategy with details given below.
Initiate each
J423KJ
N O
J423KJ
N O
PQR
,
.
E-step: Estimate each variable in the set of (
and (
by Gibbs sampling [84],
denoted as (
S
PTR
*
U
N
O
PTR
|
,
and (
S
PTR
*
N
S
PTR
|
.
M-step: calculate the maximum likelihood estimation of each parameter
J423KJ
as the
following:
J423KJ
N O
PTR
∑
*
U
N
O
PTR
∑
*
N
S
PTR
V
2.3.3 Prediction of gene lethality through cross-validation
In the prediction of gene lethality through cross-validation, we randomly chose part of the
gene lethality data (training data) as known to estimate the parameters )
J423KJ
M
+
of the BN model and to infer the probability of being lethal for each module
P*
1R7
by the above EM strategy. The estimation results were used to
infer the probability of being lethal for the remaining genes (testing data) by the
following formula:
59
W*
1X 1 Y ∏ 61 Y
J423KJ
P*
1R8
,=
>
?
@
(2.3)
2.3.4 Local Bayesian model
As a comparison to the BN model, we also employed the local Bayesian model (LM) to
infer the module lethality given gene lethality. The LM model differs from the BN model
only in that only the sub network for a candidate module is considered as if none of its
components participates in other modules (shown by dashed line in Figure 2.8). Thus,
Formula (2.1) in the BN model has been simplified to the following formula:
6*
17*
,
,
18 [
1 :; *
1
12345
BCDEFG:HE
I P2.4R
Formula (2.2) remains the same for the LM model.
2.3.5 Data sources
We first applied our Bayesian network model to the gene lethality data in S. cerevisiae.
The gene lethality data in S. cerevisiae were obtained from a genome-wide gene deletion
study[35], where, out of a total of 5,916 genes deleted, 18.7% (1,105) are essential for
growth on rich glucose medium. In order to analyze the conservation mode of gene
lethality across different species, the lethality data in C. elegans and D. melanogaster
were obtained from two genome-wide RNA interference experiments [37, 60],
respectively. The ortholog mapping data were downloaded from the Inparanoid database
[85].
60
Protein complexes were treated as one type of functional modules in the case of gene
lethality. Both curated and HTP protein complex data were used in our analysis. The
manually curated protein complex dataset was downloaded from the ScISIC dataset in
the Bioconductor [86] R package ScISI (version 1.10.0). A detailed description can be
found from [87]. It consists of protein complexes derived from small scale experiments
that have been curated by Gene Ontology (GO) or MIPS, and other manually curated
protein complexes obtained from IntAct [88]. In total, the dataset consists of 390 protein
complexes, including 582 lethal genes and 794 nonlethal genes. The high throughput
(HTP) protein complex data were obtained from a recent large-scale AP-MS experiment
[15]. It consists of 491 protein complexes, including 577 lethal genes and 828 nonlethal
genes.
For more broadly defined functional modules, we took GO biological processes [56]. The
data were downloaded from the YEASTGO dataset in the Bioconductor [86] annotation
package YEAST (version 2.0.1) and processed with the package GOstats [66]. It was
further processed such that, if one gene belonged to a GO node, it would be included in
all of its ancestor nodes. In total, the dataset contains 2200 GO biological processes,
including 1031 lethal genes and 4223 nonlethal genes.
We then applied our strategy to prioritize biological processes causally implicated in
human cancers. We downloaded and retrieved GO annotations for human genes from
NCBI website [89]. The data were further processed in a manner similar to that for S.
cerevisiae. In total, there are 14371 human genes involved in 4644 biological processes.
61
Two datasets of cancer genes were considered. The first dataset was downloaded from the
“cancer-gene census” database, a curated cancer gene database assembled from previous
studies [59]. The second dataset was obtained from systematic sequencing of colorectal
and breast cancer genomes for somatic mutations [43]. In this dataset, the somatic
mutations found in cancers
were classified into either "drivers" or "passengers" [90]
according to authors’ criteria. Driver mutations are
causally involved in the neoplastic
process and are positively
selected during tumorigenesis. Passenger mutations provide
no
positive or negative selective advantage to the tumor, but they
are retained by chance
during repeated rounds of cell division
and clonal expansion. In the second dataset, only
candidate cancer genes that are most likely to be drivers according to authors’ criteria [43]
are considered in our analysis. After mapping to the GO biological processes, there are a
total of 331 and 225 cancer genes in the two datasets, respectively.
2.4 Conclusion
In this paper, we attempt to decode phenotypic effects caused by genetic perturbation
through known functional modules. By decoding gene lethality through protein
complexes and investigating the conservation of gene lethality across different species in
the context of lethal and nonlethal protein complexes, we provide evolutionary evidence
supporting the modular nature of lethality. Based on human cancer genes, we prioritized
many biological processes causally implicated in cancers, which are consistent with
previous knowledge. We also identified some “new” biological processes whose roles in
cancer development are less well understood: cytoskeletal anchoring and lipid transport.
62
Motivated by the overlapping structure of functional modules in biological systems, we
provide a global strategy to distinguish functional modules that are most likely to be
actual causal modules from a large number of other “overlapping modules” whose only
relatedness with the phenotypes most likely results from the sharing of gene members
with the causal modules. Local strategies, such as the HG enrichment test, ignores the
overlapping structures among modules and is thus less effective in distinguishing actual
causal modules from the “overlapping modules.” In contrast, the BN model filters out
“overlapping modules,” which generates a more accurate list of causal modules.
Compared to either the HG enrichment test, or the LM model, in the case of prediction of
gene lethality, results consistently showed that the modules prioritized by the BN model
are better representatives of the actual causal modules, even though it can never be
ascertained whether or not the modules prioritized by the global strategy are, indeed, true
causal modules in the absence of any direct biological benchmark.
In summary, our results indicate that modularity, which is believed by investigators to be
a true property of biological systems, can be applied to the interpretation of phenotypic
variations from perturbations in genetic variation. This might shed light on the study of
more complicated phenotypes, such as human disease. With proper modeling, the
global strategy could potentially be applied to a variety of fields. For example, it might be
interesting to see how it helps identify differentially expressed gene sets in microarray
data analysis. More importantly, a module-based prediction strategy will benefit the
study of human diseases by transferring phenotypic data learned from other organisms to
human beings. This has significant implications for the treatment of human cancers.
63
2.5 Acknowledgements
We thank Zhidong Tu for discussion and comments. This work was partly supported by
National Institutes of Health (NIH)/National Science Foundation Joint Mathematical
Biology Initiative grant DMS-0241102 and NIH grant P50 HG 002790.
64
Chapter 3 Prioritizing reliable hits from RNAi screens
3.1 Introduction
In the past few years, many groups have successfully conducted multiple genome-wide
RNA interference (RNAi) screenings in C. elegans, D. melanogaster and mammals,
using either whole animals or cell lines to investigate a full array of biological processes
at the systems level [91-98]. Compared with classical genetic screens, such as
transposon-mediated mutagenesis and somatic clonal analysis [99-101], RNAi
technology is revolutionary in that it allows investigators to quickly interrogate the
phenotypeic changes that occur upon reducing individual gene function at the genome
scale [41]. However, similar to many other high-throughput technologies, RNAi screens
are not completely flawless. On the one hand, genes may not always be effectively
knocked down and will consequently be missed by the screening. We refer to these genes
as false negatives (FNs). On the other hand, owing to the tolerance for mismatches and
gaps in base-paring with targets, small interfering RNA (siRNA) could possibly target up
to hundreds of sequences [45, 46], which are often termed as off-target effects (OTEs).
Such OTEs are believed to be the main reason for false positives (FPs) in RNAi screens.
The use of long double-stranded RNAs (dsRNAs) in Drosophila has been proposed as a
means of reducing the occurrence of OTEs [47]. However, two groups reported that
OTEs mediated by short homology stretches within long dsRNAs were prevalent in
Drosophila, and that therefore the effectiveness of dsRNAs for reducing OTEs needs
further investigation [102, 103]. Furthermore, OTEs and low efficacies in knocking down
65
certain genes are not the only sources for FNs and FPs associated with RNAi screens. As
a matter of fact, designing a high-throughput RNAi screen involves many levels of
decision-making, such as the type and concentration of RNAi reagents, the readout
options, and the methodologies and criteria used for hit selections, each of which could
affect the quality of the final results [47]. For example, it has been shown that the
adoption of a better analytic method for hit selection may help reduce the rate of FPs
and FNs [104, 105].
Both computational and experimental efforts have been made to identify errors in RNAi
screens. For example, Ma et al [102] and Kulkarni et al [103] suggested that dsRNAs
which contained >= 19-nucleotide(nt) perfect matches to unintended targets or had
simple tandem repeats of the tri-nucleotide CAN (N represents any base) might cause
OTEs and thus contribute significantly to FPs. Consequently, sequence-based
computational analysis can be used to predict potential FPs in RNAi screens. However,
such prediction is not applicable to identifying FNs. Moreover, DasGupta and colleagues
found that there was a lack of strict correlation between the sequence match of 19 nts and
FPs, and they suggested that the "FP results" obtained from dsRNAs that were predicted
to have OTEs based on sequence analysis should not be blindly treated as artifacts
without further tests [106]. In their study, to experimentally distinguish true positives
(TPs) from FPs, they rescreened hits identified in the original screen using multiple,
independent “off-target (OT)-free” dsRNAs. However, such experimental validation has
its own drawbacks. First, since not all dsRNAs are effective in knocking down the target
genes, failure in validating the original positive hits is insufficient for validating FPs. In
66
fact, they showed that some known regulators of the pathway under investigation were
actually missed by the validation screens [106]. Second, since our knowledge of the
mechanisms involved in OTEs is still developing, the successful validation of RNAi hits
by so-called “OT-free” dsRNAs might actually be the result of unknown OTEs. Third,
validation screens are usually conducted only on the positive hits from primary screens,
and FNs cannot be recovered without additional effort.
As diverse genomic data accumulate, integrating RNAi screening results with other
genomic information, particularly those represented in the form of networks, may help in
identifying FPs and FNs. Network-based analysis has been widely applied to solving
many biological problems. For example, methods have been developed using
protein-protein interaction networks to predict unknown disease genes [55, 107, 108], or
to diagnose disease subtypes [52]. A common principle adopted by most of these
network-based studies is “guilt by association”, i.e., nearby genes in the network are
more likely to possess similar functions, or will lead to similar phenotypic changes, when
perturbed. Here, we test whether this principle holds for RNAi hits, and if it does, we
intend to apply it to addressing the noise issue associated with RNAi screens. We also
anticipate that network analysis may help to reveal the underlying mechanisms that link
the perturbed genes with the observed phenotype changes, which may not be directly
obtainable from the raw screening data. Specifically, by perceiving the cell or organism
as a dynamic system composed of interacting functional modules which are defined as
discrete entities whose functions are separable from those of other modules [49], the
network information can help us to identify the underlying module structure.
67
Here we present a comprehensive network analysis using 24 published genome-wide
RNAi screens in Drosophila. We first verify the “guilt by association” principle by
showing that RNAi hits are significantly more connected than random cases. We then
develop a network-based RNAi phenotype scoring method termed NePhe to integrate
information from both network topology and RNAi screening results. We demonstrate the
effectiveness of NePhe scores in identifying putative FPs and FNs by a novel rank-based
test and two case studies. We show how the network information can help identify the
underlying modules as formed by the refined hits that potentially explain the RNAi
phenotype changes as observed by the screen experiments. Finally, we discuss limitations
of our approach and potential follow-up studies.
3.2 Results
3.2.1 RNAi hits have higher network connectivity than random
chance hits
The Drosophila protein-protein interaction (PPI) network was built from PPIs in the
STRING database [109]. STRING is a comprehensive PPI database, and the PPIs are
experimentally derived or predicted by comparative genomics and text mining. In total,
10,297 proteins and 248,355 interactions were used to construct our network. Only
proteins within this network were considered throughout our analyses. Hereinafter, we do
not make explicit distinction between genes and their protein products. 24 genome-wide
RNAi screening results were downloaded from the flyRNAi database [110]. For each
68
screen, we collected the set of genes that were observed to cause changes in the
phenotype under investigation. We call these genes hits, and all the remaining genes
nonhits for that screen. For each screen, a sub-network was constructed exclusively upon
hits for that screen and the interactions among them. In order to evaluate network
connectivity of these 24 sub-networks, we measured three network attributes, i.e., number
of edges, size of the largest component and number of isolated nodes. We calculated
two P-values for each attribute in each sub-network by either randomizing nodes or edges
(see Materials and Methods for details). Table 3.1 lists the number of edges and P-values
for each screen. In total, 20 out of 24 sub-networks have a significantly greater number of
edges compared to randomized networks (both P-values < 0.005), supporting higher
network connectivity. Similar, but slightly less significant, results were obtained for the
other two network attributes (Table B1). Therefore, our results indicate that the principle
of “guilt by association” is valid and applicable for RNAi hits.
Although for most of the 24 screens, hits are significantly more connected than random
cases, the degree of connectivity varies considerably among screens as reflected by the
wide range of P-values. Several factors could account for this variance. For example, the
STRING database may contain relatively more complete PPIs for some screened
biological processes than others; therefore, some screening hits may appear to be more
connected. Another possible factor could be the different accuracy for generating the
screening results by different experimental protocols. For instance, as shown in Table 3.1,
the sub-network constructed from “viral replication” [111] screening hits is among the
most significantly connected, while the sub-network constructed from “nuclear import of
69
Smads” [112] is among the least connected. Although the readouts were measured by
immuno-fluorescence staining followed by automated microscopy for both screens,
different accuracies could exist. In “viral replication”, knockdowns of true participants
were expected to cause a reduced number of cells compared to negative controls and thus
presumably be easier to measure compared with the “nuclear import of Smads” process.
In this case, the knockdowns were expected to cause diffused distribution of Mad in
cytoplasm compared to the negative controls where Mad predominantly localized in
nucleus, making it difficult to measure the phenotype change accurately and leading, in
turn, to a higher error rate and lower connectivity. Furthermore, the criteria used for hit
selection varied dramatically from screen to screen. For instance, as listed in Table 3.1,
“Store-operated Ca2+ entry” [113] and “Ca(2+) influx” [114] are presumably two related
biological processes. However, the two screens differ dramatically with regard to the
number of hits and their associated P-values. The screen for “store-operated Ca2+ entry”
measured the dsRNA effects by percentage inhibition and used a relatively lenient cutoff
to obtain a large number of hits for further validation (1,122 hits). The screen for “Ca(2+)
influx” calculated z-score for each dsRNA and used a relatively stringent cutoff of -3 to
obtain a small number of hits (65 hits). The two hit sets overlap by 25 genes (Fisher
Exact P = 5×10
-9
), suggesting that these two screens are significantly related, although
very different in hit counts. Also screens are different as some kept the basic cell
metabolism hits while some removed them. For instance, the sub-network associated with
“JAK/STAT signaling” [94] appears to be less connected than that of the “Hh signaling
pathway” [115]. What may partially account for this is the fact that the former removed
70
ribosomal proteins, as well as proteins involved in RNA processing and translation during
the curation process, while the latter did not. Finally, some of the hit sets listed in Table
3.1 were obtained directly from primary screenings, and some were filtered with
additional validation assays. Although the 24 screening results studied here were curated
from an assembly of experiments that varied in multiple aspects, the comprehensive
study we performed here demonstrates that the higher network connectivity associated
with RNAi hits and the applicability of our NePhe scoring system, as shown here and in
later paragraphs, hold in general and are not restricted to a particular screening result.
71
RNAi screen [Ref] #hits #edges P-value1 P-value2
Store-operated Ca2+ entry [113] 1,122 4,281 2e-05 3e-98
ERK signaling [116] 982 7,050 2e-71 <1e-229
Nuclear import of Smads [112] 683 1,321 0.07 3e-06
Protein secretion and Golgi
organization [117]
645 6,597 <1e-229 <1e-229
Hh signaling pathway [115] 306 3,214 <1e-229 <1e-229
Bacterial infection [118] 286 2,803 <1e-229 <1e-229
Growth and viability [93] 281 1,871 <1e-229 <1e-229
Wnt signaling pathway [95] 167 368 5e-51 1e-92
Light-dependent CRY degradation [51] 131 197 1e-28 9e-105
Neural outgrowth genes [119] 128 414 7e-146 3e-145
Chlamydia infection [120] 126 107 2e-07 2e-17
Regulators of NFAT [96] 121 29 0.7 0.002
Multipolar divisions [121] 115 62 0.005 9e-12
Viral replication [111] 104 2,069 <1e-229 <1e-229
JAK/STAT signaling [94] 104 85 4e-09 1e-21
Mycobacterial infection [122] 76 176 1e-129 8e-117
Transcript-specific mRNA export [123] 65 146 6e-141 2e-100
Ca(2+) influx [114] 65 137 3e-123 9e-54
Muscle assembly and maintenance
[124]
39 21 5e-11 5e-08
Caspase activation [125] 37 4 0.4 0.2
Mitochondrial and Peroxisomal Fission
[126]
22 2 0.3 0.1
Histone pre-mRNA processing [127] 17 4 2e-04 2e-05
E2F repression [128] 15 7 2e-13 1e-39
Orai proteins [129] 15 9 1e-21 1e-30
Table 3.1: The network attributes and corresponding P-values for the 24 sub-networks
constructed from RNAi hits. Results are sorted based on the number of hits in descending
order. The P-value1 and P-value2 are calculated by two different randomization strategies,
i.e., 1) randomizing nodes and 2) randomizing edges (with fixed node degree) (see
Materials and Methods for details).
3.2.2 Identifying the best performing NePhe scoring system
Given the above observations, we then tried to apply the "guilt by association" principle
to address the issue of FPs and FNs associated with RNAi screens. In general, we believe
72
that if a gene has tight connection with many hit genes, then it is likely to be a TP in the
case of a hit or an FN in the case of a nonhit, and vice versa. One computational problem
that arises is the need to quantify the distance or similarity between a pair of genes in the
context of network. Several different measurements have been proposed in previous
studies to address similar problems [107, 108, 130]. We consider four of them: direct
neighbor, shortest path, diffusion kernel [131] and association analysis-based
transformation [130]. In addition, we also needed to quantify the overall similarity
between a gene and its neighbors, or in an extreme case, all the remaining genes in the
network. In this analysis, we considered three different summation formulas to calculate
the overall similarity (see Materials and Methods for details). Thus, in total, we compared
twelve different scoring functions, i.e., combinations of four pair-wise similarity
measurements and three summation formulas (see Table B4 for details). We call these
scoring functions Network RNAi Phenotype (NePhe) scoring functions, since we
integrate both the network topology and RNAi screen data to derive the NePhe scores.
Since full annotations for true positive hits or true negative hits are not available for most
RNAi screens, it is not possible to directly compare the performance of each scoring
method. To overcome this difficulty, we designed a rank-based test to indirectly estimate
the relative performance of different scoring functions (see Materials and Methods for
details). Although the NePhe scoring functions differ in how they define pair-wise
similarity and how to summarize the similarity across all neighbors, the common scenario
is that a gene would receive a higher score if a greater number of its neighbors are hits.
Therefore, under the principle of “guilt by association”, an FP should be more likely to
73
receive a lower score compared to TPs. In contrast, an FN should be more likely to
receive a higher score compared to TNs. Based on this reasoning, the rank-based test
works as indicated in the following description. We assume that all hits in the original
RNAi screens are TPs and all nonhits are TNs. One hit is placed into the nonhit set as if it
were a nonhit (simulated FN). We then rank all nonhits, including the simulated FN,
using different scoring methods (see Figure 3.1). Similarly, one nonhit is added to the hit
set as though it were a hit (simulated FP), and we then rank all hits, including the
simulated FP, using different scoring methods. We repeat the above procedure for each hit
and nonhit for each screen. We evaluate the performance of each scoring method based
on two quantities: the relative rank (RR) of simulated FNs among nonhits and the RR of
simulated FPs among hits. For each scoring function, we calculate the group means of the
two quantities for each screen, and the overall performance of each method is determined
by the grand means of the two quantities from all 24 screens (see Material and Methods
for details). Theoretically, for an optimal scoring method, the RR of FNs should be
close to 1 (ranked at the top), and the RR of FPs should be close to zero (ranked at the
bottom). In reality, however, because not all hits in the original hit sets are TPs, we do
not expect every FN simulated in this way to be ranked high among all negatives, as not
all negatives are TNs either. Similarly, we do not expect every simulated FP to receive a
low rank among all positives. However, as long as the original hit sets are significantly
enriched for TPs (which we believe to be true for most screens), the rank-based test
should reflect the relative performance of each method.
74
Figure 3.1: The flowchart for the rank-based test. We put one hit into the nonhit set as if
it were a nonhit (simulated FN). We then ranked all nonhits, including the simulated FN,
using different scoring methods. Presumably, a good scoring system can rank the “FN”
higher, while a bad scoring system cannot.
Figure 3.2 shows the performance of different NePhe scoring methods estimated by the
rank-based test. First, as indicated by the grand mean of FPs and FNs, all the
network-based scoring methods perform much better than random chance (which is
expected to be 0.5) in ranking FPs and FNs for all 24 screens (Figure 3.2(a) and (b),
respectively). It is of note that the error bars, which represent the standard deviations, are
considerably larger in Figure 3.2(a) than in Figure 3.2(b). This is expected since there are
75
fewer hits than nonhits for all the screens, hence fewer simulated FNs than FPs. Second,
when considering the overall performance of the four similarity measurements, diffusion
kernel performs the best, followed by association analysis-based transformation, shortest
path, and then direct neighbor. This result is consistent with previous studies [107, 108]
and supports the superiority of global measurements (e.g., diffusion kernel) over local
measurements (e.g., direct neighbor). When considering the three summation functions,
formula 3 performs slightly better than the other two. However, the summation formulas
have less influence on the overall performance compared to similarity measurements. The
better performing formula 3 endorses the calculation of a gene’s similarity to other genes
by putting different weights to hits and nonhits (see Materials and Methods). Third, when
we compare the NePhe scoring system with the GeneRank algorithm [132], we find that
GeneRank is not as powerful as our models in recovering FPs (Figure 3.2(b)), but is
comparable in prioritizing FNs (Figure 3.2(a)). We optimized the parameter d in the
GeneRank algorithm by varying it from 0.1 to 0.9 with 0.1 intervals, and selected the one
with the best performance to compare with our best performed models. Finally, there is
no consistently best performing NePhe scoring function, and the best function is
somehow screen-specific. Figure 3.2(c) and Figure 3.2(d) show the screen-specific
performance of different methods for two screens that are used for case studies in later
sections (see Figure B1 for all the 24 screens). It should be noted that the relative
performance of different methods varies considerably for each screen. One possible
reason for this is that different RNAi hit sub-networks may have distinct characteristics.
For example, hit sub-networks of smaller size appear to favor the diffusion kernel method
76
over the direct neighbor method (see Appendix B for details). Since the rank-based test
can tell us which method is the best for a particular screen, we simply use the method
with the best performance throughout our analyses with the following two case studies.
Figure 3.2: The overall performance of different methods in identifying FNs (a) and FPs
(b) in the rank-based test and the screen-specific performance of different methods in
identifying FNs (c) and FPs (d) in the rank-based test. The error bars represent the
estimated standard deviations for the corresponding quantities. The DN, SP, DK and AT
represent the four different network similarity measurements, i.e., direct neighbor,
shortest path, diffusion kernel and association analysis-based transformation, respectively.
Index 1, 2 and 3 represent the three different summarizing formulas, respectively (see
Table B4 for details). GR represents the GeneRank algorithm.
77
To further quantify the RR of simulated FNs among nonhits and the RR of simulated FPs
among hits, we show their distributions for each screen in Figure 3.3. As seen in Figure
3.3(a), 14/24 screens have their majority (>50%) of simulated FNs ranked above 0.8
among all negatives, or 9/24 screens, if considering a higher threshold of 0.9. Similarly,
15/24 screens have their majority of simulated FPs ranked below 0.3 among all positives,
or 7/24 screens, if considering a lower threshold of 0.1 (Figure 3.3(b)). Furthermore, we
find that the results shown in Figure 3.3 correlate well with results in Table 3.1, i.e., the
effectiveness of the NePhe scoring system for a particular screen largely correlates with
the degree of connectivity of the sub-network derived from that screen. The two related
screens, Ca2+ influx and store-operated Ca2+ entry, received quite different ranks among
all the 24 screens based on RR of FNs and RR of FPs. Since the screen for Ca2+ influx
used a more stringent cutoff to call hits, the number of hits is much smaller (65 hits)
compared to the screen for store-operated Ca2+ entry (1,122 hits) and is presumably,
therefore, of better quality. The better rank it received based on the NePhe scoring system
suggests that the output of our approach is reasonably dependent on the quality of its
input. For the two screens that we use for later case studies, i.e., “Hh signaling pathway”
and “Wnt signaling pathway”, the performance of NePhe scoring is intermediate among
all 24 screens.
78
Figure 3.3: The distributions of RR of FNs among nonhits (a) and RR of FPs among hits
(b) for each screen in the rank-based test. The RR was computed by the best performing
scoring method for that screen according to the rank-based test. The notation for each
method is the same as in Figure 3.2.
79
3.2.3 Case studies: Hedgehog (Hh) and Wnt signaling pathways
In this section, we study RNAi screens interrogating Hh [115] and Wnt signaling
pathways [95]. Because all the original hits had been rescreened by an independent
collection of dsRNA to assess FP rates in a follow-up study [106], we chose these two
particular screens. Thus, for these two particular RNAi screens, we can use this validation
screen as an independent experimentally derived reference set to estimate the
performance of the NePhe scoring system in identifying FPs.
3.2.3.1 Comparing NePhe scoring system with experimental validation
We compare the NePhe scoring system with experimental validation from the following
four aspects.
First, NePhe scores correlate with experimental validation results. We ranked hits in the
original screen of Hh signaling pathway by their NePhe scores and put them into bins.
Within each interval, we calculated the proportion of hits confirmed by experimental
validation, termed as the reproducibility rate. As shown in Figure 3.4(a), the
reproducibility rates positively correlate with the NePhe scores. Statistical tests show that
the ranks of all reproducible hits are significantly higher than those of irreproducible hits
(P-value = 4e-14 by Wilcoxon rank-sum test). A similar, but weaker, trend can be seen for
the Wnt signaling pathway (P-value = 0.04 by Wilcoxon rank-sum test) (Figure 3.4(b)).
The decrease of reproducibility rates is most visible for RR between 0 and 0.1, but less
apparent for other intervals. One possible reason is that the original hit set for the Wnt
80
signaling pathway was already quite accurate. In fact, the overall reproducibility rate for
the Wnt signaling pathway is about 74%, higher than that for the Hh signaling pathway
(56%). As the validation experiment is also based on RNAi technology, it has its own
FP/FN issues and may fail to validate an already very accurate hit set. Here, the 74%
reproducibility rate is comparable to the average reproducibility rate observed for the
validation screen when the same collection of dsRNA was used to self-validate, which
clearly demonstrates a limitation of experimental validation [116].
Figure 3.4: The reproducibility rate of hits in the validation screen within each interval
of the RR by NePhe score for Hh (a) and Wnt (b) signaling pathway.
81
Second, the NePhe scoring system can prioritize known regulators of Hh/Wnt pathways
that failed to be confirmed by experimental validation. As discussed in the introduction,
validation experiments are based on RNAi and can have their own FN issues. Using Hh
and Wnt signaling pathways as examples, we find that the validation experiment indeed
failed to validate some of the known pathway regulators. As shown in Table 3.2, the
KEGG [133] pathways contain 25 and 65 genes for Hh and Wnt signaling pathways,
respectively (we regard them as true positives). The original screens identified 9 and 9 of
these known regulators, respectively (Table 3.2). However, the validation experiment
only confirmed 6 out of 9 for the Hh signaling pathway and 8 out of 9 for the Wnt
signaling pathway. Those unconfirmed regulators may be suggestive FNs of validation
screens, although they could also be missed by the validation screens for a multitude of
reasons unrelated to the FN/FP rate (see the discussion). On the other hand, the NePhe
scoring system seems to successfully capture all the known regulators in the original hit
sets. By calculating the NePhe scores for all the original hits (306 and 167 for Hh and
Wnt signaling pathways, respectively) and choosing the top ranked hits of the same size
as the experimentally validated hit sets (171 out of 306 and 123 out of 167), we see that
all the original hits contained in KEGG pathways are kept in these NePhe top-ranked hit
sets (Table 3.2). Therefore, compared with the validation experiments, the NePhe scoring
system is better at keeping hits that are known regulators, while some of these hits are
missed by validation experiments.
82
KEGG
pathway
(#genes)
All
hits
Experimentally
validated hits
Top-ranked
hits
All
nonhits
Top-ranked
nonhits
(RR > 0.9)
Top-ranked
nonhits
(RR > 0.8)
Hh
signaling
pathway
(23 )
9 6 9 14 7 10
Wnt
signaling
pathway
(65)
9 8 9 56 32 48
Table 3.2: Each cell represents the overlap between KEGG pathway genes (rows) and all,
or subsets of, hits/nonhits in the corresponding RNAi screen (columns). The hits/nonhits
were ranked by NePhe score. The top-ranked hits (column 3) are of the same size as the
experimentally validated hits (column 2).
Third, NePhe scores correlate with sequence-based OTE prediction for FPs. As discussed
in the introduction, OTEs mediated by homologous sequences or CAN repeats are
believed to be a main reason for RNAi screen FPs. We classified screening hits into two
categories, i.e., off-target (OT)-related and OT-unrelated, using sequence-based OTE
prediction similar to DasGupta et al. [106] (see Appendix B for details). Figure 3.5(a)
shows the proportion of OT-related hits for each NePhe score interval. It can be seen that
there is a strong negative correlation between the proportion of OT-related hits and their
RR for both the Hh signaling pathway and Wnt signaling pathway. Statistical tests show
that the rank of OT-related hits is significantly lower than that of OT-unrelated hits (P =
4e-13 for the Hh signaling pathway, and P = 9e-4 for the Wnt signaling pathway by
Wilcoxon rank-sum test). Therefore, FPs predicted from the NePhe score correlate well
with those predicted using sequence-based OTE prediction.
83
Figure 3.5: The proportion of OT-related hits within each interval of the RR by NePhe
score for Hh (a) and Wnt (b) signaling pathway.
Fourth, the NePhe score can further refine the sequence-based OTE prediction for FPs.
DasGupta and colleagues pointed out that there was a lack of strict correlation between
predicted OT-related hits and FPs as confirmed by validation experiments [106]. Figure
84
3.6(a) shows the reproducibility rates for OT-related and OT-unrelated hits based on the
validation experiment for the Hh signaling pathway. It is clear that OT-related hits have a
lower reproducibility rate, indicating that sequence-based OTE prediction is generally
informative. However, 31.6% of OT-related hits were in fact reproduced in the validation
screen, indicating that FPs predicted by sequence analysis could actually contain a
considerable proportion of TPs. Similarly, TPs predicted by sequence analysis could in
fact be contaminated by a considerable proportion of FPs. Using the NePhe scores, we
further divided OT-related and OT-unrelated groups into high-ranked (e.g., RR of NePhe
score >=0.4) and low-ranked subgroups. Here, the cutoff of 0.4 is somehow arbitrary, but
we get similar results using different cutoffs from a reasonable interval (data not shown).
We compute the reproducibility rate for the four subgroups separately. The results are
plotted in Figure 3.6(b) and (c), and it can be seen that the high-ranked subgroup has a
much larger reproducibility rate than the low-ranked subgroup for both OT-related and
OT-unrelated hits. Statistical tests further confirm that the rank of reproducible hits is
significantly higher than that of irreproducible hits for OT-related and OT-unrelated hits
(Wilcoxon rank-sum test P = 0.04 for the former and 1e-9 for the latter). A similar, but
less significant, pattern is also observed for the Wnt signaling pathway (see Figure B2 in
Appendix B). Therefore, the NePhe scoring system can be used to further identify TPs
from predicted FPs (OT-related hits) or FPs from predicted TPs (OT-unrelated hits).
85
Figure 3.6: The reproducibility rate for OT-related and OT-unrelated hits (a), for
low-ranked and high-ranked OT-related hits by NePhe score (b) and for low-ranked and
high-ranked OT-unrelated hits by NePhe scores (c) for Hh signaling pathway.
86
3.2.3.2 Identifying FNs from nonhit set using NePhe scoring system
Compared to experimental validation, the value of the NePhe scoring system becomes
clearer when we consider recovering FNs that are missed by the original screens. This is
because in practice, most experimental validations focus only on primary hits, as does the
sequence-based OTE prediction. In this subsection, we provide evidence indicating that
top-ranked nonhits by the NePhe scoring system are enriched for genes that are relevant
to the pathway under investigation, while these nonhits are putative FNs missed by the
original screens.
First, top-ranked nonhits are enriched for known regulators of the Hh and Wnt signaling
pathways. Table 3.2 lists the numbers of top-ranked nonhits that are known KEGG
pathway genes. For each pathway, 14 out of 56 KEGG pathway genes were missed by the
original screens and thus reported as nonhits. By ranking all the nonhits from original
screens based on their NePhe scores, we see that the top 10% nonhits were able to
capture 50% (Hh) and 57.1% (Wnt) of these missed KEGG pathway genes (P-values are
1e-4 and 2e-18 by Fisher Exact Test). The top 20% nonhits are able to capture 71.4% and
85.7% of these missed KEGG pathway genes (P-values are 4e-5 and 4e-26 by Fisher
Exact Test).
Second, top-ranked nonhits display mutant phenotypes similar to the mutant phenotypes
for known Hh or Wnt pathway genes. Here we assume that genes belonging to the same
pathway tend to show similar phenotypes when mutated. Because we have a set of known
regulators from KEGG, we can compare the mutant phenotypes to estimate how likely it
87
is that an unknown gene belongs to the same pathway. We retrieved allele phenotype data
for Drosophila genes from FlyMine [43] and used these data for mutant phenotypes.
There were 1,901 genes that had at least one allele phenotype, and we only considered
them in the following analysis. We calculated the mutant phenotype similarity between
each nonhit and known regulators (see Materials and Methods for details).The
distributions of the similarities for nonhits with different NePhe scores are shown in
Figure 3.7 in blue bars. We also computed the similarity between each hit and known
regulators (orange bar in Figure 3.7), and the phenotype similarity among known
regulators (red bar in Figure 3.7). First, as expected, the within pathway mutant
phenotype similarity is the highest for both Hh and Wnt KEGG pathways, which
supported our assumption. Second, there is a significant positive correlation between the
RR of nonhits and their mutant phenotype similarities to known regulators (P-value
<2e-16 by Spearman’s correlation test; estimated rhos are 0.28 and 0.31 for the Hg and
Wnt pathways, respectively). The strong correlation observed here indicates that the
NePhe scoring system indeed correctly ranked putative FNs to the top of nonhit genes.
Third, with regard to mutant phenotype similarity to known regulators, there is no
significant difference between top-ranked nonhits (rank>0.9) and hits (P-value > 0.1 by
Wilcoxon rank-sum test). In other words, these top-ranked nonhits are comparable to hits
when mutant phenotype similarity to known regulators is considered. The failure of
RNAi screens to detect these putative FNs might result from ineffective knockdown by
siRNA. Or, it could also result from the fact that the RNAi screens were carried in cell
lines and thus unable to capture certain regulators with detectable mutant phenotypes
88
only at the tissue or organism level. In any case, the NePhe scoring system can be used to
identify putative FNs that are not identifiable by experiment alone.
Figure 3.7: Distributions of the mutant phenotype similarities between nonhits within
each interval of the RR by NePhe score and known regulators (blue bars), between hits
and known regulators (orange bar) and among known regulators themselves (red bar) for
Hh signaling pathway (a) and Wnt signaling pathway (b). Only genes with at least one
allele phenotype are considered here.
89
3.2.4 Interpretation of RNAi phenotypes at module level
In this section, we use the Wnt signaling pathway as an example to show how the NePhe
scoring system can bring biological insights to the screening results and can help to
interpret RNAi phenotypes at the module level. Based on the RNAi screening results and
NePhe scores, we constructed a high-confidence Wnt signaling pathway-related
sub-network. The sub-network was built by hits in the original screen, top-ranked nonhits
by NePhe score and high-confidence interactions in the STRING network (confidence
score >0.9 [109]). We included the top 300 nonhits to the sub-network for two reasons: 1)
the RR is high, >0.97, and 2) its accuracy is most likely comparable to the original
screening hit set using the KEGG Wnt signaling pathway as reference because both sets
contain a similar proportion of Wnt KEGG pathway genes (~5%). Figure 3.8 shows the
largest connected component of this sub-network consisting of 209 genes in total, among
which 51 are hits (red), 158 are top-ranked nonhits (white), 24 are members in the KEGG
Wnt signaling pathway (green boundary), and 41 are supported by literature for their
association with the Wnt signaling pathway (square). We refer hereafter to those genes
within the KEGG Wnt signaling pathway as canonical participants. 15 out of the 24
canonical participants shown in Figure 3.8 are among the top-ranked nonhits (e.g., dsh,
dally), which further confirms the effectiveness of our computational strategy in
recovering putative FNs of RNAi screens. What might be more interesting in Figure 3.8
is that, with the network information and the nonhits recovered by NePhe scores, hits
detected in the RNAi screen appear to be clustered into several hypothetical modules.
90
These modular structures may help us to dissect the potential roles of module genes,
including the non-canonical participants, in the Wnt signaling pathway.
It is of note that most of the canonical participants of Wnt signaling pathway shown in
Figure 3.6 are clustered together in a single module (module I of Figure 3.6), so are most
of the literature-supported genes (square nodes). This suggests that traditional strategies
in identifying participants of Wnt signaling pathway might be biased on certain modules,
most likely the core modules [134]. Among the remaining modules that are mainly
composed of non-canonical participants, module II is related to transcription factor TFIID
complex, including Tbp, e(y)1, Taf1, Taf6 and etc. It is not surprising to observe TFIID
complex as regulators of Wnt signaling pathway, since binding of TFIID to DNA is
necessary for transcription initiation for most RNA polymerase II promoters. For example,
previous study of the promoter region of mouse frizzled related protein (a family of
secreted proteins involved in the Wnt signaling pathway) identified putative
transcriptional factor binding sites for TFIID and other transcription factors [135].
Module III is associated with PcG protein complex, a chromatin-associated multi-protein
complex containing Polycomb Group proteins, including Z, Pc, ph-p, ph-d and etc. The
involvement of PcG complex in Wnt signaling pathway has also been suggested by
previous studies. For example, genome-wide binding profiles of Polycomb Group
proteins showed that PcG proteins preferentially bound to developmental genes, many of
which encode transcriptional regulators and key components of signal transduction
pathways, including Wingless, Hedgehog, Notch and Delta [136]. Module IV contains
many proteins known to be involved in a diverse of signaling pathways, for instance, Pnt
91
of torso signaling pathway [137], aop of Ras signaling pathway [138], pvr of VEGF
receptor signaling pathway [139] and vn of EGF signaling pathway [140]. Their
involvement as regulators for Wnt signaling pathway suggests that the complex
cross-talking among different signaling pathways could be much more prevailing than
previously thought.
92
Figure 3.8: A sub-network associated with the Wnt signaling pathway. Red: hits of RNAi
screening. White: top-ranking nonhits by NePhe score. Green boundary: genes within
KEGG Wnt signaling pathway. Square: genes supported by literature for their association
with the Wnt pathway. Module I: canonical participants-related. Module II:
transcription factor TFIID complex-related. Module III: PcG protein complex-related.
Module IV: other signaling pathways-related. The visualization was generated using
Cytoscape [141]
93
Therefore, from the most general regulators, such as TFIID complex, to less general
regulators that are preferentially involved in controlling signaling pathways (e.g., PcG
complex), to participants of other signaling pathways that enable cross-talkings, and to
the most specific regulators like the canonical participants of Wnt signaling pathway, the
network approach we developed just reveals a fine modular view of the Wnt signaling
pathway that could be of great interest to biologists for further validation, and such views
are not directly derivable from raw RNAi screen data.
3.3 Materials and Methods
3.3.1 Network attributes, P-values and Randomization
24 RNAi screens were considered in total, and each screen was analyzed independently.
For each screen, a sub-network was constructed by including only hits for that screen and
PPIs among them as obtained from STRING database. We calculated three network
attributes, i.e., number of edges, size of the largest component and number of isolated
nodes. In order to obtain P-values for each network attribute, we constructed randomized
sub-networks for each of the 24 RNAi screens. From these randomized sub-networks, we
obtained null distribution of each network attribute for that particular screen. The P-value
was then computed by one-side testing assuming these attributes under null hypothesis
were normally distributed, while mean and variance were estimated from these
randomized sub-networks. The assumption of normal distribution has been verified and
adopted in a previous study in yeast [142]. We employed two different strategies in
94
generating randomized sub-networks from the whole PPI network. In the first strategy,
we randomly assigned a gene in the whole network as a hit while keeping the total
number of hits the same as seen from the real screen. Then we derived a sub-network
associated with the randomized hits from the whole network. In the second strategy, we
first randomized the edges of whole network while keeping the number of interactions for
each node fixed (implemented by R graph package). From the randomized whole
network, we then derived a sub-network associated with the original hits. For each
strategy, we generated 1,000 randomized sub-networks.
3.3.2 Network-based similarity
The network-based similarity
ij
S between gene i and j were measured in the following
four ways.
Direct neighbor:
ij ij
S A =
(3.4.1)
where A is the adjacency matrix, so that
1 if interacts with
0 otherwise
ij
i j
A
=
.
Shortest path:
( )
exp
ij ij
S sp = −
(3.4.2)
where
ij
sp denotes the shortest path between i and j in the network. We used the
95
exponential function exp( ) y x = − to transform the gene-gene distance to gene-gene
similarity.
Association analysis-based transformation:
#
max( , )
ij
i j
common neighbors of i and j
S
d d
= (3.4.3)
where
i
d denotes the degree of node iin the network. It is known that PPI networks are
both incomplete and inaccurate. One way to handle this problem is to transform the
original interaction graph to new graphs by removing spurious edges, adding biologically
valid ones, and assigning reliability scores to the edges constituting the final network
[130]. The motivation for the above transformation method is that proteins sharing many
neighbors are more similar, and the significance of this similarity depends on the number
of neighbors that each gene has.
Diffusion kernel:
ij ij
S K =
(3.4.4)
where ( ) exp K L β = , L A D = − , and A form the adjacency matrix of the interaction
network and D is a diagonal matrix containing the nodes' degrees. The diffusion kernel
can be seen as a random walk consisting of transitions to each one of the current node's
neighbors with probability of β [143].
ij
K can be regarded as a sum of the probabilities
96
over all paths from ito j . In this study, we explored two different values forβ , i.e., 0.1
and 0.01, and chose the one that gave better performance in the rank-based test.
3.3.3 NePhe scoring system
The NePhe score of gene j
for screen k
is calculated by the following three formulas:
,
1
kj ij ki
i G i j
NePhe S I
∈ ≠
=
∑
(3.4.5)
,
,
2
ij ki
i G i j
kj
ij
i G i j
S I
NePhe
S
∈ ≠
∈ ≠
=
∑
∑
(3.4.6)
, ,
3 (1 )
kj k ij ki k ij ki
i G i j i G i j
NePhe S I S I α β
∈ ≠ ∈ ≠
= − −
∑ ∑
(3.4.7)
where G denotes all the genes in the network,
ij
S denotes the network similarity between
gene
i and j , and
ki
I denotes the observation of gene i in RNAi screen k . 1
ki
I =
if
gene i is a hit for screen k , and 0 otherwise.
Intuitively, the more similar a gene is to hits according to the network-based similarity
(Formula 3.1.1-3.1.4), the more likely it is a TP/FN. Thus, a straightforward scoring
function would summarize the similarity of gene j to all hits of the screen k (Formula
3.2.1). However, the similarity of a gene to nonhits may also affect its likelihood of being
a TP/FN, i.e., higher similarity to nonhits may indicate lower possibility of being a
97
TP/FN. Motivated by this, we devised Formula 3.2.2 which divides the similarity of
gene j to all hits by its similarity to all the genes (both hits and nonhits). In order to
distinguish the different contributions of hits and nonhits to the final score, Formula 3.2.3
combines the similarity of gene j to all hits and all nonhits with different weights, i.e.,
k
α and
k
β .
In order to determine the parameters
k
α and
k
β in Formula 3.2.3, we used the following
linear regression model:
, ,
(1 )
kj k k ij ki k ij ki
i G i j i G i j
I S I S I γ α β
∈ ≠ ∈ ≠
= + − −
∑ ∑ (3.4.8)
Given observations for all genes
{ }
|
kj
I j G ∈ in screen k , the above linear regression
optimizes the coefficients
k
γ ,
k
α and
k
β so that the model, which differs from the
NePhe score (Formula 3.2.3) by a constant
k
γ , would predict actual observations with
the least square errors. Since
kj
I is a Boolean variable, it might be counterintuitive to use
linear regression instead of logistic regression in the above model. However, based on our
results (data not shown), linear regression performs better than logistic regression in the
rank-based test. The superiority of linear regression over logistic regression is especially
obvious when screens with smaller hit size are considered. In fact, previous studies
have successfully applied linear regression to classification problems [144, 145].
98
Different from previous studies[132, 146], the network-based similarity of a gene to itself
is not considered in any of the above three formulas (3.2.1-3.2.3). In other words, the
NePhe scoring functions do not consider a gene’s own RNAi phenotype in the screen.
This is particularly useful when we need to predict a gene's knockdown phenotype type
which has not been previously screened by RNAi.
3.3.4 Rank-based test
To systematically evaluate the performance of the above different scoring methods in
identifying FNs and FPs in the RNAi screening results, we designed a novel rank-based
test.
We simulated FNs by adding one hit gene into the nonhit set of screen k each time.
Specifically, for a gene
{ }
| 1
ki
i i I ∈ = (
ki
I specifies if gene iis a hit for screen k ), we
set 0
ki
I = and considered it as an FN. We updated the NePhe score according to Formula
3.2.1-3.2.3 for each gene with current setting of
ki
I . Particularly, to mimic real situations,
we also updated the parameters of
k
α and
k
β in Formula 3.2.3 by linear regression with
current setting of
ki
I . We then ranked all genes in the nonhit set by NePhe score, which
includes all the original nonhits and the simulated FN gene i. We denote the RR of the
simulated FN gene i as
ki
FNR ( (0,1]
ki
FNR ∈ ). We set
ki
I back to 1 afterwards. We
repeated the above procedure for each gene
{ }
| 1
ki
i i I ∈ = . The screen-specific
performance of a scoring method in identifying FNs for screen k is defined as the
99
group mean of
ki
FNR :
{ }
{ }
| 1
| 1
ki
ki
i I
ki
k
i
F
FNR
i I
NR
∈ =
⋅ =
=
∑
(3.4.9)
The overall performance of a scoring method in identifying FNs for all the 24 screens is
defined as the grand mean of
ki
FNR , which is the mean of the group mean of
ki
FNR .
k
k E
FNR
FNR
E
⋅
∈
⋅⋅ =
∑
(3.4.10)
where E denotes the set of all 24 screens. If a phenotypic score performs well in
recovering FNs, we would expect k FNR ⋅ and FNR⋅⋅ close to 1.
Similarly, we simulated FPs by adding one nonhit gene into the hit set of screen k each
time. Specifically, for a gene
{ }
| 0
ki
i i I ∈ = , we set 1
ki
I = and considered it as an FN.
We updated the NePhe score for each gene with current setting of
ki
I . We then ranked all
genes in the hit set by NePhe score, which includes all the original hits and the simulated
FP gene i. We denote the RR of the simulated FN gene ias
ki
FPR ( (0,1]
ki
FPR ∈ ). We
set
ki
I back to 0 afterwards. We repeated the above procedure for each gene
{ }
| 0
ki
i i I ∈ = . The screen-specific performance of a scoring method in identifying FPs
for screen k is defined as the group mean of
ki
FPR :
100
{ }
{ }
| 0
| 0
ki
ki
i I
ki
k
i
F
FPR
i I
PR
∈ =
⋅ =
=
∑
(3.4.11)
The overall performance of a scoring method in identifying FNs for all 24 screens is
defined as the grand mean of
ki
FPR , which is the mean of the group mean of
ki
FPR .
k
k E
FPR
FPR
E
⋅
∈
⋅⋅ =
∑
(3.4.12)
where E denotes the set of all 24 screens. If a phenotypic score performs well in
recovering FNs, we would expect k FPR ⋅
and FPR⋅⋅ close to 0.
3.3.5 Mutant phenotype similarity
We downloaded allele phenotype data for Drosophila from FlyMine [43]. In specific, for
each gene, we obtained a list of distinct terms describing the phenotypes that had been
manifested by at least one of its alleles, including the tissue type, cell type, and
developmental stage. After mapping to the PPI network, we obtained 1,901 genes
associated with at least one of the total 2,884 distinct phenotype terms. We represented
the phenotype profile for each gene as a Boolean vector of length 2,884, specifying
whether it was associated with each of the 2,884 phenotype terms. The mutant phenotype
similarity between two genes was then calculated by the cosine of the angle between their
phenotype vectors as in the previous study [36]:
101
cos
X Y
X Y
θ =
g
(3.5.1)
The mutant phenotype similarity of a gene to KEGG Hh (Wnt) signaling pathway was
measured by the average phenotype similarity of the gene to all the members in the
pathway.
3.4 Discussion and Conclusions
We carried out by far the most comprehensive network-based analyses on multiple
genome-wide RNAi screens in Drosophila. We showed that RNAi screen hits were
generally more connected in the PPI network than random cases. We developed a NePhe
scoring system to identify both FPs and FNs in RNAi screening results. We demonstrated
the power of such scoring system by a novel rank-based test and two case studies. We
provided our NePhe score results for both hits and nonhits of the 24 whole-genome RNAi
screens to provide a foundation for follow-up studies. We also showed that these NePhe
scores are reasonably robust to the random noise in the initial hit sets (see the Appendix B
for details). We implemented our strategy to compute NePhe scores in an R package
(www-rcf.usc.edu/~fsun) so that our approach could be used by the whole research
community in the future.
With that said, our approach does have several limitations. First, the NePhe scoring
system relies on the relative completeness and accuracy of PPI information. It is not
applicable to hits or nonhits whose target genes are missing from the interaction network.
102
Likewise, genes present in the network, but which have poorly characterized interactions,
are likely to yield inaccurate results. Second, the list of putative FNs and FPs created by
our approach are only suggestive and should not be regarded as definitive. Although hits
with lower NePhe scores will most likely validate at lower rates than those with higher
scores, the exact fraction of them representing true FNs and FPs cannot be known
without actually validating the data. For example, even though a gene is relevant to a
pathway under investigation, its knockdown does not need to have an effect for a
multitude of reasons, such as the paralogue issue and the alternative/branching of
pathway issue. Thus, some putative FNs suggested by our approach may not necessarily
be FNs. In this regard, scientists should use our approach to assess the general robustness
of their screen rather than use it as a substitute for experimental validation. Third, our
approach does not address the "specificity" issue with RNAi screens. A common
phenomenon observed by many screens is that genes related to basic cell metabolism
(e.g., ribosomal proteins, proteasome components, polymerases, or splicing factors) are
often reported as hits. These genes usually receive high NePhe scores since they are well
connected in the network. Although they are likely to be true hits, they may not be
relevant to the questions being asked of each screen. Some of these effects could be offset
by cross comparing multiple screening results. For example, we can remove from each
screen those genes that participate in cell growth and viability [93] and perform the
NePhe calculation afterwards (see Table B5 for the network connectivity of each RNAi
screen after the removal). However, such strategy will miss bona fide components that
may have pleiotropic effects [41, 134].
103
In summary, we present a novel network-based strategy that can potentially address the
FN and FP issue associated with RNAi screens. Follow-up experimental validations of
our results are extremely valuable for further quantifying the results of our approach.
Moreover, given the increasing popularity of RNAi techniques and rapidly accumulating
protein-protein interaction data in multiple model organisms, including human, the
applicability of our approach to other species is very promising.
3.5 Acknowledgements
This work was partly supported by National Institutes of Health (NIH)/National Science
Foundation Joint Mathematical Biology Initiative grant DMS-0241102 and NIH grant
P50 HG 002790.
104
Chapter 4 Future work
4.1. Uncertain causality between gene and phenotype
Our current implementation of the BN model to prioritize phenotype-associated
functional modules assumes the relationship between individual genes and a phenotype is
known and such relationship is modeled as a simple Boolean variable. For example, we
know if a gene is lethal or not in yeast, and represent a gene's lethality as 1 or 0. However,
for most complex traits, we have no explicit knowledge of their causal genetic factors,
and our approach will not be applicable in these cases. Fortunately, diverse genomic
studies have provided multiple evidences that can be used to estimate the gene-phenotype
association. For example, genome wide association studies can provide the association of
each SNP with a particular disease. Expression analysis can provide differentially
expressed genes between case and control samples. Multiple genomic evidences can also
be integrated to give a single score which represents the strength of the association
between a gene and a complex trait. Rather than a Boolean variable, the estimated
gene-phenotype associations by the above methods are usually represented in a
continuous variable. Thus, in the future, we could extend our BN model to accommodate
such continuous association value. By doing so, we may be able to prioritize gene
modules underlying complex traits without knowing the exact causality between gene
and phenotype.
105
4.2. The specificity issue with RNAi screens
Our approach to prioritize reliable hits in RNAi screens does not address the "specificity"
issue with RNAi screens. A common phenomenon observed by many screens is that
genes related to basic cell metabolism (e.g., ribosomal proteins, proteasome components,
polymerases, or splicing factors) are often reported as hits. These genes usually receive
high NePhe scores since they are well connected in the network. Although they are likely
to be true hits, they may not be relevant to the questions being asked of each screen.
Some of these effects could be offset by cross comparing multiple screening results. For
example, we can remove from each screen those genes that participate in cell growth and
viability [93] and perform the NePhe calculation afterwards. We can also identify and
remove hits that are repeatedly occurring in multiple screens since they are suspicious of
nonspecific effects. However, such simple filtering strategies will miss bona fide
components that may have pleiotropic effects [41, 134].
Network-based integrative analysis might help identify non-specific hits more effectively
than the above simple filtering strategies. The intuition is as follows. If the occurrence of
a hit in multiple screen results is caused by pleiotropic effects, that is, its corresponding
gene has multiple functions, we may expect that it interacts with different subsets of hits
in different screen results. But if the multiple occurrence of a hit is due to nonspecific
effect, that is, the function of its corresponding gene is universally important, we may
expect the hit interacts with the similar set of hits across multiple screenings. We may
make use of the above argument to refine our scoring system by integrating network
106
information and multiple screening results. Such scoring system may help not only filter
out noise but also nonspecific hits in RNAi screening results.
4.3. Epistatic effects
In our studies, we mainly focused on the single gene loss-of-function screening results.
However, the fact that the majority of knockout strains in yeast failed to display any
detectable phenotype indicates the limitation of such single gene screens. In fact, it is
believed that most genes affect complex traits, such as human diseases, through genetic
interactions with other genes. To address the above limitations, multiple large-scale
double gene loss-of-function screenings have been carried out in different organisms to
identify genetic interactions or epistatic effects, where double mutants cause phenotypic
effects that are not observed by single mutant. Thus, it is interesting and important to
extend our methods to accommodate such double gene screening results. Moreover, we
may want to integrate results from both single and double gene loss-of-function
screenings in the future.
4.4. Integrating modular and network representation
In our first study, the biological systems are depicted by hierarchical organized
predefined modules, such as GO annotations. Such structuralized representation of the
systems provides us with detailed information about the molecule components of each
functional module and how different functional units are related to each other. However,
the drawbacks are that the binary interaction information among components of a module
107
is unknown and that usually the predefined module only contains a subset of all
components due to incomplete knowledge. More commonly, the genomic information is
represented in forms of networks as in our second study. Such network representation is
easier to be directly derived from results of diverse genomic technologies and thus can be
very comprehensive. In addition, it can give direct pair-wise relationships between
molecules. However, in this case, the modular organization of the system is unclear.
Thus, it is very attempting to integrate both modular and network representations of
biological systems. For example, we can superimpose the modular information onto the
network. By doing so, we are able to provide module components with detailed binary
information. But more importantly, we could expand known module with putative new
members based on the network topology or even generate new hypothetical modules.
Under such mixed representation of biological systems, we may want to develop new
mathematical models and scoring systems to identify functional modules and
sub-networks underlying a particular phenotype.
108
References
118. Agaisse H, Burrack LS, Philips JA, Rubin EJ, Perrimon N, Higgins DE:
Genome-Wide RNAi Screen for Host Factors Required for Intracellular
Bacterial Infection. Science 2005, 309(5738):1248-1251.
34. Alonso JM, Ecker JR: Moving forward in reverse: genetic technologies to
enable genome-wide phenomic screens in Arabidopsis. Nat Rev Genet 2006,
7(7):524-536.
14. Andreas Bauer BK: Affinity purification-mass spectrometry. European Journal
of Biochemistry 2003, 270(4):570-578.
56. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP,
Dolinski K, Dwight SS, Eppig JT et al: Gene Ontology: tool for the unification
of biology. Nat Genet 2000, 25(1):25-29.
16. Badano JL, Katsanis N: Beyond Mendel: an evolving view of human genetic
disease transmission. Nat Rev Genet 2002, 3(10):779-789.
94. Baeg GH, Zhou R, Perrimon N: Genome-wide RNAi analysis of JAK/STAT
signaling components in Drosophila. Genes Dev 2005, 19:1861 - 1870.
124. Bai J, Binari R, Ni J-Q, Vijayakanthan M, Li H-S, Perrimon N: RNA
interference screening in Drosophila primary cells for genes involved in
muscle assembly and maintenance. Development 2008, 135(8):1439-1449.
2. Barabasi A-L, Oltvai ZN: Network biology: understanding the cell's functional
organization. Nat Rev Genet 2004, 5(2):101-113.
117. Bard F, Casano L, Mallabiabarrena A, Wallace E, Saito K, Kitayama H, Guizzunti
G, Hu Y , Wendler F, DasGupta R et al: Functional genomics reveals genes
involved in protein secretion and Golgi organization. Nature 2006,
439(7076):604-607.
4. Barker D, Pagel M: Predicting Functional Gene Links from
Phylogenetic-Statistical Analyses of Whole Genomes. PLoS Comput Biol 2005,
1(1):e3.
12. Beer MA, Tavazoie S: Predicting Gene Expression from Sequence. 2004,
117(2):185-198.
109
40. Berns K, Hijmans EM, Mullenders J, Brummelkamp TR, Velds A, Heimerikx M,
Kerkhoven RM, Madiredjo M, Nijkamp W, Weigelt B et al: A large-scale RNAi
screen in human cells identifies new components of the p53 pathway. Nature
2004, 428(6981):431-437.
45. Birmingham A, Anderson EM, Reynolds A, Ilsley-Tyree D, Leake D, Fedorov Y ,
Baskerville S, Maksimova E, Robinson K, Karpilow J et al: 3[prime] UTR seed
matches, but not overall identity, are associated with RNAi off-targets. Nat
Meth 2006, 3(3):199-204.
41. Boutros M, Ahringer J: The art and design of genetic screens: RNA
interference. Nat Rev Genet 2008, advanced online publication.
60. Boutros M, Kiger AA, Armknecht S, Kerr K, Hild M, Koch B, Haas SA,
Consortium HFA, Paro R, Perrimon N: Genome-Wide RNAi Analysis of
Growth and Viability in Drosophila Cells. Science 2004, 303(5659):832-835.
93. Boutros M, Kiger AA, Armknecht S, Kerr K, Hild M, Koch B, Haas SA,
Consortium HF, Paro R, Perrimon N: Genome-wide RNAi analysis of growth
and viability in Drosophila cells. Science 2004, 303:832 - 835.
18. Byrne A, Weirauch M, Wong V , Koeva M, Dixon S, Stuart J, Roy P: A global
analysis of genetic interactions in Caenorhabditis elegans. Journal of Biology
2007, 6(3):8.
79. Cao WM, Murao K, Imachi H, Yu X, Abe H, Yamauchi A, Niimi M, Miyauchi A,
Wong NCW, Ishida T: A Mutant High-Density Lipoprotein Receptor Inhibits
Proliferation of Human Breast Cancer Cells. Cancer Res 2004,
64(4):1515-1521.
84. Casella G, George E: Explaining the Gibbs Sampler. The American Statistician
1992, 46(3):167-174.
138. Chang HC, Solomon NM, Wassarman DA, Karim FD, Therrien M, Rubin GM,
Wolff T: phyllopod functions in the fate determination of a subset of
photoreceptors in drosophila. Cell 1995, 80(3):463-472.
139. Cho NK, Keyes L, Johnson E, Heller J, Ryner L, Karim F, Krasnow MA:
Developmental Control of Blood Cell Migration by the Drosophila VEGF
Pathway. 2002, 108(6):865-876.
111. Cherry S, Kunte A, Wang H, Coyne C, Rawson RB, Perrimon N: COPI Activity
Coupled with Fatty Acid Biosynthesis Is Required for Viral Replication. PLoS
Pathogens 2006, 2(10):e102.
110
52. Chuang H-Y , Lee E, Liu Y-T, Lee D, Ideker T: Network-based classification of
breast cancer metastasis. Mol Syst Biol 2007, 3.
44. Comprehensive genomic characterization defines human glioblastoma genes
and core pathways. Nature 2008, 455(7216):1061-1068.
3. Cordero OX, Snel B, Hogeweg P: Coevolution of gene families in prokaryotes.
Genome Research 2008, 18(3):462-468.
64. Daigle B, Altman R: M-BISON: Microarray-based integration of data sources
using networks. BMC Bioinformatics 2008, 9(1):214.
95. DasGupta R, Kaykas A, Moon RT, Perrimon N: Functional genomic analysis of
the Wnt-wingless signaling pathway. Science 2005, 308:826 - 833.
106. DasGupta R, Nybakken K, Booker M, Mathey-Prevot B, Gonsalves F,
Changkakoty B, Perrimon N: A case study of the reproducibility of
transcriptional reporter cell-based RNAi screens in Drosophila. Genome
Biology 2007, 8(9):R203.
120. Derré I, Pypaert M, Dautry-Varsat A, Agaisse H: RNAi Screen in Drosophila
Cells Reveals the Involvement of the Tom Complex in Chlamydia Infection.
PLoS Pathogens 2007, 3(10):e155.
36. Dudley AM, Janse DM, Tanay A, Shamir R, Church GM: A global view of
pleiotropy and phenotypically derived gene function in yeast. Mol Syst Biol
2005, 1.
38. Echeverri CJ, Perrimon N: High-throughput RNAi screening in cultured cells:
a user's guide. Nat Rev Genet 2006, 7(5):373-384.
47. Echeverri CJ, Perrimon N: High-throughput RNAi screening in cultured cells:
a user's guide. Nat Rev Genet 2006, 7:373 - 384.
68. Esteller M: Cancer epigenomics: DNA methylomes and histone-modification
maps. Nat Rev Genet 2007, 8(4):286-298.
66. Falcon S, Gentleman R: Using GOstats to test gene lists for GO term
association. Bioinformatics 2007, 23(2):257-258.
123. Farny NG, Hurt JA, Silver PA: Definition of global and transcript-specific
mRNA export pathways in metazoans. Genes & Development 2008,
22(1):66-78.
111
48. Feldman I, Rzhetsky A, Vitkup D: Network properties of genes harboring
inherited disease mutations. Proceedings of the National Academy of Sciences
2008, 105(11):4323-4328.
110. Flockhart I, Booker M, Kiger A, Boutros M, Armknecht S, Ramadan N,
Richardson K, Xu A, Perrimon N, Mathey-Prevot B: FlyRNAi: the Drosophila
RNAi screening center database. Nucl Acids Res 2006, 34(suppl_1):D489-494.
37. Fraser AG, Kamath RS, Zipperlen P, Martinez-Campos M, Sohrmann M,
Ahringer J: Functional genomic analysis of C. elegans chromosome I by
systematic RNA interference. Nature 2000, 408(6810):325-330.
72. Friedberg EC: How nucleotide excision repair protects against cancer. Nat Rev
Cancer 2001, 1(1):22-33.
73. Friedberg EC: Cockayne syndrome--a primary defect in DNA repair,
transcription, both or neither? Bioessays 1996(Sep;18(9)):731-738.
116. Friedman A, Perrimon N: A functional RNAi screen for regulators of receptor
tyrosine kinase and ERK signalling. Nature 2006, 444(7116):230-234.
134. Friedman A, Perrimon N: Genetic Screening for Signal Transduction in the
Era of Network Biology. Cell 2007, 128(2):225-231.
57. Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze
expression data. In: Proceedings of the fourth annual international conference on
Computational molecular biology. Tokyo, Japan: ACM; 2000.
59. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N,
Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004,
4(3):177-183.
5. Gabaldon T, Huynen MA: Lineage-specific gene loss following mitochondrial
endosymbiosis and its potential for function prediction in eukaryotes.
Bioinformatics 2005, 21(suppl_2):ii144-150.
126. Gandre-Babbe S, van der Bliek AM: The Novel Tail-anchored Membrane
Protein Mff Controls Mitochondrial and Peroxisomal Fission in Mammalian
Cells. Mol Biol Cell 2008, 19(6):2402-2412.
130. Gaurav P, Michael S, Rohit G, Tushar G, Vipin K: Association analysis-based
transformations for protein interaction networks: a function prediction case
study. In: Proceedings of the 13th ACM SIGKDD international conference on
Knowledge discovery and data mining. San Jose, California, USA: ACM; 2007.
112
42. Genome-wide association study of 14,000 cases of seven common diseases and
3,000 shared controls. Nature 2007, 447(7145):661-678.
86. Gentleman R, Carey V , Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier
L, Ge Y , Gentry J et al: Bioconductor: open software development for
computational biology and bioinformatics. Genome Biology 2004, 5(10):R80.
35. Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S,
Lucau-Danila A, Anderson K, Andre B et al: Functional profiling of the
Saccharomyces cerevisiae genome. Nature 2002, 418(6896):387-391.
33. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y , Hao YL, Ooi CE,
Godwin B, Vitols E et al: A Protein Interaction Map of Drosophila
melanogaster. Science 2003, 302(5651):1727-1736.
24. Girvan M, Newman MEJ: Community structure in social and biological
networks. Proceedings of the National Academy of Sciences of the United States
of America 2002, 99(12):7821-7826.
78. Gospodarowicz D, Lui GM, Gonzalez R: High-Density Lipoproteins and the
Proliferation of Human Tumor Cells Maintained on Extracellular
Matrix-coated Dishes and Exposed to Defined Medium. Cancer Res 1982,
42(9):3704-3713.
90. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H,
Teague J, Butler A, Stevens C et al: Patterns of somatic mutation in human
cancer genomes. Nature 2007, 446(7132):153-158.
77. Guillermet J, Saint-Laurent N, Rochaix P, Cuvillier O, Levade T, Schally A V ,
Pradayrol L, Buscail L, Susini C, Bousquet C: Somatostatin receptor subtype 2
sensitizes human pancreatic cancer cells to death ligand-induced apoptosis.
Proceedings of the National Academy of Sciences 2003, 100(1):155-160.
96. Gwack Y , Sharma S, Nardone J, Tanasa B, Iuga A, Srikanth S, Okamura H,
Bolton D, Feske S, Hogan PG: A genome-wide Drosophila RNAi screen
identifies DYRK-family kinases as regulators of NFAT. Nature 2006, 441:646
- 650.
129. Gwack Y , Srikanth S, Feske S, Cruz-Guilloty F, Oh-hora M, Neems DS, Hogan
PG, Rao A: Biochemical and Functional Characterization of Orai Proteins. J
Biol Chem 2007, 282(22):16232-16243.
113
17. Han J-DJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV , Dupuy D,
Walhout AJM, Cusick ME, Roth FP et al: Evidence for dynamically organized
modularity in the yeast protein-protein interaction network. Nature 2004,
430(6995):88-93.
53. Hart GT, Lee I, Marcotte E: A high-accuracy consensus map of yeast protein
complexes reveals modular nature of gene essentiality. BMC Bioinformatics
2007, 8(1):236.
49. Hartwell LH, Hopfield JJ, Leibler S, Murray AW: From molecular to modular
cell biology. Nature.
67. Helleday T, Petermann E, Lundin C, Hodgson B, Sharma RA: DNA repair
pathways as targets for cancer therapy. Nat Rev Cancer 2008, 8(3):193-204.
20. Hillenmeyer ME, Fung E, Wildenhain J, Pierce SE, Hoon S, Lee W, Proctor M,
St.Onge RP, Tyers M, Koller D et al: The Chemical Genomic Portrait of Yeast:
Uncovering a Phenotype for All Genes. Science 2008, 320(5874):362-365.
25. Holme P, Huss M, Jeong H: Subnetwork hierarchies of biochemical pathways.
Bioinformatics 2003, 19(4):532-538.
145. Huang X, Pan W: Linear regression and two-class classification with gene
expression data. Bioinformatics 2003, 19(16):2072-2078.
7. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y , Barkai N: Revealing
modular organization in the yeast transcriptional network. Nat Genet 2002,
31(4):370-377.
70. Isenberg JS, Klaunig JE: Role of the Mitochondrial Membrane Permeability
Transition (MPT) in Rotenone-Induced Apoptosis in Liver Cells. Toxicol Sci
2000, 53(2):340-351.
46. Jackson AL, Bartz SR, Schelter J, Kobayashi SV , Burchard J, Mao M, Li B, Cavet
G, Linsley PS: Expression profiling reveals off-target gene regulation by
RNAi. Nat Biotech 2003, 21(6):635-637.
81. Jacobsen L, Madsen P, Moestrup SK, Lund AH, Tommerup N, Nykjar A,
Sottrup-Jensen L, Gliemann J, Petersen CM: Molecular Characterization of a
Novel Human Hybrid-type Receptor That Binds the alpha 2-Macroglobulin
Receptor-associated Protein. J Biol Chem 1996, 271(49):31379-31383.
100. Jorgensen EM, Mango SE: The art and design of genetic screens:
Caenorhabditis elegans. Nat Rev Genet 2002, 3(5):356-369.
114
21. Jasnos L, Korona R: Epistatic buffering of fitness loss in yeast double deletion
strains. Nat Genet 2007, 39(4):550-554.
32. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein
networks. Nature 2001, 411(6833):41-42.
133. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T,
Kawashima S, Okuda S, Tokimatsu T et al: KEGG for linking genomes to life
and the environment. Nucl Acids Res 2008, 36(suppl_1):D480-484.
71. Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes.
Nucl Acids Res 2000, 28(1):27-30.
22. Kelley R, Ideker T: Systematic interpretation of genetic interactions using
protein networks. Nat Biotech 2005, 23(5):561-566.
88. Kerrien S, Alam-Faruque Y , Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E,
Feuermann M, Friedrichsen A, Huntley R et al: IntAct--open source resource
for molecular interaction data. Nucl Acids Res 2007, 35(suppl_1):D561-565.
92. Kiger AA, Baum B, Jones S, Jones MR, Coulson A, Echeverri C, Perrimon N: A
functional genomic analysis of cell morphology using RNA interference. J
Biol 2003, 2:27.
97. Kim JK, Gabel HW, Kamath RS, Tewari M, Pasquinelli A, Rual J-F, Kennedy S,
Dybbs M, Bertin N, Kaplan JM et al: Functional Genomic Analysis of RNA
Interference in C. elegans. Science 2005, 308(5725):1164-1167.
108. Köhler S, Bauer S, Horn D, Robinson PN: Walking the Interactome for
Prioritization of Candidate Disease Genes. The American Journal of Human
Genetics 2008, 82(4):949-958.
131. Kondor RI, and Lafferty, J.D. : Diffusion kernels on graphs and other discrete
input spaces. . Proceedings of the Nineteenth International Conference on
Machine Learning 2002:315–322.
103. Kulkarni MM, Booker M, Silver SJ, Friedman A, Hong P, Perrimon N,
Mathey-Prevot B: Evidence of off-target effects associated with long dsRNAs
in Drosophila melanogaster cell-based assays. Nat Methods 2006, 3:833 - 838.
121. Kwon M, Godinho SA, Chandhok NS, Ganem NJ, Azioune A, Thery M, Pellman
D: Mechanisms to suppress multipolar divisions in cancer cells with extra
centrosomes. Genes & Development 2008, 22(16):2189-2203.
115
55. Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby
AM, Tumer Z, Pociot F, Tommerup N et al: A human phenome-interactome
network of protein complexes implicated in genetic disorders. Nat Biotech
2007, 25(3):309-316.
1. Lim R, Aebi U, Fahrenkrog B: Towards reconciling structure and function in
the nuclear pore complex. Histochemistry and Cell Biology 2008,
129(2):105-116.
82. Lockhart AC, Tirona RG, Kim RB: Pharmacogenetics of ATP-binding Cassette
Transporters in Cancer and Chemotherapy. Mol Cancer Ther 2003,
2(7):685-698.
128. Lu J, Ruhf M-L, Perrimon N, Leder P: A genome-wide RNA interference screen
identifies putative chromatin regulators essential for E2F repression.
Proceedings of the National Academy of Sciences 2007, 104(22):9381-9386.
91. Lum L, Yao S, Mozer B, Rovescalli A, V on Kessler D, Nirenberg M, Beachy PA:
Identification of Hedgehog pathway components by RNAi in Drosophila
cultured cells. Science 2003, 299:2039 - 2045.
43. Lyne R, Smith R, Rutherford K, Wakeling M, Varley A, Guillier F, Janssens H, Ji
W, McLaren P, North P et al: FlyMine: an integrated database for Drosophila
and Anopheles genomics. Genome Biology 2007, 8(7):R129.
146. Ma X, Lee H, Wang L, Sun F: CGI: a new approach for prioritizing genes by
combining gene expression and protein-protein interaction data.
Bioinformatics 2007, 23(2):215-221.
102. Ma Y , Creanga A, Lum L, Beachy PA: Prevalence of off-target effects in
Drosophila RNA interference screens. Nature 2006, 443:359 - 363.
15. MacBeath G, Schreiber SL: Printing Proteins as Microarrays for
High-Throughput Function Determination. Science 2000,
289(5485):1760-1763.
63. Metz CE: Basic principles of ROC analysis. Semin Nucl Med 1978, 8:283-298.
75. Mizejewski GJ: Role of Integrins in Cancer: Survey of Expression Patterns.
Proc Soc Exp Biol Med 1999, 222(2):124-138.
80. Montel V , Gaultier A, Lester RD, Campana WM, Gonias SL: The Low-Density
Lipoprotein Receptor Related Protein Regulates Cancer Cell Survival and
Metastasis Development. Cancer Res 2007, 67(20):9817-9824.
116
132. Morrison J, Breitling R, Higham D, Gilbert D: GeneRank: Using search engine
technology for the analysis of microarray experiments. BMC Bioinformatics
2005, 6(1):233.
115. Nybakken K, V okes SA, Lin TY , McMahon AP, Perrimon N: A genome-wide
RNA interference screen in Drosophila melanogaster cells for new
components of the Hh signaling pathway. Nat Genet 2005, 37:1323 - 1332.
65. Qin X, Ahn S, Speed T, Rubin G: Global analyses of mRNA translational
control during early Drosophila embryogenesis. Genome Biology 2007,
8(4):R63.
50. Oti M, Brunner HG: The modular nature of genetic diseases. Clinical Genetics
2007, 71(1):1-11.
39. Paddison PJ, Silva JM, Conklin DS, Schlabach M, Li M, Aruleba S, Balija V ,
O'Shaughnessy A, Gnoj L, Scobie K et al: A resource for large-scale
RNA-interference-based screens in mammals. Nature 2004,
428(6981):427-431.
19. Pagel P, Wong P, Frishman D: A Domain Interaction Map Based on
Phylogenetic Profiling. Journal of Molecular Biology 2004, 344(5):1331-1346.
30. Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping community
structure of complex networks in nature and society. Nature 2005,
435(7043):814-818.
101. Patton EE, Zon LI: The art and design of genetic screens: zebrafish. Nat Rev
Genet 2001, 2(12):956-966.
122. Philips JA, Rubin EJ, Perrimon N: Drosophila RNAi Screen Reveals CD36
Family Member Required for Mycobacterial Infection. Science 2005,
309(5738):1251-1253.
29. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical
Organization of Modularity in Metabolic Networks. Science 2002,
297(5586):1551-1555.
85. Remm M, Storm CEV , Sonnhammer ELL: Automatic clustering of orthologs
and in-paralogs from pairwise species comparisons. Journal of Molecular
Biology 2001, 314(5):1041-1052.
26. Rives AW, Galitski T: Modular organization of cellular networks. Proceedings
of the National Academy of Sciences of the United States of America 2003,
100(3):1128-1133.
117
142. Said MR, Begley TJ, Oppenheim A V , Lauffenburger DA, Samson LD: Global
network analysis of phenotypic effects: Protein networks and toxicity
modulation in Saccharomyces cerevisiae. Proceedings of the National Academy
of Sciences of the United States of America 2004, 101(52):18006-18011.
51. Sathyanarayanan S, Zheng X, Kumar S, Chen C-H, Chen D, Hay B, Sehgal A:
Identification of novel genes involved in light-dependent CRY degradation
through a genome-wide RNAi screen. Genes & Development 2008,
22(11):1522-1533.
140. Schnepp B, Grumbling G, Donaldson T, Simcox A: Vein is a novel component in
the Drosophila epidermal growth factor receptor pathway with similarity to
the neuregulins. Genes & Development 1996, 10(18):2302-2313.
87. Scholtens D, Chiang T, Huber W, Gentleman R: Estimating node degree in
bait-prey graphs. Bioinformatics 2008, 24(2):218-224.
9. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N:
Module networks: identifying regulatory modules and their
condition-specific regulators from gene expression data. Nat Genet 2003,
34(2):166-176.
11. Segal E, Yelensky R, Koller D: Genome-wide discovery of transcriptional
modules from DNA sequence and gene expression. Bioinformatics 2003,
19(suppl_1):i273-282.
119. Sepp KJ, Hong P, Lizarraga SB, Liu JS, Mejia LA, Walsh CA, Perrimon N:
Identification of Neural Outgrowth Genes using Genome-Wide RNAi. PLoS
Genetics 2008, 4(7):e1000111.
141. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N,
Schwikowski B, Ideker T: Cytoscape: A Software Environment for Integrated
Models of Biomolecular Interaction Networks. Genome Research 2003,
13(11):2498-2504.
74. Shoeman RL, Hartig R, Hauses C, Traub P: ORGANIZATION OF FOCAL
ADHESION PLAQUES IS DISRUPTED BY ACTION OF THE HIV-1
PROTEASE. Cell Biology International 2002, 26(6):529-539.
98. Silva JM, Marran K, Parker JS, Silva J, Golding M, Schlabach MR, Elledge SJ,
Hannon GJ, Chang K: Profiling Essential Genes in Human Mammary Cells by
Multiplex RNAi Screening. Science 2008, 319(5863):617-620.
118
23. Snel B, Bork P, Huynen MA: The identification of functional modules from the
genomic association of genes. Proceedings of the National Academy of Sciences
of the United States of America 2002, 99(9):5890-5895.
27. Spirin V , Mirny LA: Protein complexes and functional modules in molecular
networks. Proceedings of the National Academy of Sciences of the United States
of America 2003, 100(21):12123-12128.
99. St Johnston D: The art and design of genetic screens: Drosophila
melanogaster. Nat Rev Genet 2002, 3(3):176-188.
62. Strawn LA, Shen T, Shulga N, Goldfarb DS, Wente SR: Minimal nuclear pore
complexes define FG repeat domains essential for transport. Nat Cell Biol
2004, 6(3):197-206.
137. Strecker TR, Yip ML, Lipshitz HD: Zygotic genes that mediate torso receptor
tyrosine kinase functions in the Drosophila melanogaster embryo.
Proceedings of the National Academy of Sciences of the United States of America
1991, 88(13):5824-5828.
8. Stuart JM, Segal E, Koller D, Kim SK: A Gene-Coexpression Network for
Global Discovery of Conserved Genetic Modules. Science 2003,
302(5643):249-255.
6. Sun N, Carroll RJ, Zhao H: Bayesian error analysis model for reconstructing
transcriptional regulatory networks. 2006, 103(21):7988-7993.
136. Tolhuis B, Muijrers I, de Wit E, Teunissen H, Talhout W, van Steensel B, van
Lohuizen M: Genome-wide profiling of PRC1 and PRC2 Polycomb chromatin
binding in Drosophila melanogaster. Nat Genet 2006, 38(6):694-699.
28. Tornow S, Mewes HW: Functional modules by relating protein interaction
networks and gene expression. Nucl Acids Res 2003, 31(21):6283-6289.
1. Tyson JJ, Chen K, Novak B: Network dynamics and cell physiology. Nat Rev
Mol Cell Biol 2001, 2(12):908-916.
69. Vieira A V , Lamaze C, Schmid SL: Control of EGF Receptor Signaling by
Clathrin-Mediated Endocytosis. Science 1996, 274(5295):2086-2089.
113. Vig M, Peinelt C, Beck A, Koomoa DL, Rabah D, Koblan-Huberson M, Kraft S,
Turner H, Fleig A, Penner R et al: CRACM1 Is a Plasma Membrane Protein
Essential for Store-Operated Ca2+ Entry. Science 2006, 312(5777):1220-1223.
119
58. V ogelstein B, Kinzler KW: Cancer genes and the pathways they control. Nat
Med 2004, 10(8):789-799.
109. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N,
Huynen MA, Bork P: STRING: known and predicted protein-protein
associations, integrated and transferred across organisms. Nucl Acids Res
2005, 33(suppl_1):D433-437.
127. Wagner EJ, Burch BD, Godfrey AC, Salzler HR, Duronio RJ, Marzluff WF: A
Genome-wide RNA Interference Screen Reveals that Variant Histones Are
Necessary for Replication-Dependent Histone Pre-mRNA Processing. 2007,
28(4):692-699.
10. Wang W, Cherry JM, Botstein D, Li H: A systematic approach to
reconstructing transcription networks in Saccharomycescerevisiae.
Proceedings of the National Academy of Sciences of the United States of America
2002, 99(26):16893-16898.
31. Watts DJ, Strogatz SH: Collective dynamics of /`small-world/' networks.
Nature 1998, 393(6684):440-442.
147. Wilson RJ, Goodman JL, Strelets VB, The FlyBase C: FlyBase: integration and
improvements to query tools. Nucl Acids Res 2008, 36(suppl_1):D588-593.
135. Wong VKW, Yam JWP, Hsiao WLW: Cloning and Characterization of the
Promoter Region of the Mouse Frizzled-Related Protein 4 Gene. Biological
Chemistry 2003, 384(8):1147-1154.
144. Wu B: Differential gene expression detection and sample classification using
penalized linear regression models. Bioinformatics 2006, 22(4):472-476.
54. Wu X, Jiang R, Zhang M, Li S: Network-based global inference of human
disease genes. Mol Syst Biol 2008, 4.
112. Xu L, Yao X, Chen X, Lu P, Zhang B, Ip YT: Msk is required for nuclear
import of TGF-{beta}/BMP-activated Smads. J Cell Biol 2007,
178(6):981-994.
125. Yi CH, Sogah DK, Boyce M, Degterev A, Christofferson DE, Yuan J: A
genome-wide RNAi screen reveals multiple regulators of caspase activation. J
Cell Biol 2007, 179(4):619-626.
120
13. Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J,
Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N et al: High-Quality Binary
Protein Interaction Map of the Yeast Interactome Network. Science 2008,
322(5898):104-110.
83. Zhang B, Groffen J, Heisterkamp N: Resistance to farnesyltransferase
inhibitors in Bcr/Abl-positive lymphoblastic leukemia by increased
expression of a novel ABC transporter homolog ATP11a. Blood 2005,
106(4):1355-1361.
114. Zhang SL, Yeromin AV , Zhang XHF, Yu Y , Safrina O, Penna A, Roos J,
Stauderman KA, Cahalan MD: Genome-wide RNAi screen of Ca2+ influx
identifies genes that regulate Ca2+ release-activated Ca2+ channel activity.
Proceedings of the National Academy of Sciences 2006, 103(24):9357-9362.
104. Zhang XD, Ferrer M, Espeseth AS, Marine SD, Stec EM, Crackower MA, Holder
DJ, Heyse JF, Strulovici B: The Use of Strictly Standardized Mean Difference
for Hit Selection in Primary RNA Interference High-Throughput Screening
Experiments. J Biomol Screen 2007, 12(4):497-509.
105. Zhang XD, Kuan PF, Ferrer M, Shu X, Liu YC, Gates AT, Kunapuli P, Stec EM,
Xu M, Marine SD et al: Hit selection with false discovery rate control in
genome-scale RNAi screens. Nucl Acids Res 2008, 36(14):4667-4679.
76. Zitzer H, Honck H-H, Bachner D, Richter D, Kreienkamp H-J: Somatostatin
Receptor Interacting Protein Defines a Novel Family of Multidomain
Proteins Present in Human and Rodent Brain. J Biol Chem 1999,
274(46):32997-33001.
121
Appendices
Appendix A: Supplementary materials for Chapter 2
In this section, details are provided on the methods and results which are omitted in
chapter II.
Involved in Lethal
Complexes
Not involved in
Lethal Complexes
Lethal
Genes
Non-
lethal
Genes
genes whose orthologs in D. melanogaster are related to cell growth and viability
genes whose orthologs in D. melanogaster are related to cell growth and viability
Figure A1: Genes in S. cerevisiae are classified into four groups according to their
lethality and the lethality of protein complexes to which they belong. Within each group,
the pie chart represents the distribution of genes with respect to the lethality of their
orthologs in D. melanogaster. The lethal protein complexes were identified as in Table
1(b).
122
Figure A2: The 27 GO CAN-processes prioritized by the Bayesian network model
(yellow) and their offspring and ancestor nodes (blue). The nodes with red circles
represent 23 out of 27 GO CAN-processes prioritized by the HG enrichment test. The
size of the nodes is proportional to the minus log p-value of the HG enrichment test for
the cancer genes. Those nodes with size zero are insignificant nodes by HG enrichment
test (P-value>0.05).
123
Complex ID Description Total Lethal
GO:0031261 DNA replication preinitiation complex 21 21
GO:0005666 DNA-directed RNA polymerase III complex 17 17
GO:0005656 pre-replicative complex 15 15
GO:0000172 ribonuclease MRP complex 10 10
GO:0005849 mRNA cleavage factor complex 9 9
GO:0000177 cytoplasmic exosome (RNase complex) 9 9
GO:0005675 holo TFIIH complex 9 9
GO:0005655 nucleolar ribonuclease P complex 9 9
MIPS-130 Chaperonine containing T-complex TRiC (TCP RING Complex) 8 8
GO:0043614 multi-eIF complex 8 8
GO:0000145 exocyst 8 8
GO:0030915 Smc5-Smc6 complex 7 7
GO:0019774 proteasome core complex, beta-subunit complex (sensu Eukaryota) 7 7
GO:0042729 DASH complex 7 7
EBI-1250459 6 6
GO:0000127 transcription factor TFIIIC complex 6 6
GO:0000506 glycosylphosphatidylinositol-N-acetylglucosaminyltransferase
(GPI-GnT) complex
5 5
GO:0000799 nuclear condensin complex 5 5
GO:0042765 GPI-anchor transamidase complex 5 5
EBI-1208823 5 5
GO:0000120 RNA polymerase I transcription factor complex 5 5
MIPS-445.10 SCF-CDC4 complex 5 5
GO:0000214 tRNA-intron endonuclease complex 4 4
GO:0000818 nuclear MIS12/MIND type complex 4 4
MIPS-290.20.10 Tim22p-complex 4 4
EBI-1209054 4 4
MIPS-310.10 NSP1 complex 4 4
EBI-852570 4 4
GO:0031518 CBF3 complex 4 4
GO:0000928 gamma-tubulin small complex, spindle pole body 3 3
GO:0005662 DNA replication factor A complex 3 3
GO:0000126 transcription factor TFIIIB complex 3 3
MIPS-180.30 Geranylgeranyltransferase II (GGTase II) 3 3
EBI-1248655 2 2
GO:0005672 transcription factor TFIIA complex 2 2
GO:0005673 transcription factor TFIIE complex 2 2
GO:0005785 signal recognition particle receptor complex 2 2
GO:0005835 fatty acid synthase complex 2 2
GO:0005953 CAAX-protein geranylgeranyltransferase complex 2 2
GO:0009328 phenylalanine-tRNA ligase complex 2 2
GO:0017059 serine C-palmitoyltransferase complex 2 2
GO:0017087 mitochondrial processing peptidase complex 2 2
GO:0018444 translation release factor complex 2 2
GO:0031501 mannosyltransferase complex 2 2
Table A1: 94 curated lethal protein complexes identified by the BN model. The complex
ID starting with “GO”, “MIPS” and “EBI” represents GO, MIPS and Intact ID,
respectively. Total # denotes the total number of genes whose lethality is known. Lethal #
denotes the number of genes that are lethal.
124
Table A1: Continued.
GO:0031510 SUMO activating enzyme complex 2 2
GO:0031515 tRNA (m1A) methyltransferase complex 2 2
GO:0042272 nuclear RNA export factor complex 2 2
GO:0043541 UDP-N-acetylglucosamine transferase complex 2 2
GO:0030684 preribosome 2 2
MIPS-260.40 NSF-SNAP complex 2 2
MIPS-440.30.10.20 Prp9p/Prp11p/Prp21p complex 2 2
MIPS-500.10.110 eIF4E/eIF4G/Pab1p complex 2 2
MIPS-510.190.190 NC2 complex 2 2
GO:0032777 Piccolo NuA4 histone acetyltransferase complex 2 2
GO:0032116 cohesin loading complex 2 2
GO:0035101 FACT complex 2 2
GO:0030690 Noc1p-Noc2p complex 2 2
GO:0030691 Noc2p-Noc3p complex 2 2
MIPS-310.20 NSP1-NUP82 complex 2 2
MIPS-290.20.20 Tim17p-complex 2 2
GO:0005669 transcription factor TFIID complex 14 13
MIPS-410.35 Replication complex 20 18
GO:0005732 small nucleolar ribonucleoprotein complex 19 17
GO:0046540 U4/U6 x U5 tri-snRNP complex 25 22
GO:0008540 proteasome regulatory particle, base subcomplex (sensu Eukaryota) 8 7
GO:0005847 mRNA cleavage and polyadenylation specificity factor complex 15 13
GO:0019773 proteasome core complex, alpha-subunit complex (sensu Eukaryota) 7 6
EBI-1250245 7 6
MIPS-270.10 Inner Kinetochor Protein Complex 7 6
GO:0005665 DNA-directed RNA polymerase II, core complex 12 10
GO:0005744 mitochondrial inner membrane presequence translocase complex 6 5
GO:0008541 proteasome regulatory particle, lid subcomplex (sensu Eukaryota) 10 8
GO:0030687 nucleolar preribosome, large subunit precursor 5 4
GO:0005851 eukaryotic translation initiation factor 2B complex 5 4
GO:0032040 small subunit processome 37 29
GO:0030127 COPII vesicle coat 8 6
GO:0031105 septin complex 4 3
GO:0030869 RENT complex 4 3
GO:0005681 spliceosome 29 21
GO:0005736 DNA-directed RNA polymerase I complex 14 10
GO:0001405 presequence translocase-associated import motor 7 5
GO:0016592 Srb-mediator complex 7 5
GO:0005852 eukaryotic translation initiation factor 3 complex 7 5
GO:0030008 TRAPP complex 10 7
MIPS-60 Anaphase promoting complex (APC) 10 7
GO:0000243 commitment complex 11 7
GO:0016586 RSC complex 16 10
GO:0031932 TORC 2 complex 7 4
GO:0005742 mitochondrial outer membrane translocase complex 7 4
GO:0008250 oligosaccharyl transferase complex 9 5
MIPS-520.10 ER protein-translocation complex (Sec complex) 9 5
GO:0031011 INO80 complex 9 5
MIPS-310 Nuclear pore complex (NPC) 22 12
125
Appendix B: Supplementary materials for Chapter 3
In this section, details are provided on the methods and results which are omitted in
chapter III.
Network connectivity of RNAi hits
Table B1 shows the size of the largest component, the number of isolated nodes and their
corresponding P-values for each sub-network derived from each of the 24 RNAi screens.
It can be seen that the majority of the 24 sub-networks are characterized by significantly
larger connected component and fewer isolated nodes, which indicate higher network
connectivity, as compared to random cases (P-values < 0.05). But these two network
attributes may not be as powerful as the number of edges in revealing the network
connectivity. Since when the number of hits is large enough, e.g., > 300, most hits can be
connected in a large connected component even within randomized network, the P-values
are hence less significant or not significant for these sub-networks.
126
RNAi screen #hits SLC
†
P-value1 P-value2 #isolate
d nodes
P-value1 P-value2
Store-operated Ca2+
entry
1,122 773 6e-05 0.6 326 5e-04 0.1
ERK signaling 982 763 1e-16 1 207 1e-17 1
Nuclear import of Smads 683 377 0.02 1 254 0.004 0.01
Protein secretion and
Golgi organization
645 513 3e-16 1 114 8e-26 0.8
Hh signaling pathway 306 230 4e-14 0.8 72 8e-21 0.5
Bacterial infection 286 251 2e-22 0.002 33 2e-35 5e-04
Growth and viability 281 204 8e-14 0.9 72 5e-18 0.6
Wnt signaling pathway 167 101 5e-17 5e-04 61 6e-11 8e-05
Light-dependent CRY
degradation
131 39 3e-05 0.4 82 0.03 0.2
Neural outgrowth genes 128 105 2e-54 9e-06 23 5e-23 5e-05
Chlamydia infection 126 55 7e-13 0.004 65 7e-05 4e-04
Regulators of NFAT 121 9 0.6 0.5 83 0.2 2e-04
Multipolar divisions 115 31 3e-05 0.02 63 3e-04 4e-06
Viral replication 104 92 6e-83 0.1 9 5e-31 0.008
JAK/STAT signaling 104 51 6e-23 4e-05 53 1e-05 1e-04
Mycobacterial infection 76 61 2e-99 5e-07 13 7e-22 6e-11
Transcript-specific
mRNA export
65 48 2e-86 6e-05 17 3e-17 2e-06
Ca(2+) influx 65 50 2e-94 3e-04 13 8e-21 1e-05
Muscle assembly and
maintenance
39 16 6e-28 1e-08 19 4e-07 8e-04
Caspase activation 37 5 0.02 0.01 32 0.4 0.4
Mitochondrial and
Peroxisomal Fission
22 3 0.08 0.03 19 0.3 0.2
Histone pre-mRNA
processing
17 3 0.02 0.009 12 0.005 0.001
E2F repression 15 7 1e-17 1e-43 6 2e-10 2e-20
Orai proteins 15 4 4e-05 7e-06 4 4e-15 1e-15
† Size of the largest component.
Table B1: The network attributes and corresponding P-values for the 24 sub-networks
constructed from RNAi hits. Results are presented in the same order as in Table 1. The
P-value1 and P-value2 are calculated in a similar way to that in Table 1.
Screen-specific Performance of different NePhe scoring functions
We compared the performance of different NePhe scoring functions in each of the 24
RNAi screens. Here we particularly focused on their performance in identifying FPs
127
rather than FNs, since the estimation of the former is far more reliable (with smaller error
bars) than the latter in the rank-based tests. Figure B1 shows the screen-specific
performance of different methods. It has to be reminded that in the rank-based test, the
lower the simulated FPs are ranked among hits, the better the scoring function performs.
It can be seen in Figure B1 that the diffusion kernel (DK) based methods (red) generally
outperform others. In addition, DK appears to show superiority especially for those
screens whose hit sets display certain characteristics. For example, DK shows superiority
particularly for hit set of smaller size. DK is the best performed scoring function for 8 out
of the 9 screens with hit size <100. This observation is consistent with previous findings
that DK significantly outperforms other similarity measurements in prioritizing human
disease genes [107, 108], since the number of known causal genes for a disease is usually
within this range. Among those hit sets of larger size, DK shows obvious advantages for
relatively “difficult” cases. For instance, DK is the best performed method for “nuclear
import of Smads” and “regulators of NFAT”. Based on the results of Table 1, the network
connectivity of hits for both screens is among the lowest. In contrast, for those
well-connected hit sets, such as “viral replication”, simple measurements like direct
neighbor can achieve the best performance.
128
Figure B1: Screen-specific performance of different methods in identifying FPs in
rank-based test. The notations are the same as in Figure 1. The 24 screens are placed in
the same order as in Table 1 (by row), i.e., decreasing order based on the hit size.
129
Robustness of NePhe scores to random noise in the hit set
We calculated NePhe scores for each of the 24 screens using the best method for that
screen according to the rank-based test. We want to know how robust the calculated
NePhe scores are to the random noise in the initial hit set. In order to achieve this, we
simulated noise in the hit set and then compared the NePhe scores before and after the
simulation. More specifically, for each screen, we randomly added a number of nonhits
(e.g., 10% of the original hit number) into the original hit set and recalculated the NePhe
scores. We then computed Spearman’s correlation coefficients for NePhe scores of
original hits before and after adding the noise. We repeated the above procedure 20 times
and obtained the mean and standard deviation (Table B2). Similarly, we randomly
removed a number of hits from the initial hit set and then compared the NePhe scores for
the remaining hits before and after the removal (Table B3). As shown in both Table B2
and Table B3, the NePhe scores of the hits appear to be reasonably robust to noise in the
initial hit set. Most of the correlation coefficients remain above 0.95, even after adding 50%
noise or removing 50% of the original hits. The NePhe scores tend to be less robust for
those screens whose sub-networks show lower connectivity (Table 1). Comparing
results in Table B2 and S5, NePhe scores seem to be more robust to the random noise
added to the hit set than to the random deletion of original hits. Similar results can be
observed for NePhe scores of the nonhits (data not shown).
130
RNAi screen Add 10% Add 20% Add 20% Add 40% Add 50%
Store-operated Ca2+ entry 0.992(0.002) 0.985(0.003) 0.98(0.004) 0.975(0.004) 0.966(0.009)
ERK signalling 0.999(3e-04) 0.998(4e-04) 0.997(5e-04) 0.996(8e-04) 0.995(0.001)
Nuclear import of Smads 0.97(0.01) 0.943(0.01) 0.918(0.02) 0.89(0.02) 0.867(0.02)
Protein secretion and Golgi
organization 0.994(0.003) 0.989(0.002) 0.981(0.004) 0.977(0.004) 0.973(0.004)
Hh signaling pathway 0.999(5e-04) 0.997(0.001) 0.996(0.001) 0.995(0.002) 0.994(0.001)
Bacterial infection 1(2e-04) 0.998(3e-04) 0.998(4e-04) 0.997(6e-04) 0.996(8e-04)
Growth and viability 0.998(0.001) 0.997(0.002) 0.995(0.005) 0.993(0.004) 0.987(0.01)
Wnt signaling pathway 0.994(0.004) 0.989(0.006) 0.983(0.008) 0.977(0.007) 0.97(0.01)
Light-dependent CRY
degradation 0.978(0.02) 0.95(0.03) 0.937(0.03) 0.92(0.03) 0.913(0.03)
Neural outgrowth genes 0.997(0.001) 0.993(0.002) 0.99(0.003) 0.988(0.004) 0.985(0.004)
Chlamydia infection 0.992(0.005) 0.983(0.007) 0.975(0.01) 0.973(0.01) 0.957(0.02)
Regulators of NFAT 0.98(0.01) 0.951(0.02) 0.931(0.03) 0.91(0.03) 0.894(0.05)
Multipolar divisions 0.987(0.006) 0.978(0.009) 0.966(0.01) 0.962(0.01) 0.943(0.02)
Viral replication 1(6e-04) 1(3e-04) 1(6e-04) 0.999(7e-04) 0.999(7e-04)
JAK/STAT signaling 0.993(0.002) 0.987(0.005) 0.985(0.003) 0.974(0.02) 0.976(0.007)
Mycobacterial infection 0.996(0.006) 0.991(0.007) 0.982(0.01) 0.972(0.03) 0.976(0.02)
Transcript-specific mRNA
export 0.988(0.01) 0.978(0.02) 0.97(0.03) 0.966(0.03) 0.956(0.04)
Ca(2+) influx 0.998(0.002) 0.995(0.004) 0.991(0.005) 0.99(0.004) 0.987(0.006)
Muscle assembly and
maintenance 0.993(0.01) 0.975(0.04) 0.95(0.07) 0.955(0.05) 0.94(0.05)
Caspase activation 0.959(0.05) 0.93(0.07) 0.926(0.06) 0.892(0.07) 0.854(0.1)
Mitochondrial and Peroxisomal
Fission 0.997(0.01) 0.991(0.01) 0.978(0.05) 0.981(0.02) 0.963(0.03)
Histone pre-mRNA processing 0.966(0.04) 0.925(0.07) 0.897(0.1) 0.88(0.1) 0.91(0.07)
E2F repression 0.998(0.003) 0.992(0.02) 0.991(0.02) 0.998(0.003) 0.992(0.008)
Orai proteins 1(0.001) 0.999(0.002) 0.994(0.01) 0.982(0.04) 0.984(0.04)
Table B2: The Spearman’s rank correlation coefficients of the NePhe score before and
after adding different levels of noise into the hit set. The figure in each cell represents the
mean value of the calculated correlation coefficients in 20 simulations. The figure in the
parenthesis represents the standard deviation.
131
RNAi screen Remove 10% Remove
20%
Remove 20% Remove
40%
Remove
50%
Store-operated Ca2+ entry 0.991(0.002) 0.983(0.003) 0.972(0.006) 0.96(0.007) 0.951(0.01)
ERK signalling 0.999(2e-04) 0.997(0.001) 0.994(8e-04) 0.99(0.004) 0.986(0.005)
Nuclear import of Smads 0.958(0.01) 0.926(0.02) 0.856(0.09) 0.835(0.03) 0.749(0.09)
Protein secretion and Golgi
organization 0.991(0.003) 0.98(0.003) 0.964(0.007) 0.948(0.01) 0.926(0.02)
Hh signaling pathway 0.998(6e-04) 0.996(0.001) 0.994(0.001) 0.99(0.003) 0.988(0.004)
Bacterial infection 0.998(4e-04) 0.995(9e-04) 0.991(0.002) 0.987(0.003) 0.983(0.005)
Growth and viability 0.997(8e-04) 0.996(0.002) 0.99(0.006) 0.988(0.004) 0.984(0.01)
Wnt signaling pathway 0.986(0.005) 0.979(0.01) 0.965(0.01) 0.947(0.02) 0.927(0.02)
Light-dependent CRY
degradation 0.984(0.01) 0.953(0.03) 0.942(0.03) 0.921(0.05) 0.882(0.04)
Neural outgrowth genes 0.99(0.004) 0.977(0.006) 0.963(0.007) 0.945(0.01) 0.925(0.02)
Chlamydia infection 0.985(0.01) 0.975(0.01) 0.958(0.02) 0.944(0.03) 0.873(0.2)
Regulators of NFAT 0.924(0.03) 0.838(0.04) 0.782(0.08) 0.706(0.08) 0.626(0.08)
Multipolar divisions 0.983(0.007) 0.962(0.01) 0.938(0.02) 0.89(0.05) 0.824(0.2)
Viral replication 0.998(7e-04) 0.996(0.001) 0.994(0.002) 0.99(0.003) 0.987(0.005)
JAK/STAT signaling 0.989(0.004) 0.978(0.007) 0.968(0.008) 0.933(0.03) 0.913(0.05)
Mycobacterial infection 0.943(0.04) 0.93(0.04) 0.872(0.07) 0.825(0.08) 0.735(0.1)
Transcript-specific mRNA
export 0.966(0.03) 0.95(0.02) 0.891(0.06) 0.832(0.07) 0.83(0.1)
Ca(2+) influx 0.981(0.009) 0.958(0.01) 0.943(0.02) 0.914(0.02) 0.863(0.06)
Muscle assembly and
maintenance 0.974(0.05) 0.938(0.06) 0.9(0.1) 0.86(0.1) 0.81(0.1)
Caspase activation 0.954(0.03) 0.897(0.06) 0.862(0.08) 0.779(0.1) 0.796(0.1)
Mitochondrial and Peroxisomal
Fission 0.994(0.02) 0.982(0.04) 0.954(0.06) 0.943(0.06) 0.91(0.1)
Histone pre-mRNA processing 0.862(0.3) 0.81(0.4) 0.583(0.5) 0.6(0.5) 0.413(0.5)
E2F repression 0.947(0.07) 0.95(0.06) 0.898(0.1) 0.832(0.1) 0.65(0.3)
Orai proteins 0.841(0.2) 0.776(0.2) 0.543(0.3) 0.547(0.3) 0.255(0.4)
Table B3: The Spearman’s rank correlation coefficients of the NePhe score before and
after removing different fractions of hits from the hit set. The figure in each cell
represents the mean value of the calculated correlation coefficients in 20 simulations. The
figure in the parenthesis represents the standard deviation.
Sequence-based OTE prediction of FPs
For each dsRNA used to knockdown hits in the original screen, we retrieved the number
of 19 nucleotide exact overlaps it has with other genes from flyRNAi database [147].
From the same place, we also obtained the information about whether there are CAN
repeats in the dsRNA. Using similar criteria as in the study of DasGupta et al [106], we
132
labeled a hit as off target (OT)-related if its corresponding dsRNA shares greater than 5
possible 19 nt exact overlaps with other genes, or there are CAN repeats in the dsRNA,
and labeled it as OT-unrelated if neither the two criteria are satisfied. Those OT-related
hits were predicted to be enriched with FPs [103, 148].
Figure B2: The reproducibility rate for OT-related and OT-unrelated hits (a), for
low-ranked and high-ranked OT-related hits by NePhe score (b) and for low-ranked and
high-ranked OT-unrelated hits by NePhe scores (c) for Wnt signaling pathway.
133
Method Network-based similarity Scoring formula
DN1 Direct neighbor (Formula 3.1.1) Formula 3.2.1
DN2 Direct neighbor (Formula 3.1.1) Formula 3.2.2
DN3 Direct neighbor (Formula 3.1.1) Formula 3.2.3
SP1 Shortest path (Formula 3.1.2) Formula 3.2.1
SP2 Shortest path (Formula 3.1.2) Formula 3.2.2
SP3 Shortest path (Formula 3.1.2) Formula 3.2.3
AT1 Association analysis-based transformation (Formula 3.1.3) Formula 3.2.1
AT2 Association analysis-based transformation (Formula 3.1.3) Formula 3.2.2
AT3 Association analysis-based transformation (Formula 3.1.3) Formula 3.2.3
DK1 Diffusion kernel (Formula 3.1.4) Formula 3.2.1
DK2 Diffusion kernel (Formula 3.1.4) Formula 3.2.2
DK3 Diffusion kernel (Formula 3.1.4) Formula 3.2.3
Table B4: The 12 methods for computing NePhe score.
134
RNAi screen #hits #edges P-value1 P-value2
Store-operated Ca2+ entry 997 2248 0.6 7.00E-14
ERK signaling 862 4151 3.00E-29 3.00E-154
Nuclear import of Smads 662 1163 0.2 2.00E-05
Protein secretion and Golgi
organization 421 2212 3.00E-126 <1e-229
Hh signaling pathway 188 672 3.00E-133 4.00E-229
Bacterial infection 190 727 3.00E-158 1.00E-195
Growth and viability 0 NA NA NA
Wnt signaling pathway 127 254 1.00E-61 6.00E-80
Light-dependent CRY degradation 118 65 0.004 1.00E-14
Neural outgrowth genes 113 310 6.00E-131 7.00E-117
Chlamydia infection 118 90 1.00E-06 6.00E-14
Regulators of NFAT 119 29 0.6 0.002
Multipolar divisions 109 57 0.004 1.00E-14
Viral replication 55 300 <1e-229 <1e-229
JAK/STAT signaling 86 69 1.00E-10 4.00E-18
Mycobacterial infection 66 165 4.00E-185 5.00E-145
Transcript-specific mRNA export 44 43 2.00E-35 2.00E-31
Ca(2+) influx 51 87 1.00E-100 2.00E-38
Muscle assembly and maintenance 33 7 0.02 0.006
Caspase activation 37 4 0.4 0.2
Mitochondrial and Peroxisomal Fission 22 2 0.3 0.1
Histone pre-mRNA processing 16 4 8.00E-05 2.00E-07
E2F repression 13 6 4.00E-14 3.00E-38
Orai proteins 12 6 2.00E-17 3.00E-23
Table B5: The network attributes and corresponding P-values for the 24 sub-networks
constructed from RNAi hits after removing those that affect cell growth and viability.
Results are presented in the same order as in Table 1. The P-value1 and P-value2 are
calculated in a similar way to that in Table 1.
Abstract (if available)
Abstract
How to interpret the nature of biological processes, which when perturbed cause certain phenotype changes, such as human disease, is a major challenge. The completion of sequencing the genomics of many model organisms has made “reverse genetic approaches” efficient and comprehensive ways to identify the causal genes for a given phenotype under investigation. For instance, genome-wide knockout strains are now available for S. cerevisiae, and diverse high throughput RNAi knockdown experiments have been performed for multiple higher organisms. Although very useful, these high throughput screening approaches are associated with two main problems: 1) the underlying biology, i.e., how genetic perturbation leads to the change of phenotypes in the complex of biological systems is unclear
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Integrative approach of biological systems analysis on regulatory pathways, modules, protein essentialities, and disease gene classification
PDF
Computational and experimental approaches for the identification of genes and gene networks in the Drosophila sex-determination hierarchy
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Predicting functional consequences of SNPs: insights from translation elongation, molecular phenotypes, and pathways
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
Asset Metadata
Creator
Wang, Li
(author)
Core Title
Prioritizing phenotype-associated functional modules and sub-networks from high throughout screening results
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology
Publication Date
08/03/2009
Defense Date
03/13/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
high throughput screenings,modules,network,OAI-PMH Harvest,phenotype
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Arbeitman, Michelle N. (
committee member
), Chen, Liang (
committee member
), Kuo, C.-C. Jay (
committee member
)
Creator Email
tujiaojiao1981@gmail.com,wang7@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2450
Unique identifier
UC162476
Identifier
etd-Wang-2952 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-179319 (legacy record id),usctheses-m2450 (legacy record id)
Legacy Identifier
etd-Wang-2952.pdf
Dmrecord
179319
Document Type
Dissertation
Rights
Wang, Li
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
high throughput screenings
modules
phenotype