Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Integrating high-throughput sequencing data to study gene regulation
(USC Thesis Other)
Integrating high-throughput sequencing data to study gene regulation
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content
INTEGRATING HIGH-THROUGHPUT SEQUENCING DATA TO
STUDY GENE REGULATION
by
Chao Dai
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
December 2013
Copyright 2013 Chao Dai
Dedication
To my parents Shuming and Chunyi
ii
Acknowledgments
First and foremost, I want to express my deepest appreciation to my advisor Professor
Xianghong Jasmine Zhou, who gave me tremendous support and encouragement during
my PhD study. She leads me to the fascinating field of gene regulation, and broads my
vision of computation biology and many other perspectives. I would also like to thank my
dissertation committee members, Professor Frank Alber, for his insightful comments for
spatial genome organization project; Professor Michael Waterman for his detailed disser-
tation editing; and Professor Yan Liu, for her comments for Tensor framework.
Special thanks go to my close collaborators, Dr. Wenyuan Li and Dr. Harianto Tjong,
who shared their research expertise with me and significantly enriched my knowledge of
network-based optimization techniques and spatial genome organization modelling.
I would also like to thank my friends, they are Dr. Stanley Shi, Dr. Shihua Zhang, Dr.
Fang Fang, Dr. Li Wang, Xiao Lei, Dr. Qiang Song, Dr. Yi Shi, Dr. Min Xu, Qingjiao Li,
Jianghan Qu, Quan Chen, Jing Zhang, and Yang Fu. I learn a lot from them, and memories
of our friendship will be always in my heart.
iii
Contents
Dedication ii
Acknowledgments iii
Abstract xiii
Chapter 1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of our work . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Integrating Many Co-splicing Networks to Reconstruct Splicing
Regulatory Modules . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Identify Frequent Coupled Modules to Study Transcription and
Splicing Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Functional Implications of Spatial Genome Organization . . . . . 5
Chapter 2 Integrating Many Co-splicing Networks to Reconstruct Splicing
Regulatory Modules 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Statement of the problem . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iv
2.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Vector norm selection . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Non-convex optimization . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3.1 Concave Duality . . . . . . . . . . . . . . . . . . . . . 16
2.3.3.2 Optimization scheme . . . . . . . . . . . . . . . . . . . 17
2.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Cassette exons identification . . . . . . . . . . . . . . . . . . . . 18
2.4.2 RNA-Seq datasets selection and processing . . . . . . . . . . . . 18
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Frequent co-splicing clusters are likely to represent functional mod-
ules, splicing modules, transcriptional modules, and protein com-
plexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1.1 Functional analysis . . . . . . . . . . . . . . . . . . . . 20
2.5.1.2 Splicing regulatory analysis . . . . . . . . . . . . . . . 21
2.5.1.3 Transcriptional and epigenomic analysis . . . . . . . . 23
2.5.1.4 Protein complex analysis . . . . . . . . . . . . . . . . . 24
2.5.2 Co-splicing clusters reveal novel functions that are not identified
by co-expression clusters . . . . . . . . . . . . . . . . . . . . . . 25
2.5.3 Exons can dynamically participate in different pathways upon dif-
ferent co-splicing mechanisms . . . . . . . . . . . . . . . . . . . 27
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 3 Identify Frequent Coupled Modules to Study Transcription and
Splicing Coupling 30
v
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Statement of the problem . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Continuous optimization . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Optimization algorithm and pattern extraction . . . . . . . . . . . 40
3.4 RNA-seq data preprocessing . . . . . . . . . . . . . . . . . . . . . 43
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.1 Frequent coupled clusters are likely to represent functional mod-
ules and protein complexes . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Frequent coupled clusters are likely to represent transcription and
splicing modules . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.3 Exploring the mechanisms of the transcription and splicing coupling 50
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Chapter 4 Functional Implications of Spatial Genome Organization 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1 Topological domains are functionally coupled with epigenetic and
transcriptional regulation . . . . . . . . . . . . . . . . . . . . . . 54
4.1.2 Association of promoters-enhancers interactions and chromosome
organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.3 Association of transcription factories and chromosome organization 58
4.1.4 Association of cell type specific gene expression and cell type spe-
cific chromatin interactions . . . . . . . . . . . . . . . . . . . . . 60
vi
4.1.5 Our contributions to interpret the coupling between spatial genome
organization and genome function . . . . . . . . . . . . . . . . . 61
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 Population-based modeling of chromosome conformation capture
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.2 Frequent network pattern mining on a population of spatial genome
structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Functional significance of spatial genome organization . . . . . . . 65
4.3.1 Classify chromosomal domains based on epigenetic marks . . . . 65
4.3.2 Active chromosomal regions frequently interact with each other . 71
4.3.3 Module frequency and module activity are positively correlated . . 73
4.4 Transcription factories are coupled with spatial genome organization 75
4.4.1 RNAPII signal and module frequency are positively correlated in
transcription factories . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.2 Gene pairs in transcription factories have shorter 3D distance . . . 78
4.4.3 Inter-chromosomal transcription factories stability and gene expres-
sion are positively correlated . . . . . . . . . . . . . . . . . . . . 81
4.4.4 Inter-chromosomal transcription factories stability and transcrip-
tion factor signal are correlated . . . . . . . . . . . . . . . . . . . 83
4.4.5 Transcription factors cooperate in transcription factories . . . . . 86
4.4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.5 Transcriptional regulation and epigenetic regulation are closely asso-
ciated in transcription factories . . . . . . . . . . . . . . . . . . . . 92
vii
4.5.1 Module frequency and module activity are positively correlated in
inter-chromosomal transcription factories . . . . . . . . . . . . . 93
4.5.2 Inter-chromosomal transcription factories stability and epigenetic
mark signal are correlated . . . . . . . . . . . . . . . . . . . . . 94
4.5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.6 Centromeric influence plays an important role to shape spatial genome
organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.6.1 Hub domains are positioned closer to centromeres, lie closer to
nuclear interior and have higher gene density . . . . . . . . . . . 99
4.6.2 Centromeric influence and the coupling of centromeric influence
and transcription factories . . . . . . . . . . . . . . . . . . . . . 101
4.6.3 Inter-chromosomal interactions demarcate chromosomes into nuclear
interior and nuclear periphery . . . . . . . . . . . . . . . . . . . 105
4.6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Reference List 114
List of Tables
4.1 Overlapping percentages of 46 transcription factors and RNAPII . . . . . 84
viii
List of Figures
2.1 Illustration of the 3
rd
-order tensor representation . . . . . . . . . . . . . . 11
2.2 Frequent co-splicing cluster enrichment fold change increases with its recur-
rence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Illustration of frequent coupled cluster . . . . . . . . . . . . . . . . . . . 34
3.2 Illustration of the coupling degree between a gene set and an exon set . . 42
3.3 Frequent coupled clusters are likely to represent functional modules . . . 45
3.4 Frequent coupled clusters are likely to represent protein complexes . . . . 47
3.5 Frequent coupled clusters are likely to represent transcription modules . . 49
3.6 Frequent coupled clusters are likely to represent splicing modules . . . . . 49
4.1 Scheme to identify frequent dense cluster . . . . . . . . . . . . . . . . . 66
4.2 The number of TCC reads in centromeric regions is significantly small . . 67
4.3 Hierarchical clustering of non-centromeric domains . . . . . . . . . . . . 68
4.4 Gene density and centromeric distance comparisons among different domain
categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Module size distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.6 Module activity correlates with chromosome number . . . . . . . . . . . 70
4.7 Distribution of adjacent domain distance in intra-chromosomal modules . 70
ix
4.8 Active chromosomal regions frequently interact with each other . . . . . . 72
4.9 Module frequency and module activity are positively correlated . . . . . . 74
4.10 RNAPII enriched modules have higher gene expression . . . . . . . . . . 77
4.11 Correlation between RNAPII and gene expression . . . . . . . . . . . . . 77
4.12 Correlation between RNAPII signal and module frequency . . . . . . . . 79
4.13 Boxplot of module frequency between RNAPII enriched and not RNAPII
enriched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.14 Gene pairs in transcription factories have shorter 3D distance . . . . . . . 80
4.15 Gene expression and module frequency are only positively correlated in
inter-chromosomal transcription factories . . . . . . . . . . . . . . . . . 82
4.16 3D distance between domain pair and gene expression is negatively corre-
lated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.17 Homologous chromosome regions have shorter 3D distances in transcrip-
tion factories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.18 Inter-chromosomal transcription factories stability and activator signal are
positively correlated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.19 Inter-chromosomal transcription factories stability and repressor signal are
negatively correlated . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.20 Heat map of 46 transcription factors in transcription factories . . . . . . . 89
4.21 The number of recruited activators and RNAPII signal are positively cor-
related in inter-chromosomal transcription factories . . . . . . . . . . . . 90
4.22 The number of recruited activators and inter-chromosomal transcription
factory stability are positively correlated . . . . . . . . . . . . . . . . . . 90
x
4.23 Boxplot of gene expression in transcription factories and non transcription
factories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.24 Transcription factories have higher activity and gene density . . . . . . . 95
4.25 Module frequency and module activity are positively correlated in inter-
chromosomal transcription factories . . . . . . . . . . . . . . . . . . . . 95
4.26 Correlation between module activity and intra-chromosomal module fre-
quency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.27 Active mark and inter-chromosomal transcription factory stability are pos-
itively correlated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.28 Repressive mark and inter-chromosomal transcription factory stability are
negatively correlated . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.29 Hub domains are positioned closer to centromeres and have higher gene
density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.30 Hub domains are positioned closer to nuclear interior . . . . . . . . . . . 100
4.31 Modules containing hub domains have higher gene density and activity . . 101
4.32 Modules containing hub domains have higher RNAPII and gene expression 102
4.33 Modules containing hub domains are positioned closer to centromeres and
nuclear interior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.34 Correlation between module centromeric distance and module frequency . 103
4.35 Modules closer to centromeres have higher RNAPII . . . . . . . . . . . . 105
4.36 Inter-chromosomal module centromeric distance are significantly smaller
than random chance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.37 Chromatin contact heat map among inter-chromosomal modules . . . . . 107
4.38 Chromosomes positioned in nuclear interior have higher gene density . . 108
xi
4.39 Module centromeric distance and radial position are positively correlated . 109
4.40 Negative correlations between module interior distance and module gene
density/activity/RNAPII/gene expression . . . . . . . . . . . . . . . . . . 110
xii
Abstract
High-throughput sequencing is a powerful technique for gene regulation study, which
can provide information about isoform expression as well as transcription factors / epige-
netic factors binding signal measurements. Chromosome conformation capture followed
by high-throughput sequencing were recently applied to study spatial genome organiza-
tion and shed light on functional associations between genome organization and gene reg-
ulation. By integrating genome wide high-throughput sequencing data, this dissertation
investigates gene regulation in three different layers: co-splicing; coupling between tran-
scription and splicing; and functional implications of spatial genome organization.
In the exon co-splicing project, we designed a tensor-based approach to identify co-
spliced exon clusters that frequently appear in multiple RNA-seq datasets. We found that
co-splicing clusters can reveal novel functional groups which cannot be identified by co-
expression clusters, and they can grant new insights of functions associated with post-
transcriptional regulation. Our results also demonstrated that the same exons can dynam-
ically participate in different pathways depending on different conditions and different
other exons that are co-spliced.
In the transcription and splicing coupling project, by using tensor-based approach to
identify frequent coupled clusters on two layered co-expression networks and co-splicing
xiii
networks, we demonstrated that protein-protein interactions between transcription fac-
tors and splicing factors could potentially serve as important mediators for transcription-
splicing coupling.
In spatial genome organization and genome function association project, we applied
our tensor-based integrative network pattern mining method on a population of genome
structures modeled from chromosome conformation capture data, and integrated public
epigenetic marks and transcription factors ChIP-seq and RNA-seq data, to systematically
investigate functional implications of spatial genome organization. Our results strongly
demonstrated that transcriptional regulation and epigenetic regulation are functionally
coupled with spatial genome organization, and centromeric influence plays an important
role to shape spatial genome organization.
xiv
Chapter 1
Introduction
1.1 Background
Gene regulation study is a fundamental and fascinating research field for both molec-
ular and computational biologists. In 1995, Microarray technology was firstly applied to
quantify gene expression [1], and ChIP-chip technology was developed in 2000 to inves-
tigate transcription factor binding sites [2]. Both technologies significantly expanded our
knowledge about gene regulation: we learned that co-expressed genes tend to be func-
tionally associated [3]; in general, co-expressed genes have common binding motifs in
upstream promoter regions, potentially bound by the same transcription factor(s) for gene
co-regulation [4]; alternative splicing contributes to cell type specific gene expression [5];
differentially expressed genes between normal samples and disease samples can be used to
identify drug targets [6], and many more [7–9]. In 2002, Chromosome Conformation Cap-
ture (3C) technology was developed by Job Dekker to study interaction between any two
genomic loci [10], which built a functional link between spatial genome organization and
gene regulation. Since then, gene regulation study opened a new chapter: enhancers can be
brought into proximity with target genes over long linear distance by spatial organization
of chromosomes [11–13] to activate gene expression; cell type specific gene expression
can be explained by cell type specific chromatin interactions [14], and many more discov-
eries [15, 16]. Although much knowledge was discovered, gene regulation studies were
1
only focused on certain genomic loci or a few genes, because of limited and low resolu-
tion of 3C data.
In 2007, high-throughput DNA sequencing technology became a game changer for
gene regulation studies [17, 18]. Gene probes were not needed any more, novel genes and
transcripts were discovered [19]; gene expression and transcription factors binding sig-
nal can be measured with higher accuracy by RNA-seq [20] and ChIP-seq [21]; genome
wide chromatin interactions were constructed with much higher resolution by paired end
reads [22], and many more [23–26]. Transcription, splicing, and spatial genome organi-
zation studies coupled with next generation sequencing extensively expanded our under-
standing about gene regulation.
Being able to access to public biological data is always the key issue for computational
biologists. NCBI GEO (Gene Expression Omnibus) [27], NCBI SRA (Sequence Read
Archive) [28] and Encode (Encyclopedia of DNA Elements) [29] are three big central
consortia to share gene regulation experimental data. NCBI GEO is a data source of
microarray gene expression data; NCBI SRA mainly collects RNA-seq data generated
from high-throughput sequencing; and Encode provides a database for ChIP-seq data,
which mainly aims to identify genome wide transcription factors / histone marks binding
signal.
In this dissertation, we integrated many human RNA-seq and ChIP-seq data to inves-
tigate gene regulation in three different layers: exons co-splicing; coupling between tran-
scription and splicing, and association between spatial genome organization and genome
function.
2
1.2 Summary of our work
In Chapter 2, we integrated 38 human exon co-splicing networks to reconstruct splic-
ing regulatory modules. In Chapter 3, we integrated 38 human gene co-expression net-
works and exon co-splicing networks together, to investigate functional couplings between
transcriptional machinery and splicing machinery, and explore biological mechanisms for
transcription and splicing coupling. In Chapter 4, we systematically examined functional
implications of spatial genome organization, and investigated the driving force for spatial
genome organization.
1.2.1 Integrating Many Co-splicing Networks to Reconstruct Splicing
Regulatory Modules
Alternative splicing is a ubiquitous gene regulatory mechanism that dramatically
increases the complexity of the proteome. However, the mechanism for regulating alter-
native splicing is poorly understood, and study of coordinated splicing regulation has been
limited to individual cases [30, 31]. To study genome-wide splicing regulation, we inte-
grated 38 human RNA-seq datasets to identify splicing module, which we defined as a set
of cassette exons co-regulated by the same splicing factors. We designed a tensor-based
approach to identify co-splicing clusters that appear frequently across multiple conditions,
thus very likely to represent splicing modules - the units in the splicing regulatory net-
work. We applied our tensor-based method [32] to the 38 co-splicing networks derived
from human RNA-seq datasets and identified an atlas of frequent co-splicing clusters. We
demonstrated that these identified clusters represent potential splicing modules by validat-
ing using four biological knowledge databases. The likelihood that a frequent co-splicing
3
cluster is biologically meaningful increases with its recurrence across multiple datasets,
highlighting the importance of our integrative approach. Co-splicing clusters revealed
novel functional groups which cannot be identified by co-expression clusters, and they
can give new insights into functions associated with post-transcriptional regulation; and
the same exons can dynamically participate in different pathways depending on different
conditions and different other exons that are co-spliced.
1.2.2 Identify Frequent Coupled Modules to Study Transcription and
Splicing Coupling
Current network analysis methods all focus on one or multiple networks of the same
type. However, cells are organized by multi-layer networks, e.g. transcriptional regulatory
networks, splicing regulatory networks, and protein-protein interaction networks, which
interact and influence each other. Elucidating coupling mechanisms among those different
types of networks is essential to understand functions and mechanisms of cellular activi-
ties. In this work, we developed the first computational method for pattern mining across
many two-layered graphs, with the two layers representing different types of coupled bio-
logical networks. We formulated the problem of identifying frequent coupled clusters
between the two layers of networks into a tensor-based computation problem, and pro-
posed an efficient solution. We applied our method on 38 two-layered co-transcription
and co-splicing networks, derived from 38 RNA-seq data sets. With the identified atlas of
coupled transcription-splicing modules, we demonstrated that transcription-splicing cou-
pling are potentially mediated by transcription factors - splicing factors interactions.
4
1.2.3 Functional Implications of Spatial Genome Organization
In the last decade, the development of the chromosome conformation capture (3C) and
related technologies followed by high-throughput sequencing has significantly expanded
our understanding of spatial genome structures. Results have shown that chromosomes
adopt highly organized structures and occupy preferential locations in nucleus [33–36],
and accumulated evidence has demonstrated that chromosome organization is highly cor-
related with genome function [37–41]. In order to fully understand gene regulation, it
is important to decipher associations between spatial genome organization and transcrip-
tional/epigenetic regulation.
Chromosome interactions uncovered by 3C and its variants only describe the aver-
age conformation in a population of cells [39, 42]. Considering cell-to-cell heterogeneity,
averaged chromosome conformation cannot adequately reflect the wide range of structural
features relevant to a population of cells. To address the challenge of representing highly
variable genome structures, we modeled a large population of three-dimensional genome
structures from tethered conformation capture (TCC) chromatin interaction data [43],
which represent a spectrum of all possible chromosome configurations. Then we applied
our tensor-based integrative network pattern mining method on the population of genome
structures to identify frequently contacted chromatin regions. After obtaining frequently
co-localized chromatin regions, we integrated public epigenetic marks and transcription
factors ChIP-seq and RNA-seq data to systematically investigate the following important
biological questions:
1. What are the functional associations between spatial genome organization and tran-
scriptional regulation?
5
2. How are transcriptional regulation and epigenetic regulation functionally coupled
in spatial genome organization?
3. What is the major driving force to shape spatial genome organization?
4. What is the relationship between gene positioning in nucleus and gene expression?
6
Chapter 2
Integrating Many Co-splicing Networks
to Reconstruct Splicing Regulatory
Modules
Alternative splicing is a ubiquitous gene regulatory mechanism that dramatically
increases the complexity of the proteome. However, the mechanism for regulating alter-
native splicing is poorly understood, and study of coordinated splicing regulation has been
limited to individual cases. To study genome-wide splicing regulation, we integrated many
human RNA-seq data sets to identify splicing module, which we defined as a set of cas-
sette exons co-regulated by the same splicing factors. We have designed a tensor-based
approach to identify co-splicing clusters that appear frequently across multiple conditions,
and thus very likely to represent splicing modules - the units in the splicing regulatory
network. We applied our tensor-based method to the 38 co-splicing networks derived
from human RNA-seq data sets and identified an atlas of frequent co-splicing clusters.
We demonstrated that these identified clusters represent potential splicing modules by val-
idating against four biological knowledge databases. The likelihood that a frequent co-
splicing cluster is biologically meaningful increased with its recurrence across multiple
data sets, highlighting the importance of the integrative approach. Co-splicing clusters
revealed novel functional groups which cannot be identified by co-expression clusters.
7
They can grant new insights into functions associated with post-transcriptional regulation,
and the same exons can dynamically participate in different pathways depending on dif-
ferent conditions and different other exons that are co-spliced [44].
2.1 Introduction
Alternative splicing provides an important means for generating proteomic diversity.
Recent estimates indicate that nearly 95% of human multi-exon genes are alternatively
spliced [45]. The mechanisms for regulating alternative splicing are still poorly under-
stood, and splicing outcomes are influenced by many factors such as splicing factors, cis-
regulatory elements locating at flanking regions of cassette exons, and RNA secondary
structures [46, 47]. A fundamental task of alternative splicing research is to decipher
splicing code and understand the mechanism of how an exon is alternatively spliced in
a tissue-specific manner.
A central concept in transcription regulation is the transcription module, defined as a
set of genes that are co-regulated by the same transcription factor(s). Analogously, such
coordinated regulation also occurs at the splicing level [48–50]. For example, the splicing
factor Nova regulates exon splicing of a set of genes that shape the synapse [50]. However,
the study of such coordinated splicing regulation has thus far been limited to individual
cases [49–53]. In this work, we defined a splicing module as a set of exons that are co-
regulated by the same splicing factors. The exons in a splicing module can belong to
different genes, but they exhibit correlated splicing patterns (in terms of being included
or excluded in their respective transcripts) across different conditions, thus form an exon
8
co-splicing cluster. Studying splicing modules paves the way to systematically investigate
splicing factors regulation, and the biological functions those co-spliced clusters may have.
The recent development of RNA-seq technology provides a revolutionary tool to study
alternative splicing. From each RNA-seq dataset, we can derive not only the expression
levels of genes, but also those of exons and transcripts (i.e., splicing isoforms). Given an
RNA-seq dataset containing a set of samples (from different biological conditions), we can
calculate the inclusion rate of each exon in every sample as the ratio between its expression
level and that of the host gene. In this study we only consider cassette exons, which are
the most common case in alternative splicing events. Henceforth, the term “exon” always
means “cassette exon”. After obtaining inclusion rate profiles of cassette exons, an exon
splicing network can be constructed to study splicing regulation. A recent study provided a
nice example of studying splicing regulatory relationships using a network of exon-exon,
exon-gene, and gene-gene links using exon array data [54]. Here, we constructed from
each RNA-seq dataset a weighted co-splicing network where the nodes represent exons
and the edge weights are correlations between the inclusion rates of two exons across all
samples in the dataset. While directly comparing the inclusion rates for the same exon in
different datasets where the results could be biased by platforms and protocols, the corre-
lations between inclusion rates for a given exon pair are comparable across datasets. From
a series of RNA-seq datasets, we can therefore derive a series of co-splicing networks,
which can be subjected to comparative network analysis and provide an effective way to
integrate a large number of RNA-seq experiments conducted in different laboratories and
using different technology platforms.
In our study, a heavy subgraph in a weighted co-splicing network represents a set of
exons that are highly correlated in their inclusion rate profiles, i.e., they are co-spliced.
9
Exons which frequently form a heavy subgraph in multiple datasets are likely to be reg-
ulated by the same splicing factors, and thus form a splicing module. We call such pat-
terns frequent co-splicing clusters (FSC). Due to the enhanced signal to noise separation,
frequent clusters are more robust and are more likely to be regulated by the same splic-
ing factors (thus more likely to represent splicing modules) than those heavy subgraphs
derived from a single dataset.
2.2 Statement of the problem
Our goal is to identify co-spliced exon clusters that frequently occur across multi-
ple weighted co-splicing networks, and we adopt our recently developed tensor-based
approach to achieve this goal [55]. A co-splicing network of n nodes (exons) can be
represented as annn adjacency matrixA, where elementa
ij
is the weight of the edge
between nodes i and j. This weight represents the Pearson correlation between the two
exons’ inclusion rate profiles. Givenm co-splicing networks with the samen nodes but
different edge weights, we can represent the whole system as a 3
rd
-order tensor (or 3-
dimensional array) of sizennm (Figure 2.1). An elementa
ijk
of the tensor is the
weight of the edge between nodesi andj in thek
th
network. A co-splicing cluster appears
as a heavy subgraph in the co-splicing network, which in turn corresponds to a heavy
region in the adjacency matrix. A frequent co-splicing cluster is one that appears in multi-
ple datasets, and appears as a heavy region of the tensor. Thus, the problem of identifying
frequent co-splicing clusters can intuitively be formulated as the problem of identifying
heavy sub-tensors in a tensor.
10
Figure 2.1: Illustration of the 3
rd
-order tensor representation of a collection of networks:
A collection of co-splicing networks can be “stacked” into a third-order tensor such that
each slice represents the adjacency matrix of one network. The weights of edges in the co-
splicing networks and their corresponding entries in the tensor are color-coded according
to the scale to the right of the figure. After reordering the tensor by the exon and network
membership vectors, a frequent co-splicing cluster (colored in red) emerges in the top-left
corner. It is composed of exonsA;B;C;D which are heavily interconnected in networks
1; 2; 3.
Givenm co-splicing networks, with each network has then exons with different edge
weight representing the strength of two exons that are co-spliced, our aim is to identify a
subset of exons that are heavily and frequently connected in a subset of networks. Mathe-
matically, a frequent co-splicing cluster (FSC) in the tensorA = (a
ijk
)
nnm
can be rep-
resented by two membership vectors: (i) the exon membership vectorx = (x
1
;:::;x
n
)
T
,
where x
i
= 1 if exon belongs to the cluster and otherwise; and (ii) the network mem-
bership vector y = (y
1
;:::;y
m
)
T
, where y
j
= 1 if the exons of the cluster are heavily
connected in network and otherwise. The summed weight of all edges in the FSC is
H
A
(x;y) =
1
2
n
X
i=1
n
X
j=1
m
X
k=1
a
ijk
x
i
x
j
y
k
: (2.1)
11
Please note that weighta
ijk
is counted only whenx
i
=x
j
=y
k
= 1, indicating exoni
andj are co-spliced in networkk. We have two constraints for this optimization problem:
the size of co-splicing cluster (the number of exons it contains) is at least K1, and the
frequency of the co-splicing cluster is at leastK2. That is, we look for exon membership
vector x = (x
1
;:::;x
n
)
T
and network membership y = (y
1
;:::;y
m
)
T
which jointly
maximizeH
A
(x;y) with constraints
P
n
i=1
x
i
K
1
and
P
m
j=1
y
j
K
2
.
2.3 Methodology
2.3.1 Problem formulation
The problem of discovering a frequent co-splicing cluster can be intuitively formulated
as a discrete combinatorial optimization problem: for any given exon numberK
1
and net-
work number K
2
, iterate all the possible exons and networks and look for the heaviest.
However, this is a 0-1 Integer programming problem, which is NP hard, and not solvable
even for a small number of exons and datasets. Therefore, the naive discrete optimization
method is not applicable. Instead, by relaxing exon membership and network member-
ship vector to be continuous (for example, vector norm) rather than binary, we can have
continuous objectives and constraints, and gain access to a wealth of established optimiza-
tion methods to solve our problem. There are many examples to reformulate a discrete
graph discovery problem as a continuous optimization problem, such as using a Hopfield
neural network to solve the traveling salesman problem [56] and applying the Motzkin-
Straus theorem to the clique-finding problem [57]. Moreover, when a graph-based pattern
mining problem is transformed into a continuous optimization problem, it is easier to
12
incorporate constraints representing prior knowledge. Finally, advanced continuous opti-
mization techniques require very few ad hoc parameters, in contrast with most heuristic
graph combinatorial algorithms.
After relaxing integer constraints to continuous constraints, the problem is reformu-
lated as looking for non-negative real vectors x;y that jointly maximizeH
A
(x;y). The
optimization problem is formally expressed as follows:
max
x2R
n
+
;y2R
m
+
H
A
(x;y)
subject tof(x) = 1 and g(y) = 1
(2.2)
where R
+
is a non-negative real space, andf(x) andg(y) are vector norms. Basically,
larger element in x indicates that it has stronger co-splicing relationships with its neigh-
bors; and larger element iny indicates that co-splicing signal is stronger in the correspond-
ing network. After solving Eq. (2.2), users can easily identify the top-ranking exons (after
sorting each exon byx) and top-ranking networks (after sorting each network byy) con-
tributing to the objective function. After rearranging the networks in this manner, the FSC
with the largest density occupies a corner of the 3D tensor. We can then mask all edges in
the heaviest FSC with zeros, and optimize Eq. (2.2) again to search for the next FSC.
2.3.2 Vector norm selection
A vector norm is defined askxk
p
= (
P
n
i=1
jx
i
j
p
)
1=p
, wherep> 0, which is also called
“L
p
-vector norm”. When p 1, L
p
is a convex function; otherwise when p < 1, L
p
is non-convex function. The commonly used vector norm is L
2
, which corresponds to
Euclidean distance; when p = 1, it leads to regression shrinkage, and L
1
norm can be
used in feature selection; in extreme case, whenp = 0,kxk
0
=jfi : x
i
6= 0gj , which
13
corresponds to sparse learning; whenp =1,kxk
1
= max
16i6n
(jx
i
j), usually applied in
optimization with bound constraints. Clearly, the choice of vector norms in Eq. (2.2) for
exon membership and network membership vectors has a big impact on the outcome of
optimization. In general, the closerp is to zero, the sparser the solution it is, that is, fewer
components of the optimized vectors are significantly different from zero. In contrast, as
p increases, the solution grows smoother, and more components tend to have none zero
values that are close to each other.
In our problem setup, generally splicing regulation mechanism on target exons is spe-
cific, and we do not expect that FSC contains a large number of exons, indicating exon
membership vector solution should be sparse, which can be achieved with a mixed norm
f(x) =kxk
0
+ (1)kxk
1
(0<< 1). The normL
0
favors sparsity while the norm
L
1
encourages smoothness in the non-zero components of x. In practice, we use f(x)
with another mixed normf(x) = kxk
p
+ (1)kxk
2
(0 < < 1) wherep < 1. For
network membership vector, we want the exon cluster to appear in as many networks as
possible, so the network membership values should be non-zero and close to each other.
This is the typical outcome of optimization using the normL
1
. In practice, we approxi-
mateL
1
withkyk
q
, whereq > 1. Therefore, the regularized vector norms and are fully
specified as follows,
f(x) =kxk
p
+ (1)kxk
2
and g(y) =kyk
q
(2.3)
We performed simulations to determine suitable values for the parametersp,, andq,
by applying the tensor method to collections of random weighted networks. We randomly
placed FSCs of varying size, recurrence, and density in a subset of the random networks.
14
We then tried different combinations ofp,, andq, and adopted the combination (p = 0:8,
= 0:2, andq = 10) that led to the discovery of the most FSCs.
Therefore, we have
f(x) = 0:2kxk
0:8
+ 0:8kxk
2
and g(y) =kyk
10
(2.4)
Please note thatf(x) is a non-convex function andg(y) is a convex function.
2.3.3 Non-convex optimization
Since f(x) is non-convex, our tensor method requires an optimization protocol that
can deal with non-convex constraints. The quality of the optimum discovered for a non-
convex problem depends heavily on the numerical procedure. Standard numerical tech-
niques such as gradient descent converge to a local minimum of the solution space, and
different procedures often find different local minimal. Thus, it is important to find a the-
oretically justified numerical procedure. We use an advanced framework known as multi-
stage convex relaxation, which has good numerical properties for non-convex optimization
problems [58]. In this framework, concave duality is used to construct a sequence of con-
vex relaxations that give increasingly accurate approximations to the original non-convex
problem. We approximated the sparse constraint function f(x) by the convex function
f
v
(x) = v
T
h(x)f
h
(v), wherev is dual variable,h(x) is a convex functionh(x) =x
2
andf
h
(v) is the concave dual of the functionf
v
(x). The vectorv contains coefficients that
will be automatically generated during the optimization process. After each optimization,
the new coefficient vectorv yields a convex function that more closely approximates the
15
original non-convex functionf(x). The details of non-convex optimization are explained
as follows.
2.3.3.1 Concave Duality
Our goal is to regularize non-convex function f(x) by using concave duality. Let
h(x) :R
n
!
R
n
be a vector function with being the convex hull of its range. It may
not be a one-to-one map. However, we assume that there exists a functionf
h
(u) defined
on
such thatf(x) =f
h
(h(x)) holds.
Assuming that we can findh so that the functionf
h
(u) is a concave function ofu on
, we can rewrite the regularization functionf(x) as:
f(x) = inf
v2R
n
v
T
h(x)f
h
(v)
(2.5)
using concave duality (Page 308 in [59]), wheref
h
(v) is the conjugate function ofv.
Basically, by concave duality, we can map the non-convex set to a convex set through
h(x) : R
n
!
R
n
. In the convex set, the minimum value of original non-convex
functionf(x) of is achieved at
^ v =r
u
f
h
(u)j
u=h(x)
(2.6)
Equation (2.6) tells us that during non-convex regularization procedure, we can keep
updating dual variablev to give increasing accurate approximations to original non-convex
functionf(x).
16
2.3.3.2 Optimization scheme
In our case, non-convex function isf(x) =kxk
p
+ (1)kxk
2
(0<< 1), given
mapping convex function ash(x) =
jx
1
j
h
;:::;jx
n
j
h
, solution in (2.6) for dual variable
is given by
^ v
i
=
h
(
X
j
jx
j
j
p
)
1
p
1
jx
i
j
ph
+
1
h
(
X
j
x
2
j
)
1
2
1
jx
i
j
2h
(2.7)
Given initial values x
(0)
2 R
+n
and y
(0)
2 R
+m
, the optimization scheme to solve
(2.2) is to repeat the following updates until convergence:
Updatex:x
i
h
x
i
P
j;k
a
ijk
x
j
y
k
^ v
i
i1
h
, thenx is normalized byx
i
x
i
jjxjjp+(1)jjxjj
2
.
Updatey:y
k
(y
k
P
i;j
a
ijk
x
i
x
j
)
1
q
, theny is normalized byy
k
y
k
jjyjjq
.
Updatev:v
i
h
(
P
j
jx
j
j
p
)
1
p
1
jx
i
j
ph
+
1
h
(
P
j
x
2
j
)
1
2
1
jx
i
j
2h
Once the membership vectors (solution of (2.2)) are solved, the frequent co-splicing
clusters can be intuitively obtained by including those exons and networks with large mem-
bership values. However, any given solution can result in multiple overlapping patterns
whose density is greater than a specified threshold. Here, density is defined as the average
weight of all edges in the pattern. To identify the most representative pattern, we first
ranked exons and networks in decreasing order of their membership values in x and y.
Then we extracted two representative patterns that satisfy the density threshold: the pattern
that occurs in the most networks while having at least the minimum number of top-ranking
exons (5), and the pattern with the largest number of top-ranking exons while appearing
in at least the minimum number of top-ranking networks(3). Both patterns are included
as co-splicing clusters in our results. After discovering a pattern, we can mask its edges
17
in those networks where it occurs (replacing those elements of the tensor with zeroes) and
optimize (2.2) again to search for the next frequent co-splicing cluster.
2.4 Data preprocessing
2.4.1 Cassette exons identification
We obtained human cassette exons from hg19 UCSC gene annotations downloaded
from UCSC database. For every gene, first we removed exons in 5’ UTR and 3’ UTR
of each transcript, and then checked the location of each remaining exon to determine
whether it is contained in all the known transcripts of a gene model. If an exon is not con-
tained in all the transcripts, the exon is a cassette exon. We applied this method to all the
UCSC genes, and identify 22,310 cassette exons in human genome. It is worth noting that
some cassette exons are always contained in the same transcripts, which means they are
always co-spliced as long as the transcripts are expressed. Since this trivial case dominates
density calculation in co-splicing clusters, and produce biased result, we combined those
cassette exons that are always included in the same transcripts as a cluster, and used one
cassette exon to represent the whole cassette exon cluster.
2.4.2 RNA-Seq datasets selection and processing
From NCBI’s Sequence Read Archive (SRA) we selected all human RNA-seq datasets,
each of which contains at least six samples (the minimum for robust correlation estima-
tion). The 38 datasets that met these criteria by January 30 2011 were used for the fol-
lowing analysis. For each dataset, we used the Tophat [60] tool to map short reads to
18
the hg19 reference genome and applied the transcript assembly tool Cufflinks [61] to esti-
mate expressions for all transcripts with known UCSC transcription annotations [62]. We
calculated the inclusion rate of each exon in every sample as the ratio between its expres-
sion (i.e., sum of FPKM over all transcripts that cover the exon) and the expression of
the host gene (i.e., sum of FPKM over all transcripts of the gene). It is worth noting that
in RNA-seq experiments, a gene expression with low FPKM is usually not precisely esti-
mated because the number of reads mapped to the gene is quite small. In order to work
with reasonably accurate estimates of exon inclusion rates, as pointed out by [63], we
calculated inclusion rates only for those exons whose host genes’ expressions are above
80
th
percentile across at least 6 samples. Throughout this study, we only considered the
genes containing cassette exons whose inclusion rate profiles met the above criterion. This
resulted in 16,024 exons covering 9,532 genes. Based on these profiles, we constructed
an exon co-splicing network from each RNA-seq dataset by using Pearson’s correlation
between exons’ inclusion rate profiles.
2.5 Results
We applied our method to 38 RNA-seq datasets generated under various experimental
conditions. Adopting the empirical criteria of “density”>0.4 and cluster size>5 exons,
we identified 7,194/3,104/1,422/594 co-splicing clusters with recurrences>3/4/5/6.
19
2.5.1 Frequent co-splicing clusters are likely to represent functional
modules, splicing modules, transcriptional modules, and pro-
tein complexes
To assess the biological significance of the identified patterns, we evaluated the extent
to which these exon clusters represent functional modules, splicing modules, transcrip-
tional regulatory modules, and protein complexes.
2.5.1.1 Functional analysis
We evaluated the functional homogeneity of the host genes in an exon cluster using
Gene Ontology (GO) annotations. To ensure the specificity of GO terms, we filtered out
general GO terms associated with>300 genes. If the host genes of exons in a cluster are
statistically enriched in a GO term withp-value<1E-4 (based on the hypergeometric test),
we declared the exon cluster to be functionally homogeneous. We found that 23.3% of
clusters appearing in>3 datasets are functionally homogenous, compared to only 6.0%
of randomly generated clusters with the same sizes. An enrichment fold ratio of 3.9
between real and random patterns demonstrated the functional relevance of the identified
patterns. Furthermore, the enrichment fold ratio increased with increasing density and
recurrence of FSCs (shown in Figure 2.2A). For example, when the FSCs were required
to recur in at least 5 datasets, their enrichment fold ratio compared to random patterns
increased to 4.4, confirming the benefits of the integrative analysis of multiple RNA-seq
datasets in improving the quality of detected patterns. Functionally homogenous clusters
covered a wide range of post-transcriptional associated GO terms, such as “RNA splic-
ing”, “ribonucleoprotein binding”, “heterogeneous nuclear ribonucleoprotein complex”,
20
Figure 2.2: Four types of databases are used: (A) Gene Ontology for functional enrich-
ment, (B) SpliceAid2 database for splicing enrichment, (C) ENCODE database for tran-
scriptional and epigenetic enrichment, and (D) CORUM database for protein complex
enrichment. Thex-axis is network recurrence andy-axis is enrichment fold ratio, calcu-
lated by dividing the percentage of enriched clusters by the percentage of enriched random
clusters.
“negative regulation of transcription from RNA polymerase II promoter”, and “cellular
protein localization”.
2.5.1.2 Splicing regulatory analysis
The exons in our identified co-splicing clusters have highly correlated inclusion rate
profiles across different experimental conditions, and clusters meeting this criterion are
likely to consist of exons co-regulated by the same splicing factors. It was observed that
splicing factors can affect alternative splicing by interacting with cis-regulatory elements
in a position-dependent manner [64]. We collected the experimental RNA target motifs
(2220 RNA binding sites) of 62 splicing factors from the SpliceAid2 database [65]. To
identify possible splicing factors associated with a co-splicing cluster, for each exon of a
co-splicing cluster we obtained the internal exon region and its 50bp flanking intron region
that are enriched in the motifs of those 62 splicing factors by performing BLAST search
(E-score<0.001). If the exons of a cluster are highly enriched in the targets of a splicing
factor, we consider the cluster to be ”splicing homogeneous”. Although the collection of
21
known splicing motifs is very limited, at thep-value<0.05 level (based on the hypergeo-
metric test), we still observed that 4.9% clusters with>5 exons and>6 recurrences are
splicing homogenous, compared to 1.6% of randomly generated patterns with the same
size distribution. Performing the same analysis for less frequent clusters, we found that
as the recurrence increases the enrichment fold ratio also increases (Figure 2.2B). The
five most frequently enriched splicing factors are hnRNP E2, 9G8, hnRNP U, SRp75 and
SRp30c. hnRNP E2 and hnRNP U both belong to heterogeneous nuclear ribonucleoprotein
family, which generally suppress splicing through binding to exonic splicing silencer [46].
Studies showed that hnRNP E2 can repress exon usage when present at high levels in
vitro [66], and hnRNP U bind pre-mRNA as well as nuclear mRNA and play an important
role in processing and transport of mRNA [67]. 9G8, SRp75 and SRp30c all belong to the
SR family of splicing regulators. The 9G8 protein excluding other SR factors can rescue
the splicing activity of a 9G8-depleted nuclear extract, indicating 9G8 plays a crucial role
in splicing [68]. SRp75 is present in messenger ribonucleoproteins in both cycling and dif-
ferentiated cells, and shuttles between nucleus and cytoplasm, implicating its widespread
roles in splicing regulation [69]. SRp30c can function as a repressor of 3’ splice site utiliza-
tion and SRp30c-CE9 interaction may contribute to the control of hnRNP A1 alternative
splicing [70].
We found that certain splicing factors tend to co-bind to the cis-regulatory regions
of exons in a co-splicing cluster, suggesting the combinatorial regulation of those splic-
ing factors. Trans-acting SR proteins Tra2 and SRp30c were simultaneously enriched in
18 clusters (with recurrence>3), whose major functions are related to post-transcriptional
regulation, such as “ribonucleoprotein binding” (p-value=2.11E-5), “nuclear mRNA splic-
ing, via spliceosome” (p-value=7.66E-5), “RNA export from nucleus” (p-value=4.81E-5),
22
and “translational initiation” (p-value=2.48E-5). Current study has shown that there is a
cooperative interaction between Tra2 and SRp30c in exonic splicing enhancer depen-
dent GnRG pre-mRNA splicing [71]. Splicing regulators SRp20, SRp30c and SRp75
were simultaneously enriched in 2 clusters (with recurrence3), which are also asso-
ciated with post-transcriptional regulation. For example, the following are statistically
significant: “RNA splicing” (p-value=3.25E-6), “translation initiation factor activity” (p-
value=7.42E-5), and “eukaryotic translation initiation factor 3 complex” (p-value=2.17E-
4). Our results indicated that combinatorial splicing regulation can potentially occur in
post-transcriptional processes.
2.5.1.3 Transcriptional and epigenomic analysis
To evaluate how co-splicing is affected by transcriptional regulation, we used 191
ChIP-seq profiles generated by the Encyclopedia of DNA Elements (ENCODE) consor-
tium [72]. This dataset includes the genome-wide binding of 40 transcription factors (TF),
9 histone modification marks, and 3 other markers (DNase, FAIRE, and DNA methyla-
tion) on 25 different cell lines. If the host genes of an exon cluster are highly enriched in
the targets of any regulatory factor, we consider the cluster to be “transcription homoge-
nous”. At the significance level p-value <0.01, 74.9% clusters with recurrences3 are
transcription homogenous, compared to only 21.2% of randomly generated clusters of the
same size. As expected, the enrichment fold ratio increases with recurrence (Figure 2.2C).
The four most frequently enriched regulatory factors were TAF8, GABP, FOS and NFYB.
TAF8 is a subunit of transcription initiation factor TFIID, and is required for accurate and
regulated initiation by RNA polymerase II [73]. As an ETS transcription factor, GABP
plays a key role in regulating genes which are intimately involved in cell cycle control,
23
protein synthesis and cellular metabolism [74]. FOS can dimerise with c-Jun to form AP-
1 transcription factor, which upregulates transcription of a wide range of genes involved in
proliferation and differentiation to defense against invasion and cell damage [75]. NFYB
is a subunit of an ubiquitous heteromeric transcription factor NF-Y, which regulates 30%
of mammalian promoters [76].
2.5.1.4 Protein complex analysis
We evaluated the extent to which host genes of our identified exon clusters are pro-
tein complexes by using the Comprehensive Resource of Mammalian protein complexes
database (CORUM, September 2009 version) [77]. At the significance level p-value
<0.05, 18.1% of co-splicing clusters with recurrences3 were enriched in genes belong-
ing to a protein complex, versus only 0.7% of randomly generated clusters of the same
size. The enrichment fold ratio for protein complexes increased with the cluster recur-
rence (Figure 2.2D). The five most frequently enriched protein complexes were “Spliceo-
some”, “CCT micro-complex”, “large Drosha complex”, “Nop56p-associated pre-rRNA
complex”, and “C complex spliceosome”. At least 1/3 of subunits in the enriched complex
“large Drosha complex” contain proteins associated with splicing function, especially het-
erogeneous nuclear ribonucleoproteins such as HNRNPH1, HNRNPM, HNRNPU, HNRN-
PUL1 and HNRNPDL [77].
24
2.5.2 Co-splicing clusters reveal novel functions that are not identi-
fied by co-expression clusters
Previous study showed that genes that are co-regulated transcriptionally do not nec-
essarily overlap with those that are co-spliced [78]. Therefore, the identification of co-
splicing clusters can potentially reveal functionally related genes that could not be dis-
covered from transcription analysis. In order to identify novel functions associated with
co-splicing but not co-expression, we constructed a gene co-expression network from
each RNA-seq dataset. The nodes of these networks represent genes, and the edges are
weighted by Pearson’s correlation between two gene expression profiles. We then applied
our tensor-based pattern mining algorithm to identify frequent co-expression clusters in the
38 co-expression networks. The same functional enrichment analysis described above for
co-splicing clusters was performed on the resulting co-expression clusters. Results showed
that 98.8% of co-splicing clusters with recurrences 3 had low expression correlations
(average correlation 6 0.2). We found many post-transcriptional regulation associated
functions are enriched in co-splicing clusters but not in co-expression clusters, including
“maintenance of protein location”, “regulation of protein catabolic process”, “cytoplas-
mic sequestering of protein”, “regulation of intracellular protein transport”, “regulation of
ubiquitin-protein ligase activity”, “ribonucleoprotein complex assembly”, “RNA splicing,
via transesterification reactions”, and “RNA export from nucleus”.
For example, one co-splicing cluster contained seven host genes: HNRNPUL1,
HNRNPC, DHX9, BAT1, PSMA5, RAD23 and RPS9. This cluster cannot be found
from co-expression data, since the expression profiles of the host genes had low cor-
relation. However, this set of host genes was enriched with several splicing associated
25
functions, including “RNA splicing” (p-value=1.89E-6) and “RNA helicase activity” (p-
value=4.68E-5). Out of seven host genes, HNRNPUL1 and HNRNPC belong to the het-
erogeneous nuclear ribonucleoprotein family, which generally suppress splicing through
binding to an exonic splicing silencer [46]. DHX9, known as RNA helicase A, is a highly
conserved DEAD-box protein that activates transcription, modulates RNA splicing, binds
the nuclear pore complex and involves in spliceosome assembly [79,80]. Previous research
illustrated that DHX9 mediates association of CBP and RNA polymerase II [81], and cur-
rent study further shows that DHX9 interacts with post-transcriptional control element
RNA in the nucleus and cytoplasm to facilitate efficient translation [79]. Interestingly,
HNRNPC and DHX9 are tightly functionally associated: silencing of DHX9 seriously dis-
turbed the nuclear distribution of the hnRNP C protein [82]. As an essential splicing factor,
BAT1 also belongs to the DEAD-box protein family, and plays an important role in mRNA
export from the nucleus to the cytoplasm, supported by recent experimental evidence that
knocked down BAT1 induces spliced mRNA, as well as total polyA RNA accumulating in
nuclear speckle domains, which is not exported to the cytoplasm [83]. Clearly, co-splicing
clusters can provide complementary information on functionally related gene groups in
addition to co-trancriptional clusters. In particular, co-splicing clusters can grant new
insights into functions associated with post-transcriptional regulation.
26
2.5.3 Exons can dynamically participate in different pathways upon
different co-splicing mechanisms
Skipping or including a cassette exon can change the functions of a protein by
deleting or inserting a protein domain, and protein isoforms resulting from alter-
native splicing may participate in different pathways. In our results, we observed
that 70.3%/52.3%/38.3%/27.1% of exons are members of at least two clusters
(recurrence3/4/5/6) with different functions. For example, exon8 of the gene Rela
appeared in four co-splicing clusters with recurrence3, which are enriched with the
following distinct functions respectively: “ER-associated protein catabolic process” (p-
value=2.20E-5), “response to extracellular stimulus” (p-value=3.80E-5), “regulation of
gene-specific transcription” (p-value=8.89E-5), and “positive regulation of intracellular
protein kinase cascade” (p-value=2.49E-5). Rela encodes the transcription factor p65,
which is an important subunit of the NF-B complex that affects several hundred genes
by NF-B signaling. Recent research has identified several alternative splice variants of
Rela, e.g. p654, p6542 and p6543. In fact, p654 arises by the use of an alterna-
tive splice site located 30 nucleotides into exon8, and p6543 was identified as a splice
variant lacking exon7 and exon8 [84]. These facts are consistent with our finding that
exon8 is dynamically included in multiple co-splicing clusters. As another example, exon2
of the gene EIF5 appears in three co-splicing clusters with recurrence3, which are
enriched with following distinct functions respectively: “RNA splicing” (p-value=6.27E-
5), “mRNA polyadenylation” (p-value=1.57E-5), and “regulation of translational initia-
tion” (p-value=8.18E-5). As a translation initiation factor, EIF5 plays critical roles for
the accurate recognition of the correct start codon during translation initiation [85]. Our
27
result suggested that except for translation initiation regulation, EIF5 may also involve
in post-transcriptional regulation, such as RNA splicing and mRNA polyadenylation by
dynamically including exon2 in multiple co-splicing clusters. These examples demon-
strated that exons can contribute to different functions of proteins depending on different
splicing regulatory mechanisms.
2.6 Conclusions
Splicing codes are determined by a combination of many factors, such as cis-regulatory
elements and trans-acting factors. If certain exons share the same splicing code, they may
form a splicing module: a unit in the splicing regulatory network. Therefore, identifying
co-splicing clusters and investigating their cis-regulatory elements and associated trans-
acting factors can serve as an important step to decipher the splicing code. Our tensor-
based approach can identify co-spliced exon clusters that frequently appear in multiple
RNA-seq datasets. The exons in a frequent co-splicing cluster can belong to different
genes, but are very likely to be co-regulated by the same splicing factors, thus forming
a splicing module. We demonstrated that the identified clusters represent biologically
meaningful modules, by validating against four biological knowledge databases. In all
four types of enrichment results the likelihood that a co-splicing cluster is biologically
meaningful increases with its recurrence. This consistent behavior highlights the impor-
tance of the integrative approach. We also showed that the co-splicing clusters can reveal
novel functional related genes that cannot be identified by co-expression clusters, and that
the same exons can dynamically participate in different pathways depending on differ-
ent conditions and different other exons that are co-spliced. The NCBI Sequence Reader
28
Achieve database currently stores 8099 RNA-seq profiles, and this number is expected to
dramatically increase in the near future. We expect to apply our approach to the rapidly
accumulating RNA-seq data of multiple organisms, and to identify a large number of splic-
ing modules and their associated phenotype conditions. This analysis can serve as a first
step towards the reconstruction of tissue and disease specific splicing regulatory networks.
29
Chapter 3
Identify Frequent Coupled Modules to
Study Transcription and Splicing
Coupling
Current network analysis methods all focus on one or multiple networks of the same
type. However, cells are organized by multi-layer networks, e.g. transcriptional regulatory
networks, splicing regulatory networks, protein-protein interaction networks, which inter-
act and influence each other. Elucidating the coupling mechanisms among different types
of networks is essential in better understanding cell functions. In this work, we developed
the first computational method for pattern mining across many two-layered graphs, with
the two layers representing coupled biological networks of different types. We formulated
the problem of identifying frequent coupled clusters between the two layers of networks
into a tensor-based computation framework, and proposed an efficient solution to solve
the problem. We applied the method on 38 two-layered co-transcription and co-splicing
networks, derived from 38 human RNA-seq datasets. With the identified atlas of coupled
transcription-splicing modules, we explored to what extent, for which cellular functions,
and by what mechanisms transcription-splicing coupling takes place [86].
30
3.1 Introduction
The recent development of high-throughput technologies provides numerous opportu-
nities to systematically characterize diverse biological networks. “Network Biology” is
an emerging field [87]. Thus far, most computational methods focused on the analysis
of one or more of the same type of biological networks, e.g. protein-protein interaction
networks, transcriptional regulatory networks, or metabolic networks, each representing a
single layer of organization in the complex cellular system. In reality, these multiple levels
of organization interact and influence each other, harboring sophisticated coupling mech-
anisms that are essential in maintaining the function and robustness of cells. However, to
the best of our knowledge, no computational methods are applied to analyze the coupling
between different types of biological networks. Yet, more and more experimental stud-
ies simultaneously profile biological systems at multiple levels. For example, the Cancer
Genome Atlas (TCGA) project generates multi-dimensional maps of key genomic changes
(e.g. SNP, DNA methylation, gene expression, and microRNA expression) for the same
set of tumor samples [88]. The NCI60 project profiled 60 human cancer cell lines in terms
of drug responses [89–91], gene expression [92], protein expression [93], and miRNA
expression [94]. The most abundant multi-level profiling data is RNA-seq data, and every
RNA-seq profile provides information at both the transcription level and the splicing level.
How to utilize such multi-dimensional data to study the coupling between different layers
of cellular networks is a challenge in computational biology. In this work, we developed
a novel algorithm to mine coupled network modules that occur frequently across a series
of two-layered networks. We used the rapidly accumulating RNA-seq data as the testing
system to study the coupling between co-expression and co-splicing networks.
31
Transcription and splicing regulate gene expression in a coordinated manner [95–97].
Recent reports observed that splicing couples with transcription [95, 97, 98], occurring in
intimate association with the elongating RNA polymerase II [99]. Multiple splicing reg-
ulators were shown to be linked to the transcription machinery via protein-protein inter-
actions [96, 98, 100, 101]. Whereas most studies focused on how transcription impacts
splicing, [100] reported the first evidence that splicing factors also affect transcription.
Although study of transcription-splicing coupling has emerged as an active research
area [102–105], the function, scope, and mechanism of the coupling has not been sys-
tematically studied yet. It is not clear to what extent, by what mechanisms, and for which
cellular functions, transcription-splicing coupling takes place.
From a RNA-seq dataset, we can obtain not only the expression levels of genes to
construct a gene co-expression network, but also those of exons from which an exon co-
splicing network can be constructed. In a gene co-expression network, nodes represent
genes and edge weights represent correlations between the co-expression of two genes
across all samples. In an exon co-splicing network, nodes represent exons and the edge
weights represent correlations between the inclusion rates of two exons across all samples.
These two kinds of networks are “coupled”, in the sense that each gene in the gene network
contains several exons in the exon network. To study coupling mechanisms, we propose
to identify coupled transcription-splicing modules in a series of paired gene co-expression
and exon co-splicing networks, with each pair being derived from a RNA dataset. The
concept of our approach is illustrated in Figure 3.1. A set of co-expressed genes (i.e.,
densely interconnected in the gene co-expression networks) is likely to be co-regulated by
the same transcription factor and thus may represent a transcription module. Similarly,
a set of co-spliced exons (i.e., densely interconnected in the exon co-splicing networks)
32
is likely to be co-spliced by the same splicing factor and thus may represent a splicing
module. When genes of a transcription module contain their exons or at least a subset of
their exons to form a splicing module, it is likely to indicate the transcription and splicing
coupling takes place, and thus forms a so-called “coupled module”. In particular, if the
coupled cluster (a co-expressed gene cluster coupled with a co-spliced exon cluster) recur-
rently appears across multiple paired gene and exon networks, then the enhanced signal to
noise ratio would point to a higher likelihood for this frequent coupled cluster (abbrevi-
ated as FCC) to represent a functional coupled module than would those coupled clusters
derived from a single dataset.
3.2 Statement of the problem
To identify FCCs from a large collection of many paired weighted co-expression and
co-splicing networks, we proposed a computational method based on the tensor model. A
tensor is a multi-dimensional array and a matrix can be thought as a 2
nd
-order tensor. Given
L RNA-seq datasets, a collection ofL gene co-expression networks with the sameN gene
nodes can be represented as a 3
rd
-order tensorG = (g
ijk
)
NNL
. Each element g
ijk
is
the weight of the edge between genesi andj ink
th
RNA-seq dataset. Correspondingly, a
collection ofL exon co-splicing networks with the sameM exon nodes can be modeled
as the tensorE = (e
ijk
)
MML
.
As shown in Figure 3.1, an FCC can be described with a tensor model as follows:
its gene cluster and exon cluster correspond to a heavy region of the tensorG andE,
respectively, which can be called as the heavy subtensor. Thus, the FCC can be found
by reordering the tensorsG andE simultaneously such that the heaviest subtensors move
33
Splicing
Machinery
Exon co-splicing
networks
Transcriptional
Machinery
Gene co-expression
networks
1,2,3
A,B,
C,D
1
2
3
A
B
C
D Dataset 1 Dataset 2 Dataset 3
Dataset 1 Dataset 2 Dataset 3
1
2
3
1
2
3
A
B
C
D
A
B
C
D
1
2
3
A
B
C
D
Exon tensor E
Gene tensor G
Figure 3.1: Illustration of the frequent coupled cluster (FCC). In a collection of three
paired gene co-expression and exon co-splicing networks, a subset of genesf1,2,3g are
densely interconnected and their exonsfA,B,C,Dg are also densely interconnected. These
subsets form an FCC that represents the coupled transcription-splicing module. The gene
and exon clusters intuitively corresponds to the heavy subtensors inG andE.
34
toward the top-left corner, while their constituent genes and exons keep “belong-to” rela-
tionships. The subtensor in the top-left corner can then be expanded outwards from the
left-top corner until the FCC reaches its optimal size.
3.3 Methodology
3.3.1 Problem formulation
A naive approach of identifying an FCC is to first identify gene dense subgraphs which
frequently occur across multiple co-expression networks, then to identify a subset of those
member genes whose exons (or subset of exons) form dense subgraphs frequently occur-
ring in the pair-matched co-splicing networks. However, because a subgraph of a dense
subgraph may not be dense, we cannot simultaneously guarantee the heaviness of such
derived gene and exon subgraphs, therefore the naive approach is not applicable. In this
work, we proposed a novel computational method to identify FCCs.
GivenL RNA-seq datasets from which we constructedL gene co-expression networks
and their paired L exon co-splicing networks, the whole system can be represented as
two 3
rd
-order tensors,G = (g
ijk
)
NNL
,E = (e
ijk
)
MML
and a binary relation matrix
between N genes and M exons, R = (r
ij
)
NM
. Each element g
ijk
(e
ijk
) in the tensor
is the non-negative weight of the edge between genes (exons)i andj in thek
th
gene co-
expression (exon co-splicing) network. Note that g
iik
= 0 (e
iik
= 0) and g
ijk
= g
jik
(e
ijk
=e
jik
) for anyi;j;k, because all networks are assumed to be undirected and without
cycles. As each exon belongs to only one gene and a gene has at least one exon, the
relation matrix R has special characteristics: (1) r
ij
= 1 when gene i contains exon j,
35
r
ij
= 0 otherwise; and (2) each column vector has only one non-zero and the rest are zero.
Therefore,R is very sparse, with exactlyM non-zeros.
An FCC is defined as follows: “a set of genesG that are frequently co-expressed in
a set of datasets D (forming a dense subgraph in multiple gene networks), and a set of
their exonsE that are co-spliced in the same set of datasets (forming a dense subgraph in
multiple exon networks)”. We formulate the problem of identifying an FCC as follows,
Definition of an FCC.
An FCC consists of a set of genes G, a set of exons E and a set of datasets D that
satisfy the following two criteria:
Heavy Subgraph Criterion: Genes ofG are densely connected to each other in each
dataset ofD (i.e., active dataset); and exons ofE are densely connected to each
other in each of the same set of datasets.
Relation Criterion: Each gene ofG contains> 1 exons ofE, whereas each exon of
E is contained by6 1 genes ofG.
Any FCC can be described by three membership vectors: (i) the gene membership
vector x = (x
1
;:::;x
N
)
T
, wherex
i
= 1 if genei belongs to the gene setG andx
i
= 0
otherwise; (ii) the exon membership vector y = (y
1
;:::;y
M
)
T
, wherey
j
= 1 if exonj
belongs to the exon setE andy
j
= 0 otherwise; and (iii) the dataset membership vector
w = (w
1
;:::;w
L
)
T
, wherew
k
= 1 if the cluster appears in the datasetk (we term such a
dataset as an “active dataset” for the cluster) andw
k
= 0 otherwise.
Using these three membership vectors, the “dense subgraph criterion” defined above
can be formulated by maximizing the “density” functions of the gene and exon sub-
graphs in the active datasets D. The “density” functions of gene subgraph H
G
(x;w) =
36
P
N
i=1
P
N
j=1
P
L
k=1
g
ijk
x
i
x
j
w
k
is the summed weight of all edges of the gene sub-
graph in its active datasets. The “density” functions of exon subgraph H
E
(y;w) =
P
M
i=1
P
M
j=1
P
L
k=1
e
ijk
y
i
y
j
w
k
is the summed weight of all edges of the exon subgraph in
its active datasetsD. Only the weights of edgesx
i
= x
j
= 1 (y
i
= y
j
= 1) are counted
inH
G
(H
E
). Therefore,H
G
(x;w) measures the “density” of a gene subgraphG, defined
by the gene membership vector x and active datasets defined by the dataset membership
vectorw. This is similarly applied toH
E
(y;w).
The relation criterion between genes and exons can be formulated by using the idea of
the linear assignment problem formulation in operations research [106]. Let the relation
variablesZ = (z
ij
)
NM
indicate the matching between genes and exons, wherez
ij
= 1 if
genei contains exonj and both belong to the FCC and 0 otherwise. Then relation criterion
can be formulated by maximizing the objective functionO
R
(Z) =
P
r
ij
6=0
z
ij
r
ij
x
i
y
j
with
the constraint
P
M
j=1
z
ij
6K
2
for alli = 1;:::;N, whereK
2
is the number of exons in the
cluster). Therefore, maximizingO
R
(Z) is a process of finding genes and exons that have
one-to-many relationships and simultaneously have large associated weightsx
i
andy
j
.
With the above formulations of each individual criterion, identifying an FCC can be
formulated as a discrete combinatorial optimization problem: among all FCCs of fixed
size (K
1
member genes,K
2
member exons, andK
3
member datasets), we look for the pat-
tern that can achieve the maximum of the combined objective function: O(x;y;w;Z) =
H
G
(x;w)+H
E
(y;w)+O
R
(x;y;Z), where;> 0 are constant weights of individual
objective function.
37
3.3.2 Continuous optimization
The problem formulated is an integer programming problem of looking for the binary
membership vectorsx,y,w and the binary matrixZ that jointly maximize the combined
objective function under the constraints
P
N
i=1
x
i
= K
1
,
P
M
j=1
y
j
= K
2
, and
P
L
k=1
w
k
=
K
3
. However, there are several major drawbacks to this discrete formulation. The first
is parameter dependence: as withK-densest subgraph problems, the size parametersK
1
,
K
2
, andK
3
are difficult for users to provide and control. The second is high computational
complexity: the task is NP-hard, and therefore is not feasible even for a small number of
datasets.
To address those two drawbacks, we formulated a continuous optimization problem
with the same objective by relaxing integer constraints to continuous constraints. That is,
we looked for non-negative real vectorsx,y,w and the non-negative real sparse matrixZ
that jointly maximizeO(x;y;w;Z). This optimization problem is formally formulated as
max
x;y;w2R
+ O(x;y;w;Z) =H
G
(x;w) +H
E
(y;w) +O
R
(x;y;Z)
subject to
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
Constraint I: kxk
f
= 1; 16f < 2
Constraint II: kyk
g
= 1; 16g< 2
Constraint III: kwk
h
= 1; h> 2
Constraint IV:
P
M
j=1
z
ij
6 1; i = 1; 2;:::;N
(3.1)
whereR
+
is a non-negative real space of vectors or matrix, andkxk
p
= (
P
i
jx
i
j
p
)
1=p
is
the L
p
norm of the vector x. The L
p
= 1 (1 6 p < 2) norm constraint of genes and
exons is a well-known convex sparse coding scheme [107, 108] that can lead to the sparse
38
solution where only a few elements of the gene (exon) membership vector are significantly
different from zero and therefore their corresponding genes (exons) can be selected into the
coupled cluster. On the contrary, theL
h
= 1 (h > 2) norm constraints of datasets gives
a “smooth” solution where the elements of the optimized vector w are approximately
equal and therefore lead to the identified coupled cluster occurring in as many datasets
as possible. z
ij
measures coupling strength between gene i and exon j. The problem
formulation is similar to the generalization of the recurrent heavy subgraph (RHS) problem
defined in [107]. However, requiring mapping relationships between gene subgraph and
exon subgraph by forcing a relation criterion makes it more complicated than the RHS
problem. Such nature of simultaneously analyzing two different tensors makes it distinct
from the problem defined in the single tensor.
Eq. (3.1) defines a tensor-based optimization formulation for the problem of identi-
fying FCCs. By solving Eq. (3.1), users can easily identify the top-ranking genes/exons
(after sorting elements x/y in non-increasing order) and the top-ranking datasets (after
sorting elements ofw in non-increasing order) contributing to the objective function. After
rearranging the tensors in this manner, the optimum FCC occupies the corner of the 3D
tensorG andE. We then mask the edges contained in this cluster in both gene and exon
networks with zeros and optimize Eq. (3.1) again to search for the next coupled cluster.
The constant and are weights for different sub-objective functions. To fully exploit
such capability, we collect all FCCs discovered in all combinations
1
of different and,
1
In practice, we used the following combinations: is any of the six predefined values
f0:01;0:05;0:1;1;10;50g and is any of the five predefined valuesf10;20;50;100;500g.
39
then remove duplicates. Details of the optimization procedure and pattern extraction pro-
cess are presented in the next section. Our problem formulation requires three parameters
f,g andh which can be fixed through simulation study.
We performed simulation studies to determine suitable values for the parameters f,
g, andh by applying our tensor method to collections of random weighted networks. In
subsets of these networks, we randomly placed FCCs of varying size, occurrence, and
heaviness. We then tested different combinations off,g, andh, and adopted the combi-
nation (f =g = 1,h = 10) that led to the discovery of the most FCCs.
3.3.3 Optimization algorithm and pattern extraction
We derived an iterative algorithm to maximize the objective function with the con-
straints defined in Eq.(3.1). This algorithm repeatedly applies the four updating steps until
the objective function converges to a fixed point, yielding the solution to Eq.(3.1). Each
step of this procedure individually updates x (y or w or Z) while fixing the other three
variables.
40
Updating scheme for solving Eq. (3.1).
Step 1: x
i
x
i
P
j;k
g
ijk
x
j
w
k
+[(ZR)y]
i
P
i;j;k
g
ijk
x
i
x
j
w
k
+x
T
(ZR)y
!
1=f
i = 1;:::;N
Step 2: y
j
y
j
P
i;k
e
ijk
y
i
w
k
+[(ZR)
T
x]
j
P
i;j;k
e
ijk
y
i
y
j
w
k
+x
T
(ZR)y
!
1=g
j = 1;:::;M
Step 3: w
k
w
k
P
i;j
g
ijk
x
i
x
j
+
P
i;j
e
ijk
y
i
y
j
P
i;j;k
g
ijk
x
i
x
j
w
k
+
P
i;j;k
e
ijk
y
i
y
j
w
k
!
1=h
k = 1;:::;L
Step 4: z
ij
z
ij
[R (xy
T
)]
ij
[R (xy
T
)]
T
i
Z
i
for alli;j whose [R]
ij
6= 0
Note: [x]
i
is thei
th
element of the vectorx. [X]
ij
is the element of the matrixX at thei
th
row and the
j
th
column,[X]
i
denotes thei
th
row vector of the matrixX, andZ =XY is an element-wise matrix
multiplication wherez
ij
=x
ij
y
ij
.
As the relation matrix R is extremely sparse (with only M non-zeros), all element-
wise matrix multiplication operations involvingR andZ can be implemented efficiently
by only treating z
ij
whose corresponding r
ij
6= 0 and simply assigning the rest z
ij
= 0
whose correspondingr
ij
= 0. Thus, the complexity of the algorithm is linear with respect
to the total number of edges in two tensorsG andE.
According to the definition of three aforementioned membership vectorsx,y, andw,
FCCs can be intuitively obtained by including those top-ranking genes, exons, and datasets
with large membership values. In this work, we empirically define an FCC extracted from
these top-ranking genes, exons and datasets as a triplet (G
0
;E
0
;D
0
), i.e., a set of> 5 genes
G
0
, a set of> 5 exons E
0
and a set of> 2 active datasets D
0
satisfying: (1) each gene
subgraph formed byG
0
and exon subgraph formed byE
0
in each active dataset ofD
0
has an
average edge weight (or so-called “density”)> a predefined threshold (i.e.,> 0:4) and (2)
41
the fraction of host genes ofE
0
that are included inG
0
, so-calledcoverage
gene
(G
0
;E
0
) =
jG
0
\hostgene(E
0
)j
jG
0
j
> a predefined threshold (i.e.,> 0:7) and the fraction of exons ofG
0
which
are included inE
0
, so-calledcoverage
exon
(G
0
;E
0
) =
jE
0
\exon(G
0
)j
jE
0
j
> a predefined threshold
(i.e.,> 0:7). The first criterion gives a certain degree of how dense and frequent the cluster
should be, and the second criterion gives a measure of what degree of “coupling” should
exist between a gene set and an exon set. Two examples of coupled gene set and exon
set withcoverage
gene
> 0:6 andcoverage
exon
> 0:6 were shown in Figure 3.2. Based
on this definition, we can generate one or more FCCs that are highly overlapping with
each other from combinations of different sizes of top-ranking genes, exons, and datasets
which satisfy above two criteria. Therefore, we call a group of FCCs derived from the
same membership vectorsx,y, andw asafamilyofFCCs.
1
2
3
A
B
C
D
1
2
3
A
B
C
D
(A)
Exon Subgraph Gene Subgraph Exon Subgraph Gene Subgraph
• Gene subgraph density: 1
• Exon subgraph density: 1
• Gene Coverage: 100% = 3/3
• Exon Coverage: 100% = 4/4
• Gene subgraph density: 1
• Exon subgraph density: 0.83=5/6
• Gene Coverage: 67% = 2/3=
• Exon Coverage: 75% = 3/4=
(B)
|{1,3} ∩ {1,2,3}|
|{1,2,3}|
|{A,B,D} ∩ {A,B,C,D}|
|{A,B,C,D}|
Figure 3.2: Illustration of the coupling degree between a gene set and an exon set.
Blue edges between genes and exons indicate that there are relationships between them.
Red edges within genes or exons indicate that they are co-expressed or co-spliced. (A)
Perfect coupling; (B) Coupling withcoverage
gene
> 0:6 andcoverage
exon
> 0:6.
42
3.4 RNA-seq data preprocessing
From the Sequence Read Archive (SRA) of NCBI, we selected all human RNA-seq
datasets, each of which contains at least six samples (the minimum for robust correlation
estimation). This results in a total of 38 datasets. For each dataset, we used the Tophat [60]
tool to map short reads to the hg18 reference genome and applied the transcript assembly
tool Cufflinks [61] to estimate expressions for all transcripts with known UCSC transcript
annotations [62]. We calculated the inclusion rate of each cassette exon in every sample
as the ratio between its expression (i.e., sum of FPKM over all transcripts that cover the
exon) and the expression of the host gene (i.e., sum of FPKM over all transcripts of the
gene). Then for each dataset, we constructed a weighted gene co-expression network, in
which nodes represent genes and edges are weighted by expression correlations between
two genes, and a weighted exon co-splicing network, in which nodes represent cassette
exons and edge weights represent correlations between the inclusion rates of two cassette
exons. We then performed network normalization to make the edge weights comparable
across datasets and used non-uniform edge sampling in all networks for fast computation.
For detailed methodology, please refer to [86].
3.5 Results
After we applied our method to 38 paired gene co-expression networks and exon co-
splicing networks derived from RNA-seq datasets, we identified 8,667 FCC families con-
taining the total 43,580 FCCs. Each FCC contains > 5 member genes, > 5 member
exons, appears in> 2 RNA-seq datasets, has a “density”> 0:4, “coverage
gene
> 0:7”
and “coverage
exon
> 0:7”. The average gene/exon size of these patterns are 12.61/12.64
43
and the average recurrence is 2.04. To assess statistical significance of the identified FCCs,
we applied our method to 38 paired random networks (each of which is generated from
one of the 38 paired weighted networks by the edge randomization method
2
) to identify
FCCs with the same properties of size, heaviness, and coupling coverage. We repeated
this process 10 times, and each time only 0 3 FCC families (average 0:9 families) were
identified. This is an extremely low value compared to 8,667 FCC families discovered in
real data that indicates the significance of our results.
To assess the biological significance of the identified FCCs, we evaluated the extent
to which these FCCs represent functional modules, protein complexes, transcriptional and
splicing modules. Since FCCs within the same family are highly overlapped, we treated a
family of FCCs as a unit in these following analyses. For example, we denoted a family of
FCC to be functionally homogenous if any of its containing FCCs is significantly enriched
in genes of the certain functional category.
3.5.1 Frequent coupled clusters are likely to represent functional
modules and protein complexes
We evaluated the functional homogeneity of the merged genes and host genes of exons
in a frequent coupling clusters using specific Gene Ontology (GO) terms [110]. To ensure
the specificity of GO terms, we filtered out those general terms associated with > 300
genes. If the gene set of any FCCs within a family is significantly enriched in a GO
term with a q-value < 0:05 (after false discovery rate multiple testing correction), we
2
Given a real weighted network, the random weighted network is generated by a random redistribution
of the actual weights on the random unweighted network with the same number of edges which can be
generated by the widely used degree-preserving randomization procedure [109].
44
declared this family as functionally homogeneous. According to this definition, 50.4% of
the FCC families were functionally homogenous. In an ensemble of randomly generated
FCC families with the same size distribution as our patterns, only 5.7% of the families were
functionally homogenous. This enrichment fold ratio of 8.8 between real and random pat-
terns demonstrated the strong functional relevance of the identified patterns. Furthermore,
the enrichment fold ratio increases dramatically with increasing density and recurrence
of FCCs (shown in Figure 3.3). For example, when the FCCs are required to recur in
at least 4 datasets, their enrichment fold ratio compared to random patterns increased to
48.0, highlighting the benefit of integrative analysis of multiple RNA-seq datasets. When
the FCCs are required to have density at least 0.6, their enrichment fold ratio raises to 65.0
compared with lower heaviness threshold.
minimum active datasets minimum heaviness
(A) (B)
Fold ratio
Fold ratio
0
10
20
30
40
50
2 3 4
0
10
20
30
40
50
60
70
0.4 0.5 0.6
Figure 3.3: FCCs are highly likely to represent functional modules. The enrichment
fold ratio, calculated by dividing the percentage of functionally homogenous FCC families
by the percentage of functionally homogenous random patterns, increases significantly
with the recurrence and denisty of the FCCs. (A) The minimum number of active datasets
in which FCCs recur is varied while fixing the minimum density as 0.4; (B) The minimum
density of FCCs is varied while fixing the minimum number of active datasets as 2.
45
Functional modules represented by FCCs covered a large number of categories related
to post-transcriptional processing, such as “regulation of mRNA stability,” “posttranscrip-
tional regulation of gene expression,” “RNA polyadenylation,” “RNA splicing,” “mRNA
processing,” “RNA transport,” “establishment of RNA localization,” “mRNA 3’-end pro-
cessing,” “nucleic acid transport” and “RNA transport,” etc. During expression of protein-
coding genes, pre-mRNAs are transcribed in the nucleus and undergo several process-
ing steps, including capping, splicing, 3’-end processing and polyadenylation, prior to
be transported to the cytoplasm for translation. The coupling between those individual
steps is believed to enable the proofreading and better regulation of gene expression [111].
Other functional categories over-represented by FCCs include “nucleosome organiza-
tion,” linking to the recent evidence that the transcription elongation, an important step
in transcription-splicing coupling, can be regulated by nucleosome [112].
Similar results were obtained by using the Comprehensive Resource of Mammalian
protein complexes (CORUM) database [113] to assess the association between FCCs and
known protein complexes. The merged sets of genes and exon-host-genes of 78.6% of
FCC families are significantly enriched in genes from the same complexes with aq-value
< 0:05, compared to only 5.05% of randomly generated patterns with the same size dis-
tribution. The enrichment fold ratio dramatically increased with increasing heaviness and
recurrence of the FCCs (Figure 3.4).
Protein complexes represented by FCCs generally fall into the following categories:
(1) Splicing regulation, e.g. SNW1 complex, CDC5L complex, C complex spliceosome;
(2) Transcription regulation, e.g. SNW1 complex, TLE1 complex, H2AX complex; (3)
microRNA processing, e.g. Large Drosha complex, DGCR8 multiprotein complex; (4)
46
minimum active datasets minimum heaviness
(A) (B)
Fold ratio
Fold ratio
0
100
200
300
400
500
600
700
2 3 4
0
40
80
120
160
0.4 0.5 0.6
Figure 3.4: FCCs are highly likely to represent protein complexes. The enrichment fold
ratio increases significantly with the recurrence and density of the FCCs. (A) The mini-
mum number of active datasets in which FCCs recur is varied while fixing the minimum
density as 0.4; (B) The minimum density of FCCs is varied while fixing the minimum
number of active datasets as 2.
DNA or chromatin processing, e.g. DNA-PK-Ku-eIF2-NF90-NF45 complex, Emerin com-
plex; (5) Translation, e.g. eif complex, Nop56p-associated pre-rRNA complex; and (6)
Protein folding, e.g. CCT micro-complex. Interestingly, the functions of all of those
complexes are associated with the central-dogma-process. Some of those complexes have
been experimentally confirmed to couple splicing and transcription. For example, the
SNW1 complex contains components of the spliceosome, and it is known to couple vita-
min D receptor-mediated transcription and RNA splicing [114]. Since there has been little
research on such coupling, we expect that more protein complexes will be validated to be
involved in functional coupling between transcription and splicing.
47
3.5.2 Frequent coupled clusters are likely to represent transcription
and splicing modules
Because the gene set of an FCC are co-expressed in multiple datasets generated under
different conditions, it is likely to represent a transcription module. To assess this possi-
bility, we used the 191 ChIP-seq profiles generated by the Encyclopedia of DNA Elements
(ENCODE) consortium [72]. If the gene set of an FCC is found to be highly enriched in
the targets for a regulatory factor, then this factor is likely to actively regulate genes in this
FCC, and we declare the family containing this FCC to be “transcriptionally homogenous”.
According to this definition, 61.2% of FCC families are transcriptionally homogenous with
an enrichmentq-value< 0:05, compared to 8.9% of the randomly generated patterns with
the same size distribution. As shown in Figure 3.5, with the increasing density and recur-
rence of FCCs, the enrichment fold ratio between FCCs and random patterns dramatically
increases. The five most frequently enriched regulators were YY1, E2F4, c-Myc, MAX, and
TAF, all of which have been implicated in cancer pathogenesis or progression [115–119].
Since the exon set of an FCC has highly correlated inclusion rate profiles across differ-
ent experimental conditions, they are likely to be co-regulated by the same splicing factors.
We collected the experimental RNA target motifs (2,220 RNA binding sites) of 62 splic-
ing factors from the SpliceAid2 database [65]. For each exon in an FCC, we obtained the
internal exon region and its 100bp flanking intron region which are enriched in the motifs
of those 62 splicing proteins by performing BLAST search (E-score<0.05). If the exon
set of any FCC within a family was significantly (q-value< 0:05) enriched in the targets
of a splicing factor, we consider this family to be “splicing homogenous”. Although the
collection of known splicing motifs is very limited, we still observed that 8.9% of the FCC
families are splicing homogenous, compared to 5.1% of randomly generated patterns with
48
minimum active datasets minimum heaviness
(A) (B)
Fold ratio
Fold ratio
0
10
20
30
40
2 3 4
0
10
20
30
40
0.4 0.5 0.6
Figure 3.5: FCCs are highly likely to represent transcription modules. The enrich-
ment fold ratio increases significantly with the recurrence and density of the FCCs. (A)
The minimum number of active datasets in which FCCs recur is varied while fixing the
minimum density as 0.4; (B) The minimum density of FCCs is varied while fixing the
minimum number of active datasets as 2.
the same size distribution. Again, as shown in Figure 3.6, with the increasing density
and recurrence of the FCCs, the enrichment fold ratio increases as well. The five most
frequently enriched splicing regulators are PSF, hnRNP-D, hnRNP-C1, SLM-2, and HuB.
minimum active datasets minimum heaviness
(A) (B)
fold ratio
fold ratio
0
1
2
3
4
5
2 3 4
0
1
2
3
4
0.4 0.5 0.6
Figure 3.6: FCCs are likely to represent splicing modules. The enrichment fold ratio
increases significantly with the recurrence and denisty of the FCCs. (A) The minimum
number of active datasets in which FCCs recur is varied while fixing the minimum density
as 0.4; (B) The minimum density of FCCs is varied while fixing the minimum number of
active datasets as 2.
49
3.5.3 Exploring the mechanisms of the transcription and splicing cou-
pling
Recently, individual case studies suggested that the strong connection between tran-
scription and splicing might be due to (a) the coupled recruitment of transcription and
splicing factors and (b) the kinetic elongation control by RNA Pol II [97]. However,
global analyses of these hypotheses have not been conducted. Our catalogues of cou-
pled transcription-splicing modules provide a unique opportunity to systematically assess
the mechanism of (a). In particular, we evaluated the hypothesis that the functionally
coupled recruitment of both types of factors could be mediated by the direct or indirect
protein-protein interactions. This hypothesis is supported by the following biological evi-
dence: (1) the presence of an intron or simply a 5’ splice site immediately downstream
from a promoter, i.e. the short distance between the splicing and transcription factors,
greatly enhances transcription both in mammalian and yeast genes [97]; and (2) the human
spliceosome contains at least 30 proteins with known or putative roles involved in tran-
scription, indicating direct interactions between the transcription and splicing [120, 121].
We used protein-protein interaction (PPI) repository BioGRID [122] to examine the
PPIs between enriched transcription and splicing factors in coupled module families. We
consider not only the direct PPIs between two factors, but also the indirect interactions
through a mediator protein (one-step interaction), because experiments showed that tran-
scription and splicing factors could be recruited by the same proteins [123,124]. In order to
have a broad coverage of transcription factors, we used 109 human transcription factors in
ENCODE [72] and JASPAR database [125], and 10,278 DNA binding motifs were down-
loaded from [126]. We performed BLAST search (E-score<0.05) in 1000bp regions of the
50
transcription start sites of all genes, and then used the hyper-geometric test (q-value<0.05)
to identify enriched transcription factors for genes in an FCC. The putative splicing factors
for each FCC were identified based on the description in Section 3.5.2. The total number
of PPIs between transcription and splicing factors within the same FCC families (includ-
ing direct and one-hop interactions) was 105, compared to only 14.8 in random families
with the same size of gene and exon sets, with the fold ratio 7.1. Our results supported that
PPI mediated association between transcription factors and splicing factors are probably
an important mechanism of transcription-splicing coupling.
An interesting example is the FCC family associated with the splicing regulator PSF
and the transcription regulator TAT-SF1. PSF is a part of the human spliceosome and
essential in RNA splicing [127, 128]. TAT-SF1 functions as a general transcription fac-
tor playing a role in the process of transcriptional elongation [129]. Based on the
kinetic model that transcription rate affects splicing outcomes during transcription elonga-
tion [97], previous studies suggested that TAT-SF1 couple with PSF for co-transcriptional
splicing. In fact, the spliceosome has been shown to be associated with the transcription
elongation complexes [130], and TAT-SF1 was also characterized to be part of the spliceo-
some [121]. We found that TAT-SF1 and PSF both interact with the protein WBP4, a
general spliceosomal protein that may play a role in cross-intron bridging of U1 and U2
snRNPs [131]. Thus, we hypothesized that the elongation factor TAT-SF1 may function-
ally couple with PSF through the mediator WBP4 in spliceosome.
As another example, the co-spliced exons of an FCC family were enriched in the RNA
binding motif of the splicing factor hnRNP D, and the co-expressed genes of the same FCC
family is over-represented by the DNA binding motif of the transcription factor TDP43.
hnRNP D belongs to heterogeneous nuclear ribonucleoprotein family, which generally
51
suppresses splicing through binding to an exonic splicing silencer [132]. TDP43 was orig-
inally identified as a transcriptional repressor that binds to chromosomally integrated TAR
DNA and represses HIV-1 transcription [133]. Interestingly, TDP43 was later shown to
also have splicing regulatory functions, wherein the recruitment of TDP43 inhibits exon
recognition [134]. We suspect that TDP43 may couple with hnRNP D through hnRNP A1,
because there is a direct protein-protein interaction between hnRNP A1 and hnRNP
D/TDP43 [122], whereas the TDP43-hnRNP A1 interaction is suggested to be mediated
by C-terminal region of TDP43 [135]. Furthermore, experiments have shown that TDP43,
hnRNP D and hnRNP A1 are all shuttled between nucleus and cytoplasm [136,137]. Thus,
it is very possible that TDP43 and hnRNP couple in the transcription-splicing process via
hnRNP A1.
3.6 Conclusions
In this work, we developed a novel tensor-based approach to identify frequent cou-
pled modules in many two-layered networks. This is the first computational method for
pattern mining across multi-layered networks. Advancement of genomic technologies has
resulted in a rapid accumulation of multi-dimensional genomic data that simultaneously
profile molecular activities at different levels of the same biological samples, providing
unique opportunities and challenges to study the complex coupling mechanisms among
multiple layers of biological networks. By using an effective tensor-based problem formu-
lation, our approach can be easily extended to mineK-layered (K > 2) networks and can
conveniently incorporate additional pattern constraints.
52
We applied our method to study the coupling between transcription and splicing, an
area that recently attracted much attention of experimental biologists. Our comprehensive
catalogue of coupled transcription-splicing modules provides a unique resource to globally
examine the scope, functions, and mechanisms of the coupling. Although the current
collection of transcription and splicing factor binding information is limited, we were
able to observe the strongly coupled transcription-splicing modules mainly participate in
processes that need strict and precise regulation, e.g. genetic information transfer from
DNAs to proteins. We also showed that functional coupling between transcription and
splicing are potentially mediated by transcription factor(s) - splicing factor(s) interactions.
With the continued growth of RNA-seq data repositories and regulatory information, we
expect that our method, using only public data, can provide more interesting hypotheses
for further targeted experimental studies.
53
Chapter 4
Functional Implications of Spatial
Genome Organization
4.1 Introduction
In the last decade, the development of chromosome conformation capture (3C) and
related technologies followed by high-throughput sequencing have significantly expanded
our understanding of spatial organization of the genome. Results have shown that chromo-
somes are highly organized, occupying preferred locations in the nucleus [33–36]. Further-
more, accumulating evidence indicated that chromosome organization is highly correlated
with chromatin states and transcriptional regulation [37–41]. In order to fully understand
genome function, it is important to decipher associations between spatial genome orga-
nization and transcriptional/epigenetic regulation. In the following sections, we review
some key lines of evidence for these associations.
4.1.1 Topological domains are functionally coupled with epigenetic
and transcriptional regulation
Hi-C and its variants based on all-versus-all chromatin interactions provide a powerful
tool to investigate genome organization and function. In the first genome-wide chromatin
54
interactions study using Hi-C technology [22], it was observed that genome organiza-
tion is characterized by the spatial segregation of open and closed chromatin to form two
types of genomic compartments, where one compartment is associated with more accessi-
ble, actively transcribed chromatin than the other. Independent studies on Drosophila and
mammalian genomes [43,138–140] have confirmed the existence of two nuclear compart-
ments, and further extended the compartment to “physical domain”, “topological domain”,
or “topological associated domain (TAD)”. A higher resolution chromatin interaction map
was constructed for the Drosophila genome [140], and it was reported that the whole
genome can be partitioned into physical domains: chromatin interactions are frequent
within the domains but much less between domains. Particularly, physical domains can be
classified by chromatin states, and domain boundaries are enriched with insulator binding
sites. Furthermore, it was shown that genome organization can be demarcated by inter-
acting domains: contacts between inactive domains are generally confined to their chro-
mosomal territory and show low probability for inter-chromosomal contacts, while active
domains can reach out their chromosomal territories to form inter-chromosomal contacts
with other active cluster domains at low specificity.
In a parallel study of chromatin interactions in human and mouse genomes [138], it
was demonstrated that topological domain structures are stable across different cell types,
while the internal regions within each domain can be dynamic, potentially due to cell
type specific regulatory events, which was also supported in [141]. Nora and collabora-
tors examined the detailed spatial organization of a 4.5 Mb region containing a mouse
transcript Xist, and showed that coordinately regulated loci are located in the same TAD.
This result suggests that the domain potentially serves as a genomic block to integrate a
cis-regulating network, where common cis-regulatory elements are shared [139]. In FISH
55
experiments, it has also been shown that pairs of loci in the same TAD are closer than
pairs of loci from different TADs. In another work, genome-wide TADs were derived
from chromatin interactions in the mouse genome during pro-B cell developments, and it
was observed that genes located in the same TAD have a greater tendency to be coordi-
nately expressed than genes located in different TADs [142]. A recent study of integrated
mouse cis-regulatory sequences [143] reported that enhancer-promoter units uncovered by
epigenetic marks significantly overlap with TADs, providing further evidence that TADs
are likely to function in coordinating gene expression. Therefore, several lines of evidence
suggest that TADs are associated with epigenetic regulation, and that physical clustering
within TADs potentially plays an important role in gene co-regulation.
4.1.2 Association of promoters-enhancers interactions and chromo-
some organization
A fundamental question in transcriptional regulation study is how genes and regulatory
elements are organized and coordinated. It has long been hypothesized that enhancers can
be brought into the proximity of target genes by the spatial organization of chromosomes,
even when separated by long linear distance. Before 3C and its variants were applied to
capture the spatial organization of chromosomes, enhancers were commonly assigned to
the nearest gene in linear distance, which may not be accurate. According to [143], the
correlation between enhancer signal and promoter signal based on the nearest gene model
is only slightly higher than that between randomly paired promoter and enhancer, and 34%
of enhancer/promoter pairs in the nearest gene model were even negatively correlated. 3C
and related technologies provide a systematic and more precise way to investigate the
56
communication between promoters and enhancers via DNA loops. One well characterized
example is that-globin genes are physically associated with hypersensitive sites in locus
control region [14]. Many other long-range looping events have been identified, such as
-globin locus [11], T(H)2 regulatory region [12], and Abdominal-B locus [13].
In a large-scale investigation, Sanyal and collaborators used 5C method (3C-carbon
copy) to map interactions between promoters and distal elements in ENCODE pilot project
regions [144]. They reported that loop interactions are significantly located in distal ele-
ments that are enriched with insulator CTCF, actively transcribed, and enhancer chromatin
states, but depleted for the repressed chromatin state. It was estimated that only 7% of
looping interactions are with the nearest gene, further indicating that linear distance is not
accurate for enhancer assignment. Interestingly, the long-range interaction landscape was
shown as asymmetric, with interactions of enhancers peaking around 120kb upstream of
the promoters, which may reflect a potential preference of DNA looping to enhancers that
are located in intergenic non-transcribed regions.
Shen and collaborators [143] further systematically examined the correlations between
promoter-enhancer interactions and topological domains across the entire mouse genome,
and discovered several interesting results. They found that the mouse genome was parti-
tioned into enhancer-promoter units (EPUs) based on highly correlated of enhancer sig-
nal and promoter signal, and that interactions between enhancer-promoter pairs within
the same EPUs occurred more frequently than interactions between enhancer-promoter
pairs with the same genomic distance but across different EPUs or by random chance.
They also showed that EPUs in the mouse genome are highly conservative with putative
human enhancer-promoter pairs, suggesting that EPUs may serve as functional domains
57
for gene coordination. Another interesting result from the above study is that tissue-
specific CTCF binding sites significantly overlap with enhancers, whereas ubiquitous
CTCF binding sites significantly overlap with promoters, suggesting that a fraction of
the CTCF binding sites may function through promoters and enhancers. Besides CTCF-
mediated promoter-enhancer interactions reported in [15, 145, 146], it has been reported
that in the case of-globin locus, multiple protein factors are also required to form chro-
matin loops to mediate promoter-enhancer interactions, including GATA1, FOG1, EKLF
and Nli/Ldb1 [147–149]. However it is unknown how protein factors cooperate to mediate
enhancer-promoter chromatin loops.
4.1.3 Association of transcription factories and chromosome organi-
zation
Besides the fact that long range interactions can bring promoters and enhancers in
proximity for transcriptional regulation, it has been observed that functionally associ-
ated genes can be brought to a certain subnuclear compartment by chromosomal inter-
actions, such as tRNA genes in yeast [16], silenced Hox genes in Drosophila [150], and
Myc / Igh in mouse [151]. This raises a possibility that co-expressed genes are orga-
nized into transcription factories containing RNA polymerase II and other components for
transcription [37,38,152]. Using RNAP II associated Paired-End-Tag sequencing technol-
ogy (ChIA-PET), Li and collaborators reported that most genes with promoter-promoter
interactions are active and transcribed cooperatively, and that promoters in transcription
factories can cooperate to regulate the activity of other promoters with which they inter-
act [141]. It was demonstrated that in fission yeast genome, highly transcribed genes,
58
co-regulated genes during cell cycle, and genes from particular gene ontology groups tend
to co-localize in spatial genome structure [153], supporting the idea that functionally asso-
ciated genes can be organized into transcription factories for co-regulation.
It was also reported that genes can be dynamically recruited to transcription facto-
ries, facilitating transcriptional control [151]. Schoenfelder and collaborators showed that
mouse globin genes associate with hundreds of other transcribed genes, and Klf1-regulated
genes preferentially cluster at a limited number of transcription factories containing high
levels of Klf1, suggesting that individual factories could become specialized hotspots for
the transcription of a preferential network of genes [154]. Interestingly, in erythroid
cells lacking Klf1, Klf1-regulated genes showed reduced association with transcription
factories, and chromosomal associations between Klf1-regulated genes were specifically
disrupted. This implies that the specific transcription factor-enriched microenvironment
may influence the establishment of three-dimensional positioning of active genes and
their chromosomal associations in the nucleus [154]. Recently, Ben-Elazar and collab-
orators applied an interpolation and embedding based method to systematically examine
whether co-regulated genes are co-localized from Saccharomyces cerevisiae Hi-C data.
They reported that co-regulated genes are preferentially clustered in space, after control-
ling for effects of linear co-localization along the genome. This phenomenon was signif-
icant for 64 out of 117 transcription factors, and transcription factors with spatially co-
localized targets were more strongly expressed than those whose targets are not spatially
clustered [155].
59
4.1.4 Association of cell type specific gene expression and cell type
specific chromatin interactions
It has been proposed that cell type specific gene expression can be potentially explained
by cell specific chromatin interactions and conformations [156]. Although topological
domains are reported to be quite similar in different cell types, a fraction of chromosome
regions show distinct pattern of chromatin contacts in different cell types, largely associ-
ated with cell type specific gene expression [138]. It has been reported that B cell devel-
opmental regulator Ebf1 locus is strongly associated with nuclear lamina in pre-pro-B
cells but relocate away from the lamina in committed pro-B cells [142], indicating devel-
opmental regulator repositioning and accompanying chromosome conformation changes
may induce cell type specific gene expression. It has been proposed that E2A may play
important roles in forming cell-type specific chromatin interactions, based on evidence
that DNA loops enriched with E2A differed extensively in the pre-pro-B cell stage and
the pro-B cell stage. Furthermore, E2A is largely overlapped by enhancer marker p300,
supporting the idea that E2A may act as a major participant for developmentally regulated
chromatin interactions [142]. In order to fully understand genome function, it is impor-
tant to decipher the relationship between genome conformation changes and the epigenetic
alterations of the genome in different cell identities and fates. Such information is neces-
sary to form a comprehensive picture of the spatial and functional organization of genome
structure [40].
60
4.1.5 Our contributions to interpret the coupling between spatial
genome organization and genome function
3C and related technologies significantly expand our understanding of genome func-
tion from linear regulatory elements into functional 3D regulatory network. Although
many novel discoveries have emerged in this exciting field over the last decade, there
remain many open questions. For example, it is currently unknown whether transcriptional
regulation is a cause or consequence of genome organization, whether gene positions in
the 3D nuclear space play a role in their regulation, and what is the major driving force to
shape spatial genome organization.
It is important to point out that the chromosome interactions uncovered by Hi-C and its
variants describe the average conformation in a population of cells [39, 42]. Considering
cell-to-cell heterogeneity, averaged chromosome conformation cannot adequately repre-
sent the wide range of structural features relevant to a population of cells. To address the
challenge of representing highly variable genome structures, we constructed a large popu-
lation of three-dimensional genome structures from tethered conformation capture (TCC)
chromatin interaction data [43]. The simulated structures therefore represent a spectrum
of all possible chromosome configurations that are consistent with the TCC data. We
applied our tensor-based integrative network pattern mining algorithm to the population
of genome structures, in order to identify frequently contacted chromatin regions. After
determining which chromatin regions are frequently co-localized, we integrated public
epigenetic marks and transcription factors ChIP-seq and RNA-seq data [29] to systemati-
cally investigate the following important biological questions:
61
1. What are the functional associations between spatial genome organization and tran-
scriptional regulation?
2. How are transcriptional regulation and epigenetic regulation functionally coupled
in spatial genome organization?
3. What is the major driving force to shape spatial genome organization?
4. What is the relationship between gene positioning in nucleus and gene expression?
4.2 Methods
4.2.1 Population-based modeling of chromosome conformation cap-
ture data
Chromatin contacts are observed with a wide range of frequencies in Hi-C experi-
ments, suggesting that many potential chromatin contacts are present in only a fraction of
cells. As pointed out in [39], the frequency of co-localization is rather low (less than 10%
of cells at a given time), reflecting the fact that chromosome conformation is dynamic and
highly variable among different cells. In other words, the contacts in Hi-C data do not
describe just one structure, but the average number contacts over numerous genome struc-
tures in different cells. Therefore, a population of genome structures should be generated
in which the resulting variety of structures is statistically consistent with the experimental
data. We formalize this task as an optimization problem with three main components: (i)
a three-dimensional representation of the chromosomes modeled at an appropriate level
of resolution; (ii) an objective function quantifying structure population’s agreement with
62
the data; and (iii) a method for optimizing the objective function to yield a population of
genome structures.
To identify consecutive genomic regions that share similar contact profiles, we applied
constrained clustering using Pearsons correlation between the regions contact profiles as
a similarity measure. Optimizing the clustering cutoff divided the linear genome into 428
domains, where contact frequencies are high within each domain, but much lower between
domains. The resulting domain-based contact frequency map was highly correlated with
the original frequency map, confirming that the characteristic long-range contact patterns
were preserved.
After dividing the human genome into domains, our goal is to generate a population
of spatial genome structures, whose configurations are statistically consistent with the
tethered conformation capture (TCC) chromatin interaction data. That is, the 856 (4282)
domains of the diploid human genome are packed into the nucleus differently for each
structure in the population, but contact frequencies across the whole population generate
a profile similar to the TCC data. A total of 10000 genome structures were generated
in the following way: starting from random positions, the positions of all 856 domains
were simultaneously optimized to a score of zero, indicating that no restraint violations
remained, here restraints ensure that all spheres are positioned within the nuclear volume.
For detailed methodology, please refer to [43].
To examine the consistency of the structure population with the original experimental
data, we calculated its domain contact frequency map and compared it with the TCC data.
The two were highly correlated with an average Pearsons correlation of 0.94, confirm-
ing the excellent agreement between contact frequencies in the structure population and
experiments.
63
4.2.2 Frequent network pattern mining on a population of spatial
genome structures
After inferring a population of spatial genome structures from TCC data, our goal is
to identify a cluster of chromatin regions that are frequently co-localized in 3D space.
Diploid human chromosomes are represented as 856 domains, including twin-domains
from homologous chromosomes. Each domain in three-dimensional space is represented
as coordinate P
i
= (x
i
;y
i
;z
i
) where 1i856. We constructed a chromatin interaction
network to describe co-localization relationships among the domains. Each domain is
represented as a node and two nodes are connected if the corresponding domains are close
in 3D space (the ratio of domain centroid distance and domain radius sum is equal or less
than 2). Discovering which domains are frequently co-localized can be formulated as a
network pattern mining problem: the goal is to identify sets of nodes that are densely
connected in a large number of networks.
Our input data is a collection of 10000 randomized chromatin networks, where each
network contains 856 domains. Every domain (node) in these networks has two labels:
(L1) the genomic location of the domain, including chromosome number, starting posi-
tion and ending position; and (L2) which chromosome copy the domain is from (because
humans have a diploid genome). Thus, twin-domains have the same L1 label but dif-
ferent L2 labels. Since we are mainly interested in identifying co-localization among
domains from different genomic regions, we merge twin-domains into a single represen-
tative domain. That is, instead of identifying dense subgraphs in the original networks, we
work on a population of contracted network with 428 merged domains. After identifying
frequent dense subgraphs in the contracted networks, we use these subgraphs as candidates
64
to recover densely connected subgraphs in the original networks. Using this procedure, all
the frequent dense subgraphs in the original networks can be efficiently captured, based
on an important property between the contracted network and original network: a dense
subgraph in the original network is also dense in the contracted network. The frequent
network pattern mining framework is illustrated in Figure 4.1.
We applied our tensor-based integrative network mining method [32] to the contracted
networks, following by recover procedure in the original networks, and identified 2123
frequent dense subgraphs, which are referred to as modules in the following analysis.
Each module captures a set of frequent chromatin interactions, and the frequency of a
module indicates the number of structures in which these chromatins are co-localized in
the population of 10000 spatial genome structures. The minimum module frequency was
set to 100.
4.3 Functional significance of spatial genome organiza-
tion
4.3.1 Classify chromosomal domains based on epigenetic marks
The human genome was partitioned into 428 domains by chromatin interactions from
the tethered conformation capture (TCC) data, where contact frequency was high within
domains, but much lower between domains. Since human centromeric regions contain
highly conserved repetitive DNA sequences(Grady, Ratliff et al. 1992), the number of
reads mapped to centromeric regions was significantly lower than the number of reads
mapped to the other regions (Figure 4.2, p-value<2.2e-16). In order to rule out effects of
65
Figure 4.1: Illustration of the algorithm to identify frequent dense cluster. (A) Each
node in a chromatin interaction network (CIN) represents a domain and each edge between
two nodes represents an interaction between those two domains. In this example, 4 CINs
with 12 nodes are constructed. Twin nodes are indicated by A/B pairing. Step 1: All the
twin nodes were merged in each network to yield a contracted graph. Step 2: Tensor-
based method was applied to identify the dense subgraph that frequently occurs across
many networks. Step 3: Dense subgraphs were recovered from the original networks. (B)
Illustration of the tensor-based algorithm in Step 2. A collection of contracted chromatin
interaction networks can be ”stacked” into a third-order tensor that each slice represents
the adjacency matrix of one network.
few reads mapping to centromeric regions, we classified the 428 domains as centromeric
and non-centromeric, and only examined chromatin contacts among non-centromeric
domains. Centromeric domains are defined as those lying within or overlapping a 10
Mb region around any centromere (5 Mb on each side). A total of 63 (14.7%) domains
were categorized as centromeric.
66
Figure 4.2: The number of TCC reads in 10 Mb centromeric / background regions
The following analysis only examines chromatin contacts among the remaining 365
(85.3%) non-centromeric domains. Those were further categorized as ”active”, ”inac-
tive, or ”others” based on hierarchical clustering of ChIP-seq signals of 13 epige-
netic/transcriptional marks (Figure 4.3). The active domains are enriched with active chro-
matin marks such as H3K4me3, H3K27ac, and DNase. The inactive domains are enriched
with repressive chromatin marks such as H3K27me3 and H3K9me3. Domains with a
mixture signal of active and repressive chromatin signals were categorized as ”other”. In
summary, out of 365 non-centromeric domains, there were 139 active domains, 155 inac-
tive domains, and 71 other domains. We found that active domains had the highest gene
density, inactive domains had the lowest gene density, and other domains had gene density
falling in between (Figure 4.4a). The linear distances of the domains to their centromeres
are shown in Figure 4.4b.
We applied our tensor-based network mining framework to the 365 non-centromeric
domains, and identified 2123 modules, which covered all the 365 domains. The majority
67
Figure 4.3: Hierarchical clustering of 365 non-centromeric domains based on 13 epige-
netic/transcriptional marks
68
Figure 4.4: (A) Gene density comparisons among four domain categories (B) Distance to
centromeres comparisons among four domain categories
Figure 4.5: (A) Distribution of module size (B) Distribution of hub domain size
69
Figure 4.6: (A) Module number decreases with chromosome number (B) Modules with
more chromosomes are more active
Figure 4.7: Average distance of adjacent domains in intra-chromosomal modules
70
of modules contained a small number of domains (5), but a few contained as many as 19
domains (Figure 4.5a). Each domain belongs to at least three modules, and the most rep-
resented domain belongs to 265 modules(Figure 4.5b), indicating different chromosomal
regions have different potential to contact with other chromosomal regions.
Out of the 2123 modules, 767 (36.1%) contained only domains from the same chro-
mosome, and those modules captured frequently contacted intra-chromosomal interac-
tions. The other 1356 (63.9%) modules contained domains from different chromosomes,
and those modules captured frequently contacted inter-chromosomal interactions. The
majority of modules span five or fewer chromosomes (Figure 4.6a). As expected, intra-
chromosomal contacts were mainly formed by chromosomal regions with close linear dis-
tance, whereas certain nuclear compartments can be brought into spatial proximity by
DNA loops even they are far away in linear space (Figure 4.7).
4.3.2 Active chromosomal regions frequently interact with each other
A module represents a cluster of nuclear compartments that are co-localized in a large
proportion of the modeled genome structures, and therefore are potentially coupled with
genome functions. Firstly, we examined which types of chromosomal regions tend to
interact in a module: are the modules mainly formed by active domains, inactive domains
or a mixture of the two? A module is categorized as active/inactive/other if at least 75%
of its domains are active/inactive/other; otherwise it is categorized as mixed. In our 2123
modules, the numbers of active /inactive /others /mixed modules are 629 /253 /14 /1227,
respectively. We used a permutation test to examine the significance of these proportions.
We randomly generated 1000 random module sets, and each set had 2123 random mod-
ules. For fair comparisons, each random module satisfied the following two constraints:
71
Figure 4.8: Comparisons of module category proportions with random modules
(1) the number of chromosomes in a random module is the same as the real module; (2)
the relative distance of domains in a random module is the same as that in a real module.
We found that the average numbers of active /inactive /others /mixed modules among ran-
domly assembled modules were 233 /344 /30 /1515 respectively. Therefore, the number
of active modules was significantly higher than would be expected from random chance;
and the numbers of inactive and mixed modules were significantly lower. There was no
significant difference between the expected and actual number of modules classified as
”other” (Figure 4.8). This analysis indicates that the spatial genome organization is not
random: active chromosomal regions frequently interact with each other; whereas inactive
chromosomal regions do not. Furthermore, contacts between active chromosomal regions
and inactive chromosomal regions are very rare.
72
The spatial organization of a genome is influenced by both intra- and inter- chromo-
somal interactions. It has been reported that contacts between inactive regions are gen-
erally confined in chromosomal territory, whereas active regions tend to reach out their
chromosome territory to form inter-chromosome contacts [140]. This is consistent with
our finding that modules intermingling more chromosomes tend to contain more active
domains (Figure 4.6B). The fold changes between active and inactive modules were 0.41
and 4.24 for intra- and inter- chromosome modules respectively. Our results indicate that a
relatively large fraction of intra-chromosome interactions are formed by contacts between
inactive regions, which results a compact nuclear subcompartments, that may correspond
to gene repression. Inter-chromosome interactions are more likely to be formed by con-
tacts between active regions, and those looped out active chromosome regions may repre-
sent active chromatin hotspot and associate with gene activation.
4.3.3 Module frequency and module activity are positively correlated
Our population structure modeling method can uncover frequently contacted chromo-
somal regions in a large population of genome structures. Module frequency represents
the number of times a module occurs in the entire population of simulated structures,
so high frequency modules are more likely to represent a stable configuration of a sub-
nuclear compartment, and may serve as a basic building block of genome function. It
should be noted that module frequency is largely dependent on whether captured contacts
represent intra-chromosome interactions or inter-chromosome interactions. As expected,
intra-chromosome interactions were much more frequent, because nuclear compartments
are more likely to contact their nearest neighbors inside of a chromosome territory. Out of
73
Figure 4.9: (A) Module frequency and module activity are positively correlated in inter-
chromosomal modules (B) Module frequency and module activity is positively correlated
in intra-chromosomal modules
10000 modeled chromosome structures, the average frequency of intra-chromosome mod-
ules was 2661, while that of inter-chromosome modules was 418. Therefore, intra- and
inter- chromosome modules should be treated separately in any frequency-based analysis.
We have shown that inter-chromosomal modules are more likely to be active. Next
we examine the relationship between module activity and module frequency: do modules
with high activity tend to be more frequent (stable) in ensemble of modeled structures? In
this analysis, module activity is defined as the percentage of active domains in the module.
We found that among both intra-chromosomal and inter-chromosomal modules, module
frequency increased as module activity (Figure 4.9). Positively correlated module activ-
ity and module frequency indicates that co-localization of active nuclear compartments
potentially play important roles to shape both intra- and inter-chromosomal organizations.
74
4.4 Transcription factories are coupled with spatial
genome organization
It has long been hypothesized that spatial genome organization and transcriptional
regulation are tightly coupled, although whether transcription regulation is the cause or
consequence of genome organization has yet to be determined [37, 40, 157, 158]. Cer-
tain transcription factors are reported to mediate co-localization of co-regulated genes in
specialized foci of concentrated RNA polymerase II [154, 159], where transcription fac-
tories were formed for potential optimization of efficient and coordinated transcriptional
regulation. But to our knowledge, no research has yet investigated the coupling between
spatial genome organization and transcriptional regulation on the genome scale. It has
been shown that transcription factors are enriched in a limited number of transcription fac-
tories, where associated genes are highly expressed [154,157]. It has also been shown that
chromosomal interactions are dissociated when transcription is inhibited [154, 160]. We
expect that modules enriched with RNA polymerase II (RNAPII) could encompass tran-
scription factories. In order to examine the coupling between spatial genome organization
and transcriptional regulation, we collected ChIP-seq for 47 genome-wide transcription
factors, and RNA-seq data for the human lymphoblastoid cell line (Gm12878) from the
Encode database.
We measure transcription factor (TF) signal in a module as the average signal of a
given TF over all domains in the module. To test whether a TF is enriched in a module, we
generated 1000 random modules using the same procedure as described in section 4.3.2.
If the TF signal of the real module falls in the top 5% percentile of randomly generated
75
modules, the TF is considered enriched in the module; otherwise the TF is not enriched in
the module.
We are interested in answering the following questions to investigate the coupling
between transcription factories and 3D genome organization:
1. What is the relationship between RNAPII signal and transcription factories stability?
2. How are transcription factors recruited in transcription factories, and what is the
relationship between transcription factor signal and transcription factories stability?
4.4.1 RNAPII signal and module frequency are positively correlated
in transcription factories
We found that out of 2123 modules, 567 were enriched with RNAPII, and 493 (86.9%)
of these were inter-chromosomal modules. Genes in RNAPII enriched modules had sig-
nificantly higher expression compared to genes in modules not enriched with RNAPII
(Figure 4.10, p-value<2.2e-16), in agreement with reports that transcribed genes are posi-
tioned at RNAPII foci while non-transcribed genes are positioned away from RNAPII
foci [160]. The correlations between RNAPII and gene expression for inter- / intra- chro-
mosomal modules are shown in Figure 4.11.
Next we examined the relationship between RNAPII signal and module frequency for
RNAPII-enriched and not RNAPII-enriched inter-chromosomal modules separately (Fig-
ure 4.12). As the figure shows, for RNAPII enriched inter-chromosomal modules, module
frequency increased as RNAPII signal is higher (Figure 4.12A), indicating that nuclear
76
Figure 4.10: RNAPII enriched modules have higher gene expression
Figure 4.11: Correlation between RNAPII and gene expression
77
compartments are more stable with higher RNAPII signal; while for non transcription fac-
tories, RNAPII signal is weak and there is no clear correlation between RNAPII signal and
module frequency.
We also examined the relationship between RNAPII signal and module frequency for
RNAPII-enriched and not RNAPII-enriched intra-chromosomal modules (Figure 4.12B).
For the same RNAPII signal level, RNAPII-unenriched modules actually had higher fre-
quency than RNAPII-enriched modules, suggesting that RNAPII may not be associated
with intra-chromosomal organization.
Finally, we compared the frequency of all RNAPII-enriched and RNAPII-unenriched
modules. We observed that among inter-chromosomal modules, RNAPII-enriched
modules have higher frequency (Figure 4.13A, p-value=3.5e-7); while among intra-
chromosomal modules, module frequency was similar regardless whether RNAPII is
enriched or not (Figure 4.13B, p-value=0.7).
4.4.2 Gene pairs in transcription factories have shorter 3D distance
RNAPII-enriched modules may represent nuclear subcompartments containing co-
regulated genes in close proximity. It has been reported that Hbb loci are largely
dissociated with RNAPII foci after knocking down RNAPII in the heat-shock-treated
nuclei [160]. We therefore propose that if transcriptional regulation is coupled with spa-
tial genome organization, genes in a transcription factory should be closer to each other
in spatial distance than genes not in transcription factories, when comparing genes with
similar linear distances. To test this hypothesis, we compared the 3D distances of gene
pairs inside and outside of transcription factories, for pairs with similar 1D distances. The
results are shown in Figure 4.14. We observe that for different ranges of 1D distance
78
Figure 4.12: (A) Module frequency and RNAPII signal is positively correlated in RNAPII
enriched inter-chromosomal modules (B) Correlation between RNAPII signal and module
frequency in intra-chromosomal modules
between gene pairs, conditional on the similar 1D distance, the 3D distances between
gene pairs in transcription factories are always significantly shorter than in modules not
enriched with RNAPII, indicating that transcriptional regulation is functionally coupled
with spatial genome organization.
We also compared the 3D distances between gene pairs from RNAPII-enriched mod-
ules with high and low activities, but did not find any significant differences. This result
indicates that genes in transcription factories, either associated with activation or repres-
sion, are both spatially close, potentially facilitated by transcription regulation.
79
Figure 4.13: (A) RNAPII-enriched modules have higher frequency than not RNAPII-
enriched modules in inter-chromosomal modules (B) RNAPII-enriched modules have sim-
ilar frequency as not RNAPII-enriched modules in intra-chromosomal modules
Figure 4.14: Gene pairs in transcription factories have shorter 3D distance conditional on
the same 1D distance
80
Taken together, these results demonstrate that transcription factories are located in
RNAPII-enriched nuclear compartments, and that higher RNAPII signal stabilizes tran-
scription factories in inter-chromosomal modules. Genes in transcription factories are
closer to each other in space, probably to facilitate gene regulation.
4.4.3 Inter-chromosomal transcription factories stability and gene
expression are positively correlated
We have shown that genes in transcription factories have higher expression (Fig-
ure 4.10). This section describes our aim to further examine the relationship between
transcription factory stability and gene expression. Our hypothesis is that transcription
factories with higher frequency are probably more stable, which may facilitate the recruit-
ment of activators and the transcription of genes. In modules not associated with transcrip-
tion factories, however, the frequency depends largely on nuclear spatial constraints. Thus,
our population of simulated 3D structures includes some frequent modules that represent
transcription factories, and some that just represent frequent associations of domains with
strong constraints on their spatial location. Among inter-chromosomal transcription fac-
tories, gene expression increases with module frequency (Figure 4.15A), but there is no
such correlation among modules not associated with transcription factories (Figure 4.15B).
These results are in agreement with our hypothesis.
Furthermore, we examined the relationship between the 3D distance between domain
pairs and gene expression. We observed that whether or not a pair of domain pairs was part
of a transcription factory, as their 3D distance increased, the gene expression associated
81
Figure 4.15: (A) In inter-chromosomal transcription factories, gene expression is posi-
tively correlated with module frequency (B) In inter-chromosomal modules not enriched
with RNAPII, there is no correlation between gene expression and module frequency
Figure 4.16: 3D distance between domain pair and gene expression is negatively correlated
82
with the domains decreased (Figure 4.16). Our result indicates that on the domain level,
pairs with closer spatial distance usually have higher gene expression.
It has been reported that homologous chromosomes only pair when transcriptionally
active [161]. We observed that homologous chromosome regions in transcription factories
have shorter 3D distances than those not located in transcription factories (Figure 4.17).
Figure 4.17: Homologous chromosome regions have shorter 3D distances in transcription
factories compared with non transcription factories
4.4.4 Inter-chromosomal transcription factories stability and tran-
scription factor signal are correlated
In order to identify which TFs are most likely to be recruited into transcription factories
and what is the relationship between TFs signal and transcription factories stability, we
define the overlap of a given TF with RNAPII as the percentage of modules enriched with
both TF and RNAPII, relative to the number of modules only enriched with TF. These
percentages are shown in Table 4.1. Taking transcription factor Max as an example, we
83
find that 90% of Max foci overlapped with RNAPII, and that 54% of modules enriched
with RNAPII are also enriched with Max.
We examined 46 examined TFs, and found that 24 strongly overlapped (overlapping
percentage0.75) with RNAPII-enriched modules. We propose that these TFs play impor-
tant roles in reinforcing the stability of transcription factories. If the hypothesis is correct,
we expect that among these modules, transcription factory stability is positively corre-
lated with TF signal, reflecting that when more TF is recruited into a transcription factory,
the conformation of the transcription factory is more stable. On the other hand, among
modules not enriched with RNAPII, we expect that subnuclear compartments are largely
unrelated to transcription machinery, so no such positive correlations should exist.
Table 4.1: Overlapping percentages of 46 transcription fac-
tors and RNAPII
TF TF percentage RNAPII percentage
Bhlhe40c 0.95 0.68
Maz 0.94 0.74
Sin3 0.93 0.66
Chd2 0.92 0.81
Ebf1 0.91 0.46
Usf2 0.9 0.72
Mxi1 0.9 0.53
textbfMax 0.9 0.54
Nrf1 0.89 0.61
Cfos 0.88 0.63
Whip 0.88 0.74
Ctcf 0.87 0.86
Chd1 0.86 0.44
Znf143 0.86 0.58
Stat1 0.86 0.66
Continued on next page
84
Table 4.1 – continued from previous page
TF TF percentage RNAPII percentage
Nfyb 0.85 0.59
Corest 0.85 0.35
Rad21 0.84 0.44
P300b 0.84 0.31
Tr4 0.83 0.78
Smc3 0.82 0.37
Brca1 0.81 0.66
Tblr1 0.79 0.28
Rfx5 0.76 0.26
Znf384 0.73 0.21
Erra 0.72 0.23
E2f4 0.71 0.2
Elk1 0.7 0.17
Mafk 0.65 0.19
Tbp 0.65 0.22
Nfkb 0.64 0.13
Pol3 0.54 0.09
Nfe2 0.48 0.07
Stat3 0.3 0.04
Cdp 0.23 0.02
Jund 0.23 0.02
Yy1 0.19 0.02
Zzz3 0.19 0.02
Irf3 0.18 0.03
Gcn5 0.18 0.02
Srebp1 0.17 0.02
Srebp2 0.16 0.02
Ikzf1 0.15 0.02
Spt20 0.15 0.02
Nfya 0.1 0.01
Znf274 0.07 0.02
We found that in all the RNAPII-enriched inter-chromosomal modules, one group of
TFs was positively correlated with transcription factory stability; furthermore, the positive
85
correlations disappeared in modules not enriched with RNAPII. Other TFs showed neg-
ative correlation with transcription factory stability, which also disappeared in modules
not enriched with RNAPII. We then examined correlations between TF signal and gene
expression in transcription factories, and found two consistent relationships: the TFs that
were positively correlated with transcription factory stability, were also positively corre-
lated with gene expression; and the TFs that were negatively correlated with transcription
factory stability, were also negatively correlated with gene expression. We define the
transcription factors that were positively correlated with gene expression in transcription
factories to be activators, and those that were negatively correlated with gene expression as
repressors. We use one activator (Maz) and one repressor (Spt20) to demonstrate the corre-
lations between TF signal and inter-chromosomal module frequency in RNAPII-enriched
and not RNAPII-enriched modules. In inter-chromosomal transcription factories, the Maz
signal and module frequency were positively correlated (Figure 4.18A); whereas the Spt20
signal and module frequency were negatively correlated (Figure 4.19A).
Our results indicate that transcription factory stability is dependent on both activa-
tor recruitment and repressor diffusion: the loss of activator signal in transcription fac-
tories leads to disruptions of inter-chromosomal interactions, where targeted genes have
lower expression. The loss of repressor signal results in increased associations of inter-
chromosomal interactions in transcription factories, where targeted genes have higher
expression.
4.4.5 Transcription factors cooperate in transcription factories
In cases where a group of transcription factors are simultaneously enriched in the same
transcription factories, we can examine co-regulation of transcription factors in the spatial
86
Figure 4.18: (A) Activator Maz is positively correlated with inter-chromosomal transcrip-
tion factory stability (B) Activator Maz is not correlated with inter-chromosomal module
frequency if RNAPII is not enriched
genome. The correlation matrix of transcription factor enrichments in transcription facto-
ries is shown in Figure 4.20. We observe three well-separated transcription factor groups
in the heat map matrix: strong activators, weak activators, and repressors. This result indi-
cates that transcription factors cooperate to regulate gene expression in 3D space: in tran-
scription factories, once a few bound activators / repressors come together, the increased
local concentration of binding sites might makes it more likely that other co-operative
factors are recruited [152].
We then examined the number of enriched activators / repressors in transcription fac-
tories and other modules. We found that the average number of enriched activators in
transcription factories was significantly higher (15.2 VS 1.0), and the number of enriched
87
repressors was similar in the two categories (0.3 VS 0.5). The observation that transcrip-
tion factors are mainly enriched in transcription factories can explain the specific binding
of transcription factors: the majority of genomic regions with transcription factor binding
motifs are not functional binding sites. A potential reason is that such regions are not
located in transcription factories, so transcription factors are not recruited there.
Furthermore, we examined the relationship between the number of enriched activa-
tors in transcription factories and RNAPII signal, and we found that more activators are
recruited in inter-chromosomal transcription factories with stronger RNAPII signals. This
trend disappears for intra-chromosomal transcription factories (Figure 4.21). Our results
indicate that activators potentially reinforce with RNAPII to stabilize inter-chromosomal
Figure 4.19: (A) Repressor Spt20 is negatively correlated with inter-chromosomal tran-
scription factory stability (B) Repressor Spt20 is not correlated with inter-chromosomal
module frequency if RNAPII is not enriched
88
Figure 4.20: Co-operations of transcription factors in transcription factories
transcription factories, and we observed that inter-chromosomal module frequency was
positively correlated with the number of recruited activators (Figure 4.22).
Transcription factories enriched with activators should have higher gene expression
than those enriched with repressors; and non transcription factories enriched with activa-
tors should also have higher gene expression than those enriched with repressors, which
are in agreement with our hypothesis (Figure 4.23). Interestingly, transcription facto-
ries enriched with repressors had higher gene expression than non transcription factories
enriched with activators, indicating that genes loaded in transcription factories had higher
expression in general.
89
Figure 4.21: (A) In inter-chromosomal transcription factories, the number of recruited
activators is positively correlated with RNAPII signal (B) There is no such correlation in
intra-chromosomal transcription factories
Figure 4.22: Correlation between the number of recruited activators and inter-
chromosomal RNAPII-enriched / not RNAPII-enriched modules
90
Figure 4.23: Gene expression in transcription factories and non transcription factories
4.4.6 Conclusions
In summary, our results strongly imply that transcription factories are coupled with
spatial genome organization, and support the hypothesis proposed in [152, 154] that pref-
erential associations in transcription factories may substantially affect higher order chro-
mosome conformations. On the other hand, our results also indicate that transcriptional
regulation must be considered in the context of the spatial genome organization. Chro-
mosomal clustering of functionally related genes might allow for better coordination of
their expression, and functional compartmentalization of the genome might be intrinsic
in the spatial compartmentalization of the nucleus. The spatial genome organization also
potentially couple with epigenetic regulation, which we are going to investigate next.
91
4.5 Transcriptional regulation and epigenetic regulation
are closely associated in transcription factories
The crosstalk between transcriptional regulation and epigenetic regulation has been
of great interest. Epigenetic regulation consists mainly of DNA methylation and histone
modifications, both of which are closely associated with gene expression. In general,
hypermethylation of promoter regions correlates with low or no transcription, perhaps by
blocking the promoters at which activating transcription factors bind [162]. Histone modi-
fications also play an important role in the establishment and maintenance of gene expres-
sion. For example, promoters enriched with H3K4me2 / H3K4me3 are activated; promot-
ers enriched with H3K9me3 / H3K27me3 are repressed, and H3K4me1 is an enhancer
histone mark.
The spatial genome structures revealed by Hi-C technologies show that well-
demarcated topological domains are highly correlated with epigenetic organization. Inac-
tive domains are condensed and confined to their chromosomal territories, whereas active
domains reach out of their chromosomal territory to form remote chromosomal con-
tacts [140]. It has also been observed that topological domains are partitioned by sharp
boundaries enriched with CTCF and other chromatin insulator-binding proteins [41, 138].
These connections between topological domains and epigenetic domains have begun to
unveil the associations between transcriptional regulation and epigenetic regulation. For
example, a recent study showed that the mouse genome can be partitioned into enhancer-
promoter units [143]. However, crosstalk between transcriptional regulation and epige-
netic regulation are still largely unknown in spatial genome organization.
92
In order to study this problem, we collected 11 genome-wide epigenetic marks (ChIP-
seq) in the human lymphoblastoid cell line (Gm12878) from the Encode database. The
11 marks are Dnase, H2az, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3,
H3K27ac, H3K27me3, H3K79me2, and H4K20me1. We measured epigenetic mark sig-
nal in a module as the average mark signal over all domains in the module.
Specifically, we are interested in the following two questions:
1. What is the relationship between module activity and module frequency in tran-
scription factories?
2. What is the relationship between epigenetic mark signal and module frequency?
4.5.1 Module frequency and module activity are positively correlated
in inter-chromosomal transcription factories
Firstly we compared module activity in transcription factories (RNAPII-enriched mod-
ules) and other modules (recall that module activity is calculated as the percentage of
active domains in the module). As expected, module activity in transcription factories
was significantly higher (Figure 4.24A), indicating that RNAPII-enriched modules were
associated with open chromatin regions. We also observed that gene density was higher in
transcription factories (Figure 4.24B), which makes biological sense. From an evolution
point of view, most genes should be mainly located at transcription factories, awaits to be
transcribed and later translated to execute their functions.
Secondly, we examined the correlation between module frequency and module activity.
We found that among inter-chromosomal transcription factories, module frequency was
positively correlated with module activity; whereas among inter-chromosomal modules
93
not enriched with RNAPII, there was no correlation (Figure 4.25). This result indicates
that transcriptional regulation and epigenetic regulation are closely associated in inter-
chromosomal transcription factories.
Thirdly, we examined the correlation between module frequency and module activity
in intra-chromosomal modules. We found that module frequency is positively correlated
with module activity in both transcription factories and other modules (Figure 4.26). It is
worth noting that for intra-chromosomal modules, with the same activity level, the average
frequency of module not enriched with RNAPII was even higher than that of transcription
factories. This fact suggests that transcriptional regulation and epigenetic regulation are
largely unrelated in intra- chromosomal modules. These observations are consistent with
our prior results that RNAPII signal is positively correlated with inter-chromosomal tran-
scription factory frequency but not with intra-chromosomal transcription factory frequency
(Figure 4.12).
4.5.2 Inter-chromosomal transcription factories stability and epige-
netic mark signal are correlated
We have shown that module frequency and module activity are positively correlated
in inter-chromosomal transcription factories. Next we examine whether the signals of
individual epigenetic marks also correlate with inter-chromosomal transcription factory
stability. We found that for active epigenetic marks, the signal was positively correlated
with the frequency of transcription factories, but no such correlation exists for non tran-
scription factories. Figure 4.27 shows an example of this relationship using the active
mark H3K9ac. On the other hand, for repressive epigenetic marks, the signal is negatively
94
Figure 4.24: (A) RNAPII enriched modules have higher activity (B) RNAPII enriched
modules have higher gene density
Figure 4.25: Module frequency is positively correlated with module activity in inter-
chromosomal transcription factories
95
Figure 4.26: Correlation between module activity and intra-chromosomal module fre-
quency
correlated with module frequency in transcription factories, and no such correlation exists
for non transcription factories. Figure 4.28 shows an example of this relationship using
the repressive mark H3K27me3.
4.5.3 Conclusions
In summary, our results indicate that transcriptional regulation and epigenetic regula-
tion are closely associated in inter-chromosomal transcription factories. Active epigenetic
marks are abundant in nuclear environments enriched with RNAPII to form active tran-
scription hubs, where transcriptional activators are efficiently recruited to facilitate gene
activation.
96
Figure 4.27: Active mark H3K9ac is positively correlated with inter-chromosomal tran-
scription factory stability
4.6 Centromeric influence plays an important role to
shape spatial genome organization
It has long been of great interest to understand how the genome is organized in space.
So far, Hi-C and its variants are by far the most effective and efficient technologies to
investigate spatial genome organization. It has been reported that in the human genome,
on the megabase scale, chromatin conformation can be modeled as a fractal globule that
enables maximally dense packing while preserving the ability to easily fold and unfold
any genomic locus [22]. However, the spatial organization of the human genome has not
yet been investigated on larger scales (>10 Mb). We have demonstrated that transcription
factories are closely coupled with spatial genome organization, but it is worth to point
97
out that transcriptome only accounts for less than 5% of the whole human genome, and
RNAPII-enriched transcription factories only account for 26.7% of all modules. We expect
that besides RNAPII-enriched nuclear subcompartments that are functionally intrinsic in
nucleus, there must be another factor(s), that can explain the remaining majority of mod-
ules, and play essential roles in shaping spatial genome organization.
In order to track down these influential factors, we identified a set of domains that occur
in most modules, which potentially serve as chromatin interaction hubs for spatial genome
organization. Examining the properties of these hub domains allowed us to uncover the
strong influence of centromeric regions in shaping the 3D genome organization. Interest-
ingly, transcription factories were closely associated with centromeric influence.
Figure 4.28: Repressive mark H3K27me3 is negatively correlated with inter-chromosomal
transcription factory stability
98
4.6.1 Hub domains are positioned closer to centromeres, lie closer to
nuclear interior and have higher gene density
The total number of domains in our 2123 modules was 10941, where domains contain-
ing in different modules are counted multiple times. On average, a domain was contained
in about 5.2 modules. The fact a domain occurs in multiple modules indicates that a certain
genomic region can contact different partners in different cell populations. We sorted the
domains by their occurrences in modules, and are especially interested in characterizing
hub domains that occur in many modules (the top 10% of the module occurrence). We
found that hub domains are located closer to centromeres (Figure 4.29A), and have higher
gene density (Figure 4.29B). We also compared the radial positions of hub domains and
specific domains (the bottom 10% of the module occurrence), and found that hub domains
were positioned significantly more interior in nuclei than specific domains (Figure 4.30,
p-value<2.2e-16). Furthermore, hub domains were more likely to be active, and the fold
change of active hub and inactive hub was 1.61; whereas specific domains were more
likely to be inactive, and the fold change of active and inactive was 0.75.
Out of the 2123 modules, 1409 contained at least one hub domain, and the remain-
ing 714 modules did not contain any hub domains. We compared these two module
groups in terms of gene density, activity, RNAPII signal, gene expression, distance to cen-
tromeres, and distance to the nuclear center. We found that the modules containing hub
domains had higher gene density (Figure 4.31A, p-value<2.2e-16), activity (Figure 4.31B,
p-value<2.2e-16), RNAPII signal (Figure 4.32A, p-value<2.2e-16), and gene expression
(Figure 4.32B, p-value<2.2e-16). They were also positioned closer to centromeres (Fig-
ure 4.33A, p-value=1.0e-12), and more interior in nuclei (Figure 4.33B, p-value<2.2e-16).
99
Figure 4.29: (A) Hub domains are located closer to centromeres (B) Hub domains have
higher gene density
Figure 4.30: Hub domains are positioned more interior in nuclei than specific domains
100
Figure 4.31: (A) Modules containing hub domains have higher gene density (B) Modules
containing hub domains have higher activity
4.6.2 Centromeric influence and the coupling of centromeric influ-
ence and transcription factories
From the previous section, it is clear that modules containing at least one hub domains
have distinct features compared to modules containing no hub domains. Given our finding
that hub domains tend to be located close to the centromeres (Figure 4.29A), we speculate
that centromeric influence plays an important role to shape spatial genome organization.
In order to test our hypothesis, firstly, we examined the relationship between module
centromeric distance and module frequency. Module centromeric distance is calculated as
the average over each domain’s centroid to the centromere in the corresponding chromo-
some. Correlation plots for inter-chromosomal modules and intra-chromosomal modules
101
Figure 4.32: (A) Modules containing hub domains have higher RNAPII (B) Modules con-
taining hub domains have higher gene expression
Figure 4.33: (A) Modules containing hub domains are located closer to centromeres (B)
Modules containing hub domains are positioned more interior in nuclei
102
Figure 4.34: (A) Correlation between centromeric distance and module frequency for
inter-chromosomal modules (B) Correlation between centromeric distance and module
frequency for intra-chromosomal modules
are shown in Figure 4.34A and Figure 4.34B respectively. We observe that for both inter-
and intra- chromosomal modules, as centromeric distance decreases, module frequency
increases. This correlation indicates that more stabilized spatial genome organization is
potentially achieved by stronger centromeric influence. Interestingly, module frequency
does not strictly decrease as centromeric distance increases. Modules with very long cen-
tromeric distances tend to have larger frequencies than moderate centromeric distance,
which may indicate that telomeric influence also play a role to affect spatial genome orga-
nization at chromosome ends.
103
We previously demonstrated that in inter-chromosomal transcription factories, mod-
ules with higher RNAPII signals had higher frequencies (Figure 4.12A), given that mod-
ule frequency also correlates with centromeric distance (Figure 4.34A), we speculate that
centromeric influence and transcription factories are potentially largely coupled.
First, we examined the relationship between module centromeric distance and RNAPII
signal in all modules (Figure 4.35A). We found that nuclear subcompartments that are co-
localized closer to centromeres have higher RNAPII signals, indicating that the spatial
organization of the genome is largely coupled with transcription regulation. On the other
hand, at the domain level, RNAPII signal and domain centromeric distance are not cor-
related (Figure 4.35B), indicating that transcriptional regulation is coupled with spatial
genome organization but not with linear genome organization. We speculate that because
of centromeric influence, domains closer to centromeres from different chromosomes are
stably co-localized, and these stabilized nuclear subcompartments tend to form transcrip-
tion factories.
If centromeric influence really drives spatial genome organization, we expect that even
modules not enriched with RNAPII (accounting for transcriptional effects on 3D genome
organization), should be closer to centromeres than random modules, which is in agree-
ment with our hypothesis (Figure 4.36A). It is no surprise that among inter- chromosomal
modules, transcription factories were significantly closer to centromeres than non tran-
scription factories (p-value<2.2e-16). We also examined centromeric influence on intra-
chromosomal genome organization by checking whether there were any significant differ-
ences in terms of centromeric distances between three intra- chromosomal module groups:
RNAPII enriched, not RNAPII enriched, and random modules. We observed no signifi-
cant differences between the centromeric distances of the three groups (Figure 4.36B).
104
Figure 4.35: (A) Modules closer to centromeres have higher RNAPII (B) RNAPII and
domain centromeric distance is not correlated
This result indicates that intra-chromosomal organization is largely independent of cen-
tromeric influence.
4.6.3 Inter-chromosomal interactions demarcate chromosomes into
nuclear interior and nuclear periphery
Based on inter-chromosomal contact probabilities, it has been reported that the small,
gene-rich chromosomes (chromosomes 16, 17, 19, 20, 21, and 22) preferentially interact
with each other and co-localize in the center of the nucleus [22]. We can also examine
inter-chromosomal interactions based on module level, rather than relying on raw inter-
chromosomal contact probabilities. We expect to obtain more accurate results since mod-
ules are able to capture frequently co-localized chromosomal regions.
105
Figure 4.36: (A) Inter-chromosomal modules centromeric distance comparison (B) Intra-
chromosomal modules centromeric distance comparison
First, within each module, we counted all the pairwise interactions between differ-
ent chromosomes. For example, if one module contains domain(s) from chromosome
1 and domain(s) from chromosome 2, interactions between chromosome 1 and chromo-
some 2 are counted once. We iterated all the 1356 inter-chromosomal modules, and inter-
chromosomal interactions heat map is shown in Figure 4.37.
From Figure 4.37, we observe that chromosomes can be divided into two groups.
Group 1 recapitulated the previously identified chromosome 16, 17, 19, 20, 21, and 22,
as well as one new chromosome 11. Interestingly, chromosome 17 potentially served as
a driving hub for the nuclear interior inter-chromosomal co-localization, since chromo-
some 17 was the one that most intensively interacted with the other 6 chromosomes. The
remaining 16 chromosomes interacted with each other moderately, and are localized in the
106
Figure 4.37: Nuclei are divided into nuclear interior and nuclear periphery by inter-
chromosomal interactions
nuclear periphery. We compared the gene densities of the two groups, and confirmed that
the nuclear interior chromosomes are gene rich, while nuclear peripheral chromosomes
are gene poor (Figure 4.38, Wilcox p-value=1.8E-9).
107
Figure 4.38: Nuclear interior chromosomes have higher gene density
It has been reported that in budding yeast, telomeres are organized in clusters enriched
at the nuclear periphery, which is critical for telomere silencing [163]. A recent study
showed that human telomeres are tethered to the nuclear envelope during postmitotic
nuclear assembly, suggesting dynamic re-localization of human telomeres during the cell
cycle [164]. We speculate that inter-chromosomal interacting contacts in the nuclear
periphery are mediated by telomeres.
In order to investigate module locations in the nucleus, we examined the relationship
between module centromeric distance and module nuclear interior distance, and found
that they were positively correlated (Figure 4.39). Our results indicate that centromeric
chromosomal clustering tends to occur in the nuclear interior. Here module nuclear inte-
rior distance is calculated as the average mass distance to the nuclear interior center of
corresponding chromosomes, over all domains in the module.
108
Figure 4.39: Centromeric distance and radial position are positively correlated for all mod-
ules
It has been reported that the radial nuclear positions of genes are correlated with their
transcriptional status, active genes preferentially located close to the nuclear interior, and
inactive genes close to nuclear periphery [165]. We further investigated in a large scale
subnuclear compartments (module level), whether gene expression is negatively correlated
with module nuclear interior distance. We found that as module nuclear interior distance
increases, module gene expression decreases (Figure 4.40A). Note that the gene expression
in our modules does not strictly decrease toward the periphery. One potential reason is that
the nuclear envelope has both euchromatin and heterochromatin states, so certain genes
tethered to the nuclear periphery can be activated [166].
109
Figure 4.40: (A) Module gene expression is negatively correlated with module nuclear
interior distance (B) Module RNAPII signal is negatively correlated with module nuclear
interior distance (C) Module activity is negatively correlated with module nuclear interior
distance (D) Module gene density is negatively correlated with module nuclear interior
distance
110
In order to investigate the effects of nuclear subcompartment spatial localization on
transcription activity, chromatin states and gene distribution, we examined the relation-
ships between module nuclear interior distance and module RNAPII signal, module activ-
ity, and module gene density. It turns out that they are all negatively correlated (Fig-
ure 4.40B, Figure 4.40C, Figure 4.40D). This result strongly demonstrates that spatial
genome organization, transcriptional regulation and epigenetic regulation are closely asso-
ciated. In subnuclear compartments closer to the nuclear interior, more genes are located
there, chromatins are more active, and genes have higher expression. Similar relationships
between spatial genome organization, transcription pattern and chromatin state have been
observed in the yeast genome [166, 167], indicating that the functional implications of
spatial genome organization may well be conserved throughout evolution.
We speculate that the human genome is displaced toward the nuclear envelope, and
that for the majority of genes, transcription fates are largely predetermined by distance to
the nuclear interior. Transcription products from one transcription machine could serve
as trans-acting factors to regulate cis-regulatory elements in nearby transcription machine.
Such an intriguing mechanism could be potentially achieved by intrinsic epigenetic chro-
matin states embedded in the spatial genome organization.
Although we expect the majority of genes to have predetermined transcription fates,
transcriptional regulation can be highly flexible and dynamic. It has been reported
that during cell differentiation, certain transcriptionally repressed loci located at the
nuclear periphery can relocate toward the interior of the nucleus to become highly tran-
scribed [166]. Such rapid and directed trajectories through the nucleus to certain sub-
nuclear compartments upon transcriptional activation could be driven by actin/myosin,
although the detailed molecular mechanisms for this directed chromosomal repositioning
111
are unknown [168, 169]. It is worth pointing out that gene repositioning needs energy, so
it is more biologically efficient if the majority of genes can be regulated in certain con-
fined subnuclear compartments. We speculate that except for transcriptional regulation,
certain post-transcriptional regulation (e.g. RNA splicing) can also occur in transcription
factories, in agreement with experimentally validated phenomena that transcription and
splicing are largely coupled [97, 170].
4.6.4 Conclusions
Currently, the functional implications and driving forces of spatial genome organiza-
tion are two of the most important questions in genome studies. We believe that cen-
tromeric influence is the decisive factor which can help answer both questions. The
functional link between spatial genome organization and genome function is mediated
by centromeric influence. Nuclear subcompartments positioned closer to nuclear interior
have higher gene density and are more transcriptionally active. We speculate that tran-
scriptional regulation takes advantage of the pre-determined spatial genome organization,
where transcription factories are positioned closer to centromeres and largely immobilized.
Genes that need to be transcribed move to nearby transcription factories. General tran-
scription factors that includes TATA-binding protein (TBP), TFIIA, ITIIB, TFIIF, TFIIE,
and TFIIH, potentially use ATP hydrolysis as a source of energy to open and push the
DNA to the active site of RNAPII enriched in transcription factories [171]. For genes
that need to be co-regulated (they can have long linear distance or even located in differ-
ent chromosomes), it is transcriptionally much more efficient if they are positioned in the
same transcription factories, where transcription factors work collaboratively for gene co-
regulation. Although centromeric influence may pre-determine and stabilize transcription
112
factories, transcriptional regulation may also alter local spatial genome organization for
dynamic gene expression regulation. This may occur by dynamically recruiting and diffus-
ing transcription factors, associated with epigenetic regulation through histone-modifying
enzyme and /or chromatin remodeling complex.
We have demonstrated that centromeric influence is potentially a major driving force of
spatial genome organization, and we now raise the following questions. What are the main
features of centromeric regions and how did centromeres evolve? It has been reported that
transposable elements (TEs) have significantly larger abundance at the centromeric and
pericentric regions in a wide range of phylogenetic species [172]. TEs comprise 46% of
the human genome, and their mobile, abundant and versatile nature makes them as valu-
able source for evolutionary adaptation. An intriguing hypothesis was proposed that the
evolution of centromere-specific DNA is accompanied by the preservation of the over-
all structural integrity of the TEs, and that TEs might provide the evolutionary templates
on which the tandemly repeated satellite units that typify the centromeric and pericentric
heterochromatic domains of most organisms have evolved [172].
113
Reference List
[1] Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene
expression patterns with a complementary DNA microarray. Science 1995,
270(5235):467–470.
[2] Lieb JD, Liu X, Botstein D, Brown PO: Promoter-specific binding of Rap1
revealed by genome-wide maps of protein–DNA association. Nature genetics
2001, 28(4):327–334.
[3] Hu H, Yan X, Huang Y , Han J, Zhou XJ: Mining coherent dense subgraphs
across massive biological networks for functional discovery. Bioinformatics
2005, 21(suppl 1):i213–i221.
[4] Das M, Dai HK: A survey of DNA motif finding algorithms. BMC bioinformatics
2007, 8(Suppl 7):S21.
[5] Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, San-
tos R, Schadt EE, Stoughton R, Shoemaker DD: Genome-wide survey of human
alternative pre-mRNA splicing with exon junction microarrays. Science 2003,
302(5653):2141–2144.
[6] Liang P, Pardee AB: Analysing differential gene expression in cancer. Nature
Reviews Cancer 2003, 3(11):869–876.
[7] Liu JS, Neuwald AF, Lawrence CE: Bayesian models for multiple local sequence
alignment and Gibbs sampling strategies. Journal of the American Statistical
Association 1995, 90(432):1156–1170.
[8] Smyth GK: Limma: linear models for microarray data. In Bioinformatics and
computational biology solutions using R and Bioconductor, Springer 2005:397–
420.
114
[9] Bolstad BM, Irizarry RA,
˚
Astrand M, Speed TP: A comparison of normalization
methods for high density oligonucleotide array data based on variance and
bias. Bioinformatics 2003, 19(2):185–193.
[10] Dekker J, Rippe K, Dekker M, Kleckner N: Capturing chromosome conforma-
tion. science 2002, 295(5558):1306–1311.
[11] Vernimmen D, De Gobbi M, Sloane-Stanley JA, Wood WG, Higgs DR: Long-
range chromosomal interactions regulate the timing of the transition between
poised and active gene expression. EMBO J 2007, 26(8):2041–51.
[12] Spilianakis CG, Lalioti MD, Town T, Lee GR, Flavell RA: Interchromosomal
associations between alternatively expressed loci. Nature 2005, 435(7042):637–
45.
[13] Lin Q, Chen Q, Lin L, Smith S, Zhou J: Promoter targeting sequence mediates
enhancer interference in the Drosophila embryo. Proc Natl Acad Sci U S A 2007,
104(9):3237–42.
[14] Tolhuis B, Palstra RJ, Splinter E, Grosveld F, de Laat W: Looping and interac-
tion between hypersensitive sites in the active beta-globin locus. Mol Cell 2002,
10(6):1453–65.
[15] Yoon YS, Jeong S, Rong Q, Park KY , Chung JH, Pfeifer K: Analysis of the
H19ICR insulator. Mol Cell Biol 2007, 27(9):3499–510.
[16] Thompson M, Haeusler RA, Good PD, Engelke DR: Nucleolar clustering of dis-
persed tRNA genes. Science 2003, 302(5649):1399–401.
[17] Hall N: Advanced sequencing technologies and their wider impact in microbi-
ology. Journal of Experimental Biology 2007, 210(9):1518–1525.
[18] Pettersson E, Lundeberg J, Ahmadian A, et al.: Generations of sequencing tech-
nologies. Genomics 2009, 93(2):105–111.
[19] Roberts A, Pimentel H, Trapnell C, Pachter L: Identification of novel transcripts
in annotated genomes using RNA-Seq. Bioinformatics 2011, 27(17):2325–2329.
[20] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quanti-
fying mammalian transcriptomes by RNA-Seq. Nature methods 2008, 5(7):621–
628.
115
[21] Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo
protein-DNA interactions. Science Signaling 2007, 316(5830):1497.
[22] Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling
A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Ben-
der MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES,
Dekker J: Comprehensive mapping of long-range interactions reveals folding
principles of the human genome. Science 2009, 326(5950):289–93.
[23] Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assess-
ment of technical reproducibility and comparison with gene expression arrays.
Genome research 2008, 18(9):1509–1517.
[24] Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics 2009, 25(9):1105–1111.
[25] Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg
SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq
reveals unannotated transcripts and isoform switching during cell differentia-
tion. Nature biotechnology 2010, 28(5):511–515.
[26] Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M,
Borodina T, Soldatov A, Parkhomchuk D, et al.: A global view of gene activity and
alternative splicing by deep sequencing of the human transcriptome. Science
2008, 321(5891):956–960.
[27] Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF,
Soboleva A, Tomashevsky M, Marshall KA, et al.: NCBI GEO: archive for
high-throughput functional genomic data. Nucleic acids research 2009, 37(suppl
1):D885–D890.
[28] Leinonen R, Sugawara H, Shumway M: The sequence read archive. Nucleic acids
research 2011, 39(suppl 1):D19–D21.
[29] Rosenbloom KR, Dreszer TR, Pheasant M, Barber GP, Meyer LR, Pohl A, Raney
BJ, Wang T, Hinrichs AS, Zweig AS, et al.: ENCODE whole-genome data in the
UCSC Genome Browser. Nucleic acids research 2010, 38(suppl 1):D620–D625.
[30] Hedley ML, Maniatis T: Sex-specific splicing and polyadenylation of dsx pre-
mRNA requires a sequence that binds specifically to tra-2 protein in vitro. Cell
1991, 65(4):579–586.
116
[31] Jernej Ule A, Ule JS, Williams A, Hu JS, Cline M, Wang H, Clark T, Fraser C,
Ruggiu M, Zeeberg BR, et al.: Nova regulates brain-specific splicing to shape
the synapse. Nature genetics 2005, 37(8):844–852.
[32] Li W, Liu CC, Zhang T, Li H, Waterman MS, Zhou XJ: Integrative analysis of
many weighted co-expression networks using tensor computation. PLoS Com-
put Biol 2011, 7(6):e1001106.
[33] Bolzer A, Kreth G, Solovei I, Koehler D, Saracoglu K, Fauth C, Muller S, Eils R,
Cremer C, Speicher MR, Cremer T: Three-dimensional maps of all chromosomes
in human male fibroblast nuclei and prometaphase rosettes. PLoS Biol 2005,
3(5):e157.
[34] Cremer T, Cremer M: Chromosome territories. Cold Spring Harb Perspect Biol
2010, 2(3):a003889.
[35] Duan Z, Andronescu M, Schutz K, McIlwain S, Kim YJ, Lee C, Shendure J, Fields
S, Blau CA, Noble WS: A three-dimensional model of the yeast genome. Nature
2010, 465(7296):363–7.
[36] Tjong H, Gong K, Chen L, Alber F: Physical tethering and volume exclusion
determine higher-order genome organization in budding yeast. Genome Res
2012, 22(7):1295–305.
[37] Fraser P, Bickmore W: Nuclear organization of the genome and the potential for
gene regulation. Nature 2007, 447(7143):413–7.
[38] Sexton T, Schober H, Fraser P, Gasser SM: Gene regulation through nuclear
organization. Nat Struct Mol Biol 2007, 14(11):1049–55.
[39] Dekker J: Gene regulation in the third dimension. Science 2008,
319(5871):1793–4.
[40] Schoenfelder S, Clay I, Fraser P: The transcriptional interactome: gene expres-
sion in 3D. Curr Opin Genet Dev 2010, 20(2):127–33.
[41] Hou C, Li L, Qin ZS, Corces VG: Gene density, transcription, and insulators
contribute to the partition of the Drosophila genome into physical domains.
Mol Cell 2012, 48(3):471–84.
[42] de Wit E, de Laat W: A decade of 3C technologies: insights into nuclear organi-
zation. Genes Dev 2012, 26:11–24.
117
[43] Kalhor R, Tjong H, Jayathilaka N, Alber F, Chen L: Genome architectures
revealed by tethered chromosome conformation capture and population-based
modeling. Nat Biotechnol 2012, 30:90–8.
[44] Dai C, Li W, Liu J, Zhou XJ: Integrating many co-splicing networks to recon-
struct splicing regulatory modules. BMC systems biology 2012, 6(Suppl 1):S17.
[45] Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF,
Schroth GP, Burge CB: Alternative isoform regulation in human tissue tran-
scriptomes. Nature 2008, 456(7221):470–476.
[46] Matlin AJ, Clark F, Smith CWJ: Understanding alternative splicing: towards a
cellular code. Nat Rev Mol Cell Biol 2005, 6(5):386–398.
[47] Chen M, Manley JL: Mechanisms of alternative splicing regulation: insights
from molecular and genomics approaches. Nat Rev Mol Cell Biol 2009,
10(11):741–754.
[48] Nagoshi R, McKeown M, Burtis K, Belote J, Baker B: The control of alterna-
tive splicing at genes regulating sexual differentiation in D. melanogaster. Cell
1988, 53(2):229–236.
[49] Hedley M, Maniatis T: Sex-specific splicing and polyadenylation of dsx pre-
mRNA requires a sequence that binds specifically to tra-2 protein in vitro. Cell
1991, 65(4):579–586.
[50] Jernej Ule A, Ule J, Alan Williams J, Melissa Cline H, Tyson Clark C, Matteo Rug-
giu B, David Kane J, John Blume R: Nova regulates brain-specific splicing to
shape the synapse. Nature Genetics 2005, 37(8):844–852.
[51] Hentze M, Kuhn L: Molecular control of vertebrate iron metabolism: mRNA-
based regulatory circuits operated by iron, nitric oxide, and oxidative stress.
Proceedings of the National Academy of Sciences of the United States of America
1996, 93(16):8175–82.
[52] Zhang C, Zhang Z, Castle J, Sun S, Johnson J, Krainer A, Zhang M: Defining
the regulatory network of the tissue-specific splicing factors Fox-1 and Fox-2.
Genes & Development 2008, 22(18):2550–2563.
[53] Moore M, Wang Q, Kennedy C, Silver P: An Alternative Splicing Network Links
Cell-Cycle Control to Apoptosis. Cell 2010, 142(4):625–636.
118
[54] Chen L, Zheng S: Studying alternative splicing regulatory networks through
partial correlation analysis. Genome biology 2009, 10:R3.
[55] Li W, Liu CC, Zhang T, Li H, Waterman MS, Zhou XJ: Integrative Analysis of
Many Weighted Co-Expression Networks Using Tensor Computation. PLoS
Comput Biol 2011, 7(6):e1001106.
[56] Hopfield JJ: Neural networks and physical systems with emergent collective
computational abilities. Proc Natl Acad Sci USA.79(8):2554–2558.
[57] Motzkin TS, Straus EG: Maxima for Graphs and a New Proof of a Theorem of
Tur´ an. Canadian Journal of Mathematics 1965, 17(4):533–540.
[58] Zhang T: Analysis of Multi-stage Convex Relaxation for Sparse Regularization.
J Mach Learn Res 2010, 11:1081–1107.
[59] Rockafellar RT: Convex Analysis. Princeton, NJ: Princeton University Press 1970.
[60] Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics 2009, 25(9):1105 –1111.
[61] Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg
SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq
reveals unannotated transcripts and isoform switching during cell differentia-
tion. Nature Biotechnology 2010, 28(5):511–515. [PMID: 20436464].
[62] Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M,
Barber GP, Clawson H, Coelho A, Diekhans M, Dreszer TR, Giardine BM, Harte
RA, Hillman-Jackson J, Hsu F, Kirkup V , Kuhn RM, Learned K, Li CH, Meyer
LR, Pohl A, Raney BJ, Rosenbloom KR, Smith KE, Haussler D, Kent WJ: The
UCSC Genome Browser database: update 2011. Nucleic Acids Research 2011,
39(Database issue):D876–882.
[63] Jiang H, Wong WH: Statistical inferences for isoform expression in RNA-Seq.
Bioinformatics 2009, 25(8):1026–1032. [PMID: 19244387].
[64] Licatalosi DD, Mele A, Fak JJ, Ule J, Kayikci M, Chi SW, Clark TA, Schweitzer
AC, Blume JE, Wang X, Darnell JC, Darnell RB: HITS-CLIP yields genome-wide
insights into brain alternative RNA processing. Nature 2008, 456(7221):464–
469,[http://dx.doi.org/10.1038/nature07488].
119
[65] Piva F, Giulietti M, Burini AB, Principato G: SpliceAid 2: A database of human
splicing factors expression data and RNA target motifs. Human Mutation 2011.
[66] Rothrock CR, House AE, Lynch KW: HnRNP L represses exon splicing via a
regulated exonic splicing silencer. The EMBO Journal 2005, 24(15):2792–2802.
[67] Spraggon L, Dudnakova T, Slight J, Lustig-Yariv O, Cotterell J, Hastie N, Miles
C: hnRNP-U directly interacts with WT1 and modulates WT1 transcriptional
activation. Oncogene 2007, 26(10):1484–1491.
[68] Cavaloc Y , Popielarz M, Fuchs JP, Gattoni R, Stvenin J: Characterization and
cloning of the human splicing factor 9G8: a novel 35 kDa factor of the ser-
ine/arginine protein family. The EMBO Journal 1994, 13(11):2639–2649.
[69] Ank M, Morales L, Henry I, Beyer A, Neugebauer KM: Global analysis reveals
SRp20- and SRp75-specific mRNPs in cycling and neural cells. Nature Struc-
tural & Molecular Biology 2010, 17(8):962–970.
[70] Simard MJ, Chabot B: SRp30c is a repressor of 3’ splice site utilization. Molec-
ular and Cellular Biology 2002, 22(12):4001–4010.
[71] Park E, Han J, Son GH, Lee MS, Chung S, Park SH, Park K, Lee KH, Choi S, Seong
JY , Kim K: Cooperative actions of Tra2 with 9G8 and SRp30c in the RNA
splicing of the gonadotropin-releasing hormone gene transcript. The Journal of
Biological Chemistry 2006, 281:401–409.
[72] Thomas DJ, Rosenbloom KR, Clawson H, Hinrichs AS, Trumbower H,
Raney BJ, Karolchik D, Barber GP, Harte RA, Hillman-Jackson J, Kuhn
RM, Rhead BL, Smith KE, Thakkapallayil A, Zweig AS, Haussler D,
Kent WJ: The ENCODE Project at UC Santa Cruz. Nucleic Acids Res
2007, 35(Database issue):D663–D667,[http://www.ncbi.nlm.nih.gov/
pubmed/17166863]. [PMID: 17166863].
[73] Horikoshi M, Bertuccioli C, Takada R, Wang J, Yamamoto T, Roeder RG: Tran-
scription factor TFIID induces DNA bending upon binding to the TATA ele-
ment. Proceedings of the National Academy of Sciences of the United States of
America 1992, 89(3):1060–1064.
[74] Rosmarin AG, Resendes KK, Yang Z, McMillan JN, Fleming SL: GA-binding
protein transcription factor: a review of GABP as an integrator of intracellular
signaling and protein-protein interactions. Blood Cells, Molecules & Diseases
2004, 32:143–154.
120
[75] Kovcs KJ: c-Fos as a transcription factor: a stressful (re)view from a functional
map. Neurochemistry International 1998, 33(4):287–297.
[76] Fossati A, Dolfini D, Donati G, Mantovani R: NF-Y recruits Ash2L to impart
H3K4 trimethylation on CCAAT promoters. PloS One 2011, 6(3):e17220.
[77] Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G,
Frishman G, Montrone C, Mewes H: CORUM: the comprehensive resource
of mammalian protein complexes–2009. Nucl. Acids Res. 2010, 38(Database
issue):D497–D501. [PMID: 19884131].
[78] Pan Q, Shai O, Misquitta C, Zhang W, Saltzman AL, Mohammad N, Babak T, Siu
H, Hughes TR, Morris QD, Frey BJ, Blencowe BJ: Revealing global regulatory
features of mammalian alternative splicing using a quantitative microarray
platform. Molecular Cell 2004, 16(6):929–941. [PMID: 15610736].
[79] Hartman TR, Qian S, Bolinger C, Fernandez S, Schoenberg DR, Boris-Lawrie K:
RNA helicase A is necessary for translation of selected messenger RNAs. Nature
Structural & Molecular Biology 2006, 13(6):509–516.
[80] Schmid SR, Linder P: D-E-A-D protein family of putative RNA helicases. Molec-
ular Microbiology 1992, 6(3):283–291.
[81] Nakajima T, Uchida C, Anderson SF, Lee CG, Hurwitz J, Parvin JD, Montminy
M: RNA helicase A mediates association of CBP with RNA polymerase II. Cell
1997, 90(6):1107–1112.
[82] Zhang S, Schlott B, Grlach M, Grosse F: DNA-dependent protein kinase (DNA-
PK) phosphorylates nuclear DNA helicase II/RNA helicase A and hnRNP pro-
teins in an RNA-dependent manner. Nucleic Acids Research 2004, 32:1–10.
[83] Dias AP, Dufu K, Lei H, Reed R: A role for TREX components in the release
of spliced mRNA from nuclear speckle domains. Nature Communications 2010,
1:97.
[84] Leeman JR, Gilmore TD: Alternative splicing in the NF-kappaB signaling path-
way. Gene 2008, 423(2):97–107.
[85] Conte MR, Kelly G, Babon J, Sanfelice D, Youell J, Smerdon SJ, Proud CG: Struc-
ture of the eukaryotic initiation factor (eIF) 5 reveals a fold common to several
translation factors. Biochemistry 2006, 45(14):4550–4558.
121
[86] Li W, Dai C, Liu C, Zhou X: Algorithm to Identify Frequent Coupled Modules
from Two-Layered Network Series: Application to Study Transcription and
Splicing Coupling. Journal of Computational Biology 2012, 19(6):710–730.
[87] Barabasi A, Oltvai Z: Network biology: understanding the cell’s functional
organization. Nature Reviews Genetics 2004, 5(2):101–113.
[88] Cancer Genome Atlas Research Network: Comprehensive genomic character-
ization defines human glioblastoma genes and core pathways. Nature 2008,
455(7216):1061–1068.
[89] Weinstein JN, Myers TG, O’Connor PM, Friend SH, Fornace J A J, Kohn KW, Fojo
T, Bates SE, Rubinstein LV , Anderson NL, Buolamwini JK, van Osdol WW, Monks
AP, Scudiero DA, Sausville EA, Zaharevitz DW, Bunow B, Viswanadhan VN, John-
son GS, Wittes RE, Paull KD: An information-intensive approach to the molecu-
lar pharmacology of cancer. Science (New York, N.Y.) 1997, 275(5298):343–349.
[90] Bussey KJ, Chin K, Lababidi S, Reimers M, Reinhold WC, Kuo W, Gwadry F, Ajay,
Kouros-Mehr H, Fridlyand J, Jain A, Collins C, Nishizuka S, Tonon G, Roschke A,
Gehlhaus K, Kirsch I, Scudiero DA, Gray JW, Weinstein JN: Integrating data on
DNA copy number with gene expression levels and drug sensitivities in the
NCI-60 cell line panel. Molecular Cancer Therapeutics 2006, 5(4):853–867.
[91] Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, Kohn KW, Reinhold
WC, Myers TG, Andrews DT, Scudiero DA, Eisen MB, Sausville EA, Pommier Y ,
Botstein D, Brown PO, Weinstein JN: A gene expression database for the molec-
ular pharmacology of cancer. Nature Genetics 2000, 24(3):236–244.
[92] Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee
JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR: Chemosen-
sitivity prediction by transcriptional profiling. Proceedings of the National
Academy of Sciences of the United States of America 2001, 98(19):10787–10792.
[93] Shankavaram UT, Reinhold WC, Nishizuka S, Major S, Morita D, Chary KK,
Reimers MA, Scherf U, Kahn A, Dolginow D, Cossman J, Kaldjian EP, Scudiero
DA, Petricoin E, Liotta L, Lee JK, Weinstein JN: Transcript and protein expres-
sion profiles of the NCI-60 cancer cell panel: an integromic microarray study.
Molecular Cancer Therapeutics 2007, 6(3):820–832.
122
[94] Blower PE, Verducci JS, Lin S, Zhou J, Chung J, Dai Z, Liu C, Reinhold W,
Lorenzi PL, Kaldjian EP, Croce CM, Weinstein JN, Sadee W: MicroRNA expres-
sion profiles for the NCI-60 cancer cell panel. Molecular Cancer Therapeutics
2007, 6(5):1483–1491.
[95] Rosonina E, Blencowe B: Gene expression: The close coupling of transcription
and splicing. Current Biology 2002, 12(9):319–321.
[96] Fong Y , Zhou Q: Stimulatory effect of splicing factors on transcriptional elon-
gation. Nature 2001, 414(6866):929–933.
[97] Kornblihtt A, de la Mata M, Fededa J, Munoz M, Nogues G: Multiple links
between transcription and splicing. RNA 2004, 10(10):1489–1498.
[98] Pandit S, Wang D, Fu X: Functional integration of transcriptional and RNA
processing machineries. Current Opinion in Cell Biology 2008, 20(3):260–265.
[99] C´ eaceres JF, Kornblihtt AR: Alternative splicing: multiple control mechanisms
and involvement in human disease. Trends in Genetics: TIG 2002, 18(4):186–
193.
[100] Lin S, Coutinho-Mansfield G, Wang D, Pandit S, Fu X: The splicing factor SC35
has an active role in transcriptional elongation. Nature Structural & Molecular
Biology 2008, 15(8):819–826.
[101] Br` es V , Gomes N, Pickle L, Jones K: A human splicing factor, SKIP, associates
with P-TEFb and enhances transcription elongation by HIV-1 Tat. Genes &
Development 2005, 19(10):1211.
[102] Perales R, Bentley D: “Co-transcriptionality”: The Transcription Elongation
Complex as a Nexus for Nuclear Transactions. Molecular Cell 2009, 36(2):178–
191.
[103] Mu˜ noz MJ, Santangelo MSP, Paronetto MP, de la Mata M, Pelisch F, Boireau S,
Glover-Cutter K, Ben-Dov C, Blaustein M, Lozano JJ, Bird G, Bentley D, Bertrand
E, Kornblihtt AR: DNA damage regulates alternative splicing through inhibition
of RNA polymerase II elongation. Cell 2009, 137(4):708–720.
[104] Mapendano CK, Lykke-Andersen S, Kjems J, Bertrand E, Jensen TH: Crosstalk
between mRNA 3’ end processing and transcription initiation. Molecular Cell
2010, 40(3):410–422.
123
[105] Barboric M, Lenasi T, Chen H, Johansen EB, Guo S, Peterlin BM: 7SK snRNP/P-
TEFb couples transcription elongation with alternative splicing and is essential
for vertebrate development. Proceedings of the National Academy of Sciences of
the United States of America 2009, 106(19):7798–7803.
[106] Burkard R, Dell’Amico M, Martello S: Assignment Problems. SIAM 2009.
[107] Li W, Liu CC, Zhang T, Li H, Waterman MS, Zhou XJ: Integrative Analysis of
Many Weighted Co-Expression Networks Using Tensor Computation. PLoS
Comput Biol 2011, 7(6):e1001106.
[108] Ding C, Zhang Y , Li T, Holbrook SR: Biclustering Protein Complex Interactions
with a Biclique Finding Algorithm. In Proceedings of the 6th International Con-
ference on Data Mining, Hong Kong, China 2006.
[109] Maslov S, Sneppen K: Specificity and Stability in Topology of Protein Networks.
Science 2002, 296(5569):910–913.
[110] Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K,
Dwight S, Eppig J, et al.: Gene ontology: tool for the unification of biology. The
Gene Ontology Consortium. Nat Genet 2000, 25:25–9.
[111] Reed R: Coupling transcription, splicing and mRNA export. Current Opinion in
Cell Biology 2003, 15(3):326–331.
[112] Lindstrom D, Squazzo S, Muster N, Burckin T, Wachter K, Emigh C, McCleery
J, Yates III J, Hartzog G: Dual roles for Spt5 in pre-mRNA processing and
transcription elongation revealed by identification of Spt5-associated proteins.
Molecular and cellular biology 2003, 23(4):1368.
[113] Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G,
Frishman G, Montrone C, Mewes H: CORUM: the comprehensive resource
of mammalian protein complexes–2009. Nucl. Acids Res. 2010, 38(Database
issue):D497–D501.
[114] Zhang C, Dowd D, Staal A, Gu C, Lian J, Van Wijnen A, Stein G, MacDon-
ald P: Nuclear coactivator-62 kDa/Ski-interacting protein is a nuclear matrix-
associated coactivator that may couple vitamin D receptor-mediated transcrip-
tion and RNA splicing. Journal of Biological Chemistry 2003, 278(37):35325.
124
[115] Gordon S, Akopyan G, Garban H, Bonavida B: Transcription factor YY1: struc-
ture, function, and therapeutic implications in cancer biology. Oncogene 2005,
25(8):1125–1142.
[116] Nevins J: The Rb/E2F pathway and cancer. Human molecular genetics 2001,
10(7):699.
[117] Little C, Nau M, Carney D, Gazdar A, Minna J: Amplification and expression of
the c-myc oncogene in human lung cancer cell lines 1983.
[118] Nair S, Burley S: X-Ray Structures of Myc-Max and Mad-Max Recognizing
DNA: Molecular Bases of Regulation by Proto-Oncogenic Transcription Fac-
tors. Cell 2003, 112(2):193–205.
[119] Suzuki H, Igarashi S, Nojima M, Maruyama R, Yamamoto E, Kai M, Akashi H,
Watanabe Y , Yamamoto H, Sasaki Y , et al.: IGFBP7 is a p53-responsive gene
specifically silenced in colorectal cancer with CpG island methylator pheno-
type. Carcinogenesis 2010, 31(3):342.
[120] Zhou Z, Licklider L, Gygi S, Reed R: Comprehensive proteomic analysis of the
human spliceosome. Nature 2002, 419(6903):182–185.
[121] Rappsilber J, Ryder U, Lamond A, Mann M: Large-scale proteomic analysis of
the human spliceosome. Genome research 2002, 12(8):1231.
[122] Stark C, Breitkreutz B, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID:
a general repository for interaction datasets. Nucleic acids research 2006,
34(suppl 1):D535.
[123] Nayler O, Str¨ atling W, Bourquin J, Stagljar I, Lindemann L, Jasper H, Hartmann
A, Fackelmayer F, Ullrich A, Stamm S: SAF-B protein couples transcription
and pre-mRNA splicing to SAR/MAR elements. Nucleic acids research 1998,
26(15):3542.
[124] Blencowe B, Bowman J, McCracken S, Rosonina E: SR-related proteins and the
processing of messenger RNA precursors. Biochemistry and cell biology 1999,
77(4):277–291.
[125] Sandelin A, Alkema W, Engstrm P, Wasserman WW, Lenhard B: JASPAR:
an open-access database for eukaryotic transcription factor binding profiles.
Nucleic Acids Research 2004, 32(Database issue):D91–94.
125
[126] Tabach Y , Brosh R, Buganim Y , Reiner A, Zuk O, Yitzhaky A, Koudritsky M, Rot-
ter V , Domany E: Wide-scale analysis of human functional transcription factor
binding reveals a strong bias towards the transcription start site. PloS One
2007, 2(8):e807.
[127] Peng R, Hawkins I, Link A, Patton J: The splicing factor PSF is part of a large
complex that assembles in the absence of pre-mRNA and contains all five
snRNPs. RNA biology 2006, 3(2):69.
[128] Garcia-Jurado G, Llanes D, Moreno A, Soria B, Tejedo J: Monoclonal Antibody
That Recognizes a Domain on Heterogeneous Nuclear Ribonucleoprotein K
and PTB-Associated Splicing Factor. Hybridoma 2011, 30:53–59.
[129] Zhou Q, Sharp P: Tat-SF1: cofactor for stimulation of transcriptional elonga-
tion by HIV-1 Tat. Science 1996, 274(5287):605.
[130] Kameoka S, Duque P, Konarska M: p54nrb associates with the 5’ splice
site within large transcription/splicing complexes. The EMBO Journal 2004,
23(8):1782–1791.
[131] Bedford M, Reed R, Leder P: WW domain-mediated interactions reveal a
spliceosome-associated protein that binds a third class of proline-rich motif:
the proline glycine and methionine-rich motif. Proceedings of the National
Academy of Sciences 1998, 95(18):10602.
[132] Matlin A, Clark F, Smith C: Understanding alternative splicing: towards a cel-
lular code. Nature Reviews Molecular Cell Biology 2005, 6(5):386–398.
[133] Ou S, Wu F, Harrich D, Garcia-Martinez L, Gaynor R: Cloning and character-
ization of a novel cellular protein, TDP-43, that binds to human immunod-
eficiency virus type 1 TAR DNA sequence motifs. Journal of virology 1995,
69(6):3584.
[134] Buratti E, Baralle F: Characterization and Functional Implications of the RNA
Binding Properties of Nuclear Factor TDP-43, a Novel Splicing Regulator of
CFTR Exon 9. Journal of Biological Chemistry 2001, 276(39):36337.
[135] Ayala Y , Laura De Conti S, V´ azquez A, Romano M, D’Ambrogio A, Tollervey J,
Ule J, Baralle M, Buratti E, Baralle F: TDP-43 regulates its mRNA levels through
a negative feedback loop. The EMBO Journal 2010.
126
[136] Loflin P, Chen C, Shyu A: Unraveling a cytoplasmic role for hnRNP D in the in
vivo mRNA destabilization directed by the AU-rich element. Genes & develop-
ment 1999, 13(14):1884.
[137] Ayala Y , Zago P, D’Ambrogio A, Xu Y , Petrucelli L, Buratti E, Baralle F: Struc-
tural determinants of the cellular localization and shuttling of TDP-43. Journal
of cell science 2008, 121(22):3778.
[138] Dixon JR, Selvaraj S, Yue F, Kim A, Li Y , Shen Y , Hu M, Liu JS, Ren B: Topo-
logical domains in mammalian genomes identified by analysis of chromatin
interactions. Nature 2012, 485(7398):376–80.
[139] Nora EP, Lajoie BR, Schulz EG, Giorgetti L, Okamoto I, Servant N, Piolot T, van
Berkum NL, Meisig J, Sedat J, Gribnau J, Barillot E, Bluthgen N, Dekker J, Heard
E: Spatial partitioning of the regulatory landscape of the X-inactivation centre.
Nature 2012, 485(7398):381–5.
[140] Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, Parrinello
H, Tanay A, Cavalli G: Three-dimensional folding and functional organization
principles of the Drosophila genome. Cell 2012, 148(3):458–72.
[141] Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, Poh HM, Goh Y , Lim
J, Zhang J, Sim HS, Peh SQ, Mulawadi FH, Ong CT, Orlov YL, Hong S, Zhang Z,
Landt S, Raha D, Euskirchen G, Wei CL, Ge W, Wang H, Davis C, Fisher-Aylor
KI, Mortazavi A, Gerstein M, Gingeras T, Wold B, Sun Y , Fullwood MJ, Cheung
E, Liu E, Sung WK, Snyder M, Ruan Y: Extensive promoter-centered chromatin
interactions provide a topological basis for transcription regulation. Cell 2012,
148(1-2):84–98.
[142] Lin YC, Benner C, Mansson R, Heinz S, Miyazaki K, Miyazaki M, Chandra V ,
Bossen C, Glass CK, Murre C: Global changes in the nuclear positioning of genes
and intra- and interdomain genomic interactions that orchestrate B cell fate.
Nat Immunol 2012, 13(12):1196–204.
[143] Shen Y , Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, Wagner U, Dixon J, Lee
L, Lobanenkov VV , Ren B: A map of the cis-regulatory sequences in the mouse
genome. Nature 2012, 488(7409):116–20.
[144] Sanyal A, Lajoie BR, Jain G, Dekker J: The long-range interaction landscape of
gene promoters. Nature 2012, 489(7414):109–13.
127
[145] Handoko L, Xu H, Li G, Ngan CY , Chew E, Schnapp M, Lee CW, Ye C, Ping
JL, Mulawadi F, Wong E, Sheng J, Zhang Y , Poh T, Chan CS, Kunarso G, Sha-
hab A, Bourque G, Cacheux-Rataboul V , Sung WK, Ruan Y , Wei CL: CTCF-
mediated functional chromatin interactome in pluripotent cells. Nat Genet
2011, 43(7):630–8.
[146] Junier I, Dale RK, Hou C, Kepes F, Dean A: CTCF-mediated transcriptional reg-
ulation through cell type-specific chromosome organization in the beta-globin
locus. Nucleic Acids Res 2012, 40(16):7718–27.
[147] Drissen R, Palstra RJ, Gillemans N, Splinter E, Grosveld F, Philipsen S, de Laat W:
The active spatial organization of the beta-globin locus requires the transcrip-
tion factor EKLF. Genes Dev 2004, 18(20):2485–90.
[148] Vakoc CR, Letting DL, Gheldof N, Sawado T, Bender MA, Groudine M, Weiss
MJ, Dekker J, Blobel GA: Proximity among distant regulatory elements at the
beta-globin locus requires GATA-1 and FOG-1. Mol Cell 2005, 17(3):453–62.
[149] Song SH, Hou C, Dean A: A positive role for NLI/Ldb1 in long-range beta-
globin locus control region function. Mol Cell 2007, 28(5):810–22.
[150] Bantignies F, Roure V , Comet I, Leblanc B, Schuettengruber B, Bonnet J, Tixier
V , Mas A, Cavalli G: Polycomb-dependent regulatory contacts between distant
Hox loci in Drosophila. Cell 2011, 144(2):214–26.
[151] Osborne CS, Chakalova L, Brown KE, Carter D, Horton A, Debrand E, Goyenechea
B, Mitchell JA, Lopes S, Reik W, Fraser P: Active genes dynamically colocalize
to shared sites of ongoing transcription. Nat Genet 2004, 36(10):1065–71.
[152] Cook PR: A model for all genomes: the role of transcription factories. J Mol
Biol 2010, 395:1–10.
[153] Tanizawa H, Iwasaki O, Tanaka A, Capizzi JR, Wickramasinghe P, Lee M, Fu
Z, Noma K: Mapping of long-range associations throughout the fission yeast
genome reveals global genome organization linked to transcriptional regula-
tion. Nucleic Acids Res 2010, 38(22):8164–77.
[154] Schoenfelder S, Sexton T, Chakalova L, Cope NF, Horton A, Andrews S, Kurukuti
S, Mitchell JA, Umlauf D, Dimitrova DS, Eskiw CH, Luo Y , Wei CL, Ruan Y ,
Bieker JJ, Fraser P: Preferential associations between co-regulated genes reveal
a transcriptional interactome in erythroid cells. Nat Genet 2010, 42:53–61.
128
[155] Ben-Elazar S, Yakhini Z, Yanai I: Spatial localization of co-regulated genes
exceeds genomic gene clustering in the Saccharomyces cerevisiae genome.
Nucleic Acids Res 2013.
[156] Kagey MH, Newman JJ, Bilodeau S, Zhan Y , Orlando DA, van Berkum NL,
Ebmeier CC, Goossens J, Rahl PB, Levine SS, Taatjes DJ, Dekker J, Young RA:
Mediator and cohesin connect gene expression and chromatin architecture.
Nature 2010, 467(7314):430–5.
[157] Xu M, Cook PR: Similar active genes cluster in specialized transcription facto-
ries. J Cell Biol 2008, 181(4):615–23.
[158] Hou C, Corces VG: Throwing transcription for a loop: expression of the genome
in the 3D nucleus. Chromosoma 2012, 121(2):107–16.
[159] Brickner DG, Ahmed S, Meldi L, Thompson A, Light W, Young M, Hickman TL,
Chu F, Fabre E, Brickner JH: Transcription factor binding to a DNA zip code
controls interchromosomal clustering at the nuclear periphery. Dev Cell 2012,
22(6):1234–46.
[160] Mitchell JA, Fraser P: Transcription factories are nuclear subcompartments
that remain in the absence of transcription. Genes Dev 2008, 22:20–5.
[161] Xu M, Cook PR: The role of specialized transcription factories in chromosome
pairing. Biochim Biophys Acta 2008, 1783(11):2155–60.
[162] Berdasco M, Alcazar R, Garcia-Ortiz MV , Ballestar E, Fernandez AF, Roldan-
Arjona T, Tiburcio AF, Altabella T, Buisine N, Quesneville H, Baudry A, Lepiniec
L, Alaminos M, Rodriguez R, Lloyd A, Colot V , Bender J, Canal MJ, Esteller M,
Fraga MF: Promoter DNA hypermethylation and gene repression in undiffer-
entiated Arabidopsis cells. PLoS One 2008, 3(10):e3306.
[163] Hediger F, Neumann FR, Van Houwe G, Dubrana K, Gasser SM: Live imaging of
telomeres: yKu and Sir proteins define redundant telomere-anchoring path-
ways in yeast. Curr Biol 2002, 12(24):2076–89.
[164] Crabbe L, Cesare AJ, Kasuboski JM, Fitzpatrick JA, Karlseder J: Human telomeres
are tethered to the nuclear envelope during postmitotic nuclear assembly. Cell
Rep 2012, 2(6):1521–9.
129
[165] Kress C, Ballester M, Devinoy E, Rijnkels M: Epigenetic modifications in 3D:
nuclear organization of the differentiating mammary epithelial cell. J Mam-
mary Gland Biol Neoplasia 2010, 15:73–83.
[166] Ruault M, Dubarry M, Taddei A: Re-positioning genes to the nuclear envelope in
mammalian cells: impact on transcription. Trends Genet 2008, 24(11):574–81.
[167] Brickner JH: Transcriptional memory at the nuclear periphery. Curr Opin Cell
Biol 2009, 21:127–33.
[168] Chuang CH, Carpenter AE, Fuchsova B, Johnson T, de Lanerolle P, Belmont AS:
Long-range directional movement of an interphase chromosome site. Curr Biol
2006, 16(8):825–31.
[169] Dundr M, Ospina JK, Sung MH, John S, Upender M, Ried T, Hager GL, Matera
AG: Actin-dependent intranuclear repositioning of an active gene locus in vivo.
J Cell Biol 2007, 179(6):1095–103.
[170] Brown SJ, Stoilov P, Xing Y: Chromatin and epigenetic regulation of pre-mRNA
processing. Hum Mol Genet 2012, 21(R1):R90–6.
[171] He Y , Fang J, Taatjes DJ, Nogales E: Structural visualization of key steps in
human transcription initiation. Nature 2013.
[172] Wong LH, Choo KH: Evolutionary dynamics of transposable elements at the
centromere. Trends Genet 2004, 20(12):611–6.
130
Asset Metadata
Creator
Dai, Chao (author)
Core Title
Integrating high-throughput sequencing data to study gene regulation
Contributor
Electronically uploaded by the author
(provenance)
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
10/23/2015
Defense Date
06/24/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
ChIP-seq,gene regulation,OAI-PMH Harvest,RNA-seq,spatial genome organization
Format
application/pdf
(imt)
Language
English
Advisor
Zhou, Xianghong Jasmine (
committee chair
), Alber, Frank (
committee member
), Liu, Yan (
committee member
), Waterman, Michael S. (
committee member
)
Creator Email
chaodai@usc.edu,daichao07@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-340117
Unique identifier
UC11295220
Identifier
etd-DaiChao-2113.pdf (filename),usctheses-c3-340117 (legacy record id)
Legacy Identifier
etd-DaiChao-2113.pdf
Dmrecord
340117
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Dai, Chao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Abstract (if available)
Abstract
High-throughput sequencing is a powerful technique for gene regulation study, which can provide information about isoform expression as well as transcription factors / epigenetic factors binding signal measurements. Chromosome conformation capture followed by high-throughput sequencing were recently applied to study spatial genome organization and shed light on functional associations between genome organization and gene regulation. By integrating genome wide high-throughput sequencing data, this dissertation investigates gene regulation in three different layers: co-splicing
Tags
ChIP-seq
gene regulation
RNA-seq
spatial genome organization
Linked assets
University of Southern California Dissertations and Theses