Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
3D modeling of eukaryotic genomes
(USC Thesis Other)
3D modeling of eukaryotic genomes
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
3D Modeling of Eukaryotic Genomes
!
Ph.D. Dissertation
Submitted by
Ke Gong
Computational Biology & Bioinformatics Ph.D. Program, Department of Biological
Sciences
University of Southern California
December 2015
Ph.D. Dissertation Committee
Dr. Frank Alber (Chair)
Dr. Aiichiro Nakano
Dr. Remo Rohs
ii
Acknowledgement:
I would like to acknowledge everyone who has helped me during my doctoral studies.
Above all, I would like to gratefully acknowledge the guidance, everlasting support and
tremendous encouragement of my doctoral advisor, Dr. Frank Alber. I appreciate all
your time, ideas, efforts and funding to my researches in the past six years. And thank
you for leading me to the genome-modeling world to explore the beauty of biological
science.
I would also like to express my gratitude to Dr. Remo Rohs, Dr. Aiichiro Nakano as my
thesis committee. Thank you for guiding my academic growth and development, and for
your constant support in my Ph.D pursuing process.
My appreciations also extend to Dr. Xianghong Jasmine Zhou, for your insightful
comments and suggestions on my projects; Dr. Michael S. Waterman, for your valuable
advice and encouraging words in the past few years.
I also want to thank everyone that has helped me in my Ph.D life, including: Dr. Harianto
Tjong, Haochen Li, Zachary Frazier, Long Pei, Dr. Tianyin Zhou, Dr. Wenyuan Li, Chao
Dai, Qingjiao Li, Nan Hua, Dr. Cheyu Lee, Lin Yang.
iii
Last but not least, I thank my wife Qian Tang, for always being by my side, loving,
supporting, and believing in me in the past few years. Marring you is the greatest
achievement I have ever done in my life!
iv
Table of Contents
Acknowledgement: ......................................................................................................... ii
Abstract ........................................................................................................................... x
Chapter 1. Background: Chromatin Conformation Capture and Genome
Structure Modeling Method ........................................................................................... 1
1.1 Chromatin Conformation Capture experiment can identify contact frequency
between genomic regions ...................................................................................................... 1
1.2 Reconstruction of 3D genome structures ................................................................... 4
1.2.1 Consensus structure modeling method ..................................................................... 5
1.2.2 Population-based structure modeling method ........................................................... 6
Chapter 2. The Budding Yeast Genome Modeling Reveals Structural Principle of
Genomic Organization ................................................................................................... 7
2.1 Introduction .................................................................................................................... 7
2.2 Methodology ................................................................................................................... 8
2.3 Results .......................................................................................................................... 12
2.3.1 Chromosome territories as a result of constrained random encounters .................. 12
2.3.2 Genome-wide chromosome contact patterns .......................................................... 14
2.3.3 Intrachromosomal locus–locus interactions ............................................................. 17
2.3.4 Interchromosomal locus-locus contacts .................................................................. 19
2.3.5 Gene localizations ................................................................................................... 20
2.3.6 Pairwise telomere distances .................................................................................... 22
2.3.7 Telomere clusters .................................................................................................... 24
2.3.8 Co-localization of functionally related loci ................................................................ 24
2.4 Summary ....................................................................................................................... 27
v
Chapter 3. Comparative Genomic Structural Study of the Fission Yeast and the
Budding Yeast .............................................................................................................. 27
3.1 Introduction .................................................................................................................. 27
3.2 Materials and Methods ................................................................................................ 30
3.3 Results and Discussion .............................................................................................. 38
3.3.1 Fission yeast genome structure ............................................................................... 38
3.3.2 Genome structure comparison between fission and budding yeast ........................ 44
3.4 Conclusions ................................................................................................................. 55
Chapter 4. Hi-Corrector: a Fast, Scalable and Memory-Efficient Package for
Normalizing Large-Scale Hi-C Data ............................................................................ 57
4.1 Introduction .................................................................................................................. 57
4.2 Algorithms and implementation ................................................................................. 59
4.3 Results .......................................................................................................................... 61
4.4 Conclusion ................................................................................................................... 63
Chapter 5. A Fast and Robust Method for Identifying Genomic Topological
Associated Domains .................................................................................................... 64
5.1 Introduction .................................................................................................................. 64
5.2 Methods ........................................................................................................................ 66
5.2.1 Generating binSignal ............................................................................................... 67
5.2.2 Optimal approximation of binsignal and detect local minimums .............................. 69
5.2.3 Statistical filtering of false positives ......................................................................... 70
5.3 Results .......................................................................................................................... 71
5.3.1 Algorithm performance ............................................................................................ 71
5.3.2 Correlation based comparison with the reported TDs in previous research ............ 74
5.3.3 Epigenetic characteristics of Topological Domains ................................................. 75
vi
5.3.4 Our unique boundaries share the similar epigenetic characteristic with common
boundaries in two methods ................................................................................................. 78
5.4 Discussion and Conclusion ........................................................................................ 81
Chapter 6. High Resolution Human Genome Structures with Population-based
Modeling Approach ...................................................................................................... 83
6.1 Introduction .................................................................................................................. 83
6.2 Methods and Materials ................................................................................................ 87
6.2.1 Pre-processing of Hi-C Contact Data ...................................................................... 88
6.2.2 Contact Probability Generation ................................................................................ 88
6.2.3 Probabilistic model and problem formulation of the structure population ................ 89
6.2.4 Distance threshold for restraint assignment ............................................................ 94
6.2.5 Methods improvement for TAD resolution models .................................................. 95
6.3 Results and Discussion .............................................................................................. 99
6.3.1 Overview of the new population based modeling methods ..................................... 99
6.3.2 Estimation of redundant constraints through A/M steps improve the obtained
structures .......................................................................................................................... 102
6.3.3 Validation of models with TCC contact profile ....................................................... 103
6.3.4 Validation of model with known structure information ........................................... 106
6.3.5 Structural properties of chromosomes with different chromatin state .................... 109
6.3.6 Radial position affects the replication timing ......................................................... 112
6.3.7 Replicate analysis to show the robustness of the method ..................................... 113
References .................................................................................................................. 115
vii
List of Figures
Figure 1-1! The overview of Hi-C ,TCC and in situ Hi-C method .................................... 4!
Figure 2-1! Population-based analysis of the budding yeast genome organization. ..... 10!
Figure 2-2! Chromosome locations of the budding yeast genome ............................... 13!
Figure 2-3! Chromosome and gene loci interactions. ................................................... 16!
Figure 2-4! Chromosome folding patterns .................................................................... 18!
Figure 2-5! Gene territories. .......................................................................................... 21!
Figure 2-6! Median telomere-telomere distances in the structure population. .............. 23!
Figure 2-7! Spatial clustering of replication origins and tRNA gene loci. ...................... 25!
Figure 3-1! Fission yeast genome structures calculation. ............................................. 32!
Figure 3-2! Assessment of the structure population with independent experimental data
not included as the input information in the calculations. ........................................ 42!
Figure 3-3! Chromosome and gene territories analysis of fission yeast and budding
yeast 47!
Figure 3-4! Interaction specificity and gene localization comparison between fission
yeast and budding yeast. ........................................................................................ 49!
Figure 3-5! Functional related genes tend to be spatially clustered in both fission yeast
and budding yeast. .................................................................................................. 52!
Figure 5-1! Definition of binSignal. ................................................................................ 68!
Figure 5-2! Enrichment plots of five epigenetic marks for two mouse cells. ................. 73!
viii
Figure 5-3! PCC and wPCC boxplots of TDs generated by different methods on cortex
and hESC cell line. .................................................................................................. 77!
Figure 5-4! Boundary analysis of different method ....................................................... 79!
Figure 5-5 ! Epigenetic profiles near domain boundaries for two mouse cells. ............. 81!
Figure 6-1! Workflow of the modeling process from data preprocessing to modeling
procedure ................................................................................................................ 87!
Figure 6-2! Schematic of population-based genome structure modeling ...................... 91!
Figure 6-3! The estimation of redundant constraints helps to get better structures. ... 103!
Figure 6-4! The output contact property matches with the input TCC contact
probability matrix. .................................................................................................. 105!
Figure 6-5! Structural feature analysis of the population. ........................................... 107!
Figure 6-6! Epigenetics of different chromatin states .................................................. 110!
Figure 6-7! Location preference of different chromatin states. ................................... 111!
Figure 6-8! Replication timing with radial positions and chromatin states .................. 113!
Figure 6-9! Method robustness analyses .................................................................... 114!
ix
List of Tables
Table 2-1! Functional forms of the restraints in the scoring function .............................. 9!
Table 3-1! Geometric constraints and their functional forms. ....................................... 34!
Table 3-2! Constraints applied for different models. ..................................................... 36!
Table 3-3! Nuclear accessibility for orthologous genes in the fission yeast and the
budding yeast. ......................................................................................................... 48!
Table 3-4! Clustering properties of genes in 51 GO categories. ................................... 55!
Table 4-1! Running time of three algorithms on 10K and 20K bp resolution Hi-C data 62!
Table 5-1! Comparison of our method and Dixon’s method. ........................................ 73!
x
Abstract
The 3D structural organization of the genome plays a key role in nuclear
functions such as gene expression and DNA replication. Thanks to the recent
development of genome-wide 3-C based proximity ligation assays (Hi-C and its variant
TCC), close chromatin contacts can now be identified at unprecedented resolution.
Although these methods have provided new insight into genome organization, it is
important to remember that they measure only the relative frequencies of chromosome
interactions averaged over a large population of cells. However, individual 3D genome
structures vary dramatically from cell to cell even within an isogenic sample. This
structural variability poses a great challenge to the interpretation of population-averaged
contact intensities obtained from Hi-C experiments. To explicitly model the variability
between cells, we proposed a concept of population-based Hi-C structure modeling
method by generating an ensemble of 3D genome structures consistent with the data.
For small eukaryotic genomes like budding yeast and fission yeast, we
developed a random encounter model to generate yeast genome structure populations
without imposing any specific contact constraints but only known nuclear landmarks.
The structure patterns from the model population reproduced the Hi-C experiment, FISH
observations and other experiments. The random encounter model revealed the
genome spatial organizing principle of both budding yeast and fission yeast genomes.
Even through the two yeasts have different chromosome length and gene locations, the
two yeasts revealed a high similarity in terms of genes 3D clustering and expression
relationship. This result suggested a strong stability in gene spatial localization through
evolution.
xi
Hi-C contact data in large eukaryotic genomes contains more structural features
than small eukaryotic genomes. To facilitate the processing and detecting of structure
features, we have implemented a memory efficient bias-removal application for data
preprocessing, and designed a new contact pattern detection algorithm to identify
important structure units, called topological domains.
To model large eukaryotic genomes like human and mouse, due to the
complexity in the structures, we developed a genome structure modeling approach by
incorporating the Hi-C data in a population-based manner for the first time.
Chromosome location and loci distance features predicted from the structure population
are in a remarkably agreement with FISH experiments. An improved modeling method
is developed by introducing an iterative-based assignment-modeling (AM) algorithm.
This new model can help us study the genomes at a resolution 5 times higher than the
previous human genome model (~100kb-1Mb resolution). The obtained structure
populations show many important structural features, including the relationship between
structures and replication activities and the formation of function sub-compartments in
the nucleus.
In summary, my work on the eukaryotic genome modeling reveals genome
organization principles, and provides a new direction to study the relationship between
genome structures and gene functions.
1
Chapter 1. Background: Chromatin Conformation
Capture and Genome Structure Modeling Method
It is widely accepted that eukaryotic genome structures are not randomly
organized in the nucleus. The intra-nuclear positions of genes are highly correlated with
a variety of nuclear activities, including replication and transcription. For instance, in
budding yeast and fission yeast, heterochromatic regions such as telomeres and silent
mating-type loci are silenced by anchoring them to the nuclear envelope (NE),
presumably through heterochromatin protein factors, while centromeres are clustered in
a distinct region of the nucleus [1-10]. For more complex genomes, such as mouse and
human, previous work has described a compartmentalized nuclear architecture based
on chromosome territories [11-14].
1.1 Chromatin Conformation Capture experiment can identify
contact frequency between genomic regions
Several experimental methods have been developed to elucidate information
about the 3D genome structure. Fluorescence in situ hybridization (FISH) is designed to
analyze specific DNA sequence locations in the nucleus [15-18]. Recently, researchers
have developed chromosome conformation capture methods (3C, 4C, 5C, Hi-C and
TCC), which give insight to the physical interaction map between genomic regions at a
relatively high resolution [9, 19-24]. Among those methods, Hi-C and TCC are focusing
on genome wide interaction features rather than local interaction features.
2
The high-throughput approach, Hi-C, can be used to collect the spatially close
genomic regions from a pool of cells[22]. The experiment uses formaldehyde to cross-
link DNA to associated proteins, and bind proteins together when they are close to each
other (Fig 1-1A). This step will result in spatially fixed DNA-protein-protein-DNA
structures, which allows capturing of those DNA segments that are close in 3D space.
Restriction enzymes are used to cut DNA into small fragments in the fixed nucleus. .
Re-ligation of DNA is then performed, and newly formed DNA-ligation sites are then
labeled, specifically extracted and sheared to generate a sequencing library which will
capture those DNA segment that are originally close in the cells. TCC method is very
similar to Hi-C in the experimental design to detect the genomic region interactions
[24](Fig 1-1B). However, to decrease the amount of random ligations between DNA
fragments in solution, TCC performs the ligation on solid substrates, while the ligation is
done in the solution in Hi-C experiment. This step has been shown to increase the
signal-to-noise ratio significantly and facilitate the study of intra- and inter- chromosomal
interactions with higher accuracy. In situ Hi-C method is based on the Hi-C method to
detect the genomic interactions when the nucleus remains intact[25]. The procedure
involves using formaldehyde to crosslink the cell without breaking the nucleus envelope
(Fig 1-1C).
Due to inherent biases in the nature of the chromosomal conformation capture
methods, a bias removal procedure is necessary to process the contact data, which is
referred to “normalization method”. GC content, fragment length and also mappability
will need to be corrected to obtain more accurate contact frequencies between genomic
regions [26]. Imakaev proposed an iterative correction method based on an assumption
3
that each bin of genomic regions has the same “visibility”, called equal visibility model
[27]. This assumption will further result in the row sum of a contact frequency matrix to
be the same for all bins. In a recent paper, Rao et al. proposed to use another matrix
balancing method to remove the biases, called KR normalization [25].
4
Figure 1-1 The overview of Hi-C ,TCC and in situ Hi-C method
(A) The experimental procedure of Hi-C method. Figure is originated from Lieberman et
al[22] (B) The experimental procedure of TCC method. Figure is originated from Kalhor
et al[24] (C) The experimental procedure of in situ Hi-C method. Figure is originated from
Rao et al[25]
1.2 Reconstruction of 3D genome structures
The Chromosome Conformation Capture data describes probability of a pairwise
interaction between two genomic regions in a population of cells. However, the 2D
pairwise contact map itself lacks a representation of the higher order chromatin
structures in three-dimensional space. Features, like chromatin clustering of multiple
5
loci and nuclear location preferences, can only be revealed in the three-dimensional
structures. However, reconstruction of genome structures from 2D contact map is not
an easy task. The nature of Hi-C/TCC experiment can only represent population-
averaged contact information over a pool of cells rather than a single structure. Also the
conformation capture data cannot distinguish between the interactions formed by the
two alleles in a diploid genome. To reconstruct the 3D genome structures, two different
modeling approaches have been proposed to generate 3D chromatin structures
reproducing the contact frequency data. The consensus modeling approaches try to
construct a single or multiple structures each of these minimizing the difference
between the genome structures and a structural interpretation of the entire experimental
contact map. The population-based approaches assume that the individual structures in
a population can be heterogeneous and highly variable, and that the overall contacts in
all structures of a population cumulatively reproduce the contact data generated from a
population of cells in the experiment..
1.2.1 Consensus structure modeling method
Duan presented the first 3D genome conformation with Hi-C data incorporated
[23]. In his study of budding yeast genome structure, contact information is first
converted into distance restraints assuming 130bp of packed chromatin has length of
1nm. All the constraints including some other known biological facts are integrated into
a set of equations (i.e. restraints), which define an optimization problem. However, it is
impossible to satisfy all restraints simultaneously in a single structure and the goal of
the optimization is to obtain an optimum structure violating as few restraints as possible.
6
In the study of the fission yeast genome, Tanizawa applied a similar methodology to
model a single optimized 3D genome structure [9]. The MDS[28] and ShRec3D[29]
method are proposed as a more generalized method to optimize the structures based
on a distance matrix between points.
Two other methods have been proposed based on a probabilistic model rather
than a distance model. MCMC5C method is based on MCMC (Markov Chain Monte
Carlo) sampling by moving each point (genomic region) within a certain distance based
on a probability inferred from the contact frequency data [30]. BACH is designed using a
more complex probabilistic method to reconstruct the chromatin structures from the
contact frequency data, using Bayesian inference [31].
However, the major limitation of these methods is that the generated consensus
structures do not represent single instances of actual genome structures and cannot
capture the variable nature of long-range and trans chromatin interactions in different
structural states. Further underlining this problem, no single 3D model from these
approaches can simultaneously satisfy all the physical contacts measured by the Hi-C
experiments.
1.2.2 Population-based structure modeling method
The first population-based modeling method was presented by our lab [24, 32].
Compared to the consensus structure modeling method, population-based method is
improved in two areas. 1) Following observations in experiments, a heterogeneous
population of structures is generated that together obey the derived constraints, which
allows exploration of the dynamic and variable nature of genome structure and the
detection of genome structural states in a population of cells. 2) With HiC or TCC
7
contact data incorporated, the method can be applied to large and complex genomes.
For example, in the human lymphoblastoid genome structure modeling, we identified
adjacent genome regions that are similar in their contact behavior and are therefore
represented as structural domains. This step will significantly reduce the number of
simulated beads, compared to a model with fixed sized beads, and accelerate the
simulations in the case of large genomes [24].
Different from a constrained population-based model, MonteGrappa is a method
proposed in Giorgetti et al[33] using a probabilistic model. The method generated a
population of structures from a series of MCMC sampled structures. However, this
method is now only been applied to short range of interactions (regions of only ~1Mb)
and it is not possible to model entire genomes. The sampling could take way much
longer time for a genome-wide application.
Chapter 2. The Budding Yeast Genome Modeling
Reveals Structural Principle of Genomic Organization
2.1 Introduction
In budding yeast, centromeres are attached to the spindle pole body (SPB)
through microtubules while telomeres prefer to locate at the nuclear periphery [1, 2, 34,
35]. The rDNA region, which produces ribosomal RNA, is confined into a nucleolus with
other protein complexes, which is located on the opposite side of the nucleus from the
SPB [36-38]. These observations are also called nuclear landmarks for budding yeast.
8
In addition, FISH experiments have suggested a “Rabl-like” chromosome conformation,
which refers to the back-folding of sub-centromeric chromosome regions so that
chromosome arms appear juxtaposed [19, 35, 38, 39].
The first Hi-C experiment for the budding yeast genome was carried out in 2009
[23]. The genome-wide interaction matrix is highly organized with cross-shaped intra-
chromosomal interaction patterns for each chromosome.
To fairly assess the principles of chromosome folding and the possible role of
molecular interactions in establishing nuclear order, we must first examine the genome
structure that arises when chromosomes are tethered but otherwise randomly
configured in the confinement of the nuclear environment. Nobody to date has
determined if entirely random configurations of tethered chromosomes are sufficient to
reproduce in a statistical manner all the available quantitative data about the yeast
genome organization and gene loci interactions, including the highly structured contact
frequency maps from genome-wide conformation capture experiments [19, 23], the
distribution of gene territories from fluorescence imaging [40-42], and the clustering of
replication start sites and tRNA genes.
In order to study the structural principles of budding yeast, we perform genome
structure modeling of the entire budding yeast genome. The model can be considered a
random encounter model with the guidance of only nuclear landmarks.
2.2 Methodology
We model the genome structures as random chromosome configurations that are
subject to the following constraints: (1) All chromosomes are confined in the nucleus; (2)
9
all the centromeres are attached to the SPB through microtubules; (3) all the telomeres
are located near the nuclear periphery; and (4) the nucleolus is inaccessible to
chromosomes, except for those regions containing rDNA repeats.
For our random-encounter model, the nuclear architecture is described by the
nuclear envelope (NE), the spindle pool body (SPB), the nucleolus, and the 16
individual chromosomes in the haploid yeast genome. The positions of the NE, SPB,
and nucleolus remain fixed, while the configurations of the chromosomes are optimized
(Fig 2-1).
The 16 chromosomes are described as flexible chromatin fibers, which in turn
are represented as chains of connecting spheres with a radius of 15 nm. The
compaction ratio is set to six nucleosomes per 11 nm of length. This figure agrees with
other experiments, which have measured the compaction ratio to be between 1.2 and
11 nucleosomes per 11 nm of length [43, 44]. Thus, each bead accommodates 3.2 kb of
genome sequence. The 12 Mb yeast genome is represented by a total of 3779 beads.
Changing the compaction ratio will slightly change the total number of beads but will not
affect the outcome of the calculations at the resolution of our analysis.
Table 2-1 Functional forms of the restraints in the scoring function
10
The nuclear radius is set to 1 micron, as suggested by experiments [15, 45]. The
relative position and size of the SPB and nucleolus are taken from imaging experiments
[15]. The SPB and nucleolus are located at opposite ends of the nucleus, while a central
axis connects the centers of the SPB, nucleus, and nucleolus (Fig 2-1).
Figure 2-1 Population-based analysis of the budding yeast genome
organization.
To analyze structural features of the genome, we defined an optimization problem with
three main components. (Top panels) A structural representation of chromosomes as
flexible chromatin fibers (center), a structural representation of the nuclear architecture
(left), and the scoring function quantifying the genome structure’s accordance with
nuclear landmark constraints (right). (Bottom panels) An optimization and sampling
method, which minimizes the scoring function to generate a population of genome
structures that entirely satisfies all landmark constraints.
The scoring function is defined as a sum of spatial restraints and quantifies the
degree of consistency between the structure and the imposed landmark constraints
11
derived from experimental information. To optimize the structure, the scoring function is
minimized to a score of zero. The scoring function is written as
where !!
!
∈ℜ
!
is the coordinate vector of bead i, and N is the total number of beads in
a model. !,!,!, and ! are subsets of specific beads in the chromosome chains that
share certain properties. More specifically, ! is the set of beads assigned to the last
bead of every chain, ! is the set of beads assigned to centromeres, ! is the set of
beads assigned to telomeres, and ! is the set of all beads flanking rDNA repeat regions.
The optimization is performed using a combination of simulated annealing
molecular dynamics and the conjugate gradient methods implemented in the Integrated
Modeling Platform (IMP; http://www.integrativemodeling.org) [46, 47]. An individual
optimization starts with an entirely random bead configuration, followed by an initial
optimization of the structure. Next, we apply simulated annealing protocols to optimize
the genome configuration. Finally, conjugate gradient optimization ensures that all
constraints are satisfied, leading to a structure with score zero. Many independent
optimizations are carried out to generate a population of 200,000 genome structures
with a total score of zero, hence consistent with all input data. A comparison between
the frequency maps of two independently calculated populations, each with 100,000
structures, showed that our genome structure population is highly reproducible
S r
i
,..,r
N
( )
= U
ch
(r
i
,r
i+1
)
i=1,i∉α
N−1
∑
+ U
exc
(r
i
,r
j
)
j>i
N
∑
i=1
N−1
∑
+ U
nuc
(r
i
)
i=1
N
∑
+ U
cen
i∈β
∑
f (r
i
)+
U
tel
i∈γ
∑
(r
i
)+ U
inu
(r
i
)
i∈δ
∑
+ U
onu
(r
i
)
i∉δ
∑
=0
12
(Pearson’s correlation between the contact frequency maps of the two populations is
0.999).
2.3 Results
2.3.1 Chromosome territories as a result of constrained random
encounters
We first ask to what extent the landmark constraints lead to preferred
chromosome locations. We calculate the probability that each chromosome occupies
any given region of the nucleus (i.e., the localization probability density [LPD] of a
chromosome) (Fig. 2-2B) Based on the LPD, it is evident that all the chromosomes have
preferred regions. Smaller chromosomes reside preferentially around the central axis,
near the SPB. Interestingly medium-sized chromosomes are more likely to reside away
from the central axis (e.g., chromosome 8), while for large chromosomes (e.g.,
chromosome 4, whose size is 1.5 Mb), the LPD is highest in the central region of the
nucleus again along the central axis.
13
Figure 2-2 Chromosome locations of the budding yeast genome
(A) A sample of 40 chromosome configurations from the structure population for the
small chromosome 1 (left panel, blue chains), the large chromosome 4 (middle panel,
green chains) and the medium sized chromosome 8 (right panel, grey chains). The
chromosomes are depicted in the nucleus with the SPB in pink, the nucleolus in dark
blue and the NE in light blue. (B) (Top panels) Chromosome localization probability
densities (LPD) of chromosomes 1 (left), 4 (middle) and 8 (right panel) plotted along the
two principle axis r and z. (Lower left panel) Reference frame for projecting the positions
of chromosome points onto the two principle axis, namely the projection along the
central axis z (connecting SPB, nuclear center and nucleolus), and the radial distance r
axis indicating the absolute distance of a point from the central axis. (C) The LPD of
chromosome 4 resulting from the “single chromosome population”. The chromosome is
subject to all landmark constraints but structures are generated without the presence of
other chromosomes in the nucleus. The density distribution is significantly different from
the situation when all chromosomes are present (see panel B). (D) Excluded volume
effect. The difference map between the LPDs of chromosome 4 from the structure
population when all chromosomes are present (B, middle panel) and the single
chromosome population as defined in (C).
14
We then ask what factors are responsible for the chromosomes’ preferred
locations. For each chromosome, we calculate a new structure population for a nucleus
containing only a single chromosome but otherwise constrained in a manner identical to
the full simulation (i.e., the single chromosome population) (Fig. 2-2C). Comparing the
two structure populations reveals great differences for each chromosome location (Fig.
2-2D). For example, in the full simulation, large chromosomes reside substantially
farther from the SPB region toward the nucleolus than would be expected based on
chromosome tethering alone. The differences are caused by a volume exclusion effect:
Because of tethering, the chromosomes must compete for the limited space around the
SPB. Smaller chromosomes are naturally more restricted to regions closer to the SPB,
which in turn tends to exclude parts of larger chromosomes from these regions. For
smaller chromosomes, the opposite effect is observed; in the full simulation, they exhibit
an increased probability density around the SPB. Importantly, due to the volume
exclusion effect, the preferred location of a chromosome is not defined by tethering
alone but also depends on the total number and lengths of all other chromosomes in the
nucleus.
2.3.2 Genome-wide chromosome contact patterns
Next, we measure how often any two chromosome chains come into contact with
each other over the entire structure population. Interestingly, most chromosomes show
distinct preferences for interacting with certain others. For instance, chromosome 1 has
a significantly higher chance of interacting with chromosomes 3 and 6 than with any
other chromosome. Its interactions with the large chromosomes 4, 7, and 12 are
substantially depleted. Strikingly, almost identical chromosome interaction preferences
15
are observed in an independent genome-wide chromosome conformation capture
experiment (Fig. 2-3A [23]). Pearson’s correlation between the chromosome-pair
contact frequencies in our structure population and those detected in the experiment is
0.94 (P < 10
-15
). In the random control, the contact frequencies do not display any
significant chromosome pair contact preferences (Pearson’s correlation between
experimental data and the random control is 0.57)
16
Figure 2-3 Chromosome and gene loci interactions.
(A) Histogram of the normalized contact frequencies of chromosome 1 with other
chromosomes in the structure population (black bars), chromatin conformation capture
experiment [23](grey bars) and the random control population (white bars). (B)
Comparison of chromosome-arm – chromosome-arm contact frequencies from the
structure population and experiment [23]. (C-D) Contact frequency heat maps for
chromosome arm contacts in the structure population (C) and experiment [23] (D). (E)
Heat maps of the genome-wide contact frequencies between loci at 32 kb resolution
determined from the structure population, and (F) in chromosome conformation capture
experiment [23] (Color code ranges from white for low to red for high values).
Centromere positions are marked by small dark lines. The row-based average
Pearson’s correlation between the two heat maps is 0.94 (all p-values < 10
-6
). The
largest differences between both heat maps involve interactions to the small arm of
chromosome 12, which is not surprising because it contains all of the rDNA genes
located in the nucleolus, which are not explicitly treated in our simulation.
Next, we compare contact frequencies for all possible pairings of the 32
chromosome arms. It is evident that some pairs of chromosome arms have a greater
propensity to interact than others. In particular, chromosome arms with < 500 kb
(chromosomes 1, 3, 5, 6, 8, and 9) are more likely to interact with each other than
longer arms. For instance, the short arm of chromosome 1R is almost eight times more
likely to interact with the short arm of chromosome 3L than with the long arm of 4R. Also
these observations are in almost complete agreement with the conformation capture
experiments (Pearson’s correlation coefficient of 0.93, P < 10
-15
) (Fig. 2-3C,D [23]).
17
Finally, when chromatin contacts are analyzed at a resolution of 32 kb, the
contact frequency heat map of the structure population shows highly organized cross-
shaped patterns. Also these patterns are in excellent agreement with those observed in
the conformation capture experiment [23]. The two contact frequency maps are again
strongly correlated, with an average row-based Pearson’s coefficient of 0.94 (all P-
values <10
-6
). In contrast, the contact frequency map generated from the random
control lacks the cross-shaped patterns. We now analyze the intra and inter-
chromosomal locus-pair interaction patterns in more detail.
2.3.3 Intrachromosomal locus–locus interactions
Intrachromosomal contact patterns in the structure population and experiments
can be divided into three regions (Fig. 2-4A). Contact frequencies are enriched between
regions in the same chromosome arm, as expected for a constrained random polymer
chain (blocks C in Fig. 2-4A). In general, contact frequencies between regions within the
same chromosome arm increase with decreasing sequence separation, which is shown
by the strong diagonal in the contact frequency maps (Fig. 2-4A). However, regions
located close to the centromere behave very differently. Contacts between
subcentromeric regions on opposite sides of the centromere are clearly enriched in
frequency, even with increasing chain distance, as can be seen along the line
perpendicular to the main diagonal of the contact frequency map (block b in Fig. 2-4A).
Moreover, contact frequencies between subcentromeric regions and regions from the
bulk of both chromosome arms are very low (block a in Fig. 2-4A).
18
Figure 2-4 Chromosome folding patterns
(A) heatmaps showing intra-chromosomal contact frequencies for chromosome 4
obtained from conformation capture experiment [23] (top left), structure population (top
right), random control (bottom left), and single chromosome population (bottom right).
The latter is derived from a structure population for a nucleus containing only
chromosome 4, constrained in a manner identical to the full simulation. Heat maps of
the experiment and the structure population show characteristic folding patterns
reminiscent of the back-folding of subcentromeric regions onto themselves. While the
heat maps of the random control and the single chromosome population lack the
characteristic pattern. (B) Scheme showing the particular back-folding of the regions
adjacent to both sides of the centromere for several chromosome configurations.
Similar contact patterns have been reported in 3C conformation capture
experiments and have been explained by a particular Rabl-like style of chromosome
folding [19]. The hypothesis is that regions on opposite sides of the centromere are
folded toward each other, possibly indicating the existence of a biochemical attraction
between loci (Fig. 2-4B). However, the strong agreement between the experimental
contact frequency maps and our structure population demonstrates that such contact
patterns are not necessarily caused by specific biochemically mediated interactions
19
between subcentromeric regions. An equally possible explanation is that they represent
purely random encounters of constrained chromosome chains.
It remains to be determined which factors are most responsible for the folding. In
the ‘‘single chromosome population,’’ the cross-shaped intrachromosomal contact
pattern is lost; the contact frequency map is similar to the random control (Fig. 2-4A,
bottom panels). Therefore the particular folding pattern illustrated in Figure 4B is caused
by a volume exclusion effect as a result of the presence of all 16 chromosomes. The
competition among all centromere-tethered strands for the limited space around the
SPB naturally leads to the style of folding described by experiments, and this folding is
the proximate cause of the enriched contact frequencies between centromeric regions
and the observed shielding of these regions from chromosome arm interactions.
2.3.4 Interchromosomal locus-locus contacts
The interchromosomal contact frequencies in the structure population are
correlated with those observed in experiments, with an average Pearson’s correlation of
0.54, which is highly significant (P < 10
-15
) (Fig. 2-3E,F). In contrast, the Pearson’s
correlation between the random control and experiments is close to nil, and the
distinctive contact patterns in the experimental data are completely absent in the
random control.
To examine the effect of limited sampling on the accuracy of chromosomal
contact patterns, we compared our initial contact frequency map to maps generated
from randomly sampling different proportions of these contacts. In contrast, to
intrachromosomal contacts, the correlations between interchromosomal contact
patterns are greatly affected by limited sampling. At a sampling rate of 0.1%, we find
20
that the Pearson’s correlation between the two interchromosomal contact maps (even
when assuming an ideal model) cannot exceed 0.5. Similar correlation values are
observed in the Hi-C experiment when two interchromosomal contact maps are
compared that are generated by using two different restriction enzymes [26]. In our
analysis, the observed correlation value of 0.54 corresponds to a sampling rate of 0.2%,
which is also the order of magnitude that is expected for the experiment [23]. Thus, the
observed correlation coefficient of 0.54 represents a remarkably good agreement
between the interchromosomal contact patterns, given that the experimental and
computational samplings are finite and cannot be exhaustive.
2.3.5 Gene localizations
We now focus on the nuclear locations of individual gene loci. The locations of
eight genes have been determined by large-scale fluorescence imaging experiments
[15]. These locations are measured with respect to the two principal axes of the nucleus
(Fig. 2-5A). We determined the two-dimensional (2D) density distributions of the same
gene loci in our structure population, allowing for a direct comparison with fluorescence
experiments [15](Fig. 2-5A). The density distribution functions agree well with
experiments, in that each locus occupies a well-defined territory. The volumes and
shapes of these territories strongly resemble those observed in experiments [15]. For
instance, genes GAL2, HMO1, and SNR17A are located near the nucleolus in the
structure population, as seen in the experiment. Interestingly, the structure population
places SNR17B (no experiments available) and SNR17A in similar positions near the
nucleolus, despite the fact that these genes are located on different chromosomes. Both
of these genes are involved in ribosome biogenesis and code the snoRNA U3. Also in
21
agreement with experiments, the distribution patterns of the functionally related genes
RPS5 and RPS20 are quite different. For instance, RPS5 positions are significantly
more diffuse.
In order to compare quantitatively the relative positions of these eight genes, we
measure their median distance along the central axis in the 2D density maps obtained
from experimental data and in the structure population. These positions are in excellent
agreement (Pearson’s correlation is 0.95, P < 10
-3
) (Fig. 2-5B).
Figure 2-5 Gene territories.
(A) Projected localization probability densities for the positions of 8 gene loci in the
structure population. The probability densities are determined with respect to two-
principle axis of the nuclear architecture (top right panel). The z-axis connects the SPB
with the origin at the nuclear center and the nucleolus. The radial axis r defines the
distance of a point from the central axis (top right panel). The lower half of the projected
localization density plot is mirrored from the top half for visual convenience. (B) Median
22
gene loci position along the z-axis calculated from the projected probability localization
densities in (a) and from the experiment [15]. The two are highly correlated with R
2
=0.9.
To allow for comparison with experiment the z-axis distance of a gene locus is
normalized relative to the SPB-gene distance.
2.3.6 Pairwise telomere distances
It is well known that telomeres are not positioned randomly on the nuclear
periphery [1, 38, 48]. Fluorescence imaging has revealed that the distance between any
two sub- telomeres increases gradually with the arm lengths of their chromosomes [48].
For a given sub-telomere, this relationship is linear. In the structure population, we
observe a very similar behavior. More specifically, after applying a change point
analysis, we find that the distance between subtelomere pairs as a function of arm
length is divided into two linear regimes (Fig. 2-6). For chromosome arms with lengths
up to 360 kb, the distances observed in our structure population increase with a
relatively steep slope. Above 360 kb, the slope decreases significantly. This behavior is
entirely consistent with experiments, and the change in slope has been explained as
follows [48]. For small arms, the accessible position of a subtelomere at the NE is
entirely restricted by the arm length. Hence, the median distance between two
subtelomeres increases rapidly with their accessible areas. However, at a certain arm
length, the subtelomere is able to reach all points on the NE. Further increases in arm
length do not dramatically increase the median subtelomere distance.
23
Figure 2-6 Median telomere-telomere distances in the structure population.
The median distances between a telomere of a reference chromosome arm and all
other telomeres are plotted for reference chromosome arms (A) 6R, (B) 10R, (C) 7R,
and (D) 4R. The vertical dash line marks the slope change point at 95% confidence
interval shown by the shaded area.
Interestingly, the change in slope occurs at a slightly different arm length in
experiments (310 kb) [48] compared to our structure population (356–440 kb). However,
the incompleteness of the experimental data can explain some of this difference. If we
only include those chromosomes that are also analyzed in the experiment, the change
in slope in our simulation shifts to 309–327 kb in remarkably good agreement with the
experiment.
Reference: 10R (309 kb)
Reference: 7L (497 kb) Reference: 4R (1082 kb)
Arm length (L in kb) Arm length (L in kb)
Arm length (L in kb) Arm length (L in kb)
Median pair distance (d in nm) Median pair distance (d in nm)
A
C
B
D
200 400 600 800 1000 1200
Reference: 6R (122 kb)
400 600 800 1000 1200 1400
356
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
6L
2L
9L
4L
7L
15L
10L
13L
1L
14R
16L
14L
2R
7R
13R
15R
12R
4R
12L
3R
11R
10R
5R
8R
16R
8L
3L
9R
1R
5L
d = 0.21L+1010, L > 356 kb
R
2
= 0.804
d = 1.55L+441, L < 356 kb
R
2
= 0.953
11L
200 400 600 800 1000 1200
400 600 800 1000 1200 1400
356
●
●
●
● ●
●
●
●
● ● ●
●●
●
●
●
●
●●
● ●
●
●●● ● ● ●
●
●
10L
2L
9L
4L
7L
15L
11L
13L
1L,6L
14L
16L
14R
3R
11R
2R
7R
13R
15R
12R
4R
5R
8R
16R
8L
3L
6R
9R
1R
d = 0.128L+1150, L > 356 kb
R
2
= 0.813
5L
12L
d = 0.785L+844, L < 356 kb
R
2
= 0.905
d = 0.017L+1250, L > 356 kb
R
2
= 0.924
d = 0.326L+1160, L < 356 kb
R
2
= 0.803
d = 0.288L+1180, R
2
= 0.935
4R
12R 15R
13R
7R
8R
2R
16L
4L
14L
1L
3L,
6R
200 400 600 800 1000 1200
400 600 800 1000 1200 1400
356
● ●
●
● ● ●
●
●
● ●
●
●●
●
●
●
●
●
●
●
● ●
●
●●
● ●
● ●
●
9R
8L
1L,
6L
16R
5R
15L
9L
13L
2L
3R
11L
11R
10R
14R
12L
10L
5L
d = 0.095L+1250, L > 356 kb
R
2
= 0.867
d = 0.421L+1080, L < 356 kb
R
2
= 0.849
200 400 600 800 1000 1200
400 600 800 1000 1200 1400
356
● ●● ● ●●
●
●
●
●
●
●●
● ●
●
●
●
●
● ●●
● ●●
● ●
●
●
●
4L
6L
6R
1R
9R
15L
9L
2L
11L
8R 5R 10R
3R
14R
12L
5L
8L
3L
1L
11R
13L
2R
16L
7L
10L
14L
7R
13R
16R 15R
12R
24
2.3.7 Telomere clusters
To identify subtelomere clusters in the structure population, for each subtelomere
we calculated the fraction of structures with at least one other subtelomere within 250
nm. In agreement with another experiment [48], such small subtelomere distances are
infrequent (1%–3%) for most chromosomes. However, a few co-locations are observed
more frequently between relatively short chromosome arms, namely, 1R:1L, 6R:6L,
1R:9R, and 3L:3R. These pairings occur in 12%-20% of the population, also in
agreement with experiments [38, 48]. For example, the pairs 3R:3L and 6R:6L were
recently reported to form significant but transient interactions, leading to the formation of
chromosome loops [38, 48].In general, we find that as the length difference between
chromosome arms grows more pronounced, the probability of their telomeres being co-
localized decreases. Interestingly, the volume exclusion effect has a pronounced effect
on the co-location frequency of telomeres. For small and also large chromosomes, the
volume exclusion effect increases significantly the co-location frequency, while for
medium-sized chromosomes the opposite is observed by decreasing the co-location
frequency. For instance, the fraction of co-located telomeres increases by almost 20%
for the small chromosome 6 upon the presence of all other chromosomes in the
nucleus, while it decreases by 60% for the medium-sized chromosome 8.
2.3.8 Co-localization of functionally related loci
Next, we investigate whether functionally related gene loci are co-localized in the
structure population. First, we compare the 3D spatial distributions of early and late
replication start sites in the structure population. These sites are distributed across all
25
chromosomes (Fig. 2-7, right panels). Experimental evidence exists that early
replication sites are spatially clustered during interphase [23, 49].
Figure 2-7 Spatial clustering of replication origins and tRNA gene loci.
The histograms show the distribution of the mean pair distance ratio between a set of
specific sites (e.g. early replication sites) and all sites in the structures of the population.
The histograms are generated as follows: For a given structure in the population the
mean pair distance between a set of specific loci (e.g. all early replication origins) is
calculated. This distance is divided by the mean pair distance of all sites in the same
structure. The distribution of the distance ratio is then obtained from all structures in the
population. If the distribution is centered on 1 (vertical dashed line) the selected sites
behave similarly to a random sample of all sites. If the distribution is shifted towards
26
smaller values their pair distances are smaller than would be expected from the
background control. If the distribution is shifted to lower values, the selected sites are
more distant from each other than would be expected from the random control
background. (A) Distribution of distance ratios of early (red) and late (green) replication
start sites as determined by three independent experiments using the CDR (Clb5
Dependent Region), Rad53 checkpoint-mediated regulation, and GINS complex as
identifiers. For the latter case, category 1 sites start replication earlier than category 2
sites. (Right panel) The positions of each site in the chromosome sequence. The
number of early and late-firing sites labeled with CDR, Rad3, and GINS are 77 and 123,
101 and 99, and 169 and 135, respectively. (B) Distribution of distance ratios for all 275
tRNAs. For all sets of sites in (A) and (B) the shift of the mean pair distances is highly
statistical significant.
In each structure of the population, we calculate the mean pairwise distance
between all early replication sites. The frequency distribution derived from these mean
pairwise distances is compared to a distribution chosen from randomly selected sites in
the genome. We observe significant spatial clustering of the early replication sites (Fig.
2-7A), in the sense that their mean pairwise distances are significantly less than would
be expected from randomly selected sites (Stouffer’s Z-transform tests z-scores < -
160;). This observation holds for all three sets of early replication origins identified in the
literature [50-52]. Remarkably, for late replication sites we see the opposite effect: a
statistically significant increase in the mean pairwise distances between late replication
sites compared with the background. It appears that, on average, early replication start
sites are closer to the centromere on the chromosome sequence compared with the late
start sites (all P-values <10e-5 for the three data sets).
We also analyzed the spatial positions of all tRNA gene loci in the genome,
which have been observed to cluster in experiments [23, 53]. Again, we observe a
statistically significant decrease in the pairwise distances between tRNA loci (Fig. 2-7B)
compared with randomly picked loci.
27
Our observations clearly indicate that the chromosomal locations of these
specific sites are not randomly distributed over the genome; they are positioned in such
a way that early replication sites have a higher probability of being co-localized when
the chromosome chains behave as random polymer chains that are subject to nuclear
landmark constraints.
2.4 Summary
In summary we have shown that tethering of heterochromatic regions to nuclear
landmarks and random encounters of chromosomes in the confined nuclear volume
explain the higher order organization of the budding yeast genome. We have
quantitatively characterized the contact patterns and nuclear territories that emerge
when chromosomes are allowed to behave as constrained but otherwise randomly
configured flexible polymer chains in the nucleus and found remarkable agreement of
our models with all available experimental evidence of yeast genome organization.
Chapter 3. Comparative Genomic Structural Study of
the Fission Yeast and the Budding Yeast
3.1 Introduction
The 3D structural organization of the genome plays a key role in the correct
execution of nuclear functions, such as gene expression regulation[54-57], and DNA
replication[58, 59]. In yeast, centromeres remain attached to the spindle pole body
28
(SPB) during interphase, and telomeres are typically anchored to nuclear envelope
(NE)[7, 60-63]. For budding yeast, chromosomes were described of having a “Rabl-like”
orientation and genes are located in defined nuclear territories[15, 35, 38, 39].
Moreover, the rDNA containing nucleolus is located in a well-defined position relative to
the SPB[8, 36, 64]. More recently, conformation capture experiments (Hi-C methods)
[22, 24, 65] provided a detailed view of the genome-wide chromatin interaction patterns
in budding yeast (Saccharomyces cerevisiae) and fission yeast (Schizosaccharomyces
pombe). The Hi-C experiments revealed large differences between the two yeast types
[9, 23, 66, 67]. Budding yeast showed highly structured contact maps with distinct
cross-shaped patterns for intra-and inter-chromosomal interactions. In comparison, the
Hi-C maps of fission yeast show only weakly structured patterns, and, with the
exception of centromere and telomere interactions, are dominated by local chromosome
chain contacts. Several studies correlated Hi-C interaction patterns with functional
features [9, 23, 68-70]. For instance, some co-regulated genes in fission yeast form
frequent interactions, and are assumed to be spatially clustered even though they are
substantially separated in the genome sequence [9].
We and others recently discovered that in budding yeast, entirely random
configurations of tethered chromosomes are sufficient to reproduce in a statistical
manner many data about the budding yeast genome organization[32, 71-74], including
gene loci interactions from genome-wide Hi-C experiments [23], the distribution of gene
territories from fluorescence imaging [15],and the clustering of functionally related
genes such as replication start sites as well as tRNA genes [23, 49, 53]
29
Here, we focus on fission yeast (Schizosaccharomyces pombe) and investigate
the role of geometric constraints on its genome organization for the given gene order on
each chromosome. To fairly assess the factors responsible for genome structure-
function correlations, we must first examine the genome structure that arises when
chromosomes are tethered to nuclear landmarks but otherwise randomly configured in
the confinement of the nuclear environment. Like budding yeast, the fission yeast
centromeres are attached to the SPB during interphase through microtubule
interactions. The fission yeast telomeres are anchored to the NE and the SPB is located
at opposite sides from the nucleolus, which contains rDNA genes [7, 8]. All these factors
exert geometric constraints on the chromosome fibers. Although the genomes of
budding and fission yeast are almost of equal size (~12Mb), the total number and length
of the chromosomes are largely different. Fission yeast has only 3 chromosomes
compared to the 16 in budding yeast and they are significantly larger [75]. Due to these
changes, the impact of geometric constraints on the chromosome conformations is
different between the two yeasts, and it is unknown if random encounters of constraint
chromosomes alone could explain the observed fission yeast genome structure and
structure-function correlations.
Here, we calculated a large population of fission yeast genome structures in
which chromosomes are constrained by geometric constraints but otherwise randomly
configured in the nucleus. We quantitatively characterized the resulting chromatin
contact patterns, nuclear territories of gene loci and chromosomes, and also analyzed
structure function correlations including the co-locations of co-expressed and
functionally related genes. Our findings demonstrate that purely random configurations
30
of flexible chromosome chains, combined with the locations of genes on the
chromosomes can reproduce a wide range of experimental data, including chromatin
interaction patterns and locations from Hi-C and FISH experiments, such as the spatial
clustering of tRNA and 5sRNA genes, as well as clustering of co-expressed genes.
Although fission and budding yeast genomes share similar principles of genome
organization, a comparative structure analysis revealed dramatic differences in the
resulting structure populations. For instance, almost all chromatin regions in fission
yeast can access wide areas in the nuclear volume, whereas in budding yeast only a
small fraction of loci show a similar behavior. Therefore, in this model gene territories in
fission yeast are generally more diffused than those in budding yeast and nuclear
locations of orthologous genes can be quite different between them. However, despite
the structural differences, the clustering behavior of related genes is quite similar in
fission and budding yeast, even though the genes’ nuclear locations differ. Moreover,
our analysis also provides insights on the contribution of individual constraint types in
establishing functional relevant interaction patterns. For instance, centromere clustering
is particularly important in establishing inter-chromosomal clustering of tRNA genes in
fission yeast. In summary, our findings demonstrate that some described structure-
function correlations can be explained as a consequence of random chromatin collisions
driven by a few geometric constraints (mainly centromere-SPB and telomere-NE
tethering), combined with the specific gene locations in the chromosome sequence.
3.2 Materials and Methods
In our model, the yeast nuclear architecture is defined by the nuclear envelope
(NE), the spindle pole body (SPB), the nucleolus, and the 3 chromosomes for the
31
haploid fission yeast genome (Figure 3-1A). The positions of the NE, SPB and
nucleolus remain constant while the configurations of the chromosomes are optimized.
32
Figure 3-1 Fission yeast genome structures calculation.
(A) Schematic view of fission yeast nuclear architecture and imposed geometric
constraints. Centromeres are located within a sphere volume of radius 300 nm to
ensure that they are close to the SPB. Telomeres are anchored to the NE and can
freely move on the NE surface. rDNA genes are constrained to be on the nucleolar
surface (right side). All non-rDNA genes are prevented to enter the nucleolus. All
chromosomes are confined in a nucleus of radius 0.71 micron. (B) Snapshot of a
genome structure illustrating the packing of the chromosomes in the nuclear volume.
Different chromosome chains are depicted in different colors. The nucleolus volume is
shown in silver. The spindle pole body is shown as a light green cylinder opposite to
nucleolus. (C,D,E) Heatmaps of the genome-wide contact frequencies of the fission
yeast from calculated structure populations (C), from experiment (D), and from a
random control model with no geometric constraints applied (E). The resolution of the
heatmaps is 96 kb per bin (Text S1). The color code ranges from white to red to
represent frequencies from low to high. The telomere-telomere interactions are
highlighted in a zoom-in box (F) Pearson correlation between contact frequency heat
maps from experiment and structure populations (Text S1). Correlation values are
shown for intra-chromosomal and inter-chromosomal interactions separately. The
experimental heat map is compared to several different structure populations generated
with different amount of geometric constraints. Values for models C, R, N indicate
structure populations that were generated only with centromeric constraints (C), rDNA
constraints (R) and telomere anchoring constraints (N), respectively (Methods).
Chromosome Representation
Each chromosome is represented by a flexible chromatin fiber as a chain of
connected spheres as described previously [32]. Each bead with radius of 15 nm
represents 3.2 kb of genome sequence, which corresponds to a fiber compaction ratio
of 6 nucleosomes per 11 nm fiber length [32, 43]. The entire genome contains 3930
beads.
Nuclear Architecture
The nuclear radius (r=710 nm) and the position of the SBP and nucleolus are
based on data from imaging experiments[7, 9]. The SPB and nucleolus are positioned
at opposite ends of the nucleus (Figure 3-1B).
33
Scoring Function
The geometrical constraints are expressed in a single scoring function, which is
defined as a sum of spatial restraints and quantifies the degree of consistency between
the structure and the constraints. To optimize an individual structure, the scoring
function is minimized to a score of ~zero therefore entirely satisfying the restraints by a
small residual. The scoring function is defined as
Here, is the 3D coordinate vector of each bead i; N is the total number of
beads in a model and α, β, γ, and δ represent different subsets of beads with specific
genomic features. α is the set of all terminal beads of the chromosomes; β is the set of
all centromeric beads; γ is the set of beads representing telomeres, and δ are all
beads representing rDNA regions (detailed explanation in Table 3-1). All the restraints
are expressed as pseudo potential energy terms u described as follows:
Chromatin chain restraints U
ch
restricts consecutive beads in a chromosome
chain to be within a distance of 30nm. (Table 3-1).
Chromatin chain excluded volume restraints U
exc
prevents the overlap between
any two beads in the genome (Table 3-1).
S r
i
,..,r
N
( )
= U
ch
(r
i
,r
i+1
)
i=1,i∉α
N−1
∑
+ U
exc
(r
i
,r
j
)
j>i
N
∑
i=1
N−1
∑
+ U
nuc
(r
i
)
i=1
N
∑
+ U
cen
i∈β
∑
(r
i
)+ U
tel
i∈γ
∑
(r
i
)+ U
inu
(r
i
)
i∈δ
∑
+ U
onu
(r
i
)
i∉δ
∑
= 0
i
∈ r
3
R
34
Nuclear envelope restraints U
nuc
ensures that all beads reside inside the nucleus
with a radius of R
nuc
=710nm (Table 3-1).
Chromatin persistence length. A harmonic potential restraint is imposed to
reproduce the desired chain stiffness during the optimization process.
, for i, i+1, and i+2 on the same chain.
This restraint is only imposed during the calculation of gradient forces in the
optimization process and is not considered when calculating the final score (see score
function). With a force constant of k
angle
=0.2 kcal/mol, the obtained chromatin chains
have a persistence length as expected for a chromatin fiber, which is assumed to
behave similarly as in budding yeast study, for which experimental estimates exist [32].
Table 3-1 Geometric constraints and their functional forms.
All constraints are expressed as harmonic functions, namely (u
h
), harmonic upper (u
ub
)
and lower bounds (u
lb
) [32]. Based on different properties, we define several subsets of
beads. is the set of all end beads for every chromosome chain, is the set of beads
representing centromeric regions, is the set of beads representing telomeric regions,
and is the set of beads which represent the start and end of rDNA regions.
U
angle
=
1
2
k
angle
1− arccos r
i+1
− r
i
( )
i r
i+2
− r
i+1
( ) ( ) ( )
i=1
N−2
∑
35
Restraint Type
Functional
form
µ d(nm) bead i k
Non-Specific Geometric Restraints
U
n
Beads inside
nucleus restraint
u
ub
(0,0,0) 710 all beads
1
U
ch
Chromatin chain
bond restraint
u
h
r
i+1
30
1
U
exc
Excluded volume
restraint
u
lb
r
j
30
1
Specific Geometric Restraints
U
cen
Centromere
localization
restraint (C)
u
ub
(-560,0,0) 300
1
U
onu
rDNA restraint 1
outside nucleolus
(R)
u
ub
(-1065,0,0) 1278
1
U
inu
rDNA restraint 2
inside nucleolus
(R)
u
lb
(-1065,0,0) 1278
1
U
tel-ne
Telomere to NE
restraint (N)
u
lb
(0,0,0) 710
1
Geometric Constraints to Nuclear Landmarks.
(C) Centromere localization restraint U
cen
. All centromeres are clustered at the
SPB through interactions with microtubules of length ~300 nm. Therefore centromeres
are constrained to be located at the SPB (centromere constraints C) by restricting the
central bead of the centromere region to be located in a sphere of radius 300 nm
(Figure 3-1). Based fluorescence imaging this volume is located on the central axis of
the nucleus, close to the NE (Table 3-1) [7].
{1.. 1}, 1 iN i α = − + ∉
{1.. 1}, iN ji = − >
∀i ∈β
∀i ∉δ
∀i ∈δ
∀i ∈γ
36
(N) Telomere localization restraint U
tel-NE
. All telomeres are anchored to the NE
and can freely move on the NE surface as suggested by experimental evidence (Table
3-1) [7].
(R1) Nucleolus localization restraint U
inu
. The ribosomal DNA (rDNA) is located
next to the telomeric regions on chromosome 3. Experiments showed that the rDNA
regions occupy the nucleolus[7]. We don’t explicitly resolve the structures of rDNA
regions in this model. Instead, we anchor the two beads representing the rDNA start
and end regions to the surface of nucleolus as previously described for budding
yeast[32]. The two anchor points can freely move on the nucleolus surface (Table 3-1).
(R2) Nucleolus excluded volume restraint U
onu
. All chromosomal regions other
than rDNA repeats are excluded from the nucleolus (Table 3-1).
To study the influence of each of the geometric constraint types on the genome
organization, we generated several structure populations using different combinations of
geometric constraints (Table 3-2, Control, C, N, R, CNR).
Table 3-2 Constraints applied for different models.
In total we generate 5 different models with different combination of specific geometric
restraints (Table 3-1). A check mark indicates that a model contains the corresponding
set of constraints.
37
Model Name Centromere
Restraints (C)
Telomere to NE
Restraints (N)
rDNA Restraints
(R)
CNR ✓ ✓ ✓
Control
C ✓
N ✓
R ✓
Structure optimization
The scoring function is optimized by using a combination of simulated annealing
molecular dynamics and the conjugate gradient methods implemented in the Integrated
Modeling Platform [32, 47, 76]. An individual optimization run starts with an entirely
random bead configuration, followed by an initial optimization of the structure. Then, we
apply simulated annealing protocols to entirely equilibrate the genome configuration.
Finally, conjugate gradient optimization ensures that all constraints are satisfied, leading
to a structure with score ~zero. Many independent optimizations are carried out to
generate a population of at least 100,000 independently calculated genome structures
with a total score of ~zero. This population represents a spectrum of genome structures
consistent with the input constraints. To test the effect of different constraint types, we
generated a total of 4 structure populations with different geometric constraints and 1
structure population without imposing any geometric constraints (random control model)
(Table 3-2).
38
3.3 Results and Discussion
3.3.1 Fission yeast genome structure
To represent highly variable genome structures, we constructed a large
population of 3D genome structures, which represent a spectrum of all possible
chromosome configurations. To explore the chromosome conformational space,
100,000 independent simulations were performed, each time starting with a random
genome configuration. After structure optimization and equilibration, each of the
independently calculated 100,000 structures satisfies all the imposed geometric
constraints. The optimized structures are hereafter referred to as the “structure
population” (‘CNR Model’ in Table 3-2). To investigate the role of the different geometric
constraints types, we also generated 3 additional structure populations each containing
geometric constraints of only one specific type (‘C Model’, ’N Model’, ’R Model’ in Table
2), and 1 structure population generated without imposing any geometric constraints
(‘Random control Model’ in Table 3-2).
We first discuss the structure population generated with the complete set of
geometric constraints (CNR model in Table 3-2). From the structure population we
quantitatively characterized structural features and compared these with experimental
data. Specifically we determined the chromosome and loci contact patterns, locus-locus
distances, as well as nuclear territories of genes and chromosomes.
Comparison of contact frequencies to Hi-C experiments.
We first compared the contact frequencies calculated from our structure
population with those from Hi-C experiments[9, 27]. The contact frequency between two
chromatin regions is defined as the fraction of all structures containing a contact
39
between the corresponding chain beads. The contact frequencies reproduced well
those from the Hi-C experiments, with a Pearson correlation of 0.91 (Figure 3-1C, CNR
model, Table 3-2). Also a visual inspection confirms the agreement between model and
experiment (Figure 3-1C,D).
The intra-chromosomal contact frequency map is dominated by high intensity
values along the diagonal, with a sharp decay of frequencies for contact pairs with
increasing sequence distance. Long-range intra-chromosomal contacts of a locus
appear almost uniformly distributed, similar to an unconstrained polymer. Indeed, the
structure population generated without any geometric constraints (‘Control Model’,
Table 3-2) still showed a high Pearson correlation of 0.91 for intra-chromosomal
contacts due to the dominant diagonal elements in the contact frequency map (‘Control
Model’ in Figure 3-1F). The only intra-chromosomal features not recovered in the
random control are weak interactions between telomeres (Figure 3-1D,E). This
observation may in part be explained by the relatively large size of the chromosomes.
With increasing chromosome size the geometric constraints (i.e., centromere clustering
and telomere-NE anchoring) are less restrictive on the conformational flexibility for most
of the chromosome regions (i.e., the central regions in a chromosome arm) and
therefore have less influence on the intra-chromosomal contact behavior. Therefore
regions that are distant in sequence to centromeres or telomeres behave similarly to a
random unconstrained polymer. A complete different contact behavior is seen in
budding yeast, where the characteristic cross-shaped intra-chromosomal interaction
patterns can only be reproduced when geometric constraints are imposed, partly as a
result of the smaller chromosome arm lengths [32]. We previously showed that these
40
interaction patterns are mainly a result of crowding effects when centromeres of all 16
chromosomes compete for the limited space around the spindle pole body. Because
fission yeast has only three chromosomes a cross-shaped interaction pattern is not
observed in model and experiment even though the same type of geometric constraints
are imposed.
When analyzing inter-chromosomal interactions a different picture emerges. A
random chain model without geometric constraints does not reproduce any of the
experimental contact frequencies (Pearson correlation r
P
~ 0)(‘Control Model’ in Figure
3-1F). Instead, a structure population with geometric constraints reproduced the
experimental inter-chromosomal interaction patterns with a Pearson correlation of ~0.40
(‘CNR Model’ in Figure 3-1F). The model captures naturally the two key interaction
patterns observed in experiment, namely centromere-centromere interactions resulting
from centromere clustering, and also an increased telomere-telomere interaction
frequency, even though no constraints between telomeres were imposed (Figure 3-
1C,D). Also in budding yeast, significant interactions are observed between telomeres,
in particular for the smaller chromosomes [32].
Among all the individual constraint types, centromere clustering leads to the
largest increase in correlation between modeled and experimental contact frequency
maps (‘C Model’ Figure 3-1F). This effect is mainly due to the enhanced contact
frequencies between centromeres and a slight decrease in contact frequencies between
centromeres and other regions on the chromosome, which is only faintly visible in the
experimental heat maps, while it is more pronounced in the modeled structure
populations. Other constraint types have only minor impact. For instance, constraining
41
ribosomal genes to the nucleolus does not significantly improve the match between
modeled and experimental interactions (Pearson Correlation: 0.03 for ‘R Model’ in
Figure 3-1D).
We now investigate if other known structural features are also observed in the
genome structures.
Distances between loci
After establishing the agreement of chromatin contact frequencies in model and
experiment we now focus on 3D genome structural features. In independent 3D FISH
experiments [9] the average distances of 18 gene pairs were measured [9]. The
average loci distances in the structure population were in excellent agreement with
experiment (R
2
=0.77, Figure 3-2A), even though no information about the FISH
distances was used when generating the models. Also the distribution of the distances
in the structure population agreed well with those from a set of FISH experiments
(Figure 3-2B) (Pearson correlation of 0.93), indicating that the random encounter of
constrained chromosomes can reproduce the data.
42
Figure 3-2 Assessment of the structure population with independent
experimental data not included as the input information in the calculations.
(A) Mean 3D distances for 18 pairs of loci calculated from structure population and
determined by independent 3D FISH experiments (R
2
=0.77)[9]. The mean distance
between two loci is measured as the average distance between the two corresponding
chromosome beads in all structures of the population. (B) Histogram of the distance
distribution between locus chr2 (3094994bp to 3116383bp) and locus chr3 (1404306bp
to 1441994bp) in the structure population and FISH experiments [9]. The correlation
between the two histograms is calculated as the correlation of the pairwise frequency
values between experiment and structure population, r
C
=0.93 (p-value<1E-8). (C)
Spatial clustering of Pol-III transcribed genes (p-value < 1E-16 for both tRNA and
5srRNA). The histograms show the distribution of the mean pair distance ratio between
a set of specific sites (e.g. Pol-III transcribed sites tRNA sites or 5sRNA sites) and all
sites in the structures of the population. The distance ratio histograms are generated as
follows: For a given structure in the population the mean pair distance between a set of
specific loci (e.g. all early replication origins) is calculated. This distance is divided by
the mean pair distance of all sites in the same structure to get a distance ratio. The
distribution of the distance ratio is then obtained from all structures in the population.
The vertical line represents the expected distance ratio if genes are randomly
distributed. (D) Combined localization probability density (LPD) plot for the 2D
43
distribution of all tRNA sites in fission yeast from our structure population (Text S1). The
density is represented by the color ranging from blue to red. The plot shows that tRNA
genes have the highest density close to SPB region. (E) Enrichment for chromatin -
Man1 protein binding. (Left panel) Enrichment of Man1-binding signal from DamID
experiments in the 100 beads that show the shortest average distance to the NE in the
structure population. (Right panel) Man1-binding enrichment of randomly selected
domains. The results show significantly higher Man1 enrichment in beads closest to NE
compared to randomly selected beads (p-value<1E-6, Cohen’d = 0.66).
Nuclear localization of PolIII transcribed genes (tRNA, 5sRNA)
FISH experiments showed Pol III-transcribed genes (such as tRNA and 5srRNA)
to be spatially clustered, preferentially at centromeric regions [77-79]. Therefore, we
calculated the average pairwise 3D distances between all tRNA genes in the structure
population and compared the resulting distance distributions with those from randomly
selected genome sites. The average 3D distances between tRNA genes were
significantly smaller than randomly selected loci (Figure 3-2C,D). Similarly, also 5srRNA
were spatially clustered in the structure population (Figure 3-2C). To eliminate the bias
of having sites clustered in genomic sequence, we also calculated the average
distances between only those Pol III-transcribed gene pairs that were located on
different chromosomes. Even for this reduced set we still obtained significant 3D spatial
clustering. When comparing tRNA clustering between structure populations generated
with different constraint types, it becomes evident that tRNA clustering is mainly driven
by centromere constraints. Pol III-transcribed genes were significantly closer to the SPB
compared to randomly selected sites. Our results indicate that geometric constraints
alone (i.e. in particular centromere clustering around the SPB) will increase the
probability for Pol III-transcribed genes to be in spatial proximity to each other in 3D
space, even if these genes are located on different chromosomes.
44
Gene loci-NE distances
DamID experiments reveal the probability of a locus to be close to the NE by
measuring its binding propensity to the lamina-like NE protein Man1 [80]. To test if the
loci-NE distances in our models agree with DamID experiments, we calculated the
average distance of each locus to the NE. The 10% loci with the shortest NE distances
in our structure population were selected as loci with the highest likelihood to be
positioned close to the NE. We then calculated the enrichment of Man1 binding loci in
this set and compared it to a set of randomly selected loci. The set of loci detected to be
closest to the NE had a significantly higher MAN1 binding signal enrichment compared
to randomly selected sets of loci (p-value < 1E-4) (Figure 3-2E).
3.3.2 Genome structure comparison between fission and budding
yeast
We showed that a structure population with constrained but otherwise random
chromosome chains combined with the natural gene positioning on chromosomes
reproduced many known features of the fission yeast genome organization. We
previously showed a similar result for budding yeast[32]. The chromosome
organizations between the two yeasts are largely different. We now compare the
structure populations of the fission yeast genome with the one previously generated for
the budding yeast [32].
Chromosome Territories
Chromosome territories were analyzed by calculating the combined location
probability density (LPD) of all regions in a chromosome. In fission yeast, the LPD of the
large chromosomes 1 and 2 are almost uniformly distributed over the entire nucleus
45
with only slight increase of the LPD closer to the SPB (for chromosome 2), and a slight
increase of LPD at the nucleolus for chromosome 1 (Figure 3-3A). Only chromosome 3
showed a larger variance in LPD with some increased values at the SPB and the
nucleolus, due its smaller size and constraining of the rDNA genes to the nucleolus
(Figure 3-3A). Moreover, almost all chromatin regions of all three chromosomes can
freely access the entire nuclear volume (Figure 3B top panel). For instance, ~92% of all
chromatin regions in the fission yeast genome can access at least 80% of the nuclear
volume (Figure 3-3B). In budding yeast all the chromosome territories showed
substantially larger LPD variations with distinct maxima, at different nuclear locations
(Figure 3-3A) [32]. Only a relatively small fraction (~32%) of all chromosome regions
can access at least 80% of the nuclear volume (Figure 3-3B bottom panel). These large
differences are mainly due to the substantially smaller but also more variable length of
the chromosome arms in budding yeast.
46
47
Figure 3-3 Chromosome and gene territories analysis of fission yeast and
budding yeast
(A) Chromosome localization probability density (LPD) plots for fission yeast (top panel)
and selected chromosomes in budding yeast for comparison (lower panel)[32]. The
chromosomes are ordered by their size from largest (left) to smallest chromosomes
(right). (B) Comparison of the nucleus accessibility of genomic regions between fission
yeast and budding yeast (Text S1). The higher the accessibility, the more space it can
explore the nucleus. The red dots in red represent the centromeric locations. (C) Gene
localization probability density (LPD) plots for four genes in fission yeast. Their
orthologous genes were also analyzed in the budding yeast genome models [32].
We then analyzed how much a chromosome’s LPD varies if it is calculated from
a structure population with or without the remaining chromosomes in the nucleus. In
fission yeast the location and extension of a chromosome’s LPD is not affected by the
presence of all the other chromosomes. In contrast, in budding yeast a chromosome’s
LPD dramatically changes with the presence of all other chromosomes in the nucleus
[32].
Gene Territories
To analyze the spatial localization of individual genes we determined the nuclear
territories of four genes, for which nuclear territories of their orthologous genes have
been previously determined in budding yeast [15, 32]. The LPD of these genes reveal
preferred locations in the fission yeast nucleus as seen by the LPD maxima of these
genes (Figure 3-3C). However, the gene territories are significantly more diffuse in
fission than in budding yeast. For instance, the gene RPS20 can generally access
almost 99% of the nuclear volume while the orthologous gene in budding yeast is
substantially more restricted and can access only 29 % of the nuclear volume (Table 3-
48
3). Also, the actual gene locations can be quite different in the two yeast species. For
instance, the most dramatic difference among the four genes is seen for the gene
RPS20, which is located towards the nucleolus in fission yeast and shows a very large
gene territory (Figure 3-3C). However, in budding yeast the gene territory of the
orthologous gene is quite focused and located close to the spindle pole body[32].
Table 3-3 Nuclear accessibility for orthologous genes in the fission yeast and
the budding yeast.
The orthologous of URA3 in budding yeast is URA4 in fission yeast.
GAL1 RPS5 RPS20 URA4/URA3
Fission Yeast 84% 92% 64% 99%
Budding Yeast 35% 67% 42% 29%
Interaction specificity
We also analyzed the interaction specificity of chromatin regions by defining an
‘interaction entropy’ value for each locus, which measures the preference of a locus to
interact with specific loci at increased frequencies. If a locus forms specific interactions
its interaction entropy will be low, whereas it will be high if a locus forms interactions to
many other loci at similar contact frequencies. Both, in Hi-C experiment and structure
population, fission yeast showed significantly larger entropy values (p-value<1E-6) than
budding yeast (Figure 3-4A). This result confirms that fission yeast chromatin
interactions show lower interaction specificity, and are substantially more variable than
49
those in the budding yeast, which indicates that the fission yeast genome organization
is less structured than that of budding yeast[32].
Figure 3-4 Interaction specificity and gene localization comparison between
fission yeast and budding yeast.
(A) Chromatin interaction specificity analysis. The interaction specificity of a locus is
estimated by defining an entropy value, which measures the interaction preference of
genomic region with others. The larger the entropy value is, the less specific are its
interactions with other genomic regions (Text S1). Loci in fission yeast show
significantly larger entropy values and therefore form more random interactions than in
budding yeast, in both Hi-C experiments and structure population (both p-value<1E-16).
(B) The distribution of the distance ratio for lowly expressed genes and highly
expressed genes. Lowly expressed genes are significantly dispersed than randomly
selected loci (dashed vertical line), with a p-value<1E-16 and Cohen’s d is 0.72. The
highly expressed genes are significantly clustered with a p-value<1E-16 and Cohen’s d
is 0.33. (C) Combined gene localization probability density (LPD) plots for highly
expressed genes and lowly expressed genes in the two yeasts
50
Genome structure and gene expression
In mammalian cells transcriptionally active genes can be co-localized to nuclear
sites referred to as transcription factories [57, 81, 82]. Data from Hi-C experiments
indicated a co-location of co-transcribed genes also in fission yeast [9]. We now
investigate whether genomic regions that contain highly expressed genes have a
tendency to be co-localized also in our structure populations, even though no
constraints were imposed between them. We defined two sets of genes, one containing
the top 100 ranked genes based on their expression levels in G1 phase and one set
containing the bottom 100 ranked genes [9, 83]. For both fission and budding yeast, the
average 3D distances between highly expressed genes in the structure populations are
significantly smaller than those of the lowest expressed genes (Figure 3-4B). When
plotting the combined LPD for all the genes in each set it is evident that for both yeast
types highly expressed genes are localized towards the nuclear interior, while lowly
expressed genes reside towards the outer regions close to the NE and SPB (Figure 3-
4C). This finding confirms experimental observations about the preferred location of
highly and lowly expressed genes[80], and demonstrates that differences in the nuclear
locations between highly and lowly expressed gene sets must be pre-disposed by their
sequence positions along the chromosome arms. Indeed, when comparing the
distribution of sequence distances of the two gene sets to their respective centromeres
and telomeres it becomes evident that the highly expressed genes have a significantly
lower genomic distance to centromeres, which resulted in clustering of these genes in
3D space when centromere constraints are imposed (p-value=0.05). The lowly
expressed genes have significantly smaller genomic distances to the telomeres,
51
compared to randomly selected loci, which results in a location preferentially close to
NE when telomere-NE constrains are imposed (p-value=0.003).
Nuclear localizations of functionally related genes
In budding yeast, functionally related genes tend to be co-localized [59, 68]. We
next study, if such gene co-localization can also be reproduced in our structure
populations of fission and budding yeast even though no constraints were imposed
between functionally related genes. Genetic interaction (GI) experiments provided a
large number of gene pairs with related functions [84-86]. Based on these experiments,
we selected 2 sets of gene pairs, one with functional correlated and one with functional
uncorrelated gene pairs. Interestingly, for both, fission and budding yeast, functionally
related genes have shorter averaged 3D distances in the structure population than
functionally unrelated genes (Figure 3-5A). This observation indicates that random
chromosome conformations subject to a few geometric constrains may in part explain
these structure-function correlations.
52
Figure 3-5 Functional related genes tend to be spatially clustered in both fission
yeast and budding yeast.
(A) Spatial clustering of functional related genes in fission and budding yeast. The
histogram shows the distribution of the mean pair distance ratio between a set of
functionally related genes, as defined by genetic interaction experiments (Text S1) and
all the sites in the structures population. The histograms are generated as described in
Figure 2D. Genes with low functional correlation score are less clustered compared to
related genes with high functional correlation. The Cohen’d for highly functional
correlated gene pairs is 1.34 and 3.85 for the fission yeast and the budding yeast (p-
value<1E-16 for both cases). The Cohen’d for lowly functional correlated genes is 0.13
53
and 2.17 for the fission yeast and the budding yeast, respectively (p-value<1E-16 for
both cases). (B) Histograms of the distributions of the mean pair distance ratio between
the set of early replication origins and all the sites in the structure population. A
corresponding histogram is shown also for late replication origins. Early replication
origins are spatially clustered in fission yeast (p-value<1E-16, Cohen’d=0.59), while late
replication site show statistically significant larger average distances than randomly
selected sites (p-value<1E-16, Cohen’d=0.4) (C) Comparison of the clustering of genes
in the same GO categories between fission yeast and budding yeast. The test is based
on 51 GO categories, which contain sufficient amount of genes in both yeast types.
Plotted is the difference D
GO
- D
random
between the average pairwise 3D distances of the
genes in a GO category (D
GO
) and the average pairwise 3D distances between
randomly selected gene sites (D
random
). If the difference D
random
-D
GO
is larger than 0,
genes in the GO category is defined as clustered (p-value<-1E16), and If the difference
is smaller than 0, genes in the GO category are considered to be ”dispersed” (p-value<-
1E16) . Each point representing a GO category is colored by their functional categories,
such as cellular component, biological process and molecular function. (D) The
clustering of functional categories is highly significant. Shown is a selection of GO
categories under the term molecular functions. The dashed line indicates a p-value of
1E-16. The –log(p-value) is trimmed at maximally p-value=1E-20. The numbers on the
left of the figure represent GO categories as labeled in Table S1. For all the genes in
the class “Molecular Function”, 10 of 10 GO categories are significantly clustered in
budding yeast, while genes in 7 out of the 10 same GO categories are significantly
clustered in fission yeast.
Moreover, we also observed that early replication origins in fission yeast are
spatially clustered in our structure population. The average pairwise 3D distances of
early replication origins are significantly smaller compared to a set of randomly selected
sites (Figure 3-5B) [87]. In contrast, the averaged pairwise 3D distances of late
replication sites are significantly larger compared to randomly selected loci, indicating
that late replication origins are more dispersed in 3D space. We previously found the
same behavior in budding yeast, even though the two yeast types have different
locations of their replication origins and very different chromosome organizations[32].
Association among genes in same ontology groups
Finally, we analyzed the co-locations of genes classified in the same gene
54
ontology group (GO) (http://www.pombase.org and http://www.yeastgenome.org/). To
have sufficient sampling for our analysis, we selected GO classes containing at least 50
and up to 500 genes, which resulted in the selection of 51 GO terms for each yeast
species. For each set of genes in a GO category, we calculated the average pairwise
3D distance in the structure population. We then compared these data with those of
randomly selected set of loci (Figure 3-5C). If a gene set had a significantly smaller
average 3D distance than a set of randomly selected loci (based on a p-value<1E-16),
we considered these genes to be “clustered”. Genes were considered to be “dispersed”
if their average distance was significantly larger than randomly selected loci (based on a
p-value<1E-16). Interestingly, genes in the same GO category tend to be clustered for
both yeast types. In fission yeast 36 out of 51 GO categories show significant gene
clustering, while 41 out of the 51 GO categories show significant gene clustering in the
budding yeasts (Table 3-4). This observation agrees with our finding that functional
related genes are more clustered in 3D space. Our structure population also reveals a
similarity in the clustering property for the same GO categories in both yeast species.
Among the 51 GO categories, 38 GO categories (>74%), showed identical gene
clustering behavior in both yeast species (32 clustered and 6 dispersed GO categories)
(Figure 3-5D, Table 3-4). Therefore, the clustering properties must be pre-disposed in
the sequence position of these genes. Indeed, when calculating the average sequence
distances of genes, it is evident that genes in most GO categories are already clustered
at sequence level for both yeast species.
55
Table 3-4 Clustering properties of genes in 51 GO categories.
Fission Yeast
Budding
Yeast
Clustered Random Dispersed
Clustered 32 7 2
Random 2 0 0
Dispersed 2 0 6
3.4 Conclusions
Here, we studied the genome organization of fission yeast and characterized the
chromatin contact patterns, and nuclear territories of chromosomes and gene loci,
which emerge when chromosomes are allowed to behave as constrained but otherwise
randomly configured flexible polymer chains. This model is sufficient to explain in a
statistical manner many experimentally determined distinctive features of the fission
yeast genome organization, such as Hi-C contact patterns, including the enhanced
interaction frequencies between telomeres; and the co-location of some co-expressed
genes and co-locations of functionally related genes, including early replication start
sites, tRNA genes, and 5sRNA genes. Our findings demonstrate that some structure-
function correlations can be explained as a consequence of random chromatin collisions
driven by a few geometric constraints combined with the natural gene positioning on the
chromosomes. Distinguishing such “driver” interactions from “passenger” interactions is
key in understanding the principles of spatial genome organization and genome
structure-function correlations.
56
We also performed a comparative genome structure analysis between fission
and budding yeast, for which similar organization principles have been described
previously[32]. Despite similar organizing principles, large differences exist between
fission and budding yeast genome structures. In fission yeast large fractions of the
chromosomes can almost freely access the entire nuclear volume and gene territories
are diffuse. In contrast, the budding yeast genome is substantially more structured with
more focused gene territories and most chromosome regions can only access a
restricted region of the nucleus. Moreover, in budding yeast the inter-chromosomal
interaction patterns are highly structured leading to cross-shaped patterns in the contact
map. These cross-shaped interaction patterns are due to an exclusion volume effect
when centromeres of all the 16 chromosomes compete for the limited space around the
SPB[32]. Because in fission yeast only 3 chromosomes cluster at the SPB, such
interaction patterns are not observed despite imposing identical geometric constraints.
Moreover, due to the substantially longer chromosome arms combined with a smaller
nuclear radius, locations of gene loci are more disperse across the nucleus. Therefore,
in fission yeast specific inter-chromosomal interactions are mainly restricted to regions
directly adjacent to centromeric and telomeric regions.
Despite the differences in genome organization, many functional similarities
prevail. For instance, in both yeast types the average 3D distances between highly
expressed genes are significantly smaller than those of the lowest expressed genes.
The observed co-localization between functionally related genes in fission yeast is
mainly due to clustered gene locations along the chromosome sequence and the fact
that they are enriched towards either the centromeres or telomeres.
57
Finally, our work also highlights that experimental data on fission yeast is
consistent with a population of genome structures that can significantly vary between
them. Such an observation cautions against using structure models based on
ensemble-averaged experimental data. Such models may not be able to capture
accurately many of the structural properties of the genome.
Chapter 4. Hi-Corrector: a Fast, Scalable and
Memory-Efficient Package for Normalizing Large-
Scale Hi-C Data
4.1 Introduction
The recent development of genome-wide proximity ligation assays such as Hi-C
[22] and its variant TCC [24] has significantly facilitated the study of spatial genome
organization. The raw chromatin interaction data obtained by Hi-C methods can have
both technical and biological biases[27]. Therefore, correcting biases in the Hi-C data is
an important preprocessing step. Among several recently developed methods [26], the
iterative correction (abbreviated as IC) algorithm [27] has been used most widely by
recent studies [28, 88-90] due to its conceptual simplicity, parameter-free algorithm and
ability to account for unknown biases, although its assumption of the equal visibility
across all loci may require further exploration. Mathematically, the IC algorithm is a
matrix scaling or balancing method that transforms a symmetric matrix into one that is
58
doubly stochastic, meaning that the row and column sums of the matrix are equal to 1.
However, the Hi-C chromatin interaction matrix is of the massive size O(N
2
), where N is
the number of genomic regions. Thus, it requires expensive computing resources such
as large memory and long computation time. This is especially problematic for high-
resolution data at the kilobase level or beyond [89, 91]. For example, at the resolution of
10K base pairs per region, the human genome has 303 640 regions and the matrix of
the Hi-C data occupies about 343 GB of memory, which cannot be loaded into any
common desktop computer even in the compressed format. Most scaling algorithms in
the matrix computation field [92, 93] suffer from this scalability issue, because their main
focus is improving the convergence rate and numerical stability. Therefore, there is high
demand for a fast and scalable IC algorithm that can work with massive Hi-C data
matrices on common computing resources.
Here we propose a set of scalable algorithms (adapted from the original IC
algorithm) to meet this need. Both sequential and parallel versions were implemented in
the standard and efficient C language, which allows for precise memory control. The
sequential implementation is memory efficient and can run on any single computer with
limited memory, even for Hi-C datasets of large size. It is designed to overcome the
memory limit by loading a portion of data into the memory at each time, so requires
some extra time for file reading. The parallel implementation is both memory efficient
and fast. It can run on one of the most popular parallel computing resources: a
computer cluster (i.e. a distributed-memory computing environment). In this
environment, a set of general-purpose processors or computers can be interconnected
to share resources, and each computer retains its local and limited memory. The
59
parallel algorithm is designed with very low communication overhead among computing
nodes, so that it runs faster on clusters with more computers. Although the Hi-C
analysis pipeline, ICE [27], implements the IC algorithm, it works only on a single
computer and cannot utilize as many computing resources as possible to speed up the
computation.
Very few parallel matrix scaling or balancing algorithms have been developed
prior to this work. However, none of them are suitable for the bias correction task of Hi-
C data. Zenios and Iu parallelized the matrix balancing algorithm in 1990 for a shared-
memory computer, which cannot address the memory shortage problem. Amestoy
designed a complicated data distribution strategy based on the partitions of non-zero
elements. Their method is not applicable to the raw Hi-C contact map, which contains a
high proportion of nonzero elements. We performed experiments on high-performance
computing resources and clusters with different numbers of nodes and memory
capacities. The results showed that this package could meet the strong demand for
normalizing massive Hi-C data given limited computing resources.
4.2 Algorithms and implementation
Given an observed chromosome contact frequency matrix Ο= (Ο
!"
)
!×!
over N
genomic regions, the IC method eliminates biases so that all genomic regions have
equal visibility. To make this algorithm memory-efficient, we designed a strategy of
breaking the matrix O into K equal partitions of complete rows and loading only one
partition into memory at any given time. Therefore, the memory requirement can be very
low when K is large. This strategy adapts the IC algorithm by adding two steps: (i)
60
loading the kth matrix partition Ο
!
into memory and (ii) updating this partition with the
last updated bias vector b. The new Memory Efficient Sequential algorithm (called IC-
MES) works even for the extreme case of K=N, where only one row is loaded each time.
ICMES is memory efficient, but it is still a sequential algorithm that runs on a single
computing machine. Therefore, it may be too slow when the machine has small
memory. To normalize large Hi-C matrices in a short time, we also designed a fast,
scalable and Memory-Efficient Parallel algorithm (called IC-MEP) that can maximally
exploit the parallelism of the normalization problem and make use of many commonly
available computing resources. In essence, the normalization problem is a data divisible
task: a series of operations that can independently work on separate partitions of the
data. This problem is perfectly suited to the data-parallel model in a distributed-memory
computing environment such as a computer cluster, which consists of K independent
processors (or nodes) that are loosely or tightly connected in high-speed networks and
have limited local memory. We employed the manager–worker parallel programming
paradigm. The manager task partitions the data into K blocks, then initiates K worker
tasks in different nodes; each worker task processes a single data block. The manager
coordinates all workers and synchronizes their calculations with updated bias vectors.
The IC-MEP algorithm has very little network messaging overhead, because no
communication exists between workers. Therefore, it is computationally efficient.
Furthermore, in order for each worker to run its task on limited memory, we also used
the memory-saving strategy of the IC-MES algorithm. That is, each worker further
partitions its assigned data block into a set of sub-blocks and loads only one sub-block
into memory at any given time. Theoretically, the IC-MEP algorithm can work on any
61
number of processors with any local memory capacity. Details of these three algorithms
and their flowchart figures are provided in the Supplementary materials. We used ANSI
C to implement the two sequential algorithms IC and IC-MES, because of its maximum
control and memory efficiency. We implemented the parallel algorithm IC-MEP using
the popular message-passing interface, which is a highly standardized and portable
environment designed for solving large-scale scientific problems on distributed memory
systems.
4.3 Results
We compared three algorithms (IC, IC-MES and IC-MEP) on the TCC/Hi-C data
of two human cell types: GM12878 and hESC. The whole genome is partitioned into the
equal-size regions (or bins); the bin size is the main indicator of Hi-C data resolution.
The results are listed in Table 1. In the experiment with 20K bp resolution data, the
basic IC algorithm requires a minimum memory of 86 GB. The algorithm ICMES can run
with just 4 GB memory (a common memory configuration in office computers) and
complete the same work in reasonable time (within 4 h). IC-MEP can dramatically
speed up the computation using more processors (about 6 min with 48 processors),
while using only 1 GB of memory in each processor. For the 10K bp data, none of HPC
computer nodes (with 128 GB memory limit) can load the full matrix (about 343 GB) for
the basic IC algorithm. But IC-MES and IC-MEP can use 2 GB memory to quickly get
the results (even in half hour using 48 processors). Details are provided in the
Supplemental materials. 4 Conclusion With the rapidly increasing resolution of Hi-C
datasets, the size of the chromatin contact map will soon exceed the memory capacity
62
of general computers. We developed Hi-Corrector, a scalable and memory-efficient
package for bias removal in HiC data. HiCorrector can run on any single computer or a
computer cluster with limited memory size to complete the task. We performed
experiments on high-resolution HiC data from two human cell types to show that the
package can process very large data sets in reasonable time using the single
processor, and in very short time with multiple processors. The experiments further
demonstrate the scalability of our package with the observation shown in Table 1 that
the more processors used, the faster it is. Therefore, Hi-Corrector is a timely resource
addressing the challenge of normalizing high-resolution Hi-C data.
Table 4-1 Running time of three algorithms on 10K and 20K bp resolution Hi-C
data
63
All algorithms were terminated after 10 iterations for the purpose of performance
comparison, since each iteration has almost the same running time. ‘Memory’ includes
only the memory allocated for computation in each processor, not system overhead.
The elapsed time format is “hours : minutes : seconds”.
4.4 Conclusion
With the rapidly increasing resolution of Hi-C datasets, the size of the chromatin
contact map will soon exceed the memory capacity of general computers. We
developed Hi-Corrector, a scalable and memory-efficient package for bias removal in
HiC data. Hi- Corrector can run on any single computer or a computer cluster with
limited memory size to complete the task. We performed experiments on high-resolution
HiC data from two human cell types to show that the package can process very large
data sets in reason- able time using the single processor, and in very short time with
multiple processors. The experiments further demonstrate the scalability of our package
with the observation shown in Table 4-1 that the more processors used, the faster it is.
Therefore, Hi-Corrector is a timely resource addressing the challenge of normalizing
high-resolution Hi-C data.
64
Chapter 5. A Fast and Robust Method for Identifying
Genomic Topological Associated Domains
5.1 Introduction
Chromatins serve as the physical carrier of genetic and epigenetic information.
Recent eukaryote genome studies indicate that the chromatin conformation plays an
important role in many nuclear processes, including gene expression, epigenetic
organization, and DNA replication [56, 94, 95]. Although our understanding of the spatial
organization of the chromatin is still very limited, recent development of the genome-
wide chromosome conformation capture techniques (HiC/TCC) [22, 24, 95] holds great
promises in rendering the insights into genome structures and the associated impact on
nuclear functions. One of the interesting observations from the HiC/TCC data of human,
mouse, and Drosophila genomes is that those genomes are linearly partitioned below
the mega base length scale into well demarcated physical domains with strong internal
connectivity and limited external interactions. Such physical domains, also termed
“Topological domains (TD)”, correlate strongly with histone modification enrichments,
active gene density, lamina interactions, replication timing, nucleotide and repetitive
element composition [56, 95, 96]. Evidences suggest that topological domains are
largely conserved across cell types and even species[56].
Several methods have been developed to identify topological domains. The first
relevant concept, physical domain, was proposed by Sexton[95], who also devised a
probabilistic approach to identify such domains. First, distance-scaling factors were
inferred from global genomic trend using an iterative maximum likelihood algorithm.
65
These factors were subsequently compared to contact intensities around the Hi-C
matrix diagonal to systematically identify physical domains as contiguous chromosomal
regions that are flanked by distance-scaling peaks[95]. Dixon et al.[56] defined
topological domains and proposed an identification method based on directionality index
(DI). DI quantifies the degree of upstream or downstream interaction bias for a genomic
region. The value of DI change considerably at the periphery of the topological domains.
A Hidden Markov model (HMM) is then used to identify the locations of topological
domains from DIs. Hou et al. [96] developed a Bayesian probability model based
method assuming that the number of paired-end tags linking two loci follows a Poisson
distribution. To estimate the location of the boundary points, a Markov chain Monte
Carlo (MCMC) strategy was adopted.
The existing methods introduced above identify topological domains from
different point of views and are all effective in terms of downstream biological analyses.
However, most of methods require extensive computational power, pre-processing step,
or too many human interference via parameter tuning which is hard without fully
understanding the methods. This can be major bottleneck for further investigation about
genome structure. In addition, there are significant inconsistencies among results
generated by different methods, or even by the same methods using different input
parameters. In fact, those domains inconsistently identified across methods/runs are
mostly active domains, because active domains are less compact and more likely to
form inter-domain contacts with other active domains. Such inconsistencies could be
caused by multiple reasons: (1) the heuristic nature of the algorithms, especially ones
that incorporate probabilistic/likelihood/HMM; (2) the noises in the data; and probably
66
the most importantly, (3) the ambiguity of domains from Hi-C data. When the signal of
domain boundary is not sufficiently strong, alternative segmentation could lead to
different topological domains that are all possible to exist. Such ambiguity is hard to
resolve by using the Hi-C data alone. Other data, e.g. histone marks and insulator
binding, may weigh in to precisely determine the boundaries of topological domain,
although such signals do not always exist in the domain boundaries.
5.2 Methods
In a proposed new approach to detect TDs, we focused on three major goals as
follows.
1. How to detect domains with minimum introduced parameters?
2. How to reduce false positive and false negative TDs?
3. How to reduce the computational cost?
With these goals in mind, we proposed an efficient and accurate method to
detect TDs with only single parameter. The input data is a Hi-C interaction matrix, where
entries are contact frequencies between any two segments (bins) within a chromosome.
Our method has three steps: (1) compute the sum of interaction frequencies in the
matrix for each bin to its neighbor within a small window, referred to as binSignal (2)
approximate optimal curve of the binSignal along bins in the matrix and detect the local
minimums from the curve (3) filter out false positive local minimums using p-value under
binomial distribution assumption. In the first step, for each bin we assess its local
interaction intensity, resulting in an interaction intensity curve along the chromosome.
For this curve, we find approximately optimal curve and subsequently detect local
minimums that shall represent the boundaries of topological domains. In the last step,
67
we filtered out false positive local minimums with p-values and selected most domain-
boundary-like regions. We further elaborate our method as the following subsections.
5.2.1 Generating binSignal
TD boundary can be defined as a region in between two adjacent distinguished
TDs. In general, those regions show lower contact frequency between its up-stream and
down-stream regions than contact frequency within TD. Under this rationale, we first
computed the average frequency of contacts between its upstream and downstream
regions for each bin. Let i denote the bin index. For each bin i, we introduced a
parameter representing a window of length 2w that covers its upstream region U
i
= [i-w-
1, i-w, …, i ], and its downstream region D
i
= [ i+1, i+2, …, i+w ]. The average contact
frequency between U
i
and D
i
, denoted !"#$"%#&' ! , which is calculated as follows:
!!!!!!!!!!!!!!!!"#$"%#&' ! = !
1
!
!
!"#$.!"#$ !
!
! ,!
!
!
!
!!!
!
!!!
!!!!!!!!!!(1)
where cont.freq denotes the contact frequency between two bins. Intuitively,
!"#$"%#&' ! represents the average contact frequency of bin I, which is the diamond
area in Figure 5-1A. Examing the !"#$"%#&' ! for all bins along a chromosome,
obviously, !"#$"%#&' ! shall be high for those bins located close to the centers of the
topological domains, and !"#$"%#&' ! shall reach local minimum at the TD boundaries.
In the following sub-section, we present our approach to find approximately optimal
curve of !"#$"%#&'!without any parameters and detect the local minimums from the
fitting curve.
68
Figure 5-1 Definition of binSignal.
(A) binSignal can be represented as contact frequency of up-stream and down-stream
of bin. Contact frequencies(cont.freq) are minimized in boundaries and maximized in
center of domain. (B) We detect turning points in binSignal applying optimal polygon
determination algorithm. In each line segment represented by two turning points, we
determined local minimum bins satisfying conditions mentioned in the text. (C) Using
our proposed method in this paper, we compute binSignal, local minimums, and p-
values. Local minimums in binSignal and bins with significant p-value are highly
69
consistent each other. In addition, those two bins well point to boundary-like regions in
Hi-C Heatmap.
5.2.2 Optimal approximation of binsignal and detect local minimums
Given the !"#$"%#&'!curve along the chromosome, we used a mathematical fitting
approach to find approximately optimal curve. Kumar Ray et al. proposed a parameter-
free linear time algorithm to fit digital curves with optimal polygon (Ray,B. and Ray,K.,
1993,1994). It determines the longest possible line segments with the minimum sum of
errors, which is maximizing objective function (F
j
). The L1 norm is used to measure the
closeness of a polygon to a digital curve. The original algorithm can be rewritten as the
following pseudo code.
Applying this shape determination algorithm to our series data, !"#$"%#&', we fit
the approximated curve to given !"#$"%#&'!in linear time. In practice, each end of the
line segments constitutes a turning point. The local minima of binSignal curve are at
bins with the smallest contact frequencies compared to the neighboring left and right
bins. Therefore, we defined turning points satisfying the following conditions as the local
minima.
• Turning points whose derivative change from negative to positive.
• Points with smallest contact frequency in segmented line.
As shown in Fig 5-1B, detected local minima by the proposed approach well captured
boundary-like bins, as well as avoiding noise-like local minima.
70
5.2.3 Statistical filtering of false positives
In order to ensure the accuracy of detected boundary, we introduced a statistical
method for detected local minima. Topological domains can be defined as continuous
genomic regions with high contact frequency and the contact frequency is higher than
outside. That is, contact frequency demarcating two adjacent domains (i.e. inter-domain
interactions) are relatively lower than within contact frequency (i.e. intra-domain
interactions) in two adjacent domains. Keeping this rationale in mind, we performed a
hypothesis test to examine whether our local minimums showed significant intra- and
inter- domain contact frequency difference. Our null hypothesis is that inter-domain
contact frequency (i.e. cont.freq (U
i
, D
i
)) is equal to intra-domain contact frequency (i.e.
cont.freq (U
i
, U
i
) and cont.freq(D
i
, D
i
,)). We computed p-value by wilcox rank sum test
with directional alternative hypothesis. Note that we used z-score of cont.freq (A, B)
normalized by all cont.freq (A, B) with the same genomic distance since the contact
frequency highly depends on the genomic distance. Finally, we filtered out local minima
with p-value larger than significance level (p-value<0.05).
Based on the computed p-value, we checked whether two adjacent domains are
well distinguished by detected local. As shown in Fig 5-1B, almost all local minima are
highly consistent with the regions with p-value< 0.05 and, in practice, small proportion of
local minimums was discarded.
71
5.3 Results
5.3.1 Algorithm performance
Due to the lack of ground truth, source codes and data availability, we compared
our TDs with Dixon’s TD which is widely accepted. We performed experiments with two
mouse cells (embryonic stem cell and cortex cell) and human cell lines (hESC and
IMR90) in 40kb from Dixon’s experiment. Our program is written in R (CRAN) script.
When tested on Intel Xeon 3.3GHz CPU with 10GB RAM, the total running time is
around 1~2 minutes to process the whole genome. In comparison with several hours for
Dixon et al.[56], our method is extremely efficient. The only parameter in the program,
w, denoting the window length for computing the binSignals, was set to 5. Our program
generated ~6000 and ~4700 TDs of hESC and IMR90, respectively (see Table 5-1).
Consistent with the results from Dixon et al., we found that the average size of TDs in
hESC was slightly smaller than that in IMR90, i.e. ~450 Kb versus ~600 Kb. We
identified roughly twice as many TDs per cell type as those generated from Dixon et
al.'s program. As a result, our TD size is around half of that of Dixons’ (Table 5-1). In
addition, TD boundaries detected by our method captured well boundary-like regions
more sensitively and almost boundaries in Dixons’ were covered by our TDs. While
Dixon’s boundaries have a tendency to indicate between large size of domains, our
domains are able to detect the boundaries in between small size of TDs, as well as
large size of TDs. For newly introduced TD boundaries, we tested if those boundaries
have similar characteristic with commonly detected ones in later section.
72
73
Figure 5-2 Enrichment plots of five epigenetic marks for two mouse cells.
Barrier-like elements and promoter marks have a tendency to locate close to domain
boundaries. H3K4me1 show slight depletion near boundary and H3K27ac have different
pattern across cell types.
Table 5-1 Comparison of our method and Dixon’s method.
We identified twice as many TDs per cell type as Dixon's method did, and our TDs are
smaller than Dixon’s.!
74
5.3.2 Correlation based comparison with the reported TDs in previous
research
An important characteristic of TDs is that the contact frequency profiles of the
bins within a TD should be more similar to each other than to bins outside of TD. We
hence propose to use the average Pearson correlation coefficient (PCC) of the contact
profiles of bin pairs within the TDs as a TD quality evaluation score. Moreover, since
topological domains are local features of chromatin organization, we therefore propose
a weighted Pearson correlation coefficient (wPCC) as an alternative measurement in
addition to the PCC, where each bin i’s contribution is weighted by the background
contact frequency w
i
for two bins at a distance between the bin i and the center bin of
the TD.
For two contact profiles x and y of 2 bins within a TD, the weighted correlation
coefficient is calculated according to (2).
75
!"## !,!;! =
!"# !,!;!
!"# !,!;! !"# !,!;!
(2)
where
!"# !,!;! =
!
!
(!
!
!!(!;!))(!
!
!!(!;!))
!
! !
!
(3)
and
! !;! = !
!
!
!
! !
!
! !
. (4)
Our consistently TDs achieved the highest PCC and wPCC measures, and with
the lowest variances as well. This evaluation indicates that our method identifies more
significant TDs in terms of the similarity of their contact profiles.
5.3.3 Epigenetic characteristics of Topological Domains
In order to check the functional characteristics of the TDs, we computed the
enrichment of barrier-like element (CTCF), promoter related elements and histone
modification (RNA Polymerase II and H3K4me3), and cell-specific enhancer histone
modification (H3K4me1 and H3K27ac) inside the TDs for mouse cortex and mouse
embryonic stem cells(mESCs). All these epigenetic data are collected from[97]. We
gathered all peaks of five epigenetic marker with p-value<0.05 using MACS14 [98]and
examined the relationship between their relative frequency and their location in TD. As
shown the histogram in Figure 5-3, CTCF-enriched regions are located close to domain
boundary and its abundance decreases towards the center. This indicates that CTCF
can act as insulator to block the two adjacent regions [99]. Similar with CTCF, promoter
related elements, such as Polymerase II and H3K4me3, also show the higher peak near
76
boundaries than domain inside in both cell types. This infers that the gene transcription
start site(TSS) are mainly located at the domain boundary, and transcription starts from
domain boundary towards inside. In addition, we observed that cell specific enhancer
marker, H3K4me1, slightly depleted as it close to domain boundary in both cell type. On
the other hand, interestingly, we observed different patterns for H3K27ac across cell
types: slight peak in boundary of mESC and slight depletion in boundary of cortex cell.
All these observations are highly consistent with the results in previous studies [56, 95,
96] and supportive to the notion that functional domains are highly related to physical
domain structure.
77
Figure 5-3 PCC and wPCC boxplots of TDs generated by different methods on
cortex and hESC cell line.
For each replication and combined data, our method shows higher correlation
coefficient than Dixon’s method in both measurements
78
5.3.4 Our unique boundaries share the similar epigenetic
characteristic with common boundaries in two methods
As shown in Table 5-1, our method identified around twice more domains
compared to Dixon’s method. We questioned, for domain boundaries identified in our
method, but not in Dixon’s method, whether those boundaries are considered as the
true boundaries. As mapping one set of domains to the other set of domains, we
defined the boundary detected in both methods as common boundaries and the
boundary detected in only one method as unique boundary. First, we built a n by m
matching table, where n and m are equal to the number of domains in method A and
method B, respectively. And, we set the entries as 0 or 1. For example, as shown in
Figure 5a , if i-th domains in method A, A
i
, are covered by four domains by method B,
such as B
j
, B
j+1
, B
j+2
, and B
j+3
, we computed the overlap score for all possible
combinations with B
j
to B
j+1
and filled the entries as 1 when the combination has the
highest overlap score. In Fig 5-(a), the overlap score can be represented as
!"#$%&' !" = !
!−!
!−!
Using matching table and overlap score, we classified common and unique boundaries
into two different methods. As shown in Fig 5-4B, we identified 800~1,000 unique
boundaries in our method, 30~150 in Dixon’s unique boundaries, and 1,600~3,000
common boundaries for each cell type. Then, we examined the enrichment patterns of
four strong epigenetic profiles, such as CTCF, PolII, H3K4me3, and H3K4me1, at
domain structures for common and unique boundaries of two mouse cells. We inferred
that if the unique domain boundaries show similar epigenetic pattern with common
boundary, the unique boundary also can be called as domain boundary. In Fig 5-5, Blue
79
and red curves indicate the common boundary and unique boundary, respectively. As
shown in Fig 5-5, unique and common in our method are similar patterns each other
with high correlation.
Figure 5-4 Boundary analysis of different method
(a) Matching Table Scheme. We compute overlap score and best matching TDs in
other method, as mapping TDs by one method to TDs by the other method. With
80
overlap score and matching table, we identified common boundaries and unique
boundaries for each method. (b) As same as comparison with Hi-C heatmap, Dixon and
our method share high proportion of boundaries.
81
Figure 5-5 Epigenetic profiles near domain boundaries for two mouse cells.
Black curves indicates epigenetic characteristic for the total boundaries in our method
and right(Dixon’s) show highly consistent patterns. Comparing unique boundaries(red
curves in both sides) with common boundaries(blue curves in both sides), our
method(left plots) shows higher correlation than Dixon’s method(right plots) in all
epigenetic profiles.
5.4 Discussion and Conclusion
We have presented a robust and efficient method in identifying chromatin
topological associated domains (TDs). Compared to previous approaches, our method
is not only computationally efficient, but also depends only on one single parameter.
The only parameter in our method, w, is related to the expected minimum size of TD.
This parameter is positively correlated with average domain size and, obviously, has
negative correlation with the number of TDs. In addition, our approach generates
consistent domain set in different window size. First, we fixed a domain set by larger
window size and mapped the other domain set by smaller size of window. Then, we
calculated overlaps score (Figure 5) of two domain sets derived by different window
setting, such as (w=5 vs w=10), (w=5 vs w=15), (w=5 vs w=20), and (w=10 vs w=20).
As shown in Figure, the overlap score of each pair show high overlap score(>90%).
This examination infers that the domain set by larger window size can be represented
by combination of the other domain set by smaller window size, as well as domains are
hierarchically structured.
Although it is still questionable to determine the best parameter like many other
studies, due to the lack of ground truth about real-domain structure and resolution
restriction, we set our window size w to 5 based on our own experience. We computed
82
the difference between intra- and inter-domain interactions for all identified domains as
changing window size from 3 to 20. The difference between intra- and inter-domain
interactions shows peak when the window size is 5 to 7. This measurement is
reasonable since intra-domain interactions are stronger than inter-domain interactions
from domain definition. As considering the minimum domain size in Dixon’s method
(~200KB) and the contact frequency difference based measurement in above, we
heuristically set our window as 5. Under this parameter setting, we identified around
twice more domains than Dixon et al.[56], where too many single-bin domains are
identified. Furthermore, we showed that the number of domains and our identified
domains are reasonable through comparison of unique and common domain
boundaries.
Domains identified by our method support the previous claim that TD boundary is
highly related to CTCF, promoter regions. Interestingly, housekeeping genes are more
abundance around the center of TD than at the boundary. Moreover, we also observed
that TDs are conserved across cell types where there can be more than 50% overlap. In
addition, as comparing genome structure of two different cell types, we confirmed that
housekeeping genes are significantly enriched inside common TDs, which infers that
topological conservation across cell types are highly related to housekeeping genes.
83
Chapter 6. High Resolution Human Genome
Structures with Population-based Modeling Approach
6.1 Introduction
The 3D structural organization of the genome plays a key role in nuclear
functions such as gene expression and DNA replication[55, 100, 101]. Thanks to the
recent development of genome-wide 3-C based proximity ligation assays (Hi-C[19, 22]
and its variant TCC[24]), close chromatin contacts can now be identified at increasing
resolution[23, 25, 56, 88, 89, 91, 95, 102], providing new insight into genome
organization. These methods measure the relative frequencies of chromosome
interactions averaged over a large population of cells. However, individual 3D genome
structures can vary dramatically from cell to cell even within an isogenic sample,
especially with respect to long-range interactions[103-105]. This structural variability
poses a great challenge to the interpretation of ensemble-averaged Hi-C data [24, 33,
106-108] and prevents the direct observation of cooperative interactions, which are
chromatin interactions likely to co-occur in the same cell. This analysis problem is
particularly pronounced for long-range (cis) and inter-chromosomal (trans) interactions,
which are generally observed at relatively low frequencies and are therefore present
only in a small subset of individual cells at any given time[91, 101, 103]. Despite their
low frequencies, long-range and inter-chromosome interaction patterns are not random
noise. In fact, these interactions are more informative than short-range interactions in
determining the global genome architectures in cells, and are often functionally relevant
84
– interactions between transcriptionally active regions are often inter-chromosomal in
nature[24]. Due to their variable nature, long-range interactions can be part of
alternative structural conformations (i.e. structural states), which makes their
interpretation in form of consensus structures impossible and inferring which of these
long-range interactions co-occur in the same cell remains a major challenge.
These challenges cannot be easily overcome even by the new single-cell Hi-C
technology[103], because it currently detects only a very small fraction of chromatin
interactions in a cell. Also, one might need to profile many thousands of cells before the
data cover a statistically representative spectrum of genome structures. It is therefore
highly beneficial to develop methods that use ensemble-averaged Hi-C data to infer
cooperative long-range chromatin interactions, which in turn would allow reconstruction
of a set of genome structures that accurately captures a genome’s structural variability.
Most current structure modeling approaches relate Hi-C contact frequencies to
distances, assuming that a lower contact frequency corresponds to a larger distance
between loci in 3D space[23, 29-31, 88, 109-111]. A few representative structures are
then selected that best match the inferred distances. The major limitation of these
methods is that the generated consensus structures do not represent single instances
of actual genome structures and cannot capture the variable nature of long-range and
trans chromatin interactions in different structural states. Further underlining this
problem, no single 3D model from these approaches can simultaneously satisfy all the
physical contacts measured by the Hi-C experiments.
85
We recently pointed out this problem and proposed the concept of population-
based genome structure generation to explicitly model the genome structure variability
between cells from Hi-C data[24]. Note that to determine the true population of genome
structures underlying the Hi-C data would require knowing which exact chromatin
contacts are present in each cell. The Hi-C data cannot provide this information, but it is
possible to approximate the underlying 3D genome structure population given additional
information. Here, we show that embedding the genome in 3D space enables such an
approximation by facilitating the inference of likely cooperative interactions. In 3D space
the presence of some chromatin contacts induce structural changes that may make
some additional contacts in the same structure more probable, while other contacts less
likely. Moreover, in a single structure, each chromatin region can form only a limited
number of interactions and is confined to the nucleus. These constraints and
considerations effectively restrict the conformational freedom of the chromosomes, and
permit us to infer likely cooperativity between subsets of the observed chromatin
interactions, which in turn helps deconvoluting the Hi-C data into a set of plausible
structural states.
Based on the above rationale, we have developed a probabilistic framework that
infers cooperativity between interactions from Hi-C data to generate a population of
individual 3D genome structures. We determine the genome structure population by
maximizing the likelihood function for observing the Hi-C data. Because the problem
does not have a closed-form solution, numerical routines are needed to approximate the
solution. We propose an iterative procedure to maximize local approximations of the
likelihood function, which produces a population of genome structures whose chromatin
86
domain contacts are statistically consistent with the Hi-C data. The result is the best
approximation of the underlying true population of genome structures, given the
available data. Our method distinguishes between interactions involving two
chromosome homologues and therefore is capable of generating structure populations
for entire diploid genomes, which also allows direct assessment of our findings with
image analysis techniques. Further, because the generated population contains many
different structural states, it can accommodate all the observed chromatin interactions,
including those that would be mutually exclusive in a single structure. Our method is
sufficiently flexible to integrate additional experimental information from various data
sources into the log-likelihood function in the future. Finally, our method is applicable at
various levels of resolution.
As a case study, we tested our method on human lymphoblastoid cells using
data from our TCC experiments. We generated a population of 3D structures that
correctly predicts many features of the lymphoblastoid genome known from imaging
experiments, including the distributions of inter-chromosomal distances between gene
loci as well as the preferred nuclear locations of the chromosomes. The obtained
structure populations show many important structural features, including the relationship
between structures and replication activities and the formation of function sub-
compartments in the nucleus.
87
6.2 Methods and Materials
We apply our method on the TCC data of human lymphoblastoid cell GM12878
to generate TAD level genome structure population. The flow charts in Fig 6-1 outline
our method. In the following we describe the data processing and 3D modeling
methods.
Figure 6-1 Workflow of the modeling process from data preprocessing to
modeling procedure
88
The blocks in the left panel represents the detailed of sub-procedures in the main work
flow.
6.2.1 Pre-processing of Hi-C Contact Data
Due to the existence of a variety of biases in the Hi-C/TCC experiment which
influence the ligation efficiency (including GC content and variable fragment lengths), a
data normalization step is necessary to remove these influences. Here we followed a
protocol developed by Imakaev et al and generated a normalized contact frequency
matrix at 100 kb resolution from the TCC data[27]. Bins with very low sequencing
coverage are considered as low-confidence regions and were discarded in the
normalization process. After obtaining the normalized contact frequency matrix, we then
applied the topological associated domain (TADs) calling method to detect the size and
location for TADs and non-TAD regions, such as topological boundaries and
unorganized chromatin regions[56]. In total 2320 TADs and non-TAD regions have been
identified for GM12878 cell, with an average size of ~1.3Mb. These TADs and non-TAD
regions will be further represented as spheres in our structure modeling.
6.2.2 Contact Probability Generation
We first convert the contact frequency to the contact probability value for
interactions between domain pairs. By definition, TADs are consecutive regions with
relatively high interaction frequencies within themselves and much lower frequency of
contacts between regions in neighboring TADs. There are cases at the TAD-level
resolution contact frequency matrix that show very loose interaction patterns between
neighboring TADs, which suggests a low chance for those consecutive genomic regions
89
to form contact in 3D space. In contrast to our previous approach[24], consecutive
TADs in our current model do not necessarily form contacts between them in 100% of
the structure population. To accommodate such a property, we now adapt a different
strategy for the parameter f
max
in comparison to our previous approach, which
represented the contact frequency value for observing a 100% probability to form
contacts in the population. The f
max
value is calculated as the average of all contact
frequencies of all possible contacts of regions within TADs. A contact probability matrix,
denoted as A= (a
ij
)
N×N
, can be calculated through the formula a
ij
=
f
ij
f
max
describing the
probability of contact between regions i and j, where f
ij
anda
ij
represents their contact
frequency and probability values, respectively.
Due to cooperativity of contacts or other random encounter effects, TAD spheres
may be close to each other even if they are not constraint in the model causing
overestimation of contacts. Those contacts we referred to as reduntant contacts.
6.2.3 Probabilistic model and problem formulation of the structure
population
Following our previous approach [24], we optimize a large population of
structures that will be used to investigate spatial organization of the genome. The entire
procedure can be treated as an optimization problem for a maximum likelihood function.
For a given contact probability matrix A, we aim to reconstruct all 3D structures X in a
population of M genome models each containing 2N homologous regions. For
reconstructing M structures of a population from the observed averaged TCC data A,
we would need the detailed information about which spheres have contact in which cells
90
in the population. Because information in A is far from completeness, due to (i) its
population-averaged nature without contact information at the single cell level and (ii) its
lack of the homologous information for all contacts. Therefore, we introduced a latent
variable W for complementing every single cell’s information and homologous
information missed by A. It is a binary-valued 3rd-order tensor specifying the contacts of
homologous genomic regions in each structure. We can jointly approximate the
structure population X and the contact tensor W by maximizing the log-likelihood of the
probability
logL(X |A,W)= logP A,WX
( )
.
91
Figure 6-2 Schematic of population-based genome structure modeling
(A) A population of M genome structures is constructed, in which the formation of
contacts between chromosome domains over all structures is statistically consistent
with the contact probability matrix A, derived from TCC experiments. We formulate this
problem as a maximum likelihood estimation problem. Because the TCC data A are
incomplete, we introduce the “contact indicator tensor” W, a binary 3rd-order tensor that
can complete the missing contact information in A. That is, W specifies which domain
contacts exist in which structures of the population and also distinguishes between
contacts from homologous chromosome copies. The structure population X and contact
92
tensor W are approximated by maximizing the log-likelihood
P A,WX
( )
=P AW
( )
P WX
( )
. Also shown is the “projected contact indicator tensor”
W= (w
IJm
)
N×N× M
, which is derived from W by projecting its diploid genome
representation (with 2N domains with homologous domain copies) to its haploid
representation (with N domains without homologous domain distinction). (Note that
capital letter indices I and J relate to domains without distinguishing between two
homologous copies, while lower letter indices i, i’ and j, j’ distinguish between two
copies.) (B) The maximum likelihood optimization is achieved through a step-wise
iterative process, where we gradually increase the optimization hardness by gradually
adding contacts of the matrix
A
θ
= a
IJ
θ
( )
N×N
A
!
= (a
!
IJ
)
!×!
, (where
a
IJ
θ
=
a
IJ
0
⎧
⎨
⎪
⎩
⎪
if a
IJ
≥θ
otherwise
) with decreasing contact probability threshold
θa
θ
IJ
=
a
IJ
ifa
IJ
≥θ
0 otherwise
. The idea is to generate a structure population
ˆ
X
θ
and contact
indicator tensor
ˆ
W
θ
that at first reproduce only the most frequent interactions (a
IJ
≥ θ),
so that interactions with contact probabilities lower than a certain value θ are ignored
(for example, we can start with θ = 1). Then
1 ˆ
θ
X is used as the initial model of the next
round of optimization for
2
θ
A , which includes all domain contacts with lower contact
probabilities (i.e., using only
2
θ ≥
IJ
a and
21
θθ < ). This in turn leads to the refined
structure population
2 ˆ
θ
X . We repeat this process, each time adding more domain
contacts to the input data (
A
θ
with lower θ ), until
A
θ
is almost close to A. When θ is
close to zero,
ˆ
X
θ
reproduces all the contacts in A. This process generates a structure
population that is consistent with the TCC data.
We design an iterative procedure to maximize the log-likelihood function that consists of
two steps (Fig 6-2):
• Assignment step (A-step): Given the current estimated model X
(k)
, estimate the
latent variable W
(k+1)
by maximizing the log-likelihood over all possible values of
W.
W
(k+1)
= argmax
W
logP A,WX
( ) { }
, given X =X
(k)
• Modeling step (M-step): Given the current estimated latent variable W
(k+1)
, find
the model X
(k+1)
that maximizes the log-likelihood function.
93
X
(k+1)
= argmax
X
logP A,WX
( ) { }
, given W=W
(k+1)
We use a stepwise optimization strategy to gradually increase the optimization
hardness (Fig 6-2) while introducing cooperative chromatin interaction. The idea is to
begin by estimating a structure population, which at first reproduces only the most
frequent interactions (a
IJ
≥ θ), so that interactions with contact probabilities lower than a
certain value θ are ignored. Then, using this structure population as the initial condition,
we add contacts with lower probabilities and perform another round of optimization. In
other words, the contacts in A are added gradually to the structure population X and
tensor W. The A/M-steps are repeated for several rounds and at different probability
level (θ) until a stop criterion is achieved.
In the A-step, we use an efficient heuristic strategy to estimate W by using
information from the structure population generated in the previous M-step. We assume
that assignments of a given chromatin contact across the contact indicator tensor W are
more likely realized in those genome structures in which the corresponding chromatin
domains are already closer in 3D space. In particular, for each potential contact
between domains I and J, we determine a cutoff activation distance
d
IJ
act
based on the
distribution of all distances for this pair in all structures of the model population
(Methods and Supplementary Fig. 1). The cutoff distance is defined such that the
probability
P(d
IJ
≤ d
IJ
act
) equals to
a
IJ
and is used to estimate the contact indicators.
In the M-step, maximizing logP A,WX
( )
logP A,W X) can be reduced to maximize
only logP WX
( )
maxlogP(W|X) , because A and W are known and
P A,WX
( )
=P AW
( )
P WX
( )
. We use simulated annealing dynamics and conjugate
94
gradient optimizations to generate a population of 3D genome structures X for which all
the chromatin contacts in the contact indicator tensor W are physically realized in the
genome structures
max log P W
m
X
m
( )
indicating that the likelihoods of all contacts in the
structure population are maximized to approximately one.
Each individual structure can be independently optimized in parallel. To efficiently
optimize an individual structure, we employed simulated annealing dynamics and
conjugate gradient optimizations. The former is a structure modeling approach that can
efficiently arrive at a stable state minimizing constraints violations; while the latter can
adjust local structures in order to reach the optimum. Both are implemented in the
Integrated Modeling Platform (IMP, http://www.integrativemodeling.org/)[46, 76].
6.2.4 Distance threshold for restraint assignment
One problem we face is to distribute contact for pair (I,J) in a portion of structure
population according to the input aIJ, that is to pick which structures shall (I,J) be
restrained. We designed an efficient heuristic, i.e., a distance threshold method, to
approximate the solution. We assume that the assignments of the contact indicator
tensor W are more likely realized in those genome structures in which the
corresponding chromatin domains are already closer in 3D space.
In short, for each pair (I,J) we calculate their distances in population and then rank the
2M distances in increasing order. The distance threshold,
d
IJ
act
, is determined as the
distance value at the 2a
IJ
M
th
and the (I,J) pairs in population whose distance are less
than the threshold shall be restrained forming contact.
95
6.2.5 Methods improvement for TAD resolution models
Our current modeling method is mainly based on the approach described
earlier[24]. However, slight modifications need to be introduced in this paper as we deal
with higher resolution model that consists of roughly 4 times smaller and 4 times as
many more domains.
Below we discuss several modifications that have improved the initial method.
Estimating the effective probability
We revisit the structure population from our previous coarser model (~5Mb
resolution) and compare every pairwise contact probability from the model and
experiment (input data). The analysis shows that the linear regression line has a slope
of 1.2, which indicates a 20% overestimation of the contacts formed. Due to
cooperativity of contacts or other random encounter effect, TAD spheres may be close
to each other causing overestimation of contacts. Those contacts we referred to as non-
constrained contacts. When it comes to a TAD resolution model at a resolution ~1Mb,
the total number of constraints will be one order of magnitude more than those in lower
resolution models, and the overestimation of the constraints will further make the
optimization even harder to perform. The compactness of local structures, which can
only be revealed through high-resolution model, will be strongly affected by the number
of constraints applied. Thus, it is extremely important to reduce the overestimation of
contacts to better interpret the TCC data.
96
The reason of the over-estimation in our low-resolution modeling method is
caused by the absence of imposing the non-contact items. In the probably formula for
the modeling method, we have the structure maximization part represented as
w
ijm
logP(w
ijm
=1d
ijm
)+(1−w
ijm
)logP(w
ijm
= 0d
ijm
)
⎡
⎣
⎤
⎦ m=1
M
∑
j=i+1
2N
∑
i=1
2N−1
∑
. However, due to the
large number of non-contact constraints suggested by the model and the limitation in
computing power, we do not force the non-contact constraints in optimization runs when
w
ijm
=0 , which means that bead i and j in structure m may still have a probability to form
contacts. Those contacts without applying constraints will further result in an
overestimation of the modeled probability matrix.
To correct the effect by the overestimation of contacts in high-resolution structure
modeling, we use a heuristic method to infer the minimal possible constraints towards
optimizing N
constrained−contacts
+ N
nonconstrained−contacts
= N
expected−contacts
in the final structure
population. Unfortunately, the number of non-constrained contacts cannot be obtained
in the structure before we optimize a structure population. However, we can use the
information from the structure population generated in the previous iteration to estimate
the number of constraints in the current structure population, under the step-wise and
iterative optimization scheme.
Let assume the non-constrained contacts are detected by chance within a distance
threshold, which are caused by a random encounter of contacts nearby. Consider any
portion a
ij
of step: k structure population that forms contacts (and we wish to have
according to the input data A) can be divided into two parts, a portion that are under
restraint P
ij
and a complementary portion (1-P
ij
) that is under random encounter Q. We
then express the relation as follows.
97
a
IJ
= P
IJ
step:k
+ (1− P
IJ
step:k
)Q
(1)
We aim to solve for the effective restraint activation probability at the current
step,
P
IJ
step:k
=
a
IJ
− Q
1− Q
(2)
Then Q can be estimated by rewriting equation (1) and derived from data in the
structure population in the previous iteration (step: k-1),
a
IJ
'
= P
IJ
step:k−1
+ (1− P
IJ
step:k−1
)Q
(3)
,
where
a
IJ
'
denotes the fraction in the previous population that domains I and J are
detected in contacts. Rearranging Eq. (3) we obtain Q that is needed to solve Eq. (2),
Q=
a
IJ
'
− P
IJ
step:k−1
1− P
IJ
step:k−1
(4)
.
Note that
P
ij
step:k−1
is known from the previous A-step. In the current method, now
P
IJ
step:k
replaces
a
IJ
that was used to determine distance threshold
d
IJ
act
in our previous
method[24]. The performance of this improvement will be further explained in results.
98
Upper bound distance constraints for consecutive domains
In the previous approach, we assumed that consecutive domains form contacts
in 100% of population. Here we drop this assumption, acknowledging the fact that
consecutive TADs may be loosely connected as a result of much less frequent contact
between than within TADs. We apply a distance constraint on every consecutive pair to
approximate the contact probability determined by the input data. We consider a contact
is formed when the soft spheres of TADs overlap, i.e. the center-to-center distance
between pair (i,J) is d
ij
< 2(r
i
+ r
j
). We determine an upper bound distance between the
pair as a function of probability a
IJ
. We assume that the probability of TAD J is in
contact with TAD I is proportional to the ratio of interaction volume and a spherical
volume made by their separation distance (D), P
ij
(r
i
,r
j
,D)=
V(2r
i
+ 2r
j
)−V(r
i
+r
j
)
V(D)−V(r
i
+r
j
)
, where
V(r)=
4π
3
r
3
. Rearranging the equation yields the upper bound distance,
D
ij
(P
ij
,r
i
,r
j
)=
7
P
ij
+1
⎡
⎣
⎢
⎤
⎦
⎥
1
3
(r
i
+r
j
) .
We check that the final population modeled with this modification shows a good
agreement with the experiment.
Technical details
We also modify our dynamic simulation technique in the M-step to better take
advantage of existing local conformations during the optimization process. In contrast to
our previous approach, now we use coordinates of genome structures generated in the
previous iteration as starting configuration to reduce the search space of local optima
99
for the next M-step. However, such starting structure may be unfavorable when more
constraints are added. To handle this problem, we first expand the nuclear volume and
then gradually shrink it (e.g. setting nuclear radius from 1.2 to 0.8 Rnuc with interval of
0.1Rnuc) while performing simulated annealing dynamics. Our experience shows that
this strategy helps to reach an optimum conformation quickly.
The lack of constraints at early A/M steps usually cause extended conformation
of chromosomes. While constraints are added gradually, the more compact homolog
copy will have higher chance to get newly added constraints. As a result, the extended
copy will have a high chance to stay extended even in the later stage. To handle this
problem, we introduce a bounding spherical volume for every chromosome to mimic
chromosome territory. The radius of the bounding sphere is proportional to the
chromosome length. This spherical territory constraint is only applied at early stage of
A/M optimization. This strategy helps both homologues to have similar distribution of
contact constraints during optimization.
6.3 Results and Discussion
6.3.1 Overview of the new population based modeling methods
The modeling procedure consists of data preprocessing, contact probability
matrix generation and the simulation procedure. For the data pre-processing step, we
generated the normalized contact frequencies matrix at bin level resolution using the
ICE method, and then identified topological domains based on the obtained matrix.
We need to calculate the contact probability matrix at TAD-level from the bin-
level contact frequencies matrix to guide the population-based simulation. However,
100
different from the assumption in our previous approach, which assumes that
consecutive domains always have 100% contact probability, the new population based
modeling method assumes that the contact probability between different consecutive
TADs are different. This assumption accommodates the fact that TADs may be
separated by topological boundaries, and some consecutive TADs are close to each
other while others are not. In the new method, we only assume that bins inside TADs
always in proximity to each other, and then we can derive the contact probability
between each TAD pairs accordingly (Methods and Materials).
The modeling approach is a probabilistic framework to generate a large number
of 3D genome structures (i.e. the structure population) whose chromatin domain
contacts are statistically consistent with the input contact probability matrix (Figure 6-
2A). Our structure population represents a de-convolution of the ensemble-averaged
TCC data into a population of individual structures and represents the most likely
approximation of the true structure population given all the available data. We
formulated the structure optimization problem as a maximum likelihood estimation
problem and designed an iterative optimization algorithm with a series of optimization
strategies for efficient and scalable model estimation.
The entire simulation can be treated as an optimization problem for a maximum
likelihood function. For a given contact probability matrix A, we aim to reconstruct all 3D
structures X in the population of M models each contains 2N homologous regions. For
reconstructing M structures of a population from the observed TCC data A, we need the
detailed information of which sphere pairs have contact in which cells for supplementing
the observed average data A, because A is far from completeness, due to (i) its
101
population-averaged nature without contact information at the single cell level and (ii) its
lack of the homologous information for all contacts. Therefore, we introduced a latent
variable W for complementing every single cell’s information and homologous
information missed by A. It is a binary-valued 3rd-order tensor specifying the contacts of
homologous genomic regions in each structure. We can jointly approximate the
structure population X and the contact tensor W by maximizing the log-likelihood of the
probability
logL(X |A,W)= logP A,WX
( )
.
The maximization is performed through an iterative optimization framework
(Figure 6-2B). We iteratively perform the following two steps by, (1) fixing X to calculate
W and then (2) fixing W to calculate X to optimize the likelihood. In our macro domain
modeling method, we applied constraints based on W into the simulation, which will
generate X with an overestimation of resulting contacts. This overestimation of total
contacts actually results in a structure population more compact than expected. The
reason of the overestimation is caused by cooperativity, which means that if we applied
a contact constraint of two TADs, which will also bring its nearby TADs close. This
cooperativity will give us more contacts than our targeted numbers. To fix this issue, we
change the step (1) from calculate contact tensor W to a direct constraints tensor !
!
,
this !
!
will take care of the cooperativity issue which will generate a better structure
population in step (2).
The other reason for this modification is that in the TAD-level model we expect
~25 times more constraints applied to the simulation compared to our macro-domain
model. We will need to exclude those redundant constraints, represented as !
!
, and
only applied !
!
to make the final contact tensor W (Methods and Materials).
102
6.3.2 Estimation of redundant constraints through A/M steps improve
the obtained structures
We first analyze if the estimation of redundant constraints helps to reduce the
violations of constraints through the A/M steps. A distance ratio of a constraint is
defined as the distance of two TADs divided by the sum of the radius. If the distance
ratio is larger than 1.05, we consider the constraint to be violated, otherwise the
constraint is satisfied. We define a violation of constraint if the distance between two
TAD surfaces is larger than 1.05 times the sum of the radius. At A
5%,
as we process
more and more A/M steps, the violation of total constraints decrease dramatically for
each structure. The average violation percentage drops from 5% to 0.25% after 11
iterative steps (Figure 6-3A), which gives us a final population satisfying the stopping
criteria. For those violations, we also investigate the deviation of the violations from its
constrained distance. The histogram shows that the distance ratios of violations are
mainly within 1.05 to 1.2, with a percentile value of 99% at 1.2(Figure 6-3B).
As more and more the A/M steps processed, we will expect a convergence of the
redundancy of each constraint, which also reflects a convergence of number of the
applied constraints due to formula X. For each step, we can calculate a redundant value
for each constraint to get a redundancy matrix. We then calculate the standard deviation
of the difference between two redundancy matrixes in the consecutive A/M step. It
shows that the standard deviation of redundancy matrix is getting smaller as more A/M
steps goes, which shows a convergence effect of the algorithm (Figure 6-3D).
103
Figure 6-3 The estimation of redundant constraints helps to get better
structures.
(A) Comparison the total constraints applied using new method with the macro domain
modeling method through iterative A/M steps. The distributions of constraints applied for
each step are represented in the boxplot. The dash line represents the total number of
constraints of macro domain modeling method. (B) The total number of violation drops
through iterative steps. (C) Correlation of the contact probability matrix generated from
the structure population with input matrix increases through the iterative steps. (D) The
total number redundant constraints tend to converge through the iterative A/M steps.
6.3.3 Validation of models with TCC contact profile
As more and more A/M step performed, the contact heatmap generated from the
model are more and more similar to the experimental result. The correlation between
104
the model and experiment contact frequency matrix is increasing with each iterative
step in the optimization, especially for the inter-chromosomal correlation, which
increases from 0.47 to 0.92 (Figure 6-3C).
After 11 A/M steps, we obtained a final structure population at A5%. We compare
the Hi-C contact profile generated from the final structure population at TAD level with
the experiment in detail. The result shows that the contact frequency map generated
from our structures agree remarkably well with the TCC data, with a Pearson’s
correlation as 0.93 (intra-chromosomal correlation 0.97, inter-chromosomal correlation
0.92). Visually, the interaction features are reproduced nicely with what is found in
experiments (Figure 6-4B). The distribution plot also shows a strong matching between
experiments and our models. The fitted line the TCC frequency with median of predicted
frequency is y=0.991x+0.017 with an adjusted R square value as 0.983 (Figure 6-4C).
We observe that some of the predicted probability stays higher than TCC probability,
especially when the TCC probabilities are low. The reason might due to a coopertivity of
TADs stay close to bring some neighbor region close. For example, sometime a high
interaction frequency will always bring it’s neighbor regions close. We also compare the
ICP value for each chromosome for both model and TCC result to see if we have a nice
fitting for each chromosome. The result matches in a strong agreement with a
correlation value 0.94(Figure 6-4D, p-value<1E-16).
105
Figure 6-4 The output contact property matches with the input TCC contact
probability matrix.
(A) The output contact probability matrix for chr1 generated using macro domain
modeling method. (B) Comparison of output contact probability matrix with the input
TCC contact probability matrix. (C) Scatter plot shows that the contact probability matrix
generated from our structure population matches remarkably well with the TCC input.
(D) Comparison of intra/inter ratio for TCC and models for each chromosome. The
Pearson correlation is 0.77. (E) Alpha value calculation for model. (F) Comparison of
alpha value for each chromosome.
106
6.3.4 Validation of model with known structure information
We next compare the structure features gathered from our model with known
experiments. A radial position can be calculated for each chromosome in the model as
the relative radius to the nucleus center, which reveals a preferred location of
chromosomes. Surprisingly, the preferred radial position of each chromosome agree
well with the FISH experiment, with a Pearson correlation of 0.68 (Figure 6-5A). For
those chromosomes that are less active, for instance chromosome 19, it has shown to
have a larger radial position as chromosome 18. This chromosomal location preference
has been revealed by experiments as well. The matching here shows that even we just
have the chromosome contact information, we can successfully reproduce the
chromosome location preference result.
107
Figure 6-5 Structural feature analysis of the population.
(A) Comparison of average radio distance for each chromosome for structure
populations and FISH experiment. The linear fitted R-square is 0.6. (B) Radio distance
of each TAD in chromosome 2. Two regions are identified to be inner nucleus. One
region is centromere region and the other one are the fusion site of the centromere
regions. (C) The radio distance of each TAD in chromosome 4. The p-arm is more inner
comparing to q-arm. (D) The radial distance of TADs identified by Pol2 signal. Pol2
enriched TADs has a significantly smaller radial distance than all TADs, which Pol2
108
depleted TADs has a significantly larger radial distance than all TADs. (E) The
surrounding environment analysis of TADs based on Pol2 signals. If the ratio is larger
than 0, which means there is more active TADs nearby than inactive TADs, and vice
versa.
We also studied the radial position for individual TADs. Similar to what we found
in our coarse model, most centromeres have small radial positions, which indicate a
pointing-inwards location. We also observe a small radial position at 40-50 Mb
downstream from the centromere on q-arm of chromosome 2 in our TAD model (Figure
6-5B). The telomeric region at q-arm Chromosome 8 and the telomeric regions at p-arm
of chromosome 11 show a more inner average location in comparison to other regions
in the same chromosome arm. This observation may explain the location of known
interactions of MYC and -globins.
Interestingly, in our TAD model, the p-arm of chromosome 4 has shown a much
larger radial distance to nuclear center comparing to the q-arm (Figure 4C). This feature
is in good agreement with DamID observations, which indicate that the p-arm prefers to
stay closer to NE compared to p-arm. This observation cannot be detected in our
previous coarse model without enforcing specific constraints to the chromosomal arms.
This result may suggest that the Hi-C contact frequency information actually contains
much more information than just pairwise contacts.
We next investigate if TADs in the model are related to the macro domain
definition in the previous coarse model. We measure the distance distribution between
two TADs within and across the macro domains, and the result shows that TADs within
a macro domain has a significant smaller pairwise distances comparing to those across
109
macro domains. This observations bridge the relationship between the definition of
macro domain and TADs, that macro domains is a structure unit clustered by TADs.
6.3.5 Structural properties of chromosomes with different chromatin
state
It has been shown through several FISH experiments, that in human genomes,
inactive regions are more likely to be distributed close to the nuclear envelope, while the
active regions prefer towards the inner nuclear regions. We first label each TAD based
on the enrichment of RNA pol2 binding sites, which can be divided into two categories,
pol2 enriched regions and pol2 poor regions, and generate the average radial positions
of TADs to compare with the randomly selected regions. The analysis shows that pol2
enriched TADs (A1, A2) have a much smaller average radial position than randomly
selected TADs (p-value < 1E-16), while pol2 poor TADs show a significant tendency to
be located close the NE (p-value < 1E-16) (Figure 6-5D). This result matches well with
the FISH experiments that active regions tends to be inner nucleus comparing to
inactive genomic regions.
In the recent published paper, Rao et al [25] have further divided chromatin
regions into 6 states, including active states (labeled as A1, A2) or inactive chromatin
states (labeled as B1,B2, B3,B4) (Fig 6-6). The Lamin binding data from other celltype
suggested a different preference of radial positions for these different states,[25]
however it still remains unknown about the states location in Lymphoblastoid cell. To
address the question about the location preference for different state, we calculate the
average radial position for the six states. Since peri-centromeric regions are included in
B2 states, we further split B2 into two different sets, peri-centromeric regions (PRC) and
110
regular B2 regions, to see the location preference more specifically. The results show a
significant difference in the location preference for different chromatin states.
Figure 6-6 Epigenetics of different chromatin states
This figure is obtained from Rao et al.[25] .
We can see that comparing to all other states, the PRC regions in B2 states have
smaller radial positions (Fig 6-7A). In our previous macro domain model, we have
shown centromere clustering drives the radial position of each chromosome, and the
clustering prefers to be inner nucleus. This high resolution TAD further confirms location
preference of centromeres.
For active domains, A1 states is the more inner comparing to A2 states, and also
all the inactive chromatins (B1,B2,B3,B4). This suggested that the most active regions
have a preferred location towards the inner nucleus. This trend is also the same as the
lamina binding data suggested from other cell type(Fig 6-7).
For all the inactive domains, B2 and B3 have a much larger radial position value
than B1 and B4 states (Fig 6-7). Based on the epigenetics states of B2 and B3, we can
111
see that B2 and B3 are considered as inert states depleting with active epigenetic
markers, while B1 and B4 states are kind of active comparing to B2 and B3. Also, if we
comparing B3 and B2, we will see that B3 is more inert. For the radial position we can
see that B1 and B4 has similar radial position, which B2 and B3 are much closer to
nuclear envelopes. More importantly, the distance to nuclear envelopes is proportional
to their inertness. This observation echoes our observation about pol2 enrichment
analysis in Fig 6-5.
Figure 6-7 Location preference of different chromatin states.
112
(A) The radial position preference of different chromatin states, while B2 is splited into
two sets, the peri-centromeric regions(PRC) and the regular B2 regions (B) The
Location probability density plot of different states. PRC regions are excluded from B2
states.
6.3.6 Radial position affects the replication timing
We next investigate if our model suggests a relationship between replication
timing and radial position. Figure 6-8A shows that replication happened at the radial
position around 0.4-0.5, which is mainly A1 state chromatin according to Figure 6-7A.
Then the replication is been passed on in both directions to more inner and outer. The
most inner regions are mostly pericentromeric regions based on Figure 6-7A, which are
replicated lastly.
If we plot the relationship between replication timing and the chromatin states, we
can also see that A1 states are replicated earliest and then followed by A2 states. B1
and B4 states are replicated next. B2 and B3 are replicated last. This observation
suggested a transitive property of replication that A1 state happened earliest and then
other region are triggered based on the radial position.
113
Figure 6-8 Replication timing with radial positions and chromatin states
(A) The relationship between replication timing and radial position of each TADs. The
higher the replication timing value is, the earlier the replication happened. (B) The
relationship between replication timing and chromatin states
6.3.7 Replicate analysis to show the robustness of the method
We have shown previously that this TAD level modeling method can generate high
quality structure to reproduce the Hi-C contact patterns and also structure features
observed through other experiment. To further test the robustness of the method, we
114
generate an independent population of structures as a replicate to test the
reproducibility of structures. The contact frequency map generate from the replicate
match almost perfectly with the structure population, with a slope of 0.994 and a R-
square of ~0.998(Figure 6-9A). We also compare the radial position generated by the
replicate with our original population to test the structure feature. It also shows that the
average radial position of each TAD are in a significant agreement with the original
population(Figure 6-9B).
Figure 6-9 Method robustness analyses
(A) Comparison of contact frequency matrix between two replicate population. (B)
Comparison of average radial position of each TAD between two replicate population.
115
References
1. Gotta, M., et al., The clustering of telomeres and colocalization with Rap1, Sir3,
and Sir4 proteins in wild-type Saccharomyces cerevisiae. J Cell Biol, 1996.
134(6): p. 1349-63.
2. Hediger, F., et al., Live imaging of telomeres: yKu and Sir proteins define
redundant telomere-anchoring pathways in yeast. Curr Biol, 2002. 12(24): p.
2076-89.
3. Taddei, A., H. Schober, and S.M. Gasser, The budding yeast nucleus. Cold
Spring Harb Perspect Biol, 2010. 2(8): p. a000612.
4. Mekhail, K. and D. Moazed, The nuclear envelope in genome organization,
expression and stability. Nat Rev Mol Cell Biol, 2010. 11(5): p. 317-28.
5. Horigome, C., et al., Ribosome biogenesis factors bind a nuclear envelope SUN
domain protein to cluster yeast telomeres. EMBO J, 2011. 30(18): p. 3799-811.
6. Cam, H.P., et al., Comprehensive analysis of heterochromatin- and RNAi-
mediated epigenetic control of the fission yeast genome. Nat Genet, 2005. 37(8):
p. 809-19.
7. Funabiki, H., et al., Cell cycle-dependent specific positioning and clustering of
centromeres and telomeres in fission yeast. J Cell Biol, 1993. 121(5): p. 961-76.
8. Leger-Silvestre, I., et al., Structural and functional analysis of the nucleolus of the
fission yeast Schizosaccharomyces pombe. Eur J Cell Biol, 1997. 72(1): p. 13-
23.
9. Tanizawa, H., et al., Mapping of long-range associations throughout the fission
yeast genome reveals global genome organization linked to transcriptional
regulation. Nucleic Acids Res, 2010. 38(22): p. 8164-77.
10. Hall, I.M., K. Noma, and S.I. Grewal, RNA interference machinery regulates
chromosome dynamics during mitosis and meiosis in fission yeast. Proc Natl
Acad Sci U S A, 2003. 100(1): p. 193-8.
11. Boyle, S., et al., The spatial organization of human chromosomes within the
nuclei of normal and emerin-mutant cells. Hum Mol Genet, 2001. 10(3): p. 211-9.
12. Cremer, M., et al., Non-random radial higher-order chromatin arrangements in
nuclei of diploid human cells. Chromosome Res, 2001. 9(7): p. 541-67.
13. Cremer, T. and C. Cremer, Chromosome territories, nuclear architecture and
gene regulation in mammalian cells. Nat Rev Genet, 2001. 2(4): p. 292-301.
14. Meaburn, K.J. and T. Misteli, Cell biology: chromosome territories. Nature, 2007.
445(7126): p. 379-781.
15. Berger, A.B., et al., High-resolution statistical mapping reveals gene territories in
live yeast. Nat Methods, 2008. 5(12): p. 1031-7.
16. Heun, P., et al., Chromosome dynamics in the yeast interphase nucleus.
Science, 2001. 294(5549): p. 2181-6.
17. Csink, A.K. and S. Henikoff, Large-scale chromosomal movements during
interphase progression in Drosophila. J Cell Biol, 1998. 143(1): p. 13-22.
18. Ferguson, M. and D.C. Ward, Cell cycle dependent chromosomal movement in
pre-mitotic human T-lymphocyte nuclei. Chromosoma, 1992. 101(9): p. 557-65.
116
19. Dekker, J., et al., Capturing chromosome conformation. Science, 2002.
295(5558): p. 1306-11.
20. Simonis, M., et al., Nuclear organization of active and inactive chromatin
domains uncovered by chromosome conformation capture-on-chip (4C). Nat
Genet, 2006. 38(11): p. 1348-54.
21. Zhao, Z., et al., Circular chromosome conformation capture (4C) uncovers
extensive networks of epigenetically regulated intra- and interchromosomal
interactions. Nat Genet, 2006. 38(11): p. 1341-7.
22. Lieberman-Aiden, E., et al., Comprehensive mapping of long-range interactions
reveals folding principles of the human genome. Science, 2009. 326(5950): p.
289-93.
23. Duan, Z., et al., A three-dimensional model of the yeast genome. Nature, 2010.
465(7296): p. 363-7.
24. Kalhor, R., et al., Genome architectures revealed by tethered chromosome
conformation capture and population-based modeling. Nat Biotechnol, 2012.
30(1): p. 90-8.
25. Rao, S.S., et al., A 3D map of the human genome at kilobase resolution reveals
principles of chromatin looping. Cell, 2014. 159(7): p. 1665-80.
26. Yaffe, E. and A. Tanay, Probabilistic modeling of Hi-C contact maps eliminates
systematic biases to characterize global chromosomal architecture. Nat Genet,
2011. 43(11): p. 1059-65.
27. Imakaev, M., et al., Iterative correction of Hi-C data reveals hallmarks of
chromosome organization. Nat Methods, 2012. 9(10): p. 999-1003.
28. Varoquaux, N., et al., A statistical approach for inferring the 3D structure of the
genome. Bioinformatics, 2014. 30(12): p. i26-33.
29. Lesne, A., et al., 3D genome reconstruction from chromosomal contacts. Nat
Methods, 2014. 11(11): p. 1141-3.
30. Rousseau, M., et al., Three-dimensional modeling of chromatin structure from
interaction frequency data using Markov chain Monte Carlo sampling. BMC
Bioinformatics, 2011. 12: p. 414.
31. Hu, M., et al., Bayesian inference of spatial organizations of chromosomes. PLoS
Comput Biol, 2013. 9(1): p. e1002893.
32. Tjong, H., et al., Physical tethering and volume exclusion determine higher-order
genome organization in budding yeast. Genome Res, 2012. 22(7): p. 1295-305.
33. Giorgetti, L., et al., Predictive polymer modeling reveals coupled fluctuations in
chromosome conformation and transcription. Cell, 2014. 157(4): p. 950-63.
34. O'Toole, E.T., M. Winey, and J.R. McIntosh, High-voltage electron tomography of
spindle pole bodies and early mitotic spindles in the yeast Saccharomyces
cerevisiae. Mol Biol Cell, 1999. 10(6): p. 2017-31.
35. Jin, Q.W., J. Fuchs, and J. Loidl, Centromere clustering is a major determinant of
yeast interphase nuclear organization. J Cell Sci, 2000. 113 ( Pt 11): p. 1903-12.
36. Yang, C.H., et al., Higher order structure is present in the yeast nucleus:
autoantibody probes demonstrate that the nucleolus lies opposite the spindle
pole body. Chromosoma, 1989. 98(2): p. 123-8.
117
37. Dvorkin, N., M.W. Clark, and B.A. Hamkalo, Ultrastructural localization of nucleic
acid sequences in Saccharomyces cerevisiae nucleoli. Chromosoma, 1991.
100(8): p. 519-23.
38. Bystricky, K., et al., Chromosome looping in yeast: telomere pairing and
coordinated movement reflect anchoring efficiency and territorial organization. J
Cell Biol, 2005. 168(3): p. 375-87.
39. Schober, H., et al., Controlled exchange of chromosomal arms reveals principles
driving telomere interactions in yeast. Genome Res, 2008. 18(2): p. 261-71.
40. Therizols, P., et al., Chromosome arm length and nuclear constraints determine
the dynamic relationship of yeast subtelomeres. Proc. Natl. Acad. Sci. U.S.A.,
2010. 107(5): p. 2025-30.
41. Zimmer, C. and E. Fabre, Principles of chromosomal organization: lessons from
yeast. J. Cell Biol., 2011. 192(5): p. 723-33.
42. Berger, A.B., et al., High-resolution statistical mapping reveals gene territories in
live yeast. Nat. Methods, 2008. 5(12): p. 1031-7.
43. Bystricky, K., et al., Long-range compaction and flexibility of interphase
chromatin in budding yeast analyzed by high-resolution imaging techniques. Proc
Natl Acad Sci U S A, 2004. 101(47): p. 16495-500.
44. Dekker, J., Mapping in vivo chromatin interactions in yeast suggests an extended
chromatin fiber with regional variation in compaction. J Biol Chem, 2008.
283(50): p. 34532-40.
45. Meister, P., et al., Visualizing yeast chromosomes and nuclear architecture.
Methods Enzymol, 2010. 470: p. 535-67.
46. Alber, F., et al., The molecular architecture of the nuclear pore complex. Nature,
2007. 450(7170): p. 695-701.
47. Russel, D., et al., Putting the pieces together: integrative modeling platform
software for structure determination of macromolecular assemblies. PLoS Biol,
2012. 10(1): p. e1001244.
48. Therizols, P., et al., Chromosome arm length and nuclear constraints determine
the dynamic relationship of yeast subtelomeres. Proc Natl Acad Sci U S A, 2010.
107(5): p. 2025-30.
49. Di Rienzi, S.C., et al., Fragile genomic sites are associated with origins of
replication. Genome Biol Evol, 2009. 1: p. 350-63.
50. Feng, W., et al., Genomic mapping of single-stranded DNA in hydroxyurea-
challenged yeasts identifies origins of replication. Nat Cell Biol, 2006. 8(2): p.
148-55.
51. McCune, H.J., et al., The temporal program of chromosome replication:
genomewide replication in clb5{Delta} Saccharomyces cerevisiae. Genetics,
2008. 180(4): p. 1833-47.
52. Sekedat, M.D., et al., GINS motion reveals replication fork progression is
remarkably uniform throughout the yeast genome. Mol Syst Biol, 2010. 6: p. 353.
53. Thompson, M., et al., Nucleolar clustering of dispersed tRNA genes. Science,
2003. 302(5649): p. 1399-401.
54. Misteli, T., Beyond the sequence: cellular organization of genome function. Cell,
2007. 128(4): p. 787-800.
118
55. Takizawa, T., K.J. Meaburn, and T. Misteli, The meaning of gene positioning.
Cell, 2008. 135(1): p. 9-13.
56. Dixon, J.R., et al., Topological domains in mammalian genomes identified by
analysis of chromatin interactions. Nature, 2012. 485(7398): p. 376-80.
57. Papantonis, A. and P.R. Cook, Transcription factories: genome organization and
gene regulation. Chem Rev, 2013. 113(11): p. 8683-705.
58. Cook, P.R., The organization of replication and transcription. Science, 1999.
284(5421): p. 1790-5.
59. Kitamura, E., J.J. Blow, and T.U. Tanaka, Live-cell imaging reveals replication of
individual replicons in eukaryotic replication factories. Cell, 2006. 125(7): p. 1297-
308.
60. Gasser, S.M., Visualizing chromatin dynamics in interphase nuclei. Science,
2002. 296(5572): p. 1412-6.
61. Taddei, A., et al., The functional importance of telomere clustering: global
changes in gene expression result from SIR factor dispersion. Genome Res,
2009. 19(4): p. 611-25.
62. Taddei, A. and S.M. Gasser, Structure and function in the budding yeast nucleus.
Genetics, 2012. 192(1): p. 107-29.
63. Zimmer, C. and E. Fabre, Principles of chromosomal organization: lessons from
yeast. J Cell Biol, 2011. 192(5): p. 723-33.
64. Leger-Silvestre, I., et al., Functional compartmentalization of the nucleus in the
budding yeast Saccharomyces cerevisiae. Chromosoma, 1999. 108(2): p. 103-
13.
65. Handoko, L., et al., CTCF-mediated functional chromatin interactome in
pluripotent cells. Nat Genet, 2011. 43(7): p. 630-8.
66. Mizuguchi, T., et al., Cohesin-dependent globules and heterochromatin shape 3D
genome architecture in S. pombe. Nature, 2014.
67. Grand, R.S., et al., Chromosome conformation maps in fission yeast reveal cell
cycle dependent sub nuclear structure. Nucleic Acids Res, 2014. 42(20): p.
12585-99.
68. Dai, Z. and X. Dai, Nuclear colocalization of transcription factor target genes
strengthens coregulation in yeast. Nucleic Acids Res, 2012. 40(1): p. 27-36.
69. Ben-Elazar, S., Z. Yakhini, and I. Yanai, Spatial localization of co-regulated
genes exceeds genomic gene clustering in the Saccharomyces cerevisiae
genome. Nucleic Acids Res, 2013. 41(4): p. 2191-201.
70. Homouz, D. and A.S. Kudlicki, The 3D organization of the yeast genome
correlates with co-expression and reflects functional relations between genes.
PLoS One, 2013. 8(1): p. e54699.
71. Wong, H., et al., A predictive computational model of the dynamic 3D interphase
yeast nucleus. Curr Biol, 2012. 22(20): p. 1881-90.
72. Tokuda, N., T.P. Terada, and M. Sasai, Dynamical modeling of three-
dimensional genome organization in interphase budding yeast. Biophys J, 2012.
102(2): p. 296-304.
73. Gehlen, L.R., et al., Chromosome positioning and the clustering of functionally
related loci in yeast is driven by chromosomal interactions. Nucleus, 2012. 3(4):
p. 370-83.
119
74. Avsaroglu, B., et al., Effect of chromosome tethering on nuclear organization in
yeast. PLoS One, 2014. 9(7): p. e102474.
75. Wood, V., et al., The genome sequence of Schizosaccharomyces pombe.
Nature, 2002. 415(6874): p. 871-80.
76. Alber, F., et al., Determining the architectures of macromolecular assemblies.
Nature, 2007. 450(7170): p. 683-94.
77. Noma, K., et al., A role for TFIIIC transcription factor complex in genome
organization. Cell, 2006. 125(5): p. 859-72.
78. Iwasaki, O., et al., Centromeric localization of dispersed Pol III genes in fission
yeast. Mol Biol Cell, 2010. 21(2): p. 254-65.
79. Iwasaki, O. and K. Noma, Global genome organization mediated by RNA
polymerase III-transcribed genes in fission yeast. Gene, 2012. 493(2): p. 195-
200.
80. Steglich, B., et al., The inner nuclear membrane proteins Man1 and Ima1 link to
two different types of chromatin at the nuclear periphery in S. pombe. Nucleus,
2012. 3(1): p. 77-87.
81. Osborne, C.S., et al., Active genes dynamically colocalize to shared sites of
ongoing transcription. Nat Genet, 2004. 36(10): p. 1065-71.
82. Xu, M. and P.R. Cook, Similar active genes cluster in specialized transcription
factories. J Cell Biol, 2008. 181(4): p. 615-23.
83. Cho, R.J., et al., A genome-wide transcriptional analysis of the mitotic cell cycle.
Mol Cell, 1998. 2(1): p. 65-73.
84. Tong, A.H., et al., Global mapping of the yeast genetic interaction network.
Science, 2004. 303(5659): p. 808-13.
85. Baryshnikova, A., et al., Quantitative analysis of fitness and genetic interactions
in yeast on a genome scale. Nat Methods, 2010. 7(12): p. 1017-24.
86. Frost, A., et al., Functional repurposing revealed by comparing S. pombe and S.
cerevisiae genetic interactions. Cell, 2012. 149(6): p. 1339-52.
87. Hayashi, M., et al., Genome-wide localization of pre-RC sites and identification of
replication origins in fission yeast. EMBO J, 2007. 26(5): p. 1327-39.
88. Ay, F., et al., Three-dimensional modeling of the P. falciparum genome during
the erythrocytic cycle reveals a strong connection between genome architecture
and gene expression. Genome Res, 2014. 24(6): p. 974-88.
89. Le, T.B., et al., High-resolution mapping of the spatial organization of a bacterial
chromosome. Science, 2013. 342(6159): p. 731-4.
90. Naumova, N., et al., Organization of the mitotic chromosome. Science, 2013.
342(6161): p. 948-53.
91. Jin, F., et al., A high-resolution map of the three-dimensional chromatin
interactome in human cells. Nature, 2013. 503(7475): p. 290-4.
92. Knight, P.A., The Sinkhorn-Knopp algorithm: Convergence and applications.
Siam Journal on Matrix Analysis and Applications, 2008. 30(1): p. 261-275.
93. Knight, P.A. and D. Ruiz, A fast algorithm for matrix balancing. Ima Journal of
Numerical Analysis, 2013. 33(3): p. 1029-1047.
94. Sexton, T., et al., Gene regulation through nuclear organization. Nat Struct Mol
Biol, 2007. 14(11): p. 1049-55.
120
95. Sexton, T., et al., Three-dimensional folding and functional organization
principles of the Drosophila genome. Cell, 2012. 148(3): p. 458-72.
96. Hou, W.R., et al., cDNA, genomic sequence cloning and overexpression of
ribosomal protein gene L9 (rpL9) of the giant panda (Ailuropoda melanoleuca).
Genet Mol Res, 2011. 10(3): p. 1576-88.
97. Shen, Y., et al., A map of the cis-regulatory sequences in the mouse genome.
Nature, 2012. 488(7409): p. 116-20.
98. Zhang, Y., et al., Model-based analysis of ChIP-Seq (MACS). Genome Biol,
2008. 9(9): p. R137.
99. Van Bortle, K. and V.G. Corces, The role of chromatin insulators in nuclear
architecture and genome function. Curr Opin Genet Dev, 2013. 23(2): p. 212-8.
100. Bickmore, W.A. and B. van Steensel, Genome architecture: domain organization
of interphase chromosomes. Cell, 2013. 152(6): p. 1270-84.
101. Gibcus, J.H. and J. Dekker, The hierarchy of the 3D genome. Mol Cell, 2013.
49(5): p. 773-82.
102. Hou, C., et al., Gene density, transcription, and insulators contribute to the
partition of the Drosophila genome into physical domains. Mol Cell, 2012. 48(3):
p. 471-84.
103. Nagano, T., et al., Single-cell Hi-C reveals cell-to-cell variability in chromosome
structure. Nature, 2013. 502(7469): p. 59-64.
104. Kind, J., et al., Single-cell dynamics of genome-nuclear lamina interactions. Cell,
2013. 153(1): p. 178-92.
105. Misteli, T., The Cell Biology of Genomes: Bringing the Double Helix to Life. Cell,
2013. 152(6): p. 1209-1212.
106. Junier, I., et al., CTCF-mediated transcriptional regulation through cell type-
specific chromosome organization in the beta-globin locus. Nucleic Acids Res,
2012. 40(16): p. 7718-27.
107. Meluzzi, D. and G. Arya, Recovering ensembles of chromatin conformations from
contact probabilities. Nucleic Acids Res, 2013. 41(1): p. 63-75.
108. Barbieri, M., et al., Complexity of chromatin folding is captured by the strings and
binders switch model. Proc Natl Acad Sci U S A, 2012. 109(40): p. 16173-8.
109. Fraser, J., et al., Computing chromosome conformation. Methods Mol Biol, 2010.
674: p. 251-68.
110. Bau, D., et al., The three-dimensional folding of the alpha-globin gene domain
reveals formation of chromatin globules. Nat Struct Mol Biol, 2010. 18(1): p. 107-
14.
111. Bau, D. and M.A. Marti-Renom, Structure determination of genomic domains by
satisfaction of spatial restraints. Chromosome Res, 2011. 19(1): p. 25-35.
!
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Understanding the 3D genome organization in topological domain level
PDF
Exploring three-dimensional organization of the genome by mapping chromatin contacts and population modeling
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Profiling transcription factor-DNA binding specificity
PDF
Machine learning of DNA shape and spatial geometry
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Application of machine learning methods in genomic data analysis
PDF
3D deep learning for perception and modeling
PDF
Forkhead transcription factors regulate replication origin firing through dimerization and cell cycle-dependent chromatin binding in S. cerevisiae
PDF
Object detection and recognition from 3D point clouds
PDF
Data-driven 3D hair digitization
PDF
Computational analysis of the spatial and temporal organization of the cellular environment
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Structure – dynamics – function analysis of class A GPCRs and exploration of chemical space using integrative computational approaches
Asset Metadata
Creator
Gong, Ke
(author)
Core Title
3D modeling of eukaryotic genomes
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
09/14/2017
Defense Date
06/01/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D conformation,budding yeast,chromatin conformation,eukaryotic genomes,fission yeast,genome modeling,Hi-C,OAI-PMH Harvest,structure modeling,topological domain
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Alber, Frank (
committee chair
), Nakano, Aiichiro (
committee member
), Rohs, Remo (
committee member
)
Creator Email
jameshgkr61@gmail.com,kegong@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-181098
Unique identifier
UC11273275
Identifier
etd-GongKe-3905.pdf (filename),usctheses-c40-181098 (legacy record id)
Legacy Identifier
etd-GongKe-3905.pdf
Dmrecord
181098
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Gong, Ke
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D conformation
budding yeast
chromatin conformation
eukaryotic genomes
fission yeast
genome modeling
Hi-C
structure modeling
topological domain