Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
(USC Thesis Other)
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Mapping 3D Genome Structures: a Data Driven
Modeling Method for Integrated Structural
Analysis
Nan Hua
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
August 2019
i
Abstract
Background
The completion of the human genome DNA sequence was one of the most important
accomplishments in biology. However, the genome has levels of complexity that go far beyond its
linear sequence. One of the next grand challenges is to determine the detailed 3D folding
architecture of the linear genomic DNA, and to reveal the functional implications of this level of
organization. This 3D organization greatly influences gene functions, such as the expression or
silencing of genes. For instances, gene expression is initiated by physical interactions of promoter
and enhancer regions that often are fare apart in linear DNA sequence
1
. Recent evidence shows
that clustering of chromatin near nuclear bodies, such as nucleoli, speckles and lamina are
correlated with gene silencing and replication timing
2
. Misfolding and spatial rearrangements of
chromosomes frequently occur in cancer cells
3
. For a better understanding of nuclear functions
and its impact in disease it is therefore necessary to study the 3D folding of the genome and the
organization.
Data about the nuclear organization can now be generated from wide range of genomics and
imaging experimental methods, each with their own strengths and limitations. For instance,
chromosome conformation capture experiments, such as like 3C, 4C, 5C, Hi-C and ChIA-PET
produce rich information about the probability of pairwise chromatin interactions on a genome-
wide scale. On the other side, imaging approaches like DNA FISH and live-cell super-resolution
microscopy can provide direct views of the organization and dynamics of chromatin, however
typically only for a few gene loci at the same time.
Task
These data provide unique opportunities to generate 3D structural representations of genomes to
study detailed chromatin folding and partitioning of chromatin into subnuclear compartments,
which will facilitate an in-depth analysis of spatial chromatin patterns and their functional
ii
relevance. However, this task is challenging due to large size and diploidy of genomes and the
stochastic nature of chromosome conformations. My PhD focuses on developing approaches to
unite genomics with microscopic data by determining 3D structures of entire human genomes
from genomics data and developing tools for structure-function mapping to extract the underlying
principles and mechanisms of genome-wide chromosome organization. My goal is to acquire a
more thorough insight about the basic principles governing the spatial organization of genomes,
specifically the partitioning of genomic regions into subnuclear compartments and its impact on
nuclear function.
iii
Acknowledgement
First and foremost, I would express my deepest gratitude and sincerest acknowledgment to my
advisor, Professor Frank Alber. During my research and study, he gave me a lot of guidance and
support, both in academics and life. His generosity of sharing thoughts and knowledge, his
patience to give supervision, his attitude to seek perfection is, will always be, the lighthouse to
me for all my time.
I would like to thank Professor Lin Chen, Professor Vsevolod Katritch and Professor Remo Rohs
for serving in my Ph.D. dissertation committees. I would thank their comments and suggestions
for me to revise and refine my research work.
In addition, I am very grateful to my previous colleague in the lab, Dr. Ke Gong and Dr. Harianto
Tjong for leading me to start learning and developing population-based modeling methods, which
serves the foundation of my research work. I also want to thank my colleague, Dr. Guido Polles
for his excellent work of developing IGM along with me and made it possible for high resolution
modeling. My thanks also go to all lab mates for their support and cooperation in the past five
years, Dr. Asli Yildirim, Dr. Long Pei, Dr. Haochen Li, Dr. Qingjiao Li, Dr. Hanjun Shin. Jitin Singla,
Yuxiang Zhan.
I would also like to extend my appreciation to Professor Xianghong Jasmine Zhou for your
insightful suggestions on my projects; Dr. Jiang Xu and Dr. Yi Kou in Professor Lin Chen’s lab for
working very closely with me.
I wish to express my gratitude to my friends and classmates in our department, who has helped
me in my Ph.D. life: Dr. Beibei Xin, Dr. Tsu-pei Chiu, Dr. Meng Zhou, Dr. Wenzheng Li, Jinsen Li,
Weili Wang, Kujing Tang, Nilkanth Patel, Saket Choudhary, Rishvanth Prabakar, Zhu Liu, Yingfei
Wang and many others.
Finally, I want to thank my family, my parents Sujuan Zhang and Feng Hua, for the everlasting
iv
support; my wife Mengyuan Celia Xi for her love and for being the most important part of my life.
v
Table of Contents
Abstract .................................................................................................................................................... i
Background .......................................................................................................................................... i
Task ...................................................................................................................................................... i
Acknowledgement ...................................................................................................................................iii
Table of Contents..................................................................................................................................... v
List of Figures ........................................................................................................................................ viii
List of Tables .......................................................................................................................................... ix
Chapter 1 Introduction ...................................................................................................................... 1
1.1 Experimental methods for capturing genome organization .............................................................. 4
1.1.1 Frequencies of pairwise chromatin-chromatin interactions ................................................. 6
1.1.2 Single cell pairwise chromatin-chromatin interactions ........................................................ 7
1.1.3 Multivariate chromatin interactions from ligation free methods ........................................... 9
1.1.4 Probing chromatin proximity to nuclear bodies ................................................................ 10
1.1.5 Imaging methods for mapping the spatio-temporal organization of the nucleus................ 11
1.2 Computational Methods ............................................................................................................... 13
1.2.1 Consensus structure methods......................................................................................... 16
1.2.2 Resampling methods ...................................................................................................... 16
1.2.3 Population-based deconvolution methods ....................................................................... 18
Chapter 2 Population-based genome structure modeling using Hi-C data ........................................ 20
2.1 Computational representation of genome structure ....................................................................... 22
2.2 Probabilistic formulation of population structure modeling ............................................................. 22
2.3 Iterative refinement of restraint assignment .................................................................................. 23
2.4 PGS software pipeline for population-based genome structure modeling ...................................... 25
Chapter 3 Extending population-based genome structure modeling through comprehensive data
integration…………………………………………………………………………………………………... .......... 31
3.1 Mathematical formulation of comprehensive data integration through population-based genome
structure modeling ............................................................................................................................. 33
3.1.1 Univariate data ............................................................................................................... 33
vi
3.1.2 Bivariate data ................................................................................................................. 33
3.1.3 Multivariate data ............................................................................................................. 33
3.1.4 Formulation .................................................................................................................... 34
3.2 Software design and implementation ............................................................................................ 36
Chapter 4 Studying spatial organization of whole genomes through structure modeling ................... 39
4.1 Case study: Comparative analysis of liver and cardiac chromatin structure ................................... 40
4.1.1 Data sources and structure modeling .............................................................................. 40
4.1.2 Different compartmentalization status in Heart and Liver nucleus .................................... 41
4.2 Case study: Structural organization of subnuclear chromatin compartments in human cells .......... 44
4.2.1 Pre-processing of in situ Hi-C data .................................................................................. 46
4.2.2 Modeling representation and parameters ........................................................................ 46
4.2.3 Model structures reach a remarkable agreement with input matrix .................................. 48
4.2.4 Assessment of models and cross-validation of experimental data ................................... 50
4.2.5 Results: The spatial organization of the human genome and its structure-function
correlations .................................................................................................................................... 59
4.2.6 Structure-based TAD calling............................................................................................ 85
4.2.7 Discussion ...................................................................................................................... 88
4.3 Case study: Comparative analysis of human H1-ESC and HFFc6 nucleus structures. .................. 90
4.3.1 Generating high resolution structure models using Hi-C and Lamina-DamID ................... 90
4.3.2 Common spatial compartment localization is observed in comparative analysis .............. 92
4.3.3 TAD variations detected using structure-based TAD calling method................................. 95
Chapter 5 Conclusion ..................................................................................................................... 97
References ............................................................................................................................................ 99
Appendix ............................................................................................................................................. 108
A.1 Usage of PGS software ........................................................................................................ 108
A.1.1 Installation .................................................................................................................... 108
A.1.2 Equipment setup........................................................................................................... 109
A.1.3 Procedures ................................................................................................................... 110
A.1.4 Troubleshooting ............................................................................................................ 114
A.1.5 Timing .......................................................................................................................... 115
vii
A.1.6 Anticipated results ........................................................................................................ 116
A.2 Structure archiving and dissemination .................................................................................. 117
A.3 Additional Methods ............................................................................................................... 118
A.3.1 Fmaxation .................................................................................................................... 118
A.3.2 Building domain-level probability matrix ........................................................................ 119
A.3.3 Probabilistic population structure modeling ................................................................... 119
A.3.4 Iterative refinement of restraint assignment ................................................................... 120
A.3.5 Simulated annealing for optimizing structure population ................................................ 122
A.3.6 Wasserstein distance for comparing cumulative distributions......................................... 123
A.3.7 Assigning TADs with cell replication phase .................................................................... 123
A.3.8 Assigning TADs with subcompartment states ................................................................ 123
A.3.9 Shell State Probability ................................................................................................... 124
A.3.10 2D localization distribution ............................................................................................ 124
A.3.11 TAD interaction network ................................................................................................ 124
A.3.12 Maximal cliques enrichment (MCE) ............................................................................... 125
A.3.13 Predicting TSA-seq data using A1 subgraphs ................................................................ 125
A.3.14 Detect Dip TADs based on Radial Position .................................................................... 125
A.3.15 TAD neighborhood ........................................................................................................ 126
A.3.16 NAD related dip-TADs and NAD-Dominated Dip-clusters .............................................. 126
viii
List of Figures
Figure 1. Hierarchical organization of genome structure .......................................................................... 3
Figure 2. Schematic diagrams of the TAD level population-based modeling framework.......................... 21
Figure 3. PGS software workflows......................................................................................................... 27
Figure 4. PGS setup. ............................................................................................................................ 28
Figure 5. Examples of PGS outputs. ..................................................................................................... 29
Figure 6. The overall scheme of data integration into population-based genome modeling framework. .. 31
Figure 7. Overall design of IGM............................................................................................................. 37
Figure 8. Web-based IGM interface and visualization. ........................................................................... 38
Figure 9. Properties of 3D models of heart and liver genomes. .............................................................. 41
Figure 10. Comparison of input matrix and output models. .................................................................... 49
Figure 11. Cross-validation test using 50% in situ Hi-C data .................................................................. 51
Figure 12. Experimental validation of structure population using Lamina-DamID data. ........................... 53
Figure 13. Experimental validation of TAD structure population using FISH data.................................... 55
Figure 14. Locational preferences for different chromatin regions. ......................................................... 60
Figure 15. Structure-based chromatin interaction network analysis of subcompartment states. .............. 64
Figure 16. Spatial properties of subgraphs in chromatin interaction network. ......................................... 69
Figure 17. The spatial compartmentalization of human lymphoblastoid cell............................................ 75
Figure 18. Dip TADs and their properties. .............................................................................................. 79
Figure 19. Frequent clusters identified among Dip-TADs. ...................................................................... 83
Figure 20. Structure-based TAD calling. ................................................................................................ 86
Figure 21. Assessments of H1-ESC and HFFc6 models. ....................................................................... 91
Figure 22. Comparative analysis of chromatin organization in different SPIN states. .............................. 94
Figure 23. Structure-based TAD calling comparison. ............................................................................. 95
ix
List of Tables
Table 1. Different Experimental methods that provide information on 3D genome organization. ............... 5
Table 2. Comparison of data-driven 3D genome modeling methods. ...................................................... 15
Table 3. Number of TADs for different states. ......................................................................................... 62
Table 4. Population average of spatial and graph measurements in TAD resolution model. .................... 77
Table 5. Population average of spatial and graph measurements in 200kb resolution model. ................. 77
Table 6. Summary of frequent dip-clusters ............................................................................................. 82
1
Chapter 1 Introduction
The three-dimensional organization of chromosomes and their locations in the nucleus greatly
influences the functions of genes, including expression, silencing and replication. The relative
locations of tens of thousands of genes and millions of potential regulatory elements play an
important role orchestrating nuclear functions. For example, promoter-enhancer interactions and
other regulatory interactions often act over considerable sequence distances in the kb to Mb range
or even between different chromosomes
4,5
. Also, the locations of genes and regulatory elements
with respect to nuclear bodies, such as nucleoli, nuclear speckles, and lamina often influence
their functions. Chromatin at the nuclear lamina is more likely to be silenced, while genes close
to nuclear speckles tend to be highly expressed
5,6
. For a better understanding of nuclear functions,
it is therefore necessary to study the spatial folding patterns of chromosomes and the spatial
organization of nuclear bodies in the nuclear context. Recent studies point to a hierarchical
chromatin organization with compartments of chromatin in related functional states
7
, topologically
associating domains (TADs)
8
, sub-domains
9
, and chromatin loops
10
(Figure 1). Chromosome
conformation capture experiments revealed that chromatin is spatially compartmentalized into at
least two major subcompartments, one associated with active, more open chromatin and one with
inactive more compacted chromatin
7
. The transcriptional repressor CTCF, cohesin and other
architectural proteins mediate chromatin interactions by stabilizing chromatin loops of about
100kb sequence lengths, as well as the formation of chromatin topological associated domains
10–
14
.
Acquiring spatial information for a chromatin polymer counting billions of base pairs is a daunting
task, especially when considering genome dynamics, which leads to structural variations from cell
to cell in an isogenic sample. Recently, there has been a rapid development of new imaging and
genomics technologies to shed light on genome organization in greater detail. New ensemble as
well as single cell genomics technologies probe chromatin interactions at unprecedented
resolution
15
. High-throughput imaging approaches localize the probabilistic distributions of
2
chromatin probes and live cell and super-resolution imaging visualize loci and sub nuclear
structures as well as trace the dynamics of chromatin loop formation
5
. To maximize synergy
between different approaches and further enhance the field a concerted effort is needed to
coordinate the mapping of the dynamic nucleome organization.
Recently the National Institute of Health (NIH) formed the 4DN Nucleome consortium
(https://www.4dnucleome.org/) to tackle this difficult mapping problem by coordinating efforts
between many research groups
5
. The goal of the consortium is to develop a set of methods for
mapping the structures and dynamics of the genome as well as produce integrated approaches
to generate a first draft of a model of the 4D Nucleome. A central piece of these efforts is a joint
analysis project in which many research groups pledged to produce a variety of different data on
a selected set of tier 1 cell lines, comprising H1-ES human embryonic cells, hTert-HFF humane
foreskin fibroblasts as well as IMR90 lung fibroblasts and GM12878 B-lymphocytes. Producing a
wide array of data from many complementary experimental methods will facilitate benchmarking
of the various experimental technologies and will lay the foundation for integrating the various
data to generate quantitative models of the nuclear organization for these cell types.
3
Figure 1. Hierarchical organization of genome structure
An important aspect of this project is therefore to establish computational tools that can integrate
the wealth of information from various imaging and omics experiments to map three-dimensional
(3D) structures compatible with the data. There are several benefits in calculating 3D structures
from omics and imaging data. First, 3D structures are a natural way of integrating data types from
complementary experimental methods. All data from omics and imaging experiments originate
from 3D structures, often from a large population of cells. Therefore, when using an appropriate
representation of experimental errors and uncertainties, one should be able to relate all data to
an ensemble of representative 3D genome structures. Second, generating structures that are
simultaneously consistent with different data sources allows a cross-validation of the data types.
For example, one can directly assess 3D models generated from Hi-C data with spatial
information from imaging experiments and analyze disagreement between the data. Third, 3D
4
structures provide added value by revealing additional features not visible in the original input
data set. For example, an ensemble of genome structures generated from Hi-C data may reveal
specific higher order (i.e. multivariate) chromatin clusters that are observed in a statistically
significant subsets of cells, even though the ensemble-based Hi-C data provides only binary
interactions
16
.
To achieve these goals, it is essential to transform all experimental information into an accurate
representation of chromatin structures. However, characterizing 3D genome structures at a
meaningful resolution is a challenging task. Genome structures can substantially vary between
single cells and are dynamic in nature. A probabilistic description is thus needed, surpassing
traditional structural modeling that assumes of a single equilibrium structure, or a small number
of metastable structures.
Here in my PhD thesis proposal, I first summarize various experimental methods that are used in
the 4DN joint analysis project to shed light on aspects of the genome structural organization. Then
I will provide an overview of data-driven computational methods that use experimental data to
generate structural models consistent with it. Then I will specifically discuss methods developed
in our lab to simulate whole genome architecture, and effort for comprehensive data integration,
which can reveal the folded states of complete genomes, and consider the structural
heterogeneity across different cells. Finally, using one sample of lymphoblastoid cell line
GM12878, I will present the final modeling result at both TAD and 200kb resolution along with the
findings draw from structure models. The ultimate goal for structural modeling is to provide a
better understanding of how the chromosomal folding state relates to gene regulation and to
provide insights on mechanisms and driving forces of chromatin folding.
1.1 Experimental methods for capturing genome organization
To date, many experimental techniques are available to probe the spatial organization of the
genome (Table 1). We categorize these techniques according to the type of spatial information
they provide for genome modeling. For instance, data may provide information about a pairwise
5
chromatin contact, multivariate chromatin contacts, probability distributions of locations or
distances between loci, or the probability of specific contacts between chromatin regions or
chromatin regions and nuclear bodies. When performing 3D structure modeling, the information
class decides how the data is transformed into spatial restraints. In the following sections, we
summarize some of the experimental methods according to this classification scheme.
Table 1. Different Experimental methods that provide information on 3D genome
organization.
In addition, a distinction is made between single-cell and ensemble methods. Single-cell
experiments provide data specific to a single nucleus. The advantage of having a direct
Ensemble Methods Single Cell Methods
Omics Methods Imaging Methods
Hi-C
7
GAM
17
Hi-C
18
3D DNA FISH
19
HIPMap
20
In situ Hi-C
10
Chia-PET
21
Combinatorial
Hi-C
22
3D RNA FISH
19
CASFISH
23–26
TCC
27
Micro-C
28
SPRITE
29
Seq FIS
30
MERFISH
31,32
DNAse-Hi-C
33
Lamina
DamID
34
Lamina
DamID
35
Transmission EM
36
ChromEM/EMT
37
Capture Hi-C
38
TRIP
39
cryoSXT
40
cryoET
COLA
41
3C
42
Super-resolution methods:
3DSIM, STORM, PALM,
STED
43–48
LLS
49
4C
50
5C
51
LLS-PAINT
52
6
observation of a single physical structure comes with the downside that the variability of structures
across different nuclei must be analyzed by repeated experimental sampling. The throughput can
be a limiting factor to capture relatively rare yet functionally relevant events. In addition, low data
coverage per cell may also be challenging. Ensemble methods, on the other side, provide
aggregated data from a whole cell sample in a single experiment. Sample averages are
statistically more accurate, but all the single cell information is lost, such as multivariate chromatin
interactions in individual cells. Also, rare but functionally relevant features of individual cells may
not be detectable in ensemble averaged data.
1.1.1 Frequencies of pairwise chromatin-chromatin interactions
Chromosome Conformation Capture (3C) methods
15,42
, and their variants such as in situ Hi-C
10
,
TCC
27
, Micro-C
53
and DNAse-Hi-C
33
as well as Chromatin Interaction Analysis by Paired-End
Tag Sequencing (ChiaPET)
21,54
probe the relative frequency of millions of binary chromatin
interactions in a population of cells (Table 1) by averaging chromosome conformations from
millions of nuclei. Most chromosome conformation capture protocols follow some general steps.
First, an isogenic sample of cells is fixed by formaldehyde cross-linking. Second, the genetic
material is digested either through a restriction enzyme (in Hi-C, in situ Hi-C, TCC) or through
DNAse (DNAse-Hi-C) or micrococcal nuclease (Micro-C). Then re-ligation of proximal DNA
fragments is induced. Due the cross-linking step, many genomic regions close in 3D space will
likely remain in spatial proximity even after the digestion step. This, in turn, will favor their ligation
in the following step, regardless of their relative position in sequence. Finally, newly formed
ligation sites are extracted followed by pair-end sequencing of the extracted fragments. Alignment
of the sequenced fragments to their position in the reference genome determines the contact
frequency of chromatin regions that are in spatial proximity in the initial cell population. A variety
of methods build on a similar schema, with different protocols or technologies applied at each
step.
3C based methods map the proximity of chromatin without distinguishing the proteins that mediate
7
these interactions. In contrast, ChiaPET
21,54
relies on an immuno-precipitation step, yielding
chromatin contact frequency maps for interactions mediated by only a specific protein of interest.
For instance, ChiaPET has been applied to detect chromatin interactions mediated by
architectural proteins CTCF, cohesin or RNA polymerase II
21,54,55
.
Capture Hi-C is a variant of the Hi-C assay for probing interactions of specific genomic loci
38,56
.
Because sequencing is focused only on a subset of genomic loci, it typically achieves deeper
sequencing and thus higher resolution than all-against-all genome-wide Hi-C assays. For
example, capture Hi-C was used to examine the long-range regulatory interactions of tens of
thousands of promoters in human cell types
38,56
. By enhancing the signal provided by specific
proteins or genomic loci, these methods can potentially enrich the signal of rare but functionally
important interactions and therefore can convey additional functional information.
All ensemble-based chromosome conformation capture assays produce the relative frequency of
specific interactions that are observed in a population of cells. Typically the pair-end sequenced
reads are binned at a given resolution (>~1kb) and the resulting contact frequencies are
normalized to correct systematic biases
57,58
. When used for 3D structural modeling, Hi-C contact
frequencies have been used in several different ways. For example, Tjong et al. transformed
contact frequencies to probabilities of observing a given chromatin contact in a cell population
and used these probabilities to generate models which reproduce the frequency of direct contacts
observed in the cell population
59
. Several additional approaches incorporate significant Hi-C
contacts in form of direct contact constraints
60
, whereas other approaches
61–65
relate contact
frequencies to averaged distances between loci, either based on empirical curves obtained from
imaging assays or on polymer physics considerations.
1.1.2 Single cell pairwise chromatin-chromatin interactions
Single-cell Hi-C methods
66–69
perform the Hi-C assay to individual nuclei and therefore enable the
detection of pairwise DNA contacts present in the same cell. Because the experiment is performed
on genetic material of only a single cell, only a fraction of all contacts can be detected. Despite
8
the challenges, however, recent improvements in conformation capture protocols detect more
than one million contacts per cell to be detected
68
. The method has also been recently applied to
thousands of individual cells and revealed insights into chromosome structural changes during
cell cycle
69
.
A second experimental strategy, combinatorial single cell Hi-C, is based on multiplex
barcoding
70
. This method utilizes genetic “barcodes”, i.e. short DNA sequences uniquely
associated to each cell. Although applied to a large population of cells, barcodes allow to map
pairwise chromatin interactions to individual cells. After genome digestion, but before cell lysis,
the sample is separated in multiple wells. A short barcode sequence, different for each well, is
then added to the exposed DNA fragments. The whole sample is then mixed and separated again
into multiple wells, and another random short sequence is added to the DNA in each well. This
process is repeated multiple times, extending the barcode sequences, effectively creating a
genetic identifier for each pair-end sequencing read that is unique to each cell. The probability for
two different cells to acquire the exact same barcode by random chance drops exponentially with
the number of iterations, and a small number of iterations is usually sufficient to ensure the
uniqueness of the barcode per cell.
A large number of pairwise chromatin contacts can strongly constrain the configurational space,
allowing to obtain a “snapshot” of the genome configuration for the given cell. Because these data
are typically only a small fraction of all chromatin interactions modeling efforts often are
augmented with additional information, for example from ensemble Hi-C experiments. In addition,
as in ensemble Hi-C it may not be possible in most cases to differentiate chromatin regions from
the two homologues chromosome copies. In the paper of Nagano et al.
66
modeling was performed
for the single X chromosome of male cells. In a more recent paper, Carstens et al. adapted a
Bayesian structure determination framework to include information about average chromosome
extensions from Fluorescent in situ hybridization (FISH) data in the modeling process. In the same
framework, they proposed a way to model diploid structures from data which does not separate
9
between chromosome copies
71
. In the study by Stevens et al., the modeling of whole cells was
facilitated by the use of haploid embryonic mouse cells, effectively removing the degeneracy
introduced by multiple chromosome copies
67
. In their work, structures of 8 cells at 100kb
resolution were produced and analyzed. Interestingly, the resulting structures are robust even
when only partial data is used. The model analysis confirmed that structural features like the
compartmentalization of active and inactive chromatin are systematically found in each cell. At
the same time, however, loops and TADs appear to be stochastic in nature and, although present
in a population, they are subject to wide variations on a cell-to-cell basis.
1.1.3 Multivariate chromatin interactions from ligation free methods
In addition to ligation-based methods, such as Hi-C and its variants which provide a way to map
pairwise contacts between genomic regions, ligation free methods allow the detection of “higher
order” interactions, i.e. the simultaneous co-localization of several chromatin regions in a given
cell. One such method is Genome Architecture Mapping (GAM)
17
, which sequences the genetic
material of thin slices of individual fixed nuclei, obtaining a list of chromatin regions co-localized
in the two-dimensional slab of a single cell. Using a large number of cell slices, mathematical
modeling then allows to determine the probability of pairwise and multivariate co-localization of
chromatin regions in individual nuclei. GAM revealed more pronounced long-range and inter-
chromosomal chromatin interaction patterns in comparison to Hi-C data
17
. GAM data can be used
in 3D structure modeling by incorporating spatial constraints for multivariate chromatin co-
localizations. For instance, population-based modeling methods can be extended to reproduce
the experimentally detected multivariate co-localization probabilities in a population of genome
models.
A second method, Split-Pool Recognition of Interactions by Tag Extension (SPRITE), is based on
split-pool barcoding of RNA and DNA
29
. After cross-linking, medium-sized particles of genetic
material (i.e., clusters) are separated by sonication. The clusters are then uniquely barcoded by
an iterative multiplex procedure before sequencing. The method can therefore provide data on
10
hundreds of thousands of multivariate chromatin co-localizations in a population of cells. In
structure modeling, SPRITE can provide single cell multivariate contact constraints to ensure that
specific chromatin clusters are present in individual cells of the population.
1.1.4 Probing chromatin proximity to nuclear bodies
Several experimental methods map the proximity of chromatin to nuclear bodies, such as the
lamina located at the nuclear envelope (NE) or speckles, which are nuclear domains enriched in
pre-mRNA splicing factors.
The proximity of chromatin to the nuclear envelope has been probed with Lamina DNA adenine
methyltransferase identification (DamID) experiments
72
by the van Steensel group, both as
ensemble and single cell assays
6,34,73,74
. Essentially, the method determines a molecular contact
frequency of DNA to a Dam-methylase fusion proteins. For instance, lamina-B fusion protein
probes DNA proximity to the nuclear envelope
74
, whereas other proteins could measure DNA
proximity to other nuclear compartments. Lamina-DamID data has been probed for a population
of cells, in the form of a lamina binding “propensity” profiles, or, more recently, even for single
cells
73
.
Another method is Tyramide Signal Amplification–sequencing (TSA–seq), which has recently
been introduced by the laboratory of Andrew Belmont to probe the proximity of DNA loci to nuclear
Speckles
5
.
In 3D structure modeling, lamina-DamID data can be used to constraint specific loci to regions
close to the NE. A recent study integrated lamina-DamID with Hi-C data to produce a population
of genome structures that reproduces both, chromatin-chromatin contact probabilities as well as
the probabilities of loci to be in spatial proximity to the NE
75
. Another study used lamina-DamID
and Hi-C data in a resampling approach to ensure that lamina associated domains (LADs) have
a higher probability to be located at the nuclear periphery
60
. Modeling highlighted differences in
the positioning of specific domains in HeLa cells carrying mutations associated to known
pathologies.
11
1.1.5 Imaging methods for mapping the spatio-temporal organization of the
nucleus
Various imaging technologies visualize the structural organization and dynamics of the nucleus
at spatio-temporal resolution. Microscopy tools are important in providing direct spatial
relationships of genomic loci and nuclear bodies. Because imaging does not suffer from
mappability or sequencing biases they provide important complementary information for cross-
validating findings from genomics and proteomics mapping methods.
Fluorescent in situ hybridization (FISH) uses oligonucleotide probes or RNA-mediated recruitment
of fluorescently labelled dCas9
25
in fixed cells to probe spatial distances and interactions of
genomic loci. FISH has provided key insights into spatial organization of chromosomes and
spatial relationships between genomic loci
76
. For instance, FISH experiments have revealed the
formation of chromosome territories
77,78
, the sub-compartmentalization of chromosomes into
functionally distinct regions
76
and have been used to analyze spatial proximities between specific
loci. Recently a multiplex FISH method traced the detailed structural folding patterns of entire
chromosomes in a single cell at 1Mb resolution
79
. These data were generated by imaging a
multitude of probes simultaneously through multiplex imaging. Inherently, single-cell methods
generally suffer from limited throughput to describe sufficient statistical variations of spatial
features. HIPMap is a method for large-scale automatic analysis from high-throughput FISH
imaging, which allows a detail description of the probability densities of pairwise and multivariate
distances between genomic loci as well as radial positioning of genomic loci in a population of
tens of thousands of cells
20,80
.
Probability densities of multivariate distances from high-throughput imaging can be incorporated
as spatial constraints in modeling efforts. For example, FISH imaging has been used to constrain
average spatial distances in 3D structure modeling of chromosomes
81
. Ideally, one would
generate populations of genome models that reproduce the observed probability densities.
Some imaging methods do not require cross-linking or cell fixation, and therefore allow
12
observations in living cells. Life-cell imaging approaches, such as CRISPR–dCas9 FISH can
provide detailed information about the dynamic behavior of chromatin regions. CRISPR–dCas9
FISH in live cells
23,24,26,82–85
is capable to map the dynamic behavior of chromatin regions in real
time. In addition, several other super resolution live-cell imaging methods, such as 3DSIM, PALM,
STORM, STED can be used to probe the dynamics of loci
43–48
.
Cryo soft X-ray tomography (SXT) reveals details of the nuclear ultrastructure, such as the size
and shape of the nucleus, as well as its spatial compartments such as euchromatic and
heterochromatic chromatin clusters, the location and sizes of nuclear bodies. SXT is an imaging
technique for visualizing cells and their interior structures in 3D without the need for chemical
fixation, staining or dehydration, and consequently leads to images of the entire nucleus in its
native state. An SXT tomogram is reconstructed from a series of 2D projection images. Contrast
in SXT images is generated by variations of the linear absorption coefficient (LAC) for different
nuclear regions and functional chromatin subcompartments (i.e., eu- and hetero-chromatin
regions), membranes, and nuclear bodies such as the nucleolus, which can be easily detected in
SXT reconstructions
40
. For instance, SXT allowed detailed segmentations of nuclear
compartments, detection of pericentromeric chromosome clusters
40,59,86
, and even the
reconstruction of the detailed conformation of an X-chromosome in eight cells
87
Chrom-EMT (CEMT) visualizes the chromatin ultrastructure across multiple scales by combining
electron microscopy tomography with a labeling method (chromEM) to enhance the contrast of
DNA. In this method, a fluorescent DNA-binding dye is able to photo-oxidize and polymerize
diaminobenzenidine (DAB), which in turns binds OsO 4 to allow detection of DNA and chromatin
ultrastructures in electron tomograms of a nucleus of glutaraldehyde-fixed cells. This labeling
makes it possible to detect chromatin folding patterns at nucleosome resolution inside a nucleus.
Chrom-EMT images revealed that chromatin is a disordered 5–to 24-nm-diameter granular chain
that is packed together at different concentrations densities in the nucleus
37
.
Cryo-electron tomography (CET) generates 3D reconstructions of cells in hydrated, close to
13
native states at molecular resolutions
88,89
. Recent advances in focused ion beam (FIB) technology
make it possible to generate thin slices of mammalian cells that allow imaging of nuclear
organization and folding patterns of chromatin in the nucleus. However, tracing of the chromatin
fiber and reconstruction of individual complexes in the crowded nuclear environment are
challenging due to the relatively low contrast and signal to noise ratio and distortions due to the
missing wedge
88
. However, new imaging technologies and advances in automation will make it
possible to detect both structures and spatial positions of multiple large macromolecular
complexes in individual cells at molecular resolution
90
.
After summarizing various experimental methods that provide data about the spatial genome
organization, we discuss in the following sections data-driven computational methods that use
experimental data to generate 3D genome structures consistent with it. In the last section of this
review, we discuss methods that can integrate complementary sets of data and discuss recent
examples.
1.2 Computational Methods
There are a number of approaches for data–driven genome structure modeling
91,92
(Table 2).
These methods use experimental data to generate genome or chromosome structures consistent
with it. Approaches can be classified into three major groups based on the type of modeling,
(Table 1): 1) Consensus models, which represent the data by a single “averaged” structure; 2)
Resampling methods, which explore conformational variability of the structures; and 3) Population
based modeling, which perform a deconvolution of the ensemble data
92
(see a comparison in
Table 2). In addition, models can differ in the structural granularity of chromatin representation,
which can be described as i) DNA fiber and chromatin fiber models ranging from 10 base-pair to
100 kb resolution; and ii) chromosome or genome models represented by positions of the
chromatin domains at <~1Mb resolution (i.e., Topological Associated Domains, henceforth TAD),
or larger macro-domains and subcompartments at >~5Mb resolution. Computational methods
also differ in how the data is interpreted and translated into spatial constraints. So far, most
14
modeling approaches use almost exclusively data from chromosome conformation capture
experiments (e.g. Hi-C data). For example, some methods interpret the contact frequency from
ensemble Hi-C experiments into spatial distances between the corresponding loci, whereas other
methods use contact constraints to enforce actual contacts between loci in a subset of structures
in an ensemble so that the total number of contacts agrees with the relative contact frequency in
the experiment (Table 2). We first discuss the different types of computational approaches.
Method Distance
or Contact
based
Genome
coverage
Species Model type
Duan et al.
61
Distance Whole Genome Budding
Yeast
Consensus
BACH
64
Distance All chromosomes Mouse Consensus
AutoChrom3D
93
Distance 500-800kb Human Consensus
ChromSDE
63
Distance Chromosome 13 Mouse,
Human
Consensus
PASTIS
62
Distance All chromosomes Mouse Consensus
ShRec3D
94
Distance 30Mb,
Chromosome 1
Human Consensus
HAS
95
Distance All chromosomes Human Consensus
3D-GENOME
96
Distance All chromosomes Human Consensus
MCMC5C
65
Distance 142kb(5C), Human Resampling
15
Table 2. Comparison of data-driven 3D genome modeling methods.
88.4Mb(Chr14,
HiC)
TADbit
97
Distance 500kb(5C) Human Resampling
Junier et al
98
Contact 1.2Mb (chr11) Human Resampling
Gehlen et al
99
Contact Whole Genome Budding
Yeast
Resampling
Meluzzi and Arya
100
Contact 135-270kb Pseudo Resampling
Trieu and Cheng
101
Contact All chromosomes Human Resampling
InfMod3DGen
81
Distance All chromosomes Yeast Resampling
MOGEN
102
Contact Whole Genome Human Resampling
Di Stefano et al
103
Contact Whole genome Human Resampling
Chrom3D
60
Contact Whole genome Human Resampling
Kalhor et al
27
Contact Whole genome Human Population
Giorgetti et al
104
Contact 780kb Human Population
MiChroM
105,106
Contact All chromosomes Mouse Population
PGS
59
Contact Whole genome Human Population
16
1.2.1 Consensus structure methods
Consensus models generate a single structure, which represents a “best fit” of the data. Often
this data is generated from a population of cells, for example ensemble Hi-C
63,93
. In this case,
ensemble Hi-C contact frequencies are typically mapped to spatial distances, assuming an
inverse relationship between contact frequencies and distances. A single 3D structure is then
generated that minimizes the residual errors between modeled and expected distances by either
optimizing a scoring function
61–63,93,94
, a likelihood function through Bayesian inference
64
, or
solving a generalized linear model
95
. Consensus models have been applied to whole genomes
61
,
individual chromosomes
62,64,95,96,107
and chromatin globules
93,94
. A consensus model is typically
fast to generate and summarizes averaged structural features from ensemble data. However,
chromosome conformations vary between cells and a single consensus structure cannot convey
aspects of variability and dynamics and therefore may not represent instances of individual
observable structures.
1.2.2 Resampling methods
Resampling methods calculate an ensemble of structures by independent optimizations of the
same scoring function derived from the data. These methods introduce aspects of conformational
variability because of the presence of multiple minima, unconstrained degrees of freedom or
thermodynamic fluctuations.
For instance, in a study by Junier et al., a thermodynamic ensemble of configurations of the
human beta globin locus were produced by enforcing CTCF loops; the strength of the interaction
was tuned to reproduce the observed 3C contacts distributions
98
. Meluzzi and Arya
100
proposed
a polymer model in which 3C contact probabilities can be reproduced by a thermodynamic
ensemble of configurations in an opportunely shaped energy landscape. Other methods extended
the frequency-to-distance mapping widely used in consensus modeling by allowing statistical
deviations. For instance, MCMC5C uses Markov chain Monte Carlo method to generate a
representative sample from the posterior probability distribution over the space of structures from
17
5C (Chromosome Conformation Capture Carbon Copy) data
65
. Another instance is
InfMod3Dgen
81
, which uses a polymer chain representation and a Bayesian framework in which
the conformational energy from polymer physics is used as prior information to model infer an
ensemble of chromatin structures based on Hi-C data.
A caveat in the resampling procedure comes from the fact that ensemble datasets can include
conflicting data from mutually exclusive chromatin conformations. If the complete data is
considered simultaneously in a single structure, there can be inconsistencies between data and
models (i.e., restraints violations). As a consequence, some methods use only a subset of the
data, for instance most significant contact and non-contact information. TADbit is a package which
includes a resampling modeling method
97,108
. The procedure performs many independent
optimizations from random starting configurations using the Integrative Modeling Platform (IMP)
109,110
by restraining the distance between two loci if their interaction frequency z-score exceeds a
cutoff (or keeping them apart if the frequency is below a lower cutoff). This protocol has been
applied to model multiple systems, on different scales, including the alpha globulin locus in K562
and lymphoblastoid cells
111
, the structure of the bacterial genome of Caulobacter Crescentus
112
and genomic domains in Drosophila melanogaster
108
.
Similarly, in another software tool called MOGEN, high and low valued frequency pairs in HiC
maps are considered as contacts and non-contacts
101,102
. In another study, Gehlen et al. randomly
selected a subset from a pool of frequently observed interactions to explore the effect of specific
genomic contacts in the yeast genome
99
.
Di Stefano et al.
103
selected significant intra-chromosome interactions in human lung fibroblast
and stem cells to restrain the structural space of full diploid genomes at very high resolution (~3kb).
The structures are optimized using Molecular Dynamics and reproduce the average nuclear
positioning of active and lamina associated genomic regions.
Finally, in a modeling framework developed by Paulsen et al. (Chrom3D), only statistically
significant pair-wise interactions between TADs are considered in addition to significant contacts
18
of TADs to the NE derived from lamina-DamID data. It simulates the positions of topologically-
associated domains (TADs) relative to each other and to the nuclear periphery by using a Monte
Carlo optimization of a loss-score function
60
.
1.2.3 Population-based deconvolution methods
Population-based deconvolution approaches generate a population of structures, in which the
accumulated contacts over all structures reproduce the probability of contacts in the ensemble
Hi-C data rather than each structure individually
27,104,106
. As noted in the previous section,
enforcing the full set of detected chromatin contacts may lead to violated restraints. Effectively,
population-based methods distribute all expected Hi-C chromatin contacts across different
structures in a population. As a result, they allow structures in different conformational states,
where each state could contain only a coherent subset of chromatin contacts. For example,
Giorgetti et al. developed a physical polymer model which uses an iterative Monte Carlo scheme
to generate a thermodynamic ensemble of chromatin conformations
104
, and reproduced the 5C
data well for a region spanning 780 kb. T o achieve this goal, specific chromatin contact interaction
potentials were optimized to mimic interactions statistically favoring or disfavoring the co-
localization of chromatin regions according to the chromosome conformation capture data. The
method was used to explore the repertoire of chromatin conformations within TADs.
In another approach, Zhang and Wolynes used the maximum entropy principle and molecular
dynamics to model an ensemble of chromosome conformations consistent with a pseudo-
Boltzmann distribution for an effective energy landscape that reproduces the experimentally
measured pairwise contact frequencies from Hi-C data
106
. The approach was applied to model a
population of mouse chromosomes with good agreement with experiment.
Additionally, an ensemble of structures for the C. crescentus bacterial genome were recently
generated by combining restraint and population-based modeling approaches
113
. In this study, a
multi-scale modeling protocol was performed guided by weak distance restraints derived from
available Hi-C data
114
and resulting models were then further reweighted in order to match the Hi-
19
C data better considering the cell-to-cell variability. The approach resulted in an ensemble of
highly structurally variable C. crescentus chromosome models that is in agreement with the Hi-C
data.
20
Chapter 2 Population-based genome structure
modeling using Hi-C data
Previously our lab has spent a lot of effort in order to understand the structure features of genome
organization. The first step we took is to use ensemble Hi-C data to explicitly model the genome
structure variability between cells by population-based deconvolution calculation
27,59
. The
approach incorporates the stochastic nature of chromosome conformations and considers that
the contact profiles from Hi-C experiments originate from a variety of individual genome structures.
This approach allows detailed insights of different chromatin structure states, which in turn can
reveal driving forces in the spatial genome organization. We formalized a new improved
probabilistic framework, which de-convolves the ensemble Hi-C data into a population of 10,000
genome structures in such a way so that the accumulated chromatin contacts across the entire
population are statistically consistent with the input Hi-C data (Figure 2). The method performs a
structure-based deconvolution of ensemble Hi-C data and generates a large population of distinct
diploid 3D genome structures by maximizing the likelihood of observing the data. This is achieved
by performing an iterative and step-wise optimization of both, the assignment and satisfaction of
chromatin contact constraints in the ensemble of genome configurations. Each iteration involves
two steps: the constraint assignment (A-step) and the genome structure optimization (M-step).
The latter uses a combination of the simulated annealing and conjugate gradient methods applied
to each structure of the population
109,115,116
.
The method performs a structure-based deconvolution of ensemble Hi-C data and generates a
large population of distinct diploid 3D genome structures by maximizing the likelihood of observing
the data. This is achieved by performing an iterative and step-wise optimization of both, the
assignment and satisfaction of chromatin contact constraints in the ensemble of genome
configurations. Each iteration involves two steps: the constraint assignment (A-step) and the
genome structure optimization (M-step). The latter uses a combination of the simulated annealing
and conjugate gradient methods applied to each structure of the population
109,115,116
. The
21
optimization hardness is increased in a step-wise manner by gradually adding more contact
constraints during the iterative optimization process. The optimized genome structures can
typically reproduce almost all the contacts from Hi-C experiments and avoid unphysical structures
from simultaneous enforcement of conflicting data in the same structure.
Figure 2. Schematic diagrams of the TAD level population-based modeling
framework. We generate a population of structures where each structure is
optimized according to an individual contact matrix deconvoluted from the
ensemble Hi-C. The strategy is to gradually increase the number of contacts in
every model by using a stepwise iterative approach. We applied an iterative
approach to optimize the structures using an Assignment/Modeling framework.
22
2.1 Computational representation of genome structure
To correctly model the whole genome organization in the cell nucleus we need to describe the
DNA molecules using a computational representation. To represent the DNA molecule, we
established a bead-on-string model to describe the chromatin fiber using coarse grained model,
where each bead represents a chromatin segmentation. Each segmentation can be varying in
size from kilo-base to several mega-base in genomic sequence. The volume of representative
beads is scaled according to the actual genome length and adjacent beads in sequence are
restrained so that they are close to each other in space. Thus, each cell nucleus 𝑖𝑖 is represented
as a set of coordinates 𝑿𝑿 𝑖𝑖 , 𝑗𝑗 , with bead 𝑗𝑗 corresponds to genomic segments.
To incorporate the stochastic nature of chromosome conformations we aim to build a set of
nucleus structures of size 𝑴𝑴 to represent the ensemble collection of cell nucleus in real
experiment. We then describe our model structures 𝑿𝑿 = { 𝑿𝑿 1
, 𝑿𝑿 2
, ⋯ , 𝑋𝑋 𝑀𝑀 }, where the m-th structure
𝑿𝑿 𝑚𝑚 is a set of 3D vectors representing the center coordinates of 2N genomic segment spheres
𝑋𝑋 𝑚𝑚 = { 𝑋𝑋 ⃗
𝑖𝑖 𝑚𝑚 : 𝑋𝑋 ⃗
𝑖𝑖 𝑚𝑚 ∈ ℜ
3
, 𝑖𝑖 = 1,2, ⋯ ,2 𝑁𝑁 }. The 2 here represents the two homologous copies in diploid
genome.
2.2 Probabilistic formulation of population structure modeling
The entire simulation can be treated as an optimization problem for a maximum likelihood function.
For a given contact probability matrix 𝑨𝑨 = ( 𝑨𝑨 𝒊𝒊𝒊𝒊
)
𝑵𝑵 × 𝑵𝑵 , where each entry is the probability that a
direct contact between domains 𝑖𝑖 and 𝑗𝑗 exist in a structure of the population, we aim to
reconstruct all 3D structures 𝑿𝑿 in the population of 𝑴𝑴 models each contains 2N homologous
regions. For reconstructing 𝑴𝑴 structures of a population from the observed Hi-C probability
matrix 𝑨𝑨 , we need the detailed information of which sphere pairs have contact in which cells for
supplementing the observed average data 𝑨𝑨 , because 𝑨𝑨 is far from completeness, due to (i) its
population-averaged nature without contact information at the single cell level and (ii) its lack of
the homologous information for all contacts. Therefore, we introduced a latent variable 𝑾𝑾 =
23
( 𝑾𝑾 𝒊𝒊𝒊𝒊 𝒊𝒊 )
𝟐𝟐 𝑵𝑵 × 𝟐𝟐 𝑵𝑵 × 𝑴𝑴 for complementing every single cell’s information and homologous information
missed by 𝑨𝑨 . It is a binary-valued 3rd-order tensor specifying the contacts of homologous
genomic regions in each structure. We can jointly approximate the structure population 𝑿𝑿 and
the contact tensor 𝑾𝑾 by maximizing the log-likelihood of the probability.
log 𝑃𝑃 ( 𝐗𝐗 | 𝐀𝐀 , 𝐖𝐖 ) = log 𝑃𝑃 ( 𝐀𝐀 , 𝐖𝐖 | 𝐗𝐗 )
The maximum likelihood optimization is achieved through a step-wise iterative process, where we
gradually increase the optimization hardness by gradually adding contacts of the matrix 𝐀𝐀 𝜃𝜃 =
� 𝐴𝐴 𝑖𝑖 𝑗𝑗 𝜃𝜃 �
𝑁𝑁 × 𝑁𝑁 with decreasing contact probability threshold θ. The iterative procedure involves two
steps by, (1) Assignment step (A-step): Given the estimated structures X at step k, estimate W by:
𝑊𝑊 ( 𝑘𝑘 + 1)
= argmax
𝑊𝑊 {log 𝑝𝑝 ( 𝐴𝐴 , 𝑊𝑊 | 𝑋𝑋 )} , 𝐺𝐺𝑖𝑖 𝐺𝐺 𝐺𝐺 𝐺𝐺 𝑋𝑋 = 𝑋𝑋 ( 𝑘𝑘 )
And then (2) Modeling step (M-step): Given the estimated W to generate model X at step k+1:
𝑋𝑋 ( 𝑘𝑘 + 1)
= argmax
𝑋𝑋 {log 𝑝𝑝 ( 𝐴𝐴 , 𝑊𝑊 | 𝑋𝑋 )} , 𝐺𝐺𝑖𝑖 𝐺𝐺 𝐺𝐺 𝐺𝐺 𝑊𝑊 = 𝑊𝑊 ( 𝑘𝑘 + 1)
However, because only contact information is included into the structure modeling (in contrast to
non-contact information), it is possible that previously assigned contacts conflict with the new
ones, which leads to a compact, unphysical structure that cannot be easily relaxed. Here we
propose the improved framework with a new step named Iterative Refinement (Figure 2). This
step improves the assignment of contacts in 𝑾𝑾 using the knowledge from previous assignments
and output matrix 𝑪𝑪 . The correction then estimates the portion of contacts that can be relocated
for the next step. Our improved method leads to better and faster optimization suitable for higher
resolution genome structure than the predecessors.
2.3 Iterative refinement of restraint assignment
Here we introduce new methodological innovation in our optimization scheme, iterative refinement,
24
which allows us to generate genome structures at higher resolution and with improved accuracy
in comparison to our previous approach. The probabilistic framework for deconvolving ensemble
data is a data driven method. This method only considers restraints which are contact information,
and non-contact information are not included, therefore an excess amount of wrongly assigned
contacts can be introduced during the optimization process and thus drastically increase the
optimization time for resolving them.
Due to cooperative contacts and random encounters of domains, domain spheres might get close
to each other even if they are not readily assigned restraints because of already imposed contacts.
It is likely that during the optimization the models contain too many contacts. If we would impose
non-contacts, we would have no problem with overrepresented contacts. We tinker with this issue
by modify the restraint assignment step, where we distribute domain contacts (i,j) in a subset of
structures in the population based on the input aij.
To minimize excess contacts, we use a heuristic method to infer the minimal possible number of
to correct constraints required in each iteration for optimizing a structure population consistent
with the Hi-C data. We assume that the contact frequency for domain pair (i,j) are a consequence
of two parts, constrained contacts and non-constrained excess contacts:
𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑒𝑒𝑒𝑒 𝑒𝑒 𝑒𝑒𝑒𝑒 𝑒𝑒 𝑒𝑒𝑒𝑒
= 𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑒𝑒 𝑐𝑐𝑐𝑐𝑐𝑐 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑖𝑖 𝑐𝑐 𝑒𝑒𝑒𝑒 + 𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑐𝑐𝑐𝑐𝑐𝑐 − 𝑒𝑒 𝑐𝑐𝑐𝑐 𝑐𝑐 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑖𝑖 𝑐𝑐 𝑒𝑒𝑒𝑒
Where 𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑒𝑒𝑒𝑒 𝑒𝑒 𝑒𝑒𝑒𝑒 𝑒𝑒 𝑒𝑒𝑒𝑒
is the frequency of pair (i,j) to form contact and 𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑒𝑒 𝑐𝑐𝑐𝑐𝑐𝑐 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑖𝑖 𝑐𝑐 𝑒𝑒𝑒𝑒 , 𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑐𝑐𝑐𝑐𝑐𝑐 − 𝑒𝑒 𝑐𝑐𝑐𝑐𝑐𝑐 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑖𝑖 𝑐𝑐 𝑒𝑒𝑒𝑒
are the frequency of constrained contacts and non-constrained contacts. Applying this idea, we
assume the contact probability matrix A is defined by assigning contacts in a matrix 𝐀𝐀 �
= ( 𝐴𝐴 ̂
𝑖𝑖 𝑗𝑗 )
𝑁𝑁 × 𝑁𝑁 ,
where 𝜀𝜀 𝑖𝑖 𝑗𝑗 are non-constrained excess contacts:
𝐴𝐴 𝑖𝑖 𝑗𝑗 = 𝐴𝐴 ̂
𝑖𝑖 𝑗𝑗 + 𝜀𝜀 𝑖𝑖 𝑗𝑗
Unfortunately, the number of non-constrained/excess contacts cannot be inferred before we
optimize a structure population. However, in a step-wise and iterative optimization scheme we
can use the structure population generated in the previous iteration to approximate this number.
25
We use Q ij to denote the proportion of non-constrained/excess contacts that are formed due to
random and cooperative encounters as a result of other already imposed contacts. If we use the
contact indicator matrix C that has the same dimension as W such that each element Cijm is the
indicator of contact in model structure Xm for domain pair (i,j), and k for iteration step, then from
the previous equation at the step:k-1 we have:
� 𝑪𝑪 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 = 1
= � 𝑾𝑾 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 = 1
+ � 𝑀𝑀 − � 𝑾𝑾 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 − 1
� 𝑄𝑄 𝑖𝑖 𝑗𝑗
We can solve Q ij by:
𝑄𝑄 𝑖𝑖 𝑗𝑗 =
∑ 𝑪𝑪 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 = 1
− ∑ 𝑾𝑾 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 = 1
𝑀𝑀 − ∑ 𝑾𝑾 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 = 1
Once we get 𝑄𝑄 𝑖𝑖 𝑗𝑗 we can have:
𝑨𝑨 𝑖𝑖 𝑗𝑗 = 𝑨𝑨 �
𝑖𝑖 𝑗𝑗 + 𝜀𝜀 𝑖𝑖 𝑗𝑗 = 𝑨𝑨 �
𝑖𝑖 𝑗𝑗 + �1 − 𝑨𝑨 �
𝑖𝑖 𝑗𝑗 � 𝑄𝑄 𝑖𝑖 𝑗𝑗
𝑨𝑨 �
𝑖𝑖 𝑗𝑗 = ( 𝑨𝑨 𝑖𝑖 𝑗𝑗 − 𝑄𝑄 𝑖𝑖 𝑗𝑗 ) (1 − 𝑄𝑄 𝑖𝑖 𝑗𝑗 � )
𝑨𝑨 �
𝑖𝑖 𝑗𝑗 , is the estimated proportion of constraints to be assigned to 𝑾𝑾 and the activation distance
𝑑𝑑 𝑖𝑖 𝑗𝑗 𝑐𝑐 𝑒𝑒𝑒𝑒
are updated according to 𝑨𝑨 �
. Then for each existing structure in structure population 𝑿𝑿 ,
assign a harmonic restraint according to corresponding 𝑾𝑾 .
2.4 PGS software pipeline for population-based genome structure
modeling
Based on the mathematical formulation of the process, we developed the PGS (Population-based
Genome Structure) software pipeline (https://www.github.com/alberlab/pgs). The PGS modeling
package takes two inputs: an experimental Hi-C contact frequency map, and a segmentation of
26
the genome sequence into chromatin domains (e.g. TADs). PGS generates a population of 3D
genome structures where each domain is represented as a sphere, and the distribution of physical
contacts between domain spheres across the population reproduces the Hi-C experiment. The
software automatically generates an analysis of the structure population, including a description
of the model quality based on its contact probability agreement with experiments and various
structural genome features, including the radial nuclear positions of individual chromatin domains.
The software also comes with a set of additional tools to facilitate user customized analysis, such
as a tool for exporting PDB structures for visualization. The individual genome structures also
contain a wealth of information and can be used to detect higher-order structural patterns of
chromatin regions. As gold standard assessment, it is necessary to compare defined structural
features from the structure population with independent experiments not included as input
information when generating the models, for example, such information may be distances
between specific loci from 3D FISH experiments
59
, contact frequencies between chromatin and
the nuclear envelope from lamina DamID experiments
75
or spatial features extracted from soft X-
ray tomography experiments
59
.
The PGS package generates a large number of genome structures, which constitute an optimized
structure population consistent with the input data. The complexity of this computational problem
originates also from the large scale of the input data (high-resolution, genome-wide Hi-C contact
frequencies), which must be processed to generate constraints on the structure population. To
meet this computational challenge, PGS has been designed to run in a high-performance
computing (HPC) environment, such as Sun grid engine (SGE) or T orque. We have also designed
PGS to work on a laptop or personal computer, but this application should only be used to
generate a small population of structures (around 100 for testing purposes). PGS is implemented
as a single Python software package for ease of installation and use. We wrapped the source
code in pyflow (https://github.com/Illumina/pyflow), a lightweight parallel task engine developed
by Illumina, which runs the whole complex simulation process through a single command without
any intermediate human intervention. Note that while the original pyflow library only supports local
27
computers and SGEs, we developed a modified version of pyflow, called pyflow-alabmod
(https://github.com/shanjunUSC/pyflow-alabmod) allowing PGS to run in a HPC environment with
PBS (Portable Batch System) script. In addition to PGS, users must install the independent
modeling software IMP (version 2.4 or above), which can be downloaded from
https://integrativemodeling.org/. Users should also install Python 2 (version 2.7 or higher) and
its libraries, including numpy, matplotlib, pandas, h5py, seaborn, and scipy (web addresses
indicated in the Materials section). To provide flexibility, we divided the whole workflow into three
independent, consecutive stages (Figure. 3):
Figure 3. PGS software workflows. Building the input matrix, modeling and
optimizing structure population with A/M cycles, and basic analysis from the final
structure population.
1) Producing a domain-domain contact probability matrix from the input Hi-C data. (Step 2, matrix
28
building)
2) Generating the optimized structure population. (Step 2, modeling step)
3) Producing a basic analysis summary for the resulting structure population. (Step 3-5)
Users who already have a domain-domain contact probability matrix can skip Step 2, matrix
building via the graphical user interface (GUI) (Figure 4a) by selecting TAD-TAD Prob option. By
default, PGS takes a raw (Hi-C) contact matrix as the input (Figure 4b). In any case, even if the
user skips this matrix building step, they must provide a text file containing the chromosome
segmentations (i.e., the domain or TAD definitions; Figure. 4c). The required file formats are
described in the Materials section.
Figure 4. PGS setup. (a) GUI to help users generate configuration files. (b) An
example showing the format of an acceptable contact frequency matrix file. (c)
An example showing the format of an acceptable TAD file.
29
PGS comes with a GUI to help new users generate the input configuration file (a json file). For an
experienced user, it is straightforward to directly modify the input configuration file. This file
contains the location of the raw Hi-C matrix file, the location of the chromatin segmentation or
TAD definition file, modeling parameters, and system parameters. The first component normalizes
the raw Hi-C contact map using KR-normalization
57,117
and generates a TAD-level contact
probability matrix. The second component generates an optimized population of a given number
of genome structures through the iterative A-step and M-step cycles. The third component
produces a report on the quality of the optimization, as well as basic structural analyses such as
contact frequency heat maps and the average nuclear radial position of each TAD (Figure. 5).
Also, PGS has online documentation where detailed installation and usage can be found
(http://pgs.readthedocs.io/en/latest/).
Figure 5. Examples of PGS outputs. (a) Structure population. (b) Histogram of
violated constraints. The maximum number of violated restraints is defined in the
“violation cutoff” configuration setting (see Procedures, Step 1). (c) Heat map of
contact probabilities from the final structure population. The color scheme is from
white (0) to red (1). (d) Density scatter plots comparing all pairwise domain contact
probabilities from the structure population and the input Hi-C data. The Pearson’s
30
correlation coefficient (PCC) of the comparison is indicated. Histograms of the
contact probabilities are shown along the sides of the plot. (e) The average radial
position of domains along a chromosome. PGS generates these plots for every
chromosome.
31
Chapter 3 Extending population-based genome
structure modeling through comprehensive data
integration
Figure 6. The overall scheme of data integration into population-based genome
modeling framework. The framework is highly flexible so that it can include
pairwise interactions, chromatin-nuclear body interactions, and high-order
interactions at single cell level as well as imaging data.
Most models of genome structures (including my studies described above) have relied on just
one data source, such as Hi-C, even though a single experimental method cannot typically
capture all aspects of the spatial genome organization. However, integrating data from a wide
range of technologies, each with complementary strengths and limitations will greatly increase
the accuracy, resolution and coverage of genome structure models. Moreover, integrative models
32
will offer a way to cross-validate the consistency of data obtained from complementary
technologies, for example from imaging and genomic technologies. However, it remains a major
challenge to develop hybrid methods that can systematically integrate data from many different
technologies to generate structural maps of the genome. Population-based deconvolution
methods provide an ideal framework for comprehensive data integration.
To achieve this one must develop methods that can accurately relate experimental information
from genomic and imaging methods into accurate representations of chromatin structures that
also capture the dynamic nature of the genome.
We are now solving this problem partially by integrating data from a subset of all existing methods
(Figure 6). For methods measuring frequencies of pairwise chromatin-chromatin interactions like
3C methods (Hi-C/MicroC/TCC), contact frequencies can be transformed to probabilities of
observing a given chromatin contact in a cell population. For methods measuring contacts to
nuclear bodies like Lamin-B DamID, which measures the chromatin proximity to nuclear envelop,
data can also be transformed to contact probabilities between chromatin and its target. There are
also methods measuring the high-order chromatin proximity information like SPRITE
29
, which
uses split-pool barcoding followed by sequencing to map both DNA and RNA co-localization
clusters and GAM
17
, which uses laser microdissection on frozen cell to extract nuclear profiles.
This kind of information can be integrated by adding single cell multivariate contacts constraints
to ensure that specific chromatin clusters are present in individual cells of the population. Data
generated by multiplex FISH can also be converted to spatial restraints in similar ways.
Together with colleges in our lab, I am now expanding my work to develop a new platform called
Integrated Genome Modeling (IGM) which is intended to be able to handle different genome types
(diploid, haploid, phased, modified), different nucleus shapes and different data sources. One
would need to make minimum changes to support new dataset and to deploy parallel computing
on different platform. Currently the platform supports integrated modeling of genome with Hi-C,
DamID, FISH and SPRITE dataset.
33
3.1 Mathematical formulation of comprehensive data integration
through population-based genome structure modeling
Here we describe a generalized framework that allows comprehensive integration of many data
types. To formalize the process, we can generally divide data into univariate, bivariate and
multivariate data types.
3.1.1 Univariate data
Univariate data describe features that depend on only a single chromosome region, such as the
probability of a chromosome region to be in proximity to the lamina at the NE, which is derived
from lamina-DamID data. Typically these data can be represented by a probability vector 𝑈𝑈 =
�( 𝑢𝑢 𝐼𝐼 | 𝐼𝐼 = 1,2, . . , 𝑁𝑁 ) � with elements describing the probability that chromosome regions I=1,2,…,N
share a given feature (e.g. the probability of a chromosome region I to be in proximity with the
lamina at the NE). N is the number of all chromosomal regions in the genome (i.e. the total
number of TAD domains or 100kb bins).
3.1.2 Bivariate data
Bivariate data describe features that depend on two chromatin regions, for example the probability
of interaction between two chromosomal regions. Typically, these data are represented by a
probability matrix 𝚳𝚳 = � 𝑚𝑚 𝐼𝐼𝐼𝐼
� 𝐼𝐼 , 𝐽𝐽 = 1,2, . . , 𝑁𝑁 � ; for example, a contact probability matrix derived from
ensemble Hi-C data. Each element mIJ is the probability that a given contact between
chromosomal regions I and J exists in an ensemble of cells.
3.1.3 Multivariate data
Multivariate data describe higher order relationships between multiple chromosome regions, for
example, the probability of observing several different chromosome regions to be in spatial
proximity to each other. These data are represented by third order or higher order tensors 𝚻𝚻 =
� 𝑡𝑡 𝐼𝐼𝐼𝐼𝐼𝐼
� 𝐼𝐼 , 𝐽𝐽 , 𝐾𝐾 = 1,2, . . , 𝑁𝑁 �. The general goal is to generate a population of genome structures ( 𝐗𝐗 =
34
� 𝐗𝐗 1
, 𝐗𝐗 2, ⋯,
𝐗𝐗 𝑆𝑆 � as a set of S diploid genome structures) that are statistically consistent with all
available univariate (U), bivariate ( 𝑴𝑴) and multivariate ( 𝑻𝑻 ) data probabilities.
3.1.4 Formulation
We formulate this genome structure modeling problem as a maximization of the likelihood
𝑃𝑃 ( 𝑈𝑈 , 𝚳𝚳 , 𝚻𝚻 | 𝐗𝐗 ). With known 𝑈𝑈 , 𝚳𝚳 and 𝚻𝚻 , we can calculate the structure population 𝐗𝐗 such that
the likelihood 𝑃𝑃 ( 𝑈𝑈 , 𝚳𝚳 , 𝚻𝚻 | 𝐗𝐗 ) is maximized. However, in a diploid genome, each domain has two
homologous copies and in most cases the data expressed in the probability vectors U, matrices
M and tensors T do not distinguish between homologous chromosome copies (i.e., the data is
based on unphased data). Moreover, U , 𝚳𝚳 and 𝚻𝚻 are probabilities typically derived from
ensemble experiments (for example, ensemble Hi-C experiments), therefore they cannot reveal
which of the individual features co-exist in the same 3D structure.
To represent information derived from individual cells, we introduce latent variables (V,W,R),
which contain the information missing from ensemble data, namely which of the observed features
belong to each of the S structures in the model population and also which homologous
chromosome copies are involved. For example, for univariate lamina-DamID data, the latent
variable is a binary matrix, 𝐕𝐕 = ( 𝐺𝐺 𝑖𝑖 𝑐𝑐 )
2 𝑁𝑁 × 𝑆𝑆 , which specifies which domain is located near the NE
in each structure s of the population and also distinguishes between the two homologous TAD
copies ( 𝐺𝐺 𝑖𝑖 𝑐𝑐 = 1 indicates that TAD i is located near the NE in structure s; 𝐺𝐺 𝑖𝑖 𝑐𝑐 = 0 otherwise). For
bivariate data we introduce the latent variables 𝐖𝐖 = � 𝑤𝑤 𝑖𝑖 𝑗𝑗𝑐𝑐
�
2 𝑁𝑁 × 2 𝑁𝑁 × 𝑆𝑆 , which is a binary, 3rd-order
tensor. For example, in case of Hi-C data, we define 𝑾𝑾 as the “contact indicator tensor”, which
contains the information about which specific domain contacts belong to each of the S structures
in the model population and also which homologous chromosome copies are involved ( 𝑤𝑤 𝑖𝑖 𝑗𝑗𝑐𝑐
= 1
indicates a contact between domain spheres i and j in structure s; 𝑤𝑤 𝑖𝑖 𝑗𝑗𝑐𝑐
= 0 otherwise). W is a
detailed expansion of M into a diploid, single-structure representation of the data. The structure
population X is consistent with W. Therefore, the dependence relationship between these three
variables is given as 𝐗𝐗 → 𝐖𝐖 → 𝐌𝐌 . Similarly, the dependence relationship between X, V and U is
35
given as 𝐗𝐗 → 𝐕𝐕 → 𝑈𝑈 , because X is the structure population consistent with V and V is a detailed
expansion of U at a diploid and single-structure representation of the data. Similarly, we define a
latent variable R to define the missing information in the multivariate data type T.
In addition, we can also consider additional information such as the nuclear volume or excluded
volume constraints applicable to all domains that describe the non-overlapping volume of each
domain.
Thus, the data integration optimization problem is expressed as:
𝐗𝐗 �
= arg max
𝐗𝐗 , 𝐕𝐕 , 𝐖𝐖 , 𝐑𝐑 log 𝑃𝑃 ( 𝑈𝑈 , 𝚳𝚳 , 𝚻𝚻 , 𝐕𝐕 , 𝐖𝐖 , 𝐑𝐑 ⌈ 𝐗𝐗 )
subject to �
nuclear volume constraint
excluded volume constraint
The log likelihood can be expanded as
log 𝑃𝑃 ( 𝑈𝑈 , 𝚳𝚳 , 𝚻𝚻 , 𝐕𝐕 , 𝐖𝐖 , 𝐑𝐑 ⌈ 𝐗𝐗 ) = log 𝑃𝑃 ( 𝑈𝑈 , 𝚳𝚳 , 𝚻𝚻 , ⌈ 𝐕𝐕 , 𝐖𝐖 , 𝐑𝐑 ) 𝑃𝑃 ( 𝐕𝐕 , 𝐖𝐖 , 𝐑𝐑 ⌈ 𝐗𝐗 )
= log P( 𝐓𝐓 | 𝐑𝐑 ) 𝑃𝑃 ( 𝚳𝚳 | 𝐖𝐖 )P( 𝑈𝑈 | 𝐕𝐕 )P( 𝐕𝐕 , 𝐖𝐖 , 𝐑𝐑 ⌈ 𝐗𝐗 )
Because there is no closed form of solution to the problem, we have developed a variant of the
Expectation-maximization (EM) method to iteratively optimize this log likelihood
59,75
. Each
iteration consists of two steps:
• Assignment step (A-step): Given the current model 𝐗𝐗 ( 𝑒𝑒 )
at the iteration step t,
estimate the latent variables 𝐑𝐑 ( 𝑒𝑒 + 1)
, 𝐖𝐖 ( 𝑒𝑒 + 1)
and 𝐕𝐕 ( 𝑒𝑒 + 1)
by maximizing the log-
likelihood over all possible values of R, W, and V.
• Modeling step (M-step): Given the current estimated latent variables 𝐑𝐑 ( 𝑒𝑒 + 1)
,
𝐖𝐖 ( 𝑒𝑒 + 1)
and 𝐕𝐕 ( 𝑒𝑒 + 1)
, find the model 𝐗𝐗 ( 𝑒𝑒 + 1)
that maximizes the log-likelihood
function.
The detailed implementation of the A-step and M-step are described in detail in
59,75,118
. The step-
wise optimization strategy also includes a gradual increase of the optimization hardness by adding
36
contact constraints gradually with decreasing contact probability threshold.
3.2 Software design and implementation
In order to make the new integrated platform work we agreed on some design guidelines. The
first is the flexibility. The program is able to handle different genome types (diploid, haploid,
phased, modified), and different nucleus shapes. Also, the program is able to handle different data
sources to be called “Integrated”. The second is that the program is easy to use/deploy. One
would need to make minimum changes to support new data and different genome. One would
need to make minimum changes to deploy parallel computing on different platform. Under that
principle we design the workflow of IGM as follows: (Figure 7)
Layer 1: Spatial restraints
This layer handles all restraint that can be translated from data. The spatial restraints are
described as a set of points and scoring functions.
Layer 2: Genome model representation
This layer handles the model itself as well as the spatial restraints from layer 1. The model
represents the whole genome structure where spheres are representing genomic domains. Force
fields of excluded volume are applied.
Layer 3. Optimization Kernel
Given the representation and force fields, we apply molecular dynamics to optimize the structure.
This involves the public available MD solver like IMP and lammps. Currently we only apply
lammps solver.
Layer 4. The distributed system controller.
Considering the large amount of calculations involved in the whole workflow, we designed a map-
reduce system on the High-Performance Computing Cluster. We distribute each individual
37
optimization task and also evaluation task into a collection of computing nodes in order to utilize
the distributed computing power. Currently this system is built upon ipyparallel. We are planning
to include other controller types such as dask or the task scheduler of HPC itself.
Figure 7. Overall design of IGM.
We also designed a user interface for easy manipulation of parameters and visualization. Here
we show a screen shot of the visualization web-based interface. (Figure 8)
38
Figure 8. Web-based IGM interface and visualization.
39
Chapter 4 Studying spatial organization of whole
genomes through structure modeling
Studies of the nuclear architecture are essential for a better understanding of genome function.
Accurate partitioning of the genome into subnuclear compartments is required for gene
expression regulation, activation of developmental programs and maintaining cellular identity
during differentiation. Promoter-enhancer interactions involve spatial contacts often over
considerable sequence distances and are often enhanced by being organized within the same
topological associated domains (TADs), which favor interactions within and minimizes interactions
between chromatin domains
4,119,120
. The domain borders are often demarcated by convergent
CTCF binding sites, which leads to a domain formation through a loop extrusion process. Other
architectural proteins and mechanisms are likely to stabilize specific domain borders and some
domain formation is driven by the segregation of chromatin into cohesive functional compartments
rather than being demarcated by CTCF binding sites. T opological domains are generally dynamic
structures and the formation of contact loops at CTCF sites increases the retention time for a
domain border at the specific site, which is then more likely to be observed in the averaged signal
of ensemble Hi-C data.
Combining the new and improved methods described above, which allows us to simulate entire
diploid genome structures at TAD – level (~1Mb), and higher level (10-200kb) of resolution. By
using 3D structure models, we can gain a better understanding of the genome architecture inside
the nucleus. 3D models also facilitate cross-validation of independent data types and bridges
genomics technologies with microscopy. Structure models also offer a new direction to call TADs
from different kind of data sources. Here we present a couple of case studies using the population-
based modeling results to reveal the overall architecture of entire genomes.
40
4.1 Case study: Comparative analysis of liver and cardiac chromatin
structure
While we understand transcriptome changes across multiple organs and disease states, a major
gap in our basic understanding of organ function is how genome architecture varies between cells
and how this relates to gene expression. To address these gaps in knowledge, we collaborate
with the laboratory of Thomas Vondriska at UCLA to investigate chromatin structural differences
between heart and liver, and how they relate to tissue-specific gene expression programs. In this
case study, using 3D chromatin structure at TAD resolution, we show that more interchromosomal
interactions exist at organ-specific genes, and that about half of such interactions bridge distinct
compartments within both cardiac and liver nuclei. Intriguingly, these models reveal a discordant
overall nuclear compaction strategy between heart and liver, with the former packaging genes in
the transcriptionally active compartment A preferentially towards the center of the nucleus and the
latter exhibiting no preferential nuclear arrangement of A versus B compartments. T aken together,
our data suggest that intra- and interchromosomal chromatin architecture plays a role in
orchestrating tissue-specific gene expression in distinct organs.
4.1.1 Data sources and structure modeling
Hi-C datasets from this study were downloaded from NCBI GEO: Cardiac data corresponding to
our previous work were downloaded using accession number GSE96693 (Control_HiC)
121
. Liver
data were downloaded using accession number GSE104129 (Hi-C reps1-5)
122
. This dataset
comes from wild-type C57BL/6J mice whose hepatocyte nuclei were isolated via homogenization
and then crosslinked in 1%formaldehyde in PBS and quenched in glycine (125mM final
concentration) for Hi-C. Libraries for both heart and liver were constructed using MboI as the
restriction endonuclease.
To generate 3D models of topologically associating domains, we first ran TopDom
123
on ICE-
normalized 100 kb matrices to generate a list of TADs in cardiac and liver Hi-C data, using
41
window.size = 3 as a parameter. We then used PGS software
118
to generate 10,000 3D models
of the genome (autosomes + chrX), using default parameters, the 100 kb matrices, and TAD calls
as inputs. The contact probabilities between TADs in the resulting population of genome
structures are statistically consistent with the contact probability matrix from Hi-C experiments.
A/B compartmentalization was calculated on 5 kb resolution contact matrices. For both heart and
liver data, TADs were designated as being in compartment A or B based on the majority
compartmentalization status of the 5 kb that lie within each TAD.
4.1.2 Different compartmentalization status in Heart and Liver nucleus
3D genome structure models reveal distinct chromosomal structures within liver or cardiac
epigenomes, allowing comparison of one chromosome to another, enabling comparison of the
individual chromosomes between organs.
Figure 9. Properties of 3D models of heart and liver genomes. (A) The nucleus is
divided into 5 concentric shells, with shell 1 in the interior and shell 5 at the
periphery. In heart nuclei (left), TADs in compartment A are more likely to be found
toward the interior of the nucleus and TADs in compartment B toward the periphery.
In contrast, the liver Hi-C data (right) show that TADs in compartment A have a
higher probability of being found toward the periphery and those in compartment
B within one of the inner shells. Error bars indicate standard deviation of
observations for 10,000 structures. (B) Across distinct regions of chr11 (left) and
chr14 (right), radial positions of TADs between organs (heart in red, liver in blue)
differ. Positions of heart- (red dots) and liver-specific genes (blue dots)
42
superimposed. The y-axis shows average position (0 is the center of the nucleus,
1 indicates nuclear periphery), while the x-axis shows chromosome position in
megabases. (C) In 3D cardiac nuclear models (left), on average 57.8% of queried
A-compartment TADs form interchromosomal interactions (within 500 nm) with
regions in compartment A, while 60.7% of queried B-compartment TADs form
interchromosomal interactions with regions in compartment B. Contrastingly, liver
models show that 53.9% of queried A-compartment TADs are within 500 nm of
regions in compartment B, while 62.4% of queried B-compartment TADs interact
with the same compartment in a different chromosome. (D) Similar analysis as in
(A), but with TADs that have heart- (red bars) or liver-specific (blue bars) genes.
In cardiac nuclei (left), both heart- and liver-specific genes tend to associate with
the nuclear center, while in liver nuclei the trend is the opposite. Error bars indicate
standard deviation of observations for 10,000 structures.
To investigate the distribution of chromatin features in the nucleus, we divided the nuclear volume
into 5 concentric shells in such a way that each shell contains 20% of total number of TADs per
structure. Based on their radial positions, all TADs in each of the 10,000 genome structures are
then partitioned into the 5 shells. We then measured the probability for a TAD in a given
subcompartment (A/B) to be localized in each of the concentric shells (Figure 9A). We observe
striking differences in the internal organization of the compartments. In heart cells, chromatin in
compartment A shows the highest localization probability in the most inner shells (shell 1 in Figure
9A), and the probability gradually decreases toward the outer regions (shell 5 in Figure 9A, top
left panel). This observation is consistent with previous observations that showed highly
transcribed genes to be localized toward the interior regions of the nucleus
124
. Compartment B
shows the opposite behavior, with the highest localization probability for the outermost shell
(Figure 9A, lower left panel), consistent with the location of heterochromatin and lamina
associated domains at the nuclear envelope
6,125
. In contrast, liver cells show a different spatial
organization in the models. Compartment A is more evenly distributed with the highest localization
probability at the outermost shell, while compartment B shows a slight decrease in localization
probability toward the most outer shell.
T o gain a quantitative understanding of interchromosomal TAD-TAD colocalization, we studied the
compartment composition at the interchromosomal boundaries. At each TAD position, we
43
determined all TADs that are localized within a distance of 500 nm and are part of a different
chromosome. We then determined the percentage of A/B compartment found in this group of
inter-chromosomal TAD neighbors. The heart genome shows a high preference for TADs in the
same chromatin compartment across chromosome boundaries, indicating a high level of
compartmentalization across chromosome borders. In liver cells, we observe a different behavior.
While TADs in subcompartment B also show a high preference to be in proximity to TADs in same
state, TADs of state A do not show a preference for the same state, showcasing the different
global organization of the genome in liver nuclei.
We also calculated the average radial position of each TAD with respect to the nuclear center
(Figure 9B). When plotting the average radial positions for each TAD across a chromosome we
observe distinct regional differences with well-defined local minima and maxima (Figure 9B).
TADs corresponding to minima are on average more interior located than directly neighboring
regions in the same chromosome. These radial position profiles are markedly different for the
same chromosomes in the two tissues. The correlation between the radial position profiles is very
low, and in some regions even anti-correlated (Figure 9B). These distinctions are further
illustrated when examining the likelihood of regions from the same compartment to interact with
each other (Figure 9C).
Finally, we examined the localization of chromatin from a gene centric view, determining the
relative positioning of heart and liver specific genes in the different nuclei (Figure 9D). In
agreement with the observations from Figure 9A, this gene centric analysis revealed a preference
of interior localization of genes in cardiac nuclei and the antithetical behavior in liver nuclei. In
summary, our structure-based calculations support the notion that, on a TAD scale (hundreds of
kilobases), there are major structural differences in the global 3D structural organization of liver
and heart genomes.
The majority of chromatin conformation capture studies that have emerged the past few years
have focused exclusively on intrachromosomal interactions. The adult cardiac myocyte, which
44
does not divide, is an interesting test case to explore the role of interchromosomal contact
surfaces in genome function—principally, although not exclusively, via gene regulation. A liver Hi-
C dataset of comparable sequencing depth afforded the opportunity to explore contrasting
features of such interactions, should they exist, within the same genome housed in separate cells’
nuclei. Both epigenomes exhibited similar levels of interchromosomal interactions and in both
cases, they were enriched in genes associated with the function of that cell type. Combining these
interactions with 3D renderings of genomes in heart and liver provided a unique opportunity to
investigate differences in chromosome folding and nuclear organization. Several observations
emerged: liver and heart cells not only package their genomes differently, but they appear to obey
distinct general principles of organization, wherein heart genomes preferentially localize
compartment A regions toward the center and compartment B regions toward the periphery,
whereas liver cells do not exhibit this behavior. Future studies will investigate whether
interchromosomal interaction surfaces participate in such behaviors as cell proliferation, whether
they change with age or are dependent on developmental state, and what non-DNA molecules
inhabit the surfaces of interchromosomal apposition, presumably orchestrating the reproducible
formation of these structures.
4.2 Case study: Structural organization of subnuclear chromatin
compartments in human cells
Accurate partitioning of the genome into subnuclear compartments is required for gene
expression regulation, activation of developmental programs and maintaining cellular identity
during differentiation. Although spatial partitioning of the genome is a key requirement for correct
genome functions, the specification of functional chromatin classes is not clearly defined and
subject to interpretation of the data and arbitrary choices of parameters. Genomics experiments
define active (A) and inactive (B) functional subcompartments from similarities in genome-wide
chromatin interactions, while microscopy shows the spatial segregation of chromatin in relation to
their nuclear positioning and the formation of nuclear bodies such as nuclear speckles, nucleoli,
45
and lamina associated sequestered chromatin. However, microscopy cannot map the chromatin
identity on a genome-wide scale. No method exists to date that can easily bridge these views and
provide a comprehensive picture to allow a structure-based analysis of the functional partitioning
of chromatin into spatially distinct subcompartments. However, microscopy cannot reveal the
identity of chromatin at nuclear bodies on a genome-wide scale. Here, we aim to unite genomics
with microscopic data to acquire a more thorough insight about the basic principles governing the
partition and sequestering of genomic regions into nuclear bodies, which drive the spatial
organization of genomes and directly impact chromatin function. For instance, chromatin located
at the nuclear lamina is more likely to be silenced, while genes close to nuclear speckles tend to
be highly expressed
6
. Our population-based genome structure modeling approach can be a
powerful tool to derive new insights into the mechanisms of sequestering genomic regions into
functional cohesive units and therefore provide details for deciphering the basic principles of
functional genome organization. Our method links the subcompartment description from
ensemble genomics data and the 3D spatial interpretation of subcompartments in single cells
from microcopy.
We generated genome structure populations for human lymphoblastoid cell line GM12878. Our
3D structures represent the genome at TAD and 200kb resolution and explain structural details
over a range of scales from the partitioning of chromosome territories and segregation of
functional subcompartments down to the structural folding and partition of topological domains.
The generated structures agree remarkably well with independent experimental data about
structural and epigenetic features of the genome. Moreover, several features have been revealed
for the first time through a structure population at TAD and 200kb level. We analyzed in detail the
spatial organization of subnuclear chromatin compartments. Depending on the functional states,
these chromatin subcompartments show a distinct difference in their 3D organization, segregate
into distinct chromatin bodies, which differ between subcompartments in size, connectivity and
nuclear locations. We also identified chromatin bodies associated to nucleoli and nuclear speckles,
which allowed an in-depth analysis of chromatin composition, nuclear locations and spatial
46
associations with other subcompartments.
4.2.1 Pre-processing of in situ Hi-C data
We used in situ Hi-C datasets for the human lymphoblastoid cell line GM12878. The raw contact
map is downloaded from Gene Expression Omnibus (GEO) under the accession number
GSE63525. Following the protocol by Imakaev et al.
57
bins with the lowest 2% sequence coverage
and were discarded during the normalization process. For data normalization, we adopted the
same KR normalization method used in
10
, leading to a normalized contact frequency matrix 𝑭𝑭 =
( 𝒇𝒇 𝒊𝒊𝒊𝒊
)
𝑲𝑲 × 𝑲𝑲 at 100 kb resolution with K = 30376 bins.
TAD resolution model: We then applied the topological associated domain (TADs) calling method
to detect the size and location of TADs and non-TAD regions, such as topological boundaries and
unorganized chromatin regions. In total 2320 TADs and non-TAD regions have been identified for
GM12878 cell, with an average size of ~1.2Mb. These TADs and non-TAD regions will be further
represented as spheres in our structure modeling. We denote N=2320 as the number of
topological associated domains. We then generated probability matrix at bin and TAD level as our
further input of our algorithm.
200kb resolution model: In addition, we used the same in situ Hi-C datasets at 20kb resolution
following the same preprocessing pipeline described above to generate a normalized contact
frequency matrix. Then generate the probability matrix at 200kb level with size of N = 15166 bins
as our modeling input.
4.2.2 Modeling representation and parameters
4.2.2.1 Genome Representation
The nuclear architecture of the GM12878 cells consists of the nuclear envelope (NE) represented
as a sphere with radius R= 5 micron, and two copies of 23 individual chromosomes.
At TAD level, the genomic sequence is partitioned into a total of 2320 topological domains (TADs),
for a total of 4640 domains for the diploid genome. Each TAD domain 𝑖𝑖 is represented by a
47
sphere particle. A domain sphere is characterized by two radii: (1) its hard (excluded volume)
radius 𝑟𝑟 𝑖𝑖 , which is estimated from the DNA sequence length and the nuclear occupancy of the
genome; and (2) its soft (contact) radius which is twice the hard radius 𝑟𝑟 𝑖𝑖 𝑆𝑆 = 2 × 𝑟𝑟 𝑖𝑖 . A contact
between two spheres is defined as an overlap between the spheres’ soft radii. Therefore, a contact
between two domains is defined when the center-to-center distance between the domains is
smaller than the sum of 2 times of the radius 𝑑𝑑 𝑖𝑖 𝑗𝑗 ≤ 2( 𝑟𝑟 𝑖𝑖 + 𝑟𝑟 𝑗𝑗 ). This two-radius model allows for
the possibility that chromatin can partially loop out of its bulk domain region to form contacts, while
establishing a minimum genome occupancy in the nucleus. According to experimental data, the
combined hard-core spheres of all domains occupy around 20% of the nuclear volume in
agreement with previous estimates
59,118
. The diploid genome comprises a total of 4640 domains.
At 200kb level, the genomic sequence is segmented into a total of 15166 bins with each bin
spanning a 200kb genomic region. Similarly, each region is represented by a sphere particle with
same radius 𝑟𝑟 𝑖𝑖 . Contact is also defined as the same as described above. The total diploid genome
comprises 30332 chromatin spheres.
4.2.2.2 Domain-level probability matrix and outlier removal
The TAD domain and 200kb contact probability matrix 𝐴𝐴 = ( 𝑎𝑎 𝑖𝑖 𝑗𝑗 )
𝑁𝑁 × 𝑁𝑁 is calculated from the 100 kb
and 20kb bin level Hi-C contact probabilities respectively, following a procedure previously
described
118
:
𝑎𝑎 𝑖𝑖 𝑗𝑗 = 𝑚𝑚 𝐺𝐺𝑎𝑎𝐺𝐺 ( 𝑡𝑡 𝑡𝑡 𝑝𝑝 10% < { 𝑝𝑝 𝛼𝛼𝛼𝛼
: 𝛼𝛼 ∈ 𝑏𝑏 ( 𝑖𝑖 ), 𝛽𝛽 ∈ 𝑏𝑏 ( 𝑗𝑗 )} > )
where 𝑝𝑝 𝛼𝛼𝛼𝛼
is the 100kb (for TAD model) or 20kb (for 200kb model) bin level contact probability
between bin 𝛼𝛼 and 𝛽𝛽 with 𝑏𝑏 ( 𝑖𝑖 ) as the set of all bins in matrix 𝑷𝑷 that belong to segment 𝑖𝑖 . The
bin-level contact probability matrix 𝑷𝑷 = ( 𝑝𝑝 𝑖𝑖 𝑗𝑗 )
𝐼𝐼 × 𝐼𝐼 is calculated as 𝑝𝑝 𝑖𝑖 𝑗𝑗 = min (
𝑓𝑓 𝑖𝑖𝑖𝑖
𝑓𝑓 𝑚𝑚 𝑚𝑚 𝑚𝑚 , 1), where 𝑭𝑭 =
( 𝑓𝑓 𝑖𝑖 𝑗𝑗 )
𝐼𝐼 × 𝐼𝐼 is the normalized Hi-C contact frequency matrix. Following our established method,
Fmax is defined by the total average contact frequencies within a genomic segment, and also
considerations about chromosome territory occupancy from polymer modeling, namely an
48
expectation of an average minimum number of nearest genomic sphere neighbors in the segment
bead proximity (for example, random packing of domains in the nucleus of 10 micron diameter for
random polymer model will lead to an average number of 24 nearest bead neighbors within a
volume that encompasses twice the sum of bead radii). We also tested varying fmax across a
wider range, which will not affect the outcome of our results.
Outlier contacts with extremely higher values than all surrounding contacts are identified by
{ 𝑝𝑝 : 𝑝𝑝 > 𝜇𝜇 + 2 𝐼𝐼 𝑄𝑄 𝐼𝐼 𝑡𝑡 𝑟𝑟 𝑝𝑝 < 𝜇𝜇 − 2 𝐼𝐼 𝑄𝑄 𝐼𝐼 } , where 𝑝𝑝 ∈ { 𝑝𝑝 𝛼𝛼𝛼𝛼
: 𝛼𝛼 ∈ 𝑏𝑏 ( 𝑖𝑖 ), 𝛽𝛽 ∈ 𝑏𝑏 ( 𝑗𝑗 )} and 𝜇𝜇 = 𝑚𝑚 𝐺𝐺𝑎𝑎 𝐺𝐺 { 𝑝𝑝 𝛼𝛼𝛼𝛼
: 𝛼𝛼 ∈
𝑏𝑏 ( 𝑖𝑖 ), 𝛽𝛽 ∈ 𝑏𝑏 ( 𝑗𝑗 )}, with IQR as the interquartile range of { 𝑝𝑝 𝛼𝛼𝛼𝛼
}. These outliers are excluded from
the calculation.
4.2.3 Model structures reach a remarkable agreement with input matrix
At TAD resolution, the genome structure population converged at θ=0.7% after 11 A/M iterations.
The genome-wide Hi-C contact profile generated from the final structure population agrees
remarkably well with the one from the Hi-C experiment, with a Pearson’s correlation of 0.98. Also,
the intra-chromosomal and inter-chromosomal heat maps show excellent agreement with a
Pearson’s correlation 0.97, and 0.92. Visually, the interaction features in the experiment are nicely
reproduced in the models (Figure 10A&B). High Pearson’s correlation values between Hi-C heat
maps can often be dominated by contributions of the diagonal. To better assess the agreement,
we also plot the scatter plot distribution, which also shows a strong matching between
experiments and our models (Figure 10C).
49
Figure 10. Comparison of input matrix and output models.
(A)The Hi-C contact probability matrix of chromosome 2 (left panel) and genome-
wide (right panel). The zoomed in heatmap shows the matrix of region 24Mb-
61.3Mb. (B) The contact probability matrix of chromosome 2 (left panel) and
genome-wide (right panel) calculated from our structure population. (C) Density
plot comparing the contact probabilities generated from Hi-C data and the structure
population (Pearson’s r = 0.98). (D) Histograms of restraint-violation ratio from the
structure population. For a pair of constrained domains, violation ratio is defined
as the ratio of the real distance over the expected distance. Violation ratio less
than 1.05 is considered as satisfied and is not displayed in the histograms (almost
all of restraints fall in this category).
50
Constraints satisfaction. A quantitative measure of agreement between experiment and models
are the number of violated restraints that is the number and violation scores of contacts that
cannot be enforced. With increasing number of iterative refinement, A/M steps, the total violation
scores and number of violated restraints decreases dramatically. The optimization converges
when more than 99.995% of all input restraints are satisfied and all corresponding contacts are
physically present in the models. On average about 23,300 restraints are imposed in each of
10,000 structures. We next investigate the deviation of contact distances for the 0.005% of
restraints that are violated in the models. To quantify the level of violation we calculate the relative
distance ratio, defined as the ratio between the distance between the two constrained domains in
the model and the expected target distance that is needed to satisfy the contact restraint. The
histogram of distance ratios shows that the distances of the violated restraints are very close to
the target distances. 99% of the violated constraints show distance ratios ranging from 1.05 to
1.25 (Figure 10D).
For models at 200kb resolution, the population of 10,000 genome structures converged at θ=0.8%.
The genome-wide Hi-C contact profile generated from the structure population agrees very well
with the one from the Hi-C experiment, with a Pearson’s correlation of ~ 0.99. On average each
structure is assigned with 236,000 restraints and 99.94% of input all restraints are entirely
satisfied.
4.2.4 Assessment of models and cross-validation of experimental data
4.2.4.1 Prediction of Hi-C contact frequencies from models generated by sparse data set
We validate the accuracy of genome structures using a re-sampling procedure by evaluating a
model generated with a sparse data set and assessed how well the resulting genome structures
can predict the missing data. To test the robustness of our modeling, we performed a cross-
validation using as input information a data set containing only 50% of randomly chosen Hi-C data
points from the H1-ESC 200kb probability matrix. We predicted the missing Hi-C contact
frequencies from models generated with the sparse Hi-C map with high agreement to the
51
experimental values (Pearson correlation, r=0.93) (Figure 11 upper panel).
To validate the accuracy of the genome models further, it is necessary to assess structural
features of the genome with other independent experimental data not used when generating the
structures.
Figure 11. Cross-validation test using 50% in situ Hi-C data
(A) Matrix of chromosome 2 showing the 50% dataset vs. the matrix generated
from 10,000 models. The upper triangle of the matrix shows the generated contact
probability matrix from models and the lower triangle of matrix shows the 50%
randomly chosen matrix. (B) The comparison of original dataset with the predicted
contact probability matrix from models. (C) The scatter plot showing the correlation
of the removed 50% part of the matrix and its corresponding predicted values from
models. The Pearson correlation is 0.93 for the values from experiment and the
predicted. Linear regression R is 0.87 with P-value < 2.2e-16.
4.2.4.2 Contact probability of chromatin with the nuclear envelope
Different chromatin regions show a preference to be located at specific locations in the nucleus,
as shown by their average radial positions. Even within the same chromosome some regions
show an increased probability to localize towards the nuclear periphery, while others localize more
often towards the interior of the nucleus. To assess the predictive power of our structures, we
compared these findings with independent data from lamina-DamID experiments, which map the
probability for chromosomal regions to be in proximity to lamina at the nuclear envelope
6
. For
52
comparison we used data from 118 single-cell lamina-DamID experiments for the cell line KBM7
73
.
In the TAD resolution model the predicted radial positions of domains agree very well with the
contact frequencies from DamID experiments (Figure 12A, B&C). TADs predicted with relatively
small average radial positions (<0.6) and therefore localized more often towards the nuclear
interior, show relatively low contact frequencies with the nuclear lamina in experiment. 80.5% of
all TADs with average radial positions smaller than 0.6, have lamina contact frequencies below
20% (Figure 12A). In contrast, TADs with average radial positions of 0.6 and higher, show a
sharp increase in the experimentally observed Lamina-DamID contact frequencies. 55% of the
TADs with radial positions larger than 0.6 show lamina contact frequencies above 20% (Figure
12A). To assess further the preferred positions of TADs in the nucleus, we divided the nuclear
volume into 5 concentric shells S={S 1,S2,S3,S4,S5}. Each shell Sn is defined by a set of lower and
upper radial bounds. In each genome structure, boundaries are set so that each shell contains
20% of the total number of TADs. Based on their radial positions, all TADs in each genome
structure are then partitioned into the 5 shells. We then measure the average lamina contact
frequency for domains found in each shell. Domains associated with the most outer shell clearly
show substantially higher lamina contact frequencies than domains of the inner shells (Figure
12B). Indeed, the lamina contact frequency increases gradually from the inner shell to the outer
shells (Figure 12B).
53
Figure 12. Experimental validation of structure population using Lamina-DamID
data.
(A) Density scatter plot for TAD radial position and lamina contact frequency (CF;
DamID data from Kinds et al. 2015). TADs with relatively small average radial
positions (<0.6), show relatively low contact frequencies with the nuclear lamina in
experiment in the experiments. Around 80.5% of the TADs have lamina CF < 20%.
In contrast, for TADs that are more often located at the outer nuclear regions
(average radial positions of 0.6 and or higher) show a sharp increase in the
experimentally observed Lamina-DamID contact frequencies, where 55% of the
TADs have lamina CF > 20%. (B) We divide the whole nuclear volume into 5
concentric shells with equal number of TADs based on their radial positions. Then
we calculate the average lamina CF for TADs in different shells in each individual
structure. Error bars represent the standard deviation across 10,000 structures. A
clear trend can be observed that the inner the shell is, the lower the lamina CF will
be. (C) Average radial position of TADs in chromosome 3 are plotted with lamina
CF. Valleys in the radial position plots match remarkably with low lamina contact
frequency (CF) regions (red dash lines). (D) The Lamina-DamID signal for
54
chromosome 1 in H1-ESC cells from Bas van Steensel lab, and the predicted
contact probabilities with nuclear envelop from 200kb resolution models. High
contact probability peaks are reproduced nicely. (E) The plot for Pearson
correlation between experimental Lamina-DamID signal and probabilities from
200kb structure models. Centromeres are excluded from comparison.
Next, we analyzed the variability of radial positions for TADs within a chromosome. When plotting
the average radial positions for each TAD across a chromosome we observe distinct regional
differences with well-defined local minima and maxima. TADs corresponding to minima are on
average more interior located than directly neighboring regions in the same chromosome.
Interestingly these minima correlate well with positions of large gaps in lamina contact frequencies
from experiment (Figure 12C), confirming our predictions.
The 200kb models also show excellent agreement between predicted and experimental lamina
DamID signal. Peaks of high contact probability with the NE coincide with peaks in the
experimental lamina DamID signal, while depleted regions in the experimental Lamina-DamID
signal are nicely reproduced also in the models (Figure 12D). The Pearson correlation between
the genome-wide experimental DamID signals and predicted values from the 10,000 models is
0.6. Next, we assess our models with imaging data from 3D Fluorescent in situ Hybridization
(FISH) experiments.
4.2.4.3 Inter-chromosomal co-location frequencies.
We now assess structural features observed in our TAD structure population with 3D FISH
experiments. We first analyze inter-chromosomal co-location patterns that we measured for 6
chromatin loci
27,59
. The co-location frequency is defined as the fraction of cells that show two
specific loci at a measured distance smaller than 1 μm (Figure 13A). We compare four inter-
chromosomal pairs of loci, namely between a locus on chromosome 19 and four other gene loci
on chromosome 11, which have no known functional connection
27
. In our models, two loci pairs
(H0-H1, H0-H2 in Figure 13A) were in spatial proximity substantially more frequently than the
other two pairs (H0-L1, H0-L2 in Figure 13A), which is in very good agreement with the FISH
55
experiments. The linear regression model of the co-location frequencies shows an excellent
agreement with a regression coefficient R = 0.9987 with F-statistic p-value = 6.7*10^-4 (Figure
13A), which is a substantially better agreement than our previous low-resolution model at lower
~4 Mb resolution.
Figure 13. Experimental validation of TAD structure population using FISH data.
(A) Comparison of frequency of the interchromosomal loci co-localization between
observed occurrence in FISH experiments and in the structure population. The co-
localization events between a locus on chromosome 19 (labeled H0; an active
56
domain) and any of the 4 loci on chromosome 11 (two inactive domains L1 & L2,
and two active ones H1 & H2) were counted. The structure population proposed in
this paper reproduced the frequencies very well. (B) Cumulative probability of
minimum distances between three interchromosomal loci in nuclei observed by
FISH experiments (left panel) and structure population (middle panel). The label
indicates the chromosome. The inset in the middle panel illustrates the distance
measured between 3 loci in a diploid nucleus (colored accordingly): the minimum
triplet distance t = (d1+d2+d3)/3 is defined as the smallest of all 8 possible loci
combinations from the two homologue chromosomes for each of the three loci. t is
then used to plot the cumulative probabilities. The label indicates the chromosome;
“cen 1-9-21” indicates peri-centromeres of chromosomes 1, 9, and 21; “far 1-9-21”
indicates 3 loci far from centromeres, i.e. chromosome 1: 179,836,951-
179,993,708 and chromosome 9: 110,383,457-110,589,787 and chromosome 21:
31,137,227-31,348,263. Linear regression (right panel) compares the
Wasserstein distance (WD) for every pair of cumulative frequencies from
experiment and model. (C) Comparison of 3D location of three distant genomic
loci on chromosome 6. A snapshot of nucleus with three different probes at far-
separating loci on chromosome 6 (left panel). A comparison of relative radial
position of three loci (middle panel). And a comparison of pairwise distance of three
loci. Similar finds are reproduced very well in our structure models.
4.2.4.4 Distributions of multivariate distances between chromatin regions
Next, we analyze higher order spatial relationships between sets of three loci on three different
chromosomes and determine their multivariate distance distributions. Specifically, we define the
minimum triplet distance t=(d1+d2+d3)/3 between the three loci and determine a cumulative
distribution of the triplet distances t from all structures in the population (Figure 13B). A cumulative
distribution of triplet distances describes the fraction of cells showing a minimal triplet distance t
that is smaller than a given value. In total, we analyzed 4 different triplet combinations on 9
different chromosomes. We have previously shown that chromosome specific higher order
centromere clusters play an important role in the global genome organization of lymphoblastoid
cells
59
. Therefore, we have selected loci in pericentromeric regions to probe the formation of
higher order centromere clusters and loci far from pericentromeric regions as control. Based on
our models we have selected three peri-centromeric chromatin clusters observed at different
57
probabilities in the population.
We discovered a cluster between pericentromeric regions of chromosomes 1, 9 and 21 (i.e., cen
1-9-21) at relatively high probability. For example, for the selected loci in the pericentromeric
regions of chromosomes 1, 9 and 21 (i.e., cen 1-9-21) 9.81% of structures have a triplet distance
t smaller than 1.5 micron. In contrast, the pericentromeric cluster of chromosomes 7, 10, and 12
(cen 7-10-12) is observed at a medium range probability. Only 4.14% of structures have a triplet
distance t within 1.5 micron. The pericentromeric cluster cen 2-3-6 is found with a relatively low
probability and only 0.28% of structures have a triplet distance t below 1.5 micron.
The cumulative probabilities of triplet distances were measured by FISH experiments for all four
inter-chromosomal triplet loci (cen 1-9-21, cen 2-3-6, cen 1-9-21 and far-1-9-21) (Figure 13B left
panel) and compared with the results predicted by our structure population (Figure 13B middle
panel). The model and experiment agree very well. Cen-1-9-21 shows consistently the highest
cumulative probabilities in the structure population in comparison to all other measured loci
combinations. In agreement with our predictions, triplet cen-2-3-6 shows the lowest cumulative
probabilities of triplet distances.
Next, we quantitatively compared the differences in triplet distance distributions between models
and experiment. We perform a pairwise comparison between cumulative triplet distance
distributions using the Wasserstein distance (WD), which measures the difference of area under
the curve between two the probability distributions. We calculated the Wasserstein distance for
each of the six possible pairs of cumulative triplet distance distributions. The right panel in Figure
3E plots the corresponding WD for each pair of triplet distributions in the models versus those in
the experimental results. Each point in the plot are colored by 2 corresponding triplet combinations.
For instance, the upper right point is representing the WD between triplet cen 1-9-21 and cen 2-
3-6. In the scatter plot, all the WD from experiment and model lie on a straight line, with a
regression coefficient R = 0.9382 with F-statistic p-value = 1.4*10^-3 (Figure 13B right panel).
The high correlation between the WDs shows a very good agreement between model and
58
experiment. The linear regression model shows also substantially better agreement than our
previous lower resolution model. Our new structure population out-performs compared to our old
method, giving a more accurate prediction of centromere clustering frequencies (Figure 13B).
4.2.4.5 Distribution of intra-chromosomal pairwise distances between chromatin regions.
We now analyze the conformation of a chromosome by measuring pairwise distance distributions
and distributions of the radial positions of loci within the same chromosome. We found that for
chromosome 6 some genomic loci that are separated at far genomic distances are closer in 3D
spatial distance than those separated at shorter genomic distance. This behavior is consistent
with a back-looping phenomenon. We choose 3 representative loci on chromosome 6 as FISH
probes to validate the trend of this back looping (Figure 3F). Locus P1 (chr6 306,712 – 532,406)
is located on the p-arm of the chromosome, while both locus P2 (chr6 130,076,074 – 130,287,007),
and locus O (chr6 87,193,201 – 87,351,481) are located at the q-arm of chromosome 6 (Figure
13C left panel).
We also measured the distance for each pair of loci in FISH data and from models. Even though
the genomic distance between loci P1 and P2 is substantially larger that the sequence distance
between loci O and P1, the 3D distance between loci P1-P2 is significantly smaller than the
distance between loci P1-O in our models (Figure 13C right panel). The closest distance among
all pairs of loci is P2-O (p-values < 5.0e-5, one-sided Wilcoxon tests). This is consistent with a
backfolding event that brings loci P1 and P2 closer together than P1 –O.
These findings are well reproduced in experiment. Indeed, distances between O-P1 measured in
3D FISH are significantly shorter than distances between P1-P2, while distances between loci O-
P2 are closest among all three pairs of loci (Figure 13C right panel) (p-values < 2.2e-16, one-
sided Wilcoxon test). Also, individual FISH images often show a triangular pattern for the locations
of the 3 loci (Figure 13C left).
59
4.2.4.6 Distribution of radial positions of chromatin regions
We extracted 3D locations for the same loci from the models and found their spatial relationships
matched very well with the experiment data.
Next, we analyzed the radial positions of each locus in FISH experiments. Interestingly locus O
tends to be located further outside towards the nuclear envelop in comparison to P1 and P2,
which can partially explain the relatively smaller distance between P1-P2 despite their large
sequence distance. Among the three loci P1 tends to be the most inner one. These observations
are nicely captured in the model population (Figure 13C middle).
4.2.5 Results: The spatial organization of the human genome and its structure-
function correlations
4.2.5.1 Relationship between genome structure and DNA replication timing
The predicted average radial positions of chromatin are highly reproducible in replicate
calculations and differ substantially between individual domains with a range from 0.3 to 0.9.
These differences in radial positions are likely a result of functional segregation between active
and inactive chromatin and known to be relevant to their function. For instance, early replication
timing is linked to chromatin preferentially located in the interior regions of the nucleus. Topological
associated domains act as stable units of replication timing regulation
2
and therefore we used
Repli-seq data to classify each chromatin domain into one of six DNA synthesis phases of cell
division (i.e., G1b, S1, S2, S3, S4, G2, see methods). We then calculated the nuclear localization
probabilities for chromatin domains in each replication class from our models (Figure 14A).
Chromatin domains with the earliest replication timing in G1b phase show significantly smaller
radial positions in our structures than domains in any other phases (p-value < 2.2e-16) (Figure
14A). Interestingly, we observe a gradual increase in the average radial positions with progressing
replication timing phases, confirming experimental observations that replication tends to start in
the more central regions of the nucleus and gradually expands towards the periphery. This result
is also confirmed when plotting the replication-timing wave signal against the average radial
60
positions
2
. Domains with the smaller radial positions (average radial position range from 0.4 to
0.5) show the highest replication-time wave signal, while those with the largest radial positions
show the lowest wave signal (Figure 14B).
Figure 14. Locational preferences for different chromatin regions.
61
(A) Average radial position of TADs in different replication phase. (B) Correlation
of replication timing signal and TAD spatial radial position. Replication starts from
regions in the inner shell and spread towards both the center and periphery. (C)
Boxplot of average radial position of TADs in different subcompartment states. (D)
We divide the whole nuclear volume into 5 concentric shells with equal number of
TADs based on their radial positions. We compare the conditional state probability
𝑃𝑃 ( 𝑆𝑆 ℎ 𝐺𝐺 𝑒𝑒𝑒𝑒 | 𝑆𝑆 𝑡𝑡 𝑎𝑎 𝑡𝑡 𝐺𝐺 ) of all 5 subcompartment state TADs. (E) The distribution of state-
specific TADs within in a random 2D slice of the nucleus. The distributions are
normalized against the background TAD distribution. (F) Example 3D distributions
of different states of chromatin from 200kb resolution model shown in contour plot.
4.2.5.2 Structural organization and partitioning of chromatin subcompartments
We now analyze the formation of functional subcompartments. In a study by Rao et al., chromatin
in lymphoblastoid cells were classified into 6 functional subcompartments, two in a
transcriptionally active state (A1, A2) and three in a transcriptionally inactive state (B1, B2, B3).
An additional subcompartment (B4) is present only in chromosome 19. Chromatin in each
subcompartment shows distinct enrichment for different epigenetic factors: Chromatin in
subcompartment A1 has substantially higher enrichment at promoter and enhancer regions for
active epigenetic markers like H3K27ac, H3K4me3, H3K9ac, H3K79me2 in comparison to A2
chromatin
10
. A1 shows relatively high GC content and contains highly expressed genes, while A2
chromatin is enriched in longer genes. B1 is enriched in H3K27me3 and associated with
Polycomb-related facultative heterochromatin, while B2 is enriched in pericentromeric
heterochromatin and Nucleoli Associated Domains (NADs). B3 chromatin is depleted in NADs
and enriched in lamina associated domains (LADs)
10
. In our model at TAD resolution, a chromatin
domain is assigned to one of the six subcompartments if at least 50% of its constituent chromatin
(Table 3) were assigned to a single subcompartment class (See Additional Methods), while
remaining domains were not assigned to any class (i.e, class N/A). 98% of all TADs (2190 TADs)
were assigned to a single subcompartment, whereas only 39 TADs remained unassigned. At
200kb resolution 89% of autosome domains were assigned to a single subcompartment while
only 10% of 200kb chromatin regions were unassigned.
62
We observe distinct differences in the nuclear positions of subcompartments. The A1
subcompartment shows the lowest average radial position, while the B3 subcompartment is most
peripheral (Figure 14CD). We then partitioned the nucleus into 5 radial shells, such that the
number of domains is equally divided between the shells. Based on 10,000 structures, we then
calculate the probability for a chromatin region of a given subcompartment to be located in any of
the five shells (Figure 12B See methods).
State A1 A2 B1 B2 B3 B4 NA ChrX
Number of TADs 457
(19.7%)
568
(24.5%)
297
(12.8%)
306
(13.2%)
553
(23.8%)
9
(0.4%)
39
(1.7%)
91
(3.9%)
Number of dip-
TADs
45
(39.1%)
34
(29.6%)
11
(9.6%)
11
(9.6%)
3
(2.6%)
0
(0%)
5
(4.3%)
6
(5.2%)
Table 3. Number of TADs for different states.
Although subcompartments A1 and A2 are both in the active state, they show substantial
differences in their shell probability profiles. The probability of A1 chromatin is highest in the most
inner shell (Figure 14D) and is gradually decreasing towards the outermost shell. In contrast, A2
is distributed more equally throughout the nucleus, with a sizeable probability for regions close to
the nuclear envelope (NE) (Figure 14D), The largest probabilities are observed for intermediate
shells 3 and 4. These differences are further illustrated when plotting 2D histograms of chromatin
distributions through the central plane of the nucleus (Additional Methods): Whereas A1 shows a
single maximum towards the nuclear center, A2 chromatin shows ring shaped maxima at an
intermediate range and slightly declining values towards both the central and outer nuclear
regions. The increased probability towards the interior regions agrees with the higher
63
transcriptional activity of A1 chromatin in comparison to A2
126
. We also observe striking
differences for subcompartments in the inactive states. B3 chromatin shows the largest average
radial position of all subcompartments (Figure 14DEF). The highest probability is at the outermost
shell and gradually decreases towards the nuclear interior (Figure 14D). The 2D distribution
shows ring shaped maxima at the nuclear periphery. Indeed, based on lamina DamID
experiments 79.9% of B3 chromatin domains consist of lamina-associated domains (LADs), which
localize at the nuclear periphery. In our models, a TAD domain in the B3 subcompartment has a
2.6-fold higher probability to be located in the outer most shell than a TAD in the B1
subcompartment. In our 200kb resolution model we see the same trend and the probability of
finding B3 chromatin in the outer most shell is 7.2-fold higher than chromatin of same genomic
length in B1 subcompartment state. Moreover, almost 56% of all the B3 TAD domains and 60%
of B3 regions in 200kb resolution in an individual structure are occupying the two outer most shells
in the nucleus at any given time. Interestingly, chromatin in the B1 subcompartment shows an
entirely opposing localization behavior. B1 shows the highest shell probability at the innermost
shell, with a gradual decrease towards the outer shells - a localization profile reminiscent to that
of the A1 subcompartment. On average about 50% of B1 chromatin per structure are located in
the two innermost shells and the 2D localization distribution show a maximum towards the inner
regions of the nucleus. The B2 subcompartment is more evenly distributed across the entire
nuclear space, but with an enrichment and slight increase in localization probability in the central
regions of the nucleus. This increase may be a result of sub-centromere and nucleolus related
clusters (discussed in later sections).
64
Figure 15. Structure-based chromatin interaction network analysis of
subcompartment states.
65
We convert the TADs from 3D space (A) into undirected graph which we call as
TAD interaction network (TIN)(B). We color the TADs as the neighborhood
connectivity which is the average contacts of neighbor TADs. We then apply
Markov Clustering Algorithm (MCL) to find highly connected clusters in single
structures(C), which we call them as “Interaction subgraphs”. TADs in Interaction
subgraphs are colored based on their chromosome id. (D) Examples of the 3D
contour plot showing the spatial location and size of top 50 largest subgraphs. We
refer these subgraphs as “chromatin bodies” as a result of spatial sub-
compartmentalization. (E) Examples of 3D contour plot of chromatin bodies in
different states. A1 and B1 bodies are often close to each other in space, while A2
and B3 are often attached. The right 2 figures show the spatial arrangement of
different subcompartment chromatin bodies. Panel (ABC) are drawn from the
same one structure example from 10,000 TAD resolution model population. 3D
contour plots in (DE) are drawn from same structure from 10,000 200kb resolution
model population.
To study the level of fragmentation of subcompartments into individual spatial units we represent
each genome structure as a chromatin interaction network (TIN), where each domain is a node
and an edge between two nodes is drawn if the corresponding spheres are in direct physical
contact in the 3D structure (See additional methods). A TIN is constructed separately for each
subcompartment, and for each of the 10,000 structures in the population (Figure 15AB). The
single cell network organization distinguishes the different subcompartments. The A1 networks
show significantly higher node connectivity/average degree in comparison to all other chromatin
networks (Figure 16A) (Wilcoxon test p-value < 2.2e-16, both TAD and 200kb resolution). A1 also
shows the highest number of maximal cliques (Figure 16B) Wilcoxon test p-value < 2.2e-16, both
TAD and 200kb resolution), which are defined as fully connected subnetworks. The number of
maximal cliques in a subcompartment network describes its tendency to form highly connected
chromatin clusters in single cells. These findings point to a substantial spatial segregation of A1
chromatin into relatively large cohesive groups of nuclear bodies containing chromatin
interactions preferentially formed to other A1 regions and with relatively little mixing with other
subcompartments. We further probe the spatial fragmentation of the A1 subcompartment by
applying the Markov Clustering Method (MCL) to localize the number, size and spatial positions
66
of highly connected subgraphs in each single cell TIN network
127
(Figure 15C). On average each
structure contains about 253 connected subgraphs in TAD TINs and 763 subgraphs in 200kb TINs.
Given the fact that Markov clustering methods would give small clusters if some nodes have
relatively low connectivity, we are conducting all analysis based on top 75% largest subgraphs.
Under that criteria any TAD subgraph smaller than 3 nodes or any 200kb subgraphs smaller than
7 nodes (1.4 Mb) are filtered. Figures 13E shows examples of the size and nuclear positions of
chromatin in connected subgraphs, which are visualized by a contour plot of their density maps.
A1 networks show the largest subgraphs among all subcompartments (Figure 16D). The average
size for the 75% largest subgraphs is 30Mb (~25 TAD domains) in TAD resolution, in comparison
to only 20Mb (~17 nodes) for the A2 subcompartment. The chance of a TAD in the A1
subcompartment to be part of a highly connected subgraph with more than 50 nodes is about five
times (4.6) larger than a TAD in the A2 subcompartment (Figure 16C right) and almost 23 times
larger than a TAD in the B1 subcompartment. Interestingly, A1 chromatin that is often part of larger
subgraphs shows higher transcriptional activity than A1 chromatin that is often part of smaller
subgraphs (Figure 16H), shown by the average Gro-seq signal in top 25% largest (big) subgraphs
and top 25% smallest (small) subgraphs. This observation points to a functional relevance of
subgraph size and connectivity. To test this hypothesis further we quantified the Gro-seq signal
for chromatin with respect to the spatial distance to the cluster centers (Figure 16I). Specifically,
we determined the nuclear locations of each cluster center, divided all chromatin into increasing
concentric shells from the cluster center and averaged the nascent RNA expression from GRO-
Seq experiments for chromatin in each shell. Note that this measure only relies on the geometric
position of a subgraph center. The Gro-seq signal is calculated from all the chromatin in a shell
irrespective of their subcompartment assignment. Indeed, for A1 subgraph locations, the
transcriptional signal is highest towards the subgraph centers and declines with increasing
distance from the center, which describes a transcriptional active domain in the nucleus (Figure
16I). At A2 subgraph locations the signal is substantially weaker consistent with overall smaller
subgraphs and lower transcriptional activity (Figure 16H). Overall, only A1 subcompartments
67
have extended cohesive chromatin bodies with high levels of RNA expression (Figure 16I).
Interestingly, subgraphs (>= 3 nodes) in the A1 subcompartment are most often composed from
multiple chromosomes: Almost 70% of large A1 subgraphs are formed inter-chromosomally from
at least 2 chromosomes and on average are combined from regions of 3 chromosomes, in
comparison to ~2 chromosomes for large subgraphs in the A2 subcompartment. The difference
is even more obvious where subgraphs (>=7 nodes) are on average formed by more than 4
chromosomes, in comparison to 3 chromosomes in A2 subcompartment.
Overall, A1 subgraphs are hubs of the highest transcriptional activity, are formed from multiple
chromosomes and located towards the interior regions of the nucleus. These attributes are
reminiscent to chromatin at nuclear Speckles, also known as inter-chromosomal granule clusters,
which are nuclear bodies enriched for splicing factors, pre-mRNAs and small nuclear
ribonucleoproteins (snRNPs). Nuclear speckles behave as dynamic phase-separated bodies
128–
130
and have been proposed to act as storage sites for RNA-processing components
131
and
transcriptional hubs for some active genes
132–134
. Nuclear speckles are excluded from the nuclear
periphery and enriched toward the nuclear center
135
. The Belmont laboratory recently developed
TSA-seq, a genome-wide mapping method for estimating cytological distances of chromosome
loci relative to particular nuclear compartments. The method uses tyramide signal amplification
(TSA) based on an antibody-coupled peroxidase enzyme (HRP), to generate biotin-tyramide free-
radicals; these free radicals then diffuse and react to label DNA uniformly across the genome,
which in turn can be isolated and sequenced. To label DNA associated with nuclear speckles they
used HRP-coupled antibodies against the protein SON, which is a highly specific marker for
nuclear speckles. SON is only present in speckles at high concentrations, and therefore, SON-
TSA produces a gradient of diffusible free radicals at the speckle location for distance-dependent
biotin labelling of DNA. The diffusion and steady state concentration of tyramide free-radicals after
the TSA reaction depends on the distance to the speckle and can be modeled as an exponential
decay function. To test if A1 subgraph clusters are consistent with chromatin at nuclear speckles
we simulate the TSA-seq signal from the models by assuming that the geometric centers of the
68
subgraphs are speckle locations and are the point source of SON-TSA-produced tyramide free-
radical diffusion and compare the resulting genome-wide TSA-seq profiles to the experimental
data. Indeed, we observe that the locations of A1 subgraph centers predict remarkably well the
TSA-seq data and hence the positions of nuclear speckles. Indeed, the resulting genome-wide
TSA-seq profiles show a good correlation with the experimental data (Spearman correlation 0.64,
p-value < 2.2e-16) (Figure 16J). The simulated TSA-seq data is a direct measure for the quality
of genome-wide chromatin positioning in the nucleus and their positions relative to nuclear
speckles, irrespective of their subcompartment assignments. For instance, Figure 16K shows the
simulated and experimental TSA-seq profile for chromosome X. Chromosome X does not contain
any chromatin of the A1 subcompartment and the good agreement between the simulated and
experimental TSA-seq profile is solely due to the positioning of chromosome-X regions relative to
nuclear speckles. To assess our findings further, we divided the chromatin around a subgraph
center into concentric shells and averaged the experimental TSA-seq signal for chromatin in each
shell, irrespective of its subcompartment assignment. We observe dramatically enhanced
experimental TSA-seq signal for chromatin most often located in the A1 subgraphs’ centers with
a decaying signal towards the periphery of the subgraph. Despite being in a transcriptionally
active chromatin state, A2 subgraphs do not show this specific behavior: A2 subgraphs do not
have increased TSA-seq signal at the cluster centers and hence do not show any decay of TSA-
seq signal with increasing distance from the cluster centers. Indeed, concentric shells around A2
subgraph centers are depleted of TSA-seq signal. These observations indicate that A1 subgraphs
are nuclear bodies likely connected to nuclear speckles, while A2 subgraphs, which are
transcriptionally active do not share any spatial similarities with nuclear speckles. Overall, the
single cell A2 network organization is remarkably different from that of A1. A2 networks show
significantly lower neighborhood connectivity and average edge degree than A1 networks and
overall have the lowest maximal clique enrichment among networks of all subcompartments. Also,
A2 networks have a significantly larger number of smaller subgraphs, many of those formed
dominantly by a higher fraction of intra-chromosomal interactions rather than inter-chromosomally
69
as in A1. On average a TAD in A2 has an almost 5-fold lower probability to be in a very large
subgraph with more than 50 nodes in comparison to A1 TADs (Figure 16C). On average a A2
subgraph has a size of 17 nodes (30 Mb) in TAD resolution and 33 (8 Mb) nodes in 200kb
resolution. These observations show that the A2 subcompartment is substantially more
fragmented into a larger number of smaller nuclear bodies in comparison to A1.
Figure 16. Spatial properties of subgraphs in chromatin interaction network.
70
(A) The comparison of average neighborhood connectivity within each chromatin
state across all structures in the population. Average neighborhood connectivity of
a node is defined as the average number of edge connections for all its neighboring
nodes. (B) The number of maximal cliques enrichment of different
subcompartment states. (C) The probability of a node to be in certain size of the
subgraph. The area under the curve of the shaded area is plotted in the right panel,
which represents the probability of a TAD to be in a big subgraph that is larger than
50. (D) A violin plot showing the distribution of subgraph sizes larger than 3 (25%
percentile) in TAD resolution models. (E) Distribution of the number of subgraphs
larger than 3 in each structure. (F) The average fraction of inter-chromosomal
edges with in any subgraph in different states. Error bar represents the standard
deviation in a population of 10,000 TAD structures. Grey dashed line shows the
average fraction of interchromosomal edges considering all subgraphs. (G) The
subcompartment states composition of TADs in the neighborhood (500nm range)
of different state-TADs. (H) The average Gro-Seq signal in big (25% largest) and
small (25% smallest) subgraphs in different states. A1 state shows the most
difference in big and small subgraphs. Error bar represents the standard deviation
in 10,000 TAD structures. (I) Average Gro-seq and TSA-seq signal for chromatin
with respect to the spatial distance to the spatial cluster centers. Nuclear locations
of each cluster center are determined and chromatin is divided into increasing
concentric shells from the cluster center. The nascent RNA expression from GRO-
Seq experiments and tyramide signal amplification signal (TSA-seq) are calculated
for chromatin in each shell. (J) 2D density plot showing the relationship between
raw TSA-seq signal and predicted TSA-seq signal. (K) A genome track view of
TSA-seq signal and predicted signal of first 47Mb of chromosome X p-arm.
Moreover, A2 bodies are dominated by cis interactions and their formation is dominated by
chromatin regions from the same chromosome. The fraction of interchromosomal edges in a A2
subgraph (>= 3TADs) is almost 1.3-fold lower than in those subgraphs in A1 state (1.6-fold in
200kb resolution). These features are consistent with the observed overall wider range of nuclear
locations for A2 chromatin bodies with almost similar probabilities to be located across inner and
sub-outer shells (Figure 14D). As described before, A2 bodies are sites of transcriptional activity
with an increased the Gro-seq expression signal for A2 subgraphs towards A2 subgraph centers.
However, the Gro-seq expression is substantially lower than those of A1.
Our observations paint a clear picture about the structural differences between the A1 and A2
subcompartments in nuclear location and network organization. The observed higher subgraph
71
connectivity of A1 entails a higher functional homogeneity and more uniform/cohesive
composition (homogeneous/uniform) of A1 chromatin bodies, and together with their larger sizes
and higher gene density may favor enhanced transcriptional bursting of genes in A1 bodies at
nuclear speckles.
Although the B1 subcompartment is transcriptionally inactive, it shares a very similar localization
profile to the transcriptionally active A1 subcompartment and preferentially localizes towards the
interior shells of the nucleus (Figure 12C). However, the single cell network organization of B1 is
significantly different to A1 and most other subcompartments. B1 has the lowest neighborhood
connectivity and together with A2 the lowest maximal cliques enrichment (MCE) among all
subcompartments (Wilcoxon test p-value < 2.2e-16) (Figure 14AB). Subsequently, the B1
network is highly fragmented into a large number of relatively small subgraph clusters (Figure
14D) and only 2% of B1 TADs forms subgraph larger than 50 (Figure 14C). On average, each
cell contains around 213 subgraph clusters larger than 3 nodes and B1 TAD forms around 42
(19.7%) TAD subgraphs which is higher than A1 TADs (Figure 14E). Similar to A1, B1 subgraphs
are often formed by trans interactions. 35.5% percent of all edges in a B1 subgraph are
interchromosomal, which is a 2.2-fold higher than B3 and 1.3-fold larger than A2
subcompartments.
Overall B1 is fragmented into many but relatively small chromatin bodies. The relatively low clique
enrichment points to a state that is less cohesive in its composition than for instance A1. To
analyze the spatial composition of B1 chromatin bodies further we calculated the chromatin state
enrichment in the surrounding neighborhood of each chromatin domain in a given state
(neighborhood enrichment). For each domain we selected a surrounding spherical volume with
radius of 500nm from the domain center and calculated the fraction of chromatin in a given
subcompartment in this neighborhood region (Figure 12E). We observe that B1 chromatin has a
high neighborhood association with the A1 subcompartment, A1 chromatin is significantly more
often found in the surroundings of B1 bodies than expected by random chance and is almost as
72
likely present in the immediate neighborhood as B1 chromatin itself (Figure 13E). At further
inspection, it is apparent that the chromatin bodies formed by the B1 subcompartment
preferentially associate and even surround/cover A1 chromatin bodies, which are likely
associated with nuclear speckles. Based on the spatial association of B1 with A1 chromatin bodies
we would expect an enhanced TSA-seq signal for B1 in comparison to all other inactive states or
A2. Indeed, B1 chromatin shows enriched TSA-seq signal, which confirms the spatial proximity
of B1 with A1 subgraphs (Figure 14H). We then divided the chromatin around B1 subgraph
centers into concentric shells and averaged the experimental TSA-seq as well as Gro-Seq signal
for chromatin in each shell. The low Gro-seq signal across concentric shells clearly shows the
repressive state of B1 chromatin bodies (Figure 14H), while at the same time the increased TSA-
seq signal shows its spatial association with nuclear speckles. However, the TSA-seq signal is
significantly lower than A1 subgraphs and in contrast to A1 remains constant over a wider range
of distances from the subgraph center. B1 chromatin is enriched in H3K27me3 methylation and
associated with polycomb repressive complexes (PRCs), which are transcriptional repressors
responsible for cell development, cell fate specification and cell-identity maintenance. It is
therefore likely that B1 chromatin bodies in our models correspond to PcG bodies. It is interesting
that the transcriptionally repressed B1 chromatin bodies are localized in proximity to A1
associated nuclear speckles. During cell development activation of PcG repressed genes will lead
to activation of developmental pathways. The proximity to nuclear speckles will ensure a fast and
strong expression response upon gene activation. On the other hand, the proximity to nuclear
speckles demands a tightly regulated repressive state, which is especially critical for stem cells
so that their potential to differentiate is tightly controlled to prevent unwanted activation of cell fate
genes.
The B3 subcompartment is mainly composed of heterochromatic regions, specifically, it contains
lamina-associated domains. In contrast to B1, the B3 subcompartment forms relatively large
subgraphs (Figure 14D). Average size of B3 chromatin domain subgraphs is about 26 TAD nodes,
which is 2-fold higher than B1 state subgraphs (1.8-fold higher in 200kb resolution). In contrast to
73
B1 and also A1, (the latter also forms relatively large subgraphs), B3 chromatin bodies are mostly
composed of intra-chromosomal contacts (Figure 16F): The fraction of inter-chromosomal edges
in subgraphs is at least 2.1-fold smaller than those of A1 or B1 subcompartments. The fact that
B3 chromatin forms almost exclusively large subcompartment clusters, which are mostly formed
by local interactions might be an intrinsic property of heterochromatin compaction and also can
be explained by the preferred location of B3 subcompartment at the nuclear periphery, which
exposes the chromatin to lamina at the NE rather than other chromosome regions. The spatial
composition of B3 chromatin bodies are highly homogenous as shown by the analysis of the
surrounding neighborhood of each B3 domain. B3 shows the highest neighborhood composition
within the 500nm range, compared to any subcompartment. (Figure 16G). Overall, B3 forms large
cohesive chromatin bodies mainly localized at the nuclear periphery. Interestingly, although B3 is
highly homogenous we noticed that the A2 subcompartment shows a relatively high association
to be in spatial proximity to the B3 subcompartment. In other words, A2 bodies are more often in
the neighborhood of B3 bodies than any other subcompartment. This observation raises an
interesting question, namely if there are two different mechanisms to silence A1 and A2 chromatin
during cell development and upon functional modulation of cells. A1 shows preferred association
to B1 polycomb repressed chromatin, while A2 may be preferentially silenced by association and
possible tethering to B3-lamina associated domains. However, this assumption is speculative at
this point.
Our models reveal an organizational principle that links conformations of individual chromosomes
with emerging subcompartment partitions at the nuclear scale (Figure 17A). In many
chromosomes, we observe a similar characteristic folding path, leading to a polarized
chromosome orientation, which bridges B3 chromatin bodies at the outset of the nucleus with
A1/B1 chromatin bodies located towards the nuclear center (Figure 17B). Equivalent
arrangements in many chromosomes then constitute multi-layered functional phases, which
aggregate chromatin of similar functional types from different chromosomes into subcompartment
bodies. Chromatin bodies that are formed mostly intra-chromosomal are located towards the outer
74
regions of the nucleus (B3 and A2), while chromatin bodies formed from multiple chromosomes
aggregate towards the central regions of the nucleus, which are zones of enriched trans
interactions. Typically, we observe A1 and B1 bodies located in the central nuclear zone with the
highest exposure to interchromosomal chromosome contacts (Figure 17C). This finding
demonstrates that chromosome organization goes beyond a simplistic previously suggested two-
layer model, in which individual chromosomes are compartmentalized into two subunits of active
and inactive chromatin. Chromatin bodies of the B2 subcompartment form clusters of
subcentromeric regions and nucleoli associated clusters and play a crucial role by forming site-
specific chromatin clusters. For instance, we observe the highest frequencies for specific
multivariate interactions for B2 chromatin bodies, while all other interchromosomal bodies are
formed with less sequence specific nature showing more similar probabilities of interactions
between all regions in the same subcompartment.
75
Figure 17. The spatial compartmentalization of human lymphoblastoid cell.
(A) An example view of spatial organization of subcompartment-state chromatin
from the equator of nucleus. A1 chromatins occupy the most inner region of
nucleus, covered by B1 chromatin. A2 contacts closely with B3 and separates A1,
B1 state chromatin, while B3 chromatin stay closely to the nuclear envelop. B2
contains region that are highly preferring to form inter-chromosomal clusters and
are associated with nucleoli. (B) An example showing the folding path of
chromosome 12. High Gro-Seq signal regions are show in 3D as red density
clouds. Active p-telomere of chr12 starts from the center of this high transcription
76
activity zone and folds the inactive region in q-arm towards the envelop, and folds
back in to the center region. (C) The folding structure of chromosome 6 and
chromosome 12 are shown here as an example of chromosome-chromosome
interactions bridged with Dip region. A1 chromatin forms large connected
chromatin bodies which resembles speckle-like properties. Dip locations, show in
red beads, bridge more interchromosomal contacts, potentially help proper folding
of the whole chromosome.
A1 A2 B1 B2 B3
Average radial position 0.54 0.63 0.57 0.60 0.68
Average neighborhood
connectivity
12.60 7.94 6.45 9.23 9.77
Maximal cliques enrichment 1.92 1.06 1.15 1.50 1.35
Average size of subgraphs
> 3
25.7 16.9 13.0 20.4 26.0
Average # of subgraphs > 3
34.9
(16.4%)
64.7
(30.4%)
41.9
(19.7%)
29.2
(13.7%)
42.3
(19.8%)
Average # of chromosome
in subgraphs > 3
3.18 2.56 2.99 3.04 2.32
Average fraction of inter-
chromosome edges in
33.2% 26.3% 35.5% 26.5% 16.4%
77
subgraphs > 3
Table 4. Population average of spatial and graph measurements in TAD resolution
model.
A1 A2 B1 B2 B3
Average radial position 0.52 0.64 0.57 0.68 0.76
Average neighborhood
connectivity
25.97 12.15 12.87 13.55 15.63
Maximal cliques enrichment 5.98 1.63 2.22 2.16 1.70
Average size of subgraphs
> 7
70.9 33.0 33.3 37.7 59.0
Average # of subgraphs > 7
53.9
(9.7%)
159.2
(28.6%)
91.6
(16.5%)
109.8
(19.7%)
141.8
(25.5%)
Average # of chromosome
in subgraphs > 7
4.37 3.06 3.70 2.38 2.25
Average fraction of inter-
chromosome edges in
subgraphs > 7
41.5% 25.5% 35.3% 14.3% 9.6%
Table 5. Population average of spatial and graph measurements in 200kb
resolution model.
4.2.5.3 Hot-spots of multivariate interchromosomal interactions
As shown earlier, the average radial locations of some chromatin regions are substantially lower
(towards the inner regions of the nucleus) in comparison to their immediate neighbors in the same
78
chromosome. This behavior leads to a characteristic local minimum (i.e., a “dip”) when plotting
the average radial positions of each region in a chromosome. By using a local minimum detection
algorithm, we identify 115 such TADs and denote them as “dip-TADs” (Figure 18A). These dips
are highly reproducible in TAD and 200kb resolution models and we validated these dip-TADs
with experimental lamina contact frequencies (CF)
73
. Strikingly, almost no lamina-chromatin
contacts are found for chromatin regions around the center of these dip-TADs, confirming their
more central locations away from the nuclear lamina (Figure 18B upper left). Dip-TADs also
show distinct epigenetic features. They are substantially enriched in markers of actively
transcribed genes, for instance, H3K4me and H3K9ac, which indicates active transcription
opposed to poised activation. Markers of inactive genes such as H3K27me3 are depleted at dip-
TADs (Figure 18B). Overall almost 60% of dip-TADs are part of the active subcompartments.
TADs of the A1 subcompartment are more often dip-TADs than would be expected by random
chance (39.1% of dip-TADs are in A1 state while 19.7% of TADs are in A1 state genome-wide).
Domains in the A2 subcompartment are observed at the expected probability (29.6% in dip-TADs
and 24.5% of all TADs in the genome). TADs of the B1 and B3 subcompartment are significantly
depleted in dip-TADs. Interestingly, about 9.6% of dip-TADs are part of the B2 subcompartment,
even though the domains of the inactive subcompartments are usually underrepresented (Table
6).
79
Figure 18. Dip TADs and their properties.
(A) Dip-TADs on chromosome 2 and 4 are shown in red dots. Dip-TADs were
detected by a local minimum detection algorithm using the TAD average radial
position (Shown in blue line). (B Upper-left) Average lamina-contact frequency (CF)
(Kind et al 2015) along the 5Mb upstream and downstream dip regions. Regions
80
around 2Mb from the dip center show depletion of contact with nuclear lamina,
confirming our models that dip-TADs point inwards to the nucleus center. (B other)
Several histone transcriptionally active markers and RNA Pol-II ChIP-seq signal
around dip-TADs indicate enrichment at dip center. Orange lines are local
polynomial regression for actual data points (circles). Grey lines are the
background signal levels. (D) Number of surrounding TADs within 500 nm
neighborhood for dip and non-dip TADs in different subcompartment states. ‘***’
indicates the one side Wilcox-test significance. (D) The number of long-range dip-
TADs in the 3D neighborhood for each TAD (the self-chromosome is excluded;
neighborhood TADs are within 500nm distance away). Red points represent the
location of dip TADs, showing dip-TADs have more interaction with other dip-TADs.
The number of long-range dip-TADs is normalized by the bead (TAD) surface area
to reduce the bias introduced from TAD’s size. (E) The Inter-chromosomal portion
(ICP) of each TAD along the genome. Red points represent the location of dip
TADs, showing dip-TADs have more inter-chromosome contacts. (F) The average
subgraph size of each TAD in 10,000 genome structures. Red points represent the
location of dip TADs showing dips in large A1 subgraphs.
To study the interaction preferences of dip-TADs, we then examined the spatial neighborhood of
each TAD in 3D space. For every dip-TAD in each structure, we determine the local neighborhood
as the set of all TADs within a distance of 500 nm. Interestingly, dip-TADs have a significantly
larger number of TADs in their local neighborhood than non-dip TADs, which are TADs that are
separated by at least 3 TADs from a dip TAD (p-value=2.113e-14 Wilcox-test). These findings can
also be observed from comparison of dip-TADs within subcompartment states A1, A2, B1 and B3
(Figure 18C, p-values < 2.2e-16, one-side Wilcox-test). For instance, A1 dip-TADs on average
have 113.4 TADs within a distance of 500nm, compare to 99.9 TADs in A1 non-dip-TADs. Indeed,
the fraction of inter-chromosomal contacts (i.e. the inter-chromosomal contact portion (ICP)) of
dip-TADs is significantly higher in comparison to non-dip TADs in the same chromosome (except
chr1 and 19) and also in comparison to the immediate TAD neighbors in sequence (Figure 18E).
More interestingly, although dip-TADs are generally far apart in sequence or are located on
different chromosomes, they show a tendency to cluster with each other in 3D space. Dip-TADs
have more dip TADs in the local neighborhood per 1µm
2
surface area of dip-TADs (Figure 18D).
On average each dip-TAD have 12.4 other dip-TADs in the local neighborhood per surface area
81
(3.3-fold higher than non-dip TADs), even though dip-TADs constitute only 5% of all TADs. These
findings can be further confirmed by their enrichment in big subcompartment subgraphs described
above. On average 24% of dip-TADs are found in subgraphs larger than 50 TADs, which is 2.4-
fold higher than non-dip TADs. Even among domains in the A1 subcompartment, A1 dip-TADs
show a 1.7-fold enrichment in large subgraphs compared to all other non-dip A1 TADs (Figure
18F).
Therefore, dip-TADs are not only enriched in inter-chromosomal interactions but are also
significantly enriched for interactions with other dip-TADs. These observations indicate that dip-
TADs act as hubs for interchromosomal interactions and the formation of dip-TAD clusters and
therefore dip-TADs might play a specific role in the global structural organization of the genome.
Dominant state Number of
clusters
Average
cluster
frequency
Average number of
chromosomes in a
cluster
The fraction of inter-
chromosomal
clusters
A1 52 5.32% 1.69 44.2%
A2 45 4.54% 1.07 6.7%
B1 8 2.46% 1.38 37.5%
B2 50 9.51% 4.14 100%
B3 0 - - -
NA 42 6.61% 1.48 31.0%
Subcentromere 52 (88% B2) 9.43% 4.02 100%
Subtelomere 24 (71% A1) 4.16% 2.38 79.2%
82
NAD 64 (78% B2) 9.50% 3.67 92.2%
Table 6. Summary of frequent dip-clusters
4.2.5.4 Frequent multivariate chromatin clusters and their functional relevance
We found that dip-TADs tend to collocate in space. We now explore if specific combinations of
some dip-TADs form frequently occurring multivariate interactions, i.e., higher order spatial
clusters. To detect frequently occurring clusters we applied our dense frequent cluster mining
approach as described previously
16
. We identified 197 dip-clusters that are found in at least 1%
(i.e. 100 structures) of the 10,000-structure population. Among those, 92 are inter-chromosomal
clusters, that is they contain dip-TADs from at least two different chromosomes (Figure 18C). This
number is relatively large, given the relatively lower contact frequencies that are generally
observed for inter-chromosomal contacts.
We first classify the 197 clusters according to their subcompartment composition. A cluster is
associated to a subcompartment if at least 50% of all cluster members are of the same state. We
found 52 clusters of the A1 subcompartment, 45 clusters of A2, 8 of B1 and 50 clusters of the B2
subcompartment (Table 6). The highest average cluster frequencies are observed for the B2 state.
Even though only 9.6% of dips are of B2 state, they form 50 (25.4%) out of the 197 frequent
clusters. 8 of the top 10 clusters with the highest frequencies are from the B2 subcompartment.
All frequent B2 clusters are inter-chromosomal in nature and formed from multiple chromosomes.
The largest clusters comprise up to 8 chromosomes. We recently described the role of
chromosomes-specific centromere clusters as a driving force in interchromosomal organization
in GM12878 cells
59
.
To better understand the role of centromeric and telomeric regions in high-frequent cluster
formation, we define clusters as sub-centromeric or sub-telomeric, if at least 50% of the cluster
members are formed from chromatin regions located within 5Mb from the centromere or telomere
position, respectively. 52 out of all 192 clusters are sub-centromeric and are mainly (88%)
83
associated with the B2 subcompartment (Table 6). This high fraction indicates that centromere
clusters are indeed the major factor in high frequency inter-chromosomal dip-TADs clusters. In
essence, we found 56.5% of all inter-chromosomal clusters are sub-centromeric (Table 6). In
contrast, sub-telomeric clusters are often composed of A1-dominant clusters but a high propensity
to involve multiple chromosomes, which agrees with the observation from image studies
136
.
In general, clusters of the A1 subcompartment show lower frequencies (5.3%±4.6%) in
comparison to those of the B2 state (9.5%±11.1%), which indicates that A1 cluster formation is
less driven by loci specific multivariate interactions, but rather by factors that are generally shared
by chromatin of the same subcompartment, reminiscent more of a phase transition model.
Figure 19. Frequent clusters identified among Dip-TADs.
84
(A) All dip cluster frequency grouped by different cluster major type. We classified
all Dip-TADs into 3 categories - “Centromere”, “Telomere” and “Domain”, where
“Domain” refers to TADs that are neither “Centromere” nor “Telomere”. We define
“major type” as 75% of the cluster member dips are the same type. Centromere
clusters are the most frequent clusters among the population. (B) Inter
chromosomal dip cluster frequency shows a majority of centromere cluster. Cluster
major type is defined the same as (A). (C) A total of 197 dip-clusters that exist in
at least 1% among 10,000 structure population. 92 dip-clusters are inter-
chromosomal clusters. (D) All 197 dip-cluster contacts shown in a circos plot. Inter-
chromosomal contacts are shown in blue lines, intra-chromosomal contacts are
shown in red lines. The chromosome bands are shown in the outer layer with
centromere marked as red. Nucleolus-associated chromatin domains (NADs)
(Nemeth et al. 2010) are shown in the inner circular band as blue. The nucleolar-
to-genomic ratios (Koningsbruggen et al. 2010) are shown in the middle circular
band, with black showing the region of ratio lower than 2 and red showing the
region of ratio higher than 2. (E) The fraction of NAD-Dominated clusters in
different cluster frequency range. We define NAD-Dominated clusters as those
have at least 50% of cluster member TADs are NAD related. (F) Two example dip
clusters. Left one shows the highest frequent cluster involves the human chr4 p-
telomere, chr9 centromere and chr20 centromere. Right one shows another high
frequent all-inter-chromosomal cluster which involves the centromere of chr4,
chr20 and chr21.
Finally, we noticed that many high frequent dip-clusters can be associated with the formation of
nucleoli, which are large nuclear compartments of ribosome biogenesis and ribosomal DNA
transcription. Nucleoli are formed by DNA in nucleolar organizing regions, which have been
identified through staining studies on the short arms of human acrocentric chromosomes 13, 14,
15, 21, 22. Moreover, high throughput sequencing has revealed human nucleolar-associated
domains (NADs), which are involved in nucleolus formation
137,138
.We noticed that chromatin in
high-frequency clusters, especially inter-chromosomal clusters, often overlap with the location of
NADs (Figure 19D). Around 60% of all 197 frequent clusters contain at least one NAD-related
TAD (i.e., a TAD with at least one NAD region in its 3 nearest neighbors TADs). We then define a
NAD dominant cluster if at least 50% of its cluster members are NAD related TADs (Figure 19E).
We identified 64 NAD-dominated cluster, which constitutes mainly by B2 clusters (Figure 19E).
Clusters with the highest frequencies (> 15% in the population) are exclusively NAD-dominated
85
clusters. Therefore, NAD-dominated frequent clusters are likely to be constituents of peri-
nucleolus chromatin. Interestingly, we observe large differences in frequencies of nucleoli
associated clusters, which indicates that propensity to form nucleoli differs between NAD regions.
Frequencies differ largely, with some specific NAD combinations showing very high frequencies
while others show only low frequencies or are never observed. This observation points to NAD
specific interactions patterns in the formation of nucleoli, which may differ between cell types. For
example, specific NAD clusters are often observed for NADs of chromosomes 4, 9, 14, 20, 21,
and 22, while NADs detected on all other chromosomes (including NADs on chromosomes 13
and 15) are rarely involved in frequent cluster formation. Figure 19F shows the top 2 most
frequent clusters, all of which associated with nucleoli. Interestingly, the highest frequent cluster
involves NADs located close to the p-telomere of chromosome 4, and the centromeres of
chromosome 9 and 20. The second highest frequent inter-chromosomal cluster involves the
centromeres of chromosomes 4, 20 and 21. These high frequency clusters both link the p-arm
terminal regions of chromosome 4 , with clusters of NORs on telomere and centromere dips,
which the phenomena that the p-arm of chromosome 4 is located on average substantially
more interior than the q-arm of chromosome 4, which has been experimentally described
138,139
.
In summary, we observe more than 190 frequent clusters of dip-TADs, which occur with a
frequency of at least 1%. This number is large, given the generally low probability of especially
inter-chromosomal interactions. The cluster with the highest frequencies can be associated with
distinct nuclear bodies, namely centromere clusters, telomer clusters and the formation of
chromosome-specific nucleoli.
4.2.6 Structure-based TAD calling
Our 3D genome structures at 200kb resolution are able to detect topological domains by
identifying increased local compaction of the chromatin fiber, which resemble structural domains.
The volume occupied by a continuous chromosome region provides information about the
compaction of chromatin into structural domains.
86
To estimate the local chromatin compaction across the entire genome in each individual cell we
used a sliding window approach and calculated for each consecutive genomic region the running
radius of gyration (RG) for a continuous 1Mb chromatin region that is centered at the target locus
(+/-500kb up and downstream of the target locus). High RG values correspond to more extended
shapes of the chromatin fiber, whereas a low RG value means that the local structure for the 1Mb
region is more compacted and condensed. Interestingly, we observe pronounced peaks and
valleys in the corresponding RG profiles of single cells. The regions between two peaks
correspond to domain-like structure in the models (Figure 20 AB). To test if these structure-
derived peaks agree with TAD borders from ensemble Hi-C maps, we used several TAD calling
programs, namely TopDom
123
, HiCseg
140
, InsulationScore
141
and TADbit
108
, to detect TADs from
the ensemble Hi-C map. We find a good agreement between the RG peaks in our structures and
TAD boundaries detected by TAD callers (Figure 20D).
Figure 20. Structure-based TAD calling.
87
(A) A 6-Mb example showing the TAD calling result on chromosome 4. The 6 Mb
region on chromosome 4 starts at 80Mb position. In situ Hi-C data are plotted as
contact matrix, and the RG peak frequencies are plotted on the top. In total 5 TADs
are called using TopDom. TAD borders are shown using red-dashed-line. High RG
peak frequency loci are also shown using grey-dashed-line which potentially be
the TAD border identified using structure. (B) 2 example structure showing the
potential folding of chromatin region shown in (A). (C) Average RG peak for loci
around a TAD border. TAD borders are detected using TopDom software on 100kb
raw contact matrix. Around 50% of structures show RG peak in the immediate
neighbor of an RG peak and around 70% structures have RG peak within 400kb
range. (D) Boxplot showing the frequencies to find RG peaks in region contains
TAD border. For example, for all 200kb region contains a TAD border, 53.4%
containing a border from only one of the 4 methods, with a probability of 14% to
find an RG peak. For a 200kb region that is identified as TAD border in all 4
methods have a probability of 22% to find an RG peak.
For instance, at positions of TAD borders predicted by T opDom, we observe a clear maximum for
the probability of observing a peak in the RG profile. In contrast, when selecting random TAD
borders instead the probability for finding RG peaks in the structures shows a flat distribution
function with no maximum. Interestingly, since we predict TAD borders at single cell level we can
now analyze the cell-to-cell variability of TAD formation in the population of structures. On average,
about 20% of structures show an RG peak at the exact TAD border, and about 50% of structures
show an RG peak in the immediate neighboring region (Figure 20C). These findings agree well
with a recent paper from the Xiaowei Zhuang laboratory, which used STORM super-resolution
imaging to show that TAD-like and sub-TAD like chromatin organizations are present in cells in
individual cells with large cell-to-cell variations
142
. Usually a TAD border is observed in 10-20% of
the images. In addition to TopDom, we also assessed our findings with three additional
aforementioned TAD calling algorithms, TADbit, HiCseg and InsulationScore. All four TAD calling
algorithms show similar results. The frequency of RG peak lies at TAD boundaries ranges from
18%-19%, and the frequency to find RG peak within +/- 200kb of TAD boundaries ranges from
44%-50% in these 4 TAD-calling methods. Ensemble TAD calling methods generally show
variable result depending on the data set and quality of the data. Overall only around 8.4% of TAD
88
borders are detected in all 4 methods, while 53.4% are observed in only one of the four methods
ensembles (Figure 20D). Interestingly, the frequency for observing an RG peak in the structure
population increases when the corresponding TAD border is detected in more than one calling
method. For instance, the probability of observing an RG peak at the position of a TAD border is
1.6 times larger when the border is detected in all four methods (P=22%) in comparison to a
border that is detected in only one out three methods (P=14%).
We also detect a link between the local structure organization of a chromatin region (i.e. into a
topological domains) and its propensity to form long-range and trans interactions. For instance,
the top 10% chromatin regions with the highest RG peak frequencies (and therefore highest
probability to form stable TAD borders) show significantly larger amount of long-range (> 1Mb)
and trans chromatin interactions than the 10% chromatin regions that show the lowest RG peak
frequencies. Interestingly, this observation holds true also at the single cell level. A specific TAD
border is typically present only in a fraction of cells. We now analyze changes in the chromatin
interaction behavior between cells that contain a specific TAD border and those in which the same
border disappears, and the former border region becomes part of a larger domain structure. We
find that the amount of long-range and inter-chromosomal interactions at the border region is
increased on average by a factor of 1.3 in those structures that contain the TAD border in
comparison to those that lack the border (One side pairwise T-test p-value < 2.2e-16). This
observation shows a direct link between changes in the local structure organization of TAD
domains and its long-range interactions and global structural features.
4.2.7 Discussion
The study on the structural organization of chromatin subcompartments using purely
computational model, on the first time, gives us key findings of spatial compartmentalization
(Figure 17). A1 is the most active state in lymphoblastoid cell, which resides the most inner part
of the nucleus. A1 and B1 states show an intricate spatial relationship. We found B1 state to be a
very special state since chromatin in B1 state likes to stay in place very similar to A1, but forming
89
smaller, less compact compartments. Our observations demonstrate that the B1 network is highly
fragmented into a larger number of subcompartment segments, which are spatially associated to
the A1 compartment. Interestingly, chromatin in B1 state is associated with polycomb-group
proteins, which can remodel chromatin such that epigenetic silencing of genes takes place.
Therefore, genes silenced by polycomb proteins appear to be in similar nuclear positions as the
highly transcriptionally active A1 state. This observation might have relationship with the nature
of B1 state, that they are more or less repressed in lymphoblastoid cell and with the potential to
be turned into active. A2 chromatin regions are observed to be spread out and form fewer compact
compartments. B3 state contains mostly lamina-associated domains. They stay preferentially
close to nucleus envelop and form mostly intra-chromosome contacts. The spatial profile of B3
state chromatin reveals the organization pattern for heterochromatin where they form large,
compact compartments with fewer chromosome involved. Interestingly, although B3 is highly
homogenous we noticed that the A2 subcompartment shows a relatively high association to be in
spatial proximity to the B3 subcompartment. In other words, A2 bodies are more often about B3
bodies than any other subcompartment. This observation poses two potential mechanisms to
silence A1 and A2 chromatin during cell development and upon functional modulation of cells. A1
shows preferred association to B1 polycomb repressed chromatin, while A2 may be preferentially
silenced by association and possible tethering to B3-lamina associated domains. Chromatin
regions in B2 state seen to have evenly distributed but have an enrichment in the center. More
detailed analysis makes us believe that this observation is a result of centromere clusters, or
nucleolus associated domain clusters.
We also find 115 dip-TADs that are residing inwards the nucleus center. These regions turn out
to be highly contacting, gene rich domains. These are functional active domains and bear
important role in nucleolus forming and thus stabilizing the whole genome structure. Figure 20C
shows one typical Dip-Dip interaction that brings 2 active A1 chromatins together to form a big
actively transcribed chromatin body which resembles the speckle properties (Figure 17C). We
observed the presence of higher-order chromatin clusters in our models. However, not all the
90
NOR regions participate equally likely in nucleoli clusters and specific combinations of NORs are
found more often in clusters than others. It remains to be seen what factors are responsible for
chromosome-specific nature of nucleoli clustering.
4.3 Case study: Comparative analysis of human H1-ESC and HFFc6
nucleus structures.
The 4D Nucleome joint analysis project has been initiated by 4D Nucleome consortium to
accelerate the in-depth analysis of the underlying role of genome structure in regulatory
processes and analysis of the differences in genome structure organization between different cell
types. This effort is only possible in a collaborative effort. Our lab is part of center initiative for
mapping the 3D structure of the genome as part of the 4D Nucleome consortium. In April 2017
about 10 research labs that are part of the 4DN consortium have agreed to generate data using
a variety of genomics and imaging techniques on H1-hESC and HFFc6 cell lines. One of the
major challenges is to integrate the different complementary types of datasets obtained in this
project. This will require integrative modeling methods of chromatin folding to determine the
structures and their dynamics consistent with both genomics and imaging data. Here I show
results based on structure analysis through integrated genome modeling (IGM).
4.3.1 Generating high resolution structure models using Hi-C and Lamina-DamID
We used in situ Hi-C data generated by Job Dekker lab and Lamina-DamID data generated by
Bas van Steensel lab. The data are downloaded from4DN data portal
https://data.4dnucleume.org/. To generate 3D models, we processed Hi-C data using the
aforementioned pipeline in Section 4.2 into 200kb resolution probability matrix. The DamID signal
(the ratio of observed reads in the experiment and in the control) is first filtered by removing in a
supervised way its baseline, and then transformed to contact probabilities. The amount of
genomic material contacting the envelope in each structure is a free parameter in this procedure,
91
and we set it to 𝜌𝜌 𝑒𝑒 𝑐𝑐 𝑒𝑒 = 12%. We note that the sum of the probability profile ∑ 𝑝𝑝 𝑖𝑖 𝑒𝑒 𝑐𝑐 𝑒𝑒 𝑖𝑖 should add
up to 𝜌𝜌 𝑒𝑒 𝑐𝑐 𝑒𝑒 ⋅ 𝑁𝑁 where 𝑁𝑁 is the number of beads. We set a linear mapping between the signal
profile 𝑆𝑆 𝑖𝑖 and the probability profile by setting 𝑝𝑝 𝑖𝑖 =
𝑐𝑐 𝑖𝑖 1
𝑁𝑁 ∑ 𝑐𝑐 𝑖𝑖 𝑖𝑖 𝜌𝜌 𝑒𝑒 𝑐𝑐 𝑒𝑒 .
While the nucleus of H1-ESC can be approximately seen as a sphere with radius 𝐼𝐼 = 5000 𝐺𝐺 𝑚𝑚 ,
HFFc6 is a fibroblast cell line with elongated shape. We model this cell nucleus using an ellipsoid
cell nucleus with three axes as 𝐼𝐼 𝑐𝑐 = 7840.0, 𝐼𝐼 𝑏𝑏 = 6470.0, 𝐼𝐼 𝑒𝑒 = 2450.0. After 13 A/M steps we
generated 10,000 200kb resolution models for both H1-ESC and HFFc6 cell line, both at 𝜃𝜃 𝐻𝐻 𝑖𝑖 𝐻𝐻 =
0.001 and 𝜃𝜃 𝐷𝐷 𝑐𝑐 𝑚𝑚 𝐼𝐼 𝐷𝐷 = 0.2. On average each structure has ~200,000 restraints including ~4,000
envelop contact restraints. At the final step > 99.996% of the restraints are satisfied, leaving 4e-5
of the total restraints are violated.
Figure 21. Assessments of H1-ESC and HFFc6 models.
Here we show the resulting structures reproducing the input experimental data.
Chromosome 2 is chosen to show the correlation between input data and the
signal simulated on 10,000 structures in 200kb resolution.
The overall correlation of Lamina DamID signal is ~ 0.94 for both models and the correlation are
all above 0.9 for all autosomes. Pearson correlation of Hi-C data is > 0.98 between input
92
probability matrix and the simulated contact probability matrix (Figure 21).
4.3.2 Common spatial compartment localization is observed in comparative
analysis
The Belmont laboratory recently developed TSA-seq, a genome-wide mapping method for
estimating cytological distances of chromosome loci relative to particular nuclear compartments.
The method uses tyramide signal amplification (TSA) based on an antibody-coupled peroxidase
enzyme (HRP), to generate biotin-tyramide free-radicals; these free radicals then diffuse and
react to label DNA uniformly across the genome, which in turn can be isolated and sequenced.
To label DNA associated with nuclear speckles they used HRP-coupled antibodies against the
protein SON, which is a highly specific marker for nuclear speckles. The diffusion and steady state
concentration of tyramide free-radicals after the TSA reaction depends on the distance to the
speckle and can be modeled as an exponential decay function. Jiang Ma’s laboratory has been
working closely with Belmont lab to develop probabilistic graphical model for estimating joint
probabilities using HiC, lamina DamID, TSA-seq data. The method classifies the chromatin to
several SPIN states: speckle, interior active, interior repressed, near lamina and lamina
associated. As part of 4DN consortium, both labs produced data for human stem cell H1-ESC and
fibroblast cell HFFc6.
Based on their radial positions, all 200kb beads in a model are partitioned into the 5 shells such
that beads are equally divided. Boundaries are set so that each shell contains 20% of the beads.
We then use 10,000 structures to estimate the “conditional state probability” P (s n|X) for a bead
to be located in a shell sn given the state X. The sum of state probabilities for all the shells for a
given state is 1.
Surprisingly even the cell types are different, we observe the same spatial compartmentalization
as we seen in lymphoblastoid models. We see that the lamina-associated regions have the
highest conditional state probability in the most outer shell s5 and the lowest conditional class
probability in the inner most shell s1. The probability gradually decreases from the most outer to
93
the most inner shell (Figure 22A). Interestingly, TADs in the near lamina class have the highest
peaks in the second outmost shell s4 in H1 cells and the most outer shell in HFF cells. Which
might be an impact of bigger surface area in HFF nucleus.
Chromatin regions in the ‘interior active’ class show the highest conditional class probability in the
2
nd
inner shell. Strikingly, the speckle-associated class beads are mainly located in the most
central shell s1 for both H1 and HFF, which conforms our find that A1 state in GM12878 forms
speckle-like nuclear bodies. Interior repressed region seems to be very similar to B1 state in
GM12878 cell and also show a preference to stay inner (Figure 22).
94
Figure 22. Comparative analysis of chromatin organization in different SPIN states.
(A) Spatial localization of chromatin of different SPIN states. Cell volume is
partitioned into the 5 concentric shells such that beads are equally divided.
Conditional shell probability for a bead to be found in a shell is plotted in each
barplot. (B) An example of the chromosomes with speckle state folding towards
the center. Contour shows the density of all speckle state beads in the nucleus. (C)
95
Several examples of chromosomes showing a layered folding structure, where
lamina associated chromatin (purple) stays in the most outer side close to nuclear
envelop, and speckle state chromatin folding inwards. Beads are color coded the
same as in (A).
4.3.3 TAD variations detected using structure-based TAD calling method
TAD calling methods cannot be applied to Lamina-DamID data however we can make it possible
by integrated modeling. We used DamID data along with Hi-C data and we applied the
aforementioned structure-based TAD calling method. Not only this method can detect different
TAD folding in different cells, but also able to detect the structure variety of TAD in different single
structures in the same cell type. Here I show a few preliminary comparison results using this
method.
Figure 23. Structure-based TAD calling comparison.
(A) The same 1.4Mb region showing the different average radius of gyration peaks.
A TAD border is found at 40.2 Mb region which is show as red dashed line. H1-
ESC and GM12878 showed the same TAD border here but in HFFc6 the two TADs
merged into one. The raw contact matrix is shown on the right where black boxes
representing the TAD segmentation. (B) TAD structure variation found in
chromosome 2 from 32.4Mb-36.8Mb.
96
For example, by applying the structure-based TAD calling method on 3 cell types an RG peak can
be found in the middle of the 1.4Mb region starting from 38.8Mb to 41Mb (Figure 23A). But this
peak disappears in HFFc6 cell line. The corresponding contact matrix conforms this finding that
2 TADs can be clearly seen but HFFc6 have 2 TADs merged into one. In addition the structure-
based method can also be used to detect TAD variations within cell type. An example of
chromosome 2 show (Figure 23B) that one TAD can also split into 2 in the population of cells.
This region is bounded by 2 TAD borders and normal TAD calling method like TopDom
123
defines
this region as a TAD. The structure RG peak frequency is 30% and 28% on the left and right TAD
border. However, there is still a probability of ~24% to find an RG peak in the middle of the TAD.
Structure #1 and structure #2 shows the border is non-exist but structure #3 shows a clear TAD
border where the whole TAD splits into 2 independent TADs.
97
Chapter 5 Conclusion
After determining the complete DNA sequence of the human genome and subsequent mapping
of most genes and potential regulatory elements, we are now in a position to study the genome
in 3D. The last years have seen an enhanced interest in the spatial organization of genomes.
Technological advances have revealed new insights and increased understanding of the genome
folding processes and their functional relevance. A joint and coordinated effort of many research
laboratories will facilitate the challenging task of producing quantitative models of the dynamic
nucleome. Such models may help to study the role of genome structure in cell differentiation and
disease; may allow a better understanding of the mechanistic principles of chromatin folding and
may facilitate the detection of the molecular machineries that shape variations of genome
organization in different cells.
Here in this thesis, I presented a thorough review of the recent developed techniques to study the
nuclear organization. Together these studies allow us moving from a one-dimensional
representation of the genome as a long DNA sequence to a spatially and dynamically organized
three-dimensional structure of the living and functional genome inside cells. Facing data from a
variety of different sources, building a 3D model became the most obvious and natural way to
interpret. Since 2010, many computational methods have been developed and helped the
community to explore the true folding of eukaryotic chromatins. Our population-based method
stands out as a systematic way of simulating the stochastic nature of chromatin folding and
provided an efficient software to generate such an ensemble of whole genome structure models.
Now it has also been expanded to integrated modeling which accepts a variety of different data
sources including Hi-C, Lamina-DamID, SPRIT, FISH etc.
Now through my thesis work, these modeling methods have been proven to be valuable to help
people understand the human nucleus architecture. By using these methods, we observed
striking genomic folding difference between liver and heart cell. Models built on human
lymphoblastoid cells show more detailed features on genome structures those we have never
98
seen before. Our models have high predictive value. They reproduce remarkably well many
known structural properties of the human lymphoblastoid cell genome, even though these were
not included as input constraints and are not readily observable in the raw data. We showed
lamina contact profiles could be reproduced by only using in situ Hi-C information, which suggests
that the gene-poor region can be automatically excluded onto the nuclear periphery. We also have
much stronger prediction power of spatial distance and position on particular genomic loci, which
match the FISH measurement surprisingly well.
Without doubt the 3D genome modeling provides us vast opportunity to explore the genome
organization features. A joint and coordinated effort of many research laboratories will facilitate
the challenging task of producing quantitative models of the dynamic nucleome. Such models
may help to study the role of genome structure in cell differentiation and disease; may allow a
better understanding of the mechanistic principles of chromatin folding and may facilitate the
detection of the molecular machineries that shape variations of genome organization in different
cells. Now with the release of data contributed by 4DN consortium, integrated genome modeling
method will be playing the key role in revealing the unknowns in fundamental organization of cell
nucleus.
99
References
1. Furlan-Magaril, M., Várnai, C., Nagano, T. & Fraser, P . 3D genome architecture from populations to
single cells. Curr. Opin. Genet. Dev. 31, 36–41 (2015).
2. Pope, B. D. et al. Topologically associating domains are stable units of replication-timing regulation.
Nature 515, 402–405 (2014).
3. Roix, J. J., McQueen, P. G., Munson, P. J., Parada, L. A. & Misteli, T. Spatial proximity of
translocation-prone gene loci in human lymphomas. Nat. Genet. 34, 287–291 (2003).
4. Tolhuis, B., Palstra, R. J., Splinter, E., Grosveld, F. & De Laat, W. Looping and interaction between
hypersensitive sites in the active β-globin locus. Mol. Cell 10, 1453–1465 (2002).
5. Dekker, J. et al. The 4D nucleome project. Nature 549, 219–226 (2017).
6. Guelen, L. et al. Domain organization of human chromosomes revealed by mapping of nuclear
lamina interactions. Nature 453, 948–51 (2008).
7. Lieberman-aiden, E. et al. Comprehensive Mapping of Long-Range Interactions Revelas Folding
Principles of the Human Genome. Science 326, 289–294 (2009).
8. Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin
interactions. Nature 485, 376–380 (2012).
9. Shen, Y. et al. A map of the cis-regulatory sequences in the mouse genome. Nature 488, 116–120
(2012).
10. Rao, S. S. P. et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of
Chromatin Looping. Cell 159, 1665–1680 (2014).
11. Fudenberg, G. et al. Formation of Chromosomal Domains by Loop Extrusion. Cell Rep. 15, 2038–
2049 (2016).
12. Sanborn, A. L. et al. Chromatin extrusion explains key features of loop and domain formation in wild-
type and engineered genomes. Proc. Natl. Acad. Sci. 112, E6456–E6465 (2015).
13. de Wit, E. et al. CTCF Binding Polarity Determines Chromatin Looping. Mol. Cell 60, 676–684 (2015).
14. Schwarzer, W. et al. Two independent modes of chromatin organization revealed by cohesin
removal. Nature 551, 51–56 (2017).
15. Denker, A. & De Laat, W. The second decade of 3C technologies: Detailed insights into nuclear
organization. Genes and Development 30, 1357–1382 (2016).
16. Dai, C. et al. Mining 3D genome structure populations identifies major factors governing the stability
of regulatory communities. Nat. Commun. 7, 11549 (2016).
17. Beagrie, R. A. et al. Complex multi-enhancer contacts captured by genome architecture mapping.
Nature 543, 519–524 (2017).
100
18. Nagano, T. et al. Single-cell Hi-C for genome-wide detection of chromatin interactions that occur
simultaneously in a single cell. Nat. Protoc. 10, 1986–2003 (2015).
19. Strongin, D. E., Groudine, M. & Politz, J. C. R. Nucleolar tethering mediates pairing between the
IgH and Myc loci. Nucleus 5, 474–481 (2014).
20. Shachar, S., Pegoraro, G. & Misteli, T. HIPMap: A High-Throughput Imaging Method for Mapping
Spatial Gene Positions. Cold Spring Harb. Symp. Quant. Biol. 80, 73–81 (2015).
21. Fullwood, M. J. et al. An oestrogen-receptor-α-bound human chromatin interactome. Nature 462,
58–64 (2009).
22. Cusanovich, D. a et al. Multiplex single-cell profiling of chromatin accessibility by combinatorial
cellular indexing. Science (80-. ). 348, 910–914 (2015).
23. Ma, H. et al. Multiplexed labeling of genomic loci with dCas9 and engineered sgRNAs using
CRISPRainbow. Nat. Biotechnol. 34, 528–530 (2016).
24. Chen, B. et al. Dynamic imaging of genomic loci in living human cells by an optimized CRISPR/Cas
system. Cell 155, 1479–1491 (2013).
25. Deng, W., Shi, X., Tjian, R., Lionnet, T. & Singer, R. H. CASFISH: CRISPR/Cas9-mediated in situ
labeling of genomic loci in fixed cells. Proc. Natl. Acad. Sci. 112, 11870–11875 (2015).
26. Ma, H. et al. Multicolor CRISPR labeling of chromosomal loci in human cells. Proc. Natl. Acad. Sci.
112, 3002–3007 (2015).
27. Kalhor, R., Tjong, H., Jayathilaka, N., Alber, F. & Chen, L. Genome architectures revealed by
tethered chromosome conformation capture and population-based modeling. Nat. Biotechnol. 30,
90–98 (2012).
28. Hsieh, T. H. S., Fudenberg, G., Goloborodko, A. & Rando, O. J. Micro-C XL: Assaying chromosome
conformation from the nucleosome to the entire genome. Nat. Methods 13, 1009–1011 (2016).
29. Quinodoz, S. A. et al. Higher-order inter-chromosomal hubs shape 3-dimensional genome
organization in the nucleus. bioRxiv 02215, 219683 (2017).
30. Lubeck, E., Coskun, A. F., Zhiyentayev, T., Ahmad, M. & Cai, L. Single-cell in situ RNA profiling by
sequential hybridization. Nature Methods 11, 360–361 (2014).
31. Ma, H., Reyes-Gutierrez, P. & Pederson, T. Visualization of repetitive DNA sequences in human
chromosomes with transcription activator-like effectors. Proc. Natl. Acad. Sci. 110, 21048–21053
(2013).
32. Miyanari, Y ., Ziegler-Birling, C. & Torres-Padilla, M. E. Live visualization of chromatin dynamics with
fluorescent TALEs. Nat. Struct. Mol. Biol. 20, 1321–1324 (2013).
33. Ramani, V. et al. Mapping 3D genome architecture through in situ DNase Hi-C. Nat. Protoc. 11,
2104–2121 (2016).
101
34. Van Steensel, B. & Henikoff, S. Identification of in vivo DNA targets of chromatin proteins using
tethered Dam methyltransferase. Nat. Biotechnol. 18, 424–428 (2000).
35. Kind, J. et al. Single-Cell Dynamics of Genome-Nuclear Lamina Interactions. Cell 153, 178–192
(2013).
36. Phan, S. et al. 3D reconstruction of biological structures: automated procedures for alignment and
reconstruction of multiple tilt series in electron tomography. Adv. Struct. Chem. Imaging 2, 8 (2017).
37. Ou, H. D. et al. ChromEMT: Visualizing 3D chromatin structure and compaction in interphase and
mitotic cells. Science (80-. ). 357, eaag0025 (2017).
38. Mifsud, B. et al. Mapping long-range promoter contacts in human cells with high-resolution capture
Hi-C. Nat. Genet. 47, 598–606 (2015).
39. Akhtar, W. et al. Using TRIP for genome-wide position effect analysis in cultured cells. Nat. Protoc.
9, 1255–1281 (2014).
40. Do, M., Isaacson, S. A., McDermott, G., Le Gros, M. A. & Larabell, C. A. Imaging and characterizing
cells using tomography. Archives of Biochemistry and Biophysics 581, 111–121 (2015).
41. Darrow, E. M. et al. Deletion of DXZ4 on the human inactive X chromosome alters higher-order
genome architecture. Proc. Natl. Acad. Sci. U. S. A. 113, E4504-12 (2016).
42. Dekker, J. Capturing Chromosome Conformation. Science (80-. ). 295, 1306–1311 (2002).
43. Müller, P. et al. COMBO-FISH enables high precision localization microscopy as a prerequisite for
nanostructure analysis of genome loci. Int. J. Mol. Sci. 11, 4094–4105 (2010).
44. Weiland, Y., Lemmer, P. & Cremer, C. Combining FISH with localisation microscopy: Super-
resolution imaging of nuclear genome nanostructures. Chromosom. Res. 19, 5–23 (2011).
45. Doksani, Y., Wu, J. Y., De Lange, T. & Zhuang, X. XSuper-resolution fluorescence imaging of
telomeres reveals TRF2-dependent T-loop formation. Cell 155, 345–356 (2013).
46. Beliveau, B. J. et al. Single-molecule super-resolution imaging of chromosomes and in situ
haplotype visualization using Oligopaint FISH probes. Nat. Commun. 6, 7147 (2015).
47. Boettiger, A. N. et al. Super-resolution imaging reveals distinct chromatin folding for different
epigenetic states. Nature 529, 418–422 (2016).
48. Nozaki, T. et al. Dynamic Organization of Chromatin Domains Revealed by Super-Resolution Live-
Cell Imaging. Mol. Cell 67, 282-293.e7 (2017).
49. Chen, B. C. et al. Lattice light-sheet microscopy: Imaging molecules to embryos at high
spatiotemporal resolution. Science (80-. ). 346, (2014).
50. Van De Werken, H. J. G. et al. 4C technology: Protocols and data analysis. Methods Enzymol. 513,
89–112 (2012).
51. Dostie, J. & Dekker, J. Mapping networks of physical interactions between genomic elements using
102
5C technology. Nat. Protoc. 2, 988–1002 (2007).
52. Legant, W. R. et al. High-density three-dimensional localization microscopy across large volumes.
Nat. Methods 13, 359–365 (2016).
53. Hsieh, T. H. S. et al. Mapping Nucleosome Resolution Chromosome Folding in Yeast by Micro-C.
Cell 162, 108–119 (2015).
54. Zhang, J. et al. ChIA-PET analysis of transcriptional chromatin interactions. Methods 58, 289–299
(2012).
55. Tang, Z. et al. CTCF-Mediated Human 3D Genome Architecture Reveals Chromatin Topology for
Transcription. Cell 163, 1611–1627 (2015).
56. Schoenfelder, S. et al. Polycomb repressive complex PRC1 spatially constrains the mouse
embryonic stem cell genome. Nat. Genet. 47, 1179–1186 (2015).
57. Imakaev, M. et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization.
Nat. Methods 9, 999–1003 (2012).
58. Yaffe, E. & Tanay, A. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to
characterize global chromosomal architecture. Nat. Genet. 43, 1059–1065 (2011).
59. Tjong, H. et al. Population-based 3D genome structure analysis reveals driving forces in spatial
genome organization. Proc. Natl. Acad. Sci. 113, E1663–E1672 (2016).
60. Paulsen, J. et al. Chrom3D: Three-dimensional genome modeling from Hi-C and nuclear lamin-
genome contacts. Genome Biol. 18, 21 (2017).
61. Duan, Z. et al. A three-dimensional model of the yeast genome. Nature 465, 363–367 (2010).
62. Varoquaux, N., Ay, F., Noble, W. S. & Vert, J. P. A statistical approach for inferring the 3D structure
of the genome. in Bioinformatics 30, i26–i33 (2014).
63. Zhang, Z., Li, G., Toh, K.-C. & Sung, W.-K. 3D Chromosome Modeling with Semi-Definite
Programming and Hi-C Data. J. Comput. Biol. 20, 831–846 (2013).
64. Hu, M. et al. Bayesian Inference of Spatial Organizations of Chromosomes. PLoS Comput. Biol. 9,
e1002893 (2013).
65. Rousseau, M., Fraser, J., Ferraiuolo, M. a, Dostie, J. & Blanchette, M. Three-dimensional modeling
of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling.
BMC Bioinformatics 12, 414 (2011).
66. Nagano, T . et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature 502,
59–64 (2013).
67. Stevens, T. J. et al. 3D structures of individual mammalian genomes studied by single-cell Hi-C.
Nature 544, 59–64 (2017).
68. Flyamer, I. M. et al. Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-
103
zygote transition. Nature 544, 110–114 (2017).
69. Nagano, T. et al. Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature
547, 61–67 (2017).
70. Ramani, V. et al. Massively multiplex single-cell Hi-C. Nat. Methods 14, 263–266 (2017).
71. Carstens, S., Nilges, M. & Habeck, M. Inferential Structure Determination of Chromosomes from
Single-Cell Hi-C Data. PLOS Comput. Biol. 2016, 1–44 (2016).
72. Vogel, M. J., Peric-Hupkes, D. & van Steensel, B. Detection of in vivo protein - DNA interactions
using DamID in mammalian cells. Nat. Protoc. 2, 1467–1478 (2007).
73. Kind, J. et al. Genome-wide Maps of Nuclear Lamina Interactions in Single Human Cells. Cell 163,
134–147 (2015).
74. Yáñez-Cuna, J. O. & van Steensel, B. Genome–nuclear lamina interactions: from cell populations
to single cells. Current Opinion in Genetics and Development 43, 67–72 (2017).
75. Li, Q. et al. The three-dimensional genome organization of Drosophila melanogaster through data
integration. Genome Biol. 18, 145 (2017).
76. Boyle, S., Rodesch, M. J., Halvensleben, H. A., Jeddeloh, J. A. & Bickmore, W. A. Fluorescence in
situ hybridization with high-complexity repeat-free oligonucleotide probes generated by massively
parallel synthesis. Chromosom. Res. 19, 901–909 (2011).
77. Cremer, T. & Cremer, M. Chromosome Territories. Cold Spring Harb. Perspect. Biol. 2, a003889–
a003889 (2010).
78. Bolzer, A. et al. Three-dimensional maps of all chromosomes in human male fibroblast nuclei and
prometaphase rosettes. PLoS Biol. 3, 0826–0842 (2005).
79. Wang, S. et al. Spatial organization of chromatin domains and compartments in single chromosomes.
Science (80-. ). 353, 598–602 (2016).
80. Shachar, S., Voss, T. C., Pegoraro, G., Sciascia, N. & Misteli, T. Identification of Gene Positioning
Factors Using High-Throughput Imaging Mapping. Cell 162, 911–923 (2015).
81. Wang, S., Xu, J. & Zeng, J. Inferential modeling of 3D chromatin structure. Nucleic Acids Res. 43,
1–12 (2015).
82. Takei, Y ., Shah, S., Harvey, S., Qi, L. S. & Cai, L. Multiplexed Dynamic Imaging of Genomic Loci by
Combined CRISPR Imaging and DNA Sequential FISH. Biophys. J. 112, 1773–1776 (2017).
83. Fu, Y. et al. CRISPR-dCas9 and sgRNA scaffolds enable dual-colour live imaging of satellite
sequences and repeat-enriched individual loci. Nat. Commun. 7, 11707 (2016).
84. Anton, T., Bultmann, S., Leonhardt, H. & Markaki, Y. Visualization of specific DNA sequences in
living mouse embryonic stem cells with a programmable fluorescent CRISPR/Cas system. Nucleus
5, 163–172 (2014).
104
85. Guan, J., Liu, H., Shi, X., Feng, S. & Huang, B. Tracking Multiple Genomic Elements Using
Correlative CRISPR Imaging and Sequential DNA FISH. Biophys. J. 112, 1077–1084 (2017).
86. Le Gros, M. A. et al. Soft x-ray tomography reveals gradual chromatin compaction and
reorganization during neurogenesis in vivo. Cell Rep. 17, 2125–2136 (2016).
87. Smith, E. A. et al. Quantitatively imaging chromosomes by correlated cryo-fluorescence and soft x-
ray tomographies. Biophys. J. 107, 1988–1996 (2014).
88. Mahamid, J. et al. Visualizing the molecular sociology at the HeLa cell nuclear periphery. Science
(80-. ). 351, 969–972 (2016).
89. Oikonomou, C. M. & Jensen, G. J. The development of cryo-EM and how it has advanced
microbiology. Nature Microbiology (2017). doi:10.1038/s41564-017-0073-7
90. Frazier, Z., Xu, M. & Alber, F . TomoMiner and TomoMinerCloud: A Software Platform for Large-Scale
Subtomogram Structural Analysis. Structure 25, 951-961.e2 (2017).
91. Imakaev, M. V ., Fudenberg, G. & Mirny, L. A. Modeling chromosomes: Beyond pretty pictures. FEBS
Letters 589, 3031–3036 (2015).
92. Serra, F. et al. Restraint-based three-dimensional modeling of genomes and genomic domains.
FEBS Lett. 589, 2987–2995 (2015).
93. Peng, C. et al. The sequencing bias relaxed characteristics of Hi-C derived data and implications
for chromatin 3D modeling. Nucleic Acids Res. 41, e183–e183 (2013).
94. Lesne, A., Riposo, J., Roger, P., Cournac, A. & Mozziconacci, J. 3D genome reconstruction from
chromosomal contacts. Nat. Methods 11, 1141–1143 (2014).
95. Zou, C., Zhang, Y. & Ouyang, Z. HSA: Integrating multi-track Hi-C data for genome-scale
reconstruction of 3D chromatin structure. Genome Biol. 17, 1–14 (2016).
96. Szalaj, P. et al. 3D-GNOME: an integrated web service for structural modeling of the 3D genome.
Nucleic Acids Res. 44, W288–W293 (2016).
97. Baù, D. & Marti-Renom, M. a. Genome structure determination via 3C-based data integration by the
Integrative Modeling Platform. Methods 58, 300–306 (2012).
98. Junier, I., Dale, R. K., Hou, C., Képès, F. & Dean, A. CTCF-mediated transcriptional regulation
through cell type-specific chromosome organization in the β-globin locus. Nucleic Acids Res. 40,
7718–7727 (2012).
99. Gehlen, L. R. et al. Chromosome positioning and the clustering of functionally related loci in yeast
is driven by chromosomal interactions. Nucleus 3, 370–383 (2012).
100. Meluzzi, D. & Arya, G. Recovering ensembles of chromatin conformations from contact probabilities.
Nucleic Acids Res. 41, 63–75 (2013).
101. Trieu, T. & Cheng, J. Large-scale reconstruction of 3D structures of human chromosomes from
105
chromosomal contact data. Nucleic Acids Res. 42, e52 (2014).
102. Trieu, T. & Cheng, J. MOGEN: A tool for reconstructing 3D models of genomes from chromosomal
conformation capturing data. Bioinformatics 32, 1286–1292 (2016).
103. Di Stefano, M., Paulsen, J., Lien, T. G., Hovig, E. & Micheletti, C. Hi-C-constrained physical models
of human chromosomes recover functionally-related properties of genome organization. Sci. Rep.
6, 35985 (2016).
104. Giorgetti, L. et al. Predictive polymer modeling reveals coupled fluctuations in chromosome
conformation and transcription. Cell 157, 950–963 (2014).
105. Di Pierro, M., Zhang, B., Aiden, E. L., Wolynes, P. G. & Onuchic, J. N. Transferable model for
chromosome architecture. Proc. Natl. Acad. Sci. 113, 12168–12173 (2016).
106. Zhang, B. & Wolynes, P. G. Topology, structures, and energy landscapes of human chromosomes.
Proc. Natl. Acad. Sci. 112, 6062–6067 (2015).
107. Zhang, Y. et al. Spatial organization of the mouse genome and its role in recurrent chromosomal
translocations. Cell 148, 908–921 (2012).
108. Serra, F. et al. Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals structural
features of the fly chromatin colors. PLoS Comput. Biol. 13, e1005665 (2017).
109. Russel, D. et al. Putting the pieces together: Integrative modeling platform software for structure
determination of macromolecular assemblies. PLoS Biol. 10, (2012).
110. Alber, F. et al. Determining the architectures of macromolecular assemblies. Nature 450, 683–694
(2007).
111. Baù, D. et al. The three-dimensional folding of the α-globin gene domain reveals formation of
chromatin globules. Nat. Struct. Mol. Biol. 18, 107–115 (2011).
112. Umbarger, M. A. et al. The three-dimensional architecture of a bacterial genome and its alteration
by genetic perturbation. Mol. Cell 44, 252–264 (2011).
113. Yildirim, A. & Feig, M. High-resolution 3D models of Caulobacter crescentus chromosome reveal
genome structural variability and organization. Nucleic Acids Res. 46, 3937–3952 (2018).
114. Le, T. B. K., Imakaev, M. V, Mirny, L. a & Laub, M. T. High-Resolution Mapping of the Spatial
Organization of a Bacterial Chromosome. Science (80-. ). 342, 731–734 (2013).
115. Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. Optimization by Simulated Annealing. Science (80-. ).
220, 671–680 (1983).
116. Borgia, G., Coyle, B. & Zwiers, P. B. EVOLUTION OF COLORFUL DISPLAY. Evolution (N. Y). 61,
708–712 (2007).
117. Knight, P. A. & Ruiz, D. A fast algorithm for matrix balancing. IMA J. Numer. Anal. 33, 1029–1047
(2013).
106
118. Hua, N. et al. Producing genome structure populations with the dynamic and automated PGS
software. Nat. Protoc. 13, 915–926 (2018).
119. Spilianakis, C. G., Lalioti, M. D., Town, T., Lee, G. R. & Flavell, R. A. Interchromosomal associations
between alternatively expressed loci. Nature 435, 637–645 (2005).
120. Schoenfelder, S. et al. Preferential associations between co-regulated genes reveal a transcriptional
interactome in erythroid cells. Nat. Genet. 42, 53–61 (2010).
121. Rosa-Garrido, M. et al. High-resolution mapping of chromatin conformation in cardiac myocytes
reveals structural remodeling of the epigenome in heart failure. Circulation (2017).
doi:10.1161/CIRCULATIONAHA.117.029430
122. Kim, Y. H. et al. Rev-erba dynamically modulates chromatin looping to control circadian gene
transcription. Science (80-. ). (2018). doi:10.1126/science.aao6891
123. Shin, H. et al. TopDom: An efficient and deterministic method for identifying topological domains in
genomes. Nucleic Acids Res. (2015). doi:10.1093/nar/gkv1505
124. Therizols, P . et al. Chromatin decondensation is sufficient to alter nuclear organization in embryonic
stem cells. Science (80-. ). (2014). doi:10.1126/science.1259587
125. Reddy, K. L., Zullo, J. M., Bertolino, E. & Singh, H. Transcriptional repression mediated by
repositioning of genes to the nuclear lamina. Nature (2008). doi:10.1038/nature06727
126. Shopland, L. S. et al. Folding and organization of a contiguous chromosome region according to the
gene distribution pattern in primary genomic sequence. J. Cell Biol. 174, 27–38 (2006).
127. Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of
protein families. Nucleic Acids Res. 30, 1575–84 (2002).
128. Brangwynne, C. P ., Mitchison, T. J. & Hyman, A. A. Active liquid-like behavior of nucleoli determines
their size and shape in Xenopus laevis oocytes. Proc. Natl. Acad. Sci. (2011).
doi:10.1073/pnas.1017150108
129. Zhu, L. & Brangwynne, C. P. Nuclear bodies: The emerging biophysics of nucleoplasmic phases.
Current Opinion in Cell Biology (2015). doi:10.1016/j.ceb.2015.04.003
130. Galganski, L., Urbanek, M. O. & Krzyzosiak, W. J. Nuclear speckles: Molecular organization,
biological function and role in disease. Nucleic Acids Res. 45, 10350–10368 (2017).
131. Spector, D. L. & Lamond, A. I. Nuclear speckles. Cold Spring Harb. Perspect. Biol. 1–13 (2016).
doi:10.1101/cshperspect.a000646
132. Xing, Y., Johnson, C. V., Moen, P. T., McNeil, J. A. & Lawrence, J. B. Nonrandom gene organization:
Structural arrangements of specific pre-mRNA transcription and splicing with SC-35 domains. J. Cell
Biol. (1995). doi:10.1083/jcb.131.6.1635
133. Shopland, L. S., Johnson, C. V., Byron, M., McNeil, J. & Lawrence, J. B. Clustering of multiple
specific genes and gene-rich R-bands around SC-35 domains: Evidence for local euchromatic
107
neighborhoods. J. Cell Biol. (2003). doi:10.1083/jcb.200303131
134. Hall, L. L., Smith, K. P., Byron, M. & Lawrence, J. B. Molecular anatomy of a speckle. in Anatomical
Record - Part A Discoveries in Molecular, Cellular, and Evolutionary Biology (2006).
doi:10.1002/ar.a.20336
135. Carter, K. C. et al. A three-dimensional view of precursor messenger RNA metabolism within the
mammalian nucleus. Science (80-. ). (1993). doi:10.1126/science.8446902
136. Weierich, C. et al. Three-dimensional arrangements of centromeres and telomeres in nuclei of
human and murine lymphocytes. Chromosom. Res. (2003). doi:10.1023/A:1025016828544
137. Németh, A. et al. Initial genomics of the human nucleolus. PLoS Genet. 6, e1000889 (2010).
138. van Koningsbruggen, S. et al. High-resolution whole-genome sequencing reveals that specific
chromatin domains from most human chromosomes associate with nucleoli. Mol. Biol. Cell 21,
3735–48 (2010).
139. Tam, R., Smith, K. P . & Lawrence, J. B. The 4q subtelomere harboring the FSHD locus is specifically
anchored with peripheral heterochromatin unlike most human telomeres. J. Cell Biol. 167, 269–279
(2004).
140. Lévy-Leduc, C., Delattre, M., Mary-Huard, T. & Robin, S. Two-dimensional segmentation for
analyzing Hi-C data. in Bioinformatics (2014). doi:10.1093/bioinformatics/btu443
141. Crane, E. et al. Condensin-driven remodelling of X chromosome topology during dosage
compensation. Nature 523, 240–244 (2015).
142. Bintu, B. et al. Super-resolution chromatin tracing reveals domains and cooperative interactions in
single cells. Science (80-. ). (2018). doi:10.1126/science.aau1783
108
Appendix
A.1 Usage of PGS software
A.1.1 Installation
A. Download (or “git clone”) PGS from https://www.github.com/alberlab/PGS.
PGS flow is a Python package which runs on Linux and Mac OS X systems.
Python can be downloaded from http://www.python.org.
The dependencies are as follows:
- Numpy (http://www.numpy.org/)
- Scipy (http://www.scipy.org/)
- Matplotlib (http://matplotlib.org/)
- Pandas (http://pandas.pydata.org/)
- H5py (http://www.h5py.org/)
- Seaborn (http://seaborn.pydata.org/)
B. Download IMP (Integrative Modeling Package) version 2.4 or later from:
https://integrativemodeling.org/.
C. Prepare the experimental data. Depending on options chosen by the user during
configuration, PGS can take different kinds of input files.
Option 1 (raw + TAD definition). The user provides a raw contact frequency matrix
(uniformly binned) and TAD index information. PGS generates a TAD-TAD contact
probability matrix from the raw data and automatically proceeds to the modeling
component. This option requires two input files:
File 1: Genome-wide chromatin-chromatin interaction matrix, where each of the N
rows describes one bin of the Hi-C data. This text file can be gzip or bzip compressed.
It is formatted as follow (see Fig. 3b).
109
• No header
• Column 1: chromosome name (e.g. Chr1, Chr2, ..., ChrX)
• Column 2: start genomic position of the Hi-C bin (0-based)
• Column 3: end genomic position of the Hi-C bin (1-based)
• Columns 4 to N+3: contact vector of the bin with all other bins (i.e. contact
matrix) (integers)
File 2: Chromosome segmentation file, where each row defines one topological
associated domain (Fig. 3c). This text file has the BED file format:
• No header
• Column 1: chromosome name (e.g. Chr1, Chr2, ..., ChrX)
• Column 2: start genomic positions of TAD (0-based)
• Column 3: end genomic positions of TAD (1-based)
• Column 4: flag for the kind of TAD (“domain”, “gap”, “CEN”)
Option 2 (TAD-TAD probabilities + TAD information). In this case, the user has
already prepared a TAD-TAD contact probability matrix and must also provide the
TAD definitions in a file. The two input files have the same formats as files 1 and
2 in Option 1. The bins in the first file represent TADs and the matrix elements
must be probability values between 0 and 1.
Option 3 (hdf5 prob). The user provides a TAD-TAD contact probability matrix that
was generated by PGS. This option is useful for producing independent structure
populations from a different random initialization of the structures, or for testing
different model parameters using the same input data.
A.1.2 Equipment setup
We recommend following the installation instructions from our online documentation
(http://pgs.readthedocs.io/en/latest/quickstart.html). The easiest way to install PGS is to
use a conda package manager. Both Anaconda (https://www.continuum.io/downloads)
and the minimal package Miniconda (http://conda.pydata.org/miniconda.html) are
110
suitable for managing all the required packages, including IMP. Once the PGS package
has been downloaded along with all the dependencies mentioned above, set up the
package using the following command.
$ python setup.py install
The script “setup.py” is located in the PGS directory. To confirm that PGS is installed properly,
users can execute the following shell command.
$ cd test
$ sh runPgs_workflow_test.sh
This process should take less than two minutes on any current computing workstation.
A.1.3 Procedures
Step 1: Generate the configuration file and execution script.
A user can either modify the prepared configuration file and execution script or use the graphical
user interface (GUI) called PGS-Helper (requires Java) to generate these files.
A. Using PGS-Helper (if Java is installed).
$ java –jar PGSHelper.jar
The command will display a GUI (Fig. 3a) prompting the user to enter the needed
information. Most of the fields are pre-populated, so the user can just review and
modify them if necessary. There are only 4 blank fields that the user must complete
(described in points i to iii below). In the following, we describe the fields displayed
in the GUI. When all of the settings are correct, the user clicks the “Generate”
button at the bottom of the GUI. They can then review the usage in the bottom box,
and click “Confirm” to generate the configuration file (input_config.json) and
executable file (runPGS.sh)
i. Working Directory
111
This is the directory where the output of the GUI (the executable script
runPGS.sh and the configuration file input_config.json), the log files
(pyflow.data directory), and the results of the 3D genome modeling will be
stored.
ii. PGS Source Directory
This is the PGS installation directory, which contains pgs.py.
iii. Input
• Select one of the three options to specify which types of input files are to
be used (see Equipment for details) and specify the file locations.
• Genome. The current version of PGS supports recent human and mouse
genomes: hg19, hg38, mm9, and mm10. PGS automatically generates the
diploid autosome and X chromosome representations.
• Resolution: the bin resolution (integer number of base pairs) of the raw Hi-
C matrix.
iv. Modeling Parameters
• Num of structures: the number of structures in the population to optimize
(default = 1,000). We recommend increasing this value to at least 10,000
for a final sampling).
• Violation cutoff: the maximum proportion of violated constraints. A smaller
value will generally result in better agreement with the input data (default =
0.05).
• Theta list: a decreasing series of values in the range 1 ≤ theta < 0 . Each
theta is a contact probability threshold, determining which contacts are
used in the optimization. PGS progresses through all the values in this list,
gradually including more and more Hi-C contacts in the optimization
(default = 1, 0.2, 0.1, 0.05, 0.02, 0.01).
• Max iteration: the maximum number of A/M cycles for each value of theta
(default = 10).
• Nucleus Radius: the radius of the nucleus in nanometers. A typical human
nucleus has a radius of 5000 nm (default = 5000).
• Genome occupancy: the ratio between the genome-wide chromosomal
volume and the total volume of the nucleus (default = 0.2).
v. System Parameters
112
• Default core: the default number of computing cores to use for each job.
Light jobs, such as the modeling step (M-step), do not require more than
one CPU (default = 1).
• Default mem MB: the memory limit for each job in megabytes (default =
1,500).
• Max core: the maximum number of computing cores to allocate for a heavy
job, such as building the matrix or calculating pairwise distance
distributions (default = 8).
• Max mem MB: the memory allocation limit for a heavy job (default = 64,000
MB).
vi. Command Setup
• Run mode: the user’s computing platform. This can be local (e.g. a
personal workstation), SGE (Sun Grid Engine), or Torque.
• Core limit: specify the maximum number of cores to allocate. (This setting
is valid for all three run modes. In local mode, set this value to the cores of
the computer.)
• Mem limit: specify the limit of total memory usage in MB.
• Optional argument list: additional unix-style command line arguments
(user specific) for all job submissions. The GUI provides a template
allowing the user to recognize and supply missing values (e.g. in [‘-
q’,’[qname]’,’-l’,’walltime=hh:mm:ss’] replace qname with
the user’s HPC queue name, and hh:mm:ss with hours:minutes:seconds.).
vii. Click the “Generate” button at the bottom. The user then can review the usage
in the bottom box, and confirm to generate the configuration
(input_config.json) and executable files (runPGS.sh).
B. Check and modify the configuration and executable files directly.
In case users do not have Java installed to run the PGS Helper program, the
package also provides examples of the configuration and executable files. Users
can open these text files under the pgs/test directory, and modify them as
needed.
i. input_config.json
{"source_dir" : "[Directory name where pgs socurce is]",
113
“input" : {
"contact_map_file_hdf5" : "[Contact map file]",
"TAD_file" : "[ TAD file, .bed format]"
“resolution” : “[Resolution of input contact_map_file] e,g. 100000”
“genome” : “[Genome version], e.g. hg19”
},
"output_dir" : "[Output Directory to store the results], e.g. $PROJECT_DIR/result",
“modeling_parameters" : {
"theta_list" : [Theta list] e.g, ["1", "0.2", "0.1","0.05","0.02","0.01"],
"num_of_structures" : [Number of structure to generate] e.g. 10000,
"max_iter_per_theta" : [Max Iterations per job] e.g. 10,
"violation_cutoff" : [Violation Cutoff ] e.g. 0.005
"chr_occupancy" : [Chromosome Occupancy ] e.g. 0.2
"nucleus_radius" : [Nucleus Radius ] e.g. 5000.0
},
"system" : {
"max_core" : [Maximum number of cores in a single node], e.g. 8,
"max_memMB" : [Maximum size of mem(MB) in a single node] e.g. 64000,
"default_core" : [Default number of cores], e.g. 1,
"default_memMB" : [Default size of mem(MB)] e.g. 1500
}
}
ii. runPGS.sh
python $PGS_DIRECTORY/pgs.py
--input_config $PROJECT_DIR/input_config.json
--run_mode [running platform]
114
--nCores 300
--memMb 800000
--pyflow_dir $PROJECT_DIR
--schedulerArgList ["-q","[qname]","-l","walltime=100:00:00"]
Step 2: Run PGS.
After the configuration file and execution script are generated by step 1, the user can execute
PGS with the following command.
$ sh runPgs.sh
A.1.4 Troubleshooting
Troubleshooting advice can be found in Table 7.
Table 7: Troubleshooting
Step Problem Possible reason Solution
1
Java is installed, but the GUI of
PGS Helper does not appear
X11 for graphical
display is not turned
on
Log in again to your
HPC with the “ssh –
X” option
2
The terminal where PGS was
executed closed, so the PGS
process was stopped
Accidentally closed,
system shut-down,
or broken node.
Rerun PGS, using
the same command
as before
3
PGS stops with [ERROR]
messages containing “… failed
sub-workflow classname:
The resolution is set
incorrectly, or the
input matrix format
Fix the resolution
parameter in
input-
115
‘BuildTADMapFlow’ …” and
“IndexError: … is out of bounds …”
is wrong. config.json, and
check the input file
format.
4
PGS stops with [ERROR]
messages containing “… failed
sub-workflow classname:
‘BuildTADMapFlow’ …”, “… using
non-integer …”, and “originHist =
…”
The raw input matrix
contains a non-
integer
Check and fix the
matrix
5
PGS stops while running the A/M-
cycles
Computing cluster
problems
Try to request more
than 10 GB memory
for the main PGS
program
A.1.5 Timing
The configuration of PGS should take only about 1 minute.
We have designed PGS to automatically and dynamically run a series of processes or steps. If
there are failures on a running job, for example because a node is down, the network is busy, or
there is a disk I/O failure, PGS tries to resubmit the failed job two more times before aborting.
The total run time for PGS can vary widely depending on available computing resources, data
size, and modeling complexity. The first task is to build the input matrix, which takes about 1
minute or less for input options 2 and 3. If the user selects input option 1, this task takes from
several minutes to several hours depending on the size of the matrix. For instance, it takes about
one minute to process a 2 Mb resolution Hi-C matrix, but 14 hours to build the ~2300x2300 contact
116
probability matrix from a 100 kb resolution Hi-C matrix (these times are on a single ~2.8 GHz
CPU). The second task is to optimize the structure population by running A/M cycles (iterations
of the A-step and M-step). This process starts immediately after the input matrix is generated,
with PGS submitting many simultaneous jobs on a computing cluster. The typical time required to
finish one M-step optimization for a single genome structure with ~2x2300 TAD domains is about
45~90 minutes (at ~1 Mb resolution). If the user asked for a population of 2,000 structures, and
allocates 500 CPUs to the task, then PGS will run the first 500 jobs simultaneously. The remaining
1500 jobs are queued and sent one by one to CPUs on the cluster as they become available.
PGS waits until the M-step is complete for all structures before it submits the A-step jobs. In this
example, the A-step calculation takes about 5-30 minutes. Thus, a single A/M cycle for a
population of 2000 structures at ~1 Mb resolution could take about 3 hours. The length of the
theta list and number of iterations per theta value will also affect the timing, as multipliers of the
A/M cycle time. The expected total time is about equal to the number of theta parameters plus 5
to 10 (based on our experience) times the A/M cycle time. Since PGS decides on the fly (based
on the violation cutoff parameter) whether to continue iterating the A/M cycle or move to the next
theta level, we cannot provide a more accurate prediction of the timing. The run time also depends
on the quality of the data set. Noisy or inconsistent data are likely to produce artifacts that are
hard to optimize and hence require more A/M cycles.
A.1.6 Anticipated results
The main output of PGS is a structure population. All results are stored under the result
directory. In this version, PGS writes to four subdirectories:
i. probMat: contains the input contact probability matrix (in hdf5 binary format) if option 1
or 2 is selected.
ii. actDist: contains intermediate files generated by the A-step, which are used in the
subsequent M-step.
117
iii. structure: contains the genome structure information during optimization, saved in hdf5
binary files (with .hms file extension). One file corresponds to one structure and contains
a history of optimization snapshots for the different theta parameters. The smallest theta,
with the last iteration step (alphabetically ordered, i.e. the last snapshot) is the final model.
We refer to the whole set of final models as the structure population (Fig 4a). Users then
read TAD coordinates from these structure files and perform further analysis that relates
to their research. We have provided a library of tools on the PGS public repository to help
users easily analyze the structure population (for further details, refer to the PGS
documentation page at http://pgs.readthedocs.io/en/latest/tools.html).
iv. report: contains some basic analysis: heat maps of contact probability matrices, radial
positions of TADs, and the quality of optimization (Figs. 4b-e). PGS writes the average
nuclear radial position for every TAD in the file radialPlot_summary.txt. Users can
also find a summary of the violation portion that reflects the overall quality agreement
between experiment data (input of PGS) and the structure population (output of PGS).
A.2 Structure archiving and dissemination
A crucial aspect in my work is to be able to make genome structure models available to the public.
However, currently no venue exists to store, curate and distribute multi-scale chromosomal
structures to the public. PDB-Dev is a prototype deposition and archiving system for protein
structural models obtained through integrative/hybrid methods. We want to adapt the PDB-Dev
framework for multi-scale chromatin models so that we are able to deposit our integrative genome
populations into PDB-dev. These models are not at atomic resolution and are derived by
integrating information from multiple sources. PDB is already moving traditional atomic structures
to mmCIF format. The laboratory of Prof. Andrej Sali is leading the support of integrative/hybrid
models by means of an mmCIF extension dictionary ihm-extension.dic
(https://github.com/ihmwg/IHM-dictionary). My goal is to develop a standard data format and
quality control mechanism that can serve as standards for the database deposition and release
118
of genome structures. It would be essential to have a solid data standard so that our lab as well
as any other lab can release, and deposit verified genome modeling structures to the public.
A.3 Additional Methods
A.3.1 Fmaxation
In order to model the contact frequencies between 2 genomic regions, we describe the whole
genome by domain-level contact probability matrix. (We might refer to it as TAD matrix). Where
each bin in the matrix represents a domain and each value corresponds to the contact
probabilities ( 𝑝𝑝 ∈ [0,1]) for the 2 related domains. The parameter 𝑓𝑓 𝑚𝑚 𝑐𝑐 𝑒𝑒 or fmax is chosen to
represent the contact frequency value of 100% probability to form contacts. It can be regarded as
a normalization factor to convert contact frequency matrix to contact probability matrix.
We simply call the process for transforming bin-level contact matrix to bin-level probability as
“fmaxation”. The bin-level contact probability matrix 𝑷𝑷 = ( 𝑝𝑝 𝑖𝑖 𝑗𝑗 )
𝐼𝐼 × 𝐼𝐼 is calculated as 𝑝𝑝 𝑖𝑖 𝑗𝑗 =
min (
𝑓𝑓 𝑖𝑖𝑖𝑖
𝑓𝑓 𝑚𝑚 𝑚𝑚 𝑚𝑚 , 1), where 𝑭𝑭 = ( 𝑓𝑓 𝑖𝑖 𝑗𝑗 )
𝐼𝐼 × 𝐼𝐼 is the normalized contact frequency matrix.
In our whole genome models, we represent each chromatin domain as beads, each bead 𝑖𝑖 is a
sphere particle with a radius 𝑟𝑟 𝑖𝑖 . The contact event for each 2 beads are defined as the center-to-
center distance is smaller than the sum of 2 times of the radius 𝑑𝑑 𝑖𝑖 𝑗𝑗 ≤ 2( 𝑟𝑟 𝑖𝑖 + 𝑟𝑟 𝑗𝑗 ). Then the volume
of the bead itself is:
𝑉𝑉 𝑏𝑏 𝑒𝑒𝑐𝑐 𝑒𝑒 =
4
3
𝜋𝜋 𝑟𝑟 3
In order to choose the right fmax we assume each bead are equal in radius size, and the spatial
occupancy of beads is 20%. In this way we expect every bead has 24 contacts on average:
𝑉𝑉 𝑐𝑐𝑒𝑒𝑖𝑖 𝑛𝑛 ℎ 𝑏𝑏 𝑐𝑐 𝑐𝑐𝑐𝑐
=
4
3
𝜋𝜋 (5 𝑟𝑟 )
3
× 20% − 𝑉𝑉 𝑏𝑏 𝑒𝑒𝑐𝑐 𝑒𝑒 = (125 × 0.2 − 1) 𝑉𝑉 𝑏𝑏 𝑒𝑒𝑐𝑐 𝑒𝑒 = 24 𝑉𝑉 𝑏𝑏 𝑒𝑒𝑐𝑐 𝑒𝑒
So we choose fmax and build domain-level probability matrix such that the mean row sum of the
domain-level matrix is close to 24. This can be done by repeatedly adjust fmax and do the matrix
119
building process iteratively until converge.
A.3.2 Building domain-level probability matrix
The contact between 2 large domains is defined by the closest part of each other. We define a
mapping 𝑏𝑏 ( 𝑖𝑖 ) as the set of all bins in matrix 𝑷𝑷 that belong to TAD 𝑖𝑖 . Then the domain-level
matrix 𝐴𝐴 = ( 𝑎𝑎 𝑖𝑖 𝑗𝑗 )
𝑁𝑁 × 𝑁𝑁 can be calculated as:
𝑎𝑎 𝑖𝑖 𝑗𝑗 = 𝑚𝑚 𝐺𝐺𝑎𝑎𝐺𝐺 ( 𝑡𝑡 𝑡𝑡 𝑝𝑝 10% < { 𝑝𝑝 𝛼𝛼𝛼𝛼
: 𝛼𝛼 ∈ 𝑏𝑏 ( 𝑖𝑖 ), 𝛽𝛽 ∈ 𝑏𝑏 ( 𝑗𝑗 )} > )
In the case that some contacts are extremely higher than the surrounding contacts, these contacts
are identified as outliers by { 𝑝𝑝 : 𝑝𝑝 > 𝜇𝜇 + 2 𝐼𝐼 𝑄𝑄 𝐼𝐼 𝑡𝑡 𝑟𝑟 𝑝𝑝 < 𝜇𝜇 − 2 𝐼𝐼 𝑄𝑄 𝐼𝐼 } , where 𝑝𝑝 ∈ { 𝑝𝑝 𝛼𝛼𝛼𝛼
: 𝛼𝛼 ∈ 𝑏𝑏 ( 𝑖𝑖 ), 𝛽𝛽 ∈
𝑏𝑏 ( 𝑗𝑗 )} and 𝜇𝜇 = 𝑚𝑚 𝐺𝐺𝑎𝑎 𝐺𝐺 { 𝑝𝑝 𝛼𝛼𝛼𝛼
: 𝛼𝛼 ∈ 𝑏𝑏 ( 𝑖𝑖 ), 𝛽𝛽 ∈ 𝑏𝑏 ( 𝑗𝑗 )}. The IQR refers to the interquartile range of { 𝑝𝑝 𝛼𝛼𝛼𝛼
}.
These outliers are excluded from the calculation.
A.3.3 Probabilistic population structure modeling
We use the previous developed probabilistic framework to de-convolve the ensemble Hi-C data
into a large population of individual genome structures of entire diploid genomes that are
statistically consistent with the contacts observed in the ensemble Hi-C data
59,118
. The genome
structure population is determined by maximizing the likelihood function for observing the Hi-C
data. Briefly given contact probability matrix 𝑨𝑨 , we aim to reconstruct all 3D structures 𝑿𝑿 =
{ 𝑿𝑿 𝟏𝟏 , 𝑿𝑿 𝟐𝟐 … 𝑿𝑿 𝑴𝑴 } in the population of 𝑀𝑀 models each containing 2N genomic regions (i.e. TADs)
for the diploid genome, and 𝑋𝑋 𝑖𝑖 𝑚𝑚 ∈ ℜ
3
, 𝑖𝑖 = 1. .2 𝑁𝑁 . We introduced a latent indicator variable 𝑾𝑾 =
( 𝒘𝒘 𝒊𝒊𝒊𝒊 𝒊𝒊 )
𝟐𝟐 𝑵𝑵 × 𝟐𝟐 𝑵𝑵 × 𝑴𝑴 for complementing every single structure’s information and homologous
information missed by 𝑨𝑨 . It is a binary-valued 3rd-order tensor specifying the contacts of
homologous genomic regions in each structure such that ∑ 𝑾𝑾 𝑚𝑚 𝑀𝑀 𝑚𝑚 = 1
𝑀𝑀 ⁄ = 𝑨𝑨 . We can jointly
approximate the structure population 𝑿𝑿 and the contact tensor 𝑾𝑾 by maximizing the log-
likelihood of the probability:
120
log 𝑃𝑃 ( 𝐗𝐗 | 𝐀𝐀 , 𝐖𝐖 ) = log 𝑃𝑃 ( 𝐀𝐀 , 𝐖𝐖 | 𝐗𝐗 )
A stepwise optimization strategy to gradually increase the optimization hardness beginning by
estimating a structure population that reproduces at first only the most frequent interactions
( 𝑨𝑨 𝜃𝜃 | 𝑎𝑎 𝑖𝑖 𝑗𝑗 ≥ 𝜃𝜃 , 𝑎𝑎 𝑖𝑖 𝑗𝑗 ∈ 𝑨𝑨 𝜃𝜃 ), so that interactions with contact probabilities lower than a certain value θ
are ignored. Then we gradually add contacts with lower probabilities and perform additional
rounds of optimization. The A/M-steps are repeated for several rounds and at different probability
level ( θ1, θ2, … θn) until a stop criterion is achieved.
A.3.4 Iterative refinement of restraint assignment
Here we introduce new methodological innovation in our optimization scheme, iterative refinement,
which allows us to generate genome structures at higher resolution and with improved accuracy
in comparison to our previous approach. The probabilistic framework for deconvolving ensemble
data is a data driven method. This method only considers restraints which are contact information,
and non-contact information are not included, therefore an excess amount of wrongly assigned
contacts can be introduced during the optimization process and thus drastically increase the
optimization time for resolving them.
Due to cooperative contacts and random encounters of domains, domain spheres might get close
to each other even if they are not readily assigned restraints as a result of already imposed
contacts. It is likely that during the optimization the models contain too many contacts. If we would
impose non-contacts, we would have no problem with overrepresented contacts. We tinker with
this issue by modify the restraint assignment step, where we distribute domain contacts ( 𝑖𝑖 , 𝑗𝑗 ) in
a subset of structures in the population based on the input 𝑎𝑎 𝑖𝑖 𝑗𝑗 .
To minimize excess contacts, we use a heuristic method to infer the minimal possible number of
to correct constraints required in each iteration for optimizing a structure population consistent
with the Hi-C data. We assume that the contact frequency for domain pair ( 𝑖𝑖 , 𝑗𝑗 ) are a
consequence of two parts, constrained contacts and non-constrained excess contacts:
121
𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑒𝑒𝑒𝑒 𝑒𝑒 𝑒𝑒𝑒𝑒 𝑒𝑒 𝑒𝑒𝑒𝑒
= 𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑒𝑒 𝑐𝑐𝑐𝑐𝑐𝑐 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑖𝑖 𝑐𝑐 𝑒𝑒𝑒𝑒 + 𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑐𝑐𝑐𝑐𝑐𝑐 − 𝑒𝑒 𝑐𝑐𝑐𝑐 𝑐𝑐 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑖𝑖 𝑐𝑐 𝑒𝑒𝑒𝑒
Where 𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑒𝑒𝑒𝑒 𝑒𝑒 𝑒𝑒𝑒𝑒 𝑒𝑒 𝑒𝑒𝑒𝑒
is the frequency of pair ( 𝑖𝑖 , 𝑗𝑗 ) to form contact and 𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑒𝑒 𝑐𝑐𝑐𝑐𝑐𝑐 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑖𝑖 𝑐𝑐 𝑒𝑒 𝑒𝑒 ,
𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑐𝑐𝑐𝑐𝑐𝑐 − 𝑒𝑒 𝑐𝑐𝑐𝑐𝑐𝑐 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑖𝑖 𝑐𝑐 𝑒𝑒𝑒𝑒 are the frequency of constrained contacts and non-constrained contacts.
Applying this idea, we assume the contact probability matrix A is defined by assigning contacts in
a matrix 𝐀𝐀 �
= ( 𝐴𝐴 ̂
𝑖𝑖 𝑗𝑗 )
𝑁𝑁 × 𝑁𝑁 , where 𝜀𝜀 𝑖𝑖 𝑗𝑗 are non-constrained excess contacts:
𝐴𝐴 𝑖𝑖 𝑗𝑗 = 𝐴𝐴 ̂
𝑖𝑖 𝑗𝑗 + 𝜀𝜀 𝑖𝑖 𝑗𝑗
Unfortunately, the number of non-restrained/excess contacts cannot be inferred before we
optimize a structure population. However, in a step-wise and iterative optimization scheme we
can use the structure population generated in the previous iteration to approximate this number.
We use 𝑄𝑄 𝑖𝑖 𝑗𝑗 to denote the proportion of non-restrained/excess contacts that are formed due to
random and cooperative encounters as a result of other already imposed contacts. If we use the
contact indicator matrix 𝑪𝑪 that has the same dimension as 𝑾𝑾 such that each element 𝐶𝐶 𝑖𝑖 𝑗𝑗𝑚𝑚
is
the indicator of contact in model structure 𝑿𝑿 𝒊𝒊 for domain pair ( 𝑖𝑖 , 𝑗𝑗 ), and 𝑘𝑘 for iteration step, then
from the previous equation at the step:k-1 we have:
� 𝑪𝑪 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 = 1
= � 𝑾𝑾 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 = 1
+ � 𝑀𝑀 − � 𝑾𝑾 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 − 1
� 𝑄𝑄 𝑖𝑖 𝑗𝑗
We can solve 𝑄𝑄 𝑖𝑖 𝑗𝑗 by:
𝑄𝑄 𝑖𝑖 𝑗𝑗 =
∑ 𝑪𝑪 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 = 1
− ∑ 𝑾𝑾 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 = 1
𝑀𝑀 − ∑ 𝑾𝑾 𝑖𝑖 𝑗𝑗𝑚𝑚
( 𝑘𝑘 − 1)
𝑀𝑀 𝑚𝑚 = 1
Once we get 𝑄𝑄 𝑖𝑖 𝑗𝑗 we can have:
𝑨𝑨 𝑖𝑖 𝑗𝑗 = 𝑨𝑨 �
𝑖𝑖 𝑗𝑗 + 𝜀𝜀 𝑖𝑖 𝑗𝑗 = 𝑨𝑨 �
𝑖𝑖 𝑗𝑗 + �1 − 𝑨𝑨 �
𝑖𝑖 𝑗𝑗 � 𝑄𝑄 𝑖𝑖 𝑗𝑗
𝑨𝑨 �
𝑖𝑖 𝑗𝑗 = ( 𝑨𝑨 𝑖𝑖 𝑗𝑗 − 𝑄𝑄 𝑖𝑖 𝑗𝑗 ) (1 − 𝑄𝑄 𝑖𝑖 𝑗𝑗 � )
122
𝑨𝑨 �
𝑖𝑖 𝑗𝑗 , is the estimated proportion of restraints to be assigned to 𝑾𝑾 and the activation distance 𝑑𝑑 𝑖𝑖 𝑗𝑗 𝑐𝑐 𝑒𝑒𝑒𝑒
are updated according to 𝑨𝑨 �
. Then for each existing structure in structure population 𝑿𝑿 , assign a
harmonic restraint according to corresponding 𝑾𝑾 .
A.3.5 Simulated annealing for optimizing structure population
Once we have the restraint assignment activation distance 𝑑𝑑 𝑖𝑖 𝑗𝑗 𝑐𝑐 𝑒𝑒𝑒𝑒
for every pair ( 𝑖𝑖 , 𝑗𝑗 ), we calculate
most likely structure population 𝑿𝑿 by simulated anneling through molecular dynamics. The
scoring (energy) function of structure population can be written as a sum of sub-scoreing function
of individual structures:
𝐹𝐹 = � 𝑓𝑓 𝑚𝑚 𝑀𝑀 𝑚𝑚 = 1
And for individual structures:
𝑓𝑓 𝑚𝑚 =
1
2
� max( ‖ 𝑋𝑋 𝑖𝑖 𝑚𝑚 ‖
2
− 𝐼𝐼 , 0)
2
2 𝑁𝑁 𝑖𝑖 = 1
+
1
2
� min ( � 𝑋𝑋 𝑖𝑖 𝑚𝑚 − 𝑋𝑋 𝑗𝑗 𝑚𝑚 �
2
− � 𝑟𝑟 𝑖𝑖 + 𝑟𝑟 𝑗𝑗 �, 0)
2
2 𝑁𝑁 𝑖𝑖 , 𝑗𝑗 = 1
𝑖𝑖 > 𝑗𝑗 +
1
2
� max ( � 𝑋𝑋 𝑖𝑖 𝑚𝑚 − 𝑋𝑋 𝑗𝑗 𝑚𝑚 �
2
− 𝑑𝑑 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑒𝑒𝑒𝑒 𝑐𝑐 𝑒𝑒 𝑖𝑖 𝑒𝑒 𝑒𝑒 , 0)
2
2 𝑁𝑁 𝑖𝑖 , 𝑗𝑗 = 1
𝑖𝑖 = 𝑗𝑗 − 1; 𝑖𝑖 , 𝑗𝑗 𝑖𝑖 𝑐𝑐 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑒𝑒 𝑒𝑒 ℎ 𝑐𝑐𝑐𝑐 𝑚𝑚 +
1
2
� max ( � 𝑋𝑋 𝑖𝑖 𝑚𝑚 − 𝑋𝑋 𝑗𝑗 𝑚𝑚 �
2
− 2 � 𝑟𝑟 𝑖𝑖 + 𝑟𝑟 𝑗𝑗 �, 0)
2
2 𝑁𝑁 𝑖𝑖 , 𝑗𝑗 = 1; 𝑖𝑖 > 𝑗𝑗 𝑖𝑖 , 𝑗𝑗 ∈{ 𝑖𝑖 , 𝑗𝑗 : 𝑒𝑒 𝑖𝑖𝑖𝑖
≤ 𝑒𝑒 𝑖𝑖𝑖𝑖
𝑚𝑚 𝑎𝑎 𝑎𝑎 }
Where the first term of represents the nuclear volume restraints where 𝐼𝐼 is the nuclear radius.
The second term represents the excluded volume restraints that every bead pair ( 𝑖𝑖 , 𝑗𝑗 ) cannot
overlap with each other. The third term represents the consecutive domain restrains for those
beads in the same chromosome. The forth term represents the Hi-C contact restraints where a
harmonic upper bond restraint is assigned for the bead pair ( 𝑖𝑖 , 𝑗𝑗 ) whose distance 𝑑𝑑 𝑖𝑖 𝑗𝑗 ≤ 𝑑𝑑 𝑖𝑖 𝑗𝑗 𝑐𝑐 𝑒𝑒𝑒𝑒
.
123
A.3.6 Wasserstein distance for comparing cumulative distributions
For any give distribution 𝑓𝑓 ( 𝑥𝑥 ) and 𝑔𝑔 ( 𝑥𝑥 ) , we calculate wasserstein distance for the two
distributions as:
𝑊𝑊 𝑊𝑊 = inf � | 𝑓𝑓 ( 𝑥𝑥 ) − 𝑔𝑔 ( 𝑥𝑥 )| 𝑑𝑑 𝑥𝑥
For discrete case as we compare cumulative distributions of co-localization in FISH experiments
and models we compute Wasserstein distance as:
𝑊𝑊 𝑊𝑊 = � | 𝑓𝑓 ( 𝑥𝑥 ) − 𝑔𝑔 ( 𝑥𝑥 )| Δ 𝑥𝑥
Which measures the difference of area under the curve between the two probability distributions.
A.3.7 Assigning TADs with cell replication phase
We extracted Repli-seq signal from
2
for cell line GM12878. The replication timing data reflects
the phase for replication event to happen in G1b, S1, S2, S3, S4 and G2. The data present as the
normalized percentage of tag densities for all cell phase fractions. We then calculate the average
normalized percentage of tag densities for each TAD and assign the TAD with the cell phase with
the highest average percentage. For example, if a TAD shows the average normalized percentage
of tag density for G1b, S1, S2, S3, S4 and G2 as 10%, 45%, 20%, 10%, 5%, 0%, then the TAD is
assigned with S1 phase.
A.3.8 Assigning TADs with subcompartment states
In Rao et al 2014, six nuclear subcompartments are defined according to intra-chromosome
contact profiles using Hidden Markov Model. The definition of subcompartment states are
retrieved from
10
. However these states are assigned on 100kb level. For each TAD segmentation
of chromatin, we calculated the proportion of each subcompartment state. The state of the TAD
is then set to the majority of its constituent state (50% or more). If there is no majority constituent
state, then we assign the TAD as NA. For example, for one TAD 75% of the region is A1 state,
then we define the TAD state as A1. If there is no one state that occupies 50% or more of the TAD
124
region, the TAD state will be undefined and will not be included in the further analysis.
A.3.9 Shell State Probability
To map the preferred positions of TADs in the nucleus, we divide the nuclear volume into 5
concentric shells 𝑆𝑆 = { 𝑆𝑆 1
, 𝑆𝑆 2
, 𝑆𝑆 3
, 𝑆𝑆 4
, 𝑆𝑆 5
} for each individual structure. Each shell 𝑆𝑆 𝑐𝑐 is defined by
a set of lower and upper bounds for the radial positions. Boundaries are set so that each shell
contains 20% of the total number of TADs. In other words, based on their radial positions, all TADs
in a genome structure are partitioned into the 5 shells such that the number of domains is equally
divided between the shells. Then for each structure we estimate the shell state probability
𝑃𝑃 ( 𝑆𝑆 𝑐𝑐 | 𝑋𝑋 ): the probability of a TAD for a given state 𝑋𝑋 to be located in any of the five-shell volume.
We estimate this value by dividing the number of the state-specfic TADs in each shell volume by
the total number of this state-specfic TADs. In this way ∑ 𝑃𝑃 ( 𝑆𝑆 𝑐𝑐 | 𝑋𝑋 )
𝑐𝑐 = 1.
A.3.10 2D localization distribution
The distribution is calculated by selecting 10 randomly oriented great circle planes through the
center of the nucleus for each of the 10,000 structures. If a TAD intersects a plane its 2D location
in the plane is considered. We then align all 100,000 planes and average the 2D locations for all
detected TADs of a given state. These state distributions are then normalized against the 2D
distribution of all TADs in the nucleus. This procedure leads to a 2D histogram for the distribution
of the nuclear positions for TADs of a given chromatin state.
A.3.11 TAD interaction network
A TAD interaction network (TIN) is a graph represents the contact profile of a given set of TADs
in one structure model. Each vertex of TIN represents a TAD. Each edge is drawn if the TAD-TAD
contact presents, which is defined if spatial distance of the pair of TADs 𝑑𝑑 𝑖𝑖 𝑗𝑗 is smaller or equal
to the 2 times sum of the radius of corresponding beads: 𝑑𝑑 𝑖𝑖 𝑗𝑗 ≤ 2( 𝑟𝑟 𝑖𝑖 + 𝑟𝑟 𝑗𝑗 ). Then we can construct
a corresponding TAD interaction network for each set of subcompartment state TADs in each
structure.
125
A.3.12 Maximal cliques enrichment (MCE)
Clique is a subset of nodes in a network that all nodes are adjacent to each other, in other words
it is a subnetwork that is fully connected. The maximal clique refers to the clique that cannot be
enlarged. The number of maximal cliques in a network reflects whether the network tends to
cluster or not.
For each network (TIN) we calculated the number of maximal cliques. Then randomly shuffle the
states for each TAD and regenerated 10 TINs for each state as controls. The number of maximal
cliques enrichment (MCE) is then calculated using number of maximal cliques of states divide by
the mean of 10 random controls. High maximal cliques enrichment (MCE) shows formation of a
tight structural subcompartment with high connectivity between TADs of the same state and a
high spatial compactness.
A.3.13 Predicting TSA-seq data using A1 subgraphs
Firstly we only use A1 subgraph those are larger than the median size. The center of mass is
calculated for each spatial subgraph and for each chromatin bead 𝑏𝑏 𝑖𝑖 , the signal is calculated
using:
𝑠𝑠 𝑖𝑖 𝑔𝑔𝐺𝐺 𝑎𝑎 𝑒𝑒 𝑖𝑖 =
1
𝑀𝑀 � 𝐺𝐺 − 4 ‖ 𝑏𝑏 𝑖𝑖 − 𝐴𝐴 1
𝑎𝑎 𝑐𝑐 𝑐𝑐 𝑎𝑎 𝑐𝑐𝑐𝑐
‖
𝑀𝑀 𝑚𝑚 = 1
Where 𝑀𝑀 is the number of structures in the population. Here we use 10,000. Then the genome
wide signals for each chromatin bead are normalized by the genome-wide mean value. The final
predicted signal is calculated by taking the natural logarithm of the normalized values.
A.3.14 Detect Dip TADs based on Radial Position
By investigating the radial positioning of each TAD in our modeling results, some TADs tend to
reside way inner compared to the neighborhood TADs. We define those inner TADs as “dips”
according to their quadratic-like characteristic in radial position curves. In order to detect these
dips, we formulize this task as an inverse of peak detection problem: (1) A dip TAD must be the
126
local minima in terms of radial position. In other words, the radial position must be smaller than
its immediate neighbors. (2) Any two dips can’t be too close to each other. (3) A valid dip should
be inner enough compared to several indirect neighbors.
Under these assumptions we designed a dip-detection algorithm. For each chromosome radial
position values, firstly identify all local minima TADs. Then sort all TADs according to the radial
position. Keep removing all high dips until there is only one dip in any 10 adjacent TADs. At last
we validate all dips those satisfy this formula:
∀ 𝑖𝑖 ∈ { 𝑇𝑇 𝐴𝐴𝑊𝑊 < 𝑖𝑖 > 𝑖𝑖 𝑠𝑠 𝑎𝑎 𝑑𝑑 𝑖𝑖 𝑝𝑝 }:
𝐼𝐼 𝑖𝑖 − 𝑚𝑚 𝑖𝑖𝐺𝐺 � 𝐼𝐼 𝑗𝑗 � 𝑖𝑖 − 𝑟𝑟 ≤ 𝑗𝑗 ≤ 𝑖𝑖 + 𝑟𝑟 �
𝑚𝑚 𝑎𝑎𝑥𝑥 � 𝐼𝐼 𝑗𝑗 � 𝑖𝑖 − 𝑟𝑟 ≤ 𝑗𝑗 ≤ 𝑖𝑖 + 𝑟𝑟 � − 𝑚𝑚 𝑖𝑖𝐺𝐺 � 𝐼𝐼 𝑗𝑗 � 𝑖𝑖 − 𝑟𝑟 ≤ 𝑗𝑗 ≤ 𝑖𝑖 + 𝑟𝑟 �
< 𝑇𝑇
where 𝐼𝐼 𝑖𝑖 is the radial position for TAD 𝑖𝑖 and 𝑟𝑟 is the range for indirect neighbors, 𝑇𝑇 is some
threshold to quantify the innerness of the dip candidate. We choose 𝑟𝑟 = 5 and 𝑇𝑇 = 0.05 in our
algorithm.
A.3.15 TAD neighborhood
For each TAD 𝑖𝑖 we define the TAD neighborhood { 𝑗𝑗 } if the center-to-center distances of other
TADs { 𝑗𝑗 } to itself are smaller than 500nm, which can be expressed as a set 𝑁𝑁 𝐺𝐺 𝑖𝑖 = { 𝑗𝑗 : 𝑗𝑗 ≠ 𝑖𝑖 , 𝑑𝑑 𝑖𝑖 𝑗𝑗 <
500}. The neighborhood composition of state 𝑋𝑋 shows how frequent in the population that a TAD
of a state 𝑌𝑌 to appear in the neighborhood of any TAD in state 𝑋𝑋 , which is the set
{ 𝑁𝑁 𝐺𝐺 𝑖𝑖 : 𝑖𝑖 𝑖𝑖 𝑠𝑠 𝑠𝑠 𝑡𝑡 𝑎𝑎 𝑡𝑡 𝐺𝐺 𝑋𝑋 }.
We can also define neighborhood state if the majority of a TAD neighborhood appears to be one
particular state. Otherwise the neighborhood state is defined as “Mix” state. For instance, a
neighborhood is defined as being in state A1 if more than 50% of TADs in the neighborhood are
of state A1, otherwise it is a “Mix” neighborhood.
A.3.16 NAD related dip-TADs and NAD-Dominated Dip-clusters
We get the NAD definition from
137
and uplifted to genome assembly hg19. For any Dip-TAD, we
127
define NAD related Dip if there is an NAD that is within 3Mb apart. Then we define NAD-
Dominated clusters if the majority (>50%) of the Dip-cluster member TADs are NAD related Dips.
Abstract (if available)
Abstract
Background: The completion of the human genome DNA sequence was one of the most important accomplishments in biology. However, the genome has levels of complexity that go far beyond its linear sequence. One of the next grand challenges is to determine the detailed 3D folding architecture of the linear genomic DNA, and to reveal the functional implications of this level of organization. This 3D organization greatly influences gene functions, such as the expression or silencing of genes. For instances, gene expression is initiated by physical interactions of promoter and enhancer regions that often are fare apart in linear DNA sequence. Recent evidence shows that clustering of chromatin near nuclear bodies, such as nucleoli, speckles, and lamina are correlated with gene silencing and replication timing. Misfolding and spatial rearrangements of chromosomes frequently occur in cancer cells. For a better understanding of nuclear functions and its impact on disease, it is, therefore, necessary to study the 3D folding of the genome and the organization. ❧ Data about the nuclear organization can now be generated from a wide range of genomics and imaging experimental methods, each with their own strengths and limitations. For instance, chromosome conformation capture experiments, such as 3C, 4C, 5C, Hi-C, and ChIA-PET produce rich information about the probability of pairwise chromatin interactions on a genome-wide scale. On the other side, imaging approaches like DNA FISH and live-cell super-resolution microscopy can provide direct views of the organization and dynamics of chromatin, however typically only for a few gene loci at the same time. ❧ Task: These data provide unique opportunities to generate 3D structural representations of genomes to study detailed chromatin folding and partitioning of chromatin into subnuclear compartments, which will facilitate an in-depth analysis of spatial chromatin patterns and their functional relevance. However, this task is challenging due to the large size and diploidy of genomes and the stochastic nature of chromosome conformations. My PhD focuses on developing approaches to unite genomics with microscopic data by determining 3D structures of entire human genomes from genomics data and developing tools for structure-function mapping to extract the underlying principles and mechanisms of genome-wide chromosome organization. My goal is to acquire a more thorough insight into the basic principles governing the spatial organization of genomes, specifically the partitioning of genomic regions into subnuclear compartments and its impact on nuclear function.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D modeling of eukaryotic genomes
PDF
Understanding the 3D genome organization in topological domain level
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Exploring three-dimensional organization of the genome by mapping chromatin contacts and population modeling
PDF
Computational analysis of genome architecture
PDF
Machine learning of DNA shape and spatial geometry
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
New tools for whole-genome analysis of DNA replication timing and fork elongation in saccharomyces cerevisiae
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
C. elegans topoisomerase II regulates chromatin architecture and DNA damage for germline genome activation
PDF
Forkhead transcription factors regulate replication origin firing through dimerization and cell cycle-dependent chromatin binding in S. cerevisiae
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
The function of Rpd3 in balancing the replicaton initiation of different genomic regions
PDF
Structural and biochemical studies of large T antigen: the SV40 replicative helicase
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
3D vessel mapping techniques for retina and brain as an early imaging biomarker for small vessel diseases
PDF
Genome engineering of filamentous fungi for efficient novel molecule production
Asset Metadata
Creator
Hua, Nan
(author)
Core Title
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
07/29/2019
Defense Date
06/20/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D modeling,data integration,DNA,GM12878,H1-ESC,HFFc6,Hi-C,human genome,integrated analysis,nuclear organization,nuclear subcompartments,OAI-PMH Harvest,simulated annealing,software design,structural biology,TAD,topologically associated domain
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Alber, Frank (
committee chair
), Chen, Lin (
committee member
), Katritch, Vsevolod (
committee member
), Rohs, Remo (
committee member
)
Creator Email
huanan1991@gmail.com,nhua@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-197691
Unique identifier
UC11663112
Identifier
etd-HuaNan-7662.pdf (filename),usctheses-c89-197691 (legacy record id)
Legacy Identifier
etd-HuaNan-7662.pdf
Dmrecord
197691
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Hua, Nan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D modeling
data integration
DNA
GM12878
H1-ESC
HFFc6
Hi-C
human genome
integrated analysis
nuclear organization
nuclear subcompartments
simulated annealing
software design
structural biology
TAD
topologically associated domain