Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Exploring the application and usage of whole genome chromosome conformation capture
(USC Thesis Other)
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Exploring the Application and Usage of Whole Genome Chromosome
Conformation Capture
by
Haochen Li
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(MOLECULAR BIOLOGY)
August 2017
Table of Contents
Abstract .......................................................................................................................................... 1
Chapter 1 Introduction ........................................................................................................... 3
1.1 Hierarchical structure of the eukaryotic DNA genome .................................................................3
1.1.1 Primary DNA structures revealed by X-ray crystallography and NMR ......................................3
1.1.2 Chromatin structure and the 30-nanometer fiber .........................................................................4
1.1.3 Microscopy observations into eukaryotic genome organization ..................................................6
1.2 Capture genome 3D structure by the “C” methods .......................................................................7
1.2.1 3C technology: toward three-dimensional genomics ...................................................................8
1.2.2 Chromosome conformation capture-on-chip (4C) technology ..................................................11
1.2.3 Chromosome conformation capture carbon copy (5C) technology ...........................................14
1.2.4 Hi-C: the whole-genome chromosome conformation capture technology ................................15
1.3 Challenges in interpreting chromosome capture experiments ....................................................17
Chapter 2 Mapping Virus-Host Whole Genome Interactions by Tethered Chromosome
Conformation Capture ............................................................................................................... 21
2.1 Introduction .....................................................................................................................................21
2.2 Methods ............................................................................................................................................23
2.2.1 Cell culture and TCC experiments. ............................................................................................23
2.2.2 Preprocessing TCC sequencing output ......................................................................................24
2.2.3 Splitting test for filtering low coverage bins of TCC observed contact (OC) matrix and
choosing optimal resolution to visualize VGI contact profile ............................................................25
2.2.4 Bias correction of TCC datasets .................................................................................................27
2.2.5 Identification of VGI enriched genomic regions by hypothesis testing .....................................28
2.3 Results and Discussions ...................................................................................................................28
2.3.1 Genome-wide interactions between adenovirus and host genome ............................................28
2.3.2 Identification of viral DNA enriched regions on the host genome ............................................30
2.3.3 Viral DNA enriched regions co-localize in 3D space prior to infection ....................................35
2.3.4 Viral DNA preferentially interacts with active host chromatin regions and chromatin regions
marked by histone modification H3K27me3 ......................................................................................36
2.3.5 Relations between viral DNA contacts and DNA replication ....................................................44
2.3.6 Remodeling of epigenetic features in the VGI enriched host genomic regions .........................49
2.4 Conclusions ......................................................................................................................................50
Chapter 3 Hi-C Data Normalization by Restriction End Sequencing ............................. 51
3.1 Introduction .....................................................................................................................................51
3.2 Methods ............................................................................................................................................54
3.2.1 Prepare restriction end sequencing library for Hi-C data normalization ...................................54
3.2.2 Mapping RES sequencing output and extract restriction end alignments ..................................57
3.2.3 Retrieve RES reads from Hi-C sequencing library ....................................................................58
3.2.4 Normalize Hi-C contact matrix by RES sequencing data ..........................................................58
3.3 Results and Discussions ...................................................................................................................59
3.3.1 RES normalization by Hi-C dangling end sequences reproduces normalization result by RES
control experiment ..............................................................................................................................59
3.3.2 Differences between RES normalization and genome-wide matrix balancing normalization ..60
3.3.3 Comparison of Hi-C data reproducibility between RES normalization and genome-wide matrix
balancing normalization ......................................................................................................................64
3.3.4 RES performance in high resolution Hi-C contact matrix .........................................................68
3.4 Conclusions ......................................................................................................................................70
References .................................................................................................................................... 71
1
Abstract
Chromosomes from both eukaryotes and prokaryotes not only convey information through their
linear DNA sequence but also contribute to the regulation of a number of DNA-related metabolic
processes through their three-dimensional arrangements. Molecular biology techniques, such as
Chromosome Conformation Capture (3C) and the 3C-based methods, allow us to investigate the
three-dimensional organizations of the genome in high resolution and high throughput. In this
dissertation, we explored the whole genome chromosome conformation capture techniques (a.k.a.
Hi-C) in developing novel Hi-C applications to study virus-host genome interactions and more
reliable Hi-C data normalization method in order to exploit the potentialities of using Hi-C data
to decipher the mystery of 3D genome structures.
Viruses have evolved a variety of mechanisms to interact with host cells for their adaptive
benefits, including subverting host immune responses and hijacking host DNA
replication/transcription machineries. Although interactions between viral and host proteins have
been studied extensively, little is known about how the vial genome may interact with the host
genome and how such interactions could affect the activities of both the virus and the host cell.
Since the three-dimensional organization of a genome can have significant impact on genomic
activities such as transcription and replication, we hypothesize that such structure-based
regulation of genomic functions also applies to viral genomes depending on their association
with host genomic regions and their spatial locations inside the nucleus. Here, we used Tethered
Chromosome Conformation Capture (TCC) to investigate viral-host genome interactions
between the adenovirus and human lung fibroblast cells. We found viral-host genome
interactions were enriched in certain active chromatin regions and chromatin domains marked by
H3K27me3. The contacts by viral DNA seems to impact the structure and function of the host
genome, leading to remodeling of the fibroblast epigenome. Our study represents the first
comprehensive analysis of viral-host interactions at the genome structure level, revealing
unexpectedly specific virus-host genome interactions. The non-random nature of such
interactions indicates a deliberate but poorly understood mechanism for targeting of host DNA
by foreign genomes.
2
Extracting biologically meaningful information about chromosomal interactions obtained with
high-throughput sequencing chromosome conformation capture (Hi-C) experiments requires the
elimination of systematic biases. So far data normalization has been performed by computational
methods that use either matrix balancing, or explicit factors probabilistic modeling to perform
bias corrections. Here, we present a control based data normalization pipeline that adopts
sequencing and alignment of the restriction end DNA fragments by either performing a
Restriction End Sequencing (RES) control experiment or by retrieving RES reads from the actual
Hi-C sequencing library. We validate RES normalizations by both of the two sources of RES
control and show that our normalization is robust and reproducible. This pipeline works in
particular for Hi-C based methods following protocols with biotin label attachments at restriction
ends after the chromatin enzyme digestion step. We compare the results of RES normalization
and whole genome matrix balancing normalization and observe major differences at regions
closed to the telomeres and centromeres. Furthermore, we show that the RES normalization
technique results in higher data reproducibility than matrix balancing normalization in terms of
overall cis and trans one-dimensional contact coverage. Lastly, we discuss about the potential
limitations and technical challenges of the RES normalization technique.
3
Chapter 1 Introduction
1.1 Hierarchical structure of the eukaryotic DNA genome
Genetic information in mammalian cells is stored in double-helical fiber molecules known as
DNA. The total length of the DNA fiber is extremely longer than the size of the nucleus into
which the DNA is compacted and packed, which renders understanding the spatial organization
of the genome very challenging. Based on current research, genome architecture is organized in
hierarchical manners: from DNA double-helices to nucleosomes; from 10 nm fibers to 30 nm
fibers; from hetero/euchromatins to chromosome territories. Revealing structures at different
scales requires different technologies: from X-ray crystallography, to electron microscopy, to
light/fluorescence microscopy and FISH.
1.1.1 Primary DNA structures revealed by X-ray crystallography and NMR
Due to its importance in life, DNA has been a subject in an ever more intensifying research field.
Research interest of the structure of DNA culminated in Watson and Crick’s double-helical
structure [1], a model that was based on evidence obtained by Franklin, Chargaff, and others
[2,3]. The double-helix was only the first in a line of discoveries concerning the three-
dimensional structure of DNA molecules; the structures of A-DNA and Z-DNA were
characterized later [4,5], as were the less abundant triple-stranded and quadruple-stranded
structures of DNA [6–8].
Focus of the above structural studies is the DNA molecule itself. Inside eukaryotic cells,
however, DNA operates as a part of a protein-DNA-RNA conglomerate known as “chromatin”.
While it is the DNA molecule alone that stores the genetic information, maintenance,
transcription, and propagation of this information are orchestrated and regulated in the larger
context of chromatin. As a matter of course, chromatin has also been subject to intense structural
interest.
4
The total chromatin in a eukaryotic nucleus comprises a limited number of chromosomes. Each
chromosome contains a single very long molecule of DNA, the average length of which, in
human cells for instance, is more than four centimeters ( ∼1.7 inches). As a consequence,
chromatin structure is much more complicated than the structure of DNA itself and can be
studied at various levels. In particular, chromatin structure can be studied at the finest level of
DNA and protein structures, or at the coarsest level of spatial arrangement of chromosomes, or
any other level in between.
1.1.2 Chromatin structure and the 30-nanometer fiber
The most fundamental features of chromatin behavior are determined by local interactions
between its proteins and its DNA in the context of relatively small protein-DNA complexes.
These complexes are the only level of chromatin structure for which reliable atomic resolution
structures are available. The most abundant protein component of chromatin is histones, which
package DNA into structural units known as nucleosomes. Nucleosomes represent the most basic
unit of chromatin and their structure has been elucidated at an atomic resolution [9]: they consist
of ~146 base pairs (bp) of super helical DNA wrapped left-handedly 1.65 turns around an
octameric core of histones.
Histones not only play a central role in packaging DNA, they also regulate its accessibility and
function. The nucleosome structure clearly shows that, when DNA is wrapped around the core
histones, the amino termini of the eight histone molecules that compose the core are accessible;
other proteins can interact with and modify these histone “tails.” Various histone modifications,
including monomethylation, dimethylation, trimethylation, acetylation, phosphorylation,
ubiquitination, ADP-ribosylation, and SUMOylation on several residues in the histone tails have
been identified [10]. While some specific modifications remain poorly understood, many can be
characterized as either activating or repressing of gene expression.
It is believed that some histone modifications exert their effect by causing condensation or
decondensation of the nucleosomes, resulting in respectively lower and higher accessibility of
5
DNA [11,12]. These modifications, and histone tails in general, also serve as recruiting platforms
for various protein complexes.
Nucleosomes are not the only protein-DNA complexes in chromatin. Many other proteins,
including transcription factors and other DNA binding proteins, can be considered a part of
chromatin as well. These proteins are necessary for all functions of chromatin including a proper
expression of genes, identification of damaged DNA, and replication of the genetic material. The
structure of these proteins has been extensively studied, most often in complex with DNA [13–
17].
Individual nucleosomes are connected to each other by a piece of linker DNA that varies in
length, but is typically within the range of 10 to 80 bp [18]. This linear array of nucleosomes that
are separated by a thin filament has a “beads on a string” appearance under the microscope,
which is also referred to as the 10-nm fiber based on its diameter [19]. At the next level of
organization, this 10-nm fiber is believed to fold into a fiber of ∼30 nanometers in diameter
[18,20,21]. Experimental evidence suggests that both the 10-nm and the 30-nm forms of the fiber
inhabit the nucleus at the same time [22].
While an atomic-resolution structure of the 10-nm fiber can be directly inferred from those of
nucleosome and free DNA, the exact structure of the 30-nm fiber is unknown and subject to
some controversy [23–25]. Two competing models for this structure remain popular [23,26]. One
is the solenoid model (one-start helix model), in which the linker DNA that connects adjacent
nucleosomes is bent between them such that they follow a superhelical path with about 6-8
nucleosomes per turn [24,26,27]; the other model is the zig-zag model (two-start helix model), in
which adjacent nucleosomes are connected by a straight linker DNA and show a zig-zag
arrangement such that each nucleosome in the fiber binds to its second neighboring nucleosome
[24,25,28,29]. These models are not mutually exclusive; it is possible that different regions of
chromatin assume one or the other structure at different times. In in vitro and in silico
experiments, the dominance of either the solenoid or the zig-zag form is driven by the total
concentration of bivalent and monovalent salts [23,26].
6
1.1.3 Microscopy observations into eukaryotic genome organization
Chromosomes occupy their own territory in the cell nucleus [30,31] and adopt a preferential
radial position within the nucleus, with large chromosomes found more often at the nuclear
periphery and small chromosomes found more interiorly. The spatial separation of chromosomes
in the nucleus is not absolute: Chromosomal intermingling takes place at the periphery of the
territories [32]. A further division can be found within the chromosome territory, with gene-poor
and gene-rich regions spatially separated [33,34].
The segregation of active and inactive chromatin inside the nucleus raises the possibility that
nuclear positioning affects gene activity. This idea is supported by DNA fluorescence in situ
hybridization (FISH) observations that certain genes (e.g., HoxB and uPA) loop out of their
chromosome territory upon activation [35,36]. This is likely driven by regulatory DNA
sequences, such as enhancers and locus control regions (LCRs) [37]. In addition, a correlation
between expression status and positioning relative to the nuclear periphery and peri-centromeric
heterochromatin has been observed for some genes (most notably the imunnoglobulin and beta-
globin loci) [38,39]. In these instances, the silent genes are found to be closer to these nuclear
landmarks than the active ones [40]. Again, evidence exists that this positioning can be
controlled by transcription factors binding to regulatory DNA sequences [41].
The proximity of a gene relative to other genes has also been associated with its regulation. This
suggestion came from FISH experiments showing that a number of erythoid-specific genes that
are located far apart on a chromosome specifically colocalize with each other when they are
actively transcribed [42]. While this may suggest that genes have a relatively unconstrained
ability to move and search for preferred neighboring genes, live-cell imaging studies suggest
otherwise. Tagging loci with arrays of bacterial operator sequences and expressing the cognate
DNA-binding protein fused to a fluorescent protein allow for the spatial and temporal tracking of
genes. This has revealed that genomic loci generally show constrained motion within a small
subvolume of the mammalian cell nucleus [43–45]. Targeting specific factors to these arrays can
sometimes—but not always—induce repositioning, as was seen for lamin-associated proteins
7
that can direct loci to the nuclear periphery [46–48]. This, again, sometimes—but not always—
leads to reduced transcriptional output of genes surrounding the arrays on the linear chromosome.
The power of FISH and other microscopy methods lies in their ability to do single-cell analyses
of gene positioning [49,50]. However, on a genomic and cell population scale, they are limited in
throughput and resolution. It is therefore unclear whether they uncover general principles of
nuclear organization or the peculiarities of individual genes.
Genomics methods are now available for investigating nuclear organization. DamID [51] is one
such method that, when directed to proteins of the nuclear lamina, provides maps of chromatin
associated with the nuclear periphery [52]. In mice and in human cell lines, these lamina-
associated domains (LADs) are megabase-sized regions from across the genome. They are
generally gene-poor, transcriptionally inactive, and late replicating [53,54]. Interestingly, genes
that are activated or poised for transcription can dissociate from the nuclear lamina [53].
Besides DamID, a series of other genomics approaches has been developed that measures DNA–
DNA contact probabilities. These strategies enable the highly detailed uncovering of
chromosome topology.
1.2 Capture genome 3D structure by the “C” methods
Introduced by Job Dekker, a new rationale for studying the genome architecture is based on
chromosome conformation capture (3C) [55]. The underlying basis of these “C” methods is to
preserve spatial proximity (a.k.a. interactions) between chromatin loci using formaldehyde
crosslinking and quantify interaction frequencies between loci of interest as a measure of their
spatial proximity, which is completely based on molecular biology and was initially quantified
by quantitative PCR. Based on the quantifying approach that is employed, conformation capture
methods differ in scope and the number of interactions that they are able to quantify.
8
1.2.1 3C technology: toward three-dimensional genomics
The seminal study by Dekker et al. [55] describing the 3C method has sparked the development
of a large number of 3C-derived genomics methods. To understand their potential and limitations,
we first look at the principles and applications of the underlying 3C technology.
The initial step in 3C and 3C-derived methods is to establish a representation of the 3D
organization of the DNA. To this end, the chromatin is fixed using a fixative agent, most often
formaldehyde [55]. Next, the fixed chromatin is cut with a restriction enzyme recognizing 6 base
pairs (bp), such as HindIII [55], BglII [56], BamHI [57], and EcoRI [58], or with more frequent
cutters, such as AciI [59] and DpnII [57,60]. In the subsequent step, the sticky ends of the cross-
linked DNA fragments are religated under diluted conditions to promote intramolecular ligations
(i.e., between cross-linked fragments). DNA fragments that are far away on the linear template,
but colocalize in space, can, in this way, be ligated to each other. A template is thereby created
that is, in effect, a one-dimensional (1D) cast of the 3D nuclear structure.
The way to establish the 3D conformation of a locus or chromosome is to measure the number of
ligation events between nonneighboring sites. In 3C, this is done by semiquantitative [55] or
quantitative [61,62] PCR amplification of selected ligation junctions. For this, primers are
designed near and toward the ends of all restriction fragments of interest. By comparing the
amplification efficiency of different primer combinations, a matrix of ligation frequencies is
established that serves as proxies for pairwise interaction frequencies.
In the original study by Dekker et al. [55], from this matrix, the average 3D conformation of
yeast chromosome III was determined, showing that it forms a contorted ring. The method was
then adapted for the mammalian system and used to demonstrate that chromatin loops exist in
vivo between regulatory DNA elements and their target genes. This was originally done in
studies investigating the b-globin locus, where the upstream LCR was shown to physically
interact with the active globin genes, thereby looping out the intervening 30–50 kb of chromatin
fiber [56]. The term active chromatin hub (ACH) was introduced to describe such spatial
clustering of genic sequences with surrounding regulatory sequences [56]. The composition of
9
the beta-globin ACH was demonstrated to dynamically follow the transcriptional changes that
accompany development and differentiation [63], and transcription factors were found to drive or
stabilize its formation [62,64,65]. 3C technology also demonstrated looping between regulatory
sequences and genes at other loci, including the H19–Igf2 locus [66], interleukin TH2 [67], and
the alpha-globin locus [68].
With 3C, it is also possible to pick up enhancers that were previously unknown to regulate a gene.
A survey of the spatial environment of the CFTR gene—which, when mutated, can cause cystic
fibrosis—identified a number of cell type-specific interactions. Some of these sequences showed
synergistic enhancer activity in a reporter assay, suggesting that they may activate CFTR
expression [69].
Enhancer activity on gene expression can be blocked by insulator sequences [70]. Insulator
sequences are bound by proteins such as CTCF in mammals [71] or Su(Hw) in flies [72]. 3C
technology has been used to demonstrate that the function of certain insulators is dependent on
spatial organization. In mammals, CTCF sites form chromatin loops by contacting each other in
the beta-globin locus [62] and H19/Igf2 locus [73]. CTCF recruits additional factors to its
binding sites, such as cohesion [74,75] and TAF3 [76], which may facilitate DNA loop
formation [77]. In Drosophila, a single gypsy insulator element was shown to block the
repressive effects of an upstream Polycomb response element (PRE) on a downstream reporter
gene. Interestingly, two gypsy insulator elements located between the repressive (PRE) and the
transgene rescued the PRE-mediated gene silencing by forming a loop and bringing the PRE
spatially closer to the gene [60].
In addition to enhancer and insulator loops, a number of recent studies have pointed to the
existence of loops between the start and end of a gene. 3C experiments in mouse liver cells
showed that ribosomal DNA promoters have an increased propensity to interact with terminator
sequences and that these loops are associated with increased rDNA expression [78]. Promoter–
terminator looping has been observed on RNA polymerase II transcribed sequences. In yeast,
loops form on genes when they are either active or poised, but not when they are repressed [79].
In human cells, loop formation between the two long terminal repeats (LTRs) of the HIV
10
provirus has been shown to be dependent on gene expression [80]. It has also been proposed that
gene loops have a role in transcriptional memory: Gene loops form after an initial round of (slow)
gene activation and are essential for subsequent fast reactivation [81]. Although promoter–
terminator looping is generally associated with increased gene activity, for the BRCA1 gene, the
looped conformation is actually associated with lower expression compared with the nonlooped
conformation [57]. In this case, the gene loop is thought to confer repression on the locus.
Various technical issues need to be considered when interpreting 3C data. For example, it is
important to realize that any two sequences nearby on the linear chromosome are, by definition,
close in space, and therefore sequences over hundreds of kilobases frequently crosslink and
ligate to the anchor, independently of the chromatin’s 3D conformation. Thus, in order to
appreciate loops visualized by 3C-based technologies, one needs to find the anchor interacting
with a distant sequence more frequently than with intervening sequences. Therefore, 3C methods
intrinsically rely on quantitative rather than qualitative measurements. 3C technology uses PCR
for the quantitative detection of a given ligation junction. The importance of this assessment is
underscored by the following consideration: At most alleles, cross-linking will result in larger
chromatin aggregates with many DNA fragments together (‘‘hairballs’’), within which all DNA
ends compete with each other for ligation to the anchor fragment. Even a frequent and stable
enhancer–promoter interaction will therefore only occasionally result in the corresponding
ligation junction. Combined with the fact that every anchor fragment is only present twice in a
diploid cell and that a single cell therefore contributes maximally two ligation junctions of
interest, this implies that 3C PCR requires faithful and quantitative amplification of very rare
ligation junctions from many genome equivalents. For this and other reasons, (semi)quantitative
3C PCR is notoriously difficult and requires strict controls and careful experimental design and
data interpretation [82–84].
Despite its inherent difficulties, 3C has been instrumental in delineating chromatin loops
between sites relatively close on the linear chromosome template. However, for sites separated
over distances more than a few hundred kilobases, specific ligation products become too
infrequent to be accurately quantified by 3C PCR. The advent of genome-scale methods such as
microarrays and high-throughput sequencing has enabled the development of more unbiased
11
methods that offer a solution for assessing the relative abundance of such long-range DNA–DNA
contacts.
1.2.2 Chromosome conformation capture-on-chip (4C) technology
4C originally combined 3C technology with microarrays to analyze the contacts of a selected
genomic site (or ‘‘viewpoint’’) with all of the genomic fragments that are represented on the
array [33]. 4C-seq refers to the same strategy, but uses next-generation sequencing (NGS)
instead of microarrays to analyze contacting sequences [85]. 4C is also an acronym for circular
chromosome conformation capture [73], which uses a slightly different protocol. 4C is known as
a ‘‘one versus all’’ strategy because, in it, a single viewpoint is defined, and the genome is
screened for sequences that contact this selected site [86].
The practical steps involved in 4C technology are explained in detail in Simonis et al. [33,84]. In
brief, in 4C technology, the ligated 3C template is processed with a second round of DNA
digestion and ligation to create small DNA circles (some of which contain the 3C ligation
junctions). Using viewpoint-specific primers, inverse PCR specifically amplifies all sequences
contacting this chromosomal site. They can then be analyzed by microarrays or, nowadays, by
NGS methods. The latter is cheaper, provides higher resolution, enables more accurate
quantification of DNA interaction frequencies, and has a larger dynamic range.
4C technology was first applied to investigate the DNA interaction profiles of a tissue-specific
gene (beta-globin) embedded in an inactive chromosomal region and a house-keeping gene
(Rad23a) present in an active gene-rich region [33]. Rad23a made contacts with active regions
on its own chromosome and on other chromosomes. The profile of contacts was largely
conserved between two tissues. The erythroid-specific beta-globin locus, on the other hand, had a
profoundly different profile of interactions, depending on its expression status. In erythroid cells,
the locus contacts other active regions, whereas in fetal brains, where the locus is inactive,
inactive regions are contacted. As such, 4C demonstrated the separation between active and
inactive chromatin at a much higher resolution than appreciable by microscopy. It also showed
12
that the activity of a single gene locus can have a marked effect on its steady-state nuclear
position.
4C studies have also been used to address whether or not it is the act of transcription itself that
dictates the position of a locus. A modified 4C approach to study the DNA interactions of the
active beta-globin and alpha-globin loci led to the conclusion that coregulated genes
preferentially meet at dedicated transcription sites in the nucleus [87]. A possible interpretation
of these results is that genes dynamically move to specific nuclear locations for their
transcription, rather than the transcription machinery moving to the genes. However, results from
other 4C studies are at odds with this interpretation. Blocking elongation of RNA polymerase
with alpha-amanitin or DRB did not seem to affect the contacts of active genes [58], suggesting
that chromosome conformation is not dependent on ongoing transcription. The reverse (i.e.,
activation of a large number of genes) also does not dramatically affect nuclear organization.
This was concluded from a series of 4C experiments that addressed how chromosome
conformation is affected by a stimulation of the glucocorticoid receptor (GR) [88]. The cellular
response is very rapid and affects hundreds of genes [89]. The effect on nuclear organization,
however, was found to be modest and to mostly consist of the expansion of existing interacting
regions, rather than a massive reorganization of chromosomes. No large-scale movement of GR-
responsive genes was observed, not even toward each other [88].
The relative stability of chromosome conformation in a given cell type was confirmed by a study
that assessed the influence of an ectopic enhancer on 3D genome interactions [90]. A human
beta-globin LCR ectopically placed in a cluster of mouse housekeeping genes was originally
found by FISH to more often position this genomic region outside its chromosome territory [91].
4C technology was then applied to determine where the LCR drags the cluster to. No new
contacts were observed; rather, some of the pre-existing interchromosomal contacts were formed
more frequently. Interestingly, this happened specifically at genes controlled by GATA-1 and
EKLF, two transcription factors that also bind to the LCR [90]. Collectively, these data suggest
that chromosomal context dictates the nuclear subvolume that can be sampled by a genomic
element (gene and enhancer). Within this pre-determined space, the element may find preferred
interaction partners (such as genes controlled by shared transcription factors).
13
The link between structure and transcription was further elucidated in a 4C study that focused on
dosage compensation of the mammalian X chromosome. Using an allele-specific 4C strategy, it
was shown that the inactive and active X chromosomes adopt distinct topologies. The noncoding
RNA Xist drives X inactivation, but is not required to maintain gene silencing [92]. Intriguingly,
upon conditional deletion of Xist, the inactive X chromosome did change its conformation to
adopt a structure similar to the active X chromosome, despite the continued suppression of
silenced genes on the inactive X chromosome. This indicates that the conformation of
chromosomes is not critically dependent on the expression of the genes they are comprised of.
4C studies in Drosophila further uncovered the specification of preferred nuclear environments.
Genes repressed by Polycomb group (PcG) proteins were found to preferentially cluster [93,94],
and contacts were mostly restricted to the same chromosome arm. When the 4C analysis was
repeated on a chromosome arm that harbors an inversion across the centromere, specific contacts
were only observed on the newly fused chromosome arm [93]. The clustering of PcG-repressed
genes may also be of functional importance: Mutations in one target locus were found to
(slightly) weaken the repression of an interacting PcG target locus [94].
4C technology is the preferred strategy to assess the DNA contact profile of individual genomic
sites. As may be obvious from the above examples, 4C is currently limited to the description of
long-range contacts with larger regions elsewhere on the chromosome (in cis) or on other
chromosomes (in trans). Local interactions—for example, between a gene and its enhancer 50 kb
away—are not yet readily picked up due to a lack of resolution. Most 4C strategies use
restriction enzymes with a 6-nucleotide (nt) recognition sequence that cut, on average, once
every few kilobases, creating fragments that are much larger than the average regulatory
sequences (which are often not larger than several hundred base pairs). Increased resolution was
obtained when a frequent cutter was used that recognizes 4 bp and theoretically cuts every 256
bp. Using this enzyme, a known contact between a regulatory sequence and a gene of the alpha-
globin cluster was picked up [95]. Whether or not this strategy is robust enough to also identify
de novo regulatory interactions remains to be seen. One disadvantage of this published strategy
is that no further processing of the very large DNA circles was included, which hampers efficient
14
PCR amplification and may explain the apparent low data complexity. Clearly, though, 4C is the
natural scheme to also pick up local interactions between enhancers, promoters, and other
regulatory sequences. Further improvements on the technique are expected that will better allow
robust screening for these important local interactions.
1.2.3 Chromosome conformation capture carbon copy (5C) technology
5C can be described as a ‘‘many versus many’’ technology. It allows concurrent determination
of interactions between multiple sequences [96]. In 5C, the 3C template is hybridized to a mix of
oligonucleotides, each of which partially overlaps a different restriction site in the genomic
region of interest. Pairs of oligo-nucleotides that correspond to interacting fragments are
juxtaposed on the 3C template and can be ligated together. Since all 5C oligos carry one of two
universal sequences at their 5 prime ends, all ligation products can subsequently be amplified
simultaneously in a multiplex PCR reaction. Readout of these junctions occurs either on a
microarray or by high-throughput sequencing.
The resolution of the technique is determined by the spacing between neighboring
oligonucleotides on the linear chromosome template. It can never reach the resolution of 4C and
Hi-C (discussed later), as not every unique end of a restriction fragment will allow the design of
a 5C oligonucleotide. On the other hand, and different from 4C, 5C provides a matrix of
interaction frequencies for many pairs of sites: This puts contacts between given DNA sites in
the context of those between other pairs of sites. 5C and Hi-C are therefore used to reconstruct
the (average) 3D conformation of larger genomic regions.
5C technology has been applied to the human beta-globin locus [96] (Dostie et al. 2006) the
human alpha-globin locus [97] (Bau et al. 2011), the human HOXA–D gene clusters [76,98,99],
and the human X chromosome inactivation center [100]. At the globin loci, interactions between
regulatory sequences and genes previously identified by 3C technology were readily picked up,
showing that this technology is a medium-throughput alternative to 3C for the identification of
enhancer–promoter interactions. It is not widely used for this purpose, though, likely because 3C
15
is considered easier in design and data analysis. However, its increase in throughput and reduced
bias in PCR amplification efficiency between pairs of sites theoretically makes 5C superior to 3C.
5C not only identifies interactions between specific pairs of sites, but also builds a matrix of
contact frequencies across entire genomic regions and, as such, is the method of choice to start
reconstructing their conformations. Invariably, the data show that when active, gene clusters
adopt a looped but compact topology that facilitates local contacts between genomic elements;
e.g., between regulatory sequences and genes. When the chromatin organization of the HOXA
locus was assayed in two cell types expressing either the 5 prime HOXA genes or 3 prime
HOXA gene, preferred long-range interactions were only observed within the active parts of the
gene cluster, showing diametrically opposed HOXA interaction patterns between the two cell
types [76]. In vivo, the spatiotemporal expression of Hox genes is collinear with their
chromosomal order in the gene clusters. It was also shown that different contact domains form
along the developing body axis, each separating the different genes that are active and inactive at
the corresponding position of the anterior–posterior axis [101]. In conclusion, 5C technology is
the method of choice for understanding DNA contacts between specific sites in the context of
other contacts made in a genomic region.
1.2.4 Hi-C: the whole-genome chromosome conformation capture technology
NGS methods have also led to the development of a number of ‘‘all versus all’’ methods, among
which the first was the Hi-C method [102]. In Hi-C, the procedure for creating a 3C template is
slightly adjusted. Before ligation, the restriction ends are filled in with biotin-labeled nucleotides.
Following a blunt end ligation, DNA is purified and sheared, and a biotin pull-down is
performed to ensure that only ligation junctions are selected for further analysis. Reads are
mapped back to the genome, and when a pair is found on two different restriction fragments, this
is scored as an interaction between these two fragments. From this, a matrix of ligation
frequencies between all fragments in the genome can be constructed. A variation [103] of the Hi-
C method uses the 4C strategy to further digest with a secondary restriction enzyme and ligate to
trim the 3C template to small circles, but is followed by a third round of digestion that uses the
primary 3C restriction enzyme. The ends are filled in with a biotinylated adapter containing an
16
EcoP15I restriction site, which is cut 25– 27 nt away. Biotin pull-down and paired-end
sequencing enables the detection of ligation junctions.
The resolution of the published Hi-C experiments in mammals, based on ~10 million paired-end
reads, was ~1 Mb [102]. Because of the quadratic nature of ‘‘all versus all’’ data, an increase in
resolution by 10-fold requires a 100-fold increase in sequence depth. The spatial separation of
active and inactive regions, previously observed in 4C experiments [33], was confirmed in a
genome-wide manner by the Hi-C data. Gene-dense, presumably active regions cluster with
other gene-dense regions. Conversely gene-poor, inactive regions cluster with other gene-poor
regions. Furthermore, the Hi-C data revealed that nuclear organization was quite constant
between two different cell lines (K562 and GM06990), hinting at a core organization present in
most cell lines.
The two microenvironments that were picked up by the Hi-C method were different with respect
to a number of features: The open chromatin compartment was enriched for genes, active histone
marks, and DNase I hypersensitivity sites. On the other hand, the closed chromatin compartment
was enriched for inactive histone marks, such as H3K27me3, and depleted for genes and DNase
I hypersensitivity sites.
In both baker’s and fission yeast, ‘‘all versus all’’ genome-wide chromosome conformation
experiments were able to reach kilobase resolution owing to their much smaller genome sizes
(both ~12.4 Mb) and increased sequence depth. For Saccharomyces cerevisiae, the previously
observed Rabl configuration [104–106] was confirmed, with the data showing clustering of the
centromeres and clustering of the telomeres [103]. As can be predicted from this structure, the
short arms of chromosomes showed more frequent interactions. This was also observed in high-
throughput in vivo imaging experiments that measured the nuclear position of genomic regions
relative to nuclear landmarks [107]. The Hi-C data also showed that tRNA genes dispersed
throughout the genome come together in two distinct nuclear clusters: one nucleolar cluster [108],
and the other cluster interacting with the centromere. In both S. cerevisiae and
Schizosaccharomyces pombe, the Hi-C data indicate that the chromosomes are organized in
chromosomal territories. However, different conclusions were drawn with respect to the
17
clustering of functionally related genes. In S. cerevisiae, apart from tRNA genes, no other groups
of genes were found significantly clustered [103]. However, in S. pombe, cell cycle genes and
other functionally related genes were claimed to come together [109]. Whether these are species-
specific differences or are caused by different analysis methods requires further study.
Until recently, with reduced NGS sequencing cost, Hi-C studies were extensively performed on
mammalian cells, which revealed several major genome structures including macro-domains
[110]; A and B spatial compartments (previously discussed active and inactive
microenvironments) [102,110]; topological associated domains (TADs) [111,112]; and various
types of loops, which may be mediated by architectural proteins such as CCCTC-binding Factor
(CTCF) or by specific transcription factors [112]. Some loops are important for bringing together
enhancers and gene promoters while others define TAD boundaries [112,113]. Because of the
high sequencing depth and high resolution, the Hi-C data provides the most comprehensive view
of the mammalian genome landscape so far.
1.3 Challenges in interpreting chromosome capture experiments
The resolution of all 3C-based methods is limited by the choice of the first restriction enzyme.
For a six-cutter like HindIII, there are ~800,000 HindIII sites in the mouse genome, and the
average resolution throughout the genome will be ~4 kb. However, this is only theoretical, as a
number of factors can influence this resolution. Local distribution of restriction sites can vary
between different genomic regions, resulting in different resolutions at different genomic
locations. An additional factor that influences the results is the presence of repeats in the genome.
For 3C, this is a relatively minor problem because one can be quite flexible in the selection of
primers for PCR. For sequencing-based methods, this can be more challenging, especially in 4C-
seq and 5C, which rely on the sequence directly adjacent to the restriction site. However, this can
be partially circumvented by increasing the length of the sequencing reads, which gives higher
mapping specificity.
A key characteristic of 3C, 4C, 5C, and Hi-C experiments is the very high capture probability
between neighboring fragments, in keeping with their close spatial proximity. Moving further
18
away from a given fragment leads to exponential decrease of the capture probability until it
reaches a baseline level [82,102,110]. The rapid decline in contact probability makes it so that
specific ligation junctions between two given sites far apart on the chromosome, or on different
chromosomes, will be rare. As discussed, this makes 3C (‘‘one versus one’’) unsuitable for the
analysis of long-range contacts. As for the higher-order genomics versions of 3C, windowed
approaches are necessary for the analysis of long-range chromatin contacts. Indeed, for far cis
and trans DNA contacts, 4C and HiC data sets are not reproducible at the single-fragment
resolution, but are highly reproducible over genomic windows. In such approaches, when a long-
range interaction within or between chromosomes is described, this is often a statistical
definition, meaning that two (multi-fragment) regions have a higher probability for making
contacts compared with other regions at a similar distance on the same chromosome or elsewhere
on other chromosomes. For a more thorough statistical definition of 4C or Hi-C interactions, we
refer to Lieberman-Aiden et al. [102], Splinter et al. [85], Yaffe and Tanay [114], and Imakaev et
al. [115].
Knowing which regions of the genome are contacted by a gene of interest (4C) or interpreting
the overall chromosomal conformation (Hi-C) is only interesting when some form of integrative
genomics analysis is performed. However, because long-range contacts are generally formed
between genomic regions measuring in the range of 100 kb to 1 Mb, there is often more than
three orders of magnitude difference between the scale of the genomic interaction data and
genomic data such as CpG methylation, ChIP, DNase I hypersensitivity, and expression data
[116]. This scale difference is further complicated by the fact that across larger chromosomal
domains, many genomic variables correlate. For example, C/G nucleotide density correlates with
gene density, which in turn correlates with the density of SINE repeats, which shows an inverse
correlation with LINE repeat density. Furthermore, regions of high gene density also show a high
density of transcription factor-binding sites, DNase I hypersensitivity sites, and certain post-
translational histone modifications [117]. Although these are only correlations, it is important to
keep these potential confounding factors in mind when formulating hypotheses with regard to the
underlying features of nuclear organization.
19
Proper statistical analysis is especially important when dealing with gene clusters. Since spatial
interactions are often formed between larger regions covering multiple genes, an interacting
region may spuriously overlap with one or more genes having the same function. Most tools for
scoring enrichment of functional annotation, such as Database for Annotation, Visualization, and
Integrated Discovery (DAVID) [118] or FatiGO [119], assume independent sampling, a
prerequisite that is clearly not satisfied in studies of long-distance chromatin interactions. This
limitation can be overcome by performing nonstandard statistical procedures, such as circular
permutation of the gene order along the chromosome [120] or the collapsing of gene clusters to
single observations in the statistical analysis.
A further issue to consider is the number of contacts a given gene appears to have. In 4C, a
single locus can be engaged in tens or hundreds of contacts (depending on the threshold applied
to define contacts). These contacts are collected from many cells and will not all be present in the
same cell. Likely, the large number of contacts reflects cell-to-cell differences in genome
topology. It was previously argued that upon exit from mitosis, each chromosome probably
adopts one of a limited number of energetically favored conformations that will position a given
gene next to a few other genes [33].
This is especially important to keep in mind when considering 3D modeling based on
chromosome capture data. 3D chromatin models based on 5C and Hi-C data offer a tantalizing
first step toward visualizing abstract interaction frequencies [102,103,109], but need to be
interpreted with caution [121]. First, microscopy has shown that genome topology changes over
time and is different between cells. The dynamics of chromatin structure and cell-to-cell
variation is not appreciable by 3C-based methods, and it cannot be determined whether two
different interactions of A with B and C occur simultaneously or sequentially and/or whether
they are mutually exclusive. 3C and derivatives only provide steady-state conformations
measured across a population of cells. The resulting average 3D genome models are therefore
not likely to be found exactly as such in any given cell at a given time. This is different from
protein-folding models, as these molecules generally form stable 3D configurations. Recently,
population-based 3D modeling methods were developed for Hi-C data structural simulation and
mining, which leads to deeper analysis of individual genome structures represented by the
20
ensemble Hi-C data [110,120,122]. Ultimately, DNA contacts need to be studied in single cells
on single alleles to understand how one interaction influences the other.
21
Chapter 2 Mapping Virus-Host Whole Genome
Interactions by Tethered Chromosome Conformation
Capture
2.1 Introduction
Upon infection, viruses deliver viral proteins and genetic materials into host cells either
embracing or confronting the host cellular machineries. A variety of virally evolved adaptive
mechanisms lead to dynamic virus-host interactions at every stage of the viral life cycle—all
intended to increase the likelihood of generating progeny virions. A key aspect of viral
regulation is the transcription of viral genes and the replication of viral genomes. Many RNA and
DNA viruses introduce DNA into the host cell nucleus. Retrovirus and lentivirus RNA genomes
also enter the nucleus as double stranded DNA, which is synthesized by virion-associated reverse
transcriptase in cytoplasm. The genome of DNA viruses is packaged at very high molecular
density with specialized viral packaging proteins to form a capsid virion particle. Upon entering
the cell nucleus, virion particles interact with host activating factors in decondensing the
incoming viral DNA to initiate viral transcription and replication or establish a stable
intermediate for latent infection or long-term integration [123,124].
After nuclear entry, the viral genomes distribute inside the nucleus through a non-random but
poorly understood process. For DNA viruses, fluorescence in situ hybridization (FISH) imaging
analyses show nuclear deposition of viral DNA and subsequent viral gene transcription at the
periphery of Nuclear Domain 10’s (ND10s), which are also known as PML nuclear bodies (NBs)
and PML oncogenic domains (PODs) [125,126]. ND10 is a highly organized nuclear structure
accumulated with interferon-upregulated proteins, implicating ND10s as sites of a nuclear
defense mechanism. ND10 subcompartments often reside adjacent to nuclear speckles, a nuclear
domain enriched in pre-mRNA splicing factors. DNA viruses initiate their transcription at
ND10s juxtaposed to nuclear speckles. After the expression of viral immediate early gene
products, ND10s subsequently become dispersed. The dynamic co-localizations of DNA virus
22
genomes and host nuclear domains at defined stages of viral life cycle strongly suggest a
functional role of viral-host genomic interactions. However, because of the limited resolution of
FISH and its low throughput, the molecular and chromatin compositions of these nuclear
domains and their interactions with viral DNA are not fully characterized. Little is known about
whether the virus genome associates with specific host chromosomal regions.
Extensive studies suggest that nuclear organization, including the proximity of genomic loci and
their positions inside the nucleus, play important roles in regulating transcription and DNA
replication [127–129]. In addition to early image-based analyses, recently developed
chromosome conformation capture (3C) techniques are able to capture chromosomal interactions
at fine resolution and in high-throughput (Hi-C) when coupled with next generation sequencing
[55,102]. These studies reveal a hierarchical organization of chromosome structures—including
contact domains [112], topological associated domains [130], macro-domains [110], and
chromatin compartments A/B [102] — that can be functionally related to gene expression,
chromatin modifications, DNA replication timing, etc. These studies raise the possibility that
viruses may have to adapt to or alter the host nuclear environment and genome structure to
regulate the viral life cycle. To reveal whether virus-host genome interactions are specific, we
used a 3C/Hi-C based approach to investigate the viral host genome interactions. However,
solution-based 3C methods such as Hi-C suffer from high background noise [82,110,131,132],
such that rare DNA interactions, including interchromosomal contacts as well as virus-host
genome interactions to be studied here, may be difficult to detect. Alternative approaches, such
as the solid phase chromosome conformation capture technique, also known as TCC [110], and
the recently developed in situ Hi-C and single-cell Hi-C [112,133], minimize the noise levels
originating from random DNA ligations and could be adopted to investigate virus-host genome
interactions. In this study, we used TCC and the adenovirus/fibroblast model to explore virus-
host genome interactions and its potential functional implications.
The adenovirus is a non-enveloped virus containing a linear double stranded DNA genome. It is
the first human virus that was found to cause tumors in hamster [134]. This tumor promoting
activity led to extensive studies of the molecular and cell biology of adenovirus infection [135].
23
Among the human adenoviruses, serotype 5 (Ad5) is an extensively characterized strain, which
has a ~36 kb linear dsDNA genome and encodes ~39 viral genes [136].
Here we identified genome-wide adenovirus-host genome interactions and observed that the
interactions are not randomly distributed across the host genome. Host-viral genome interactions
preferentially occur in specific regions of the host genome that are associated with open and
early replicating chromatin. At the same time, these regions are also enriched with specific
histone modifications, such as H3K27me3, which is not normally associated with active
chromatin but is critical in regulating cell differentiation [38]. Interestingly, the viral DNA
attachment sites coincide with locations undergoing early host epigenome remodeling upon viral
infection. We found a correlation between the observed frequencies of a host-viral DNA
interaction and the extent of acetylation of H3K18 and H3K9 (in the presence of RB and p300),
which are part of the observed epigenetic remodeling of the host genome upon adenoviral
infection [25-28]. This observation raises the possibility that the locations of viral DNA in the
host genome may be mechanistically linked to the dramatic epigenetic changes in the host
genome upon viral infection.
2.2 Methods
2.2.1 Cell culture and TCC experiments.
IMR90 cells were grown under standard culturing conditions (DMEM, 10% FBS, 1X
penicillin/streptomycin, 5% CO
2
, and 37 °C). Once the cultures reached 100% confluency, they
were maintained for another 12 hours to make sure cells were being arrested. Cells were infected
with small e1a adenovirus [137] (MOI=200) in regular media while FBS was reduced to 2% and
maintained at 37 °C with 5% CO
2
. Mock infection cells received the new media with no virus.
At each stage of 6/24 hours post-infection (pi) and 6/24 hours mock-infection (mi), old media
was aspirated out and 20 millions cells were treated with fresh medium (DMEM, 2% FBS, and
24
penicillin/streptomycin) and 1% of formaldehyde. TCC experiments were performed as
previously described [110].
2.2.2 Preprocessing TCC sequencing output
Sequencing libraries prepared from TCC experiments have unique features because of a number
of special enzyme treatments. In this section, preprocessing of the raw sequencing reads prior to
alignment is described.
2.2.2.1 Adjusting for sequencing lariats
TCC experiments use exonuclease III to remove the free DNA ends’ biotin before pulling down
of DNA ligation junctions. This step may produce a significant amount of single strand DNA,
which can potentially form hairpin structures during preparation of the sequencing library. When
performing DNA end repair, 5 prime overhang of hairpin structure will be filled in, which cause
the initial sequence of read 2 to be identical to what from the read 1 and eventually lead to
significant amount of alignment failures. We name this type of errors as “sequencing lariats”
which is unique for TCC experiments. To attenuate the effect of these errors, we run a program
to search for identical initial sequences between read 1 and read 2 and trim the identical sequence
from read 2 before performing the alignment. This trimming step significantly increased the
percentage of alignable reads.
2.2.2.2 Filtering of ligation junctions
Informative sequencing results are the ligation products of different DNA fragments resulted
from restriction enzyme digestion. Depending on the sites of DNA shearing, sequencing reads
may surpass the ligation junctions of certain DNA fragments. These reads may be unable to align
to the genome because of the chimeric ligations. In order to improve the alignment efficiency,
we run a program to scan for HindIII ligation junction (expected to be “AAGCTAGCTT” after
25
end filling-in and blunt-end ligation) and allow one base of mutation or deletion considering
restriction enzyme star effect) and removed all bases after the 3-prime midpoint of the junction.
This filtering step also led to bigger portion of successfully aligned reads.
Raw sequencing reads after the previous two steps of filtering are aligned to hg19 and
Adenovirus 5 reference genome by bowtie-1.0.0 with a maximum of three mismatches allowed.
2.2.2.3 Removing non-informative pairs
Two types of sequencing read pairs that do not contain any information about the spatial
organization of the genomes and viral-host genome interactions can be readily identified. The
first type consists of pairs that are a result of PCR amplification of a single DNA molecule (PCR
duplication). These pairs can make a contact appear more frequent when it has only been
amplified more efficiently in the PCR. A group of read pairs that are a result of PCR duplication
align to identical positions on both ends. All but one pair in such groups were removed from the
catalogue.
The second type consists of pairs that originate from DNA molecules that do not include a
ligation junction, yet they bind to the streptavidin-coated beads either non-specifically or as a
result of incomplete exonuclease action in removing terminal biotins. After sequencing, these
molecules result in pairs that align just 300-700 bp apart to opposite strands of the reference
sequence of the genome (the size range depends on the size range used during gel-extraction).
All such pairs were removed from the catalogue of each dataset.
2.2.3 Splitting test for filtering low coverage bins of TCC observed contact
(OC) matrix and choosing optimal resolution to visualize VGI contact profile
Difference in chromatin accessibility of genomic locations leads to difference in sequencing
depth and coverage, which requires special considerations in most of the sequencing based
genomic analyses [138]. For TCC and Hi-C studies, because of low-coverage in 2D contact
26
matrices, sequencing depth is one of the key considerations. Values of low coverage matrix
entries won’t converge during iterative normalization [115], which leads to spurious contact
frequencies (Figure S4).
2.2.3.1 Splitting test for TCC OC matrix
To define matrix bins of low coverage, we performed a “splitting test” on the TCC OC matrix,
which will divide the data matrix into two matrices by randomly dividing the unique read counts
per entry bin (with value n) following a binomial distribution (x ~ binom(n, p=0.5)). The pair of
matrices will be used to test if contact profiles at the given bin resolution are reproducible.
Several pairs of matrices will be generated by independent splitting test of the original data.
Splitting tests were performed on matrices from all of the four TCC datasets.
Intrachromosomal contacts are not random and maintain specific patterns in interphase cells
[102,110,139]. Therefore, the Pearson’s correlation between the corresponding pairs of
intrachromosomal contact profiles in both split matrices should be close to +1 if the bin size is
adequately chosen with respect to the available sequencing coverage. Splitting tests showed that
at 500kb bin resolution, the pair-wise Pearson’s correlation converges to +1 at 1%~2%
percentiles. Bins with correlations in the lower 1%~2% percentiles will be removed from the
matrices.
2.2.3.2 Splitting test for VGI contact vector
To determine the optimal resolution to present the viral-host genome interactions (VGIs), we
performed the same binomial splitting for the VGI contact vector entries and compared the
pairwise correlations at a series of different resolutions. Each VGI contact vector was split
independently 20 times to test the consistency between the splitting tests. Similar as
interchromosomal interactions, VGIs are more uniformly distributed than intrachromosomal
interactions. Therefore, pairwise Pearson’s correlation of VGI splitting tests do not converge to 1
at any resolutions in either one of the two datasets (6 hours pi and 24 hours pi). So here we
27
choose a reasonable resolution of 500kb, when the pairwise correlations are generally high and
starting to converge (Figure S1).
2.2.4 Bias correction of TCC datasets
2.2.4.1 Iterative correction of TCC OC matrix
TCC OC matrices of the four datasets are normalized through the iterative correction (ICE)
method to remove experimental biases [115]. As described in the ICE method, the diagonal of
each OC matrix was removed. Also, all intrachromosomal interactions with genomic distances
less than 20kb were removed, which removes potential self-looping contacts and dangling ends
artifacts between consecutive bins. OC matrix bins with low sequencing coverage (as defined by
splitting tests) were removed resulting in reduced matrix dimension. At 500kb resolution, each
TCC OC matrix was defined as a 𝐾×𝐾 matrix, 𝑪= 𝑐
&'
(×(
, in which the entry c
ij
is equal to
the number of observed CIs between bins i and j, where K is the matrix dimension after
removing low coverage bins and c
ij
= 0 for i = j. To obtain the contact frequency (CF) matrix
𝑭
𝒏
= 𝑓
&',-
(×(
through ICE method, the matrix was normalized for n iterations as follows:
𝑓
.
&',/
=
1
23,456
1
27,456
8
796
1
73,456
8
796
with 𝑓
&',:
= 𝑐
&'
and 𝑓
&',/
=
1
;
23,4
1
23,456
8
396
8
296
1
;
23,4
8
396
8
296
for 𝑙 =1,2,…,𝑛.
Therefore, the bias vector is 𝑽
𝒏
= 𝑣
-,&
(
where 𝑣
-,&
=𝑀 𝑓
&D,/EF
(
DGF
-
/GF
with the scaling
factor 𝑀 =
1
;
23,4
8
396
8
296
H
496
1
23,456
8
396
8
296
H
496
so that 𝑓
&',-
=
1
23,I
J
H,2
J
H,3
=
K
23
J
H,2
J
H,3
. OC matrix for each dataset was
normalized for 10 iterations when the matrix entries started to converge.
28
2.2.4.2 Normalization of VGI
To obtain contact frequency of VGI vector, 𝑨= 𝑎
& (
, at 500kb resolution, the observed VGI
count vector 𝑶= 𝑜
& (
was normalized by its corresponding bias vector V
n
so that 𝑎
&
=
P
2
J
2
.
The bias vector V
n
was also scaled to 𝑣
&
=1
(
&GF
to serve as the probability vector in identifying
VGI enriched regions on the host genome.
2.2.5 Identification of VGI enriched genomic regions by hypothesis testing
A multinomial null model assumption was used for the VGI contact distribution on genomic
locations with Mult
k
(n, p), for which k is the number of binning windows (k=5,644), n is the total
number of observed VGIs for the 6 hours pi and 24 hours pi sample, and p is the binning
windows’ probability vector (p
1
, p
2
, …, p
k
), which is proportional to the OC matrix bias vector
generated by the ICE normalization. Bins with significantly higher number of VGIs than
expected by the random null model distribution were defined as VGI enriched with a false
discovery rate (FDR) of less than 10
-4
. The FDR was calculated by applying the same statistics to
two randomly generated null models given different p-value thresholds.
2.3 Results and Discussions
2.3.1 Genome-wide interactions between adenovirus and host genome
We used TCC to map genome-wide chromatin interactions between the adenovirus and host
genomes. We used the small E1A (e1a) adenovirus 5 construct (dl1500) that mainly expresses
the small e1a oncoprotein [140]. The ‘early region 1a’ (E1A) of the human Ad5 genome
generates two splice variants, the smaller of which (i.e., small e1a) is responsible for activating
early viral genes’ expression and the virus-induced host cell transformation. The e1a oncoprotein
29
can drive contact-inhibited human fibroblast cells into S phase in part by displacing the RB cell
cycle checkpoint protein from the E2F master transcription factors [141] without causing host
nucleus destruction in the lytic infection cycle. Extensive studies have demonstrated that e1a is
crucial for remodeling the host epigenome leading to oncogenic transformation of the cell
[137,142–144]. Thus the dl1500 construct serves as a good model for our studies. TCC
experiments were performed on human primary lung fibroblasts (IMR90) with and without
dl1500 adenovirus infection for different durations at 6 and 24 hours post infection (p.i.). For the
two datasets with virus infection, we preprocessed (Methods) the pair-end sequencing reads and
aligned them to both human (hg19) and adenovirus 5 (Ad5) genomes. Pair-end reads aligned to
hg19 on one end and to Ad5 on the other end were described as virus-host genome interactions
(VGIs). 156,666 VGIs and 460,988 VGIs were identified at 6 and 24 hours p.i., respectively.
Pair-end reads with both ends aligned to hg19 are host chromosomal interactions (CIs). For
intrachromosomal interactions only pair-end reads with end reads separated by a sequence
distance of at least 20 kb were considered. The absolute numbers of reads for each interaction
type (CIs and VGIs) in all four datasets are shown in Table 2.1. VGIs consist of 0.12% and 0.43%
of all detected interactions in the 6 and 24 hours p.i. data sets, respectively (Figure 2.1a).
The amount of adenovirus DNA in the nucleus of an infected cell can be calculated from the size
of the adenovirus 5 genome (~36 kb) multiplied by the multiplicity of infection (MOI), which is
the ratio of the infectious adenovirus particles to the number of targeted fibroblast cells. An MOI
of 200 was used for all virus infection experiments to ensure near complete infection of all cells
in culture. The total amount of DNA consists of the diploid female human genome (IMR90) and
the viral DNA. The expected DNA capturing rate, defined as the ratio of the viral DNA to the
total amount of nuclear DNA, is 0.14%, which is similar in value to the total fraction of VGI
among all detected DNA interactions in the 6 hours p.i. TCC dataset (Figure 2.1a). This value
indicates the TCC method is robust in capturing virus-host genome interactions.
To analyze the genome-wide distribution of VGIs, the host genome was tiled into binning
windows at different resolutions. Given the current sequencing depth, a resolution of 500 kb
binning windows was chosen as the optimal resolution to represent VGIs at the given sequencing
30
depth (Methods 2.2.3.2 and Figure 2.2). Chromosomal interactions (CIs) were also binned to 500
kb windows to build the Observed genome-wide raw Contact count (OC) matrices.
For 3C-based experiments, the number of contacts observed for different genomic regions may
be biased by differences in DNA sequence and chromatin features, such as restriction enzyme
digestion efficiency, sequence mappability, GC content, etc [114,145]. Based on the “equal
visibility” assumption, the genome-wide OC matrices of the four datasets were normalized by
balancing the marginal sum of the matrices through the iterative correction (ICE) method to
obtain the contact frequency (CF) matrices [115] (See Methods for the details of data processing).
To accurately account for the VGI contact frequency, the observed raw counts of VGI were
normalized by the bias vector generated from ICE normalization method (Methods 2.2.4.1). VGI
contact frequencies at 6 and 24 hours p.i. were therefore represented by two contact profile
vectors with dimension of 5,689 bins spanning the entire hg19 genome. The Pearson’s
correlation coefficient between the two vectors is 0.768 indicating that VGI positions are
consistent between the two samples and are not randomly distributed. Genome-wide VGIs
together with their contact frequencies are displayed in figure 1b, which shows the non-uniform
distribution of VGI’s in the human genome for both datasets at 6 and 24 hours p.i.
2.3.2 Identification of viral DNA enriched regions on the host genome
The enrichment of VGIs at a genomic region was determined by hypothesis testing with a null
model assumption of multinomial distribution (Figure 2.1c; Methods 2.2.5). As results, 222 and
731 bins were identified as viral DNA enriched bins for the 6 and 24 hours p.i. samples,
respectively. 173 (79%) of the VGI enriched bins in the 6 hours p.i. data set were also found in
the 24 hours p.i. sample (Figure 2.1d). Since viral DNA does not replicate at early infection
stages [142], the 222 viral DNA enriched bins identified at 6 hours p.i. are likely the primary
genomic locations where adenovirus DNA interacts with the host genome as it is delivered into
the host cell nucleus.
31
Figure 2.1 VGI distributions and enriched regions on host genome. a) Portion of TCC contacts
containing adenovirus 5 (Ad5) DNA (including VGIs and viral-viral genome interactions). Input
represents the portion of adenovirus DNA among the total DNA material in the nucleus, which is
32
calculated by Size_Ad5*MOI (200 for the two virus infection TCC experiments in this study) /
(Size_diploid_hg19+Size_Ad5*MOI). b) VGI contact frequency distributions on different
chromosomes at 6 hours and 24 hours pi. VGI contact frequency vectors were normalized by the
ICE bias vectors (Methods). Only statistically VGI enriched host genomic regions were plotted
with lines (line gray scale represents VGI contact frequency) on the circos plots. c) Distributions
of VGI observed contacts in the 500kb resolution bins at 6 hours and 24 hours pi. Each dash line
represents the multinomial null model distributions assuming VGIs are randomly distributed
across the host genome. d) Venn plot of VGI enriched bins identified at different time of
infection using the null models given FDR < 10
-4
.
33
Table 2.1. Summary of the numbers of sequencing reads at each step of TCC data processing.
Library 6hr pi 24hr pi 6hr mi 24hr mi
# of raw reads 148,308,467 135,150,648 153,161,215 150,913,552
Reference genome hg19 + Ad5 hg19 + Ad5 hg19 hg19
# of paired reads (with PCR duplicates) 84,541,933 89,428,739 99,949,706 86,634,846
# of paired reads (without PCR duplicates) 82,249,104 88,363,338 97,550,807 85,357,688
Intrachromosomal interactions (<20kb) 18,389,476 28,633,987 28,634,834 23,379,726
Intrachromosomal interactions (>20kb) 31,081,701 42,355,855 44,737,331 39,304,942
Interchromosomal interactions 32,769,404 17,368,986 24,172,287 22,668,268
Viral-host genome interactions 156,666 460,988 NA NA
34
Figure 2.2 Splitting test of viral-host genome interactions at different bin resolutions. VGIs from
6 hours pi and 24 hours pi datasets were indexed at different resolutions ranging from 50kb to
5mb. VGI contact vectors with different resolutions were randomly split using a binomial
distribution for 20 times (Methods 2.2.3.2) and Pearson’s correlations between two split vectors
were calculated for each splitting event (Methods 2.2.3.2). The error bar at different resolutions
represents the standard deviation of the pairwise correlations among the 20 times of splitting.
Resolutions
Pearson's Correlation Coefficient
50kb 100kb 200kb 500kb 1mb 2mb 5mb
0.0 0.2 0.4 0.6 0.8 1.0
24 hours infection
6 hours infection
35
2.3.3 Viral DNA enriched regions co-localize in 3D space prior to infection
Previous FISH imaging analyses showed that the adenovirus genome is deposited to specific
nuclear domains (ND10s) at the early stages of viral infection [125]; and that ND10s segregated
equally and had symmetric positions in daughter nuclei after cell division [146], which indicates
viral DNA genome deposition follows certain chromosomal bases. Moreover, ND10s are
frequently situated next to DNA replication foci and specific gene loci [126]. These observations
suggest that the viral genome prefers certain larger host chromosomal locations or
subcompartments within the nucleus.
To investigate the existence of large chromosomal subcompartments involved in viral-host
genome interactions, we tested if the 222 chromatin regions enriched with VGIs at the 6 hours p.i.
are clustered in 3D space before the viral infection. The 222 bins can form up to 1507 bin pairs
of intrachromosomal interactions in the 6 hours mock-infection (m.i.) TCC dataset (Figure 2.3a).
Within a chromosome, genomic regions generally tend to interact more frequently when they are
separated by shorter sequence distances. Therefore, we performed a sequence distance restricted
sampling test to rule out bias with respect to sequence proximity on the interaction frequency.
For each intrachromosomal pair of the 1507 VGI regions, a pair of bins is randomly selected
from the same chromosome with identical sequence distance separation to the VGI pair. The
distance restricted sampling procedure was repeated for 10,000 times, generating 10,000 random
bin pairs of intrachromosomal interactions. In this analysis VGI bins had a statistically
significantly higher probability (p-value less than 10
-4
) to colocalize with each other in
comparison to other bins with the same sequence separation, as is evident from the distributions
of pairwise contact frequencies for VGI regions and random control (Figure 2.3b).
Intrachromosomal long-range interactions define chromosome sub compartments, such as
compartment A and compartment B regions. Chromosome regions within each compartment
tend to co-localize with other regions in the same compartment [12]. On one hand, out of the
5,689 500-kb bins in the genome, 2483 (43.6%) are classified as compartment A regions in the 6
hours p.i. TCC dataset. On the other hand, out of the 222 500kb bins enriched with VGIs at 6
hours p.i., 148 (66.7%) are classified as compartment A regions. Therefore, the observed VGI
colocation could be a reflection of its compartment A preference. To investigate if the VGI
36
colocation is influenced by compartment A/B segregation, we repeated the distance restricted
sampling test for the 148 VGI bins in compartment A in comparison to all the 2483 bins in
compartment A. Also in this analysis, the p-value was less than 0.005, which indicates that viral
genome colocation is not solely driven by its compartment A preference.
Interestingly, our analysis indicates that VGI bins can be clustered according to their contact
frequency profiles into 2 to 3 larger groups (i.e. subcompartments) per chromosome, so that VGI
regions within each domain showed substantially increased interaction frequencies in
comparison to interactions to VGI regions in another domain. These spatial subcompartments are
visually evident when plotting the contact frequency heatmaps formed by only VGI bins (Figure
2.3a). These subcompartments could be candidates for nuclear domains associated with early
viral infection, such as the previously identified ND10s [125].
2.3.4 Viral DNA preferentially interacts with active host chromatin regions
and chromatin regions marked by histone modification H3K27me3
The genome-wide chromosome conformation capture data of IMR90 is able to classify the host
genome into at least two distinct compartments, A and B, based on the TCC interaction patterns
[102,110]. In addition to the spatial distinction between different compartments, compartment A
regions are also characterized by active chromatin features including an early replication timing
profile, while compartment B regions are recognized as “inactive” regions [110]. Figure 3a
shows the assignment of compartment A and B for chromosome 2. For the entire genome, VGI
frequencies for regions of compartment A were significantly higher than those in chromatin
regions of compartment B (with student’s t-test p-value less than 10
-15
) (Figure 2.4b). Among the
222 VGI enriched bins at 6 hours p.i., 148 (66.7%) belong to compartment A, while for the
whole genome 2483 (43.6%) out of 5689 bins are part of compartment A. VGI frequency
distributions for regions in compartments A and B in all chromosomes are detailed in Table 2.2.
37
Figure 2.3 Spatial co-localization of VGI enriched chromatin regions. a) VGI enriched bins (6
hours pi) are selected from the 6 hour mock-infection TCC chromosomal contact frequency
38
matrix. The selected sub-matrix shows a number of spatial clusters for different chromosomes.
Sub-matrix for chromosome 21 is not shown in the figure, since only one VGI enriched bin is
identified in the chromosome. Similarly, the two VGIs enriched bins for chromosome 22 are
consecutive located on the chromosome. Therefore, contact frequency between these two bins is
substantially high. b) Distribution of intrachromosomal contact frequencies of the selected sub-
matrices is shown on the left side (in red). Distribution of contact frequencies generated by the
distance restricted sampling method is shown on right side (in blue).
39
Figure 2.4 Adenovirus genome interacts with active euchromatin regions and early replicated
genomic regions. a) Chromosome 2 cytogenetic features and compartment A/B assignments are
40
shown within the schematic chromosome. 6 hours pi VGI contact frequency profile of
chromosome 2 is shown on top and colored according to the A/B compartment membership of
the corresponding chromatin region. Cytogenetic features, compartment A/B assignment, and
VGI contact profile are all shown at 500kb resolutions. b) Box plot of VGI contact frequencies
for chromatin regions of different cytogenetic categories at 6 hours pi. c) Box plot of VGI
contact frequencies for chromatin of compartment A and B bins at 6 hours pi. d) Heatmap of
different histone modification markers and replication timing signals for different categories of
genomic regions, including, all VGI enriched regions, all compartment A regions, VGI enriched
compartment A regions, all compartment B regions, and VGI compartment B regions. Strengths
of signals for the heatmap entries are represented by fold change compared to the genome-wide
average signals. H3K27ac data is from Ferrari et al. 2014. H3K9ac and H3K18ac data are from
Ferrari et al. 2012. Replication sequencing data are from Consortium 2012. Other chromatin
feature data was downloaded from NCBI epigenome roadmap project.
41
Table 2.2 Summary of VGI distributions between compartment A and B regions for different
chromosomes.
Chromosome
Compartment with
higher VGI
P-value
chr1 A 0.00491
chr2 A 1.37e-12
chr3 A 9.35e-09
chr4 A 2.5e-19
chr5 A 3.02e-15
chr6 A 2.84e-09
chr7 A 1.86e-06
chr8 A 5.02e-11
chr9 A 1.58e-05
chr10 A 8.39e-05
chr11 A 0.00109
chr12 A 0.0134
chr13 A 3.98e-06
chr14 A 0.000873
chr15 A 0.0667
chr16 B 0.764
chr17 B 8.14e-10
chr18 A 1.29e-06
chr19 B 0.972
chr20 A 0.369
chr21 A 0.121
chr22 B 0.52
chrX A 7.63e-29
42
We also compared VGI frequencies for regions in different cytogenic categories. The mean VGI
frequency decreases with increasing Giemsa staining levels (for bins in categories ranging from
“gneg” to the “gpos100” cytogenetic categories) (Figure 2.4c). Centromeric stained regions
(categorized as “acen” in Figure 2.4b) are heterochromatic and have the lowest average VGI
frequencies. This observation also suggests that the adenovirus genome interacts less frequently
with heterochromatin and much more so with active euchromatin regions.
We further analyzed the viral-host genome contacts in the context of data from DNase I
hypersensitivity assays (Figure 2.5) and ChIP-seq assays of various histone modifications
(Figure 2.4d). These analyses indicated that viral DNA preferentially interacts with active open
chromatin regions. For instance, the 222 VGI enriched regions at 6 hours p.i. show a similar
histone modification profile as those in compartment A regions (denoted as VGI and A,
respectively, in Figure 2.4d), whose histone modification markers typically are associated with
actively transcribed chromatin regions such as H3K4me1, H3Kme2, H3K27ac, H3K79me1 and
H3K79me2 [147,148]. Histone modifications H3K27ac and H3K4me1 were previously shown to
be defining features of active enhancer regions [149]. Even the 74 VGI regions classified as
being part of the B compartment (denoted as VGI_B in Figure 2.4d) showed higher fold change
for histone modifications associated with active open chromatin (i.e. H3K27ac, H3K4me1,
H3K4me2, etc.) compared with the chromatin regions in the B compartment (denoted as B in
Figure 2.4d). In addition, we observed a strong negative correlation (Pearson’s correlation
coefficient = -0.39 in Figure 2.5) and a depleted fold change (fold change = 0.5 in Figure 2.4d)
between the VGI frequency profile and the occurrence profile of histone modification H3K9me3,
which is a marker for heterochromatin regions [150,151]. These observations suggest that viral
DNA-host genome interactions occur preferentially in active open chromatin regions. This
enrichment could be the result of preferential binding of the viral genome to active chromatin
regions, an increased viral DNA replication in these active regions, active exclusion from
heterochromatin, or combinations of these mechanisms. Since our analyses were based on data
from 6 hours p.i. when no significant viral replication is taking place, and viral DNA interacts
with specific regions of host euchromatin, we favor the interpretation that the viral genome
preferentially interacts with open active chromatin regions.
43
Figure 2.5 Genome wide Pearson’s correlations between 6 hours VGI contact frequency profile
and different chromatin feature profiles. H3K9ac and H3K18ac Chip-Seq data were downloaded
from the study Ferrari, et al [137]. Other chromatin feature data was downloaded from NCBI
epigenome roadmap project.
DNase
MethylC
H2AK5ac
H2AK9ac
H2A.Z
H2BK120ac
H2BK12ac
H2BK15ac
H2BK20ac
H2BK5ac
H3K14ac
H3K18ac
H3K23ac
H3K27ac
H3K27me3
H3K36me3
H3K4ac
H3K4me1
H3K4me2
H3K4me3
H3K56ac
H3K79me1
H3K79me2
H3K9ac
H3K9me1
H3K9me3
H4K20me1
H4K5ac
H4K8ac
H4K91ac
Correlations with VGI Contact Frequency Profile
Genome Wide Pearson's Correlation
−0.4 −0.2 0.0 0.2 0.4
44
Interestingly, histone marker H3K27me3, which is associated with Polycomb-group silencing
[148,152], is strongly enriched in the VGI genomic regions (VGI, VGI_A, and VGI_B in Figure
2.4d), even though it antagonizes H3K27ac, a histone modification enriched in compartment A
but not in compartment B regions and associated with gene activity [153]. Given their opposite
transcriptional activities, it is intriguing that both H3K27ac and H3K27me3 markers are enriched
at VGI regions (especially in VGI_A regions). Both H3K27ac and H3K27me3 are involved in
transcription regulation of genes responsible for cell differentiation. Whether their enrichment in
VGI regions is mechanistically linked to viral induced transformation remains an interesting
open question. Nevertheless, these analyses suggest that viral-host genome interactions may
define unique chromatin states, which appear to share many features of the open euchromatin but
also show distinct histone marker profile from those associated with activated and repressed
genes.
2.3.5 Relations between viral DNA contacts and DNA replication
Viral DNA replication occurs within defined regions of the host cell’s nucleus [154,155]. Figure
3e shows that compartment A regions, particularly the 148 VGI’s that are part of compartment A
(denoted as VGI_A), are early replicated. Even the 74 VGI regions classified as being part of the
B compartment showed substantially earlier replication timing compared with other regions in
the B compartment, which generally replicate at later stages (Figure 2.4e; VGI_B regions
showed highest replication signals in the mid S phase, while compartment B regions’ signals
generally enrich towards the G2 phase.). These observations suggest that the adenovirus genome
prefers to interact with the host genome’s early replication regions.
Infection with the dl1500 adenovirus will drive arrested fibroblast cells into S phase [135,143].
To see if the host genome replication driven by dl1500 adenovirus infection spatially co-
localizes with positions of viral DNA attachments, we performed BrdU-seq experiments in 24
hours p.i. IMR90 cells following experimental procedures as previously described [156]. IMR90
cells were treated with BrdU at 23.5 hours post dl1500 adenovirus infection. Cells were
harvested 30 min later and BrdU incorporated DNA were immunoprecipitated and purified for
high-throughput sequencing. Sequencing reads were aligned to both the hg19 and Ad5 reference
45
genomes. Among the uniquely aligned reads, 95% aligned to the Ad5 reference genome,
indicating that the majority of DNA replication is from the virus genome (Figure 2.6a). The 5%
BrdU-seq reads aligned to hg19, a total number of 1,009,498 reads, were indexed to 500-kb
resolution and compared with the VGI contact profiles at 24 hours pi (Figure 2.6b). Among the
731 VGI enriched chromatin bins at 24 hours p.i., about 40% overlapped with BrdU-seq signals.
Of the remaining bins without virus-genome interactions, only ~20% showed host genome
replication signals (Figure 2.6c). However, since genomic regions with BrdU-seq signal are also
much earlier replicated than the other regions (Figure 2.7; p-value less than 10
-16
), we cannot
determine if host genome replication at 24 hours p.i. is either due to its intrinsic early replication
timing or driven by viral activities. Based on the BrdU-seq data, we observed that the majority of
the DNA replication activities happened to the virus genome, which is likely the result of high
MOI value of virus infection [135]. Therefore, the fact that more genomic regions are identified
with VGI enrichment in 24 vs. in 6 hours p.i. (731 vs. 222) could be due to extra viral genome
load at the later time point.
46
Figure 2.6 Virus and host genome replication. Epigenetic remodeling at VGI enriched bins. a)
Fraction of BrdU-seq reads alignment to either hg19 or Ad5 reference genomes. b) VGI contact
frequency profile (upper panel) and BrdU-seq frequencies (lower panel) at 24 hours pi for
chromosome 2. VGI contact frequencies and BrdU-seq reads counts of VGI enriched bins are
47
colored in red while all data of all other bins are colored in blue. All the profiles are based on
500kb resolution. c) Percentage of VGI enriched bins with host genome replication identified by
BrdU-seq (in red). For the other bins, the percentage is colored in blue. All data shown for 24
hours pi. d) Heatmap of different ChIP-seq data before and after 24 hours of adenovirus dl1500
infection at the 24 hours pi VGI enriched host genome regions. Strengths of signals for the
heatmap entries are represented by fold change in VGI enriched regions compared to the
genome-wide average signals. Sources of data are from Ferrari et al. 2012 and Ferrari et al. 2014.
e) 6 hours pi VGI contact frequency distribution on Chromosome 2. H3K18ac and H3K9ac
distributions for mock-infection and post-infection samples. Zoom in regions show alterations of
different histone acetylations after infection. Correlations between primary VGI contact profile
and either H3K18ac or H3K9ac distribution in either mock-infection or post-infection sample.
48
Figure 2.7 Comparison of replication timing signal between genomic regions with BrdU-seq
reads at 24 hours pi versus regions without BrdU-seq reads. The Wavelet-smoothed replication
timing signal are retrieved from the ENCODE project consortium database [157].
bins with BrdU reads other bins
0 20 40 60 80 100
Wavelet−smoothed replication timing signal
p−value = 3.12e−84
49
2.3.6 Remodeling of epigenetic features in the VGI enriched host genomic
regions
Since small e1a alters global patterns of specific histone modifications in the host genome
[137,142,143], we compared ChIP-seq data of various histone modifications before and after
adenovirus dl1500 infection. We found that VGI enriched genomic regions undergo dramatic
changes in their epigenetic chromatin features upon viral infection (Figure 2.6d). The host lysine
acetylase p300 binds to VGI enriched regions after infection. p300 interacts with the
retinoblastoma (RB) proteins, which together with e1a (the p300-e1a-RB complex) function in
repressing selected host genes involved in anti-viral defense [144]. Reduced fold changes of
H3K27ac and H3K4me1 were observed in the VGI enriched genomic regions after infection
(Figure 2.6d), indicating that VGI regions coincide with locations of host epigenome remodeling.
In addition, the histone modification profile of H3K18ac before infection is strongly anti-
correlated (Pearson’s correlation coefficient = -0.44) with the VGI frequencies at early infection
(Figure 2.4d, 2.6d, 2.7). Upon viral infection small e1a causes a substantial (~70%) reduction in
cellular levels of H3K18ac [137,143] and specifically, the elimination of essentially all H3K18ac
ChIP-seq peaks at promoters and intergenic regions of genes related to fibroblast functions.
Meanwhile, the small e1a also induces the appearance of new H3K18ac peaks at promoters of
highly induced genes associated with cell cycle control and at new putative enhancers [137]. It is
interesting to see that the viral genome preferentially interacts with the host genome at positions
deprived of H3K18ac at early stages of infection. And after 24 hours of infection, H3K18ac is
enhanced at such VGI genomic regions in the presence of RB and p300 (Figure 2.6d, 2.6e).
These observations suggest that acetylation of H3K18 [143], may be related to the binding of the
adenovirus genome or shows similar preference for chromatin regions associated with viral
genome interactions. Besides H3K18ac, enrichment is also found for H3K9ac in the VGI
enriched genomic regions because of the general colocalization of these two histone acetylation
marks [137,158].
The above observations suggest that virus genome attachments to the host genomic regions
coincide and may even potentially induce epigenetic changes during adenovirus infection. The
mechanisms by which the binding of the virus genome may be related causally or
50
consequentially to such epigenetic changes are unclear. But our studies show non-random viral-
host genome interactions and raise an interesting question for future studies on the role of VGI in
host epigenome remodeling, host gene expression, viral genome replication and the underlying
mechanisms of oncogenic transformation caused by adenovirus infection.
2.4 Conclusions
In this study, we have analyzed the global physical interactions between the adenovirus genome
and host fibroblast genome using tethered chromosome conformation capture. Our analyses
indicate that virus-host genome interactions occur at specific genomic locations. By mapping
VGIs in the host genome at different stages of adenovirus infection, we find that the initial
pattern of VGI locations is largely maintained throughout the infection process and at later stage
of infection the distribution of VGIs expands to additional host genomic regions. VGIs
preferentially occur in the active chromatin subcompartment in the host genome, which could
potentially facilitate viral gene expression and DNA replication. While the host genomic regions
targeted preferentially by virus interactions share many epigenetic features of activated genes,
they also possess many unique histone markers, suggesting that virus attachment sites on the host
genome constitute a unique class of chromatin states. Our analyses also suggest that VGI could
have substantial impact on the activities of the host genome. Adenovirus-induced epigenetic
remodeling of the host genome appears to be enhanced in genomic regions enriched with VGIs.
Likewise, the binding of regulatory protein machineries (the p300 and RB complex) is increased
at chromatin regions with enriched VGIs. It is conceivable that VGIs may be facilitating these
changes or that these epigenetic changes facilitate VGIs to achieve a favorable outcome for the
virus. Further experiments with higher resolutions are needed to validate and expand on these
findings, to reveal more detailed structural features of the host chromatin regions where virus
genome preferentially bind, and to elucidate the role of virus genome in regulating host genome
activities.
51
Chapter 3 Hi-C Data Normalization by Restriction End
Sequencing
3.1 Introduction
The general principles of the Chromosome Conformation Capture (3C) based protocols rely on
formaldehyde fixation to capture trans and cis chromosomal interactions in living cells. The
crosslinked cells are incubated with a restriction enzyme that will cut the DNA in a number of
restriction fragments (RFs). A ligation step in either diluted, tethered or in situ conditions will
favor ligation events between RFs trapped within the same complex [102,110,112]. After a
decrosslinking step, the resulting 3C DNA templates consist of a collection of ligation products
of two RFs, whose relative abundance (after normalization) reflects the frequency with which
these two chromatin segments were crosslinked in the population of cells. The analysis of this
library enables the generation of chromosomal contact maps, which allows deciphering the 3D
spatial positioning of loci with respects to each other. In the past few years, quantification of the
abundance of ligation products has evolved from semi-quantitative PCR [55] to deep-sequencing
techniques [102]. The later approach now enables genome-wide analysis of chromosome
organization. The result of such an experiment is a symmetric matrix with fixed bin size (i.e., the
observed contact (OC) matrix) that describes the number of times any given pair of RFs has been
detected in a ligation event at a genome-wide scale. Those matrices represent the relative
frequency of physical interaction for each RF in the genome with all of the other RFs.
3C derived experiments are likely to generate biases given the complexity of the protocols, and
necessitate a dedicated effort to experimentally identify and limit the generation of byproducts at
each step [82]. Different biases, resulting from formaldehyde crosslinking, restriction enzyme
digestion, restriction fragment length, sequence GC content, sequence mappability, etc. arise at
different steps of the protocols [82,114,145,159]. Therefore, these data need to be carefully
processed in order to identify these biases and limit the introduction of false predictions in the
final analysis.
52
Various normalization methods have been suggested to remove the biases of Hi-C data at the
fragment and segment levels [114,115,160,161]. Yaffe and Tanay suggested the probabilistic
modeling method with explicit bias factors, where various error sources—such as fragment
length bias, mappability bias, and GC content bias—were identified based on the fragment level
and corrected based on a probabilistic model [114]. Although this approach is advantageous in
that the physical origins of experimental biases are clearly shown, an unknown portion of error
sources can become magnified in the processed results. Also, Hi-C biases could vary from
experiment to experiment and organism and a new probabilistic model would need to be
generated each time. Alternatively, Imakaev et al. suggested an iterative segment-level correction
method by equalizing the coverage of each segment from the frequency of double-sided and
single-sided reads [115]. This approach avoids systematic errors of complex physical origins but
may increase the artificial error caused by physically isolated fragments that have no nearby RFs.
For most of other Next-generation DNA Sequencing (NGS) based methods, such as Chromatin
Immunoprecipitated DNA sequencing (ChIP-seq) and Methylated-DNA immunoprecipitation
sequencing (MeDIP-seq), a background null model it usually required to normalize data and call
significant peak signals [162,163]. Even though there are a number of different ways to build
statistical null models for NGS data normalization, the ideal way is to achieve a null model from
a control experiment and perform normalization accordingly [164].
Following approaches for ChIP-seq experiments, a number of statistical null model methods are
proposed for 3C/Hi-C data normalization and peak calling [113,114,161,165]. However, none of
these methods are able to adopt a control experiment to establish the data normalization null
model. A published 3C/5C protocol adopts a randomized ligation control experiment for 3C/5C
data normalization [166]. However, for Hi-C experiments, the cost of building a contact bias
matrix is substantially higher than for the actual Hi-C experiment, because of the two-
dimensional nature of Hi-C data. Such Hi-C control experiment might only work for studying
organisms with small genomes, such as bacteria and yeast [109].
Even though it is not cost effective to prepare a randomized ligation control experiment for Hi-C,
the frequency of randomized ligations between two restriction fragments can be predicted from
the total number of each restriction end in the library given a certain fragment ligation event.
53
Similarly, if contact frequencies between restriction fragments are represented in a binned
contact frequency matrix, expected ligation frequencies between two bins in the bias matrix can
be predicted by the total number of restriction end alignments that are sequenced and aligned to
each matrix bin. Therefore, in this study, we design a Restriction End Sequencing (RES) control
for Hi-C to predict the expected bias between two different restriction fragments or two Hi-C OC
matrices bins in order to perform Hi-C data normalization.
54
3.2 Methods
3.2.1 Prepare restriction end sequencing library for Hi-C data normalization
Ideally, the restriction end sequencing control experiment should be conducted in parallel with
the regular Hi-C experiment, because biases from each of the Hi-C experimental steps (e.g.,
formaldehyde crosslinking, restriction enzyme digestion, PCR amplification, etc.) may vary from
experiment to experiment. To prepare a RES library, the standard Hi-C experiment protocol is
followed until the step of intra-molecular ligation. This ligation step is skipped and the protocol
proceeds to the step of chromatin digestion and genomic DNA purification. In the standard
procedures for Hi-C sequencing library preparation, the purified genomic DNA is treated with an
enzyme that has exonuclease activity to remove the biotin labels from unligated restriction
ends[102,110]. However, in the RES control experiment, it is intended to pull down the
restriction end DNA fragments for sequencing and to quantitatively measure the number of
mapped reads from the restriction ends. Therefore, the purified DNA from the RES control
experiment is directly used for the DNA shearing step, streptavidin pull-down, and illumina
sequencing library preparation following the same procedure as in the standard protocol for Hi-C
sequencing library preparation (Figure 3.1a).
55
Figure 3.1
56
a) Representations of RES reads from the RES control experiment and Hi-C contacts reads and
RES (dangling) reads from the Hi-C experiment.
b) The one-dimensional sequencing reads coverages of the filtered RES reads from control
experiment, Hi-C contact reads, RES (dangling) reads, and the total (non-filtered) RES reads
from control experiment. HindIII restriction site positions and 50mers mappability scores of
hg19 are also laid out.
c) Zoom in view of different read coverage distributions around 4 HindIII restriction sites.
d) Two-dimensional comparison of RES bias vector from control experiment and RES (dangling)
bias vector from Hi-C experiment. (with Pearson’s correlation coefficient of 0.92)
e) Two-dimensional comparison of RES bias vector from control experiment and KR Hi-C
matrix balancing bias vector. (with Pearson’s correlation coefficient of 0.86)
57
3.2.2 Mapping RES sequencing output and extract restriction end alignments
The pair-end sequencing output of the RES library is mapped with BWA-MEM aligner to the
hg19 reference genome under the paired-end mode with parameters of “-SP5M” [167] and only
retains the primary alignments. Sequencing alignments with MAPQ quality scores less than 30
are removed (MAPQ score filtering should follow the same criteria as in Hi-C data processing
requirement). As shown in Figure 3.1b, the majority of the Hi-C contact reads are aligned to the
genomic regions adjacent to restriction enzyme cutting sites. However, significant amount of the
RES sequence alignments are located at genomic regions that are distant to the restriction sites.
Sources of these unspecific alignments could be the result of unspecific binding of DNA
fragments to the streptavidin bead during biotin pull down and random fragmentation of genomic
DNA for which the fragment end is able to be labeled with biotin nucleotide but not able to form
DNA ligation.
Therefore, the primary alignments of RES library are filtered to retrieve restriction end specific
alignments by searching the 5’ end of the query sequence for the restriction cutting site signature
(“AGCTT” for HindIII) in each read of the sequencing pair. One mismatch or deletion is allowed
for the cutting site signature matching to consider the star effect of restriction enzyme in Hi-C
experiments [110]. As long as one of the read pair satisfies the signature requirement, the
underlying paired reads will be considered as candidate RES alignments for Hi-C data
normalization. These alignments contain read pairs for which both reads were aligned and read
pairs for which only a single read has been aligned. For the single end aligned RES reads, only
the mapped read will be considered as RES alignment. For read pairs for which both ends are
aligned, both of the two reads will be considered as RES alignments if the two alignments are
mapped to different strands and the genome position distance between the two reads is less than
1000bp.
58
3.2.3 Retrieve RES reads from Hi-C sequencing library
In protocols of Hi-C based methods [102,110,112], exonuclease treatment is performed to the
extracted DNA to remove the biotin labels from unligated restriction ends in order to increase the
yield of Hi-C contact reads. However, as common to most of enzymatic reactions, the
exonuclease treatment does not have 100% efficiency. Therefore, a substantial amount of
restriction end fragments are pulled down by streptavidin beads and end up in the Hi-C
sequencing library. These fragments are commonly named as “dangling ends” in Hi-C
experiments. Under the assumption of low efficiency of blunt end DNA ligation, alignments of
the dangling end reads have the same features as restriction end sequencing reads and therefore
can be used for Hi-C data normalization as well (Figure 3.1a, 3.1c).
To retrieve RES reads alignment from the Hi-C library, the complete sequencing output (raw
fastq files) are aligned against the reference genome with BWA-MEM pair end mode following
the same parameters “-SP5M” and only retains the primary alignments. First, the pair end reads
are identified as Hi-C contacts if both ends are mapped with MAPQ score greater than 30 to
either different chromosomes or the same chromosome with genome distance greater than
1000bp. Read pairs, that are not recognized as Hi-C contacts, are considered as RES alignments
if they contain restriction enzyme cutting site signatures (“AGCTT” at 5 prime end) and the two
read ends are less than 1000 bps in sequence from each other. Additionally, to remove the
ligation events from the identified RES alignments, if the alignment from each read does not
completely match the reference genome and the query sequence contains a sub-sequence of the
restriction enzyme ligation junction (“AAGCTAGCTT” for HindIII), the underlying read pair is
removed from RES reads. Then, the selected RES reads from Hi-C library are processed the
same as RES reads from control experiment to normalization bias vectors.
3.2.4 Normalize Hi-C contact matrix by RES sequencing data
For Hi-C contacts, Observed Contact (OC) matrices can be constructed and binned at different
resolutions. The RES alignments are binned accordingly to the same resolution as the Hi-C OC
matrix. Then we assume that the number of RES alignments in each bin represents the bias factor
59
for the given genomic regions, because the expected random ligation events between fragments
in the two bins are proportional to the product of the available fragments in the library, which is
detected by the RES alignments. Therefore, the OC matrix (𝑂
&,'
) can be normalized by the RES
bias factor product (𝑅𝐸𝑆
&
∙𝑅𝐸𝑆
'
) of each pair of bins to achieve the normalized Contact
Frequency (CF) matrix (𝐹
&,'
). CF matrix is then scaled by a factor M to achieve equal contact
coverage as the original OC matrix.
𝐹
&,'
=𝑀∙𝑂
&,'
/ 𝑅𝐸𝑆
&
∙𝑅𝐸𝑆
'
The OC matrix can be normalized by the RES bias vector retrieved either from the control
experiment or from the Hi-C dangling end sequences, which are denoted as RES normalization
and RES (dangling) normalization, respectively.
3.3 Results and Discussions
3.3.1 RES normalization by Hi-C dangling end sequences reproduces
normalization result by RES control experiment
To evaluate the normalization of the RES method, a RES control experiment is performed in
parallel with a Tether Conformation Capture (TCC) experiment on the lymphoblastoid cell line
GM12878 with HindIII as the restriction enzyme. Among the 289,827,242 raw sequencing pair-
end reads of the RES control experiment, 42,006,322 of RES alignment pairs (a total number of
77,716,651 RES alignments considering single-end alignments and double-end alignments) are
identified as RES alignments after filtering (See Methods) for data normalization. Among the
304,961,337 raw sequencing pair-end reads of the TCC experiment, 108,619,021 of Hi-C
contacts are identified to build OC contact matrices and 18,863,946 dangling end RES
alignments are identified for data normalization at the according matrix resolution (most of the
analyses are conducted at 100kb resolution in this study). Therefore, RES alignments from (i)
control experiment and (ii) from Hi-C dangling end sequences are both indexed into 100kb
resolution bins to construct the normalization bias vectors. The resulting the two bias vectors
from RES control experiment and RES dangling end alignments are highly correlated (Pearson’s
60
r = 0.92), which indicates consistent normalization results by RES and RES dangling methods in
the majority of different genomic regions (Figure 3.1d). The high correlation provides the
possibility to normalize public Hi-C datasets directly from RES dangling end mapping based the
Hi-C library without the need to perform a dedicated RES control experiment.
3.3.2 Differences between RES normalization and genome-wide matrix
balancing normalization
The RES normalization method is based on a different assumption (experimental control based)
than the existing Hi-C data normalization methods, such as explicit factor normalization (non-
experimental probabilistic modeling based) and matrix balancing normalization (equal visibility
based). However, common to all normalization methods is the assumption that the bias is a
unique feature of a given genomic region. Therefore, bias can be corrected by applying a bias
vector to the observed Hi-C OC contact matrix. In this section, we select one of the Knight Ruiz
(KR) matrix balancing [115,168] normalization method and compare with the RES method.
The TCC OC matrix of GM12878 is normalized by KR matrix balancing as described in Rao, et
al. [112] To reduce the effects of matrix balancing artifacts and sampling errors of low contact
coverage bins, 2% of bins with lowest contact coverage (not including bins with zero contact
coverage) are removed before KR normalization. When comparing the resulting bias vectors
between KR and RES normalization, we observe that KR and RES bias vectors are fairly
consistent (Pearson’s r = 0.82) to each other for the majority of genomic regions. Some
differences are seen for genomic regions where the RES biases are relatively higher than KR
biases (Figure 3.1e). Since RES bias vector generated from the control experiment is considered
as ground truth for bias quantification, it means the difference (RES bias higher than KR bias)
can potentially lead to over-representation of contact frequencies in these regions in KR
normalization. As shown in Figure 3.2d, these contact over-representations are most evident in
the peri-centromeric regions. At the same time, contact frequencies close to the telomere regions
are relatively enhanced in RES normalization.
61
We also observe differences in the one-dimensional (1D) contact coverage between RES and KR
normalizations. By definition, the KR normalization method balances the Hi-C matrix and
achieves equal genome-wide 1D contact coverage across different bins [168]. However, 1D
contact coverage of whole genome CF matrix after RES normalization follows nearly normal
distribution instead of uniform distribution (Figure 3.3), which is intuitively true that biological
differences at different genome regions will lead to differences in forming interactions.
Nevertheless, KR doesn’t guarantee a balanced cis CF matrix. Therefore, the marginal sum of cis
CF matrix follows non-uniform distributions (Figure 3.3). Additionally, different studies
reported different basic assumptions in Hi-C matrix balancing. In Imakaev et al., the Iterative
Correction (IC) method suggests that Hi-C matrix should be iterative corrected by using the
whole genome 1D contact coverage plus the number of single end alignments in each bin to
balance genomic regions with repetitive reference sequences [115]. In Rao et al., the KR method
is proposed to balance the Hi-C matrix genome-wide, intrachromosome wide, or
interchromosome wide without considering the single end aligned reads [112]. Also, the above
two methods have different treatments for diagonal entries in the matrix, i.e. IC removes the
diagonal and diagonal +1 matrix entries before balancing while the KR does not. Such different
assumptions will lead to differences in matrix normalization results.
For comparison, Figure 3.3 shows the genome-wide and chromosome-wide 1D contact coverage
distributions for the observed contact matrix, RES normalized contact matrix by control
experiment, RES normalized contact matrix by dangling end alignments, and KR normalized
contact matrix. Unlike KR matrix balancing, RES normalization achieves matrices whose
contact coverage distributions are consistent (normal like distributions) between cis CF matrix
and whole genome CF matrix. It is also noticeable that the contact coverage distributions are
consistent between RES and RES (dangling) normalizations. It is intuitive to assume that the Hi-
C matrix marginal sum does not necessarily follow a uniform distribution, since there could be
genomic regions that are isolated from the other regions and therefore form fewer contacts. And
a certain genomic region’s potential of forming interactions should be consistent genome-wide
and chromosome-wide. In other words, unlike matrix balancing methods, RES normalization
follows a consistent assumption and achieves consistent normalization results both genome-wide
and chromosome-wide.
62
Figure 3.2 Differences among RES normalization, RES (dangling) normalization, and KR
matrix balancing normalization. (all matrices are at 100kb resolution)
63
a) Chromosome 7 contact frequency matrix normalized by whole genome KR matrix balancing.
b) Chromosome 7 contact frequency matrix normalized by RES from control experiment.
c) Chromosome 7 contact frequency matrix normalized by RES (dangling) from Hi-C
experiment.
d) Log ratio difference matrix between RES normalized matrix and KR normalized matrix. Both
matrices are scaled with equal median before calculating log ratio.
e) Log ratio difference matrix between RES normalized matrix and RES (dangling) normalized
matrix. Both matrices are scaled with equal median before calculating log ratio.
Figure 3.3 TCC matrix marginal sum distributions of observed contact matrix, RES normalized
contact matrix by control experiment, RES normalized contact matrix by dangling end
alignments, KR normalized contact matrix by matrix balancing. p.s. matrices are analyzed at
100kb resolution.
64
3.3.3 Comparison of Hi-C data reproducibility between RES normalization
and genome-wide matrix balancing normalization
Hi-C data normalization methods can reduce a variety of technical biases and should therefore
increase the reproducibility of data between different Hi-C replicates that may have different
sources of biases [114,115,169]. In this study, we analyzed two Hi-C data replicates of
embryonic stem cells that use different restriction enzymes HindIII and NocI [170]. Both of the
two datasets are processed at 100kb and 500kb resolutions and normalized by KR matrix
balancing and RES (dangling) as described previously. First, we use the 500kb Hi-C matrices to
study data reproducibility of contact coverage. We define the 1D contact coverage of a
symmetric Hi-C matrix as the marginal sum of a given bin. For the observed OC Hi-C matrices
of the HindIII and NcoI replicates, we find a Spearman’s correlation of r = 0.52 for the whole
genome one-dimensional (1D) contact coverage (Figure 3.4a). However, this correlation does not
truly indicate high data reproducibility because of the intrinsic common biases within the two
datasets. For instance, both of the replicates have extremely high 1D contact coverage at bins
close to the telomere regions of chromosome 11, which could be biased by high enzyme
accessibility for both HindIII and NcoI. Also, the low contact coverage of chromosome X of the
two replicates is simply a result of the haploid chromosome X in the male cell line in comparison
to all the other diploid autosomes. After RES normalization, the correlation of the whole genome
1D contact coverage is substantially higher (Spearman’s r = 0.80) than the observed datasets
(Figure 3.4a) showcasing improved reproducibility of the replicate data sets after RES bias
removal. Interestingly, the previously observed spurious biases at chromosome 11 telomere
region and chromosome X are removed in RES normalization as well. As shown in Figure 3.4b,
after RES normalization, the 1D contact coverage distribution has lower standard deviation than
the observed dataset. Also, bins with lowest coverage in the observed dataset show much higher
contact coverage after RES normalization. This also indicates that RES normalization reduces
biases in the observed Hi-C data. Similar results can be observed in both of the two replicates
(Figure 3.4b).
By definition, datasets normalized by KR matrix balancing leads to uniform distributions of
whole genome 1D contact coverage and therefore cannot be compared in terms of reproducibility
65
with either observed Hi-C data or RES normalized Hi-C data. However, whole genome matrix
balancing normalization does not lead to uniform distribution of cis and trans 1D contact
coverage in different chromosomes, which should also be reproducible between the two
replicates. As shown in Figure 3.4c, on one hand, RES normalization also leads to reproducible
results in terms of both cis and trans 1D contact coverage of the two replicates. On the other
hand, cis and trans contact coverage reproducibility is low after KR normalization. The above
result indicates that RES normalization leads to more consistent results than whole genome
matrix balancing normalization in terms of contact coverage reproducibility.
To investigate Hi-C data reproducibility of short-range interactions, we utilize a weighted
correlation calculation method developed as “HiCRep” to evaluate data reproducibility of the
above two replicates [169]. In this study, we performed HiCRep analysis to calculate stratum-
adjusted correlation coefficients (SCC) between the two replicates’ Hi-C matrices at 100kb
resolution with equalized sequencing depth, optimized matrix smoothing, and short-range
interactions within 5Mb [169]. SCC scores are calculated between the observed, KR normalized,
and RES normalized replicates. As results, the observed replicates always show the highest SCC
score, which might be also caused by the intrinsic common biases between the replicates. For
most of the chromosomes, KR normalized datasets result in relatively higher SCC scores than
RES normalized datasets. However, both KR and RES normalizations leads to high SCC scores
(HiCRep scc > 0.70), which indicates normalization methods leads to high data reproducibility in
terms of short-range Hi-C interactions (Figure 3.4d).
66
Figure 3.4 Hi-C reproducibility of KR normalization and RES normalization.
67
a) Whole genome one-dimensional (1D) contact coverage along the chromosomes of the
observed Hi-C matrix and RES normalized Hi-C matrix between the HindIII and NcoI Hi-C
replicates.
b) Change of the whole genome 1D contact coverage distribution before and after RES
normalization.
c) cis and trans 1D contact coverage correlations between replicates (HindIII vs. NcoI) by
different normalization methods (RES vs KR matrix balancing).
d) Weighted correlations scores (Stratum-adjusted correlation coefficient by HiCRep) generated
from short range interactions (5Mb in this study) of different chromosomes in observed Hi-C
matrix, KR normalized Hi-C matrix, and RES normalized Hi-C matrix.
68
3.3.4 RES performance in high resolution Hi-C contact matrix
Next, we investigate the performance of RES normalization for high resolution Hi-C contact
matrices, at 1kb and 5kb resolution. We applied the RES normalization to the Rao et al. [112]
combined in situ Hi-C data sets for GM12878 cell lines. After processing their fastq sequencing
files, 4,139,627,512 Hi-C contacts are identified to build the contact matrices. Meanwhile,
437,562,753 RES alignments are identified to build RES bias vector at different resolutions.
Contact matrix heatmaps at different resolutions with both KR normalization and RES
normalization are shown in Figure 4. Since the in situ Hi-C dataset used a different restriction
enzyme (MboI), the bias behavior might be different from the HindIII TCC dataset as discussed
previously. However, the general behavior, such as the over-representation of KR normalization
at the peri-centromeric regions, is maintained. However, it is evident that contact matrices after
RES normalization are not as smooth as those in KR normalization, especially in matrices at high
resolution such as 1kb. On one hand, the level of smoothness of RES normalization might be due
to the unbalanced nature of the method itself. On the other hand, RES alignments do not
completly match the positions of Hi-C contact alignments (Figure 3.1b, 3.1c). Therefore, bias
representation in RES reads might diverge in the aspects of GC content and mappability at high
resolution for the reason that bias level in a given genomic region might be more stringent to
certain restriction enzyme cutting sites at high resolution. Ideally, certain level of GC content and
mappability bias correction should be applied to RES normalization as well.
69
Figure 3.5 RES normalization in contact matrices at high resolutions.
70
3.4 Conclusions
In this study, we developed a novel Hi-C data normalization method which is based on a
restriction end sequencing (RES) experimental control. Unlike randomized ligation Hi-C control,
RES is cost-effective for Hi-C data with large genome sizes and therefore is able to analyze Hi-C
data in high resolutions. Additionally, the dangling end sequences from Hi-C experiment also
obtain frequencies of restriction ends and reproduce restriction end frequencies from RES control
experiment as well. Even though slight differences are observed between the dangling RES read
frequencies and those from a RES control experiment, it still provides an option to perform RES
Hi-C data normalization without an actual control experiment. Compared with whole genome
matrix balancing normalization method, RES normalization is superior because it adopts a
normalization null model from a real control experiment instead of statistical modelling or
mathematical manipulations based on a variety of different assumptions. In our analysis, we
showed that RES normalization results in higher data reproducibility of Hi-C replicates in terms
of one-dimensional contact coverage than whole genome matrix balancing normalization, which
indicates a higher normalization accuracy in reducing technical biases of Hi-C data.
71
References
[1] J.D. Watson, F.H.C. Crick, Molecular structure of nucleic aids: A structure for
deoxyribose nucleic acid, Nature. 171 (1953) 737–738. doi:10.1176/appi.ajp.160.4.623.
[2] R.E. Franklin, R.G. Gosling, The structure of sodium thymonucleate fibres. I. The
influence of water content, Acta Crystallogr. 6 (1953) 673–677.
doi:10.1107/S0365110X53001939.
[3] D. Elson, E. Chargaff, On the desoxyribonucleic acid content of sea urchin gametes,
Experientia. 8 (1952) 143–145. doi:10.1007/BF02170221.
[4] A.H. Wang, G.J. Quigley, F.J. Kolpak, J.L. Crawford, J.H. van Boom, G. van der Marel,
A. Rich, Molecular structure of a left-handed double helical DNA fragment at atomic
resolution., Nature. 282 (1979) 680–686. doi:10.1038/282680a0.
[5] R.E. Franklin, R.G. Gosling, Evidence for 2-Chain Helix in Crystalline Structure of
Sodium Deoxyribonucleate, Nature. 172 (1953) 156–157. doi:10.1038/172156a0.
[6] S. Arnott, R. Chandrasekaran, C.M. Marttila, Structures for polyinosinic acid and
polyguanylic acid., Biochem. J. 141 (1974) 537–543.
[7] S.B. Zimmerman, G.H. Cohen, D.R. Davies, X-ray fiber diffraction and model-building
study of polyguanylic acid and polyinosinic acid, J. Mol. Biol. 92 (1975).
doi:10.1016/0022-2836(75)90222-3.
[8] A. Rich, DNA comes in many forms, Gene. 135 (1993) 99–109. doi:10.1016/0378-
1119(93)90054-7.
[9] K. Luger, A.W. Mäder, R.K. Richmond, D.F. Sargent, T.J. Richmond, Crystal structure of
the nucleosome core particle at 2.8 A resolution., Nature. 389 (1997) 251–60.
doi:10.1038/38444.
[10] C.L. Peterson, M.-A. Laniel, Histones and histone modifications, Curr. Biol. 14 (2004)
R546–R551. doi:10.1016/j.cub.2004.07.007.
[11] B.D. Strahl, C.D. Allis, The language of covalent histone modifications., Nature. 403
(2000) 41–5. doi:10.1038/47412.
[12] T. Jenuwein, Translating the Histone Code, Science (80-. ). 293 (2001) 1074–1080.
doi:10.1126/science.1063127.
[13] Y. Cho, S. Gorina, P.D. Jeffrey, N.P. Pavletich, Crystal structure of a p53 tumor
suppressor-DNA complex: understanding tumorigenic mutations., Science. 265 (1994)
72
346–55. doi:10.1126/science.8023157.
[14] Y. Wu, M. Borde, V. Heissmeyer, M. Feuerer, A.D. Lapan, J.C. Stroud, D.L. Bates, L.
Guo, A. Han, S.F. Ziegler, D. Mathis, C. Benoist, L. Chen, A. Rao, FOXP3 Controls
Regulatory T Cell Function through Cooperation with NFAT, Cell. 126 (2006) 375–387.
doi:10.1016/j.cell.2006.05.042.
[15] A. Han, F. Pan, J.C. Stroud, H.-D. Youn, J.O. Liu, L. Chen, Sequence-specific recruitment
of transcriptional co-repressor Cabin1 by myocyte enhancer factor-2., Nature. 422 (2003)
730–734. doi:10.1038/nature01555.
[16] J.C. Stroud, C. Lopez-Rodriguez, A. Rao, L. Chen, Structure of a TonEBP–DNA complex
reveals DNA encircled by a transcription factor, Nat. Struct. Biol. 9 (2002) 90–94.
doi:10.1038/nsb749.
[17] L. Chen, J.N. Glover, P.G. Hogan, A. Rao, S.C. Harrison, Structure of the DNA-binding
domains from NFAT, Fos and Jun bound specifically to DNA., Nature. 392 (1998) 42–48.
doi:10.1038/32100.
[18] G. Felsenfeld, M. Groudine, Controlling the double helix, Nature. 421 (2003) 448–453.
doi:10.1038/nature01411.
[19] C.L.F. Woodcock, J.P. Safer, J.E. Stanchfield, Structural repeating units in chromatin. I.
Evidence for their general occurrence, Exp. Cell Res. 97 (1976) 101–110.
doi:10.1016/0014-4827(76)90659-5.
[20] H.G. Davies, M.E. Haynes, Electron-microscope observations on cell nuclei in various
tissues of a teleost fish: the nucleolus-associated monolayer of chromatin structural units.,
J. Cell Sci. 21 (1976) 315–27. http://www.ncbi.nlm.nih.gov/pubmed/972173.
[21] H.G. Davies, M.E. Haynes, Light- and electron-microscope observations on certain
leukocytes in a teleost fish and a comparison of the envelope-limited monolayers of
chromatin structural units in different species, J Cell Sci. 17 (1975) 263–285.
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati
on&list_uids=1127023.
[22] N. Gilbert, S. Gilchrist, W.A. Bickmore, Chromatin organization in the mammalian
nucleus, Int. Rev. Cytol. 242 (2004) 283–336. doi:10.1016/S0074-7696(04)42007-5.
[23] J. Sun, Q. Zhang, T. Schlick, Electrostatic mechanism of nucleosomal array folding
revealed by computer simulation, Proc. Natl. Acad. Sci. 102 (2005) 8180–8185.
73
doi:10.1073/pnas.0408867102.
[24] P.J. Robinson, D. Rhodes, Structure of the “30nm” chromatin fibre: A key role for the
linker histone, Curr. Opin. Struct. Biol. 16 (2006) 336–343. doi:10.1016/j.sbi.2006.05.007.
[25] T. Schalch, S. Duda, D.F. Sargent, T.J. Richmond, X-ray structure of a tetranucleosome
and its implications for the chromatin fibre, Nature. 436 (2005) 138–141.
doi:10.1038/nature03686.
[26] G. Li, D. Reinberg, Chromatin higher-order structures and gene regulation, Curr. Opin.
Genet. Dev. 21 (2011) 175–186. doi:10.1016/j.gde.2011.01.022.
[27] J. Widom, A. Klug, Structure of the 300A chromatin filament: X-ray diffraction from
oriented samples, Cell. 43 (1985) 207–213. doi:10.1016/0092-8674(85)90025-X.
[28] C.L. Woodcock, R.P. Ghosh, Chromatin higher-order structure and dynamics., Cold
Spring Harb. Perspect. Biol. 2 (2010). doi:10.1101/cshperspect.a000596.
[29] S.P. Williams, B.D. Athey, L.J. Muglia, R.S. Schappe, A.H. Gough, J.P. Langmore,
Chromatin fibers are left-handed double helices with diameter and mass per unit length
that depend on linker length, Biophys. J. 49 (1986) 233–248. doi:10.1016/S0006-
3495(86)83637-2.
[30] A. Bolzer, G. Kreth, I. Solovei, D. Koehler, K. Saracoglu, C. Fauth, S. M??ller, R. Eils, C.
Cremer, M.R. Speicher, T. Cremer, Three-dimensional maps of all chromosomes in
human male fibroblast nuclei and prometaphase rosettes, PLoS Biol. 3 (2005) 0826–0842.
doi:10.1371/journal.pbio.0030157.
[31] C. Cremer, T. Cremer, J.W. Gray, Induction of chromosome damage by ultraviolet light
and caffeine: Correlation of cytogenetic evaluation and flow karyotype, Cytometry. 2
(1982) 287–290. doi:10.1002/cyto.990020504.
[32] M.R. Branco, A. Pombo, Intermingling of chromosome territories in interphase suggests
role in translocations and transcription-dependent associations, PLoS Biol. 4 (2006) 780–
788. doi:10.1371/journal.pbio.0040138.
[33] M. Simonis, P. Klous, E. Splinter, Y. Moshkin, R. Willemsen, E. de Wit, B. van Steensel,
W. de Laat, Nuclear organization of active and inactive chromatin domains uncovered by
chromosome conformation capture–on-chip (4C), Nat. Genet. 38 (2006) 1348–1354.
doi:10.1038/ng1896.
[34] L.S. Shopland, C.R. Lynch, K.A. Peterson, K. Thornton, N. Kepper, J. Von Hase, S. Stein,
74
S. Vincent, K.R. Molloy, G. Kreth, C. Cremer, C.J. Bult, T.P. O’Brien, Folding and
organization of a contiguous chromosome region according to the gene distribution
pattern in primary genomic sequence, J. Cell Biol. 174 (2006) 27–38.
doi:10.1083/jcb.200603083.
[35] C. Ferrai, S.Q. Xie, P. Luraghi, D. Munari, F. Ramirez, M.R. Branco, A. Pombo, M.P.
Crippa, Poised transcription factories prime silent uPA gene prior to activation, PLoS Biol.
8 (2010). doi:10.1371/journal.pbio.1000270.
[36] S. Chambeyron, W.A. Bickmore, Chromatin decondensation and nuclear reorganization of
the HoxB locus upon induction of transcription, Genes Dev. 18 (2004) 1119–1130.
doi:10.1101/gad.292104.
[37] D. Noordermeer, M.R. Branco, E. Splinter, P. Klous, W. van Ijcken, S. Swagemakers, M.
Koutsourakis, P. van der Spek, A. Pombo, W. de Laat, Transcription and chromatin
organization of a housekeeping gene cluster containing an integrated beta-globin locus
control region., PLoS Genet. 4 (2008) e1000016. doi:10.1371/journal.pgen.1000016.
[38] T. Ragoczy, M.A. Bender, A. Telling, R. Byron, M. Groudine, The locus control region is
required for association of the murine ??-globin locus with engaged transcription factories
during erythroid maturation, Genes Dev. 20 (2006) 1447–1457. doi:10.1101/gad.1419506.
[39] S.T. Kosak, Subnuclear Compartmentalization of Immunoglobulin Loci During
Lymphocyte Development, Science (80-. ). 296 (2002) 158–162.
doi:10.1126/science.1068768.
[40] P. Meister, B.D. Towbin, B.L. Pike, A. Ponti, S.M. Gasser, The spatial dynamics of
tissue-specific promoters during C.elegans development, Genes Dev. 24 (2010) 766–782.
doi:10.1101/gad.559610.
[41] M. Lundgren, C.M. Chow, P. Sabbattini, A. Georgiou, S. Minaee, N. Dillon, Transcription
factor dosage affects changes in higher order chromatin structure associated with
activation of a heterochromatic gene, Cell. 103 (2000) 733–743. doi:10.1016/S0092-
8674(00)00177-X.
[42] C.S. Osborne, L. Chakalova, K.E. Brown, D. Carter, A. Horton, E. Debrand, B.
Goyenechea, J.A. Mitchell, S. Lopes, W. Reik, P. Fraser, Active genes dynamically
colocalize to shared sites of ongoing transcription, Nat. Genet. 36 (2004) 1065–1071.
doi:10.1038/ng1423.
75
[43] H. Strickfaden, A. Zunhammer, S. van Koningsbruggen, D. Köhler, T. Cremer, 4D
chromatin dynamics in cycling cells: Theodor Boveri’s hypotheses revisited., Nucleus. 1
(2010) 284–97. doi:10.4161/nucl.1.3.11969.
[44] S.M. Janicki, T. Tsukamoto, S.E. Salghetti, W.P. Tansey, R. Sachidanandam, K. V.
Prasanth, T. Ried, Y. Shav-Tal, E. Bertrand, R.H. Singer, D.L. Spector, From silencing to
gene expression: Real-time analysis in single cells, Cell. 116 (2004) 683–698.
doi:10.1016/S0092-8674(04)00171-0.
[45] T. Tsukamoto, N. Hashiguchi, S.M. Janicki, T. Tumbar, a S. Belmont, D.L. Spector,
Visualization of gene activity in living cells., Nat. Cell Biol. 2 (2000) 871–878.
doi:10.1038/35046510.
[46] L.E. Finlan, D. Sproul, I. Thomson, S. Boyle, E. Kerr, P. Perry, B. Ylstra, J.R. Chubb,
W.A. Bickmore, Recruitment to the nuclear periphery can alter expression of genes in
human cells, PLoS Genet. 4 (2008). doi:10.1371/journal.pgen.1000039.
[47] K.L. Reddy, J.M. Zullo, E. Bertolino, H. Singh, Transcriptional repression mediated by
repositioning of genes to the nuclear lamina, Nature. 452 (2008) 243–247.
doi:10.1038/nature06727.
[48] R.I. Kumaran, D.L. Spector, A genetic locus targeted to the nuclear periphery in living
cells maintains its transcriptional competence, J. Cell Biol. 180 (2008) 51–65.
doi:10.1083/jcb.200706060.
[49] P.K. Geyer, M.W. Vitalini, L.L. Wallrath, Nuclear organization: Taking a position on
gene expression, Curr. Opin. Cell Biol. 23 (2011) 354–359. doi:10.1016/j.ceb.2011.03.002.
[50] C. Ferrai, I.J. de Castro, L. Lavitas, M. Chotalia, A. Pombo, Gene positioning., Cold
Spring Harb. Perspect. Biol. 2 (2010). doi:10.1101/cshperspect.a000588.
[51] B. van Steensel, S. Henikoff, Identification of in vivo DNA targets of chromatin proteins
using tethered dam methyltransferase., Nat. Biotechnol. 18 (2000) 424–428.
doi:10.1038/74487.
[52] H. Pickersgill, B. Kalverda, E. de Wit, W. Talhout, M. Fornerod, B. van Steensel,
Characterization of the Drosophila melanogaster genome at the nuclear lamina, Nat. Genet.
38 (2006) 1005–1014. doi:10.1038/ng1852.
[53] D. Peric-Hupkes, W. Meuleman, L. Pagie, S.W.M. Bruggeman, I. Solovei, W. Brugman,
S. Gr??f, P. Flicek, R.M. Kerkhoven, M. van Lohuizen, M. Reinders, L. Wessels, B. van
76
Steensel, Molecular Maps of the Reorganization of Genome-Nuclear Lamina Interactions
during Differentiation, Mol. Cell. 38 (2010) 603–613. doi:10.1016/j.molcel.2010.03.016.
[54] L. Guelen, L. Pagie, E. Brasset, W. Meuleman, M.B. Faza, W. Talhout, B.H. Eussen, A.
de Klein, L. Wessels, W. de Laat, B. van Steensel, Domain organization of human
chromosomes revealed by mapping of nuclear lamina interactions, Nature. 453 (2008)
948–951. doi:10.1038/nature06947.
[55] J. Dekker, K. Rippe, M. Dekker, N. Kleckner, Capturing chromosome conformation.,
Science. 295 (2002) 1306–11. doi:10.1126/science.1067799.
[56] B. Tolhuis, R.J. Palstra, E. Splinter, F. Grosveld, W. De Laat, Looping and interaction
between hypersensitive sites in the active beta-globin locus, Mol. Cell. 10 (2002) 1453–
1465. doi:10.1016/S1097-2765(02)00781-5.
[57] S.M. Tan-Wong, J.D. French, N.J. Proudfoot, M.A. Brown, Dynamic interactions between
the promoter and terminator regions of the mammalian BRCA1 gene, Proc. Natl. Acad.
Sci. 105 (2008) 5160–5165. doi:10.1073/pnas.0801048105.
[58] R.-J. Palstra, M. Simonis, P. Klous, E. Brasset, B. Eijkelkamp, W. de Laat, Maintenance
of Long-Range DNA Interactions after Inhibition of Ongoing RNA Polymerase II
Transcription, PLoS One. 3 (2008) e1661. doi:10.1371/journal.pone.0001661.
[59] A. Miele, K. Bystricky, J. Dekker, Yeast silent mating type loci form heterochromatic
clusters through silencer protein-dependent long-range interactions, PLoS Genet. 5 (2009).
doi:10.1371/journal.pgen.1000478.
[60] I. Comet, B. Schuettengruber, T. Sexton, G. Cavalli, A chromatin insulator driving three-
dimensional Polycomb response element (PRE) contacts and Polycomb association with
the chromatin fiber, Proc. Natl. Acad. Sci. 108 (2011) 2294–2299.
doi:10.1073/pnas.1002059108.
[61] H. Würtele, P. Chartrand, Genome-wide scanning of HoxB1-associated loci in mouse ES
cells using an open-ended Chromosome Conformation Capture methodology, Chromosom.
Res. 14 (2006) 477–495. doi:10.1007/s10577-006-1075-0.
[62] E. Splinter, H. Heath, J. Kooren, R.J. Palstra, P. Klous, F. Grosveld, N. Galjart, W. De
Laat, CTCF mediates long-range chromatin looping and local histone modification in the
beta-globin locus, Genes Dev. 20 (2006) 2349–2354. doi:10.1101/gad.399506.
[63] R.-J. Palstra, B. Tolhuis, E. Splinter, R. Nijmeijer, F. Grosveld, W. de Laat, The β-globin
77
nuclear compartment in development and erythroid differentiation, Nat. Genet. 35 (2003)
190–194. doi:10.1038/ng1244.
[64] C.R. Vakoc, D.L. Letting, N. Gheldof, T. Sawado, M.A. Bender, M. Groudine, M.J. Weiss,
J. Dekker, G.A. Blobel, Proximity among distant regulatory elements at the β-globin locus
requires GATA-1 and FOG-1, Mol. Cell. 17 (2005) 453–462.
doi:10.1016/j.molcel.2004.12.028.
[65] R. Drissen, R.J. Palstra, N. Gillemans, E. Splinter, F. Grosveld, S. Philipsen, W. De Laat,
The active spatial organization of the beta-globin locus requires the transcription factor
EKLF, Genes Dev. 18 (2004) 2485–2490. doi:10.1101/gad.317004.
[66] A. Murrell, S. Heeson, W. Reik, Interaction between differentially methylated regions
partitions the imprinted genes Igf2 and H19 into parent-specific chromatin loops, Nat.
Genet. 36 (2004) 889–893. doi:10.1038/ng1402.
[67] C.G. Spilianakis, M.D. Lalioti, T. Town, G.R. Lee, R.A. Flavell, Interchromosomal
associations between alternatively expressed loci, Nature. 435 (2005) 637–645.
doi:10.1038/nature03574.
[68] D. Vernimmen, M. De Gobbi, J.A. Sloane-Stanley, W.G. Wood, D.R. Higgs, Long-range
chromosomal interactions regulate the timing of the transition between poised and active
gene expression, EMBO J. 26 (2007) 2041–2051. doi:10.1038/sj.emboj.7601654.
[69] N. Gheldof, E.M. Smith, T.M. Tabuchi, C.M. Koch, I. Dunham, J.A. Stamatoyannopoulos,
J. Dekker, Cell-type-specific long-range looping interactions identify distant regulatory
elements of the CFTR gene, Nucleic Acids Res. 38 (2010) 4325–4336.
doi:10.1093/nar/gkq175.
[70] J.A. Wallace, G. Felsenfeld, We gather together: insulators and genome organization, Curr.
Opin. Genet. Dev. 17 (2007) 400–407. doi:10.1016/j.gde.2007.08.005.
[71] J.E. Phillips, V.G. Corces, CTCF: Master Weaver of the Genome, Cell. 137 (2009) 1194–
1211. doi:10.1016/j.cell.2009.06.001.
[72] P.K. Geyer, V.G. Corces, DNA position-specific repression of transcription by a
Drosophila zinc finger protein, Genes Dev. 6 (1992) 1865–1873.
doi:10.1101/gad.6.10.1865.
[73] Z. Zhao, G. Tavoosidana, M. Sjölinder, A. Göndör, P. Mariano, S. Wang, C. Kanduri, M.
Lezcano, K. Singh Sandhu, U. Singh, V. Pant, V. Tiwari, S. Kurukuti, R. Ohlsson,
78
Circular chromosome conformation capture (4C) uncovers extensive networks of
epigenetically regulated intra- and interchromosomal interactions, Nat. Genet. 38 (2006)
1341–1347. doi:10.1038/ng1891.
[74] K.S. Wendt, K. Yoshida, T. Itoh, M. Bando, B. Koch, E. Schirghuber, S. Tsutsumi, G.
Nagae, K. Ishihara, T. Mishiro, K. Yahata, F. Imamoto, H. Aburatani, M. Nakao, N.
Imamoto, K. Maeshima, K. Shirahige, J.-M. Peters, Cohesin mediates transcriptional
insulation by CCCTC-binding factor, Nature. 451 (2008) 796–801.
doi:10.1038/nature06634.
[75] V. Parelho, S. Hadjur, M. Spivakov, M. Leleu, S. Sauer, H.C. Gregson, A. Jarmuz, C.
Canzonetta, Z. Webster, T. Nesterova, B.S. Cobb, K. Yokomori, N. Dillon, L. Aragon,
A.G. Fisher, M. Merkenschlager, Cohesins Functionally Associate with CTCF on
Mammalian Chromosome Arms, Cell. 132 (2008) 422–433.
doi:10.1016/j.cell.2008.01.011.
[76] K.C. Wang, Y.W. Yang, B. Liu, A. Sanyal, R. Corces-Zimmerman, Y. Chen, B.R. Lajoie,
A. Protacio, R.A. Flynn, R.A. Gupta, J. Wysocka, M. Lei, J. Dekker, J.A. Helms, H.Y.
Chang, A long noncoding RNA maintains active chromatin to coordinate homeotic gene
expression, Nature. 472 (2011) 120–124. doi:10.1038/nature09819.
[77] S. Hadjur, L.M. Williams, N.K. Ryan, B.S. Cobb, T. Sexton, P. Fraser, A.G. Fisher, M.
Merkenschlager, Cohesins form chromosomal cis-interactions at the developmentally
regulated IFNG locus, Nature. (2009). doi:10.1038/nature08079.
[78] A. Németh, S. Guibert, V.K. Tiwari, R. Ohlsson, G. Längst, Epigenetic regulation of TTF-
I-mediated promoter–terminator interactions of rRNA genes, EMBO J. 27 (2008) 1255–
1265. doi:10.1038/emboj.2008.57.
[79] J.M. O’Sullivan, S.M. Tan-Wong, A. Morillon, B. Lee, J. Coles, J. Mellor, N.J. Proudfoot,
Gene loops juxtapose promoters and terminators in yeast, Nat. Genet. 36 (2004) 1014–
1018. doi:10.1038/ng1411.
[80] K.J. Perkins, M. Lusic, I. Mitar, M. Giacca, N.J. Proudfoot, Transcription-Dependent
Gene Looping of the HIV-1 Provirus Is Dictated by Recognition of Pre-mRNA Processing
Signals, Mol. Cell. 29 (2008) 56–68. doi:10.1016/j.molcel.2007.11.030.
[81] J.P. Lainé, B.N. Singh, S. Krishnamurthy, M. Hampsey, A physiological role for gene
loops in yeast, Genes Dev. 23 (2009) 2604–2609. doi:10.1101/gad.1823609.
79
[82] J. Dekker, The three “C” s of chromosome conformation capture: controls, controls,
controls., Nat. Methods. 3 (2006) 17–21. doi:10.1038/nmeth823.
[83] H. Hagège, P. Klous, C. Braem, E. Splinter, J. Dekker, G. Cathala, W. de Laat, T. Forné,
Quantitative analysis of chromosome conformation capture assays (3C-qPCR), Nat.
Protoc. 2 (2007) 1722–1733. doi:10.1038/nprot.2007.243.
[84] M. Simonis, J. Kooren, W. de Laat, An evaluation of 3C-based methods to capture DNA
interactions, Nat. Methods. 4 (2007) 895–901. doi:10.1038/nmeth1114.
[85] E. Splinter, W. de Laat, The complex transcription regulatory landscape of our genome:
control in three dimensions, EMBO J. 30 (2011) 4345–4355. doi:10.1038/emboj.2011.344.
[86] M.J. Fullwood, C.L. Wei, E.T. Liu, Y. Ruan, Next-generation DNA sequencing of paired-
end tags (PET) for transcriptome and genome analyses, Genome Res. 19 (2009) 521–532.
doi:10.1101/gr.074906.107.
[87] S. Schoenfelder, T. Sexton, L. Chakalova, N.F. Cope, A. Horton, S. Andrews, S. Kurukuti,
J.A. Mitchell, D. Umlauf, D.S. Dimitrova, C.H. Eskiw, Y. Luo, C.-L. Wei, Y. Ruan, J.J.
Bieker, P. Fraser, Preferential associations between co-regulated genes reveal a
transcriptional interactome in erythroid cells, Nat. Genet. 42 (2010) 53–61.
doi:10.1038/ng.496.
[88] O. Hakim, M.H. Sung, T.C. Voss, E. Splinter, S. John, P.J. Sabo, R.E. Thurman, J.A.
Stamatoyannopoulos, W. De Laat, G.L. Hager, Diverse gene reprogramming events occur
in the same spatial clusters of distal regulatory elements, Genome Res. 21 (2011) 697–706.
doi:10.1101/gr.111153.110.
[89] S. John, T.A. Johnson, M.H. Sung, S.C. Biddie, S. Trump, C.A. Koch-Paiz, S.R. Davis, R.
Walker, P.S. Meltzer, G.L. Hager, Kinetic complexity of the global response to
glucocorticoid receptor action, Endocrinology. 150 (2009) 1766–1774.
doi:10.1210/en.2008-0863.
[90] D. Noordermeer, E. de Wit, P. Klous, H. van de Werken, M. Simonis, M. Lopez-Jones, B.
Eussen, A. de Klein, R.H. Singer, W. de Laat, Variegated gene expression caused by cell-
specific long-range DNA interactions, Nat. Cell Biol. 13 (2011) 944–951.
doi:10.1038/ncb2278.
[91] D. Noordermeer, M.R. Branco, E. Splinter, P. Klous, W. Van Ijcken, S. Swagemakers, M.
Koutsourakis, P. Van Der Spek, A. Pombo, W. De Laat, Transcription and chromatin
80
organization of a housekeeping gene cluster containing an integrated β-globin locus
control region, PLoS Genet. 4 (2008). doi:10.1371/journal.pgen.1000016.
[92] A. Wutz, T.P. Rasmussen, R. Jaenisch, Chromosomal silencing and localization are
mediated by different domains of Xist RNA, Nat. Genet. 30 (2002) 167–174.
doi:10.1038/ng820.
[93] B. Tolhuis, M. Blom, R.M. Kerkhoven, L. Pagie, H. Teunissen, M. Nieuwland, M.
Simonis, W. de Laat, M. van Lohuizen, B. van Steensel, Interactions among polycomb
domains are guided by chromosome architecture, PLoS Genet. 7 (2011).
doi:10.1371/journal.pgen.1001343.
[94] F. Bantignies, V. Roure, I. Comet, B. Leblanc, B. Schuettengruber, J. Bonnet, V. Tixier, A.
Mas, G. Cavalli, Polycomb-dependent regulatory contacts between distant hox loci in
drosophila, Cell. 144 (2011) 214–226. doi:10.1016/j.cell.2010.12.026.
[95] K.M. Lower, J.R. Hughes, M. De Gobbi, S. Henderson, V. Viprakasit, C. Fisher, A.
Goriely, H. Ayyub, J. Sloane-Stanley, D. Vernimmen, C. Langford, D. Garrick, R.J.
Gibbons, D.R. Higgs, Adventitious changes in long-range gene expression caused by
polymorphic structural variation and promoter competition, Proc. Natl. Acad. Sci. 106
(2009) 21771–21776. doi:10.1073/pnas.0909331106.
[96] J. Dostie, T.A. Richmond, R.A. Arnaout, R.R. Selzer, W.L. Lee, T.A. Honan, E.D. Rubio,
A. Krumm, J. Lamb, C. Nusbaum, R.D. Green, J. Dekker, Chromosome Conformation
Capture Carbon Copy (5C): A massively parallel solution for mapping interactions
between genomic elements, Genome Res. 16 (2006) 1299–1309. doi:10.1101/gr.5571506.
[97] D. Baù, A. Sanyal, B.R. Lajoie, E. Capriotti, M. Byron, J.B. Lawrence, J. Dekker, M.A.
Marti-Renom, The three-dimensional folding of the α-globin gene domain reveals
formation of chromatin globules, Nat. Struct. Mol. Biol. 18 (2011) 107–114.
doi:10.1038/nsmb.1936.
[98] M.A. Ferraiuolo, M. Rousseau, C. Miyamoto, S. Shenker, X.Q.D. Wang, M. Nadler, M.
Blanchette, J. Dostie, The three-dimensional architecture of Hox cluster silencing, Nucleic
Acids Res. 38 (2010) 7472–7484. doi:10.1093/nar/gkq644.
[99] J. Fraser, M. Rousseau, S. Shenker, M.A. Ferraiuolo, Y. Hayashizaki, M. Blanchette, J.
Dostie, Chromatin conformation signatures of cellular differentiation, Genome Biol. 10
(2009) R37. doi:10.1186/gb-2009-10-4-r37.
81
[100] E.P. Nora, B.R. Lajoie, E.G. Schulz, L. Giorgetti, I. Okamoto, N. Servant, T. Piolot, N.L.
van Berkum, J. Meisig, J. Sedat, J. Gribnau, E. Barillot, N. Blüthgen, J. Dekker, E. Heard,
Spatial partitioning of the regulatory landscape of the X-inactivation centre, Nature. 485
(2012) 381–385. doi:10.1038/nature11049.
[101] D. Noordermeer, M. Leleu, E. Splinter, J. Rougemont, W. De Laat, D. Duboule, The
Dynamic Architecture of Hox Gene Clusters, Science (80-. ). 334 (2011) 222–225.
doi:10.1126/science.1207194.
[102] E. Lieberman-Aiden, N.L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling,
I. Amit, B.R. Lajoie, P.J. Sabo, M.O. Dorschner, R. Sandstrom, B. Bernstein, M.A.
Bender, M. Groudine, A. Gnirke, J. Stamatoyannopoulos, L.A. Mirny, E.S. Lander, J.
Dekker, Comprehensive mapping of long-range interactions reveals folding principles of
the human genome., Science. 326 (2009) 289–93. doi:10.1126/science.1181369.
[103] Z. Duan, M. Andronescu, K. Schutz, S. McIlwain, Y.J. Kim, C. Lee, J. Shendure, S.
Fields, C.A. Blau, W.S. Noble, A three-dimensional model of the yeast genome, Nature.
465 (2010) 363–367. doi:nature08973 [pii]\n10.1038/nature08973.
[104] C. Zimmer, E. Fabre, Principles of chromosomal organization: Lessons from yeast, J. Cell
Biol. 192 (2011) 723–733. doi:10.1083/jcb.201010058.
[105] K. Bystricky, P. Heun, L. Gehlen, J. Langowski, S.M. Gasser, Long-range compaction
and flexibility of interphase chromatin in budding yeast analyzed by high-resolution
imaging techniques, Proc. Natl. Acad. Sci. 101 (2004) 16495–16500.
doi:10.1073/pnas.0402766101.
[106] Q.W. Jin, J. Fuchs, J. Loidl, Centromere clustering is a major determinant of yeast
interphase nuclear organization., J. Cell Sci. 113 ( Pt 1 (2000) 1903–1912.
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10806101&ret
mode=ref&cmd=prlinks%5Cnpapers3://publication/uuid/D2D68B90-F2DC-4581-BE2A-
5491867DB0F1.
[107] P. Therizols, T. Duong, B. Dujon, C. Zimmer, E. Fabre, Chromosome arm length and
nuclear constraints determine the dynamic relationship of yeast subtelomeres, Proc. Natl.
Acad. Sci. 107 (2010) 2025–2030. doi:10.1073/pnas.0914187107.
[108] M. Thompson, Nucleolar Clustering of Dispersed tRNA Genes, Science (80-. ). 302 (2003)
1399–1401. doi:10.1126/science.1089814.
82
[109] H. Tanizawa, O. Iwasaki, A. Tanaka, J.R. Capizzi, P. Wickramasinghe, M. Lee, Z. Fu, K.I.
Noma, Mapping of long-range associations throughout the fission yeast genome reveals
global genome organization linked to transcriptional regulation, Nucleic Acids Res. 38
(2010) 8164–8177. doi:10.1093/nar/gkq955.
[110] R. Kalhor, H. Tjong, N. Jayathilaka, F. Alber, L. Chen, Genome architectures revealed by
tethered chromosome conformation capture and population-based modeling., Nat.
Biotechnol. 30 (2012) 90–8. doi:10.1038/nbt.2057.
[111] J.R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J.S. Liu, B. Ren,
Topological domains in mammalian genomes identified by analysis of chromatin
interactions, Nature. 485 (2012) 376–380. doi:10.1038/nature11082.
[112] S.S.P.S.P. Rao, M.H.H. Huntley, N.C.C. Durand, E.K.K. Stamenova, I.D.D. Bochkov,
J.T.T. Robinson, A.L.L. Sanborn, I. Machol, A.D.D. Omer, E.S.S. Lander, E.L.L. Aiden,
A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin
Looping, Cell. 159 (2014) 1665–1680. doi:10.1016/j.cell.2014.11.021.
[113] F. Jin, Y. Li, J.R. Dixon, S. Selvaraj, Z. Ye, A.Y. Lee, C.-A. Yen, A.D. Schmitt, C.A.
Espinoza, B. Ren, A high-resolution map of the three-dimensional chromatin interactome
in human cells, Nature. (2013). doi:10.1038/nature12644.
[114] E. Yaffe, A. Tanay, Probabilistic modeling of Hi-C contact maps eliminates systematic
biases to characterize global chromosomal architecture, Nat. Genet. 43 (2011) 1059–65.
doi:10.1038/ng.947.
[115] M. Imakaev, G. Fudenberg, R.P. McCord, N. Naumova, A. Goloborodko, B.R. Lajoie, J.
Dekker, L. a Mirny, Iterative correction of Hi-C data reveals hallmarks of chromosome
organization. - Supplement, Nat. Methods. 9 (2012) 999–1003. doi:10.1038/nmeth.2148.
[116] M.A. Marti-Renom, L.A. Mirny, Bridging the resolution gap in structural modeling of 3D
genome organization, PLoS Comput. Biol. 7 (2011). doi:10.1371/journal.pcbi.1002125.
[117] R.M. Myers, J. Stamatoyannopoulos, M. Snyder, I. Dunham, R.C. Hardison, B.E.
Bernstein, T.R. Gingeras, W.J. Kent, E. Birney, B. Wold, G.E. Crawford, C.B. Epstein, N.
Shoresh, J. Ernst, T.S. Mikkelsen, P. Kheradpour, X. Zhang, L. Wang, R. Issner, M.J.
Coyne, T. Durham, M. Ku, T. Truong, L.D. Ward, R.C. Altshuler, M.F. Lin, M. Kellis,
C.A. Davis, P. Kapranov, A. Dobin, C. Zaleski, F. Schlesinger, P. Batut, S. Chakrabortty,
S. Jha, W. Lin, J. Drenkow, H. Wang, K. Bell, I. Bell, H. Gao, E. Dumais, J. Dumais, S.E.
83
Antonarakis, C. Ucla, C. Borel, R. Guigo, S. Djebali, J. Lagarde, C. Kingswood, P. Ribeca,
M. Sammeth, T. Alioto, A. Merkel, H. Tilgner, P. Carninci, Y. Hayashizaki, T. Lassmann,
H. Takahashi, R.F. Abdelhamid, G. Hannon, K.T. Fejes, J. Preall, A. Gordon, V. Sotirova,
A. Reymond, C. Howald, E.A.Y. Graison, J. Chrast, Y. Ruan, X. Ruan, A. Shahab, W.T.
Poh, C.L. Wei, T.S. Furey, A.P. Boyle, N.C. Sheffield, L. Song, Y. Shibata, T. Vales, D.
Winter, Z. Zhang, D. London, T. Wang, D. Keefe, V.R. Iyer, B.K. Lee, R.M. McDaniell,
Z. Liu, A. Battenhouse, A.A. Bhinge, J.D. Lieb, L.L. Grasfeder, K.A. Showers, P.G.
Giresi, S.K.C. Kim, C. Shestak, F. Pauli, T.E. Reddy, J. Gertz, E.C. Partridge, P. Jain, R.O.
Sprouse, A. Bansal, B. Pusey, M.A. Muratet, K.E. Varley, K.M. Bowling, K.M. Newberry,
A.S. Nesmith, J.A. Dilocker, S.L. Parker, L.L. Waite, K. Thibeault, K. Roberts, D.M.
Absher, A. Mortazavi, B. Williams, G. Marinov, D. Trout, B. King, K. McCue, A.
Kirilusha, G. DeSalvo, K.A. Fisher, H. Amrhein, S. Pepke, J. Vielmetter, G. Sherlock, A.
Sidow, S. Batzoglou, R. Rauch, A. Kundaje, M. Libbrecht, E.H. Margulies, S.C.J. Parker,
L. Elnitski, E.D. Green, T. Hubbard, J. Harrow, S. Searle, S.C.J. Parker, B. Aken, A.
Frankish, T. Hunt, G. Despacio-Reyes, M. Kay, G. Mukherjee, A. Bignell, G. Saunders, V.
Boychenko, M. Brent, M.J. van Baren, R.H. Brown, M. Gerstein, E. Khurana, S.
Balasubramanian, H. Lam, P. Cayting, R. Robilotto, Z. Lu, T. Derrien, A. Tanzer, D.G.
Knowles, M. Mariotti, D. Haussler, R. Harte, M. Diekhans, M. Lin, A. Valencia, M. Tress,
J.M. Rodriguez, D. Raha, M. Shi, G. Euskirchen, F. Grubert, M. Kasowski, J. Lian, P.
Lacroute, Y. Xu, H. Monahan, D. Patacsil, T. Slifer, X. Yang, A. Charos, B. Reed, L. Wu,
R.K. Auerbach, L. Habegger, M. Hariharan, J. Rozowsky, A. Abyzov, S.M. Weissman, K.
Struhl, N. Lamarre-Vincent, M. Lindahl-Allen, B. Miotto, Z. Moqtaderi, J.D. Fleming, P.
Newburger, P.J. Farnham, S. Frietze, H. O’Geen, X. Xu, K.R. Blahnik, A.R. Cao, S.
Iyengar, R. Kaul, R.E. Thurman, H. Wang, P.A. Navas, R. Sandstrom, P.J. Sabo, M.
Weaver, T. Canfield, K. Lee, S. Neph, V. Roach, A. Reynolds, A. Johnson, E. Rynes, E.
Giste, S. Vong, J. Neri, T. Frum, E.D. Nguyen, A.K. Ebersol, M.E. Sanchez, H.H. Sheffer,
D. Lotakis, E. Haugen, R. Humbert, T. Kutyavin, T. Shafer, W.S. Noble, J. Dekker, B.R.
Lajoie, A. Sanyal, K.R. Rosenbloom, T.R. Dreszer, B.J. Raney, G.P. Barber, L.R. Meyer,
C.A. Sloan, V.S. Malladi, M.S. Cline, K. Learned, V.K. Swing, A.S. Zweig, B. Rhead,
P.A. Fujita, K. Roskin, D. Karolchik, R.M. Kuhn, S.P. Wilder, D. Sobral, J. Herrero, K.
Beal, M. Lukk, A. Brazma, J.M. Vaquerizas, N.M. Luscombe, P.J. Bickel, N. Boley, J.B.
84
Brown, Q. Li, H. Huang, A. Sboner, K.Y. Yip, C. Cheng, K.K. Yan, N. Bhardwaj, J.
Wang, L. Lochovsky, J. Jee, T. Gibson, J. Leng, J. Du, R.S. Harris, G. Song, W. Miller, B.
Suh, B. Paten, M.M. Hoffman, O.J. Buske, Z. Weng, X. Dong, J. Wang, H. Xi, S.A.
Tenenbaum, F. Doyle, S. Chittur, L.O. Penalva, T.D. Tullius, K.P. White, S. Karmakar, A.
Victorsen, N. Jameel, N. Bild, R.L. Grossman, P.J. Collins, N.D. Trinklein, M.C.
Giddings, J. Khatun, C. Maier, T. Wang, T.W. Whitfield, X. Chen, Y. Yu, H.
Gunawardena, E.A. Feingold, R.F. Lowdon, L.A.L. Dillon, P.J. Good, B. Risk, A user’s
guide to the Encyclopedia of DNA elements (ENCODE), PLoS Biol. 9 (2011).
doi:10.1371/journal.pbio.1001046.
[118] G. Dennis Jr, B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, C.H. Lane, R.A. Lempicki, G.
Dennis, B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, R.A. Lempicki,
DAVID: Database for Annotation, Visualization, and Integrated Discovery, Genome Biol.
4 (2003) R60. doi:10.1186/gb-2003-4-9-r60.
[119] F. Al-Shahrour, R. Díaz-Uriarte, J. Dopazo, FatiGO: A web tool for finding significant
associations of Gene Ontology terms with groups of genes, Bioinformatics. 20 (2004)
578–580. doi:10.1093/bioinformatics/btg455.
[120] C. Dai, W. Li, H. Tjong, S. Hao, Y. Zhou, Q. Li, L. Chen, B. Zhu, F. Alber, X. Jasmine
Zhou, Mining 3D genome structure populations identifies major factors governing the
stability of regulatory communities, Nat. Commun. 7 (2016) 11549.
doi:10.1038/ncomms11549.
[121] J. Langowski, Chromosome conformation by crosslinking: polymer physics matters.,
Nucleus. 1 (2010) 37–9. doi:10.4161/nucl.1.1.10837.
[122] H. Tjong, W. Li, R. Kalhor, C. Dai, S. Hao, K. Gong, Y. Zhou, H. Li, X.J. Zhou, M.A. Le
Gros, C.A. Larabell, L. Chen, F. Alber, Population-based 3D genome structure analysis
reveals driving forces in spatial genome organization, Proc. Natl. Acad. Sci. 113 (2016)
E1663–E1672. doi:10.1073/pnas.1512577113.
[123] P.M. Lieberman, Chromatin organization and virus gene expression, J. Cell. Physiol. 216
(2008) 295–302. doi:10.1002/jcp.21421.
[124] D.M. Knipe, A. Cliffe, Chromatin control of herpes simplex virus lytic and latent
infection., Nat. Rev. Microbiol. 6 (2008) 211–221. doi:10.1038/nrmicro1794.
[125] A.M. Ishov, G.G. Maul, The periphery of nuclear domain 10 (ND10) as site of DNA virus
85
deposition, J. Cell Biol. 134 (1996) 815–826. doi:10.1083/jcb.134.4.815.
[126] G.G. Maul, Nuclear domain 10, the site of DNA virus transcription and replication,
BioEssays. 20 (1998) 660–667. doi:10.1002/(SICI)1521-1878(199808)20:8<660::AID-
BIES9>3.0.CO;2-M.
[127] T. Cremer, C. Cremer, Chromosome territories, nuclear architecture and gene regulation
in mammalian cells, Nat Rev Genet. 2 (2001) 292–301. doi:10.1038/35066075.
[128] T. Misteli, Beyond the Sequence: Cellular Organization of Genome Function, Cell. 128
(2007) 787–800. doi:10.1016/j.cell.2007.01.028.
[129] W. a Bickmore, The spatial organization of the human genome., Annu. Rev. Genomics
Hum. Genet. 14 (2013) 67–84. doi:10.1146/annurev-genom-091212-153515.
[130] J.R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J.S. Liu, B. Ren,
Topological domains in mammalian genomes identified by analysis of chromatin
interactions, Nature. 485 (2012) 376–380. doi:10.1038/nature11082.
[131] M.J. Fullwood, M.H. Liu, Y.F. Pan, J. Liu, H. Xu, Y. Bin Mohamed, Y.L. Orlov, S.
Velkov, A. Ho, P.H. Mei, E.G.Y. Chew, P.Y.H. Huang, W.-J. Welboren, Y. Han, H.S.
Ooi, P.N. Ariyaratne, V.B. Vega, Y. Luo, P.Y. Tan, P.Y. Choy, K.D.S.A. Wansa, B. Zhao,
K.S. Lim, S.C. Leow, J.S. Yow, R. Joseph, H. Li, K. V Desai, J.S. Thomsen, Y.K. Lee,
R.K.M. Karuturi, T. Herve, G. Bourque, H.G. Stunnenberg, X. Ruan, V. Cacheux-
Rataboul, W.-K. Sung, E.T. Liu, C.-L. Wei, E. Cheung, Y. Ruan, An oestrogen-receptor-
alpha-bound human chromatin interactome., Nature. 462 (2009) 58–64.
doi:10.1038/nature08497.
[132] A.A. Gavrilov, E.S. Gushchanskaya, O. Strelkova, O. Zhironkina, I.I. Kireev, O. V.
Iarovaia, S. V. Razin, Disclosure of a structural milieu for the proximity ligation reveals
the elusive nature of an active chromatin hub, Nucleic Acids Res. 41 (2013) 3563–3575.
doi:10.1093/nar/gkt067.
[133] T. Nagano, Y. Lubling, T.J. Stevens, S. Schoenfelder, E. Yaffe, W. Dean, E.D. Laue, A.
Tanay, P. Fraser, Single-cell Hi-C reveals cell-to-cell variability in chromosome structure.,
Nature. 502 (2013) 59–64. doi:10.1038/nature12593.
[134] J.J. Trentin, Y. Yabe, G. Taylor, The quest for human cancer viruses., Science. 137 (1962)
835–841.
[135] A.J. Berk, Recent lessons in gene expression, cell cycle control, and cell biology from
86
adenovirus, Oncogene. 24 (2005) 7673–7685. doi:1209040 [pii]\r10.1038/sj.onc.1209040.
[136] A.J. Davison, M. Benko, B. Harrach, Genetic content and evolution of adenoviruses, J.
Gen. Virol. 84 (2003) 2895–2908. doi:10.1099/vir.0.19497-0.
[137] R. Ferrari, T. Su, B. Li, G. Bonora, A. Oberai, Y. Chan, R. Sasidharan, A.J. Berk, M.
Pellegrini, S.K. Kurdistani, Reorganization of the host epigenome by a viral oncogene,
Genome Res. 22 (2012) 1212–1221. doi:10.1101/gr.132308.111.
[138] D. Sims, I. Sudbery, N.E. Ilott, A. Heger, C.P. Ponting, Sequencing depth and coverage:
key considerations in genomic analyses., Nat. Rev. Genet. 15 (2014) 121–32.
doi:10.1038/nrg3642.
[139] N. Naumova, M. Imakaev, G. Fudenberg, Y. Zhan, B.R. Lajoie, L. a Mirny, J. Dekker,
Organization of the mitotic chromosome., Science. 342 (2013) 948–53.
doi:10.1126/science.1236083.
[140] C. Montell, E.F. Fisher, M.H. Caruthers, A.J. Berk, Resolving the functions of
overlapping viral genes by site-specific mutagenesis at a mRNA splice site, Nature. 295
(1982) 380–384. doi:10.1038/295380a0.
[141] X. Liu, R. Marmorstein, Structure of the retinoblastoma protein bound to adenovirus E1A
reveals the molecular basis for viral oncoprotein inactivation of a tumor suppressor, Genes
Dev. 21 (2007) 2711–2716. doi:10.1101/gad.1590607.
[142] R. Ferrari, M. Pellegrini, G. a Horwitz, W. Xie, A.J. Berk, S.K. Kurdistani, Epigenetic
reprogramming by adenovirus e1a., Science. 321 (2008) 1086–1088.
doi:10.1126/science.1155546.
[143] G.A. Horwitz, K. Zhang, M.A. McBrian, M. Grunstein, S.K. Kurdistani, A.J. Berk,
Adenovirus Small e1a Alters Global Patterns of Histone Modification, Science (80-. ). 321
(2008) 1084–1085. doi:10.1126/science.1155544.
[144] R. Ferrari, D. Gou, G. Jawdekar, S.A. Johnson, M. Nava, T. Su, A.F. Yousef, N.R. Zemke,
M. Pellegrini, S.K. Kurdistani, A.J. Berk, Adenovirus Small E1A Employs the Lysine
Acetylases p300 / CBP and Tumor Suppressor Rb to Repress Select Host Genes and
Promote Productive Virus Infection, Cell Host Microbe. 16 (2014) 663–676.
doi:10.1016/j.chom.2014.10.004.
[145] S. Hahn, D. Kim, Identifying and Reducing Systematic Errors in Chromosome
Conformation Capture Data., PLoS One. 10 (2015) e0146007.
87
doi:10.1371/journal.pone.0146007.
[146] M.R. Frey, a G. Matera, Coiled bodies contain U7 small nuclear RNA and associate with
specific DNA sequences in interphase human cells., Proc. Natl. Acad. Sci. U. S. A. 92
(1995) 5915–5919. doi:10.1073/pnas.92.18.8532a.
[147] N. Suka, Y. Suka, A.A. Carmen, J. Wu, M. Grunstein, Highly specific antibodies
determine histone acetylation site usage in yeast heterochromatin and euchromatin., Mol.
Cell. 8 (2001) 473–479. doi:10.1016/S1097-2765(01)00301-X.
[148] C. Martin, Y. Zhang, The diverse functions of histone lysine methylation, Nat. Rev. Mol.
Cell Biol. 6 (2005) 838–849. doi:10.1038/nrm1761.
[149] M.P. Creyghton, A.W. Cheng, G.G. Welstead, T. Kooistra, B.W. Carey, E.J. Steine, J.
Hanna, M.A. Lodato, G.M. Frampton, P.A. Sharp, L.A. Boyer, R.A. Young, R. Jaenisch,
Histone H3K27ac separates active from poised enhancers and predicts developmental
state, Proc. Natl. Acad. Sci. 107 (2010) 21931–21936. doi:10.1073/pnas.1016071107.
[150] J. Nakayama, J.C. Rice, B.D. Strahl, C.D. Allis, S.I. Grewal, Role of histone H3 lysine 9
methylation in epigenetic control of heterochromatin assembly., Science. 292 (2001) 110–
3. doi:10.1126/science.1060118.
[151] S. Rea, F. Eisenhaber, D. O’Carroll, B.D. Strahl, Z.W. Sun, M. Schmid, S. Opravil, K.
Mechtler, C.P. Ponting, C.D. Allis, T. Jenuwein, Regulation of chromatin structure by
site-specific histone H3 methyltransferases., Nature. 406 (2000) 593–599.
doi:10.1038/35020506.
[152] R. Cao, Role of Histone H3 Lysine 27 Methylation in Polycomb-Group Silencing, Science
(80-. ). 298 (2002) 1039–1043. doi:10.1126/science.1076997.
[153] F. Tie, R. Banerjee, C. a Stratton, J. Prasad-Sinha, V. Stepanik, A. Zlobin, M.O. Diaz, P.C.
Scacheri, P.J. Harte, CBP-mediated acetylation of histone H3 lysine 27 antagonizes
Drosophila Polycomb silencing., Development. 136 (2009) 3131–3141.
doi:10.1242/dev.037127.
[154] a de Bruyn Kops, D.M. Knipe, Preexisting nuclear architecture defines the intranuclear
location of herpesvirus DNA replication structures., J. Virol. 68 (1994) 3512–26.
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=236855&tool=pmcentrez&ren
dertype=abstract.
[155] I.H. Wang, M. Suomalainen, V. Andriasyan, S. Kilcher, J. Mercer, A. Neef, N.W. Luedtke,
88
U.F. Greber, Tracking viral genomes in host cells at single-molecule resolution, Cell Host
Microbe. 14 (2013) 468–480. doi:10.1016/j.chom.2013.09.004.
[156] B. Li, T. Su, R. Ferrari, J.Y. Li, S.K. Kurdistani, A unique epigenetic signature is
associated with active DNA replication loci in human embryonic stem cells., Epigenetics.
9 (2014) 257–267. doi:10.4161/epi.26870.
[157] Encode Consortium, An integrated encyclopedia of DNA elements in the human genome,
Nature. 489 (2013) 57–74. doi:10.1038/nature11247.An.
[158] Q. Jin, L.-R. Yu, L. Wang, Z. Zhang, L.H. Kasper, J.-E. Lee, C. Wang, P.K. Brindle,
S.Y.R. Dent, K. Ge, Distinct roles of GCN5/PCAF-mediated H3K9ac and CBP/p300-
mediated H3K18/27ac in nuclear receptor transactivation., EMBO J. 30 (2011) 249–262.
doi:10.1038/emboj.2010.318.
[159] A. Gavrilov, S. V. Razin, G. Cavalli, In vivo formaldehyde cross-linking: It is time for
black box analysis, Brief. Funct. Genomics. 14 (2015) 163–165. doi:10.1093/bfgp/elu037.
[160] A. Cournac, H. Marie-Nelly, M. Marbouty, R. Koszul, J. Mozziconacci, Normalization of
a chromosomal contact map, BMC Genomics. 13 (2012) 436. doi:10.1186/1471-2164-13-
436.
[161] M. Hu, K. Deng, S. Selvaraj, Z. Qin, B. Ren, J.S. Liu, HiCNorm: Removing biases in Hi-
C data via Poisson regression, Bioinformatics. 28 (2012) 3131–3133.
doi:10.1093/bioinformatics/bts570.
[162] Y. Zhang, T. Liu, C.A. Meyer, J. Eeckhoute, D.S. Johnson, B.E. Bernstein, C. Nussbaum,
R.M. Myers, M. Brown, W. Li, X.S. Liu, Model-based Analysis of ChIP-Seq (MACS),
Genome Biol. 9 (2008) R137. doi:10.1186/gb-2008-9-9-r137.
[163] A. Diaz, K. Park, D.A. Lim, J.S. Song, Normalization, bias correction, and peak calling
for ChIP-seq, Stat. Appl. Genet. Mol. Biol. 11 (2012). doi:10.1515/1544-6115.1750.
[164] K. Liang, S. Keleş, Normalization of ChIP-seq data with control, BMC Bioinformatics. 13
(2012) 199. doi:10.1186/1471-2105-13-199.
[165] F. Ay, T.L. Bailey, W.S. Noble, Statistical confidence estimation for Hi-C data reveals
regulatory chromatin contacts, Genome Res. 24 (2014) 999–1011.
doi:10.1101/gr.160374.113.
[166] J.M. Belton, J. Dekker, Randomized ligation control for chromosome conformation
capture, Cold Spring Harb. Protoc. 2015 (2015) 587–592. doi:10.1101/pdb.prot085183.
89
[167] H. Li, R. Durbin, Fast and accurate long-read alignment with Burrows-Wheeler transform,
Bioinformatics. 26 (2010) 589–595. doi:10.1093/bioinformatics/btp698.
[168] P.A. Knight, D. Ruiz, A fast algorithm for matrix balancing, IMA J. Numer. Anal. 33
(2013) 1029–1047. doi:10.1093/imanum/drs019.
[169] T. Yang, F. Zhang, G.G. Yardimci, R.C. Hardison, W.S. Noble, F. Yue, Q. Li, HiCRep:
assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient,
bioRxiv. (2017). doi:10.1101/101386.
[170] J. Fraser, C. Ferrai, A.M. Chiariello, M. Schueler, T. Rito, G. Laudanno, M. Barbieri, B.L.
Moore, D.C. Kraemer, S. Aitken, S.Q. Xie, K.J. Morris, M. Itoh, H. Kawaji, I. Jaeger, Y.
Hayashizaki, P. Carninci, A.R. Forrest, C.A. Semple, J. Dostie, A. Pombo, M. Nicodemi,
Hierarchical folding and reorganization of chromosomes are linked to transcriptional
changes in cellular differentiation, Mol. Syst. Biol. 11 (2015) 852–852.
doi:10.15252/msb.20156492.
Abstract (if available)
Abstract
Chromosomes from both eukaryotes and prokaryotes not only convey information through their linear DNA sequence but also contribute to the regulation of a number of DNA-related metabolic processes through their three-dimensional arrangements. Molecular biology techniques, such as Chromosome Conformation Capture (3C) and the 3C-based methods, allow us to investigate the three-dimensional organizations of the genome in high resolution and high throughput. In this dissertation, we explored the whole genome chromosome conformation capture techniques (a.k.a. Hi-C) in developing novel Hi-C applications to study virus-host genome interactions and more reliable Hi-C data normalization method in order to exploit the potentialities of using Hi-C data to decipher the mystery of 3D genome structures. ❧ Viruses have evolved a variety of mechanisms to interact with host cells for their adaptive benefits, including subverting host immune responses and hijacking host DNA replication/transcription machineries. Although interactions between viral and host proteins have been studied extensively, little is known about how the vial genome may interact with the host genome and how such interactions could affect the activities of both the virus and the host cell. Since the three-dimensional organization of a genome can have significant impact on genomic activities such as transcription and replication, we hypothesize that such structure-based regulation of genomic functions also applies to viral genomes depending on their association with host genomic regions and their spatial locations inside the nucleus. Here, we used Tethered Chromosome Conformation Capture (TCC) to investigate viral-host genome interactions between the adenovirus and human lung fibroblast cells. We found viral-host genome interactions were enriched in certain active chromatin regions and chromatin domains marked by H3K27me3. The contacts by viral DNA seems to impact the structure and function of the host genome, leading to remodeling of the fibroblast epigenome. Our study represents the first comprehensive analysis of viral-host interactions at the genome structure level, revealing unexpectedly specific virus-host genome interactions. The non-random nature of such interactions indicates a deliberate but poorly understood mechanism for targeting of host DNA by foreign genomes. ❧ Extracting biologically meaningful information about chromosomal interactions obtained with high-throughput sequencing chromosome conformation capture (Hi-C) experiments requires the elimination of systematic biases. So far data normalization has been performed by computational methods that use either matrix balancing, or explicit factors probabilistic modeling to perform bias corrections. Here, we present a control based data normalization pipeline that adopts sequencing and alignment of the restriction end DNA fragments by either performing a Restriction End Sequencing (RES) control experiment or by retrieving RES reads from the actual Hi-C sequencing library. We validate RES normalizations by both of the two sources of RES control and show that our normalization is robust and reproducible. This pipeline works in particular for Hi-C based methods following protocols with biotin label attachments at restriction ends after the chromatin enzyme digestion step. We compare the results of RES normalization and whole genome matrix balancing normalization and observe major differences at regions closed to the telomeres and centromeres. Furthermore, we show that the RES normalization technique results in higher data reproducibility than matrix balancing normalization in terms of overall cis and trans one-dimensional contact coverage. Lastly, we discuss about the potential limitations and technical challenges of the RES normalization technique.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Exploring three-dimensional organization of the genome by mapping chromatin contacts and population modeling
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Exploring stem cell pluripotency through long range chromosome interactions
PDF
Understanding the 3D genome organization in topological domain level
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
3D modeling of eukaryotic genomes
PDF
Application of machine learning methods in genomic data analysis
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Profiling transcription factor-DNA binding specificity
PDF
Computational analysis of genome architecture
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Computational analysis of the spatial and temporal organization of the cellular environment
PDF
Studies of the biological relevance of Histone H4 Lysine 20 monomethylation: discovery of its role in the cell cycle and localization within the human genome
PDF
Structural and biochemical studies of large T antigen: the SV40 replicative helicase
PDF
The development of targeted transcription factor transposition and understanding chromatin dynamics in hypertrophic cardiomyopathy
PDF
Discovery of mature microRNA sequences within the protein- coding regions of global HIV-1 genomes: Predictions of novel mechanisms for viral infection and pathogenicity
PDF
The function of Rpd3 in balancing the replicaton initiation of different genomic regions
PDF
A diagrammatic analysis of the secondary structural ensemble of CNG trinucleotide repeat
PDF
Investigating the function and epigenetic regulation of ABCA3, a novel LUAD tumor suppressor gene
Asset Metadata
Creator
Li, Haochen
(author)
Core Title
Exploring the application and usage of whole genome chromosome conformation capture
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Molecular Biology
Publication Date
07/17/2017
Defense Date
05/24/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D genomics,adenovirus,bias correction,chromosome conformation capture,epigenetics,experiment control,Hi-C,OAI-PMH Harvest,virus-host genome interaction
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Lin (
committee chair
), Alber, Frank (
committee member
), Jung, Jae (
committee member
), Zhou, Xianghong Jasmine (
committee member
)
Creator Email
haochenl@usc.edu,li.haoch@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-399556
Unique identifier
UC11264425
Identifier
etd-LiHaochen-5521.pdf (filename),usctheses-c40-399556 (legacy record id)
Legacy Identifier
etd-LiHaochen-5521.pdf
Dmrecord
399556
Document Type
Dissertation
Rights
Li, Haochen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D genomics
adenovirus
bias correction
chromosome conformation capture
epigenetics
experiment control
Hi-C
virus-host genome interaction