Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Understanding the 3D genome organization in topological domain level
(USC Thesis Other)
Understanding the 3D genome organization in topological domain level
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1 UNDERSTANDING THE 3D GENOME ORGANIZATION IN TOPOLOGICAL DOMAIN LEVEL By Hanjun Shin A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) May 2017 Copyright 2017 Hanjun Shin 2 ACKNOWLEDGEMENTS First of all, I would like to thank my academic advisor, Professor Xianghong Jasmine Zhou. She has consistently supported me during my Ph.D studies and always helped me to overcome my hardness with deep patience. Without her warm support, understanding, and encouragement, I could not have finished this long journey. I also would like to express deep gratitude for my co- advisor, Professor Frank Abler. His guidance and constructive suggestions lead all my ph.d work in the correct direction. Also, I give thanks to all my committee members, Professor Aiichiro Nakano, Professor Michael Smith Waterman, and Fengzu Sun for their constructive and supportive suggestions on my research. I am very grateful to all my lab members, Dr. Chao Dai, Professor Yi Shi, Dr. Harianto Tjong, Qiangjiao Li, Nan Hua, and Long Pei, for being great research collaborators. Also, I wish to thank for my supportive friends at USC, Dr. Younghoon Lee, Dr. Taeho Lee, and Pyojae Kim. I would like to express special thanks to my family, Changgeun Shin, Limsoon Park, and Dr. Jungmin Shin. Even though they were physically thousands miles away from me, I can feel their warm love and support. Lastly, I would like to show my deepest appreciation to my wife, Jungeun Ryu, and two daughters, Yujin Shin and Catherine Jiwon Shin, for their endless love, devotion, and belief. This dissertation is dedicated to them. 3 TABLE OF CONTENTS Acknowledgements 2 List of Figures 4 List of Tables 5 Abstract 6 Chapter 1: Introduction 7 1.1. Background 7 1.2. Topological Domain Identification Methods 7 1.3. Population based genome structure modeling 9 1.4. The goal of the study 11 Chapter 2: TopDom : An efficient and deterministic method for identifying topological domains in genomes 13 2.1. Introduction 13 2.2. Materials and Method 13 2.3. Results 18 2.4. Discussion 25 Chapter 3: PGS: a dynamic and automated population-based genome structure software 27 3.1. Introduction 27 3.2. Software design and implementation 28 3.3. Materials 29 3.4. Procedures 32 3.5. Troubleshooting 37 3.6. Timing 37 3.7. Anticipated Results 38 Chapter 4: Conclusion 40 References 41 Figures 45 Tables 59 Appendix 63 4 LIST OF FIGURES Figure 1: Chapter 2. TopDom Method 46 Figure 2: Chapter 2. Variations in the size and number of TDs 47 Figure 3: Chapter 2 Selection of the window size 48 Figure 4: Chapter 2. Illustration of topological domains identified by the DI method, HiCseg, and TopDom 49 Figure 5: Chapter 2. Quality comparison of TDs identified by TopDom, HicSeg, and DI 50 Figure 6: Chapter 2. Epigenetic characteristics surrounding boundary regions 51 Fgiure 7: Chpater 2 Common and Unique Boundaries 52 Figure 8: Chapter 2 Epigenetic characteristics of unique boundaries 53 Figure 9: Chapter 2. Gene expression and housekeeping gene density near TD boundaries 54 Figure 10: Chapter 2. Overlap score of TDs under different window settings 55 Figure 11: Chapter 3. PGS software workflows 56 Figire 12: Chapter 3. PGS setup 57 Figure 13: Chpater 3. Examples of PGS output 58 5 LIST OF TABLES Table 1: Chapter 2. Comparison of TDs identified by three different methods for four cell types 60 Table 2: Chapter 2. Chi-squared test on the association between conserved TD structures and the locations of housekeeping genes 61 Table 3: Chapter 3. Troubleshooting 62 6 ABSTRACT Genome-wide proximity ligation assays, such as Hi-C or TCC (i.e. Genome wide chromatin conformation capture), allow the identification of chromatin contacts at unprecedented resolution. Several studies reveal that mammalian chromosomes are composed of topological domains (TDs) in sub-mega base resolution, which appear to be conserved across cell types and to some extent even between organisms. Studying topological domains is now a begging and an important step toward understanding the structure and functions of spatial genome organization. In this thesis, as the first work, we propose an efficient and deterministic method, TopDom, to identify TDs, along with a set of statistical methods for evaluating their quality. TopDom is much more efficient than existing methods and depends on just one intuitive parameter, a window size. TopDom also identifies more and higher quality TDs than the popular directional index algorithm. The TDs identified by TopDom provide strong support for the cross-tissue TD conservation. Finally, our analysis reveals that the locations of housekeeping genes are closely associated with cross-tissue conserved TDs. As the second project, we present a population based 3D genome structure modeling pipeline for further understanding of individual 3D genome structures in nucleus. We implement a method described in Kalhor et al. Nature Biotechnology 2012 and Tjong et al. PNAS 2016 and wrap the whole pipeline as a single pipeline software for general user to examine the individual 3D genomes structure variability easily. The PGS generate a population of 3D genome structures in TD level based on a probabilistic framework that uses Hi-C data. It is fully automated and simplifies individual processing steps and job submissions into a single straightforward execution. 7 CHAPTER 1. INTRODUCTION 1.1. Background Chromatin is the physical carrier of genetic and epigenetic information. Recent studies indicate that its high-order spatial conformation plays an important role in many nuclear processes, including gene expression, epigenetic organization, and DNA replication (1-7). Although our understanding of the spatial organization of chromatin is still very limited, genome-wide proximity ligation assays (6,8-10) promise to grant new insights into 3D genome structures and their relation to nuclear functions. For example, Hi-C data have led to the interesting observation that the human, mouse, and drosophila genomes are linearly partitioned into physical domains with strong internal connectivity but limited interaction with other domains (1,4). These domains occur below the mega base scale, and are termed topological domains (TDs). The chromatin within a TD often displays uniform functional properties such as histone modifications, active gene density, lamina interaction propensity, replication timing, or nucleotide and repetitive element compositions (1,4,6,11-13). Evidence suggests that topological domains are widely conserved across species, not just across cell types in the same species (1). 1.2. Topologial domain identification methods Several methods have been developed to identify topological domains (1,4,6,11- 14). Sexton et al. (2012) defined the first relevant concept, “physical domains,” and devised a probabilistic approach to first infer a distance-scaling factor for each restriction fragment, then identify peaks in these distance-scaling factors as the boundaries of physical domains (4). Dixon et al. (2012) coined the term “topological domain,” and 8 proposed an identification method based on a directionality index (DI). The DI quantifies the degree of upstream or downstream interaction bias for a genomic region, so its value changes drastically at the periphery of a topological domain. Dixon et al. used a Hidden Markov Model (HMM) to identify topological domains from DIs (1). Hou et al. (2012) developed a Bayesian probability model assuming that the number of paired-end tags linking two loci follows a Poisson distribution, and adopted a Markov chain Monte Carlo (MCMC) strategy to estimate the locations of the TD boundaries (12). Filippova et al. (2014) proposed a dynamic programming method called “Armatus” which is able to capture persistent domains across various resolutions. Levi-Leduc et al. (2014) defined a block-wise segmentation model for the detection of TDs, and proved that the maximum likelihood estimate of the block boundaries can be rephrased as a 1D segmentation problem, which can be resolved using standard dynamic programming methods(14). More recently, Rao et al. (2014) used dynamic programming to transform the original contact frequency heatmap into an arrowhead matrix and annotate the domains based on the transformed matrix (6). For all the TDs identified by different methods, insulating factors such as CTCF and other histone modifications were found to be highly enriched at the domain boundaries (1,4,6,11-14). All the above methods identify topological domains from different points of view, and all are effective at gaining biological insights in downstream analyses. However, the source codes of only several methods are publicly available (1,4,6,11-14). Moreover, most of the above methods are challenging for biologists to use, because they require extensive pre-processing (1,4,6,11-14) and/or substantial parameter tuning (1,4,6,11-14). Among the methods with open source codes, Dixon et al. (1,4,6,11-14) used the Gaussian Mixture Model and the HMM to predict the state of upstream or downstream bias for each bin, and found that the results depend on parameters chosen by the researcher: the input number of components in the mixture model and the cutoff for the median posterior probability. The Armatus method has a great advantage in that it builds consensus 9 domains combining multi-scale domain sets, and requires only a single major parameter (the resolution to generate domains) and two additional minor parameters (the highest resolution used to generate domains, and the step size to increment the resolution parameter) in its combining step (1,4,6,11-14). The parameters required by the block- wise segmentation approach (14) include the distribution of input data, which is challenging for biological users to determine. Aside from usability, another important issue is the inconsistency among the domains generated by different methods, or even among domains generated by the same method but with different input parameters. This is especially important given that previous literature has shown that domain boundaries are more likely to be active regions (1,4,6,11-14). Such regions are less compact and more likely to form inter-chromosomal contacts with other active regions (1,4,6,11-14). But if the signal indicating a TD boundary is weak, the determination of topological domains is sensitive to inconsistencies caused by (a) the heuristic nature of the algorithms, (b) noise in the data, and probably most importantly (c) the ambiguity of Hi-C data due to heterogeneity among cells in the sample. A robust TD identification method should identify high- quality TDs in a consistent manner. 1.3. Population based genome strcuture modeling The question of how a genome is intricately packed inside the nucleus has sparked a burgeoning field of study. Advanced Hi-C techniques are generating rich datasets of the contact frequencies between chromosome regions, which are extremely valuable for investigating the spatial organization of the genome. Reconstructing the genome in 3D is an appealing approach to understanding the relationship between genome structure and function. However, the 3D organization of the genome varies greatly between cells. This variability poses a great challenge to interpreting ensemble 10 Hi-C contact frequencies, which are averaged across an ensemble of cells. Long-range and inter-chromosomal interactions, which have low frequencies to begin with are particularly difficult to integrate into consistent 3D models (9,15-21) To address this challenge, we recently introduced the concept of population-based genome structure modeling. This probabilistic approach deconvolves the ensemble Hi-C data and generates an ensemble of distinct diploid 3D genome structures that is fully consistent with the input dataset of chromatin-chromatin interactions. Hence, our method explicitly models the variability of 3D genome structures across cells (20,22). Moreover, because the generated population contains many different structural states, it can accommodate all observed chromatin interactions, including low-frequency, long-range interactions that would be mutually exclusive in a single structure. Our method is sufficiently flexible to integrate additional experimental information and model the genome at various levels of resolution. In contrast with our approach, most other 3D genome modeling methods generate a single, consensus structure from the complete Hi-C dataset (23-31). However, a single 3D model cannot simultaneously reproduce all the contacts present in the Hi-C experiment, which calls into question the assumption that a single-structure approach can fairly represent the complexity of genome structures. We have described the details of our population-based method elsewhere (20-22). Because there is no closed form solution, we employ an iterative and step-wise restraint optimization procedure. Briefly, we employ a structure-based deconvolution of Hi-C data and optimize a population of distinct diploid 3D genome structures by maximizing the likelihood of observing the Hi-C data. We employ an iterative optimization procedure. Each iteration involves two steps: constraint assignment (termed the A-step) and optimizing the structure population with a combination of the simulated annealing and conjugate gradient methods (termed the M-step). We increase the optimization hardness in a step- 11 wise manner by gradually adding more contact constraints during the iterative optimization process (20). Importantly, by embedding an ensemble of genome structures in 3D space as part of the optimization process, the method can detect which chromatin contacts are likely to co-occur in individual cells. Hence, the population represents a deconvolution of the Hi-C data into individual structures and domain contacts; it is the best approximation to the underlying true population of genome structures in the Hi-C experiment, given the available data and assumptions. The chromatin domain contacts of the structure population, as a whole, are statistically highly consistent with the Hi-C data. Our approach incorporates the stochastic nature of chromosome conformations and allows a detailed analysis of alternative chromatin structural states. 1.4. Goal of the study This study has two main goals. In Chapter two, we propose an efficient and deterministic method, TopDom, to systematically identify topological domains. Compared to previous methods, TopDom has linear time complexity and only depends on a single, intuitive parameter. Using this method, we identify TDs that reproduce the fundamental definition of a chromatin TD, namely that the average contact frequency between regions within a TD is much higher than the average contact frequency between inside and outside regions. We compared our method with two existing methods, and showed that TopDom can identify fine-scaled TDs with high quality. Using the TDs identified by our method, we show that cross-tissue TD conservation is even stronger than previously reported, and that the locations of housekeeping genes are strongly associated with cross-tissue conserved TDs. In Chapter three, we present a population based 3D genome structure modeling pipeline for further understanding of individual 3D genome structures in nucleus. Our Population-based Genome Structure (PGS) modeling package, takes as input information 12 an experimental Hi-C contact frequency map and a segmentation of the genome sequence into chromatin domains (for example Topological Associated Domains – TADs used in this article). PGS then generates a population of 3D genome structures in which the physical contacts between the chromatin domains (which represented as spheres) as a whole reproduce those from the Hi-C experiment. The software automatically generates an analysis report about the structure population, including a description of the structure quality based on its agreement with experiments and various structural genome features, including the radial nuclear positions of individual chromatin domains. The genome structures contain a wealth of information and can be used to detect higher order structural patterns of chromatin regions. 13 CHAPTER 2 TopDom : An efficient and deterministic method for identifying topological domains in genomes 2.1. INTRODUCTION In this chapter, we propose an efficient and deterministic method to systematically identify topological domains. Compared to the previous probabilistic methods, which have long run times and require complex parameter tuning, our method has linear time complexity and only depends on a single, intuitive parameter. Using this method, we identify TDs that reproduce the fundamental definition of a chromatin TD, namely that the average contact frequency between regions within a TD is much higher than the average contact frequency between inside and outside regions. Our new method identifies more TDs and higher quality TDs than the directionality index method. Using the TDs identified by our method, we show that cross-tissue TD conservation is even stronger than previously reported, and that the locations of housekeeping genes are strongly associated with cross-tissue conserved TDs. 2.2 MATERIALS AND METHODS We focused on three questions while designing a new method for TD identification: (a) How can we reduce false detections and improve the quality of the TDs? (b) How can we reduce the computational cost for TD detection? (c) How can we minimize the number of parameters required to reliably identify TDs? 14 We propose an efficient and effective method, TopDom, with a single easy-to- adjust parameter. The input data is a Hi-C contact map, where entries are contact frequencies between any two chromatin segments (i.e., bins in the data matrix). Our method has three steps: (1) For each bin, we generate a value binSignal by computing the average contact frequency among pairs of chromatin regions (one upstream, the other downstream) in a small window surrounding the bin. This step results in a curve binSignal(i) that runs along the chromosome. (2) Discover TD boundaries as local minima in the binSignal(i) series. (3) Filter out false detections in the local minima by statistical testing. Each step is described in more detail below. 2.2.1. Generating binSignal by computing bin-level contact frequencies A TD boundary can be defined as a region between two adjacent TDs. In general, the contact frequencies between regions upstream and downstream of a TD boundary are lower than those between two regions within a TD. We use this requirement to identify TD boundaries. First, for each bin, we compute the average contact frequency between upstream and downstream regions around the bin location. The size of the window for this calculation is controlled by a free parameter w. Let i denote the bin index. We define a window of length 2w that selects upstream regions Ui = { i-w-1, i-w, …, i }, and downstream regions Di = { i+1, i+2, …, i+w } around the bin i. The average contact frequency, denoted binsignal(i), is calculated as follows: ( ) = 1 . ( ), ( ) ( . 1) where cont.freq indicates the contact frequency between two bins. Intuitively, binsignal(i) illustrates the average contact frequency between bins in the neighbourhood of i, as illustrated in the diamond-shaped area of Figure 1-(a). We expect binsignal(i) to 15 be high for bins located close to the center of a TD, while it should be low at a TD boundary. In the following section, we present our approach to find the curve whose shape best fits binSignal across a TD boundary without any parameters, and then detect local minima using the fitting curve. 2.2.2. Detect TD boundaries based on binSignals Intuitively, local minima in the binSignal series along a chromosome represent TD boundaries. However, some local minima result from noise in the data. In order to capture the dominant local minima, we first smooth the binSignal curve. Our strategy is to approximate the binSignal curve with line segments to capture major trends, and for this purpose we adopt the linear-time algorithm of Kumar Ray et al. (32,33). Specifically, we fit binSignal with a piecewise linear function consisting of the longest possible line segments but minimum fitting error, defined as the sum of distances from points in binSignal to the fitted line segments. The fitness function for a given line segment j is calculated as Fj = Lj - Ej, where Lj denotes the line length and Ej the fitting error. The end of a fitted line segment is termed a turning point. The algorithm is detailed below. Given a fixed starting point (Pstart), it tests line segments of increasing length connecting Pstart to a later point (Pj). The fitness generally increases with line length, but when the algorithm finds a line with a smaller fitness score than the previous line, it saves the previous line as part of the final curve. The previously tested endpoint becomes a turning point and the new starting point for the next iteration. Repeating this procedure until Pj arrives at the end of binSignal (the end of the chromosome), we are able to build a piecewise linear function that clearly identifies all turning points in binSignal. 16 Curving Fitting Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: Pstart =signal start, Pend=signal end; Fj=0, Fj-1=0; Pj=Pstart+2; Do while Pstart<=Pend and Pj<=Pend // line(Pa, Pb) = a line connecting two points Pa and Pb Lj = length of line(Pstart, Pj) Ej = sum of distance error (Pk, line(Pstart, Pj) ) where Pk are any points between Pstart and Pj Fj=Lj-Ej if(Fj<Fj-1) Set Pj-1 as turning point Pstart = Pj-1; Pj=Pstart+2; Fj-1=0 else Fj-1=Fj; Pj=Pj+1; endif loop Given the set of turning points from the fitted line segments, we then search for the local minima that have the smallest contact frequencies compared to those of their neighboring bins. Local minima are points that satisfy the following two conditions: 1) The derivative changes from negative to positive in the interval between two adjacent turning points 2) The contact frequency has the smallest value the interval between two adjacent turning points 17 Figure 1-(b) exemplifies the original binSignal curve (black), the fitting curve (blue dotted line), and the turning points (blue dots). The local minima (red inverted triangles) capture “TD boundary-like” bins, and avoid weak local minima in the original curve that are likely due to noise. 2.2.3. Statistical filtering of false positive TD boundaries. To filter false positives from the identified TD boundaries, we take advantage of the fact that chromatin interactions inside TDs generally have higher frequencies than those between adjacent TDs. This means that at a TD boundary, interactions between upstream and downstream bins (i.e., between two different TDs) should be much less frequent than interactions between different upstream neighbour bins, or interactions between different downstream neighbour bins. Thus, given a bin i, an adjacent upstream window of length w, and an adjacent downstream window of length w, we denote interactions between the up- and downstream windows as the ‘between.interactions’, and interactions within the up- or downstream windows as ‘within.interactions’. If bin i is located at a TD boundary, we expect within.interactions to be stronger than between.interactions; otherwise, there should be no significant difference between within.interactions and between.interactions. . ( ) = . ( ), ( ) | | − | | − | } . ( ) = . ( ), ( ) | | − | | − | } or . ( ), ( ) | | − | | − | } .. We perform the Wilcox Rank Sum test to assess whether there is a significant difference between within.interactions and between.interactions for each bin. Because the contact frequency between two bins is highly dependent on the genomic distance between 18 them, we calculate the z-score of each cont.freq (A, B), normalized by all cont.freq (A, B) with the same genomic distance. Finally, we filter out local minima with p-values larger than 0.05. As shown in Figure 1-(c), almost all local minima discovered in the processed binSignal curve are associated with p-values< 0.05. In practice, only a small proportion of the local minima are discarded at this step. Note that although Steps 2 and 3 are both designed to identify TDs based on the fundamental definition, Step 3 draws on a broader chromatin range (two adjacent TDs) than Step 2 (only a window around a TD boundary). Given all identified local minima and the p-values of all bins along the chromosome, we use the following rule to annotate TDs and boundary regions: given two consecutive local minima, if any bin does not show a significant difference between the contact frequencies of within.interactions and between.interactions (P-value >0.05), we classify the region between the minima as a TD; otherwise, we classify it as a boundary region. The boundary regions represent TD-free chromatin at the given sequencing resolution and current parameter settings. 2.3 RESULTS 2.3.1. Determination of the TopDom parameter We performed our analysis on Hi-C datasets of two mouse cells (embryonic stem cell and cortex cell) and two human cell lines (embryonic stem cell and IMR90), at a bin resolution of 40kb, as suggested in the previous study (1). TopDom has a single adjustable parameter, the window size w used to compute binSignal. In general, as w increases, the size of the discovered TDs increases and the number of TDs decreases (Figure 2). To determine the best window size w, we rely on an important characteristic of TDs: bins within a given TD should have more similar contact frequency profiles than bins outside the TD. Therefore, we calculated Pearson’s correlation coefficient (PCC) between the contact profiles of bins within a TD as a quality measurement. Moreover, 19 since topological domains are local features of chromatin organization, we also calculate the weighted Pearson’s correlation coefficient (wPCC) where contacts inside the TDs are weighted more. Specifically, each bin’s contribution to wPCC is weighted by the background contact frequency bi for any two bins at a distance i between the bin and the center bin of the TD. For the contact profiles x and y of two bins within a TD, the weighted correlation coefficient is calculated according to (eq.2). ( , ; ) = ( , ; ) ( , ; ) ( , ; ) ( . 2) where ( , ; ) = ( − ( ; ))( − ( ; )) ∑ and ( ; ) = ∑ ∑ As shown in Figure 3, among the window sizes w = 3, 5, 7, 9, 12, and 15, the choice w=5 consistently achieved the highest average PCC/wPCC scores. This measurement can be considered a general guideline to determine w, as the ideal value might depend on the genome studied. Considering the previously reported minimum TD size (~200Kb) (1) and our bin size of 40Kb, w=5 is a reasonable setting. All results discussed below, unless otherwise stated, are based on the setting w=5. 2.3.2. Comparison between TopDom and existing methods Considering the popularity of existing methods (based on the number of citations and source code availability), we compared our TopDom method with the directionality index method (1) and the recently developed HicSeg method (14). We refer to these two methods hereafter as DI and HicSeg, respectively. 20 Our TopDom program (available via http://zhoulab.usc.edu/TopDom/) was written in R (CRAN) script and tested on an Intel Xeon 3.3GHz computer with 10GB RAM. We ran the HicSeg algorithm on the same computer with nb_change_max=500, distrib=”G”, and model=”Dplus”. For the DI method, we followed the default settings mentioned in (1). In the same computational environment, TopDom is more efficient at identifying TDs from the same input HiC data. TopDom takes 6~7 minutes, while DI takes >8 hours and HicSeg takes about 2.5 hours to process the whole mouse genome. We counted the number of TDs identified by the three methods. Setting the window size w=5, TopDom identifies more TDs than the other two methods; consequently, the average size of TDs identified by TopDom is smaller (see Table 1). Consistent with results from the DI and HiCseg methods, we found that the average size of TDs in hESC is slightly smaller than that in IMR90, ~450Kb versus ~600Kb respectively. As shown in Figure 4, TopDom captures more boundary-like regions, and most of the TD boundaries discovered by the DI and HiCseg methods are covered by the TD boundaries. While the DI method is generally good at identifying boundaries between large TDs, TopDom and HiCseg are able to detect TDs of smaller size. Thus, both algorithms reveal the topological structure of a genome on a finer scale than DI, and with improved efficiency. We then compared the methods in terms of three different quality measurements: the intra-TD Pearson’s correlation coefficient (PCC), the intra-TD weighted Pearson’s correlation coefficient (wPCC), and the difference between the average intra-TD and inter-TD contact frequencies. The last measure is a good alternative quality score because bins in the same TD should have high-frequency interactions, while bins from adjacent TDs should have limited interactions. Let Intra(i) denote the average of contact frequencies between bins within the same TD i, and Inter(i, j) denote the average of contact frequencies between a bin in TD i and a bin in adjacent TD j, where |i – j | = 1. The TD quality can then be defined as Intra(i) – Inter(i, j). 21 With the setting w=5, TopDom displays significantly better performance in terms of the mean and variance of all three quality measurements across all four cell lines (p- value < 1e−50 by t-test), except for the measurements of PCC and wPCC on IMR90 (Figure 5). This evaluation indicates that generally, our method more accurately identifies TDs in terms of the similarity of their contact profiles. 2.3.3. Epigenetic characteristics of chromatin in Topological Domains We explored whether certain regulatory factors might be associated with topological boundary regions. For mouse cortex and mESC cells, we collected ChipSeq data from Shen et al. 2012 (34) for the architectural protein CTCF, promoter-related marks (RNA Polymerase II and H3K4me3), and enhancer-related histone modifications (H3K4me1 and H3K27ac). As shown in Figure 6, in both cell types CTCF binding sites are twice as enriched near TD boundaries compared to surrounding regions, confirming the role of CTCF as an insulator (35,36). Similarly, promoter marks such as RNA Polymerase II and H3K4me3 also peak near the boundaries in both cell types (Figure 6). This observation suggests that gene transcription start sites (TSSs) are mainly located at TD boundaries. In addition, H3K4me1 is slightly depleted at locations close to the TD boundaries in both cell types. Interestingly, we observed that H3K27ac shows a different pattern, with a slight peak at the TD boundaries in the mESC cells and a slight depletion at the TD boundaries in mouse cortex cells. The signals are weak for both of these marks, however, due to the regulatory complexity of enhancer regions (Figure 6). All of these observations are highly consistent with previous discoveries (1,4,11-13,37) and support the claim that functional organizations are closely related to physical TD structures. 2.3.4. TopDom can identify fine-scaled topological domain structures 22 Our method identified more TDs than the other two methods at w=5 (see Table 1), and its domains are smaller than those identified by other two methods. There is a great deal of overlap between the regions identified as TDs (comparing DI to TopDom and HiCseg to TopDom), as shown in Figure 4. We now analyze the overall consistency between different sets of TDs, and ask whether the new regions flagged by TopDom are likely to be true TDs. We classify all identified TD boundaries as common boundaries if they are identified by two different methods or in two cell types, and unique boundaries otherwise. The following paragraphs describe how we match TDs from different sets to identify corresponding TD boundaries. Let A and B be two sets of TDs identified by the two methods on the same sample, {a1, a2, …, an } and {b1, b2, …, bm}. For each TD in A (ai A), we aim to ∈ find the best matching subsets in B (B’ B) where B’ contains consecutive TDs along a ⊂ chromosome. The overlap score measures the degree of matching between a TD from A and a set of consecutive TDs from B: overlap(a , B ) = |{a } ∩ B | | {a } ∪ B | We computed the overlap scores between every TD in A (ai A) and all ∈ possible subsets (B’ B) in B. The subset with the highest overlap score is selected as ⊂ the best matching set of ai A. ⊂ bestmatch(a ) = max ⊂ {overlap(a , B )} Note that the overlap score and the bestmatch operation can also be used to compare TDs across cell types. Figure 7-(a) illustrates this concept. The i-th TD in A (ai A) is overlapped by ∈ four TDs (bj, bj+1, bj+2, bj+3) in B. We identify the subset B’={ bj+1, bj+2, bj+3 } with the highest overlap score as the bestmatch of ai. Therefore, boundaries partitioning bj- 23 bj+1 and bj+3-bj+4 are classified as common boundaries, and the boundaries demarcating bj+1-bj+2 and bj+2-bj+3 are considered unique boundaries. As shown in Figure 7-(b), TopDom identified 2,300 to 2,900 unique boundaries in the four cell types when compared with the DI method. TopDom identified 1,200 to 1,700 unique boundaries when compared with the HiCseg method. This result suggests that the TopDom TD boundaries include most of the TD boundaries detected by the DI and HiCseg methods. Moreover, we confirm that TD boundaries are strongly conserved across cell types (>70%) (see Figure 7-(c)), which implies that TD structure is also conserved (1). We next asked whether the TopDom unique boundaries can be considered “true TD boundaries” based on their epigenetic characteristics. We examined the enrichment patterns of three epigenetic profiles, CTCF, PolII, and H3K4me3, at unique and common (shared by different methods) boundaries for the two mouse cell types. As shown in Figure 8, epigenetic enrichment patterns at our unique boundaries are similar to those at the common boundaries. This strongly suggests that our method is finding fine-scale structures not reported by other methods. 2.3.5. TopDom reveals a significant association between TD conservation and housekeeping gene locations We examined the locations and properties of genes in the context of TDs. For all 22,000 genes of hg18 refSeq in the UCSC genome browser database, we assigned each gene to one of the TDs identified by TopDom. RNA-seq data of the hESC (GSM438363) and IMR90 (GSM438361) cell lines (38-41) were collected from the NCBI Epigenomics Gateway, and cufflinks (42) was used to measure gene expression levels. Similar to the epigenetic profiles, gene density and gene expression increase in regions close to TD boundaries (Figure 9-(a)), suggesting that TD boundaries are likely to have open chromatin. Furthermore, we examined the locations of about 3,000 human housekeeping 24 genes (43) and observed that they reside significantly closer to TD boundaries than would be expected under random assignment (Figure 9-(b)). This observation is consistent with previous claims that housekeeping genes tend to locate at TD boundaries (1,4). In contrast, we selected 230 differentially expressed genes (q-value < 0.05) using cuffdiff (42) and observed that the locations of those genes do not show a preference towards TD boundaries (Figure 9-(b)). This could indicate that gene expression differences are largely driven by complex regulatory interactions within the TDs. Considering the facts that housekeeping genes behave similarly across cell types, and that their locations are highly related to TD structures, we next asked if there is a relationship between the locations of housekeeping genes and the structural conservation of TDs across cell types. Dixon et al. reported that around 50~80% of boundaries are shared across cell types (1). In our result, around 80~90% of TD boundaries in the differentiated cells (IMR90 cell, mouse cortex cell) are included in those of embryonic stem cells (hESC, mESC) (Figure 7-(c)), and more than 70% of TD boundaries found in embryonic stem cells are shared with differentiated cells. This suggests that our TDs can provide even stronger support for the conservation of TDs across cell types. We then examined whether the conserved TD structures are related to the locations of housekeeping genes. First, we counted the number of housekeeping genes that are close to common and unique boundaries in the IMR90 and hESC cells. As shown in Table 2, around 90% of housekeeping genes are located close to common boundaries. Considering that the proportion of common boundaries at hESC is around 70%, this proportion is significantly higher than expected. By a chi-square test (Table 2), we confirm that housekeeping genes are located significantly closer to common boundaries (p-value < 2.2e-16). In summary, our analysis provides strong evidence that TDs are highly conserved across cell types, and that the locations of housekeeping genes are closely related to the conserved TDs. 25 2.4 DISCUSSION We have presented an efficient and deterministic method, TopDom, for identifying chromatin topological domains (TDs). Compared to previous methods, TopDom is not only computationally efficient but also easy for general users to learn and apply. Using several objective assessments, we show that our method captures finer-scale TDs with generally higher quality than two popular existing methods. Given a Hi-C dataset with fixed resolution (bin size), the only parameter that needs to be chosen by a researcher is the window size, and we have discussed how to choose an appropriate value for this parameter in the Results section. We discovered that the TDs identified under different window sizes w are slightly different, and additionally confirmed that the identified TDs overlap with each other a great deal (Figure 10). This indicates that the set of TD boundaries identified with a small value of w will include most of the TD boundaries identified with a large value of w. Furthermore, we applied TopDom to the Hi-C data at 20kb and 40kb resolutions to determine how the bin resolution affects TopDom performance. According to the approach described above, we set w=5 for the 40kb Hi-C data of all four cell lines and w=7, 15, 15, and 5 for the 20kb Hi-C data of mESC, Cortex, hESC, and IMR90 respectively. We then computed the overlap scores of the TDs identified on 20kb-resolution data with those identified on 40kb-resolution data. The mean overlap score is around 0.97 on all four cell lines, with a standard deviation around 0.09, indicating that TopDom is not sensitive to the choice of bin size. We also demonstrated the validity of the TopDom method through several biological assessments. The TDs identified by TopDom support previous claims that TD boundaries are highly related to CTCF binding, promoter regions, and housekeeping gene locations. We also observed that TDs are highly conserved; there can be more than 70% overlap in their boundaries across cell types. Moreover, when comparing TDs from two 26 different cell types, we found that housekeeping genes are preferentially located close to TD boundaries in both cell types, which implies that topological conservation is associated with the locations of these genes. TD identification can not only provide insights into local chromatin structures, but also facilitate the construction of global 3D genome models. Since chromatin interactions within a TD are much more frequent than interactions between TDs, a coarse genome structural model can use TDs as its basic building blocks. While the fine structures within a TD can vary (44), all chromatin regions within a TD are likely to co- localize in nuclear space. Thus, our method, by efficiently and effectively identifying TDs, can help emerging efforts on investigating the higher order genome organization. 27 CHAPTER 3 PGS: a dynamic and automated population-based genome structure software 3.1. INTRODUCTION In this chapter, we present the software package to generate a population of 3D genome structures based on a probabilistic framework that uses Hi-C data. The foundation of our method has been described in Kalhor et al. Nature Biotechnology 2012 and Tjong et al. PNAS 2016. We explicitly address one of the major challenges in Hi-C data analysis, namely that individual 3D genome structures vary dramatically from cell to cell even within an isogenic sample while Hi-C data can measure only the relative frequencies of chromosome interactions averaged over a large population of cells. Our 3D genome structure analysis provides new insights into genome organization as demonstrated in an additional publication (Dai et al. Nature Communications 2016). Our Population-based Genome Structure (PGS) modeling package, takes as input information an experimental Hi-C contact frequency map and a segmentation of the genome sequence into chromatin domains (for example Topological Associated Domains – TADs used in this chapter) (Figure. 11). PGS then generates a population of 3D genome structures in which the physical contacts between the chromatin domains (which represented as spheres) as a whole reproduce those from the Hi-C experiment. The software automatically generates an analysis report about the structure population, including a description of the structure quality based on its agreement with experiments and various structural genome features, including the radial nuclear positions of individual chromatin domains. The genome structures contain a wealth of information 28 and can be used to detect higher order structural patterns of chromatin regions (as described in our previous 3.2. SOFTWARE DESIGN AND IMPLEMENTATION The population-based genome structure modeling method generates a large number of genome structure models, which as a whole constitute an optimized structure population. The complexity of computational problems originates also from a large-scale data (i.e. high resolution genome-wide Hi-C contact frequency) to be processed to generate the input restraints of the structure population modeling. Considering such computational challenges, PGS has been designed to run on high performance computing environments (HPC), such as sun grid engine (SGE) and TORQUE. We also designed PGS to work on a single laptop or personal computer for a small size population of structures (around 100 models). PGS is implemented as a single Python-based software package for user to easily install and use. We wrapped the whole source codes into pyflow (https://github.com/Illumina/pyflow), a lightweight parallel task engine developed by Illumina, which allows whole complex simulation process to run through a single command without intermediate human intervention. Note that, while original pyflow only supports local computers and SGE, we make PGS possible to run on HPC with PBS script, which expands the capability of the original pyflow to pyflow-alabmod. In addition to PGS, users are required to install the independent modeling software, IMP 2.4 or above (IMP can be downloaded from https://integrativemodeling.org/), as well as the Python standard libraries (Python 2.7 or above with numpy, matplotlib, pandas, h5py, seaborn, and scipy). To provide flexibility of the usage, we divided the whole workflow into three consecutive components as follows (Figure. 11): 29 1) Producing domain-domain interaction matrix from input data. 2) Generating structure population. 3) Summarizing results with basic analysis. In this way, we can flexibly accommodate users, e.g. those who have already their own domain-domain interaction matrix and can skip component 1 via the interactive graphical user interface (GUI; Figure. 12a). By default, PGS takes the input of a raw (Hi- C) interaction matrix (Figure. 12b). In any cases, the user must provide a text file containing the chromosome segmentations (i.e., TAD information; Figure. 12c). The file formatting will be described in the Materials section. To run our software seamlessly, PGS comes with a GUI to help users generate an input configuration file (json file). For an experienced user, it is straightforward to modify the input configuration file. It contains the information on the location of the input raw Hi-C matrix file, the location of the chromatin segmentation or TAD definition file, modeling parameters, and system parameters. In the first component, PGS normalizes the raw Hi-C contact map using KR-normalization(45) and generate a TAD level interaction probability matrix. In the second component, PGS generates a given number of genome structures through structure optimization by iterative A-step and M- step cycles. Finally, in the third component PGS produces analysis reports about the optimization quality and basic structural analyses, such as contact frequency heat maps, nuclear radial position of each TAD (Figure. 13). 3.3 MATERIALS 3.3.1. Equipment A. Download (or “git clone”) the PGS from https://www.github.com/alberlab/PGS. PGS flow is a Python package, which runs on Linux and Mac OS X systems. Python can be downloaded from http://www.python.org. 30 Necessary dependencies are as follows: - Numpy (http://www.numpy.org/), - Scipy (http://www.scipy.org/), - Matplotlib (http://matplotlib.org/). - Pandas (http://pandas.pydata.org/) - H5py (http://www.h5py.org/) - Seaborn (http://seaborn.pydata.org/) - IMP (Integrative Modeling Package) version 2.4 or above can be downloaded from: https://integrativemodeling.org/. Input data set (experimental data): depending on the options chosen by the user, PGS can take different set of input files. Option 1 (raw + TAD definition), a user provides a raw contact frequency matrix (uniformly binned) and user-defined TAD index information. PGS will then generate a TAD-TAD contact probability matrix and automatically proceed to the modeling flow. This option requires two input files: File 1: Genome-wide chromosome-chromosome interaction matrix. The text file (can be gzip or bzip compressed) is formatted as follow (see Fig. 3b). No header Column 1: chromosome names (e.g. Chr1, Chr2, ..., ChrX) Column 2: start genomic positions of bins (0-based) Column 3: end genomic positions of bins (1-based) Columns 4 to N+3: interaction matrix of size N (integers) File 2: Chromosome segmentation (topological associated domain/TAD) file (Fig. 3c). The text file adopts a bed file format: No header Column 1: chromosome names (e.g. Chr1, Chr2, ..., ChrX) 31 Column 2: start genomic positions of TADs (0-based) Column 3: end genomic positions of TADs (1-based) Column 4: flag of TADs (“domain”, “gap”, “CEN”) Option 2 (TAD-TAD prob + TAD information), a user has prepared a TAD-TAD contact probability matrix and the TAD information file. This option requires two input files that are similar to files 1 and 2 in Option 1, where the matrix bins represent TADs and the matrix elements are probability values from 0 to 1. Option 3 (hdf5 prob), a user provides a TAD-TAD contact probability matrix that was generated by PGS. This option is useful for reproducing structure populations or for calculations with different parameters using the same input data. 3.3.2 Equipment setup We recommend following the easy installation instruction from our online documentation (http://pgs.readthedocs.io/en/latest/quickstart.html). The easy way to install PGS is to use conda package manager. Either Anaconda (https://www.continuum.io/downloads) or the minimal Miniconda (http://conda.pydata.org/miniconda.html) are suitable for managing required packages including IMP. Once the PGS package has been downloaded with all dependencies mentioned above, setup the package using the following command. $ python setup.py install The script “setup.py” is located in the PGS directory. To test if PGS is installed properly, users can then execute a shell command as follows. $ cd test $ sh runPgs_workflow_test.sh 32 This process should take only less than two minutes on any current computing workstations. 3.4 PROCEDURES 3.4.1 Step 1: Generate a configuration file A user can either modify the prepared configuration file and the execution script or use a graphical user interface (GUI), called PGS-Helper (requires Java), to generate those files. A. Using PGS Helper (if Java is installed) $ java –jar PGSHelper.jar The command will display a GUI (Fig. 3a) for the user to fill in the information needed. Most of the fields have been pre-populated for the user to review and modify if necessary. There are only 4 blank fields for the user to complete (in points i to iii below). In the following, we describe the fields displayed in the GUI. i. Working Directory This will be the project directory where the output of this GUI (an executable script runPGS.sh and a configuration file input_config.json), log files (pyflow.data directory), and the result of 3D genome modeling will be stored. ii. PGS Source – Directory Specify the PGS installation directory, which contains pgs.py. iii. Input 33 Select one out of three options depending on what experiment data to be fed for genome modeling (see Equipment for the details of the options), and then specify the file locations. Genome: specify the genome version. The current version supports recent human and mouse genomes, i.e. hg19, hg38, mm9, and mm10. PGS generates diploid autosome and X chromosome representations. Resolution: an integer indicating the bin resolution (in base-pair) of the raw matrix. iv. Modeling Parameters We have populated the default ‘Modeling Parameters’ which can be easily modified. Num of structures: specify the number of structures to calculate, (default = 1,000) (recommended 10,000 for final sampling). Violation cutoff: specify the maximum portion of violated restraints, the smaller values in general will result in better agreements with the input data (default = 0.05). Theta list: specify 1 ≤ theta < 0 to indicate contact probability thresholds in the step-wise optimization (default = 1, 0.2, 0.1, 0.05, 0.02, 0.01). Max iteration: specify the number of maximum A/M cycles for each theta (default = 10). Nucleus radius: specify the nuclear radius in nanometers. The typical number for a human nucleus is 5000 (default). 34 Genome occupancy: specify the volume ratio between genome- wide chromosomal volumes and the nucleus. Default value is 20%. v. System Parameters We have populated the default parameters for computing and job submissions that can be modified by the user when necessary. Default core: specify the number of computing cores to use for each job. Light jobs, like modeling step (M-step) do not require more than one CPU (default = 1). Default mem MB: specify the memory limit for each job in megabytes (default = 1,500). Max core: specify the maximum number of computing core allocation for a heavy job (e.g. building matrix and calculating pairwise distances distributions; default = 8). Max mem MB: specify the memory allocation limit for the heavy job (default = 64,000 MB). vi. Command Setup Run mode: specify the user’s computing platform, such as Local computer, Sun Grid Engine or Torque. Core limit: specify the limit of number of cores to allocate. (Valid for all 3 run modes. In case of Local mode please set this value to the maximum cores of the local computer.) Mem limit: specify the limit of total memory usage in MB. 35 Optional argument list: additional arguments (user specific) for all job submissions. The GUI will provide a template where the user can modify (add or delete) the options and values, e.g. in [‘- q’,’[qname]’,’-l’,’walltime=hh:mm:ss’] replace qname with the user’s HPC queue name, and hh:mm:ss with hours:minutes:seconds. vii. Click the “Generate” button at the bottom. The user then can review the usage in the bottom box, and confirm to generate the configuration (input_config.json) and executable files (runPGS.sh). B. Check or modify the configuration and executable files. In case some users do not have Java installed to run the PGS Helper program, we also provide examples of both configuration and executable files. Users can open the text files under pgs/test directory and modify them accordingly. In the following we describe the two files. i. input_config.json {"source_dir" : "[Directory name where pgs socurce is]", “input" : { "contact_map_file_hdf5" : "[Contact map file]", "TAD_file" : "[ TAD file, .bed format]" “resolution” : “[Resolution of input contact_map_file] e,g. 100000” “genome” : “[Genome version], e.g. hg19” }, "output_dir" : "[Output Directory to store the results], e.g. $PROJECT_DIR/result", “modeling_parameters" : { "theta_list" : [Theta list] e.g, ["1", "0.2", "0.1","0.05","0.02","0.01"], 36 "num_of_structures" : [Number of structure to generate] e.g. 10000, "max_iter_per_theta" : [Max Iterations per job] e.g. 10, "violation_cutoff" : [Violation Cutoff ] e.g. 0.005 "chr_occupancy" : [Chromosome Occupancy ] e.g. 0.2 "nucleus_radius" : [Nucleus Radius ] e.g. 5000.0 }, "system" : { "max_core" : [Maximum number of cores in a single node], e.g. 8, "max_memMB" : [Maximum size of mem(MB) in a single node] e.g. 64000, "default_core" : [Default number of cores], e.g. 1, "default_memMB" : [Default size of mem(MB)] e.g. 1500 } } ii. runPGS.sh python $PGS_DIRECTORY/pgs.py --input_config $PROJECT_DIR/input_config.json --run_mode [running platform] --nCores 300 --memMb 800000 --pyflow_dir $PROJECT_DIR --schedulerArgList ["-q","[qname]","-l","walltime=100:00:00"] Step 2: Run PGS User can execute PGS through the following command. $ sh runPgs.sh 37 3.5 TROUBLESHOOTING Troubleshooting advice can be found in Table 3. 3.6 TIMING Step 1: ~1 minute Step 2: the timing at this step varies largely depending on the computing resources, data size and modeling complexity. We have designed PGS to automatically and dynamically run many successive processes or steps. When there are failures on the running jobs, for examples a node is down, busy network, disk I/O failure, etc., PGS resubmits the failed jobs 2 more times. The very first task is to build the input matrix, which should take about less than 1 minute (if Options 2 and 3 are selected) and varies from several minutes to several hours if Option 1 is selected. For instance, it takes about a minute to process a 2 Mb resolution Hi-C matrix, but it can be 14 hours to build ~ 2300x2300 contact probability matrix from a 100 kb resolution Hi-C matrix on a common ~2.8 GHz CPU. Then A/M cycles (iterations of A-step and M-step) will start immediately with PGS submitting many jobs simultaneously on computing clusters. A typical time to finish one M-step optimization for a genome structure of ~2x2300 TAD domains is about 45~90 minutes (at ~1 Mb resolution). If the user asked for generation of 2,000 structures and allocates 500 CPUs, then PGS will run the first 500 jobs simultaneously while queuing the remaining 1500 jobs one-by-one whenever a CPU within the allocated 500 becomes available. PGS will wait until the M-step of all the structures finishes; it then submits the A-step job. For this example, the A-step calculation can take about 5-30 minutes. Thus an A/M cycle of ~1 Mb resolution in the example could take about 3 hours. The theta list and number of iteration per theta value will also affect the timing, as this combination will be a multiplier of the A/M cycle time. The expected total time is about the number of theta parameter plus 5 to 10 (based on our experience) times A/M cycle. Since PGS is a dynamic pipeline software that will decide (based on the violation cutoff parameter) to stay iterating the A/M cycle or continue to 38 the next theta level of the A/M cycle, we cannot estimate an accurate timing. The running time also depends on the data set, e.g. noisy or inconsistent data will likely to produce artifacts that are hard to optimize and in turn require more A/M cycles to obtain fully optimized structure population (~ zero violation score). 3.7 ANTICIPATED RESULTS PGS main output is a structure population. All results are stored under result directory. In this version, PGS writes four folders under the result directory: i. probMat: contains the input (contact probability matrix in hdf5 binary format) if option 1 or 2 is selected. ii. actDist: contains intermediate files generated by the A-step which will be used in the subsequent M-step. iii. structure: contains the genome structure information during optimization, saved in the hdf5 binary files (with .hms file extension). One file corresponds to one structure, and it contains a history of optimization snapshots according to the theta parameters. The smallest theta with the latest iteration step (alphabetically ordered, i.e. the last snapshot) is the final model. We refer to the whole set of final models as the structure population (Fig 4a). Users then can get TAD coordinates and perform further analysis that relates to their research. We have provided our library of tools on the PGS public repository that can help users to easily analyze the structure population (for further details, please refer to PGS documentation page http://pgs.readthedocs.io/en/latest/tools.html). iv. report: contains examples of basic analysis with some plots, e.g. heat maps of contact probability matrices, radial positions of TADs, and the quality of optimization (Figs. 4b-e). PGS writes the average of nuclear radial positions for every TAD in the radialPlot_summary.txt file. User can also find a summary of the violation portion that reflects the overall quality 39 agreement between experiment data (input of PGS) and the structure population (output of PGS). 40 CHAPTER 4. CONCLUSION In this thesis, we have introduced a new method and tool to understand genome organization in topological domain level. In the first section, we have presented an efficient and deterministic method, TopDom, for identifying chromatin topological domains (TDs). Compared to previous methods, TopDom is not only computationally efficient but also easy for general users to learn and apply. We also demonstrated the validity of the TopDom method through several biological assessments. The TDs identified by TopDom support previous claims that TD boundaries are highly related to CTCF binding, promoter regions, and housekeeping gene locations. We also observed that TDs are highly conserved; there can be more than 70% overlap in their boundaries across cell types. Moreover, when comparing TDs from two different cell types, we found that housekeeping genes are preferentially located close to TD boundaries in both cell types, which implies that topological conservation is associated with the locations of these genes. In the second part, we have introduced population-based 3D genome-modeling package called PGS, for further understanding of 3D genome organization. PGS takes a genome-wide Hi-C contact frequency matrix and produces an ensemble of 3D genome structures entirely consistent with the input. PGS has been designed to run parallel jobs automatically and dynamically. PGS supports high-performance computing environments, such as the sun grid engine (SGE) and TORQUE, as well as local individual desktop computers. The software automatically generates an analysis report, and also provides tools to extract and analyze the 3D coordinates of specific domains. We believe that TopDom and PGS will be useful for the scientific community that utilizes Hi-C technology, and will become key tools for the study of spatial genome organization and genome structure-function relationships. 41 REFRENCES 1. Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S. and Ren, B. (2012) Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 485, 376-380. 2. Lanctôt, C., Cheutin, T., Cremer, M., Cavalli, G. and Cremer, T. (2007) Dynamic genome architecture in the nuclear space: regulation of gene expression in three dimensions. Nature Reviews Genetics, 8, 104-115. 3. Sexton, T., Schober, H., Fraser, P. and Gasser, S.M. (2007) Gene regulation through nuclear organization. Nature structural & molecular biology, 14, 1049- 1055. 4. Sexton, T., Yaffe, E., Kenigsberg, E., Bantignies, F., Leblanc, B., Hoichman, M., Parrinello, H., Tanay, A. and Cavalli, G. (2012) Three-dimensional folding and functional organization principles of the Drosophila genome. Cell, 148, 458-472. 5. Dixon, J.R., Jung, I., Selvaraj, S., Shen, Y., Antosiewicz-Bourget, J.E., Lee, A.Y., Ye, Z., Kim, A., Rajagopal, N. and Xie, W. (2015) Chromatin architecture reorganization during stem cell differentiation. Nature, 518, 331-336. 6. Rao, S.S., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D. and Lander, E.S. (2014) A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell, 159, 1665-1680. 7. Lin, Y.C., Benner, C., Mansson, R., Heinz, S., Miyazaki, K., Miyazaki, M., Chandra, V., Bossen, C., Glass, C.K. and Murre, C. (2012) Global changes in the nuclear positioning of genes and intra-and interdomain genomic interactions that orchestrate B cell fate. Nature immunology, 13, 1196-1204. 8. Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J. and Dorschner, M.O. (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. science, 326, 289-293. 9. Kalhor, R., Tjong, H., Jayathilaka, N., Alber, F. and Chen, L. (2012) Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Nature biotechnology, 30, 90-98. 10. Duan, Z., Andronescu, M., Schutz, K., McIlwain, S., Kim, Y.J., Lee, C., Shendure, J., Fields, S., Blau, C.A. and Noble, W.S. (2010) A three-dimensional model of the yeast genome. Nature, 465, 363-367. 11. Filippova, D., Patro, R., Duggal, G. and Kingsford, C. (2014) Identification of alternative topological domains in chromatin. Algorithms for Molecular Biology, 9, 14. 12. Hou, C., Li, L., Qin, Z.S. and Corces, V.G. (2012) Gene density, transcription, and insulators contribute to the partition of the Drosophila genome into physical domains. Molecular cell, 48, 471-484. 42 13. Nora, E.P., Lajoie, B.R., Schulz, E.G., Giorgetti, L., Okamoto, I., Servant, N., Piolot, T., van Berkum, N.L., Meisig, J. and Sedat, J. (2012) Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature, 485, 381-385. 14. Levy-Leduc, C., Delattre, M., Mary-Huard, T. and Robin, S. (2014) Two- dimensional segmentation for analyzing Hi-C data. Bioinformatics, 30, I386-I392. 15. Junier, I., Dale, R.K., Hou, C., Kepes, F. and Dean, A. (2012) CTCF-mediated transcriptional regulation through cell type-specific chromosome organization in the beta-globin locus. Nucleic acids research, 40, 7718-7727. 16. Barbieri, M., Chotalia, M., Fraser, J., Lavitas, L.M., Dostie, J., Pombo, A. and Nicodemi, M. (2012) Complexity of chromatin folding is captured by the strings and binders switch model. Proc Natl Acad Sci U S A, 109, 16173-16178. 17. Meluzzi, D. and Arya, G. (2013) Recovering ensembles of chromatin conformations from contact probabilities. Nucleic acids research, 41, 63-75. 18. Giorgetti, L., Galupa, R., Nora, E.P., Piolot, T., Lam, F., Dekker, J., Tiana, G. and Heard, E. (2014) Predictive polymer modeling reveals coupled fluctuations in chromosome conformation and transcription. Cell, 157, 950-963. 19. Zhang, B. and Wolynes, P.G. (2015) Topology, structures, and energy landscapes of human chromosomes. Proc Natl Acad Sci U S A, 112, 6062-6067. 20. Tjong, H., Li, W., Kalhor, R., Dai, C., Hao, S., Gong, K., Zhou, Y., Li, H., Zhou, X.J., Le Gros, M.A. et al. (2016) Population-based 3D genome structure analysis reveals driving forces in spatial genome organization. Proc Natl Acad Sci U S A, 113, E1663-1672. 21. Dai, C., Li, W., Tjong, H., Hao, S., Zhou, Y., Li, Q., Chen, L., Zhu, B., Alber, F. and Jasmine Zhou, X. (2016) Mining 3D genome structure populations identifies major factors governing the stability of regulatory communities. Nature communications, 7, 11549. 22. Kalhor, R., Tjong, H., Jayathilaka, N., Alber, F. and Chen, L. (2012) Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Nat Biotechnol, 30, 90-98. 23. Bau, D., Sanyal, A., Lajoie, B.R., Capriotti, E., Byron, M., Lawrence, J.B., Dekker, J. and Marti-Renom, M.A. (2010) The three-dimensional folding of the alpha-globin gene domain reveals formation of chromatin globules. Nat Struct Mol Biol, 18, 107-114. 24. Duan, Z., Andronescu, M., Schutz, K., McIlwain, S., Kim, Y.J., Lee, C., Shendure, J., Fields, S., Blau, C.A. and Noble, W.S. (2010) A three-dimensional model of the yeast genome. Nature, 465, 363-367. 25. Bau, D. and Marti-Renom, M.A. (2011) Structure determination of genomic domains by satisfaction of spatial restraints. Chromosome Res, 19, 25-35. 26. Rousseau, M., Fraser, J., Ferraiuolo, M.A., Dostie, J. and Blanchette, M. (2011) Three-dimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling. BMC Bioinformatics, 12, 414. 43 27. Fraser, J., Rousseau, M., Blanchette, M. and Dostie, J. (2010) Computing chromosome conformation. Methods Mol Biol, 674, 251-268. 28. Hu, M., Deng, K., Qin, Z., Dixon, J., Selvaraj, S., Fang, J., Ren, B. and Liu, J.S. (2013) Bayesian inference of spatial organizations of chromosomes. PLoS computational biology, 9, e1002893. 29. Lesne, A., Riposo, J., Roger, P., Cournac, A. and Mozziconacci, J. (2014) 3D genome reconstruction from chromosomal contacts. Nature methods, 11, 1141- 1143. 30. Varoquaux, N., Ay, F., Noble, W.S. and Vert, J.P. (2014) A statistical approach for inferring the 3D structure of the genome. Bioinformatics, 30, i26-33. 31. Ay, F., Bunnik, E.M., Varoquaux, N., Bol, S.M., Prudhomme, J., Vert, J.P., Noble, W.S. and Le Roch, K.G. (2014) Three-dimensional modeling of the P. falciparum genome during the erythrocytic cycle reveals a strong connection between genome architecture and gene expression. Genome research, 24, 974-988. 32. Ray, B.K. and Ray, K.S. (1993) Determination of optimal polygon from digital curve using L 1 norm. Pattern Recognition, 26, 505-509. 33. Ray, B.K. and Ray, K.S. (1994) A non-parametric sequential method for polygonal approximation of digital curves. Pattern Recognition Letters, 15, 161- 167. 34. Shen, Y., Yue, F., McCleary, D.F., Ye, Z., Edsall, L., Kuan, S., Wagner, U., Dixon, J., Lee, L. and Lobanenkov, V.V. (2012) A map of the cis-regulatory sequences in the mouse genome. Nature, 488, 116-120. 35. Van Bortle, K. and Corces, V.G. (2013) The role of chromatin insulators in nuclear architecture and genome function. Current opinion in genetics & development, 23, 212-218. 36. Bickmore, W.A. and van Steensel, B. (2013) Genome architecture: domain organization of interphase chromosomes. Cell, 152, 1270-1284. 37. Jin, F., Li, Y., Dixon, J.R., Selvaraj, S., Ye, Z., Lee, A.Y., Yen, C.-A., Schmitt, A.D., Espinoza, C.A. and Ren, B. (2013) A high-resolution map of the three- dimensional chromatin interactome in human cells. Nature, 503, 290-294. 38. Bernstein, B.E., Stamatoyannopoulos, J.A., Costello, J.F., Ren, B., Milosavljevic, A., Meissner, A., Kellis, M., Marra, M.A., Beaudet, A.L. and Ecker, J.R. (2010) The NIH roadmap epigenomics mapping consortium. Nature biotechnology, 28, 1045-1048. 39. Hawkins, R.D., Hon, G.C., Lee, L.K., Ngo, Q., Lister, R., Pelizzola, M., Edsall, L.E., Kuan, S., Luu, Y. and Klugman, S. (2010) Distinct epigenomic landscapes of pluripotent and lineage-committed human cells. Cell stem cell, 6, 479-491. 40. Lister, R., Pelizzola, M., Dowen, R.H., Hawkins, R.D., Hon, G., Tonti-Filippini, J., Nery, J.R., Lee, L., Ye, Z. and Ngo, Q.-M. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. nature, 462, 315-322. 44 41. Lister, R., Pelizzola, M., Kida, Y.S., Hawkins, R.D., Nery, J.R., Hon, G., Antosiewicz-Bourget, J., O’Malley, R., Castanon, R. and Klugman, S. (2011) Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature, 471, 68-73. 42. Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J. and Pachter, L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28, 511-515. 43. Eisenberg, E. and Levanon, E.Y. (2013) Human housekeeping genes, revisited. Trends in Genetics, 29, 569-574. 44. Dekker, J., Rippe, K., Dekker, M. and Kleckner, N. (2002) Capturing chromosome conformation. science, 295, 1306-1311. 45. Knight, P.A. and Ruiz, D. (2013) A fast algorithm for matrix balancing. IMA J Numer Anal, 33, 1029-1047. 46. Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown, M. and Li, W. (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol, 9, R137. 45 FIGURES 46 Figure 1. TopDom method. (a) We define binSignal(i) as the average contact frequency between an upstream and a downstream chromatin region (U i and Di) in a window (of size 2w) surrounding bin i. The value of binSignal(i) is relatively high if bin i is located inside a TD (red diamond), and reaches a local minimum at a TD boundary (dotted red diamond). (b) Using a piecewise linear curve fitting algorithm, we identify turning points (blue circles) in the original curve (black) binSignal(i). Dominant local minima (red inverted triangles) can be detected using the piecewise linear curve (dotted blue line). (c) We compute p-values to assess the validity of local minima by comparing within.interactions and between.interactions using a Wilcox rank-sum test. Deep valleys in the original binSignal(i) curve (top layer), regions with p-values < 0.05 (second layer), and local minima in the piecewise linear curve (third layer) are generally highly consistent. Also, those regions indicate boundaries on a Hi-C contact map (bottom layer). 47 Figure 2. Variations in the size and number of TDs. The median size of TDs increases (blue) and the number of TDs decreases (red) with window size. 48 Figure 3. Selection of the window size. The average intra-TD Pearson’s correlation coefficient (PCC) and weighted PCC were adopted as measurements of a TD’s quality. We computed intra-TD PCC/wPCC for all TDs, with the window size varying from 3 to 15, and the highest average PCC was obtained for w = 5 (red) in all four cell lines. The results for wPCC are very similar (plots not shown).. 49 Figure 4. Illustration of topological domains identified by the DI method, HiCseg, and TopDom. TD boundaries (gray bars) are plotted on the Hi-C contact map of chromosome 10 (randomly chosen for this illustration) in mESC and hESC cell types. TD boundaries identified by TopDom (bottom) sensitively capture boundary-like regions. Most TD boundaries identified by the DI method and HiCseg are shared by the TopDom TD boundaries. 50 Figure 5. Quality comparison of TDs identified by TopDom, HicSeg, and DI on four cell lines using the intra-inter difference measurement (the top panel), the average Pearson’s correlation coefficient (the middle panel), and the weighted Pearson’s correlation coefficient (the bottom panel). TopDom achieved higher scores than HicSeg and DI on all cases except IMR90 cells, with the PCC and wPCC measurements. 51 Figure 6. Epigenetic characteristics surrounding boundary regions. From the ChIP- seq data of five epigenetic marks in mouse ESC and cortex cells, we identified peaks (p- value < 0.05) using MACS14 (46). The CTCF and promoter marks (Polymerase II and H3K4me3) are enriched near TD boundaries for both cell types. For the enhancer marks H3K4me1 and H3K27ac, the enrichment patterns near boundaries are slightly depleted in both cell types. 52 Figure 7. Common and Unique Boundaries. (a) Illustration of common and unique boundaries. (b) Overlap of TD boundaries identified by the DI method versus TopDom (top) and HiCseg versus TopDom (bottom). Most TD boundaries identified by the DI and HiCseg methods are included in the set identified by TopDom. (c) Overlap of TD boundaries in different cell types (hESC vs. IMR90, mESC vs. cortex). The TD sets overlap greatly, indicating the strong conservation of TDs across cell types. 53 Figure 8. Epigenetic characteristics of unique boundaries. We examined the epigenetic characteristics of unique boundaries, identified with respect to DI (a) and HiCseg (b). In both (a) and (b), CTCF, Polymerase II, and H3K4me3 have strong peaks near unique TD boundaries for both mESC and mouse cortex cells. Their epigenetic profiles at unique boundaries are very similar, with the boundaries showing in Figure 6. 54 Figure 9. Gene expression and housekeeping gene density near TD boundaries. (a) For two human cell types (hESC and IMR90), we observed that the median RPKM tends to be higher closer to TD boundaries. (b) Housekeeping genes reside significantly closer to TD boundaries than expected (top), but no such pattern exists for differentially expressed genes (bottom). 55 Figure 10. Overlap score of TDs under different window settings. TDs identified by two different w settings have a high overlap score (>0.9) in all four cell types. 56 Figure 11: PGS software workflows: building the input matrix, modeling and optimizing structure population with A/M cycles, and basic analysis from the final structure population. 57 Figure 12: PGS setup (a) GUI to help users generate configuration files. (b) An example showing the format of the acceptable contact frequency matrix file. (c) An example showing the format of the acceptable TAD file. 58 Figure 13: Examples of PGS output (a) Structure population. (b) Histogram of violated restraints. The maximum number allowed for violated restraints is set in the “violation cutoff” value (see Procedure Step 1). (c) Heat map of contact probability from the final structure population. The color scheme is from white to red for 0 to 1, respectively. (d) Density scatter plots comparing the contact probabilities from the structure population and the input Hi-C data. The Pearson’s correlation coefficient (PCC) of the comparison is indicated. Histogram of each data is shown at the side. (e) The average radial position along an example chromosome. PGS will generate the plots for every chromosome. 59 TABLES 60 Table 1. Comparison of TDs identified by three different methods for four cell types. Species Cell Type Method # of TDs Ave. Size of TDs Human hESC TopDom 5,904 453KB DI 3,127 855KB HicSeg 5,240 528KB IMR90 TopDom 4,640 580KB DI 2,348 1,123KB HicSeg 4,189 657KB Mouse mESC TopDom 4,477 531KB DI 2,200 1,093KB HicSeg 3,484 720KB Cortex TopDom 4,094 596KB DI 1,518 1,540KB HicSeg 3,103 809KB 61 Table 2. Chi-squared test on the association between conserved TD structures and the locations of housekeeping genes. Around 90% of housekeeping genes are mapped to common domain boundaries. Common Boundaries Unique Boundaries House-keeping genes 3,315 388 Non House-keeping genes 16,053 3,255 62 Table 3. Troubleshooting Step Problem Possible reason Solution 1 Java installed, but GUI PGS Helper does not appear X11 for graphical display is not turned on Login again to your HPC with “ssh –X” option 2 The terminal where PGS was executed is closed so the PGS process is stopped Accidentally closed, system shut-down, or broken node Just rerun the PGS, using the same command as before 2 PGS stops with [ERROR] messages: “… failed sub-workflow classname: ‘BuildTADMapFlow’ …” and “IndexError: … is out of bounds …” The resolution is set incorrectly, or input matrix format is wrong. Fix the resolution parameter in the input-config.json, and check input file format. 2 PGS stops with [ERROR] messages contain “… failed sub-workflow classname: ‘BuildTADMapFlow’ …”, “… using non-integer …”, and “originHist = …” The raw input matrix contains non-integer Check and fix the matrix 2 PGS stops while running the A/M- cycles Computing cluster problems Try to request larger than 10 GB memory for the main PGS program 63 APPENDIX 64 APPENDIX CHAPTER 2 TopDom : An efficient and deterministic method for identifying topological domains in genomes Availability http://zhoulab.usc.edu/TopDom/ Requirement Source code is implemented by R script. Your machine must be installed appropriate R packages. If R package is not installed on your machine, please visit http://www.r-project.org, download appropriate R version from CRAN, and install it on your machine. Note that this source code does not depend on any open-packages, so user does not need to install additional packages. Download and Install Please, visit http://zhoulab.usc.edu/TopDom/ and download the zipped file(TopDom.zip) in download section. No installation is required. Just unzip the download file and put the script (TopDom.R) into your working directory. You can check or change your working directory by the following commands. Usage > getwd() # check current working directory > setwd(“[YOUR DESIRABLE WORKING DIRECTORY ADDRESS]”) # set your working directory > source(“TopDom.R”) # Read source code from working directory > TopDom(matrix.file=[matrix file address], window.size=[window.size], outBinSignal=[.binSignal file address], outDomain=[.domain file address]) # Run TopDom 65 matrix.file Input matrix.file is a normalized Hi-C contact matrix(n by n+3, where n is the number of bins for each chromosome) for a chromosome. Each column is separated by tab and the first three columns should include bin information, such as "chromosome", "from.coord", "to.coord". The headers are not allowed. Each row indicates a bin including position information and contact frequencies with the other bins. <FORMAT> window.size Window size should be a nonnegative integer number and is used to compute binSignal. Recommended size is any integer number between 5 and 20. output file arguments(outBinSingal, outDomain) TopDom.R produces two types of output files, i.e. '.binSignal' and '.domain' file. I f user wants to keep the result as a file, specify a full file name. Note that default is set to NULL. Output The results are returned by binSignal and domain as list form. If user wants to keep results as as file form, give full file names for arguments, outBinSingal and outDomain. A. binSignal chr10 0 40000 0 0 0 0 ….. chr10 40000 80000 0 0 0 0 ….. chr10 80000 120000 0 0 0 0 ….. chr10 120000 160000 0 0 0 0 …… 66 The binSignal includes mean contact frequency, local extreme, and p-value for every bin. The first four columns represent basic bin information given by matrix file, such as bin id(id), chromosome(chr), start coordination(from.coord), and end coordination(to.coord) for each bin. And the last three columns represent computed values by this program. <id> bin id <chr> chromosome <from.coord> start coordination of bin <to.coord> end coordination of bin <local.ext > -1 : local minima. -0.5 : gap region. 0 : general bin. 1 : local maxima. <mean.cf> Average of contact frequencies between lower and upper regions for bin i <p-value> Computed p-value by Wilcox rank sum test. Read reference for more details. B. binSignal format C. domain id chr from.coord to.coord local.ext mean.cf p-value 1 chr10 0 40000 -0.5 0 1 2 chr10 40000 80000 -0.5 0 1 3 chr10 80000 120000 -0.5 0 1 … 1005 chr10 40160000 40200000 0 16.93 3.07e- 01 67 Every bin is categorized by basic building block, such as gap, domain, or boundary. Each row indicates a basic building block. The first five columns include the basic information about the block, ‘tag’ column indicates the class of the building block. <id> identifier of block <chr> chromosome <from.id> start bin index of the block <from.coord> start coordination of the block <to.id> end bin index of the block <to.coord> end coordination of the block <tag> categorized name of the block. Three possible blocks exists, “gap”, “domain”, and “boundary” <size> size of the block D. domain format E. bed(from v0.0.2) chrom chromStart chromEnd name chr from.id from.coord to.id to.coord tag size chr10 1 0 75 3000000 gap 3000000 chr10 76 3000000 112 4480000 domain 1480000 chr10 113 4480000 125 5000000 domain 520000 68 F. bed format chr10 0 3000000 gap chr10 3000000 4480000 domain chr10 4480000 5000000 domain chr10 5000000 5840000 domain chr10 5840000 5920000 boundary …. 69 Example Run > source(“TopDom.R”) > TopDom(matrix.file="chr10.nij.HindIII.comb.40kb.matrix", window.size=10) [1] "######################################################## " [1] "Step 0 : File Read and Matrix Scaling.." [1] "######################################################## " [1] "-- Matrix Scaling...." [1] "-- Done!" [1] "Step 0 : Done !!" [1] "######################################################## " [1] "Step 1 : Generating binSignals by computing bin- level contact frequencies" [1] "######################################################## " [1] "Step 1 : Done !!" [1] "######################################################## " [1] "Step 2 : Detect TD boundaries based on binSignals" [1] "######################################################## " [1] "Process Regions from 76 to 552" [1] "Process Regions from 556 to 1439" [1] "Process Regions from 1442 to 2028" [1] "Process Regions from 2038 to 2494" [1] "Process Regions from 2497 to 3250" [1] "Step 2 : Done !!"
Abstract (if available)
Abstract
Genome-wide proximity ligation assays, such as Hi-C or TCC (i.e. Genome wide chromatin conformation capture), allow the identification of chromatin contacts at unprecedented resolution. Several studies reveal that mammalian chromosomes are composed of topological domains (TDs) in sub-mega base resolution, which appear to be conserved across cell types and to some extent even between organisms. Studying topological domains is now a begging and an important step toward understanding the structure and functions of spatial genome organization. ❧ In this thesis, as the first work, we propose an efficient and deterministic method, TopDom, to identify TDs, along with a set of statistical methods for evaluating their quality. TopDom is much more efficient than existing methods and depends on just one intuitive parameter, a window size. TopDom also identifies more and higher quality TDs than the popular directional index algorithm. The TDs identified by TopDom provide strong support for the cross-tissue TD conservation. Finally, our analysis reveals that the locations of housekeeping genes are closely associated with cross-tissue conserved TDs. ❧ As the second project, we present a population based 3D genome structure modeling pipeline for further understanding of individual 3D genome structures in nucleus. We implement a method described in Kalhor et al. Nature Biotechnology 2012 and Tjong et al. PNAS 2016 and wrap the whole pipeline as a single pipeline software for general user to examine the individual 3D genomes structure variability easily. The PGS generate a population of 3D genome structures in TD level based on a probabilistic framework that uses Hi-C data. It is fully automated and simplifies individual processing steps and job submissions into a single straightforward execution.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D modeling of eukaryotic genomes
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Exploring three-dimensional organization of the genome by mapping chromatin contacts and population modeling
PDF
Computational analysis of the spatial and temporal organization of the cellular environment
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Semantic structure in understanding and generation of the 3D world
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Application of machine learning methods in genomic data analysis
PDF
Integrating high-throughput sequencing data to study gene regulation
PDF
Comparative transcriptomics: connecting the genome to evolution
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Data-driven 3D hair digitization
PDF
Breaking the plateau in de novo genome scaffolding
PDF
Profiling transcription factor-DNA binding specificity
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
3D object detection in industrial site point clouds
Asset Metadata
Creator
Shin, Hanjun (author)
Core Title
Understanding the 3D genome organization in topological domain level
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2017-05
Publication Date
03/08/2017
Defense Date
01/06/2017
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Hi-C,OAI-PMH Harvest,PGS,population of genome structure modeling,TopDom,topological associated domain
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Zhou, Xianghong Jasmine (
committee chair
), Alber, Frank (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
hanjun.shin@gmail.com,shanjun@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11259407
Unique identifier
UC11259407
Identifier
etd-ShinHanjun-5092.pdf (filename)
Legacy Identifier
etd-ShinHanjun-5092
Dmrecord
347735
Document Type
Dissertation
Format
theses (aat)
Rights
Shin, Hanjun
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Hi-C
PGS
population of genome structure modeling
TopDom
topological associated domain