Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Exploring three-dimensional organization of the genome by mapping chromatin contacts and population modeling
(USC Thesis Other)
Exploring three-dimensional organization of the genome by mapping chromatin contacts and population modeling
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXPLORING THREE-DIMENSIONAL ORGANIZATION OF THE GENOME BY MAPPING CHROMATIN CONTACTS AND POPULATION MODELING by Reza Kalhor A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (GENETIC, MOLECULAR AND CELLULAR BIOLOGY) August 2012 Copyright 2012 Reza Kalhor ii In memory of Mojgan Radnejad for her perpetual innocence and “aasheghtarin-e-zendegaan-e-88” (the most loving of the living souls in 2009). iii Acknowledgements I owe thanks to all those who bore with me during the last five and a half years, at the school or away from it. Most of all, I want to thank my advisor Dr. Lin Chen, for his wisdom, courage and vision, for being on the other side of many enjoyable discussions, for sharing teachable memories and anecdotes, for an always interesting historical perspective, for keeping open the door to his office at all times, for being candid, for trusting in me unwaveringly, and just as much like a partner as like an advisor. I also want to express my gratitude to the other faculty members at USC who profoundly influenced my development as a scientist. Dr. Frank Alber, always provided consistent and deep analysis as part collaborator and part advisor. Frank taught me a great deal about sci- entific writing, on multiple occasions staying long after midnight to work on manuscripts. Dr. Oscar Aparicio supported me unconditionally, provided me with guidance as a mem- ber of my dissertation committee, and granted me access to many resources of his lab. Dr. Michael Stallcup provided me with guidance and input as a member of my disserta- tion committee. Dr. Andrew D. Smith allowed me access to the high-performance clus- ter computing (hpcc), supported and encouraged my efforts, shared his valuable insights, and provided for many interesting and exciting scientific discussions. MaryAnn Murphy iv worked tirelessly and patiently to better my general writing skills and, at same time, kindly accommodated my unpredictable schedule. I would like to acknowledge other faculty members at USC who helped me in various capacities. Dr. Peter Laird, Dr. Matthew Michaels, Dr. Ben Berman, Dr. Qi-Long Ying, Dr. Nunzio Bottini, Dr. David Vandenberg, Dr. Norman Arnheim, Dr. John Tower, Dr. Michelle Arbeitmen, and Dr. Michael Lieber supported my project by sharing their insights or resources. Dr. Zoltan Tokes introduced me to a unique philosophical and political perspective into science and arts. Dr. Baruch Frenkel and Dr. Wange Lu accepted me into their labs as a rotation student and introduced me to new areas of scientific research. Baruch, together with Dr. Ite Laird equipped me and many fellow graduate students with respectable presentation skills. Ite also supported me as well as other PIBBS students in general. My friends and colleagues in the Lin Chen lab made the last few years enjoyable ones. Dr. Nimanthi Jayathilaka, my fellow PIBBS comrade, provided support and counsel as only someone in the same boat can. Kaori Noridomi, Michael Phillips, and Katherine Daugherty, together with Nimanthi, not only provided for great times in and outside of the lab but also supported me in all aspects of life. Their being excellent cooks and bakers was also crucial for my maintaining an acceptable intake of nutrients. Melissa Hansen, Xiao Lei, and Jordan Brown brought a lot of energy through their fresh curiosity. Dr. Aidong Han, Dr. Yongqing Wu, Dr. Cosma Dellisanti, and Chen Wang helped me quickly get settled in the Lin Chen lab and supported me with their experience and expertise. I also want to thank other members of the lab, Dr. Shuxing Li, Go Watanabe, Dr. Raja Dey, Yang Li, Dr. Aki Uchida, Sonya Hansen, Kevin Cheng, Dr. Yongheng Chen, Haochen Li, Zhiwei Chen, Yiming, Meng, and Victoria for creating a collegiate environment. I appreciate the friendship of all Lin Chen lab members. v I owe a debt of gratitude to many students, postdocs, and research assistants for their col- laboration and kind support. Dr. Harianto Tjong was the best collaborator anyone could ask for. I enjoyed working with him tremendously and at the same time learned a lot from him. In return, I helped him break his rib! Dr. Ashley Williams helped with confocal mi- croscopy experiments. Joseph Aman assisted in the sequencing experiments, always going out of his way to accommodate my samples. Dr. Charles Nicolet also kindly assisted me with sequencing. Dr. Eric Schulze, Dr. Cunye Qu, and Dr. Chang Tong in the Qi-Long Ying lab helped with obtaining stem cells and derived differentiated cells. Dr. Stephanie Stanford, Dr. Edoardo Fiorillo, and Dr. Valeria Orrù helped me out a lot in the fall of 2007 when I spent a few weeks in the Bottini lab using the Nucleofector to generate a massive amount of transgenic cells. Dr. Yang-Ho (David) Chen provided me with customized ver- sions of PerM high-throughput alignment. Dr. Jeff Bertram and Meghna Patel from the Myron Goodman lab patiently answered my many questions regarding fluorescence mea- surement experiments and oligo synthesis. Many members of the Xiaojiang Chen lab also assisted me with various reagents as well as the in-house X-ray machine. At UCLA, Dr. Siavash Kurdistani and Dr. Bing Li kindly generated virally transformed fibroblasts. My great friends at PIBBS and MCB provided me with solace from the grind of the daily work. Dr. Sarmad Al-Bassam, Aysen Erdem, Dr. Sudeep Srivastava, Ian Slaymaker, Dr. Houtan Noushmehr, Dr. Narges Rashidi, and Anil Sindhurakar made my experience as a graduate student and as a resident of Los Angeles unforgettable. Sudeep also helped me with gene expression analyses and made his softwares available, Ian also put a great deal of time, energy, and creativity into designing cover images, and Houtan also encouraged and supported me to get my hands dirty with computational biology and bioinformatics. Aysen and Sarmad contributed to my caffeine addiction by being great company. I enjoyed the company of many others as well, such as Zachary Frazier, Jared Peace, Zachary Ostrow, vi Tittu Thomas, Wendy Vu, Frances Tran, Shikhar Sharma, Erik Petterson, and Varuzhan Balasanyan. I also want to acknowledge Christina Tasulis for compassionately taking care of random problems that would arise in the Ray R. Irani Hall (RRI) and patiently assisting me with various administrative and bureaucratic matters, as well as, Dawn Burke, Eleni Yokas, Hayley Peltz, Catherine Atienza, Marisela Zuniga, Luigi Manna, and other staff at USC without whose hard work ensured that I really only had to worry about research. Several friends and colleagues at Illumina provided for exciting and enjoyable experiences this past summer: Dr. Kevin Gunderson and Dr. Frank Steemers, Casey Turk, Steve Nor- berg, and Natasha Pignatelli. My most heartfelt thanks go to Dr. Mostafa Ronaghi for always being incredibly generous with his support. I also want to acknowledge those who helped me discover my passions and interests: Dr. Hosseinali Asgharian, Dr. Mehdi Dastgheib, Dr. Sasan Amini, and Dr. Afshin Ahmadian. Without their collective constructive influence I would have certainly been on a different path in my life, and for that, I am always indebted to them. Nima Shojaee and Marzieh Vali were an extension of my family in Los Angeles. They kept me connected to Kalle-Paacheh, something I thought I had lost forever after I came to the US. My family in the United States, my aunt, Khaleh Forouz, and my cousins Sepehr, Simeen, Noreen, Gabe, and Allison are owed a debt of gratitude for supporting me in all aspects of life and doing so unconditionally. Aleena Mansuri was born only an hour after I defended this dissertation. Last but not least, I want to thank my family in Iran for their love and support. My parents, Morteza and Farah, have always been the best of role models; it is their values that guided me to my current path in life. My brothers, Farid and Kian, whom I have dearly missed, vii are sources of great pride. For these and many other things, I appreciate them all. Reza Kalhor May 2012 viii Table of Contents Dedication ii Acknowledgements iii List of Tables xii List of Figures xiii Attributions xvi Abstract xvii Chapter 1 Introduction 1 1.1 Prelude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Genome architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 From DNA structure to chromatin structure . . . . . . . . . . . . . 2 1.2.2 Local protein-DNA interactions within chromatin . . . . . . . . . . 4 1.2.3 The 30-nanometer fiber . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Higher-order structure of chromatin fiber . . . . . . . . . . . . . . 6 1.2.5 Spatial arrangement of genome . . . . . . . . . . . . . . . . . . . . 9 1.2.5.1 Arrangement of whole chromosomes . . . . . . . . . . . 10 1.2.5.2 Arrangement of loci . . . . . . . . . . . . . . . . . . . . 11 1.2.5.3 Functional significant of spatial arrangement . . . . . . . 13 1.3 Methods for analyzing the genome architecture . . . . . . . . . . . . . . . 14 1.4 The study at hand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 2 Tethered Conformation Capture for Genome-wide Mapping of Chromatin Contacts 19 2.1 Preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 3C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 5C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 ix 2.2.3 4C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.4 Limitations of the available conformation capture methods . . . . . 25 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Design principles for developing a genome-wide conformation cap- ture method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Implementation of Tethered Conformation Capture (TCC) . . . . . 30 2.3.2.1 Choice of cells . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.2.2 Choice of restriction enzymes . . . . . . . . . . . . . . . 31 2.3.2.3 Layout of experiments . . . . . . . . . . . . . . . . . . . 32 2.3.2.4 Experimental procedure of the TCC experiments . . . . . 32 2.3.2.5 Experimental procedure of the non-tethered conforma- tion capture experiments . . . . . . . . . . . . . . . . . . 37 2.3.2.6 Massively-parallel sequencing of conformation capture libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3.2.7 Assembling contact catalogues of each library . . . . . . 40 2.3.3 Evaluating noise levels . . . . . . . . . . . . . . . . . . . . . . . . 49 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.5 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.5.1 Tethered Conformation Capture (TCC) . . . . . . . . . . . . . . . . 58 2.5.2 Non-tethered conformation capture . . . . . . . . . . . . . . . . . . 65 2.5.3 Accession numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter 3 Genome-wide Patterns of Chromatin Contacts 66 3.1 Preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3.1 Relative amounts of intrachromosomal and interchromosomal con- tacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3.2 Intrachromosomal contacts . . . . . . . . . . . . . . . . . . . . . . 69 3.3.2.1 Distance dependence of contact frequencies . . . . . . . . 71 3.3.2.2 Class dependence of contact frequencies: emergence of two classes . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.2.3 Functional differences between the two classes: active class and inactive class . . . . . . . . . . . . . . . . . . . 74 3.3.2.4 Active and inactive class are affected differently by cen- tromeres . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.3.3 Interchromosomal contacts . . . . . . . . . . . . . . . . . . . . . . 79 3.3.3.1 Interchromosomal contacts have low frequencies . . . . . 82 3.3.3.2 Active and inactive classes show different propensities to interchromosomal contacts . . . . . . . . . . . . . . . . . 84 3.3.3.3 Indiscriminate interactions between chromosome territories 90 3.3.3.4 Confirming indiscriminate interactions using 3D FISH . . 96 x 3.3.4 The effects of tethering on the analysis results . . . . . . . . . . . . 96 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.5 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.5.1 Different levels of resolution and segmentation of the genome . . . 106 3.5.2 Ligation frequency matrix . . . . . . . . . . . . . . . . . . . . . . 108 3.5.3 Contact frequency matrix . . . . . . . . . . . . . . . . . . . . . . . 110 3.5.4 Contact profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.5.5 Contact enrichment (expected values for contacts) . . . . . . . . . . 111 3.5.6 Correlation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.5.7 Principal component analysis (PCA) . . . . . . . . . . . . . . . . . 113 3.5.8 Centromere-centromere contacts associations . . . . . . . . . . . . 114 3.5.9 RNA polymerase II binding . . . . . . . . . . . . . . . . . . . . . 116 3.5.10 Gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.5.11 Histone modifications . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.5.12 DNase I sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.5.13 3D DNA FISH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Chapter 4 Three-dimensional Genome Structures Based on Tethered Con- formation Capture Data 119 4.1 Preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.3.1 Structural modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.3.1.1 Structural representation . . . . . . . . . . . . . . . . . . 122 4.3.1.2 Deriving structural parameters from TCC data . . . . . . 127 4.3.1.3 Structure determination . . . . . . . . . . . . . . . . . . 128 4.3.2 Structural features of the genome population . . . . . . . . . . . . . 129 4.3.2.1 Structural heterogeneity . . . . . . . . . . . . . . . . . . 129 4.3.2.2 Radial positions of chromosome territories . . . . . . . . 134 4.3.2.3 Distances between non-homologous chromosomes . . . . 136 4.3.2.4 Packing of the active and inactive regions . . . . . . . . . 138 4.3.2.5 Association between different chromosome territories . . 138 4.3.3 Assessment of the structure population . . . . . . . . . . . . . . . . 140 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.5 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.5.1 Structural representation . . . . . . . . . . . . . . . . . . . . . . . 145 4.5.1.1 Coarse-graining of chromosomes . . . . . . . . . . . . . 145 4.5.1.2 Determining the sphere volumes . . . . . . . . . . . . . . 147 4.5.1.3 Structure population . . . . . . . . . . . . . . . . . . . . 149 4.5.2 Structure determination . . . . . . . . . . . . . . . . . . . . . . . . 150 4.5.2.1 Converting block contacts to sphere contact probabilities . 150 4.5.2.2 Assigning sphere contacts to structure models . . . . . . . 152 xi 4.5.2.3 Optimization of structure models . . . . . . . . . . . . . 154 4.5.3 Territory radial positioning based on FISH . . . . . . . . . . . . . . 159 Chapter 5 Conclusions 161 Bibliography 169 Appendices Appendix A: Figures for All Chromosomes . . . . . . . . . . . . . . . . . . . . 189 Appendix B: Side-by-side Comparison of Tethered and Non-tethered Libraries . . 207 Appendix C: Characteristics of Modeling Spheres . . . . . . . . . . . . . . . . . 218 xii List of Tables Table 2.1 Library statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Table 2.2 Star activity in different libraries . . . . . . . . . . . . . . . . . . . . 44 Table 3.1 Probes for 3D-FISH . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Table 3.2 Different levels of resolution. . . . . . . . . . . . . . . . . . . . . . . 107 Table 3.3 Number of segments in each chromosome for various resolution lev- els (H values). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Table 3.4 The defined centromeric regions for each chromosome inF 6002 . . . . 115 Table 4.1 The number of spheres representing each chromosome. . . . . . . . . 147 Table C.1 Characteristics of the modeling spheres. . . . . . . . . . . . . . . . . 219 xiii List of Figures Figure 2.1 Standard library preparation in conformation capture techniques. . . 21 Figure 2.2 The standard detection schemes for conformation capture techniques (3C, 5C, and 4C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Figure 2.3 Chemical structure of Iodoacetyl-PEG2-Biotin. . . . . . . . . . . . 33 Figure 2.4 Overview of Tethered Conformation Capture (TCC). . . . . . . . . . 34 Figure 2.5 Phosphorothioate bonds in DNA. . . . . . . . . . . . . . . . . . . . 36 Figure 2.6 Expected ligation junction sequences for HindIII and MboI. . . . . . 41 Figure 2.7 Contact catalogue. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Figure 2.8 Distribution of alignment distance for all pairs in each library. . . . 47 Figure 2.9 Self-looping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 2.10 Level of noise in different libraries. . . . . . . . . . . . . . . . . . . 52 Figure 3.1 Contact frequency map. . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 3.2 Intrachromosomal contact probability as a function of distance. . . . 71 Figure 3.3 Correlation map and class assignment. . . . . . . . . . . . . . . . . 73 Figure 3.4 Correlation and contact frequency as a function of class and distance. 75 Figure 3.5 Functional activity in each class. . . . . . . . . . . . . . . . . . . . 76 Figure 3.6 Effect of centromere on contact profile similarity and contact fre- quency in large chromosomes. . . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 3.7 Active-active and inactive-inactive correlation maps. . . . . . . . . . 80 Figure 3.8 Pairwise whole-chromosome contact frequencies. . . . . . . . . . . 81 xiv Figure 3.9 Comparison of intra and interchromosomal contact frequencies. . . . 83 Figure 3.10 Genome-wide contact enrichment map of chromosome 2. . . . . . . 85 Figure 3.11 Alignment ofICP andEIG along chromosome 2. . . . . . . . . . 86 Figure 3.12 Interchromosomal contact probability (ICP ). . . . . . . . . . . . . 87 Figure 3.13 Clustering and association patterns of the centromeres. . . . . . . . . 88 Figure 3.14 Interactions between chromosome 11 and chromosome 19. . . . . . 92 Figure 3.15 Interactions between high-ICP active regions on chromosome 19 and other high-ICP regions in the genome. . . . . . . . . . . . . . . . . . 95 Figure 3.16 FISH experiments confirming indiscriminate interchromosomal con- tacts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Figure 3.17 Organization of active and inactive regions within chromosome ter- ritories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Figure 4.1 Constrained clustering and coarse-graining of chromosomes. . . . . 124 Figure 4.2 Objective penalty function for chromosome 11. . . . . . . . . . . . 125 Figure 4.3 Representation of chromatin blocks as spheres. . . . . . . . . . . . . 126 Figure 4.4 Genome structure population of 10,000. . . . . . . . . . . . . . . . 130 Figure 4.5 Genome-wide contact frequency maps from HindIII-TCC catalogue, structure population, and a population of random structures. . . . . . . . . . 131 Figure 4.6 Heterogeneity between the structures in the population. . . . . . . . 133 Figure 4.7 Contact overlap between structures in the population. . . . . . . . . 134 Figure 4.8 Radial positioning of chromosome territories. . . . . . . . . . . . . 135 Figure 4.9 Contact overlap between structures in the population. . . . . . . . . 137 Figure 4.10 Local packing of the active and inactive class in the structure popu- lation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Figure A.1 Contact frequency maps for all chromosomes. . . . . . . . . . . . . 190 Figure A.2 Correlation maps and class assignment. . . . . . . . . . . . . . . . . 192 Figure A.3 Active-active and inactive-inactive correlation maps. . . . . . . . . . 194 xv Figure A.4 ICP in active and inactive classes. . . . . . . . . . . . . . . . . . . 197 Figure A.5 Alignment ofICP andEIG along each chromosome. . . . . . . . . 199 Figure A.6 Interactions between high-ICP active segments on chromosome 11 and high-ICP active segments on each of the other autosomal chromosomes.202 Figure A.7 Objective penalty function for all chromosomes. . . . . . . . . . . . 204 Figure A.8 Radial distribution of chromosomes. . . . . . . . . . . . . . . . . . 206 Figure B.1 Contact frequency maps. . . . . . . . . . . . . . . . . . . . . . . . 209 Figure B.2 Intrachromosomal contact probability as a function of genomic dis- tance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Figure B.3 Characteristics and statistics of the compared libraries. . . . . . . . . 211 Figure B.4 Pairwise whole-chromosome contact frequencies. . . . . . . . . . . 212 Figure B.5 Genome-wide contact enrichment map of chromosome 2. . . . . . . 213 Figure B.6 Interchromosomal contact probability (ICP ). . . . . . . . . . . . . 215 Figure B.7 Interactions between chromosome 11 and chromosome 19. . . . . . 217 xvi Attributions I performed all of the work presented in this dissertation, under the supervision of my advisor Dr. Lin Chen, with the exceptions detailed here. My collaborators, Dr. Harianto Tjong and Dr. Frank Alber, designed and performed most structure determination experiments in Chapter 4. They also provided invaluable input and insight for analyzing and interpreting the data in Chapter 3. Dr. Alber and Dr. Tjong contributed to writing the sections of Kalhor et al., 2012 manuscript [92] that served as a foundation for writing Chapter 3 and Chapter 4. Dr. Nimanthi Jayathilaka contributed to performing the 3D FISH experiments in Chapter 3. xvii Abstract The genetic information in most organisms is stored in a double-helical fiber molecule known as DNA. The total DNA fiber inside every single human nucleus, also known as the genome, is about two meters in length. This very long fiber is organized within a nucleus that is about ten micrometers in diameter. In other words, were the nucleus the size of a soccer ball, the DNA fiber inside it would extend thirty miles or fifty kilometers. This enormous length naturally results in a structural complexity that renders understand- ing the spatial organization of the genome very challenging. However, similar to other biological systems, a complete functional understanding of the genome requires structural insight. In fact, it has been demonstrated that the three-dimensional organization of the genome affects all nuclear processes, including replication, transcription, and DNA repair. For example, transcription, replication, and DNA repair take place in distinct spatial com- partments known as “factories.” Active genes tend to localize together and in proximity to transcriptionally active regions of the nucleus while inactive genes tend to associate with the nuclear envelope and other transcriptionally silent regions of the nucleus. Gene activity itself is partly regulated by functional elements, such as enhancers or repressor, that are not in the promoter of the gene but exert their effect in a long range, presumably by physically associating with their targets. xviii Despite this well-recognized importance for nuclear functions, the three-dimensional orga- nization of the genome is poorly understood. The main limitation in exploring this genome architecture is that its natural structural complexity renders available technologies inad- equate for studying it. Only a few approaches exist for obtaining structural information about the genome. The most commonly used techniques are based on either Fluorescence in situ hybridization (FISH) or chromosome conformation capture (3C). These techniques are limited by throughput as they can only address the position of only a few genomic lo- cations, or loci, in each experiment. As a result, none of the currently available methods are capable of addressing the spatial organization of loci at a genome-wide level. Here, we present a method for assaying the structure of the entire genome. This method measures the relative spatial proximity of all loci by combining the 3C and massively- parallel sequencing technologies. We call this method “Tethered Conformation Capture” (TCC). The output of TCC is a catalogue of several million contacts that involve all re- gions of the genome and are not biased to any particular locus or loci. Furthermore, TCC catalogues feature a significantly improved signal-to-noise ratio compared to other 3C- based methods. This improvement is a direct result of replacing the diluted liquid phase ligation strategy, which is a staple of all 3C-based methods, with a novel solid-phase liga- tion strategy that is drastically more effective in preventing the unwanted spurious ligation events. We also applied TCC to human lymphoblastoid cells, generating a catalogue of more than one hundred million contacts. This catalogues enabled a systematic study of the genome architecture. As the lymphoblastoid cells that were are also a subject of the studies by the Encyclopedia of DNA Elements (ENCODE) Consortium, we were able to interpret the genome-wide architectural features in the context of the available one dimensional func- xix tional data across the genome, such as gene expression levels, binding of various transcrip- tion factors, and histone modifications. These analyses revealed new insights into the three-dimensional organization of the human genome. We found that genomic distance is an important determining factor in determining the contact frequency between loci on the same chromosome: contact probability decreases rapidly with increasing distance between loci. However, genomic distance is not the only determining factor in intrachromosomal contact frequency; the other important factor is functional activity. Functionally active and inactive loci - defined by transcriptional ac- tivity, gene density, activating histone modifications, and etc. - show distinct patterns of intrachromosomal contact behavior. Inactive regions prefer to interact with their neighbor- ing loci while active regions are more likely to form long-range interactions. Furthermore, active loci are more likely to localize at the border of their chromosome’s territory while inactive loci are prefer to localize internally. Moreover, the centromere affects the spatial organization of the active and inactive loci differently: active loci on the two chromosome arms frequently interact with each other; however, contact between inactive loci on differ- ent chromosome arms is effectively blocked by the centromere. We also extracted a detailed profile of interactions between chromosomes. This profile re- vealed that interactions between chromosomes are abundant overall but distributed among a very large number of low-frequency locus-locus contacts. The loci that mediate these contact belong to a specific subgroup of the active regions. Surprisingly, the members of this subgroup not only form interchromosomal interactions with numerous loci across the genome, they do so with seemingly little dependence on their interaction partners’ identi- ties. Moreover, only a subset of these numerous interactions are present in each cell in the population. The exact subset that exists in each cell is likely determined by the accessibility xx of loci to each other in that cell, but each contact exists in only a small fraction of the cell population. We referred to this pattern of contact between different chromosomes as indiscriminate in- teractions and further investigated their underlying mechanisms. The organization of these interactions together with their association with the highest levels of functional activity suggested that they are mediated by the functional machineries of the nucleus, such as transcription factories. We also found that the improved signal-to-noise ratio in TCC is fundamentally important for observing these low-frequency interactions between chromo- somes. Without using the solid-phase ligation strategy, true contacts between chromosomes cannot be separated from the background noise. While the TCC contact data embody the three-dimensional organization of the genome, they are inherently two dimensional. Translating this data into three-dimensional structures can provide insights beyond those obtained from directly from the contact information. However, calculating any three-dimensional structures of the genome is very challenging, largely due to structural heterogeneity. It has long been observed in microscopy exper- iments that different cells can have profoundly different genome architectures. Further- more, our own observations regarding the contacts between chromosomes, each of which is present in only a small fraction of the cell population, reinforces this notion. Accordingly, we developed a method to translate the TCC data into a population of genome structures that are different individually but, as a whole, reproduce the input TCC contacts and their statistical distributions. Using this method, we calculated a genome structure population solely based on the lymphoblastoid cells’ TCC data. This population reproduced the hall- marks of genome architecture, as described by fluorescent microscopy. These hallmarks include the radial position of chromosome territories that others have characterized using a variety of other approaches. xxi We also used the genome structure population to analyze other structural features of the genome in probabilistic terms. For instance, we quantified the heterogeneity of genome architecture, packing density of active and inactive regions, as well as the pairwise distances between the territories of different chromosomes. Overall, our studies represent a combined experimental and computational approach to study how the genome is organized in three dimensions. 1 Chapter 1 Introduction 1.1 Prelude Life is a self-sustaining wave of order, the sole purpose of which, if a purpose can be defined for accidents, is to propagate [58]. To delay the fate by which all things are bound in this universe - disorder - life must transfer information from generation to generation. Living entities encode this information in their genetic material, which is often double- stranded DNA. In mammalian cells, this source of information is by and large stored in the nucleus, where a complicated interplay of RNAs, proteins, and the DNA itself maintains the integrity of the genetic information, transcribes it for cellular functions, and replicates it for propagation. In humans, nuclear DNA is almost three billion base pairs long and is present in two copies in most cells. Such string of base pairs would extend about 2 meters in length if stretched into a line. Remarkably, however, this string resides within the boundaries of the nucleus, a sphere of5 microns in radius. 2 Two aspects regarding the residence of a long DNA inside a much smaller nucleus are most interesting. First, the way these molecules are compacted to fit inside the nucleus. This question has been extensively studied and formidable theories exist as for how it can be [71]. Second, the way these molecules are organized in the three-dimensional space of the nucleus, and how this organization interacts with all that goes on in the nucleus, which is the maintenance, transcription, and propagation of the information and the associated regulatory processes. There are about one hundred trillion cells in an average human body. Almost every single cell receives a highly compacted form of this long genomic DNA in the form of mitotic chromosomes at its inception and thereafter unpacks it within the nuclear space. It is an open question whether a spatial organization so complex can be exactly replicated in two different cells within the practical boundaries of thermodynamic and kinetic feasibility. To the very least, it is most improbable that all cells can reproduce the same organization. Therefore, another interesting aspect of nuclear organization is how different are different cells in their spatial arrangement of the genomic DNA in nuclear space. These various aspects of the three-dimensional organization of the genome in the nuclear space are the subjects of this dissertation. 1.2 Genome architecture 1.2.1 From DNA structure to chromatin structure Due to its importance in life, deoxyribonucleic acid, better known as DNA, has been a subject of ever more intensifying research. Friedrich Miescher discovered DNA, or what he called nuclein, in the 1800s, and later that century, Albrecht Kossel isolated the primary nucleotides that compose DNA [57]. These discoveries were followed by Levene’s char- 3 acterization of each nucleotide and the atoms it comprises in the early 1900s [108], but the exact role of the molecule in cells remained unknown for a few more decades. Once it came to light that DNA is the substance that transfers genetic information, initially by Avery, Macleod, and McCarty [12] and later confirmed by Hershey and Chase [86], the interest in the structure of DNA became the focal point of global scientific interest. This interest culminated in James Watson and Francis Crick proposing a double-helical structure [191], a model that was based on evidence obtained by Franklin, Wilkins, Char- gaff, and others [69, 194, 75]. The double-helix was only the first in a line of discoveries concerning the three-dimensional structure of DNA molecules; the structures of A-DNA and Z-DNA were characterized later [74, 188], as were the less abundant triple-stranded and quadruple-stranded structures of DNA [141, 11, 205]. Most recently, the occurrence of non-Watson-Crick base pairs, namely Hoogsteen base pairs, in canonical duplex DNA was observed [127, 100]. The focus of these structural studies is the DNA molecule itself. Inside eukaryotic cells, however, this DNA operates as a part of a protein-DNA-RNA conglomerate known as “chromatin” [197]. While it is the DNA molecule alone that stores the genetic informa- tion, maintenance, transcription, and propagation of this information are orchestrated and regulated in the larger context of chromatin. As a matter of course, chromatin has also been subject to intense structural interest. The total chromatin in a eukaryotic nucleus comprises a limited number of chromosomes. Each chromosome contains a single very long molecule of DNA [25], the average length of which, in human cells for instance, is more than four centimeters (1.7 inches). As a con- sequence, chromatin structure is much more complicated than the structure of DNA itself and can be studied at various levels. In particular, chromatin structure can be studied at the finest level of DNA and protein structures, or at the coarsest level of spatial arrangement of 4 chromosomes, or any other level in between. As this dissertation’s focus is the chromatin structure (also referred to as the genome architecture), the following sections of this chapter review the results from previous studies at various levels, starting from the finer levels and proceeding to the coarser ones. Also reviewed are the models and theories regarding the structure of chromatin and its importance for nuclear functions. 1.2.2 Local protein-DNA interactions within chromatin The most fundamental features of chromatin behavior are determined by local interactions between its proteins and its DNA in the context of relatively small protein:DNA complexes. These complexes are the only level of chromatin structure for which reliable atomic res- olution structures are available. The most abundant protein component of chromatin is histones [125], which package DNA into structural units known as nucleosomes. Nucle- osomes represent the most basic unit of chromatin and their structure has been elucidated at an atomic resolution [115]: they consist of 146 base pairs (bp) of superhelical DNA wrapped left-handedly 1.65 turns around an octameric core of histones. Histones not only play a central role in packaging DNA, they also regulate its accessibility and function. The nucleosome structure clearly shows that, when DNA is wrapped around the core histones, the amino termini of the eight histone molecules that compose the core are accessible; other proteins can interact with and modify these histone “tails.” Various histone modifications, including monomethylation, dimethylation, trimethylation, acety- lation, phosphorylation, ubiquitination, ADP-ribosylation, and SUMOylation on several residues in the histone tails have been identified [169, 155, 184, 190, 13, 161, 163, 189, 126, 103, 175, 13, 31, 174]. While some specific modifications remain poorly understood, many can be characterized as either activating or repressing of gene expression. 5 It is believed that some histone modifications exert their effect by causing condensation or decondensation of the nucleosomes, resulting in respectively lower and higher accessibility of DNA [169, 90]. These modifications, and histone tails in general, also serve as recruiting platforms for various protein complexes. Some of the interactions between histones and other proteins have been studied in high-resolution structures [147]. Nucleosome are not the only protein:DNA complexes in chromatin. Many other proteins, including transcription factors and other DNA binding proteins, can be considered a part of chromatin as well [197]. These proteins are necessary for all functions of chromatin including a proper expression of genes, identification of damaged DNA, and replication of the genetic material. The structure of these proteins has been extensively studied, most often in complex with DNA [1, 34, 170, 84, 171, 202, 38, 14, 150, 164, 89]. 1.2.3 The 30-nanometer fiber Individual nucleosomes are connected to each other by a piece of linker DNA that varies in length, but is typically within the range of 10 to 80 bp [71]. This linear array of nucleo- somes that are separated by a thin filament has a “beads on a string” appearance under the microscope, which is also referred to as the 10-nm fiber based on its diameter [200, 128]. At the next level of organization, this 10-nm fiber is believed to fold into a fiber of30 nanometers in diameter [61, 60, 71]. Experimental evidence suggest that both the 10-nm and the 30-nm forms of the fiber inhabit the nucleus at the same time [63]. While an atomic-resolution structure of the 10-nm fiber can be directly inferred from those of nucleosome and free DNA, the exact structure of the 30-nm fiber is unknown and subject to some controversy [152, 143, 173]. Two competing models for this structure remain popular [110, 173]. One is the solenoid model (one-start helix model), in which the linker DNA that connects adjacent nucleosomes is bent between them such that they follow a 6 superhelical path with about 6-8 nucleosomes per turn [192, 144, 110]; the other model is the zig-zag model (two-start helix model), in which adjacent nucleosomes are connected by a straight linker DNA and show a zig-zag arrangement such that each nucleosome in the fiber binds to its second neighboring nucleosome [196, 199, 66, 110, 152]. These models are not mutually exclusive; it is possible that different regions of chromatin assume one or the other structure at different times. In in vitro and in silico experiments, the dominance of either the solenoid or the zig-zag form is driven by the total concentration of bivalent and monovalent salts [110, 173]. 1.2.4 Higher-order structure of chromatin fiber The 30-nm fiber is regarded as the basic form of chromatin which can fold into structures of a higher order [71]. How this 30-nm fiber folds into higher-order structures is poorly understood [198] and only described using theoretical polymer folding models that are based on limited experimental measurements. One such model is the random-walk/giant- loop model [149], which stipulates that the large-scale folding of the chromatin fiber during interphase may consist of flexible chromatin loops that average3 Mb with a random-walk backbone. Random walk (random coil) refers to the conformations that a polymer can assume when little steric repulsion exists between its monomers and the polymer interacts with the solvent [146]. This organization was proposed based on some measurements of physical distances between loci as a function of their genomic distance and is normally invoked to explain the looping of genes out of their chromosome territories (see below) [78, 29]. Another such model is the Random Loop model [121], which is similar to the random- walk/giant-loop but also takes into account that chromosomes are limited by the nuclear 7 space. This model assumes a self-avoiding random walk folding of the chromatin fiber with probabilistic long-distance interactions between monomers. Yet another model is the crumpled globule (fractal globule) model [81, 113] which stipu- lates that the chromatin fiber resembles a non-equilibrium polymeric state that forms after the collapse of a polymer from a random walk (random coil) state but before it reaches the equilibrium globule state 1 [81]. Important features of this conformation are a lack of knots and spatial segregation of the polymer into subdomains which may explain the appearance of segregated structural domains in microscopy experiments [113]. While each of these models can explain certain behaviors of the chromatin fiber, none is regarded as a generally predictive model of chromosome folding. These shortcomings are likely due to the simplistic nature of these models. For example, they assume uniform properties along the chromatin fiber. This assumption is clearly inadequate as even the distribution of nucleosomes along genomic DNA is quite variable [93]. Additionally, not all regions of the genome are expected to fold into the 30-nm fiber. Moreover, the chromatin fiber of the transcriptionally active and inactive regions are expected to have very different physical properties as they involve different protein types and histone modifications. The folding geometry of chromatin beyond the 30-nm fiber is even more unclear than that of the 30-nm fiber [123, 18]. What is clear is that packaging of DNA around histones and higher-order assemblies of nucleosomes provide the necessary ratios of compaction to fit the entire genomic DNA inside the nucleus [71]. Furthermore, irrespective of the exact geometry, the folding of the chromatin fiber appears to be important for genomic functions [123]. 1 A confined polymer with attraction between its monomers, or repulsion from its solvent, collapses into an equilibrium globule state [146]. 8 Despite the ambiguity of the exact folding geometries, local chromatin loops appear to be a ubiquitous structural element of the fiber [185, 76, 123]. Since loops bring distantly located sequences into spatial proximity, they are interesting organizational features of chromatin, possibly allowing for regulatory communication between distant locations. It is difficult to confirm the existence of loops and their physiological importance because they often can- not be detected under native conditions or visualized inside the cells. Nevertheless, spatial proximity has been shows to play a role in several gene-regulatory events that suggest the involvement of loops [76, 123]. These studies argue that local chromatin loops are impor- tant in positive and negative regulation of genes. The most prominent example is that of the -globin gene: the-globin enhancer which is located50 kb upstream of its promoter physically associates with the main body of the gene concomitantly with activation [193]. A similar example involves gene activation during the differentiation of erythroid cells: loop formation occurs prior to activation when erythroid progenitor cells become lineage committed [132]. These looping events appear to bring upstream locus control regions (LCRs) together with the promoter or the gene itself, forming a “hub” that presumably creates a microenvironment of proper transcriptional activity by concentrating the relevant transcription factors [123]. Some studies suggest that chromatin looping may be generally important for proper gene expression. Analyzing the topology of several genes in budding yeast and humans, these studies suggest that active loci fold back onto themselves to form a gene-loop that brings their 3 0 end in physical proximity to their 5 0 beginning [131, 10, 120]. Formation of such gene-loops is consistent with the view that the transcription machinery physically interacts with the 3 0 end processing and RNA processing factors [19]. The prevalence of gene loop- ing and its exact correlation with transcriptional activity is not yet clear; however, it would provide an effective way to coordinate transcription and RNA processing [123]. 9 1.2.5 Spatial arrangement of genome The difficulties in studying the 30-nm and higher-order foldings of the chromatin fiber are a consequence of technical limitations. The high-resolution methods, such as crystallography and NMR, that were used to study the nucleosomes and other protein:DNA complexes are impractical for larger complexes that embody the folding geometry of the fiber at higher levels. Electron microscopy methods often involve extraction of chromatin material from the nucleus and harsh treatments that raise question whether the results are representative of the in vivo chromatin conformations. Chromosome conformation capture (3C)-based methods do not provide adequate structural details for evaluation of folding details, neither does fluorescence in situ hybridization (FISH) which is limited by the diffraction limit of light 2 . The current resolution of FISH-based methods is200 nm. While this resolution is wildly inadequate for studying the folding geometry of the chromatin fiber, it can properly address the spatial arrangement of chromosomes and their parts relative to each other within the av- erage nucleus which has a diameter about fifty times this resolution limit. Consequently, together with other methods, FISH and its derivatives have been used extensively to study the genome architecture at coarser levels that involve the spatial arrangement of chromo- somes and loci. These arrangements are the most global level of genome organization within the three-dimensional space of the nucleus. The spatial arrangement of genomes is erased during mitosis and reset in every newly di- vided cell as mitotic chromosomes decondense. As indicated by the long-established obser- vation that transcriptionally active and inactive regions segregate into physically separate domains of euchromatin and heterochromatin, this newly established genome arrangement 2 The various technologies that are used in studying the genome architecture are described later in this chapter. 10 in each cell is not random [123]. The concept of non-random positioning within the nu- cleus has been significantly augmented by the FISH-based mapping of many chromoso- mal regions in years past. Such mappings showed that many loci have preferential loca- tions within the nucleus that can change during differentiation and development [133, 134]. Some aspects regarding the spatial arrangement of the genome are discussed below. 1.2.5.1 Arrangement of whole chromosomes The most general case of spatial arrangement and positioning within the nucleus is that of whole chromosomes. It turned out that after mitosis, chromosomes maintain their spatial separation and occupy distinct territories in the nucleus [51, 53]. Such territories were proposed by Rabl and Boveri in late nineteenth and early twentieth centuries [138, 24] but first observed in an elegant study by Thomas Cremer and colleagues, many decades later, in the 1980s [52]. Cremer and colleagues microirradiated small portions of nuclei with a UV laser, labelled the sites of damage with radioactive nucleotides, and visualized them in the following metaphase. Instead of obtaining a scattered radioactive signal across many chromosomes, which would have been expected if chromosomes were completely intertwined, they observed discrete labeling in parts of only one or a few chromosomes, indicating that chromosomes are compartmentalized during interphase. Direct visualization of chromosome territories (CTs), however, became possible only after the development of whole chromosome paint probes and FISH [55, 112]. To also observe the CTs in the course of cell cycle, a different approach was introduced in which nuclear DNA is labelled with fluorochrome-labeled nucleotides in living cells during the S phase. Labelled cells are followed through several cell divisions which lead to the segregation of labelled and non-labelled sister chromatids into daughter nuclei during the second and subsequent mitotic events. This labeling/segregation (L/S) approach yields nuclei with 11 distinct, replication-labelled patches which reflect the territories of single chromatids [72, 206, 119, 51]. These approaches were used to study the spatial arrangement of chromosomes. It was dis- covered that no specific arrangement of chromosomes exists in the nucleus: a territory can have different neighbors in different cells [51]; however, each chromosome territory (CT) has a distribution along the radial axis of the nucleus and a preferential position [50, 26]. Small and gene-rich chromosomes prefer the interior positions of the nucleus while others prefer the more peripheral positions [26]. A famous and striking example of this phe- nomenon are human chromosomes 18 and 19. While both chromosomes have a similar size, the gene-poor chromosome 18 territories are typically found at the nuclear periphery whereas the gene-rich chromosome 19 territories prefer the nuclear interior [56]. Interest- ingly, these positions are evolutionarily conserved as the genetic material corresponding to chromosomes 18 and 19 showed a similar pattern of radial positioning in other primates [176]. These conserved radial positions were observed in cells of the same tissue. Some chromosomes show different radial distributions in different tissue types [134]. 1.2.5.2 Arrangement of loci Similar to chromosomes, the position of chromosomal regions and even single genes along the radial axis of the nucleus is not random. For example, in B cell progenitors the IgH locus is preferentially located near the nuclear periphery where it is silent, but in B cell pre- cursors it prefers the nuclear interior where it is poised for activity [101]. Other examples include the CD4 locus which repositions from the nuclear periphery to the nuclear interior during T cell differentiation [98], and Hox1b and Hox9 genes which reposition to the in- terior upon their induction with retinoic acid [33]. It is important to note that there is no general correlation between activity and the preferred radial position of genes [123]. This 12 is suggested by the fact that the two homologous alleles are often positioned differently in cells even as they display similar functional activity [145], and the fact that for most genes, no change in radial positioning occurs upon a change in gene activity [123]. The radial positioning of loci is not the only important structural feature for gene expres- sion; localization within CTs has also been implicated in gene function. A general model has been proposed in which active genes are located at the periphery of CTs and inactive regions are located in the interior. In such an arrangement, active genes would be more accessible to transcription and splicing factors, which are more concentrated between CTs, while inactive regions would have limited accessibility to the transcription machineries [54]. While this model is supported by the observation that coding sequences are often located at the periphery and noncoding sequences at the interior of CTs [104], as well as the looping-out of the active gene clusters (see below), it is not absolute because coding sequences can be seen at the interior of territories even when transcriptionally active and noncoding sequences can be seen at the periphery [153]. Another important aspect of locus localization with respect to territory which contributes to genome regulation and organization is the formation giant or large loops (not to be con- fused with local loops, see above) [187, 42]. Large loops are giant protrusions of several megabases from a CT into other territories. These structures are believed to separate chro- mosome regions from their natural location within the territory and place them in specific microenvironments within the nucleus [42]. As such loops often involve chromosome re- gions with a high gene density, such as gene-clusters, that are being transcribed at high rates, it is suggested that transcription plays a central role in their formation [42]. As ev- idence, these loops collapse back into the territory when transcription is stopped by an inhibitor [42]. Prominent examples of these large loops are the human major histocompati- 13 bility complex II [187], the mouse epidermal differentiation complex [195], the mouse Hox cluster [33], and human chromosome 11p15.5 which has a high gene-density [117]. The giant loops are not the only instance of physical association between loci of dif- ferent chromosomes. Electron microscopy and high-resolution FISH methods have re- vealed extensive intermingling between chromosomes [28]. In fact, one study estimated that in20% of nuclear volume sequences from more than one chromosome are present [28]. 1.2.5.3 Functional significant of spatial arrangement In general, the functional consequence of the positioning of loci relative to each other and associations between different chromosomes are poorly understood [123]. One exception is the spatial clustering of ribosomal genes which associate with each other as a part of the nucleolus [172]. This clustering is important for proper regulation of the rRNA genes. Similar clustering has also been observed for tRNA genes in budding yeast [179, 68]. It is also suggested that some non-rRNA genes in mammalian erythroid cells can also cluster to coordinate gene expression: multiple genes in a 30 Mb region of a chromosome were observed to associate with shared sites of transcription [130]. However, the functional importance of this association is still unclear. Other studies suggest that direct interactions between different chromosomes (interchro- mosomal interactions) play regulatory functions. This paradigm, called “trans-regulation via interchromosomal communication,” was established by analysis of the Ifng and TH2 loci in mouse [165]. In the naive helper T-cells, the TH2 LCR on chromosome 11 was ob- served to physically associate with the Ifng locus on chromosome 10. Upon differentiation of naive T cells, these two loci separate, and Ifng becomes transcriptionally active. Another example was observed in mouse sensory neurons, where a single odor receptor from a very 14 large repertoire of receptors on multiple chromosomes is selected for expression, appar- ently through physical interaction with a specific enhancer element [114]. However, these observation are subject to a caveats: associations are generally only observed for single alleles and not in all cells of a population [123]. In conclusion, the spatial arrangement of chromosomes and their regions within the nucleus appears to play important roles in nuclear function. However, the details of these arrange- ments, their underlying mechanisms, and their functional relevance are poorly understood. Moreover, most models and hypotheses are based on anecdotal evidence corresponding to a few examples if not only one. A genome-wide picture is lacking. 1.3 Methods for analyzing the genome architecture The methods that are applicable to studying the genome architecture at levels beyond small protein:DNA complexes can be divided into two categories: the visual methods and the molecular methods. The visual methods entail observing the genome using microscopy. These methods can further be divided into two major subcategories based on the underlying microscopic principle. First, electron microscopy (EM) methods, which can image the whole nucleus or a part, a cross-section, or an extract of it [85]. EM methods have a very high resolution, in the order of an angstrom [70]. Accordingly, the folding geometry of the chromatin fiber is often studied using these methods [198, 18, 178, 73]. While standard electron microscopy methods cannot identify specific macromolecular structures or gene loci, the immuno-electron microscopy (immuno-EM) variations can be specific. In immuno-EM, antibodies specific to a target of interest are labelled with gold particles. The specimen is then labelled with the antibody and subject to transmission electron microscopy 15 (TEM) in which gold-labelled targets appear as distinctly dark spots since they completely block the transmission of electrons [116]. Despite their high resolution, EM-based methods have important limitations in studying the genome architecture. For one, they involve relatively harsh methods of fixing the samples that can alter the native structures. Also, their general lack of specificity renders them unsuitable for studying the spatial organization of specific regions or structures. Even in immuno-EM, where the use of antibodies allows for the detection of specific targets, the low throughput of the method makes it challenging to analyze more than one target at a time. The second subcategory of visual methods are based on fluorescence microscopy. Several fluorescence microscopy-based techniques have been applied to studying the genome ar- chitecture. The less prominent variations involve imaging of chromosomes or loci in live cells, such as labeling/segregation (L/S, see above) [72, 206, 119, 51] and GFP-fused lac repressor/lac operator array system [182, 17]. The most frequently used variation is fluo- rescence in situ hybridization (FISH) on interphase cells [16, 106]. This method can locate the position of a locus in the 3D lattice of the nucleus. In a commonly used variation of FISH, cells are first fixed to preserve their 3D organization and that of the genome throughout the experiment. They are then made permeable by a series of chemical treatments. The permeable cells are treated with a hapten-labeled DNA probe that corresponds to the target sequence of interest. This probe can correspond to a single locus or an entire chromosome. Under the experimental conditions, the probe anneals to the target locus in the nucleus, labeling it with the hapten. The hapten is then hybridized to a fluorescence-labeled antibody. Finally, the position of the fluorescence label, and hence the locus, is detected using a confocal fluorescence microscope [16]. 16 Due to its locus-specific nature, FISH has been the technique of choice for analyzing the relative positioning of chromosomes and loci (see above). Its advantages over EM-based methods include being less laborious and enabling simultaneous analysis of a few targets. Simultaneous detection of three to five targets can be achieved by using a different hapten in each target’s probe and using a different fluorophore for detection of each hapten [16]. On the other hand, current variations of FISH are limited by the resolution of visible light which is200 nanometers. In addition, a genome-wide analysis or even simultaneous analysis of several targets is not possible using current variations. The molecular methods, the second category of methods for studying the genome architec- ture, are all based on chromosome conformation capture (3C) [64]. The underlying basis of these methods is to preserve the spatial proximity between loci using crosslinking and quantify crosslinking frequency between loci of interest as a measure of their spatial prox- imity. The detection methods are not microscopy based, they are rather based on molecular methods such as quantitative PCR. The details of 3C and its variations are discussed in the introduction to Chapter 2 (page 19). The advantages of 3C-based methods include specificity, improved resolution compared to FISH, and much improved throughput compared to EM and FISH. The main limitations of current 3C-based methods are low signal-to-noise ratios and biased results based on the selection of targets (see Chapter 2). Another limitation is that their throughput is far from what is needed to analyze the entire genome 3 . 3 An unbiased and genome-wide version of 3C, called Hi-C [113], was developed in parallel with this study [35, 36]. Hi-C and its results are discussed in more details in Chapters 2 and 3. 17 1.4 The study at hand While the genome architecture is believed to play a central role in all nuclear functions, it is poorly understood especially at at higher order and spatial arrangement levels. Specif- ically, no genome-wide understanding of the genome architecture exists, and most of the models and theories are based on limited observations or anecdotal evidence. Our limited understanding of the genome in 3D is not for a lack of interest; rather, it is due to the lim- itations of the available techniques. In this study we aim at addressing these technological limitations. We introduce the Tethered Conformation Capture (TCC) technology for a genome-wide inspection of chromosome structures. This method combines 3C with massively parallel sequencing, solid-phase chemistry, and novel genetic engineering approaches to map chro- matin contacts comprehensively and in an unbiased fashion, thereby addressing the limi- tations of current 3C-based methods. We demonstrate that not only is TCC capable of a genome-wide and unbiased analysis, it also leads to substantially improved signal-to-noise ratios in 3C assays. We also describe efficient strategies for processing this genome-wide conformation capture data into contact catalogues that are accurate representations of con- tacts in vivo. These topics are mainly covered in Chapter 2. We apply TCC to human lymphoblastoid cells to obtain an extensive catalogue of genome- wide contacts. Based on these contacts, we provide a comprehensive three-dimensional view of the genome and reveal new insights about the spatial arrangement of loci in the nu- cleus and the interplay between chromatin structure and function. For example, we show that functionally active regions of the genome, such as those with elevated gene expression, form numerous long-range interactions in their chromosome and profusely interact with other chromosomes. This behavior makes these regions more accessible to transcription 18 and replication factories. By contrast, functionally inactive regions do not form long-range interactions. We also describe a previously unappreciated level of structural heterogene- ity among cells: numerous distinct chromatin contacts are observed in a cell population; however, different cells harbor different subsets of all possible contacts. These topics are mainly covered in Chapter 3. While the genome-wide contact data obtained from TCC embody the three-dimensional structure of the genome, they are inherently two-dimensional. Translating such data to three-dimensional structures is a challenging task due to the complexities of the genome architecture and the dataset. Here we present a method to translate the genome-wide con- tact data into a population of three-dimensional genome structures that is consistent with the TCC data. In order to do so, we define a structural representation strategy for the genome that allows calculation of genome structures. We also formulate approaches to translate the TCC contact data into structural parameters that can be directly used to cal- culate 3D structures. Furthermore, we introduce methods of structure calculation based on the aforementioned representation and data translation strategies. We analyze the structure population further to not only confirm the validity of our approach, but also obtain insight into the statistical properties of the genome architecture. These topics are mainly covered in Chapter 4. 19 Chapter 2 Tethered Conformation Capture for Genome-wide Mapping of Chromatin Contacts 2.1 Preview After a detailed introduction about available chromosome conformation capture methods, this chapter presents Tethered Conformation Capture (TCC) which is a method for unbiased and genome-wide mapping of chromatin contacts. The design principles and experimental strategies are described in details, the output data structure is characterized, data processing is discussed, and finally, the level of noise is evaluated. 20 2.2 Introduction To study the spatial organization of chromatin, conformation capture methods are com- monly used. The basic idea behind these methods is that the spatial proximity between loci in the nucleus can be inferred from their crosslinking frequency. In contrast to the imaging approaches (see Chapter 1), these methods rely on "molecular" techniques, such as formaldehyde crosslinking, ligation, and PCR [64, 204, 157]. As a consequence, they do not require harsh conditions or cumbersome techniques to preserve the three-dimensional structure of the cell throughout the experiments. They also have the potential to analyze small regions of the genome and resolve the spatial proximity of loci that are separated by short genomic distances. A typical conformation capture experiment can be divided into two parts: library prepara- tion, in which physical contacts between loci are captured in a library of DNA molecules, and quantification, in which the contact frequencies are measured. For the library prepara- tion part (Figure 2.1), a population of cells is treated with formaldehyde, which crosslinks proteins to proteins and to DNA [88, 162]. This treatment results in the formation of cova- lent bonds between regions of the genome that are in spatial proximity. The formaldehyde- treated cells are then lysed, and DNA is digested with a restriction enzyme, a process which results in the fragmentation of chromatin into protein-DNA complexes. Next, these com- plexes are subjected to ligation at a very low concentration. In this condition, intramolecu- lar ligations between crosslinked DNA fragments are favored over intermolecular ligations between random DNA fragments. Finally, the crosslinking is reversed, and the DNA is purified to obtain a conformation capture library. The abundance of the ligation product between any two DNA fragments in this library is correlated with their crosslinking fre- quency, which, in turn, represents their spatial proximity inside the nucleus. 21 Figure 2.1: Standard library preparation in conformation capture techniques. Cells are treated with formaldehyde, which crosslinks proteins to proteins and to DNA. The formaldehyde-treated cells are then lysed, and DNA is digested with a restriction enzyme, resulting in the fragmentation of chromatin into protein-DNA complexes. These complexes are subjected to ligation at a very low concentration to promote intramolecular ligation between crosslinked DNA fragments. The crosslinking is then reversed, and the DNA is purified to obtain a conformation capture library. In this library, the abundance of the ligation product between any two DNA fragments represents their crosslinking frequency, and thus, their contact frequency in the population of cells analyzed. 22 This library preparation process is similar in all conformation capture methods; what dis- tinguishes them is the quantification process (Figure 2.2). This process measures the abun- dance of the ligation product(s) of interest. To perform this measurement, various ap- proaches have been used by the different conformation capture methods. Based on the approach that is employed, conformation capture methods differ in scope and the number of interactions that they are able to quantify. The quantification step in the more prominent conformation capture methods are described below. 2.2.1 3C Chromosome Conformation Capture (3C) [64, 181, 166, 83] is the most basic conformation capture method. To quantify the ligation product of interest, this method uses quantitative PCR (qPCR). The PCR primers anneal to the ligation partners such that only their ligation product, not any other molecule in the library, can be amplified and quantified in the qPCR reaction (Figure 2.2 A). 2.2.2 5C Chromosome Conformation Capture Carbon Copy (5C) is a more advanced conformation capture method which enables analysis of spatial proximity for multiple pairs of DNA frag- ments [67, 183]. The library preparation process in 5C is identical to that of 3C, but the quantification process is different (Figure 2.2 B). In 5C quantification, the ligation prod- ucts of interest are first “carbon-copied” and then amplified through multiplexed ligation- mediated PCR. Only the amplified portion of the library is then quantified using either microarrays or a high-throughput sequencing platform. 5C is, in essence, a multiplexed 3C; it carries out 3C for many pairs of loci in parallel. Therefore, it enables simultaneous analysis of multiple interactions in a single experiment. 23 Figure 2.2: The standard detection schemes for conformation capture techniques (3C, 5C, and 4C). (A) 3C measures the contact frequency between two loci of interest. After the preparation of a conformation capture library library (Figure 2.1), ligation frequency be- tween two loci of interest, in here A and B, is quantified using PCR. The primers for this quantitative PCR reaction are designed based on the sequences of both loci. (B) 5C mea- sures the contact frequency between multiple pairs of loci. After preparation, the library is hybridized to 5C oligonucleotides in a multiplex setting. Each 5C oligonucleotide has a universal adaptor sequence (shown in black) for later amplification and a target-specific sequence (shown in orange or blue) for hybridizing to the ligation junctions that involve its target. Once two 5C oligonucleotides are hybridized side-by-side to their targets, they can be ligated, effectively producing a “carbon copy” of the ligation event. The universal primer sequences are later used to amplify all copied ligation junctions in parallel, which can later be quantified using custom-designed microarrays or high-throughput sequencing. (C) 4C measures the contact frequency between one locus of interest or bait (locus A in this panel) and its unknown contact partners or targets throughout the genome (loci X 1 through X n ). A few different 4C strategies exist, but the one most commonly used is depicted here. In this 4C strategy, the library is digested with a second restriction enzyme which cuts the chromatin more frequently than the one used in library preparation. The digested DNA fragments are then circularized, resulting in intermediates where a target is flanked by the bait. Finally, circular PCR is used to amplify the targets using a pair of primers that hy- bridize to the bait sequence. Amplified targets are later identified using high-throughput sequencing or tiling microarrays. 24 Figure 2.2: The standard detection schemes for conformation capture techniques (3C, 5C, and 4C). (Caption on the previous page) 25 2.2.3 4C 4C is a more sophisticated conformation capture method [156, 204, 203, 80]. This method can identify all the contact partners (targets) of one genomic fragment (bait) and quantify the relative interaction frequencies. In contrast to 3C and 5C, in which both partners of an interaction have to be pre-determined, in 4C, for one determined DNA fragment, all un- known contact partners can be identified. Once again, the library preparation process in 4C is similar to that of 3C and 5C, but the quantification process is more sophisticated (Figure 2.2 C). The library is first treated with a restriction enzyme that cuts DNA more frequently than the enzyme initially used to digest the crosslinked chromatin. The resulting DNA fragments are then ligated to themselves. This treatment produces circular DNA molecules in which the bait is connected from both its ends to a target. All targets are then ampli- fied in an inverse PCR using primers that anneal to the bait and flank the target. Finally, the amplified targets are identified using tiling microarrays or high-throughput sequencing. In this fashion, 4C can identify the genome-wide contact partners of a single fragment of interest. 2.2.4 Limitations of the available conformation capture methods These conformation capture methods are limited in many ways. First, they are biased towards the fragments or interactions of interest to the researcher. In other words, one can only measure the interactions for which one is actively searching. Given this bias, knowledge of the most important interactions is essential to designing an appropriate ex- periment. In most studies, however, these interactions are unknown due to the lack of prior knowledge of the three-dimensional landscape, and thus, a biased analysis is inevitable. By overlooking the most consequential interactions, such an analysis is likely to result in a 26 misinterpretation of the experimental results. Second, the scope of these methods is limited to only one pair of loci (3C), a few pairs of loci (5C), or the contact partners of a single locus (4C). A dramatically more extensive scope, one that involves the quantification of all existent crosslinking events, is required for a genome-wide understanding of chromatin ar- chitecture. Finally, these methods are encumbered by low signal-to-noise ratios, which are largely the result of intermolecular ligations between noncrosslinked DNA fragments. This problem, which persists despite ligating at low concentrations, reduces the sensitivity of the assay. Reduced sensitivity can result in a failure to identify important interactions. To address these limitations, this study combines conformation capture with massively par- allel sequencing and solid-phase chemistry. The resulting approach, Tethered Conforma- tion Capture (TCC), is a method that not only enables an unbiased and genome-wide analy- sis of chromatin interactions, but is also highly sensitive. This chapter describes the design principles of TCC, its implementation, and the basic characteristics of its output. 2.3 Results 2.3.1 Design principles for developing a genome-wide conformation capture method As an indicator of the spatial proximity between two loci, conformation capture experi- ments measure their in vivo crosslinking frequency. The currently available methods can only measure a small subset of all the crosslinking events. They also have to exploit the sequence of their targets for the quantification process. These limitations render the current methods incapable of performing a genome-wide analysis that is also unbiased. A genome- wide experiment requires the quantification of all crosslinking events. For the experiment to also be unbiased, this quantification must be carried out without any reliance on prior 27 knowledge of the fragments’ sequences. To address these requirements, massively parallel sequencing can be used to sequence a random and sufficiently large sample of all the DNA molecules in a conformation capture library, thereby identifying all the ligation events in the sample without prior knowledge of their sequences. Such sequencing of a conformation capture library, nevertheless, has its own challenges. The biggest hurdle is that only a small fraction of a library’s DNA molecules represent actual ligation events. Most molecules might be DNA fragments that failed to ligate at all during the library preparation process. Such a failure may happen for various reasons. For one, some fragments may not be crosslinked to any other fragments. Even for those that are, ligation cannot take place without a favorable distance and orientation between the DNA ends. The DNA ends should also be accessible and adequately exposed for any ligation to take place. Additionally, the nonspecific endonuclease activity of the restriction enzyme (i.e., the star activity) can generate DNA ends that are incompatible and therefore incapable of ligating to others. Ligation failure can also be caused by the spontaneous dephosphorylation of the 5 0 ends. For these reasons, a simple application of massively parallel sequencing to a conformation capture library (e.g., a 3C library) may produce mostly non-informative reads. For an informative and cost-effective sequencing of a library, the ligated DNA molecules must be purified from the non-ligated ones. In order to purify the ligated DNA molecules, we devised the following unique strategy. Be- fore ligation, the digestion overhangs are filled in by a specific combination of nucleotide analogues: an exonuclease-resistant analogue is inserted 5 0 to a biotinylated analogue such that only the biotinylated residue would not be protected from exonuclease action. After ligation, the DNA is treated with an exonuclease to remove the terminal residues down- stream of the exonuclease resistant nucleotide. As a result of this treatment, only the ligation junctions, where the biotinylated residues were no longer terminal, retain the bi- 28 otinylated residue. The affinity of biotin to streptavidin is then used to purify the ligated molecules from the non-ligated ones. Such a modification of the ligation junctions requires the use of additional reactions in the library preparation process. Some of these involve reagents, enzymes, or byproducts that can interfere with the subsequent reactions. Therefore, for the library preparation process to be effective, it is important to remove the unwanted substances before each reaction. Other molecular engineering methods that involve DNA modification typically purify the DNA between successive reactions with alcohol precipitation or affinity columns. How- ever, conformation capture experiments involve crosslinked chromatin, which, unlike free DNA, is difficult to purify. Therefore, in this study, crosslinked protein-DNA complexes are tethered to a surface, and most reactions take place on this solid-phase support. As the surface can be easily washed, this solid-phase conformation capture strategy facilitates the removal of enzymes, unused reagents, and byproducts when necessary so that successive reactions can be carried out efficiently. The design principles that were discussed so far address the specific challenges of devising an unbiased and genome-wide conformation capture method. It is also important to address the low signal-to-noise ratio, which is a general challenge to all the conformation capture methods. The main source of noise is intermolecular ligations between non-crosslinked DNA fragments. Because these ligations do not represent in vivo crosslinking events, they result in false-positive contacts between random DNA fragments. While a solitary inter- molecular ligation may not be detrimental to the experiment, a large number would raise the overall background noise level and, hence, reduce the sensitivity of the assay. This reduction in sensitivity is not the only problem when it comes to a method that relies on massively parallel sequencing. These sequencing analyses are very expensive and have a limited output. For each false-positive contact that is sequenced, a true-positive one is 29 missed. Therefore, if a high level of noise pertains to an experiment, more sequencing has to be carried out to compensate for the lost information. To reduce the noise-generating intermolecular ligations, all the available conformation cap- ture methods perform the ligation step of the library preparation process at a low concen- tration of protein-DNA complexes. Nevertheless, low concentrations cannot effectively prevent these ligations because, during the time course of the reaction, different molecules can randomly encounter each other through diffusion and convection. It is by eliminating this free diffusion that we seek to reduce intermolecular ligations. This task can be ac- complished by tethering the crosslinked complexes to a solid-phase support; indeed, the solid-phase conformation capture strategy, in addition to facilitating the reaction clean up, can also effectively eliminate the free diffusion of crosslinked protein-DNA complexes. Theoretically, immobilization of molecules in close proximity on a surface can increase intermolecular ligations. To eliminate the possibility of this adverse effect, we use a large surface area to reduce the likelihood of different crosslinked complexes being immobilized at distances that allow ligation between them. In this study, we combined the strategies described above to devise a conformation capture method that is not only unbiased and genome-wide, but also highly sensitive. As the main feature of this method is the tethering of the crosslinked complexes to a surface, we refer to it as “Tethered Conformation Capture” (TCC). The following section describes, in detail, how TCC was implemented. 30 2.3.2 Implementation of Tethered Conformation Capture (TCC) 2.3.2.1 Choice of cells All experiments that are presented here used the GM12878 cell line. This cell line is a part of the Human Genetic Cell Repository of the National Institute of General Medical Sciences (NIGMS). It was derived from a female member of the CEPH/Utah pedigree 1463, a family of Utah residents with ancestry from northern and western Europe. Samples from the pedigree were established as a part of a human genome cell line diversity panel by the Centre d’Etude du Polymorphisme Humain (CEPH). The cell line itself was estab- lished by an Epstein-Barr Virus (EBV) transformation of peripheral vein B lymphocytes [30, 59]. Several features of the GM12878 cell line make it an excellent choice for project. First, it grows well. Second, its karyotype, knowledge of which is essential to any study of genome architecture, is relatively normal and, therefore, unambiguous. Third, it is one of the cell lines which the International HapMap Project has chosen for deep sequencing with the Illumina Genome Analyzer (Solexa) platform [45]. As a result, both its genotype and hap- lotypes are well-characterized [46, 77, 140]. Fourth, it is also one of the cell lines chosen for the Encyclopedia of DNA Elements (ENCODE) Project [44, 22], and consequently, many of its standard functional features, such as gene expression, DNase hypersensitiv- ity, histone modifications, DNA methylation, DNA polymerase binding, and the binding of other transcription factors have been determined at the genome-wide level and are pub- licly available. Finally, its selection for the HapMap and the ENCODE projects has made GM12878 the cell line of choice for many other studies of the human genome. Therefore, the analysis of GM12878, which may be the best-characterized eukaryotic cell line, al- 31 lows for a comparison and integration of the chromosome architecture data with the other functional aspects of the genome (see Chapter 3). 2.3.2.2 Choice of restriction enzymes In order to evaluate the performance of TCC with significantly different chromatin prepara- tions, experiments were carried out using two different restriction enzymes; either HindIII or MboI. HindIII identifies the hexameric sequence 5 0 -AAGCTT-3 0 , which appears 837,647 times in the reference sequence of the human genome, while MboI identifies the tetrameric sequence 5 0 -GATC-3 0 , which appears more than 7,127,609 times. Therefore, the complete digestion of the genome with HindIII is expected to produce an average fragment size of about 3,416 bp, whereas complete digestion with Mbo is expected to result in fragments that are on average smaller than 402 bp. These enzymes are made suitable for conformation capture experiments by some of their characteristics. For one, both enzymes create 5 0 overhangs, the only type of overhang that can be modified by fill-in reactions using a DNA polymerase and nucleotide analogues. As described earlier, such a modification is essential for purifying the ligated fragments from the non-ligated. Additionally, the overhangs that are generated by these enzymes, TCGA- 5 0 by HindIII and CTAG-5 0 by MboI, contain all four bases and thus offer flexibility in the choice of nucleotide analogues that can be used to modify them. Furthermore, none of these enzymes requires extreme conditions, such high temperatures or low pH, and both are stable and operate well in the conditions that are used during the digestion of crosslinked chromatin (data not shown). Finally, unlike some other enzymes, these enzymes are not blocked by CpG methylation, which is an intrinsic property of the human genome [21, 37]. According to commercial providers, HindIII is completely insensitive to methylation, and 32 MboI is only slightly less efficient in the case of an overlapping CpG methylation but is not completely blocked. 2.3.2.3 Layout of experiments TCC was carried out once using HindIII and once using MboI as the restriction enzyme, with otherwise identical experimental parameters. Using this design, one can assess the effects of different chromatin preparations on the results of conformation capture assays. To also delineate the effects of solid-phase tethering, both experiments were replicated without surface immobilization. Instead, in these non-tethered experiments (which are equivalent to Hi-C [113] experiments), the ligation step was carried out in a diluted liquid phase, similar to the previous conformation capture methods. In summary, four conformation capture libraries, which only differed in the restriction enzyme and the use of tethering, were prepared: a HindIII tethered library, a HindIII non-tethered library, an MboI tethered library, and an MboI non-tethered library 1 . 2.3.2.4 Experimental procedure of the TCC experiments GM12878 cells were treated with 1% formaldehyde to crosslink DNA and proteins and lysed using a combination of osmotic stress and physical homogenization to release the crosslinked chromatin. Chromatin proteins were denatured by adding sodium dodecyl sulfate (SDS) to the mixture and heating it at 65 C. They were then biotinylated us- ing Iodoacetyl-PEG2-Biotin, which is a sulfhydryl-reactive biotin labeling reagent with a polyethylene glycol (PEG) spacer arm (Figure 2.3) and reacts with cysteines [97]. After biotinylation, the SDS was neutralized using Triton X-100 to allow for the later enzy- matic reactions. Chromatin DNA was then digested with a restriction enzyme (HindIII or 1 These libraries are also respectively referred to as HindIII-TCC, HindIII-HiC, MboI-TCC, and HindIII- HiC throughout this document. 33 Figure 2.3: Chemical structure of Iodoacetyl-PEG2-Biotin. Structure of Iodoacetyl-PEG2-Biotin which was used for non-specific biotinylation of cel- lular proteins through their cysteine residues. (Image from www.piercenet.com) MboI). Together, these treatments break the crosslinked chromatin into soluble crosslinked protein-DNA complexes in which proteins are biotinylated and DNA ends have 5 0 over- hangs (Figure 2.4). These biotinylated complexes can be immobilized on a Steptavidin-coated surface (Figure 2.4). To make this immobilization efficient, the excess Iodoacetyl-PEG2-Biotin, which is 0.6 kD, was removed by dialysis with a 20 kD cutoff membrane. The complexes were then immobilized on Streptavidin-coated magnetic beads. The amount of beads that was used provides approximately 2 cm 2 of surface area per 1 million cells worth of crosslinked chromatin. After immobilization, the excess Streptavidin molecules, ones that were not bound by a protein-DNA complex, were blocked by free biotin to prevent interference with the next reactions, which included other biotinylated reagents. After this point, all non-immobilized material, including free biotin, the restriction enzyme, non-biotinylated proteins, and the non-crosslinked DNA were washed from the mixture. 34 Figure 2.4: Overview of Tethered Conformation Capture (TCC). Cells are treated with formaldehyde, which covalently crosslinks proteins (purple ellipses) to each other and to DNA (orange and blue strings). (1) The chromatin is solubilized and its proteins are biotinylated (purple ball and stick). DNA is digested with a restriction enzyme that generates 5 0 overhangs. (2) Crosslinked complexes are immobilized at a very low density on the surface of streptavidin coated magnetic beads (grey arc) through the biotinylated proteins; non-crosslinked DNA fragments are removed. (3) The 5 0 overhangs are filled in with an-thio-triphosphate containing nucleotide analog (the yellow nucleotide in the inset), which is resistant to exonuclease digestion, and a biotinylated nucleotide analog (the red nucleotide with the purple ball and stick in the inset) to generate blunt ends. (4) Blunt DNA ends are ligated. (5) Crosslinking is reversed and DNA is purified. The biotinylated nucleotide is removed from non-ligated DNA ends using E. coli exonuclease III while the phosphorothioate bond protects DNA fragments from complete degradation. (6) The DNA is sheared and fragments that include a ligation junction are isolated on streptavidin-coated magnetic beads, but this time through the biotinylated nucleotides. (7) Sequencing adaptors are added to all DNA molecules to generate a library. (8) Ligation events are identified using paired-end sequencing. 35 In immobilized complexes, the DNA ends were modified to lay the groundwork for the purification of the ligated DNA fragments from the non-ligated. For this modification, using the Klenow fragment (the large fragment of E. coli DNA polymerase I), the 5 0 over- hangs were filled in with dATP, dTTP, Biotin-14-dCTP, and dGTPS nucleotides (Figure 2.4 inset). This reaction resulted in blunt DNA ends in which dGTPS was inserted 5 0 to Bioting-14-dCTP. This position of dGTPS relative to Bioting-14-dCTP is ensured in the recognition site of either MboI or HindIII because, in both cases, Guanine is positioned 5 0 to Cytosine. These nucleotide analogues were chosen based on their specific properties. Biotin-14- dCTP is a biotinylated analogue of dCTP, which upon incorporation into DNA, generates a biotinylated cytosine residue. dGTPS, on the other hand, is an analogue of dGTP in which one of the non-bridging oxygen atoms of the alpha-phosphate is replaced with a sulfur atom. Upon integration into DNA, this analogue creates a phosphorothioate bond (Figure 2.5), which is resistant to certain exonucleases. This feature is important for the later purification of ligated DNA fragments. The choice of cytosine (C) and guanine (G) bases to carry their respective analogues, together with the overhangs that HindIII or MboI create, assure that in all DNA fragments the phosphorothioate bond is inserted 5 0 to the biotinylated residue. This arrangement is important for the purification of ligated DNA fragments in later steps. The blunt-ended DNA fragments were ligated using T4 DNA ligase. As previously dis- cussed, because the crosslinked protein-DNA complexes are immobilized at a low density on the surface of the beads, this reaction is likely to highly favor intramolecular ligations between crosslinked DNA fragments. Only such ligations represent in vivo crosslinking events. In other words, at this point in the experiment, in vivo crosslinking events between DNA fragments were converted to ligations between them. 36 Figure 2.5: Phosphorothioate bonds in DNA. (Image from IDT DNA technologies) These ligations stably connected the crosslinked DNA fragments and therefore, made the crosslinked proteins dispensible. Therefore, DNA was purified using overnight incubation at 65 C with proteinase K and extraction with phenol-chloroform. This process, which also removes DNA from the surface of the beads, results in a primary library that is equivalent to a 3C library and contains both ligated and non-ligated DNA fragments. To remove the ligated fragments from the non-ligated ones, the purified sample was treated with E. coli Exonuclease III (ExoIII). This enzyme catalyzes the stepwise removal of mononucleotides from 3 0 -hydroxyl termini of duplex DNA [201], but it cannot cleave phosphorothioate bonds [136]. Consequently, in this reaction, ExoIII removes the termi- nal residues, including the biotinylated residue, from all non-ligated DNA ends, but it is stopped by the phosphorothioate bond before it degrades the entire DNA fragment. The only biotinylated residues that can survive this treatment are those at the ligation junctions, making biotin an exclusive label of the ligated DNA fragments and thus enabling their isolation from the non-ligated. 37 At this juncture, in principle, the ligated DNA molecules can be purified by streptavidin- coated beads and sequenced. In practice, however, these fragments are too long for the Illumina Genome Analyzer (Solexa) sequencing platform [20], as they average more than 3 Kb in length (data not shown). To break them into shorter pieces, the DNA solution was subjected to acoustic energy in a Covaris S2 instrument to obtain fragments with an average size of 200 bp. This treatment often results in structural anomalies at the point of fragmentation, such as 3 0 and 5 0 overhangs, 3 0 phosphate, or a lack of 5 0 phosphate. To repair such anomalies, sheared DNA was treated with an enzymatic cocktail including the Klenow fragment, T4 DNA polymerase, and T4 polynucleotide kinase in presence of dNTPs and ATP. Afterwards, single A-overhang was added to the resulting blunt ends using the exo- Klenow fragment. This overhang facilitates the ligation of sequencing adaptors after the purification of the ligated DNA fragments. The ligated DNA fragments were isolated on streptavidin coated magnetic beads, and the non-ligated fragments were washed away. The sequencing adaptors, which contain T- overhangs, were then ligated to both ends of all the DNA fragments using T4 DNA ligase. All the DNA fragments on the beads were then amplified in a PCR reaction using adaptor- specific primers. The PCR products between 350 bp and 550 bp were separated from a gel. A sample of these DNA molecules, which consititute a final library, was then sequenced on an Illumina Genome Analyzer IIx. 2.3.2.5 Experimental procedure of the non-tethered conformation capture experi- ments The experimental parameters of each non-tethered conformation capture experiment were identical to the corresponding tethered experiment. However, all the steps that are re- quired for immobilizing the crosslinked chromatin on a the beads were eliminated, and 38 instead, ligation was carried out in liquid phase at a low concentration. As a result, these non-tethered experiments are almost identical to Hi-C experiments [113]. Moreover, in these experiments, the amount of dilution per the count of cells (i.e., the concentration of crosslinked chromatin) during ligation was replicated from the Hi-C study by Lieberman- Aiden and colleagues [113]. Consequently the results of our experiments are comparable to those of Lieberman-Aiden and colleagues. 2.3.2.6 Massively-parallel sequencing of conformation capture libraries The quality of each library and its suitability for massively-parallel sequencing was deter- mined by sequencing a small number of its molecules using the standard Sanger method [151] combined with fluorescent dye-terminator chemistry [159, 135]. This procedure con- firmed that the majority of the DNA fragments in each library represented potential contacts in the genome and were suitable for sequencing. However, the non-tethered MboI library appeared to have a very low signal-to-noise ratio (see 2.3.3, page 49). Consequently, this library was not analyzed using massively-parallel sequence; rather, a sample about 150 molecules from this library were sequenced using the standard dye-terminator method. Only a few parameters of this library have been used in this study. All of these were esti- mated from this sequenced sample. For the three other libraries that passed the initial quality control, a sample of the final library was sequenced on an Illumina Genome Analyzer IIx platform using 40 or 50 bp paired-end reads [20]. All libraries were initially sequenced on either one or two lanes of the Genome Analyzer IIx. After this first round of sequencing, the tethered HindIII (HindIII-TCC) library was chosen for a more extensive analysis and further sequenced on two lanes of the Hi-Seq platform. Hi-Seq is, in principle, to the Genome Analyzer IIx but produces about 5 times as many reads on a single sequencing lane from a similar amount 39 of the material. The sequencing output statistics of all the libraries generated as a part of this study are shown in the first row of Table 2.1. Table 2.1: Library statistics The sequencing, alignment, pairing, and filtering statistics for each library. The italicized numbers for PCR multiplication, flaking, and self-looping mark the pairs that were filtered out of the initial catalogue in order to obtain the final catalogue. Numbers in parentheses are percentage values of each category compared to the “Total pairs” row. The last row (“Filtered pairs”) represents the catalogues that were used for all later analyses, mostly in Chapter 3. Library HindIII Tethered HindIII Non-tethered MboI Tethered Total clusters 211,592,642 31,346,767 24,343,780 Unique alignments: First end 175,086,554 26,104,639 18,858,217 Second end 170,949,684 25,741,087 18,009,151 Total pairs 147,262,098 22,210,846 14,346,220 Non-informative: PCR multiplications 10,337,451 (7%) 624,976 (3%) 172,414 (1%) Flaking 26,404,870 (18%) 6,654,621 (30%) 1,845,049 (13%) Self-looping 11,886,208 (8%) 589,712 (3%) 916,333 (6%) Filtered pairs (final catalogue) 98,633,569 (67%) 14,341,537 (64%) 11,412,424 (80%) In the paired-end format, all the DNA molecules are first sequenced from one end and then from the other; as a result, the sequencing output is comprised of two files; one for all the first-end reads of and the other for all the second-end reads. The read-pairs that belong to the same DNA molecule (one in each file) can be identified by a unique code that they share in the output files. 40 2.3.2.7 Assembling contact catalogues of each library Pre-alignment filtering. In a conventional massively-parallel sequencing experiment, the output reads can be aligned to a reference genome directly. In conformation capture ex- periments, however, direct alignment is not an effective option because, in many molecules, the sequencing from either end can surpass the ligation junction. In such a case, the result- ing read represents an incontiguous part of the reference genome. When using appropriate parameters for alignment these reads often fail to align to the reference genome. This problem can be alleviated by excluding the part of a read that comes after the lig- ation junction. This task is possible because the sequence of the ligation junction can be determined based on the restriction enzyme’s recognition sequence and overhang. For example, the ligation junctions in the HindIII libraries are expected to have an AAGC- TAGCTT sequence (Figure 2.6 A), whereas those in the MboI libraries are expected to have a GATCGATC sequence (Figure 2.6 B). In each case, the midpoint of the sequence represents the ligation point, after which the read corresponds to a different region of the genome. Therefore, in this study, we scanned all reads for the ligation junction sequence. When such a junction was identified in a read, we removed the part after the midpoint of the junction (i.e., after AAGCT for HindIII junctions and after the first GATC for MboI junctions). While scanning for the exact matches of the expected junction sequence may be adequate, it is not exhaustive. Several factors can lead to junctions that slightly deviate from the ex- pected sequences. Firstly, non-enzymatic truncation of the 5 0 overhangs during and after the digestion step in library preparation can lead to ligation junctions that have a small dele- tion in the middle. In the case of a HindIII digestion, for example, a AAGCTAGCTT can be converted to AAGCAGCTT (underlined base is deleted). Secondly, a PCR-introduced 41 Figure 2.6: Expected ligation junction sequences for HindIII and MboI. The expected ligation junction sequence in a HindIII (A) or MboI (B) library. The white letters represent bases that are inserted during the fill-in reaction. 42 mutation or sequencing error can produce ligation junctions that appear to have a different nucleotide at one position. For instance, a AAGCTAGCTT can appear as AGGCTAGCTT (underlined base is mutated). Thirdly, and most importantly, the star activity of the restric- tion enzyme during the digestion of crosslinked chromatin can result in junctions with de- viating sequences. Star activity refers to when a restriction enzyme cleaves a sequence that is similar but not identical to its defined recognition sequence [142]. For example, HindIII may digest AAGCAT or AAGCT instead of AAGCTT (underlined base is misidentified by the enzyme) albeit at a lower frequency. This star activity has been known to be exacer- bated by certain conditions [142], among which are non-standard reaction buffers, a high ratio of the enzyme to the target DNA, and prolonged incubation times. All of these con- ditions were present during chromatin digestion in our conformation capture experiments (tethered or non-tethered): the reaction buffers differed from the standard buffer because they contained denatured cellular proteins, cellular lipids, and large amounts of detergents; the enzyme to DNA ratios were high because the amount of enzyme that was used to di- gest the crosslinked chromatin was more than five times the amount that would completely digest an equivalent amount of purified DNA in one hour; and incubation was prolonged as digestions were carried out overnight instead of the standard one to four hours. Combined, these factors suggest that the fraction of the deviant ligation junctions may be sizable. To account for these deviant ligation junctions, we adjusted the scanning algorithm to allow for one mismatch or deletion in the entire expected junction sequence. In other words, any sequence that differed from the ligation junction by one mismatch or deletion was also considered to be a junction, and the sequence after the junction’s midpoint was removed from the corresponding read. This strategy is consistent with the nature of star activity, which often takes place on sites that are different from the defined recognition sequence in only one base. 43 Since star activity is likely the dominant cause of deviating junctions, the observed fre- quencies of these junctions in conformation capture libraries provides a window into star activity. Table 2.2 compares the observed frequency of various deviant junctions to that of the expected junction sequence. Alignment. After filtering, the reads can be directly aligned to the reference genome. Since these reads are short and numerous, aligning them is challenging and cannot be per- formed within a reasonable time-frame with the traditional alignment programs, such as BLAST and BLAT [94, 160, 9]. For this task, specialized programs have been developed, including RMAP, PerM, Maq, Bowtie, and Eland [158, 39, 111, 107, 20]. However, not all of these can be used to align conformation capture data because the pre-alignment filtering of the ligation junctions results in unequal read lengths. Most of the aforementioned pro- grams create specific indices from each read and compare them to the reference genome. Because these indices are based on a constant read length, they cannot be easily recon- ciled with differing lengths, and as a consequence, most of these softwares do not accept an input with reads of different lengths. An exception to this pattern is Bowtie, which creates indices for the reference genome instead of the reads and, as a result, can handle reads of varying lengths. Accordingly we used Bowtie (version 0.12.7) [107] to align the conformation capture data to the GRCh37/hg19 build of the human genome. We carried out alignment for the first-end and the second-end reads of each library indepen- dently. The alignment parameters were basic and did not allow for more than 3 mismatches for each alignment. For each read, only the best alignment result (i.e., the one with the low- est number of mismatches) was recorded. If a read had more than a single best alignment result, they were all discarded. Table 2.1 shows the alignment statistics for each end of each the libraries. It also demonstrates the importance of ligation junction filtering and 44 Table 2.2: Star activity in different libraries In each sequenced library, the number of detected ligation junctions on either read has been counted and their sequence has been determined. The top ten ligation junctions in each library are shown. The percentages are calculated as a fraction of all sequenced reads for each library. T: Tethered, NT: Non-tethered. HindIII-T HindIII-NT MboI-T Junction Count % Junction Count % Junction Count % AAGCTAGCTT 56435495 16.31 AAGCTAGCTT 4555136 8.79 GATCGATC 7019807 19.04 AAGCAGCTT 6405030 1.85 AAGGTAGCTT 166598 0.32 GATGATC 322210 0.87 AAGCTAGCT 1667694 0.48 AAGCTAGCT 154986 0.30 GATCGAT 185342 0.50 AAGCTGCTT 1047976 0.30 AAGCTACCTT 151859 0.29 GATCATC 73439 0.20 AAGCTACCTT 744880 0.22 AAGATAGCTT 122790 0.24 GATCCATC 72059 0.20 AATCTAGCTT 437949 0.13 AAGCTATCTT 103168 0.20 GATGGATC 43424 0.12 AAGGTAGCTT 294958 0.09 AAGCTAGCCT 67052 0.13 GATTGATC 27091 0.07 AAGCTAGCCT 236731 0.07 AAGCAGCTT 66743 0.13 GACGATC 25638 0.07 ATGCTAGCTT 220614 0.06 AATCTAGCTT 66390 0.13 GATCGAC 24764 0.07 AAGCTAGATT 203785 0.06 ATGCTAGCTT 58795 0.11 GAACGATC 24480 0.07 45 star-activity filtering for maximizing the fraction of the reads that are uniquely aligned to the genome. Contact catalogue assembly and filtering. We coupled the first-end read and the second- end read alignments of each sequenced DNA fragment to form a “pair”, which is a line of data comprised of two locations in the genome and potentially represents a contact between the two corresponding loci (more on this potential below). This coupling was carried out based on a unique code that was assigned by the sequencing machinery to each sequenced DNA fragment and was shared by both reads from that fragment. All the pairs from each library were then combined in a list to form the primary contact catalogue (Figure 2.7). Figure 2.7: Contact catalogue. A sample of ten pairs from a contact catalogue. Each pair includes the strand, chromosome, position, and the length of the alignment for each of its corresponding reads. The pairs for which either end could not be unambiguously aligned were not included in this catalogue. Table 2.1 (page 39) shows the total number of pairs in each library’s primary contact catalogue. Not all pairs in the primary catalogues truly embody information about genome architec- ture. In fact, three types of pairs can be distinguished that do not contain any data about the spatial organization of the genome. The first type consists of pairs that represent PCR 46 copies of a single DNA molecule generated during the pre-sequencing PCR amplification (PCR multiplication). Sequencing multiple PCR copies of a single DNA fragment can make the corresponding contact appear more frequent in cells, when in reality, it was either amplified more efficiently in the PCR, or more than one of its copies were sequenced by chance. The discriminating feature of pairs that originate from PCR multiplication of a single DNA fragment is that they all align to the exact same position on both ends. This behavior contrasts with non-PCR copies of a truly frequent contact, which align to slightly different positions on one or both ends because they undergo independent shearing events that are very unlikely to cut them in identical locations on both ends. Therefore, when groups of pairs aligned to identical positions on both ends, we discarded all but a single member from the catalogue. The second type of non-informative pairs originate from DNA molecules that do not in- clude a ligation junction; nevertheless, when purifying the ligation junctions during the last part of the library preparation process, they bind to the streptavidin-coated beads and undergo sequencing (flaking). In other words, these are contiguous fragments that have evaded all the cleaning mechanisms that enrich the ligation junctions. This evasion can occur because these molecules bind to the beads non-specifically or because they do not undergo a complete cycle of exonuclease activity and retain their terminal biotins. The discriminating characteristic of flaking pairs is that they represent uninterrupted genomic sequences, and therefore, their ends align to the opposite strands of the same chromosome just a fragment-length apart. This fragment length, which is determined by the size se- lection step before sequencing (see Materials and methods), is typically between 250 bp and 700 bp, and most certainly below 1000 bp. Accordingly, this flaking phenomenon can be readily identified in the histogram of the distance between pairs in the form of a pro- nounced peak of fragments that align less than 1000 bp apart from each other (Figure 2.8). 47 To account for flaking, we removed all pairs that aligned to the opposite strands of the same Figure 2.8: Distribution of alignment distance for all pairs in each library. The plot shows the distribution of the alignment distance between the two ends of pairs in the HindIII Tethered (A), HindIII Non-tethered (B), and MboI Tethered (C) libraries. Only pairs that align to the same chromosome are considered. The sharp peak between 100 bp and 1000 bp (marked with an arrow in A) represents fragments that are an outcome of the flaking phenomenon. chromosome and closer than 1000 bp from all the contact catalogues. The third type of non-informative pairs originate during the ligation step after restriction digestion of the chromatin DNA, when some DNA molecules ligate to themselves (self- looping) (Figure 2.9). Such ligation events clearly do not carry any information about the genome architecture as the two ends of any DNA molecule are, by definition, always in spatial proximity. The discerning features of self-looping pairs, though, are not as defini- tive as those of PCR multiplication or flaking. The two ends of these pairs also align on the opposing strands of the same chromosome and at relatively short genomic distances from each other. However, this distance has a wide range: it can be as short as the shortest length which, based on DNA’s persistence length, allows the formation of a circular DNA (likely above 100 bp) and as long as the longest fragments after the digestion of the crosslinked chromatin (about 12 Kb for MboI and 30 Kb for HindIII). Many true contacts in the cata- logues are also within the same distance range. Nevertheless, the comparison between the 48 Figure 2.9: Self-looping Various intermediates during the ligation step of any conformation capture experiment can ligate to themselves, resulting in self-looping. 49 distance distribution of pairs aligning to the same strand and those aligning to the opposite strands suggests that self-looping pairs comprise a majority of the pairs in their distance range (data not shown). Based on these considerations, we discarded all pairs with a dis- tance of less than 30 Kb for the HindIII libraries and less than 12 Kb for the MboI library. For extra caution, we chose these cutoff values larger than those calculated based on the size distribution of the digested fragments. Moreover, in order to maintain a consistent composition of binary contacts of all distances, this filtering also included the pairs that aligned to the same strand. Both of these strategies are justifiable because the analyses in ensuing chapters do not focus on short range contacts. The removal of PCR multiplication, flaking, and self-looping pairs from the primary cata- logue generated the final catalogue of binary contacts (also referred to as the contact cata- logue from here on out). This filtering procedure for each library is summarized in Table 2.1 (page 39), which contains the total number of pairs after each filtration step from the primary catalogue to the final contact catalogue. Unless otherwise stated, all the analyses that are described in this document are based on these final catalogues (i.e., the last row in Table 2.1). 2.3.3 Evaluating noise levels It is has been documented that conformation capture techniques are encumbered by low signal-to-noise ratios [68, 79, 157]. It is, thus, important to evaluate the level of noise in our libraries and assess the effect of tethering on this level. Such an assessment re- quires both a clear definition of noise and knowledge of its major sources. In the context of genome-wide conformation capture experiments, noise is comprised of those pairs in the catalogue that do not truly represent a crosslinking event but, unlike flaking and self- looping, cannot be discriminated from the other pairs. Such pairs result in false-positive 50 contacts. Many different phenomena, in both the processing of the data and the experiment itself, result in noise. For example, the faulty alignment of an end converts the correspond- ing pair into noise. Noise can also be produced during the final preparation of the library for sequencing, when DNA fragments ligate to each other instead of the sequencing adaptors. Similarly, during the first ligation step of the library preparation process, an intermolecular ligation between non-crosslinked DNA fragments can produce fallacious contacts. In the case of faulty alignments, there is no real measure of how extensive the problem can be. Nevertheless, high-quality sequencing and appropriate alignment parameters can probably minimize this problem to negligible amounts. The ligation of DNA fragments to each other instead of the sequencing adaptors is also a limited phenomenon and tends to be constant between different library preparations. In fact, based on the results of a study by Quail and colleagues [137], it is expected that only five percent of the pairs in each catalogue are a re- sult of this problem. On the other hand, intermolecular ligations between non-crosslinked DNA fragments have been recognized as the most important source of noise in confor- mation capture experiments [79], so much so that several technology development studies have considered their effects [68, 154]. Moreover, unlike the other noise-generating fac- tors, the extent of intermolecular ligations can differ in different libraries largely based on the total concentration of DNA during the reaction. To make matters worse, there is no up- per limit to the fraction intermolecular ligations. These characteristics make intermolecular ligations between non-crosslinked fragments of DNA the most important source of noise in the catalogues. The main challenge in measuring the noise level in a library is that individual false-positive noise pairs do not differ in anyway from the other pairs in a catalogue. Collectively, how- ever, these erroneous pairs have a different distribution from bona fide contacts. Whereas the true contact pairs are distributed based on the constraints of genome architecture, the 51 false-positive pairs are largely distributed randomly or based on the concentration of the corresponding DNA fragments during ligation. For example, true pairs are likely to be overwhelmingly intrachromosomal (between fragments that belong to the same chromo- some) because the chromosome fiber keeps neighboring loci in spatial proximity and be- cause each chromosome resides within a largely discrete territory where it predominantly interacts with itself (see Chapter 1). False-positive pairs, on the other hand, are likely to be overwhelmingly interchromosomal (between fragments that belong to different chro- mosomes). This tendency is because, in a random intermolecular ligation, there are more possible combinations of fragment pairs from different chromosomes (23 in human) than from the same chromosome. In summary, all other parameters being equal, a library with a higher level of noise is expected to have a higher fraction of interchromosomal contacts than a library with a lower level of noise. We used this behavior to compare the noise levels of the different libraries without directly identifying individual noise pairs. For this comparison specifically, the primary catalogues (2.3.2.7, page 45) were used instead of the final catalogues and processed with a slightly different filtering scheme. In this scheme, only PCR multiplication (see page 46) and flak- ing (see page 46) pairs were removed from consideration. Self-looping pairs (see page 48) were not removed because, even though they do not harbor any structural information, they are instances of intramolecular ligation and, thus, relevant to the fraction of intermolecular ligations in the library. The fraction of interchromosomal pairs (i.e., ligations) were then measured in each catalogue (Figure 2.10). This fraction is half as large in the tethered HindIII library compared to its non-tethered counterpart. This difference is even more pro- nounced in the MboI libraries, where the fraction of interchromosomal ligations is almost two and a half times smaller in the tethered library compared to its non-tethered counterpart. 52 Figure 2.10: Level of noise in different libraries. The observed fractions of intra (dark red) and interchromosomal (light blue) ligations in tethered (T) and non-tethered (NT) libraries produced using HindIII or MboI. The ran- dom ligation (RL) bar represents the expected fractions if all ligations occurred between non-crosslinked DNA fragments. For the non-tethered MboI library only, these fractions were determined by sequencing 160 individual DNA molecules from three replicates of the experiment. 53 The consistently lower fraction of interchromosomal ligations in all the tethered libraries strongly suggests that these libraries have a significantly lower level of noise. Furthermore, among the non-tethered libaries, the MboI library shows a substantial in- crease in the fraction of interchromosomal ligations compared to the HindIII library. This fraction, however, is only modestly increased in the tethered MboI library compared to the tethered HindIII library. Because MboI results in a shorter size and a higher concentra- tion of DNA fragments compared to HindIII (see 2.3.2.2, page 31), the small difference between the tethered libraries of these two enzymes demonstrates that, in contrast to the non-tethered approach, the tethered approach is minimally affected by the concentration of DNA fragments. This independence from concentration further supports that most lig- ations in TCC are between crosslinked DNA fragments. In other words, ligations in the tethered experiments appear to be zeroth-order reactions with respect to the reactants’ con- centrations, which indicates that they are intramolecular [109]. It could be argued that tethering is somehow more efficient in capturing intrachromosomal contacts while the conventional approach is more suited for capturing interchromosomal ones, and thus, the observed differences in the fraction of interchromosomal ligations do not represent a difference in the level of noise. To rule out this possibility, we tested whether the distribution of the additional interchromosomal contacts in the non-tethered libraries does, in fact, resemble one that would be expected from random intermolecular ligations. To this end, we measured the average difference between the observed interchromosomal contact frequencies and those expected from completely random intermolecular ligations. This difference was half as large in the non-tethered library compared to the tethered library (see Materials and methods), indicating that, in the former, interchromosomal contacts follow the pattern that is expected from random intermolecular ligations. This result shows that the higher fraction of interchromosomal ligations in the non-tethered library is, in fact, a 54 result of noise pairs. It also confirms that the noise generated by random intermolecular ligations is considerably lower in the TCC libraries. 2.4 Discussion We developed a new method, Tethered Conformation Capture (TCC), which is capable of mapping chromatin interactions at the genome-wide level (Figure 2.4). This method can be classified in the conformation capture family of techniques; however, it offers significant advances compared to those previously available. For example, our method extends the scope of conformation capture experiments from a limited number of loci or interactions to the entire genome. Moreover, it is not biased to any particular locus or interaction and, therefore, alleviates the need for prior knowledge of interactions in the nucleus. TCC’s genome-wide scope and lack of bias are made possible by incorporating paired- end massively-parallel sequencing as the quantification strategy. Using this sequencing technology, ligations between all interacting loci can be identified and quantified. In fact, all chromosomes are represented in our results, and the number of contacts involving each chromosome is roughly proportional to its size. Moreover, because this second-generation sequencing technology alleviates the necessity for any locus-based enrichment, selection, or amplification, the results are also unbiased at the sub-chromosomal level. In fact, in our contact catalogues, almost all non-repetitive regions of the genome are represented roughly equally. In addition to broadening the scope and eliminating bias, TCC also reduces the level of noise in conformation capture experiments. Noise is largely generated by intermolecular ligations between fragments of DNA that are not crosslinked to each other, a phenomenon which was dramatically reduced in TCC by integrating tethering and solid-phase ligation. 55 Solid-phase ligation refers to ligating crosslinked DNA fragments as they are tethered to a surface. Indicative of the effectiveness of solid-phase ligation in reducing noise levels were the fractions of interchromosomal ligations (2.3.3 in page 49 and Figure 2.10). Such ligations, which are the likely outcomes of intermolecular ligations, were reduced almost two-fold in our tethered libraries. This reduction was even more drastic in our MboI li- braries compared to the HindIII libraries because MboI cuts the genomic DNA into smaller fragments than does HindIII, thereby leading to a higher concentration of DNA during ligation. This higher concentration, in turn, drastically increased the fraction of interchro- mosomal ligation, rendering the non-tethered MboI library unfit for sequencing. In the tethered MboI library, by contrast, the increased concentration of DNA had little impact, leaving the library suitable for sequencing. This unique advantage of TCC, namely the reduced level of intermolecular ligation noise in the tethered libraries, significantly increases the sensitivity of the conformation capture assay, making the analysis of low frequency contacts possible. This analysis is discussed in Chapter 3 and Appendix B, where it becomes clear that most interchromosomal inter- actions can only be observed in the tethered libraries. In the non-tethered libraries, inter- chromosomal interactions cannot be distinguished from the background noise produced by intermolecular ligations. Lower intermolecular ligations in the tethered libraries may be attributed to several fac- tors. First, only the DNA fragments that are crosslinked to proteins are immobilized, and thus, non-crosslinked fragments are effectively washed out of the reaction (Figure 2.4). This removal of the “naked” DNA fragments is important because any ligation involving these fragments is an intermolecular ligation and results in a false-positive noise contact. Second, because immobilized protein-DNA complexes cannot diffuse freely, the only pos- sible source of intermolecular ligations are complexes that are immobilized in immediate 56 vicinity of each other. Using a large immobilization area can significantly and sufficiently reduce these “too close” immobilizations. Moreover, even if complexes are immobilized in a very close distance, tethering may preclude the orientations that are required for a ligation reaction, further inhibiting intermolecular ligations. In the standard liquid phase ligation, in contrast, freely diffusing complexes can encounter each other throughout the course of the reaction and engage in intermolecular ligations. Using a large volume for the ligation reaction can reduce the probability of these encounters, but our results suggest that imprac- tically large volumes should be used in order to match the level of intermolecular ligations in the tethered samples. To incorporate tethering and second-generation sequencing into TCC, we added several important steps to the standard library preparation protocol. A cysteine-reactive reagent (Figure 2.3) was used to biotinylate all proteins after cell lysis. By means of these biotiny- lated proteins, crosslinked chromatin was later tethered to streptavidin-coated magnetic beads. For second-generation sequencing, we used a specific combination of biotinylated and exonuclease-resistant nucleotide analogues (Figure 2.4) which exclusively labeled the lig- ated DNA molecules. We then used streptavidin-coated beads, for a second time, to purify these molecules from the non-ligated ones. Such enrichment of the ligated DNA molecules is crucial for sequencing because the library preparation process is inefficient. Specifically, the vast majority of the DNA molecules in a standard library preparation do not undergo any ligation, making their sequencing futile. Our procedure was very effective at enrich- ing the ligated molecules; it reduced the fraction of non-ligated DNA molecules that were sequenced (i.e., the pairs that are a result of flaking, see 2.3.2.7 in page 46) to less than 19 percent and, in one of the libraries, as low as 13 percent (Table 2.1, page 39). 57 It is noteworthy that we initially conceived the idea of tethering not for reducing noise but for enabling clean-up between the different steps of the library preparation process. Though, at the time, we considered the potential of tethering to reduce noise, it was un- known that intermolecular ligations are such a pervasive problem in non-tethered libraries. It later turned out that the tethering strategy and the increased signal-to-noise ratios that result from it are a defining feature of TCC. Another conformation capture method, which is also capable of unbiased genome-wide analysis, was developed by Lieberman-Aiden and colleagues in parallel with our studies [113]. This method, which is known as Hi-C, uses the standard liquid phase ligation strategy and, as a result, shows levels of intermolecular ligations that are slightly higher than our non-tethered libraries. Although reduced noise is its main advantage, TCC offers several other advantages com- pared to Hi-C. For instance, because the intermolecular ligation noise remains low even at substantially increased DNA fragment concentrations, tethering facilitates higher res- olution analyses with enzymes that cut the chromatin more frequently. Additionally, the removal of naked DNA fragments after surface immobilization eliminates the possibility of their self-looping, thereby increasing the efficiency of sequencing. The solid-phase ap- proach also alleviates the need for large reaction volumes in the experiments, a feature that dramatically reduces the consumption of reagents. Furthermore, by allowing for reaction clean-up, tethering also makes it possible to carry out multiple reactions on crosslinked protein-DNA complexes without interference between their reagents or enzymes. This property enables the use of alternative approaches for cutting the chromatin or more com- plex strategies for modifying the DNA fragments. We have also presented multiple new strategies for the assembly of contact catalogues from genome-wide conformation capture data. For instance, the effect of ligation junction on the success rate in aligning the reads to a reference genome has been quantified, and strategies 58 for accounting for these junctions have been developed. These strategies also consider the complex issue of star-activity, and by doing so, they improve the overall success rate of the alignment procedure. In addition, certain non-informative pairs, including those that result from flaking, self-looping, and PCR multiplication, have been characterized and algorithms for their removal have been developed. 2.5 Materials and methods 2.5.1 Tethered Conformation Capture (TCC) Cell culture and crosslinking of cells. GM12878 cells were obtained from Coriell Insti- tute (Camden, NJ). They were grown in RPMI 1640 with 2 mM L-Glutamine, 15% fetal bovine serum (FBS), and 1% penicillin-streptomycin (50 U/mL penicillin and 50 μg/mL streptomycin), in accordance with the culture conditions provided by the ENCODE data coordination center on UCSC website. For each batch of crosslinked cells, 25 million cells were spun down at 100 g for 5 minutes at room temperature, resuspended in 45 mL fresh medium, and fixed by adding 1.25 mL of 37% formaldehyde (Sigma-Aldrich, St Louis, MO) to a 1% final concentration and incubation at room temperature for 10 minutes. This crosslinking reaction was stopped by adding 2.5 mL of 2.5 M Glycine and incubating at room temperature for 15 minutes. After stopping the crosslinking reaction, the cells were spun down at 4 degrees of Celsius ( C), supernatant was discarded, and the pellet was flash-frozen with liquid nitrogen. Each frozen batch was stored at -80 C. Cell lysis and biotinylation. A batch of 25 million crosslinked GM12878 cells was thawed on ice for 20 minutes. Cells were resuspended in 550 μL of the lysis buffer (10 mM HEPES pH=8.0, 10 mM NaCl, 0.2% IGEPAL CA-630, and 1X protease inhibitors solution (Roche Ltd, Basel, Switzerland)) and incubated on ice for 15 minutes. This mix- 59 ture was then transferred to a Dounce homogenizer (Wheaton Industries Inc. Millville, NJ) and treated with 20 strokes of pestle A. The resulting lysate was transferred to a 2.0 mL centrifuge tube. It was then spun down at 2500 g for 5 minutes at room temperature, and the supernatant was discarded. The pellet was washed twice with an ice-cold wash buffer (50 mM Tris.HCl pH=8.0, 50 mM NaCl, 1 mM EDTA) and resuspended in 250 μL of the same buffer. It was then mixed with 95 μL of 2% SDS and incubated at 60 °C for 15 minutes to solubilized the crosslinked chromatin. After the suspension cooled down to the room temperature, it was mixed with 105 μL of 25 mM EZlink Iodoacetyl-PEG2-Biotin (IPB) (Pierce Protein Research Products) and rocked at the room temperature for 60 min- utes to biotinylate the cysteine residues. To neutralize SDS, the biotinylated sample was first mixed with 1300 μL of 1X NEBuffer2 and incubated on ice for 5 minutes. It was then mixed with 225 μL of 10% Triton X-100, and incubated on ice for 15 more minutes. Digestion and dialysis To start digestion, 100 μL of 10X NEBuffer2, 5 μL of 1M DTT, 430 μL of water, and 20 μL of 100 U/μL HindIII (NEB, Ipswich, MA) were added to the sample (for the MboI-tethered experiment, 35 μL of 25 U/μL MboI (NEB) was used and water was reduced accordingly). The digestion reaction was incubated at 37 °C overnight. The next day, the mixture was placed in a 20 kD cutoff Slide-A-Lyzer Dialysis Cassette (Pierce Protein Research Products, Rockford, Illinois) and dialyzed for five hours at room temperature against 1 L of the dialysis buffer (10 mM Tris.HCl pH=8.0, 1 mM EDTA) to eliminate excess IPB remaining from the biotinylation step. The dialysis buffer was renewed after three hours. The dialyzed sample, approximately 2500 μL in volume, was transferred to a 15 mL conical tube. Immobilization at low surface coverage (tethering). To immobilize the biotinylated chromatin, the surface of 400 μL MyOne Streptavidin T1 beads (Invitrogen) was used. 60 These beads have about 250 cm 2 /mL surface area, therefore, for each experiment, the chro- matin content has been tethered on about 50 cm 2 surface area. To prepare the beads for the immobilization, they were washed with PBST (phosphate buffer saline with 0.01% Tween20) three times and resuspended in 2.0 mL of the same buffer. The dialyzed sample was divided into five equal aliquots in 1.5 mL centrifuge tubes for the ease of handling. To each aliquot, 400 μL of the beads suspension in TPBS was added. Each aliquot was then rocked at the room temperature for 30 minutes. During this 30-minute incubation, 150 μL of 25 mM IPB was neutralized with an equimolar amount of 2-mercaptoethanol. At the end of the incubation, each aliquot was mixed with 5 μL of the neutralized IPB and rocked at the room temperature for 15 minutes. Filling DNA ends and blunt-end ligation on beads. Note: For each buffer or solution exchange in the following reactions, the beads were collected on the wall of the tube using a magnet, the solution was aspirated out, the tube was removed from the magnet, and the beads were resuspended in the desired buffer or solution. The washing steps were carried out with 600 μL of a buffer, unless a different amount is specified. To prevent aggregation of the beads, it is important to minimize the time the beads spend on the wall of the tube. The reactions described below are carried out in parallel and identically for each aliquot. To remove the non-biotinylated chromatin and non-crosslinked DNA, the beads were washed once with TPBS and once with a wash buffer (10 mM Tris.HCl pH=8.0, 50 mM NaCl, 0.4% Triton X-100) and resuspended in 100 μL of the same buffer. The 5 0 overhangs that were generated by the restriction enzyme (HindIII or MboI) were filled in by adding 63 μL of wa- ter, 1 μL of 1 M MgCl 2 , 10 μL of 10X NEBuffer2, 0.7 μL of 10 mM dATP, 0.7 μL of 10 mM dTTP, 0.7μL of 10 mM dGTPS (2 0 -Deoxyguanosine-5 0 -O-(1-thiotriphosphate).sodium salt, Sp-isomer, AXXORA, LLC, San Diego, CA), 15 μL of 0.4 mM Biotin-14-dCTP (In- 61 vitrogen, Carlsbad, CA), 4 μL of 10% Triton X-100, and 5 μL of 5U/μL Klenow (NEB) to the sample and rocking the tube at the room temperature for 40 minutes. This treatment generates blunt DNA ends that are marked by a biotinylated nucleotide located 3 0 to a phos- phorothioate bond. This phosphorothioate bond is produced by the integration of dGTPS into the DNA. The fill-in reaction was stopped by adding 5 μL of 0.5 M EDTA to the suspension. The beads were washed twice with a buffer (50 mM Tris.HCl pH=7.4, 0.4% Triton X-100, 0.1 mM EDTA), resuspended in 500 μL of the same buffer, and transferred to a new 15 mL conical tube. For ligation, the sample was mixed with 4 mL of water, 250 μL of 10X ligase buffer (NEB), 90 μL of 20% Triton X-100, 100 μL of 1 M Tris.HCl pH=7.4, 50 μL of 100X BSA (NEB, Ipswich, MA), and 2 μL of 2000 U/μL T4 DNA ligase (NEB, Ipswich, MA) and rocked at 16 °C for 4 hours. Ligation was stopped by first adding 200 μL of 0.5 M EDTA to the sample and then aspirating the supernatant out. The use of EDTA for stopping the reaction is important for preventing intermolecular ligations as the beads are collected on the wall of the tube during aspiration. DNA extraction. To reverse the formaldehyde crosslinking and purify the DNA, the beads were first resuspended in 400 μL of an extraction buffer (50 mM Tris.HCl pH=8.0, 0.2% SDS, 1 mM EDTA, 100 mM NaCl). The suspension was then mixed with 20 μL of 20 mg/mL proteinase K (NEB) and incubated at 65 °C overnight. The following day, an additional 5 μL of 20 mg/mL proteinase K was added and the incubation at 65 °C was continued for two more hours. Afterwards, the supernatant, which contains the initial conformation capture library, was transferred to a new tube and extracted twice with an equal volume of phenol:chloroform: isoamyl alcohol (25:24:1 v/v) and once with an equal volume of chloroform. After this 62 extraction, the aqueous phase was transferred to another tube, mixed with sodium chloride and glycogen to final concentrations of 200 mM and 25 μg/mL respectively. The DNA content was then precipitated with the addition 900 μL of 200 proof ethanol and incuba- tion at -20 °C overnight (alternatively, incubation can be done at -80 °C for 1 hour). The precipitated DNA was pelleted by centrifugation at 20,000 g for 20 minutes in 4 °C. The supernatant was removed and the pellet was immersed in 500 μL of 80% EtOH and spun down at 20,000 g for 10 minutes. The supernatant was removed again, and, before it is completely dried out, the pellet was resuspended in 20 μL of 10 mM Tris.HCl pH=8.0 (it is important to prevent the DNA pellet from completely drying at this stage). At this point, all five aliquots, which had been subjected, in parallel, to the treatments described above, were pooled for a total volume of 100 μL. The solution was mixed with 1 μL of 10 mg/mL RNAseA and incubated at 37 °C for 30 minutes. After this RNA removal, the DNA was purified with the QIAquick PCR purification kit (Qiagen, Valencia, CA) per the manufacturer’s instructions and eluted in 50 μL of their elution buffer (EB), and its concentration was measured. Removal of biotin from non-ligated DNA ends and shearing. 5 μg of the purified DNA sample was treated with 300 units of ExoIII (NEB) in a total volume of 90 μL of NEBuffer1 for 1 hour at 37 °C. The reaction was stopped by adding 2 μL of 0.5 M EDTA and 2 μL of 5 M NaCl, and the enzyme was inactivated by incubation at 70 °C for 20 minutes. To shear the DNA fragments to smaller sizes, the total volume of the sample was first ad- justed to 100 μL by adding water. The sample was then transferred to a 6x16 mm AFA fiber microtube with snap-cap and sheared to 100-500 bp in a Covaris S2 (Covaris, Woburn, MA) at duty cycle of 5%, intensity of 5, cycles/burst of 200, and for a total of 180 seconds. This 63 sheared DNA was purified with a QIAquick PCR purification kit per the manufacturer’s protocol and eluted in 50 μL of EB. End-repair and adding the A-overhangs. To repair the DNA ends after shearing, the purified sample was treated with 5 units of Klenow (NEB), 15 units of T4 DNA polymerase (NEB), and 50 units of T4 polynucleotide kinase (NEB) in 100 μL of 1X T4 ligase buffer (NEB) with 0.4 mM of dNTPs for 30 minutes at 20 °C. The DNA was then purified with a QIAquick PCR purification kit and eluted in 40 μL of EB. To add the A-overhangs, the sample was treated with 15 units of exo- Klenow (Enzymatics, Beverly, MA) in 50 μL of NEBuffer2 with 0.2 mM dATP for 30 minutes at 37 °C. This reaction was stopped with 1 μL of 0.5 M EDTA. Pull-down of biotinylated DNA and ligation of sequencing adaptors. Note: For each buffer or solution exchange in the following reactions, the beads were collected on the wall of the tube using a magnet, the solution was aspirated out, the tube was removed from the magnet, and the beads were resuspended in the desired buffer or solution. All the wash steps were done with 500 μL of a buffer unless a different amount is indicated, and all reactions were carried out in Costar pre-lubricated tubes (Capitol Scientific, Inc, Austin, TX) to prevent MyOne beads from sticking to the walls of the tubes. To pull-down the biotinylated DNA fragments (i.e., ligation junctions), 10 μL of MyOne Streptavidin C1 beads (Invitrogen) were washed twice with 1X Bind and Wash buffer (B&W) (2X B&W: 10mM Tris-HCl pH=7.5, 1 mM EDTA, 2 M NaCl), resuspended in 50 μL of 2X B&W, mixed with the A-added DNA sample, and rocked at room temperature for 30 minutes. This treatment results in the binding of biotinylated DNA fragments to the streptavidin-coated beads. The beads were then washed once with 1X B&W with 0.1% Triton-X100 and once with 10 mM Tris.HCl pH=8.0 to remove the non-biotinylated DNA 64 fragments. To attach the amplification/sequencing adaptors to all bound DNA molecules, the beads were resuspended in 100 μL of 1X rapid ligation buffer (Enzymatics) with 3 μM of the paired-end sequencing adaptors (Illumina, San Diego, CA), mixed with 5 μL of 600 U/μL T4 DNA ligase (Enzymatics), and incubated at room temperature for 20 minutes with occasional mixing. This reaction was stopped with the addition of 6 μL of 0.5 M EDTA, and the beads were washed twice with 1X B&W and twice with TE and resuspended in 30 μL of water. PCR amplification. The PCR amplification protocol was adopted from Quail et al. [137] and carried out on 20 μL of bead suspension in 1X Pfx buffer, 1.5 mM MgSO 4 , 400 μM dNTPs, 0.625 μM of each PE primer (Illumina) and 2 units of Platinum Pfx DNA poly- merase (Invitrogen) with the following cycling conditions: Initial denaturation: 94 °C, 3 minutes: 1 cycle Amplification: 94 °C, 20 sec - 65 °C, 30 sec - 72 °C, 45 sec: 15 cycles Final Extension: 72 °C, 10 minutes: 1 cycle Note: Using an excessive number of cycles in the PCR reaction should be avoided. I often optimize the number of PCR cycles by running several pilot 10 μL PCRs with 1 μL of the bead template and 12, 15, and 18 cycles (or more if necessary). We then run each product on a electrophoresis gel and choose the cycle number in which the PCR appears to be at the first half of the exponential phase and scale up that number of cycles for the 50 μL PCR reaction. Size selection. The entire PCR product was loaded on a 2% low range ultra agarose (Bio- Rad) gel and run at 100 V for 60 minutes to separate the sample. The gel was stained with SybrSafe (Invitrogen) and visualized on a Gel Doc 1000 (Bio-Rad). The fragments between 350 and 550 base-pairs were excised and purified with a QIAgel extraction kit (Qiagen) 65 in accordance with the manufacturer’s protocol. The extracted DNA, which constitutes a tethered library (HindIII-TCC or MboI-TCC), was eluted in 50 μL of EB into a pre- lubricated tube. Quantification and paired-end sequencing. The library was quantified by real-time PCR as described by Quail et al. [137] and sequenced on an Illumina Genome Analyzer IIx (GA) machine using the paired-end module and 40 bp reads from each end. 2.5.2 Non-tethered conformation capture Non-tethered HindIII library, or HindIII Hi-C library, was prepared from 25 million crosslinked GM12878 cells as previously described [113]. The non-tethered MboI library was prepared in the same way, but, for digestion, 200 units of MboI (NEB) were used. 2.5.3 Accession numbers Sequencing data and binary contact catalogues are publicly available in NCBI SRA under the accession number SRA025848. 66 Chapter 3 Genome-wide Patterns of Chromatin Contacts 3.1 Preview The focus of this chapter is the analysis of genome-wide contact frequency patterns that were obtained using TCC. It describes various approaches that were applied to analyzing high-throughput contact data which are a novel type of dataset. The chapter also provides a comprehensive description of the genome architecture as seen by genome-wide chromo- some conformation capture, and compares these results and conclusions with those of pre- vious studies in the field. These include many novel insights about the spatial organization of the genome, including internal organization of chromosome territories and interactions between them. At a technological level, the results have been compared between tethered and non-tethered conformation capture strategies, demonstrating that tethering is essential for proper under- standing of many architectural features. 67 3.2 Introduction The previous chapter described how TCC can be used to map chromosome contacts at a genome-wide level. The output of this method is a contact catalogue (Figure 2.7) which, in the case of our tethered-HindIII library, contains almost one hundred million contacts (Table 2.1). These contacts cover all non-repetitive regions of all chromosomes and rep- resents the genome architecture in human lymphoblastoid cells. However, a direct look at the very long list of chromosomal contacts in the catalogue can hardly inform one’s under- standing of the genome architecture. In order to analyze the catalogue in an informative way, appropriate theoretical approaches and computational tools must be developed. This chapter describes the tools and strategies that we used for analyzing the contact cata- logues. It also describes what we learned about the genome architecture from these anal- yses. Some of the analysis strategies that we used were inspired by Lieberman-Aiden and colleagues [113] but were modified to fit our understanding of the data. The other analysis strategies were developed by us. Most of these techniques are described in the Materials and methods section of this chapter. The catalogue-based analysis of the genome architecture in this chapter 1 can be divided into two parts: the analysis of all contacts within each chromosome (intrachromosomal contacts) and those between different chromosomes (interchromosomal contacts). We ana- lyze intrachromosomal contacts first because understanding them is fundamental to under- standing contacts between chromosomes. Finally, this chapter describes the importance of the tethering strategy and the increased signal-to-noise ratio that it provides in understanding the interchromosomal contacts. In- terchromosomal contacts have low frequencies when compared to intrachromosomal con- 1 All analyses in this chapter use the Tethered-HindIII catalogue, unless otherwise stated. 68 tacts. As a result, their signal in the contact catalogue is inherently low. A high level of intermolecular ligation noise in the conformation capture experiment can easily lower the signal-to-noise ratio of interchromosomal contacts to levels that makes them impossible to detect. 3.3 Results 3.3.1 Relative amounts of intrachromosomal and interchromosomal contacts After the total number of contacts, the most basic parameter of a catalogues is its fraction of interchromosomal and intrachromosomal pairs. In an intrachromosomal pair both reads align to the same chromosome, whereas in an interchromosomal pair, they align to two different chromosomes. In a final catalogue (Table 2.1, page 39), each pair represents a contact. We observed that in the tethered-HindIII library about 70.0% of all contacts are intrachro- mosomal while only 30.0% are interchromosomal 2 . The true fraction of interchromosomal contacts in a nucleus is expected to be even smaller than 30% as an unknown amount of the observed interchromosomal contacts correspond to noise, and also, some of the intrachro- mosomal pairs that are removed when filtering self-looping pairs represent true intrachro- mosomal contacts. These results, which demonstrate the dominance of intrachromosomal contacts in cells, are in agreement with the concept of chromosome territories [51]. In addition to this difference in abundance, intra and interchromosomal contacts show fun- damentally different properties. Therefore, the next two sections of this chapter focus on 2 These fractions are slightly different from those displayed in Figure 2.10, page 52, as those are the fractions of ligations and include the self-looping pairs. 69 analyzing each type of contact separately. Intrachromosomal contacts, which due to their dominance and higher frequencies are more straightforward to study, are described first. Interchromosomal contacts, which due to their scattered nature and lower frequencies are more challenging to study, are described second. 3.3.2 Intrachromosomal contacts In order to visualize the contacts between various parts of a chromosome, we generated a contact frequency matrix from the catalogue (Materials and methods). Such a matrix can be visualized in the form of a heatmap in which the color intensity at each position represents the frequency of contact between the corresponding regions of the chromosome. Figure 3.1 A shows such a contact frequency map for chromosome 2 and Appendix A shows that of all chromosomes. These maps show numerous contacts, indicating extensive physical association between various regions of each chromosome. It is almost inconceivable that all these contacts be present at the same time and crosslinked in the same nucleus. In other words, the observed patterns in the contact maps must be a result of substantially different chromosome struc- tures in different cells. Consequently, these maps represent an overlay of all structures that are present in the population of cells. Nevertheless, a visual inspection of these maps re- veals two structural features of chromosomes. First, contact frequency between two loci is a function of their genomic distance. Regions close to the diagonal of the map, which corresponds to neighboring regions, show very high contact frequencies. With increasing distance from the diagonal, the average contact frequencies decrease. Second, the maps show a plaid pattern with blocks of enriched and depleted contact frequency. This pattern suggests that factors other than genomic distance also contribute to the contact frequency between two loci. These two features are analyzed in more depth below. 70 Figure 3.1: Contact frequency map. The contact map of chromosome 2 obtained from the HindIII-TCC library. The intensity of the red color in each position of the map represents the observed frequency of contact between corresponding segments of the chromosome which are shown on the top and to the left of the map. A pair of tick marks on the ideogram encompasses 4986 HindIII sites. In this and other figures, the white lines in the heatmaps mark the unalignable region of the centromeres. In this map, chromosomes 2 is divided into segments that span 277 HindIII sites each, resulting in 258 segments of 1 Mb (Table 3.2). See Appendix Figures A.1 and B.1 for the TCC contact frequency maps of all the other chromosomes and side-by-side comparison of the maps obtained with and without tethering. 71 3.3.2.1 Distance dependence of contact frequencies To better quantify the dependence of contact frequencies on the genomic distance, we mea- sured the genome-wide average contact probability of loci based on their distance (Figure 3.2). These plots show that contact probability decreases rapidly with increasing genomic Figure 3.2: Intrachromosomal contact probability as a function of distance. The genome-wide average probability of contact between chromosomal regions in the HindIII TCC library is displayed as a function of their genomic distance. In panel A only the Y axis is logarithmic, whereas in panel B both axes are in logarithmic scales. The prob- ability values were measured using a bin size of 10,000 bp. For a side-by-side comparison of the same analysis in different libraries see Appendix Figure B.2. distance between loci. This decay is very fast, especially in short distances. For exam- ple, two region that are 0.1 Mb apart are about nine times more likely to contact than two regions that are 1 Mb apart. 3.3.2.2 Class dependence of contact frequencies: emergence of two classes The plaid pattern of contact frequency maps (Figure 3.1) indicates that distance is not the only parameter that determines intrachromosomal contact preferences. To further investi- gate this plaid pattern, we defined the “contact profile” of a region as the ordered list of 72 frequency values for its contacts with all the other regions in the genome (Materials and methods). The Pearson’s correlation coefficient between two intrachromosomal contact profiles is a similarity measure for the corresponding regions’ contact behaviors. Positive correlation values indicate similar contact profiles, values around zero indicate dissimilar contact profiles, and negative values indicate opposite contact profiles. Using this mea- sure, we observed that each chromosome can be divided into two classes of regions with anti-correlated intrachromosomal contact profiles. To better visualize these correlation patterns, we calculated a correlation matrix from the contact frequency matrix of each chromosome (Material and methods). In a correlation matrix, the value in each position is the Pearson’s correlation between the contact profiles of the corresponding regions. A correlation matrix can be visualized as a correlation map, in which the color of each region represents the Pearson’s correlation between the corre- sponding regions’ contact profiles. Figure 3.3 shows such a correlation frequency map for chromosome 2 and Appendix A shows that of all chromosomes. These correlation maps clearly show that two main classes of loci with anti-correlated contact profiles are present in each chromosome. Namely, in the row or column along each region in the map, the correlation values sharply alternate between positive and negative; all the regions that show positive correlation with a particular region are positively correlated with each other while all the regions that show negative correlation with a particular region are positively correlated with each other but negatively correlated with the first group. In addition, almost all positions in the map show positive or negative correlation values; few, if any positions have correlation coefficients close to zero. These observations strongly reinforce the notion that two main classes of regions are present in each chromosome. The class affiliation of each chromosomal region can be determined using principal com- ponent cnalysis (PCA) [91]. Using this procedure, we calculated anEIG variable for each 73 Figure 3.3: Correlation map and class assignment. Correlation map and class assignment for chromosome 2. The color of each position in the map represents the Pearson’s correlation between the intrachromosomal contact profiles of the corresponding two segments of the chromosome to the left and on top. The color key is shown on the bottom-right corner of the figure. To assign each segment to one of the two classes (orange or purple assignment blocks on the top of the map), principal component analysis (PCA) is used to calculate the EIG variable (plotted on top of the assignment blocks) for each segment. Segments with a positiveEIG are assigned to one class, while those with a negativeEIG are assigned to the other. Segments withEIG values close to zero have not been assigned to a class. Note that the size of each chromosome band is based on the number of HindIII sites it contains. For this map, chromosome 2 is divided into 517 segments of 0.5 Mb, each spanning 138 HindIII sites (Table 3.2). See Figure A.2 in Appendix A for correlation maps and class assignments of all the other autosomal chromosomes. 74 segment of each chromosome (Materials and methods). Positive EIG values mark one class (class A) and negative values mark the other (class B) (Figure 3.3). After genome- wide class assignment, we measured the genome-wide average intrachromosomal contact profile correlation for each class (Figure 3.4 A). These measurements clearly show that the contact profiles of regions in the same class are correlated while those of regions in different classes are anti-correlated. It is not only the contact profile similarity that distinguishes these two classes. Not sur- prisingly, direct contacts within a class are also enriched and contacts between classes are depleted. To illustrate, at any given genomic distance, regions in the same class contact each other more frequently than regions in different classes (Figure 3.4 B). In conclusion, the TCC data suggest that all chromosomes can be divided into two classes of regions such that regions within a class prefer to associate with other regions from the same class. This finding confirms the results of a previous study [113]. 3.3.2.3 Functional differences between the two classes: active class and inactive class To investigate functional differences between the two classes, we alignedEIG with various measures of functional activity along each chromosome (Figure 3.5). These measures of functional activity include DNase I sensitivity, RNA pol II binding, gene expression, gene density, and various histone modifications. In all chromosomes, DNaseI sensitivity, which marks the accessible regions of chromatin [148], shows enrichment in class A (Figure 3.5). Similarly, both gene density and gene expression levels are enriched in class A. Class A regions also show significantly higher binding by RNA polymerase II, which marks actively transcribed genes [95]. Moreover, various activating histone modifications, show enrichment in class A. Namely, histone 3 75 Figure 3.4: Correlation and contact frequency as a function of class and distance. (A) The genome-wide average Pearson’s correlation coefficient between the intrachro- mosomal contact profiles of two class A segments (orange), two class B segments (dark purple), and a class A and a class B segment (gray) plotted against their genomic distance. (B) The genome-wide averaged intrachromosomal contact frequencies for two class A (orange), two class B (purple) and a class A and a class B (gray) segments plotted against their genomic distance. The Y-axis is plotted in a logarithmic scale. For both panels, each chromosome is divided into segments of 138 HindIII sites, resulting in 6,000 segments of 0.5 Mb. 76 Figure 3.5: Functional activity in each class. Various measures of chromatin activity have been aligned to theEIG along chromosome 2. These measures include DNaseI sensitivity (DNaseI), RNA polymerase II binding (Pol II), gene expression (expression), gene density, and several histone modifications. Positive EIG values (active class or class A) are plotted in orange while negative EIG values (inactive class or class B) are plotted in dark gray. The colors of all the other plots match that of the EIG plot. The genome-wide Spearman correlation coefficient () between EIG and each marker is shown on the right of each plot. The X-axes for all markers is in arbitrary units with 100 corresponding to the maximum genome-wide signal (Materials and methods). In this plot, chromosome 2 was divided into segments spanning 138 HindIII sites (Tables 3.2 and 3.3). 77 lysine 9 acetylation (H3K9ac), histone 3 lysine 4 mono, di, and trimethylation (H3K4me1, H3K4me2, and H3K4me3), histone 3 lysine 27 acetylation (H3K27ac), histone 3 lysine 36 trimethylation (H3K36me3), and histone 4 lysine 20 monomethylation (H4K20me1) all show substantial enrichment in this class. These histone modifications are known to mark active transcription, permissive euchromatin, transcription elongation, or other forms of functional activity [184, 190, 13, 161, 163, 189, 126, 103, 175]. Interestingly, the only histone modification we analyzed that does not show high enrichment in class A is histone 3 lysine 27 trimethylation (H3K27me3) which is a repressive mark [13, 31, 174]. Finally, a visual inspection of class assignment with chromosomal bands shows that class A regions tend to coincide with lighter chromosomal bands while class B regions tend to coincide with the dark chromosomal bands (Figures 3.5 and A.5). The combination of these observations suggests that class A represents most of the func- tionally active regions of the genome while class B harbors most functionally silent regions of the genome. Here, we will refer to class A also as the “active class” and to class B also as the “inactive class.” 3.3.2.4 Active and inactive class are affected differently by centromeres Having observed the functional differences between the two classes, we asked whether there are any differences between the active and inactive classes in forming long-range intrachromosomal contacts. We observed that compared to the active regions (class A), in- active regions (class B) generally prefer to associate with their neighboring regions (Figure 3.4 B). A special case of this behavior was observed in the interactions between inactive regions of large chromosomes (that is chromosomes 1-5, 8, and 10). In these chromosomes, the similarity of contact profiles decreases abruptly for inactive regions separated by the cen- 78 Figure 3.6: Effect of centromere on contact profile similarity and contact frequency in large chromosomes. (A) Average correlation between intrachromosomal contact profiles of active segments (left) and inactive segments (right) on the same arm (red) or different arms (blue) of chro- mosomes 1, 2, 3, 4, 5, 8, and 10 as a function of distance between segments. (B) Average contact frequency between active segments (left) and inactive segments (right) on the same arm (red) or different arms (blue) of chromosomes 1, 2, 3, 4, 5, 8, and 10 as a function of their distance. Y-axes are in a logarithmic scale. For these analyses, the unalignable parts of the centomeres have been assumed to be 3Mb long [43], but the results do not significantly change with centromeres as long as 5 Mb. 79 tromere. Consequently, only inactive regions in the same chromosome arm have similar contact profiles (Figure 3.6 A). The frequency of contacts between inactive regions in dif- ferent chromosome arms is also substantially lower than would be expected from their sequence separation alone (Figure 3.6 B). These characteristics give rise to a distinctive four-block pattern in the “inactive-only” correlation matrices of the larger chromosomes (Figures 3.7 and A.3). The active regions on different sides of the centromere, by contrast, show very similar contact profiles (Figure 3.6 A) and associate just as frequently as those on the same side (Figure 3.6 B). As a result the four-block pattern does not exist in the “active-only” corre- lation matrices of the large chromosomes (Figures 3.7 and A.3). These results suggest that, in larger chromosomes, the centromere limits the physical association between inactive re- gions from opposing chromosome arms whereas it does not hinder association between the active regions. 3.3.3 Interchromosomal contacts The contacts between chromosome territories constitute only 30% of all pairs in a cata- logue. As described earlier (see 2.3.3, page 49), this 30% of the catalogue also likely har- bors the majority of noise contacts. Our tethering strategy, which drastically reduces the random ligation noise (Figure 2.10), together with relatively deep sequencing of libraries (Table 2.1), allowed us to analyze these interchromosomal contacts in detail. We first measured the total number of contacts observed between all pairs of chromo- somes (Figure 3.8). Confirming a previous study [113], the pairwise frequencies show that contacts are enriched between small gene-rich chromosomes. This observation is consis- tent with the preferential localization of these chromosomes near the center of the nucleus [26, 50]. 80 Figure 3.7: Active-active and inactive-inactive correlation maps. Active-active (left) and inactive-inactive (right) correlation maps for chromosome 2. The color intensity of each point in the map represents the Pearson’s correlation between the “active-only” (left) or “inactive-only” (right) contact profiles of the corresponding seg- ments, whose location in the chromosome has been marked by an arrow on the ideogram of chr2 in the middle. The ideogram shows the positions of the active (orange bars on the left) and inactive (purple bars on the right) segments. The different shades of orange and purple are used only to differentiate the adjacent segments. Each correlation map is cal- culated by only considering contacts between active segments (left) or inactive segments (right). For this map, chromosome 2 is divided into 517 segments of 0.5 Mb, each span- ning 138 HindIII sites (Table 3.2). For similar active-active and inactive-inactive maps of the other large chromosomes see Appendix Figure A.3. 81 Figure 3.8: Pairwise whole-chromosome contact frequencies. The observed/expected frequency of total contact between all pairs of chromosomes obtained from HindIII-TCC are shown. Red and blue respectively indicate enrichment and depletion of contacts compared to the expected value according to the color key on the bottom-right corner. Expected values were calculated based on the size and number of observed reads per chromosome when assuming completely random ligations. See Appendix Figure B.4 for a side-by-side comparison of tethered and non-tethered libraries. 82 A more comprehensive analysis of interchromosomal contacts, however, must extend be- yond the whole-chromosome level and focus on the different loci in a chromosome. For this task, we used the genome-wide contact frequency matrix (Materials and methods) to analyze contact between segments of different chromosomes. 3.3.3.1 Interchromosomal contacts have low frequencies In a contact frequency matrix where the genome is divided into3,000 segments of1 Mb in size (F 3004 : H = 277, K =3004, Materials and methods), there are 489,256 intra- chromosomal elements but 8,534,760 interchromosomal elements (Tables 3.2 and 3.3). In other words, the are 17 times as many possibilites for interchromosomal contacts as there are for intrachromosomal contacts, yet interchromosomal contacts constitute only 30% of the contact catalogue. Consequently, the average frequency of interchromosomal elements in the matrix is about 40 folds less than that of the intrachromosomal elements. This cal- culation suggests that interchromosomal contacts may have much lower frequencies than intrachromosomal contacts. To examine this possibility, we sorted intra and interchromosomal contacts in the genome- wide contact frequency matrix based on their frequencies and compared the most frequent contacts of each type (Figure 3.9). The results clearly show that interchromosomal contacts have lower frequencies compared to intrachromosomal contacts, even when only consid- ering long-range intrachromosomal contacts. They also suggest that the interchromosomal contacts in the catalogue do not correspond to only a few elements in the contact matrix; rather, they are distributed between many low-frequency elements. 83 Figure 3.9: Comparison of intra and interchromosomal contact frequencies. In anF 3004 contact frequency matrix (Materials and methods), the 200,000 most frequent interchromosomal contacts (red) and the 200,000 most frequent medium or long-range intrachromosomal contacts (blue) are sorted according to their frequency, and their frequency is plotted against their sorting rank. The short-range contacts, defined for this specific plot as intrasegment contacts or contacts between adjacent neighboring segments, have been excluded. The intrasegment contacts constitute 40%, and the contacts between adjacent segments constitute 16% of all intra-chromosomal contacts in the contact frequency matrix. When also including these short-range intrachromosomal contacts, the top 5,000 most frequent intrachromosomal contacts are 15 times more frequent than the top 5,000 most frequent interchromosomal contacts (data not shown). Y-axis is in a logarithmic scale. 84 3.3.3.2 Active and inactive classes show different propensities to interchromosomal contacts These low frequency contacts must be identified and characterized in order to describe in- terchromosomal contacts in general terms and understand the factors that determine their patterns and frequencies. To do so, we first produced genome-wide enrichment maps (Fig- ure 3.10) to visualize the distribution of interchromosomal contacts. In some columns and rows in the map, many contacts are enriched; whereas, in other few if any contacts are enriched. This pattern suggests that some loci are inherently more likely to form interchro- mosomal contacts. To identify the loci that are more prone to forming interchromosomal contacts, we defined the “interchromosomal contact probability index” (ICP ) as the sum of a region’s inter- chromosomal contact frequencies divided by the sum of its inter and intrachromosomal contact frequencies: ICP i = j=K X j=1;j= 2chr i f i;j l=K X l=1 ; l6=i f i;l (3.1) wherechr i denotes segments of the genome that are on the same chromosome as segment i, K is the total number of segments for the genome, andf i;j is the frequency of contact between segmentsi andj.ICP , therefore, describes the propensity of a region to forming interchromosomal contacts. Interestingly, we observed large differences in the distribution ofICP between the active and inactive classes (Figures 3.11 and A.5). In the inactive class, the vast majority of regions have relatively low ICP s Figures 3.12 A and A.4). On the other hands, many active regions show highICP s. 85 Figure 3.10: Genome-wide contact enrichment map of chromosome 2. The genome-wide enrichment map for chr2, compiled from the HindIII TCC library. Enrichment is calculated as the ratio of the observed frequency in each position to its expected value; expected values were obtained assuming completely random ligations (Materials and methods). Red and light blue respectively indicate enrichment and depletion of a contact in accordance with the color key on the bottom. Chromosome 2 (left) extends along the Y-axis while all 23 chromosomes (top) extend along the X-axis. The zoomed panel to the right of each map magnifies the section that corresponds to contacts between the small arm of chromosome 2 and chromosomes 20, 21, 22, and X. A pair of tick marks on chromosome 2 spans 5022 HindIII sites. This map is based on anF 1500 matrix (Materials and methods). 86 Figure 3.11: Alignment ofICP andEIG along chromosome 2. Alignment of ICP and EIG along chromosome 2. ICP values above the dashed blue line are significantly above the average ICP for inactive segments of the chromosome. Therefore, the blue line separates the high-ICP segments. Y-axis forICP is in a logarith- mic scale. See Appendix Figure A.5 for other chromosomes. AnF 6002 contact frequency matrix (Tables 3.2 and 3.3) was used for these plots. These results suggest that most interchromosomal contacts are mediated by the active class. However, in each chromosome, a few inactive regions show high ICP s. We found that most of these inactive regions flank the unalignable regions of the centromeres (Figures 3.12 A and A.4), and their high ICP is due to association with the centromeric regions of other chromosomes (Figure 3.13 A). These results are consistent with the previously described clustering of centromeres in interphase cells [6, 172]. Prompted by this observation, we further characterized the general interaction patterns be- tween the centromeres of different chromosomes (Figure 3.13, Materials and methods). We found that the centromeres of different chromosomes frequently associate (Figure 3.13 A). We also found that the centromeric regions of the acrocentric chromosomes are more likely to contact each other than the centromeric regions of the metacentric chromosomes (Figure 3.13 B). This finding is consistent with the previously-documented clustering of acrocentric centromeres around the nucleoli [172]. Finally, we observed the highest cen- tromere contact frequencies between chromosomes 13 and 21 and between chromosomes 14 and 22 (Figure 3.13 C). This observation is also in agreement with previous fluorescence imaging studies in lymphocytic cells [6, 7]. 87 Figure 3.12: Interchromosomal contact probability (ICP ). (A) For all segments of chromosome 2, interchromosomal contact probability index (ICP ) is plotted againstEIG. Segments with a positiveEIG (orange dots) belong to the active class, while those with a negativeEIG (brown dots) belong to the inactive class. The blue dashed line separates high-ICP segments: values above the line are significantly larger than the averageICP for inactive segments. Hollow red dots mark those inactive segments with largeICP s that also flank the centromere. See Appendix Figure A.4 for similar plots of all autosomal chromosomes. (B) For all active segments in the genome,ICP is plotted against the binding of RNA polymerase II (pol II). Pol II binding values are reproduced from a ChIP-seq study [95] on the GM12878 cells and are in arbitrary units based on alignment frequency (Materials and methods). The p-value of the correlation is smaller than 10 16 . The X-axis is plotted in a logarithmic scale. (C) For seven loci on the small arm of chromosome 11,ICP is plotted against their average distance from the edge of chr11 territory as measured by FISH [117]. Positive distance values denote localization within the bulk territory, while negative values denote localization away from the bulk territory. Orange and brown dots represent assignment to the active and inactive classes, respectively. Error bars represent95% confidence interval [117]. AnF 6002 contact frequency matrix (Tables 3.2 and 3.3) was used for these plots. 88 Figure 3.13: Clustering and association patterns of the centromeres. (A) The genome-wide average frequency of interchromosomal contacts between segments at a given relative distance from the centromere and centromeric regions (i.e., segments directly flanking the centromeres) of other chromosomes. (B) Average inter- chromosomal contact frequency of segments around the centromeres of acrocentric (red) and metacentric (blue) chromosomes with the centromeres of acrocentric chromosomes (13, 14, 15, 21, and 22). (C) Association profile of the centromere of each chromosome with other centromeric regions based on the TCC data. The height of the horizontal line at each position repre- sents the frequency of that association relative to other centromeric associations of the centromere on the Y-axis. Association frequencies between the centromeres of chromosomes 13 and 21 (the green highlighting) and 14 and 22 (the red highlighting) are most pronounced, although centromere 21 frequently associates with centromere 18 as well. Moreover, the centromeres of chromosomes 4 and 10 appear to also frequently associate. See Materials and methods for details of frequency calculations. An F 6002 contact frequency matrix (Tables 3.2 and 3.3) was used for these plots. 89 In contrast to the inactive class, many regions in the active class have highICP s. In fact, the vast majority of regions with a large ICP belong to the active class (Figures 3.12 A and A.4). For example, in chromosome 2, 90% of the regions with a top 25% ICP are members of the active class (Figure 3.12 A). Nevertheless, not all the active regions have a largeICP . For instance, about 40% of the active regions in chromosome 2 form relatively few interchromosomal contacts, and theirICP s are similar to those of the inactive regions (Figure 3.12 A). This nonuniform interchromosomal contact behavior in the active class may reflect func- tional variations within this class. Indeed, we observed that those active regions with larger ICP s also show higher RNA polymerase II binding (Figure 3.12 B) as well as higher total gene expression (Pearson’sr = 0.53, p-value < 10 15 ), indicating that high transcriptional activity is associated with an increased probability of forming interchromosomal contacts. These results support the notion that transcription factories are important in stabilizing in- teractions between chromosomes [154]. We also asked whether the regions’ differences inICP are reflected in their localization within their chromosomes’ territories. Previous fluorescence imaging studies have shown that highly transcribed regions can frequently extend outside of the bulk territory of their chromosome in the form of large loops [187, 118] (see Chapter 1). One of these studies analyzed several loci on chromosome 11 in lymphoblastoid cells [117]. Remarkably, we found that the reported average distances of these loci from the edge of their chromosome territory is strongly correlated with theirICP s (Pearson’sr = 0.98, p-value < 10 3 ) (Figure 3.12 C). Moreover, the loci that showed preferential localization in the bulk of the chromo- some territory in the imaging study are inactive in the TCC data, while those that showed more frequent localization beyond the bulk of the territory are active and have largeICP s (Figure 3.12 C). While more fluorescence imaging experiments are required to extend this 90 observation to the entire genome, these examples suggest thatICP also reflects the pre- ferred positions of a locus within the territory of its chromosome. They also indicate that spatial accessibility is important for interchromosomal contacts. 3.3.3.3 Indiscriminate interactions between chromosome territories The analyses described above focus on the identity of the regions that frequently form inter- chromosomal contacts. To examine the patterns and properties of the contacts themselves, we analyzed those interchromosomal contacts with frequencies clearly above noise level. We refer to these contacts as “significant interactions.” 3 Most of these significant interactions are formed between active regions, in particular be- tween those with highICP s (Figure 3.14 A). Interestingly, most of these regions interact with numerous other high-ICP active regions in other chromosomes. For instance, con- sider the 14 segments on chromosome 19 that, in anF 3004 contact frequency matrix (Tables 3.2 and 3.3), are classified as high-ICP (Figure 3.14 A, orange blocks on chr19 ideogram). Each of these regions forms significant interactions with at least 64%, on average 80%, and at most 93% of all the high-ICP active regions on chromosome 11 (Figure 3.14). More- over, each of these regions on chromosome 19 forms significant interactions with at least 53%, on average 71%, and at most 77% of all the high-ICP active regions in the genome (Figure 3.15). These measurements emphasize that numerous different interactions take place between different chromosome territories. It is also interesting that none of these significant interactions appears to be dominant, and they all have relatively low frequencies (Figures 3.14 A and 3.15). In the case of chromosomes 11 and 19 (Figure 3.14 A), the significant interchromosomal interactions between high-ICP active regions are on average more than seventy times less frequent 3 In this context, the word “interaction” does not necessarily imply a functional interplay between the loci; rather, it only refers to physical associations in the 3D space with statistically significant frequencies. 91 Figure 3.14: Interactions between chromosome 11 and chromosome 19. (A) Plotted are the frequencies of all contacts between high-ICP active segments on chr19 and all the segments on chr11. Contacts involving the high-ICP active segments on chr11 are shown as purple dots and contacts involving all other segments of this chromosome are shown as gray triangles. The contacts plotted between vertical dotted lines involve the same high- ICP active segment on chr19 and all the segments of chr11. Frequencies above the dashed blue line are significantly higher than the average frequency of contacts between high-ICP active segments on chr19 and inactive segments on chr11 (p-value < 0.04, non-parametric). These frequencies can be considered significantly larger than the noise level, defined as the false-positive contact frequencies due to random intermolecular ligations. 14 segments on chr19 and 28 segments on chr11 were classified as high-ICP active. The locations of the high-ICP active segments in chr19 are marked by an orange bar on the ideogram of the chromosome on the bottom of the panel. (B) For all possible pairs of high-ICP active segments from chromosomes 11 and 19, their contact frequency has been plotted against the product of theirICP s. Same interactions are marked with purple color in (A). The p-value of the correlation is nominal. See Appendix Figure A.6 for similar plots of chr11 with all the other chromosomes. (C) The histogram of Pearson’s correlation of interchromosomal contact frequency with product of ICP values for high-ICP active segments of all 231 pairwise combinations of autosomal chromosomes. The vertical purple line marks the average Pearson’s correlation at 0.47. AnF 3004 contact frequency matrix (Tables 3.2 and 3.3) was used for these plots. 92 Figure 3.14: Interactions between chromosome 11 and chromosome 19. (Caption on the previous page) 93 than intrachromosomal contacts between neighboring1 Mb regions. Combined with their numerosity, these low frequencies suggest that each interchromosomal interaction is present in only a fraction of the cell population. Even as none of interchromosomal interactions shows dominance, different interactions have different frequencies. For example, the most distal high-ICP active region on the q arm of chromosome 19 (bottom panel in Figure 3.15) clearly prefers some of the high-ICP segments on chromosome 8 to others on the same chromosome. Strikingly, we observe a correlation between this contact frequency and theICP of the contact partner (green line in Figure 3.15); the larger theICP of the contact partner, the higher the observed frequency of its interaction. In fact, some active regions in the genome show increased contact frequency (i.e., a peak) with all fourteen high-ICP active segments on chromosome 19 (Figure 3.15). Almost all such regions have very highICP s themselves. These observations indicate that contact frequency peaks often coincide withICP peaks (Figure 3.15). The correspondence between the observed contact frequency and ICP extends to both of the interaction partners. To illustrate, the contact frequency between a pair of high- ICP active regions shows a positive correlation with the product of theirICP s (Figures 3.15 B,C and A.6). These observations suggest that for many high-ICP active regions the probability of forming interchromosomal interactions is independent of the identity of their interaction partners. We already established thatICP can be an indicator for the relative position of a region from the edge of the chromosome territory. This correlation, therefore, suggests that the propensity for forming interchromosomal contacts between high-ICP active regions is largely governed by the spatial accessibility of the contact partners. In summary, our observations indicate that most active regions do not exclusively interact with only a few specific regions on other chromosomes, rather they can form interactions 94 Figure 3.15: Interactions between high-ICP active regions on chromosome 19 and other high-ICP regions in the genome. Each horizontal panel shows the contact profile of one of the fourteen high-ICP active segments on chr19 with all high-ICP active segments on other autosomal chromosomes (red, purple, and pink bars and the Y-axis on the left). Con- tact frequencies above the black dashed line are significantly larger than average contact frequency between the marked active segment on chr19 and all inactive segments in the genome (p-value < 0.04, non-parametric). Such contact frequencies can be safely assumed above the background random ligation noise. The green line corresponds to the Y-axis on the right and represents theICP values of the target segments on each chromosome. The ideograms of chr19 (on the right of each plot) mark the location of the active segment of each panel with an orange bar. ICP values have been normalized by their average for high-ICP segments of each chromosome to allow side-by-side display in the same plot. AnF 3004 contact frequency matrix (Tables 3.2 and 3.3) was used for these plots. 95 Figure 3.15: Interactions between high-ICP active regions on chromosome 19 and other high-ICP regions in the genome. (Caption on the previous page) 96 indiscriminanely with many high-ICP active regions. These contacts may only be present in the fraction of cells where both interaction partners are mutually accessible. 3.3.3.4 Confirming indiscriminate interactions using 3D FISH To confirm the existence of interchromosomal interactions between high-ICP active re- gions we measured the colocalization frequency of one probe on chromosome 19 with each of four different probes on chromosome 11 using 3D DNA FISH [16] (Figure 3.16 A). The chromosome 19 probe was located in a high-ICP active region while the four chromosome 11 probes were equally split between inactive and high-ICP active regions (Table 3.1). Based on the conformation capture data, it is expected that a small fraction of the cell population would harbor each interaction between high-ICP regions; therefore, we analyzed1,000 cells for each pair of probes to significantly and accurately estimate their colocalization frequencies (Figure 3.16 B). These measurements showed that, in a small but significant fraction of the cells, the high- ICP active region on chromosome 19 colocalizes with each of its active counterparts on chromosome 11 (Figure 3.16 C). In contrast, the same region on chromosome 19 is unlikely to localize in proximity to either inactive regions on chromosome 11 (Figure 3.16 C). These results support the conclusion that high-ICP active regions on different chromosomes can interact and that each interaction occurs in only a small fraction of the cell population. 3.3.4 The effects of tethering on the analysis results All the analyses described above are based on our HindIII-TCC library, which was gen- erated using the tethering strategy. While the the standard conformation capture assays, which involve a dilution strategy for preventing intermolecular ligations (see Chapter 2), have been used extensively during the past decade and well-accepted by the scientific com- 97 Figure 3.16: FISH experiments confirming indiscriminate interchromosomal contacts. (A) The layout of the 3D-FISH experiments where the localization of a high-ICP active locus on chr19 (H0) relative to four loci on ch11 (H1, H2, L1, and L2) was analyzed in about 1,000 cells per pair of loci. H1 and H2 are high-ICP active, while the L1 and L2 are inactive. The blocks on the chromosomes’ ideograms mark the position of each locus (orange for high-ICP active and brown for inactive), and the arrows mark the pair com- binations that are analyzed (purple for active-active and grey for active-inactive). (B) An example nucleus from each pair of loci analyzed in 3D-FISH. Nuclei are counterstained with DAPI (blue). In all four nuclei, the hybridization signal of H0 is shown in red and that of the other locus is shown in green. (C) Cumulative percentage of nuclei that show a pair of hybridization signals closer than a given distance is plotted. Only the closest pair of signals for each nucleus is considered. 1,011, 987, 976, and 998 total nuclei were analyzed in duplicates for H0-L1, H0-L2, H0-H1, and H0-H2 respectively. Distances smaller than 0.6 μm (dashed blue line - arbitrarily selected for visualization purposes) represent colo- calizations in a close vicinity where a direct interaction between loci is possible. Because colocalization is required but not sufficient for a direct contact, these values likely provide a ceiling for the fraction of cells that harbor a direct contact between these loci. Table 3.1: Probes for 3D-FISH For each probe, its class assignment,ICP status, corresponding BAC clone, and genomic location are shown. Probe Status BAC Chr Start position End position H0 High-ICP active RP11-50I11 Chr19 49,676,368 49,847,517 H1 High-ICP active RP11-169D4 Chr11 72,224,886 72,323,389 H2 High-ICP active RP11-770J1 Chr11 118,245,278 118,439,596 L1 Low-ICP inactive RP11-651M4 Chr11 27,659,448 27,827,522 L2 Low-ICP inactive RP11-220C23 Chr11 40,972,099 41,132,971 98 munity, the tethering strategy is a novel and fundamentally different approach to confor- mation capture experiments. It is, thus, important to evaluate the effects of this strategy on the results of conformation capture experiments and the conclusions we have made here. Appendix B focuses on such a comparison. In Appendix B, the tethered and non-tethered (i.e., dilution-based) libraries produced here and the non-tethered library produced in another study [113] are compared side-by-side. This comparison makes two points clear. First, in analyzing the intrachromosomal con- tacts, the tethering strategy does not produce different outcomes when compared to the non-tethering strategies. This observation indicates that tethering is a reliable method and does not introduce systemic biases that make its results unreproducible by the previous non-tethering strategies. Second, in analyzing interchromosomal contacts, tethering pro- duces very different outcomes compared to the standard dilution-based strategy. The com- mon theme of this difference is that only in the tethered libraries any significant difference between theICP s of different regions or the frequencies of different contacts can be ob- served. This observation suggests that the improved signal-to-noise ratios in the tethered libraries are crucial for the ability to measure these contacts all together. 3.4 Discussion We have used the TCC data to study the human genome architecture in a systematic way. Our studies not only confirm some previously observed results, they also provide many new insights into the genome architecture. These insights can be divided into two general cate- gories of the internal organization of chromosome territories and the interactions between them. 99 Regarding the internal organization of chromosome territories, we observed that many dif- ferent contacts are possible between various regions in the same chromosome and that these contacts show a wide range of frequencies (Figures 3.1 and 3.2), suggesting that the 3D organization of each chromosome is different in different cells. In other words, contacts between various parts of a chromosome are not absolute rather probabilistic. Despite its probabilistic nature, the 3D organization of a chromosome is not random; in fact, certain factors determine the association preferences and probabilities between different loci. In this regard, two factors can be considered the most consequential. The first factor, which has the more pronounced effect of the two, is the genomic distance between loci. The prob- ability of contact between loci on the same chromosome decreases rapidly with increasing distance (Figure 3.2) [113]. The other important factor is what we refer to as “class”: the loci in each chromosome can be separated into two classes such that, after normalizing for genomic distance, loci that belong to the same class have higher contact frequencies com- pared to loci that belong to different classes (Figures 3.3 and 3.4 B). As a result, loci that belong to the same class share similar contact profiles whereas loci that belong to different classes show anti-correlated contact profiles (Figures 3.3 and 3.4 A) [113]. The two classes that can be defined based on intrachromosomal contacts are not symmet- rical with respect to their function, sequence structure, or 3D organization. Functionally, only one of the two classes (the active class) shows enrichment for various markers of functional activity such as DNaseI sensitivity, RNA polymerase II binding, gene expres- sion, and activating histone modifications (Figure 3.5) [113]. Structurally, the active class is enriched for known genes (Figure 3.5) and is more likely to coincide with the lighter bands of each chromosome (Figures 3.11 and A.5). The other class (the inactive class), on the other hand, is largely depleted of known genes (Figure 3.5) and tends to coincide with the darker bands of each chromosome (Figures 3.11 and A.5). 100 In addition to these differences, the two classes have differences in their 3D organization as well. The regions of the inactive class preferentially associate with neighboring inactive regions, while the regions of the active class have a diverse panel of long-range contact part- ners (Figure 3.4). Additionally, in large chromosomes, inactive regions on opposing sides of the centromere have little interaction with each other (Figures 3.6 and 3.7). At the same time, active regions on different arms show extensive interactions (Figures 3.6 and 3.7). This behavior is consistent with a recent report in D. melanogaster where interactions be- tween some inactive polycomb-associated regions were constrained within a chromosome arm [180, 41]. Consistently, previous fluorescence microscopy studies of chromosomes 3 and 6 in amniotic cell fluid nuclei suggest that different chromosome arms form sep- arate domains [65]. It should be noted that this behavior could not have been observed in mouse (M. musculus), where the karyotype almost exclusively comprises acrocentric chromosomes. A very interesting 3D organizational difference between the active and inactive class in each chromosome is their exposure to other chromosomes. Using the interchromosomal contact probability index (ICP , Formula 3.1 in page 84), which is a quantitative measure of interchromosomal contact propensity for each region, we have measured how frequently regions of the active and inactive classes engage in contacts with other chromosome terri- tories (Figures 3.11, 3.12 A). Inactive regions, by and large, have a low tendency to interact with other chromosomes (Figures 3.12 A and A.4). There are a few exceptions to this trend which are the regions that flank the centromeres. These regions show high contact frequen- cies with the centromeric regions of other chromosomes (Figure 3.13) and their association patterns are in excellent agreement with imaging results from other studies [6, 172, 7]. In contrast to inactive regions, active regions have a wide range of ICP values, and many show a high tendency to interact with other chromosomes (Figures 3.12 A and A.4). 101 The fact that certain active regions have a higher probability of forming contacts with other chromosomes suggests that, compared to inactive regions, these region are more exposed to other chromosome territories. One way that this organization is possible would be for the high-ICP active regions to have preferential localization at the border of their chromo- some’s territory and inactive regions to have preferential localization inside the territory or close to the nuclear envelope. This model is supported by FISH experiments [118, 104] in- cluding the ones that assayed the localization of several loci relative to their territory’s edge (Figure 3.12 C). In these experiments, loci that belong to the inactive class were localized inside the chromosome territory, whereas those that belong to the active class extended out- side the bulk territory. Furthermore, recent fluorescence microscopy experiments in mouse cells also support this model (Figure 3.17) [27]. In these experiments, a chromosome was labeled by a whole chromosome paint and a chromosome-specific exome probe, which la- bels all the coding sequences [27]. The results show the bulk of the chromosome near the nuclear envelope surrounded by the exome sequences most of which belong to the active class. Compared to the internal organization of chromosome territories much less is known about the interactions between them. Our TCC data provide the first comprehensive and genome- wide view of interactions between chromosome territories. Most of these interactions are mediated by active regions with relatively high ICP s (Figures 3.12 A and A.4). Each of these regions forms significant interactions with a striking number of high-ICP active regions on other chromosomes (Figures 3.14 A and 3.15). Because of this numerosity, it is improbable that all these interactions can exist at the same time in one cell. In other words, no single 3D structure of the entire genome can satisfy all these interactions simultaneously. Furthermore, each of these interactions has a very low - yet significant - frequency. These observations strongly suggest that each interaction is only present in a small fraction of 102 Figure 3.17: Organization of active and inactive regions within chromosome territories. Interphase nucleus from mouse ES cells hybridized with TexasRed MMU2 exome se- quence capture probe pool and FITC-labelled (green) standard MMU2 chromosome paint. The exome sequence capture probe is comprised of 32,517 different oligos which label various coding sequences in mouse chromosome 2 [27]. The image is unpublished data courtesy of Dr. Wendy A. Bickmore (University of Edinburgh) and is included with her permission. 103 the cells. In fact, our FISH studies suggest that this fraction is smaller than 2% of the cell population (Figure 3.16). A direct consequence of this conclusion is that each cell in the population harbors only a small subset of all possible interchromosomal interactions and different cells harbor different subsets. If a different subset of the possible interchromosomal interactions are present in each cell, what factors determine the subset in each cell? To better answer this question, a related question must be addressed first: on what basis does a high-ICP active region choose its interchromosomal interaction partner? Our answer to this question is informed by a rather remarkable observation: interaction frequencies correlate with theICP s of the interaction partners (Figures 3.14 B,C and A.6). This correlation suggests thatICP is an important basis on which the interaction partner is chosen. In other words, the higher the general tendency of a region to forming interchromosomal interactions, the more likely it will be chosen as the partner by any high-ICP active region. Combined with evidence thatICP represents the accessibility of a region to other chromosomes (Figure 3.12 C), this result suggests that accessibility of loci to each other in each cell is a very important factor in determining the subset of interactions that are present in that cell. The correlation between the product ofICP s and the contact frequencies of high-ICP ac- tive regions also suggests a level of independence between theICP values of many regions in the genome. In other words, the tendency of most regions to forming interchromosomal contacts may not depend on any other regions in the genome. This suggests that forma- tion of an interchromosomal interaction between two high-ICP active regions is largely indiscriminate of their individual identities and may take place in cells where they are in spatial proximity. In conclusion, interchromosomal interactions can form indiscriminately between high-ICP active regions that are accessible to each other. It is not yet clear what 104 factors determine accessibility, but radial position of loci and chromosomes in each cell, the time in the cell cycle, and regional functional activity may be among them. This conclusion is not entirely consistent with the concept of trans-regulation via inter- chromosomal communication [123] (see Chapter 1) as a general mode for gene regulation. This concept stipulates that the transcription of some genes is regulated by specific func- tional element on the other chromosomes. Such a specific regulatory interaction between two loci would require a discriminate interchromosomal interaction between the loci that leads to their spatial association in a considerable fraction of the cell population. Such physical associations are not prominent in our datasets. Furthermore, our results provide a possible alternative explanation for why trans-regulation via interchromosomal control was observed in other studies [165, 114]. These studies analyzed the contact frequency between a few loci using biased methods. When, only in a certain developmental stage or in response to a stimulus, a significant interchromosomal interaction between a gene and a regulatory sequence was observed concomitantly with the gene becoming activated, the interaction was classified as a specific regulatory communication. It is possible that a genome-wide analysis would have revealed numerous condition-specific (i.e., triggered developmental stage or in response to a stimulus) interaction partners for both loci, calling into question the cause-and-effect relationship between transcriptional activation and any one of those many interactions. Specifically, if a locus that is in the inactive class or the low-ICP active class becomes high-ICP active in response to a signal or in the course of development, it will suddenly acquire numerous interaction partners, which are high-ICP active regions, on other chromosomes. When studying only a few interactions involving this locus in a biased analysis, if all but one of the chosen partners are non-highICP part- ners, that one interaction with the high-ICP partner will appear more consequential than it truly is. 105 It should be noted that while our results do not support trans-regulation via specific in- terchromosomal communication as a general phenomenon, a caveat of this conclusion is that our analyses were done at low resolutions where interactions between a single reg- ulatory element and a single promoter cannot be assessed. Furthermore, other forms of trans-regulation that are less specific, for instance, ones involving numerous interchromo- somal interactions, cannot be ruled out (see below). It should also be emphasized that our results do not in any way contradict trans-regulation via intrachromosomal communication, especially those in short range. This conclusion raises another important question: if interchromosomal interactions are not a result of regulation-based communications, what mechanisms underlie them and why do they have an indiscriminate pattern? A clue to the answer may reside in the observa- tion that the propensity to forming interchromosomal contacts correlates with a region’s transcriptional activity (Figure 3.12 B). Because transcription is often focused at discrete sites (i.e., transcription factories) [47], this correlation may be a consequence of the active regions from different chromosomes being recruited to the same factory. Formation of in- terchromosomal contacts between transcribing loci within transcription factories not only supports previous suggestions that transcription factories play an important role in stabi- lizing interchromosomal interactions [29, 49, 154] but also can explain the indiscriminate behavior of these interactions as loci may be recruited to the same factory completely in- dependently of each other. This indiscriminate nature of interactions also suggests that, based on accessibility in each cell, different combinations of loci associate in one factory. Nevertheless, the association of a specific transcription factor with only some of the tran- scription factories, as reported before [154], can make the recruitment of its targets to the same factories more likely. Moreover, since transcription is not the only nuclear function that is concentrated at discrete sites [105, 123], it is possible that other factories, such as 106 those of splicing and DNA repair, also mediate the indiscriminate interactions between chromosome territories. Finally, we showed that the tethering strategy is of paramount importance in analyzing interchromosomal contacts (Appendix B). While due to their high frequencies, and thus, signal levels, intrachromosomal contacts can be readily analyzed with both tethered (TCC) and non-tethered (Hi-C) approaches, interchromosomal contacts, which are of generally lower frequencies (Figure 3.9), can only be measured in the tethered libraries. This dif- ference between tethered and non-tethered libraries is not limited to the libraries produced by us. The Hi-C libraries that were produced by the only other genome-wide conforma- tion capture study [113] perform similar to our non-tethered Hi-C libraries in analyzing interchromosomal contacts. In conclusion, the tethering step in TCC makes the analysis of interchromosomal contacts possible, and none of the observations that were made in this chapter about interchromosomal contacts could have been made with the previously available methods (Appendix B). 3.5 Materials and methods 3.5.1 Different levels of resolution and segmentation of the genome The smallest units of chromatin that form contacts in a conformation capture assay are the fragments generated by the restriction enzyme. A complete digestion of the human genome with HindIII generates more than 800,000 fragments. Using a catalogue with 100,000,000 contacts, similar to our tethered-HindIII catalogue, on-average 250 contacts can be observed per restriction fragment. As each fragment has a theoretical maximum of 800,000 potential contact partners in genome, 250 contacts is wildly inadequate to char- acterize its genome-wide contact behavior. Even if the amount of sequencing and the ob- 107 served number of contacts were not limiting, the total number of possible contacts between all fragments (800000 2 = 640,000,000,000 values) is challenging to handle in genome-wide computations. To circumvent these problems, we combined several consecutive restriction fragments into larger “segments” such that each segment includes an equal number of frag- ments. The number of restriction sites in each segment (H) can be chosen based on the resolution that is desired for a particular analysis. SmallerH values result in higher resolu- tions and higher totals for the number of segments that have to be defined in order to cover the genome (K). The maximum practical resolution (minimumH) in an analysis is limited by the total number of contacts in a catalogue. However, the exact value of this maximum cannot be measured as it also depends on the nature of the analysis and its statistical re- quirements. Different resolutions levels have been used in various analyses in this chapter. These levels of resolution, defined byH andK, and their corresponding median fragment sizes are shown in Table 3.2. Table 3.2: Different levels of resolution. The median size of segments in kilo basepairs and total number of segments covering the genome (K) based on the number of HindIII restriction fragments per segment (H). H K Size (Kb) 83 9977 285 138 6002 475 166 4992 570 277 3004 950 417 2002 1425 558 1500 1900 An alternative to this approach, namely choosing the basic segments based on the number of restriction sites, is to choose fragment of equal genomic length in base pairs. We selected the restriction site-based approach because, in this approach, the resulting fragments are expected to generate equal number of DNA ends in the conformation capture experiment and, therefore, demonstrate equal activity. In a length-based approach, however, different 108 fragments can have drastically different activities based on their density of restriction sites. While this problem can be most pronounced in higher-resolution analyses, even at the chromosomal level, the number of restriction sites in some chromosomes is significantly lower than expected based on their size (data not shown). To implement the restriction site-based segmentation procedure for each level of resolution (H), the reference sequence of each chromosome was scanned toward the telomere of the q arm. EachH consecutive restriction fragments were assigned to a segment. Only a few adjustments were applied. First, if two restriction sites were closer than 20 bp, they were counted as a single site. This adjustment prevents short repeat sequences that contain the restriction site from resulting in disproportionately short segments. Second, segments were not allowed to surpass stretches of ambiguous bases (N) that were longer than half the expected basepair length of the segments. Such stretches of ambiguous sequence typically occur at the centromeres. This adjustment assures that no segment includes regions that are very distant from one another on the sequence of the chromosome, for example, on two sides of the centromere. As a result of these adjustments, some segments in each chromosome, commonly the segment that immediately precedes the centromere and the last segment, included less thanH restriction fragments. The results of this implementation of the restriction site-based fragmentation are shown in Table 3.3. 3.5.2 Ligation frequency matrix Given a resolution level (H) and the corresponding total number of segments that cover the genome (K), the genome-wide ligation frequency matrix was defined as aKK matrix, C K = (c ij ) KK , in which the entry c i;j is equal to the number of observed sequencing pairs in catalogue (Figure 2.7) that show ligation between segmentsi andj. 109 Table 3.3: Number of segments in each chromosome for various resolution levels (H val- ues). Chr H = 83 H = 138 H = 166 H = 277 H = 417 H = 558 1 776 464 387 232 155 115 2 858 517 429 258 172 129 3 715 430 358 215 143 107 4 707 426 354 212 142 106 5 661 398 331 198 133 99 6 615 371 308 185 123 92 7 546 329 274 165 109 82 8 518 312 259 156 104 78 9 419 250 208 125 83 63 10 457 275 228 138 91 68 11 474 285 237 143 94 71 12 468 282 234 141 94 71 13 353 212 177 106 71 53 14 309 186 155 93 62 46 15 275 166 138 84 55 41 16 240 144 120 72 48 36 17 228 137 115 69 46 35 18 273 165 137 83 55 42 19 140 84 70 43 29 22 20 189 114 95 58 39 29 21 122 74 61 37 26 19 22 92 55 46 28 19 14 X 542 326 271 163 109 82 110 3.5.3 Contact frequency matrix There exist minor variations in how efficiently a restriction enzyme can cut different re- gions of the genome. These differences in digestion efficiency can affect the observed total number of pairs in the catalogue that involve each segment. Differences in sequence cover- age can also affect the observed number of pairs involving a certain segment [113, 68, 177]. Therefore, to obtain contact frequency, the frequency of each value in the ligation frequency matrix was normalized by the total number of ligations involving the two corresponding segments: F K = (f ij ) KK (3.2) f ij = c ij ( K1 X k=1 K X l=k+1 c kl ) K X k=1 c ik K X l=1 c lj (3.3) wheref i;j is the contact frequency (i.e., normalized ligation frequency) between segmentsi andj, each term in the denominator is the total number of contacts (i.e., ligation products) of one involved segment, and the double summation in the numerator is the total number of all contacts, a constant which applies to all pairs and can be modified for individual analyses. The resulting matrix (F K ) generates the genome-wide contact frequency matrix. Other studies have used a similar normalization procedure to obtain contact frequency ma- trices [113, 68]. The portion of the genome-wide contact frequency matrix that contains the intrachromosomal contact frequencies of a chromosome would be the intrachromoso- mal contact frequency matrix of that chromosome. 111 3.5.4 Contact profile The contact profile of segment i is the ith row-vector of the contact matrix (F K ), which entails the ordered list of contact frequencies of segmenti with all other segments in the genome. 3.5.5 Contact enrichment (expected values for contacts) The expected value for the frequency of a contact between segments i and j (e i;j ) was calculated as: e i;j = s i s j (3.4) wheres i ands j are the total of all observed contact frequencies involving segmentsi andj respectively, and is a normalization constant. For example, in FIGURE, is chosen such that the average observed/expected frequency (f i;j =e i;j ) of all interchromosomal contacts is equal to 1. 3.5.6 Correlation matrix The normalization described above is not always sufficient when comparing the intrachro- mosomal contact profiles. This insufficiency is due to the fact that the DNA chain con- strains the positions of the segments on the same chromosome relative to each other, and as a result, the frequency of contact between two segments on a chromosome depends on their distance. A correlation matrix compares the intra-chromosomal contact profiles of all the segments on a chromosome. Therefore, to generate such correlation matrices, it is nec- essary to take the DNA chain effect into account. Without an adjustment, the local (short range) contacts, which are most frequent due to the constraining effect of the DNA chain, would dominate the analysis. 112 To perform a distance adjustment for the correlation analysis of chromosomes, the contact frequency matrix of each chromosome was normalized for the distance of the contact. Let K chr be the total number of segments covering a chromosome for a give H (Table 3.3). The distance-normalized matrix for each chromosome (F K chr ) was obtained by dividing each value of the contact matrix by the chromosome-wide average frequency of contacts with a similar distance [113]. Let the distance from the diagonal of aK chr K chr matrix bel =jijj, then the normalized matrix is defined as: F K chr = (f ij ) K chr K chr (3.5) where f ij = f ij hf l i l=jijj (3.6) and hf l i l=jijj = X (i;j)2U f ij K chr l (3.7) where U =f(i;j) :jjijjj =l; 1i;jK chr g (3.8) Then, for each chromosome, a correlation matrix was defined in whichp i;j is Pearson’s cor- relation coefficient between theith row and thejth column of the distance-normalized con- tact matrix (F K chr ). In other words,p i;j is the correlation between the distance-normalized 113 intrachromosomal contact profiles of segment i and j. Positive correlation indicates that these segment show a similar pattern of contact frequency with other segments in the chro- mosome. Negative correlation indicates opposite patterns of contact frequencies. 3.5.7 Principal component analysis (PCA) Principal component analysis (PCA) [91] can be used to classify segments in the active and inactive class [113]. Therefore, to identify segments of the active and the inactive class, we carried out PCA for each chromosome independently using its correlation matrix. To deter- mine the principal components, the eigenvector with largest Eigen value was derived from the correlation matrix. For each segment on a chromosome, the value of the first principal component (EIG) was measured as the length of the projection of its correlation profile on the eigenvector. For all chromosomes except chromosome 4, the first principal component clearly corresponds to the plaid pattern and the active and inactive classes of sequences. For chromosome 4, the first principal component corresponds to the different arms instead of the plaid pattern. Therefore, for this chromosome PCA was carried out separately for each arm. In PCA, the direction of the principal component vector (i.e., eigenvector) is randomly chosen depending on the computational strategy. To determine whether negative or positive values of the principal component correspond to the active segments, Spearman correlation between the principal components and DNase hypersensitivity along the chro- mosome was calculated. A correlation coefficient smaller than 0 means that the negative values of PCA mark the active segments. In these cases, allEIG values were multiplied by Ð1 to assure that for all chromosomes positive EIG values correspond to the active segments. For the purpose of determining the direction of the principal component, us- ing correlation with ICP , RNA pol II binding, gene expression, and gene density yield identical results as using the DNase hypersensitivity. 114 3.5.8 Centromere-centromere contacts associations The centromeres of all human chromosomes consist of repetitive sequences [40]. Because making unique alignments to these sequences is not possible, these regions basically appear as black boxes in the experiment with few or no uniquely aligned pairs available for them. As a consequence, we used the segments immediately flanking the centromeres, which are often part of the centromeric heterochromatin, as a surrogate in the analyses involving centromeres. To measure the contact frequency between centromeres, anF 6002 (H=138) contact fre- quency matrix generated from the HindIII TCC library was used. The selected segments for each centromere both in terms of genomic location and in terms of the number of seg- ments on each side are shown in Table 3.4. It should be noted that manually selecting centromeric surrogate regions on a chromosome by chromosome basis can yield optimal results. However, to avoid any bias introduced by such a selection process, we used the two segments on each side of the centromere of each chromosome as the surrogate “centromeric region” for the analyses (Table 3.4). In the case of acrocentric chromosomes (13, 14, 15, 21, 22) only the q-arm side of the centromere was considered. To obtain the average interchromosomal contact frequency with centromeric regions, the stretch between 18 segments before to 30 segments after the centromere of each chromo- some were considered. For acrocentric chromosomes, the stretch between the centromere and 30 segments after the centromere was considered. For each position (e.g., 20 before the centromere or -20), the number of contacts between the segment in that position on each chromosome with the centromeric regions (the selected four segments) on all other chro- mosomes was summed. For example, for the +10 position after centromeres, the contact 115 Table 3.4: The defined centromeric regions for each chromosome inF 6002 . Chr On p-arm (bp) On q-arm (bp) # segments # segments on p-arm on q-arm 1 120471507-121485434 142535434-143523024 2 2 2 91595103- 92326171 95326171- 96356447 2 2 3 90104192- 90504854 93504854- 94262896 2 2 4 48874116- 49660117 52660117- 53529140 2 2 5 46149058- 46405641 49405641- 50262605 2 2 6 58466074- 58780166 61880166- 62225396 2 2 7 57011289- 58054331 61054331- 61552295 2 2 8 43214215- 43838887 46838887- 47699557 2 2 9 46413223- 47317679 65467679- 66542245 2 2 10 38240375- 39154935 42354935- 43087187 2 2 11 51285016- 51594205 54694205- 54887605 2 2 12 34672924- 34856694 37856694- 38257963 2 2 13 NA 19020000- 19786410 NA 2 14 NA 19000000- 19921555 NA 2 15 NA 20000000- 21191268 NA 2 16 34540289- 35285801 46385801- 47497133 2 2 17 21552183- 22263006 25263006- 26293199 2 2 18 14943589- 15410898 18510898- 19711257 2 2 19 24500980- 24631782 27731782- 28020582 2 2 20 25589011- 26319569 29419569- 30883573 2 2 21 NA 14338129- 15193734 NA 2 22 NA 16050000- 17196878 NA 2 X 58435691- 58582012 61682012- 62315999 2 2 116 frequency of the 10 th segment after the centromere of chromosome 1 with the centromeric regions of chromosomes 2 to X is taken by adding the contact frequency between the +10 segment and each centromeric region and taking its average for all 22 target centromeres (the contact sum between a segment and the centromeric region of any acrocentric chro- mosome is multiplied by two because there are half as many segments in that combina- tion). The same is then repeated for contact frequency between the 10 th segments after the centromere of other chromosomes (2 to X) with all centromeric regions excluding the centromere of the same chromosome. The average frequency of all these contacts (23 22 combinations) is taken as the average interchromosomal contact frequency of the +10 position after the centromere with the centromeric regions of all chromosomes (Figure 3.13 A). The method for generating the centromeric interaction profiles (Figure 3.13 C) can be bet- ter understood in the context of an example. For the centromere of chromosome 1, the contact frequency between each of the four segments flanking its centromere (Table 3.4)) with every single segment of all other centromeres was retrieved from theF 6002 contact frequency map. An average of all possible contact frequencies between segments around the centromere of chromosome 1 and any other chromosome was then taken as the total contact frequency of chromosome 1 centromere with the centromere of that chromosome of interest. This process was repeated for all other chromosomes. 3.5.9 RNA polymerase II binding Raw RNA polymerase II (pol II) ChIP-seq data in GM12878 cells were obtained from another study [95]. The ChIP-seq data were aligned to the human genome (GRCh37/hg19). The binding of pol II to each segment was calculated as the number of reads that aligned 117 to the segment in anti-pol II ChIP divided by number of aligned reads in anti-IgG negative control. 3.5.10 Gene expression Raw RNA-seq (poly-A enriched) data for GM12878 cells were obtained from another study [95]. The expression level of UCSC known canonical genes in hg19 was estimated using a two-parameter generalized Poisson model as described by Srivastava and Chen [167]. Total gene expression for each segment was measured as the sum of the expressions (theta values) of all genes that overlap with that segment. 3.5.11 Histone modifications Raw histone modification ChIP-seq data in GM12878 cells were obtained from the EN- CODE project [22] (generated at the Broad Institute and in the Bradley E. Bernstein lab at the Massachusetts General Hospital/Harvard Medical School). The ChIP-seq data were aligned to the human genome (GRCh37/hg19). Each histone modification level was cal- culated as the number of reads that aligned to the segment in the corresponding antibody pulldown experiment divided by the number of aligned reads in the input negative con- trol. 3.5.12 DNase I sensitivity Raw DNase I sensitivity sequencing data in GM12878 cells, which were generated using the Digital DNase I methodology [148], were obtained from the ENCODE project [22] (these data were generated by the UW ENCODE group). The Digital DNase sequencing reads were aligned to the human genome (GRCh37/hg19). The total number of alignments to each segment was taken as the total amount of DNase sensitivity in that segment. 118 3.5.13 3D DNA FISH BACs were obtained from the BACPAC Resource Center (BPRC) at Children’s Hospi- tal Oakland Research Institute. 3D-FISH experiments were carried as described previ- ously [16]. The only BAC that aligns to chromosome 19 (RP11-50I11) was labelled with Digoxigenin while the other BACs (RP11-651M4, RP11-220C23, RP11-169D4, and RP11-770J1), all of which align to chromosome 11, were labelled with Biotin in nick- translation reactions. In each hybridization reaction, roughly 300 ng of each labelled probe and 5 μg of CotI DNA were used. Each label was detected with two layers; avidin-FITC and Mouse anti-dig as the first layer, and goat anti-avidin-FITC and Sheep anti-mouse-Cy3 as the second layer. The total DNA was counterstained by DAPI. Confocal microscopy was carried out using an Olympus FluoView FV1000 imaging system equipped with a 60X/1.42 PlanApo objective. Optical sections (z stacks) of 0.20 μm apart were obtained in the se- quencial mode in DAPI, FITC, and Cy3 channels. Center-to-center distances between the probes were calculated using the Smart 3D-FISH pluging for ImageJ as described [82]. Each pair of probes was processed in duplicates with about 1,000 total cells per pair. 119 Chapter 4 Three-dimensional Genome Structures Based on Tethered Conformation Capture Data 4.1 Preview This chapter describes a method for translating the TCC data, which embody the three- dimensional structures of the genome in the form of two-dimensional or binary contacts, into three-dimensional structures. We layout strategies to account for the various challenges that are associated with calculating structures for genomes and devise methods of doing so. Our methods produce a population of 3D genome structures based on the TCC data. We then analyze the structure population to gain insight into the genome architecture. 120 4.2 Introduction The TCC contacts, or any other genome-wide conformation capture data, are inherently two dimensional (Figure 3.1) even as they represent the three-dimensional organization of the genome. In other words, these data represent the three-dimensional structure of chromosomes indirectly. The relationship between TCC contacts and the three-dimensional structure of the genome is similar to that between X-ray diffraction patterns of a protein crystal and the protein structure. While both the TCC data and X-ray diffraction patterns can be very informative, a complete structural understanding of the entity under study demands the calculation of three-dimensional structures which produced the corresponding TCC contacts or X-ray diffraction spots. After decades of efforts and theoretical and experimental improvement, calculating accu- rate protein and macromolecular structures from X-ray diffraction patterns or NMR spectra has now become a routine proposition [129]. The structural studies of the genome based on genome-wide contact information, on the other hand, are at their infancy. Furthermore, no attempts at calculating the three-dimensional structure of mammalian chromosomes based on this type of data has been made before. Such a structural calculation faces several unique challenges that distinguish it from the structural studies of other molecules that are commonly studied using X-ray crystallog- raphy or Nuclear Magnetic Resonance (NMR). The first challenge is the aspect ratio of the genome, defined as the ratio of the longest aspect of a molecule to its shortest aspect. In the case of the diploid human genome, the aspect ratio or the ratio of its length to its diameter is about 10 9 . This value compares very poorly to the aspect ratio of an average 100 kD protein, which is 10 3 . This large aspect ratio leads to both data collection and computational problems. In data collection, the amount of sequencing that can be currently 121 performed is far short of what is required to provide adequate contact information along the entire length of the genome at the highest resolution. In computation, performing any structural calculation on this very long chromatin fiber at a high resolution would require computing capacities that are difficult or hardly possible to reach. The second challenge is the extensive heterogeneity in the genome structure. For most macromolecules, a few if not only a single dominant conformation exist. The same cannot be said for the genome. The TCC data show numerous contacts within each chromosome that are observed at a wide range of frequencies (Figures 3.1 and A.1). They also reveal extensive interactions between different chromosomes that take place at very low frequen- cies (Figures 3.10, 3.14 A, 3.15, and 3.9). The TCC data together with FISH data (Figure 3.16) show that each contact is present in only a small fraction of the cell population, point- ing to large heterogeneity in the structure of chromosomes and in the genome architecture generally. They also suggest that a dominant structure does not exist when it comes to the genome (Figures 3.14 A and 3.15). These challenges must be address in any attempt at calculating genome structures. In addi- tion, such an attempt must introduce proper theoretical and computational basis for struc- ture calculation based on contact data. In this chapter, we present the first ever such effort. To account for the large aspect ratio of the genome, we use a coarse-graining strategy that represents each chromosome as a limited number of spheres. To account for the struc- tural heterogeneity, we use a population-based structure modeling strategy which does not generate a single representative structure, rather generates a population of genome struc- tures that, as a whole, reproduce the TCC data but are substantially different from one another. We also demonstrate that our population-based modeling approach reproduces a hallmark of chromosome structures, namely, their radial positions. Finally, we use the calculated 122 structure population to analyze some structural properties of the genome in probabilistic terms. 4.3 Results 4.3.1 Structural modeling The contacts in TCC data describe not one structure but the average contacts of numer- ous genome structures in different cells. We, therefore, aimed at generating a population of three-dimensional genomes in which the resulting variety of structures is statistically consistent with the data. This task can be expressed as an optimization problem with three main components: (1) a structural representation of chromosomes at an appropriate level of resolution; (2) a strategy to translate TCC contact data to structural parameters that can be implemented in the structural representation; and (3) a method for determining structures that agree with the structural parameters and optimizing them. 4.3.1.1 Structural representation The first step in modeling genome structures based on the TCC data is to define an ap- propriate structural representation of the chromosomes. Such a representation should not only deal with the enormous length of each chromosome (see above) and simplify it to an appropriate level of resolution, but should also provide a permissive basis for implementing the succeeding structure calculation steps. In order to devise an effective structural representation, we drew inspiration from the ap- pearance of contact frequency maps (Figures 3.1 and A.1). The plaid appearance of these maps suggests that each chromosome can be partitioned into megabase-scale “blocks” of contiguous segments that share similar contact profiles. Due to their genomic proximity, 123 the segments within each block contact other regions of the same block most frequently in fact, several orders of magnitude more frequently than other regions in the genome (Fig- ure 3.2). Furthermore, the regions in the same block are restrained to each other by the relatively short chromatin fiber that encompasses them. Based on these considerations, we concluded that each block represents chromatin regions that are, by and large, within a limited distance of one another in the nucleus, making these blocks a possible foundation for a structural representation of chromosomes. To find the borders of these blocks, we applied a constrained clustering algorithm to each chromosome’s contact frequency matrix 1 using the Pearson’s correlation between the seg- ments’ contact profiles as a similarity measure (Figure 4.1 A, Materials and methods). We then used a previously-described objective penalty function [96] to optimize the total num- ber of clusters while maximizing the similarity of contact profiles within each cluster (Fig- ure 4.2, Materials and methods). Applying this procedure to all chromosomes divided the haploid genome into 428 “chromatin blocks” (Appendix Figure A.7 and Table C.1). We then compiled a block-block contact frequency matrix (Materials and methods). This block-based matrix (Figure 4.1 B) is highly correlated with the original contact matrix (Spearman’s = 0.81, p-value < 10 15 ), confirming that the long-range contact patterns are preserved in the this block matrix (Figure 4.1 A,B). Several observations further support that large portions of the chromatin in a block are in spatial proximity and predominately occupy the same specific subterritory in the nuclear space. First, the vast majority of contacts involving each block are between regions inside the block: it is one hundred times more likely that any given region forms a contact with itself or a region inside the same block than with a region in a different block. Second, the 1 All analyses in this chapter were carried out using anF 4992 contact frequency matrix (Table 3.2, page 107 and Table 3.3, page 109) that was generated based on the HindIII tethered contact catalogue (Table 2.1, page 39). 124 Figure 4.1: Constrained clustering and coarse-graining of chromosomes. (A) The contact frequency map of chromosome 11 from the HindIII-TCC catalogue. Hierarchical constrained clustering was applied using the Pearson’s correlation between the segments’ contact profiles as the similarity measure (Materials and methods). The dendrogram of constrained clustering is shown to the left and on top of the map. AnF 4992 contact frequency matrix (Tables 3.2 and 3.3) was used for this map. (B) Coarse-grained block matrix of chromosome 11. In the block map, the value of an element is the average contact frequency of all the corresponding elements in the contact frequency map. Spearman’s rank correlation coefficient between this block matrix and the contact frequency map in (A) is 0.78. Assignment of segments to the active (orange blocks) and inactive (dark brown blocks) classes are shown to the left and on top of the matrix. See Figure 4.5 A for the coarse-grained genome-wide block map. 125 Figure 4.2: Objective penalty function for chromosome 11. Detection of the optimal number of clusters (i.e., blocks) by optimizing a penalty function. The plot shows the score of the penalty function with respect to the number of clusters. For chr11, the optimal number of clusters is 15. Optimizing the same penalty function for all chromosomes led to 428 clusters (blocks) for the entire genome (Appendix C). Appendix Figure A.7 shows this plot for all chromosomes. 126 contact probability between neighboring regions is reduced and an abrupt change in contact profiles is observed across the block borders. These observations suggests that these blocks are an appropriate foundation for a structural representation of the genome. Therefore, as a first approximation, we defined the space that is largely occupied by each chromatin block in the nucleus as a spherical volume whose size is approximated by the length of the block (Figure 4.3 and Table C.1). Figure 4.3: Representation of chromatin blocks as spheres. The sphere for each block (solid sphere) is defined by a hard radius (R) which is estimated from the block size and nuclear occupancy of the genome (Materials and methods). The sphere cannot be penetrated within this radius (Materials and methods). Each sphere has soft shell (dotted line) with a distance of R from the surface of the sphere. A contact between two spheres is enforced as an overlap between the spheres’ respective soft shells. Also shown is a schematic hypothetical view of the chromatin fiber. The structure of the genome can be represented by a spatial arrangement of these spheres. Based on this representation, structure calculation entails determining a population of genome structures where, in each structure, all the 856 spheres that constitute the diploid genome are packed into the nucleus in a way that their contacts across the population are wholly consistent with the TCC data. 127 4.3.1.2 Deriving structural parameters from TCC data In order to translate TCC data into structural parameters compatible with our represen- tation of chromosomes, we converted the observed contact frequencies to sphere contact probabilities. We generated a block contact frequency matrix (G) from anF 4992 contact frequency matrix (Materials and methods). Converting these block contact frequencies (g) to contact probability between the corresponding spheres (a) requires defining reference points for contact frequencies (g = f max ) that represent a sphere contact probability of 1 (a = 1) as well as an assumption regarding the relationship between contact frequency and contact probability. For the former, we defined the reference contact frequency of each block (f max i ) based on its contact frequency with its neighboring blocks by assuming that contact between adjacent spheres in each chromosome must be present in 100% of the cells because they are directly connected by the chromatin fiber (Materials and methods). A block-block contact frequency equal to or more than f max would result in the corre- sponding spheres having a contact probability of 1 (a = 1). For the latter, we assumed a linear relationship between contact frequency and contact probability (a = cg, where c = 1=f max when g f max and c = 1=g when g > f max ); therefore, a contact with a frequency half that of the reference value (f max ) would have a probability of 0.5 (Materials and methods). It should be noted that because of the differences between blocks,f max was defined for each block separately (Materials and methods). These sphere contact probabilities represent a structural parameter that can be integrated into our structure calculations. If contact probability between two spheres is 0.5, then a contact between these spheres should be enforced in half of the structures in the population (Figure 4.3, Materials and methods). 128 4.3.1.3 Structure determination Structure determination and optimization entail the generation of a structure population based on the sphere representation that is consistent with the TCC-derived structural pa- rameters (i.e., sphere contact probabilities). We started with a population of 10,000 ran- dom sphere arrangements (models). We then assigned a set of sphere contact restraints to each model in the population (Materials and methods). A contact restraint can be thought of as generating a “force” between the spheres so that they form a contact by overlapping in their soft shells (Figure 4.3, Materials and methods). Importantly, any given contact was only enforced in the fraction of models in the population corresponding to its TCC- based probability (Materials and methods). 2 If a contact was not enforced in a structure, no assumptions were made about the relative positions of the corresponding spheres. Fur- thermore, the assignment of a restraint to each structure was done independent of the other restraints that were assigned to that structure (Materials and methods). We also defined additional types of restraints. These restraints ensure that all spheres are positioned within the nuclear volume and overlaps between spheres is prevented. After assigning restraints, the positions of all the spheres in the model population should be optimized until no restraint violations remain (Materials and methods). This task requires a scoring function [2, 3] that quantifies the model population’s accordance with the TCC- derived structural parameters. We defined a scoring function that was the sum of contact and other types of restraints such that it would have a value of zero when no restraints were violated and an increasingly positive value with increasing number of violations (Materials and methods). 2 In a diploid cell, most loci are present in two copies. Because the TCC data do not distinguish between these copies, the optimal assignment of a contact to two of the corresponding four spheres in each structure was determined as a part of our optimization process [5, 2]. See Materials and methods for the complete description. 129 Next, we simultaneously optimized the positions of all the spheres in the model popula- tion until no restraint violations remained. This optimization process relies on conjugate gradient [168] and molecular dynamics [8, 139] with simulated annealing [99, 32]; start- ing from the random configurations, it iteratively moves the spheres so as to minimize the scoring function to zero. The result is a population of 10,000 genome structures (Figure 4.4). To test how consistent this structure population is with the experimental data, we calculated a sphere contact frequency matrix from the structure population and compared it with the block-block contact frequency matrix obtained from the TCC data (Figure 4.5 A,B). The two were strongly correlated with an average Pearson’s correlation of 0.94, confirming the excellent agreement between contact frequencies in the structure population and those in the experiment. Furthermore, three independently calculated populations showed that our structure population was highly reproducible (Pearson’s r > 0.999), which also indicates that, at this resolution, the size of the population is sufficiently large. 4.3.2 Structural features of the genome population Having generated a population of genome structures, we studied its various structural fea- tures in probabilistic terms. We also compared it with the well-known features of the genome architecture. 4.3.2.1 Structural heterogeneity Because chromatin contacts in the TCC data are observed over a wide range of frequencies, our genome structure population shows a fairly large degree of structural variation. This heterogeneity is clear in the analysis of root-mean-square deviation (RMSD) among the structures in the population (Figure 4.6 A,B). We calculated the root-mean-square deviation 130 Figure 4.4: Genome structure population of 10,000. A schematic of the optimized structure population is shown on top. A randomly selected sample from the population is magnified on the bottom. All forty-six chromosome territories are shown. Homologous pairs share the same color. The nuclear envelope is displayed in gray. For visualization purposes, the spheres are blurred in the magnified structure. 131 Figure 4.5: Genome-wide contact frequency maps from HindIII-TCC catalogue, structure population, and a population of random structures. (Continued on the next page) 132 Figure 4.5 continued: Shown are the contact frequency maps from (A) HindIII-TCC catalogue and (B) the structure population. Also shown is the contact frequency map of a structure population generated with random chromosome chains (C) which only includes chromosome chain contacts. All matrices are 428 by 428. 133 (RMSD) of all pairs of structures. When normalized by the nuclear radius, the RMSD ranges between 0.6 and 0.95 with an average of 0.82 (Figure 4.6 A). We also normalized each RMSD value by the mean distance of all spheres in the corresponding two structures (the average value for the two structures). In this case, the average RMSD is 0.98 (Figure 4.6 B). Figure 4.6: Heterogeneity between the structures in the population. (A) Structural heterogeneity in the genome population: histogram of the root-mean-square deviation (RMSD) between all pairs of structures in the model population. RMSD values are in nuclear radius unit (i.e., normalized by the nuclear radius). (B) Histogram of the RMSD between all pairs of structures in the model population. In this plot, each RMSD value is normalized by the average of the mean internal distance between all sphere pairs in the corresponding two structures. See Materials and methods for more information. We also measured the overlap of sphere contacts between all pairs of structures in the population. For that, a binary vector containing all possible sphere-sphere contacts was constructed for each structure; all contacts that were present in the structure were assigned a value of 1, and those that were absent from the structure were assigned a value of 0. An overlap index between each pair of structures was then calculated as the fraction of shared contacts (i.e., the number of vector elements where both structures carry a 1, divided by the total number of contacts in the union of both vectors). This analysis showed that, 134 on average, only 21% of sphere contacts are shared between any two structures in the population (Figure 4.7). Figure 4.7: Contact overlap between structures in the population. The histogram of the sphere contact overlap percentage between all pairs of structures in the population. The contact overlap is measured as the fraction of sphere contacts that are present in both structures. See Materials and methods for more information. 4.3.2.2 Radial positions of chromosome territories Despite its large heterogeneity, the structure population reveals a distinct and non-random chromosome organization. Specifically, the population clearly identifies a preferred radial position for each chromosome (Figure 4.8 A,B and A.8). These positions strongly agree with independent FISH studies in lymphoblastoid cells [26, 50]. In fact, the Pearson’s correlation between these imaging-based and our population-based average positions is 0.71 (p-value < 10 3 , Materials and methods). On the other hand, radial positions in a control population generated without the TCC data (Figure 4.5 C) did not agree with the experiment (Pearson’sr = -0.2, Figure 4.8 C), indicating that the TCC data are responsible for generating the correct radial distributions seen in the imaging experiments. The structure population also shows that the radial positions of chromosomes tend to in- crease with their size, with some noticeable exceptions (Figure 4.8 B). One of these excep- 135 Figure 4.8: Radial positioning of chromosome territories. (A) The distribution of the radial positions for chromosomes 18 (red dashed line) and 19 (blue solid line) calculated from the genome structure population. Radial positions are calculated for the center of mass of each chromosome and are given as a fraction of the nuclear radius. See Appendix Figure A.8 for the radial distribution of all chromosomes. (B) The average radial position of all chromosomes in the structure population plotted against their size. Error bars mark the standard deviation. (C) The average radial positions of all chromosomes in a structure population that was generated without the TCC data. The radial positions of chromosomes are plotted against the chromosome size, and error bars mark the standard deviation. See Figure 4.5 C for the genome-wide contact map from this random-chain population. 136 tions is the radial positions of chromosomes 18 which was observed at a peripheral position. Chromosome 18 has a very similar size to chromosome 19; nevertheless, chromosome 19 is located closer to the center of the nucleus while chromosome 18 is preferentially lo- cated closer to the nuclear envelope (Figure 4.8 B). Furthermore, the homologous copies of chromosome 18 are often distant from each other while those of chromosome 19 are often closely associated (Figure A.8). These observations about chromosome 18 are in agreement with independent experimental evidence [26, 50]. Another noticeable exception to increasing radial positions with increasing size trend is chromosome 1. This chromosome prefers a more central position in the model popula- tion compared to other chromosomes of the same size (Figures 4.8 B and A.8). While this observation is in contrast with imaging studies in fibroblasts [23], it is in complete agreement the studies in lymphoblastoid cells [26, 50], to which our structure population corresponds. 4.3.2.3 Distances between non-homologous chromosomes We also measured the pairwise average distance between all chromosome territories in our structure population. When territories are clustered based on these distances, two main groups can be identified (Figure 4.9 A). The first group (chromosomes 1, 11, 14-17, 19- 22) tend to occupy the central region of the nucleus as evident from their population-based joint localization probabilities (Figure 4.9 B). These chromosomes also tend to have rela- tively higher gene densities [102]. The second group (chromosomes 2-10, 12, 13, 18, X) preferentially occupies the periphery of the nucleus (Figure 4.9 B). 137 Figure 4.9: Contact overlap between structures in the population. (A) Clustering of chromosomes with respect to the average distance between the center of mass of each chromosome pair in the genome structure population (shorter to longer av- erage distance is colored by gradual purple to white). The clustering dendrogram, which identifies two clusters is shown on top. (B) (Left panels) The density contour plot of the lo- calization probability [2] for all chromosomes in cluster 1 (top panel) and cluster 2 (bottom panel) calculated from all the structures in the genome structure population. The rainbow color-coding ranges from blue (minimum value) to red (maximum value). (Right panels) Shown is a representative genome structure from the genome structure population. Chro- mosome territories are shown for all chromosomes in cluster 1 (top) and all chromosomes in clusters 2 (bottom). 138 4.3.2.4 Packing of the active and inactive regions We also analyzed the structure population with respect to the active and inactive class that were characterized in the previous chapter. In fact, a majority of the blocks, which define the spheres, are composed of mostly active or mostly inactive segments. This observation is not surprising given that the active and inactive segments were identified based on their contact profile correlation and the borders of the blocks were identified on a similar basis. Accordingly, we classified spheres with at least 90% of their segments in the active class as active spheres and those with at least 90% of their segments in the inactive class as inactive spheres. Other spheres with a more heterogenous composition of the active and inactive segments were not included in our analyses. Even at the current resolution of the structure population, we observed an important dif- ference in the packing of the active and inactive spheres when we measure the distances between all active and all inactive spheres (Figure 4.10). The average distances between spheres comprising mainly inactive regions are significantly smaller (p-value < 10 15 , paired t-test), suggesting that inactive regions are more densely packed in the structure population in comparison to the active regions. 4.3.2.5 Association between different chromosome territories In order to evaluate interchromosomal contacts and association between different chromo- somes in our structure population, we analyzed the spheres that are mostly composed of high-ICP active segments. A given interchromosomal interaction between these spheres is present, on average, in about 16% of the structures in the population. This value is about 18 times larger than the corresponding values for an interchromosomal contact between spheres containing high-ICP regions and spheres containing lowICP regions. It is im- portant to note that the average sphere spans a block of about 6.8 Mb (i.e., 10 segments 139 Figure 4.10: Local packing of the active and inactive class in the structure population. Histogram showing the distribution of the average distances between spheres in the struc- ture population. The distribution of distances between all spheres composed of mainly inactive regions are shown in blue, and the distribution of distances between spheres com- posed of mainly active regions is shown in orange. The average distances between the inactive spheres is significantly smaller than those of the active spheres. 140 in F 4992 ). As a result, a sphere classified as “high-ICP ” does not exclusively contain high-ICP segments. Additionally, convergence of more than two chromosomes to the same location in the nu- cleus is observed frequently in the structure population. For example, segments of up to four chromosomes converge to the same location with a significantly high frequency in many structures. However, occurrences of more than 8 chromosomes converging on the same location appear to be extremely rare. 4.3.3 Assessment of the structure population It is fundamentally important to assess the accuracy of our genome structure population, defined as the difference between the modeled and native structures in three-dimensional details. This difference, however, is currently impossible to evaluate with any certainty. Nevertheless, several lines of evidence indicate that our population is representative of the true configurations of the genome. First, it is possible to find a population of genome structures that is consistent with all the experimental data. The fact that such a structure population, one without any violation of the TCC-derived structural parameters, can be produced suggests that the TCC data and our structure determination strategies are self- consistent and also consistent with each other. Second, the structure population statistics agree with independent imaging experiments. Specifically, the average radial positions of chromosome territories in the structure population correlate strongly with results from FISH studies. We carried out more analyses to further assess the validity of our structure population. In one analysis, we replicated the structure determination steps to generate three independent structure populations based on identical input data. We found that the results are highly reproducible with independently generated populations producing the same statistical fea- 141 tures with a high precision. This level of agreement further indicates that the input data are self-consistent and that the size of the model population, for this level of resolution, is sufficiently large. In another analysis we generated multiple structure populations after removing parts of the input data. For example, we randomly selected 50% of the block contacts and converted their frequencies to zero. We then used the reduced matrix as the input for generating a structure population. The resulting population was able to correctly predict the excluded contact data. Specifically, the Pearson’s correlation between the partial contact frequency map used as input and the complete experimental data set was 0.87 whereas the Pearson’s correlation between the map from the partial-input and complete-input structure popula- tions is increased to 0.93. Moreover, the radial positions of chromosome territories in the partial-input structure population agree with those of the complete-input population (Pear- son’sr = 0.98, p-value < 10 3 ) with the models generated using partial data. Also, several independent replicates of this procedure produced a very similar outcome. 4.4 Discussion We have devised a method to modeling three-dimensional structures based on the TCC contact data, which are inherently two dimensional. This method was made possible by selecting an effective structural representation for the chromosomes (Figure 4.3). By rep- resenting each chromosome as a limited number of spheres (Table C.1), we addressed the difficulties of genome-wide structural analysis on the very long chromatin fiber. Fur- thermore, this representation provided an excellent foundation for the establishment and formulation of other structure derivation strategies. Another crucial feature of our mod- eling approach is the introduction of a population-based structural analysis concept: to account for the extensive structural heterogeneity between the genomes of different cells, 142 we calculated not a single but a population of genome structures that, as whole, embody the structural parameters that were embedded in the TCC data (Figure 4.4). Our modeling approach was also served by a strategy to translate the TCC contact data into structural parameters. We formulated all observed contact frequencies between the chro- matin blocks that underlie our sphere representation as a function of contact frequencies between consecutive blocks. We then used the contact frequencies between the consecu- tive blocks as a reference point for a sphere contact probability of one and measured all the other contact probabilities accordingly. This approach translated the chromatin contact frequencies in TCC data to structural parameters, namely sphere contact probabilities, that could be implemented in our modeling strategy. In this regard, it is important to emphasize that our method does not correlate contact frequencies with averaged distances as do other methods [183, 15, 113]; it relies purely on the TCC data by incorporating only the presence or absence of chromatin contacts. By not assuming a correlation between crosslinking fre- quency and distance, our method is not only more parsimonious in data interpretation, it is probably more reliable as well since crosslinking frequency in any conformation capture dataset does not necessarily represent the distance between loci for two reasons. First, it is unknown how many protein intermediates were crosslinked between the loci, and thus, unknown exactly what the distance between crosslinked loci is. Second, when loci are not in crosslinking range of each other they can have any give distance, and these distances will not be reflected in crosslinking frequency values. Finally, our modeling approach also includes effective methods to optimize random ar- rangements of the chromosome spheres into conformity with the sphere contact proba- bilities in a population context. In these optimization methods, we established the final step of calculating structures on the basis of our representation and data translation strate- gies. 143 Assessment of the resulting structure population indicated that it is a representative of the native genome architectures in lymphoblastoid cells. For example, the structure population is highly reproducible with independently generated populations reproducing the same sta- tistical features with a high precision. Furthermore, structure populations based only on part of the TCC data is able to correctly predict the missing data points. Most importantly, the calculated population reproduces the hallmarks of chromosome territory positioning in agreement with independent fluorescence in situ hybridization (FISH) studies. Our population-based modeling, therefore, provides a novel and effective means of study- ing the three-dimensional genome architectures. By systematically translating the TCC data into a population of genome structures, this approach allows for a probabilistic inter- pretation of the genome organization (Figures 4.6, 4.7, 4.8, A.8, 4.9, and 4.10). The impor- tance of probabilistic interpretations lies in the cell-to-cell structural variation that is ob- served in eukaryotic genomes [48, 122]. Many structural variables of eukaryotic genomes have a range of values with an unknown distribution. Only a probabilistic structural anal- ysis approach can properly accommodate and analyze these variables. The opposite deter- ministic approaches, which are often used in structural biology, can only be effective when a variable has a known distribution with one or a few “main” values and little deviation outside the main value(s). This scenario does not appear to be the case when it comes to the architecture of higher eukaryotic genomes, underscoring this uniquely enabling feature of population-based studies in these studies [124]. The structure population also reconciles some previously known hallmarks of chromosome structure with indiscriminate interchromosomal interactions between some regions in the active class. In the previous chapter, we described numerous low-frequency interactions be- tween high-ICP active regions. We showed that each interaction is only present in a small fraction of the cell population. These cell-to-cell differences are reflected in a fairly large 144 variation between the genome structures in the population (Figures 4.6 and 4.7). In spite of this variation, however, the structure population reproduces the previously described [50, 26] preferred radial positions of chromosomes (Figures 4.8 and A.8). The structural analysis indicates that the genome-wide behavior of interchromosomal interactions, as ob- served in the TCC data, is in keeping with the previously described architectural features of the genome. Furthermore, the population demonstrates that the TCC data alone are sufficient to reproduce the distinct spatial distributions of chromosome territories. It is also important to point out the current limitations of our modeling approach. The reso- lution of the structures is relatively low as they represent the genome with only 428 spheres. Increasing the resolution will require more sequencing data as well as more effective com- putational algorithms to handle structural calculations that grow much more complex with increasing spheres. In addition, not every structure in the population is a definitive structure of chromosomes. It is the population, as a whole, that capture the structural features of the genome. Finally, the current structure population is less suitable for analyses that compare homologous chromosomes as the ambiguity of contacts with regard to the two copies of each chromosome can make identifying subtle difference between them difficult. In summary, our modeling strategy represents the first method for genome-wide derivation of chromosome structures from experimental genome-wide data. Consequently, the struc- ture population represents the first structure of an entire mammalian genome. Our approach also enables the integration of conformation capture data with other experimental datasets that pertain to the genome architecture. For example, genome-wide conformation capture data can be combined with FISH, electron microscopy, and DamID [186] data to generate model populations that more accurate represent the genome structure. Furthermore, our model population provides a starting point for a higher resolution description of the spatial properties of the genome. 145 4.5 Materials and methods 4.5.1 Structural representation In order to model genome structures based on the TCC data, the first step is define an appropriate structural representation of the genome that simplifies the problem to an ap- propriate level of resolution. In order to do so, we coarse-grained each chromosome into blocks of consecutive segments that show a similar intrachromosomal contact profile. We then represented each block with a sphere in the structural models. The following sections describe this process. They also define our sphere-based structure population. AnF 4992 contact frequency matrix based on the HindIII tethered (HindIII-TCC) catalogue was used for all the following calculations and analyses 3 . 4.5.1.1 Coarse-graining of chromosomes Distance matrix. As a first step in coarse-graining the genome, a distance matrix was calculated for each chromosome: D K chr = (d ij ) K chr K chr (4.1) where d ij = 1p ij (4.2) 3 Only a fraction of the HindIII tethered contact catalogue in Table 2.1 was used for the analyses in this chapter. This fraction contained 22,972,949 filtered pairs that represent a randomly selected subset of the complete catalogue. The experiments in this chapter were carried out only after limited sequencing of the HindIII tethered library [92]. Structure calculations were later reproduced with the complete catalogue and very similar results were obtained. 146 withp ij as the Pearson’s correlation between theith row andjth column in the distance- normalized matrix (F K chr , see section 3.5.6 in p. 111). D the dissimilarity between the intrachromosomal contact profiles of various segments inF 4992 (Tables 3.2 and 3.3) Constrained hierarchical clustering. Based on this distance matrix (D), constraint hi- erarchical clustering was performed to cluster consecutive segments into blocks that share similar contact profiles 4 . This clustering strategy produces a hierarchy of clusters with larger clusters being divided to increasing smaller clusters until reaching single segments. In other words, this clustering strategy produces a hierarchy of blocks; at higher levels of this hierarchy there are a few blocks of large size and low uniformity in contact profile, and at lower levels of this hierarchy, there are many blocks of small size and high uniformity in contact profile (Figure 4.1 A). To choose an appropriate total number of blocks, we used an “objective function” that balances the total number of blocks against the distance (d) spread within the blocks [96]. This objective function minimizes a penalty score that is proportional to both the total number of blocks and the distance spread within the blocks, resulting in an arrangement that minimizes the total number of blocks without sacrificing uniformity within the blocks 5 (Figure 4.2 and A.7). Using this objective penalty funtion resulted in a total ofN = 428 blocks for all chromosomes. Table 4.1 shows the number of blocks of each chromosome. 4 Constrained hierarchical clustering was done using the “chclust” function in the Rioja package of R (http://cran.r-project.org/). 5 This objective function is described by Kelley and colleagues who applied it to a similar problem in Nuclear Magnetic Resonance (NMR) data analysis [96]. 147 Table 4.1: The number of spheres representing each chromosome. Chr Blocks Chr Blocks chr1 30 chr13 22 chr2 17 chr14 28 chr3 27 chr15 18 chr4 21 chr16 10 chr5 22 chr17 12 chr6 28 chr18 13 chr7 24 chr19 13 chr8 15 chr20 10 chr9 19 chr21 12 chr10 22 chr22 12 chr11 15 chrX 23 chr12 15 4.5.1.2 Determining the sphere volumes We defined the volume that is predominantly occupied by the chromatin fiber of each block as a sphere. Assuming that all the spheres have similar densities, the radius of each sphere (R i ) would be proportional to the cubic root of the mass and, thus, the genomic length (l) of its corresponding chromosomal block: R i / (mass i ) 1 3 / (l i ) 1 3 (4.3) Since the volume of the nucleus is only partially occupied by the genome, we defined the nuclear genome occupancy (O nuc ) with respect to the sphere representation as: 148 O nuc = 2 4 3 N X i=1 R 3 i 4 3 R 3 nuc = 2 N X i=1 l i R 3 nuc (4.4) where R nuc is the nuclear radius and is a coefficient that is adjusted to reproduce the nuclear volume occupancy (O nuc ) and is written as: = O nuc R 3 nuc 2 N X i=1 l i (4.5) Accordingly, the sphere radius for blocki is given by: R i = (l i ) 1 3 = 0 B B B B @ O nuc l i 2 N X i=1 l i 1 C C C C A 1 3 R nuc (4.6) To calculate the radii of spheres, therefore, the nuclear occupancy of the genome (O nuc ) must be determined. It is possible to estimate the total volume of the diploid genome based on its length by assuming the 30 nm chromatin fiber as a cylindrical tube with a diameter of 30 nm. This volume results in a nuclear occupancy ofO nuc = 7.5% in an average nucleus with a radius of 5 microns. Since this 30 nm fiber may only be loosely packed in the spheres, the nuclear volume occupancy of the spheres should be more than 7.5%. 149 Previously published studies estimate the nuclear occupancy level of the genome between 10 to 40% [62]. Considering these experimental measurements, our own estimates of 30 nm fiber-based occupancy, and loose packing of the fiber inside the spheres, we based our modeling calculations on a 20% occupancy of the nuclear volume by our 428 spheres (O nuc = 20%). It should be noted that we replicated all modeling steps with 10%, 30%, and 40% values forO nuc and observed very similar results. Appendix table C.1 provides a list of the 428 spheres with their genomic locations, lengths, and radii. 4.5.1.3 Structure population To capture the structural heterogeneity among the genome structures of a cell population, the spatial features of the genome were expressed as a population of genome structures. Each genome structure in the model population is defined by the configuration of the genome model or, in other words, the positions of all its spheres. The genome populationP is defined by a set ofM genome models (sphere coordinates): P =fX 1 ;X 2 ;:::;X M g (4.7) whereM is the total number of genome models in the populationP andX m 2P is defined as: X m =fx m;n jn = 1; 2;:::; 2N andx m;n 2< 3 g (4.8) 150 X m is the set of all sphere coordinatesx m;n for all chromosomes in a single model in the population. N = 428 is the total number of spheres in the haploid genome; the modelX m contains 2N spheres for the diploid genome. M = 10; 000 is the total number of models in the population. 4.5.2 Structure determination Structure calculation entails the generation of a structure population P that is consistent with the TCC data. In order to do so, one can start from a population of random sphere arrangements and optimize these arrangements based on the TCC data. Such a calculation requires three components: first, a strategy to convert the TCC contact frequency data into contact frequency between spheres; second, a strategy to assign these sphere contacts to the different structures in the structure population; and third, an optimization method that can reorganize the spheres in a structure to minimize the disagreement between the structure and the assigned sphere contacts. The following sections described these three components and their implementation in our modeling strategy. 4.5.2.1 Converting block contacts to sphere contact probabilities Block-block contact frequency matrix. Based on the 428 blocks that were determined by coarse-graining of chromosomes (Tables 4.1 and C.1), the block-block contact fre- quency matrix (G N ) was defined as: G N = (g kl ) NN (4.9) 151 where g kl = b E (k) X i=b S (k) b E (l) X j=b S (l) f ij (4.10) withb S (n) as the first segment andb E (n) as the last segment of thenth block andf ij as the contact frequency between segments i and j based on the contact frequency matrix (F 4992 ). Sphere contact probability matrix. Based on the block-block contact frequency ma- trix, a sphere contact probability matrixA N = (a ij ) NN was defined that determines the fraction of models which should show contact between spheres i and j, where N is the total number of spheres representing the haploid genome (Tables 4.1 and C.1).a ij depends on the block contact frequency g ij observed in the TCC contact catalogue as described below. For each blocki, a cutoff contact frequencyf max i was defined at which a contact involving its corresponding sphere was set to a probability of 1, or in other words, present in all of the M models in the population. The f max i value is unknown and can vary for different blocks because of the differences in the efficiency with which the corresponding genomic DNA is digested by the restriction enzyme in course of the TCC experiment. However, f max i can be calibrated by a common reference point. Because TCC contact frequencies between the blocks that belong to the same chromosome increase with decreasing dis- tances between them, each block tends to show its highest TCC contact frequencies with its immediate neighbors. Since every two consecutive blocks on a chromosome represent contiguous chromatin regions, based on the polymeric nature of the chromatin fiber, their corresponding spheres must be in contact with each other at all times. We, therefore, used the contact frequencies of consecutive blocks as reference points to calibrate the block con- 152 tact frequency at which contacts are observed between their corresponding spheres in all models of the population. f max i was defined for each individual block;f max i for blocki is based on its contacts with the two genomic blocks, or spheres, that flank it: f max i = min g i;(i1) ;g i;(i+1) (4.11) For the first and last block in each chromosome, which have only one neighboring block, f max i is defined asg i;(i+1) andg i;(i1) , respectively. On this basis, the probability of contact between any two spheres in the genome can be determined. Accordingly, we calculated the probability of contact for sphere pairi andj in the model population as: a ij = 8 > > > > > < > > > > > : g ij min(f max i ;f max j ) ; if g ij min(f max i ;f max j ) 1 1; otherwise Given two spheres i and j with an observed TCC contact frequency of g ij between their corresponding blocks, the contact was activated in onlya ij M of the total structures (M) in the genome model population. 4.5.2.2 Assigning sphere contacts to structure models If the probability of a sphere contact (a ij ) is equal to 1, it can be assigned to all the model structures in the populations. The contacts that have a probability smaller than 1 (0 < a ij < 1) should be assigned to the corresponding fraction of models in the population. The question is how to choose which subset of models receive this assignment. 153 The models in which a sphere contact is activated can be defined as: a ij = M X m=1 w ijm M (4.12) where matrix element variablew ijm is binary (0,1) withW = (w ijm ) NNM .w ijm denotes whether contact between spheresi andj in modelm (1mM) is active (w ijm = 1) or inactive (w ijm = 0). The sum of all binary valuesw ijm over allM models in the population is equal to the product of the contact probabilitya i;j and the total number of modelsM in the population. The contact assignment problem can now be reframed as determiningW . To determine W , we designed a heuristic approach. A contact restraint was activated in a model during optimization if the contacting pair of spheres in the initial random configuration are within a certain activation distance, namely the diameter of an activation volume. w ijm = 8 > > > > < > > > > : 1; ifd m ij <d act ij 0; otherwise Suppose the ratio of the activation volume for contact betweeni andj to the total nuclear volumeV nuc is proportional toa ij : a ij / V act ij V nuc / d act ij R nuc ! 3 (4.13) 154 Then, the activation distance of a sphere pair can be calculated as: d act ij = a ij 1 3 R nuc s a ij (4.14) whereR nuc is the nuclear radius, ands a ij is an empirical scaling function which ranges from 0.90 to 1.39. For a genome occupancy ofO nuc = 20%: s x = 2:60638x 3 3:6264x 2 + 1:775696x + 0:638916 (4.15) This heuristic procedure allows an estimation of the correct activation probabilities for restraints with an rmsd error of 0.0298 (a ij 0:25) with Pearson’sr = 0.9945 (p-value < 10 15 and confidence level 99%). We excluded contacts for pairs of spheres with a ij 0:25. This restriction reduces the number of false positive contacts detected in the experiments while not substantially de- creasing the information content. The spheres contacts with frequencya ij 0:25, totaling 2717 contacts that involved 426 spheres (the blocks corresponding to 2 out of 428 spheres had no contacts in the TCC data), were included in later structure calculations. 4.5.2.3 Optimization of structure models At this point in the procedure, each structural model in the population consists of a random sphere arrangement and has been assigned a set of contacts between spheres. Optimizing these models into structures that are consistent with TCC data requires a means of measur- ing the agreement between the sphere arrangements and their assigned contacts. As such 155 means, we defined a scoring function that can quantify the disagreement between a given arrangement of spheres in a structure model and the contacts assigned to that structure as well as ensure their general structural soundness with respect to all spheres being located within the nuclear border and not clashing with one another. Scoring function. The scoring functionS(X 1 ;X 2 ;:::;X M ) is defined as the sum of all spatial restraints and depends on all the coordinates of each modelX m in the population. Determining the structure of the genome is equivalent to finding the set of all sphere coor- dinatesP so that the scoring function is zero: S(X 1 ;X 2 ;:::;X M ) = M X m=1 2N X i=1 u nuc mi + M X m=1 2N1 X i=1 2N X j>i u exc mij + M X m=1 N1 X i=1 N X j=i+1 w mij u con mij (4.16) s:t: = 8 > > > > > > < > > > > > > : w mij = 0; 1 M X m=1 w mij =a ij M This scoring function comprises three types of restraints: the nuclear volume restraints (u nuc mi ), excluded volume restraints (u exc mij ), and contact restraints (u con mij ). Each restraint can be thought of as applying a “force” to the sphere during the optimization of sphere ar- rangements such that spheres stay within the nuclear boundaries, stay out of each others volume, or form a contact. The functional terms of these restraints are described in de- tail below. The scoring function is also subject to the matrix element variable w mij with W = (w mij ) NNM which establishes which models a contact restraint restraint is as- signed to. 156 Nuclear volume restraints. These restraints restrict all the spheres inside a predefined spherical volume that represents the nucleus and are expressed as: u nuc i (X m ) = 8 > > > > < > > > > : 1 2 k nuc (R i +d i0 R nuc ) 2 , forR nuc <R i +d i0 0; otherwise where R nuc is the nuclear radius, R i is the radius of sphere i, k nuc a harmonic constant, andd i0 is the distance between center of spherei and the nuclear center. k nuc was set to 1. Excluded volume restraints. These restraints prevent spatial overlap between spheres and are expressed as: u nuc i (X m ) = 8 > > > > < > > > > : 1 2 k exc (R i +R j d m ij ) 2 , ford m ij <R i +R j 0; otherwise wherek exc is a harmonic constant,R i andR j are the radii of spheresi andj, andd m ij is the Euclidean distance between spheresi andj in modelm as defined above. k exc was set to 1. Contact restraints. These restraints enforce direct contact between two spheres in a structure. This restraint can be thought of as generating a “force” between the spheres dur- ing optimization (see below) so that they form a contact. However, the TCC contact data 157 are ambiguous regarding the homologous copies of each chromosome. In other words, each locus is present in two copies in the diploid genome and the TCC data do not distinguish between them; therefore, it cannot be determined which of the alternative loci is responsible for a specific contact, nor can it be determined if only some or all of the possible contacts combinations between homologous copies are formed in the cell population. An observed contact in the TCC data only means that at least one of the alternative combinations of loci have formed a contact. To account for this ambiguity in calculating the structures, we distinguished three types of sphere contact ambiguity: contacts involving pairs of spheres that belong to different chro- mosomes (B 1 ), contact involving pairs of consecutive spheres on the same chromosomes (B 2 ), and those involving all other pairs of spheres on the same chromosome (B 3 ). These sets of sphere-pairs were defined as: B 1 =f(i;j); (i +N;j); (i;j +N); (i +N;j +N) : 1 (i;j)N; chr(j)6=chr(i)g B 2 =f(i;j); (i +N;j +N) : 1 (i;j)N; chr(j) =chr(i);jijj= 1g B 3 =f(i;j); (i +N;j +N) : 1 (i;j)N; chr(j) =chr(i);jijj> 1g wherei is the index of a sphere in one chromosome andi+N the index for the same sphere in the homologous chromosome with 1 i N. chr(i) indicates the chromosome that spherei belongs to. Let :B!D be a function that, given a set of sphere indicesB and the coordinates of a single model from the populationX m , returns the corresponding set of distances: 158 (B 1 ) =fd i;j ;d i+N;j ;d i;j+N ;d i+N;j+N g (4.17) (B k=f2;3g ) =fd i;j ;d i+N;j+N g (4.18) whered i;j =jx i x j j is the Euclidian distance between spheresi andj with coordinates x i andx j . The contact restraints are then defined as: u nuc i (X m ) = 8 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > : 1 2 k con X (a;b)2B 1 ab (d ab d 0 ij ) 2 ; if min((B 1 ))>d 0 ij and 1 (i;j)N; chr(j)6=chr(i) 1 2 k con X (a;b)2B 2 (d ab d 0 ij ) 2 ; ifd ab >d 0 ij and 1 (i;j)N; chr(j) =chr(i);jijj= 1 1 2 k con X (a;b)2B 3 ab (d ab d 0 ij ) 2 ; if min((B 3 ))>d 0 ij and 1 (i;j)N; chr(j) =chr(i);jijj> 1 0; otherwise 159 where abj(a;b)2B k=f1;3g 8 > > > > < > > > > : 1; ifd ab = min((B K )) 0; otherwise k con is a harmonic constant, d 0 ij is the maximum contact distance defined as d 0 ij = (1 + )(R i +R j ) with scaling factor. At the beginning of the optimization was set to 0.1 while k con = 10 4 , then during the course of the optimizations (see below) both terms were increased, to = 1 andk con = 1. Optimization. The scoring functionS(X 1 ;X 2 ;:::;X M = 10; 000) was optimized so that all restraints were satisfied leading to a score of zero. The optimization started from 10,000 random initial sphere configurations leading to an optimized population of 10,000 genome structures. The optimization process involved simulated annealing [99, 32] with molecular dynamics (MD) [8, 139] and conjugate gradient (CG) [168] optimizations. The scoring function was implemented and optimized in the the Integrated Modeling Platform (IMP, http://www.integrativemodeling.org/) [4, 3]. 4.5.3 Territory radial positioning based on FISH The radial positions of each chromosome territory in lymphoblastoid cells was obtained from a study by Boyle and colleagues who used FISH for their measurements[26]. We eliminated chromosome 20 from our comparisons of radial positions between FISH exper- iments and the model population because the measured position of this chromosome from two different imaging studies do not agree (Boyle et al. [26] and Cremer et al. [50]). Here 160 we describe how the radial positions of each chromosome were extracted from the study by Boyle and colleagues. First, radial position data were extracted from histogram in Figure 3 of the article ([26]). In that figure, the concentric shells were labeled from 5 (center) to 1 (periphery). To simplify the expressions we adopted a labeling ofi = 1..5 from center to periphery. The radius of shelli was scaled by using the smallest radiusr 1 , thusr i = r 1 p i. Second, the midpoint of each shell was computed asr mid i = r i +r i1 2 = 0:5r 1 ( p i + p i 1). Finally, for each chromosome, the values in the histogram of Figure 3 (h i ,i = 1..5), that were higher than 0.5 max (h i ) were selected to contribute to the radial position calculations. Therefore, if k is the subset ofi that was selected in the radial position calculation, then CT(chr) = X k h k r mid k X k h k (4.19) where CT(chr) is the average radial position of chromosomechr. 161 Chapter 5 Conclusions The three-dimensional organization of the genome is not only a fascinating scientific prob- lem, it is also crucially important for cellular functions. Nevertheless, a genome-wide perspective and understanding of this organization was lacking. We have here combined novel high-throughput experimental methods with innovative computational approaches in order to analyze the genome architecture systematically and provide a more comprehensive description of its properties. We developed the Tethered Conformation Capture (TCC) technology which is capable of mapping chromatin interactions at the genome-wide level (Figure 2.4). This method offers significant advances compared to those previously available. For one, it extends the scope of conformation capture experiments from a limited number of loci or interactions to the entire genome. At the same time, it is not biased to any particular locus or interaction and, therefore, requires no prior knowledge of the architectural landscape in the nucleus. This latter property enabled us to carry out a more agnostic and descriptive study as opposed testing specific hypotheses of limited scope. 162 TCC’s genome-wide scope and lack of bias were made possible by incorporating paired- end massively-parallel sequencing to the analysis of genome architecture. This develop- ment follows a general theme of many fields of biology that were revolutionized by the influence of the latest generation of sequencing technologies. Broadening the scope and eliminating bias through sequencing, however, are not the only important advances of the TCC technology. TCC also reduces the level of noise in conformation capture experiments. Noise is largely generated by intermolecular ligations between fragments of DNA that are not crosslinked to each other, a phenomenon which was dramatically reduced in TCC by integrating tethering and solid-phase ligation (Figure 2.10). This enhanced signal-to-noise ratio made the analysis of low frequency contacts possible. The most prominent examples of these low-frequency contacts are interactions between chromosome territories which could only be observed after incorporating tethering (Chapter 3 and Appendix B). TCC also paves the way for future improvements in conformation capture technologies. For example, not only does it facilitates higher resolution analyses with enzymes that cut the chromatin more frequently, it also enables the use of other methods, enzymatic, physical, chemical, and otherwise, for cutting the chromatin. It also allows for sophisticated method of DNA end manipulation that would be necessary for customized conformation capture experiments in future. In this study, we used TCC to systematically study the genome architecture of human lymphoblastoid cells. Our studies provide many insights into the genome architecture. When considering any single chromosome, many different contacts are possible between various regions in that chromosome. These contacts show a wide range of frequencies (Figures 3.1 and 3.2), suggesting that the 3D organization of each chromosome is different in different cells. In other words, contacts between various parts of a chromosome are not absolute rather probabilistic. In fact, two factors are most important in determining 163 the association probabilities between different loci in the same chromosome. The factor with a larger magnitude of effect is the genomic distance between loci. The probability of contact between loci on the same chromosome rapidly decreases with increasing distance (Figure 3.2) [113]. The other important factor is “class:” the loci in each chromosome can be separated into two classes such that, after normalizing for genomic distance, loci that belong to the same class have higher contact probabilities compared to those that belong to different classes (Figures 3.3 and 3.4 B). It is this behavior that results in the signature “plaid pattern” [113] in the contact frequency map of each chromosome (Figures 3.1 and A.1). These two classes, defined based on intrachromosomal contacts, are asymmetrical with re- spect to genomic functions, chromosome gene content and banding patterns, and spatial behavior. With respect to genomic functions, only one of the two classes, henceforth called the active class, shows enrichment for DNaseI sensitivity, RNA polymerase II binding, gene expression, and activating histone modifications (Figure 3.5) [113]. With respect to chromosome content, the active class is enriched for known genes (Figure 3.5) and is more likely to coincide with the lighter bands of each chromosome (Figures 3.11 and A.5). The other class, henceforth called the inactive class, is by and large depleted of known genes (Figure 3.5) and tends to coincide with the darker bands in each chromosome (Figures 3.11 and A.5). With respect to spatial behavior, the regions of the inactive class preferentially associate with neighboring inactive regions, while the regions of the active class have a broad panel of long-range contact partners (Figure 3.4). Additionally, in large chromo- somes, inactive regions on opposing sides of the centromere have little interaction with each other (Figures 3.6 and 3.7). At the same time, active regions on different arms show extensive interactions (Figures 3.6 and 3.7). 164 An important difference in the 3D organization of the active and inactive classes in each chromosome is their exposure to the other chromosomes. Inactive regions, by and large, have a low tendency to interact with other chromosomes (Figures 3.11, 3.12 A and A.4). There are a few exceptions to this trend which are the regions that flank the centromeres. These regions show high contact frequencies with the centromeric regions of other chromo- somes (Figure 3.13) and their association patterns are in excellent agreement with imaging results from other studies [6, 172, 7]. By contrast to the inactive regions, many active regions in each chromosome show a high tendency to interact with other chromosomes (Figures 3.12 A and A.4). This behavior indicates that, compared to inactive regions, these active region are more exposed to other chromosome territories. One way that this organi- zation is possible would be for these active regions to have preferential localization at the border of their chromosome’s territory and inactive regions to have preferential localization inside the territory or close to the nuclear envelope. In fact, FISH experiments that ana- lyzed the localization of loci relative to their territory’s edge [118, 104] support this model (Figures 3.12 C and 3.17). As discussed earlier, the enhanced signal-to-noise ratio in TCC enabled us to accurately profile profile of interactions between chromosome territories. Such a profile was made more important by the fact that, compared to the internal organization of territories, much less was known about the interactions between them. We found that interactions between whole chromosomes are very abundant but distributed among a very large number of low- frequency locus-locus contacts. The loci that participate in interchromosomal interactions are a subgroup of the active class, which we called the high-ICP regions (Figures 3.12 A and A.4). Each of these regions forms significant interactions with a striking number of other high-ICP regions in other chromosomes (Figures 3.14 A and 3.15). The numer- ousness and the low-frequency nature of these interactions strongly suggested that each of 165 them is only present in a small fraction of the cells. Our FISH experiments supported this notion and suggested that the fraction can be as small as 2% of the cell population (Fig- ure 3.16). Together, these observations indicated that each cell in the population harbors only a small subset of all possible interchromosomal interactions and different cells harbor different subsets. Consequently, each cell must harbor a different subset of the possible interchromosomal interactions. What factors determine the subset of interactions in each cell and on what basis does a high- ICP active region choose its interchromosomal interaction partners? We found that the higher the general tendency of a region to forming interchromosomal interactions, the more likely it will be chosen as the partner by any high-ICP active region (Figures 3.14 B,C and A.6). Combined with evidence that high-ICP regions are more likely to be exposed to other chromosomes (Figure 3.12 C), our result suggested that accessibility of loci to each other in each cell is a very important factor in determining the subset of interactions that are present in that cell. In fact, our results suggest that the tendency of most regions to forming interchromosomal contacts may not depend on any other regions in the genome. In other words, the formation of an interchromosomal interaction between two high-ICP active regions is largely indiscriminate of their individual identities and may take place in cells where they are in spatial proximity. In conclusion, interchromosomal interactions can form indiscriminately between high-ICP active regions that are accessible to each other. We also noted that the existence of indiscriminate interactions between chromosomes is not entirely consistent with the concept of trans-regulation via interchromosomal communica- tion [123] (see Chapter 1) as a general phenomenon in gene regulation. Trans-regulation via interchromosomal communication stipulates that the transcription of some genes is reg- ulated by specific functional element on the other chromosomes. Such a specific regulatory interaction between two loci would require a discriminate interchromosomal interaction 166 between the loci that leads to their spatial association in a considerable fraction of the cell population. Such physical associations are not prominent in our datasets. The conclusion about the existence of this phenomenon in other studies [165, 114] can be attributed to the absence of genome-wide perspective and the bias of the analysis to the loci of interest. In other words, if one could only observe one or a few of the indiscriminate interactions be- tween chromosomes territories, one may conclude that they are specific and discriminate. While our results do not support trans-regulation via specific interchromosomal commu- nication as a general phenomenon, it may be the case for a few loci. It should also be considered that our analyses were done at low resolutions where interactions between a single regulatory element and a single promoter cannot be assessed. Furthermore, other forms of trans-regulation that are less specific, for instance, ones involving numerous inter- chromosomal interactions, cannot be ruled out. Additionally our results do not in any way contradict trans-regulation via intrachromosomal communication. This conclusion raises another question: if interchromosomal interactions are not a result of regulation-based communications, what mechanisms underlie them and why do they have an indiscriminate pattern? A clue to the answer may reside in the observation that the propensity to forming interchromosomal contacts correlates with a region’s transcrip- tional activity (Figure 3.12 B). Because transcription is often focused at discrete sites (i.e., transcription factories) [47], this correlation may be a consequence of the active regions from different chromosomes being recruited to the same factory. Formation of interchro- mosomal contacts between transcribing loci within transcription factories not only supports previous suggestions that transcription factories play an important role in stabilizing inter- chromosomal interactions [29, 49, 154] but also can explain the indiscriminate behavior of these interactions as loci may be recruited to the same factory completely independently of each other. However, the association of a specific transcription factor with only some of the 167 transcription factories, as reported before [154], can make the recruitment of its targets to the same factories more likely. Moreover, since transcription is not the only nuclear func- tion that is concentrated at discrete sites [105, 123], it is possible that other factories, such as those of splicing and DNA repair, also mediate the indiscriminate interactions between chromosome territories. While analysis of contact data can provide insights into the genome architecture, a more complete understanding requires obtaining three-dimensional structures. We, thus, devised a method to modeling three-dimensional structures based on the TCC contact data. This method was made possible by selecting an effective structural representation for the chro- mosomes (Figure 4.3). By representing each chromosome as a limited number of spheres (Table C.1), we addressed the difficulties of genome-wide structural analysis on the very long chromatin fiber. Furthermore, this representation provided an excellent foundation for the establishment and formulation of other structure derivation strategies. Another crucial feature of our modeling approach is the introduction of a population-based structural anal- ysis concept to account for the extensive structural heterogeneity between the genomes of different cells (Figure 4.4). Accordingly, we calculated a population of genome structures that represent the TCC data. Another integral part of our modeling approach was a strategy to translate the TCC contact data into structural parameters. This approach translated the chromatin contact frequencies in TCC data to sphere contact probabilities. This is a unique aspect of our approach in that our method does not correlate contact frequencies with av- eraged distances as do other methods [183, 15, 113]. Finally, our modeling approach also includes effective methods to optimize random arrangements of the chromosome spheres into conformity with the sphere contact probabilities in a population context. Assessment of the resulting structure population indicated that it is a representative of the native genome architectures in lymphoblastoid cells. For example, structure populations 168 based only on part of the TCC data is able to correctly predict the missing data points. Most importantly, the calculated population reproduces the radial positioning of chromosomes in agreement with others’ FISH studies. This population-based modeling, therefore, provides a novel means of studying the three- dimensional genome architectures, which allows for a probabilistic interpretation of the genome organization (Figures 4.6, 4.7, 4.8, A.8, 4.9, and 4.10). The importance of prob- abilistic interpretations lies in the cell-to-cell structural variation that is observed in eu- karyotic genomes [48, 122, 124]. By reproducing the radial positioning of chromosomes, the structure population also indicates that the genome-wide behavior of interchromoso- mal interactions is in keeping with the previously described architectural features of the genome. Furthermore, the population demonstrates that the TCC data alone are sufficient to reproduce the distinct spatial distributions of chromosome territories. Our modeling strategy represents the first method for genome-wide derivation of chromo- some structures from experimental genome-wide data. Consequently, the structure popu- lation represents the first structure of an entire mammalian genome. Our approach also en- ables the integration of conformation capture data with other experimental datasets that per- tain to the genome architecture, such FISH, electron microscopy, and DamID data. 169 Bibliography [1] A. K. Aggarwal, D. W. Rodgers, M. Drottar, M. Ptashne, and S. C. Harrison. Recog- nition of a DNA operator by the repressor of phage 434: a view at high resolution. Science, 242:899–907, Nov 1988. [2] F. Alber, S. Dokudovskaya, L. M. Veenhoff, W. Zhang, J. Kipper, D. Devos, A. Suprapto, O. Karni-Schmidt, R. Williams, B. T. Chait, M. P. Rout, and A. Sali. Determining the architectures of macromolecular assemblies. Nature, 450:683–694, Nov 2007. [3] F. Alber, S. Dokudovskaya, L. M. Veenhoff, W. Zhang, J. Kipper, D. Devos, A. Suprapto, O. Karni-Schmidt, R. Williams, B. T. Chait, A. Sali, and M. P. Rout. The molecular architecture of the nuclear pore complex. Nature, 450:695–701, Nov 2007. [4] F. Alber, F. Forster, D. Korkin, M. Topf, and A. Sali. Integrating diverse data for structure determination of macromolecular assemblies. Annu. Rev. Biochem., 77:443–477, 2008. [5] F. Alber, M. F. Kim, and A. Sali. Structural characterization of assemblies from overall shape and subcomplex compositions. Structure, 13:435–445, Mar 2005. [6] I. Alcobia, R. Dilao, and L. Parreira. Spatial associations of centromeres in the nuclei of hematopoietic cells: evidence for cell-type-specific organizational patterns. Blood, 95:1608–1615, Mar 2000. [7] I. Alcobia, A. S. Quina, H. Neves, N. Clode, and L. Parreira. The spatial organization of centromeric heterochromatin during normal human lymphopoiesis: evidence for ontogenically determined spatial patterns. Exp. Cell Res., 290:358–369, Nov 2003. [8] B. J. Alder and T.E. Wainwright. Studies in Molecular Dynamics. J. Chem. Phys., 31:459–467, Feb 1959. [9] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403–410, Oct 1990. 170 [10] A. Ansari and M. Hampsey. A role for the CPF 3’-end processing machinery in RNAP II-dependent gene looping. Genes Dev., 19:2969–2978, Dec 2005. [11] S. Arnott, R. Chandrasekaran, and C. M. Marttila. Structures for polyinosinic acid and polyguanylic acid. Biochem. J., 141:537–543, Aug 1974. [12] O. T. Avery, C. M. Macleod, and M. McCarty. Studies on the chemical nature of the substance inducing transformation of pneumococcal types: induction of transforma- tion by a desoxyribonucleic acid fraction isolated from pneumococcus type III. J. Exp. Med., 79:137–158, Feb 1944. [13] A. Barski, S. Cuddapah, K. Cui, T. Y . Roh, D. E. Schones, Z. Wang, G. Wei, I. Che- pelev, and K. Zhao. High-resolution profiling of histone methylations in the human genome. Cell, 129:823–837, May 2007. [14] D. L. Bates, Y . Chen, G. Kim, L. Guo, and L. Chen. Crystal structures of multiple GATA zinc fingers bound to DNA reveal new insights into DNA recognition and self-association by GATA. J. Mol. Biol., 381:1292–1306, Sep 2008. [15] D. Bau, A. Sanyal, B. R. Lajoie, E. Capriotti, M. Byron, J. B. Lawrence, J. Dekker, and M. A. Marti-Renom. The three-dimensional folding of the -globin gene do- main reveals formation of chromatin globules. Nat. Struct. Mol. Biol., 18:107–114, Jan 2011. [16] Barbara Beatty, Mai Sabine, and Jeremy Squire. FISH: a practical approach. Oxford University Press, Oxford, 2002. [17] A. S. Belmont. Visualizing chromosome dynamics with GFP. Trends Cell Biol., 11:250–257, Jun 2001. [18] A. S. Belmont and K. Bruce. Visualization of G1 chromosomes: a folded, twisted, supercoiled chromonema model of interphase chromatid structure. J. Cell Biol., 127:287–302, Oct 1994. [19] D. L. Bentley. Rules of engagement: co-transcriptional recruitment of pre-mRNA processing factors. Curr. Opin. Cell Biol., 17:251–256, Jun 2005. 171 [20] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton, C. G. Brown, K. P. Hall, D. J. Evers, C. L. Barnes, H. R. Bignell, J. M. Boutell, J. Bryant, R. J. Carter, R. Keira Cheetham, A. J. Cox, D. J. Ellis, M. R. Flatbush, N. A. Gorm- ley, S. J. Humphray, L. J. Irving, M. S. Karbelashvili, S. M. Kirk, H. Li, X. Liu, K. S. Maisinger, L. J. Murray, B. Obradovic, T. Ost, M. L. Parkinson, M. R. Pratt, I. M. Rasolonjatovo, M. T. Reed, R. Rigatti, C. Rodighiero, M. T. Ross, A. Sabot, S. V . Sankar, A. Scally, G. P. Schroth, M. E. Smith, V . P. Smith, A. Spiridou, P. E. Torrance, S. S. Tzonev, E. H. Vermaas, K. Walter, X. Wu, L. Zhang, M. D. Alam, C. Anastasi, I. C. Aniebo, D. M. Bailey, I. R. Bancarz, S. Banerjee, S. G. Barbour, P. A. Baybayan, V . A. Benoit, K. F. Benson, C. Bevis, P. J. Black, A. Boodhun, J. S. Brennan, J. A. Bridgham, R. C. Brown, A. A. Brown, D. H. Buermann, A. A. Bundu, J. C. Burrows, N. P. Carter, N. Castillo, M. Chiara E Catenazzi, S. Chang, R. Neil Cooley, N. R. Crake, O. O. Dada, K. D. Diakoumakos, B. Dominguez- Fernandez, D. J. Earnshaw, U. C. Egbujor, D. W. Elmore, S. S. Etchin, M. R. Ewan, M. Fedurco, L. J. Fraser, K. V . Fuentes Fajardo, W. Scott Furey, D. George, K. J. Gietzen, C. P. Goddard, G. S. Golda, P. A. Granieri, D. E. Green, D. L. Gustafson, N. F. Hansen, K. Harnish, C. D. Haudenschild, N. I. Heyer, M. M. Hims, J. T. Ho, A. M. Horgan, K. Hoschler, S. Hurwitz, D. V . Ivanov, M. Q. Johnson, T. James, T. A. Huw Jones, G. D. Kang, T. H. Kerelska, A. D. Kersey, I. Khrebtukova, A. P. Kindwall, Z. Kingsbury, P. I. Kokko-Gonzales, A. Kumar, M. A. Laurent, C. T. Lawley, S. E. Lee, X. Lee, A. K. Liao, J. A. Loch, M. Lok, S. Luo, R. M. Mammen, J. W. Martin, P. G. McCauley, P. McNitt, P. Mehta, K. W. Moon, J. W. Mullens, T. Newington, Z. Ning, B. Ling Ng, S. M. Novo, M. J. O’Neill, M. A. Osborne, A. Osnowski, O. Ostadan, L. L. Paraschos, L. Pickering, A. C. Pike, A. C. Pike, D. Chris Pinkard, D. P. Pliskin, J. Podhasky, V . J. Quijano, C. Raczy, V . H. Rae, S. R. Rawlings, A. Chiva Rodriguez, P. M. Roe, J. Rogers, M. C. Rogert Bacigalupo, N. Romanov, A. Romieu, R. K. Roth, N. J. Rourke, S. T. Ruediger, E. Rusman, R. M. Sanches-Kuiper, M. R. Schenker, J. M. Seoane, R. J. Shaw, M. K. Shiver, S. W. Short, N. L. Sizto, J. P. Sluis, M. A. Smith, J. Ernest Sohna Sohna, E. J. Spence, K. Stevens, N. Sutton, L. Szajkowski, C. L. Tregidgo, G. Turcatti, S. Vandevon- dele, Y . Verhovsky, S. M. Virk, S. Wakelin, G. C. Walcott, J. Wang, G. J. Worsley, J. Yan, L. Yau, M. Zuerlein, J. Rogers, J. C. Mullikin, M. E. Hurles, N. J. McCooke, J. S. West, F. L. Oaks, P. L. Lundberg, D. Klenerman, R. Durbin, and A. J. Smith. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456:53–59, Nov 2008. [21] A. P. Bird. CpG-rich islands and the function of DNA methylation. Nature, 321:209– 213, 1986. 172 [22] E. Birney, J. A. Stamatoyannopoulos, A. Dutta, R. Guigo, T. R. Gingeras, E. H. Mar- gulies, Z. Weng, M. Snyder, E. T. Dermitzakis, R. E. Thurman, M. S. Kuehn, C. M. Taylor, S. Neph, C. M. Koch, S. Asthana, A. Malhotra, I. Adzhubei, J. A. Green- baum, R. M. Andrews, P. Flicek, P. J. Boyle, H. Cao, N. P. Carter, G. K. Clelland, S. Davis, N. Day, P. Dhami, S. C. Dillon, M. O. Dorschner, H. Fiegler, P. G. Giresi, J. Goldy, M. Hawrylycz, A. Haydock, R. Humbert, K. D. James, B. E. Johnson, E. M. Johnson, T. T. Frum, E. R. Rosenzweig, N. Karnani, K. Lee, G. C. Lefebvre, P. A. Navas, F. Neri, S. C. Parker, P. J. Sabo, R. Sandstrom, A. Shafer, D. Vetrie, M. Weaver, S. Wilcox, M. Yu, F. S. Collins, J. Dekker, J. D. Lieb, T. D. Tullius, G. E. Crawford, S. Sunyaev, W. S. Noble, I. Dunham, F. Denoeud, A. Reymond, P. Kapranov, J. Rozowsky, D. Zheng, R. Castelo, A. Frankish, J. Harrow, S. Ghosh, A. Sandelin, I. L. Hofacker, R. Baertsch, D. Keefe, S. Dike, J. Cheng, H. A. Hirsch, E. A. Sekinger, J. Lagarde, J. F. Abril, A. Shahab, C. Flamm, C. Fried, J. Hack- ermuller, J. Hertel, M. Lindemeyer, K. Missal, A. Tanzer, S. Washietl, J. Korbel, O. Emanuelsson, J. S. Pedersen, N. Holroyd, R. Taylor, D. Swarbreck, N. Matthews, M. C. Dickson, D. J. Thomas, M. T. Weirauch, J. Gilbert, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447(7146):799–816, 2007. [23] A. Bolzer, G. Kreth, I. Solovei, D. Koehler, K. Saracoglu, C. Fauth, S. Muller, R. Eils, C. Cremer, M. R. Speicher, and T. Cremer. Three-dimensional maps of all chromosomes in human male fibroblast nuclei and prometaphase rosettes. PLoS Biol., 3:e157, May 2005. [24] T. Boveri. Die Blastomerenkerne von Ascaris megalocephala und die Theorie der Chromosomenindividualität. Arch Zellforsch, 3:181–268, 1909. [25] T. Boveri. Concerning the origin of malignant tumours by Theodor Boveri. Trans- lated and annotated by Henry Harris. J. Cell. Sci., 121 Suppl 1:1–84, Jan 2008. [26] S. Boyle, S. Gilchrist, J. M. Bridger, N. L. Mahy, J. A. Ellis, and W. A. Bickmore. The spatial organization of human chromosomes within the nuclei of normal and emerin-mutant cells. Hum. Mol. Genet., 10:211–219, Feb 2001. [27] S. Boyle, M. J. Rodesch, H. A. Halvensleben, J. A. Jeddeloh, and W. A. Bickmore. Fluorescence in situ hybridization with high-complexity repeat-free oligonucleotide probes generated by massively parallel synthesis. Chromosome Res., 19:901–909, Oct 2011. [28] M. R. Branco and A. Pombo. Intermingling of chromosome territories in interphase suggests role in translocations and transcription-dependent associations. PLoS Biol., 4:e138, May 2006. [29] M. R. Branco and A. Pombo. Chromosome organization: new facts, new models. Trends Cell Biol., 17:127–134, Mar 2007. 173 [30] H. M. Cann, C. de Toma, L. Cazes, M. F. Legrand, V . Morel, L. Piouffre, J. Bodmer, W. F. Bodmer, B. Bonne-Tamir, A. Cambon-Thomsen, Z. Chen, J. Chu, C. Carcassi, L. Contu, R. Du, L. Excoffier, G. B. Ferrara, J. S. Friedlaender, H. Groot, D. Gurwitz, T. Jenkins, R. J. Herrera, X. Huang, J. Kidd, K. K. Kidd, A. Langaney, A. A. Lin, S. Q. Mehdi, P. Parham, A. Piazza, M. P. Pistillo, Y . Qian, Q. Shu, J. Xu, S. Zhu, J. L. Weber, H. T. Greely, M. W. Feldman, G. Thomas, J. Dausset, and L. L. Cavalli- Sforza. A human genome diversity cell line panel. Science, 296(5566):261–2, 2002. [31] R. Cao, L. Wang, H. Wang, L. Xia, H. Erdjument-Bromage, P. Tempst, R. S. Jones, and Y . Zhang. Role of histone H3 lysine 27 methylation in Polycomb-group silenc- ing. Science, 298:1039–1043, Nov 2002. [32] V . Cerny. Thermodynamical approach to the traveling salesman problem: An effi- cient simulation algorithm . J. Optimiz. Theory App., 45:41–51, Jan 1985. [33] S. Chambeyron and W. A. Bickmore. Chromatin decondensation and nuclear reor- ganization of the HoxB locus upon induction of transcription. Genes Dev., 18:1119– 1130, May 2004. [34] L. Chen, J. N. Glover, P. G. Hogan, A. Rao, and S. C. Harrison. Structure of the DNA-binding domains from NFAT, Fos and Jun bound specifically to DNA. Nature, 392:42–48, Mar 1998. [35] L. Chen and R. Kalhor. Genome-wide chromosome conformation capture. Patent (filed Aug. 6, 2009), 12 2011. US 8076070. [36] L. Chen and R. Kalhor. Tethered conformation capture. Patent Application (filed May 18, 2010), 11 2011. US 2011/0287947 A1. [37] L. Chen, A. M. MacMillan, W. Chang, K. Ezaz-Nikpay, W. S. Lane, and G. L. Verdine. Direct identification of the active-site nucleophile in a DNA (cytosine-5)- methyltransferase. Biochemistry, 30:11018–11025, Nov 1991. [38] Y . Chen, R. Dey, and L. Chen. Crystal structure of the p53 core domain bound to a full consensus site as a self-assembled tetramer. Structure, 18:246–256, Feb 2010. [39] Y . Chen, T. Souaiaia, and T. Chen. PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics, 25:2514–2521, Oct 2009. [40] Andy K.H. Choo. The Centromere. Oxford University Press, Oxford, 1997. [41] M. Chotalia and A. Pombo. Polycomb targets seek closest neighbours. PLoS Genet., 7:e1002031, Mar 2011. [42] J. R. Chubb and W. A. Bickmore. Considering nuclear compartmentalization in the light of nuclear dynamics. Cell, 112:403–406, Feb 2003. 174 [43] D. W. Cleveland, Y . Mao, and K. F. Sullivan. Centromeres and kinetochores: from epigenetics to mitotic checkpoint signaling. Cell, 112:407–421, Feb 2003. [44] ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science, 306(5696):636–40, 2004. [45] International HapMap Consortium. The International HapMap Project. Nature, 426(6968):789–96, 2003. [46] International HapMap Consortium. A haplotype map of the human genome. Nature, 437(7063):1299–320, 2005. [47] P. R. Cook. The organization of replication and transcription. Science, 284:1790– 1795, Jun 1999. [48] P. R. Cook. Predicting three-dimensional genome structure from transcriptional ac- tivity. Nat. Genet., 32:347–352, Nov 2002. [49] P. R. Cook. A model for all genomes: the role of transcription factories. J. Mol. Biol., 395:1–10, Jan 2010. [50] M. Cremer, J. von Hase, T. V olm, A. Brero, G. Kreth, J. Walter, C. Fischer, I. Solovei, C. Cremer, and T. Cremer. Non-random radial higher-order chromatin arrangements in nuclei of diploid human cells. Chromosome Res., 9:541–567, 2001. [51] T. Cremer and C. Cremer. Chromosome territories, nuclear architecture and gene regulation in mammalian cells. Nat. Rev. Genet., 2:292–301, Apr 2001. [52] T. Cremer, C. Cremer, H. Baumann, E. K. Luedtke, K. Sperling, V . Teuber, and C. Zorn. Rabl’s model of the interphase chromosome arrangement tested in Chinese hamster cells by premature chromosome condensation and laser-UV-microbeam ex- periments. Hum. Genet., 60:46–56, 1982. [53] T. Cremer, M. Cremer, S. Dietzel, S. Muller, I. Solovei, and S. Fakan. Chromosome territories–a functional nuclear landscape. Curr. Opin. Cell Biol., 18:307–316, Jun 2006. [54] T. Cremer, A. Kurz, R. Zirbel, S. Dietzel, B. Rinke, E. Schrock, M. R. Speicher, U. Mathieu, A. Jauch, P. Emmerich, H. Scherthan, T. Ried, C. Cremer, and P. Lichter. Role of chromosome territories in the functional compartmentalization of the cell nucleus. Cold Spring Harb. Symp. Quant. Biol., 58:777–792, 1993. [55] T. Cremer, P. Lichter, J. Borden, D. C. Ward, and L. Manuelidis. Detection of chro- mosome aberrations in metaphase and interphase tumor cells by in situ hybridization using chromosome-specific library probes. Hum. Genet., 80:235–246, Nov 1988. [56] J. A. Croft, J. M. Bridger, S. Boyle, P. Perry, P. Teague, and W. A. Bickmore. Dif- ferences in the localization and morphology of chromosomes in the human nucleus. J. Cell Biol., 145:1119–1131, Jun 1999. 175 [57] R. Dahm and F. Miescher. Discovering DNA: Friedrich Miescher and the early years of nucleic acid research. Hum. Genet., 122:565–581, Jan 2008. [58] Charles R. Darwin. On the Origin of Species by Means of Natural Selection. John Murray, London, 1859. [59] J. Dausset, H. Cann, D. Cohen, M. Lathrop, J. M. Lalouel, and R. White. Centre d’etude du polymorphisme humain (CEPH): collaborative genetic mapping of the human genome. Genomics, 6(3):575–7, 1990. [60] H. G. Davies and M. E. Haynes. Light- and electron-microscope observations on certain leukocytes in a teleost fish and a comparison of the envelope-limited mono- layers of chromatin structural units in different species. J. Cell. Sci., 17:263–285, Mar 1975. [61] H. G. Davies, A. B. Murray, and M. E. Walmsley. Electron-microscope observations on the organization of the nucleus in chicken erythrocytes and a superunit thread hypothesis for chromosome structure. J. Cell. Sci., 16:261–299, Nov 1974. [62] S. de Nooijer, J. Wellink, B. Mulder, and T. Bisseling. Non-specific interactions are sufficient to explain the position of heterochromatic chromocenters and nucleoli in interphase nuclei. Nucleic Acids Res., 37:3558–3568, Jun 2009. [63] H. Dehghani, G. Dellaire, and D. P. Bazett-Jones. Organization of chromatin in the interphase mammalian cell. Micron, 36:95–108, 2005. [64] J. Dekker, K. Rippe, M. Dekker, and N. Kleckner. Capturing chromosome confor- mation. Science, 295(5558):1306–11, 2002. [65] S. Dietzel, A. Jauch, D. Kienle, G. Qu, H. Holtgreve-Grez, R. Eils, C. Munkel, M. Bittner, P. S. Meltzer, J. M. Trent, and T. Cremer. Separate and variably shaped chromosome arm domains are disclosed by chromosome arm painting in human cell nuclei. Chromosome Res., 6:25–33, Jan 1998. [66] B. Dorigo, T. Schalch, A. Kulangara, S. Duda, R. R. Schroeder, and T. J. Richmond. Nucleosome arrays reveal the two-start organization of the chromatin fiber. Science, 306:1571–1573, Nov 2004. [67] J. Dostie, T. A. Richmond, R. A. Arnaout, R. R. Selzer, W. L. Lee, T. A. Honan, E. D. Rubio, A. Krumm, J. Lamb, C. Nusbaum, R. D. Green, and J. Dekker. Chromosome conformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res, 16(10):1299–309, 2006. [68] Z. Duan, M. Andronescu, K. Schutz, S. McIlwain, Y . J. Kim, C. Lee, J. Shendure, S. Fields, C. A. Blau, and W. S. Noble. A three-dimensional model of the yeast genome. Nature, 465:363–367, May 2010. 176 [69] D. Elson and E. Chargaff. On the desoxyribonucleic acid content of sea urchin gametes. Experientia, 8:143–145, Apr 1952. [70] R. Erni, M. D. Rossell, C. Kisielowski, and U. Dahmen. Atomic-resolution imaging with a sub-50-pm electron probe. Phys. Rev. Lett., 102(9):096101, Mar 2009. [71] G. Felsenfeld and M. Groudine. Controlling the double helix. Nature, 421:448–453, Jan 2003. [72] J. Ferreira, G. Paolella, C. Ramos, and A. I. Lamond. Spatial organization of large- scale chromatin domains in the nucleus: a magnified view of single chromosome territories. J. Cell Biol., 139:1597–1610, Dec 1997. [73] J. T. Finch and A. Klug. Solenoidal model for superstructure in chromatin. Proc. Natl. Acad. Sci. U.S.A., 73:1897–1901, Jun 1976. [74] R. E. Franklin and R. G. Gosling. Evidence for 2-chain helix in crystalline structure of sodium deoxyribonucleate. Nature, 172:156–157, Jul 1953. [75] R. E. Franklin and R. G. Gosling. Molecular configuration in sodium thymonucleate. Nature, 171:740–741, Apr 1953. [76] P. Fraser. Transcriptional control thrown for a loop. Curr. Opin. Genet. Dev., 16:490– 495, Oct 2006. [77] K. A. Frazer, D. G. Ballinger, D. R. Cox, D. A. Hinds, L. L. Stuve, R. A. Gibbs, J. W. Belmont, A. Boudreau, P. Hardenbol, S. M. Leal, S. Pasternak, D. A. Wheeler, T. D. Willis, F. Yu, H. Yang, C. Zeng, Y . Gao, H. Hu, W. Hu, C. Li, W. Lin, S. Liu, H. Pan, X. Tang, J. Wang, W. Wang, J. Yu, B. Zhang, Q. Zhang, H. Zhao, H. Zhao, J. Zhou, S. B. Gabriel, R. Barry, B. Blumenstiel, A. Camargo, M. Defelice, M. Fag- gart, M. Goyette, S. Gupta, J. Moore, H. Nguyen, R. C. Onofrio, M. Parkin, J. Roy, E. Stahl, E. Winchester, L. Ziaugra, D. Altshuler, Y . Shen, Z. Yao, W. Huang, X. Chu, Y . He, L. Jin, Y . Liu, Y . Shen, W. Sun, H. Wang, Y . Wang, Y . Wang, X. Xiong, L. Xu, M. M. Waye, S. K. Tsui, H. Xue, J. T. Wong, L. M. Galver, J. B. Fan, K. Gunderson, S. S. Murray, A. R. Oliphant, M. S. Chee, A. Montpetit, F. Chagnon, V . Ferretti, M. Leboeuf, J. F. Olivier, M. S. Phillips, S. Roumy, C. Sallee, A. Verner, T. J. Hudson, P. Y . Kwok, D. Cai, D. C. Koboldt, R. D. Miller, L. Paw- likowska, P. Taillon-Miller, M. Xiao, L. C. Tsui, W. Mak, Y . Q. Song, P. K. Tam, Y . Nakamura, T. Kawaguchi, T. Kitamoto, T. Morizono, A. Nagashima, Y . Ohnishi, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449(7164):851–61, 2007. [78] G. Fudenberg and L. A. Mirny. Higher-order chromatin structure: bridging physics and biology. Curr Opin Genet Dev, Feb 2012. [79] M. J. Fullwood and Y . Ruan. ChIP-based methods for the identification of long- range chromatin interactions. J. Cell. Biochem., 107:30–39, May 2009. 177 [80] A. Gondor, C. Rougier, and R. Ohlsson. High-resolution circular chromosome con- formation capture assay. Nat Protoc, 3(2):303–13, 2008. [81] A.Y . Grosberg, S.K. Nechaev, and E.I. Shakhnovich. The role of topological constraints in the kinetics of collapse of macromolecules. Journal de Physique, 49:2095–2100, 1988. [82] M. Gue, C. Messaoudi, J. S. Sun, and T. Boudier. Smart 3D-FISH: automation of distance analysis in nuclei of interphase cells by image processing. Cytometry A, 67:18–26, Sep 2005. [83] H. Hagege, P. Klous, C. Braem, E. Splinter, J. Dekker, G. Cathala, W. de Laat, and T. Forne. Quantitative analysis of chromosome conformation capture assays (3C- qPCR). Nat Protoc, 2(7):1722–33, 2007. [84] A. Han, F. Pan, J. C. Stroud, H. D. Youn, J. O. Liu, and L. Chen. Sequence-specific recruitment of transcriptional co-repressor Cabin1 by myocyte enhancer factor-2. Nature, 422:730–734, Apr 2003. [85] M.A. Hayat. Principles and techniques of electron microscopy, biological applica- tions. Macmillan Press, New York, NY , 3rd edition, 1989. [86] A. D. Hershey and M. Chase. Independent functions of viral protein and nucleic acid in growth of bacteriophage. J. Gen. Physiol., 36:39–56, May 1952. 178 [87] S. J. Humphray, K. Oliver, A. R. Hunt, R. W. Plumb, J. E. Loveland, K. L. Howe, T. D. Andrews, S. Searle, S. E. Hunt, C. E. Scott, M. C. Jones, R. Ainscough, J. P. Almeida, K. D. Ambrose, R. I. Ashwell, A. K. Babbage, S. Babbage, C. L. Bag- guley, J. Bailey, R. Banerjee, D. J. Barker, K. F. Barlow, K. Bates, H. Beasley, O. Beasley, C. P. Bird, S. Bray-Allen, A. J. Brown, J. Y . Brown, D. Burford, W. Bur- rill, J. Burton, C. Carder, N. P. Carter, J. C. Chapman, Y . Chen, G. Clarke, S. Y . Clark, C. M. Clee, S. Clegg, R. E. Collier, N. Corby, M. Crosier, A. T. Cummings, J. Davies, P. Dhami, M. Dunn, I. Dutta, L. W. Dyer, M. E. Earthrowl, L. Faulkner, C. J. Fleming, A. Frankish, J. A. Frankland, L. French, D. G. Fricker, P. Garner, J. Garnett, J. Ghori, J. G. Gilbert, C. Glison, D. V . Grafham, S. Gribble, C. Grif- fiths, S. Griffiths-Jones, R. Grocock, J. Guy, R. E. Hall, S. Hammond, J. L. Harley, E. S. Harrison, E. A. Hart, P. D. Heath, C. D. Henderson, B. L. Hopkins, P. J. Howard, P. J. Howden, E. Huckle, C. Johnson, D. Johnson, A. A. Joy, M. Kay, S. Keenan, J. K. Kershaw, A. M. Kimberley, A. King, A. Knights, G. K. Laird, C. Langford, S. Lawlor, D. A. Leongamornlert, M. Leversha, C. Lloyd, D. M. Lloyd, J. Lovell, S. Martin, M. Mashreghi-Mohammadi, L. Matthews, S. McLaren, K. E. McLay, A. McMurray, S. Milne, T. Nickerson, J. Nisbett, G. Nordsiek, A. V . Pearce, A. I. Peck, K. M. Porter, R. Pandian, S. Pelan, B. Phillimore, S. Povey, Y . Ramsey, V . Rand, M. Scharfe, H. K. Sehra, R. Shownkeen, S. K. Sims, C. D. Skuce, M. Smith, C. A. Steward, D. Swarbreck, N. Sycamore, J. Tester, A. Thorpe, A. Tracey, A. Tro- mans, D. W. Thomas, M. Wall, J. M. Wallis, A. P. West, S. L. Whitehead, D. L. Willey, S. A. Williams, L. Wilming, P. W. Wray, L. Young, J. L. Ashurst, A. Coul- son, H. Blocker, R. Durbin, J. E. Sulston, T. Hubbard, M. J. Jackson, D. R. Bentley, S. Beck, J. Rogers, and I. Dunham. DNA sequence and analysis of human chromo- some 9. Nature, 429:369–374, May 2004. [88] V . Jackson. Studies on histone organization in the nucleosome using formaldehyde as a reversible cross-linking agent. Cell, 15(3):945–54, 1978. [89] N. Jayathilaka, A. Han, K. J. Gaffney, R. Dey, J. A. Jarusiewicz, K. Noridomi, M. A. Philips, X. Lei, J. He, J. Ye, T. Gao, N. A. Petasis, and L. Chen. Inhibition of the function of class IIa HDACs by blocking their interaction with MEF2. Nucleic Acids Res, Mar 2012. [90] T. Jenuwein and C. D. Allis. Translating the histone code. Science, 293:1074–1080, Aug 2001. [91] I. T. Jolliffe. Principal component analysis. Springer, New York, 2nd edition, 2002. [92] R. Kalhor, H. Tjong, N. Jayathilaka, F. Alber, and L. Chen. Genome architectures revealed by tethered chromosome conformation capture and population-based mod- eling. Nat. Biotechnol., 30:90–98, Jan 2012. 179 [93] N. Kaplan, I. K. Moore, Y . Fondufe-Mittendorf, A. J. Gossett, D. Tillo, Y . Field, E. M. LeProust, T. R. Hughes, J. D. Lieb, J. Widom, and E. Segal. The DNA- encoded nucleosome organization of a eukaryotic genome. Nature, 458:362–366, Mar 2009. [94] D. Karolchik, R. Baertsch, M. Diekhans, T. S. Furey, A. Hinrichs, Y . T. Lu, K. M. Roskin, M. Schwartz, C. W. Sugnet, D. J. Thomas, R. J. Weber, D. Haussler, and W. J. Kent. The UCSC Genome Browser Database. Nucleic Acids Res., 31:51–54, Jan 2003. [95] M. Kasowski, F. Grubert, C. Heffelfinger, M. Hariharan, A. Asabere, S. M. Waszak, L. Habegger, J. Rozowsky, M. Shi, A. E. Urban, M. Y . Hong, K. J. Karczewski, W. Huber, S. M. Weissman, M. B. Gerstein, J. O. Korbel, and M. Snyder. Variation in transcription factor binding among humans. Science, 328:232–235, Apr 2010. [96] L. A. Kelley, S. P. Gardner, and M. J. Sutcliffe. An automated approach for clus- tering an ensemble of NMR-derived protein structures into conformationally related subfamilies. Protein Eng., 9:1063–1065, Nov 1996. [97] K. Y . Kim, T. Rhim, I. Choi, and S. S. Kim. N-acetylcysteine induces cell cycle arrest in hepatic stellate cells through its reducing activity. J Biol Chem, 276(44):40591–8, 2001. [98] S. H. Kim, P. G. McQueen, M. K. Lichtman, E. M. Shevach, L. A. Parada, and T. Misteli. Spatial genome organization during T-cell differentiation. Cytogenet. Genome Res., 105:292–301, 2004. [99] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220:671–680, May 1983. [100] M. Kitayner, H. Rozenberg, R. Rohs, O. Suad, D. Rabinovich, B. Honig, and Z. Shakked. Diversity in DNA recognition by p53 revealed by crystal structures with Hoogsteen base pairs. Nat. Struct. Mol. Biol., 17:423–429, Apr 2010. [101] S. T. Kosak, J. A. Skok, K. L. Medina, R. Riblet, M. M. Le Beau, A. G. Fisher, and H. Singh. Subnuclear compartmentalization of immunoglobulin loci during lympho- cyte development. Science, 296:158–162, Apr 2002. [102] G. Kreth, J. Finsterle, J. von Hase, M. Cremer, and C. Cremer. Radial arrangement of chromosome territories in human cell nuclei: a computer model approach based on gene density indicates a probabilistic global positioning code. Biophys. J., 86:2803– 2812, May 2004. 180 [103] N. J. Krogan, M. Kim, A. Tong, A. Golshani, G. Cagney, V . Canadien, D. P. Richards, B. K. Beattie, A. Emili, C. Boone, A. Shilatifard, S. Buratowski, and J. Greenblatt. Methylation of histone H3 by Set2 in Saccharomyces cerevisiae is linked to transcriptional elongation by RNA polymerase II. Mol. Cell. Biol., 23:4207–4218, Jun 2003. [104] A. Kurz, S. Lampel, J. E. Nickolenko, J. Bradl, A. Benner, R. M. Zirbel, T. Cremer, and P. Lichter. Active and inactive genes localize preferentially in the periphery of chromosome territories. J. Cell Biol., 135:1195–1205, Dec 1996. [105] A. I. Lamond and D. L. Spector. Nuclear speckles: a model for nuclear organelles. Nat. Rev. Mol. Cell Biol., 4:605–612, Aug 2003. [106] P. R. Langer-Safer, M. Levine, and D. C. Ward. Immunological method for mapping genes on Drosophila polytene chromosomes. Proc. Natl. Acad. Sci. U.S.A., 79:4381– 4385, Jul 1982. [107] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory- efficient alignment of short DNA sequences to the human genome. Genome Biol, 10(3):R25, 2009. [108] P. A. Levene. The structure of yeast nucleic acid. J. Biol. Chem., 40:415–424, Dec 1919. [109] Ira N. Levine. Physical Chemistry. McGraw-Hill, Boston, 6th edition, 2009. [110] G. Li and D. Reinberg. Chromatin higher-order structures and gene regulation. Curr. Opin. Genet. Dev., 21:175–186, Apr 2011. [111] H. Li, J. Ruan, and R. Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 18:1851–1858, Nov 2008. [112] P. Lichter, T. Cremer, J. Borden, L. Manuelidis, and D. C. Ward. Delineation of individual human chromosomes in metaphase and interphase cells by in situ sup- pression hybridization using recombinant DNA libraries. Hum. Genet., 80:224–234, Nov 1988. [113] E. Lieberman-Aiden, N. L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner, R. Sandstrom, B. Bern- stein, M. A. Bender, M. Groudine, A. Gnirke, J. Stamatoyannopoulos, L. A. Mirny, E. S. Lander, and J. Dekker. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326(5950):289–93, 2009. [114] S. Lomvardas, G. Barnea, D. J. Pisapia, M. Mendelsohn, J. Kirkland, and R. Axel. Interchromosomal interactions and olfactory receptor choice. Cell, 126:403–413, Jul 2006. 181 [115] K. Luger, A. W. Mader, R. K. Richmond, D. F. Sargent, and T. J. Richmond. Crystal structure of the nucleosome core particle at 2.8 A resolution. Nature, 389:251–260, Sep 1997. [116] K. Maeshima, M. Eltsov, and U. K. Laemmli. Chromosome structure: improved immunolabeling for electron microscopy. Chromosoma, 114:365–375, Nov 2005. [117] N. L. Mahy, P. E. Perry, and W. A. Bickmore. Gene density and transcription influ- ence the localization of chromatin outside of chromosome territories detectable by FISH. J. Cell Biol., 159:753–763, Dec 2002. [118] N. L. Mahy, P. E. Perry, S. Gilchrist, R. A. Baldock, and W. A. Bickmore. Spatial organization of active and inactive genes and noncoding DNA within chromosome territories. J. Cell Biol., 157:579–589, May 2002. [119] E. M. Manders, H. Kimura, and P. R. Cook. Direct imaging of DNA in living cells reveals the dynamics of chromosome formation. J. Cell Biol., 144:813–821, Mar 1999. [120] M. Martin, J. Cho, A. J. Cesare, J. D. Griffith, and G. Attardi. Termination factor- mediated DNA loop between termination and initiation sites drives mitochondrial rRNA synthesis. Cell, 123:1227–1240, Dec 2005. [121] J. Mateos-Langerak, M. Bohn, W. de Leeuw, O. Giromus, E. M. Manders, P. J. Verschure, M. H. Indemans, H. J. Gierman, D. W. Heermann, R. van Driel, and S. Goetze. Spatially confined folding of chromatin in the interphase nucleus. Proc. Natl. Acad. Sci. U.S.A., 106:3812–3817, Mar 2009. [122] T. Misteli. Protein dynamics: implications for nuclear architecture and gene expres- sion. Science, 291:843–847, Feb 2001. [123] T. Misteli. Beyond the sequence: cellular organization of genome function. Cell, 128:787–800, Feb 2007. [124] T. Misteli. Parallel genome universes. Nat. Biotechnol., 30:55–56, Jan 2012. [125] B. Moore, E. Whitley, and A. Webster. The Basic and Acidic Proteins of the sperm of Echinus esculentus. Direct Measurements of the Osmotic Pressure of a Protamine or Histone. Biochem. J., 7:142–147, Mar 1913. [126] T. Nakamura, T. Mori, S. Tada, W. Krajewski, T. Rozovskaia, R. Wassell, G. Dubois, A. Mazo, C. M. Croce, and E. Canaani. ALL-1 is a histone methyltransferase that assembles a supercomplex of proteins involved in transcriptional regulation. Mol. Cell, 10:1119–1128, Nov 2002. [127] E. N. Nikolova, E. Kim, A. A. Wise, P. J. O’Brien, I. Andricioaei, and H. M. Al-Hashimi. Transient Hoogsteen base pairs in canonical duplex DNA. Nature, 470:498–502, Feb 2011. 182 [128] A. L. Olins and D. E. Olins. Spheroid chromatin units (v bodies). Science, 183:330– 332, Jan 1974. [129] Li-ling Ooi. Principles of X-ray Crystallography. Oxford University Press, Oxford, 2009. [130] C. S. Osborne, L. Chakalova, K. E. Brown, D. Carter, A. Horton, E. Debrand, B. Goyenechea, J. A. Mitchell, S. Lopes, W. Reik, and P. Fraser. Active genes dy- namically colocalize to shared sites of ongoing transcription. Nat. Genet., 36:1065– 1071, Oct 2004. [131] J. M. O’Sullivan, S. M. Tan-Wong, A. Morillon, B. Lee, J. Coles, J. Mellor, and N. J. Proudfoot. Gene loops juxtapose promoters and terminators in yeast. Nat. Genet., 36:1014–1018, Sep 2004. [132] R. J. Palstra, B. Tolhuis, E. Splinter, R. Nijmeijer, F. Grosveld, and W. de Laat. The beta-globin nuclear compartment in development and erythroid differentiation. Nat. Genet., 35:190–194, Oct 2003. [133] L. Parada and T. Misteli. Chromosome positioning in the interphase nucleus. Trends Cell Biol., 12:425–432, Sep 2002. [134] L. A. Parada, P. G. McQueen, and T. Misteli. Tissue-specific spatial organization of genomes. Genome Biol., 5:R44, 2004. [135] J. M. Prober, G. L. Trainor, R. J. Dam, F. W. Hobbs, C. W. Robertson, R. J. Zagursky, A. J. Cocuzza, M. A. Jensen, and K. Baumeister. A system for rapid DNA sequenc- ing with fluorescent chain-terminating dideoxynucleotides. Science, 238:336–341, 1987. [136] S. D. Putney, S. J. Benkovic, and P. R. Schimmel. A dna fragment with an alpha- phosphorothioate nucleotide at one end is asymmetrically blocked from digestion by exonuclease iii and can be replicated in vivo. Proc Natl Acad Sci U S A, 78(12):7350–4, 1981. [137] M. A. Quail, I. Kozarewa, F. Smith, A. Scally, P. J. Stephens, R. Durbin, H. Swerd- low, and D. J. Turner. A large genome center’s improvements to the Illumina se- quencing system. Nat Methods, 5(12):1005–10, 2008. [138] C. Rabl. Über Zellteilung. Morpholgisches Jahrbuch, 10:214–258, 1885. [139] A. Rahman. Correlations in the Motion of Atoms in Liquid Argon. Phys. Rev., 136:405–4011, Oct 1964. 183 [140] R. Redon, S. Ishikawa, K. R. Fitch, L. Feuk, G. H. Perry, T. D. Andrews, H. Fiegler, M. H. Shapero, A. R. Carson, W. Chen, E. K. Cho, S. Dallaire, J. L. Freeman, J. R. Gonzalez, M. Gratacos, J. Huang, D. Kalaitzopoulos, D. Komura, J. R. MacDon- ald, C. R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M. J. Somerville, J. Tchinda, A. Valsesia, C. Woodwark, F. Yang, J. Zhang, T. Zer- jal, J. Zhang, L. Armengol, D. F. Conrad, X. Estivill, C. Tyler-Smith, N. P. Carter, H. Aburatani, C. Lee, K. W. Jones, S. W. Scherer, and M. E. Hurles. Global variation in copy number in the human genome. Nature, 444(7118):444–54, 2006. [141] A. Rich. DNA comes in many forms. Gene, 135:99–109, Dec 1993. [142] C. R. Robinson and S. G. Sligar. Molecular recognition mediated by bound water. A mechanism for star activity of the restriction endonuclease EcoRI. J. Mol. Biol., 234:302–306, Nov 1993. [143] P. J. Robinson, L. Fairall, V . A. Huynh, and D. Rhodes. chromatin fiber: evidence for a compact, interdigitated structure. Proc. Natl. Acad. Sci. U.S.A., 103:6506–6511, Apr 2006. [144] P. J. Robinson and D. Rhodes. Structure of the ’30 nm’ chromatin fibre: a key role for the linker histone. Curr. Opin. Struct. Biol., 16:336–343, Jun 2006. [145] J. J. Roix, P. G. McQueen, P. J. Munson, L. A. Parada, and T. Misteli. Spatial proxim- ity of translocation-prone gene loci in human lymphomas. Nat. Genet., 34:287–291, Jul 2003. [146] M. Rubinstein and R.H. Colby. Polymer Physics. Oxford University Press, USA, 1st edition, 2003. [147] A. J. Ruthenburg, H. Li, T. A. Milne, S. Dewell, R. K. McGinty, M. Yuen, B. Ueber- heide, Y . Dou, T. W. Muir, D. J. Patel, and C. D. Allis. Recognition of a mononu- cleosomal histone modification pattern by BPTF via multivalent interactions. Cell, 145:692–706, May 2011. [148] P. J. Sabo, M. S. Kuehn, R. Thurman, B. E. Johnson, E. M. Johnson, H. Cao, M. Yu, E. Rosenzweig, J. Goldy, A. Haydock, M. Weaver, A. Shafer, K. Lee, F. Neri, R. Humbert, M. A. Singer, T. A. Richmond, M. O. Dorschner, M. McArthur, M. Hawrylycz, R. D. Green, P. A. Navas, W. S. Noble, and J. A. Stamatoyannopou- los. Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA mi- croarrays. Nat. Methods, 3:511–518, Jul 2006. [149] R. K. Sachs, G. van den Engh, B. Trask, H. Yokota, and J. E. Hearst. A random- walk/giant-loop model for interphase chromosomes. Proc. Natl. Acad. Sci. U.S.A., 92:2710–2714, Mar 1995. 184 [150] K. Sandman, S. L. Pereira, and J. N. Reeve. Diversity of prokaryotic chromosomal proteins and the origin of the nucleosome. Cell. Mol. Life Sci., 54:1350–1364, Dec 1998. [151] F. Sanger, S. Nicklen, and A. R. Coulson. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A., 74:5463–5467, 1977. [152] T. Schalch, S. Duda, D. F. Sargent, and T. J. Richmond. X-ray structure of a tetranu- cleosome and its implications for the chromatin fibre. Nature, 436:138–141, Jul 2005. [153] M. O. Scheuermann, J. Tajbakhsh, A. Kurz, K. Saracoglu, R. Eils, and P. Lichter. Topology of genes and nontranscribed sequences in human interphase nuclei. Exp. Cell Res., 301:266–279, Dec 2004. [154] S. Schoenfelder, T. Sexton, L. Chakalova, N. F. Cope, A. Horton, S. Andrews, S. Ku- rukuti, J. A. Mitchell, D. Umlauf, D. S. Dimitrova, C. H. Eskiw, Y . Luo, C. L. Wei, Y . Ruan, J. J. Bieker, and P. Fraser. Preferential associations between co-regulated genes reveal a transcriptional interactome in erythroid cells. Nat. Genet., 42:53–61, Jan 2010. [155] Y . Shiio and R. N. Eisenman. Histone sumoylation is associated with transcriptional repression. Proc. Natl. Acad. Sci. U.S.A., 100:13225–13230, Nov 2003. [156] M. Simonis, P. Klous, E. Splinter, Y . Moshkin, R. Willemsen, E. de Wit, B. van Steensel, and W. de Laat. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat Genet, 38(11):1348–54, 2006. [157] M. Simonis, J. Kooren, and W. de Laat. An evaluation of 3C-based methods to capture dna interactions. Nat Methods, 4(11):895–901, 2007. [158] A. D. Smith, Z. Xuan, and M. Q. Zhang. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics, 9:128, 2008. [159] L. M. Smith, J. Z. Sanders, R. J. Kaiser, P. Hughes, C. Dodd, C. R. Connell, C. Heiner, S. B. Kent, and L. E. Hood. Fluorescence detection in automated DNA sequence analysis. Nature, 321:674–679, 1986. [160] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J. Mol. Biol., 147:195–197, Mar 1981. [161] R. E. Sobel, R. G. Cook, C. A. Perry, A. T. Annunziato, and C. D. Allis. Conser- vation of deposition-related acetylation sites in newly synthesized histones H3 and H4. Proc. Natl. Acad. Sci. U.S.A., 92:1237–1241, Feb 1995. 185 [162] M. J. Solomon and A. Varshavsky. Formaldehyde-mediated dna-protein crosslink- ing: a probe for in vivo chromatin structures. Proc Natl Acad Sci U S A, 82(19):6470–4, 1985. [163] T. E. Spencer, G. Jenster, M. M. Burcin, C. D. Allis, J. Zhou, C. A. Mizzen, N. J. McKenna, S. A. Onate, S. Y . Tsai, M. J. Tsai, and B. W. O’Malley. Steroid receptor coactivator-1 is a histone acetyltransferase. Nature, 389:194–198, Sep 1997. [164] B. M. Spiegelman and R. Heinrich. Biological control through regulated transcrip- tional coactivators. Cell, 119:157–167, Oct 2004. [165] C. G. Spilianakis, M. D. Lalioti, T. Town, G. R. Lee, and R. A. Flavell. Interchro- mosomal associations between alternatively expressed loci. Nature, 435:637–645, Jun 2005. [166] E. Splinter, F. Grosveld, and W. de Laat. 3C technology: analyzing the spatial organization of genomic loci in vivo. Methods Enzymol, 375:493–507, 2004. [167] S. Srivastava and L. Chen. A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res., 38:e170, Sep 2010. [168] Terry A. Straeter. On the Extension of the Davidon-Broyden Class of Rank One, Quasi-Newton Minimization Methods to an Infinite Dimensional Hilbert Space with Applications to Optimal Control Problems. PhD thesis, North Carolina State Uni- versity, Raleigh, North Carolina, 1971. [169] B. D. Strahl and C. D. Allis. The language of covalent histone modifications. Nature, 403:41–45, Jan 2000. [170] J. C. Stroud, C. Lopez-Rodriguez, A. Rao, and L. Chen. Structure of a TonEBP- DNA complex reveals DNA encircled by a transcription factor. Nat. Struct. Biol., 9:90–94, Feb 2002. [171] J. C. Stroud, Y . Wu, D. L. Bates, A. Han, K. Nowick, S. Paabo, H. Tong, and L. Chen. Structure of the forkhead domain of FOXP2 bound to DNA. Structure, 14:159–166, Jan 2006. [172] G. J. Sullivan, J. M. Bridger, A. P. Cuthbert, R. F. Newbold, W. A. Bickmore, and B. McStay. Human acrocentric chromosomes with transcriptionally silent nucleolar organizer regions associate with nucleoli. EMBO J., 20:2867–2874, Jun 2001. [173] J. Sun, Q. Zhang, and T. Schlick. Electrostatic mechanism of nucleosomal array folding revealed by computer simulation. Proc. Natl. Acad. Sci. U.S.A., 102:8180– 8185, Jun 2005. 186 [174] M. Tachibana, K. Sugimoto, T. Fukushima, and Y . Shinkai. Set domain-containing protein, G9a, is a novel lysine-preferring mammalian histone methyltransferase with hyperactivity and specific selectivity to lysines 9 and 27 of histone H3. J. Biol. Chem., 276:25309–25317, Jul 2001. [175] H. Talasz, H. H. Lindner, B. Sarg, and W. Helliger. Histone H4-lysine 20 monomethylation is increased in promoter and coding regions of active genes and correlates with hyperacetylation. J. Biol. Chem., 280:38814–38822, Nov 2005. [176] H. Tanabe, S. Muller, M. Neusser, J. von Hase, E. Calcagno, M. Cremer, I. Solovei, C. Cremer, and T. Cremer. Evolutionary conservation of chromosome territory arrangements in cell nuclei from higher primates. Proc. Natl. Acad. Sci. U.S.A., 99:4424–4429, Apr 2002. [177] H. Tanizawa, O. Iwasaki, A. Tanaka, J. R. Capizzi, P. Wickramasinghe, M. Lee, Z. Fu, and K. Noma. Mapping of long-range associations throughout the fission yeast genome reveals global genome organization linked to transcriptional regula- tion. Nucleic Acids Res., 38:8164–8177, Dec 2010. [178] F. Thoma and T. Koller. Influence of histone H1 on chromatin structure. Cell, 12:101–107, Sep 1977. [179] M. Thompson, R. A. Haeusler, P. D. Good, and D. R. Engelke. Nucleolar clustering of dispersed tRNA genes. Science, 302:1399–1401, Nov 2003. [180] B. Tolhuis, M. Blom, R. M. Kerkhoven, L. Pagie, H. Teunissen, M. Nieuwland, M. Simonis, W. de Laat, M. van Lohuizen, and B. van Steensel. Interactions among Polycomb domains are guided by chromosome architecture. PLoS Genet., 7:e1001343, Mar 2011. [181] B. Tolhuis, R. J. Palstra, E. Splinter, F. Grosveld, and W. de Laat. Looping and interaction between hypersensitive sites in the active beta-globin locus. Mol Cell, 10(6):1453–65, 2002. [182] T. Tsukamoto, N. Hashiguchi, S. M. Janicki, T. Tumbar, A. S. Belmont, and D. L. Spector. Visualization of gene activity in living cells. Nat. Cell Biol., 2:871–878, Dec 2000. [183] M. A. Umbarger, E. Toro, M. A. Wright, G. J. Porreca, D. Bau, S. H. Hong, M. J. Fero, L. J. Zhu, M. A. Marti-Renom, H. H. McAdams, L. Shapiro, J. Dekker, and G. M. Church. The three-dimensional architecture of a bacterial genome and its alteration by genetic perturbation. Mol. Cell, 44:252–264, Oct 2011. [184] C. R. Vakoc, M. M. Sachdeva, H. Wang, and G. A. Blobel. Profile of histone lysine methylation across transcribed mammalian chromatin. Mol. Cell. Biol., 26:9185– 9195, Dec 2006. 187 [185] R. van Driel, P. F. Fransz, and P. J. Verschure. The eukaryotic genome: a system regulated at different hierarchical levels. J. Cell. Sci., 116:4067–4075, Oct 2003. [186] B. van Steensel and S. Henikoff. Identification of in vivo DNA targets of chromatin proteins using tethered dam methyltransferase. Nat. Biotechnol., 18:424–428, Apr 2000. [187] E. V . V olpi, E. Chevret, T. Jones, R. Vatcheva, J. Williamson, S. Beck, R. D. Camp- bell, M. Goldsworthy, S. H. Powis, J. Ragoussis, J. Trowsdale, and D. Sheer. Large- scale chromatin organization of the major histocompatibility complex and other re- gions of human chromosome 6 and its response to interferon in interphase nuclei. J. Cell. Sci., 113 ( Pt 9):1565–1576, May 2000. [188] A. H. Wang, G. J. Quigley, F. J. Kolpak, J. L. Crawford, J. H. van Boom, G. van der Marel, and A. Rich. Molecular structure of a left-handed double helical DNA frag- ment at atomic resolution. Nature, 282:680–686, Dec 1979. [189] H. Wang, R. Cao, L. Xia, H. Erdjument-Bromage, C. Borchers, P. Tempst, and Y . Zhang. Purification and functional characterization of a histone H3-lysine 4- specific methyltransferase. Mol. Cell, 8:1207–1217, Dec 2001. [190] Z. Wang, C. Zang, J. A. Rosenfeld, D. E. Schones, A. Barski, S. Cuddapah, K. Cui, T. Y . Roh, W. Peng, M. Q. Zhang, and K. Zhao. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat. Genet., 40:897–903, Jul 2008. [191] J. D. Watson and F. H. Crick. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171:737–738, Apr 1953. [192] J. Widom and A. Klug. Structure of the 300A chromatin filament: X-ray diffraction from oriented samples. Cell, 43:207–213, Nov 1985. [193] M. Wijgerde, F. Grosveld, and P. Fraser. Transcription complex stability and chro- matin dynamics in vivo. Nature, 377:209–213, Sep 1995. [194] M. H. Wilkins, A. R. Stokes, and H. R. Wilson. Molecular structure of deoxypentose nucleic acids. Nature, 171:738–740, Apr 1953. [195] R. R. Williams, S. Broad, D. Sheer, and J. Ragoussis. Subchromosomal positioning of the epidermal differentiation complex (EDC) in keratinocyte and lymphoblast interphase nuclei. Exp. Cell Res., 272:163–175, Jan 2002. [196] S. P. Williams, B. D. Athey, L. J. Muglia, R. S. Schappe, A. H. Gough, and J. P. Langmore. Chromatin fibers are left-handed double helices with diameter and mass per unit length that depend on linker length. Biophys. J., 49:233–248, Jan 1986. [197] Alan P. Wolffe. Chromatin: structure and function (3rd Edition). Academic Press, San Diego, CA, 1999. 188 [198] C. L. Woodcock. Chromatin architecture. Curr. Opin. Struct. Biol., 16:213–220, Apr 2006. [199] C. L. Woodcock and R. P. Ghosh. Chromatin higher-order structure and dynamics. Cold Spring Harb Perspect Biol, 2:a000596, May 2010. [200] C. L. Woodcock, J. P. Safer, and J. E. Stanchfield. Structural repeating units in chromatin. I. Evidence for their general occurrence. Exp. Cell Res., 97:101–110, Jan 1976. [201] R. Wu, G. Ruben, B. Siegel, E. Jay, P. Spielman, and C. P. Tu. Synchronous digestion of SV40 DNA by exonuclease iii. Biochemistry, 15(4):734–40, 1976. [202] Y . Wu, M. Borde, V . Heissmeyer, M. Feuerer, A. D. Lapan, J. C. Stroud, D. L. Bates, L. Guo, A. Han, S. F. Ziegler, D. Mathis, C. Benoist, L. Chen, and A. Rao. FOXP3 controls regulatory T cell function through cooperation with NFAT. Cell, 126:375–387, Jul 2006. [203] H. Wurtele and P. Chartrand. Genome-wide scanning of HoxB1-associated loci in mouse ES cells using an open-ended Chromosome Conformation Capture method- ology. Chromosome Res, 14(5):477–95, 2006. [204] Z. Zhao, G. Tavoosidana, M. Sjolinder, A. Gondor, P. Mariano, S. Wang, C. Kan- duri, M. Lezcano, K. S. Sandhu, U. Singh, V . Pant, V . Tiwari, S. Kurukuti, and R. Ohlsson. Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat Genet, 38(11):1341–7, 2006. [205] S. B. Zimmerman, G. H. Cohen, and D. R. Davies. X-ray fiber diffraction and model- building study of polyguanylic acid and polyinosinic acid. J. Mol. Biol., 92:181–192, Feb 1975. [206] D. Zink, T. Cremer, R. Saffrich, R. Fischer, M. F. Trendelenburg, W. Ansorge, and E. H. Stelzer. Structure and dynamics of human interphase chromosome territories in vivo. Hum. Genet., 102:241–251, Feb 1998. 189 Appendix A Figures for All Chromosomes Many of the figures in the main text of this document provide data for only one chromo- somes. This appendix contains the comprehensive version of these figures. 190 Figure A.1: Contact frequency maps for all chromosomes. (Continued on the next page) 191 Figure A.1 continued: Contact frequency maps for all chromosomes. Tethered HindIII contact frequency maps of all chromosomes. The scale bar on the bottom right of each map represents the length of 4,000 HindIII cut sites. To generate these maps, each chromosome was divided into segments that span 138 HindIII cut sites each, dividing the genome into 6,000 segments of 0.45 Mb (Table 3.2). Note that the size of chromosome bands is based on the number of HindIII cut sites in them and not their physical length. The white lines in the heatmap represent the unalignable region of the centromere. 192 Figure A.2: Correlation maps and class assignment. (Continued on the next page) 193 Figure A.2 continued: Correlation maps and class assignment. Correlation maps of all autosomal chromosomes based on the HindIII-TCC library. The projection of the correlation profile on the eigenvector of the first principal component (EIG, Materials and methods) of each segment is aligned on the top of each map. The assignment of each segment to the active (orange) or the inactive (purple) class based on EIG is shown on top of the heatmap and below theEIG plot. The green bar marks the position of the centromere. The scale bar on the bottom right of each map represents the length of 4,000 HindIII cut sites on each chromosome. 194 Figure A.3: Active-active and inactive-inactive correlation maps. (Continued on the next page) 195 Figure A.3 continued: Active-active and inactive-inactive correlation maps. On the ideogram (middle) of chromosomes 1, 2, 3, 4, 5, 8 and 10, active (orange) and inactive (purple) segments are marked. A correlation matrix is compiled from the “active- active” contacts (left) and “inactive-inactive” contacts (right) for each chromosome. An “active-active” contact is a contact between two segments of the chromosome that are both assigned to the active class. The white lines in the map mark the position of the centromere. Different shades of purple and orange are only used to distinguish adjacent regions. The scale bar on bottom right side of each map shows 4,000 HindIII restriction sites. 196 Figure A.4: ICP in active and inactive classes. Interchromosomal contact probability index (ICP ) plotted against EIG for each autosomal chromosome. Active segments (orange dots) are marked by a positiveEIG value and inactive segments (brown dots) are marked by negative EIG values. For each chromosome, inactive segments with a high ICP that flank the centromere are shown (hollow red dots). Chromosome 9 has the largest number of such segments, which is consistent with this chromosome having the largest centromeric heterochromatin [87]. In each plot, ICP values above the dashed blue line are significantly larger than the averageICP of the inactive segments. AnF 6002 contact frequency matrix (Tables 3.2 and 3.3) was used for making these plots. 197 Figure A.4:ICP in active and inactive classes. (Caption on the previous page) 198 Figure A.5: Alignment ofICP andEIG along each chromosome. Alignment ofICP and EIG along each chromosome. ICP values above the dashed blue line are significantly above the average ICP for inactive segments of the chromosome. Therefore, the blue line separates the high-ICP segments. See Appendix Figure A.5 for other chromosomes. Y-axis forICP is in a logarithmic scale. AnF 6002 contact frequency matrix (Tables 3.2 and 3.3) was used for these plots. 199 Figure A.5: Alignment ofICP andEIG along each chromosome. (Continued on the next page) 200 Figure A.5 continued. (Caption on page 198) 201 Figure A.6: Interactions between high-ICP active segments on chromosome 11 and high-ICP active segments on each of the other autosomal chromosomes. For all possible combinations of binary interchromosomal interactions between high-ICP active segments of chr11 and all other chromosomes, their contact frequency has been plotted against the product of theirICP values. The Pearson’s correlation coefficient (r) is displayed on each plot. X-axes are logarithmic. AnF 3004 contact frequency matrix (Tables 3.2 and 3.3) was used for these plots. 202 Figure A.6: Interactions between high-ICP active segments on chromosome 11 and high- ICP active segments on each of the other autosomal chromosomes. (Caption on the previous page) 203 Figure A.7: Objective penalty function for all chromosomes. Detection of the optimal number of clusters (i.e., blocks) by optimizing a penalty function. The plot shows the score of the value of the penalty function (Y-axes) with respect to the number of clusters (X-axes). An F 4992 contact frequency matrix (Tables 3.2 and 3.3) was used for this analysis. 204 Figure A.7: Objective penalty function for all chromosomes. (Caption on the previous page) 205 Figure A.8: Radial distribution of chromosomes. The radial distributions for the center of mass of all chromosomes in the structure population. The radial positions are expressed relative to the nuclear radius with 0 representing the nuclear center. The histogram for each chromosome is composed of both the outer (red line) and the inner (blue line) homologous copy. The outer and inner copies are determined in each structure of the population based on their centers of mass. 206 Figure A.8: Radial distribution of chromosomes. (Caption on the previous page) 207 Appendix B Side-by-side Comparison of Tethered and Non-tethered Libraries This appendix compares our TCC library to the Hi-C libraries that were produced by us and by Lieberman-Aiden and colleagures [113]. The main focus of these comparisons is the effect and performance of the tethereing strategy used in TCC to the dilution-based strategy used in Hi-C and in all other conformation capture methods so far. The comparison figures in this appendix follow the order that they appeared in the Chap- ter Genome-wide Patterns of Chromosomal Contacts. For those figures that only involve intrachromosomal contacts only the tethered and non-tethered libraries that we produced have been compared and the catalogues are those introduce in Table 2.1. These libraries have significantly different number of contacts. While this difference does not bias the results of intrachromosomal contact comparisons, it can have a significant effect on the interchromosomal contact comparisons. As a result, for the latter type of comparison, we have randomly chosen a fraction of the HindIII-TCC library such that the total number 208 of interchromosomal pairs in each of the compared libraries (i.e., our HindIII-TCC and HindIII-HiC libraries, and Lieberman-Aiden and colleagues’ HindIII-HiC library [113]) are comparable 1 (Figure B.3). Incidentally, the fraction of the HindIII-TCC library that is used for interchromosomal contact comparisons in this chapter is identical to the one used in our published datasets [92]. The study by Lieberman-Aiden and colleagues [113] involves the GM06990 lymphoblas- toid cell line. This cell line is very similar to the GM12878 cell line that we have used 2 They also used HindIII as the restriction enzyme. So the results of our study and their study are comparable. For the HindIII-HiC library produced by Lieberman-Aiden and col- leagues, the raw sequencing data were downloaded from publicly available databases. The data consists two experimental replicates of the library. We combined these replicates into a single catalogue, which is analyzed here. This catalogue was generated in an identical fashion as the other catalogues (see Chapter 2). 1 Note that as a consequence of this arrangement, the figures from our HindIII-TCC library from elsewhere in the text may be slightly different from those of the same library in this appendix. 2 The difference between the two cell lines is that they were obtained from different individuals of the same pedigree. Also, only GM12878 is an ENCODE cell line, while GM06990 was only used in the pilot phase of the ENCODE project. 209 Figure B.1: Contact frequency maps. The contact map of chromosome 2 obtained from the tethered HindIII (HindIII TCC) li- brary (A) and non-tethered HindIII (HindIII Hi-C) library (B). In these maps, chromosomes 2 is divided into segments that span 277 HindIII sites each, resulting in 258 segments of 1 Mb (Table 3.2). A pair of tick marks on the ideogram encompasses 4986 HindIII sites. The white lines in the map mark the unalignable region of the centromeres. 210 Figure B.2: Intrachromosomal contact probability as a function of genomic distance. The genome-wide average probability of contact between chromosomal regions is dis- played as a function of their genomic distance. Measurements are repeated separately for the HindIII Tethered (black), MboI Tethered (green), and HindIII Non-tethered (red) catalogues. The probability values were measured using bin sizes of 10,000 bp. Both the X and the Y axes are in logarithmic scales. 211 Figure B.3: Characteristics and statistics of the compared libraries. The sequencing results from the HindIII-TCC library (column on the left) and HindIII-HiC library (middle column) from this study [92], and the HindIII-HiC libraries obtained by Lieberman-Aiden and colleagues [113] (column on the right) were processed and analyzed in an identical fashion. The results from Lieberman-Aiden and colleagues are a combi- nation of two HindIII libraries of GM06990 cells that the authors had produced for their study. On the top, the cell line and restriction enzyme used to generate each library and the total number of interchromosomal contacts that were obtained from each library are shown. All libraries contain a similar number of interchromosomal contacts. On the bottom, the percentages of inter and intrachromosomal ligations in each catalogue are shown in a pie chart format. Larger fraction of interchromosomal ligations suggests a higher fraction of random intermolecular ligation noise in the library. 212 Figure B.4: Pairwise whole-chromosome contact frequencies. The observed/expected frequency of total contact between all pairs of chromosomes ob- tained from TCC (A) and non-tethered Hi-C (B) conformation capture HindIII libraries are shown. Red and blue respectively indicate enrichment and depletion of contacts compared to the expected value according to the color key on the bottom-right corner. The stronger enrichment and depletion of contacts that is observed in the tethered library is due to the lower level of noise. Expected values were calculated based on the size and number of observed reads per chromosome when assuming completely random ligations. 213 Figure B.5: Genome-wide contact enrichment map of chromosome 2. The genome-wide enrichment map for chromosome 2, compiled from the tethered (A) and non-tethered (B) HindIII libraries. Enrichment is calculated as the ratio of the observed frequency in each position to its expected value; expected values were obtained assuming completely random ligations (Methods). Red and light blue respectively indicate enrichment and depletion of a contact in accordance with the color key between the panels. Chromosome 2 (left) extends along the Y-axis while all 23 chromosomes (top) extend along the X-axis. The zoomed panel to the right of each map magnifies the section that corresponds to contacts between the small arm of chromosome 2 and chromosomes 20, 21, 22, and X. A pair of tick marks on chromosome 2 spans 5022 HindIII sites. These maps are based on anF 1500 matrix (Materials and methods). 214 Figure B.6: Interchromosomal contact probability (ICP ). (A) For all segments of chromosome 2, interchromosomal contact probability index (ICP ) is plotted against EIG. Segments with a positive EIG (orange dots) belong to the active class, while those with a negative EIG (brown dots) belong to the inactive class. The blue dashed line separates high-ICP segments: values above the line are significantly larger than the average ICP for inactive segments. Red dots mark those inactive segments with large ICP s that also flank the centromere. These plots are generated exactly as in Figure 3.12 A and show that the difference between active and inactive regions’ tendencies in forming interchromosomal contacts is only apparent in the tethered library. In the non-tethered libraries, the ICP values of all segments of the chromosome are increased, presumably due to increased random intermolecular ligations, such that the differences between the active and the inactive regions are no longer discernable. (B) For all active segments in the genome, ICP is plotted against the binding of RNA polymerase II (pol II). Pol II binding values are reproduced from a ChIP-seq study [95] on the GM12878 cells and are in arbitrary units based on alignment frequency. These plots are generated exactly as in Figure 3.12 B and show that the correlation between pol II binding and ICP is only substantial in the tethered library. In the non-tethered library the smaller dynamic range of the derivedICP values masks the correlation. ChIP-seq data is not available for GM06990 cells. (C) For seven loci on the small arm of chromosome 11,ICP is plotted against their average distance from the edge of chr11 territory as measured by FISH [117]. Positive distance values denote localization within the bulk territory, while negative values denote localization away from the bulk territory. Orange and brown dots represent assignment to the active and inactive classes, respectively. The regression goodness of fit (R 2 ) is shown on each plot. These plots are generated exactly as in Figure 3.12 C for each library. Only in tethered library the correlation betweenICP and localization within the chromosome territory can be identified. 215 Figure B.6: Interchromosomal contact probability (ICP ). (Caption on the previous page) 216 Figure B.7: Interactions between chromosome 11 and chromosome 19. (A) Plotted are the frequencies of all contacts between high-ICP active segments on chr19 and all the segments on chr11. Contacts involving the high-ICP active segments on chr11 are shown as purple dots and contacts involving all other segments of this chromosome are shown as gray triangles. Frequencies above the dashed blue line are significantly higher than the average frequency of contacts between high-ICP active segments on chr19 and inactive segments on chr11 (p-value < 0.04, non-parametric). These plots are generated exactly as in Figure 3.14 A and show that the conclusion that many interchromosomal interactions between high-ICP active regions are significantly enriched can only be made from the tethered dataset. (B) For all possible interchromosomal interactions between a pair of active regions (high-ICP and otherwise) from chromosomes 11 and 19, their contact frequency has been plotted against the product of their ICP s. These plots were made similar to Figure 3.14 B and show that the correlation is only clear in the tethered dataset. (C) The histogram of Pearson’s correlation of interchromosomal contact frequency with product ofICP values for high-ICP active segments of all 231 pairwise combinations of autosomal chromosomes. The vertical purple line marks the average Pearson’s correlation. These plots were made similar to Figure 3.14 C. 217 Figure B.7: Interactions between chromosome 11 and chromosome 19. (Caption on the previous page) 218 Appendix C Characteristics of Modeling Spheres 219 Table C.1: Characteristics of the modeling spheres. The genomic locations, sizes, and radii (when the nuclear radius is assumed to be 5,000 nm) of the 428 spheres that represent each haploid copy of the human genome in structural modeling. The genomic length is in megabases and the radius is in nanometers. Sphere Chr Start End l(Mb) R(nm) 1 chr1 10000 7932970 7.92 323.3 2 chr1 7932970 12578983 4.65 270.6 3 chr1 12578983 19737856 7.16 312.6 4 chr1 19737856 47067667 27.33 488.5 5 chr1 47067667 51167777 4.10 259.6 6 chr1 51167777 55748247 4.58 269.3 7 chr1 55748247 58727365 2.98 233.4 8 chr1 58727365 68149979 9.42 342.6 9 chr1 68149979 84480133 16.33 411.5 10 chr1 84480133 94280505 9.80 347.1 11 chr1 94280505 100429223 6.15 297.1 12 chr1 100429223 102051304 1.62 190.6 13 chr1 102051304 109015332 6.96 309.7 14 chr1 109015332 113689321 4.67 271.2 15 chr1 113689321 120852103 7.16 312.6 16 chr1 120852103 143920968 23.07 461.7 17 chr1 143920968 162859319 18.94 432.3 18 chr1 162859319 167040820 4.18 261.3 19 chr1 167040820 176073041 9.03 337.8 20 chr1 176073041 178602516 2.53 221.0 21 chr1 178602516 185468002 6.87 308.3 22 chr1 185468002 197435637 11.97 371.0 23 chr1 197435637 201271346 3.84 253.9 24 chr1 201271346 208044175 6.77 306.9 25 chr1 208044175 211302815 3.26 240.5 26 chr1 211302815 213257424 1.96 202.8 27 chr1 213257424 221885875 8.63 332.7 28 chr1 221885875 237117281 15.23 402.0 29 chr1 237117281 242969783 5.85 292.3 30 chr1 242969783 249240621 6.27 299.1 (Continued on the next page) 220 Sphere Chr Start End l(Mb) R(nm) 31 chr2 10000 8320580 8.31 328.5 32 chr2 8320580 12200329 3.88 254.9 33 chr2 12200329 23624283 11.42 365.3 34 chr2 23624283 34154455 10.53 355.5 35 chr2 34154455 42181350 8.03 324.7 36 chr2 42181350 48615617 6.43 301.7 37 chr2 48615617 59961484 11.35 364.5 38 chr2 59961484 76242187 16.28 411.1 39 chr2 76242187 84589799 8.35 329.0 40 chr2 84589799 114915216 30.33 505.8 41 chr2 114915216 118493252 3.58 248.1 42 chr2 118493252 122757471 4.26 263.0 43 chr2 122757471 127513033 4.76 272.8 44 chr2 127513033 169244772 41.73 562.6 45 chr2 169244772 180438460 11.19 362.8 46 chr2 180438460 230572078 50.13 598.0 47 chr2 230572078 243189373 12.62 377.6 48 chr3 60000 4217585 4.16 260.8 49 chr3 4217585 5336721 1.12 168.4 50 chr3 5336721 8375815 3.04 234.9 51 chr3 8375815 17941042 9.57 344.3 52 chr3 17941042 30870051 12.93 380.7 53 chr3 30870051 34305254 3.44 244.7 54 chr3 34305254 36943608 2.64 224.1 55 chr3 36943608 60021632 23.08 461.8 56 chr3 60021632 71162464 11.14 362.2 57 chr3 71162464 73391554 2.23 211.9 58 chr3 73391554 90504854 17.11 418.0 59 chr3 93504854 97228293 3.72 251.4 60 chr3 97228293 108430825 11.20 362.9 61 chr3 108430825 111562686 3.13 237.3 62 chr3 111562686 114703377 3.14 237.5 63 chr3 114703377 121505812 6.80 307.3 64 chr3 121505812 129790061 8.28 328.2 65 chr3 129790061 143574229 13.78 388.9 (Continued on the next page) 221 Sphere Chr Start End l(Mb) R(nm) 66 chr3 143574229 148521206 4.95 276.4 67 chr3 148521206 160513196 11.99 371.2 68 chr3 160513196 169141560 8.63 332.7 69 chr3 169141560 172426980 3.29 241.1 70 chr3 172426980 176555340 4.13 260.2 71 chr3 176555340 182490725 5.94 293.7 72 chr3 182490725 186982221 4.49 267.6 73 chr3 186982221 193628664 6.65 304.9 74 chr3 193628664 197962430 4.33 264.4 75 chr4 10000 4433960 4.42 266.3 76 chr4 4433960 18607518 14.17 392.5 77 chr4 18607518 24090760 5.48 286.0 78 chr4 24090760 26857285 2.77 227.7 79 chr4 26857285 35788427 8.93 336.5 80 chr4 35788427 43217205 7.43 316.5 81 chr4 43217205 46951133 3.73 251.6 82 chr4 46951133 58430859 11.48 365.9 83 chr4 58430859 65477609 7.05 311.0 84 chr4 65477609 73954211 8.48 330.7 85 chr4 73954211 89844497 15.89 407.8 86 chr4 89844497 98834276 8.99 337.2 87 chr4 98834276 124533065 25.70 478.6 88 chr4 124533065 128482883 3.95 256.4 89 chr4 128482883 130053300 1.57 188.5 90 chr4 130053300 138403443 8.35 329.0 91 chr4 138403443 151265484 12.86 380.0 92 chr4 151265484 154670618 3.41 244.0 93 chr4 154670618 183240770 28.57 495.8 94 chr4 183240770 187141178 3.90 255.3 95 chr4 187141178 191044276 3.90 255.4 96 chr5 10000 1849815 1.84 198.7 97 chr5 1849815 4824406 2.98 233.3 98 chr5 4824406 23041575 18.22 426.8 99 chr5 23041575 31582815 8.54 331.5 100 chr5 31582815 43773773 12.19 373.3 (Continued on the next page) 222 Sphere Chr Start End l(Mb) R(nm) 101 chr5 43773773 46405641 2.63 223.9 102 chr5 49405641 53541300 4.14 260.3 103 chr5 53541300 62277998 8.74 334.1 104 chr5 62277998 63829153 1.55 187.7 105 chr5 63829153 81719682 17.89 424.2 106 chr5 81719682 87866767 6.15 297.1 107 chr5 87866767 89843916 1.98 203.6 108 chr5 89843916 94564805 4.72 272.1 109 chr5 94564805 96597280 2.03 205.4 110 chr5 96597280 130366783 33.77 524.2 111 chr5 130366783 143262634 12.90 380.3 112 chr5 143262634 147969635 4.71 271.8 113 chr5 147969635 150997251 3.03 234.6 114 chr5 150997251 156289566 5.29 282.6 115 chr5 156289566 160183058 3.89 255.1 116 chr5 160183058 167309916 7.13 312.1 117 chr5 167309916 180905260 13.60 387.1 118 chr6 60000 7999375 7.94 323.6 119 chr6 7999375 10741729 2.74 227.0 120 chr6 10741729 16928729 6.19 297.7 121 chr6 16928729 24701561 7.77 321.3 122 chr6 24701561 44458123 19.76 438.5 123 chr6 44458123 47608169 3.15 237.8 124 chr6 47608169 51567742 3.96 256.6 125 chr6 51567742 53691261 2.12 208.5 126 chr6 53691261 56196075 2.51 220.3 127 chr6 56196075 57153390 0.96 159.8 128 chr6 57153390 70308348 13.16 382.9 129 chr6 70308348 76826739 6.52 303.0 130 chr6 76826739 87747715 10.92 359.8 131 chr6 87747715 91566560 3.82 253.5 132 chr6 91566560 95926372 4.36 265.0 133 chr6 95926372 101498414 5.57 287.5 134 chr6 101498414 105618007 4.12 260.0 135 chr6 105618007 112286744 6.67 305.3 (Continued on the next page) 223 Sphere Chr Start End l(Mb) R(nm) 136 chr6 112286744 134278985 21.99 454.4 137 chr6 134278985 139616621 5.34 283.5 138 chr6 139616621 142942116 3.33 242.1 139 chr6 142942116 145100975 2.16 209.6 140 chr6 145100975 148706832 3.61 248.7 141 chr6 148706832 151844075 3.14 237.4 142 chr6 151844075 154482740 2.64 224.1 143 chr6 154482740 161812693 7.33 315.1 144 chr6 161812693 166163517 4.35 264.8 145 chr6 166163517 171055067 4.89 275.3 146 chr7 10000 3164956 3.16 237.9 147 chr7 3164956 5273920 2.11 208.0 148 chr7 5273920 8150827 2.88 230.7 149 chr7 8150827 16814856 8.66 333.1 150 chr7 16814856 21356241 4.54 268.6 151 chr7 21356241 31035313 9.68 345.6 152 chr7 31035313 35380466 4.35 264.7 153 chr7 35380466 40564385 5.18 280.7 154 chr7 40564385 43257714 2.69 225.7 155 chr7 43257714 45506899 2.25 212.5 156 chr7 45506899 56400002 10.89 359.5 157 chr7 56400002 64707576 8.31 328.5 158 chr7 64707576 72069631 7.36 315.5 159 chr7 72069631 77554014 5.48 286.0 160 chr7 77554014 86249078 8.70 333.5 161 chr7 86249078 97318453 11.07 361.5 162 chr7 97318453 102884913 5.57 287.4 163 chr7 102884913 108381915 5.50 286.2 164 chr7 108381915 127289515 18.91 432.1 165 chr7 127289515 135356025 8.07 325.3 166 chr7 135356025 144708261 9.35 341.7 167 chr7 144708261 148363522 3.66 249.8 168 chr7 148363522 152548805 4.19 261.4 169 chr7 152548805 159128663 6.58 303.9 170 chr8 10000 5818417 5.81 291.5 (Continued on the next page) 224 Sphere Chr Start End l(Mb) R(nm) 171 chr8 5818417 13066982 7.25 313.9 172 chr8 13066982 16821531 3.76 252.1 173 chr8 16821531 27159343 10.34 353.3 174 chr8 27159343 31248151 4.09 259.4 175 chr8 31248151 37206259 5.96 294.0 176 chr8 37206259 43292280 6.09 296.1 177 chr8 43292280 47915443 4.62 270.2 178 chr8 47915443 82752247 34.84 529.7 179 chr8 82752247 94504937 11.75 368.8 180 chr8 94504937 104462065 9.96 348.9 181 chr8 104462065 123838965 19.38 435.6 182 chr8 123838965 136108207 12.27 374.1 183 chr8 136108207 139283053 3.18 238.4 184 chr8 139283053 146304022 7.02 310.6 185 chr9 10000 4282187 4.27 263.2 186 chr9 4282187 7046242 2.76 227.6 187 chr9 7046242 18967119 11.92 370.5 188 chr9 18967119 22269866 3.30 241.5 189 chr9 22269866 26894163 4.62 270.2 190 chr9 26894163 27827065 0.93 158.5 191 chr9 27827065 33056464 5.23 281.5 192 chr9 33056464 38377198 5.32 283.2 193 chr9 38377198 70492001 32.12 515.5 194 chr9 70492001 71828460 1.34 178.6 195 chr9 71828460 82777226 10.95 360.1 196 chr9 82777226 85790815 3.01 234.3 197 chr9 85790815 92228885 6.44 301.7 198 chr9 92228885 102953706 10.73 357.7 199 chr9 102953706 111524002 8.57 331.9 200 chr9 111524002 117689649 6.17 297.4 201 chr9 117689649 122975883 5.29 282.5 202 chr9 122975883 127328564 4.35 264.8 203 chr9 127328564 141153431 13.83 389.3 204 chr10 60000 7962701 7.90 323.1 205 chr10 7962701 11259030 3.30 241.4 (Continued on the next page) 225 Sphere Chr Start End l(Mb) R(nm) 206 chr10 11259030 14950928 3.69 250.7 207 chr10 14950928 21760439 6.81 307.4 208 chr10 21760439 30374158 8.61 332.5 209 chr10 30374158 32780389 2.41 217.3 210 chr10 32780389 43258385 10.48 354.9 211 chr10 43258385 45779229 2.52 220.7 212 chr10 45779229 52579890 6.80 307.3 213 chr10 52579890 63528446 10.95 360.1 214 chr10 63528446 65815083 2.29 213.7 215 chr10 65815083 69568692 3.75 252.1 216 chr10 69568692 76900443 7.33 315.1 217 chr10 76900443 82163108 5.26 282.1 218 chr10 82163108 88368585 6.21 298.0 219 chr10 88368585 101132884 12.76 379.0 220 chr10 101132884 105978328 4.85 274.4 221 chr10 105978328 111610417 5.63 288.6 222 chr10 111610417 112702860 1.09 167.0 223 chr10 112702860 120441770 7.74 320.8 224 chr10 120441770 127672510 7.23 313.6 225 chr10 127672510 135524747 7.85 322.4 226 chr11 60000 3973178 3.91 255.6 227 chr11 3973178 12451356 8.48 330.7 228 chr11 12451356 16525649 4.07 259.0 229 chr11 16525649 19915850 3.39 243.6 230 chr11 19915850 32768850 12.85 379.9 231 chr11 32768850 36764401 4.00 257.4 232 chr11 36764401 43424567 6.66 305.2 233 chr11 43424567 48021109 4.60 269.7 234 chr11 48021109 56750418 8.73 333.9 235 chr11 56750418 59707479 2.96 232.8 236 chr11 59707479 78391195 18.68 430.4 237 chr11 78391195 107345207 28.95 498.0 238 chr11 107345207 116737747 9.39 342.2 239 chr11 116737747 121396568 4.66 270.9 240 chr11 121396568 134946516 13.55 386.7 (Continued on the next page) 226 Sphere Chr Start End l(Mb) R(nm) 241 chr12 60000 13388979 13.33 384.6 242 chr12 13388979 24556599 11.17 362.5 243 chr12 24556599 32927195 8.37 329.3 244 chr12 32927195 38863809 5.94 293.7 245 chr12 38863809 45635139 6.77 306.8 246 chr12 45635139 58287535 12.65 377.9 247 chr12 58287535 62318275 4.03 258.1 248 chr12 62318275 70160235 7.84 322.2 249 chr12 70160235 92150622 21.99 454.4 250 chr12 92150622 97033387 4.88 275.2 251 chr12 97033387 101690656 4.66 270.8 252 chr12 101690656 120070840 18.38 428.0 253 chr12 120070840 125866473 5.80 291.3 254 chr12 125866473 131197367 5.33 283.3 255 chr12 131197367 133841895 2.65 224.3 256 chr13 19020000 19996596 0.98 160.9 257 chr13 19996596 22065733 2.07 206.7 258 chr13 22065733 23829067 1.76 195.9 259 chr13 23829067 34411998 10.58 356.1 260 chr13 34411998 39526017 5.11 279.4 261 chr13 39526017 47394617 7.87 322.6 262 chr13 47394617 48432782 1.04 164.2 263 chr13 48432782 53911777 5.48 285.9 264 chr13 53911777 60063970 6.15 297.2 265 chr13 60063970 61026382 0.96 160.1 266 chr13 61026382 63636186 2.61 223.3 267 chr13 63636186 73138522 9.50 343.5 268 chr13 73138522 81271480 8.13 326.2 269 chr13 81271480 94855676 13.58 387.0 270 chr13 94855676 96621529 1.77 196.0 271 chr13 96621529 97713634 1.09 167.0 272 chr13 97713634 101306453 3.59 248.4 273 chr13 101306453 105958364 4.65 270.8 274 chr13 105958364 110960622 5.00 277.4 275 chr13 110960622 112237201 1.28 176.0 (Continued on the next page) 227 Sphere Chr Start End l(Mb) R(nm) 276 chr13 112237201 113095052 0.86 154.1 277 chr13 113095052 115109878 2.02 204.9 278 chr14 1.90E+07 20632263 1.63 191.0 279 chr14 20632263 25188413 4.56 268.9 280 chr14 25188413 30814538 5.63 288.5 281 chr14 30814538 34425880 3.61 248.8 282 chr14 34425880 36494470 2.07 206.7 283 chr14 36494470 41257777 4.76 272.9 284 chr14 41257777 45400918 4.14 260.5 285 chr14 45400918 45917105 0.52 130.1 286 chr14 45917105 50020107 4.10 259.7 287 chr14 50020107 51757959 1.74 195.0 288 chr14 51757959 55010280 3.25 240.3 289 chr14 55010280 56179168 1.17 170.9 290 chr14 56179168 58756298 2.58 222.4 291 chr14 58756298 62380982 3.63 249.1 292 chr14 62380982 63774936 1.39 181.2 293 chr14 63774936 68045535 4.27 263.1 294 chr14 68045535 78450609 10.41 354.1 295 chr14 78450609 79513679 1.06 165.5 296 chr14 79513679 81018856 1.51 185.9 297 chr14 81018856 82105491 1.09 166.8 298 chr14 82105491 88280116 6.18 297.6 299 chr14 88280116 91194792 2.92 231.7 300 chr14 91194792 93856735 2.66 224.8 301 chr14 93856735 97480902 3.62 249.1 302 chr14 97480902 99714984 2.23 212.0 303 chr14 99714984 101256020 1.54 187.3 304 chr14 101256020 102110391 0.85 153.9 305 chr14 102110391 107289540 5.18 280.6 306 chr15 2.00E+07 30530631 10.53 355.5 307 chr15 30530631 36019334 5.49 286.1 308 chr15 36019334 40181278 4.16 260.9 309 chr15 40181278 45696309 5.52 286.6 310 chr15 45696309 48247492 2.55 221.6 (Continued on the next page) 228 Sphere Chr Start End l(Mb) R(nm) 311 chr15 48247492 50352507 2.11 207.9 312 chr15 50352507 53173665 2.82 229.2 313 chr15 53173665 55830405 2.66 224.6 314 chr15 55830405 63577780 7.75 320.9 315 chr15 63577780 66202618 2.63 223.7 316 chr15 66202618 74582360 8.38 329.4 317 chr15 74582360 79481256 4.90 275.5 318 chr15 79481256 86408538 6.93 309.2 319 chr15 86408538 89234181 2.83 229.3 320 chr15 89234181 91474712 2.24 212.2 321 chr15 91474712 93892572 2.42 217.7 322 chr15 93892572 98828650 4.94 276.2 323 chr15 98828650 102521392 3.69 250.7 324 chr16 60000 5206839 5.15 280.0 325 chr16 5206839 8232892 3.03 234.6 326 chr16 8232892 31763227 23.53 464.8 327 chr16 31763227 35285801 3.52 246.8 328 chr16 46385801 56275392 9.89 348.1 329 chr16 56275392 58421210 2.15 209.2 330 chr16 58421210 66608852 8.19 326.9 331 chr16 66608852 70587498 3.98 257.0 332 chr16 70587498 87734987 17.15 418.2 333 chr16 87734987 90294753 2.56 221.9 334 chr17 0 8987390 8.99 337.2 335 chr17 8987390 15565173 6.58 303.9 336 chr17 15565173 20874363 5.31 282.9 337 chr17 20874363 26474960 5.60 288.0 338 chr17 26474960 31378506 4.90 275.6 339 chr17 31378506 33144995 1.77 196.0 340 chr17 33144995 49419943 16.28 411.0 341 chr17 49419943 52993980 3.57 248.0 342 chr17 52993980 55227929 2.23 212.0 343 chr17 55227929 63150038 7.92 323.3 344 chr17 63150038 72196023 9.05 337.9 345 chr17 72196023 81195209 9.00 337.4 (Continued on the next page) 229 Sphere Chr Start End l(Mb) R(nm) 346 chr18 10000 8130548 8.12 326.0 347 chr18 8130548 13515331 5.39 284.3 348 chr18 13515331 19145302 5.63 288.5 349 chr18 19145302 24463425 5.32 283.1 350 chr18 24463425 28778565 4.32 264.0 351 chr18 28778565 43217967 14.44 394.9 352 chr18 43217967 48731568 5.51 286.5 353 chr18 48731568 54094540 5.36 283.9 354 chr18 54094540 57933526 3.84 254.0 355 chr18 57933526 59599627 1.67 192.3 356 chr18 59599627 61703508 2.10 207.8 357 chr18 61703508 71846983 10.14 351.1 358 chr18 71846983 78017248 6.17 297.5 359 chr19 60000 8764948 8.71 333.6 360 chr19 8764948 12572114 3.81 253.3 361 chr19 12572114 19534273 6.96 309.7 362 chr19 19534273 24631782 5.10 279.1 363 chr19 27731782 29832067 2.10 207.7 364 chr19 29832067 31226076 1.39 181.2 365 chr19 31226076 33047739 1.82 198.1 366 chr19 33047739 34659858 1.61 190.2 367 chr19 34659858 38521701 3.86 254.5 368 chr19 38521701 42404781 3.88 254.9 369 chr19 42404781 45242424 2.84 229.6 370 chr19 45242424 51834512 6.59 304.1 371 chr19 51834512 59118983 7.28 314.4 372 chr20 60000 6362819 6.30 299.6 373 chr20 6362819 17211535 10.85 359.0 374 chr20 17211535 21471690 4.26 262.9 375 chr20 21471690 30333068 8.86 335.6 376 chr20 30333068 37335022 7.00 310.3 377 chr20 37335022 42480099 5.15 280.0 378 chr20 42480099 49681080 7.20 313.2 379 chr20 49681080 58741775 9.06 338.1 380 chr20 58741775 61005470 2.26 213.0 (Continued on the next page) 230 Sphere Chr Start End l(Mb) R(nm) 381 chr20 61005470 62965520 1.96 203.0 382 chr21 9411193 15304465 5.89 293.0 383 chr21 15304465 17979033 2.68 225.1 384 chr21 17979033 25793965 7.82 321.9 385 chr21 25793965 27993201 2.20 210.9 386 chr21 27993201 30000560 2.01 204.6 387 chr21 30000560 31038017 1.04 164.2 388 chr21 31038017 32545472 1.51 185.9 389 chr21 32545472 33698272 1.15 170.1 390 chr21 33698272 39011423 5.31 283.0 391 chr21 39011423 40748853 1.74 195.0 392 chr21 40748853 42415346 1.67 192.3 393 chr21 42415346 48119895 5.71 289.8 394 chr22 16050000 17405389 1.36 179.5 395 chr22 17405389 19032590 1.63 190.8 396 chr22 19032590 24597230 5.57 287.4 397 chr22 24597230 27413073 2.82 229.0 398 chr22 27413073 29939474 2.53 220.9 399 chr22 29939474 32416683 2.48 219.4 400 chr22 32416683 35465028 3.05 235.2 401 chr22 35465028 36658158 1.19 172.0 402 chr22 36658158 43281482 6.62 304.6 403 chr22 43281482 47446423 4.17 261.0 404 chr22 47446423 50938947 3.49 246.1 405 chr22 50938947 51244566 0.31 109.3 406 chrX 60000 2457088 2.40 217.1 407 chrX 2457088 9294522 6.84 307.8 408 chrX 9294522 25213888 15.92 408.0 409 chrX 25213888 28345949 3.13 237.3 410 chrX 28345949 31443681 3.10 236.4 411 chrX 31443681 37353330 5.91 293.2 412 chrX 37353330 49204904 11.85 369.8 413 chrX 49204904 57337491 8.13 326.2 414 chrX 57337491 67180185 9.84 347.6 415 chrX 67180185 78611490 11.43 365.4 (Continued on the next page) 231 Sphere Chr Start End l(Mb) R(nm) 416 chrX 78611490 95960808 17.35 419.9 417 chrX 95960808 99567009 3.61 248.7 418 chrX 99567009 112288487 12.72 378.6 419 chrX 112288487 115104653 2.82 229.0 420 chrX 115104653 117382593 2.28 213.4 421 chrX 117382593 119707040 2.32 214.8 422 chrX 119707040 128390800 8.68 333.4 423 chrX 128390800 136544210 8.15 326.4 424 chrX 136544210 141561436 5.02 277.7 425 chrX 141561436 146518428 4.96 276.5 426 chrX 146518428 152789780 6.27 299.1 427 chrX 152789780 154978789 2.19 210.6 428 chrX 154978789 155260560 0.28 106.4
Abstract (if available)
Abstract
The genetic information in most organisms is stored in a double-helical fiber molecule known as DNA. The total DNA fiber inside every single human nucleus, also known as the genome, is about two meters in length. This very long fiber is organized within a nucleus that is about ten micrometers in diameter. In other words, were the nucleus the size of a soccer ball, the DNA fiber inside it would extend thirty miles or fifty kilometers. This enormous length naturally results in a structural complexity that renders understanding the spatial organization of the genome very challenging. However, similar to other biological systems, a complete functional understanding of the genome requires structural insight. In fact, it has been demonstrated that the three-dimensional organization of the genome affects all nuclear processes, including replication, transcription, and DNA repair. For example, transcription, replication, and DNA repair take place in distinct spatial compartments known as ``factories.'' Active genes tend to localize together and in proximity to transcriptionally active regions of the nucleus while inactive genes tend to associate with the nuclear envelope and other transcriptionally silent regions of the nucleus. Gene activity itself is partly regulated by functional elements, such as enhancers or repressor, that are not in the promoter of the gene but exert their effect in a long range, presumably by physically associating with their targets. ❧ Despite this well-recognized importance for nuclear functions, the three-dimensional organization of the genome is poorly understood. The main limitation in exploring this genome architecture is that its natural structural complexity renders available technologies inadequate for studying it. Only a few approaches exist for obtaining structural information about the genome. The most commonly used techniques are based on either Fluorescence in situ hybridization (FISH) or chromosome conformation capture (3C). These techniques are limited by throughput as they can only address the position of only a few genomic locations, or loci, in each experiment. As a result, none of the currently available methods are capable of addressing the spatial organization of loci at a genome-wide level. ❧ Here, we present a method for assaying the structure of the entire genome. This method measures the relative spatial proximity of all loci by combining the 3C and massively-parallel sequencing technologies. We call this method ``Tethered Conformation Capture'' (TCC). The output of TCC is a catalogue of several million contacts that involve all regions of the genome and are not biased to any particular locus or loci. Furthermore, TCC catalogues feature a significantly improved signal-to-noise ratio compared to other 3C-based methods. This improvement is a direct result of replacing the diluted liquid phase ligation strategy, which is a staple of all 3C-based methods, with a novel solid-phase ligation strategy that is drastically more effective in preventing the unwanted spurious ligation events. ❧ We also applied TCC to human lymphoblastoid cells, generating a catalogue of more than one hundred million contacts. This catalogues enabled a systematic study of the genome architecture. As the lymphoblastoid cells that were are also a subject of the studies by the Encyclopedia of DNA Elements (ENCODE) Consortium, we were able to interpret the genome-wide architectural features in the context of the available one dimensional functional data across the genome, such as gene expression levels, binding of various transcription factors, and histone modifications. ❧ These analyses revealed new insights into the three-dimensional organization of the human genome. We found that genomic distance is an important determining factor in determining the contact frequency between loci on the same chromosome: contact probability decreases rapidly with increasing distance between loci. However, genomic distance is not the only determining factor in intrachromosomal contact frequency
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Exploring stem cell pluripotency through long range chromosome interactions
PDF
3D modeling of eukaryotic genomes
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Forkhead transcription factors regulate replication origin firing through dimerization and cell cycle-dependent chromatin binding in S. cerevisiae
PDF
Understanding the 3D genome organization in topological domain level
PDF
C. elegans topoisomerase II regulates chromatin architecture and DNA damage for germline genome activation
PDF
Functional role of chromatin remodeler proteins in cancer biology
PDF
Computational analysis of genome architecture
PDF
Structural and biochemical determinants of APOBEC1 substrate recognition and enzymatic function
PDF
Effects of chromatin regulators during carcinogenesis
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Epigenetic plasticity of cultured female human embryonic stem cells and regulation of gene expression and chromatin by PR-SET7 mediated H4K20me1
PDF
New tools for whole-genome analysis of DNA replication timing and fork elongation in saccharomyces cerevisiae
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
The function of Rpd3 in balancing the replicaton initiation of different genomic regions
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Forkhead transcription factors control genome wide dynamics of the S. cerevisiae replication timing program
PDF
Using genomics to understand the gene selectivity of steroid hormone receptors
PDF
Multiple functions of the PR-Set7 histone methyltransferase: from transcription to the cell cycle
Asset Metadata
Creator
Kalhor, Reza
(author)
Core Title
Exploring three-dimensional organization of the genome by mapping chromatin contacts and population modeling
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Genetic, Molecular and Cellular Biology
Publication Date
05/30/2012
Defense Date
03/28/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
chromatin structure,chromosome conformation capture,chromosome structure,genome architecture,Hi-C,OAI-PMH Harvest,population modeling,Population-based modeling,TCC,Tethered Conformation Capture
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Lin (
committee chair
), Alber, Frank (
committee member
), Aparicio, Oscar M. (
committee member
), Stallcup, Michael R. (
committee member
)
Creator Email
kalhor@usc.edu,reza.kalhor@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-44451
Unique identifier
UC11290106
Identifier
usctheses-c3-44451 (legacy record id)
Legacy Identifier
etd-KalhorReza-870.pdf
Dmrecord
44451
Document Type
Dissertation
Rights
Kalhor, Reza
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
chromatin structure
chromosome conformation capture
chromosome structure
genome architecture
Hi-C
population modeling
Population-based modeling
TCC
Tethered Conformation Capture