Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
(USC Thesis Other)
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CONSTRUCTING METAGENOME-ASSEMBLED GENOMES AND MOBILE GENETIC
ELEMENT HOST INTERACTIONS USING METAGENOMIC HI-C
by
Yuxuan Du
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
August 2024
Copyright 2024 Yuxuan Du
Dedication
I dedicate this dissertation to my beloved parents Junfeng Liu and Lizhi Du.
ii
Acknowledgements
First of all, I would like to express my sincere gratitude to my advisor, Dr. Fengzhu Sun, for guiding
and supporting me throughout my Ph.D. journey. Dr. Sun introduced me to the world of metagenomics.
His passion for inquiry and dedication have deeply affected me, encouraging me to explore further in
academia. Whenever I faced challenges or critical decisions, Dr. Sun was there with valuable suggestions,
encouragement, and recognition of my achievements. His enthusiasm for research, attention to details, and
resilience in challenges, have not only inspired me but also significantly changed my life. I am confident
that the lessons I’ve learned from him will continue to enlighten my path. As I progress in my academic
career, Dr. Sun’s example will always guide me.
I would also like to express my sincere gratitude to Dr. Michael Waterman and Dr. Jed A. Fuhrman for
their support and guidance as I pursue my academic career.
I would also like to express my sincere gratitude to my Doctoral qualifying exam and Dissertation
committee members, Dr. Jed A. Fuhrman, Dr. Liang Chen, Dr. Geoffrey Fudenberg, and Dr. Matthew
Pennell, for their constructive suggestions and invaluable feedback. Their insights and expertise greatly
improved the quality and scope of my work.
I would also like to thank Dr. Jed A. Fuhrman for his valuable comments on my research projects
and for his patient guidance in biological insights. Learning from him has taught me the importance of
collaboration with domain experts in the field of bioinformatics.
iii
I would also like to extend my thanks to all the faculty members who have supported me during my
Ph.D. studies, including Dr. Remo Rohs, Dr. Liang Chen, Dr. Andrew Smith, Dr. Mark Chaisson, Dr. Peter
Calabrese, Dr. Adam Maclean, Dr. Geoffrey Fudenberg, Dr. Rory Spence, Dr. Vsevolod Katritch, Dr. Jazlyn
Mooney, Dr. Matthew Pennell, and Dr. Michael Doc Edge. Their efforts have broadened my understanding
of various fields within computational biology.
I would also like to thank my former labmates at Sun’s lab: Dr. Xin Bai, Dr. Yilin Gao, Dr. Kujin
Tang, Dr. Tianqi Tang, Dr. Weili Wang, Dr. Zifan Zhu, Siliangyu Cheng, and my current labmates: Dallace
Francis, Jiawei Huang, Yue Huang, Beibei Wang, Yuqiu Wang, Dr. Ziye Wang, and Wenxuan Zuo. It is
exactly working alongside you making my Ph.D. journey both enjoyable and memorable.
I would also like to extend my thanks to my peers at USC including Bryan Dinh, Yibei Jiang, Jordy
Lam, Meilu McDermott, Raktim Mitra, Vardges Tserunyan, and my friends including Dr. Brendon Cooper,
Dr. Wenbo Chen, Haocheng Gao, Xinyu Guo, Wei Jiang, Dr. Jinsen Li, Shichao Liu, Dr. Tsung-Yu Lu,
Yangcheng Liu, Yingtong Liu, Zheyu Li, Dandan Peng, Jingwen Ren, Bo Sun, Yuan Tian, Xiaojun Wu,
Yingfei Wang, Jianzhi Yang, Liang You, Qingyang Yin, Yue Yu, Kan Zhou, Yue Zhou, Yuxiang Zhan and
many others.
Finally, and most importantly, I owe my deepest gratitude to my dear parents, Junfeng Liu and Lizhi
Du, for their unwavering support throughout my life. From childhood, they have been my role models,
teaching me the values of being a good person. Their guidance and constant support have played a crucial
role in shaping me into who I am today. They have consistently encouraged me to follow my dreams and
passions. Without their encouragement, I would not have achieved this milestone. This dissertation is
dedicated to them as a token of my love and appreciation.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Conventional metagenomic shotgun sequencing . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Metagenomic high-throughput chromosome conformation capture sequencing . . . . . . . 3
1.3 Computational challenges encountered in metaHi-C data analyses . . . . . . . . . . . . . . 4
1.3.1 Raw metaHi-C contacts harbor severe systematic biases . . . . . . . . . . . . . . . 4
1.3.2 Spurious Hi-C contacts confound the interpretability of Hi-C networks . . . . . . . 4
1.3.3 There exist significant gaps for current Hi-C-based contig binning approaches . . . 5
1.4 Outline of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: HiCzin: normalizing metagenomic Hi-C data and detecting spurious contacts using
zero-inflated negative binomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Framework of applying HiCzin to metagenomic Hi-C experiments . . . . . . . . . 10
2.2.2 Calculating the coverage of assembled contigs . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Applying TAXAassign to generate sample data of the intra-species contacts . . . . 11
2.2.4 Normalization via the HiCzin model . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.5 Spurious contact detection by a hybrid statistical method based on HiCzin . . . . . 14
2.2.6 Generalizing the HiCzin by selecting different independent variables . . . . . . . . 15
2.2.7 HiCzin normalization without labeled contigs . . . . . . . . . . . . . . . . . . . . . 15
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Analyses of experimental biases in synthetic metagenomic yeast samples . . . . . 16
2.3.2 Normalization methods in the public metagenomic Hi-C analysis pipelines . . . . . 18
2.3.3 Removing explicit biases and filtering out spurious contacts using zero-inflated
negative binomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Applying the HiCzin model to the M-Y samples . . . . . . . . . . . . . . . . . . . . 19
2.3.5 Generalizing the HiCzin model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
v
2.3.6 Clustering of contigs by the Louvain algortihm . . . . . . . . . . . . . . . . . . . . 26
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3: MetaCC allows scalable and integrative analyses of both long-read and short-read
metagenomic Hi-C data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Initial processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 NormCC normalization module in MetaCC . . . . . . . . . . . . . . . . . . . . . . 35
3.2.4 Discarding spurious inter-species contacts based on NormCC-normalized Hi-C
contacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.5 Genome binning in MetaCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.6 Evaluating the quality of recovered MAGs . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.7 Cleaning partially contaminated bins in MetaCC . . . . . . . . . . . . . . . . . . . 41
3.2.8 Assessing the performance of normalization and spurious contact removal on a
synthetic yeast metaHi-C dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.9 Estimating the coverages of assembled contigs . . . . . . . . . . . . . . . . . . . . 42
3.2.10 MAG analyses on the human gut short-read metaHi-C dataset . . . . . . . . . . . . 43
3.2.11 MAG analyses on two long-read metaHi-C datasets . . . . . . . . . . . . . . . . . . 43
3.2.12 Plasmid analyses on the sheep gut long-read metaHi-C dataset . . . . . . . . . . . 44
3.2.13 Other algorithms used in benchmarking . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Overview of MetaCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 NormCC comprehensively corrects all systematic biases existing in a synthetic
yeast metaHi-C dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 NormCC outperforms HiCzin on the spurious contact removal, contig clustering,
and computational time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.4 MetaCC binning achieved the best performance of MAG retrieval on short-read
metaHi-C datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.5 MetaCC binning markedly outperformed existing binners on long-read metaHi-C
datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.6 MetaCC binning identified and expanded the order Erysipelotrichales from the
cow rumen and sheep gut samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.7 Plasmid analyses among high-quality MAGs retrieved by MetaCC binning from
the sheep gut sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.8 Running time of the overall MetaCC pipeline . . . . . . . . . . . . . . . . . . . . . 54
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 4: ImputeCC enhances integrative Hi-C-based metagenomic binning through constrained
random-walk-based imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.2 Initial processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.3 The framework of ImputeCC binning . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.3.1 Detect assembled contigs with single-copy marker genes: . . . . . . . . . 63
vi
4.2.3.2 Impute the metagenomic Hi-C contact matrix for contigs containing
marker genes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.3.3 Precluster contigs with marker genes as preliminary bins . . . . . . . . . 65
4.2.3.4 Leiden clustering for all contigs using the information of preliminary bins 67
4.2.3.5 Integrative strategy to obtain the final bins . . . . . . . . . . . . . . . . . 68
4.2.4 Evaluating the quality of recovered MAGs from the mock and real metaHi-C datasets 68
4.2.5 MAG analyses on real metaHi-C datasets . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.6 Other binners used in benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.1 Overview of ImputeCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.2 ImputeCC achieved accurate preclustering for contigs with single-copy marker
genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.3 ImputeCC retrieved the most high-quality genomes from the mock metaHi-C
datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.4 ImputeCC markedly outperformed existing binners on real metaHi-C datasets . . . 74
4.3.5 Running time analysis of the ImputeCC . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 5: ViralCC retrieves complete viral genomes and virus-host pairs from metagenomic Hi-C
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.2 Initial processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.3 Viral contig detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.4 Construct the Hi-C interaction graph for viral contigs . . . . . . . . . . . . . . . . 85
5.2.5 Construct the host proximity graph for viral contigs . . . . . . . . . . . . . . . . . 85
5.2.6 Integrate the Hi-C interaction graph and the host proximity graph . . . . . . . . . 86
5.2.7 Leiden graph clustering based on the integrative graph . . . . . . . . . . . . . . . . 86
5.2.8 Evaluate the CheckV completeness of vMAGs on real metagenomic Hi-C datasets . 87
5.2.9 A systematic benchmarking strategy to evaluate the performance of binning viral
contigs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.9.1 Rationale of the benchmarking framework . . . . . . . . . . . . . . . . . 87
5.2.9.2 Generate mock viral contigs with ground truth . . . . . . . . . . . . . . 88
5.2.9.3 Gold standards to assess binning performance using mock metagenomic
Hi-C datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.10 The quality control of metagenomic Hi-C datasets . . . . . . . . . . . . . . . . . . 89
5.2.11 Annotate vMAGs at the order and family levels . . . . . . . . . . . . . . . . . . . . 89
5.2.12 Detect virus-host pairs between vMAGs and host MAGs . . . . . . . . . . . . . . . 90
5.2.13 Compare ViralCC to other pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.1 Generating mock metagenomic Hi-C datasets for benchmarking . . . . . . . . . . 90
5.3.2 Integrating the Hi-C interaction graph and the host proximity graph improves
binning performance on the mock human gut dataset . . . . . . . . . . . . . . . . . 91
5.3.3 ViralCC outperforms other binning methods on the mock human gut dataset . . . 92
5.3.4 Binning analyses of viral contigs on three real metagenomic Hi-C datasets . . . . . 93
5.3.5 Annotation of vMAGs demonstrated the high purity of the vMAGs at the family
level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
vii
5.3.6 Phage-host network in the wastewater sample . . . . . . . . . . . . . . . . . . . . 96
5.3.7 Validate virus-host pairs using CRISPR spacer analysis on the wastewater dataset . 97
5.3.8 Running time of ViralCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 6: Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.1 Metagenomic assembly using both shotgun reads and Hi-C reads . . . . . . . . . . 103
6.1.2 Reference-based metaHi-C analyses for the human gut environment . . . . . . . . 103
6.1.3 Imputing metaHi-C contact matrix using graph neural network . . . . . . . . . . . 104
Chapter 7: Supplementary materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1 Supplementary materials for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1.1 Initial processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1.2 Shotgun assembly and Hi-C read alignment . . . . . . . . . . . . . . . . . . . . . . 105
7.1.3 Contact map generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1.4 Annotating contigs by reference genomes for the M-Y samples . . . . . . . . . . . 106
7.2 Supplementary materials for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.1 The respective performances of HiCzin and HiCBin are markedly deteriorated
when only a small fraction of assembled contigs can be annotated . . . . . . . . . 108
7.2.2 Polishing HiFi assemblies using accurate short reads did not improve the binning
performance on the sheep gut dataset . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2.3 The spurious contact removal step with default threshold consistently improved
the downstream binning results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2.4 A standard read cleaning procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2.5 Estimating the number of genomes in the metagenomic data using single-copy
marker genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.6 Identifying the species identity of contigs on the synthetic yeast dataset . . . . . . 110
7.3 Supplementary materials for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3.1 ImputeCC’s genus-level analysis unveiled key genera and species expansion in
the sheep gut microbiota . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3.2 Filtering the incomplete reference genomes at the species level from the mock
microbial community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3.3 A standard read cleaning procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.3.4 The assembly of shotgun reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.3.5 Identifying single-copy marker genes from assembled contigs . . . . . . . . . . . . 127
7.3.6 The modularity function of the Leiden algorithm used in ImputeCC . . . . . . . . 127
7.3.7 Estimation of the precision and recall for contig bins based on lineage-specific genes 127
7.3.8 Identifying the species identity of contigs on the mock metaHi-C dataset . . . . . . 128
7.4 Supplementary materials for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.4.1 Binning results for the viral contigs on the mock wastewater dataset and the mock
cow fecal dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.4.2 ViralCC performed better in recovering near complete vMAGs from large viral
genomes on the mock metagenomic Hi-C datasets . . . . . . . . . . . . . . . . . . 134
7.4.3 Results of virus-host detection on the human gut sample . . . . . . . . . . . . . . . 135
7.4.4 Results of virus-host detection on the cow fecal sample . . . . . . . . . . . . . . . . 136
7.4.5 The existence of biases by the viral genome sizes in the benchmarking method . . 136
7.4.6 Read cleaning procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
viii
7.4.7 Algorithms to evaluate the completeness of vMAGs by CheckV . . . . . . . . . . . 137
7.4.8 Evaluation criteria of the clustering results . . . . . . . . . . . . . . . . . . . . . . . 138
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
ix
List of Tables
2.1 Pearson correlation coefficients (absolute value) between normalized valid contacts and
the product of each of the three factors of explicit biases. . . . . . . . . . . . . . . . . . . . 22
2.2 Area under the discard-retain curve for different normalization methods. Higher AUDRC
score indicates better performance in spurious contact detection. The optimal values of
the results are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Pearson correlation coefficients (absolute value) between normalized valid contacts and
the product of each of the three factors of explicit biases, and AUDRC for different
generalized HiCzin models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Comparison of the clustering results of contigs using the Louvain Algorithm. # of contigs
represents the number of contigs in groups; F-score, ARI, and NMI are Fowlkes Mallows
score, Adjusted Rand Index, and Normalized Mutual Information, respectively. The
optimal values of the results are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.1 M-Y species list in the sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 The fractions of annotated contigs for HiCzin and HiCBin on different metaHi-C datasets.
Fewer than 1% of assembled contigs could be successfully labeled on the both long-read
metaHi-C datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3 The size of shotgun and Hi-C libraries from raw metaHi-C datasets. . . . . . . . . . . . . . 115
7.4 The assembly statistics of contigs from all datasets. . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Comparison of the running time between NormCC and HiCzin on different metaHi-C
datasets. Values in the parentheses represent extra time consumed by HiCzin on preparing
the input data. NA means that HiCzin failed to converge on the cow rumen metaHi-C
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.6 Taxonomic statistics of 709 high-quality MAGs retrieved by MetaCC binning from the
sheep gut dataset. MAGs were annotated by GTDB-TK at the order level. . . . . . . . . . . 118
7.7 The distribution of plasmid contigs among high-quality MAGs retrieved by MetaCC
binning from the sheep gut dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
x
7.8 The plasmid contigs with coverage > 2 × than the mean average coverage of their
respective MAGs retrieved by MetaCC binning from the sheep gut dataset. . . . . . . . . . 120
7.9 The normalization performance of HiCzin when only 1% of contigs were annotated on the
yeast dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.10 The binning performance of HiCBin when only 1% of contigs were annotated on the yeast
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.11 The species list in the synthetic yeast sample. . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.12 The accession numbers (European Nucleotide Archive) and sizes of three shotgun
libraries, sequenced using the Illumina HiSeq 3000, ONT MinION R9, and PacBio Sequel
II platforms, from the same mock microbial community. . . . . . . . . . . . . . . . . . . . . 131
7.13 The sizes of shotgun and Hi-C libraries from raw metaHi-C datasets. . . . . . . . . . . . . 132
7.14 The number of mapped Hi-C read pairs for the mock metaHi-C datasets. . . . . . . . . . . 133
7.15 The statistics of assembled contigs for the three metagenomic Hi-C datasets. Note: N50 is
defined by the length of the shortest contig where contigs with longer and equal length
cover at least 50% of the assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.16 The statistics of viral contigs detected by VirSorter for the three metagenomic Hi-C
datasets. N50 is defined by the length of the shortest contig where contigs with longer
and equal length cover at least 50% of the assembly. . . . . . . . . . . . . . . . . . . . . . . 147
7.17 The numbers of putative reference genomes and mock viral contigs from the three
metagenomic datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.18 Gint outperforms Ghic and Ghost for clustering viral contigs in terms of F-score, ARI, NMI,
and homogeneity on the mock human gut dataset. The optimal values of the results are
in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.19 Gint outperforms Ghic and Ghost for clustering in terms of the completeness and contamination criteria on the mock human gut dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.20 The numbers of high-quality potential host MAGs retrieved by HiCBin from the three
metagenomic Hi-C datasets according to the CheckM criteria. Note: near-complete
(CheckM completeness ≥ 90%, CheckM contamination ≤ 10%), substantially complete
(70% ≤ CheckM completeness < 90%, CheckM contamination ≤ 10%), and moderately
complete (50% ≤ CheckM completeness < 70%, CheckM contamination ≤ 10%) . . . . . . 151
7.21 The genome size fractions of near-complete vMAGs recovered by different binners on the
three mock datasets. VAMB failed to bin viral contigs on the mock wastewater and cow
fecal datasets due to the small number of contigs to train its model. The optimal values of
the results are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
xi
7.22 The taxonomic statistics of potential host MAGs derived from the human gut dataset. . . . 153
7.23 The taxonomic statistics of potential host MAGs derived from the cow fecal dataset. . . . . 154
xii
List of Figures
2.1 Workflow of HiCzin utilized in metagenomic Hi-C analysis. . . . . . . . . . . . . . . . . . 10
2.2 Relationship between raw nonzero interaction counts and the product of the number of
restriction sites, length, and coverage between contig pairs. . . . . . . . . . . . . . . . . . . 17
2.3 Comparison of (a) the raw counts of spurious contacts and valid contacts (i.e., nonzero
intra-species Hi-C contacts), (b) the number of restriction sites of spurious contacts and
valid contacts, (c) the length of spurious contacts and valid contacts, (d) the coverage of
spurious contacts and valid contacts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 (a) Comparison of the distribution of raw valid contacts and raw spurious contacts. (b)
Comparison of the distribution of normalized valid contacts and normalized spurious
contacts by HiCzin. (c) The proportions of discarded valid contacts and discarded spurious
contacts. (d) The discard-retain curve using all sample data. . . . . . . . . . . . . . . . . . 23
3.1 Overview of the MetaCC framework for metagenomic Hi-C analyses. (a) The input
metaHi-C dataset consists of shotgun libraries and Hi-C libraries. Short/long reads
in shotgun libraries are assembled into contigs, to which Hi-C paired-end reads were
subsequently aligned. In this way, raw Hi-C contact matrix displaying the proximity
similarity between contigs within cells can be constructed. The raw Hi-C contact matrix
is normalized by the NormCC normalization module to correct the systematic biases
and spurious inter-species contacts are subsequently removed. Assembled contigs are
then binned into high-quality MAGs leveraging the normalized Hi-C contact matrix.
Finally, downstream analyses are conducted. (b) Visualize the procedures of NormCC
normalization and spurious contact removal by plotting heatmaps of the Hi-C contact
matrix for contigs belonging to the species Kluyveromyces wickerhamii and Ashbya
gossypii from a synthetic yeast dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Benchmarking the NormCC normalization module on the synthetic yeast metaHi-C
dataset. (a) Discard-retain curves for evaluating spurious contact removal based on the
raw, HiCzin-normalized, or NormCC-normalized Hi-C contact matrices, respectively.
NormCC achieved the highest AUDRC (i.e., area under discard-retain curve). (b)
Performance of contig clustering based on the raw, HiCzin-normalized, or NormCCnormalized Hi-C contact matrices as well as NormCC-normalized Hi-C contact matrix
with spurious contact removal, respectively. NormCC outperformed HiCzin on the contig
clustering in terms of F-score, ARI, and NMI. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
xiii
3.3 Benchmarking the MetaCC binning module on short-read metaHi-C datasets. MetaCC
binning outperformed other binners on both the human gut and wastewater short-read
metaHi-C datasets according to the CheckM criteria (Near-complete: completeness ≥
90% and contamination ≤ 10%; Substantially complete: 70% ≤ completeness < 90%
and contamination ≤ 10%; Moderately complete: 50% ≤ completeness < 70% and
contamination ≤ 10%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Benchmarking the MetaCC binning module on long-read metaHi-C datasets. (a) MetaCC
binning outperformed other binners on both the cow rumen and sheep gut long-read
metaHi-C datasets according to the CheckM criteria (Near-complete: completeness ≥
90% and contamination ≤ 10%; Substantially complete: 70% ≤ completeness < 90%
and contamination ≤ 10%; Moderately complete: 50% ≤ completeness < 70% and
contamination ≤ 10%). HiCBin failed to bin contigs on the cow rumen dataset due to
the nonconvergence of its adopted normalization method HiCzin. (b) Comparison of
near-complete bins identified by MetaCC binning and other Hi-C-based binners from
the long-read metaHi-C datasets. The total length of each bar shows the total number of
near-complete (NC) bins recovered by each binner. Each bar is then colored according to
the number of NC bins that can be identified by both binners (NC in both), the number of
NC bins that are substantially complete in the other bin set (SC in other), and the number
of NC bins that are moderately complete or missing in the other bin set (MC or miss in
other). (c) Comparison of the number of species recovered by different binners with high
quality. MAGs retrieved by MetaCC binning represents the largest taxonomic diversity at
the species level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Overview of the ImputeCC. Given an input of the metagenomic Hi-C contact matrix and
contigs containing single-copy marker genes, ImputeCC initiates the imputation of the
metaHi-C contact matrix using a new constrained random walk with restart (CRWR)
algorithm, specifically limiting random walks to originate from contigs with marker genes.
Subsequently, ImputeCC segregates and retains the imputed contact matrix exclusively
for marker-gene-containing contigs, using it in conjunction with the characteristics of
single-copy marker genes to effectively precluster these contigs as preliminary bins.
Finally, ImputeCC applies the Leiden clustering method to group all assembled contigs,
with insights from the preliminary bins guiding the optimization of the binning process. . 71
4.2 Benchmarking using the three mock metaHi-C datasets. (a) Assessing the quality of
preliminary bins using ARI. ImputeCC accurately grouped marker-gene-containing
contigs while the CRWR imputation markedly improved the preclustering performance.
(b) ImputeCC outperformed other binners on all the three mock metaHi-C datasets
with respect to the number of retrieved high-quality MAGs (completeness ≥ 90% and
contamination ≤ 5%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
xiv
4.3 Benchmarking using the real cow rumen and sheep gut long-read metaHi-C datasets. (a)
The number of MAGs with varying completeness (comp) and contamination (cont) ≤ 5%.
ImputeCC consistently outperforms other binning tools, producing a greater number of
high-quality bins in both long-read metaHi-C datasets. (b) The number of MAGs with
varying completeness and contamination ≤ 10%. ImputeCC returned more mediumquality bins when compared to alternative methods for both datasets. (c) Comparative
analysis of the taxonomic diversity at the species level within medium-quality bins
obtained by different binning tools. ImputeCC’s binning approach stands out by capturing
the broadest range of microbial species in medium-quality MAGs. . . . . . . . . . . . . . . 76
5.1 The general workflow of ViralCC to retrieve high-quality viral genomes and determine
virus-host pairs. Shotgun reads are first assembled into contigs, to which Hi-C paired-end
reads are aligned. Viral contigs are subsequently identified. Leveraging Hi-C linkages
and the virus-host proximity structure to link viral contigs, ViralCC constructs the Hi-C
interaction graph and the host proximity graph. After integrating two graphs, ViralCC
employs Leiden clustering to reconstruct draft viral genomes, and additionally detects the
virus-host pairs based on recovered viral genomes and Hi-C linkages. . . . . . . . . . . . . 82
5.2 Comparison of viral genome retrieval performance according to (a) clustering metrics and
(b) completeness and contamination criteria (Moderately complete: 50% ≤ completeness
< 70%, contamination ≤ 10%; Substantially complete: 70% ≤ completeness < 90%,
contamination ≤ 10%; Near-complete: completeness ≥ 90%, contamination ≤ 10%).
ViralCC outperforms other binning methods on the mock human gut dataset. . . . . . . . 93
5.3 Comparison of draft viral bins retrieved by different binning tools according to the
CheckV completeness standard on the (a) human gut, (b) cow fecal, and (c) wastewater
datasets. ViralCC can retrieve more complete viral genomes compared to VAMB, CoCoNet,
vRhyme, bin3C, and MetaTOR from all three real metagenomic Hi-C samples. . . . . . . . 94
5.4 Heatmaps of raw Hi-C contact matrices of the top ten vMAGs from the (a) human gut, (b)
cow fecal, and (c) wastewater datasets with the contig index as the axis unit. The vMAGs
were first ranked by their numbers of contigs and then the contigs within each vMAG
were ranked by their sizes. The scale bar shows the number of raw Hi-C contacts between
viral contigs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.5 Taxonomy statistics of annotated vMAGs on the (a) human gut, (b) cow fecal, and (c)
wastewater datasets. The numbers on the graph indicate the number of vMAGs belonging
to different families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.6 (a) Taxonomic annotations of MAGs recovered by HiCBin from the domestic wastewater
sample. Burkholderiales, Pseudomonadales, Lachnospirales, Bacteroidales, and Oscillospirales
were the predominant orders. (b) The apparent infection spectrum of vMAGs from the
wastewater sample. vMAGs belonging to the family Myoviridae mainly targeted hosts
from the order Burkholderiales and a large number of vMAGs from the family Siphoviridae
could infect Bacteroidales bacteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
xv
7.1 The inter-species versus intra-species Hi-C contacts within the NormCC-normalized
Hi-C contact matrix from the synthetic yeast metaHi-C dataset. The y-axis represents
the logarithmically scaled values of normalized Hi-C contacts by NormCC. An unpaired
t test was conducted to compare the values between 393,228 normalized intra-species
Hi-C contacts and 125,860 inter-species Hi-C contacts. The resulting p-value is less than
2.22e-16, indicating that the magnitude of intra-species contacts significantly surpasses
that of spurious inter-species contacts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 The number of high-quality bins retrieved by MetaCC binning based on unpolished or
polished contigs from the sheep gut long-read metaHi-C dataset. Without polishing
HiFi assembly using accurate short reads (unpolished), MetaCC binning could retrieve
417, 162, and 130 near-complete, substantially complete, and moderately complete bins,
respectively. In contrast, after polishing (polished), only 416, 153, and 131 near-complete,
substantially complete, and moderately complete bins were recovered by MetaCC binning,
respectively, indicating that the polishing step did not substantially improve the binning
results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 The numbers of assembled contigs from 13 species in the synthetic yeast metaHi-C
dataset. The full species names in the x-axis are shown in Supplementary Table 10. Most
of the assembled contigs belong to the genus Saccharomyces. . . . . . . . . . . . . . . . . . 113
7.4 Benchmarking using the real human gut and wastewater short-read metaHi-C datasets.
(a) The number of MAGs with varying completeness (comp) and contamination (cont) ≤
5%. ImputeCC consistently outperforms other binning tools, producing a greater number
of high-quality bins in both short-read metaHi-C datasets. (b) The number of MAGs with
varying completeness and contamination ≤ 10%. ImputeCC returned more mediumquality bins when compared to alternative methods for both datasets. (c) Comparative
analysis of the taxonomic diversity at the species level within medium-quality bins
obtained by different binning tools. ImputeCC’s binning approach stands out by capturing
the broadest range of microbial species in medium-quality MAGs. . . . . . . . . . . . . . . 129
7.5 Comparative analysis of high-quality MAGs retrieved from the sheep gut long-read
metaHi-C dataset. (a) Comparison of high-quality MAG recovery using ImputeCC and
three other Hi-C-based binning tools (MetaTOR, bin3C, and MetaCC), as determined
through Mash analysis. ImputeCC successfully retrieved the majority of high-quality
MAGs obtained by the alternative Hi-C-based tools, while also surpassing them by
reconstructing a significant number of additional high-quality MAGs. (b) Annotation
analysis of the high-quality MAGs highlighting the enhanced diversity captured by
ImputeCC at different taxonomic levels in comparison to its Hi-C-based counterparts,
such as MetaTOR, bin3C, and MetaCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.6 Heatmaps of raw Hi-C contact matrices of the top ten vMAGs on the (a) human gut, (b)
cow fecal, and (c) wastewater datasets with the contig size as the axis unit. The vMAGs
were first ranked by their numbers of contigs and then the contigs within each vMAG
were ranked by their sizes. The scale bar shows the number of raw Hi-C contacts between
viral contigs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
xvi
7.7 Comparison of viral genome retrieval performance according to the (a) clustering metrics
and (b) completeness and contamination criteria on the mock wastewater dataset, and
the (c) clustering metrics and (d) completeness and contamination criteria on the mock
cow fecal dataset. Moderately complete: 50% ≤ completeness < 70%, contamination
≤ 10%; Substantially complete: 70% ≤ completeness < 90%, contamination ≤ 10%;
Near-complete: completeness ≥ 90%, contamination ≤ 10%. . . . . . . . . . . . . . . . . . 142
7.8 The strip plot of the genome sizes of the near-complete vMAGs recovered by different
binners on the mock (a) human gut, (b) wastewater, and (c) cow fecal datasets. VAMB
failed to bin viral contigs on the mock wastewater and cow fecal datasets due to the small
number of contigs to train its model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.9 The apparent infection spectrum of vMAGs on hosts from different orders on the human
gut dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.10 The apparent infection spectrum of vMAGs on hosts from different orders on the cow
fecal dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
xvii
Abstract
The advent of metagenomic high-throughput chromosome conformation capture (metagenomic Hi-C or
metaHi-C) enables identifying contig-to-contig relationships with respect to the proximity within the same
physical cell and reveals great potential to simultaneously study multiple genomes and probe active virushost interactions. However, the metaHi-C data analyses encounter significant challenges, including the
existence of systematic biases from raw metaHi-C contacts, the impact of spurious Hi-C contacts on data
interpretation, and gaps in current Hi-C-based contig binning methods that overlook critical biological information and fail to adequately address viral genome recovery. In this dissertation, we present our efforts
to tackle these computational challenges, focusing on normalization and binning to improve microbial
community analysis. In terms of normalization, we first introduce HiCzin as the first method able to eliminate all systematic biases and we also put forward a straightforward but effective strategy to filter out
spurious Hi-C linkages. We further enhance our initial HiCzin approach by introducing NormCC, a more
efficient normalization method. With respect to the contig binning, we develop ImputeCC for enhanced
contig binning by incorporating single-copy marker genes, and ViralCC for dedicated viral genome recovery and identification of virus-host pairs. Each method advances the analysis of microbial communities by
correcting biases, improving clustering of metagenomic contigs, and enabling the discovery of new species
and mobile genetic element host interactions in complex microbial ecosystems.
xviii
Chapter 1
Introduction
Microbial communities consist of a wide range of microorganisms with many unexploited enzymes and
metabolic potentials encoded in the genomes of these diverse microbes [64, 140]. For instance, these microbes are found throughout the human body, particularly in the human gut, where the number of microbial cells ranges from 1010 to 1011 per wet-weight gram of human faeces [61, 143, 153]. The gut microbiome
is crucial to our health and is closely associated with various diseases [138]. Studies have demonstrated
that an imbalance in the gut microbiome, a condition known as gut dysbiosis, can result in significant gastrointestinal disorders like inflammatory bowel disease and colorectal cancer [2, 57]. Furthermore, the gut
dysbiosis can worsen infections [120] and lead to metabolic diseases, including obesity and type II diabetes
[56, 150]. The gut-brain axis, a concept highlighting the link between gut health and cognitive function,
has also received considerable attention [27]. Recent research has underscored the gut-brain axis’s role
in cognitive disorders such as Parkinson and Alzheimer [80, 83]. In addition to their crucial impact on
human health and disease, microbial communities are fundamental to various environmental processes,
including nutrient cycling and environmental detoxification [12, 52]. Therefore, exploring the structure
and dynamics of microbial communities is essential for enhancing our understanding of their significant
effects on human health and ecosystems.
1
Traditional pure cultures grown in the laboratory are insufficient to explore microbial diversity because
most of microorganisms cannot be cultivated [132, 144]. As a culture-independent genomic approach,
metagenomics avoids the isolation or cultivation of microorganisms and provides a broad aspect of the
community structure and the functional capabilities present in complex ecosystems [58, 146].
1.1 Conventional metagenomic shotgun sequencing
Metagenomic shotgun sequencing aims to capture DNA from all organisms present in the microbial communities. In metagenomic shotgun sequencing, all microbial genomes are fragmented into smaller pieces.
This approach doesn’t target specific genomes but sequences all DNA present in the sample, which we refer
to as shotgun reads [3]. These reads represent a mix of all genomic materials, including that from prokaryotes, eukaryotes, viruses, and plasmids. These shotgun reads can be either clustered into groups to reduce
the size of metagenomic datasets [8, 98] or assembled into longer contigs, which are usually a portion
of the full-length genomes [90, 112]. To retrieve the complete genomes present in microbial ecosystems,
assembled contigs are grouped into bins that represent draft genomes of different species. This grouping
process, termed binning, is the foundation of the downstream taxonomic profiling and functional analysis. Traditional shotgun-based binning methods make use of the contigs’ compositions and/or abundance
profiles [5, 77, 97, 158]. Compositions of contigs usually refer to GC-content and oligonucleotide frequencies and the shotgun-based binning tools assume that contigs from the same genome share similar
compositions [21, 163]. Besides, it has been shown that coverage profiles of metagenomic contigs from the
same genome are highly correlated across multiple samples [109]. Although some shotgun-based binning
pipelines have achieved good retrieval performance combining the information of the compositions and
abundance, effective co-abundance profiles cannot be constructed if there are not enough samples due to
the cost of sequencing or the limited ability to collect samples, which is common in clinical studies, for
instance. When applied to a single sample, the shotgun-based binning methods can merely rely on the
2
composition information to group contigs and struggle to distinguish closely related species with similar
genomic compositions.
1.2 Metagenomic high-throughput chromosome conformation capture
sequencing
High-throughput chromosome conformation capture (Hi-C) [94] is a DNA proximity ligation technique
and can generate millions of paired-end reads linking DNA fragments in close proximity within cells. HiC sequencing has already been utilized to explore topologically associated domains and the compartment
property of the mammalian genomes [31, 94].
The recent introduction of Hi-C sequencing into metagenomics leads to the metagenomic Hi-C technique (metaHi-C) and provides new insights for understanding species diversity and microorganism interactions within a single microbial sample [24, 79, 101, 142, 161]. This new technique combines the advanced
proximity ligation method with shotgun sequencing in metagenomics, offering a novel perspective on microbial communities. Specifically, shotgun sequencing involves extracting genomic fragments directly
from a microbial sample. These fragmented shotgun reads are assembled into longer contigs. In parallel,
Hi-C experiments on the same sample generate DNA-DNA proximity ligations, linking loci within the
same cell, and producing millions of paired-end Hi-C reads. The Hi-C reads are then aligned to the assembled contigs, enabling the identification of metagenomic Hi-C contacts. These contacts, representing the
number of Hi-C read pairs linking contig pairs, reveal contig relationships based on physical proximity
within the microbial community. Since metaHi-C contacts provide a reliable way to understand how different contigs are related, they hold great potential to group assembled contigs into metagenome-assembled
genomes (MAGs) [9, 29, 65, 121] and to identify the virus-host pairs or mobile genetic element-host interactions [79, 101].
3
1.3 Computational challenges encountered in metaHi-C data analyses
1.3.1 Raw metaHi-C contacts harbor severe systematic biases
Raw Hi-C interaction counts are influenced by contig attributes such as the number of restriction sites on
contigs, contig length, and coverage, leading to systematic biases [18, 33, 121]. Specifically, longer contigs
and those with more restriction sites have a higher chance of ligation, while high coverage suggests more
frequent interactions. These factors distort the true proximity between contigs, underscoring the need
for normalization. However, existing normalization methods, designed primarily for individual genome
[63, 68], fall short with metagenomic Hi-C data due to their unique complexity and variability in species
abundance. Moreover, though some metagenomic Hi-C normalization approaches have been proposed,
they remain insufficient in fully correcting these systematic biases. Therefore, it is imperative to eliminate
all systematic biases from raw metaHi-C contacts.
1.3.2 Spurious Hi-C contacts confound the interpretability of Hi-C networks
Spurious Hi-C contacts can arise from multiple steps of experimental protocols. As ligations in solution
can occur in protocols without sufficient crosslinking [62], this could potentially occur between DNA from
the contigs of different species. Furthermore, reads originating from species that are closely related could
present challenges in mapping unambiguously to one species or the other [142]. These erroneous links
can significantly impair the contig binning process and lead to the incorrect identification of interactions
between mobile genetic elements and their hosts, potentially skewing our understanding of microbial
ecosystems. To maintain the integrity of our analyses and the conclusions drawn from metaHi-C data,
it is crucial to identify and eliminate these spurious contacts, ensuring a more accurate representation of
microbial interactions.
4
1.3.3 There exist significant gaps for current Hi-C-based contig binning approaches
Hi-C-based contig binning leverages the frequency of Hi-C contacts within the same genome to cluster
fragmented contigs into metagenome-assembled genomes (MAGs) [9, 29, 34], enhancing our understanding of microbial functions and virus-host dynamics. Compared to conventional binning methods that use
sequence composition and coverage, Hi-C-based binning demonstrated its superior capabilities in MAG
recovery from a single sample [34, 121]. However, current Hi-C-based binning methods mainly utilize
Hi-C interactions for clustering contigs, neglecting critical biological information like single-copy marker
genes embedded in assembled contigs. This oversight represents a critical gap in current methods, indicating significant potential for more refined and integrative contig binning methods. Moreover, despite
recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few
of Hi-C-based methods are designed to retrieve viral genomes. Therefore, it is imperative to fill these gaps.
1.4 Outline of the dissertation
This dissertation introduces innovative computational methods to tackle the aforementioned challenges
posed by metaHi-C data analysis. Specifically, we have developed the following methods to address the
computational challenges:
• In Chapter 2, we demonstrate the limitations of existing normalization methods for metagenomic
Hi-C data in addressing systematic biases. To solve this problem, we propose HiCzin [33], a novel
technique employing zero-inflated negative binomial regression for both normalization and spurious
contact detection. This approach models Hi-C contacts between contigs in the regression and takes
residuals of the regression as the normalized Hi-C contacts. HiCzin stands out as the first method to
fully eliminate systematic biases in a synthetic yeast dataset [18]. We also put forward a simple but
effective strategy to filter our spurious inter-species Hi-C contacts. By effectively correcting these
5
biases and identifying spurious contacts, HiCzin enhances the quality of metagenomic Hi-C contact
maps, leading to improved clustering of metagenomic contigs and enabling more accurate analysis
of microbial communities.
• In Chapter 3, we enhance our initial HiCzin approach by introducing NormCC [36], a new normalization method. Compared to HiCzin, which models Hi-C contacts within the same species and
depends on contig annotations, NormCC advances this by using a model that accounts for all proximity ligation events per contig without needing annotations, streamlining the normalization process.
We validate the normalization performance of NormCC using a synthetic yeast dataset [18] and
show that NormCC outperforms HiCzin in eliminating spurious contacts, improving contig clustering, and reducing computation time. Leveraging NormCC, we also develop MetaCC, a Hi-C-based
binning pipeline for more accurate microbial community analysis.
• In Chapter 4, we address the oversight in current Hi-C-based binning methods that ignore crucial
biological signals such as single-copy marker genes. To bridge this gap, we introduce ImputeCC,
an innovative binning tool that combines Hi-C interactions with single-copy marker genes to refine
contig binning. Utilizing a constrained random walk with restart algorithm, ImputeCC enhances
Hi-C connectivity, enabling more precise clustering for marker-gene-containing contigs and the recovery of numerous high-quality MAGs. ImputeCC markedly outperforms conventional Hi-C-based
binners across various datasets, proving its efficacy in improving microbial genome reconstruction
and uncovering new species in complex communities.
• In Chapter 5, we tackle the challenge of viral genome recovery from metaHi-C data with the development of ViralCC [32]. ViralCC aims to recover complete viral genomes and identify virus-host
pairs from metagenomic Hi-C data. Unlike other Hi-C-based methods that primarily rely on HiC interactions, ViralCC incorporates a virus-host proximity structure for more accurate binning.
6
Demonstrated across various ecosystems, including human gut, cow fecal, and wastewater samples, ViralCC outperforms existing methods in retrieving viral genomes with higher completeness.
This tool not only enhances viral genome recovery but also facilitates the exploration of virus-host
interactions within microbial communities.
Finally, we conclude the dissertation and put forward our future research proposals in Chapter 6.
Supplementary materials for Chapters 2-5 are included in Chapter 7.
7
Chapter 2
HiCzin: normalizing metagenomic Hi-C data and detecting spurious
contacts using zero-inflated negative binomial regression
In this chapter, we report on two types of biases in metagenomic Hi-C experiments: explicit biases and
implicit biases, and introduce HiCzin, a parametric model to correct both types of biases and remove
spurious interspecies contacts. We demonstrate that the normalized metagenomic Hi-C contact maps by
HiCzin result in lower biases, higher capability to detect spurious contacts, and better performance in
metagenomic contig clustering.
2.1 Introduction
High-throughput chromosome conformation capture (Hi-C) is a DNA proximity ligation approach with
many applications in the investigation of genomic structures, DNA interactions, and even characterizing
virus-host interactions from metagenomes [15, 94]. In Hi-C experiments, chimeric junctions are formed
between pieces of DNAs in close proximity within cells and then subjected to paired-end sequencing generating millions of paired-end reads linking DNA fragments [94]. The number of reads connecting two
DNA fragments is significantly related to the probability of contact between genomic loci in the threedimensional structure at a fixed time point. Hi-C technique reveals the compartment property of the
8
mammalian genomes [94], identifies topologically associated domains [31], and reconstructs haplotypes
[136].
Most recently, the Hi-C technique has been applied to the metagenomics domain (metagenomic HiC), and a series of Hi-C experiments have been conducted for microbial communities rather than a single
species [29, 121]. Combined with the traditional shotgun sequencing, metagenomic Hi-C technique has displayed a powerful ability to probe virus–host interactions [15], simultaneously retrieve multiple genomes
[18], deconvolute assembled contigs from whole genome shotgun (WGS) sequencing data into genome
bins in both simulated and real microbial communities [9], and track horizontal gene transfer [161].
However, there exist strong experimental biases for the Hi-C interaction counts [162]. Therefore, normalizing Hi-C data is essential to remove these biases. Though multiple strategies have been put forward
[63, 68], most of these normalization methods aim to normalize Hi-C data derived from a single species,
mainly human cells, and are not suitable to be applied on metagenomic Hi-C data from complex communities. This is mainly because potential factors of biases for metagenomic Hi-C data are different from those
for Hi-C data within individual species. Additionally, it is not valid to theoretically assume all contigs
should have equal visibility in metagenomic Hi-C data as the relative abundance levels of the different
species can vary. Several relatively simple metagenomic Hi-C normalization methods have been developed. ProxiMeta [121] applied a normalization to the raw Hi-C counts by accounting for the estimated
abundance of the contigs, and further took the number of restriction sites on the contigs into consideration
[142]. As a proprietary metagenomic genome binning platform without open-source pipeline, ProxiMeta
did not clarify the normalization algorithms in detail. Beitel et al. [10] divided raw interaction counts by
the product of the length of two contigs. MetaTOR [9] normalized raw counts by the geometric mean
of the contigs’ coverage. Metaphase [18] and bin3C [29] divided raw Hi-C counts by the product of the
number of restriction sites and bin3C used the Knight-Ruiz algorithm [84] to construct a general doubly
stochastic matrix after the first step correction. We will show that these normalization methods are not
9
Figure 2.1: Workflow of HiCzin utilized in metagenomic Hi-C analysis.
effective in removing all biases. Additionally, the biases of spurious inter-species contacts are ignored for
metagenomic Hi-C data by all these normalization methods, considerably weakening the interpretability
of the Hi-C data [142].
Here we first comprehensively discuss potential experimental biases for metagenomic Hi-C data, and
then propose HiCzin, a method to normalize metagenomic Hi-C data based on the zero-inflated negative
binomial regression frameworks [165]. We also develop a hybrid statistical method to detect spurious
inter-species contacts. We show that the normalized metagenomic Hi-C contact maps by HiCzin lead to
lower biases, higher ability to detect spurious contacts, and better performance in metagenomic contig
clustering on the published metagenomic Hi-C dataset.
2.2 Methods
2.2.1 Framework of applying HiCzin to metagenomic Hi-C experiments
The workflow of HiCzin utilized in the metagenomic Hi-C analysis is shown in Figure 2.1. In metagenomic
Hi-C experiments, short reads are obtained by shotgun sequencing from microbial communities. At the
same time, metagenomic Hi-C sequencing reads are generated from the same sample. Contigs are assembled from the shotgun short reads and Hi-C reads are mapped to the assembled contigs to construct raw
contact maps consisting of the number of Hi-C reads mapped to contig pairs. Then, HiCzin is employed to
normalize raw contact maps and discard spurious contacts. Finally, downstream analysis can be conducted
on the basis of normalized contact maps by HiCzin.
10
2.2.2 Calculating the coverage of assembled contigs
The coverage of contigs was computed using MetaBAT [76] v2.12.5 script: ‘jgi_summarize_bam_contig_-
depths’.
2.2.3 Applying TAXAassign to generate sample data of the intra-species contacts
The taxonomic assignment of contigs was resolved by TAXAassign (v0.4) [67] with parameters ‘-p -c 20 -r
10 -m 98 -q 98 -t 95 -a “60,70,80,95,95,98” -f’. Assignment results with ‘unclassified’ at the species level were
discarded, and only deterministic results of taxonomic assignment at the species level were kept. Intraspecies pairs were subsequently generated by pairwise combining contigs assigned to the same species,
and corresponding contacts were treated as samples to fit the HiCzin model.
2.2.4 Normalization via the HiCzin model
Based on the zero-inflated generalized linear mixed framework [88], the HiCzin is a two-component mixture model combining a mass point at zero with a count distribution. Specifically, within the intra-species
contacts, zero contacts may come from two sources: the count distribution, showing that these zeros are
observations of the population of the intra-species contacts and no interactions happened, or the zero mass
points, indicating that Hi-C interactions happened, but the observations of the interactions were lost due
to certain kinds of experimental noise.
11
Formally, denote the population of the intra-species contacts as a random variable Y . The basic assumption of the HiCzin model is that Y follows the negative binomial distribution. Let Zij denote a zeroinflated random variable of the intra-species contacts between the ith contig and the jth contig. Then the
random variable Zij is given by
Zij ∼ 0, with probability πij ,
∼ NB(µij , θij ), with probability 1 − πij , (2.1)
where πij represents the probability of a zero count generated from the zero-inflation part of the model
and NB(µij , θij ) is negative binomial distribution with mean µij and shape parameter θij .
Therefore, the zero-inflated density of Zij is the result of mixing a negative binomial distribution and
a degenerate distribution at zero as
Pr(Zij = zij ) =
πij + (1 − πij )( θij
µij+θij
)
θij , zij = 0
(1 − πij )
Γ(zij+θij )
Γ(θij )·zij !
·
µ
zij
ij ·θ
θij
ij
(µij+θij )
zij+θij
, zij = 1, 2, · · · ,
(2.2)
where Γ(·) is the gamma function. The random variable Zij will be degenerated to negative binomial
distribution when πij = 0.
We assume that the parameters µij and πij depend on the three factors of explicit biases while θij is
an independent parameter as a constant parameter θ in our model. Define sk, lk, and ck as the number of
restriction sites, the length and the coverage of the kth contig, respectively. As Lord et al. [96] suggested,
link functions in generalized linear models are used to model the dependence of parameters µij and πij
12
on the three factors of explicit biases. To be specific, we propose that µij is related to three factors by the
logarithmic link, i.e.,
log(µij ) = β0 + βs · log(si
· sj ) + βl
· log(li
· lj ) + βc · log(ci
· cj ). (2.3)
We also propose that πij is related to three factors by the logistic link, i.e.,
log(
πij
1 − πij
) = γ0 + γs · log(si
· sj ) + γl
· log(li
· lj ) + γc · log(ci
· cj ). (2.4)
Let µZij denote the mean of random variable Zij . Then, the corresponding regression equation for
µZij is
log(µZij ) = log(1 − πij ) + β0 + βs · log(si
· sj )
+ βl
· log(li
· lj ) + βc · log(ci
· cj ). (2.5)
The overall model parameters β = (β0, βs, βl
, βc), γ = (γ0, γs, γl
, γc) and additional dispersion parameter θ can be estimated by maximum likelihood (ML) using the latest R package ‘glmmTMB’ [17].
Finally, the residuals of the counting part are the normalized metagenomic Hi-C contacts, i.e.,
eij = zij/µcij . (2.6)
Hence, given sample data of the intra-species contacts, our HiCzin model can integrate all three factors
of explicit biases.
13
2.2.5 Spurious contact detection by a hybrid statistical method based on HiCzin
From the HiCzin model, the intra-species contact Yij follows the negative binomial distribution with mean
µcij and shape θb. Given any contig pairs with nonzero contacts, we denote the value of the observed raw
contacts as Oij and the expected contacts under condition that the two contigs come from the same species
as Eij , where Eij = E(Yij ) = µcij . We define the enrichment score as Sij = log(Oij/Eij ).
Under our statistical framework, we also design a hypothesis test to detect spurious contacts. The null
hypothesis of the test is that Oij belongs to the intra-species contacts while the alternative hypothesis
is that Oij belongs to the spurious inter-species contacts. We directly regard Oij as the test statistic and
Oij ∼ Yij under null hypothesis. We choose one-tailed test and calculate the p-value of Oij as
pij = Pr(Yij ≤ Oij ). (2.7)
Then, we develop a hybrid statistical method to detect spurious contacts. We choose a threshold t for
the enrichment score and a significance level α for the hypothesis test. Contacts of contig pairs whose
enrichment score is less than t or p-value is less than α will be regarded as spurious contacts and then
discarded.
We define the valid contacts as the nonzero intra-species Hi-C contacts. To determine the threshold
and the significance level, we assume that the percentiles of the enrichment score and the p-value of the
valid contacts in our sample data are similar to those in the whole data and preselect a percentage (default
10%) reflecting the acceptable fraction of losses of the valid contacts. Taking advantage of our generated
sample data of the intra-species contact, we can determine the threshold t and significance level α such
that less than the preselected percentage of valid contacts in sample data are incorrectly identified as
spurious contacts for both methods, respectively. Based on our assumption, we suppose that around the
same percentage of valid contacts in the whole data might be mistakenly discarded. Therefore, thresholds
14
can be strictly restricted to detect most of spurious contacts while avoid incorrectly identifying a large
proportion of valid contacts in the whole data.
2.2.6 Generalizing the HiCzin by selecting different independent variables
Let {x
k}
n
k=1 denote the set of factors. Then, we modify the regression equation in (2.5) as
log(µZij ) = log(1 − πij ) + β0 +
Xn
k=1
βk · log(x
k
i
· x
k
j
), (2.8)
where πij in (2.4) is modified as
log(
πij
1 − πij
) = γ0 +
Xn
k=1
γk · log(x
k
i
· x
k
j
). (2.9)
Then, ‘glmmTMB’ [17] package is employed to estimate {βk}
n
k=0, {γk}
n
k=0, and θ, and residuals of the
counting part are considered as normalized contacts.
2.2.7 HiCzin normalization without labeled contigs
As samples of the intra-species contacts cannot be obtained in some scenarios, we just regard all nonzero
raw contacts as our sample data. Although these samples contain both valid contacts and spurious contacts,
the number of valid contacts is supposed to be much larger than that of spurious contact and thus we
suppose that spurious contacts will not result in significant biases in parameter estimation. Moreover, as
we don’t have zero contacts to fit the zero-inflated part, one option to solve this problem is to set πij as a
constant parameter, i.e.,
logit(πij ) = log( πij
1 − πij
) = γ (2.10)
15
Unknown parameters are estimated by maximum likelihood using ‘glmmTMB’ [17] package and residuals are considered as normalized contacts.
2.3 Results
2.3.1 Analyses of experimental biases in synthetic metagenomic yeast samples
In addition to chromosomal contacts of interest, several other factors unrelated to chromosomal contacts
can also influence the number of Hi-C interactions between contigs [162]. We refer to such factors as
biases. We report on two kinds of biases with substantial influences on metagenomic Hi-C contact maps:
explicit biases and implicit biases. Explicit biases include three potential factors: i) the number of enzymatic
restriction sites on contigs, ii) contig length, and iii) contig coverage [10, 18, 121], all of which can be
observed. Implicit biases represent spurious inter-species contacts. Spurious Hi-C contacts may result
from insufficient crosslinking during experimental protocols [62], leading to inter-species DNA ligations
and mapping challenges with closely related species [142].
We analyzed metagenomic yeast samples (M-Y), which consisted of 16 yeast strains and 13 yeast species
(BioProject : PRJNA245328) [18]. WGS dataset contains 85.7 million read pairs (101 bp per read) and
Hi-C dataset contains 81 million read pairs (100 bp per read). After processing the raw WGS and Hi-C
reads (Supplementary materials), we generated raw Hi-C contact maps for 6,196 assembled contigs and
downloaded reference genomes of these 16 yeast strains (Supplementary Table 7.1). To determine the true
species identity of the assembled contigs, contigs were aligned to reference genomes at the species level
by BLASTn [6] with parameters: ‘-perc identity 95 -evalue 1e-30 -word size 50’ (Supplementary materials).
Thirty-seven contigs (0.6%) could not be aligned to reference genomes and were not considered in the
following analyses.
16
Figure 2.2: Relationship between raw nonzero interaction counts and the product of the number of restriction sites, length, and coverage between contig pairs.
According to the alignment results to the reference genomes, we refer to contig pairs from the same
species and different species as intra-species pairs and inter-species pairs, respectively. Interaction counts
of intra-species pairs and inter-species pairs are defined as valid contacts and spurious contacts, respectively. In particular, we denote zero contacts if no interaction was observed between intra-species pairs;
hence the intra-species contacts, corresponding to intra-species pairs, are composed of valid contacts and
zero contacts. Valid contacts imply a high probability of contig pair’s belonging to the same genome, while
spurious contacts confound the interpretation of the Hi-C data.
Raw interaction counts were enriched between pairs of contigs with a high number of restriction sites,
long contigs, and/or contigs with high coverage (Figure 2.2), which can be explained by the following reasons. Longer contigs may have higher ligation efficiencies with other contigs than shorter contigs, more
restriction sites are likely to increase the probability of enzymatic cuts within DNA fragments, and higher
coverages, representing higher concentration of contigs, can result in more Hi-C interactions between
contigs. The Pearson correlation coefficients between raw valid contacts and the product of the number of restriction sites, the length and the coverage for each pair of contigs were 0.429, 0.400, and 0.184,
respectively, demonstrating that these three factors were indeed highly correlated with valid contacts.
As for implicit biases, the number of spurious contacts made up 25.5% of all nonzero contacts, which
could not be neglected for the M-Y samples.
17
2.3.2 Normalization methods in the public metagenomic Hi-C analysis pipelines
Because of the existence of aforementioned experimental biases, it is necessary to normalize the raw HiC contacts before downstream analysis, such as clustering and tracking virus-host interactions. Most of
the current available pipelines divided the raw Hi-C interactions by the product of one factor of explicit
biases to normalize raw Hi-C contacts, which we refer to as naive normalization methods [9, 10, 121].
These naive normalization methods only corrected part of explicit biases, and the unnormalized factors of
explicit biases might still be highly correlated with Hi-C contact maps. As for the two-stage normalization
method in bin3C [29], equal visibility for all regions is a basic theoretical assumption for utilizing the matrix
balancing algorithm they use to recover normalized Hi-C matrices [68], yet this assumption is not satisfied
for metagenomic assembled contigs with huge differences in length and abundance. Moreover, all these
normalization methods ignored the influence of implicit biases and did not attempt to detect and remove
the spurious inter-species contacts. Therefore, it is imperative to develop new normalization methods to
overcome these shortcomings.
2.3.3 Removing explicit biases and filtering out spurious contacts using zero-inflated
negative binomial regression
The Poisson and negative binomial regression models are widely used in fitting count data and have been
successfully employed in fitting Hi-C interactions of human cells [63]. Therefore, there is potential to apply
frameworks based on Poisson or negative binomial regression to normalize metagenomic Hi-C data. Here
we model the population of the intra-species contacts using the negative binomial distribution rather than
the Poisson distribution because Hi-C data are always over-dispersed [63]. In the classical negative binomial regression model, we can fit the model given sample data of the intra-species contacts by regarding
factors of biases and intra-species contacts as predictor variables and the response variable, respectively.
Then, the residuals of this conventional model serve as normalized contacts.
18
However, one remarkable phenomenon for intra-species contacts was the presence of excess zeros,
which means zero contacts account for a large proportion within intra-species contacts. Specifically, the
number of nonzero intra-species contacts from the synthetic yeast dataset only made up 14.9% within
all intra-species contacts. Although classical negative binomial models can capture the property of overdispersion, they are not sufficient for modeling the excess zeros observed in Hi-C contact maps. To solve
these problems, we developed HiCzin, a novel metagenomic Hi-C normalization method based on zeroinflated negative binomial regression frameworks [165], combining the counting distribution of the intraspecies contacts with a mass point at zero. The residues of the counting part serve as normalized contacts.
Compared with raw valid contacts, the average value of raw spurious contacts was smaller while the
average number of restriction sites, length, and coverage of contigs were significantly larger (Figure 2.3).
These evidences indicated that spurious inter-species contacts were more likely to be generated for longer
contigs with more restriction sites and higher abundances. Therefore, we expect that the magnitude of
the normalized spurious contacts by the factors of explicit biases to be significantly smaller than that of
the normalized valid contacts. Thus, a basic idea is to discard the normalized contacts whose values are
less than a selected threshold as spurious contacts [142]. However, determining the threshold is extremely
challenging. Based on our HiCzin normalization model, we develop a hybrid statistical method to detect
spurious contacts and determine thresholds (see Methods).
2.3.4 Applying the HiCzin model to the M-Y samples
To fit the HiCzin model, samples of the intra-species contacts were generated using TAXAassign [67],
which assigned 3,441 (55.5%) contigs to the known reference genomes in the NCBI nt database (see Methods). These 3,441 contigs were assigned to 10 species by TAXAassign. We compared the taxonomy assignment results by TAXAassign with the corresponding true species identities obtained by BLASTn. Only
21 labels were different, indicating the high precision of taxonomy assignments at the species level by
19
p < 2.22e-16
0.0
2.5
5.0
7.5
10.0
Spurious Contacts Valid Contacts
Contacts(log scale)
p < 2.22e-16
0.0
2.5
5.0
7.5
10.0
12.5
Spurious Contacts Valid Contacts
Site(log scale)
p < 2.22e-16
16
20
24
Spurious Contacts Valid Contacts
Length(log scale)
p < 2.22e-16
5
10
15
20
Spurious Contacts Valid Contacts
Coverage(log scale)
a b
c d
Figure 2.3: Comparison of (a) the raw counts of spurious contacts and valid contacts (i.e., nonzero intraspecies Hi-C contacts), (b) the number of restriction sites of spurious contacts and valid contacts, (c) the
length of spurious contacts and valid contacts, (d) the coverage of spurious contacts and valid contacts.
20
TAXAassign. Then, taking advantage of these labeled contigs, we generated a relationship of intra-species
pairs by pairwise combining contigs from the same species, and corresponding contacts were obtained
as sample data to fit the HiCzin model. A total of 1,492,856 samples of the intra-species contacts were
generated.
All sample data were then utilized to fit the HiCzin model. We compared our model with naive normalization methods, the two-stage normalization method in bin3C and the classical negative binomial
regression model. To simplify the notation, we denote naive normalization methods by site, length, and
coverage as Naive Site, Naive Length, and Naive Coverage, and denote the two-stage normalization method
in bin3C and the classical negative binomial regression model as bin3C_Norm and Naive NB.
We first calculated the Pearson correlation coefficients between normalized valid contacts and the
product of each of the three factors of explicit biases to gauge the bias effects (Table 2.1). The Naive Site
and Naive Length approaches increased the Pearson correlation coefficients between valid contacts and
the product of the coverage from 0.184 to 0.559 and 0.694; the Naive Coverage approach increased the
correlation coefficient between valid contacts and the product of the site from 0.429 to 0.515 and increased
the correlation coefficient between valid contacts and the product of the length from 0.400 to 0.481. These
results proved that the naive normalization methods only corrected part of explicit biases, and the unnormalized factors of explicit biases showed even higher correlation with Hi-C contact maps. In contrast,
the two-stage normalization method in bin3C decreased all three correlation coefficients to 0.024, 0.025,
0.011, indicating that the matrix balancing algorithm can assist in correcting explicit biases to some extent. These three correlation coefficients were decreased to 0.023, 0.024, 0.154 using Naive NB, and further
decreased to 2 × 10−4
, 0.002, 0.069 using HiCzin. Therefore, HiCzin achieved better performance than all
other normalization methods in removing explicit biases.
The other objective of metagenomic Hi-C normalization is to identify valid contacts from all observed
contacts. Although raw values of spurious contacts were significantly smaller than those of valid contacts,
21
site length coverage
Raw Contacts 0.429 0.400 0.184
Naive Site 0.004 0.004 0.559
Naive Length 0.004 0.004 0.694
Naive Coverage 0.515 0.481 0.006
bin3C_Norm 0.024 0.025 0.011
Naive NB 0.023 0.024 0.154
HiCzin 2 × 10−4
0.002 0.069
Table 2.1: Pearson correlation coefficients (absolute value) between normalized valid contacts and the
product of each of the three factors of explicit biases.
Normalization method AUDRC
Naive site & bin3C_Norm 0.682
Naive length 0.712
Naive coverage 0.757
Naive NB 0.792
HiCzin 0.804
Table 2.2: Area under the discard-retain curve for different normalization methods. Higher AUDRC score
indicates better performance in spurious contact detection. The optimal values of the results are in bold.
the distribution of spurious contacts mixed with the distribution of valid contacts (Figure 2.4a), making
it challenging to separate spurious contacts from valid contacts. After normalization, the distribution of
normalized spurious contacts deviated considerably to the left from the distribution of normalized valid
contacts (Figure 2.4b), facilitating the distinction from spurious contacts to valid contacts.
Therefore, we adopted our hybrid statistical approach based on the HiCzin model to detecting and then
discarding spurious contacts (see Methods). The main procedure of our approach is to select thresholds of
the enrichment score and the p-value, respectively, and any contacts whose enrichment score or p-value
are below the thresholds would be identified as spurious contacts. A percentage reflecting the acceptable
fraction of losses of the valid contacts was preselected and thresholds were determined such that less than
the preselected percentage of valid contacts in sample data were incorrectly identified as spurious contacts.
Noticeably, both thresholds increased with the preselected percentage, and larger thresholds could detect
more spurious contacts while incorrectly identifying a higher number of valid contacts. Though there
22
0
2
4
6
012345
Raw Contacts(log scale)
Density
Spurious Contacts
Valid Contacts
0.0
0.1
0.2
0.3
0.4
-4 0 4
Normalized Contacts by HiCzin(log scale)
Density
Spurious Contacts
Valid Contacts
0.0
0.2
0.4
0.6
0.8
0.00 0.05 0.10 0.15 0.20
Preselected percentage
Proportion of the number of contacts
Discarded spurious contacts
Discarded valid contacts
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Proportion of the number of discarded spurious contacts Proportion of the number of retained valid contacts
HiCzin
Naive Coverage
Naive Length
Naive NB
Naive Site & Bin3C_Norm
a b
c d
Figure 2.4: (a) Comparison of the distribution of raw valid contacts and raw spurious contacts. (b) Comparison of the distribution of normalized valid contacts and normalized spurious contacts by HiCzin. (c)
The proportions of discarded valid contacts and discarded spurious contacts. (d) The discard-retain curve
using all sample data.
23
existed a ‘trade-off’, the proportion of discarded spurious contacts increased much faster than that of
discarded valid contacts (Figure 2.4c), indicating that we could remove a large fraction of spurious contacts
while keeping most of the valid contacts. For instance, if we set the preselected percentage as default 10%,
which means that we could withstand the losses of around 10% of valid contacts, about 60% of spurious
contacts were detected while only 13% of valid contacts were incorrectly removed. These results supported
the feasibility of our spurious contact detection method.
According to our basic idea of spurious contact detection, naive normalization methods and the Naive
NB method could also be used to detect spurious contacts by regarding normalized contacts less than
certain thresholds as spurious contacts. We employed the same technique proposed in our hybrid statistical
spurious contact detection method to determine thresholds for other normalization methods. For the twostage normalization method in bin3C, as the matrix balancing algorithm in the second step may amplify
the influence of certain spurious contacts, it is better to remove the noise of spurious contacts after the first
step correction. Since the first stage of bin3C_Norm is equivalent to the Naive Site approach, the spurious
contact detection result of bin3C_Norm is the same as that of the Naive Site approach. To evaluate the
capability of normalization methods to detect spurious contacts while retaining the valid contacts, we
design the discard-retain curve (DR curve). In the graph of a DR curve, the x-axis is the proportion of
discarded spurious contacts among all spurious contacts in the whole data, and the y-axis represents the
proportion of retained valid contacts within all valid contacts in the whole data. We denote the area
under the DR curve as AUDRC. Larger AUDRC indicates that the normalization method can retain more
valid contacts while discard more spurious contacts. Therefore, we plotted the DR curve to evaluate the
performance of different normalization methods (Figure 2.4d). AUDRC was subsequently calculated for
each of the normalization methods (Table 2.2), and our HiCzin model achieved the best result with respect
to AUDRC.
24
2.3.5 Generalizing the HiCzin model
Our HiCzin model can be generalized to consider different independent variables and do normalization
without labeled contigs (see Methods). Here, we explore three significant scenarios.
HiCzin_LC: In some real situations, the specific enzymes utilized in Hi-C experiments are unknown;
thus only the length and the coverage of contigs can serve as independent variables.
HiCzin_GC: GC-content is sometimes considered as one source of biases in Hi-C experiments [63,
162]. Therefore, we explored the influence of adding GC-content as a new predictor variable to our HiCzin
model, though we did not observe a strong correlation between raw valid contacts and GC-content (Pearson correlation coefficient: 0.032) for the Hi-C contact maps of the synthetic M-Y samples.
Unlabeled HiCzin: In the real application of HiCzin, some extreme difficulties may be encountered.
For example, there may not be enough computational resources to run TAXAassign or an extremely small
number of contigs can be labeled. To solve these problems, a HiCzin normalization mode without labeled
contigs (Unlabeled HiCzin) is designed.
We applied these three generalized HiCzin models on the M-Y samples (Table 2.3). For the HiCzin_-
LC and HiCzin_GC, the Pearson correlation coefficients between normalized contact counts and the three
factors increased compared to those of the HiCzin in Table 2.1, though the AUDRC was slightly higher than
that of the HiCzin in Table 2.2. For the unlabeled HiCzin, detecting spurious contacts was tough as it was
challenging to determine thresholds without specific samples of the intra-species contacts. Although the
normalization results were worse than those of the HiCzin model using labeled contigs, unlabeled mode
of HiCzin still performed better than naive normalization methods and it is more applicable and requires
fewer computational resources than the HiCzin model using labeled contigs.
25
site length coverage AUDRC
HiCzin_LC 0.006 0.002 0.097 0.812
HiCzin_GC 0.008 0.003 0.131 0.816
Unlabeled HiCzin 0.114 0.105 0.079
Table 2.3: Pearson correlation coefficients (absolute value) between normalized valid contacts and the
product of each of the three factors of explicit biases, and AUDRC for different generalized HiCzin models.
2.3.6 Clustering of contigs by the Louvain algortihm
The Louvain algorithm has been widely employed to cluster contigs based on metagenomic Hi-C data [9,
99]. We applied this algorithm to the Hi-C data normalized by different methods. We set the preselected
percentage of maximum incorrectly identified valid contacts in sample data as 10% for all HiCzin models
and regarded groups above 500 kbp as effective bins to evaluate the clustering performance.
As shown in Table 2.4, the original HiCzin model achieved the best clustering performance by the Louvian algorithm. Although the matrix balancing algorithm in the second stage could improve the clustering
quality, bin3C_Norm grouped much fewer contigs compared to Naive Site. The Naive NB approach also
grouped a relatively small number of contigs. The performance of both correcting biases and clustering
by HiCzin_GC was worse than that of the original HiCzin model; hence it is not necessary to consider the
GC-content in the regression process. One potential explanation for the poor performance of HiCzin_GC
is that the genomes in the community have similar GC-content and the Hi-C contact maps are not dependent on GC-content. The clustering results of the HiCzin_LC and the unlabeled HiCzin were significantly
better than all naive normalization methods and the two-stage normalization method in bin3C, indicating
that our normalization model still performed well when the restriction enzymes of Hi-C experiments or
the labels of any contigs were unknown. These results ensure that the HiCzin models are widely applicable
with excellent normalization effects under different circumstances.
26
# of contigs F-score ARI NMI
Naive Site 5997 0.763 0.724 0.783
Naive Length 6092 0.791 0.758 0.788
Naive Coverage 6131 0.752 0.706 0.791
bin3C_Norm 5266 0.791 0.761 0.793
Naive NB 4783 0.783 0.748 0.764
Unlabeled HiCzin 6105 0.794 0.761 0.806
HiCzin_LC 6044 0.802 0.771 0.807
HiCzin_GC 6039 0.799 0.767 0.803
HiCzin 6065 0.807 0.776 0.810
Table 2.4: Comparison of the clustering results of contigs using the Louvain Algorithm. # of contigs represents the number of contigs in groups; F-score, ARI, and NMI are Fowlkes Mallows score, Adjusted Rand
Index, and Normalized Mutual Information, respectively. The optimal values of the results are in bold.
2.4 Discussion
We put forward two types of experimental biases for metagenomic Hi-C data. Explicit biases include the
number of restriction sites, contig length, and contig coverage and implicit biases include spurious interspecies contacts. Both types of biases could be obviously observed in the metagenomic yeast samples.
Naive normalization methods could only correct part of explicit biases, and the unnormalized factors of
explicit biases showed even higher correlation with Hi-C contact maps. Based on the basic assumption that
the population of the intra-species contacts follows the negative binomial distribution, we have presented
HiCzin, a parametric model applying zero-inflated negative binomial regression framework to normalize
metagenomic Hi-C data, and have introduced a hybrid statistical method to detect and remove the spurious inter-species contacts. We have shown that normalized metagenomic Hi-C contact maps by HiCzin
lead to lower biases, higher ability to detect spurious contacts, and better metagenomic contig clustering
performance, compared with all naive methods and two-stage normalization method in bin3C. In case that
the specific enzymes utilized in Hi-C experiments are unknown or there are not enough computational
resources to run TAXAassign, we come up with the generalized HiCzin by only selecting the length and
the coverage of contigs as predictor variables, and a HiCzin mode without labeled contigs. We have shown
that these two models also performed well in normalization, spurious contact detection and metagenomic
27
contig clustering. Although we can remove a large fraction of spurious contacts by our hybrid statistical
approach, it is inevitable to lose a small quantity of useful valid contacts. Directly modeling the spurious
contacts may separate the spurious contacts from valid contacts even better. Given that spurious Hi-C
contacts can result from the incorrect assignment of Hi-C reads to contigs of closely related species, the
effectiveness and precision of various short-read alignment methods may play an important role in the
occurrence of these erroneous contacts. Thus, it is essential to evaluate and compare different alignment
techniques to understand their impact on metaHi-C data analysis. As the Hi-C technique will be increasingly utilized upon the metagenomics domain in the near future, we expect that the normalization model
we propose here can facilitate the downstream analysis, and improve results in retrieving metagenomeassembled genomes, identifying virus-host interactions, tracking horizontal gene transfer and all other
areas making use of metagenomic Hi-C data.
28
Chapter 3
MetaCC allows scalable and integrative analyses of both long-read and
short-read metagenomic Hi-C data
In this chapter, we report MetaCC, an efficient and integrative framework for analyzing both short-read
and long-read metaHi-C datasets. MetaCC outperforms existing methods on normalization and binning.
In particular, the MetaCC normalization module, named NormCC, is more than 3000 times faster than
the HiCzin normalization method presented in Chapter 2 on a complex wastewater dataset. When applied to one sheep gut long-read metaHi-C dataset, MetaCC binning module demonstrated its ability
to retrieve high-quality genomes, including an expansion of five uncultured members from the order
Erysipelotrichales. Further plasmid analyses reveal that MetaCC binning is able to capture multi-copy
plasmids.
3.1 Introduction
Metagenomics aims to study the complex community structures and reveal metabolic potentials in microbial ecosystems without the isolation or cultivation of microbes in the environment [58, 64, 140, 146]. The
recent introduction of high-throughput chromosome conformation capture technique (Hi-C) into metagenomics provides new insights into species diversity and interactions between microorganisms within a
single microbial sample [24, 79, 101, 142, 161].
29
Metagenomic Hi-C technique (metaHi-C) combines the rapidly-developed proximity ligation approach
with the metagenomic shotgun sequencing. Specifically, shotgun experiments directly extract genomic
fragments from a single microbial sample. In parallel, Hi-C experiments on the same microbial sample
create DNA-DNA proximity ligations between loci within the same physical cell, generating millions of
paired-end Hi-C short reads. Fragmented shotgun reads are assembled into contiguous sequences, termed
contigs, to which paired-end Hi-C reads are subsequently aligned. Therefore, metagenomic Hi-C contacts,
defined as the numbers of Hi-C read pairs linking any pair of assembled contigs, reflect the contig-to-contig
relationships with respect to their proximity. Since raw metagenomic Hi-C contacts are substantially affected by systematic biases, normalization is necessary after processing raw metaHi-C data [10, 18, 121].
We disclosed three systematic biases including the number of enzymatic restriction sites on contigs, contig length, and contig coverage and put forward a state-of-the-art normalization method HiCzin [33] that
can correct all three biases in Chapter 2. After the normalization, fragmented contigs can be grouped into
metagenome-assembled genomes (MAGs) [65] using Hi-C contacts. This process, termed Hi-C-based binning, enables the construction of large compendia of metagenomic assembled microbial genomes. Several
Hi-C-based binning methods have been designed, such as MetaTOR [9], bin3C [29], and HiCBin [34].
Despite recent advances of these computational tools designed for metagenomic Hi-C data, there still
exists much room for improvement for more scalable and stable analyses. For instance, the Knight-Ruiz
algorithm [84] utilized by bin3C [29] may fail to generate a bistochastic matrix when the raw Hi-C contact matrix is highly sparse [123, 157]. MetaTOR [9] employs the classical Newman-Girvan modularity
function [48] in its binning procedure, which cannot identify small genomes due to the resolution limit
[43] in complex Hi-C contact networks [34]. HiCzin [33] and HiCBin [34] requires a large amount of
computing resources on estimating contig abundances and generating contig annotations, which refers
to assigning nucleotide sequences to various taxonomic levels. Specifically, HiCzin and HiCBin utilize
TAXAassign [67] to label contigs at the species level by running BLAST [73] against a curated nucleotide
30
reference database. Moreover, since shotgun libraries in the original metagenomic Hi-C experiments are
constructed using next-generation sequencing (short-read metaHi-C) [10, 18, 121, 142], all existing computational methods are designed and merely benchmarked on short-read metaHi-C datasets [9, 29, 33, 34].
With the rapid development of third-generation sequencing, multiple recent metaHi-C experiments also
leveraged Nanopore or PacBio sequencing to generate long-read shotgun libraries (long-read metaHi-C)
[14, 15, 28, 51]. However, the current computational tools that have achieved state-of-the-art results on
short-read metaHi-C datasets encounter difficulties in adapting to long-read metaHi-C datasets. In our
experiments for this study, we observe that the performances of HiCBin, which demonstrated the superior binning performance on short-read metaHi-C datasets according to recent benchmarking studies [70],
are markedly deteriorated on long-read metaHi-C datasets. One essential factor contributing to this decline is the large degradation of HiCBin as well as its adopted normalization method HiCzin when only a
small fraction of assembled contigs can be successfully labeled at the species level by TAXAassign (Supplementary materials). Additionally, the taxonomic labeling of contigs assembled from long reads poses
a challenge for TAXAassign (Supplementary Table 7.2), consequently limiting the effectiveness of HiCBin
and HiCzin on long-read metaHi-C datasets. Therefore, it is imperative to develop new computational
methods to fill these gaps.
Here we report MetaCC, a scalable and integrative framework for both long-read and short-read
metaHi-C datasets. In the MetaCC framework, raw metagenomic Hi-C contacts are first efficiently and
effectively normalized by a new normalization method, NormCC. In comparison to HiCzin, which relies
on estimated contig abundances as input, NormCC employs a negative binomial regression model to represent contig abundances based on easily obtainable features including the number of restriction sites on
contigs, contig length, and the number of proximity ligation events within contigs. Consequently, NormCC
does not require the estimation of contig abundances. Additionally, HiCzin models the Hi-C contacts between contigs of the same species, necessitating contig annotations. Conversely, NormCC models the total
31
number of proximity ligation events for each contig using a second negative binomial regression, eliminating the need for contig annotation. Using a synthetic yeast dataset [18], we validate the normalization
performance of NormCC and show that NormCC outperforms HiCzin with respect to the spurious contact (i.e., Hi-C contacts linking contigs from different genomes due to experimental noises) removal, contig
clustering, and computational time. Leveraging NormCC-normalized Hi-C contacts, the binning module
in MetaCC enables the retrieval of high-quality MAGs. We compare the retrieval performance of MetaCC
binning against all publicly-available Hi-C-based binning tools MetaTOR, bin3C, and HiCBin as well as one
state-of-the-art shotgun-based binner VAMB [110] on two real short-read metaHi-C datasets and two real
long-read metaHi-C datasets. Downstream annotation studies and plasmid analyses on long-read metaHiC datasets demonstrate the superior ability of MetaCC on characterizing the species diversity, extracting
important microbes out of the microbial ecosystems, and capturing multi-copy plasmid contigs.
3.2 Methods
3.2.1 Datasets
In this study, we leveraged several publicly available metagenomic Hi-C datasets, consisting of two shortread metaHi-C datasets and two long-read metaHi-C datasets. The specific size of raw datasets were shown
in Supplementary Table 7.3.
Two short-read metaHi-C datasets were generated from different microbial ecosystems, including human gut (BioProject: PRJNA413092) [121] and wastewater (BioProject: PRJNA506462) [142]. Each shortread metaHi-C dataset was composed of shotgun libraries and Hi-C libraries derived from the same sample
source. The restriction endonucleases Sau3AI and MluCI were utilized to construct all Hi-C sequencing
libraries. All shotgun libraries and Hi-C libraries were sequenced by Illumina platforms at 150 bp.
32
Two long-read metaHi-C datasets were derived from cow rumen samples (BioProject: PRJNA507739)
[15] and sheep gut samples (BioProject: PRJNA595610) [14], respectively. The cow rumen long-read
metaHi-C dataset consisted of PacBio uncorrected long read libraries and Hi-C libraries. The error-prone
PacBio long reads were generated using the PacBio RSII and PacBio Sequel while Hi-C libraries were created by the restriction enzymes Sau3AI and MluCI and subsequently sequenced on an Illumina HiSeq 2000
at 80 bp. The sheep gut long-read metaHi-C dataset contained PacBio circular consensus sequencing (CCS)
long read libraries and Hi-C sequencing libraries. PacBio CCS long reads were highly accurate (average
Q scores above 20) and hereafter referred to as the HiFi reads. Separate Hi-C libraries from the sheep
gut long-read metaHi-C dataset were generated by the restriction endonucleases Sau3AI and MluCI and
sequenced at 150 bp for analysis.
3.2.2 Initial processing
In the metagenomic Hi-C experiment, the read cleaning procedure is necessary before the alignment of
Hi-C read pairs, since the adaptor sequences, low quality reads, and PCR duplication can cause significant
problems in downstream analyses. Therefore, we applied a standard cleaning procedure to all Hi-C read
libraries using bbduk from the BBTools suite (v37.25) [19] (Supplementary materials).
For the two short-read metaHi-C datasets, shotgun reads were assembled into contigs by MEGAHIT
(v1.2.9) [90] with parameters ‘-k-min 21 -k-max 141 -k-step 12 -merge-level 20,0.95 -min-contig-len 1000’.
The assembled contigs of both PacBio uncorrected long reads and HiFi long reads from the two longread metaHi-C datasets were provided by the original authors and thus were directly downloaded for
analyses. Bickhart et al. [15] assembled PacBio raw reads from the cow rumen long-read metaHi-C dataset
by Canu v1.6+101 changes (r8513) [87], and subsequently polished the assembly twice with Illumina data
using Pilon [155]. The final assembly was deposited at https://figshare.com/articles/usda_pacbio_second_
pilon_indelsonly_fa_gz/8323154. An updated version of the assembly of PacBio HiFi long reads from the
33
long-read sheep gut metaHi-C dataset was provided by authors of the original paper utilizing metaFlye
(v2.9) [85] with default parameters and was deposited at https://doi.org/10.5281/zenodo.5228989 under
the file ‘flye.v29.sheep_gut.hifi.250g.fasta.gz’. The assembly statistics of contigs from all datasets were
shown in Supplementary Table 7.4.
Finally, we aligned processed paired-end Hi-C reads to assembled contigs by BWA-MEM (v0.7.17) [91].
We switched off the read pairing mode and regarded the alignment with lowest read coordinate as primary alignments with parameter ‘-5SP’ for the BWA-MEM mapping. After the alignment, we successively
removed unmapped reads, secondary alignments, supplementary alignments and alignments with low
quality (nucleotide match length <30 or mapping score <30). Raw contig-to-contig contacts were aggregated by counting the number of Hi-C read pairs aligned to two contigs separately as across-contig Hi-C
contacts, which reflected the proximity extents between contigs. We also defined the number of Hi-C
read pairs mapped to the same contig as within-contig Hi-C contacts. Since shorter contigs with fewer
Hi-C signals and occurrences of restriction sites tended to have much higher variance, weakening the
stability in the downstream analyses [29, 34], restrictions on minimum contig length (default, 1000 bp),
minimum number of restriction sites (default, one), and minimum Hi-C signal (default, two across-contig
Hi-C contacts and one within-contig Hi-C contact) were imposed to filter problematic contigs. Raw Hi-C
contact matrix was then generated from the alignment of Hi-C paired-end reads where the diagonal and
non-diagonal entries represented within-contig and across-contig Hi-C contacts, respectively. Notably,
because metaHi-C experiments were designed to explore contig-to-contig relationships, across-contig HiC contacts were much more important than within-contig Hi-C contacts and unless otherwise specified,
Hi-C contacts always referred to across-contig Hi-C contacts in this paper.
34
3.2.3 NormCC normalization module in MetaCC
NormCC is a scalable and effective normalization module to eliminate the biases of the number of restriction sites, contig length and coverage on the raw metagenomic Hi-C contacts. Let H denote the raw Hi-C
contact matrix. We define the Hi-C signal Mi of contig i as the total number of proximity ligation events
between contig i and other contigs, i.e.,
Mi =
X
k̸=i
Hik. (3.1)
We model the Mi using the negative binomial (NB) distribution, i.e.,
Mi ∼ NB(µi
, θ), (3.2)
where θ is the negative binomial dispersion parameter and the mean µi depend on the three factors of
systematic biases for raw metagenomic Hi-C contacts, i.e., the number of restriction sites on contigs, contig
length and coverage [33]. Logarithmic link functions in negative binomial regression models [59] are used
to model the dependence of parameter µi on the three factors of biases, i.e.,
log(µi) = β0 + βs · log(si) + βl
· log(li) + βc · log(ci), (3.3)
where si
, li
, and ci represent the number of restriction sites, the length, and the coverage of the contig i,
respectively.
To solve the regression equation 3.3, we need to obtain the specific values of independent variables,
i.e., the three factors of explicit biases for all contigs. Though the number of restriction sites and contig
length can be directly obtained, the true contig abundances are always unknown in real datasets. One
solution is to estimate the contig coverages by aligning short reads or long reads used in assembly back
35
to contigs. However, the alignment procedure usually consumes a huge amount of computing time and
memory resources, especially for long reads [75]. To tackle this problem, we design a statistical model
to represent the unknown coverage using known elements. Specifically, let Ni denote the number of
proximity ligation events within the contig i, i.e.,
Ni = Hii. (3.4)
We assume that Ni also follows the negative binomial distribution, i.e.,
Ni ∼ NB(νi
, σ), (3.5)
where σ is the negative binomial dispersion parameter and the mean νi
is linked to three factors of biases
using logarithmic link functions, i.e.,
log(νi) = γ0 + γs · log(si) + γl
· log(li) + γc · log(ci). (3.6)
Based on formulas 3.5 and 3.6, we develop the first negative binomial regression model, denoted by
NBR1, where we consider the factors of systematic biases and the within-contig Hi-C contacts Ni as the
predictor variables and the response variable, respectively. The residual of NBR1 for contig i can be written
as
Ni/νi
. (3.7)
We further assume that no factors other than the number of restriction sites, the length, and the coverage have a major impact on the number of proximity ligation events between fragments within the same
contig (i.e., within-contig Hi-C contacts). By taking residuals, the effects of all factors with substantial
36
impacts on the within-contig Hi-C contacts are eliminated. As a result, the residuals described in 3.7 are
primarily composed of non-essential factors, which are assumed to be the same for all contigs, i.e.,
Ni/ exp{γ0 + γs · log(si) + γl
· log(li) + γc · log(ci)}
=
Ni
e
γ0 s
γs
i
l
γl
i
c
γc
i
.= C, (3.8)
where C is a constant. Notably, in addition to factors such as the number of restriction sites, the length,
and the coverage of contigs, the extent of spatial proximity across different contigs also plays a major
role in determining the number of proximity ligation events between them. Therefore, the assumption
mentioned earlier regarding the within-contig Hi-C contacts is not applicable to the across-contig Hi-C
contacts.
From formula 3.8, we can obtain an approximate expression of the contig coverage as
ci
.= (C¯ ·
Ni
s
γs
i
l
γl
i
)
−γc
, (3.9)
where C¯ = C
−1
· e
−γ0
.
Therefore, the unknown independent variable ci can be approximately represented using three observable variables Ni
, si
, and li
. Though the parameters C¯, γs, γl
, and γc are unsolved, we will then show
that we don’t need to estimate these three parameters in our NormCC model.
37
Let us plug the approximate expression of contig coverage ci
in formula 3.9 into equation 3.3, i.e.,
log(µi) = β0 + βs · log(si) + βl
· log(li) − βcγc · log(C¯ ·
Ni
s
γs
i
l
γl
i
)
= (β0 − βcγc · log(C¯)) + (βs + βcγcγs) · log(si)
+ (βl + βcγcγl) · log(li) − βcγc · log(Ni)
= βe0 + βes · log(si) + βe
l
· log(li) + βfN · log(Ni), (3.10)
where
βe0 = β0 − βcγc · log(C¯),
βes = βs + βcγcγs,
βe
l = βl + βcγcγl
,
βfN = −βcγc. (3.11)
Based on formulas 3.2 and 3.10, we develop the second negative binomial regression model NBR2. In
NBR2, the Hi-C signal Mi serves as the response variable, while si
, li
, and Ni are considered as predictor
variables that contribute to the mean of the distribution µi for a given contig i. Since all variables in NBR2
are observable, we can directly estimate βe0, βes, βe
l
, and βfN using the maximum likelihood. Let βˆ
0, βˆ
s, βˆ
l
,
and βˆN denote the corresponding maximum likelihood estimations. Once the parameters of the model are
determined, the estimated mean µˆi can be obtained as
µˆi = e
βˆ
0 s
βˆs
i
l
βˆ
l
i N
βˆN
i
. (3.12)
38
Notably, from formula 3.2, µˆi represents the estimated mean of the number of proximity ligation events
between contig i and other contigs, while this estimate takes into account only the number of restriction
sites, the length, and the coverage of contigs. In other words, µˆi reflects the capability of contig i to
produce proximity ligations with other contigs, considering the influence of three bias factors. To address
the variations in contig abilities in generating Hi-C interactions due to these bias factors, we normalize
the raw Hi-C contacts between contig i and contig j (where i ̸= j) by dividing them by the square root of
µˆi
· µˆj , i.e.,
Hij
p
µˆi
· µˆj
· C, ˆ (3.13)
where Cˆ = maxk µˆk is a rescaling constant. In formula 3.13, the rescaling constant Cˆ is used to adjust
and scale the values of normalized Hi-C contacts in case they are too small. The square root of µˆi
· µˆj
can be regarded as a scaled geometric mean of the expected number of proximity ligation events across
contigs, predicted only based on three bias factors. Our intuition is that the deviation between the actual
across-contig Hi-C contacts and the expected number of proximity ligation events considering only the
three bias factors can primarily be attributed to the spatial proximity and thus can reflect the unbiased
proximity across contigs.
3.2.4 Discarding spurious inter-species contacts based on NormCC-normalized Hi-C
contacts
Spurious inter-species Hi-C contacts refer to the occurrences of proximity ligation events between contigs
from different genomes due to experimental noises and confound the interpretability of the Hi-C data
[142]. Based on the expectation that proximity ligations between genomic segments in the same species
occur orders of magnitude more frequently than interactions between different species [33], we discard
the lowest p percent (default, five) of NormCC-normalized Hi-C contacts as spurious.
39
3.2.5 Genome binning in MetaCC
After correcting systematic biases by NormCC and removing spurious Hi-C contacts, the processed Hi-C
contact matrix is successively transformed to a weighted graph G without self-loops where vertices represent all contigs and edge weights are values of NormCC-normalized Hi-C contacts between contigs. Then,
we applied the Leiden graph clustering algorithm [151] on the Hi-C contact graph G to cluster contigs
into draft genomic bins. The Leiden algorithm is a modularity-based community detection algorithm and
takes greedy strategies to optimize the modularity function. Instead of the classical Newman-Girvan modularity [48] which suffers resolution limits and may fail to identify small bins [43], we leverage a flexible
modularity function based on the Reichardt and Bornholdt’s Potts model [124] as:
X
{i,j|∆ij=1}
eij −
didj
2n
· r
, (3.14)
where eij is the edge weight (i.e., NormCC-normalized Hi-C contacts) between contigs i and j; di and dj
denote the degree of contig i and contig j in the graph G, respectively; n is the total number of edges in
the graph; r represents a resolution parameter; ∆ij is an indicator function and is equal to one if contigs i
and j belong to the same community. Notably, the resolution parameter r can be regarded as the relative
importance between the configuration null part and links within the communities and controls the number
of communities, and the larger r tends to generate more communities [151]. Therefore, determining this
hyper-parameter affects the results of contig clustering.
Similar to [159], we detect single-copy marker genes in assembled contigs using FragGeneScan [127]
and HMMER (v3.3.2) [41] to estimate the number of genomes in the metagenomic data, denoted by k
(Supplementary materials). We also set the minimal bin size to the default value of 150 kbp, slightly smaller
than the minimum length of known bacterial genomes [106], and consider only contig bins above this size
as resolved MAGs. Then, our objective is to select a suitable value of r for which the number of resolved
40
MAGs aligns with the estimated number of genomes in the sample. To achieve this, we sequentially try
a list of increasing values for r. For each candidate value of the resolution parameter r, we record the
number of resolved MAGs, denoted as kr. Considering the potential underestimation of the number of
genomes, which can occur due to factors such as the possibility of marker genes failing to be detected in
certain species, the resolution parameter is determined as the first value for which the number of resolved
MAGs surpasses the estimated number of genomes, mathematically, i.e.,
min r
s.t. kr > k; r ∈ {1, 20, 40, 60, 80, · · · }. (3.15)
After selecting the resolution parameter, we can cluster the assembled contigs into MAGs, making up the
initial bin set of MetaCC binning.
3.2.6 Evaluating the quality of recovered MAGs
We applied CheckM (v1.1.3, module: lineage_wf) [119] to evaluate retrieved MAGs. Following the CheckM
criteria for completeness and contamination [121], we referred to the resolved MAGs with CheckM completeness larger than 50% and contamination smaller than 10% as high-quality MAGs. We further attributed
high-quality draft genomes to three ranks according to the CheckM completeness, i.e., near-complete
(completeness ≥ 90% and contamination ≤ 10%), substantially complete (70% ≤ completeness < 90%
and contamination ≤ 10%), and moderately complete (50% ≤ completeness < 70% and contamination ≤
10%).
3.2.7 Cleaning partially contaminated bins in MetaCC
Apart from high-quality MAGs, there also existed partially contaminated bins with completeness higher
than 50% and contamination higher than 10% in the initial bin set of MetaCC binning. Similar to other
41
binners, such as MetaTOR [9] and HiCBin [34], we selected out and cleaned all partially contaminated
bins by partitioning contigs within each contaminated bin using the Leiden algorithm. The resolution
parameter was kept to be 1 in re-clustering procedures since the number of groups within each partially
contaminated bin was expected to be small. As a result, groups of relatively smaller bins, denoted by subbins, could be generated and those sub-bins with bin size larger than the minimal requirement (default, 150
kbp) were retained and merged back into the initial bin set to obtain the final bin set of MetaCC binning.
3.2.8 Assessing the performance of normalization and spurious contact removal on a
synthetic yeast metaHi-C dataset
We assessed the normalization performance of NormCC and the following spurious contact removal on
an additional synthetic yeast sample (BioProject: PRJNA245328) [18], consisting of 13 yeast species. The
synthetic yeast metaHi-C dataset contained shotgun libraries and Hi-C libraries created using restriction
enzymes NcoI and HindIII. The raw shotgun and Hi-C libraries contain 85.7 million read pairs at 101
bp and 81 million read pairs at 100 bp, respectively. The read cleaning, contig assembly, and Hi-C read
alignment procedures were consistent with those applied to the real short-read metaHi-C datasets. The
contig assembly statistics were shown in Supplementary Table 7.4. Since all species within the synthetic
yeast sample were known, the species identity of the assembled contigs could be identified (Supplementary
materials). Thereafter, the ground truth of intra-species Hi-C contacts (i.e., Hi-C contacts linking contigs
from the same species) and spurious inter-species Hi-C (i.e., Hi-C contacts linking contigs from different
species) contacts can be generated for benchmarking analyses.
3.2.9 Estimating the coverages of assembled contigs
Short/long reads used in the assembly were mapped back to assembled contigs to estimate the contigs’
abundances. We employed BBMap from the BBTools suite (v37.25) [19] and minimap2 (v2.24) [92] to align
42
short reads and long reads back to contigs, respectively. SAMtools [93] was used to transform the alignment files into bam files, serving as the input for the script ‘jgi_summarize_bam_contig_depths’ provided
by [77] to calculate the contigs’ coverages.
3.2.10 MAG analyses on the human gut short-read metaHi-C dataset
Since many bacteria in the human gut have been identified in previous studies [42, 168, 4], we evaluated
the bins retrieved from the human gut short-read metaHi-C dataset using the repository of known human
gut bacteria. Specifically, we downloaded the Unified Human Gastrointestinal Genome (UHGG) database
(v1.0) [4], which is one of the largest species-level public gut microbial reference databases. To estimate
the number of bins corresponding to known bacteria and the number of bins that might contain chimeric
genomes for different binning methods, we utilized Mash (v2.2) [115] with 10,000 sketches per genome
to calculate the Mash distance between the UHGG species-level representative and the bins derived from
the human gut dataset. Mash distance serves as a reliable proxy for one minus the average nucleotide
identity (ANI) [115], with the Mash species-level threshold of 0.05 equivalent to the widely accepted 95%
ANI used to define species boundaries [86]. Therefore, we assigned one bin to one species if the Mash
distance between the bin and the representative reference genome of that species was less than 0.05. We
also identified a bin as chimeric if it was assigned to multiple species.
3.2.11 MAG analyses on two long-read metaHi-C datasets
To identify which near-complete bins overlapped each other from MetaCC binning and other Hi-C-based
binners, we employed Mash (v2.2) [115] with 10,000 sketches per bin to calculate the Mash distance between near-complete bins from different bin sets. Two bins with mash distance smaller than 0.01 were
identified as MAGs from the same genome [110, 117]. Moreover, to evaluate ability of different binners to
capture the species diversity, we annotated all high-quality bins using GTDB-TK (v2.1.0, Release: R207 v2)
43
[22] with the function ‘classify_wf’ to obtain the taxonomic information of high-quality MAGs recovered
by different binners.
3.2.12 Plasmid analyses on the sheep gut long-read metaHi-C dataset
A total of 6,320 contigs in 709 high-quality MAGs retrieved by MetaCC binning were first filtered by
PPR-Meta (v1.1) [39] with cut-off 0.5 to identify potential plasmid contigs. In this way, we identified 111
(1.7%) potential plasmid contigs. These pre-filtering 111 contigs were further screened using Platon (v1.6)
[135] with mode ‘Sensitivity’ to exclude potential chromosomal contigs. As a result, 99 contigs are finally
identified as plasmids with high confidence. We queried these 99 plasmid contigs by BLAST (v2.12.0) [73]
with at least 95% identity match of at least 1000 bp to the reference plasmid genomes from NCBI RefSeq
database (Release: November, 2022).
3.2.13 Other algorithms used in benchmarking
The normalization method HiCzin (v0.1.0) [33] was run with default parameters. All binners used for
comparison, i.e., VAMB (v3.0.3) [110], MetaTOR (v1.1.4) [9], bin3C (v0.1.1) [29], and HiCBin (v1.1.0) [34]
were executed with default parameters on all real metaHi-C datasets.
3.3 Results
3.3.1 Overview of MetaCC
MetaCC is a comprehensive analysis framework designed for both short-read and long-read metaHi-C
datasets (Figure 3.1a) and consists of four main components. (I) We design a scalable and effective normalization method, NormCC, to eliminate systematic biases from the raw metagenomic Hi-C contact matrix.
(II) We discard spurious inter-species Hi-C contacts linking contigs from different species due to experimental noises. (III) Based on the normalized Hi-C contact graph, we retrieve high-quality MAGs using
44
Leiden clustering [151] with all hyper-parameters automatically tuned. (IV) With several new computational strategies, we reliably characterize the structure of microbial ecosystems.
Assembly
Pre-processing
Alignment
Input Binning Downstream
Analyses
NormCC Normalization
MetaCC Framework
Align Hi-C reads
to contigs
Short/long
read set
Hi-C read set
Assembled
contigs
Raw Hi-C contact matrix
Eliminate
systematic
biases
Normalized Hi-C
contact matrix
Annotation
Normalized Hi-C
contact graph
Detect
marker genes
Metagenomeassembled genomes
MAG 1 MAG 2
Graph
clustering
Raw Hi-C contact matrix
Correct
systematic biases
Discard spurious
inter-species biases
Normalized Hi-C contact matrix
Normalized Hi after removing spurious contacts -C contact matrix
a
b ≥
log scale
Kluyveromyces wickerhamii Ashbya gossypii Spurious inter-species contacts
Remove spurious
contacts
0
1
2
3
≥ 4
Figure 3.1: Overview of the MetaCC framework for metagenomic Hi-C analyses. (a) The input metaHi-C
dataset consists of shotgun libraries and Hi-C libraries. Short/long reads in shotgun libraries are assembled
into contigs, to which Hi-C paired-end reads were subsequently aligned. In this way, raw Hi-C contact
matrix displaying the proximity similarity between contigs within cells can be constructed. The raw Hi-C
contact matrix is normalized by the NormCC normalization module to correct the systematic biases and
spurious inter-species contacts are subsequently removed. Assembled contigs are then binned into highquality MAGs leveraging the normalized Hi-C contact matrix. Finally, downstream analyses are conducted.
(b) Visualize the procedures of NormCC normalization and spurious contact removal by plotting heatmaps
of the Hi-C contact matrix for contigs belonging to the species Kluyveromyces wickerhamii and Ashbya
gossypii from a synthetic yeast dataset.
3.3.2 NormCC comprehensively corrects all systematic biases existing in a synthetic
yeast metaHi-C dataset
Leveraging a synthetic yeast metaHi-C dataset [18] with all assembled contigs labeled at the species level,
we previously revealed that raw intra-species Hi-C contacts, defined as the number of proximity ligation events linking contigs from the same species, were more enriched between pairs of contigs with a
45
larger number of restriction sites, longer contigs, and/or contigs with higher coverages [33]. We have also
demonstrated that HiCzin, the normalization method employed in HiCBin, outperformed other metaHi-Cbased normalization methods, including those utilized in bin3C and MetaTOR, in terms of spurious contact
detection and contig clustering using the synthetic yeast metaHi-C dataset [33, 34]. Notably, HiCzin incorporates contig annotations at the species level, obtained through TAXAassign [67], to select intra-species
Hi-C contacts utilized in fitting its normalization model. In line with the previous analyses, we validated
the performance of NormCC normalization on this synthetic sample and compared it to HiCzin using the
same benchmarking criteria. Details of processing raw data were shown in the Methods section.
The procedures of NormCC normalization and spurious contact removal can be visualized in Figure
3.1b. To quantify the biases existing for raw intra-species Hi-C contacts, we computed the Pearson correlation coefficients between all raw intra-species contacts and the product of the number of restriction sites,
the length, and the coverage for corresponding contig pairs, which were 0.429, 0.400, and 0.184, respectively, indicating the strong biases of three factors on raw metagenomic Hi-C contacts. After the NormCC
normalization, the correlations between the bias-corrected Hi-C contacts and the product of three factors
were decreased to 0.094, 0.090, and 0.004, respectively, demonstrating that NormCC was able to comprehensively correct all systematic biases for the metaHi-C datasets.
3.3.3 NormCC outperforms HiCzin on the spurious contact removal, contig clustering,
and computational time
Though the magnitude of the spurious inter-species contacts is significantly smaller than that of the intraspecies contacts in the NormCC-normalized Hi-C contact matrix (Supplementary Figure 7.1), discarding
all Hi-C contacts below a threshold as spurious inevitably resulted in the loss of a few informative intraspecies contacts. Therefore, the improved capacity for removing spurious contacts from one single Hi-C
contact matrix can be assessed by effectively eliminating a greater number of spurious contacts while
46
minimizing the unintended removal of informative intra-species contacts. We then applied the spurious
contact removal strategy (see Methods) based on the raw, HiCzin-normalized, or NormCC-normalized HiC contact matrices, respectively, and plotted discard–retain (DR) curves where the proportion of discarded
spurious contacts among all spurious contacts is plotted against the proportion of retained intra-species
contacts within all intra-species contacts at various thresholds corresponding to various percentiles (Figure 3.2a). The area under the discard–retain curve (AUDRC) can measure the ability of spurious contact
removal. With respect to AUDRC, NormCC outperformed HiCzin.
Moreover, we applied the Leiden clustering strategy (see Methods) on the raw, HiCzin-normalized,
or NormCC-normalized Hi-C contact matrices, respectively, to cluster contigs. To explore the impact of
spurious contact removal on contig clustering, we also grouped contigs based on the NormCC-normalized
Hi-C contact matrix after removing spurious contacts. Since the true species identity of all contigs was
available, we employed three comprehensive metrics (Supplementary materials): Fowlkes-Mallows score
(F-score), Adjusted Rand Index (ARI), and normalized mutual information (NMI) to evaluate the clustering
performance. As shown in Figure 3.2b, the bias elimination and the spurious contact removal could improve
the clustering performance while NormCC outperformed HiCzin on the contig binning in terms of F-score,
ARI, and NMI.
Finally, NormCC and HiCzin were executed on a 2.40 GHz Intel Xeon Processor E5-2665 with 50,000 MB
RAM provided by the Advanced Research Computing platform at the University of Southern California.
The time recording started at the input of raw Hi-C contact matrix and ended at the output of normalized
Hi-C contact matrix. We run both NormCC and HiCzin on the synthetic yeast dataset as well as four
real metaHi-C datasets. The results of running time were shown in Supplementary Table 7.5. NormCC is
much faster than HiCzin on all datasets. In particular, NormCC is more than 3,000× faster than HiCzin
on the wastewater short-read metaHi-C dataset [142]. Apart from the running time, HiCzin consumed
47
0.6
0.7
0.8
0.9
F-score ARI NMI
Clustering metrics
Scores
Raw HiCzin NormCC NormCC + Spurious contact removal
The synthetic yeast dataset
0.6
0.7
0.8
0.9
F-score ARI NMI
Clustering metrics
Scores
Raw HiCzin NormCC NormCC + Spurious contact removal
The synthetic yeast dataset
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Proportion of discarded spurious contacts
Proportion of retained intra-species contacts
Unlabeled HiCzin HiCzin NormCC
Discard-retain curve
0.6
0.7
0.8
0.9
F-score ARI NMI
Clustering metrics
Scores
Raw HiCzin NormCC NormCC + Spurious contact removal
The synthetic yeast dataset
0.6
0.7
0.8
0.9
F-score ARI NMI
Clustering metrics
Scores
Raw HiCzin NormCC NormCC + Spurious contact removal
The synthetic yeast dataset
0.6
0.7
0.8
0.9
F-score ARI NMI
Clustering metrics
Scores
Raw HiCzin NormCC NormCC + Spurious contact removal
The synthetic yeast dataset
0.6
0.7
0.8
0.9
F-score ARI NMI
Clustering metrics
Scores
Raw HiCzin NormCC NormCC + Spurious contact removal
The synthetic yeast dataset
0.6
0.7
0.8
0.9
F-score ARI NMI
Clustering metrics
Scores
Raw HiCzin NormCC NormCC + Spurious contact removal
The synthetic yeast dataset
AUDRC
:
0.6
0.7
0.8
0.9
F-score ARI NMI
Clustering metrics
Scores
Raw HiCzin NormCC NormCC + Spurious contact removal
The synthetic yeast dataset
0.6
0.7
0.8
0.9
F-score ARI NMI
Clustering metrics
Scores
Raw HiCzin NormCC NormCC + Spurious contact removal
The synthetic yeast dataset
0.6
0.7
0.8
0.9
F-score ARI NMI
Clustering metrics
Scores
Raw HiCzin NormCC NormCC + Spurious contact removal
The synthetic yeast dataset
: 0.685
: 0.804
: 0.847
a b
Figure 3.2: Benchmarking the NormCC normalization module on the synthetic yeast metaHi-C dataset.
(a) Discard-retain curves for evaluating spurious contact removal based on the raw, HiCzin-normalized, or
NormCC-normalized Hi-C contact matrices, respectively. NormCC achieved the highest AUDRC (i.e., area
under discard-retain curve). (b) Performance of contig clustering based on the raw, HiCzin-normalized,
or NormCC-normalized Hi-C contact matrices as well as NormCC-normalized Hi-C contact matrix with
spurious contact removal, respectively. NormCC outperformed HiCzin on the contig clustering in terms
of F-score, ARI, and NMI.
a large amount of extra computational resources to prepare the input data, including generating contig
annotations and estimating the contig abundances, compared to NormCC.
3.3.4 MetaCC binning achieved the best performance of MAG retrieval on short-read
metaHi-C datasets
To validate MetaCC binning on short-read metaHi-C datasets, we applied it to two datasets from different
microbial environments: human gut [121] and wastewater [142]. Since the actual genomes are unknown in
real samples, we leveraged CheckM [119] to evaluate the quality of the recovered bins (see Methods). We
compared MetaCC binning to other three publicly-available Hi-C-based binning tools, i.e., MetaTOR [9],
bin3C [29], and HiCBin [34]. Additionally, we included one state-of-the-art shotgun-based binning method
VAMB [110] into comparison. Without using Hi-C information, the shotgun-based binning depends on
48
the sequence similarity and abundance features of contigs to retrieve draft genomic bins. In both datasets,
MetaCC binning recovered more near-complete and high-quality bins than the alternatives considered
(Figure 3.3). Specifically, on the human gut dataset, VAMB as well as three Hi-C-based binning methods,
i.e., MetaTOR, bin3C, and HiCBin, could recover 39, 47, 60, and 67 near-complete MAGs, respectively, while
MetaCC binning increased this number to 79. Moreover, VAMB, MetaTOR, bin3C, and HiCBin retrieved
11, 82, 44, and 94 near-complete MAGs, respectively, which was improved to 103 by MetaCC binning on
the wastewater dataset. Notably, in all instances, Hi-C-based binning pipelines outperformed the shotgunbased method on short-read metaHi-C datasets, indicating the great potential of Hi-C information.
Additionally, for the human gut short-read metaHi-C dataset, we assessed the number of bins corresponding to known bacteria identified in the human gut environment and the number of bins that might
contain chimeric genomes for different binning methods using UHGG gut microbial reference database
[4] (see Methods). MetaCC binning recovered the largest number of known bacteria from the human gut
environment based on the UHGG database. Specifically, VAMB, MetaTOR, bin3C, HiCBin, and MetaCC
binning retrieved 83, 107, 89, 118, and 128 bins, respectively, which were assigned to only one known
species. Furthermore, only one bin was detected as chimeric for MetaCC binning, while 4, 6, 2, and 11
chimeric bins were identified for VAMB, MetaTOR, bin3C, and HiCBin, respectively.
3.3.5 MetaCC binning markedly outperformed existing binners on long-read metaHiC datasets
Since all previous studies only compared Hi-C-based binning tools on short-read metaHi-C datasets, we
focused on the benchmarking of MetaCC binning and other existing Hi-C-based binners on long-read
metaHi-C datasets leveraging one cow rumen long-read metaHi-C dataset [15] and one sheep gut longread metaHi-C dataset [14]. Results were shown in Figure 3.4a. On the cow rumen long-read metaHi-C
dataset, VAMB, MetaTOR, and bin3C created 4, 5, and 5 near-complete MAGs, respectively, while MetaCC
49
MaxBin2
VAMB
MetaBAT2
MetaTOR
bin3C
HiCBin
MetaCC
0 50 100
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Human gut
MaxBin2
VAMB
MetaBAT2
MetaTOR
bin3C
HiCBin
MetaCC
0 50 100
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Human gut
MaxBin2
VAMB
MetaBAT2
MetaTOR
bin3C
HiCBin
MetaCC
0 50 100
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Human gut
MaxBin2
VAMB
MetaBAT2
MetaTOR
bin3C
HiCBin
MetaCC
0 50 100
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Human gut
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 50 100
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Human gut short-read metaHi-C dataset
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 50 100 150 200
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Wastewater short-read metaHi-C dataset
Figure 3.3: Benchmarking the MetaCC binning module on short-read metaHi-C datasets. MetaCC binning
outperformed other binners on both the human gut and wastewater short-read metaHi-C datasets according to the CheckM criteria (Near-complete: completeness ≥ 90% and contamination ≤ 10%; Substantially
complete: 70% ≤ completeness < 90% and contamination ≤ 10%; Moderately complete: 50% ≤ completeness < 70% and contamination ≤ 10%).
binning increased this number to 8. In total, MetaCC binning reconstructed 71 high-quality bins, a gain
of 38 (115%), 28 (65.1%) and 31 (77.5%) high-quality bins against VAMB, MetaTOR and bin3C, respectively.
HiCBin failed to bin contigs on the cow rumen dataset due to the nonconvergence of its adopted normalization method HiCzin. As for the sheep gut long-read metaHi-C dataset, VAMB generated 190 near-complete,
94 substantially complete, and 94 moderately complete bins. MetaTOR created 228 near-complete, 102 substantially complete, and 105 moderately complete MAGs. bin3C recovered 268 near-complete, 83 substantially complete, and 51 moderately complete draft genomic bins. HiCBin reconstructed 99 near-complete,
55 substantially complete, and 54 moderately complete bins. In contrast, MetaCC binning retrieved 417
near-complete, 162 substantially complete, and 130 moderately complete MAGs, significantly outperforming VAMB, MetaTOR, bin3C, and HiCBin with an increase of 227 (119.5%), 189 (82.9%), 149 (55.6%), and 318
(321%) near-complete bins, respectively. MetaCC binning also improved the total number of high-quality
MAGs by 331 (87.6%), 274 (63.0%), 307 (76.4%), and 501 (240.9%) compared to VAMB, MetaTOR, bin3C, and
50
HiCBin, respectively. We also tested the efficacy of polishing HiFi assemblies using short reads on binning
and found it does not improve the binning performance on the sheep gut dataset (Supplementary materials), suggesting that the polishing step might not be necessary and could be omitted in the future possibly
due to the high accuracy of HiFi reads.
Moreover, we used Mash [115] to identify instances that MetaCC binning and other Hi-C-based binners
(i.e., MetaTOR, bin3C, and HiCBin) retrieved the same near-complete MAGs on both long-read metaHiC datasets. As shown in Figure 3.4b, most of near-complete MAGs recovered by MetaTOR, bin3C, and
HiCBin could also be retrieved by MetaCC binning in near-complete quality. MetaCC binning further
reconstructed a large number of near-complete MAGs that were only recovered in substantially and moderately complete quality (or absent) by other Hi-C-based binners on both long-read metaHi-C datasets
and the inverse cases were relatively rare, validating the superior ability of MetaCC binning to retrieve
near-complete bins on long-read metaHi-C datasets.
Finally, we explored the capability of different binners to capture the species diversity in microbial
samples by annotating all high-quality bins generated by MetaCC and other Hi-C-based binners on both
long-read metaHi-C datasets using GTDB-TK [22]. As shown in Figure 3.4c, bins derived from MetaCC
binning represented a larger taxonomic diversity at the species level on both datasets. Additionally, we
found that one near-complete MAG (BIN 1254; Completeness: 97.94 and Contamination: 0.38) retrieved by
MetaCC binning from the sheep gut samples belonged to one species Bacteroides vulgatus, which is one of
the most important species in the gut environments and plays important roles in inhibiting atherosclerosis and decreasing the production of the gut microbial lipopolysaccharide [167]. However, this important
species could not be recovered by other binners with high quality from the sheep gut dataset. Therefore,
MetaCC binning outperformed other binners on extracting the species structure out of microbial ecosystems.
51
a b c
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 20 40 60
Number of species
Hi-C-based binning method
Unknown Known
Cow rumen long-read metaHi-C dataset
MetaTOR
MetaCC
0 100 200 300 400
Number of near-complete bins
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 20 40 60
Number of species
Hi-C-based binning method
Unknown Known
Cow rumen long-read metaHi-C dataset
MetaTOR
MetaCC
0 100 200 300 400
Number of near-complete bins
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 20 40 60
Number of bins
Binning method
Cow rumen long-read metaHi-C dataset
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 200 400 600
Number of bins
Binning method
Sheep gut long-read metaHi-C dataset
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 200 400 600
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Sheep gut long-read metaHi-C dataset
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 200 400 600
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Sheep gut long-read metaHi-C dataset
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 200 400 600
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Sheep gut long-read metaHi-C dataset
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 20 40 60
Number of species
Binning method
Cow rumen long-read metaHi-C dataset
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 50 100 150
Number of species
Binning method
Sheep gut long-read metaHi-C dataset
bin3C
MetaCC
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 20 40 60
Number of bins
Binning method
Cow rumen long-read metaHi-C dataset
MetaTOR
MetaCC
024 6 8
HiCBin
MetaCC
bin3C
MetaCC
MetaTOR
MetaCC
0 100 200 300 400
VAMB
MetaTOR
bin3C
HiCBin
MetaCC
0 200 400 600
Number of bins
Binning method
Sheep gut long-read metaHi-C dataset
MetaTOR
MetaCC
0 100 200 300 400
MC or miss in other SC in other NC in both
MetaTOR
MetaCC
0 100 200 300 400
MC or miss in other SC in other NC in both
MetaTOR
MetaCC
0 100 200 300 400
MC or miss in other SC in other NC in both
Figure 3.4: Benchmarking the MetaCC binning module on long-read metaHi-C datasets. (a) MetaCC binning outperformed other binners on both the cow rumen and sheep gut long-read metaHi-C datasets
according to the CheckM criteria (Near-complete: completeness ≥ 90% and contamination ≤ 10%; Substantially complete: 70% ≤ completeness < 90% and contamination ≤ 10%; Moderately complete: 50% ≤
completeness < 70% and contamination ≤ 10%). HiCBin failed to bin contigs on the cow rumen dataset
due to the nonconvergence of its adopted normalization method HiCzin. (b) Comparison of near-complete
bins identified by MetaCC binning and other Hi-C-based binners from the long-read metaHi-C datasets.
The total length of each bar shows the total number of near-complete (NC) bins recovered by each binner.
Each bar is then colored according to the number of NC bins that can be identified by both binners (NC
in both), the number of NC bins that are substantially complete in the other bin set (SC in other), and the
number of NC bins that are moderately complete or missing in the other bin set (MC or miss in other). (c)
Comparison of the number of species recovered by different binners with high quality. MAGs retrieved by
MetaCC binning represents the largest taxonomic diversity at the species level.
3.3.6 MetaCC binning identified and expanded the order Erysipelotrichales from the
cow rumen and sheep gut samples
Members of the order Erysipelotrichales, which are found to have very important functions in animal disease and physiology [74], have been isolated from the human [74], cow [145], insect [148], and mouse [26]
gut. Among high-quality MAG sets recovered by different binners, we found that only the set of MetaCC
binning included the draft genome from the order Erysipelotrichales on the cow rumen dataset.
52
Similar to other gut environments, we also observed the prevalence of this order from the sheep
gut sample, indicating an increasingly important role of the order Erysipelotrichales in animal microbiomes. Specifically, according to the annotation results of GTDB-TK, eight high-quality MAGs retrieved
by MetaCC binning belonged to the order Erysipelotrichales (compared to five, three, and one recovered by
MetaTOR, bin3C, and HiCBin). Three out of these eight bins could be annotated at the species level. The
other five MAGs could be annotated to four different genus but failed to be annotated at the species level,
suggesting the potential expansion of species in the order Erysipelotrichales. Further experiments are required to collect more data on their phenotypic and physical properties before these uncultured members
can be finally determined.
3.3.7 Plasmid analyses among high-quality MAGs retrieved by MetaCC binning from
the sheep gut sample
Taxonomic statistics of 709 high-quality MAGs retrieved by MetaCC binning from the sheep gut dataset
are shown in Supplementary Table 7.6. Among contigs contained in these high-quality MAGs, 99 contigs
were identified as plasmid contigs with high confidence (see Methods). The majority of plasmid contigs
were included in MAGs from the orders Oscillospirales and Bacteroidales (Supplementary Table 7.7), which
were commonly reported in the gut microbiomes [55]. Though there were only 8 out of 709 MAGs (1.1%)
from the order Erysipelotrichales, 13 out of 99 plasmid contigs (13.1%) could be found within these 8 MAGs.
We also observed three plasmid contigs in MAGs from the order Christensenellales, members of which are
hydrogen-producing fibrolytic and have been reported more predominant in the sheep rumen environment
than other rumen environments, such as the mice and rabbits [104].
Plasmids present in multiple copies in genomes are often absent from MAGs retrieved by shotgunbased binning methods since such kind of methods rely on the coverage information to bin contigs [145].
Therefore, we would like to explore whether MetaCC binning could bin multi-copy plasmids. To look for
53
the existence of multi-copy plasmid contigs, we extracted plasmid contigs with coverage > 2× than the
mean average coverage of their respective MAGs, and we observed two plasmids contig_24425 and contig_-
61128, whose coverages were around 3× and 5× than the mean average coverage of their respective MAGs,
respectively (Supplementary Table 7.8). Another plasmid contig_58576 (length: 103,370 bp) had strong
BLAST [73] hits with a total of 101,669 bp alignment length (98.4%) to NCBI plasmid reference genome
NZ_CP080264.1 (Assigned taxon: Escherichia coli). Indeed, MetaCC binning attributed this plasmid contig
to BIN 1239, which was annotated as Escherichia coli at the species level by GTDB-TK.
3.3.8 Running time of the overall MetaCC pipeline
On a 2.40 GHz Intel Xeon Processor E5-2665 with 50,000-MB memory allocated, the overall MetaCC
pipeline spent 19 min, 56 min, 15 min, and 109 min on the human gut short-read, wastewater short-read,
cow rumen long-read, and sheep gut long-read metaHi-C datasets, respectively.
3.4 Discussion
In this work, we have developed MetaCC for scalable and integrative metaHi-C analyses. The MetaCC
framework consists of two major modules, the NormCC normalization module and the binning module.
NormCC models both proximity ligation counts across contigs and within contigs using negative binomial and enables correcting all systematic biases. Compared to HiCzin, NormCC showed better performance in terms of the spurious inter-species contact removal and contig clustering on a synthetic yeast
dataset and was much faster on real metaHi-C datasets. Moreover, HiCzin suffers substantial performance
deterioration when the species-level annotation by TAXAassign is achieved for only a limited fraction of
assembled contigs (Supplementary materials). This vulnerability is particularly noticeable on long-read
metaHi-C datasets (Supplementary Table 7.2), as further evidenced by HiCzin’s failure to converge on the
cow rumen long-read metaHi-C dataset (Supplementary Table 7.5). In contrast, NormCC does not rely on
54
contig annotations as input and performs normalization solely based on fundamental features of assembled
contigs, including contig length and the number of restriction sites on contigs. These essential features
can be directly obtained from contigs after assembly regardless of the sequencing technologies employed.
This adaptability enables NormCC to be easily applied to both short-read and long-read metaHi-C datasets,
demonstrating its versatility in comparison to HiCzin.
MetaCC binning also outperformed all existing Hi-C-based binners consistently on short-read and
long-read metaHi-C datasets. Downstream annotation studies and plasmid analyses on real long-read
metaHi-C datasets further demonstrated the unique ability of MetaCC on characterizing the structures of
microbial samples. Notably, on short-read metaHi-C datasets, HiCBin demonstrated substantial outperformance compared to other competing methods except MetaCC binning, aligning with previous benchmarking studies [70]. MetaCC binning further showed a slight improvement over HiCBin (Figure 3.3).
Both methods employ Leiden clustering, with the key distinction lying in their respective normalization
approaches. MetaCC employs NormCC as its normalization method, whereas HiCBin relies on HiCzin.
Consequently, the improved performance of MetaCC binning over HiCBin can be primarily attributed
to the superior contig clustering performance facilitated by its normalization method NormCC, as also
supported by Figure 3.2b. However, in line with HiCzin, HiCBin also exhibits notable degradation in performance when assigning taxonomic labels for contigs at the species level is challenging (Supplementary
materials), which is particularly evident on long-read metaHi-C datasets (Supplementary Table 7.2). This
limitation of HiCBin adversely affects its performance on long-read metaHi-C datasets, underscoring the
notable superiority of MetaCC binning over other Hi-C-based binners, including HiCBin, specifically in
the context of long-read metaHi-C datasets.
In the spurious contact removal step, it is important to note that there is no gold standard for determining the threshold value, as the fraction of spurious inter-species contacts among all Hi-C contacts varied
due to the quality of metaHi-C experiments. Moreover, there exists a trade-off in selecting this cut-off
55
value. Opting for larger thresholds can eliminate more spurious contacts but may also result in the unintended removal of a higher number of informative intra-species contacts. Therefore, we have taken a
conservative approach by selecting a small yet safe threshold (i.e., the default 5-th percentile) to mitigate
the loss of important Hi-C information. From our experiments on the synthetic yeast dataset, the default
cut-off enabled the removal of 19.3% of spurious inter-species contacts while incorrectly discarding less
than 0.5% of informative intra-species contacts. Furthermore, we conducted experiments to evaluate the
impact of the spurious contact removal step using the default threshold on the downstream binning results
of all four real metaHi-C datasets. Our results consistently demonstrated that the inclusion of this step
with the default threshold led to improved binning outcomes across all datasets (Supplementary materials).
In addition to the default conservative thresholds, an outcome-oriented strategy may be an alternative for
selecting cut-offs. For example, we can try different thresholds and choose one that yields the best MAG
retrieval results in downstream analysis. However, those outcome-oriented strategies always consume
much more computing resources and lack generalizability.
In long-read metaHi-C experiments, it is noteworthy that contigs assembled from error-prone long
reads are typically polished using accurate short reads obtained from the same sample to improve sequence
accuracy [15, 28, 51]. One important reason for adopting polishing is due to the low alignment quality of
accurate Hi-C short reads to contigs assembled from error-prone long reads. Regarding our NormCC
normalization method, the polishing step can also help to mitigate the impact of sequencing errors on the
identification of restriction sites on contigs due to the improved accuracy at the nucleotide level. However,
for the contigs assembled from the accurate HiFi long reads, previous studies have indicated that polishing
HiFi assemblies using short reads did not markedly enhance sequence accuracy [14]. Furthermore, we
have demonstrated that polishing HiFi assembles did not improve the Hi-C-based binning results on the
sheep gut dataset (Supplementary materials). Therefore, considering the high accuracy of HiFi reads, we
believe the polishing step may not be necessary in this case.
56
For the four real metaHi-C datasets, we employed CheckM to evaluate the binning performance.
Though CheckM is the main software used to assess the quality of bins retrieved from real metagenomic
samples, there is a need for further investigation into how accurately the validation method based on
marker genes can reflect the actual completeness and contamination of the recovered MAGs. This is particularly relevant as certain genomic regions may lack marker genes. Moreover, the focus of CheckM on
marker sets suitable for evaluating bacterial and archaeal genomes may result in eukaryotic genomes being
classified as significantly incomplete [119].
There are several directions that MetaCC can be further extended. For large MAGs with high abundances, it is interesting to combine NormCC-normalized Hi-C contacts with other information sources,
such as the assembly graph to scaffold the assembled contigs within the same MAG retrieved by MetaCC
binning. Moreover, identifying interactions between mobile genetic elements and hosts using NormCCnormalized Hi-C contacts is of great potential. One major challenge in this topic is to choose a threshold
of Hi-C contacts as the true interactions. As a new and the most systematic framework to date, we hope
MetaCC enables improved analysis of metaHi-C data with the potential to shed new light on the dark
matter of the microbiome.
57
Chapter 4
ImputeCC enhances integrative Hi-C-based metagenomic binning
through constrained random-walk-based imputation
In this chapter, we introduce ImputeCC, an integrative contig binning tool tailored for metaHi-C datasets.
ImputeCC integrates Hi-C interactions with the inherent discriminative power of single-copy marker
genes, initially clustering them as preliminary bins, and develops a new constrained random walk with
restart (CRWR) algorithm to improve Hi-C connectivity among these contigs. Extensive evaluations on
mock and real metaHi-C datasets from diverse environments, including the human gut, wastewater, cow
ru- men, and sheep gut, demonstrate that ImputeCC consistently outperforms other Hi-C-based contig
binning tools. In particular, ImputeCC retrieves a total of 408 high-quality and 885 medium-quality MAGs,
representing the largest number of reference-quality MAGs reported from a single sample to date. ImputeCC’s genus-level analysis of the sheep gut microbiota further reveals its ability and potential to recover essential species from dominant genera such as Bacteroides, detect previously unrecognized genera,
and shed light on the characteristics and functional roles of genera such as Alistipes within the sheep gut
ecosystem.
58
4.1 Introduction
Metagenomics is revolutionizing microbial ecology by enabling the exploration of complex microbial communities in diverse environments without the need for traditional microbial isolation or cultivation [58,
64]. The recent combination of Hi-C sequencing with whole metagenomic shotgun sequencing leads to
the development of the metagenomic Hi-C (metaHi-C) technique, which has provided novel perspectives
on species diversity and the interactions among microorganisms within a single microbial sample [18, 100,
121]. In metaHi-C experiments, shotgun sequencing extracts genomic fragments from a microbial sample,
while Hi-C sequencing conducted on the same microbial sample generates DNA-DNA proximity ligations
within the same cells, resulting in millions of paired-end Hi-C short reads. These fragmented shotgun
reads are assembled into longer contigs, forming the basis for aligning paired-end Hi-C reads. MetaHi-C
contacts, representing the number of Hi-C read pairs linking contig pairs, reveal contig relationships based
on physical proximity within the microbial community. Depending on whether the shotgun libraries in
metaHi-C experiments are constructed using second-generation or third-generation sequencing technologies, metaHi-C experiments can be classified into either short-read or long-read metaHi-C datasets, respectively. Considering contigs originating from the same genome exhibit enriched Hi-C contact frequencies
relative to those derived from distinct genomes, the process of Hi-C-based binning emerges and aims at
grouping fragmented contigs into metagenome-assembled genomes (MAGs) [65] by leveraging Hi-C contacts between contigs [9, 29, 34]. The resulting MAG collections serve as fundamental prerequisites for
downstream analyses, such as the elucidation of the metabolic potentials and functional roles of diverse
microorganisms, as well as the exploration of virus-host interactions [24, 161]. Various Hi-C-based contig
binning methods have been developed, including HiCBin [34], MetaTOR [9], bin3C [29], and the MetaCC
binning module presented in Chapter 3 [36]. Compared to conventional shotgun-based binning tools reliant on sequence composition and contig coverage for contig clustering, Hi-C-based binning methods
demonstrate their superior ability in MAG recovery using only one single sample [34, 121].
59
However, existing Hi-C-based binning methods rely solely on Hi-C interactions for contig grouping,
overlooking valuable biological information encapsulated within single-copy marker genes. These genes,
present as single copies in the vast majority of genomes [3], hold the great potential to discriminate between contigs originating from distinct species when shared among them. This omission underscores a
critical gap in current approaches, leaving ample room for enhancement and improved analyses. In response, we introduce ImputeCC, an integrative binning tool designed for metaHi-C datasets. ImputeCC
manages to harness the comprehensive insights offered by both Hi-C interactions and single-copy marker
genes to optimize the contig binning process. To thoroughly assess the effectiveness of ImputeCC, we
conduct simulations for both short-read and long-read metaHi-C datasets. Subsequently, we demonstrate
ImputeCC’s performance against other publicly-available Hi-C-based binning tools using a diverse set of
real short-read and long-read metaHi-C datasets including the human gut short-read [121], wastewater
short-read [142], cow rumen long-read [15], and sheep gut long-read [14] metaHi-C datasets. ImputeCC’s
superior performance is particularly evident in the challenging sheep gut environment, where ImputeCC
successfully retrieves an impressive total of 408 high-quality and 885 medium-quality MAGs, as assessed
by the latest CheckM2 [25]. To the best of our knowledge, this represents the largest number of referencequality MAGs reported from a single microbial sample. Furthermore, ImputeCC’s genus-level analyses of
the sheep gut microbiota reveal ability of ImputeCC to recover essential species from dominant genera
such as Bacteroides, showed its potential to detect previously unrecognized genera, and unveiled other
high-quality MAGs within the Alistipes genus that warrant further experimental investigation to elucidate
their characteristics and roles within this ecosystem.
60
4.2 Methods
4.2.1 Datasets
Mock metaHi-C datasets: The mock community sequencing data were downloaded from the European
Nucleotide Archive under project ID PRJEB52977 [102]. The mock community comprises 71 strains representing 69 distinct species and underwent comprehensive sequencing using the Illumina HiSeq 3000,
ONT MinION R9, and PacBio Sequel II platforms, generating three different shotgun libraries. The specific
accession numbers and sizes of these three shotgun libraries are shown in Supplementary Table 7.12. After filtering the incomplete reference genomes (Supplementary materials), we obtained reference genomes
of 66 distinct species for the following experiments. The abundances of all species were available from
the supplementary data of [102]. Since the original dataset lacked Hi-C sequencing reads, we employed
sim3C (v0.2) [30] to simulate metagenomic Hi-C reads based on the 66 reference genomes and their known
abundances in the mock community, utilizing parameters ‘-n 10000000 -l 150 -e MluCI -e Sau3AI -m hic
–insert-sd 20 –insert-mean 350 –insert-min 150 –linear –simple-reads’. Subsequently, we combined the
same simulated Hi-C library with the three shotgun libraries, respectively, to construct three mock metaHiC datasets. These mock Hi-C datasets were named according to the shotgun library incorporated in the
mock dataset, resulting in the ‘mock Illumina,’ ‘mock PacBio,’ and ‘mock Nanopore’ metaHi-C datasets.
Each mock dataset comprised real shotgun reads sequenced from a known mock community, along with
simulated Hi-C reads.
Real metaHi-C datasets: Four publicly-available real metaHi-C datasets were utilized in this study, comprising two short-read metaHi-C datasets and two long-read metaHi-C datasets. The specific sizes of the
raw datasets are detailed in Supplementary Table 7.13.
61
The two short-read metaHi-C datasets were derived from the human gut (BioProject: PRJNA413092)
[121] and wastewater (BioProject: PRJNA506462) [142] samples, respectively. Each short-read metaHi-C
dataset consisted of both shotgun and Hi-C libraries originating from the same sample source. The construction of Hi-C sequencing libraries involved the use of restriction endonucleases Sau3AI and MluCI.
Sequencing of both the shotgun and Hi-C libraries was carried out on Illumina platforms, producing 150-
base pair reads. The two long-read metaHi-C datasets were obtained from cow rumen (BioProject: PRJNA507739) [15] and sheep gut (BioProject: PRJNA595610) [14] samples, respectively. The cow rumen
long-read metaHi-C dataset comprised uncorrected PacBio long-read libraries and Hi-C libraries. The
error-prone PacBio long reads were generated using both the PacBio RSII and PacBio Sequel platforms.
Hi-C libraries for this dataset were prepared using the Sau3AI and MluCI restriction enzymes and subsequently sequenced on an Illumina HiSeq 2000, producing 80-base pair reads. The sheep gut long-read
metaHi-C dataset consisted of PacBio circular consensus sequencing (CCS) long-read libraries and Hi-C
sequencing libraries. The PacBio CCS long reads, characterized by high accuracy with average Q scores
exceeding 20, were referred to as HiFi reads. Distinct Hi-C libraries for the sheep gut long-read metaHi-C
dataset were generated using the Sau3AI and MluCI restriction enzymes and sequenced at a length of 150
base pairs.
4.2.2 Initial processing
We first conduct essential read cleaning procedures using ‘bbduk’ from the BBTools suite (v37.25) [19]
to address issues such as adaptor sequences, low-quality reads, and PCR duplication (Supplementary materials). For each metaHi-C dataset, reads from the shotgun library are assembled into longer contigs
(Supplementary materials). After assembly, processed paired-end Hi-C reads are aligned to these contigs using BWA-MEM (v0.7.17) [91] with the ‘-5SP’ parameter to prioritize the alignment with the lowest
read coordinate as the primary alignment. Subsequent alignment filtering steps include the removal of
62
unmapped reads, secondary and supplementary alignments, and alignments with low quality (nucleotide
match length < 30 or mapping score < 30). We count Hi-C read pairs aligned to two contigs as raw
Hi-C contacts between contigs and those contigs with fewer than two Hi-C contacts are excluded. Raw
Hi-C contacts are normalized by NormCC [36] with default parameters to eliminate the systematic biases
derived from the number of restriction sites, contig length, and coverage.
4.2.3 The framework of ImputeCC binning
4.2.3.1 Detect assembled contigs with single-copy marker genes:
Similar to [159], we identify single-copy marker genes, which are genes typically found as single copies in
the majority of genomes [3] within the assembled contigs. We accomplish this by employing FragGeneScan
[127] and HMMER (v3.3.2) [41] (Supplementary materials).
4.2.3.2 Impute the metagenomic Hi-C contact matrix for contigs containing marker genes:
The effective preclustering of contigs with single-copy marker genes partially depends on the expectation
that marker-gene-containing contigs can be reliably linked through robust Hi-C interactions if they come
from the same genome. However, this expectation encounters a practical limitation attributed to the localized characteristics of proximity ligations, which implies that even when two contigs share the same
genomic origin, they may fail to establish Hi-C contacts if they are not in close spatial proximity within
the cell, thereby contributing to the sparsity of the metagenomic Hi-C contact matrix [33]. To facilitate
improved connections among marker-gene-containing contigs originating from the same genome through
Hi-C interactions, we design a metagenomic Hi-C contact matrix imputation method. This involves employing a constrained random walk with restart (CRWR) technique to amplify the within-cell Hi-C signals
specially for marker-gene-containing contigs. Specifically, we define m and n as the number of contigs
63
containing single-copy marker genes and the total number of assembled contigs, respectively. Let H denote the NormCC-normlized Hi-C contact matrix, where the entry Hij represents the normalized Hi-C
contacts between contig i and j. We first set all diagonal entries of H as zero and reorganize the matrix
H by moving the contigs containing marker genes to the first m rows and m columns consistently and
denote the reorganized matrix as H′
. Then, the reorganized matrix H′
is further normalized by its row
sum and let M denote the matrix after the row-sum normalization, i.e.,
Mij =
H′
P
ij
k H′
ik
. (4.1)
We use N(t)
to represent the matrix after the t-th iteration of random walk with restart and limit that all
random walks can only start from the contigs with marker genes. Mathematically, the random walk starts
from the initial matrix N(0) =
Im×m 0m×(n−m)
0(n−m)×m 0(n−m)×(n−m)
n×n
, and N(t)
is computed recursively by
the following:
N
(t) = (1 − p) · N
(t−1)
· M + p · T, (4.2)
where T = N(0) denotes the restarting matrix, and p (default, 0.5) serves as the restarting probability used
to maintain a balance between the influence of global and local network structures. Notably, since the last
n − m rows of all iteration matrices N are kept to be zero, the formula 4.2 can be simplified by omitting
the last n − m rows of N and T. As a result, the new RWR can be represented as
N˜(0) = T˜ = [Im×m|0m×(n−m)
]m×n,
N˜(t) = (1 − p) · N˜(t−1)
· M + p · T . ˜ (4.3)
64
To avoid the imputed matrix becoming too dense, we only retain the largest τ percent (default, 20) of
non-zero entries in N˜(t)
after each iteration, i.e.,
N˜(t) = N˜(t)
◦ 1{N˜(t)>Cτ
t
}
, (4.4)
where C
τ
t
is a (100 − τ )-th percentile of all non-zero entries in N˜(t)
; 1 represents an indicator matrix and
1ij = 1 only if N˜
(t)
ij > Cτ
t
; ◦ denotes the mathematical operator of element-wise matrix multiplication.
Let δt = ||N˜(t) − N˜(t−1)||2. The iteration ends if either of the following two conditions is satisfied:
• δt < 0.01,
• Early stop if δt − δt−1 < 0.001 for a consecutive five times.
Let Nˆ denote the final matrix output from the imputation. Then the first m columns of Nˆ, denoted by
Pm×m, can exactly represent the imputed Hi-C matrix for contigs with marker genes. Finally, we transform
the matrix P to a symmetric matrix P
′
and further normalize P
′
to eliminate the contigs’ coverage biases
derived from the imputation using the Square Root Vanilla Coverage (sqrtVC) method [123], i.e.,
P
′ = P + P
T
,
Q = D− 1
2P
′D− 1
2 , (4.5)
where D is a diagonal matrix where each elements Dii is the sum of the i-th row of P
′
.
4.2.3.3 Precluster contigs with marker genes as preliminary bins
Leveraging the imputed Hi-C matrix Q as well as the characteristics of single-copy marker genes, we would
like to accurately precluster contigs with marker genes as preliminary bins. Specifically, we first sort all
65
categories of detected marker genes by the number of contigs containing the marker genes. If several
marker genes correspond to the same number of contigs, they are further sorted by the gene length. Then,
we use a greedy strategy to iteratively construct the preliminary bins as follows:
• Initialization: choose all contigs from the first marker gene and initialize preliminary bin set, denoted
by B, with each bin containing one contig.
• Iteration: in the k-th iteration, we select all contigs containing the k-th marker gene and only handle
contigs that have not been assigned to any preliminary bins in B. Let C denote the set of contigs
to be processed in the iteration. We then define the contig-to-bin Hi-C similarity between a contig
c ∈ C and a bin B ∈ B as:
Sc,B =
P
c1∈B Qc,c1
#B
(4.6)
where c1 denotes the contigs in the preliminary bin B, Qc,c1
is the imputed Hi-C contacts between
contigs c and c1 and #B represents the number of contigs in B. In this way, we can construct a
undirected bipartite graph, where the top nodes are contigs from the set C and the bottom nodes are
preliminary bins from the set B. The weighted edges between top nodes and bottom nodes represent
the contig-to-bin Hi-C similarity. To assign the contigs to preliminary bins, we leverage the Karp’s
algorithm [78] to find a maximum-weight matching between contigs and preliminary bins. For each
contig in the set C with a matching preliminary bin, if the contig-to-bin Hi-C similarity is above
the median of non-zero entries in the imputed matrix Q, we attribute the contig to its matching
preliminary bin; otherwise, the contig will be discarded. Finally, we add all unmatched contigs to B
as new preliminary bins, with each new bin containing one unmatched contig.
• Repeat the iteration step until all marker genes are processed.
66
4.2.3.4 Leiden clustering for all contigs using the information of preliminary bins
We apply the Leiden community detection algorithm [151] to the NormCC-normalized Hi-C contact matrix H to cluster all assembled contigs, using the preliminary bin set as an initial framework. The Leiden
algorithm iteratively merges and refines communities to maximize modularity, a metric that quantifies
the partitioning quality. To incorporate preliminary bin information, we initialize contig memberships
based on preliminary bins, ensuring that contigs from the same preliminary bin are placed within the
same community, while contigs not associated with any preliminary bins are initially assigned to individual communities. Throughout the Leiden iterations, these assignments for contigs from preliminary bins
remain fixed. Consequently, contigs from the same preliminary bin coalesce into the same cluster, while
those from different preliminary bins form distinct clusters after the Leiden clustering.
Moreover, since the Leiden algorithm is modularity-based, we select a flexible modularity function
based on the Reichardt and Bornholdt’s Potts model [124]. Notably, the resolution parameter r in the
modularity function (Supplementary materials) is a hyper-parameter that determines the relative importance assigned to the configuration null part compared to the links within the communities. To ascertain
the optimal resolution parameter, we conduct parallel executions of the Leiden algorithm using various
resolution values and automatically select the most favorable outcome. Specifically, we identify lineagespecific genes, which act as indicators of genome quality, through the application of the CheckM (v1.1.3)
[119] function ‘checkm analyze’. Consequently, for any given contig bin, we employ the same evaluation
strategy as CheckM to efficiently estimate its precision and recall (Supplementary materials). Subsequently,
for each resolution parameter value, we count the number of genomic bins with precision exceeding 95%
and recall surpassing 90%, 70%, and 50%, respectively. Finally, we automatically select the resolution value
that maximizes the sum of three count numbers as the optimal choice.
67
4.2.3.5 Integrative strategy to obtain the final bins
It is essential to acknowledge that the preliminary bins may not be entirely accurate. This can occur, for instance, in cases where genome coverage is insufficient or marker genes are fragmented into several pieces.
Furthermore, our clustering strategy in previous steps may exacerbate these mis-binnings arising from the
preliminary bin assignments. Consequently, it is still meaningful to apply the Leiden algorithm to cluster
contigs independently, without relying on the preliminary bin information. The selection of the resolution
parameter follows the same methodology as previously described. We denote the resulting bin sets as Fpre
and Fnull for the Leiden clustering with and without preliminary bin information, respectively. We then
implement an iterative greedy strategy to integrate these two bin sets. Specifically, in each iteration of this
integrative procedure, we assess the quality of all existing MAGs from Fpre and Fnull using the metric:
Recall − 2 × (100 − Precision). (4.7)
The MAG displaying the highest estimated quality across both bin sets is selected for further consideration.
In situations where two or more MAGs exhibit identical estimated quality scores, ties are resolved by
selecting the MAG with the greatest N50 statistic and bin size. Following the selection of a MAG, it is
moved from the corresponding bin set to the final bin set, and any contigs belonging to the selected MAG
are also removed from the other bin set, if present. This iterative procedure continues until the highest
quality MAG identified falls below 10. Finally, we can obtain the final bin set through the integration.
4.2.4 Evaluating the quality of recovered MAGs from the mock and real metaHi-C
datasets
For the mock metaHi-C datasets, where all species within the mock microbial community were known, the
species identity of the assembled contigs could be determined (Supplementary materials). Then, we can
68
define the the completeness and contamination of each MAG recovered from the mock datasets. Specifically, for each MAG, we segregated the lengths of contigs according to their respective reference genomes
and attributed the MAG to the reference genome with the largest cumulative contig length, denoted as
L(q). The length of the corresponding reference genome was denoted as L(r), and the total length of the
MAG was referred to as L(v). The completeness of a MAG was quantified as L(q)
L(r)
, while the contamination
of a MAG was defined as L(v)−L(q)
L(v)
. Finally, we classified high-quality genomes obtained from the mock
datasets as those MAGs with completeness ≥ 90% and contamination ≤ 5%.
For the real metaHi-C datasets, since the actual genomes are unknown in real samples, we applied
CheckM2 [25] to evaluate the completeness and contamination of retrieved MAGs. CheckM2 is an advanced machine learning-based method for assessing the quality of draft genomic bins, offering improved
accuracy and computational speed compared to existing tools [25]. Based on the CheckM2 assessments of
completeness and contamination, we categorized the resolved MAGs from real metaHi-C datasets as highquality if their completeness ≥ 90% and contamination ≤ 5%, while MAGs were designated as mediumquality if their completeness ≥ 50% and contamination ≤ 10%.
4.2.5 MAG analyses on real metaHi-C datasets
To assess the capacity of various binning methods in capturing taxonomic diversity within real metaHi-C
datasets, we performed taxonomic annotation on all high-quality and medium-quality bins using GTDBTK (v2.1.0, Release: R207 v2) [22] with the function ‘classify_wf’ to extract the taxonomic information of
the MAGs recovered by different binning methods.
Furthermore, to identify overlapping high-quality bins retrieved from the sheep gut long-read metaHiC dataset between ImputeCC binning and other Hi-C-based binning approaches, we utilized Mash (v2.2)
[115] with 10,000 sketches per bin to calculate the Mash distance between high-quality bins from different
bin sets. Bins with a Mash distance below 0.01 were considered MAGs originating from the same genome.
69
4.2.6 Other binners used in benchmarking
All binners used for comparison, i.e., VAMB (v3.0.3) [110], HiCBin (v1.1.0) [34], MetaTOR (v1.1.4) [9],
bin3C (v0.1.1) [29], and MetaCC (v1.1.0) [36] were executed with default parameters on all mock and real
metaHi-C datasets.
4.3 Results
4.3.1 Overview of ImputeCC
ImputeCC is an integrative Hi-C-based binner that leverages the combined power of Hi-C interactions
and single-copy marker genes in the contig binning process. Figure 4.1 shows the outline of ImputeCC.
The core concept of ImputeCC involves the preclustering of marker-gene-containing contigs guided by
two fundamental principles: I) Contigs sharing the same single-copy marker gene originate from distinct
species with high probability; II) Contigs without overlapping single-copy marker genes are likely from
the same genome when connected by robust Hi-C signals. To address the challenge that marker-genecontaining contigs from the same genome may not be effectively linked by Hi-C contacts due to the locality
characteristics of proximity ligations, we design a new constrained random walk with restart (CRWR)
algorithm to impute the metaHi-C contact matrix before preclustering, with all random walks limited
to start from marker-gene-containing contigs. Subsequently, by leveraging the imputed Hi-C matrix in
conjunction with the aforementioned principles, ImputeCC can accurately precluster contigs with singlecopy marker genes, establishing them as preliminary bins. Finally, the tool applies Leiden clustering [151]
to group all assembled contigs, utilizing the information from preliminary bins to optimize the binning
process.
70
Shotgun reads
Hi-C reads
…
Assembled contigs
Hi-C contact matrix
…
Single copy
marker genes
Hi-C matrix segment
Rearrange Hi-C matrix Slice
Preliminary bins
Step1:
Contigs with single-copy marker genes
After imputation
Slice
Rearranged Hi-C matrix
Step2:
Imputed
Hi-C matrix
Step3:
Final bins of all contigs
Contigs without single-copy marker genes
Aligned to
contigs
ImputeCC Pipeline Workflow
CRWR
imputation
Integrative Preclustering
binning
Figure 4.1: Overview of the ImputeCC. Given an input of the metagenomic Hi-C contact matrix and contigs
containing single-copy marker genes, ImputeCC initiates the imputation of the metaHi-C contact matrix
using a new constrained random walk with restart (CRWR) algorithm, specifically limiting random walks
to originate from contigs with marker genes. Subsequently, ImputeCC segregates and retains the imputed
contact matrix exclusively for marker-gene-containing contigs, using it in conjunction with the characteristics of single-copy marker genes to effectively precluster these contigs as preliminary bins. Finally,
ImputeCC applies the Leiden clustering method to group all assembled contigs, with insights from the
preliminary bins guiding the optimization of the binning process.
4.3.2 ImputeCC achieved accurate preclustering for contigs with single-copy marker
genes
Since ImputeCC relies on the information provided by preliminary bins for final contig clustering, the quality of these preliminary bins, as established during the preclustering step, holds a pivotal role in affecting
the final binning results of ImputeCC. Mock metaHi-C datasets were created by combining simulated HiC reads with real shotgun sequencing data from a manually curated microbial community. The shotgun
data were obtained from the Illumina HiSeq 3000, ONT MinION R9, and PacBio Sequel II platforms. These
datasets, named ‘mock Illumina’, ‘mock Nanopore’, and ‘mock PacBio’, each comprised a combination of
71
simulated Hi-C reads and real shotgun reads corresponding to the specific sequencing platform. Since
the ground truth of all contigs from the mock metaHi-C datasets were known, we could leverage the
mock datasets to assess the quality of the preclustering of preliminary bins. Specifically, we calculated the
Adjusted Rand Index (ARI) clustering evaluation metric (Supplementary materials) for preliminary bins
derived from the mock Illumina, Nanopore, and PacBio datasets, resulting in values of 0.976, 0.975, and
0.988, respectively (Figure 4.2a). These values indicated that ImputeCC could accomplish precise preclustering for contigs with single-copy marker genes. Furthermore, we performed preclustering directly using
NormCC-normalized Hi-C contacts, omitting the imputation step. In this context, the ARI values for preliminary bins derived from the three mock datasets were decreased to 0.783, 0.903, and 0.775, respectively
(Figure 4.2a), underscoring the significant enhancement in the construction of preliminary bins achieved
through our CRWR imputation.
0.6
0.7
0.8
0.9
1.0
Mock Illumina Mock Nanopore Mock PacBio
Dataset categories
ARI scores
With imputation Without imputation
Mock metaHi−C datasets
a
0.6
0.7
0.8
0.9
1.0
Mock Illumina Mock Nanopore Mock PacBio
Binning methods
Dataset categories
With imputation Without imputation
Number of high−quality MAGs
Number of high−quality MAGs
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
Mock Illumina
Mock Nanopore
Mock PacBio
15
20
25
30
35
0.6
0.7
0.8
0.9
1.0
Mock Illumina Mock Nanopore Mock PacBio
Binning methods
Dataset categories
With imputation Without imputation
Number of high−quality MAGs
22 26 24 25 38
14 17 18 15 23
16 15 15 12 36
Number of high−quality MAGs
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
Mock Illumina
Mock Nanopore
Mock PacBio
15
20
25
30
35
Number of high
−quality MAGs
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
Mock Illumina
Mock Nanopore
Mock PacBio
35
30
25
20
15
Number of high
−quality MAGs
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
Mock Illumina
Mock Nanopore
Mock PacBio
35
30
25
20
15
Number of high
−quality MAGs
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
Mock Illumina
Mock Nanopore
Mock PacBio
35
30
25
20
15
Number of high
−quality MAGs
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
Mock Illumina
Mock Nanopore
Mock PacBio
35
30
25
20
15
Number of high
−quality MAGs
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
Mock Illumina
Mock Nanopore
Mock PacBio
35
30
25
20
15
b
0.6
0.7
0.8
0.9
1.0
Mock Illumina Mock Nanopore Mock PacBio
Binning methods
Dataset categories
With imputation Without imputation
Number of high−quality genomes
Figure 4.2: Benchmarking using the three mock metaHi-C datasets. (a) Assessing the quality of preliminary
bins using ARI. ImputeCC accurately grouped marker-gene-containing contigs while the CRWR imputation markedly improved the preclustering performance. (b) ImputeCC outperformed other binners on all
the three mock metaHi-C datasets with respect to the number of retrieved high-quality MAGs (completeness ≥ 90% and contamination ≤ 5%).
72
4.3.3 ImputeCC retrieved the most high-quality genomes from the mock metaHi-C
datasets
We first conducted a comparative evaluation of ImputeCC binning against VAMB [110], MetaTOR [9],
bin3C [29], and the MetaCC binning module (referred to as MetaCC) [36] using the three mock metaHi-C
datasets. In addition to VAMB, a popular shotgun-based binning tool that utilizes sequence composition
and coverage information, three other tools in consideration are Hi-C-based. It is important to note that
another publicly available Hi-C-based binner HiCBin [34] was excluded from the benchmarking study on
the mock datasets due to its inability to converge when applied to the mock Nanopore and PacBio datasets.
As shown in Figure 4.2b, ImputeCC demonstrated a remarkable ability to reconstruct a markedly larger
number of high-quality genomes (completeness ≥ 90% and contamination ≤ 5%) across all the three mock
datasets. Specifically, ImputeCC outperformed the second-highest result by 46.2%, 27.8%, and 125% in terms
of high-quality genome reconstruction for the mock Illumina, Nanopore, and PacBio datasets, respectively.
Notably, the number of mapped Hi-C read pairs for the mock Nanopore dataset was considerably lower in
comparison to the mock Illumina and PacBio datasets (Supplementary Table 7.14), which can be attributed
to the relatively higher error rate associated with Nanopore R9 long reads. This disparity in read mapping
could be one of the contributing factors for ImputeCC retrieving a comparatively lower number of highquality genomes from the mock Nanopore dataset. Finally, we evaluated ImputeCC’s stability against
Hi-C sequencing depth by downsampling the Hi-C read pairs from 10 million to 5 million in the mock
datasets. The recovery of high-quality MAGs slightly declined from 38 to 36 in the Illumina dataset and
from 23 to 21 in the Nanopore dataset, while the PacBio dataset consistently yielded 36 MAGs. These
results highlighted ImputeCC’s resilience to reduced Hi-C read counts, ensuring its reliable performance
in the mock metaHi-C datasets.
73
4.3.4 ImputeCC markedly outperformed existing binners on real metaHi-C datasets
To validate ImputeCC on real metaHi-C data, we applied it to two short-read and two long-read metaHi-C
datasets from four different environments: human gut, wastewater, cow rumen, and sheep gut. Here, we
compared ImputeCC to all four publicly-available Hi-C-based binners, namely HiCBin, MetaTOR, bin3C,
and MetaCC, in addition to VAMB. Given the absence of reference genomes in real-world datasets, we
utilized the CheckM2 [25] to evaluate the completeness and contamination of the recovered bins. The
results from the two long-read metaHi-C datasets are presented in Figure 4.3, while those from the two
short-read metaHi-C datasets can be found in Supplementary Figure 7.4. In all cases, ImputeCC recovered
more high-quality (completeness ≥ 90% and contamination ≤ 5%) and medium-quality (completeness ≥
50% and contamination ≤ 10%) bins than the alternatives considered. Notably, the sheep gut long-read
metaHi-C dataset, owing to its high complexity, posed a greater challenge. ImputeCC binning retrieved
408 high-quality MAGs, markedly outperforming VAMB, HiCBin, MetaTOR, bin3C, and MetaCC with an
increase of 235 (135.8%), 321 (369%), 279 (216.3%), 160 (64.5%), and 82 (25.2%), respectively (Figure 4.3a).
ImputeCC was also able to recover 125.8%, 279.8%, 91.1%, 120.1% and 23.1% more medium-quality bins than
VAMB, HiCBin, MetaTOR, bin3C, and MetaCC, respectively (Figure 4.3b).
Moreover, we explored the capability of different binners to capture the species diversity in microbial
samples by annotating all medium-quality and high-quality bins generated by different binners on all
real metaHi-C datasets using GTDB-TK [22]. As shown in Figure 4.3c and Supplementary Figure 7.4c,
medium-quality bins derived from ImputeCC represented a markedly larger taxonomic diversity at the
species level on all datasets. We further conducted a detailed comparative analysis of the high-quality
MAGs retrieved from the sheep gut long-read metaHi-C dataset. We employed Mash [115] to identify
cases where ImputeCC binning and three other Hi-C-based binning tools (MetaTOR, bin3C, and MetaCC)
retrieved identical high-quality MAGs on the sheep gut long-read metaHi-C dataset. Notably, the majority
of high-quality MAGs obtained through other Hi-C-based binning tools were also successfully recovered
74
by ImputeCC (Supplementary Figure 7.5a). In contrast, ImputeCC binning went beyond by reconstructing
a substantial number of high-quality MAGs that remained inaccessible to the other binning tools. Further
annotation analyses of the high-quality MAGs demonstrated ImputeCC recovered more distinct taxa at
various taxonomic levels compared to Hi-C-based alternatives, including bin3C, MetaTOR, and MetaCC
(Supplementary Figure 7.5b).
Finally, ImputeCC’s analysis at the genus level, leveraging its recovered high-quality MAGs, has unveiled significant insights into microbial composition of the sheep gut microbiota (Supplementary materials). Within this complex ecosystem, ImputeCC highlighted the dominance of the Bacteroides genus,
known for influencing intestinal immunity [129, 164], and uniquely detected critical species within it,
such as Bacteroides uniformis and Bacteroides vulgatus. It was also the only tool to uncover the Tidjanibacter genus and extensively characterized the Alistipes genus, revealing species with potential roles in
the sheep gut ecosystem and suggesting a broader species diversity. These capabilities demonstrate ImputeCC’s unparalleled contribution to elucidating the sheep gut’s microbial composition and its functional
significance.
4.3.5 Running time analysis of the ImputeCC
On an Intel Xeon Processor E5-2665 with a clock speed of 2.40 GHz and 50 GB of allocated memory, the
ImputeCC pipeline spent 64 min, 204 min, 25 min, and 2,115 min on the human gut short-read, wastewater
short-read, cow rumen long-read, and sheep gut long-read metaHi-C datasets, respectively.
4.4 Discussions
In this work, we developed ImputeCC, an integrative Hi-C-based contig binning methods. ImputeCC
combines Hi-C interactions with the intrinsic discriminative potential of single-copy marker genes by
preclustering marker-gene-containing contigs as preliminary bins. To enhance the Hi-C connectivity of
75
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 10 20 30 40 50
Number of bins
Binning method
Cow rumen long−read metaHi−C dataset
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 200 400 600
Number of bins
Binning method
Sheep gut long−read metaHi−C dataset
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 20 40 60 80
Number of bins
Binning method
Cow rumen long−read metaHi−C dataset
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 250 500 750
Number of bins
Binning method
Sheep gut long−read metaHi−C dataset
HiCBin
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
0 20 40 60 80
Number of bins
Binning method
Cow rumen long
−read metaHi
−C dataset
HiCBin
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
0 250 500 750
Number of bins
Binning method
Sheep gut long
−read metaHi
−C dataset
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 20 40 60
Number of species
Binning method
Cow rumen long−read metaHi−C dataset
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 50 100 150 200
Number of species
Binning method
Sheep gut long−read metaHi−C dataset
a b c
Figure 4.3: Benchmarking using the real cow rumen and sheep gut long-read metaHi-C datasets. (a) The
number of MAGs with varying completeness (comp) and contamination (cont) ≤ 5%. ImputeCC consistently outperforms other binning tools, producing a greater number of high-quality bins in both long-read
metaHi-C datasets. (b) The number of MAGs with varying completeness and contamination ≤ 10%. ImputeCC returned more medium-quality bins when compared to alternative methods for both datasets. (c)
Comparative analysis of the taxonomic diversity at the species level within medium-quality bins obtained
by different binning tools. ImputeCC’s binning approach stands out by capturing the broadest range of
microbial species in medium-quality MAGs.
marker-gene-containing contigs, ImputeCC introduces a constrained random walk with restart (CRWR)
approach to impute the metaHi-C contact matrix. Finally, ImputeCC employs Leiden clustering to group
all assembled contigs, optimizing the binning process by leveraging information from the preliminary bins.
Evaluations of ImputeCC using a wide range of diverse mock/real metaHi-C datasets have demonstrated
its effectiveness for retrieving reference-quality MAGs and shown its potential to unravel the structure
of microbial ecosystems and their resident microorganisms. Notably, we utilized CheckM2 in assessing
the binning performance for the four real metaHi-C datasets. Although CheckM2 represents the most
advanced software for evaluating bin quality in real metagenomic samples, it is essential to delve further
into the accuracy of this machine-learning-based validation method in reflecting the true completeness
76
and contamination levels of the recovered MAGs. Moreover, previous research has established the efficacy
of Hi-C-based binning over shotgun-based approaches [29, 34]. Accordingly, our benchmarking analyses
focus on Hi-C-based methods, comparing ImputeCC with similar tools and including VAMB as a reference
shotgun-based method.
ImputeCC offers several promising avenues for expansion. For instance, when dealing with large
MAGs characterized by high abundances, there is potential in imputing normalized Hi-C contacts for
contigs within these MAGs to facilitate the scaffolding process. Moreover, exploring imputation methods
that consider additional information, such as the sequence composition of contigs, could yield improved
imputation results.
77
Chapter 5
ViralCC retrieves complete viral genomes and virus-host pairs from
metagenomic Hi-C data
In this chapter, we introduce ViralCC, a publicly available tool to recover complete viral genomes and detect
virus-host pairs using Hi-C data. Compared to other Hi-C-based methods, ViralCC leverages the virushost proximity structure as a complementary information source for the Hi-C interactions. Using mock
and real metagenomic Hi-C datasets from several different microbial ecosystems, including the human
gut, cow fecal, and wastewater, we demonstrate that ViralCC outperforms existing Hi-C-based binning
methods as well as state-of-the-art tools specifically dedicated to metagenomic viral binning. ViralCC
can also reveal the taxonomic structure of viruses and virus-host pairs in microbial communities. When
applied to a real wastewater metagenomic Hi-C dataset, ViralCC constructs a phage-host network, which
is further validated using CRISPR spacer analyses.
5.1 Introduction
Viruses constitute the most divergent and ubiquitous biological organism on earth with an estimated global
abundance of 1031 [16]. Viruses have enormous impacts on ecosystems as predators and/or parasites
within microbial communities through the lysogenic or lytic cycle infecting bacteria and archaea [50,
147]. For instance, viruses contribute significantly to the biogeochemical cycling of carbon and nitrogen
78
in aquatic habitats [44, 71] and are implicated in certain diseases such as inflammatory bowel disease
and severe acute malnutrition in human systems [111, 126]. Therefore, the interest in viromics has risen
dramatically in the past two decades.
Since the number of viruses that can be traditionally cultivated in the laboratory is too limited to
assess viral diversity [116], metagenomics, as a culture-independent sampling strategy, has been widely
exploited to recover viral genomes and to identify the hosts of these newly discovered viruses, one of
the most difficult aspects of studying viruses in microbial communities [38, 53, 54]. Metagenomic whole
genome shotgun sequencing (WGS) directly extracts genomic fragments from various environmental samples, generating a large number of short reads that are subsequently assembled into contigs [3, 90, 112].
Metagenomic viral contigs are then identified from large assemblies based on sequence composition, sequence similarity, and/or detection of viral proteins [82, 125, 131]. However, viral genome assembly from
shotgun reads is challenging [141] and short viral contigs may only represent segments of entire viral
genomes [46]. Incomplete viral fragments have a significantly adverse impact on the downstream analyses including the characterization of the underlying viral diversity and abundance, prediction of host and
functional capacity [130, 154]. Therefore, metagenomic viral binning, defined as a process to group viral
contigs from the same species into viral metagenome-assembled genomes (vMAGs), is valuable, especially
for giant viruses [134].
Most of traditional shotgun-based binning tools are developed to recover eukaryotic, bacterial, and
archaeal genomes [5, 77, 110, 159] and ignore the challenges associated with viruses, such as the lack of
universal single-copy genes and relatively small size of viral genomes. Additionally, those binning tools
exploiting microbial marker gene analysis are not applicable for viruses [95, 139, 159]. CoCoNet [7] and
vRhyme [81] are two existing methods specifically dedicated to metagenomic viral binning. CoCoNet
trains a neural network using both composition and co-occurrence features of viral contigs across samples
to predict the probability that two viral contigs originate from the same genome. vRhyme utilizes single79
or multi- sample coverage effect size to calculate coverage differences between viral contigs. To process
the sequence composition information, vRhyme first pretrains supervised machine-learning based classification models using genome fragments. Then, the nucleotide feature similarity vector between two viral
contigs is input into the classification models to predict the probability value that viral contigs originate
from the same genome. Finally, vRhyme constructs a weighted network, where each node is a viral contig
and an edge weight is calculated by dividing the coverage difference by the probability value. Networks are
further refined into vMAGs. However, both CoCoNet and vRhyme may be critically impaired when there
are not enough samples to construct reliable co-abundance profiles of viral contigs, i.e., profiles showing
which contigs share consistent abundance values across multiple samples and are therefore likely to come
from the same genome.
Metagenomic high-throughput chromosome conformation capture (metagenomic Hi-C) has been developed in recent years to simultaneously recover metagenome-assembled genomes (MAGs) and determine virus-host pairs from a single microbial community sample [10, 18, 35, 99, 100, 101, 121]. Combined
with the conventional shotgun sequencing, metagenomic Hi-C applies a genomic proximity ligation technique to construct chimeric junctions between metagenomic sequences in close proximity within the same
cell. After sequencing, millions of Hi-C read pairs are generated and subsequently aligned to contigs assembled from the shotgun reads. Contigs belonging to the same genome display enriched Hi-C contact
frequencies compared to those from different genomes [18], resulting in dozens of nearly complete bacterial genomes retrieved by publicly available Hi-C-based binning tools [9, 29, 34]. Although recovering
high-quality viral genomes is vital and prerequisite for downstream analyses, apart from a proprietary
and commercial genome reconstruction service called ProxiPhage [152], Hi-C-based binning methods with
open-source pipeline are not developed to retrieve viral genomes. For example, HiCBin requires the taxonomic annotation of some contigs by TAXAassign (https://github.com/umerijaz/TAXAassign) to generate
80
the intra-species contacts in the normalization step [33] while TAXAassign can hardly annotate viral contigs, resulting in the inability of HiCBin to bin viral contigs.
In addition to the difficulties in recovering vMAGs, tools for benchmarking the performance of viral
genome retrieval remains rare in metagenomic Hi-C experiments. CheckV has been widely used to estimate the completeness of vMAGs by comparing them to a large database curated from NCBI GenBank and
environmental samples [107]. However, unlike the CheckM which takes advantage of universal singlecopy marker genes to assess both completeness and contamination of prokaryotic MAGs [119], CheckV
is unable to estimate the contamination of vMAGs since there is no such marker gene set available for
viruses [130]. CheckV is also limited in its ability to assess the completion of vMAGs since randomly
grouping two viral contigs together generally increases completion. Moreover, though methods based
on simulating known viral contigs from NCBI RefSeq viral genomes [113] have already been employed
to estimate the binning results of shotgun-based methods [7, 81], they cannot be generalized to evaluate
Hi-C-based binning approaches since few studies have been conducted on modeling Hi-C interactions for
viral contigs. Therefore, it is imperative to design a systematic and comprehensive benchmarking strategy
for Hi-C-based metagenomic viral binning.
To tackle the problem of a paucity of viral binning methods in metagenomic Hi-C experiments, we
developed ViralCC, a Hi-C-based binning method dedicated to recovering complete viral genomes and
determining virus-host pairs. The general pipeline of ViralCC is shown in Figure 5.1. ViralCC not only
considers the Hi-C interaction graph, but also puts forward a host proximity graph of viral contigs as
a complementary source of information to the Hi-C interaction map. Two graphs are then integrated
together, followed by Leiden graph clustering [151], to generate draft viral genomes. We compared ViralCC
to VAMB [110], CoCoNet [7], vRhyme [81], MetaTOR [9], and bin3C [29]. Our experiments indicated
that ViralCC substantially improved the CheckV completeness of viral genomic bins on real metagenomic
Hi-C datasets. Moreover, we put forward a systematic strategy to benchmark the viral genome retrieval
81
performance in metagenomic Hi-C experiments by generating mock metagenomic Hi-C datasets from real
samples. The ground truth of all mock viral contigs is known in mock datasets while Hi-C interactions
between mock viral contigs can be obtained directly from real samples without simulation. Leveraging
mock metagenomic Hi-C datasets derived from three real samples, we further demonstrated that ViralCC
outperformed other binning methods and recovered viral genomes with higher completeness and lower
contamination. Finally, we showed that the virus-host pairs can be determined based on the recovered
viral genomes.
Step4: Leiden clustering
Step3: Integrate two graphs
Step2: Construct host proximity graph
Step 1: Construct Hi-C interaction graph
ViralCC pipeline
Shotgun
read-sets
Assembly
Contigs
Viral contig
detection
Viral Contigs
Hi-C
read-sets
Draft viral genomes
Viral … … Bin
Host
Bin
Virus-host pairs
Figure 5.1: The general workflow of ViralCC to retrieve high-quality viral genomes and determine virushost pairs. Shotgun reads are first assembled into contigs, to which Hi-C paired-end reads are aligned.
Viral contigs are subsequently identified. Leveraging Hi-C linkages and the virus-host proximity structure
to link viral contigs, ViralCC constructs the Hi-C interaction graph and the host proximity graph. After
integrating two graphs, ViralCC employs Leiden clustering to reconstruct draft viral genomes, and additionally detects the virus-host pairs based on recovered viral genomes and Hi-C linkages.
82
5.2 Methods
5.2.1 Datasets
Three real metagenomic Hi-C datasets, all previously published, were employed to validate the performance of viral genome retrieval and to discover virus-host pairs. Experiments from the previously published papers are briefly repeated here.
The human gut dataset: This dataset was derived from the microbiome of a human gut and was composed of one WGS library (NCBI accession: SRR6131123) and two separate Hi-C libraries constructed by
two four-cutter restriction enzymes, MluCI and Sau3AI (NCBI accession: SRR6131122 and SRR6131124)
[121]. The Illumina HiSeqX Ten was used to sequence the shotgun and Hi-C libraries, creating 151 bp
paired-end reads. The two Hi-C libraries consisted of 48.8 million (MluCI library) and 41.7 million (Sau3AI
library) read pairs, respectively. The sequencing of the raw WGS library produced 250.9 million read pairs
(ratio Hi-C:shotgun = 0.36).
The cow fecal dataset: The cow fecal sample was collected and processed at the Beef and Sheep Research
Centre of Scotland’s Rural College [145], generating one shotgun library (NCBI accession: ERX2333418)
and two Hi-C libraries fragmented using either the Sau3AI or MluCI restriction enzymes (NCBI accession: ERX2548555 and ERX2548556). After sequencing all libraries by the Illumina HiSeqX platform at 150
bp, 159.5 million paired-end reads were obtained in the shotgun library while the two Hi-C libraries contained 86.2 million (Sau3AI library) and 59.3 million (MluCI library) paired-end reads, respectively (ratio
Hi-C:shotgun = 0.91).
83
The wastewater dataset: In the wastewater (WW) sample [142], the shotgun library (NCBI accession:
SRR8239393) was prepared using the DNeasy PowerWater kit while the Hi-C library (NCBI accession:
SRR8239392) was produced by a proprietary Hi-C preparation kit (Phase Genomics, Inc). The cutting
enzymes utilized in the experiment were Sau3AI and MluCI. All read-sets were sequenced by the HiSeq
4000 at the length of 150 bp. There were 269.3 million and 95.3 million paired-end reads for the WW
shotgun metagenomic and Hi-C read-sets, respectively (ratio Hi-C:shotgun = 0.35).
5.2.2 Initial processing
We applied bbduk from the BBTools suite (v37.25) [19] to thoroughly clean raw WGS and Hi-C read libraries (Supplementary materials). Processed shotgun reads were assembled into contigs using MEGAHIT
(v1.2.9) [90] with options ‘-min-contig-len 1000 -k-min 21 -k-max 141 -k-step 12 -merge-level 20, 0.95’
(Supplementary Table 7.15). Then, processed Hi-C paired-end reads were mapped to assembled contigs
by BWA MEM (v0.7.17) [91] with parameter ‘-5SP’. After the alignment, we removed unmapped reads,
secondary alignments, supplementary alignments and alignments with low quality (mapping score or nucleotide match length <30). Raw Hi-C contact maps between two contigs were constructed by counting
the number of Hi-C read pairs separately aligned to these two contigs.
5.2.3 Viral contig detection
Long contigs (≥ 3 kbp) assembled from shotgun reads were screened by VirSorter (v1.0.6) [131] with default
parameter to identify viral contigs. VirSorter achieved the best F1 score in a recent benchmarking study
[49]. Contigs annotated as prophages were removed from the viral sequences (Supplementary Table 7.16).
We refer to the contigs that are not identified by VirSorter as potential host contigs.
84
5.2.4 Construct the Hi-C interaction graph for viral contigs
We define the Hi-C interaction graph for viral contigs as Ghic(V, Ehic), where the vertex vi ∈ V represents
the i-th identified viral contig, and an edge eij ∈ Ehic exists if vi and vj are linked by at least one Hi-C link.
5.2.5 Construct the host proximity graph for viral contigs
Besides the Hi-C interaction graph, we also take advantage of virus-host proximity structure to link viral
contigs. Specifically, we define two viral contigs as associated by k shared host contigs if these two viral
contigs are linked to at least the same k host contigs by the Hi-C interaction. Based on this metric to
measure the linkage between viral contigs, we construct the host proximity graph for viral contigs, denoted
by Ghost(V, Ehost), where the vertex vi ∈ V still represents the i-th identified viral contig while an edge eij
exists in Ehost if vi and vj are associated by k shared host contigs. Formally, let Hi denote the set of host
contigs for viral contig vi
. Then, vi and vj are connected in the host proximity graph Ghost if
|Hi ∩ Hj | ≥ k, (5.1)
where |·| denotes the cardinality of a set and the parameter k here is automatically determined such that
max
k
|Ehost|
s.t. |Ehost| ≤ |Ehic|; k ≥ kmin, (5.2)
where kmin (default 4) is the lower bound of parameter k. Note that decreasing k relaxes the requirement
for the existence of an association by shared host contigs, leading to more edges in Ghost. Thus in formula
5.2, maximizing the number of edges in Ghost is equivalent to minimize the value of k. Though smaller k
provides a larger number of connections for viral contigs in Ghost, the value of k cannot be too small, which
may introduce false positive associations due to the experimental noise. Therefore, two constraints that the
85
number of edges in Ghost is less than that of Ghic and k is no less than kmin are utilized to control the value
of k. We found that the vast majority of edges within the host proximity graph linked the viral contigs
from the same genome on the three mock metagenomic Hi-C datasets, demonstrating the reliability of the
host proximity graph (see Results).
5.2.6 Integrate the Hi-C interaction graph and the host proximity graph
We have constructed the Hi-C interaction graph and the host proximity graph to link viral contigs. Then,
we would like to integrate these two graphs. Let Gint(V, Eint) denote the final integrative graph, where the
vertex set still represents all viral contigs and an edge eij belongs to the edge set Eint if vi and vj are linked
through any one of the Hi-C interaction graph Ghic or the host proximity graph Ghost.
5.2.7 Leiden graph clustering based on the integrative graph
We cluster the viral contigs using the Leiden graph clustering algorithm [151] based on the integrative
graph Gint. The Leiden algorithm is a modularity-based community detection algorithm. It takes a threestage greedy approach to optimizing the modularity function. Specifically, in each iteration, the algorithm
assigns each node to a community such that the modularity function will increase after the local movement,
followed by refining the partition into sub-communities and aggregating the network. Moreover, a general
modularity function based on the Reichardt and Bornholdt’s Potts model [124] is selected for the Leiden
algorithm to overcome the resolution limit [43] and is defined as:
X
{i,j|ci=cj}
(Mij − r
didj
2n
), (5.3)
where M is the adjacency matrix of graph Gint, c denotes the community of viral vertices, r is a resolution
parameter, d represents the degree of viral vertices and n is the total number of edges in graph. The
resolution parameter r is tuned using the silhouette coefficient [128] of the binning results, which is a
86
popular clustering evaluation metric without true labels by measuring the cohesion and the separation
of the clusters. The candidate resolution that yields the highest silhouette coefficient is selected as the
optimal value for the Leiden clustering.
5.2.8 Evaluate the CheckV completeness of vMAGs on real metagenomic Hi-C datasets
We used one popular tool CheckV (v0.7.0) [107] to estimate the completeness quality of viral MAGs recovered from three real metagenomic Hi-C datasets. Since CheckV was originally designed for assessing
the quality of single-contig viral genomes, viral contigs from each vMAG were concatenated into a single
sequence as required by CheckV. CheckV applies two algorithms to compute the completeness of vMAGs
based on amino acid identity (AAI) or hidden Markov model (HMM) (Supplementary materials). The AAIbased approach reports a confidence level of estimation based on the alignment quality to the CheckV
genome database and the contig length, and high- and medium-confidence estimates are demonstrated to
be accurate and can be trusted [107]. Therefore, we combined the results estimated by two approaches
to determine the completeness of vMAGs. Specifically, for each vMAG, CheckV AAI-based estimation of
completeness was utilized if this estimation was qualified as medium or high confidence. Otherwise, the
HMM-based estimate was used if available.
5.2.9 A systematic benchmarking strategy to evaluate the performance of binning viral
contigs
5.2.9.1 Rationale of the benchmarking framework
Though CheckV has been widely exploited to evaluate the binning performance for viral contigs, the inability to assess the contamination renders the CheckV evaluation less comprehensive on vMAGs. Moreover,
benchmarking the viral genome retrieval through simulation is challenging since few studies have been
conducted on modeling Hi-C interactions for viral contigs. To solve these problems, we put forward a
87
benchmarking strategy to comprehensively evaluate the binning performance of Hi-C-based tools on viral
contigs without the need for simulating Hi-C interactions for viral contigs.
5.2.9.2 Generate mock viral contigs with ground truth
Instead of simulating viral contigs using known viral reference genomes, we designed a strategy to directly
generate mock viral contigs with ground truth from the real metagenomic Hi-C sample. Though viral
genome assemblies from shotgun reads are commonly plagued by insufficiently long contigs, there are still
a few single contigs that can individually represent the viral genome with relatively high completeness.
Therefore, we first applied CheckV to all identified viral contigs. Contigs above 10,000 bp and marked as
‘high-quality’ or ‘complete’ by CheckV were considered relatively complete viral genomes and served as
the putative reference genomes. Then, we directly simulated mock viral contigs from real metagenomic HiC datasets using these putative reference genomes. Specifically, we extracted subsequences from putative
reference genomes in sliding windows of a 3 kbp length moving from the left to right without overlaps.
As a result, putative reference genomes were split into non-overlapping fragments of 3 kbp. Fragments
at the edges of putative reference genomes were retained if they were longer than 1 kbp. All fragmented
contigs were regarded as mock viral contigs and labeled based on which putative reference genomes they
originated from. We then mixed the obtained mock viral contigs with all potential host contigs and aligned
the Hi-C read pairs to the mixed contig set using BWA MEM with parameter ‘-5SP’ to create a mock
metagenomic Hi-C dataset. In this way, we generated mock viral contigs with ground truth and constructed
valid Hi-C interactions without simulating the Hi-C experiments for viral contigs in a mock metagenomic
Hi-C dataset. We were subsequently able to validate the binning performance based on mock metagenomic
Hi-C datasets for Hi-C-based binning approaches as well as shotgun-based binning tools.
88
5.2.9.3 Gold standards to assess binning performance using mock metagenomic Hi-C datasets
Since the true labels of all mock viral contigs in the mock metagenomic Hi-C dataset were known, we employed four comprehensive evaluation metrics of the clustering performance (Supplementary materials):
Fowlkes-Mallows scores (F-scores), Adjusted Rand Index (ARI), Normalized Mutual Information (NMI),
and Homogeneity. These four metrics were used to evaluate binning performance.
Moreover, we defined the completeness and contamination of each vMAG. Specifically, for each vMAG,
we summed the lengths of contigs from different reference genomes separately and assigned the vMAG
to the reference genome with the largest query length, denoted by L(q). We also denoted the length
of corresponding reference genome as L(r) and referred to the total length of the vMAG as L(v). The
completeness of a vMAG is defined as L(q)
L(r)
and the contamination of a vMAG is defined as L(v)−L(q)
L(v)
.
Then, we assigned the high-quality vMAGs into three ranks, i.e., near-complete (completeness ≥ 90%,
contamination ≤ 10%), substantially complete (70% ≤ completeness < 90%, contamination ≤ 10%), and
moderately complete (50% ≤ completeness < 70%, contamination ≤ 10%), which is similar to the CheckM
evaluation criteria [119].
5.2.10 The quality control of metagenomic Hi-C datasets
As in [101], we defined the inter-contig Hi-C contacts as the paired-end Hi-C reads mapped to different
viral contigs. Then the 3D ratio was calculated by dividing the number of inter-contig Hi-C contacts by
the total number of paired-end Hi-C reads aligned to viral contigs.
5.2.11 Annotate vMAGs at the order and family levels
We first employed DemoVir (https://github.com/feargalr/Demovir) to classify viral contigs to the order and
family taxonomic levels by comparing genes on contigs against the curated viral protein database (https:
//figshare.com/articles/NR_Viral_TrEMBL/5822166). Contigs whose genes were consistently classified to
89
the same family were finally annotated. Then, we defined the vMAG family as the family to which the
majority of contigs in the vMAG belonged.
5.2.12 Detect virus-host pairs between vMAGs and host MAGs
All non-viral contigs for each sample were binned using HiCBin (v1.1.0) [34] with default parameters to
generate potential host MAGs, which were subsequently annotated by GTDB-TK (v2.1.0, Release: R207_-
v2) [23] with default parameters and the taxonomic classification results were visualized using ITOL (v5)
[89]. vMAGs were associated with potential host MAGs if they were linked by at least two Hi-C read-pairs
as in [79].
5.2.13 Compare ViralCC to other pipelines
VAMB (v3.0.3) [110] was executed with option ‘-t 40’. vRhyme (v1.0.0) [81], MetaTOR (v1.1.4) [9], and
bin3C (v0.1.1) [29] were run with default parameters. The input coverage files of viral contigs for VAMB and
vRhyme were generated using script ‘jgi_summarize_bam_contig_depths’ provided by MetaBAT2 (v2.12.1)
[77]. Since CoCoNet [7] removed contigs occurring in only one sample, we used the mode ‘composition’
to recover the viral genomes. The other parameters were set to default values.
5.3 Results
5.3.1 Generating mock metagenomic Hi-C datasets for benchmarking
All viral contigs detected by VirSorter were assessed by CheckV to select single contigs with high completeness as putative reference genomes. As a result, 51 putative reference genomes, with length ranging
from 11,410 bp to 194,784 bp were generated from the human dataset; 11 putative reference genomes from
11,452 bp to 42,000 bp were obtained from the cow fecal dataset; and 17 putative reference genomes, ranging from 11,455 bp to 127,910 bp were derived from the wastewater dataset (Supplementary Table 7.17).
90
We then constructed mock viral contigs by splitting the putative viral genomes and obtained 1,010, 94,
and 279 fragmented mock viral contigs from the three datasets, respectively (Supplementary Table 7.17).
For each real metagenomic Hi-C dataset, mock viral contigs were mixed with all non-viral contigs (i.e.,
contigs that are not identified as viral contigs by VirSorter), followed by the alignment of Hi-C paired-end
reads to construct the mock metagenomic Hi-C dataset. The analyses of binning mock viral contigs on the
mock human gut dataset were presented in the main text. We also provided benchmarking results on the
mock wastewater and the mock cow fecal datasets in the Supplementary materials.
5.3.2 Integrating the Hi-C interaction graph and the host proximity graph improves
binning performance on the mock human gut dataset
We first constructed the Hi-C interaction graph Ghic and the host proximity graph Ghost for 1,010 mock
viral contigs from the mock human gut dataset. There are 2,699 edges in Ghic. The parameter k for Ghost
was tuned to be 30, which means that any two viral contig nodes with an edge in Ghost were linked to at
least the same 30 host contigs by the Hi-C interaction. This resulted in 2,698 edges in Ghost. Among these
2,698 edges in Ghost, 14.5% of the edges were spurious edges, which were defined as the edges that linked
two contigs from different putative reference genomes in Ghost. We then integrated Ghost and Ghic into Gint,
which contained 4,397 edges. We could observe 1,000 common edges between Ghost and Ghic, accounting
for around 37% of the total number of edges in either graph.
We applied the Leiden clustering on Ghic, Ghost, and Gint, respectively, and assessed the binning results
using four clustering metrics: F-score, ARI, NMI, and homogeneity (Supplementary Table 7.18). Gint outperformed both Ghic and Ghost in terms of all four clustering metrics. We also evaluated the completeness
and contamination of each vMAG (Supplementary Table 7.19). Specifically, 8 near-complete, 3 substantially
complete, and 5 moderately complete vMAGs were recovered based only on Ghic, while 12 near-complete
91
and 2 substantially complete vMAGs were retrieved based only on Ghost. In contrast, employing the integrative graph Gint for clustering could reconstruct 26 near-complete, 2 substantially complete, and 4
moderately complete vMAGs. The improvement of binning performance by integrating two graphs indicated the Hi-C interaction graph and the host proximity graph were complementary to each other on
binning viral contigs.
5.3.3 ViralCC outperforms other binning methods on the mock human gut dataset
ViralCC was compared to VAMB, CoCoNet, vRhyme, bin3C, and MetaTOR on the mock human gut dataset
(see Methods). VAMB is a general shotgun-based binning tool while bin3C and MetaTOR are general HiC-based binning pipelines. CoCoNet and vRhyme are two shotgun-based binning methods specifically
designed for clustering sequenced viral particles.
As shown in Figure 5.2a, VAMB, CoCoNet, vRhyme, bin3C, and MetaTOR achieved 0.198, 0.485, 0.366,
0.404, and 0.750 in terms of F-score, respectively, which was improved to 0.795 by ViralCC. The ARI scores
for viral bins produced by VAMB, CoCoNet, vRhyme, bin3C, and MetaTOR were 0.111, 0.471, 0.302, 0.274,
and 0.744. In contrast, ViralCC increased the ARI score to 0.787. As for the NMI, VAMB, CoCoNet, vRhyme,
bin3C, and MetaTOR obtained 0.724, 0.742, 0.782, 0.817, and 0.928, whereas ViralCC achieved a score of
0.929. ViralCC also improved the homogeneity score to 0.921 from 0.570, 0.723, 0.687, 0.691, and 0.911,
achieved by VAMB, CoCoNet, vRhyme, bin3C, and MetaTOR, respectively.
VAMB, CoCoNet, vRhyme, bin3C, and MetaTOR could recover 1, 5, 0, 5, and 22 near-complete vMAGs,
respectively, while ViralCC increased this number to 26 (Figure 5.2b). In total, ViralCC could retrieve 32
high-quality vMAGs out of 51 reference genomes whereas VAMB, CoCoNet, vRhyme, bin3C, and MetaTOR could reconstruct 7, 11, 7, 6, and 30 high-quality vMAGs, respectively. Moreover, we also found
that ViralCC had a better performance than other binners in recovering near complete vMAGs from large
92
putative viral genomes (Supplementary materials). Altogether, ViralCC outperformed other binning methods as it recovered viral genomes with higher completeness and lower contamination based on the mock
metagenomic Hi-C dataset. Notably, MetaTOR and ViralCC were comparable according to the NMI and
the homogeneity scores, indicating that both approaches could recover high purity viral contig bins. On
the other hand, ViralCC achieved better performance than MetaTOR in terms of F-score and ARI (Figure
5.2a) while retrieving more complete bins (Figure 5.2b) from the mock metagenomic Hi-C dataset. This
shows the effectiveness of combining host proximity information with Hi-C interaction information.
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Clustering metrics
Scores
VAMB
CoCoNet
vRhyme
bin3C
MetaTOR
ViralCC
The mock human gut dataset
0
10
20
30
VAMB CoCoNet vRhyme bin3C MetaTOR ViralCC
Binning method
Number of viral bins
Moderately complete Substantially complete Near-complete
The mock human gut dataset
a b
Figure 5.2: Comparison of viral genome retrieval performance according to (a) clustering metrics and (b)
completeness and contamination criteria (Moderately complete: 50% ≤ completeness < 70%, contamination ≤ 10%; Substantially complete: 70% ≤ completeness < 90%, contamination ≤ 10%; Near-complete:
completeness ≥ 90%, contamination ≤ 10%). ViralCC outperforms other binning methods on the mock
human gut dataset.
5.3.4 Binning analyses of viral contigs on three real metagenomic Hi-C datasets
VirSorter detected 791, 1,338, and 2,757 viral contigs from the human gut, cow fecal, and wastewater
samples, respectively. Viral contigs were binned using different methods for the three datasets. The
CheckV completeness of viral bins was estimated to evaluate the binning quality. We referred to viral
93
bins with CheckV completeness above 90% as draft viral genomes with high completion and denoted bins
with CheckV completeness above 50% as draft viral genomes with medium completion.
For the human gut dataset, ViralCC identified 465 viral bins with sizes ranging from 3,001 bp to 307,395
bp, and yielded more high and medium completion draft viral genomes than any other tested methods
(Figure 5.3a). For the cow fecal dataset, ViralCC constructed 574 viral bins with sizes ranging from 3,002 bp
to 157,462 bp. It generated substantially more medium and high completion draft viral genomes than other
methods, specifically exceeding the numbers of high completion draft genomes from VAMB, CoCoNet,
vRhyme, bin3C, and MetaTOR by 161%, 140%, 66.7%, 93.5%, and 62.1%, respectively (Figure 5.3b). From the
wastewater dataset, ViralCC established 1,240 viral bins with sizes ranging from 3,006 bp to 461,626 bp,
and could reconstruct 32.8%, 103%, 141%, 175%, and 75% more high completion draft genomes compared
with VAMB, CoCoNet, vRhyme, bin3C, and MetaTOR, respectively (Figure 5.3c). ViralCC also recovered
markedly more draft viral genomes with medium completion.
ViralCC
MetaTOR
bin3C
vRhyme
CoCoNet
VAMB
0 40 80 120
Number of bins
Binning method
Completeness ≥ 50% ≥ 60% ≥ 70% ≥ 80% ≥ 90%
CheckV results on the human gut dataset
ViralCC
MetaTOR
bin3C
vRhyme
CoCoNet
VAMB
0 40 80 120
Number of bins
Binning method
Completeness ≥ 50% ≥ 60% ≥ 70% ≥ 80% ≥ 90%
CheckV results on the cow fecal dataset
ViralCC
MetaTOR
bin3C
vRhyme
CoCoNet
VAMB
0 50 100 150
Number of bins
Binning method
Completeness ≥ 50% ≥ 60% ≥ 70% ≥ 80% ≥ 90%
CheckV results on the wastewater dataset
ViralCC
MetaTOR
bin3C
vRhyme
CoCoNet
VAMB
0 40 80 120
Number of bins
Binning method
Completeness ≥ 50% ≥ 60% ≥ 70% ≥ 80% ≥ 90%
CheckV results on the cow fecal dataset
a b c
Figure 5.3: Comparison of draft viral bins retrieved by different binning tools according to the CheckV
completeness standard on the (a) human gut, (b) cow fecal, and (c) wastewater datasets. ViralCC can
retrieve more complete viral genomes compared to VAMB, CoCoNet, vRhyme, bin3C, and MetaTOR from
all three real metagenomic Hi-C samples.
Altogether, the analyses of three real metagenomic Hi-C datasets demonstrated that ViralCC retrieved
more complete viral genomes compared to VAMB, CoCoNet, vRhyme, bin3C, and MetaTOR, which was
consistent with our observations from the mock metagenomic Hi-C datasets. Moreover, we sorted vMAGs
94
by the number of viral contigs in descending order. If multiple vMAGs contained the same number of viral
contigs, they were further sorted by the bin size in descending order. Contigs in each vMAG were also
sorted by the contig length in descending order. We then plotted the raw Hi-C contact maps (see Methods)
of the top ten vMAGs for the three datasets with either the contig index (Figure 5.4) or the contig size
(Supplementary Figure 7.6) as the axis unit, respectively, which confirmed the valid reconstruction of the
viral genomes.
Finally, we explored the relationships between the quality of Hi-C datasets and the vMAG retrieval
performance. The 3D ratio was employed to measure the quality of Hi-C datasets (see Methods). Specifically, the 3D ratios were 23.3%, 38.3%, and 54.9% for the human gut, cow fecal, and wastewater datasets,
respectively. Though the higher 3D ratio does not necessarily mean more informative linkages between
contigs [101], we still observed that compared to the traditional shotgun-based binning methods, the improvement of binning performance by ViralCC was remarkable on metagenomic datasets with high quality
Hi-C libraries.
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Index of viral contigs
Index of viral contigs
CoCoNet vRhyme bin3C MetaTOR ViralCC
Raw Hi-C contact maps of the human gut dataset
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Index of viral contigs
Index of viral contigs
CoCoNet vRhyme bin3C MetaTOR ViralCC
Raw Hi-C contact maps of the human gut dataset
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Index of viral contigs
Index of viral contigs
CoCoNet vRhyme bin3C MetaTOR ViralCC
Raw Hi-C contact maps of the human gut dataset
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Index of viral contigs
Index of viral contigs
CoCoNet vRhyme bin3C MetaTOR ViralCC
Raw Hi-C contact maps of the human gut dataset
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Index of viral contigs
Index of viral contigs
CoCoNet vRhyme bin3C MetaTOR ViralCC
Raw Hi-C contact maps of the human gut dataset
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Index of viral contigs
Index of viral contigs
CoCoNet vRhyme bin3C MetaTOR ViralCC
Raw Hi-C contact maps of the cow fecal dataset
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Index of viral contigs
Index of viral contigs
CoCoNet vRhyme bin3C MetaTOR ViralCC
Raw Hi-C contact maps of the human gut dataset
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Index of viral contigs
Index of viral contigs
CoCoNet vRhyme bin3C MetaTOR ViralCC
Raw Hi-C contact maps of the human gut dataset
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Index of viral contigs
Index of viral contigs
CoCoNet vRhyme bin3C MetaTOR ViralCC
Raw Hi-C contact maps of the wastewater dataset
≥≥
a b c
Figure 5.4: Heatmaps of raw Hi-C contact matrices of the top ten vMAGs from the (a) human gut, (b) cow
fecal, and (c) wastewater datasets with the contig index as the axis unit. The vMAGs were first ranked by
their numbers of contigs and then the contigs within each vMAG were ranked by their sizes. The scale bar
shows the number of raw Hi-C contacts between viral contigs.
95
5.3.5 Annotation of vMAGs demonstrated the high purity of the vMAGs at the family
level
We annotated 191, 320, and 693 vMAGs in total at the family level for the human gut, cow fecal, and
wastewater datasets, respectively. We found that 173 (90.6%) out of 191 vMAGs in the human gut sample,
265 (82.8%) out of 320 vMAGs in the cow fecal sample, and 592 (85.4%) out of 693 vMAGs in the wastewater
sample contained only viral contigs from the same family, demonstrating the high purity of vMAGs at the
family level.
As shown in Figure 5.5, the vMAGs were dominated by tailed bacteriophages of the order Caudovirales
and vMAGs belonging to the families Myoviridae, Siphoviridae, and Podoviridae were found in all three
samples [1]. Bacteriophages, mainly Siphoviridae, dominated the two gut samples [11]. Compared to the
other samples that were more dominated by Siphoviridae, Myoviridae and Siphoviridae vMAGs were of
similar abundance in the wastewater sample, as reported for water environments [66, 69, 149, 156].
vMAG family
Myoviridae
Siphoviridae
Podoviridae
Herpesviridae
The human gut dataset
vMAG family
Myoviridae
Siphoviridae
Podoviridae
Herpesviridae
The human gut dataset
136
45
5 5
vMAG family
Myoviridae
Siphoviridae
Podoviridae
Herpesviridae
Lipothrixviridae
The cow fecal dataset
vMAG family
Myoviridae
Siphoviridae
Podoviridae
Herpesviridae
Lipothrixviridae
The cow fecal dataset
223
72
22
1 2
vMAG family
Myoviridae
Siphoviridae
Podoviridae
Herpesviridae
Lipothrixviridae
The wastewater dataset
vMAG family
Myoviridae
Siphoviridae
Podoviridae
Herpesviridae
Lipothrixviridae
The wastewater dataset
322
295
68
1 7
a b c
vMAG family
Myoviridae
Siphoviridae
Podoviridae
Herpesviridae
Lipothrixviridae
The wastewater dataset
Figure 5.5: Taxonomy statistics of annotated vMAGs on the (a) human gut, (b) cow fecal, and (c) wastewater
datasets. The numbers on the graph indicate the number of vMAGs belonging to different families.
5.3.6 Phage-host network in the wastewater sample
We discovered virus-host pairs based on the vMAGs recovered by ViralCC, and showed the results from
the wastewater dataset in the main text below. The results of virus-host detection from the human gut and
cow fecal datasets are shown in the Supplementary materials.
96
For non-viral contigs, expected to be largely bacterial, HiCBin generated 1,253 MAGs, which were assessed by CheckM (v1.1.3, parameter: lineage wf) [119]. The quality evaluation results are shown in Supplementary Table 7.20. Among 1,253 MAGs, 600 MAGs could be unambiguously annotated by GTDB-TK
[23] and the taxonomy classification results were visualized using ITOL [89] (Figure 5.6a). Burkholderiales, Pseudomonadales, Lachnospirales, Bacteroidales, and Oscillospirales were the predominant orders in the
wastewater sample. Burkholderiales and Pseudomonadales were common orders reported in water environments [40, 142]. Lachnospirales, Bacteroidales, and Oscillospirales were reported in the gut microbiomes
[55]; these are reasonable to be detected in this domestic wastewater sample from around 25,000 people
[142].
A total of 1,065 (85%) out of 1,253 MAGs were associated with at least one viral MAG. We then explored
the infection spectrum of annotated vMAGs on hosts from different orders (Figure 5.6b). We observed that
vMAGs from the family Myoviridae mainly targeted hosts from the order Burkholderiales, which is consistent with previous findings that some phages belonging to the family Myoviridae could lyse bacteria
from Burkholderia [166]. A large number of vMAGs belonging to the family Siphoviridae could infect Bacteroidales bacteria [114]. Moreover, we unexpectedly observed that 4 vMAGs apparently infecting members
of the order Burkholderiales came from the family Herpesviridae, which previously has been reported only
to infect animals, including human-beings [103]. Further research is needed to determine if these reveal a
true infection or if the proximity ligation occurred in a non-infection situation (e.g. extracellularly).
5.3.7 Validate virus-host pairs using CRISPR spacer analysis on the wastewater dataset
We predicted the CRISPR spacers in host MAGs using PILER-CR (v1.06) [37] and 925 CRISPR spacers were
detected. Then, we aligned these spacers to vMAGs using BLAST [73] with parameters ‘-task blastn-short
-evalue 1e-5’. The alignments with bitscore below 45 were further filtered out [101]. In this way, 16 robust
hits between host MAGs and virus MAGs were found using CRISPR spacer analysis.
97
Among those 16 hits, 13 virus-host MAG pairs (81.3%) were also associated by the Hi-C linkages.
Noticeably, according to CRISPR spacer analysis, we observed that vMAG 1,198 (family: Siphoviridae) was
associated with two host MAGs from the Fusobacteriales order while these two host MAGs were the only
two associated hosts of vMAG 1,198 predicted by the Hi-C interactions.
Order
Burkholderiales
Lachnospirales
Pseudomonadales
Oscillospirales
Bacteroidales
Actinomycetales
Peptostreptococcales
Flavobacteriales
Enterobacterales
Propionibacteriales
Campylobacterales
Fusobacteriales
Desulfovibrionales
Erysipelotrichales
Rhodobacterales
Tissierellales
Selenomonadales
Nanopelagicales
Others 0
250
500
750
1000
Burkholderiales
Pseudomonadales
Lachnospirales
Bacteroidales
Oscillospirales
Host order
Number of associations
vMAG family
Myoviridae
Siphoviridae
Podoviridae
Herpesviridae
Lipothrixviridae
The wastewater dataset
a b
Order
Burkholderiales
Lachnospirales
Pseudomonadales
Oscillospirales
Bacteroidales
Actinomycetales
Peptostreptococcales
Flavobacteriales
Enterobacterales
Propionibacteriales
Campylobacterales
Fusobacteriales
Desulfovibrionales
Erysipelotrichales
Rhodobacterales
Tissierellales
Selenomonadales
Nanopelagicales
Others
Figure 5.6: (a) Taxonomic annotations of MAGs recovered by HiCBin from the domestic wastewater sample. Burkholderiales, Pseudomonadales, Lachnospirales, Bacteroidales, and Oscillospirales were the predominant orders. (b) The apparent infection spectrum of vMAGs from the wastewater sample. vMAGs belonging to the family Myoviridae mainly targeted hosts from the order Burkholderiales and a large number of
vMAGs from the family Siphoviridae could infect Bacteroidales bacteria.
5.3.8 Running time of ViralCC
ViralCC was executed on one computing node of a 2.40 GHz Intel Xeon Processor E5-2665 with 50,000
MB RAM provided by the Advanced Research Computing platform at University of Southern California. ViralCC consumed 22.5 min, 76.6 min, and 21.7 min running time on the human gut, cow fecal, and
wastewater samples, respectively.
5.4 Discussion
ViralCC is an open-source Hi-C-based binning method for viral genome retrieval. Unlike other Hi-Cbased binning tools using only Hi-C contact maps. ViralCC exploits a host proximity graph based on
98
the virus-host proximity structure as a supplementary source of connections between viral contigs. We
demonstrate that ViralCC outperformed other tools on real metagenomic Hi-C datasets according to the
CheckV completeness criteria. Notably, considering that randomly binning viral contigs into vMAG does
not reduce the CheckV completeness compared to the completeness of each of the individual contigs, it is
necessary to construct a random binning model as control experiments when the CheckV completeness
is used as the evaluation metric. Moreover, we observe that the improvement of binning performance
by ViralCC was significant from metagenomic datasets with high-quality Hi-C libraries compared to the
shotgun-based binning methods, indicating the potential importance of good quality Hi-C libraries on viral
genome retrieval.
Since the assessment by CheckV software is not comprehensive, we put forward a systematic benchmarking strategy to assess the performance of binning viral contigs using mock metaHi-C datasets. We
expect that this benchmarking strategy can facilitate the evaluation of any Hi-C-based binning tools in viral genome retrieval studies. However, there are also limitations and biases in the benchmarking strategy.
Since we only choose viral genomes that can be recovered by a single contig from the whole community,
our benchmarking method inevitably under-estimates the true diversity of the virus community. The effectiveness of the benchmarking is also less convincing if there are few putative viral genomes. Moreover,
though we have shown the low fraction of spurious contacts in the host proximity graph using the mock
metagenomic Hi-C datasets, we cannot obtain the results from the real datasets because it is challenging
to know the true labels of viral contigs from the real datasets. Finally, we observe that the sizes of putative viral genomes tend to be small in the benchmarking method (Supplementary materials). Though all
pipelines are treated equally on the same set of mock viral contigs derived from the selected putative viral
genomes, the sizes of putative viral genomes should be accounted for in the benchmarking considering
that the full recovery of a larger putative viral genome requires a binner to correctly group more viral
contigs into a single bin from the mock datasets.
99
Apart from the direct binning of viral contigs as we discussed here, training a classification model
to distinguish confidently labelled viral bins and bacterial bins can also contribute to providing a highly
enriched candidate set of viral bins from bulk metagenome data [72]. Viral genome retrieval, combined
with the Hi-C proximity ligation also sheds light on the infection mechanisms and unveils entirely active
virus-host interactions.
Compared to a popular approach, CRISPR spacer analysis, which can reflect historic linkages between
viruses and hosts [60, 122], metagenomic Hi-C experiments are able to detect active virus-host pairs at a
single time point. Chen et al. [24] used metagenomic Hi-C experiments to validate virus-host associated
pairs predicted by CRISPR in activated sludge (AS) samples using Illumina sequencing and Nanopore sequencing separately. They validated 11 out of 21 and 16 out of 28 virus-host associated pairs predicted by
CRISPR based on the Illumina and combined Illumina/Nanopore sequenced samples, respectively, leveraging Hi-C linkages. In our study, we validated 13 out of 16, 3 out of 4, and 2 out of 2 virus-host pairs
predicted by CRISPR based on the wastewater, human gut, and cow fecal datasets, respectively (see Results, Supplementary materials). Both studies clearly show how analyses of metagenomic Hi-C data can
be a powerful tool in recovering virus-host pairs that are otherwise difficult to determine (e.g. from noncultured organisms). It should be noted that some CRISPR predicted virus-host associations indicate historical associations that may not be present in a given sample, and such pairs cannot be detected by Hi-C
[24]. And it must also be kept in mind that some virus-bacteria associations apparent from proximity ligation might be a result of proximity of bacterial and viral DNA from a mechanism other than infection; thus,
unexpected results like our reported apparent herpesvirus infection of Burkholderiales should be validated
before jumping to extraordinary conclusions.
In the future, it will be interesting to explore whether existing binning methods can resolve closely
related viruses residing in the same bacterial host based on virus-host proximities. Moreover, recent studies
have found that specific viruses have mechanisms enabling multiple viral genomes to infect the same host
100
cell, which is called the co-infection [133]. Leveraging the Hi-C proximity ligation to discover the existence
of co-infection for multiple phages within the same cell is another potential topic for future research.
101
Chapter 6
Conclusions and future work
In this dissertation, we have made significant strides in the field of metagenomics by introducing computational methods tailored for metaHi-C data analysis. These include the normalization methods, HiCzin
and NormCC, and the contig binning methods, ImputeCC and ViralCC. HiCzin introduces a novel approach using zero-inflated negative binomial regression for normalization and spurious contact detection
in metagenomic Hi-C data, effectively eliminating systematic biases and enhancing the quality of metagenomic Hi-C contact maps for more precise analysis of microbial communities. NormCC builds on the foundation laid by HiCzin, presenting a new normalization method that models all proximity ligation events
per contig without the need for annotating contigs and estimating contig coverages, thus streamlining the
normalization process and reducing computation time in metagenomic Hi-C data analysis.
ImputeCC is an innovative binning tool that integrates Hi-C interactions with single-copy marker
genes through a constrained random walk with restart algorithm, refining contig binning, enhancing HiC connectivity, and boosting the recovery of high-quality MAGs from complex microbial communities.
ViralCC, designed to recover complete viral genomes and identify virus-host pairs from metagenomic HiC data, incorporates a virus-host proximity structure for more accurate binning, outperforming existing
methods in retrieving viral genomes with higher completeness across various ecosystems.
102
These innovations underscore the potential of metaHi-C in elucidating microbial interactions and community dynamics, offering new perspectives in health sciences and microbial ecology. The work sets a
foundation for future research aimed at refining these methods and exploring their applications in realworld problems, such as disease treatment, environmental monitoring, and beyond. As we continue to
unravel the complexities of microbial ecosystems, the tools developed here will be pivotal in advancing
our understanding on microbiome.
6.1 Future work
Despite the recent advancement in the development of tools tailored for metaHi-C, there are still several
unsolved questions:
6.1.1 Metagenomic assembly using both shotgun reads and Hi-C reads
Current contig assembly methods employed in metagenomic Hi-C workflows only rely on short or long
reads from shotgun libraries [36, 121]. Consequently, it is imperative to design a novel contig assembly
method that harnesses the synergy of both shotgun and Hi-C reads. This novel approach will be tailored
to accommodate diverse sequencing platforms for shotgun libraries, encompassing widely used Illumina
short-read, Nanopore R9/10 long-read, and the PacBio HiFi sequencing platforms.
6.1.2 Reference-based metaHi-C analyses for the human gut environment
An essential scientific inquiry within the field of metaHi-C analysis pertains to the identification of active
phage-host and plasmid-host interactions. Presently, all existing metaHi-C data analysis pipelines rely on
contig assembly as the basis for detecting viral and plasmid contigs, subsequently associating them with
host contigs through Hi-C interactions [79, 161]. However, assembly-based approaches exhibit limitations
in their ability to detect low-abundance phages, primarily due to challenges in assembling genomes from
103
low-abundance taxa within a single sample [13]. Leveraging recent advancements in the construction of
comprehensive databases containing human gut bacterial, archaeal, viral genomes, and plasmid references
[20, 45, 108], we are inclined to explore the development of a reference-based metaHi-C analysis framework
tailored for the human gut ecosystem.
6.1.3 Imputing metaHi-C contact matrix using graph neural network
Previous experiments have highlighted the difficulties in recovering species with low Hi-C coverage, often
correlated with lower abundances [35]. Therefore, the exploration of imputation techniques for metaHi-C
contact matrices is important to predict the missing Hi-C interactions due to the sequencing depth. To
realize the imputation, the sequence composition of contigs can help. Specifically, we intend to explore
the Graph Neural Network (GNN)-based imputation methods to enhance our understanding of the local
topological structures within the Hi-C network. Here, the k-mer sequence composition matrix will act as
an attributed feature matrix within the GNN model, enriching the analysis with detailed sequence information.
104
Chapter 7
Supplementary materials
7.1 Supplementary materials for Chapter 2
7.1.1 Initial processing
We applied a standard cleaning pipeline on both WGS and Hi-C datasets using bbduk from the BBTools
suite(v37.25) [19]. Adaptor sequences were removed by bbduk with parameter ‘ktrim=r k=23 mink=11
hdist=1 minlen= 50 tpe tbo’ and reads were quality-trimmed with parameter ‘trimq=10 qtrim=r ftm=5
minlen=50’ using bbduk. Then, the first 10 nucleotides of each read were trimmed by bbduk with parameter
‘ftl=10’.
7.1.2 Shotgun assembly and Hi-C read alignment
For the shotgun dataset, de novo assembly was produced by MEGAHIT [90] with parameters ‘-min-contiglen 300 -k-min 21 -k-max 141 -k-step 12 -merge-level 20,0.95’ and contigs shorter than 1 kb were discarded.
For the Hi-C dataset, only paired reads were kept for the downstream analysis. All PCR optical and
tile-edge duplicates for Hi-C paired-end reads were removed by ‘clumpify.sh’ from BBTools suite [19] with
default parameters. Processed Hi-C paired-end reads were mapped to assembled contigs using BWA-MEM
[91] with parameters ‘-5SP’. Then, samtools [93] with parameters ‘view -F 0x904’ were applied on the
105
resulting BAM files to remove unmapped reads (0x4) and supplementary (0x800) and secondary (0x100)
alignments. Alignments with low quality ( <30 nucleotide match length or mapping score <30) were also
filtered out. By this means, 4,700,202 read pairs were mapped to different contigs for the synthetic M-Y
samples.
7.1.3 Contact map generation
As contact map reflects the proximity distance within contigs, only pairs of reads aligned on different
contigs were kept so as to generate the contact map. Raw contig–contig interactions were aggregated
as contacts by counting the number of alignments linking two contigs. Contigs that no Hi-C reads were
aligned to were discarded.
7.1.4 Annotating contigs by reference genomes for the M-Y samples
In order to explore experimental biases, the reference genomes of 16 yeast stains were downloaded (Supplementary Table 7.1). As analysis was performed at the species level, the genomes of four strains(FY, CEN.PK,
RM11-1A and SK1) from the same species (S. cerevisiae) were combined into one reference genome. Then,
all contigs were aligned to those 13 reference genomes of all known species by BLASTn [6] with parameters: ‘-perc_identity 95 -evalue 1e-30 -word_size 50’. Hence, the true species that assembly contigs came
from were determined if there existed any alignment of the contigs to the species’ reference genome; the
placement of the alignment was ignored [18].
106
Genus Species Strain in sample Reference strain
Saccharomyces cerevisiae FY4H FY
Saccharomyces cerevisiae CEN.PK CEN.PK
Saccharomyces cerevisiae RM11-1A RM11-1A
Saccharomyces cerevisiae SK1 SK1
Saccharomyces paradoxus YDG613
Saccharomyces mikatae FM356 IFO 1815
Saccharomyces kudriavzevii FM527 IFO 1802
Saccharomyces bayanus var. uvarum YZB5-113 CBS 7001
Naumovozyma castellii 4310 NRRL Y-12630
Lachancea waltii Kwaltii ura3 NRRL Y-8285
Lachancea kluyveri FM628 CBS 3082
Kluyveromyces lactis MW98-8C NRRL Y-1140
Kluyveromyces wickerhamii Y-8286 UCD 54-210
Ashbya gossypii WT ATCC 10895
Scheffersomyces stipitis Y-11545 CBS 6054
Pichia pastoris JC308 GS115
Table 7.1: M-Y species list in the sample.
107
7.2 Supplementary materials for Chapter 3
7.2.1 The respective performances of HiCzin and HiCBin are markedly deteriorated
when only a small fraction of assembled contigs can be annotated
Since both the HiCzin normalization and the HiCBin binning methods require annotating contigs at the
species level by TAXAassign to fit their models, we would like to explore how the fraction of annotated
contigs affects the respective performances of HiCzin and HiCBin on a synthetic yeast metaHi-C dataset.
Details of processing raw data were shown in the Methods section of the main text. Notably, the fraction of annotated contigs from the yeast dataset is much larger than that from other metaHi-C datasets
(Supplementary Table 7.2).
We further downsampled annotated contigs utilized in fitting the HiCzin and HiCBin models to 1% on
the yeast dataset, which was not extreme considering that even fewer than 1% of assembled contigs could
be labeled on the cow fecal and sheep gut metaHi-C datasets (Supplementary Table 7.2). We conducted
five rounds of downsampling and assessed the respective performances of HiCzin and HiCBin. The performance of HiCzin was measured by the Pearson correlation coefficients between normalized Hi-C contacts
and three factors of systematic biases while the results of HiCBin were evaluated using three clustering
metrics, including F-score, ARI, and NMI. As shown in Supplementary Tables 7.9 and 7.10, the respective
performances of HiCzin and HiCBin were markedly deteriorated when only one percent of assembled
contigs could be annotated.
108
7.2.2 Polishing HiFi assemblies using accurate short reads did not improve the binning
performance on the sheep gut dataset
Pilon [155] polishing of the HiFi assembly from the sheep gut long-read metaHi-C dataset was accomplished using the Illumina short reads derived from the same sheep gut sample with the ‘–fix indels –nostrays’ setting. We then aligned paired-end Hi-C reads to polished contigs as described in the Methods
section of the main text and binned contigs using the MetaCC framework. As shown in Supplementary
Figure 7.2, the polishing step did not substantially improve the binning results.
7.2.3 The spurious contact removal step with default threshold consistently improved
the downstream binning results
To assess the impact of the spurious contact detection step using our default threshold on the subsequent
binning process, we executed the MetaCC pipeline without the spurious contact removal step. As a result, the MetaCC binning without spurious contact detection retrieved 75, 101, 6, and 412 near-complete
MAGs from the human gut, wastewater, cow rumen, and sheep gut metaHi-C datasets, respectively. With
the inclusion of the spurious contact detection step, these numbers were improved to 79, 103, 8, and 417,
respectively. Moreover, the total count of high-quality MAGs recovered from the four datasets were also
increased from 118, 205, 68, and 696 to 124, 209, 71, and 708, respectively, after the spurious contact removal. These results demonstrated a consistent enhancement of the spurious contact removal step in the
downstream binning outcomes for real datasets.
7.2.4 A standard read cleaning procedure
Adaptor sequences were removed by bbduk from the BBTools suite (v37.25) [19] with parameters ‘ktrim=r
k=23 mink=11 hdist=1 minlen=50 tpe tbo’ and reads were quality-trimmed using bbduk with parameters
‘trimq=10 qtrim=r ftm=5 minlen=50’. Then, the first 10 nucleotides of each read were trimmed by bbduk
109
with parameter ‘ftl=10’. Identical PCR optical and tile-edge duplicates for Hi-C paired-end reads were
removed by the script ‘clumpify.sh’ from the BBTools suite (v37.25) with default parameters.
7.2.5 Estimating the number of genomes in the metagenomic data using single-copy
marker genes
Following the strategy in [159], we utilized single-copy marker genes to estimate the number of genomes
in the microbial sample. To predict genes from the contigs, FragGeneScan [127] was employed, and the
predicted genes were scanned using HMMER3 (v3.3.2) [41] with parameter ‘-cut_tc’ to identify 107 singlecopy marker genes that are conserved in 95% of sequenced bacteria [3]. After filtering out genes that
do not meet the coverage threshold (set at 40%), we determined the number of genomes present in the
metagenomic data k as the median number of contigs containing each of the marker genes. This step
accounted for the possibility of marker genes being fragmented into multiple pieces, which could affect
the estimation of the number of genomes.
7.2.6 Identifying the species identity of contigs on the synthetic yeast dataset
We first downloaded the reference genomes of all 16 yeast strains from 13 yeast species in the synthetic
yeast sample (Supplementary Table 7.11). As the analyses were made at the species level, the genomes
of four strains (FY, CEN.PK, RM11-1A and SK1) from the same species (Saccharomyces cerevisiae) were
combined into one reference genome. Then, all contigs were aligned to those 13 reference genomes of all
known species by BLAST [73] with parameters ‘-perc identity 95 -evalue 1e-30 -word size 50’. The true
species identity of the assembled contigs could be determined if there existed any alignment of the contigs
to the species’ reference genome (Supplementary Figure 7.3).
110
p < 2.22e-16
5
10
Inter-species Hi-C Contacts Intra-species Hi-C Contacts
Hi-C Contact Categories
Normalized Contacts by NormCC (log scale)
Figure 7.1: The inter-species versus intra-species Hi-C contacts within the NormCC-normalized Hi-C contact matrix from the synthetic yeast metaHi-C dataset. The y-axis represents the logarithmically scaled
values of normalized Hi-C contacts by NormCC. An unpaired t test was conducted to compare the values
between 393,228 normalized intra-species Hi-C contacts and 125,860 inter-species Hi-C contacts. The resulting p-value is less than 2.22e-16, indicating that the magnitude of intra-species contacts significantly
surpasses that of spurious inter-species contacts.
111
MaxBin2
VAMB
MetaBAT2
MetaTOR
bin3C
HiCBin
MetaCC
0 50 100
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Human gut
MaxBin2
VAMB
MetaBAT2
MetaTOR
bin3C
HiCBin
MetaCC
0 50 100
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Human gut
MaxBin2
VAMB
MetaBAT2
MetaTOR
bin3C
HiCBin
MetaCC
0 50 100
Number of bins
Binning method
Moderately complete Substantially complete Near-complete
Human gut
Polished
Unpolished
0 200 400 600
Number of bins
Moderately complete Substantially complete Near-complete
Sheep gut long-read metaHi-C dataset
Figure 7.2: The number of high-quality bins retrieved by MetaCC binning based on unpolished or polished
contigs from the sheep gut long-read metaHi-C dataset. Without polishing HiFi assembly using accurate
short reads (unpolished), MetaCC binning could retrieve 417, 162, and 130 near-complete, substantially
complete, and moderately complete bins, respectively. In contrast, after polishing (polished), only 416, 153,
and 131 near-complete, substantially complete, and moderately complete bins were recovered by MetaCC
binning, respectively, indicating that the polishing step did not substantially improve the binning results.
112
0
500
1000
1500
2000
A.gossypii
K.lactis
K.wickerhamii
L.kluyveri
L.waltii
N.castellii
P.pastoris
S.bayanus
S.cerevisiae
S.kudriavzevii
S.mikatae
S.paradoxus
S.stipitis
Species
Number of assembled contigs
Figure 7.3: The numbers of assembled contigs from 13 species in the synthetic yeast metaHi-C dataset.
The full species names in the x-axis are shown in Supplementary Table 10. Most of the assembled contigs
belong to the genus Saccharomyces.
113
Dataset The fraction of annotated contigs
Yeast short-read metaHi-C 58.3%
Human gut short-read metaHi-C 11.4%
Wastewater short-read metaHi-C 8.2%
Cow fecal long-read metaHi-C 0.1%
Sheep gut long-read metaHi-C 0.8%
Table 7.2: The fractions of annotated contigs for HiCzin and HiCBin on different metaHi-C datasets. Fewer
than 1% of assembled contigs could be successfully labeled on the both long-read metaHi-C datasets.
114
Dataset Size of shotgun libraries (Gbp) Size of Hi-C libraries (Gbp)
Human gut 37.9 25.9
Wastewater 81.3 28.8
Cow rumen 52 10.1
Sheep gut 255 32.3
Synthetic yeast 17.3 16.2
Table 7.3: The size of shotgun and Hi-C libraries from raw metaHi-C datasets.
115
Dataset The number of contigs Average length (bp) Total length (bp)
Human gut 105,267 5044 530,969,816
Wastewater 752,580 2539 1,910,562,642
Cow rumen 77,670 13,859 1,076,426,242
Sheep gut 47,246 90,466 4,274,155,803
Synthetic yeast 6,566 19,194 126,030,343
Table 7.4: The assembly statistics of contigs from all datasets.
116
Dataset HiCzin NormCC
Yeast 15 min 46 s (6.5 h) 13 s
Human gut 104 min 23 s (7.5 h) 22 s
Wastewater 112 h 31 min (17.7 h) 2 min 30 s
Cow rumen NA 15 s
Sheep gut 13 s (33.1 h) 4 s
Table 7.5: Comparison of the running time between NormCC and HiCzin on different metaHi-C datasets.
Values in the parentheses represent extra time consumed by HiCzin on preparing the input data. NA means
that HiCzin failed to converge on the cow rumen metaHi-C dataset.
117
The orders of high-quality MAGs The number of MAGs
Oscillospirales 287
Christensenellales 109
Bacteroidales 89
Lachnospirales 56
RF39 36
TANB77 16
Victivallales 10
Erysipelotrichales 8
Peptostreptococcales 7
Desulfovibrionales 7
RFN20 6
Verrucomicrobiales 5
RF32 5
UBA1381 5
Pirellulales 4
UBA4068 4
Coriobacteriales 3
Monoglobales 3
ML615J-28 3
RFP12 3
Acholeplasmatales 2
Acidaminococcales 2
Burkholderiales 2
Treponematales 2
HGM11327 2
UBA7702 2
UMGS1883 2
Others 15
Table 7.6: Taxonomic statistics of 709 high-quality MAGs retrieved by MetaCC binning from the sheep gut
dataset. MAGs were annotated by GTDB-TK at the order level.
118
The orders of high-quality MAGs The number of plasmid contigs included
Oscillospirales 39
Bacteroidales 19
Erysipelotrichales 13
RFP12 9
Lachnospirales 4
Christensenellales 3
RF39 3
Peptostreptococcales 2
Enterobacterales 1
Burkholderiales 1
RUG12999 1
DTUO25 1
Coriobacteriales 1
UBA1381 1
HGM11514 1
Table 7.7: The distribution of plasmid contigs among high-quality MAGs retrieved by MetaCC binning
from the sheep gut dataset.
119
Plasmid contig The coverage of plasmid contigs The coverage of their respective MAGs
contig_24425 68.94 22.38
contig_61128 158.20 33.93
Table 7.8: The plasmid contigs with coverage > 2 × than the mean average coverage of their respective
MAGs retrieved by MetaCC binning from the sheep gut dataset.
120
Yeast Site Length Coverage
Without downsampling 0.001 0.001 0.069
Downsampling 1 0.002 0.003 0.207
Downsampling 2 0.001 0.001 0.272
Downsampling 3 0.002 0.002 0.329
Downsampling 4 0.393 0.386 0.006
Downsampling 5 0.004 0.005 0.322
Table 7.9: The normalization performance of HiCzin when only 1% of contigs were annotated on the yeast
dataset.
121
Yeast F-score ARI NMI
Without downsampling 0.908 0.894 0.895
Downsampling 1 0.610 0.483 0.697
Downsampling 2 0.761 0.719 0.803
Downsampling 3 0.612 0.485 0.708
Downsampling 4 0.757 0.698 0.848
Downsampling 5 0.850 0.828 0.839
Table 7.10: The binning performance of HiCBin when only 1% of contigs were annotated on the yeast
dataset.
122
Genus Species Strain in sample Reference strain
Saccharomyces cerevisiae FY4H FY
Saccharomyces cerevisiae CEN.PK CEN.PK
Saccharomyces cerevisiae RM11-1A RM11-1A
Saccharomyces cerevisiae SK1 SK1
Saccharomyces paradoxus YDG613 YDG613
Saccharomyces mikatae FM356 IFO 1815
Saccharomyces kudriavzevii FM527 IFO 1802
Saccharomyces bayanus var. uvarum YZB5-113 CBS 7001
Naumovozyma castellii 4310 NRRL Y-12630
Lachancea waltii Kwaltii ura3 NRRL Y-8285
Lachancea kluyveri FM628 CBS 3082
Kluyveromyces lactis MW98-8C NRRL Y-1140
Kluyveromyces wickerhamii Y-8286 UCD 54-210
Ashbya gossypii WT ATCC 10895
Scheffersomyces stipitis Y-11545 CBS 6054
Pichia pastoris JC308 GS115
Table 7.11: The species list in the synthetic yeast sample.
123
7.3 Supplementary materials for Chapter 4
7.3.1 ImputeCC’s genus-level analysis unveiled key genera and species expansion in
the sheep gut microbiota
ImputeCC’s genus-level analysis, leveraging its retrieval of 408 high-quality MAGs, has unveiled significant insights into microbial composition of the sheep gut microbiota. Within this complex ecosystem,
Bacteroides emerges as one of the dominant bacterial genera, well-recognized for its potential influence on
the intestinal immune system [129, 164]. ImputeCC’s distinctive capabilities stood out as it successfully
recovered two critical species from the Bacteroides genus, specifically Bacteroides uniformis and Bacteroides
vulgatus, within the sheep gut environment. B. uniformis has garnered attention for its reported role in
ameliorating immunological dysfunctions and metabolic disorders, often associated with intestinal dysbiosis [47]. In contrast, B. vulgatus assumes vital roles in reducing the production of gut microbial lipopolysaccharides and inhibiting atherosclerosis [167]. Notably, among high-quality MAGs, while MetaCC managed
to detect the presence of B. vulgatus, other binning tools failed to identify the genus Bacteroides from the
sheep gut dataset. ImputeCC’s distinctive capability also emerged as it was the only method that could
detect the Tidjanibacter genus, a relatively new and less-studied taxonomic group [160]. This discovery
creates opportunities for more research on this genus, offering the potential for exploring its ecological
roles within the sheep gut environment. Within the Rikenellaceae family, ImputeCC’s analysis illuminated
the prevalence and diversity of the Alistipes genus, which was predominantly found in the gastrointestinal tracts of the healthy human microbiome [118, 137]. Specifically, ImputeCC retrieved 17 high-quality
MAGs affiliated with Alistipes, compared to the 4, 3, and 9 high-quality MAGs recovered by MetaTOR,
bin3C, and MetaCC, respectively. Among these 17 MAGs, Alistipes senegalensi emerged as a noteworthy
species, recognized for its involvement in mannose fermentation [105], suggesting a role of the members
124
from the Alistipes genus within the sheep gastrointestinal tract’s intricate ecosystem. Furthermore, ImputeCC’s analysis unveiled five high-quality MAGs within the Alistipes genus that could not be annotated
at the species level by GTDB-TK, suggesting the potential expansion of species diversity within the Alistipes genus. Additional experiments are necessary to gather further data on the phenotypic and physical
characteristics of these uncultured members before their definitive identification can be achieved. In conclusion, all these findings underscore the unique efficacy of ImputeCC in advancing our understanding of
microbial ecosystems by characterizing the sheep gut microbiota’s taxonomic composition and functional
potential.
7.3.2 Filtering the incomplete reference genomes at the species level from the mock
microbial community
The reference genomes of all species in the mock microbial community [102] can be downloaded from
https://forgemia.inra.fr/metagenopolis/benchmark_mock/-/blob/main/reference/MOCK_001.fasta.gz. For
these 71 reference genomes, the species Methanococcus maripaludis includes two strains (S2 and C5), and
the species Shewanella baltica also comprises two strains (OS185 and OS223). Given our focus on specieslevel analyses, we opted to randomly select the reference genomes of Methanococcus maripaludis S2 and
Shewanella baltica OS185 to represent their respective species. Additionally, reference genomes of the
species Desulfovibrio piger, Methanomassiliicoccus luminyensis, and Methanobrevibacter oralis DSM 7256
were excluded from consideration due to their incomplete status. Consequently, we acquired a set of 66
complete reference genomes at the species level for the following experiments.
125
7.3.3 A standard read cleaning procedure
The removal of adaptor sequences was executed using the bbduk tool from the BBTools suite (v37.25) [19]
with the following settings: ‘ktrim=r k=23 mink=11 hdist=1 minlen=50 tpe tbo’. Subsequently, reads underwent a quality-trimming process using bbduk with the parameters ‘trimq=10 qtrim=r ftm=5 minlen=50’.
Following this, the initial 10 nucleotides of each read were trimmed using bbduk with the parameter
‘ftl=10’. Any duplicate reads that were identical in terms of PCR optical and tile-edge duplicates for HiC paired-end reads were eliminated using the ‘clumpify.sh’ script from the BBTools suite (v37.25) with
default settings.
7.3.4 The assembly of shotgun reads
For shotgun reads from short-read metaHi-C datasets, we employed MEGAHIT (v1.2.9) [90] to assemble
shotgun reads into contigs with the parameters ‘-k-min 21 -k-max 141 -k-step 12 -merge-level 20,0.95
-min-contig-len 1000’.
For shotgun reads from mock Nanopore and mock PacBio long read metaHi-C datasets, shotgun reads
were assembled with metaFlye (v2.9-b1768) [85] using the ‘–meta –nano-raw’ for Nanopore and ‘–meta
–pacbio-hifi’ for PacBio reads. The contigs obtained from the two real long-read metaHi-C datasets (cow
rumen and sheep gut) were directly provided by the original authors and downloaded for analysis. For the
cow rumen dataset, Bickhart et al. [15] initially assembled PacBio raw reads using Canu (v1.6+101 changes,
r8513) [87] and further refined the assembly with two rounds of Illumina data polishing using Pilon
[155]. The finalized assembly is accessible at https://figshare.com/articles/usda_pacbio_second_pilon_
indelsonly_fa_gz/8323154. In the case of the sheep gut dataset [14], an updated assembly of PacBio HiFi
long reads was provided by the original authors, who used metaFlye (v2.9) with default parameters. This
updated assembly is available at https://doi.org/10.5281/zenodo.5228989 under the file ‘flye.v29.sheep_-
gut.hifi.250g.fasta.gz’.
126
7.3.5 Identifying single-copy marker genes from assembled contigs
We detect single-copy marker genes within assembled contigs following the strategy in [159]. Specifically,
FragGeneScan [127] was employed to predict genes from the contigs, and the predicted genes were scanned
using HMMER3 (v3.3.2) [41] with parameter ‘-cut tc’ to identify 107 single-copy marker genes that are
conserved in 95% of sequenced bacteria.
7.3.6 The modularity function of the Leiden algorithm used in ImputeCC
Let G denote the Hi-C contact graph, where vertices represent assembled contigs, and edge weights denote
the normalized Hi-C contacts between these contigs. The modularity function based on the Reichardt and
Bornholdt’s Potts model [124] used in ImputeCC is
X
{i,j|ci=cj}
hij − r
didj
2n
, (7.1)
where ci and cj denote the community index of contig i and contig j, respectively; hij represents the edge
weight (i.e., normalized Hi-C contacts) between contigs i and j; di and dj refer to the degree of contig i
and contig j in the graph G, respectively; n is the total number of edges in the graph G; r represents the
resolution parameter.
7.3.7 Estimation of the precision and recall for contig bins based on lineage-specific
genes
Similar to [119], we evaluate the precision and recall of a given bin based on lineage-specific genes. Specifically, the bin’s precision is defined as
max{0, 100 −
P
s∈M
P
g∈s Cg
|s|
|M|
}, (7.2)
127
where M is the set of all collocated gene set constructed by [119]; s is a set of collocated genes from M;
Cg takes on the value of N − 1 when a gene g is identified N times (N ≥ 1), and it equals 0 when the
gene is missing.
The bins recall is calculated as
P
s∈M
|s∩GM|
|s|
|M|
, (7.3)
where GM is the set of genes identified in a bin.
7.3.8 Identifying the species identity of contigs on the mock metaHi-C dataset
All assembled contigs from the mock Illumina, Nanopore, and PacBio datasets were aligned to 66 reference
genomes by BLAST [73] with parameters ‘-perc identity 95 -evalue 1e-30 -word size 50’. The true species
identity of the assembled contigs could be determined if there existed any alignment of the contigs to the
species’ reference genome.
128
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 25 50 75 100
Number of bins
Binning method
Human gut short−read metaHi−C dataset
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 50 100 150
Number of bins
Binning method
Wastewater short−read metaHi−C dataset
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 50 100
Number of bins
Binning method
Human gut short−read metaHi−C dataset
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 50 100 150 200 250
Number of bins
Binning method
Wastewater short−read metaHi−C dataset
HiCBin
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
0 25 50 75 100
Number of bins
Binning method
Human gut short
−read metaHi
−C dataset
HiCBin
VAMB
MetaTOR
bin3C
MetaCC
ImputeCC
0 50 100 150 200 250
Number of bins
Binning method
Wastewater short
−read metaHi
−C dataset
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 25 50 75 100 125
Number of species
Binning method
Human gut short−read metaHi−C dataset
VAMB
HiCBin
MetaTOR
bin3C
MetaCC
ImputeCC
0 40 80 120 160
Number of species
Binning method
Wastewater short−read metaHi−C dataset
a b c
Figure 7.4: Benchmarking using the real human gut and wastewater short-read metaHi-C datasets. (a) The
number of MAGs with varying completeness (comp) and contamination (cont) ≤ 5%. ImputeCC consistently outperforms other binning tools, producing a greater number of high-quality bins in both short-read
metaHi-C datasets. (b) The number of MAGs with varying completeness and contamination ≤ 10%. ImputeCC returned more medium-quality bins when compared to alternative methods for both datasets. (c)
Comparative analysis of the taxonomic diversity at the species level within medium-quality bins obtained
by different binning tools. ImputeCC’s binning approach stands out by capturing the broadest range of
microbial species in medium-quality MAGs.
129
ImputeCC MetaCC
High quality in
both (301)
ImputeCC
only (107)
MetaCC
only (25)
ImputeCC MetaTOR
High quality in
both (112)
ImputeCC
only (296)
MetaTOR
only (17)
ImputeCC bin3C
High quality in
both (172)
ImputeCC
only (236)
bin3C only
(76)
a
0
50
100
150
200
Species Genus Family Order
Number of taxa
ImputeCC only
MetaTOR only
Both
0
50
100
150
200
Species Genus Family Order
Number of taxa
ImputeCC only
bin3C only
Both
0
50
100
150
200
Species Genus Family Order
Number of taxa
ImputeCC only
MetaCC only
Both
b
Figure 7.5: Comparative analysis of high-quality MAGs retrieved from the sheep gut long-read metaHi-C
dataset. (a) Comparison of high-quality MAG recovery using ImputeCC and three other Hi-C-based binning tools (MetaTOR, bin3C, and MetaCC), as determined through Mash analysis. ImputeCC successfully
retrieved the majority of high-quality MAGs obtained by the alternative Hi-C-based tools, while also surpassing them by reconstructing a significant number of additional high-quality MAGs. (b) Annotation
analysis of the high-quality MAGs highlighting the enhanced diversity captured by ImputeCC at different
taxonomic levels in comparison to its Hi-C-based counterparts, such as MetaTOR, bin3C, and MetaCC.
130
Sequencing platform Accession number Size of shotgun libraries (Gbp)
Illumina HiSeq 3000 ERR9765746 6.1
ONT MinION R9 ERR9765780 3.1
PacBio Sequel II ERR9765783 5.4
Table 7.12: The accession numbers (European Nucleotide Archive) and sizes of three shotgun libraries,
sequenced using the Illumina HiSeq 3000, ONT MinION R9, and PacBio Sequel II platforms, from the same
mock microbial community.
131
Dataset Size of shotgun libraries (Gbp) Size of Hi-C libraries (Gbp)
Human gut 37.9 25.9
Wastewater 81.3 28.8
Cow rumen 52 10.1
Sheep gut 255 32.3
Synthetic yeast 17.3 16.2
Table 7.13: The sizes of shotgun and Hi-C libraries from raw metaHi-C datasets.
132
Dataset The number of mapped Hi-C read pairs
Mock Illumina 8,549,960
Mock Nanopore 6,079,873
Mock PacBio 7,844,386
Table 7.14: The number of mapped Hi-C read pairs for the mock metaHi-C datasets.
133
7.4 Supplementary materials for Chapter 5
7.4.1 Binning results for the viral contigs on the mock wastewater dataset and the
mock cow fecal dataset
Since all viral contigs were labeled in the mock metagenomic Hi-C datasets, we defined the spurious edges
in the host proximity graph of ViralCC as the edges that linked two contigs from different putative reference genomes in the host proximity graph. Then we computed the fraction of spurious edges in the
host proximity graph by dividing the number of spurious edges by the total number of edges in the host
proximity graph for the three mock metagenomic Hi-C datasets. The fraction was 1.5% and 16.5% for the
mock wastewater and mock cow fecal datasets, respectively.
ViralCC was then compared to CoCoNet, vRhyme, bin3C, and MetaTOR on the mock wastewater
dataset and the mock cow fecal dataset (Supplementary Figure 7.7). VAMB failed to bin viral contigs on
both mock datasets due to the small number of contigs to train its model. ViralCC also outperformed
other binning methods in terms of F-score, ARI, and homogeneity and recovered the most near-complete
and high-quality vMAGs on the mock wastewater dataset. For the mock cow fecal dataset, three Hi-Cbased binning methods (i.e., bin3C, MetaTOR, and HiCBin) obtained similar binning performance in terms
of the four clustering metrics. MetaTOR and ViralCC achieved slightly better performance in term of
NMI and homogeneity, respectively. Although MetaTOR retrieved the most high-quality vMAGs, ViralCC
reconstructed the most near-complete vMAGs.
7.4.2 ViralCC performed better in recovering near complete vMAGs from large viral
genomes on the mock metagenomic Hi-C datasets
Considering that almost all viral contigs have the same length in our mock datasets, the full recovery of
a larger putative viral genomes requires a binner to correctly group more viral contigs into a single bin.
134
To explore the ability of different binners to retrieve large viral genomes, we plotted the genome sizes of
the near-complete vMAGs recovered by different binners on the three mock metagenomic Hi-C datasets
(Supplementary Figure 7.8). On both the mock human gut and wastewater datasets, we found that binners
apart from MetaTOR and ViralCC could only retrieve near-complete vMAGs with the reference genome
sizes below 60,000 bp. MetaTOR and ViralCC could recover near-complete vMAGs with the reference
genome sizes between 60,000 bp and 80,000 bp while ViralCC was the only method that could reconstruct
near-complete vMAGs with the reference genome sizes above 80,000 bp. As for the mock cow fecal dataset,
the largest reference viral genome is 42,500 bp and all binners could retrieve near-complete vMAGs with
the reference genome sizes above 40,000 bp while ViralCC could achieve the largest number. Moreover,
we divided the total sizes of near-complete vMAGs by the total sizes of the overall putative viral genomes
for the three mock datasets. As shown in Supplementary Table 7.21, ViralCC achieved the highest genome
size fractions of the near-complete vMAGs on all mock datasets. Therefore, ViralCC outperformed other
binners on recovering near-complete vMAGs from large putative viral genomes on the mock datasets.
7.4.3 Results of virus-host detection on the human gut sample
A total of 338 MAGs were generated for non-viral contigs and 164 MAGs could be annotated by GTDB-TK.
The CheckM assessment results of 338 MAGs were shown in Supplementary Table 7.20. Lachnospirales,
Oscillospirales, and Bacteroidales are the predominant orders in the human gut sample (Supplementary
Table 7.22).
We then explored the infection spectrum of vMAGs on hosts from different orders (Supplementary Figure 7.9). vMAGs from the Siphoviridae and Myoviridae families were mainly associated with host MAGs
from the orders Lachnospirales and Oscillospirales, followed by Bacteroidales. The vast majority of vMAGs
135
from the families Podoviridae and Herpesviridae were associated with host MAGs from the orders Lachnospirales and Oscillospirales. CRISPR spacer analysis predicted four interactions between vMAGs and host
MAGs, and 3 out of 4 interactions could also be discovered using Hi-C.
7.4.4 Results of virus-host detection on the cow fecal sample
HiCBin retrieved 496 MAGs for non-viral contigs, which were subsequently evaluated by CheckM (Supplementary Table 7.20). A total of 191 MAGs could be annotated by GTDB-TK. Bacteroidales and Lachnospirales are the predominant orders in the cow fecal sample (Supplementary Table 7.23).
Supplementary Figure 7.10 showed the infection spectrum of vMAGs on hosts from different orders.
vMAGs from the families Siphoviridae and Myoviridae were mainly associated with host MAGs from the
orders Bacteroidales and Lachnospirales, followed by Oscillospirales, Coriobacteriales, and Erysipelotrichales.
The vast majority of vMAGs from the families Podoviridae and Herpesviridae were associated with host
MAGs from the orders Bacteroidales and Lachnospirales. CRISPR spacer analysis predicted two interactions
between vMAGs and host MAGs. Both interactions could also be discovered using Hi-C.
7.4.5 The existence of biases by the viral genome sizes in the benchmarking method
According to the benchmarking strategy, we selected the near-complete viral contigs according to the
CheckV completeness criteria as putative reference viral genomes. Considering the limitation of the shotgun sequencing, the larger the size of viral genome is, the more difficult it is to recover the complete viral
genome using a single contig during the assembly process. Therefore, it is inevitable that the sizes of
putative viral genomes tend to be small. Specifically, we found that the sizes of the largest vMAGs were
307,395 bp, 157,462 bp, and 461,626 bp on the human gut, cow fecal, and wastewater datasets, respectively.
In contrast, the sizes of the largest selected viral genomes were only 194,784 bp, 42,500 bp, and 127,910
136
bp for the three datasets, respectively, indicating that the benchmarking method was biased by the viral
genome sizes.
7.4.6 Read cleaning procedure
Adaptor sequences were removed by bbduk from the BBTools suite (v37.25) with parameters ‘ktrim=r
k=23 mink=11 hdist=1 minlen=50 tpe tbo’ and reads were quality-trimmed using bbduk with parameters
‘trimq=10 qtrim=r ftm=5 minlen=50’. Then, the first 10 nucleotides of each read were trimmed by bbduk
with parameter ‘ftl=10’. Identical PCR optical and tile-edge duplicates for Hi-C paired-end reads were
removed by the script ‘clumpify.sh’ from the BBTools suite.
7.4.7 Algorithms to evaluate the completeness of vMAGs by CheckV
CheckV applies two algorithms to compute the completeness of vMAGs: amino acid identity (AAI)-based
and hidden Markov model (HMM)-based approaches. In the AAI-based approach, proteins are first compared to the CheckV genome database using AAI. After identifying the top hits, completeness is computed
as the ratio between the contig length and the length of matched reference genomes and a confidence
level is reported based on the strength of the alignment and the length of the contig. High- and mediumconfidence estimates are quite accurate and can be trusted. The second method is HMM-based. This
method aims to compute the completeness of highly novel viruses that may not match a CheckV genome
with sufficiently high AAI. In these cases CheckV identifies the viral HMMs on the contig and compares
the contig length with reference genomes sharing the same HMMs. CheckV then returns the estimated
range for genome completeness, which represents the 90% confidence interval based on the distribution
of lengths of reference genomes with the same viral HMMs.
137
7.4.8 Evaluation criteria of the clustering results
The Homogeneity Score: Let C = {ci
|i = 1, · · · , n} denote a set of classes and K = {ki
|1, · · · , m}
denote a set of clusters for N samples. Let A be the contingency table produced by the clustering algorithm
representing the clustering solution, such that A = {aij} where aij is the number of data points that are
members of class ci and elements of cluster kj . Then the homogeneity score, denoted by h is defined as:
h = 1 −
H(C|K)
H(C)
, (7.4)
where
H(C|K) = −
X
|K|
k=1
X
|C|
c=1
ack
N
log ack
P|C|
c=1 ack
,
H(C) = −
X
|C|
c=1
P|K|
k=1 ack
n
log
P|K|
k=1 ack
n
.
The Fowlkes-Mallows score: The Fowlkes-Mallows score (F-score) is defined as the geometric mean of
the precision and recall, i.e,
FM =
r
T P
T P + F P
·
T P
T P + F N
, (7.5)
where T P is the number of true positives, F P is the number of false positives, and F N is the number of
false negatives.
138
The Adjusted Rand Index: The rand index (RI) is defined as the percentage of correct decisions made
by the clustering algorithm, i.e.,
RI =
T P + T N
T P + T N + F P + F N
, (7.6)
where T P is the number of true positives, T N is the number of true negatives, F P is the number of false
positives, and F N is the number of false negatives.
Then, the Adjusted Rand Index (ARI) can be defined as
ARI =
RI − E(RI)
max(RI) − E(RI)
, (7.7)
where E(RI) denotes the Expected Rand Index.
The Normalized Mutual Information: Let U and V denote the sets of true class labels and predicted
cluster labels, respectively. Define the entropy of a label set S as
H(S) = −
X
|S|
i=1
P(i)log(P(i)), (7.8)
where P(i) = |Si
|/N is the probability of an object in class Si
.
The mutual information (MI) between U and V is calculated by:
MI(U, V ) = X
|U|
i=1
X
|V |
j=1
P(i, j)log(
P(i, j)
P(i)P′(j)
), (7.9)
where P(i, j) = |Ui ∩ Vj |/N, P(i) = |Ui
|/N, and P
′
(j) = |Vj |/N.
139
Then, the Normalized Mutual Information (NMI) is defined as
NMI(U, V ) = 2 × MI(U, V )
H(U) + H(V )
. (7.10)
140
Raw Hi-C contact maps of the wastewater dataset
0- I ≥3
200
400--·
600-
: 800-
; 1000-
l 1200 -
1400-
1600-
1800-
I
■ II 'J I . 1:
--
. • I
.- �; I I I I I I I I I I - - -
0 200 400 600 800 10001200140016001800
Size (Kb)
2
- 1
-0
a b c
Figure 7.6: Heatmaps of raw Hi-C contact matrices of the top ten vMAGs on the (a) human gut, (b) cow
fecal, and (c) wastewater datasets with the contig size as the axis unit. The vMAGs were first ranked by
their numbers of contigs and then the contigs within each vMAG were ranked by their sizes. The scale bar
shows the number of raw Hi-C contacts between viral contigs.
141
0.25
0.50
0.75
1.00
F-score ARI NMI Homogeneity
Clustering metrics
Scores
CoCoNet vRhyme bin3C MetaTOR ViralCC
The mock wastewater dataset
0
5
10
CoCoNet vRhyme bin3C MetaTOR ViralCC
Binning method
Number of viral bins
Moderately complete Substantially complete Near-complete
The mock cow fecal dataset
0.4
0.6
0.8
1.0
F-score ARI NMI Homogeneity
Clustering metrics
Scores
CoCoNet vRhyme bin3C MetaTOR ViralCC
The mock cow fecal dataset
0
5
10
15
CoCoNet vRhyme bin3C MetaTOR ViralCC
Binning method
Number of viral bins
Moderately complete Substantially complete Near-complete
The mock wastewater dataset a b
c d
Figure 7.7: Comparison of viral genome retrieval performance according to the (a) clustering metrics and
(b) completeness and contamination criteria on the mock wastewater dataset, and the (c) clustering metrics
and (d) completeness and contamination criteria on the mock cow fecal dataset. Moderately complete:
50% ≤ completeness < 70%, contamination ≤ 10%; Substantially complete: 70% ≤ completeness < 90%,
contamination ≤ 10%; Near-complete: completeness ≥ 90%, contamination ≤ 10%.
142
a b c
Figure 7.8: The strip plot of the genome sizes of the near-complete vMAGs recovered by different binners
on the mock (a) human gut, (b) wastewater, and (c) cow fecal datasets. VAMB failed to bin viral contigs on
the mock wastewater and cow fecal datasets due to the small number of contigs to train its model.
143
0
250
500
750
Lachnospirales Oscillospirales
Bacteroidales
Coriobacteriales
Actinomycetales
Lactobacillales
Host order
Number of associations
vMAG family
Myoviridae
Siphoviridae
Podoviridae
Herpesviridae
The human gut dataset
Figure 7.9: The apparent infection spectrum of vMAGs on hosts from different orders on the human gut
dataset.
144
0
1000
2000
3000
Bacteroidales
Lachnospirales Oscillospirales Coriobacteriales Peptostreptococcales Erysipelotrichales Selenomonadales
Host order
Number of associations
vMAG family
Myoviridae
Siphoviridae
Podoviridae
Herpesviridae
Lipothrixviridae
The cow fecal dataset
Figure 7.10: The apparent infection spectrum of vMAGs on hosts from different orders on the cow fecal
dataset.
145
Dataset Contigs ≥ 1 kbp N50 Average length (bp) Total length (bp)
Human gut 105,267 14,166 5,044 530,969,816
Cow fecal 190,817 5,820 3,573 681,943,699
Wastewater 752,580 2,977 2,538 1,910,562,642
Table 7.15: The statistics of assembled contigs for the three metagenomic Hi-C datasets. Note: N50 is
defined by the length of the shortest contig where contigs with longer and equal length cover at least 50%
of the assembly.
146
Dataset Number of viral contigs N50 Total length
Human gut 791 9,198 13,226,295
Cow fecal 1,338 6,038 12,466,164
Wastewater 2,757 5,960 25,549,304
Table 7.16: The statistics of viral contigs detected by VirSorter for the three metagenomic Hi-C datasets.
N50 is defined by the length of the shortest contig where contigs with longer and equal length cover at
least 50% of the assembly.
147
Dataset Number of putative reference genomes Number of mock viral contigs
Human gut 51 1,010
Cow fecal 11 94
Wastewater 17 279
Table 7.17: The numbers of putative reference genomes and mock viral contigs from the three metagenomic
datasets.
148
F-score ARI NMI Homogeneity
Ghic 0.491 0.383 0.834 0.717
Ghost 0.449 0.335 0.761 0.616
Gint 0.795 0.787 0.929 0.921
Table 7.18: Gint outperforms Ghic and Ghost for clustering viral contigs in terms of F-score, ARI, NMI, and
homogeneity on the mock human gut dataset. The optimal values of the results are in bold.
149
near-complete substantially complete moderately complete total
Ghic 8 3 5 16
Ghost 12 2 0 14
Gint 26 2 4 32
Table 7.19: Gint outperforms Ghic and Ghost for clustering in terms of the completeness and contamination
criteria on the mock human gut dataset.
150
Dataset Near-complete Substantially complete Moderately complete
Human gut 67 33 12
Cow fecal 79 27 17
Wastewater 94 56 41
Table 7.20: The numbers of high-quality potential host MAGs retrieved by HiCBin from the three metagenomic Hi-C datasets according to the CheckM criteria. Note: near-complete (CheckM completeness ≥ 90%,
CheckM contamination ≤ 10%), substantially complete (70% ≤ CheckM completeness < 90%, CheckM
contamination ≤ 10%), and moderately complete (50% ≤ CheckM completeness < 70%, CheckM contamination ≤ 10%)
151
Mock human gut Mock cow fecal Mock wastewater
VAMB 1.1% NA NA
vRhyme 0 20.4% 11.6%
bin3C 5.1% 44.9% 5.1%
CoCoNet 4.5% 14.9% 13.9%
MetaTOR 35.3% 59.5% 51.8%
ViralCC 45.6% 85.1% 65.4%
Table 7.21: The genome size fractions of near-complete vMAGs recovered by different binners on the three
mock datasets. VAMB failed to bin viral contigs on the mock wastewater and cow fecal datasets due to the
small number of contigs to train its model. The optimal values of the results are in bold.
152
Order Number of MAGs
Lachnospirales 62
Oscillospirales 53
Bacteroidales 23
Coriobacteriales 6
Actinomycetales 6
Lactobacillales 4
Christensenellales 2
Peptostreptococcales 2
Clostridiales 2
Erysipelotrichales 2
Veillonellales 1
Peptococcales 1
Table 7.22: The taxonomic statistics of potential host MAGs derived from the human gut dataset.
153
Order Number of MAGs
Bacteroidales 58
Lachnospirales 49
Oscillospirales 14
Coriobacteriales 10
Peptostreptococcales 8
Erysipelotrichales 8
Selenomonadales 8
Acidaminococcales 7
Enterobacterales 4
Actinomycetales 3
Veillonellales 3
Sphaerochaetales 3
Treponematales 2
Christensenellales 1
Eubacteriales 1
Desulfovibrionales 1
Gastranaerophilales 1
Table 7.23: The taxonomic statistics of potential host MAGs derived from the cow fecal dataset.
154
Bibliography
[1] H-W Ackermann. “5500 Phages examined in the electron microscope”. In: Archives of Virology
152.2 (2007), pp. 227–243.
[2] Jiyoung Ahn, Rashmi Sinha, Zhiheng Pei, Christine Dominianni, Jing Wu, Jianxin Shi,
James J Goedert, Richard B Hayes, and Liying Yang. “Human gut microbiome and risk for
colorectal cancer”. In: Journal of the National Cancer Institute 105.24 (2013), pp. 1907–1911.
[3] Mads Albertsen, Philip Hugenholtz, Adam Skarshewski, Kåre L Nielsen, Gene W Tyson, and
Per H Nielsen. “Genome sequences of rare, uncultured bacteria obtained by differential coverage
binning of multiple metagenomes”. In: Nature Biotechnology 31 (2013), pp. 533–538.
[4] Alexandre Almeida, Stephen Nayfach, Miguel Boland, Francesco Strozzi, Martin Beracochea,
Zhou Jason Shi, Katherine S Pollard, Ekaterina Sakharova, Donovan H Parks, Philip Hugenholtz,
et al. “A unified catalog of 204,938 reference genomes from the human gut microbiome”. In:
Nature Biotechnology 39 (2021), pp. 105–114.
[5] Johannes Alneberg, Brynjar Smári Bjarnason, Ino De Bruijn, Melanie Schirmer, Joshua Quick,
Umer Z Ijaz, Leo Lahti, Nicholas J Loman, Anders F Andersson, and Christopher Quince. “Binning
metagenomic contigs by coverage and composition”. In: Nature Methods 11 (2014), pp. 1144–1146.
[6] Stephen F Altschul, Warren Gish, Webb Miller, et al. “Basic local alignment search tool”. In:
Journal of Molecular Biology 215 (1990), pp. 403–410.
[7] Cédric G Arisdakessian, Olivia D Nigro, Grieg F Steward, Guylaine Poisson, and Mahdi Belcaid.
“CoCoNet: an efficient deep learning tool for viral metagenome binning”. In: Bioinformatics 37.18
(2021), pp. 2803–2810.
[8] Marleen Balvert, Xiao Luo, Ernestina Hauptfeld, Alexander Schönhuth, and Bas E Dutilh. “OGRE:
Overlap Graph-based metagenomic Read clustEring”. In: Bioinformatics 37.7 (2021), pp. 905–912.
[9] Lyam Baudry, Théo Foutel-Rodier, Agnès Thierry, Romain Koszul, and Martial Marbouty.
“MetaTOR: a computational pipeline to recover high-quality metagenomic bins from mammalian
gut proximity-ligation (meta3C) libraries”. In: Frontiers in Genetics 10 (2019), p. 753.
155
[10] Christopher W Beitel, Lutz Froenicke, Jenna M Lang, Ian F Korf, Richard W Michelmore,
Jonathan A Eisen, and Aaron E Darling. “Strain-and plasmid-level deconvolution of a synthetic
metagenome by sequencing proximity ligation products”. In: PeerJ 2 (2014), e415.
[11] Leen Beller and Jelle Matthijnssens. “What is (not) known about the dynamics of the human gut
virome in health and disease”. In: Current Opinion in Virology 37 (2019), pp. 52–57.
[12] Johan Bengtsson-Palme, Magnus Alm Rosenblad, Mikael Molin, and Anders Blomberg.
“Metagenomics reveals that detoxification systems are underrepresented in marine bacterial
communities”. In: BMC Genomics 15.749 (2014).
[13] Richa Bharti and Dominik G Grimm. “Current challenges and best-practice protocols for
microbiome analysis”. In: Briefings in Bioinformatics 22.1 (2021), pp. 178–193.
[14] Derek M Bickhart, Mikhail Kolmogorov, Elizabeth Tseng, Daniel M Portik, Anton Korobeynikov,
Ivan Tolstoganov, Gherman Uritskiy, Ivan Liachko, Shawn T Sullivan, Sung Bong Shin, et al.
“Generating lineage-resolved, complete metagenome-assembled genomes from complex
microbial communities”. In: Nature Biotechnology 40 (2022), pp. 711–719.
[15] Derek M Bickhart, Mick Watson, Sergey Koren, Kevin Panke-Buisse, Laura M Cersosimo,
Maximilian O Press, Curtis P Van Tassell, Jo Ann S Van Kessel, Bradd J Haley, Seon Woo Kim,
et al. “Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex
microbial community by combined long-read assembly and proximity ligation”. In: Genome
Biology 20.153 (2019).
[16] Mya Breitbart and Forest Rohwer. “Here a virus, there a virus, everywhere the same virus?” In:
Trends in Microbiology 13.6 (2005), pp. 278–284.
[17] Mollie E Brooks, Kasper Kristensen, Koen J Van Benthem, et al. “glmmTMB balances speed and
flexibility among packages for zero-inflated generalized linear mixed modeling”. In: The R Journal
9 (2017), pp. 378–400.
[18] Joshua N Burton, Ivan Liachko, Maitreya J Dunham, and Jay Shendure. “Species-level
deconvolution of metagenome assemblies with Hi-C–based contact probability maps”. In: G3:
Genes, Genomes, Genetics 4.7 (2014), pp. 1339–1346.
[19] Brian Bushnell. BBMap: a fast, accurate, splice-aware aligner. Tech. rep. Lawrence Berkeley
National Lab.(LBNL), Berkeley, CA (United States), 2014.
[20] Luis F Camarillo-Guerrero, Alexandre Almeida, Guillermo Rangel-Pineros, Robert D Finn, and
Trevor D Lawley. “Massive expansion of human gut bacteriophage diversity”. In: Cell 184 (2021),
pp. 1098–1109.
[21] Sourav Chatterji, Ichitaro Yamazaki, Zhaojun Bai, and Jonathan A Eisen. “CompostBin: A DNA
composition-based algorithm for binning environmental shotgun reads”. In: Annual International
Conference on Research in Computational Molecular Biology. Springer. 2008, pp. 17–28.
156
[22] Pierre-Alain Chaumeil, Aaron J Mussig, Philip Hugenholtz, and Donovan H Parks. “GTDB-Tk v2:
memory friendly classification with the genome taxonomy database”. In: Bioinformatics 38.23
(2022), pp. 5315–5316.
[23] Pierre-Alain Chaumeil, Aaron J Mussig, Philip Hugenholtz, and Donovan H Parks. “GTDB-Tk: a
toolkit to classify genomes with the Genome Taxonomy Database”. In: Bioinformatics 36.6 (2020),
pp. 1925–1927.
[24] Yiqiang Chen, Yulin Wang, David Paez-Espino, Martin F Polz, and Tong Zhang. “Prokaryotic
viruses impact functional microorganisms in nutrient removal and carbon cycle in wastewater
treatment plants”. In: Nature Communications 12.5398 (2021).
[25] Alex Chklovski, Donovan H Parks, Ben J Woodcroft, and Gene W Tyson. “CheckM2: a rapid,
scalable and accurate tool for assessing microbial genome quality using machine learning”. In:
Nature Methods 20 (2023), pp. 1203–1212.
[26] Laura M Cox, Jiho Sohn, Kerin L Tyrrell, Diane M Citron, Paul A Lawson, Nisha B Patel,
Tadasu Iizumi, Guillermo I Perez-Perez, Ellie JC Goldstein, and Martin J Blaser. “Description of
two novel members of the family Erysipelotrichaceae: Ileibacterium valens gen. nov., sp. nov. and
Dubosiella newyorkensis, gen. nov., sp. nov., from the murine intestine, and emendation to the
description of Faecalibacterium rodentium”. In: International Journal of Systematic and
Evolutionary Microbiology 67.5 (2017), pp. 1247–1254.
[27] John F Cryan, Kenneth J O’Riordan, Caitlin SM Cowan, Kiran V Sandhu, Thomaz FS Bastiaanssen,
Marcus Boehme, Martin G Codagnone, Sofia Cussotto, Christine Fulling, Anna V Golubeva, et al.
“The microbiota-gut-brain axis”. In: Physiological Reviews 99.4 (2019), pp. 1877–2013.
[28] Anna Cuscó, Daniel Pérez, Joaquim Viñes, Norma Fàbregas, and Olga Francino. “Novel canine
high-quality metagenome-assembled genomes, prophages and host-associated plasmids provided
by long-read metagenomics together with Hi-C proximity ligation”. In: Microbial Genomics 8.3
(2022), p. 000802.
[29] Matthew Z DeMaere and Aaron E Darling. “bin3C: exploiting Hi-C sequencing data to accurately
resolve metagenome-assembled genomes”. In: Genome Biology 20.46 (2019).
[30] Matthew Z DeMaere and Aaron E Darling. “Sim3C: simulation of Hi-C and Meta3C proximity
ligation sequencing technologies”. In: Gigascience 7.2 (2018), gix103.
[31] Jesse R Dixon, Siddarth Selvaraj, Feng Yue, et al. “Topological domains in mammalian genomes
identified by analysis of chromatin interactions”. In: Nature 485 (2012), pp. 376–380.
[32] Yuxuan Du, Jed A Fuhrman, and Fengzhu Sun. “ViralCC retrieves complete viral genomes and
virus-host pairs from metagenomic Hi-C data”. In: Nature Communications 14.502 (2023).
[33] Yuxuan Du, Sarah M. Laperriere, Jed Fuhrman, and Fengzhu Sun. “Normalizing Metagenomic
Hi-C Data and Detecting Spurious Contacts Using Zero-Inflated Negative Binomial Regression”.
In: Journal of Computational Biology 29 (2022), pp. 106–120.
157
[34] Yuxuan Du and Fengzhu Sun. “HiCBin: binning metagenomic contigs and recovering
metagenome-assembled genomes using Hi-C contact maps”. In: Genome Biology 23.63 (2022).
[35] Yuxuan Du and Fengzhu Sun. “HiFine: integrating Hi-C-based and shotgun-based methods to
refine binning of metagenomic contigs”. In: Bioinformatics 38.11 (2022), pp. 2973–2979.
[36] Yuxuan Du and Fengzhu Sun. “MetaCC allows scalable and integrative analyses of both long-read
and short-read metagenomic Hi-C data”. In: Nature Communications 14.6231 (2023).
[37] Robert C Edgar. “PILER-CR: fast and accurate identification of CRISPR repeats”. In: BMC
Bioinformatics 8.18 (2007).
[38] Joanne B Emerson, Simon Roux, Jennifer R Brum, Benjamin Bolduc, Ben J Woodcroft,
Ho Bin Jang, Caitlin M Singleton, Lindsey M Solden, Adrian E Naas, Joel A Boyd, et al.
“Host-linked soil viral ecology along a permafrost thaw gradient”. In: Nature Microbiology 3
(2018), pp. 870–880.
[39] Zhencheng Fang, Jie Tan, Shufang Wu, Mo Li, Congmin Xu, Zhongjie Xie, and Huaiqiu Zhu.
“PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep
learning”. In: Gigascience 8.6 (2019), giz066.
[40] Wen-Wen Feng, Jin-Feng Liu, Ji-Dong Gu, and Bo-Zhong Mu. “Nitrate-reducing community in
production water of three oil reservoirs and their responses to different carbon sources revealed
by nitrate-reductase encoding gene (napA)”. In: International Biodeterioration & Biodegradation
65.7 (2011), pp. 1081–1086.
[41] Robert D Finn, Jody Clements, and Sean R Eddy. “HMMER web server: interactive sequence
similarity searching”. In: Nucleic Acids Research 39.suppl_2 (2011), W29–W37.
[42] Samuel C Forster, Nitin Kumar, Blessing O Anonye, Alexandre Almeida, Elisa Viciani,
Mark D Stares, Matthew Dunn, Tapoka T Mkandawire, Ana Zhu, Yan Shao, et al. “A human gut
bacterial genome and culture collection for improved metagenomic analyses”. In: Nature
Biotechnology 37 (2019), pp. 186–192.
[43] Santo Fortunato and Marc Barthelemy. “Resolution limit in community detection”. In: Proceedings
of the National Academy of Sciences 104.1 (2007), pp. 36–41.
[44] Jed A Fuhrman. “Marine viruses and their biogeochemical and ecological effects”. In: Nature 399
(1999), pp. 541–548.
[45] Valentina Galata, Tobias Fehlmann, Christina Backes, and Andreas Keller. “PLSDB: a resource of
complete bacterial plasmids”. In: Nucleic Acids Research 47 (2019), pp. D195–D202.
[46] Rodrigo García-López, Jorge Francisco Vázquez-Castellanos, and Andrés Moya. “Fragmentation
and coverage variation in viral metagenome assemblies, and their effect in diversity calculations”.
In: Frontiers in Bioengineering and Biotechnology 3 (2015), p. 141.
158
[47] Paola Gauffin Cano, Arlette Santacruz, Ángela Moya, and Yolanda Sanz. “Bacteroides uniformis
CECT 7771 ameliorates metabolic and immunological dysfunction in mice with high-fat-diet
induced obesity”. In: PLoS One 7 (2012), e41079.
[48] Michelle Girvan and Mark EJ Newman. “Community structure in social and biological networks”.
In: Proceedings of the National Academy of Sciences 99.12 (2002), pp. 7821–7826.
[49] Cody Glickman, Jo Hendrix, and Michael Strong. “Simulation study and comparative evaluation
of viral contiguous sequence identification tools”. In: BMC Bioinformatics 22.329 (2021).
[50] Christopher J Gobler, David A Hutchins, Nicholas S Fisher, Elizabeth M Cosper, and
Sergio A Saňudo-Wilhelmy. “Release and bioavailability of C, N, P Se, and Fe following viral lysis
of a marine chrysophyte”. In: Limnology and Oceanography 42.7 (1997), pp. 1492–1504.
[51] Jean-Sebastien Gounot, Minghao Chia, Denis Bertrand, Woei-Yuh Saw, Aarthi Ravikrishnan,
Adrian Low, Yichen Ding, Amanda Hui Qi Ng, Linda Wei Lin Tan, Yik-Ying Teo, et al.
“Genome-centric analysis of short and long read metagenomes reveals uncharacterized
microbiome diversity in Southeast Asians”. In: Nature Communications 13.6044 (2022).
[52] Emily B Graham, William R Wieder, Jonathan W Leff, Samantha R Weintraub, Alan R Townsend,
Cory C Cleveland, Laurent Philippot, and Diana R Nemergut. “Do we need to understand
microbial communities to predict ecosystem function? A comparison of statistical models of
nitrogen cycling processes”. In: Soil Biology and Biochemistry 68 (2014), pp. 279–282.
[53] Ann C Gregory, Olivier Zablocki, Ahmed A Zayed, Allison Howell, Benjamin Bolduc, and
Matthew B Sullivan. “The gut virome database reveals age-dependent patterns of virome
diversity in the human gut”. In: Cell Host & Microbe 28.5 (2020), pp. 724–740.
[54] Ann C Gregory, Ahmed A Zayed, Nádia Conceição-Neto, Ben Temperton, Ben Bolduc,
Adriana Alberti, Mathieu Ardyna, Ksenia Arkhipova, Margaux Carmichael, Corinne Cruaud,
et al. “Marine DNA viral macro-and microdiversity from pole to pole”. In: Cell 177.5 (2019),
pp. 1109–1123.
[55] Carolina Gubert, Chloe Jane Love, Saritha Kodikara, Jamie Jie Mei Liew, Thibault Renoir,
Kim-Anh Lê Cao, and Anthony John Hannan. “Gene-environment-gut interactions in
Huntington’s disease mice are associated with environmental modulation of the gut
microbiome”. In: iScience 25 (2022), p. 103687.
[56] Manoj Gurung, Zhipeng Li, Hannah You, Richard Rodrigues, Donald B Jump, Andrey Morgun,
and Natalia Shulzhenko. “Role of gut microbiota in type 2 diabetes pathophysiology”. In:
EBioMedicine 51 (2020), p. 102590.
[57] Jonas Halfvarson, Colin J Brislawn, Regina Lamendella, Yoshiki Vázquez-Baeza,
William A Walters, Lisa M Bramer, Mauro D’amato, Ferdinando Bonfiglio, Daniel McDonald,
Antonio Gonzalez, et al. “Dynamics of the human gut microbiome in inflammatory bowel
disease”. In: Nature Microbiology 2.17004 (2017).
[58] Jo Handelsman. “Metagenomics: application of genomics to uncultured microorganisms”. In:
Microbiology and Molecular Biology Reviews 68.4 (2004), pp. 669–685.
159
[59] Joseph M Hilbe. Negative binomial regression. Cambridge University Press, 2011.
[60] Frank Hille, Hagen Richter, Shi Pey Wong, Majda Bratovič, Sarah Ressel, and
Emmanuelle Charpentier. “The biology of CRISPR-Cas: backward and forward”. In: Cell 172.6
(2018), pp. 1239–1259.
[61] Lesley Hoyles and Anne L McCartney. “What do we mean when we refer to Bacteroidetes
populations in the human gastrointestinal microbiota?” In: FEMS Microbiology Letters 299 (2009),
pp. 175–183.
[62] Tsung-Han S Hsieh, Geoffrey Fudenberg, Anton Goloborodko, and Oliver J Rando. “Micro-C XL:
assaying chromosome conformation from the nucleosome to the entire genome”. In: Nature
Methods 13 (2016), pp. 1009–1011.
[63] Ming Hu, Ke Deng, Siddarth Selvaraj, et al. “HiCNorm: removing biases in Hi-C data via Poisson
regression”. In: Bioinformatics 28.23 (2012), pp. 3131–3133.
[64] Philip Hugenholtz and Gene W Tyson. “Metagenomics”. In: Nature 455 (2008), pp. 481–483.
[65] Luisa W Hugerth, John Larsson, Johannes Alneberg, Markus V Lindh, Catherine Legrand,
Jarone Pinhassi, and Anders F Andersson. “Metagenome-assembled genomes uncover a global
brackish microbiome”. In: Genome Biology 16.279 (2015).
[66] Bonnie L Hurwitz and Matthew B Sullivan. “The Pacific Ocean Virome (POV): a marine viral
metagenomic dataset and associated protein clusters for quantitative viral ecology”. In: PLoS One
8.2 (2013), e57355.
[67] Umer Ijaz and Christopher Quince. TAXAassign v0. 4. https://github.com/umerijaz/TAXAassign.
2013.
[68] Maxim Imakaev, Geoffrey Fudenberg, Rachel Patton McCord, et al. “Iterative correction of Hi-C
data reveals hallmarks of chromosome organization”. In: Nature Methods 9 (2012), pp. 999–1003.
[69] Vijayan Jasna, Ammini Parvathi, and Abhinandita Dash. “Genetic and functional diversity of
double-stranded DNA viruses in a tropical monsoonal estuary, India”. In: Scientific Reports
8.16036 (2018).
[70] Longhao Jia, Yingjian Wu, Yanqi Dong, Jingchao Chen, Wei-Hua Chen, and Xing-Ming Zhao. “A
survey on computational strategies for genome-resolved gut metagenomics”. In: Briefings in
Bioinformatics (2023), bbad162.
[71] Nianzhi Jiao, Gerhard J Herndl, Dennis A Hansell, Ronald Benner, Gerhard Kattner,
Steven W Wilhelm, David L Kirchman, Markus G Weinbauer, Tingwei Luo, Feng Chen, et al.
“Microbial production of recalcitrant dissolved organic matter: long-term carbon storage in the
global ocean”. In: Nature Reviews Microbiology 8 (2010), pp. 593–599.
160
[72] Joachim Johansen, Damian R Plichta, Jakob Nybo Nissen, Marie Louise Jespersen, Shiraz A Shah,
Ling Deng, Jakob Stokholm, Hans Bisgaard, Dennis Sandris Nielsen, Søren J Sørensen, et al.
“Genome binning of viral entities from bulk metagenomics data”. In: Nature Communications
13.965 (2022).
[73] Mark Johnson, Irena Zaretskaya, Yan Raytselis, Yuri Merezhuk, Scott McGinnis, and
Thomas L Madden. “NCBI BLAST: a better web interface”. In: Nucleic Acids Research 36.suppl_2
(2008), W5–W9.
[74] Nadeem O Kaakoush. Insights into the role of Erysipelotrichaceae in the human host. 2015.
[75] Saurabh Kalikar, Chirag Jain, Md Vasimuddin, and Sanchit Misra. “Accelerating minimap2 for
long-read sequencing applications on modern CPUs”. In: Nature Computational Science 2 (2022),
pp. 78–83.
[76] Dongwan D Kang, Jeff Froula, Rob Egan, and Zhong Wang. “MetaBAT, an efficient tool for
accurately reconstructing single genomes from complex microbial communities”. In: PeerJ 3
(2015), e1165.
[77] Dongwan D Kang, Feng Li, Edward Kirton, Ashleigh Thomas, Rob Egan, Hong An, and
Zhong Wang. “MetaBAT2: an adaptive binning algorithm for robust and efficient genome
reconstruction from metagenome assemblies”. In: PeerJ 7 (2019), e7359.
[78] Richard M Karp. “An algorithm to solve the m× n assignment problem in expected time O (mn
log n)”. In: Networks 10.2 (1980), pp. 143–152.
[79] Alyssa G Kent, Albert C Vill, Qiaojuan Shi, Michael J Satlin, and Ilana Lauren Brito. “Widespread
transfer of mobile antibiotic resistance genes within individual gut microbiomes revealed through
bacterial Hi-C”. In: Nature Communications 11.4379 (2020).
[80] Periyanaina Kesika, Natarajan Suganthy, Bhagavathi Sundaram Sivamaruthi, and
Chaiyavat Chaiyasut. “Role of gut-brain axis, gut microbial composition, and probiotic
intervention in Alzheimer’s disease”. In: Life Sciences 264 (2021), p. 118627.
[81] Kristopher Kieft, Alyssa Adams, Rauf Salamzade, Lindsay Kalan, and Karthik Anantharaman.
“vRhyme enables binning of viral genomes from metagenomes”. In: Nucleic Acids Research 50.14
(2022), e83.
[82] Kristopher Kieft, Zhichao Zhou, and Karthik Anantharaman. “VIBRANT: automated recovery,
annotation and curation of microbial viruses, and evaluation of viral community function from
genomic sequences”. In: Microbiome 8.90 (2020).
[83] Lisa Klingelhoefer and Heinz Reichmann. “Pathogenesis of Parkinson disease—the gut–brain axis
and environmental factors”. In: Nature Reviews Neurology 11 (2015), pp. 625–636.
[84] Philip A Knight and Daniel Ruiz. “A fast algorithm for matrix balancing”. In: IMA Journal of
Numerical Analysis 33.3 (2013), pp. 1029–1047.
161
[85] Mikhail Kolmogorov, Derek M Bickhart, Bahar Behsaz, Alexey Gurevich, Mikhail Rayko,
Sung Bong Shin, Kristen Kuhn, Jeffrey Yuan, Evgeny Polevikov, Timothy PL Smith, et al.
“metaFlye: scalable long-read metagenome assembly using repeat graphs”. In: Nature Methods 17
(2020), pp. 1103–1110.
[86] Konstantinos T Konstantinidis and James M Tiedje. “Genomic insights that advance the species
definition for prokaryotes”. In: Proceedings of the National Academy of Sciences 102.7 (2005),
pp. 2567–2572.
[87] Sergey Koren, Brian P Walenz, Konstantin Berlin, Jason R Miller, Nicholas H Bergman, and
Adam M Phillippy. “Canu: scalable and accurate long-read assembly via adaptive k-mer
weighting and repeat separation”. In: Genome Research 27.5 (2017), pp. 722–736.
[88] Diane Lambert. “Zero-inflated Poisson regression, with an application to defects in
manufacturing”. In: Technometrics 34 (1992), pp. 1–14.
[89] Ivica Letunic and Peer Bork. “Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic
tree display and annotation”. In: Nucleic Acids Research 49.W1 (2021), W293–W296.
[90] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. “MEGAHIT: an
ultra-fast single-node solution for large and complex metagenomics assembly via succinct de
Bruijn graph”. In: Bioinformatics 31.10 (2015), pp. 1674–1676.
[91] Heng Li. “Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM”. In:
arXiv (2013). doi: 10.48550/arXiv.1303.3997.
[92] Heng Li. “Minimap2: pairwise alignment for nucleotide sequences”. In: Bioinformatics 34.18
(2018), pp. 3094–3100.
[93] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth,
Goncalo Abecasis, and Richard Durbin. “The sequence alignment/map format and SAMtools”. In:
Bioinformatics 25.16 (2009), pp. 2078–2079.
[94] Erez Lieberman-Aiden, Nynke L Van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy,
Agnes Telling, Ido Amit, Bryan R Lajoie, Peter J Sabo, Michael O Dorschner, et al.
“Comprehensive mapping of long-range interactions reveals folding principles of the human
genome”. In: Science 326.5950 (2009), pp. 289–293.
[95] Hsin-Hung Lin and Yu-Chieh Liao. “Accurate binning of metagenomic contigs via automated
clustering sequences using information of genomic signatures and marker genes”. In: Scientific
Reports 6.24175 (2016).
[96] Dominique Lord, Seth D Guikema, and Srinivas Reddy Geedipally. “Application of the
Conway–Maxwell–Poisson generalized linear model for analyzing motor vehicle crashes”. In:
Accident Analysis & Prevention 40 (2008), pp. 1123–1134.
[97] Yang Young Lu, Ting Chen, Jed A Fuhrman, and Fengzhu Sun. “COCACOLA: binning
metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and
paired-end read LinkAge”. In: Bioinformatics 33.6 (2017), pp. 791–798.
162
[98] Yunan Luo, Yun William Yu, Jianyang Zeng, Bonnie Berger, and Jian Peng. “Metagenomic
binning through low-density hashing”. In: Bioinformatics 35.2 (2019), pp. 219–226.
[99] Martial Marbouty, Lyam Baudry, Axel Cournac, and Romain Koszul. “Scaffolding bacterial
genomes and probing host-virus interactions in gut microbiome by proximity ligation
(chromosome capture) assay”. In: Science Advances 3.2 (2017), e1602105.
[100] Martial Marbouty, Axel Cournac, Jean-François Flot, Hervé Marie-Nelly, Julien Mozziconacci,
and Romain Koszul. “Metagenomic chromosome conformation capture (meta3C) unveils the
diversity of chromosome organization in microorganisms”. In: eLife 3 (2014), e03318.
[101] Martial Marbouty, Agnès Thierry, Gaël A Millot, and Romain Koszul. “MetaHiC phage-bacteria
infection network reveals active cycling phages of the healthy human gut”. In: eLife 10 (2021),
e60608.
[102] Victoria Meslier, Benoit Quinquis, Kévin Da Silva, Florian Plaza Oñate, Nicolas Pons,
Hugo Roume, Mircea Podar, and Mathieu Almeida. “Benchmarking second and third-generation
sequencing platforms for microbial metagenomics”. In: Scientific Data 9.694 (2022).
[103] Thomas C Mettenleiter, Barbara G Klupp, and Harald Granzow. “Herpesvirus assembly: an
update”. In: Virus Research 143.2 (2009), pp. 222–234.
[104] Lan Mi, Bin Yang, Xialu Hu, Yang Luo, Jianxin Liu, Zhongtang Yu, and Jiakun Wang.
“Comparative analysis of the microbiota between sheep rumen and rabbit cecum provides new
insight into their differential methane production”. In: Frontiers in Microbiology 9 (2018), p. 575.
[105] Ajay Kumar Mishra, Gregory Gimenez, Jean-Christophe Lagier, Catherine Robert, Didier Raoult,
and Pierre-Edouard Fournier. “Genome sequence and description of Alistipes senegalensis sp.
nov.” In: Standards in Genomic Sciences 6 (2012), pp. 304–314.
[106] Atsushi Nakabachi, Atsushi Yamashita, Hidehiro Toh, Hajime Ishikawa, Helen E Dunbar,
Nancy A Moran, and Masahira Hattori. “The 160-kilobase genome of the bacterial endosymbiont
Carsonella”. In: Science 314.5797 (2006), p. 267.
[107] Stephen Nayfach, Antonio Pedro Camargo, Frederik Schulz, Emiley Eloe-Fadrosh, Simon Roux,
and Nikos C Kyrpides. “CheckV assesses the quality and completeness of metagenome-assembled
viral genomes”. In: Nature Biotechnology 39 (2021), pp. 578–585.
[108] Stephen Nayfach, David Páez-Espino, Lee Call, Soo Jen Low, Hila Sberro, Natalia N Ivanova,
Amy D Proal, Michael A Fischbach, Ami S Bhatt, Philip Hugenholtz, et al. “Metagenomic
compendium of 189,680 DNA viruses from the human gut microbiome”. In: Nature Microbiology 6
(2021), pp. 960–970.
[109] H Bjørn Nielsen, Mathieu Almeida, Agnieszka Sierakowska Juncker, Simon Rasmussen,
Junhua Li, Shinichi Sunagawa, Damian R Plichta, Laurent Gautier, Anders G Pedersen,
Emmanuelle Le Chatelier, et al. “Identification and assembly of genomes and genetic elements in
complex metagenomic samples without using reference genomes”. In: Nature Biotechnology 32
(2014), pp. 822–828.
163
[110] Jakob Nybo Nissen, Joachim Johansen, Rosa Lundbye Allesøe, Casper Kaae Sønderby,
Jose Juan Almagro Armenteros, Christopher Heje Grønbech, Lars Juhl Jensen,
Henrik Bjørn Nielsen, Thomas Nordahl Petersen, Ole Winther, et al. “Improved metagenome
binning and assembly using deep variational autoencoders”. In: Nature Biotechnology 39 (2021),
pp. 555–560.
[111] Jason M Norman, Scott A Handley, Megan T Baldridge, Lindsay Droit, Catherine Y Liu,
Brian C Keller, Amal Kambal, Cynthia L Monaco, Guoyan Zhao, Phillip Fleshner, et al.
“Disease-specific alterations in the enteric virome in inflammatory bowel disease”. In: Cell 160.3
(2015), pp. 447–460.
[112] Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov, and Pavel A Pevzner. “metaSPAdes: a new
versatile metagenomic assembler”. In: Genome Research 27.5 (2017), pp. 824–834.
[113] Nuala A O’Leary, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad,
Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, et al.
“Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and
functional annotation”. In: Nucleic Acids Research 44.D1 (2016), pp. D733–D745.
[114] Lesley A Ogilvie, Lucas D Bowler, Jonathan Caplin, Cinzia Dedi, David Diston, Elizabeth Cheek,
Huw Taylor, James E Ebdon, and Brian V Jones. “Genome signature-based dissection of human
gut metagenomes to extract subliminal viral sequences”. In: Nature Communications 4.2420 (2013).
[115] Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman,
Sergey Koren, and Adam M Phillippy. “Mash: fast genome and metagenome distance estimation
using MinHash”. In: Genome Biology 17.132 (2016).
[116] David Paez-Espino, Emiley A Eloe-Fadrosh, Georgios A Pavlopoulos, Alex D Thomas,
Marcel Huntemann, Natalia Mikhailova, Edward Rubin, Natalia N Ivanova, and Nikos C Kyrpides.
“Uncovering Earth’s virome”. In: Nature 536 (2016), pp. 425–430.
[117] Shaojun Pan, Chengkai Zhu, Xing-Ming Zhao, and Luis Pedro Coelho. “A deep siamese neural
network improves metagenome-assembled genomes in microbiome datasets across different
environments”. In: Nature Communications 13.2326 (2022).
[118] Bianca J Parker, Pamela A Wearsch, Alida CM Veloo, and Alex Rodriguez-Palacios. “The genus
Alistipes: gut bacteria with emerging implications to inflammation, cancer, and mental health”.
In: Frontiers in Immunology 11 (2020), p. 906.
[119] Donovan H Parks, Michael Imelfort, Connor T Skennerton, Philip Hugenholtz, and
Gene W Tyson. “CheckM: assessing the quality of microbial genomes recovered from isolates,
single cells, and metagenomes”. In: Genome Research 25.7 (2015), pp. 1043–1055.
[120] Tu Anh N Pham and Trevor D Lawley. “Emerging insights on intestinal dysbiosis during bacterial
infections”. In: Current Opinion in Microbiology 17 (2014), pp. 67–74.
164
[121] Maximilian O Press, Andrew H Wiser, Zev N Kronenberg, Kyle W Langford, Migun Shakya,
Chien-Chi Lo, Kathryn A Mueller, Shawn T Sullivan, Patrick SG Chain, and Ivan Liachko. “Hi-C
deconvolution of a human gut microbiome yields high-quality draft genomes and reveals
plasmid-genome interactions”. In: bioRxiv (2017). doi: 10.1101/198713.
[122] Andreas S Puschnik, Karim Majzoub, Yaw Shin Ooi, and Jan E Carette. “A CRISPR toolbox to
study virus–host interactions”. In: Nat Rev Microbiol 15.6 (2017), pp. 351–364.
[123] Suhas SP Rao, Miriam H Huntley, Neva C Durand, Elena K Stamenova, Ivan D Bochkov,
James T Robinson, Adrian L Sanborn, Ido Machol, Arina D Omer, Eric S Lander, et al. “A 3D map
of the human genome at kilobase resolution reveals principles of chromatin looping”. In: Cell
159.7 (2014), pp. 1665–1680.
[124] Jörg Reichardt and Stefan Bornholdt. “Statistical mechanics of community detection”. In: Physical
Review E 74 (2006), p. 016110.
[125] Jie Ren, Nathan A Ahlgren, Yang Young Lu, Jed A Fuhrman, and Fengzhu Sun. “VirFinder: a novel
k-mer based tool for identifying viral sequences from assembled metagenomic data”. In:
Microbiome 5.69 (2017).
[126] Alejandro Reyes, Laura V Blanton, Song Cao, Guoyan Zhao, Mark Manary, Indi Trehan,
Michelle I Smith, David Wang, Herbert W Virgin, Forest Rohwer, et al. “Gut DNA viromes of
Malawian twins discordant for severe acute malnutrition”. In: Proceedings of the National
Academy of Sciences 112.38 (2015), pp. 11941–11946.
[127] Mina Rho, Haixu Tang, and Yuzhen Ye. “FragGeneScan: predicting genes in short and error-prone
reads”. In: Nucleic Acids Research 38.20 (2010), e191–e191.
[128] Peter J Rousseeuw. “Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis”. In: Journal of Computational and Applied Mathematics 20 (1987), pp. 53–65.
[129] Bertrand Routy, Vancheswaran Gopalakrishnan, Romain Daillère, Laurence Zitvogel,
Jennifer A Wargo, and Guido Kroemer. “The gut microbiota influences anticancer
immunosurveillance and general health”. In: Nature Reviews Clinical Oncology 15 (2018),
pp. 382–396.
[130] Simon Roux, Evelien M Adriaenssens, Bas E Dutilh, Eugene V Koonin, Andrew M Kropinski,
Mart Krupovic, Jens H Kuhn, Rob Lavigne, J Rodney Brister, Arvind Varsani, et al. “Minimum
information about an uncultivated virus genome (MIUViG)”. In: Nature Biotechnology 37 (2019),
pp. 29–37.
[131] Simon Roux, Francois Enault, Bonnie L Hurwitz, and Matthew B Sullivan. “VirSorter: mining
viral signal from microbial genomic data”. In: PeerJ 3 (2015), e985.
[132] Michelle Sait, Philip Hugenholtz, and Peter H Janssen. “Cultivation of globally distributed soil
bacteria from phylogenetic lineages previously only detected in cultivation-independent
surveys”. In: Environmental Microbiology 4.11 (2002), pp. 654–666.
165
[133] Rafael Sanjuán and María-Isabel Thoulouze. “Why viruses sometimes disperse in groups”. In:
Virus Evolution 5 (2019), vez014.
[134] Frederik Schulz, Julien Andreani, Rania Francis, Hadjer Boudjemaa, Jacques Yaacoub Bou Khalil,
Janey Lee, Bernard La Scola, and Tanja Woyke. “Advantages and limits of metagenomic assembly
and binning of a giant virus”. In: mSystems 5.3 (2020), e00048–20.
[135] Oliver Schwengers, Patrick Barth, Linda Falgenhauer, Torsten Hain, Trinad Chakraborty, and
Alexander Goesmann. “Platon: identification and characterization of bacterial plasmid contigs in
short-read draft assemblies exploiting protein sequence-based replicon distribution scores”. In:
Microbial Genomics 6.10 (2020).
[136] Siddarth Selvaraj, Jesse R Dixon, Vikas Bansal, et al. “Whole-genome haplotype reconstruction
using proximity-ligation and shotgun sequencing”. In: Nature Biotechnology 31 (2013),
pp. 1111–1118.
[137] Andrei N Shkoporov, Andrei V Chaplin, Ekaterina V Khokhlova, Victoria A Shcherbakova,
Oksana V Motuzova, Vladimir K Bozhenko, Lyudmila I Kafarskaia, and Boris A Efimov. “Alistipes
inops sp. nov. and Coprobacter secundus sp. nov., isolated from human faeces”. In: International
Journal of Systematic and Evolutionary Microbiology 65 (2015), pp. 4580–4588.
[138] Andrew B Shreiner, John Y Kao, and Vincent B Young. “The gut microbiome in health and in
disease”. In: Current Opinion in Gastroenterology 31 (2015), pp. 69–75.
[139] Christian MK Sieber, Alexander J Probst, Allison Sharrar, Brian C Thomas, Matthias Hess,
Susannah G Tringe, and Jillian F Banfield. “Recovery of genomes from metagenomes via a
dereplication, aggregation and scoring strategy”. In: Nat Microbiol 3 (2018), pp. 836–843.
[140] Carola Simon and Rolf Daniel. “Metagenomic analyses: past and future trends”. In: Applied and
Environmental Microbiology 77.4 (2011), pp. 1153–1161.
[141] Saskia L Smits, Rogier Bodewes, Aritz Ruiz-Gonzalez, Wolfgang Baumgärtner,
Marion P Koopmans, Albert DME Osterhaus, and Anita C Schürch. “Assembly of viral genomes
from metagenomes”. In: Frontiers in Microbiology 5 (2014), p. 714.
[142] Thibault Stalder, Maximilian O Press, Shawn Sullivan, Ivan Liachko, and Eva M Top. “Linking the
resistome and plasmidome to the microbiome”. In: The ISME Journal 13.10 (2019), pp. 2437–2446.
[143] Alison M Stephen and JH Cummings. “The microbial contribution to human faecal mass”. In:
Journal of Medical Microbiology 13 (1980), pp. 45–56.
[144] Bradley S Stevenson, Stephanie A Eichorst, John T Wertz, Thomas M Schmidt, and
John A Breznak. “New strategies for cultivation and detection of previously uncultured
microbes”. In: Applied and Environmental Microbiology 70.8 (2004), pp. 4748–4755.
[145] Robert D Stewart, Marc D Auffret, Amanda Warr, Andrew H Wiser, Maximilian O Press,
Kyle W Langford, Ivan Liachko, Timothy J Snelling, Richard J Dewhurst, Alan W Walker, et al.
“Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen”. In:
Nature Communications 9.870 (2018).
166
[146] Wolfgang R Streit and Ruth A Schmitz. “Metagenomics–the key to the uncultured microbes”. In:
Current Opinion in Microbiology 7.5 (2004), pp. 492–498.
[147] Curtis A Suttle. “Marine viruses—major players in the global ecosystem”. In: Nature Reviews
Microbiology 5 (2007), pp. 801–812.
[148] Dorothee Tegtmeier, Cornelius Riese, Oliver Geissinger, Renate Radek, and Andreas Brune.
“Breznakia blatticola gen. nov. sp. nov. and Breznakia pachnodae sp. nov., two fermenting
bacteria isolated from insect guts, and emended description of the family Erysipelotrichaceae”. In:
Systematic and Applied Microbiology 39.5 (2016), pp. 319–329.
[149] Rebecca Vega Thurber. “Current insights into phage biodiversity and biogeography”. In: Current
Opinion in Microbiology 12.5 (2009), pp. 582–587.
[150] Herbert Tilg, Arthur Kaser, et al. “Gut microbiome, obesity, and metabolic dysfunction”. In: The
Journal of Clinical Investigation 121.6 (2011), pp. 2126–2132.
[151] Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. “From Louvain to Leiden: guaranteeing
well-connected communities”. In: Scientific Reports 9.5233 (2019).
[152] Gherman Uritskiy, Maximillian Press, Christine Sun, Guillermo Domínguez Huerta,
Ahmed A Zayed, Andrew Wiser, Jonas Grove, Benjamin Auch, Stephen M Eacker,
Shawn Sullivan, et al. “Accurate viral genome reconstruction and host assignment with
proximity-ligation sequencing”. In: bioRxiv (2021). doi: 10.1101/2021.06.14.448389.
[153] Doris Vandeputte, Gunter Kathagen, Kevin D’hoe, Sara Vieira-Silva, Mireia Valles-Colomer,
João Sabino, Jun Wang, Raul Y Tito, Lindsey De Commer, Youssef Darzi, et al. “Quantitative
microbiome profiling links gut community variation to microbial load”. In: Nature 551 (2017),
pp. 507–511.
[154] Jorge F Vázquez-Castellanos, Rodrigo García-López, Vicente Pérez-Brocal, Miguel Pignatelli, and
Andrés Moya. “Comparison of different assembly and annotation tools on analysis of simulated
viral metagenomic communities in the gut”. In: BMC Genomics 15.37 (2014).
[155] Bruce J Walker, Thomas Abeel, Terrance Shea, Margaret Priest, Amr Abouelliel,
Sharadha Sakthikumar, Christina A Cuomo, Qiandong Zeng, Jennifer Wortman, Sarah K Young,
et al. “Pilon: an integrated tool for comprehensive microbial variant detection and genome
assembly improvement”. In: PLoS One 9.11 (2014), e112963.
[156] Shannon J Williamson, Douglas B Rusch, Shibu Yooseph, Aaron L Halpern, Karla B Heidelberg,
John I Glass, Cynthia Andrews-Pfannkoch, Douglas Fadrosh, Christopher S Miller,
Granger Sutton, et al. “The Sorcerer II Global Ocean Sampling Expedition: metagenomic
characterization of viruses within aquatic microbial samples”. In: PLoS One 3 (2008), e1456.
[157] Honglong Wu, Xuebin Wang, Mengtian Chu, Dongfang Li, Lixin Cheng, and Ke Zhou. “HCMB: a
stable and efficient algorithm for processing the normalization of highly sparse Hi-C contact
data”. In: Computational and Structural Biotechnology Journal 19 (2021), pp. 2637–2645.
167
[158] Yu-Wei Wu, Blake A Simmons, and Steven W Singer. “MaxBin 2.0: an automated binning
algorithm to recover genomes from multiple metagenomic datasets”. In: Bioinformatics 32.4
(2016), pp. 605–607.
[159] Yu-Wei Wu, Yung-Hsu Tang, Susannah G Tringe, Blake A Simmons, and Steven W Singer.
“MaxBin: an automated binning method to recover individual genomes from metagenomes using
an expectation-maximization algorithm”. In: Microbiome 2.26 (2014).
[160] Zhiyong Xie, Yixuan Bai, Guijie Chen, Ying Rui, Dan Chen, Yi Sun, Xiaoxiong Zeng, and
Zhonghua Liu. “Modulation of gut homeostasis by exopolysaccharides from Aspergillus cristatus
(MK346334), a strain of fungus isolated from Fuzhuan brick tea, contributes to
immunomodulatory activity in cyclophosphamide-treated mice”. In: Food & Function 11.12 (2020),
pp. 10397–10412.
[161] Eitan Yaffe and David A Relman. “Tracking microbial evolution in the human gut using Hi-C
reveals extensive horizontal gene transfer, persistence and adaptation”. In: Nature Microbiology 5
(2020), pp. 343–353.
[162] Eitan Yaffe and Amos Tanay. “Probabilistic modeling of Hi-C contact maps eliminates systematic
biases to characterize global chromosomal architecture”. In: Nature Genetics 43 (2011),
pp. 1059–65.
[163] Bin Yang, Yu Peng, Henry CM Leung, Siu-Ming Yiu, Jing-Chi Chen, and Francis YL Chin.
“Unsupervised binning of environmental genomic fragments based on an error robust selection
of l-mers”. In: Proceedings of the Third International Workshop on Data and Text Mining in
Bioinformatics. 2009, pp. 3–10.
[164] Tanya Yatsunenko, Federico E Rey, Mark J Manary, Indi Trehan, Maria Gloria Dominguez-Bello,
Monica Contreras, Magda Magris, Glida Hidalgo, Robert N Baldassano, Andrey P Anokhin, et al.
“Human gut microbiome viewed across age and geography”. In: Nature 486 (2012), pp. 222–227.
[165] Kelvin KW Yau, Kui Wang, and Andy H Lee. “Zero-inflated negative binomial mixed regression
modeling of over-dispersed count data with extra zeros”. In: Biometrical Journal 45 (2003),
pp. 437–452.
[166] Umaporn Yordpratum, Unchalee Tattawasart, Surasakdi Wongratanacheewin, and
Rasana W Sermswan. “Novel lytic bacteriophages from soil that lyse Burkholderia pseudomallei”.
In: FEMS Microbiology Letters 314 (2011), pp. 81–88.
[167] Naofumi Yoshida, Takuo Emoto, Tomoya Yamashita, Hikaru Watanabe, Tomohiro Hayashi,
Tokiko Tabata, Namiko Hoshi, Naoya Hatano, Genki Ozawa, Naoto Sasaki, et al. “Bacteroides
vulgatus and Bacteroides dorei reduce gut microbial lipopolysaccharide production and inhibit
atherosclerosis”. In: Circulation 138.22 (2018), pp. 2486–2498.
[168] Yuanqiang Zou, Wenbin Xue, Guangwen Luo, Ziqing Deng, Panpan Qin, Ruijin Guo,
Haipeng Sun, Yan Xia, Suisha Liang, Ying Dai, et al. “1,520 reference genomes from cultivated
human gut bacteria enable functional microbiome analyses”. In: Nature Biotechnology 37 (2019),
pp. 179–185.
168
Abstract (if available)
Abstract
The advent of metagenomic high-throughput chromosome conformation capture (metagenomic Hi-C or metaHi-C) enables identifying contig-to-contig relationships with respect to the proximity within the same physical cell and reveals great potential to simultaneously study multiple genomes and probe active virus-host interactions. However, the metaHi-C data analyses encounter significant challenges, including the existence of systematic biases from raw metaHi-C contacts, the impact of spurious Hi-C contacts on data interpretation, and gaps in current Hi-C-based contig binning methods that overlook critical biological information and fail to adequately address viral genome recovery. In this dissertation, we present our efforts to tackle these computational challenges, focusing on normalization and binning to improve microbial community analysis. In terms of normalization, we first introduce HiCzin as the first method able to eliminate all systematic biases and we also put forward a straightforward but effective strategy to filter out spurious Hi-C linkages. We further enhance our initial HiCzin approach by introducing NormCC, a more efficient normalization method. With respect to the contig binning, we develop ImputeCC for enhanced contig binning by incorporating single-copy marker genes, and ViralCC for dedicated viral genome recovery and identification of virus-host pairs. Each method advances the analysis of microbial communities by correcting biases, improving the clustering of metagenomic contigs, and enabling the discovery of new species and virus-host interactions in complex microbial ecosystems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Application of machine learning methods in genomic data analysis
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Patterns of molecular microbial activity across time and biomes
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Comparative transcriptomics: connecting the genome to evolution
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Microbial ecology in the deep terrestrial biosphere: a geochemical, metagenomic and culture-based approach
Asset Metadata
Creator
Du, Yuxuan
(author)
Core Title
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2024-08
Publication Date
06/19/2024
Defense Date
03/01/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
metagenome-assembled genomes,metagenomic Hi-C,mobile genetic element host interactions,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Fuhrman, Jed (
committee member
), Chen, Liang (
committee member
), Fudenberg, Geoffrey (
committee member
)
Creator Email
yuxuandu@usc.edu,yxdu.cbb@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113996WR1
Unique identifier
UC113996WR1
Identifier
etd-DuYuxuan-13116.pdf (filename)
Legacy Identifier
etd-DuYuxuan-13116
Document Type
Dissertation
Format
theses (aat)
Rights
Du, Yuxuan
Internet Media Type
application/pdf
Type
texts
Source
20240619-usctheses-batch-1171
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
metagenome-assembled genomes
metagenomic Hi-C
mobile genetic element host interactions