Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Big data analytics in metagenomics: integration, representation, management, and visualization
(USC Thesis Other)
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Big Data Analytics in Metagenomics: Integration, Representation, Management,
and Visualization
by
Yang Lu
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computational Biology and Bioinformatics)
December 2017
Copyright 2017 Yang Lu
Dedication
Dedicated to my families.
ii
Acknowledgements
I would like to thank my advisor, Professor Fengzhu Sun, for his supervision of my research over
the past four years. In addition to his inspiring guidance, he has always been a supportive mentor
within and beyond research. His research enthusiasm, meticulous attitude, critical thinking, and
encouragement to exploration have not only greatly improved the quality of this dissertation, but
cultivated my deep interests in the area of bioinformatics and biostatistics. I am extremely certain
that his in
uence would denitely carry over into my future academic career.
I would also like to thank my co-advisor and close collaborator, Professor Jinchi Lv. I really
enjoyed the working experiences with him, which not only consolidated my ideas from intuition
by his rigorous statistical thinking but also motivated me to think like a statistician. His solid
theoretical basis and broad interests indeed complemented my weakness, and shaped my taste
about the forefront of statistical research. Though it has only been less than one year, I have
learned a lot from him, and I wish to learn more in the following collaborations.
I am also thankful to my oral committee and dissertation committee members, Professor
Michael S. Waterman, Professor Jed A. Fuhrman, Prof. Shang-Hua Teng, Professor Ting Chen,
and Professor Andrew D. Smith, for their insightful suggestions and sel
ess help during the
research period. Lastly, I am grateful to my wonderful collaborators and friends in SunLab: Kujin
Tang, Fang Zhang, Professor Ying Wang, Weili Wang, Zifan Zhu, Mengge Zhang, Yichao Dong,
Han Li, Wangshu Zhang, Quan Chen, Xin Bai. Their love, humbleness, and descent working ethic
have made my journal in USC enjoyable.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List Of Tables vii
List Of Figures viii
Abstract xi
Chapter 1: Introduction 1
1.1 Data Analysis in Metagenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Sequencing and Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Binning Sequences into Operational Taxonomic Units (OTUs) . . . . . . . 4
1.1.3 Inferring Phylogenetic Relationships . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Big Data Analytics Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Data Analysis and Knowledge Discovery . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Motivation: Marriage between Big Data Paradigm and Metagenetic Analyses . . . 10
1.4 Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2: COCACOLA: binning metagenomic contigs using sequence COmpo-
sition, read CoverAge, CO-alignment, and paired-end read LinkAge 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Feature Matrix Representation of Contigs . . . . . . . . . . . . . . . . . . . 18
2.2.3 Incorporating Additional Knowledge into Binning . . . . . . . . . . . . . . 19
2.2.4 Optimization by Alternating Nonnegative Least Squares (ANLS) . . . . . . 20
2.2.5 Initialization of W and H . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.6 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.7 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.8 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.9 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Performance on the Simulated Datasets . . . . . . . . . . . . . . . . . . . . 26
2.3.2 The Eect of Incorporating Additional Knowledge on Binning . . . . . . . 27
2.3.3 Impact of Varying Number of Clusters K . . . . . . . . . . . . . . . . . . . 29
2.3.4 The Eect of the Number of Samples on the Performance . . . . . . . . . . 31
iv
2.3.5 The Eect of Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.6 Performance on Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.7 Running Time of COCACOLA, CONCOCT, GroopM, MaxBin and MetaBAT 36
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 3: Hetero-RP: Towards Enhanced and Interpretable Clustering/Classication
in Integrative Genomics 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Parameter choice for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.3 Insucient Auxiliary Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Application to Clustering: Metagenomic Contig Binning . . . . . . . . . . . 46
3.3.2 Application to Classication: RBP Binding Site Prediction . . . . . . . . . 48
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Chapter 4: CAFE: aCcelerated Alignment-FrEe sequence analysis 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Alignment-free Dissimilarity Basics . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 Conventional measures based on k-mer counts . . . . . . . . . . . . . . . . . 58
4.2.2.1 Chebyshev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2.2 Euclidean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2.3 Manhattan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2.4 Canberra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2.5 d
2
or Cosine [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.2.6 Pearson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.2.7 Feature frequency proles (FFP ) [113] . . . . . . . . . . . . . . . 59
4.2.2.8 Jensen-Shannon divergence (JS) [46] . . . . . . . . . . . . . . . . 59
4.2.2.9 Co-phylog [138] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.3 Measures based on background adjusted k-mer counts . . . . . . . . . . . . 60
4.2.3.1 CVTree [93] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.3.2 d
2
[96, 128] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.3.3 d
S
2
[96, 128] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.4 Measures based on presence/absence of k-mers . . . . . . . . . . . . . . . . 62
4.2.4.1 Anderberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.4.2 Antidice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.4.3 Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.4.4 Gower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.4.5 Hamman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.4.6 Hamming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.4.7 Jaccard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.4.8 Kulczynski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.4.9 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.4.10 Ochiai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.4.11 Phi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.4.12 Russel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.4.13 Sneath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.4.14 Tanimoto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.4.15 Yule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
v
4.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.1 Accelerate the calculation of d
2
, d
S
2
, and CVTree . . . . . . . . . . . . . . . 67
4.3.2 Work
ows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.3 Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.1 Application to Primate and Vertebrate Genomic Sequences . . . . . . . . . 71
4.4.2 Application to Microbial Genomic Sequences . . . . . . . . . . . . . . . . . 75
4.4.3 Application to Metagenomic Samples . . . . . . . . . . . . . . . . . . . . . 76
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 5: Conclusion and ongoing work 78
Reference List 80
vi
List Of Tables
2.1 Running Time of COCACOLA, CONCOCT, GroopM, MaxBin and MetaBAT . . 37
vii
List Of Figures
2.1 The performance of COCACOLA, CONCOCT, GroopM, MaxBin and MetaBAT
on both simulated datasets (a,b) and real datasets (c,d). . . . . . . . . . . . . . . 28
2.2 Evaluation of the impact of incorporating two additional types of knowledge (Sec-
tion 2.2.3) on sub-samples of simulated \species" dataset. The rst option is co-
alignment information to reference genomes, depicted by (a)-(c). The second option
is paired-end reads linkage, depicted by (d)-(f). The ensemble of both is depicted
by (g)-(i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 The performance of COCACOLA, CONCOCT, GroopM, MaxBin and MetaBAT in
terms of precision, recall and ARI on the simulated \species" and \strain" datasets.
The results of simulated \species" dataset are depicted by (a)-(c). And the results
of simulated \strain" dataset are depicted by (d)-(f). . . . . . . . . . . . . . . . . 32
2.4 Evaluation of COCACOLA, CONCOCT, GroopM, MaxBin and MetaBAT on sen-
sitivity to varying sample size on simulated \species" dataset. The number of sub-
samples are 9, 4, 3, 2, 1, 1, 1, 1, 1, corresponding to sample size 10,20,30,40,50,60,70,80,90,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Evaluation of the impact of initialization on sub-samples of simulated \species"
dataset. The performance comparison is only based on initialization with dierent
distance measurements are depicted by (a)-(c). The performance comparison is
based on initialization with dierent distance measurements and COCACOLA are
depicted by (d)-(f). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Illustration of Hetero-RP. (A) Each object o
1
; ;o
5
is represented by its corre-
sponding feature vector, containing features indexed from f
1
to f
10
, with colors
indicating dierent data sources. \positive-links" and \negative-links" are in green
solid line and red dash line, respectively. (B) The input to Hetero-RP contains two
parts, the data matrix based on the concatenated feature vectors and the signed
graph encoding both \positive-links" and \negative-links". (C) Hetero-RP rescales
each dimension of features, but keeps the overall scale xed. (D) The applications
of Hetero-RP widely cover both clustering and classication domains. . . . . . . . 42
viii
3.2 Incorporating Hetero-RP in Metagenomic Contig Binning. (A) COCACOLA with
Hetero-RP applied to the simulated \species" dataset with multiple sub-sampling
replicates, evaluated by Adjusted Rand Index (ARI). The auxiliary knowledge con-
siders the co-alignment of contig pairs. (B) The feature weight of two random
samples of size 10 and 96 using co-alignment. The blue shadow on the left side of
the dashed line indicates the scales of abundance prole, whereas the red shadow
on the right side of the dashed line indicates the scales of composition prole. The
green horizontal line indicates the scales of 1. (C) The auxiliary knowledge consid-
ers the paired-end reads linkages. (D) The feature weight of two random samples of
size 10 and 96 using linkage. (E) COCACOLA with Hetero-RP using co-alignment
is compared against CONCOCT, MaxBin, and MetaBAT using ARI. The improve-
ment is more prominent in small size cases such as 10 and 20. (F) Application to
the real \MetaHIT" dataset. The evaluation is based upon the recovery of genome
bins at every completeness threshold. The number of recovered genome bins (X-
axis) by each method (Y-axis) in dierent completeness threshold (gray scale) with
precision > 80%, calculated by the lack of contamination using CheckM. . . . . . 49
3.3 Incorporating Hetero-RP in RBP Binding Sites Prediction. (A) iONMF with or
without Hetero-RP is applied to 31 published CLIP experiments. The positive-links
and negative-links sets are constructed according to the labels of the nucleotide po-
sitions in the training set. The performance is evaluated by the area under the
receiver operating characteristic curve (AUC). (B) Hetero-RP is compared against
state-of-the-art methods. (C) Hetero-RP is compared against popular feature se-
lection methods. (D,E,F) Interpretations of the scales obtained by Hetero-RP. (D)
The 3'-UTR region type of PUM2 and IGF2BP has the largest weight, consistent
with the fact that the binding sites of both IGF2BP and PUM2 are distributed
across 3'-UTRs. (E) The mutual co-binding of hnRNPC and U2AF2 has large
weight and this observation agrees with the fact that hnRNPC interacts with the
same positions as U2AF2. Moreover, the weights are even larger at the upstream
and downstream, supporting the evidence of direct competition between the two.
(F) The intron and exon region types of U2AF2 start to scale up at 30 nucleotides
upstream from the binding site, where the reported intron-exon boundary is located. 53
4.1 The radix trie constructed for the calculation of the expected occurrences of tetramers,
(a) i.i.d. model, (b) the rst order Markov model, and (c) the second order Markov
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 The work
ow of CAFE. The JELLYFISH software parses the input sequence les
(in Fasta format), counts k-mers, and saves compressed information into separate
databases. CAFE subsequently loads the databases and constructs a symmetric
dissimilarity matrix among the inputs. CAFE also integrates four types of visual-
ized downstream analysis, including dendrograms, heatmap, principal coordinate
analysis (PCoA), and network display. . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Screenshot of CAFE user interface based on a toy example. The user interface
layout divides into six parts in terms of functionality: (1) data selection toolbar
(top left), (2) dissimilarity setting toolbar (top middle), (3) image toolbar (top
right), (4) input data list (middle left), (5) run-time information console (bottom
left), and (6) visualized analyses (bottom right). . . . . . . . . . . . . . . . . . . . 70
ix
4.4 The Spearman correlation of various dissimilarity measures with the evolutionary
distances using the maximum likelihood approach across many genomic regions
based on 21 primate species (top), 28 vertebrate species (middle), and the combi-
nation of both (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 The Pearson correlation of various dissimilarity measures with the evolutionary dis-
tances using the maximum likelihood approach across many genomic regions based
on 21 primate species (top), 28 vertebrate species (middle), and the combination
of both (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6 The normalized Robinson-Foulds distance between the clustering tree using various
dissimilarity measures and the phylogenetic tree derived based on the maximum
likelihood approach across many genomic regions for the 21 primate species (top)
and 28 vertebrate species (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7 Wall time, peak memory usage, and speedup ratio comparison between CAFE and
the original implementation to calculated
2
dissimilarity between a pair of genomes
for k=14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8 The clustering results of 27 E.coli and Shigella genomes using measures based on
background adjusted 14-mer counts: d
S
2
, d
2
, and CVTree. The Markov order of
the sequences were set at 1. The colors indicate the 6 dierent E. Coli reference
groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.9 The normalized Robinson-Foulds distance between between the clustering tree us-
ing various dissimilarity measures and the evolutionary tree derived based on the
maximum likelihood approach across many genomic regions for the 27 E.coli and
Shigella genomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.10 The clustering results of the mammalian gut samples using measures based on
background adjusted k-mer counts: d
S
2
, d
2
, and CVTree. . . . . . . . . . . . . . . 76
x
Abstract
The advent of the next-generation sequencing technologies (NGS) has allowed biologists to ex-
tract genomic data with unprecedented high resolution and sucient sequence depth, which has
fueled the excessive generation of genomic data, with faster speed and lower cost. In addition to
the direct generation of genomic data, NGS has also facilitated the generation of other types of
omics data, such as epigenomics, transcriptomics, proteomics, metabolomics, and interactomics.
Such explosively increasing amount of data has raised compelling challenges and opportunities,
such as how to build up repositories of data, how to integrate and exploit data from various
sources, how to manipulate and manage data eciently, and how to analyze and interpret data
in an interactive and visualized way. This dissertation presents three computational methods and
software packages developed specically for the need to analyze metagenomic data accurately,
promptly, and interactively. To investigate the taxonomic structure of microbial samples, given
assembled contigs, we want to bin the contigs into OTUs not only using the concatenation of
tetra-nucleotide composition and the coverage proles across multiple metagenomic samples of
the contigs, but seamlessly integrate two types of additional knowledge, co-alignment to reference
genomes and linkage between contigs provided by paired-end reads, to achieve improved perfor-
mance. In contig binning, it is far from optimal to concatenate tetra-nucleotide composition and
the coverage proles across multiple metagenomic samples straightforwardly. We weigh important
features more highly than less important ones in accord with implicitly existing auxiliary knowl-
edge. More importantly, the achieved feature importance scores interpretable and meaningful.
In inferring the phylogenetic relationships among OTUs by alignment-free sequence comparison,
xi
those state-of-the-art dissimilarity measures CVTree, d
2
, and d
S
2
are computationally expensive.
Therefore, we accelerate the computation for ecient analyses from both novel data structure
and fast implementation perspective.
xii
Chapter 1
Introduction
Though invisible to human eyes, the human body is inhabited by many microbial organisms form-
ing complex microbial communities, which are also called microbiome. The number of microbial
cells living in the human body is believed to be 10 times larger than that of human cells [61].
The microbial genomes are regarded as the extended human genome, which encode genetic and
metabolic functionalities that humans do not inherently have [61].
The microbiome communities have recently been associated with human physiology and dis-
eases, such as obesity [123], type 2 diabetes [95], cardiovascular diseases [52], and even neu-
rodevelopmental disorders [40]. Thus, eorts have been made to collect and study the diverse
composition of microbiomes with respect to diet, health, and environment factors, such as Human
Microbiome Project (HMP) [87] as well as the European Metagenomics of the Human Intestinal
Tract (MetaHIT) [29].
A microbiome generally consists of a complex mixture of microbial organisms from bacteria,
archaea, eukaryote, and virus, a great proportion of them were unexplored until recently. Conven-
tional studies on microbiomes intend to isolate and culture the microbes individually. However, for
extremely rich microbial community in human bodies, ocean, and soil, the isolation and cultiva-
tion becomes impractical. With the advent of the next-generation sequencing technologies (NGS),
such as Illumina Solexa sequencing, it becomes possible to quantify and study microbiomes by
1
directly extracting genomic data of a microbiome with unprecedented high resolution and su-
cient sequence depth, getting rid of laborious and time-consuming isolation and cultivation. Such
technique, which studies the genetic material directly recovered from environmental samples, is
referred to as metagenomics [100], which oers biologists insights into complex microbial commu-
nities even including species with very low abundance [2].
1.1 Data Analysis in Metagenomics
1.1.1 Sequencing and Assembly
There are two basic sequencing methods to prole the taxonomic composition of microbiomes.
The rst method focuses on sequencing only specic phylogenetic marker genes. One common
representative is 16S rRNA gene, a ubiquitous gene in bacterial genomes containing variable
regions to distinguish dierent types of bacteria. Therefore, sequencing 16S rRNA genes can
be used to prole the composition of bacterial microbiomes and study the diversity of bacterial
communities. Despite its simplicity, the 16S rDNA sequencing has limited resolution among
closely related species [89]. In this dissertation, we focus on the second method, to sequence all
microbial genomes using shotgun sequencing.
In shotgun metagenomic sequencing, genomic fragments, also referred to as reads, are ran-
domly extracted from all genomic DNA sequences in the microbial samples [68]. Reads originate
from microbial genomes that constitute the whole microbiome. Since the shotgun metagenomic
sequencing generates tens to hundreds of millions of reads, analyzing these reads can provide in-
sights into the taxonomic information of the microbiome. One drawback of shotgun metagenomic
sequencing is, to detect rare taxa in the microbiome, the sequencing depth is expected to be much
deeper than 16S rDNA sequencing.
To recover the uncultured taxa genomes in the microbiome, assembly is needed to combine
short metagenomic reads into the complete genomes or longer stretches of sequence, also referred
2
to as contigs. The majority of current o-the-shelf assembly software programs is tailored to
assemble single genomes. Therefore, simply using such assembly software programs to assemble
metagenomes should be treated with great caution. The assembly with short metagenomic reads
is extremely challenging due to the following reasons: (i) repetitive sequence regions within or
across genomes; (ii) strain-level variation within the same species; (iii) homologous regions across
dierent species. Failure to resolve these aforementioned concerns may give rise to chimeric
assemblies [121].
There are two basic assembly strategies to assemble metagenomic reads. The rst option is
reference-based assembly, that is, reads are aligned to reference genomes, such as MIRA [20] and
AMOS [88]. Note that reference-based assembly only works favorably under the circumstance
that the genome sequences in the microbiome have close matches against reference genomes avail-
able. However, shortage of reference genomes as well as large insertions and deletions between
the genome sequences in the microbiome and the reference genomes give rise to the fragmented
assembly. The second and commonly-used option is de novo assembly, that is, the assembly
process doesn't rely on any reference genomes. de novo assembly is based upon the idea of de
Bruijn graph. Specically, in de Bruijn graph, each k-mer of a read is represented as a directed
edge, with starting and ending nodes representing the prex and sux (k 1)-mer of that par-
ticular k-mer. The de Bruijn graph is constructed by using all k-mers from all metagenomic
reads. And sequence assembly is equivalently transformed into the problem of nding Eulerian
paths and reporting them as contigs in the de Bruijn graph. In practice it is dicult because
the actual de Bruijn graph becomes massive and over-complicated due to the sequencing errors
as well as repetitive sequence regions within or across genomes. Therefore, eorts are needed to
mitigate sequencing errors by simplifying the de Bruijn graph. The representative de novo as-
sembly softwares for metagenomic data include SOAPdenovo2 [67], Meta-IDBA [84], MetaVelvet
[76], metaSPAdes [80], etc.
3
1.1.2 Binning Sequences into Operational Taxonomic Units (OTUs)
To further investigate the taxonomic structure of microbial samples, assembled sequence frag-
ments, also known as contigs, need be grouped into operational taxonomic units (OTUs) that
ultimately represent genomes or signicant parts of genomes. OTU clustering is also called bin-
ning (or genomic binning), serving as the key step towards taxonomic proling and downstream
functional analysis. The currently available binning tools can be broadly categorized into classi-
cation and clustering methods. Classication methods are also referred to as taxonomy-dependent
methods, where reference databases are needed for the assignment from contigs or reads to mean-
ingful taxa. The classication is either based on homology due to sequence identity, or genomic
signatures such as oligonucleotide composition patterns and taxonomic clades. Homology-based
methods include MEGAN that assigns reads to the lowest common taxonomic ancestor annotated
with NCBI taxonomy [41]. MG-RAST is another widely-used tool that compares reads against
both protein and nucleotide databases to build the phylogeny in the microbiome [73]. Represen-
tative genomic signature-based methods include AMPHORA [133], PhyloSift [25], MetaPhlAn
[107], Kraken [132], etc. AMPHORA [133] employs hidden Markov models to retrieve common
marker protein-coding genes and then builds a marker gene phylogeny. PhyloSift [25] is similar
to AMPHORA, with more extensive databases such as viral gene family database, larger gene
marker database, etc. MetaPhlAn [107] uses a phylogenetic clade-specic marker database to
assign sequences to corresponding taxonomic groups. Kraken [132] assigns taxonomic labels to
sequences by exact alignment ofk-mers to lowest common ancestor database. In addition, hybrid
methods are available. For example, PhyloPythia [71] trains support vector machine classiers
based upon oligonucleotide frequency to determine whether a new sequence belongs to certain
taxonomic groups. Another method, Phymm [13] uses Interpolated Markov Models to represent
4
the oligonucleotide patterns of each reference genomes. Note that classication methods still suf-
fer from the similar drawbacks as those in reference-based assembly, namely, shortage of reference
genomes.
In comparison, clustering methods are more widely used, which is also referred to as taxonomy-
independent methods, where no additional reference databases or taxonomic information is needed.
These methods require similarity measurements of content types such as GC content, tetra-mer
composition [2, 19, 137], contig coverage prole [8, 135], or the combination of both [65, 4, 43,
134, 47]. Therefore, based upon the content types, clustering methods are further divided into
three subtypes: (i) composition-based methods; (ii) abundance-based methods; (iii) composition-
and-abundance-based methods. Specically, composition-based methods rely on oligonucleotide
frequency of contigs, and only dier in how to represent and quantify compositional dierences.
For example, Self Organizing Maps can be used to project tetra-nucleotide frequencies onto a
two-dimensional map for clustering [17]. Another method, CompostBin [19], calculates the k-mer
frequencies of varying length and subsequently reduce the dimensionality of compositional space
by using a weighted PCA approach. One major limitation of composition-based methods lies
in the fact that the sequence lengths are expected to be suciently long to guarantee the good
binning performance. Therefore, in the scenarios where the abundance levels of species in the mi-
crobiome vary markedly, abundance-based methods mitigate the diculties faced by composition-
based methods. For example, AbundanceBin [136] tends to create bins based upon the similar
abundance levels, with the underlying assumption that the abundances of dierent species pos-
sesses separate Poisson distributions. Recently, composition-and-abundance-based methods have
emerged which bin contigs using both composition and the coverage proles of the contigs across
multiple metagenomic samples [65, 4, 43, 134, 47]. Among these methods, GroopM [43] is ad-
vantageous in its visualized and interactive pipeline. On one hand, it is
exible, allowing users
to merge and split bins under expert intervention. On the other hand, in the absence of expert
intervention, the automatic binning results of GroopM are not as satisfactory as CONCOCT [4].
5
CONCOCT [4] makes use of the Gaussian mixture model (GMM) to cluster contigs into bins.
Also, CONCOCT provides a mechanism to automatically determine the optimal OTU number by
variational Bayesian model selection [24]. MetaBAT [47] calculates integrated distance for pair-
wise contigs and then clusters contigs iteratively by modied K-medoids algorithm. And MaxBin
[134] compares the distributions of distances between and within the same genomes.
1.1.3 Inferring Phylogenetic Relationships
After contigs/reads have been assigned to OTUs, the phylogenetic relationships among OTUs
can also be inferred by sequence comparison. Sequence comparison is widely used to study
the relationship among molecular sequences. The dominant tools for sequence comparison are
alignment-based methods, including global [78] and local [114] sequence alignments. With the
advent of alignment-based tools such as BLAST [5] and sequence databases such as RefSeq [91],
alignment-based methods are widely used in a broad range of applications. Despite their extensive
applications, alignment-based methods are not appropriate in metagenomic studies. On one hand,
NGS technologies generate large amounts of short reads and it is challenging to assemble them for
both genomic and metagenomic studies. Without long assembled contigs across many samples, it
is challenging for alignment-based methods to compare genomes and metagenomes [130, 44]. On
the other hand, viruses are more likely to infect bacterial hosts having similar word pattern usage
[1, 103], and thus, the hosts of viruses can potentially be inferred based on their word pattern
usages. However, alignment based methods may not be well applicable for studying virus-host
infectious associations.
Alignment-free sequence comparison methods serve as attractive alternatives for studying the
relationships among sequences when alignment based methods are not appropriate or too time
consuming to be implemented in practice [125, 115]. Several types of alignment free methods are
available including those based on the counts of k-mers, longest common subsequences, shortest
6
absent patterns, etc. that have recently been reviewed in a special issue of Brieng in Bioinformat-
ics [124]. For alignment-free statistics using k-mer counts, these methods project each sequence
intok-mer counts feature space, where sequence information is transformed into numerical values
such ask-mer frequency. The recently developed statisticsd
2
andd
S
2
have been shown to perform
well theoretically [96] as well as in many applications including the comparison of gene regulatory
regions [115], whole genome sequences [97], metagenomes [44], and virus-bacteria host infectious
associations [1].
1.2 Big Data Analytics Paradigm
Rapidly evolving high-throughput technologies have enabled biologists to progressively collect
large amounts of genomic data with unprecedented diversity and high resolution. For example,
The Cancer Genome Atlas (TCGA) project [72] and Encyclopedia of DNA Elements (ENCODE)
project [22] have provided open access to genomic, transcriptomic, and epigenomic information
from a diverse group of samples. The TCGA project contains genome and protein sequences, mea-
surements of somatic mutations, copy number variation, mRNA expression, miRNA expression,
and histological slides from thousands of tumors. The ENCODE project comprises thousands of
datasets from ChIP-seq, RNA-seq, Hi-C, as well as other experiments. Presently, data is reaching
a much larger scale, such as the Genome 10K project [37], where the amount reaches the magni-
tude of petabytes (PB). Such a wealth of huge and heterogeneous data, nowadays referred as \Big
Data", serves as the basis for uncovering novel discoveries and improved understanding towards
molecular heterogeneity of biological processes.
The explosively increasing amount of \Big Data" with high-volume and high-variety has raised
compelling challenges and opportunities. The high-volume of data demands scalable storage tech-
niques as well as distributed algorithms for ecient information search and retrieval. The high-
variety of data sources expects novel integrative analysis instead of straightforward concatenation.
7
On the other hand, the opportunities of \Big Data" come with enormous challenges, such as how
to build up repositories of data, how to integrate and exploit data from various sources, how to
manipulate and manage data eciently, and how to analyze and interpret data in an interactive
and visualized way.
The advent of the next-generation sequencing technologies (NGS) has allowed biologists to
extract genomic data with unprecedented high resolution and sucient sequence depth, which
has fueled the excessive generation of genomic data, with faster speed and lower cost. In addition
to the direct generation of genomic data, NGS has also facilitated the generation of other types of
omics data, such as epigenomics, transcriptomics, proteomics, metabolomics, and interactomics.
Therefore, the role of bioinformatics to deal with \Big Data" in NGS era can be divided into the
following broad categories:
1.2.1 Data Integration
The rich availability of omics data has encouraged researchers derive multi-scale insights into the
underlying biological mechanisms. Conventionally, many studies have scrutinized each individual
type of omics data independently to uncover novel discoveries. However, restriction to single-
data-type may fail to untangle the molecular heterogeneity of biological processes by missing
essential complementary information across dierent types of omics data. By recognizing such
limitations, integrative omics data analyses have motivated growing interests, based upon the
rationale that combining multiple data types may complement unreliable or missing information
in any single-data-type. More importantly, the complete biological processes are likely to be
discovered only if dierent types of omics data provide multi-scale insights into the analyses.
There are extensive existing methods to integrate dierent types of data, from concatenation-based
integrative analyses to transformation-based integrative analyses [34]. However, despite extensive
studies, challenges persist in (i) the curse of dimensionality (i.e., a small number of samples
compared to a much larger number of features); (ii) data heterogeneity (i.e., dierent omics data
8
varying in data distribution); and (iii) unbalanced scales (i.e., uneven sizes across dierent types
of data). So far no existing methods are able to tackle all the challenges simultaneously, and more
advanced techniques are still needed.
1.2.2 Data Analysis and Knowledge Discovery
Data analysis and knowledge discovery is the prime concern after collecting the omics data. The
main workhorses of the analysis procedure are machine-learning models that can learn from data
automatically and improve with experience [63]. Conventionally, machine-learning algorithms
can be broadly categorized into clustering and classication techniques. The former captures
underlying patterns of the data and groups them into biologically interpretable groups, such as
metagenomic contig binning [65]. The latter infers general properties of the data distribution from
a few annotated examples, such as RNA-binding protein (RBP) binding site prediction [117]. In
addition to clustering and classication, searching also serves as an important analytic process to
explore \Big Data" to discover consistent and matched patterns, such as the sequence comparison
in phylogeny reconstruction and metagenomics research.
1.2.3 Data Visualization
Though data analysis and knowledge discovery has been extensively applied, most studies have
treated the underlying machine-learning models as a \black box" and presumably regarded that
the underlying mechanisms are incomprehensible and uninterpretable. Yet the lack of a clear
knowledge of how and why the machine-learning models work may impede the models achieving
their best performance. Therefore, it is desirable to develop more transparent and interpretable
platforms to better understand and improve the machine-learning models interactively, with the
aid of interactive visualization. Omics data require the ability to transform large and complex
data into comprehensive and intuitive plots that immediately re
ect the result of corresponding
operations. This would help researchers to diagnose the behaviour of the machine-learning models
9
and thus guide them to improve the performance of the models. The key challenges of interactive
visualization lie in the functionalities, the scalability, and the responsiveness to the operations.
1.3 Motivation: Marriage between Big Data Paradigm and
Metagenetic Analyses
The computational methods and software packages developed in the dissertation are mainly mo-
tivated by the need to analyze metagenomic data accurately, promptly, and interactively. Specif-
ically, we have developed novel computational methods to address the following problems:
To investigate the taxonomic structure of microbial samples, given assembled contigs, we
want to bin the contigs into OTUs not only using the concatenation of tetra-nucleotide com-
position and the coverage proles across multiple metagenomic samples of the contigs, but
seamlessly integrate two types of additional knowledge, co-alignment to reference genomes
and linkage between contigs provided by paired-end reads, to achieve improved performance.
In contig binning, we are further aware that it is far from optimal to concatenate tetra-
nucleotide composition and the coverage proles across multiple metagenomic samples straight-
forwardly. We want to weigh important features more highly than less important ones in ac-
cord with implicitly existing auxiliary knowledge. More importantly, we expect the achieved
feature importance scores interpretable and meaningful.
In inferring the phylogenetic relationships among OTUs by alignment-free sequence compar-
ison, we realize that those state-of-the-art dissimilarity measures CVTree, d
2
, and d
S
2
are
computationally expensive. Therefore, we want to accelerate the computation for ecient
analyses from both novel data structure and fast implementation perspective.
10
1.4 Outline and Contributions
The rest of this dissertation is organized as follows. Chapter 2 presents COCACOLA, a general
framework automatically bin contigs into OTUs based upon sequence composition and coverage
across multiple samples. The eectiveness of COCACOLA is demonstrated in both simulated and
real datasets in comparison to the state-of-art binning approaches such as CONCOCT, GroopM,
MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is
employing L
1
distance instead of Euclidean distance for better taxonomic identication during
initialization. More importantly, COCACOLA takes advantage of both hard clustering and soft
clustering by sparsity regularization. In addition, the COCACOLA framework seamlessly em-
braces customized knowledge to facilitate binning accuracy. Two types of additional knowledge
are investigated, the co-alignment to reference genomes and linkage of contigs provided by paired-
end reads, as well as the ensemble of both. It has been shown that both co-alignment and linkage
information further improve binning in the majority of cases. COCACOLA is scalable and faster
than CONCOCT [4], GroopM [43], MaxBin [134], and MetaBAT [47].
It has been observed that binning metagenomic contigs is dicult when the number of sam-
ples is small, regardless of using COCACOLA, CONCOCT, GroopM, MaxBin or MetaBAT [65].
With small number of metagenomic samples, the relationship between the contigs cannot be ac-
curately inferred based on the relationship between the abundance proles. Therefore, Chapter 3
introduces Heterogeneity Rescaling Pursuit (Hetero-RP), a scalable and tuning-free preprocessing
framework for integrative genomic studies, to re-weigh the contributions of abundance proles
and composition proles so that important features are weighted more highly than less important
ones. The rationale to determine the weights of dierent features is based on the implicitly exist-
ing auxiliary knowledge related to the problem of interest. As a result, Hetero-RP automatically
weighs abundance and composition proles according to the varying number of samples, resulting
in markedly improved performance of contig binning on both simulated and real datasets.
11
Next, we aim to infer the phylogenetic relationships among OTUs by alignment-free sequence
comparison. Chapter 4 reports a stand-alone software, aCcelerated Alignment-FrEe sequence
analysis (CAFE), for ecient calculation of 28 alignment-free dissimilarity measures. CAFE al-
lows for both assembled genome sequences and unassembled NGS shotgun reads as input, and
wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used
to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordi-
nate analysis (PCoA), and network display. CAFE serves as a generalk-mer based alignment-free
analysis platform for studying the relationships among genomes and metagenomes.
Finally, Chapter 5 concludes the dissertation with potential improvement of proposed methods
as well as future directions. Chapter 2-4 are self-contained, which can be read independently.
12
Chapter 2
COCACOLA: binning metagenomic contigs using sequence
COmposition, read CoverAge, CO-alignment, and
paired-end read LinkAge
In this chapter, we present COCACOLA, a general framework automatically bin contigs into
OTUs based upon sequence composition and coverage across multiple samples. The eectiveness
of COCACOLA is demonstrated for both simulated and real datasets in comparison to state-of-art
binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior perfor-
mance of COCACOLA relies on two aspects. One is employing L
1
distance instead of Euclidean
distance for better taxonomic identication during initialization. More importantly, COCACOLA
takes advantage of both hard clustering and soft clustering by sparsity regularization. In addition,
the COCACOLA framework seamlessly embraces customized knowledge to facilitate binning accu-
racy. Two types of additional knowledge are investigated, the co-alignment to reference genomes
and linkage of contigs provided by paired-end reads, as well as the ensemble of both. It has been
shown that both co-alignment and linkage information further improve binning in the majority of
cases. COCACOLA is scalable and faster than CONCOCT [4], GroopM [43], MaxBin [134], and
MetaBAT [47].
13
2.1 Introduction
Metagenomic studies aim to understand microbial communities directly from environmental sam-
ples without cultivating member species [100]. The next-generation sequencing technologies
(NGS) allow biologists to extract genomic data with unprecedented high resolution and su-
cient sequence depth, oering insights into complex microbial communities even including species
with very low abundance [2]. To further investigate the taxonomic structure of microbial samples,
assembled sequence fragments, also known as contigs, need be grouped into operational taxonomic
units (OTUs) that ultimately represent genomes or signicant parts of genomes. OTU clustering
is also called binning (or genomic binning), serving as the key step towards taxonomic proling
and downstream functional analysis. Therefore, accurate binning of the contigs is an essential
problem in metagenomic studies.
Despite extensive studies, accurate binning of contigs remains challenging for several major
reasons, including chimeric assemblies due to repetitive sequence regions within or across genomes,
sequencing errors or artifacts, strain-level variation within the same species, etc. [4, 68] The
currently available binning methods can be broadly categorized into classication and clustering
approaches. Classication approaches are \taxonomy dependent", where reference databases are
needed for the assignment from contigs or reads to meaningful taxons. The classication is
either based on homology due to sequence identity, or genomic signatures such as oligonucleotide
composition patterns and taxonomic clades. Homology-based methods include MEGAN [41] that
assigns reads to the lowest common taxonomic ancestor. Examples of genomic signature-based
methods include PhyloPythia [71] and Kraken [132] that are composition-based classiers and
naive Bayesian classier (NBC) [102], a clade-specic approach. In addition, hybrid methods
are available to take both alignment and composition-based strategy into consideration, such as
PhymmBL [13] and SPHINX [75].
14
In comparison, clustering approaches are \taxonomy independent", where no additional ref-
erence databases or taxonomic information is needed. These approaches require similarity mea-
surements from GC content, tetra-mer composition [2, 19, 137], or Interpolated Markov Models
[49], to contig coverage prole [8, 136].
Recently, several methods have been developed to bin contigs using the coverage proles of
the contigs across multiple metagenomic samples [2, 4, 16, 43, 47, 79, 134]. Here the coverage of
a contig is dened as the fraction of reads mapped to the contig in a sample. The idea is that if
two contigs are from the same genome, their coverage proles across multiple samples should be
highly correlated. These methods can be further improved by integrating coverage proles with
the sequence tetra-mer composition of the contigs [4, 43, 47]. Among these methods, GroopM
[43] is advantageous in its visualized and interactive pipeline. On one hand, it is
exible, allowing
users to merge and split bins under expert intervention. On the other hand, in the absence of
expert intervention, the automatic binning results of GroopM is not as satisfactory as CONCOCT
[4]. CONCOCT [4] makes use of the Gaussian mixture model (GMM) to cluster contigs into bins.
Also, CONCOCT provides a mechanism to automatically determine the optimal OTU number
by variational Bayesian model selection [24]. MetaBAT [47] calculates integrated distance for
pairwise contigs and then clusters contigs iteratively by modied K-medoids algorithm. And
MaxBin [134] compares the distributions of distances between and within the same genomes.
In this paper we present COCACOLA, a general framework for contig binning incorporating
sequence COmposition, CoverAge, CO-alignment and paired-end reads LinkAge across multiple
samples. By default, COCACOLA utilizes sequence composition and coverage across multiple
samples for binning. Compared to recent approaches such as CONCOCT, GroopM, MaxBin and
MetaBAT, COCACOLA performs better in three aspects. Firstly, COCACOLA is superior with
respect to precision, recall and Adjusted Rand Index (ARI). Secondly, COCACOLA is more robust
in the case of varying number of samples. COCACOLA is scalable and faster than CONCOCT,
GroopM, MaxBin andMetaBAT.
15
In addition, the COCACOLA framework seamlessly embraces customized knowledge to facil-
itate binning accuracy. In our study, we have investigated two types of knowledge, in particular,
the co-alignment to reference genomes and linkage between contigs provided by paired-end reads.
We nd that both co-alignment and linkage information facilitate better binning performance in
the majority of cases.
2.2 Materials and Methods
2.2.1 Problem Formulation
A microbial community is comprised of a set of OTUs at dierent abundance levels, and our
objective is to put contigs into the genomic OTU bins from which they were originally derived.
OTUs are expected to be disentangled based on contigs comprising either the discriminative
abundance or dissimilarity among sequences in terms of tetra-mer composition. The rationale of
binning contigs into OTUs relies on the underlying assumption that contigs originating from the
same OTU share similar relative abundance as well as sequence composition.
Formally, we encode the abundance and composition of thek-th OTU by a (M+V ) dimensional
feature vector,W
k
,k = 1; 2; ;K, whereM is the number of samples,V is the number of distinct
l-mers, and K is the total number of OTUs. Specically, W
mk
represents the abundance of the
k-th OTU in the m-th sample, m = 1; 2; ;M, respectively. And W
M+v;k
stands for the l-mer
relative frequency composition of the k-th OTU, v = 1; 2; ;V . Similarly, the feature vector of
then-th contig is denoted asX
n
. Let H
kn
be the indicator function describing whether the n-th
contig belongs to the k-th OTU, i.e., H
kn
= 1 means the n-th contig originates from the k-th
OTU and H
kn
= 0 otherwise. Therefore, X
n
can be represented as:
X
n
= H
1n
W
1
+ H
2n
W
2
+ + H
kn
W
K
; n = 1; 2; ;N (2.1)
16
where N is the number of contigs. Equation (2.1) can be further written into the matrix form:
XW H s:t: W 0; H2f0; 1g
KN
;kH
n
k
0
= 1 (2.2)
whereW = (W
1
;W
2
; ;W
K
) is a (M +V )K nonnegative matrix with each column encoding
the feature vector of the corresponding OTU. And H = (H
1
; H
2
; ; H
N
) is a KN binary
matrix with each column encoding the indicator function of the corresponding contig. kH
n
k
0
=
P
K
k=1
H
kn
= 1 ensures the n-th contig belongs exclusively to one particular OTU.
The matricesW and H are obtained by minimizing a certain objective function. In this paper
we use Frobenius norm, commonly known as the sum of squared error:
arg min
W;H0
kXW Hk
2
F
s:t: H2f0; 1g
KN
;kH
n
k
0
= 1 (2.3)
Note that solving Equation (2.3) is NP-hard by formulation as an integer programming problem
with an exponential number of feasible solutions [45]. A common way to tackle Equation (2.3) is
relaxing the binary constraint of H with numerical values. Hence Equation (2.3) is reformulated
as the following minimization problem:
arg min
W;H
kXWHk
2
F
s:t: W;H 0 (2.4)
where H serves as a coecient matrix instead of an indicator matrix. In the scenario of Equa-
tion (2.4), W
k
, the feature vector of the k-th OTU , represents the centroid of the k-th cluster.
Meanwhile, each contigX
n
is approximated by a weighted mixture of clusters, where the weights
are encoded in H
n
. In other words, relaxation of binary constraint changes the interpretation
from hard clustering to soft clustering, where hard clustering means that a contig can be assigned
to one OTU only, while soft clustering allows a contig to be assigned to multiple OTUs. It has
been observed that by imposing sparsity on each column of H, the hard clustering behavior can
17
be facilitated [51]. Therefore, Equation (2.4) is further modied through the Sparse Nonnegative
Matrix Factorization (SNMF) form [51]:
arg min
W;H0
kXWHk
2
F
+
N
X
n=1
kH
n
k
2
1
(2.5)
wherekk
1
indicates L
1
-norm. Due to non-negativity of H,kH
n
k
1
stands for the column sum of
then-th column vector ofH. The parameter> 0 controls the trade-o between approximation
accuracy and the sparseness of H. Namely, larger implies stronger sparsity while smaller value
ensures better approximation accuracy.
2.2.2 Feature Matrix Representation of Contigs
Similar to CONCOCT [4], each contig longer than 1000bp is represented by a (M +V ) dimensional
column feature vector including M dimensional coverage and V dimensional tetra-mer composi-
tion. The coverage denotes the average number of mapped reads per base pair from each of M
dierent samples. While the tetra-mer composition denotes the tetra-mer frequency for the contig
itself plus its reverse complement. Due to palindromic tetra-mers, V = 136.
Adopting the notation of CONCOCT [4], the coverage of all the N contigs is represented
by an NM matrix Y , where N is the number of contigs of interest and Y
nm
indicates the
coverage of the n-th contig from the m-th sample. The tetra-mer composition of the N contigs
is represented by an NV matrix Z where Z
nv
indicates the count of v-th tetra-mer found in
the n-th contig. Before normalization, a pseudo-count is added to each entry of the coverage
matrix Y and composition matrix Z, respectively. As for the coverage, a small value is added,
i.e., Y
0
nm
= Y
nm
+ 100=L
n
, analogous to a single read aligned to each contig as prior, where L
n
is the length of the n-th contig. As for the composition, a single count is simply added, i.e.,
Z
0
nv
=Z
nv
+ 1.
18
The coverage matrix Y is rstly column-wise normalized (i.e., normalization within each in-
dividual sample), followed by row-wise normalization (i.e., normalization across M samples) to
obtain coverage prole p. The row-wise normalization aims to mitigate sequencing eciency
heterogeneity among contigs.
Y
00
nm
=
Y
0
nm
P
N
n=1
Y
0
nm
p
nm
=
Y
00
nm
P
M
m=1
Y
00
nm
(2.6)
The composition matrix Z is row-wise normalized for each contig (i.e., normalization across M
tetra-mer count) to obtain composition prole q:
q
nv
=
Z
0
nv
P
V
v=1
Z
0
nv
(2.7)
The feature matrix of contigs is denoted as X = [p q]
T
, as the combination of coverage prole p
and composition prole q. To be specic, X is a (M +V )N nonnegative matrix where each
column represents the feature vector of a particular contig.
2.2.3 Incorporating Additional Knowledge into Binning
We consider two types of additional knowledge that may enhance the binning accuracy [9]. One
option is paired-end reads linkage. Specically, a high number of links connecting two contigs
imply high possibility that they belong to the same OTU. Since the linkage may be erroneous
due to the existence of chimeric sequences, we keep linkages that are reported through multiple
samples. The other option is co-alignment to reference genomes. That is, two contigs mapped to
the same reference genome support the evidence that they belong to the same OTU.
We encode additional knowledge by an undirected network in the form of a nonnegative weight
matrixA, whereA
nn
0 quanties the condence level we believe then-th contig and then
0
-th contig
19
to be clustered together. Based upon the aforementioned matrix A, a network regularization item
is introduced to measure the coherence of binning [14]:
R
g
=
1
2
N
X
n;n
0
=1
kH
n
H
n
0k
2
A
nn
0 =Tr(HLH
T
) (2.8)
where Tr() indicates the matrix trace, the sum of items along the diagonal. D denotes the
diagonal matrix whose entries are column sums (or row sums due to symmetry) of A, i.e.,D
nn
=
P
N
n
0
=1
A
nn
0. The Laplacian matrix [21] is dened as L = DA. With convention we use
normalized Laplacian matrix instead, that is,L = D
1=2
LD
1=2
= ID
1=2
AD
1=2
, I
A. By incorporating the network regularization in Equation (2.8), the objective function in
Equation (2.5) changes to the following form:
arg min
W;H0
kXWHk
2
F
+
N
X
n=1
kH
n
k
2
1
+Tr(HLH
T
) (2.9)
where the parameter > 0 controls the trade-o of belief between unsupervised binning and
additional knowledge. Namely, large indicates strong condence on the additional knowledge.
Conversely, small puts more weight on the data.
To utilize multiple additional knowledge sources together, a combined Laplacian matrix is
constructed as a weighted average of individual Laplacian matrices
L = (
P
d
d
L
d
)= (
P
d
d
)
where each positive weight
d
re
ects the contribution of the corresponding information.
2.2.4 Optimization by Alternating Nonnegative Least Squares (ANLS)
Among comprehensive algorithms to solve Equation (2.9), the multiplicative updating approach
[58] is most widely used. Despite its simplicity in implementation, slow convergence is of high
concern. This paper adopts a more ecient algorithm with provable convergence called alter-
nating nonnegative least squares (ANLS) [51]. ANLS iteratively handles two nonnegative least
20
square (NNLS) [129] subproblems in Equation (2.10) until convergence. The ANLS algorithm is
summarized in Algorithm 1.
H arg min
H0
kXWHk
2
F
+
N
X
n=1
kH
n
k
2
1
+Tr(HLH
T
) (2.10a)
W arg min
W0
X
T
H
T
W
T
2
F
(2.10b)
We solve Equation (2.10a) by block coordinate descent (BCD), that is, we divide Equation (2.10a)
Algorithm 1 Optimization by Alternating Nonnegative Least Squares
Input: feature matrix X 2 R
(M+V )N
, initial basis matrix W 2 R
(M+V )K
and coecient
matrix H2R
KN
, tolerance threshold ", maximum iteration threshold T
1: repeat
2: Obtain optimal H of Equation (2.10a) by xing W
3: Obtain optimal W of Equation (2.10b) by xing H
4: until A particular stopping criterion involving " is satised or iteration number exceeds T
Output: W ,H
into N subproblems and minimize the objective function with respect to each subproblem at a
time while keep the rest xed:
arg min
Hn0
kX
n
WH
n
k
2
2
+kH
n
k
2
1
+H
T
n
LH
n
; n = 1; ;N
= arg min
Hn0
kX
n
WH
n
k
2
2
+kH
n
k
2
1
+H
T
n
(H
n
2
N
X
n
0
=1
A
nn
0H
old
n
0 )
= arg min
Hn0
kX
n
WH
n
k
2
2
+kH
n
k
2
1
+
H
n
N
X
n
0
=1
A
nn
0H
old
n
0
2
2
(2.11)
where the matrix H
old
denotes the value of H obtained from the previous iteration. Following
Jacobi updating rule, we combine N subproblems in Equation (2.11) into the matrix form:
arg min
H0
0
B
B
B
B
B
B
@
X
0
1N
p
H
old
A
1
C
C
C
C
C
C
A
0
B
B
B
B
B
B
@
W
p
e
1K
p
I
K
1
C
C
C
C
C
C
A
H
2
F
(2.12)
21
where 0
1N
is aN dimensional row vector of all zeros, e
1K
is aK dimensional row vector of all
ones.
2.2.5 Initialization of W and H
Note that we need to initialize W and H as the input to Algorithm 1. A good initialization
not only enhances the accuracy of the solution, but facilitates fast convergence to a better local
minima as well [55]. We initialize W and H by K-means clustering, namely, W is set to be the
K-means centroid of X with each column W
k
corresponding to the feature vector of the k-th
centroid. Meanwhile, H is set to be the indicator matrix encoding the cluster assignment.
The distance measurement contributes crucially to the success of binning. Ideally, a proper dis-
tance measurement exhibits more distinguishable taxonomic dierence. The traditional K-means
approach takes Euclidean distance as default measurement to quantify closeness. However, as for
the coverage prole, Su et al.[118] shows L
1
distance produces more reasonable binning results
than Euclidean and correlation-based distances. As for the composition prole, L
1
distance is
better than Euclidean and cosine distances [62]. Therefore, our method adopts K-means cluster-
ing with L
1
distance. Once preliminary K-means clustering is achieved, we eliminate suspicious
clusters with few contigs using the bottom-up L Method [104].
2.2.6 Parameter Tuning
We have two parameters (;) to be tuned in our algorithm. Traditional cross-validation-like
strategy demands searching through a two dimensional grid of candidate values, which is compu-
tationally unaordable in the case of large datasets. Instead, we rstly search a good marginal
value by xing = 0. After that, a one dimensional search is performed on a range of candidate
values while keeping xed.
In our implementation, when = 0, is approximated by the Lagrange Multiplier of Equa-
tion (2.4) with constraint
P
N
n=1
(kH
n
k
1
1)
2
= 0, denoted by
. Then we run the algorithm
22
with respect to each candidate and xed =
, resulting in corresponding binning results with
various cluster number. Notice that traditional internal cluster validity indices are only applicable
on the basis of xed cluster number scenario [131], such as SSE (Sum of Square Error), Davies-
Bouldin index [26], etc. To be specic, the indices have the tendency towards monotonically
increase or decrease as the cluster number increases [64]. We tackle the impact of monotonicity
by adopting TSS minimization index [120]; that is, we choose the candidate with minimum TSS
value, recorded as
. Then we can solve Equation (2.9) by using (
;
) as selected regulariza-
tion parameters.
2.2.7 Post-processing
The resulting binning obtained from Algorithm 1 may contain clusters that are not well separated.
Therefore, we dene separable conductance as an eective measurement to evaluate the separation
of pairwise clusters, so as to determine whether to merge them or not. Namely, we consider each
cluster as having a spherical scope centered at its centroid. To be robust against outliers, the
radius is chosen as the third quartile among the intra-cluster distances. The separable conductance
between the c
1
-th cluster and the c
2
-th cluster, sep(c
1
;c
2
), is dened as the number of contigs
from the c
1
-th cluster also included in the spherical scope of the c
2
-th cluster, divided by the
smaller cluster size of two. Intuitively, the separable conductance exploits the overlap between
two clusters. The procedure of post-processing works as follows: we keep picking the pair of
clusters with maximum separable conductance and merge them until it fails to exceed a certain
threshold. The threshold is set to be 1 in this study.
23
2.2.8 Datasets
Alneberg et al. [4] simulated a \species" dataset and another \strain" dataset. Both simulated
datasets were constructed based upon 16S rRNA samples originated from the Human Micro-
biome Project (HMP) [23]. The relative abundance proles of the dierent species/strains for the
simulation were based on the HMP samples as well.
The simulated \species" dataset consisted of 101 dierent species across 96 samples. It aimed
to test the ability of CONCOCT to cluster contigs in complex populations [4]. The species were
approximated by the operational taxonomic units (OTUs) from HMP with more than 3% sequence
dierences. Each species was guaranteed to appear in at least 20 samples. A total of 37,628 contigs
remained for binning after co-assembly and ltering.
The simulated \strain" dataset aimed to test the ability of CONCOCT to cluster contigs at
dierent levels of taxonomic resolution [4]. To be more specic, the simulated \strain" dataset
consisted of 20 dierent species or strains from the same species across 64 samples, including
ve dierent E. coli strains, ve dierent Bacteroides species, ve dierent species from dierent
Clostridium genera, and ve dierent gut bacteria. It was challenging for CONCOCT to separate
the ve dierent E. coli strains [4]. A total of 9,417 contigs remained for binning after co-assembly
and ltering.
In addition to two simulated datasets, we use a time-series study of 11 fecal microbiome samples
from a premature infant [108], denoted as the \Sharon" dataset. Since the true species that contigs
belong to are not known, we assign the class labels by annotating contigs using the TAXAassign
script [42]. As a result, 2; 614 out of 5; 579 contigs are unambiguously labeled on the species level
for evaluation. Another real dataset contains 264 samples from the MetaHIT consortium [94]
(SRA:ERP000108), the same dataset used in MetaBAT [47], denoted as the \MetaHIT" dataset.
17; 136 out of 192; 673 co-assembled contigs are unambiguously labeled on the species level for
evaluation.
24
2.2.9 Evaluation Criteria
To evaluate a binning result with K clusters symbolizing predicted OTUs, against targeted S
species, aKS matrixA = (a
ks
) can be constructed so thata
ks
indicates the shared number of
contigs between the k-th OTU and the s-th species. Therefore, a
k
=
P
s
a
ks
and a
s
=
P
k
a
ks
stand for the size of the k-th OTU and the s-th species, respectively.
One evaluation measurement focuses on if pairs of contigs belonging to the same species can
be clustered together, such as Adjusted Rand Index (ARI) . Given the knowledge of the species
to which each contig is mapped, we treat the corresponding species as class labels. Then the
classication of pairs of contigs falls into one of the four cases: TP (True Positive) andFP (False
Positive) represent the number of pairs of contigs that truly belong to the same species being
clustered into the same OTUs and distinct OTUs, respectively; FN (False Negative) and TN
(True Negative) stand for the number of pairs of contigs from dierent species being clustered
into the same OTUs and distinct OTUs, respectively. ARI is calculated as Equation (2.13):
ARI =
2 (TPTNFPFN)
FP
2
+FN
2
+ 2TPTN + (TP +TN) (FP +FN)
=
P
k;s
a
ks
2
t
3
1
2
(t
1
+t
2
)t
3
where t
1
=
X
k
a
k
2
; t
2
=
X
s
a
s
2
; t
3
=
2t
1
t
2
N
2
whereN is the total number of contigs. In addition to ARI, we also consider precision and recall
dened as follows:
precision =
1
N
X
k
max
s
fa
ks
g: recall =
1
N
X
s
max
k
fa
ks
g:
25
2.3 Results
Given the same input, i.e., sequence composition and coverage across multiple samples, we show
the eectiveness of COCACOLA on simulated \species" and \strain" datasets, in comparison with
three state-of-art, methodologically distinct methods for contigs binning: CONCOCT [4], GroopM
[43], MaxBin [134] and MetaBAT [47]. The comparison excludes Canopy [79] that is based on
binning co-abundant gene groups instead of binning contigs. Furthermore, we investigate the
performance improvement of COCACOLA after incorporating two additional types of knowledge,
co-alignment to reference genomes and linkage between contigs provided by paired-end reads, as
well as the ensemble of both. Results reveal both information facilitates better performance in
the majority of cases. Finally, we report the performance of COCACOLA on two real datasets.
2.3.1 Performance on the Simulated Datasets
Even though both COCACOLA and CONCOCT are able to deterine the OTU number automat-
ically, an initial estimation of OTU number K is needed to start from. Since the OTU number
is usually unknown, we study the binning performance with respect to the value of K chosen
empirically. We observe that K-means clustering tends to generate empty clusters given large
K. Our strategy is to increase K until there are more than K=2 empty clusters, and we choose
the corresponding K as the input. At this stage, we place more emphasis on the redundancy of
OTU number rather than the accuracy. Thus, we obtain K = 192 and K = 48 as input to the
simulated \species" and \strain" dataset, respectively.
For the simulated \species" dataset, Figure 2.1(a) compares COCACOLA against CONCOCT,
GroopM, MaxBin and MetaBAT in terms of precision, recall and ARI. The precision of COCA-
COLA is 0:9978, suggesting that almost all contigs within each cluster originate from the same
species. In comparison, the precision of CONCOCT, GroopM, MaxBin and MetaBAT is 0:9343,
26
0:9324, 0:9973 and 0:9958, respectively. The recall obtained by COCACOLA is 0:9993, imply-
ing that nearly all contigs derived from the same species are grouped into the same clusters.
In contrast, the recall of CONCOCT, GroopM, MaxBin and MetaBAT is 0:996, 0:881, 0:9973
and 0:9174, respectively. As for ARI, COCACOLA achieves 0:997 while CONCOCT, GroopM,
MaxBin and MetaBAT get 0:9296, 0:7922, 0:9961 and 0:9308, respectively.
For the simulated \strain" dataset, the results are shown by Figure 2.1(b). The precision,
recall and ARI of COCACOLA reach 0:9766, 0:9747 and 0:9512, respectively. In comparison,
CONCOCT, GroopM, MaxBin and MetaBAT achieve 0:8733, 0:9525, 0:8151 and 0:8730 in terms
of precision, 0:9552, 0:7805, 0:9167 and 0:8009 in terms of recall, 0:8809, 0:7529, 0:757 and 0:5858
in terms of ARI, respectively.
We conclude that COCACOLA performs well in constructing species from highly complicated
environmental samples. Besides, COCACOLA performs well in handling strain-level variations,
which cannot be fully resolved due to assembly limitation [4].
2.3.2 The Eect of Incorporating Additional Knowledge on Binning
We investigate the performance improvement of COCACOLA after incorporating two additional
types of knowledge as proposed in the \Methods" section, in particular, co-alignment to reference
genomes and linkage between contigs provided by paired-end reads. Moreover, we study the
ensemble of both. The comparison is between the binning result by COCACOLA incorporating
additional knowledge against the result without. The comparison is based upon sub-samples of
the simulated \species" dataset. We choose sub-samples of size ranging from 10 to 90, with 10
as increment. To avoid duplicate contribution from a particular sample, we choose sub-samples
without overlapping. Therefore, the numbers of sub-samples are 9; 4; 3; 2; 1; 1; 1; 1; 1, respectively.
Since the contributions from additional knowledge nearly diminish when the sample size exceeds
K = 30, therefore we focus on the 16 cases from K = 10 to K = 30.
27
Precision Recall ARI
0.6
0.7
0.8
0.9
1.0
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
Precision Recall ARI
0.2
0.4
0.6
0.8
1.0
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
(a) simulated \species" dataset (b) simulated \strain" dataset
Precision Recall ARI
0.4
0.6
0.8
1.0
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
Precision Recall ARI
0.25
0.50
0.75
1.00
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
(c) real \Sharon" dataset (d) real \MetaHIT" dataset
Figure 2.1: The performance of COCACOLA, CONCOCT, GroopM, MaxBin and MetaBAT on
both simulated datasets (a,b) and real datasets (c,d).
28
In terms of co-alignment, we design the symmetric weight matrix A
nn
0 = 1 if contig n and
contig n
0
are aligned to the same species using the TAXAassign script [42]. As shown in Fig-
ure 2.2(a)-(c), the precision is improved noticeably in 7 cases and decreased in 3 cases, the recall
is improved noticeably in 11 cases and decreased slightly in 1 case, the ARI is improved noticeably
in 10 cases and decreased slightly in 2 case.
In terms of linkage, we design the symmetric weighted matrix A
nn
0 as the number of samples
supporting linkage connecting contig n and contign
0
. As depicted in Figure 2.2(d)-(f), the preci-
sion is improved noticeably in 7 cases and decreased in 2 cases, the recall is improved noticeably in
7 cases and decreased slightly in 4 cases, and ARI is improved noticeably in 5 cases and decreased
in 3 cases.
In terms of the ensemble of co-alignment and linkage, as depicted in Figure 2.2(g)-(i), the
precision is improved noticeably in 10 cases and decreased in 3 cases, the recall is improved
noticeably in 13 cases and no case suers decreasing, and ARI is improved noticeably in 11 cases
and decreased in 1 cases.
We have the following conclusions: (1) When there are sucient number of samples, the
contributions from additional knowledge diminish. (2) Additional knowledge such as co-alignment
and linkage information facilitate better overall performance in the majority of cases. (3) The
ensemble of both information performs more stable than individual information.
2.3.3 Impact of Varying Number of Clusters K
We study the impact of varyingK on the binning results. We apply COCACOLA and CONCOCT
to both simulated \species" and \strain" datasets under dierent values ofK (Figure 2.3). We let
K range from the exact number of the real species to four times of it, with 10% increments. Since
GroopM, MaxBin and MetaBAT do not involve K, the corresponding results remain unchanged
with dierent K.
29
0.85
0.90
0.95
1.00
0.80 0.85 0.90 0.95 1.00
Precision without Alignment
Precision with Alignment
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples 0.900
0.925
0.950
0.975
1.000
0.85 0.90 0.95 1.00
Recall without Alignment
Recall with Alignment
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
0.7
0.8
0.9
1.0
0.6 0.7 0.8 0.9 1.0
ARI without Alignment
ARI with Alignment
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
(a) (b) (c)
0.85
0.90
0.95
1.00
0.80 0.85 0.90 0.95 1.00
Precision without Linkage
Precision with Linkage
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
0.900
0.925
0.950
0.975
1.000
0.85 0.90 0.95 1.00
Recall without Linkage
Recall with Linkage
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
0.7
0.8
0.9
1.0
0.6 0.7 0.8 0.9 1.0
ARI without Linkage
ARI with Linkage
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
(d) (e) (f)
0.85
0.90
0.95
1.00
0.80 0.85 0.90 0.95 1.00
Precision without Alignment + Linkage
Precision with Alignment + Linkage
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
0.900
0.925
0.950
0.975
1.000
0.85 0.90 0.95 1.00
Recall without Alignment + Linkage
Recall with Alignment + Linkage
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
0.7
0.8
0.9
1.0
0.6 0.7 0.8 0.9 1.0
ARI without Alignment + Linkage
ARI with Alignment + Linkage
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
(g) (h) (i)
Figure 2.2: Evaluation of the impact of incorporating two additional types of knowledge (Section
2.2.3) on sub-samples of simulated \species" dataset. The rst option is co-alignment information
to reference genomes, depicted by (a)-(c). The second option is paired-end reads linkage, depicted
by (d)-(f). The ensemble of both is depicted by (g)-(i).
30
In the simulated \species" dataset (Figure 2.3(a)-(c)), there are several observations. First,
the recall obtained by COCACOLA maintains around 0:999, slightly larger than 0:996 by
CONCOCT and 0:9973 by MaxBin, and much higher than 0:881 by GroopM and 0:9174 by
MetaBAT. The nearly perfect recall for COCACOLA, CONCOCT and MaxBin suggests that
almost all contigs in each species are grouped within the same cluster no matter howK is chosen.
Second, the precision of both COCACOLA and CONCOCT improves as K increases with some
deviations. When K is small, some clusters may contain contigs from dierent species. As K
increases, the contigs in each bin tend to be more homogeneous when they come from the same
species, thus precision improves. Notice that MetaBAT performs very well in terms of precision
and ARI, and it requires a large K for COCACOLA to catch up with.
In the simulated \strain" dataset (Figure 2.3(d)-(f)), when the number of clusters K varies
from 20 to 80, COCACOLA reaches a plateau at K = 26 with respect to precision, recall and
ARI. Specically, precision stabilizes at 0:976. Recall stabilizes at 0:976 meanwhile ARI
stabilizes at 0:9516. COCACOLA outperforms CONCOCT, MaxBin and MetaBAT almost in
all cases except when K = 78.
We conclude that COCACOLA tends to perform better with respect to larger K than the
exact number the real species, which is understandable because largerK explores the data better.
2.3.4 The Eect of the Number of Samples on the Performance
To evaluate the eect of the number of samples on the performance, we use a fraction of samples
as input only. The simulated \species" dataset comprises 96 samples overall. Thus we choose
sub-samples of size ranging from 10 to 90, with 10 as increment. To avoid duplicate contribution
from a particular sample, we choose sub-samples without overlapping. Therefore, the numbers of
sub-samples are 9; 4; 3; 2; 1; 1; 1; 1; 1, respectively.
We set the number of clustersK = 200 for all sub-sample studies. Fig 2.4 shows the precision,
recall and ARI of ve methods for the dierent sub-samples. Notice that the precision, recall
31
0.80
0.85
0.90
0.95
1.00
100 200 300 400
Initial Cluster Number K
Precision
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
0.900
0.925
0.950
0.975
1.000
100 200 300 400
Initial Cluster Number K
Recall
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
0.80
0.85
0.90
0.95
1.00
100 200 300 400
Initial Cluster Number K
Adjusted Rand Index (ARI)
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
(a) (b) (c)
0.7
0.8
0.9
1.0
20 40 60 80
Initial Cluster Number K
Precision
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
0.80
0.85
0.90
0.95
1.00
20 40 60 80
Initial Cluster Number K
Recall
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
0.6
0.7
0.8
0.9
1.0
20 40 60 80
Initial Cluster Number K
Adjusted Rand Index (ARI)
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
(d) (e) (f)
Figure 2.3: The performance of COCACOLA, CONCOCT, GroopM, MaxBin and MetaBAT in
terms of precision, recall and ARI on the simulated \species" and \strain" datasets. The results of
simulated \species" dataset are depicted by (a)-(c). And the results of simulated \strain" dataset
are depicted by (d)-(f).
32
and ARI start to decrease when sample size drops below K = 40. The major contribution to the
decrease is due to the loss of recall, interpreted as over-estimation of OTU numbers. The reasons
lie in the fact that hard clustering-like approaches such as COCACOLA are less robust compared
to mixture model-like approaches such as CONCOCT. Despite that, COCACOLA demonstrates
more robustness in terms of precision and ARI than CONCOCT across all sub-samples.
GroopM demonstrates unstable performance when sample size is small. For instance, as
depicted in Figure 2.4(a) and Figure 2.4(c), GroopM groups all contigs into one holistic bin in
one sub-sample of size 10 and 30, respectively. In comparison, the bin sizes obtained are 84 and
96 by COCACOLA, 80 and 92 by CONCOCT, respectively.
MaxBin performs the best in the case of small sample size with regard to precision and ARI.
However, the superiority comes from the fact that MaxBin excludes a proportion of ambiguous
contigs for binning. In comparison, other methods consider all contigs.
We conclude that COCACOLA and CONCOCT perform more stably in the case of small
sample size compared to GroopM and MetaBAT. When sample size is small, COCACOLA and
CONCOCT display dierent trade-os between precision and recall. In particular, COCACOLA
is superior in terms of precision whereas inferiority in terms of recall in comparison to CONCOCT.
Overall, COCACOLA outperforms CONCOCT in terms of ARI, an indicator taking both precision
and recall into account. When sample size grows large, COCACOLA outperforms CONCOCT in
all respects.
2.3.5 The Eect of Initialization
To evaluate the eect of initialization on the performance, we conduct the comparisonL
1
distance
and L
2
(Euclidean) distance on sub-samples of simulated \species" dataset.
We rst evaluate the performance originating from initialization only (Figure 2.5(a)-(c)). In
6 out of 23 cases, initialization using L
2
distance shows better precision whereas suers worse
recall. In the remainder of the cases, initialization usingL
2
distance is dominated byL
1
distance.
33
0.5
1.0
10 20 30 40 50 60 70 80 90
Number of Samples
Precision
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
0.85
0.90
0.95
1.00
10 20 30 40 50 60 70 80 90
Number of Samples
Recall
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
0.0
0.3
0.6
0.9
1.2
10 20 30 40 50 60 70 80 90
Number of Samples
Adjusted Rand Index (ARI)
COCACOLA
CONCOCT
GroopM
MaxBin
MetaBAT
(a) (b) (c)
Figure 2.4: Evaluation of COCACOLA, CONCOCT, GroopM, MaxBin and MetaBAT on sensi-
tivity to varying sample size on simulated \species" dataset. The number of sub-samples are 9,
4, 3, 2, 1, 1, 1, 1, 1, corresponding to sample size 10,20,30,40,50,60,70,80,90, respectively.
We then evaluate the performance of COCACOLA with initialization usingL
1
andL
2
distance.
In 8 out of 23 cases, initialization usingL
2
distance shows better precision but suers worse recall.
And in 2 out of 23 cases, initialization using L
2
distance shows better recall but suers worse
precision. In the rest cases, initialization using L
2
distance is dominated by L
1
distance. As for
ARI, initialization using L
2
distance outperforms L
1
distance only in 1 out of 23 cases.
We conclude thatL
1
distance performs better than L
2
(Euclidean) distance in initialization.
2.3.6 Performance on Real Datasets
Applying COCACOLA to the \Sharon" dataset (Figure 2.1(c)), given initial choice of K =
30, the precision, recall and ARI reach 0:9889, 0:9759 and 0:9670, respectively. In comparison,
CONCOCT, GroopM, MaxBin and MetaBAT achieve 0:9801, 0:9820, 0:7077 and 0:9705 in terms
of precision, 0:9606, 0:9147, 0:9767 and 0:8344 in terms of recall, 0:9600, 0:9126, 0:5639 and 0:8634
in terms of ARI, respectively. COCACOLA identies 6 OTUs corresponding to six reported
genomes. In comparison, CONCOCT, GroopM, MaxBin and MetaBAT identify 14, 24, 5 and 11
OTUs, respectively.
Next we investigate the performance improvement of COCACOLA after incorporating addi-
tional knowledge. We use linkage information only because it's circular to use TAXAassign script
[42] on both alignment and labeling. COCACOLA still identies 6 OTUs, with the precision,
34
0.80
0.85
0.90
0.95
1.00
0.8 0.9
Precision by using L2 initialization only
Precision by using L1 initialization only
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
0.85
0.90
0.95
1.00
0.85 0.90 0.95 1.00
Recall by using L2 initialization only
Recall by using L1 initialization only
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
0.6
0.7
0.8
0.9
1.0
0.4 0.6 0.8
ARI by using L2 initialization only
ARI by using L1 initialization only
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
(a) (b) (c)
0.85
0.90
0.95
1.00
0.8 0.9
Precision by using L2 initialization + COCACOLA
Precision by using L1 initialization + COCACOLA
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
0.90
0.95
1.00
0.85 0.90 0.95 1.00
Recall by using L2 initialization + COCACOLA
Recall by using L1 initialization + COCACOLA
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
0.7
0.8
0.9
1.0
0.4 0.6 0.8
ARI by using L2 initialization + COCACOLA
ARI by using L1 initialization + COCACOLA
10 samples
20 samples
30 samples
40 samples
50 samples
60 samples
70 samples
80 samples
90 samples
(d) (e) (f)
Figure 2.5: Evaluation of the impact of initialization on sub-samples of simulated \species"
dataset. The performance comparison is only based on initialization with dierent distance mea-
surements are depicted by (a)-(c). The performance comparison is based on initialization with
dierent distance measurements and COCACOLA are depicted by (d)-(f).
35
recall and ARI reaching 0:9923, 0:9797 and 0:9743, and slightly outperforms the case without
additional knowledge.
Applying COCACOLA to the \MetaHIT" dataset (Figure 2.1(d)), given initial choice of K =
100, the precision, recall and ARI reach 0:9082, 0:8272 and 0:7717, respectively. In comparison,
CONCOCT, GroopM, MaxBin and MetaBAT achieve 0:8933, 0:5247, 0:6655 and 0:5738 in terms
of precision, 0:7901, 0:6843, 0:8228 and 0:7397 in terms of recall, 0:7518, 0:3757, 0:5866 and 0:1088
in terms of ARI, respectively.
Next we investigate the performance improvement of COCACOLA after incorporating linkage
information. The performance is further slightly improved from 0:9082 to 0:9084 in terms of
precision, from 0:8272 to 0:8350 in terms of recall, and from 0:7717 to 0:7844 in terms of ARI,
respectively.
2.3.7 Running Time of COCACOLA, CONCOCT, GroopM, MaxBin
and MetaBAT
COCACOLA shares the same data parsing pipeline as CONCOCT and diers only in the binning
step, whereas GroopM uses its own work
ow. It is reasonable to compare running time of binning
directly between COCACOLA and CONCOCT. In order to bring GroopM into context, we take
into account the stages related to binning and therefore exclude the data parse stage. As for
MaxBin and MetaBAT we simply pre-calculate the abundance and depth information. MaxBin
involves multi-threaded parameter, which is set as the number of cores. All of ve methods
run on the 12-cores and 60GB-RAM computing platform provided by the USC High Performance
Computing Cluster. The comparison is conducted on both the simulated datasets and real datasets
(Table 2.1). We conclude that COCACOLA runs faster than CONCOCT, GroopM, MaxBin and
MetaBAT.
36
Dataset
COCACOLA CONCOCT GroopM MaxBin MetaBAT
Time Speedup Time Speedup Time Speedup Time Speedup Time Speedup
\species" 1m41.50s 1x 17m14.71s 10.2x 1h57m28s 69.4x 49m48.52s 29.4x 4m16.14s 2.5x
\strain" 10.94s 1x 1m10.99s 6.5x 17m00.46s 93.3x 9m54.80s 54.4x 2m31.52s 13.9x
\Sharon" 13.22s 1x 25.11s 1.9x 4m45.85s 21.6x 1m36.09s 7.3x 24.66s 1.9x
\MetaHIT" 2m39.12s 1x 20m20.90s 7.7x 12m47.68s 4.8x 2h20m52s 53.1x 7m25.07s 2.8x
Table 2.1: Running Time of COCACOLA, CONCOCT, GroopM, MaxBin and MetaBAT
2.4 Discussion
We develop a general framework to bin metagenomic contigs utilizing sequence composition and
coverage across multiple samples. Our approach, COCACOLA, outperforms state-of-art binning
approaches CONCOCT [4], GroopM [43], MaxBin [134] and MetaBAT [47] on both simulated
and real datasets.
The superior performance of COCACOLA relies on several aspects. First, initialization plays
an important role in binning accuracy. Second, COCACOLA employs L
1
distance instead of
Euclidean distance for better taxonomic identication. Third, COCACOLA takes advantage
of both hard clustering and soft clustering. Specically, soft clustering (such as the Gaussian
mixture model used by CONCOCT) allows a contig to be assigned probabilistically to multiple
OTUs, hence gains more robust results in general in comparison to hard clustering (such as the
Hough partitioning used by GroopM). However, in complex environmental samples with strain-
level variations, the corresponding OTUs are closely intertwined. For this situation soft clustering
in turn further mixes the OTUs up and thus deteriorates clustering performance. COCACOLA
obtains better trade-o between hard clustering and soft clustering by exploiting sparsity.
However, we notice that binning metagenomic contigs remains challenging when the number of
samples is small, regardless of using COCACOLA, CONCOCT, GroopM, MaxBin or MetaBAT.
With a small number of metagenomic samples, the relationship between the contigs cannot be
accurately inferred based on the relationship between the abundance proles. Therefore, future
research needs to study how to re-weight the contributions of abundance proles and composition
proles in unsupervised [15] or semi-supervised [141] scenarios. Moreover, recent studies suggest
37
that Euclidean or L
1
distance between l-mer frequencies do not perform as well as alternative
dissimilarity measurements such as d
2
and d
S
2
[128] in comparing genome sequences. However,
the use of such measurements is computationally challenging, which needs further exploration.
The COCACOLA framework seamlessly embraces customized knowledge to facilitate bin-
ning accuracy. In our study, we have investigated two types of knowledge, in particular, the
co-alignment to reference genomes and linkage of contigs provided by paired-end reads. Even
though the contributions from additional knowledge diminish when there are sucient number of
samples, they play an important role in binning results when the number of samples is small. In
future studies, we intend to explore better customized prior knowledge where one option is exploit-
ing phylogenetic information in taxonomic annotation [92]. Another option relies on identifying
functional annotation of contigs, including open reading frames (ORF) that are likely to encode
co-abundance gene groups [79], etc. We have also investigated the ensemble of both co-alignment
and linkage knowledge, and it shows more stable performance than individual information. In
future studies, we aim to nd optimal weights [122] instead of equal weights.
38
Chapter 3
Hetero-RP: Towards Enhanced and Interpretable
Clustering/Classication in Integrative Genomics
In this chapter, we study the integration of dierent types of data and propose a scalable and
tuning-free preprocessing framework, Heterogeneity Rescaling Pursuit (Hetero-RP), which weighs
important features more highly than less important ones in accord with implicitly existing auxil-
iary knowledge. We demonstrate eectiveness of Hetero-RP in diverse clustering and classication
applications. More importantly, Hetero-RP oers an interpretation of feature importance, shed-
ding light on the driving forces of the underlying biology. In metagenomic contig binning, Hetero-
RP automatically weighs abundance and composition proles according to the varying number of
samples, resulting in markedly improved performance of contig binning. In RNA-binding protein
(RBP) binding site prediction, Hetero-RP not only improves the prediction performance mea-
sured by the area under the receiver operating characteristic curves (AUC), but also uncovers
the evidence supported by independent studies, including the distribution of the binding sites of
IGF2BP and PUM2, the binding competition between hnRNPC and U2AF2, and the intron-exon
boundary of U2AF2.
39
3.1 Introduction
Rapidly evolving high-throughput technologies have enabled biologists to progressively collect
large amounts of genomic data with unprecedented diversity and high resolution. For example,
The Cancer Genome Atlas (TCGA) project [72] and Encyclopedia of DNA Elements (ENCODE)
project [22] have provided open access to genomic, transcriptomic, and epigenomic information
from a diverse group of samples. Potential data include, but is not limited to, genome and protein
sequences [90], single nucleotide variants (SNV) [39], and gene ontologies [6]. Integrative analysis
of such a wealth of heterogeneous data has motivated growing interests, giving rise to enhanced
reliability of novel discoveries and improved understanding towards molecular heterogeneity of
biological processes. Thus far, integrative analyses are carried out in pervasive clustering and
classication studies. The former captures underlying patterns of the data and groups them into
biologically interpretable groups, such as metagenomic contig binning [65]. And the latter infers
general properties of the data from a few annotated examples, such as RNA-binding protein
(RBP) binding site prediction [117].
One common idea of integrating dierent types of data is to concatenate the feature vectors
representing the data, as illustrated in Figure 3.1(A). Despite its simplicity, due to the unbalanced
scales, data with a large number of features tend to have larger in
uence on the nal outcome
than others. One potential remedy is to normalize the data inversely proportional to its corre-
sponding feature size. However, meaningful features from larger data may be diluted and become
even weaker than unwanted features from smaller data. Another common practice of data inte-
gration projects multiple data types onto the same latent feature space [34]. However, dierent
data sources usually exhibit some unique features that are not shared by others and thus the
enforcement of a joint space can potentially miss essential complementary information from the
dierent data sources. In general, despite extensive research, integrative studies pose signicant
challenges.
40
In this paper, we introduce Heterogeneity Rescaling Pursuit (Hetero-RP), a scalable and
tuning-free preprocessing framework for integrative genomic studies, to rescale features from mul-
tiple data sources so that important features are weighted more highly than less important ones.
The rationale to determine the weights of dierent features is based on the implicitly existing aux-
iliary knowledge related to the problem of interest. We demonstrate the eectiveness of Hetero-RP
in two clustering and classication applications: metagenomic contig binning and RBP binding
site prediction. The objective of metagenomic contig binning is to cluster contigs in metagenomic
samples so that contigs from the same genome are binned together. Additionally, two contigs may
map to the same reference genome or there are multiple paired-end reads linking the two contigs.
To utilize the auxiliary information on contig pairs, we introduce the concept of \positive-links"
between pairs of contigs supported by strong evidence of being binned together. Adversely, two
contigs mapped to phylogenetically distant genomes are most unlikely to belong to the same bin.
Thus we introduce the concept of \negative-links" between pairs of contigs not belonging to the
same bin. With such auxiliary information available, we determine the weights of the dierent
features. Similarly, the objective of RBP binding site prediction aims to predict whether RBPs
bind to specic nucleotide positions of target RNA or not. With the existence of class labels pro-
vided by the training data, two nucleotide positions may share the same labels or dierent labels.
To utilize the auxiliary information on label information, we introduce \positive-links" between
pairwise nucleotide positions indicating those sharing the same class label from the training data,
and \negative-links" otherwise. In addition, the auxiliary knowledge not only implicitly exists for
exploration, but can be acquired actively and interactively as well. With human aid, interactive
annotation gathered from experts can further improve the performance in both clustering [126]
and classication [53].
Hetero-RP aims to seek better weights of features that match up with the auxiliary knowledge
containing \positive-links" and \negative-links", particularly tailored for heterogeneous data from
multiple sources. Unlike conventional feature selection, Hetero-RP makes no assumptions of
41
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
o1
o2
o4
o3
o5
positive-links
negative-links
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10
1 1
1
-1
-1
-1 -1
1
1
1
1 1
-1
-1 -1
-1
o1
o2
o3
o4
o5
o1 o2 o3 o4 o5
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
o1 o2 o3 o4 o5
Signed Network
encoding links
Data Matrix
Feature Weights
by Hetero-RP
Rescaled Data
Matrix
weight=1
o1 o2 o3 o4 o5
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
Metagenome Binning
Binding Site Detection
Precision Medicine
Applications
(A)
(B) (C)
(D) Output of Hetero-RP Re-formulation of Input
Input of Hetero-RP
weight=0
Figure 3.1: Illustration of Hetero-RP. (A) Each objecto
1
; ;o
5
is represented by its correspond-
ing feature vector, containing features indexed fromf
1
tof
10
, with colors indicating dierent data
sources. \positive-links" and \negative-links" are in green solid line and red dash line, respectively.
(B) The input to Hetero-RP contains two parts, the data matrix based on the concatenated feature
vectors and the signed graph encoding both \positive-links" and \negative-links". (C) Hetero-RP
rescales each dimension of features, but keeps the overall scale xed. (D) The applications of
Hetero-RP widely cover both clustering and classication domains.
feature independence [30, 31, 32]. Likewise, Hetero-RP does not enforce the majority of features
to be \irrelevant". Instead it assumes that the fraction of features without unit weight is small.
These interpretable weights enable us to characterize the driving forces of the underlying biology.
We demonstrate the eectiveness of Hetero-RP using metagenomic contig binning and RBP
binding sites prediction as examples. In metagenomic contig binning, Hetero-RP automatically
weighs abundance and composition proles according to the varying number of samples, resulting
in markedly improved performance of contig binning on both simulated and real datasets. In RBP
binding site prediction, a combination of Hetero-RP with state-of-the-art methods improves the
prediction performance measured by the area under the receiver operating characteristic curves
(AUC) substantially in 28 out of 31 real datasets by an average of 5:9%. Better still, the inter-
pretable feature importance learned by Hetero-RP uncovers the evidence supported by indepen-
dent studies, including the distribution of the binding sites of IGF2BP and PUM2, the binding
competition between hnRNPC and U2AF2, and the intron-exon boundary of U2AF2.
42
3.2 Materials and Methods
3.2.1 Problem Formulation
LetO =o
1
;o
2
; ;o
n
be the set ofn objects that indicate metagenomic contigs for binning, RBP
interaction sites for prediction, etc. Each object is represented by a feature vector, for features
from a single data source or concatenation of multiple data sources. Mathematically, each data
sourcei with feature dimensionalityp
i
on the same set ofn objects is represented by a data matrix
X
i
2R
pin
. When the number of data sources m > 1, we stack them together by X =
X1
Xm
.
X2 R
pn
is the stacked data matrix illustrated in Figure 3.1(B) and p , p
1
+p
2
+ +p
m
,
where p
1
;p
2
; ;p
m
are the feature dimensionalities of each data source, respectively. Then the
feature vector of o
i
is denoted as X
i
, for i = 1; 2; ;n.
We encode \positive-links" and \negative-links" by an undirected signed graphG = (O;P;N ),
whereO is the set of objects, andP andN consist of \positive-links" and \negative-links",
respectively. G can be represented by an adjacency matrix A2 R
nn
, where A
ij
= 1 if there
is a \positive-link" between o
i
and o
j
, A
ij
=1 if there is a \negative-link" between o
i
and o
j
,
and A
ij
= 0 otherwise. Hetero-RP aims to nd a p-dimensional vector W = [w
1
;w
2
; ;w
p
]
for overall p features, so as to minimize the inconsistency between the signed graphG and the
feature-wise rescaled data matrix diag(W )X where diag() diagonalizes the vector into a diagonal
matrix.
min
W
L(W ) =
X
i;j
A
ij
kdiag(W )X
i
diag(W )X
j
k
2
= tr(diag(W )XLX
T
diag(W ));
s.t. W 0; and
X
i
W
i
=p
(3.1)
where L =DA denotes the Laplacian matrix [21] of adjacency matrix A and D indicates the
diagonal matrix whose d
ii
entry equals the sum of the i-row (or column due to symmetry) of
A. In the above formulation, inconsistency decreases when object pairs joined by positive-links
43
are pulled closer after data matrix is rescaled. To avoid trivial solutions, we enforce W to be
nonnegative and conserved in sum, i.e.
P
i
W
i
=p, as shown in Figure 3.1(C).
Unlike conventional feature selection that assumes most features are irrelevant, Hetero-RP
assumes the majority of features are useful. Among those useful features, only a subset of them are
more or less informative (weight6= 1) whereas the rest are neutral (weight = 1). In comparison,
conventional feature selection treats features either relevant (weight = 1) or irrelevant (weight
= 0). To examine whether the assumption of Hetero-RP holds, for the clustering scenario, the
dip test [36] can be used to check if each feature is multi-modal. If not, that feature is regarded
uninformative and thus excluded. For the classication task, univariate metrics such as the t-test
can also be applied to score each feature and the resulting p-values are obtained. Features whose
p-values exceed a certain threshold are not considered as well. The assumption of Hetero-RP
naturally leads to the regularization of W = W 1, the deviation from unit weight. Thus,
Equation (3.1) changes to the following quadratic programming problem:
min
W
L(W ) = tr(diag(1 + W )XLX
T
diag(1 + W )) +kWk
2
=
X
i
Y
i
(W
i
+ 1)
2
+W
2
i
;
s.t. W
i
1; and
X
i
W
i
= 0
(3.2)
where the parameter > 0 shrinks weights towards ones and towards each other. And Y is the
diagonal vector ofXLX, i.e., Y
i
= (XLX)
ii
, fori = 1; 2; ;n. Note that when \negative-links"
are available, Y may no longer remain nonnegative. To keep convexity of Equation (3.2) for easy
optimization, a common practice chooses Y
i
= max (0; (XLX)
ii
), for i = 1; 2; ;n.
44
3.2.2 Parameter choice for
Hetero-RP selects the parameter in Equation (3.2) automatically [119] by carrying out the
following two steps iteratively until convergence:
c
W arg min
W1
P
i
Wi=0
X
i
Y
i
(W
i
+ 1)
2
+ 2p
0
b kWk
2
; (3.3a)
b
s
1
p
X
i
Y
i
(W
i
+ 1)
2
; (3.3b)
where
0
is chosen according to the suggestion of [98] and has also been used in [33]. In particular,
0
=B=
p 1 +B
2
1=2
, where B =tq(1p
1=2
=(2r logr);p 1) with tq(;d) the -th quantile
of a t-distribution with d degrees of freedom, and r represents the rank of L.
3.2.3 Insucient Auxiliary Knowledge
When \positive-links" and \negative-links" are sparse, Equation (3.2) may suer from overtting,
and will be unable to provide expected weight reliably. Therefore, we utilize auxiliary knowledge
along with the original data matrix X. Specically, we consider a k-nearest neighbor network
containingn vertices where each vertexi corresponds toX
i
, thei-th column ofX andk is chosen
as
p
n. For each vertex i, i = 1; 2; ;n, if vertex j belongs to the k-nearest neighbors of vertex
i, then vertex i and vertex j are connected by edge with weight M
(0)
ij
= exp
n
kXiXjk
2
2
2
o
,
where can be chosen as 1:06b n
1
5
and b is the standard deviation of X [112]. We let M
ij
=
max(M
(0)
ij
;M
(0)
ji
) for symmetry. Finally, we use the combined adjacency matrixA+
M, where the
parameter
> 0 controls the trade-o between intrinsic data structure and auxiliary knowledge.
To balance the contribution fromA andM,
is chosen as
= tr((A)
T
M)=((A)
T
A), the minimizer
of arg min
kA
Mk
2
F
, wherekk
2
F
indicates the sum of squared error.
45
3.3 Results
3.3.1 Application to Clustering: Metagenomic Contig Binning
The next-generation sequencing technologies (NGS) enable biologists to sequence microbial com-
munities from environmental samples directly. Contig binning is a process to group assembled
sequence fragments, also known as contigs, into operational taxonomic units (OTUs), in which
contigs in the same bin belong to closely related genomes [68]. Most available methods rely on
the integrated usage of abundance proles across multiple metagenomic samples and tetra-mer
composition proles of contig sequences [65, 4, 43, 134, 47].
In brief, binning utilizes two types of data, p
1
-dimensional relative abundance proles X
1
,
andp
2
-dimensional composition proles X
2
, wherep
1
is the dimension of abundance proles and
p
2
is the number of distinct tetranucleotides. In addition, the co-alignment of contig pairs and
paired-end reads linkage are considered as the auxiliary knowledge that potentially contributes to
the binning performance [65].
We compared our previously proposed method, COCACOLA [65], with or without using
Hetero-RP. The COCACOLA used in this paper is an upgraded version that takes a non-linear
transformation of the input features by spectral embedding [127] (see supplementary material for
more details). To guarantee a fair comparison, the bin number is xed as the estimation from
single-copy genes [134]. We not only showed the improvement of COCACOLA after incorpo-
rating Hetero-RP as preprocessing, but also compared the improved performance against three
state-of-the-art methodologically distinct methods: CONCOCT [4], MaxBin [134], and MetaBAT
[47].
We evaluated the gain from Hetero-RP based on both simulated and real datasets. The
simulated \species" dataset consists of 101 distinct species across 96 samples [4], with more than
3% sequence dierences. Overall n = 37; 628 contigs of length at least 1kbps were assembled
for binning. The binning result is evaluated by the Adjusted Rand Index (ARI), an overall
46
measure taking both precision and recall into account (see supplementary material for denition).
The real \MetaHIT" dataset contains 264 dierent samples from the MetaHIT consortium [94]
(SRA:ERP000108), with a total of 192; 673 assembled contigs of length at least 1kbps remaining
for binning. Unlike the simulated dataset, the true labels are inaccessible in real dataset. Instead,
we applied CheckM [83] to estimate the approximate precision (by the percentage of genes absent
from dierent genomes) and completeness (by the percentage of expected single-copy genes that
are binned).
For the simulated \species" dataset, to assess the gain of Hetero-RP thoroughly, we further
sub-sampled the data of size varying from 10 to 90, with a step size of 10. To avoid duplicate
contributions from multiple replicates, the numbers of replicates are 9; 4; 3; 2; 1; 1; 1; 1; 1, respec-
tively. When using co-alignment as auxiliary knowledge, as shown in Figure 3.2(A), the ARI is
improved in 20 cases by an average of 5:9% and decreased in 2 cases by an average of 1:7%. We
also scrutinized the weight obtained by Hetero-RP on two randomly chosen cases of sample size
10 and 96, respectively. As illustrated in Figure 3.2(B), the average weight of abundance proles
are 0:26 and 0:87 when sample sizes are 10 and 96, respectively. That is, Hetero-RP prefers
to scale down the abundance proles when sample size is small, consistent with the observation
that binning performs better when the sample size increases [65, 4] (see supplementary material
for more details). When using paired-end reads linkage as auxiliary knowledge, as depicted in
Figure 3.2(C), the ARI is improved in 9 cases by an average of 2:9% and decreased in 3 cases by
an average of 1:4%. The improvement is less prominent than co-alignment because the positive-
links set of co-alignment is 1800x larger than the set of linkage. The inferior performance is
also revealed by the weight obtained by Hetero-RP. As shown in Figure 3.2(D), the abundance
proles are not suciently scaled down by an average of 0:74 when sample size is 10, and the
weight has an average of 0:99, almost diminished when sample size is 96. We next compared
COCACOLA with Hetero-RP using co-alignment against CONCOCT, MaxBin, and MetaBAT.
47
As illustrated in Figure 3.2(E), Hetero-RP performs well in a majority of the cases, achieving
better precision-recall tradeo, especially when sample sizes are small.
For the real \MetaHIT" dataset, we focused on the identication of genome bins having
> 80% precision (the lack of contamination) and > 30% recall (completeness). As shown in
Figure 3.2(F), Hetero-RP contributes to more or equivalent genome bins at every completeness
threshold. Because co-alignment outperforms linkage consistently, we compared COCACOLA
with Hetero-RP using co-alignment against CONCOCT, MaxBin, and MetaBAT. We observe that
for the recovery of a genome bin with> 90% completeness, MaxBin dominates other methods. In
particular, MaxBin recovers 29 genome bins in comparison to 25 by COCACOLA with Hetero-RP,
15 by CONCOCT, and 14 by MetaBAT, respectively. Nevertheless, MaxBin does not perform well
for genome bins with < 70% completeness. COCACOLA with Hetero-RP consistently recovers
more genome bins than CONCOCT at every completeness threshold. It recovers more high quality
genome bins with 60% completeness than MetaBAT. We conclude that in the experiment
involving real metagenomic contigs, COCACOLA with Hetero-RP still performs well.
3.3.2 Application to Classication: RBP Binding Site Prediction
We next incorporated Hetero-RP as a preprocessing step to predict whether RNA-binding proteins
(RBP) bind to specic nucleotide positions of target RNA, as RBPs are of vital importance in the
control of gene expression. Current state-of-the-art methods utilize the combination of multiple
data sources including tetranucleotides, secondary structure, region type, CLIP co-binding, and
Gene Ontology (GO) terms [117].
For a given RBP, a predictive model is built based uponn training nucleotide positions indicat-
ing whether each position is a binding side or not. For each nucleotide position, the neighboring
[50; 50] positions are considered in ve types of data. Specically, the tetranucleotide composi-
tion X
1
has p
1
dimensions where p
1
= 256 101 = 25; 856; the probabilistic scores of secondary
structure X
2
computed by RNAfold [27] have p
2
dimensions where p
2
= 101; the region type
48
ARI
10 Samples 20 Samples 30 Samples 40 Samples
50 Samples
60 Samples
70 Samples
80 Samples
90 Samples
96 Samples
(A)
(C)
(E)
COCACOLA + Hetero-RP(Co-alignment)
(F)
ARI
COCACOLA + Hetero-RP(Linkage)
Recall >=90 >=80 >=70 >=60 >=50 >=40 >=30
0.90
0.95
0.80 0.85 0.90 0.95
0.80
0.85
0.90
0.95
0.80 0.85 0.90 0.95
COCACOLA
0.25
0.50
0.75
1.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
COCACOLA+Hetero-RP CONCOCT MaxBin MetaBAT
10 20 30 40 50 60 70 80 90 96 Sample Size
22 42 61 81 96 119 141
25 48 71 92 107 126 153
22 45 63 83 97 117 144
COCACOLA +
Hetero-RP(Coalignment)
COCACOLA +
HeteroRP(Linkage)
COCACOLA
0 50 100 150
# of Genomes Identified
25 48 71 92 107 126 153
15 29 49 62 72 86 104
29 53 72 86 98 118 131
14 41 61 89 110 139 157
MetaBAT
MaxBin
CONCOCT
0 50 100 150
COCACOLA
+Hetero-RP
# of Genomes Identified
0.0
0.3
0.6
0.9
1.2
Feature Weight
0.0
0.3
0.6
0.9
1.2
Feature Weight
0.00
0.25
0.50
0.75
1.00
Feature Weight
0.00
0.25
0.50
0.75
1.00
Feature Weight
Abundance of 96 Samples
Composition
Abundance of
10 Samples
Abundance of
10 Samples
Abundance of 96 Samples
Composition
Composition
Composition
(B)
(D)
Figure 3.2: Incorporating Hetero-RP in Metagenomic Contig Binning. (A) COCACOLA with
Hetero-RP applied to the simulated \species" dataset with multiple sub-sampling replicates, eval-
uated by Adjusted Rand Index (ARI). The auxiliary knowledge considers the co-alignment of
contig pairs. (B) The feature weight of two random samples of size 10 and 96 using co-alignment.
The blue shadow on the left side of the dashed line indicates the scales of abundance prole,
whereas the red shadow on the right side of the dashed line indicates the scales of composition
prole. The green horizontal line indicates the scales of 1. (C) The auxiliary knowledge considers
the paired-end reads linkages. (D) The feature weight of two random samples of size 10 and 96
using linkage. (E) COCACOLA with Hetero-RP using co-alignment is compared against CON-
COCT, MaxBin, and MetaBAT using ARI. The improvement is more prominent in small size
cases such as 10 and 20. (F) Application to the real \MetaHIT" dataset. The evaluation is based
upon the recovery of genome bins at every completeness threshold. The number of recovered
genome bins (X-axis) by each method (Y-axis) in dierent completeness threshold (gray scale)
with precision > 80%, calculated by the lack of contamination using CheckM.
49
X
3
is has ve presence/absence indicators for intron, exon, 5'-UTR, 3'-UTR, and ORF, with
dimensionality p
3
= 5 101 = 505; the co-binding proteins cDNA counts X
4
involve other 30
RBP experiments, with dimensionality up to 30 101 = 3; 030; and the GO annotationsX
5
have
p
5
= 39; 560 GO term markers indicating the position within known genes having that annotation
(see supplementary material for more detailed descriptions). In addition, the auxiliary knowledge
is considered as whether pairwise nucleotide positions share the same or dierent labels.
We evaluated Hetero-RP based on 31 published CLIP experiments, with 19 distinct RBPs
with one or multiple experimental replicates [117]. These RBPs involve a variety of functionalities
such as splicing (ELAVL1, FUS, hnRNPs, TDP-43, U2AF2, etc.) and processing of 3'-UTR (Ago,
IGF2BP, etc.). For each individual experiment, up to 20; 000 identied crosslinking sites split into
training and test sets as positive samples, whereas sites within non-interacting genes as negatives.
The prediction performance is measured by the area under the receiver operating characteristic
curves (AUC).
We rst compared the state-of-the-art method, iONMF [117], with or without using Hetero-
RP. iONMF relies on orthogonality-regularized nonnegative matrix factorization, not only out-
performing other state-of-the-art methods GraphProt [70] and RNAContext [48] but also dis-
covering class-specic RNA binding patterns (see supplementary material for more details). In
particular, we applied Hetero-RP to the training set, and obtained the corresponding weights.
The same weights were applied to both training and test sets afterwards. We not only showed
the improvement of iONMF after incorporating Hetero-RP as preprocessing, but also compared
the improved performance against GraphProt and RNAContext. Meanwhile, we also compared
Hetero-RP against three popular feature selection methods: Correlation-based Feature Selection
(CFS) [35], Fast Correlation-Based Filter (FCBF) [139], and Sparse Logistic Regression (SLR)
[109] (see supplementary material for denitions).
We ran iONMF 20 times with dierent random seeds. For each round, the random seed was
xed for both training and testing stages regardless of whether Hetero-RP is used. As shown in
50
Figure 3.3(A), in 28 out of 31 cases, Hetero-RP substantially improves the performance of the
already state-of-the-art method by an average increase of 5:9%. In the remaining cases, Hetero-
RP shows negligibly worse performance by an average decrease of 1:2%. It is notable that QKI
achieves a 21:1% improvement by Hetero-RP, with the AUC score increasing from 0:678 to 0:821.
Furthermore, Hetero-RP facilitates more robust prediction by decreasing the standard deviation
by an average of 45:9%. We next compared the improved performance of iONMF against state-of-
the-art methods. As illustrated in Figure 3.3(B), iONMF with Hetero-RP outperforms GraphProt
and RNAContext in 25 out of 31 cases by an average increase of 11:1% and 11:0%, respectively.
In remaining 6 cases, Hetero-RP is outperformed by an average of 6:2% and 5:7%, respectively.
After that, we compared Hetero-RP with three dierent types of feature selection methods. As
shown in Figure 3.3(C), Hetero-RP outperforms CFS, FCBF, and SLR in 26, 30, and 31 out of 31
cases by an average increase of 6:7%, 12:0%, and 12:9%, respectively. In comparison, Hetero-RP
merely shows worse performance by an average decrease of 2:4% and 1:9% in the remaining cases
for CFS and FCBF, respectively.
We scrutinized the weights obtained by Hetero-RP (see supplementary material for more
details). As shown in Figure 3.3(D), 3'-UTR region types turn out to be the most informative
features across the upstream and downstream of the crosslinking sites for RBP such as IGF2BP
and PUM2. In contrast, intron, 5'-UTR, and ORF region types are scaled down. This observation
agrees with the fact that the binding sites of both IGF2BP and PUM2 are distributed across 3'-
UTRs [105]. In addition, both IGF2BP and PUM2 are mainly cytoplasmic and their binding sites
are mainly located in exons [105], which is also captured by Hetero-RP. Specically, exon features
across the upstream of the crosslinking sites are scaled up for both IGF2BP and PUM2, and the
downstream are also enriched for IGF2BP.
It has been reported that hnRNPC interacts with the same positions as U2AF2 [117], consistent
with the weights revealed in Figure 3.3(E), which is either scaled up or unchanged across the
upstream and downstream of the crosslinking sites. The underlying rationale is that binding of
51
the hnRNPC or U2AF2 serves as the indirect evidence of binding of the counterpart. Meanwhile,
there is evidence supporting the direct competition between the two [140], implied by the fact that
the weights of positions around [25; 25] relative to the binding site is lower than the upstream
and downstream.
Finally, U2AF2 is a splicing factor that predominantly crosslinks to the 3' splice site [140]. It
has also been reported that the intron-exon boundary is at 30 nucleotides upstream from the
binding site [117], exactly where the weights of both intron and exon region types start to increase
steeply, as depicted in Figure 3.3(F).
3.4 Discussion
Hetero-RP provides a general data preprocessing framework for integrative genomic studies. By
utilizing implicitly existing auxiliary knowledge, Hetero-RP introduces a scalable algorithm to
weigh important features more highly than less important ones. At the same time, eorts have
been made to avoid overtting by regularization and incorporating intrinsic structure from data
per se. From practitioners' perspective, Hetero-RP is tuning-free without tedious cross-validation
for tuning parameters.
We demonstrate the eectiveness of Hetero-RP in both clustering and classication domains,
from metagenomic contig binning to RBP binding site prediction, showing the wide applicabil-
ity of our framework. More importantly, Hetero-RP not only plays the role as a \black box",
but also leads to interpretability of feature importance, oering insights into better biological
understanding.
We also notice the potential improvement of Hetero-RP for future investigation:
(i) The \positive-links" and \negative-links" chosen as auxiliary knowledge are assumed to
be generic so that they remain invariant to all situations. Nevertheless, such generic assump-
tion may be limited when auxiliary knowledge vary with dierent situations. That is, a specic
52
1.00
1.05
1.10
1.15
5025 0 25 50
Feature Weight
1.00
1.05
1.10
1.15
5025 0 25 50
Feature Weight
U2AF2 Cobinding to hnRNPC hnRNPC Cobinding to U2AF2
0.12
0.13
0.14
5025 0 25 50
Feature Weight
0.13
0.14
0.15
0.16
50 25 0 25 50
Feature Weight
Intron of U2AF2 Exon of U2AF2
0.2
0.4
0.6
0.8
1.0
5025 0 25 50
Feature Weight
Region Types of PUM2 Region Types of IGF2BP
0.3
0.6
0.9
5025 0 25 50
Feature Weight
3’-UTR 5’-UTR exon intron ORF
[1]Ago/EIF2C1−4
[2]Ago2−MNase
[3]Ago2(1)
[4]Ago2(2)
[5]Ago2
[6]eIF4AIII(1)
[7]eIF4AIII(2)
[8]ELAVL1
[9]ELAVL1−MNase
[10]ELAVL1A
[11]ELAVL1
[12]ESWR1
[13]FUS
[14]Mut FUS
[15]IGF2BP1−3
[16]hnRNPC
[17]hnRNPC
[18]hnRNPL
[19]hnRNPL
[20]hnRNPL−like
[21]MOV10
[22]Nsun2
[23]PUM2
[24]QKI
[25]SRSF1
[26]TAF15
[27]TDP−43
[28]TIA1
[29]TIAL1
[30]U2AF2
[31]U2AF2(KD)
0.5
0.6
0.7
0.8
0.9
1.0
Area Under ROC Curve (AUC)
iONMF+Hetero−RP iONMF
0.5
0.6
0.7
0.8
0.9
1.0
Area Under ROC Curve (AUC)
iONMF+Hetero−RP GraphProt RNAContext
0.5
0.6
0.7
0.8
0.9
1.0
Area Under ROC Curve (AUC)
iONMF+Hetero−RP iONMF+CFS iONMF+FCBF iONMF+SLR
(A)
(B)
(C)
(D)
(E)
(F)
Figure 3.3: Incorporating Hetero-RP in RBP Binding Sites Prediction. (A) iONMF with or
without Hetero-RP is applied to 31 published CLIP experiments. The positive-links and negative-
links sets are constructed according to the labels of the nucleotide positions in the training set.
The performance is evaluated by the area under the receiver operating characteristic curve (AUC).
(B) Hetero-RP is compared against state-of-the-art methods. (C) Hetero-RP is compared against
popular feature selection methods. (D,E,F) Interpretations of the scales obtained by Hetero-RP.
(D) The 3'-UTR region type of PUM2 and IGF2BP has the largest weight, consistent with the
fact that the binding sites of both IGF2BP and PUM2 are distributed across 3'-UTRs. (E) The
mutual co-binding of hnRNPC and U2AF2 has large weight and this observation agrees with the
fact that hnRNPC interacts with the same positions as U2AF2. Moreover, the weights are even
larger at the upstream and downstream, supporting the evidence of direct competition between
the two. (F) The intron and exon region types of U2AF2 start to scale up at 30 nucleotides
upstream from the binding site, where the reported intron-exon boundary is located.
53
\positive-link" or \negative-link" may take place in certain situation while not in others. For ex-
ample, individual RBP binding activities may change along with dierent cell types, environmental
conditions, or biological systems. Therefore, modeling condition-specic auxiliary knowledge is
needed to adaptively learn feature weights that exhibit condition-specic behavior. Moreover,
once having obtained feature weights learned from auxiliary knowledge in some conditions, how
and to what extent to reuse such results to help boost the performance given auxiliary knowledge
in other conditions is also needed.
(ii) More general forms of auxiliary knowledge, such as the relative comparison in the form
of \A is closer to B than A is to C", can be considered, . The relative comparison is pervasive
and ubiquitous in many scenarios. For example, the relative comparison encodes a phylogenetic
tree containing the interspecies relationships among the microbial organisms. To be specic, two
genomes sharing the same genus taxonomic level are more likely to belong to the same OTU than
those in dierent levels [92]. Actually, the positive-links and negative-links are special cases of
relative comparisons, that is, an positive-link pair is closer than any random pair which in turn is
closer than a negative-link pair.
(iii) The quadratic programming form of Hetero-RP is not computationally optimal, and faster
solvers such as ADMM [12] can be applied.
In summary, Hetero-RP can serve as a foundational preprocessing tool for integrative genomic
studies. With the advent of high-throughput technologies, researchers are exposed to \Big Data"
in biology and medicine. Though integrative studies are increasingly common, facilitating better
understanding towards biological mechanisms, the optimal integration of diverse heterogeneous
massive data still needs to be explored. We expect Hetero-RP to motivate a rich set of applications
in integrative genomic studies and \Big Data" practices.
54
Chapter 4
CAFE: aCcelerated Alignment-FrEe sequence analysis
In this chapter, we aim to infer the phylogenetic relationships among OTUs by alignment-free
sequence comparison. We realize that those state-of-the-art dissimilarity measures CVTree, d
2
,
and d
S
2
are computationally expensive. Therefore, we report a stand-alone software, aCcelerated
Alignment-FrEe sequence analysis (CAFE), for ecient calculation of 28 alignment-free dissimilar-
ity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun
reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses,
CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms,
heatmap, principal coordinate analysis (PCoA), and network display. CAFE serves as a general
k-mer based alignment-free analysis platform for studying the relationships among genomes and
metagenomes.
4.1 Introduction
Sequence comparison is widely used to study the relationship among molecular sequences. The
dominant tools for sequence comparison are alignment-based methods, including global [78] and
local [114] sequence alignments. With the advent of alignment-based tools such as BLAST [5]
and sequence databases such as RefSeq [91], alignment-based methods are widely used in a broad
range of applications. Despite their extensive applications, alignment-based methods are not
55
appropriate in some situations. First, gene regulatory regions are generally not highly conserved
making alignment-based approaches dicult to identify related regulatory regions that are bound
by similar transcription factors [59]. Second, Next Generation Sequencing (NGS) technologies
generate large amounts of short reads and it is challenging to assemble them for both genomic
and metagenomic studies. Without long assembled contigs across many samples, it is challenging
for alignment-based methods to compare genomes and metagenomes [130, 44]. Third, viruses are
more likely to infect bacterial hosts having similar word pattern usage [1, 103], and thus, the hosts
of viruses can potentially be inferred based on their word pattern usages. However, alignment
based methods are usually not applicable for studying virus-host infectious associations.
Alignment-free sequence comparison methods serve as attractive alternatives for studying the
relationships among sequences when alignment based methods are not appropriate or too time
consuming to be implemented in practice [125, 115]. Several types of alignment free approaches are
available including those based on the counts of k-mers, longest common subsequences, shortest
absent patterns, etc. that have recently been reviewed in a special issue of Brieng in Bioinformat-
ics [124]. Here we concentrate on alignment-free statistics using k-mer counts. These approaches
project each sequence into k-mer (or equivalently k-tuple, k-gram) counts feature space, where
sequence information is transformed into numerical values such as k-mer frequency. We do not
consider dissimilarity measures using spaced k-mers due to the added computational complexity
counting spacedk-mers. The recently developed statisticsd
2
andd
S
2
have been shown to perform
well theoretically [96] as well as in many applications including the comparison of gene regulatory
regions [115], whole genome sequences [97], metagenomes [44], and virus-bacteria host infectious
associations [1]. Despite their excellent performance in many applications, the original implemen-
tation of these statistics are relatively slow due to the requirement of calculating the expected
k-mer counts and thus limits their usage.
CAFE signicantly speeds up the calculation of recently developed measures based on back-
ground adjusted k-mer counts, such as CVTree [93], d
2
[96], and d
S
2
[96], with reduced memory
56
requirement. In addition, CAFE integrates 10 conventional measures based on k-mer counts such
as Chebyshev (Ch), Euclidean (Eu), Manhattan (Ma), d
2
dissimilarity [11], Jensen-Shannon
divergence (JS) [46], feature frequency proles (FFP ) [113], andCo-phylog [138]. CAFE also of-
fers 15 measures based on presence/absence ofk-mers, such asJaccard andHamming distances.
We further demonstrate the value of alignment-free dissimilarity measures using CAFE on real
datasets, ranging from primate, vertebrate and microbial genomic sequences, to metagenomic
sequence reads.
4.2 Background
4.2.1 Alignment-free Dissimilarity Basics
The k-mer based alignment-free dissimilarity measures aim to compare two genome sequences
G
(1)
and G
(2)
, of length L
(1)
and L
(2)
, respectively, based upon the occurrences of all k-mers of
xed length k for molecular sequences.
LetN
(i)
w
be the number of occurrences of a givenk-mer w via a sliding window of lengthk over
the sequenceG
(i)
, wherei = 1; 2. When studying double-stranded sequences, N
(i)
w
also takes into
account the number of occurrences of the reverse complimentary k-mer. Further, the frequency
of the k-mer f
(i)
w
=
N
(i)
w
P
w
N
(i)
w
is dened as its relative abundance.
For some of the dissimilarity measures such as d
2
and d
S
2
, the expected counts of the k-mers
under a certain model of the genomic sequences are needed. Here we use Markov models as the
generative models of the sequences. Specically, assuming an r-th order Markov model (r < k)
with transition probabilities (i;j) and stationary probabilities (i) with i2A
r
; j 2A, the
expected occurrences EN
(i)
w
can be calculated as:
EN
(i)
w
= (L
(i)
k + 1)(w[1 :r])
kr
Y
i=1
(w[i :i +r 1]; w[i +r]) (4.1)
57
where the transition and stationary probabilities can be estimated from the sequence data.
4.2.2 Conventional measures based on k-mer counts
4.2.2.1 Chebyshev
The Chebyshev (Ch) distance is dened as:
Ch = max
w2A
k
f
(1)
w
f
(2)
w
: (4.2)
4.2.2.2 Euclidean
The Euclidean (Eu) distance is dened as:
Eu =
s
X
w2A
k
f
(1)
w
f
(2)
w
2
: (4.3)
4.2.2.3 Manhattan
The Manhattan (Ma) distance is dened as:
Ma =
X
w2A
k
f
(1)
w
f
(2)
w
: (4.4)
4.2.2.4 Canberra
The Canberra distance is a variation of the Manhattan distance, dened as:
Canberra =
X
w2A
k
f
(1)
w
f
(2)
w
f
(1)
w
+f
(2)
w
(4.5)
58
4.2.2.5 d
2
or Cosine [11]
The d
2
distance or equivalently Cosine distance is dened as:
d
2
=
1
2
0
B
B
@
1
P
w2A
kf
(1)
w
f
(2)
w
r
P
w2A
k
f
(1)
w
2
r
P
w2A
k
f
(2)
w
2
1
C
C
A
(4.6)
4.2.2.6 Pearson
The Pearson distance is dened as:
d
2
= 1
P
w2A
kf
(1)
w
f
(2)
w
P
w2A
k
f
(1)
w
P
w2A
k
f
(2)
w
jA
k
j
v
u
u
t
P
w2A
kf
(1)
w
f
(1)
w
P
w2A
k
f
(1)
w
2
jA
k
j
!
P
w2A
kf
(2)
w
f
(2)
w
P
w2A
k
f
(2)
w
2
jA
k
j
!
(4.7)
4.2.2.7 Feature frequency proles (FFP) [113]
The feature frequency proles (FFP ) dissimilarity is dened as:
FFP =
1
2
0
@
X
w2A
k
f
(1)
w
log
2
f
(1)
w
f
(2)
w
+
X
w2A
k
f
(2)
w
log
2
f
(2)
w
f
(1)
w
1
A
(4.8)
4.2.2.8 Jensen-Shannon divergence (JS) [46]
Given two sequences tted withr-th order Markov modelsM
1
andM
2
, respectively, the Jensen-
Shannon divergence between the two sequences is dened as:
JS =h
M
1
+M
2
2
1
2
h (M
1
)
1
2
h (M
2
) (4.9)
where h (M
i
) denotes the Shannon entropy for Markov modelM
i
, where i = 1; 2. That is,
h (M
i
) =
P
w2A
kf
(i)
w
P
w2A
f
(i)
wjw
logf
(i)
wjw
, wheref
(i)
wjw
=
N
(i)
ww
N
(i)
w
and ww represents the concate-
nating word of w andw. The length of ww is (r + 1).
M1+M2
2
denotes the average Markov model
betweenM
1
andM
2
59
4.2.2.9 Co-phylog [138]
Dierent from the other dissimilarity measures, where the k-mer counts require an exact match,
Co-phylog focuses on an approximate match. Thus,Co-phylog denes a structureS =C
a1;a2;;an
O
b1;b2;;bn1
,
wherea
i
andb
i
are the lengths of thei-th consecutive 1s segment and the lengths of thei-th con-
secutive 0s segment, respectively. For example, the seed 1110111 has the structure S = C
3;3
O
1
.
CAFE uses the structure S = Ck1
2
;
k1
2
O
1
and S = Ck
2
1;
k
2
O
1
when k is odd and even, respec-
tively.
Given a structure S = C
a1;a2;;an
O
b1;b2;;bn1
and a k-mer w = s
1
s
2
s
k
, S divides w
into 2n 1 parts from left to right of lengths a
1
;b
1
;a
2
;b
2
; ;a
n1
;b
n1
;a
n
. Then the C-gram,
denoted as C
S
(w), is dened as the concatenation of the rst, the third, parts of w, whereas
the O-gram, denoted asO
S
(w), is dened as the concatenation of the second, the fourth, parts
of w. For example, given the structure S = C
3;3
O
1
and w = actgact, we have C
S
(w) = actact
and O
S
(w) =g.
For a given genome G, we can have all its k-mers and the corresponding C-grams. For any
C-gram c, its objects are dened as object
G;S
(c) =fO
S
(w) :w2G;C
S
(w) =cg. Further, the
C-gramc is called a context if and only if the set object
G;S
(c) has only one element. Suppose we
aim to compare two genome sequences G
(1)
and G
(2)
, given a structure S, we have their context
C-gram sets C
S
(G
(1)
) andC
S
(G
(2)
), respectively. Co-phylog only considers the context C-grams
shared by both sets. Let R be the intersection of the two sets. For the i-th context C-gram c
i
,
i from 1 tojRj, dene I
i
= 0 if object
G
(1)
;S
(c
i
) = object
G
(2)
;S
(c
i
) and 1 otherwise. Finally, the
Co-phylog distance is dened as
P
jRj
i=1
Ii
jRj
.
4.2.3 Measures based on background adjusted k-mer counts
We rst dene the expected number of occurrences ofk-mer w in the sequenceS
(i)
as EN
(i)
w
, and
denote
~
N
(i)
w
=N
(i)
w
EN
(i)
w
.
60
4.2.3.1 CVTree [93]
The CVTree dissimilarity is dened as:
CVTree =
1
2
0
@
1
P
w2A
k
^
f
(1)
w
^
f
(2)
w
q
P
w2A
k (
^
f
(1)
w
)
2
q
P
w2A
k (
^
f
(2)
w
)
2
1
A
(4.10)
where
^
f
(i)
w
=
~
N
(i)
w
EN
(i)
w
EN
(i)
w
. CVTree assumes a (k 2)-th order Markov chain for the background
sequence. After preliminary exploration of the relationship between CVTree dissimilarity and
evolutionary distance calculated based on maximum likelihood approaches, we propose to use the
following transformation T (x) = (log(1 2x))
2
on CVTree so that the transformed dissimilarity
is highly linearly related to the evolutionary distance calculated using the maximum likelihood
approach.
4.2.3.2 d
2
[96, 128]
The d
2
dissimilarity is dened as:
d
2
=
1
2
0
@
1
P
w2A
k
f
(1)
w
f
(2)
w
q
P
w2A
k (
f
(1)
w
)
2
q
P
w2A
k (
f
(2)
w
)
2
1
A
(4.11)
where
f
(i)
w
=
~
N
(i)
w
p
EN
(i)
w
. Similar to CVTree, we use the transformationT (x) = (log(1 2x))
2
ond
2
.
4.2.3.3 d
S
2
[96, 128]
The d
S
2
dissimilarity is dened as:
d
S
2
=
1
2
0
@
1
P
w2A
k
~
f
(1)
w
~
f
(2)
w
q
P
w2A
k (
~
f
(1)
w
)
2
q
P
w2A
k (
~
f
(2)
w
)
2
1
A
(4.12)
where
~
f
(i)
w
=
~
N
(i)
w
(
~
N
(1)
w
)
2
+(
~
N
(2)
w
)
2
1
4
. Similar to CVTree, we use the transformationT (x) = (log(1 2x))
2
on d
S
2
.
61
4.2.4 Measures based on presence/absence of k-mers
The presence/absence of k-mers are treated as binary data. Let b
(1)
w
and b
(2)
w
be the pres-
ence/absence values of the k-mer w in the two sequences G
(1)
and G
(2)
, respectively.
4.2.4.1 Anderberg
The Anderberg dissimilarity is dened as:
Anderberg = 1 (A=(A +B) +A=(A +C) +D=(C +D) +D=(B +D))=4 (4.13)
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present inG
(1)
and absent inG
(2)
,C is the number ofk-mers absent inG
(1)
and present inG
(2)
,
and D is the number of k-mers that are absent in both sequences, respectively.
4.2.4.2 Antidice
The Antidice dissimilarity is dened as:
Antidice = 1A=(A + 2(B +C)) (4.14)
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present in G
(1)
and absent in G
(2)
, and C is the number of k-mers absent in G
(1)
and present in
G
(2)
, respectively.
4.2.4.3 Dice
The Dice dissimilarity is dened as:
Dice = 1 2A=(2A +B +C) (4.15)
62
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present in G
(1)
and absent in G
(2)
, and C is the number of k-mers absent in G
(1)
and present in
G
(2)
, respectively.
4.2.4.4 Gower
The Gower dissimilarity is dened as:
Gower = 1AD=
p
(A +B) (A +C) (D +B (D +C) (4.16)
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present inG
(1)
and absent inG
(2)
,C is the number ofk-mers absent inG
(1)
and present inG
(2)
,
and D is the number of k-mers that are absent in both sequences, respectively.
4.2.4.5 Hamman
The Hamman dissimilarity is dened as:
Hamman = 1 [((A +D) (B +C))=N]
2
(4.17)
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present inG
(1)
and absent inG
(2)
,C is the number ofk-mers absent inG
(1)
and present inG
(2)
,
D is the number ofk-mers that are absent in both sequences, andN is the total number ofk-mers,
respectively.
4.2.4.6 Hamming
The Hamming dissimilarity is dened as:
Hamming = (B +C)=N (4.18)
63
where B is the number of k-mers present in G
(1)
and absent in G
(2)
, C is the number of k-mers
absent in G
(1)
and present in G
(2)
, and N is the total number of k-mers, respectively.
4.2.4.7 Jaccard
The Jaccard dissimilarity is dened as:
Jaccard = 1A=(ND) (4.19)
whereA is the number ofk-mers that are present in both vectors,D is the number ofk-mers that
are absent in both sequences, and N is the total number of k-mers, respectively.
4.2.4.8 Kulczynski
The Kulczynski dissimilarity is dened as:
Kulczynski = 1 (A=(A +B) +A=(A +C))=2 (4.20)
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present in G
(1)
and absent in G
(2)
, and C is the number of k-mers absent in G
(1)
and present in
G
(2)
, respectively.
4.2.4.9 Matching
The Matching dissimilarity is dened as:
Matching = 1 (A +D)=N (4.21)
whereA is the number ofk-mers that are present in both vectors,D is the number ofk-mers that
are absent in both sequences, and N is the total number of k-mers, respectively.
64
4.2.4.10 Ochiai
The Ochiai dissimilarity is dened as:
Ochiai = 1A=
p
(A +B) (A +C) (4.22)
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present in G
(1)
and absent in G
(2)
, and C is the number of k-mers absent in G
(1)
and present in
G
(2)
, respectively.
4.2.4.11 Phi
The Phi dissimilarity is dened as:
Phi = 1 [(ABCD)=
p
(A +B) (A +C) (D +B) (D +C)]
2
(4.23)
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present inG
(1)
and absent inG
(2)
,C is the number ofk-mers absent inG
(1)
and present inG
(2)
,
and D is the number of k-mers that are absent in both sequences, respectively.
4.2.4.12 Russel
The Russel dissimilarity is dened as:
Russel = 1A=N (4.24)
where A is the number of k-mers that are present in both vectors, and N is the total number of
k-mers, respectively.
65
4.2.4.13 Sneath
The Sneath dissimilarity is dened as:
Sneath = 1 2(A +D)=(2(A +D) + (B +C)) (4.25)
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present inG
(1)
and absent inG
(2)
,C is the number ofk-mers absent inG
(1)
and present inG
(2)
,
and D is the number of k-mers that are absent in both sequences, respectively.
4.2.4.14 Tanimoto
The Tanimoto dissimilarity is dened as:
Tanimoto = 1 (A +D)=((A +D) + 2(B +C)) (4.26)
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present inG
(1)
and absent inG
(2)
,C is the number ofk-mers absent inG
(1)
and present inG
(2)
,
and D is the number of k-mers that are absent in both sequences, respectively.
4.2.4.15 Yule
The Yule dissimilarity is dened as:
Yule = 1 [(ADBC)=(AD +BC)]
2
(4.27)
where A is the number of k-mers that are present in both vectors, B is the number of k-mers
present inG
(1)
and absent inG
(2)
,C is the number ofk-mers absent inG
(1)
and present inG
(2)
,
and D is the number of k-mers that are absent in both sequences, respectively.
66
4.3 Materials and Methods
4.3.1 Accelerate the calculation of d
2
, d
S
2
, and CVTree
The computation ofd
2
,d
S
2
, and CVTree is dominated by the calculation of EN
(i)
w
, wherei = 1; 2
when the sequence is relatively short while k is large. The acceleration of calculating EN
(i)
w
becomes possible based upon the observation that some k-mers share common prex strings. For
example, tetramer w
1
=AAAA and w
2
=AAAC share the longest common prex AAA. Given
the rst order Markov model, in principle, we don't have to calculate EN
(i)
w2
from the beginning as
long as we have calculated EN
(i)
w1
. To be specic, EN
(i)
w2
= EN
(i)
w1
P(TjA)
P(AjA)
, where the transition
probabilities P (AjA) =(w
1
[3]; w
1
[4]) and P (TjA) =(w
2
[3]; w
2
[4]), respectively.
Based on this observation, we organize all possiblek-mers into a radix trie (Figure 4.1). To be
specic, the radix trie represents a full quadtree of height k. Each leaf node represents the EN
(i)
w
of the corresponding k-mer w whereas every internal node stores the marginal probability of k-
mers sharing the common prex corresponding to the node. Edges are labeled with respect to the
alphabetA =fA;C;G;Tg, denoting the succeeding base of the current prex. For example, in
the radix trie illustrated in Figure 4.1(a), the leaf nodes with id 65-68 represent thek-mersAAAA
to AAAT , respectively. In the scenario of independent identically distributed (i.i.d.) model, the
internal nodes with id 1, 5 and 17 store the marginal probabilities of k-mers of common prex
A, AA and AAA with value P (A), P (A)
2
and P (A)
3
, respectively. In the scenario of rst order
Markov model, the internal nodes with id 1, 5 and 17 store the marginal probabilities ofk-mers of
common prex A, AA and AAA with value P (A), P (A)P (AjA) and P (A)P (AjA)
2
, respectively.
In the scenario of second order Markov model, the internal nodes with id 1, 5 and 17 store the
marginal probabilities of k-mers of common prex A, AA and AAA with value 1, P (AA) and
P (AA)P (AjAA), respectively.
67
65 66 67 68
A C G T
... ... ...
17 18 19 20
A C G T
... ... ...
5 6 7 8
A C G T
... ... ...
1 2 3 4
A C G T
root
P(A)
P(A|A)
P(A|A)
P(A|A)
P(T|A)
P(A)
P(A)P(A|A)
P(A)P(A|A)
P(AAAA)=P(A)P(A|A)
2
3
P(AAAT)=P(A)P(A|A) P(T|A)
2
65 66 67 68
A C G T
... ... ...
17 18 19 20
A C G T
... ... ...
5 6 7 8
A C G T
... ... ...
1 2 3 4
A C G T
root
P(AA)
P(A|AA)
P(A|AA)
P(T|AA)
P(AA)
P(AA)P(A|AA)
P(AAAA)=P(AA)P(A|AA)
2
P(AAAT)=P(AA)P(A|AA)P(T|AA)
65 66 67 68
A C G T
... ... ...
17 18 19 20
A C G T
... ... ...
5 6 7 8
A C G T
... ... ...
1 2 3 4
A C G T
root
P(A)
P(A)
P(A)
P(T)
P(A)
P(A)
P(AAAA)=P(A)
3
P(AAAT)=P(A) P(T)
3
P(A)
2
P(A)
3
(a) (b) (c)
Figure 4.1: The radix trie constructed for the calculation of the expected occurrences of tetramers,
(a) i.i.d. model, (b) the rst order Markov model, and (c) the second order Markov model.
SeqA
SeqA SeqB SeqC
SeqB
SeqC
0
0.2
0.9 0.7
0
0
0.2 0.9
0.7
Pairwise Dissimilarity
SeqA ACGAGTCCGA...
SeqB AGATGTTTTTA...
SeqC CCGGAAGTGA...
Source Sequences
SeqA
ACGA: 5
AGTC: 2
CCGA: 9
CGAG: 1
GAGT: 3
...
SeqB
AGAT: 6
ATGT: 4
GATG: 9
GTTT: 2
TGTT: 1
...
SeqC
AAGT: 7
AGTG: 1
CCGG: 2
CGGA: 5
GAAG: 9
...
k-mer Counts By JELLYFISH Calculate Expected k-mer Counts
(optional)
AA
AC
AG
AT
TT
...
A C G T
0.3
0.2
0.25 0.25 0.5
0.7 0.1
0.7
0.9 0.1
1
Markov Model
SeqC
E[ACGA]
E[AGTC]
E[CCGA]
E[CGAG]
E[GAGT]
...
AA
AC
AG
AT
TT
...
A C G T
0.3
0.2
0.25 0.25 0.5
0.7 0.1
0.7
0.9 0.1
1
Markov Model
SeqB
E[ACGA]
E[AGTC]
E[CCGA]
E[CGAG]
E[GAGT]
...
AA
AC
AG
AT
TT
...
A C G T
0.3
0.2
0.25 0.25 0.5
0.7 0.1
0.7
0.9 0.1
1
Markov Model
SeqA
E[ACGA]: 1.5
E[AGTC]: 2.3
E[CCGA]: 0.5
E[CGAG]: 8.1
E[GAGT]: 5.0
...
SeqA
SeqB
SeqC
SeqA
SeqC
SeqB
SeqA
SeqB
SeqC
Dendrogram Network
PCoA
Visualization
SeqA
SeqB
SeqC
SeqA
SeqB
SeqC
Heatmap
Figure 4.2: The work
ow of CAFE. The JELLYFISH software parses the input sequence les (in
Fasta format), counts k-mers, and saves compressed information into separate databases. CAFE
subsequently loads the databases and constructs a symmetric dissimilarity matrix among the in-
puts. CAFE also integrates four types of visualized downstream analysis, including dendrograms,
heatmap, principal coordinate analysis (PCoA), and network display.
Then the calculation of EN
(i)
w
is reduced to the depth-rst search on the radix trie, equiva-
lent to the total number of internal and leaf nodes. Therefore, the overhead is reduced to the
complexity
4
k
.
4.3.2 Work
ows
CAFE works with sequence data, both assembled genomic sequences and unassembled shotgun
sequence reads from NGS technologies, and countsk-mers by JELLYFISH [69], a fast and memory-
ecient k-mer counting tool. JELLYFISH produces compressed databases containing all k-mer
counts given the query sequences. CAFE subsequently loads the databases and generates nec-
essary transformed information with respect to various dissimilarity measures. For example,
68
measures based on presence/absence of k-mers binarize k-mer counts into presence/absence indi-
cators. Most conventional measures normalize k-mer counts into the k-mer frequencies. Besides,
expectedk-mer counts are involved in recently developed measures based on background adjusted
k-mer counts, such as CVTree, d
2
, and d
S
2
. In such cases, the Markov models for the sequences
are assumed as the underlying generative models, with the parameters estimated from the se-
quence data accordingly. The Markov order can be either manually set or automatically chosen
using the Bayesian information criterion (BIC) [77].
The resulting pairwise dissimilarities among the sequences form a symmetric matrix. CAFE
can directly output the dissimilarity matrix in a standard PHYLIP format. Alternatively, CAFE
provides four types of built-in downstream visualized analyses, including clustering the sequences
into dendrograms using the UPGMA algorithm, heatmap visualization of the matrix, projecting
the matrix to a two-dimensional space using principal coordinate analysis (PCoA), and network
display. A graphical illustration of CAFE work
ow is shown in Figure 4.2.
4.3.3 Graphical User Interface
The CAFE user interface consists of four major tools|data selection toolbar, dissimilarity setting
toolbar, image toolbar, and visualized analyses. The data selection toolbar enables users to browse
and add/delete genome sequences or NGS shotgun reads of the le extension \.fasta", \.fa" or
\.fna". The selected les are shown in the input data list. The data selection toolbar also supports
loading pre-computed results in a standard PHYLIP format.
The dissimilarity setting toolbar determines the choice of dissimilarity measures as well as the
involved parameter conguration, including the k-mer length, the order of the potential Markov
model, the cuto of the minimum k-mer occurrences, and whether to consider the reverse com-
plement of each k-mer, a common practice in dealing with NGS shotgun reads. When a certain
parameter is unnecessary for particular dissimilarity measures, the corresponding conguration is
disabled. In the cases ofCVTree,d
2
, andd
S
2
, usually the proper order of Markov model remains
69
Figure 4.3: Screenshot of CAFE user interface based on a toy example. The user interface layout
divides into six parts in terms of functionality: (1) data selection toolbar (top left), (2) dissimilarity
setting toolbar (top middle), (3) image toolbar (top right), (4) input data list (middle left), (5)
run-time information console (bottom left), and (6) visualized analyses (bottom right).
unclear to the user. A simple yet time-consuming way is to set \-1", which will infer the order
automatically using the BIC [77].
After the \Run" button is pressed, the CAFE work
ow starts, and consolidated dissimilarity
results are saved in a standard PHYLIP format, together with the run-time information trackable
from the console. Meanwhile, four types of built-in analyses are provided in tabbed windows,
including dendrograms, heatmap, PCoA, and network display.
The view of the visualized analyses can be adjusted by using the \zoom-in" and \zoom-out"
buttons located in the image toolbar. CAFE also supports downloading the visualized results for
publication. To access this function, users can either use the \save" button in the image toolbar
or right-click on the gure directly. A screenshot of the CAFE user interface is shown in Figure
4.3.
70
4.3.4 Design
CAFE is designed for extensibility and reusability. For example, users can specify a threshold to
lter outk-mers whose counts are below the threshold. In this case, the Iterator hides the details
of ltering, wrapping the enumeration of qualied k-mer counts or frequencies uniformly. Also,
some dissimilarity measures do not need the expectedk-mer counts. Hence the Proxy provides the
calculation of expectedk-mer counts as a service on demand. Moreover, the dissimilarity measures
are encapsulated in Strategy, enabling users to integrate customized dissimilarity measures into
CAFE easily as plug-ins.
4.4 Results
4.4.1 Application to Primate and Vertebrate Genomic Sequences
We compared various alignment-free dissimilarity measures using CAFE on three real genomic
datasets. We rst investigated the evolutionary relationship of 21 primates whose complete
genome sequences are available in the NCBI database [86]. For each dissimilarity measure, the
calculated pairwise dissimilarity measures are directly compared against the corresponding evolu-
tionary distances calculated by Ape (an R package) [82] as the benchmark, in terms of Spearman
and Pearson correlations between the estimated alignment-free dissimilarity and the evolution-
ary distances, and normalized Robinson-Foulds distance [101] between the clustering tree using
UPGMA and the standard phylogenetic tree.
Similarly, we investigated the evolutionary relationship of 28 vertebrate species and compared
the alignment-free dissimilarity measures with the pairwise evolutionary distances given in [74].
Finally, we combined the two datasets to see how the alignment-free dissimilarity measures relate
to evolutionary distances calculated based on maximum likelihood approach from a large number
of genomic regions.
71
d 2
S
d 2
*
Figure 4.4: The Spearman correlation of various dissimilarity measures with the evolutionary
distances using the maximum likelihood approach across many genomic regions based on 21
primate species (top), 28 vertebrate species (middle), and the combination of both (bottom).
72
CVTree Canberra Ch Cosine Co-phylog d2 Eu FFP JS Ma Pearson Anderberg Antidice Dice Gower Hamman Hamming Jaccard Kulczynski Matching Ochiai Phi Russel Sneath Tanimoto Yule d 2
*
d 2
S
0.973 0.973 0.978
0.836
−0.03
0.755
0.944
0.755
0.191
0.827 0.839
0.952
0.747
0.726
0.674 0.682
0.726
0.68
0.702
0.674 0.681
0.702
0.681 0.688
0.454
0.71
0.689
0.779
0.00
0.25
0.50
0.75
1.00
0.911
0.946 0.946
0.911
0.474
0.089
0.886
0.089
0.405
0.144
0.201
0.856
0.086
0.834 0.833
0.812
0.834
0.9
0.861
0.833
0.81
0.861
0.812
0.834
0.799
0.848
0.876
0.847
0.00
0.25
0.50
0.75
0.907
0.963 0.964
0.937
0.415
0.522
0.826
0.522
0.473
0.6
0.626
0.888
0.525
0.874 0.874
0.857
0.874
0.921
0.896
0.874
0.858
0.896
0.858
0.843 0.834
0.886
0.907
0.894
0.00
0.25
0.50
0.75
1.00
Measures based on k-mers counts only Measures based on presence/absence of k-mers
Measures using
adjusted k-mer counts
Figure 4.5: The Pearson correlation of various dissimilarity measures with the evolutionary dis-
tances using the maximum likelihood approach across many genomic regions based on 21 primate
species (top), 28 vertebrate species (middle), and the combination of both (bottom).
CVTree Canberra Ch Cosine Co-phylog d2 Eu FFP JS Ma Pearson Anderberg Antidice Dice Gower Hamman Hamming Jaccard Kulczynski Matching Ochiai Phi Russel Sneath Tanimoto Yule d 2
*
d 2
S
0.389
0.444
0.556
0.5
1
0.833
0.389
0.833
0.944
0.889
0.833
0.556
0.889
0.556 0.556 0.556
0.444
0.5
0.444
0.556 0.556
0.444
0.556 0.556
0.778
0.444
0.5
0.444
0.00
0.25
0.50
0.75
1.00
0.52
0.48 0.48
0.52
0.96
0.88
0.56
0.88
0.92
0.76
0.8
0.72
0.88
0.76
0.84 0.84
0.76
0.72 0.72
0.84 0.84
0.72
0.84
0.76
0.92
0.72 0.72
0.8
0.00
0.25
0.50
0.75
1.00
Figure 4.6: The normalized Robinson-Foulds distance between the clustering tree using various
dissimilarity measures and the phylogenetic tree derived based on the maximum likelihood ap-
proach across many genomic regions for the 21 primate species (top) and 28 vertebrate species
(bottom).
73
Figure 4.7: Wall time, peak memory usage, and speedup ratio comparison between CAFE and
the original implementation to calculate d
2
dissimilarity between a pair of genomes for k=14.
The comparison involves 3 dissimilarity measures based on background adjusted k-mer counts
including CVTree, d
2
, and d
S
2
, 10 conventional measures based on k-mer counts, including
Canberra, Ch, Cosine, Co-phylog, d
2
, Eu, FFP , JS, Ma, and Pearson, and 15 measures
based on presence/absence of k-mers including Anderberg, Antidice, Dice, Gower, Hamman,
Hamming, Jaccard, Kulczynski, Matching, Ochiai, Phi, Russel, Sneath, Tanimoto, and
Yule. We used k = 14 as in [97]. The results are illustrated in Figure 4.4, Figure 4.5, and
Figure 4.6 for Spearman correlations, Pearson correlations, and normalized Robinson-Foulds dis-
tance, respectively. The Markov order 12 is used ford
2
,d
S
2
, andJS as most of the sequences have
estimated order 12 based on BIC [77]. Consistent with previous studies, the background adjusted
dissimilarity measures outperform markedly the non-background adjusted measures.
74
9 H U _ J O O 狤 狤 狩 + I U R O 狧 狧 狫 狪 狫 + I U R O / ' / 狣 9 Y U T T K O 狢 狦 狨 + I U R O 1 狣 狤 Y [ H Y Z = 狥 狣 狣 狢 + I U R O 狢 狣 狤 狩 . 狨 + 狤 狥 狦 狪 狨 狫 + I U R O ) ' : ) ) 狪 狩 狥 狫 + I U R O 9 3 9 狥 狧 + I U R O ' 6 + ) 5 狣 + I U R O 狧 狥 狨 + I U R O ) , : 狢 狩 狥 + I U R O 9 + 狣 狣 + I U R O + 狤 狦 狥 狩 狩 ' 9 L R K ^ T K X O 狤 G 狥 狢 狣 + I U R O ; 3 4 狢 狤 狨 9 L R K ^ T K X O 狧 狪 狦 狢 狣 + I U R O 5 狣 狧 狩 . 狩 9 G Q G O + I U R O 9 狪 狪 + I U R O . 9 + I U R O ; : / 狪 狫 + I U R O 5 狣 狧 狩 . 狩 + * 2 狫 狥 狥 + I U R O 1 狣 狤 Y [ H Y Z 3 - 狣 狨 狧 狧 + I U R O / ' / 狥 狫 9 L R K ^ T K X O 狤 G 狤 狦 狧 狩 : + I U R O + * 狣 G 9 H U _ J O O ) * ) 狥 狢 狪 狥 狫 狦 9 J _ Y K T Z K X O G K ' ( 狣 ( 狤 * + 9 ) < : X K K J 狤 9 J 狤 + I U R O 9 狪 狪 + I U R O 狢 狣 狤 狩 . 狨 + 狤 狥 狦 狪 狨 狫 + I U R O ) , : 狢 狩 狥 + I U R O 5 狣 狧 狩 . 狩 + * 2 狫 狥 狥 + I U R O ' 6 + ) 5 狣 9 H U _ J O O 狤 狤 狩 + I U R O 1 狣 狤 Y [ H Y Z = 狥 狣 狣 狢 + I U R O ; : / 狪 狫 + I U R O . 9 9 L R K ^ T K X O 狤 G 狥 狢 狣 + I U R O 狧 狥 狨 + I U R O / ' / 狥 狫 + I U R O 9 3 9 狥 狧 + I U R O ) ' : ) ) 狪 狩 狥 狫 + I U R O 9 + 狣 狣 + I U R O + 狤 狦 狥 狩 狩 ' + I U R O 1 狣 狤 Y [ H Y Z 3 - 狣 狨 狧 狧 + I U R O 5 狣 狧 狩 . 狩 9 G Q G O 9 L R K ^ T K X O 狧 狪 狦 狢 狣 + I U R O / ' / 狣 9 H U _ J O O ) * ) 狥 狢 狪 狥 狫 狦 9 Y U T T K O 狢 狦 狨 + I U R O 狧 狧 狫 狪 狫 + I U R O + * 狣 G + I U R O ; 3 4 狢 狤 狨 9 J _ Y K T Z K X O G K 9 L R K ^ T K X O 狤 G 狤 狦 狧 狩 : + I U R O 狢 狣 狤 狩 . 狨 + 狤 狥 狦 狪 狨 狫 + I U R O 狧 狧 狫 狪 狫 + I U R O 1 狣 狤 Y [ H Y Z 3 - 狣 狨 狧 狧 + I U R O ; 3 4 狢 狤 狨 9 L R K ^ T K X O 狤 G 狥 狢 狣 9 L R K ^ T K X O 狧 狪 狦 狢 狣 + I U R O + 狤 狦 狥 狩 狩 ' 9 J _ Y K T Z K X O G K + I U R O ' 6 + ) 5 狣 9 H U _ J O O ) * ) 狥 狢 狪 狥 狫 狦 + I U R O 5 狣 狧 狩 . 狩 + * 2 狫 狥 狥 + I U R O ; : / 狪 狫 + I U R O 9 + 狣 狣 + I U R O 9 3 9 狥 狧 + I U R O 1 狣 狤 Y [ H Y Z = 狥 狣 狣 狢 9 H U _ J O O 狤 狤 狩 + I U R O + * 狣 G 9 Y U T T K O 狢 狦 狨 + I U R O . 9 + I U R O 5 狣 狧 狩 . 狩 9 G Q G O + I U R O / ' / 狥 狫 + I U R O ) , : 狢 狩 狥 9 L R K ^ T K X O 狤 G 狤 狦 狧 狩 : + I U R O 狧 狥 狨 + I U R O 9 狪 狪 + I U R O ) ' : ) ) 狪 狩 狥 狫 + I U R O / ' / 狣 Figure 4.8: The clustering results of 27 E.coli and Shigella genomes using measures based on
background adjusted 14-mer counts: d
S
2
, d
2
, and CVTree. The Markov order of the sequences
were set at 1. The colors indicate the 6 dierent E. Coli reference groups.
0.583 0.583 0.583 0.583
0.792
0.583 0.583 0.583 0.583
0.875
0.958
0.625 0.625 0.625 0.625 0.625 0.625 0.625 0.625 0.625 0.625 0.625 0.625 0.625
0.583
0.625 0.625
0.583
0.00
0.25
0.50
0.75
1.00
CVTree Canberra Ch Cosine Co-phylog d2 Eu FFP JS Ma Pearson Anderberg Antidice Dice Gower Hamman Hamming Jaccard Kulczynski Matching Ochiai Phi Russel Sneath Tanimoto Yule d 2
*
d 2
S
Figure 4.9: The normalized Robinson-Foulds distance between between the clustering tree using
various dissimilarity measures and the evolutionary tree derived based on the maximum likelihood
approach across many genomic regions for the 27 E.coli and Shigella genomes.
We then evaluate the computational speed of CAFE compared to the original implementation
for d
2
in [97]. We calculate the dissimilarity using d
2
measure (d
2
and d
S
2
share highly similar
formulation) on random pairs of genome sequences. As shown in Figure 4.7, CAFE achieves
24:0X speedup with 55:3% peak memory consumption on average.
4.4.2 Application to Microbial Genomic Sequences
We applied CAFE to analyze 27 E.coli and Shigella genomes dataset as in [10]. These genomes
can be assigned to 6 E.coli reference (ECOR) groups: A, B1, B2, D, E, and S. We investigated how
well various alignment-free dissimilarity measures can identify these groups. For each dissimilarity
measure, we used UPGMA to cluster the samples based on the calculated pairwise dissimilarity
75
2 O U T 狤 - O X G L L K 狤 5 Q G V O 狣 . U X Y K 狣 @KHX G 9 : 2 狣 + I N O J T G ) G V _ H G X G 1 X U U 狥 5 Q G V O 狤 8 G H H O Z . _ X G ^ 9 * 6 U R G X ( X 狤 ; X O G R 狤 5 X G T M 狣 . _ K T G 狤 ' X S G J O R R U ( [ Y N * U M 狣 ) U R U H [ Y 9 V M H Q = ( O M . U X T 9 * ( O M . U X T = 狥 - G ` K R R K 狥 ( R G I Q 8 N O T U 狣 . _ X G ^ 9 : 2 ' L + R V N 9 * 狥 < = 6 O M 2 O U T 狣 - U X O R R G 9 : 2 9 V M H Q = 5 Q G V O 狤 8 G H H O Z . _ X G ^ 9 * 1 X U U 狥 5 X G T M 狣 - O X G L L K 狤 - G ` K R R K 狥 ) G V _ H G X G < = 6 O M ( R G I Q 8 N O T U 狣 2 O U T 狣 . U X Y K 狣 + I N O J T G . _ K T G 狤 - U X O R R G 9 : 2 ) U R U H [ Y 2 O U T 狤 ( [ Y N * U M 狣 . _ X G ^ 9 : 2 ( O M . U X T = 狥 ; X O G R 狤 ' L + R V N 9 * 狥 @KHX G 9 : 2 狣 ( O M . U X T 9 * 5 Q G V O 狣 6 U R G X ( X 狤 ' X S G J O R R U 1 X U U 狥 ' L + R V N 9 * 狥 - O X G L L K 狤 . _ X G ^ 9 : 2 2 O U T 狣 ' X S G J O R R U 6 U R G X ( X 狤 9 V M H Q = 2 O U T 狤 ; X O G R 狤 - G ` K R R K 狥 5 Q G V O 狤 + I N O J T G < = 6 O M ( O M . U X T = 狥 ( R G I Q 8 N O T U 狣 8 G H H O Z 5 Q G V O 狣 ) G V _ H G X G ( [ Y N * U M 狣 @ K H X G 9 : 2 狣 . U X Y K 狣 ) U R U H [ Y 5 X G T M 狣 . _ K T G 狤 . _ X G ^ 9 * ( O M . U X T 9 * - U X O R R G 9 : 2 L U X K M [ Z L K X SK T Z OT M N K X H O\ U X K Y Y OSV RK M [ Z I G X T O\ U X K Y NOTJM[ZLKXSKT Z OT M N K X H O\ U X K Y ) < : X K K J 狤 9 J 狤 Figure 4.10: The clustering results of the mammalian gut samples using measures based on
background adjusted k-mer counts: d
S
2
, d
2
, and CVTree.
matrix. The Markov order 1 is used for d
2
andd
S
2
as most of the sequences have estimated order
1 based on BIC [77].
We usedk = 14 for the comparison. The comparison involves 3 dissimilarity measures based on
background adjusted k-mer counts including CVTree, d
2
, and d
S
2
, and the results are illustrated
in Figure 4.8. Consistent with previous studies, ford
S
2
, each ECOR is monophyletic except A and
B2. The normalized Robinson-Foulds distances [101] between the estimated clustering tree and
the standard phylogenetic tree is illustrated in Figure 4.9.
4.4.3 Application to Metagenomic Samples
We used CAFE to analyze a mammalian gut metagenomic dataset [44] comprised of NGS short
reads from 28 samples. These samples further split into 3 groups: 8 hindgut-fermenting herbivores,
13 foregut-fermenting herbivores, and 7 simple-gut carnivores. We investigated how well various
alignment-free dissimilarity measures can identify these groups. For each dissimilarity measure,
we used UPGMA to cluster the samples based on the calculated pairwise dissimilarity matrix.
We used k = 5 as in [44]. The comparison involves 3 dissimilarity measures based on back-
ground adjusted k-mer counts including CVTree, d
2
, and d
S
2
, and the results are illustrated in
Figure 4.10. The Markov order 0 is used ind
2
andd
S
2
as in [44]. Consistent with previous studies,
d
S
2
achieves clear separations among the 3 groups.
76
4.5 Discussion
We have developed a fast and user-friendly alignment-free analyses platform, CAFE, for studying
the relationships among genomes and metagenomes. With reduced memory usage, CAFE speeds
up the calculation of the state-of-the-art alignment-free measures that perform well theoretically
and practically. For easy usage, CAFE not only integrates 28 dissimilarity measures extensively
but also integrates four types of downstream visualized analyses. CAFE will make the usage of
alignment-free methods more accessible to researchers. We encourage users to contribute their
own dissimilarity measures to CAFE as plug-ins.
77
Chapter 5
Conclusion and ongoing work
This dissertation presents multiple computational methods and software packages developed
specically for analyzing metagenomic data accurately, promptly, and interactively. These meth-
ods have been targeted to address some important problems in metagenomic analysis pipelines,
such as binning metagenomic contigs into OTUs (COCACOLA, Hetero-RP), inferring phyloge-
netic relationships among OTUs (CAFE, CRAFT), etc. For potential improvement of proposed
methods as well as future directions of research, I have found the following problems of interest.
Some of them may have immediate solutions whereas others may need more careful scrutiny.
Kernel methods for Heterogeneity Rescaling Pursuit (Hetero-RP). Hetero-RP measures the
features against auxiliary knowledge implicitly in Euclidean distance, which may not hold well for
all sources of data. The kernelized Hetero-RP takes the advantage of both worlds: one can enjoy
the expressiveness of nonlinear space together with the eciency of linear weight. Specically,
Kernel Principal Component Analysis [106] can be potentially used to project the original data
into a new data matrix with the nonlinear features induced by the kernel, and then use the
resulting data matrix as the input to Hetero-RP. This procedure is referred to as \KPCA trick",
which is theoretically sound [18].
Deep learning based Heterogeneous Data Integration. Direct integration of data is problem-
atic because the common knowledge shared by data from multiple sources may lies in a higher
78
abstraction space beyond the complementariness among dierent types of data as observation.
Deep learning [56] oers a way to map data with high dimensionality from multiple sources to a
low-dimensional but abstract space via training a multiple-layer neural network [38]. The deep
learning based integrative method can provide attractive solution to integrate data from multiple
sources. More importantly, complex features and non-linear representations encoding much more
abstract knowledge can be automatically learned eectively.
Constructing varying-length k-mer dictionary using Deep Learning methods. Alignment-free
sequence comparison methods rely on the construction of k-mer vectors. Nevertheless, k-mer
vectors cost excessive space consumption. Moreover, k-mers are not interpretable, unable to
pinpoint the driving sequence fragments which encode the taxonomic dierences. Convolutional
neural networks (CNN) [57] oer a way to learn features from the original genome sequences in
a data-driven way, getting rid of feature extraction steps. In addition, the importance of learned
convolutional features can be quantied by recently developed methods such as DeepLIFT [110]
or layer-wise relevance propagation (LRP) [7].
The problems mentioned above only captures a tiny corner of the exciting marriage between
big data paradigm and metagenetic analyses. With the advent of high-throughput technologies,
researchers are exposed to \Big Data" in biology and medicine. Currently statistical and compu-
tational methods of metagenetic analyses are far from satisfactory. We expect this dissertation
will motivate developing a rich set of computational and statistical tools in this eld.
79
Reference List
[1] Nathan A Ahlgren, Jie Ren, Yang Young Lu, Jed A Fuhrman, and Fengzhu Sun. Alignment-
free d2* oligonucleotide frequency dissimilarity measure improves prediction of hosts from
metagenomically-derived viral sequences. Nucleic Acids Research, 45(1):39{53, 2017.
[2] Mads Albertsen, Philip Hugenholtz, Adam Skarshewski, K are L Nielsen, Gene W Tyson,
and Per H Nielsen. Genome sequences of rare, uncultured bacteria obtained by dierential
coverage binning of multiple metagenomes. Nature Biotechnology, 31(6):533{538, 2013.
[3] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting
the sequence specicities of DNA-and RNA-binding proteins by deep learning. Nature
Biotechnology, 33(8):831{838, 2015.
[4] Johannes Alneberg, Brynjar Sm ari Bjarnason, Ino de Bruijn, Melanie Schirmer, Joshua
Quick, Umer Z Ijaz, Leo Lahti, Nicholas J Loman, Anders F Andersson, and Christo-
pher Quince. Binning metagenomic contigs by coverage and composition. Nature Methods,
11(11):1144{1146, 2014.
[5] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman.
Basic local alignment search tool. Journal of Molecular Biology, 215(3):403{410, 1990.
[6] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler,
J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al.
Gene Ontology: tool for the unication of biology. Nature Genetics, 25(1):25{29, 2000.
80
[7] Sebastian Bach, Alexander Binder, Gr egoire Montavon, Frederick Klauschen, Klaus-Robert
M uller, and Wojciech Samek. On pixel-wise explanations for non-linear classier decisions
by layer-wise relevance propagation. PloS One, 10(7):e0130140, 2015.
[8] Yael Baran and Eran Halperin. Joint analysis of multiple metagenomic samples. PLoS
Computational Biology, 8(2):e1002373, 2012.
[9] Sugato Basu, Ian Davidson, and Kiri Wagsta. Constrained clustering: Advances in algo-
rithms, theory, and applications. CRC Press, 2008.
[10] Guillaume Bernard, Cheong Xin Chan, and Mark A Ragan. Alignment-free microbial phy-
logenomics under scenarios of sequence divergence, genome rearrangement and lateral ge-
netic transfer. Scientic Reports, 6:28970, 2016.
[11] B Edwin Blaisdell. A measure of the similarity of sets of sequences not requiring sequence
alignment. Proceedings of the National Academy of Sciences, 83(14):5155{5159, 1986.
[12] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed
optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends R
in Machine Learning, 3(1):1{122, 2011.
[13] Arthur Brady and Steven L Salzberg. Phymm and PhymmBL: metagenomic phylogenetic
classication with interpolated markov models. Nature Methods, 6(9):673{676, 2009.
[14] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegative
matrix factorization for data representation. IEEE Transactions on Pattern Analysis and
Machine Intelligence,, 33(8):1548{1560, 2011.
[15] Deng Cai, Chiyuan Zhang, and Xiaofei He. Unsupervised feature selection for multi-cluster
data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 333{342, 2010.
81
[16] Rogan Carr, Shai S Shen-Orr, and Elhanan Borenstein. Reconstructing the genomic content
of microbiome taxa through shotgun metagenomic deconvolution. PLoS Computational
Biology, 9(10):e1003292, 2013.
[17] Chon-Kit Kenneth Chan, Arthur L Hsu, Saman K Halgamuge, and Sen-Lin Tang. Binning
sequences using very sparse labels within a metagenome. BMC Bioinformatics, 9(1):215,
2008.
[18] Ratthachat Chatpatanasiri, Teesid Korsrilabutr, Pasakorn Tangchanachaianan, and Boon-
serm Kijsirikul. A new kernelization framework for Mahalanobis distance learning algo-
rithms. Neurocomputing, 73(10):1570{1579, 2010.
[19] Sourav Chatterji, Ichitaro Yamazaki, Zhaojun Bai, and Jonathan A Eisen. CompostBin: A
DNA composition-based algorithm for binning environmental shotgun reads. In Research
in Computational Molecular Biology, pages 17{28, 2008.
[20] Bastien Chevreux, Thomas Wetter, S andor Suhai, et al. Genome sequence assembly using
trace signals and additional sequence information. In German Conference on Bioinformat-
ics, volume 99, pages 45{56, 1999.
[21] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997.
[22] ENCODE Project Consortium et al. An integrated encyclopedia of DNA elements in the
human genome. Nature, 489(7414):57{74, 2012.
[23] Human Microbiome Project Consortium et al. Structure, function and diversity of the
healthy human microbiome. Nature, 486(7402):207{214, 2012.
[24] Adrian Corduneanu and Christopher M Bishop. Variational Bayesian model selection for
mixture distributions. In Articial intelligence and Statistics, volume 2001, pages 27{34,
2001.
82
[25] Aaron E Darling, Guillaume Jospin, Eric Lowe, Frederick A Matsen IV, Holly M Bik, and
Jonathan A Eisen. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ,
2:e243, 2014.
[26] David L Davies and Donald W Bouldin. A cluster separation measure. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 1(2):224{227, 1979.
[27] Robert B Denman. Using RNAFOLD to predict the activity of small catalytic RNAs.
Biotechniques, 15(6):1090{1095, 1993.
[28] Ingo Ebersberger, Sascha Strauss, and Arndt von Haeseler. HaMStR: prole hidden markov
model based search for orthologs in ESTs. BMC Evolutionary Biology, 9(1):157, 2009.
[29] S Dusko Ehrlich, MetaHIT Consortium, et al. MetaHIT: The European Union Project on
metagenomics of the human intestinal tract. In Metagenomics of the human body, pages
307{316. Springer, 2011.
[30] Jianqing Fan and Yingying Fan. High dimensional classication using features annealed
independence rules. The Annals of Statistics, 36(6):2605{2637, 2008.
[31] Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional fea-
ture space (with discussion). Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 70(5):849{911, 2008.
[32] Jianqing Fan and Jinchi Lv. A selective overview of variable selection in high dimensional
feature space (invited review article). Statistica Sinica, 20(1):101{148, 2010.
[33] Yingying Fan and Jinchi Lv. Innovated scalable ecient estimation in ultra-large Gaussian
graphical models. The Annals of Statistics, 44:2098{2126, 2016.
[34] Vladimir Gligorijevi c and Nata sa Pr zulj. Methods for biological data integration: perspec-
tives and challenges. Journal of The Royal Society Interface, 12(112):20150571, 2015.
83
[35] Mark A Hall and Lloyd A Smith. Feature Selection for Machine Learning: Comparing a
Correlation-Based Filter Approach to the Wrapper. In FLAIRS Conference, volume 1999,
pages 235{239, 1999.
[36] John A Hartigan and PM Hartigan. The Dip Test of Unimodality. The Annals of Statistics,
pages 70{84, 1985.
[37] David Haussler, Stephen J O'Brien, Oliver A Ryder, F Keith Barker, Michele Clamp, An-
drew J Crawford, Robert Hanner, Olivier Hanotte, Warren E Johnson, Jimmy A McGuire,
et al. Genome 10k: a proposal to obtain whole-genome sequence for 10 000 vertebrate
species. 2009.
[38] Georey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with
neural networks. science, 313(5786):504{507, 2006.
[39] Joel N Hirschhorn and Mark J Daly. Genome-wide association studies for common diseases
and complex traits. Nature Reviews Genetics, 6(2):95{108, 2005.
[40] Elaine Y Hsiao, Sara W McBride, Sophia Hsien, Gil Sharon, Embriette R Hyde, Tyler Mc-
Cue, Julian A Codelli, Janet Chow, Sarah E Reisman, Joseph F Petrosino, et al. Microbiota
modulate behavioral and physiological abnormalities associated with neurodevelopmental
disorders. Cell, 155(7):1451{1463, 2013.
[41] Daniel H Huson, Alexander F Auch, Ji Qi, and Stephan C Schuster. MEGAN analysis of
metagenomic data. Genome Research, 17(3):377{386, 2007.
[42] Umer Ijaz and Christopher Quince. TAXAassign v0.4, June 2009.
[43] Michael Imelfort, Donovan Parks, Ben J Woodcroft, Paul Dennis, Philip Hugenholtz, and
Gene W Tyson. GroopM: an automated tool for the recovery of population genomes from
related metagenomes. PeerJ, 2:e603, 2014.
84
[44] Bai Jiang, Kai Song, Jie Ren, Minghua Deng, Fengzhu Sun, and Xuegong Zhang. Com-
parison of metagenomic samples using sequence signatures. BMC Genomics, 13(1):730,
2012.
[45] Peng Jiang, Jiming Peng, Michael Heath, and Rui Yang. A clustering approach to con-
strained binary matrix factorization. In Data Mining and Knowledge Discovery for Big
Data, pages 281{303. 2014.
[46] Se-Ran Jun, Gregory E Sims, Guohong A Wu, and Sung-Hou Kim. Whole-proteome phy-
logeny of prokaryotes by feature frequency proles: An alignment-free method with optimal
feature resolution. Proceedings of the National Academy of Sciences, 107(1):133{138, 2010.
[47] Dongwan D Kang, Je Froula, Rob Egan, and Zhong Wang. MetaBAT, an ecient tool
for accurately reconstructing single genomes from complex microbial communities. PeerJ,
3:e1165, 2015.
[48] Hilal Kazan, Debashish Ray, Esther T Chan, Timothy R Hughes, and Quaid Morris. RNA-
context: a new method for learning the sequence and structure binding preferences of RNA-
binding proteins. PLoS Computational Biology, 6(7):e1000832, 2010.
[49] David R Kelley and Steven L Salzberg. Clustering metagenomic sequences with interpolated
Markov models. BMC Bioinformatics, 11(1):544, 2010.
[50] W James Kent. BLAT:the BLAST-like alignment tool. Genome Research, 12(4):656{664,
2002.
[51] Jingu Kim and Haesun Park. Sparse nonnegative matrix factorization for clustering. Tech-
nical Report GT-CSE-08-01, 2008.
[52] Robert A Koeth, Zeneng Wang, Bruce S Levison, Jennifer A Bua, Elin Org, Brendan T
Sheehy, Earl B Britt, Xiaoming Fu, Yuping Wu, Lin Li, et al. Intestinal microbiota
85
metabolism of l-carnitine, a nutrient in red meat, promotes atherosclerosis. Nature medicine,
19(5):576{585, 2013.
[53] Natsumaro Kutsuna, Takumi Higaki, Sachihiro Matsunaga, Tomoshi Otsuki, Masayuki Ya-
maguchi, Hirofumi Fujii, and Seiichiro Hasezawa. Active learning framework with iterative
clustering for bioimage classication. Nature Communications, 3:1032, 2012.
[54] Thomas K Landauer, Peter W Foltz, and Darrell Laham. An introduction to latent semantic
analysis. Discourse processes, 25(2-3):259{284, 1998.
[55] Amy N Langville, Carl D Meyer, Russell Albright, James Cox, and David Duling. Ini-
tializations for the nonnegative matrix factorization. In Proceedings of the twelfth ACM
International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 23{
26, 2006.
[56] Yann LeCun, Yoshua Bengio, and Georey Hinton. Deep learning. Nature, 521(7553):436{
444, 2015.
[57] Yann LeCun, L eon Bottou, Yoshua Bengio, and Patrick Haner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278{2324, 1998.
[58] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix
factorization. Nature, 401(6755):788{791, 1999.
[59] Garmay Leung and Michael B Eisen. Identifying cis-regulatory sequences by word prole
similarity. PloS One, 4(9):e6901, 2009.
[60] Henry CM Leung, Siu-Ming Yiu, Bin Yang, Yu Peng, Yi Wang, Zhihua Liu, Jingchi Chen,
Junjie Qin, Ruiqiang Li, and Francis YL Chin. A robust and accurate binning algo-
rithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics,
27(11):1489{1495, 2011.
86
[61] Hongzhe Li. Microbiome, metagenomics, and high-dimensional compositional data analysis.
Annual Review of Statistics and Its Application, 2:73{94, 2015.
[62] Ruiqi Liao, Ruichang Zhang, Jihong Guan, and Shuigeng Zhou. A new unsupervised bin-
ning approach for metagenomic sequences based on n-grams and automatic feature weight-
ing. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 11(1):42{54,
2014.
[63] Maxwell W Libbrecht and William Staord Noble. Machine learning in genetics and ge-
nomics. Nature Reviews. Genetics, 16(6):321, 2015.
[64] Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, Junjie Wu, and Sen Wu. Under-
standing and enhancement of internal clustering validation measures. Cybernetics, IEEE
Transactions on, 43(3):982{994, 2013.
[65] Yang Young Lu, Ting Chen, Jed A Fuhrman, and Fengzhu Sun. COCACOLA: binning
metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment, and
paired-end read LinkAge. Bioinformatics, 33(6):791{798, 2017.
[66] Yang Young Lu, Kujin Tang, Jie Ren, Jed A Fuhrman, Michael S Waterman, and Fengzhu
Sun. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Research, page
gkx351, 2017.
[67] Ruibang Luo, Binghang Liu, Yinlong Xie, Zhenyu Li, Weihua Huang, Jianying Yuan,
Guangzhu He, Yanxiang Chen, Qi Pan, Yunjie Liu, et al. SOAPdenovo2: an empirically
improved memory-ecient short-read de novo assembler. Gigascience, 1(1):18, 2012.
[68] Sharmila S Mande, Monzoorul Haque Mohammed, and Tarini Shankar Ghosh. Classication
of metagenomic sequences: methods and challenges. Briengs in Bioinformatics, 13(6):669{
681, 2012.
87
[69] Guillaume Mar cais and Carl Kingsford. A fast, lock-free approach for ecient parallel
counting of occurrences of k-mers. Bioinformatics, 27(6):764{770, 2011.
[70] Daniel Maticzka, Sita J Lange, Fabrizio Costa, and Rolf Backofen. GraphProt: modeling
binding preferences of RNA-binding proteins. Genome Biology, 15(1):R17, 2014.
[71] Alice Carolyn McHardy, H ector Garc a Mart n, Aristotelis Tsirigos, Philip Hugenholtz, and
Isidore Rigoutsos. Accurate phylogenetic classication of variable-length DNA fragments.
Nature Methods, 4(1):63{72, 2007.
[72] Roger McLendon, Allan Friedman, Darrell Bigner, Erwin G Van Meir, Daniel J Brat,
Gena M Mastrogianakis, Jerey J Olson, Tom Mikkelsen, Norman Lehman, Ken Aldape,
et al. Comprehensive genomic characterization denes human glioblastoma genes and core
pathways. Nature, 455(7216):1061{1068, 2008.
[73] Folker Meyer, Daniel Paarmann, Mark D'Souza, Robert Olson, Elizabeth M Glass, Michael
Kubal, Tobias Paczian, A Rodriguez, Rick Stevens, Andreas Wilke, et al. The metagenomics
RAST server{a public resource for the automatic phylogenetic and functional analysis of
metagenomes. BMC Bioinformatics, 9(1):386, 2008.
[74] Webb Miller, Kate Rosenbloom, Ross C Hardison, Minmei Hou, James Taylor, Brian Raney,
Richard Burhans, David C King, Robert Baertsch, Daniel Blankenberg, et al. 28-way
vertebrate alignment and conservation track in the ucsc genome browser. Genome Research,
17(12):1797{1808, 2007.
[75] Monzoorul Haque Mohammed, Tarini Shankar Ghosh, Nitin Kumar Singh, and Sharmila S
Mande. SPHINXan algorithm for taxonomic binning of metagenomic sequences. Bioinfor-
matics, 27(1):22{30, 2011.
88
[76] Toshiaki Namiki, Tsuyoshi Hachiya, Hideaki Tanaka, and Yasubumi Sakakibara. MetaVel-
vet: an extension of Velvet assembler to de novo metagenome assembly from short sequence
reads. Nucleic Acids Research, 40(20):e155{e155, 2012.
[77] Leelavati Narlikar, Nidhi Mehta, Sanjeev Galande, and Mihir Arjunwadkar. One size does
not t all: On how Markov model order dictates performance of genomic sequence analyses.
Nucleic Acids Research, 41(3):1416{1424, 2013.
[78] Saul B Needleman and Christian D Wunsch. A general method applicable to the search
for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology,
48(3):443{453, 1970.
[79] H Bjrn Nielsen, Mathieu Almeida, Agnieszka Sierakowska Juncker, Simon Rasmussen,
Junhua Li, Shinichi Sunagawa, Damian R Plichta, Laurent Gautier, Anders G Pedersen,
Emmanuelle Le Chatelier, et al. Identication and assembly of genomes and genetic elements
in complex metagenomic samples without using reference genomes. Nature Biotechnology,
32(8):822{828, 2014.
[80] Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov, and Pavel A Pevzner. metaSPAdes:
a new versatile metagenomic assembler. Genome Research, 27(5):824{834, 2017.
[81] Brian D Ondov, Todd J Treangen, P all Melsted, Adam B Mallonee, Nicholas H Bergman,
Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance esti-
mation using MinHash. Genome Biology, 17(1):132, 2016.
[82] Emmanuel Paradis, Julien Claude, and Korbinian Strimmer. APE: analyses of phylogenetics
and evolution in R language. Bioinformatics, 20(2):289{290, 2004.
[83] Donovan H Parks, Michael Imelfort, Connor T Skennerton, Philip Hugenholtz, and Gene W
Tyson. CheckM: assessing the quality of microbial genomes recovered from isolates, single
cells, and metagenomes. Genome Research, 25(7):1043{1055, 2015.
89
[84] Yu Peng, Henry CM Leung, Siu-Ming Yiu, and Francis YL Chin. Meta-IDBA: a de novo
assembler for metagenomic data. Bioinformatics, 27(13):i94{i101, 2011.
[85] Jerey Pennington, Richard Socher, and Christopher D Manning. Glove: Global Vectors
for Word Representation. In EMNLP, volume 14, pages 1532{1543, 2014.
[86] Polina Perelman, Warren E Johnson, Christian Roos, Hector N Seu anez, Julie E Horvath,
Miguel AM Moreira, Bailey Kessing, Joan Pontius, Melody Roelke, Yves Rumpler, et al. A
molecular phylogeny of living primates. PLoS Genetics, 7(3):e1001342, 2011.
[87] Jane Peterson, Susan Garges, Maria Giovanni, Pamela McInnes, Lu Wang, Jeery A Schloss,
Vivien Bonazzi, Jean E McEwen, Kris A Wetterstrand, Carolyn Deal, et al. The NIH human
microbiome project. Genome Research, 19(12):2317{2323, 2009.
[88] Mihai Pop, Adam Phillippy, Arthur L Delcher, and Steven L Salzberg. Comparative genome
assembly. Briengs in Bioinformatics, 5(3):237{248, 2004.
[89] Rachel Poretsky, Luis M Rodriguez-R, Chengwei Luo, Despina Tsementzi, and Konstanti-
nos T Konstantinidis. Strengths and limitations of 16S rRNA gene amplicon sequencing in
revealing temporal microbial community dynamics. PLoS One, 9(4):e93827, 2014.
[90] Kim D Pruitt, Tatiana Tatusova, Garth R Brown, and Donna R Maglott. NCBI Reference
Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic
Acids Research, 40(D1):D130{D135, 2012.
[91] Kim D Pruitt, Tatiana Tatusova, and Donna R Maglott. NCBI reference sequence (RefSeq):
a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic
Acids Research, 33(suppl 1):D501{D504, 2005.
[92] Elizabeth Purdom. Analysis of a data matrix and a graph: Metagenomic data and the
phylogenetic tree. The Annals of Applied Statistics, pages 2326{2358, 2011.
90
[93] Ji Qi, Hong Luo, and Bailin Hao. CVTree: a phylogenetic tree reconstruction tool based
on whole genomes. Nucleic Acids Research, 32(suppl 2):W45{W47, 2004.
[94] Junjie Qin, Ruiqiang Li, Jeroen Raes, Manimozhiyan Arumugam, Kristoer Solvsten
Burgdorf, Chaysavanh Manichanh, Trine Nielsen, Nicolas Pons, Florence Levenez, Takuji
Yamada, et al. A human gut microbial gene catalogue established by metagenomic sequenc-
ing. Nature, 464(7285):59{65, 2010.
[95] Junjie Qin, Yingrui Li, Zhiming Cai, Shenghui Li, Jianfeng Zhu, Fan Zhang, Suisha Liang,
Wenwei Zhang, Yuanlin Guan, Dongqian Shen, et al. A metagenome-wide association study
of gut microbiota in type 2 diabetes. Nature, 490(7418):55{60, 2012.
[96] Gesine Reinert, David Chew, Fengzhu Sun, and Michael S Waterman. Alignment-
free sequence comparison (I): statistics and power. Journal of Computational Biology,
16(12):1615{1634, 2009.
[97] Jie Ren, Kai Song, Minghua Deng, Gesine Reinert, Charles H Cannon, and Fengzhu Sun.
Inference of markovian properties of molecular sequences from ngs data and applications to
comparative genomics. Bioinformatics, 32(7):993{1000, 2016.
[98] Zhao Ren, Tingni Sun, Cun-Hui Zhang, Harrison H Zhou, et al. Asymptotic normality
and optimalities in estimation of large gaussian graphical models. The Annals of Statistics,
43(3):991{1026, 2015.
[99] Steen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th Inter-
national Conference on, pages 995{1000, 2010.
[100] Christian S Riesenfeld, Patrick D Schloss, and Jo Handelsman. Metagenomics: genomic
analysis of microbial communities. Annu. Rev. Genet., 38:525{552, 2004.
[101] David F Robinson and Leslie R Foulds. Comparison of phylogenetic trees. Mathematical
biosciences, 53(1-2):131{147, 1981.
91
[102] Gail L Rosen, Erin R Reichenberger, and Aaron M Rosenfeld. NBC: the naive bayes classi-
cation tool webserver for taxonomic classication of metagenomic reads. Bioinformatics,
27(1):127{129, 2011.
[103] Simon Roux, Steven J Hallam, Tanja Woyke, and Matthew B Sullivan. Viral dark mat-
ter and virus{host interactions resolved from publicly available microbial genomes. eLife,
4:e08490, 2015.
[104] Stan Salvador and Philip Chan. Determining the number of clusters/segments in hierar-
chical clustering/segmentation algorithms. In Proceedings of the 16th IEEEE International
Conference on Tools with AI (ICTAI), pages 576{584, 2004.
[105] Marion Scheibe, Falk Butter, Markus Hafner, Thomas Tuschl, and Matthias Mann. Quan-
titative mass spectrometry and PAR-CLIP to identify RNA-protein interactions. Nucleic
Acids Research, 40(19):9897{9902, 2012.
[106] Bernhard Sch olkopf, Alexander Smola, and Klaus-Robert M uller. Kernel principal compo-
nent analysis. In International Conference on Articial Neural Networks, pages 583{588,
1997.
[107] Nicola Segata, Levi Waldron, Annalisa Ballarini, Vagheesh Narasimhan, Olivier Jousson,
and Curtis Huttenhower. Metagenomic microbial community proling using unique clade-
specic marker genes. Nature Methods, 9(8):811{814, 2012.
[108] Itai Sharon, Michael J Morowitz, Brian C Thomas, Elizabeth K Costello, David A Relman,
and Jillian F Baneld. Time series community genomics analysis reveals rapid shifts in
bacterial species, strains, and phage during infant gut colonization. Genome Research,
23(1):111{120, 2013.
[109] Shirish Krishnaj Shevade and S Sathiya Keerthi. A simple and ecient algorithm for gene
selection using sparse logistic regression. Bioinformatics, 19(17):2246{2253, 2003.
92
[110] Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a
black box: Learning important features through propagating activation dierences. arXiv
preprint arXiv:1605.01713, 2016.
[111] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop
distributed le system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th
symposium on, pages 1{10. IEEE, 2010.
[112] Bernard W Silverman. Density estimation for statistics and data analysis, volume 26. CRC
press, 1986.
[113] Gregory E Sims, Se-Ran Jun, Guohong A Wu, and Sung-Hou Kim. Alignment-free genome
comparison with feature frequency proles (FFP) and optimal resolutions. Proceedings of
the National Academy of Sciences, 106(8):2677{2682, 2009.
[114] Temple F Smith and Michael S Waterman. Identication of common molecular subse-
quences. Journal of Molecular Biology, 147(1):195{197, 1981.
[115] Kai Song, Jie Ren, Gesine Reinert, Minghua Deng, Michael S Waterman, and Fengzhu
Sun. New developments of alignment-free sequence comparison: measures, statistics and
next-generation sequencing. Briengs in Bioinformatics, 15(3):343{353, 2014.
[116] Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, Chengxiang Zhai, Miles J
Efron, Ravishankar Iyer, Michael C Schatz, Saurabh Sinha, and Gene E Robinson. Big data:
astronomical or genomical? PLoS Biol, 13(7):e1002195, 2015.
[117] Martin Stra zar, Marinka
Zitnik, Bla z Zupan, Jernej Ule, and Toma z Curk. Orthogonal
matrix factorization enables integrative analysis of multiple RNA binding proteins. Bioin-
formatics, 32(10):1527{1535, 2016.
[118] Chien-Hao Su, Tse-Yi Wang, Ming-Tsung Hsu, Francis Cheng-Hsuan Weng, Cheng-Yan
Kao, Daryi Wang, and Huai-Kuang Tsai. The impact of normalization and phylogenetic
93
information on estimating the distance for metagenomes. IEEE/ACM Transactions on
Computational Biology and Bioinformatics (TCBB), 9(2):619{628, 2012.
[119] Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 99:879{898,
2012.
[120] Yuangang Tang, Fuchun Sun, and Zengqi Sun. Improved validation index for fuzzy cluster-
ing. In American Control Conference, pages 1120{1125, 2005.
[121] Todd J Treangen and Steven L Salzberg. Repetitive DNA and next-generation sequencing:
computational challenges and solutions. Nature Reviews Genetics, 13(1):36, 2012.
[122] Koji Tsuda, Hyunjung Shin, and Bernhard Sch olkopf. Fast protein classication with mul-
tiple networks. Bioinformatics, 21(suppl 2):ii59{ii65, 2005.
[123] Peter J Turnbaugh, Micah Hamady, Tanya Yatsunenko, Brandi L Cantarel, Alexis Duncan,
Ruth E Ley, Mitchell L Sogin, William J Jones, Bruce A Roe, Jason P Aourtit, et al. A
core gut microbiome in obese and lean twins. nature, 457(7228):480, 2009.
[124] Susana Vinga. Editorial: Alignment-free methods in computational biology. Briengs in
Bioinformatics, 15(3):341{342, 2014.
[125] Susana Vinga and Jonas Almeida. Alignment-free sequence comparison-a review. Bioinfor-
matics, 19(4):513{523, 2003.
[126] Konstantin Voevodski, Maria-Florina Balcan, Heiko R oglin, Shang-Hua Teng, and Yu Xia.
Active clustering of biological sequences. Journal of Machine Learning Research,
13(Jan):203{225, 2012.
[127] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395{
416, 2007.
94
[128] Lin Wan, Gesine Reinert, Fengzhu Sun, and Michael S Waterman. Alignment-free sequence
comparison (ii): theoretical power of comparison statistics. Journal of Computational Bi-
ology, 17(11):1467{1490, 2010.
[129] Michael S Waterman. A restricted least squares problem. Technometrics, 16(1):135{136,
1974.
[130] Dana Willner, Rebecca Vega Thurber, and Forest Rohwer. Metagenomic signatures of 86
microbial and viral metagenomes. Environmental Microbiology, 11(7):1752{1766, 2009.
[131] Christian Wiwie, Jan Baumbach, and Richard R ottger. Comparing the performance of
biomedical clustering methods. Nature Methods, page (epub ahead of print), 2015.
[132] Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classi-
cation using exact alignments. Genome Biol, 15(3):R46, 2014.
[133] Martin Wu and Jonathan A Eisen. A simple, fast, and accurate method of phylogenomic
inference. Genome Biology, 9(10):R151, 2008.
[134] Yu-Wei Wu, Blake A Simmons, and Steven W Singer. MaxBin 2.0: an automated binning
algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics, page
btv638, 2015.
[135] Yu-Wei Wu, Yung-Hsu Tang, Susannah G Tringe, Blake A Simmons, and Steven W Singer.
MaxBin: an automated binning method to recover individual genomes from metagenomes
using an expectation-maximization algorithm. Microbiome, 2(1):1{18, 2014.
[136] Yu-Wei Wu and Yuzhen Ye. A novel abundance-based algorithm for binning metagenomic
sequences using l-tuples. Journal of Computational Biology, 18(3):523{534, 2011.
95
[137] Bin Yang, Yu Peng, Henry CM Leung, Siu-Ming Yiu, Jing-Chi Chen, and Francis YL
Chin. Unsupervised binning of environmental genomic fragments based on an error robust
selection of l-mers. BMC Bioinformatics, 11(Suppl 2):S5, 2010.
[138] Huiguang Yi and Li Jin. Co-phylog: an assembly-free phylogenomic approach for closely
related organisms. Nucleic Acids Research, 41:e75, 2013.
[139] Lei Yu and Huan Liu. Feature selection for high-dimensional data: A fast correlation-based
lter solution. In ICML, volume 3, pages 856{863, 2003.
[140] Kathi Zarnack, Julian K onig, Mojca Tajnik, I~ nigo Martincorena, Sebastian Eustermann,
Isabelle St evant, Alejandro Reyes, Simon Anders, Nicholas M Luscombe, and Jernej Ule.
Direct competition between hnRNP C and U2AF65 protects the transcriptome from the
exonization of Alu elements. Cell, 152(3):453{466, 2013.
[141] Zheng Zhao and Huan Liu. Semi-supervised Feature Selection via Spectral Analysis. In
SDM, pages 641{646, 2007.
96
Abstract (if available)
Abstract
The advent of the next-generation sequencing technologies (NGS) has allowed biologists to extract genomic data with unprecedented high resolution and sufficient sequence depth, which has fueled the excessive generation of genomic data, with faster speed and lower cost. In addition to the direct generation of genomic data, NGS has also facilitated the generation of other types of omics data, such as epigenomics, transcriptomics, proteomics, metabolomics, and interactomics. Such explosively increasing amount of data has raised compelling challenges and opportunities, such as how to build up repositories of data, how to integrate and exploit data from various sources, how to manipulate and manage data efficiently, and how to analyze and interpret data in an interactive and visualized way. This dissertation presents three computational methods and software packages developed specifically for the need to analyze metagenomic data accurately, promptly, and interactively. ❧ To investigate the taxonomic structure of microbial samples, given assembled contigs, we want to bin the contigs into OTUs not only using the concatenation of tetra-nucleotide composition and the coverage profiles across multiple metagenomic samples of the contigs, but seamlessly integrate two types of additional knowledge, co-alignment to reference genomes and linkage between contigs provided by paired-end reads, to achieve improved performance. ❧ In contig binning, it is far from optimal to concatenate tetra-nucleotide composition and the coverage profiles across multiple metagenomic samples straightforwardly. We weigh important features more highly than less important ones in accord with implicitly existing auxiliary knowledge. More importantly, the achieved feature importance scores interpretable and meaningful. In inferring the phylogenetic relationships among OTUs by alignment-free sequence comparison, those state-of-the-art dissimilarity measures $CVTree$, $d_2^*$, and $d_2^S$ are computationally expensive. Therefore, we accelerate the computation for efficient analyses from both novel data structure and fast implementation perspective.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
sup_chapter4
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
etd-RenJie-5618-sup_chapter2.pdf
PDF
FigureS1
PDF
Alignment-free sequence comparison methods and applications to comparative genomics [pdf]
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Application of machine learning methods in genomic data analysis
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Patterns of molecular microbial activity across time and biomes
PDF
Profiling transcription factor-DNA binding specificity
PDF
Breaking the plateau in de novo genome scaffolding
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
Unlocking capacities of genomics datasets through effective computational methods
Asset Metadata
Creator
Lu, Yang
(author)
Core Title
Big data analytics in metagenomics: integration, representation, management, and visualization
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
11/10/2017
Defense Date
08/17/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alignment-free sequence comparison,integrative genomics,metagenomics,next generation sequencing,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Fuhrman, Jed A. (
committee member
), Lv, Jinchi (
committee member
), Waterman, Michael S. (
committee member
)
Creator Email
isaac.lu.phd@gmail.com,ylu465@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-455199
Unique identifier
UC11264681
Identifier
etd-LuYang-5895.pdf (filename),usctheses-c40-455199 (legacy record id)
Legacy Identifier
etd-LuYang-5895.pdf
Dmrecord
455199
Document Type
Dissertation
Rights
Lu, Yang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alignment-free sequence comparison
integrative genomics
metagenomics
next generation sequencing