Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identifying functional metabolic guilds: a computational approach to classifying heterotrophic diversity in the marine system
(USC Thesis Other)
Identifying functional metabolic guilds: a computational approach to classifying heterotrophic diversity in the marine system
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IDENTIFYING FUNCTIONAL METABOLIC GUILDS:
A COMPUTATIONAL APPROACH TO CLASSIFYING HETEROTROPHIC DIVERSITY IN
THE MARINE SYSTEM
by
Ryan C. Reynolds
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOLOGY (MARINE BIOLOGY AND BIOLOGICAL OCEANOGRAPHY))
AUGUST 2024
Copyright 2024 Ryan C. Reynolds
ii
Dedication
This work, the culmination of my academic career, is dedicated to the countless collaborators,
colleagues, and friends (shoutout to the Fun Gang) that are necessary to conquer a challenge of
this magnitude. To the incredible people in the Levine lab group during my tenure as well as the
professors and administrative staff of the Marine and Environmental Biology department who
embraced and supported me throughout this journey. To my best friend José and the many game
nights and happy hours we shared with Melisa, Dylan, Kelly, and so many other good friends that
reminded me of the things in this life that really matter. To my loving partner, Linda, who stood
by me in the throes of the PhD defense and always gave her time generously to help me keep it
together. And finally, to my parents, who never ceased to support me and picked me up whenever
I needed a helping hand. To you all, I owe everything.
iii
TABLE OF CONTENTS
Dedication................................................................................................ ii
LIST OF TABLES...................................................................................vi
LIST OF FIGURES.............................................................................. viii
ABSTRACT ...........................................................................................xv
Chapter 1 : Introduction............................................................................1
Chapter 2 : Identification of Microbial Metabolic Functinoal Guilds
from Large Genomic Datasets..................................................................6
1 Introduction .................................................................................................... 6
2 Materials and Methods................................................................................... 8
2.1 Dataset .........................................................................................................................8
2.2 Classic Methods.........................................................................................................12
2.3 Aspect Bernoulli........................................................................................................13
2.4 Scoring.......................................................................................................................14
2.5 Guild identification and mapback genomes ..............................................................16
2.7 Artificial datasets.......................................................................................................18
2.8 Data Visualization .....................................................................................................19
3 Results.......................................................................................................... 19
3.1 Phylogeny of Datasets...............................................................................................19
3.2 Classic Methods.........................................................................................................20
3.3 AB model...................................................................................................................23
3.4 Comparison between AB model and clustergram guilds ..........................................28
4 Discussion .................................................................................................... 32
4.1 Emergent microbial metabolic guilds........................................................................32
4.2 MAG vs SAG guild comparison ...............................................................................35
5 Conclusions.................................................................................................. 36
Chapter 3 : Emergent Metabolic Niches for Marine Heterotrophs.........39
1 Introduction .................................................................................................. 39
2 Methods........................................................................................................ 40
2.1 Genomic Data ............................................................................................................40
2.2 Phylogeny ..................................................................................................................41
2.3 Model Generation and Quality Assessment ..............................................................42
2.4 Compound Classification ..........................................................................................44
2.5 Growth sensitivity analysis........................................................................................45
2.6 Validation on Experimentally Characterized Genomes ............................................46
2.7 Generation of Self-Organized Maps..........................................................................47
iv
2.8 Maximum Growth Rate Estimations.........................................................................49
2.9 Global Distribution....................................................................................................50
2.10 Data Visualization .....................................................................................................51
3 Results.......................................................................................................... 51
3.1 CarveMe model quality .............................................................................................51
3.2 Metabolic strategy assessment...................................................................................53
3.3 Emergent metabolic clusters......................................................................................55
3.4 Biogeographic Distribution .......................................................................................59
4 Discussion .................................................................................................... 61
Chapter 4 : Emergent Structural Differences in Metabolic Models for
Marine Heterotrophs...............................................................................64
1 Introduction .................................................................................................. 64
2 Methods........................................................................................................ 68
2.1 Genomic Data ............................................................................................................68
2.2 Phylogeny ..................................................................................................................69
2.3 Metabolic Model Generation and Quality Assessment .............................................69
2.4 In silico knockouts.....................................................................................................71
2.5 Network Generation ..................................................................................................73
2.6 Data Visualization .....................................................................................................75
3 Results.......................................................................................................... 75
3.1 Model Consensus Variability ....................................................................................75
3.2 Bulk Network Structure.............................................................................................77
3.3 Metabolic Compartment Assessment ........................................................................79
3.4 Knockout Experiments..............................................................................................83
3.5 Knockout Network Metabolic Compartment Assessment ........................................89
3.6 High Impact Knockout Reactions..............................................................................92
4 Discussion .................................................................................................... 93
Chapter 5 : Conclusions..........................................................................97
REFERENCES .....................................................................................102
APPENDICES......................................................................................122
Appendix A....................................................................................................... 122
A1 Extended Methods...................................................................................................122
A2 Extended validation of AB ......................................................................................126
A3 Speed and Stability ..................................................................................................130
Appendix B ....................................................................................................... 156
B1 CarveMe Validation ................................................................................................156
B2 SOMs clustering ......................................................................................................157
B3 SOM cluster analyses ..............................................................................................160
v
Appendix C ....................................................................................................... 206
vi
LIST OF TABLES
Table 2.1: Top 15 functions based on score (see Methods) for two aspects related to DMSP
degradation and motility. Functions that constitute the resulting DMSP and motility guilds are
highlighted in bold or bold and italics. SBP is the substrate-binding protein associated with
the respective ABC transporter......................................................................................................11
Table 3.1: Description of 8 SOM clusters including the number of genomes per cluster, the
growth strategy as determined by the dCUB distributions, the number and names of the growth
limiting substrate classes, as well as the 2 most numerically abundant orders. ............................55
Table 4.1: Statistical significance values for the metabolite (a) and reaction type (b) frequency
comparisons between good and bad ensembles. We used a Welch’s modified t-test for uneven
sample sizes to test significance. ...................................................................................................79
Table 4.2: Statistical significance values for the metabolite (a) and reaction type (b) frequency
comparisons between the good and bad knockout ensembles.......................................................90
Table 4.3: The 14 reactions determined to be high impact reactions by having average
decreases twice as great as their standard deviations. ...................................................................92
Appendix Table A4: Results of Aspect Bernoulli runs with three non-overlapping artificial
guilds at three aspect numbers K=5,10,20. Hit Rate describes the percentage of hits observed
out of all possible hits. Extra Hits represent runs where an artificial guild appears in more than
one aspect. Multi Hits occur when multiple artificial guilds appear together in a single aspect.
Hit rate, Extra Hits, and Multi Hits are shown as percent values (%).........................................145
Appendix Table A5: Results of Aspect Bernoulli runs with three non-overlapping artificial
guilds at three aspect numbers K=5,10,20. Hit Rate describes the percentage of hits observed
out of all possible hits. Extra Hits represent runs where an artificial guild appears in more than
one aspect. Multi Hits occur when multiple artificial guilds appear together in a single aspect.
Hit rate, Extra Hits, and Multi Hits are shown as percent values (%).........................................149
Appendix Table A6: Results of Aspect Bernoulli runs with three artificial guilds inserted
randomly at three aspect numbers K=5,10,20. Hit Rate describes the percentage of hits
observed out of all possible hits. Extra Hits represent runs where an artificial guild appears in
more than one aspect. MultiHits occur when multiple artificial guilds appear together in a
single aspect. Hit rate, Extra Hits, and Multi Hits are shown as percent values (%). .................150
Appendix Table A7: Guilds defined as top 5 functions of each aspect for a run of the AB
model on the composite dataset with K=10 aspects....................................................................151
Appendix Table A8: Guilds defined as top 5 functions of each aspect for a run of the AB
model on the MAG-only dataset with K=10 aspects...................................................................152
Appendix Table A9: Guilds defined as top 5 functions of each aspect for a run of the AB
model on the SAG-only dataset with K=10 aspects....................................................................153
Appendix Table A10: Number of mapback genomes for guilds defined by the top 5 highest
scoring functions for three different values of K (K=5,10,20). X’s represent guilds beyond the
size of K.......................................................................................................................................154
Appendix Table A11: Guilds for the two models generated to assess the numerical stability of
the AB procedure. Each column reflects the functions from one model with the rows
distinguishing which guild they belonged to. For visual ease, a blank row is inserted between
guilds. ..........................................................................................................................................155
Appendix Table B12: All data for 1,591 high consensus genomes. This table provides the
unique identifiers for the genomes used in the SOM analysis as well as information on the
vii
SOM cluster they were assigned to, their specific value of the dCUB growth proxy, taxonomic
information (order, genus, and species), as well as the raw growth sensitivity values computed
for the 11 compound classes clustered in this study....................................................................182
Appendix Table B13: Metabolite classification information. This table provides information
on the 456 compounds that were manually classified for this study including their names in
plain English, the compound class they were assigned to, and the name of the corresponding
external reaction in the CarveMe universal model......................................................................183
Appendix Table B14: Biogeographical distribution of SOM clusters by oceanographic region.
This table provides the RPKM information for the 23 regions defined by Lanclos et al. (2023)
including the full name of each region, the identifier for each region, the oceanographic
category each region was assigned to, the number of stations assigned to each region, and the
relative abundance of the 8 SOM clusters based on the bootstrapped RPKM values.................198
Appendix Table B15: Biogeographical distribution of SOM clusters by oceanographic
category. This table provides the RPKM information for the 5 oceanographic categories
defined in this study including the number of stations assigned to each category and the RPKM
relative abundance information for each category. .....................................................................201
Appendix Table B16: Matrix of significance values for all pairs of SOM clusters based on
dCUB distributions. This table provides the p-values for all paired comparisons of the
distributions of growth rates (estimated using dCUB) for our 8 SOM clusters. .........................202
Appendix Table B17: Unifrac distances for all paired comparisons of the SOM clusters. This
table provides the Unifrac distances for all paired comparisons of the genomes in SOM
clusters. We report the Unifrac distances when comparing across the full phylogenetic tree as
well as for subtrees of only the genomes in each of our 15 major taxonomic orders. ................203
Appendix Table C18: Relative frequency of the total set of surveyed reactions (3,068) that
belonged to each of the 9 unique reaction types..........................................................................217
viii
LIST OF FIGURES
Figure 2.1: Abundances of functions within an example aspect’s probabilistic representatives,
Ak, compared to their score rank before (rfk, cyan) and after (sfk, orange) applying the score
adjustment qfk (step 2). After the adjustment, a large density of points in the upper left
quadrant is observed indicating that the highest rank functions using sfk are also found within
a large number of probabilistic representative genomes. ..............................................................17
Figure 2.2: Results of the NMDS run on the composite dataset. Points plotted are the loadings
of the functions in the dataset on MDS axes 1 and 2. Points are semi-transparent to emphasize
points that overlap one another. Note: the NMDS algorithm did not reach convergence with a
minimum stress value of 0.211......................................................................................................21
Figure 2.3: Resulting clustergram plot on the presence/absence pathway data for our
composite dataset (red = present, black = absent) using a cut height of 0.9 with rows (genomes)
and columns (functions) clustered based on Jaccard distance.......................................................22
Figure 2.4: Hit Rate and number of Extra Hits for 100 simulated datasets with three artificial
guilds inserted in a non-overlapping manner across a range of K values. Results are colored
by the guild parameters where #fn denotes the number of functions in each artificial guild.
The red (#fn = 5/Abundance = 0.02) versus the green (#fn = 5/Abundance = 0.1) lines
illustrates the impact of a change in guild abundance. The impact of guild size on Hit Rate
and Extra Hits can be seen in Supplemental Table 2. ...................................................................25
Figure 2.5: Number of genomes that possess all of the functions in a guild (mapback genomes)
as guild size is expanded to include more functions in decreasing score order (starting at size
2)....................................................................................................................................................26
Figure 2.6: Specificity of guild function pairs for a guild related to the degradation of DMSP.
Values are shown for the confidence of the guild function pairs in the outgroup genomes such
that low values indicate high specificity of the guild function pairs for the DMSP guild. Note
that the colorbar is scaled from 0 to 0.8. The diagonal is omitted since it is 1 by definition.
The axes are non-symmetric because DmdA → ddd* is fundamentally different from ddd* →
DmdA (see Eq. 6). .........................................................................................................................27
Figure 2.7: Distribution of guild sizes (number of functions) and number of mapback genomes
for guilds generated with clustergram at cut heights of 0.9 (blue square) and 1 (purple plus
sign) as well as AB. AB approach 1 (red circle) defined guilds using a fixed size of 5 functions
while AB approach 2 (green triangle) defined guilds using a minimum mapback genome
cutoff of 100 genomes. Points were jittered using the built-in position_jitter function in the
ggplot2 package v3.4.2 with h=0.1, w=0.35 using the random seed 123. ....................................28
Figure 3.1: Diversity of dataset, quality of metabolic models, and designation of metabolic
clusters. Phylogenetic tree of all 3,984 bacterial genomes included in this study (including the
66 reference genomes from the BiGG database). The tree is contextualized by several external
rings that describe different qualitative and quantitative components of the genomes in this
study. The first ring around the tree denotes both the position and density of high quality
ensembles within the tree as well as the assignment of these genomes to each of our eight
SOM clusters. The second ring shows the ensemble consensus score (Equation 1) for each
genome in the tree. The third, sparse ring of red lines denotes the position of the 66 BiGG
reference genomes present in the tree. Finally, the fourth and innermost ring shows the
location of the top 15 most abundant orders..................................................................................43
ix
Figure 3.2: Substrate sensitivities for 8 SOM clusters. Bubble plot of the mean growth
sensitivity values for genomes in each of our 8 SOM clusters. A growth sensitivity of 1
indicates high sensitivity to that substrate such that the modeled growth rate was reduced
proportionally to the reduction in the substrate’s flux (e.g., 50% substrate reduction
corresponded to 50% growth rate reduction). The size of the bubbles in this plot reflect the
relative sensitivity of each of the 8 SOM clusters to a given compound class where larger
bubbles indicate that cluster was more sensitive to that compound class than others. The 6
compound classes which resulted in significant growth reduction for at least one of the SOM
clusters are shown here. The full results for all 11 substrate classes are provided in
Supplemental Figures S2 & S3). Cluster numbers were colored based on maximal genomic
growth rate (Supplemental Figure S6)...........................................................................................57
Figure 3.3: Biogeographical relative abundances of 8 SOM clusters. Clustered bar charts of
the relative abundances of the 8 SOM clusters as determined by RPKM at each of the 1,203
stations assigned to one of the 23 oceanographic regions. Stations were grouped into our 5
defined oceanographic categories and then arranged based on a hierarchical clustering of the
relative abundances........................................................................................................................60
Figure 4.1: Range of consensus scores for two genomes. Histograms of the resulting ensemble
consensus scores after re-running two individual genomes through CarveMe’s ensemble mode
with 60 model ensembles 500 times. The two bar colors delineate the two genomes with initial
consensus scores of 0.829 (red, lower quality ensemble) and 0.941 (blue, higher quality
ensemble).......................................................................................................................................77
Figure 4.2: Distributions of metabolite frequencies for low (bad) and high (good) quality
ensembles. The distributions for the low- and high-quality ensembles are statistically
significantly different for all three compartments. ........................................................................78
Figure 4.3: Distributions of reaction type frequencies for low- and high- quality ensembles for
the 9 possible reaction types within a CarveMe model. The arrow denotes the direction of the
reaction such that Cytoplasm → Periplasm denotes a reaction converting a reactant in the
cytoplasm to a product in the periplasm. All 9 pairs of distributions are statistically
significantly different from one another. The low-quality ensembles have higher frequencies
of all reaction types except Cytoplasm → Cytoplasm and Periplasm → Cytoplasm reaction
types...............................................................................................................................................81
Figure 4.4: Change in consensus score as a result of reaction knockouts. Scatterplot of the
average starting and post-knockout consensus score per reaction surveyed during our
knockouts simulations (N= 3,068 total reactions). Points are colored and sized by the number
of replicate ensembles the reaction was knocked-out in. The solid black line represents the 1:1
line where knocking out a reaction would, on average, create no change in ensemble consensus.
The dashed black line represents the average decrease in consensus score across all the
knockout experiments....................................................................................................................85
Figure 4.5: Rank abundance curve of the average change in consensus score after knockouts
for each of the 3,068 reactions surveyed. Points are colored by the proportion of times that
individual reaction was added back to an ensemble after being knocked out and sized by the
number of total replicate ensembles per reaction. The solid black lines denote the bulk of the
data (90%) which fall between a consensus score change of -0.024 and -0.244. The dashed
black line represents the dataset average change in consensus score across all reactions of -
0.134. .............................................................................................................................................88
x
Figure 4.6: Scatterplot of the average change in consensus score versus the standard deviation
in consensus change per reaction for the 3,068 reactions surveyed. Points are colored and sized
by the number of replicate ensembles they were knocked out in. The black dashed line
represents the line of slope -1 which separates the reactions based on whether the absolute
value of the mean consensus change is larger (below the line) or smaller (above the line) than
the standard deviation of consensus change. Similarly, the orange dashed line represents the
line of slope -0.5 which separates the reactions based on whether the absolute value of the
mean consensus change is more (below the line) or less (above the line) than twice as large as
the standard deviation....................................................................................................................91
Appendix Figure A1: Phylogenomic tree of the full composite dataset consisting of 3,840
genomes including MAGs, SAGs, and isolate genomes. This tree presents 3,775 of those
genomes (see Results) that represent 51 unique bacterial phyla and 2 archaeal phyla. For
clarity, the two archaeal phyla have been collapsed simply into an “Archaea” designation. .....132
Appendix Figure A2: Phylogenomic tree for the SAG genomes sources and quality filtered
from the GORG-Tropics expedition constituting 1,733 genomes. This tree presents 1,415 of
those genomes (see Results) that represent 9 unique bacterial phyla as well as 2 archaeal phyla
that are collapsed simply to the designation “Archaea”..............................................................133
Appendix Figure A3: Phylogenomic tree for the DMSP guild as defined in the main text (see
bolded functions in main text Table 2). This guild is distributed across 4 bacterial families,
primarily Rhodobacteraceae........................................................................................................134
Appendix Figure A4: Phylogenomic tree for the motility guild as defined in the main text (see
bolded functions in main text Table 2). This guild is distributed across 9 bacterial orders, most
notably in the Enterobacterales and Caulobacterales. .................................................................135
Appendix Figure A5: Total distribution of AAI values between 30% and 90% for all genome
pairs in our composite dataset of 3,840 genomes. On average, a given genome pair had an
AAI value of 39.1%.....................................................................................................................136
Appendix Figure A6: Histogram of average AAI values for our Monte Carlo style simulation
of 1,000 sets of 100 random genomes. The AAI values for all genome pairs in each random
set were averaged to construct the distribution in white bars. In addition, we computed the
average AAI value for all pairs of genomes in each of our 10 guilds and overlaid those with
colored vertical lines. The High ANI line that is shown to the right of the plot break shows the
AAI value for the 100 genomes with the most non-NA ANI values (i.e., the most similar set
of 100 genomes). The axis break was produced using ggbreak (Xu et al., 2021).......................137
Appendix Figure A7: Visual schematic of the model procedure that shows how we model our
data matrix Y as a matrix V of Bernoulli probabilities that we then decomposed into the two
matrices � and � to create a low-dimensional representation of Y.............................................138
Appendix Figure A8: Resulting dendrogram from running clustergram on our composite
dataset with a cut height of 1 (red = present, black = absent). The rows (genomes) and columns
(functions) were both clustered using the Jaccard distance metric. ............................................139
Appendix Figure A9: Simulated data metric values for 100 simulations with three artificial
guilds across the tested range of K’s (K=5-20): Hit Rate as a proportion (top panel), Extra Hits
(middle panel), and Multi Hits (bottom panel)............................................................................140
Appendix Figure A10: Simulated data metric values for 100 simulations with a single artificial
guild across the tested range of K’s (K=5-20): Hit Rate as a proportion (top panel), Extra Hits
(middle panel), and Multi Hits (bottom panel). As seen in the bottom panel, there are no Multi
xi
Hits for the single guild simulations because you must have 2+ guilds to register one as defined
(Appendix A2).............................................................................................................................141
Appendix Figure A11: Simulated data metric values for 100 simulations with three artificial
guilds randomly inserted into the dataset across the tested range of K’s (K=5-20): Hit Rate as
a proportion (top panel), Extra Hits (middle panel), and Multi Hits (bottom panel). .................142
Appendix Figure A12: Value of the MLE estimator for runs of the AB that only vary in
number of iterations used. ...........................................................................................................143
Appendix Figure A13: Comparison of scores for matched guilds from two independent AB
model estimates using a two-step approach that identified good initialization states and then
ran the EM algorithm for many steps for those states. We see that the scores lie along the 1:1
line (red), showing that the guilds are relatively stable across model estimates.........................144
Appendix Figure B1: Bubble plot of the mean growth sensitivity values (similar to Figure 3.2)
for a new set of 8 SOM clusters generated on the 983 genomes with a consensus of 90% or
greater. A growth sensitivity of 1 indicates high sensitivity to that substrate such that the
modeled growth rate was reduced proportionally to the reduction in the substrate’s flux (e.g.,
50% substrate reduction corresponded to 50% growth rate reduction). The size of the bubbles
in this plot reflect the relative sensitivity of each of the 8 SOM clusters to a given compound
class where larger bubbles indicate that cluster was more sensitive to that compound class
than others. The 6 compound classes which resulted in significant growth reduction for at least
one of the SOM clusters are shown here. While the ordering of the clusters changed, we still
observed the same overall patterns. We have one cluster with no growth sensitivities and
multiple clusters with sensitivity to one compound and multiple with sensitivity to two
compounds. The fast growth cluster and intermediate growth single sensitivity clusters from
the primary analysis emerged in this higher consensus group of models. The slight shift for
the multiple sensitivity clusters is consistent with the observation that the more classically
oligotrophic orders generally had model ensembles with lower consensus values such that
excluding these genomes from the SOM generation would be expected to have the largest
impact on the slow growth clusters. ............................................................................................169
Appendix Figure B2: Distribution of growth sensitivity values by cluster. Density plots of the
growth sensitivity values for each model for each of the 11 compound classes grouped by
SOM cluster (N=1,050,060). .......................................................................................................170
Appendix Figure B3: Comparison of results between CarveMe model ensembles and
experimental growth studies performed in Gralka et al. Heatmap of agreement between
CarveMe model ensemble reactions and experimental growth data for a collection of 146
strains grown on 58 different sole carbon sources. White squares indicate direct agreement
between the models and data (i.e., model includes the exchange reaction and growth was
observed or model does not include the exchange reaction and no growth was observed), blue
squares indicate a false positive (model includes the exchange reaction, experimental data
does not) and red squares indicate a false negative (experimental data predicts growth, model
does not include the exchange reaction). Gray squares indicate that the presence of the
compound in the model exchange reactions was variable (between the consensus thresholds
for “present” and “absent”)..........................................................................................................172
Appendix Figure B4: Relative growth sensitivities between SOM clusters. (a) Ordered bar
plot of the proportion of models across all clusters with substantial growth sensitivity to the
reduction of each compound class (substantial is defined as >80% reduction in growth). (b)
Bar plots of the relative mean growth sensitivity values for each of the 11 compound classes
xii
across the 8 SOM clusters. The error bars represent one standard deviation. Plot facets are
ordered from the highest overall sensitivity (carboxylic acids) to the lowest (amines/amides)..173
Appendix Figure B5: Taxonomy by cluster. (a) Ordered bar plot of the proportion of models
in each of the 15 most abundant orders with substantial growth sensitivity to the reduction of
any compound class (substantial is defined as >80% reduction in growth). (b) Stacked bar
plots of the relative abundances of the top 15 orders in each cluster. .........................................174
Appendix Figure B6: Codon usage bias (dCUB) by cluster. The dCUB values fall into four
statistically distinct groups designated with letters according to the key. Group a is the slowgrowing group (Clusters 1, 3, 7 and 8) and statistically distinct from the other clusters. Group
c (Cluster 2) is the fast-growing group and is distinct from all other clusters. Groups ab and
bc represent our intermediate growers. Specifically, group ab (Clusters 4 and 6) are
statistically distinct from fast-growing group c but not from slow-growing group a, whereas
group bc (Cluster 5) is statistically distinct from group a but not from group c. ........................175
Appendix Figure B7: Taxonomic abundance enrichments of top 15 Orders by cluster.
Percentage enrichment in the relative abundance of the top 15 Orders (and Other) in each of
the 8 SOM clusters compared to the relative abundances of each of these Orders in the full
dataset. .........................................................................................................................................176
Appendix Figure B8: Relative abundances of SOM clusters by oceanographic region.
Sampling sites were grouped for bootstrapping according to the 23 oceanographic regions
given in Table S3. For each region, bar plots of the average relative abundances of each of the
8 SOM clusters are shown. The relative abundances are calculated based on the bootstrap
distributions of the raw RPKM values. The clusters are colored by their growth strategy (fast,
fast-intermediate, slow-intermediate, and slow). Error bars represent the standard deviations
of the bootstrapped distributions. ................................................................................................177
Appendix Figure B9: Relative abundances of SOM clusters by oceanographic category.
Sampling sites were grouped for bootstrapping according to the 5 oceanographic categories.
Bar plots show the average relative abundances of each of the 8 SOM clusters in each category
where the abundance is based on the bootstrap distributions of the raw RPKM values. The
clusters are colored by their growth strategy (fast, fast-intermediate, slow-intermediate, and
slow). Error bars represent the standard deviations of the bootstrapped distributions................178
Appendix Figure B10: CarveMe run parameterizations. (a) Rarefaction curve of the total
number of unique reactions found in any model within an ensemble of models generated for
a given genome. This curve was generated for ensemble sizes ranging from 2-100 models. At
low ensemble sizes (e.g., in the range from 2-20) the model space rapidly identifies new
unique reactions as more models are generated. The curves stabilize around ensemble sizes of
40-80 such that increasing the number of models in the ensemble does not add new reactions.
(b) Histogram of the consensus scores for model ensembles when annotating reactions for
CarveMe with eggNOG vs. the native Diamond (ensemble size = 60). Overall models
generated with Diamond annotation produced significantly higher quality models than when
eggNOG annotations were used for the same genomes. .............................................................179
Appendix Figure B11: SOM metrics. (a) The SOM grid is shown where each circle represents
a grid point in the map (N=400). Grid points are colored by their assignment to the 8 defined
SOM clusters. Grid points to which no genomes were assigned are colored gray to represent
the absence of mapped data. It is important to note that the SOM uses a toroidal grid where
the edges wrap around such that, for example, all nodes in Cluster 8 are in fact connected. (b)
The training progress of the grid is shown for the duration of the map refinement process. (c)
xiii
The number of models from each genome ensemble that were assigned to each SOM cluster,
where a value of 60 denotes instances when all models from the ensemble were assigned to
the same SOM cluster and a value of 0 denotes that no models from a specific ensemble were
assigned to the cluster. Data is only shown for the 1,591 high consensus ensembles. The
bimodal distribution of the data around 0 and 60 illustrates that all models from a given
ensemble were almost always assigned to the same SOM cluster. .............................................180
Appendix Figure B12: PCA plot of the growth sensitivity data. The PCA captured 50.6% of
the total variance on the first two principal component axes, and distinguished two major
groups of data points, a slow growing and fast growing cluster. Of note, the estimates of
maximum growth rate were not included in this clustering. The points in the PCA are colored
by SOM cluster assignment to illustrate that both approaches identified similar clustering of
the data but that the SOM method was able to better differentiate between the 8 clusters.........181
Appendix Figure C1: (a) The total number of reactions included in an ensemble as a function
of CarveMe ensemble size (results for 18 genomes are shown). We see that the cumulative
number of reactions beginsto saturate around ensemble sizes of 60, indicating that the reaction
space is fully explored. (b) Comparisons of the CarveMe ensemble consensus scores based on
the annotation method used during the CarveMe run process. Orange bars represent the
resulting ensemble consensus scores when eggNOG orthologies are imported as opposed to
using the native DIAMOND annotation process.........................................................................206
Appendix Figure C2: Distributions of node degree for networks associated with good and bad
model ensemble networks. ..........................................................................................................207
Appendix Figure C3: Distributions of eigen centrality values for good and bad model
ensemble networks. .....................................................................................................................208
Appendix Figure C4: Distributions of betweenness centrality values for good and bad model
ensemble networks. .....................................................................................................................209
Appendix Figure C5: Ratios of export to import for the unique metabolites in each of the three
cellular compartments. The distributions for Extracellular and Cytoplasm metabolites are
statistically significantly different (Welch’s modified t-test p < 0.01). The Periplasm is a
transition space between the other two compartments so it’s unsurprising that virtually all
networks had a precise export/import ratio of 1..........................................................................210
Appendix Figure C6: Distributions of reaction type frequencies for the 263 high quality
Flavobacteria genomes and all high quality genomes (N = 1,591) for the 9 possible reaction
types within a CarveMe model. All distributions are statistically indistinguishable between
the two groups (Welch’s modified t-test, p > 0.05). ..................................................................211
Appendix Figure C7: Mean consensus change for a test genome (C = 0.999) as the number
of reactions that are knocked out is increased. Error bars reflect the standard deviation in
individual ensemble consensus values for each number of reaction knockouts tested. ..............212
Appendix Figure C8: Comparison of the original consensus scores for each of the 263
genomes to the average change in consensus for all 250 replicate knockout ensembles
generated for each genome. Points are colored by the mean change in consensus (y-axis). ......213
Appendix Figure C9: Histograms of the change in consensus score for reaction knockouts
created for the 9 unique types of reactions present in CarveMe. There are no statistical
differences between the changes in consensus score when evaluated by reaction type (Welch’s
modified t-test, � > 0.05) suggesting that any kind of reaction can impact model precision. ...214
Appendix Figure C10: Violin plots of the changes in consensus for knockout ensembles
grouped based on how many of the 5 knocked out reactions were ultimately added back during
xiv
the carving process. The violin at 0 reflects the case where none of the 5 reactions were added
back while 1 reflects the case where all 5 reactions were added back. All pairwise comparisons
of the 6 distributions were found to be significantly different according to a Tukey test on an
ANOVA (p < 0.05). ...................................................................................................................215
Appendix Figure C11: Histograms of the changes in consensus for individual reaction
knockouts depending on whether they were added back (bottom) or not (top). Reactions that
are added back to knockout ensembles have a much higher tendency towards creating
decreases in ensemble consensus compared to reactions that are not added back. .....................216
xv
ABSTRACT
Ocean microbial communities are made up of thousands of diverse taxa whose metabolic
demands set the rates of both biomass production and degradation. Thus, these microscopic
organisms play a critical role in ecosystem dynamics, global carbon cycling, and climate.
Modeling these dynamics requires reducing the complexity of microbial communities and linking
microbial activities directly with biogeochemical rates. I developed a Bayesian statistical method
for defining functional guilds from annotated genomes, derived from both uncultured and cultured
organisms. Expanding on this work, I leveraged global metagenomic datasets, metabolic models,
and unsupervised machine learning techniques to identify key marine heterotrophs metabolic
guilds. I found eight clusters with distinct substrate preferences, growth strategies, taxonomic
profiles, and biogeographic distributions. I demonstrated that the slowest growing groups are
sensitive to the availability of multiple classes of substrates while the intermediate growth groups
are only sensitive to a single class. Moreover, organisms from diverse taxonomic groups can
occupy the same metabolic niche such that metabolic strategy cannot always be inferred from
taxonomy alone. I also show that the automated metabolic model generation software is both a
powerful tool for understanding microbial metabolism and must be used with caution to ensure
robust solutions. Overall, this work provides the building blocks for analyzing and modeling
diverse marine microbial populations.
1
Chapter 1 : Introduction
Marine microbial heterotrophs play an integral role in regulating biogeochemical cycling,
particularly through the process of remineralization (Falkowski et al., 2008; Fuhrman and Azam,
1982, 1980). Photosynthetic organisms in the surface ocean fix carbon dioxide and inorganic
nutrients into oxygen and organic carbon compounds (Ault, 2000; Johnson et al., 2006; Nelson et
al., 1995). This oxygen and organic carbon are respired, or remineralized, throughout the water
column as a result of microbial activity and released back to carbon dioxide and dissolved
inorganic nutrients (Azam, 1998). Remineralization occurs as part of a cyclical process broadly
defined as the microbial loop (Pomeroy et al., 2007; Pomeroy, 1974).
Upper ocean carbon cycling is a tightly constrained process, and in some parts of the ocean
upwards of 90% of the carbon dioxide that is fixed is remineralized on rapid time scales before it
can be exported to the deep ocean (Giering et al., 2014; Henson et al., 2012, 2011; Martin et al.,
1987). The fraction of fixed organic matter (OM) that persists either exits the surface via sinking
in the particulate form (POM) (Martin et al., 1987; Nguyen et al., 2022) or subduction in the
dissolved form (DOM). DOM can also accumulate as a result of low consumption rates driven by
factors such as the complexity of machinery required to degrade it, termed recalcitrance (Bligh et
al., 2022; Zakem et al., 2021). The fixation and remineralization of carbon also impacts the
concentrations of other major macronutrients such as nitrogen, sulfur, and phosphorous that are
available in the form of dissolved inorganic compounds. These fundamental elements are
incorporated into organic matter at specific ratios during carbon fixation and released back to the
dissolved inorganic pool during remineralization (Redfield et al., 1963; Sharoni and Halevy, 2020).
The balance between rates of photosynthetic carbon fixation and remineralization is instrumental
in understanding the fate of carbon dioxide in the ocean.
2
Understanding the drivers of the ocean carbon cycle, specifically the balance between
fixation and remineralization, is particularly crucial as anthropogenic changes alter the global
climate and the atmosphere continues to accumulate greenhouse gases, chiefly carbon dioxide.
The international community has come together to model the impact of climate change on the
Earth System (e.g., the Intergovernmental Panel on Climate Change (IPCC)) (Calvin et al., 2023).
These model simulations are used to inform governmental policy and public sentiment. The global
biogeochemical models used for these simulations capture overall biogeographical patterns such
as low productivity in the oligotrophic gyres, elevated productivity in upwelling regions, and the
high-nutrient-low-chlorophyll regions of the oceans. However, these models still struggle to
capture key dynamics related to primary production such as temperature dependences, initial
nutrient concentrations, grazer dynamics (Laufkötter et al., 2015). There are also key drivers that
remain poorly understood such as the factors that influence phytoplankton bloom dynamics and
robust measures of phytoplankton community diversity. However, the biochemical mechanisms
of carbon fixation by primary producers are well understood (Badger et al., 1998; Brzezinski et al.,
1998) and can be modeled with relatively high fidelity even in these global models with limited
spatial resolution.
Currently, the global biogeochemical models used for climate predictions (with a few
exceptions) do not include explicit micro-heterotroph populations primarily because we lack
fundamental understanding of the biochemical mechanisms of remineralization by heterotrophic
organisms (Segschneider and Bendtsen, 2013). Remineralization is typically represented in global
models as a first order rate constant, sometimes with temperature dependance (Aumont et al.,
2015), which acts to convert different classes of organic matter back to inorganic nutrients. As a
result, we are not modeling the spatial and temporal variability in remineralization rates across
3
oceanographic regions as heterotrophic populations change in response to changing environments.
In effect, this means that our global climate models are missing the real, and essential,
heterotrophic biotic factors that vary remineralization rates spatially and temporally. It is therefore
essential to classify these heterotrophs into populations based on their biochemical function so
they can be dynamically incorporated into global models in order to capture this variability in
remineralization.
Currently, our understanding of the rates of heterotrophic microbial activity in the oceans
comes primarily from bulk measurements such as the rate of uptake of leucine and thymidine
(Chin-Leo and Kirchman, 1988; Fuhrman and Azam, 1982). Bulk measurements are essential for
generating hypotheses on how microbial communities set rates of carbon cycling and are
instrumental for constraining modeling efforts. Specifically, these experimental measurements are
essential to guiding how we parametrize and validate our models as we build them. However, we
do not know the mechanisms that drive variability in these rate measurements. As such, we are not
able to model how these variations in rates and rate kinetics affect key drivers of the global carbon
budget like rates of carbon export, carbon transfer efficiency, and substrate recalcitrance (Henson
et al., 2012, 2011; Nguyen et al., 2022; Zakem et al., 2021). Furthermore, these measurements do
not directly explain the underlying biogeochemical mechanisms that set the values, variability, and
scaling laws of these rates.
To better incorporate the microbial loop into climate modeling we must define a framework
that captures how microbial communities set rates of carbon cycling, particularly remineralization
and allows us to simplify marine heterotrophic diversity to a limited set of representative
individuals. Taxonomy has long served as a fundamental framework for differentiating organisms
based on the contents of their genomes (Caporaso et al., 2012; Sogin et al., 2006), and we use
4
taxonomic surveys as a proxy for community function (Wemheuer et al., 2020). However, there is
increasing evidence that function cannot be directly extrapolated from taxonomy (Louca et al.,
2018, 2017, 2016; Matthews et al., 2021; Tully et al., 2018). With the advent of more advanced
genomics techniques including metagenome assembled genomes (MAGs) (Graham et al., 2017;
MetaHIT Consortium et al., 2014; Parks et al., 2017; Benjamin J. Tully et al., 2018; Venter, 2004)
and single-cell amplified genomes (SAGs) (Martinez-Garcia et al., 2012; Pachiadaki et al., 2019;
Sieracki et al., 2019; Stepanauskas and Sieracki, 2007; Swan et al., 2013, 2011) which allow for
the sequencing of uncultured organisms, we now have a growing database of marine heterotrophic
genomes. This wealth of genomic data serves as the foundation for new analyses and new ways of
describing, quantifying, and categorizing heterotrophic metabolism.
In the following three chapters, I will establish the need to develop new frameworks for
categorizing microbial metabolism, specifically functional or metabolic guilds. While there is a
great diversity of marine microbial heterotrophs, in this dissertation I will primarily focus on
marine heterotrophic bacteria. In Chapter 2, I develop a new Bayesian statistical method for
defining functional guilds and demonstrate that this approach can be used to define guilds from a
large, diverse set of draft marine genomes. Having established the need for, and ability to define,
functional guilds, in Chapter 3 I develop a set of metabolic guilds based on genome-scale metabolic
models (GEMs) and present a simple, but powerful measure of model quality for these GEMs. I
identify 8 metabolic guilds from a large set of high quality GEMs, separated into 4 distinct growth
types, based on the specific preferences and requirements of these GEMs for several classes of
organic compounds. In Chapter 4, I refine my analysis of GEMs to identify specific properties that
can distinguish low and high quality metabolisms, represented as networks, including a significant
increase in metabolite import and export in low quality networks. I also identify key model
5
reactions that may be exceptionally important to annotate consistently and correctly in marine
heterotrophs to develop more high quality GEMs.
6
Chapter 2 : Identification of Microbial Metabolic Functinoal Guilds
from Large Genomic Datasets
1 Introduction
Microbes are the engines that drive many global processes critical for maintaining Earth as
a habitable planet, including the cycling of carbon and nitrogen. In particular, heterotrophic
microbes (bacteria and archaea) control the rate at which organic compounds are cycled
(Falkowski et al., 2008; Fuhrman and Azam, 1982, 1980; Pomeroy, 1974), which has important
implications for atmospheric CO2 concentrations and thus climate. However, we currently have a
limited knowledge of what sets the rate of organic matter cycling (Dittmar et al., 2021; Zakem et
al., 2021) and how these rates vary as a function of microbial community composition.
Global ecological models which are used to study large-scale carbon cycling typically
consider the impact of microbial heterotrophy to be a constant or a bulk approximation acting on
a generic organic carbon pool (Aumont and Bopp, 2006; Séférian et al., 2013). Thus, these models
are unable to capture variations in rates of biogeochemical cycling driven by dynamic and diverse
microbial communities. This is partially due to the lack of a tractable framework for explicitly
modelling complex heterotrophic microbial communities, their biogeochemical function, and how
these functions vary both temporally and spatially. Such a framework requires understanding
organismal-level metabolic potential (i.e., which metabolic pathways co-occur within individual
cells) and how microbes assemble to form communities. While such a framework exists for
phytoplankton (Quere et al., 2005; Raitsos et al., 2008), we lack a similar framework for defining
meaningful heterotrophic functional types or metabolic functional guilds. Metabolic functional
guilds are defined here as groups of organisms that are capable of the same biogeochemical or
ecological function (e.g., nitrogen fixation or chitin degradation) in an ecosystem.
7
Microbial communities have primarily been characterized using the amplification of
marker genes (e.g., 16S small subunit RNA gene). Analysis of functional diversity has either relied
upon ‘omics analyses (Larkin et al., 2021; Ustick et al., 2021; Venter, 2004; Yooseph et al., 2007)
or closest cultured representatives (Hornick and Buschmann, 2018; Roth Rosenberg et al., 2021;
Staley et al., 2014). The former provides an accounting of which genes are present but does not
provide insight into which functions are co-occurring within individual organisms. The latter
extends phylogenetic analyses to gain insight into function by using genomic data from the closest
cultured representative via tools such as PICRUSt or Tax4Fun2 (Langille et al., 2013; Wemheuer
et al., 2020). While this provides insights into the metabolic potential of the community, it relies
on having a cultured representative where the vast majority of organisms in the ocean do not have
such representatives (Parks et al., 2017; Sogin et al., 2006). In addition, the cultured representative
approach relies on the assumption that biogeochemically relevant functions are highly
phylogenetically conserved, which may not always hold due to high rates of horizontal gene
transfer (McDaniel et al., 2010). Several experimental and observational studies have
demonstrated that function and phylogeny are often decoupled in a variety of environments (Louca
et al., 2018, Louca et al., 2017, Louca et al., 2016; Tully et al., 2018). Pangenomics has revealed
microdiversity within individual species that results in genetically distinct species sub-groups or
sub-clades (Delmont and Eren, 2018) further complicating the link between function and
phylogeny.
Recent advances in bioinformatic techniques have allowed for the high throughput
assembly of organismal genomes from metagenomes, termed metagenome assembled genomes
(MAGs) (Graham et al., 2017; Imelfort et al., 2014; Kang et al., 2019, 2015; Lu et al., 2016;
MetaHIT Consortium et al., 2014; Strous et al., 2012; Wu et al., 2016). In addition, microfluidics
8
techniques have enabled the sequencing of single cells (single-cell amplified genomes (SAGs))
(Martinez-Garcia et al., 2012; Pachiadaki et al., 2019; Sieracki et al., 2019; Stepanauskas and
Sieracki, 2007; Swan et al., 2013, Swan et al., 2011). Combined, these innovations have led to
large datasets of publicly available annotated MAGs and SAGs (Klemetsen et al., 2018; Pachiadaki
et al., 2019; Paoli et al., 2021), thus dramatically increasing our knowledge of microbial diversity.
Most notable is the Tara Oceans circumnavigation expedition (Sunagawa et al., 2015), that
collected metagenomes from a global set of sampling stations that have subsequently been
assembled into thousands of MAGs (Baker et al., 2015; Graham et al., 2018; Lombard et al., 2014;
Rawlings et al., 2018; Zhang et al., 2018; Zhou et al., 2019). These large, well-annotated datasets
provide an unprecedented opportunity to assess co-occurring functions within a cell for uncultured
organisms.
Here, we present a new statistical approach for defining microbial metabolic functional
guilds and show that the guilds we identify are specific and ecologically relevant. This approach
also establishes a framework that can be used to generate new hypotheses for co-occurring
functions. As our approach is agnostic to phylogeny with no a priori phylogenetic data provided,
this framework provides an excellent tool for interrogating the metabolic potential of uncultured
organisms. This work lays the foundation for defining microbial communities in terms of
metabolic functional guilds that will allow us to better understand the role that dynamic microbes
play in determining rates of biogeochemical cycles.
2 Materials and Methods
2.1 Dataset
Three different sources of genomes were used for this analysis, metagenome-assembled
genomes (MAGs), isolate genomes (i.e., from cultures), and single-cell amplified genomes (SAGs).
9
Specifically, we used 1,859 MAGs (Tully et al., 2018) assembled from the Tara Oceans
metagenomes (Sunagawa et al., 2015) using the BinSanity v0.2.6.1 technique and assembly
pipeline (Graham et al., 2017). Only bins that met the following minimum requirements were
assigned as draft genomes and included as MAGs: >90% complete and <10% contamination, 80-
90% complete with <5% contamination, or 50-80% complete with <2% contamination. These
genomes can be found at NCBI under BioProject ID PRJNA391943. A total of 6,872 SAG
genomes were obtained from the GORG-Tropics database (Pachiadaki et al., 2019) which can be
found at NCBI under BioProject ID PRJEB33281 and at Open Science Framework under
DOI 10.17605/OSF.IO/PCWJ9. Only SAGs with at least 70% completeness were included in our
analysis (N=1,733). In addition, 967 isolate genomes and 980 genomes with unresolved
provenance (i.e., unclear from the metadata if MAGs or isolates) were obtained from the MarDB
(Klemetsen et al., 2018) (https://mmp.sfb.uit.no/databases/) (accessed 31 May 2018). A composite
genomic dataset was generated using the Tara Oceans MAGs, isolates, and MarDB genomes
(N=3,840). The 1,733 SAGs separately formed a second dataset and a third dataset was generated
of the 1,859 known MAGs from the composite dataset, resulting in two datasets comprised solely
of SAGs or MAGs, respectively. These two datasets were used to compare the resulting guilds
between two distinct methods of determining genome reconstructions.
Genomes from the composite and SAG datasets were classified with the GTDB taxonomy
toolkit (GTDB-Tk) (Chaumeil et al., 2022) using r207 of the Genome Taxonomy Database (Parks
et al., 2018). GTDB-Tk v2.1.0 utilized Prodigal v2.6.3 (Hyatt et al., 2010) to predict genes on the
3,840 input genomes provided as FASTA nucleotide sequence files. The set of 120 bacterial and
53 archaeal target marker genes used in GTDB-Tk were identified with HMMER 3 v3.1b2 (Eddy,
2011). Phylogenetic estimation was performed with FastTree2 v2.1.11 (Price et al., 2010) and
10
then FastANI v1.32 (Jain et al., 2018) and Mash v2.3 (Ondov et al., 2016) were used to confirm
phylogenetic groups with ANI measures. Quality analysis of the genomes in both datasets was
performed with CheckM v1.2.1 (Parks et al., 2015). The average completeness for the composite
dataset was 90.8% with an average contamination of 1.5% and the average completeness for the
SAG dataset was 80.6% with an average contamination of 0.15%. Phylogenomic trees were
constructed for the full set of genomes using GToTree v1.7.05 (Lee, 2019), as well as for the guilds
shown in Table 1 (see Appendix A1) using the taxonomic classifications from GTDB-Tk to
annotate each tree. Much like GTDB-Tk, GToTree utilized Prodigal v.2.6.3 (Hyatt et al., 2010) to
predict functional genes for the 3,840 input genomes provided as FASTA sequence files. Target
genes from the pre-built Archaea_and_Bacteria gene set (25 genes) were identified with HMMER
3 v3.3.2 (Eddy, 2011), aligned with muscle v5.1 (Edgar, 2021), trimmed with TrimAl v1.4
(Capella-Gutierrez et al., 2009), and concatenated before phylogenetic estimation was performed
with FastTree 2 v2.1.11 (Price et al., 2010).
To further assess the phylogenetic diversity of the composite dataset, we also computed
the average nucleotide identity (ANI) and average amino acid identity (AAI). ANI values were
computed on the whole genomes using fastANI v1.33 (Jain et al., 2018) while AAI values were
computed using fastAAI v0.1.20 (https://github.com/cruizperez/FastAAI). fastAAI also used
Pyrodigal (Larralde, 2022), a Python library binding to Prodigal (Hyatt et al., 2010) to predict
genes as well as PyHMMER (Larralde and Sincomb, 2022) to perform the alignments to fastAAI’s
single-copy protein (SCP) datasets. A full breakdown of this pipeline is presented in Appendix A1.
We selected 212 experimentally verified and well-characterized metabolic pathways from
the KEGG database (Ogata et al., 1999) (Appendix Table A1). These functions were chosen due
to their biogeochemical (e.g., nitrogen fixation, methanogenesis) and ecological (e.g., motility,
11
chemotaxis) relevance. All genomes were then analyzed using KEGG-Decoder v0.6sbp and
KEGG-Expander v0.5 (Graham et al., 2018) to assign the presence or absence of the 212 pathways.
KEGG-Decoder is informed by KEGG pathways/modules, however specific steps and key
biogeochemical reactions are broken out to reflect essential steps. Specifically, several different
criteria or thresholds were used in order to determine if pathways were present in a given genome.
Table 2.1: Top 15 functions based on score (see Methods) for two aspects related to DMSP degradation and motility.
Functions that constitute the resulting DMSP and motility guilds are highlighted in bold or bold and italics. SBP is the
substrate-binding protein associated with the respective ABC transporter.
KEGG-Decoder first assumes that core metabolisms must be present for normal cellular function
for most organisms, and thus it is unlikely to find a fragmentary pathway that is non-functional.
Thus for core metabolisms (e.g., glycolysis, gluconeogenesis, ATP synthase, etc.) a low threshold
of 25% total gene presence was used. Conversely, KEGG-Decoder assumes that the same is not
true for complex/geochemically relevant pathways, so a higher threshold is implemented to ensure
that it is tracking actual functionality rather than misannotation. Thus, for pathways that were
either complex (e.g., multiple branching options), geochemically relevant (e.g., thiosulfate
oxidation), or both (e.g., secretion pathways), a total gene presence between 50-75% was
required. An intermediate threshold of 33-40% total gene presence was used for simple pathways
DMSP Aspect Scores Motility Aspect Scores
DMSP demethylation 30.908 Type II Secretion 20.603
DMSP lyase (dddLQPDKW) 29.901 Ubiquinol Cytochrome c reductase 18.733
sulfite dehydrogenase(quinone) 27.231 Cytochrome-c oxidase cbb3-type 17.174
trimethylamine methyltransferase 22.441 Flagellum 12.752
dimethylamine/trimethylamine dehydrogenase 17.902 phospholipid SBP 12.180
putative simple sugar SBP 16.735 Chemotaxis 11.285
microcinc SBP 13.544 Glyoxylate shunt 7.971
Ubiquinol Cytochrome c reductase 13.391 thiamin biosynthesis 7.577
taurine SBP 13.029 phosphate transporter 7.430
glycinebetaine/proline SBP 12.989 Cytochrome bd complex 7.406
general l-amino acid SBP 12.160 Type I Secretion 7.304
spermindine/putrescine SBP 11.625 cationic peptide SBP 7.006
putative spermidine/putrescine SBP 11.493 ammonia transporter 6.610
tungstate SBP 10.723 Sec/SRP 6.484
thiosulfate oxidation 10.663 TCA cycle 6.458
12
constituting 3 to 4 genes. For “pathways” that possess only a single reaction, presence/absence
was directly determined.
This large binary dataset was used as input for metabolic guild identification both using
the classic methods and our new Aspect Bernoulli based method (see below). It is important to
note that the AB method presented here is not restricted to this number of functions and can be
extended to include as many functions or hypothetical proteins as the user desires. Furthermore,
genome annotations can be performed in any manner the user desires so long as the resulting data
matrix is binary. However, we emphasize that the choice of annotations is paramount in determing
the types of metabolic signal the user can get back when running this method. This is a discovery
based dimension reduction method and as such can only directly identify patterns based on the
data presented to it.
2.2 Classic Methods
We tested several clustering and dimensionality reduction methods to try to identify
microbial metabolic guilds including Nonmetric Multidimensional Scaling (NMDS) (Kruskal,
1964) of the functions and complete linkage hierarchical clustering of both the genomes and
functions concurrently. NMDS was performed using the metaMDS function from the vegan
package v2.6.4 (Oksanen et al., 2019) in R v4.2.3 with two dimensions, Bray-Curtis dissimilarity
(Bray and Curtis, 1957), and a maximum of 50 iterations. We also analyzed our composite dataset
using an agglomerative hierarchical clustering method using the clustergram function from the
Statistics and Machine Learning toolbox v12.1 from MATLAB R2021a (The Math Works, Inc.,
2021). We applied these two statistical methods to our composite dataset of 3,840 genomes and
assessed their ability to extract a low-dimensional structure of co-occurring functions in the form
of guilds.
13
Ultimately, we sought a method that could reduce our data to a lower number of dimensions
with defined and clear separation into clusters of functions that represent metabolic guilds.
Therefore, it was essential that our method could identify signals of metabolic guilds driven by
relatively rare functions even in the presence of high abundance functions such as core carbon
metabolism or housekeeping genes. This aspect was important because we expected many of these
core metabolisms to strongly co-occur due to their essential nature and thus could potentially limit
our ability to define more biogeochemically relevant metabolic functional guilds. We found that
an augmented Aspect Bernoulli model was best able to accommodate all of these requirements.
We present this model and the underlying statistical method that defines this approach in the
following section.
2.3 Aspect Bernoulli
We used the Aspect Bernoulli (AB) model (Bingham et al., 2009) to perform a statistical
matrix decomposition of our binary data matrix � ∈ �! × $. The AB model was selected as it is
designed for sparse matrices of binary data. AB is similar to Latent Dirichlet Allocation (LDA)
which has been applied to similar problems (e.g. topic modeling, population structure (Blei, 2003;
Pritchard et al., 2000)), but is not designed to handle binary data. The AB model assumes that each
entry �%,' in the data matrix � is a random Bernoulli realization of an underlying scalar probability
�%,' ∈ [0,1]. Here � denotes genome, and � denotes function. In other words, the AB method
assues that the observed pattern in the data is the result of a Bernoulli coin flip based on the
probability of a specific function occuring in a specific genome. Thus we can define another matrix
{�%'}%(),…,!,'(),…,$ with the same dimensions as the data matrix that represents these underlying
probabilities.
14
We then assume that this matrix of probabilities {�%'}%(),…,!,'(),…,$ can be defined as the
product of two additional matrices � and Γ such that:
�%' = Γ%⋅β⋅' , Eq. 1
for each probability �%' in the matrix. The β and Γ matrices are of size G by � and � by F,
respectively – where G is the total number of genomes in the data set and F is the total number of
functions. These two matrices allow us to identify � groups or aspects in our dataset (see
Terminology box for definition). Here we use the term aspect instead of guild because the aspects
contain all functions in the dataset. As we describe below, we then can define metabolic functional
guilds based on the β matrix which provides the probability that function � is present in a given
genome if that genome is associated with the �,- aspect. Specically, if β.' is close to 1 then
function � is highly associated with aspect �. The Γ matrix quantifies how strong the �,- aspect
is, within each genome g. Specifically, if Γ%. is close to 1 then genome g is strongly associated
with aspect �. Β and Γ are then optimized using an iterative Expectation Maximization (EM)
algorithm as described in (Bingham et al., 2009). For a full, rigorus description of the methods
please see Appendix A1.2.
One key advantage of the AB method is that the use of the matrix of probabilities
{�%'}%(),…,!,'(),…,$ allows the method to deal with inaccuracies in the data (e.g. false absences or
presences) as detailed in (Bingham et al., 2009). Specifically, the AB method can accommodate
instances where the presence (absence) of a function in the genome is otherwise inconsistent with
the main aspects associated with it.
2.4 Scoring
In order to define metabolic functional guilds (see Terminology box for definition) from
the AB model output, we needed a way to quantify the relative importance of functions within an
15
aspect. To this end, we introduced a post-processing score to order the functions within each aspect
such that two conditions were met: 1) functions that were strong indicators of membership in that
aspect were scored highly (i.e., if that function was present in a genome then it was likely that
aspect � was present); 2) genomes that were identified as being associated with aspect � were
likely to contain functions at the top of aspect �’s list (i.e., if genome � was associated with aspect
� it was likely to have function A which was at the top of aspect �’s list). The functions that
combined to define a metabolic functional guild could then be identified based on high-ranking
functions in the aspect lists.
To meet the first condition, we posed the following question – having observed a function
� to be present in a randomly chosen genome g, how likely was it that the function was present
due to aspect �? We could quantify this likelihood by calculating
�'. = 1
� P�R�%'. = 1 T �%' = 1)
!
%()
.
Eq. 2
Using Bayes rule, we computed the above conditional probability in terms of the AB parameters:
�R�%'. = 1T�%' = 1V = Γ%. �.'
�%'
. Eq. 3
Next, we identified the genomes that were most strongly associated with each aspect (i.e., had
large Γ values). We will hereafter refer to this set of genomes �. ⊆ {1, … , �} as aspect �’s
“probabilistic representatives”. We filtered {1, ⋯ , �} into � non-overlapping sets �), ⋯ , �/; each
set �. was defined as the genomes � that placed the highest value of Γ%,⋅ on � and also had a large
enough Γ%,. = �R�%'. = 1V (specifically, Γ%,. > 2/�). This 2/� threshold ensured that we were
excluding genomes that had nearly uniform Γ vectors. For our composite dataset, this threshold
did not exclude any genomes.
16
From �., we calculated �'.:
�'. = ∑%∈1! �%'
1
� ∑ ∑%∈1! �%'
$
'()
, Eq. 4
which is the ratio of the abundance of each function within �. and the mean abundance within �..
Lastly, we multiplied the marginal probability �'. (Eq. 1) by the adjustment factor �'. (Eq. 4).
This gave us the score metric �'. that we used to identify our guilds:
�'. = �'. ⋅ �'. Eq. 5
In this score, �'. upweights functions � that are more abundant among probabilistic
representatives of aspect � than average (Figure 2.1) and makes the score �'. = �'. ⋅ �'. more
comparable across aspects. Since a function that is highly specific to aspect � is scored highly,
top-scoring functions are attractive candidates for forming metabolic function guilds from aspects.
Next, we describe how to choose a small set of functions to form such guilds. The full algorithm
for the AB procedure can be found in the extended methods (Appendix A1).
2.5 Guild identification and mapback genomes
After identifying the probabilistic representatives �. based on our pipeline, we further
narrowed each aspect down to metabolic functional guilds �. according to the scores �'.. Then,
we obtained the mapback genomes �. (see Terminology Box) for guild �. as the set of genomes
possessing all of the functions in �.. We used two alternative approaches to identify the set of
functions that comprise metabolic functional guilds; 1) using a fixed number of functions, 5
functions in this case (Option 1 in Appendix A1.2) or 2) requiring a minimum number of genomes
in the dataset to be associated with a given guild (Option 2 in Appendix A1.2). The number of
mapback genomes is an important quantity in our pipeline, as it quantifies how strongly the original
17
data supports the proposed metabolic functional guilds. For instance, if we found many mapback
genomes for a fixed-size functional guild, we would be more confident in the validity of that guild.
2.6 Guild specificity
A key objective of the pipeline was to identify functions co-occurring within individual
genomes that were meaningfully associated. Specifically, ideally for a guild � containing functions
Figure 2.1: Abundances of functions within an example aspect’s probabilistic representatives, Ak, compared to their
score rank before (�"#, cyan) and after (�"#, orange) applying the score adjustment �"# (step 2). After the adjustment,
a large density of points in the upper left quadrant is observed indicating that the highest rank functions using �"# are
also found within a large number of probabilistic representative genomes.
18
A and B, the presence of function A in a genome would indicate both that that genome was a
member of guild �, and that the genome would also contain function B. To test the association
between pairs of functions within our guilds, we calculated the confidence (Agrawal et al., 1993)
of seeing B given A (� →�) as
����(�, �) = ∑ 3$%
&
$'( 3$)
∑ 3$% &
$'(
Eq. 6
where A and B are functions from our dataset and �%1 and �%4 are the presence or absence of A
and B in genome g. High confidence values suggested that the presence of function B was highly
conserved with that of function A. We computed the forward and reverse confidence values for
every pair of functions in the guilds identified from our data. Because of the way we defined
mapback genomes, these confidence values were all 1 within our mapback genomes and between
0 and 1 for our ‘outgroup’ genomes (i.e., the rest of the dataset).
2.7 Artificial datasets
The number of aspects, �, is a free parameter in the AB model that determines the
maximum number of guilds that can be identified. The ideal choice of � is dataset specific and is
a function of the underlying structure of the data matrix. To test the impact of this choice on the
resulting guilds identified by our method, we constructed a large collection of synthetic datasets
comprised of either one or three artificial guilds appended to our original composite dataset of
3,840 genomes and 212 functions. These guilds were defined to be “perfect” guilds where genomes
either had all the artificial guild functions or none of them. For example, an artificial guild with 5
functions and 2% total abundance in the dataset would have all 5 functions perfectly co-occurring
in 77 genomes, while the remaining 3,763 genomes would not possess any of these artificial
functions (all zeros). Guild parameters were drawn from three possible abundances (2%, 5%, or
10% of the genomes containing the artificial guild) and three possible sizes (guilds consisting of
19
5, 7, or 9 functions) with all unique combinations tested (Appendix Table A2). Each artificial guild
was inserted in a non-overlapping manner such that each genome could only belong to a maximum
of one artificial guild. For each combination, we created 100 replicates of our synthetic data.
Additional sensitivity analyses were conducted where we assigned guilds randomly, allowing
some genomes to belong to multiple artificial guilds (Appendix A2).
2.8 Data Visualization
All data visualizations in MATLAB were performed using the Statistics and Machine
Learning Toolbox v12.1 from MATLAB R2021a (The Math Works, Inc., 2021). Data
visualizations in R v4.2.3 were performed using the ggplot2 v3.4.2 and ggbreak v0.1.1 packages
(Wickham, 2009; Xu et al., 2021) as well as the lattice v0.21.8 package (Sarkar, 2008).
3 Results
3.1 Phylogeny of Datasets
The phylogeny of our composite dataset of 3,840 genomes was assessed using GtoTree
and GTDB-Tk. From this large dataset, 65 genomes (60 archaeal, 5 bacterial) were excluded due
to insufficient marker gene coverage. Another 39 genomes that were included in the tree were
flagged during the quality assessment step for high redundancy estimates (average 16.7%
redundancy) but were still highly complete (average 95.7% completeness). Of the 3,775 high
quality genomes, there were 3,529 bacterial genomes representing 51 unique bacterial phyla.
Among these phyla were the key marine superphylum Proteobacteria (Yarza et al., 2014) with
1,774 genomic representatives as well as other notable phyla such as the Cyanobacteria (108
genomes), Bacteroidota (545 genomes), Firmicutes (111 genomes), Desulfobacterota (55
genomes), and the Verrucomicrobiota (91 genomes). In addition, there were 246 archaeal genomes
representing 2 unique archaeal phyla, Thermoplasmatota and Thermoproteota. Appendix Figure
20
A1 shows the full phylogenomic tree visualized in the iTOL web application (Letunic and Bork,
2021), which is colored by individual bacterial phylum identity.
We passed our high-quality SAG dataset of 1,733 genomes through GtoTree and GTDBTk as well and determined phylogeny for 1,415 genomes (Appendix Figure A2). 318 genomes
(301 bacterial, 17 archaeal) were excluded for insufficient marker gene coverage while three of
the included genomes were flagged during the quality assessment step for high redundancy
estimates (average 14% redundancy). Of the 1,415 high quality genomes, there were 1,409
bacterial genomes representing 9 unique bacterial phyla and 6 archaeal genomes representing 2
unique archaeal phyla. Like the composite dataset, many of the bacterial genomes were classified
in the phylum Proteobacteria (1,158 genomes). The next two largest phyla were Bacteroidota (103)
and Cyanobacteria (83). Collectively, these three phyla accounted for 95.4% of all SAGs with an
ascribed bacterial phylogeny.
3.2 Classic Methods
We applied two classic statistical methods (NMDS and clustergram) to our dataset and
assessed their ability to extract low-dimensional structure of co-occurring functions in the form of
guilds. The results of the NMDS are shown in Figure 2.2 where each point in the NMDS represents
a function such that clusters of points could, potentially, indicate guilds. No distinct features
emerge along either axis. The majority of data points group into a dense cloud of points with no
clear separation along an axis of variance. While approaches for analyzing variance in reduced
dimensions such as NMDS can be powerful for identifying clusters of similarly acting samples,
NMDS was unable to identify clusters that could be interpreted as metabolic guilds when applied
to our dataset.
21
Next, we present results using a
standard clustering approach, namely
hierarchical clustering, as implemented by
clustergram. Here, we clustered both the
genomes and functions (rows and columns)
using the Jaccard distance metric with
complete linkage and two different cut
heights, 0.9 and 1. (Figure 2.3). We selected
the Jaccard distance for clustergram
because of the binary format of our data.
However, unlike the AB method, Jaccard
treats all presences/absences equally and
thus does not provide differential weights for rare versus highly abundant functions. We chose to
use cut heights of 0.9 and 1 based on the resulting dendrograms as they produced clusters among
both rare and high abundance functions. At lower cut heights, we found that a large bulk of the
functions clustered out as singletons and the clusters that did form were primarily the core, high
abundance functions. Thus, we felt that 0.9 and 1 were good values for comparing the microbial
metabolic functional guilds identified by clustergram and AB.
Applying clustergram to our data with a cut height of 0.9 yielded 30 distinct clusters of
functions that we interpreted as potential metabolic guilds (Figure 2.3). These clusters averaged
5.8 functions (range 2-42 functions) and 38.8 mapback genomes (range 3-354 genomes). Nearly
20% of the total functions (N=42) were in a single guild of highly abundant core functions that
drove the clustering of the remainder of the dataset. We also tested clustergram with a cut height
Figure 2.2: Results of the NMDS run on the composite dataset.
Points plotted are the loadings of the functions in the dataset on
MDS axes 1 and 2. Points are semi-transparent to emphasize
points that overlap one another. Note: the NMDS algorithm did
not reach convergence with a minimum stress value of 0.211.
22
of 1 that produced 17 distinct clusters of functions. The average number of functions in a cluster
increased to an average value of 11.1 (range 2-66 functions) but the number of mapback genomes
dropped sharply to an average of just 3.2 mapback genomes (range 0-17 genomes) per guild. Seven
of these guilds had no mapback genomes and the two largest guilds alone accounted for 46.7% of
the total data used for this clustering procedure.
We identified several disadvantages of the classic statistical methods. Firstly, large
numbers of core metabolisms found in many genomes (such as housekeeping genes, core carbon
metabolism, etc.) formed huge guilds with few mapback genomes, which were therefore not
informative as metabolic guilds (see Figure 2.3). Secondly, these methods do not permit for
functions to be part of more than one guild, which is inconsistent with the high functional
Figure 2.3: Resulting clustergram plot on the presence/absence pathway data for our composite dataset (red = present,
black = absent) using a cut height of 0.9 with rows (genomes) and columns (functions) clustered based on Jaccard
distance.
23
redundancy that has been demonstrated in microbial communities (Louca et al., 2018, Louca et
al., 2017, Louca et al., 2016; Tully et al., 2018). Finally, these methods do not provide an intrinsic
ranking of the importance of each function for defining a guild – e.g., which functions are strong
indicators of membership in the guild. Below we will compare the guilds from clustergram to that
of the AB model and demonstrate that both methods identify similar guilds but that clustergram
both breaks the AB guilds up into smaller groups (fewer functions) and results in guilds with fewer
mapback genomes. Thus, the AB method is better able to capture metabolic functional guilds that
contain a meaningful number of functions (>3) with substantial numbers of mapback genomes.
3.3 AB model
Below we present an assessment of the robustness of the AB model for detecting guilds, a
summary of the AB model guilds from the composite dataset, and then a comparison between the
AB model and the classic methods.
3.3.1 Choosing a value for K
The AB model requires the user to define � prior to running the algorithm. To test the
impact of the choice of � on the ability to detect different sized guilds (i.e., numbers of functions)
and guilds with different abundances in the dataset (i.e., frequency), we ran the artificial datasets
through the method with a wide range of � values (� = 5, ⋯ ,20). This analysis (described in
Appendix A2 and summarized below) identified a clear trade-off between using low � values,
which inhibited the detection of low abundance guilds, and using high � values, which overfitted
the dataset. What qualifies as a ‘low’ versus ‘high’ � value will be dataset specific. The analysis
described below allows the user to identify a range of reasonable � values for a given dataset and
the type of guilds (e.g., abundance and size) that are being targeted in the analysis. For this study,
we manually assessed guilds derived from � values within the identified range in order to select
24
our final value of � (� = 10). We recommend that a similar analysis be performed prior to
applying this method to a new dataset.
We quantified the ability of our method to identify artificial guilds in our artificial datasets
(see Methods) over a range of � values using two metrics: Hit Rate and Extra Hits. Hit Rate
describes the overall frequency with which we identified our artificial guilds. In the ideal case, we
would observe all of an artificial guild’s functions present at the top of the score-ordered function
list (top 15) in exactly one aspect. So, for a simulation using three distinct artificial guilds, we
would expect to see three hits per simulated dataset (i.e., each guild showing up at the top of only
one aspect list) which would give us a 100% hit rate, or a hit rate frequency of 1. Extra Hits
catalogues instances where we observed an artificial guild occurring at the top of more than one
aspect list, i.e., an artificial guild being divided across two aspects.
The size of the guild and abundance of the guild in the dataset impacted the ability of the
method to identify artificial guilds at different � values (Figure 2.4). As guild size and abundance
in the dataset increased, the hit rate at low � values increased to 1. In other words, it was easier to
identify larger and more abundant guilds, as one might expect. When � was low, extra hits were
zero. As we increased the value of �, the hit rate remained high, but we started to see extra hits.
When guilds were large and/or abundant, extra hits increased more quickly and at lower values of
� than for smaller and less abundant guilds. This analysis demonstrated that if the choice of � was
too small only the largest and most abundant guilds were identified (under-fitting system). On the
other hand, if � was too large, guilds showed up in multiple aspects (over-fitting system). We
concluded that a good range for � was around the point where hit rate was maximized while extra
hits remained zero. A full analysis of the impact of guild size, guild abundance, and � value on
25
guild identification, as well as the impact of randomly inserting guilds and number of artificial
guilds inserted, is presented in Appendix A2.
We also tested various numbers of iterations for the expectation-maximization (EM)
algorithm implemented as detailed by (Bingham et al., 2009) to determine how quickly the model
converged to a local maximum. For each iteration value (ranging from 10 to 1,500 steps) we
initialized and ran 10 random restarts. For our chosen value of � = 10, the likelihood appeared to
Figure 2.4: Hit Rate and number of Extra Hits for 100 simulated datasets with three artificial guilds inserted in a nonoverlapping manner across a range of K values. Results are colored by the guild parameters where #fn denotes the
number of functions in each artificial guild. The red (#fn = 5/Abundance = 0.02) versus the green (#fn = 5/Abundance
= 0.1) lines illustrates the impact of a change in guild abundance. The impact of guild size on Hit Rate and Extra Hits
can be seen in Supplemental Table 2.
26
plateau at its maximum value after approximately 500 iterations (Appendix Figure A12). We also
assessed the stability of the AB results and showed that the identification of guilds was consistent
across runs initialized with different random seeds (Appendix Figure A13).
3.3.2 Guild identification in composite dataset
The method successfully
identified guilds within the
composite dataset that were found
in a substantial number of genomes
in the dataset and contained
functions that were specific to that
guild (see methods). When defined
using the top 5 scoring functions
(approach 1), the resulting guilds
averaged 116.2 mapback genomes
(range 11 to 468 genomes). When
guilds were defined to include functions co-occurring within at least 100 genomes (approach 2),
the average guild size was 5.7 functions per guild (range 2 to 20 functions). Figure 2.5 shows the
number of mapback genomes present in the dataset as the number of functions defining each guild
is increased from 2 to 20.
Both approaches for defining guilds resulted in guilds comprised of functions that were
specific to that guild. When looking at the co-occurrence of each pair of functions from the guild
set of functions (guild function pairs), low confidence values were observed in the outgroup
genomes for each guild function pair as compared to the value of 1 for the guild function pairs in
Figure 2.5: Number of genomes that possess all of the functions in a guild
(mapback genomes) as guild size is expanded to include more functions in
decreasing score order (starting at size 2).
27
the mapback genomes (by
definition). Guilds identified using
approach 1 (top 5 scoring functions)
had a 0.455 average confidence
value in the outgroup genomes.
However, many pairs of functions
were substantially less conserved in
the outgroup genomes (i.e., these
pairs were strongly indicative of
membership in the guild). For this
we looked at the minimum
outgroup confidence value across
all pairs of functions in each guild (i.e., the two functions that most strongly indicated membership
in the guild). For approach 1, the average across all 10 guilds (� = 10) of the minimum confidence
values was 0.09 (range 0.029-0.132). In other words, functions A and B in guild � were found
together only ~10% of the time in the non-mapback genomes and 100% of the time in the mapback
genomes. Guilds defined using approach 2 (~100 mapback genomes) had a 0.338 average
confidence value in the outgroup genomes and a 0.029 (range 0-0.105) average minimum
confidence value. Figure 2.6 shows an example heatmap of both the forward and reverse
confidence values for a putative DMSP guild. Low confidence values for the outgroup genomes
confirm that this method identified functional co-occurrences that are specific only to a subset of
genomes.
Figure 2.6: Specificity of guild function pairs for a guild related to the
degradation of DMSP. Values are shown for the confidence of the guild
function pairs in the outgroup genomes such that low values indicate high
specificity of the guild function pairs for the DMSP guild. Note that the
colorbar is scaled from 0 to 0.8. The diagonal is omitted since it is 1 by
definition. The axes are non-symmetric because DmdA → ddd* is
fundamentally different from ddd* → DmdA (see Eq. 6).
28
3.4 Comparison between AB model and clustergram guilds
We compared the guild sizes and mapback genome numbers of the clustergram guilds to
guilds generated using the AB method approaches 1 & 2. Figure 2.7 shows the distribution of guild
sizes versus number of mapback genomes for each of these three methods. Based on our simulated
data analysis described in Section 3.3, we determined that � = 10 was an appropriate number of
guilds for the AB method.
Overall, we found that the
clustergram method
identified more guilds with
fewer functions and fewer
mapback genomes than the
AB method. Specifically,
with a cut height of 0.9,
clustergram identified
three times as many guilds
(N=30) as the AB method
(N=10). Of these 30
clustergram guilds, the
majority (60% of the guilds)
possessed 3 or fewer
functions with 33.3% of
the guilds constituting just
a pair of functions. When
Figure 2.7: Distribution of guild sizes (number of functions) and number of
mapback genomes for guilds generated with clustergram at cut heights of 0.9 (blue
square) and 1 (purple plus sign) as well as AB. AB approach 1 (red circle) defined
guilds using a fixed size of 5 functions while AB approach 2 (green triangle)
defined guilds using a minimum mapback genome cutoff of 100 genomes. Points
were jittered using the built-in position_jitter function in the ggplot2 package v3.4.2
with h=0.1, w=0.35 using the random seed 123.
29
we used the conservative criteria of at least 100 mapback genomes per guild (approach 2), the AB
method generated a comparable number of guilds with 3 or fewer functions (50% of the total
guilds). However, the two methods differ substantially in terms of number of mapback genomes
identified for each guild. Clustergram yielded guilds with an average of 38.8 mapback genomes
per guild, substantially less than the two AB methods which averaged 116.2 and 142.9 mapback
genomes for approaches 1 and 2, respectively. When we reduced the threshold for AB approach 2
to the clustergram average of 39 mapback genomes per guild, we found just one guild with 3 or
fewer functions (10% of the total guilds). To make a more direct comparison to the clustergram
guilds, we re-ran the AB pipeline with � = 30. Allowing for a higher number of guilds in the AB
method resulted in a similar number of mapback genomes per guild as the runs with K=10 with an
average of 113 mapback genomes (range 0-1436) for approach 1 and with only one guild having
no mapbacks. Allowing for a larger number of guilds in the AB method results in a high frequency
of full or partial guild duplication (see Figure 2.4 and Section 3.3.1).
To test the impact of the cut height on guild size, we increased the clustergram cut height
to 1 (Appendix Figure A8). This results in a more similar number of total guilds (17 for
clustergram compared to 10 for AB) between the different methods. A cut height of 1 reduced the
number of small clustergram guilds (3 or fewer functions) to 41.2%. However, this even further
decreased the number of mapback genomes for each guild (average of 3.2 genomes per guild with
some guilds having no mapbacks). For both cut heights, clustergram identified one guild with 42
functions (cut height=0.9) and 66 functions (cut height=1) which corresponds to 19.8% and 31.1%
of all functions in the dataset, respectively. This large guild was comprised entirely of highly
abundant functions and was substantially larger than the largest guild produced by AB approach 2
(28 functions using the lower threshold of 39 or more mapback genomes). Furthermore, the large
30
clustergram guild had just 4 and 0 mapback genomes for cut heights 0.9 and 1, respectively, while
the 28 function AB guild had 61 mapback genomes. Finally, we tried using a dynamic cut height
method for clustering functions which improved the guild sizes and number of mapback genomes
over the static height, but still resulted in guilds with fewer mapback genomes than the AB guilds
(see Appendix A1.3).
We next assessed the differences in guilds functions identified by the two methods using
AB approach 1 where guilds were defined with a static number of functions. We observed several
reoccurring patterns. When using a cut height of 0.9 for clustergram, the 5 AB guild functions
were typically split between 2 distinct clustergram guilds (range split between 1-3 guilds) with
only two of the ten AB guilds being contained within a single clustergram cluster. When we look
at the clustergram guilds that contain the AB guild functions, we find that they average 52.8
mapback genomes compared to 116.2 for the corresponding AB guilds. This suggests that the AB
method is finding groups of functions that are more commonly found together in the dataset.
Increasing the cut height to 1 resulted in fewer clustergram clusters and marginally reduced the
fragmentation of AB guilds between clustergram guilds with AB guilds now being split across 1.7
clustergram guilds on average (range 1-3 guilds). At this linkage, the clustergram guilds which
contained the AB guild functions had on average 30 additional functions (range 5.5-61) and only
0.33 mapback genomes (range 0-1) compared to the corresponding AB guilds which had 116.2
mapback genomes (range 11-468). There were several instances (4 of 10), where the AB guild
functions clustered fully or partially into the large clustergram guild with 66 functions containing
the highly abundant functions in the dataset with no mapback genomes.
This analysis demonstrated that both the AB and clustering methods are able to identify
functional guilds from our dataset and that there was overlap in the functions that were grouped
31
together into guilds using the two methods. We show that the AB guilds both contained more
functions and were more highly represented in the dataset (have substantially more mapback
genomes) than the guilds defined using the clustering method. As with any method, there are both
advantages and disadvantages to the AB method. One disadvantage of the AB method is the need
to choose a value of the free parameter � which determines the number of guilds identified (see
discussion above in Section 3.3.1). However, we demonstrate how a user can use our pipeline to
make an informed decision as to the best value for �. Another key distinction between the two
methods is that clustering methods precisely define the functions belonging to each guild. The AB
method provides information both about which functions are strong indicators of the guild and
which genomes have a high probability of membership in the guild. The user must then decide
which set of functions to define as a guild. We provide two approaches for making this distinction
and highlight how this additional information generated by the AB method can be used to generate
hypotheses (see discussion below in Section 4.1). Additional advantages to the AB method are that
the AB method does not require that all functions be members of a guild nor that a function be a
member of just one guild and that the AB method can distinguish between false and true
absences/presences in the dataset. Lastly, it is important to note for the AB method that if there are
mapback genomes for a guild then the guild is by definition meaningful (i.e., found in the dataset).
However, the absence of a guild does not necessitate that that guild does not exist. The AB method
might not have identified a guild for several other reasons, including other structures in the data
matrix which can make rare guilds difficult to find or the absence of a key annotation that is crucial
for distinguishing it from the rest of the dataset.
32
4 Discussion
4.1 Emergent microbial metabolic guilds
Our approach identified several biogeochemically relevant metabolic functional guilds
with numerous genome representatives in the composite dataset. It is important to note that these
guilds emerged from this analysis without any curation or a priori knowledge applied. As such,
the identification of known guilds (e.g., photosynthesis) is a strong indication that the method is
able to detect biologically meaningful phenomena even when these associations are in low
abundance in the dataset. Here we highlight three emergent guilds and draw connections to
previously identified co-occurring biochemical processes. The other 7 guilds identified by the
method are also of significance (11-235 mapback genomes) and are listed in Appendix Table A4.
For example, we identify a guild associated with phosphorus acquisition (C-P lyase genes, see
Section 4.2) and several associated with different types of carbon metabolisms (see Guilds 8 and
9 in Appendix Table A4). However, for sucinctness, we describe in detail just three guilds which
illustrate the power of the AB method.
The photosynthetic functions served as a nice test case of our method. Our composite
dataset was curated in such a way that photosystems I and II were only present in 2.5% (N=95)
and 2.7% (N=105) of the genomes, respectively. However, our method was able to identify a
photosynthesis guild with 10 total functions including photosystems I and II, NAD(P)H quinone
oxidoreductase, cytochrome b6f complex, and RuBisCO (Appendix Table A4). This 10 function
guild had 12 mapback genomes in the composite dataset. We were also able to identify this
photosynthetic guild in the SAG dataset where Photosystems I and II have abundances of 6.3%
and 5.8% respectively. The identification of this well characterized system provided an excellent
‘ground truth’ validation of our method.
33
The approach identified a guild related to the consumption of the organic sulfur compound
dimethylsulfoniopropionate (DMSP). This guild consisted of DMSP demethylation, DMSP lyase,
and sulfite dehydrogenase (quinone) and had 139 mapback genomes. These three functions were
the highest ranked functions within a single aspect (Table 1). For this analysis, we assessed the
presence of at least one of 7 different DMSP lyases (DddL, DddQ, DddP, DddD, DddK, DddY
and DddW). DMSP lyase has been shown experimentally to co-occur with the enzyme DMSP
demethylase (DmdA), which performs the demethylation reaction for DMSP (Reisch et al., 2011,
2008) – though this association is not obligatory. These pathways have been characterized in
abundant marine clades, such as Roseobacters (Moran et al., 2007) and SAR11 (Tripp et al., 2008).
Sulfite dehydrogenase has also been implicated as a potential pathway through which DMSPderived sulfur is reduced from sulfite to sulfate (Reisch et al., 2011).
The AB method suggests that there are several additional functions that might commonly
co-occur with these 3 DMSP related functions (Table 1). For example, taurine and glycine betaine
transport, either into the cell to meet metabolic demands or out of the cell to excrete waste products,
could be features of this guild. In fact, previous work suggests that many Roseobacters utilize a
diverse suite of labile dissolved organic sulfur (DOS) metabolites to meet their sulfur requirements
(Landa et al., 2019). In a co-culture experiment with R. pomeroyi strain DSS-3 and two
phytoplankton species, (Landa et al., 2019) demonstrated enriched expression patterns of transport
and catabolism genes for seven sulfur-rich phytoplankton exometabolites, including DMSP and
taurine. These findings are consistent with the fact that both DMSP and taurine are produced in
high concentrations by certain phytoplankton groups (Jackson et al., 1992; Saltzman and Cooper,
1989). The nitrogen-rich compatible solute glycine betaine is also produced by certain
phytoplankton groups (Keller et al., 1999) and has been implicated as a nitrogen source for
34
Roseobacters (Moran et al., 2007). Therefore, the capacity to use these substrates co-occurring
within a single organism is consistent with known ecological interactions – and might indicate that
organisms in the DMSP guild could be associated with the phycosphere. Including taurine as a 4th
function in the guild resulted in 100 mapback genomes, including glycine betaine as a 4th function
resulted in 134 mapback genomes, and including both (5 function guild) resulted in 98 mapback
genomes.
Thiosulfate oxidation also occurs in the top 15 ranked score list (rank 15). Previous
experimental work has shown that this pathway is involved in DMSP degradation (Reisch et al.,
2011). In fact, if we included thiosulfate oxidation within the DMSP guild, we obtained a guild of
4 DMSP functions with 89 mapback genomes in the composite dataset all co-occurring with a high
degree of specificity (Figure 2.6).
The last example guild was a large guild related to motile microbial lifestyles. The key
functions in the motility guild were type II secretion, cbb3-type cytochrome c oxidase, flagellum,
chemotaxis, ubiquinol cytochrome c reductase, a phospholipid SBP, and the glyoxylate shunt,
totaling seven guild functions with 385 mapback genomes (Table 1). These functions are all
consistent with copiotrophic lifestyles where organisms are motile and capable of responding to
signals in the environment through chemotaxis. Similar to the DMSP guild, a key advantage to
our approach is it provides a list of functions that co-occur with classic ‘copiotrophic’ functions
(e.g., chemotaxis and flagellum) with high specificity to the guild mapback genomes. This can
allow us to develop hypotheses related to the ecological and biogeochemical roles played by this
group. For this motility guild, type II secretion and the Glyoxylate shunt co-occur with both
chemotaxis and flagellum with a high degree of specificity (average outgroup confidence of 0.35).
35
4.2 MAG vs SAG guild comparison
We ran both our MAG and SAG datasets through our method to investigate the differences
in guilds generated by these two different datasets. These datasets not only used different
methodologies but also sampled different oceanographic regions. The MAG dataset was
comprised of globally distributed samples, most notably 68 sampling sties from Tara Oceans
(Sunagawa et al., 2015) spanning all major oceanographic regions (except the Arctic Ocean) and
3 depths from the surface (5 m) to the mesopelagic zone (600 m). The SAG dataset on the other
hand was obtained from samples primarily located in the North Atlantic and Pacific Oceans at a
mean depth of 70.7 meters and was prefiltered (Pachiadaki et al., 2019). Thus, the expectation is
that these different datasets will yield different guilds because they sampled fundamentally
different communities. Indeed, while guilds related to DMSP, the C-P lyase pathway, motility, and
rhodopsins (Appendix Table A5) were identified in the MAG dataset, the SAG dataset generated
guilds primarily related to the uptake of substrates (Appendix Table A6).
A guild associated with the acquisition of phosphorus was identified in both datasets. In
the SAG dataset, this guild comprised of four functions and 163 mapback genomes, which
consisted of the C-P lyase complex (PhnGHIJ), CP-lyase operon (PhnFKLMNOP), CP-lyase
cleavage (PhnJ), and a phosphonate transporter (PhnCED). The C-P lyase pathway has been shown
to break down a variety of phosphonate bonds, including phosphonates associated with semi-labile
high molecular weight dissolved organic matter (Metcalf and Wanner, 1993; Sosa et al., 2017;
White and Metcalf, 2004). It is unsurprising to see the CP-lyases grouped together since they are
co-located in a single operon. However, this guild served as another example that our method can
extract well-known functional co-occurrences (our method does not take into account co-location
36
of genes within the genome). These four functions associated with the SAG phosphorus guild were
also found together in one of the MAG guilds with 62 mapback genomes.
The guilds identified by our method were an emergent property of the dataset itself. This
means that the absence of a known or potential guild in the model output does not necessarily
mean that that guild was not present in the dataset. Using a different collection of annotated
genomes could potentially change the abundances of the functions within the dataset, which could
greatly impact whether the method identified a specific group of functions as a guild or not. For
example, we demonstrated that guilds with abundances 2% or lower were difficult to consistently
observe. Furthermore, as discussed above, � is a crucial free parameter that needs to be selected
for each novel dataset to which this method is applied. We recommend constraining � using a
similar heuristic approach to the one we describe above or using other previously suggested
methods like the Akaike Information Criterion (Bingham et al., 2009; deLeeuw, 1992).
5 Conclusions
Co-occurrence of metabolic functions has long been studied in the field of biochemistry
where metabolic pathways are elucidated. However, these studies are typically very labor intensive
and require cultured representatives. This can present an issue since only a small fraction of marine
microbes have been cultured (Rappé and Giovannoni, 2003; Steen et al., 2019). Our method
presents a way to generate hypotheses about co-occurring functions across large collections of
genomes without relying on cultured representatives. These hypotheses might aid in future
biochemical studies by providing targeted functions to test.
In addition to generating testable hypotheses, this method presents several potential future
applications. One possibility is in assisting with genome annotation through the incorporation of
hypothetical gene products that have not yet been functionally characterized. One recent study
37
(Faure et al., 2021) developed a large-scale sequence similarity network to identify protein
functional clusters (PFCs) and demonstrated the potential for characterizing PFCs of previously
unannotated proteins and correlating them with multiple environmental variables. Rather than
focusing on whole community functional composition, our method identifies collections of
ecologically relevant functions that are found to co-occur within assembled and isolate genomes.
Using our method, one could construct a dataset composed of a mix of annotated and unannotated
genes/proteins. Any mapback genomes identified for those hypothetical functions would be
excellent culture candidates for characterizing that hypothetical gene. This method offers the
potential to significantly refine the targeting of these culturing efforts to make them nimbler and
more cost effective.
Understanding microbial metabolic functional guilds is an essential step in describing
microbial communities based on their metabolic activity, particularly for key heterotrophic
communities. Rather than focusing on the functional composition of the entire community, our
method identifies collections of co-occurring functions that form the building blocks of a
community’s functional structure. Defining the community as such will allow us to develop
improved numerical ecosystem models that capture these metabolic capabilities. In addition, it will
help us to better build and validate models such as the trait-based ecosystem model GENOME
described in Coles et al. (Coles et al., 2017) that directly simulated the metagenomes and
metatranscriptomes of communities. Furthermore, because our approach is phylogenetically
independent, it also provides the ability to disentangle analyses of function and phylogeny when
assessing the structure of a given community. This provides a window into the level of functional
redundancy present both within a single guild and across the community as a whole. Additionally,
our approach generates hypotheses about potential co-occurring metabolic functions that can be
38
tested experimentally. Furthermore, since we demonstrate that this approach works for both MAG
and SAG genomes, this method offers the ability to characterize the genomic potential of
uncultured organisms from a wide range of studies.
39
Chapter 3 : Emergent Metabolic Niches for Marine Heterotrophs
1 Introduction
Classification of heterotrophic microbes into metabolic functional guilds can provide a
framework for coalescing diverse microbial communities (Reynolds et al., 2023) into more
tractable units for incorporation into biogeochemical models (Zakem et al., 2024). Historically, we
have grouped marine microbial heterotrophs into copiotrophic organisms, which thrive in high
resource environments and generally have faster growth rates with flexible metabolisms, and
oligotrophic organisms, which dominate resource poor environments and have slower growth rates
(Koch, 2001). While these broad categories are useful, they do not inherently facilitate defining
metabolic niches or substrate preferences that are critical when considering rates of
biogeochemical cycling. Specifically, there is no intrinsic linkage between fast or slow maximum
growth rates and the substrate preferences for organisms (Liu et al., 2020). In this analysis, we
expand beyond the copiotroph-oligotroph paradigm and independently assess metabolic strategy
and growth rates to generate a generalizable functional categorization of marine microbial
heterotrophic metabolisms.
Genome-scale metabolic models (GEMs) provide a means for translating genomic
information into cellular metabolisms (Oberhardt et al., 2009) but have historically been labor
intensive to generate and have been generally restricted to cultured isolates (Gu et al., 2019). The
advent of fast automated metabolic model construction software such as CarveMe, ModelSEED,
Agora (Henry et al., 2010; Machado et al., 2018; Magnúsdóttir et al., 2017), etc., has enabled
generating GEMs for large numbers of genomes and from uncultured organisms (Mendoza et al.,
2019). Metabolic potential from GEMs can further be explored through the use of flux balance
analysis tools such as CobraPy (Ebrahim et al., 2013). These combined analyses provide insights
40
into the minimal metabolic requirements for a cell and hypotheses about the preferred substrates
for growth (Régimbeau et al., 2022).
Here we leveraged a large global dataset of marine microbial genomes (Ocean Microbial
Database (OMD)) (Paoli et al., 2022) to identify patterns in metabolic strategies among marine
bacteria through GEMs. Testing model sensitivity to growth on different substrates allowed us to
define unique clusters of marine heterotrophic bacteria with shared growth strategies. We
identified a classic fast-growing copiotrophic cluster, four slow-growing oligotrophic clusters each
with a unique metabolic strategy, and three intermediate growth clusters, also with unique
metabolic strategies. These clusters are found globally but at varying abundances in different
ecological regimes. While clear phylogenetic signals emerged distinguishing the clusters, our
findings also suggest that similar metabolic niches are occupied by distinct taxonomic groups.
2 Methods
2.1 Genomic Data
Genomic data was obtained from the Ocean Microbial Database (OMD) hosted at
microbiomics.io (Paoli et al., 2022), which contains approximately 35,000 microbial genomes
including metagenome-assembled genomes (MAGs), single amplified genomes (SAGs), and
cultured isolates. We included only high quality bacterial genomes as defined by standard
thresholds of > 80% completeness and < 5% contamination (Parks et al., 2017; Benjamin J. Tully
et al., 2018). These estimates were determined based on the average of the CheckM (Parks et al.,
2015) and Anvi’o (Eren et al., 2015) completeness and contamination scores. High quality
genomes were then dereplicated using dRep (Olm et al., 2017) with a 95% ANI threshold which
was provided in the OMD metadata. We used the resulting 3,918 high-quality dereplicated
bacterial genomes as our preliminary dataset for analysis.
41
2.2 Phylogeny
The phylogenetic tree of the 3,918 bacterial genomes was determined using GtoTree v1.7.0
(Lee, 2019) and IQ-TREE v2.0.3 (Minh et al., 2020). We also included the 66 unique bacterial
reference genomes underlying the bacterial metabolic models in the BiGG database (King et al.,
2016) that was used to generate CarveMe’s universal reaction model (Machado et al., 2018). From
these 3,984 total genomes, we created a multiple sequence alignment (MSA) file using the
predefined Bacteria single copy gene (SCG) set in GToTree v1.7.00 (Lee, 2019). During this
process, eight genomes were excluded from the tree due to an insufficient number of hits to the
target SCG set resulting in an alignment file of 3,976 genomes. However, these eight genomes
were still included in our taxonomic analyses of the SOM clusters as we were able to assign their
phylogeny using the Genome Taxonomy Database (GTDB). The MSA file was then passed to IQTREE v2.0.3 using the LG+R10 model with 3,554 amino-acid sites to generate a phylogenetic tree
(Figure 3.1). For a current taxonomy of all genomes in the dataset, we overlaid full taxonomic
assignments generated by GTDB-Tk v2.1.0 (Chaumeil et al., 2022) with the GTDB r214 database
(Parks et al., 2022) onto this tree.
We also calculated a quantitative measure of phylogenetic relatedness, the UniFrac
distance (Lozupone et al., 2011), for subgroups of genomes we defined based on several external
parameters (e.g., SOM cluster, ensemble consensus score, dCUB, etc.). For example, we created
sub-datasets containing genomes assigned to each of our eight SOM clusters that we compared
using UniFrac. UniFrac-Binaries was run using the Striped UniFrac algorithm (McDonald et al.,
42
2018) to compute unweighted UniFrac scores (Lozupone et al., 2011). These results from this
assessment are presented in Appendix B3.3.
2.3 Model Generation and Quality Assessment
CarveMe v1.5.1 (Machado et al., 2018) was used to generate multiple metabolic models
for each genome, called an ensemble. CarveMe’s ensemble function creates multiple models from
a single genome by randomizing the weighting factors for unannotated reactions before generating
each model using its mixed integer linear programming (MILP) algorithm. Annotated reactions
receive weighting factors based on their gene-protein-reaction score, a metric that reflects the level
of confidence in whether all proteins and subunits required for a reaction to take place are
supported in the genome. We tested a variety of ensemble sizes ranging from 2 models to 100
models to assess the necessary number of model replicates to effectively capture the reaction space
of each genome (Appendix Figure B10). We found that the new reactions added to the total
reaction space started to plateau around an ensemble size of 60 suggesting that 60 models were
sufficient to capture the majority of possible model solutions.
For each of the 3,918 high quality genomes in our dataset, 60 models were generated using
their protein fasta files as input. CarveMe7
was run using python 3.7 and IBM ILOG CPLEX
Optimizer v20.1.0, using the native DIAMOND annotation procedure v0.9.14 (Buchfink et al.,
2015). To assess the quality of the metabolic models generated by CarveMe, we developed a
consensus score metric C. The consensus score is defined as:
� = i1
�
5(6
5()
P �(�75 = 1)
7(8
7()
Eq. 7
where Xmr is the presence/absence matrix of ensemble model reactions across M individually
generated models, r is an individual reaction, R is the total number of reactions in the ensemble,
and E is the ensemble size. In this context, I is the indicator function for the case that reaction r is
43
present in ensemble model m. In plain terms, C measures the consistency of the CarveMe model
Figure 3.1: Diversity of dataset, quality of metabolic models, and designation of metabolic clusters. Phylogenetic tree
of all 3,984 bacterial genomes included in this study (including the 66 reference genomes from the BiGG database).
The tree is contextualized by several external rings that describe different qualitative and quantitative components of
the genomes in this study. The first ring around the tree denotes both the position and density of high quality ensembles
within the tree as well as the assignment of these genomes to each of our eight SOM clusters. The second ring shows
the ensemble consensus score (Equation 1) for each genome in the tree. The third, sparse ring of red lines denotes the
position of the 66 BiGG reference genomes present in the tree. Finally, the fourth and innermost ring shows the
location of the top 15 most abundant orders.
44
reactions across the ensemble generated for a single genome. If all models in the ensemble
contained all the same reactions then the consensus score would be 1, if only half of the models
had the same set of reactions then the consensus score would be 0.5. Similar to Machado et al.
2018, we equate the consensus score with overall ensemble quality because significant
dissimilarities between ensemble models suggest that the cutting algorithm in CarveMe was forced
to make more uninformed choices. On the other hand, an ensemble with highly consistent models
suggests that the cutting procedure had sufficient knowledge to consistently include the correct
pathways in the model. Only genomes with a consensus score greater than 0.8 were used for further
analyses (N=1,591).
2.4 Compound Classification
To provide an assessment of broad metabolic strategies, we analyzed the CarveMe model
growth sensitivities by compound classes. To do this, compounds that were used as substrates
(imported into the cell by the model) were manually classified into the following 13 major
categories: carboxylic acids, amino acids and derivatives, peptides,
nucleobases/nucleosides/nucleotides and derivatives, carbohydrates and derivatives,
ketones/aldehydes, organic sulfur, phospholipids/fatty acids and triglycerides, alcohols, amines
and amides, B vitamins, inorganics, and “other” (Appendix Table B2). We excluded inorganics
and ‘other’ categories from our downstream analyses to focus on the eleven categories with organic
substrates necessary for growth. References used in the categorization included ChEBI (Hastings
et al., 2016), NIH PubChem (Kim et al., 2023), BiGG Database (Norsigian et al., 2019;
Schellenberger et al., 2010), HMDB (Wishart et al., 2022), BioCyc (Caspi et al., 2020; Karp et al.,
2019), ChemSpider (Pence and Williams, 2010), ECMDB (Guo et al., 2012; Sajed et al., 2016),
and prior knowledge. There were 2,467 external exchange reactions in the universal model
45
representing the acquisition of compounds from the environment or media. Only 633 of the 2,467
appeared as external reactions in any of our models. We then classified the 456 compounds that
showed up the most frequently and accounted for the majority of the total flux into the models
across all 95,460 CarveMe models generated for this study. Specifically, our classified compounds
accounted for at least 90% of the total import flux in 98.3% models across all of our sensitivity
tests (N=1,050,060). The 177 compounds that were not included each appeared in fewer than 10
of the 95,460 total models in the dataset.
2.5 Growth sensitivity analysis
We used the CobraPy v0.25.0 (Ebrahim et al., 2013) software to test the CarveMe model
growth sensitivities under a wide range of substrate availability. First, we assessed the type and
quantity of compounds preferred for growth for each of the CarveMe models under replete
conditions (where we define replete conditions as having maximum flux of all substrates available
to the model). Specifically, we estimated the maximum model growth rate using the slim_optimize
function in CobraPy with all possible media components turned on. We then determined the
minimal set of compounds that allowed the previously determined maximum model growth rate
using the CobraPy minimal media prediction (minimal_media function). This function solves a
mixed integer linear programming (MILP) problem to minimize the import fluxes (external
exchange reactions) while maintaining the maximum model growth rate.
Compound-specific growth sensitivities for each of our 11 growth compound classes were
then determined for each model. For each substrate compound class (defined above in section 2.4),
the available flux for that class was supplied at 50% of the import flux value in the ‘replete
conditions’ while all other substrates were allowed to reach their maximum values. Any medium
component from the limited growth compound class that was not originally predicted as part of
46
the minimal medium of a given model was made unavailable to prevent models from
circumventing the substrate limitation. We then assessed how the substrate import fluxes shifted
under these limitation scenarios and the resulting change in predicted growth rate. Sensitivities
were computed on a [0,1] scale using the following equation:
2 × (1 − �9
� ) Eq. 8
where n is the predicted growth rate under substrate limitation by compound class n and is the
predicted growth rate in the ‘replete conditions’. The eleven compound-specific growth
sensitivities that were estimated per model then served as the input data for the SOM clustering.
The full enumeration of the compound specific growth sensitivities for each genome can be found
in Appendix Table B1.
2.6 Validation on Experimentally Characterized Genomes
To validate our CarveMe models and growth sensitivity analysis, we compared our model
results to experimentally validated measurements on a shared set of genomes. Specifically, we
constructed model ensembles for 176 marine bacterial genomes that were experimentally tested
for their ability to grow on a variety of sugar/acid substrates (Gralka et al., 2023). The data from
this study included binary measurements of growth/no growth on 118 compounds as the sole
carbon substrate in the media and a prediction of sugar/acid preference (SAP). Of the 176 genomes,
146 generated high quality CarveMe models (above the consensus threshold of 0.8).
We conducted a paired comparison of the experimental and model predictions of substrate
growth for the 146 high quality genomes that had been assessed for growth on the range of carbon
substrates. Of the 118 experimentally measured compounds, 59 had corresponding reactions in the
BiGG database. Our results exclude one of these 59 compounds, oxaloacetate, which is highly
unstable and rapidly degrades to pyruvate (Wilcock and Goldberg, 1972) and confounds the
47
fidelity of the growth experiment with that compound as the sole carbon source. For presence in
the model, we required the external exchange reaction for the substrate to be present in >80% of
the ensemble models. For absence, we required that the exchange reaction be absent in >80% of
the ensemble models. 1.5% of the model-substrate comparisons fell between these two cutoffs and
so were not assigned an outcome. For each genome and each substrate, we assigned one of three
outcomes: 1) agreement between the models and the data (either presence/growth or absence/nogrowth); 2) disagreement between the models and the data (absence in the model/growth in the
data); or 3) false positives where the models contained the exchange reaction but the organism was
not able to grow on the substrate as a sole carbon source.
To assess the agreement between the modeled growth sensitivities of these organisms and
the experimental findings for sugar/acid preference (SAP), we computed the growth sensitivities
of the CarveMe model ensembles (see Section 2.5). We used the sensitivity to carbohydrates for
the sugar preference assessment and the sensitivity to amino acids for the acid preference
assessment. The 146 genomes with high quality models were grouped into two groups based on
whether they were sugar-preferring organisms (SAP > 0, N=77) or acid-preferring organisms (SAP
< 0, N=69). We then compared the relative sensitivities for each of these classes to their
experimentally assigned SAP value by determining the average growth sensitivity of the models
associated with the genomes in each group to our sugar and acid compound classes. The results
from this assessment are presented in Appendix B1.
2.7 Generation of Self-Organized Maps
To identify clusters of organisms with similar metabolic strategies, we employed SelfOrganized Maps (SOMs) to the assessment of compound specific growth sensitivities. SOMs are
an unsupervised machine learning dimension reduction method capable of handling large data
48
formats(Kohonen, 1990). SOMs are a non-parametric approach, capable of highlighting nonlinear,
complex patterns in two-dimensional space from highly dimensional data. The map was built using
the CobraPy growth sensitivity analysis for the 95,460 high-quality ensemble models (C ≥ 0.8).
These scaled compound flux predictions were clustered using kohonen v3.0.12 (Wehrens and
Buydens, 2007) and solved over 1,500 iterations with a learning rate vector of (0.025, 0.01) and
default neighborhood radii on a 20-by-20 toroidal, hexagonal grid spatially described by standard
Euclidean distance. Map parameters were determined using heuristics and metrics of error
proposed in the SOMs literature (Céréghino and Park, 2009; Kalteh et al., 2008; Kiviluoto, 1996;
Kohonen, 1990; Park et al., 2003) and are discussed further in Appendix B2. Each node in the grid
was initialized with a random codebook vector of values for each independent variable. Data
entries were then randomly drawn from the dataset - every entry in the dataset was drawn in each
iteration - and the grid point values of the closest neighborhood of nodes were updated. After
sufficient training, the values assigned to each grid point reflect the spatial topology of the data
(e.g., density of data points, variation) as well as the full range of values in the original dataset.
The final SOM map was then grouped into eight distinct clusters using k-means clustering
(Hartigan and Wong, 1979) based on the coherence of the growth compound sensitivity predictions.
The full map and designation of the clusters is shown in Appendix Figure B11a. After 1,500
iterations, the mean object distance to its closest map unit (the quantization error) was
approximately 1×10-4 (Appendix Figure B11b).
As the SOM map and clusters were built using all of the ensemble models (60 per genome),
we then needed to assign each genome to a cluster. This was done by assigning each model to its
closest mapping unit, and determining the mapping unit possessing a simple majority of the 60
models generated from a single genome. This mapping unit was designated as the mapping node
49
for the genome and the genome was assigned to the associated SOM cluster. We assessed the
frequency with which each of the 60 ensemble models occurred in a single SOM cluster (Appendix
Figure B11c) and showed that 96.0% of the genomes had 90% of their models assigned to the
same SOM cluster (and 68.6% had all 60 models assigned to the same SOM cluster). Of the 1,591
genomes, only one genome had models split equally between two SOM clusters. In this case, this
genome was randomly assigned to one of the two clusters using a fixed random seed of 123. The
parameter optimization of the SOM map developed in this study is discussed in further detail in
Appendix B2.
2.8 Maximum Growth Rate Estimations
To assess differences in maximum growth rates, we estimated the codon usage bias (dCUB)
for all 1,591 genomes using the gRodon program (J. L. Weissman et al., 2021). dCUB is a metric
that has been empirically linked with optimization for faster growth. gRodon measures codon
usage bias of highly expressed genes, in this case ribosomal proteins, compared to the codon usage
patterns across the whole genome. This genomic measure of maximum growth is a reasonable
proxy and allows us to examine the differences in growth optimization for this set of uncultured
organisms without needing to do extensive culturing and metabolic characterization efforts.
Because estimating actual growth rates from codon usage bias requires correcting for temperature,
we used the raw dCUB scores for this analysis to assess relative differences in genomic
optimization for rapid growth. Previous work by Weissman (J. Weissman et al., 2021; J. L.
Weissman et al., 2021) suggests that differences in dCUB values are only reliable below the
threshold of -0.08 (i.e., lower values of dCUB represent faster growth rates). We use this threshold
to differentiate between ‘slow growth’ and ‘fast growth’ organisms. The results from these
analyses are presented in Appendix B3.2.
50
2.9 Global Distribution
To assess the global distribution of the genomes within each SOM cluster, we performed
competitive metagenomic recruitment. Specifically, we calculated normalized Reads Per Kilobase
per Million mapped reads (RPKM) with the pipeline RRAP v1.3.2 (Kojima et al., 2022). RRAP
uses bowtie2 v2.4.2 (Langmead and Salzberg, 2012) to align reads and SAMTools v1.14 (Danecek
et al., 2021) to index and sort the read alignment data. RRAP takes read alignment statistics
generated by SAMTools to calculate RPKM. A total of 1,424 metagenomes were used for the read
recruitment from several metagenomics surveys including Tara Oceans, BioGeoTraces, and
Malaspina (Paoli et al., 2022). Raw metagenome fastq files were aggregated by sample and by
depth when multiple depths were present – e.g., the Tara Oceans dataset – and quality filtered
using the iu-filter-quality-minoche script from the Illumina-utils library v2.10 with default
parameters. This script follows the quality filtering approach outlined in (Minoche et al., 2011).
After quality filtering, our genome set was recruited to the metagenomic reads, and reads per
kilobase per million mapped reads (RPKM) values were calculated for each genome at each site.
We then partitioned our data into 23 oceanographic regions defined in Lanclos et al. 2023
and aggregated the raw RPKM values for the genomes in our study (Lanclos et al., 2023). Of the
1,424 sampling sites in the metagenomic recruitment, we had oceanographic region assignments
from the metadata for 1,203 sites. The 23 defined oceanographic regions in this metadata averaged
52.3 distinct samples per region (ranging from 3 samples/sites in the Southern Ocean to 299
samples at station ALOHA) (Appendix Table B3). We then further clustered the 23 oceanographic
regions into 5 categories: Estuarine, Coastal, Oligotrophic Seas, Oligotrophic Open Oceans, and
the Southern Ocean (Appendix Table B4). We used this categorization to group the sampling sites
and compare the relative abundances determined from the raw RPKM values. For the sampling
51
sites associated with each category, we clustered the relative abundances of the eight SOM clusters
at each site using Euclidean distance and hierarchical clustering with McQuitty linkage distance.
To assess the relative abundance of genomes assigned to each SOM cluster per
oceanographic region or category, we conducted a bootstrap recruitment of the individual genome
abundance values at each station. We employed bootstrapping due to large variation both in the
number of samples present in each region (ranging from 3 at SOC to 299 at ALOHA) and in the
number of genomes assigned to each cluster (ranging from 74 genomes in Cluster 8 to 558
genomes in Cluster 2). For each region, we computed 1,000 independent bootstrap iterations (using
the fixed random seed 123), drawing 10,000 data points from the pool of RPKM samples for each
of our eight clusters. During each bootstrapping step, we calculated the cumulative RPKM of the
sampled data for each cluster and then compared their magnitudes to determine the relative
abundances of the clusters. Average values and 95% confidence intervals were then computed
from the resulting distributions of relative abundances for each cluster/region combination.
2.10 Data Visualization
All data visualizations in R v4.2.3 were performed using ggplot v3.4.2, ggridges v0.5.4
(Wilke, 2024), ggtree (Yu et al., 2017), patchwork v1.1.2 (Pedersen, 2024), ragg v1.2.5, and plots
native to kohonen v3.0.12 (Wehrens and Buydens, 2007).
3 Results
3.1 CarveMe model quality
We generated GEMs for 3,918 high quality bacterial genomes (including cultured isolates,
metagenomes, and single-cell genomes) from OMD using CarveMe (Machado et al., 2018). This
dataset spanned a wide diversity of marine bacteria representing 205 distinct taxonomic orders,
fifteen of which had fifty or more genomes (Figure 3.1). Given the stochastic nature of the cutting
52
algorithm in CarveMe, it is necessary to run ensembles of models for each genome (Bernstein et
al., 2021; Machado et al., 2018). To ensure we included only high quality models in our analysis,
we generated a large model ensemble for each genome (N = 60) and assessed the robustness of the
models using a consensus metric. Specifically, higher confidence can be placed in models where
a consistent set of enzymatic reactions are included across the entire ensemble of models generated
by CarveMe for a single genome (high consensus value).
Consensus values for the OMD genomes ranged from 1 (all model reactions were the same
across all ensemble members) to 0.24 (only 24% of reactions were conserved between ensemble
members thus providing low confidence in the CarveMe models). Systematic differences in
consensus values were seen between phylogenetic groups (Figure 3.1, 3rd ring). The
Enterobacterales, Rhodobacterales, Cytophagales, Sphingomonadales, and Pseudomonadales
had the largest number of genomes with high consensus values: on average, 65.0% of genomes
from each of these orders had consensus values above 0.8 (range 51.3%-81.3%). Genomes that
were phylogenetically similar to the reference genomes used to develop CarveMe generally had
higher consensus values (Figure 3.1). Several orders had a large proportion of genomes with high
consensus values despite being phylogenetically distant from the reference genomes. The
Rhodobacterales, for example, have no reference genome but 72.1% of these genomes had a
consensus value greater than 0.8. CarveMe struggled to generate high consensus ensembles for
several orders. Only 10.8% of Pelagibacterales and 4.1% of PCC-6307 genomes had consensus
values above 0.8, while 66.5% and 49.7% of genomes in these groups had values at or below 0.5,
respectively.
This analysis suggests that adding reference genomes (experimentally validated metabolic
models used to improve the CarveMe tool) in these low consensus orders would substantially
53
improve the ability for CarveMe to generate high consensus ensembles. This would greatly
improve our ability to robustly apply CarveMe broadly to environmental datasets (Giordano et al.,
2024). Our further analyses only used the 1,591 genomes with consensus values of 0.8 or greater.
We tested a more conservative cutoff of 0.9, yielding a dataset of 983 genomes, and showed that
the primary findings of this work remain unchanged (Appendix Figure B1). The low number of
PCC-6307 genomes with high quality models (N=7) was likely due to the fact that these are
Cyanobacteria and so phototrophic or mixotrophic (capable of growing on or supplementing
growth with organic compounds), whereas the CarveMe universal model is based on, and validated
with, heterotrophic bacterial genomes. Given that CarveMe was designed for heterotrophic
microbes and only 0.44% of the genomes used for subsequent analyses were from the PCC-6307
order, we focused our analyses on heterotrophic metabolic strategies.
3.2 Metabolic strategy assessment
We defined metabolic strategy as the substrates that are preferred by an organism for
growth. We assessed the metabolic strategy for each genome using a suite of sensitivity studies.
Specifically, model growth dynamics were evaluated using CobraPy under ‘replete’ conditions (all
potential growth substrates available), and then under ‘limiting’ conditions in which the
availability of certain compound classes were substantially reduced. Here we used a threshold of
an 80% reduction in growth rate under the limiting condition as the designation of substantial
reduction in growth (Appendix Figure B2). We then considered a genome to be sensitive to a
compound class if growth was substantially reduced when the compound class was removed.
We validated our approach using an extensive culture-based analysis of carbon substrate
preferences for 191 marine microbes (Gralka et al., 2023). Good agreement was observed between
the CarveMe model predictions for these genomes and the experimentally validated growth
54
preferences (Appendix Figure B3a). Specifically, 51.0% of the comparisons showed exact
agreement between the model predictions and experimental data and only 4.6% of models
predicted no growth where growth was experimentally observed. For the remaining 44.4% of the
comparisons, the models predicted that the substrate could be taken up by the organism but no
growth was experimentally observed when that compound was provided as a sole carbon source.
These cases suggest that the organisms might be able to use the substrate, but not as a sole carbon
source or under the conditions tested. Good agreement was also seen between model-predicted
compound sensitivities and the designation of acid versus sugar specialists identified by16
(Appendix Figure B3b), suggesting that our framework is capturing substrate preferences observed
experimentally (Appendix B1).
The highest growth sensitivity occurred under carboxylic acid limitation, with 39.9% of all
models in the dataset demonstrating substantial growth reduction when the uptake of this
compound class was limited (Appendix Figure B4a). Reducing the availability of amino acids or
carbohydrates resulted in substantial growth reduction in 29.2% and 17.6% of the models,
respectively (Appendix Figure B4a). In contrast, the models were generally insensitive to the
reduction of amines/amides, ketones/aldehydes, or alcohols with only 0.13%, 0.2%, and 0.25% of
models showing substantial growth reduction, respectively (Appendix Figure B4a). When
analyzed by taxonomic order, we found that the most sensitive taxa to substrate limitation across
all compound classes were the Pelagibacterales (24.9% of models showed substantial growth
reduction to at least one compound class), SAR86 (21.6% showed substantial growth reduction),
and AEGEAN-169 (19.1% showed substantial growth reduction) (Appendix Figure B5a). This is
consistent with these groups being classically oligotrophic organisms with streamlined genomes
and limited metabolic flexibility (Getz et al., 2023; Giovannoni, 2017; Swan et al., 2013). By
55
contrast, the classically copiotrophic groups (the Enterobacterales, Sphingomonadales, and
Rhodobacterales) showed the least growth sensitivity to substrate reduction, with only 5.9%, 9.1%
and 9.4% of these models showing substantial growth reduction across all compound classes,
respectively (Appendix Figure B5a). This indicates that the classically designated copiotrophic
orders may have more flexible metabolisms where they can achieve high growth rates using many
different compound classes.
Table 3.1: Description of 8 SOM clusters including the number of genomes per cluster, the growth strategy as
determined by the dCUB distributions, the number and names of the growth limiting substrate classes, as well as the
2 most numerically abundant orders.
Cluster Genomes
Growth
Strategy
% fast
growers Limiting Classes Top 2 orders
# Substrate
1 144 Slow 50.00% 2 Carboxylic Acids, Amino Acids
Pelagibacterales,
Rhodobacterales
2 558 Fast 78.00% 0 None
Flavobacteriales,
Enterobacterales
3 95 Slow 40.00% 2 Carboxylic Acids, Peptides
Flavobacteriales,
Pelagibacterales
4 211
Slow
Intermediate 62.10% 1 Amino Acids
Flavobacterales,
Rhodobacterales
5 299
Fast
Intermediate 73.20% 1 Carboxylic Acids
Pseudomonadales,
Rhodobacterales
6 133
Slow
Intermediate 66.20% 1 Carbohydrates
Flavobacteriales,
Rhodobacterales
7 77 Slow 42.90% 2 Peptides, Amino Acids
Flavobacteriales,
Rhodobacterales
8 74 Slow 48.60% 3
Carboxylic Acids, Amino Acids,
B vitamins
Flavobacterales,
Rhodobacterales
3.3 Emergent metabolic clusters
Metabolic niches were identified based on the substrate preference profiles for all 1,591
genomes using Self Organizing Maps (SOMs). The SOMs method is an unsupervised clustering
method that reduces large, high dimensional datasets to a topologically defined two-dimensional
56
grid space (Wehrens and Buydens, 2007). Eight SOM clusters emerged with distinct growth
sensitivities to different compound classes (Table 1). Differences in sensitivities to carbohydrates,
carboxylic acids, amino acids, peptides, and B-vitamins drove the largest separations between the
clusters (Figure 3.2, Appendix Figure B4b). To further expand the analysis of growth strategy, we
computed an estimate of maximum growth rate for each genome based on genomic optimization
using codon usage bias (dCUB) (J. Weissman et al., 2021; J. L. Weissman et al., 2021). We then
assessed differences in genomic estimates of maximum growth rates between clusters (dCUB
values were not used in the SOM clustering). Significant differences in estimated maximum
growth rates were observed between the SOM clusters (Tukey’s HSD, ANOVA) with one fastgrowing cluster (Cluster 2), four slow-growing clusters (Clusters 1, 3, 7, and 8), and three clusters
with intermediate growth rates (Clusters 4, 5, and 6) (Appendix Figure B6, Table 1).
The eight SOM clusters had conserved phylogenetic signals with enrichment in specific
taxonomic groups (Table 1). However, we simultaneously observed that many taxonomic groups
appeared across multiple clusters (Appendix Figure B5b). This suggests that diverse taxonomic
groups have similar substrate preferences and growth sensitivities, and also that some taxonomic
groups appear to have sublineages with wide variations in lifestyle. For example, the
Enterobacterales and Rhodobacterales were enriched in our fast-growing Cluster 2 by 174.6%
and 121.0%, respectively, relative to their abundances in the total dataset (Appendix Figure B7).
Similarly, SAR86 and the Pelagibacterales were on average enriched in the slow-growing Clusters
1, 3, 7, and 8 by 211% and 307%, respectively, relative to their abundances in the total dataset.
This was compared to 74.1% and 80.6% reductions of these two classically oligotrophic orders in
the fast-growing Cluster 2 relative to the total dataset. However, we also found that nine of the
fifteen most abundant orders were present in all clusters, and only three orders were absent in more
57
than one cluster (the Sphingomonadales, PCC-6307, and AEGEAN-169). The Flavobacteriales,
for instance, were present across all eight clusters accounting for 9.7% of slow-growing Cluster 1
up to 33.1% of the intermediate growth Cluster 6 (Appendix Figure B5b). Thus, although there
was variation in the taxonomic composition of the clusters, the differences between clusters was
not driven by taxonomy alone (Appendix B3.3). Below we provide an analysis of the four growth
types that emerged from the SOM clusters: fast-growing, slow-growing, fast intermediate growth,
and slow intermediate growth.
Figure 3.2: Substrate sensitivities for 8 SOM clusters. Bubble plot of the mean growth sensitivity values for
genomes in each of our 8 SOM clusters. A growth sensitivity of 1 indicates high sensitivity to that substrate such
that the modeled growth rate was reduced proportionally to the reduction in the substrate’s flux (e.g., 50% substrate
reduction corresponded to 50% growth rate reduction). The size of the bubbles in this plot reflect the relative
sensitivity of each of the 8 SOM clusters to a given compound class where larger bubbles indicate that cluster was
more sensitive to that compound class than others. The 6 compound classes which resulted in significant growth
reduction for at least one of the SOM clusters are shown here. The full results for all 11 substrate classes are
provided in Supplemental Figures S2 & S3). Cluster numbers were colored based on maximal genomic growth rate
(Supplemental Figure S6).
58
The fast-growing Cluster 2 was classically ‘copiotrophic’. 78.0% of genomes in this cluster
had predicted maximum genomic growth rates that were higher (more negative dCUB) than the
threshold of slow growth (dCUB=-0.08, where lower dCUB values correspond to faster growth).
This threshold corresponds to a doubling time of approximately 5 hours for mesophilic organisms
(optimal growth temperature between 20-60°C) (J. Weissman et al., 2021) (Appendix Figure B6).
Taxonomically, Cluster 2 consisted primarily of the Enterobacterales, Flavobacteriales,
Rhodobacterales, and Psuedomonadales (Figure 3.1, Appendix Figure B5a). Metabolically, this
fast-growing cluster showed the least sensitivity to the removal of compounds with no significant
growth sensitivity to the reduction of any of the 11 measured compound classes (Figure 3.2). This
suggests that these organisms have flexible metabolisms capable of growing on a wide range of
substrates and are able to synthesize or substitute essential metabolites when not available from
the environment. Hereafter, we will refer to this as the fast-growing generalist cluster.
In contrast, the slow-growing clusters (Clusters 1, 3, 7, and 8) had significantly slower
estimated maximum growth rates than the fast-growing generalist cluster (average dCUB of -0.106)
(Appendix Figure B6). For example, 60% of genomes in Cluster 3 had dCUB values within the
‘indistinguishable slow growth range’ (dCUB values above the -0.08 threshold). This cluster had
a high proportion of known oligotrophic orders such as SAR86 and the Pelagibacterales, with
these groups enriched in this cluster by 457% and 343% relative to the overall dataset (Appendix
Figure B6). The Marinisomatales was also found to be slightly enriched (139.6%) in this cluster
relative to its abundance in the overall dataset. All four slow-growing clusters showed growth
sensitivities to multiple (two or more) compound classes (Figure 3.2). This is in contrast to the
intermediate growth clusters which demonstrated sensitivity to a single compound class (described
below). We observed metabolic niche separation within the slow-growing clusters. For example,
59
Cluster 1 exhibited high growth sensitivities to two classes of acids (carboxylic acids and amino
acids/derivatives), whereas Cluster 3 models showed high growth sensitivities to carboxylic acids
and peptides (Figure 3.2, Appendix Figure B4b). Cluster 7 showed growth sensitivity to peptides
and amino acids, while Cluster 8 models were sensitive to carboxylic acids, B vitamins, and amino
acids.
The three intermediate growth clusters (Clusters 4, 5, and 6) showed growth sensitivity to
a single compound class: amino acids, carboxylic acids, and carbohydrates, respectively (Figure
3.2). All three intermediate growth clusters had predicted growth rates that were significantly
slower than the fast-growing Cluster 2 (Appendix Figure B6). Cluster 5 was also estimated to be
significantly faster than the four slow-growing clusters – hereafter referred to as the fast
intermediate growth cluster. Overall, these intermediate growth clusters appeared to be more
flexible metabolically and faster-growing than the slow-growing specialist clusters but more
specialized and slower growing than the fast-growing generalist cluster. The intermediate growth
clusters corroborate a recent modeling study which suggested that the dominant heterotrophic
group in the subsurface ocean might be slow-growing copiotrophs2
.
3.4 Biogeographic Distribution
To investigate the biogeographic distributions of our eight SOM clusters, we performed a
competitive metagenomic recruitment and calculated normalized Reads Per Kilobase per Million
mapped reads (RPKM) for 1,424 globally distributed samples. We compared the relative
abundance of aggregate total RPKM for each of the 23 unique oceanographic regions in our global
dataset using a bootstrapping approach (Appendix Figure B8). We further grouped these 23
regions into 5 oceanographic categories and applied our bootstrapping approach, as well as a direct
clustering on the raw RPKM values from each site. Clear biogeographical patterns emerged across
60
both the 23 oceanographic regions and the 5 defined oceanographic categories (Figure 3.3,
Appendix Figures S8&S9). Estuarine sites showed the highest abundance of both the fast-growing
generalist and the fast-growing intermediate clusters. The co-occurrence of these copiotrophic and
metabolically flexible organisms in these frequently eutrophic and variable salinity environments
aligns with our expectation that microbial communities present in these regions are dominated by
fast-growing organisms. In contrast, these faster growing clusters were rare at open ocean
oligotrophic sites where copiotrophs are primarily present at low abundances, occupying niches
such as sinking particles (Fuhrman, 2009). We show that community compositions in the
oligotrophic seas (Mediterranean and Red Sea) and open ocean sites are dominated by the four
Figure 3.3: Biogeographical relative abundances of 8 SOM clusters. Clustered bar charts of the relative abundances
of the 8 SOM clusters as determined by RPKM at each of the 1,203 stations assigned to one of the 23 oceanographic
regions. Stations were grouped into our 5 defined oceanographic categories and then arranged based on a hierarchical
clustering of the relative abundances.
61
slow-growing clusters and the two slower-growing intermediate clusters. The slow-growing acidspecialist cluster (Cluster 1) was the most numerically dominant group of organisms across all
samples and also had the greatest enrichment of the Pelagibacterales, the most numerically
dominant order of marine heterotrophs (Giovannoni, 2017). Unfortunately, the dataset did not
allow us to make further conclusions related to the co-occurrence of specific metabolic preferences
and the biogeochemical environment in which these communities were found. However,
expanding this analysis to samples where interdisciplinary information (e.g. metagenomic,
metabolomic, organic matter composition, and rate measurements) are co-collected is an exciting
avenue for future work.
4 Discussion
Although marine microbial heterotrophs play a primary role in regulating organic matter
cycling, biogeochemical cycles, and global climate outcomes, we have historically lacked an
overarching framework for characterizing these diverse communities and assessing their
functional metabolic niches. Since the majority of microbial heterotrophic diversity in the oceans
remains uncultured (Lloyd et al., 2018), we must rely on indirect methods to assess these metabolic
strategies. By combining metagenomic information, numerical models, and statistical approaches,
we identified eight distinct metabolic strategies for marine heterotrophic metabolism. Critically,
our approach is a high throughput method for generating key metabolic and physiological insight
that could previously only be obtained through labor intensive laboratory experiments restricted
to cultured organisms. The hypothesized clusters generated by this analysis provide a set of
microbial building blocks on which we can understand the assemblage of global heterotrophic
microbial communities and how they differ by oceanographic region.
62
We demonstrated that, when applied correctly, the CarveMe tool provides fundamental
insights into the metabolism of a highly diverse set of marine heterotrophic organisms. We
additionally showed that there were clear biases in the quality of models generated by CarveMe
with some orders, such as the Pelagibacterales that consistently produced poor quality models.
There are likely several factors that result in poor quality models. We postulate that the current
universal model and cutting algorithm used by CarveMe may struggle with streamlined genomes
(e.g., the Pelagibacterales) and for marine heterotrophs that specialize in growth on more complex
carbon substrates. We also hypothesize that issues with annotation, in particular for transporters,
might also contribute to poor quality models for certain groups. We suggest that including
additional reference genomes with validated metabolic models for certain orders (e.g., the
Pelagibacterales) will substantially improve our ability to generate high quality metabolic models
across diverse groups.
While the clusters identified in this analysis are robust, they are not necessarily complete.
In particular, we demonstrated that the CarveMe tool was not successful at creating high quality
models for the majority of genomes in many key microbial groups. Thus, we anticipate that once
we can create high quality models for these groups and investigate their metabolic strategies, we
will potentially identify additional meaningful clusters.
Here we used metabolic models to analyze the growth strategies for a large number of
marine microbial genomes (the majority uncultured) via in silico methods. We identified eight
clusters with distinct substrate preferences, growth strategies, taxonomic profiles, and
biogeographic distributions (Table 1, Appendix Table B1). We demonstrated that some growth
strategies correspond strongly with phylogeny, suggesting that we can infer metabolism directly
from phylogeny in some cases. However, the majority of the phylogenetic groups in our dataset
63
were distributed across multiple clusters with distinct metabolic preferences, demonstrating that
organisms from diverse taxonomic groups can occupy the same metabolic niche, consistent with
the findings of widely varying growth rates within closely related organisms (Deulofeu-Capo et
al., 2024). Our approach also provides a new resource for artificial media development by
identifying key growth-limiting substrates whose absence in traditional media may currently be
inhibiting culturing efforts. Finally, our metabolic clusters provide a framework for developing
biogeochemical models that explicitly incorporate diverse microbial communities by specifying
specific growth strategies and metabolic preferences that can be used to parameterize these groups.
64
Chapter 4 : Emergent Structural Differences in Metabolic Models
for Marine Heterotrophs
1 Introduction
Marine microbial heterotrophs are the key engines behind all major biogeochemical cycles
including the global carbon cycle which mediates atmospheric carbon dioxide levels. In the oceans,
micro-heterotrophs drive the microbial loop (Pomeroy, 1974) where organic carbon is
remineralized back to carbon dioxide (Azam, 1998; Falkowski et al., 2008; Fuhrman and Azam,
1980). The location in the water column that remineralization occurs determines whether this CO2
is sequestered in the deep ocean or whether it remains in the surface ocean where it can be either
fixed again or exchanged back into the atmosphere (Giering et al., 2014; Henson et al., 2012, 2011).
Despite the importance of these processes, we still have a limited mechanistic understanding of
what drives variable rates of remineralization in the ocean and how these rates will be impacted
under changing oceanic conditions as a result of climate change (Behrenfeld et al., 2006;
Kwiatkowski et al., 2020). To effectively incorporate the cycling of organic carbon by microheterotrophs into global models, it is crucial that we reduce the vast diversity of heterotrophic
microbes into a simplified framework of heterotrophic metabolism that accurately reflects these
mechanisms. Since we currently lack such a framework for how these rates are set, microbial
processes such as remineralization are implicitly represented in global Earth system models as
simple phenomenological rate constants (Aumont et al., 2015).
Historically, microorganisms have been grouped into taxonomic groups based on their
phylogenetic relatedness as measured by the similarity of conserved genetic elements such as the
16S ribosomal subunit (Caporaso et al., 2012; Fuhrman et al., 1993; Sogin et al., 2006). The
metabolic functionality or niche of these organisms has then often been inferred based on
65
phylogenetic similarity to organisms with known metabolic capacities (Langille et al., 2013). This
type of inference assumes that metabolic capacity is primarily vertically inherited and thus is
synonymous with evolutionary history (Wemheuer et al., 2020). However, there is increasing
evidence that phylogeny is not always a good predictor of metabolism and can even be completely
independent from it (Louca et al., 2018, 2017, 2016; Matthews et al., 2021; Benjamin J Tully et
al., 2018). There are many prevailing hypotheses for this disconnect between function and
phylogeny, however it is thought that the high rates of horizontal gene transfer found in the ocean
play a key role (Fan et al., 2020; McDaniel et al., 2010). This horizontal gene transfer disrupts the
reliability of vertically inherited traits – what phylogeny tracks – even when comparing organisms
at the strain level (Hehemann et al., 2016). Cosmopolitan horizontal gene transfer means that
phylogenetically similar organisms can adopt very different sets of genes and metabolic functions
that are favorable to their specific environmental regimes. Additionally, given the short generation
time of marine microbes and high selective pressures for advantageous metabolic traits, rapid
adaptation is possible where traits can evolve in response to environmental or ecological shifts in
some strain level subpopulations (Jaspers and Overmann, 2004; Ward and Collins, 2022). Since
phylogeny is not a consistent predictor of metabolic functional potential, other approaches are
needed to quantify the functional role of marine heterotrophic microbes in the environment.
Genome-scale metabolic models (GEMs) are powerful tools for determining the
metabolism and functional potential of uncultured heterotrophic microbes (Mendoza et al., 2019;
Oberhardt et al., 2009). These models take genomic information about functional capacity from
gene annotations and translate it into a network of metabolic pathways. A GEM can then be used
to generate hypotheses as to the metabolic responses of the organism to different growth conditions
using an approach called flux balance analysis (FBA) (Kauffman et al., 2003; Noor et al., 2016;
66
Varma and Palsson, 1994). In FBA, the model optimizes fluxes of substrate and energy through
different parts of the metabolic network in order to construct a metabolism that maximizes growth,
measured in biomass, often while minimizing total flux or total flux components. FBA can be used
to predict metabolic strategies for microbes under different conditions, for example which internal
metabolic pathways are favorable or which metabolites might be secreted. However, GEMs have
historically required intensive experimental efforts in order to validate the network structure and
predicted FBA output. Specifically, for each model, the individual genes and proteins of an
organism must be well characterized and flux through different pathways tracked using isotopic
labeling (Boschker et al., 1998; Fuhrman and Azam, 1982). Thus, curated GEMs have been limited
to cultured isolates (Gu et al., 2019) and as a result we lack curated GEMs for most of the important
members of the marine microbial community. With the advent of automated metabolic modeling
software such as CarveMe, ModelSEED, and Agora (Henry et al., 2010; Machado et al., 2018;
Magnúsdóttir et al., 2017), among others, we are now capable of generating GEMs for large sets
of organisms, including uncultured organisms (Mendoza et al., 2019). These tools have also made
GEMs much more widely accessible to non-specialist scientists and have dramatically increased
throughput on model generation. This increased throughput allows for the generation of GEMs
representing a wide array of environmentally relevant organisms that would have never been
attainable with manually curated models (Giordano et al., 2024)
While automated GEM generation tools mark an important advance in the field of marine
microbiology, the ease with which they can be applied comes with the potential for them to be
used without sufficient validation or regards to their constraints. There has been some effort by the
modeling community to better understand the limitations of these tools and provide
recommendations for their use (Gu et al., 2019; Moyer et al., 2024). Tools like Memote (Lieven
67
et al., 2020) have been developed to check the quality of these models by looking for dead end
reactions, continuous generation of biomass, and mass and charge balance. While these are critical
components of a good quality model, they do not assess the assumptions and choices that were
made to create a functional model from often incomplete information. Reynolds et al. (2024)
demonstrated that for CarveMe the carving process itself is stochastic and can be highly variable
for individually generated models from certain genomes and highlighted the need to generate
ensembles of multiple models for each input genome. Here we conduct an in-depth analysis to
expand our ability to identify reliable GEMs beyond the currently used quality metrics.
Specifically, we assess how and why automatically generated GEMs may fail to create reliable
models, how this links to model generation procedures, the sensitivity of model generation to
missing input information (e.g., annotation issues), and the signature of potentially problematic
models. We use the CarveMe software for this study as it has been widely used and always
produces functioning models based on its unique procedure of “carving” out the GEM from a
curated universal model.
We previously applied CarveMe to a large dataset (3,918 marine bacterial genomes) and
demonstrated that approximately 60% of the genomes produced GEMs that did not meet our
threshold for high-quality (Reynolds et al., 2024). This assessment was related to the “precision”
of GEM generation by CarveMe where GEMs were considered high-quality if CarveMe generated
consistent models for a specific genome. We also showed that there were substantial differences
in GEM quality outcomes across taxonomic orders where typically genomes that were
taxonomically distant from the reference genomes used to develop CarveMe were lower quality
(Reynolds et al., 2024). Here, we identified the underlying differences between high- and lowquality GEMs and assessed what drives these differences. Specifically, we identified key network
68
metrics and structural differences that distinguished low- and high-quality ensemble networks. We
then conducted a series of simulations where we quantified the impact of missing (or poorly
annotated) reactions on the quality of the GEM ensembles using in silico knockouts of reactions.
We demonstrate that knocking out reactions consistently reduces the consensus of model
ensembles and that the metabolisms of low- and high-quality ensemble networks differ
significantly on the basis of reaction type, particularly the proportions of metabolite import and
export reactions. We also identify a handful of key reactions that appear to be particularly
important for CarveMe for producing high-quality GEMs. Our findings both assist in GEM
generation software improvement and provide user recommendations on how to implement quality
control of automatically generated GEMs.
2 Methods
2.1 Genomic Data
For this study, we leveraged the Ocean Microbial Dataset (OMD) v1 hosted at
microbiomics.io (Paoli et al., 2022) which houses over 35,000 genomes from a variety of sources
including metagenome-assembled genomes (MAGs) (Graham et al., 2017; MetaHIT Consortium
et al., 2014; Parks et al., 2017; Benjamin J. Tully et al., 2018), single-cell amplified genomes
(SAGs) (Martinez-Garcia et al., 2012; Pachiadaki et al., 2019; Sieracki et al., 2019; Stepanauskas
and Sieracki, 2007; Swan et al., 2013, 2011), and isolates. We used only high-quality genomes
defined using standard thresholds (Parks et al., 2017; Benjamin J. Tully et al., 2018) of > 80%
completeness and < 5% contamination as determined by the average of the values derived from
CheckM (Parks et al., 2015) and An’vio (Eren et al., 2015). We also dereplicated the high-quality
OMD genomes using dRep (Olm et al., 2017) at the 95% ANI threshold, which was already
provided in the OMD metadata. Because the CarveMe software was primarily validated with
69
bacterial GEMs, we limited our study to marine bacterial genomes. This yielded a dataset of 3,918
high-quality bacterial genomes.
2.2 Phylogeny
We determined both taxonomy and phylogeny for the 3,918 genomes used in this study.
We also included the 66 unique bacterial reference genomes used to build the metabolic models in
the BiGG database that form the basis of the CarveMe universal model. This resulted in a total of
3,984 genomes for our analyses. Taxonomy was assigned using GTDB-Tk v 2.1.0 (Chaumeil et
al., 2022) with the GTDB r214 database (Parks et al., 2018). Phylogeny was determined using
GToTree v1.7.0 (Lee, 2019) and IQ-TREE v2.0.3 (Minh et al., 2020). GToTree v1.7.0 was first
used to build a multiple sequence alignment (MSA) file using the predefined Bacteria single copy
gene (SCG) set for the 3,984 genomes. During this process, 8 genomes were excluded for having
insufficient hits to the Bacteria SCG, but these genomes were still assigned taxonomy and used in
our experimental analyses. This MSA file with the remaining 3,976 genomes was then processed
in IQ-TREE v2.0.3 with 1,000 ultrafast bootstraps (Hoang et al., 2018) and the LG+R10 model
with 3,554 amino-acid sites to build the phylogenetic tree.
2.3 Metabolic Model Generation and Quality Assessment
Metabolic models were generated using CarveMe v1.5.1 (Machado et al., 2018) with its
mixed integer linear programming (MILP) algorithm. CarveMe can take either gene/protein fasta
files or eggNOG annotations as input. Here we selected DIAMOND v0.9.14 (Buchfink et al., 2015)
based on tests which indicated that EggNOG annotations produced low-quality models (Appendix
Figure C1). Protein predictions for the input genome from Prodigal v2.6.3 (Hyatt et al., 2010) were
aligned by DIAMOND to the unique set of proteins present in the BiGG database (Schellenberger
et al., 2010). CarveMe then cross-referenced the annotated BiGG proteins for the input genome
70
against its gene-protein-reaction (GPR) file, which contains 30,814 proteins corresponding to the
unique protein sequences in the universal model. Proteins in the GPR file that correspond to
annotated BiGG proteins were then individually assigned reaction scores. Specifically, BiGG
protein scores were converted to reaction scores by summing the scores of all annotated proteins
involved in catalyzing the same reaction. The annotated reaction scores for all reactions in the
genome were then normalized to a median value of 1.
CarveMe has two modes of model generation: single model mode and ensemble mode.
When CarveMe was run in single model mode, all unannotated reactions in the universal model
were assigned a negative score of -1 and a single metabolic model was generated. When CarveMe
was run in ensemble mode, all unannotated reactions in the universal model were then assigned a
random score value. There are also a set of spontaneous reactions in the universal model which
were assigned a neutral score of 0. In ensemble mode, CarveMe then generates the user specified
number of metabolic models (M). The output can either be interpreted as one pan-reactome – the
union of predicted reactions across all ensemble members – or M individual metabolic models in
the ensemble.
After the reaction weights for all reactions in the universal model are set, CarveMe trims,
or carves, the universal model to create a draft organismal metabolism with no futile cycles, infinite
biomass loops, etc. and a minimal number of total reaction components. The rection scores in the
GPR file inform the carving process by directing the model to prioritize the inclusion of reactions
with high GPR scores (i.e., annotated reactions) while promoting it to remove reactions with
limited genetic support (i.e., unannotated reactions). The CarveMe approach allows for reactions
not annotated in the genome to be included in the model which mitigates against issues with
annotation or incomplete genomes and circumvents the need for manual gap filling. When run in
71
ensemble mode, the randomization of scores for the unannotated reactions accounts for uncertainty
in the annotation process but also introduces a degree of stochasticity such that CarveMe can create
a different model each time it is run.
To assess the consistency (precision) of model generation by CarveMe in ensemble mode,
we previously developed a consensus score metric (Reynolds et al., 2024) defined as:
� = i1
�
5(6
5()
P �(�75 = 1)
7(8
7()
where �75 is a matrix of reaction presence/absences across � individually generated models, � is
an individual reaction, � is the total number of reactions in the ensemble, and � is the ensemble
size. Here, � is defined as the indicator function for the case that reaction � is present in ensemble
model �. Our previous analysis showed that the consensus scores across the 3,918 genomes was
bimodal with peaks at ≈ 0.5 and ≈ 0.8 (Reynolds et al., 2024). Based on this, we used a threshold
of � ≥ 0.8 for high-quality ensembles and � ≤ 0.5 to designate poor quality models. Here we use
an ensemble size of 60 models based on our assessment that 60 models was sufficient to saturate
the metabolic reaction space (Appendix Figure C1).
2.4 In silico knockouts
To rigorously assess what causes some genomes to generate high-quality ensembles and
others to generate poor quality ensembles, we selected a subset of the genomes to perform an indepth sensitivity test. For this analysis, we focused on the Flavobacteriales (� = 613). This order
was selected because it is phylogenetically distant from the reference genomes used to build the
CarveMe universal model. It is also one of the most abundant taxa in our dataset and represents
key strains in marine microbial ecosystems, such as in chitin-degrading communities (Enke et al.,
2018; Fontanez et al., 2015). Furthermore, 42.9% of the genomes in this order generated high-
72
quality ensembles (� = 263, � ≥ 0.8) and 16.8% generated low-quality ensembles (� = 103,
� ≤ 0.5). The remaining 247 genomes had consensus scores between 0.5 and 0.8. Focusing on a
single taxonomic order allowed us to minimize the phylogenetic effects on the results of our
simulations so that we could more directly measure the effect of modifying the input genome
content on model consensus.
To assess the impact of loss of genomic information on CarveMe model precision (as
quantified by the consensus score), we performed in silico knockout simulations in high-quality
ensembles where reactions were changed from annotated to unannotated. Our hypothesis was that
a lack of annotation information would make the model carving more unstable and thus reduce
model precision and the ensemble consensus score. Thus, we conducted our knockout simulations
only on high-quality ensembles so we could determine how information loss in previously high
precision metabolic models could disrupt the stability of the carving process. Specifically, we
generated 250 replicate knockouts per genome for the 263 high-quality genome ensembles
assigned to the Flavobacteriales Order, resulting in a total of 65,750 new model ensembles.
Knockouts were performed by directly modifying the gene-protein-reaction (GPR) rules file used
by CarveMe (Machado et al., 2018). As described above, the GPR file contains 30,814 proteins
corresponding to the unique protein sequences in the universal model. Reactions associated with
annotated BiGG proteins received a positive score, unannotated proteins received random scores,
and spontaneous reactions received a score of 0. Knockouts were performed by randomly selecting
reactions that were annotated from the input genome (i.e., positive score in the GPR file) and
setting the score to zero. CarveMe then treated the knocked-out reaction as lacking any information
as to whether it is present or absent in the model (this is the way that spontaneous reactions are
73
treated in the model). Crucially, this formulation ensured that CarveMe could still choose to use
this reaction in the new ensemble if it remained part of the optimal solution.
To modify this GPR file and create our knockouts, we first determined the reactions that
were eligible to be removed. This was done by taking the ensemble of 60 models generated for the
current genome and identifying the total set of reactions present across all 60 carved models. From
this list of total reactions, the subset with positive scores in the GPR file were selected (i.e., those
pathways associated with BiGG annotations from the input genome). On average, 1,010 reactions
(range 422-1517) for each genome were kept as eligible reactions to knockout.
From the subset of N eligible reactions, Z were randomly selected to be knocked out (Z
ranged from 1-50) using a random number generator from 1 to N without replacement. Since we
were interested in knocking out whole reactions rather than individual proteins, we identified all
protein isozymes that were associated with catalyzing the individual reaction, X. We then excised
all information pertaining to the proteins that catalyze X from the GPR file. To designate reaction
X as spontaneous, the following line was appended to the GPR file for each knocked-out reaction
R_X:
G_s0001, P_s0001, R_X, iML1515
where s0001, P_s0001 denotes that R_X can occur spontaneously. iML1515 specifies which of the
curated models used to build the universal model that reaction R_X was originally annotated in.
The specific model that the reaction was annotated in is not relevant to setting the reaction scores
so iML1515 is just a placeholder.
2.5 Network Generation
CarveMe models consist of hundreds to thousands of reactions that transfer carbon and
energy between different metabolite pools. Each model can be thought of as a network where each
74
metabolite is a node and each reaction is a connection (or edge) between the nodes. Similarly,
networks can be generated for the model ensemble simulations. These networks have the exact
same structure as the individual models, except that they reflect the pan-reactome of all 60 replicate
models. In this network, each node is still a metabolite as in the individual model network but now
each connection (edge) has a weight [0,1] associated with it describing what proportion of models
in the ensemble predicted the presence of that reaction. Here, we generated ensemble networks for
all 3,918 genomes from the OMD dataset, as well as the ensembles generated during the knockouts
experiment (described in Section 2.4). Networks were generated from the CarveMe output files of
the ensemble states which describes whether a given reaction in the ensemble was present (value
of 1) or absent (value of 0) in each of the 60 ensemble models. Each metabolite was annotated
with the cellular compartment in which it was found in the model (extracellular, periplasm, or
cytoplasm) and each reaction was assigned the cellular compartment (e.g., cytoplasm→cytoplasm)
or compartments (e.g., extracellular→cytoplasm) in which it occurred.
To examine differences between low- and high-quality ensemble metabolisms, we
analyzed the resulting networks for the two groups of ensembles to identify structural differences
in the predicted metabolisms. As described above, we used a threshold of � ≥ 0.8 to define the
group of high-quality ensemble networks and � ≤ 0.5 to define the low-quality ensemble
networks. We chose 0.5 as the critical point for defining the low-quality group of networks since
a consensus score of 0.5 effectively describes the case where, on average, the presence/absence of
a given reaction being added is essentially a random draw from a binary distribution. Thus, any
network constructed from an ensemble with consensus score of 0.5 or less can effectively be
considered a pseudo-random metabolic ensemble. Network analyses and metrics were generated
using ggraph v2.1.0 and tidygraph v1.2.3 in R Studio v4.2.3. We computed standard measures of
75
overall network composition, including the distribution of node degrees, eigen centrality, and
betweenness centrality. Node degree defines the number of edges connected to each individual
node in a given network, while eigen and betweenness centrality describe the relative importance
and local density of the surrounding connections for a given node.
2.6 Data Visualization
All data visualizations in R v4.2.3 were performed using ggplot v3.4.2, and patchwork
v1.1.2.
3 Results
3.1 Model Consensus Variability
CarveMe uses two fundamentally different approaches for scoring unannotated reactions
during model generation depending on whether the user wants to generate a single model from an
input genome, or an ensemble of many models (described above in Section 2.3). To investigate
the impact this can have on resulting models, we performed a sensitivity analysis of the model
generation process in which we ran an individual genome through CarveMe 500 separate times
generating either individual models, or ensembles of 60 models. We selected a genome that had a
particularly high initial ensemble consensus score (� = 0.944, for the first ensemble generated).
When CarveMe was run in single model mode, each of the 500 models were exactly identical
demonstrating that running CarveMe in this mode is a completely deterministic process. This is
expected as the single model generation assigns a weight of -1 to all unannotated reactions
specifying that they are not present. This contrasts with the ensemble mode where CarveMe
introduces stochasticity by randomizing this reaction weight, which accounts for the fact that some
unannotated reactions might in fact be present in the model.
76
Next, we tested the variability in consensus scores when CarveMe was run in ensemble
mode. We ran the test with two genomes, a genome with a high consensus score (� = 0.944) used
for the individual-mode test and a genome with a moderately high consensus score (� = 0.829),
with 500 replicates of 60-model ensembles. The ensemble consensus scores for each of the 500
ensembles varied substantially. For the high consensus score genome, the median consensus score
was 0.941 with a standard deviation of 0.055 (range 0.772 − 0.999) (Figure 4.1). The moderate
consensus score genome generated a wider variance in consensus scores than the higher quality
ensemble with a median value of 0.876 and a standard deviation of 0.089 (range 0.651 − 0.999)
(Figure 4.1). These tests demonstrate that individual genome ensembles with high consensus
scores are much more likely to come from distributions of ensemble consensus scores that have
smaller ranges and are thus less variable. As the consensus score drops, so too does the confidence
in the consensus score based on a single ensemble run.
While running CarveMe in single model mode is reproducible, this is not necessarily
reflective of fidelity of the metabolism predicted by the genome. Rather, we argue, that allowing
CarveMe to account for key uncertainties in genome annotation and assembly through ensemble
mode provides greater insight into the potential metabolisms predicted by the genome. We suggest
that for the most thorough assessment of metabolic potential – when computational resources are
not a barrier – a user should generate many model ensembles. This allows for a quantification of
consistency in the predicted metabolism (precision in model generation). To further assess the
metabolisms predicted by low- and high-consensus score model ensembles, here we investigate
differences in network structure using both classical network theory and the intrinsic metabolic
features associated with each network. We further quantify the impact of reducing the amount of
information available to CarveMe a priori regarding the reactions that are present in the input
77
genome to determine how this impacts model consensus, and if there are specific reactions that are
essential to predicting consistent metabolisms.
3.2 Bulk Network Structure
Metabolic models are essentially networks in which each metabolite or intermediate
product is a node, and each reaction is an edge connecting nodes. Thus, we can apply classical
network theory approaches to study differences in the metabolic models generated by CarveMe.
Specifically, we assessed the low (� ≤ 0.5) and high (� ≥ 0.8) quality ensemble networks
(Methods Section 2.5) in order to determine if there were significant differences in the structure
Figure 4.1: Range of consensus scores for two genomes. Histograms of the resulting ensemble consensus scores
after re-running two individual genomes through CarveMe’s ensemble mode with 60 model ensembles 500 times.
The two bar colors delineate the two genomes with initial consensus scores of 0.829 (red, lower quality ensemble)
and 0.941 (blue, higher quality ensemble).
78
and connectivity between the two groups of networks. Since by definition these two groups have
substantially different consensus scores, we hypothesized that the underlying metabolisms may
have fundamentally different underlying network structures. To quantify this, we computed node
degree, eigen centrality, and betweenness centrality (see Methods Section 2.5 for full description
of these metrics) for all 3,918 ensemble networks. We then compared these network metrics for
the low-quality ensembles (� = 956) to the high-quality ensembles (� = 1,591). If the network
structures were significantly different from one another, we would observe significantly different
distributions of these network metrics. However, we found that the distributions of all 3 measures
Figure 4.2: Distributions of metabolite frequencies for low (bad) and high (good) quality ensembles. The distributions
for the low- and high-quality ensembles are statistically significantly different for all three compartments.
79
of network structure were indistinguishable between the low- and high-quality groups of networks
(Appendix Figures S2-S4). This suggests that, while there was more variability in the predicted
networks within the ensemble for low-quality models based on their ensemble consensus score,
the inherent structure and connectivity of the metabolic networks themselves were not
meaningfully different. That is, we did not observe fundamental differences in the density or
connectivity of the metabolites (nodes) and reactions (edges) between low- and high-quality
ensemble networks.
3.3 Metabolic Compartment Assessment
Metabolites in CarveMe models can exist in three different cellular compartments- in the
extracellular space, in the periplasm, or in the cytoplasm. Similarly, reactions in CarveMe models
fall into 9 distinct types – reactions that transfer metabolites into or out of each compartment or
reactions that occur within one of the compartments. To investigate if there were differences in the
metabolic structure of the high-quality versus low-quality CarveMe ensembles, we examined the
distributions of metabolites in each of these three compartments and differences in the types of
reactions included in the models.
Table 4.1: Statistical significance values for the metabolite (a) and reaction type (b) frequency comparisons between
good and bad ensembles. We used a Welch’s modified t-test for uneven sample sizes to test significance.
a.
Compartment Extracellular Cytoplasm Periplasm
p-value < 2.2 × 10*+, < 2.2 × 10*+, < 2.2 × 10*+,
b.
Compartment Extracellular Cytoplasm Periplasm
Extracellular < 2.2 × 10*+, < 2.2 × 10*+, 1.79 × 10*,
Cytoplasm < 2.2 × 10*+, < 2.2 × 10*+, < 2.2 × 10*+,
Periplasm 1.55 × 10*- < 2.2 × 10*+, < 2.2 × 10*+,
Significant differences were observed in the relative frequency of metabolites in each of
these three compartments between the high-quality and low-quality ensembles (Figure 4.2).
80
Specifically, high-quality ensembles had a significantly greater frequency of their metabolites in
the cytoplasm compared to the low-quality ensemble networks (Table 1a) with 65.1% cytoplasm
metabolites compared to 61.0% (Welch’s modified t-test, � ≪ 0.01) . By contrast, the low-quality
ensembles had a significantly greater frequencies of extracellular metabolites in their metabolism
(Welch’s modified t-test, � ≪ 0.01) with 20.4% extracellular metabolites compared to 17.3%.
This difference in metabolite frequencies suggests that high-quality ensembles have a much greater
ability to carry out de novo synthesis of intermediate metabolites and require less metabolite
exchange with the environment. Low-quality ensembles, on the other hand, exchange a greater
number of metabolites with the environment to support a more limited central metabolism. The
difference in the mean frequency of metabolites in the periplasm was significantly different
between the low- and high-quality ensembles (Welch’s modified t-test, � ≪ 0.01), albeit the
differences were much smaller with 18.5% periplasm metabolites compared to 17.6%, respectively.
We further assess the prevalence of reactions within each compartment (e.g., cytoplasm to
cytoplasm) and between the three compartments (e.g., extracellular to periplasm) (Figure 4.3). The
distribution of reaction type for all 9 different types were statistically different for the low- and
high-quality ensembles (Table 1b). The largest differences were observed in the frequency of
reactions within the cytoplasm (� → �) and between the cytoplasm and the extracellular space
(� → �). Specifically, high-quality ensembles had significantly more � → � reactions, 74.6%
versus 73.2% (Welch’s modified t-test, � ≪ 0.01) while low-quality ensembles had significantly
more (� → �) reactions, 1.6% versus 1.1% (Welch’s modified t-test, � ≪ 0.01). Overall, the highquality ensembles had more reactions in the cytoplasm (� → �) while the low-quality ensembles
had on average more of 7 of the 9 reaction types. For this analysis, we normalized the number of
81
reactions of each type to the total number of reactions in each ensemble to minimize the effect of
the variability in this total number of reactions per ensemble.
The higher frequency of reactions into and out of the extracellular compartment for the
low-quality ensembles reinforces the observed differences in frequency of metabolites by
compartment between the low- and high-quality ensembles. Our analysis indicates that low-quality
ensembles have metabolisms that are reliant on exchanging greater numbers of metabolites with
the environment particularly to support central metabolism (greater frequency of � → � ).
Figure 4.3: Distributions of reaction type frequencies for low- and high- quality ensembles for the 9 possible reaction
types within a CarveMe model. The arrow denotes the direction of the reaction such that ��������� → ���������
denotes a reaction converting a reactant in the cytoplasm to a product in the periplasm. All 9 pairs of distributions are
statistically significantly different from one another. The low-quality ensembles have higher frequencies of all reaction
types except ��������� → ��������� and ��������� → ��������� reaction types.
82
Furthermore, the increase in the frequency of � → � reactions suggests that the metabolisms of
low-quality ensembles have a greater amount of export, leakage, or secretion of intermediate
metabolites produced during their central metabolism. When we assessed the ratio of export (as a
reactant) to import (as a product) of the metabolites into/out of the extracellular compartment, we
found that the low-quality ensembles had a significantly greater ratio of export to import for
extracellular metabolites, with an average ratio of 0.80 compared to 0.75 for the high-quality
ensembles (Welch’s modified t-test, � ≪ 0.01, Appendix Figure C5).
While the abundance of metabolites and frequency of reactions exhibited normal or
lognormal distributions of values across the ensembles, a bimodal distribution was observed for
the reactions exchanging metabolites between the periplasm and extracellular space (� → � and
� → �). Specifically, 229 of the 1,591 high-quality genome ensembles in the dataset generated
ensembles with elevated � → � reactions and 233 genomes generated ensembles with elevated
� → � (threshold of 0.03). Moreover, 228 of these genomes were shared between the two groups.
Thus, we conclude this bimodality is generated by the high flux between the periplasm and
extracellular space in a subset of the models. As this is observed in the high-quality ensembles, we
hypothesize that this is likely a conserved metabolic feature that is specific to this group of
genomes. While investigating the specific metabolisms generating this feature is beyond the scope
of this work, we would expect this pattern from metabolisms such as the consumption of large
polymeric compounds that must be degraded in stages. Breaking down these compounds often
requires degradation to smaller oligomers in the periplasm as well as secreting externally active
enzymes that would elevate the relative investment into these two reaction types.
Due to computational resource limitations, the knockout experiment was conducted on a
subset of genomes instead of on the entire dataset. Specifically, we conducted this experiment
83
using the genomes from the Flavobacteriales Order as this group contained a substantial number
of both low- and high-quality ensembles and was phylogenetically distant from the reference
models used to build the CarveMe software. To confirm that this subset of genomes was
representative of the larger dataset, we compared the metabolic structure of the 263 high-quality
Flavobacteriales ensembles to the full set of 1,591 high-quality ensembles. The reaction type
distributions for the Flavobacteriales genomes were statistically identical to the full set of highquality ensembles for all 9 reaction types (Appendix Figure C6). Thus, the Flavobacteria
ensembles provide a representative subset of the larger dataset on which do conduct the reaction
knockout experiments.
3.4 Knockout Experiments
The knowledge used to generate metabolic models from genomes is inherently incomplete
due either to genes of unknown function and/or incomplete genomes. Here we assessed how a loss
of information impacts the quality of CarveMe models by performing knockout simulations in
which we randomly removed annotated reactions and assessed the impact on the resulting model.
To determine how many reactions to knock out, we tested our procedure on one of the 263
Flavobacteriales genomes that generated high-quality ensembles (� = 0.999). We generated 100
replicates for a range of knockout group sizes: 2, 5, 10, 20, and 40 reactions. We saw that there
was on average a substantial drop in consensus score for all knock out sizes with the largest
decrease of 0.29 at 5 knockouts (range 0.25-0.29). Removing more than 5 reactions had a slightly
smaller reduction in consensus score but, as we analyze in detail below, there was significant
variability in the impact of these knockout results and so it is not clear how meaningful this trend
is (Appendix Figure C7). We selected a knockout size of 5 for the subsequent experiment.
84
A large set of 65,750 knockout ensembles were generated – 250 different knockout
experiments for each of the 263 Flavobacteriales genomes. There was substantial variation in the
change in ensemble consensus score as a result of these knockouts. This was expected as different
reactions should have a differential impact on the ability for the CarveMe software to find a
consistent model solution. Specifically, for each knockout experiment, 5 reactions were randomly
selected to be removed such that for some knockout experiments the impact of removing reactions
might be minimal while for other sets of 5 reactions the impact might be quite large. Furthermore,
the impact of removing different reactions should vary by genome.
Overall, removing reactions from the highest quality ensembles created substantially larger
decreases in consensus score than when reactions were removed from ensembles with consensus
scores closer to 0.8 (Appendix Figure C8). There was a significant inverse linear relationship
between the original model consensus score and the average change in consensus after knockouts
(Pearson’s correlation, � = −0.455 , � < 0.01 ) (Appendix Figure C8). Within the 65,750
knockout ensemble runs, 3,068 reactions were knocked out at least one time with a reaction being
removed 93 times on average (range 1-334). The average drop in consensus score by reaction was
0.134 (Figure 4.4) – this is the drop in consensus score averaged across all knockout ensembles
that removed a given reaction. No significant differences were observed in the average change in
consensus when we grouped reactions based on the 9 reaction types (i.e., Cytoplasm→Cytoplasm,
or Extracellular→Cytoplasm) (Appendix Figure C9). This suggests that the CarveMe software is
not particularly susceptible to loss of information in one of the reaction types compared to the
others.
While most knockout experiments resulted in a reduction or no change in consensus score,
14.3% of knockout experiments resulted in a substantial increase in consensus score (defined here
85
as an increase of at least 0.01). Overall, the knockout experiments that resulted in increased
consensus scores had on average lower original consensus scores (� = 0.856) than the ensembles
where knockouts generated a reduction or no change (� = 0.942). When assessed at the individual
reaction level, only 3.9% (N=120) of the knocked-out reactions produced an average increase in
ensemble consensus across the full set of 65,750 knockout experiments. However, these 120
reactions were also sampled much more infrequently, averaging just 3 replicates with a maximum
sampling size of 15 replicates compared to the experiment wide average of 93 replicates (range 1-
Figure 4.4: Change in consensus score as a result of reaction knockouts. Scatterplot of the average starting and postknockout consensus score per reaction surveyed during our knockouts simulations (N= 3,068 total reactions). Points
are colored and sized by the number of replicate ensembles the reaction was knocked-out in. The solid black line
represents the 1:1 line where knocking out a reaction would, on average, create no change in ensemble consensus. The
dashed black line represents the average decrease in consensus score across all the knockout experiments.
86
334). Thus, it is likely that many of these reactions were simply under-sampled in our dataset and
the stochasticity of the carving process was primarily driving the consensus increases. We again
highlight that ensemble consensus does not necessarily directly correlate to high metabolic fidelity
to the organism. Rather, a high ensemble consensus score tells us that CarveMe reliably identifies
one optimal metabolic configuration for the input genome. Thus, it is conceivable, especially for
ensembles with lower and more variable consensus scores (see section 3.1) that removing some
reactions actually facilitates CarveMe in finding a reproducible solution. For these instances, it is
likely that while the results are more consistent, they might also have lower fidelity to the true
metabolism of the genome, if it were experimentally characterized. For the remainder of our
analysis, we focus on the reactions that produced decreases in consensus to characterize the
magnitude and variance in decreases when knocking out different reactions. This further allowed
us to identify specific reactions that consistently produce decreases in consensus that would be
good candidates for experimental characterization, so they can be annotated better.
To better understand the distribution of the consensus changes in our knockout ensembles,
we performed a rank abundance of the 3,068 unique reactions surveyed in the reaction knockout
experiments. These reactions were ranked according to the mean change in consensus produced
across all replicates where that reaction was knocked out (Figure 4.5). The bulk of the reaction
data, defined here as 90% of the total set of reactions, had a mean change in consensus score
between −0.024 and −0.244 with a mean decrease of 0.134 (Figure 4.5). These reactions were
also the most frequently knocked out with each reaction being knocked out in approximately 106
ensembles on average compared to 93 ensembles for all reactions in the dataset. 7.4% of the total
set of reactions had changes in consensus scores that fell above the bulk data with an average
increase in consensus score of 0.033. As discussed above, 3.9% of these reactions resulted in an
87
increase in consensus score on average while the additional 3.5% of reactions in the upper tail of
the data produced average decreases between 0 and 0.024. These reactions in the upper tail were
under sampled relative to the average sampling rate, 4 replicates per reaction relative to 93. The
remaining 2.6% of reactions surveyed were found in the lower tail of the data. These reactions
produced on average large decreases in consensus (average decrease of 0.328). Similar to the
subset of reactions that resulted in increased consensus scores, the reactions that resulted in large
decreases were on average under sampled relative to the rest of the dataset, 5 replicates per reaction
relative to 93.
The knockout experiments removed the annotation information for reactions rather than
requiring them to be excluded from the model. Thus, the reactions that were knocked out were still
considered by CarveMe as possible reactions for inclusion in the final model. We assessed how
many times CarveMe included a reaction that had been knocked-out in the final model ensemble
(we term this ‘adding back the reaction’). A knockout reaction was considered to be added back if
it was present in at least one of the 60 models. In the bulk data (90%), knockout reactions were
added back 61.2% of the time (Figure 4.5). By contrast, reactions in the upper tail (average changes
in consensus greater than −0.024) were added back substantially less frequently with only 42.8%
of the reactions being added back. Reactions in the lower tail of the data (average decrease in
consensus greater than 0.244) were added back a rate similar to the bulk data, with reactions being
added back to the resulting ensembles 57.6% of the time on average. While the reactions in the
tails are under sampled relative to the bulk data, these results do suggest that the reactions that are
not added back into the model are more likely to produce consensus increases while reactions that
are more consistently added back produce consensus decreases.
88
We extended this analysis to assess on a per ensemble basis how many of the 5 knockout
reactions were added back. When only 1 or 2 reactions were added back, the distribution of
consensus changes were centered around a median value of approximately 0 (< 0.001) (Appendix
Figure C10). The ensembles where all 5 reactions were added back had significantly larger
decreases in consensus score (average decrease of 0.21) than for the ensembles where none, 1, 2,
or 3 reactions were added back (pairwise Tukey test, � < 0.01, Appendix Figure C10). Overall,
there was a negative trend between the number of reactions added back and the change in
consensus.
Figure 4.5: Rank abundance curve of the average change in consensus score after knockouts for each of the 3,068
reactions surveyed. Points are colored by the proportion of times that individual reaction was added back to an
ensemble after being knocked out and sized by the number of total replicate ensembles per reaction. The solid black
lines denote the bulk of the data (90%) which fall between a consensus score change of -0.024 and -0.244. The dashed
black line represents the dataset average change in consensus score across all reactions of -0.134.
89
3.5 Knockout Network Metabolic Compartment Assessment
Clear differences were observed between low- and high-quality ensembles across the entire
dataset (Section 3.2). With the knockout experiments, we were able to demonstrate how loss of
information about genomic content (knockouts) typically resulted in a reduction in the consensus
score. Here we assessed whether the structure of the low- and high-quality ensemble networks
from the knockout experiments were similar to the structures of the original low- and high-quality
ensemble networks from the original dataset. While the vast majority of ensembles experienced
reductions in consensus after knocking reactions out (mean decrease of 0.134) only a small
proportion of genomes dropped sufficiently to fall into our low-quality group with a consensus
score below 0.5 (Figure 4.6). From the 65,750 of our knockout ensembles, there were 28,884 highquality ensembles (� ≥ 0.8) and 2,169 low-quality ensembles (� ≤ 0.5). (Note: all genomes had
original ensemble consensus scores prior to knockout of � ≥ 0.8). To compare even sample sizes,
we randomly drew 2,169 ensembles from the pool of high-quality ensembles, equivalent to the
total number of low-quality ensembles.
The low- and high-quality knockout ensembles showed similar patterns to the original
dataset. Specifically, no significant differences were observed in the structure of the low- and highquality ensemble networks for the knockout experiments (based on measures of node degree, eigen
centrality, and betweenness centrality), consistent with the results for the original dataset. However,
when we compared the frequencies of nodes (metabolites) found in each of the three cellular
compartments for the two groups of knockout ensembles, consistent differences emerged (Table
2a). As with the original dataset, the low-quality knockout ensembles had greater frequencies of
extracellular metabolites, averaging 19.8% of their total metabolites compared to 19.1% for the
high-quality knockout ensembles (Welch’s modified t-test, � ≪ 0.01). The high-quality knockout
90
ensembles had greater frequencies of metabolites in the cytoplasm, averaging 62.7% of their total
metabolites relative to just 61.4% for the low-quality knockout ensembles (Welch’s modified ttest, � ≪ 0.01), again consistent with the original dataset. This shift occurred even though we were
typically knocking out less than 1% of the total eligible reactions and drawing the reactions to
knockout randomly from among all 9 reaction types. The relative frequencies of the 9 reaction
types that were knocked out during these simulations are provided in Appendix Table C1.
Table 4.2: Statistical significance values for the metabolite (a) and reaction type (b) frequency comparisons between
the good and bad knockout ensembles.
a.
Compartment Extracellular Cytoplasm Periplasm
p-value < 2.2 × 10*+, < 2.2 × 10*+, < 2.2 × 10*+,
b.
Compartment Extracellular Cytoplasm Periplasm
Extracellular 0.13 6.63 × 10*. 1.79 × 10*,
Cytoplasm < 2.2 × 10*+, < 2.2 × 10*+, < 2.2 × 10*+,
Periplasm 1.55 × 10*+/ < 2.2 × 10*+, < 2.2 × 10*+,
Significant differences were observed in the distributions of reaction types between the
low- and high-quality knockout ensembles (i.e., � → � or � → � reactions). We saw similar
patterns to our analysis of the original ensembles where 8 of the 9 reaction type distributions were
different between the low- and high-quality knockout ensembles. While the differences in
percentages are small, all differences were statistically significant (Welch’s modified t-test, � ≪
0.01, Table 2b). In particular, the low-quality knockout ensembles had greater frequencies of � →
� and � → � type reactions, with lower frequencies of � → � type reactions as compared to the
high-quality knockout ensembles – as was observed in the original dataset. Overall, 7 of the 9
reaction types had the same differences in reaction frequency as the original dataset. The � → �
and � → � reaction types in the knockout experiments did show different trends to the original
data with the high-quality ensembles having lower frequencies of � → � reactions than the low-
91
quality ensembles. The � → � type reactions, by contrast, were statistically identical in the
knockout ensembles but were different in the original dataset with low-quality networks having
significantly more of this reaction type. This assessment suggests that our knockout experiments
were able to reduce the ensemble consensus score sufficiently to convert high-quality ensembles
Figure 4.6: Scatterplot of the average change in consensus score versus the standard deviation in consensus change
per reaction for the 3,068 reactions surveyed. Points are colored and sized by the number of replicate ensembles they
were knocked out in. The black dashed line represents the line of slope -1 which separates the reactions based on
whether the absolute value of the mean consensus change is larger (below the line) or smaller (above the line) than
the standard deviation of consensus change. Similarly, the orange dashed line represents the line of slope -0.5 which
separates the reactions based on whether the absolute value of the mean consensus change is more (below the line) or
less (above the line) than twice as large as the standard deviation.
92
to low-quality ensembles, and demonstrated that these low-quality ensembles have a reproducible
metabolic structure.
Table 4.3: The 14 reactions determined to be high impact reactions by having average decreases twice as great as their
standard deviations.
Reaction (BiGG
notation)
Reaction (Descriptive Name) Number of
replicates
Mean
decrease in
consensus
Standard
Deviation
R_AMPD2 AMP deaminase 7 0.250 0.081
R_BTNt2i Biotin uptake 7 0.295 0.135
R_BZDH Benzaldehyde dehydrogenase 9 0.169 0.069
R_CPGNtonex Coprogen transport via ton system
(extracellular)
7 0.229 0.089
R_DESAT16 Palmitoyl CoA desaturase n C160CoA n
C161CoA
10 0.280 0.034
R_FRDx Fumarate reductase NADH 8 0.192 0.085
R_GPDDA3pp Glycerophosphodiester
phosphodiesterase
(Glycerophosphoserine)
6 0.222 0.079
R_HACD4i 3-hydroxyacyl-CoA dehydrogenase (3-
oxodecanoyl-CoA)
8 0.216 0.093
R_HPPK2_1 6-hydroxymethyl-dihydropterin
pyrophosphokinase
7 0.387 0.111
R_O2XOADOX None reported in BiGG database 6 0.435 0.092
R_PGPP140 Phosphatidylglycerol phosphate
phosphatase (n-C14:0)
13 0.242 0.104
R_PLIPA1G140pp Phospholipase A1 (phosphatidylglycerol,
nC14:0) (periplasm)
11 0.113 0.051
R_PLIPA2E120pp Phospholipase A2
(phosphatidylenanolamine n-C12:0)
(periplasm)
9 0.134 0.046
R_URIC Uricase 7 0.152 0.060
3.6 High Impact Knockout Reactions
We were particularly interested in identifying if there were specific reactions sampled
during our knockouts experiments that consistently produced significant reductions in ensemble
consensus. To investigate this, we aggregated our data by individual knocked out reaction and
computed the mean and standard deviation of the change in consensus for each of the 3,068 unique
reactions knocked out at least once during our experiment. We then filtered this dataset for
reactions that were knocked out at least 5 times and had a mean drop in consensus greater than the
standard deviation of the consensus drop for that reaction. Using these filters, we found that there
93
were 247 reactions that matched these criteria which are enumerated in Appendix Table C2. The
majority of these reactions took place within the cytoplasm compartment (� → �, 73.7% of
reactions) with reactions also belonging to every compartment except from the periplasm to the
extracellular space (� → �).
To look more rigorously at reactions with statistically significant negative impacts on
consensus we further filtered this dataset for reactions that appeared at least 5 times and had a
mean drop in consensus twice as large as the standard deviation of consensus drop for that reaction.
We found that 14 reactions matched these criteria and are shown in Table 3. These reactions took
place within the cytoplasm compartment (� → �, 9 reactions), within the periplasm compartment
(� → �, 3 reactions), from the extracellular to cytoplasm compartment (� → �, 1 reaction), and
from the periplasm to cytoplasm compartment (� → �, 1 reaction). Since these reactions have
means twice as large as their standard deviations, we can infer that these reactions only produce
drops in consensus within statistically significant confidence levels. We hypothesize that these 14
reactions would be excellent candidates for characterization in marine organisms in order to
stabilize the consistency of marine metabolic model annotation. Many of these reactions were
associated with fatty acid metabolism, particularly oxidoreductases including multiple forms of
Coenzyme A.
4 Discussion
GEMs, specifically CarveMe, leverage numerical optimization techniques to take
incomplete metabolic annotations and predict complete organismal metabolism without user
supervision or manual curation. As such, understanding the quality of these predictions is crucial
for determining how reliably we can use these methods for classification or comparison to other
measurement data. We demonstrate that the outcome of a CarveMe ensemble generation is highly
94
dependent on the number and type of BiGG proteins annotated in the genome a priori. When we
begin to remove BiGG annotations using an in silico style knockout scheme, the consistency of
previously high-quality model ensembles can decrease substantially to the point that those
ensembles become pseudo-random. We identify specific reactions that consistently produce
decreases in model consistency, measured as ensemble consensus, and typically produce
substantial decreases. We show that, while there are not significant differences in the traditional
structure of these metabolic networks, there are fundamental differences in the frequency of
metabolite and reaction types depending on whether an ensemble network is of high or low-quality.
Furthermore, through our in silico knockouts, we demonstrate that these differences are
reproducible, and can be observed even in low-quality ensembles produced by modifying
previously high-quality ensembles.
Our results suggest that there are specific reactions that may be keystones for the outcome
of a CarveMe ensemble generation. These reactions consistently produced decreases in model
consensus when they were included in the 5 reactions that were knocked out. However, the broad
distribution of, and large standard deviations in, consensus change, suggest that CarveMe
outcomes are likely very individualized to the specific genome of interest. These results make it
clear that the number and accuracy of the reaction annotations for each input genome are
paramount to creating a high quality model ensemble. For potential keystone reactions,
experimentally characterizing the genes that compose these reactions in marine organisms could
lead to substantial improvements in the frequency with which these reactions are annotated and
annotated accurately. Thus, we suggest that characterizing specific reactions with large impacts on
model consensus in marine strains is an excellent place to begin experimental efforts in support of
better CarveMe outcomes.
95
An important finding of this study is that low-quality ensemble networks are more likely
to have greater quantities of metabolite import and export reactions, occupying a significantly
larger proportion of their total metabolism, than high-quality ensembles. This means that these
low-quality metabolisms are predicted to have the highest amount of exchange and interaction
with the extracellular space. In the environment, the extracellular metabolite space is crucial for
the formation, structure, and interactions of marine microbial communities. Several previous
studies have attempted to use metabolic models to interrogate the composition of the extracellular
space for metabolite exchange and predict metabolic interactions (cooperation vs competition)
within microbial communities. In this light, our findings are particularly troubling because they
suggest that low-quality models disproportionally interact with the extracellular metabolite space.
Thus, predicted interactions within marine microbial communities are most likely enriched in, if
not totally dominated by, unconstrained metabolic predictions. Without controlling for model
quality, these measurements of community interaction levels and types cannot reliably be
leveraged for further insight and analysis given the cosmopolitan prevalence of unconstrained
model predictions. We strongly recommend that future community interaction studies centered
around predicting uncultured organismal metabolisms with automated metabolic models
implement some kind of quality control for model consistency.
We find that unconstrained metabolic predictions are endemic to CarveMe and can be
caused by gaps in annotation for reactions across the universal model space. Undeniably, we have
demonstrated that these automated modeling programs are generally sensitive to the amount of
annotated information available at the outset of model optimization. With a modification of just
0.5% of reactions producing greater than 50% reductions in ensemble consensus, it is clear that
missing or incorrect annotations are hugely detrimental to extracting consistent metabolism from
96
automated metabolic modeling tools. However, we have further shown that these unconstrained
models predict a specific kind of metabolism with measurably anomalous features that promote
greater metabolite exchange with the environment. This finding will aid in the development and
refinement of GEMs in the future by providing a target behavior for developers to minimize from
their model predictions during benchmarking. Applying intrinsic measures of model reliability as
we have done here will also enable studies such as the interaction studies mentioned above that
draw conclusions directly predicated on these GEMs to appropriately filter their data. Isolating
only the well-constrained metabolisms from their modeling surveys will allow them to make much
more robust predictions of metabolite exchange in in silico communities of uncultured organisms.
97
Chapter 5 : Conclusions
This thesis identifies the need to develop new frameworks for identifying and modeling
the differences between marine microbial heterotrophs beyond the traditional paradigm of
phylogeny. Phylogenetic differences have been – and remain – a powerful, methodologically
sound, and accessible way to distinguish organisms within a microbial community. However, I
have expanded the growing body of evidence that these phylogenetic differences often do not
accurately reflect the functional and metabolic differences that are integral to the biogeochemical
function and stability of microbial communities. Furthermore, it is clear that there are a wide
variety of mechanisms through which heterotrophic microbes control biogeochemical rates, and
that these mechanisms are functionally redundant across many taxonomic groups. With our current
understanding of marine heterotrophic metabolism, it will remain virtually infeasible (barring
significant advances in computer processing) to directly model the vast diversity of marine
heterotrophs. It is also not computationally tractable to incorporate heterotrophic metabolism at
the resolution of genome-scale metabolic models into large-scale biogeochemical models.
However, I argue that a tractable path forward is to incorporate simplified metabolisms that
represent the first order processes necessary for capturing variability in remineralization rates. This
approach however requires that marine diversity and metabolism be simplified into functional
archetypes or guilds. This will allow us to identify a limited set of representative organisms that
more efficiently reflect the metabolically distinct groups of organisms that underpin natural marine
microbial communities. This set of representative organisms will enable us to simulate classes of
marine heterotrophs using explicit population models in a way analogous to how we currently
model marine phytoplankton.
98
For decades, marine microbiology existed in a “data poor” state reliant on developing a
knowledge base of the ocean directly from bulk numerical measurements of growth and enzymatic
rates, uptake kinetics, or compound concentrations. Efforts to develop and extend this knowledge
base required additional experimental studies that involved extensive cruise time, culturing efforts,
and explicit rate/concentration measurements. With the advent of genomics techniques, the field
is now becoming more “data rich” with the generation of high volumes of ‘omics measurements
and datasets (e.g., the OMD dataset used in Chapters 2 and 3). Without question the addition of
these ‘omics techniques, and the resulting datasets, have revolutionized our understanding of the
ocean. We now have a fundamental understanding of the distribution, abundance, and genomic
content of marine microbial diversity globally. However, the number of unannotated components
in these ‘omics datasets highlight the fact that our ability to sequence and identify genes, transcripts,
and proteins has now outpaced our fundamental knowledge of the function or purpose of these
genetic elements. This disconnect between the quantity of data and our ability to interpret the data
points to the continued need for experimental studies and culturing efforts. I proffer to say that
marine microbiology finds itself in a unique position of being “understanding poor” and thus
unable to fully leverage newfound sources of data, like ‘omics data, to further improve our
understanding of marine microbial life.
We have at our fingertips a vast and growing wealth of genomic, transcriptomic, and
proteomic data yet lack a functional understanding of huge swathes of genes, transcripts, and
proteins identified by these sequencing efforts. The original Tara Oceans paper, for instance,
reported approximately 80% of the identified genes as novel and 40% of these genes as having an
unknown function (Sunagawa et al., 2015). I demonstrate that this issue of limited annotation (or
poor annotations) is especially challenging as we attempt to parlay this vast genomic data into
99
more functional data types, such as genome-scale metabolic models (GEMs). In particular, a
profound lack of marine heterotrophic model organisms for metabolic curation and validation
creates significant issues for the development of high fidelity marine heterotrophic GEMs. To
begin to contextualize this ‘omics data and use it to its fullest potential, we need to continue our
cultivation efforts of key heterotrophs (e.g., SAR11 and R. pomeroyi) as model organisms. This
will allow us to truly expand our knowledge base and begin to rival model organisms like E. coli
and C. elegans that have been developed in traditional human and terrestrial microbiology.
Establishing techniques for building comprehensive culture libraries as has been done in E. coli
will crucially allow us to begin deciphering the function of the many unannotated genes and
proteins found in our ‘omics data via in situ knockouts and other experimental techniques.
Reassuringly, these efforts are already underway (Henson et al., 2018; Lanclos et al., 2023; Schroer
et al., 2023).
Characterizing the internal cellular machinery of marine heterotrophic organisms is
undoubtedly a crucial next step to a comprehensive understanding of the mechanistic
underpinnings of the remineralization process. However, as we expand these characterization
efforts, we also need to pay attention to the diverse and dynamic external metabolic landscape
these organisms interact with. From the phycosphere to particle snow, marine microbial life is
heavily reliant on interactions with their environment and other community members through
metabolite exchange. This secondary metabolite marketplace is crucial for heterotrophic microbial
community dynamics (Moran et al., 2022). Currently, we can classify these metabolites into
generic compound classes (e.g., amino acids), and into broad groups based on their ecological
roles, such as “infochemical” signaling compounds. However, due to the incredibly low
concentrations of these metabolites in the ocean, among other roadblocks, we do not have a robust
100
characterization of the full distribution of these essential metabolites. The secondary metabolite
marketplace plays a key role in determining community composition, which ultimately influences
rates of organic matter degradation and determines what compounds are drawn down or are
sequestered.
Organismal metabolism ultimately determines the amount and type of external metabolites
in the local extracellular space. I have shown that we do not currently have robust enough
metabolic models to confidently predict these distributions of external metabolites for marine
heterotrophs. To understand this at a deeper level we need to refine our knowledge base of
compound transporters, including the types of transport, their mechanisms, and differences in
uptake kinetics. We need to better characterize this secondary metabolite marketplace in order to
understand what compounds are strongly influencing community assembly, stability, and thus the
specific metabolic niches present. Each of these aims can be systematically conquered with the
development of these stable model organism cultures capable of supporting high throughput
experimental efforts.
By enhancing our culturing efforts and improving our characterization of marine
heterotrophs into generalizable groups based directly on their metabolisms, we will solidify our
fundamental understanding of marine microbial heterotrophy. Developing stable cultures of model
organisms will enable us to characterize and understand the function of the genes and proteins that
drive biotic influences in the marine ecosystem. Experimentally characterizing these unannotated
genes and hypothetical proteins will unlock the wealth of ‘omics data that has already been
diligently and thoroughly collected. These robust annotations, in combination with continued
culturing efforts, will enable us to more comprehensively quantify and classify the metabolites
these organisms are producing. Understanding the type and quantity of metabolites that these
101
organisms are producing will allow us to better understand what metabolites are secreted or leaked
into the external environment to form the basis of community interactions. Quantifying and
enumerating the metabolites these organisms excrete and exchange will further inform our
understanding of what drives community formation and function. Ultimately, this will elucidate
the mechanisms of marine heterotrophy that set remineralization rates and enable us to accurately
model the biological component of this process at scale.
102
REFERENCES
Agrawal, R., Imieliński, T., Swami, A., 1993. Mining association rules between sets of items in
large databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on
Management of Data - SIGMOD ’93. Presented at the the 1993 ACM SIGMOD
international conference, ACM Press, Washington, D.C., United States, pp. 207–216.
https://doi.org/10.1145/170035.170072
Ault, T.R., 2000. Vertical migration by the marine dinoflagellate Prorocentrum triestinum
maximises photosynthetic yield. Oecologia 125, 466–475.
https://doi.org/10.1007/s004420000472
Aumont, O., Bopp, L., 2006. Globalizing results from ocean in situ iron fertilization studies:
GLOBALIZING IRON FERTILIZATION. Glob. Biogeochem. Cycles 20, n/a-n/a.
https://doi.org/10.1029/2005GB002591
Aumont, O., Ethé, C., Tagliabue, A., Bopp, L., Gehlen, M., 2015. PISCES-v2: an ocean
biogeochemical model for carbon and ecosystem studies. Geosci. Model Dev. 8, 2465–
2513. https://doi.org/10.5194/gmd-8-2465-2015
Azam, F., 1998. Microbial Control of Oceanic Carbon Flux: The Plot Thickens. Science 280,
694–696. https://doi.org/10.1126/science.280.5364.694
Badger, M.R., Andrews, T.J., Whitney, S.M., Ludwig, M., Yellowlees, D.C., Leggat, W., Price,
G.D., 1998. The diversity and coevolution of Rubisco, plastids, pyrenoids, and
chloroplast-based CO 2 -concentrating mechanisms in algae. Can. J. Bot. 76, 1052–1071.
https://doi.org/10.1139/b98-074
Baker, B.J., Lazar, C.S., Teske, A.P., Dick, G.J., 2015. Genomic resolution of linkages in carbon,
nitrogen, and sulfur cycling among widespread estuary sediment bacteria. Microbiome 3,
14. https://doi.org/10.1186/s40168-015-0077-6
Behrenfeld, M.J., O’Malley, R.T., Siegel, D.A., McClain, C.R., Sarmiento, J.L., Feldman, G.C.,
Milligan, A.J., Falkowski, P.G., Letelier, R.M., Boss, E.S., 2006. Climate-driven trends
in contemporary ocean productivity. Nature 444, 752–755.
https://doi.org/10.1038/nature05317
Bernstein, D.B., Sulheim, S., Almaas, E., Segrè, D., 2021. Addressing uncertainty in genomescale metabolic model reconstruction and analysis. Genome Biol. 22, 64.
https://doi.org/10.1186/s13059-021-02289-z
Bingham, E., Kabán, A., Fortelius, M., 2009. The aspect Bernoulli model: multiple causes of
presences and absences. Pattern Anal. Appl. 12, 55–78. https://doi.org/10.1007/s10044-
007-0096-4
Blei, D.M., 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 30.
103
Bligh, M., Nguyen, N., Buck-Wiese, H., Vidal-Melgosa, S., Hehemann, J.-H., 2022. Structures
and functions of algal glycans shape their capacity to sequester carbon in the ocean. Curr.
Opin. Chem. Biol. 71, 102204. https://doi.org/10.1016/j.cbpa.2022.102204
Boschker, H.T.S., Nold, S.C., Wellsbury, P., Bos, D., De Graaf, W., Pel, R., Parkes, R.J.,
Cappenberg, T.E., 1998. Direct linking of microbial populations to specific
biogeochemical processes by 13C-labelling of biomarkers. Nature 392, 801–805.
https://doi.org/10.1038/33900
Bray, J.R., Curtis, J.T., 1957. An Ordination of the Upland Forest Communities of Southern
Wisconsin. Ecol. Monogr. 27, 325–349. https://doi.org/10.2307/1942268
Brzezinski, M., Villareal, T., Lipschultz, F., 1998. Silica production and the contribution of
diatoms to new and primary production in the central North Pacific. Mar. Ecol. Prog. Ser.
167, 89–104. https://doi.org/10.3354/meps167089
Buchfink, B., Xie, C., Huson, D.H., 2015. Fast and sensitive protein alignment using
DIAMOND. Nat. Methods 12, 59–60. https://doi.org/10.1038/nmeth.3176
Calvin, K., Dasgupta, D., Krinner, G., Mukherji, A., Thorne, P.W., Trisos, C., Romero, J.,
Aldunce, P., Barrett, K., Blanco, G., Cheung, W.W.L., Connors, S., Denton, F., DiongueNiang, A., Dodman, D., Garschagen, M., Geden, O., Hayward, B., Jones, C., Jotzo, F.,
Krug, T., Lasco, R., Lee, Y.-Y., Masson-Delmotte, V., Meinshausen, M., Mintenbeck,
K., Mokssit, A., Otto, F.E.L., Pathak, M., Pirani, A., Poloczanska, E., Pörtner, H.-O.,
Revi, A., Roberts, D.C., Roy, J., Ruane, A.C., Skea, J., Shukla, P.R., Slade, R., Slangen,
A., Sokona, Y., Sörensson, A.A., Tignor, M., Van Vuuren, D., Wei, Y.-M., Winkler, H.,
Zhai, P., Zommers, Z., Hourcade, J.-C., Johnson, F.X., Pachauri, S., Simpson, N.P.,
Singh, C., Thomas, A., Totin, E., Arias, P., Bustamante, M., Elgizouli, I., Flato, G.,
Howden, M., Méndez-Vallejo, C., Pereira, J.J., Pichs-Madruga, R., Rose, S.K., Saheb,
Y., Sánchez Rodríguez, R., Ürge-Vorsatz, D., Xiao, C., Yassaa, N., Alegría, A., Armour,
K., Bednar-Friedl, B., Blok, K., Cissé, G., Dentener, F., Eriksen, S., Fischer, E., Garner,
G., Guivarch, C., Haasnoot, M., Hansen, G., Hauser, M., Hawkins, E., Hermans, T.,
Kopp, R., Leprince-Ringuet, N., Lewis, J., Ley, D., Ludden, C., Niamir, L., Nicholls, Z.,
Some, S., Szopa, S., Trewin, B., Van Der Wijst, K.-I., Winter, G., Witting, M., Birt, A.,
Ha, M., Romero, J., Kim, J., Haites, E.F., Jung, Y., Stavins, R., Birt, A., Ha, M.,
Orendain, D.J.A., Ignon, L., Park, S., Park, Y., Reisinger, A., Cammaramo, D., Fischlin,
A., Fuglestvedt, J.S., Hansen, G., Ludden, C., Masson-Delmotte, V., Matthews, J.B.R.,
Mintenbeck, K., Pirani, A., Poloczanska, E., Leprince-Ringuet, N., Péan, C., 2023. IPCC,
2023: Climate Change 2023: Synthesis Report. Contribution of Working Groups I, II and
III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change
[Core Writing Team, H. Lee and J. Romero (eds.)]. IPCC, Geneva, Switzerland.
Intergovernmental Panel on Climate Change (IPCC).
https://doi.org/10.59327/IPCC/AR6-9789291691647
104
Capella-Gutierrez, S., Silla-Martinez, J.M., Gabaldon, T., 2009. trimAl: a tool for automated
alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973.
https://doi.org/10.1093/bioinformatics/btp348
Caporaso, J.G., Lauber, C.L., Walters, W.A., Berg-Lyons, D., Huntley, J., Fierer, N., Owens,
S.M., Betley, J., Fraser, L., Bauer, M., Gormley, N., Gilbert, J.A., Smith, G., Knight, R.,
2012. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and
MiSeq platforms. ISME J. 6, 1621–1624. https://doi.org/10.1038/ismej.2012.8
Caspi, R., Billington, R., Keseler, I.M., Kothari, A., Krummenacker, M., Midford, P.E., Ong,
W.K., Paley, S., Subhraveti, P., Karp, P.D., 2020. The MetaCyc database of metabolic
pathways and enzymes - a 2019 update. Nucleic Acids Res. 48, D445–D453.
https://doi.org/10.1093/nar/gkz862
Céréghino, R., Park, Y.-S., 2009. Review of the Self-Organizing Map (SOM) approach in water
resources: Commentary. Environ. Model. Softw. 24, 945–947.
https://doi.org/10.1016/j.envsoft.2009.01.008
Chaumeil, P.-A., Mussig, A.J., Hugenholtz, P., Parks, D.H., 2022. GTDB-Tk v2: memory
friendly classification with the genome taxonomy database. Bioinformatics 38, 5315–
5316. https://doi.org/10.1093/bioinformatics/btac672
Chin-Leo, G., Kirchman, D.L., 1988. Estimating Bacterial Production in Marine Waters from the
Simultaneous Incorporation of Thymidine and Leucine. Appl. Environ. Microbiol. 54,
1934–1939. https://doi.org/10.1128/aem.54.8.1934-1939.1988
Coles, V.J., Stukel, M.R., Brooks, M.T., Burd, A., Crump, B.C., Moran, M.A., Paul, J.H.,
Satinsky, B.M., Yager, P.L., Zielinski, B.L., Hood, R.R., 2017. Ocean biogeochemistry
modeled with emergent trait-based genomics. Science 358, 1149–1154.
https://doi.org/10.1126/science.aan5712
Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard, M.O., Whitwham, A.,
Keane, T., McCarthy, S.A., Davies, R.M., Li, H., 2021. Twelve years of SAMtools and
BCFtools. GigaScience 10, giab008. https://doi.org/10.1093/gigascience/giab008
deLeeuw, J., 1992. Introduction to Akaike (1973) Information Theory and an Extension of the
Maximum Likelihood Principle, in: Kotz, S., Johnson, N.L. (Eds.), Breakthroughs in
Statistics, Springer Series in Statistics. Springer New York, New York, NY, pp. 599–609.
https://doi.org/10.1007/978-1-4612-0919-5_37
Delmont, T.O., Eren, A.M., 2018. Linking pangenomes and metagenomes: the Prochlorococcus
metapangenome. PeerJ 6, e4320. https://doi.org/10.7717/peerj.4320
Deulofeu-Capo, O., Sebastián, M., Auladell, A., Cardelús, C., Ferrera, I., Sánchez, O., Gasol,
J.M., 2024. Growth rates of marine prokaryotes are extremely diverse, even among
closely related taxa. ISME Commun. 4, ycae066. https://doi.org/10.1093/ismeco/ycae066
105
Dittmar, T., Lennartz, S.T., Buck-Wiese, H., Hansell, D.A., Santinelli, C., Vanni, C., Blasius, B.,
Hehemann, J.-H., 2021. Enigmatic persistence of dissolved organic matter in the ocean.
Nat. Rev. Earth Environ. 2, 570–583. https://doi.org/10.1038/s43017-021-00183-7
Ebrahim, A., Lerman, J.A., Palsson, B.O., Hyduke, D.R., 2013. COBRApy: COnstraints-Based
Reconstruction and Analysis for Python. BMC Syst. Biol. 7, 74.
https://doi.org/10.1186/1752-0509-7-74
Eddy, S.R., 2011. Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195.
https://doi.org/10.1371/journal.pcbi.1002195
Edgar, R.C., 2021. High-accuracy alignment ensembles enable unbiased assessments of
sequence homology and phylogeny (preprint). Bioinformatics.
https://doi.org/10.1101/2021.06.20.449169
Enke, T.N., Leventhal, G.E., Metzger, M., Saavedra, J.T., Cordero, O.X., 2018. Microscale
ecology regulates particulate organic matter turnover in model marine microbial
communities. Nat. Commun. 9, 2743. https://doi.org/10.1038/s41467-018-05159-8
Eren, A.M., Esen, Ö.C., Quince, C., Vineis, J.H., Morrison, H.G., Sogin, M.L., Delmont, T.O.,
2015. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3,
e1319. https://doi.org/10.7717/peerj.1319
Falkowski, P.G., Fenchel, T., Delong, E.F., 2008. The Microbial Engines That Drive Earth’s
Biogeochemical Cycles. Science 320, 1034–1039.
https://doi.org/10.1126/science.1153213
Fan, X., Qiu, H., Han, W., Wang, Y., Xu, D., Zhang, X., Bhattacharya, D., Ye, N., 2020.
Phytoplankton pangenome reveals extensive prokaryotic horizontal gene transfer of
diverse functions. Sci. Adv. 6, eaba0111. https://doi.org/10.1126/sciadv.aba0111
Faure, E., Ayata, S.-D., Bittner, L., 2021. Towards omics-based predictions of planktonic
functional composition from environmental data. Nat. Commun. 12, 4361.
https://doi.org/10.1038/s41467-021-24547-1
Fontanez, K.M., Eppley, J.M., Samo, T.J., Karl, D.M., DeLong, E.F., 2015. Microbial
community structure and function on sinking particles in the North Pacific Subtropical
Gyre. Front. Microbiol. 6. https://doi.org/10.3389/fmicb.2015.00469
Fuhrman, J.A., 2009. Microbial community structure and its functional implications. Nature 459,
193–199. https://doi.org/10.1038/nature08058
Fuhrman, J.A., Azam, F., 1982. Thymidine incorporation as a measure of heterotrophic
bacterioplankton production in marine surface waters: Evaluation and field results. Mar.
Biol. 66, 109–120. https://doi.org/10.1007/BF00397184
106
Fuhrman, J.A., Azam, F., 1980. Bacterioplankton Secondary Production Estimates for Coastal
Waters of British Columbia, Antarctica, and California. Appl. Environ. Microbiol. 39,
1085–1095. https://doi.org/10.1128/aem.39.6.1085-1095.1980
Fuhrman, J.A., McCallum, K., Davis, A.A., 1993. Phylogenetic diversity of subsurface marine
microbial communities from the Atlantic and Pacific Oceans. Appl. Environ. Microbiol.
59, 1294–1302. https://doi.org/10.1128/aem.59.5.1294-1302.1993
Galili, T., 2015. dendextend: an R package for visualizing, adjusting and comparing trees of
hierarchical clustering. Bioinformatics 31, 3718–3720.
https://doi.org/10.1093/bioinformatics/btv428
Getz, E.W., Lanclos, V.C., Kojima, C.Y., Cheng, C., Henson, M.W., Schön, M.E., Ettema,
T.J.G., Faircloth, B.C., Thrash, J.C., 2023. The AEGEAN-169 clade of bacterioplankton
is synonymous with SAR11 subclade V (HIMB59) and metabolically distinct. mSystems
8, e00179-23. https://doi.org/10.1128/msystems.00179-23
Giering, S.L.C., Sanders, R., Lampitt, R.S., Anderson, T.R., Tamburini, C., Boutrif, M., Zubkov,
M.V., Marsay, C.M., Henson, S.A., Saw, K., Cook, K., Mayor, D.J., 2014. Reconciliation
of the carbon budget in the ocean’s twilight zone. Nature 507, 480–483.
https://doi.org/10.1038/nature13123
Gifford, S.M., Sharma, S., Booth, M., Moran, M.A., 2013. Expression patterns reveal niche
diversification in a marine microbial assemblage. ISME J. 7, 281–298.
https://doi.org/10.1038/ismej.2012.96
Giordano, N., Gaudin, M., Trottier, C., Delage, E., Nef, C., Bowler, C., Chaffron, S., 2024.
Genome-scale community modelling reveals conserved metabolic cross-feedings in
epipelagic bacterioplankton communities. Nat. Commun. 15, 2721.
https://doi.org/10.1038/s41467-024-46374-w
Giovannoni, S.J., 2017. SAR11 Bacteria: The Most Abundant Plankton in the Oceans. Annu.
Rev. Mar. Sci. 9, 231–255. https://doi.org/10.1146/annurev-marine-010814-015934
Graham, E.D., Heidelberg, J.F., Tully, B.J., 2018. Potential for primary productivity in a
globally-distributed bacterial phototroph. ISME J. 12, 1861–1866.
https://doi.org/10.1038/s41396-018-0091-3
Graham, E.D., Heidelberg, J.F., Tully, B.J., 2017. BinSanity: unsupervised clustering of
environmental microbial assemblies using coverage and affinity propagation. PeerJ 5,
e3035. https://doi.org/10.7717/peerj.3035
Gralka, M., Pollak, S., Cordero, O.X., 2023. Genome content predicts the carbon catabolic
preferences of heterotrophic bacteria. Nat. Microbiol. 8, 1799–1808.
https://doi.org/10.1038/s41564-023-01458-z
107
Gu, C., Kim, G.B., Kim, W.J., Kim, H.U., Lee, S.Y., 2019. Current status and applications of
genome-scale metabolic models. Genome Biol. 20, 121. https://doi.org/10.1186/s13059-
019-1730-3
Guo, A.C., Jewison, T., Wilson, M., Liu, Y., Knox, C., Djoumbou, Y., Lo, P., Mandal, R.,
Krishnamurthy, R., Wishart, D.S., 2012. ECMDB: The E. coli Metabolome Database.
Nucleic Acids Res. 41, D625–D630. https://doi.org/10.1093/nar/gks992
Hartigan, J.A., Wong, M.A., 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Appl.
Stat. 28, 100. https://doi.org/10.2307/2346830
Hastings, J., Owen, G., Dekker, A., Ennis, M., Kale, N., Muthukrishnan, V., Turner, S.,
Swainston, N., Mendes, P., Steinbeck, C., 2016. ChEBI in 2016: Improved services and
an expanding collection of metabolites. Nucleic Acids Res. 44, D1214–D1219.
https://doi.org/10.1093/nar/gkv1031
Hehemann, J.-H., Arevalo, P., Datta, M.S., Yu, X., Corzett, C.H., Henschel, A., Preheim, S.P.,
Timberlake, S., Alm, E.J., Polz, M.F., 2016. Adaptive radiation by waves of gene transfer
leads to fine-scale resource partitioning in marine microbes. Nat. Commun. 7, 12860.
https://doi.org/10.1038/ncomms12860
Henry, C.S., DeJongh, M., Best, A.A., Frybarger, P.M., Linsay, B., Stevens, R.L., 2010. Highthroughput generation, optimization and analysis of genome-scale metabolic models. Nat.
Biotechnol. 28, 977–982. https://doi.org/10.1038/nbt.1672
Henson, M.W., Lanclos, V.C., Faircloth, B.C., Thrash, J.C., 2018. Cultivation and genomics of
the first freshwater SAR11 (LD12) isolate. ISME J. 12, 1846–1860.
https://doi.org/10.1038/s41396-018-0092-2
Henson, S.A., Sanders, R., Madsen, E., 2012. Global patterns in efficiency of particulate organic
carbon export and transfer to the deep ocean. Glob. Biogeochem. Cycles 26,
2011GB004099. https://doi.org/10.1029/2011GB004099
Henson, S.A., Sanders, R., Madsen, E., Morris, P.J., Le Moigne, F., Quartly, G.D., 2011. A
reduced estimate of the strength of the ocean’s biological carbon pump: BIOLOGICAL
CARBON PUMP STRENGTH. Geophys. Res. Lett. 38, n/a-n/a.
https://doi.org/10.1029/2011GL046735
Hoang, D.T., Chernomor, O., Von Haeseler, A., Minh, B.Q., Vinh, L.S., 2018. UFBoot2:
Improving the Ultrafast Bootstrap Approximation. Mol. Biol. Evol. 35, 518–522.
https://doi.org/10.1093/molbev/msx281
Hornick, K.M., Buschmann, A.H., 2018. Insights into the diversity and metabolic function of
bacterial communities in sediments from Chilean salmon aquaculture sites. Ann.
Microbiol. 68, 63–77. https://doi.org/10.1007/s13213-017-1317-8
108
Hyatt, D., Chen, G.-L., LoCascio, P.F., Land, M.L., Larimer, F.W., Hauser, L.J., 2010. Prodigal:
prokaryotic gene recognition and translation initiation site identification. BMC
Bioinformatics 11, 119. https://doi.org/10.1186/1471-2105-11-119
Imelfort, M., Parks, D., Woodcroft, B.J., Dennis, P., Hugenholtz, P., Tyson, G.W., 2014.
GroopM: an automated tool for the recovery of population genomes from related
metagenomes. PeerJ 2, e603. https://doi.org/10.7717/peerj.603
Jackson, A.E., Ayer, S.W., Laycock, M.V., 1992. The effect of salinity on growth and amino
acid composition in the marine diatom Nitzschia pungens. Can. J. Bot. 70, 2198–2201.
https://doi.org/10.1139/b92-272
Jain, C., Rodriguez-R, L.M., Phillippy, A.M., Konstantinidis, K.T., Aluru, S., 2018. High
throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.
Nat. Commun. 9, 5114. https://doi.org/10.1038/s41467-018-07641-9
Jaspers, E., Overmann, J., 2004. Ecological Significance of Microdiversity: Identical 16S rRNA
Gene Sequences Can Be Found in Bacteria with Highly Divergent Genomes and
Ecophysiologies. Appl. Environ. Microbiol. 70, 4831–4839.
https://doi.org/10.1128/AEM.70.8.4831-4839.2004
Johnson, Z.I., Zinser, E.R., Coe, A., McNulty, N.P., Woodward, E.M.S., Chisholm, S.W., 2006.
Niche Partitioning Among Prochlorococcus Ecotypes Along Ocean-Scale Environmental
Gradients. Science 311, 1737–1740. https://doi.org/10.1126/science.1118052
Kalteh, A.M., Hjorth, P., Berndtsson, R., 2008. Review of the self-organizing map (SOM)
approach in water resources: Analysis, modelling and application. Environ. Model.
Softw. 23, 835–845. https://doi.org/10.1016/j.envsoft.2007.10.001
Kang, D.D., Froula, J., Egan, R., Wang, Z., 2015. MetaBAT, an efficient tool for accurately
reconstructing single genomes from complex microbial communities. PeerJ 3, e1165.
https://doi.org/10.7717/peerj.1165
Kang, D.D., Li, F., Kirton, E., Thomas, A., Egan, R., An, H., Wang, Z., 2019. MetaBAT 2: an
adaptive binning algorithm for robust and efficient genome reconstruction from
metagenome assemblies. PeerJ 7, e7359. https://doi.org/10.7717/peerj.7359
Karp, P.D., Billington, R., Caspi, R., Fulcher, C.A., Latendresse, M., Kothari, A., Keseler, I.M.,
Krummenacker, M., Midford, P.E., Ong, Q., Ong, W.K., Paley, S.M., Subhraveti, P.,
2019. The BioCyc collection of microbial genomes and metabolic pathways. Brief.
Bioinform. 20, 1085–1093. https://doi.org/10.1093/bib/bbx085
Kauffman, K.J., Prakash, P., Edwards, J.S., 2003. Advances in flux balance analysis. Curr. Opin.
Biotechnol. 14, 491–496. https://doi.org/10.1016/j.copbio.2003.08.001
109
Keller, M.D., Kiene, R.P., Matrai, P.A., Bellows, W.K., 1999. Production of glycine betaine and
dimethylsulfoniopropionate in marine phytoplankton. I. Batch cultures. Mar. Biol. 135,
237–248. https://doi.org/10.1007/s002270050621
Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B.A., Thiessen,
P.A., Yu, B., Zaslavsky, L., Zhang, J., Bolton, E.E., 2023. PubChem 2023 update.
Nucleic Acids Res. 51, D1373–D1380. https://doi.org/10.1093/nar/gkac956
King, Z.A., Lu, J., Dräger, A., Miller, P., Federowicz, S., Lerman, J.A., Ebrahim, A., Palsson,
B.O., Lewis, N.E., 2016. BiGG Models: A platform for integrating, standardizing and
sharing genome-scale models. Nucleic Acids Res. 44, D515–D522.
https://doi.org/10.1093/nar/gkv1049
Kiviluoto, K., 1996. Topology preservation in self-organizing maps, in: Proceedings of
International Conference on Neural Networks (ICNN’96). Presented at the International
Conference on Neural Networks (ICNN’96), IEEE, Washington, DC, USA, pp. 294–299.
https://doi.org/10.1109/ICNN.1996.548907
Klemetsen, T., Raknes, I.A., Fu, J., Agafonov, A., Balasundaram, S.V., Tartari, G., Robertsen,
E., Willassen, N.P., 2018. The MAR databases: development and implementation of
databases specific for marine metagenomics. Nucleic Acids Res. 46, D692–D699.
https://doi.org/10.1093/nar/gkx1036
Koch, A.L., 2001. Oligotrophs versus copiotrophs. BioEssays 23, 657–661.
https://doi.org/10.1002/bies.1091
Kohonen, T., 1990. The self-organizing map. Proc. IEEE 78, 1464–1480.
https://doi.org/10.1109/5.58325
Kojima, C.Y., Getz, E.W., Thrash, J.C., 2022. RRAP: RPKM Recruitment Analysis Pipeline.
Microbiol. Resour. Announc. 11, e00644-22. https://doi.org/10.1128/mra.00644-22
Kruskal, J.B., 1964. Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika 29, 1–27. https://doi.org/10.1007/BF02289565
Kuhn, H.W., 1955. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2,
83–97. https://doi.org/10.1002/nav.3800020109
Kwiatkowski, L., Torres, O., Bopp, L., Aumont, O., Chamberlain, M., Christian, J.R., Dunne,
J.P., Gehlen, M., Ilyina, T., John, J.G., Lenton, A., Li, H., Lovenduski, N.S., Orr, J.C.,
Palmieri, J., Santana-Falcón, Y., Schwinger, J., Séférian, R., Stock, C.A., Tagliabue, A.,
Takano, Y., Tjiputra, J., Toyama, K., Tsujino, H., Watanabe, M., Yamamoto, A., Yool,
A., Ziehn, T., 2020. Twenty-first century ocean warming, acidification, deoxygenation,
and upper-ocean nutrient and primary production decline from CMIP6 model projections.
Biogeosciences 17, 3439–3470. https://doi.org/10.5194/bg-17-3439-2020
110
Lanclos, V.C., Rasmussen, A.N., Kojima, C.Y., Cheng, C., Henson, M.W., Faircloth, B.C.,
Francis, C.A., Thrash, J.C., 2023. Ecophysiology and genomics of the brackish water
adapted SAR11 subclade IIIa. ISME J. 17, 620–629. https://doi.org/10.1038/s41396-023-
01376-2
Landa, M., Burns, A.S., Durham, B.P., Esson, K., Nowinski, B., Sharma, S., Vorobev, A.,
Nielsen, T., Kiene, R.P., Moran, M.A., 2019. Sulfur metabolites that facilitate oceanic
phytoplankton–bacteria carbon flux. ISME J. 13, 2536–2550.
https://doi.org/10.1038/s41396-019-0455-3
Langille, M.G.I., Zaneveld, J., Caporaso, J.G., McDonald, D., Knights, D., Reyes, J.A.,
Clemente, J.C., Burkepile, D.E., Vega Thurber, R.L., Knight, R., Beiko, R.G.,
Huttenhower, C., 2013. Predictive functional profiling of microbial communities using
16S rRNA marker gene sequences. Nat. Biotechnol. 31, 814–821.
https://doi.org/10.1038/nbt.2676
Langmead, B., Salzberg, S.L., 2012. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9,
357–359. https://doi.org/10.1038/nmeth.1923
Larkin, A.A., Garcia, C.A., Garcia, N., Brock, M.L., Lee, J.A., Ustick, L.J., Barbero, L., Carter,
B.R., Sonnerup, R.E., Talley, L.D., Tarran, G.A., Volkov, D.L., Martiny, A.C., 2021.
High spatial resolution global ocean metagenomes from Bio-GO-SHIP repeat
hydrography transects. Sci. Data 8, 107. https://doi.org/10.1038/s41597-021-00889-9
Larralde, M., 2022. Pyrodigal: Python bindings and interface to Prodigal,an efficient method for
gene prediction in prokaryotes. J. Open Source Softw. 7, 4296.
https://doi.org/10.21105/joss.04296
Larralde, M., Sincomb, T., 2022. althonos/pyhmmer: v0.7.1.
https://doi.org/10.5281/ZENODO.7442141
Laufkötter, C., Vogt, M., Gruber, N., Aita-Noguchi, M., Aumont, O., Bopp, L., Buitenhuis, E.,
Doney, S.C., Dunne, J., Hashioka, T., Hauck, J., Hirata, T., John, J., Le Quéré, C., Lima,
I.D., Nakano, H., Seferian, R., Totterdell, I., Vichi, M., Völker, C., 2015. Drivers and
uncertainties of future global marine primary production in marine ecosystem models.
Biogeosciences 12, 6955–6984. https://doi.org/10.5194/bg-12-6955-2015
Lee, M.D., 2019. GToTree: a user-friendly workflow for phylogenomics. Bioinformatics 35,
4162–4164. https://doi.org/10.1093/bioinformatics/btz188
Letunic, I., Bork, P., 2021. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic
tree display and annotation. Nucleic Acids Res. 49, W293–W296.
https://doi.org/10.1093/nar/gkab301
Lieven, C., Beber, M.E., Olivier, B.G., Bergmann, F.T., Ataman, M., Babaei, P., Bartell, J.A.,
Blank, L.M., Chauhan, S., Correia, K., Diener, C., Dräger, A., Ebert, B.E., Edirisinghe,
111
J.N., Faria, J.P., Feist, A.M., Fengos, G., Fleming, R.M.T., García-Jiménez, B.,
Hatzimanikatis, V., Van Helvoirt, W., Henry, C.S., Hermjakob, H., Herrgård, M.J.,
Kaafarani, A., Kim, H.U., King, Z., Klamt, S., Klipp, E., Koehorst, J.J., König, M.,
Lakshmanan, M., Lee, D.-Y., Lee, S.Y., Lee, S., Lewis, N.E., Liu, F., Ma, H., Machado,
D., Mahadevan, R., Maia, P., Mardinoglu, A., Medlock, G.L., Monk, J.M., Nielsen, J.,
Nielsen, L.K., Nogales, J., Nookaew, I., Palsson, B.O., Papin, J.A., Patil, K.R., Poolman,
M., Price, N.D., Resendis-Antonio, O., Richelle, A., Rocha, I., Sánchez, B.J., Schaap,
P.J., Malik Sheriff, R.S., Shoaie, S., Sonnenschein, N., Teusink, B., Vilaça, P., Vik, J.O.,
Wodke, J.A.H., Xavier, J.C., Yuan, Q., Zakhartsev, M., Zhang, C., 2020. MEMOTE for
standardized genome-scale metabolic model testing. Nat. Biotechnol. 38, 272–276.
https://doi.org/10.1038/s41587-020-0446-y
Liu, S., Parsons, R., Opalk, K., Baetge, N., Giovannoni, S., Bolaños, L.M., Kujawinski, E.B.,
Longnecker, K., Lu, Y., Halewood, E., Carlson, C.A., 2020. Different carboxyl‐rich
alicyclic molecules proxy compounds select distinct bacterioplankton for oxidation of
dissolved organic matter in the mesopelagic Sargasso Sea. Limnol. Oceanogr. 65, 1532–
1553. https://doi.org/10.1002/lno.11405
Lloyd, K.G., Steen, A.D., Ladau, J., Yin, J., Crosby, L., 2018. Phylogenetically Novel
Uncultured Microbial Cells Dominate Earth Microbiomes. mSystems 3.
https://doi.org/10.1128/mSystems.00055-18
Lombard, V., Golaconda Ramulu, H., Drula, E., Coutinho, P.M., Henrissat, B., 2014. The
carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 42, D490–
D495. https://doi.org/10.1093/nar/gkt1178
Louca, S., Jacques, S.M.S., Pires, A.P.F., Leal, J.S., Srivastava, D.S., Parfrey, L.W., Farjalla,
V.F., Doebeli, M., 2017. High taxonomic variability despite stable functional structure
across microbial communities. Nat. Ecol. Evol. 1, 0015. https://doi.org/10.1038/s41559-
016-0015
Louca, S., Parfrey, L.W., Doebeli, M., 2016. Decoupling function and taxonomy in the global
ocean microbiome. Science 353, 1272–1277. https://doi.org/10.1126/science.aaf4507
Louca, S., Polz, M.F., Mazel, F., Albright, M.B.N., Huber, J.A., O’Connor, M.I., Ackermann,
M., Hahn, A.S., Srivastava, D.S., Crowe, S.A., Doebeli, M., Parfrey, L.W., 2018.
Function and functional redundancy in microbial systems. Nat. Ecol. Evol. 2, 936–943.
https://doi.org/10.1038/s41559-018-0519-1
Lozupone, C., Lladser, M.E., Knights, D., Stombaugh, J., Knight, R., 2011. UniFrac: an effective
distance metric for microbial community comparison. ISME J. 5, 169–172.
https://doi.org/10.1038/ismej.2010.133
Lu, Y.Y., Chen, T., Fuhrman, J.A., Sun, F., 2016. COCACOLA: binning metagenomic contigs
using sequence COmposition, read CoverAge, CO-alignment and paired-end read
LinkAge. Bioinformatics btw290. https://doi.org/10.1093/bioinformatics/btw290
112
Machado, D., Andrejev, S., Tramontano, M., Patil, K.R., 2018. Fast automated reconstruction of
genome-scale metabolic models for microbial species and communities. Nucleic Acids
Res. 46, 7542–7553. https://doi.org/10.1093/nar/gky537
Magnúsdóttir, S., Heinken, A., Kutt, L., Ravcheev, D.A., Bauer, E., Noronha, A., Greenhalgh,
K., Jäger, C., Baginska, J., Wilmes, P., Fleming, R.M.T., Thiele, I., 2017. Generation of
genome-scale metabolic reconstructions for 773 members of the human gut microbiota.
Nat. Biotechnol. 35, 81–89. https://doi.org/10.1038/nbt.3703
Martin, J.H., Knauer, G.A., Karl, D.M., Broenkow, W.W., 1987. VERTEX: carbon cycling in
the northeast Pacific. Deep Sea Res. Part Oceanogr. Res. Pap. 34, 267–285.
https://doi.org/10.1016/0198-0149(87)90086-0
Martinez-Garcia, M., Brazel, D.M., Swan, B.K., Arnosti, C., Chain, P.S.G., Reitenga, K.G., Xie,
G., Poulton, N.J., Gomez, M.L., Masland, D.E.D., Thompson, B., Bellows, W.K.,
Ziervogel, K., Lo, C.-C., Ahmed, S., Gleasner, C.D., Detter, C.J., Stepanauskas, R., 2012.
Capturing Single Cell Genomes of Active Polysaccharide Degraders: An Unexpected
Contribution of Verrucomicrobia. PLoS ONE 7, e35314.
https://doi.org/10.1371/journal.pone.0035314
Matthews, A., Majeed, A., Barraclough, T.G., Raymond, B., 2021. Function is a better predictor
of plant rhizosphere community membership than 16S phylogeny. Environ. Microbiol.
23, 6089–6103. https://doi.org/10.1111/1462-2920.15652
McDaniel, L.D., Young, E., Delaney, J., Ruhnau, F., Ritchie, K.B., Paul, J.H., 2010. High
Frequency of Horizontal Gene Transfer in the Oceans. Science 330, 50–50.
https://doi.org/10.1126/science.1192243
McDonald, D., Vázquez-Baeza, Y., Koslicki, D., McClelland, J., Reeve, N., Xu, Z., Gonzalez,
A., Knight, R., 2018. Striped UniFrac: enabling microbiome analysis at unprecedented
scale. Nat. Methods 15, 847–848. https://doi.org/10.1038/s41592-018-0187-8
Mendoza, S.N., Olivier, B.G., Molenaar, D., Teusink, B., 2019. A systematic assessment of
current genome-scale metabolic reconstruction tools. Genome Biol. 20, 158.
https://doi.org/10.1186/s13059-019-1769-1
MetaHIT Consortium, Nielsen, H.B., Almeida, M., Juncker, A.S., Rasmussen, S., Li, J.,
Sunagawa, S., Plichta, D.R., Gautier, L., Pedersen, A.G., Le Chatelier, E., Pelletier, E.,
Bonde, I., Nielsen, T., Manichanh, C., Arumugam, M., Batto, J.-M., Quintanilha dos
Santos, M.B., Blom, N., Borruel, N., Burgdorf, K.S., Boumezbeur, F., Casellas, F., Doré,
J., Dworzynski, P., Guarner, F., Hansen, T., Hildebrand, F., Kaas, R.S., Kennedy, S.,
Kristiansen, K., Kultima, J.R., Léonard, P., Levenez, F., Lund, O., Moumen, B., Le
Paslier, D., Pons, N., Pedersen, O., Prifti, E., Qin, J., Raes, J., Sørensen, S., Tap, J., Tims,
S., Ussery, D.W., Yamada, T., Renault, P., Sicheritz-Ponten, T., Bork, P., Wang, J.,
Brunak, S., Ehrlich, S.D., 2014. Identification and assembly of genomes and genetic
113
elements in complex metagenomic samples without using reference genomes. Nat.
Biotechnol. 32, 822–828. https://doi.org/10.1038/nbt.2939
Metcalf, W.W., Wanner, B.L., 1993. Evidence for a fourteen-gene, phnC to phnP locus for
phosphonate metabolism in Escherichia coli. Gene 129, 27–32.
https://doi.org/10.1016/0378-1119(93)90692-V
Minh, B.Q., Schmidt, H.A., Chernomor, O., Schrempf, D., Woodhams, M.D., Von Haeseler, A.,
Lanfear, R., 2020. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic
Inference in the Genomic Era. Mol. Biol. Evol. 37, 1530–1534.
https://doi.org/10.1093/molbev/msaa015
Minoche, A.E., Dohm, J.C., Himmelbauer, H., 2011. Evaluation of genomic high-throughput
sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome
Biol. 12, R112. https://doi.org/10.1186/gb-2011-12-11-r112
Moran, M.A., Belas, R., Schell, M.A., González, J.M., Sun, F., Sun, S., Binder, B.J., Edmonds,
J., Ye, W., Orcutt, B., Howard, E.C., Meile, C., Palefsky, W., Goesmann, A., Ren, Q.,
Paulsen, I., Ulrich, L.E., Thompson, L.S., Saunders, E., Buchan, A., 2007. Ecological
Genomics of Marine Roseobacters. Appl. Environ. Microbiol. 73, 4559–4569.
https://doi.org/10.1128/AEM.02580-06
Moran, M.A., Kujawinski, E.B., Schroer, W.F., Amin, S.A., Bates, N.R., Bertrand, E.M.,
Braakman, R., Brown, C.T., Covert, M.W., Doney, S.C., Dyhrman, S.T., Edison, A.S.,
Eren, A.M., Levine, N.M., Li, L., Ross, A.C., Saito, M.A., Santoro, A.E., Segrè, D.,
Shade, A., Sullivan, M.B., Vardi, A., 2022. Microbial metabolites in the marine carbon
cycle. Nat. Microbiol. 7, 508–523. https://doi.org/10.1038/s41564-022-01090-3
Moyer, D.C., Reimertz, J., Segrè, D., Fuxman Bass, J.I., 2024. Semi-Automatic Detection of
Errors in Genome-Scale Metabolic Models. https://doi.org/10.1101/2024.06.24.600481
Nelson, D.M., Tréguer, P., Brzezinski, M.A., Leynaert, A., Quéguiner, B., 1995. Production and
dissolution of biogenic silica in the ocean: Revised global estimates, comparison with
regional data and relationship to biogenic sedimentation. Glob. Biogeochem. Cycles 9,
359–372. https://doi.org/10.1029/95GB01070
Nguyen, T.T.H., Zakem, E.J., Ebrahimi, A., Schwartzman, J., Caglar, T., Amarnath, K.,
Alcolombri, U., Peaudecerf, F.J., Hwa, T., Stocker, R., Cordero, O.X., Levine, N.M.,
2022. Microbes contribute to setting the ocean carbon flux by altering the fate of sinking
particulates. Nat. Commun. 13, 1657. https://doi.org/10.1038/s41467-022-29297-2
Noor, E., Flamholz, A., Bar-Even, A., Davidi, D., Milo, R., Liebermeister, W., 2016. The Protein
Cost of Metabolic Fluxes: Prediction from Enzymatic Rate Laws and Cost Minimization.
PLOS Comput. Biol. 12, e1005167. https://doi.org/10.1371/journal.pcbi.1005167
114
Norsigian, C.J., Pusarla, N., McConn, J.L., Yurkovich, J.T., Dräger, A., Palsson, B.O., King, Z.,
2019. BiGG Models 2020: multi-strain genome-scale models and expansion across the
phylogenetic tree. Nucleic Acids Res. gkz1054. https://doi.org/10.1093/nar/gkz1054
Oberhardt, M.A., Palsson, B.Ø., Papin, J.A., 2009. Applications of genome‐scale metabolic
reconstructions. Mol. Syst. Biol. 5, 320. https://doi.org/10.1038/msb.2009.77
Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., Kanehisa, M., 1999. KEGG: Kyoto
Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27, 29–34.
https://doi.org/10.1093/nar/27.1.29
Oksanen, J., Blanchet, F.G., Friendly, M., Kindt, R., Legendre, P., McGlinn, D., Minchin, P.R.,
O’Hara, R.B., Simpson, G.L., Solymos, P., Stevens, M.H.H., Szoecs, E., Wagner, H.,
2019. vegan: Community Ecology Package. R package version 2.5-6. https://CRAN.Rproject.org/package=vegan.
Olm, M.R., Brown, C.T., Brooks, B., Banfield, J.F., 2017. dRep: a tool for fast and accurate
genomic comparisons that enables improved genome recovery from metagenomes
through de-replication. ISME J. 11, 2864–2868. https://doi.org/10.1038/ismej.2017.126
Ondov, B.D., Treangen, T.J., Melsted, P., Mallonee, A.B., Bergman, N.H., Koren, S., Phillippy,
A.M., 2016. Mash: fast genome and metagenome distance estimation using MinHash.
Genome Biol. 17, 132. https://doi.org/10.1186/s13059-016-0997-x
Pachiadaki, M.G., Brown, J.M., Brown, J., Bezuidt, O., Berube, P.M., Biller, S.J., Poulton, N.J.,
Burkart, M.D., La Clair, J.J., Chisholm, S.W., Stepanauskas, R., 2019. Charting the
Complexity of the Marine Microbiome through Single-Cell Genomics. Cell 179, 1623-
1635.e11. https://doi.org/10.1016/j.cell.2019.11.017
Paoli, L., Ruscheweyh, H.-J., Forneris, C.C., Hubrich, F., Kautsar, S., Bhushan, A., Lotti, A.,
Clayssen, Q., Salazar, G., Milanese, A., Carlström, C.I., Papadopoulou, C., Gehrig, D.,
Karasikov, M., Mustafa, H., Larralde, M., Carroll, L.M., Sánchez, P., Zayed, A.A.,
Cronin, D.R., Acinas, S.G., Bork, P., Bowler, C., Delmont, T.O., Gasol, J.M., Gossert,
A.D., Kahles, A., Sullivan, M.B., Wincker, P., Zeller, G., Robinson, S.L., Piel, J.,
Sunagawa, S., 2022. Biosynthetic potential of the global ocean microbiome. Nature 607,
111–118. https://doi.org/10.1038/s41586-022-04862-3
Paoli, L., Ruscheweyh, H.-J., Forneris, C.C., Kautsar, S., Clayssen, Q., Salazar, G., Milanese, A.,
Gehrig, D., Larralde, M., Carroll, L.M., Sánchez, P., Zayed, A.A., Cronin, D.R., Acinas,
S.G., Bork, P., Bowler, C., Delmont, T.O., Sullivan, M.B., Wincker, P., Zeller, G.,
Robinson, S.L., Piel, J., Sunagawa, S., 2021. Uncharted biosynthetic potential of the
ocean microbiome (preprint). Microbiology. https://doi.org/10.1101/2021.03.24.436479
Park, Y.-S., Céréghino, R., Compin, A., Lek, S., 2003. Applications of artificial neural networks
for patterning and predicting aquatic insect species richness in running waters. Ecol.
Model. 160, 265–280. https://doi.org/10.1016/S0304-3800(02)00258-2
115
Parks, D.H., Chuvochina, M., Rinke, C., Mussig, A.J., Chaumeil, P.-A., Hugenholtz, P., 2022.
GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically
consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res.
50, D785–D794. https://doi.org/10.1093/nar/gkab776
Parks, D.H., Chuvochina, M., Waite, D.W., Rinke, C., Skarshewski, A., Chaumeil, P.-A.,
Hugenholtz, P., 2018. A standardized bacterial taxonomy based on genome phylogeny
substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004.
https://doi.org/10.1038/nbt.4229
Parks, D.H., Imelfort, M., Skennerton, C.T., Hugenholtz, P., Tyson, G.W., 2015. CheckM:
assessing the quality of microbial genomes recovered from isolates, single cells, and
metagenomes. Genome Res. 25, 1043–1055. https://doi.org/10.1101/gr.186072.114
Parks, D.H., Rinke, C., Chuvochina, M., Chaumeil, P.-A., Woodcroft, B.J., Evans, P.N.,
Hugenholtz, P., Tyson, G.W., 2017. Recovery of nearly 8,000 metagenome-assembled
genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542.
https://doi.org/10.1038/s41564-017-0012-7
Pedersen, T.L., 2024. patchwork: The Composer of Plots.
Pence, H.E., Williams, A., 2010. ChemSpider: An Online Chemical Information Resource. J.
Chem. Educ. 87, 1123–1124. https://doi.org/10.1021/ed100697w
Pomeroy, L., leB. Williams, P., Azam, F., Hobbie, J., 2007. The Microbial Loop. Oceanography
20, 28–33. https://doi.org/10.5670/oceanog.2007.45
Pomeroy, L.R., 1974. The Ocean’s Food Web, A Changing Paradigm. BioScience 24, 499–504.
https://doi.org/10.2307/1296885
Price, M.N., Dehal, P.S., Arkin, A.P., 2010. FastTree 2 – Approximately Maximum-Likelihood
Trees for Large Alignments. PLoS ONE 5, e9490.
https://doi.org/10.1371/journal.pone.0009490
Pritchard, J.K., Stephens, M., Donnelly, P., 2000. Inference of Population Structure Using
Multilocus Genotype Data. Genetics 155, 945. https://doi.org/10.1093/genetics/155.2.945
Quere, C.L., Harrison, S.P., Colin Prentice, I., Buitenhuis, E.T., Aumont, O., Bopp, L., Claustre,
H., Cotrim Da Cunha, L., Geider, R., Giraud, X., Klaas, C., Kohfeld, K.E., Legendre, L.,
Manizza, M., Platt, T., Rivkin, R.B., Sathyendranath, S., Uitz, J., Watson, A.J., WolfGladrow, D., 2005. Ecosystem dynamics based on plankton functional types for global
ocean biogeochemistry models. Glob. Change Biol. 0, 051013014052005-???
https://doi.org/10.1111/j.1365-2486.2005.1004.x
116
Raitsos, D.E., Lavender, S.J., Maravelias, C.D., Haralabous, J., Richardson, A.J., Reid, P.C.,
2008. Identifying four phytoplankton functional types from space: An ecological
approach. Limnol. Oceanogr. 53, 605–613. https://doi.org/10.4319/lo.2008.53.2.0605
Rappé, M.S., Giovannoni, S.J., 2003. The Uncultured Microbial Majority. Annu. Rev. Microbiol.
57, 369–394. https://doi.org/10.1146/annurev.micro.57.030502.090759
Rawlings, N.D., Barrett, A.J., Thomas, P.D., Huang, X., Bateman, A., Finn, R.D., 2018. The
MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a
comparison with peptidases in the PANTHER database. Nucleic Acids Res. 46, D624–
D632. https://doi.org/10.1093/nar/gkx1134
Redfield, A.C., Bostwick, H.K., Richards, F.A., 1963. The influence of organisms on the
composition of seawater. The sea.
Régimbeau, A., Budinich, M., Larhlimi, A., Pierella Karlusich, J.J., Aumont, O., Memery, L.,
Bowler, C., Eveillard, D., 2022. Contribution of genome‐scale metabolic modelling to
niche theory. Ecol. Lett. 25, 1352–1364. https://doi.org/10.1111/ele.13954
Reisch, C.R., Moran, M.A., Whitman, W.B., 2011. Bacterial Catabolism of
Dimethylsulfoniopropionate (DMSP). Front. Microbiol. 2.
https://doi.org/10.3389/fmicb.2011.00172
Reisch, C.R., Moran, M.A., Whitman, W.B., 2008. Dimethylsulfoniopropionate-Dependent
Demethylase (DmdA) from Pelagibacter ubique and Silicibacter pomeroyi. J. Bacteriol.
190, 8018–8024. https://doi.org/10.1128/JB.00770-08
Reynolds, R., Hyun, S., Tully, B., Bien, J., Levine, N.M., 2023. Identification of microbial
metabolic functional guilds from large genomic datasets. Front. Microbiol. 14, 1197329.
https://doi.org/10.3389/fmicb.2023.1197329
Reynolds, R., Weiss, A.C., James, C.C., Kojima, C.Y., Weissman, J.L., Thrash, J.C., Levine,
N.M., 2024. Emergent Metabolic Niches for Marine Heterotrophs.
https://doi.org/10.1101/2024.05.29.596556
Roth Rosenberg, D., Haber, M., Goldford, J., Lalzar, M., Aharonovich, D., Al‐Ashhab, A.,
Lehahn, Y., Segrè, D., Steindler, L., Sher, D., 2021. Particle‐associated and free‐living
bacterial communities in an oligotrophic sea are affected by different environmental
factors. Environ. Microbiol. 23, 4295–4308. https://doi.org/10.1111/1462-2920.15611
Sajed, T., Marcu, A., Ramirez, M., Pon, A., Guo, A.C., Knox, C., Wilson, M., Grant, J.R.,
Djoumbou, Y., Wishart, D.S., 2016. ECMDB 2.0: A richer resource for understanding
the biochemistry of E. coli. Nucleic Acids Res. 44, D495–D501.
https://doi.org/10.1093/nar/gkv1060
117
Saltzman, E.S., Cooper, W.J. (Eds.), 1989. Biogenic Sulfur in the Environment, ACS
Symposium Series. American Chemical Society, Washington, DC.
https://doi.org/10.1021/bk-1989-0393
Sarkar, D., 2008. Lattice: Multivariate Data Visualization with R. https://doi.org/10.1007/978-0-
387-75969-2
Sarmento, H., Morana, C., Gasol, J.M., 2016. Bacterioplankton niche partitioning in the use of
phytoplankton-derived dissolved organic carbon: quantity is more important than quality.
ISME J. 10, 2582–2592. https://doi.org/10.1038/ismej.2016.66
Schellenberger, J., Park, J.O., Conrad, T.M., Palsson, B.Ø., 2010. BiGG: a Biochemical Genetic
and Genomic knowledgebase of large scale metabolic reconstructions. BMC
Bioinformatics 11, 213. https://doi.org/10.1186/1471-2105-11-213
Schroer, W.F., Kepner, H.E., Uchimiya, M., Mejia, C., Rodriguez, L.T., Reisch, C.R., Moran,
M.A., 2023. Functional annotation and importance of marine bacterial transporters of
plankton exometabolites. ISME Commun. 3, 37. https://doi.org/10.1038/s43705-023-
00244-6
Séférian, R., Bopp, L., Gehlen, M., Orr, J.C., Ethé, C., Cadule, P., Aumont, O., Salas y Mélia,
D., Voldoire, A., Madec, G., 2013. Skill assessment of three earth system models with
common marine biogeochemistry. Clim. Dyn. 40, 2549–2573.
https://doi.org/10.1007/s00382-012-1362-8
Segschneider, J., Bendtsen, J., 2013. Temperature‐dependent remineralization in a warming
ocean increases surface pCO 2 through changes in marine ecosystem composition. Glob.
Biogeochem. Cycles 27, 1214–1225. https://doi.org/10.1002/2013GB004684
Sharoni, S., Halevy, I., 2020. Nutrient ratios in marine particulate organic matter are predicted by
the population structure of well-adapted phytoplankton. Sci. Adv. 6, eaaw9371.
https://doi.org/10.1126/sciadv.aaw9371
Sieracki, M.E., Poulton, N.J., Jaillon, O., Wincker, P., de Vargas, C., Rubinat-Ripoll, L.,
Stepanauskas, R., Logares, R., Massana, R., 2019. Single cell genomics yields a wide
diversity of small planktonic protists across major ocean ecosystems. Sci. Rep. 9, 6025.
https://doi.org/10.1038/s41598-019-42487-1
Sogin, M.L., Morrison, H.G., Huber, J.A., Welch, D.M., Huse, S.M., Neal, P.R., Arrieta, J.M.,
Herndl, G.J., 2006. Microbial diversity in the deep sea and the underexplored “rare
biosphere.” Proc. Natl. Acad. Sci. 103, 12115–12120.
https://doi.org/10.1073/pnas.0605127103
Sosa, O.A., Repeta, D.J., Ferrón, S., Bryant, J.A., Mende, D.R., Karl, David.M., DeLong, E.F.,
2017. Isolation and Characterization of Bacteria That Degrade Phosphonates in Marine
118
Dissolved Organic Matter. Front. Microbiol. 8, 1786.
https://doi.org/10.3389/fmicb.2017.01786
Staley, C., Gould, T.J., Wang, P., Phillips, J., Cotner, J.B., Sadowsky, M.J., 2014. Core
functional traits of bacterial communities in the Upper Mississippi River show limited
variation in response to land cover. Front. Microbiol. 5.
https://doi.org/10.3389/fmicb.2014.00414
Steen, A.D., Crits-Christoph, A., Carini, P., DeAngelis, K.M., Fierer, N., Lloyd, K.G., Cameron
Thrash, J., 2019. High proportions of bacteria and archaea across most biomes remain
uncultured. ISME J. 13, 3126–3130. https://doi.org/10.1038/s41396-019-0484-y
Stepanauskas, R., Sieracki, M.E., 2007. Matching phylogeny and metabolism in the uncultured
marine bacteria, one cell at a time. Proc. Natl. Acad. Sci. 104, 9052–9057.
https://doi.org/10.1073/pnas.0700496104
Strous, M., Kraft, B., Bisdorf, R., Tegetmeyer, H.E., 2012. The Binning of Metagenomic Contigs
for Microbial Physiology of Mixed Cultures. Front. Microbiol. 3.
https://doi.org/10.3389/fmicb.2012.00410
Sunagawa, S., Coelho, L.P., Chaffron, S., Kultima, J.R., Labadie, K., Salazar, G., Djahanschiri,
B., Zeller, G., Mende, D.R., Alberti, A., Cornejo-Castillo, F.M., Costea, P.I., Cruaud, C.,
d’Ovidio, F., Engelen, S., Ferrera, I., Gasol, J.M., Guidi, L., Hildebrand, F., Kokoszka,
F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B.T., Royo-Llonch, M.,
Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S.,
Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N.,
Hingamp, P., Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S.,
Stemmann, L., Sullivan, M.B., Weissenbach, J., Wincker, P., Karsenti, E., Raes, J.,
Acinas, S.G., Bork, P., Boss, E., Bowler, C., Follows, M., Karp-Boss, L., Krzic, U.,
Reynaud, E.G., Sardet, C., Sieracki, M., Velayoudon, D., 2015. Structure and function of
the global ocean microbiome. Science 348, 1261359–1261359.
https://doi.org/10.1126/science.1261359
Swan, B.K., Martinez-Garcia, M., Preston, C.M., Sczyrba, A., Woyke, T., Lamy, D., Reinthaler,
T., Poulton, N.J., Masland, E.D.P., Gomez, M.L., Sieracki, M.E., DeLong, E.F., Herndl,
G.J., Stepanauskas, R., 2011. Potential for Chemolithoautotrophy Among Ubiquitous
Bacteria Lineages in the Dark Ocean. Science 333, 1296–1300.
https://doi.org/10.1126/science.1203690
Swan, B.K., Tupper, B., Sczyrba, A., Lauro, F.M., Martinez-Garcia, M., Gonzalez, J.M., Luo,
H., Wright, J.J., Landry, Z.C., Hanson, N.W., Thompson, B.P., Poulton, N.J.,
Schwientek, P., Acinas, S.G., Giovannoni, S.J., Moran, M.A., Hallam, S.J., Cavicchioli,
R., Woyke, T., Stepanauskas, R., 2013. Prevalent genome streamlining and latitudinal
divergence of planktonic bacteria in the surface ocean. Proc. Natl. Acad. Sci. 110,
11463–11468. https://doi.org/10.1073/pnas.1304246110
119
The Math Works, Inc., 2021. MATLAB, Version 2021a. Math Works Inc.
Tripp, H.J., Kitner, J.B., Schwalbach, M.S., Dacey, J.W.H., Wilhelm, L.J., Giovannoni, S.J.,
2008. SAR11 marine bacteria require exogenous reduced sulphur for growth. Nature 452,
741–744. https://doi.org/10.1038/nature06776
Tully, Benjamin J., Graham, E.D., Heidelberg, J.F., 2018. The reconstruction of 2,631 draft
metagenome-assembled genomes from the global oceans. Sci. Data 5, 170203.
https://doi.org/10.1038/sdata.2017.203
Tully, Benjamin J, Wheat, C.G., Glazer, B.T., Huber, J.A., 2018. A dynamic microbial
community with high functional redundancy inhabits the cold, oxic subseafloor aquifer.
ISME J. 12, 1–16. https://doi.org/10.1038/ismej.2017.187
Ustick, L.J., Larkin, A.A., Garcia, C.A., Garcia, N.S., Brock, M.L., Lee, J.A., Wiseman, N.A.,
Moore, J.K., Martiny, A.C., 2021. Metagenomic analysis reveals global-scale patterns of
ocean nutrient limitation. Science 372, 287–291. https://doi.org/10.1126/science.abe6301
Varma, A., Palsson, B.O., 1994. Metabolic Flux Balancing: Basic Concepts, Scientific and
Practical Use. Bio/Technology 12, 994–998. https://doi.org/10.1038/nbt1094-994
Venables, H., Moore, C.M., 2010. Phytoplankton and light limitation in the Southern Ocean:
Learning from high‐nutrient, high‐chlorophyll areas. J. Geophys. Res. Oceans 115,
2009JC005361. https://doi.org/10.1029/2009JC005361
Venter, J.C., 2004. Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science
304, 66–74. https://doi.org/10.1126/science.1093857
Vesanto, J., 2000. Neural network tool for data mining: SOM toolbox. TOOLMET2000.
Ward, B.A., Collins, S., 2022. Rapid evolution allows coexistence of highly divergent lineages
within the same niche. Ecol. Lett. 25, 1839–1853. https://doi.org/10.1111/ele.14061
Wehrens, R., Buydens, L.M.C., 2007. Self- and Super-organizing Maps in R : The kohonen
Package. J. Stat. Softw. 21. https://doi.org/10.18637/jss.v021.i05
Weissman, J., Dimbo, E.-R.O., Krinos, A.I., Neely, C., Yagües, Y., Nolin, D., Hou, S.,
Laperriere, S., Caron, D.A., Tully, B., Alexander, H., Fuhrman, J.A., 2021. Estimating
global variation in the maximum growth rates of eukaryotic microbes from cultures and
metagenomes via codon usage patterns. https://doi.org/10.1101/2021.10.15.464604
Weissman, J.L., Hou, S., Fuhrman, J.A., 2021. Estimating maximal microbial growth rates from
cultures, metagenomes, and single cells via codon usage patterns. Proc. Natl. Acad. Sci.
118, e2016810118. https://doi.org/10.1073/pnas.2016810118
120
Wemheuer, F., Taylor, J.A., Daniel, R., Johnston, E., Meinicke, P., Thomas, T., Wemheuer, B.,
2020. Tax4Fun2: prediction of habitat-specific functional profiles and functional
redundancy based on 16S rRNA gene sequences. Environ. Microbiome 15, 11.
https://doi.org/10.1186/s40793-020-00358-7
White, A.K., Metcalf, W.W., 2004. Two C—P Lyase Operons in Pseudomonas stutzeri and
Their Roles in the Oxidation of Phosphonates, Phosphite, and Hypophosphite. J.
Bacteriol. 186, 4730–4739. https://doi.org/10.1128/JB.186.14.4730-4739.2004
Wickham, H., 2009. ggplot2: Elegant Graphics for Data Analysis. https://doi.org/10.1007/978-0-
387-98141-3
Wilcock, A.R., Goldberg, D.M., 1972. Kinetic determination of malate dehydrogenase activity
eliminating problems due to spontaneous conversion of oxaloacetate to pyruvate.
Biochem. Med. 6, 116–126. https://doi.org/10.1016/0006-2944(72)90029-4
Wilke, C.O., 2024. ggridges: Ridgeline Plots in “ggplot2.”
Wishart, D.S., Guo, A., Oler, E., Wang, F., Anjum, A., Peters, H., Dizon, R., Sayeeda, Z., Tian,
S., Lee, B.L., Berjanskii, M., Mah, R., Yamamoto, M., Jovel, J., Torres-Calzada, C.,
Hiebert-Giesbrecht, M., Lui, V.W., Varshavi, Dorna, Varshavi, Dorsa, Allen, D., Arndt,
D., Khetarpal, N., Sivakumaran, A., Harford, K., Sanford, S., Yee, K., Cao, X., Budinski,
Z., Liigand, J., Zhang, L., Zheng, J., Mandal, R., Karu, N., Dambrova, M., Schiöth, H.B.,
Greiner, R., Gautam, V., 2022. HMDB 5.0: the Human Metabolome Database for 2022.
Nucleic Acids Res. 50, D622–D631. https://doi.org/10.1093/nar/gkab1062
Wu, Y.-W., Simmons, B.A., Singer, S.W., 2016. MaxBin 2.0: an automated binning algorithm to
recover genomes from multiple metagenomic datasets. Bioinformatics 32, 605–607.
https://doi.org/10.1093/bioinformatics/btv638
Xu, S., Chen, M., Feng, T., Zhan, L., Zhou, L., Yu, G., 2021. Use ggbreak to Effectively Utilize
Plotting Space to Deal With Large Datasets and Outliers. Front. Genet. 12, 774846.
https://doi.org/10.3389/fgene.2021.774846
Yarza, P., Yilmaz, P., Pruesse, E., Glöckner, F.O., Ludwig, W., Schleifer, K.-H., Whitman,
W.B., Euzéby, J., Amann, R., Rosselló-Móra, R., 2014. Uniting the classification of
cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat. Rev.
Microbiol. 12, 635–645. https://doi.org/10.1038/nrmicro3330
Yooseph, S., Sutton, G., Rusch, D.B., Halpern, A.L., Williamson, S.J., Remington, K., Eisen,
J.A., Heidelberg, K.B., Manning, G., Li, W., Jaroszewski, L., Cieplak, P., Miller, C.S.,
Li, H., Mashiyama, S.T., Joachimiak, M.P., van Belle, C., Chandonia, J.-M., Soergel,
D.A., Zhai, Y., Natarajan, K., Lee, S., Raphael, B.J., Bafna, V., Friedman, R., Brenner,
S.E., Godzik, A., Eisenberg, D., Dixon, J.E., Taylor, S.S., Strausberg, R.L., Frazier, M.,
Venter, J.C., 2007. The Sorcerer II Global Ocean Sampling Expedition: Expanding the
121
Universe of Protein Families. PLoS Biol. 5, e16.
https://doi.org/10.1371/journal.pbio.0050016
Yu, G., Smith, D.K., Zhu, H., Guan, Y., Lam, T.T., 2017. GGTREE : an R package for
visualization and annotation of phylogenetic trees with their covariates and other
associated data. Methods Ecol. Evol. 8, 28–36. https://doi.org/10.1111/2041-210X.12628
Zakem, E.J., Cael, B.B., Levine, N.M., 2021. A unified theory for organic matter accumulation.
Proc. Natl. Acad. Sci. U. S. A. 118, e2016896118.
https://doi.org/10.1073/pnas.2016896118
Zakem, E.J., McNichol, J., Weissman, J.L., Raut, Y., Xu, L., Halewood, E.R., Carlson, C.A.,
Dutkiewicz, S., Fuhrman, J.A., Levine, N.M., 2024. Predictable functional biogeography
of marine microbial heterotrophs. https://doi.org/10.1101/2024.02.14.580411
Zhang, H., Yohe, T., Huang, L., Entwistle, S., Wu, P., Yang, Z., Busk, P.K., Xu, Y., Yin, Y.,
2018. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation.
Nucleic Acids Res. 46, W95–W101. https://doi.org/10.1093/nar/gky418
Zhou, Z., Tran, P., Liu, Y., Kieft, K., Anantharaman, K., 2019. METABOLIC: A scalable highthroughput metabolic and biogeochemical functional trait profiler based on microbial
genomes (preprint). Bioinformatics. https://doi.org/10.1101/761643
122
APPENDICES
Appendix A
A1 Extended Methods
A1.1 Phylogeny of Datasets
We constructed phylogenomic trees for both the composite and SAG only datasets as well
as for the guilds presented in main text Table 2. Appendix Figures 1-4 show the trees, annotated
at the highest taxonomic classification with 2 or more distinct phylogenetic groups. We observed
that the composite dataset that included globally sourced MAGs, isolates, and SAGs was much
more diverse with 51 distinct bacterial phyla than the SAG only dataset which represented just 9
bacterial phyla (see Appendix Figures 1&2). We also found that the phylogenetic diversity of our
guilds varied. For example, the DMSP guild we identified was highly conserved within the
Alphaproteobacteria, spanning just 4 distinct bacterial families with the majority of genomes
coming from the family Rhodobacteraceae. On the other hand, the motility guild identified was
diversely represented across 9 bacterial orders with the order Enterobacterales comprising the bulk
of the associated mapback genomes.
We also computed ANI and AAI values for the composite dataset to numerically assess its
diversity. We found that only 0.52% of the possible genome pairs produced a non-NaN ANI value,
suggesting that this dataset was not appropriate for ANI based analysis. Instead, we used AAI,
which is more suitable for genomes that are more phylogenetically distant from one another. After
excluding values outside of FastAAI’s defined range of 30-90%, we found that on average, a given
genome pair had an AAI value of 39.1%. The full distribution of AAI values is shown in Appendix
Figure A5. We also assessed the AAI values between pairs of mapback genomes for each of our
10 guilds. To test this against the full dataset, we developed a Monte Carlo style simulation to
123
approximate the distribution of average AAI values for a comparable subset of random genomes
from the composite dataset. To accomplish this, we took 1,000 random subsets of 100 genomes –
similar to the average number of genomes in a 5 function guild – and computed the average AAI
value of all possible genome pairs. The resulting distribution can be seen in Appendix Figure A6
with the individual guild average AAI values overlaid as vertical lines. We see that our guilds span
the distribution with some guilds that have higher AAI values than average (more phylogenetically
conserved) and some that have lower AAI values than average (less phylogenetically conserved).
This suggests that our method is finding guilds which are sometimes phylogenetically conserved
and sometimes not. As confirmation that it was possible to find 100 genomes with higher AAI
values (more similar set of genomes), we took the 100 genomes with the most non-NA ANI values
and computed their average AAI value. For this group, we found a much higher average value of
50.7% compared to the 39.1% value of the overall dataset.
A1.2 Extended AB Method
Below we present the complete procedure for selecting functional guilds from an Aspect
Bernoulli model:
1. Calculate Aspect Bernoulli estimates �{ and �|, main text Eq. 1
2. Calculate �'. = )
! ∑ :$!;!0
<$0
!
%() , main text Eq. 2 and Eq. 3
3. For each aspect � = 1,··· ,�:
a. Calculate �. ⊆ {1,···, �} as the set of genomes such that the following is true:
– For each �= ∈ �, �= ≠ �, Γ%,. > Γ%,.1
– Γ%,. > >
/
b. Compute �'. = ∑$∈% 3$0 ! (
3 ∑ ∑$∈% 3$0 !
3
0'(
, main text Eq. 4
124
c. Form the score �'. = �'. ⋅ �'., main text Eq. 5
d. Denote as {�?.}?(),⋯,$ the decreasing order of {�'.}'(),⋯,$ (such that �A⋅,!,. is
sorted from high to low). Next, denote as �Ä
B,⋅ = �B,A0,⋅ the rearranged version of the
row �B,⋅ and proceed to Option 1 or 2.
– Option 1: Fixed guild size Calculate �. to be the set of row indices �
such that �Ä
B,):D are all equal to 1.
– Option 2: Data guild size
• For each � = 1,2, … , � calculate �7 to be the set of row indices �
such that �Ä
B,):7 are all equal to 1. Set �É to be the largest value of �
such that |�7| ≥ 100.
• Set �. to be equal to �7E and set functional guild �. to be equal to
�):7E ,..
The underlying assumptions of AB are twofold. First, entry �%,' ∈ {0,1} (the presence or
absence of a function in a genome in our dataset) is a random Bernoulli realization of an underlying
scalar probability �%,' ∈ [0,1]. Second, the matrix of probabilities {�%'}%(),…,!,'(),…,$ has an
underlying low-dimensional representation, and each entry is assumed to be a convex combination
of � probabilities:
�%' = Γ%⋅β⋅' , Eq. 1
where β and Γ are two matrices which relate to a latent variable �%'. – where � denotes genome,
� denotes function, and � denotes aspect (see Terminology box for definition). The latent variable
�%'. encodes whether aspect � is the active aspect for genome-function pair �� (that is, only one
125
1 for each �� pair; zero otherwise). With this latent variable, we can describe the matrices β and
Γ more precisely:
– Γ%. = �R�%'. = 1V quantifies how strong the �,- aspect is, within each genome g.
– β.' = �R�%' = 1T�%'. = 1V is the probability that function � is present given that the
�,- latent aspect is present for a given genome-function pair.
A visual schematic of the Aspect Bernoulli model can be seen in Appendix Figure A7.
A1.3 Extended Classical Methods
We ran the hierarchical clustering method clustergram at a cut height of 1 (Appendix
Figure A8) in addition to the cut height of 0.9 (main text Figure 2.3). The choice of linkage value
determines how sensitive the merging step is when determining whether to adjoin nodes together
in a cluster. So, increasing the linkage from 0.9 to 1 had the primary effect of enlarging the cluster
sizes from 5.8 to 11.1 functions per guild while reducing the total number of clusters from 30 to
17.
We also tested the use of a dynamic cut height based on the topology of the tree using the
dynamicTreeCut package v1.63.1 on a complete linkage dendrogram generated by hclust and
analyzed with the dendextend v1.17.1 package (Galili, 2015) with R v4.2.3. With this dynamic cut
height we identified 19 distinct guilds averaging 8.8 functions (range 5-19) with an average of 47.6
mapback genomes per guild (range 0-391). However, 26.3% of the guilds (N=5) still hada no
mapback genomes. Though this is overall a marked improvement over the static height clustering
values, it is still substantially less than those of the AB guilds (see main text Section 3.4).
126
A2 Extended validation of AB
A2.1 Extended Simulated Data Analysis
To test the sensitivity of the Aspect Bernoulli (AB) model to the number of factors, K, we
generated simulated datasets in which either one or three artificial guilds were added to the
composite dataset (see Methods). For each artificial guild type, we created 100 datasets with
individually inserted guilds drawn from nine combinations of guild size (i.e., number of functions)
and abundance (i.e., number of genomes). The AB model was then run using � values ranging
from 5 to 20. The results from the sensitivity tests are presented in Appendix Table A2 for three
values of � across our tested range (� = 5,10,20). Within each super column of Appendix Table
A2 (e.g., � = 5) three metrics (sub columns) are shown: Hit Rate, Extra Hits, and Multi Hits
(which we define below). The full set of values for each of these three metrics is plotted from � =
5 to � = 20 for simulations with three and one artificial guilds respectively for four of the
size/abundance combinations in Appendix Figures 9&10. Below, we present the results from the
three artificial guild simulated data.
As described in the main text, Hit Rate describes the overall frequency of an artificial guild
being identified by the method (i.e., appearing in the top 15 functions of an aspect). One can think
of this as the recovery rate of an artificial guild. For example, at � = 5 we see that an artificial
guild of size 5 with 2% abundance was never found. As shown in Appendix Figure A9, the hit rate
for this guild improved as we increased K, reaching 16.0% at � = 10 and 81.7% at � = 20
(Appendix Table A2), substantially improving our ability to recover this rare guild. Similarly, as
the number of functions contained within the 2% abundant artificial guild increased from 5 to 9,
the recovery rate increased as well (aside from the case of � = 5 where it remains undetectable)
(Appendix Table A2). This trend held across all three abundance levels.
127
For each run, we also calculated Extra Hits (where an artificial guild appears in at the top
of the function list in multiple aspects) and Multi Hit (where two or more artificial guilds occur
together at the top of the function list in a single aspect). The presence of Extra Hits in these
datasets is interpreted as overfitting by the Aspect Bernoulli algorithm since each artificial guild
should appear only once and be completely intact. At � = 5 and � = 10, zero Extra Hits were
observed for any of our nine size/abundance combinations. However, at � = 20, 4.7% of the total
observed hits across all guild parameter combinations were Extra Hits, reaching as much as 17.4%
of the total observed hits for guilds of size 9 and abundance of 10%. In other words, when � was
high and the artificial guild was abundant, the method identified additional underlying structure in
the composite dataset and combined this with that abundant guild.
The presence of Multi Hits is interpreted to be an underfitting by the Aspect Bernoulli
algorithm since each guild should be in its own distinct factor. For large, high abundance guilds
(e.g. size 9, 10% abundance guild in Appendix Table A2) which have high Hit Rates even at low
� values, we observed a large number of Multi Hits when � was small (evidence for underfitting),
with 21.0% of the total observed hits being Multi Hits when � = 5. Multi Hits then decreased as
� increased as a result of more accurate fitting as the model was able to separate the artificial
guilds into individual factors. For artificial guilds with medium size and abundance, the Hit Rate
was low at low K. When this was the case, Multi Hits were also initially low. For these artificial
guilds, both Hit Rate and Multi Hits increased as � increased. Once the Hit Rate approached 100%,
the Multi Hits began to decline back to zero with further increases in � (Appendix Figure A10).
For example, in the case of a guild with 7 functions and a 5% abundance, the Multi Hits increased
from none at � = 5 to 18.7% at � = 10 before decreasing again to just 1.3% of hits at � = 20
(Appendix Table A2). At the same time, the Hit Rate increased from 2.0% at � = 5 to 94.0% at
128
� = 10 and 100% at � = 20. Low size and abundance guilds showed a similar pattern to medium
size guilds but the transitions were shifted to higher values of K.
Finally, we checked whether our guild hits in each run were unique, or if some of the hits
were duplicate occurrences of the same artificial guild. For each simulated dataset that produces
Extra Hits, all of the unique guilds always appeared, as opposed to multiple duplications of the
same artificial guild.
A2.2 Randomly Inserted Artificial Datasets
The results described above and presented in Appendix Table A2 used non-overlapping
guilds such that each genome could only be a member of one artificial guild. We also generated a
second dataset type in which the artificial guilds were inserted randomly rather than in a strictly
non-overlapping fashion, such that any given genome could be a member of any number of
artificial guilds (Appendix Table A3). These overlapping datasets were generated using three
simulated guilds and the same nine size/abundance combinations as shown in Appendix Table A2.
Overall, we found that the Hit Rates and Extra Hits for the overlapping guilds were remarkably
similar to the values from the non-overlapping artificial guilds, but the frequency of Multi Hits
increased on average by 116.1% at � = 5, 83.1% at � = 10, and 79.7% at � = 20. We conclude
this is most likely due to the random insertion of guilds, which can create a stronger linkage
between artificial guilds than if only one guild is allowed per genome. As discussed in Appendix
A2.1, for guilds with medium size and abundance, the increase in Multi Hits at low � values
corresponded to increasing Hit Rates. Once the Hit Rates neared 100%, the Multi Hits decreased
to zero as the model began to distinguish those guilds. The full range of values are presented for
four of the size/abundance combinations for 100 simulated datasets with three randomly inserted
guilds in Appendix Figure A11.
129
A2.3 Extended Analysis of the Impact of K
In addition to impacting the frequency with which we identify guilds of different sizes and
abundances, the choice of � also influences the number of mapback genomes associated with each
guild. As mentioned in the Methods section, for each aspect k, the pipeline provides a score for
every function in the dataset. The user must then make a decision as to the subset of functions to
include within each functional guild. We investigated several different choices for determining
guild functions (see Methods). When � was low (e.g., < 10), we observed higher numbers of
mapbacks on average (Appendix Table A7). We attribute this to the fact that, at low K, the pipeline
preferentially identified guilds of functions that were more abundant in the dataset. Specifically,
guilds identified at � = 5 averaged 362 mapback genomes (range 61-1,295) compared to 116
genomes (range 11-468) for guilds identified at � = 10 and 54 genomes (range 0-210) for guilds
identified at � = 20. This analysis was conducted by defining the guilds as the top 5 functions in
each aspect.
The number of functions included within each guild has a large impact on the number of
mapback genomes, as expected (Figure 2.3). For all aspects, as the number of functions in a guild
decreased, the number of mapback genomes increased. In the most extreme case, when a guild
was defined by just the top two functions, guilds identified at � = 10 in the composite dataset
averaged 301 mapback genomes (range 79-694). For many guilds, we observed a plateauing of
mapback genomes across a range of guild sizes indicating the presence of a robust guild in which
there was a strong co-occurrence of functions (e.g., if function A then also functions B and C).
Specifically, decreasing the number of functions required to be a member of the guild did not
substantially change the number of mapback genomes.
130
We also compared the number of mapback genomes we found among our probabilistic
representatives (defined as the top 5 scoring functions in each aspect) to the number of mapback
genomes across the full dataset. In most instances, there were additional mapback genomes in the
full dataset what were not included in the probabilistic representatives. For the composite dataset,
when � = 10, there was on average a 33% increase (range 0-134%) in the number of mapback
genomes found within the entire dataset as compared to just within the probabilistic representatives.
This occurred because genomes can be members of multiple guilds. For the probabilistic
representative identification, each genome is only assigned to the single aspect with which it is
most strongly associated, according to that genome’s Γ vector.
A3 Speed and Stability
A3.1 Computational speed
In this section, we briefly examine the computational speed of the EM algorithm (Bingham,
Kabán, and Fortelius 2009) by running the algorithm once on our main dataset and examining the
objective value �? across algorithm iterations � = 1,2, ⋯ (Appendix Figure A12). In our analysis
(see Results), we stopped the algorithm at a fixed number of iterations (2,000), which took about
1,320 seconds on a single core of an M1 processor. More generally, we recommend the
conventional stopping rule for this algorithm class – stopping when the relative improvement of
the objective value (�? − �?F))/�?F) is sufficiently small, e.g., 10FG.
A3.2 Numerical Stability
In estimating the AB model, we attempt to maximize a non-convex objective. For a better
chance at finding a global maximum, the EM algorithm must be run multiple times with new
random initializations, and the final estimate can be taken from the run with the highest final
objective value. Even using multiple restarts, obtaining a near-global optimum is still
131
computationally difficult. In the following analysis, we further investigated how numerically stable
the estimated guilds are, from two independent model estimates.
We produced each of the two model estimates as an approximate result of 30,000 random
initializations. Specifically, we first ran the EM algorithm many times (30,000 random
initializations) for a short amount of time (300 iterations) in order to identify an especially
promising initialization, which we then reran through the EM algorithm until we reached
numerical convergence.
Next, the six aspects from each of the two model estimates needed to be numerically
matched using an appropriate permutation of one set of six aspects to closely resemble the other
set. For this matching, we used the Hungarian algorithm (Kuhn, 1955) to find the best permutation
of one model’s estimated {Γ⋅,.}.()
/ to the other’s. Matching using {β⋅,.}.()
/ produces the same
result.
The final comparison of the two model estimates is presented in Appendix Figure A13 and
Appendix Table A8. Appendix Figure A13 shows six scatterplots (one for each aspect �) of the
scores {�'.}'()
$ from the two model estimates. The points in each scatterplot have a close but
imperfect fit to the � = � line. Next, Appendix Table A8 shows the six functional guilds produced
using our proposed pipeline (option 2 of Appendix A1.2) from the two model estimates. Of note,
Guilds 4 and 6 are perfectly conserved across models 1 and 2 and strongly resemble the DMSP
and motility guilds, respectively, that we present in the main text (see Discussion). Though the
matched scores do not fall precisely along the 1:1 line shown in red on each plot, the guilds in
Appendix Table A8 were very consistent between models, and the scores still showed a strong
linear trend.
132
Appendix Figure A1: Phylogenomic tree of the full composite dataset consisting of 3,840 genomes including MAGs,
SAGs, and isolate genomes. This tree presents 3,775 of those genomes (see Results) that represent 51 unique bacterial
phyla and 2 archaeal phyla. For clarity, the two archaeal phyla have been collapsed simply into an “Archaea”
designation.
Tree scale: 1
133
Appendix Figure A2: Phylogenomic tree for the SAG genomes sources and quality filtered from the GORG-Tropics
expedition constituting 1,733 genomes. This tree presents 1,415 of those genomes (see Results) that represent 9 unique
bacterial phyla as well as 2 archaeal phyla that are collapsed simply to the designation “Archaea”.
Tree scale: 1
Phyla
Archaea
Verrucomicrobiota
Marinisomatota
Proteobacteria
Bacteroidota
Cyanobacteria
Actionbacteriota
SAR324
Planctomycetota
Chloroflexota
134
Appendix Figure A3: Phylogenomic tree for the DMSP guild as defined in the main text (see bolded functions in main
text Table 2). This guild is distributed across 4 bacterial families, primarily Rhodobacteraceae.
Phaeobacter daeponensis
Rhodobacterales bacterium Y4I
Phaeobacter caeruleus
Leisingera sp ANG-M1
Leisingera sp ANG-Vp
Leisingera aquimarina
Roseobacter sp MedPE-SWde
Phaeobacter arcticus
Phaeobacter sp 11ANDIMAR09
Phaeobacter sp CECT 5382
Nautella italica CECT 7645 1
Ruegeria sp R11
Alphaproteobacteria bacterium MedPE-SWcel
Rhodobacterales bacterium 56 14 T64
Ruegeria atlantica CECT 4293
Ruegeria atlantica CECT 4292
Ruegeria sp ZGT108
Ruegeria sp ZGT118
Ruegeria marina strain CGMCC 19108
Tropicibacter litoreus R37
Sedimentitalea nanhaiensis strain CGMCC 110959
Aestuariivita boseongensis BS-B2
Sulfitobacter noctilucicola
Sulfitobacter geojensis strain EhN01
Sulfitobacter mediterraneus 1FIGIMAR09
Oceanibulbus sp HI0023
Roseobacter litoralis Shiba 1991
Roseobacter denitrificans OCh 114
Tateyamaria sp ANG-S1
Phaeobacter sp CECT 7735
Pseudopelagicola gijangensis strain DSM 100564
Thalassobius aestuarii strain DSM 15283
Shimia sp SK013
Lentibacter algarum strain DSM 24677
Roseovarius mucosus
Roseovarius sp TM1035
Roseovarius sp 217
Roseovarius sp MCTG156 2b
Roseovarius marisflavi strain DSM 29327
Roseovarius gaetbuli CECT 8370
Roseovarius lutimaris strain DSM 28463
Pelagicola litorisediminis CECT 8287
Roseovarius aestuarii CECT 7745
Roseovarius indicus
Roseovarius atlanticus R12b
Roseovarius nubinhibens
Lutimaribacter saemankumensis strain DSM 28010
Marivita cryptomonadis CL-SK44
Marivita geojedonensis DPG-138
Marivita hallyeonensis strain DSM 29431
Tropicibacter phthalicicus strain CECT 8649
Pelagimonas varians strain CECT 8663
Maliponia aquimaris CECT 8898
Marinovum algicola strain FF3
Aestuariivita atlantica 22II-S11-z3
Rhodobacteraceae bacterium MED-G08
Rhodobacteraceae bacterium REDSEA-S34 B6
Marinovum sp strain MED682
Rhodobacteraceae bacterium HIMB11
Rhodobacteraceae bacterium SB2
Alphaproteobacteria bacterium casp-alpha6
Marinovum sp strain CPC320
Labrenzia alba CECT 7551
Labrenzia alba CECT 5096
Labrenzia alexandrii CECT 5112
Roseibium sp TrichSKD4
Leucothrix mucor Oersted 1844
Oceanospirillaceae bacterium UBA7410
Rhodobacterales bacterium HTCC2255
Rhodobacteraceae bacterium HTCC2150
Rhodobacter sp BACL10 MAG-120419-bin15
Litoreibacter janthinus strain DSM 26921
Pseudorhodobacter ferrugineus
Litoreibacter ascidiaceicola DSM 100566
Thalassobacter arenae DSM 19593
Jannaschia donghaensis CECT 7802
Thalassobacter stenotrophicus 1CONIMAR09
Jannaschia sp EhC01
Nioella nitratireducens strain SSW136
Maribius pelagius strain DSM 26893
Boseongicola aestuarii CECT 8489
Loktanella rosea DSM 29591
Loktanella sp 5RATIMAR09
Loktanella koreensis strain DSM 17925
Loktanella marina CECT 8899
Octadecabacter temperatus strain DSM 26878
Octadecabacter ascidiaceicola CECT 8868
Planktotalea sp IMCC9565
Rhodobacteraceae bacterium HTCC2083
Tree scale: 1
Family
Stappiaceae
Rhodobacteraceae
Thiotrichaceae
Nitrincolaceae
135
Appendix Figure A4: Phylogenomic tree for the motility guild as defined in the main text (see bolded functions in
main text Table 2). This guild is distributed across 9 bacterial orders, most notably in the Enterobacterales and
Caulobacterales.
Marinobacter hydrocarbonoclasticus UBA1145
Marinobacter sp strain CPC48
Marinobacter sp tcs-11
Marinobacter sp C1S70
Marinobacter daepoensis
Marinobacter hydrocarbonoclasticus strain STW2
Marinobacter excellens LAMA 842
Marinobacter santoriniensis
Marinobacter sp AK21
Marinobacter sp DSM 26671
Marinobacter sp strain IN6
Marinobacter sp YWL01
Marinobacter manganoxydans MnI7-9
Marinobacter sp ES 048
Marinobacter sp EhC06
Marinobacter adhaerens strain PBVC038
Marinobacter guineae strain M3B
Marinobacter antarcticus strain CGMCC 110835
Marinobacter sp ANT B65
Marinobacter salexigens strain HJR7
Marinobacter aromaticivorans strain D15-8P
Marinobacter sp HL-58
Marinobacter sp UBA3604
Marinobacter daqiaonensis strain CGMCC 19167
Marinobacter segnicrescens strain CGMCC 16489
Marinobacter mobilis strain CGMCC 17059
Marinobacter zhejiangensis strain CGMCC 17061
Marinobacter sp X15-166B
Marinobacter sp UBA2698
Marinobacter nanhaiticus D15-8W
Alteromonadaceae bacterium strain EAC69
Gammaproteobacteria bacterium UBA3977
Mangrovitalea sediminis strain M11-4
Hahellaceae bacterium strain EAC91
Gammaproteobacteria bacterium UBA2679
Oleiphilus sp HI0078
Oleiphilus sp HI0061
Oleiphilus sp HI0080
Proteobacteria bacterium strain DOLZORAL124 50 18
Oceanospirillum sp AK56
Oceanospirillum linum ATCC 11336
Oceanospirillum maris
Oceanospirillum sp strain CPC24
Oceanospirillum beijerinckii
Oceanospirillum multiglobuliferum ATCC 33336
Marinobacterium stanieri S30
Marinobacterium georgiense DSM 11526
Marinobacterium sp AK27
Marinobacterium jannaschii
Amphritea japonica ATCC BAA-1530
Amphritea atlantica strain DSM 18887
Thalassolituus sp UBA2527
Thalassolituus sp UBA2535
Oceanospirillaceae bacterium strain NP39
Oceanospirillaceae bacterium strain CPC2
Oceanospirillaceae bacterium strain RS342
Oleibacter marinus DSM 24913
Thalassolituus sp ESRF-bin18
Oceanobacter kriegii
Bermanella sp strain CPC64
Bermanella marisrubri
Reinekea blandensis
Alteromonadaceae bacterium Bs12
Alteromonadales bacterium BS08
Teredinibacter turnerae 3
Simiduia agarivorans
Microbulbifer thermotolerans strain DSM 19189
Microbulbifer sp HZ11
Gammaproteobacteria bacterium 50 400 T64
Cellvibrionales bacterium strain NORP45
Cellvibrionales bacterium UBA5860
Dasania marina
marine gamma proteobacterium HTCC2143
Halioglobus sp HI00S01
Melitea salexigens DSM 19753
Pseudomonas baetica strain LMG 25716
Pseudomonas moraviensis strain UCD-KL30
Pseudomonas sp ATCC PTA-122608
Pseudomonas sp 06C 126
Pseudomonas deceptionensis strain LMG 25555
Pseudomonas sp UBA2684
Pseudomonas taeanensis
Pseudomonas cuatrocienegasensis strain CIP 109853
Pseudomonas alcaliphila NBRC 102411
Pseudomonas sp PI11
Pseudomonas oleovorans MGY01
Pseudomonas alcaligenes OT 69
Pseudomonas stutzeri NF13
Pseudomonas stutzeri CCUG 16156
Pseudomonas stutzeri HI00D01
Pseudomonas stutzeri MF28
Pseudomonas xanthomarina DSM 18231
Pseudomonas sp strain SP133
Pseudomonas aeruginosa Ocean-1187
Pseudomonas pachastrellae strain JCM 12285
Pseudomonas sp MT5
Pseudomonas aestusnigri VGXO14
Pseudomonas pelagia
Zooshikella ganghwensis JC2044
Alcanivorax sp UBA5099
Alcanivorax sp UBA4521
Alcanivorax sp strain NAT75
Alcanivorax sp UBA5084
Alcanivorax sp UBA5163
Alcanivorax sp 43B GOM-46m
Alcanivorax pacificus W11-5
Woodsholea maritima
Gammaproteobacteria bacterium UBA6940
Pseudomonadales bacterium strain NAT4
Gammaproteobacteria bacterium 45 16 T64
Gammaproteobacteria bacterium UBA3067
Pseudacidovorax intermedius NH-1
Rhodoferax ferrireducens A7
Hydrogenophaga sp LPB0072
Burkholderiales bacterium UBA1834
Cupriavidus sp SK-3
Cupriavidus sp UBA2534
Janthinobacterium sp BJB303
Oxalobacteraceae bacterium IMCC9480
Sutterellaceae bacterium strain SAT62
Lysobacter spongiicola DSM 21749
Luteimonas sp FCS-9
Algiphilus aromaticivorans
Oceanococcus atlanticus 22II-S10r2
Nevskiales bacterium strain NP100
Salinisphaera sp strain NP40
Salinisphaera shabanensis
Chromatiales bacterium ex Bugula neritina AB1
Halothiobacillaceae bacterium UBA5861
Thiorhodococcus drewsii
Thiorhodococcus sp AK35
Thiocystis violascens
Marichromatium purpuratum 984
endosymbiont of Loripes lucinalis strain G A
endosymbiont of Loripes lucinalis strain G E
endosymbiont of Codakia orbicularis strain COS
Sedimenticola selenatireducens
Cycloclasticus sp 46 120 T64
Gallaecimonas xiamenensis
Spongiobacter sp S2292
Pseudoalteromonas sp S3431
Pseudoalteromonas sp BSw20308
Pseudoalteromonas sp Bsi20652
Pseudoalteromonas sp 78C3
Alteromonadales bacterium TW-7
Pseudoalteromonas sp ECSMB14103
Pseudoalteromonas sp ESRF-bin5
Pseudoalteromonas sp H103
Pseudoalteromonas sp TB51
Pseudoalteromonas sp P1-7a
Pseudoalteromonas sp 6BO GOM-1096m
Pseudoalteromonas sp P1-16-1b
Pseudoalteromonas haloplanktis
Pseudoalteromonas prydzensis strain DSM 14232
Pseudoalteromonas porphyrae UCD-SED14
Pseudoalteromonas sp H105
Pseudoalteromonas sp 3D05
Pseudoalteromonas sp UCD-33C
Pseudoalteromonas arabiensis JCM 17292
Pseudoalteromonas lipolytica SCSIO 04301
Pseudoalteromonas luteoviolacea S4054
Pseudoalteromonas luteoviolacea HI1 HI1
Pseudoalteromonas luteoviolacea H33-S
Pseudoalteromonas luteoviolacea 2ta16
Pseudoalteromonas luteoviolacea S2607
Pseudoalteromonas luteoviolacea strain IPB1
Pseudoalteromonas luteoviolacea
Pseudoalteromonas rubra S2471
Pseudoalteromonas sp R3
Pseudoalteromonas rubra
Pseudoalteromonas elyakovii
Pseudoalteromonas piscicida strain 36Y RITHPW
Pseudoalteromonas luteoviolacea ATCC 29581
Pseudoalteromonas byunsanensis strain JCM 12483
Pseudoalteromonas sp JW3
Pseudoalteromonas citrea NCIMB 1889
Pseudoalteromonas sp SW0106-04
Pseudoalteromonas sp P1-9
Pseudoalteromonas sp 520P1 No. 423
Pseudoalteromonas denitrificans DSM 6059
Pseudoalteromonas tunicata D2
Pseudoalteromonas ulvae TC14
Algicola sagamiensis
Rheinheimera perlucida
Rheinheimera sp strain CPC13
Arsukibacterium ikkense GCM72
Alishewanella aestuarii
Rheinheimera pacifica strain DSM 17616
Rheinheimera baltica
Rheinheimera nanhaiensis E407-8
Rheinheimera sp KH87
Colwellia marinimaniae MTCD1
Colwellia sp TT2012
Colwellia piezophila ATCC BAA-637
Colwellia psychrerythraea GAB14E
Colwellia sp 75C3
Colwellia sp 12G3
Colwellia sp Bg11-28
Colwellia psychrerythraea ND2E
Colwellia sp 39 35 sub15 T18
Colwellia sp strain NORP29
Colwellia polaris MCCC1C00015
Colwellia sediminilitoris KCTC 52213
Colwellia aestuarii KCTC 12480
Colwellia mytili KCTC 52417
Colwellia chukchiensis CGMCC1 9127
Colwellia sp QM50
Thalassomonas actiniarum A5K-106
Thalassomonas viridans
Thalassotalea sp PP2-459
Alteromonas marina AD001
Alteromonas macleodii REDSEA-S09 B2
Alteromonas sp W12
Alteromonas macleodii UBA1010
Alteromonas sp ALT199
Alteromonas sp V450
Alteromonas sp LTR
Alteromonas australica strain UBA2516
Alteromonas sp strain NORP73
Alteromonas confluentis strain KCTC 42603
Alteromonadaceae bacterium strain NP2
Alteromonadaceae bacterium UBA7877
Alteromonadaceae bacterium UBA7387
Alteromonas lipolytica strain JW12
Aestuariibacter aggregatus CGMCC 1 8995
Alteromonas sp P0211
Alteromonas sp P0213
Glaciecola sp 33A
Glaciecola pallidula
Glaciecola sp UBA2563
Aestuariibacter salexigens
Glaciecola mesophila KMM 241
Pseudoalteromonas sp PLSV
Glaciecola chathamensis S18K6
Glaciecola polaris LMG 21857
Glaciecola psychrophila 170
Paraglaciecola sp MB-3u-78
Glaciecola arctica BSs20135
Paraglaciecola sp S66
Glaciecola lipolytica E3
Idiomarina abyssalis strain KMM 227
Idiomarina sp UBA4206
Idiomarina sp 28-8
Idiomarina zobellii strain KMM 231
Idiomarina sp MD25a
Idiomarinaceae bacterium strain SP259
Idiomarina xiamenensis 10-D-4
Shewanella frigidimarina Ag06-30
Shewanella sp ALD9
Shewanella sp Actino-trap-3
Shewanella sp P1-14-1
Shewanella sp UCD-KL21
Shewanella baltica OS625
Shewanella morhuae ATCC BAA-1205
Shewanella xiamenensis BC01
Shewanella sp ECSMB14101
Shewanella sp cp20
Shewanella colwelliana ATCC 39565
Shewanella waksmanii ATCC BAA-643
Shewanella sp Alg231 23
Shewanella benthica KT99
Shewanella sp Bg11-22
Shewanella algae strain CSB04KR
Paraferrimonas sedimenticola NBRC 101628
Ferrimonas sediminum strain DSM 23317
Ferrimonas kyonanensis
Ferrimonas futtsuensis
Ferrimonas senticii
Ferrimonas marina strain DSM 16917
Psychromonas arctica
Psychromonas sp psych-6C06
Moritella sp Urea-trap-13
Gammaproteobacteria bacterium MedPE
Alteromonadales bacterium alter-6D02
Aeromonas veronii ARB3
Aeromonas veronii strain UBA1835
Aeromonas caviae strain CH129
Aeromonas hydrophila AL10-121
Aeromonas salmonicida subsp salmonicida strain J223
Aeromonas salmonicida CBA100
Aeromonas molluscorum 848
Oceanimonas baumannii ATCC 700832
Oceanimonas smirnovii ATCC BAA-899
Zobellella denitrificans ZD1
Aliagarivorans marinus
Photobacterium swingsii
Photobacterium sanguinicancri CAIM 1827
Photobacterium jeanii strain R-40508
Photobacterium profundum 3TCK
Photobacterium sp 13-12
Photobacterium sp J15
unclassified bacterium 3
Photobacterium aquae
Photobacterium aphoticum
Photobacterium ganghwense
Photobacterium gaetbulicola AD005a
Photobacterium leiognathi CUB1
Photobacterium damselae subsp damselae strain RM-71
Photobacterium halotolerans MELD1
Enterovibrio norvegicus FF-33
Enterovibrio calviensis
Enterovibrio calviensis FF-85
Enterovibrio nigricans DSM 22720
Grimontia marina CECT 8713
Grimontia sp AK16
Grimontia sp CECT 9029
Enterovibrio pacificus strain CAIM 1920
Salinivibrio sp IB868
Salinivibrio sp MA607
Aliivibrio sp 1S175
Salinivibrio sp PR5
Aliivibrio wodanis CL4
Aliivibrio logei 5S-186
Vibrio sp vnigr-6D03
Vibrio sp qd031
Aliivibrio fischeri strain 5F7
Vibrio lentus strain 5F79
Vibrio tasmaniensis strain UCD-FRSSP16 35
Vibrio nigripulchritudo Wn13
Vibrio tasmaniensis 1F-187
Vibrio cyclitrophicus 1F97
Vibrio splendidus MOR2
Vibrio kanaloae 5S-149
Vibrio sp 624788
Vibrionales bacterium SWAT-3
Vibrio splendidus MARa
Vibrio splendidus ATCC 33789
Vibrio sp HI00D65
Vibrio fortis Dalian14
Vibrio genomosp FF-238
Vibrio sp A354
Vibrio thalassae strain CECT8203
Vibrio mediterranei strain 21LN0615E
Vibrio genomo sp 9ZC157
Vibrio mexicanus
Vibrio astriarenae
Vibrio metoecus OYP9C12
Vibrio sp RC586
Vibrio cholerae YB4G05
Vibrio mimicus VM223
Vibrio furnissii S0821
Vibrio fluvialis S1110
Vibrio diazotrophicus
Vibrio alginolyticus RM-24-2
Vibrio anguillarum HI618
Vibrio proteolyticus
Vibrio xiamenensis strain CGMCC 110228
Vibrio sp JCM 18905 2
Vibrio parahaemolyticus HS-22-14
Vibrio alginolyticus LMG 11650
Vibrio sp JCM 18904 2
Vibrio natriegens 2
Vibrio parahaemolyticus HS-13-1 100
Vibrio parahaemolyticus M0605
Vibrio parahaemolyticus 22702
Vibrio campbellii CAIM198
Vibrio campbellii PEL22A
Vibrio rotiferianus DAT722
Vibrio harveyi ZJ0603
Vibrio harveyi MOR3
Vibrio sp PID17 43
Vibrio azureus 2
Vibrio sagamiensis NBRC 104589
Vibrio scophthalmi strain FP3289
Vibrio vulnificus VVyb1 BT3
Vibrio scophthalmi LMG 19158
Vibrio ichthyoenteri ATCC 700023
Vibrio panuliri strain CAIM 1902
Vibrio ponticus CAIM 1731
Vibrio sp DCR 1-4-2
Vibrio nereis NBRC 15637
Vibrio hepatarius
Vibrio europaeus strain PP-638
Vibrio bivalvicida 605
Vibrio tubiashii T33
Vibrio galatheae S2757
Vibrio orientalis CIP102891
Vibrio sinaloensis AD032
Vibrio brasiliensis
Vibrio sinaloensis DSMZ 21326
Vibrio sp B183
Vibrio coralliilyticus OCN008
Vibrio neptunius S2394
Vibrio caribbenthicus ATCC BAA2122
Vibrio pacinii
Tree scale: 1
Order
Chromatiales
Xanthomonadales
Caulobacterales
Enterobacterales
Nevskiales
Pseudomonadales
Methylococcales
Burkholderiales
Granulosicoccales
136
Appendix Figure A5: Total distribution of AAI values between 30% and 90% for all genome pairs in our composite
dataset of 3,840 genomes. On average, a given genome pair had an AAI value of 39.1%.
137
Appendix Figure A6: Histogram of average AAI values for our Monte Carlo style simulation of 1,000 sets of 100
random genomes. The AAI values for all genome pairs in each random set were averaged to construct the distribution
in white bars. In addition, we computed the average AAI value for all pairs of genomes in each of our 10 guilds and
overlaid those with colored vertical lines. The High ANI line that is shown to the right of the plot break shows the
AAI value for the 100 genomes with the most non-NA ANI values (i.e., the most similar set of 100 genomes). The
axis break was produced using ggbreak (Xu et al., 2021).
138
Appendix Figure A7: Visual schematic of the model procedure that shows how we model our data matrix Y as a
matrix V of Bernoulli probabilities that we then decomposed into the two matrices � and � to create a lowdimensional representation of Y.
139
Appendix Figure A8: Resulting dendrogram from running clustergram on our composite dataset with a cut height of
1 (red = present, black = absent). The rows (genomes) and columns (functions) were both clustered using the Jaccard
distance metric.
140
Appendix Figure A9: Simulated data metric values for 100 simulations with three artificial guilds across the tested
range of K’s (K=5-20): Hit Rate as a proportion (top panel), Extra Hits (middle panel), and Multi Hits (bottom panel).
141
Appendix Figure A10: Simulated data metric values for 100 simulations with a single artificial guild across the tested
range of K’s (K=5-20): Hit Rate as a proportion (top panel), Extra Hits (middle panel), and Multi Hits (bottom panel).
As seen in the bottom panel, there are no Multi Hits for the single guild simulations because you must have 2+ guilds
to register one as defined (Appendix A2).
142
Appendix Figure A11: Simulated data metric values for 100 simulations with three artificial guilds randomly inserted
into the dataset across the tested range of K’s (K=5-20): Hit Rate as a proportion (top panel), Extra Hits (middle panel),
and Multi Hits (bottom panel).
143
Appendix Figure A12: Value of the MLE estimator for runs of the AB that only vary in number of iterations used.
144
Appendix Figure A13: Comparison of scores for matched guilds from two independent AB model estimates using a
two-step approach that identified good initialization states and then ran the EM algorithm for many steps for those
states. We see that the scores lie along the 1:1 line (red), showing that the guilds are relatively stable across model
estimates.
145
Appendix Table A4: Results of Aspect Bernoulli runs with three non-overlapping artificial guilds at three aspect
numbers K=5,10,20. Hit Rate describes the percentage of hits observed out of all possible hits. Extra Hits represent
runs where an artificial guild appears in more than one aspect. Multi Hits occur when multiple artificial guilds appear
together in a single aspect. Hit rate, Extra Hits, and Multi Hits are shown as percent values (%).
glycolysis NAD-reducinghydrogenase CP-lyasecomplex riboseSBP
gluconeogenesis NADP-reducinghydrogenase CP-lyaseoperon erythritolSBP
TCACycle NiFehydrogenaseHyd-1 TypeISecretion putativexylitolSB
P
NAD(P)Hquinoneoxidored
uctase
thiaminbiosynthesis TypeIIISecretion inositolSBP
NADHquinoneoxidored
uctase
riboflavinbiosynthesis TypeIISecretion inositolphosphateSBP
F-typeATPase cobalaminbiosynthesis TypeIVSecretion putativefructoolig
osaccharideSBP
V-typeATPase transporter:vitaminB12 TypeVISecretion glycerolSBP
Cytochromecoxid
ase
transporter:thiamin Sec-SRP putativesnGlycerol3-
phosphateSBP
Ubiquinolcytochromecredu
ctase
transporter:urea TwinArginineTar
geting
putativesorbitol/m
annitolSBP
Cytochromeoubiq
uinoloxidase
transporter:phosphonate TypeVabcSecretio
n
arabinosaccharide
SBP
Cytochromeaa3-
600menaquinolox
idase
transporter:phosphate sulfateSBP gammahexachlorocycloh
exaneSBP
Cytochromecoxid
ase,cbb3-type
Flagellum molybdateSBP phospholipidSBP
Cytochromebdco
mplex
Chemotaxis molybdate/tungsta
teSBP
putativemultiplesu
garSBP
RuBisCo Methanogenesisviamethanol tungstateSBP putativesimplesug
arSBP
CBBCycle Methanogenesisviaacetate nirate/nitriteSBP lysine/arginine/or
nithineSBP
rTCACycle Methanogenesisviadimethylsulfide,
methanethiol,methylpropanoate
bicarbonateSBP histidineSBP
146
Wood-Ljungdahl Methanogenesisviamethylamine taurineSBP glutamineSBP
3-
Hydroxypropiona
teBicycle
Methanogenesisviatrimethylamine sulfonateSBP arginineSBP
4-
Hydroxybutyrate/
3-
hydroxypropionat
e
Methanogenesisviadimethylamine phthalateSBP glutamate/aspartat
eSBP
ammoniaoxidatio
n
MethanogenesisviaCO2 ironiiiSBP octopine/nopaline
SBP
hydroxylamineox
idation
CoenzymeB/CoenzymeMregenerati
on
putrescineSBP generallaminoacidSBP
nitriteoxidation CoenzymeMreductiontomethane spermidine/putres
cineSBP
glutamateSBP
dissimnitrateredu
ction
Solublemethanemonooxygenase putativespermidin
e/putrescineSBP
cystineSBP
DNRA dimethylamine/trimethylaminedehy
drogenase
mannopineSBP l-cystineSBP
nitritereduction PhotosystemII 2-
aminoethylphosph
onateSBP
arginine/ornithine
SBP
nitricoxidereducti
on
PhotosystemI glycinebetaine/pro
lineSBP
putativelysineSBP
nitrousoxidereduction
Cytochromeb6/fcomplex osmoprotectantSB
P
branchedchainami
noacidSBP
nitrogenfixation anoxygenictype-IIreactioncenter hydroxymethylpyr
imidineSBP
neutralaminoacid
SBP
hydrazinedehydro
genase
anoxygenictype-Ireactioncenter putativethaiminSB
P
d-methionineSBP
hydrazinesynthas
e
Retinalbiosynthesis maltose/maltodext
rinSBP
putativeglutamine
SBP
dissimilatorysulfa
te<>APS
Entner-DoudoroffPathway arabinogalactanS
BP
putativeaminoacid
SBP
dissimilatorysulfi
te<>APS
Mixedacid:Lactate raffinose/stachyos
e/melibioseSBP
putativepolaramin
oacidSBP
dissimilatorysulfi
te<>sulfide
Mixedacid:Formate alphaglucosideSBP
oligopeptideSBP
147
thiosulfateoxidati
on
Mixedacid:FormatetoCO2&H2 glucose/arabinose
SBP
dipeptideSBP
altthiosulfateoxid
ationtsdA
Mixedacid:Acetate glucose/mannoseS
BP
cationicpeptideSB
P
altthiosulfateoxid
ationdoxAD
Mixedacid:Ethanol,AcetatetoAcetyl
aldehyde
trehalose/maltose
SBP
nickelSBP
sulfurreductasesr
eABC
Mixedacid:Ethanol,AcetylCoAtoAcetylaldehyde(reversible)
trehaloseSBP glutathioneSBP
thiosulfate/polysu
lfidereductase
Mixedacid:Ethanol,Acetylaldehydet
oEthanol
nacetylglucosamine
SBP
peptides/nickelSB
P
sulfhydrogenase Mixedacid:PEPtoSuccinateviaOAA,
malate&fumarate
cellobioseSBP microcincSBP
sulfurdisproportio
nation
Naphthalenedegradationtosalicylate ndiacetylchitobiose
SBP
ironcomplexSBP
sulfurdioxygenas
e
BiofilmPGASynthesisprotein putativechitobiose
SBP
manganeseSBP
sulfitedehydrogen
ase
ColanicacidandBiofilmtranscription
alregulator
l-arabinoseSBP zincSBP
sulfitedehydrogen
ase(quinone)
BiofilmregulatorBssS lactose/larabinoseSBP
iron/zinc/mangane
se/copperSBP
sulfideoxidation ColanicacidandBiofilmproteinA oligogalacturonide
SBP
manganese/ironS
BP
sulfurassimilation Curlifimbriaebiosynthesis alpha14digalactur
onateSBP
manganese/zincS
BP
DMSPdemethylat
ion
Adhesion putativealdouronat
eSBP
manganese/zinc/ir
onSBP
DMSdehydrogen
ase
Competence-relatedcorecomponents methylgalactosideSBP
cobalt/nickelSBP
DMSOreductase Competencerelatedrelatedcomponents
d-xyloseSBP biotinSBP
NiFehydrogenase Competencefactors xylobioseSBP betacarotene15,15-
monooxygenase
ferredoxinhydrog
enase
Glyoxylateshunt multiplesugarSBP rhodopsin
148
membraneboundhydrogenas
e
Anapleroticgenes d-alloseSBP transporter:ammo
nia
hydrogen:quinon
eoxidoreductase
Sulfolipidbiosynthesis fructoseSBP DMSPlyase(dddL
QPDKW)
149
Appendix Table A5: Results of Aspect Bernoulli runs with three non-overlapping artificial guilds at three aspect
numbers K=5,10,20. Hit Rate describes the percentage of hits observed out of all possible hits. Extra Hits represent
runs where an artificial guild appears in more than one aspect. Multi Hits occur when multiple artificial guilds appear
together in a single aspect. Hit rate, Extra Hits, and Multi Hits are shown as percent values (%).
K=5 K=10 K=20
Guild Size/
Abundance
Hit
Rate
Extra
Hits
Multi
Hits
Hit
Rate
Extra
Hits
Multi
Hits
Hit
Rate
Extra
Hits
Multi
Hits
5/0.02 0 0 0 16.0 0 1.0 81.7 0 7.7
5/0.05 0 0 0 88.0 0 11.3 99.7 0.7 4.0
5/0.1 22.3 0 0 100 0 9.7 100 12.3 0
7/0.02 0 0 0 40.0 0 3.3 95.0 0 7.3
7/0.05 2.0 0 0 94.0 0 18.7 100 0 1.3
7/0.1 64.3 0 10.0 100 0 3.3 100 12.0 0
9/0.02 0 0 0 70.1 0 8.0 98.7 0 7.3
9/0.05 14.0 0 2.0 99.0 0 15.0 100 0 1.0
9/0.1 87.7 0 21.0 100 0 0.3 100 17.4 0
150
Appendix Table A6: Results of Aspect Bernoulli runs with three artificial guilds inserted randomly at three aspect
numbers K=5,10,20. Hit Rate describes the percentage of hits observed out of all possible hits. Extra Hits represent
runs where an artificial guild appears in more than one aspect. MultiHits occur when multiple artificial guilds appear
together in a single aspect. Hit rate, Extra Hits, and Multi Hits are shown as percent values (%).
K=5 K=10 K=20
Guild Size/
Abundance
Hit
Rate
Extra
Hits
Multi
Hits
Hit
Rate
Extra
Hits
Multi
Hits
Hit
Rate
Extra
Hits
Multi
Hits
5/0.02 0 0 0 11.7 0 0 81.7 0.3 6.3
5/0.05 0 0 0 87.7 0 19.3 100 0.7 8.9
5/0.1 23.3 0 3.0 100 0 23.0 100 11.5 5.3
7/0.02 0 0 0 46.7 0 4.3 95.0 0 10.7
7/0.05 4.3 0 0.3 95.3 0 25.0 100 0.3 6.6
7/0.1 78.3 0 26.3 100 0 17.7 100 15.7 2.0
9/0.02 0 0 0 69.0 0 8.3 99.3 0 8.3
9/0.05 27.3 0 9.7 98.7 0 23.7 100 0.3 2.7
9/0.1 92.7 0 32.0 100 0 8.0 100 16.2 0.6
151
Appendix Table A7: Guilds defined as top 5 functions of each aspect for a run of the AB model on the composite
dataset with K=10 aspects.
Guild
1
CoenzymeB/C
oenzymeMrege
neration
Methanogenes
isviaacetate
molybdate/tung
stateSBP
dissimilatorysulfite
<>sulfide
dissimilatorysul
fite<>APS
Guild
2
TypeIISecretio
n
Ubiquinolcytochromecr
eductase
Cytochromecox
idase,cbb3-type
Flagellum phospholipidSB
P
Guild
3
DMSPdemethy
lation
DMSPlyase(d
ddLQPDKW)
sulfitedehydrog
enase(quinone)
Methanogenesisvia
trimethylamine
dimethylamine/t
rimethylamined
ehydrogenase
Guild
4
NAD(P)Hquinoneoxidor
eductase
Cytochromeb
6/fcomplex
PhotosystemII PhotosystemI putativechitobio
seSBP
Guild
5
CPlyasecleavage
PhnJ
CPlyaseoperon
CPlyasecomplex
fructoseSBP d-xyloseSBP
Guild
6
glutamateSBP cellobioseSBP arabinogalactan
SBP
d-methionineSBP ndiacetylchitobio
seSBP
Guild
7
maltose/maltod
extrinSBP
transporter:vit
aminB12
arginineSBP putativeaminoacid
SBP
dmethionineSBP
Guild
8
sulfurassimilati
on
Retinalbiosynt
hesis
manganese/zinc
/ironSBP
nitrousoxidereduction
gluconeogenesi
s
Guild
9
V-typeATPase NADHquinoneoxidor
eductase
glycolysis riboflavinbiosynth
esis
Methanogenesis
viaCO2
Guild
10
rhodopsin betacarotene15,15
-
monooxygena
se
Retinalbiosynth
esis
Mixedacid:Ethanol
,AcetatetoAcetylal
dehyde
Cytochromecox
idase
152
Appendix Table A8: Guilds defined as top 5 functions of each aspect for a run of the AB model on the MAG-only
dataset with K=10 aspects.
Guild
1
molybdate
SBP
glycinebeta
ine/proline
SBP
DMSP lyase
(dddLQPDKW)
hydroxymethylpyri
midine SBP
fMethanogenes
is via
trimethylamine
Guild
2
Cytochrome-c
oxidase cbb3
type
Type II
secretion
Cytochrome bd
complex
DNRA Chemotaxis
Guild
3
gluconeogenesi
s
Gamma.he
xachlorocy
clohexane
SBP
Type I secretion Sec.SRP Mixedacid.Eth
anol.Acetateto
Acetylaldehyd
e
Guild
4
ribose SBP manganes/z
inc/iron
SBP
sulfur assimilation Mixedacid.Ethanol.
Acetyl.CoAtoAcety
laldehyde.reversible
Entner.Doudor
off Pathway
Guild
5
DMSP
demethylation
putative
spermidine/
putrescine
SBP
sulfite
dehydrogenase
(quinone)
dimethylamine/trim
ethylamine
dehydrogenase
thiosulfate
oxidation
Guild
6
Mixedacid.Ace
tate
glycolysis F-type ATPase putative multiple
sugar SBP
peptides/nickel
SBP
Guild
7
V-type ATPase NADHquinone
oxidoreduc
tase
Mixedacid-PEP to
Succinate via
OAA/malate/fuma
rate
rhodopsin Anaplerotic
genes
Guild
8
C-P lyase
cleavage PhnJ
CP-lyase
operon
CP-lyase complex d-methionine SBP trehalose/malto
se SBP
Guild
9
ubiquinol
cytochrome-c
reductase
Type I
Secretion
rhodopsin beta.carotene15.15.
monooxygenase
ammonia
transporter
Guild
10
Mixedacid.For
mate
Cobal/nick
el SBP
Methanogenesis
via acetate
CoenzymeB/Coenz
ymeM regeneration
rhamnose SBP
153
Appendix Table A9: Guilds defined as top 5 functions of each aspect for a run of the AB model on the SAG-only
dataset with K=10 aspects.
Guild
1
NAD(P)Hquinone
oxidoreductase
Cytochrome
b6/f complex
Photosystem I Photosystem II manganese/zinc
SBP
Guild
2
zinc SBP biotin SBP DMSP
demethylation
taurine SBP general l-amino
acid SBP
Guild
3
dissimilatory
sulfite<>APS
DMSP
demethylation
biotin SBP NADP-reducing
hydrogenase
branched chain
amino acid SBP
Guild
4
Ubiquinolcytochrome c
reductase
Type II
Secretion
Naphthalene
degradation to
salicylate
Gyloxylate shunt Type I
Secretion
Guild
5
CP-lyase operon CP-lyase
complex
C-P lyase
cleavage PhnJ
transporter:
phosphonate
glucose/mannos
e SBP
Guild
6
Competencerelated core
components
iron complex
SBP
dipeptide SBP glutamine SBP riboflavin
biosynthesis
Guild
7
gluconeogenesis anoxygenic
type-II reaction
center
cystine SBP Mixed acid:
Formate to CO2
& H2
Methanogenesis
via
trimethylamine
Guild
8
putative simple
sugar SBP
trehalose/maltos
e SBP
putative snGlycerol 3-
phosphate
SBP
glutathione SBP ribose SBP
Guild
9
Methanogenesis
via CO2
cobalamin
biosynthesis
sulfur
assimilation
V-type ATPase gammahexchlorocyclo
hexane SBP
Guild
10
raffinose/stachyo
se/melibioseSBP
peptides/nickel
SBP
putrescine
SBP
arginine SBP arabinosacchari
de SBP
154
Appendix Table A10: Number of mapback genomes for guilds defined by the top 5 highest scoring functions for three
different values of K (K=5,10,20). X’s represent guilds beyond the size of K.
Guild Number of Mapbacks
(K=5)
Number of Mapbacks
(K=10)
Number of Mapbacks
(K=20)
Guild
1
109 11 51
Guild
2
65 468 12
Guild
3
1295 68 21
Guild
4
282 45 171
Guild
5
61 97 210
Guild
6
X 49 1
Guild
7
X 62 51
Guild
8
X 37 69
Guild
9
X 90 8
Guild
10
X 235 24
Guild
11
X X 14
Guild
12
X X 49
Guild
13
X X 97
Guild
14
X X 50
Guild
15
X X 0
Guild
16
X X 150
Guild
17
X X 46
Guild
18
X X 11
Guild
19
X X 9
Guild
20
X X 46
155
Appendix Table A11: Guilds for the two models generated to assess the numerical stability of the AB procedure. Each
column reflects the functions from one model with the rows distinguishing which guild they belonged to. For visual
ease, a blank row is inserted between guilds.
Guilds Model 1 Model 2
Guild 1 glutamate SBP glutamate SBP
Guild 1 arabinogalactan SBP arabinogalactan SBP
Guild 2 arginine/ornithine SBP arginine/ornithine SBP
Guild 2 maltose/maltodextrin SBP maltose/maltodextrin SBP
Guild 3 beta-carotene15,15-
monoxygenase
beta-carotene15,15-
monoxygenase
Guild 3 Retinal biosynthesis Retinal biosynthesis
Guild 3 Type I Secretion Type I Secretion
Guild 3 gamma/hexachlorocyclohexane
SBP
gamma/hexachlorocyclohexane
SBP
Guild 3 gluconeogenesis gluconeogenesis
Guild 4 sulfite dehydrogenase
(quinone)
sulfite dehydrogenase
(quinone)
Guild 4 DMSP lyase (dddLQPDKW) DMSP lyase (dddLQPDKW)
Guild 4 DMSP demethylation DMSP demethylation
Guild 5 glycolysis glycolysis
Guild 5 NADH-quinone
oxidoreductase
NADH-quinone
oxidoreductase
Guild 5 riboflavin biosynthesis riboflavin biosynthesis
Guild 5 Anaplerotic genes V-type ATPase
Guild 5 F-type ATPase Anaplerotic genes
Guild 6 Ubiquinol-cytochrome c
reductase
Ubiquinol-cytochrome c
reductase
Guild 6 Type II Secretion Type II Secretion
Guild 6 Cytochrome c oxidase cbb3-
type
Cytochrome c oxidase cbb3-
type
Guild 6 phospholipid SBP phospholipid SBP
Guild 6 Flagellum Flagellum
156
Appendix B
B1 CarveMe Validation
We validated the CarveMe models and predictions of growth sensitivity to specific
compounds by comparing our model predictions to an extensive experimental dataset (Gralka et
al., 2023) where 176 marine bacterial strains were tested for their ability to grow on 118 substrates
as a sole carbon source1
. In 51.0% of cases, the CarveMe ensembles directly agreed with the growth
predictions – either the models contained the exchange reaction and experimental data confirmed
growth, or the models did not include the exchange reaction and the experiments showed no growth.
Here, we required the exchange reaction to be present in >80% of the ensemble members to be
considered present in the model predictions for a given strain. This agreement between the models
and data was similar to the results reported by Gralka et al. of 58.0% agreement between CarveMe
models and experimental data. We found that in just 4.6% of cases CarveMe predicted that a
specific exchange reaction was not present in our CarveMe ensembles despite experimental
evidence that that strain could grow on the specific substrate as a sole carbon source. Gralka et al
found that these same instances occurred much more often in 16.1% of cases. Gralka et al’s
methodology differed from ours in several key ways including the fact that they did not use model
ensembles nor exclude strains that generated poor quality models. Our low false negative rate
offers strong evidence that the predictions of exchange reactions from CarveMe are robust when
generated following our methodology.
In 44.4% of cases, our CarveMe ensembles predicted that the exchange reaction was
present in the models despite experimental evidence demonstrating no growth on that substrate as
the sole carbon source. The models generated by Gralka et al found these same instances in 26.0%
of cases. This high false positive rate is not necessarily inconsistent with high quality CarveMe
157
models. The experimental growth data from Gralka et al used each tested substrate as the sole
carbon source. However, for this comparison, we only examined whether the exchange reaction
was present in the model, not whether the model was able to grow on the substrate as a sole carbon
source. For some compounds that are showing up as false positives, it is likely that the organisms
could use the substrate but not as a sole carbon source. Amino acids are a good example of
compounds that could be used in this way – an organism might have the ability to take up an amino
acid from the environment which would be energetically favorable to synthesizing de novo
however they might not be able to grow on that amino acid as the sole carbon source. Indeed, the
false positive rate for amino acids in the Gralka dataset is higher than for the rest of the compounds
(55.2%). When we looked at the carbohydrate compounds, which are more often used as primary
carbon sources, the false positive rate decreased to 36.2%. Similarly, when we looked at only the
carboxylic acids, the false positive rate decreased to 34.8%. This analysis suggests that the high
false positive rates are in large part because the experiments and models are assessing different
aspects of microbial metabolism – i.e. we are not comparing like to like. We also conducted a
series of growth sensitivity tests on the models generated from the Gralka dataset which are
described below (Appendix B3.2). These tests again are not a perfect comparison to the
experimental design and so serve less as a validation and more as a point of comparison. Overall,
these results provide support that the models generated by the CarveMe automated pipeline can be
used to provide robust hypotheses about the metabolic capabilities of organisms.
B2 SOMs clustering
The model growth sensitivity analysis generated 1,050,060 data points (1,591 genomes x
60 models x 11 compound class tests). To analyze this large dataset and identify overarching
patterns we employed Self Organizing Maps (SOMs), a type of dimensional reduction method. In
158
addition to SOMs, we analyzed the data using other traditional dimensional reduction approaches
such as PCA, and clustering methods such as direct hierarchical clustering. The PCA was able to
distinguish two broad clusters within the dataset with PC1 explaining 28.0% of the total variance
and PC2 explaining 22.6% of the total variance (Appendix Figure B12). However, PCA did not
allow us to further differentiate within these groups while the SOM clustering provided clearly
differentiated clusters. Direct hierarchical clustering was computationally infeasible on the full
dataset, but was able to identify some patterns when the data values were averaged by ensemble.
We used a hexagonal, toroidal grid configuration to build the SOM in order to avoid the
development of edge effects where the majority of data ends up in the corners of the map. Training
a toroidal map connects these edges together fluidly to negate these edge effects (e.g., the top and
bottom of the map are treated as adjacent). Adjusting for potential edge effects was necessary
because, although the growth sensitivity results are continuous on a scale from 0-1, we observed
bimodal distributions for most of the sensitivity tests where either a model was sensitive to the
removal of a substrate (value between 0.8 and 1) or was insensitive to the removal (value between
0 and 0.2) (Appendix Figure B2). Due to this bimodality, we also adjusted the learning rate vector
from the default (0.05, 0.01) down to (0.025, 0.01). This reduced the learning rate for the map in
early iterations to produce a smoother set of map values and combat the effect of bimodal data to
tend towards extremes. We tested the number of iterations needed to achieve small quantization
error values (discussed further below) (Appendix Figure B11b) and increased the run length to
1500 iterations from the default setting of 100 iterations, in part due to the reduction in the initial
learning rate which slows down convergence.
We also conducted a sensitivity analysis on the size of the map from 5-by-5 to 100-by-100.
Determining map size for SOM clustering is an open problem, and there are no definitive
159
theoretical bases for defining a “correct” size based on the size and characteristics of the input data
(Céréghino and Park, 2009; Kalteh et al., 2008; Park et al., 2003). Heuristic rules of thumb have
been suggested (Vesanto, 2000) as well as field-specific guidelines (Park et al., 2003). However,
the guidelines are not generalizable and remain dependent on data characteristics. Metrics of error
have been proposed (Kiviluoto, 1996; Kohonen, 1990) to identify if a given set of map parameters
generates a SOM with an appropriate level of resolution (quantization error) or topology
preservation (topographic error). We computed quantization error which measures the resolution
of a SOM by looking at the difference between a data vector and its best mapping unit (BMU).
The final map size is a balance between a map that is large enough to fully differentiate all of the
patterns in the dataset and one that does not contain too many unassigned nodes. We chose a 20-
by-20 grid size such that the map had sufficient space to distribute the variability in the data
without overfitting the map. Quantization error decreases with increasing map size so map size
cannot be optimized to a global error minimum but only a local minimum3
. Borders of unassigned
nodes between clustered mapping units are one way that a sufficiently large map size is determined.
The final map had 126 unassigned nodes (31.5%).
To identify clusters amongst the SOM nodes, we used k-means clustering. We also tried
hierarchical clustering but elected to use k-means because it consistently performed better based
on intra-cluster distances. The number of clusters (eight) was chosen to minimize the intra-cluster
difference while not overfitting the map. The final map is shown in Appendix Figure B11a colored
by cluster. When visualized in two dimensional space this toroidal map appears as roughly square
but the top and bottom, and left and right, sides of the map are continuously connected.
160
B3 SOM cluster analyses
We performed a variety of analyses with the SOM clusters to identify patterns within the
growth sensitivity profile data and between this data and other independent data measures. For
each SOM cluster, we determined the set of compound classes that resulted in reduced growth
when limited (hereafter called the growth sensitivity profile). We then calculated the genome
estimated maximum growth rates for each cluster using codon usage bias (dCUB) and assessed
the taxonomic composition of each cluster. For these analyses, we used the dCUB threshold value
of dCUB=-0.08 from (J. Weissman et al., 2021; J. L. Weissman et al., 2021) to differentiate
between fast growing (more negative dCUB values) and slow growing (more positive dCUB value)
organisms.
B3.1 dCUB growth estimates
We tested whether there were statistical differences in genome estimated maximum growth
rates for the eight SOM clusters. Based on the results of our Tukey tests, we identified four
statistically significant groups (Appendix Figure B6). It is important to highlight that dCUB values
were not included in building the SOM map and so any differences in dCUB values between
clusters are emergent rather than prescribed. Genomes in Cluster 2 were predicted to be
significantly faster growers (Appendix Table B5) than all clusters except Cluster 5 based on codon
usage bias (average dCUB of -0.228), while Clusters 1, 3, 7, and 8 all had significantly slower
predicted growth rates (average dCUB ranging from -0.097 to -0.134). SOM Cluster 5 was
statistically distinct from the slow growing clusters (average dCUB of -0.199) but overlapped with
the fast growing cluster. Intermediate growth Clusters 4 and 6 were significantly slower than fast
growing Cluster 2 (average dCUB of -0.161 and -0.153, respectively) but overlapped with the slow
161
growing clusters. This suggests that we identified two distinct types of intermediate growers in our
dataset – a fast intermediate growth cluster and slow intermediate growth clusters.
To further distinguish the intermediate growth clusters (Clusters 4-6) from one another, we
calculated the fraction of genomes within each cluster with dCUB values below the -0.08 dCUB
threshold (fraction of fast growth genomes). We found that 66.2% and 62.1% of genomes in
Clusters 4 and 6 were ‘fast growers’ while 73.2% of genomes in Cluster 5 were ‘fast growers’. For
comparison, 78.0% of genomes in Cluster 2 were ‘fast growers’ while on average 45.4% of
genomes in the slow growing clusters (range 40.0-50.0%) were ‘fast growers’ (Appendix Figure
B6). We posit that genomes in the ‘intermediate growth’ clusters could belong to some
intermediary lifestyle phenotype(s) between copiotrophs and oligotrophs. Furthermore, our results
hypothesize the existence of more than one of these intermediary phenotypes. Overall, our results
suggest that there are several growth strategies for the ‘intermediate’ lifestyle and multiple growth
strategies for the slow growing oligotrophs. In contrast, we identified only one growth strategy for
the fast growing genomes. Our inability to differentiate multiple fast growing groups might be a
bias in the model formulation as we are only able to differentiate the genomes based on the
sensitivity to the 11 compound classes tested.
B3.2 Growth Sensitivity Profiles
We observed substantial differences in the metabolic strategies between all eight of our
SOM clusters (Appendix Figure B4b). Generally, we found that our SOM clusters fell into one of
three distinct metabolic strategies, and that these strategies aligned with the four statistically
different growth strategies identified above. The slow growing clusters demonstrated high growth
sensitivity to two or more compound classes. For example, slow growing Cluster 3 demonstrated
substantial sensitivity to both carboxylic acids and peptides. All four of the slow growth clusters
162
had unique pairings of growth sensitive compound classes. By contrast, our fast growing Cluster
2 demonstrated low to zero sensitivity to any compound classes. Our three intermediate growth
clusters each demonstrated a single compound class sensitivity, with our two slower intermediate
growth Clusters 4 and 6 demonstrating sensitivity to amino acids and carbohydrates, respectively,
while fast intermediate growth Cluster 5 showed sensitivity to carboxylic acids. Overall,
carboxylic acids (4 clusters), amino acids/derivatives (2 clusters), and peptides (2 clusters) were
the compound classes that caused the most significantly high sensitivities amongst our eight SOM
clusters.
Overall, we observed that sugar and acid compounds were the primary drivers of our SOM
clustering, which aligns with the proposed classification from the Gralka et al study through their
sugar acid preference (SAP) metric. While our sensitivity test is not homologous to their SAP
metric, we decided to compute growth sensitivities for the strains used in the Gralka study to
examine the differences in sugar/acid sensitivity for organisms with a defined preference for sugars
or acids based on their SAP. The sugar compounds from the Gralka et al study were primarily
classified as carbohydrates with some carboxylic acids and nucleobases/nucleotides/nucleosides,
while the acid compounds were primarily classified as amino acids, peptides, and carboxylic acids.
Using the sugar/acid preference metric developed by Gralka et al, we grouped the 146 strains into
sugar specialists (N=77) and acid specialists (N=69). For each group, we then assessed the growth
sensitivity to removing carbohydrates, amino acids, or carboxylic acids (Appendix Figure B3b).
Reducing the availability of carbohydrates resulted in substantial growth reduction (growth
sensitivity values of 0.8 or greater) in 14.3% of the sugar specialist strains, while there were no
acid specialists that suffered substantial growth reduction. Given that the SAP is a continuum
between sugar and acid preference such that some of the strains were only weak sugar preferrers,
163
we also used a very liberal definition of growth sensitivity of 0.2 (i.e. a 20% reduction in growth
due to limitation of the compound). With the lower threshold, 16.9% of the sugar specialists
showed a response while we still observed 0% of the acid specialists with a response.
Reducing the availability of amino acids resulted in substantial growth limitation (0.8
threshold) in 15.9% of acid specialist strains compared to 11.7% of sugar specialist strains . We
hypothesize that part of the challenge with the assessment of acid preference using amino acids is
that amino acids are so central to growth that models often have multiple strategies for synthesizing
amino acids when they are not available from the environment. When we assessed the growth
reduction to the removal of carboxylic acids, we saw that 36.2% of acid specialist models were
sensitive to the removal of carboxylic acids while 18.2% of sugar specialists were sensitive (0.8
threshold). With the liberal threshold of 0.2, we observed 42.0% sensitivity to the removal of
carboxylic acids by acid specialists relative to 19.5% for sugar specialists.
It is important to note that the Gralka et al sugar/acid preference index is not an auxotrophy
for these compounds and the preference metric is a continuum from -1 to 1. In contrast, our
sensitivity test was not designed to determine the relative preference of sugar to acid but rather the
response to removing the substrate. Thus we would not necessarily expect perfect agreement as
these are two different tests. But the overall consistency in response is encouraging.
B3.3 Cluster phylogeny
To confirm that our SOM clusters were identifying de novo guilds of metabolic similarity
rather than simply recapturing known phylogenetic groups, we performed phylogenetic and
taxonomic analyses of the genomes assigned to each cluster. Qualitative analyses of taxonomic
relative abundances within our SOM clusters showed that clusters were composed of many
different taxonomic orders, and many orders were represented across multiple clusters, albeit at
164
varying relative abundances (Appendix Figures B5, B7). Thus, the metabolic niches captured by
our SOM clusters could not be identified directly from taxonomy alone. To further investigate
possible differences between the SOM cluster phylogenies, we computed the UniFrac distances
between clusters to quantify their phylogenetic relatedness. We found that all pairwise
comparisons of our clusters resulted in relatively high Unifrac distances (values ranging from
0.767-0.847 on a [0,1] scale). These values are reported in Appendix Table B6. Typically, UniFrac
distances in this range would suggest that the SOM clusters are taxonomically distinct from one
another. Since our qualitative analyses demonstrated that our SOM clusters were not simply
recapitulating taxonomy at the order level, we concluded that the UniFrac distances indicated
phylogenetic distinction based on each cluster having unique members within each order. These
high UniFrac distances were likely attributable in large part to the demonstrably different relative
abundances of the major taxonomic orders between SOM clusters (Appendix Figure B7). Another
possibility was that the UniFrac scores reflected unique taxonomy below the order level (e.g.,
genus or species).
We tested this hypothesis by restricting our UniFrac analysis to individual orders and then
re-computing the distances between SOM clusters (Appendix Table B6). We completed all
pairwise comparisons for the 8 SOM clusters across each of the top 15 orders. In general, the
UniFrac distances between SOM clusters were lower when looking within an order than when
looking at the entire phylogeny, which suggests that there are greater phylogenetic differences
between SOM cluster assignments at the order level than there are at the sub-order level. This
further supports the notion that the majority of these phylogenetic differences are likely driven by
differences in the relative abundance of these orders in each SOM cluster. The average UniFrac
distances between SOM clusters within the top 15 orders spanned a wide range; the
165
Flavobacteriales were the most phylogenetically different between clusters (with the highest
average UniFrac distance of 0.795) while Caulobacterales, AEGEAN-169, and PCC-6307 were
the most similar between clusters (with average UniFrac distances of 0.293 0.240, and 0.180
respectively). This suggests that SOM cluster membership formed distinct clades at the sub-order
level for the Flavobacteriales, Psuedomonadales, Rhodobacterales, and Cytophagales, all of
which had UniFrac distances larger than 0.5 (Appendix Table B6). For the remaining orders, the
low average UniFrac distances could indicate that either the organisms in the clusters had a high
level of metabolic flexibility or that metabolic differences could be accounted for by strain level
variation in metabolic capability.
We also found that Cluster 2 had the highest average UniFrac distance to the other SOM
clusters in 7 of the 15 orders, suggesting that it was the most phylogenetically distinct cluster.
Cluster 2 was also our fast growing cluster and the only cluster without any compound sensitivities,
which suggests that organisms occupying this metabolic niche were more distinct from organisms
in other niches.
Overall these results demonstrate that, while in some cases there is a relationship between
cluster membership and phylogeny, the clusters cannot be explained through phylogeny alone.
They also emphasize the need for defining metabolic niches predicated directly from metabolism
- like the one presented in this study - rather than defining metabolic niches directly from
phylogeny.
B3.4 Geographic distribution
The SOM clusters showed unique biogeographic patterns based on metagenomic
recruitment (Appendix Figure B9). We found that slow growing Cluster 1 was the most
numerically dominant of our eight SOM clusters with a mean relative abundance of 22.6%, while
166
fast growing Cluster 2 was the least abundant with a mean relative abundance of 4.95%. Our fast
intermediate growth Cluster 5 was second lowest in average relative abundance at 9.52%. Cluster
1 was found to be highly abundant relative to the other seven clusters in every oceanographic
category except for the Southern Ocean where it was the fourth most abundant.
When we looked within specific oceanographic categories, further patterns emerged. While
overall Cluster 2 and 5 had the lowest abundances, these two clusters were significantly more
abundant at estuarine stations, reaching mean relative abundances of 12.7% for Cluster 2 (fifth
highest abundance in estuarine regions amongst all clusters) and 18.2% for Cluster 5 (second
highest abundance in estuarine regions among all clusters). The estuarine stations were highly
diverse, with six of the eight SOM clusters present at relatively high abundances at statistically
identical levels (Clusters 3 and 6 were significantly rarer than the other clusters in the estuarine
samples). The relative evenness of the majority of the SOM clusters at estuarine stations suggests
that the microbial community is being supplied with a diverse set of compounds at concentrations
sufficiently high to support the metabolic requirements of a diverse group of organisms. This type
of flexible environment with diverse compound availability favors more balanced metabolic
strategies and higher maximum growth rates of the faster growing clusters which is consistent with
the higher abundances of these clusters at estuarine stations.
The abundance of the clusters in coastal stations was more evenly distributed than in
estuaries. However, unlike in the estuaries, the coastal stations were primarily dominated by the
slow growing Cluster 1 (23.3% mean relative abundance) and slow intermediate growth Cluster 4
(20.6% mean relative abundance). Similarly to the estuarine stations, the overall taxonomic
evenness suggests that the microbial community at coastal sites are also being supplied a diverse
set of compounds at sufficiently high concentrations to support the growth of diverse metabolic
167
strategies. In particular, since the two most dominant clusters were sensitive to amino acids, their
abundance in coastal locations suggests that those stations had high enough concentrations of
amino acids to sustain large total biomasses of these organisms. One possible cause for the greater
evenness of our eight SOM clusters in coastal sites compared to estuaries is the effect of varying
salinity on community membership at estuarine stations. Variations in the salinity levels could be
an external environmental factor limiting the growth of organisms that could otherwise grow
effectively, thereby suppressing the abundance of the SOM clusters they were assigned to.
The remaining three categories - oligotrophic seas, oligotrophic open oceans, and the
southern ocean - were dominated by just two of the eight SOM clusters. The oligotrophic seas and
oligotrophic open oceans showed similar distributions of relative abundance and were both
dominated by slow growing Clusters 1 and 8. The high abundance of slow growing organisms in
these categories is consistent with these oceanographic regions being resource limited. In resource
limited environments, organisms often cope with consistently low nutrient concentrations by
specializing in growth on specific compounds resulting in an environment with rigid, defined
niches (Gifford et al., 2013). Organisms typically specialize in growth on certain compounds by
using transporters with greater affinity for these compounds, and/or streamlining their genomes to
reduce internal nutrient requirements. This sort of rigid niche structure, and low nutrient
availability, is unfavorable to fast growing, metabolically flexible organisms as they can be
outcompeted for compound acquisition by specialists. Resource limited conditions thus favor
compound sensitive, specialist organisms occupying defined niches for growth on their specific
compounds (Sarmento et al., 2016). The significantly lower abundances of slow growing Clusters
3 and 7 suggest then that the environmental niches that organisms in these clusters occupy are not
present. Specifically, Clusters 3 and 7 are the only two clusters sensitive to peptides which suggests
168
that stations in these oligotrophic categories might have particularly low concentrations of peptide
compounds available. This is consistent with the fact that peptides are one of the most labile forms
of DOM available to heterotrophic organisms and would be drawn down rapidly under resource
limited conditions.
In contrast to the oligotrophic categories, the Southern Ocean category was actually
dominated by slow growing Cluster 3 and slow intermediate growth Cluster 6. There were only
three samples from the Southern Ocean in our dataset so it is difficult to conclude if these relative
abundances are representative of the entire region. However, the Southern Ocean is distinct from
the oligotrophic open oceans in that it has higher nutrients and sustains higher productivity when
iron limitation is alleviated (Venables and Moore, 2010). Thus, it is possible that increased
nutrients and the subsequent increased productivity promotes the abundance of intermediate
growth strategy organisms like those found in Cluster 6.
169
Appendix Figure B14: Bubble plot of the mean growth sensitivity values (similar to Figure 3.2) for a new set of 8
SOM clusters generated on the 983 genomes with a consensus of 90% or greater. A growth sensitivity of 1 indicates
high sensitivity to that substrate such that the modeled growth rate was reduced proportionally to the reduction in the
substrate’s flux (e.g., 50% substrate reduction corresponded to 50% growth rate reduction). The size of the bubbles in
this plot reflect the relative sensitivity of each of the 8 SOM clusters to a given compound class where larger bubbles
indicate that cluster was more sensitive to that compound class than others. The 6 compound classes which resulted
in significant growth reduction for at least one of the SOM clusters are shown here. While the ordering of the clusters
changed, we still observed the same overall patterns. We have one cluster with no growth sensitivities and multiple
clusters with sensitivity to one compound and multiple with sensitivity to two compounds. The fast growth cluster and
intermediate growth single sensitivity clusters from the primary analysis emerged in this higher consensus group of
models. The slight shift for the multiple sensitivity clusters is consistent with the observation that the more classically
oligotrophic orders generally had model ensembles with lower consensus values such that excluding these genomes
from the SOM generation would be expected to have the largest impact on the slow growth clusters.
170
Appendix Figure B15: Distribution of growth sensitivity values by cluster. Density plots of the growth sensitivity
values for each model for each of the 11 compound classes grouped by SOM cluster (N=1,050,060).
171
172
Appendix Figure B16: Comparison of results between CarveMe model ensembles and experimental growth studies
performed in Gralka et al. Heatmap of agreement between CarveMe model ensemble reactions and experimental
growth data for a collection of 146 strains grown on 58 different sole carbon sources. White squares indicate direct
agreement between the models and data (i.e., model includes the exchange reaction and growth was observed or model
does not include the exchange reaction and no growth was observed), blue squares indicate a false positive (model
includes the exchange reaction, experimental data does not) and red squares indicate a false negative (experimental
data predicts growth, model does not include the exchange reaction). Gray squares indicate that the presence of the
compound in the model exchange reactions was variable (between the consensus thresholds for “present” and
“absent”).
173
Appendix Figure B17: Relative growth sensitivities between SOM clusters. (a) Ordered bar plot of the proportion of
models across all clusters with substantial growth sensitivity to the reduction of each compound class (substantial is
defined as >80% reduction in growth). (b) Bar plots of the relative mean growth sensitivity values for each of the 11
compound classes across the 8 SOM clusters. The error bars represent one standard deviation. Plot facets are ordered
from the highest overall sensitivity (carboxylic acids) to the lowest (amines/amides).
174
Appendix Figure B18: Taxonomy by cluster. (a) Ordered bar plot of the proportion of models in each of the 15 most
abundant orders with substantial growth sensitivity to the reduction of any compound class (substantial is defined
as >80% reduction in growth). (b) Stacked bar plots of the relative abundances of the top 15 orders in each cluster.
175
Appendix Figure B19: Codon usage bias (dCUB) by cluster. The dCUB values fall into four statistically distinct
groups designated with letters according to the key. Group a is the slow-growing group (Clusters 1, 3, 7 and 8) and
statistically distinct from the other clusters. Group c (Cluster 2) is the fast-growing group and is distinct from all other
clusters. Groups ab and bc represent our intermediate growers. Specifically, group ab (Clusters 4 and 6) are statistically
distinct from fast-growing group c but not from slow-growing group a, whereas group bc (Cluster 5) is statistically
distinct from group a but not from group c.
176
Appendix Figure B20: Taxonomic abundance enrichments of top 15 Orders by cluster. Percentage enrichment in the
relative abundance of the top 15 Orders (and Other) in each of the 8 SOM clusters compared to the relative abundances
of each of these Orders in the full dataset.
177
Appendix Figure B21: Relative abundances of SOM clusters by oceanographic region. Sampling sites were grouped
for bootstrapping according to the 23 oceanographic regions given in Table S3. For each region, bar plots of the
average relative abundances of each of the 8 SOM clusters are shown. The relative abundances are calculated based
on the bootstrap distributions of the raw RPKM values. The clusters are colored by their growth strategy (fast, fastintermediate, slow-intermediate, and slow). Error bars represent the standard deviations of the bootstrapped
distributions.
178
Appendix Figure B22: Relative abundances of SOM clusters by oceanographic category. Sampling sites were grouped
for bootstrapping according to the 5 oceanographic categories. Bar plots show the average relative abundances of each
of the 8 SOM clusters in each category where the abundance is based on the bootstrap distributions of the raw RPKM
values. The clusters are colored by their growth strategy (fast, fast-intermediate, slow-intermediate, and slow). Error
bars represent the standard deviations of the bootstrapped distributions.
179
Appendix Figure B23: CarveMe run parameterizations. (a) Rarefaction curve of the total number of unique reactions
found in any model within an ensemble of models generated for a given genome. This curve was generated for
ensemble sizes ranging from 2-100 models. At low ensemble sizes (e.g., in the range from 2-20) the model space
rapidly identifies new unique reactions as more models are generated. The curves stabilize around ensemble sizes of
40-80 such that increasing the number of models in the ensemble does not add new reactions. (b) Histogram of the
consensus scores for model ensembles when annotating reactions for CarveMe with eggNOG vs. the native Diamond
(ensemble size = 60). Overall models generated with Diamond annotation produced significantly higher quality
models than when eggNOG annotations were used for the same genomes.
180
Appendix Figure B24: SOM metrics. (a) The SOM grid is shown where each circle represents a grid point in the map
(N=400). Grid points are colored by their assignment to the 8 defined SOM clusters. Grid points to which no genomes
were assigned are colored gray to represent the absence of mapped data. It is important to note that the SOM uses a
toroidal grid where the edges wrap around such that, for example, all nodes in Cluster 8 are in fact connected. (b) The
training progress of the grid is shown for the duration of the map refinement process. (c) The number of models from
each genome ensemble that were assigned to each SOM cluster, where a value of 60 denotes instances when all models
from the ensemble were assigned to the same SOM cluster and a value of 0 denotes that no models from a specific
ensemble were assigned to the cluster. Data is only shown for the 1,591 high consensus ensembles. The bimodal
distribution of the data around 0 and 60 illustrates that all models from a given ensemble were almost always assigned
to the same SOM cluster.
181
Appendix Figure B25: PCA plot of the growth sensitivity data. The PCA captured 50.6% of the total variance on the
first two principal component axes, and distinguished two major groups of data points, a slow growing and fast
growing cluster. Of note, the estimates of maximum growth rate were not included in this clustering. The points in the
PCA are colored by SOM cluster assignment to illustrate that both approaches identified similar clustering of the data
but that the SOM method was able to better differentiate between the 8 clusters.
182
Appendix Table B12: All data for 1,591 high consensus genomes. This table provides the unique identifiers for the
genomes used in the SOM analysis as well as information on the SOM cluster they were assigned to, their specific
value of the dCUB growth proxy, taxonomic information (order, genus, and species), as well as the raw growth
sensitivity values computed for the 11 compound classes clustered in this study.
183
Appendix Table B13: Metabolite classification information. This table provides information on the 456 compounds
that were manually classified for this study including their names in plain English, the compound class they were
assigned to, and the name of the corresponding external reaction in the CarveMe universal model.
Metabolite Classification CarveMe Reaction
10 Phenyldecanoic acid Carboxylic Acid EX_phedca_e
2 4 6 Trinitrotoluene Other EX_tnt_e
2-Oxoglutarate Ketones/Aldehydes EX_akg_e
4-Aminobenzoate Amino Acids/Derivatives EX_4abz_e
4-Aminobutanoate Amino Acids/Derivatives EX_4abut_e
Acetaldehyde Ketones/Aldehydes EX_acald_e
Adenosine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_adn_e
Ala-Gln Peptides EX_ala_gln_e
Ala-His Peptides EX_ala_his_e
Ala-L-Thr-L Peptides EX_ala_L_thr__L_e
Ala-Leu Peptides EX_ala_leu_e
Arsenate Inorganic EX_aso4_e
D Valine Amino Acids/Derivatives EX_val__D_e
Dimethyl sulfoxide Organic Sulfur EX_dmso_e
Ethanol Alcohol EX_etoh_e
Gly asn L C6H11N3O4 Peptides EX_gly_asn__L_e
Gly pro L C7H12N2O3 Peptides EX_gly_pro__L_e
Gly-Cys Peptides EX_gly_cys_e
Gly-Gln Peptides EX_gly_gln_e
Gly-Leu Peptides EX_gly_leu_e
Gly-Met Peptides EX_gly_met_e
Gly-Phe Peptides EX_gly_phe_e
Gly-Tyr Peptides EX_gly_tyr_e
Glycerophosphoglycerol Other EX_g3pg_e
Glycine-glycine-glutamine
tripeptide Peptides EX_glyglygln_e
Glycylphenylalanine Peptides EX_glyphe_e
H2O H2O Inorganic EX_h2o_e
Hydrogen cyanide Other EX_cyan_e
Hydrogen peroxide Inorganic EX_h2o2_e
184
L Glycinylmethionine Peptides EX_glymet_e
L alaninyltryptophan Peptides EX_alatrp_e
L glycinylglutamine Peptides EX_glygln_e
L histidinylhistidine Peptides EX_hishis_e
L methionine R oxide C5H11NO3S Amino Acids/Derivatives EX_metox__R_e
L-Alanine Amino Acids/Derivatives EX_ala__L_e
L-Arginine Amino Acids/Derivatives EX_arg__L_e
L-Asparagine Amino Acids/Derivatives EX_asn__L_e
L-Glutamate Amino Acids/Derivatives EX_glu__L_e
L-Glutamine Amino Acids/Derivatives EX_gln__L_e
L-Lysine Amino Acids/Derivatives EX_lys__L_e
L-Phenylalanine Amino Acids/Derivatives EX_phe__L_e
L-Proline Amino Acids/Derivatives EX_pro__L_e
L-Threonine Amino Acids/Derivatives EX_thr__L_e
L-Tryptophan Amino Acids/Derivatives EX_trp__L_e
L-Tyrosine Amino Acids/Derivatives EX_tyr__L_e
L-alanine-D-glutamate-meso-2,6-
diaminoheptanedioate Other EX_LalaDgluMdap_e
Lysine-glutamine-glycine tripeptide Peptides EX_lysglugly_e
Met L ala L C8H16N2O3S Peptides EX_met_L_ala__L_e
Nitrate Inorganic EX_no3_e
Nitrite Inorganic EX_no2_e
O2 O2 Inorganic EX_o2_e
Phosphoenolpyruvate Carboxylic Acid EX_pep_e
Putrescine Amines/Amides EX_ptrc_e
Sulfur Inorganic EX_s_e
Tetrathionate Organic Sulfur EX_tet_e
Thiamin B Vitamins EX_thm_e
Trithionate Organic Sulfur EX_tton_e
Uracil Nucleobases/Nucleosides/Nucleotides/Derivatives EX_ura_e
Xanthine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_xan_e
3 Hydroxy 10 phenyldecanoic acid Carboxylic Acid EX_R_3hpdeca_e
2-Oxobutanoate Carboxylic Acid EX_2obut_e
185
Alpha-L-Arabinan (3 subunits) Nucleobases/Nucleosides/Nucleotides/Derivatives EX_araban__L_e
Chorismate Carboxylic Acid EX_chor_e
Cytidine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_cytd_e
D-Serine Amino Acids/Derivatives EX_ser__D_e
Deoxycytidine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_dcyt_e
Diacetylchitobiose Carbohydrates/Derivatives EX_dachi_e
Fe(III)hydroxamate Other EX_fe3hox_e
Folate B Vitamins EX_fol_e
Formamide Amines/Amides EX_frmd_e
Glycine betaine Amino Acids/Derivatives EX_glyb_e
Guanine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_gua_e
Iron(III) chelated
carboxymycobactin T (R=8 carbon,
final carbon is carboxyl group) Other EX_fcmcbtt_e
L Methionine S oxide C5H11NO3S Amino Acids/Derivatives EX_metox_e
L alaninylthreonine Peptides EX_alathr_e
L-Citrulline Amino Acids/Derivatives EX_citr__L_e
L-Leucine Amino Acids/Derivatives EX_leu__L_e
Maltotriose C18H32O16 Carbohydrates/Derivatives EX_malttr_e
N,N'-diacetylchitobiose Carbohydrates/Derivatives EX_chtbs_e
NMN C11H14N2O8P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_nmn_e
Pyridoxine B Vitamins EX_pydxn_e
S-Methyl-L-methionine Amino Acids/Derivatives EX_mmet_e
Thymidine C10H14N2O5 Nucleobases/Nucleosides/Nucleotides/Derivatives EX_thymd_e
UDP-N-acetyl-3-O-(1-
carboxyvinyl)-D-glucosamine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_uaccg_e
Undecaprenyl phosphate Phospholipids/Fatty Acids/Triglycerides EX_udcpp_e
Urea CH4N2O Amines/Amides EX_urea_e
1,4-alpha-D-glucan Carbohydrates/Derivatives EX_14glucan_e
1-deoxy-D-xylulose Carbohydrates/Derivatives EX_dxyl_e
Ammonium Inorganic EX_nh4_e
Carbon monoxide Inorganic EX_co_e
D-Fructose Carbohydrates/Derivatives EX_fru_e
L histidinylglycine Amino Acids/Derivatives EX_hisgly_e
186
N-Acetyl-D-glucosamine 1-
phosphate Carbohydrates/Derivatives EX_acgam1p_e
Shikimate Carboxylic Acid EX_skm_e
Sulfate Inorganic EX_so4_e
Thiosulfate Organic Sulfur EX_tsul_e
2',3'-Cyclic CMP Nucleobases/Nucleosides/Nucleotides/Derivatives EX_23ccmp_e
4-Aminobutanal Ketones/Aldehydes EX_4abutn_e
5-Methylcytosine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_5mcsn_e
Adenine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_ade_e
Alpha,alpha'-Trehalose 6-phosphate Carbohydrates/Derivatives EX_tre6p_e
Choline C5H14NO Amines/Amides EX_chol_e
Creatinine Amines/Amides EX_crtn_e
D Asparagine Amino Acids/Derivatives EX_asn__D_e
D-Glucose 1-phosphate Carbohydrates/Derivatives EX_g1p_e
D-Glucuronate Carbohydrates/Derivatives EX_glcur_e
D-Tagatose Carbohydrates/Derivatives EX_tag__D_e
Deoxyuridine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_duri_e
Glycine Amino Acids/Derivatives EX_gly_e
H+ Inorganic EX_h_e
Hydrogen sulfide Inorganic EX_h2s_e
L alaninylhistidine Peptides EX_alahis_e
L glycinylserine Peptides EX_glyser_e
L-Homoserine Amino Acids/Derivatives EX_hom__L_e
L-Methionine Sulfoxide Amino Acids/Derivatives EX_metsox_S__L_e
L-Prolinylglycine Peptides EX_progly_e
L-alanine-L-glutamate Peptides EX_LalaLglu_e
Leucylleucine Peptides EX_leuleu_e
Maltoheptaose Carbohydrates/Derivatives EX_malthp_e
Phenylethyl alcohol Alcohol EX_pea_e
Proline-histidine-glutamine
tripeptide Peptides EX_prohisglu_e
Raffinose C18H32O16 Carbohydrates/Derivatives EX_raffin_e
Reduced glutathione Organic Sulfur EX_gthrd_e
Uridine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_uri_e
187
3 Hydroxypentanoic acid Phospholipids/Fatty Acids/Triglycerides EX_R_3hpt_e
R R 2 3 Butanediol C4H10O2 Alcohol EX_btd_RR_e
2',3'-Cyclic GMP Nucleobases/Nucleosides/Nucleotides/Derivatives EX_23cgmp_e
2,3-diaminopropionate Amino Acids/Derivatives EX_23dappa_e
2,5-diketo-D-gluconate Carboxylic Acid EX_25dkglcn_e
2-Dehydro-D-gluconate Carboxylic Acid EX_2dhglcn_e
3 CMP C9H12N3O8P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_3cmp_e
3-Phospho-D-glycerate Carboxylic Acid EX_3pg_e
4 Hydroxyphenylacetic acid
C8H8O3 Carboxylic Acid EX_4hoxpac_e
Arbutin C12H16O7 Carbohydrates/Derivatives EX_arbt_e
Benzoate Carboxylic Acid EX_bz_e
Catechol Alcohol EX_catechol_e
Cellobiose Carbohydrates/Derivatives EX_cellb_e
Cis-Aconitate Carboxylic Acid EX_acon_C_e
Coenzyme A Other EX_coa_e
Creatine cytosol Amino Acids/Derivatives EX_creat_e
D-Galacturonate Carbohydrates/Derivatives EX_galur_e
D-Glucosamine Carbohydrates/Derivatives EX_gam_e
D-Glucose Carbohydrates/Derivatives EX_glc__D_e
D-Glyceraldehyde Carbohydrates/Derivatives EX_glyald_e
D-Methionine Amino Acids/Derivatives EX_met__D_e
D-O-Phosphoserine Amino Acids/Derivatives EX_pser__D_e
D-Proline Amino Acids/Derivatives EX_pro__D_e
D-Xylose Carbohydrates/Derivatives EX_xyl__D_e
D-phenylalanine Amino Acids/Derivatives EX_phe__D_e
Dextrin C12H20O10 Carbohydrates/Derivatives EX_dextrin_e
Dihydroxyacetone Carbohydrates/Derivatives EX_dha_e
Dihydroxyacetone phosphate Carbohydrates/Derivatives EX_dhap_e
Dopamine Amines/Amides EX_dopa_e
Fe2+ mitochondria Inorganic EX_fe2_e
Fumarate Carboxylic Acid EX_fum_e
Galactitol Carbohydrates/Derivatives EX_galt_e
188
Glycerol 3-phosphate Carbohydrates/Derivatives EX_glyc3p_e
Isocitrate Carboxylic Acid EX_icit_e
L-Arabinose Carbohydrates/Derivatives EX_arab__L_e
L-tartrate Carboxylic Acid EX_tartr__L_e
Linoleic acid (all cis C18:2) n-6 Phospholipids/Fatty Acids/Triglycerides EX_lnlc_e
Meso-2,6-Diaminoheptanedioate Other EX_26dap__M_e
N-Acetyl-Dglucosamine(anhydrous)NAcetylmuramic acid Other EX_anhgm_e
N-Formimidoyl-L-glutamate Amino Acids/Derivatives EX_forglu_e
Nitric oxide Inorganic EX_no_e
Octacosanoyl-CoA Other EX_octscoa_e
Selenate Inorganic EX_sel_e
Succinate Carboxylic Acid EX_succ_e
Thymine C5H6N2O2 Nucleobases/Nucleosides/Nucleotides/Derivatives EX_thym_e
Trans 4 Hydroxy L proline
C5H9NO3 Amino Acids/Derivatives EX_4hpro_LT_e
Trimethylamine N-oxide Amines/Amides EX_tmao_e
Undecaprenyl diphosphate Phospholipids/Fatty Acids/Triglycerides EX_udcpdp_e
3 hydroxynonanoic acid Carboxylic Acid EX_R_3hnonaa_e
2',3'-Cyclic AMP Nucleobases/Nucleosides/Nucleotides/Derivatives EX_23camp_e
3 AMP C10H12N5O7P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_3amp_e
3-Methylbutanoic acid Carboxylic Acid EX_3mb_e
4-Hydroxyphenylacetate Carboxylic Acid EX_4hphac_e
8 Phenyloctanoic acid Carboxylic Acid EX_pheocta_e
Beta D glucose 6 phosphate
C6H11O9P Carbohydrates/Derivatives EX_g6p_B_e
Cu+ Inorganic EX_cu_e
D-Arginine Amino Acids/Derivatives EX_arg__D_e
D-Galactarate Carboxylic Acid EX_galct__D_e
D-Lysine Amino Acids/Derivatives EX_lys__D_e
D-Mannitol Carbohydrates/Derivatives EX_mnl_e
D-tartrate Carboxylic Acid EX_tartr__D_e
DCMP C9H12N3O7P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_dcmp_e
DTMP C10H13N2O8P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_dtmp_e
189
Deoxyadenosine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_dad_2_e
Fructoselysine Other EX_frulys_e
Glycerol Alcohol EX_glyc_e
Glycerophosphoserine Phospholipids/Fatty Acids/Triglycerides EX_g3ps_e
Glycolate C2H3O3 Carboxylic Acid EX_glyclt_e
Heptanoate Phospholipids/Fatty Acids/Triglycerides EX_hpta_e
Hydrogen Inorganic EX_h2_e
Hydroxylamine Amines/Amides EX_ham_e
Indole Other EX_indole_e
L-Histidine Amino Acids/Derivatives EX_his__L_e
L-Isoleucine Amino Acids/Derivatives EX_ile__L_e
L-Serine Amino Acids/Derivatives EX_ser__L_e
L-Valine Amino Acids/Derivatives EX_val__L_e
L-alanine-D-glutamate-meso-2,6-
diaminoheptanedioate-D-alanine Other EX_LalaDgluMdapDala_e
L-methionine-R-sulfoxide Amino Acids/Derivatives EX_metsox_R__L_e
Methanol Alcohol EX_meoh_e
N-Acetylneuraminate Carbohydrates/Derivatives EX_acnam_e
Nonanoate C9H17O2 Phospholipids/Fatty Acids/Triglycerides EX_nona_e
Octadecanoate (n-C18:0) Phospholipids/Fatty Acids/Triglycerides EX_ocdca_e
Octadecenoate (n-C18:1) Phospholipids/Fatty Acids/Triglycerides EX_ocdcea_e
Ornithine Amino Acids/Derivatives EX_orn_e
Petroselaidic acid Phospholipids/Fatty Acids/Triglycerides EX_ptsla_e
Pyruvate Carboxylic Acid EX_pyr_e
Sulfite Inorganic EX_so3_e
Superoxide anion Inorganic EX_o2s_e
Triacylglycerol hexadecanoate Phospholipids/Fatty Acids/Triglycerides EX_tag160_e
Triacylglycerol nC181d9 Phospholipids/Fatty Acids/Triglycerides EX_tag181d9_e
UDP-D-glucuronate Carbohydrates/Derivatives EX_udpglcur_e
L-fuculose Carbohydrates/Derivatives EX_fuc_e
Propanal Ketones/Aldehydes EX_ppal_e
L-Malate Carboxylic Acid EX_mal__L_e
3 Hydroxy 10 phenyldecanoic acid Carboxylic Acid EX_R_3hpdeca_e
190
UMP C9H11N2O9P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_ump_e
3 Hydroxy 4Z decenic acid Carboxylic Acid EX_R3hdec4e_e
3 Hydroxypentanoic acid Carboxylic Acid EX_R_3hpt_e
R R 2 3 Butanediol C4H10O2 Alcohol EX_btd_RR_e
4-Hydroxybenzoate Carboxylic Acid EX_4hbz_e
Cytosine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_csn_e
D-Mannose Carbohydrates/Derivatives EX_man_e
D-Mannose 1-phosphate Carbohydrates/Derivatives EX_man1p_e
Diacetyl C4H6O2 Ketones/Aldehydes EX_diact_e
Lactose C12H22O11 Carbohydrates/Derivatives EX_lcts_e
Orotate C5H3N2O4 Nucleobases/Nucleosides/Nucleotides/Derivatives EX_orot_e
3 hydroxynonanoic acid Carboxylic Acid EX_R_3hnonaa_e
Cys Gly C5H10N2O3S Peptides EX_cgly_e
D-Ornithine Amino Acids/Derivatives EX_orn__D_e
(R)-Propane-1,2-diol Alcohol EX_12ppd__R_e
1-Propanol Alcohol EX_ppoh_e
2-3-dihydroxybenzoylserine trimerFe-III Other EX_fe3dhbzs3_e
2-Aminoethylphosphonate Other EX_2ameph_e
3-Oxoadipate Carboxylic Acid EX_3oxoadp_e
4-aminobenzoate-glutamate Peptides EX_abg4_e
D-Glucose 6-phosphate Carbohydrates/Derivatives EX_g6p_e
DAMP C10H12N5O6P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_damp_e
DGMP C10H12N5O7P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_dgmp_e
DUMP C9H11N2O8P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_dump_e
Decanoate (n-C10:0) Phospholipids/Fatty Acids/Triglycerides EX_dca_e
Glycolaldehyde Ketones/Aldehydes EX_gcald_e
Hexadecanoate (n-C16:0) Phospholipids/Fatty Acids/Triglycerides EX_hdca_e
Hexadecenoate (n-C16:1) Phospholipids/Fatty Acids/Triglycerides EX_hdcea_e
Inosine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_ins_e
Pentanoate Phospholipids/Fatty Acids/Triglycerides EX_pta_e
Phenylacetaldehyde Ketones/Aldehydes EX_pacald_e
Phosphate Inorganic EX_pi_e
191
Phosphotyrosine Amino Acids/Derivatives EX_tyrp_e
Starch C12H20O10 Carbohydrates/Derivatives EX_starch_e
Sucrose C12H22O11 Carbohydrates/Derivatives EX_sucr_e
UDP-N-acetyl-D-galactosamine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_udpacgal_e
Alpha L Arabinan C15H24O12 Carbohydrates/Derivatives EX_Larab_e
Calcium Inorganic EX_ca2_e
Chloride Inorganic EX_cl_e
Co2+ Inorganic EX_cobalt2_e
Copper Inorganic EX_cu2_e
Iron bound extracellular
staphyloferrin B Other EX_istfrnB_e
Potassium Inorganic EX_k_e
Magnesium Inorganic EX_mg2_e
Manganese Inorganic EX_mn2_e
Pyridoxamine B Vitamins EX_pydam_e
Zinc Inorganic EX_zn2_e
Iron bound extracellular
staphyloferrin A Other EX_istfrnA_e
L-Methionine Amino Acids/Derivatives EX_met__L_e
Sn-Glycero-3-phosphocholine Phospholipids/Fatty Acids/Triglycerides EX_g3pc_e
2 methylbutyraldehyde C5H10O Ketones/Aldehydes EX_2mbald_e
Citrate Carboxylic Acid EX_cit_e
Deoxyguanosine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_dgsn_e
(R)-Pantothenate B Vitamins EX_pnto__R_e
Riboflavin C17H20N4O6 B Vitamins EX_ribflv_e
4-Amino-5-hydroxymethyl-2-
methylpyrimidine Other EX_4ahmmp_e
D-Alanyl-D-alanine Peptides EX_alaala_e
Iron (Fe3+) Inorganic EX_fe3_e
Hypoxanthine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_hxan_e
Diphosphate Inorganic EX_ppi_e
Pseudouridine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_psuri_e
Trehalose Carbohydrates/Derivatives EX_tre_e
Ferrypyoverdine P putida KT2440
specific Other EX_fe3pyovd_kt_e
192
2-Dehydro-3-deoxy-D-gluconate Carboxylic Acid EX_2ddglcn_e
D Arabinose C5H10O5 Ketones/Aldehydes EX_arab__D_e
Methanethiol CH4S Organic Sulfur EX_ch4s_e
L-Cystathionine Amino Acids/Derivatives EX_cyst__L_e
Myo-Inositol Carbohydrates/Derivatives EX_inost_e
Serine-glutamine-glycine tripeptide Peptides EX_serglugly_e
Fe(III)dicitrate Other EX_fe3dcit_e
D-Glucosamine 6-phosphate Carbohydrates/Derivatives EX_gam6p_e
Guanosine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_gsn_e
Nicotinate B Vitamins EX_nac_e
D Tyrosine Amino Acids/Derivatives EX_tyr__D_e
D-Galactonate Carboxylic Acid EX_galctn__D_e
3 UMP C9H11N2O9P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_3ump_e
L-Cysteine Amino Acids/Derivatives EX_cys__L_e
Maltose C12H22O11 Carbohydrates/Derivatives EX_malt_e
Maltohexaose Carbohydrates/Derivatives EX_malthx_e
GMP C10H12N5O8P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_gmp_e
D-Glycerate 2-phosphate Carbohydrates/Derivatives EX_2pg_e
D-Mannose 6-phosphate Carbohydrates/Derivatives EX_man6p_e
N-Acetyl-D-mannosamine Carbohydrates/Derivatives EX_acmana_e
2-Phosphoglycolate Carboxylic Acid EX_2pglyc_e
Aminoimidazole-riboside Nucleobases/Nucleosides/Nucleotides/Derivatives EX_airs_e
Phenethylamine Amines/Amides EX_peamn_e
Pyridoxal B Vitamins EX_pydx_e
(S)-Propane-1,2-diol Alcohol EX_12ppd__S_e
Guanosine 3 phosphate
C10H12N5O8P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_3gmp_e
AMP C10H12N5O7P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_amp_e
Formate Carboxylic Acid EX_for_e
Deoxyribose C5H10O4 Carbohydrates/Derivatives EX_drib_e
Gly asp L C6H9N2O5 Peptides EX_gly_asp__L_e
Fe-enterobactin Other EX_feenter_e
Sn-Glycero-3-phosphoethanolamine Phospholipids/Fatty Acids/Triglycerides EX_g3pe_e
193
Sn-Glycero-3-phospho-1-inositol Phospholipids/Fatty Acids/Triglycerides EX_g3pi_e
D-Fructose 6-phosphate Carbohydrates/Derivatives EX_f6p_e
L-Fucose Carbohydrates/Derivatives EX_fuc__L_e
4-Hydroxy-benzyl alcohol Alcohol EX_4hba_e
L Ornithine C5H13N2O2 Amino Acids/Derivatives EX_orn__L_e
D-Sorbitol Carbohydrates/Derivatives EX_sbt__D_e
N-Acetyl-D-glucosamine Carbohydrates/Derivatives EX_acgam_e
L-Aspartate Amino Acids/Derivatives EX_asp__L_e
Ethanolamine Amines/Amides EX_etha_e
Allantoin Other EX_alltn_e
Propionate (n-C3:0) Carboxylic Acid EX_ppa_e
4-aminobenzoyl-glutamate Amino Acids/Derivatives EX_4abzglu_e
Tagatose Carbohydrates/Derivatives EX_tgt_e
Methanesulfonate Organic Sulfur EX_mso3_e
D-Glucuronate 1-phosphate Carbohydrates/Derivatives EX_glcur1p_e
D-Ribose Carbohydrates/Derivatives EX_rib__D_e
Maltotetraose Carbohydrates/Derivatives EX_maltttr_e
Agmatine Amines/Amides EX_agm_e
L-Carnosine Peptides EX_carn_e
Toluene Other EX_tol_e
Xanthosine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_xtsn_e
Salicin C13H18O7 Carbohydrates/Derivatives EX_salcn_e
CMP C9H12N3O8P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_cmp_e
L-Threonine O-3-phosphate Amino Acids/Derivatives EX_thrp_e
Nitrous oxide Inorganic EX_n2o_e
Benzaldehyde Ketones/Aldehydes EX_bzal_e
Deoxyinosine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_din_e
N Ribosylnicotinamide
C11H15N2O5 Nucleobases/Nucleosides/Nucleotides/Derivatives EX_rnam_e
Nicotinamide B Vitamins EX_ncam_e
2',3'-Cyclic UMP Nucleobases/Nucleosides/Nucleotides/Derivatives EX_23cump_e
D-Alanine Amino Acids/Derivatives EX_ala__D_e
DIMP C10H12N4O7P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_dimp_e
194
L alaninylleucine Peptides EX_alaleu_e
L-Ascorbate Other EX_ascb__L_e
UDP-N-acetyl-D-glucosamine Nucleobases/Nucleosides/Nucleotides/Derivatives EX_uacgam_e
Oxidized glutathione Organic Sulfur EX_gthox_e
UDPgalactose Nucleobases/Nucleosides/Nucleotides/Derivatives EX_udpgal_e
L-Lyxose Carbohydrates/Derivatives EX_lyx__L_e
Propanoyl phosphate Other EX_ppap_e
Formaldehyde Ketones/Aldehydes EX_fald_e
Indole 3 acetaldehyde C10H9NO Ketones/Aldehydes EX_id3acald_e
6-Phospho-D-gluconate Carboxylic Acid EX_6pgc_e
L-Xylulose Carbohydrates/Derivatives EX_xylu__L_e
D-Mannitol 1-phosphate Carbohydrates/Derivatives EX_mnl1p_e
D-Glutamate Amino Acids/Derivatives EX_glu__D_e
IMP C10H11N4O8P Nucleobases/Nucleosides/Nucleotides/Derivatives EX_imp_e
UDPglucose Nucleobases/Nucleosides/Nucleotides/Derivatives EX_udpg_e
Phenylacetic acid Carboxylic Acid EX_pac_e
3 Hydroxy 8 phenyloctanoic acid Phospholipids/Fatty Acids/Triglycerides NA
Acetate Carboxylic Acid EX_ac_e
5-Methylthio-D-ribose Organic Sulfur EX_5mtr_e
Acetoacetate Carboxylic Acid EX_acac_e
Tetradecanoate (n-C14:0) Phospholipids/Fatty Acids/Triglycerides EX_ttdca_e
Alpha D-glucose Carbohydrates/Derivatives EX_glc__aD_e
Quinate Carboxylic Acid EX_quin_e
L Ectoine Carboxylic Acid EX_ecto__L_e
Ala L asp L C7H11N2O5 Peptides EX_ala_L_asp__L_e
6 Phenylhexanoic acid Carboxylic Acid EX_phehxa_e
D-Glucarate Carboxylic Acid EX_glcr_e
Choline sulfate Organic Sulfur EX_chols_e
L-Lactate Carboxylic Acid EX_lac__L_e
9 Phenylnonanoic acid Carboxylic Acid EX_phenona_e
D Leucine Amino Acids/Derivatives EX_leu__D_e
Oxaloacetate Carboxylic Acid EX_oaa_e
Psicoselysine Amino Acids/Derivatives EX_psclys_e
195
D-Gluconate Carboxylic Acid EX_glcn_e
4-Hydroxy-L-threonine Amino Acids/Derivatives EX_4hthr_e
Beta-Alanine Amino Acids/Derivatives EX_ala_B_e
L Sorbose C6H12O6 Carbohydrates/Derivatives EX_srb__L_e
Beta Alaninamide Amino Acids/Derivatives EX_balamd_e
Butyrate (n-C4:0) Phospholipids/Fatty Acids/Triglycerides EX_but_e
3 hydroxy 5Z 8Z tetradecedienic
acid Phospholipids/Fatty Acids/Triglycerides NA
Benzyl alcohol Alcohol EX_bzalc_e
2(alpha-D-Mannosyl)-D-glycerate Carbohydrates/Derivatives EX_manglyc_e
Maltopentaose Carbohydrates/Derivatives EX_maltpt_e
Glycylglycine C4H8N2O3 Peptides EX_glygly_e
Beta D-Galactose Carbohydrates/Derivatives EX_gal_bD_e
1 4 Diguanidinobutane Amines/Amides EX_dgudbutn_e
L Arginine phosphate
C6H14N4O5P Amino Acids/Derivatives EX_argp_e
R Acetoin C4H8O2 Ketones/Aldehydes NA
N-Acetyl-L-glutamate Amino Acids/Derivatives EX_acglu_e
L-Rhamnose Carbohydrates/Derivatives EX_rmn_e
Beta-1,3/1,4-glucan (Barley, n=6,
Glc beta1->3,4 Glc) Carbohydrates/Derivatives EX_glucan6_e
Xylotriose Carbohydrates/Derivatives EX_xyl3_e
Arbutin 6-phosphate Carbohydrates/Derivatives EX_arbt6p_e
Xanthosine 5'-phosphate Nucleobases/Nucleosides/Nucleotides/Derivatives EX_xmp_e
Salmochelin-S2-Fe-III Other EX_salchs2fe_e
O-Phospho-L-serine Amino Acids/Derivatives EX_pser__L_e
Ala L glu L C8H13N2O5 Peptides EX_ala_L_glu__L_e
N-Acetylmuramate Carbohydrates/Derivatives EX_acmum_e
L-alanine-D-glutamate Peptides EX_LalaDglu_e
Ferric 2,3-dihydroxybenzoylserine Other EX_fe3dhbzs_e
Vaccenic acid Phospholipids/Fatty Acids/Triglycerides EX_vacc_e
Urate C5H4N4O3 Nucleobases/Nucleosides/Nucleotides/Derivatives EX_urate_e
Glycerol 2-phosphate Other EX_glyc2p_e
Hexanoate (n-C6:0) Phospholipids/Fatty Acids/Triglycerides EX_hxa_e
3 hydroxyheptanoic acid Carboxylic Acid NA
196
Tyramine Amino Acids/Derivatives EX_tym_e
Cellulose (n=4 repeating units) Carbohydrates/Derivatives EX_cell4_e
(R)-mevalonate Carboxylic Acid EX_mevR_e
D-Galactose Carbohydrates/Derivatives EX_gal_e
L-Galactonate Carboxylic Acid EX_galctn__L_e
L glycinylglutamate Peptides EX_glyglu_e
4-Hydroxybenzaldehyde Ketones/Aldehydes EX_4hbald_e
Triacylglycerol octadecanoate Phospholipids/Fatty Acids/Triglycerides EX_tag180_e
Triacylglycerol nC182d9d12 Phospholipids/Fatty Acids/Triglycerides EX_tag182d9d12_e
Dimethyl sulfone Organic Sulfur EX_dmso2_e
Beta alanylL leucine Peptides EX_balaleu_e
Myo-Inositol hexakisphosphate Other EX_minohp_e
5-Dehydro-D-gluconate Carbohydrates/Derivatives EX_5dglcn_e
L-Idonate Carboxylic Acid EX_idon__L_e
L-Carnitine Amines/Amides EX_crn_e
L-Homocysteine Amino Acids/Derivatives EX_hcys__L_e
D-Lactate Carboxylic Acid EX_lac__D_e
Salmochelin-S4-Fe-III Other EX_salchs4fe_e
3 Hydroxydodecanoic 6 en acid Carboxylic Acid NA
Isethionic acid Organic Sulfur EX_isetac_e
Salicylate Carboxylic Acid EX_salc_e
Ethanesulfonate Organic Sulfur EX_ethso3_e
Glycol Alcohol EX_glycol_e
Ribitol Carbohydrates/Derivatives EX_rbt_e
Coniferol Other EX_confrl_e
Isethionate C2H5O4S Organic Sulfur EX_istnt_e
2-Oxoarginine Carboxylic Acid EX_5g2oxpt_e
Xylan (4 backbone units, 1 glcur
side chain) Carbohydrates/Derivatives EX_xylan4_e
Phosphonate Other EX_ppat_e
1-O-methyl-Beta-D-glucuronate Carbohydrates/Derivatives EX_metglcur_e
D Galactarate C6H8O8 Carboxylic Acid EX_galctr__D_e
D Galactosamine C6H13NO5 Carbohydrates/Derivatives EX_galam_e
197
Galactomannan(n=6 repeat units
mannose, alpha-1,4 man) Carbohydrates/Derivatives EX_galman6_e
3 Hydroxyphenylacetic acid
C8H8O3 Carboxylic Acid EX_3hoxpac_e
Galactomannan(n=4 repeat units
mannose, alpha-1,4 man) Carbohydrates/Derivatives EX_galman4_e
D-Xylonate Carboxylic Acid EX_dxylnt_e
L-gulonate Carboxylic Acid EX_guln__L_e
Xylan (8 backbone units, 2 glcur
side chain) Carbohydrates/Derivatives EX_xylan8_e
Beta Methylglucoside C7H14O6 Other EX_mbdg_e
D-Fructuronate Carboxylic Acid EX_fruur_e
Gallic acid Alcohol EX_ga_e
(R)-Glycerate Carboxylic Acid EX_glyc__R_e
Mannotriose (beta-1,4) Carbohydrates/Derivatives EX_mantr_e
GTP C10H12N5O14P3 Nucleobases/Nucleosides/Nucleotides/Derivatives EX_gtp_e
Ethanesulfonate C2H5O3S Organic Sulfur EX_eths_e
198
Appendix Table B14: Biogeographical distribution of SOM clusters by oceanographic region. This table provides the
RPKM information for the 23 regions defined by Lanclos et al. (2023) including the full name of each region, the
identifier for each region, the oceanographic category each region was assigned to, the number of stations assigned to
each region, and the relative abundance of the 8 SOM clusters based on the bootstrapped RPKM values. Region Identifier Full Region Name Number of Stations Oceanographic Category
8
7
6
5
4
3
2
1
AON
Ocean North
Atlantic
263
Open Ocean
Oligotrophic
0.2066
0.1256
0.0495
0.0248
0.1366
0.1616
0.0349
0.2604
AOS
Ocean South
Atlantic
117
Open Ocean
Oligotrophic
0.1725
0.1590
0.0413
0.0601
0.1414
0.1260
0.0436
0.2560
Baltic Sea
Baltic Sea
55
Coastal
0.1011
0.0366
0.0497
0.0815
0.2931
0.0730
0.0386
0.3265
agic
Baltic_Pel
Baltic Sea
Pelagic
4
Coastal
0.0566
0.0457
0.0583
0.1915
0.3240
0.0354
0.0464
0.2422
Black Sea
Black Sea
5
Coastal
0.0345
0.2846
0.1386
0.0546
0.2403
0.0407
0.1045
0.1022
CB
Delaware Bay
Chesapeake/
36
Estuarine
0.1740
0.1007
0.0632
0.1249
0.1999
0.0646
0.1427
0.1299
199
Bay
Chesapeake
Bay
Chesapeake
15
Estuarine
0.0641
0.0071
0.1993
0.0854
0.0538
0.0192
0.3986
0.1725
River
Columbia
River
Columbia
4
Estuarine
0.0441
0.0562
0.0747
0.1079
0.2835
0.0181
0.1936
0.2218
GOM
Mexico
Gulf of
7
Coastal
0.4318
0.0294
0.0057
0.1050
0.0800
0.2007
0.0291
0.1182
HOT
ALOHA
Station
299
Open Ocean
Oligotrophic
0.2098
0.1113
0.0369
0.1194
0.1596
0.1523
0.0228
0.1879
ION
Ocean North
Indian
9
Open Ocean
Oligotrophic
0.3474
0.1563
0.0112
0.0273
0.1148
0.1064
0.0192
0.2176
IOS
South
Indian Ocean
12
Open Ocean
Oligotrophic
0.2870
0.1967
0.0209
0.0179
0.1299
0.1134
0.0169
0.2173
MED
an Sea
Mediterrane
7
Oligotrophic
Seas
0.1854
0.0850
0.0748
0.0162
0.1106
0.1132
0.0283
0.3865
San Pedro Ocean
Time-series
NPAC
12
Coastal
0.1022
0.1312
0.1778
0.1501
0.1422
0.0191
0.0487
0.2287
200
PON
Ocean North
Pacific
78
Open Ocean
Oligotrophic
0.2486
0.1507
0.0513
0.0209
0.0959
0.1667
0.0324
0.2335
POS
Ocean South
Pacific
207
Open Ocean
Oligotrophic
0.2510
0.1062
0.0587
0.0257
0.0768
0.1234
0.0238
0.3343
Pearl_river
Pearl River
15
Estuarine
0.1420
0.2054
0.0590
0.1183
0.1468
0.0185
0.1678
0.1422
RED
Red Sea
6
Oligotrophic
Seas
0.4256
0.1422
0.0189
0.0217
0.0997
0.1265
0.0203
0.1450
SFBay
Francisco Bay
San
8
Estuarine
0.1416
0.0520
0.0559
0.1554
0.1526
0.0652
0.0821
0.2951
SI
Saanich Inlet
4
Coastal
0.0130
0.1554
0.3339
0.0351
0.2760
0.0167
0.0171
0.1527
SOC
Ocean
Southern
3
Southern
0.0839
0.0472
0.2580
0.1229
0.0617
0.2958
0.0209
0.1096
Ocean
Sapelo
Island
Sapelo
5
Coastal
0.0943
0.0914
0.2425
0.1496
0.2198
0.0305
0.0506
0.1212
Bay
Yaquina
Bay
Yaquina
32
Estuarine
0.1180
0.1396
0.0233
0.2045
0.1458
0.0147
0.1095
0.2446
201
Appendix Table B15: Biogeographical distribution of SOM clusters by oceanographic category. This table provides
the RPKM information for the 5 oceanographic categories defined in this study including the number of stations
assigned to each category and the RPKM relative abundance information for each category.
Oceanographic
Category
Number of
Stations
Cluster
1
Cluster
2
Cluster
3
Cluster
4
Cluster
5
Cluster
6
Cluster
7
Cluster
8
Coastal 92 0.2334 0.0439 0.0708 0.2061 0.1084 0.1047 0.0844 0.1482
Estuarine 110 0.2271 0.1272 0.0228 0.1534 0.1824 0.0390 0.1275 0.1206
Oligotrophic Open
Ocean 985 0.2649 0.0302 0.1440 0.1202 0.0432 0.0477 0.1267 0.2231
Oligotrophic Seas 13 0.2936 0.0253 0.1184 0.1065 0.0184 0.0526 0.1071 0.2781
Southern Ocean 3 0.1097 0.0206 0.2957 0.0613 0.1234 0.2591 0.0469 0.0834
202
Appendix Table B16: Matrix of significance values for all pairs of SOM clusters based on dCUB distributions. This
table provides the p-values for all paired comparisons of the distributions of growth rates (estimated using dCUB) for
our 8 SOM clusters.
Cluster Comparison p value
2-1 3.58E-08
3-1 0.954627773
4-1 0.7981970848
5-1 0.00255195585
6-1 0.9752095145
7-1 0.7596861371
8-1 0.9999507727
3-2 3.76E-09
4-2 1.49E-05
5-2 0.2200741497
6-2 8.68E-05
7-2 2.39E-09
8-2 1.38E-05
4-3 0.1893939319
5-3 1.21E-04
6-3 0.4932582833
7-3 0.9996558229
8-3 0.9988789317
5-4 0.1662492488
6-4 0.9999237676
7-4 0.0708115924
8-4 0.7433751924
6-5 0.1405184432
7-5 3.80E-05
8-5 0.0129446364
7-6 0.2429095454
8-6 0.9333372143
8-7 0.9675410519
203
Appendix Table B17: Unifrac distances for all paired comparisons of the SOM clusters. This table provides the Unifrac
distances for all paired comparisons of the genomes in SOM clusters. We report the Unifrac distances when comparing
across the full phylogenetic tree as well as for subtrees of only the genomes in each of our 15 major taxonomic orders. PCC Pelagibacterales Pseudomonadales
s
Sphingomonadale
Rhodobacterales
SAR86
All
Compared Cluster
Compared Cluster
-6307
AEGEAN
Burkholderiales
Caulobacterales
Cytophagales
A
Enterobacterales_
Flavobacteriales
Marinisomatales
Opitutales
Other
-169
Acidimicrobiales
2
1
0.4774
0.6747
0.2306
0.6798
0.4199
0.7989
NA
0.5099
0.3565
0.6349
0.5986
0.8194
0.5018
0.2828
0.7958
NA
0.3102
3
1
0.5056
0.2632
0.8449
NA
0.4709
0.5002
NA
0.7938
NA
0.5563
0.4352
0.8370
0.6294
NA
0.2932
NA
0.2075
4
1
0.3618
0.2815
0.3959
0.4839
0.7559
0.5441
0.3514
0.7996
0.1510
0.5026
0.5422
0.1715
0.5563
0.4298
0.8375
NA
0.2571
5
1
0.1649
0.1184
0.5654
0.4257
0.7723
0.6052
0.3736
0.7708
0.2031
0.5480
0.6712
0.3811
0.6490
0.4196
0.8274
NA
0.2831
6
1
0.3997
0.3168
0.5139
0.4510
0.7740
0.6282
0.3719
0.7762
0.0789
0.6362
0.5759
0.2279
0.4922
0.4531
0.8073
NA
0.1035
7
1
0.4591
0.6188
0.1675
0.5637
0.2717
0.8285
NA
0.2907
0.1752
0.5582
0.3982
0.7717
0.5820
0.3927
0.8476
NA
0.3186
8
1
0.8026
NA
0.5520
NA
0.3886
0.5285
NA
0.2744
0.2866
0.5193
0.4212
0.7541
0.5531
0.3297
0.7634
NA
0.2410
3
2
0.7187
0.5119
0.7754
NA
0.4708
0.6926
NA
0.8124
NA
0.6478
0.6587
0.8804
0.5401
NA
0.2949
0.3870
0.4859
204
4
2
0.3602
0.6084
0.3406
0.6797
0.4546
0.7971
NA
0.4241
0.2850
0.5294
0.2746
0.6463
0.5726
0.7540
0.6299
0.3217
0.7851
5
2
0.3351
0.6258
0.4072
0.6268
0.1901
0.7816
NA
0.5518
0.3808
0.6888
0.5157
0.7943
0.6067
0.2840
0.7683
NA
0.3216
6
2
0.4453
0.7147
0.3283
0.6928
0.3945
0.7672
NA
0.4649
0.3524
0.7191
0.6060
0.7490
0.4875
0.3803
0.8036
NA
0.3359
7
2
0.3431
0.7346
0.2679
0.7280
0.3697
0.8182
NA
0.4725
0.3182
0.6691
0.6501
0.8731
0.5161
0.3455
0.8467
NA
0.3538
8
2
0.7760
NA
0.7275
NA
0.3654
0.7629
NA
0.3250
0.2825
0.5854
0.4128
0.6823
0.6322
0.8164
0.5414
0.3348
0.8100
4
3
0.4827
0.3647
0.7700
NA
0.4451
0.5593
NA
0.7990
NA
0.5501
0.4743
0.8279
0.5311
NA
0.3347
0.2135
0.3372
5
3
0.6562
0.5116
0.8005
NA
0.5029
0.6519
NA
0.8309
NA
0.4840
0.5423
0.8229
0.4230
NA
0.2885
NA
0.3482
6
3
0.3986
0.3742
0.8039
NA
0.5161
0.4870
NA
0.8340
NA
0.5466
0.3600
0.8257
0.3026
NA
0.4107
NA
0.2128
7
3
0.4836
0.3700
0.8328
NA
0.3768
0.6244
NA
0.8413
NA
0.5069
0.3167
0.7026
0.4189
NA
0.3133
NA
0.3857
8
3
0.8392
NA
0.5118
NA
0.3419
0.5280
NA
0.8124
NA
0.3812
0.3552
0.8142
0.3427
NA
0.1272
0.1976
0.3703
205
5
4
0.3580
0.3365
0.5982
0.5088
0.7453
0.5171
0.4149
0.8204
0.2269
0.3646
0.6553
0.4418
0.6253
0.4542
0.8054
NA
0.3911
6
4
0.4125
0.4001
0.6334
0.3740
0.7639
0.5534
0.3967
0.7929
0.2082
0.3868
0.5397
0.2080
0.4721
0.2444
0.7764
NA
0.1863
7
4
0.3148
0.6836
0.1954
0.5813
0.3362
0.8471
NA
0.4340
0.2745
0.5077
0.4020
0.8408
0.4398
0.4171
0.8271
NA
0.4293
8
4
0.8296
NA
0.5719
NA
0.3131
0.6220
NA
0.3674
0.0724
0.4253
0.2697
0.5792
0.3799
0.8212
0.4510
0.3559
0.8004
6
5
0.3962
0.3473
0.5701
0.4838
0.7627
0.4296
0.4541
0.8045
0.2110
0.4143
0.6617
0.4211
0.6181
0.3940
0.7732
NA
0.2876
7
5
0.3565
0.6892
0.4619
0.6192
0.3693
0.8213
NA
0.2858
0.2158
0.4893
0.5072
0.8399
0.4284
0.3480
0.8497
NA
0.1947
8
5
0.8015
NA
0.6955
NA
0.4034
0.7194
NA
0.1454
0.3198
0.4835
0.4873
0.8423
0.3491
0.4165
0.8232
NA
0.3531
7
6
0.4638
0.6192
0.2004
0.4971
0.3585
0.8263
NA
0.2956
0.2551
0.5631
0.3614
0.7985
0.4380
0.4925
0.8138
NA
0.3233
8
6
0.8026
NA
0.5255
NA
0.5041
0.5229
NA
0.4579
0.2349
0.4748
0.3016
0.7552
0.3266
0.2447
0.7894
NA
0.2463
8
7
0.8018
NA
0.5695
NA
0.2610
0.4938
NA
0.3718
0.2191
0.4862
0.2852
0.7518
0.3675
0.4587
0.8021
NA
0.3168
206
Appendix C
Appendix Figure C26: (a) The total number of reactions included in an ensemble as a function of CarveMe ensemble
size (results for 18 genomes are shown). We see that the cumulative number of reactions begins to saturate around
ensemble sizes of 60, indicating that the reaction space is fully explored. (b) Comparisons of the CarveMe ensemble
consensus scores based on the annotation method used during the CarveMe run process. Orange bars represent the
resulting ensemble consensus scores when eggNOG orthologies are imported as opposed to using the native
DIAMOND annotation process.
207
Appendix Figure C27: Distributions of node degree for networks associated with good and bad model ensemble
networks.
208
Appendix Figure C28: Distributions of eigen centrality values for good and bad model ensemble networks.
209
Appendix Figure C29: Distributions of betweenness centrality values for good and bad model ensemble networks.
210
Appendix Figure C30: Ratios of export to import for the unique metabolites in each of the three cellular compartments.
The distributions for Extracellular and Cytoplasm metabolites are statistically significantly different (Welch’s
modified t-test � < 0.01). The Periplasm is a transition space between the other two compartments so it’s unsurprising
that virtually all networks had a precise export/import ratio of 1.
211
Appendix Figure C31: Distributions of reaction type frequencies for the 263 high quality Flavobacteria genomes and
all high quality genomes (� = 1,591) for the 9 possible reaction types within a CarveMe model. All distributions are
statistically indistinguishable between the two groups (Welch’s modified t-test, � > 0.05).
212
Appendix Figure C32: Mean consensus change for a test genome (� = 0.999) as the number of reactions that are
knocked out is increased. Error bars reflect the standard deviation in individual ensemble consensus values for each
number of reaction knockouts tested.
213
Appendix Figure C33: Comparison of the original consensus scores for each of the 263 genomes to the average change
in consensus for all 250 replicate knockout ensembles generated for each genome. Points are colored by the mean
change in consensus (y-axis).
214
Appendix Figure C34: Histograms of the change in consensus score for reaction knockouts created for the 9 unique
types of reactions present in CarveMe. There are no statistical differences between the changes in consensus score
when evaluated by reaction type (Welch’s modified t-test, � > 0.05) suggesting that any kind of reaction can impact
model precision.
215
Appendix Figure C35: Violin plots of the changes in consensus for knockout ensembles grouped based on how many
of the 5 knocked out reactions were ultimately added back during the carving process. The violin at 0 reflects the case
where none of the 5 reactions were added back while 1 reflects the case where all 5 reactions were added back. All
pairwise comparisons of the 6 distributions were found to be significantly different according to a Tukey test on an
ANOVA (� < 0.05).
216
Appendix Figure C36: Histograms of the changes in consensus for individual reaction knockouts depending on
whether they were added back (bottom) or not (top). Reactions that are added back to knockout ensembles have a
much higher tendency towards creating decreases in ensemble consensus compared to reactions that are not added
back.
217
Appendix Table C18: Relative frequency of the total set of surveyed reactions (3,068) that belonged to each of the 9
unique reaction types.
Reaction Type Percentage of Total
��������� → ��������� (� → �) 72.8%
��������� → ������������� (� → �) 8.2%
��������� → ��������� (� → �) 1.5%
������������� → ��������� (� → �) 5.2%
������������� → ������������� (� → �) 4.6%
������������� → ��������� (� → �) 4.3%
��������� → ��������� (� → �) 9.3%
��������� → ������������� (� → �) 0.7%
��������� → ��������� (� → �) 4.9%
Abstract (if available)
Abstract
Ocean microbial communities are made up of thousands of diverse taxa whose metabolic demands set the rates of both biomass production and degradation. Thus, these microscopic organisms play a critical role in ecosystem dynamics, global carbon cycling, and climate. Modeling these dynamics requires reducing the complexity of microbial communities and linking microbial activities directly with biogeochemical rates. I developed a Bayesian statistical method for defining functional guilds from annotated genomes, derived from both uncultured and cultured organisms. Expanding on this work, I leveraged global metagenomic datasets, metabolic models, and unsupervised machine learning techniques to identify key marine heterotrophs metabolic guilds. I found eight clusters with distinct substrate preferences, growth strategies, taxonomic profiles, and biogeographic distributions. I demonstrated that the slowest growing groups are sensitive to the availability of multiple classes of substrates while the intermediate growth groups are only sensitive to a single class. Moreover, organisms from diverse taxonomic groups can occupy the same metabolic niche such that metabolic strategy cannot always be inferred from taxonomy alone. I also show that the automated metabolic model generation software is both a powerful tool for understanding microbial metabolism and must be used with caution to ensure robust solutions. Overall, this work provides the building blocks for analyzing and modeling diverse marine microbial populations.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Ecophysiology of important understudied bacterioplankton through an integrated research and education approach
PDF
Genetic characterization of microbial eukaryotic diversity and metabolic potential
PDF
Characterizing protistan diversity and quantifying protistan grazing in the North Pacific Subtropical Gyre
PDF
Unexplored microbial communities in marine sediment porewater
PDF
Enhancing recovery of understudied and uncultured lineages from metagenomes
PDF
Thermal diversity within marine phytoplankton communities
PDF
Microbial metabolism in the deep subsurface: potential energy sources in subglacial and terrestrial environments
PDF
Multi-dataset analysis of bacterial heterotrophic variability at the San Pedro Ocean Time-series (SPOT): an investigation into the necessity and feasibility of incorporating a dynamic bacterial c...
PDF
Dynamics of protein metabolism in larvae of marine invertebrates
PDF
Spatial and temporal investigations of protistan grazing impact on microbial communities in marine ecosystems
PDF
Spatial and temporal dynamics of marine microbial communities and their diazotrophs in the Southern California Bight
PDF
Extracellular electron transport: Investigating the diversity and mechanisms behind an understudied microbial process with global implications
PDF
Microbial metabolism in deep subsurface sediments of Guaymas Basin (Gulf of California): methanogenesis, methylotrophy, and asgardarchaeota
PDF
Iron-dependent response mechanisms of the nitrogen-fixing cyanobacterium Crocosphaera to climate change
PDF
Disentangling the ecology of bacterial communities in cnidarian holobionts
PDF
The dynamic regulation of DMSP production in marine phytoplankton
PDF
Marine protistan diversity, spatiotemporal dynamics, and physiological responses to environmental cues
PDF
Microbial ecology in the deep terrestrial biosphere: a geochemical, metagenomic and culture-based approach
PDF
Temporal variability of marine archaea across the water column at SPOT
PDF
Application of evolutionary theory and methods to aquatic ecotoxicology
Asset Metadata
Creator
Reynolds, Ryan C.
(author)
Core Title
Identifying functional metabolic guilds: a computational approach to classifying heterotrophic diversity in the marine system
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Biology (Marine Biology and Biological Oceanography)
Degree Conferral Date
2024-08
Publication Date
09/04/2024
Defense Date
07/10/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
carbon cycling,functional guilds,heterotrophic metabolism,marine heterotrophs,metabolic niches,microbial communities,microbial metabolism
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Levine, Naomi M. (
committee chair
), Bien, Jacob (
committee member
), Thrash, Cameron (
committee member
), Webb, Eric (
committee member
)
Creator Email
ryan.reynolds.806@gmail.com,ryanreyn@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11399AGMD
Unique identifier
UC11399AGMD
Identifier
etd-ReynoldsRy-13488.pdf (filename)
Legacy Identifier
etd-ReynoldsRy-13488
Document Type
Dissertation
Format
theses (aat)
Rights
Reynolds, Ryan C.
Internet Media Type
application/pdf
Type
texts
Source
20240905-usctheses-batch-1208
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
carbon cycling
functional guilds
heterotrophic metabolism
marine heterotrophs
metabolic niches
microbial communities
microbial metabolism