Page 66 - Read Online
P. 66
Page 18 of 21 Ponsero et al. Microbiome Res Rep 2023;2:27 https://dx.doi.org/10.20517/mrr.2023.26
Finally, we assessed the impact of k-mer filtering on the k-mer-based distance computed between samples
in a simulated metagenomic dataset. Filtering of low-abundance k-mers was proposed as a solution to
palliate sequencing errors, while the filtration of high-abundance k-mers aims to remove potential sequence
contaminants. Additionally, filtering out the rare k-mers reduces the computational requirements of the
comparison by reducing the number of unique k-mers to be taken into account . In the conditions chosen
[10]
for the simulated experiment (simple mock communities of 25 organisms each, sequenced at 5 million
reads), applying a low abundance k-mer filter consistently degraded the correlation between expected and
k-mer-based distances, even for a k-mer minimum abundance filter of 2. Importantly, while our simulated
metagenomes allow for realistic modeling of sequencing errors, we acknowledge that additional sequencing
errors not simulated in our experiments could be present in real metagenomic datasets. As expected, the
potential impact of k-mer filtering is particularly important to consider when using distances such as the
presence/absence Jaccard distance.
To our knowledge, there are 12 tools currently published for k-mer-based de novo comparative
metagenomic tasks. Older tools, such as Commet, TriageTool, and Compareads, used k-mers to compare
metagenomes in terms of read content. However, these approaches are unable to scale to modern
metagenomic dataset sizes. More recent approaches compare datasets on their k-mer content directly. We
benchmarked all de-novo comparative metagenomic tools that could be installed and run on a dataset of 30
metagenomes. Most tools were able to compute pairwise distances in less than 5 h. Strikingly, Simka
allowed for a comparison of the samples on their complete k-mer spectra in less than 3 h, a run time
comparable to other tools such as Mash or Sourmash that use a sketching approach. The fastest tool in this
benchmark was SimkaMin, which was able to perform the comparison in less than 30 minutes. Finally, we
compared the output of all tools on two real metagenomic datasets, and assessed if the tools were able to
recapitulate data structures observed taxonomically. Importantly, most of the tested k-mer-based de novo
tools were able to successfully recapitulate this data structure using the standard parameters, with the
exception of CAFE, whose recommended small k-mer size (5-13bp) seems not to be appropriate for a fine-
scale exploration of metagenomic differences.
Recommendations for de novo comparative metagenomic users
From the experiments and benchmarks performed in this study, we highlight key points for users interested
in applying de novo comparative methods to their metagenomic datasets. In terms of usability, ease of
installation, and computational requirements, we believe that Simka allows for a fast and accurate k-mer-
based comparison of metagenomic datasets, and SimkaMin provides an alternative for the fast estimation of
Bray-Curtis and presence/absence Jaccard distances for very large-scale datasets or for users with limited
computational resources.
In accordance with previously published observations, we recommend using a k-mer length of 20bp or
above to measure k-mer-based Bray-Curtis distances between metagenomes, in order to obtain results that
are well correlated with taxonomic-based distances. However, we highlight here that presence/absence k-
mer-based metrics such as the presence/absence Jaccard do not correlate well with the equivalent
taxonomic-based distances. Importantly, our experiments also show that sequencing depth can have a
drastic effect on the k-mer-based distances, and users should look out for inflation of k-mer distances close
to or equal to 1. Finally, users should limit their use of minimum abundance k-mer filters to cases where
they strongly suspect a large number of erroneous k-mers, or in case of computational limitations.
However, in this situation, the users should refrain from using presence/absence distances, as they are most
affected by the filtration of k-mers.