Page 66 - Read Online
P. 66

Page 18 of 21               Ponsero et al. Microbiome Res Rep 2023;2:27  https://dx.doi.org/10.20517/mrr.2023.26

               Finally, we assessed the impact of k-mer filtering on the k-mer-based distance computed between samples
               in a simulated metagenomic dataset. Filtering of low-abundance k-mers was proposed as a solution to
               palliate sequencing errors, while the filtration of high-abundance k-mers aims to remove potential sequence
               contaminants. Additionally, filtering out the rare k-mers reduces the computational requirements of the
               comparison by reducing the number of unique k-mers to be taken into account . In the conditions chosen
                                                                                  [10]
               for the simulated experiment (simple mock communities of 25 organisms each, sequenced at 5 million
               reads), applying a low abundance k-mer filter consistently degraded the correlation between expected and
               k-mer-based distances, even for a k-mer minimum abundance filter of 2. Importantly, while our simulated
               metagenomes allow for realistic modeling of sequencing errors, we acknowledge that additional sequencing
               errors not simulated in our experiments could be present in real metagenomic datasets. As expected, the
               potential impact of k-mer filtering is particularly important to consider when using distances such as the
               presence/absence Jaccard distance.

               To our knowledge, there are 12 tools currently published for k-mer-based de novo comparative
               metagenomic tasks. Older tools, such as Commet, TriageTool, and Compareads, used k-mers to compare
               metagenomes in terms of read content. However, these approaches are unable to scale to modern
               metagenomic dataset sizes. More recent approaches compare datasets on their k-mer content directly. We
               benchmarked all de-novo comparative metagenomic tools that could be installed and run on a dataset of 30
               metagenomes. Most tools were able to compute pairwise distances in less than 5 h. Strikingly, Simka
               allowed for a comparison of the samples on their complete k-mer spectra in less than 3 h, a run time
               comparable to other tools such as Mash or Sourmash that use a sketching approach. The fastest tool in this
               benchmark was SimkaMin, which was able to perform the comparison in less than 30 minutes. Finally, we
               compared the output of all tools on two real metagenomic datasets, and assessed if the tools were able to
               recapitulate data structures observed taxonomically. Importantly, most of the tested k-mer-based de novo
               tools were able to successfully recapitulate this data structure using the standard parameters, with the
               exception of CAFE, whose recommended small k-mer size (5-13bp) seems not to be appropriate for a fine-
               scale exploration of metagenomic differences.


               Recommendations for de novo comparative metagenomic users
               From the experiments and benchmarks performed in this study, we highlight key points for users interested
               in applying de novo comparative methods to their metagenomic datasets. In terms of usability, ease of
               installation, and computational requirements, we believe that Simka allows for a fast and accurate k-mer-
               based comparison of metagenomic datasets, and SimkaMin provides an alternative for the fast estimation of
               Bray-Curtis and presence/absence Jaccard distances for very large-scale datasets or for users with limited
               computational resources.


               In accordance with previously published observations, we recommend using a k-mer length of 20bp or
               above to measure k-mer-based Bray-Curtis distances between metagenomes, in order to obtain results that
               are well correlated with taxonomic-based distances. However, we highlight here that presence/absence k-
               mer-based metrics such as the presence/absence Jaccard do not correlate well with the equivalent
               taxonomic-based distances. Importantly, our experiments also show that sequencing depth can have a
               drastic effect on the k-mer-based distances, and users should look out for inflation of k-mer distances close
               to or equal to 1. Finally, users should limit their use of minimum abundance k-mer filters to cases where
               they strongly suspect a large number of erroneous k-mers, or in case of computational limitations.
               However, in this situation, the users should refrain from using presence/absence distances, as they are most
               affected by the filtration of k-mers.
   61   62   63   64   65   66   67   68   69   70   71