Page 65 - Read Online
P. 65

Ponsero et al. Microbiome Res Rep 2023;2:27  https://dx.doi.org/10.20517/mrr.2023.26  Page 17 of 21

               approaches aim to allow metagenomic comparisons in conditions where Reference-based methods are
               impossible given high novelty or bias due to the underrepresentation of taxa in a reference database. These
               computational approaches compare metagenomic samples solely on their k-mer composition, thus
               bypassing the need for taxonomic profiling. These types of approaches, which take into account both
               known and unknown taxa in the microbiota, are particularly relevant when analyzing understudied
               ecosystems where microbial unknowns are prevalent .
                                                           [27]
               Very few previous studies have compared taxonomic Beta-diversity metrics to k-mer-based distances.
               Notably, Dubinkina et al. explored the relationship between k-mer-based and taxonomy-based beta-
               diversity measurements, using simulated metagenomic datasets composed of ten human gut bacteria .
                                                                                                        [5]
               Using these simple simulated datasets, the authors reported a high correlation between taxonomy-based and
               k-mer-based Bray-Curtis distances (rho = 0.88 with k = 10bp) and observed that the correlation increases
               with longer k-mer sizes. Importantly, due to computational constraints, the authors only explored these
               correlations for a small k-mer size (maximum 12bp). These observations were later confirmed using a k-
               mer size above 21bp by Benoit et al., who demonstrated a strong correlation (rho = 0.885 with k = 21pb)
               between k-mer-based and taxonomic-based Bray-Curtis metrics on real metagenome datasets from the
               Human Microbiome Project (HMP) . In this study, we built on these prior works and used simulated
                                               [10]
               metagenomes to extensively compare the correlation between taxonomy-based and k-mer-based beta-
               diversity distances. We focused our analysis on two commonly used ecological metrics: the quantitative
               Bray-Curtis index and the presence/absence Jaccard index. As previously observed by Dubinkina et al., the
               correlation between taxonomic and k-mer-based beta-diversity distances improved when the k-mer length
               increased and reached a plateau for k-mer lengths above 20bp .
                                                                   [5]

               Using simulated metagenomic datasets of increasing sequencing depth, we showed that the correlation
               between taxonomic and k-mer-based distances was strongly impacted by the number of reads in the
               metagenomes. The correlation increased with the sequencing depth, and k-mer-based distances measured
               between shallow metagenomes were close to 1 (completely dissimilar). This result suggests that k-mer-based
               distances at shallow sequencing depth tend to overestimate the dissimilarity between metagenomes. This is
               further confirmed when comparing the results obtained for communities of increased richness. While a
               strong correlation between the expected taxonomic and k-mer-based Bray-Curtis distance was measured at
               a sequencing depth of 5M reads for simple communities of 25 organisms, the correlation dropped in the
               same conditions for more complex communities composed of 500 organisms.

               While sequencing depth and community richness had a notable impact on the correlation between expected
               and k-mer-based distances, no major impact was found for the sequencing technology and low abundance
               sequence contaminations. These experiments demonstrate a global resilience of k-mer-based distances
               towards low k-mer noise. This is in accordance with prior experiments, showing that low-rate SNP
               mutation had a minor impact on the k-mer-based distances . Additionally, we assessed the effect of
                                                                     [5]
               community phylogenetic richness on the k-mer-based distances. This experiment showed little impact of
               the phylogenetic richness when considering long k-mer size above 20bp.


               Even if de novo k-mer-based methods are globally scalable, applying these methods to very large
               metagenomic projects containing thousands of metagenomes is still a computational challenge. In order to
               reduce the computational time of k-mer-based comparisons, several tools choose to approximate pairwise
               distances by subsampling the k-mer space, instead of considering the billions of k-mers typically present in
               metagenomic projects. Here we show that these sketched approaches allow for a robust estimation of the k-
               mer-based distance at a sketch size of 1 million k-mers or above. Importantly, the estimation is more precise
               for quantitative-based distances such as the Bray-Curtis metric than for presence/absence distances.
   60   61   62   63   64   65   66   67   68   69   70