Page 65 - Read Online
P. 65
Ponsero et al. Microbiome Res Rep 2023;2:27 https://dx.doi.org/10.20517/mrr.2023.26 Page 17 of 21
approaches aim to allow metagenomic comparisons in conditions where Reference-based methods are
impossible given high novelty or bias due to the underrepresentation of taxa in a reference database. These
computational approaches compare metagenomic samples solely on their k-mer composition, thus
bypassing the need for taxonomic profiling. These types of approaches, which take into account both
known and unknown taxa in the microbiota, are particularly relevant when analyzing understudied
ecosystems where microbial unknowns are prevalent .
[27]
Very few previous studies have compared taxonomic Beta-diversity metrics to k-mer-based distances.
Notably, Dubinkina et al. explored the relationship between k-mer-based and taxonomy-based beta-
diversity measurements, using simulated metagenomic datasets composed of ten human gut bacteria .
[5]
Using these simple simulated datasets, the authors reported a high correlation between taxonomy-based and
k-mer-based Bray-Curtis distances (rho = 0.88 with k = 10bp) and observed that the correlation increases
with longer k-mer sizes. Importantly, due to computational constraints, the authors only explored these
correlations for a small k-mer size (maximum 12bp). These observations were later confirmed using a k-
mer size above 21bp by Benoit et al., who demonstrated a strong correlation (rho = 0.885 with k = 21pb)
between k-mer-based and taxonomic-based Bray-Curtis metrics on real metagenome datasets from the
Human Microbiome Project (HMP) . In this study, we built on these prior works and used simulated
[10]
metagenomes to extensively compare the correlation between taxonomy-based and k-mer-based beta-
diversity distances. We focused our analysis on two commonly used ecological metrics: the quantitative
Bray-Curtis index and the presence/absence Jaccard index. As previously observed by Dubinkina et al., the
correlation between taxonomic and k-mer-based beta-diversity distances improved when the k-mer length
increased and reached a plateau for k-mer lengths above 20bp .
[5]
Using simulated metagenomic datasets of increasing sequencing depth, we showed that the correlation
between taxonomic and k-mer-based distances was strongly impacted by the number of reads in the
metagenomes. The correlation increased with the sequencing depth, and k-mer-based distances measured
between shallow metagenomes were close to 1 (completely dissimilar). This result suggests that k-mer-based
distances at shallow sequencing depth tend to overestimate the dissimilarity between metagenomes. This is
further confirmed when comparing the results obtained for communities of increased richness. While a
strong correlation between the expected taxonomic and k-mer-based Bray-Curtis distance was measured at
a sequencing depth of 5M reads for simple communities of 25 organisms, the correlation dropped in the
same conditions for more complex communities composed of 500 organisms.
While sequencing depth and community richness had a notable impact on the correlation between expected
and k-mer-based distances, no major impact was found for the sequencing technology and low abundance
sequence contaminations. These experiments demonstrate a global resilience of k-mer-based distances
towards low k-mer noise. This is in accordance with prior experiments, showing that low-rate SNP
mutation had a minor impact on the k-mer-based distances . Additionally, we assessed the effect of
[5]
community phylogenetic richness on the k-mer-based distances. This experiment showed little impact of
the phylogenetic richness when considering long k-mer size above 20bp.
Even if de novo k-mer-based methods are globally scalable, applying these methods to very large
metagenomic projects containing thousands of metagenomes is still a computational challenge. In order to
reduce the computational time of k-mer-based comparisons, several tools choose to approximate pairwise
distances by subsampling the k-mer space, instead of considering the billions of k-mers typically present in
metagenomic projects. Here we show that these sketched approaches allow for a robust estimation of the k-
mer-based distance at a sketch size of 1 million k-mers or above. Importantly, the estimation is more precise
for quantitative-based distances such as the Bray-Curtis metric than for presence/absence distances.