Page 60 - Read Online
P. 60

Page 12 of 21               Ponsero et al. Microbiome Res Rep 2023;2:27  https://dx.doi.org/10.20517/mrr.2023.26






































                Figure 6. Impact of sketching k-mer on the estimation of k-mer-based Bray-Curtis distances. (A) Absolute differences between the
                exact k-mer-based Bray-Curtis distances and sketched Bray-Curtis distances for an increasing Sketch size; (B) Exact k-mer-based
                against the sketched Bray-Curtis distances (k = 30bp) obtained for a simulated dataset of 100 metagenomes simulated at a sequencing
                depth of 5 million paired reads using the HiSeq sequencing error model.

               In order to assess the effect of sketching size on the precision of the k-mer-based distance, we computed the
               absolute difference between exact and sketched k-mer-based Bray-Curtis and presence/absence Jaccard
               distances obtained on a dataset of simulated metagenomes composed of 25 organisms sequenced at 5
               million reads using a HiSeq error model (SimSet 1). As expected, the difference between exact k-mer and
               sketched Bray-Curtis indices decreased as the sketch size increased [Figure 6]. Strikingly, for all considered
               k-mer lengths, even small sketch sizes allowed for a reliable estimation of the indices. However, increasing
               the sketch size above 50K k-mers only marginally improved the estimated distance. As expected from a
               presence/absence distance metric, the estimation of presence/absence Jaccard distances was noisier than for
               the Bray-Curtis distances, even considering the large sketch size [Supplementary Figure 6].


               Assessing the impact of k-mer filtering
               In order to improve upon the k-mer-based beta-diversity measures, several tools enable users to filter out or
               weigh k-mers for consideration in the comparison between samples. In particular, the ability to filter
               extremely low abundance k-mers or extremely highly abundant k-mers was proposed as a method to
               remove potentially erroneous k-mers due to sequencing error or to filter out contaminants in the
               metagenomes . While the rationale behind the use and chosen thresholds for these filters has been mostly
                           [10]
               empirical, some tools, such as Simka, implement a default filtering of low (n < 2) and high abundance k-
               mers. In order to assess the impact of k-mer filtration on the k-mer-based distance metrics, the Bray-Curtis
               and presence/absence Jaccard index was computed between pairs of samples in a simulated community
               composed of 25 random organisms, sequenced at 5 million reads using a HiSeq error model (SimSet 1).
               Distances obtained on the same sample pairs before and after filtering of k-mers were compared to the
               expected taxonomic beta-diversity metric.
   55   56   57   58   59   60   61   62   63   64   65