Page 124 - Read Online
P. 124

Page 12 of 16                 Pham et al. Microbiome Res Rep 2024;3:25  https://dx.doi.org/10.20517/mrr.2024.01

               (the lowest quality) and 49 (the median quality) were used; for the CAMI dataset, thresholds of 18 (the
               lowest quality) and 30 (the median quality) were used. MetaBIDx lets users adjust the k-mer quality
               parameter to a desirable level.


               The findings, summarized in Table 3, indicate that higher quality thresholds for k-mers lead to improved
               precision in bacterial species identification. This improvement was particularly notable in the CAMI
               dataset, where a significant increase in precision was observed, although it was accompanied by a slight
               reduction in recall.


               Mende Dataset: For the 10 and 100 species samples, there was no change in precision and recall when the
               k-mer quality threshold was increased from 33 to 49. In the 400 species sample, both thresholds (33 and 49)
               resulted in high precision and recall, with an F1-score of 0.972.


               CAMI Dataset: Across all samples, increasing the k-mer quality threshold from 18 to 30 led to a notable
               improvement in precision and F1-scores. The increase in k-mer quality, however, resulted in a decrease in
               recall, though the overall F1-score improvement suggests a favorable balance between precision and recall.
               The trade-off between precision and recall highlights the importance of selecting an optimal k-mer quality
               threshold that balances the need for accurate species identification and comprehensive species detection.


               Running time analysis
               We reported the running time of MetaBIDx on building the Mende index and querying reads from 400
               species sample using different numbers of CPU(s) in Supplementary Tables 1 and 2. The running time of
               building the Mende index decreases significantly from 165 to 65 min as the CPU increases from 1 to 32.
               Similarly, the running time of querying reads from 400 species sample also decreases significantly from
               145 min with 1 CPU to 16 min with 32 CPUs. The results in Supplementary Tables 1 and 2 indicate the
               effectiveness of parallelization in reducing computational time.

               The comparison of running times between our method, MetaBIDx, and other tools such as CLARK,
               KrakenUniq, Kraken2, and Centrifuge indicates that MetaBIDx generally has longer running times across
               different samples in both the Mende and CAMI datasets. For the Mende dataset, on average, for each
               sample, MetaBIDx took 16 min, CLARK took 6 min, Kraken took 2 min, KrakenUniq and Centrifuge took
               7 min. For the CAMI dataset, the average running times are 71, 17, 17, 8, and 16 min for MetaBIDx,
               CLARK, KrakenUniq, Kraken2, and Centrifuge, respectively. Supplementary Table 3 reports detailed
               information on this comparison.


               DISCUSSION
               The proposed method employs Bloom filters to store unique genomic signatures and facilitates species
               indexing. It incorporates a novel strategy for reducing false positives by clustering species based on their
               “approximate” coverages derived from identified reads. We found that the method surpassed several well-
               known metagenomic tools in precision, recall, and F1-score across various datasets, particularly in complex
               microbiomes where accurate species identification is vital.


               The approach for reducing false positives based on clustering “approximate” genome coverages, notably
               enhances prediction precision of not only MetaBIDx but also other approaches based on read classification,
               representing an advancement in addressing the prevalent issue of false positives in metagenomic analysis.
   119   120   121   122   123   124   125   126   127   128   129