Page 118 - Read Online

P. 118

Page 6 of 16 Pham et al. Microbiome Res Rep 2024;3:25 https://dx.doi.org/10.20517/mrr.2024.01

processing speed, and the ability to create custom indexes, a vital feature for our analysis. Tools that lacked
comprehensive documentation or presented installation and experimental challenges were not considered
for this study.

After experimenting with several k-mer lengths, MetaBIDx was built with k-mers of length 31. A previous
study experimented with different k-mer sizes and observed that k-mer similarity between genomes
[43]
approximated various degrees of taxonomic similarity, and that a k-mer length of 31 appeared to
correspond to species-level similarity. Most tools used in the evaluation also have the default k-mer size of
31. An index with a size of 8 GB, utilizing 3 hash functions, was created for the Mende dataset, which
included 457 reference genomes. For the CAMI dataset, a more extensive index of 16 GB was built using 2
hash functions, accommodating 2,850 reference genomes. The other tools in the study also employed
k-mers of length 31 and were run with their default settings for a fair comparison. To ensure consistency
across all methods, the same collections of reference genomes were used to construct the genome libraries
or indexes. The experiments were conducted on a standardized computational setup, using a machine with
32 cores and 330 GB of RAM, and all tools were run in multi-threaded mode to utilize the full
computational capacity. The script for building index for all tools is shared in the Supplementary Materials.

The evaluation of prediction performance utilized three widely recognized metrics: precision, recall, and
F1-score. Precision quantifies the proportion of true positive predictions out of all positive predictions made
(sum of true positives and false positives), essentially reflecting the accuracy in predicting species as a
fraction of all species predictions. Recall measures the proportion of true positive predictions relative to the
total actual species present in the sample, indicating the method’s ability to identify all relevant species. The
F1-score, derived as the harmonic mean of precision and recall, offers a composite metric that equally
weights precision and recall, providing a single measure to assess the balance between them.

The comparative assessment of our method, MetaBIDx, alongside other tools, was structured into two key
experiments. The first experiment focused on evaluating the ability of MetaBIDx and other tools to predict
bacterial species based on identified reads alone, which is the standard approach adopted by these tools for
species prediction in metagenomic samples. This experiment’s main objective was to show that the default
behavior of these tools could be enhanced for more accurate species identification.

In the second experiment, we aimed to facilitate a more equitable comparison by augmenting the other
methods with our strategy for reducing false positives. This was achieved by applying our technique of
clustering “approximate” coverages to each tool. The inclusion of this approach in the assessment was
intended to improve the precision of the other tools, thereby enabling a fairer and more balanced
comparison with MetaBIDx. Through this two-pronged experimental design, we sought to comprehensively
evaluate and demonstrate the effectiveness of our method in the context of metagenomic species prediction.

Additionally, we also explored the effect of using high-quality k-mers at different thresholds to enhance
species prediction accuracy. This dual-phase approach was designed to provide a thorough understanding
of the capabilities and limitations of each tool in metagenomic species identification.

RESULTS
Comparative analysis of species identification
First, we compared MetaBIDx against other methods that predict species solely based on read classification.
This approach predicates species prediction on the detection of reads originating from the species in
question. If a read from a particular species is detected in a sample, that species is predicted to be present.

113 114 115 116 117 118 119 120 121 122 123