Page 117 - Read Online
P. 117
Pham et al. Microbiome Res Rep 2024;3:25 https://dx.doi.org/10.20517/mrr.2024.01 Page 5 of 16
crucial step involves identifying and examining the cluster with the lowest mean coverage. This cluster
comprises species with exceptionally low coverages, attributable to misidentified unique signatures. By
excluding these species from our predictions, we significantly reduce the incidence of false positives. We
employed a popular clustering method, K-means, to cluster species with similar coverages. The
implementation of K-means was provided by scikit-learn [33,34] .
Data collection and preparation
To evaluate the proposed method, we employed metagenomic shotgun sequencing data without imposing
data quality constraints or specific requirements. Three widely recognized datasets were utilized for this
assessment, comprising two mock community datasets and one derived from a human sample.
Mende Dataset: this dataset (available at https://swifter.embl.de/~mende/simulated_data), comprises three
metagenomic samples. These samples are distinguished by their species complexity, featuring 10, 100, and
400 species, respectively. Each sample contains 75 bp long reads. The number of reads varies from
[35]
26,665,674 to 26,667,004 pairs. This dataset, originally used in a study on metagenomic assembly , was
constructed using simulated Illumina sequencing errors and quality values, reflecting the characteristics of
actual metagenomic data.
CAMI Challenge Dataset: this dataset was obtained from the CAMI challenge (accessible at https://data.
[36]
cami-challenge.org). It was also used in another benchmark . It includes eight metagenomic samples
[37]
representing a gradient of complexity: low (RL_S001), medium (RM_S001, RM_S002), and high (RH_S001,
RH_S002, RH_S003, RH_S004, RH_S005). These samples are characterized by experimental conditions and
features akin to real datasets, such as the inclusion of multiple, closely related strains, the presence of
plasmid and viral sequences, and realistic abundance distributions. The reads in this dataset are 150 bp in
length. The number of reads varies from 49,898,179 to 49,905,935 pairs, from low complexity to high
complexity.
[38]
PT-8 (S2): this sample was used in a study . It was derived from brain tissue biopsies of a 67-year-old
patient with osteomyelitis, lung disease, and multifocal brain and spinal lesions, and was diagnosed with
Mycobacterium tuberculosis. The PT-8 (S2) sample, with human reads excluded, can be accessed at the
NCBI SRA repository (https://www.ncbi.nlm.nih.gov/sra/SRX1621515). This sample consists of bacterial,
viral, and fungal species.
Reference Genomes: methods that use reference genomes to build indices for species identification. For the
Mende dataset, we collected reference genomes that contain reads from all four metagenomic samples. This
resulted in a set of 457 bacterial genomes, averaging approximately 3.5 Mbp each, with a cumulative size of
around 1.6 GB. In contrast, the reference genomes for the CAMI dataset were more extensive,
encompassing 2,850 bacterial genomes. This collection included all bacterial genomes present in the reads
from the eight samples, along with genomes of closely related strains within the same species or subspecies.
The average genome size in this collection was about 5.7 Mbp, culminating in a total size of approximately
16 GB. The genome collection for the CAMI dataset was utilized in the experiment with real dataset as it
also contains Mycobacterium tuberculosis. All reference genomes were downloaded from NCBI.
Comparative analysis of different methods for species prediction
We conducted a comprehensive comparative analysis of our tool, MetaBIDx, and various metagenomic
tools, which can be used for species detection, including CLARK , Kraken2 , KrakenUniq ,
[39]
[40]
[24]
[42]
[41]
Centrifuge , and Sourmash . These tools were selected based on their robustness, documented accuracy,