Page 117 - Read Online
P. 117

Pham et al. Microbiome Res Rep 2024;3:25  https://dx.doi.org/10.20517/mrr.2024.01  Page 5 of 16

               crucial step involves identifying and examining the cluster with the lowest mean coverage. This cluster
               comprises species with exceptionally low coverages, attributable to misidentified unique signatures. By
               excluding these species from our predictions, we significantly reduce the incidence of false positives. We
               employed a popular clustering method, K-means, to cluster species with similar coverages. The
               implementation of K-means was provided by scikit-learn [33,34] .

               Data collection and preparation
               To evaluate the proposed method, we employed metagenomic shotgun sequencing data without imposing
               data quality constraints or specific requirements. Three widely recognized datasets were utilized for this
               assessment, comprising two mock community datasets and one derived from a human sample.


               Mende Dataset: this dataset (available at https://swifter.embl.de/~mende/simulated_data), comprises three
               metagenomic samples. These samples are distinguished by their species complexity, featuring 10, 100, and
               400 species, respectively. Each sample contains 75 bp long reads. The number of reads varies from
                                                                                                    [35]
               26,665,674 to 26,667,004 pairs. This dataset, originally used in a study on metagenomic assembly , was
               constructed using simulated Illumina sequencing errors and quality values, reflecting the characteristics of
               actual metagenomic data.


               CAMI Challenge Dataset: this dataset was obtained from the CAMI challenge  (accessible at https://data.
                                                                                 [36]
               cami-challenge.org). It was also used in another benchmark . It includes eight metagenomic samples
                                                                    [37]
               representing a gradient of complexity: low (RL_S001), medium (RM_S001, RM_S002), and high (RH_S001,
               RH_S002, RH_S003, RH_S004, RH_S005). These samples are characterized by experimental conditions and
               features akin to real datasets, such as the inclusion of multiple, closely related strains, the presence of
               plasmid and viral sequences, and realistic abundance distributions. The reads in this dataset are 150 bp in
               length. The number of reads varies from 49,898,179 to 49,905,935 pairs, from low complexity to high
               complexity.


                                                    [38]
               PT-8 (S2): this sample was used in a study . It was derived from brain tissue biopsies of a 67-year-old
               patient with osteomyelitis, lung disease, and multifocal brain and spinal lesions, and was diagnosed with
               Mycobacterium tuberculosis. The PT-8 (S2) sample, with human reads excluded, can be accessed at the
               NCBI SRA repository (https://www.ncbi.nlm.nih.gov/sra/SRX1621515). This sample consists of bacterial,
               viral, and fungal species.

               Reference Genomes: methods that use reference genomes to build indices for species identification. For the
               Mende dataset, we collected reference genomes that contain reads from all four metagenomic samples. This
               resulted in a set of 457 bacterial genomes, averaging approximately 3.5 Mbp each, with a cumulative size of
               around 1.6 GB. In contrast, the reference genomes for the CAMI dataset were more extensive,
               encompassing 2,850 bacterial genomes. This collection included all bacterial genomes present in the reads
               from the eight samples, along with genomes of closely related strains within the same species or subspecies.
               The average genome size in this collection was about 5.7 Mbp, culminating in a total size of approximately
               16 GB. The genome collection for the CAMI dataset was utilized in the experiment with real dataset as it
               also contains Mycobacterium tuberculosis. All reference genomes were downloaded from NCBI.

               Comparative analysis of different methods for species prediction
               We conducted a comprehensive comparative analysis of our tool, MetaBIDx, and various metagenomic
               tools,  which  can  be  used  for  species  detection,  including  CLARK , Kraken2 , KrakenUniq ,
                                                                                          [39]
                                                                                                       [40]
                                                                                [24]
                                       [42]
                        [41]
               Centrifuge , and Sourmash . These tools were selected based on their robustness, documented accuracy,
   112   113   114   115   116   117   118   119   120   121   122