Page 115 - Read Online
P. 115

Pham et al. Microbiome Res Rep 2024;3:25  https://dx.doi.org/10.20517/mrr.2024.01  Page 3 of 16

               identification. For instance, horizontal gene transfer can result in shared genetic segments across distinct
               species, complicating the attribution of reads to specific genomes. To mitigate this challenge, alternative
               strategies like metagenomic assembly can be employed [19,20] . This approach involves assembling reads into
               longer contiguous sequences, providing more contextual information than individual reads and aiding in
               more accurate species identification. This approach, however, has its own challenges. First, they are
               computationally demanding, requiring significant processing power and memory, especially for complex or
               large metagenomic datasets.  Further, their success is heavily dependent on the quality and length of reads.
               Short or poor-quality reads may lead to fragmented assemblies, reducing the ability to reconstruct complete
               genomes. Lastly, assembling genomes of low-abundance species might create an unreliable assembly,
               leading to an underrepresentation of less abundant members of the microbial community.


               Many existing approaches for species detection in metagenomics rely on the outcomes of read classification,
               which, while common, may not optimize species identification accuracy. This paper posits that prioritizing
               species identification directly enhances accuracy by providing a more precise representation of the
               microbiome community. We introduce a novel species identification method for microbiomes that utilizes
               distinctive genomic signatures and a modified Bloom filter for indexing the genomes within a microbiome.
               To reduce false positives and enhance identification accuracy, we integrate a clustering approach, an
               unsupervised machine learning technique. This method groups species with similar genomic coverages,
               facilitating the identification of species with low coverages that might otherwise be mistaken as artifacts due
               to inaccurate read detection. Our results show that this method outperforms existing techniques in terms of
               accuracy and successfully identifies a pathogen in an actual metagenomic dataset.


               METHODS
               MetaBIDx, our proposed method, consists of two stages:


               1. Index phase. This phase involves collecting reference genomes for a target microbiome and constructing
               an index from these reference genomes. The reference genomes represent the universe of species that may
               exist in a specific microbiome. The index comprises signatures of all k-mers (short sequences of length k)
               from potentially hundreds to thousands of reference genomes.

               2. Prediction phase. Here, metagenomic DNA sequences, consisting of short reads from a metagenomic
               sample, are matched against the index, built in the first phase, to ascertain their probable species origins.
               This sample of reads comes from a specific environment or host containing species that is a subset of the
               universe of species, whose genomes are collected in the first phase. Subsequently, species of identified reads
               are clustered based on their approximate coverages, aiming to filter out false positive species predictions,
               leaving only the species present in the metagenomic environment.

               Phase 1: Building the index of a microbiome
                                                    [32]
               MetaBIDx employs a modified Bloom filter  as its indexing mechanism. A Bloom filter is a space-efficient
               probabilistic data structure used for efficient membership queries. Although a Bloom filter may mistakenly
               identify items not stored in it (false positive mistakes), it can accurately recall stored items.


               Index construction for a microbiome relies on the reference genomes of species present. Without prior
               knowledge of the microbiome’s composition, one can use comprehensive bacterial and viral genomes from
               public databases for index creation. However, if specific species within the microbiome are of interest or
               there is partial knowledge about its composition, the index can be built using only those species’ reference
               genomes.
   110   111   112   113   114   115   116   117   118   119   120