Page 115 - Read Online
P. 115
Pham et al. Microbiome Res Rep 2024;3:25 https://dx.doi.org/10.20517/mrr.2024.01 Page 3 of 16
identification. For instance, horizontal gene transfer can result in shared genetic segments across distinct
species, complicating the attribution of reads to specific genomes. To mitigate this challenge, alternative
strategies like metagenomic assembly can be employed [19,20] . This approach involves assembling reads into
longer contiguous sequences, providing more contextual information than individual reads and aiding in
more accurate species identification. This approach, however, has its own challenges. First, they are
computationally demanding, requiring significant processing power and memory, especially for complex or
large metagenomic datasets. Further, their success is heavily dependent on the quality and length of reads.
Short or poor-quality reads may lead to fragmented assemblies, reducing the ability to reconstruct complete
genomes. Lastly, assembling genomes of low-abundance species might create an unreliable assembly,
leading to an underrepresentation of less abundant members of the microbial community.
Many existing approaches for species detection in metagenomics rely on the outcomes of read classification,
which, while common, may not optimize species identification accuracy. This paper posits that prioritizing
species identification directly enhances accuracy by providing a more precise representation of the
microbiome community. We introduce a novel species identification method for microbiomes that utilizes
distinctive genomic signatures and a modified Bloom filter for indexing the genomes within a microbiome.
To reduce false positives and enhance identification accuracy, we integrate a clustering approach, an
unsupervised machine learning technique. This method groups species with similar genomic coverages,
facilitating the identification of species with low coverages that might otherwise be mistaken as artifacts due
to inaccurate read detection. Our results show that this method outperforms existing techniques in terms of
accuracy and successfully identifies a pathogen in an actual metagenomic dataset.
METHODS
MetaBIDx, our proposed method, consists of two stages:
1. Index phase. This phase involves collecting reference genomes for a target microbiome and constructing
an index from these reference genomes. The reference genomes represent the universe of species that may
exist in a specific microbiome. The index comprises signatures of all k-mers (short sequences of length k)
from potentially hundreds to thousands of reference genomes.
2. Prediction phase. Here, metagenomic DNA sequences, consisting of short reads from a metagenomic
sample, are matched against the index, built in the first phase, to ascertain their probable species origins.
This sample of reads comes from a specific environment or host containing species that is a subset of the
universe of species, whose genomes are collected in the first phase. Subsequently, species of identified reads
are clustered based on their approximate coverages, aiming to filter out false positive species predictions,
leaving only the species present in the metagenomic environment.
Phase 1: Building the index of a microbiome
[32]
MetaBIDx employs a modified Bloom filter as its indexing mechanism. A Bloom filter is a space-efficient
probabilistic data structure used for efficient membership queries. Although a Bloom filter may mistakenly
identify items not stored in it (false positive mistakes), it can accurately recall stored items.
Index construction for a microbiome relies on the reference genomes of species present. Without prior
knowledge of the microbiome’s composition, one can use comprehensive bacterial and viral genomes from
public databases for index creation. However, if specific species within the microbiome are of interest or
there is partial knowledge about its composition, the index can be built using only those species’ reference
genomes.