Page 114 - Read Online
P. 114
Page 2 of 16 Pham et al. Microbiome Res Rep 2024;3:25 https://dx.doi.org/10.20517/mrr.2024.01
the field. The study lays the groundwork for future improvements in computational efficiency and the expansion of
microbial databases.
Keywords: Bacteria identification, metagenomics, species identification, bloom filter, clustering
INTRODUCTION
Advances in next-generation sequencing technologies have reduced both cost and sequencing errors,
[1]
enabling large-scale analyses of metagenomic data to help understand the microbial composition of
environments like the human gut. This understanding can provide insights into various disorders and
[2,3]
[4,5]
[6,7]
[8]
[9]
diseases including diabetes , depression , rheumatoid arthritis , and gout . Dysbiosis, or microbial
imbalance, is not only linked to gastrointestinal disorders but can also affect the respiratory system .
[10]
[3]
Key processes for analyzing microbial communities include read classification, profiling, and species
identification. Read classification uses computational algorithms and a reference database to assign
metagenomic reads to specific groups or organisms. Profiling assesses the relative abundance of different
organisms in a sample, providing crucial environmental insights [11-13] . Species identification, particularly
important in clinical metagenomics, determines the organisms present in a sample and is essential for
diagnosing infections caused by specific pathogens. Despite their importance, these processes can be
challenging due to the vast amount of information they require.
[14]
Various techniques exist for metagenomic analysis, including read alignment to reference genomes , using
taxonomically informative gene marker analysis , clustering metagenomic sequence [16-18] , assembling
[15]
sequence [19,20] , using unique characteristics of the 16S rRNA genes [16,21] , and using k-mers [22-25] . Alignment-
based approaches are accurate but time-consuming, while k-mer-based approaches achieve a better balance
between performance and runtime . CLARK , a read classifier, assigns reads to targets with the most
[24]
[26]
distinguishing k-mers, and stands out for its efficiency and speed in read classification, making it suitable
for extensive datasets. DUDes , a taxonomic profiler, identifies candidate organisms by comparing read
[14]
mapping strength in each node of the taxonomic tree iteratively, and demonstrates effectiveness in single
and multiple organism detection, excelling in scenarios with unevenly represented references. Kraken
[25]
creates a database with k-mers and corresponding common ancestors, then utilizes exact-match database
queries of k-mers for rapid processing. Kraken achieved high precision and sensitivity at the genus level
[23]
while also standing out for its accuracy and speed. MetaCache , a k-mer based read classifier, uses a
technique known as minhashing and context-aware k-mer matches, significantly reducing memory
requirements while maintaining high sensitivity and precision. GSM , another profiler, builds an index
[22]
using genomic markers and computes abundances with linear equations, also showing its high accuracy in
comparison to other tools. Several approaches utilize a specialized data structure for information retrieval
[28]
known as a Bloom filter. FACS uses a Bloom filter to classify DNA sequences. MetaProFi uses a Bloom
[27]
filter to build indexes of amino acid sequences to provide a memory-efficient and storage-efficient solution
for protein sequence comparison. Bloom filters have recently been used to index large collections of short-
read sequencing data. BIGSI and COBS use multiple Bloom filters to index k-mers in a way that
[30]
[29]
attempts to limit cache misses during query. Kmtricks also used Bloom filters to index terabase-sized
[31]
collections of sequencing data.
Most methods rely on read classification for species identification [24,25] . This process typically involves
mapping sequenced reads against a reference database or index, and assigning reads to species based on
how they are matched to the database or index. This approach, however, faces challenges caused by
sequencing errors, mutations, horizontal gene transfer, or strain-level variation, which impact species