Page 114 - Read Online
P. 114

Page 2 of 16                  Pham et al. Microbiome Res Rep 2024;3:25  https://dx.doi.org/10.20517/mrr.2024.01
               the field. The study lays the groundwork for future improvements in computational efficiency and the expansion of
               microbial databases.

               Keywords: Bacteria identification, metagenomics, species identification, bloom filter, clustering



               INTRODUCTION
               Advances in next-generation sequencing technologies have reduced both cost and sequencing errors,
                                                            [1]
               enabling large-scale analyses of metagenomic data  to help understand the microbial composition of
               environments like the human gut. This understanding can provide insights into various disorders and
                      [2,3]
                                        [4,5]
                                                    [6,7]
                                                                         [8]
                                                                                   [9]
               diseases  including diabetes , depression , rheumatoid arthritis , and gout . Dysbiosis, or microbial
               imbalance, is not only linked to gastrointestinal disorders  but can also affect the respiratory system .
                                                               [10]
                                                                                                    [3]
               Key processes for analyzing microbial communities include read classification, profiling, and species
               identification. Read classification uses computational algorithms and a reference database to assign
               metagenomic reads to specific groups or organisms. Profiling assesses the relative abundance of different
               organisms in a sample, providing crucial environmental insights [11-13] . Species identification, particularly
               important in clinical metagenomics, determines the organisms present in a sample and is essential for
               diagnosing infections caused by specific pathogens. Despite their importance, these processes can be
               challenging due to the vast amount of information they require.

                                                                                                  [14]
               Various techniques exist for metagenomic analysis, including read alignment to reference genomes , using
               taxonomically informative gene marker analysis , clustering metagenomic sequence [16-18] , assembling
                                                           [15]
               sequence [19,20] , using unique characteristics of the 16S rRNA genes [16,21] , and using k-mers [22-25] . Alignment-
               based approaches are accurate but time-consuming, while k-mer-based approaches achieve a better balance
               between performance and runtime . CLARK , a read classifier, assigns reads to targets with the most
                                                       [24]
                                              [26]
               distinguishing k-mers, and stands out for its efficiency and speed in read classification, making it suitable
               for extensive datasets. DUDes , a taxonomic profiler, identifies candidate organisms by comparing read
                                         [14]
               mapping strength in each node of the taxonomic tree iteratively, and demonstrates effectiveness in single
               and multiple organism detection, excelling in scenarios with unevenly represented references. Kraken
                                                                                                        [25]
               creates a database with k-mers and corresponding common ancestors, then utilizes exact-match database
               queries of k-mers for rapid processing. Kraken achieved high precision and sensitivity at the genus level
                                                                       [23]
               while also standing out for its accuracy and speed. MetaCache , a k-mer based read classifier, uses a
               technique known as minhashing and context-aware k-mer matches, significantly reducing memory
               requirements while maintaining high sensitivity and precision. GSM , another profiler, builds an index
                                                                           [22]
               using genomic markers and computes abundances with linear equations, also showing its high accuracy in
               comparison to other tools. Several approaches utilize a specialized data structure for information retrieval
                                                                                            [28]
               known as a Bloom filter. FACS  uses a Bloom filter to classify DNA sequences. MetaProFi  uses a Bloom
                                         [27]
               filter to build indexes of amino acid sequences to provide a memory-efficient and storage-efficient solution
               for protein sequence comparison. Bloom filters have recently been used to index large collections of short-
               read sequencing data. BIGSI  and COBS  use multiple Bloom filters to index k-mers in a way that
                                                    [30]
                                         [29]
               attempts to limit cache misses during query. Kmtricks  also used Bloom filters to index terabase-sized
                                                               [31]
               collections of sequencing data.
               Most methods rely on read classification for species identification [24,25] . This process typically involves
               mapping sequenced reads against a reference database or index, and assigning reads to species based on
               how they are matched to the database or index. This approach, however, faces challenges caused by
               sequencing errors, mutations, horizontal gene transfer, or strain-level variation, which impact species
   109   110   111   112   113   114   115   116   117   118   119