Page 50 - Read Online
P. 50

Page 2 of 21                Ponsero et al. Microbiome Res Rep 2023;2:27  https://dx.doi.org/10.20517/mrr.2023.26

               of low amounts of sequence contamination and sequencing error was limited. Finally, we benchmarked currently
               available  de-novo  comparative  metagenomic  tools  and  compared  their  output  on  two  datasets  of  fecal
               metagenomes and showed that most k-mer-based tools were able to recapitulate the data structure observed
               using taxonomic approaches.

               Conclusion: This study expands our understanding of the strength and limitations of k-mer-based de novo
               comparative metagenomic approaches and aims to provide concrete guidelines for researchers interested in
               applying these approaches to their metagenomic datasets.

               Keywords: De-novo comparative metagenomics, metagenomes, k-mers



               INTRODUCTION
               The advent of modern metagenomics has led to the generation of massive amounts of genomic data
               allowing the characterization of microbes’ diversity and their function in ecosystems. Comparative
               metagenomics aims to explore the similarities and differences of microbial communities by comparing
               metagenomes to one another. These studies generally measure the distance between each pair of
               metagenomes in order to investigate the impact of an ecological condition on the composition of microbial
               communities. By computing distances between communities, comparative metagenomic tools also provide
               a way to cluster similar metagenomes together or, on the contrary, distinguish distinct communities. These
               techniques can be used to retrieve similar metagenomes using a query metagenome or to classify a
               metagenome based on characteristics. For whole genome shotgun datasets, these comparisons can be
               achieved by measuring the similarity of the samples in terms of their taxonomic or functional diversity.
               These approaches require the annotation of metagenomic datasets using taxonomic or functional reference
               databases. On the other hand, de novo comparative metagenomic approaches compare metagenomic
               samples based on their sequence content only. Using these approaches, the similarity between datasets is
               measured by evaluating the proportion of shared sequences using the entire dataset, compared to
               Reference-based methods that can be limited by incomplete or biased reference databases (reviewed in
               Comin et al. 2021 ).
                              [1]
               Historically, de novo comparative metagenomic tools using WGS have relied on two distinct approaches:
               read-based and k-mer-based comparisons. While the first studies used alignment-based algorithms such as
               BLAST  for comparing reads to one another, the ever-increasing size and number of metagenomic datasets
                     [2]
               quickly required more computationally efficient algorithms. As a result, several approaches emerged to
               retrieve the number of shared reads between two samples and compute a distance based on this measure.
                          [3]
                                                     [4]
               Compareads   and  its  successor  Commet   approximate  the  read  similarity  between  each  pair  of
               metagenomes to estimate the number of shared reads. However, these algorithms are computationally
               intensive and are difficult to scale to modern-size metagenomic datasets.
               Instead of comparing datasets at the read level, another approach is to consider the dataset using a bag-of-
               word model, where a metagenomic dataset can be considered as a text composed of DNA words of length k
               (referred to as k-mers). This approach relies on three core tenets: (1) closely related organisms share k-mer
               profiles and cluster together, making taxonomic assignment unnecessary ; (2) k-mer frequency is
                                                                                  [5]
               correlated with the abundance of an organism ; and (3) k-mers of sufficient length can be used to
                                                         [6]
               distinguish specific organisms . Hence, k-mers spectra can be used to differentiate between samples. The
                                         [7]
               most simple and effective approach is the comparison of metagenomic datasets by calculation of pairwise
               distances between datasets on the basis of their composition of k-mers. These approaches first count the
               number of k-mers in the datasets using different algorithms, then calculate a dissimilarity metric between
               pairs of samples based on their k-mer count frequencies.
   45   46   47   48   49   50   51   52   53   54   55