Page 51 - Read Online
P. 51

Ponsero et al. Microbiome Res Rep 2023;2:27  https://dx.doi.org/10.20517/mrr.2023.26  Page 3 of 21

               Importantly, the all-vs-all comparison of an ever-growing number of metagenomic samples, each composed
               of millions of reads and billions of k-mers, provides a complex challenge in terms of computation time and
               resources. Different approaches have been used to reduce the computing costs of such large-scale analyses.
               Some tools approximate the real similarity distance between metagenomes using subsampling or sketching.
                                                                       [8]
               This approach was notably used in the k-mer-based tool MASH  designed for an easy and fast de novo
               comparison of genomes and metagenomes. On the other hand, calculating an actual similarity distance
               between samples is possible and scalable when architectures such as High-Performance Computing or
               Hadoop clusters are used [9,10] .

               In recent years, several bioinformatic tools that perform k-mer-based de novo comparative metagenomics
               have been released; however, it is not clear how these tools and metrics compare with each other. For
               biologists and domain experts to choose a tool, it is important to understand the limitations and pitfalls
               associated with each approach. To meet this need, we developed sets of simulated metagenomes that
               allowed us to (1) thoroughly assess the relationship between k-mer-based and taxonomy-based distances
               and evaluate the impact of technical and biological variables on these metrics, in particular, the effect of
               sequencing depth, sequencing technology, metagenomic contamination and community diversity;
               (2) evaluate the effect of sketching and filtering methods; and (3) provide an overview of the currently
               available tools for large-scale de novo metagenomic comparative analysis.


               METHODS
               k-mer-based tools for de novo metagenomic analysis
               Each of the tools evaluated in this study was installed from the recommended source following the authors’
               instructions. When tools were available from several sources, Bioconda was preferred due to simplified
               dependency management. Tools that could not be obtained through Bioconda were directly cloned from
               GitHub or Sourceforge.

               All tools were run in a SLURM High Performance Computing (HPC) environment. The standard running
               conditions were four cores and 24 GB of memory. If a tool had higher memory requirements, it received
               more memory but was limited to four threads to keep runtime comparisons consistent. If a tool supported
               paired-end reads, the R1 and R2 files were used; otherwise, only the R1 files were used. If a tool allowed for
               cluster computing commands, they were used, with subjobs limited to four cores and 24 GB of memory.
               Pre-processing, such as read filtering and trimming, was not included in the runtimes. A k-mer size of
               k = 31bp was used for all tools except CAFE which used a k-mer size of 5bp, and all tools were run with the
               default options, excluding threads, memory, cluster computing, and k-mer size options.


               Simulated datasets
               This study leverages four distinct simulated datasets: (1) SimSet 1 to assess technical effects; (2) SimSet 2 to
               mimic low abundance contamination effects; (3) SimSet 3 to assess the impact of microbial community
               richness; and (4) SimSet 4 to assess the impact of taxonomic diversity.


                                                                     [11]
               All simulated datasets were generated using InSilicoSeq v1.5.4 . Briefly, this tool uses an error model of
               per-base quality (Phred) scores using Kernel Density Estimation, trained on real sequencing reads, and is
               able to generate reads with realistic quality score distributions for several sequencing platforms, including
               MiSeq, HiSeq, and NovaSeq [11,12] . All simulated metagenomes were generated from complete bacterial and
               archaeal genomes downloaded from RefSeq in November 2022 .
                                                                    [13]
   46   47   48   49   50   51   52   53   54   55   56