Page 51 - Read Online
P. 51
Ponsero et al. Microbiome Res Rep 2023;2:27 https://dx.doi.org/10.20517/mrr.2023.26 Page 3 of 21
Importantly, the all-vs-all comparison of an ever-growing number of metagenomic samples, each composed
of millions of reads and billions of k-mers, provides a complex challenge in terms of computation time and
resources. Different approaches have been used to reduce the computing costs of such large-scale analyses.
Some tools approximate the real similarity distance between metagenomes using subsampling or sketching.
[8]
This approach was notably used in the k-mer-based tool MASH designed for an easy and fast de novo
comparison of genomes and metagenomes. On the other hand, calculating an actual similarity distance
between samples is possible and scalable when architectures such as High-Performance Computing or
Hadoop clusters are used [9,10] .
In recent years, several bioinformatic tools that perform k-mer-based de novo comparative metagenomics
have been released; however, it is not clear how these tools and metrics compare with each other. For
biologists and domain experts to choose a tool, it is important to understand the limitations and pitfalls
associated with each approach. To meet this need, we developed sets of simulated metagenomes that
allowed us to (1) thoroughly assess the relationship between k-mer-based and taxonomy-based distances
and evaluate the impact of technical and biological variables on these metrics, in particular, the effect of
sequencing depth, sequencing technology, metagenomic contamination and community diversity;
(2) evaluate the effect of sketching and filtering methods; and (3) provide an overview of the currently
available tools for large-scale de novo metagenomic comparative analysis.
METHODS
k-mer-based tools for de novo metagenomic analysis
Each of the tools evaluated in this study was installed from the recommended source following the authors’
instructions. When tools were available from several sources, Bioconda was preferred due to simplified
dependency management. Tools that could not be obtained through Bioconda were directly cloned from
GitHub or Sourceforge.
All tools were run in a SLURM High Performance Computing (HPC) environment. The standard running
conditions were four cores and 24 GB of memory. If a tool had higher memory requirements, it received
more memory but was limited to four threads to keep runtime comparisons consistent. If a tool supported
paired-end reads, the R1 and R2 files were used; otherwise, only the R1 files were used. If a tool allowed for
cluster computing commands, they were used, with subjobs limited to four cores and 24 GB of memory.
Pre-processing, such as read filtering and trimming, was not included in the runtimes. A k-mer size of
k = 31bp was used for all tools except CAFE which used a k-mer size of 5bp, and all tools were run with the
default options, excluding threads, memory, cluster computing, and k-mer size options.
Simulated datasets
This study leverages four distinct simulated datasets: (1) SimSet 1 to assess technical effects; (2) SimSet 2 to
mimic low abundance contamination effects; (3) SimSet 3 to assess the impact of microbial community
richness; and (4) SimSet 4 to assess the impact of taxonomic diversity.
[11]
All simulated datasets were generated using InSilicoSeq v1.5.4 . Briefly, this tool uses an error model of
per-base quality (Phred) scores using Kernel Density Estimation, trained on real sequencing reads, and is
able to generate reads with realistic quality score distributions for several sequencing platforms, including
MiSeq, HiSeq, and NovaSeq [11,12] . All simulated metagenomes were generated from complete bacterial and
archaeal genomes downloaded from RefSeq in November 2022 .
[13]