Page 50 - Read Online
P. 50
Page 2 of 21 Ponsero et al. Microbiome Res Rep 2023;2:27 https://dx.doi.org/10.20517/mrr.2023.26
of low amounts of sequence contamination and sequencing error was limited. Finally, we benchmarked currently
available de-novo comparative metagenomic tools and compared their output on two datasets of fecal
metagenomes and showed that most k-mer-based tools were able to recapitulate the data structure observed
using taxonomic approaches.
Conclusion: This study expands our understanding of the strength and limitations of k-mer-based de novo
comparative metagenomic approaches and aims to provide concrete guidelines for researchers interested in
applying these approaches to their metagenomic datasets.
Keywords: De-novo comparative metagenomics, metagenomes, k-mers
INTRODUCTION
The advent of modern metagenomics has led to the generation of massive amounts of genomic data
allowing the characterization of microbes’ diversity and their function in ecosystems. Comparative
metagenomics aims to explore the similarities and differences of microbial communities by comparing
metagenomes to one another. These studies generally measure the distance between each pair of
metagenomes in order to investigate the impact of an ecological condition on the composition of microbial
communities. By computing distances between communities, comparative metagenomic tools also provide
a way to cluster similar metagenomes together or, on the contrary, distinguish distinct communities. These
techniques can be used to retrieve similar metagenomes using a query metagenome or to classify a
metagenome based on characteristics. For whole genome shotgun datasets, these comparisons can be
achieved by measuring the similarity of the samples in terms of their taxonomic or functional diversity.
These approaches require the annotation of metagenomic datasets using taxonomic or functional reference
databases. On the other hand, de novo comparative metagenomic approaches compare metagenomic
samples based on their sequence content only. Using these approaches, the similarity between datasets is
measured by evaluating the proportion of shared sequences using the entire dataset, compared to
Reference-based methods that can be limited by incomplete or biased reference databases (reviewed in
Comin et al. 2021 ).
[1]
Historically, de novo comparative metagenomic tools using WGS have relied on two distinct approaches:
read-based and k-mer-based comparisons. While the first studies used alignment-based algorithms such as
BLAST for comparing reads to one another, the ever-increasing size and number of metagenomic datasets
[2]
quickly required more computationally efficient algorithms. As a result, several approaches emerged to
retrieve the number of shared reads between two samples and compute a distance based on this measure.
[3]
[4]
Compareads and its successor Commet approximate the read similarity between each pair of
metagenomes to estimate the number of shared reads. However, these algorithms are computationally
intensive and are difficult to scale to modern-size metagenomic datasets.
Instead of comparing datasets at the read level, another approach is to consider the dataset using a bag-of-
word model, where a metagenomic dataset can be considered as a text composed of DNA words of length k
(referred to as k-mers). This approach relies on three core tenets: (1) closely related organisms share k-mer
profiles and cluster together, making taxonomic assignment unnecessary ; (2) k-mer frequency is
[5]
correlated with the abundance of an organism ; and (3) k-mers of sufficient length can be used to
[6]
distinguish specific organisms . Hence, k-mers spectra can be used to differentiate between samples. The
[7]
most simple and effective approach is the comparison of metagenomic datasets by calculation of pairwise
distances between datasets on the basis of their composition of k-mers. These approaches first count the
number of k-mers in the datasets using different algorithms, then calculate a dissimilarity metric between
pairs of samples based on their k-mer count frequencies.