Page 54 - Read Online
P. 54
Page 6 of 21 Ponsero et al. Microbiome Res Rep 2023;2:27 https://dx.doi.org/10.20517/mrr.2023.26
[15]
Taxonomic profiling of the metagenomic samples was performed using Kraken2 v2.1.1 against the
[16]
[19]
HumGut database , and Bracken v2.6.1 was run on Kraken2 outputs . PCoA visualization of the
distances computed between sample pairs was generated using the ecodist R package v2.0.9.
Before hierarchical clustering of the samples, low-abundance species (< 0.01% relative abundance and
< 0.1% prevalence) were filtered out. Then, the dataset was transformed into relative abundances, and a
distance matrix was calculated from the transformed data using the Bray-Curtis or presence/absence using
the Ecodist function. Hierarchical clustering was done with the function hclust and with the Wald.D2
method. Clusters’ purity was calculated as follows: (1) each cluster was assigned to the sample group, which
is most frequent in the cluster; (2) the accuracy of this assignment was measured by counting the number of
correctly assigned samples; and (3) dividing the accuracy by the total number of samples.
PERMANOVA testing was performed using the adonis2 function from the vegan R package using 999
permutations.
RESULTS
Comparing k-mer-based and taxonomy-based analysis
To assess and compare beta-diversity distances obtained using Reference-based and k-mer-based
approaches, four simulated short-reads metagenomic datasets were generated. Each dataset was composed
of 100 metagenomes, and each sample had a known taxonomic composition and relative abundance profile.
Pairwise beta-diversity metrics were computed between all pairs of samples in the dataset using the true
taxonomic profile at the species level and is referred to as the “expected taxonomic-based” beta-diversity
metric. Using the generated sample taxonomic composition and profiles, simulated metagenomic reads
were generated with a given sequencing depth and sequencing error model. The k-mer-based beta-diversity
distances between each pair of simulated metagenomes were assessed using Simka and are referred to as
[10]
“k-mer-based” beta-diversity metrics. Finally, the simulated metagenomes were profiled using the read
classifier Kraken2 and Bracken. The read counts obtained were used to compute a “read-based taxonomic”
beta-diversity metric at the species level. It is important to note that because all genomes used to generate
the mock communities are present in the Kraken2 database, the impact of unknown taxa in metagenomes is
not investigated in this experiment. The correlation between the beta-diversity metrics for the same sample
pairs was measured using a Spearman correlation. Figure 1 provides an overview of the simulated
experiment.
Technical effects
We first evaluated the correlation between taxonomic-based beta-diversity and k-mer-based metrics in
simple simulated metagenomes and assessed the potential impact of technical variables such as sequencing
technology and sequencing depth. A simulated dataset (SimSet 1) of 100 simulated metagenomes composed
of 25 bacterial species was generated for three different sequencing technologies (HiSeq, MiSeq, and
NovaSeq) and at different sequencing depths (50K, 100K, 500K, 1M, 5M, 10M, and 50M paired reads). The
“expected taxonomic-based” beta-diversity distances (Bray-Curtis and presence/absence Jaccard distance)
were computed at the species level between each pair of samples using the true taxonomic profiles used to
generate the simulated metagenomes. The same beta-diversity distances were computed on the simulated
[10]
metagenomes’ k-mer composition using Simka at different k-mer lengths (10, 15, 20, 25, and 30) . The
correlation between expected taxonomic and k-mer-based beta-diversity distances was assessed for each
setting using Spearman correlations.