Page 53 - Read Online
P. 53
Ponsero et al. Microbiome Res Rep 2023;2:27 https://dx.doi.org/10.20517/mrr.2023.26 Page 5 of 21
Beta-diversity distances between simulated metagenomes
Three distinct types of distance metrics were computed on the simulated metagenomes:
The expected taxonomic beta-diversity distances (Bray-Curtis and presence/absence Jaccard distances) were
computed on the simulated samples’ taxonomic abundance profiles using the Vegan R package .
[14]
[16]
[15]
Read-based taxonomic profiles were obtained using Kraken2 and Bracken on the simulated
metagenomes using the “Standard plus protozoa & fungi database” (from https://benlangmead.github.io/
aws-indexes/k2 on 05.2021). The read-based taxonomic beta-diversity distances (Bray-Curtis and presence/
absence Jaccard distances) were computed on the simulated samples’ taxonomic abundance profiles using
the Vegan R package.
k-mer-based beta-diversity distances were computed using Simka (Bray-Curtis and presence/absence
Jaccard distances), with controlled k-mer length. The minimum abundance k-mer filter was set to 2 and the
maximum abundance k-mer filter to 999999999 .
[10]
Spearman correlations between the different types of beta-diversity distances were assessed using the Stats R
package v3.6.2.
Effect of sketched k-mer distances
A simulated dataset of 100 simulated metagenomes composed of 25 organisms each was generated using
InSilicoSeq for a sequencing depth of 5 million reads and with the HiSeq error model. The exact k-mer-
based Bray-Curtis and presence/absence Jaccard distances were obtained for determined k-mer lengths
using Simka with the default filtering parameter. Sketched k-mer profiles and distances were obtained using
SimkaMin at determined k-mer and sketch sizes.
[17]
The absolute difference between the exact and sketched k-mer distance was calculated for each sample pair
comparison. The correlation between the expected Bray-Curtis distances on the simulated taxonomic
profiles and the sketched k-mer-based distances was calculated using a Spearman correlation.
Minimum and maximum abundance k-mer filter effects
For this experiment, a simulated dataset of 100 simulated metagenomes composed of 25 organisms each was
generated using InSilicoSeq for a sequencing depth of 5 million reads and using an HiSeq error model. K-
mer-based Bray-Curtis and presence/absence Jaccard distances were obtained for a determined k-mer
length using Simka without a k-mer filter. Distances also were computed on the same simulated
metagenome dataset using the minimum k-mer abundance or maximum k-mer abundance parameter from
Simka. The absolute difference between the unfiltered and filtered k-mer distance was calculated for each
sample pair comparison. The correlation between the expected Bray-Curtis distances on the simulated
taxonomic profiles and the filtered k-mer-based distances was calculated using a Spearman correlation.
Benchmark on infant and mother metagenomic dataset
Publicly available fecal metagenomes from infants and pregnant mothers were retrieved from the European
Nucleotide Archive (ENA Bioproject ID: PRJEB52774). The sample collection and sequencing are described
[18]
in a previously published study . Sequences were trimmed and quality filtered using FastQC v0.11.9 and
Trim Galore v0.6.6 with default parameters. Quality-filtered sequences were screened to remove human
read sequences using Bowtie2 v2.4.2 against the Human genome (Human Build 38, patch release 7). After
quality control and human read filtering, infant fecal metagenomes containing less than 10 million paired-
end reads and mother fecal metagenomes with less than 20 million paired-end reads were discarded.