Page 116 - Read Online
P. 116
Page 4 of 16 Pham et al. Microbiome Res Rep 2024;3:25 https://dx.doi.org/10.20517/mrr.2024.01
In constructing the index F for referenced bacterial genomes, each genome’s k-mers are processed. This
includes both the main and reverse complement strands of a genome. A set of n randomly generated hash
functions maps each k-mer to n entries in the index. These entries can hold three types of values: 0 for an
empty entry, -1 for a “dirty” entry, and a positive genome id. An entry in F has value -1 if two k-mers from
different genomes hash to it, indicating the k-mer is not unique. If a k-mer from one genome hashes to an
entry already holding a different genome’s id, that k-mer is not unique, and the entry is marked as “dirty”.
Otherwise, the entry is updated with the genome id of the currently processed genome if it is empty or
already contains the same genome id.
This approach allows F to function similarly to a Bloom filter, aiding in the detection of genomes present in
a metagenomic sample. The algorithm for building this index processes each genome sequentially, updating
the index entries accordingly.
Phase 2: Determining species in a microbiome
The process consists of two steps. First, the index F is used to assign reads in the sample to the species
stored in the index. Second, these reads are further clustered into groups of similar coverages to determine
which species are present in the sample.
Step 1: Querying reads
To determine which species a read belongs to, hash values of all k-mers within the read are checked against
the index F. For a read that is part of genome g, if it contains a k-mer with a unique hash value in F, it is
correctly identified as belonging to genome g. A read not belonging to genome g might be mistakenly
identified as such if it contains a k-mer with a hash value matching one from genome g. This could be due
to sequencing errors or genetic variants.
The strategy for querying each k-mer of a read is as follows. First, gather a set of values stored in F. If they
consist of an identical positive value, we predict this value to be the id of the genome containing the k-mer.
Otherwise, the k-mer is discarded.
If over 50% of non-discarded k-mers in a read have the same value, we predict this value to be the id of the
genome that contains the read. Otherwise, the read is discarded. We adhere to the standard majority rule,
setting the threshold at 50% as a default. Users, however, can adjust this threshold to desired stringency
levels. After processing all reads in the sample, only those with predictively identified species are retained.
The output includes these reads and their corresponding genome ids. This strategy helps to ensure accuracy
in species identification despite potential errors or variants in the sequencing data.
Step 2: Identification of species based on approximate genomic coverages
Our method for predicting bacterial presence in a microbiome innovatively employs clustering based on the
“approximate” coverages of bacterial genomes. Central to this approach is the identification of reads
containing unique genetic signatures indicative of specific bacterial species. In this context, a bacterium’s
presence is inferred from nontrivial genomic coverage, while an absence is suggested by minimal coverage,
primarily due to false positives. The underlying assumption of our approach is that with modern sequencing
technologies, there will be fewer sequencing errors, yielding fewer false positives. Consequently, falsely
predicted species caused by falsely predicted reads should have significantly low coverages.
In this step, species with similar “approximate” coverages are placed in the same clusters. “Approximate”
coverages are calculated based on the number of identified reads, which contain unique k-mers of species. A