Page 116 - Read Online
P. 116

Page 4 of 16                  Pham et al. Microbiome Res Rep 2024;3:25  https://dx.doi.org/10.20517/mrr.2024.01

               In constructing the index F for referenced bacterial genomes, each genome’s k-mers are processed. This
               includes both the main and reverse complement strands of a genome. A set of n randomly generated hash
               functions maps each k-mer to n entries in the index. These entries can hold three types of values: 0 for an
               empty entry, -1 for a “dirty” entry, and a positive genome id. An entry in F has value -1 if two k-mers from
               different genomes hash to it, indicating the k-mer is not unique. If a k-mer from one genome hashes to an
               entry already holding a different genome’s id, that k-mer is not unique, and the entry is marked as “dirty”.
               Otherwise, the entry is updated with the genome id of the currently processed genome if it is empty or
               already contains the same genome id.

               This approach allows F to function similarly to a Bloom filter, aiding in the detection of genomes present in
               a metagenomic sample. The algorithm for building this index processes each genome sequentially, updating
               the index entries accordingly.

               Phase 2: Determining species in a microbiome
               The process consists of two steps. First, the index F is used to assign reads in the sample to the species
               stored in the index. Second, these reads are further clustered into groups of similar coverages to determine
               which species are present in the sample.


               Step 1: Querying reads
               To determine which species a read belongs to, hash values of all k-mers within the read are checked against
               the index F. For a read that is part of genome g, if it contains a k-mer with a unique hash value in F, it is
               correctly identified as belonging to genome g. A read not belonging to genome g might be mistakenly
               identified as such if it contains a k-mer with a hash value matching one from genome g. This could be due
               to sequencing errors or genetic variants.


               The strategy for querying each k-mer of a read is as follows. First, gather a set of values stored in F. If they
               consist of an identical positive value, we predict this value to be the id of the genome containing the k-mer.
               Otherwise, the k-mer is discarded.

               If over 50% of non-discarded k-mers in a read have the same value, we predict this value to be the id of the
               genome that contains the read. Otherwise, the read is discarded. We adhere to the standard majority rule,
               setting the threshold at 50% as a default. Users, however, can adjust this threshold to desired stringency
               levels. After processing all reads in the sample, only those with predictively identified species are retained.
               The output includes these reads and their corresponding genome ids. This strategy helps to ensure accuracy
               in species identification despite potential errors or variants in the sequencing data.

               Step 2: Identification of species based on approximate genomic coverages
               Our method for predicting bacterial presence in a microbiome innovatively employs clustering based on the
               “approximate” coverages of bacterial genomes. Central to this approach is the identification of reads
               containing unique genetic signatures indicative of specific bacterial species. In this context, a bacterium’s
               presence is inferred from nontrivial genomic coverage, while an absence is suggested by minimal coverage,
               primarily due to false positives. The underlying assumption of our approach is that with modern sequencing
               technologies, there will be fewer sequencing errors, yielding fewer false positives. Consequently, falsely
               predicted species caused by falsely predicted reads should have significantly low coverages.


               In this step, species with similar “approximate” coverages are placed in the same clusters. “Approximate”
               coverages are calculated based on the number of identified reads, which contain unique k-mers of species. A
   111   112   113   114   115   116   117   118   119   120   121