Page 27 - Read Online
P. 27
Lugli et al. Microbiome Res Rep 2023;2:15 https://dx.doi.org/10.20517/mrr.2022.21 Page 13 of 16
Focusing on the assembled genomes, we observed that MEGAnnotator generates a higher number of
contigs with respect to the oldest version [Figure 2]. Moreover, the assembled genomes of the oldest version
of the pipeline were characterized by a lower number of N50 and higher number of L50, i.e., 74,157 and 18,
in respect to the updated pipeline, i.e., 137,143 and 10 [Figure 2F]. Furthermore, in the previous software
version, the user was forced to provide the reference genome sequence in the same analysis folder. Thus,
MEGAnnotator2 can assemble microbial genomes more efficiently, and the selection of a reference strain
for the reordering of contigs is now automated based on the knowledge acquired in the species
identification step. Thus, the new pipeline version is 63 times faster than its predecessor in assembling
genomes [Figure 2].
For the functional classification of genes, the previous version of the pipeline chooses the first hit between
the 10 hits that possess an appropriate protein name. Due to the gradually expanding of the reference
database, this strategy is not optimal. Thus, MEGAnnotator2 is provided with pre-processed databases
where non-appropriate protein names were previously removed. So, the best hit will automatically represent
an orthologous gene with an appropriate protein name. In addition, the novel database is more manageable,
and the computing time has been decreased from 60.3 min using MEGAnnotator to 2.9 min in
MEGAnnotator2.
Accordingly, the past version of the pipeline was 43 times slower than MEGAnnotator2 in providing the
assembled genomes and the annotation of genes, showing an improvement of 20x in the annotation of
genes and 63x in the assembly of genomes [Figure 2].
Performance of the pre-processed RefSeq database of NCBI
In addition to the selection of more efficient software for the execution of each task, one of the major
improvements to the pipeline is represented by the pre-processed RefSeq database of NCBI. To select the
optimal strategy to assign functional annotation to gene sequences, we employed the genomic repertoire of
Geobacter lovleyi SZ (CP001089), constituting 3,623 genes, and subsets of the RefSeq database of NCBI.
First, RefSeq genes were processed by removing non-informative genes, such as hypothetical proteins, and a
collection of unsuitable gene names that may compromise the goodness of the resulting functional
classification. Then, RefSeq genes were clustered with CD-HIT using a sequence identity threshold of 90%,
80%, and 70%. Finally, RAPSearch2 and DIAMOND generated databases for taxonomy annotation tests.
The reduction in the size of the database was heavily dependent on the software used and the level of
clustering among genes, i.e., from 492.9 to 35.1 GB using RAPSearch2 and from 96.8 to 6.2 GB using
DIAMOND [Figure 2I]. Similarly, the speed performance between the two software and the clustered
database was significantly lower using DIAMOND (on average twice faster than RAPSearch2), and the
RefSeq database builds with a 70% clustering (33 times faster than the RefSeq and 2.8 times faster than
clustering at 80%) [Figure 2D].
The resulting functional annotation from both strategies and clustered RefSeq databases does not highlight
significant differences [Supplementary Table 4], while classification from the unfiltered RefSeq was
superficial due to the imprecise gene classification of the un-processed database. Thus, the software
DIAMOND and clustering at 70% by CD-HIT has been selected for their speed advantages and reduced
memory usage. This strategy allowed us to build a consistent database for the functional classification of
genes constituting a fraction of the RefSeq database (1/80) and achieving the classification of genes 33 times
faster. Pre-processed databases will be updated twice a year to guarantee the inclusion of novel genes.