Page 27 - Read Online
P. 27

Lugli et al. Microbiome Res Rep 2023;2:15  https://dx.doi.org/10.20517/mrr.2022.21  Page 13 of 16

               Focusing on the assembled genomes, we observed that MEGAnnotator generates a higher number of
               contigs with respect to the oldest version [Figure 2]. Moreover, the assembled genomes of the oldest version
               of the pipeline were characterized by a lower number of N50 and higher number of L50, i.e., 74,157 and 18,
               in respect to the updated pipeline, i.e., 137,143 and 10 [Figure 2F]. Furthermore, in the previous software
               version, the user was forced to provide the reference genome sequence in the same analysis folder. Thus,
               MEGAnnotator2 can assemble microbial genomes more efficiently, and the selection of a reference strain
               for the reordering of contigs is now automated based on the knowledge acquired in the species
               identification step. Thus, the new pipeline version is 63 times faster than its predecessor in assembling
               genomes [Figure 2].


               For the functional classification of genes, the previous version of the pipeline chooses the first hit between
               the 10 hits that possess an appropriate protein name. Due to the gradually expanding of the reference
               database, this strategy is not optimal. Thus, MEGAnnotator2 is provided with pre-processed databases
               where non-appropriate protein names were previously removed. So, the best hit will automatically represent
               an orthologous gene with an appropriate protein name. In addition, the novel database is more manageable,
               and the computing time has been decreased from 60.3 min using MEGAnnotator to 2.9 min in
               MEGAnnotator2.


               Accordingly, the past version of the pipeline was 43 times slower than MEGAnnotator2 in providing the
               assembled genomes and the annotation of genes, showing an improvement of 20x in the annotation of
               genes and 63x in the assembly of genomes [Figure 2].


               Performance of the pre-processed RefSeq database of NCBI
               In addition to the selection of more efficient software for the execution of each task, one of the major
               improvements to the pipeline is represented by the pre-processed RefSeq database of NCBI. To select the
               optimal strategy to assign functional annotation to gene sequences, we employed the genomic repertoire of
               Geobacter lovleyi SZ (CP001089), constituting 3,623 genes, and subsets of the RefSeq database of NCBI.
               First, RefSeq genes were processed by removing non-informative genes, such as hypothetical proteins, and a
               collection of unsuitable gene names that may compromise the goodness of the resulting functional
               classification. Then, RefSeq genes were clustered with CD-HIT using a sequence identity threshold of 90%,
               80%, and 70%. Finally, RAPSearch2 and DIAMOND generated databases for taxonomy annotation tests.

               The reduction in the size of the database was heavily dependent on the software used and the level of
               clustering among genes, i.e., from 492.9 to 35.1 GB using RAPSearch2 and from 96.8 to 6.2 GB using
               DIAMOND [Figure 2I]. Similarly, the speed performance between the two software and the clustered
               database was significantly lower using DIAMOND (on average twice faster than RAPSearch2), and the
               RefSeq database builds with a 70% clustering (33 times faster than the RefSeq and 2.8 times faster than
               clustering at 80%) [Figure 2D].

               The resulting functional annotation from both strategies and clustered RefSeq databases does not highlight
               significant differences [Supplementary Table 4], while classification from the unfiltered RefSeq was
               superficial due to the imprecise gene classification of the un-processed database. Thus, the software
               DIAMOND and clustering at 70% by CD-HIT has been selected for their speed advantages and reduced
               memory usage. This strategy allowed us to build a consistent database for the functional classification of
               genes constituting a fraction of the RefSeq database (1/80) and achieving the classification of genes 33 times
               faster. Pre-processed databases will be updated twice a year to guarantee the inclusion of novel genes.
   22   23   24   25   26   27   28   29   30   31   32