Page 15 - Read Online
P. 15

Page 4 of 16                  Lugli et al. Microbiome Res Rep 2023;2:15  https://dx.doi.org/10.20517/mrr.2022.21

               probiogenomics.unipr.it/cmu/. As reported in the manual, a single Unix command line is needed to have
               the full pipeline installed in the system. One of the advantages of using MEGAnnotator2 is that it can be
               used without internet access since all the programs and databases will be accessible locally after the pipeline
               installation.


               Another main novelty in pipeline execution is the possibility of processing multiple genomes in series
               without wasting time between analysis execution. Specifically, the script can recognize multiple FASTQ files
               retrieved from NGS base-calling and organize the execution in tandem with the analysis based on the
               parameters arranged by the user. Furthermore, the results of multiple analyses can be put together to
               provide an overall view of the assembled data. In MEGAnnotator2, the implementation of the automated
               script is easy to achieve, as reported in the manual. Thus, additional extensions will be implemented in
               future updates of the software and it can also be programmed and introduced by the user base on the need.

               MEGAnnotator2 databases
               Alongside the software, MEGAnnotator2 is provided with multiple databases to avoid restricted online
               computing during the execution of the pipeline. Specifically, alongside the installation and software update,
               the pipeline can be run on a local machine without constant network access. Notably, four ad-hoc pre-
               compilated databases are downloaded together with all the scripts to use the pipeline at its full potential.

               The first database is dedicated to the functional annotation of genes, aiming at providing reliable outputs
               with the most up-to-date data for gene classification. To do so, the RefSeq database of NBCI (amino acid
               sequences) is processed by removing non-informative genes, such as hypothetical proteins, and a collection
               of inappropriate gene names that may compromise the feasibility of the resulting functional classification.
                                                                                            [17]
               Then, selected genes are clustered with CD-HIT using a sequence identity threshold of 70% . This process
               reduces the overall size of the database without removing any sequence information, resulting in a
               decreased computational cost for the system of the final user. Using this strategy, we reduced the previous
               database of MEGAnnotator from hundreds of gigabytes to 35 gigabytes. However, as well as for the other
               databases provided by MEGAnnotator2, the installation of the software will download the pre-processed
               database. Thus, the user does not need to process or compile individual databases.


               A second database is represented by a single reference genome for each species of microorganism, covering
               all known genome variability but avoiding redundancy within the same species. All bacterial genomes
               available in the NCBI RefSeq database were retrieved and filtered based on the most up-to-date reference
               ANI table made available from the repository. Finally, for each bacterial species, each genome was processed
                                       [18]
               using the sourmash software  and compared in a pair-wise approach to obtain a series of Jaccard similarity
               matrices. Then, the optimal reference genome was extracted from each Jaccard similarity matrix, given by
               the highest average Jaccard similarity score. Genome sequences of representative genomes are used to
               provide average nucleotide identity (ANI) values with respect to the assembled genome sequence.
               Furthermore, a subset of the database, represented by complete reference genome only, is used to perform
               sequence alignment, allowing contig reordering of partially reconstructed chromosomal sequences.


               The third database represents a collection of validated 16S and 18S rRNA gene sequences of all classified
               microorganisms based on the SILVA repository . Specifically, the database is generated by processing the
                                                        [19]
               latest release of the complete SILVA repository removing sequences with non-informative microbial
               taxonomy, such as unknown species. Then, selected ribosomal genes are clustered with CD-HIT using a
               sequence identity threshold of 99.9% . At first glance, results based on this database might appear
                                                 [17]
               redundant since, through the MEGAnnotator2 pipeline, the species is attributed through ANI values
   10   11   12   13   14   15   16   17   18   19   20