Page 18 - Read Online
P. 18

Lugli et al. Microbiome Res Rep 2023;2:15  https://dx.doi.org/10.20517/mrr.2022.21  Page 7 of 16

               As in the first MEGAnnotator version, resulting contig sequences retrieved from the assemblies can be
               reordered based on a reference genome sequence of the same species. The difference in this software version
               is that the user does not need to provide the genome sequence of the reference strain, but only the species
               name in the parameters file. Doing that, the pipeline will retrieve the genome sequence of the reference
               strain from the RefSeq genomes of NCBI and provide to perform a genome alignment using MAUVE . If
                                                                                                      [28]
               the ANI analysis results are discordant with the given species name, MEGAnnotator2 will choose the
               appropriate genome for the reordering.


               Finally, assembled contigs are filtered based on length before gene prediction and functional annotation.
               The user can provide two different length cut-offs to remove contigs with an inferior length obtained
               through the short read assembly (using SPAdes) or long read assembly (using CANU).


               Step 4: gene prediction and functional annotation
               Gene prediction is performed by prodigal , whose high efficiency in predicting the start of genes has been
                                                  [29]
                          [30]
               documented . Collected amino acid gene sequences are then used to perform their functional prediction.
               Notably, partial sequences predicted at the edge of contigs (genes without the start and/or the stop codon)
               may be removed based on their length by the user. The functional annotation of each gene sequence is
               managed by DIAMOND, due to its reduced computational run time with respect to other similar tools .
                                                                                                       [31]
               By default, DIAMOND performs alignment using the --sensitive option in search of query coverage > 50
               and e-value < 1·10 . However, like the other parameters described above, they can be easily customized by
                               -8
               modifying their values in the parameters file. Thus, the putative function of the subject sequence with the
               highest score is attributed to each query sequence.

               Unclassified genes from the DIAMOND search are further investigated by InterProScan among an HMM-
               based database , aiming at classifying them into family proteins and predicting domains that may suggest
                            [32]
               their biological role. If a gene is unclassified even in the InterProScan profiling, the resulting functional
               annotation is set as a “hypothetical protein”.


               Additionally, non-coding genes are predicted using barrnap (https://github.com/tseemann/barrnap) and
                              [33]
               tRNAscan-SE 2.0 , allowing for detecting rRNA and tRNA genes across the assembled genome sequence.
               In this regard, the pipeline can be programmed to process prokaryotes or eukaryotes genomes to predict the
               appropriate ribosomal genes using the -k (kingdom) option or setting the parameter file. By default,
               MEGAnnotator2 will predict ribosomal genes associated with prokaryotes.

               Step 5: metabolic profiling (optional)
               As a new feature of MEGAnnotator2, predicted gene sequences are screened against the MetaCyc metabolic
               database to retrieve each attributable enzymatic reaction . The Enzyme Commission (EC) numbers are
                                                                [20]
                                                                 [31]
               conferred to each amino acid sequence using DIAMOND . By default, DIAMOND performs alignment
               using the --sensitive option in search of query coverage > 50 and e-value < 1·10 . Results of the analysis are
                                                                                  -8
               reported as raw counts for each EC number as well as a percentage based on the total number of genes.

               MEGAnnotator2 output files
               The amount of output files provided by MEGAnnotator2 is proportional to the number of analyses defined
               in the parameters file. By default, the process of assembly and annotation of the microbial genomes ends
               with the generation of a GenBank file compatible with the Artemis genome browser . Within the GenBank
                                                                                     [34]
               file is reported information about the genome sequence, gene positions, and gene annotation. Furthermore,
               a comprehensive file (genome_info.txt) reports the main characteristics of the assembled microbial
   13   14   15   16   17   18   19   20   21   22   23