Page 16 - Read Online
P. 16

Lugli et al. Microbiome Res Rep 2023;2:15  https://dx.doi.org/10.20517/mrr.2022.21  Page 5 of 16

               comparison.  The  issue  is  that,  to  date,  we  do    not  possess  the  genome  sequence  of  all  known
               microorganisms. Thus, additional information, such as sequence similarity of the 16S/18S rRNA gene, can
               be helpful in studying uncommon microorganisms.


               Finally, MEGAnnotator2 is provided with a database comprising information regarding metabolic reactions
               collected from the MetaCyc . By using the latter database, it is possible to have a profile constituting each
                                       [20]
               attributable enzymatic reaction of the predicted microorganism genes in analysis.

               All databases will be updated every six months to overcome taxonomy re-classification issues and provide
               reliable output profiles. The support will end when updated methodologies overcome the current
               classification strategies, resulting in a reshaping of the pipeline and databases. The user can also provide
               custom databases to perform custom DNA filtering steps before assembly. These additional databases need
                                                [21]
               to be compiled using the BWA aligner  as reported in the manual, starting from DNA sequences in fasta
               format.

               MEGAnnotator2 input files
               To run MEGAnnotator2, the user needs to provide DNA sequencing data in fastq format. Short reads in
               single- or paired-end can be used (Illumina or Ion Torrent data) as well as long reads (PacBio or Nanopore
               data). In this context, PacBio HiFi reads can only be used prior to conversion from BAM to fastq format.
               The pipeline can be executed in a Unix terminal with a single command, specifying the name of the project
               and the input data path. For example, it follows three commands based on paired-end, long reads, and
               mixed reads input:

               MEGAnnotator2 -t 60 -n project_name -p -f forward_input.fastq -r reverse_input.fastq


               MEGAnnotator2 -t 60 -n project_name -l -i input.fastq

               MEGAnnotator2 -t 60 -n project_name -o -i long_input.fastq -f forward_input.fastq -r reverse_input.fastq


               Otherwise, dedicated scripts are implemented in MEGAnnotator2 to automatize the processing of the input
               data generating a bash script that will run samples in series without the need to execute specific commands.
               For additional information on the execution of the program, see the manual.


               Step 1: quality filtering of the data
               To provide more reliable results, we implemented a DNA filtering step, a feature absent in the previous
                                      [14]
               version of MEGAnnotator . As default, MEGAnnotator2 performs a quality filtering step aiming at
               removing DNA sequences that are too short or that display low quality. Based on the input file typology, the
               pipeline will perform a short read filtering (single or paired-end based on the technology) or a long read
               filtering of the data. To do so, the fastq-mcf utility (https://github.com/ExpressionAnalysis/ea-utils) is
               employed to perform filtering of short reads, removing as default reads shorter than 100 nucleotides and
               those with a quality < 20. Otherwise, long reads were managed by Fitlong (https://github.com/rrwick/
               Filtlong), removing as default reads shorter than 1,000 nucleotides and keeping 90% of reads with superior
               quality not exceeding 500 Gb of data. Whenever both short and long reads data are used as input, Fitlong
               will better evaluate the long read quality using k-mer matches to the short read to improve the final genome
               quality. The user can manually edit all parameters to achieve a more suitable filtering step based on the
               user’s needs.
   11   12   13   14   15   16   17   18   19   20   21