Page 16 - Read Online

P. 16

Lugli et al. Microbiome Res Rep 2023;2:15 https://dx.doi.org/10.20517/mrr.2022.21 Page 5 of 16

comparison. The issue is that, to date, we do not possess the genome sequence of all known
microorganisms. Thus, additional information, such as sequence similarity of the 16S/18S rRNA gene, can
be helpful in studying uncommon microorganisms.

Finally, MEGAnnotator2 is provided with a database comprising information regarding metabolic reactions
collected from the MetaCyc . By using the latter database, it is possible to have a profile constituting each
[20]
attributable enzymatic reaction of the predicted microorganism genes in analysis.

All databases will be updated every six months to overcome taxonomy re-classification issues and provide
reliable output profiles. The support will end when updated methodologies overcome the current
classification strategies, resulting in a reshaping of the pipeline and databases. The user can also provide
custom databases to perform custom DNA filtering steps before assembly. These additional databases need
[21]
to be compiled using the BWA aligner as reported in the manual, starting from DNA sequences in fasta
format.

MEGAnnotator2 input files
To run MEGAnnotator2, the user needs to provide DNA sequencing data in fastq format. Short reads in
single- or paired-end can be used (Illumina or Ion Torrent data) as well as long reads (PacBio or Nanopore
data). In this context, PacBio HiFi reads can only be used prior to conversion from BAM to fastq format.
The pipeline can be executed in a Unix terminal with a single command, specifying the name of the project
and the input data path. For example, it follows three commands based on paired-end, long reads, and
mixed reads input:

MEGAnnotator2 -t 60 -n project_name -p -f forward_input.fastq -r reverse_input.fastq

MEGAnnotator2 -t 60 -n project_name -l -i input.fastq

MEGAnnotator2 -t 60 -n project_name -o -i long_input.fastq -f forward_input.fastq -r reverse_input.fastq

Otherwise, dedicated scripts are implemented in MEGAnnotator2 to automatize the processing of the input
data generating a bash script that will run samples in series without the need to execute specific commands.
For additional information on the execution of the program, see the manual.

Step 1: quality filtering of the data
To provide more reliable results, we implemented a DNA filtering step, a feature absent in the previous
[14]
version of MEGAnnotator . As default, MEGAnnotator2 performs a quality filtering step aiming at
removing DNA sequences that are too short or that display low quality. Based on the input file typology, the
pipeline will perform a short read filtering (single or paired-end based on the technology) or a long read
filtering of the data. To do so, the fastq-mcf utility (https://github.com/ExpressionAnalysis/ea-utils) is
employed to perform filtering of short reads, removing as default reads shorter than 100 nucleotides and
those with a quality < 20. Otherwise, long reads were managed by Fitlong (https://github.com/rrwick/
Filtlong), removing as default reads shorter than 1,000 nucleotides and keeping 90% of reads with superior
quality not exceeding 500 Gb of data. Whenever both short and long reads data are used as input, Fitlong
will better evaluate the long read quality using k-mer matches to the short read to improve the final genome
quality. The user can manually edit all parameters to achieve a more suitable filtering step based on the
user’s needs.

11 12 13 14 15 16 17 18 19 20 21