Page 19 - Read Online
P. 19
Page 8 of 16 Lugli et al. Microbiome Res Rep 2023;2:15 https://dx.doi.org/10.20517/mrr.2022.21
genomes, including the amount of the DNA sequencing output, number of filtered reads, number of
assembled contigs, genome length, average coverage, completeness of the genome and its contamination
level, number of genes, rRNA genes, and tRNA genes, and species prediction based on the 16S/18S rRNA
gene sequence and ANI values of the chromosomal sequence.
Additional files are produced to allow the user to evaluate the results of each step of the pipeline. Among
these files is reported the quality of the genome sequence (checkM_report), the results of the 16S/18S rRNA
gene alignment (16S.blastn or 18S.blastn), the collection of gene protein sequences (aaORFs.fasta), the
sequence of the assembled contigs (contigs.fasta) and a report of long read sequence polishing if requested
(polishing_report.txt).
In addition, multiple folders are provided, containing data regarding the main steps of genome processing.
Filtered reads are stored as FASTQ files in the folder “filtered_reads” together with html files reporting the
quality of raw reads and filtered reads if requested. Genome alignment of the assembled data with respect to
the reference genome retrieved from the ANI database is located in the folder “mauve_alignment” and can
be visualized by using MAUVE. Furthermore, the assembly documentation produced by SPAdes or CANU
is located in the “assembly” folder, including statistics, assembly steps, and logs. Finally, the folder named
“metabolic_reactions” contains the results achieved from the metabolic profiling if requested by the user.
In case multiple microbial strains have been analyzed in tandem with MEGAnnotator2, multiple folders will
be generated for each analyzed genome named with the microorganism code (project_name).
RESULTS AND DISCUSSION
MEGAnnotator2 performance and statistics using short reads
This work aims to deliver a complete pipeline to manage any sequencing output and provide the user with
statistics and biological information about the assembled microorganism. Thus, each available online
software included in the pipeline has been chosen based on recent scientific literature highlighting its
performance with respect to other tools [30,31,35] .
To test the whole pipeline, we used one million short reads belonging to 10 microbial species characterized
by different genome sizes, ranging from two Mb to five Mb [Table 1]. The machine used to benchmark the
pipeline was equipped with an AMD Threadripper with 32 cores and 256 GB of RAM. Memory read and
write operations were managed by an NVME m.2 2tb SSD. The average execution time of the complete
pipeline was 14.2 min, while mandatory steps (assembly and annotation) took an average of 5.6 min to be
executed. Figure 2 reported the individual timing of each step, with the genome quality step representing
the most time-consuming (median of 269 sec), followed by the gene prediction and annotation (median of
175 sec) and the genome assembly (median of 163 sec) [Figure 2A and Supplementary Table 1]. An example
of relevant statistics provided by MEGAnnotator2 to the user is reported in Table 1 and can be found as
results once the pipeline has ended its job as a text document.
Performance and statistics of the pipeline using long reads
Unlike short read analyses, the usage of long read sequences resulted in a more time-consuming procedure
due to the implementation of dedicated filtering and assembly algorithms. To benchmark the efficiency of
MEGAnnotator2 using long reads, additional 10 microbial strains were subjected to genome assembly and
annotation using long reads, and additional 10 microbial strains with a combination of long and short reads
[Tables 2 and 3]. Notably, aiming at simulating a real-world scenario, the microorganisms’ genome length
used in the hybrid approach was larger than five Mb except for one [Table 3]. Testing has been performed