Page 19 - Read Online
P. 19

Page 8 of 16                  Lugli et al. Microbiome Res Rep 2023;2:15  https://dx.doi.org/10.20517/mrr.2022.21

               genomes, including the amount of the DNA sequencing output, number of filtered reads, number of
               assembled contigs, genome length, average coverage, completeness of the genome and its contamination
               level, number of genes, rRNA genes, and tRNA genes, and species prediction based on the 16S/18S rRNA
               gene sequence and ANI values of the chromosomal sequence.


               Additional files are produced to allow the user to evaluate the results of each step of the pipeline. Among
               these files is reported the quality of the genome sequence (checkM_report), the results of the 16S/18S rRNA
               gene alignment (16S.blastn or 18S.blastn), the collection of gene protein sequences (aaORFs.fasta), the
               sequence of the assembled contigs (contigs.fasta) and a report of long read sequence polishing if requested
               (polishing_report.txt).


               In addition, multiple folders are provided, containing data regarding the main steps of genome processing.
               Filtered reads are stored as FASTQ files in the folder “filtered_reads” together with html files reporting the
               quality of raw reads and filtered reads if requested. Genome alignment of the assembled data with respect to
               the reference genome retrieved from the ANI database is located in the folder “mauve_alignment” and can
               be visualized by using MAUVE. Furthermore, the assembly documentation produced by SPAdes or CANU
               is located in the “assembly” folder, including statistics, assembly steps, and logs. Finally, the folder named
               “metabolic_reactions” contains the results achieved from the metabolic profiling if requested by the user.


               In case multiple microbial strains have been analyzed in tandem with MEGAnnotator2, multiple folders will
               be generated for each analyzed genome named with the microorganism code (project_name).


               RESULTS AND DISCUSSION
               MEGAnnotator2 performance and statistics using short reads
               This work aims to deliver a complete pipeline to manage any sequencing output and provide the user with
               statistics and biological information about the assembled microorganism. Thus, each available online
               software included in the pipeline has been chosen based on recent scientific literature highlighting its
               performance with respect to other tools [30,31,35] .


               To test the whole pipeline, we used one million short reads belonging to 10 microbial species characterized
               by different genome sizes, ranging from two Mb to five Mb [Table 1]. The machine used to benchmark the
               pipeline was equipped with an AMD Threadripper with 32 cores and 256 GB of RAM. Memory read and
               write operations were managed by an NVME m.2 2tb SSD. The average execution time of the complete
               pipeline was 14.2 min, while mandatory steps (assembly and annotation) took an average of 5.6 min to be
               executed. Figure 2 reported the individual timing of each step, with the genome quality step representing
               the most time-consuming (median of 269 sec), followed by the gene prediction and annotation (median of
               175 sec) and the genome assembly (median of 163 sec) [Figure 2A and Supplementary Table 1]. An example
               of relevant statistics provided by MEGAnnotator2 to the user is reported in Table 1 and can be found as
               results once the pipeline has ended its job as a text document.

               Performance and statistics of the pipeline using long reads
               Unlike short read analyses, the usage of long read sequences resulted in a more time-consuming procedure
               due to the implementation of dedicated filtering and assembly algorithms. To benchmark the efficiency of
               MEGAnnotator2 using long reads, additional 10 microbial strains were subjected to genome assembly and
               annotation using long reads, and additional 10 microbial strains with a combination of long and short reads
               [Tables 2 and 3]. Notably, aiming at simulating a real-world scenario, the microorganisms’ genome length
               used in the hybrid approach was larger than five Mb except for one [Table 3]. Testing has been performed
   14   15   16   17   18   19   20   21   22   23   24