Page 28 - Read Online
P. 28

Page 14 of 16                 Lugli et al. Microbiome Res Rep 2023;2:15  https://dx.doi.org/10.20517/mrr.2022.21

               Benchmark of synthetic datasets
               Short- and long-read synthetic datasets were produced from complete genome sequences downloaded from
               the NCBI repository. In this context, the genome sequence of Bifidobacterium bifidum ATCC 29521,
               Pseudomonas aeruginosa ATCC 27853, Escherichia coli K-12, Streptococcus pneumoniae TIGR4, Clostridium
               perfringens JXJA17, and Salmonella enterica MAC15 were chosen, to cover genomes ranging from two to
               seven Mb [Supplementary Table 5]. The tool wgsim (https://github.com/lh3/wgsim) was used to generate
               one million synthetic short-read sequences and 150,000 synthetic long-read sequences per genome. Then,
               the MEGAnnotator2 pipeline was employed to simulate the genome assemblies of each microorganism
               using a combination of synthetic short- and long-reads. Results highlighted that using long-reads, the
               integrity of the genomes was higher, allowing the reconstruction of repetitive genome portions that were
               lost using short-read only, i.e., larger genome sizes and numbers of identified rRNA genes [Supplementary
               Table 5]. Looking at the execution time of each step of the pipeline, we validate the data previously observed
               with real samples [Supplementary Table 6]. Hybrid and long-read strategies were more time-consuming,
               taking double the assembly time with respect to short-read assemblies, as well as the filtering step of long-
               read sequences [Supplementary Table 6].


               Furthermore, the assembly of complex samples was simulated using a limited number of short-read
               sequences, i.e., 100,000 reads per genome. This synthetic benchmark aimed to test the pipeline if the quality
               of the sequencing reads were not as good as expected, thus resulting in a few amount of DNA sequences to
               assembly. In this scenario, the reconstruction of genomes ended whit low average coverage, ranging from 8
               to 25, but the integrity of the genomes was maintained, resulting in genome completeness ranging from
               96.43% to 99.2% [Supplementary Table 5]. Altogether, the MEGAnnotator2 report showed that the complex
               genome structure of Pseudomonas aeruginosa ATCC 27853 was difficult to assemble, resulting in 663
               contigs [Supplementary Table 5].

               CONCLUSIONS
               MEGAnnotator2 is a pipeline that manages all the currently existing sequencing formats of modern DNA
               sequencing systems, including short and long reads. Most of the software associated has been changed to
               improve the quality of the results and the execution time of the pipeline [Table 1 and Figure 2].
               Furthermore, additional features such as read quality filtering, a quality check of DNA and assembled
               genomes, and metabolic profiling have been added to provide the user with more information and flexibility
               in the execution of programs. Notably, the execution time from the previous pipeline version has decreased
               by 43 times, and multiple genomes can be processed in series to avoid wasting time between genome
               analyses. Furthermore, the pipeline installation does not require additional actions from the user, and the
               space on the disk of the functional annotation database has been reduced by 80 times. Altogether,
               MEGAnnotator2 displays all the features needed for the reconstruction of procaryotic and unicellular
               eukaryotes and can be easily implemented by the user with additional features due to the modulatory
               architecture of the pipeline.

               DECLARATIONS
               Acknowledgments
               We thank GenProbio Srl for the financial support of the Laboratory of Probiogenomics. Part of this research
               is conducted using the High Performance Computing (HPC) facility of the University of Parma.


               Authors’ contributions
               Manuscript writing and pipeline implementation: Lugli GA
   23   24   25   26   27   28   29   30   31   32   33