Page 28 - Read Online
P. 28
Page 14 of 16 Lugli et al. Microbiome Res Rep 2023;2:15 https://dx.doi.org/10.20517/mrr.2022.21
Benchmark of synthetic datasets
Short- and long-read synthetic datasets were produced from complete genome sequences downloaded from
the NCBI repository. In this context, the genome sequence of Bifidobacterium bifidum ATCC 29521,
Pseudomonas aeruginosa ATCC 27853, Escherichia coli K-12, Streptococcus pneumoniae TIGR4, Clostridium
perfringens JXJA17, and Salmonella enterica MAC15 were chosen, to cover genomes ranging from two to
seven Mb [Supplementary Table 5]. The tool wgsim (https://github.com/lh3/wgsim) was used to generate
one million synthetic short-read sequences and 150,000 synthetic long-read sequences per genome. Then,
the MEGAnnotator2 pipeline was employed to simulate the genome assemblies of each microorganism
using a combination of synthetic short- and long-reads. Results highlighted that using long-reads, the
integrity of the genomes was higher, allowing the reconstruction of repetitive genome portions that were
lost using short-read only, i.e., larger genome sizes and numbers of identified rRNA genes [Supplementary
Table 5]. Looking at the execution time of each step of the pipeline, we validate the data previously observed
with real samples [Supplementary Table 6]. Hybrid and long-read strategies were more time-consuming,
taking double the assembly time with respect to short-read assemblies, as well as the filtering step of long-
read sequences [Supplementary Table 6].
Furthermore, the assembly of complex samples was simulated using a limited number of short-read
sequences, i.e., 100,000 reads per genome. This synthetic benchmark aimed to test the pipeline if the quality
of the sequencing reads were not as good as expected, thus resulting in a few amount of DNA sequences to
assembly. In this scenario, the reconstruction of genomes ended whit low average coverage, ranging from 8
to 25, but the integrity of the genomes was maintained, resulting in genome completeness ranging from
96.43% to 99.2% [Supplementary Table 5]. Altogether, the MEGAnnotator2 report showed that the complex
genome structure of Pseudomonas aeruginosa ATCC 27853 was difficult to assemble, resulting in 663
contigs [Supplementary Table 5].
CONCLUSIONS
MEGAnnotator2 is a pipeline that manages all the currently existing sequencing formats of modern DNA
sequencing systems, including short and long reads. Most of the software associated has been changed to
improve the quality of the results and the execution time of the pipeline [Table 1 and Figure 2].
Furthermore, additional features such as read quality filtering, a quality check of DNA and assembled
genomes, and metabolic profiling have been added to provide the user with more information and flexibility
in the execution of programs. Notably, the execution time from the previous pipeline version has decreased
by 43 times, and multiple genomes can be processed in series to avoid wasting time between genome
analyses. Furthermore, the pipeline installation does not require additional actions from the user, and the
space on the disk of the functional annotation database has been reduced by 80 times. Altogether,
MEGAnnotator2 displays all the features needed for the reconstruction of procaryotic and unicellular
eukaryotes and can be easily implemented by the user with additional features due to the modulatory
architecture of the pipeline.
DECLARATIONS
Acknowledgments
We thank GenProbio Srl for the financial support of the Laboratory of Probiogenomics. Part of this research
is conducted using the High Performance Computing (HPC) facility of the University of Parma.
Authors’ contributions
Manuscript writing and pipeline implementation: Lugli GA