Page 13 - Read Online
P. 13
Page 2 of 16 Lugli et al. Microbiome Res Rep 2023;2:15 https://dx.doi.org/10.20517/mrr.2022.21
INTRODUCTION
Since 1995, whole genome sequencing (WGS) has been the golden standard for the reconstruction of
microbial genome sequences, with the publication of the first complete genome sequence of
Haemophilus influenza . WGS was an efficient strategy that allowed gathering random DNA sequences of a
[1]
[2]
microbial genome used to reconstruct the entire chromosome sequence using mathematical algorithms .
Nowadays, the most common DNA sequencing technologies used for the reconstruction of genomes are
[3,4]
represented by Illumina, followed by Pacific Bioscience and Oxford Nanopore . While the first one is
largely used for the ability to produce a massive amount of high-quality data, it relies on the production of
short DNA sequences ranging from 150 to 250 bp . Instead, PacBio and Nanopore sequencing systems are
[5]
technologies chosen for the genome reconstruction of microorganisms thanks to their ability to produce
long DNA sequences up to 40,000 bp . However, the latter technologies, also known as third-generation
[6]
sequencers, display some limitations in accuracy and throughput with respect to short-read sequencing.
Nonetheless, the advent of long-read DNA sequencers allowed to improve draft assembly of microbial
genomes, producing complete genome sequences , and, recently, the implementation of PacBio HiFi reads
[7]
drastically improved the long-read DNA final quality.
Accordingly, sequenced genomic data needs to be processed by bioinformatic tools to reconstruct the
[8,9]
chromosomal sequences and unveil their genomic repertoire . Thus, software for assembling and
annotating microbial genomes has been implemented to process and manage such DNA data [10-13] . In 2016,
the MEGAnnotator pipeline was implemented to provide the researcher with automated in silico tools for
analyzing prokaryotic genomes . Nowadays, many pipelines have been implemented to ease genomes
[14]
assembly and annotation process [15,16] . Nevertheless, selecting free software that manages all types of
sequenced DNA to be used in a local environment is still highly challenging.
Here, we describe the improved bioinformatic pipeline MEGAnnotator2 that allows the assembly of
prokaryotic genomes and chromosomes from unicellular eukaryotes, followed by gene prediction,
functional annotation, and DNA quality evaluation of the reconstructed genome sequences. The pipeline
can manage data from every NGS platform and modern third-generation sequencers such as PacBio and
Nanopore long reads. Furthermore, each analysis step is automated and managed by a bash script, which
coordinates freely online available software and custom databases that are continuously kept updated to
overcome issues related to taxonomy re-classification.
MATERIALS AND METHODS
MEGAnnotator2 workflow
MEGAnnotator2 is a bash script that runs on Linux under GNU General Public License (GPL). The
complete workflow reported in Figure 1 shows the different steps managed by the pipeline by relying on the
coordination of freely available software programs. Complete execution of the pipeline starts from the
filtering of the raw sequencing data, providing statistics on the quality of the sequenced DNA as well as the
filtered DNA that will be used for the assembly of the microbial genome. Based on the sequencing
technology (short reads, long reads, or both), a specific assembly strategy is employed, resulting in one or
more consensus sequences of the microbial chromosomes. Then, a quality assessment of the assembled data
is performed to highlight the genome quality and the species relatedness. The latter information will be used
to reorder contigs based on the reference strain of the identified species. Later, the pipeline proceeds with
the prediction of the coding genes (as well as non-coding genes) to predict their function using similarity
searches in the custom NCBI RefSeq database and a domain search in the InterProScan database. Gathered
data will be used to generate a GenBank file that stores all biological information while all main statistics are
reported in an available text file. Finally, the pipeline performs a metabolic screening to retrieve each