Page 13 - Read Online
P. 13

Page 2 of 16                  Lugli et al. Microbiome Res Rep 2023;2:15  https://dx.doi.org/10.20517/mrr.2022.21

               INTRODUCTION
               Since 1995, whole genome sequencing (WGS) has been the golden standard for the reconstruction of
               microbial  genome  sequences,  with  the  publication  of  the  first  complete  genome  sequence  of
               Haemophilus influenza . WGS was an efficient strategy that allowed gathering random DNA sequences of a
                                  [1]
                                                                                                        [2]
               microbial genome used to reconstruct the entire chromosome sequence using mathematical algorithms .
               Nowadays, the most common DNA sequencing technologies used for the reconstruction of genomes are
                                                                                    [3,4]
               represented by Illumina, followed by Pacific Bioscience and Oxford Nanopore . While the first one is
               largely used for the ability to produce a massive amount of high-quality data, it relies on the production of
               short DNA sequences ranging from 150 to 250 bp . Instead, PacBio and Nanopore sequencing systems are
                                                         [5]
               technologies chosen for the genome reconstruction of microorganisms thanks to their ability to produce
               long DNA sequences up to 40,000 bp . However, the latter technologies, also known as third-generation
                                               [6]
               sequencers, display some limitations in accuracy and throughput with respect to short-read sequencing.
               Nonetheless, the advent of long-read DNA sequencers allowed to improve draft assembly of microbial
               genomes, producing complete genome sequences , and, recently, the implementation of PacBio HiFi reads
                                                         [7]
               drastically improved the long-read DNA final quality.

               Accordingly, sequenced genomic data needs to be processed by bioinformatic tools to reconstruct the
                                                                       [8,9]
               chromosomal sequences and unveil their genomic repertoire . Thus, software for assembling and
               annotating microbial genomes has been implemented to process and manage such DNA data [10-13] . In 2016,
               the MEGAnnotator pipeline was implemented to provide the researcher with automated in silico tools for
               analyzing prokaryotic genomes . Nowadays, many pipelines have been implemented to ease genomes
                                          [14]
               assembly and annotation process [15,16] . Nevertheless, selecting free software that manages all types of
               sequenced DNA to be used in a local environment is still highly challenging.


               Here, we describe the improved bioinformatic pipeline MEGAnnotator2 that allows the assembly of
               prokaryotic genomes and chromosomes from unicellular eukaryotes, followed by gene prediction,
               functional annotation, and DNA quality evaluation of the reconstructed genome sequences. The pipeline
               can manage data from every NGS platform and modern third-generation sequencers such as PacBio and
               Nanopore long reads. Furthermore, each analysis step is automated and managed by a bash script, which
               coordinates freely online available software and custom databases that are continuously kept updated to
               overcome issues related to taxonomy re-classification.


               MATERIALS AND METHODS
               MEGAnnotator2 workflow
               MEGAnnotator2 is a bash script that runs on Linux under GNU General Public License (GPL). The
               complete workflow reported in Figure 1 shows the different steps managed by the pipeline by relying on the
               coordination of freely available software programs. Complete execution of the pipeline starts from the
               filtering of the raw sequencing data, providing statistics on the quality of the sequenced DNA as well as the
               filtered DNA that will be used for the assembly of the microbial genome. Based on the sequencing
               technology (short reads, long reads, or both), a specific assembly strategy is employed, resulting in one or
               more consensus sequences of the microbial chromosomes. Then, a quality assessment of the assembled data
               is performed to highlight the genome quality and the species relatedness. The latter information will be used
               to reorder contigs based on the reference strain of the identified species. Later, the pipeline proceeds with
               the prediction of the coding genes (as well as non-coding genes) to predict their function using similarity
               searches in the custom NCBI RefSeq database and a domain search in the InterProScan database. Gathered
               data will be used to generate a GenBank file that stores all biological information while all main statistics are
               reported in an available text file. Finally, the pipeline performs a metabolic screening to retrieve each
   8   9   10   11   12   13   14   15   16   17   18