Page 17 - Read Online
P. 17

Page 6 of 16                  Lugli et al. Microbiome Res Rep 2023;2:15  https://dx.doi.org/10.20517/mrr.2022.21

               Furthermore, the pipeline allows a custom filtering step before assembly to remove putative contamination
               that may occur in strain isolation or sequencing procedure. For example, the user may choose to remove the
               DNA of a specific bacterial species or DNA vector sequences used in certain experimental procedures.


               Moreover, as a new feature, the pipeline generates statistics for each fastq input file to certify its quality.
               More in specific, a pre-filtering and post-filtering analysis is managed by the FastQC quality control tool to
               spot potential problems in the sequencing dataset used. Data regarding base quality scores, read quality
               scores, sequence length distribution, sequence duplication levels, and overrepresented sequences are
               displayed before and after read filtering.

               Step 2: assembly of the filtered reads
               After a first quality filtering of the input data, assemblies of DNA sequences can be performed using a
               combination of short and long sequences obtained by any NGS platform as well as modern third-generation
               sequencers such as PacBio and Nanopore. Filtered short reads are managed by SPAdes , which evaluates
                                                                                          [22]
               the average length of the DNA sequences to generate an optimum list of k-mer sizes to be used as a
               parameter in the assembly phase. For example, an Illumina 250bp paired-end output will result in a list of
                                                                                               [23]
               “21,33,55,77,99,127” k-mer sizes. Besides, the assembler CANU manages long-read sequences . To obtain
               more reliable data, which usually consists of a complete reconstruction of the chromosomal sequence, the
               user can input the putative length of the genome sequence to the pipeline, which will be used as a variable in
               the assembly step.


               Furthermore, the pipeline can also manage assemblies using short and long-read sequences as input.
               MEGAnnotator2 gives the user the possibility to choose between two strategies. The first approach takes
               advantage of the capability of SPAdes to manage hybrid assemblies. Thus, the assembled chromosomal
               sequence obtained from a long-read assembly managed by CANU is then used as input by SPAdes as a
               reference to perform the hybrid assembly together with long and short-read sequences. Otherwise, the
               second approach uses once again the assembled chromosomal sequence obtained from the long read
               assembly, followed by DNA sequence polishing using the Polypolish tool . The resulting high-quality
                                                                                [24]
               genome is obtained by aligning each short read to all possible locations of the assembled genome by making
               use of the SAM file generated by the BWA aligner . Both methodologies can be used to generate a high-
                                                          [21]
               quality complete genome sequence of the assembled genomes. Nonetheless, based on our validation test, the
               polishing approach can minimize INDELs’ occurrence in the genome sequence.

               Step 3: genome quality check (optional)
               As a new feature of MEGAnnotator2, assembled data is assessed with multiple validation methods. A first
               screening is represented by the identification of the assembled genomes of the microbial species. The 16S/
               18S rRNA gene sequences are compared to the non-redundant SILVA database above described through
               BLASTn . At the same time, the fastANI tool is used to identify the microorganism with the highest
                      [25]
                                                                  [26]
               whole-genome Average Nucleotide Identity (ANI) values . Together, those microorganisms with the
               highest 16S/18S rRNA gene sequence identity and highest ANI values, composed with the respective values,
               are reported as genome information in the output.


               The average genome coverage is calculated using the BBmap aligner (https://github.com/BioInfoTools/
               BBMap) by mapping the short reads on the assembled contig sequences. Instead, the coverage of long-read
               assemblies is retrieved directly from the CANU report. Additionally, the quality of the assembled genome is
               evaluated  using  the  checkM  tool . Data  regarding  the  completeness  and  contamination  of  the
                                              [27]
               reconstructed genome are reported as values in the output information.
   12   13   14   15   16   17   18   19   20   21   22