Page 17 - Read Online

P. 17

Page 6 of 16 Lugli et al. Microbiome Res Rep 2023;2:15 https://dx.doi.org/10.20517/mrr.2022.21

Furthermore, the pipeline allows a custom filtering step before assembly to remove putative contamination
that may occur in strain isolation or sequencing procedure. For example, the user may choose to remove the
DNA of a specific bacterial species or DNA vector sequences used in certain experimental procedures.

Moreover, as a new feature, the pipeline generates statistics for each fastq input file to certify its quality.
More in specific, a pre-filtering and post-filtering analysis is managed by the FastQC quality control tool to
spot potential problems in the sequencing dataset used. Data regarding base quality scores, read quality
scores, sequence length distribution, sequence duplication levels, and overrepresented sequences are
displayed before and after read filtering.

Step 2: assembly of the filtered reads
After a first quality filtering of the input data, assemblies of DNA sequences can be performed using a
combination of short and long sequences obtained by any NGS platform as well as modern third-generation
sequencers such as PacBio and Nanopore. Filtered short reads are managed by SPAdes , which evaluates
[22]
the average length of the DNA sequences to generate an optimum list of k-mer sizes to be used as a
parameter in the assembly phase. For example, an Illumina 250bp paired-end output will result in a list of
[23]
“21,33,55,77,99,127” k-mer sizes. Besides, the assembler CANU manages long-read sequences . To obtain
more reliable data, which usually consists of a complete reconstruction of the chromosomal sequence, the
user can input the putative length of the genome sequence to the pipeline, which will be used as a variable in
the assembly step.

Furthermore, the pipeline can also manage assemblies using short and long-read sequences as input.
MEGAnnotator2 gives the user the possibility to choose between two strategies. The first approach takes
advantage of the capability of SPAdes to manage hybrid assemblies. Thus, the assembled chromosomal
sequence obtained from a long-read assembly managed by CANU is then used as input by SPAdes as a
reference to perform the hybrid assembly together with long and short-read sequences. Otherwise, the
second approach uses once again the assembled chromosomal sequence obtained from the long read
assembly, followed by DNA sequence polishing using the Polypolish tool . The resulting high-quality
[24]
genome is obtained by aligning each short read to all possible locations of the assembled genome by making
use of the SAM file generated by the BWA aligner . Both methodologies can be used to generate a high-
[21]
quality complete genome sequence of the assembled genomes. Nonetheless, based on our validation test, the
polishing approach can minimize INDELs’ occurrence in the genome sequence.

Step 3: genome quality check (optional)
As a new feature of MEGAnnotator2, assembled data is assessed with multiple validation methods. A first
screening is represented by the identification of the assembled genomes of the microbial species. The 16S/
18S rRNA gene sequences are compared to the non-redundant SILVA database above described through
BLASTn . At the same time, the fastANI tool is used to identify the microorganism with the highest
[25]
[26]
whole-genome Average Nucleotide Identity (ANI) values . Together, those microorganisms with the
highest 16S/18S rRNA gene sequence identity and highest ANI values, composed with the respective values,
are reported as genome information in the output.

The average genome coverage is calculated using the BBmap aligner (https://github.com/BioInfoTools/
BBMap) by mapping the short reads on the assembled contig sequences. Instead, the coverage of long-read
assemblies is retrieved directly from the CANU report. Additionally, the quality of the assembled genome is
evaluated using the checkM tool . Data regarding the completeness and contamination of the
[27]
reconstructed genome are reported as values in the output information.

12 13 14 15 16 17 18 19 20 21 22