Page 77 - Read Online
P. 77
Page 8 of 16 Renzi et al. Microbiome Res Rep 2024;3:2 https://dx.doi.org/10.20517/mrr.2023.27
AspGD: Aspergillus genome database; BOLD: barcode of life data systems; CGD: candida genome database; CYGD: comprehensive yeast
genome database; ISHAM-ITS: international society for human and animal mycology-internal transcribed spacer; ISHAM-MLST: international
society for human and animal mycology - multilocus sequence typing; JGI: joint genome institute; MLST: multilocus sequence typing; SGD:
saccharomyces genome database.
[100]
[101]
Aspergillus Genome Database (AspGD) , Barcode of Life Data Systems (BOLD) , Broad Institute
databases (http://www.broadinstitute.org/scientific-community/data/), Candida Genome Database (CGD) ,
[102]
[103]
Comprehensive Yeast Genome Database (CYGD) , Ensembl Fungi (https:// fungi.ensembl.org),
FungiDB , FUNGIpath , Fusarium-ID , Fusarium Multilocus Sequence Typing (MLST) ,
[107]
[104]
[105]
[106]
[108]
International Society for Human and Animal Mycology-Internal Transcribed Spacer (ISHAM-ITS) ,
International Society for Human and Animal Mycology - MultiLocus Sequence Typing (ISHAM-MLST)
(http://mlst.mycologylab.org/), JGI MycoCosm , NCBI GenBank (https://www.ncbi.nlm.nih.gov/ genbank/),
[109]
NCBI RefSeq ( http://www.ncbi.nlm.nih.gov/refseq/), PomBase , Saccharomyces Genome Database
[110]
(SGD) , and UNITE have been resumed and extensively classified by Prakash et al. . To avoid the
[113]
[112]
[111]
hampering issues of comprehensive data management, they suggest a cloud-based, dynamic network
platform based on the integration of particular focused-group databases with maximum access and functional
characteristics for the user community.
One of the most concerning analytic challenges in mycobiota investigations is the inadequate curation of
fungal databases. This deficiency in high-quality fungal sequences within curated databases results in a
substantial number of unclassified reads. Addressing this issue may involve producing additional high-
[87]
quality metagenomic and whole-fungal genome assemblies . Furthermore, sequencing data are frequently
devoid of any biologically relevant information, such as the substrate of origin or details on the technology
used. Thus, well-curated fungal databases with accurate sequence data play a pivotal role in further research
and diagnostics in the field of mycology. The current fungal databases only poorly represent the diversity of
the fungal kingdom, limiting their analytical power.
Pipelines
The bioinformatics analysis workflow for amplicon data can be summarized into four main steps: (i) pre-
processing; (ii) “grouping” of amplicon sequences; (iii) taxonomic classification; and (iv) visualization and
[114]
statistical analysis . While various tools can be used in each of these steps, producing slightly different
results, the second step, in particular, is crucial. Amplicon sequences can be clustered based on their
similarity [115-119] , akin to classical clustering techniques such as k-mean clustering or agglomerative clustering
- or based on single nucleotide differences across them, an approach currently known as sequence variant
inference . Methods falling into the first category profile bacterial communities by grouping similar
[60]
sequences into Operational Taxonomic Units (OTUs), but the definition of a similarity threshold has always
been empirical. As a consequence, these methods tend to produce a large number of OTUs that are not
[120]
always biologically relevant, an issue that goes by the name of “OTU inflation” . This massive production
of OTUs may lead to wrong conclusions and/or to the generation of huge datasets, which can be difficult to
analyze. Tackling this issue is not trivial, and a series of novel approaches have been proposed. These
approaches rely on the definition of sequence variants from single nucleotide differences in the amplicon
reconstruction, trying to profile microbial communities based on “real” differences instead of sequence
similarity. Nowadays, the research communities are gradually moving to the new concept of Amplicons
Sequence Variants (ASVs) or Exact Sequences Variants (ESVs) for profiling bacterial communities, and
[121]
it should also be recommended for yeasts and yeast-like organisms. These approaches generate an error
model for each sequencing run, which enables discriminating between a true sequence variant (i.e., one