Page 77 - Read Online
P. 77

Page 8 of 16                   Renzi et al. Microbiome Res Rep 2024;3:2  https://dx.doi.org/10.20517/mrr.2023.27



               AspGD: Aspergillus genome database; BOLD: barcode of life data systems; CGD: candida genome database; CYGD: comprehensive yeast
               genome database; ISHAM-ITS: international society for human and animal mycology-internal transcribed spacer; ISHAM-MLST: international
               society for human and animal mycology - multilocus sequence typing; JGI: joint genome institute; MLST: multilocus sequence typing; SGD:
               saccharomyces genome database.


                                                   [100]
                                                                                          [101]
               Aspergillus  Genome  Database  (AspGD) ,  Barcode  of  Life  Data  Systems  (BOLD) ,  Broad  Institute
               databases (http://www.broadinstitute.org/scientific-community/data/), Candida Genome Database (CGD) ,
                                                                                                        [102]
                                                              [103]
               Comprehensive  Yeast  Genome  Database  (CYGD) ,  Ensembl  Fungi  (https://  fungi.ensembl.org),
               FungiDB ,  FUNGIpath ,  Fusarium-ID ,  Fusarium  Multilocus  Sequence  Typing  (MLST) ,
                                                                                                        [107]
                       [104]
                                      [105]
                                                      [106]
                                                                                                        [108]
               International  Society  for  Human  and  Animal  Mycology-Internal  Transcribed  Spacer  (ISHAM-ITS) ,
               International  Society  for  Human  and  Animal  Mycology  -  MultiLocus  Sequence  Typing  (ISHAM-MLST)
               (http://mlst.mycologylab.org/), JGI MycoCosm , NCBI GenBank (https://www.ncbi.nlm.nih.gov/ genbank/),
                                                       [109]
               NCBI  RefSeq  ( http://www.ncbi.nlm.nih.gov/refseq/),  PomBase ,  Saccharomyces  Genome  Database
                                                                        [110]
               (SGD) , and UNITE  have been resumed and extensively classified by Prakash et al. . To avoid the
                                                                                            [113]
                                   [112]
                     [111]
               hampering  issues  of  comprehensive  data  management,  they  suggest  a  cloud-based,  dynamic  network
               platform based on the integration of particular focused-group databases with maximum access and functional
               characteristics for the user community.
               One of the most concerning analytic challenges in mycobiota investigations is the inadequate curation of
               fungal databases. This deficiency in high-quality fungal sequences within curated databases results in a
               substantial number of unclassified reads. Addressing this issue may involve producing additional high-
                                                                 [87]
               quality metagenomic and whole-fungal genome assemblies . Furthermore, sequencing data are frequently
               devoid of any biologically relevant information, such as the substrate of origin or details on the technology
               used. Thus, well-curated fungal databases with accurate sequence data play a pivotal role in further research
               and diagnostics in the field of mycology. The current fungal databases only poorly represent the diversity of
               the fungal kingdom, limiting their analytical power.


               Pipelines
               The bioinformatics analysis workflow for amplicon data can be summarized into four main steps: (i) pre-
               processing; (ii) “grouping” of amplicon sequences; (iii) taxonomic classification; and (iv) visualization and
                              [114]
               statistical analysis . While various tools can be used in each of these steps, producing slightly different
               results, the second step, in particular, is crucial. Amplicon sequences can be clustered based on their
               similarity [115-119] , akin to classical clustering techniques such as k-mean clustering or agglomerative clustering
               - or based on single nucleotide differences across them, an approach currently known as sequence variant
               inference . Methods falling into the first category profile bacterial communities by grouping similar
                       [60]
               sequences into Operational Taxonomic Units (OTUs), but the definition of a similarity threshold has always
               been empirical. As a consequence, these methods tend to produce a large number of OTUs that are not
                                                                                  [120]
               always biologically relevant, an issue that goes by the name of “OTU inflation” . This massive production
               of OTUs may lead to wrong conclusions and/or to the generation of huge datasets, which can be difficult to
               analyze. Tackling this issue is not trivial, and a series of novel approaches have been proposed. These
               approaches rely on the definition of sequence variants from single nucleotide differences in the amplicon
               reconstruction, trying to profile microbial communities based on “real” differences instead of sequence
               similarity. Nowadays, the research communities are gradually moving to the new concept of Amplicons
               Sequence Variants (ASVs) or Exact Sequences Variants (ESVs)  for profiling bacterial communities, and
                                                                     [121]
               it should also be recommended for yeasts and yeast-like organisms. These approaches generate an error
               model for each sequencing run, which enables discriminating between a true sequence variant (i.e., one
   72   73   74   75   76   77   78   79   80   81   82