Page 39 - Read Online
P. 39
Fabbrini et al. Microbiome Res Rep 2023;2:25 https://dx.doi.org/10.20517/mrr.2023.25 Page 9 of 18
is difficult, as it depends on several factors, such as the complexity of the data and the strength of the
relationships between the variables. In general, the larger the sample size, the higher the statistical power
and the associated odds of detecting meaningful relationships between variables. Providing a specific
number that is universally applicable for the minimum sample size in co-occurrence networking analysis is
difficult, as it varies depending on the specific context and research objectives. However, some studies
suggest that a sample size of 25-30 samples per group may be considered reasonable for such kind of
analysis [45-57] . It is important to note that the quality of the data, the accuracy of the sequencing technology
used, and the statistical methods used to infer the network can also influence the minimum sample size
required for a robust analysis.
The second aspect to consider is the structure of the dataset, including the choice of using compositional
(taxonomic) and/or functional data. Generally, in both cases, the data are present as tabular outputs
reporting for each sample a given value (either relative abundances or counts, or normalized counts) for n
observed features (e.g., taxa, pathways, etc.). Concerning compositional information, data can be obtained
from: (i) 16S rRNA amplicon sequencing, often followed by QIIME 2 bioinformatic pipeline processing;
[58]
[59]
or (ii) shotgun metagenomics sequencing, followed by read alignment tools such as MetaPhlAn 4 ,
Kraken2 and METAnnotatorX2 , to ultimately produce the compositional table. From the functional
[61]
[60]
[62]
standpoint, inferred techniques starting from 16S rRNA data such as PICRUSt2 can be used, yet shotgun
metagenomics is highly preferred. The reason for this is that 16S rRNA amplicon sequencing methods rely
on the use of reference sequences to analyze small amplicons derived from metagenomes, rather than
examining the entire metagenome as a whole. Such a limitation might result in improper assessment of
metabolic capability and inadequate taxonomic assignment to resolve microbiome compositional data
down to the species level. The possibilities are vast regarding tools for functional annotation of
metagenomic samples and include both read-mapping and assembly approaches. For what concerns read-
[63]
[64]
mapping, the most commonly used tools include HUMANn3 , MetaCV , EggNOG , and other
[65]
methods comprising the use of Hidden Markov Models on tailored databases. On the other hand, the use of
[66]
assembly approaches includes some tools for species-level genome bin definition, such as MetaWRAP ,
[67]
and some tools for functional annotation, such as Prokka and EggNOG [65,86] . The yield of metagenomics
approaches often involves multiple layers of information such as taxonomic composition and functional
profile, which require multi-omic integration to properly address their relationships. Multi-omics
integration is arguably the most complex scenario and, probably because of this, receives the least coverage
to date.
When considering the structure of the dataset, another important aspect is the decision of whether or not to
filter the data. Including all variables, even low-abundance ones, may provide a more comprehensive view
of the relationships between microbial taxa or functions detected, possibly revealing previously unknown
associations. Nevertheless, this may also lead to increased network complexity (with increased
computational resources and runtime requirements) and might result in weaker or spurious associations
that could fade out the real relationships. On the other hand, including only the most abundant variables
can simplify the network, possibly highlighting the most prominent relationships between microbial taxa or
functionalities. This approach is particularly useful for studying the composition or function of a “core”
microbiome, focusing attention on relevant microbial taxa and functional pathways while limiting the
computational load. Accordingly, the choice between including all variables or considering only the most
abundant ones (e.g., taxa/functions present only in the majority of samples, filtering out zero-values) should
be based on the research question and the availability of computational resources, also taking into account
the related limitations [69-71] . Typically, filtering procedures reduce the complexity of microbiome data while
providing more reproducible and comparable results in microbiome data analysis. However, studies tend to