(36, 37).
After data normalization, traditional amplicon sequencing data analyses
include the generation of distance matrices for ordination, clustering,
and variance partitioning analyses. Commonly used distance metrics
include Bray-Curtis, Jaccard and Unifrac (weighted and unweighted) that
– regardless of their value in other fields - also do not take into
account the compositional nature of sequencing data. The Aitchison
distance - defined as the Euclidian distance on top of a center-log
transformed count matrix – is a viable compositional alternative
(36) on top of
which ordinations (e.g. PCA biplots) can be performed. Additionally, the
Philr transform metric has been introduced as compositional alternative
to the weighted Unifrac, that carries phylogenetic information
(38). Most of the
above mentioned compositional options are implemented in R packages and
include publicly available tutorials.
In addition to the limitations imposed by sequencing technology and the
nature of the sequencing data, another aspect that prevents amplicon
sequencing data analyses from being fully quantitative is the potential
multiple copies of marker genes per organism. The 16S rRNA gene copy
number per microbial cell can vary between 1-18 and can additionally
show variation within different strains of the same species
(39, 40;
Lavrinienko et al., 2020). Therefore, relying solely on the number and
diversity of 16S rRNA gene sequences can lead to inaccurate estimates of
abundance and diversity of microbial communities. Correcting for 16S
rRNA gene copy numbers in sequencing surveys remains challenging,
particularly for soil. Introducing an internal spike-in can be a useful
tool towards more quantitative amplicon data analyses and there are a
few studies that applied this technique in soil
(41, 42). There
are however, important considerations: i) the choice of spike should
neither fall on members of the existing microbial community, nor it
should be in concentrations that would shift the sequencing effort
towards it; ii) the timing of addition of the spike (before or after
nucleic acids extractions) will dictate the kind of retrieved
information: while adding the spike after extraction can provide good
estimates of sequencing biases, it does not take extraction efficiency
into account (42).
A recent study combined amplicon sequencing, a synthetic DNA spike of
known concentration on the samples prior extraction, and qPCR
quantifications to back calculate the number of copies before extraction
after taking into account the extraction yield. The ratio of each OTU
against the initial concentration of 16S rRNA genes was used to
calculate more accurate abundance levels of each OTU after taking
extraction efficiency into account
(43). As a
consequence of all the above-mentioned limitations, we recommend a
critical evaluation of the different data analyses tools in light of the
intrinsic nature of each experimental setup (see section “Ecological
interpretations from amplicon sequencing data”).