To circumvent challenges associated with highly compositional relative abundance data , numerous tools for statistical analysis have been introduced. We here suggest such data-driven approaches to address the concerns of normalization, false-discovery rates and the compositional nature of sequencing.
One of the first approaches to analyze amplicon sequencing data is to
remove potential sequencing errors. Doing so contributes to the
elimination of chimeras and other sequencing artefacts that tend to
falsely boost diversity levels \cite{Edgar2011,Haas2011}. The use
of amplicon sequencing variants (ASVs), instead of operational taxonomic
units (OTUs) attempts to overcome this issue by assigning a greater
probability of a true biological sequence being more abundant than an
error-containing sequence
(30). To that end,
bioinformatic tools such as DADA2
(31) and Deblur
(32) attempt to
use sequencing error profiles to resolve amplicon sequencing data into
ASVs. Furthermore, even though there are some caveats associated with
the use of ASVs that might require previous considerations, they have an
intrinsic biological meaning as a DNA sequence, as opposed to OTUs which can either be a representation of the most abundant biological sequence or a consensus sequence. Additionally, ASVs make the merging of datasets possible, even when the
sequencing primer pairs are different
(30).
Another relevant step when analyzing sequencing data is to account for
the different sequencing effort across samples (i.e. different library
sizes) that can result in a substantially different number of recovered
reads even within sample replicates. Ways to tackle this issue include
total library size normalization and rarefaction, although recent
literature has advised against the latter
(33).
Bioinformatic tools such as DeSeq2 and EdgeR that were originally built
for differential gene expression analyses of RNA-seq data, now extend to
amplicon-based studies. These packages provide ways to normalize count tables using
the “relative log expression” (RLE) and the “Trimmed Mean of
M-values” (TMM) normalization approaches respectively
(34, 35). Both
methods are applied on a raw or a low-abundance filtered count table and
have performed well in both real and simulated datasets and outperform
rarefaction-based approaches
(33). Other
alternatives to account for the compositional aspect of sequencing data
include center log (CLR), isometric log (ILR) or additive log (ALR)
ratios transformations on a count data matrix
(36, 37).
After data normalization, traditional amplicon sequencing data analyses
include the generation of distance matrices for ordination, clustering,
and variance partitioning analyses. Commonly used distance metrics
include Bray-Curtis, Jaccard and Unifrac (weighted and unweighted) that
– regardless of their value in other fields - also do not take into
account the compositional nature of sequencing data. The Aitchison
distance - defined as the Euclidian distance on top of a center-log
transformed count matrix – is a viable compositional alternative
(36) on top of
which ordinations (e.g. PCA biplots) can be performed. Additionally, the
Philr transform metric has been introduced as compositional alternative
to the weighted Unifrac, that carries phylogenetic information
(38). Most of the
above mentioned compositional options are implemented in R packages and
include publicly available tutorials. As a consequence of all the above-mentioned limitations, we recommend a critical evaluation of the different data analyses tools in light of the intrinsic nature of each experimental setup (see section “Ecological interpretations from amplicon sequencing data”).
Steps toward reproducible and quantitative sequencing studies
Data-driven approaches to improve reproducibility
In addition to the limitations imposed by sequencing technology and the
compositional nature of sequencing data, another aspect that prevents data analyses from being fully quantitative is the potential
multiple copies of marker genes per organism. For example, the 16S rRNA gene copy
number per bacterial cell can vary between 1-18 and can additionally
show variation within different strains of the same species
(39, 40;
Lavrinienko et al., 2020, \cite{Stoddard_2014}). Therefore, relying solely on the number and
diversity of 16S rRNA gene sequences can lead to inaccurate estimates of
abundance and diversity of microbial communities. Several computational tools can correct amplicon datasets for the number of 16S rRNA gene copies based on existing genome information (e.g. PICRUSt \cite{Douglas_2019} and
CopyRighter \cite{Angly_2014}). However, correcting for 16S
rRNA gene copy numbers in sequencing surveys remains challenging,
particularly for soil, as the gene copy numbers are only known for a subsection of the soil microbes (\cite{Louca_2018,Nunan_2020}. This challenge becomes even more problematic for marker genes of fungi and other eukaryotes such as protists as the copy number here can vary more drastically among taxa \cite{Gong_2013,Gong_2019}. Other housekeeping genes, which occur only once in a genome, such as the recA \cite{Eisen1995} have been proposed in the past, as universal phylogenetic marker genes, but their use remains limited because of lower taxonomic resolution or limited databases.
Introducing an internal spike-in can be a useful
tool towards more quantitative amplicon data analyses and there are a
few studies that applied this technique in soil
\cite{Tkacz_2018,Hardwick2018}. There
are however, important considerations: i) the choice of spike should
neither fall on members of the existing microbial community, nor it
should be in concentrations that would shift the sequencing effort
towards it; ii) the timing of addition of the spike (before or after
nucleic acids extractions) will dictate the kind of retrieved
information: while adding the spike after extraction can provide good
estimates of amplification and/or sequencing biases, it does not take extraction efficiency
into account (\cite{Hardwick2018}).
A recent study combined amplicon sequencing, a synthetic DNA spike of
known concentration on the samples prior extraction, and qPCR
quantifications to back calculate the number of copies before extraction
after taking into account the extraction yield. The ratio of each OTU
against the initial concentration of 16S rRNA genes was used to
calculate more accurate abundance levels of each OTU after taking
extraction efficiency into account \cite{Zemb2020}.
Experimental approaches to more quantitative sequencing studies
Maybe we need a short introduction here why we think that seq data should become more quantitative? For example, if sequencing data shows that species X in sample A increases compared to sample B, it does not necessarily reflect a quantitative increase of species X in sample A. It could well be that the numbers of species X decrease in both sample A and B, with a stronger decrease in sample B. Also that both increase. etc. Thus, knowing about the absolute numbers may help to "adjust" sequencing data?
An approach towards absolute abundance data from soil communities are direct cell counts obtained through fluorescence microscopy \cite{Bloem1995} or fluorescence activated cell counting \cite{Khalili_2019}. Total counts help to assess the absolute abundance of microbial cells that fall within a certain range of parameters such as cell size and morphology. Counting of cells may help to circumvent overestimation of microbial diversity related to extracellular DNA by counting only intact cells (48, 49). In addition, it may also be combined with with BioOrthogonal Non-Canonical Amino acid Tagging (BONCAT) to target only the fraction of cells within a soil sample that is translationally active in situ \cite{Couradeau_2019}.
To the best of our knowledge, a direct use of absolute abundances of microbial cells to improve the evaluation of microbial diversity (evenness) of sequencing data has not been reported.
Would it even make sense? Please brainstorm with me. What if:
- We cut a soil sample in half
- Sequence one half
- count the single/filamentous/aggregated cells (archaea plus bacteria) in the other
- Then we do have an estimate of absolute abundance of cells per gram of soil as well as an estimate of the relative abundance of certain taxa/groups received through extraction and PCR and sequencing
- Let's assume for a second that there was no bias through extraction (gram-positives harder to disrupt), PCR (preferential amplification of certain (abundant) sequences), and sequencing (seq errors)
- Actinobacteria were found to make 20 % of the 109 total cells, which makes 2 * 108 Actinobacteria in a gram of soil. Would this information help us a lot?
- If we have two treatments in an experiment, and we do not see a difference in the total cell numbers but we do see a decrease of Actinobacteria to 10 %, could we translate this to a decrease in total Actinobacteria due to the treatment?
- If relative abundances change in comparison to each other it does not allow us to say if there was an increase or decrease because even if there was an increase of the relative abundance of taxa A in sample A as compared to sample B, both abundances could still be decreased in total. So could such an approach using total cell counts actually help? A requirement would be that dilutions of the same soil sample (109 - 107) would still give comparable rel abundances of groups after sequencing. Anyone aware of such as study?
In contrast to using the total number of all cells, the actual abundances of certain taxa of interest in a soil sample can be obtained by the use of Fluorescence in situ hybridization (FISH) techniques. Recently, Piwosz et al. \cite{Piwosz2020} used CARD-FISH to count the abundances of bacterial taxa in aquatic samples and to compare these to relative abundances generated by amplicon sequencing. The authors concluded that relative abundance data obtained through amplicon sequencing was robust enough for ecological interpretation on a community level. For specific taxonomic groups, however, the correlation of abundances obtained with both techniques disagreed in large parts, suggesting that care has to be taken when interpreting relative abundance data of single taxa derived from amplicon sequencing. Such technical comparisons for soil samples are rare (e.g. \cite{Ushio_2014}), but given the diversity of soil microbiomes we suggest to use amplicon sequencing data mainly for screenings of soil microbiomes on a community scale (e.g. phylum/class/order level). If the dynamics of certain phylogenetic groups are to be understood on a quantitative basis, we suggest to apply suitable FISH techniques (e.g. \cite{Schmidt_2013}).