Normalization of compositional data/making it comparable across studies

A major challenge in the analysis and interpretation of amplicon sequencing data remains the relative nature of the data, which may not reflect actual microbial abundances (25)⁠. Numerous normalization approaches have been considered to account for differences in rRNA gene counts across taxa (26)⁠. Data normalization remains even more critical in amplicon sequencing of eukaryotes. At each step of the sample handling and sequencing process, a subset of the sample is selected, and additional bias has the potential to be introduced (27)⁠. Numerous tools for statistical analysis have been introduced to circumvent challenges associated with highly compositional relative abundance data. We here suggest such data-driven approaches to address the concerns of normalization, false-discovery rates and the compositional nature of sequencing.
One of the first approaches to analyze amplicon sequencing data is to remove potential sequencing errors. Doing so contributes to the elimination of chimeras and other sequencing artifacts that tend to falsely boost diversity levels  \cite{Edgar2011,Haas2011}. The use of amplicon sequencing variants (ASVs), instead of operational taxonomic units (OTUs) attempts to overcome this issue by assigning a greater probability of a true biological sequence being more abundant than an error-containing sequence (30)⁠. To that end, bioinformatic tools such as DADA2 (31)⁠ and Deblur (32)⁠ attempt to use sequencing error profiles to resolve amplicon sequencing data into ASVs. Furthermore, even though there are some caveats associated with the use of ASVs that might require previous considerations, they have an intrinsic biological meaning as a DNA sequence, as opposed to OTUs. Additionally, they make the merging of datasets possible, even when the sequencing primer pairs are different (30)⁠.
Another relevant step when analyzing sequencing data is to account for the different sequencing effort across samples (i.e. different library sizes) that can result in a substantially different number of recovered reads even within sample replicates. Ways to tackle this issue include total library size normalization and rarefaction, although recent literature has advised against the latter (33)⁠. Bioinformatic tools such as DeSeq2 and EdgeR that were originally built for differential gene expression analyses of RNA-seq data, now extend to amplicon-based studies.  These packages provide ways to normalize count tables using the “relative log expression” (RLE) and the “Trimmed Mean of M-values” (TMM) normalization approaches respectively (34, 35)⁠. Both methods are applied on a raw or a low-abundance filtered count table and have performed well in both real and simulated datasets and outperform rarefaction-based approaches (33)⁠. Other alternatives to account for the compositional aspect of sequencing data include center log (CLR), isometric log (ILR) or additive log (ALR) ratios transformations of a count data matrix (36, 37)⁠.
After data normalization, traditional amplicon sequencing data analyses include the generation of distance matrices for ordination, clustering, and variance partitioning analyses. Commonly used distance metrics include Bray-Curtis, Jaccard and Unifrac (weighted and unweighted) that – regardless of their value in other fields - also do not take into account the compositional nature of sequencing data. The Aitchison distance - defined as the Euclidian distance on top of a center-log transformed count matrix – is a viable compositional alternative (36)⁠ on top of which ordinations (e.g. PCA biplots) can be performed. Additionally, the Philr transform metric has been introduced as compositional alternative to the weighted Unifrac, that carries phylogenetic information (38)⁠. Most of the above mentioned compositional options are implemented in R packages and include publicly available tutorials.
As a consequence of all the above-mentioned limitations, we recommend a critical evaluation of the different data analyses tools in light of the intrinsic nature of each experimental setup (see section “Ecological interpretations from amplicon sequencing data”).

Data-driven approaches to more quantitative sequencing studies

Or: Experimental approaches to more quantitative sequencing studies
In addition to the limitations imposed by sequencing technology and the nature of the sequencing data, another aspect that prevents amplicon sequencing data analyses from being fully quantitative is the potential multiple copies of marker genes per organism. The 16S rRNA gene copy number per microbial cell can vary between 1-18 and can additionally show variation within different strains of the same species (39, 40; Lavrinienko et al., 2020). Therefore, relying solely on the number and diversity of 16S rRNA gene sequences can lead to inaccurate estimates of abundance and diversity of microbial communities. Correcting for 16S rRNA gene copy numbers in sequencing surveys remains challenging, particularly for soil. Introducing an internal spike-in can be a useful tool towards more quantitative amplicon data analyses and there are a few studies that applied this technique in soil (41, 42)⁠. There are however, important considerations: i) the choice of spike should neither fall on members of the existing microbial community, nor it should be in concentrations that would shift the sequencing effort towards it; ii) the timing of addition of the spike (before or after nucleic acids extractions) will dictate the kind of retrieved information: while adding the spike after extraction can provide good estimates of sequencing biases, it does not take extraction efficiency into account (42)⁠. A recent study combined amplicon sequencing, a synthetic DNA spike of known concentration on the samples prior extraction, and qPCR quantifications to back calculate the number of copies before extraction after taking into account the extraction yield. The ratio of each OTU against the initial concentration of 16S rRNA genes was used to calculate more accurate abundance levels of each OTU after taking extraction efficiency into account (43)⁠. As a consequence of all the above-mentioned limitations, we recommend a critical evaluation of the different data analyses tools in light of the intrinsic nature of each experimental setup (see section “Ecological interpretations from amplicon sequencing data”).