3. Challenges in amplicon sequence data analysis
One of the first steps in the analysis of amplicon sequencing data is the
removal of potential sequencing errors. Doing so eliminates sequencing artefacts that may falsely boost diversity levels \cite{Edgar2011,Haas2011}. The use
of amplicon sequencing variants (ASVs), instead of operational taxonomic
units (OTUs) helps to overcome this issue by assigning a greater
probability of a true biological sequence being more abundant than an
error-containing sequence
\cite{Callahan_2017}. To that end,
bioinformatic tools such as DADA2
\cite{Callahan_2016} and Deblur
\cite{Amir_2017} attempt to
use sequencing error profiles to resolve amplicon sequencing data into
ASVs. An ASV is more likely to have an
intrinsic biological meaning (i.e. true DNA sequence), as opposed to an OTU which can either be a representation of the most abundant biological sequence or a consensus sequence. Additionally, ASVs make the merging of datasets possible, even when the
sequencing primer pairs are different
\cite{Callahan_2017}.
Another relevant step when analyzing sequencing data is to account for
the different sequencing efforts across samples (i.e. library
size, sequencing depth) that can result in a substantially different number of recovered
reads even among replicates. Ways to tackle this include
total library size normalization and rarefaction, with the latter remaining debated to date \cite{McMurdie_2014}.
Bioinformatic tools such as DeSeq2 and EdgeR provide ways to normalize count tables \cite{Love_2014,Robinson_2010}. Both
methods are applied on raw or low-abundance filtered count tables, and
have performed well in both real as well as simulated datasets and outperform
rarefaction-based approaches
\cite{McMurdie_2014} . Other
alternatives that account for the compositional aspect of sequencing data
include center log (CLR), isometric log (ILR) or additive log (ALR)
ratios transformations on a count data matrix
(Aitchison 1984, \cite{Egozcue_2003}).
Following data normalization, traditional workflows
include the generation of distance matrices for ordination, clustering,
and variance partitioning analyses. Commonly used distance metrics
include Bray-Curtis, Jaccard and Unifrac (weighted and unweighted). These metrics are often used although they do not take into
account the compositional nature of sequencing data. The Aitchison
distance - defined as the Euclidian distance on top of a center-log
transformed count matrix – is a viable compositional alternative
( Aitchison 1984, ) that allows to perform ordinations (e.g. PCA biplots). Additionally, the
Philr transform metric has been introduced as a compositional alternative
to the weighted Unifrac that carries phylogenetic information
(38). Most of the
above mentioned compositional options are implemented in R packages (except Aitchison to the best of our knowledge) and
include publicly available tutorials. In light of the challenges related to normalization and analysis of compositional data, we recommend a critical evaluation of available data analysis tools to best address the nature of each experimental setup (see section 6).
Another aspect that prevents data analyses from being fully quantitative is the potential multiple copies of marker genes per organism. For example, the 16S rRNA gene copy number per bacterial cell can vary between 1 and 18 and can additionally show variation within different strains of the same species \cite{Stoddard_2014,Coenye2003,Lavrinienko2021}. Therefore, relying solely on the number and diversity of markers such a 16S rRNA genes can lead to inaccurate estimates of abundance and diversity of microbial communities. Several computational tools can correct amplicon datasets for the number of 16S rRNA gene copies based on existing genome information (e.g. PICRUSt2 \cite{Douglas_2019} and CopyRighter \cite{Angly_2014}). However, correcting for 16S rRNA gene copy numbers in sequencing surveys remains challenging, particularly for soil, as the gene copy numbers are only known for a subset of the soil microbes \cite{Louca_2018,Nunan_2020}. This challenge becomes even more problematic for marker genes of fungi and other eukaryotes, such as protists, as the copy number here can vary drastically between taxa \cite{Gong_2013,Gong_2019}. Other housekeeping genes, which occur only once in a genome, have been proposed as univeral phylogenetic marker genes (such as the recA \cite{Eisen1995}), but their use remains limited due to lower taxonomic resolution and limited availability in databases.
Storing data in repositories should be a requirement for publication
Sharing data makes research reproducible and data re-usable which is of particular relevance in light of increasing amounts of sequencing data obtained from soils around the globe. A requirement to ensure data availability and that data is stored in sufficient quality (e.g. usable format) is that authors are either aware of these benefits or required to do so by respective journals.
In order to assess the current state of data depisition in the field, we searched the author guidelines of the 10 specialized soil journals that contributed most studies with amlicon sequencing data (see Fig. 1 for reference). Out of 10 journals, many "encourage their authors to make data available" while only 2 journals require sequencing data to be deposited in public repositories such as NCBI before a manuscript have be accepted for publication. But even if authors feel encouraged to comply, storage of their data in a repository does not always safeguard reproducibility of the reported research. It seems that quite many datasets deposited are in fact results of the whole sequencing runs with no meaningful information provided to separate the individual amplicons. In addition, as only the raw output of sequencing experiments are available in most cases, it may be difficult to reconstitute from such data the exact datasets used for the reported statistics and illustrations. This is because it needs precise reporting of all applied quality filters and processing steps (see above) including referencing the versions of the software packages.
Consequently, we call on all specialized soil journals that accept and publish sequencing data to (i) provide community standards for appropriate repositories in their data policy statements and (ii) require the submission of sequencing data to repositories to facilitate reproducibility, open science, and meta-analyses. We advocate for the deposition of raw sequencing data in combination with processed ASV/OTU tables together with sample metadata, taxonomy files, and a meaningful description of how these files were processed to ensure reproducibility of published research and usability of collected data.
4. Addressing and interpreting compositional sequencing data
Interpreting relative abundance data
Compositionality of amplicon sequencing data presents challenges to the interpretation of changes in microbial community structure. The amount of sequence data obtained through high-throughput sequencing is a fixed value, resulting in a random sampling of sequences from a sample that cannot be directly linked to actual counts based on sequences alone \cite{Gloor_2017}. Numerous studies have revealed shifts in microbial community composition across treatments including gradients of temperature, pH, and salinity, as well as seasonal or temporal parameters. This practice is robust on a community level when broad-scale changes in taxa are of interest (e.g. phylum), and has resulted in similar ecological conclusions as data generated with more quantitative approaches \cite{Piwosz2020}. However, at higher taxonomic resolution (e.g. genus), quantitative inferences from relative abundance sequencing data become more challenging. Due to the nature of sequencing, a change in relative abundance of one species is often reflected in a corresponding change in another species. We depict such challenges in interpretation in the following theoretical experiment.
Amplicon sequencing data obtained from the same soil sample at two different time points (t1, t2) consists of two species (A, B). The relative abundance observed for species A and B is 0.55 and 0.45 at timepoint 1 (t1), and 0.8 and 0.2 at timepoint 2 (t2), respectively (Figure 3). From t1 to t2, species B decreases in relative abundance coupled to an increase in the relative abundance of species A. The bars below (t2a-t2e) illustrate five examples of changes in absolute abundance in t2 that could underlie the patterns observed in relative abundance data. The initial time point (t1) is also shown for comparison.
The first case represents a biologically true situation where the absolute abundance matches the relative abundance observations. There are no changes in biomass from t1 to t2 and species A increases, whereas species B decreases (Fig. 3, t2a). The second case depicts an increase in overall biomass between t1 and t2 caused by an absolute increase in species B and no absolute changes in species A (Fig. 3, t2b). In relative abundance data, species A would appear to reduce in abundance when in fact that was caused by a change in species B alone. The third case represents an opposite scenario where there is a total decrease in biomass from t1 to t2 caused by a decrease in species B and no changes in species A (Fig. 3, t2c). Again, in a relative abundance plot, species B would appear to have decreased in relative abundance (true), but inferences regarding species A would be false. The fourth case represents a situation where there is a general increase in biomass from t1 to t2 prompted by increases in absolute abundances of both species A and B (Fig. 3, t2d) while the fifth case represents an opposite scenario where there is a general decrease in biomass from t1 to t2 caused by decreases in absolute abundances of both species A and B (Fig. 3, t2e). In these last two cases, as long as the proportion between the two species stays the same, the relative abundance plot reflects biological changes.
This theoretical exercise shows, that even for a community of only two member species, there are five potential changes in the absolute abundance that could underlie the observed shift in relative abundance. Given that soil communities usually harbour thousands of species, the degree of complexity increases dramatically. This is further compounded by factors such as PCR bias, which underlines the caution that must be applied when interpreting relative abundances retrieved from amplicon sequencing.