4. Addressing and interpreting compositional sequencing data

Interpreting relative abundance data

Compositionality of amplicon sequencing data presents challenges to the interpretation of changes in microbial community structure. The amount of sequence data obtained through high-throughput sequencing is a fixed value, resulting in a random sampling of sequences from a sample that cannot be directly linked to actual counts based on sequences alone \cite{Gloor_2017}. Numerous studies have revealed shifts in microbial community composition across treatments including gradients of temperature, pH, and salinity, as well as seasonal or temporal parameters. This practice is robust on a community level when broad-scale changes in taxa are of interest (e.g. phylum), and has resulted in similar ecological conclusions as data generated with more quantitative approaches  \cite{Piwosz2020}. However, at higher taxonomic resolution (e.g. genus), quantitative inferences from relative abundance sequencing data become more challenging. Due to the nature of sequencing, a change in relative abundance of one species is often reflected in a corresponding change in another species. We depict such challenges in interpretation in the following theoretical experiment. 
Amplicon sequencing data obtained from the same soil sample at two different time points (t1, t2) consists of two species (A, B). The relative abundance observed for species A and B is 0.55 and 0.45 at timepoint 1 (t1), and 0.8 and 0.2 at timepoint 2 (t2), respectively (Figure 3). From t1 to t2, species B decreases in relative abundance coupled to an increase in the relative abundance of species A. The bars below (t2a-t2e) illustrate five examples of changes in absolute abundance in t2 that could underlie the patterns observed in relative abundance data. The initial time point (t1) is also shown for comparison. 
The first case represents a biologically true situation where the absolute abundance matches the relative abundance observations. There are no changes in biomass from t1 to t2 and species A increases, whereas species B decreases (Fig. 3, t2a). The second case depicts an increase in overall biomass between t1 and t2 caused by an absolute increase in species B and no absolute changes in species A (Fig. 3, t2b). In relative abundance data, species A would appear to reduce in abundance  when in fact that was caused by a change in species B alone. The third case represents an opposite scenario where there is a total decrease in biomass from t1 to t2 caused by a decrease in species B and no changes in species A (Fig. 3, t2c). Again, in a relative abundance plot, species B would appear to have decreased in relative abundance (true), but inferences regarding species A would be false. The fourth case represents a situation where there is a general increase in biomass from t1 to t2 prompted by increases in absolute abundances of both species A and B (Fig. 3, t2d) while the fifth case represents an opposite scenario where there is a general decrease in biomass from t1 to t2 caused by decreases in absolute abundances of both species A and B (Fig. 3, t2e). In these last two cases, as long as the proportion between the two species stays the same, the relative abundance plot reflects biological changes. 
This theoretical exercise shows, that even for a community of only two member species, there are five potential changes in the absolute abundance that could underlie the observed shift in relative abundance. Given that soil communities usually harbour thousands of species, the degree of complexity increases dramatically. This is further compounded by factors such as PCR bias, which underlines the caution that must be applied when interpreting relative abundances retrieved from amplicon sequencing.