Technical challenges in a heterogeneous and diverse habitat
The diversity of microorganisms in soil has been well-documented as a major challenge in studying soil microbial communities \cite{Gans_2005,Fierer2006}. A single gram of soil is estimated to contain 108-109 cells \cite{Bloem1995,Nunan_2001} and tens of thousands of microbial taxa \cite{Roesch2007}. Compared to host-associated microbiomes (e.g. gut, skin microbiome), free-living bacteria exhibit higher levels of diversity. In a recent comparison of alpha-, beta- and gamma-diversity from samples collected as part of the Earth Microbiome Project (EMP), soils were determined to have the highest alpha-diversity across environments \cite{Walters_2020}. In terms of beta- and gamma-diversity, soil came in second only to sediment samples. However, all biomes sampled in the EMP indicate that we are only scratching the surface of bacterial diversity across environments. Fewer studies have investigated the diversity and global distribution of fungi \cite{Tedersoo_2014,V_trovsk__2019}. These studies indicate that more heterogeneous environments, such as soils and sediments, may contain more diverse communities that more homogeneous habitats (e.g. marine, freshwater, air, biofilms) \cite{Fierer_2011,Walters_2020,Torsvik_2002}.
In addition to high biological diversity, researchers interested in the microbial composition of soils are confronted with technical challenges throughout the sample processing workflow. The general workflow of amplicon sequencing includes 1) careful planning of the experimental design, 2) nucleic acid extraction, 3) quality control and sequence library preparation, 4) primer choice, PCR amplification, sequencing and 5) analysis of sequence data (Figure 2b). At each of these steps, information can be lost as a result of the techniques applied (ie nucleic acid extraction method, primer selection, statistical approaches), with consequences for data interpretation in the context of ecological questions. As with any scientific experiment, the experimental design is essential for determining the specific hypotheses that can be addressed. In amplicon sequencing experiments, one must consider the appropriate spatial scale (ie aggregate/microscale, centimeter scale, meter scale) and temporal scale for sampling in order to address specific questions regarding community dynamics (discussed in section 5). Additionally, sample replication remains a critical concern in soil studies, particularly when it comes to statistical inference and/or construction of co-occurrence networks (discussed in sections 5 and 6).
The physiochemical properties of soils make nucleic acid extraction from this matrix particularly challenging. Numerous extraction protocols and kits have been developed to circumvent challenges with DNA extraction from soil, however each method introduces distinct bias on the subset of the microbial community retrieved \cite{Terrat_2011,Zieli_ska_2017,Dopheide_2018}. The presence of inhibitors, such as humic substances, is common in soil and can reduce the quality and purity of nucleic acids in extraction, inhibit generation of libraries for sequencing, and decrease efficiency of PCR reactions necessary in amplicon sequencing. In addition to the nucleic acid extraction method of choice, primer selection dictates the organisms or functions targeted by the approach (16S rRNA, ITS, functional markers). Finally, due to the diversity and heterogeneity of soil samples the resulting data is often sparse, containing numerous taxa with low abundance and prevelance which may be dealt with through filtering thresholds or statistcal approaches. Keeping all these factors in mind, the application of sequencing technologies to soil has provided invaluable information regarding the structure and critical nature of understanding microbial communities. However, the loss of information at each step of the process - from sampling to analysis - must be carefully considered in light of amplicon sequencing data interpretation.