Assessment of taxonomic diversity across the tree of life from extracellular eDNA
Our results demonstrate that PCR-free deep sequencing of extracellular eDNA is a promising approach for taxonomic diversity assessment across the tree of life in large aquatic ecosystems. By generating one of the deepest shotgun sequencing datasets of extracellular eDNA, we push the limits of biodiversity assessment to detect taxa across the tree of life, including the relatively low abundant non-microbial taxa in the ecosystem. Through statistical extrapolation of richness accumulation curves, we show that the asymptotic taxonomic richness of the ecosystem across the tree of life can be reliably estimated. Further, we also found that extracellular eDNA provides broad-scale spatiotemporal resolution to detect changes in the relative abundance of taxa across the tree of life.
We achieved these results due to the adaptations at every level of the workflow from sample collection, eDNA extraction, library preparation, sequencing, and bioinformatics. We enriched the extracellular eDNA using a lysis-free protocol rather than using the total eDNA to avoid DNA extraction bias due to the differences in lysis efficiencies between cell types from a wide range of taxa (Djurhuus et al., 2017). By eliminating the PCR from the laboratory workflow by using PCR-free library preparation methods, we achieved very low duplication rates in the sequences which otherwise may render a considerable part of the data useless by reducing the effective depth and increasing the PCR-induced artifacts (Kebschull & Zador, 2015). As the probability of detection of low-abundant taxa is determined by the depth of sequencing, we estimated the required sequencing depth by analyzing the library complexity. We then sequenced the extracellular eDNA libraries to the point of saturation by employing an extremely high-throughput sequencing platform. Further, to achieve sensitive taxonomic classifications, we derived two independent taxonomic assignments from the paired-end reads using protein-based classification algorithms and calculated the lowest common ancestor taxa for each read. The reads of bacterial origin dominated the taxonomic assignments (86.95%) due to the high abundance of Bacteria in aquatic ecosystems. However, the family richness of Eukaryotes was higher than Bacteria possibly due to a large number of eukaryotic families represented in the reference database compared to prokaryotes (Supplementary Fig. 5). Studies in the past could not detect a high diversity of Eukaryotes from shotgun sequencing of total eDNA mainly due to the shallow sequencing depth (22.3 million) and a low percentage of reads assigned to Eukaryota (0.34%) (Stat et al., 2017). We achieved over sixteen-fold more taxonomic assignments to Eukaryota (5.48%) and detected hundreds of families of Protists, Fungi, Plants, and Animals. Particularly, the high diversity of Metazoan families indicates detectable amounts of DNA from non-microbial species in the extracellular eDNA for shotgun sequencing approaches. This opens up the possibility of detecting taxa across the tree of life without using any targeted enrichment techniques such as PCR or hybridization capture that can introduce a bias toward certain taxa (van der Loos & Nijland, 2021). We also showed that statistical extrapolation of taxonomic richness accumulation curves can be used to account for the undetected taxa with very low abundances and estimate the asymptotic richness across the tree of life. The estimates of asymptotic family richness were in line with the expected richness of well characterized taxa in the ecosystem such as fishes. Such estimates of total taxonomic richness can be used to monitor the changes in taxonomic richness across the tree of life over a long period and help in identifying and prioritizing taxa for conservation. Although we did not detect any substantial change in the composition of taxonomic families among the samples, we detected high variation in the relative abundance of the families across space and time. This indicates that the taxonomic families in the ecosystem can remain largely unchanged while their relative abundance may vary in the given spatiotemporal scale. Furthermore, the genome-scale data generated using this approach can also be repurposed for assessing diversity at the gene level, mapping functional traits to specific taxa, inferring species co-occurrence patterns, and linking community changes to ecosystem functioning and services.
Limitations
The taxonomic resolution achievable through deep sequencing of extracellular eDNA is generally lower compared to approaches targeting a barcoding region in the genome. The taxonomic classification of the extracellular eDNA sequences depends upon the taxonomic resolution of various genomic loci that are stochastically captured, the sensitivity of the algorithm used to detect homology, and the availability of reference sequences from the target organisms. Different regions in the genome provide variable taxonomic resolutions depending on the sequence complexity, mutation rate, selection pressure, recombination, and evolutionary history of the species (Coissac et al., 2016). Further, sensitive alignment-based homology detection algorithms such as BLAST (Altschul, 2014) are prohibitively slow to query billions of reads against large reference databases. Alternative alignment-free kmer-based algorithms such as KRAKEN2 (Wood et al., 2019) are thousands of times faster than BLAST but far less sensitive and cannot find homology between highly divergent species (Lindgreen et al., 2016). Due to the sparsity of existing reference sequence databases, many underrepresented taxa may remain undetected and lead to underestimates of taxonomic diversity when using DNA-based classifiers. Hence, we adopted a protein-based classification algorithm as the protein sequences are more conserved than the genomic DNA sequences and offer better sensitivity with incomplete databases than DNA-based algorithms (Menzel et al., 2016). Even when the exact species is not represented in the database, the sequences can be taxonomically identified using the evolutionarily closest species present in the database as a proxy. Protein-based classification also eliminates erroneous taxonomic assignments from repetitive DNA sequences that are abundant in Eukaryotic genomes. But the trade-off of using protein-based over DNA-based classification is the lower taxonomic resolution due to the conservation of protein sequences among closely related species. However, such trade-offs are inevitable when accurate estimates of taxonomic richness are required, especially when assessing a tropical ecosystem like ours where the majority of the diversity is yet to be documented.
Sequencing costs and the availability of genome-scale data are the main limiting factors for the adoption of deep sequencing of extracellular eDNA for taxonomic assessment of ecosystems. Deep sequencing of samples to the point of saturation may quickly become infeasible for large-scale projects with hundreds of samples. Decreasing the sampling resolution and using statistical extrapolations as demonstrated in this study can bring down sequencing costs and enable the assessment of large ecosystems. Moreover, advancements in sequencing technologies are expected to decrease the sequencing cost to as less as $1 per GB in the near future which will make it more affordable. Furthermore, only a small fraction of all the known species have their genomes assembled, annotated, and archived in public sequence databases. Nevertheless, an international moonshot initiative in biology called the Earth BioGenome Project is set to change the scenario of incomplete databases by generating genomic resources for all the known eukaryotic species (about 1.5 million) in a record time of over a decade (Lewin et al., 2018). Several large-scale genome sequencing initiatives across the world have joined this massive effort targeting a wide variety of taxa. With the progress and completion of various genome sequencing initiatives, the increased availability of reference sequences in the databases will improve the sensitivity and specificity of the taxonomic assignments and provide a more accurate snapshot of taxonomic diversity.
Conclusion
Extracellular eDNA is a natural repertoire of genetic material from all the organisms inhabiting an ecosystem and is a reliable source for taxonomic diversity assessment. Organisms across the tree of life can be effectively detected through PCR-free deep sequencing of extracellular eDNA. The total taxonomic richness of the ecosystem can be estimated through statistical extrapolation of richness accumulation curves derived from incidence frequencies of taxa in the extracellular eDNA sequences. Extracellular eDNA also provides broad-scale spatiotemporal resolution of changes in biodiversity across the tree of life in an ecosystem. With plummeting sequencing costs and increasing coverage of reference databases by large-scale genome sequencing projects, we envision the wide adoption of PCR-free deep sequencing of extracellular eDNA for large-scale biodiversity assessment across the tree of life. Although there is further scope to test and optimize the workflow, we believe that this study significantly advances our understanding of the capabilities and limits of extracellular eDNA for taxonomic diversity assessment. Its application to detect taxa across the tree of life is fundamental for a paradigm shift toward implementing large-scale next-generation bioassessment and biomonitoring programs for the conservation, restoration, and management of ecosystems in the Anthropocene.