We generated a total of 851.43 Giga bases of sequencing data from the three spatially-replicated extracellular environmental DNA samples. Upon demultiplexing, we obtained about 2.6 billion paired-end reads from the Station 29 sample (S29), and 1.5 billion paired-end reads each from Station 28 (S28) and Station 27 (S27) samples. After quality and low-complexity filtering, we retained an average of 98.19% (SD 0.39%) of reads for taxonomic assignment. Since we used a completely PCR-free library preparation method, the average duplication rate was 6.65% (SD 0.31%) arising from optical duplicates and sequencing of the complementary strand. We derived 26 sampling units of 100 million reads from the sample S29 and 14 sampling units each from S28 and S27, which were separately analysed for determining the taxonomic composition. We obtained a set of 35.5 million reference sequences by filtering and clustering about 60.9 million nucleotide sequences from the INSDC at 99% sequence identity. By searching the high-quality reads against the constructed reference database using MMseqs2, we annotated on average 8.71% (SD 0.54%) of the reads to one of the levels of NCBI taxonomy using the 2bLCA algorithm. The paired-end structure of the reads allowed us to correct the discordant taxonomic assignments using the independently derived LCA labels for each read in the pair. We obtained a total of 353.07 million pair-merged pseudo-taxonomic assignments across the three samples. The taxonomic assignments were spread across the tree of life with different proportions, specifically, Viruses (51.41%), Archaea (0.32%), Bacteria (44.23%), and Eukaryota (4.02%). Overall, about 21.53% of the pseudo-taxonomic assignments had a resolution at least up to the genus level, which provided an estimate of 1815 genera across the tree of life in our samples. Eukaryota had the highest proportion of reads classified up to the genus level (45.65%), and the Viruses had the lowest proportion with just 3.67%. Specifically, the results indicated the presence of 47 genera of Viruses, 64 genera of Archaea, 675 genera of Bacteria, and 1029 genera of Eukaryotes among the sequenced samples (Figure 4). Further, eukaryotic genera were distributed across all the kingdoms, namely, Protists (32.36%), Fungi (18.85%), Viridiplantae (18.85%), and Metazoa (29.93%) (Table 1A). Since the low abundant macro-organisms are the most difficult to detect, we inspected the number of genera observed among the commonly targeted groups of Metazoa in the sequenced samples. The 308 Metazoan genera comprised of 39.61% of chordates and 60.39% of invertebrates. Phylum Arthropoda with 97 genera and class Actinopteri with 61 genera had the highest representation among the invertebrates and chordates. Among the low represented groups were Mammalia with 30 genera, Aves with 14 genera, Amphibia with four genera, Chondrichthyes with three genera, and Reptiles with one observed genus.
Optimal sequencing depth for biomonitoring
To estimate an optimal depth of sequencing to detect maximum diversity with minimal effort, we generated extrapolated genus accumulation curves across various taxa for each sample as a function of sequencing depth. All the accumulation curves show saturation as the sequencing depth increases beyond 1 billion paired-end reads (Figure 5). We then estimated the asymptotic genus richness of each taxon in all the samples using the accumulation curves (Table 1B). The asymptotic genus richness represents the maximum diversity of a particular taxon that can be detected in the given sample. Using the asymptotic genus richness of respective samples as a reference point, we then estimated sample coverage as a function of sequencing depth (Figure 6). Sample coverage is the fraction of genera detected by at least 1 read at a given depth. At the lowest depth of 100 million reads, samples have the highest coverage for Bacteria (AVG 93.59%, SD 1.54%) followed by Viruses (AVG 87.75%, SD 1.52%) and the lowest coverage for Aves (AVG 55.51%, SD 15.03%) followed by Mammalia (AVG 60.42%, SD 14.13%). At a read depth of 1 billion reads, all the taxa have coverage greater than the threshold of 90% (Table 1C) with the lowest coverage for Mammalia (AVG 91.74%, SD 13.7%). With double the sequencing effort of 2 billion reads, the coverage increases only by 3.88% for Mammalia (AVG 95.62%, SD 7.57%), crossing the 95% threshold for all the taxa under consideration (Figure 6).
Effectiveness of the pseudo-taxonomic approach
Since we adopted a pseudo-taxonomic assignment strategy to alleviate the issue of incomplete databases, the taxonomic assignments may only indicate the presence of a related species when the target species is not represented in the reference database. Our spatially-replicated sampling strategy allowed us to estimate the reproducibility of the pseudo-taxonomic approach by comparing the unique and shared number of genera across various taxa among the samples. The community similarity among the samples was measured using the Sorenson and Jaccard indices across various taxa (Figure 7). The lowest community similarity was found among Protists with a Sorenson index of 0.96 (Table 1D) and a Jaccard index of 0.91 (Table 1E). To further examine the effectiveness of the pseudo-taxonomic approach for detecting low abundant taxon that is underrepresented in the reference database, we chose the Irrawaddy dolphin (Orcaella brevirostris), an endangered flagship species of the Chilika ecosystem as a case study. Since the whole genome of the Irrawaddy dolphin has not been assembled, we examined hits to related species whose annotated regions of the genome are available in the reference database. We found 12,399 hits to Lipotes, a possibly extinct genus of Chinese river dolphin endemic to the Yangtze river in China, whose whole genome sequence is available in the reference database. There were no other hits in our filtered taxonomy dataset to any other marine mammals in the reference database (Supporting data).
Discussion
Biomonitoring across the tree of life is plausible
Our results show that PCR-free ultra-deep shotgun sequencing of extracellular environmental DNA is a promising approach for next-generation biomonitoring across the tree of life in large aquatic ecosystems. By generating one of the deepest shotgun sequencing datasets for environmental DNA to date, we push the limits of biomonitoring to detect biodiversity even from low abundant macroorganisms in the ecosystem (Figure 4). We also achieve very low duplication rates in the sequences by eliminating PCR from the laboratory workflow, which otherwise may render a considerable part of the data useless by reducing the effective depth. However, due to stringent filtering and relatively less sensitive search algorithm, we obtain low proportions of taxonomically annotated sequences (recall rate) compared to studies using BLAST searches (e.g., Cowart et al., 2018). The compromise with the recall rate was necessary for our study as BLAST is computationally infeasible to search billions of reads against large reference databases. Nevertheless, less sensitive algorithms than BLAST such as MMseqs2 used in this study are likely to miss only evolutionarily very distant homologs and not the related species. Thus, most of the unclassified reads due to less sensitivity may have been classified to very low taxonomic ranks below family level by the LCA algorithm, which does not affect our analysis at the genus level. On the contrary, our choice of analysing the paired-end reads independently and merging them after verifying their taxonomic annotation provides us with more confidence in the classification at the cost of reducing the number of reads assigned to the genus level due to reduced read length of 150bp compared to a merged read length of 300bp. Further studies are required to investigate the effect of such computational choices on the results to optimise the computational workflow. Although Eukaryotes made up only 4.02% of the total taxonomic assignments, the number of genera detected was higher in Eukaryota than Viruses and Bacteria, which made up most of the classified reads, possibly due to three reasons. First, 4.02% of eukaryotic reads accounted for over 14.2 million taxonomic assignments due to the high depth of our dataset. Second, Eukaryota had the highest proportion of reads classified up to the genus level with about 6.49 million reads. Such a large number of reads with high-taxonomic resolution provides excellent power to detect low abundant macro-organisms that is comparable to targeted studies. Third, despite the high abundance of Viruses and Bacteria in the environment, more than 99% of their diversity is yet to be catalogued (Locey & Lennon, 2016). Hence, we observe a greater number of eukaryotic genera than Bacteria and Viruses. Further, Viruses had the least proportion of reads classified up to genus-level (3.67%), likely due to the high mutation rates and divergence between lineages requiring protein level comparisons to detect remote homology (Nooij et al., 2018). The low number of Viral genera detected could also be due to the missing diversity of RNA viruses that cannot be detected with DNA-based direct metagenomic methods (Zhang et al., 2019). Additionally, our results demonstrate successful detection of a large number of genera from various groups of Metazoa which indicates the presence of detectable amounts of DNA from macro-fauna in the extracellular eDNA. The main limiting factor for detecting macroorganisms is the depth of sequencing that determines the sensitivity towards low-abundant taxa. In fact, a previous study with shallow depth (22.3 million) concluded that metagenomics is not a suitable approach for biomonitoring eukaryotes from an abundance perspective (Stat et al., 2017). In contrast, another study with a better depth (328.7 million) concluded that eDNA metagenomics is suitable for detecting marine benthic fauna from a presence/absence viewpoint (Cowart et al., 2018). We utilise an extremely high-throughput next-generation sequencer to generate billions of reads per sample and show that biodiversity across the tree of life, including low abundant macroorganisms, can be reliably monitored using an incidence-based statistical framework. We detected a high diversity of commonly monitored groups such as Arthropoda and Actinopteri, that are consistent with the known inventories of highly diverse species occurring in the sampled ecosystem (Suresh et al., 2018). Reptiles had the lowest diversity among the inspected taxa, likely due to their low shedding rates of eDNA (Adams et al., 2019). A commonly observed silver lining in eDNA-based studies is the detection of unintended organisms, also referred to as the by-catch. eDNA studies in aquatic ecosystems have demonstrated the ability to monitor even the terrestrial species occurring in the catchment area (Deiner et al., 2016). Therefore, a significant proportion of the detected taxa in our study may represent terrestrial organisms inhabiting the catchment area which can be distinctly noticed in the class Mammalia containing taxonomic assignments of rodents, bats, and a high number of reads from Human, Dog, Cat, and domestically farmed animals (Supporting data), possibly originating from the untreated sewage runoffs into the ecosystem.
One billion reads may be optimal for biomonitoring
Although the sequencing costs have drastically reduced over the last decade, biomonitoring projects with a large number of samples may require considerable funding to perform ultra-deep sequencing. Hence, we also determine the optimal sequencing depth required to detect maximum biodiversity with minimal sequencing effort. Since the abundance of organisms across the tree of life vary widely, we define the optimal depth as the discrete sequencing depth (in multiples of 100 million) at which all the examined taxa, including the low-abundant ones, cross a maximum sample coverage threshold (at the genus level) that cannot be further increased with minimal sequencing effort. In this study, we generated a total of 5.6 billion paired-end reads but obtained unequal depths of sequencing data across samples due to the differences in library concentrations. Statistical extrapolation of genus accumulation curves allowed us to mitigate the technical issues caused by unequal sequencing depths and predict values up to a maximum sequencing depth of 3 billion reads per sample, which is the practical limit under current sequencing costs. Our results indicate that about one billion paired-end reads may be optimal for biomonitoring across the tree of life, which could detect about 90% of the diversity (Figure 6). Doubling the depth from one billion to two billion only increased the coverage threshold to 95%, indicating saturation. Filtering false positives from low abundant taxa with an extremely small number of reads is challenging, especially when the number of samples is low. One can take advantage of the presence of low abundant taxa in multiple spatiotemporally replicated samples for differentiating them from false positives. Currently, generating one billion paired-end reads per sample may be feasible for small to moderately sized biomonitoring programs when they utilise next-generation sequencers such as the Illumina Novaseq 6000. The Novaseq platform can deliver paired-end reads up to 150bp in length at extremely high depths in a single run reducing the cost incurred per sample. Such sequencing depths are also now typical in various applications in biology. For instance, re-sequencing a human genome at 30x coverage requires 100 Gb of sequencing, and de novo assembling a mammalian sized genome at 100x coverage requires 300 Gb of sequencing, whereas one billion 150bp paired-end reads for biomonitoring requires 150 Gb of sequencing. In contrast, targeted approaches such as metabarcoding may require only 10 million reads per library to achieve good sensitivity (Singer et al., 2019). Due to the relatively low-depth requirement per library, metabarcoding may seem to be a better choice than metagenomics. However, the cost of targeted approaches also increases significantly considering the number of technical replicates required to alleviate PCR-induced stochasticity, libraries with different primers targeting a taxon to reduce primer-bias, and libraries with different markers to target different groups of taxa to achieve tree of life metabarcoding (Stat et al., 2017; van der Loos & Nijland, 2020). For example, to implement a completely unbiased biomonitoring program across the tree of life using metabarcoding, one may need at least three technical replicates per primer pair, two different sets of primers per marker for each of the six groups, namely, Archaea (16S, 23S), Bacteria (16S, 23S), Protists (18S, 28S), Fungi (18S, ITS), Plants (matK, rbcL), and Animals (COI, CytB) which can add up to 72 libraries per location (3 technical replicates x 2 primer sets x 2 markers x 6 taxa) requiring 720 million reads. Although the tree of life metabarcoding example is an extreme case on the higher-end, ultra-deep shotgun sequencing of eDNA is a much simpler choice with relatively fewer complications and might become more attractive with the further reduction of sequencing costs in the future.
The pseudo-taxonomic approach is reproducible
Currently, for most non-model organisms whose genomes are not yet assembled, the reference sequence databases contain only a few loci that are routinely used in barcoding and population genetics. Also, due to a high proportion of uncatalogued novel taxa in the environment, a large number of sequenced reads may not directly match the source taxa from which they originated. Taxonomy-free approaches are being employed in metabarcoding studies to eliminate the dependency on the reference databases (Mächler et al., 2020). In marker-based targeted studies, the sequences derived arise from a single locus and can be clustered into operational taxonomic units (OTUs) based on sequence identity (Westcott & Schloss, 2015) or amplicon sequence variants (ASVs) based on the denoised sequences (Tikhonov et al., 2015). The relative abundance of different OTUs or ASVs provides a taxonomy-free approach to compare diversity across samples (Callahan et al., 2017). But, in case of metagenomic studies, the sequences arise from loci across the genome with varying taxonomic resolutions. Further, higher organisms with large genome sizes may only have a fraction of their genome sampled. In such a scenario, utilising a taxonomy-free approach is not feasible. Hence, we adapted a pseudo-taxonomic approach with advantages of taxonomy-based and taxonomy-free approaches by being partially dependent on reference databases. In the pseudo-taxonomic approach, the nearest available species is chosen as the best hit when the locus from the target species is not available in the reference database. Our results from the spatially-replicated samples have very high community similarity (Figure 7) indicating high reproducibility of the pseudo-taxonomic approach. To illustrate, we also provide an example of the Irrawaddy dolphin, an endangered flagship species of Chilika, whose whole genome is not available in the database but expected to be present in our samples. Interestingly, thousands of sequences from the Irrawaddy dolphin hit the Chinese river dolphin genome in the database, even though there were other genomes in the database from the family Delphinidae that may be more phylogenetically closer. We suspect that the annotated gene sequences from the other Delphinidae genomes might have been clustered to the Chinese river dolphin gene sequences due to high sequence similarity while reducing the redundancy during the process of database construction. The pseudo-taxonomic approach will provide more evolutionarily closer hits to the source taxon with the availability of well-populated databases in the future.
The high similarity in community composition across spatially-replicated samples also indicates that the spatial resolution of biodiversity provided by extracellular DNA is much higher than 10 km in the sampled ecosystem. The high spatial resolution of biodiversity could be due to the homogenous distribution of extracellular DNA in the water column and implies that extracellular DNA may be suitable for monitoring large ecosystems by effectively reducing the number of samples required to represent spatially large ecosystems. Our study also emphasises the importance of pilot-scale studies before designing large-scale biomonitoring programs to evaluate the spatio-temporal resolutions of the targeted type of environmental DNA (Figure 1) in the ecosystem under consideration.
Limitations and conclusion
Successful application of technologies to solve practical problems requires a thorough understanding of its technical strengths and weaknesses. Numerous studies in the past decade have revealed the limitations of the widely-used PCR-based metabarcoding technique (Elbrecht et al., 2017). Such targeted approaches may only be suitable for monitoring a particular taxon of interest but not the entire tree of life due to the inherent inability to provide the relative abundances across major groups of organisms that have different standardised barcoding loci. Moreover, next-generation biomonitoring employs a more holistic approach of monitoring a wide range of ecological entities from genes to entire ecosystems which requires data from the entire genome of individuals from various species across the tree of life (Bohan et al., 2017). Hence, advancements in metagenomic technologies are an indispensable requirement for deploying next-generation biomonitoring at a global scale (Makiola et al., 2020). Our novel metagenomic workflow (Figure 2) with adaptations at every level of the pipeline from sample collection, DNA extraction, library preparation, sequencing, and bioinformatics is an attempt to develop futuristic technologies that can deliver the required data in an unbiased way for next-generation biomonitoring. Nevertheless, our metagenomic approach also has some limitations that warrant further testing and optimisation. Although we achieved our objectives with a limited number of samples in a pilot-scale study in a model ecosystem, more testing is necessary for different ecosystems where obtaining microgram quantities of extracellular DNA may be difficult (e.g., Tundra ecosystem). Further, applicability to terrestrial ecosystems should be rigorously tested since the extracellular DNA isolated from the soil, pooled from multiple sampling plots in terrestrial ecosystems, may only be useful to detect microbes, plants, and sub-terranean organisms, but not large terrestrial animals. Such cases may need innovative sampling strategies such as collecting water samples from waterholes routinely visited by animals (Seeber et al., 2019). Next, obtaining one billion reads per sample may be economically prohibitive for very large-scale biomonitoring programs at the current costs of sequencing. Hence, a metagenomic approach may not be widely adopted by researchers and ecosystem managers until the deep sequencing costs become more affordable. Finally, our approach is currently limited to monitoring broad-scale changes in the ecosystem using pseudo-taxonomic labels at the genus level using uncurated reference sequence databases. For applications such as resolving fine-scale community structure at the species level, targeted approaches using curated barcode databases may be more suitable than metagenomics (Singer et al., 2020). Despite the limitations, our approach has the potential to be adapted and incorporated into large-scale next-generation biomonitoring programs in the future due to three key features. First, the possibility of simultaneously and effectively detecting biodiversity across the tree of life in a single assay that enables inferring species interactions across kingdoms by tracking changes in the relative abundances. Second, the multi-faceted nature of the generated data from the whole genome allows for monitoring diversity at the gene level, mapping of functional traits to specific taxa, and linking community changes to ecosystem functioning and services. Third, the high spatio-temporal resolution of biodiversity provided by our approach through employing extracellular DNA (Figure 1) enables monitoring of large ecosystems by significantly reducing the sampling effort. Plummeting sequencing costs and increasing large-scale genome sequencing projects provide an excellent opportunity to further test and optimise our novel metagenomic workflow under different conditions. We believe that this study advances our understanding regarding the capabilities of eDNA-based metagenomics that is necessary for a paradigm shift towards implementing large-scale next-generation biomonitoring programs across the tree of life in the earth biogenome era.