We generated a total of 851.43 Giga bases of sequencing data from the
three spatially-replicated extracellular environmental DNA samples. Upon
demultiplexing, we obtained about 2.6 billion paired-end reads from the
Station 29 sample (S29), and 1.5 billion paired-end reads each from
Station 28 (S28) and Station 27 (S27) samples. After quality and
low-complexity filtering, we retained an average of 98.19% (SD 0.39%)
of reads for taxonomic assignment. Since we used a completely PCR-free
library preparation method, the average duplication rate was 6.65% (SD
0.31%) arising from optical duplicates and sequencing of the
complementary strand. We derived 26 sampling units of 100 million reads
from the sample S29 and 14 sampling units each from S28 and S27, which
were separately analysed for determining the taxonomic composition. We
obtained a set of 35.5 million reference sequences by filtering and
clustering about 60.9 million nucleotide sequences from the INSDC at
99% sequence identity. By searching the high-quality reads against the
constructed reference database using MMseqs2, we annotated on average
8.71% (SD 0.54%) of the reads to one of the levels of NCBI taxonomy
using the 2bLCA algorithm. The paired-end structure of the reads allowed
us to correct the discordant taxonomic assignments using the
independently derived LCA labels for each read in the pair. We obtained
a total of 353.07 million pair-merged pseudo-taxonomic assignments
across the three samples. The taxonomic assignments were spread across
the tree of life with different proportions, specifically, Viruses
(51.41%), Archaea (0.32%), Bacteria (44.23%), and Eukaryota (4.02%).
Overall, about 21.53% of the pseudo-taxonomic assignments had a
resolution at least up to the genus level, which provided an estimate of
1815 genera across the tree of life in our samples. Eukaryota had the
highest proportion of reads classified up to the genus level (45.65%),
and the Viruses had the lowest proportion with just 3.67%.
Specifically, the results indicated the presence of 47 genera of
Viruses, 64 genera of Archaea, 675 genera of Bacteria, and 1029 genera
of Eukaryotes among the sequenced samples (Figure 4). Further,
eukaryotic genera were distributed across all the kingdoms, namely,
Protists (32.36%), Fungi (18.85%), Viridiplantae (18.85%), and
Metazoa (29.93%) (Table 1A). Since the low abundant macro-organisms are
the most difficult to detect, we inspected the number of genera observed
among the commonly targeted groups of Metazoa in the sequenced samples.
The 308 Metazoan genera comprised of 39.61% of chordates and 60.39% of
invertebrates. Phylum Arthropoda with 97 genera and class Actinopteri
with 61 genera had the highest representation among the invertebrates
and chordates. Among the low represented groups were Mammalia with 30
genera, Aves with 14 genera, Amphibia with four genera, Chondrichthyes
with three genera, and Reptiles with one observed genus.
Optimal sequencing depth for biomonitoring
To estimate an optimal depth of sequencing to detect maximum diversity
with minimal effort, we generated extrapolated genus accumulation
curves across various taxa for each sample as a function of sequencing
depth. All the accumulation curves show saturation as the sequencing
depth increases beyond 1 billion paired-end reads (Figure 5). We then
estimated the asymptotic genus richness of each taxon in all the
samples using the accumulation curves (Table 1B). The asymptotic genus
richness represents the maximum diversity of a particular taxon that
can be detected in the given sample. Using the asymptotic genus
richness of respective samples as a reference point, we then estimated
sample coverage as a function of sequencing depth (Figure 6). Sample
coverage is the fraction of genera detected by at least 1 read at a
given depth. At the lowest depth of 100 million reads, samples have
the highest coverage for Bacteria (AVG 93.59%, SD 1.54%) followed by
Viruses (AVG 87.75%, SD 1.52%) and the lowest coverage for Aves (AVG
55.51%, SD 15.03%) followed by Mammalia (AVG 60.42%, SD 14.13%).
At a read depth of 1 billion reads, all the taxa have coverage greater
than the threshold of 90% (Table 1C) with the lowest coverage for
Mammalia (AVG 91.74%, SD 13.7%). With double the sequencing effort
of 2 billion reads, the coverage increases only by 3.88% for Mammalia
(AVG 95.62%, SD 7.57%), crossing the 95% threshold for all the taxa
under consideration (Figure 6).
Effectiveness of the pseudo-taxonomic approach
Since we adopted a pseudo-taxonomic assignment strategy to alleviate the
issue of incomplete databases, the taxonomic assignments may only
indicate the presence of a related species when the target species is
not represented in the reference database. Our spatially-replicated
sampling strategy allowed us to estimate the reproducibility of the
pseudo-taxonomic approach by comparing the unique and shared number of
genera across various taxa among the samples. The community similarity
among the samples was measured using the Sorenson and Jaccard indices
across various taxa (Figure 7). The lowest community similarity was
found among Protists with a Sorenson index of 0.96 (Table 1D) and a
Jaccard index of 0.91 (Table 1E). To further examine the effectiveness
of the pseudo-taxonomic approach for detecting low abundant taxon that
is underrepresented in the reference database, we chose the Irrawaddy
dolphin (Orcaella brevirostris), an endangered flagship species
of the Chilika ecosystem as a case study. Since the whole genome of the
Irrawaddy dolphin has not been assembled, we examined hits to related
species whose annotated regions of the genome are available in the
reference database. We found 12,399 hits to Lipotes, a possibly
extinct genus of Chinese river dolphin endemic to the Yangtze river in
China, whose whole genome sequence is available in the reference
database. There were no other hits in our filtered taxonomy dataset to
any other marine mammals in the reference database (Supporting data).
Discussion
Biomonitoring across the tree of life is plausible
Our results show that PCR-free ultra-deep shotgun sequencing of
extracellular environmental DNA is a promising approach for
next-generation biomonitoring across the tree of life in large aquatic
ecosystems. By generating one of the deepest shotgun sequencing datasets
for environmental DNA to date, we push the limits of biomonitoring to
detect biodiversity even from low abundant macroorganisms in the
ecosystem (Figure 4). We also achieve very low duplication rates in the
sequences by eliminating PCR from the laboratory workflow, which
otherwise may render a considerable part of the data useless by reducing
the effective depth. However, due to stringent filtering and relatively
less sensitive search algorithm, we obtain low proportions of
taxonomically annotated sequences (recall rate) compared to studies
using BLAST searches (e.g., Cowart et al., 2018). The compromise with
the recall rate was necessary for our study as BLAST is computationally
infeasible to search billions of reads against large reference
databases. Nevertheless, less sensitive algorithms than BLAST such as
MMseqs2 used in this study are likely to miss only evolutionarily very
distant homologs and not the related species. Thus, most of the
unclassified reads due to less sensitivity may have been classified to
very low taxonomic ranks below family level by the LCA algorithm, which
does not affect our analysis at the genus level. On the contrary, our
choice of analysing the paired-end reads independently and merging them
after verifying their taxonomic annotation provides us with more
confidence in the classification at the cost of reducing the number of
reads assigned to the genus level due to reduced read length of 150bp
compared to a merged read length of 300bp. Further studies are required
to investigate the effect of such computational choices on the results
to optimise the computational workflow.
Although Eukaryotes made up only 4.02% of the total taxonomic
assignments, the number of genera detected was higher in Eukaryota than
Viruses and Bacteria, which made up most of the classified reads,
possibly due to three reasons. First, 4.02% of eukaryotic reads
accounted for over 14.2 million taxonomic assignments due to the high
depth of our dataset. Second, Eukaryota had the highest proportion of
reads classified up to the genus level with about 6.49 million reads.
Such a large number of reads with high-taxonomic resolution provides
excellent power to detect low abundant macro-organisms that is
comparable to targeted studies. Third, despite the high abundance of
Viruses and Bacteria in the environment, more than 99% of their
diversity is yet to be catalogued (Locey & Lennon, 2016). Hence, we
observe a greater number of eukaryotic genera than Bacteria and Viruses.
Further, Viruses had the least proportion of reads classified up to
genus-level (3.67%), likely due to the high mutation rates and
divergence between lineages requiring protein level comparisons to
detect remote homology (Nooij et al., 2018). The low number of Viral
genera detected could also be due to the missing diversity of RNA
viruses that cannot be detected with DNA-based direct metagenomic
methods (Zhang et al., 2019).
Additionally, our results demonstrate successful detection of a large
number of genera from various groups of Metazoa which indicates the
presence of detectable amounts of DNA from macro-fauna in the
extracellular eDNA. The main limiting factor for detecting
macroorganisms is the depth of sequencing that determines the
sensitivity towards low-abundant taxa. In fact, a previous study with
shallow depth (22.3 million) concluded that metagenomics is not a
suitable approach for biomonitoring eukaryotes from an abundance
perspective (Stat et al., 2017). In contrast, another study with a
better depth (328.7 million) concluded that eDNA metagenomics is
suitable for detecting marine benthic fauna from a presence/absence
viewpoint (Cowart et al., 2018). We utilise an extremely high-throughput
next-generation sequencer to generate billions of reads per sample and
show that biodiversity across the tree of life, including low abundant
macroorganisms, can be reliably monitored using an incidence-based
statistical framework. We detected a high diversity of commonly
monitored groups such as Arthropoda and Actinopteri, that are consistent
with the known inventories of highly diverse species occurring in the
sampled ecosystem (Suresh et al., 2018). Reptiles had the lowest
diversity among the inspected taxa, likely due to their low shedding
rates of eDNA (Adams et al., 2019). A commonly observed silver lining in
eDNA-based studies is the detection of unintended organisms, also
referred to as the by-catch. eDNA studies in aquatic ecosystems have
demonstrated the ability to monitor even the terrestrial species
occurring in the catchment area (Deiner et al., 2016). Therefore, a
significant proportion of the detected taxa in our study may represent
terrestrial organisms inhabiting the catchment area which can be
distinctly noticed in the class Mammalia containing taxonomic
assignments of rodents, bats, and a high number of reads from Human,
Dog, Cat, and domestically farmed animals (Supporting data), possibly
originating from the untreated sewage runoffs into the ecosystem.
One billion reads may be optimal for biomonitoring
Although the sequencing costs have drastically reduced over the last
decade, biomonitoring projects with a large number of samples may
require considerable funding to perform ultra-deep sequencing. Hence, we
also determine the optimal sequencing depth required to detect maximum
biodiversity with minimal sequencing effort. Since the abundance of
organisms across the tree of life vary widely, we define the optimal
depth as the discrete sequencing depth (in multiples of 100 million) at
which all the examined taxa, including the low-abundant ones, cross a
maximum sample coverage threshold (at the genus level) that cannot be
further increased with minimal sequencing effort. In this study, we
generated a total of 5.6 billion paired-end reads but obtained unequal
depths of sequencing data across samples due to the differences in
library concentrations. Statistical extrapolation of genus accumulation
curves allowed us to mitigate the technical issues caused by unequal
sequencing depths and predict values up to a maximum sequencing depth of
3 billion reads per sample, which is the practical limit under current
sequencing costs. Our results indicate that about one billion paired-end
reads may be optimal for biomonitoring across the tree of life, which
could detect about 90% of the diversity (Figure 6). Doubling the depth
from one billion to two billion only increased the coverage threshold to
95%, indicating saturation. Filtering false positives from low abundant
taxa with an extremely small number of reads is challenging, especially
when the number of samples is low. One can take advantage of the
presence of low abundant taxa in multiple spatiotemporally replicated
samples for differentiating them from false positives. Currently,
generating one billion paired-end reads per sample may be feasible for
small to moderately sized biomonitoring programs when they utilise
next-generation sequencers such as the Illumina Novaseq 6000. The
Novaseq platform can deliver paired-end reads up to 150bp in length at
extremely high depths in a single run reducing the cost incurred per
sample. Such sequencing depths are also now typical in various
applications in biology. For instance, re-sequencing a human genome at
30x coverage requires 100 Gb of sequencing, and de novo assembling a
mammalian sized genome at 100x coverage requires 300 Gb of sequencing,
whereas one billion 150bp paired-end reads for biomonitoring requires
150 Gb of sequencing. In contrast, targeted approaches such as
metabarcoding may require only 10 million reads per library to achieve
good sensitivity (Singer et al., 2019). Due to the relatively low-depth
requirement per library, metabarcoding may seem to be a better choice
than metagenomics. However, the cost of targeted approaches also
increases significantly considering the number of technical replicates
required to alleviate PCR-induced stochasticity, libraries with
different primers targeting a taxon to reduce primer-bias, and libraries
with different markers to target different groups of taxa to achieve
tree of life metabarcoding (Stat et al., 2017; van der Loos & Nijland,
2020). For example, to implement a completely unbiased biomonitoring
program across the tree of life using metabarcoding, one may need at
least three technical replicates per primer pair, two different sets of
primers per marker for each of the six groups, namely, Archaea (16S,
23S), Bacteria (16S, 23S), Protists (18S, 28S), Fungi (18S, ITS), Plants
(matK, rbcL), and Animals (COI, CytB) which can add up to 72 libraries
per location (3 technical replicates x 2 primer sets x 2 markers x 6
taxa) requiring 720 million reads. Although the tree of life
metabarcoding example is an extreme case on the higher-end, ultra-deep
shotgun sequencing of eDNA is a much simpler choice with relatively
fewer complications and might become more attractive with the further
reduction of sequencing costs in the future.
The pseudo-taxonomic approach is reproducible
Currently, for most non-model organisms whose genomes are not yet
assembled, the reference sequence databases contain only a few loci that
are routinely used in barcoding and population genetics. Also, due to a
high proportion of uncatalogued novel taxa in the environment, a large
number of sequenced reads may not directly match the source taxa from
which they originated. Taxonomy-free approaches are being employed in
metabarcoding studies to eliminate the dependency on the reference
databases (Mächler et al., 2020). In marker-based targeted studies, the
sequences derived arise from a single locus and can be clustered into
operational taxonomic units (OTUs) based on sequence identity (Westcott
& Schloss, 2015) or amplicon sequence variants (ASVs) based on the
denoised sequences (Tikhonov et al., 2015). The relative abundance of
different OTUs or ASVs provides a taxonomy-free approach to compare
diversity across samples (Callahan et al., 2017). But, in case of
metagenomic studies, the sequences arise from loci across the genome
with varying taxonomic resolutions. Further, higher organisms with large
genome sizes may only have a fraction of their genome sampled. In such a
scenario, utilising a taxonomy-free approach is not feasible. Hence, we
adapted a pseudo-taxonomic approach with advantages of taxonomy-based
and taxonomy-free approaches by being partially dependent on reference
databases. In the pseudo-taxonomic approach, the nearest available
species is chosen as the best hit when the locus from the target species
is not available in the reference database. Our results from the
spatially-replicated samples have very high community similarity (Figure
7) indicating high reproducibility of the pseudo-taxonomic approach. To
illustrate, we also provide an example of the Irrawaddy dolphin, an
endangered flagship species of Chilika, whose whole genome is not
available in the database but expected to be present in our samples.
Interestingly, thousands of sequences from the Irrawaddy dolphin hit the
Chinese river dolphin genome in the database, even though there were
other genomes in the database from the family Delphinidae that may be
more phylogenetically closer. We suspect that the annotated gene
sequences from the other Delphinidae genomes might have been clustered
to the Chinese river dolphin gene sequences due to high sequence
similarity while reducing the redundancy during the process of database
construction. The pseudo-taxonomic approach will provide more
evolutionarily closer hits to the source taxon with the availability of
well-populated databases in the future.
The high similarity in community composition across spatially-replicated
samples also indicates that the spatial resolution of biodiversity
provided by extracellular DNA is much higher than 10 km in the sampled
ecosystem. The high spatial resolution of biodiversity could be due to
the homogenous distribution of extracellular DNA in the water column and
implies that extracellular DNA may be suitable for monitoring large
ecosystems by effectively reducing the number of samples required to
represent spatially large ecosystems. Our study also emphasises the
importance of pilot-scale studies before designing large-scale
biomonitoring programs to evaluate the spatio-temporal resolutions of
the targeted type of environmental DNA (Figure 1) in the ecosystem under
consideration.
Limitations and conclusion
Successful application of technologies to solve practical problems
requires a thorough understanding of its technical strengths and
weaknesses. Numerous studies in the past decade have revealed the
limitations of the widely-used PCR-based metabarcoding technique
(Elbrecht et al., 2017). Such targeted approaches may only be suitable
for monitoring a particular taxon of interest but not the entire tree of
life due to the inherent inability to provide the relative abundances
across major groups of organisms that have different standardised
barcoding loci. Moreover, next-generation biomonitoring employs a more
holistic approach of monitoring a wide range of ecological entities from
genes to entire ecosystems which requires data from the entire genome of
individuals from various species across the tree of life (Bohan et al.,
2017). Hence, advancements in metagenomic technologies are an
indispensable requirement for deploying next-generation biomonitoring at
a global scale (Makiola et al., 2020). Our novel metagenomic workflow
(Figure 2) with adaptations at every level of the pipeline from sample
collection, DNA extraction, library preparation, sequencing, and
bioinformatics is an attempt to develop futuristic technologies that can
deliver the required data in an unbiased way for next-generation
biomonitoring. Nevertheless, our metagenomic approach also has some
limitations that warrant further testing and optimisation. Although we
achieved our objectives with a limited number of samples in a
pilot-scale study in a model ecosystem, more testing is necessary for
different ecosystems where obtaining microgram quantities of
extracellular DNA may be difficult (e.g., Tundra ecosystem). Further,
applicability to terrestrial ecosystems should be rigorously tested
since the extracellular DNA isolated from the soil, pooled from multiple
sampling plots in terrestrial ecosystems, may only be useful to detect
microbes, plants, and sub-terranean organisms, but not large terrestrial
animals. Such cases may need innovative sampling strategies such as
collecting water samples from waterholes routinely visited by animals
(Seeber et al., 2019). Next, obtaining one billion reads per sample may
be economically prohibitive for very large-scale biomonitoring programs
at the current costs of sequencing. Hence, a metagenomic approach may
not be widely adopted by researchers and ecosystem managers until the
deep sequencing costs become more affordable. Finally, our approach is
currently limited to monitoring broad-scale changes in the ecosystem
using pseudo-taxonomic labels at the genus level using uncurated
reference sequence databases. For applications such as resolving
fine-scale community structure at the species level, targeted approaches
using curated barcode databases may be more suitable than metagenomics
(Singer et al., 2020).
Despite the limitations, our approach has the potential to be adapted
and incorporated into large-scale next-generation biomonitoring programs
in the future due to three key features. First, the possibility of
simultaneously and effectively detecting biodiversity across the tree of
life in a single assay that enables inferring species interactions
across kingdoms by tracking changes in the relative abundances. Second,
the multi-faceted nature of the generated data from the whole genome
allows for monitoring diversity at the gene level, mapping of functional
traits to specific taxa, and linking community changes to ecosystem
functioning and services. Third, the high spatio-temporal resolution of
biodiversity provided by our approach through employing extracellular
DNA (Figure 1) enables monitoring of large ecosystems by significantly
reducing the sampling effort. Plummeting sequencing costs and increasing
large-scale genome sequencing projects provide an excellent opportunity
to further test and optimise our novel metagenomic workflow under
different conditions. We believe that this study advances our
understanding regarding the capabilities of eDNA-based metagenomics that
is necessary for a paradigm shift towards implementing large-scale
next-generation biomonitoring programs across the tree of life in the
earth biogenome era.