INTRODUCTION
The vast biodiversity on earth is the result of billions of years of
evolution. All the evolutionary lineages that make up the tree of life
belong to three domains: Archaea, Bacteria, Eukaryota, and a fourth
contested category of Viruses. Organisms across the tree of life have
evolved and adapted to inhabit various environments on earth. Widely
accepted studies estimate that about 8.7 million (±1.3 million) species
of Eukaryotes (Mora et al., 2011) and up to a trillion species of
microbes (Locey & Lennon, 2016) exist on earth. Given these estimates
and the completeness of the encyclopedia of life database (Parr et al.,
2014), the majority of eukaryotic diversity and most of the microbial
diversity remain unknown to science despite 250 years of scientific
exploration. At the current rate of novel species discovery, it would
take hundreds of years to catalogue all the eukaryotic species alone
(Costello et al., 2013). However, there is currently an impending threat
of sixth mass extinction due to various anthropogenic factors such as
pollution, land-use change, habitat loss, poaching, and climate change
(Ceballos et al., 2020). The population sizes of many species have
dropped significantly, and species extinction rates have increased
hundreds to thousands of times compared to the background rate (Ceballos
et al., 2015, 2017). Extinction of species is irreversible and may have
long-lasting effects on the ecosystem functioning and services. A global
assessment report by the U.N. estimated that up to a million eukaryotic
species may be threatened with extinction and might go extinct in the
next few decades (Intergovernmental Science-Policy Platform on
Biodiversity and Ecosystem Services, 2019). Under the current scenario,
many species may go extinct even before being catalogued in the
encyclopedia of life. Therefore, it has become imperative to assess and
monitor biodiversity at a large-scale than ever before, to chart
conservation policies and guide management strategies.
Classical biomonitoring techniques are time-consuming,
resource-intensive, require manual identification of specimens, and are
not easily scalable to deploy on a large scale. Environmental DNA
(eDNA)-based molecular methods offer several advantages over classical
biomonitoring methods (Thomsen & Willerslev, 2015). eDNA-based
biomonitoring techniques detect the presence of various taxa in the
ecosystem utilising the DNA extracted directly from whole environmental
samples (e.g., water, soil, air) (Taberlet et al., 2012a). The last
decade witnessed tremendous strides in the methodological development of
eDNA-based biomonitoring techniques (Seymour, 2019). Along with the
technical advances, there has also been considerable effort to
understand the ecology of eDNA (Barnes & Turner, 2016; Stewart, 2019),
and to clearly define the term eDNA (Pawlowski et al., 2020;
Rodriguez-ezpeleta et al., 2020). By exploiting various sources of DNA
in an environmental sample (Figure 1), eDNA-based biomonitoring has
emerged as a powerful new technique that has revolutionised the way we
survey ecological communities (Deiner et al., 2017). Numerous
comparative studies have concluded that eDNA-based biomonitoring could
complement or even potentially replace classical biomonitoring methods
in the future (Leempoel et al., 2020; Piggott et al., 2020).
eDNA-based biomonitoring offers high scalability at four levels:
physical, economic, biological, and ecological scalability. First,
collecting eDNA samples requires relatively very less effort than most
of the classical biomonitoring techniques. For example, filtering water
samples to detect fish communities requires significantly less time and
resources than surveying using gill nets or electrofishing. Such
physical scalability in sampling enables collection of a large number of
samples covering entire ecosystems with minimal effort (e.g., West et
al., 2021). Second, the high-throughput nature of molecular methods used
in eDNA-based biomonitoring such as quantitative PCR and next-generation
sequencing reduces the economic cost incurred per sample as the scale of
the project increases. Third, eDNA samples typically contain multiple
sources of DNA from many different organisms across the tree of life
(Figure 1) (Barnes & Turner, 2016; Torti et al., 2015). For instance, a
bulk sediment sample can contain microorganisms, invertebrates,
extracellular DNA, and other biological particles of various sizes from
organisms across a wide range of taxa. Thus, eDNA samples are
biologically scalable, in the sense that a single sample can be used to
target diverse taxa and the same sample can be repurposed to detect a
different set of organisms, eliminating the need for repeated sampling
(Dysthe et al., 2018). Fourth, the different types of environmental DNA
present in a sample offer ecological scalability by providing a wide
range of spatio-temporal resolutions of biodiversity for various
applications (Figure 1) (Bohmann et al., 2014). For example, organismal
DNA is best suited for studying the functional ecology of microbial
communities, while extra-organismal and extracellular DNA are considered
as relic DNA that obscures the desired functional signal (Carini et al.,
2016; Lennon et al., 2018). Whereas, extra-organismal DNA with a
moderate temporal resolution of a few days is suitable for a
time-sensitive application such as detecting invasive species where
direct sampling of organisms is difficult (Thomas et al., 2020).
Further, extracellular DNA bound to suspended surface-reactive soil
particles with a high temporal resolution of a few weeks to a few months
might benefit long-term biomonitoring with seasonal or annual sampling
strategies (Nagler et al., 2018). Finally, extracellular DNA preserved
in deep sediments with an extremely high temporal resolution ranging
from years to centuries can be used to reconstruct past and
paleo-communities (Bálint et al., 2018). Such flexibility and
scalability make eDNA-based techniques suitable for the needs of very
large-scale biomonitoring programs, required to keep a check on the
fast-changing environments in the current Anthropocene.
The methodologies employed in eDNA-based biomonitoring are currently
limited to targeted approaches where a specific species or a group of
related taxa are detected. Routinely, a quantitative PCR assay is used
to detect a single species sensitively, and a universal PCR-based
metabarcoding is utilised to detect a group of related taxa sharing a
common barcoding marker. Recently, PCR-free approaches such as
hybridisation capture sequencing were employed for targeting a single
species (Jensen et al., 2020) and multiple species (Seeber et al., 2019).
The scientific community readily adopted the targeted approaches of
eDNA-based biomonitoring because of two main reasons. First, the idea of
targeted monitoring of a single-species or a group of related taxa was
synonymous with the existing classical monitoring of a single invasive
species, keystone species, or communities of fishes and
macro-invertebrates (Seymour, 2019). Second, large reference databases
of standardised DNA barcodes (e.g., COI, rbcL-matK, ITS) that are
conserved at low taxonomic ranks (e.g., Kingdom) and provide a high
taxonomic resolution between closely related taxa were available
(Meiklejohn et al., 2019). The International Barcode of Life consortium
is amassing millions of DNA barcodes from a wide range of eukaryotic
species and plan to complete the library DNA barcodes for all the
multicellular organisms with a planetary biodiversity mission in the
coming decades. Nevertheless, several disadvantages and limitations are
inherent to the targeted approaches that restrict their applicability to
the next frontier of eDNA-based biomonitoring known as the
next-generation biomonitoring (Bohan et al., 2017). Next-generation
biomonitoring aims to push the current limits of classical biomonitoring
by incorporating a broader spectrum of ecological organisational levels,
from genes to meta-ecosystems, in the eDNA-based biomonitoring framework
(Makiola et al., 2020). Since the ecosystem is a continually interacting
web of life that influence each other (Bascompte, 2009), next-generation
biomonitoring uses a holistic approach by inferring network properties
from large-scale multi-layered ecological networks derived using
machine-learning approaches (Derocles et al., 2018). The network
properties are then directly linked to ecosystem functions and services
rather than relying on simple biodiversity metrics such as alpha, beta,
and gamma diversity to detect changes in the ecosystem. The key factor
limiting the usage of targeted approaches in next-generation
biomonitoring is their inability to retrieve all the ecological
information that can be, in principle, obtained from an environmental
sample. For instance, due to the dependency on standard DNA barcodes
that differ between major groups of organisms such as prokaryotes,
protists, fungi, plants, and animals, a targeted approach can never be
used to determine the relative abundances of all the organisms across
the tree of life from an environmental sample in a single assay. Such a
holistic data is necessary for building ecosystem-level ecological
networks that incorporate trophic and other types of interactions
between species separated by large evolutionary distances. Further, the
lack of data from non-barcoding genes limits its usage in determining
the functional potential of communities and monitoring their genomic
diversity. Finally, there are various known technical biases such as DNA
extraction bias, amplification bias, marker bias, and primer bias that
obscures the original diversity and relative abundances when a wide
range of taxa are targeted (van der Loos & Nijland, 2020).
An untargeted and unbiased approach of detection which preserves the
underlying structure of biodiversity in terms of relative abundance of
biomass across the tree of life is fundamental to reap the full benefits
of eDNA-based next-generation biomonitoring programs. Taberlet et al.
were the first to envision that a metagenomic approach of performing
environmental shotgun sequencing on extremely high throughput sequencing
platforms could yield billions of DNA sequences per sample that can be
used to monitor biodiversity across the tree of life and overcome the
biases and limitations of targeted approaches (Taberlet et al., 2012b).
Due to the absence of targeted enrichment steps such as PCR and
hybridisation capture, metagenomic approaches provide an unbiased
representation of the input library of environmental DNA sequences. The
proportion of metagenomic sequences that can be taxonomically annotated
will directly depend upon the completeness of reference databases
containing whole genome sequences. Currently, the DNA barcode databases
have relatively very high species coverage than the whole genome
databases due to which metabarcoding is presently more preferable than
metagenomics (Stat et al., 2017). Nevertheless, an international
moonshot initiative in biology called the Earth BioGenome Project is set
to change the scenario of incomplete genome databases by generating
genomic resources for all the known eukaryotic species (about 1.5
million) in a record time of over a decade (Lewin et al., 2018). Several
large-scale genome sequencing initiatives have joined this massive
effort targeting a wide variety of taxa
(https://www.earthbiogenome.org/affiliated-project-networks). The
earth biogenome project has adopted a phylogenomic approach where a
representative reference genome will be generated for each of the
families in the eukaryotic lineage in the first phase of the project,
followed by representative genomes for every genus in the second phase,
and finally sequencing every known species in the third phase. In the
near future, with the progress and completion of various genome
sequencing initiatives, the availability of reference sequences will not
be a barrier for adopting metagenomic approaches. Instead, technical
challenges at the field, wet lab, and dry lab will be the limiting
factor for transitioning from targeted biomonitoring to untargeted
next-generation biomonitoring across the tree of life. Therefore, we
aimed to develop a futuristic biomonitoring pipeline for the earth
biogenome era, from sampling to data analysis, that could benefit from
the growing number of whole genome sequences.
Based on the literature survey, we identified five main challenges that
are encountered while monitoring biodiversity across the tree of life in
large ecosystems using a metagenomic approach, namely, collection of a
large number of samples to obtain an unbiased representation of the
biodiversity from a vast area (e.g., Tara oceans project (Sunagawa et
al., 2020)), DNA extraction bias due to differences in lysis
efficiencies between cell-types from a wide range of taxa (Djurhuus et
al., 2017), a low proportion of overall taxonomic assignment of reads
due to incomplete reference databases (e.g., Singer et al., 2020), the
overwhelming proportion of microbial taxa relative to macrobial taxa due
to the high abundance of microbes in environmental samples (e.g., Stat
et al., 2017) and a high number of uninformative duplicate reads and
artefacts introduced by PCR amplification of sequencing libraries
(Kebschull & Zador, 2015). We followed a top-down approach to alleviate
these issues by adapting logically deduced solutions at every step of
the biomonitoring pipeline such as sample collection, environmental DNA
extraction, library preparation, sequencing, and bioinformatics. First,
we targeted the extracellular DNA in the environmental samples using
large-volume water filtration techniques and a customised DNA extraction
protocol. Second, we used a completely PCR-free method of library
preparation to obliterate PCR-induced artefacts and duplicates in the
reads. Third, we pushed the limits of sequencing by performing
ultra-deep sequencing of the PCR-free libraries and obtained billions of
paired-end reads per sample to detect the low-abundant macrobial taxa.
Fourth, we adapted a pseudo-taxonomic approach of read assignment to
approximate the taxa richness when the target organisms are
underrepresented in the reference database. Fifth, we used
incidence-based community analysis rather than abundance-based
statistics to minimise the overshadowing effect of microbial taxa on
macrobial taxa. In this paper, we demonstrate the feasibility and
reproducibility of our novel metagenomic workflow (Figure 2) to
effectively monitor biodiversity across the tree of life in large
aquatic ecosystems in a pilot-scale experimental setup. Large-scale
implementation of such an untargeted and unbiased approach provides a
powerful tool for molecular ecologists to implement next-generation
biomonitoring programs in the earth biogenome era.
MATERIALS AND METHODS
Pilot-scale study design
To test our methodology for effective monitoring of biodiversity across
the tree of life in a large ecosystem, we selected Chilika, a highly
biodiverse tropical lagoon ecosystem located on the east coast of India.
Chilika is the second largest brackish-water lagoon in the world
spanning about 1100 sq. km. with a unique community assemblage
consisting of marine organisms from the Bay of Bengal, the north-eastern
part of the Indian ocean and freshwater species from the tributaries of
Mahanadi river, a major river system in east-central India (Sarkar et
al., 2012). We designed a pilot-scale spatially replicated sampling
strategy to demonstrate the feasibility and reproducibility of our
metagenomic approach. We selected three equally spaced geolocated
stations (S27, S28, S29) on a 10 km transect in the central sector of
the lagoon (Figure 3). We targeted the extracellular environmental DNA
in the water (Figure 1) at the sampling locations. Extracellular DNA
with a very high spatio-temporal resolution is an excellent choice for
seasonal or annual biomonitoring programs in vast ecosystems where
obtaining the maximum information with minimal sampling is critical.
Since Chilika is a shallow water lagoon with an average depth of about 2
metres (Sarkar et al., 2012) and experiences strong coastal winds, the
possibility of vertical stratification in the water column is minimal
and sampling the surface water for extracellular DNA will suffice for
the objective of this study.
Large-volume sample collection
Since extracellular DNA makes up only a fraction of the total eDNA that
can be extracted from the water sample (Torti et al., 2015), large
volumes of water need to be filtered to obtain sufficient yield of DNA
for PCR-free approaches that require relatively more amounts of DNA
owing to the absence of any DNA amplification step. We selected a pore
size of 0.45um instead of 0.22um to allow for more volume of water to be
filtered before clogging due to the high turbidity observed in the
sampling locations. The selected pore size also retains suspended soil
particles that could be potentially bound by extracellular DNA such as
clay (<2um), silt (2-50um), and sand (50um-2mm) (Nagler et
al., 2018). Further, we used a 47mm filter membrane made up of mixed
cellulose ester (MCE), as it has been shown to bind to the free-form of
extracellular DNA due to its chemical affinity (Liang & Keeley, 2013).
We filtered about 10 litres of water in each geolocated sampling station
on 28th December 2019 using the integrated eDNA
sampler by Smithroot Inc. (Thomas et al., 2018). A triplicate sampling
module was used to maximise the rate and volume of water sampled per
location. We set a maximum vacuum pressure of 10psi to minimise any cell
lysis during filtration and maintain a flow rate of less than one litre
per minute. The sampling system also avoids the risk of sample
contamination by utilising sterile single-use self-preserving filter
holders which are replaced for every sampling location. Thus, we did not
use any negative filtration controls during the sampling. We transported
the samples to the laboratory at room temperature in the dark for DNA
extraction. The self-preserving filter holder is made up of desiccating
plastic material which completely removes any traces of water and
preserves eDNA for several weeks in room temperature (Thomas et al.,
2019).
Extracellular DNA enrichment
We adapted and modified a modular lysis-free phosphate buffer-based DNA
extraction protocol from the literature (Lever et al., 2015; Liang &
Keeley, 2013; Taberlet et al., 2012c) to enrich the extracellular DNA
and minimise the proportion of organismal and extra-organismal DNA
(Figure 1) extracted from the filter membranes. The main principle of
the extraction protocol is to desorb the extracellular DNA bound to the
surface of the MCE filter membrane and soil particles without lysing the
intact cellular particles using the saturated phosphate buffer. The
phosphate groups from the buffer compete with the phosphate groups of
the extracellular DNA bound to the surface of soil particles via cation
bridging and desorb the DNA by chemical displacement (Taberlet 2012c).
The desorbed DNA is then isolated through a column-based DNA isolation
protocol using reagents and columns from the Nucleospin soil kit
(Macherey-Nagel, Germany). Within a week of sample collection, we
performed extracellular DNA extraction from the filter membranes in a
clean lab. The bench surfaces were wiped with 50% diluted commercial
bleach, followed by distilled water and 70% ethanol. We used
filter-tips to pipette all the liquids during the extraction to avoid
any aerosol contamination. The phosphate buffer was freshly prepared
before extraction by mixing 0.197g of
NaH2PO4 and 1,47g of
Na2HPO4 in 100ml of DNA-free water
(0.12M, pH 8). Filter membranes were carefully taken out of the filter
holders and rolled using sterile forceps before placing them into 15ml
falcon tubes containing 5ml phosphate buffer and large ceramic beads
(0.6-0.8mm) from the Nucleospin soil kit. The falcon tubes were shaken
for 10-15 minutes by placing them on a vortex-mixer with the vertical
falcon holder module. The large ceramic beads help homogenise soil
clumps recalcitrant to the desorption process but not disrupt the intact
cells. This process is principally different from bead-beating employed
in microbiology to lyse cells. We took care not to exceed the time of
mixing beyond 15 minutes to avoid co-extracting large proportion of
humic acids. The homogenised mixture was immediately centrifuged at
11000 x g to precipitate particulate matter, and the supernatant was
passed through the Nucleospin inhibitor removal column. The DNA-binding
condition of the flow-through was then adjusted using the binding
buffer, and then passed through the silica column. DNA was eluted using
150ul of Tris EDTA buffer from the Nucleospin soil kit. We did not
include any negative controls during extraction since the high amount of
input DNA required for the PCR-free library preparation protocols cannot
be obtained from negative extraction controls.
PCR-free library preparation and ultra-deep sequencing
We quantified the extracellular DNA extract using a high-sensitivity
double-stranded DNA assay in Qubit 4 (Thermo Fisher Scientific, USA).
DNA concentration was diluted to 20ng/ul, and one microgram of DNA was
taken as input for library preparation. We chose the Illumina Truseq DNA
PCR-free library preparation method to avoid PCR-induced artefacts such
as substitutions, indels, chimaeras, and drastically reduce
uninformative duplicate reads that arise from library amplification. We
did not include technical replicates for library-preparation since there
is no stochasticity involved in library preparation due to a completely
PCR-free workflow. Moreover, the spatially-replicated sampling strategy
allows us to test the reproducibility of our approach. The input DNA was
first randomly sheared into 350bp fragments using the Covaris
ultra-sonicator. The ends of the fragmented DNA were repaired and
dA-tailed prior to ligation using unique dual (UD) indexing adaptors for
Illumina (IDT, USA). The UD indices significantly reduce tag jumps that
cause cross-contamination across multiplexed samples. The
adaptor-ligated library fragments were size-selected and purified with
SPRI beads (Beckman Coulter, USA). The fragment sizes of the libraries
were verified using Agilent Bioanalyzer high-sensitivity DNA chip. The
libraries were quantified using a qPCR library validation kit (Takara
Bio, USA) and the concentrations were adjusted before pooling with other
libraries. We adjusted the relative concentration of our libraries to
yield billions of paired-end reads per sample which would enable us to
detect low-abundant macroorganisms. The pooled libraries were denatured
into single-stranded DNA before loading onto a patterned flow cell and
sequenced for 300 cycles on the Novaseq platform. The Novaseq 6000 is
the latest series of next-generation sequencer from Illumina that has
drastically improved the quality of short reads and provides extremely
high-throughput sequencing capabilities suitable for our deep sequencing
requirements.
Sequence analysis strategy
The process of taxonomic annotation of metagenomic reads directly
depends on the availability of reference genome sequences from the
target organisms, the sensitivity of the algorithm used to detect
homology, and the taxonomic resolution of the genomic loci. Currently,
the proportion of taxonomically classifiable reads would be very low
because only a small fraction of all the known species have their
genomes assembled, annotated, and archived in public nucleotide sequence
databases. Further, different regions in the genome provide variable
taxonomic resolutions depending on the sequence complexity, mutation
rate, selection pressure, recombination, and evolutionary history of the
species (Coissac et al., 2016). In general, the proportion of
taxonomically classified metagenomic reads reduces as we move from lower
taxonomic levels towards species-level resolution. Under these
circumstances, the Lowest Common Ancestor (LCA) algorithm originally
implemented in MEGAN (Huson et al., 2007) provides a robust taxonomic
assignment to the queried sequences. The LCA algorithm requires the
output of sensitive search algorithms such as BLAST to assign a
taxonomic label to the query sequence. However, alignment-based homology
detection algorithms such as BLAST are prohibitively slow to query
billions of reads against large reference databases. Alternative
alignment-free kmer-based algorithms such as KRAKEN (Wood et al., 2019)
are thousands of times faster than BLAST but very less sensitive and
cannot find homology between divergent species (Lindgreen et al., 2016).
Therefore, we selected a hybrid search algorithm called MMseqs2
(Steinegger & Söding, 2017) which first prefilters hits for each query
using the shared short kmers (k=15) and then utilises a vectorised
Smith-Waterman alignment to infer homology between the query reads and
the database hits. MMseqs2 also provides flexibility to fine-tune the
sensitivity parameter based on the available computing resources, size
of the query datasets, and size of the reference database.
Pseudo-taxonomic assignment of reads
We adapted a pseudo-taxonomic assignment approach using the dual blast
lowest common ancestor algorithm (2bLCA) (Hingamp et al., 2013)
implemented in MMseqs2 (Steinegger & Söding, 2017). The algorithm
searches for a homologous locus from the nearest sequenced species and
finds the best hit when the reference database does not contain the
sequence from the target species. Further, the taxonomic resolution of
the locus is assessed by querying the aligned region of the best hit
against the database. If the locus is not sequenced in any related
species, the taxonomic label of the best hit will be assigned to the
original query; otherwise, the LCA of all the significant hits relative
to the best hit (in terms of E-value) is calculated. Currently, due to
incomplete reference genome databases, most of the taxonomic labels
represent pseudo-taxonomic assignments and not the real taxonomic
identity of the query sequence. Therefore, the taxonomic labels of
metagenomic reads cannot be considered as a direct indicator of species
presence in the sampled environment. Instead, it might only indicate the
presence of a related species that is not represented in the database in
most of the cases.
We constructed the reference database using the non-redundant nucleotide
sequences (nt) and their taxonomic labels from the International
Nucleotide Sequence Database Collaboration (INSDC), downloaded on
26th October 2020 from the NCBI FTP servers. The nt
database also consists of annotated regions from the NCBI Genbank and
RefSeq genome assemblies. We filtered the sequences shorter than 100bp
or longer than 1Mb, and repetitive sequences based on the sequence
annotation keywords (satellite, repeat, repetitive, transposon,
intersperse, tandem). The remaining sequences were clustered at 99%
identity to reduce redundancy using LINCLUST (Steinegger & Söding,
2018). We masked the low-complexity regions in the representative
sequences of each cluster using the DUSTMASKER algorithm implemented in
BLAST+ (Camacho et al., 2009). The raw sequencing reads were also
quality trimmed and low-complexity filtered using FASTP (Chen et al.,
2018) prior to taxonomic assignment. We used the easy-taxonomy workflow
implemented in MMseqs2 to taxonomically annotate the reads with the
2bLCA algorithm (Hingamp et al., 2013). We used an e-value threshold of
1e-5, a maximum sequence divergence of 60% identity, and a minimum
query coverage of 80%. We obtained the 2bLCA taxonomic assignments for
each read of the paired-end reads independently and then calculated the
LCA for every pair. To reduce the false-positive assignments, we
filtered very low frequency taxa at the species, genus and family levels
which contained either less than 1% of reads relative to the parent
taxon or less than ten reads, whichever was highest among the two
thresholds.
Incidence-based community analysis
The number of reads assigned to each taxon (copy number) is dependent on
the abundance of the taxon in the environment, the genome size, and the
fraction of the genome sequence available in the reference database.
Thus, raw read counts do not represent the true abundance of a taxon
when the reference database is incomplete. Moreover, the high abundance
and diversity of microorganisms in the environment inevitably causes
overrepresentation of the microbial taxa and obscures the taxonomic
signal from the macrobial taxa. Therefore, we adapted an incidence-based
statistical framework instead of the standard abundance-based community
analysis (Colwell et al., 2012). We divided each metagenomic sample
containing billions of reads into multiple sampling units of 100 million
reads each. In each sampling unit, we obtained the counts of reads that
are taxonomically classified at least up to the genus level. We
restricted our analysis to the genus level because the proportion of
taxonomically misidentified sequences at the genus level in the Genbank
database is estimated to be less than 1% (Leray et al., 2019) and
prohibitively higher at the species level (Locatelli et al., 2020). We
then estimated the asymptotic genus richness of various taxa in each
sample (hill number of 0th order) using statistical
extrapolation of genus accumulation curves of the observed incidences of
different genera in each of the sampling units (Chao et al., 2014). The
interpolation and extrapolation of genus accumulation curves were
performed using the iNEXT R package with hundred bootstraps (Hsieh et
al., 2016). We then assessed the sample completeness as a function of
sequencing depth to estimate the optimal depth required to detect
maximum diversity across various taxa. Finally, we measured the
community similarity among the three spatially-replicated samples using
the number of unique and shared taxa in the SpadeR R package with
hundred bootstraps (Chao & Jost, 2015). SpadeR estimates the Jaccard
and Sorenson indices by considering the differences in sequencing depth
among the samples.
RESULTS
Biomonitoring across the tree of life
We generated a total of 851.43 Giga bases of sequencing data from the
three spatially-replicated extracellular environmental DNA samples. Upon
demultiplexing, we obtained about 2.6 billion paired-end reads from the
Station 29 sample (S29), and 1.5 billion paired-end reads each from
Station 28 (S28) and Station 27 (S27) samples. After quality and
low-complexity filtering, we retained an average of 98.19% (SD 0.39%)
of reads for taxonomic assignment. Since we used a completely PCR-free
library preparation method, the average duplication rate was 6.65% (SD
0.31%) arising from optical duplicates and sequencing of the
complementary strand. We derived 26 sampling units of 100 million reads
from the sample S29 and 14 sampling units each from S28 and S27, which
were separately analysed for determining the taxonomic composition. We
obtained a set of 35.5 million reference sequences by filtering and
clustering about 60.9 million nucleotide sequences from the INSDC at
99% sequence identity. By searching the high-quality reads against the
constructed reference database using MMseqs2, we annotated on average
8.71% (SD 0.54%) of the reads to one of the levels of NCBI taxonomy
using the 2bLCA algorithm. The paired-end structure of the reads allowed
us to correct the discordant taxonomic assignments using the
independently derived LCA labels for each read in the pair. We obtained
a total of 353.07 million pair-merged pseudo-taxonomic assignments
across the three samples. The taxonomic assignments were spread across
the tree of life with different proportions, specifically, Viruses
(51.41%), Archaea (0.32%), Bacteria (44.23%), and Eukaryota (4.02%).
Overall, about 21.53% of the pseudo-taxonomic assignments had a
resolution at least up to the genus level, which provided an estimate of
1815 genera across the tree of life in our samples. Eukaryota had the
highest proportion of reads classified up to the genus level (45.65%),
and the Viruses had the lowest proportion with just 3.67%.
Specifically, the results indicated the presence of 47 genera of
Viruses, 64 genera of Archaea, 675 genera of Bacteria, and 1029 genera
of Eukaryotes among the sequenced samples (Figure 4). Further,
eukaryotic genera were distributed across all the kingdoms, namely,
Protists (32.36%), Fungi (18.85%), Viridiplantae (18.85%), and
Metazoa (29.93%) (Table 1A). Since the low abundant macro-organisms are
the most difficult to detect, we inspected the number of genera observed
among the commonly targeted groups of Metazoa in the sequenced samples.
The 308 Metazoan genera comprised of 39.61% of chordates and 60.39% of
invertebrates. Phylum Arthropoda with 97 genera and class Actinopteri
with 61 genera had the highest representation among the invertebrates
and chordates. Among the low represented groups were Mammalia with 30
genera, Aves with 14 genera, Amphibia with four genera, Chondrichthyes
with three genera, and Reptiles with one observed genus.
Optimal sequencing depth for biomonitoring
To estimate an optimal depth of sequencing to detect maximum diversity
with minimal effort, we generated extrapolated genus accumulation
curves across various taxa for each sample as a function of sequencing
depth. All the accumulation curves show saturation as the sequencing
depth increases beyond 1 billion paired-end reads (Figure 5). We then
estimated the asymptotic genus richness of each taxon in all the
samples using the accumulation curves (Table 1B). The asymptotic genus
richness represents the maximum diversity of a particular taxon that
can be detected in the given sample. Using the asymptotic genus
richness of respective samples as a reference point, we then estimated
sample coverage as a function of sequencing depth (Figure 6). Sample
coverage is the fraction of genera detected by at least 1 read at a
given depth. At the lowest depth of 100 million reads, samples have
the highest coverage for Bacteria (AVG 93.59%, SD 1.54%) followed by
Viruses (AVG 87.75%, SD 1.52%) and the lowest coverage for Aves (AVG
55.51%, SD 15.03%) followed by Mammalia (AVG 60.42%, SD 14.13%).
At a read depth of 1 billion reads, all the taxa have coverage greater
than the threshold of 90% (Table 1C) with the lowest coverage for
Mammalia (AVG 91.74%, SD 13.7%). With double the sequencing effort
of 2 billion reads, the coverage increases only by 3.88% for Mammalia
(AVG 95.62%, SD 7.57%), crossing the 95% threshold for all the taxa
under consideration (Figure 6).
Effectiveness of the pseudo-taxonomic approach
Since we adopted a pseudo-taxonomic assignment strategy to alleviate the
issue of incomplete databases, the taxonomic assignments may only
indicate the presence of a related species when the target species is
not represented in the reference database. Our spatially-replicated
sampling strategy allowed us to estimate the reproducibility of the
pseudo-taxonomic approach by comparing the unique and shared number of
genera across various taxa among the samples. The community similarity
among the samples was measured using the Sorenson and Jaccard indices
across various taxa (Figure 7). The lowest community similarity was
found among Protists with a Sorenson index of 0.96 (Table 1D) and a
Jaccard index of 0.91 (Table 1E). To further examine the effectiveness
of the pseudo-taxonomic approach for detecting low abundant taxon that
is underrepresented in the reference database, we chose the Irrawaddy
dolphin (Orcaella brevirostris), an endangered flagship species
of the Chilika ecosystem as a case study. Since the whole genome of the
Irrawaddy dolphin has not been assembled, we examined hits to related
species whose annotated regions of the genome are available in the
reference database. We found 12,399 hits to Lipotes, a possibly
extinct genus of Chinese river dolphin endemic to the Yangtze river in
China, whose whole genome sequence is available in the reference
database. There were no other hits in our filtered taxonomy dataset to
any other marine mammals in the reference database (Supporting data).
Discussion
Biomonitoring across the tree of life is plausible
Our results show that PCR-free ultra-deep shotgun sequencing of
extracellular environmental DNA is a promising approach for
next-generation biomonitoring across the tree of life in large aquatic
ecosystems. By generating one of the deepest shotgun sequencing datasets
for environmental DNA to date, we push the limits of biomonitoring to
detect biodiversity even from low abundant macroorganisms in the
ecosystem (Figure 4). We also achieve very low duplication rates in the
sequences by eliminating PCR from the laboratory workflow, which
otherwise may render a considerable part of the data useless by reducing
the effective depth. However, due to stringent filtering and relatively
less sensitive search algorithm, we obtain low proportions of
taxonomically annotated sequences (recall rate) compared to studies
using BLAST searches (e.g., Cowart et al., 2018). The compromise with
the recall rate was necessary for our study as BLAST is computationally
infeasible to search billions of reads against large reference
databases. Nevertheless, less sensitive algorithms than BLAST such as
MMseqs2 used in this study are likely to miss only evolutionarily very
distant homologs and not the related species. Thus, most of the
unclassified reads due to less sensitivity may have been classified to
very low taxonomic ranks below family level by the LCA algorithm, which
does not affect our analysis at the genus level. On the contrary, our
choice of analysing the paired-end reads independently and merging them
after verifying their taxonomic annotation provides us with more
confidence in the classification at the cost of reducing the number of
reads assigned to the genus level due to reduced read length of 150bp
compared to a merged read length of 300bp. Further studies are required
to investigate the effect of such computational choices on the results
to optimise the computational workflow.
Although Eukaryotes made up only 4.02% of the total taxonomic
assignments, the number of genera detected was higher in Eukaryota than
Viruses and Bacteria, which made up most of the classified reads,
possibly due to three reasons. First, 4.02% of eukaryotic reads
accounted for over 14.2 million taxonomic assignments due to the high
depth of our dataset. Second, Eukaryota had the highest proportion of
reads classified up to the genus level with about 6.49 million reads.
Such a large number of reads with high-taxonomic resolution provides
excellent power to detect low abundant macro-organisms that is
comparable to targeted studies. Third, despite the high abundance of
Viruses and Bacteria in the environment, more than 99% of their
diversity is yet to be catalogued (Locey & Lennon, 2016). Hence, we
observe a greater number of eukaryotic genera than Bacteria and Viruses.
Further, Viruses had the least proportion of reads classified up to
genus-level (3.67%), likely due to the high mutation rates and
divergence between lineages requiring protein level comparisons to
detect remote homology (Nooij et al., 2018). The low number of Viral
genera detected could also be due to the missing diversity of RNA
viruses that cannot be detected with DNA-based direct metagenomic
methods (Zhang et al., 2019).
Additionally, our results demonstrate successful detection of a large
number of genera from various groups of Metazoa which indicates the
presence of detectable amounts of DNA from macro-fauna in the
extracellular eDNA. The main limiting factor for detecting
macroorganisms is the depth of sequencing that determines the
sensitivity towards low-abundant taxa. In fact, a previous study with
shallow depth (22.3 million) concluded that metagenomics is not a
suitable approach for biomonitoring eukaryotes from an abundance
perspective (Stat et al., 2017). In contrast, another study with a
better depth (328.7 million) concluded that eDNA metagenomics is
suitable for detecting marine benthic fauna from a presence/absence
viewpoint (Cowart et al., 2018). We utilise an extremely high-throughput
next-generation sequencer to generate billions of reads per sample and
show that biodiversity across the tree of life, including low abundant
macroorganisms, can be reliably monitored using an incidence-based
statistical framework. We detected a high diversity of commonly
monitored groups such as Arthropoda and Actinopteri, that are consistent
with the known inventories of highly diverse species occurring in the
sampled ecosystem (Suresh et al., 2018). Reptiles had the lowest
diversity among the inspected taxa, likely due to their low shedding
rates of eDNA (Adams et al., 2019). A commonly observed silver lining in
eDNA-based studies is the detection of unintended organisms, also
referred to as the by-catch. eDNA studies in aquatic ecosystems have
demonstrated the ability to monitor even the terrestrial species
occurring in the catchment area (Deiner et al., 2016). Therefore, a
significant proportion of the detected taxa in our study may represent
terrestrial organisms inhabiting the catchment area which can be
distinctly noticed in the class Mammalia containing taxonomic
assignments of rodents, bats, and a high number of reads from Human,
Dog, Cat, and domestically farmed animals (Supporting data), possibly
originating from the untreated sewage runoffs into the ecosystem.
One billion reads may be optimal for biomonitoring
Although the sequencing costs have drastically reduced over the last
decade, biomonitoring projects with a large number of samples may
require considerable funding to perform ultra-deep sequencing. Hence, we
also determine the optimal sequencing depth required to detect maximum
biodiversity with minimal sequencing effort. Since the abundance of
organisms across the tree of life vary widely, we define the optimal
depth as the discrete sequencing depth (in multiples of 100 million) at
which all the examined taxa, including the low-abundant ones, cross a
maximum sample coverage threshold (at the genus level) that cannot be
further increased with minimal sequencing effort. In this study, we
generated a total of 5.6 billion paired-end reads but obtained unequal
depths of sequencing data across samples due to the differences in
library concentrations. Statistical extrapolation of genus accumulation
curves allowed us to mitigate the technical issues caused by unequal
sequencing depths and predict values up to a maximum sequencing depth of
3 billion reads per sample, which is the practical limit under current
sequencing costs. Our results indicate that about one billion paired-end
reads may be optimal for biomonitoring across the tree of life, which
could detect about 90% of the diversity (Figure 6). Doubling the depth
from one billion to two billion only increased the coverage threshold to
95%, indicating saturation. Filtering false positives from low abundant
taxa with an extremely small number of reads is challenging, especially
when the number of samples is low. One can take advantage of the
presence of low abundant taxa in multiple spatiotemporally replicated
samples for differentiating them from false positives. Currently,
generating one billion paired-end reads per sample may be feasible for
small to moderately sized biomonitoring programs when they utilise
next-generation sequencers such as the Illumina Novaseq 6000. The
Novaseq platform can deliver paired-end reads up to 150bp in length at
extremely high depths in a single run reducing the cost incurred per
sample. Such sequencing depths are also now typical in various
applications in biology. For instance, re-sequencing a human genome at
30x coverage requires 100 Gb of sequencing, and de novo assembling a
mammalian sized genome at 100x coverage requires 300 Gb of sequencing,
whereas one billion 150bp paired-end reads for biomonitoring requires
150 Gb of sequencing. In contrast, targeted approaches such as
metabarcoding may require only 10 million reads per library to achieve
good sensitivity (Singer et al., 2019). Due to the relatively low-depth
requirement per library, metabarcoding may seem to be a better choice
than metagenomics. However, the cost of targeted approaches also
increases significantly considering the number of technical replicates
required to alleviate PCR-induced stochasticity, libraries with
different primers targeting a taxon to reduce primer-bias, and libraries
with different markers to target different groups of taxa to achieve
tree of life metabarcoding (Stat et al., 2017; van der Loos & Nijland,
2020). For example, to implement a completely unbiased biomonitoring
program across the tree of life using metabarcoding, one may need at
least three technical replicates per primer pair, two different sets of
primers per marker for each of the six groups, namely, Archaea (16S,
23S), Bacteria (16S, 23S), Protists (18S, 28S), Fungi (18S, ITS), Plants
(matK, rbcL), and Animals (COI, CytB) which can add up to 72 libraries
per location (3 technical replicates x 2 primer sets x 2 markers x 6
taxa) requiring 720 million reads. Although the tree of life
metabarcoding example is an extreme case on the higher-end, ultra-deep
shotgun sequencing of eDNA is a much simpler choice with relatively
fewer complications and might become more attractive with the further
reduction of sequencing costs in the future.
The pseudo-taxonomic approach is reproducible
Currently, for most non-model organisms whose genomes are not yet
assembled, the reference sequence databases contain only a few loci that
are routinely used in barcoding and population genetics. Also, due to a
high proportion of uncatalogued novel taxa in the environment, a large
number of sequenced reads may not directly match the source taxa from
which they originated. Taxonomy-free approaches are being employed in
metabarcoding studies to eliminate the dependency on the reference
databases (Mächler et al., 2020). In marker-based targeted studies, the
sequences derived arise from a single locus and can be clustered into
operational taxonomic units (OTUs) based on sequence identity (Westcott
& Schloss, 2015) or amplicon sequence variants (ASVs) based on the
denoised sequences (Tikhonov et al., 2015). The relative abundance of
different OTUs or ASVs provides a taxonomy-free approach to compare
diversity across samples (Callahan et al., 2017). But, in case of
metagenomic studies, the sequences arise from loci across the genome
with varying taxonomic resolutions. Further, higher organisms with large
genome sizes may only have a fraction of their genome sampled. In such a
scenario, utilising a taxonomy-free approach is not feasible. Hence, we
adapted a pseudo-taxonomic approach with advantages of taxonomy-based
and taxonomy-free approaches by being partially dependent on reference
databases. In the pseudo-taxonomic approach, the nearest available
species is chosen as the best hit when the locus from the target species
is not available in the reference database. Our results from the
spatially-replicated samples have very high community similarity (Figure
7) indicating high reproducibility of the pseudo-taxonomic approach. To
illustrate, we also provide an example of the Irrawaddy dolphin, an
endangered flagship species of Chilika, whose whole genome is not
available in the database but expected to be present in our samples.
Interestingly, thousands of sequences from the Irrawaddy dolphin hit the
Chinese river dolphin genome in the database, even though there were
other genomes in the database from the family Delphinidae that may be
more phylogenetically closer. We suspect that the annotated gene
sequences from the other Delphinidae genomes might have been clustered
to the Chinese river dolphin gene sequences due to high sequence
similarity while reducing the redundancy during the process of database
construction. The pseudo-taxonomic approach will provide more
evolutionarily closer hits to the source taxon with the availability of
well-populated databases in the future.
The high similarity in community composition across spatially-replicated
samples also indicates that the spatial resolution of biodiversity
provided by extracellular DNA is much higher than 10 km in the sampled
ecosystem. The high spatial resolution of biodiversity could be due to
the homogenous distribution of extracellular DNA in the water column and
implies that extracellular DNA may be suitable for monitoring large
ecosystems by effectively reducing the number of samples required to
represent spatially large ecosystems. Our study also emphasises the
importance of pilot-scale studies before designing large-scale
biomonitoring programs to evaluate the spatio-temporal resolutions of
the targeted type of environmental DNA (Figure 1) in the ecosystem under
consideration.
Limitations and conclusion
Successful application of technologies to solve practical problems
requires a thorough understanding of its technical strengths and
weaknesses. Numerous studies in the past decade have revealed the
limitations of the widely-used PCR-based metabarcoding technique
(Elbrecht et al., 2017). Such targeted approaches may only be suitable
for monitoring a particular taxon of interest but not the entire tree of
life due to the inherent inability to provide the relative abundances
across major groups of organisms that have different standardised
barcoding loci. Moreover, next-generation biomonitoring employs a more
holistic approach of monitoring a wide range of ecological entities from
genes to entire ecosystems which requires data from the entire genome of
individuals from various species across the tree of life (Bohan et al.,
2017). Hence, advancements in metagenomic technologies are an
indispensable requirement for deploying next-generation biomonitoring at
a global scale (Makiola et al., 2020). Our novel metagenomic workflow
(Figure 2) with adaptations at every level of the pipeline from sample
collection, DNA extraction, library preparation, sequencing, and
bioinformatics is an attempt to develop futuristic technologies that can
deliver the required data in an unbiased way for next-generation
biomonitoring. Nevertheless, our metagenomic approach also has some
limitations that warrant further testing and optimisation. Although we
achieved our objectives with a limited number of samples in a
pilot-scale study in a model ecosystem, more testing is necessary for
different ecosystems where obtaining microgram quantities of
extracellular DNA may be difficult (e.g., Tundra ecosystem). Further,
applicability to terrestrial ecosystems should be rigorously tested
since the extracellular DNA isolated from the soil, pooled from multiple
sampling plots in terrestrial ecosystems, may only be useful to detect
microbes, plants, and sub-terranean organisms, but not large terrestrial
animals. Such cases may need innovative sampling strategies such as
collecting water samples from waterholes routinely visited by animals
(Seeber et al., 2019). Next, obtaining one billion reads per sample may
be economically prohibitive for very large-scale biomonitoring programs
at the current costs of sequencing. Hence, a metagenomic approach may
not be widely adopted by researchers and ecosystem managers until the
deep sequencing costs become more affordable. Finally, our approach is
currently limited to monitoring broad-scale changes in the ecosystem
using pseudo-taxonomic labels at the genus level using uncurated
reference sequence databases. For applications such as resolving
fine-scale community structure at the species level, targeted approaches
using curated barcode databases may be more suitable than metagenomics
(Singer et al., 2020).
Despite the limitations, our approach has the potential to be adapted
and incorporated into large-scale next-generation biomonitoring programs
in the future due to three key features. First, the possibility of
simultaneously and effectively detecting biodiversity across the tree of
life in a single assay that enables inferring species interactions
across kingdoms by tracking changes in the relative abundances. Second,
the multi-faceted nature of the generated data from the whole genome
allows for monitoring diversity at the gene level, mapping of functional
traits to specific taxa, and linking community changes to ecosystem
functioning and services. Third, the high spatio-temporal resolution of
biodiversity provided by our approach through employing extracellular
DNA (Figure 1) enables monitoring of large ecosystems by significantly
reducing the sampling effort. Plummeting sequencing costs and increasing
large-scale genome sequencing projects provide an excellent opportunity
to further test and optimise our novel metagenomic workflow under
different conditions. We believe that this study advances our
understanding regarding the capabilities of eDNA-based metagenomics that
is necessary for a paradigm shift towards implementing large-scale
next-generation biomonitoring programs across the tree of life in the
earth biogenome era.