2.3 Methods

Firstly, DNA was extracted from 35 C. hainanense leaf samples, and the quality and concentration of the extracted DNA were tested before being sent to Hangzhou Lianchuan Biotechnology Co. After sequencing was completed, the SNPs of the C. hainanense genome were mined. Based on these SNPs, the phylogenetic tree analysis ofC. hainanense was obtained using the neighbor-joining algorithm of MEGA software. Principal component analysis (PCA) was then performed on C. hainanense populations based on the SNPs. Additionally, the population structure of all samples was analyzed using admixture software to obtain the distribution of genetic material in different populations of C. hainanense . Finally, genetic distances among all samples were calculated based on the SNPs. The detailed method is described in the paper by Chen et al. (2022).

2.3.1 Enzyme digestion protocol design

Our simplified genome digestion scheme selected according to other research methods is as follows, the restriction enzyme combination of HaeIII + Hpy166II was selected. The ’Insert Size ’ was selected as ’550-600bp’ (Xia et al., 2019).

2.3.2 Sequencing Quality Control

The Raw data (the number of reads in the original downstream data) generated by sequencing is pre-processed by quality filtering to obtain CleanData. The specific processing steps are as follows: 1) remove the adapter, 2) remove the reads containing N (N means the information of bases cannot be determined) with a proportion of more than 5%, 3) remove the low-quality reads (the number of bases with quality value Q<=10 accounts for more than 20% of the whole reads), 4) count the raw sequencing volume, effective sequencing volume, Q20 (the proportion of bases with quality values greater than or equal to 20, sequencing error rate less than 0.01), Q30 (the proportion of bases with quality values greater than or equal to 30, sequencing error rate less than 0.001), GC means guanine (G) and cytosine (C)content, and perform a comprehensive evaluation.

2.3.3 Comparison of consistency sequences

We used Burrows-Wheeler aligner (BWA) software to match the sequencing data to the consistent sequences obtained from reads clustering. Since the reference used is the consistency sequence obtained from reads clustering, the matching rate will vary somewhat between samples.
2.3.4 Variation detection and SNP statistics
After comparing the data with the concordant sequences, we used Genome Analysis Toolkit (GATK) and SAMtools software for variant detection, retaining the SNPs that were consistently output by both software as reliable loci. We further processed the SNP data by filtering them based on MAF > 0.05 and data integrity > 0.8 and retained the SNPs with polymorphisms among them. The final filtered SNPs were input to the subsequent evolutionary analysis.Based on the obtained SNP data, we analyzed the genetic evolution and structure of the population using the differences in genetic information among the samples of C. hainanense , including the phylogenetic relationships among the samples, population structure, principal component analysis (PCA), and relatedness among the samples. The following part of the analysis involves grouping samples, and the 35 samples were divided into six groups according to species for analysis.