2.6 Evaluation of assembly quality
The quality of the assembly was evaluated using the mapping rate of the paired-end and long reads to the assembly (Figure S1). We also evaluated the completeness and accuracy of the genome assembly using Bench marking universal single-copy orthologs (BUSCO) version 3.0.2 (Simão et al., 2015). Genome completeness was further evaluated by mapping of transcripts from 18 (Table S1) tissues and organs using GMAP (Wu and Watanabe, 2005).
2.7 Genome annotation
We annotated repeat sequences, gene structure, and noncoding RNA in the Chinese walnut genome (workflow, Figure S2). We used both homology based on prediction and de novo prediction to identify transposable elements (TEs). For de novo prediction, we constructed a repeat sequence database using RepeatModeler (http://www.repeatmasker.org), and predicted the presence of repeat sequences using RepeatMasker software (Maja et al., 2009) (http://www.repeatmasker.org), LTR-FINDER (Zhao and Hao, 2007) and PILER (Edgar and Myers, 2005) with default parameters. For homology based prediction, we identified transposable elements in the DNA and based on predicted proteins by comparing genomic sequence with the Repbase v21.12 database (Jurka, 2000) using RepeatMasker (Maja et al., 2009) (http://www.repeatmasker.org) and RepeatProteinMask v4.0.7 (Maja et al., 2009). Finally, all transposable elements identified by either method were merged into the final transposon annotations. Transposable elements (TEs) in the assembled Chinese walnut genome were also annotated using Tandem Repeats Finder (TRF) v4.09 (Benson et al., 1999).
To ensure accurate gene structure annotations, we combined homology prediction and de novo prediction methods. RNA sequences from eighteen tissues (Table S1) were used to train the software AUGUSTUS with default parameters (Stanke et al., 2006). We predicated gene structure de novo based on the statistical characteristics of genomic sequence data (such as frequency of codon, distribution of exon and intron) using SNAP (Johnson et al., 2008). We further predicated gene structure in the protein-coding genes by homology with genes identified in Arabidopsis thaliana , Citrus sinensis, Juglans regia , Malus domestica , Olea europaea ,Oryza sativa , Populus euphratica , Quercus robur , and Chinese walnut using Exonerate v2.2.0 (Slater et al., 2005). The final structural annotation of protein-coding genes was performed using a MAKER (Holt et al., 2011) pipeline that integrates AUGUSTUS (Stanke et al., 2006) and results from homologous protein mapping, RNA-seq mapping, and Nanopore mapping.