Chinese Herbal Plant Genome Database

Genome

De novo genome assembly of the PacBio reads was performed using NextDenovo assembler version 2.3.1 (https://github.com/Nextomics/NextDenovo) with the following parameters “input_type = raw, read_cutoff = 1k, seed_cutoff = 10k, minimap2_options_raw = ava-pb, minimap2_options_cns = ava-pb”. The genome assembly produced by NextDenovo was further polished using NextPolish software version 1.0.5 (Hu et al., 2019) with cleaned short-reads as input. A non-scaffolded draft assembly was produced after two rounds of polishing.

Hi-C technology was applied to build a chromosome-scale genome assembly. Quality filtering of Hi-C read pairs was performed using the same procedure as described in the genomic Illumina reads filtering step. Then the Hi-C read pairs were mapped to the draft assembly using Juicer version 1.6.2 (Durand et al., 2016). Read pairs with valid Hi-C contacts were retained by Juicer and further fed into the 3D de novo assembly (3D-DNA) pipeline (Dudchenko et al., 2017). Subsequently, contigs were assembled into pseudo-chromosomes using 3D-DNA based on the Hi-C contact signals. Finally, the chromosome-scale assembly was manually reviewed to rectify any mis-joins using Juicebox version 1.11.08 (https://github.com/aidenlab/Juicebox).

HBenchmarking Universal Single-Copy Orthologs (BUSCO) software v4.0.5 was used to assess the completeness of the genome assembly. The lineage dataset eudicots_odb10 with 2,326 single-copy orthologous genes was used for BUSCO analysis.

Resequencing

Genomic data from Illumina sequencing was filtered using Fastp V0.20.1 with default parameters. Clean reads were mapped to the genome assembly using Burrows-Wheeler Aligner (BWA) v0.7.10-r789 with default parameters. The alignments were sorted based on mapping coordinates, and subsequently converted into BAM format using SAMtools v1.3.1. Unmapped and nonunique reads were filtered out. Duplicated reads were filtered with the Picard package v2.1.1. Reads around indels were realigned with Genome Analysis Toolkit (GATK) v3.3-0-g37228af using the following procedures: package RealignerTargetCreator was used to identify regions where realignment was needed, then package IndelRealigner was used to realign the above regions.

GATK was using to detect potential variations. In brief, variants were called for each sample by the GATK HaplotypeCaller. A joint genotyping step for comprehensive variations union was performed on the GVCF files. VCFtools v0.1.13 was used to filter SNPs with the following parameters “QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || ReadPosRankSum<-8.0”. Indels shorter than or equal to 10 bp were filtered with “QD<2.0 || FS>200.0 || ReadPosRankSum<-20.0”. SNPs and Indels that were not biallelic, had >5% missing calls and MAF<0.05 were removed. The identified SNPs and Indels were further annotated with ANNOVAR software and divided into groupings of variations occurring in intergenic regions, coding sequences and introns.

Transcriptome

Raw reads were firstly transformed into clean reads by removing reads with sequencing adaptors, reads with frequency of unknown nucleotides above 5% and low-quality reads (containing more than 50% bases with Q-value≤20) using a custom Perl script. Then, the clean reads were de novo assembled using the Trinity program (k-mer=25, group pairs distance =300) with default parameters.

For functional annotations, the assembled unigenes were blasted against public databases (E-value < 1 × 10-5), including the NCBI non-redundant protein (NR) database1, the SwissProt database, the euKaryotic Ortholog Groups (KOG) database, the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, and the protein family (Pfam) database (version 26.0). Based on NR database annotation, GO unigene annotations were obtained by Blast2GO program. Then, GO functional classification was performed by WEGO software. Finally, CDSs of unigenes were predicted using BLSATX and ESTscan.

The high-quality reads were aligned to the assembled unigenes with the BWA program. An RPKM value was calculated for each unigene in each tissue. The RPKMs of all annotated isoforms for the same gene were summed as the RPKM of that gene. Differential expression of unigenes was calculated with a threshold of P value <0.001 and two-fold change.