2024 Volume 1
Article Contents
About this article
ARTICLE   Open Access    

Genome resequencing reveals an independently originated Camellia sinensis variety – Hainan tea

  • #Authors contributed equally: Dazhong Guo, Dongliang Li

More Information
  • Tea, originating in China over 3,000 years ago, has transitioned from a medicinal herb to a widely consumed beverage. Despite considerable research focusing on tea plants in southwestern China, little attention has been paid to those on Hainan Island. The notable resemblance between Hainan tea and C. sinensis var. assamica, alongside the unique geographical and climatic conditions of Hainan Island, has presented significant challenges for taxonomic and genetic investigations concerning Hainan tea. Our study bridged this gap by collecting 500 samples from Hainan Province and employing whole-genome resequencing to examine interspecific differences between Hainan tea and cultivated varieties. The findings confirmed the distinct taxonomic position of Hainan tea within Camellia sinensis, providing valuable insights for resource conservation and molecular breeding. Furthermore, our methodology offers a framework for investigating the origin, domestication, and genetic diversity of other species native to Hainan Island.
  • Columnar cacti are plants of the Cactaceae family distributed across arid and semi-arid regions of America, with ecological, economic, and cultural value[1]. One trait that makes it possible for the columnar cactus to survive in the desert ecosystem is its thick epidermis covered by a hydrophobic cuticle, which limits water loss in dry conditions[1]. The cuticle is the external layer that covers the non-woody aerial organs of land plants. The careful control of cuticle biosynthesis could produce drought stress tolerance in relevant crop plants[2]. In fleshy fruits, the cuticle maintains adequate water content during fruit development on the plant and reduces water loss in fruit during postharvest[3]. Efforts to elucidate the molecular pathway of cuticle biosynthesis have been carried out for fleshy fruits such as tomato (Solanum lycopersicum)[4], apple (Malus domestica)[5], sweet cherry (Prunus avium)[6], mango (Mangifera indica)[7], and pear (Pyrus 'Yuluxiang')[8].

    The plant cuticle is formed by the two main layers cutin and cuticular waxes[3]. Cutin is composed mainly of oxygenated long-chain (LC) fatty acids (FA), which are synthesized by cytochrome p450 (CYP) enzymes. CYP family 86 subfamily A (CYP86A) enzymes carry out the terminal (ω) oxidation of LC-FA[9]. Then, CYP77A carries out the mid-chain oxidation to synthesize the main cutin monomers. In Arabidopsis, AtCYP77A4 and AtCYP77A6 carry out the synthesis of mid-chain epoxy and mid-chain dihydroxy LC-FA, respectively[10,11]. AtCYP77A6 is required for the cutin biosynthesis and the correct formation of floral surfaces[10]. The expression of CYP77A19 (KF410855) and CYP77A20 (KF410856) from potato (Solanum tuberosum) restored the petal cuticular impermeability in Arabidopsis null mutant cyp77a6-1, tentatively by the synthesis of cutin monomers[12]. In eggplant (Solanum torvum), the over-expression of StoCYP77A2 leads to resistance to Verticillium dahlia infection in tobacco plants[13]. Although the function of CYP77A2 in cutin biosynthesis has not yet been tested, gene expression analysis suggests that CaCYP77A2 (A0A1U8GYB0) could play a role in cutin biosynthesis during pepper fruit development[14].

    It has been hypothesized that the export of cuticle precursors is carried out by ATP binding cassette subfamily G (ABCG) transporters. ABCG11/WBC11, ABCG12, and ABCG13 are required for the load of cuticle lipids in Arabidopsis[1517], but ABCG13 function appears to be specific to the flower epidermis[18]. The overexpression of TsABCG11 (JQ389853) from Thellungiella salsugineum increases cuticle amounts and promotes tolerance to different abiotic stresses in Arabidopsis[19].

    Once exported, the cutin monomers are polymerized on the surface of epidermal cells. CD1 code for a Gly-Asp-Ser-Leu motif lipase/esterase (GDSL) from tomato required for the cutin formation through 2-mono(10,16-dihydroxyhexadecanoyl)glycerol esterification[20]. GDSL1 from tomato carries out the ester bond cross-links of cutin monomers located at the cuticle layers and is required for cuticle deposition in tomato fruits[21]. It has been shown that the transcription factor MIXTA-like reduces water loss in tomato fruits through the positive regulation of the expression of CYP77A2, ABCG11, and GDSL1[22]. Despite the relevant role of cuticles in maintaining cactus homeostasis in desert environments[1], the molecular mechanism of cuticle biosynthesis has yet to be described for cactus fruits.

    Stenocereus thurberi is a columnar cactus endemic from the Sonoran desert (Mexico), which produces an ovoid-globose fleshy fruit named sweet pitaya[23]. In its mature state, the pulp of sweet pitaya contains around 86% water with a high content of antioxidants and natural pigments such as betalains and phenolic compounds, which have nutraceutical and industrial relevance[23]. Due to the arid environment in which pitaya fruit grows, studying its molecular mechanism of cuticle biosynthesis can generate new insights into understanding species' adaptation mechanisms to arid environments. Nevertheless, sequences of transcripts from S. thurberi in public databases are scarce.

    RNA-sequencing technology (RNA-seq) allows the massive generation of almost all the transcripts from non-model plants, even if no complete assembled genome is available[24]. Recent advances in bioinformatic tools has improved our capacity to identify long non-coding RNA (lncRNA), which have been showed to play regulatory roles in relevant biological processes, such as the regulation of drought stress tolerance in plants[25], fruit development, and ripening[2629].

    In this study, RNA-seq data were obtained for the de novo assembly and characterization of the S. thurberi fruit peel transcriptome. As a first approach, three transcripts, StCYP77A, StABCG11, and StGDSL1, tentatively involved in cuticle biosynthesis, were identified and quantified during sweet pitaya fruit development. Due to no gene expression analysis having been carried out yet for S. thurberi, stably expressed constitutive genes were identified for the first time.

    Sweet pitaya fruits (S. thurberi) without physical damage were hand harvested from plants in a native conditions field located at Carbó, Sonora, México. They were collocated in a cooler containing dry ice and transported immediately to the laboratory. The superficial part of the peels (~1 mm deep) was removed carefully from the fruits using a scalpel. Peel samples from three fruits were pooled according to their tentative stage of development defined by their visual characteristics, frozen in liquid nitrogen, and pulverized to create a single biological replicate. Four samples belonging to four different plants were analyzed. All fruits harvested were close to the ripening stage. Samples named M1 and M2 were turning from green to ripe [~35−40 Days After Flowering (DAF)], whereas samples M3 and M4 were turning from ripe to overripe (~40−45 DAF).

    Total RNA was isolated from the peels through the Hot Borate method[30]. The concentration and purity of RNA were determined in a spectrophotometer Nanodrop 2000 (Thermo Fisher) by measuring the 260/280 and 260/230 absorbance ratios. RNA integrity was evaluated through electrophoresis in agarose gel 1% and a Bioanalyzer 2100 (Agilent). Pure RNA was sequenced in the paired-end mode in an Illumina NextSeq 500 platform at the University of Arizona Genetics Core Facility. Four RNA-seq libraries, each of them from each sample, were obtained, which include a total of 288,199,704 short reads with a length of 150 base pairs (bp). The resulting sequence data can be accessed at the Sequence Read Archive (SRA) repository of the NCBI through the BioProject ID PRJNA1030439. Libraries are named corresponding to the names of samples M1, M2, M3, and M4.

    FastQC software (www.bioinformatics.babraham.ac.uk/projects/fastqc) was used for short reads quality analysis. Short reads with poor quality were trimmed or eliminated by Trimmomatic (www.usadellab.org/cms/?page=trimmomatic) with a trailing and leading of 25, a sliding window of 4:25, and a minimum read length of 80 bp. A total of 243,194,888 reads with at least a 25 quality score on the Phred scale were used to carry out the de novo assembly by Trinity (https://github.com/trinityrnaseq/trinityrnaseq/wiki) with the following parameters: minimal k-mer coverage of 1, normalization of 50, and minimal transcript length of 200 bp.

    Removal of contaminating sequences and ribosomal RNA (rRNA) was carried out through SeqClean. To remove redundancy, transcripts with equal or more than 90% of identity were merged through CD-hit (www.bioinformatics.org/cd-hit/). Alignment and quantification in terms of transcripts per million (TPM) were carried out through Bowtie (https://bowtie-bio.sourceforge.net/index.shtml) and RSEM (https://github.com/deweylab/RSEM), respectively. Transcripts showing a low expression (TPM < 0.01) were discarded. Assembly quality was evaluated by calculating the parameters N50 value, mean transcript length, TransRate score, and completeness. The statistics of the transcriptome were determined by TrinityStats and TransRate (https://hibberdlab.com/transrate/). The transcriptome completeness was determined through a BLASTn alignment (E value < 1 × 10−3) by BUSCO (https://busco.ezlab.org/) against the database of conserved orthologous genes from Embryophyte.

    To predict the proteins tentatively coded in the S. thurberi transcriptome, the best homology match of the assembled transcripts was found by alignment to the Swiss-Prot, RefSeq, nr-NCBI, PlantTFDB, iTAK, TAIR, and ITAG databases using the BLAST algorithm with an E value threshold of 1 × 10−10 for the nr-NCBI database and of 1 × 10−5 for the others[3134]. An additional alignment was carried out to the protein databases of commercial fruits Persea americana, Prunus persica, Fragaria vesca, Citrus cinensis, and Vitis vinifera to proteins of the cactus Opuntia streptacantha, and the transcriptomes of the cactus Hylocereus polyrhizus, Pachycereus pringlei, and Selenicereus undatus. The list of all databases and the database websites of commercial fruits and cactus are provided in Supplementary Tables S1 & S2. The open reading frame (ORF) of the transcripts and the protein sequences tentative coded from the sweet pitaya transcriptome was predicted by TransDecoder (https://github.com/TransDecoder/TransDecoder/wiki), considering a minimal ORF length of 75 amino acids (aa). The search for protein domains was carried out by the InterPro database (www.ebi.ac.uk/interpro). Functional categorization was carried out by Blast2GO based on GO terms and KEGG metabolic pathways[35].

    LncRNA were identified based on the methods reported in previous studies[25,29,36]. Transcripts without homology to any protein from Swiss-Prot, RefSeq, nr-NCBI, PlantTFDB, iTAK, TAIR, ITAG, P. americana, P. persica, F. vesca, C. cinensis, V. vinifera, and O. streptacantha databases, without a predicted ORF longer than 75 aa, and without protein domains in the InterPro database were selected to identify tentative lncRNA.

    Transcripts coding for signal peptide or transmembrane helices were identified by SignalP (https://services.healthtech.dtu.dk/services/SignalP-6.0/) and TMHMM (https://services.healthtech.dtu.dk/services/TMHMM-2.0/), respectively, and discarded. Further, transcripts corresponding to other non-coding RNAs (ribosomal RNA and transfer RNA) were identified through Infernal by using the Rfam database[37] and discarded. The remaining transcripts were analyzed by CPC[38], and CPC2[39] to calculate their coding potential. Transcripts with a coding potential score lower than −1 for CPC and a coding probability lower than 0.1 for CPC2 were considered lncRNA. To characterize the identified lncRNA, the length and abundance of coding and lncRNA were calculated. Bowtie and RSEM were used to align and quantify raw counts, respectively. The edgeR package[40] was used to normalize raw count data in terms of counts per million (CPM) for both coding and lncRNA.

    To obtain the transcript's expression, the aligning of short reads and quantifying of transcripts were carried out through Bowtie and RSEM software, respectively. A differential expression analysis was carried out between the four libraries by edgeR package in R Studio. Only the transcripts with a count equal to or higher than 0.5 in at least one sample were retained for the analysis. Transcripts with log2 Fold Change (log2FC) between +1 and −1 and with a False Discovery Rate (FDR) lower than 0.05 were taken as not differentially expressed (NDE).

    For the identification of the tentative reference genes two strategies were carried out as described below: i) The NDE transcripts were aligned by BLASTn (E value < 1 × 10−5) to 43 constitutive genes previously reported in fruits from the cactus H. polyrhizus, S. monacanthus, and S. undatus[4143] to identify possible homologous constitutive genes in S. thurberi. Then, the homologous transcripts with the minimal coefficient of variation (CV) were selected; ii) For all the NDE transcripts, the percentile 95 value of the mean CPM and the percentile 5 value of the CV were used as filters to recover the most stably expressed transcripts, based on previous studies[44]. Finally, transcripts to be tested by quantitative reverse transcription polymerase chain reaction (qRT-PCR) were selected based on their homology and tentative biological function.

    The fruit harvesting was carried out as described above. Sweet pitaya fruit takes about 43 d to ripen, therefore, open flowers were tagged, and fruits with 10, 20, 30, 35, and 40 DAF were collected to cover the pitaya fruit development process (Supplementary Fig. S1). The superficial part of the peels (~1 mm deep) was removed carefully from the fruits using a scalpel. Peel samples from three fruits were pooled according to their stage of development defined by their DAF, frozen in liquid nitrogen, and pulverized to create a single biological replicate. One biological replicate consisted of peels from three fruits belonging to the same plant. Two to three biological replicates were evaluated for each developmental stage. Two technical replicates were analyzed for each biological replicate. RNA extraction, quantification, RNA purity, and RNA integrity analysis were carried out as described above.

    cDNA was synthesized from 100 ng of RNA by QuantiTect Reverse Transcription Kit (QIAGEN). Primers were designed using the PrimerQuest™, UNAFold, and OligoAnalyzer™ tools from Integrated DNA Technologies (www.idtdna.com/pages) and following the method proposed by Thornton & Basu[45]. Transcripts quantification was carried out in a QIAquant 96 5 plex according to the PowerUp™ SYBR™ Green Master Mix protocol (Applied Biosystems), with a first denaturation step for 2 min at 95 °C, followed by 40 cycles of denaturation step at 95 °C for 15 s, annealing and extension steps for 30 s at 60 °C.

    The Cycle threshold (Ct) values obtained from the qRT-PCR were analyzed through the algorithms BestKeeper, geNorm, NormFinder, and the delta Ct method[46]. RefFinder (www.ciidirsinaloa.com.mx/RefFinder-master/) was used to integrate the stability results and to find the most stable expressed transcripts in sweet pitaya fruit peel during development. The pairwise variation value (Vn/Vn + 1) was calculated through the geNorm algorithm in R Studio software[47].

    An alignment of 17 reported cuticle biosynthesis genes from model plants were carried out by BLASTx against the predicted proteins from sweet pitaya. Two additional alignments of 17 charaterized cuticle biosynthesis proteins from model plants against the transcripts and predicted proteins of sweet pitaya were carried out by tBLASTn and BLASTp, respectively. An E value threshold of 1 × 10−5 was used, and the unique best hits were recovered for all three alignments. The sequences of the 17 characterized cuticle biosynthesis genes and proteins from model plants are showed in Supplementary Table S3. The specific parameters and the unique best hits for all the alignments carried out are shown in Supplementary Tables S4S8.

    Cuticle biosynthesis-related transcripts tentatively coding for a cytochrome p450 family 77 subfamily A (CYP77A), a Gly-Asp-Ser-Leu motif lipase/esterase 1 (GDSL1), and an ATP binding cassette transporter subfamily G member 11 (ABCG11) were identified by best bi-directional hit according to the functional annotation described above. Protein-conserved domains, signal peptide, and transmembrane helix were predicted through InterProScan, SignalP 6.0, and TMHMM, respectively. Alignment of the protein sequences to tentative orthologous of other plant species was carried out by the MUSCLE algorithm[48]. A neighbor-joining (NJ) phylogenetic tree with a bootstrap of 1,000 replications was constructed by MEGA11[49].

    Fruit sampling, primer design, RNA extraction, cDNA synthesis, and transcript quantification were performed as described above. Relative expression was calculated according to the 2−ΔΔCᴛ method[50]. The sample corresponding to 10 DAF was used as the calibrator. The transcripts StEF1a, StTUA, StUBQ3, and StEF1a + StTUA were used as normalizer genes.

    Normality was assessed according to the Shapiro-Wilk test. Significant differences in the expression of the cuticle biosynthesis-related transcripts between fruit developmental stages were determined by one-way ANOVA based on a completely randomized sampling design and a Tukey honestly significant difference (HSD) test, considering a p-value < 0.05 as significant. Statistical analysis was carried out through the stats package in R Studio.

    RNA was extracted from the peels of ripe sweet pitaya fruits (S. thurberi) from plants located in the Sonoran Desert, Mexico. Four cDNA libraries were sequenced in an Illumina NextSeq 500 platform at the University of Arizona Genetics Core Facility. A total of 288,199,704 reads with 150 base pairs (bp) in length were sequenced in paired-end mode. After trimming, 243,194,888 (84.38%) cleaned short reads with at least 29 mean quality scores per read in the Phred scale and between 80 to 150 bp in length were obtained to carry out the assembly. After removing contaminating sequences, redundancy, and low-expressed transcripts, the assembly included 174,449 transcripts with an N50 value of 2,110 bp. Table 1 shows the different quality variables of the S. thurberi fruit peel transcriptome. BUSCO score showed that 85.4% are completed transcripts, although out of these, 37.2% were found to be duplicated. The resulting sequence data can be accessed at the SRA repository of the NCBI through the BioProject ID PRJNA1030439.

    Table 1.  Quality metrics of the Stenocereus thurberi fruit peel transcriptome.
    Metric Data
    Total transcripts 174,449
    N50 2,110
    Smallest transcript length (bp) 200
    Largest transcript length (bp) 19,114
    Mean transcript length (bp) 1,198.69
    GC (%) 41.33
    Total assembled bases 209,110,524
    TransRate score 0.05
    BUSCO score (%) C: 85.38 (S:48.22, D:37.16),
    F: 10.69, M: 3.93.
    Values were calculated through the TrinityStats function of Trinity and TransRate software. Completeness analysis was carried out through BUSCO by aligning the transcriptome to the Embryophyte database through BLAST with an E value threshold of 1 × 10−3. Complete (C), single (S), duplicated (D), fragmented (F), missing (M).
     | Show Table
    DownLoad: CSV

    A summary of the homology search in the main public protein database for the S. thurberi transcriptome is shown in Supplementary Table S1. From these databases, the higher homologous transcripts were found in RefSeq with 93,993 (53.87 %). Based on the E value distribution, for 41,685 (44%) and 68,853 (49%) of the hits, it was found a strong homology (E value lower than 1 × 10−50) to proteins in the Swiss-Prot and RefSeq databases, respectively (Supplementary Fig. S2a & b). On the other hand, 56,539 (52.34%) and 99,599 (71.11%) of the matches showed a percentage of identity higher than 60% in the Swiss-Prot and RefSeq databases, respectively (Supplementary Fig. S2c & d).

    Figure 1 shows the homology between transcripts from S. thurberi and proteins of commercial fruits, as well as proteins and transcripts of cacti. Transcripts from S. thurberi homologous to proteins from fruits of commercial interest avocado (P. americana), peach (P. persica), strawberry (F. vesca), orange (C. sinensis), and grapefruit (V. vinifera) ranged from 77,285 (44.30%) to 85,421 (48.96%), with 70,802 transcripts homologous to all the five fruit protein databases (Fig. 1a).

    Transcripts homologous to transcripts or proteins from the cactus dragon fruit (H. polyrhizus), prickly pear cactus (O. streptacantha), Mexican giant cardon (P. pringlei), and pitahaya (S. undatus) ranged from 76,238 (43.70%) to 114,933 (65.88%), with 64,009 transcripts homologous to all the four cactus databases (Fig. 1b). Further, out of the total of transcripts, 44,040 transcripts (25.25%) showed homology only to sequences from cactus, but not for model plants Arabidopsis, tomato, or the commercial fruits included in this study (Fig. 1c).

    Figure 1.  Venn diagram of the homology search results against model plants databases, commercial fruits, and cactus. The number in the diagram corresponds to the number of transcripts from S. thurberi homologous to sequences from that plant species. (a) Homologous to sequences from Fragaria vesca (Fa), Persea americana (Pa), Prunus persica (Pp), Vitis vinifera (Vv), and Citrus sinensis (Cs). (b) Homologous to sequences from Opuntia streptacantha (Of), Selenicereus undatus (Su), Hylocereus polyrhizus (Hp), and Pachycereus pringlei (Pap). (c) Homologous to sequences from Solanum lycopersicum (Sl), Arabidopsis thaliana (At), from the commercial fruits (Fa, Pa, Pp, Vv, and Cs), or the cactus included in this study (Of, Su, Hp, and Pap). Homologous searching was carried out by BLAST alignment (E value < 1 × 10−5). The Venn diagrams were drawn by ggVennDiagram in R Studio.

    A total of 45,970 (26.35%), 58,704 (33.65%), and 48,186 (27.65%) transcripts showed homology to transcription factors, transcriptional regulators, and protein kinases in the PlantTFDB, iTAK-TR, and iTAK-PK databases, respectively (Supplementary Tables S1, S9S11). For the PlantTFDB, the homologous transcripts belong to 57 transcriptional factors (TF) families (Fig. 2 & Supplementary Table S9), from which, the most frequent were the basic-helix-loop-helix (bHLH), myeloblastosis-related (MYB-related), NAM, ATAF, and CUC (NAC), ethylene responsive factor (ERF), and the WRKY domain families (WRKY) (Fig. 2).

    Figure 2.  Transcription factor (TF) families distribution of S. thurberi fruit peel transcriptome. The X-axis indicates the number of transcripts with hits to each TF family. Alignment to the PlantTFDB database by BLASTx was carried out with an E value threshold of 1 × 10−5. The bar graph was drawn by ggplot2 in R Studio.

    Based on the homology found and the functional domain searches, gene ontology terms (GO) were assigned to 68,559 transcripts (Supplementary Table S12). Figure 3 shows the top 20 GO terms assigned to the S. thurberi transcriptome, corresponding to the Biological Processes (BP) and Molecular Function (MF) categories. For BP, organic substance metabolic processes, primary metabolic processes, and cellular metabolic processes showed a higher number of transcripts (Supplementary Table S13). Further, for MF, organic cyclic compound binding, heterocyclic compound binding, and ion binding were the processes with the higher number of transcripts. S. thurberi transcripts were classified into 142 metabolic pathways from the KEGG database (Supplementary Table S14). The pathways with the higher number of transcripts recorded were pyruvate metabolism, glycerophospholipid metabolism, glycolysis/gluconeogenesis, and citrate cycle. Further, among the top 20 KEEG pathways, the cutin, suberin, and wax biosynthesis include more than 30 transcripts (Fig. 4).

    Figure 3.  Top 20 Gene Ontology (GO) terms assigned to the S. thurberi fruit peel transcriptome. Bars indicate the number of transcripts assigned to each GO term. Assignment of GO terms was carried out by Blast2GO with default parameters. BP and MF mean Biological Processes and Molecular Functions GO categories, respectively. The graph was drawn by ggplot2 in R Studio.
    Figure 4.  Top 20 KEGG metabolic pathways distribution in the S. thurberi fruit peel transcriptome. Bars indicate the number of transcripts assigned to each KEGG pathway. Assignment of KEGG pathways was carried out in the Blast2GO suite. The bar graph was drawn by ggplot2 in R Studio.

    Out of the total of transcripts, 43,391 (24.87%) were classified as lncRNA (Supplementary Tables S15 & S16). Figure 5 shows a comparison of the length (Fig. 5a) and expression (Fig. 5b) of lncRNA and coding RNA. Both length and expression values were higher in coding RNA than in lncRNA. In general, coding RNA ranged from 201 to 18,629 bp with a mean length of 1,507.18, whereas lncRNA ranged from 200 to 5,198 bp with a mean length of 481.51 (Fig. 5a). The higher expression values recorded from coding RNA and lncRNA were 12.83 and 9.45 log2(CPM), respectively (Fig. 5b).

    Figure 5.  Comparison of coding RNA and long non-coding RNA (lncRNA) from S. thurberi transcriptome. (a) Box plot of transcript length distribution. The Y-axis indicates the length of each transcript in base pairs. (b) Box plot of expression levels. The Y-axis indicates the log2 of the count per million of reads (log2(CPM)) recorded for each transcript. Expression levels were calculated by the edgeR package in R studio. (a), (b) The lines inside the boxes indicate the median. The higher and lower box limits represent the 75th and 25th percentiles, respectively. The box plots were drawn by ggplot2 in R Studio.

    To identify the transcripts without significant changes in expression between the four RNA-seq libraries, a differential expression analysis was carried out. Of the total of transcripts, 4,980 were not differentially expressed (NDE) at least in one paired comparison between the libraries (Supplementary Tables S17S20). Mean counts per million of reads (CPM) and coefficient of variation (CV)[44] were calculated for these NDE transcripts. Transcripts with a CV value lower than 0.113, corresponding with the percentile 5 of the CV, and a mean CPM higher than 1,138.06, corresponding with the percentile 95 of the mean CPM were used as filters to identify the most stably expressed transcripts (Supplementary Table S21). Based on its homology and its tentative biological function, five transcripts were selected to be tested as tentative reference genes. Besides, three NDE transcripts homologous to previously identified stable expressed reference genes in other species of cactus fruit[4143] were selected (Supplementary Table S22). Homology metrics for the eight tentative reference genes selected are shown in Supplementary Table S23. The primer sequences used to amplify the transcripts by qRT-PCR and their nucleotide sequence are shown in Supplementary Tables S24 & S25, respectively.

    The amplification specificity of the eight candidate reference genes determined by melting curves analysis is shown in Supplementary Fig. S3. For the eight tentative reference transcripts selected, the cycle threshold (Ct) values were recorded during sweet pitaya fruit development by qRT-PCR (Supplementary Table S26). The Ct values obtained ranged from 16.85 to 30.26 (Fig. 6a). Plastidic ATP/ADP-transporter (StTLC1) showed the highest Ct values with a mean of 27.34 (Supplementary Table S26). Polyubiquitin 3 (StUBQ3) showed the lowest Ct values in all five sweet pitaya fruit developmental stages (Fig. 6a).

    Figure 6.  Expression stability analysis of tentative reference genes. (a) Box plot of cycle threshold (Ct) distribution of candidate reference genes during sweet pitaya fruit development (10, 20, 30, 35, and 40 d after flowering). The black line inside the box indicates the median. The higher and lower box limits represent the 75th and 25th percentiles, respectively. (b) Bar chart of the geometric mean (geomean) of ranking values calculated by RefFinder for each tentative reference gene (X-axis). The lowest values indicate the best reference genes. (c) Bar chart of the pairwise variation analysis and determination of the optimal number of reference genes by the geNorm algorithm. A pairwise variation value lower than 0.15 indicates that the use of Vn/Vn + 1 reference genes is reliable for the accurate normalization of qRT-PCR data. The Ct data used in the analysis were calculated by qRT-PCR in a QIAquant 96 5 plex (QIAGEN) according to the manufacturer's protocol. The box plot and the bar graphs were drawn by ggplot2 and Excel programs, respectively. Abbreviations: Actin 7 (StACT7), alpha-tubulin (StTUA), elongation factor 1-alpha (StEF1a), COP1-interactive protein 1 (StCIP1), plasma membrane ATPase 4 (StPMA4), BEL1-like homeodomain protein 1 (StBLH1), polyubiquitin 3 (StUBQ3), and plastidic ATP/ADP-transporter (StTLC1).

    The best stability values calculated by NormFinder were 0.45, 0.51, 0.97, and 0.99, corresponding to the transcripts elongation factor 1-alpha (StEF1a), alpha-tubulin (StTUA), plastidic ATP/ADP-transporter (StTLC1), and actin 7 (StACT7), respectively (Supplementary Table S27). For BestKeeper, the most stable expressed transcripts were StUBQ3, StTUA, and StEF1a, with values of 0.72, 0.75, and 0.87, respectively. In the case of the delta Ct method[51], the transcripts StEF1a, StTUA, and StTLC1 showed the best stability.

    According to geNorm analysis, the most stable expressed transcripts were StTUA, StEF1a, StUBQ3, and StACT7, with values of 0.74, 0.74, 0.82, and 0.96, respectively. All the pairwise variation values (Vn/Vn + 1) were lower than 0.15, ranging from 0.019 for V2/V3 to 0.01 for V6/V7 (Fig. 6c). The V value of 0.019 obtained for V2/V3 indicates that the use of the best two reference genes StTUA and StEF1a is reliable enough for the accurate normalization of qRT-PCR data, therefore no third reference gene is required[47]. Except for BestKeeper analysis, StEF1a and StTUA were the most stable transcripts for all of the methods carried out in this study (Supplementary Table S27). The comprehensive ranking analysis indicates that StEF1a, followed by StTUA and StUBQ3, are the most stable expressed genes and are stable enough to be used as reference genes in qRT-PCR analysis during sweet pitaya fruit development (Fig. 6b).

    Three cuticle biosynthesis-related transcripts TRINITY_DN17030_c0_g1_i2, TRINITY_DN15394_c0_g1_i1, and TRINITY_DN23528_c1_g1_i1 tentatively coding for the enzymes cytochrome p450 family 77 subfamily A (CYP77A), Gly-Asp-Ser-Leu motif lipase/esterase 1 (GDSL1), and an ATP binding cassette transporter subfamily G member 11 (ABCG11/WBC11), respectively, were identified and quantified. The nucleotide sequence and predicted amino acid sequences of the three transcripts are shown in Supplementary File 1. The best homology match for StCYP77A (TRINITY_DN17030_c0_g1_i2) was for AtCYP77A4 (AT5G04660) from Arabidopsis and SmCYP77A2 (P37124) from eggplant (Solanum melongena) in the TAIR and Swiss-Prot databases, respectively (Supplementary Table S23).

    TransDecoder, InterPro, and TMHMM analysis showed that StCYP77A codes a polypeptide of 518 amino acids (aa) in length that comprises a cytochrome P450 E-class domain (IPR002401) and a transmembrane region (residues 10 to 32). The phylogenetic tree constructed showed that StCYP77A is grouped in a cluster with all the CYP77A2 proteins included in this analysis, being closer to CYP77A2 (XP_010694692) from B. vulgaris and Cgig2_012892 (KAJ8441854) from Carnegiea gigantean (Supplementary Fig. S4).

    StGDSL1 (TRINITY_DN15394_c0_g1_i1) alignment showed that it is homologous to a GDSL esterase/lipase from Arabidopsis (Q9LU14) and tomato (Solyc03g121180) (Supplementary Table S23). TransDecoder, InterPro, and SignalP analysis showed that StGDSL1 codes a polypeptide of 354 aa in length that comprises a GDSL lipase/esterase domain IPR001087 and a signal peptide with a cleavage site between position 25 and 26 (Supplementary Fig. S5).

    Supplementary Figure S6 shows the analysis carried out on the predicted amino acid sequence of StABCG11 (TRINITY_DN23528_c1_g1_i1). The phylogenetic tree constructed shows three clades corresponding to the ABCG13, ABCG12, and ABCG11 protein classes with bootstrap support ranging from 40% to 100% (Supplementary Fig. S6a). StABCG11 is grouped with all the ABCG11 transporters included in this study in a well-separated clade, being closely related to its tentative ortholog from C. gigantean Cgig2_004465 (KAJ8441854). InterPro and TMHMM results showed that the StABCG11 sequence contains an ABC-2 type transporter transmembrane domain (IPR013525; PF01061.27) with six transmembrane helices (Supplementary Fig. S6b).

    The predicted protein sequence of StABCG11 is 710 aa in length, holding the ATP binding domain (IPR003439; PF00005.30) and the P-loop containing nucleoside triphosphate hydrolase domain (IPR043926; PF19055.3) of the ABC transporters of the G family. Multiple sequence alignment shows that the Walker A and B motif sequence and the ABC signature[15] are conserved between the ABCG11 transporters from Arabidopsis, tomato, S. thurberi, and C. gigantean (Supplementary Fig. S6c).

    According to the results of the expression stability analysis (Fig. 6), four normalization strategies were tested to quantify the three cuticle biosynthesis-related transcripts during sweet pitaya fruit development. The four strategies consist of normalizing by StEF1a, StTUA, StUBQ3, or StEF1a+StTUA. Primer sequences used to quantify the transcripts StCYP77A (TRINITY_DN17030_c0_g1_i2), StGDSL1 (TRINITY_DN15394_c0_g1_i1), and StABCG11 (TRINITY_DN23528_c1_g1_i1) by qRT-PCR during sweet pitaya fruit development are shown in Supplementary Table S24.

    The three cuticle biosynthesis-related transcripts showed differences in expression during sweet pitaya fruit development (Supplementary Table S28). The same expression pattern was recorded for the three cuticle biosynthesis transcripts when normalization was carried out by StEF1a, StTUA, StUBQ3, or StEF1a + StTUA (Fig. 7). A higher expression of StCYP77A and StGDSL1 are shown at the 10 and 20 DAF, showing a decrease at 30, 35, and 40 DAF. StABCG11 showed a similar behavior, with a higher expression at 10 and 20 DAF and a reduction at 30 and 35 DAF. Nevertheless, unlike StCYP77A and StGDSL1, a significant increase at 40 DAF, reaching the same expression as compared with 10 DAF, is shown for StABCG11 (Fig. 7).

    Figure 7.  Expression analysis of cuticle biosynthesis-related transcripts StCYP77A, StGDSL1, and StABCG11 during sweet pitaya (Stenocereus thurberi) fruit development. Relative expression was calculated through the 2−ΔΔCᴛ method using elongation factor 1-alpha (StEF1a), alpha-tubulin (StTUA), polyubiquitin 3 (StUBQ3), or StEF1a + StTUA as normalizing genes at 10, 20, 30, 35, and 40 d after flowering (DAF). The Y-axis and error bars represent the mean of the relative expression ± standard error (n = 4−6) for each developmental stage in DAF. The Ct data for the analysis was recorded by qRT-PCR in a QIAquant 96 5 plex (QIAGEN) according to the manufacturer's protocol. The graph line was drawn by ggplot2 in R Studio. Abbreviations: cytochrome p450 family 77 subfamily A (StCYP77A), Gly-Asp-Ser-Leu motif lipase/esterase 1 (StGDSL1), and ATP binding cassette transporter subfamily G member 11 (StABCG11).

    Characteristics of a well-assembled transcriptome include an N50 value closer to 2,000 bp, a high percentage of conserved transcripts completely assembled (> 80%), and a high proportion of reads mapping back to the assembled transcripts[52]. In the present study, the first collection of 174,449 transcripts from S. thurberi fruit peel are reported. The generated transcriptome showed an N50 value of 2,110 bp, a TransRate score of 0.05, and a GC percentage of 41.33 (Table 1), similar to that reported for other de novo plant transcriptome assemblies[53]. According to BUSCO, 85.4% of the orthologous genes from the Embryophyta databases completely matched the S. thurberi transcriptome, and only 3.9% were missing (Table 1). These results show that the S. thurberi transcriptome generated is not fragmented, and it is helpful in predicting the sequence of almost all the transcripts expressed in sweet pitaya fruit peel[24].

    The percentage of transcripts homologous found, E values, and identity distribution (Supplementary Tables S1 & S2; Supplementary Fig. S2) were similar to that reported in the de novo transcriptome assembly for non-model plants and other cactus fruits[4143,54] and further suggests that the transcriptome assembled of S. thurberi peel is robust[52]. Of the total of transcripts, 70,802 were common to all the five commercial fruit protein databases included in this study, which is helpful for the search for conserved orthologous involved in fruit development and ripening (Fig. 2a). A total of 34,513 transcripts (20%) show homology only to sequences in the cactus's databases, but not in the others (Supplementary Tables S1 & S2; Fig. 1c). This could suggest that a significant conservation of sequences among plants of the Cactaceae family exists which most likely are to have a function in this species adaptation to desert ecosystems.

    To infer the biological functionality represented by the S. thurberi fruit peel transcriptome, gene ontology (GO) terms and KEGG pathways were assigned. Of the main metabolic pathways assigned, 'glycerolipid metabolism' and 'cutin, suberine, and wax biosynthesis' suggests an active cuticle biosynthesis in pitaya fruit peel (Fig. 4). In agreement with the above, the main GO terms assigned for the molecular function (MF) category were 'organic cyclic compound binding', 'transmembrane transporter activity', and 'lipid binding' (Fig. 3). For the biological processes (BP) category, the critical GO terms for the present research are 'cellular response to stimulus', 'response to stress', 'anatomical structure development', and 'transmembrane transport', which could suggest the active development of the fruit epidermis and cuticle biosynthesis for protection to stress.

    The most frequent transcription factors (TF) families found in S. thurberi transcriptome were NAC, WRKY, bHLH, ERF, and MYB-related (Fig. 2), which had been reported to play a function in the tolerance to abiotic stress in plants[55,56]. Although the role of NAC, WRKY, bHLH, ERF, and MYB TF in improving drought tolerance in relevant crop plants has been widely documented[57,58], their contribution to the adaptation of cactus to arid ecosystems has not yet been elucidated and further experimental pieces of evidence are needed.

    It has been reported that the heterologous expression of ERF TF from Medicago truncatula induces drought tolerance and cuticle wax biosynthesis in Arabidopsis leaf[59]. In tomato fruits, the gene SlMIXTA-like which encodes a MYB transcription factor avoids water loss through the positive regulation of genes related to the biosynthesis and transport of cuticle compounds[22]. Despite the relevant role of cuticles in maintaining cactus physiology in desert environments, experimental evidence showing the role of the different TF-inducing cuticle biosynthesis has yet to be reported for cactus fruits.

    Out of the transcripts, 43,391 were classified as lncRNA (Supplementary Tables S15 & S16). This is the first report of lncRNA identification for the species S. thurberi. In fruits, 3,679 lncRNA has been identified from tomato[26], 3,330 from peach (P. persica)[29], 3,857 from melon (Cucumis melo)[28], 2,505 from hot pepper (Capsicum annuum)[27], and 3,194 from pomegranate (Punica granatum)[36]. Despite the stringent criteria to classify the lncRNA of sweet pitaya fruit (S. thurberi), a higher number of lncRNAs are shown when compared with previous reports. This finding is most likely due to the higher level of redundancy found during the transcriptome analysis. To reduce this redundancy, further efforts to achieve the complete genome assembly of S. thurberi are needed.

    Previous studies showed that lncRNA is shorter and has lower expression levels than coding RNA[6062]. In agreement with those findings, both the length and expression values of lncRNA from S. thurberi were lower than coding RNA (Fig. 5). It has been suggested that lncRNA could be involved in the biosynthesis of cuticle components in cabbage[61] and pomegranate[36] and that they could be involved in the tolerance to water deficit through the regulation of cuticle biosynthesis in wild banana[60]. Nevertheless, the molecular mechanism by which lncRNA may regulate the cuticle biosynthesis in S. thurberi fruits has not yet been elucidated.

    A relatively constant level of expression characterizes housekeeping genes because they are involved in essential cellular functions. These genes are not induced under specific conditions such as biotic or abiotic stress. Because of this, they are very useful as internal reference genes for qRT-PCR data normalization[63]. Nevertheless, their expression could change depending on plant species, developmental stages, and experimental conditions[64]. Reliable reference genes for a specific experiment in a given species must be identified to carry out an accurate qRT-PCR data normalization[63]. An initial screening of the transcript expression pattern through RNA-seq improves the identification of stably expressed transcripts by qRT-PCR[44,64].

    Identification of stable expressed reference transcripts during fruit development has been carried out in blueberry (Vaccinium bracteatum)[65], kiwifruit (Actinidia chinensis)[66], peach (P. persica)[67], apple (Malus domestica)[68], and soursop (Annona muricata)[69]. These studies include the expression stability analysis through geNorm, NormFinder, and BestKeeper algorithms[68,69], some of which are supported in RNA-seq data[65,66]. Improvement of expression stability analysis by RNA-seq had led to the identification of non-previously reported reference genes with a more stable expression during fruit development than commonly known housekeeping genes in grapevine (V. vinifera)[44], pear (Pyrus pyrifolia and P. calleryana)[64], and pepper (C. annuum)[70].

    For fruits of the Cactaceae family, only a few studies identifying reliable reference genes have been reported[4143]. Mainly because gene expression analysis has not been carried out previously for sweet pitaya (S. thurberi), the RNA-seq data generated in this work along with geNorm, NormFinder, BestKeeper, and RefFinder algorithms were used to identify reliable reference genes. The comprehensive ranking analysis showed that out of the eight candidate genes tested, StEF1a followed by StTUA and StUBQ3 were the most stable (Fig. 6b). All the pairwise variation values (Vn/Vn + 1) were lower than 0.15 (Fig. 6c), which indicates that StEF1a, StTUA, and StUBQ3 alone or the use of StEF1a and StTUA together are reliable enough to normalize the gene expression data generated by qRT-PCR.

    The genes StEF1a, StTUA, and StUBQ3 are homologous to transcripts found in the cactus species known as dragonfruit (Hylocereus monacanthus and H. undatus)[41], which have been tested as tentative reference genes during fruit development. EF1a has been proposed as a reliable reference gene in the analysis of changes in gene expression of dragon fruit (H. monacanthus and H. undatus)[41], peach (P. persica)[67], apple (M. domestica)[68], and soursop (A. muricata)[69]. According to the expression stability analysis carried out in the present study (Fig. 6) four normalization strategies were designed. The same gene expression pattern was recorded for the three target transcripts evaluated when normalization was carried out by the genes StEF1a, StTUA, StUBQ3, or StEF1a + StTUA (Fig. 7). Further, these data indicates that these reference genes are reliable enough to be used in qRT-PCR experiments during fruit development of S. thurberi.

    The plant cuticle is formed by two main layers: the cutin, composed mainly of mid-chain oxygenated LC fatty acids, and the cuticular wax, composed mainly of very long-chain (VLC) fatty acids, and their derivates VLC alkanes, VLC primary alcohols, VLC ketones, VLC aldehydes, and VLC esters[3]. In Arabidopsis CYP77A4 and CYP77A6 catalyze the synthesis of midchain epoxy and hydroxy ω-OH long-chain fatty acids, respectively[10,11], which are the main components of fleshy fruit cuticle[3].

    The functional domain search carried out in the present study showed that StCYP77A comprises a cytochrome P450 E-class domain (IPR002401) and a membrane-spanning region from residues 10 to 32 (Supplementary Fig. S4). This membrane-spanning region has been previously characterized in CYP77A enzymes from A. thaliana and Brassica napus[11,71]. It suggests that the protein coded by StCYP77A could catalyze the oxidation of fatty acids embedded in the endoplasmic reticulum membrane of the epidermal cells of S. thurberi fruit. Phylogenetic analysis showed that StCYP77A was closer to proteins from its phylogenetic-related species B. vulgaris (BvCYP772; XP_010694692) and C. gigantea (Cgig2_012892) (Supplementary Fig. S4). StCYP77A, BvCYP77A2, and Cgig2_012892 were closer to SlCYP77A2 and SmCYP77A2 than to CYP77A4 and CYP77A6 proteins, suggesting that StCYP77A (TRINITY_DN17030_c0_g1_i2) could correspond to a CYP77A2 protein.

    Five CYP77A are present in the Arabidopsis genome, named CYP77A4, CYP77A5, CYP77A6, CYP77A7, and CYP77A9, but their role in cuticle biosynthesis has only been reported for CYP77A4 and CYP77A6[72]. It has been suggested that CYP77A2 from eggplant (S. torvum) could contribute to the defense against fungal phytopathogen infection by the synthesis of specific compounds[13]. In pepper fruit (C. annuum), the expression pattern of CYP77A2 (A0A1U8GYB0) and ABCG11 (LOC107862760) suggests a role of CYP77A2 and ABCG11 in cutin biosynthesis at the early stages of pepper fruit development[14].

    In the case of the protein encoded by StGDSL1 (354 aa), the length found in this work is similar to the length of its homologous from Arabidopsis (AT3G16370) and tomato (Solyc03g121180) (Supplementary Fig. S5). A GDSL1 protein named CD1 polymerizes midchain oxygenated ω-OH long-chain fatty acids to form the cutin polyester in the extracellular space of tomato fruit peel[20,21]. It has been suggested that the 25-amino acid N-signal peptide found in StGDSL1 (Supplementary Fig. S5), previously reported in GDSL1 from Arabidopsis, B. napus, and tomato, plays a role during the protein exportation to the extracellular space[21,73].

    A higher expression of StCYP77A, StGDSL1, and StABCG11 is shown at the 10 and 20 DAF of sweet pitaya fruit development (Fig. 7), suggesting the active cuticle biosynthesis at the early stages of sweet pitaya fruit development. In agreement with that, two genes coding for GDSL lipase/hydrolases from tomato named SGN-U583101 and SGN-U579520 are highly expressed in the early stages and during the expansion stages of tomato fruit development, but their expression decreases in later stages[74]. It has been shown that the expression of GDSL genes, like CD1 from tomato, is higher in growing fruit[20,21]. Like tomato, the increase in expression of StCYP77A and StGDSL1 shown in pitaya fruit development could be due to an increase in cuticle deposition during the expansion of the fruit epidermis[20].

    The phylogenetic analysis, the functional domains, and the six transmembrane helices found in the StABCG11 predicted protein (Supplementary Fig. S6), suggests that it is an ABCG plasma membrane transporter of sweet pitaya[15]. Indeed, an increased expression of StABCG11 at 40 DAF was recorded in the present study (Fig. 7). Further, this data strongly suggests that it could be playing a relevant role in the transport of cuticle components at the beginning and during sweet pitaya fruit ripening.

    In Arabidopsis, ABCG11 (WBC11) exports cuticular wax and cutin compounds from the plasma membrane[15,75]. It has been reported that a high expression of the ABC plasma membrane transporter from mango MiWBC11 correlates with a higher cuticle deposition during fruit development[7]. The expression pattern for StABCG11, StCYP77A, and StGDSL1 suggests a role of StABCG11 as a cutin compound transporter in the earlier stages of sweet pitaya fruit development (Fig. 7). Further, its increase at 40 DAF suggests that it could be transporting cuticle compounds other than oxygenated long-chain fatty acids, or long-chain fatty acids that are not synthesized by StCYP77A and StGDSL1 in the later stages of fruit development.

    Like sweet pitaya, during sweet cherry fruit (Prunus avium) development, the expression of PaWCB11, homologous to AtABCG11 (AT1G17840), increases at the earlier stages of fruit development decreases at the intermediate stages, and increases again at the later stages[76]. PaWCB11 expression correlated with cuticle membrane deposition at the earlier and intermediate stages of sweet cherry fruit development but not at the later[76]. The increased expression of StABCG11 found in the present study could be due to the increased transport of cuticular wax compounds, such as VLC fatty acids and their derivates, in the later stages of sweet pitaya development[15,75].

    Cuticular waxes make up the smallest amount of the fruit cuticle. Even so, they mainly contribute to the impermeability of the fruit's epidermis[3]. An increase in the transport of cuticular waxes at the beginning of the ripening stage carried out by ABCG transporters could be due to a greater need to avoid water loss and to maintain an adequate amount of water during the ripening of the sweet pitaya fruit. Nevertheless, further expression analysis of cuticular wax biosynthesis-related genes, complemented with chemical composition analysis of cuticles could contribute to elucidating the molecular mechanism of cuticle biosynthesis in cacti and their physiological contribution during fruit development.

    In this study, the transcriptome of the sweet pitaya (S. thurberi) fruit peel was assembled for the first time. The reference genes found here are a helpful tool for further gene expression analysis in sweet pitaya fruit. Transcripts tentatively involved in cuticle compound biosynthesis and transport are reported for the first time in sweet pitaya. The results suggest a relevant role of cuticle compound biosynthesis and transport at the early and later stages of fruit development. The information generated will help to improve the elucidation of the molecular mechanism of cuticle biosynthesis in S. thurberi and other cactus species in the future. Understanding the cuticle's physiological function in the adaptation of the Cactaceae family to harsh environmental conditions could help design strategies to increase the resistance of other species to face the increase in water scarcity for agricultural production predicted for the following years.

    The authors confirm contribution to the paper as follows: study conception and design: Tiznado-Hernández ME, Tafolla-Arellano JC, García-Coronado H, Hernández-Oñate MÁ; data collection: Tiznado-Hernández ME, Tafolla-Arellano JC, García-Coronado H, Hernández-Oñate MÁ; analysis and interpretation of results: Tiznado-Hernández ME, García-Coronado H, Hernández-Oñate MÁ, Burgara-Estrella AJ; draft manuscript preparation: Tiznado-Hernández ME, García-Coronado H. All authors reviewed the results and approved the final version of the manuscript.

    All data generated or analyzed during this study are included in this published article and its supplementary information files. The sequence data can be accessed at the Sequence Read Archive (SRA) repository of the NCBI through the BioProject ID PRJNA1030439.

    The authors wish to acknowledge the financial support of Consejo Nacional de Humanidades, Ciencias y Tecnologías de México (CONAHCYT) through project number 579: Elucidación del Mecanismo Molecular de Biosíntesis de Cutícula Utilizando como Modelo Frutas Tropicales. We appreciate the University of Arizona Genetics Core and Illumina for providing reagents and equipment for library sequencing. The author, Heriberto García-Coronado (CVU 490952), thanks the CONAHCYT (acronym in Spanish) for the Ph.D. scholarship assigned (749341). The author, Heriberto García-Coronado, thanks Dr. Edmundo Domínguez-Rosas for the technical support in bioinformatics for identifying long non-coding RNA.

  • The authors declare that they have no conflict of interest.

  • Supplemental Table S1 The relevant information of 500 Hainan tea samples used in this study.
    Supplemental Table S2 Quantity table of SNPs after different hard filter.
    Supplemental Table S3 Supplementary Table 3. The information used in this study includes global Assamonia, global Sinensis, Other, Bloom Camellia, Oil seed, and Wild samples.
    Supplemental Table S4 f3 Statistical analysis results.
    Supplemental Table S5 D Statistical analysis results.
    Supplemental Table S6 Hainan tea samples used in KING software analysis of genetic relationships.
    Supplemental Fig. S1 The density plot of the sample's mapping rate, X-coordinates indicate the Mapping rate and Y-coordinates indicate the density.
    Supplemental Fig. S2 Density distribution plot of SNPs hard filter.
    Supplemental Fig. S3 Group rootless trees constructed by OLMS, LMS, global Assamica and global Sinensis. KM6 (C. cuspidata) was selected as the outgroup.
    Supplemental Fig. S4 Figure 3A cross-validation error. The x-axis represents the K value, and the y-axis represents the cross-validation error. The dots in the figure show K = 8 with the smallest cross-validation error.
    Supplemental Fig. S5 Principal Component Analysis of Hainan Tea with global Assamica, global Sinensis, wild Camellia sinensis, Bloom Camellia, Oil seed Camellia and other Camellia. Principal Components 1 and 3. The analysis focused on principal components 1 and 3. Each sample is represented by a small circle, with the color of each circle indicating the corresponding group as depicted in the legend on the right side.
    Supplemental Fig. S6 (A,B) Optimal m-value diagrams for OptM judgment. The horizontal axis is the m value, and the vertical axis is the likelihood value. The optimal m value in the figure is 1, which means that the historical relationship of samples can be best explained by 1 migration.
  • [1]

    Xia EH, Zhang HB, Sheng J, Li K, Zhang QJ, et al. 2017. The tea tree genome provides insights into tea flavor and independent evolution of caffeine biosynthesis. Molecular Plant 10:866−77

    doi: 10.1016/j.molp.2017.04.002

    CrossRef   Google Scholar

    [2]

    Wei C, Yang H, Wang S, Zhao J, Liu C, et al. 2018. Draft genome sequence of Camellia sinensis var. sinensis provides insights into the evolution of the tea genome and tea quality. Proceedings of the National Academy of Sciences of the United States of America 115:E4151−E4158

    doi: 10.1073/pnas.1719622115

    CrossRef   Google Scholar

    [3]

    Henry BC. 1886. Ling-Nam: or, interior views of southern China, including explorations in the hitherto untraversed island of Hainan. London: SW Partridge. 511 pp.

    [4]

    Wambulwa MC, Meegahakumbura MK, Kamunya S, Wachira FN. 2021. From the wild to the cup: tracking footprints of the tea species in time and space. Frontiers in Nutrition 8:706770

    doi: 10.3389/fnut.2021.706770

    CrossRef   Google Scholar

    [5]

    Li MM, Meegahakumbura MK, Wambulwa MC, Burgess KS, Möller M, et al. 2023. Genetic analyses of ancient tea trees provide insights into the breeding history and dissemination of Chinese Assam tea (Camellia sinensis var. assamica). Plant Diversity 46:229−37

    doi: 10.1016/j.pld.2023.06.002

    CrossRef   Google Scholar

    [6]

    Zhang W, Zhang Y, Qiu H, Guo Y, Wan H, et al. 2020. Genome assembly of wild tea tree DASZ reveals pedigree and selection history of tea varieties. Nature Communications 11:3719

    doi: 10.1038/s41467-020-17498-6

    CrossRef   Google Scholar

    [7]

    Zhang X, Chen S, Shi L, Gong D, Zhang S, et al. 2021. Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia sinensis. Nature Genetics 53:1250−59

    doi: 10.1038/s41588-021-00895-y

    CrossRef   Google Scholar

    [8]

    Wang X, Feng H, Chang Y, Ma C, Wang L, et al. 2020. Population sequencing enhances understanding of tea plant evolution. Nature Communications 11:4447

    doi: 10.1038/s41467-020-18228-8

    CrossRef   Google Scholar

    [9]

    Zhou Y, He W, He Y, Chen Q, Gao Y, et al. 2023. Formation of 8-hydroxylinalool in tea plant Camellia sinensis var. Assamica ‘Hainan dayezhong’. Food Chemistry: Molecular Sciences 6:100173

    doi: 10.1016/j.fochms.2023.100173

    CrossRef   Google Scholar

    [10]

    Huang H, Shi C, Liu Y, Mao SY, Gao LZ. 2014. Thirteen Camellia chloroplast genome sequences determined by high-throughput sequencing: genome structure and phylogenetic relationships. BMC Evolutionary Biology 14:151

    doi: 10.1186/1471-2148-14-151

    CrossRef   Google Scholar

    [11]

    Whittaker RJ, Fernández-Palacios JM, Matthews TJ, Borregaard MK, Triantis KA. 2017. Island biogeography: Taking the long view of nature’s laboratories. Science 357:eaam8326

    doi: 10.1126/science.aam8326

    CrossRef   Google Scholar

    [12]

    Zhou M, Liu J, Liang Y, Li D. 2017. Distribution of Holttumochloa (Poaceae: Bambusoideae) in China with description of a new species revealed by morphological and molecular evidence. Plant Diversity 39:135−39

    doi: 10.1016/j.pld.2017.05.001

    CrossRef   Google Scholar

    [13]

    Tian X, Wang Q, Zhou Y. 2018. Euphorbia Section Hainanensis (Euphorbiaceae), a New Section Endemic to the Hainan Island of China From Biogeographical, Karyological, and Phenotypical Evidence. Frontiers in Plant Scienc 9:660

    doi: 10.3389/fpls.2018.00660

    CrossRef   Google Scholar

    [14]

    Wang XH, Li J, Zhang LM, He ZW, Mei QM, et al. 2019. Population Differentiation and Demographic History of the Cycas taiwaniana Complex (Cycadaceae) Endemic to South China as Indicated by DNA Sequences and Microsatellite Markers. Frontiers in Genetics 10:1238

    doi: 10.3389/fgene.2019.01238

    CrossRef   Google Scholar

    [15]

    Li X, Shen Z, Ma C, Yang L, Duan S, et al. 2023. Teabase: A comprehensive omics database of Camellia. Plant Communications 4:100664

    doi: 10.1016/j.xplc.2023.100664

    CrossRef   Google Scholar

    [16]

    Jiang H, Long W, Zhang H, Mi C, Zhou T, et al. 2019. Genetic diversity and genetic structure of Decalobanthus boisianus in Hainan Island, China. Ecology and Evolution 9:5362−71

    doi: 10.1002/ece3.5127

    CrossRef   Google Scholar

    [17]

    Darwin C. 1859. On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. London: John Murray. https://doi.org/10.5962/bhl.title.87938

    [18]

    Lussu M, Marignani M, Lai R, Loi MC, Cogoni A, et al. 2020. A Synopsis of Sardinian Studies: Why Is it Important to Work on Island Orchids? Plants 9:853

    doi: 10.3390/plants9070853

    CrossRef   Google Scholar

    [19]

    Nazir MF, He S, Ahmed H, Sarfraz Z, Jia Y, et al. 2021. Genomic insight into the divergence and adaptive potential of a forgotten landrace G. hirsutum L. purpurascens. Journal of Genetics and Genomics 48:473−84

    doi: 10.1016/j.jgg.2021.04.009

    CrossRef   Google Scholar

    [20]

    Lynch M, Ackerman MS, Gout JF, Long H, Sung W, et al. 2016. Genetic drift, selection and the evolution of the mutation rate. Nature Reviews Genetics 17:704−14

    doi: 10.1038/nrg.2016.104

    CrossRef   Google Scholar

    [21]

    Su H, Qu LJ, He K, Zhang Z, Wang J, et al. 2003. The Great Wall of China: a physical barrier to gene flow? Heredity 90:212−19

    doi: 10.1038/sj.hdy.6800237

    CrossRef   Google Scholar

    [22]

    Wu LX, Xu HY, Jian SG, Gong X, Feng XY. 2022. Geographic factors and climatic fluctuation drive the genetic structure and demographic history of Cycas taiwaniana (Cycadaceae), an endemic endangered species to Hainan Island in China. Ecology and Evolution 12:e9508

    doi: 10.1002/ece3.9508

    CrossRef   Google Scholar

    [23]

    Wang N, Liang B, Wang J, Yeh CF, Liu Y, et al. 2016. Incipient speciation with gene flow on a continental island: Species delimitation of the Hainan Hwamei (Leucodioptron canorum owstoni, Passeriformes, Aves). Molecular Phylogenetics and Evolution 102:62−73

    doi: 10.1016/j.ympev.2016.05.022

    CrossRef   Google Scholar

    [24]

    Wang C, Ma X, Ren M, Tang L. 2020. Genetic diversity and population structure in the endangered tree Hopea hainanensis (Dipterocarpaceae) on Hainan Island, China. PLoS One 15:e0241452

    doi: 10.1371/journal.pone.0241452

    CrossRef   Google Scholar

    [25]

    Gu S, Yan YR, Yi MR, Luo ZS, Wen H, et al. 2022. Genetic pattern and demographic history of cutlassfish (Trichiurus nanhaiensis) in South China Sea by the influence of Pleistocene climatic oscillations. Scientific Reports 12:14716

    doi: 10.1038/s41598-022-18861-x

    CrossRef   Google Scholar

    [26]

    Amos W, Harwood J. 1998. Factors affecting levels of genetic diversity in natural populations. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences 353:177−86

    doi: 10.1098/rstb.1998.0200

    CrossRef   Google Scholar

    [27]

    Kremen C, Merenlender AM. 2018. Landscapes that work for biodiversity and people. Science 362:eaau6020

    doi: 10.1126/science.aau6020

    CrossRef   Google Scholar

    [28]

    Goodall-Copestake WP, Tarling GA, Murphy EJ. 2012. On the comparison of population-level estimates of haplotype and nucleotide diversity: a case study using the gene cox1 in animals. Heredity 109:50−6

    doi: 10.1038/hdy.2012.12

    CrossRef   Google Scholar

    [29]

    Salgotra RK, Chauhan BS. 2023. Genetic diversity, conservation, and utilization of plant genetic resources. Genes 14:174

    doi: 10.3390/genes14010174

    CrossRef   Google Scholar

    [30]

    Warschefsky E, Penmetsa RV, Cook DR, von Wettberg EJB. 2014. Back to the wilds: tapping evolutionary adaptations for resilient crops through systematic hybridization with crop wild relatives. American Journal of Botany 101:1791−800

    doi: 10.3732/ajb.1400116

    CrossRef   Google Scholar

    [31]

    Niu S, Song Q, Koiwa H, Qiao D, Zhao D, et al. 2019. Genetic diversity, linkage disequilibrium, and population structure analysis of the tea plant (Camellia sinensis) from an origin center, Guizhou plateau, using genome-wide SNPs developed by genotyping-by-sequencing. BMC Plant Biology 19:328

    doi: 10.1186/s12870-019-1917-5

    CrossRef   Google Scholar

    [32]

    Chen S, Zhou Y, Chen Y, Gu J. 2018. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884−i890

    doi: 10.1093/bioinformatics/bty560

    CrossRef   Google Scholar

    [33]

    Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754−60

    doi: 10.1093/bioinformatics/btp324

    CrossRef   Google Scholar

    [34]

    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25:2078−9

    doi: 10.1093/bioinformatics/btp352

    CrossRef   Google Scholar

    [35]

    McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, et al. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20:1297−303

    doi: 10.1101/gr.107524.110

    CrossRef   Google Scholar

    [36]

    Wang K, Li M, Hakonarson H. 2010. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research 38:e164

    doi: 10.1093/nar/gkq603

    CrossRef   Google Scholar

    [37]

    Lee T-H, Guo H, Wang X, Kim C, Paterson AHJBg. 2014. SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics 15:162

    doi: 10.1186/1471-2164-15-162

    CrossRef   Google Scholar

    [38]

    Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics 81:559−75

    doi: 10.1086/519795

    CrossRef   Google Scholar

    [39]

    Yang J, Lee SH, Goddard ME, Visscher PM. 2011. GCTA: a tool for genome-wide complex trait analysis. American Journal of Human Genetics 88:76−82

    doi: 10.1016/j.ajhg.2010.11.011

    CrossRef   Google Scholar

    [40]

    Alexander DH, Novembre J, Lange K. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19:1655−64

    doi: 10.1101/gr.094052.109

    CrossRef   Google Scholar

    [41]

    Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, et al. 2010. Robust relationship inference in genome-wide association studies. Bioinformatics 26:2867−73

    doi: 10.1093/bioinformatics/btq559

    CrossRef   Google Scholar

    [42]

    Pickrell JK, Pritchard JK. 2012. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genetics 8:e1002967

    doi: 10.1371/journal.pgen.1002967

    CrossRef   Google Scholar

    [43]

    Fitak RR. 2021. OptM: estimating the optimal number of migration edges on population trees using Treemix. Biology Methods and Protocols 6:bpab017

    doi: 10.1093/biomethods/bpab017

    CrossRef   Google Scholar

    [44]

    Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, et al. 2012. Ancient admixture in human history. Genetics 192:1065−93

    doi: 10.1534/genetics.112.145037

    CrossRef   Google Scholar

    [45]

    Malinsky M, Matschiner M, Svardal H. 2021. Dsuite - Fast D-statistics and related admixture evidence from VCF files. Molecular Ecology Resources 21:584−95

    doi: 10.1111/1755-0998.13265

    CrossRef   Google Scholar

    [46]

    Schrempf D, Minh BQ, De Maio N, von Haeseler A, Kosiol C. 2016. Reversible polymorphism-aware phylogenetic models and their application to tree inference. Journal of Theoretical Biology 407:362−70

    doi: 10.1016/j.jtbi.2016.07.042

    CrossRef   Google Scholar

    [47]

    Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32:268−74

    doi: 10.1093/molbev/msu300

    CrossRef   Google Scholar

  • Cite this article

    Guo D, Li D, Wang Z, Li D, Zhou Y, et al. 2024. Genome resequencing reveals an independently originated Camellia sinensis variety – Hainan tea. Agrobiodiversity 1(1): 3−12 doi: 10.48130/abd-0024-0003
    Guo D, Li D, Wang Z, Li D, Zhou Y, et al. 2024. Genome resequencing reveals an independently originated Camellia sinensis variety – Hainan tea. Agrobiodiversity 1(1): 3−12 doi: 10.48130/abd-0024-0003

Figures(4)  /  Tables(3)

Article Metrics

Article views(3487) PDF downloads(635)

ARTICLE   Open Access    

Genome resequencing reveals an independently originated Camellia sinensis variety – Hainan tea

Agrobiodiversity  1 2024, 1(1): 3−12  |  Cite this article

Abstract: Tea, originating in China over 3,000 years ago, has transitioned from a medicinal herb to a widely consumed beverage. Despite considerable research focusing on tea plants in southwestern China, little attention has been paid to those on Hainan Island. The notable resemblance between Hainan tea and C. sinensis var. assamica, alongside the unique geographical and climatic conditions of Hainan Island, has presented significant challenges for taxonomic and genetic investigations concerning Hainan tea. Our study bridged this gap by collecting 500 samples from Hainan Province and employing whole-genome resequencing to examine interspecific differences between Hainan tea and cultivated varieties. The findings confirmed the distinct taxonomic position of Hainan tea within Camellia sinensis, providing valuable insights for resource conservation and molecular breeding. Furthermore, our methodology offers a framework for investigating the origin, domestication, and genetic diversity of other species native to Hainan Island.

    • Tea (Camellia sinensis (L.) O. Kuntze) stands as China’s earliest documented tree crop, boasting a domestication history spanning over 3,000 years. Initially employed as a medicinal herb with roots dating back nearly 5,000 years, it later evolved into a beverage widely embraced for consumption[1]. On a global scale, cultivated tea plants are classified into two primary groups: C. sinensis var. sinensis (CSS) and C. sinensis var. assamica (CSA)[2].

      Hainan Island, positioned in the northern part of the South China Sea, has a rich history of tea plant cultivation and extensive planting areas. There were reports of the abundant tea plant resources on Hainan Island at the end of the Qing Dynasty. For instance, the American missionary and botanist Benjamin Couch Henry uncovered a significant number of wild tea trees during his extensive exploration of the Li ethnic group area in Hainan, confirming the abundance of ancient tea tree resources on the island[3]. As the Yunnan-Guizhou Plateau is widely recognized as a potential geographical origin of tea[46], most studies on tea plant population genomics encompass samples from southwestern China, particularly CSA varieties[1, 68], leaving research on tea plants in Hainan Island relatively sparse. The self-incompatibility of tea trees results in high offspring heterozygosity, and the abundant wild tea plant germplasm on the island provides a wealth of genetic variation, laying the groundwork for cultivating new varieties with desirable traits[7]. Despite Hainan Island’s abundance of tea resources, fully comprehending the genetic resources of tea plants there poses a challenge due to its unique climate and geographical environment. Hence, a genome-wide investigation into the genetic diversity of Hainan tea is imperative for a comprehensive understanding of the genetic resource background of Hainan tea.

      It is noteworthy that the tea plant species on the island closely resembles CSA and is referred to as ‘Hainan dayezhong’[9]. However, evidence is insufficient to conclusively determine whether the Hainan dayezhong belongs to CSA or not. The classification of Hainan tea presents a significant challenge for several reasons: Firstly, C. sinensis plants are prone to hybridization between different species, posing a challenge in accurately classifying various hybrid progenies. Secondly, numerous morphological characteristics of tea plants resemble each other, complicating precise taxonomic delineations[10]. Lastly, traditional classification of tea plants primarily relies on morphological characteristics, which may sometimes conflict with the latest molecular-based classification results[8]. Despite Hainan tea’s identification by the National Crop Variety Approval Committee in 1985, its taxonomic status within the Camellia genus on Hainan Island remains unclear due to the absence of support from modern genomic research data.

      Islands, as an ideal system for studying the effects of geographical isolation and long-distance diffusion, offer valuable insights into species evolution, encompassing phenomena such as adaptive radiation and speciation[11]. Previous studies have documented the discovery of several new plant species on Hainan Island, including Holttumochloa[12], Euphorbia[13], Cycadaceae[14], among others. Moreover, advancements in whole-genome resequencing technology have confirmed the independent evolutionary histories and parallel domestication processes of CSS and CSA[7, 8]. Building upon these findings, our hypothesis suggests that tea trees on Hainan Island may constitute a distinct species separated from CSS and CSA, and that Hainan tea has undergone an independent evolutionary trajectory on the island.

      To furnish molecular evidence regarding the genomic divergence and relationship of Hainan tea with CSS and CSA, and to elucidate the genetic background of Hainan tea on Hainan Island, we procured 500 samples of Hainan tea from the Baisha, Qiongzhong, Wuzhishan, and Ledong regions of Hainan Province, China. Employing whole-genome resequencing technology, we identified SNPs in the Hainan tea samples and constructed a phylogenetic tree that included both cultivated tea and Hainan tea, utilizing the Yunkang 10 as the reference genome. Subsequently, detailed analyses of population structure and kinship relationships were conducted to offer a comprehensive understanding of the population structure and genetic diversity of Hainan tea and to unveil the phylogenetic relationships between Hainan tea and global Camellia sinensis varieties. This study furnished robust genomic data support and further corroborated the independent status of Hainan tea within the taxonomy of Camellia sinensis. Concurrently, these findings furnished a crucial scientific foundation for the conservation of tea germplasm resources and molecular breeding on Hainan Island. Furthermore, the research methodologies and techniques employed herein hold the potential to provide valuable insights into the origin and domestication analyses of other species on Hainan Island, as well as for the investigation of genetic diversity.

    • In this study, 500 tea tree samples were collected from four major tea-producing regions in Hainan: Ledong, Qiongzhong, Baisha, and Wuzhishan (Fig. 1). Notably, the samples encompassed a substantial number of ancient tea trees. Detailed sample information is provided in the Methods and Materials section and Supplemental Table S1. Following the sequencing, a total of 6.9 Tb of raw sequencing data were obtained. Subsequently, the original data source underwent filtration and was aligned with the reference genome (Yunkang 10), yielding a final average alignment rate of 98.98%. Notably, A-MCXB3-1D was excluded from the dataset due to its notably low mapping rate of 27.27% (Supplemental Fig. S1; Table S1).

      Figure 1. 

      Geographical distribution of tea samples collected and analyzed in this study.

      The color of circles denotes the collection area of Hainan tea samples, with circle size indicating the corresponding sample count. Larger circles indicate higher sample counts in the respective areas.

      Based on the results from Supplemental Fig. S2, which contain information about SNPs initially detected using GATK, hard filtering conditions were established initially. Subsequently, SNPs with a minor allele frequency (MAF) of at least 0.05 were retained, resulting in a total of 32,334,340 SNPs (Supplemental Table S2). Among them, 91.3% were located in the intergene region, 4.85% in the intron region, 0.12% in the 5' UTR, and 0.28% in the 3' UTR. In addition, 1.01% of SNPs were located in the upstream region of the gene, 1.06% in the downstream region. Exon SNPs accounted for 1.34% of the total SNPs, of which non-synonymous SNPs and synonymous SNPs accounted for 49.82% and 47.93%, respectively (Tables 1, 2).

      Table 1.  The number of SNPs in different genome structures.

      VariantsTypeCore set
      SNPTotal32,334,340
      Intergenic29,520,274
      Intronic1,566,641
      Exonic433,604
      5' UTR40,383
      3' UTR92,101
      UTR5;UTR3229
      Upstream326,710
      Downstream341,541
      Upstream;downstream7,803
      Splicing4,838
      Exonic;splicing216

      Table 2.  The number of large-effect SNPs.

      VariantsTypeCore set
      SNPTotal (exonic + exonic;splicing)433,820
      Nonsynonymous243,634
      Synonymous179,902
      Nonsyn/Syn ratio1.35
      Stop-gain9,699
      Stop-loss518
      Unknown67
    • To explore the phylogenetic relationship between tea trees of CSS and CSA on Hainan Island, genomic data of CSS and CSA were obtained from various global regions from the teabase database (http://teabase.ynau.edu.cn/)[15], as well as from the Genome Sequence Archive project number PRJCA001158[8]. Additionally, KM6 (Camellia cuspidata) was selected as an outgroup. The chosen tea plant materials are listed in Supplemental Table S3. Subsequently, we constructed a phylogenetic tree between the resequenced samples and globally cultivated tea trees using the maximum likelihood method with SNPs extracted from prior results. Phylogenetic analysis revealed distinct categorization of the samples into four main classes: global Aassamica1, global Assamica2, global Sinensis, and tea trees from Hainan Island. Notably, global Assamica and Sinensis clustered together on one side of the tree, while the tea tree samples from Hainan Island formed a separate, distinct cluster, without mingling with global Assamica and Sinensis. Particularly noteworthy is the significant geographic clustering observed in samples collected from the Limu Mountain area (abbreviated as LMS in Fig. 2), forming a subgroup within the Hainan tea sample cluster. Conversely, samples from other regions lacked a discernible pattern of geographic or regional clustering. For simplicity, we collectively refer to Hainan tea samples as ‘Hainan tea’. Based on the phylogenetic tree clustering results, we abbreviated tea samples from the Limu Mountain region as LMS and those from outside the Limu Mountain region as OLMS.

      Figure 2. 

      Phylogenetic relationship between Hainan tea and cultivated tea.

      Branch color indicates the sample source. Blue, purple, and green branches in the upper section of the phylogenetic tree denote cultivated tea across global regions. CSA is divided into Assamac1 (blue) and Assamac2 (purple). The lower section illustrates the phylogenetic relationships among Hainan tea samples from distinct regions.

      Further population rooted tree analysis (Supplemental Fig. S3) indicated that Hainan tea exhibited significant genetic divergence from CSS and CSA, making it challenging to classify them within the same category. This suggested a considerable evolutionary divergence of Hainan tea from CSS and CSA, although additional evidence is required to bolster this hypothesis. Notably, a discernible genetic distance exists between OLMS and LMS in the evolutionary tree, and they do not form a distinct category, consistent with the findings of phylogenetic analysis. In comparison to the outgroups, Hainan tea clustered more closely with global Assamica and Sinensis, prompting speculation that Hainan tea may belong to a Camellia species distinct from CSS and CSA.

    • Utilizing the data from Fig. 2 as the foundation, sequencing data from 21 bloom Camellia, 24 oilseed Camellia, 41 wild Camellia and 15 other Camellia groups were extracted from the teabase database (Supplemental Table S3), followed by population structure analysis and principal component analysis. K values ranging from 1 to 9 and their corresponding cross-validation (CV) error values were examined (Fig. 3a; Supplemental Fig. S4). Notably, at K = 2, Hainan tea diverged from other Camellia species, manifesting distinct ancestral compositions. With K = 3, a subsequent separation occurred with Sinensis exhibiting novel ancestral components. By K = 4, Assamica segregated from the genetic makeup of the added Camellia plant samples. At this juncture, Hainan tea demonstrated a distinctive population genetic background compared to the global Assamica and Sinensis, corroborating the findings depicted in Fig. 2. Furthermore, at K = 8, the cross-validation error minimized (Fig. 3a), indicating the optimal model where multiple species within Hainan tea and Camellia were delineated into eight genetically discrete populations. Within these populations, the LMS group displayed an autonomous genetic composition (depicted in purple). Particularly noteworthy, populations from the LD-JFL and QZ-BML, originating from two rainforest reserves, exhibited a considerable blend of purple genetic backgrounds. Additionally, the NS-JM and SM-SM populations shared a common genetic makeup (depicted in dark brown). It’s of significance to observe that certain SM-MN populations encompassed a genetic background akin to global Assamica1, plausibly due to the historical introduction of Assamica in Hainan Province.

      Figure 3. 

      The population structure and principal component analysis results for various species of Hainan tea and the Camellia genus are presented, alongside gene flow maps illustrating interactions between cultivated tea and Hainan tea.

      (a) Population structure of Hainan tea and global Assamica, global Sinensis, 21 bloom Camellia, 24 oilseed Camellia, 41 wild Camellia, and 15 other Camellia groups. The picture shows the population structure of K value from 2 to 8 analyses. (b) Principal component analysis. These samples can be divided into six different groups, represented by six circles. Among them, Other, Bloom Camellia and Oilseed Camellia are grouped together, while Hainan Tea, global Assamica1, global Assamica2, global Sinensis, and wild Camellia are grouped separately. (c) Graph illustrating the results of the optimal gene flow analysis conducted using Treemix. This figure shows the history of gene flow among three populations of Assamica, Sinensis and Hainan tea. Arrows are used to mark possible population migration events in the figure, the length of the arrow indicates the intensity of migration, and the direction of the arrow indicates the direction of migration.

      In the principal component analysis of Hainan tea and Camellia species (Fig. 3b; Supplemental Fig. S5), PC1, PC2, and PC3 carried weights of 6.25%, 5.20%, and 4.53% respectively. PC1 distinctly segregated Hainan tea from other Camellia species, suggesting a unique genetic background for Hainan tea. This finding aligned with the outcomes of the population structure analysis (K = 2). PC2 separated global Assamica1 from Assamica2, while PC3 separated global Assamica from Sinensis. Consistent with the population structure results, PCA also indicated the inclusion of SM-SM samples within global Assamica1, further supporting the hypothesis that certain samples clustered within global Assamica was due to the introduction of CSA from Yunnan Province to Hainan Province. It’s noteworthy that Hainan tea displayed a closer clustering with Assamica and Sinensis in the PCA plot compared to the subsequently added Camellia genus. The findings from population structure and PCA complemented those of Fig. 2, demonstrating that Hainan tea not only stands distinct from Assamica and Sinensis but also lacks any shared genetic components with the newly added Camellia genus population.

    • To provide additional evidence for our hypothesis proposing that Hainan tea belongs to a distinct species within the Camellia genus, separate from Assamica and Sinensis, we conducted f3 statistical analysis and examined the genetic relationships among Hainan tea, Assamica, and Sinensis using Treemix and D statistics.

      During the f3 statistical analysis, we observed that the f3 values between Hainan tea and cultivated tea were highly similar, with a difference of only 0.00347, indicating a close genetic relationship between them. Specifically, the f3 values of OLMS and LMS exceeded those of Hainan tea and cultivated tea, implying a higher genetic similarity between OLMS and LMS. This discovery enhanced our understanding of the potential genetic connection between Hainan tea and LMS, suggesting a potential sharing of deeper genetic characteristics.

      In the gene flow analysis, OptM determined the optimal migration model to be m = 1, implying the possibility of a migration event among the studied populations (Supplemental Fig. S6a, S6b). Employing a model with m = 1, we identified gene flow from Sinensis to Assamica1 (Fig. 3c), supported by a significant D value in the D statistics (Supplemental Table S5). In summary, these findings indicated that Hainan tea exhibits limited similarity to Assamica and Sinensis, with no significant historical gene flow between Hainan tea and Assamica or Sinensis.

    • In order to assess the genetic relationship between Hainan tea samples, KING software was used to analyze the genetic relationship of Hainan tea, Assamica and Sinensis, and the results are shown in Fig. 4a. In the genetic process, individuals accumulate their mutations, which can be shared among different individuals, so two individuals with the same mutations do not necessarily have the same ancestors. This similarity of mutations is called Identical by state (IBS). The King software can calculate the IBS value based on the SNPs, and reflect the composition and reliability of the kinship relationship among groups through the ratio of the kinship coefficient. The abscissa in the figure represented the proportion when IBS is 0, closer to 0, and higher the reliability of the result. The kinship coefficient of the vertical axis was mainly divided into three levels. A coefficient lower than 0.0442 indicated that the kinship relationship among individuals was far away, while a negative kinship coefficient indicated that there may be a large difference in the population structure between two individuals.

      Figure 4. 

      The genetic relationships between samples and the genetic differentiation coefficient (FST) between different populations, as well as the corresponding nucleotide diversity (π).

      (a) The family analysis results are based on KING software. The X-axis is the IBS value, and the Y-axis is the kinship coefficient. The data comprise pairs of samples exhibiting close kinship, which were selected from Global Assamica, Global Sinensis, and Hainan tea. Sample pairs from Hainan tea are highlighted in red. The three intervals, separated by dotted lines from top to bottom, represent primary, secondary, and tertiary degrees of kinship, respectively. (b) The FST heat map llustraties the genetic differences among the five populations of OLMS, LMS, Assamica1, Assamica2 and Sinensis. (c) π values of OLMS, LMS, Assamica and Sinensis populations.

      Highlighted in red in Fig. 4a was the sample pair with close genetic relationship screened out of Hainan tea samples, that was, kinship greater than 0.0442 (Supplemental Table S6). The closely related samples primarily belonged to the WDB group in the SM-MN area (refer to Fig. 4a; Supplemental Table S6 for specifics). All Hainan tea samples in this region were sourced from artificially managed tea gardens, with human activities influencing the reproduction of tea trees. Moreover, the area has a history of large-scale tea plant cultivation, suggesting that the close relationship between these samples may stem from the expansion of tea tree cultivation areas.

    • Pairwise FST values were computed for five populations: Assamica1, Assamica2, Sinensis, Hainan tea, and LMS, revealing a range between 0.036 and 0.328 (Fig. 4b; Table 3). Notably, the minimal group distinction was observed between Hainan tea and LMS, resulting in an FST value of 0.036. Importantly, when compared to Assamica, the distinction between tea and Sinensis populations in the Hainan and LMS regions was less pronounced. This observation aligned with the findings from the phylogenetic analysis of f3. Genetic diversity levels for these five populations were additionally assessed through the analysis of π values. As illustrated in Fig. 4c, tea populations in the Hainan and LMS regions exhibited greater genetic diversity compared to Assamica and Sinensis. It is notable that teas from Hainan and LMS regions display heightened genetic diversity levels. The level of genetic diversity among tea populations is relatively consistent.

      Table 3.  Genetic differentiation coefficient (FST) among groups. FST values range from 0 and 1, with higher values indicating greater genetic differences among populations.

      FSTAssamica1Assamica2OLMSLMSSinensis
      Assamica10.2360.2390.2360.321
      Assamica20.2810.2820.328
      OLMS0.0360.209
      LMS0.212
      Sinensis
    • Although Hainan Island is rich in wild tea tree resources and possesses vast plantation areas of rainforest tea trees, tea tree resources have not yet been comprehensively investigated and fully developed. In this study, we selected a large number of ancient tea tree samples from the rainforest area, and analyzed them by whole genome resequencing, obtaining 32,334,340 SNPs. This dataset is the most extensive resequencing dataset of Hainan tea samples reported so far.

      The classification of Camellia species basing on traditional taxonomy is very challenging[8], and Hainan dayezhong, as a unique Camellia species in Hainan, lacks the support of genomics data so far, and its status in taxonomy is always unknown so that it is often not yet a CSA. We analyzed the population relationship between Hainan tea and globally cultivated tea trees based on resequencing data to clarify the status of Hainan tea in Camellia from a genomic perspective. By constructing a phylogenetic tree between Hainan tea and globally cultivated tea trees, it can be observed that Hainan tea does not belong to either CSS or CSA, but rather forms an independent branch and clusters into a single taxon. It is important to note that in this cluster of Hainan tea, the samples from the LMS group formed distinct geographic subgroups, whereas the samples from the OLMS group did not appear to be geographically clustered (Fig. 2). This may be attributed to the fact that samples from the LMS population were collected in the Limu Mountain Rainforest Reserve, which is relatively undisturbed by human activities. In contrast, other areas have more human activities, which may lead to the mixing of genetic backgrounds of Hainan tea in multiple regions[16].

      Although the Wuzhishan region is located in a tropical rainforest reserve, according to the Qiongzhong County Record, the state actively promoted tea planting in the region in the mid-1990s and introduced CSA varieties for breeding and cultivation. Therefore, the samples from the Wuzhishan region did not show obvious geographical clustering (Fig. 2). Additional results of population structure and principal component analysis further confirmed this observation. The population structure analysis revealed that Hainan tea has an independent genetic background, whereas LMS differs from OLMS in genetic background. It is particularly noteworthy that, except for LMS, OLMS presented a mixture of genetic backgrounds, which coincided with the results of phylogenetic trees (Figs 2, 3a). The results of principal component analysis also clearly showed the independent group status of Hainan tea with CSS and CSA (Fig. 3b; Supplemental Fig. S5). Despite the presence of several Hainan tea samples in the global Assamica1 cluster, this is consistent with the historical context of the introduction of CSA from Yunnan in the mid-1990s.

      Geographic isolation is one of the main causes of species formation[17]. When populations of the same breeding stock separate, they face independent evolutionary histories defined by natural selection, genetic drift, adaptation, and colonization to local conditions[18]. Hainan, as a tropical island, has extensive rainforests that provide high-quality growing environments for plants, and the island’s geography provides the necessary geographic isolation for new species to arise. The results of the population structure analysis, which incorporated data from additional Camellia plants, clearly indicated that Hainan tea possesses a distinct genetic background compared to other Camellia species. Moreover, Hainan tea clustered closer to CSS and CSA in the principal component analysis while remaining distant from other Camellia (Fig. 3a, b). Therefore, we cautiously proposed that Hainan tea represents a novel variety of Camellia sinensis distinct from CSS and CSA. Notably, samples from the LMS region form a distinct subgroup cluster in the phylogenetic tree depicted in Fig. 2 and demonstrated an independent genetic component in the population structure analysis, akin to the scenario observed with G. hirsutum L. purpurascens on Hainan Island[19]. Thus, it was deduced that Hainan tea from the LMS region constitutes a unique endemic variety within the Hainan tea species.

      Genetic drift is one of the important mechanisms for maintaining genetic diversity among biological populations. High levels of genetic drift help to reduce genetic differences and increase the homogeneity between two populations[20]. However, when physical barriers prevent genetic drift, different populations may form or experience physical isolation that prevents the exchange of genetic materials. These physical barriers are usually, although not always, caused by natural factors[21].

      Hainan Island, once connected to the mainland, has undergone a long period of rotation and movement, rotating counter-clockwise from its original position in the Beibu Gulf to its current position. The initial separation occurred in the Paleocene (ca. 65 Mya), while the major part of the rotational drift occurred in the Eocene[22]. During the Quaternary, ice ages and interglacial periods alternated, the most recent major ice age occurring about 15 Kya ago. The onset of the Ice Age led to a drop in global temperatures and a steady decline in global sea levels, which led to the formation of natural land bridges between sea islands and continents. During the ~8,000-year-long Ice Age (15 Kya-7 Kya ago), genetic exchange of species between Hainan Island and neighboring continents may have occurred. For example, a literature survey study found the existence of gene flow between Hainan’s native Painted Lady and the Chinese Painted Lady in South China[23]. However, the cold global climate during the Ice Age reduced the population size of the species, especially for the cold-intolerant CSA, and the likelihood of genetic exchange diminished[24]. After the Ice Age ended, the rise of the sea level led to the emergence of Qiongzhong Strait, which switched Hainan Island once again to the island mode. This geological event may have hindered genetic exchange between Hainan Island tea trees and those on the mainland, leading to their gradual and independent evolution in response to the tropical island climate. As a result, a new variant emerged, possibly falling under the categorization of Camellia sinensis[25].

      Considering the potential existence of land bridges facilitating gene flow between Hainan tea and mainland tea plants, we intensively investigated the gene flow between Hainan tea and cultivated tea. First, we performed f3 statistical analysis (Supplemental Table S4) and found that the genetic relationship between LMS and OLMS was closer comparing to cultivated tea, which is consistent with the results in Fig. 2 and 3a. Especially noteworthy is that Hainan tea was closer to Sinensis comparing to Assamica. The results of the Treemix analysis visually demonstrated how geographic isolation significantly impeded gene flow between cultivated and Hainan teas (Fig. 3c). In addition, the Dsuite program was applied to perform ABBA-BABA analysis, and this result further supported our view (Supplemental Table S5). These findings strongly suggested that the geographic separation of Hainan tea has prevented the exchange of genetic material between it and cultivated tea, thus contributing to its possible independent evolution as a new variant of Camellia sinensis.

      Groups that are highly segregated and lack genetic drift are usually prone to inbreeding[26]. However, the current analyses showed (Fig. 4a) that Hainan tea do not show excessive kinship among each other, and the concentration of samples with high kinship was overwhelmingly from samples from the WDB group (Supplemental Table S6), a group whose tea trees came from an artificially managed tea plantation. This phenomenon may be caused by anthropogenic factors. Genetic diversity, species diversity, and ecosystem diversity are the three pillars of biodiversity. Tea plants are typically propagated asexually via cuttings. If individuals propagated through this method are presented in the study samples, a significant portion of sample pairs will exhibit an affinity coefficient exceeding 0.354. Nevertheless, the present findings do not corroborate this hypothesis (Fig. 4a). Based on the principles of population genetics, the conservation of biodiversity is ultimately the conservation of genetic diversity[27]. Nucleotide diversity is an important indicator for assessing the diversity of DNA sequences in a species or population[28]. The processes of domestication and breeding have reduced the genetic diversity of crops, and the widespread cultivation of monoculture crop varieties has led to an increase in genetic vulnerability[29,30]. Wild ancient tea trees, as a precious natural resource with high genetic diversity, are of great value for the study of the evolutionary mechanisms and diversity of the tea trees[31]. Interestingly, the Hainan tea and LMS have higher genetic diversity than CSS and CSA (Fig. 4c), even though Hainan tea is affected by geographic isolation, resulting in restricted gene flow (Fig. 3c). This can be partially attributed to the unique climatic conditions of the tropical island, which are very favorable for the growth of tea trees. Combined with minimal anthropogenic disturbance, this has resulted in less natural pressure on tea tree population expansion, thus helping to maintain genetic diversity[16]. Furthermore, the genetic relationships between tea plants in Hainan and LMS were closer to those of Sinensis than to Assamica, and the genetic relationships between tea trees in Hainan and LMS were closer to each other (Fig. 4b; Table 3). This is consistent with the results obtained in the f3 statistical analysis (Supplemental Table S4), suggesting that Hainan tea and Assamica taxa diverged earlier than Hainan tea and Sinensis taxa.

      In summary, the whole-genome resequencing of 500 Hainan tea samples from major tea-producing regions of Hainan Island was performed in this study, and 32,334,340 SNPs were successfully identified. The results of this study strongly support the existence of Hainan tea as a new variant of Camellia sinensis, which is genetically distinct from CSS and CSA, and also reveal the existence of Hainan tea in the LMS region as an independently evolved local variety. Although Hainan tea did not show significant gene flow between Hainan tea and cultivated tea trees due to the geographic barrier of the strait, it still maintained high genetic diversity, which manifested itself in high π values. The results of this study help to clarify the position of Hainan tea in the taxonomy of Camellia sinensis from a genomic perspective. Additionally, they provide reliable data support for an in-depth understanding of the genetic background and diversity of Hainan tea on the island. Furthermore, they offer an important scientific basis for the conservation of tea germplasm resources and molecular breeding on Hainan Island. In addition, our research methods and techniques can also provide lessons and references for the analyses of the origin and domestication of other species on Hainan Island, as well as for genetic diversity studies.

    • Systematically, 500 samples of Hainan tea from Hainan Province, China were collected. These included Jianfengling (Ledong, 28 samples), Limu Mountain (Qiongzhong, 160 samples), Gaofeng Fangtong Village (Nankai, Baisha, 41 samples), Miao Village Junior Class (Nankai, Baisha, 53 samples), Mengya Village (Nankai, Baisha, 24 samples), Shifu Village Junior Class (Nankai, Baisha, 26 samples), Yaxing (Nankai, Baisha, 15 samples), Junmin Village (Nansheng, Wuzhishan, 13 samples), Maoxiang Village (Nansheng, Wuzhishan, 32 samples), Baimaling (Qiongzhong, 13 samples), Fanglong Village (Shuiman, Wuzhishan, 14 samples), Maona Village (Shuiman, Wuzhishan, 47 samples), and Shuiman Village (Shuiman, Wuzhishan, 34 samples). Additionally, the teabase database[15] and data from various Camellia species in the genome sequence archive with project number PRJCA001158 from Genome Sequence Archive[8] were utilized for analysis. Notably, the KM6 strain (Cuspidata Camellia) was selected as an outgroup for our study and subsequent analyses. Detailed information on the study samples can be found in Supplemental Tables S1, S3, and Fig. 1.

    • Five hundred tea accessions were acquired exclusively from Hainan province in China. Young leaves were harvested from these plants and rapidly frozen in liquid nitrogen. Total DNA extraction was performed using the DNAsecure plant kit (Tiangen, Beijing). Subsequently, 2 µg of genomic DNA from each accession was utilized to prepare sequencing libraries according to the manufacturer’s protocol using the NEBNext Ultra DNA Library Prep Kit (NEB Inc., America). Sequencing was carried out on an Illumina NovaSeq 6000 sequencer, generating paired-end sequencing libraries with an approximate insert size of 400 bp.

    • The paired-end resequencing reads underwent filtering utilizing fastp (Version: 0.12.2)[32]. This process eliminated reads containing adapter sequences or poly-N sequences, as well as low-quality reads (defined as reads with more than 40% bases having Phred quality scores ≤ 20) from the raw data. The outcome of this step was the production of clean data, which were then utilized for subsequent downstream analyses.

    • The paired-end resequencing reads were aligned to our tea reference genome using BWA (Version: 0.7.17-r1188)[33], employing default parameters. The mapping results were converted into the BAM format and unmapped as well as non-unique reads were filtered using SAMtools (Version: 1.3.1)[34]. Additionally, duplicated reads were removed using the Picard package (picard.sourceforge.net, Version: 2.1.1).

      Following BWA alignment, we performed realignment of reads around indels using GATK in a two-step process. Initially, the RealignerTargetCreator package was utilized to identify regions necessitating realignment. Subsequently, the identified regions were realigned using IndelRealigner, resulting in a realigned BAM file for each accession. Variant detection was conducted following the recommended best practice workflow by GATK[35]. Specifically, variants were called for each accession using the GATK HaplotypeCaller[35]. A joint genotyping step was carried out to merge variations comprehensively from the gVCF files. During the filtering step, the SNP filter expression was set as ‘QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 5.0 || MQRankSum < −12.5 || ReadPosRankSum < −8.0 || QUAL < 30’. SNPs that were not bi-allelic were excluded, resulting in the creation of the basic set. Subsequently, SNPs with more than 20% missing calls and MAF less than 0.05 were further eliminated to generate the core set, which was used for phylogenetic tree construction, PCA, and population structure analysis.

      SNPs were annotated according to the tea genome using the ANNOVAR package (Version: 2015-12-14)[36]. Based on the genome annotation, SNPs were classified into various genomic regions, including exonic regions (overlapping with coding exons), splicing sites (within 2 bp of a splicing junction), 5' UTRs, 3' UTRs, intronic regions (overlapping with introns), upstream and downstream regions (within a 1 kb region upstream or downstream from the transcription start site), and intergenic regions. SNPs located in coding exons were further categorized into synonymous SNPs (which did not cause amino acid changes), nonsynonymous SNPs (which caused amino acid changes), stop gain mutations (mutations resulting in the gain of a stop codon), and stop-loss mutations (mutations resulting in the loss of a stop codon). Indels within exonic regions were classified based on whether they caused frame-shift mutations (3 bp insertion or deletion) and whether they resulted in the gain or loss of a stop codon.

    • Whole-genome SNPs were utilized to construct the maximum likelihood (ML) phylogenetic tree with 100 bootstrap replicates using SNPhylo (Version: 20140701)[37]. Camellia cuspidata (KM6) served as an outgroup to provide corresponding positional information. The phylogenetic tree was visualized and color-coded using iTOL (http://itol.embl.de).

      Chromosomal SNPs were filtered by removing SNPs in linkage disequilibrium with PLINK (Version v1.90b3.38)[38] , employing a window size of 50 SNPs (advancing 1 SNP at a time) and an r2 threshold of 0.5. Principal component analysis was conducted using Genome-wide Complex Trait Analysis (GCTA, version: 1.25.3) software[39] , and the first three eigenvectors were plotted. Population structure analysis was performed using the ADMIXTURE program (Version: 1.3)[40] with a block-relaxation algorithm. The number of genetic clusters (K) was predefined from 2 to 9, and the cross-validation error (CV) procedure was run to explore convergence of individuals. Default methods and settings were applied in all analyses.

    • The relationship between each accession was examined using KING (Version: 2.2.5)[41] , utilizing the basic set SNPs with the option ‘--kinship’. This option employed the KING-Robust algorithm to estimate pair-wise kinship coefficients. Close relatives were reliably inferred based on the estimated kinship coefficients using the following simple algorithm: an estimated kinship coefficient range greater than 0.354 indicates a duplicate relationship, while ranges of [0.177, 0.354], [0.0884, 0.177], and [0.0442, 0.0884] correspond to 1st-degree, 2nd-degree, and 3rd-degree relationships, respectively.

    • The calculation of average pairwise diversity within each population (π) was conducted using 100 kb sliding windows. Population differentiation (FST) was assessed through pairwise FST comparisons among populations.

    • Admixture graphs of geographically defined Hainan tea populations were inferred using TreeMix[42], employing a Maximum Likelihood (ML) approach based on a Gaussian model of allele frequency change. The topology of the ML trees varies depending on the number of migration events (m) permitted in the model, ranging from m = 0 to m = 5. Bootstrap values on the tree were derived from 1,000 replicates. Admixture events among different tea populations were indicated by arrows on the graph, with KM6 serving as the root. To ensure robustness, each migration event was iterated 10 times with a random seed. The optimal number of migration edges was determined using the R package ‘OptM’ (Version: v0.1.6)[43].

    • The f3 statistics were computed using the R package ‘admixr’ (Version: 0.9.1)[44] for all conceivable combinations of tea groups, with KM6 serving as the outgroup. SNPs exhibiting missing data and monomorphism were excluded from the analysis.

      To assess the presence of introgression signals among tea groups, Patterson’s D (also known as the ABBA-BABA test) and f4 admixture ratio statistics for all possible trios of tea groups were calculated using Dtrios in Dsuite (Version: 0.4 r42)[45], with KM6 designated as the outgroup. SNPs with missing data and monomorphism were removed from consideration.

      To investigate the species-level relationships among tea groups, we explored the backbone of the phylogeny using the PoMo model[46] within IQ-Tree[47]. This analysis included 1,000 bootstrap replicates, employing the ultrafast bootstrap approximation method. The tree was rooted using KM6 as the outgroup.

    • The authors confirm contribution to the paper as follows: study conception and design: Duan S; draft manuscript preparation: Guo D, Li D; manuscript revision and editing: Huang Y, Duan S; tea samples collection: Wang Z, Li D, Zhou Y, Xiang G, Zhang W, Wang W, Fang Z, Hao T, Zheng D, Lei Y, Yang L, Zhang W, Tang S, Zheng L, Cao Y. All authors reviewed the results and approved the final version of the manuscript.

    • The data that supported the findings of this study have been deposited into the CNGB Sequence Archive (CNSA) of China National GeneBank DataBase (CNGBdb) with accession number CNP0005405. In addition, the sequencing data are also accessible from the tea database (http://teabase.ynau.edu.cn/index/download/index) and the BIG Data Center under the accession number PRJCA001158.

      • This work was supported by the Hainan Academy of Agricultural Sciences Research Project (HAAS2022KJCX03), Research and Demonstration on Key Technologies of Germplasm Resource Bank Construction and Resource Innovation Utilization of Wuzhishan Big Leaf Tea (ZDYF2024XDNY245) and Monitoring and Analysis of Key Quality Components of Hainan Big Leaf Black Tea and Development and Demonstration of New Standardized Processing Technology (WZSKTPXM202202). We express our sincere gratitude to the People’s Government of Wuzhishan City, Hainan Province, and the Wuzhishan Scenic Area of Hainan Tropical Rainforest National Park for their generous support and assistance in this project.

      • The authors declare that they have no conflict of interest.

      • #Authors contributed equally: Dazhong Guo, Dongliang Li

      • Supplemental Table S1 The relevant information of 500 Hainan tea samples used in this study.
      • Supplemental Table S2 Quantity table of SNPs after different hard filter.
      • Supplemental Table S3 Supplementary Table 3. The information used in this study includes global Assamonia, global Sinensis, Other, Bloom Camellia, Oil seed, and Wild samples.
      • Supplemental Table S4 f3 Statistical analysis results.
      • Supplemental Table S5 D Statistical analysis results.
      • Supplemental Table S6 Hainan tea samples used in KING software analysis of genetic relationships.
      • Supplemental Fig. S1 The density plot of the sample's mapping rate, X-coordinates indicate the Mapping rate and Y-coordinates indicate the density.
      • Supplemental Fig. S2 Density distribution plot of SNPs hard filter.
      • Supplemental Fig. S3 Group rootless trees constructed by OLMS, LMS, global Assamica and global Sinensis. KM6 (C. cuspidata) was selected as the outgroup.
      • Supplemental Fig. S4 Figure 3A cross-validation error. The x-axis represents the K value, and the y-axis represents the cross-validation error. The dots in the figure show K = 8 with the smallest cross-validation error.
      • Supplemental Fig. S5 Principal Component Analysis of Hainan Tea with global Assamica, global Sinensis, wild Camellia sinensis, Bloom Camellia, Oil seed Camellia and other Camellia. Principal Components 1 and 3. The analysis focused on principal components 1 and 3. Each sample is represented by a small circle, with the color of each circle indicating the corresponding group as depicted in the legend on the right side.
      • Supplemental Fig. S6 (A,B) Optimal m-value diagrams for OptM judgment. The horizontal axis is the m value, and the vertical axis is the likelihood value. The optimal m value in the figure is 1, which means that the historical relationship of samples can be best explained by 1 migration.
      • Copyright: © 2024 by the author(s). Published by Maximum Academic Press on behalf of Yunnan Agricultural University. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.
    Figure (4)  Table (3) References (47)
  • About this article
    Cite this article
    Guo D, Li D, Wang Z, Li D, Zhou Y, et al. 2024. Genome resequencing reveals an independently originated Camellia sinensis variety – Hainan tea. Agrobiodiversity 1(1): 3−12 doi: 10.48130/abd-0024-0003
    Guo D, Li D, Wang Z, Li D, Zhou Y, et al. 2024. Genome resequencing reveals an independently originated Camellia sinensis variety – Hainan tea. Agrobiodiversity 1(1): 3−12 doi: 10.48130/abd-0024-0003

Catalog

  • About this article

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return