De novo assembly and annotation of the IPC genome: a high-quality reference genome for salt-tolerance research
-
The genome size of IPC was initially estimated using flow cytometry (FCM) and the results showed that its genome is approximately 1.08 GB in size (Supplemental Table S1). 24.07 GB of CCS reads (30×) were generated from 338.5 GB of PacBio SMRT HiFi read (Supplemental Table S2). The resulting contig-level genome was 2.08 GB, representing the diploid genome of IPC, with an N50 length of 1.77 MB and an average contig length of 255 KB (Supplemental Table S3). The haploid genome was 1.05 GB with an N50 of 4.02 MB and an average contig length of 1.87 MB (Supplemental Table S4). The similarity of the genome size to the FCM results and the good quality of the contigs indicate that the haploid contig-level genome could be used for chromosome anchoring. 125.55 GB (125×) Hi-C data (Supplemental Table S5) was used to anchor the contigs on 15 pseudochromosomes (Supplemental Table S6), and the interaction heatmap supports the quality of the assembly and chromosome-level genome (Fig. 1, Supplemental Fig. S1).
A comprehensive and reliable annotation of IPC protein-coding genes was obtained by combining the results of de novo prediction, homologous gene prediction, and prediction based on RNA-seq data (Supplemental Table S7, Supplemental File 1−3). The quality of all protein sequences was evaluated using BUSCO, with 1571 complete protein sequences accounting for 97.4% of all protein sequences (Supplemental Table S8). In total, 34,077 protein-coding genes were predicted, with an average gene length of 3496.93 bp and an average exon length of 252.6 bp, of which, 14,008 (41.1%) were annotated by the GO database and 13,975 were annotated by the KEGG database (Supplemental Table S9).
Understanding the IPC genome in the context of comparative genomic analysis
-
Comparative genomics offers a powerful approach to studying the evolutionary history of an organism. In order to elucidate the mechanisms underlying the development of salt resistance in IPC, we conducted a comparative genomic analysis of IPC, along with nine other species. This set of species included four from the Convolvulaceae family: Cuscuta australis, Cuscuta campestris, ITB, and ITF. Additionally, we incorporated five model or node species: Oryza sativa, Actinidia chinensis, Vitis vinifera, Solanum lycopersicum, and Solanum tuberosum. The longest transcripts of the protein-coding genes from each genome were extracted using Orthfinder software, and a comparative analysis and gene family clustering were conducted (Supplemental File 4). As a result, 28,134 orthologous gene families were identified among the 301,339 genes from the 10 species (Supplemental Table S10). Among these genes, 87,501 were single-copy orthologs, while the remaining 213,838 were multi-copy orthologs. Notably, Oryza sativa had the most unclustered genes (7591) and unique family genes (2052), reflecting its distant evolutionary relationship with dicot species (Fig. 2a). Our analysis revealed significant differences in gene composition and gene number between the two Cuscuta species, while the three Ipomoea species showed similar gene compositions, indicating close evolutionary and taxonomical relationships. Shared orthologs among ITB, IPC, Vitis vinifera, Solanum lycopersicum, and Oryza sativa were visualized with a Venn diagram (Fig. 2b), showing that these five species shared 10,170 orthologous gene families, with the maximum number of specific orthologous gene families (2458) observed in IPC. This indicates an early divergence of IPC from other Ipomoea species, specifically ITB. The 168 single-copy orthologs from the 10 species were used to construct a phylogenetic tree, which showed that ITB and ITF are closer to each other than they are to IPC. It is estimated that the diversity of IPC from ITB and ITF occurred ~17.41 million years ago (Fig. 2c).
To study the WGD (Whole Genome Duplication) in IPC, the homologous protein pairs between ITB and IPC (34239) and within the genome of ITB (17890), Vitis vinifera (7822), and IPC (19945) were used for calculating Ks via WGDI software (Supplemental File 4). The distribution curve of syntenic block kernel density indicated that only one WGD event had occurred in these three species. The close proximity of Ks peaks for IPC and ITB suggests a relatively close genetic relationship between the two Ipomoea species and that they are recently diverged species (Supplemental Fig. S2a). There are 24,041 shared orthologs between IPC and ITF, which accounts for 70.55% of the annotated genes in IPC. In comparison, 24,434 shared orthologs were detected between IPC and ITB, accounting for 71.70% of the total number of IPC genes. We analyzed the genome synteny of IPC with ITB and ITF based on the syntenic blocks of orthologous genes. The results showed that ITB and ITF have a good chromosome synteny, consistent with previous analysis. However, except for chromosomes 8, 9, and 13, significant inversions were observed in other chromosomes of IPC compared to ITB and ITF. Additionally, translocations were detected in chromosomes 2 and 12 of IPC (Supplemental Fig. S2b).
Gene expansion and duplication analysis shed light on salt-tolerance in IPC at the genomic level
-
The same set of species used in previous studies was analyzed for gene family expansion and contraction. In this analysis, the orthologous gene families were clustered by OrthoFinder, which was subjected to detect orthologs family expansion and contraction of the species (Supplemental Table S11, Supplemental Fig. S3). To gain insights into the functions of expansion and contraction genes in the IPC genome, we performed a GO enrichment analysis on these genes, respectively. The contraction gene families were mainly enriched in the following processes or molecular functions: glycosyltransferase activity, response to oxygen levels, and lipid and sterol metabolism (Supplemental Fig. S4). These results suggest that these processes or molecular functions are not critical for IPC's salt tolerance. However, the genes in the expanded gene family are mainly involved in the following processes: immune response, composition of vacuoles, and plasma membrane composition (Supplemental Fig. S3, Supplemental File 5). These genes have been widely studied in the context of plant salt tolerance and suggest that the expansion of these gene families improves the response to salt stress and enhances salt tolerance in IPC. The enrichment of expansion genes in salt tolerance-related GO terms provides new insights into the salt tolerance mechanism of IPC from the genomic perspective.
To determine the origin of the rapid expansion gene in IPC, we analyzed the gene duplication in the genome via DupGen_finder software, resulting in 27040 duplicated genes divided into five duplication types: 9790 WGD (Whole-genome duplication) genes (36.21%), 2760 TD (Tandem duplication) genes (10.2%), 1280 PD (Proximal duplication) genes (4.73%), 6891 TRD (Transposed duplication) genes (25.48%), and 6319 DSD (Dispersed duplication) genes (23.37%). A comparative analysis between the duplicated genes and the rapid expansion genes was conducted (Fig. 3a, b; Supplemental Table S12). Our findings indicate that 1,412 out of the 1,907 rapid expansion genes belong to the DSD type, suggesting that DSD genes were the primary source of the rapid expansion genes in IPC. Moreover, the results showed that the DSD genes account for a high proportion of duplicated genes in IPC.
To gain insights into the salt tolerance mechanism of IPC from a gene duplication perspective, we analyzed the gene duplications in ITB and ITF, two salt-intolerant species in the genus Ipomoea. Our results revealed that the number of duplicated genes of all types in IPC was significantly higher than that in ITB and ITF. We performed Gene Ontology (GO) enrichment analysis based on duplication types. Our results show that TRD and DSD genes are mainly enriched in terms related to transposition and DNA/chromosome stability maintenance. Previous studies have demonstrated that salt stress can induce DNA damage[35−37]. Thus, the expansion of genes related to DNA/chromosome stability maintenance might be one of the mechanisms by which IPC adapts to salt-stressed environments. On the other hand, PD genes are mainly enriched in terms related to the biosynthesis and metabolism of secondary metabolites, such as terpenoids and isoprene, while TD genes are enriched in terms related to tissue immunity and biotic/abiotic stress response, and WGD genes are enriched in terms related to DNA transcription and protein modification, such as protein ubiquitination modification (Fig. 3c).
Expansion of IPC repeat genes is driven by Transposable Element (TE) events
-
To understand the cause of the expansion of repeat genes in IPC, we examined the genomes of three Ipomoea species: IPC, ITB, and ITF. While the species have the same chromosome number (2n = 30), their genome sizes differ significantly. IPC has a genome size of 1.05 GB, which is significantly larger than the genomes of ITB (461.83 MB) and ITF (492.38 MB). To investigate the reason for the larger genome of IPC, we annotated the repeat sequences in the genomes of the three species. The results revealed that a substantial portion of the IPC genome, equivalent to 83.81% or 871.63 MB, consisted of repetitive sequences. The primary contributors to these repetitive elements were transposable elements (TEs), collectively making up 82.81% of the genome sequences, encompassing both Class I and Class II TE categories. Among TE-related repeats, long terminal repeat (LTR) elements were the most prevalent, accounting for 47.37% of the genome, followed by DNA transposons at 19.79%, and non-LTR retrotransposons at 12.84%. Furthermore, 1.92% of the repeat sequences were tandem repeats, and an additional 1.28% of the genome sequence remained unidentified in terms of repeat classification. (Fig. 3d, Supplemental Table S13, Supplemental File 6).
To determine if the gene duplication in IPC was linked to transposon element activity, we conducted a genome-wide analysis of transposon elements in the 10 KB upstream and downstream of gene coding regions. Genes with the detected TEs in these regions were defined as TE-associated genes. Our findings indicated that 82.51% of the genes in the IPC genome were associated with transposable elements (TEs). Notably, 84.82% of duplicated genes exhibited TE associations, whereas only 77.44% of non-duplicated genes were identified as TE-associated. This demonstrates a higher prevalence of TE-associated genes within the duplicated gene group when compared to the non-duplicated gene group. (Supplemental Table S14). In addition, statistics of the nearest TEs upstream and downstream of gene coding regions indicated that duplicated genes tend to have TE elements located within 1,000 bp upstream of their gene coding regions. Furthermore, the first TE appeared close to the gene coding regions, as seen from the ratio of the first TE to the gene coding regions (Supplemental Fig. S5).
To further confirm the association between transposable events and gene duplication, All the TEs within 10 Kb upstream and downstream of the coding regions were statistically analyzed according to duplication types. Except for DSD, the TEs in the IPC genome displayed a position-dependent distribution, with fewer TEs detected closer to the gene coding regions. On average, the number of TEs near the gene coding region of WGD genes was less compared to other duplicated gene types (DSD, PD, TD, and TRD) and unduplicated genes (UD). The occurrence of TEs on TD and TRD duplicated genes was similar to that of unduplicated genes. However, the number of TEs near the coding regions of DSD and PD genes was higher compared to TD, TRD, and UD genes. Furthermore, the distribution of TEs near the coding regions of DSD genes was significantly higher, with a much greater number of TEs detected compared to other gene types. The distribution of TEs near the coding regions of PD genes showed two significant peaks at approximately 2,500 bp upstream and downstream of the coding regions. Given the potential of transposons to mediate gene duplication, and the higher distribution of TEs on PD and DSD genes compared to unduplicated genes (Fig. 4a), it can be concluded that the formation of PD and DSD duplicated genes in IPC is likely due to transposition events.
The eukaryotic genome-wide TEs were investigated using the TEsoter software, which specifically identifies different TE types, especially LTR retrotransposons. The results showed a high abundance of TEs in the IPC genome, with LTRs being the most abundant type. It is known that TEs, particularly LTRs, can cause stable mutations within genes through insertion into or near functional genes. As such, the presence of TEs in the coding regions of IPC genes was analyzed. Of the 34,292 annotated genes, the coding regions of 3,911 genes were found to contain TEs, of which 3,096 were LTRs, 280 were LINEs, 373 were TIRs, 26 were Helitrons, and 136 were of other types (Supplemental Fig. S6, Supplemental Table S15). To better understand the relationship between repeat genes and TEs, the distribution of TE categories and duplication types was visualized using Venn diagrams. The results showed that TEs were detected more frequently in the coding regions of DSD and TRD genes, accounting for 32.39% and 14.71% respectively (Fig. 4b, Supplemental File 7). Similarly, DSD and TRD were the most abundant duplication types in TE-related genes, accounting for 52.19% and 25.93% of the total TE-related genes, respectively (Supplemental Table S16). This result confirms a strong association between DSD and TRD gene duplication and TE events.
Time-course salt-treatment analysis provides insight into the salt tolerance of IPC at the transcriptional level
-
In the absence of a reference genome for IPC, our group and Liu et al. de novo assembled the transcriptome and studied the responses of IPC to temperature extremes and salt treatment[19,20]. For this study, we downloaded the same RNA-seq data and extensively investigated it using our genome assembly as a reference[19]. GO and KEGG enrichment analysis at different time points revealed the response details of IPC roots and leaves to salt treatment (Supplemental Result S1, Supplemental Files 8 & 9). In the IPC roots and leaves, we obtained five main categories of enriched GO as summarized in Fig. 5a and b: response-related pathways, secondary metabolic process, small molecule catabolic process, ion transmembrane transportation-related functions, and phytohormone signaling pathways. Enrichment of the GO terms 'response to water deprivation' at all time points, both in roots and leaves, indicated that salt stress directly affects water absorption in the root systems. Enrichment of the GO terms 'response to oxidative stress' and 'response to reactive oxygen species' at most time points in roots and leaves is consistent with the previous observation that oxidative stress is the secondary toxicity to salinity. The results also showed that secondary metabolic and small molecule catabolic processes played important roles in IPC's response to salt stress. This tendency is more evident in leaves than in roots, with specific metabolites/molecules indicated, including monocarboxylic acid, cutin, suberin, fatty acid, and wax. Furthermore, transmembrane transportation of ions and organic compounds might play critical roles in IPC's adaptation to salt stress because of the enrichment of related GOs, especially in leaves (Fig. 5a, b). Phytohormone regulations also participate in IPC's response to salt stress. The participation of Abscisic acid (ABA), Ethylene (ET), Jasmonates (JA), Brassinosteroids (BR), and Salicylates (SA) was observed at the early stage of salt-treated roots. Only JA and ET signaling pathways were persistently activated at the later stages of roots under treatment. However, the involvement of phytohormone regulation in the IPC leaf responding to salt was not as significant as in the root, with only the ABA signaling pathway up-regulated at a very early stage after treatment (Fig. 5a, b; Supplemental File 10).
We analyzed the RNA-seq data over time using impulse DE to group differentially expressed genes in IPC into four categories: transient up, transient down, transition up, and transition down-regulated genes. The transient up-regulated genes represent the early response of IPC to salt-stress, while the transition up-regulated genes could be responsible for the permanent adaptation of IPC to salt treatment. We performed GO and KEGG term enrichment analyses on these gene groups to gain a deeper understanding of the response and tolerance of IPC to salt treatment (Supplemental Result S1). In roots, early-responsive genes are primarily involved in response to abiotic stress (Fig. 5c), while genes conferring long-term adaptation are related to small molecules and secondary metabolite synthesis (Fig. 5d). In addition to the responsive pathways, transmembrane transportation plays a critical role in the early response of IPC leaves to salt treatment (Fig. 5e). With the transition-up regulated genes in leaves, the monocarboxylic acid metabolic processes are the most enriched GO terms, while the amino acid metabolic/catabolic-related GOs are also frequently enriched, indicating the significance of monocarboxylic acid and amino acid in the leaf response to salt-stress in IPC (Fig. 5f). We generated DAG figures to reveal the critical GOs involved in the salt tolerance of IPC (Supplemental Result S1). The responsive pathways and amino acid catabolic process are the enriched biological processes for roots and leaves, while the fatty acid metabolic process is involved in the adaptation of IPC roots. The phenol-containing compound catabolic, alpha-amino acid catabolic, and suberin biosynthetic processes participate in the adaptation of IPC leaves.
Investigation of salinity-related genes provides a constitutive salt tolerance mechanism of IPC
-
The genes involved in salt-stress perception, signal transduction, and downstream response in IPC have been identified through genome analysis. Based on previously published research, the process of how these genes are involved in salt tolerance is illustrated in Fig. 6a. Plants percept salt stress through three stimuli: Na+, somatic stress, and mechanical pressures, with three different types of sensors. Compared to Arabidopsis, most of the sensors in IPC are significantly expanded, particularly OSCA1, one of the osmotic sensors. Expression analysis has shown that the Na+ sensors, RbohD and MOCA1, and the osmotic sensor, OSCA1, are significantly upregulated at the early stage of salt treatment, indicating their involvement in salinity tolerance in IPC. Among the transporter channels, CNGCs and ANN1 are responsible for the early Ca2+ influx from the outer side of the membrane, which is critical for initiating the salt-response pathways. The IPC genome contains 26 CNGCs and two ANN1, with relatively high expression levels of ANN1 in both IPC roots and leaves. Na+ efflux and K+ influx are fulfilled by SOS and AKT1, respectively, which are essential for plant tolerance to salt stress. Interestingly, seven SOS1 and four AKT1 genes were identified in the IPC genome, with significant expansion compared to Arabidopsis. The relatively high expression of SOS1 and AKT1 in IPC roots indicates that the expansion of these genes might confer Na+ efflux and salt tolerance in IPC. Other salt-tolerance genes, such as those involved in the SOS and PA pathways, were also expanded to some extent, and most of them were upregulated at an early stage under salt treatment, especially in the roots (Fig. 6).
The genes involved in ion uptake, transportation, and sequestration were the subject of extensive investigation (Supplemental Fig. S7 & S8). Specifically, GLRs and AKT1, responsible for Na+ uptake, exhibit significant expansion in IPC, with 40 and five genes, respectively, compared to 20 and one in Arabidopsis. Additionally, members of the high-affinity K transporters (HAKs), which primarily function as K+ acquisition transporters, may also be involved in Na+ transport. Notably, a HAK from maize, ZmHAK4/ZmNC2, has been shown to act as a Na+-selective transporter[38]. Furthermore, HAK5 in Arabidopsis, initially identified as a K+ transporter responsive to low K+ signals[39,40], also mediates low-affinity Na+ uptake, regulated by external K+ concentrations[41]. These findings suggest that HAK5 may also play a role in Na+ uptake under saline conditions. In this study, seven HAK5s were identified with a significant expansion, as indicated by their p-value. The substantial expansion of Na+ uptake transporters implies that IPC tends to absorb Na+ from the soil, which aligns with previous research indicating that IPC's salt tolerance is not due to its ability to reject Na+. Moreover, the NHX and V-ATPase gene families, responsible for vascular Na+ sequestration, also display significant expansion in the IPC genome. This expansion likely contributes to IPC's salt tolerance by facilitating the transport of salt into vacuoles and its storage within plant cells, thus mitigating salt stress in the cytoplasm.
The synthesis of the primary medical component CQA was induced by salt-treatment
-
Several studies have demonstrated the bioactive components of IPC, which are primarily Isochlorogenic acids, including Isochlorogenic acid A (ISA), Isochlorogenic acid B (ISB), and Isochlorogenic acid C (ISC). ISA, also known as 3,5-Di-O-caffeoylquinic acid, is the main component of the aqueous ethanol extract of IPC's aerial part and has significant pain relief and anti-inflammatory effects[42,43]. Through the assembly and annotation of the IPC genome, the genes involved in the biosynthesis of caffeoylquinic acid (CQA) were identified, and their expressions in different organs or under salt-stressed conditions were investigated. Dicaffeoylquinic acid was identified as the primary medicinal component of IPC in previous studies, and two main competing pathways have been found in the process from phenylalanine to CQA, namely the biosynthesis of flavonoids and their isoforms and the biosynthetic pathway of lignin. To explain the accumulation of CQA in IPC, we constructed the CQA biosynthesis pathway according to the KEGG reactions using the Globe artichoke CQA as a reference. Subsequently, genes encoding the enzymes in the CQA pathway were identified, including five PALs, four C4Hs, three 4CLs, one HCT, and two HQTs (Fig. 7, Supplemental Table S17, Supplemental File 11). Since the flavonoids biosynthesis competes with the substrate of p-Coumaroyl-CoA with CQA biosynthesis, we also identified genes encoding enzymes for isoflavone and flavonoid synthesis that compete with the substrate of p-Coumaroyl-CoA with CQA biosynthesis, resulting in two CHSes and one CHI (Fig. 7, Supplemental Table S17). Additionally, quinic acid is a necessary substance for the synthesis of CQA, and its synthesis also determines the synthesis of CQA. To this end, we constructed the biosynthetic pathway of quinic acid and identified the genes involved in this pathway, resulting in four aroF-encoding candidate genes, three aroB candidate genes, five aroD candidate genes, and five QUIB candidate genes (Fig. 8, Supplemental Table S18).
To understand the biosynthesis regulation of CQA in IPC plants, the expression levels of identified genes involved in the biosynthesis pathway were investigated. PALs have the highest expression levels in filaments among the tissues examined, indicating active biosynthesis in filaments since cinnamic acid, a precursor for other biosynthesis pathways, is synthesized using phenylalanine as a substrate. Other genes in this pathway, C4Hs, 4CLs, HCT, and HQTs, have relatively higher expression in roots and leaves. HQT and 4CL are critical enzymes for the synthesis of CQA, and the genes encoding these two enzymes, IPC03G0005070 and IPC02G0011120, have higher expression levels in roots. The expression level of IPC10G0011520, HQT encoding gene, is also very high, although it is not specifically expressed in roots. HCT and HQT, encoding the enzymes responsible for the last two steps in this pathway, have the highest expression in roots, indicating that the main medicinal component, CQA, in IPC plants, may primarily accumulate in the roots of IPC (Fig. 7, Supplemental Fig. S9). Analysis of transcriptome data from salt-treated plants showed that in leaves, the genes encoding 4CL, C4H, and PAL are highly expressed within 24 h of salt treatment, while the expression levels of HQT and HCT-encoding genes increased gradually with increasing salt stress time. In roots, the expression levels of most of these genes increased after salt treatment. The expression levels of PAL, HQT, and HCT-encoding genes were up-regulated under salt treatment, while other genes were up-regulated in the early time points (Supplemental Fig. S10), indicating that salt treatment might enhance the secondary metabolic synthesis, as this trend is much more evident in the synthesis of the main medical component, CQA.
The synthesis of quinic acid is also significant for the synthesis of CQA and its derivatives. The expression levels of genes involved in quinic acid production were investigated in different organs (Fig. 8), and the results showed that their expression levels were relatively higher in leaves than in other tissues, indicating a stable and higher level of quinic acid production in leaves, which provides a stable substrate for the synthesis of CQA and its derivatives in leaves. The three genes encoding quinate isomerase (QUIB), IPC04G0015730, IPC09G0005340, and IPC13G0002640, were specifically highly expressed on the pedicel. At least one gene encoding aroF, aroB, and aroD also had relatively higher expression in the pedicel than in other tissues (Supplemental Fig. S11) indicating that the content of quinic acid in the pedicel might be at a relatively high level. Interestingly, IPC11G0004020 had remarkably high expression in filaments, indicating an active secondary metabolism process in the filaments. The expression of these genes in roots and leaves was also analyzed under salt treatment. The results revealed that several key genes, including IPC11G0004020, IPC04G0016490, IPC02G0024340, and IPC14G0008790, were found to be induced by salt treatment specifically in the roots (Supplemental Fig. S12).