-
Long-read PacBio HiFi sequencing was performed on all the four species of Macadamia. The sequencing depths for each species are as follows: M. jansenii (28 X), M. integrifolia (27 X), M. tetraphylla (42 X), and M. ternifolia (37 X)[7]. The collapsed HiFiasm assemblies of the four Macadamia species resulted in highly contiguous assemblies with N50 more than 45 Mb whereas the haploid assemblies were less contiguous and slightly smaller in size as compared to the collapsed assemblies. The M. integrifolia contig assembly had the largest number of contigs, 1049 whereas M. tetraphylla had the least. The haploid 1 assembly of all the species was comparatively more contiguous and longer than the haploid 2 assembly (Supplementary Table S1). The BUCSO analysis revealed a high percentage of genome completeness, with more than 97% coverage. Among the identified BUSCO genes, the majority were found as single-copy genes, with percentages ranging from 83.3% to 84.1%. A small proportion of the BUSCOs were detected as duplicated genes (double BUSCOs), with percentages ranging from 13.4% to 14.2%. Additionally, minor percentages of fragmented BUSCOs in the assemblies, ranging from 0.6% to 0.9% was also reported. The percentage of missing BUSCOs, representing genes absent from the assemblies, were found to be low, varying from 1.4% to 2.6% (Supplementary Table S1).
Chromosome level assembly
-
The Ragtag scaffold assembly length indicated the total size of the genome assemblies for each species, which ranged from 735 to 795 Mb. The collapsed assembly was slightly larger than individual haploid assemblies and the Hap2 assembly had the smallest size, ranging from 735 to 776 Mb for each species. Among the species, M. tetraphylla had the longest collapsed assembly, while M. integrifolia had the shortest. The length of the collapsed assembly for each species reflects the total size of their merged haplotypes, providing a more complete view of their respective genomes. M. tetraphylla had the longest haploid assembly, while M. jansenii had the shortest. Among the chromosomes in the collapsed genome assemblies of the four species, chr 9 (70 to 75 Mb) and chr 10 (68 to 72 Mb) consistently exhibit the greatest lengths. On the other hand, the smallest chromosome in all collapsed assemblies was chr 7 (Supplementary Table S2). The overall BUSCO completeness scores ranged from 95.0% to 98.9%, indicating that a significant proportion of the BUSCOs were present in the assemblies. The majority of BUSCOs were found as single-copy genes, with percentages ranging from 81.6% to 84.2%, confirming the accurate representation of essential genes in the collapsed assemblies. Only a small percentage of BUSCOs appeared as fragmented or missing BUSCO genes, suggesting robust and reliable genome assembly results (Table 1, Supplementary Table S2). The N50 values for the collapsed assemblies ranged from 51.7 to 56 Mb. M. tetraphylla exhibited the highest N50 values, while M. ternifolia had the lowest. These N50 values indicate that the collapsed assemblies have relatively contiguous contigs. The N50 values for the haploid assemblies were generally smaller than those of the collapsed assemblies. The N50 values for the haploid assemblies ranged from 51.4 to 54.8 Mb. The k-mer analysis showed that M. jansenii had the smallest genome and low heterozygosity, whereas M. integrifolia and M. tetraphylla possessed larger genomes and higher heterozygosity. A substantial portion (approximately 63%−69%) of their genetic sequences was found to be unique (Supplementary Table S3 & Supplementary Fig. S1a−d). The genome size estimation by flow cytometry results showed M. tetraphylla had the largest genome size followed by M. ternfolia, which aligns with the assembled scaffolded assembly results (Supplementary Table S3).
Table 1. Chromosome level assemblies of four species of Macadamia representing assembly length, BUSCO and N50 values.
M. jansenii M. ternifolia M. integrifolia M. tetraphylla Hap1 Hap2 Collapsed Hap1 Hap2 Collapsed Hap1 Hap2 Collapsed Hap1 Hap2 Collapsed Assembly length (Mb) 761 735 773 766b 748 780 748 751 775 776 775 795 Complete BUSCO 98.9% 95.0% 97.7% 97.1% 96.5% 97.7% 95.1% 94.3% 97.6% 97.4% 97.3% 97.8% Single 83.3% 82.1% 84.2% 83.8% 83.4% 84.1% 82.4% 81.6% 84.1% 83.5% 83.8% 83.7% Double 13.6% 12.9% 13.5% 13.3% 13.1% 13.6% 12.7% 12.7% 13.5% 13.9% 13.5% 14.1% Fragmented 0.6% 0.6% 0.7% 0.8% 0.8% 0.8% 0.9% 0.6% 0.6% 0.8% 0.7% 0.7% Missing 2.5% 4.4% 1.6% 2.1% 2.7% 1.5% 4.0% 5.1% 1.8% 1.8% 2.0% 1.5% N50 (Mb) 54.2 51.7 54.7 53.8 51.8 53.8 52.8 53 53.7 54 56 56 *The chromosomes were numbered according to the M. integrifolia genome which used the seven genetic linkage maps[4]. Genome structure comparison
-
The genomic structure comparison of the four Macadamia species using SyRI revealed syntenic regions, inversions, translocations, and duplications. Chromosomes 9 and 10 showed several structural rearrangements, with chr 9 exhibiting changes in the first half and chr 10 in the second half. Chr 4 also displayed genomic rearrangements at one end, while chr 12 in all four species showed several duplications in the middle (Fig. 1). Dotplots of the reference genome (M. jansenii Hi-C) against the four Macadamia species (assembled by ragtag) showed varying structural rearrangements, with M. integrifolia and M. tetraphylla having more structural differences compared to M. jansenii (Supplementary Fig. S2 & S3). Among all chromosomes, chr 9 and 10 had the majority of rearrangements. Similarly, dotplot comparison between the haploid assemblies showed M. integrifolia haploids were the most diverse, while M. jansenii haploids were the least diverse (Supplementary Fig. S3). The study showed that the genomes of different Macadamia species have different structures and arrangements, showing their unique genetic characteristics.
Figure 1.
The genome structure comparison of four Macadamia species, with different colours denoting each species and structural rearrangements (synteny, inversion, translocation, and duplication) as indicated on the top of the image.
Genome annotation
-
The repeat content analysis of the four species identified a total of 61% to 62% across both haploid and collapsed assemblies. This indicates that a major portion of the genomes is composed of repetitive elements. Among the different repeat types, Long Terminal Repeat (LTR) elements were the most prevalent, comprising around 22.1% to 23.8% of the genomes, followed by Long interspersed nuclear elements (LINE) elements. Other repeat types, such as DNA elements, unclassified elements, small RNA elements, satellites, and simple repeats, contributed to a smaller fraction of the total repeat content, ranging from 4.13% to 6.51% (Supplementary Table S4). The consistency of the total repeat content between haploid and collapsed assemblies suggests that the repetitive landscape is preserved even after haplotype merging. Comparing the collapsed assemblies with their respective haplotypes, for the number of predicted genes, it can be seen that the gene content remained relatively stable. Among the collapsed assemblies, M. integrifolia exhibits the highest number of genes, 40,534 while M. jansenii exhibits the lowest number of genes, 37,198. In the haploid assemblies, the number of genes ranges from 36,465 to 47,388. The number of genes distribution across the chromosomes, showed chr 9 and 10 have more genes than the other chromosomes (Table 2). The higher number of CDS and protein sequences identified by Braker3 compared to the gene count is because some genes produce multiple transcripts through alternative splicing. The telomere analysis revealed that the collapsed assemblies generally exhibited 'telomere to telomere' arrangements for most chromosomes. However, a few exceptions were observed, where telomere was present only at one of the ends, suggesting missing or ambiguous telomeric sequences on some chromosome ends (Supplementary Table S5). The functional annotation of the CDS sequences showed a majority of the similarity hits with Telopea, the only other member of the Proteaceae with a high-quality genome sequence. All the species showed similarity with Telopea followed by Nelumbo nucifera and Tetracentron sinense (Supplementary Fig. S4). The pathway analysis of the annotated CDS sequences, identified a consistent number of pathways among the four species, M. jansenii and M. tetraphylla each identified 580 pathways, 578 pathways in M. ternifolia, and M. integrifolia exhibited 581 pathways. The top five pathways, namely purine and thiamine metabolism, response to drought, biosynthesis of cofactors, and starch and sucrose metabolism, were found in all four species. This suggests that these pathways play crucial roles in the biological processes and responses shared by all four species.
Table 2. Distribution of genes across the 14 chromosomes of Macadamia species.
M. jansenii M. ternifolia M. integrifolia M. tetraphylla Hap 1 Hap2 Collapsed Hap 1 Hap2 Collapsed Hap 1 Hap2 Collapsed Hap 1 Hap2 Collapsed Chr_01 2483 2543 2474 2455 2484 2612 2483 2389 2665 2643 2521 2631 Chr_02 2666 2514 2608 2774 2666 2739 2453 2613 2699 2664 2735 2786 Chr_03 2802 2868 2844 3007 2943 3053 2837 2771 2974 2949 2917 3017 Chr_04 2780 2670 2718 2832 2706 2931 2833 2746 3078 3142 2782 2813 Chr_05 2800 2783 2798 2798 2636 2911 2755 2569 2814 2746 2780 2866 Chr_06 2607 2579 2568 2623 2465 2683 2585 2616 2667 2702 2731 2709 Chr_07 2790 2702 2696 2764 2699 2836 2810 2587 2623 2711 2578 2712 Chr_08 2768 2768 2677 2742 2671 2770 2509 2802 2878 3062 2869 2837 Chr_09 2870 2897 2878 2915 2874 3053 3373 2816 3842 3626 2978 3137 Chr_10 2402 2359 2428 2301 2209 2463 2699 2367 3103 3710 2295 2392 Chr_11 2820 2896 2812 2910 2845 3001 2917 2879 3087 2888 3024 2935 Chr_12 2590 2567 2517 2642 2408 2721 2576 2092 2538 2617 2430 2566 Chr_13 2766 2627 2732 2641 2716 2790 2684 2723 2875 2694 2663 2724 Chr_14 2560 2409 2448 2598 2474 2626 2446 2495 2691 2634 2534 2608 Total no. of genes 37704 37182 37198 38002 36796 39189 37960 36465 40534 40788 37837 38733 Number of mRNA 43510 43098 43092 44506 43016 45694 44527 43010 47301 47184 44490 45519 Number of CDS 43510 43098 43092 44506 43016 45694 44527 43010 47301 47184 44490 45519 Gene family analysis
-
Anti-microbial gene analysis: The homologs of an anti-microbial gene was identified in all four species of Macadamia by using a BLAST search. Only one gene was identified in all four species on chr 9. The sequence alignment of the reference gene MiAMP-2 with copies in all four species revealed a high degree of homology (Supplementary Fig. S5). This protein sequence alignment clearly shows four repeated segments with a four cysteine motif C-X-X-X-C-(10 ± 12)-X-C-X-X-X-C.
Fatty acid pathways
-
The number of FatA and FatB genes, essential for fatty acid production varied between species. M. integrifolia had the highest number of both genes, 10 and 11, respectively, suggesting the potential of this species for robust fatty acid synthesis. SAD (Stearoyl-ACP Desaturase) genes, which are mainly responsible for converting stearic acid (C18:0, SA) to oleic acid (C18:1, OA)[13], were present in high numbers across the four species, indicating their active involvement in the desaturation processes. This supports the observations of Hu et al.[14]. The conversion of C16:0 to C18:0 through elongation is a more efficient process compared to the conversion of C16:0 to C16:1 and the desaturation of C18:0 to C18:1 appears to be more effective than the desaturation of C16:0 to C16:1[14]. KAS (Ketoacyl-ACP Synthase) genes, crucial for fatty acid chain elongation, are notably absent in M. integrifolia, potentially indicating a unique fatty acid metabolism pathway in this species. In contrast, the other three species possess KAS genes, particularly M. jansenii and M. ternifolia (10 each), highlighting their capacity for elongating fatty acid chains (Supplementary Table S6a).
Cyanogenic glycoside pathway
-
CYP 79, which catalyses the first step in the biosynthesis of cyanogenic glycosides by acting on amino acids and converting them into aldoximes[15] was found to be present in M. integrifolia and M. tetraphylla and absent in M. jansenii and M. ternifolia, indicating a potential deviation from the typical cyanogenic glycoside biosynthesis pathway in these species. In contrast, CYP71, responsible for further converting aldoximes into cyanohydrin[16], was uniformly present among all the species. The number of BGLU and UGT genes, which are responsible for the detoxification and the glycoside modification was found to vary across the four species, reflecting differences in detoxification capabilities in the cyanogenic pathway. M. tetraphylla lacks UGT genes entirely, potentially indicating unique detoxification mechanisms (Supplementary Table S6b).
WRKY genes
-
The WRKY gene family, known for its key role in plant development and stress responses[17], revealed varying protein counts ranging from 58 to 61 among the four Macadamia species (Supplementary Table S6c). These findings align with the prior discovery of 55 WRKY proteins within the M. tetraphylla genome as reported by Niu et al. in 2022[12].
Orthologous and phylogenetic analysis
-
Orthologous clusters were generated across the four Macadamia species using Telopea as the outgroup, to identify genes that have been conserved across different species and may have similar functions. The clustering patterns of gene families across five plant species: T. speciosissima and the four Macadamia species revealed a total of 195,004 proteins grouped into 34,696 gene clusters. Among all the clusters only 31 clusters showed overlaps among two or more of the plant species and 8,217 single-copy clusters indicated conserved genes among the five species (Supplementary Table S7). A total of 30,111 (15.4%) singleton or species-specific gene were found in 2,090 unique gene clusters, where Telopea contains the maximum number of unique gene clusters (902). Among the Macadamia species, M. integrifolia had the maximum (403) whereas M. jansenii the lowest number of singleton gene clusters (201) (Fig. 2 & Supplementary Fig. S3). The Gene Ontology (GO) enrichment analysis of these unique gene clusters holds great promise in providing valuable insights into the distinct biological functions and potential adaptations of each species.
Figure 2.
A Venn-diagram showing clusters of orthologous groups of genes (OGs) for the four Macadamia species and T. speciosissima. Number of orthologous groups (OGs) belonging to core genome (OGs common among all five species- union of all circles), number of singletons (unique genes—outer area of each circle), and the common ones of remaining different combination of all five species (in between the core and the periphery of the diagram) are described.
A phylogenetic tree was constructed to investigate the genetic divergence and evolutionary distances among the Macadamia species, with Telopea as the outgroup. The tree has two main branches. One branch includes M. integrifolia and M. tetraphylla, indicating a shared genetic lineage. The other branch comprises M. jansenii and M. ternifolia, highlighting their distinct genetic lineage. (Supplementary Fig. S6 & S7).
WGD and synteny
-
The analysis of ks values in all four species of Macadamia genomes revealed a distinctive peak at ks ≈ 0.32 (Fig. 3). The Telopea genome exhibited a peak at ks ≈ 0.28. This comparison of the peaks in Macadamia and Telopea suggests a more recent whole-genome duplication (WGD) event in Telopea compared to Macadamia. In some WGD studies, WGD and divergence time estimation have been based solely on ks values. However, in recent years, there has been growing research cautioning against exclusively relying on ks plot analysis for these estimations. Instead, additional sources of evidence are recommended to ensure more robust WGD assessments[18,19].
Figure 3.
Ks distribution plot of the four Macadamia species and Telopea. The colour code of each species is provided on the top left corner.
The duplication events were further verified using the synteny plots which highlighted the duplicated genetic regions and genes. Synteny analysis revealed extensive genetic similarity within the species and among the four species, particularly on chromosomes 9 and 10 (Fig. 4 & Supplementary Fig. S8)
Figure 4.
Synteny plot across all the four Macadamia species. The vertical lines connect orthologous genes across the four species. The blue coloured ribbons represent the regular conserved regions while the red ribbons represent the inverted regions.
Expansion-contraction of gene families
-
The study of differences in protein families among the annotated species revealed significant differences between the groups. The protein family size varied notably between the Macadamia species and Telopea. A total of 613 different protein clusters were contracted and only 21 protein family clusters showed expansion in Macadamia as compared to Telopea. Among the two clades of Macadamia, the edible, species (M. integrifolia and M. tetraphylla) exhibited more expansion- contraction (+18/−140) than the bitter non-edible species (+0/−5) (Fig. 5). Among five contracted clusters of the bitter species, one cluster belonged to Xanthotoxin 5-hydroxylase CYP82C4, which is expressed in roots under iron-deficient conditions.
Figure 5.
Gene family expansion and contraction across the Macadamia species and Telopea. The blue colour represents contraction and pink presents expansion of gene clusters.
All the four species of Macadamia individually displayed more contraction than expansion. The expansion ranges from 259 to 423 clusters of protein, where M. jansenii showed the highest number of contractions, followed by M. ternifolia, and M. tetraphylla. Whereas only 54−94 protein clusters were expanded, and M. tetraphylla displayed the highest expansion of proteins (+94), one of these expanded clusters was associated with the GO term 'rejection of self-pollen' However, for Telopea the opposite was found with more expansion than contraction (+485/−57) of protein clusters (Fig. 5). Both the edible species show similar changes and the gene enrichment analysis of both also showed a similar pattern, and the same held true for the non-edible species.
-
The genome sequencing data from PacBio has been submitted under NCBI bioproject PRJNA694456. The genome assemblies and annotation of four Macadamia species have been deposited in the Genome warehouse under the bioproject: PRJCA020274. NCBI genome submission ids for M. integrifolia: SUB14785838, M. tetraphylla: SUB14787551, M. jansenii: SUB14787648 & M. ternifolia: SUB14786002.
-
About this article
Cite this article
Sharma P, Masouleh AK, Constantin L, Topp B, Furtado A, et al. 2024. Genome sequences to support conservation and breeding of Macadamia. Tropical Plants 3: e035 doi: 10.48130/tp-0024-0029
Genome sequences to support conservation and breeding of Macadamia
- Received: 11 February 2024
- Revised: 15 March 2024
- Accepted: 25 March 2024
- Published online: 21 October 2024
Abstract: Macadamia, a genus native to Eastern Australia, comprises four species, Macadamia integrifolia, M. tetraphylla, M. ternifolia, and M. jansenii. Macadamia was recently domesticated largely from a limited gene pool of Hawaiian germplasm and has become a commercially significant nut crop. Disease susceptibility and climate adaptability challenges highlight the need for a wider range of genetic resources for macadamia production. High-quality haploid resolved genome assemblies were generated using HiFiasm to allow comparison of the genomes of the four species. Assembly sizes ranged from 735 to 795 Mb and N50 from 53.7 to 56 Mb, indicating high assembly continuity with most of the chromosomes covered from telomere to telomere. Repeat analysis revealed that approximately 61% of the genomes were repetitive sequences. The BUSCO completeness scores ranged from 95.0% to 98.9%, confirming good coverage of the genomes. Gene prediction identified 37,198 to 40,534 genes. The species shared a common whole genome duplication event. Synteny analysis revealed a high conservation and similarity of the genome structure in all four species. Differences in the content of genes of fatty acid and cyanogenic glycoside biosynthesis were found between the species. An antimicrobial gene with a conserved cysteine motif was found in all four species. The four genomes provide reference genomes for exploring genetic variation across the genus in wild and domesticated germplasm, targeting conservation of genetic resources and supporting plant breeding.
-
Key words:
- Genome assembly /
- Genome annotation /
- Comparative genomics /
- Wild species