Genome sequences to support conservation and breeding of <i>Macadamia</i>

Priyanka Sharma; Ardashir Kharabian Masouleh; Lena Constantin; Bruce Topp; Agnelo Furtado; Robert J. Henry; Priyanka Sharma; Ardashir Kharabian Masouleh; Lena Constantin; Bruce Topp; Agnelo Furtado; Robert J. Henry

doi:10.48130/tp-0024-0029

2024 Volume 3

Article Contents

Next Previous

ARTICLE Open Access

Genome sequences to support conservation and breeding of Macadamia

1.
Queensland Alliance for Agriculture and Food Innovation, University of Queensland, Brisbane 4072, Australia
2.
ARC Centre of Excellence for Plant Success in Nature and Agriculture, University of Queensland, Brisbane 4072, Australia

More Information

Corresponding author: robert.henry@uq.edu.au

Received: 11 February 2024
Revised: 15 March 2024
Accepted: 25 March 2024
Published online: 21 October 2024
Tropical Plants 3, Article number: e035 (2024) | Cite this article

Abstract

Macadamia, a genus native to Eastern Australia, comprises four species, Macadamia integrifolia, M. tetraphylla, M. ternifolia, and M. jansenii. Macadamia was recently domesticated largely from a limited gene pool of Hawaiian germplasm and has become a commercially significant nut crop. Disease susceptibility and climate adaptability challenges highlight the need for a wider range of genetic resources for macadamia production. High-quality haploid resolved genome assemblies were generated using HiFiasm to allow comparison of the genomes of the four species. Assembly sizes ranged from 735 to 795 Mb and N50 from 53.7 to 56 Mb, indicating high assembly continuity with most of the chromosomes covered from telomere to telomere. Repeat analysis revealed that approximately 61% of the genomes were repetitive sequences. The BUSCO completeness scores ranged from 95.0% to 98.9%, confirming good coverage of the genomes. Gene prediction identified 37,198 to 40,534 genes. The species shared a common whole genome duplication event. Synteny analysis revealed a high conservation and similarity of the genome structure in all four species. Differences in the content of genes of fatty acid and cyanogenic glycoside biosynthesis were found between the species. An antimicrobial gene with a conserved cysteine motif was found in all four species. The four genomes provide reference genomes for exploring genetic variation across the genus in wild and domesticated germplasm, targeting conservation of genetic resources and supporting plant breeding.
- Genome assembly,
- Genome annotation,
- Comparative genomics,
- Wild species

Supplementary information

Supplementary Table S1 HiFiasm contig assembly statistics and Benchmarking Universal Single Copy Gene (BUSCO) completeness in four Macadamia species.
Supplementary Table S2 Chromosome-level assemblies for four Macadamia species, indicating the length of each chromosome and the overall assembly length.
Supplementary Table S3 Genome estimation statistics of four Macadamia species through K-mer analysis (using Jellyfish tool) and flow cytometry.
Supplementary Table S4 Repeat element distribution across Macadamia species.
Supplementary Table S5 Telomere distribution across all the four macadamia assemblies.
Supplementary Table S6 Distribution of gene families (Fatty acid, cyanogenic and WRKY) across the four species of Macadamia.
Supplementary Table S7 Distribution table of Orthologous gene clusters across the four Macadamia species and Telopea.
Supplementary Fig. S1 (a): K-mer profile (k = 21) spectrum analysis to estimate genome size of M. jansenii generated from short read sequence data using Jellyfish and GenomeScope. (b): K-mer profile (k = 21) spectrum analysis to estimate genome size of M. ternifolia generated from short read sequence data using Jellyfish and GenomeScope. (c): K-mer profile (k = 21) spectrum analysis to estimate genome size of M. integrifolia generated from short read sequence data using Jellyfish and GenomeScope. (d): K-mer profile (k = 21) spectrum analysis to estimate genome size of M. tetraphylla generated from short read sequence data using Jellyfish and GenomeScope.
Supplementary Fig. S2 Dotplots illustrating the genomic comparison of M. jansneii Hi-C assembly (used as reference) against all the four assembled Macadamia genomes.
Supplementary Fig. S3 Dotplots illustrating the genomic comparisons between the haploid assemblies of each Macadamia species.
Supplementary Fig. S4 Species distribution graph of coding sequences of M. jansenii.
Supplementary Fig. S5 Multiple sequence aligmnet of Antimicrobial protein across the four Macadamia species. 01, 02, 03, 04: represents AMP protein sequence from M. jamsenii, M. ternifolia, M. integrifolia and M. tetraphylla, respectively.
Supplementary Fig. S6 Distribution of unique and common orthologous gene clusters across the Macadamia species and Telopea.
Supplementary Fig. S7 Phylogenetic tree of Macadamia species with Telopea, with number of orthogroups corresponding to each species in purple.
Supplementary Fig. S8 Self synteny of four Macadamia species, showing the collinearity of genes across the genome assemblies.

Rights and permissions
Copyright: © 2024 by the author(s). Published by Maximum Academic Press on behalf of Hainan University. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Trueman SJ. 2013. The reproductive biology of macadamia. Scientia Horticulturae 150:354−59 doi: 10.1016/j.scienta.2012.11.032 CrossRef Google Scholar
[2]	O'Connor K, Hayes B, Topp B. 2018. Prospects for increasing yield in macadamia using component traits and genomics. Tree Genetics & Genomes 14:7 doi: 10.1007/s11295-017-1221-1 CrossRef Google Scholar
[3]	Kilian B, Dempewolf H, Guarino L, Werner P, Coyne C, et al. 2021. Crop Science special issue: Adapting agriculture to climate change: A walk on the wild side. Crop Science 61(1):32−36 doi: 10.1002/csc2.20418 CrossRef Google Scholar
[4]	Nock CJ, Baten A, Mauleon R, Langdon KS, Topp B, et al. 2020. Chromosome-scale assembly and annotation of the macadamia genome (Macadamia integrifolia HAES 741). G3 Genes \| Genomes \| Genetics 10(10):3497−504 doi: 10.1534/g3.120.401326 CrossRef Google Scholar
[5]	Nock CJ, Baten A, Barkla BJ, Furtado A, Henry RJ, et al. 2016. Genome and transcriptome sequencing characterises the gene space of Macadamia integrifolia (Proteaceae). BMC genomics 17:937 doi: 10.1186/s12864-016-3272-3 CrossRef Google Scholar
[6]	Murigneux V, Rai SK, Furtado A, Bruxner TJC, Tian W, et al. 2020. Comparison of long-read methods for sequencing and assembly of a plant genome. GigaScience 9:giaa146 doi: 10.1093/gigascience/giaa146 CrossRef Google Scholar
[7]	Sharma P, Murigneux V, Haimovitz J, Nock CJ, Tian W, et al. 2021. The genome of the endangered Macadamia jansenii displays little diversity but represents an important genetic resource for plant breeding. Plant Direct 5(12):e364 doi: 10.1002/pld3.364 CrossRef Google Scholar
[8]	Sharma P, Al-Dossary O, Alsubaie B, Al-Mssallem I, Nath O, et al. 2021a. Improvements in the sequencing and assembly of plant genomes. GigaByte 2021:gigabyte24 doi: 10.46471/gigabyte.24 CrossRef Google Scholar
[9]	Sharma P, Masouleh AK, Topp B, Furtado A, Henry RJ. 2022. De novo chromosome level assembly of a plant genome from long read sequence data. The Plant Journal 109(3):727−36 doi: 10.1111/tpj.15583 CrossRef Google Scholar
[10]	Xia C, Jiang S, Tan Q, Wang W, Zhao L, et al. 2022. Chromosomal-level genome of macadamia (Macadamia integrifolia). Tropical Plants 1:3 doi: 10.48130/tp-2022-0003 CrossRef Google Scholar
[11]	Lin J, Zhang W, Zhang X, Ma X, Zhang S, et al. 2022. Signatures of selection in recently domesticated macadamia. Nature communications 13:242 doi: 10.1038/s41467-021-27937-7 CrossRef Google Scholar
[12]	Niu Y, Li G, Ni S, He X, Zheng C, et al. 2022. The chromosome-scale reference genome of Macadamia tetraphylla provides insights into fatty acid biosynthesis. Frontiers in Genetics 13:835363 doi: 10.3389/fgene.2022.835363 CrossRef Google Scholar
[13]	Si X, Lyu S, Hussain Q, Ye H, Huang C, et al. 2023. Analysis of Delta (9) fatty acid desaturase gene family and their role in oleic acid accumulation in Carya cathayensis kernel. Frontiers in Plant Science 14:1193063 doi: 10.3389/fpls.2023.1193063 CrossRef Google Scholar
[14]	Hu W, Fitzgerald M, Topp B, Alam M, O'Hare TJ. 2022. Fatty acid diversity and interrelationships in macadamia nuts. LWT 154:112839 doi: 10.1016/j.lwt.2021.112839 CrossRef Google Scholar
[15]	Irmisch S, Clavijo McCormick A, Boeckler GA, Schmidt A, Reichelt M, et al. 2013. Two herbivore-induced cytochrome P450 enzymes CYP79D6 and CYP79D7 catalyze the formation of volatile aldoximes involved in poplar defense. The Plant Cell 25(11):4737−54 doi: 10.1105/tpc.113.118265 CrossRef Google Scholar
[16]	Hansen CC, Sørensen M, Veiga TA, Zibrandtsen JF, Heskes AM, et al. 2018. Reconfigured cyanogenic glucoside biosynthesis in Eucalyptus cladocalyx involves a cytochrome P450 CYP706C55. Plant Physiology 178(3):1081−95 doi: 10.1104/pp.18.00998 CrossRef Google Scholar
[17]	He X, Li JJ, Chen Y, Yang JQ, Chen XY. 2019. Genome-wide analysis of the WRKY gene family and its response to abiotic stress in buckwheat (Fagopyrum tataricum). Open Life Sciences 14(1):80−96 doi: 10.1515/biol-2019-0010 CrossRef Google Scholar
[18]	Tiley GP, Barker MS, Burleigh JG. 2018. Assessing the performance of Ks plots for detecting ancient whole genome duplications. Genome Biology and Evolution 10(11):2882−98 doi: 10.1093/gbe/evy200 CrossRef Google Scholar
[19]	Zwaenepoel A, Van de Peer Y. 2019. wgd—simple command line tools for the analysis of ancient whole-genome duplications. Bioinformatics 35(12):2153−55 doi: 10.1093/bioinformatics/bty915 CrossRef Google Scholar
[20]	Nakandala U, Masouleh AK, Smith MW, Furtado A, Mason P, et al. 2023. Haplotype resolved chromosome level genome assembly of Citrus australis reveals disease resistance and other citrus specific genes. Horticulture Research 10(5):uhad058 doi: 10.1093/hr/uhad058 CrossRef Google Scholar
[21]	Zhang X, Chen S, Shi L, Gong D, Zhang S, et al. 2021. Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia sinensis. Nature Genetics 53(8):1250−59 doi: 10.1038/s41588-021-00895-y CrossRef Google Scholar
[22]	Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18(2):170−75 doi: 10.1038/s41592-020-01056-5 CrossRef Google Scholar
[23]	McManus AM, Nielsen KJ, Marcus JP, Harrison SJ, Green JL, et al. 1999. MiAMP1, a novel protein from Macadamia integrifolia adopts a Greek key β-barrel fold unique amongst plant antimicrobial proteins. Journal of Molecular Biology 293(3):629−38 doi: 10.1006/jmbi.1999.3163 CrossRef Google Scholar
[24]	Li J, Hu S, Jian W, Xie C, Yang X. 2021. Plant antimicrobial peptides: structures, functions, and applications. Botanical Studies 62:5 doi: 10.1186/s40529-021-00312-x CrossRef Google Scholar
[25]	Campos ML, de Souza CM, de Oliveira KBS, Dias SC, Franco OL. 2018. The role of antimicrobial peptides in plant immunity. Journal of Experimental Botany 69(21):4997−5011 doi: 10.1093/jxb/ery294 CrossRef Google Scholar
[26]	Furtado A. 2014. DNA extraction from vegetative tissue for next-generation sequencing. In Cereal Genomics. Methods in Molecular Biology, eds. Henry R, Furtado A. Totowa, NJ: Humana Press. pp. 1−5 . doi: 10.1007/978-1-62703-715-0_1
[27]	Rubio-Piña JA, Zapata-Pérez O. 2011. Isolation of total RNA from tissues rich in polyphenols and polysaccharides of mangrove plants. Electronic Journal of Biotechnology 14(5):1−8 doi: 10.2225/vol14-issue5-fulltext-10 CrossRef Google Scholar
[28]	Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, et al. 2019. RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biology 20:224 doi: 10.1186/s13059-019-1829-6 CrossRef Google Scholar
[29]	Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. 2015. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19):3210−12 doi: 10.1093/bioinformatics/btv351 CrossRef Google Scholar
[30]	Gurevich A, Saveliev V, Vyahhi N, Tesler G. 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29:1072−75 doi: 10.1093/bioinformatics/btt086 CrossRef Google Scholar
[31]	Galbraith DW, Harkins KR, Maddox JM, Ayres NM, Sharma DP, et al. 1983. Rapid flow cytometric analysis of the cell cycle in intact plant tissues. Science 220(4601):1049−51 doi: 10.1126/science.220.4601.1049 CrossRef Google Scholar
[32]	Arumuganathan K, Earle ED. 1991. Estimation of nuclear DNA content of plants by flow cytometry. Plant Molecular Biology Reporter 9:229−41 doi: 10.1007/BF02672073 CrossRef Google Scholar
[33]	Sadhu A, Bhadra S, Bandyopadhyay M. 2016. Novel nuclei isolation buffer for flow cytometric genome size estimation of Zingiberaceae: a comparison with common isolation buffers. Annals of Botany 118(6):1057−70 doi: 10.1093/aob/mcw173 CrossRef Google Scholar
[34]	Doležel J, Greilhuber J, Suda J. 2007. Estimation of nuclear DNA content in plants using flow cytometry. Nature Protocols 2(9):2233−44 doi: 10.1038/nprot.2007.310 CrossRef Google Scholar
[35]	International Rice Genome Sequencing Project, Sasaki T. 2005. The map-based sequence of the rice genome. Nature 436:793−800 doi: 10.1038/nature03895 CrossRef Google Scholar
[36]	Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6):764−70 doi: 10.1093/bioinformatics/btr011 CrossRef Google Scholar
[37]	Ranallo-Benavidez TR, Jaron KS, Schatz MC. 2020. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications 11:1432 doi: 10.1038/s41467-020-14998-3 CrossRef Google Scholar
[38]	Pérez-Wohlfeil E, Diaz-del-Pino S, Trelles O. 2019. Ultra-fast genome comparison for large-scale genomic experiments. Scientific Reports 9:10274 doi: 10.1038/s41598-019-46773-w CrossRef Google Scholar
[39]	Brůna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. 2021. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genomics and Bioinformatics 3(1):lqaa108 doi: 10.1093/nargab/lqaa108 CrossRef Google Scholar
[40]	Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. 2019. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology 37(8):907−15 doi: 10.1038/s41587-019-0201-4 CrossRef Google Scholar
[41]	Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078−79 doi: 10.1093/bioinformatics/btp352 CrossRef Google Scholar
[42]	Conesa A, Götz S. 2008. Blast2GO: a comprehensive suite for functional analysis in plant genomics. International Journal of Plant Genomics 1:619832 doi: 10.1155/2008/619832 CrossRef Google Scholar
[43]	Jones P, Binns D, Chang HY, Fraser M, Li W, et al. 2014. InterProScan 5: genome-scale protein function classification. Bioinformatics 30(9):1236−40 doi: 10.1093/bioinformatics/btu031 CrossRef Google Scholar
[44]	Naithani S, Gupta P, Preece J, D'Eustachio P, Elser JL, et al. 2020. Plant Reactome: a knowledgebase and resource for comparative pathway analysis. Nucleic Acids Research 48(D1):D1093−D1103 doi: 10.1093/nar/gkz996 CrossRef Google Scholar
[45]	Kanehisa M, Goto S. 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28(1):27−30 doi: 10.1093/nar/28.1.27 CrossRef Google Scholar
[46]	Emms DM, Kelly S. 2019. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology 20:238 doi: 10.1186/s13059-019-1832-y CrossRef Google Scholar
[47]	Sun J, Lu F, Luo Y, Bie L, Xu L, et al. 2023. OrthoVenn3: an integrated platform for exploring and visualizing orthologous data across genomes. Nucleic Acids Research 51(W1):W397−W403 doi: 10.1093/nar/gkad313 CrossRef Google Scholar
[48]	Chapman BA, Bowers JE, Schulze SR, Paterson AH. 2004. A comparative phylogenetic approach for dating whole genome duplication events. Bioinformatics 20(2):180−85 doi: 10.1093/bioinformatics/bth022 CrossRef Google Scholar
[49]	Goel M, Sun H, Jiao WB, Schneeberger K. 2019. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biology 20:277 doi: 10.1186/s13059-019-1911-0 CrossRef Google Scholar
[50]	Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094−100 doi: 10.1093/bioinformatics/bty191 CrossRef Google Scholar
[51]	Goel M, Schneeberger K, et al. 2022. Plotsr: visualizing structural similarities and rearrangements between multiple genomes. Bioinformatics 38(10):2922−26 doi: 10.1093/bioinformatics/btac196 CrossRef Google Scholar
[52]	Wang Y, Tang H, DeBarry JD, Tan X, Li J, et al. 2012. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Research 40(7):e49 doi: 10.1093/nar/gkr1293 CrossRef Google Scholar
[53]	Bandi V, Gutwin C. 2020. Interactive exploration of genomic conservation. Proceedings of Graphics Interface 2020, University of Toronto, 28−29 May 2020. pp. 74−83. DOI: 10.20380/GI2020.09

About this article

Cite this article

Sharma P, Masouleh AK, Constantin L, Topp B, Furtado A, et al. 2024. Genome sequences to support conservation and breeding of Macadamia. Tropical Plants 3: e035 doi: 10.48130/tp-0024-0029

Sharma P, Masouleh AK, Constantin L, Topp B, Furtado A, et al. 2024. Genome sequences to support conservation and breeding of Macadamia. Tropical Plants 3: e035 doi: 10.48130/tp-0024-0029

Figures(5) / Tables(2)

Download PDF

Article Metrics

Article views(5289) PDF downloads(1074)

Other Articles By Authors

on this site
on Google Scholar

HTML

Introduction

Macadamia, a genus of evergreen trees from the Proteaceae family, is highly valued for its unique flavor, texture, and nutritional properties. It is native to Australia but has now been introduced and widely cultivated in different parts of the world including Hawaii, South Africa, Vietnam, China, and Central and South America. Macadamia is a genus of four species M. integrifolia (Maiden & Betche), M. tetraphylla (L. A. S. Johnson), M. ternifolia (F. Muell), and M. jansenii (C.L. Gross) of which only M. integrifolia, M. tetraphylla, and their hybrids are used for commercial production of edible kernel. The other two species are non-commercial due to the high content of cyanogenic glycosides in the mature kernels^[1]. Due to the absence of high-quality genomic data on Macadamia, crop improvement breeding programs have relied heavily on phenotypic characteristics, primarily from the two commercial species. This reliance poses a risk of diminishing genetic diversity^[2,3]. Therefore, to enhance breeding accuracy, there is a critical need for high-quality genomic data that can provide comprehensive insights into the genetic makeup and variability within Macadamia species. Among the four species, the M. integrifolia (HAES 741) genome was the first to be sequenced using Illumina short reads^[4]. This 518 Mb assembled genome was highly fragmented (N50: 4,745 bp) and incomplete having 77.4% BUSCO genes and covering only 79% of the genome^[5]. HAES 741 was again reassembled using combined Pacific Biosciences (PacBio) long-read data along with the Illumina short read sequences^[4]. This chromosome level assembly was more contiguous than the previous one with a size of 745 Mb, N50 of 413 kb, and 90.2% of BUSCO genes. M. jansenii was the second macadamia to be sequenced, contig level de-novo assemblies were generated using three different types of long-read sequencing methods^[6]. Among the three assemblies, PacBio continuous long reads (CLR) contig assembly outperformed others in terms of contiguity (N50: 1.55 Mb). This PacBio CLR M. jansenii contig level assembly was scaffolded to chromosome level using chromosome confirmation capture (Hi-C), where 762 contigs were reduced to 219 scaffolds where 14 scaffolds were of chromosome length, the genome contiguity was improved more than 50 times (N50: 52.1 Mb) with 97% BUSCO^[7].

For the first time, all four Macadamia species were sequenced and assembled using the advanced phase assembly (IPA) assembler with PacBio HiFi reads for each of the four species. This study reported that PacBio HiFi contig level assembly outperformed the earlier CLR contig and scaffold assembly^[8]. A further update on the M. jansenii contig level assembly reported the possibility of achieving de novo assembly of near chromosome level from sequenced data alone^[9]. Recently, a more contiguous and complete assembly of the M. integrifolia Chinese cultivar -GUIRE 1 (GR1)^[10] , a Hawaiian cultivar 'kau'^[11] and the M. tetraphylla genome were also reported^[12]. The M. integrifolia (GR1) chromosome level genome was assembled using Nanopore sequencing, producing a genome of 807 Mb, with a scaffold N50 of 54.7 Mb and 95.7% BUSCO. The M. integrifolia (Kau) was assembled using PacBio RSII with Hi-C and generated a genome of 794 Mb, with 92% complete BUSCO. The M. tetraphylla genome was assembled with Hi-C to give a 750 Mb genome, N50 51 Mb, and BUSCO of 90%.

The available genome assemblies of macadamia, present a challenge in integrating diverse genomic data due to variability in sequencing technologies and assembly pipelines, hindering a comprehensive understanding for accurate breeding. To address this limitation, this study aimed to assemble all four genomes of Macadamia species using the same sequencing platform and assembly pipeline. This approach enables more reliable and accurate comparative genome analysis. The genomic data generated from this study will help in identifying species-specific genes and the variations among the four species. Genes for desirable characteristics present in the non-commercial species may be identified for incorporation into domesticated cultivars, to widen the gene pool of domesticated macadamia.

Discussion

In this study, high-quality reference genomes and annotations were created for the four species of Macadamia. The gene model set completeness, as measured by BUSCO, suggested that the annotation pipeline used was suitable for comprehensive capture of protein-coding genes. The comparison of genome assemblies of the already available genomes of M. jansenii, M. integrifolia, and M. tetraphylla with those generated in this study revealed notable improvements in the assembly statistics. For M. jansenii, the newly assembled genome demonstrated an increase in length (from 758 to 773 Mb), improvement in N50 value from 52 to 55 Mb, and slight improvement in BUSCO as compared to the already available M. jansenii Hi-C assembly^[7]. This study has greatly improved the M. integrifolia (cultivar 741) genome with a longer assembly length of 775 Mb and a significantly higher BUSCO of 97% and N50 value of 53 Mb as compared to previous assemblies by Nock et al., in 2016 (N50: 4.7 kb)^[5] & 2020 (N50: 413 kb)^[4]. Similarly, the M. tetraphylla genome showed great improvement in terms of N50 56 Mb and 98% BUSCO as compared to the already available M. tetraphylla genome^[12]. The genome assemblies generated in this study provide enhanced continuity, higher BUSCO completeness, and increased gene identification compared to previous versions, providing a robust basis for genome comparison. Additionally, the genome assemblies attained complete chromosome coverage from telomere to telomere for most of the chromosomes, which has not been reported in the previous studies.

The comparison of collapsed assembly statistics of four Macadamia species revealed M. tetraphylla assembly stood out with the longest genome length. The M. jansenii has the shortest assembly length among the four. The gene content comparison across the four species revealed that M. integrifolia assembly exhibited the highest number of genes, followed by M. ternifolia and M. tetraphylla. These variations in gene counts may be attributed to species-specific genomic features. Haploid-resolved assemblies are essential in genomics research, as they facilitate accurate gene phasing, improved annotation, and enhanced insights into genetic diversity^[20−22]. Heterozygosity between the haplotypes in diploids can complicate the genome assemblies. The low heterozygosity of M. jasnenii and high heterozygosity of M. integrifolia and M. tetraphylla^[4,7,10,12] was also supported by k-mer analysis, haploid assembly statistics and dotplot comparisons. The dot plot comparison of the two M. jansenii haploid assemblies, showing minimal differences between the two. On the other hand, the highly heterozygous species, M. integrifolia and M. tetraphylla, exhibit significant differences in the dot plots, gene numbers, structural rearrangements and individual chromosome lengths. These findings highlight the genomic variations at haploid levels among the different Macadamia species, providing valuable insights into their genetic diversity.

Antimicrobial proteins (AMP) are essential components of plant innate immunity, exhibiting diverse activities such as antibacterial, antifungal, insecticidal, and antiviral effects, enabling effective defense against pathogens and pests^[23,24]. Comparative analysis of the gene encoding a well-known AMP protein across the four macadamia species, showed that the gene location remained conserved on chr 9 across all the species and the sequence alignment revealed a highly conserved motif of cysteines, however, the amino acid sequence was variable. These results aligned with earlier reports of these novel proteins^[23−25]. The present study reveals the sequences of the genes and confirms a high level of conservation across the Macadamia species.

The variable distribution of CYP79, across the four species, may indicate potential deviations from the conventional cyanogenic glycoside biosynthesis pathway in the two bitter species, M. jansenii and M. ternifolia. In contrast, CYP71's uniform distribution across all species, indicating its essential role. The differential counts of detoxifying enzymes, BGLU and UGT, underscores species-specific strategies, with lack of UGT genes in M. tetraphylla suggesting a different detoxification mechanism. These insights will be important in breeding new varieties making use of genotypes from all four species. Breeding new varieties with smaller trees for intensified production will require introgression of smaller plant stature from the two bitter species (M. ternifolia and M. jansenii). This will require avoidance of the transfer of the cyanogenic glycoside genes associated with bitterness.

A unique feature of Macadamia is the novel fatty acid composition. The analysis of fatty acid pathway genes showed M. integrifolia stands out with the highest numbers of both FatA and FatB genes, signifying a strong capability for fatty acid production and may explain the domestication of Macadamia being based mainly on this species. Additionally, the higher abundance of SAD genes across the four species suggests their active role in desaturation, as confirmed by Hu et al.^[14], highlighting the efficiency of C18:0 to C18:1 conversion. The absence of KAS genes in M. integrifolia suggests a potential uniqueness in its fatty acid metabolism pathway, distinct from the other three species, which possess KAS genes (especially M. jansenii and M. ternifolia with 10 each), highlighting their capacity for extending fatty acid chains. Variations in WRKY protein counts (ranging from 58 to 61) across Macadamia species supporting their roles in development and stress responses. This analysis has greatly expanded knowledge of fatty acid biosynthesis in Macadamia and identified significant species differences. This suggests the need for further studies of the differences in lipid composition of Macadamia.

Utilizing long-read assemblies in this study of Macadamia gene families significantly increased the accuracy of results for expansion and contraction events. This accuracy is crucial for identifying essential genes and gene families involved in important biological processes and hence the accurate interpretation of expansion-contraction (CAFE) analysis. Remarkably, the edible macadamia species demonstrated a higher incidence of expansion-contraction, while the bitter species exhibited fewer changes. This observation implies potential differences in the distribution of gene families between the two groups, suggesting distinct evolutionary trajectories. Understanding the factors behind the expansion of particular gene families in edible Macadamia species could provide valuable clues about the evolution of Macadamia and be harnessed for the development of improved cultivars with desirable traits. Moreover, the presence of common ks peak events in the four Macadamia species suggest significant evolutionary events that have shaped their genomes. Comparison of the ks plot between the Macadamia and the Telopea genomes, suggests that Telopea has also undergone a duplication event. This may be a separate event but the possibilities of different rates of sequence evolution make conclusions difficult. Synteny analysis further highlights the conservation of genetic regions and genes within each species and reveals intriguing similarities among the different species, particularly on chromosomes 9 and 10. These findings emphasize the importance of whole genome duplication in shaping the genetic landscape of macadamia and provide valuable insights into the evolutionary dynamics of this economically important crop. The analysis of orthologous clusters and gene families among the four Macadamia species and Telopea provided valuable insights into the conservation and divergence of genes in these plants. Among the 195,004 proteins grouped into 34,696 gene clusters, only 31 clusters showed overlaps among two or more species, while 8,217 clusters contained conserved single-copy genes across the five species. These unique gene clusters hold great promise for uncovering distinct biological functions and potential adaptations of each species. The phylogenetic tree, with Telopea as the outgroup, demonstrates two main branches: one containing M. integrifolia and M. tetraphylla and the other comprising M. jansenii and M. ternifolia, illustrating the genetic relationships among the Macadamia species. The core orthologous genes, as expected included gene families related to categories like cell growth, DNA replication and repair, metabolism, and cell cycle regulation.

The comparative genomics and experimental study, presented here, allows for the first time a genus-wide view of the biological diversity of the Macadamia, which provides a strong foundation for genome-wide analysis.

Material and methods

DNA and RNA sample

Leaf samples of all four species were freshly collected from trees in ex-situ collections, with specimens gathered from both Nambour and Tiaro locations, operated by the Department of Agriculture and Fisheries. Accessions: M. jansenii (ANAM82-5-11), M. integrifolia (741), M. tetraphylla (GTIARO1-17-7) and M. ternifolia (GTIARO1-2-14). Leaf tissue was ground under cryogenic conditions, utilizing a mortar, pestle, and Tissue Lyser. DNA isolation was conducted on all accessions, following a modified extraction method^[26] excluding phenol. The HiFi sequencing data of the four Macadamia species^[8] was used for this study. RNA sequence data for M. jansenii was used from Sharma et al.^[8]. Total RNA for M. ternifolia and M. tetraphylla were extracted from fresh leaf tissues using the RNA isolation method of Rubio-Piña & Zapata-Pére^[27] along with the Qiagen kit method and sent for short read sequencing at Macrogen Oceania, NSW, Australia. RNA Seq data for young leaves of M. integrifolia (HAES 741) was downloaded from NCBI SRA data SRR10897159.

Genome assembly
The HiFi reads of four species were assembled using HiFiasm to generate both the collapsed and the haploid assemblies^[8,22]. The contig assembly generated from HiFiasm was then scaffolded using a reference-guided approach with the RagTag tool^[28] using M. jansenii Hi-C as the reference^[7]. The chromosomes were numbered according to the M. integrifolia genome^[5]. The contigs of more than 1 Mb in size were used as input for the reference-guided approach. To assess the completeness of the assembles, the Benchmarking Universal Single-Copy Orthologs (BUSCO) (version 5.4.6)^[29] was used with the eudicots_odb10 dataset. The genome completeness was evaluated using the quality assessment tool QUAST^[30].

Genome estimation (flowcytometry and k-mer) and dot plots
For flow cytometry methods nuclei were extracted from leaf tissue by mechanical dissociation as described by Galbraith et al.^[31] with modifications for woody plant species. Briefly, 40 mg of young macadamia leaf was co-chopped with 15 mg of the internal standard Oryza sativa ssp. Japonica cv. Nipponbare, in 0.4 mL of ice-cold nuclear isolation buffer in a 5 cm polystyrene Petri dish. For M. tetraphylla and M. integrifolia, Arumuganathan & Earle^[32] nuclear isolation buffer was used; while MB01^[33] nuclear isolation buffer was used for M. ternifolia and M. jansenii. Samples were chopped for approximately 10−12 min, first into fine longitudinal strips with new parts of a sharp razor blade and then into perpendicular slices. The resulting homogenates were gently filtered through a pre-soaked 40-μm nylon mesh into a 5 mL round bottom polystyrene tube. Homogenates were then stained with 50 μg/mL of propidium iodide (PI) (Sigma, P4864-10ML) and 50 μg/ml of RNase A (Qiagen, 19101) for 10 min on ice. The BD Biosciences LSR II Flow Cytometer and FlowJo software package was used to analyze the homogenates. Briefly, fluorescence was collected using a 488 nm excitation laser tuned to 514.4 nm and a 610/20 nm bandpass filter. Instrument settings were kept constant across and throughout experiments: forward scatter voltage at 199, side scatter voltage at 300, fluorescence intensity voltage at 500, with a slow flow rate (20−50 events/s). Three biological replicates were performed on three different days. For each biological replicate, a minimum of 1,500 PI-stained events were collected per PI-stained peak. Nuclear DNA content was calculated as previously described^[34] using 388.8 Mb at 1C for the assumed size of O. sativa^[35].

Genome estimation using K-mer analysis was performed by Jellyfish's Version 2.3.0^[36] count and histo commands. The histo file was visualized in genomescope^[37]. Dot plots for the assembly comparisons were plotted using the Chromeister^[38] tool available at Galaxy Australia (https://usegalaxy.org.au).

Genome annotation
The identification and classification of the de novo repeat elements in all the collapsed assemblies of all four species was performed using the RepeatModeler (version 2.0.2a) (www.repeatmasker.org/RepeatModeler). The repeats identified were then masked by repeatmasker (version 4.0.9) (www.repeatmasker.org). The gene models in the masked assemblies were identified using an ab initio method along with RNA-seq evidence Braker3 version 3.0.3^[39]. To prepare the input files for the Braker3 run, the masked assemblies were first aligned with RNA-seq using HISAT2 version 2.1^[40], then the output aligned .sam file was converted to a .bam file using samtools^[41]. The soft masked genome assembly file along with the sorted bam file was used as input files for the Braker3 pipeline. The protein and coding sequence (CDS) fasta files generated from Braker3 contain multiple transcripts therefore a Python script was used to keep only one transcript per gene. The filtered protein and CDS fasta was then used for the downstream analysis. Tidk version 0.2.31 (Telomere identification toolkit) tool (https://github.com/tolkit/telomeric-identifier) was used to identify the telomere region in the genome assemblies using 'search' and 'plot' commands.

Functional annotation of the gene set identified for each of the four genomes was performed through Omicx box (version 3.0.27) (OmicsBox, 2019). This pipeline consists of BLAST2GO^[42] and Interproscan^[43]. For BLAST2GO, the 'blastx-fast' feature was used with NCBI non-redundant protein sequences (nr v5) database and the e-value was set at 1e-10 with 10 blast hits. The taxonomy filter was set at 33090 Viridiplantae. For Interproscan all the available databases such as families, structural domains, sites, and repeats databases were selected. For the pathway analysis: Plant reactome (Gramene)^[44] and KEGG pathway^[45] were performed using Omics box.

Gene family analysis: Anti-microbial genes were identified across the four species by conducting a BLAST homology search, looking for transcripts resembling M. integrifolia's antimicrobial cDNA (MiAMP2). Sequence alignment using Clone Manager ver 9.0 was performed with an alignment parameter scoring matrix of Mismatch (2), Open Gap (4), and Extension-Gap (1). To identify genes involved in cyanogenic glycoside, fatty acid metabolism, and WRKY gene across the four genomes, BLAST was performed and the top hits based on sequence similarity were selected.

Orthologous and phylogenetic analysis
Orthologous and phylogenetic analysis was performed using Orthofinder (V2.5.5)^[46] using the protein sequences of all the four Macadamia species along with data for Telopea. The common and unique set of orthologous protein sequences among the five species were plotted using the UpSet plot and the venn diagram of the Orthovenn3^[47]. The core or single copy orthologs obtained from Orthofinder were used to construct the phylogenetic tree using Orthovenn3.

Whole genome duplication
Whole genome duplication (WGD) analysis was performed to compute the whole set of paralogous genes in the genome using WGD tool version 1.1.2^[19,48]. Ancient WGDs were calculated by examining the distribution of synonymous substitution per site (Ks) within a genome or Ks distribution. WGD analysis of all four species of Macadamia was performed to estimate the origin and diversification. Wgd 'dmd' and 'ksd' commands were used to generate the Ks distribution plot.

Conservation of gene order and genomic regions
A pairwise whole-genome comparison was performed using SyRI^[49] to find the structural and sequence differences between the two genomes. The genomes were first aligned using the minimap2^[50] and samtool^[41] was used to index the alignment BAM file. The BAM file was then used to run the SyRI tool, the same output file was then passed through the visualisation tool plotSR^[51] using default parameters to visualize the synteny and the structural rearrangements between the Macadamia species.

Collinearity and expansion-contraction of gene families
The degree of collinearity within and between the genomes of the four Macadamia species were calculated by using MCScanX^[52]. The protein fasta file of all four species were combined and used as input for the all-vs-all homology search with the Blastp algorithm with e-value set at 1e-10, max target sequences at 5, and output format 6. The resulting tabular blastp file along with the combined gff file was then fed into MCScanX using default parameters. For self synteny MCScanX was run with default settings with the blastp output and the gff file of individual species. The web-based tool - SynVisio^[53] was used to visualize collinearity. The CAFE5 tool of Orthovern3 was used to perform the expansion and contraction of the gene families. All default parameters were used.

Author contributions

The authors confirm contribution to the paper as follows: project design and supervision: Henry RJ, Masouleh AK, Furtado A, Topp B ; genome assembly, annotation and downstream analysis: Sharma P, Masouleh AK; flow-cytometry analysis: Constantin L; draft manuscript preparation: Sharma P, Constantin L; data deposition: Sharma P. All authors reviewed the results and approved the final version of the manuscript.

	M. jansenii			M. ternifolia			M. integrifolia			M. tetraphylla
	Hap1	Hap2	Collapsed	Hap1	Hap2	Collapsed	Hap1	Hap2	Collapsed	Hap1	Hap2	Collapsed
Assembly length (Mb)	761	735	773	766b	748	780	748	751	775	776	775	795
Complete BUSCO	98.9%	95.0%	97.7%	97.1%	96.5%	97.7%	95.1%	94.3%	97.6%	97.4%	97.3%	97.8%
Single	83.3%	82.1%	84.2%	83.8%	83.4%	84.1%	82.4%	81.6%	84.1%	83.5%	83.8%	83.7%
Double	13.6%	12.9%	13.5%	13.3%	13.1%	13.6%	12.7%	12.7%	13.5%	13.9%	13.5%	14.1%
Fragmented	0.6%	0.6%	0.7%	0.8%	0.8%	0.8%	0.9%	0.6%	0.6%	0.8%	0.7%	0.7%
Missing	2.5%	4.4%	1.6%	2.1%	2.7%	1.5%	4.0%	5.1%	1.8%	1.8%	2.0%	1.5%
N50 (Mb)	54.2	51.7	54.7	53.8	51.8	53.8	52.8	53	53.7	54	56	56
The chromosomes were numbered according to the M. integrifolia* genome which used the seven genetic linkage maps^[4].

	M. jansenii			M. ternifolia			M. integrifolia			M. tetraphylla
	Hap 1	Hap2	Collapsed	Hap 1	Hap2	Collapsed	Hap 1	Hap2	Collapsed	Hap 1	Hap2	Collapsed
Chr_01	2483	2543	2474	2455	2484	2612	2483	2389	2665	2643	2521	2631
Chr_02	2666	2514	2608	2774	2666	2739	2453	2613	2699	2664	2735	2786
Chr_03	2802	2868	2844	3007	2943	3053	2837	2771	2974	2949	2917	3017
Chr_04	2780	2670	2718	2832	2706	2931	2833	2746	3078	3142	2782	2813
Chr_05	2800	2783	2798	2798	2636	2911	2755	2569	2814	2746	2780	2866
Chr_06	2607	2579	2568	2623	2465	2683	2585	2616	2667	2702	2731	2709
Chr_07	2790	2702	2696	2764	2699	2836	2810	2587	2623	2711	2578	2712
Chr_08	2768	2768	2677	2742	2671	2770	2509	2802	2878	3062	2869	2837
Chr_09	2870	2897	2878	2915	2874	3053	3373	2816	3842	3626	2978	3137
Chr_10	2402	2359	2428	2301	2209	2463	2699	2367	3103	3710	2295	2392
Chr_11	2820	2896	2812	2910	2845	3001	2917	2879	3087	2888	3024	2935
Chr_12	2590	2567	2517	2642	2408	2721	2576	2092	2538	2617	2430	2566
Chr_13	2766	2627	2732	2641	2716	2790	2684	2723	2875	2694	2663	2724
Chr_14	2560	2409	2448	2598	2474	2626	2446	2495	2691	2634	2534	2608
Total no. of genes	37704	37182	37198	38002	36796	39189	37960	36465	40534	40788	37837	38733
Number of mRNA	43510	43098	43092	44506	43016	45694	44527	43010	47301	47184	44490	45519
Number of CDS	43510	43098	43092	44506	43016	45694	44527	43010	47301	47184	44490	45519

{{lists.name}}

Genome sequences to support conservation and breeding of Macadamia