A nearly gapless, highly contiguous reference genome for a doubled haploid line of <i>Populus ussuriensis</i>, enabling advanced genomic studies

Wenxuan Liu; Caixia Liu; Song Chen; Meng Wang; Xinyu Wang; Yue Yu; Ronald R. Sederoff; Hairong Wei; Xiangling You; Guanzheng Qu; Su Chen; Wenxuan Liu; Caixia Liu; Song Chen; Meng Wang; Xinyu Wang; Yue Yu; Ronald R. Sederoff; Hairong Wei; Xiangling You; Guanzheng Qu; Su Chen

doi:10.48130/forres-0024-0016

2024 Volume 4

Article Contents

Next Previous

ARTICLE Open Access

A nearly gapless, highly contiguous reference genome for a doubled haploid line of Populus ussuriensis, enabling advanced genomic studies

1.
State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150040, China
2.
College of Life Science, Northeast Forestry University, Harbin 150040, China
3.
Forest Biotechnology Group, Department of Forestry and Environmental Resources, North Carolina State University, Raleigh, NC 27695, USA
4.
College of Forest Resources and Environmental Science, Michigan Technological University, MI 49931, USA
5.
Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education, Northeast Forestry University, Harbin 150040, China
^# Authors contributed equally: Wenxuan Liu, Caixia Liu, Song Chen

More Information

Corresponding authors: youxiangling@nefu.edu.cn; gzqu@nefu.edu.cn; chensu@nefu.edu.cn

Received: 05 February 2024
Revised: 19 March 2024
Accepted: 17 April 2024
Published online: 13 May 2024
Forestry Research 4, Article number: e019 (2024) | Cite this article

Abstract

Populus species, particularly P. trichocarpa, have long served as model trees for genomics research, owing to fully sequenced genomes. However, the high heterozygosity, and the presence of repetitive regions, including centromeres and ribosomal RNA gene clusters, have left 59 unresolved gaps, accounting for approximately 3.32% of the P. trichocarpa genome. In this study, the callus induction method was improved to derive a doubled haploid (DH) callus line from P. ussuriensis anthers. Leveraging long-read sequencing, we successfully assembled a nearly gap-free, telomere-to-telomere (T2T) P. ussuriensis genome spanning 412.13 Mb. This genome assembly contains only seven gaps and has a contig N50 length of 19.50 Mb. Annotation revealed 34,953 protein-coding genes in this genome, which is 465 more than that of P. trichocarpa. Notably, centromeric regions are characterized by higher-order repeats, we identified and annotated centromere regions in all DH genome chromosomes, a first for poplars. The derived DH genome exhibits high collinearity with P. trichocarpa and significantly fills gaps present in the latter's genome. This T2T P. ussuriensis reference genome will not only enhance our understanding of genome structure, and functions within the poplar genus but also provides valuable resources for poplar genomic and evolutionary studies.
- Doubled haploid,
- Genome assembly,
- Telomere,
- Centromere,
- Populus ussuriensis,
- T2T genome

Supplementary information

Supplemental Table S1 Primer sequences for the identification of haploids.
Supplemental Table S2 NGS reads statistics of Populus ussuriensis.
Supplemental Table S3 Estimation of genome size of Populus ussuriensis based on K-mer analysis.
Supplemental Table S4 Sequencing data statistics of Populus ussuriensis.
Supplemental Table S5 HiC reads statistics of Populus ussuriensis.
Supplemental Table S6 ALLHiC unnecessary contigs annotation.
Supplemental Table S7 Populus ussuriensis genome assembly results.
Supplemental Table S8 Statistics of consistency assessment of the Populus ussuriensis genome.
Supplemental Table S9 Assessment the gene coverage rate using BUSCO.
Supplemental Table S10 Populus ussuriensis genome assembly results.
Supplemental Table S11 Arm ratio of chromosomes of Populus ussuriensis and Populus trichocarpa.
Supplemental Table S12 Assessment the proteins coverage rate of genome annotation using BUSCO.
Supplemental Table S13 Annotation of repetitive sequences.
Supplemental Table S14 Comparison of Populus genome assemblies.
Supplemental Table S15 Comparison of centromeres length between Populus ussuriensis, Populus trichocarpa and Populus tremula.
Supplemental Table S16 Populus Genome size comparison.
Supplemental Table S17 Gene annotation specific to Populus ussuriensis centromere.
Supplemental Fig. S1 Estimation of genome size of P. ussuriensis based on K-mer analysis. A. 21-mer frequency distribution of the DH15 homozygous callus line. B. 21-mer frequency distribution of the paternal anther donor tree.
Supplemental Fig. S2 Heat map of Hi-C interactions based on chromosome-scale assembly.
Supplemental Fig. S3 The numbers of HOR repeats in centromeres of 19 chromosomes.
Supplemental Fig. S4 Structure and annotation of 19 chromosome centromeres.
Supplemental Fig. S5 Inferred phylogenetic tree of P. ussuriensis based on the centromere monomers from the 19 chromosomes.
Supplemental Fig. S6 Arm ratio of chromosomes of P. ussuriensis.
Supplemental Fig. S7 Specific primers were used to validate the genes. M: DL5000 marker. 1. Pus028233, 2. Pus028236.

Rights and permissions
Copyright: © 2024 by the author(s). Published by Maximum Academic Press, Fayetteville, GA. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Zhang B, Zhu W, Diao S, Wu X, Lu J, et al. 2019. The poplar pangenome provides insights into the evolutionary history of the genus. Communications Biology 2:215 doi: 10.1038/s42003-019-0474-7 CrossRef Google Scholar
[2]	Bradshaw HD, Ceulemans R, Davis J, Stettler R. 2000. Emerging model systems in plant biology: poplar (Populus) as a model forest tree. Journal of Plant Growth Regulation 19:306−13 doi: 10.1007/s003440000030 CrossRef Google Scholar
[3]	Brunner AM, Busov VB, Strauss SH. 2004. Poplar genome sequence: functional genomics in an ecologically dominant plant species. Trends in Plant Science 9:49−56 doi: 10.1016/j.tplants.2003.11.006 CrossRef Google Scholar
[4]	Wullschleger SD, Jansson S, Taylor G. 2002. Genomics and forest biology: Populus emerges as the perennial favorite. The Plant Cell 14:2651−55 doi: 10.1105/tpc.141120 CrossRef Google Scholar
[5]	Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, et al. 2006. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313:1596−604 doi: 10.1126/science.1128691 CrossRef Google Scholar
[6]	Meinke DW, Cherry JM, Dean C, Rounsley SD, Koornneef M. 1998. Arabidopsis thaliana: a model plant for genome analysis. Science 282:662−82 doi: 10.1126/science.282.5389.662 CrossRef Google Scholar
[7]	The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796−815 doi: 10.1038/35048692 CrossRef Google Scholar
[8]	Goff SA, Ricke D, LanTH, Presting G, Wang R, et al. 2005. Erratum: A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296:92−100 doi: 10.1126/science.309.5736.879b CrossRef Google Scholar
[9]	Ma J, Wan D, Duan B, Bai X, Bai Q, et al. 2019. Genome sequence and genetic transformation of a widely distributed and cultivated poplar. Plant Biotechnology Journal 17:451−60 doi: 10.1111/pbi.12989 CrossRef Google Scholar
[10]	Zhang Z, Chen Y, Zhang J, Ma X, Li Y, et al. 2020. Improved genome assembly provides new insights into genome evolution in a desert poplar (Populus euphratica). Molecular Ecology Resources 20:781−94 doi: 10.1111/1755-0998.13142 CrossRef Google Scholar
[11]	Lin YC, Wang J, Delhomme N, Schiffthaler B, Sundström G, et al. 2018. Functional and evolutionary genomic inferences in Populus through genome and population sequencing of American and European aspen. Proceedings of the National Academy of Sciences of the United States of America 115:E10970−E10978 doi: 10.1073/pnas.1801437115 CrossRef Google Scholar
[12]	Chen Z, Ai F, Zhang J, Ma X, Yang W, et al. 2020. Survival in the Tropics despite isolation, inbreeding and asexual reproduction: insights from the genome of the world's southernmost poplar (Populus ilicifolia). The Plant Journal 103:430−42 doi: 10.1111/tpj.14744 CrossRef Google Scholar
[13]	Yang W, Wang K, Zhang J, Ma J, Liu J, et al. 2017. The draft genome sequence of a desert tree Populus pruinosa. GigaScience 6:gix075 doi: 10.1093/gigascience/gix075 CrossRef Google Scholar
[14]	An X, Gao K, Chen Z, Li J, Yang X, et al. 2022. High quality haplotype-resolved genome assemblies of Populus tomentosa Carr., a stabilized interspecific hybrid species widespread in Asia. Molecular Ecology Resources 22:786−802 doi: 10.1111/1755-0998.13507 CrossRef Google Scholar
[15]	Huang X, Chen S, Peng X, Bae EK, Dai X, et al. 2021. An improved draft genome sequence of hybrid Populus alba × Populus glandulosa. Journal of Forestry Research 32:1663−72 doi: 10.1007/s11676-020-01235-2 CrossRef Google Scholar
[16]	Chen S, Yu Y, Wang X, Wang S, Zhang T, et al. 2023. Chromosome-level genome assembly of a triploid poplar Populus alba 'Berolinensis'. Molecular Ecology Resources 23:1092−107 doi: 10.1111/1755-0998.13770 CrossRef Google Scholar
[17]	Ambardar S, Gupta R, Trakroo D, Lal R, Vakhlu J. 2016. High throughput sequencing: an overview of sequencing chemistry. Indian Journal of Microbiology 56:394−404 doi: 10.1007/s12088-016-0606-4 CrossRef Google Scholar
[18]	Daccord N, Celton JM, Linsmith G, Becker C, Choisne N, et al. 2017. High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development. Nature Genetics 49:1099−106 doi: 10.1038/ng.3886 CrossRef Google Scholar
[19]	Shi X, Cao S, Wang X, Huang S, Wang Y, et al. 2023. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Horticulture Research 10:uhad061 doi: 10.1093/hr/uhad061 CrossRef Google Scholar
[20]	Maluszynski M, Kasha KJ, Szarejko I. 2003. Published doubled haploid protocols in plant species. In Doubled Haploid Production in Crop Plants, eds Maluszynski M, Kasha KJ, Forster BP, Szarejko I. Dordrecht: Springer. pp. 309−35. https://doi.org/10.1007/978-94-017-1293-4_46
[21]	Aboobucker SI, Jubery TZ, Frei UK, Chen YR, Foster T, et al. 2022. Protocols for in vivo doubled haploid (DH) technology in maize breeding: from haploid inducer development to haploid genome doubling. In Haploid Inducer Development to Haploid Genome Doubling, ed. Lambing C. New York, NY: Humana. 2484: 213–35. https://doi.org/10.1007/978-1-0716-2253-7_16
[22]	Zhong Y, Chen B, Wang D, Zhu X, Li M, et al. 2022. In vivo maternal haploid induction in tomato. Plant Biotechnology Journal 20:250−52 doi: 10.1111/pbi.13755 CrossRef Google Scholar
[23]	Cistué L, Vallés M, Echávarri B, Sanz JM, Castillo A. 2003. Barley anther culture. In Doubled Haploid Production in Crop Plants, eds Maluszynski M, Kasha KJ, Forster BP, Szarejko I. Dordrecht: Springer. pp. 29–34. https://doi.org/10.1007/978-94-017-1293-4_5
[24]	Zhao X, Yuan K, Liu Y, Zhang N, Yang L, et al. 2022. In vivo maternal haploid induction based on genome editing of DMP in Brassica oleracea. Plant Biotechnology Journal 20:2242−44 doi: 10.1111/pbi.13934 CrossRef Google Scholar
[25]	Pendleton M, Sebra R, Pang AWC, Ummat A, Franzen O, et al. 2015. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nature Methods 12:780−86 doi: 10.1038/nmeth.3454 CrossRef Google Scholar
[26]	Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, et al. 2009. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326:289−93 doi: 10.1126/science.1181369 CrossRef Google Scholar
[27]	Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. 2021. Nanopore sequencing technology, bioinformatics and applications. Nature Biotechnology 39:1348−65 doi: 10.1038/s41587-021-01108-x CrossRef Google Scholar
[28]	Zhang X, Zhang S, Zhao Q, Ming R, Tang H. 2019. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nature Plants 5:833−45 doi: 10.1038/s41477-019-0487-8 CrossRef Google Scholar
[29]	Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. 2015. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31:3210−12 doi: 10.1093/bioinformatics/btv351 CrossRef Google Scholar
[30]	Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, et al. 2020. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences of the United States of America 117:9451−57 doi: 10.1073/pnas.1921046117 CrossRef Google Scholar
[31]	Tarailo-Graovac M, Chen N. 2009. Using RepeatMasker to identify repetitive elements in genomic sequences. Current Protocols in Bioinformatics doi: 10.1002/0471250953.bi0410s25 CrossRef Google Scholar
[32]	Xu Z, Wang H. 2007. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research 35:W265−W268 doi: 10.1093/nar/gkm286 CrossRef Google Scholar
[33]	Hu K, Liao X, Zou Y, Wang J. 2021. Accelerating RepeatClassifier based on spark and greedy algorithm with dynamic upper boundary. bioRxiv doi: 10.1101/2021.06.03.446998 CrossRef Google Scholar
[34]	Shao B, Wang H, Li Y. Trinity: a distributed graph engine on a memory cloud. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, USA, 2013. pp. 505–16. New York, NY, United States: Association for Computing Machinery. https://doi.org/10.1145/2463676.2467799.
[35]	Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, et al. 2003. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31:5654−66 doi: 10.1093/nar/gkg770 CrossRef Google Scholar
[36]	Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, et al. 2008. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 9:R7 doi: 10.1186/gb-2008-9-1-r7 CrossRef Google Scholar
[37]	Lin Y, Ye C, Li X, Chen Q, Wu Y, et al. 2023. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research 10:uhad127 doi: 10.1093/hr/uhad127 CrossRef Google Scholar
[38]	Gao S, Yang X, Guo H, Zhao X, Wang B, et al. 2023. HiCAT: a tool for automatic annotation of centromere structure. Genome Biology 24:58 doi: 10.1186/s13059-023-02900-5 CrossRef Google Scholar
[39]	Emms DM, Kelly S. 2019. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology 20:238 doi: 10.1186/s13059-019-1832-y CrossRef Google Scholar
[40]	Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution 30:772−80 doi: 10.1093/molbev/mst010 CrossRef Google Scholar
[41]	Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35:4453−55 doi: 10.1093/bioinformatics/btz305 CrossRef Google Scholar
[42]	Sanderson MJ. 2003. r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics 19:301−02 doi: 10.1093/bioinformatics/19.2.301 CrossRef Google Scholar
[43]	Manchester SR, Dilcher DL, Tidwell WD. 1986. Interconnected reproductive and vegetative remains of populus (Salicaceae) from the middle Eocene green river formation, northeastern Utah. American Journal of Botany 73:156−60 doi: 10.1002/j.1537-2197.1986.tb09691.x CrossRef Google Scholar
[44]	Nguyen LT, Schmidt HA, Von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32:268−74 doi: 10.1093/molbev/msu300 CrossRef Google Scholar
[45]	Zhang Z, Li J, Zhao X, Wang J, Wong G, et al. 2006. KaKs_Calculator: calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinformatics 4:259−63 doi: 10.1016/S1672-0229(07)60007-2 CrossRef Google Scholar
[46]	Ginestet C. 2011. ggplot2: Elegant graphics for data analysis. Journal of the Royal Statistical Society A: Statistics in Society 174:245−46 doi: 10.1111/j.1467-985X.2010.00676_9.x CrossRef Google Scholar
[47]	Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, et al. 2018. MUMmer4: a fast and versatile genome alignment system. PLoS Computational Biology 14:e1005944 doi: 10.1371/journal.pcbi.1005944 CrossRef Google Scholar
[48]	Goel M, Sun H, Jiao WB, Schneeberger K. 2019. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biology 20:277 doi: 10.1186/s13059-019-1911-0 CrossRef Google Scholar
[49]	Goel M, Schneeberger K. 2022. plotsr: visualizing structural similarities and rearrangements between multiple genomes. Bioinformatics 38:2922−26 doi: 10.1093/bioinformatics/btac196 CrossRef Google Scholar
[50]	Wang Y, Tang H, DeBarry JD, Tan X, Li J, et al. 2012. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Research 40:e49 doi: 10.1093/nar/gkr1293 CrossRef Google Scholar
[51]	Shakirov EV, Chen JJL, Shippen DE. 2022. Plant telomere biology: the green solution to the end-replication problem. The Plant Cell 34:2492−504 doi: 10.1093/plcell/koac122 CrossRef Google Scholar
[52]	Lampson MA, Cheeseman IM. 2011. Sensing centromere tension: Aurora B and the regulation of kinetochore function. Trends in Cell Biology 21:133−40 doi: 10.1016/j.tcb.2010.10.007 CrossRef Google Scholar
[53]	Naish M, Alonge M, Wlodzimierz P, Tock AJ, Abramson BW, et al. 2021. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science 374:abi7489 doi: 10.1126/science.abi7489 CrossRef Google Scholar
[54]	Song J, Xie W, Wang S, Guo Y, Koo D, et al. 2021. Two gap-free reference genomes and a global view of the centromere architecture in rice. Molecular Plant 14:1757−67 doi: 10.1016/j.molp.2021.06.018 CrossRef Google Scholar
[55]	Su H, Liu Y, Liu Y, Birchler JA, Han F. 2018. The behavior of the maize B chromosome and centromere. Genes 9:476 doi: 10.3390/genes9100476 CrossRef Google Scholar
[56]	Dvorkina T, Kunyavskaya O, Bzikadze AV, Alexandrov I, Pevzner PA. 2021. CentromereArchitect: inference and analysis of the architecture of centromeres. Bioinformatics 37:i196−i204 doi: 10.1093/bioinformatics/btab265 CrossRef Google Scholar
[57]	Xin H, Zhang T, Wu Y, Zhang W, Zhang P, et al. 2020. An extraordinarily stable karyotype of the woody Populus species revealed by chromosome painting. The Plant Journal 101:253−64 doi: 10.1111/tpj.14536 CrossRef Google Scholar
[58]	Stettler R, Bradshaw H, Heilman P, Hinckley T. 1996. Biology of Populus* and its implications for management and conservation*. Ottawa, Ontario, Canada: NRC Research Press. 539 pp.
[59]	Qin L, Hu Y, Wang J, Wang X, Zhao R, et al. 2021. Insights into angiosperm evolution, floral development and chemical biosynthesis from the Aristolochia fimbriata genome. Nature Plants 7:1239−53 doi: 10.1038/s41477-021-00990-2 CrossRef Google Scholar
[60]	Gao B, Chen M, Li X, Liang Y, Zhu F, et al. 2018. Evolution by duplication: paleopolyploidy events in plants reconstructed by deciphering the evolutionary history of VOZ transcription factors. BMC Plant Biology 18:256 doi: 10.1186/s12870-018-1437-8 CrossRef Google Scholar
[61]	Wang H, Pak S, Yang J, Wu Y, Li W, et al. 2022. Two high hierarchical regulators, PuMYB40 and PuWRKY75, control the low phosphorus driven adventitious root formation in Populus ussuriensis. Plant Biotechnology Journal 20:1561−77 doi: 10.1111/pbi.13833 CrossRef Google Scholar
[62]	Fan Q, Fu Y. 2017. Telomere and centromere—DNA tandem arrays on the chromosome. Chinese Science Bulletin 62:3245−55 doi: 10.1360/N972016-01145 CrossRef Google Scholar
[63]	Wu H, Yao D, Chen Y, Yang W, Zhao W, et al. 2020. De novo genome assembly of Populus simonii further supports that Populus simonii and Populus trichocarpa belong to different sections. G3 Genes\|Genomes\|Genetics 10:455−66 doi: 10.1534/g3.119.400913 CrossRef Google Scholar
[64]	Stults DM, Killen MW, Pierce HH, Pierce AJ. 2008. Genomic architecture and inheritance of human ribosomal RNA gene clusters. Genome Research 18:13−18 doi: 10.1101/gr.6858507 CrossRef Google Scholar
[65]	Miga KH. 2020. Centromere studies in the era of 'telomere-to-telomere' genomics. Experimental Cell Research 394:112127 doi: 10.1016/j.yexcr.2020.112127 CrossRef Google Scholar

About this article

Cite this article

Liu W, Liu C, Chen S, Wang M, Wang X, et al. 2024. A nearly gapless, highly contiguous reference genome for a doubled haploid line of Populus ussuriensis, enabling advanced genomic studies. Forestry Research 4: e019 doi: 10.48130/forres-0024-0016

Liu W, Liu C, Chen S, Wang M, Wang X, et al. 2024. A nearly gapless, highly contiguous reference genome for a doubled haploid line of Populus ussuriensis, enabling advanced genomic studies. Forestry Research 4: e019 doi: 10.48130/forres-0024-0016

Figures(7)

Download PDF

Article Metrics

Article views(10851) PDF downloads(1819)

Other Articles By Authors

on this site
on Google Scholar

HTML

Introduction

Poplars, fast growing tree species with relatively short life cycle, are widely distributed across northern temperature regions, spanning from North America through Eurasia to Northern Africa^[1]. Their versatility extends beyond being used for making paper, pallets, furniture, and kitchen supplies. They are also highly suitable for reforestation due to their pioneer tree species characteristics^[2]. Poplars are known for their ability to produce large quantities of seeds and their roots readily sprout on marginal lands^[3]. Due to its modest genome size, rapid growth rate, facile vegetative propagation methods and high amenability for genetic manipulation, Populus has emerged as the model species for genetic and molecular studies of forest trees in the realm of forest trees^[4].

Populus trichocarpa is the first tree species and the third plant species to have its whole genome sequenced^[5], four years after Arabidopsis thaliana genome^[6,7] and one year after Oryza sativa genome^[8] was sequenced. In recent years, several poplar genomes including P. alba^[9], P. euphratica^[10], P. tremula^[11], P. ilicifolia^[12], P. pruinosa^[13] and P. tomentosa^[14] and a few hybrid poplar genomes including P. alba × P. glandulosa^[15] and P. alba 'Berolinensis'^[16] have been sequenced. However, owing to the high heterozygosity and highly repetitive sequences present in the genomes, these published genome assemblies are not highly contiguous, and incomplete in the repetitive regions, centromeres and telomeres^[17].

Poplars are dioecious plants, characterized by highly heterozygous genome^[1]. These genomes have been shaped by events like whole genome duplications, widespread repetitive sequence expansions, and subsequent chromosome rearrangements, which resulted in genomes endowed with complex characteristics, and difficulty to assemble^[18]. The pronounced genomic heterozygosity complicates efforts to achieve high-contiguity genomes, while the abundance of repetitive sequences often results in assembly gaps, particularly when using short sequence reads for diploid genome assembly. This is because biparental allelic sequences from two homologous chromosomes may be erroneously fused during assembly, leading to inaccurate gene annotation^[1]. In the last few years, the advent of long high-throughput sequencing technologies has largely alleviated the challenges in assembling the highly repetitive regions. Nevertheless, the issue of high genomic heterozygosity remains, and the adoption of homozygous lines offers a radical solution to this challenge. Generation of homozygous lines in annual plants can take many generations, for instance, the highly homozygous cultivar PN40024 grapevine^[19], was developed through nine generations of selfing. This approach is not feasible for woody plant species, primarily due to their long juvenile periods. In the case of dioecious poplars, this poses a persistent challenge, as decreased heterozygosity can affect their environmental adaptability. Nonetheless, there may be potential for the induction of haploid plants and the development of homozygous diploids, albeit with considerable challenges. For instance, haploid cells derived from a single pollen grain and doubled artificially to form homozygous diploids, generally referred to as doubled haploid (DH) lines have been reported^[20]. DH individuals possess two identical homologous chromosomes, making them ideal materials for genome sequencing. DH lines with whole genome sequencing has been reported in crops, including maize^[21], tomato^[22], barley^[23], Brassica oleracea^[24]. However, the occurrence of haploid or DH lines in forest trees has been less frequently reported.

In this study, DH callus lines of P. ussuriensis were obtained through in vitro anther induction, and a DH line named DH15 was selected for DNA sequencing using the PacBio High Fidelity (HiFi) long-read sequencing, Illumina sequencing, and high-throughput chromosome conformation capture (Hi-C) sequencing methods. A de novo assembly was then performed by a combination of PacBio long reads, Illumina, and Hi-C sequencing reads, which resulted in a T2T high-quality poplar genome. This new assembly was annotated and 465 more genes were identified than that of the current v4.1version of P. trichocarpa genome. In the previous poplar genome assembly, the centromeres and telomeres were not at all or only partially assembled and thus not reported. A comprehensive analysis of the structures, features, composition, and distribution of these regions were conducted, successfully closing nearly all the gaps in the newly assembled reference genomes. The structural components and characteristics of the centromeres of all chromosomes in the DH15 poplar genome were dissected and carefully annotated. Additionally, the annotation of transposable elements (TEs) and new genes in highly repetitive regions, particularly centromeres, have been improved. This refined genome assembly will be highly instrumental in molecular analyses of gene functions in poplar trees and enable comparative genomic studies across different poplar species. It serves as a solid foundation for future research on the poplar and other plant genomes.

Materials and methods

Plant materials and haploid calli induction

Male flower buds from a Populus ussuriensis tree were collected for anther culture at mid- or late-uninucleate stage of microspore development. Anther culture was conducted using Murashige and Skoog (MS) basal medium containing 2.0 mg/L 2,4-Dichlorophe-noxyacetic acid (2,4-D), 1.0 mg/L Kinetin (KT), 3 g/L Gelrite, and 3% sucrose to induce haploid formation. Following an initial culture in the dark for 40 d, a cold treatment of 4 °C was administered for 24 or 48 h. The anthers were continued to culture on the medium for six months. Flow cytometry was used to determine the ploidy levels of calli at different stages. Heterozygous genomic sites of the paternal P. ussuriensis tree were identified by genome resequencing. Polymerase Chain Reaction (PCR) was used to amplify ten selected heterozygous sites and then sequenced by Sanger sequencing.

DNA extraction, library construction and sequencing
A doubled haploid line (DH15) of P. ussuriensis was used for genome sequencing. Genomic DNA was isolated using SMRTbell Template Prep Kit 1.0 (Pacific Biosciences, Memlo Park, CA, USA). The quality of DNA assessed by agarose gel electrophoresis and the quantity was determined using a NanoDrop spectrophotometer (Thermo Fisher Scientific). The DNA libraries were constructed as described in a previous study^[25]. For sequencing, both Illumina and PacBio Sequel II sequencing platforms were employed. Illumina reads were utilized for genomic survey purposes, while HiFi reads from PacBio Sequel II were employed for the genome assembly.

The construction and sequencing of Hi-C library was done by Annoroad Gene Technology Company as follows: (1) DH15 calli were treated with 1% (vol/vol) formaldehyde to cross-linked DNA; (2) the cross-linked DNA was then lysed, and digested with MboI enzymes overnight; (3) the MboI enzymes were inactivated, and cohesive ends were filled in by introducing biotin-labeled dCTP; (4) after proximity ligation was performed in a blunt-end ligation buffer, the cross-linking was reversed, and DNA was purified for Hi-C library construction^[26]; (5) the final Hi-C library was sequenced on an Illumina HiSeq 2500 platform in 150-bp paired-end mode.

Genome assembly and assessment
The genomic size of the DH15 was estimated based on K-mer (k = 21) analysis using short reads, which were sequenced on the Illumina platform. The filtered PacBio HiFi reads (longer than 1,000 bp) was assembled into contigs using Hifiasm with default parameters^[27].

The contig-level assembly was indexed with bwa index (with -a bwtsw) (v.0.7.15-r1140) and samtools faidx. The DH15 Hi-C read pairs were aligned using bwa aln and bwa sampe. Aligned reads (in pairs) were converted into BAM files using samtools view with options of -b -F12. The BAM files were filtered with filterBAM_forHiC.pl (from ALLHiC package, v.0.9.13)^[28] to remove nonuniquely mapped reads. Then, for the BAM files, ALLHiC_partition was run with -e GATC -k 1 -m 25; allhic extract was run with --RE GATC option; allhic optimize and ALLHiC_build was run with default settings; the chromosome contact map was visualized with ALLHiC_plot at 100-kb resolution.

The completeness of the genome was assessed using BUSCO v.4.0.6, which contained 1,614 genes in the 'embryophyta_odb10' dataset^[29], with default parameters.

Genome annotation
The repetitive sequences in the DH genome were identified as follows. Tandem repeats were identified using theTRF tool with default settings. RepeatModeler (version 2.0.1)^[30] was used for de novo identification of the repetitive sequences and RepeatMasker (version 4.1.0)^[31] was used to predict TE sequences based on sequence homology. Long terminal sequence repeats were identified by LTR_FINDER (version 1.1)^[32]. RepeatClassifier^[33] was used to classify the identified repetitive sequences in the DH genome.

The protein-encoding genes of the DH15 genome were predicted by the combination of de novo, homology-based and RNA-seq data-aided methods. The AUGUSTUS model was trained and optimized using the single copy gene identified by BUSCO, and then used for de novo prediction. The protein sequences of ten species, P. bolleana, P. tomentosa, P. tremula, P. deltoides, P. simonii, P. trichocarpa, P. wilsonii, P. euphratica, P. pruinose, and P. ilicifolia, were used for homology-based annotation. To perform RNA-Seq assisted gene prediction, we downloaded poplar transcriptome data from the NCBI SRA database (BioProject: PRJNA808967). Clean reads were assembled into transcripts using Trinity^[34], which were aligned to the genome assembly using the Program to Assemble Spliced Alignments^[35] to predict gene structures. Finally, Evidence Modeler^[36] was used to combine gene annotation results from all homologous, de novo, and transcriptome-based predictions to integrate into a non-redundant, more complete set of genes.

Telomeric and centromeric regions of the DH15 genome were identified using quarTeT^[37]. The TeloExplorer module in quarTeT was used to identify the telomeres in the genome, and the 'explore' and 'search' tools from the telomere identification toolkit (tidk) (https://github.com/tolkit/telomeric-identifier/) were employed by this module. And telomeres in the genome were further manually validated. The CentroMiner module makes predictions about the centromeres of the genome. Using the FASTA format of the genome as an input file and inputting the transposable element (TE) annotation in GFF3 format achieve better consequences. We then used HiCAT to annotate the centromeres of the DH15 genome with default parameters^[38]. HiCAT takes a monomer template and a centromere DNA sequence as inputs.

Analysis of genomic evolution and WGD events
Protein sequences of 16 plant species, P. trichocarpa, P. deltoides, P. simonii, P. wilsonii, P. tremula, P. tomentosa, P. bolleana, P. alba, P. pruinosa, P. euphratica, P. ilicifolia, Salix brachista, S. purpurea, Arabidopsis thaliana, Carica papaya, and Vitis vinifera were used to construct a phylogenetic tree. Genes with internal stop codons, incompatible reading frames, or fewer than 50 amino acids were removed. For genes with alternative splicing variants, the longest transcript was selected. Then the comparison was performed using BLASTP with an e-value cut-off of 1e-5. OrthoFinder^[39] was used for gene family analysis. The gene families with only one copy from each of 16 species were selected as single-copy genes and were used for subsequent analysis.

MAFFT software (v7.158b)^[40] was employed to generate multiple sequence alignments of protein-coding sequences for each single-copy gene. Subsequently, the alignments of all single-copy genes were concatenated to construct a phylogenetic tree using RAxML v8.2.8 software^[41] with 1,000 bootstrap replicates. Next, r8s software (v1.71)^[42] with default parameters was applied to estimate the divergence time among species. The divergence time of the existing fossil record of the Populus and Salix (48 Mya)^[43] was used for the phylogenetic analysis. We also based the calibration point for this estimation was based on the divergence time of V. vinifera and A. thaliana (109.8−124.4 Mya) obtained from the TimeTree (www.timetree.org). The final phylogenetic tree was visualized using the iqtree tool^[44]. Ks values for each gene pair were calculated via KaKs_Calculator^[45]. The distributions of all Ks values were plotted via the R software and ggplot2 package^[46].

Collinearity analysis
The genomes were compared using Nucmer with the parameters '-c 100 -b 500 -l 50'^[47]. Subsequently, the results from the alignment file generated by Nucmer were filtered using Delta-filter with parameters '-i 90 -l 100'. SyRI, a tool for identifying synteny and rearrangement^[48], was then used to compare the genome assemblies of chromosomes of DH15 and P. trichocarpa and identified syntenies and structural rearrangements. Finally, the results were visualized using Plotsr^[49].

Discussion

The emergence of the next- and third-generations of DNA sequencing technologies, coupled with advanced bioinformatics tools, have greatly facilitated the sequencing, assembly, and public release of several poplar genomes^[9,10,14,63]. Recent advancements in sequencing technologies, notably the widespread availability of highly accurate long-read sequencing provided by PacBio, along with the adoption of diverse assembly algorithms have significantly enhanced the quality of the published poplar genomes, particularly those published recently. However, the quest for achieving optimal genome contiguity and completeness remain a persistent challenge, especially when dealing with large, structurally diverse, hybrid, or heterozygous genomes. Notably, all published poplar genomes display incompleteness in their centromere and telomere regions, falling short of attaining a high level of contiguity. These problems have mainly arisen from two aspects: (1) poplars are dioecious plants whose genomes are often highly heterozygous^[1]; (2) whole genome duplication, widespread events such as repetitive sequence expansions, and subsequent chromosome rearrangements have made poplar genomes more complicated and difficult to assemble. To solve these problems, a doubled haploid callus line of P. ussuriensis was generated, an ideal material for genome assembly. The DH15 genome represents a significant improvement over previously released poplar genome. It has successfully filled the majority of gaps, accomplished the closure of all telomeres and centromeres across its 19 chromosomes, and improved the representation of repetitive regions, including transposons, in comparison to the earlier poplar genome assemblies. Indeed, seven gaps in the DH15 assembly remain unclosed, and it's reasonable to suspect that these gaps consist of rDNA clusters. This assumption is supported by the annotation of the remaining 41 contigs, totaling 5.35 Mb in length, which revealed the prevalent presence of small subunit ribosomal RNA genes within these contigs, and they could not be assigned to specific chromosomes. The results indicated that the length of HiFi reads is insufficient to span the repeat regions of rRNA clusters. This is in line with findings in the human genome, where the majority of rRNA clusters are typically detected as 3 Mb DNA fragments^[64].

The near-gapless assembly of the Arabidopsis thaliana genome has enabled epigenomic profiling of centromeres and analysis of transposon insertion patterns^[53]. In a similar vein, the identification and annotation of centromere regions of DH15 genome represent a crucial step toward conducting comparative sequence and epigenetic analyses of centromere evolution within the Populus genus, shedding light on its relation to speciation^[65]. This high-resolution view of centromeric regions in DH15 offers a unique opportunity to investigate the origins and evolution of satellite repeats within centromeric regions. Moreover, it provides valuable insights into the organization and functioning of centromeres, not only within the poplar species but also in a broader biological context.

When comparing the DH15 genome to the P. trichocarpa genome, several notable improvements became evident. The DH15 genome successfully resolved many (> 50) assembly gaps present in the P. trichocarpa genome, this had a significant impact on gene prediction, resulting in more accurate and comprehensive gene annotations. Furthermore, the DH15 genome revealed a greater number of repeat sequences compared to the P. trichocarpa genome. Additionally, a karyotype analysis of the DH15 genome was performed and the results compared with previous experiments. The ratio of long arm to short arm of Chr 14 was quite different. In a previous study, the ratio of long and short arm was 3.23 in the P. trichocarpa, and in this study, the ratio of long and short arm was 5.48 in the DH15 genome. This discrepancy may be attributed, at least in part, to the presence of 45S rDNA. However, the ratios of long and short arms for other chromosomes remained relatively consistent, with only minor variations^[57].

Conclusions

By utilizing a doubled haploid callus induced from an anther of a paternal tree and leveraging cutting-edge PacBio long-read sequencing technology, we successfully sequenced and assembled a nearly gapless, highly contiguous T2T P. ussuriensis genome. This achievement provides telomeric and centromeric composition and distribution, rendering it a valuable resource for various studies on poplar genomes. With this assembly, including high-resolution centromeric regions for all 19 chromosomes, we can significantly advance research on the evolutionary aspects of centromeres, their roles in shaping karyotypes, and their influence on speciation processes. Moreover, the new assembly creates opportunities for exploring the genetic and genomic functions of poplar centromeres, including their interactions with kinetochore proteins and their potential in the development of plant artificial chromosomes. Furthermore, it can expedite studies related to the generation of haploids and polyploids, thus advancing molecular breeding efforts. This study stands as a pivotal contribution to the field, offering indispensable genomic resources that will drive progresses in comparative genomics, genetic, and epigenetic studies, reproductive biology, and molecular breeding strategies for poplar trees.

Author contributions

The authors confirm contribution to the paper as follows: study conception and supervision: Qu G, Su Chen, You X; samples and data collection: Liu C, Liu W, Wang M, Wang X, Yu Y; experimental guidance: Qu G, Su Chen, You X, Sederoff RR, Song Chen, Liu W; performing the analyses: Su Chen, Wei H, Liu W; draft manuscript preparation: Liu W, Liu C; manuscript revision: Qu G, Wei H, Su Chen. All authors reviewed the results and approved the final version of the manuscript.

{{lists.name}}

A nearly gapless, highly contiguous reference genome for a doubled haploid line of Populus ussuriensis, enabling advanced genomic studies