GFAnno: integrated method for plant flavonoid biosynthesis pathway gene annotation

Liuxu Du; Cui Lu; Zhentao Wang; LongXiang Zou; Yi Xiong; Qunjie Zhang; Liuxu Du; Cui Lu; Zhentao Wang; LongXiang Zou; Yi Xiong; Qunjie Zhang

doi:10.48130/bpr-0023-0041

2024 Volume 4

Article Contents

Next Previous

METHOD Open Access

GFAnno: integrated method for plant flavonoid biosynthesis pathway gene annotation

1.
Center of Genomics and Bioinformatics, Guangdong Key Laboratory of Plant Molecular Breeding, College of Agriculture, South China Agricultural University, Guangzhou 510642, China
^# These authors contributed equally: Liuxu Du, Cui Lu, Zhentao Wang

More Information

Corresponding author: s-zqj@163.com

Received: 11 July 2023
Revised: 09 December 2023
Accepted: 14 December 2023
Published online: 01 March 2024
Beverage Plant Research 4, Article number: e008 (2024) | Cite this article

Abstract

Flavonoids are important secondary metabolites synthesized by tea plants. However, inconsistencies in the variations in gene annotation methods across numerous studies, have hindered the comparison of results from previous studies. In this work, we offer 'GFAnno', an open-source software package which annotates genes and gene families based on sequence features, along with annotated parameters for 18 key genes related to the flavonoid biosynthesis pathway. The package takes a protein sequence file as input, performs gene annotation based on the identity and coverage of pre-prepared known seed protein sequences and the coverage of conserved Hidden Markov Model (HMM) domain. We used 11 dicotyledon, seven monocotyledon, and two basal angiosperm genomes to construct three datasets. We then use the seed species collection to construct seed sequences, use the test species collection to follow strict parameter selection rules, and use the validation species collection to verify the accuracy of the analysis results. The annotation results of validation collection using the filtering parameters by test collection shows that, our parameter selection can effectively exclude various structurally incomplete and abnormal proteins, while correctly distinguishing genes with high sequence similarity, such as Flavonoid 3'-Hydroxylase (F3'H) and Flavonoid 3'5'-Hydroxylase (F3'5'H) in the cytochrome P450 (CYP450). Our work aids ongoing tea plant pan-genome research by offering a convenient software for target gene annotation and sets comparative standards for analyzing the flavonoid biosynthesis pathway and conducting sequence comparison of catalytic enzymes.
- Flavonoids biosynthesis pathway,
- Gene annotation pipeline,
- Camellia sinensis,
- Angiosperms

Supplementary information

Supplemental Table S1 Enzymatic catalysis in the flavonoid synthesis pathway in tea plants.
Supplemental Table S2 Genome date used in this work.
Supplemental Table S3 Seed sequence information statistics.
Supplemental Table S4 Comparison of results from different annotation methods using F. hispida genome.
Supplemental Fig. S1 Phylogenetic trees of CYP450 superfamily proteins. NJ tree was constructed using the protein sequences of three species (A. thaliana, TGY, DASZ) genomes. C4H，F3'5'H, FNS II and F3'H were marked in different colors.
Supplemental Fig. S2 The parameter selection for 4CL of the test species.
Supplemental Fig. S3 The parameter selection for CHI of the test species.
Supplemental Fig. S4 The parameter selection for CHS of the test species. The sequences marked with asterisks were previously misannotated in earlier studies^[1].
Supplemental Fig. S5 The parameter selection for LAR of the test species.
Supplemental Fig. S6 The parameter selection for PAL of the test species.
Supplemental Fig. S7 The parameter selection for PPO of the test species.
Supplemental Fig. S8 The parameter selection for SCPL I A of the test species. The sequences marked with asterisks were previously misannotated in earlier studies^[2].
Supplemental Fig. S9 The parameter selection for UGT84A of the test species.
Supplemental Fig. S10 The parameter selection for ANR and DFR in SDR superfamily of the test species.
Supplemental Fig. S11 The parameter validation for C4H, FNS II, F3'H, and F3'5'H in CYP450 superfamily.
Supplemental Fig. S12 The parameter validation for F3H, FLS, ANS, and LDOX in 2OG superfamily.
Supplemental Fig. S13 The parameter validation for 4CL.
Supplemental Fig. S14 The parameter validation for CHI.
Supplemental Fig. S15 The parameter validation for CHS.
Supplemental Fig. S16 The parameter validation for LAR.
Supplemental Fig. S17 The parameter validation for PAL.
Supplemental Fig. S18 The parameter validation for PPO.
Supplemental Fig. S19 The parameter validation for SCPL I A.
Supplemental Fig. S20 The parameter validation for UGT84A.
Supplemental Fig. S21 The parameter validation for ANR and DFR in SDR superfamily.
Supplemental Fig. S22 Comparison of results from Gfanno and KAAS annotation methods using F. hispida genome. Green stars indicate more correctly annotated genes by Gfanno than by KAAS, and red stars indicate incorrectly annotated genes by KAAS.

Rights and permissions
Copyright: © 2024 by the author(s). Published by Maximum Academic Press, Fayetteville, GA. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Zhang L, Chen F, Zhang X, Li Z, Zhao Y, et al. 2020. The water lily genome and the early evolution of flowering plants. Nature 577:79−84 doi: 10.1038/s41586-019-1852-5 CrossRef Google Scholar
[2]	Lozano R, Hamblin MT, Prochnik S, Jannink JL. 2015. Identification and distribution of the NBS-LRR gene family in the Cassava genome. Bmc Genomics 16:360 doi: 10.1186/s12864-015-1554-9 CrossRef Google Scholar
[3]	Ashihara H, Deng WW, Mullen W, Crozier A. 2010. Distribution and biosynthesis of flavan-3-ols in Camellia sinensis seedlings and expression of genes encoding biosynthetic enzymes. Phytochemistry 71:559−566 doi: 10.1016/j.phytochem.2010.01.010 CrossRef Google Scholar
[4]	Zhang W, Zhang Y, Qiu H, Guo Y, Wan H, et al. 2020. Genome assembly of wild tea tree DASZ reveals pedigree and selection history of tea varieties. Nature Communications 11:3719 doi: 10.1038/s41467-020-17498-6 CrossRef Google Scholar
[5]	Zhang QJ, Li W, Li K, Nan H, Shi C, et al. 2020. The chromosome-level reference genome of tea tree unveils recent bursts of non-autonomous LTR retrotransposons in driving genome size evolution. Molecular Plant 13:935−938 doi: 10.1016/j.molp.2020.04.009 CrossRef Google Scholar
[6]	Lin P, Wang K, Wang Y, Hu Z, Yan C, et al. 2022. The genome of oil-Camellia and population genomics analysis provide insights into seed oil domestication. Genome Biology 23:14 doi: 10.1186/s13059-021-02599-2 CrossRef Google Scholar
[7]	Emms DM, Kelly S. 2015. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biology 16:157 doi: 10.1186/s13059-015-0721-2 CrossRef Google Scholar
[8]	Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. Journal of Molecular Biology 215:403−10 doi: 10.1016/S0022-2836(05)80360-2 CrossRef Google Scholar
[9]	Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, et al. 2018. HMMER web server: 2018 update. Nucleic Acids Research 46:W200−W204 doi: 10.1093/nar/gky448 CrossRef Google Scholar
[10]	Guo L, Gao L, Ma X, Guo F, Ruan H, et al. 2019. Functional analysis of flavonoid 3'-hydroxylase and flavonoid 3',5'-hydroxylases from tea plant (Camellia sinensis), involved in the B-ring hydroxylation of flavonoids. Gene 717:144046 doi: 10.1016/j.gene.2019.144046 CrossRef Google Scholar
[11]	Wang YS, Xu YJ, Gao LP, Yu O, Wang XZ, et al. 2014. Functional analysis of flavonoid 3',5'-hydroxylase from tea plant (Camellia sinensis): critical role in the accumulation of catechins. BMC Plant Biology 14:347 doi: 10.1186/s12870-014-0347-7 CrossRef Google Scholar
[12]	Wei K, Wang L, Zhang C, Wu L, Li H, et al. 2015. Transcriptome analysis reveals key flavonoid 3'-hydroxylase and flavonoid 3',5'-hydroxylase genes in affecting the ratio of dihydroxylated to trihydroxylated catechins in Camellia sinensis. Plos ONE 10:e137925 doi: 10.1371/journal.pone.0137925 CrossRef Google Scholar
[13]	Xiong S, Tian N, Long J, Chen Y, Qin Y, et al. 2016. Molecular cloning and characterization of a flavanone 3-Hydroxylase gene from Artemisia annua L. Plant Physiology and Biochemistry 105:29−36 doi: 10.1016/j.plaphy.2016.04.005 CrossRef Google Scholar
[14]	The UniProt Consortium. 2023. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research 51:D523−D531 doi: 10.1093/nar/gkac1052 CrossRef Google Scholar
[15]	Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150−52 doi: 10.1093/bioinformatics/bts565 CrossRef Google Scholar
[16]	Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, et al. 2021. Pfam: The protein families database in 2021. Nucleic Acids Research 49:D412−D419 doi: 10.1093/nar/gkaa913 CrossRef Google Scholar
[17]	Letunic I, Khedkar S, Bork P. 2021. SMART: recent updates, new developments and status in 2020. Nucleic Acids Research 49:D458−D460 doi: 10.1093/nar/gkaa937 CrossRef Google Scholar
[18]	Marchler-Bauer A, Bo Y, Han L, He J, Lanczycki CJ, et al. 2017. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Research 45:D200−D203 doi: 10.1093/nar/gkw1129 CrossRef Google Scholar
[19]	Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. 2009. TrimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972−73 doi: 10.1093/bioinformatics/btp348 CrossRef Google Scholar
[20]	Tamura K, Stecher G, Kumar S. 2021. MEGA11: molecular evolutionary genetics analysis version 11. Molecular Biology and Evolution 38:3022−27 doi: 10.1093/molbev/msab120 CrossRef Google Scholar
[21]	Bailey TL, Johnson J, Grant CE, Noble WS. 2015. The MEME Suite. Nucleic Acids Research 43:W39−W49 doi: 10.1093/nar/gkv416 CrossRef Google Scholar
[22]	Chen C, Chen H, Zhang Y, Thomas HR, Frank MH, et al. 2020. TBtools: an integrative toolkit developed for interactive analyses of big biological data. Molecular Plant 13:1194−202 doi: 10.1016/j.molp.2020.06.009 CrossRef Google Scholar
[23]	Cui L, Yao S, Dai X, Yin Q, Liu Y, et al. 2016. Identification of UDP-glycosyltransferases involved in the biosynthesis of astringent taste compounds in tea (Camellia sinensis). Journal of Experimental Botany 67:2285−97 doi: 10.1093/jxb/erw053 CrossRef Google Scholar
[24]	Yao S, Liu Y, Zhuang J, Zhao Y, Dai X, et al. 2022. Insights into acylation mechanisms: co-expression of serine carboxypeptidase-like acyltransferases and their non-catalytic companion paralogs. The Plant Journal 111:117−33 doi: 10.1111/tpj.15782 CrossRef Google Scholar
[25]	Singh K, Kumar S, Rani A, Gulati A, Ahuja PS. 2009. Phenylalanine ammonia-lyase (PAL) and cinnamate 4-hydroxylase (C4H) and catechins (flavan-3-ols) accumulation in tea. Functional & Integrative Genomics 9:125−34 doi: 10.1007/s10142-008-0092-9 CrossRef Google Scholar
[26]	Singh K, Rani A, Kumar S, Sood P, Mahajan M, et al. 2008. An early gene of the flavonoid pathway, flavanone 3-hydroxylase, exhibits a positive relationship with the concentration of catechins in tea (Camellia sinensis). Tree Physiology 28:1349−56 doi: 10.1093/treephys/28.9.1349 CrossRef Google Scholar
[27]	Lin GZ, Lian YJ, Ryu JH, Sung MK, Park JS, et al. 2007. Expression and purification of His-tagged flavonol synthase of Camellia sinensis from Escherichia coli. Protein Expression and Purification 55:287−92 doi: 10.1016/j.pep.2007.05.013 CrossRef Google Scholar
[28]	Initiative TAG. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796−815 doi: 10.1038/35048692 CrossRef Google Scholar
[29]	Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, et al. 2007. A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. Plos ONE 2:e1326 doi: 10.1371/journal.pone.0001326 CrossRef Google Scholar
[30]	Zhang X, Chen S, Shi L, Gong D, Zhang S, et al. 2021. Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia sinensis. Nature Genetics 53:1250−59 doi: 10.1038/s41588-021-00895-y CrossRef Google Scholar
[31]	Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, et al. 2013. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice 6:4 doi: 10.1186/1939-8433-6-4 CrossRef Google Scholar
[32]	Zhang GQ, Liu KW, Li Z, Lohaus R, Hsiao YY, et al. 2017. The Apostasia genome and the evolution of orchids. Nature 549:379−83 doi: 10.1038/nature23897 CrossRef Google Scholar
[33]	Amborella Genome Project , Albert VA, Barbazuk WB, Depamphilis CW, DER JP, et al. 2013. The Amborella genome and the evolution of flowering plants. Science 342:1241089 doi: 10.1126/science.1241089 CrossRef Google Scholar
[34]	Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, et al. 2006. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313:1596−604 doi: 10.1126/science.1128691 CrossRef Google Scholar
[35]	Wang P, Yu J, Jin S, Chen S, Yue C, et al. 2021. Genetic basis of high aroma and stress tolerance in the oolong tea cultivar genome. Horticulture Research 8:107 doi: 10.1038/s41438-021-00542-x CrossRef Google Scholar
[36]	Olsen JL, Rouzé P, Verhelst B, Lin YC, Bayer T, et al. 2016. The genome of the seagrass Zostera marina reveals angiosperm adaptation to the sea. Nature 530:331−35 doi: 10.1038/nature16548 CrossRef Google Scholar
[37]	Chaw SM, Liu YC, Wu YW, Wang HY, Lin CYI, et al. 2019. Stout camphor tree genome fills gaps in understanding of flowering plant genome evolution. Nature Plants 5:63−73 doi: 10.1038/s41477-018-0337-0 CrossRef Google Scholar
[38]	Wu H, Ma T, Kang M, Ai F, Zhang J, et al. 2019. A high-quality Actinidia chinensis (kiwifruit) genome. Horticulture Research 6:117 doi: 10.1038/s41438-019-0202-y CrossRef Google Scholar
[39]	Zhang X, Wang G, Zhang S, Chen S, Wang Y, et al. 2020. Genomes of the banyan tree and pollinator wasp provide insights into fig-wasp coevolution. Cell 183:875−889.E17 doi: 10.1016/j.cell.2020.09.043 CrossRef Google Scholar
[40]	Wang X, Feng H, Chang Y, Ma C, Wang L, et al. 2020. Population sequencing enhances understanding of tea plant evolution. Nature Communications 11:4447 doi: 10.1038/s41467-020-18228-8 CrossRef Google Scholar
[41]	Xia E, Tong W, Hou Y, An Y, Chen L, et al. 2020. The reference genome of tea plant and resequencing of 81 diverse accessions provide insights into its genome evolution and adaptation. Molecular Plant 13:1013−1026 doi: 10.1016/j.molp.2020.04.010 CrossRef Google Scholar
[42]	Eichten SR, Foerster JM, de Leon N, Kai Y, Yeh CT, et al. 2011. B73-Mo17 near-isogenic lines demonstrate dispersed structural variation in maize. Plant Physiology 156:1679−90 doi: 10.1104/pp.111.174748 CrossRef Google Scholar
[43]	Harkess A, Zhou J, Xu C, Bowers JE, Van der Hulst R, et al. 2017. The asparagus genome sheds light on the origin and evolution of a young Y chromosome. Nature Communications 8:1279 doi: 10.1038/s41467-017-01064-8 CrossRef Google Scholar
[44]	Wang Z, Miao H, Liu J, Xu B, Yao X, et al. 2019. Musa balbisiana genome reveals subgenome evolution and functional divergence. Nature Plants 5:810−21 doi: 10.1038/s41477-019-0452-6 CrossRef Google Scholar
[45]	Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M. 2007. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Research 35:W182−W185 doi: 10.1093/nar/gkm321 CrossRef Google Scholar
[46]	Werck-Reichhart D, Feyereisen R. 2000. Cytochromes P450: a success story. Genome Biology 1:reviews3003.1 doi: 10.1186/gb-2000-1-6-reviews3003 CrossRef Google Scholar
[47]	Ding Q, Wang F, Xue J, Yang X, Fan J, et al. 2020. Identification and expression analysis of hormone biosynthetic and metabolism genes in the 2OGD family for identifying genes that may be involved in tomato fruit ripening. International Journal of Molecular Sciences 21:5344 doi: 10.3390/ijms21155344 CrossRef Google Scholar
[48]	Paysan-Lafosse T, Blum M, Chuguransky S, Grego T, Pinto BL, et al. 2023. InterPro in 2022. Nucleic Acids Research 51:D418−D427 doi: 10.1093/nar/gkac993 CrossRef Google Scholar
[49]	Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, et al. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21:3674−76 doi: 10.1093/bioinformatics/bti610 CrossRef Google Scholar

About this article

Cite this article

Du L, Lu C, Wang Z, Zou L, Xiong Y, et al. 2024. GFAnno: integrated method for plant flavonoid biosynthesis pathway gene annotation. Beverage Plant Research 4: e008 doi: 10.48130/bpr-0023-0041

Du L, Lu C, Wang Z, Zou L, Xiong Y, et al. 2024. GFAnno: integrated method for plant flavonoid biosynthesis pathway gene annotation. Beverage Plant Research 4: e008 doi: 10.48130/bpr-0023-0041

Figures(5) / Tables(1)

Download PDF

Article Metrics

Article views(9690) PDF downloads(1286)

Other Articles By Authors

on this site
on Google Scholar

HTML

Introduction

The complexity of plant secondary metabolism poses major challenges for analyses in different species due to the evolution of multiple independent pathways. With the rapid development of third-generation sequencing, the era of pan-genomics has fully arrived, offering us new opportunities to study the diversity of secondary metabolites in plants. However, the classical annotation methods for superfamily, gene family, and multicopy genes are similar, but with different parameter settings. Extensive research is required to manually examine sequence features, such as sequence similarity and coverage, as well as enzyme catalytic domains, in order to determine the identification parameters for each target gene^[1,2]. This can create analytical obstacles for researchers outside of the field.

Taking gene annotation in the flavonoid biosynthesis pathway of tea plants as an example. Flavonoids are an important class of biologically active secondary metabolites produced by tea plants^[3]. In addition to their myriad of health benefits, flavonoids contribute to the enjoyable flavors and overall sensory experience of tea. The flavonoid pathway is complex, involving a wide array of enzymes, which generates diverse metabolites. In tea plants, the copy number of key genes involved in flavonoid biosynthesis differs significantly among different tea genomes^[4−6]. Annotation of the same genome using different tools and parameters often results in different gene numbers. For instance, we used Orthofinder^[7] and utilized BLASTP^[8] and HMMsearch^[9] to annotate the C. sinensis genomes 'Shuchazao' (SCZ) and 'Yunkang10' (YK10) in two different studies and received different results^{[4, 6]}. In addition, errors often occur when distinguishing multi-copy genes with very similar sequences within the same gene family. For example, the cytochrome P450 (CYP450, which includes Flavonoid 3'-Hydroxylase (F3'H), Flavonoid 3'5'-Hydroxylase (F3'5'H), Flavone Synthase II (FNSII), and Trans-cinnamate 4-Monooxygenase (C4H)) and the 2-oxoglutarate-dependent dioxygenase (2OGDs, which includes Flavanone 3-Dioxygenase (F3H), Anthocyanidin Synthase (ANS), Leucoanthocyanidin Dioxygenase (LDOX) and Flavonol Synthase (FLS)) superfamilies participate in multiple oxidation and hydroxylation reactions within the flavonoid biosynthesis pathway, leading to the production of various flavonoid compounds^[10−13]. F3'H and F3'5'H, which hydroxylate a broad range of flavonoid substrates, are characterized by a high sequence similarity (Supplemental Fig. S1)^[10−12]. Therefore, it is necessary to use precise annotation parameters for the accurate identification of different gene copies within these families to avoid false-positive or false-negative annotation results.

Here, we present 'GFAnno,' an open-source software package that integrates the workflow based on sequence similarity and conserved domain analysis into a single package. We used three dicots, two monocots, and a basal angiosperm species to generate the seed sequence, provide strict filtered and doubled checked parameters, allowing researchers to obtain annotations and classifications for the 18 key genes in the plant flavonoid biosynthesis pathway with just one command. Furthermore, this paper offers a parameter selection process and guidelines for creating parameters for the identification of other target genes, enhancing the applicability and scalability of this workflow.

Input and output

Input

The input to GFAnno includes the following options:

• The '-f / --fasta' flag specifies the input protein file in FASTA format.

• The '-c / --config' flag specifies the config file.

The config file can contain multiple genes, each with specific settings, including the local of seed file for BLASTP (blastp_seed) and corresponding filtering parameters (b_iden, b_qcov and b_tcov) for the target genes, the local of HMM file for HMMERsearch (hmm_file) and corresponding parameters (h_cov).

For first-time users of the software, you can use the '-g / --generate' parameter to generate a config file (flavno.ini) for the 18 genes involved in the plant catechin biosynthesis pathway, along with the required blast seed files and hmm_file files. If users need to annotate and filter other genes, they can refer to the format of the initial config file to create new filtering schemes.

Output
The output of GFAnno includes annotated target gene sequences along with their alignment parameters. A folder is created with the prefix specified by '-o', and within this folder, a series of FASTA files are generated, named using both the specified prefix and the names of the target genes. The 'stat' file contains at least four columns of information: gene ID, blastp_identity with the seed file, blastp_qcovs, blastp_tcovs with the seed file, and hmm_coverage. If multiple HMM files are provided, additional columns will be output for the hmm_coverage of the second and third HMM models.

For example, running it with '-o flavno -c flavno.ini' will create a folder named 'flavno. In this folder, you can find a series of FASTA files named after gene names. These files contain candidate genes that have been annotated by GFAnno using the input file. Additionally, there is a 'flavno.stat' file that provides alignment parameters for the corresponding genes.

Discussion

Annotation methods for superfamily, gene family, and multicopy genes are similar, but with different parameter settings. Superfamily and family analysis focus on identifying characteristic structural domains and each gene or gene family has its own sequence characteristics. In the flavonoid metabolic pathway, DFR and ANR belong to the Short-chain dehydrogenases/reductases (SDRs) family (Supplemental Fig. S10), while F3H, FLS, ANS, and LDOX belong to the 2OGD superfamily (Fig. 4), C4H, FNS II, F3'H, and F3'5'H belong to the CYP450 superfamily (Fig. 5 & Supplemental Fig. S1). Species within the same superfamily share conserved domains, so they can use the same HMM model. Furthermore, conserved motif analysis showed that the candidate genes classified as CYP450 superfamily members shared four common regions, namely the heme-binding domain Phe-X-X-Glu-Arg-Arg-X-Cys-X-Gly, which is responsible for binding carbon monoxide, the Glu-X-X-Arg and the Ala-Gly-X-(Asp/Glu)-Thr-(Thr/Ser) motifs, which stabilizes the core structure and facilitates oxygen or substrate binding, respectively, and the hinge region Pro-Pro-Gly-Pro-Thr-(Gly/Pro)-Trp-Pro, which determines the correct orientation of CYP450 enzymes^[46]. Characteristic iron-binding sites His-X-Asp-X(n)-His and the 2-oxoglutarate binding domain Arg-X-Ser, which are specific to 2OGDs^[47]. The sequence similarity of LDOX and FLS, F3'H and F3'5'H is very high, and their phylogenetic relationship is also very close (Figs 4b & 5b). In our work, by using GFAnno, our parameters can accurately distinguish LDOX and FLS in the 2OGDs superfamily (Fig. 4a) and distinguish F3'H and F3'5'H in the CYP450 superfamily (Fig. 5a).

To assess the accuracy of the GFAnno flavonoid pathway gene annotation parameters, we utilized all proteins from Ficus hispida in a validation dataset as input and compared the results with KAAS annotations result (Supplemental Table S4). Supplemental Fig. S22 illustrates the phylogenetic relationships and comparison of conserved domains between the annotation results of GFAnno and KAAS. Red stars represent false positives from KAAS machine annotation, while green stars represent false negatives from KAAS annotation. Among the 18 genes, GFAnno effectively filtered out structurally incomplete sequences and some fused genes, resulting in an increase of 29.41% in accurate annotation results and the exclusion of 52.94% annotation errors. Additionally, other functional annotation software, such as InterPro^[48] and blast2GO^[49], employs a similar approach using BLASTP and HMMsearch separately, which tends to yield a large number of sequences from the same gene family. Therefore, we also compiled the results obtained solely using BLASTP and HMMsearch (Supplemental Table S4). It can be observed that such annotations effectively identify the source of sequences but face challenges in determining the completeness of the target sequences (Figs 4 & 5; Supplemental Figs S2−S22).

The above results indicate that, when conducting batch analyses for pan-genomic studies, accuracy is crucial. Therefore, our study serves as a valuable addition to genes and gene families annotation based on sequence features, it also provides a firm foundation for future flavonoid studies.

Conclusions

In conclusion, our study addresses the challenges in annotating genes related to the flavonoid biosynthesis pathway, offering 'GFAnno,' an open-source software package for gene and gene family annotation. We provide accurate parameters for precise identification of plant flavonoid biosynthesis-related genes. Our contribution lies in streamlining target gene annotation and establishing comparative benchmarks for analyzing the flavonoid biosynthesis pathway and comparing catalytic enzyme sequences, benefiting ongoing tea plant pan-genome research.

Author contributions

The authors confirm contribution to the paper as follows: study conception & design and project management: Zhang Q; data analyses: Lu C, Du L, Xiong Y, Zou L; Python code: Du L, Wang Z; draft manuscritp preparation: Lu C, Du L; manuscript revision: Zhang Q. All authors reviewed the results and approved the final version of the manuscript..

{{lists.name}}

GFAnno: integrated method for plant flavonoid biosynthesis pathway gene annotation