FTGD: a machine learning method for flowering-time gene prediction

Junyu Zhang; Shuang He; Wenquan Wang; Fei Chen; Zhidong Li; Junyu Zhang; Shuang He; Wenquan Wang; Fei Chen; Zhidong Li

doi:10.48130/TP-2023-0023

2023 Volume 2

Article Contents

Next Previous

ARTICLE Open Access

FTGD: a machine learning method for flowering-time gene prediction

1.
College of Breeding and Multiplication, Sanya Institute of Breeding and Multiplication, Hainan University, Sanya 572025, China
2.
School of Tropical Agriculture and Forestry, Hainan University, Haikou 570228, China
3.
Hainan Yazhou Bay Seed Laboratory, Sanya 572024, China

More Information

Corresponding authors: feichen@hainanu.edu.cn; m15132506079_1@163.com

Received: 03 October 2023
Accepted: 07 November 2023
Published online: 22 November 2023
Tropical Plants 2, Article number: 23 (2023) | Cite this article

Highlights

We have developed a high-accuracy machine learning model for predicting flowering-time-associated genes in plants and created a practical tool for this purpose.

We successfully predicted 318,521 flowering-time-associated genes across protein datasets from 81 plant species, providing a substantial amount of data related to plant flowering timing.

In order to facilitate user access to both the tool and the data, we have established a database of plant flowering-time-associated genes, which will serve as a valuable resource for research and breeding endeavors.
Abstract

The timing of flowering significantly affects plant reproduction and crop yield, making it important to detect flowering-time associated genes. In this study, we retrieved 628 flowering-time associated protein sequences from a database of flowering-time genes in Arabidopsis thaliana (FLOR-ID) and created seven machine learning models using Support Vector Machine (SVM) algorithms to discriminate flowering-time associated genes (FTAGs) from non-FTAGs. The SVM-Kmer-PC-PseAAC model performed the best (F1 score = 0.934, accuracy = 0.939, and receiver operating characteristic = 0.943). Utilizing this model, we have developed a plant FTAGs prediction tool called 'FTAGs_Find'. We identified a total of 318,521 FTAGs from 81 species protein datasets using the FTAGs_Find. Notably, in O. lucimarinus, a non-flowering plant, only 208 FTAGs were predicted in the whole genome, accounting for just 2.68% of all genes, which is consist with the extensive FTAG loss during evolution. To facilitate user access to the FTAG prediction tool and the FTAG dataset, we constructed a plant flowering-time-associated genes database (FTAGdb), which will be a valuable resource for researchers and breeders.

Graphical Abstract
- Flowering-time genes,
- Support Vector Machine (SVM),
- Classification,
- Database

Supplementary information

Supplemental Table S1 The FTAGs data of 81 examined species.
Supplemental Table S2 The hyperparameters of SVM predictive model.

Rights and permissions
Copyright: © 2023 by the author(s). Published by Maximum Academic Press on behalf of Hainan University. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Hong L, Niu F, Lin Y, Wang S, Chen L, et al. 2021. MYB117 is a negative regulator of flowering time in Arabidopsis. Plant Signaling & Behavior 16:1901448 doi: 10.1080/15592324.2021.1901448 CrossRef Google Scholar
[2]	Song J, Li B, Cui Y, Zhuo C, Gu Y, et al. 2021. QTL mapping and diurnal transcriptome analysis identify candidate genes regulating Brassica napus flowering time. International Journal of Molecular Sciences 22:7559 doi: 10.3390/ijms22147559 CrossRef Google Scholar
[3]	Hassankhah A, Rahemi M, Ramshini H, Sarikhani S, Vahdati K. 2020. Flowering in Persian walnut: patterns of gene expression during flower development. BMC Plant Biology 20:136 doi: 10.1186/s12870-020-02372-w CrossRef Google Scholar
[4]	Yao T, Park BS, Mao HZ, Seo JS, Ohama N, et al. 2019. Regulation of flowering time by SPL10/MED25 module in Arabidopsis. The New Phytologist 224:493−504 doi: 10.1111/nph.15954 CrossRef Google Scholar
[5]	Bouché F, Lobet G, Tocquin P, Périlleux C. 2016. FLOR-ID: an interactive database of flowering-time gene networks in Arabidopsis thaliana. Nucleic Acids Research 44:D1167−D1171 doi: 10.1093/nar/gkv1054 CrossRef Google Scholar
[6]	Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, et al. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421 doi: 10.1186/1471-2105-10-421 CrossRef Google Scholar
[7]	Connor CW. 2019. Artificial intelligence and machine learning in anesthesiology. Anesthesiology 131:1346−59 doi: 10.1097/ALN.0000000000002694 CrossRef Google Scholar
[8]	Yuan Y, Cairns JE, Babu R, Gowda M, Makumbi D, et al. 2019. Genome-wide association mapping and genomic prediction analyses reveal the genetic architecture of grain yield and flowering time under drought and heat stress conditions in maize. Frontiers in Plant Science 9:1919 doi: 10.3389/fpls.2018.01919 CrossRef Google Scholar
[9]	Wang X, Xuan H, Evers B, Shrestha S, Pless R, et al. 2019. High-throughput phenotyping with deep learning gives insight into the genetic architecture of flowering time in wheat. GigaScience 8:giz120 doi: 10.1093/gigascience/giz120 CrossRef Google Scholar
[10]	Mora-Poblete F, Maldonado C, Henrique L, Uhdre R, Scapim CA, et al. 2023. Multi-trait and multi-environment genomic prediction for flowering traits in maize: a deep learning approach. Frontiers in Plant Science 14:1153040 doi: 10.3389/fpls.2023.1153040 CrossRef Google Scholar
[11]	Satake A, Kawagoe T, Saburi Y, Chiba Y, Sakurai G, et al. 2013. Forecasting flowering phenology under climate warming by modelling the regulatory dynamics of flowering-time genes. Nature Communications 4:2303 doi: 10.1038/ncomms3303 CrossRef Google Scholar
[12]	Meher PK, Mohapatra A, Satpathy S, Sharma A, Saini I, et al. 2021. PredCRG: a computational method for recognition of plant circadian genes by employing support vector machine with Laplace kernel. Plant Methods 17:46 doi: 10.1186/s13007-021-00744-3 CrossRef Google Scholar
[13]	Li Z, Tang W, You X, Hou X. 2022. LSAP: a machine learning method for leaf-senescence-associated genes prediction. Life 12:1095 doi: 10.3390/life12071095 CrossRef Google Scholar
[14]	Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, et al. 2021. Pfam: the protein families database in 2021. Nucleic Acids Research 49:D412−D419 doi: 10.1093/nar/gkaa913 CrossRef Google Scholar
[15]	Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, et al. 2018. HMMER web server: 2018 update. Nucleic Acids Research 46:W200−W204 doi: 10.1093/nar/gky448 CrossRef Google Scholar
[16]	Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, et al. 2012. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Research 40:D1202−D1210 doi: 10.1093/nar/gkr1090 CrossRef Google Scholar
[17]	Huang Y, Niu B, Gao Y, Fu L, Li W. 2010. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26:680−82 doi: 10.1093/bioinformatics/btq003 CrossRef Google Scholar
[18]	Liu B, Liu F, Wang X, Chen J, Fang L, et al. 2015. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research 43:W65−W71 doi: 10.1093/nar/gkv458 CrossRef Google Scholar
[19]	Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, et al. 2012. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Research 40:D1178−D1186 doi: 10.1093/nar/gkr944 CrossRef Google Scholar
[20]	Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, et al. 2022. Database resources of the national center for biotechnology information. Nucleic Acids Research 50:D20−D26 doi: 10.1093/nar/gkab1112 CrossRef Google Scholar
[21]	Gupta P, Naithani S, Tello-Ruiz MK, Chougule K, D'Eustachio P, et al. 2016. Gramene database: navigating plant comparative genomics resources. Current Plant Biology 7−8:10−15 doi: 10.1016/j.cpb.2016.12.005 CrossRef Google Scholar
[22]	Yu J, Zhao M, Wang X, Tong C, Huang S, et al. 2013. Bolbase: a comprehensive genomics database for Brassica oleracea. BMC Genomics 14:664 doi: 10.1186/1471-2164-14-664 CrossRef Google Scholar
[23]	Li Z, Li Y, Liu T, Zhang C, Xiao D, et al. 2022. Non-heading Chinese cabbage database: an open-access platform for the genomics of Brassica campestris (syn. Brassica rapa) ssp. chinensis. Plants 11:1005 doi: 10.3390/plants11081005 CrossRef Google Scholar
[24]	Zheng Y, Wu S, Bai Y, Sun H, Jiao C, et al. 2019. Cucurbit Genomics Database (CuGenDB): a central portal for comparative and functional genomics of cucurbit crops. Nucleic Acids Research 47:D1128−D1136 doi: 10.1093/nar/gky944 CrossRef Google Scholar
[25]	Brown AV, Conners SI, Huang W, Wilkey AP, Grant D, et al. 2021. A new decade and new data at SoyBase, the USDA-ARS soybean genetics and genomics database. Nucleic Acids Research 49:D1496−D1501 doi: 10.1093/nar/gkaa1107 CrossRef Google Scholar
[26]	Jayakodi M, Choi BS, Lee SC, Kim NH, Park JY, et al. 2018. Ginseng Genome Database: an open-access platform for genomics of Panax ginseng. BMC Plant Biology 18:62 doi: 10.1186/s12870-018-1282-9 CrossRef Google Scholar
[27]	Sakai H, Naito K, Takahashi Y, Sato T, Yamamoto T, et al. 2016. The Vigna genome server, 'Vig GS': a genomic knowledge base of the genus Vigna based on high-quality, annotated genome sequence of the azuki bean, Vigna angularis (Willd.) Ohwi & Ohashi. Plant & Cell Physiology 57:e2 doi: 10.1093/pcp/pcv189 CrossRef Google Scholar
[28]	Yu HJ, Baek S, Lee YJ, Cho A, Mun JH. 2019. The radish genome database (RadishGD): an integrated information resource for radish genomics. Database 2019:baz009 doi: 10.1093/database/baz009 CrossRef Google Scholar
[29]	Plomion C, Aury JM, Amselem J, Leroy T, Murat F, et al. 2018. Oak genome reveals facets of long lifespan. Nature Plants 4:440−52 doi: 10.1038/s41477-018-0172-3 CrossRef Google Scholar
[30]	Wei T, van Treuren R, Liu X, Zhang Z, Chen J, et al. 2021. Whole-genome resequencing of 445 Lactuca accessions reveals the domestication history of cultivated lettuce. Nature Genetics 53:752−60 doi: 10.1038/s41588-021-00831-0 CrossRef Google Scholar
[31]	Wang X, Wu J, Liang J, Cheng F, Wang X. 2015. Brassica database (BRAD) version 2.0: integrating and mining Brassicaceae species genomic resources. Database 2015:bav093 doi: 10.1093/database/bav093 CrossRef Google Scholar
[32]	Chalhoub B, Denoeud F, Liu S, Parkin IAP, Tang H, et al. 2014. Early allopolyploid evolution in the post-Neolithic Brassica napus oilseed genome. Science 345:950−53 doi: 10.1126/science.1253435 CrossRef Google Scholar
[33]	Byrne SL, Erthmann PØ, Agerbirk N, Bak S, Hauser TP, et al. 2017. The genome sequence of Barbarea vulgaris facilitates the study of ecological biochemistry. Scientific Reports 7:40728 doi: 10.1038/srep40728 CrossRef Google Scholar
[34]	Droc G, Larivière D, Guignon V, Yahiaoui N, This D, et al. 2013. The banana genome hub. Database 2013:bat035 doi: 10.1093/database/bat035 CrossRef Google Scholar
[35]	Poza-Viejo L, Payá-Milans M, San Martín-Uriz P, Castro-Labrador L, Lara-Astiaso D, et al. 2022. Conserved and distinct roles of H3K27me3 demethylases regulating flowering time in Brassica rapa. Plant, Cell & Environment 45:1428−41 doi: 10.1111/pce.14258 CrossRef Google Scholar
[36]	Qu G, Gao Y, Wang X, Fu W, Sun Y, et al. 2022. Fine mapping and analysis of candidate genes for qFT7.1, a major quantitative trait locus controlling flowering time in Brassica rapa L. Theoretical and Applied Genetics 135:2233−46 doi: 10.1007/s00122-022-04108-w CrossRef Google Scholar
[37]	Jung H, Lee A, Jo SH, Park HJ, Jung WY, et al. 2021. Nitrogen signaling genes and SOC1 determine the flowering time in a reciprocal negative feedback loop in Chinese cabbage (Brassica rapa L.) based on CRISPR/Cas9-mediated mutagenesis of multiple BrSOC1 homologs. International Journal of Molecular Sciences 22:4631 doi: 10.3390/ijms22094631 CrossRef Google Scholar
[38]	Zhang C, Zhou Q, Liu W, Wu X, Li Z, et al. 2022. BrABF3 promotes flowering through the direct activation of CONSTANS transcription in pak choi. The Plant Journal:for Cell and Molecular Biology 111:134−48 doi: 10.1111/tpj.15783 CrossRef Google Scholar
[39]	Teng Z, Zheng W, Yu Y, Hong SB, Zhu Z, et al. 2021. Effects of BrMYC2/3/4 on plant development, glucosinolate metabolism, and Sclerotinia sclerotiorum resistance in transgenic Arabidopsis thaliana. Frontiers in Plant Science 12:707054 doi: 10.3389/fpls.2021.707054 CrossRef Google Scholar
[40]	Wang Y, Song S, Hao Y, Chen C, Ou X, et al. 2023. Role of BraRGL1 in regulation of Brassica rapa bolting and flowering. Horticulture Research 10:uhad119 doi: 10.1093/hr/uhad119 CrossRef Google Scholar
[41]	Lee A, Jung H, Park HJ, Jo SH, Jung M, et al. 2023. Their C-termini divide Brassica rapa FT-like proteins into FD-interacting and FD-independent proteins that have different effects on the floral transition. Frontiers in Plant Science 13:1091563 doi: 10.3389/fpls.2022.1091563 CrossRef Google Scholar
[42]	Si S, Zhang M, Hu Y, Wu C, Yang Y, et al. 2021. BrcuHAC1 is a histone acetyltransferase that affects bolting development in Chinese flowering cabbage. Journal of Genetics 100:56 doi: 10.1007/s12041-021-01303-4 CrossRef Google Scholar
[43]	Wei Q, Hu T, Xu X, Tian Z, Bao C, et al. 2022. The new variation in the promoter region of FLOWERING LOCUS T is involved in flowering in Brassica rapa. Genes 13:1162 doi: 10.3390/genes13071162 CrossRef Google Scholar

About this article

Cite this article

Zhang J, He S, Wang W, Chen F, Li Z. 2023. FTGD: a machine learning method for flowering-time gene prediction. Tropical Plants 2:23 doi: 10.48130/TP-2023-0023

Zhang J, He S, Wang W, Chen F, Li Z. 2023. FTGD: a machine learning method for flowering-time gene prediction. Tropical Plants 2:23 doi: 10.48130/TP-2023-0023

Figures(3) / Tables(1)

Download PDF

Article Metrics

Article views(6050) PDF downloads(815)

Other Articles By Authors

on this site
on Google Scholar

HTML

Introduction

Flowering is a critical developmental stage in higher plants, indicating the transition from the vegetative phase to the reproductive phase^[1,2]. The timing of flowering significantly influences plant reproduction, crop yield, and overall plant fitness, making it essential to understand the molecular mechanisms for improving agricultural productivity^[3]. Substantial progress has been made in comprehending the mechanisms governing flowering time, with six pathways, including the GA pathway, age pathway, autonomous pathway, photoperiod pathway, temperature pathway, and vernalization pathway, identified as regulators of the timing of floral transition^[4]. To support systematic research on flowering-time-associated genes (FTAGs) in Arabidopsis thaliana, the Flowering Interactive Database (FLOR-ID: www.phytosystems.ulg.ac.be/florid) was established. Currently, the FLOR-ID database houses a comprehensive collection of 306 genes and provides links to 1646 articles, representing the collaborative work of more than 4600 scientists^[5]. This freely accessible database offers valuable resources for the study of flowering timing.

Presently, the identification of flowering-time genes primarily relies on wet-lab experiments, which are not only costly but also time-consuming and labor-intensive. The use of high-throughput omics technologies to detect flowering-time-associated genes demands significant human and financial resources. To address these challenges, computational and mathematical methods have emerged as promising alternatives. BLAST^[6], a widely used bioinformatics tool, allows for the detection of FTAGs through sequence similarity searches. The existing homology sequence search tool BLAST+ only considers the sequence composition and order features, and does not take into account a comprehensive range of information, leading to low recognition rates. The application of artificial intelligence has made significant strides in recent times, particularly in fields like textual analysis, self-learning, and image recognition^[7]. Machine learning (ML), a vital component of artificial intelligence, finds extensive use across various academic disciplines, including data analytics and gene discovery^[8]. Researchers have developed multi-trait and multi-environment genome prediction methods for flowering traits^[9−11]. Meher et al.^[12] developed an ML model for identifying plant circadian genes, while our team recently proposed a method for recognizing leaf senescence-associated genes using ML techniques^[13]. Notably, no machine learning method based on FTAGs' protein sequence data is currently available. This motivated our team to undertake the training of an ML model for the identification of proteins encoded by flowering-time-associated genes.

In this study, we have employed the support vector machine (SVM), one of the most commonly used ML methods, to discriminate between FTAGs and non-FTAGs using the protein sequence dataset. Notably, the SVM-Kmer-PC-PseAAC model demonstrated outstanding performance, boasting an F1 score of 0.934, an accuracy rate of 0.939, and a receiver operating characteristic score of 0.943. Building upon this ML model, we have developed a Python software tool named 'FTAGs_Find', which is made available to the research community. This tool allows for proteome-wide identification of flowering-time-associated genes. Furthermore, we conducted large-scale identification of FTAGs across 83 different species using the 'FTAGs_Find' software, shedding light on their evolutionary mechanisms. To facilitate access to the FTAGs dataset and the utilization of the 'FTAGs_Find' software for the scientific community, we have established the Plant Flowering-Time-Associated Genes Database (www.sagsanno.top:8080/FTGD). We are confident that the FTGD database will prove to be a valuable and user-friendly resource for all researchers.

Discussions

Flowering indicates that the plant has completed the transition from the vegetative stage to the reproductive stage^[1,2]. Many advances have revealed that the photoperiod pathway, vernalization pathway, autonomous pathway, GA pathway, temperature pathway, and age pathway regulate the timing of floral transition^[4]. The Flowering Interactive Database integrates a comprehensive collection of 306 FTAGs, providing researchers with valuable resources for studying FTAGs.

In this study, a total of 628 protein sequences were collected from the FLOR-ID database^[5] and used to construct the positive dataset. The negative dataset consisted of 8,163 protein sequences downloaded from the TAIR^[16] database (www.arabidopsis.org). We addressed the issue of imbalance by assigning different weights to the positive and negative sets. Subsequently, we developed seven machine learning models to distinguish FTAGs from non-FTAGs using a machine learning approach. Based on the proposed SVM-Kmer-PC-PseAAC classification model (F1 score = 0.934, accuracy = 0.939, and receiver operating characteristic = 0.943), we created a local Python program for the proteome-wide identification of proteins encoded by FTAGs. Compared to biological experiments and omics high-throughput technologies, using our developed prediction tool 'FTAGs_Find' offers the advantages of resource and time savings. The existing homology sequence search tool BLAST+ only takes into account sequence composition and order features when identifying homologous genes, the predictive algorithm constructed in this study considers a broader range of information, including sequence composition, order features, and physicochemical properties.

Next, a total of 318,521 FTAGs were identified from protein datasets of 81 species, encompassing 69 higher plants and 12 lower plants. Among these 81 examined species, we detected 11,823 FTAGs from the 45,611 genes in the whole genome of Sphagnum fallax. Notably, Sphagnum fallax exhibited the highest proportion of FTAGs compared to the other examined species, accounting for 25.92% of all the genes. Interestingly, Sphagnum fallax belongs to the group of flowering plants, and it suggests that FTAGs may have expanded following whole-genome duplication events in Sphagnum fallax. On the contrary, O. lucimarinus, which belongs to non-flowering plants, displayed the lowest proportion of FTAGs (2.68%). This result indicates that FTAGs may have expanded in flowering plants and contracted in non-flowering plants.

Finally, using available plant FTAGs datasets and the FTAGs_Find tool, we have constructed the Flowering-time Gene Database (FTGD: www.sagsanno.top:8080/FTGD), which enables users to download FTAGs datasets from 81 species and identify new FTAGs in other plants. In the future, we plan to incorporate additional plant FTAGs datasets into FTGD. We will also explore other machine learning methods, such as Random Forest algorithms, to enhance the performance of our prediction model. We believe that FTGD will prove to be a valuable resource for breeders and the flowering time research community.

Author contributions

The authors confirm contribution to the paper as follows: study conception and design: Li Z; experiments performance: Zhang J, He S, Wang W, Chen F, Li Z; draft manuscript preparation & revise: Zhang J, He S, Chen F, Li Z. All authors approved the final MS. All authors reviewed the results and approved the final version of the manuscript.

Methods	Number of feature	F1-score	ACC	AUC
SVM-ACC	27	0.769	0.811	0.849
SVM-Kmer	400	0.872	0.890	0.929
SVM-PC-PseAAC	22	0.766	0.810	0.915
SVM-Kmer-ACC	427	0.919	0.926	0.898
SVM-Kmer-PC-PseAAC	422	0.934	0.939	0.943
SVM-ACC-PC-PseAAC	49	0.792	0.829	0.896
SVM-ACC-Kmer-PC-PseAAC	449	0.887	0.901	0.909

{{lists.name}}

FTGD: a machine learning method for flowering-time gene prediction

Highlights

Abstract