Progress and perspectives on genomic selection models for crop breeding

Dongfeng Zhang; Feng Yang; Jinlong Li; Zhongqiang Liu; Yanyun Han; Qiusi Zhang; Shouhui Pan; Xiangyu Zhao; Kaiyi Wang; Dongfeng Zhang; Feng Yang; Jinlong Li; Zhongqiang Liu; Yanyun Han; Qiusi Zhang; Shouhui Pan; Xiangyu Zhao; Kaiyi Wang

doi:10.48130/tia-0025-0002

Figures (6) Tables (2)

Figure 1.
An overview of genomic selection. (a) There are three parts in the genomic selection, including the training dataset, models, and evaluation dataset. The training dataset consists of phenotyping data collected from the field trials and genotyping data tested in the marker lab. The models are trained through two strategies: statistical methods and machine learning. The evaluation dataset is predicted phenotype and genotyping. The materials would be selected according to the predicted phenotyping and then go to field experiments. (b) Summary of models in GS.
Figure 2.
An overview of the random forest^[51]. The random forest includes the bootstrap samples and weak learners based on the decision tree with the Gini algorithm.
Figure 3.
An overview of LightGBM^[57]. LightGBM includes the GOSS, EFB, histogram-based feature selection, and leaf-wise tree growth of the decision tree.
Figure 4.
An architecture of multilayer perceptron^[62]. The multilayer perceptron includes one layer (a0 layer) with respect to input data and one layer (a5 layer) with respect to the output. The hidden layer could consist of many layers (from a1 to a4).
Figure 5.
An architecture of convolutional neural networks^[67]. (a) General CNN algorithm, including convolutional layer, pooling layer, and fully connected layer. (b) Explanation of the convolutional computation. (c) Max pooling method.
Figure 6.
Comparison of factors on GS models prediction ability. (a) and (b) comparison of nine GS algorithms on wheat based on the Pearson correlation coefficient of the model prediction ability. (a) Plant height with two major QTLs and heritability is 75.7% in 2014 and 76.5% in 2015, (b) yield with five major QTLs and heritability is 70.1% in 2014 and 85.6% in 2015. 'MLP_1' and 'MLP_5' denote the one and five hidden layers in multilayer perceptron algorithm; 'CNN' means the one convolutional layer, one pooling layer, one fully connected layer; 'Alexnet' is based on the Alexnet architecture model. (c) and (d) Impact of marker numbers on prediction accuracy. (c) Thirteen ways of markers set were randomly selected from 30548 markers to validate the prediction accuracy through the BayesA model. (d) Thirteen ways of markers set were randomly selected from 30548 markers to validate the prediction accuracy through the MLP_1 (multilayer perceptron algorithm with one hidden layer) model. (e) and (f) Impact of feature processing on prediction accuracy. (e) Six ways of feature processing were used to validate the prediction accuracy through the MLP_1 model. (f) Six ways of feature processing were used to validate the prediction accuracy through the CNN model. 'Null' is all markers; 'minmax' is min-max scaling; 'standard' is z-score normalization; 'robust' is robust feature processing; 'PCA' is principal component analysis; 'variance' is the variance scaling. Validation of all models is conducted by five-fold cross validation and repeat 30 times. The least significant difference (LSD) is used as the significance test with threshold of 0.05.

Comparison item	ML-based GS algorithms	Statistical algorithms
Data handling capacity	Process high-dimensional datasets, handle omics data	Limited to traditional markers
Non-linear relationship	Capture non-linear relationships and enhance model performance	Struggle with non-linear relationships
Computational resources	Require significant computational resources	Require fewer resources
Interpretability	Act as black boxes, difficult to interpret	Provide transparent models
Applicability	Offer flexible processing, require tuning	Suit linear relationships

Table 1.

Comparison of ML-based GS algorithms and statistical algorithms.

Crop	Population size	Marker no.	Performance	Ref.
Wheat	2,374	39,758	GBLUP ≥ MLP	[63]
Wheat	250	12,083	GBLUP ≥ MLP	[63]
Wheat	693, 670, 807	15,744	GBLUP ≥ MLP	[63]
Maize	309	158,281	GBLUP ≥ MLP	[63]
Wheat	767, 775, 964, 980, 945, 1,145	2,038	GBLUP ≥ MLP ≈ SVM	[64]
Maize	2,267	19,465	MLP > Lasso	[100]
Maize	4,328	564,692	GBLUP ≈ BayesR ≈ SVM	[49]
Barley	400	50,000	Transformer ≈ BLUP	[83]
Maize	8,652	32,559	LightGBM > rrBLUP	[60]
Wheat	2,000	33,709	LightGBM ≈ DNNGP > GBLUP	[70]
Maize	1,404	6,730,418	SVR ≈ DNNGP > GBLUP	[70]
Wheat	599	1,447	SVR ≈ DNNGP > GBLUP	[70]

Table 2.

Summary of the performance between the ML and traditional methods.