-
Raw data for the SNP loci and sample calls were organized in Microsoft Excel, (Microsoft 365 applications). Quality control was performed using the Quality Assurance Module from SNP Variation Suite version 8.9.0[35]. Any SNP having more than a 5% no-call rate was removed from the data set. SNPs that were in linkage disequilibrium (LD) with each other at r2 > 0.5 were also removed, resulting in a data set consisting of 219 Tcm SNPs for further analysis. These loci were randomly distributed across the cacao genome and their chromosomal locations are as follows: Chromosome 1 (23), Chromosome 2 (21), Chromosome 3 (15), Chromosome 4 (27), Chromosome 5 (21), Chromosome 6 (26), Chromosome 7 (24), Chromosome 8 (18), Chromosome 9 (15), and Chromosome 10 (29) (Supplemental Table S2).
Descriptive statistics
-
Four hundred and twenty DNA samples and three controls produced amplification (Supplemental Table S1). Summary statistics were computed based on the 420 samples and 219 selected Tcm SNP markers and the results are presented in Supplemental Table S2. The mean value for Shannon's information index was 0.617, ranging from 0.398 to 0.693. The mean observed heterozygosity (HObs) was 0.247, ranging from 0.118 to 0.400. The mean genediversity (expected heterozygosity) was 0.428, ranging from 0.235 to 0.500. The mean fixation index (FIS) was 0.419, ranging from 0.149 to 0.668. The mean minor allele frequency was 0.359, ranging from 0.150 to 0.500 (Supplemental Table S3). Mantel test showed a highly significant correlation (r = 0.718; P < 0.001) between these 219 SNPs and the 91 SSR markers reported in a previous study[10] (Fig. 1).
Figure 1.
Mantel test results indicating significant correlation (r = 0.718; P < 0.001) between SNPs and SSR markers.
Inference of population structure
-
From the STRUCTURE analysis, the most probable number of genetically distinct groups (K) was two (Fig. 2a) based on Evanno's Delta K value[40]. However, when the result of STRUCTURE was analyzed using the method of Puechmaille[41], as implemented in STRUCTURE SELECTOR[39], all the supervised estimators (Medmedk, Medmeank, Maxmedk and Maxmeank) suggested the optimum K of nine populations (Fig. 2b).
Figure 2.
(a) Number of clusters based on the Evanno's Delta K value[10]. (b) Inferred clusters obtained using the method of Puechmaille. (c) Population structure of the 420 cacao accessions (Theobroma cacao L.) germplasm collections containing representative genotypes of the nine cacao genetic groups obtained using Structure v2.3.3. Black vertical lines indicate the separation of the genetic groups. Multiple colors within the genetic group imply admixed individuals.
At K = 9, seven out of the ten populations had consistent assignment results as the SSR-based study reported previously[10]. These populations include: Amelonado, Curaray, Guiana, Marañon, Nanay, and Purús (Fig. 2c; Supplemental Table S4). However, the population Nacional and Contamana were grouped together. Moreover, discrepancy was found within the Iquitos population, the germplasm from Iquitos, Peru and those from Rio Salimoes, Brazil was separated into two distinct groups. This represents 76% of the 420 samples used (Supplemental Table S1) and constitutes 44% of the samples used in the initial classification of the ten genetic groups[10]. Their distribution based on the genetics groups is as follows: Amelonado 46% (43/94), Contamana 51% (18/69), Criollo 3% (1/39), Curaray 56% (66/117), Guiana 47% (28/59), Iquitos 34% (40/117), Marañón 45% (65/143), Nacional 21% (11/52), Nanay 64% (98/152), and Purús 45% (50/110). The highest DNA amplification was obtained in samples from the Nanay group and the lowest in the Criollo group. Due to this reason, the Criollo sample was not included in the PCA and Structure Analysis. The samples with Q-value ≥ 0.75 were selected as reference clones for each of the corresponding populations (Supplemental Table S3).
Relationship among different populations
-
Principal coordinate analysis based on the results of the STRUCTURE analysis is presented in Fig. 3a and 3b, which provides a complementary illustration of the relationship among the nine genetic groups. The plane of the first three main axes accounted for 23.1%, 7.6%, and 3.8% of total variation, respectively. The distinctiveness of the nine clusters was clearly revealed. The results of the analysis of molecular variance (AMOVA) provide additional evidence supporting the significant population differentiation (Table 1). The within-population molecular variance accounted for 47.0%, whereas among populations, molecular variance was 53.0%. The inter population differentiation was highly significant as shown by Phi-statistics[49] (P < 0.001) (Table 2). The Fst value ranged from 0.038 (Nacional vs Contamana) to 0.194 (Amelonado vs Nanay), with an average of 0.109 among all the populations (Table 3).
Figure 3.
Principal Coordinates Analysis plots of 420 cacao accessions belonging to nine genetic groups. The plane of the first three main axes accounted for: first axis = 23.1%, the second = 7.6% and the third = 3.8% of the total variation.
Table 1. Analysis of molecular variance (AMOVA) for the nine cacao genetics groups.
Source of variation d.f. Sum of squares Means squares Est. Var % Among populations 9 12,047.62 1,338.62 53.84 53 Within populations 215 11,874.38 47.69 47.69 47 Total 435 23,922.00 16.57380 101.53 100 Table 2. Pairwise Population PhiPT Values among nine cacao germplasm groups.
Populations Amelonado Contamana Curaray Guiana Iquitos Marañon Nacional Nanay Purús I Purús II Amelonado 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 Contamana 0.633 0.000 0.001 0.001 0.001 0.001 0.011 0.001 0.001 0.001 Curaray 0.645 0.407 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.001 Guiana 0.702 0.561 0.540 0.000 0.001 0.001 0.001 0.001 0.001 0.001 Iquitos 0.695 0.462 0.500 0.634 0.000 0.001 0.001 0.001 0.001 0.001 Maranon 0.539 0.406 0.443 0.477 0.444 0.000 0.001 0.001 0.001 0.001 Nacional 0.622 0.095 0.436 0.561 0.499 0.395 0.000 0.001 0.001 0.001 Nanay 0.712 0.593 0.558 0.644 0.489 0.547 0.673 0.000 0.001 0.001 Purús I 0.637 0.319 0.373 0.568 0.437 0.404 0.317 0.555 0.000 0.001 Purús II 0.577 0.300 0.382 0.508 0.360 0.367 0.336 0.483 0.268 0.000 Note: PhiPT Values below diagonal. Probability, P (r and ≥ data) based on 999 permutations is shown above the diagonal. Table 3. Pairwise Population Fst Values based on the result of population stratification. Within each population, samples with the assignment coefficient > 0.75 were retained for analysis.
Amelonado Contamana Curaray Guiana Iquitos Marañon Nacional Nanay Purús I Purús II Amelonado 0.000 Contamana 0.181 0.000 Curaray 0.179 0.092 0.000 Guiana 0.180 0.143 0.124 0.000 Iquitos 0.180 0.117 0.113 0.150 0.000 Maranon 0.129 0.094 0.095 0.104 0.098 0.000 Nacional 0.166 0.038 0.101 0.138 0.129 0.091 0.000 Nanay 0.194 0.135 0.121 0.150 0.087 0.115 0.176 0.000 Purús I 0.168 0.085 0.078 0.133 0.100 0.087 0.083 0.114 0.000 Purús II 0.157 0.083 0.086 0.125 0.082 0.077 0.094 0.092 0.065 0.000 The UPGMA tree (Fig. 4) provided complementary information regarding the inter-population relationships. The cluster pattern is largely consistent with the previous SSR-based result[10]. Same as the result of STRUCTURE stratification, Population Nacional and Contamana were grouped together, which is also compatible with the results of PCoA. In addition, population Guiana and Amelonado were grouped together in the UPGMA tree, whereas population Iquitos and Nanay fell in the same main group. Purús I and Purús II were grouped together. All the branches were supported by the bootstrapping value above 50%, ranging from 516 to 1000 in the consensus tree (Fig. 4).
-
Various SNP genotyping sets have been used for cacao germplasm identification. However, these panels have not been systematically evaluated for optimum genotyping efficiency, as well as for population and sub-population classification. The ideal genotyping panel should comprise a minimum number of SNP markers but have a maximum discriminating power. Moreover, the capacity to infer the population origin of a given cacao accession is essential to support cacao germplasm identification when the reference SNP profile is not available. For an efficient germplasm identification, Linkage Disequilibrium is one of the critical factors because each SNP marker is expected to be independently informative. In the present study, we evaluated 956 SNPs on 451 wild cacao samples with known population origin. Based on the criteria of LD ≤ 0.5, call rate > 95% and Minor Allele Frequency (MAF > 0.15), we selected a total of 219 SNPs. Population stratification demonstrated their efficacy in high compatibility with previously reported SSR markers. Mantel Test of distance matrix between SSR and SNP markers showed a high correlation (r = 0.718; P < 0.001). In addition, the present study generated complementary insight regarding the classification of wild cacao populations and sub-populations in the Amazon region. These newly selected SNPs can also be combined with the previously identified SNP markers, e.g., the TcSNPs that have been commonly used in cacao germplasm identification, to form different genotyping panels. The generated SNP profiles can be converted into a simple bar code and be used in many other downstream applications, such as nursery accreditation, clone registration and the authentication of geographically referenced cocoa beans. This is our pilot project for the development of SNP markers reflecting population origin for cacao (Theobroma cacao L.) germplasm identification. Marker evaluation is being continued with the emphasis on selecting SNP markers to detect sub-population structures in the primary gene pool of T. cacao.
-
About this article
Cite this article
Gutiérrez OA, Martinez K, Zhang D, Livingstone DS, Turnbull CJ, et al. 2021. Selecting SNP markers reflecting population origin for cacao (Theobroma cacao L.) germplasm identification. Beverage Plant Research 1: 15 doi: 10.48130/BPR-2021-0015
Selecting SNP markers reflecting population origin for cacao (Theobroma cacao L.) germplasm identification
- Received: 30 September 2021
- Accepted: 26 November 2021
- Published online: 27 December 2021
Abstract: Cacao is one of the most economically important agricultural commodities in the world, providing the principal ingredient for the global chocolate industry. Accurate genotype identification is essential for effective conservation and utilization of cacao germplasm. Here, we report the screening of 956 candidate SNPs, pre-selected from the 6 and 15K Theobroma cacao SNP Arrays using targeted Genotyping-by-Sequencing on 451 cacao germplasm accessions, representing ten known genetic groups from the tropical Americas. Based on call rate (No call rate < 5%), Minor Allele Frequency (MAF > 0.15) and Linkage Disequilibrium (LD ≤ 0.5), a total of 219 SNPs were selected. The efficacy of these SNP markers for population classification was compared with the previous SSR-based analysis in cacao. The population assignment results of the retained 420 cacao accessions was highly comparable with the SSR study. The matrix of genetic distance between SSR and SNP markers is highly correlated (r = 0.718; P < 0.001). These results demonstrated the consistency in using the present SNP markers for cacao germplasm identification. This is our pilot project for the development of SNP markers reflecting population origin for cacao germplasm identification. These SNP markers and the selected reference germplasm for different populations are suitable for use in cacao germplasm management and crop improvement, including genotype identification, seed gardens and nursery accreditation, and cocoa authentication. Effort is being continued with the emphasis on selecting SNP markers for the detection of sub-population structures in the primary gene pool of T. cacao.
-
Key words:
- SNP /
- Theobroma cacao L. /
- Population structure /
- Genetic groups