-
Through a comprehensive review of relevant literature on poplar transcriptome sequencing, a dataset comprising 298 RNA-seq samples focusing on poplar stem research was compiled for this study (Supplemental Dataset 1). These samples provide a comprehensive representation of poplar stem development, enabling the identification and evaluation of candidate RGs. The 298 samples were aligned to the Populus trichocarpa v3.0 genome, resulting in 294 samples with an alignment rate exceeding 70% (Supplemental Dataset 2). Among these, samples K5_20 and K5_28 had alignment rates higher than 80% (84.80% and 84.49% respectively), but were excluded from further analysis due to their low read counts[41]. Consequently, transcriptome data from a total of 292 samples were retained for subsequent analysis in this study.
In addition, a collection of 37 reported RGs in poplar stems was compiled. After removing duplicate genes based on their gene ID numbers, a set of 30 reported RGs was obtained for further analysis (Supplemental Table S2).
Selection of normalization method
-
By analyzing the upper and lower quartile values of gene expression levels in the 292 samples before and after normalization, and presenting the results in bar graphs, the differences in gene expression levels among samples were reduced after normalization with RPKM/FPKM, TPM, TMM, and DESeq2's median of ratios compared to without normalization (RC). Specifically, the distribution of gene expression levels between samples became more similar and less variable after DESeq2 and TMM normalization, indicating that these methods performed better than RPKM/FPKM and TPM (Fig. 2a). Furthermore, the CV of the relative expression of all genes in the 292 samples were examined. After normalization with RPKM/FPKM, TPM, TMM, and DESeq2's median of ratios, the median CV values were 1.52, 1.51, 1.47, and 1.48, respectively, with TMM and DESeq2's median of ratios exhibiting more favorable effects compared to RPKM/FPKM and TPM (Fig. 2b). To evaluate the effect of different normalization methods, the changes in the expression levels of the 30 reported RGs were analyzed. The median and mean CVs for unnormalized RC were 0.90 and 0.97, respectively. After normalization with RPKM/FPKM, TPM, TMM, and DESeq2's median of ratios, the median CVs were 0.52, 0.52, 0.53, and 0.51, respectively; and the mean CVs were 0.58, 0.56, 0.57, and 0.55, respectively (Fig. 2c).
Figure 2.
Comparison of normalization methods. RC (raw counts) represents raw counts without normalization, while FPKM, TPM, TMM, and DESeq2 are results after normalization using four different methods. (a) Distribution of counts for all genes in each sample before and after normalization using 292 samples. Bar plots are employed instead of box plots, with the upper limit of the bar representing the upper quartile and the lower limit representing the lower quartile of the data. (b) Box plot demonstrates the coefficient of variance (CV) of gene expression levels for all genes before and after normalization, utilizing data from 292 samples. (c) Box plot depicting the CV for 30 previously reported reference genes (RGs) after different normalizations across all 292 samples, with the central white point indicating the mean CV.
DESeq2's median of ratios and TMM showed similar effects in normalizing the expression levels of the 30 reported RGs in poplar. However, DESeq2's median of ratios yielded the smallest CV values and was therefore chosen as the method for sample normalization in the subsequent steps.
Screening and functional annotation of stably expressed genes
-
To screen out the RGs suitable for various poplar species and experimental treatments, the RNA-Seq data of the 292 samples were further analyzed. After RC was normalized by the DESeq2's median of ratios method, the mean, standard deviation (SD), and CV of the relative expression levels of each gene were calculated. A total of 541 stably expressed genes (CV ≤ 0.3) were screened, including four reported RGs (PP2A-2, EIF-A, CDPK and ATPase) (Fig. 3a, Supplemental Dataset 3). The log values of the relative expression levels' mean of the 541 stably expressed genes ranged from 5.91 to 13.03; and for the 30 reported RGs, the log values ranged from 7.17 to 16.09 (Fig. 3b). This indicates that the stably expressed genes identified in this study have similar expression levels to the reported RGs. GO and KEGG enrichment analyses were performed on the stably expressed genes and the reported RGs. The results indicate that the stably expressed genes identified in this study share functional similarities with the reported RGs (Supplemental Dataset 4).
Figure 3.
Comparative analysis between stably expressed genes and reported reference genes (RGs). This figure presents a comparison of expression patterns between stably expressed genes and previously reported RGs. (a) Venn diagram illustrates the overlap between stably expressed genes (with CV ≤ 0.3) identified from 292 RNA-Seq samples and the reported RGs. (b) The density distribution of gene expression levels is depicted for all genes in the Populus genome, stably expressed genes, and reported RGs. The X-axis represents log2(counts + 1), and the Y-axis represents gene density.
Selection of candidate RGs
-
From the 541 stably expressed genes, 12 genes (Table 1) with the smallest CV values among the 292 samples were selected as candidate RGs, based on the normalization method of DESeq2's median of ratios. Comparing the stability of expression levels between these candidate RGs and the reported RGs, the former had CV values ranging from 0.16 to 0.20, and the latter had CV values ranging from 0.23 to 1.53. The expression of the candidate RGs showed better stabilities (lower CV values) than the reported RGs in the 292 samples (Fig. 4). Furthermore, using the six reported RGs with the smallest CV values as comparisons, the expression stability of 12 candidate RGs was verified by RT-qPCR analysis.
Table 1. Information of candidate reference genes (RGs) with the smallest coefficient of variance (CV) values among the 292 RNA-Seq samples.
Gene ID Gene name Description CV Potri.001G349400 CNOT2 CCR4-NOT transcription complex subunit 2 0.168 Potri.002G157500 RH8 Similar to DEAD/DEAH box helicase 0.172 Potri.005G110600 VPS35 Vacuolar protein sorting-associated protein 35 0.173 Potri.002G197600 FIP37.1 Similar to ARABIDOPSIS THALIANA FKBP12 INTERACTING PROTEIN 37 0.175 Potri.013G070001 NA UDP-glucose pyrophosphorylase 0.186 Potri.001G197400 Pt-UBP6.2 Similar to UBIQUITIN-SPECIFIC PROTEASE 6 0.190 Potri.006G116700 U2AF1 Splicing factor U2AF 35 kDa subunit 0.193 Potri.008G111700 NA Predicted hydrolases of HD superfamily 0.194 Potri.011G084400 CUL4 Similar to hypothetical protein 0.194 Potri.004G064400 NA Similar to ankyrin protein kinase 0.198 Potri.008G217300 Pt-CUL1.4 Similar to cullin-like protein1 0.199 Potri.003G045700 Pt-ATRLI1.2 Similar to RNase L inhibitor protein; putative 0.202 NA: not available. The estimated CVs were based on the normalization method of DESeq2's median of ratios. Figure 4.
Assessment of candidate and reported reference genes (RGs) stability based on RNA-Seq data. Box plots are employed to display the distribution of expression levels for 12 candidate RGs and 30 reported RGs. The left side shows the expression distribution of candidate RGs, while the right side shows the expression distribution of reported RGs. The small circles on the red lines represent the coefficient of variance (CV) for the gene expression levels.
Design of RT-qPCR primers and validation of amplification efficiency
-
To design universal primers suitable for different poplar species, the workflow illustrated in Fig. 5a was developed. The primary transcript sequences of 18 genes in P. trichocarpa was used for blastn to retrieve their homologous genes from P. trichocarpa, P. euphratica, P. deltoides, and poplar 717. For example, when examining the primary transcript sequence of Potri.005G10600, a total of 13 homologous transcripts were retrieved from blastn searches of the five poplar genomes. A phylogenetic tree was constructed using the primary transcript sequences of Potri.005G10600 and it's 13 homologues to identify the former's orthologs (Fig. 5b). Multiple sequence alignments were performed between Potri.005G10600 and its orthologs to obtain consensus sequences for primer design (Fig. 5c). The specificity of the primers on the five poplar genomes was verified using the mfeprimer-3.2.0 software to obtain universal primers for these species. Finally, RT-qPCR primers for all 18 RGs, including 12 candidate RGs and six reported RGs, were designed according to the method described above (Supplemental Table S5).
Figure 5.
RT-qPCR primer design. This figure illustrates the process of designing RT-qPCR primers using Potri.005G110600 as a representative example. (a) Flowchart outlining the primer design process is presented. (b) Phylogenetic tree showing the target gene and homologous genes in five different Populus species, with color-coded gene names representing the most homologous genes to the target gene. (c) Multiple sequence alignment of transcript sequences for the target gene Potri.005G110600 and its homologous genes. The alignment highlights conserved positions in blue and non-conserved positions in white. The region indicated by red arrows corresponds to the primer binding site.
To verify the performance of the designed primer pairs, RNA was extracted from the stems of poplar 717 tissue culture seedlings, and cDNA was synthesized through reverse transcription of RNA. The gene specificity and amplification efficiency of the designed primer pairs were then evaluated using RT-qPCR. The correlation coefficient (R2 values) of the 18 primer pairs in RT-qPCR amplification were all above 0.99, indicating that they had high gene specificity. The amplification efficiencies varied from 91.3% to 109.7%, all of which were in the reliable section from 90% to 115%[58]. These results indicated that the designed primers were suitable for further gene expression analysis.
Validation of expression stability of candidate RGs
-
Poplar 717 plants were subjected to different stress treatments (Supplemental Table S3) to verify the expression stability of the 12 candidate RGs and six reported RGs by RT-qPCR. The expression stability of these genes was evaluated and ranked using four commonly used stability evaluation programs: delta-Ct, BestKeeper, Normfinder, and geNorm (Supplemental Dataset 5). The physiological parameters, including net photosynthetic rate (Pn), stomatal conductance (gs), intercellular CO2 (Ci), and transpiration rate (Tr) were used to monitor the stress response of poplar plants to various treatments (Supplemental Dataset 6). The comprehensive ranking was generated by computing the geometric mean (geomean) of the gene rankings. The distribution of Ct values of RT-qPCR for all RGs was concentrated closely near the mean, indicating consistent and reliable expression levels of the candidate RGs and the reported RGs (Fig. 6a). Four candidate RGs, Potri.001G349400, Potri.005G110600, Potri.011G084400 and Potri.008G111700, exhibited the highest stability and ranked at the top under all stress conditions (Fig. 6a). To further verify that the candidate RGs in this study are suitable for comparison among different poplar species, young and mature stems (IN3 and IN9) collected from five different Populus cultivars including P. simonii, P. deltoides, poplar 717, P. alba, and P. adenopoda grown on the campus of Huazhong Agricultural University were used. And the stability of 18 RGs in these stems was analyzed. Potri.001G349400, Potri.011G084400, Potri.005G110600 and Potri.002G157500 were among the best for stability in the stems of these cultivars (Fig. 6b). Based on the above two experiments, Potri.001G349400, Potri.005G110600, and Potri.011G084400 showed the best stability compared to other candidate RGs and reported RGs.
Figure 6.
Comprehensive analysis of expression stability and Ct value distribution of reference genes (RGs). This figure presents a comprehensive evaluation of the expression stability of 18 RGs (12 candidate RGs and six reported RGs) across diverse stress treatments and Populus species using Ref-finder, along with the distribution of Ct values for these 18 RGs. (a) Box plots showing the Ct value distribution and integrated stability ranking of the 18 RGs under different stress treatment experiments (Supplemental Table S5). (b) Box plots depicting the Ct value distribution and integrated stability ranking of the 18 RGs in stems of five different Populus cultivars including P. simonii, P. deltoides, poplar 717, P. alba, and P. adenopoda. The small circles on red line indicate the integrated stability ranking, with smaller values indicating greater stability. All RT-qPCR experiments were performed with three biological replicates and two technical replicates. Each dot of the boxplot represents the average of these six replicates.
To further verify the stability of the candidate RGs and reported RGs, a test dataset was constructed using the latest RNA-Seq data recently uploaded to the public databases, which included nine sets of poplar stem-related experiments with a total of 117 samples (Supplemental Table S4, Supplemental Dataset 7). Based on the DESeq2's median of ratios normalization method, distributions and CV values of relative expression levels of the 18 RGs in the test dataset were assessed. In total, there were six candidate RGs and three reported RGs with CV values below 0.3 (Fig. 7, Supplemental Dataset 8). Among them, Potri.001G349400, Potri.002G197600 and Potri.002G157500 can be considered as novel RGs for poplar stem gene expression analysis, and their gene annotations are respectively CNOT2 (CCR4-NOT transcription complex subunit 2), FIP37.1 (similar to Arabidopsis FKBP12-interacting protein 37), and RH8 (DEAD/DEAH box RNA helicase-like 8). Potri.001G349400/CNOT2 ranked first in the training dataset, RT-qPCR validation and test dataset, showing the best expression stability in stem-related samples among all poplar genes. The universal primers used to amplify Potri.001G349400/CNOT2 were: 5'-GGTCTGACGAGCCAGCAAAGGGT-3' (forward) and 5'-CGCCGCTGCTCCTTGTGGT-3' (reverse).
Figure 7.
Evaluation of candidate and reported reference genes (RGs) stability based on RNA-Seq data. Box plots are employed to display the distribution of expression levels for 12 candidate RGs and six reported RGs. The small circles on the red line represents the coefficient of variance (CV) for the gene expression levels, where smaller CV values signify greater stability of gene expression. The three genes considered as novel RGs for poplar stem gene expression analysis are shown in purple.
-
All data generated or analyzed during this study are included in this published article and its supplementary information files.
-
About this article
Cite this article
Xie Q, Ahmed U, Qi C, Du K, Luo J, et al. 2024. A protocol for identifying universal reference genes within a genus based on RNA-Seq data: a case study of poplar stem gene expression. Forestry Research 4: e021 doi: 10.48130/forres-0024-0017
A protocol for identifying universal reference genes within a genus based on RNA-Seq data: a case study of poplar stem gene expression
- Received: 23 October 2023
- Revised: 07 April 2024
- Accepted: 07 May 2024
- Published online: 01 June 2024
Abstract: Real-time quantitative reverse transcription polymerase chain reaction (RT-qPCR) plays a crucial role in relative gene expression analysis, and accurate normalization relies on suitable reference genes (RGs). In this study, a pipeline for identifying candidate RGs from publicly available stem-related RNA-Seq data of different Populus species under various developmental and abiotic stress conditions is presented. DESeq2's median of ratios yielded the smallest coefficient of variance (CV) values in a total of 292 RNA-Seq samples and was therefore chosen as the method for sample normalization. A total of 541 stably expressed genes were retrieved based on the CV values with a cutoff of 0.3. Universal gene-specific primer pairs were designed based on the consensus sequences of the orthologous genes of each Populus RG candidate. The expression levels of 12 candidate RGs and six reported RGs in stems under different abiotic stress conditions or in different Populus species were assessed by RT-qPCR. The expression stability of selected genes was further evaluated using ΔCt, geNorm, NormFinder, and BestKeeper. All candidate RGs were stably expressed in different experiments and conditions in Populus. A test dataset containing 117 RNA-Seq samples was then used to confirm the expression stability, six candidate RGs and three reported RGs met the requirement of CV ≤ 0.3. In summary, this study was to propose a systematic and optimized protocol for the identification of constitutively and stably expressed genes based on RNA-Seq data, and Potri.001G349400 (CNOT2) was identified as the best candidate RG suitable for gene expression studies in poplar stems.
-
Key words:
- Reference genes /
- Populus /
- Gene expression /
- Transcriptome /
- RT-qPCR /
- Abiotic stress /
- Stem development