Optimisation of vibrational spectroscopy instruments and pre-processing for classification problems across various decision parameters

Joy Sim; Cushla McGoverin; Indrawati Oey; Russell Frew; Biniam Kebede; Joy Sim; Cushla McGoverin; Indrawati Oey; Russell Frew; Biniam Kebede

doi:10.48130/fia-0024-0004

Vibrational spectroscopy is a green, rapid, and affordable analytical tool for analysing the quality, safety, and origin of biological materials in agri-food sectors. Pre-processing spectral data is crucial to removing instrumental interferences and physical artifacts when developing a classification model. However, there has yet to be a consensus on which spectral pre-processing method, settings, and decision parameters to use to optimise pre-processing for different spectroscopy tools. Using an arbitrary criterion poses a risk of applying the wrong type or too severe pre-processing that removes valuable information or affects the model's performance for prediction studies. Matthew's Correlation Coefficient (MCC) - a statistic for parameterising classification performance, accounts for data set imbalance and improved decisions on model selection to express uncertainty on future predictions. Four vibrational spectroscopy instruments [near-infrared (NIR), hyperspectral (HSI), mid-infrared (FTIR), and Raman] were compared using different pre-processing methods to understand the performance using MCC to classify coffee from four countries (Indonesia, Ethiopia, Brazil and Rwanda). Key decision parameters were evaluated for the development of reliable classification models. The best pre-processing for NIR was extended multiplicative scatter correction with mean centering (MNCN), and for HSI, Savitzky-Golay (1^st derivative, 15 points) with MNCN. NIR performed the best across all four instruments, with FTIR performing the worst. Raman showed potential for coffee origin classification using the right pre-processing. Pre-processing with weighted least squares, normalisation, and MNCN eliminated the fluorescence effect on Raman spectral data. These findings show the feasibility of using MCC for classification problems.

HTML

Introduction

Vibrational spectroscopy-based tools have gained traction as green, rapid, and affordable modern analytical tools in the pharmaceutical and forensic sciences for verifying the quality of incoming materials or outgoing products^[1,2]. Their value has also been recognised in the agri-food industry for quality and origin verification of organic and biological materials^[3,4]. These tools have mainly included near-infrared (NIR), mid-infrared (FTIR), Raman, and, more recently, hyperspectral imaging (HSI) spectroscopies.

Different spectroscopic tools perform over various defined frequency ranges and differ concerning the underlying principle by which molecular vibrations generate a signal (Supplemental Fig. S1). A change in polarisation leads to a Raman active vibrational mode. In contrast, a vibrational mode needs to be associated with a change in dipole moment to be infrared (NIR, FTIR) active. Despite following different mathematical relationships, both infrared and Raman spectroscopies follow linear relationships relating sample constituent concentration to the intensity of signals or absorbance. These relationships are obeyed only when no other phenomenon (e.g., other forms of scattering, specular reflection) occur. Even well-designed studies include noise in the form of undesirable light scattering because of inconsistencies in particle size, packing densities and spectral regions (wavelengths) used in the study. These affect the effective path length of light travelling through a sample, causing non-linearities and baseline shifts^[5]. Pre-processing treatment can transform and reduce these undesirable influences. Consequently, this allows the spectral data to follow these linear relationships more strictly and minimises unmodelled variability in the data^[6,7]. More recently, the popularity of HSI spectroscopies providing spatial and spectral chemical information has led to investigations into advanced image processing methods for improving classification performance^[8].

The final model performance is significantly influenced by the choice of pre-processing method^[7]. The pre-processing method needs to be selected by considering the vibrational spectroscopic technique and optimised for the data set and objectives of the investigation. The two main types of pre-processing methods are available: scatter correction and spectral derivation. However, there is a danger of applying the wrong pre-processing treatment or introducing bias by removing valuable information from the spectra^[6].

The literature does not have unanimity on the best decision parameter (e.g., R², RMSEP) to choose the final model, even for investigations of the same sample (Supplemental Table S1). When creating classification models, there is always a risk of overfitting. Overfitting is when you develop extremely accurate mathematical equations for the calibration data set. However, once an external validation set not seen by the calibration model is introduced, these equations are poorly predicted^[9]. Approaches to optimising a model have come to include decision parameters involving root mean squared errors of calibration (RMSEC), prediction (RMSEP), coefficient of calibration (R_cal²) and validation (R_Val²)^[10]. Most publications have offered little insight into the pre-processing selection steps taken during the calibration model development and have chosen the final model based on RMSECV and RMSEP. A comprehensive overview of the literature can be found in Supplemental Table S1. However, in addition to these statistics to assess the model fit, confusion matrices are typically used in classification problems to represent the quality of the prediction, but they can be hard to communicate. Accuracy and F1 scores are popular parameters to quantitate the model performance. Accuracy, however, cannot distinguish between false positives and false negatives. F1 score notes the number of prediction errors and the types of errors made but fails to consider the number of samples for each class.

The research gaps include a lack of consensus on which spectral pre-processing method, settings, and decision parameters to optimise pre-processing for different vibrational spectroscopy tools. In addition, few studies have compared the sensitivity of various vibrational spectroscopy tools for origin classification problems.

A case study on coffee
This paper aims to compare different pre-processing methods on various vibrational spectroscopy tools (near-infrared, hyperspectral, mid-infrared, and Raman) to understand the performance of these methods for classification problems using partial least squares-discriminant analysis (PLS-DA). Key decision parameters will be evaluated to develop robust and stable calibration models for four vibrational spectroscopy tools. This paper is part of a wider study which involves the development of a rapid origin traceability toolbox for coffee. As part of this process, optimisation work was conducted.

Material and methods

Coffee samples

Green coffee beans from four countries across three continents were used as case studies: Santos, Brazil, South America; Yirga Cheffe Oromia, Ethiopia, Africa; Sumatra Mandheling, Indonesia, Asia; Kopakama, Rwanda, Africa. The coffee samples were all Coffea arabica species and wet-washed. Postharvest processing steps were conducted in the country of origin and harvested in 2020 across the same period for each sample. The samples were chosen based on their relevance to the international coffee sector, specifically from the coffee bean belt representing beans from America, Africa, and Asia. Green coffee beans were stored at 65% relative humidity with an ambient temperature of 18 ± 2 °C before further processing. Three replicates of each sample consisting of 100 g of green coffee beans were ground into a fine < 5 µm green coffee powder (GCP) using a cryomill (Retsch, Haan, Germany) and liquid nitrogen. Forty-eight samples were each placed in 5 ml polypropylene screw-capped tubes, wrapped in aluminium, and stored at −18 °C. The samples were prepared a week prior to analysis across all four instruments (near-infrared, hyperspectral, mid-infrared, and Raman). From each of the biological replicates, seven to nine analytical replicates were taken for each instrument.

Spectral acquisition

Near-infrared analysis
This study used a dispersive 'bulk' NIR (DG-NIR) and a hyperspectral imaging push-broom dispersive NIR (HSI-NIR) system.

DG-NIR measurements were performed using a NIR XDS Rapid Content Analyser (Metrohm, USA) fitted with an iris adaptor to centre the sample cup towards the window area. The device was warmed up for 30 min before recording spectra. Before recording sample spectra, a background spectrum from a Spectralon 99% diffuse reflectance standard was recorded in a dry, controlled atmosphere (20 ± 0.5 °C, 75% relative humidity ± 4%). All the spectra were collected in absorbance mode. Each sample was carefully mixed before sampling 2 g of GCP for each of the three replicates. The sample holder (17.25 mm spot size) was rotated during measurement to collect a more representative spectrum. Spectral data were collected over 400-2500 nm (data sampling interval, 0.5 nm; background, 256 scans; sample, 32 scans). Vision Air 2.0 Network software (version 66072207) was used for instrumental control and spectral acquisition. The spectrum was then saved into text format for further data analysis.

Hyperspectral imaging (HSI-NIR) measurements were performed using a PIKA NIR-320 camera (Resonon, MT, USA), a dispersive push-broom hyperspectral system 320 pixels wide. A dark reference was taken to remove dark current noise by blocking the objective lens using the lens cap, and a reflective reference was then taken using Spectralon 99% reflectance reference to account for illumination and instrument-sensor response effects. Spectra were collected in reflectance mode. A small amount of powder was packed into a standardised plastic ring (40 mm ring with an inner 15 mm diameter) compartment and levelled off. The ring was then placed on the stage. Hyperspectral data were collected over the range of 900−1,700 nm (resolution, 8.8 nm; 168 spectral sampling points (bands); framerate, 10.0 Hz; integration time, 100 ms; scanning speed, 0.10 cm/s). Spectronon Pro software (version 3.4.5, Resonon) was used for instrumental control and spectral acquisition. Regions of interest (ROI) were manually selected from each sample to include only GCP and exclude the plastic ring and background. This was done by selecting the internal diameter of the ring using the Spectronon software and then choosing seven to nine random ROIs. A mean spectrum of the ROI was then saved into text format for further data analysis.

Mid-infrared analysis
Attenuated total reflection-Fourier transform infrared (ATR-FTIR) measurements were performed in a dry, controlled atmosphere (20 ± 0.5 °C) employing a Bruker Vertex 70 FTIR Spectrometer (Bruker Optick GmbH Ettlingen, Germany) with a deuterated L-alanine-doped triglycine sulfate (DLATGS) detector equipped with a diamond crystal for ATR measurements. All spectra were recorded in the 4,000−400 cm⁻¹ range with 4 cm⁻¹ resolution, 64 scans, the background (atmosphere spectrum) was removed, and Bruker extended ATR correction was applied. OPUS software (version 7.5) was used for instrumental control and spectral acquisition. Seven to eight analytical replicates were obtained from each of the three sample replicates. Various parts of the sample were measured to ensure representation obtained through sample repacking. A total of 86 spectra were obtained for all four country samples.

Raman analysis
Raman measurements were performed using a BWTEK i-Raman-Plus operating at 785 nm excitation with a silicon-based detector and fibre-optic Raman probe. The spectra were recorded in the region between 4,200–65 cm⁻¹. Before analysis, the Raman system was turned on for 30 min to allow the laser to stabilise. Silicon and ibuprofen spectra were recorded to serve as wavelength reference checks. A small amount of powder was packed into a standardised plastic ring compartment and levelled off. The conditions for collecting sample spectra were the following: 1 s integration time, 30 accumulations, increment of 1 cm⁻¹, power at sample 130 mW at 100% laser power. The system was operated using the BWSpec software (version 4.10, USA). Dark noise was removed from each spectrum prior to each analysis. Photobleaching samples for 2 to 20 min prior to Raman spectral data collection did not improve the fluorescence-Raman signal balance.

Data analysis
Chemometric data analysis of the spectral data was conducted using R (version 4.2.0)^[11], and SOLO (ver.9.0). The analytical replicates per biological replicate were first averaged. Various pre-processing steps were investigated to eliminate potential artifacts from the spectra, namely the fluorescence effect from Raman or correcting baseline and non-linear behaviour due to particle size differences from IR spectra. The selection of pre-processing methods to trial was based on literature reports of specific method purposes and those previously applied to coffee samples. The training and test datasets were split using the caTools (version 1.18.2, USA) package in R using a split ratio of 75% train and 25% test, and cross-validation was performed using venetian blinds with seven data splits^[12].

Pre-processing methods for spectral data
Pre-processing is essential to reduce noise and extract useful information from overlapping peaks or mitigate slope change effects. The most widely used pre-processing techniques in spectroscopy include scatter corrections and spectral derivatives. Scatter correction methods include multiplicative scatter correction (MSC), standard normal variate (SNV), normalisation, de-trending, and extended MSC (EMSC). MSC estimates the correction coefficient and corrects the raw spectra with a slope (1^st-order polynomial)^[13]. The average spectrum of the calibration dataset is used to find the correction coefficient. For SNV, the average and standard deviation of absorption/intensity values of a spectrum are calculated; subsequently, from every point of the spectrum, the average is subtracted, and the result is divided by the standard deviation^[13]. EMSC is a more elaborate augmentation of MSC. Instead of a 1^st order polynomial, which corrects a slope, a 2^nd polynomial is fitted onto the average spectrum, fitting a baseline on the wavelength axis^[14]. The most common derivative method uses Savitsky-Golay (SG) polynomial derivative filters, which include a smoothing step simultaneously with a derivative calculation to decrease the influence on the signal-to-noise ratio. SG has different orders of derivatives and filter widths. Derivatives allow the additive constant background effects (first derivative) and sloping change (second derivative) to be removed. All the spectral datasets were also subject to mean centering (MNCN), in which the mean of each data column (variable) is subtracted from all the values in the column to give a data matrix where the mean of each processed variable is zero.

The pre-processing steps investigated for NIR and FTIR calibration data included min-max normalisation (0 to 1), SNV, MSC, 1^st and 2^nd derivative Savitsky-Golay (SG) with different window widths, detrend, gap-segment derivative, autoscaling, either applied alone or in combination with other techniques. The pre-processing steps investigated for Raman data included the aforementioned pre-processing steps and asymmetric weighted least squares (WLS)^[15], either applied alone or in combination with other techniques. All spectra were mean-centered and saved out before exploratory analysis and classification.

Linear classification model
PCA was first conducted to explore the dataset for any patterns. The reduced Hotelling's T², reduced Q residuals, and KNN (K-nearest neighbour) distance scores were used to assess the model fit and check for extreme outliers. The reduced Hotelling's T² and reduced Q residuals are a normalisation of the Hotelling's T² and Q residuals calculated by dividing it by the confidence limit; Hotelling's T² is a measure of variation within samples in the model, while Q residuals represent the variation remaining in each sample after modelling. The KNN score distance is a common outlier detection metric that provides the average distance to the k nearest neighbours in the score space for each sample. Partial least squares-discriminant analysis (PLS2-DA) is a supervised classifier and was used to predict the geographical origins of green coffee beans (GCBs) from four countries. In this study, the output classes were Brazil (class B), Ethiopia (class E), Indonesia (class I), and Rwanda (class R). It summarises the information from independent variables in a small number of latent variables. These representative variables are developed to maximise the covariance between predictors (x-block) and response (y-block). PLS-DA can reduce these high-dimensional datasets and handle multi-collinear and correlated variables, making PLS-DA a popular classification method. Various pre-processing techniques were applied to the four data sets, and country-based PLS-DA classification models were developed. The PLS-DA models were analysed independently for each of the datasets from all four instruments. The classification performance was validated by comparing several decision parameters listed in the next section.

Model evaluation

The models produced using PLS-DA on all four separate data sets were evaluated for the influence of pre-processing steps on the model prediction performance. The decision parameters include total variance captured, root mean square of error of calibration, cross-validation and prediction (RMSEC, RMSECV, and RMSEP, respectively). A low RMSEP would mean that the prediction performance is high and the estimated response is close to the measured response (0 or 1 in PLS-DA).

In addition to statistics to assess the model fit, confusion matrices are typically used in classification problems to represent the quality of the prediction but can be hard to communicate. Accuracy and F1 scores are popular parameters for quantifying model performance^[16,17]. Below are the equations for accuracy and F1 where TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative). Accuracy cannot distinguish between false positives and false negatives. F1 score notes the number of prediction errors and the types of errors made. F1 is equally good at minimising false positives and negatives by taking the harmonic mean of precision and recall.

$ {\rm{Accuracy}} = \dfrac{T P+T N}{T P+T N+F P+F N} $

(1)

${\rm{ F1 \;score}} = \dfrac{2T P}{2T P+F P+F N} $

(2)

However, these two parameters are only good indicators of performance for balanced datasets where all the analytical replicates are equal across all datasets. In this study, more analytical replicates were collected for certain samples as the signal-to-noise ratio was visually suspected to be problematic for some spectra, but with pre-processing, the spectra were not flagged as outliers and were thus included. Given that dataset imbalances were due to more analytical replicates taken for some samples, other decision parameters are needed. Matthew's Correlation Coefficient (MCC) can solve this issue by incorporating the dataset imbalance and providing a summary of the confusion matrix as a correlation coefficient^[16,17]. It is the only metric that involves all four contingency matrix terms. The metric represents the correlation between actual values and predicted ones. A score of 1.0 refers to a perfect classifier, while a value close to 0 means that it is no better than random chance. For a high MCC, the model must be able to predict accurately both positive (belonging to class) and negative (not belonging to class) outcomes simultaneously. Equation (3) refers to binary classification, while Eqn (4) is for multi-class classification problems, where $ {t}_{k} $ is the number of times the class k truly occurred, $ {p}_{k} $ is the number of times that class k was predicted, C was the number of samples correctly predicted, and S is the total number of samples. To the best of our knowledge, MCC has not been applied to food classification models utilising vibrational spectroscopy.

$ {\rm{MCC}} = \dfrac{(T P\times T N)-(F P \times F N)}{\sqrt{\left(T P+F P\right)(T P+F N)(T N+F P)(T N+F N)}} $

(3)

$ {\rm{MCC}} = \dfrac{(C\times S)-({\sum }_{k}^{K}{p}_{k}\times {t}_{k})}{\sqrt{({s}^{2}-{\sum }_{k}^{K}{p}_{k}^{2}) \times ({s}^{2}-{\sum }_{k}^{K}{t}_{k}^{2})}} $

(4)

The F1 scores, accuracy, and MCC of the validation (predicted) data were compared to understand the influence of these decision parameters. The prediction accuracy was calculated as a percentage of the number of actual samples in that class. A high F1 score may inform us that the classification model is performing well but can have a low MCC score. A MCC score above 0.7 is a good classification score^[17].

[1]	Zhang L, Henson MJ, Sekulic SS. 2005. Multivariate data analysis for Raman imaging of a model pharmaceutical tablet. Analytica Chimica Acta 545:262−78 doi: 10.1016/j.aca.2005.04.080 CrossRef Google Scholar
[2]	Khandasammy SR, Fikiet MA, Mistek E, Ahmed Y, Halámková L, et al. 2018. Bloodstains, paintings, and drugs: Raman spectroscopy applications in forensic science. Forensic Chemistry 8:111−33 doi: 10.1016/j.forc.2018.02.002 CrossRef Google Scholar
[3]	Mcgoverin CM, Clark ASS, Holroyd SE, Gordon KC. 2010. Raman spectroscopic quantification of milk powder constituents. Analytica Chimica Acta 673:26−32 doi: 10.1016/j.aca.2010.05.014 CrossRef Google Scholar
[4]	Beć KB, Grabska J, Bonn GK, Popp M, Huck CW. 2020. Principles and applications of vibrational spectroscopic imaging in plant science: A review. Frontiers in Plant Science 11:1226 doi: 10.3389/fpls.2020.01226 CrossRef Google Scholar
[5]	Barnes RJ, Dhanoa MS, Lister SJ. 1989. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Applied Spectroscopy 43:772−77 doi: 10.1366/0003702894202201 CrossRef Google Scholar
[6]	Rinnan Å, Berg FVD, Engelsen SB. 2009. Review of the most common pre-processing techniques for near-infrared spectra. Trends in Analytical Chemistry 28:1201−22 doi: 10.1016/j.trac.2009.07.007 CrossRef Google Scholar
[7]	Karoui R, Downey G, Blecker C. 2010. Mid-infrared spectroscopy coupled with chemometrics: A tool for the analysis of intact food systems and the exploration of their molecular structure−Quality relationships − A review. Chemical Reviews 110:6144−68 doi: 10.1021/cr100090k CrossRef Google Scholar
[8]	Lv Z, Zhang P, Sun W, Lei T, Benediktsson JA, et al. 2023. Sample iterative enhancement approach for improving classification performance of hyperspectral imagery. IEEE Geoscience and Remote Sensing Letters 21:2500605 doi: 10.1109/LGRS.2023.3348093 CrossRef Google Scholar
[9]	Hruschka WR. 1987. Data analysis: wavelength selection methods. In Near-infrared technology in the agricultural and food industries, eds. Williams P, Norris K. St. Paul, MN, USA: American Association of Cereal Chemists. pp. 35–55.
[10]	Zhao N, Wu ZS, Zhang Q, Shi XY, Ma Q, et al. 2015. Optimization of Parameter Selection for Partial Least Squares Model Development. Scientific Reports 5:11647 doi: 10.1038/srep11647 CrossRef Google Scholar
[11]	R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
[12]	Tuszynski J. 2021. caTools: Tools: Moving Window Statistics, GIF, Base64, ROC AUC, etc. https://CRAN.R-project.org/package=caTools
[13]	Dhanoa MS, Lister SJ, Sanderson R, Barnes RJ. 1994. The link between multiplicative scatter correction (MSC) and standard normal variate (SNV) transformations of NIR spectra. Journal of Near Infrared Spectroscopy 2:43−47 doi: 10.1255/jnirs.30 CrossRef Google Scholar
[14]	Martens H, Stark E. 1991. Extended multiplicative signal correction and spectral interference subtraction: New preprocessing methods for near infrared spectroscopy. Journal of Pharmaceutical and Biomedical Analysis 9:625−35 doi: 10.1016/0731-7085(91)80188-F CrossRef Google Scholar
[15]	Newey WK, Powell JL. 1987. Asymmetric least squares estimation and testing. Econometrica 55(4):819−47 doi: 10.2307/1911031 CrossRef Google Scholar
[16]	Chicco D, Jurman G. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:6 doi: 10.1186/s12864-019-6413-7 CrossRef Google Scholar
[17]	Powers DMW. 2011. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technology 2(1):37−63 Google Scholar
[18]	Lee LC, Liong CY, Jemain AA. 2017. A contemporary review on data preprocessing (DP) practice strategy in ATR-FTIR spectrum. Chemometrics and Intelligent Laboratory Systems 163:64−75 doi: 10.1016/j.chemolab.2017.02.008 CrossRef Google Scholar
[19]	Norris KH, Williams PC. 1984. Optimisation of mathematical treatments of raw near-infrared signal in the measurement of protein in hard red spring wheat I. influence of particle. Cereal Chemistry 61(2):158−65 Google Scholar
[20]	Keidel A, Von Stetten D, Rodrigues C, Máguas C, Hildebrandt P. 2010. Discrimination of green arabica and robusta coffee beans by Raman spectroscopy. Journal of Agricultural and Food Chemistry 58:11187−92 doi: 10.1021/jf101999c CrossRef Google Scholar
[21]	Wermelinger T, D'Ambrosio L, Klopprogge B, Yeretzian C. 2011. Quantification of the robusta fraction in a coffee blend via Raman spectroscopy: Proof of principle. Journal of Agricultural and Food Chemistry 59:9074−79 doi: 10.1021/jf201918a CrossRef Google Scholar
[22]	Figueiredo LP, Borém FM, Almeida MR, Oliveira LFC, Alves APDC, et al. 2019. Raman spectroscopy for the differentiation of arabic coffee genotypes. Food Chemistry 288:262−67 doi: 10.1016/j.foodchem.2019.02.093 CrossRef Google Scholar
[23]	Abreu GF, Borém FM, Oliveira LFC, Almeida MR, Alves APC. 2019. Raman spectroscopy: A new strategy for monitoring the quality of green coffee beans during storage. Food Chemistry 287:241−48 doi: 10.1016/j.foodchem.2019.02.019 CrossRef Google Scholar
[24]	Dias RCE, Yeretzian C. 2016. Investigating coffee samples by Raman spectroscopy for quality control- Preliminary study. International Journal of Experimental Spectroscopic Techniques 1:006 doi: 10.35840/2631-505x/8506 CrossRef Google Scholar
[25]	Marquetti I, Link JV, Lemes ALG, dos Santos Scholz MB, Valderrama P, et al. 2016. Partial least square with discriminant analysis and near infrared spectroscopy for evaluation of geographic and genotypic origin of arabica coffee. Computers and Electronics in Agriculture 121:313−19 doi: 10.1016/j.compag.2015.12.018 CrossRef Google Scholar
[26]	Moghimi A, Aghkhani MH, Sazgarnia A, Sarmad M. 2010. Vis/NIR spectroscopy and chemometrics for the prediction of soluble solids content and acidity (pH) of kiwifruit. Biosystems Engineering 106:295−302 doi: 10.1016/j.biosystemseng.2010.04.002 CrossRef Google Scholar
[27]	Lasch P. 2012. Spectral pre-processing for biomedical vibrational spectroscopy and microspectroscopic imaging. Chemometrics and Intelligent Laboratory Systems 117:100−14 doi: 10.1016/j.chemolab.2012.03.011 CrossRef Google Scholar
[28]	Liu Y, Huang J, Li M, Chen Y, Cui Q, et al. 2022. Rapid identification of the green tea geographical origin and processing month based on near-infrared hyperspectral imaging combined with chemometrics. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 267:120537 doi: 10.1016/j.saa.2021.120537 CrossRef Google Scholar
[29]	Downey G, Briandet R, Wilson RH, Kemsley EK. 1997. Near- and mid-infrared spectroscopies in food authentication: Coffee varietal identification. Journal of Agricultural and Food Chemistry 45:4357−61 doi: 10.1021/jf970337t CrossRef Google Scholar
[30]	Obeidat SM, Hammoudeh AY, Alomary AA. 2018. Application of FTIR spectroscopy for assessment of green coffee beans according to their origin. Journal of Applied Spectroscopy 84:1051−55 doi: 10.1007/s10812-018-0585-9 CrossRef Google Scholar
[31]	Bona E, Marquetti I, Link JV, Makimori GYF, da Costa Arca V, et al. 2017. Support vector machines in tandem with infrared spectroscopy for geographical classification of green arabica coffee. LWT - Food Science and Technology 76:330−36 doi: 10.1016/j.lwt.2016.04.048 CrossRef Google Scholar
[32]	Medina J, Caro Rodríguez D, Arana VA, Bernal A, Esseiva P, et al. 2017. Comparison of attenuated total reflectance mid-infrared, near infrared, and ¹H-nuclear magnetic resonance spectroscopies for the determination of coffee's geographical origin. International Journal of Analytical Chemistry 2017:7210463 doi: 10.1155/2017/7210463 CrossRef Google Scholar
[33]	Cebi N, Yilmaz MT, Sagdic O. 2017. A rapid ATR-FTIR spectroscopic method for detection of sibutramine adulteration in tea and coffee based on hierarchical cluster and principal component analyses. Food Chemistry 229:517−26 doi: 10.1016/j.foodchem.2017.02.072 CrossRef Google Scholar
[34]	Rubayiza AB, Meurens M. 2005. Chemical discrimination of arabica and robusta Coffees by Fourier transform Raman spectroscopy. Journal of Agricultural and Food Chemistry 53:4654−59 doi: 10.1021/jf0478657 CrossRef Google Scholar
[35]	El-Abassy RM, Donfack P, Materny A. 2011. Discrimination between arabica and robusta green coffee using visible micro Raman spectroscopy and chemometric analysis. Food Chemistry 126:1443−48 doi: 10.1016/j.foodchem.2010.11.132 CrossRef Google Scholar
[36]	Luna AS, Da Silva AP, Alves EA, Rocha RB, Lima ICA, De Gois JS. 2017. Evaluation of chemometric methodologies for the classification of coffea canephora cultivars via FT-NIR spectroscopy and direct sample analysis. Analytical Methods 9:4255−60 doi: 10.1039/C7AY01167A CrossRef Google Scholar
[37]	Giraudo A, Grassi S, Savorani F, Gavoci G, Casiraghi E, Geobaldo F. 2019. Determination of the geographical origin of green coffee beans using NIR spectroscopy and multivariate data analysis. Food Control 99:137−45 doi: 10.1016/j.foodcont.2018.12.033 CrossRef Google Scholar

Optimised pre-processing		TVar %	RMSECV	RMSEC	RMSEP	MCC, Pred.	Accuracy, Pred	F1, Pred.
DG-NIR	MNCN	98.54	0.338	0.329	0.473	0.383	0.665	0.483
	MSC, MNCN	91.49	0.245	0.243	0.309	0.774	0.882	0.865
	SNV, MNCN	91.49	0.245	0.243	0.309	0.774	0.882	0.865
	SNV, Detrend, MNCN	91.32	0.245	0.243	0.309	0.774	0.882	0.865
	MSC, SG (1^st der, 2^nd poly, 15 pts), MNCN	76.06	0.268	0.265	0.350	0.684	0.835	0.788
	Normalisation, SG (2^nd der, 2^nd poly, 7 pts), MNCN	98.05	0.358	0.352	0.351	0.652	0.812	0.757
	EMSC, MNCN	87.87	0.240	0.238	0.250	0.876	0.929	0.916
HSI-NIR	MNCN	99.69	0.372	0.362	0.421	0.618	0.800	0.681
	Normalisation, MNCN	98.28	0.333	0.322	0.402	0.655	0.767	0.650
	SG (1^st der, 2^nd poly, 15 pts), MNCN	68.87	0.341	0.325	0.364	0.636	0.800	0.728
	MSC, SG (1^st der, 2^nd poly, 15 pts), MNCN	63.41	0.338	0.324	0.403	0.473	0.733	0.605
	Normalisation, SG (1^stder, 2^nd poly, 15 pts), MNCN	85.79	0.324	0.313	0.375	0.612	0.800	0.732
	SNV, SG( 1^st der, 2^nd poly, 15 pts), MNCN	63.38	0.337	0.324	0.402	0.473	0.733	0.605
FTIR	MNCN	99.69	0.335	0.321	0.386	0.253	0.452	0.200
	Normalisation, MNCN	98.17	0.334	0.320	0.391	0.372	0.452	0.179
	Normalisation, SG (1^st der, 2^nd poly, 15 pts), MNCN	97.12	0.402	0.369	0.490	0.141	0.500	0.330
	EMSC, MNCN	71.52	0.409	0.384	0.482	0.286	0.500	0.326
Raman	MNCN	99.91	0.319	0.312	0.329	0.756	0.860	0.819
	SG (2^nd der, 2^nd poly, 7 pts), MNCN	99.66	0.350	0.343	0.369	0.521	0.735	0.648
	Normalisation, SG (1^st der, 2^nd poly, 15 pts), MNCN	98.94	0.321	0.315	0.334	0.554	0.747	0.622
	WLS (2^nd poly), MNCN	96.86	0.343	0.336	0.372	0.611	0.795	0.760
FT, Fourier-Transform; DG, Dispersive; HSI, Hyperspectral Imaging; NIR, near-infrared; TVar, Total explained variance; RMSE(C/CV/P), Root Mean Square Errors of Calibration/Cross-Validation/Prediction; MCC, Matthew's Correlation Coefficient, MNCN, Mean centering; MSC; Multiplicative Scatter Correction; EMSC, Extended Multiplicative Scatter Correction; SG (#der, #poly, #pts), Savitzky-Golay #derivative, #polynomial, #window points; WLS, Weighted Least Squares; Pred., Prediction.

{{lists.name}}

Optimisation of vibrational spectroscopy instruments and pre-processing for classification problems across various decision parameters

Abstract

Supplementary information

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors