Comparison Between Linear and Non-linear Variable Selection Methods with Applications to Spectroscopic (UV-Vis/NIR) Data
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Chiang Mai J. Sci. 2020; 47(1) : 160-174 http://epg.science.cmu.ac.th/ejournal/ Contributed Paper Comparison Between Linear and Non-linear Variable Selection Methods with Applications to Spectroscopic (UV-Vis/NIR) Data Chanida Krongchai [a], Sakunna Wongsaipun [a], Sujitra Funsueb [a], Parichat Theanjumpol [b,c], Jaroon Jakmunee [a,d] and Sila Kittiwachana*[a,e] [a] Department of Chemistry, Faculty of Science, Chiang Mai University, Chiang Mai 50200, Thailand. [b] Postharvest Technology Research Center, Faculty of Agriculture, Chiang Mai University, Chiang Mai 50200, Thailand. [c] Postharvest Technology Innovation Center, Office of the Higher Education Commission, Bangkok 10400, Thailand. [d] Institute for Science and Technology Research and Development, Chiang Mai University, Chiang Mai 50200, Thailand. [e] Environmental Science Research Center, Faculty of Science, Chiang Mai University, Chiang Mai 50200, Thailand. *Author for correspondence; e-mail: silacmu@gmail.com Received: 14 January 2019 Revised: 11 September 2019 Accepted: 16 September 2019 A BSTRACT Variable selection aims to identify important parameters in relation to predicted responses. Selection outcomes of the important variables could be different depending on the methods used. In this research, the important variables identified using linear and non-linear variable selection methods based on partial least squares-variable important in prediction (PLS-VIP) and self organizing map- discrimination index (SOM-DI) were compared. Two datasets, near-infrared (NIR) spectra of adulterated Thai Jasmine rice and ultraviolet-visible (UV-Vis) spectra of food colorant mixtures were used for the demonstration. The advantages and disadvantages for the use of the different algorithms were compared and discussed. For the NIR data, the calibration model using supervised self organizing map (SSOM) offered better prediction results and the SOM-DI variable selection method identified the spectral changes in NIR overtone regions as significance. On the other hand, PLS calibration model resulted in higher predictive errors while the PLS-VIP variable selection captured variation from the visible region between 664 nm and 884 nm. Using the UV-Vis data, PLS appeared to put attention on only the highest absorbance region of the peak maximum absorbance. In contrast, SSOM model highlighted the variation around the isosbestic spectral regions between the mixture components. The drawback for the use of a mixture design to construct the calibration models, leading to wrong interpretation of the important variables, was also discussed. Keywords: variable selection, multivariate calibration, partial least squares (PLS), self organizing map (SOM), spectral data analysis
Chiang Mai J. Sci. 2020; 47(1) 161 1. I NTRODUCTION Spectroscopic measurements especially calibration techniques can deal with a large number ultraviolet-visible (UV-Vis) and near-infrared (NIR) of variables dataset. However, there are some have been increasingly used as analytical tools in benefits if the number of predictive variables is various field such as clinical chemistry, process reduced. This is because not all the variables are monitoring, food, agriculture and environmental useful or contain informative variation for the science [1-4]. These spectroscopic measurements prediction models. The detected absorbance at are based on the similar principle where the baseline may be irrelevant or represent only noise. interaction between the electromagnetic light In addition, a measurement at one wavelength radiation and analyst sample is detected. The can be correlated to the measurement of the difference is that UV-Vis detects the absorption wavelengths nearby or they behave similarly. corresponding to the electronic transitions of the The similar trends in the measurement variables electrons in an atom or molecule, whereas the cause an overly complexity in data and could absorption of NIR is as a result of the overtones dramatically reduce the predictive performance or combinations of the chemical functional groups due to multicollinear problem [5]. Therefore, it is originating in the infrared (IR) region. These advised that suitable variable selection is performed measurement techniques gained advantages over prior to the construction of a calibration model. other analytical techniques because the sample Variable selection can be used to identify detection can be operated quickly without or less variables (wavelengths) that contribute useful sample pretreatment. Using different detection information (in this case the response), or it modes such as transmittance, reflectance and aims to evaluate the importance of the measured interaction, these spectroscopic measurements can parameters. In general, chemometric techniques, be practical for samples that are either liquid or used for classification or regression, could be solid. By modern scanning instruments, UV-Vis categorized into two major groups based on and NIR can generate a large number of variables. the nature of the algorithms; linear and non- For example, a measurement of NIR could yield linear methods [6]. At the present time, many 1701 spectral points or variables corresponding to variable selection methods have been proposed the absorbance in the region of 700-2400 nm at and most of them are a generalized form their 1 nm interval. For that reason, sophisticated data related predictive models. Partial least squares- analysis techniques are required, and the predictive variable important in prediction (PLS-VIP), results should be obtained from multivariate partial least squares-selectivity ratio (PLS-SR) and analysis of available spectra rather than a single PLS coefficients, are among the most common observed variable or a single spectrum of an variable selections which are based on the partial individual sample. least squares (PLS) regression. These methods Multivariate calibration aims to investigate a expect that the significant variables linearly affect relationship between predictive (X) and response the change in variation of response. Unlike (c) variables. The predictive variables are data that PLS, self organizing map (SOM) is a non-linear can be directly measured from samples. On the method. This model does not expect that data other hand, the response variables are information follow multivariate normal distribution or that which cannot be directly obtained from the mathematical equations are required to explain the measurement. This relationship information characteristic structure of the model. Compared between the two data blocks can be then used to the other non-linear prediction such as artificial to establish calibration model for prediction of neural network (ANN) and support vector machine unknown samples. In general, most multivariate (SVM), SOM has an ability to display the internal
162 Chiang Mai J. Sci. 2020; 47(1) structure of the model using some visualization constructing the calibration model was discussed. methods such as component planes, supervised The advantages and disadvantages of applying color shading and U-matrix [7]. Therefore, it is these two methods were reported. possible to investigate the non-linear behavior in relation to the predictive response. Recently, 2. M ATERIALS AND METHODS the development of a variable selection index, 2.1 Spectroscopic Data called self organizing map-discrimination index 2.1.1 NIR of adulterated rice (SOM-DI), were proposed [8]. This index could The rice samples were purchased from a be used to evaluate the variable significance in local department store. Two rice varieties were addition the visualization of the non-linear behavior used including Khao Dawk Mali 105 (KDML105) from the component planes. For spectral analysis, and Chai Nat 1 (CN1) white rice. The quality of linear and non-linear calibrations could result in the rice samples was certified by ISO 9001:2008 different predictive performance. Consequently, standard with good manufacturing practice (GMP). the identification of the important variables could To synthetically generate adulterated KDML105 be different. This led to a possible variety in the rice samples, the KDML105 rice was blended interpretation and conclusion. with the CN1 rice where the concentrations of This research reported the comparison of linear the mixed rice samples were ranged from 0.0 and non-linear variable selections for identifying %w/w to 100.0 %w/w with the increment of 5 important wavelengths in spectral analysis. PLS-VIP %w/w. After mixing, the samples were maintained and SOM-DI were used to represent the linear in a controlled temperature room at 25 °C for at and non-linear variable selections, respectively. least 6 hours to stabilize the sample temperature. Two datasets including UV-Vis of food colorant The NIR spectra were recorded using FOSS mixtures and NIR of adulterated Khao Dawk Mali NIR DS2500 (FOSS NIR system, USA) from 105 (KDML105) rice were used to demonstrate 400 - 2500 nm at 0.5 nm resolution. Each sample the model characteristics. The UV-Vis dataset was measured three times and the recorded spectra was used to demonstrate the performance of were averaged. The NIR spectra were separated the variable selection methods when dealing into two datasets where the samples adulterated with samples with multi-components or there at 0, 10, 20, …, 100% w/w were used as training were several analytes at different concentration samples and the rests were used as test samples. combined in samples. KDML105 is a well-known The recorded NIR data were exported to Matlab Thai jasmine rice variety which is famous for program (MATLAB V7.0, The Math Works Inc., its present fragrant and delicious texture when Natick) for further data calculation. cooked. The price of KDML105 is relatively expensive and therefore it is often blended with 2.1.2 UV-Vis spectra of food colorants some other cheaper non-fragrant rice causing This dataset was used to demonstrate the adulteration. To comply with the labelling law, the performance of the variable selection methods adulteration level should be clearly clarified. The when dealing with samples with multi-components presence of the substitution or other blending or there were several analytes at different rice more than the regulation reveals deliberate concentration combined in samples. Three food substitution, and this will be illegal under the colorants, Carmoisine, Tartrazine and Brilliant food labelling rules. The effect of non-linearity blue FCF representing red, yellow and blue in data on the predictive performance of the colors, were prepared from commercial grade calibration models was reported. A problem of chemicals. The spectrum of each food colorant sample permutation in a mixture design when was shown in Figure 1(A). The concentrations
Chiang Mai J. Sci. 2020; 47(1) 163 450 451 Carmoisine 3 Tartrazine Brilliant blue FCF 2.5 Absorbance 2 1.5 1 0.5 0 350 400 450 500 550 600 650 700 750 800 Wavelength (nm) (A) (B) 452 Figure 1. (A) The spectrum of each food colorant and (B) a three-component diagram model by 453 Figure 1. (A) The spectrum of each food colorant and (B) a three-component diagram model by mixture design used to prepare the food colorant samples. The model consisted of 43 samples indicated 454 mixture design used to prepare the food colorant samples. The model consisted of 43 samples 455 using the numbers indicated in numbers using the parenthesis. in parenthesis. 456 457 of the mixing samples were prepared following a data is maximized, PLS, in most cases, provides mixture design with three components as shown satisfactory predictive results [11]. Several algorithms 458 in Figure 1(B) resulting in a total of 43 samples of PLS calculations have been reported and 459 [9]. Twenty-eight samples were used as training PLS1, proposed by Geladi and Kowalski [9-11], samples labeled using blue circles in Figure 1(B) was used in this research. The construction of 460 and the rests were used as test samples using red PLS1 is as follows: 461 circles. Each of the samples were prepared from 462 the same solution stocks in DI water to eliminate X = TP + E the variation form the food colorant impurity. 463 The mixing samples were measured using UV-Vis c = uq + f 464 spectrometer (GENESYS 10S UV-Vis, Thermo Scientific, USA) ranging from 350 nm to 850 nm Firstly, X with I samples and J variables, is 465 with a resolution of 1 nm. decomposed into X-scores (T) and X-loadings (P). 466 At the same time, c is the product approximation 467 2.2 Calibration and Variable Selection Methods of u and c-loadings (q). Then, the correlation 2.2.1 Partial least squares (PLS) and partial between X and c is expressed by: 468 least squares-variable important in prediction 469 (PLS-VIP) u = bt Partial least squares (PLS) is among the most 470 common linear regression method in multivariate When b is a regression coefficient vector 471 data analysis [10]. Based on non-linear iterative with size J x 1 and the estimation of b can be partial least squares (NIPALS) algorithm, PLS calculated as: captures the variation from both predictive and response data and simultaneously then used b = Wq for constructing a calibration model. Since the covariance between the predictive and response
164 Chiang Mai J. Sci. 2020; 47(1) Where W is a normalized PLS weight matrix. set of square or hexagonal units. Using iterative In this work the optimum number of latent learning process [20], the trained map of SOM variables (LVs) were defined using bootstrap adapts itself so that the training samples are algorithm [12]. located as far as possible from each other where Partial least squares-variable important in the aim is to maintain the topological structure prediction (PLS-VIP) was first reported by Wold of the training samples. At the beginning, SOM et al. [13]. The variable selection parameter has was used as unsupervised model where only the been extensively used in various researches such as predictive data was used for constructing the chemistry [14], agriculture [15], medicine [16] and model. However, SOM can be used as supervised engineering [17]. The VIP score summarize the model where the response data was given during influence of each of X variables which considers the learning process as demonstrated in Figure 2. the amount of explained y variance in each LV By allowing the response data to be associated in (the number of PCs used in PLS modelling). The the learning process, it is possible to adopt SOM VIP scores provide a measurement of useful to for classification and calibration purposes. For selected which variables are contributed the most example, supervised SOM was used to predict to the c response. The PLS-VIP for the jth variable the retention time of chromatographic analysis is calculated as follows: based on quantitative structure–activity relationship (QSAR) data [21]. M In addition to the classification and calibration ∑w 2 jm .SSYm . J models, it is possible to investigate the importance VIPj = m =1 of the studied variable from the SOM training SSYtotal .M map. The extended used of SOM for variable selection and called the proposed algorithm as self and SSYtotal = b 2T ′T organizing map-discrimination index (SOM-DI) was demonstrated [8]. The idea was to see if Where wjm is the weight value for the jth the component plane profiles and the response variable and the mth component. SSYm is the plane were alike. This can be done by calculating sum of squares of explained variance for the mth the correlation between the response plane and component. SSYtotal is the total sum of squares each of the variable component plane of the explained of the dependent variable, and M is the trained map after appropriate data scaling. The total number of components. VIPj is a measure component planes which are strongly associated of the contribution of each variable according to with the response will have larger coefficient values. the variance explained by each PLS component The calculation of SOM-DI and the important were w2jm represents the importance of the jth parameters set for the supervised SOM have been variable [18]. described in detail in report of Lloyd et al. and Krongchai et al. [8 and 22]. 2.2.2 Self organizing map (SOM) and self organizing map-discrimination index (SOM-DI) 2.3 Assessment of Model Predictive Performance Self organizing map (SOM) or Kohonen To evaluate the predictive performance of network is one type of non-linear learning models the chemometric models, various model statistics [19]. Unlike principal component analysis (PCA) including root mean square error of calibration that clusters samples into an orthogonal space of (RMSEC), root mean square error of prediction the first few principal components (PCs), SOM (RMSEP), cross-validated explained variance of organizes samples into a map consisting of a training (R2) and test (Q2) sets and ratio of RMSEP
Chiang Mai J. Sci. 2020; 47(1) 165 Figure 2. A schematic diagram showing a SOM model with a size of P × Q. The data characterizes by J variables and two class memberships. and RMSEC (RP/Auto) were calculated [23]. this ratio is close to 1, this indicates a stability of RMSEC is the average difference between model when some training samples are removed predicted ( ĉi ) and expected ( ci ) response values from the modeling or the model is tested with in auto-prediction mode and can be calculated as: unknown samples [23]. To highlight the scope of this research, the spectral data were tested with various data pretreatment such as standard normal ∑i =1 (cˆi − ci ) 2 N RMSEC = variate (SNV), multiplicative scatter correction N −1 (MSC), normalization, and centering [24]. The where N is the number of samples. Using models with the best predictive results were RMSEC, the establish model is tested directly on reported. The computations of PLS, PLS-VIP, the calibration data or training samples, thus it supervised SOM and SOM-DI were carried is an internal validation or auto-predictive mode. out using in-house scripts in Matlab (2010, The On the other hand, RMSEP calculates the error MathWorks, Natick, MA). of the predicted response values ( ĉi ) of the test samples. 3. R ESULTS AND DISCUSSIONS The cross-validated explained variance of 3.1 NIR and UV-Vis Datasets the model was calculated by: NIR spectra of the rice samples and the corresponding PCA are presented in Figure 3(A) N ∑ (cˆ − c )i i 2 and 3(B), respectively. From the NIR spectra, the 2 Q =1− i =1 N rice samples had similar pattern where the shapes ∑ (c − c ) i 2 of the NIR spectra were nearly identical. In this i =1 situation, it was not easy to recognize the difference If the predicted response values ( ĉi ) are between the samples from the investigation of the test samples, this correlation index implies the raw spectra. However, when the data was visualized error in test mode (Q2). Normally, the values of using the first two PCs, it was possible to observe 2 2 R and Q as close as possible to 1.0 are expected the change in the KDML105 rice samples when and imply the greater degree of variation within mixed with the different amount of the white rice. the data modelled by the calibration model. In Figure 3(B), the rice samples were scattered The ratio of RMSEP and RMSEC (RP/Auto) was across the PCA space where the mixing levels calculated to indicate the model robustness. If increased from the top to the bottom along PC2.
166 Chiang Mai J. Sci. 2020; 47(1) 0.8 1.6 15 5 0.6 10 0 1.4 25 0.4 35 45 20 1.2 0.2 30 Absorbance 40 55 1 0 Trainging samples 50 PC2 Test samples -0.2 60 0.8 65 85 -0.4 80 75 0.6 70 -0.6 90 0.4 -0.8 100 95 -1 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 -64.799 -64.798 -64.797 -64.796 -64.795 -64.794 -64.793 -64.792 Wavelength (nm) PC1 (A) (B) 495 Figure 3. (A) NIR spectra of the rice samples and (B) PCA score plot of PC1 against PC2 of the NIR 496 Figure 3. (A) NIR spectra of the rice samples and (B) PCA score plot of PC1 against PC2 of the spectra after SNV treatment and the samples were labeled according to the percentage of KDML105. 497 NIR spectra after SNV treatment and the samples were labeled according to the percentage of 498 KDML105. 499 On the other hand, the change in variation on the cluster (the samples labeled as 1, 22 and 28) 500 PC1 was rather complicated. The samples having where the mixture samples having more than different adulteration levels possessed similar score one component were placed in the middle of the 501 values of PC1. For example, the samples with cluster as presented in Figure 4(B). 502 10% and 90% have nearly the same PC1 score 503 values the middle of the PC1 axis. This implied 3.2 Variable Selection for the NIR Dataset of that the NIR data has non-linear characteristics Adulterated Rice 504 in nature and multivariate analysis should be used The important variables identified using 505 to process data. PLS-VIP and SOM-DI are illustrated in Figure 5(A) 506 Figure 4(A) shows the UV-Vis spectra of and 5(B), respectively. The significant variables the food colorant samples. It can be seen that the from both selection methods were not identical 507 samples were characterized by three overlapping meaning that they utilized the data from different 508 peaks with the λmax at 426 nm, 516 nm and 630 parts of the NIR for predicting the adulteration 509 nm, respectively, representing the absorbance of level. In Figure 5(A), PLS-VIP seemed to capture the yellow, red and blue food colorants. The PCA the variation in the region of long visible light 510 model of the UV-Vis spectral data is presented in (664-884 nm). This indicated that the PLS-based 511 Figure 4(B). The characteristic pattern in the PCA model was sensitive toward the change in sample 512 structure of the UV-Vis data was quite different color. The variation in the sample color could be from that of the NIR data. In the UV-Vis spectra, due to that the grain characteristics of KDML105 513 there were three components in the mixing samples was less opaque and relatively clearer than that 514 and their compositions were varied according to of CN1 white rice. Although the grain color of 515 the three-component mixture design. The detected both KDML105 and CN1 were not obviously peaks allowed to be overlapped with different different when observed using naked eyes, the ratios to provide the variation for quantitative spectrophotometer could be more effectively analysis purpose. As a result, the samples in the detect the color difference. The absorbance was PCA were clustered into one region. Each of the linearly changed with the increase of the KDML105 samples of the pure color component (100% of composition. red, yellow and blue) was located at the edge of
Chiang Mai J. Sci. 2020; 47(1) 167 40 3.5 22 Trainging samples 30 Test samples 23 3 24 16 20 30 42 2.5 18 17 11 26 25 10 38 43 Absorbance 28 19 12 20 44 2 27 40 45 39 PC2 0 46 31 13 7 8 35 1.5 14 9 36 4 -10 21 41 15 5 33 10 2 29 1 6 34 37 -20 32 3 0.5 -30 1 0 -40 350 400 450 500 550 600 650 700 750 800 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 Wavelength (nm) PC1 (A) (B) Figure 516Figure 4. (A) 4. (A) Visible Visible spectra spectra of of thethefood foodcolorant colorantsamples samplesand and (B) (B) PCA PCA score score plot plotof ofPC1 PC1 against 517PC2against PC2 of the visible of the visible spectra. spectra. 1.6 1.6 1.4 1.4 1.2 1.2 1 1 Absorbance Absorbance 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Wavelength (nm) Wavelength (nm) (A) (B) Figure 5. Variable selection results of the NIR dataset using (A) PLS-VIP and (B) SOM-DI. The Figure 5. Variable selection results of the NIR dataset using (A) PLS-VIP and (B) SOM-DI. The wavelengths wavelengthsidentified as significance identified were as significance highlighted were highlightedusing usingvertical vertical closed and dotted closed and dottedred red lines, respectively, for PLS-VIP and SOM-DI. lines, respectively, for PLS-VIP and SOM-DI. On the other hand, SOM-DI shown in Figure The predictive results using PLS and supervised 5(B) identified the characteristic NIR bands of SOM before and after the variable selection are water (1,400 nm and 1,900 nm), OH (1,600 nm) summarized in Table 1. In this case, the predictive and CH (1,700-1,800 nm) bonds, respectively, as performance of the PLS model clearly improved importance [25]. These NIR regions corresponded where the RMSEP was reduced from 10.97 to to moisture and starch molecules of grains 5.206. In addition, the error in prediction of each implying that KDML105 has different moisture sample was reduced resulting in the higher Q2. content and ratio of starch molecules compared The samples placed closer to the regression line to the CN1 white rice. It was possible that the (Figure 6(B)) when compared to the correlation changed of water, OH and CH bond signals graph of the prediction model using the whole related to adulteration in the fragrant rice [26]. variables (Figure 6(A)). The ratio between RMSEP
168 Chiang Mai J. Sci. 2020; 47(1) Table 1. Predictive results of the NIR and ultraviolet-visible dataset using PLS and supervised SOM before and after variable selection methods. Full spectra Data Methods Total variables RMSEC R2 RMSEP Q2 RP/Auto PLS 4200 2.143 0.995 10.97 0.854 5.120 NIR SOM 4200 1.795 0.997 3.222 0.987 1.795 PLS 451 0.0154 1.00 0.0270 0.999 1.753 UV-Vis SOM 451 0.0548 0.999 0.0848 0.994 1.547 Selected variables PLS-VIP 478 3.697 0.986 5.206 0.967 1.408 NIR SOM-DI 430 3.189 0.988 2.230 0.981 1.147 PLS-VIP 44 0.015 1.00 0.0237 0.999 1.539 UV-Vis SOM-DI 45 0.149 0.989 0.4481 0.820 3.013 (A) (B) (C) (D) Figure 6. Correlation Figure graphs 6. Correlation graphsbetween betweenexpected expectedand and predicted concentrationofofKDML105 predicted concentration KDML105 in the in the mixing ricerice mixing samples. (A)(A) samples. andand (B)(B) areare thetheprediction predictionusing using PLS PLS before and andafter afterthe thevariables variableswere were screened screened by by PLS-VIP. PLS-VIP. (C)(C) andand (D)are (D) arethe theprediction prediction using supervised supervisedSOM SOMbefore beforeand after and thethe after variables variables werewere screened screened by SOM-DI. by SOM-DI.
Chiang Mai J. Sci. 2020; 47(1) 169 and RMSEC was also reduced confirming that the based on the regression model. According to model robustness was improved. If this parameter the assumption that the predicted response was is close to 1, this informs that the models is not linearly changed. The predictive performance prone to overfitting problem and the predictive of SOM could be improved by increasing the performance of the training and test samples can number of the training samples. Since there were be comparable. three components mixing in the samples, three The predictive accuracy of the supervised different PLS1 models were established for each SOM was slightly improved after the variables of the color components. Figure 7(A), 7(C) and were screened by the SOM-DI selection. In this 7(E) show the important variables identified using study case, SOM as a non-linear prediction still PLS-VIP for Carmoisine, Tartrazine and Brilliant provided better predictive results when compared blue FCF food colorants. The correlation graphs to the PLS model with the RMSEP value of 3.222 of the expected and predicted concentrations and 2.230, for the prediction using all and those before and after the variable selection of PLS selected variables. This implied that the SOM model models are illustrated in Figure 8. For comparison, could be suitable for capturing and processing the Figure 9 shows the prediction results of the SOM non-linear structure in the data shown in the PCA models before and after the reduction of the model in Figure 3(B). A slightly decrease in the Q2 prediction variables. value, illustrated in Figure 6(D) when compared In all cases, PLS-VIP identified the absorbance to Figure 6(C), implied that SOM model had a in the region around 600-650 nm as important capability to handle the entire variation in the data variables (Figure 7(A), 7(C) and 7(E)). It is noted and utilize them for the non-linear prediction. that the peak maximum at 630 is from the blue This was the main advantage of the SOM models. food colorant as shown in Figure 1(A). For PLS, on the contrary, was the prediction based the prediction of the yellow food colorant, the on the captured variation on the selected latent maximum wavelengths of all peaks were identified variables which should be carefully optimized. as importance (Figure 7(C)). For the prediction of the blue food colorant (Figure 7(E)), it appeared 3.3 Variable Selection for the UV-Vis Dataset that the PLS-VIP captured the variation of the of Food Colorants absorption peak at only the region of blue food In this case study, the mixture samples colorant for the main prediction. However, the consisted of three different components and prediction for the red food compound also indicated their concentrations were varied according to a that the peak band at around 513-518 nm and three-component mixture design. In overall, the 610-645 nm were significant for the prediction. predictive results of PLS was better than that of This interpretation was incorrect because ideally supervised SOM. Using the whole spectra, the the absorbance at 513-518 nm should be only the RMSEPs of the three components were 0.0270 peak band that was responsible for the estimation and 0.0848, respectively, for PLS and supervised of the red color component. SOM. The greater value of RMSEP of supervised The reason for the misinterpretation could be SOM indicated the poorer predictive results. This that the training samples, in this case study, were could be that SOM, in general, required more prepared using a mixture design model. Although samples to establish the complete variation in the the concentrations of the color components were modelling. The more samples used for training varied, their variation presented in the design should the model, the better predictive ability the model be approximately the same. However, the PLS could be obtained. In contrast, PLS, which was model captured the variables having the maximum a linear model, could interpolate the variation variation and correlated these variations for the
170 Chiang Mai J. Sci. 2020; 47(1) Carmoisine (Red) Carmoisine (Red) 3.5 3.5 3 3 2.5 2.5 Absorbance Absorbance 2 2 1.5 1.5 1 1 0.5 0.5 0 0 350 400 450 500 550 600 650 700 750 800 350 400 450 500 550 600 650 700 750 800 Wavelength (nm) Wavelength (nm) (A) (B) Tartrazine (Yellow) Tartrazine (Yellow) 3.5 3.5 3 3 2.5 2.5 Absorbance Absorbance 2 2 1.5 1.5 1 1 0.5 0.5 0 0 350 400 450 500 550 600 650 700 750 800 350 400 450 500 550 600 650 700 750 800 Wavelength (nm) Wavelength (nm) (C) (D) Brilliant blue FCF (Blue) Brilliant blue FCF (Blue) 3.5 3.5 3 3 2.5 2.5 Absorbance Absorbance 2 2 1.5 1.5 1 1 0.5 0.5 0 0 350 400 450 500 550 600 650 700 750 800 350 400 450 500 550 600 650 700 750 800 Wavelength (nm) Wavelength (nm) (E) (F) Figure Figure7.7.Variable Variableselection selectionresults resultsof of the UV-Vis UV-Vis dataset datasetusing usingPLS-VIP PLS-VIP(A), (A),(C) (C)and and(E), (E), andand SOM-DI SOM-DI (B), (B), (D)(D) andand (F). (F). The Thewavelengths wavelengths identified identified as as significance significance were were highlighted highlighted using using vertical vertical closed andclosed dottedand dotted lines, lines, respectively, respectively, for PLS-VIP forand PLS-VIP SOM-DI. and SOM-DI.
Chiang Mai J. Sci. 2020; 47(1) 171 (A) (B) (C) (D) (E) (F) Figure Figure8. 8. PLS correlation PLS plots correlation plotsusing full using spectra full ((A), spectra (C), ((A), and (C), (E)) and and (E)) selected and variables selected ((B), variables (D), ((B), and(D), (F)). and (F)).
172 Chiang Mai J. Sci. 2020; 47(1) (A) (B) (C) (D) (E) (F) Figure Figure 9. 9. Supervised Supervised SOM SOM correlation correlation plots plots using using fullfull spectra spectra ((A),((A), (C), (C), and and and (E)) (E))selected and selected variables variables ((B), ((B), (D), and (F)).(D), and (F)).
Chiang Mai J. Sci. 2020; 47(1) 173 prediction of the response. Therefore, when the 4. C ONCLUSIONS PLS models were not simultaneously used for the The significant variables identified by different prediction, the region having the highest variation variable selection methods could be different. These (in this case the blue color compound) possessed resulted in variation in the predictive performance the most significance in the prediction. In this of the constructed models. The different sets of case, PLS successfully obtained good predictive the importance variables allowed the widened results. The model with the variable reduction interpretation of data. In this research, supervised also resulted in slightly lower RMSEP as reported SOM as a non-linear calibration model utilized in Table 1. The fortunate explanation could be the variation from the NIR overtones and offered that the test samples were generated based on the better predictive results for the NIR dataset of same mixture design or they were a subset model the adulterated rice. On the other hand, for the of the training samples. If the test samples were UV-Vis dataset, PLS captured the peaks with the from different systems, for example, additional highest variation and resulted in good predictive food colorants or impurities were added in the performance. However, PLS-VIP in some cases samples, the predict results could be weakened. picked out the wrong peak positions in the On the contrary to PLS-VIP, for the red and prediction. In this case, the concentrations of all yellow food colorants, SOM-DI differently color components were estimated based on the identified significant variables for the prediction absorbance of the blue color component due to models. The non-linear model reported that the the rotational problem of the mixture design. isosbestic regions (the wavelengths of different compounds present the same absorbance) as the A CKNOWLEDGMENT important variables for the prediction. For example, S. Kittiwachana would like to acknowledge 460-480 nm and 550-570 nm for Carmoisine in the Chiang Mai University (CMU) Junior Research Figure 7(B) and 350-355 nm and 450-480 nm for Fellowship Program. The Postharvest Technology Tartrazine in Figure 7(D). For the prediction of Innovation Centre, Office of the Higher Education the blue component, the model correctly identified Commission, Bangkok, Thailand, was also the significant region. However, the predictive acknowledged. S. Wongsaipun would like to thank performance of the supervised SOM with the the Science Achievement Scholarship of Thailand variable reduction were severely reduced having (SAST). C. Krongchai and S. Funsueb would like increase RMSEP. This implied that SOM more to thank the Development and Promotion of effectively handled the entire variation in the Science and Technology Talents Project (DPST). dataset. The only one model was simultaneously used for predicting all of the color components R EFERENCES which was different from the PLS model that [1] Brown J.Q., Vishwanath K., Palmer G.M. and requited three separating models for the prediction Ramanujam N., Curr. Opin. Biotechnol., 2009; 20: of three different color components. In this case, 119-131. DOI 10.1016/j.copbio.2009.02.004. the variable reduction could lead to the missing [2] Magwaza L., Opara U., Nieuwoudt H., Cronje of important information. Using SOM-DI, the P., Saeys W. and Nicolaï B., Food Bioprocess regions corresponding the absorbance of the Technol., 2011; 5: 425-444. DOI 10.1007/ yellow and red food colorant were discarded s11947-011-0697-1. after the variable screening leading to the poorer prediction. Whereas, using PLS, the positions [3] Bosch Ojeda C. and Sánchez Rojas F., where the absorbance was high were incorporated Appl. Spectrosc. Rev., 2009; 44: 245-265. DOI into the prediction model. 10.1080/05704920902717898.
174 Chiang Mai J. Sci. 2020; 47(1) [4] Févotte G., Calas J., Puel F. and Hoff C., Int. [16] Palermo G., Piraino P. and Zucht H.D., Adv. J. Pharm., 2004; 273: 159-169. DOI 10.1016/j. Appl. Bioinform. Chem., 2009; 2: 57-70. PMCID ijpharm.2004.01.003. PMC3169946. [5] Brereton R.G. Chemometrics for Pattern Recognition, [17] Jun C., Lee S.H., Park H.S. and Lee J.H., 2009 1st Edn., Wiley: Chichester, U.K., 2009. International Conference on Computers & Industrial Engineering (CIE 2009), Troyes, France, 6-8 [6] Andersen C.M. and Bro R., J. Chemometr., July 2009; 1302-1307. 2010; 24: 728-737. DOI 10.1002/cem.1360. [18] Farrés M., Platikanov S., Tsakovski S. and [7] Liu F., Jiang Y. and He Y., Anal. Chim. Acta, 2009; Tauler R., J. Chemometr., 2015; 29: 528-536. 635: 45-52. DOI 10.1016/j.aca.2009.01.017. DOI 10.1002/cem.2736. [8] Lloyd G.R., Wongravee K., Silwood C.J., [19] Kohonen T. The self-organizing map. Grootveld M. and Brereton R.G., Chemom. Proc. IEEE, 1990; 78: 1464-1480. DOI Intell. Lab. Syst., 2009; 98: 149-161. DOI 10.1109/5.58325. 10.1016/j.chemolab.2009.06.002. [20] Lloyd G.R., Brereton R.G. and Duncan J.C., [9] Brereton R.G. Chemometrics: Data Analysis Analyst, 2009; 133: 1046-1059. DOI 10.1039/ for the Laboratory and Chemical Plant, 1st Edn., b715390b. Wiley: Chichester, U.K., 2005. [21] Kittiwachana S., Wangkarn S., Grudpan K. [10] Geladi P. and Kowalski B.R., Anal. Chim. and Brereton R.G., Talanta, 2013; 106: 229- Acta, 1986; 185: 1-17. DOI 10.1016/0003- 236. DOI 10.1016/j.talanta.2012.12.005. 2670(86)80028-9. [22] Krongchai C., Funsueb S., Jakmunee J. and [11] Marbach R. and Heise H.M., Chemom. Intell. Kittiwachana S., J. Chemometr., 2017; 31: 1-10. Lab. Syst., 1990; 9: 45-63. DOI 10.1016/0169- DOI 10.1002/cem.2871. 7439(90)80052-8. [23] Wongsaipun S., Krongchai C., Jakmunee J. and [12] Brás L.P., Lopes M., Ferreira A. and Menezes Kittiwachana S., Food Anal. Method., 2018; 11: J., J. Chemometr., 2008; 22: 695-700. DOI 613-623. DOI 10.1007/s12161-017-1031-y. 10.1002/cem.1153. [24] Xiaobo Z., Jiewen Z., Povey M.J.W., Holmes [13] Wold S., Johansson E. and Cocchi M., PLS- M. and Hanpin M., Anal. Chim. Acta, 2010; Partial Least Squares Projections to Latent 667: 14-32. DOI 10.1016/j.aca.2010.03.048. Structures; ESCOM Science, Umetrics Inc., Theory Methods and Applications, Kinnelon, [25] Theanjumpol P., Ripon S., Karaboon S., USA, 1993: 523-550. Suwapanit K., Thanapornpoonpong S. and Vearasilp S., Proceedings of Conference on [14] Andries J.P.M., Heyden Y.V. and Buydens International Agricultural Research for Development L.M.C., Anal. Chim. Acta, 2013; 760: 34-45. (Tropentag 2005), Germany, 11-13 October DOI 10.1016/j.aca.2012.11.012. 2005; 1-4. [15] Morita A., Araki T., Ikegami S., Okaue M., Sumi [26] Verma K.D. and Srivastav P.P., Rice Sci., 2017; M., Ueda R., Sagara Y., Food Sci. Technol. Res., 24: 21-31. DOI 10.1016/j.rsci.2016.05.005. 2015; 21: 175-186. DOI 10.3136/fstr.21.175.
You can also read