Multivariate Spectra Analysis: PLSR vs. PCA + MLR †

: For mixtures of compounds with very similar spectral features, common for larger organic molecules, multivariate analysis (MVA) methods can be applied to determine the concentration of the individual components. We analyzed photoacoustic spectra of mixtures of different volatile organic compounds with and without different feature selection and feature projection methods. These include: Multiple Linear Regression (MLR), Principal Component Analysis (PCA), Partial Least Squares Regression (PLSR) and Random Forest Algorithm (RFA). Even though PLSR provided the best prediction accuracy, the other techniques also exhibited some advantages.


Introduction
Spectroscopic probing of energetic transitions of molecules or atoms enables the analysis of mixtures and the selective determination of concentrations. Successfully applied laser spectroscopic methods include absorption spectroscopy, atomic emission spectroscopy, fluorescence spectroscopy and photoacoustic spectroscopy (PAS) [1].
If the spectral features of the single substances are broad and overlap strongly, the spectra evaluation requires a multivariate analysis. The general suitability of Partial Least Squares Regression (PLSR) to determine the absolute concentrations of different components of a mixture has been demonstrated [2,3]. However, the according study also revealed certain limitations of this evaluation method. Therefore, we further investigated methods of multivariate statistics and compared their prediction accuracy.
A spectrum of each VOC was recorded with a photoacoustic analyzer based on an optical parametric oscillator (OPO). The system delivers highly resolved spectra in the mid-IR wavelength region between 3.2 and 3.5 μm [4,5].
The measured spectra of the single VOCs were weighed and additively combined in several variations in order to get a larger dataset over a wider range of concentrations. To consider the measurement uncertainty, noise is added to each of these synthetic mixtures.

Multivariate Analysis
Multivariate analysis (MVA) is used to identify the relationship between the photoacoustic spectra and the concentrations of different VOCs. The so-called response matrix ∈ ℝ , contains the dependent variables ( )-i.e. the concentration of VOC ( ) in mixture/spectrum ( ) . We investigated ( = 5 ) components and ( = 100 ) mixtures. The predictor matrix ( ∈ ℝ ( , ) ) contains the independent variables ( ), which correspond to the photoacoustic signal of mixture/spectrum ( ) at wavelength ( ). One measurement contains ( = 200) values, which are equally distributed over the wavelength range (3.3 μm to 3.5 μm) in 1 nm steps. For the analysis, the synthetic spectra are split into a training set of 70 spectra and a validation set of 30 spectra.
The investigated methods, including Multiple Linear Regression (MLR), Partial Least Squares Regression (PLSR) and Principal Component Analysis are linear methods. Since the absorption of the VOCs at low concentrations is relatively weak, a linear relationship between the photoacoustic signal and the concentration can be assumed. According to the simplest model, the MLR is defined as follows: with the model's linearity coefficients ( ), the prediction error ( ) and the predicted values (here concentrations vector) ( ).

Dimensionality Reduction by Feature Projection
A way to increase the accuracy of the regression can be a dimensional reduction. The PLSR performs this dimensionality reduction as feature projection prior to the actual regression. Feature projection is a technique to generate new, fewer variables, while preserving most of the information of the original dataset.
While the Principal Component Analysis (PCA) only decomposes the matrix of independent variables ( ) (Equation (3)), the PLSR also decomposes the matrix of dependent variables ( ) into corresponding linear combinations ( , ) (Equation (4)) [6]: The noise in the data set ( ) and ( ) is indicated by the corresponding error vectors ( ) and ( ). The regression model described by Equation (2) also applies to PLSR. The linearity coefficients ( ) are determined by the model's weights ( ) and loadings ( and ) [6]: The Equations (3)-(5) can be solved by the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm [6].

Dimensionality Reduction by Feature Selection
In addition to the feature projection we investigated the feature (subset) selection, a method of dimensionality reduction in which only the most relevant features (independent variables) from the original data set are retained [7,8]. We used the Random Forest Algorithm (RFA) implementation of the Scikit-learn library based on Python, version 0.19.1 [9].

Results and Discussion
For the evaluation of the individual methods, two values are considered: The Mean Absolute Error (MAE) and the standard deviation ( ) of ( = − ), both averaged over all five VOCs. The bias for the different prediction methods is 6 ppb (parts per billion) and below. Table 1 lists the results of the different multivariate analysis methods. MLR is, in general, well suited for determining concentrations but gives less accurate results compared to the other methods. Even in combination with the RFA as a feature selection method, the accuracy remains the same. However, the method has a significant advantage. Applying a feature selection reduces the measuring time considerably, since not the entire spectrum has to be recorded, but only the ca. 70% with the most significant values. This enables sensors with approximately 30% shorter response time which is quite relevant considering that it can take several hours to record a complete spectrum.
Applying feature projection such as PCA and PLSR shows a significant increase in prediction accuracy. In this case the PLSR provides the highest prediction accuracy of all methods. An advantage of PCA+MLR is that the dimensional reduction is performed independent of the regression and even data sets of completely unknown composition can be used.
Based on the first results presented here, the MVA models will be investigated in the future by cross-validation and additional test data in the form of real gas mixtures. In addition, the feature selection will be investigated in greater depth. It can also be combined with the feature projection methods which have been introduced here.

Conflicts of Interest:
The authors declare no conflict of interest.