L1 Norm Based PCA for Unsupervised Classiﬁcation †

: Principal component analysis (PCA) is a widespread technique for the analysis of multivariate data, which ﬁnds applications in the ﬁelds of machine learning and artiﬁcial intelligence. Standard PCA seeks to calculate the subspace that minimizes the Euclidean distance (L2-norm) of the data points to it. Unfortunately, PCA is extremely sensitive to the presence of large outliers in the data. Recently, the L1-norm has been proposed as an alternative criterion to classical L2-norm in PCA, drawing considerable research interest on account of its increased robustness to outliers. The proposed contribution shows that, when combined with a whitening preprocessing step, L1-norm based PCA is endowed with discriminative power and can perform data classiﬁcation in an unsupervised manner , i.e., sparing the need for labelled data. By minimizing the L1-norm in the feature space, the technique mimics the action of common spatial patterns (CSP), a supervised feature extraction method used in brain computer interfaces. This result is of theoretical interest and opens new interesting research perspectives for L1-PCA. Furthermore, it enables us to perform classiﬁcation using algorithms for optimizing the L1-norm, which inherit the improved robustness to outliers of the L1-norm criterion. Several numerical experiments will conﬁrm the theoretical ﬁndings.


Introduction
L1-norm based criteria are becoming increasingly popular in the fields of machine learning and signal processing.In particular, there is growing interest for the development of L1-norm Principal Component Analysis (L1-PCA) [1,2].L1-PCA is a variant of traditional PCA which offers enhanced robustness against large outliers.This is an interesting feature because outliers, which are erroneous measurements that lie far apart from the main bulk of the data, are very common in experimental datasets, due to the imperfections in the measuring instruments or the environmental conditions.Specifically, L1-PCA has proven to be highly effective for the restoration of faulty data, in the reconstruction of occluded images or in dimensionality reduction problems [1][2][3][4].However, as negative points, L1-PCA algorithms are either computationally intensive and time consuming [3], despite efforts to simplify their operation [4], or prone to fall into local optima [1].Furthermore, L1-PCA is a difficult subject to analyze mathematically because, implicitly, it involves the higher-order statistics of the data.For one reason or another, only a few attempts have been made to explain the behavior of L1-PCA in practical situations.Among them, to cite an example, [5] showed that L1-PCA is able to perform Independent Component Analysis (ICA) if the data follows the ICA model.The present contribution continues to investigate the properties of L1-PCA.Here, we report that L1-PCA, after a minor modification, replicates the operation of the technique known as Common Spatial Patterns (CSP), a supervised feature extraction method used in brain computer interfaces [6].As a result, L1-PCA can be used to separate overlapping populations that are normally distributed and perform data classification in an unsupervised manner, i.e., sparing the need for labelled data, which is a remarkable feature.This finding opens new interesting research perspectives for L1-PCA in the field of machine learning.Furthermore, it enables us to develop classification algorithms based on the L1-norm, which inherit the improved robustness to outliers of the L1-norm criterion.
The paper is organized as follows: Section 2 introduces the L1-norm from standard PCA.Section 3 shows that the L1-norm is endowed with discriminative properties in binary classification scenarios.Section 4 illustrates the performance of the approach through computer simulations.Finally, Section 5 brings the paper to an end.

Background
Let x ∈ R p be a multivariate random variable measured or observed during an experiment.For simplicity, we assume that E[x] = 0, where E[•] is the expectation operator.The aim of standard PCA is to find the best-fit low-dimensional subspace for the data points.This is the subspace that minimizes the average squared distance of the data points to it.It can be also shown that this problem is equivalent to finding linear projections of the variables that have maximal variance [7].A projection onto the direction of a unit vector a is given by y = a x.
The variance of the projected data equals The first principal component is the vector that solves the problem arg max The n−th principal component is the vector that solves the optimization problem (2) subject to the additional constraint of being orthogonal to the previous n − 1 principal components.The desired best-fitting subspace, finally, is the span of the first few principal components.They are, in other words, the most significant directions characterizing the point cloud of x.
However, it is well-known that standard PCA overreacts to large outliers because it takes the square of the projected data in (1).In order to palliate this weakness, [1] proposed the replacement of the square function by the absolute value, yielding the following alternative criterion: arg max In practice, given a sample x 1 , . . ., x N from the random variable x, (3) is approximated by its sample based estimate max Because ∑ N i=1 |a x i | represents the L1-norm of the vector y whose kth entry is given by y k = a x i , PCA based on criterion (3) is usually referred to as 'L1-norm based PCA' or, simply, 'L1-PCA'.Working algorithms for solving (3) have been proposed in [1,3,4].

L1-PCA in the Case of Gaussian Data
To gain some insight into the performance of L1-PCA, let us make the usual assumption that the probability density function of the data is a p-variate normal density function of the form where is the data covariance matrix.Let y = a x be the projection of x into the direction defined by a ∈ R p .The probability density function of y is given by where σ 2 = a Ca is the variance of the projected data.Now, some calculus shows that Then, as maximizing the standard deviation σ is equivalent to maximizing the variance σ 2 , one sees that L1-PCA behaves in this case like traditional PCA, while offering robustness against the presence of large outliers in the data [1][2][3].

L1-norm Based Classification
Binary classification problems are ubiquitous in many real-life applications.Consider that we observe random samples drawn from two different populations ω 1 and ω 2 with the same population mean, assumed to be zero.It is supposed that the distribution of the random samples can be modeled as a mixture of Gaussians, i.e., where π 1 and π 2 are the a priori probabilities of occurrence of ω 1 and ω 2 , with C 1 and C 2 the corresponding class covariance matrices (C 1 = C 2 ).Consider again the L1-norm criterion Similar calculus as above shows that where σ 2 i (a) = a C i a is the variance of the ith class in the direction of the unit vector a ∈ R p .Let us assume hereafter, without any loss of generality, that the data are whitened.A random variable x is whitened by multiplying it by a matrix Q so that the result Qx has covariance QCQ = I, where C = E[xx ] and I is the identity matrix.This goal can be achieved in practice by setting Q = C −1/2 .To keep the notation simple, the whitened data are also denoted, with some abuse, by x.Likewise, the whitened class covariance matrices are still denoted by C 1 and C 2 .Whitening implies that Furthermore, Equation ( 8) still holds true.The real utility of whitening is that it introduces a constraint, namely, Equation (10), on the class variances: when one of them increases the other decreases, and vice versa.As a consequence, a thorough analysis leads to the following result (proof is omitted): Theorem 1.Under the whitening assumption, the minimizers of (8) with the constraint a = 1 maximize or minimize the power ratio This Theorem can be put in relation to the useful technique known as common spatial patterns (CSP), which is widely used in brain-computer interfaces (BCIs).Typically, electroencephalogram (EEG) samples are acquired under two different experimental conditions (e.g., imagining left and right hand movements).CSP linearly projects the data onto directions where the ratio (11) is maximal or minimal or, in simple words, where the variance of the projected data points is significantly higher for one class than for the other.The projected data variances are then be used as features for classification [6].It follows that the L1 criterion possesses the discriminative capabilities of CSP.Quite interestingly, CSP is a supervised technique, whose performance relies heavily on the availability of correctly labeled data.On the contrary, minimizing the L1 criterion ( 8) can be performed in a completely unsupervisedly fashion.

Computer Experiments
Some experiments are now conducted to illustrate the potential of the L1-approach

Experiment 1
To illustrate Theorem 1, let us consider a mixture in a bidimensional space of two equiprobable Gaussian classes, i.e., π 1 = π 2 = 1/2, with zero-means and respective covariances Observe that matrices C 1 and C 2 fulfill the whitening condition (9). Figure 1 represents the theoretically exact value of the cost function J(θ) = E[|a(θ) x|], with a(θ) = [cos(θ), sin(θ)] , calculated from Equation (8).For reference, we also plot the power ratio R(θ), defined as in (11), in the same Figure .We see that the minima of the L1-cost J(θ) correspond with either the maximum or the minimum of R(θ), as predicted by the Theorem.We also see that the maxima of J(θ) are at 0 and ±π/2 rad.At these points, the standard deviations of the projected populations are the same, i.e., σ 1 = σ 2 , with σ2 i = a C i a.It follows that the projected populations are totally mixed, because the different classes cannot be distinguished from each other.

Experiment 2
To test the L1-norm approach in a multidimensional setting, we perform several experiments with p ∈ {2, 5, 10, 15, 20, 25, 30}.In each one, we draw N = 50 p samples per each of the two Gaussian classes, and the covariance matrices C 1 and C 2 are randomly generated.After applying a whitening pre-processing to the data, the gradient descent algorithm in [8] is applied to find the orthogonal directions that (globally or locally) minimize the L1-norm criterion (7).The closeness to the subspace spanned by the line in the direction of the global minimum is used as unsupervised criterion to classify the random samples into one cluster or the other.Figure 2 shows the accuracy of the classification, averaged over 100 independent experiments.Furthermore, L1-norm criteria are also expected to exhibit robustness against large outliers.To test this property, we repeat the experiment with the difference that the whitened data points are now corrupted by replacing 10 per cent of the data samples, at randomly chosen time instants, by Gaussian noise realizations with identity covariance matrix and mean µ outliers = [10, 10, . . ., 10] .The new results are also represented in Figure 2, proving the reliability of the L1-norm.In both cases, we see that the performance increases with the dimensionality of the input representation.This finding reflects the well-known fact that it is usually easier to perform classification in high-dimensional spaces.

Figure 2 .
Figure 2. Accuracy in the classification of the data as a function of the data dimensionality.Blue line: accuracy calculated from outlier-free data.Red line: ditto for the outlier-corrupted data.