13
Computers and Mathematics with Applications ( ) Contents lists available at ScienceDirect Computers and Mathematics with Applications journal homepage: www.elsevier.com/locate/camwa Efficient sparse least squares support vector machines for pattern classification Yingjie Tian , Xuchan Ju, Zhiquan Qi, Yong Shi Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing, China article info Keywords: Least squares support vector machine Sparseness Loss function Classification Regression abstract We propose a novel least squares support vector machine, named ε-least squares support vector machine (ε-LSSVM), for binary classification. By introducing the ε-insensitive loss function instead of the quadratic loss function into LSSVM, ε-LSSVM has several improved advantages compared with the plain LSSVM. (1) It has the sparseness which is controlled by the parameter ε. (2) By weighting different sparseness parameters ε for each class, the un- balanced problem can be solved successfully, furthermore, an useful choice of the param- eter ε is proposed. (3) It is actually a kind of ε-support vector regression (ε-SVR), the only difference here is that it takes the binary classification problem as a special kind of regres- sion problem. (4) Therefore it can be implemented efficiently by the sequential minimiza- tion optimization (SMO) method for large scale problems. Experimental results on several benchmark datasets show the effectiveness of our method in sparseness, balance perfor- mance and classification accuracy, and therefore confirm the above conclusion further. © 2013 Elsevier Ltd. All rights reserved. 1. Introduction Support vector machines (SVMs), which were introduced by Vapnik and his co-workers in the early 1990s [1–3], are computationally powerful tools for supervised learning [4,5] and have already outperformed most other methods in a wide variety of applications [6–11]. Least squares support vector machines (LSSVMs) were also proposed [12,13] which only need to solve a linear system instead of a quadratic programming problem (QPP) in standard SVMs, and extensive empirical comparisons [14] show that LSSVMs obtain good performance on various classification and regression problems. LSSVMs have been studied extensively [15–18]. Unfortunately, there are two drawbacks in the plain LSSVMs. (1) Unlike the standard SVM employing a soft-margin loss function for classification and a ε-insensitive loss function for regression, LSSVMs lost the sparseness by using a quadratic loss function. (2) Another obvious limitation of LSSVMs is that although solving a linear system is in principle solvable [19], it is in practice intractable for a large dataset by the classical techniques since their computational complexity is usually of order O(l 3 ) (l is the size of the training set), which severely limits the utility of LSSVMs in large scale applications. There are a lot of papers in the literature considering the above two issues so far. As for the fast algorithms for LSSVMs, Suykens et al. [20] presented an iterative algorithm based on the conjugate gradient algorithm, and Chu et al. [21] improved the conjugate gradient algorithm by solving one reduced linear system. Keerthi and Shevade [22] extends the well-known sequential minimization optimization (SMO) [23] algorithm of SVMs for the solution of LSSVMs. For the problems with very large numbers of data points but small numbers of features, Chua [24] proposed a method which involves working with (and storing) matrices that are at most of size l × n (l is the size of the training set, n is the number of features), and extend the possible range of application for LSSVMs. However, the resulting solutions of the above methods are not sparse yet. Corresponding author. Tel.: +86 10 82680997. E-mail addresses: [email protected] (Y. Tian), [email protected] (X. Ju), [email protected] (Z. Qi), [email protected] (Y. Shi). 0898-1221/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.camwa.2013.06.028

Efficient sparse least squares support vector machines for pattern classification

  • Upload
    yong

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient sparse least squares support vector machines for pattern classification

Computers and Mathematics with Applications ( ) –

Contents lists available at ScienceDirect

Computers and Mathematics with Applications

journal homepage: www.elsevier.com/locate/camwa

Efficient sparse least squares support vector machines forpattern classificationYingjie Tian ∗, Xuchan Ju, Zhiquan Qi, Yong ShiResearch Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing, China

a r t i c l e i n f o

Keywords:Least squares support vector machineSparsenessLoss functionClassificationRegression

a b s t r a c t

We propose a novel least squares support vector machine, named ε-least squares supportvector machine (ε-LSSVM), for binary classification. By introducing the ε-insensitive lossfunction instead of the quadratic loss function into LSSVM, ε-LSSVM has several improvedadvantages comparedwith the plain LSSVM. (1) It has the sparsenesswhich is controlled bythe parameter ε. (2) By weighting different sparseness parameters ε for each class, the un-balanced problem can be solved successfully, furthermore, an useful choice of the param-eter ε is proposed. (3) It is actually a kind of ε-support vector regression (ε-SVR), the onlydifference here is that it takes the binary classification problem as a special kind of regres-sion problem. (4) Therefore it can be implemented efficiently by the sequential minimiza-tion optimization (SMO) method for large scale problems. Experimental results on severalbenchmark datasets show the effectiveness of our method in sparseness, balance perfor-mance and classification accuracy, and therefore confirm the above conclusion further.

© 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Support vector machines (SVMs), which were introduced by Vapnik and his co-workers in the early 1990s [1–3], arecomputationally powerful tools for supervised learning [4,5] and have already outperformed most other methods in a widevariety of applications [6–11]. Least squares support vector machines (LSSVMs) were also proposed [12,13] which onlyneed to solve a linear system instead of a quadratic programming problem (QPP) in standard SVMs, and extensive empiricalcomparisons [14] show that LSSVMs obtain good performance on various classification and regression problems. LSSVMshave been studied extensively [15–18].

Unfortunately, there are two drawbacks in the plain LSSVMs. (1) Unlike the standard SVM employing a soft-margin lossfunction for classification and a ε-insensitive loss function for regression, LSSVMs lost the sparseness by using a quadraticloss function. (2) Another obvious limitation of LSSVMs is that although solving a linear system is in principle solvable [19],it is in practice intractable for a large dataset by the classical techniques since their computational complexity is usually oforder O(l3) (l is the size of the training set), which severely limits the utility of LSSVMs in large scale applications.

There are a lot of papers in the literature considering the above two issues so far. As for the fast algorithms for LSSVMs,Suykens et al. [20] presented an iterative algorithm based on the conjugate gradient algorithm, and Chu et al. [21] improvedthe conjugate gradient algorithm by solving one reduced linear system. Keerthi and Shevade [22] extends the well-knownsequential minimization optimization (SMO) [23] algorithm of SVMs for the solution of LSSVMs. For the problems with verylarge numbers of data points but small numbers of features, Chua [24] proposed a method which involves working with(and storing) matrices that are at most of size l × n (l is the size of the training set, n is the number of features), and extendthe possible range of application for LSSVMs. However, the resulting solutions of the above methods are not sparse yet.

∗ Corresponding author. Tel.: +86 10 82680997.E-mail addresses: [email protected] (Y. Tian), [email protected] (X. Ju), [email protected] (Z. Qi), [email protected] (Y. Shi).

0898-1221/$ – see front matter© 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.camwa.2013.06.028

Page 2: Efficient sparse least squares support vector machines for pattern classification

2 Y. Tian et al. / Computers and Mathematics with Applications ( ) –

As for the sparse algorithms for LSSVMs, a range of methods are available and can be roughly concluded into two majorclasses: pruning and fixing size. In the first class, a simple approach to introduce the sparseness is based on the sorted supportvalue spectrum (SVS) and prunes the network by gradually removing points from the training set [25,26]. A sophisticatedmechanism was proposed by weighting the support values, the data point with the smallest error introduced after itsomission is selected then. This pruning method is claimed to outperform the standard scheme [27]. Hoegaerts et al. [28]suggested an improved selection of the pruning point based on their derived criterion. Zeng and Chen [29] proposed a SMO-based pruning method, the SMO method is introduced into the pruning process and instead of determining the pruningpoints by errors, the data points that will introduce minimum changes to a dual objective function are omitted. Li et al. [30]selected the reduced classification training set based on yif (xi) (f (x) is the decision function and y the label) instead of thesupport value. The second class mainly considers the fixed-size LSSVMs for fast finding the sparse approximate solution ofLSSVMs, in which a reduced set of candidate support vectors is used in the primal space [13] or in kernel space [31–34].However, there are still shortcomings in the existing sparse LSSVMs. For the first class, it imposes the sparseness by graduallyomitting the least important data from the training set and re-estimating the LSSVMs, which is time consuming. For thesecond class, it is assumed that the weight vectorw can be represented as a weighted sum of a limited number (far less thanthe size of the training set) of basis vectors, which is a rough approximation and not theoretically guaranteed.

In this paper, we propose a novel LSSVM, termed ε-LSSVM for binary classification. ε-LSSVM introduces the ε-insensitiveloss function instead of the quadratic loss function into LSSVM. (1) It has the sparseness which is controlled by the param-eter ε. (2) By weighting different sparseness parameters for each class, the unbalanced problem can be solved successfully,furthermore, we also propose an useful choice of the parameter ε. (3) It is actually a kind of ε-support vector regression(ε-SVR) [3–5], the only difference here is that it takes the binary classification problem as a special kind of regression prob-lem. (4) Certainly it can be implemented efficiently by SMO for large scale problems.

The paper is organized as follows. Section 2 briefly dwells on the standard C-support vector machine for classification(C-SVC) and LSSVMs. Section 3 proposes our ε-LSSVM and a weighted ε-LSSVM is given in Section 4. Section 5 deals withexperimental results. Section 6 contains concluding remarks.

2. Background

In this section, we give a brief outline of C-SVC and LSSVMs.

2.1. C-SVC

Consider the binary classification problem with the training set

T = (x1, y1), . . . , (xl, yl) ∈ (Rn× Y)l, (1)

where xi ∈ Rn, yi ∈ Y = 1, −1, i = 1, . . . , l, standard C-SVC formulates the problem as a convex quadratic programmingproblem (QPP)

minw,b,ξ

12∥w∥

2+ C

li=1

ξi,

s.t. yi((w · xi) + b) > 1 − ξi, i = 1, . . . , l,ξi > 0, i = 1, . . . , l,

(2)

where ξ = (ξ1, . . . , ξl)⊤, and C > 0 is a penalty parameter. For this primal problem, C-SVC solves its Lagrangian dual

problem

minα

12

li=1

lj=1

αiαjyiyjK(xi, xj) −

li=1

αi,

s.t.l

i=1

yiαi = 0,

0 6 αi 6 C, i = 1, . . . , l,

(3)

where K(x, x′) is the kernel function, which is also a convex QPP and then constructs the decision function.

2.2. LSSVM

For the given training set (1), the primal problem of standard LSSVM to be solved is

minw,b,η

12∥w∥

2+

C2

li=1

η2i ,

s.t. yi((w · xi) + b) = 1 − ηi, i = 1, . . . , l.

(4)

Page 3: Efficient sparse least squares support vector machines for pattern classification

Y. Tian et al. / Computers and Mathematics with Applications ( ) – 3

Fig. 1. Geometric interpretation of LSSVM: positive points represented by ‘‘+’’s, negative points represented by ‘‘∗’’s, positive proximal line (w ·x)+b = 1(down left line), negative proximal line (w · x) + b = −1 (top right line), separating line (w · x) + b = 0 (middle line).

The geometric interpretation of the above problem with x ∈ R2 is shown in Fig. 1, where minimizing 12∥w∥

2 realizes themaximummargin between the positive proximal straight line and negative proximal straight line

(w · x) + b = 1 and (w · x) + b = −1, (5)

while minimizingl

i=1 η2i implies making the straight lines (5) be proximal to all positive inputs and negative inputs

respectively. Its dual problem is also a convex QPP

minα

12

li=1

lj=1

αiαjyiyj

K(xi, xj) +

δij

C

li=1

αi,

s.t.l

i=1

αiyi = 0,

(6)

where K(x, x′) is the kernel function and

δij =

1, i = j;0, i = j. (7)

For the choice of the kernel function K(x, x′), one has several possibilities: K(x, x′) = (x · x′) (linear kernel); K(x, x′) =

((x · x′) + 1)d (polynomial kernel of degree d); K(x, x′) = exp(−∥x − x′∥2/σ 2) (RBF kernel); K(x, x′) = tanh(κ(x · x′) + θ)

(Sigmoid kernel), etc.The solution of the above problem are given by the following set of linear equations

0 −Y⊤

Y Ω + C−1I

=

0e

, (8)

where Y = (y1, . . . , yl)⊤, Ω = (Ωij)l×l = (yiyjK(xi, xj))l×l, I is the identity matrix and e = (1, . . . , 1)⊤ ∈ Rl, therefore thedecision function is

f (x) = sgn(g(x)) = sgn

l

i=1

αiyiK(xi, x) + b

. (9)

The support values αi are proportional to the errors at the data points since

αi = Cηi, i = 1, . . . , l. (10)

Clearly, points located close to the two hyperplanes (w · x) + b = ±1 have the smallest support values, one could ratherspeak of the support value spectrum in the least squares case than the support vector in standard C-SVC.

3. ε-LSSVM

As the points located close to the two hyperplanes (w · x) + b = ±1 have the smallest support values, they contributeless to the decision function (9). Following the idea of ε-insensitive loss function for the regression problem, the following

Page 4: Efficient sparse least squares support vector machines for pattern classification

4 Y. Tian et al. / Computers and Mathematics with Applications ( ) –

Fig. 2. Geometric interpretation of ε-LSSVM: positive proximal line (w · x) + b = 1 (down left thick line), negative proximal line (w · x) + b = −1 (topright thick line), positive ε-bounded lines (w · x) + b = 1 ± ε (down left dotted lines), negative ε-bounded lines (w · x) + b = −1 ± ε (top right dottedlines), separating line (w · x) + b = 0 (middle line).

optimization problem is constructed

minw,b,ξ (∗)

12∥w∥

2+

C2

li=1

(ξ 2i + ξ ∗2

i ),

s.t. − 1 − ε − ξ ∗

i 6 (w · xi) + b 6 −1 + ε + ξi, for yi = −1,

1 − ε − ξ ∗

i 6 (w · xi) + b 6 1 + ε + ξi, for yi = 1,

ξi, ξ∗

i > 0, i = 1, . . . , l,

(11)

where ε > 0 is a prior parameter.Nowwe discuss the primal problem (11) geometrically in R2 (see Fig. 2). On the one hand, we hope that the positive class

locates as much as possible in the ε-band between the bounded hyperplanes (w · x) + b = 1 + ε and (w · x) + b = 1 − ε,the negative class is located as much as possible in the ε-band between the hyperplanes (w · x) + b = −1 + ε and(w ·x)+b = −1−ε, here the errors ηi +η∗

i , i = 1, . . . , l aremeasured by the ε-insensitive loss function. On the other hand,we still hope to maximize the margin between the two proximal hyperplanes (w · x) + b = 1 and (w · x) + b = −1. Basedon the above two considerations, problem (11) is established and the structural risk minimization principle is implementednaturally.

For problem (11), the constraint ξi, ξ∗

i > 0, i = 1, . . . , l, is redundant: a negative value of ξi or ξ ∗

i cannot appear in asolution (to the problemwith this constraint removed) since the above feasible solution with ξi = 0 or ξ ∗

i = 0 gives a lowervalue for the objective function. Hence, problem (11) is obviously equivalent to the following problem

minw,b,ξ ,ξ∗

12∥w∥

2+

C2

li=1

(ξ 2i + ξ ∗2

i ),

s.t. (w · xi) + b − yi 6 ε + ξi, i = 1, . . . , l,

yi − (w · xi) − b 6 ε + ξ ∗

i , i = 1, . . . , l.

(12)

Interestingly but not surprisingly, we can see that problem (12) is in fact the ε-support vector regression machine withL2-loss (L2-SVR [35]) for the training set (1), here it takes yi as ±1 for positive and negative inputs separately.

Now we map the training set T by a mapping Φ(x) to a Hilbert space H . In order to get the solution of problem (12) inH , we need to derive its dual problems. By introducing the Lagrangian

L(w, b, ξ , ξ ∗, α, α∗) =12∥w∥

2+

C2

li=1

(ξ 2i + ξ ∗2

i ) +

li=1

αi((w · Φ(xi)) + b − yi − ε − ξi)

+

li=1

α∗

i (yi − (w · Φ(xi)) − b − ε − ξ ∗

i ), (13)

Page 5: Efficient sparse least squares support vector machines for pattern classification

Y. Tian et al. / Computers and Mathematics with Applications ( ) – 5

where α, α∗ are the Lagrange multiplier vectors, the dual problem is obtained

minα(∗)

12

li=1

lj=1

(α∗

i − αi)(α∗

j − αj)K(xi, xj) +12C

li=1

(α2i + α∗2

i ) + ε

li=1

(α∗

i + αi) −

li=1

yi(α∗

i − αi),

s.t.l

i=1

(αi − α∗

i ) = 0,

αi, α∗

i > 0, i = 1, . . . , l,

(14)

where K(x, x′) = (Φ(x) · Φ(x′)) is the kernel function. For this dual problem, we have the following conclusions.

Theorem 3.1. If α, α∗ is a solution of the problem (14), then αiα∗

i = 0 for i = 1, . . . , l.

Proof. If αi > 0, then from the KKT conditions

αi((w · xi) + b − yi − ε − ξi) = 0, (15)Cξi − αi = 0, (16)

we have

(w · xi) + b − yi − ε = ξi > 0, (17)

and based on the KKT condition

α∗

i (yi − (w · xi) − b − ε − ξ ∗

i ) = 0, (18)

So α∗

i = 0. And vice versa, i.e., for α∗

i > 0, there is αi = 0.

Theorem 3.2. Problem (6) is equivalent to the problem (14) with ε = 0.

Proof. Let

yiβi = α∗

i − αi, i = 1, . . . , l, (19)

since yi = 1 or −1, then

βi = yi(α∗

i − αi), i = 1, . . . , l. (20)

Set ε = 0, problem (14) degenerates to the following problem

minβ

12

li=1

lj=1

βiβjyiyjK(xi, xj) −

li=1

βi,

s.t.l

i=1

yiβi = 0,

(21)

which is the same as problem (6), where K(xi, xj) =

K(xi, xj) +

δijC

, i, j = 1, . . . , l.

Now, we are in a position to declare that LSSVM for binary classification can be implemented by L2-SVR for the sametraining set with ε = 0. And if we want to endow LSSVM the valuable sparsity, we only need to apply standard L2-SVR forthe classification problem to get support vectors, then the ε-LSSVM is established.

Algorithm 3.3 (ε-LSSVM).(1) Input the training set (1);(2) Choose an appropriate kernel function K(x, x′), parameters C > 0 and ε > 0;(3) Construct and solve the convex QPP (14), obtaining a solution α, α∗;(4) Compute b. If αj > 0 is chosen, compute

b = yj −l

i=1

(α∗

i − αi)K(xi, xj) + ε; (22)

if α∗

k > 0 is chosen, compute

b = yk −

li=1

(α∗

i − αi)K(xi, xk) − ε. (23)

Page 6: Efficient sparse least squares support vector machines for pattern classification

6 Y. Tian et al. / Computers and Mathematics with Applications ( ) –

(5) Construct the decision function

y = sgn(g(x)) = sgn

l

i=1

(α∗

i − αi)K(xi, x) + b

. (24)

Obviously, solving problem (14) can be efficiently implemented by LIBSVM [36] since it is actually a variation of ε-SVR.In fact, problem (14) can be concisely formulated as

minβ

12β⊤Qβ + p⊤β,

s.t. y⊤β = 0,β > 0,

(25)

where Q ∈ R2l×2l, β, p, y ∈ R2l. [36] has proved that for such a problem, an SMO-type decomposition method [37] imple-mented in LIBSVM has the complexity as: (1) ♯Iterations ×O(l) if most columns of Q are cached throughout iterations; and(2) ♯Iterations×O(nl) if columns of Q are not cached and each kernel evaluation costs O(n), while [36] also pointed out thatthere is no theoretical result yet on LIBSVM’s number of iterations. Empirically, it is known that the number of iterationsmay be higher than linear to the number of training data.

4. Weighted ε-LSSVM

For the unbalanced classification problem, different with weighted C for each class (C-LSSVM) [26]

minw,b,η

12∥w∥

2+

C+

2

yi=1

η2i +

C−

2

yi=−1

η2i ,

s.t. yi((w · xi) + b) = 1 − ηi, i = 1, . . . , l,

(26)

our ε-LSSVM applies a weighted sparse parameter ε for each class and the primal problem is constructed as

minw,b,ξ (∗)

12∥w∥

2+

C2

li=1

(ξ 2i + ξ ∗2

i ),

s.t. − 1 − ε− − ξ ∗

i 6 (w · xi) + b 6 −1 + ε− + ξi, for yi = −1,1 − ε+ − ξ ∗

i 6 (w · xi) + b 6 1 + ε+ + ξi, for yi = 1,

(27)

obviously its dual problem is

minα(∗)

12

li=1

lj=1

(α∗

i − αi)(α∗

j − αj)K(xi, xj) + ε−

yi=−1

(α∗

i + αi) + ε+

yi=1

(α∗

i + αi) −

li=1

yi(α∗

i − αi),

s.t.l

i=1

(αi − α∗

i ) = 0,

αi, α∗

i > 0, i = 1, . . . , l,

(28)

where K(xi, xj) =

K(xi, xj) +

δijC

, i, j = 1, . . . , l.

If the positive class is smaller than the negative class, smaller ε+ should be chosen than ε−, thus more negative pointsturn out to be non-support vector than that of positive points, and the problem is balanced. A recommended choice rangeof the ε− and ε+ is (0, 1), and the relation between ε− and ε+ satisfies

l+(1 − ε+) ∼= l−(1 − ε−), (29)

where l+ and l− are the number of the positive points and negative points respectively. Eq. (29) means that the number ofpositive points outside the ε+-band approximately equals the number of negative points outside the ε−-band, it also meansthat the number of positive SVs approximately equals the number of negative SVs.

In order to illustrate the proposed weighted ε-LSSVM we generated a small unbalanced artificial two-dimensional two-class dataset [38]. The dataset consist of 100 points, 15 of which are positive and 85 points are negative. When the problemis solved using plain LSSVM (4), the influence of the 85 negative points prevails over that of the much smaller set of positivedata points. As a result, 5 out of 15 points in positive class aremisclassified. The total training set correctness is 95%,with only66.7% correctness for the smaller positive class and 100% correctness for the larger negative class. The resulting separatingplane is shown in Fig. 3. When a weighted C-LSSVM is used where C+ =

8515 ×C− we can see an improvement over the plain

Page 7: Efficient sparse least squares support vector machines for pattern classification

Y. Tian et al. / Computers and Mathematics with Applications ( ) – 7

Fig. 3. An unbalanced dataset consisting of 100 points, 15 of which are positive represented by ‘‘+’’s, and 85 points of which are negative represented by‘‘∗’’s. The separating plane (middle line) is obtained by using a plain LSSVM (4). The positive class is mostly ignored by the solution. The total training setcorrectness is 95% with 66.7% correctness for positive class and 100% correctness for negative class.

Fig. 4. Linear classifier improvement by weighted C-LSSVM is demonstrated on the same dataset of Fig. 3. The separating plane (middle line) is obtainedby using a weighted C-LSSVM. Even though the positive class is correctly classified in its entirety, the overall performance is still rather unsatisfactory dueto significant difference in the distribution of points in each of the classes. Total training set correctness is 89%.

LSSVM, in the sense that a separating plane is obtained that correctly classifies all the points in the positive class. Howeverdue to the significant difference in the cardinality of the two classes and the distribution of their points, a subset of 9 pointsin the negative class is now misclassified. The total training set correctness is 89%, with 100% correctness for positive classand 87.06% correctness for negative class. The resulting separating plane is shown in Fig. 4. If nowweighted ε-LSSVM is usedwhere ε+ = 0.1, ε− = 0.84 satisfies (29), we obtain a separating plane that misclassifies only one point. The total trainingset correctness is 98%. The resulting separating plane is shown in Fig. 5.

5. Experimental results

In this section, some experiments are made to demonstrate the performance of our ε-LSSVM. All methods are imple-mented by using MATLAB 2010 on a PC with an Intel Core I5 processor with 2 GB RAM. C-SVC and ε-LSSVM are solvedby the optimization toolbox QP in MATLAB. LSSVM is the special case of our ε-LSSVM when ε = 0. The ‘‘Accuracy’’ used toevaluatemethods is defined as Accuracy = (TP+TN)/(TP+FP+TN+FN), where TP, TN, FP , and FN are the number of truepositive, true negative, false positive, and false negative, respectively. Classification accuracy of each method is measuredby the standard tenfold cross-validation methodology.

First,we apply ε-LSSVM to the iris dataset [39],which is an establisheddataset used for demonstrating the performance ofclassification algorithms. It contains three classes (Setosa, Versicolor, Virginica) and four attributes for an iris, and the goal isto classify the class of iris based on these four attributes. Here we restrict ourselves to the two classes (Versicolor, Virginica),and the two features that contain the most information about the class, namely the petal length and the petal width. Thedistribution of the data is illustrated in Fig. 6, where ‘‘+’’s and ‘‘∗’’s represent classes Versicolor and Virginica respectively.

Linear and RBF kernel K(x, x′) = exp(−∥x−x′∥2

σ) are used in which the parameter σ is fixed to be 1.0, and set C = 10, ε

varies in 0, 0.1, 0.2, 0.3, 0.4, 0.5. Experiment results are shown in Figs. 6 and 7, where two proximal lines g(x) = −1 andg(x) = +1, four ε-bounded lines g(x) = −1 ± ε and g(x) = 1 ± ε, and separating line g(x) = 0 are depicted, and support

Page 8: Efficient sparse least squares support vector machines for pattern classification

8 Y. Tian et al. / Computers and Mathematics with Applications ( ) –

Fig. 5. Very significant linear classifier improvement as a consequence of the ε-LSSVM is demonstrated on the same dataset of Figs. 3 and 4. The totaltraining set correctness is now 98% compared to 95% for plain LSSVM and 89% for weighted C-LSSVM.

(a) ε = 0. (b) ε = 0.1. (c) ε = 0.2.

(d) ε = 0.3. (e) ε = 0.4. (f) ε = 0.5.

Fig. 6. Linear ε-LSSVM: positive proximal line g(x) = 1 (down left thick line), negative proximal line g(x) = −1 (top right thick line), positive ε-boundedlines g(x) = 1±ε (down left dotted lines), negative ε-bounded lines g(x) = −1±ε (top right dotted lines), separating line g(x) = 0 (middle line), supportvectors (marked by ‘‘’’). With the increase of ε, the percentage of SVs decreases.

vectors are marked by ‘‘’’ for different ε. Fig. 8 records the varying percentage of support vectors. We can see that with theincreasing ε, the number of support vectors decreases, therefore the sparseness increases for both linear and nonlinear cases.

We also apply weighted ε-LSSVM to solve this classification problem, here half of the training points are randomly se-lected from the ‘‘∗’’ class (negative class). The sparse parameter ε− takes values in 0, 0.05, 0.1, 0.15, 0.2, 0.25 and ε+ iscomputed by (29). Experiment results are shown in Fig. 9 for linear kernel and Fig. 10 for RBF kernel separately, where thecorresponding lines are depicted and support vectors are marked by ‘‘’’ for different (ε−, ε+). Fig. 11 records the vary-ing percentage of positive support vectors, negative support vectors and total support vectors for linear and RBF cases

Page 9: Efficient sparse least squares support vector machines for pattern classification

Y. Tian et al. / Computers and Mathematics with Applications ( ) – 9

(a) ε = 0. (b) ε = 0.1. (c) ε = 0.2.

(d) ε = 0.3. (e) ε = 0.4. (f) ε = 0.5.

Fig. 7. Kernel ε-LSSVM: positive proximal line g(x) = 1 (down left line), negative proximal line g(x) = −1 (top right line), positive ε-bounded linesg(x) = 1 ± ε (dotted lines around the positive proximal line), negative ε-bounded lines g(x) = −1 ± ε (dotted lines around the negative proximal line),separating line g(x) = 0 (thick line), support vectors (marked by ‘‘’’). With the increase of ε, the percentage of SVs decreases.

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Per

cent

age

of S

Vs

Fig. 8. Sparseness increases with the increasing ε. Linear case (top broken line), nonlinear case (down broken line).

separately; we can also see that with the increasing ε− and ε+, the number of support vectors decreases, therefore thesparseness increases for both cases.

Second, in order to compare our ε-LSSVM, weighted ε-LSSVM with LSSVM and C-SVC, we choose several datasets fromthe UCI machine learning repository [39]. In Table 1, the classification accuracy, and the percentage of support vectors arelisted. For all themethods, the RBF kernel K(x, x′) = exp(−∥x−x′∥2

σ) is used and the optimal parameters C and σ are obtained

through searching in the range 2−8 to 28, the optimal parameter ε in ε-LSSVM is obtained in the range [0.1, 1] with the step0.1, by using a tuning set comprising of 30% of the dataset. Once the parameters are selected, the tuning set is returned to

Page 10: Efficient sparse least squares support vector machines for pattern classification

10 Y. Tian et al. / Computers and Mathematics with Applications ( ) –

(a) ε− = 0. (b) ε− = 0.05. (c) ε− = 0.1.

(d) ε− = 0.15. (e) ε− = 0.2. (f) ε− = 0.25.

Fig. 9. Weighted ε-LSSVMwith linear kernel for unbalanced dataset. (29) is used to compute ε+ for given ε− . Improved balanced results are obtained, andwith the increase of ε−, ε+ , the percentage of SVs in each class decreases.

Table 1Tenfold testing percentage accuracy of ε-LSSVM.

Datasets LSSVM C-SVC ε-LSSVM Weighted ε-LSSVMAccuracy % Accuracy % Accuracy % Accuracy %SVs % SVs % SVs % SVs %

Hepatitis 81.63 ± 5.34 80.65 ± 5.32 81.45 ± 3.17 81.95 ± 2.66(155 × 19) \ 35.48 ± 2.23 33.49 ± 4.06 31.63 ± 3.87

BUPA liver 67.84 ± 5.12 70.43 ± 4.27 69.21 ± 4.73 69.80 ± 3.59(345 × 6) \ 79.13 ± 3.08 76.49 ± 2.16 75.04 ± 3.18

Heart-Statlog 83.29 ± 3.91 83.70 ± 6.18 84.36 ± 3.77 84.15 ± 3.41(270 × 14) \ 43.33 ± 3.01 41.15 ± 3.22 39.33 ± 2.57

Votes 90.72 ± 3.65 93.33 ± 3.85 90.63 ± 2.76 92.81 ± 3.14(435 × 16) \ 40.46 ± 4.52 38.31 ± 3.45 37.19 ± 2.93

WPBC 74.59 ± 3.38 76.28 ± 4.68 76.77 ± 2.94 76.82 ± 3.53(198 × 34) \ 51.55 ± 5.17 48.36 ± 3.77 50.64 ± 4.06

Sonar 83.11 ± 5.12 85.10 ± 5.04 84.11 ± 3.81 84.87 ± 3.75(208 × 60) \ 41.83 ± 3.80 42.04 ± 3.13 41.17 ± 2.86

Ionosphere 91.02 ± 4.79 94.59 ± 5.53 91.96 ± 3.72 93.09 ± 2.47(351 × 34) \ 25.07 ± 3.24 22.54 ± 3.87 23.17 ± 3.34

Australian 85.09 ± 5.06 85.50 ± 4.17 85.23 ± 4.27 85.37 ± 3.44(690 × 14) \ 41.01 ± 2.91 39.28 ± 4.18 38.71 ± 3.26

Pima-Indian 76.08 ± 5.72 77.60 ± 3.76 76.52 ± 4.33 77.91 ± 4.27(768 × 8) \ 53.26 ± 3.27 50.12 ± 3.78 50.55 ± 3.34

CMC 64.12 ± 2.78 64.46 ± 3.26 65.18 ± 2.69 65.75 ± 3.18(1473 × 9) \ 69.67 ± 4.35 67.31 ± 3.86 66.18 ± 3.25

Page 11: Efficient sparse least squares support vector machines for pattern classification

Y. Tian et al. / Computers and Mathematics with Applications ( ) – 11

(a) ε− = 0. (b) ε− = 0.05. (c) ε− = 0.1.

(d) ε− = 0.15. (e) ε− = 0.2. (f) ε− = 0.25.

Fig. 10. Weighted ε-LSSVMwith RBF kernel for unbalanced dataset. (29) is used to compute ε+ for given ε− . Improved balanced results are obtained, andwith the increase of ε−, ε+ , the percentage of SVs in each class decreases.

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Per

cent

age

of S

Vs

0 0.05 0.1 0.15 0.2 0.25

Fig. 11. Sparseness of each class increases with the increasing ε− and ε+ . Linear case (broken lines with ‘‘’’), nonlinear case (broken lines with ‘‘’’).

learn the final classifier. For the unbalanced datasets, we apply weighted ε-LSSVM and set the smaller class to be negative,ε− is chosen in [0.1, 1] with the step 0.1, and Eq. (29) is used to approximately compute ε+.

In Table 1, the best accuracy and sparseness are shownbybold figures. It is easy to see that the accuracy and the sparsenessof our ε-LSSVM andweighted ε-LSSVM are better than that of LSSVM on all datasets, since LSSVM is the special case of thesetwo models. At the same time, the sparseness of ε-LSSVM and weighted ε-LSSVM are better than that of C-SVC on mostdatasets, while the accuracy is almost the samewith that of it. Furthermore,weighted ε-LSSVMperforms theoretically betterthan ε-LSSVM onmost datasets, since ε-LSSVM is also the special case of weighted ε-LSSVMwith ε− = ε+. For example, forCMC, the accuracy of our ε-LSSVM and weighted ε-LSSVM is 65.18% and 65.75% respectively, while the accuracy of LSSVMand C-SVC is 64.12% and 64.46% respectively. The percentages of SVs of LSSVM are 100% obviously, while that of weightedε-LSSVM are 66.18%, better than that of C-SVC 69.67%.

Page 12: Efficient sparse least squares support vector machines for pattern classification

12 Y. Tian et al. / Computers and Mathematics with Applications ( ) –

6. Conclusion

In this paper,we have proposed a novel LSSVM, termed ε-LSSVM for binary classification. By introducing the ε-insensitiveloss function instead of the quadratic loss function into LSSVM, ε-LSSVM has several improved advantages compared withthe plain LSSVM. (1) It has the sparseness which is controlled by the parameter ε. (2) By weighting different sparsenessparameters ε for each class, the unbalanced problem can solved successfully. (3) It is actually a kind of ε-support vectorregression (ε-SVR), the only difference here is that it takes the binary classification problem as a special kind of regressionproblem. (4) It can be implemented efficiently by SMO for large scale problems theoretically.

Parameters ε control the sparseness and can be chosen flexibly, therefore improve the plain LSSVM in many ways.And an useful choice of ε− and ε+ for different classes was also given in the weighted ε-LSSVM algorithm. Computationalcomparisons between these two ε-LSSVMsandothermethods including SVC and LSSVMhavebeenmadeon several datasets,indicating the effectiveness of ourmethod in sparseness, balance performance and classification accuracy. The extensions ofε-LSSVM to multi-class classification, robust classification, and multi-instance classification are also interesting and underour consideration.

Acknowledgments

This work has been partially supported by grants from the National Natural Science Foundation of China (No. 11271361,and No. 70921061), the CAS/SAFEA International Partnership Program for Creative Research Teams, Major International(Regional) Joint Research Project (No. 71110107026), and the President Fund of GUCAS.

References

[1] C. Cortes, V.N. Vapnik, Support-vector networks, Machine Learning 20 (3) (1995) 273–297.[2] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1996.[3] V.N. Vapnik, Statistical Learning Theory, John Wiley and Sons, New York, 1998.[4] N.Y. Deng, Y.J. Tian, Support Vector Machines: Theory, Algorithms and Extensions, Science Press, Beijing, 2009.[5] N.Y. Deng, Y.J. Tian, Chunhua Zhang, Support Vector Machines: Optimization based Theory, Algorithms and Extensions, Chapman and Hall: CRC Press,

2012.[6] M.M. Adankon, M. Cheriet, Model selection for the LS-SVM application to handwriting recognition, Pattern Recognition 42 (12) (2009) 3264–3270.[7] M.B. Karsten, Kernel methods in bioinformatics, in: Handbook of Statistical Bioinformatics, Part 3, 2011, pp. 317–334.[8] K.J. Kim, Financial time series forecasting using support vector machines, Neurocomputing 55 (1–2) (2003) 307–319.[9] G. Schweikert, A. Zien, G. Zeller, J. Behr, C. Dieterich, CS. Ong, P. Philips, F. De Bona, L. Hartmann, A. Bohlen, et al., mGene: accurate SVM-based gene

finding with an application to nematode genomes, Genome Research 19 (2009) 2133–2143.[10] D. Anguita, A. Boni, Improved neural network for SVM learning neural networks, IEEE Transactions on Neural Networks 13 (5) (2002) 1243–1244.[11] L.J. Cao, F.E.H. Tay, Support vector machine with adaptive parameters in financial time series forecasting, IEEE Transactions on Neural Networks 14

(6) (2003) 1506–1518.[12] J.A.K. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Processing Letters 9 (3) (1999) 293–300.[13] J.A.K. Suykens, V.G. Tony, D.B. Jos, D.M. Bart, V. Joos, Least Squares Support Vector Machines, World Scientific, 2002.[14] T. VanGestel, J. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G. Dedene, B. DeMoor, J. Vandewalle, Benchmarking least squares support vectormachine

classifiers, Machine Learning 54 (1) (2004) 5–32.[15] M.M. Adankon, M. Cheriet, A. Biem, Semisupervised learning using Bayesian interpretation: application to LS-SVM, IEEE Transactions on Neural

Networks 22 (4) (2011) 513–524.[16] L.F. Bo, L.C. Jiao, L. Wang, Working set selection using functional gain for LS-SVM, IEEE Transactions on Neural Networks 18 (5) (2007) 1541–1544.[17] K. Pelckmans, J.A.K. Suykens, B. De Moor, A convex approach to validation-based learning of the regularization constant, IEEE Transactions on Neural

Networks 18 (3) (2007) 917–920.[18] K. De Brabantera, J. De Brabantera, J.A.K. Suykensa, B. DeMoora, Optimized fixed-size kernelmethods for large scale data sets, Computational Statistics

& Data Analysis 54 (6) (2010) 1484–1504.[19] L.V. Ferreira, E. Kaszkurewicz, A. Bhaya, Solving systems of linear equations via gradient systems with discontinuous righthand sides: application to

LS-SVM, IEEE Transactions on Neural Networks 16 (2) (2005) 501–505.[20] J.A.K. Suykens, L. Lukas, P. Van Dooren, B. De Moor, J. Vandewalle, Least squares support vector machine classifiers: a large scale algorithm, in: Proc.

the European Conference on Circuit Theory and Design (ECCTD-99) Stresa, Italy, Sep. 1999, pp. 839–842.[21] W. Chu, C.J. Ong, S.S. Keerthy, An improved conjugate gradient method scheme to the solution of least squares SVM, IEEE Transactions on Neural

Networks 16 (2) (2005) 498–501.[22] S.S. Keerthi, S.K. Shevade, SMO algorithm for least-squares SVM formulations, Neural Computation 15 (2) (2003) 487–507.[23] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in

Kernel Methods Support Vector Learning, MIT Press, Cambridge, MA, 2000.[24] K.S. Chua, Efficient computations for large least square support vector machine classifiers, Pattern Recognition Letters 24 (1–3) (2003) 75–80.[25] J.A.K. Suykens, L. Lukas, J. Vandewalle, Sparse approximation using least squares support vectormachines, in: Proc. 2000 IEEE International Symposium

on ISCAS, Geneva, Switzerland, 2000, pp. 757–760.[26] J.A.K. Suykens, J.D. Brabanter, L. Lukas, J. Vandewalle, Weighted least squares support vector machines: robustness and sparse approximation,

Neurocomputing 48 (1–4) (2002) 85–105.[27] B.J. de Kruif, T.J.A. de Vries, Pruning error minimization in least squares support vector machines, IEEE Transactions on Neural Networks 14 (3) (2004)

696–702.[28] L. Hoegaerts, J.A.K. Suykens, J. Vandewalle, B. De Moor, A Comparison of Pruning Algorithms for Sparse Least Squares Support Vector Machines,

in: Lecture Notes in Computer Science, vol. 3316, 2004, pp. 1247–1253.[29] X.Y. Zeng, X.W. Chen, SMO-based pruning methods for sparse least squares support vector machines, IEEE Transactions on Neural Networks 16 (6)

(2005) 1541–1546.[30] Y.G. Li, C. Lin, W.D. Zhang, Improved sparse least-squares support vector machine classifiers, Neurocomputing 69 (13–15) (2006) 1655–1658.[31] L. Hoegaerts, J. Suykens, J. Vandewalle, B. De Moor, Primal space sparse kernel partial least squares regression for large scale problems, in: IEEE Proc.

Int. Joint Conf. Neural Networks, 2004, pp. 561–566.[32] G.C. Cawley, N.L.C. Talbot, Improved sparse least-squares support vector machines, Neurocomputing 48 (1–4) (2002) 1025–1031.

Page 13: Efficient sparse least squares support vector machines for pattern classification

Y. Tian et al. / Computers and Mathematics with Applications ( ) – 13

[33] G.C. Cawley, N.L.C. Talbot, Fast exact leave-one-out cross-validation of sparse least-squares support vector machines, Neural Networks 17 (10) (2004)1467–1475.

[34] L.C. Jiao, L.F. Bo, L. Wang, Fast sparse approximation for least squares support vector machine, IEEE Transactions on Neural Networks 18 (3) (2007)685–697.

[35] M.W. Chang, C.J. Lin, Leave-one-out bounds for support vector regression model selection, Neural Computation 17 (5) (2005) 1188–1222.[36] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (3) (2011) 27:1–27:27.[37] R.E. Fan, P.H. Chen, C.J. Lin, Working set selection using second order information for training SVM, Journal of Machine Learning Research 6 (2005)

1889–1918. URL http://www.csie.ntu.edu.tw/cjlin/papers/quadworkset.pdf.[38] G.M. Fung, O.L. Mangasarian, Multicategory proximal support vector machine classifiers, Machine Learning 59 (1–2) (2005) 77–97.[39] C.L. Blake, C.J. Merz, UCI repository for machine learning databases. Dept. Inf. Comput. Sci., Univ. California, Irvine [online]. Available: http://www.ics.

uci.edu/mlearn/MLRepository.html.