5
AbstractThe paper discusses two important classification techniques, Fisher’s linear discriminated analysis (FLDA) and Support Vector Machine (SVM). First, we propose a theoretical discussion, and then implement FLDA and SVM on several datasets of two classes and multiclass, a comparative experimental analysis among these two techniques aims at exploring and assessing the performance of FLDA and SVM classifiers. To sustain such analysis, the two classification techniques are compared with different training data sets and testing data sets. Different performance indicators have been used to support our experimental studies in a detailed and accurate way such as the classification accuracy. The results obtained on different datasets conclude that FLDA and SVM are valid and effective approaches for pattern classification and conclude their different performance and problems with different size datasets. Meanwhile, the paper employs a non-traditional method to get the training and testing data set, and concludes detailed pros and cons from the experiment results. Key Words: Pattern classification, multiclass, Fisher’s Linear Discriminated Analysis (FLDA), Support Vector Machine (SVM) I. INTRODUCTION HE ideas behind the two classification techniques, Fisher’s linear discriminated analysis (FLDA) and Support Vector Machine (SVM) are very different [1]-[4]. FLDA aims to reduce pattern space from a high-dimension to one-dimension, and in contrast, SVM increases the pattern space dimensions from a low dimension to a high-dimension. After conversion, the input data is projected onto a value of a one dimension space with FLDA and projected onto a point of high dimension output space with SVM. The datasets we applied the two classification techniques to are generated by the following approach: We simulate the toss of 9 coins with given bias for the head outcomes respectively, and fixed sequences of the outcomes will be modeled as measurement. The processes of simulation of the nine coins are as follows: This work was supported in part by NSFC 70890084, 60921061, 90920305; CAS 2F09N05, 2F09N06, 2F10E08, 2F10E10; 4T05Y03. Yegan Qian is with Anhui Radio TV Station and Hefei Hanteng Info. Tech Company, Anhui Hefei, 230022, China (phone: 0551-341-5524; fax: 0551-342-2353; e-mail: [email protected]). His research interest includes Image Processing & Pattern Recognition, Video Mining, and Software Engineering. Gang Xiong and Yanjie Yao are with State Key Laboratory of Management and Control for Complex Systems, Beijing Engineering Research Center of Intelligent Systems and Technology, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China (e-mail: [email protected]; [email protected]). For nine coins with bias 10 k k k=1, 2, 3, 4, 5, 6, 7, 8, 9 N=100,300, 600, 900, 10000, 20000. For 10 k k If rand < 0.5 j x = 1 headsElse j x = 0 tailsEnd So the MATLAB code used for generating the three datasets is as follows: for i = 1:N k = ceil(9.*rand); switch k case 1 dataset{i} = randsrc(1,100,[1 0; 0.1 0.9]); case 2 dataset{i} = randsrc(1,100,[1 0; 0.2 0.8]); case 3 dataset{i} = randsrc(1,100,[1 0; 0.3 0.7]); case 4 dataset{i} = randsrc(1,100,[1 0; 0.4 0.6]); case 5 dataset{i} = randsrc(1,100,[1 0; 0.5 0.5]); case 6 dataset{i} = randsrc(1,100,[1 0; 0.6 0.4]); case 7 dataset{i} = randsrc(1,100,[1 0; 0.7 0.3]); case 8 dataset{i} = randsrc(1,100,[1 0; 0.8 0.2]); case 9 dataset{i} = randsrc(1,100,[1 0; 0.9 0.1]); end end In this project, we present a theoretical discussion and an accurate experimental analysis that aim at: 1) assessing the prosperities of SVM classifiers and FLDA classifiers in these datasets generated above. 2) Comparing the two different classification techniques, SVM and FLDA, using four training datasets of different sizes. For multiclass datasets, we use one-against-one strategy to generate FLDA classifiers and SVM classifiers. Research on SVM and FLDA in Classification with Comparative Experiments Yegan Qian, Gang Xiong, Yanjie Yao T 417 978-1-4673-0390-3/12/$31.00 ©2012 IEEE

[IEEE 2012 9th IEEE International Conference on Networking, Sensing and Control (ICNSC) - Beijing, China (2012.04.11-2012.04.14)] Proceedings of 2012 9th IEEE International Conference

  • Upload
    yanjie

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 9th IEEE International Conference on Networking, Sensing and Control (ICNSC) - Beijing, China (2012.04.11-2012.04.14)] Proceedings of 2012 9th IEEE International Conference

Abstract— The paper discusses two important classification techniques, Fisher’s linear discriminated analysis (FLDA) and Support Vector Machine (SVM). First, we propose a theoretical discussion, and then implement FLDA and SVM on several datasets of two classes and multiclass, a comparative experimental analysis among these two techniques aims at exploring and assessing the performance of FLDA and SVM classifiers. To sustain such analysis, the two classification techniques are compared with different training data sets and testing data sets. Different performance indicators have been used to support our experimental studies in a detailed and accurate way such as the classification accuracy. The results obtained on different datasets conclude that FLDA and SVM are valid and effective approaches for pattern classification and conclude their different performance and problems with different size datasets. Meanwhile, the paper employs a non-traditional method to get the training and testing data set, and concludes detailed pros and cons from the experiment results.

Key Words: Pattern classification, multiclass, Fisher’s Linear Discriminated Analysis (FLDA), Support Vector Machine (SVM)

I. INTRODUCTION

HE ideas behind the two classification techniques, Fisher’s linear discriminated analysis (FLDA) and

Support Vector Machine (SVM) are very different [1]-[4]. FLDA aims to reduce pattern space from a high-dimension to one-dimension, and in contrast, SVM increases the pattern space dimensions from a low dimension to a high-dimension. After conversion, the input data is projected onto a value of a one dimension space with FLDA and projected onto a point of high dimension output space with SVM.

The datasets we applied the two classification techniques to are generated by the following approach:

We simulate the toss of 9 coins with given bias for the head outcomes respectively, and fixed sequences of the outcomes will be modeled as measurement. The processes of simulation of the nine coins are as follows:

This work was supported in part by NSFC 70890084, 60921061,

90920305; CAS 2F09N05, 2F09N06, 2F10E08, 2F10E10; 4T05Y03. Yegan Qian is with Anhui Radio TV Station and Hefei Hanteng Info. Tech

Company, Anhui Hefei, 230022, China (phone: 0551-341-5524; fax: 0551-342-2353; e-mail: [email protected]). His research interest includes Image Processing & Pattern Recognition, Video Mining, and Software Engineering.

Gang Xiong and Yanjie Yao are with State Key Laboratory of Management and Control for Complex Systems, Beijing Engineering Research Center of Intelligent Systems and Technology, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China (e-mail: [email protected]; [email protected]).

For nine coins with bias10k

k ��

k=1, 2, 3, 4, 5, 6, 7, 8, 9 N=100,300, 600, 900, 10000, 20000.

For 10k

k ��

If rand < 0.5

jx = 1 ‘heads’ Else

jx = 0 ‘tails’ End

So the MATLAB code used for generating the three datasets is as follows:

for i = 1:N k = ceil(9.*rand);

switch k case 1

dataset{i} = randsrc(1,100,[1 0; 0.1 0.9]); case 2

dataset{i} = randsrc(1,100,[1 0; 0.2 0.8]); case 3

dataset{i} = randsrc(1,100,[1 0; 0.3 0.7]); case 4

dataset{i} = randsrc(1,100,[1 0; 0.4 0.6]); case 5

dataset{i} = randsrc(1,100,[1 0; 0.5 0.5]); case 6

dataset{i} = randsrc(1,100,[1 0; 0.6 0.4]); case 7

dataset{i} = randsrc(1,100,[1 0; 0.7 0.3]); case 8

dataset{i} = randsrc(1,100,[1 0; 0.8 0.2]); case 9

dataset{i} = randsrc(1,100,[1 0; 0.9 0.1]); end

end In this project, we present a theoretical discussion and an

accurate experimental analysis that aim at: 1) assessing the prosperities of SVM classifiers and FLDA classifiers in these datasets generated above. 2) Comparing the two different classification techniques, SVM and FLDA, using four training datasets of different sizes. For multiclass datasets, we use one-against-one strategy to generate FLDA classifiers and SVM classifiers.

Research on SVM and FLDA in Classification with Comparative Experiments

Yegan Qian, Gang Xiong, Yanjie Yao

T

417978-1-4673-0390-3/12/$31.00 ©2012 IEEE

Page 2: [IEEE 2012 9th IEEE International Conference on Networking, Sensing and Control (ICNSC) - Beijing, China (2012.04.11-2012.04.14)] Proceedings of 2012 9th IEEE International Conference

II. FLDA CLASSIFICATION APPROACH

A. FLDA Mathematical Formulation

The idea behind FLDA is to use a linear projection of the data onto a one dimension space [5], so that an input vector x is projected onto a value y given by

xwy T� (1) Where w is a vector of adjustable weight parameters.

By adjusting the components of the weight vector w , we can select a projection which maximizes the class separation. The principle can be illustrated as Fig.1.

Consider a d dimensional X space; the mean vectors of each class are given by

��

��icxi

i ixN

m 2,1,1 (2)

Where ic is the region of class i , iN is the number of

pattern vectors belong to ic .

The within-class scatter of the patterns from class ic is

2,1,))(( ���� ��

imxmxS T

Cxiii

i

(3)

So the total within-class scatter is )( 21 SSSw �� (4)

Between-class scatter matrix is T

b mmmmS ))(( 2121 ��� (5) Consider the transformed one dimensional Y space; the

mean vectors of each class are given by

��

��iCyi

i iyN

m~

2,1,1~ (6)

The within-class scatter of the patterns from class ic~ is

2,1,)~(~ 2

~

2 ��� ��

imySiCy

ii (7)

So the total within-class scatter is

)~~(~ 22

21 SSSw �� (8)

We therefore arrive at the Fisher criterion given by

22

21

221 ~~)~~(

)(SS

mmwJ F ��

� (9)

We need to adjust the w to maximize )(wJ F Since

��

�iCyi

i yN

m~

1~ = �� iCx

T

i

xwN1

= �� iCxi

T xN

w 1

= iT mw (10)

So 2

21 )~~( mm � = 221 )( mwmw TT �

= wmmmmw TT ))(( 2121 ��

= wSw bT (11)

Since 2

~

2 )~(~ ��

��iCy

ii myS

= 2

~)(�

�iCy

iTT mwxw

= wmxmxw Ti

Cxi

T

i

]))(([ ����

= wSw iT (12)

So 2

22

1~~ SS � = �� wSSwT )( 21 wSw w

T (13) From (9)-(13), we can get

22

21

221 ~~)~~(

)(SS

mmwJ F ��

� =wSwwSw

wT

bT

(14)

To get the solution which maximizes the )(wJ F

We let 0� cwSw wT

Define Lagrange function as )(),( cwSwwSwwL w

TB

T ��� (15) Differentiating (15) with respect to w

wSwSwwL

wb ��

�� ),(

(16)

Let wSwS wb � = 0

So w is the solution to the maximized )(wJ F

wSwS wb �

wwSS bw ��1

)(1 wSSw bw��

= wmmmmS Tw ))(( 2121

1 ��� (17)

Let wmmR T)( 21 ��

�w RmmSw )( 211 ��

)( 211 mmSRw w �� �

� )( 211 mmSw �� (18)

B. FLDA in Pattern Space

So by formula (1) we can reduce the d dimensional pattern space to one dimensional space.

To classify the y patterns in the one dimension space, we

418

Page 3: [IEEE 2012 9th IEEE International Conference on Networking, Sensing and Control (ICNSC) - Beijing, China (2012.04.11-2012.04.14)] Proceedings of 2012 9th IEEE International Conference

select a threshold2

~~21

0mmy �

� , for every transformed

data xwy T� , if ,0yy then 1wx �

If ,0yy � then 2wx � For multiclass classification, we use one-against-one

strategy, with discriminating jiw , , cjici ����� 1,1 ,

for classification [6]. A set (1, 2, ...c) is defined for each pattern, and iterate through the set to get rid of the class which the pattern does not belong, then the set will remain only one class. Thus we get the winner class which the pattern belongs to.

III. SVM CLASSIFICATION APPROACH

A. SVM Mathematical Formulation

The SVM classifications include linear SVM and nonlinear SVM, and the linear SVM can be divided into the linearly separable case and linearly non-separable case [7]-[12].

1) Linearly Separable Case: Consider a training set including N patterns from the d dimensional feature space ),...,2,1( Nix d

i ��� , a target

}1,1{ ���iy is associated to each pattern ix . Assume the two classes are linearly separable, and the membership decision rule can be based on the function sgn[f(x)]. f(x) is the discriminant function associated with the hyperplane and defined as

bwxxf ��)( (19) In order to find such a hyperplane, we should estimate

w and b so that 01)( ��� bwxy ii , i=1,2,…,N (20)

The SVM approach consists in finding the optimal hyperplane that maximizes the distance between the closest training sample and the separating hyperplane with the distance = 1/||w||. So the optimal hyperplane can be determined as the solution of the following convex quadratic programming problem: Minimize 2||||2/1 w and subject to

01)( ��� bwxy ii , i=1,2,…,N (21)

Using Langrangian function, this linearly constrained optimization problem can be translated into the following dual problem:

Maximize: � ��� � �

�n

I

N

i

N

jjijijii xxyy

1 1 1)(

21 ���

Subject to ��

�N

iii y

10� and 0�i� , i=1,2,….,N

Suppose *i� is the optimal solution,

then *w = ��

n

iiii xy

1

*� is the optimal solution to the

optimal hyperplane.

Linearly Non-separable Case: In the classification of real data, to handle non-separable data, we introduce a new set of nonnegative slack variables, and then construct a cost function as follows:

��

���N

iiCww

1

2||||21),( �� (22)

With the constraints:

iii bwxy ���� 1)( , i=1,2,…,N (23) 2) Nonlinear Kernel Method: Let x denote a vector from a 0m dimensional space, let 1

1)}({ mjj x �� } denote a set of

nonlinear transformations from the input space to the feature space, we may define a hyperplane acting as the decision surface as follows:

��

��1

10)(

m

jjj bxw � (24)

Simply writing as

��

�1

00)(

m

jjj xw � , 1)(0 �x� (25)

Also a compact form

0)( �xwT�

Since ��

�N

iiii xdw

1)(��

The decision surface in the feature space is

0)()(1

���

xxdN

ii

Tii ��� (26)

So

0),(1

���

N

iiii xxKd� , )()(),( ii

Ti xxxxK ��� (27)

Hence the optimal solution is the solution to maximize

),(21

1 1 1jij

N

i

N

i

N

jijii xxKyy� ��

� � �

� ��� (28)

With constraint 01

���

i

N

ii y� and

NiCi ,...2,1,0 ��� � Hence the discriminant function f(x) is:

��

��N

iiii bxxKyxf

1

*,

* ))(sgn()( � (29)

The shape of the discriminant function depends on the kind of kernel functions adopted. In this project, two common

419

Page 4: [IEEE 2012 9th IEEE International Conference on Networking, Sensing and Control (ICNSC) - Beijing, China (2012.04.11-2012.04.14)] Proceedings of 2012 9th IEEE International Conference

kernel functions adopted are as follows: Polynomial function of order p:

pii xxxxK ]1)[(),( ��� (30)

Gaussian radial basis function: }||||exp{),( 2xxxxK ii ��� � (31)

In our experiments, 00001.0,2,1000,5.0 ���� epspC�

B. SVM in Pattern Space

By the discriminant function

��

��N

iiii bxxKyxf

1

*,

* ))(sgn()( � (32)

The patterns in original space will be classified. For multiclass, we use the one-against-one strategy for classifications.

IV. EXPERIMENTAL RESULTS

A. Dataset Description and Experiment Design

Before applying the two classification techniques, each 100-dimensional vector in the datasets is transformed to one dimension patterns, by summing up all the 100 components of each vector, we get the total number of heads which is the feature of the transformed pattern. Then we conducted four categories of experiments according to the different size of training datasets, which consists of 100 patterns, 300 patterns, 600 patterns and 900 patterns for nine classes classification, and testing dataset s consists of 10000 testing patterns and 20000 patterns.

The experimental analysis was organized into four categories of experiments. From these four experiments, we first aim at analyzing the effectiveness of both FLDA and SVM for multi-classes’ classification. Then according to four different sizes of training datasets, we analyze the effect of different sizes on FLDA and SVM. Last, we compare the performance of FLDA and SVM.

B. Results of Experiment 1

Classification based on training dataset with 100 patterns

Methods Size of

Training Dataset

Size of Testing Dataset

Accuracy Size of Testing Dataset

Accuracy

FLDA 100 10000 0.7716 20000 0.7662

SVM-POLY 100 10000 0.7599 20000 0.7536

SVM-RBF 100 10000 0.7771 20000 0.7693

C. Results of Experiment 2

Classification based on training dataset with 300 patterns

Methods Size of

Training Dataset

Size of Testing Dataset

Accuracy Size of Testing Dataset

Accuracy

FLDA 300 10000 0.7802 20000 0.7744

SVM-POLY 300 10000 0.7716 20000 0.7675

SVM-RBF 300 10000 0.7802 20000 0.7745

D. Results of Experiment 3

Classification based on training dataset with 600 patterns

Methods Size of

Training Dataset

Size of Testing Dataset

Accuracy Size of Testing Dataset

Accuracy

FLDA 600 10000 0.7802 20000 0.7744

SVM-POLY 600 10000 0.7792 20000 0.77105

SVM-RBF 600 10000 0.77.37 20000 0.767

E. Results of Experiment 4

Classification based on training dataset with 900 patterns

Methods Size of

Training Dataset

Size of Testing Dataset

Accuracy Size of Testing Dataset

Accuracy

FLDA 900 10000 0.7802 20000 0.7744 SVM-POLY 900 10000 0.7765 20000 0.76715

SVM-RBF 900 10000 0.7751 20000 0.76875

V. DISCUSSION AND CONCLUSION

The training datasets for these four experiments are selected according to a round policy:

label(i) = round(data(i)*10)

Where the scaled data (i) is the i th pattern feature which represents the total number of heads in 100 times of toss of a coin before scaling. Label (i) represents this i th pattern belongs to which class. For these two testing datasets, we did not apply any rule to generate them; they were generated by the bias of a coin which is equally randomly selected among the nine coins.

From the experiments results above, we can conclude 1) The three classifiers FLDA, SVM-POLY, SVM-RBF can achieve a good classification rate when applied to a large amount of dataset. 2) The FLDA will get higher classification accuracy as the size of the training dataset increases, but will keep the same classification accuracy once the size of the training dataset reaches to some degree. The SVM-POLY and SVM-RBF also increase their accuracy as the size of training dataset increases, but once the size of the of training dataset reaches to some degree, their accuracy

420

Page 5: [IEEE 2012 9th IEEE International Conference on Networking, Sensing and Control (ICNSC) - Beijing, China (2012.04.11-2012.04.14)] Proceedings of 2012 9th IEEE International Conference

will decrease as the size of the training dataset continue to increase. 3) The SVM-RBF performs better than SVM-POLY and FLDA if the size of the training dataset is small. But once the size of the training dataset reaches to some degree, FLDA will performs better than SVM-RBF and SVM-POLY. 4) When the size of testing dataset increases drastically, all these three classifiers will decreases their accuracy.

REFERENCES [1] C.-I Chang, Hyper spectral Imaging: Techniques for Spectral Detection

and Classification, New York: Kluwer, 2003. [2] ChihWei Hsu, and ChihJen Lin, “A Comparison of Methods for

Multiclass Support Vector Machines,” IEEE Transactions on Neural Networks, VOL. 13, NO. 2, Mar. 2002, pp.415-425.

[3] Yufei Ma, and Hongjiang Zhang, “Motion Pattern-Based Video Classification and Retrieval,” EURASIP Journal on Applied Signal Processin, Feb. 2003, 199-208.

[4] F. Melgani, “Classification of Hyper Spectral Remote Sensing Images With Support Vector Machines”, IEEE Transactions on Geoscience and Remote Sensing, VOL. 42, Aug. 2004, pp.1778-1790.

[5] Qian Du, “Modified Fisher’s Linear Discriminant Analysis for Hyperspectral Imagery,” IEEE Geoscience and Remote Sensing Letters, VOL. 4, NO. 4, Oct. 2007, pp.503-507.

[6] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. “Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers,” Journal of Machine Learning Research, Jan. 2000, pp.113-141.

[7] Songcan Chen, and Xubing Yang, “Rapid and Brief Communication Alternative linear discriminant classifier,” Pattern Recognition, 37, 2004, pp.1545-1547.

[8] Koby Crammer, and Yoram Singer, “On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines,” Journal of Machine Learning Research, Feb. 2001, pp.265-292.

[9] Shiqi Yu, Tieniu Tan, Kaiqi Huang, Kui Jia, and Xinyu Wu, “A Study on Gait-Based Gender Classification,” IEEE Transaction on image processing, VOL. 18, NO. 8, Aug. 2009. pp.1905-1910.

[10] Songcan Chen, and Daohong Li, “Rapid and brief communications modified linear discriminant analysis,” Pattern Recognition, 38, 2005, 441-443.

[11] Jae Won Lee, Jung Bok Lee, Mira Park, and Seuck Heun Song, “An extensive comparison of recent classification tools applied to microarray data,” Computational Statistics & Data Analysis, 48, 2005, pp.869-885.

[12] The Curse of Dimensionality, ACASC course Available:http://www.galaxy.gmu.edu/ACAS/ACAS00-02/ACAS02ShortCourse/ACASCourse10.pdf.

421