31
2010 / 03 / 17 Yi - Xian Lin 1 A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee Accepted by IEEE Transactions on Knowledge and Data Engineering Reporter Yi-Xian Lin National University of Tainan

A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

Embed Size (px)

Citation preview

Page 1: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 1

A Fuzzy Self-Constructing Feature Clustering Algorithm for Text

Classification

Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee

Accepted by IEEE Transactions on Knowledge and Data Engineering

Reporter :Yi-Xian Lin

National University of Tainan

Page 2: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 2

Outline

• Motivation & Objective

• Feature Reduction

• Feature Clustering

• Fuzzy Feature Clustering

• Text Classification

• Experimental results

• Advantages

Page 3: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 3

Motivation &&&& Objective

• In text classification, the dimensionality of the feature vector is

usually huge

• The current problem of the existing feature clustering methods

� The desired number of extracted features has to be specified in advance

� When calculating similarities, the variance of the underlying cluster is

not considered

• How to reduce the dimensionality of feature vectors for text

classification and run faster ?

Page 4: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 4

Feature Reduction

• Purpose

� Reduce classifier’s computation load

� Increase data consistency

• Techniques

� To eliminate redundant data

� To find representative data

� To reduce the dimensions of the feature sets

� To find the best set of vectors which best separate the patterns

• Two ways of doing feature reduction, feature selection

and feature extraction

Page 5: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 5

Feature Reduction

• Feature selection

� Let the word set W={W1,W2,…,Wm} be the feature vector of the

document set

� Find a new word set

� Then W’ is used as inputs for classification tasks

• Feature extraction

� Extracted features are obtained by a projecting process through

algebraic transformations

� Let a corpus of documents be represented as an matrix

� Find an optimal transformation matrix

' ' ' '

1 2{ , ,... } , kW w w w k m= <

nm×nm

RX×∈

kmRF

×∈*

Page 6: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 6

Feature Clustering

• Feature clustering is an efficient approach for feature reduction

• Groups all features into some clusters where features in a

cluster are similar to each other

• Let D be the matrix consisting of all the original documents

with m features and D’ be the matrix consisting of the

converted documents with new k features

• New feature set corresponds to a partition

{W1,W2,…,Wk} of the original feature set W

' ' ' '

1 2{ , ,... }kW w w w=

Page 7: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 7

Fuzzy Feature Clustering

• A document set D of n documents d1,d2,...,dn

• Feature vector W of m words w1,w2,...,wm

• p classes c1,c2,...,cp

• Construct one word pattern for each word in W

where

( ) ( ) ( )1 2 1 2, ,..., | , | ,..., |i i i ip i i p ix x x x P c w P c w P c w= =

( ) 1

1

| , 1

n

qi qiq

j i n

qiq

dP c w for j p

d

δ=

=

×= ≤ ≤∑∑

Page 8: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 8

Fuzzy Feature Clustering

( ) ( )6 1 6 2 6| , |x P c w P c w=

( )2 6

1 0 2 0 0 0 1 0 1 1 1 1 1 1 1 1 0 1| 0.50

1 2 0 1 1 1 1 1 0P c w

× + × + × + × + × + × + × + × + ×= =

+ + + + + + + +

Page 9: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 9

Fuzzy Feature Clustering• Let G be a cluster containing q word patterns x1,x2,...,xq

• Let

• The mean

• The deviation

• The fuzzy similarity of a word pattern x to cluster G

1 2, ,..., , 1j j j jpx x x x j q= ≤ ≤

1

1 2, ,..., ,

q

jij

p i

xm m m m m

G

== =

1 2, ,..., pσ σ σ σ=

( )2

1 , 1

q

ji jij

i

x mfor i p

Gσ =

−= ≤ ≤∑

( )2

1

expp

i i

i i

x mG xµ

σ=

− = −

Page 10: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 10

Fuzzy Feature Clustering

• A word pattern close to the mean of a cluster is regarded to

be very similar to this cluster

• Suppose m1 = < 0.4, 0.6 > , σ1 = < 0.3 , 0.5 >

( ) 1G xµ ≈

( )2 2

1 2

0.2 0.4 0.8 0.6exp exp

0.3 0.5

0.6412 0.8521 0.5464

G xµ − −

= − × −

= × =

Page 11: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 11

Fuzzy Feature Clustering

• A predefined threshold ρ,

• If , xi passes the similarity test on cluster Gj

• If the user intends to have larger clusters, give a smaller

threshold

• Two cases may occur

� No existing fuzzy clusters on which xi has passed the similarity test

� Create a new cluster Gh , h = k + 1 ( k is the number of currently

existing clusters) ,

� is a user-defined constant vector

0 1ρ≤ ≤

( )j iG xµ ρ≥

0= , h i hm x σ σ=

0 0 0,...,σ σ σ=

Page 12: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 12

Fuzzy Feature Clustering

• If there are existing clusters on which xi has passed the

similarity test, let cluster Gt be the cluster with the largest

membership degree ,

• Modification to cluster Gt

( )( )1

arg max j ij k

t G xµ≤ ≤

=

( )( )

0

2 22 2

0

, , 1

1 1 ,

1

1 , 1

t tj ij

tj tj

t

t tj t tj ij t tj ijt

t t t

t t

S m xm A B

S

S S m x S m xSA B

S S S

for j p and S S

σ σ

σ σ

× += = − +

+

− − + × + × + += =

+

≤ ≤ = +

Page 13: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 13

Fuzzy Feature Clustering

• The order in which the word patterns are fed in influences the

clusters obtained

• Sort all the patterns, in decreasing order, by their largest

components

� Let x1 = < 0.1 , 0.3 , 0.6 > , x2 = < 0.3, 0.3, 0.4 > , x3 = < 0.8, 0.1, 0.1 >

� The largest components in these word patterns are 0.6, 0.4, and 0.8

� The sorted list is 0.8, 0.6, 0.4

� The order of feeding is x3, x1, x2

Page 14: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 14

Fuzzy Feature Clustering

• The order of feeding : x5, x7, x10, x1, x4, x9, x2, x3, x8, x6

• No clusters exist at the beginning , k = 0

• Set σ0 = 0.5 , ρ=0.64

• Create G1

< 0.5 , 0.5 >< 1.00 , 0.00 >1G1

deviation σmean mSize Scluster

Page 15: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 15

Fuzzy Feature Clustering

• Feeding : x7 μG1(x7) = 1 > ρ

( )( )

( )( )

11 12

1

2 22 2

11 11

2 22 2

12 12

11

1 1.00 1.00 1 0.00 0.001.00 , 0.00

1 1 1 1

1.00 , 0.00

1 1 0.5 0.5 1 1.00 1.00 1 1 1 1.00 1.00 ,

1 1 1 1

1 1 0.5 0.5 1 0.00 0.00 1 1 1 0.00 0.00 ,

1 1 1 1

m m

m

A B

A B

σ

× + × += = = =

+ +

=

− − + × + + × + = =

+

− − + × + + × + = =

+

11 11 12 11 11

1 1

0.5 0.5 , 0.5 0.5

0.5 , 0.5 , 1 1 2

A B A B

S

σ

σ

= − + = = − + =

= = + =

Page 16: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 16

Fuzzy Feature Clustering

• After self-constructing clustering

• Similarities of patterns to clusters

Page 17: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 17

Fuzzy Feature Clustering

• Data transformation

• H-FFC (hard weighting)

� each word is only allowed to belong to a cluster and so it only

contributes to a new extracted feature

'D DT=

[ ]1

' ' ' '

1 2 2 , TT

n nD d d d D d d d = = ⋯ ⋯

( )( )11 , arg max

0 , otherwise

k i

ij

j G xt

α αµ≤ ≤ =

=

if

Page 18: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 18

Fuzzy Feature Clustering

H-FFC :

Page 19: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 19

Fuzzy Feature Clustering

• S-FFC (soft weighting)

� each word is allowed to contribute to all new extracted features,

with the degrees depending on the values of the membership

functions

• M-FFC (mixed weighting)

� a combination of the hardweighting approach and the soft-

weighting approach

� γis a user-defined constant lying between 0 and 1

( )ij j it G xµ=

( ) ( )1H S

ij ij ijt t tγ γ= × + − ×

Page 20: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 20

Fuzzy Feature Clustering

S-FFC :

Page 21: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 21

Fuzzy Feature Clustering

M-FFC :

Page 22: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 22

Text Classification

Training document data set

Feature reduction

Training data set for class 1

…...Training data set for class p

Train 1st classifier (SVM) Train p-th classifier (SVM)

…...

Unknown pattern

Feature reduction

…...

p classifiers are constructed.

Page 23: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 23

Text Classification

• Training data set and target sets for SVMs

Class Target 1Target 2

C1 +1 -1

C1 +1 -1

C1 +1 -1

C1 +1 -1

C2 -1 +1

C2 -1 +1

C2 -1 +1

C2 -1 +1

C2 -1 +1

Training target set for class C1

Training target set for class C2

Page 24: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 24

Text Classification

• Training classifiers

• Feature reduction for unknown pattern

1target ' +HD

2target ' +HD

Training classifier (SVM1)

Training classifier (SVM2)

Unknown pattern

Unknown pattern after feature reduction

Page 25: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 25

Text Classification

• Classify the unknown pattern

Trained classifier (SVM1)

Trained classifier (SVM2)

-1 +1

Unknown pattern d Class C2Classified to

Page 26: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 26

Experimental results

• Performance measures

class. wrt negatives False :

class. wrt positives False :

class. wrt negatives True :

class. wrt positives True :

classes. ofnumber :

i-thFN

i-thFP

i-thTN

i-thTP

p

i

i

i

i

( ) ( )

( )

( )

1 1

1 1

1

1

,

21 ,

P P

i ii i

P P

i i i ii i

P

i ii

P

i i i ii

TP TPMicroP MicroR

TP FP TP FN

TP TNMicroP MiccroRMicroF MicroAcc

MicroP MiccroR TP TN FP FN

= =

= =

=

=

= =+ +

+×= =

+ + + +

∑ ∑∑ ∑

∑∑

Page 27: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 27

Experimental results

• 20 news groups data set

Number of classes 20

Number of

documents20000

Proportion of

training documents2/3

Proportion of

testing documents1/3

Number of features 25718

Page 28: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 28

Experimental results

Execution time (sec) of different methods on 20 Newsgroup data

Page 29: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 29

Experimental results

Microaveraged accuracy (%) of different methods on 20 Newsgroup data

Page 30: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 30

Experimental results

Microaveraged F1 (%) of M-FFC with different γvalues

for 20 Newsgroups data

Page 31: A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification

2010 / 03 / 17 Yi - Xian Lin 31

Advantages• a fuzzy self-constructing feature clustering (FFC)

algorithm which is an incremental clustering approach

to reduce the dimensionality of the features in text

classification

• Determine the number of features automatically

• Match membership functions closely with the real

distribution of the training data

• Runs faster

• Better extracted features than other methods