Upload
candace-cunningham
View
218
Download
2
Embed Size (px)
Citation preview
Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation
Algorithm
Yen-Jen OyangDepartment of Computer Science and Informatio
n EngineeringNational Taiwan University
An Example of Data Classification
Data Class Data Class Data Class
( 15,33)
O ( 18,28)
× ( 16,31)
O
( 9 ,23)
× ( 15,35)
O ( 9 ,32)
×
( 8 ,15)
× ( 17,34)
O ( 11,38)
×
( 11,31)
O ( 18,39)
× ( 13,34)
O
( 13,37)
× ( 14,32)
O ( 19,36)
×
( 18,32)
O ( 25,18)
× ( 10,34)
×
( 16,38)
× ( 23,33)
× ( 15,30)
O
( 12,33)
O ( 21,28)
× ( 13,22)
×
Distribution of the Data Set
。。
10 15 20
30
。。。 。。
。 。。
××
××
×
×
×
×
×
×
××
×
×
Rule Based on Observation
.
0
30
253015 22
Xclass
else
class
, thenand y
yxIf
Rule Generated by the Proposed RBF(Radial Basis Function)
Network Based Learning Algorithm
Let and
If
then prediction=“O”.
Otherwise prediction=“X”.
2o
2o
210
12o
o 2
1)( i
icv
i i
evf
.
2
1)(
2
214
12x
x
2x
x
j
jcv
j j
evf
),()( xx
oo vf
S
Svf
S
S
(15,33)
(11,31)
(18,32)
(12,33)
(15,35)
(17,34)
(14,32)
(16,31)
(13,34)
(15,30)
1.723 2.745 2.327 1.794 1.973 2.045 1.794 1.794 1.794 2.027
ico
io
(9,23) (8,15)(13,37)
(16,38)
(18,28)
(18,39)
(25,18)
(23,33)
(21,28)
(9,32)(11,38)
(19,36)
(10,34)
(13,22)
6.458 10.08 2.939 2.745 5.451 3.287 10.86 5.322 5.070 4.562 3.463 3.587 3.232 6.260
jcx
jx
Identifying Boundary of Different Classes of Objects
Boundary Identified
The Vector Space Model
• In the vector space model, each object is described by a number of numerical attributes.
• For example, the outlook of a man is described by his height, weight, and age.
Transformation of Categorical Attributes into Numerical
Attributes
• Represent the attribute values of the object in a binary table form as exemplified in the following:
10003
00112
01001School Graduate
Education
College
Education
School High
Education
Female
Male/Objects
• Assign appropriate weight to each column.
• Treat the weighted vector of each row as the feature vector of the corresponding object.
4
21
3
0003
002
0001School Graduate
Education
College
Education
School High
Education
Female
Male/Objects
Application of Data Classification in Bioinformatics
• Data classification has been applied to predict the function and tertiary structure of a protein sequence.
Basics of Protein Structures
• A typical protein consists of hundreds to thousands of amino acids.
• There are 20 basic amino acids, each of which is denoted by one English character.
Three-dimensional Structure of Myoglobin
Source: Lectures of BioInfo by yukijuan
Prediction of Protein Functions and Tertiary Structures
• Given a protein sequence, biochemists are interested in its functions and its tertiary structure.
• The PDB database, which collects proteins with verified tertiary structures, contains ~19,000 proteins.
• The SWISSPROT database, which collects proteins with verified functions, contains ~110,000 proteins.
• The PIR-PSD database, which collects proteins with verified functions, contains ~280,000 proteins.
• The PIR-NREF database, which collects all protein sequences, contains ~1,060,000 proteins.
Problem Definition of Kernel Smoothing
• Given the values of function at a set of samples . We want to find a set of symmetric kernel functions and the corresponding weights such that
msssS ,...,, 21)(f
),;( ii bcK
iw
).(),;()(ˆ fbcKwf iii
i
Kernel Smoothing with the Spherical Gaussian Functions
• Hartman et al. showed that a linear combination of spherical Gaussian functions can approximate any function with arbitrarily small error.
• “Layered neural networks with Gaussian hidden units as universal approximations”, Neural Computation, Vol. 2, No. 2, 1990.
• With the Gaussian kernel functions, we want to find such that
).(2
exp)(ˆ2
2
fwfi
i
ii
iiiw and ,,
Problem Definition of Kernel Density Estimation
• Assume that we are given a set of samples taken from a probability distribution in a d-dimensional vector space. The problem now is how to find a linear combination of kernel functions that approximate the probability density function of the distribution?
)(f
• The value of the probability density function at a vector can be estimated as follows:
where n is the total number of samples, is the distance between vector and its k-th nearest samples, and
is the volume of a sphere with radius =
in a d-dimensional vector space.
v
,)1
2(
)()(
1
2
dvR
n
kvf
dd
)(vRv
)12
(
)( 2
d
vRd
d )(vR
A 1-D Example of Kernel Smoothing with the Spherical
Gaussian Functions
).(2
exp)(ˆ2
2
fwfi
i
ii
2
2
2exp
i
iiw
The Existing Approaches for Kernel Smoothing with Spherical Gauss
ian Functions
• One conventional approach is to place one Gaussian function at each sample. As a result, the problem becomes how to find
for each sample such that
).(2
exp)(ˆ2
2
f
swf
i
i
ii
iiw and is
• The most widely-used objective is to minimize
where are test samples and S is the set of training samples.
• The conventional approach suffers high time complexity, approaching , due to the need to compute the inverse of a matrix.
,))()(ˆ(1
2
1
2
S
iii
n
j
wff jj vv
nvvv ,...,, 21
)(3
SO
• M. Orr proposed a number of approaches to reduce the number of units in the hidden layer of the RBF network.
• Beatson et. al. proposed O(nlogn) learning algorithms using polyharmonic spline functions.
An O(n) Algorithm for Kernel Smoothing
• In the proposed learning algorithm, we assume uniform sampling. That is, samples are located at the crosses of an evenly-spaced grid in the d-dimensional vector space. Let denote the distance between two adjacent samples.
• If the assumption of uniform sampling does not hold, then some sort of interpolation can be conducted to obtain the approximate function values at the crosses of the grid.
A 2-D Example of Uniform Sampling
The Basic Idea of the O(n) Kernel Smoothing Algorithm
• Under the assumption that the sampling density is sufficiently high, i.e. , we have the function values at a sample and its k nearest samples, , are virtually equal. That is, .
• In other words, is virtually a constant function equal to in the proximity of
0
hs
ksss ,..., 21
)(...)()()( 21 kh sfsfsfsf
.hs
)(xf
)( hsf
• Accordingly, we can expect that
.......
;...
21
21
hk
hk wwww
A 1-D Example
)(xf
)()( hfxf
2
kh )1( h h )1( h
2
kh
h
• In the 1-D example, samples at located at
, where i is an integer.
• Under the assumption that , we have
and
• The issue now is to find appropriate and
such that
.......
;......
)2/(1)2/()2/(
)2/(1)2/()2/(
khhkhkh
khhkhkh wwww
hw
h
).(2
)(exp
)2/(
)2/(2
2
hfix
wkh
khi hh
0 hxhfxf for )()(
i
• If we set ,then we have h
large.ly sufficient is if,5066.2)2/(
)2/(
2
12
2
kekh
khi
ix
2
2
2
1
ix
e
)( hf
• Therefore, with , we can set
and obtain for
,5066.2
)( hfwh
).(2
)(exp)(
)2/(
)2/(2
2
hfix
wxgkh
khi hh
hx
h
• In fact, it can be shown that with ,
is bounded by
• Therefore, we have the following function approximator:
h
i h
ix2
2
2
)(exp
.1035.15066282745.2 8
.2
)(exp)(
5066282745.2
1)(ˆ
])2
1(,)
2
1((number reala for
)2/(
)2/(2
2
kh
khi
ixifxf
hhx
Generalization of the 1-D Kernel Smoothing Function
• We can generalize the result by setting , where is a real number.
• The table on the next page shows the bounds of
with various values.
n...21
j
jx22
2
2
)(exp
Bounds of
0.5
1.0
1.5
j
jx22
2
2
)(exp
2108.1253.1 81034.15066282745.2
111094.23397599424119.3
An Example of the Effect of Different Setting of β
2
2
5.02
1
jx
e
j
jx2
2
5.02
)(exp
The Smoothing Effect
• The kernel smoothing function is actually a weighted average of the sampled function values. Therefore, selecting a larger value implies that the smoothing effect will be more significant.
• Our suggestion is set
.1
An Example of the Smoothing Effect
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1targetfit
The smoothing effect Elimination of the smoothing effect with a compensation procedure
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1targetfit
The General Form of a Kernel Smoothing Function in the Multi-
Dimensional Vector Space
• Under the assumption that the sampling density is sufficiently high, i.e. , we have the function values at a sample and its k nearest samples, , are virtually equal. That is, .
hs
ksss ,..., 21
)(...)()()( 21 kh sfsfsfsf
0
• As a result, we can expect that
where are the weights and bandwidths of the Gaussian functions located at , respectively.
,... and ... 2121 khkh wwww
kkwww ,...,, and ,...,, 2121
ksss ,..., 21
• Since the influence of a Gaussian function decreases exponentially as the distance increases, we can set k to a value such that, for a vector in the proximity of sample , we have
hsv
.2
exp2
exp
2exp)(ˆ
12
2
2
2
2
2
k
j j
j
jh
hh
i
i
Ssi
sw
sw
swf
mi
• Since we have
our objective is to find and such that
,... and ... 2121 khkh wwww
hw h
).(
2exp
2exp
2exp
2exp
2exp)(ˆ
12
2
2
2
12
2
2
2
2
2
vf
sw
sw
sw
sw
swf
k
j h
jh
h
hh
k
j j
jj
h
hh
i
i
Ssi
mi
• Let
Then, we have
.
2
,...,,exp...)(
2
2
21
21
dj h
d
jjh
jjjwg
).(ˆ2
exp2
exp)(1
2
2
2
2
vfs
ws
wvgk
j h
j
hh
hh
.101.35452.50662827by bounded is
2
,...,,exp... have wey,Accordingl
.101.35452.50662827by bounded is 2
)(exp
then, toset weIf
).,...,( where
,2
)(exp...
2
)(exp
2
,...,,exp...
have We
d8-
2
2
21
8-2
211
21
2
2
2
211
2
2
21
21
1
1
21
d
d
d
j h
d
jj
j h
h
d
j h
dd
j h
j h
d
jj
jjj
jx
σ
xxxv
jxjx
jjj
• Therefore, with ,
is virtually a constant function and
• Accordingly, we want to set
h
)(vg
. ofproximity in the for
,5066282745.2)()(ˆ
h
hd
sv
wvgvf
dh
h
sfw
5066282745.2
)(
• Finally, by setting uniformly to , we obtain the following kernel smoothing function that approximates f(v):
.at locatedobject the
of samplesnearest theare ,..,, where
,2
exp)5066282745.2(
)()(ˆ
21
ˆ2
2
v
ksss
ssff
k
s
id
i
i
,,...,2,1 , nii
• Generally speaking, if we set uniformly to , we will obtain
.2
exp
and at locatedobject the
of samplesnearest theare ,..,, where
,2
exp)(1
)(ˆ
2
2
21
22
2
j
k
s
iid
j
v
ksss
ssff
i
,,...,2,1 , nii
Application in Data Classification
• One of the applications of the RBF network is data classification.
• However, recent development in data classification focuses on the support vector machines (SVM), due to accuracy concern.
• In this paper, we propose a RBF network based data classifier that can delivers the same level of accuracy as the SVM and enjoys some advantages.
The Proposed RBF Network Based Classifier
• The proposed algorithm constructs one RBF network for approximating the probability density function of one class of objects based on the kernel smoothing algorithm that we just presented.
The Proposed Kernel Density Estimation Algorithm for Data
Classification• Classification of a new object is
conducted based on the likelihood function:
objects. -class of
functiondensity y probabilit eapproximat theis )(ˆ and
ly,respective classes, all of samples trainingofnumber total theand
class of samples trainingofnumber theare and where
),(ˆ)(
m
f
mSS
vfS
SvL
m
m
mm
m
• Let us adopt the following estimation of the value of the probability density function at each training sample:
space. vector theof dimension theis (3)
class;
same theof samples ainingnearest tr its and
sample trainingbetween distance theis )( (2)
objects; -class
of functiondensity y probabilit theis )( (1)
where,)1
2(
)(1)(
th
1
2
d
ks
sR
m
f
dsR
S
ksf
i
ik
m
dd
ik
mim
• In the kernel smoothing problem, we set the bandwidth of each Gaussian function uniformly to , where is the distance between two adjacent training samples.
• In the kernel density estimation problem,
for each training sample, we need to determine the average distance between two adjacent training samples of the same class in the local region.
• In the d-dimensional vector space, if the average distance between samples is ,
then the number of samples in a subspace
of volume V is approximately equal to
• Accordingly, we can estimate by
i
.di
V
i
.
)12
()1(
)(ˆ
d
iki
dk
sR
• Accordingly, with the kernel smoothing function that we obtain earlier, we have the following approximate probability density function for class-m objects:
.2
exp and,
)12
()1(
)(
,at locatedobject theof samples training
-classnearest theare ,...,, where
,2
exp1
)(ˆ
2
2
21
2
2
1
jd
ikii
k
i
i
dk
i imm
j
dk
sR
v
mksss
s
Sf
• An interesting observation is that, regardless of the value of ,
we have .
• If the observation holds generally, then
i
i
2
.2
exp2
11)(ˆ
2
2
1
i
i
dk
i imm
s
Sf
• In the discussion above, is defined to be the distance between sample and its nearest training sample.
• However, this definition depends on only one single sample and tends to be unreliable, if the data set is noisy.
• We can replace with)( ik sR
is thk
. as class same theof
samples ainingnearest tr theare ,...,, where
,11
)(
21
1
2
i
k
k
jijik
s
ksss
sskd
dsR
)( ik sR
Parameter Tuning
• The discussions so far are based on the assumption that the sampling density is sufficiently high, which may not hold for some real data sets.
• Three parameters, namely d’, k’, k”,are incorporated in the learning algorithm:
.2
exp2
11)(ˆ
2
2
1
i
i
dk
i imm
s
Sf
. as class same theof
samples ainingnearest tr theare ,...,, where
,11
)(
21
1
2
i
k
k
jijik
s
ksss
sskd
dsR
• One may wonder how should be set.
• According to our experimental results, the value of has essentially no effect, as long as is set to a value within .
i
i
]2,6.0[
Time Complexity
• The average time complexity to construct a RBF network is
if the k-d tree structure is employed, where n is the number of training samples.
• The time complexity to classify c new objects with unknown class is
),loglog( nnkndnO
).loglog( nckndnO
Comparison of Classification Accuracy of the 6 Smaller Data
SetsData sets classification algorithms
proposed algorithm SVM 1NN 3NN
1. iris (150) 97.33 (k’ = 24, k” = 14, d’ = 5, = 0.7)
97.33 94.0 94.67
2. wine (178) 99.44 (k’ = 3, k” = 16, d’ = 1, = 0.7) 99.44 96.08 94.97
3. vowel (528) 99.62 (k’ = 15, k” = 1, d’ = 1, = 0.7) 99.05 99.43 97.16
4. segment (2310) 97.27 (k’ = 25, k” = 1, d’ = 1, = 0.7) 97.40 96.84 95.98
Avg. 1-4 98.42 98.31 96.59 95.70
5. glass (214) 75.74 (k’ = 9, k” = 3, d’ = 2, = 0.7) 71.50 69.65 72.45
6. vehicle (846) 73.53 (k’ = 13, k” = 8, d’ = 2, = 0.7) 86.64 70.45 71.98
Avg. 1-6 90.49 91.89 87.74 87.87
Data sets classification algorithms
proposed algorithm SVM 1NN 3NN
7. satimage(4435,2000)
92.30 (k’ = 6, k” = 26, d’ = 1, = 0.7) 91.30 89.35 90.6
8. letter(15000,5000)
97.12 (k’ = 28, k” = 28, d’ = 2, = 0.7)
97.98 95.26 95.46
9. shuttle(43500,14500)
99.94 (k’ = 18, k” = 1, d’ = 3, = 0.7) 99.92 99.91 99.92
Avg. 7-9 96.45 96.40 94.84 95.33
Comparison of Classification Accuracy of the 3 Larger Data
Sets
Data Reduction
• As the proposed learning algorithm is instance-based, removal of redundant training samples will lower the complexity of the RBF network.
• The effect of a naïve data reduction mechanism was studied.
• The naïve mechanism removes a training sample, if all of its 10 nearest samples belong to the same class as this particular sample.
Effect of Data Reductionsatimage letter shuttle
# of training samples in the original data set
4435 15000 43500
# of training samples after data reduction is applied
1815 7794 627
% of training samples remaining
40.92% 51.96% 1.44%
Classification accuracy after data reduction is applied
92.15% 96.18% 99.32%
Degradation of accuracy due to data reduction
-0.15% -0.94% -0.62%
# of training samples after data reduction is applied
# of support vectors identified by LIBSVM
satimage 1815 1689
letter 7794 8931
shuttle 627 287
Execution Times(in seconds)
Proposed algorithm without data reduction
Proposed algorithm with data reduction
SVM
Cross validation satimage 670 265 64622
letter 2825 1724 386814
shuttle 96795 59.9 467825
Make classifier satimage 5.91 0.85 21.66
letter 17.05 6.48 282.05
shuttle 1745 0.69 129.84
Test satimage 21.3 7.4 11.53
letter 128.6 51.74 94.91
shuttle 996.1 5.85 2.13
Conclusions
• A novel learning algorithm for data classification with the RBF network is proposed.
• The proposed RBF network based data classification algorithm delivers the same level of accuracy as the SVM.
• The time complexity for constructing a RBF network based on the proposed algorithm is
, which is much lower than that required by the SVM.
• The proposed RBF network based classifier can handle data sets with multiple classes directly.
• It is of interest to develop some sort of data reduction mechanisms.
)loglog( nnkndnO
• The powerpoint file of this presentation can be downloaded from
syslab.csie.ntu.edu.tw
• An extended version of the presented paper can be downloaded at the same address.