4
Stochastic Meta Descent in Online Kernel Methods Supawan Phonphitakchai Department of Electrical and Computer Engineering Naresuan University, Thailand [email protected] Tony J. Dodd Department of Automatic Control and Systems Engineering The University of Sheffield, UK t.j.dodd@sheffield.ac.uk Abstract—Learning System is a method to approximate an underlying function from a finite observation data. Since batch learning has a disadvantage in dealing with large data set, online learning is proposed to prevent the computational expensive. Itera- tive method called Stochastic Gradient Descent (SGD) is applied to solve for the underlying function on reproducing kernel Hilbert spaces (RKHSs). To use SGD in time-verying environment, a learning rate is adjusted by Stochastic Meta Descent (SMD). The simulation results show that SMD can follow shifting and switching target function whereas the size of model can be restricted using sparse solution. I. I NTRODUCTION The objective of learning is to approximate an underlying function from an observation data perturbed with noise. Once the unknown function is accurately estimated, it can be applied to predict future from a given input. Generally, batch learning is assumed but this is very inefficient in the case of large data set. To overcome this problem, online learning can be employed. This framework uses only one sample at a time and then discards after learning. Therefore, online learning do not need to store the whole set of sample. Reproducing kernel Hilbert spaces (RKHSs) show advantages in solving a number of well-known problems in signal process- ing, control, machine learning and function approximation. The strong theoretical foundations of RKHS are developed to find the global and usually unique solutions, since it uses the idea of convex optimisation in training method [1]. Recent interest in kernel methods has been dominated in machine learning community i.e. support vector machines, Gaussian processes and regularisation networks. The first contribution of this work is to present an online learning in RKHS by investigating the approximation function from a given data (supervise learning) which is generated from time-varying function. The iterative method, called Stochastic Gradient Descent (SGD), is applied to solve the approximation functions in non-stationary environment by adjusting the learning rate using the idea of Stochastic Meta Descent (SMD). The other contribution is to restrict the size of approximation function by sparse solution using orthogonal projection method in non- stationary environment. The next section provides the general framework of RKHS followed by SGD method to investigate the approximation func- tion. SMD to adjust the learning rate and the method to bound the size of model are shown in Sec IV and V respectively. Finally, the simulations in Sec VI show results of using SMD to model a time-varying function which are illustrated in two examples and the paper ends with the conclusions. II. PRELIMINARIES AND NOTATION A set of samples is given in a form of observation data {x i ,z i } N i=1 X × Z . The unknown f : X × Z R is assumed to belong to some RKHS, H k , defined on X whereas the space of all possible observations is Z . Neglecting errors, the observations arise as follows z i = L i f (1) where {L i } N i=1 is a set of linear evaluation functionals which has a unique correspondence to z i . The complete set of z i can be expressed by [2] z = N i=1 (L i f )e i = Lf (2) where e i R N is the ith standard basis vector. A RKHS can be defined as a Hilbert space of functions on X, with the property that, for each x X, the evaluation functional, L i , which associates f with f (x i ), L i f f (x i ), is a bounded linear functional. The boundedness means that there exists a scalar M such that |L i f | = |f (x i )|≤ M f H k for all f in the RKHS (3) where · H k is the norm in RKHS [3]. Following from the Riesz representation theorem, the observations are given by [3] L i f = k i ,f = f (x i ) f ∈H k (4) where k i depends only on x i . Therefore the learning problem can be stated as follows: given RKHS of functions (H k ), the set of functions {k i } N i=1 ∈H k , and the observations {x i ,z i } N i=1 , find a function f ∈H k , such that (4) is satisfied. The k i is a positive definite function called reproducing kernel (r.k.) of H k which is the unique representation of evaluation at x i [4]. The function k : X × X can be defined by Riesz theorem such that k(x i , ·),k(x j , ·) H k = k(x i ,x j ) (5) where k(x i , ·) is a function at X centered on x i . There are several examples of kernel function i.e. polynomial, Gaussian and sigmoid kernel. We can then represent any functions with reproducing kernel k in the form f (·)= N i=1 α i k(x i , ·) (6) for N Z + and α i R. This expression defines a finite dimensional subspace of H k and r.k. are a basis for RKHS. The learning problem then reduces to that of estimating the 978-1-4244-3388-9/09/$25.00 ©2009 IEEE

[IEEE 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) - Chonburi, Thailand (2009.05.6-2009.05.9)]

  • Upload
    tony-j

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) - Chonburi, Thailand (2009.05.6-2009.05.9)]

Stochastic Meta Descent in Online Kernel MethodsSupawan Phonphitakchai

Department of Electrical and Computer EngineeringNaresuan University, Thailand

[email protected]

Tony J. DoddDepartment of Automatic Control and Systems Engineering

The University of Sheffield, [email protected]

Abstract—Learning System is a method to approximate anunderlying function from a finite observation data. Since batchlearning has a disadvantage in dealing with large data set, onlinelearning is proposed to prevent the computational expensive. Itera-tive method called Stochastic Gradient Descent (SGD) is applied tosolve for the underlying function on reproducing kernel Hilbertspaces (RKHSs). To use SGD in time-verying environment, alearning rate is adjusted by Stochastic Meta Descent (SMD). Thesimulation results show that SMD can follow shifting and switchingtarget function whereas the size of model can be restricted usingsparse solution.

I. INTRODUCTION

The objective of learning is to approximate an underlyingfunction from an observation data perturbed with noise. Oncethe unknown function is accurately estimated, it can be appliedto predict future from a given input. Generally, batch learningis assumed but this is very inefficient in the case of large dataset. To overcome this problem, online learning can be employed.This framework uses only one sample at a time and then discardsafter learning. Therefore, online learning do not need to store thewhole set of sample.

Reproducing kernel Hilbert spaces (RKHSs) show advantagesin solving a number of well-known problems in signal process-ing, control, machine learning and function approximation. Thestrong theoretical foundations of RKHS are developed to findthe global and usually unique solutions, since it uses the ideaof convex optimisation in training method [1]. Recent interestin kernel methods has been dominated in machine learningcommunity i.e. support vector machines, Gaussian processes andregularisation networks.

The first contribution of this work is to present an onlinelearning in RKHS by investigating the approximation functionfrom a given data (supervise learning) which is generated fromtime-varying function. The iterative method, called StochasticGradient Descent (SGD), is applied to solve the approximationfunctions in non-stationary environment by adjusting the learningrate using the idea of Stochastic Meta Descent (SMD). Theother contribution is to restrict the size of approximation functionby sparse solution using orthogonal projection method in non-stationary environment.

The next section provides the general framework of RKHSfollowed by SGD method to investigate the approximation func-tion. SMD to adjust the learning rate and the method to bound thesize of model are shown in Sec IV and V respectively. Finally,the simulations in Sec VI show results of using SMD to model atime-varying function which are illustrated in two examples andthe paper ends with the conclusions.

II. PRELIMINARIES AND NOTATION

A set of samples is given in a form of observation data{xi, zi}N

i=1 ∈ X ×Z. The unknown f : X ×Z → R is assumedto belong to some RKHS, Hk, defined on X whereas the space ofall possible observations is Z. Neglecting errors, the observationsarise as follows

zi = Lif (1)

where {Li}Ni=1 is a set of linear evaluation functionals which

has a unique correspondence to zi. The complete set of zi canbe expressed by [2]

z =N∑

i=1

(Lif)ei = Lf (2)

where ei ∈ RN is the ith standard basis vector.A RKHS can be defined as a Hilbert space of functions on X ,

with the property that, for each x ∈ X , the evaluation functional,Li, which associates f with f(xi), Lif → f(xi), is a boundedlinear functional. The boundedness means that there exists ascalar M such that

|Lif | = |f(xi)| ≤M‖f‖Hkfor all f in the RKHS (3)

where ‖ · ‖Hkis the norm in RKHS [3]. Following from the

Riesz representation theorem, the observations are given by [3]

Lif = 〈ki, f〉 = f(xi) f ∈ Hk (4)

where ki depends only on xi. Therefore the learning problemcan be stated as follows: given RKHS of functions (Hk), theset of functions {ki}N

i=1 ∈ Hk, and the observations {xi, zi}Ni=1,

find a function f ∈ Hk, such that (4) is satisfied.The ki is a positive definite function called reproducing kernel

(r.k.) of Hk which is the unique representation of evaluation atxi [4]. The function k : X×X can be defined by Riesz theoremsuch that

〈k(xi, ·), k(xj , ·)〉Hk= k(xi, xj) (5)

where k(xi, ·) is a function at X centered on xi. There areseveral examples of kernel function i.e. polynomial, Gaussianand sigmoid kernel. We can then represent any functions withreproducing kernel k in the form

f(·) =N∑

i=1

αik(xi, ·) (6)

for N ∈ Z+ and αi ∈ R. This expression defines a finite

dimensional subspace of Hk and r.k. are a basis for RKHS.The learning problem then reduces to that of estimating the

978-1-4244-3388-9/09/$25.00 ©2009 IEEE

Page 2: [IEEE 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) - Chonburi, Thailand (2009.05.6-2009.05.9)]

parameters, αi, in (6).

III. FUNCTION APPROXIMATION WITH SGD

In online learning, at each iteration, we observe only a partof Z, denoted zn (typically the nth observation). The associatedlinear evaluation functional at present time is then [5]

Lnf = zn. (7)

The function approximation of f at time n can be achieved byminimising the instantaneous non-negative functional

greg(fn) =12‖Ln+1fn − zn+1‖2 +

ρ

2‖fn‖2 (8)

where ρ2‖fn‖2 is a regularisation term and ρ ≥ 0 is a regu-

larisation parameter. The term ρ2‖fn‖2 is added to prevent ill-

posed condition that the estimation function fits very well onobservation data but cannot fit the unforeseen data.

The minimisation of (8) can be determined by using SGDsuch that the function update, fn+1, will be calculated at eachincoming observations, (xn, zn), by

fn+1 = fn − ηn∇greg(fn) (9)

where ∇greg(fn) is the instantaneous gradient of greg or thedirection of gradient descent at fn and ηn is the learning rate.Hence, we obtain the function update

fn+1 = (1 − ηnρn)fn − ηnL∗n+1(Ln+1fn − zn+1) (10)

where L∗ is the adjoint operator of L defined by 〈Lf, z〉 =〈f, L∗z〉. For some constant, L∗

n+1a = kn+1a and alsoL∗

n+1fn = fn(xn+1), therefore

fn+1 = (1 − ηnρn)fn − ηnkn+1(fn(xn+1) − zn+1).(11)

From (6), the function update can be written in the form of kernelfunction given by

fn+1(x) = (1 − ηnρn)n∑

i=1

αinki(x) − ηnen+1kn+1(x)

=n+1∑i=1

αin+1ki(x) (12)

where the prediction error en+1 = Ln+1fn − zn+1 andL∗

n+1en+1 = kn+1en+1. Hence, the parameters αin+1 are up-

dated from

αin+1 =

{(1 − ηnρ)αi

n for i ≤ n−ηnen+1 for i = n+ 1. (13)

In SGD, the learning rate, ηn, plays an important role as thestep size moving in the direction that greg(fn) is minimised.The value of the learning rate to be used is critical. Smallvalues can lead to a slow minimisation whereas high values maycause divergence. Moreover, using SGD in time-varying systems,the learning rate has to be adapted online. Based on this idea,we propose SMD to adjust learning rate which can model theunknown system in time-varying environments.

IV. LEARNING RATE ADAPTATION USING SMD

This methods avoids negative values of the learning rate bymaking updates via an exponential gradient algorithm [6], [7].

Also, all past gradients are taken into account for updating thepresent parameter which means that information of every pastdata is incorporated in the update rule.

The local learning rate is best adapted by exponential gradientdescent, so that it can cover the wide dynamic range whilststaying strictly positive [6]. The learning rate is updated as thefollowings

ln ηn = ln ηn−1 − μ∂greg(fn)∂ ln ηn−1

= ln ηn−1 − μ∂greg(fn)∂fn

· ∂fn

∂ ln ηn−1(14)

and therefore

ηn = ηn−1 · exp(−μ〈hn, vn〉) (15)

where hn ≡ ∂greg(fn)∂fn

, vn ≡ ∂fn

∂ ln ηn−1and μ is a global meta

learning rate.

To make the computation simpler, the linearization eu ≈ 1+uis used, valid for small |u| [7]. Hence, the update rule for thelearning rate is

ηn = ηn−1 · max(12, 1 − μ〈hn, vn〉). (16)

The gradient trace, v, measures the effect that the learningrate makes on the function update. Here, we consider the effectof all changes up to time n in the form of the exponentialaverage. Therefore, the gradient trace is given by the summationof gradient up to time n [7]

vn+1 ≡n∑

i=0

λi ∂fn+1

∂ ln ηn−i(17)

where λ ∈ [0, 1] is a forgetting factor. Substituting fn+1 givenin (9)

vn+1 =n∑

i=0

λi ∂fn

∂ ln ηn−i−

n∑i=0

λi ∂(ηn · hn)∂ ln ηn−i

≈ λvn − ηn · hn − ηn

n∑i=0

λi ∂hn

∂fn· ∂fn

∂ ln ηn−i

= λvn − ηn(hn + λHnvn) (18)

where Hn denotes the Hessian of greg(fn) at time n [7].

Considering the update rule in (16), we need to define the dotproduct of hn and vn. Here,

hn =∂greg(fn)∂fn

= ∇greg(fn)

= L∗n+1(Ln+1fn − zn+1) + ρnfn. (19)

Hessian matrix H is given by

Hn =∂hn

∂fn=∂2greg(fn)

∂f2n

= L∗n+1Ln+1 + ρ. (20)

Therefore, the update rule for the gradient trace given in (18)

Page 3: [IEEE 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) - Chonburi, Thailand (2009.05.6-2009.05.9)]

is equivalent to

vn+1

= λvn − ηn(L∗n+1en+1 + ρfn) − ηnλ(L∗

n+1Ln+1 + ρ)vn

= (1 − ρηn)λvn − ηnρfn − ηnλkn+1vn(xn+1)−ηnkn+1en+1. (21)

The gradient trace can also be represented in the form of r.k. as

vn+1 =n+1∑i=1

ωin+1ki (22)

where ωi ∈ R. Combining (21) and (22), the gradient trace inthe form of reproducing kernels and its parameters are presentedby

vn+1 = (1 − ρηn)λn∑

i=1

ωinki − ηnρ

n∑i=1

αinki

−ηnkn+1(λvn(xn+1) − en+1) (23)

and the parameter vector ωn+1 has an update

ωin+1 =

{(1 − ηnρ)λωi

n − ηnραin for i ≤ n

−ηnλvn(xn+1) − ηnen+1 for i = n+ 1.(24)

Therefore, the inner product 〈hn, vn〉 is as the followings

〈hn, vn〉 = 〈L∗n+1en+1 + ρfn, vn〉

= en+1vn(xn+1) + ραTnKωn (25)

where 〈fn, vn〉 = 〈∑ni=1 α

inki,

∑ni=1 ω

inki〉 = αT

nKωn. Hence,the update rule for the learning rate is

ηn = ηn−1 · max(12, 1 − μ(en+1vn(xn+1) + ραT

nKωn)). (26)

V. SPARSE SOLUTION

Considering the framework of function approximation pre-sented in Section III, in each iteration, the posterior functionupdate is added with a new kernel and parameters. It obviouslyexpresses that the size of function grows without bound. Toovercome this problem, we need to restrict the model growthby [5]

1) adding only kernels that significantly reduce the approxi-mation error (incremental step) and

2) removing kernels that no longer create significant error(decremental step).

The method called orthogonal projections will be applied to dothese.

A. Incremental Step

The function update, fn, belongs to RKHS and is made up ofn kernels with linear combination which lies in n dimensionalsubspace, fn ∈ Hkn

. In the same way, adding a new kernel tofn, the new function update belongs to subspace Hkn+1 suchthat fn+1 ∈ Hkn+1 . However, the updating can be done withoutadding a new kernel by using orthogonal projection that functionfn+1 is projected onto Hkn

denoted by f⊥n+1. The orthogonalprojection, f⊥n+1, takes an effect of the new data point but thesize of model is maintained such that the number of kernel termis n.

The orthogonal projection f⊥n+1 will be chosen when a squarenorm of difference, ‖fn+1 − f⊥n+1‖ is lower than the prioridefined threshold, κinc, otherwise the update function becomesfn+1.

Considering the orthogonal projection of fn+1 onto Hkn,

we then have fn+1 − f⊥n+1⊥Hkn[8]. Defining fn+1 =∑n+1

l=1 αln+1kl which vector parameter αn+1 has been previously

determined in approximating fn+1, and f⊥n+1 =∑n

j=1 βjn+1kj .

Using projection conditions that 〈ki, fn+1 − f⊥n+1〉 = 0 give

〈ki,n+1∑l=1

αln+1kl −

n∑j=1

βjn+1kj〉 = 0 (27)

where i = 1, . . . , n or equivalent to

α1n+1〈ki, k1〉 + . . .+ αn+1

n+1〈ki, kn+1〉−β1

n+1〈ki, k1〉 − . . .− βnn+1〈ki, kn〉 = 0 (28)

where i = 1, . . . , n. The above expression has the uniquesolution as the following

βn+1 = K−1n,nc. (29)

Note that Kn,n ∈ Rn,n and ci = α1

n+1〈ki, k1〉 + . . . +αn+1

n+1〈ki, kn+1〉 where i = 1 . . . , n.Considering the gradient trace vn+1 in (22), the orthogonal

projection of the gradient trace has to be used if f⊥n+1 is selected.This can be achieved using the same methods as given in (27)-(29). Defining v⊥n+1 =

∑nj=1 ψ

jn+1kj , the parameter ψn+1 is

equivalent toψn+1 = K−1

n,nc′ (30)

where c′i = ω1n+1〈ki, k1〉+. . .+ωn+1

n+1〈ki, kn+1〉 and i = 1 . . . , n.The square norm of difference between fn+1 and f⊥n+1 which isgiven by

‖fn+1 − f⊥n+1‖2 = 〈fn+1 − f⊥n+1, fn+1 − f⊥n+1〉= 〈fn+1, fn+1 − f⊥n+1〉 − 0. (31)

The second term of RHS of (31) disappears due to the conditionof orthogonal projection. Therefore

‖fn+1 − f⊥n+1‖2

=n+1∑l=1

n+1∑j=1

αln+1α

jn+1〈kl, kj〉 −

n+1∑l=1

n∑j=1

αln+1β

jn+1〈kl, kj〉

= αTn+1Kn+1,n+1αn+1 − αT

n+1Kn+1,nβn+1 (32)

where αn+1, βn+1 are the vector of parameters, Kn+1,n+1 ∈R

n+1,n+1 and Kn+1,n ∈ Rn+1,n.

B. Decremental Step

After the incremental step, we denote the new function updateby

f↑n+1 =m∑

j=1

νjn+1kj (33)

where m can be n or n + 1 and νn+1 is a vector equal toαn+1 or βn+1 as appropriate when the function update is fn+1

or f⊥n+1 consequently. The size of existing model is reducedby eliminating an unwanted kernel term. Defining the function

Page 4: [IEEE 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) - Chonburi, Thailand (2009.05.6-2009.05.9)]

called decremental estimation with lth kernel removed by

f↓\ln+1 =

∑j∈M

ζjn+1kj (34)

where l ∈ m and M = {1, 2, . . . ,m}\l. The function f↓\ln+1 is

an orthogonal projection of f↑n+1 onto the reduced set of kernelsforming a subspace HkM . The kernel term l will be removedwhen the square norm difference, ‖f↑n+1−f↓\l

n+1‖2, is lower thanthe threshold κdec and the current model then becomes f↓\l

n+1.Using the same methods given in the previous section, the ζn+1

can be estimated by using the projection conditions. Therefore,

ζn+1 = K−1M,Md. (35)

Note that, K−1M,M ∈ R

m−1,m−1 and dj = ν1n+1〈k1, kj〉 + . . .+

νmn+1〈km, kj〉 where j ∈ M. The norm difference is given as

follows

‖f↑n+1 − f↓\ln+1‖2 = νT

n+1km,mνn+1 − νTn+1Km,Mζn+1 (36)

where km,m ∈ Rm,m and Km,M ∈ R

m,m−1.If the kernel lth is removed for the function update, the

gradient trace needs to be removed at the term l as well. Definingv↑n+1 =

∑mj=1 χ

jn+1kj for the gradient trace such that χn+1 can

be ωn+1 or ψn+1 as appropriate. The decremental estimationwhen kernel lth removed is v

↓\ln+1 =

∑j∈M ϕj

n+1kj and theparameters ϕn+1 is determined by

ϕn+1 = K−1M,Md′ (37)

where d′ = χ1n+1〈k1, kj〉 + . . .+ χm

n+1〈km, kj〉 and j ∈ M.

VI. SIMULATED EVALUATION OF THE ALGORITHMS

SMD were evaluated with observation data from non-stationary environments which was generated from a sinc func-tion f(x) = sinπ(x−x′)

π(x−x′) where x′ is the centre. The experimentaldata was corrupted by Gaussian distributed noise with N(0, σ2).Throughout our experiments, σ2 was setting to 0.1. To simulatenon-stationary, our sinc function was varied in time in two cases;the center was shifted with rate 0.0004n and switched to othervalue at the 150th and 300th iterations.

In our experiments, we used 500 samples to test SMDalgorithm with Gaussian kernel and width equal to 50. A testset of 100 data points were used to verify the performance ofthe algorithms in terms of MSE, where

MSE =1N

N∑n=1

(f(xn) − zn)2. (38)

Fig 1(a) presents MSE of applying SMD in shifting andswitching target, the results show that our algorithm can followthe target even they are time-varying function. The experimentsalso include sparse solution to bound the size of model. They areillustrated in Fig 1(b) that the number of kernels was reducedfrom 500 to around 12 and the time of program running is faster.

VII. CONCLUSIONS

This work presents a framework for solving the approximationfunction on RKHS by using online SGD. For applying in time-varying environment, the learning rate is adapted at each iteration

0 100 200 300 400 50010

−3

10−2

10−1

100

Iterations

MS

E

(a)

0 100 200 300 400 5000

2

4

6

8

10

12

IterationsN

umbe

r of

Ker

nels

(b)

Fig. 1. Variation in (a) MSE and (b) number of kernels in SMD with sparsesolution by setting κinc = κdec = 0.005, μ = 0.1, λ = 0.99 and η0 = 0.1.Shown are shifting (‘−−’) and switching target function (‘−−’).

by SMD whereas the regularisation parameter is kept constant.The simulation results support that using SMD is able to trackthe time-varying function in two cases: shifting and switchingtarget function. The size of model is bound by sparse solutionwhich can prevent the computational expensive and decrease thetime of program running.

REFERENCES

[1] C. Campbell, Radial Basis Function Networks: Design and Applications.Springer Verlag, 2000.

[2] T. Dodd and R. Harrison, “Some lemmas on reproducing kernel Hilbertspaces,” Department of Automatic Control and Systems Engineering, Uni-versity of Sheffield, UK, Tech. Rep. 819, 2002.

[3] N. Akhiezer and I. Glazman, Theory of Linear Operators in Hilbert Space.Pitman, 1981, vol. I.

[4] G. Wahba, Spline Models for Observational Data, ser. Series in AppliedMathematics. Philadelphia: SIAM, 1990, vol. 50.

[5] T. Dodd, B. Mitchinson, and R. Harrison, “Sparse stochastic gradient descentlearning in kernel models,” in Proceedings of The Second International Con-ference on Computational Intelligence, Robotics and Autonomous Systems,Singapore, 2003.

[6] N. N. Schraudolph and T. Graepel, “Towards stochastic conjugate gradientmethods,” in Proc. 9th Intl. Conf. Neural Information Processing, L. Wang,J. C. Rajapakse, K. Fukushima, S.-Y. Lee, and X. Yao, Eds. IEEE, 2002,pp. 853–856.

[7] S. V. N. Vishwanathan, N. N. Schraudolph, and A. J. Smola, “Step sizeadaptation in reproducing kernel Hilbert space,” J. Mach. Learn. Res., vol. 7,pp. 1107–1133, 2006.

[8] E. Kreyszig, Introductory Functional Analysis with Applications. JohnWiley & Sons, 1978.