Kernel Based Nonlinear Weighted Least Squares Regression

8/3/2019 Kernel Based Nonlinear Weighted Least Squares Regression

1/8

Kernel Based Nonlinear Weighted LeastSquares RegressionAntoni Wibowo and Mohamad Ishak Desa

AbstractIn this paper, we consider a regression model with heteroscedastic errors in which the prediction of ordinary least

squares (OLS) based regressions can be inappropriate. Weighted least squares (WLS) is widely used to handle in theheteroscedastic model by transforming an original model into a new model that satisfies homoscedasticity assumption. However,WLS yields a linear prediction and has no guarantee to avoid the negative effect of multicollinearity. Under this circumstance, wepropose a method to overcome these difficulties using the hybridization of WLS regression with kernel method. We use WLS tohandle the heteroscedastic errors and use kernel method to perform a nonlinear model and to eliminate the negative effect ofmulticollinearity in this regression model. Then, we compare the performance of the proposed method with the WLS regressionand it gives better results than WLS regression.

IndexTerms Kernel principal component regression, kernel trick, multicollinearity, heteroscedastic, nonlinear regression,weighted least squares.

.

1 INTRODUCTION

etus consider aheteroscedastic regressionmodel as

follows:

, (1) , where

, ,

, , , ,and , , , with and arepositivereal numbers and 1 , 2 , , . The weight wi isestimatedusingthedataand,seeforexample[2],[6]and[7].Thesizesof,,,, andarep1,N1,Np,N(p+1),(p+1)1andN1,respectively,whereisaN1vectorwithallelementsequaltooneandandisthesetofrealnumbers.Thevectordenotesthetransposeofthevector .Animplicationoftheassumption

isthat

the ordinary least squares (OLS) estimator

canbe inappropriate and it isnot thebestlinearunbiasedestimator (BLUE)of, seeforexample[1]

and [8].That is, among all theunbiasedestimators,OLS

doesnotprovide theestimatewith thesmallestvariance.

Depending on the nature of the heteroscedasticity,

significancetestscanbetoohighortoolowwhichimplies

thesignificance testhas lowpower.Thesedifficultiesare

usually handled by transforming model 1 into a new

model that satisfies homoscedasticity assumption and

calleditweightedleastsquares(WLS)method.Thismethod

also yields a linearpredictionmodel,however, theWLS

solutionispreferredtotheOLSsolution[1].

Furthermore,we

say

that

multicollinearity exists

on

regressorsmatrixX if ) isanearlysingularmatrix,i.e., if some eigenvalues of are close to zero. Ifmulticollinearity existsonX then canbe a largenumber and under the assumption that i is normally

distributed,thetestsforinferencesj(j=0,1,...,p)have

lowpowerand the confidence intervalcanbe large [15].

Therefore, it will be difficult to decide if a variable xj

makes a significant contribution to the regression.These

implicationsareknownastheeffectofmulticollinearity.

We can use principal component regression (PCR) to

eliminate the effects ofmulticollinearity.However, PCR

yields linearpredictionmodelswhichhave limitation in

applications sincemost real problems are nonlinear. In

recentyears,[3],[4],[9],[10],[11]and[16]haveproposed

kernelprincipal component regression (KPCR) to overcome

this linearity and to avoid the negative effect of

multicollinearity.KPCRwas constructedbased on kernel

principal component analysis (KPCA) andhomoscedasticity

assumption, see [12] and [13] for the detailed ofKPCA,

whichwasperformedbymappinganoriginalinputspace

intoahigherdimensionalfeaturespace.Therefore,KPCR

AntoniWibowoiswiththeFacultyofComputerScienceandInformationSystems,UniversitiTeknologiMalaysia,81310UTMJohorBahru,Johor,Malaysia.

Mohammad Ishak Desa is with the Faculty of Computer Science andInformation Systems, Universiti Teknologi Malaysia, 81310 UTM JohorBahru, Johor, Malaysia.

L

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 11, NOVEMBER 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING

WW.JOURNALOFCOMPUTING.ORG 56


2/8

can be inappropriate to be used in regression analysis

whenhomoscedasticityassumptionisnotsatisfied.

Under the circumstance, we propose a nonlinear

methodwhichcaneliminate theeffectofmulticolinearity

and compromises the heteroscedasticity assumption. In

thismethod, an original data set is transformed into a

higherdimensionalfeaturespaceandtocreateamultiple

linearregression

with

heteroscedasticity

assumption

in

this space.Then,weperformWLSmethodfor this linear

regression to obtain a multiple linear regression with

homoscedasticityassumption.Furthermore,weuseatrick

to have an explicitly homoscedastic multiple linear

regression andperform the similarprocedure ofPCR in

this featurespace toobtainanonlinearpredictionand to

eliminate the effect of multicollineary. We refer the

proposed technique as the weighted leastsquares KPCR

(WLSKPCR).

The rest of manuscript are organized as follows:

Section 2, we present theories and methods of WLS

regression,

kernel

trick,

WLS

KPCR

and

WLS

KPCRs

algorithm.InSection3,wecomparetheperformanceWLS

regression andWLSKPCR.Then,wegiveconclusionsin

Section4.

2 THEORIES AND METHODS

2.1 Weighted Least Squares Regression

WLS method in the linear regression is performed as

follows. First, we define , , , whichimpliesthat

, and , , ).Itisevidentthat

(2)Let , and . It is easy toverify that and and model 1becomes

, (3) ,

.We can see that the error in model 3 satisfieshomoscedasticityassumption.Toobtaintheestimatorof

inmodel3wesolve

min (4)

with respect to.Let be the solution to theproblem4whichsatisfiestheleastsquaresnormalequations

. (5)It is evident that if the row vectors of are linearlyindependent, then the rowvectorsof

arealso linearly

independent.Hence,

isinvertibleandweobtain . (6)

andcalledittheWLSestimatorof.Let be the observed data

correspondingtoand bethevalueof wheninEq.(6)isreplacedby.Thefittedvalueof, say,isgivenby

(7)

andthepredictionofWLSbasedlinearregressionisgivenby

, (8)where isa function from into .We shouldnoticethat the elements ofmatrix inmodel 1 canbe chosensuch that multicollinearity is not present in .Unfortunately, eigenvalues of are not equal toeigenvaluesof which implies there isnoguaranteethatmulticollinearitydoesnotexist in.Thepresentofmulticollinearity in can seriously deteriorate thepredictionofWLS regression.

2.2 Heteroscedastic Regression Model in FeatureSpace

The procedure for constructing WLS KPCR is almost

similar toWLS regression that previously explained in

Subsection 2.1. First,we transform our data set into an

Euclidean space of higher dimensionby a function.The

importantpointhere is that the function isnotexplicitly

defined. Then, we construct a heteroscedastic linear

regressionmodelinthisspaceanduseWLSsprocedureto

obtain a homoscedastic linear regression model, and

followed by performing a trick to obtain the explicit

homoscedasticregressionandusethesimilarprocedureof

PCR to eliminate the effect of multicollineary in this

regressionmodel.

Assumingwe have a function : where iscalled the feature space which is an Euclidean space of

higherdimensionthanp,say.Then,wedefine ,





3/8

1 and

wheresizesof

, and

areNpF,pFpFandNN,

respectively.We

assume

that

. If isinfinitedimensional, we consider the linear operator insteadofthematrix[13].Theheteroscedasticityregressionmodelinthefeaturespace

isgivenby

, (9) , ,where is a vector of regressioncoefficients in the feature space, isavectorofrandomerrors,

)

and

,

, ,

whereandarepositiverealnumbersfor i=1,2,...,N.Theweight isestimatedusing thedata ) and . We notice that we cannot use thegeneralized inversematrix to obtain the estimator of since isnotknownexplicitly.Then, we define , , , which

implies

, and , , )andobtain

, (10) , ,where , and .Furthermore,wedefine twomatrices and . The relation of eigenvalues andeigenvectors of thematrices and are related in thefollowingtheorem.

Theorem 1. [16] Suppose 0 and }.Thefollowingstatementsareequivalent:

1. andsatisfy .2. andsatisfy and ),forsome }.

3. and satisfy and ),forsome }.

Let betherankofwhere min(N,).Sincetherankof isequal to therankof and the rankof ,then therankof and therankof areequal to .

We should notice that is symmetric and positivesemidefinite which implies the eigenvalues of arenonnegativerealnumbers.Let 1 2 1... r r

1F Fp p 0

N betheeigenvaluesof, = ( . . . be the matrix of the corresponding

normalizedeigenvectors

1, 2 , . . . , of

,

and forl=1,2,... .Then, according to Theorem 1we obtain eigenvaluesandeigenvectorsproblemasfollows

1,2, . . . , 1 , , 1,2, , 0 . .

Since therankofT isequal , then theremaining( )eigenvaluesofT arezero.Let k (k= 1,

2 ,...,

),bethezeroeigenvaluesof T and

bethe

normalizedeigenvectors

of

T

corresponding

to

k .Then, we define which is anorthogonalmatrix. Itisnotdifficulttoverifythat

,where ,

0

0

.andOisazeromatrix.Since ,wecanrewritethemodel(10)as

, (11) , ,where and .Let

and ,with sizes of

,

,

, and

are

, , 1 and 1,respectively.Themodel(11)canbewrittenas , (12) , .

Since ,weobtain





4/8

, ,and .Since is equal tozero, we see that

is equal to 0.

Consequently,the

model

(12)

is

simplified

to

, (13) , .Letus assume that

1 2, , ..., Fr r p are close to zero

anddefine

, and

,with

0 0 ,

0 0 ,andsizes of, , , , , 1and( r)1,respectively.Themodel(13)cannowbewrittenas

(14) , .It is evident that the estimator of , say . . . , isgivenby

(15)andthevarianceof 1 , . . ., is

(16)Since , , . . . , are close to zero, the diagonalelementsof andalso thevarianceof 1 , . . . , will be very large numbers. Thus, we

encounter the ill effect ofmulticollinearity in themodel

(14).To avoid the effects ofmulticollinearitywedispose

theterm asin[15]andobtain (17)

where is a random vector influenced by dropping inthemodel(17).Weusuallydisposetheterm to handle the effects ofmulticollinearityonthePCR.Wecanusetheratio

1

l (l

=1,2,, )todetectthepresenceofmulticollinearityon. If 1l is smaller than, say


5/8

then there exists such that , .The function iscalled thekernel function.Insteadofchoosingexplicitly,wechooseakernelandemploythecorrespondingfunctionas.Let , .Hence,weobtain

It isevident that isknownexplicitlynow.This impliesthat= ,andarealsoknownexplicitly.2.4 WLS KPCR

Now, let us consider model 17 again. Since we have

known that = , and explicitly,we canobtain theregressioncoefficientsof thismodel.Then, the

fittedvalueof inthetransformedscale,say,isgivenby

(21) andresidualbetweenandisgivenby

. (22)Thefittedvalueofintheoriginalscale,say,isgivenby

(23)and

the

theWLSKPCRspredictionis

given

by

, , (24)where 1 , is a function from into and( . . . . The number is called theretainednumberofnonlinearPCsfortheWLSKPCR.

2.5 WLS KPCR Algorithms

We summarize the procedure in subsection 2.22.4 to

obtaintheWLSKPCRspredictionasfollows:1. Given 1 2( , , ,..., ),i i i ip y x x x ,i=1,2,...,N.

2. Calculate (1/ )1TN y N y

and ( (1/ ) )To N N N I N y 1 1 y .

3. Estimate V andfind L .

4. Calculate1

o o

z = L y .

5. Chooseakernel .

6. Contruct ( , ), ( )ij i j ijK K x x K

and

1 1 K = L KL .

7. Diagonalize K . Let 1 2 ... r

1... ... 0 p F p F N be the

eigenvalues of K and 1 2, ..., Nb b b be the

correspondingnormalizedeigenvectorsof K .

8. Detect multicollinearity on K . Let r be theretained number of nonlinear PCs such that

1

1max .

1000

sr s

9. Construct ll

l

b

forl=1,2,...,rand ( )r =

1 2( ... ) r .

10. Calculate*

( ) ( ) ( ),

r r r KU 1( ) ( )Tr r o D U z and

1 2( ... )T

Nc c cc =1 *

( ) ( )

r rL

11. Given a vector x , the WLS KPCRspredictionisgivenby

1

( ) ( , )N

i i

i

x y c k x xg

Note that the above algorithm works under the

assumption1

( ) 0N

i

i

x

. When1

( ) 0N

i

i

x

we

replaceKby NK =K EK KE+EKEinStep6,whereEis

the

NNmatrix

with

all

elements

equal

to

1 ,

Nand

workbasedon NK inthesubsequentsteps

3 CASE STUDY

In our case study, we use the Gaussian kernel

2( , ) exp( / )k x y x - y where is the parameter

of thekernel.Asmentionedbefore that thereare several

methodtoestimatetheweight iw seeforexample[2],[6],

[7] and [14].Weuse themethodbasedon replication to

estimatethe

weight

iw as follows.First,wearrange the

data x inorderofincreasing iy andmakesomegroups,

say M (


6/8

ikx and variance of iky respectively.Thenwemake

the prediction from the set 2,k kx s , say

2 1 ( ) o f x c c x ,where 2f is a function from into

and 1 ,oc c

. Furthermore,we calculate the estimated

varianceofi

y byusingthepredictor 2 ( )if x .Theweight

iw

ischoseninverselyfromtheestimatedvarianceof .iy

Thesimilarprocedure isperformed toobtaintheweights

ofWLSKPCR . Weonly replacei

y byoi

y whereoi

y is

theithelementofo

y .

We also use the averagemonthly income from food

sales (y) and the corresponding annual advertising

expenses (x) for30 restaurants [6].Thedataaregiven in

Table1andarecalledthetrainingdata.Weaddthenoises

inthe

response

values

of

the

training

data

which

are

generatedby a normally distributed random noisewith

zeromeanandstandarddeviation 1 0,1 .Afterward,

we use some of the data to test the predictionbyWLS

basedlinearregressionandWLSKPCRandcallthemthe

testingdata.Wealsoaddthenoisesintheresponsevalues

of the testing data which are generated by a normally

distributed random noisewith zeromean and standard

deviation 2 where 2 0,1 . For the sake of

comparison, we set 1 and 2 equal to 0.25 and 0.5,

respectively.Then,

we

generated

1000

sets

for

both

training and testing data to test the performance of the

WLSbasedregressionandWLSKPCR.

Table1:Therestaurantfoodssalesdata(yi 100)

i xi yi(100)

i xi yi(100)

i xi yi(100)

1 3.00 81.464 11 9.00 131.434 21 15.050 178.187

2 3.150 72.661 12 11.345 140.564 22 178.187 185.304

3 3.085 72.344 13 12.275 151.352 23 15.150 155.931

4 5.225 90.743 14 12.400 146.426 24 16.800 172.579

5 5.350 98.588 15 12.525 130.963 25 16.500 188.851

6 6.090 96.507 16 12.310 144.630 26 17.830 192.424

7 8.925 126.574 17 13.700 147.041 27 19.500 203.112

8

9.015114.133

18

15.000

179.021

28

19.200 192.482

9 8.885 115.814 19 15.175 166.200 29 19.000 218.715

10 8.950 123.181 20 14.995 180.732 30 19.350 214.317

Let ie and iy betheresidualandthepredictionofOLS

method,respectively.Itiswellknownthattheplotofthe

residual ie anditscorresponding iy isusefultocheckthe

assumption of constant variance. Our choice of WLS

solutionisalsobasedonthepatternofresiduals.Theplot

of ie and iy

isshowninFigure1(a)inwhichthevariation

of the residuals increases significantly as the prediction

valuesincrease.Hence,thisplotindicatesviolationof the

assumption of constant variance. Consequently, the

ordinary

least

squares

fit

inappropriate

and

the

estimator

ofOLS isnot thebest linearunbiased estimator (BLUE).

Fortheshakeofcomparison,thevaluesofMarechosento

betwo,fourandfive.ForinstancewhenM=2,itmeans

(a)

(b)

Figure1:

Plot

of

residual

and

its

corresponding

predicted

valuefortrainingdata:(a)OLSbasedlinearregression,(b)

WLSKPCR.

that the ordered data is divided into two groupswhere

eachgroup contains 50percentof theordereddata.The

plotofresidual 2ie anditscorrespondingpredictionvalue

oiz withM=2and =0.5isshowninFigure1(b).Figure





7/8

1(b) shows a residual plot with no systematic pattern

around zero. It seems that the assumption of constant

variance is satisfied for the data associated with this

residualplot.Wenotice that the eigenvalues of are5.2012e+03 and 4.3800 and the eigenvalues of are1.05717e03and6.4251e07.Theratiooftheeigenvaluesof

and are 4.3800/5.2012e+03 = 8.4228e04 and6.4251e07/1.05717e03=6.0779e04, respectively. Hence,multicollinearity exists in the regressionmatrices and

. It implies that the prediction of OLS based linearregression and the prediction ofWLS regression canbe

inappropriatetobeused.

We also notice that the estimator of WLS KPCR is

BLUE for regression model with heteroscedastic errors

andprovidestheestimatewiththesmallestvariance.The

comparisonofthetwomethodsisshowninTable2where

thevaluesinthetableareaveragesofRMSEfor1000sets

ofthe

training

data

and

the

testing

data

in

the

original

scale.FromTable2,weseethattheRMSEsofWLSKPCR

are smaller compared to RMSEs ofWLS regression. For

these data, the choice = 0.05 yieldsbetter results than

othervaluesof sandWLSKPCRreduces RMSEofWLS

regressiondowntomorethan50percent.

Table2:RMSEofWLSregressionandWLSKPCRforthe

restaurantfoodssalesdata.

Data Method RSME

M=2 M=4 M=5

Training

WLS

regression

870.3370 870.0094 869.9483

WLSKPCR

(=0.05,r=23)

387.4919 387.4247 387.4118

WLSKPCR

(=0.1,r=21)

430.6168 430.5629 430.5506

WLSKPCR

(=0.5,r=18)

601.4776 601.4338 601.4254

WLSKPCR

(=1,r=17)

624.2837 624.2512 624.2440

Testing

WLS

regression

835.0128 834.8192 835.2938

WLSKPCR

(=0.05,r=23)

398.7448 398.7926 398.7979

WLSKPCR

(=0.1,r=21)

463.4775 463.4887 463.4777

WLSKPCR

(=0.5,r=18)

689.8822 689.9098 689.8971

WLSKPCR

(=1,r=17)

721.3939 721.3949 721.3855

4 CONCLUSIONS

WLS is a technique to be used in case of a regression

model with non constant variances. However, this

techniqueyieldsalinearpredictionandhasnoguarantee

thatitcanhandlethenegativeeffectsofmulticollinearity.

Inthis

paper,

we

proposed

WLS

KPCR

to

be

used

in

a

heteroscedastic regressionmodel and to overcome those

limitations.We should notice thatWLS KPCR yields a

nonlinear prediction and can avoid the effects of

multicollinearity inheteroscedasticregressionmodel.The

estimateofWLSKPCRsregressioncoefficientsisthebest

linear unbiased estimator in which the proof of the

unbiasedestimatorofWLSKPCRfollowstheproofofthe

unbiased estimator ofPCR (See [15]). In our case study,

WLS KPCR outperforms WLS regression in a

heteroscedasticmodel.

ACKNOWLEGDEMENT

The authors sincerely thank to Universiti Teknologi

Malaysia and Ministry of Higher Education (MOHE)

Malaysia for Research University Grant (RUG) with

numberQ.J130000.7128.02J88. In addition,we also thankto TheResearchManagement Center (RMC) UTM for

supportingthisresearchproject.

REFERENCES

[1]S.ChattefueeandA.S.Hadi.RegressionbyExample:FouthEdition.

JohnWileyandSons,2006.

[2]J.J.Faraway.LinearModelswithR.ChapmanandHall/CRC,2005.

[3] L. Hoegaerts,J.A.K. Suykens, J. Vandewalle, and B. D.Moor.

Subsetbased least squares subspace in reproducing kernel hilbert

space.Neurocomputing,pages293323,2005.

[4]A.M.Jade,B.Srikanth,B.DKulkari,J.PJog,andL.Priya.Feature

extraction and denoising using kernel pca. Chemical Engineering

Sciences,58:44414448,2003.

[5]

H.

Q.

Minh,

P.

Niyogi,

and

Y.

Yao.

Mercers

theorem,

feature

maps, and smoothing. Lecture Notes in Computer Science, Springer

Berling,4005/2006,2009.

[6]D.C.Montgomery,ElizabethA. Peck, andG.GeoffreyVining.

IntroductiontoLinearRegression.WileyInterscience,2006.

[7] D. R. Norman and H. Smith.Applied RegressionAnalysis.John

WileyandSons,1998.





8/8

[8]J.O.Rawlings,SastryG.Pantula,andDavidA.Dickey.Applied

regressionanalysis:aresearchtool.Springer,1998.

[9]R.Rosipal,M.Girolami,L.J.Trejo,andAndrzejCichoki.Kernel

pca for feature extraction and denoising in nonlinear regression.

NeuralComputingandApplications,pages231243,2001.

[10]

R.

Rosipal

and

L.

J.

Trejo.

Kernel

partial

least

squares

regression

in reproducing kernel hilbert space. Journal of Machine Learning

Research,2:97123,2002.

[11] R. Rosipal, L. J. Trejo, and A. Cichoki. Kernel principal

component regression with em approach to nonlinear principal

componentextraction.TechnicalReport,UniversityofPaisley,UK,2001.

[12]B.Scholkopf,A.Smola,andK.R.Muller.Nonlinearcomponent

analysisasakerneleigenvalueproblem.NeuralComputation,10:1299

1319,1998.

[13] B. Scholkopf and A.J. Smola. Learning with kernels. TheMIT

Press.,2002.

[14]G.A.F.SeberandA.J.Lee.LinearRegressionAnalysis.JohnWiley

andSons,Inc.,2003.

[15]M.S.Srivastava.MethodsofMultivariateStatistics.JohnWileyand

Sons,Inc.,2002.

[16] A. Wibowo and Y. Yamamoto. A note of kernel principal

component regression. To appear in Computational Mathematics and

Modeling.

Antoni Wibowo is currently working as a senior lecturer in the facultyof computer science and information systems in UTM. He receivedB.Sc in Math Engineering from University of Sebelas Maret (UNS)Indonesia and M.Sc in Computer Science from University of

Indonesia. He also received M. Eng and Dr. Eng in System andInformation Engineering from University of Tsukuba Japan. Hisinterests are in the field of computational intelligence, machinelearning, operations research and data analysis.

Mohamad Ishak Desa is a professor in the faculty of computerscience and information systems in UTM. He received B.Sc. inMathematics from UKM in Malaysia, along an advance diploma insystem analysis from Aston University. He received a M.A. inMathematics from University of Illinois, and then, a PhD in operationsresearch from Salford University in UK. He is currently the Head ofOperation Business Intelligence Research Group in UTM. Hisinterests are operations research, optimization, logistic and supplychain.




Documents

Kernel Based Nonlinear Weighted Least Squares Regression