14
Nonlinear identification using orthogonal forward regression with nested optimal regularization Article Accepted Version Hong, X., Chen, S., Gao, J. and Harris, C. J. (2015) Nonlinear identification using orthogonal forward regression with nested optimal regularization. IEEE Transactions on Cybernetics, 45 (12). pp. 2925-2936. ISSN 2168-2267 doi: https://doi.org/10.1109/TCYB.2015.2389524 Available at http://centaur.reading.ac.uk/39716/ It is advisable to refer to the publisher’s version if you intend to cite from the work.  See Guidance on citing  . To link to this article DOI: http://dx.doi.org/10.1109/TCYB.2015.2389524 Publisher: IEEE Publisher statement: © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. All outputs in CentAUR are protected by Intellectual Property Rights law, including copyright law. Copyright and IPR is retained by the creators or other copyright holders. Terms and conditions for use of this material are defined in 

Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

  • Upload
    vunhan

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

Nonlinear identification using orthogonal forward regression with nested optimal regularization Article 

Accepted Version 

Hong, X., Chen, S., Gao, J. and Harris, C. J. (2015) Nonlinear identification using orthogonal forward regression with nested optimal regularization. IEEE Transactions on Cybernetics, 45 (12). pp. 2925­2936. ISSN 2168­2267 doi: https://doi.org/10.1109/TCYB.2015.2389524 Available at http://centaur.reading.ac.uk/39716/ 

It is advisable to refer to the publisher’s version if you intend to cite from the work.  See Guidance on citing  .

To link to this article DOI: http://dx.doi.org/10.1109/TCYB.2015.2389524 

Publisher: IEEE 

Publisher statement: © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 

All outputs in CentAUR are protected by Intellectual Property Rights law, including copyright law. Copyright and IPR is retained by the creators or other copyright holders. Terms and conditions for use of this material are defined in 

Page 2: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

the End User Agreement  . 

www.reading.ac.uk/centaur   

CentAUR 

Central Archive at the University of Reading 

Reading’s research outputs online

Page 3: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CYBERNETICS 1

Nonlinear Identification Using Orthogonal ForwardRegression With Nested Optimal Regularization

Xia Hong, Senior Member, IEEE, Sheng Chen, Fellow, IEEE, Junbin Gao, and Chris J. Harris

Abstract—An efficient data based-modeling algorithm fornonlinear system identification is introduced for radial basisfunction (RBF) neural networks with the aim of maxi-mizing generalization capability based on the concept ofleave-one-out (LOO) cross validation. Each of the RBF kernelshas its own kernel width parameter and the basic idea is tooptimize the multiple pairs of regularization parameters and ker-nel widths, each of which is associated with a kernel, one at atime within the orthogonal forward regression (OFR) procedure.Thus, each OFR step consists of one model term selection basedon the LOO mean square error (LOOMSE), followed by theoptimization of the associated kernel width and regularizationparameter, also based on the LOOMSE. Since like our previ-ous state-of-the-art local regularization assisted orthogonal leastsquares (LROLS) algorithm, the same LOOMSE is adopted formodel selection, our proposed new OFR algorithm is also capableof producing a very sparse RBF model with excellent generaliza-tion performance. Unlike our previous LROLS algorithm whichrequires an additional iterative loop to optimize the regularizationparameters as well as an additional procedure to optimize thekernel width, the proposed new OFR algorithm optimizes boththe kernel widths and regularization parameters within the sin-gle OFR procedure, and consequently the required computationalcomplexity is dramatically reduced. Nonlinear system identifica-tion examples are included to demonstrate the effectiveness ofthis new approach in comparison to the well-known approachesof support vector machine and least absolute shrinkage andselection operator as well as the LROLS algorithm.

Index Terms—Cross validation (CV), forward regression,identification, leave-one-out (LOO) errors, nonlinear system,regularization.

I. INTRODUCTION

ALARGE class of nonlinear models including some typesof neural networks can be classified as linear-in-the-

parameters models. These models have provable learningand convergence conditions, are well suited for adaptivelearning, and have clear engineering applications [1]–[3].

Manuscript received August 11, 2014; revised November 22, 2014;accepted January 6, 2015. This paper was recommended by AssociateEditor L. Rutkowski.

X. Hong is with the School of Systems Engineering, University of Reading,Reading RG6 6AY, U.K.

S. Chen is with Electronics and Computer Science, University ofSouthampton, Southampton SO17 1BJ, U.K., and also with the Faculty ofEngineering, King Abdulaziz University, Jeddah 21589, Saudi Arabia (e-mail:[email protected]).

J. Gao is with the School of Computing and Mathematics, Charles SturtUniversity, Bathurst, NSW 2795, Australia.

C. J. Harris is with Electronics and Computer Science, University ofSouthampton, Southampton SO17 1BJ, U.K.

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCYB.2015.2389524

Note that in order to obtain a desired linear-in-the-parametersstructure [4]–[7], however, nonlinear parameters in these mod-els must be determined by other means as they cannot beestimated by a linear learning process. For example, the radialbasis function (RBF) centers and kernel centers in the RBFneural network and the kernel model must be selected or berestricted to the training input data in order to apply the linearlearning approaches of [4]–[7]. Furthermore, the RBF or ker-nel width parameter and other so-called hyperparameters, suchas regularization parameters, also have to be learnt typicallyvia cross validation (CV). One of the main aims of data-basedmodeling is good generalization, i.e., the model’s capabilityto approximate accurately the system output for unseen data.The concept of CV is fundamental to the evaluation of modelgeneralization capability [8], and it is widely embedded eitherin parameter estimation, e.g., tuning regularization parame-ter [9]–[11] and forming new parameter estimates [12], or inderiving model selection criteria based on information theo-retic principles [13], which regularizes model structure in orderto produce sparse models.

The orthogonal least squares (OLS) algorithm, whichefficiently constructs sparse models in an orthogonal for-ward regression (OFR) procedure [14], [15], is a pop-ular modeling tool in neural networks, such as RBFnetworks [4], [16], neurofuzzy systems [17], [18], and waveletneural networks [19], [20]. The original OLS algorithm [14]selects regressors by virtue of their contribution to the max-imization of the model error reduction ratio. One commonlyused version of CV is the leave-one-out (LOO) CV. Forlinear-in-the-parameters models, the LOO errors can be cal-culated without actually splitting the training data set andestimating the associated models, by making use of theSherman–Morrison–Woodbury theorem [21]. By incorporat-ing the OFR framework with analytical expression of LOOerrors, the LOO mean square error (LOOMSE) was proposedas a model term selection criterion, which can be sequentiallyoptimized within the model construction process [22], [23],enabling the OFR model construction procedure to automat-ically terminate with a sparse model that generalizes well,without resorting to other stopping criterion. It is worth empha-sizing that for nonlinear system identification at least, theobjective is to obtain sparse models that generalize well.A system engineer is unlikely to accept a huge-size modelwith many model terms for controller design purpose, eventhe model is a faithful representation of the underlying non-linear system that generates the data. Take the engine dataidentification of [24], which is also considered in this paper,

2168-2267 c© 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 4: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CYBERNETICS

no control engineer will use a model with two hundreds ofmodel terms to design controller for the lorry engine, but asmall model with 20 model terms is acceptable to a controlengineer, provided that the model is accurate.

In [6], empirical evidence is provided to suggest that usingthe LOOMSE to optimize the kernel width and regularizationparameter leads to overfitting for the least squares supportvector machine (LSSVR), which uses all the training dataas model base. From a model selection viewpoint based onbias and variance analysis, [6] shows that this overfitting phe-nomenon is due to the resultant LSSVR models having a largevariance. We argue that it is very unlikely for the sparse modelsidentified by our local regularization assisted OLS (LROLS)algorithm based on the LOOMSE [23] to overfit, because ourmodel selection framework is very different from the one usedin [6]. In fact, the nonsparse LSSVR model with all the train-ing data as its kernels is inherently prone to overfitting. Bycontrast, our subset model selection based on the LOOMSEcriterion constructs a sparse model, exactly aiming to avoidoverfitting. While the results of [6] show that the LOOMSEcontinuously decreases as the iterations of the hyperparame-ter optimization increases, a classical sign of overfitting, it hasbeen shown in [22] and [23] that in our subset model selectionprocedure, the LOOMSE reaches an optimal value at certainmodel size.

Generally, better model generalization can be achieved usingparameter regularization, which penalizes the norm of param-eters. The effect of parameter regularization is to add a smallbias in order to gain advantage in combating the problem oflarge variance which is often associated with an oversizedmodel structure. Although a very small positive regulariza-tion parameter based on the l2-norm is often used to improvenumerical condition, more advanced techniques aim at theregularization design with respect to given data, leading tosparser model structure. For example, sparse models can beconstructed using the l1-norm penalized cost function, e.g.,the basis pursuit or least absolute shrinkage and selectionoperator (LASSO) [25]–[27]. However, the optimization ofthe l1-norm regularizer with respect to model generalizationanalytically is less studied. Alternatively, the l2-norm reg-ularization technique has been incorporated into the OLSalgorithm to produce a regularized OLS (ROLS) algorithmthat carries out model term selection while reducing the vari-ance of parameter estimates simultaneously [10], [11]. It hasbeen shown [28], [29] that the l2-norm parameter regular-ization is equivalent to a maximized a posterior probabilityestimate of parameters from Bayesian viewpoint by adoptinga Gaussian prior for the parameters, leading to an itera-tive evidence procedure for solving the optimal regularizationparameters [29], [30]. Further adopting the LOOMSE as themodel selection criterion leads to our state-of-the-art LROLSalgorithm [23], which is capable of producing a very sparsenonlinear model with excellent generalization performance.

Similar to all the other existing regularization assisted sparsemodel construction algorithms, the optimization of the regular-ization parameters in the LROLS algorithm [23] includes theOFR procedure as an inner loop, and therefore several itera-tions of the OFR procedure are needed. More specifically, with

all the regularization parameters initially set to a very smallpositive value, the OFR procedure in the inner loop constructsa sparse model. The regularization parameters are then updatedat the outer loop using for example the efficient evidence pro-cedure of [30], and the inner-loop OFR algorithm is startedagain with the new regularization parameters. A few outer-loopiterations, typically no more than 10, will ensure that a set ofnear optimal regularization parameters are found. Therefore,the require computational complexity approximately equals tothe complexity of single OFR procedure scaled up by thenumber of outer-loop iterations for computing near optimalregularization parameters. Moreover, other hyperparameters,such as the RBF widths, must also be optimized in order toensure an excellent generalization capability. For the RBF orkernel model which employs a single common RBF width forall model regressors, a grid search procedure based on CVis typically used to find a near-optimal RBF width and thisadds an extra third iterative loop to the model construction.Thus, the true total complexity is scaled up by the number ofgrid searches for the RBF width. For the RBF model whereeach model regressor has an individual RBF width, the non-linear search space for all the RBF width parameters becomestoo large for a grid-based procedure, and other nonlinear opti-mization means must be adopted [31]–[33]. It is, therefore,highly desirable that the hyperparameters, including nonlinearregularization parameters and RBF widths, can be analyticallyoptimized within the single OFR model construction process.This motivates our current study.

In this paper, we propose a new OFR algorithm in whichthe optimization of kernel widths and regularization parame-ters is nested within the OFR procedure, so that only a singleOFR iteration is needed for the estimation of these hyperpa-rameters. Specifically, we use the Gaussian RBF model thathas individually tunable kernel width for each kernel, and boththe kernel width and associated regularization parameter areoptimized one kernel at a time within the OFR procedure.An l2-norm locally ROLS cost function [23], [28] is usedfor model parameter estimation, and the resultant LOOMSEmeasures the model generalization performance. By exploit-ing the analytical expression of LOO errors, we derive a newapproach for successively estimating the two hyperparametersof RBF width and regularization parameter associated witheach selected regressor using the LOOMSE criterion based ona simple line search and gradient descent algorithm, respec-tively, which is embedded naturally in each regressor selectionstep of the OFR procedure. Consequently, a computationallyefficient and fully automated procedure can be achieved with-out resorting to any other validation data set for iterative modelevaluations. Since the same LOOMSE criterion of the LROLSalgorithm [23] is adopted, our proposed new OFR algorithmis also capable of producing a very sparse similar RBF modelwith excellent generalization performance. Unlike our previousLROLS algorithm [23], the new proposed algorithm constructsa sparse model within a single OFR procedure and conse-quently the required computational complexity is dramaticallyreduced.

This paper is organized as follows. Section II outlinesthe concept of the OFR using a regularized cost function.

Page 5: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HONG et al.: NONLINEAR IDENTIFICATION USING OFR WITH NESTED OPTIMAL REGULARIZATION 3

In Section III, we present the nth stage of the OFR modelrepresentation and the concept of LOO CV for modelterm selection using the LOOMSE, followed by a linesearch for tuning the kernel width and a gradient descentalgorithm to estimate the regularization parameter for theselected regressor. Both the line search and gradient descentalgorithm are also based on minimizing the LOOMSE.Section IV presents the entire proposed OFR algorithm withthe nested LOOMSE-based optimal regularization, referred toas the OFR + LOOMSE, for constructing sparse models. InSection V, the empirical results demonstrate the effectivenessof our proposed OFR + LOOMSE algorithm. The conclusionis given in Section VI.

II. PRELIMINARIES

Consider the general nonlinear system represented by thenonlinear model [34]

y(k) = f (y(k − 1), . . . , y(k − ny), u(k − 1), . . . , u(k − nu))

+ v(k) = f (x(k)) + v(k) (1)

where y(k) and u(k) are the system output and control inputwith the lags ny and nu, respectively, at sample time index k,and v(k) denotes the system white noise, while m = ny + nu,x(k) = [y(k − 1) · · · y(k − ny) u(k − 1) · · · u(k − nu)]T =[x1(k) x2(k) · · · xm(k)]T ∈ R

m denotes the m-dimensional sys-tem input vector, and f (•) is the unknown system mapping.The unknown nonlinear system (1) is to be identified basedon an observation data set DN = {x(k), y(k)}N

k=1 using somesuitable functional which can approximate f (•) with arbitraryaccuracy. Without loss of generality, in this paper, we use thedata set DN to construct a RBF network model of the form

y(M)(k) = f (M)(x(k)) =M∑

i=1

θiφi(x(k)) (2)

where y(M)(k) is the model prediction output for the inputvector x(k) based on the M-term RBF model, M is the totalnumber of regressors or model terms, and θi are the modelweights, while the regressor φi(x) takes the form of Gaussianbasis function given by

φi(x) = exp

(

−‖x − ci‖2

2τ 2i

)

(3)

in which ci = [c1,i c2,i · · · cm,i]T is the center vector of the ithRBF unit and τi > 0 is the ith RBF unit’s width parameter.We assume that each RBF kernel is placed on a training data,namely, all the RBF center vectors {ci}M

i=1 are selected fromthe training data {x(k)}N

k=1.Denote e(M)(k) = y(k) − y(M)(k) as the M-term modeling

error for the input data point x(k). Over the training dataset DN , further denote y = [y(1) y(2) · · · y(N)]T, e(M) =[e(M)(1) e(M)(2) · · · e(M)(N)]T, and �M = [φ1 φ2 · · · φM] withφn = [φn(x(1)) φn(x(2)) · · · φn(x(N))]T, 1 ≤ n ≤ M. We havethe M-term model in the matrix form of

y = �MθM + e(M) (4)

where θM = [θ1 θ2 · · · θM]T. Let an orthogonal decompositionof the regression matrix �M be �M = WMAM , where

AM =

1 a1,2 · · · a1,M

0 1. . .

......

. . .. . . aM−1,M

0 · · · 0 1

(5)

and

WM = [w1 w2 · · · wM] (6)

with wTi wj = 0, if i �= j. The regression model (4) can

alternatively be expressed as

y = WMgM + e(M) (7)

where gM = [g1 g2 · · · gM]T satisfies the triangular systemAMθM = gM , which can be used to determine θM , given AM

and gM .Further consider the following regularized cost function:

Le(

�M, gM

) = ∥

∥y − WMgM

2 + gTM�MgM (8)

where �M = diag{λ1, λ2, . . . , λM}, which contains the localregularization parameters λi > 0, for 1 ≤ i ≤ M. For agiven �M , the solution for gM can be obtained by setting thederivative vector of Le to zero, i.e., (∂Le/∂gM) = 0, yielding

g(R)i = wT

i y

wTi wi + λi

(9)

for 1 ≤ i ≤ M. Our objective Le(�M, gM) is constructed onthe orthogonal space and the l2-norm parameter constraints areassociated with the orthogonal bases wi, 1 ≤ i ≤ M.

III. SIMULTANEOUS REGULARIZATION PARAMETER

OPTIMIZATION AND MODEL CONSTRUCTION

USING LOOMSE

For each stage of the OFR, we select the regressor withthe smallest LOOMSE amongst all the candidate regressorsusing a common preset kernel width and a very small presetregularization parameter, followed by determining the optimalkernel width and then the optimal regularization parameterassociated with the selected regressor.

A. nth Stage OFR Model Representation and LOOMSE

At the nth OFR stage, the nth model term is selected from acandidate pool that is formed according to (3) using the wholeset or a subset of training data as candidate centers and a com-mon preset kernel width τn = τ0 and a preset regularizationparameter λn = λ0 initially. Consider the OFR modeling pro-cess that has produced the n-term model. Let us denote theconstructed n columns of regressors as Wn = [w1 w2 · · · wn],with wn = [wn(1) wn(2) · · · wn(N)]T. The model output vectorof this n-term model is denoted as

y(n) =n∑

i=1

g(R)i wi = Wng(R)

n (10)

Page 6: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CYBERNETICS

where g(R)n = [g(R)

1 g(R)2 · · · g(R)

n ]T, and the corresponding mod-eling error vector is given by e(n) = y − y(n). Clearly, the nthOFR stage can be represented by

e(n−1) = gnwn + e(n). (11)

Equation (11) illustrates the fact that the nth OFR stageis simply to fit a one-variable model using the current modelresidual produced after the (n−1)th stage as the desired systemoutput.

The selection of one regressor from the (M − n + 1) candi-date regressors involves generating the (M −n+1) candidatesfor wn, i.e., by making each of the (M − n + 1) candidateregressors orthogonal to the (n − 1) orthogonal basis vectorswi for 1 ≤ i ≤ n − 1, already selected in the previous (n − 1)

OFR stages. Then the contributions of the candidate regres-sors are evaluated based on the idea of the LOO CV outlinedbelow.

Consider the model fitting (11) on DN , where the desiredoutput vector is e(n−1) with its elements given by e(n−1)(k),1 ≤ k ≤ N. For notational convenience, we also denotea candidate model identified using all the N data points ase(n−1)(k, λn, τn) = g(R)

n wn(k) at the nth model fitting stage.Suppose that we now sequentially set aside each data pointin the estimation set DN in turn and estimate a model usingthe remaining (N − 1) data points. The prediction error iscalculated on the data point that was removed from the iden-tification. That is, for k = 1, 2, . . . , N, the model is estimatedby removing the kth data point from DN , and the output ofthe model, identified based on DN \ (x(k), y(k)), is computedfor the kth data unused in the identification and is denotedby e(n−1,−k)(k, λn, τn). Then, the LOO prediction error iscalculated as

e(n,−k)(k, λn, τn) = e(n−1)(k) − e(n−1,−k)(k, λn, τn). (12)

Direct evaluation of e(n,−k)(k, λn, τn) by splitting the dataset requires extensive computational efforts. However, as wehave shown in [23], they can be exactly calculated with-out actually sequentially splitting the estimation data set. TheLOOMSE is then defined as the average of the squared LOOerrors, given by J(λn, τn) = 1/N

∑Nk=1(e

(n,−k)(k, λn, τn))2.

The model generalization contribution from the nth stageof OFR depends on the selection of a regressor from the(M − n + 1) candidate regressors and further tuning of kernelwidth as well as regularization parameter optimization basedon the updated regressor. We propose that the LOOMSE isused initially for regressor selection and then used for ker-nel width as well as regularization parameter estimation. Tobe more specific, firstly we use a very small regularizationparameter λ0, e.g., λ0 = 10−8, for all the (M−n+1) candidateregressors, and the associated LOOMSE values are calculatedand ranked. Then the regressor with the smallest LOOMSE isselected as the nth regressor, denoted as φn(τ0). Secondly, forthis selected regressor, we first maximize its potential modelgeneralization performance by successively updating wn viaτn when λ0 is fixed, then followed by the associated regular-ization parameter optimization by minimizing the LOOMSE

with respect to λn when τn is fixed, so that:

λoptn = arg min

λn

{

minτn

{

1

N

N∑

k=1

(

e(n,−k)(k, λn, τn))2}}

. (13)

An optimization procedure is introduced in Section III-Bfor this two-variables optimization problem.

B. Successive Kernel Width and Regularization ParameterEstimation via Minimizing LOOMSE

Because the joint optimization of the LOOMSE with respectto both the kernel width and regularization parameter is verycomplex, we opt for the following successive optimization pro-cedure: 1) we first hold the regularization parameter as λ0 andtune the kernel width τn by directly evaluating the LOOMSEusing a line search algorithm and 2) with the obtained opti-mal τn, we then optimize λn based on the LOOMSE using agradient descent algorithm.

The model parameter estimator g(R)n (λn, τn) can be

written as

g(R)n (λn, τn) = wT

n e(n−1)/

αn (14)

where

αn = wTn wn + λn. (15)

The model residual is then given by

e(n)(k, λn, τn) = e(n−1)(k) − wn(k)wTn e(n−1)

/

αn (16)

and the LOOMSE can be calculated as [23]

J(λn, τn) = 1N

∑Nk=1

(

e(n)(k,λn,τn)βn(k)

)2(17)

where

βn(k) = 1 −n∑

i=1

(wi(k))2

αi= βn−1(k) − (wn(k))2

αn. (18)

Note that βn−1(k) has already been determined in theprevious regression step, and β0(k) = 1.

Further, note that, φn = φn(τn) is a functional of τn and

wn = φn(τn) −n−1∑

i=1

ai,nwi (19)

with ai,n = wTi φn(τn)/wT

i wi, 1 ≤ i ≤ n − 1. Therefore, wehave

wn = Bnφn(τn) (20)

where Bn = Bn−1 − wn−1wTn−1/wT

n−1wn−1, and B1 = I is theN × N identity matrix. Clearly the gradient ∂J(λn, τn)/∂τn isvery complex. However, if we fix λn = λ0, then LOOMSEbecomes a one-variable function of τn which either increasesor decreases as τn increases until it reaches a local minimum.Thus, we can simply use a 1-D line search to solve for τn, asshown in Table I.

Remark 1: This 1-D line search algorithm is guaranteed tofind a near optimal kernel width τn ∈ [ς It1τ0, ς−It1τ0] withno more than It1 iterations given the learning rate 0 < ς < 1.Provided that the initial kernel width τ0 is not too far away

Page 7: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HONG et al.: NONLINEAR IDENTIFICATION USING OFR WITH NESTED OPTIMAL REGULARIZATION 5

TABLE IKERNEL WIDTH PARAMETER UPDATE FOR FIXED λ0

from a local minimum, the preset number of iterations It1 canbe set to a small value, e.g., 10. In this way, the computationalcost of this 1-D line search in each OFR step can be controlledto be negligibly small, compared to the complexity required byselecting a model term in each OFR step. More specifically, itis well known that the complexity of selecting a model term ineach OFR step is approximately on the order of N2, denoted asO(N2) [23], [33]. It is straightforward to verify that complexityof computing the LOOMSE according to (20) and (15)–(18)is on the order of O(N), and the total complexity of this linesearch is no more than It1 · O(N). Since It1 � N, this is stillon the order of O(N), which is much smaller than O(N2).

Once τn is found we seek λn using a gradient descentalgorithm. From (16), we have

∂e(n)(k, λn, τn)

∂λn= wn(k)wT

n e(n−1)/

α2n = wn(k)g

(R)n (λn, τn)

αn(21)

and the gradient of the LOOMSE with respect to λn is given by

∂J(λn, τn)

∂λn= 2

N

N∑

k=1

e(n)(k, λn)

βn(k)

∂e(n,−k)(k, λn)

∂λn(22)

where

∂e(n,−k)(k, λn)

∂λn= ∂e(n)(k, λn)

/

∂λn

βn(k)− e(n)(k, λn)

(βn(k))2

(wn(k))2

α2n

= g(R)n (λn)γn(k) − e(n)(k, λn)(γn(k))

2 (23)

and

γn(k) = wn(k)

αnβn(k). (24)

With λoldn = λ0 and λ0 a preset very small positive value,

the gradient descent algorithm for minimizing the LOOMSEof (17) is applied as follows:

{

λnewn = max

{

λ0, λoldn − η · sign

(

∂J∂λn

λn=λoldn

)}

λoldn = λnew

n

(25)

for a predetermined number of iterations It2, e.g., It2 = 20,where η > 0 is a very small positive learning rate. Note thatsign(∂J(λn, τn)/∂λn) is used in (25), indicating that this is anormalized version of gradient descent algorithm and a smalllearning rate η will scale well with the search space of λn,irrespective of the actual size of (∂J(λn, τn)/∂λn).

Remark 2: This 1-D gradient descent algorithm is guar-anteed to converge to a near optimal solution λn, providedthat the initial regularization parameter λ0 is not too far awayfrom a local minimum. The computational cost of this gradientdescent algorithm is low due to the neat expression of (25).Furthermore, the recursion formula for βn(k) given in (18) sig-nificantly reduces the cost of the derivative evaluation. Thus,the computational complexity of the above optimal regular-ization parameter estimator is on the order of O(N), scaledby the number of iterations It2 which is usually much smallerthan N. Therefore, the computational complexity of this gra-dient descent algorithm is on the order of O(N), which isnegligibly small compared to the complexity O(N2) requiredby selecting a model term in each OFR step.

This near optimal one regularization parameter estimatortogether with kernel parameter tuning for the selected regres-sor offers the benefit of further reduction in the LOOMSE forthe selected regressor and hence improves the model general-ization for the same model terms selected sequentially by theOFR procedure.

IV. PROPOSED OFR WITH NESTED OPTIMAL

REGULARIZATION USING LOOMSE

The proposed OFR + LOOMSE algorithm is presentedbelow integrating: 1) the model regressor selection based

Page 8: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CYBERNETICS

on minimizing the LOOMSE using a preset regularizationparameter from the candidate set; 2) successively updat-ing the kernel width and optimizing regularization param-eter also based on minimizing the LOOMSE for theselected model regressor; and 3) the modified Gram-Schmidtorthogonalization procedure [14]. For notational convenience,define

�(n−1) =[

w1 · · · wn−1 φ(n−1)n · · · φ(n−1)

M

]

∈ RN×M (26)

with �(0) = �M . If some of the columns in �(n−1) areinterchanged, it is still referred to as �(n−1) for notationalsimplicity. The initial conditions are set as follows: e(0) = y,β0(k) = 1 for 1 ≤ k ≤ N, and the learning rate η is a givensmall positive number, e.g., η = 0.01. Further denote the kthelement of φ

(n−1)j as φ

(n−1)j (k).

With the initialization of λoldn = λ0 and τn = τ0, the nth

stage of the OFR procedure is given as follows.Step 1: Model Term Selection

1) For n ≤ j ≤ M, calculate

α(j)n =

(

φ(n−1)j

)Tφ

(n−1)j + λold

n (27)

β(j)n (k) = βn−1(k) −

(

φ(n−1)j (k)

)2 /α(j)

n , 1 ≤ k ≤ N

(28)

g(R,j)n =

(

φ(n−1)j

)Te(n−1)

α(j)n

(29)

e(n,j) = e(n−1) − g(R,j)n φ

(n−1)j (30)

J(j)n = 1

N

N∑

k=1

(

e(n,j)(k)/β(j)n (k)

)2. (31)

2) Find

Jn = J(jn)n = min

{

J(j)n , n ≤ j ≤ M

}

. (32)

Then the jnth and the nth columns of �(n−1) are inter-changed. The jnth column and the nth column of AM areinterchanged up to the (n − 1)th row. This effectivelyselects the nth regressor in the subset model.

Step 2: Kernel Width Optimization

Apply the algorithm in Table I to find the optimal kernelwidth parameter τn. Form φn(τn) using (3) and update thenth column of AM up to the (n − 1)th row as

ai,n = wTi φn(τn)

wTi wi

, 1 ≤ i ≤ n − 1. (33)

The modified Gram-Schmidt orthogonalization proce-dure [14] then calculates the nth row of AM and transfers�(n−1) into �(n) as follows:

wn = φn(τn) −n−1∑

i=1ai,nwi

an,j = wTn φ

(n−1)j

/

wTn wn, n + 1 ≤ j ≤ M

φ(n)j = φ

(n−1)j − an,jwn, n + 1 ≤ j ≤ M.

(34)

Step 3: Regularization Parameter Optimization

1) For the derived wn, iterate the following steps It2 times:

αn = wTn wn + λold

n (35)

βn(k) = βn−1(k) − (wn(k))2 /αn, 1 ≤ k ≤ N (36)

γn(k) = wn(k)

αnβn(k), 1 ≤ k ≤ N (37)

g(R)n = wT

n e(n−1)/

αn (38)

e(n) = e(n−1) − g(R)n wn (39)

λnewn = max

{

λ0, λoldn − η · sign

(

∂J(

λoldn

)

∂λn

)}

(40)

λoldn = λnew

n (41)

where (∂J(λn)/∂λn) = (2/N)∑N

k=1(e(n)(k)/βn(k))

(g(R)n γn(k) − e(n)(k)(γn(k))2).

2) Update the LOOMSE

Jn = 1

N

N∑

k=1

(

e(n)(k)/βn(k))2

. (42)

Termination

The OFR procedure is terminated at the (ns + 1)th stageautomatically when the condition Jns+1 ≥ Jns is detected,yielding a subset model with the ns significant regressors.

Remark 3: The existence of this desired model size ns isguaranteed. This is because initially the LOOMSE decreasesas the model size increases, reaching a minimum value at cer-tain model size, and then LOOMSE starts increasing when themodel size increases further [22], [23], [30]. Thus, the OFRmodel selection procedure based on the LOOMSE is guaran-teed to automatically “converge” or to terminate at a minimumLOOMSE solution with ns significant model terms.

The computational complexity of out proposed OFR +LOOMSE algorithm can easily be shown to be

CtotalOFR + LOOMSE ≈ (ns + 1) · O

(

N2 + (It1 + It2)N)

≈ (ns + 1) · O(

N2)

. (43)

Note that with a fixed common kernel width andgiven all the fixed regularization parameters, our previousLROLS algorithm has the computational complexity givenby [23], [32], and [33]

Csingle-OFRLROLS ≈ (ns + 1) · O

(

N2)

(44)

assuming that the OFR procedure also produces a ns-termmodel. This confirms that the new OFR + LOOMSE algorithmonly requires negligible extra computational cost in compari-son to the OFR algorithm with a fixed kernel width and fixedregularization parameters. Further assuming that the iterativeloop for updating the regularization parameters requires Irpsiterations and the grid search for tuning the single commonkernel width employs Iskw points, then the total complexity ofthe LROLS algorithm is approximately

CtotalLROLS ≈ Iskw · Irps · (ns + 1) · O

(

N2)

. (45)

Page 9: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HONG et al.: NONLINEAR IDENTIFICATION USING OFR WITH NESTED OPTIMAL REGULARIZATION 7

Therefore, an approximate complexity reduction factor ofIskw · Irps is achieved by the proposed new OFR-LOOMSEalgorithm, over our previous LROLS algorithm.

In the following experimental study section, we will demon-strate that the new OFR-LOOMSE algorithm is capable ofobtaining very similar sparse models with very similar excel-lent generalization performance, as the state-of-the-art LROLSalgorithm does.

V. SIMULATION STUDY

Example 1: Consider using an RBF network to approxi-mate the unknown scalar function

f (x) = sin(x)

x. (46)

A data set of two hundred points was generated fromy(x) = f (x) + v, where the input x was uniformly distributedin the range [−10, 10] and the noise v was Gaussian withzero mean and standard deviation 0.2. The noisy data y(x)and the underlying function f (x) are illustrated in Fig. 1(a).The Gaussian function φi(x) = exp(−(x − ci)

2/2τ 2i ) was used

as the basis function to construct an RBF model. All the twohundred data points {x} were used as the candidate RBF centerset for {ci}. For the proposed OFR + LOOMSE algorithm, ateach OFR step, the initial kernel width was set to τ0 = √

10,and the initial regularization parameter was set to λ0 = 10−8.For the line search algorithm of Table I, the iteration numberIt1 = 10 and the learning rate ς = 0.95. For the 1-D gradi-ent descent algorithm, we set the learning rate to η = 0.02and the iteration number to It2 = 30. At each OFR step,the near optimal kernel width and regularization parameterfound for the selected regressor are shown in Fig. 1(b), whileFig. 1(c), indicates that the proposed OFR + LOOMSE algo-rithm automatically selects a sparse model of size ns = 8when Jn reaches the minimum at n = ns. The model predic-tions {y(x)} of the resultant eight-term model are also depictedin Fig. 1(a).

For comparison, the ε-SVM algorithm [5], the LASSO usingthe MATLAB function lasso.m with tenfold CV being used toselect the associated regularization parameter, and our previ-ous LROLS algorithm [23] were experimented to constructmodels based on the same kernel function set but with acommon kernel width τ for every kernel. The MATLAB func-tion quadprog.m was used with the algorithm option set as“interior-point-convex” for the ε-SVM algorithm. The tun-ing parameters in the ε-SVM algorithm, such as soft marginparameter C [5], were set empirically so that the best possibleresult is obtained after several trials. The best kernel widthvalue τ for each of these three algorithms was also empiri-cally set after several trials. The mean square errors (MSEs)of the four resulting models over the noisy data set, defined byE[(y(x)−y(x))2], and the MSEs of the these four models overthe true function, defined as E[(y(x) − f (x))2], are recordedin Table II, where the expectation E[ · ] indicates the averag-ing over the data set {x}. The results of Table II show thatthe proposed OFR + LOOMSE algorithm achieves similarexcellent performance as our previous state-of-the-art LROLS

Fig. 1. Scalar function modeling example. (a) Noisy data, model prediction,and true function. (b) Values of the kernel width parameters and regularizationparameters. (c) Evolution of the LOOMSE.

algorithm, in terms of both model sparsity and model gener-alization. As mentioned in the previous section, the proposedmethod significantly reduces the total computational cost inconstructing a sparse model.

Example 2: The engine data set [24] consists of the 410data points of the fuel rack position (input u(k)) and theengine speed (output y(k)), collected from a Leyland TL11turbocharged, direct injection diesel engine when operatedat a low engine speed. Fig. 2(a) depicts the input u(k)and output y(k) of this engine data set. The first 210 data

Page 10: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CYBERNETICS

TABLE IICOMPARISON OF THE MODELING PERFORMANCE FOR

THE UNKNOWN SCALAR FUNCTION

samples were used in training and the last 200 data sam-ples for model testing. The system input vector was givenby x(k) = [y(k − 1) u(k − 1) u(k − 2)]T. We used theGaussian RBF kernel (3), and the candidate set for {ci} wasformed using all the training data samples. For the proposedOFR + LOOMSE algorithm, at each OFR step, the initialkernel width was chosen to be τ0 = 1 and the initial regu-larization parameter was set to λ0 = 10−8. Furthermore, theline search algorithmic parameters was set to It1 = 10 andς = 0.95, while the 1-D gradient descent algorithmic param-eters was given by η = 0.02 and It2 = 30. At each OFRstep, the near optimal kernel width and regularization param-eter found for the selected regressor are shown in Fig. 2(b),while Fig. 2(c) indicates that the proposed OFR + LOOMSEalgorithm automatically selects a sparse model of size ns = 21when Jn reaches the minimum at n = ns.

The ε-SVM algorithm [5], the LASSO and our previousLROLS algorithm [23] were used for comparison, based on thesame Gaussian kernel set with a common kernel width τ forevery kernel. Again, for the ε-SVM, the MATLAB functionquadprog.m was used with the option set as “interior-point-convex,” while the soft margin parameter and other tuningparameters were empirically chosen after several trials. For theLASSO, the MATLAB function lasso.m was used. For boththe ε-SVM and LASSO, we list the results obtained for arange of kernel width τ values in Table III, in comparison tothe results obtained by the proposed OFR + LOOMSE algo-rithm as well as the LROLS algorithm. The results of Table IIIshow that the ε-SVM can achieve similar test MSE as theproposed algorithm but requires a very large model, and theLASSO algorithm is unable to yield a comparable performanceto our algorithm. Moreover, the OFR + LOOMSE algo-rithm achieves similar excellent performance as the LROLSalgorithm, in terms of both model sparsity and model gen-eralization, while imposing a lower computational cost thanthe latter, as mentioned in the previous section. It is alsoworth pointing out that the computational cost of the proposedOFR + LOOMSE algorithm is significantly lower than the ε-SVM algorithm and the LASSO algorithm, even when thesetwo algorithms was given a fixed kernel width τ . For exam-ple, using MATLAB on a computer with Intel Core i7-3770KCPU, the recorded running time for the proposed algorithm is0.972248 s, but the LASSO with τ = 0.1 took 26.24080 s.

Example 3: The liquid level data set was collected froma nonlinear liquid level plant, which consisted of a dc waterpump feeding a conical flask which in turn fed a square tank.The system input u(k) was the voltage to the pump motor andthe system output y(k) was the water level in the conical flask.

Fig. 2. Engine data set modeling example. (a) System input u(k) and outputy(k). (b) Values of the kernel width parameters and regularization parameters.(c) Evolution of the LOOMSE.

A description of this nonlinear process can be found in [35]and Fig. 3(a) shows the 1000 data samples of the data setused in this experiment. From the data set, 1000 data points{x(k), y(k)} were constructed with

x(k) = [y(k − 1) y(k − 2) y(k − 3) u(k − 1)

u(k − 2) u(k − 3) u(k − 4)]T. (47)

The first 500 pairs of the data were used for training andthe remaining 500 pairs for testing the constructed model.

Page 11: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HONG et al.: NONLINEAR IDENTIFICATION USING OFR WITH NESTED OPTIMAL REGULARIZATION 9

TABLE IIICOMPARISON OF MODELING PERFORMANCE FOR THE

ENGINE DATA SET

The Gaussian RBF kernel (3) was employed, and the candi-date set for {ci} was formed using all the training data points.The initial kernel width was chosen as τ0 = 2.5 and the initialregularization parameter was set to λ0 = 10−8. For the linesearch algorithm of Table III, the iteration number It1 = 10and the learning rate ς = 0.95, while for the 1-D gradientdescent algorithm, we set the learning rate to η = 0.02 andthe iteration number to It2 = 30. At each OFR step, the nearoptimal kernel width and regularization parameter found forthe selected regressor are depicted in Fig. 3(b), while Fig. 3(c)indicates that the OFR + LOOMSE algorithm automaticallyselects a sparse model of size ns = 26 when Jn reaches theminimum at n = ns.

For comparison, the ε-SVM algorithm [5], the LASSO algo-rithm and the LROLS algorithm [23] were also applied basedon the same Gaussian kernel set but with a common ker-nel width τ for every kernel. The ε-SVM was implementedusing the MATLAB function quadprog.m with the algorithmoption set as interior-point-convex, and the soft margin andother tuning parameters were empirically chosen after severaltrials. For the LASSO, the MATLAB function lasso.m wasused with tenfold CV for selecting the associated regulariza-tion parameter. For both the ε-SVM and LASSO algorithms,we list the results obtained for a range of kernel width τ valuesin Table IV, in comparison to the results obtained by our pro-posed OFR + LOOMSE algorithm and our previous LROLSalgorithm. The modeling results of Table IV again demonstratethat the proposed OFR + LOOMSE algorithm has compara-ble performance to the LROLS algorithm [23], in terms ofboth model generalization and model sparsity. The ε-SVMalgorithm is able to achieve a similar test MSE as the pro-posed OFR + LOOMSE algorithm but requires a very largemodel. The LASSO algorithm is able to obtain a sparser modelbut yields a poorer test MSE. Note that the proposed OFR +LOOMSE algorithm is computationally much more efficientthan the ε-SVM and LASSO algorithms. Using MATLAB on acomputer with Intel Core i7-3770K CPU, the recorded runningtime for the proposed algorithm, which includes optimizing thekernel width τn at each OFR step, is 3.361728 s, comparedto 26.685302 s required by the LASSO with the kernel widthfixed to τ = 3.

Example 4: The gas furnace data set (the time series Jin [36]) contained 296 pairs of input-output points as depicted

Fig. 3. Liquid level data set modeling example. (a) System input u(k) andoutput y(k). (b) Values of the kernel width parameters and regularizationparameters. (c) Evolution of the LOOMSE.

in Fig. 4(a), where the input u(k) was the coded input gasfeed rate and the output y(k) represented the CO2 concentra-tion from the gas furnace. The Gaussian RBF kernel (3) wasemployed with the system input vector given by

x(k) = [y(k − 1) y(k − 2) y(k − 3)

u(k − 1) u(k − 2) u(k − 3)]T. (48)

From Fig. 4(a), it can be seen that the second half of thedata set is significantly different from the first half. Therefore,

Page 12: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CYBERNETICS

TABLE IVCOMPARISON OF MODELING PERFORMANCE FOR THE

LIQUID LEVEL DATA SET

we used the even-number pairs of {x(k)y(k)} for training andthe odd-number pairs for testing. The candidate RBF cen-ter set {ci} was formed using all the training data samples.The initial kernel width was given by τ0 = √

1000 and theinitial regularization parameter was set to λ0 = 10−8. Forthe line search algorithm of Table I, the iteration numberIt1 = 10 and the learning rate ς = 0.95, while the 1-D gra-dient descent algorithm used the learning rate of η = 0.02and the iteration number of It2 = 30. At each OFR step, theoptimal kernel width and regularization parameter found forthe selected regressor is shown in Fig. 4(b), while Fig. 4(c),indicates that the proposed OFR + LOOMSE algorithm auto-matically selects a sparse model of size ns = 12 when Jn

reaches the minimum at n = ns.The ε-SVM algorithm [5], the LASSO algorithm and the

LROLS algorithm [23] were used based on the same Gaussiankernel set but with a common kernel width τ for every ker-nel, for a comparison. The results of the ε-SVM algorithmand LASSO algorithm based on the common kernel varianceof τ 2 = 500, 1000, and 2000, respectively, are compared withthe results obtained by the OFR + LOOMS and LROLS algo-rithms in Table V. The modeling results of Table V demon-strate that for this example the proposed OFR + LOOMSalgorithm attains the best test MSE performance.

Example 5: The Boston housing data set, available at theUCI repository [37], comprised 506 data points with 14 vari-ables. We performed the task of predicting the median housevalue from the remaining 13 attributes. We randomly selected456 data points from the data set for training and used theremaining 50 data points to form the test set. Average resultswere given over 100 realizations. For each realization, 13 inputattributes were normalized so that each attribute has zero meanand standard deviation of one. The Gaussian RBF model wasconstructed from the 456 candidate regressors of each realiza-tion. For the proposed OFR + LOOMSE algorithm, at eachOFR step, the initial kernel width was set to τ0 = 3 and theinitial regularization parameter was set to λ0 = 10−8, whilethe learning rates η = 0.002 and ς = 0.95 as well as theiteration numbers It1 = 10 and It2 = 50 were used.

In Table VI, the modeling performance of the proposedOFR + LOOMSE algorithm is compared with the resultsof the ε-SVM [5] and the LROLS [23], which are quotedfrom [32]. We also experimented with the LASSO algorithm

Fig. 4. Gas furnace data set example. (a) System input u(k) and outputy(k). (b) Values of the kernel width parameters and regularization parameters.(c) Evolution of the LOOMSE.

supplied by MATLAB lasso.m with option set as tenfoldCV to select the associated regularization parameter. For theLASSO, a common kernel width τ was set for constructingthe kernel model from the 456 candidate regressors of eachrealization, and a range of τ values were experimented. Asshown in Table VI, the proposed OFR + LOOMSE algo-rithm is very competitive in terms of model size and modelgeneralization capability, compared with the LROLS algo-rithm and the LASSO algorithm. As analyzed in Section IV,

Page 13: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HONG et al.: NONLINEAR IDENTIFICATION USING OFR WITH NESTED OPTIMAL REGULARIZATION 11

TABLE VCOMPARISON OF MODELING PERFORMANCE FOR

THE GAS FURNACE DATA SET

TABLE VICOMPARISON OF MODELING PERFORMANCE FOR THE BOSTON HOUSE

DATA SET. THE RESULTS WERE AVERAGED OVER 100 REALIZATIONS

AND GIVEN AS MEAN ± STANDARD DEVIATION, AND THE RESULTS

FOR THE ε-SVM AND THE LROLS ARE QUOTED FROM [32]

the computational cost of the OFR + LOOMSE algorithm islower than that of the LROLS algorithm. Also, the runningtime of the OFR + LOOMSE algorithm over 100 realiza-tions was recorded as 126.02 s, much faster than the LASSOalgorithm with τ = 3 which took 1259.9 s.

VI. CONCLUSION

We have shown how to optimize the pairs of kernel widthand regularization parameters one pair at a time within theOFR procedure with the aim of maximizing the model gener-alization capability by proposing a new OFR + LOOMSEalgorithm based on the Gaussian RBF model with tunablekernel widths for nonlinear system identification application.During each stage of the OFR, the LOOMSE is used as theselection criterion to select one regressor from the candidateset, while the kernel width of the selected regressor is tunedfollowed by optimizing the associated regularization parame-ter for the tuned regressor, both also based on the LOOMSE.Since the kernel width parameters as well as regularizationparameters are optimized within the OFR algorithm, there isno need to iteratively run the OFR for choosing these parame-ters, as many other existing schemes do, including our previousstate-of-the-art LROLS algorithm. Computational analysis hasshown that the total computational cost of the proposed OFR +LOOMSE algorithm is lower than that of the LROLS algo-rithm which is well-known to be a highly efficient method forconstructing sparse models that generalize well. Five exam-ples, including three real-world nonlinear system identificationapplications, have been included to demonstrate the effec-tiveness this new data modeling approach, in comparison toseveral existing approaches, in terms of model generalization

performance and model sparsity. Although the proposed algo-rithm is presented based on the Gaussian RBF model, it canbe applied to other types of linear-in-the-parameter models.

REFERENCES

[1] A. E. Ruano, Ed., Intelligent Control Systems Using ComputationalIntelligence Techniques. London, U.K.: IEE Publishing, 2005.

[2] R. Murray-Smith and T. A. Johansen, Multiple Model Approaches toModeling and Control. Bristol, PA, USA: Taylor and Francis, 1997.

[3] S. G. Fabri and V. Kadirkamanathan, Functional Adaptive Control: AnIntelligent Systems Approach. London, U.K.: Springer, 2001.

[4] S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squareslearning algorithm for radial basis function networks,” IEEE Trans.Neural Netw., vol. 2, no. 2, pp. 302–309, Mar. 1991.

[5] S. R. Gun, “Support vector machines for classification and regression,”Dept. Electron. Comput. Sci., ISIS Group, University of Southampton,Southampton, U.K., May 1998.

[6] G. C. Cawley and N. L. C. Talbot, “On over-fitting in model selec-tion and subsequent selection bias in performance evaluation,” J. Mach.Learn. Res., vol. 11, pp. 2079–2107, Jul. 2010.

[7] D. Du, K. Li, X. Li, M. Fei, and H. Wang, “A multi-output two-stage locally regularized model construction method using the extremelearning machine,” Neurocomputing, vol. 128, pp. 104–112, Mar. 2014.

[8] M. Stone, “Cross-validatory choice and assessment of statistical predic-tions,” J. Royal Stat. Soc. B, vol. 36, no. 2, pp. 111–147, 1974.

[9] G. H. Golub, M. Heath, and G. Wahba, “Generalized cross-validation asa method for choosing good ridge parameter,” Technometrics, vol. 21,no. 2, pp. 215–223, 1979.

[10] S. Chen, Y. Wu, and B. L. Luk, “Combined genetic algorithm opti-mization and regularized orthogonal least squares learning for radialbasis function networks,” IEEE Trans. Neural Netw., vol. 10, no. 5,pp. 1239–1243, Sep. 1999.

[11] M. J. L. Orr, “Regularisation in the selection of radial basis functioncenters,” Neural Comput., vol. 7, no. 3, pp. 606–623, 1995.

[12] X. Hong and S. A. Billings, “Parameter estimation based on stackedregression and evolutionary algorithms,” IEE Proc. Control TheoryAppl., vol. 146, no. 5, pp. 406–414, Sep. 1999.

[13] L. Ljung and T. Glad, Modeling of Dynamic Systems. Englewood Cliffs,NJ, USA: Prentice Hall, 1994.

[14] S. Chen, S. A. Billings, and W. Luo, “Orthogonal least squares methodsand their applications to non-linear system identification,” Int. J. Control,vol. 50, no. 5, pp. 1873–1896, 1989.

[15] M. J. Korenberg, “Identifying nonlinear difference equation and func-tional expansion representations: The fast orthogonal algorithm,” Ann.Biomed. Eng., vol. 16, no. 1, pp. 123–142, 1988.

[16] D. Du, X. Li, M. Fei, and G. W. Irwin, “A novel locally regularized auto-matic construction method for RBF neural models,” Neurocomputing,vol. 98, pp. 4–11, Dec. 2012.

[17] L. Wang and J. M. Mendel, “Fuzzy basis functions, universal approxima-tion, and orthogonal least-squares learning,” IEEE Trans. Neural Netw.,vol. 3, no. 5, pp. 807–814, Sep. 1992.

[18] X. Hong and C. J. Harris, “Neurofuzzy design and model constructionof nonlinear dynamical processes from data,” IEE Proc. Control TheoryAppl., vol. 148, no. 6, pp. 530–538, Nov. 2001.

[19] Q. Zhang, “Using wavelets network in nonparametric estimation,” IEEETrans. Neural Netw., vol. 8, no. 2, pp. 227–236, Mar. 1993.

[20] S. A. Billings and H. L. Wei, “The wavelet-NARMAX representation:A hybrid model structure combining polynomial models with mul-tiresolution wavelet decompositions,” Int. J. Syst. Sci., vol. 36, no. 3,pp. 137–152, 2005.

[21] R. H. Myers, Classical and Modern Regression with Applications, 2ndEd. Boston, MA, USA: PWS-KENT, 1990.

[22] X. Hong, P. M. Sharkey, and K. Warwick, “Automatic nonlinear pre-dictive model construction using forward regression and the PRESSstatistic,” IEE Proc. Control Theory Appl., vol. 150, no. 3, pp. 245–254,May 2003.

[23] S. Chen, X. Hong, C. J. Harris, and P. M. Sharkey, “Sparse modelingusing orthogonal forward regression with PRESS statistic and regular-ization,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 2,pp. 898–911, Apr. 2004.

[24] S. A. Billings, S. Chen, and R. Backhouse, “The identification of lin-ear and nonlinear models of a turbocharged automotive diesel engine,”Mech. Syst. Signal Process., vol. 3, no. 2, pp. 123–142, 1989.

[25] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decompositionby basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.

Page 14: Nonlinear identification using orthogonal forward ...centaur.reading.ac.uk/39716/1/07024123.pdf · each model regressor has an individual RBF width, the non- linear search space for

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON CYBERNETICS

[26] R. Tibshirani, “Regression shrinkage and selection via the lasso,”J. Royal Stat. Soc. B, vol. 58, no. 1, pp. 267–288, 1996.

[27] B. Efron, I. Johnstone, T. Hastie, and R. Tibshirani, “Least angleregression,” Ann. Stat., vol. 32, no. 2, pp. 407–451, 2004.

[28] S. Chen, “Locally regularized orthogonal least squares algorithm for theconstruction of sparse kernel regression models,” in Proc. 6th Int. Conf.Signal Process., Beijing, China, 2002, pp. 1229–1232.

[29] D. J. C. MacKay, “Bayesian methods for adaptive models,” Ph.D. disser-tation, Dept. Comput. Neural Syst., California Inst. Technol., Pasadena,CA, USA, 1992.

[30] S. Chen, X. Hong, and C. J. Harris, “Sparse kernel regression mod-eling using combined locally regularized orthogonal least squares andD-optimality experimental design,” IEEE Trans. Autom. Control, vol. 48,no. 6, pp. 1029–1036, Jun. 2003.

[31] S. Chen, X. Hong, and C. J. Harris, “Non-linear system identificationusing particle swarm optimization tuned radial basis function models,”Int. J. Bio-Ins. Comput., vol. 1, no. 4, pp. 246–258, 2009.

[32] S. Chen, X. Hong, and C. J. Harris, “Construction of tunable radial basisfunction networks using orthogonal forward selection,” IEEE Trans.Syst., Man, Cybern. B, Cybern., vol. 39, no. 2, pp. 457–466, Apr. 2009.

[33] S. Chen, X. Hong, and C. J. Harris, “Particle swarm optimization aidedorthogonal forward regression for unified data modeling,” IEEE Trans.Evol. Comput., vol. 14, no. 4, pp. 477–499, Aug. 2010.

[34] S. Chen and S. A. Billings, “Representation of nonlinear systems: TheNARMAX model,” Int. J. Control, vol. 49, no. 3, pp. 1013–1032, 1989.

[35] S. A. Billings and W. Voon, “A prediction-error and stepwise-regressionestimation algorithm for nonlinear systems,” Int. J. Control, vol. 44,no. 3, pp. 803–822, 1986.

[36] G. E. P. Box and G. M. Jenkins, Time Series Analysis, Forecasting andControl. San Francisco, CA, USA: Holden-Day, 1976.

[37] A. Frank and A. Asuncion. (2010). UCI Machine Learning Repository.[Online]. Available: http://archive.ics.uci.edu/ml

Xia Hong (SM’02) received the B.Sc. and M.Sc.degrees from the National University of DefenseTechnology, Changsha, China, in 1984 and 1987,respectively, and the Ph.D. degree from theUniversity of Sheffield, Sheffield, U.K., in 1998, allin automatic control.

She was a Research Assistant at the BeijingInstitute of Systems Engineering, Beijing, China,from 1987 to 1993. She was a Research Fellowat the Department of Electronics and ComputerScience, University of Southampton, Southampton,

U.K., from 1997 to 2001. She is currently a Professor with the School ofSystems Engineering, University of Reading, Reading, U.K. She is activelyengaged in research of nonlinear systems identification, data modeling, esti-mation and intelligent control, neural networks, pattern recognition, learningtheory, and their applications. She has published over 200 research papers,and co-authored a research book.

Prof. Hong was the recipient of the Donald Julius Groen Prize by IMechEin 1999.

Sheng Chen (M’90–SM’97–F’08) received theB.Eng. degree from the East China PetroleumInstitute, Dongying, China, the Ph.D. degree fromthe City University London, London, U.K., in 1982and 1986, respectively, both in control engineer-ing, and the D.Sc. degree from the University ofSouthampton, Southampton, U.K., in 2005.

From 1986 to 1999, he held research and aca-demic appointments at the University of Sheffield,Sheffield, U.K., the University of Edinburgh,Edinburgh, U.K., and the University of Portsmouth,

Portsmouth, U.K. Since 1999, he has been with the Department of Electronicsand Computer Science, University of Southampton, where he is currently aProfessor of Intelligent Systems and Signal Processing. His current researchinterests include adaptive signal processing, wireless communications, mod-eling and identification of nonlinear systems, neural network and machinelearning, intelligent control system design, evolutionary computation meth-ods, and optimization. He has published over 500 research papers. He was anISI highly cited researcher in engineering in 2004.

Dr. Chen is a Distinguished Adjunct Professor at King AbdulazizUniversity, Jeddah, Saudi Arabia. He is a fellow of the IET. He was electedas a fellow of the U.K. Royal Academy of Engineering in 2014.

Junbin Gao received the B.Sc. degree in compu-tational mathematics from the Huazhong Universityof Science and Technology (HUST), Wuhan, China,and the Ph.D. degree from the Dalian Universityof Technology, Dalian, China, in 1982 and 1991,respectively.

He is a Professor of Computing Science withthe School of Computing and Mathematics, CharlesSturt University, Bathurst, NSW, Australia. He was aSenior Lecturer and a Lecturer of Computer Scienceat the University of New England, Armidale, NSW,

Australia, from 2001 to 2005. From 1982 to 2001, he was an AssociateLecturer, a Lecturer, an Associate Professor, and a Professor at the Departmentof Mathematics, HUST. His current research interests include machinelearning, data mining, Bayesian learning and inference, and image analysis.

Chris J. Harris received the B.Sc. degree from theUniversity of Leicester, Leicester, U.K., the M.A.degree from the University of Oxford, Oxford, U.K.,and the Ph.D. and D.Sc. degrees from the Universityof Southampton, Southampton, U.K., in 1972 and2001, respectively.

He is a Emeritus Research Professor with theUniversity of Southampton. He held senior aca-demic appointments at Imperial College, London,U.K., the University of Oxford, and the Universityof Manchester, Manchester, U.K. He was a Deputy

Chief Scientist for the U.K. Government. He has co-authored over 450 sci-entific research papers.

Prof. Harris was the recipient of the IEE Senior Achievement Medal fordata fusion research and the IEE Faraday Medal for distinguished interna-tional research in machine learning. He was elected as a fellow of the U.K.Royal Academy of Engineering in 1996.