[IEEE 3rd IEEE Signal Processing Education Workshop. 2004 IEEE 11th Digital Signal Processing Workshop, 2004. - Taos Ski Valley, NM, USA (1-4 Aug. 2004)] 3rd IEEE Signal Processing

2004 IEEE 11 th Digital Signal Processing Workshop & IEEE Signal Processing Education Workshop

BAYESIAN ESTIMATION OF NON-STATIONARY AR MODEL PARAMETERS VIA AN UNKNOWN FORGETTING FACTOR

Vdclav Sm(dl UTIA, Czech Academy of Sciences,

Prague, Czech Republic [email protected]

ABSTRACT

We study Bayesian estimation of the time-varying parameters of a non-stationary AR process. This is traditionally achieved via exponential forgetting. A numerically tractable solution is available if the forgetting factor is known a priori. This assumption is now relaxed. Instead, we propose joint Bayesian estimation of the AR parameters and the unknown forgetting factor. The posterior distribution is intractable, and is approximated using the Variational-Bayes (VB) method. Improved parameter tracking is revealed in simulation.

1. INTRODUCTION

The standard Bayesian approach to on-line estimation of AutoRegressive CAR) model parameters leads to the classical solution in terms of the normal equations [1]. These results, reviewed in Section 2, can be extended to cope with non-stationary AR processes, by using a known forgetting factor [2]. The resulting recursive updates involve the same dyadic form (Section 3) as in the stationary case. This paper confronts the difficulty associated with selecting a value for the forgetting factor. Using Bayesian principles, the forgetting factor is included as a random variable , to be estimated at each step (Section 4). A Variational Bayes (VB) approximation of the intractable posterior distribution restores the dyadic update structure of the conventional algorithm, with the forgetting factor replaced by its posterior expectation . The overhead implied by the procedure is the need to evaluate this expected value via an iterative algorithm, as explained in Section 4.2. In Section 5, a simulation is presented, involving estimation of time-varying AR parameters. The proposed approach appears to offer significant improvements in tracking, compared to a fixed known forgetting factor. In Section 6, the approach is discussed in the context of currently available solutions in the literature.

0-7803-8434-2/04/$20.0002004 IEEE

Anthony Quinn

Trinity College Dublin, Ireland

[email protected]

2. BAYESIAN ESTIMATION OF STATIONARY A R

MODEL PARAMETERS

A scalar univariate AR process is first studied:

p

Xn = -Lakx"-k + (J"en· k=l

(1)

Here, Xn denotes the AR process observed at times n = 1,2,3, ... , and (J"e" is the modelling error. The unknown parameters are (J = [a',o'j',wherea = [al, .. . ,ap]'. The model order, p, is assumed known in this paper. The classical solution to estimation of (J is based on the Wiener Minimum Mean Squared Error (MMSE) criterion [1], leading to the so-called normal equations. The standard Bayesian approach is based on the assumption that en in (1) is white noise with Gaussian distribution; i.e. f (en) = N(O,l). Then:

f (xnla, (J", x,,) = N (-a'x", (J"2) , (2) where n > p, and Xn = [X,,-l'" xn-p] is the vector of regressors. On-line estimation requires the posterior distribution of e to be elicited at each time n. Bayes' rule, in recursive form, is used to update the posterior distribution of parameters from time n - 1 to n:

where Xn = [Xl,' . . ,X,, ] , denotes the data history at time n, and Xo = n, by asssignment.

If the required prior, f (0), is chosen to be conjugate to the observation model (2) [3], then the functional form of the prior and posterior are identical under update (3). Since the model (2) belongs to the exponential family, both a conjugate prior and sufficient statistics are available, of the Normal-inverse·Gamma (Ni(i) type [3J:

221

. a-v N19a,/7 (V, v) :3 ( . (V: ) X N,9 ,v

exp { _�a-2 [-1, a'l V [-1, all'}, (4)

Wi9 (V, II) = r (0.511))' -0.5v lVaal-0.5 2o.5p,

v = [ ViI Val

(5)

(6)

where (6) denotes the partitioning of V E �(p+ l) x (p+ l) into blocks. with Vn being the (1,1) element. V, II are the sufficient statistics of Niga,lJ" (-).1·1 denotes the matrix determinant.

The statistics of the conjugate prior distribution, Vo, 110, are chosen to reflect our initial knowledge of parameters. If we do not have any preference, we use a diffuse distribution. Typically Vo = Ellp+l, 110 = E2, where Ip+l is the (p + 1) x (p + 1) identity matrix, and El, 102 are small positive scalars. Substituting (2) into (3) and invoking (4) at time n - 1, then the posterior distribution at time rz > pis

Vn-l + 1. (9)

Here, xn = [xn , x�]' is the extended regression vector. (8) will be called a dyadic update in this paper. Since the recursion begins at n = p + 1, we choose Vp = Vo and lip = 110. This is equivalent to choosing the distribution on parameters to be invariant for n :::; p.

The following moments of (7) will be required later in the paper:

Va��n Val;n, An

(0)

where. for example, an denotes the expected value of a with respect to distribution f (aIXn). In (12), 'IfJ(-) denotes the Digamrna function [4].

3. ESTIMATION FOR A NON-STATIONARY AR MODEL USING A KNOWN FORGETTING FACTOR

The stationary assumption, above, is rarely met in practice. Ideally, then, a model of parameter variations is required, using, for example, a hidden state variable modelling approach [5] and the Kalman filter. When not available, the estimation problem is under-determined, obviating the full Bayesian solution. The standard batch (off-line) algorithm uses windowing. Alternatively. the concept of forgetting [2] is used in adaptive signal processing [6], recursive estimation [7J, and in particle filtering approaches [8].

A Bayesian treatment of forgetting was developed in [91. There, the missing model of parameter evolution is optimally approximated (in the sense of minimum KullbackLeibler distance) via a probabilistic operator:

The notation f (.) 6" indicates the replacement of the argument of f (-) by On, where On is the time-varying unknown parameter set at time n. j ( . ) is a pre-selected alternative distribution, expressing auxiliary knowledge about On at time n. Coefficient <Pn, 0 :::; <Pn :::; 1, is known as a forgetting factor, Note that the dependence of (13) on 0n-l is replaced by dependence on the sequence, <Pi, i = 1, . . . l n, a fact suppressed in the notation of (13).

In this paper, the N iQ distribution with parameters \I., v is used as the alternative. It is typically chosen as a diffuse distribution, with the same parameter values as the prior: if = Vo, ii = va. The NiQ conjugate family (4) is closed under the convex combining (i.e. geometric mean) in (13), yielding another member of the same family. Substituting (13) and (2) into the time-varying form of (3), the following recursive update of the N ig statistics is revealed:

<PnVn-l + xnX� + (1- <;On) if, <Pnlln-l + 1 + (1 - 1>n) iI.

(14)

(15)

When <Pn = 1, the update is identical to the stationary equations (8,9).

4. EXTENSION TO AN UNKNOWN FORGETTING FACTOR

A time-varying sequence of forgetting factors, <;On, can be accommodated by (13), permitting variable non-stationary dynamics to be modelled. No guidance exists for its choice, however, and so a known constant is used in practice. In this Section, we derive a Bayesian technique for posterior inference of the forgetting factor sequence, 1>n, in tandem with the AR model parameters, On. Specifically, the task is to elicit the posterior distribution, f(On, q'>,.,IXn). Using Bayes' rule:

f(On, 1>nIXn) ex: f(xnIOn, Xn-1)f(On!Xn-1, <;on)f (1)n). (16)

In the first term on the right-hand side (given by (2), it is assumed that observations, Xn, are conditionally independent of 1>n. given model parameters, On. The second term on the right-hand side is given by (13). The third term on the right hand side follows from the assumption that 1>n is independent of previous data Xn-1, Vn. f (¢n) is chosen conservatively, being uniform in the interval [0,1];

222

i.e. f (¢n) = U ([0,1]). The resulting posterior distribution (16) is not analytically tractable. Therefore, we seek a suitable approximation. In this paper, the approximation is forced to exhibit posterior independence between Bn and

¢n: leOn, ¢nIXn) � /(8nIXn)/(¢nIXn), (17)

where distributions, J ( .) , are chosen appropriately. This is the approximating function class used in the Variational Bayes procedure, reviewed next.

4.1, Variational Bayes (VB) approximation

Theorem 1 (Variational Bayes) Let f (BI, 021X) be apos

terior pdf of 01, B2• given data, X. Let / (BI, B2IX) be an approximate pdf restricted to the set of conditionally independent distributions on 81, B2,'

/ «(h, (}2IX) = A «(}IIX) /2 (02IX) . (18) Then, the minimum of the Kullback·Leibler (KL) distance

[10],

{fl (BIIX) ,72 (02IX) } =

arglJli!}KL (i CBl,BzIX) IIJ(01,02IX») , (19) h,h

is reached jor

71(01IX) ex exp(Eo2Ix(ln(J(01,02,X)))),(20) fdB21X) ex exp(EOllx(ln(f(01,82,X»»).(21)

We will refer to (20,21) as the VB-optimal posteriors. Functions of X arising in (20,21) will be called VB-statistics. E. (.) denotes expectation of the argument with respect to the other VB-posterior.

The VB-statistics, which parameterize 71 ( .) (20), are needed for evaluation of72 (.) (21), and vice-versa. Hence, the VB solution (20,21) is usually not available in closed form. It is found by iteration to convergence of the following algorithm, which is a stochastic generalization of the classical Expectation-Maximization (EM) algorithm.

Algorithm 4.1 (Variational EM (VEM» Cyclic iteration of the following steps converges to a solution of (20,21):

E-step: compute approximate distribution of O2 at iteration i:

7;i) (B2IX) ex: exp ( 1ti-1) (OlIX)ln f(BI,B2,X)dBl. lOl

(22) M-step: using approximate distribution from the ilh E

step, compute approxinUlte distribution of 01 at iteration i:

1�i) (OIID) ex: exp ( 1;i) (B2IX)ln f(81,02,X)dB2. loo - (23)

Convergence of the algorithm was proven in [11].

4.2. VB-conjugate on-line inference

We intend to use the VB approximation (Theorem 1) at each step of the on-line update (16). VB-optimal posteriors (20,21) remain functionally invariant during the update from n � 1 to n if they are drawn from a family of distributions closed under the combination of (i) exponential forgetting, (ii) Bayes updating (16), and (iii) the VBapproximation. This requirement, called VB-conjugacy [12], ensures a tractable on-line update of sufficient statistics. It may be shown [12] that the following choices satisfy VBconjugacy:

7 (Bn-1IXn-d j (BnIXn-1)

Nig (Vn-1, I)n-d , Nig (VJ')'

(24)

(25)

In Section 4, thechoicef (¢nIXn-d = f (¢n) = U ([0, 1]) was made, and so VB-conjugacy is not required for ¢ ...

Substituting (24) and (25) into (13), and the result, along with (2), into (16), then the VB-optimal posterior for Bn, using Theorem 1, is:

1 (BnIXn) = NiQ (Vn, vn), (26)

Vn ¢nVn-1 + xnx� + (1 � ¢n) fi (27)

Vn = ¢nVn-l + 1 + (1- ¢n) v. (28) Hence, the updates of the VB-statistics for en are identical to the sufficient statistic updates in the case of known forgetting factor (14,15), ¢n, but with the latter replaced by ¢n. This is the expected value of ¢n with respect to the VB-optimal posterior, 7 (¢n!Xn), at time n.

-4.3. Evaluation of the VB-statistic, ¢n 1 (¢n IXn) is analytically intractable, since it is normalized with respect to the tenn

where

v (¢n) = ¢n Vn-1 + (1 � cPn)V, (30)

v (¢n) = ¢nl-'n-l + (1 � cPn)v, (31)

with (Nig(·,·) given by (5). Note that the recursion in (30) is onto the VB-statistic, Vn-1• given by (27), and ditto for (31), From (30,31), the values of (¢n) at the extrema of 1>n are:

«0) = (NiQ (V, v) , (32)

«(1) = WiQ(Vn-l,Vn-I), (33)

with Vn-l and Vn-l-the V B-statistics at the last step-being given by (27) and (28) respectiVely. We now propose a tractable approximation of (29) which matches these endpoints,

223

Proposition 1 (Approximation of Normalizing Constant)

(¢n) � exp (hI + h2<Pn)· Matching the extrema (32,33), then:

(34)

hI In (HiQ (V, ii) , (35)

h2 = In (HiQ (Vn-l, lIn-I) -In Wig (V, ii) �36)

Now, the VB-optimal distribution for ¢n is greatly simplified:

7 (¢nIXn) � £xp(a) U ([0, 1]), (37)

being the truncated exponential distribution of parameter a, on support [0, 1]. Its parameter is

a = (Vn-l - ii) In;;: - �tr (( Vaa;n-l - Vaa) VB��n) -In (NiQ (Vn-l. vn-d + In (Hi(; (V, ii)

1 ( _'] ( - ) [ _'] , - -= -1, an Vn-1 - V -1, an . (38) 2a;

The VB-moments, :;;., are given by (10, 11,12), using the VB-statistics (27,28). tr denotes the trace of the matrix. The required mean of (37) is given by:

-

--- exp (a) (1 - a) - 1 ¢n = ---=--'-�----'-:-:--a(l-exp(a)) . (39)

dJn (39) is a function of Vn (27) and Vn (28), via (10,11,12). Meanwhile, Vn and Vn are functions of ¢n. Hence, ¢;, is obtained using the iterative VEM algorithm (Algorithm 4.1), yielding the VB-optimal AR parameter inference at time n

(26). Note that, under approximation (34), this VB-conjugate distribution for en is undisturbed.

5. TRACKING AN AR PROCESS WITH ABRUPT

PARAMETER CHANGEPOINTS

A univariate, second-order (i.e. xn = [xn, Xn-I, Xn-2]'), stable AR model is simulated, with parameters a = 1, and { [-l.8, 0.98]' a

= [0.29, 0.98],

Abrupt switching between these two cases occurs every 30 . samples. The alternative distribution (25) parameters are

chosen to be

V = diag ([1,0.001,0.001]) , v = 10. (40)

The prior distribution, 7 (epIXp) (26) is chosen equal to the alternative distribution. At each time-step, the VEM algo-

. hm' . . . I' d ·th 7(1) 0 7 d d h nt IS InlUa Ize WI 'f'n = . , an stoppe w en

� J 5Ol_ --'" /""\ A " " A f\ " A A ./"'--.. /\. � J 'j i l-- "-../ VVVV'lVVV� V "J� z -50 �------------------------------�

20 40

\; ; q D " .:" • • • • < • • • , • <

'

::

.. A . . . . \. <.... I." t > <

w � , .. �� on· line VB

60 80 100

Fig. 1. Estimation of non-stationary AR process using timevarying forgetting. In sub-figures 2-4, full lines denote simulated values of parameters, dashed lines denote VBposterior expected values, and dotted lines denote uncertainty bounds.

I ¢�m) - ¢hm-1) I < 0.001. The required number of iter

ations, m, to reach convergence is recorded. As a variant of this scheme, an 'on-line' VB algorithm allows only two iterations of the VEM algorithm per time-step. Results of parameter estimation are displayed in Figure 1.

The algorithm promptly detects the changepoints. It achieves this by switching the estim�d forgetting factor automatically to a value close to zero (¢n = 0.05 when n = 33). This causes an abrupt 'statistics dump' of the accu mulated VB-statistics, Vn, Vn, and their re-initialization with the alternative (i.e. prior) values, V, ii. Thus, the estimation process is effectively restarted. Note that the required number of iterations of the VEM algorithm is significan� higher at the changepoints. Therefore, at these points, <Pn, obtained using the on-line VEM variant, is higher than the value found by allowing the algorithm to iterate to convergence (Figure 1). For comparison, estimates obtained using a fixed, known forgetting factor (14,15), <Pn = 0.9, "in, are displayed in the 4th sub-figure. Clearly, parameter tracking

224

is greatly improved using the on-line estimation of ¢n, even when the VEM iterations are stopped before convergence. This should be the case whenever parameter variations are not too rapid.

6. DI SCUSSION

The main drawback of the proposed VEM algorithm is that convergence to a solution at each time-step is not assured within a bounded number of iterations. This can present problems for on-line processing. The issue was addressed in [11] for stationary data. Using only one iteration of the VEM algorithm at each time-step, it was proved that the algorithm converges asymptotically to the true time-invariant posterior. However, no such result is available for non-stationary models.

A known, time-invariant forgetting factor is used in many estimation methods for non-stationary processes [7J. Previous attempts have been made to relax the assumption of an

a priori known forgetting factor, particularly in Recursive Least Square (RLS) algorithms. The method presented in [13] is the closest to our approach. It uses a gradient-based approach for estimation of the forgetting factor. However, the criterion of asymptotic MSE, minimized in [13], can only cope with slow parameter variations.

Clearly, ¢;;. plays a critical rOle in detection of nonstationarities in the data, and, consequently, in quality of parameter estimation. In this paper, the simple approximation �oposition 1), which yielded a closed-form expression for ¢n (39), influences the quality of the tracking results . Further research may suggest better approximations for critical applications.

Finally, in on-line processing, the maximum number of VEM iterations per time-step should be determined by the

sampling period of the data. If this number is too low, perfonnance of the algorithm will again be degraded.

7. CONCLUSION

The principle of VB-conjugacy has led to a method for approximate joint Bayesian estimation of non-stationary AR parameters, On, in tandem wi� a time-varying forgetting factor, ¢n. On-line estimation, ¢n, of the latter proved to be of significance, since (i) ¢n responds sensitively to change

points in the data, and (ii) it weights the accumulated statistics agai� alternative values. During non-stationary be

haviour, ¢n falls, allowing statistics to be partially re-initialized. Detection of changepoints, and improved parameter tracking, are the direct consequence, as compared with traditional stationary forgetting. An iterative VEM procedure is required at each time-step.

8. REFERENCES

[ 1] J. Makhoul, "Linear prediction: A tutorial review," Proceedings of the IEEE, vol. 63, no. 4, pp. 561-580, 1975.

[2] A. H. Iazwinski, Stochastic Processes and Filtering Theory, Academic Press, New York, 1979.

[3] J.M. Bernardo and A.F.M. Smith, Bayesian Theory, John Wiley & Sons, Chichester, New York, Brisbane, Toronto, Singapore, 1997, 2nd edition.

[4] M. Abramowitz and LA. Stegun, Handbook of mathematical functions, Dover Publications, Inc., New York,1972.

[5] M. Cassidy and W.o. Penny, "Bayesian nonstationary autoregressive models for biomedical signal analysis," IEEE Transactions on Biomedical Engineering, vol. 49, no. 10,2002.

[6J G. V. Moustakides, "Locally optimum adaptive signal processing algorithms," IEEE Transactions on Signal Processing, vol. 46, no. 12, pp. 3315-3325, 1998.

[7] L. Ljung and T. Soderstrom, Theory and practice of recursive identification, MIT Press, Cambridge; London, 1983.

[8] Kotecha I.H. Esteve F. Djuric, P.M . and E. Perret, "Sequential parameter estimation of time-varying non

gaussian autoregressive processes," Eurasip J. Appl. Sign. Process., vol. 2002, no. 8,2002.

[9] R. Kulhary and M.B. Zarrop, "On general concept of forgetting," Intematiollal Journal of Control, vol. 58, no.4,pp.905-924,1993.

[ 10] S. J. Roberts and W. D. Penny, "Variational Bayes for generalized autoregressive models," IEEE Trans

actions on Signal ProceSSing, vol. 50, no. 9, pp. 2245-2257,2002.

[11] M. Sato, "Online model selection based on the Variational Bayes," Neural Computation, vol. 13, pp. 1649-168 1,200 1.

[12] V. Smidl, The Variational Bayes Approach in Signal Processing, Ph.D. Thesis, Trinity College Dublin, Dublin, Ireland, 2004.

[13] c.-F. So, s. C. Ng, and S. H. Leung, "Gradient based variable forgetting factor RLS algorithm," Signal Processing, vol. 83, pp. 1163-1175,2003.

225

Documents

[IEEE 3rd IEEE Signal Processing Education Workshop. 2004 IEEE 11th Digital Signal Processing Workshop, 2004. - Taos Ski Valley, NM, USA (1-4 Aug. 2004)] 3rd IEEE Signal Processing