Recurrent neural networks can be trained to be maximum a posteriori probability classifiers

~ Pergamon

CONTRIBUTED A R T I C L E 0893-6080(94)00059-X

Neural Networks, Vol. 8, No. 1, pp. 25-29, 1995 Copyright © 1994 Elsevier Science Ltd Printed in the USA. All rights reserved

0893-6080/95 $9.50 + .00

Recurrent Neural Networks Can Be Trained to Be Maximum A Posteriori Probability Classifiers

SIMONE SANTINI AND ALBERTO D E L BIMBO

Dipartimento di Sistemi e Informatica

(Received 27 July 1993; accepted 27 April 1994)

Abstract--This paper proves that supervised learning algorithms used to train recurrent neural networks have an equilibrium point when the network implements a maximum a posteriori probability (MAP) classifier. The result holds as a limit when the size o f the training set goes to infinity The result is general because it stems as a property o f cost minimizing algorithms, but to prove it we implicitly assume that the network we are training has enough computing power to actually implement the MAP classifier. This assumption can be satisfied using a universal dynamic system approximator. I, Ve refer our discussion to Block Feedback Neural Networks (BeNs) and show that they actually have the universal approximation property

Keywords--Recurrent neural networks, MAP classifier, Supervised learning, Nonlinear Kalman filter, Reasonable error measures.

1. I N T R O D U C T I O N

In this paper we prove that cost minimizing learning algorithms used to train dynamic neural networks have an equilibrium point when the network implements a multistep max imum a posteriori probability (MAP) classifier. The result holds for a large class of cost functionals and in the limit of the number of training samples going to infinity.

We restrict our attention to a particular class of neural networks: multilayer perceptrons with discrete- t ime feedback paths between layers. More specifically, we use a feedback network model called Block Feedback Neural Networks (BFN) (Santini, Del Bimbo, & Jain, 1991, 1994), which allows a good flexibility in the specification of the network architecture.

The flexibility of the model makes it easier to derive certain approximation properties. Most of the results presented here, however, hold for other discrete-time dynamic neural network models as well.

Several results exist on the function approximation (Blum & Li, 1991) and the MAP classification (Hampshire & Pearlmutter, 1990) properties of feedforward neural networks.

Here, we extend the results in Hampshi re and Pearl- mutter (1990) to the classification of a sequence of input samples. When a sample is presented at t ime t, we

Request for reprints should be sent to Simone Santini, Diparti- mento di Sistemi e Informatica, Via S. Marta, 3, 50137 Firenze, Italy.

ask the network to assign it to one of the classes o~i based on the value of the sample x (t) and on the past history of the system. If the samples in the sequence are uncorrelated, then the past history has no influence over the current classification, and a dynamic network doesn't do any better than a feedforward. If, on the other side, the input samples are correlated, that is, if

~[x(t)x(~-)] # ~,,

a dynamic network can use past samples to drive the classification of the current sample, giving potentially better results.

If the sample x ( t ) belongs to the clas§~:wi, then we prove that a feedback network can be trained to compute, at its ith output,

~[60ilx(t), x(t -- 1) . . . . x(0)].

Section 2 contains results on Bayesian classification using recurrent networks. Section 3 contains a brief discussion of the results.

2. BAYESIAN CLASSIFICATION IN FEEDBACK NETWORKS

In this section we prove that, for a large class of cost functionals, the multistep MAP classifier is an equilib- r ium point of the supervised learning algorithm for recurrent networks and, in general, for any cost minimizing training algorithm applied to a network model powerful enough to be a universal approximator. Our

25

26 S. Santini and A. Del Bimbo

considerations are largely based upon Hampshi re and Pearlmutter (1990) , where a similar property is proved for feedforward networks.

For the feedforward neural networks, given a sequence o f vectors

x(0) . . . . . x(t) . . . . .

there is an equil ibrium point o f the learning algori thm such that the network computes the single step a posteriori probability that the sample x ( t ) belongs to the class wi, that is, at t ime t, it computes:

~ [ ~ Ix(t)]. ( 1 )

To extend this result, we first show that there exists a BFN with enough comput ing power to compute the multistep a posteriori probabili ty

P[~o~lx(t), x(t - 1) . . . . . ] (2)

with any desired degree o f accuracy. Then, we show that a network comput ing eqn (2) is an equil ibrium point for the BFN learning algori thm under fairly general hypotheses.

2.1. Generalities

A standard technique in estimation theory is to consider the input sequence x ( t ) as generated by a (possibly nonlinear) dynamic system driven by a white noise e( t ) . By estimating the model, we can build, based on the past samples, an optimal estimation o f the present sample x( t ) or o f any quanti ty related to this ( in the present case, the class wi it belongs to) .

Suppose that the sequence to be classified { x ( t ) ] t ~ [~ } is a stochastic process generated by a finite-dimensional nonlinear system

Thus, the projection of x ( t ) on the space. [B}-e)~] generated by B }e)~ (i.e., the conditional expectat ion)

n{ x(t)l [B}£~ ] } (5)

can be computed as a function of a finite number o f terms (e( t - 1), e( t - 2) . . . . ). I f the function h is analytic (C'~), and if

X(t) = [ e ( t - l+ 1) . . . . . e(t),

x ( t - k + 1) . . . . . X ( t ) ] T ~ R l+k (6)

is a state vector for the system, then eqn (4) can be written in the form (Priestley, 1988):

k

x(t) + ~ ~ , ( X ( t - 1 ) ) x ( t - u) U=I

l

= # ( X ( t - 1 ) ) + e ( t ) + Z ~b, (X( t - l ) ) e ( t - u ) . (7) l t = l

This model can be interpreted as a locally linear A R M A model, governed by a set o f AR coefficients { qS, } and a set o f MA coefficients { ~b, }, where { q~, } and { ft, } are the derivatives of h with respect to x and e, respec- tively.

A formal state space description o f eqn (7) is:

X(t + 1) = #(X(t)) + {F(X( t ) )}X( t ) + e(t + 1

x(t + 1)= HX(t + 1) (8)

where

and

u (x ( t ) ) = [0 . . . . . 010 . . . . . u (x( t ) ) ] ~

e(t) = e(t)[0 . . . . . l l0 . . . . . 1] r

H = [0 . . . . . 010 . . . . . 1] (9)

I=(xct)) =

0 1 0 -, 0 0 0 0 1 . , 0 0

; ; ;

6 6 o ., ~ o 0 0 0 .. 0 0

0 0 . . . . . . 0 0 0 0 . . . . . . 0 0

; ; :

6 6 . . . . . . 6 o ~ ~_~ . . . . . . ~ -m~

• . .

1 0

0

0 . . . 0 1 - . . 0

:

0 . . . ~ . . . . q ~ l

(1 O)

x(t) = h(x(t - 1) . . . . . x ( t - k),

e ( t - 1) . . . . . e ( t - 1)) + e(t) (3)

where ( e ( t ) } is a white process. I f B~ e) = Bt(e( t ) , e ( t - 1) . . . ) is the Borel field

generated by e ( t ) , e ( t - 1 ) . . . . . we may write

h ( x ( t - 1) . . . . . x ( t - k), e ( t - 1) . . . . . e(t - I)

= E[x(t)lB}')~]. (4)

An estimator for X (t) is given by

~ ( t + 1) = #(X(t)) + {F(X( t ) ) )X( t ) + e(t + 1)

i ( t + l ) = H ~ ( t + 1). (11)

From this and eqn (8) , it follows that the estimation error at t ime t for the model ( 11 ) is

~ ( t + 1 ) ~ f x ( t ) - ~ ( t ) = e ( t ) . (12)

BFNs Are MAP Classifiers 27

Because e(t) is drawn from a white sequence { e(t) }, it is, by definition, unpredictable, that is,

e(t) ~ [B}-~)~ ] ± . (13)

Because the prediction error e(t) is orthogonal to [ B }_~)~ ], it follows that

~(t ) = II{x(t)l[B}£)~] } (14)

and thus eqn ( 11 ) is an optimal estimator. The problem is whether a BFN can implement every

model of the form (8). If this is true, then any sequence generated by such a finite model can be optimally estimated by a BvN implementing the estimator ( 11 ).

Because h is analytic, so are 4~, ~b, F, and t*. Ac- cording to the universal approximation property of multilayer perceptrons (Hornik, 1991 ), there exists a three-layer feedforward neural network that approximates F and # with any given degree of accuracy.

A three-layer feedforward network (say, No) also exists to compute the analytic compound function:

U(X) = u(X) + {F(X)}X. (15)

In the same way, it is possible to build the network N~, which approximates the function Q(X) = HX. These two networks can be built in the framework of BvNs by the cascade connection. If time dependency is now considered, it is easily seen that a network ap- proximating eqn (11) can be obtained in the BvN framework by adding a feedback connection as shown in Figure 1. The overall network takes in input e( t ) , and computes ~(t + 1 ).

Given the generality o feqn (7) , it is possible to state the following.

THEOREM 1. Any stochastic process generated by a finite order nonlinear model can be optimally estimated by a suitable BeN network.

Note that, because any deterministic function can be computed by a feedforward network, it follows as a corollary of Theorem 1 that any quantity given as y( t ) = y(x( t ) ) , where y is a deterministic function satisfying the condition for neural network approximability (Hecht-Nielsen, 1989), can also be optimally estimated. All we need, in this case, is to have the network N~ approximate the function y ( H X ) instead of HX.

FIGURE 1. Architecture of a network that approximates a kth order model genera~ng a stochastic process.

In particular, this holds for the function a~, defined a s i

{~ if XU~O~ try, = (16)

if x l ~- ¢o i

2.2. Neural Networks and MAP Classification

The classification problem can be stated as follows: given a vector space V, N disjoint subsets o)i of V (classes), and an input vector x(t) , corrupted by noise, we are required to assote to x( t ) the appropriate class o~,- based on a minimum error criterion. A typical error criterion is the MAP, which requires that x(t) be assoted with the class o~c such that

P[o~clx(t), x(t - 1) . . . . ]

~ P[cojlx(/), x ( / - 1 ) . . . . ] V j ¢ c . (17)

To this end, it is sutticient to build a function q(w~, x( t ) . . . . ) such that q has its maximum for ~oi = ~oe. A classifier implementing a function like this is called a discriminant classifier (Hampshire & Pearlmutter, 1990).

In our case we will ask for more. We will ask for a function q such that

q(oai,x(t) . . . . ) = P[ c o i l x ( t ) , x ( t - I) . . . . ] (18)

that is, we require that the classifier be a Bayesian classifier. Because of eqn (17), a Bayesian classifier is also a discriminant classifier.

As in Hampshire and Pearlmutter (1990), we consider cost functionals referred to as reasonable error measures (REM). A cost functional f ( u ) is a REM if

where

U

f ' ( u ) = f ' ( 1 - u ) 1 - U

df( t ) t f ' ( u ) = - - ' ~ = ~

(19)

u

f ( u ) = v r ( 1 - l ) ) r - l d l ) , ( 2 0 )

which, in turn, include some of the most commonly used cost functionals such as the Cross-Entropy (for r = 0) and the quadratic (for r = 1 ).

In Section 2.1, we have shown that any stochastic process x( t ) generated according to the model (8) can be optimally estimated (with any degree of accuracy) by a BFN network, and so can every deterministic function y (x ( t ) ) .

This result states that the optimal MAP estimator is infinitely close to the subspace At of the functions that can be implemented by a BFN. Nevertheless, a BFN moves through At along a trajectory determined

The class of REMs contains all the costs of the form:

28 S. Santini and A. Del Bimbo

by the learning algorithm, and ends up implementing a function that gives a m i n i m u m - - o v e r the sequences making up the training s e t - -o f the learning cost. It is not clear whether this function and the MAP classifier coincide.

In other words, knowing that BvNs can implement a MAP classifier does not guarantee that, when trained on a given training set with a suitable learning algorithm, they will.

In this section we prove tha t - -unde r certain hypotheses~th is is indeed the case. If the size of the training set grows to infinity and the cost that the training algorithm is minimizing is a REM, then the MAP classifier is a min imum of the cost.

In the following we will compute the average estimation error ~ over the training set. To do this, we imagine a series of experiments, all starting with the same value x(0) , and all subject to the same probability distribution ~ [ x ( t ) ] [ B}~)~ ] ], which is fixed and given by the model (8). The only variable will be the particular realization of the process {e(t)}. We will make averages over the experiments set.

What happens in practice, however, is that the network is fed with samples taken while the sequence x( t ) is generated. Thus, the training goes on in time, based on a single realization of {e( t )} , and not on a set of experiments. This suggests hypothesizing that x ( t ) be an ergodic process, so that averages in time are equal to averages over the set of experiments.

We define a prototype x, as a unique sample of a random vector x; if we have two statistically independent samples x(t~) = x(t2) = Xp, then they are said to be instances of the same prototype Xp (see Hampshi re & Pearlmutter, 1990).

Let o~ ~ [0, 1] be the ith output o f a BvN classifier, and t~ the targets with

{~ if Xp~W~ t~ = (21)

if x~ ~ w~

Moreover, let 0 be the parameter vector for the network, and the error function be defined as:

#[oi(x,,O),ti]=If(1-oi(x,,O))~ if x p ~ w i (22) [ f ( o i ( x p , O)) if Xp~=O~ i

where f i s a monotonically nondecreasing function of its argument.

We do a number of statistically independent experiments. During the experiments, we observe that, at t ime t - 1, the network is in state X (t - 1 ) for n~ times. One of these tl t times, we observe, at t ime t, a sample x( t ) . Let x, be the prototype of x ( t ) , and n m be the number of statistically independent experiments in which the sample x ( t ) - - w i t h prototype xp--belongs to ~ . Let np~be the number of statistically independent experiments in which the sample x ( t ) - -w i th prototype xp--does not belong to w~, and n, = n,~ + npr, is the

total number of statistically independent occurrences of prototype x~,.

The main result of this section is then expressed as follows.

THEOREM 2. Let oi be the output o f the network cor- responding to the class oa~, and ~ a R E M measure. Then

O~ - 0

0o~

oi = ~[~0ilx(t), x(t - 1) . . . . ] (23)

: P[o0~ IX(t)] (24)

where the last equality is due to eqn (4 and to the fact that the model is supposed to be finite-dimensional.

Proof Computing the average error over all the experiments, we obtain:

a : ± { . . , s ( l - oi(x., 0)) + ~'lt p=l i-I

= n~ n ~ f ( l _ o ~ ( x ~ , O ) ) + _ _ f ( o ~ ( ~ , O ) ) } . p=l i=1 Ht [ Hp np

(25)

According to the law of large numbers, we have:

l i m r/p--~i = P[o~i[xp, X ( / - l ) ] = P [ w ~ l X ( t ) ] m ~ ~/p

and

lira n~ : ~ [ ~ , l x p , X(t - 1)] : P[~ , IX( t ) ] . nt~c~ Hp

Thus,

P N

lira ~ = Z ~ ~ [ x p ] { P [ w ~ l X ( t ) ] f ( l - oi(x~,0)) nt~oo p-I i=l

+ P [ ~ IX (t)lf(o~ (x,, 0)) }.

(26)

(27)

(28)

The cost g is minimized when Vo~ = 0, that is,

0g" e - - - Y. P[Xp]{ -P[wi lX( t ) ] f ' (1 - oi(xp, 0)) Ooi(x, O) ~_~

+P[go~lX( t ) ] f ' (o~(x , ,O) )} :O W. (29)

Equation (29) is satisfied iff

f '(o~(x,, 0)) _ P[o~lX(t)] f ' (1 -o , (xp , 0)) P [ ~ l X ( t ) ]

P[~0i IX (t)] : Vx,. (30)

I - ~ [ ~ < I X ( t ) ]

If the cost satisfies eqn (19), it is apparent that the cost ~ is minimized only if

oi = P[w~lX(t)] = 1J[a~i ] B~e)]. (31)

BFNS Are MAP Classifiers 29

3. D I S C U S S I O N

Where does this classifier differ f rom a classifier implemented in a feedforward network?

The M A P classifier based on P [ ~i [ x ( t ) ], which can be implemented in a feedforward network, makes an estimation based on the current input sample only. For this reason, the variance o f the estimation error is the same at all instants, no mat ter how m a n y input vectors have been presented so far, and depends only on the variance o f the input noise.

On the contrary, the classifier based on P[o~i [ X ( t ) ] can refine its est imation as it "'sees" more input sam- pies. °

When the network converges to the M A P classifier, its outputs compute ~ [ ~ i l X ( t ) ] . Therefore, the BvN can be seen as an optimal est imator o f the function o~i(x(t)) and its per formance analyzed using well- known results for the Nonlinear Kalman Filter ( Pries-

tley, 1988), which a l so - - a t steady s t a t e - - implemen t s the optimal Bayesian estimator.

In particular, a measure o f the per formance of the est imator is given by the eigenvalues o f the covariance matr ix of the est imation error. I f a~ is the quant i ty to be estimated, ~ is the est imation and ~ = w - 4, and a measure o f the estimation error is given by the eigenvalues o f

~I, : ~:[~T~]. (32)

Here we implicitly assume E [&] = 0, which is the case for the Bayesian estimator.

It should be noted that the previous measure also applies to the feedforward estimator, in which the value o f ~o at t ime t is est imated based on the input sample x ( t ) at t ime t only. In this case, the eigenvalues o f ,I, do not change with time, confirming that the single-

step est imator cannot "refine" its est imation as it re- ceives more inputs.

On the other side, in standard Ka lman filtering lit- erature it is shown t h a t - - u n d e r reasonable stability cond i t ions - - the eigenvalues o f ~/' for the filter, and hence, for the BvN after the cost has been minimized, decrease with time. This confirms that the BFN estimator, being able to refine its estimation as more inputs are presented, provides better performance than a single-step, feedforward estimator.

This is true, o f course, when input samples x (t) are generated according to a stochastic model that provides some correlation between samples at different times. I f the samples x ( t ) are completely uncorrelated, t h ~ n - - as confirmed by nonlinear estimation t h e o r y - - t h e performance of the two classifiers is the same.

R E F E R E N C E S

Blum, E. K., & Li, L. K. (1991). Approximation theory and feedforward networks. Neural Networks, 4, 511-515.

Hampshire, J. B., II, & Pearlmutter, B. A. (1990). Equivalence proofs for multi-layer perceptron classifiers and the Bayesian discriminant function. In S. Touretzky, G. Elman, T. Sejnowski, & J. Hinton (Eds.), Proceedings of the 1990 Connectionists Models Summer School San Mateo, CA: Morgan Kaufmann.

Hecht-Nielsen, R. (1989). Theory of the backpropagation neural network. In Proceedings of International Joint Conference on Neural Network (pp. 1-593-I-605 ).

Hornik, K. ( 1991 ). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4, 251-257.

Priestley, M. B. (1988). Non-linear and non-stationary time series analysis. New York: Academic Press.

Santini, S., Del Bimbo, A., & Jain, R. (1994). Block-structured recurrent neural networks. Neural Networks, 7, 000-000.

Santini, S., Del Bimbo, A., & Jain, R. (1991). An algorithm for training neural networks with arbitrary feedback structure (Tech. Rep. TR 10/91 ). Dipartimento di Sistemi e Informatica, Univ- ersit~ di Firenze,

Documents

Recurrent neural networks can be trained to be maximum a posteriori probability classifiers