12
20 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 2, NO. I. JANUARY 1991 Associative Learning in Random Environments Using Neural Networks Kumpati S. Narendra, Fellow, IEEE, and Snehasis Mukhopadhyay Abstract-In recent years, efforts have been made to generalize learning automata operating in random environments using a context vector. The process of associating an optimal action with a context vec- tor has been defined as associative learning. In this paper, associative learning is investigated using neural networks and concepts based on learning automata. The behavior of a single decision maker containing a neural network is first studied in a random environment using re- inforcement learning and the method is extended to more complex sit- uations where two or more neural networks are involved in decentral- ized decision making. Simulation results are presented to complement the theoretical discussions. I. INTRODUCTION ETERMINISTIC and stochastic learning automata oper- D ating in random environments have been studied exten- sively for over two decades [ 11. A learning automaton is shown in Fig. l(a) and consists of an automaton connected to a random environment in a feedback configuration. The automaton A, . , am performs one action a(n) = a, at stage n in the random envi- ronment E. The output P(n) of the environment at stage n be- longs to a set, which for convenience is assumed to be binary, where the value 1 indicates a success and 0, a failure. The en- vironment E is completely defined by a set of reward probabil- ities {d,} such that d, = Prob [/3(n) = 1 la(n) = a,]. The automaton chooses the action at stage (n + 1 ) on the basis of the responses received from the environment in the first n stages. The performance of the automaton is judged by the asymptotic expected value of the reward at the output of the environment and terms such as expedient, optimal, E-optimal, and absolutely expedient have been defined in the literature. Numerous deter- ministic and stochastic algorithms have been developed in the past decades [ l ] to exhibit the different types of behavior de- scribed above. Further, the basic idea has also been extended to include multiple environments [2], time-varying environ- ments 131, and distributed automata operating in a decentralized fashion [4], as well as environments which can have either dis- crete or continuous outputs in the interval [ 0, 1 ] [5]. In spite of the powerful results that have been derived for learning automata, they are of limited applicability since in gen- eral they deal only with a single random environment. To make the model more realistic and applicable to a wider range of problems, the concept of the automaton has been generalized by the introduction of a context vector [12]. Fig. l(b) shows an automaton operating in a context space X. It is assumed that the which consists of a finite number of actions, a,, a2, * Manuscript received March 19, 1990; revised September 20, 1990. This work was supported by the National Science Foundation under Grant EET- 8814747 and by Sandia National Laboratories under Contract 84-1791. The authors are with the Center for Systems Science, Department of Electrical Engineering, Yale University, New Haven, CT 06520. IEEE Log Number 9040769. context x changes from stage to stage with x(n) E X. As the context changes, the environment also changes; hence it is con- venient to represent the environment as E,. Therefore it is clear that the efficacy of any action ai at any instant depends upon the context vector at that instant. The aim of designing the automaton is to determine the optimal action corresponding to x (n) for all n. If the number of states is finite, a single autom- aton A, can be associated with each state x and it is well known that this automaton will converge only if x is visited an infinite number of times. While the above procedure may be effective when the number of distinct context vectors is small, it becomes inefficient when the number is very large and impractical when the context space is continuous. The main thrust of this paper is to demonstrate that decision makers using neural networks can be designed to make rational generalizations throughout the domain of interest in the context space, based on the responses obtained at a countable set of values assumed by the context vectorx(n). In recent years artificial multilayer neural networks have emerged as powerful components which have proved to be ex- tremely successful in pattern recognition problems [6]-[9]. From a systems theory point of view, multilayer neural net- works can be considered to be versatile nonlinear maps and it is their ability to generalize which has proved most effective in pattern recognition problems. In this paper we attempt to in- vestigate the ability of such networks, using concepts based on learning automata theory, to perform as efficient decision mak- ers in random environments associated with a context space. In particular, our ultimate objective is to design learning algo- rithms for a neural network whose input is the current context vector and whose outputs are the probabilities of the various actions. The different structures that the decision maker can as- sume, the performance that such structures lead to, and the mo- tivation for the choice of a neural network as the decision maker are all of major interest. In Section 11, the general problem of decision making in a context space is stated along the lines given in [ 1, ch. 71. This is the first of two problems to be investigated in the following sections and assumes that all the relevant information to make a decision is available in a centralized fashion. The statement of the second problem, in which several decision makers are involved in decentralized decision making, is also given in this section. In Section 111, the first problem of centralized decision making is investigated. While the details of the proposed al- gorithms are included in this section, their analysis is relegated to Section IV. In Section V, the manner in which the same al- gorithms can also be used in problem 2 is considered. Simula- tion results for both problems 1 and 2 are presented in Section VI and some extensions of the results given are outlined in Sec- tion VII. 1045-9227/91/0100-0020$01 .OO 0 1991 IEEE r --l- -

Associative learning in random environments using neural networks

  • Upload
    s

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Associative learning in random environments using neural networks

20 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 2, NO. I . JANUARY 1991

Associative Learning in Random Environments Using Neural Networks

Kumpati S. Narendra, Fellow, IEEE, and Snehasis Mukhopadhyay

Abstract-In recent years, efforts have been made to generalize learning automata operating in random environments using a context vector. The process of associating an optimal action with a context vec- tor has been defined as associative learning. In this paper, associative learning is investigated using neural networks and concepts based on learning automata. The behavior of a single decision maker containing a neural network is first studied in a random environment using re- inforcement learning and the method is extended to more complex sit- uations where two or more neural networks are involved in decentral- ized decision making. Simulation results are presented to complement the theoretical discussions.

I. INTRODUCTION ETERMINISTIC and stochastic learning automata oper- D ating in random environments have been studied exten-

sively for over two decades [ 11. A learning automaton is shown in Fig. l(a) and consists of an automaton connected to a random environment in a feedback configuration. The automaton A ,

. , am performs one action a ( n ) = a, at stage n in the random envi- ronment E . The output P ( n ) of the environment at stage n be- longs to a set, which for convenience is assumed to be binary, where the value 1 indicates a success and 0, a failure. The en- vironment E is completely defined by a set of reward probabil- ities { d , } such that d, = Prob [ / 3 ( n ) = 1 l a ( n ) = a , ] . The automaton chooses the action at stage ( n + 1 ) on the basis of the responses received from the environment in the first n stages. The performance of the automaton is judged by the asymptotic expected value of the reward at the output of the environment and terms such as expedient, optimal, E-optimal, and absolutely expedient have been defined in the literature. Numerous deter- ministic and stochastic algorithms have been developed in the past decades [ l ] to exhibit the different types of behavior de- scribed above. Further, the basic idea has also been extended to include multiple environments [2], time-varying environ- ments 131, and distributed automata operating in a decentralized fashion [4], as well as environments which can have either dis- crete or continuous outputs in the interval [ 0, 1 ] [ 5 ] .

In spite of the powerful results that have been derived for learning automata, they are of limited applicability since in gen- eral they deal only with a single random environment. To make the model more realistic and applicable to a wider range of problems, the concept of the automaton has been generalized by the introduction of a context vector [12]. Fig. l(b) shows an automaton operating in a context space X. It is assumed that the

which consists of a finite number of actions, a , , a2, *

Manuscript received March 19, 1990; revised September 20, 1990. This work was supported by the National Science Foundation under Grant EET- 8814747 and by Sandia National Laboratories under Contract 84-1791.

The authors are with the Center for Systems Science, Department of Electrical Engineering, Yale University, New Haven, CT 06520.

IEEE Log Number 9040769.

context x changes from stage to stage with x ( n ) E X. As the context changes, the environment also changes; hence it is con- venient to represent the environment as E,. Therefore it is clear that the efficacy of any action ai at any instant depends upon the context vector at that instant. The aim of designing the automaton is to determine the optimal action corresponding to x ( n ) for all n. If the number of states is finite, a single autom- aton A, can be associated with each state x and it is well known that this automaton will converge only if x is visited an infinite number of times. While the above procedure may be effective when the number of distinct context vectors is small, it becomes inefficient when the number is very large and impractical when the context space is continuous. The main thrust of this paper is to demonstrate that decision makers using neural networks can be designed to make rational generalizations throughout the domain of interest in the context space, based on the responses obtained at a countable set of values assumed by the context vectorx(n) .

In recent years artificial multilayer neural networks have emerged as powerful components which have proved to be ex- tremely successful in pattern recognition problems [6]-[9]. From a systems theory point of view, multilayer neural net- works can be considered to be versatile nonlinear maps and it is their ability to generalize which has proved most effective in pattern recognition problems. In this paper we attempt to in- vestigate the ability of such networks, using concepts based on learning automata theory, to perform as efficient decision mak- ers in random environments associated with a context space. In particular, our ultimate objective is to design learning algo- rithms for a neural network whose input is the current context vector and whose outputs are the probabilities of the various actions. The different structures that the decision maker can as- sume, the performance that such structures lead to, and the mo- tivation for the choice of a neural network as the decision maker are all of major interest.

In Section 11, the general problem of decision making in a context space is stated along the lines given in [ 1, ch. 71. This is the first of two problems to be investigated in the following sections and assumes that all the relevant information to make a decision is available in a centralized fashion. The statement of the second problem, in which several decision makers are involved in decentralized decision making, is also given in this section. In Section 111, the first problem of centralized decision making is investigated. While the details of the proposed al- gorithms are included in this section, their analysis is relegated to Section IV. In Section V, the manner in which the same al- gorithms can also be used in problem 2 is considered. Simula- tion results for both problems 1 and 2 are presented in Section VI and some extensions of the results given are outlined in Sec- tion VII.

1045-9227/91/0100-0020$01 .OO 0 1991 IEEE

r - - l - -

Page 2: Associative learning in random environments using neural networks

NARENDRA AND MUKHOPADHYAY: ASSOCIATIVE LEARNING IN RANDOM ENVIRONMENTS

a(.) E {a , } - Random P(n) E {0,1}

E , Environment

(a) (b) Fig. 1. (a) A Learning automaton. (b) A generalized learning automaton

1

Generalized . Learning Aut omat on

11. STATEMENT OF THE PROBLEMS In this paper we consider two problems in decision making

in random environments. In problem 1, which corresponds to centralized decision making, a decision maker attempts to de- termine the optimal actions at every point in a context space. In problem 2, the results derived for the previous case are ap- plied to decentralized decision making. At every instant n , when the context is x ( n), each of M decision makers chooses an ac- tion from its action set. If a ( n ) is assumed to be a decision vector in 6iM, the individual decisions can be considered to be components of a ( n ). The environment yields a response, as in problem 1, based on the entire vector a ( n ) . Each decision maker is not aware of the choice made by the other decision makers at any instant so that we have a cooperative game of decentralized decision makers in the context space T. The ob- jective once again is to determine how the individual decision makers should process on-line information so that they con- verge to their optimal decisions at every state x in X.

In the following subsections precise mathematical statements of the above two problems are given. For ease of exposition, it is assumed that the action set of each decision maker has only two actions, ai and a2, in problem 1. In problem 2 , a, and a2 are two-dimensional vectors, so that we have a stochastic co- operative game of two players with two actions each. Methods by which the results can be generalized to the multiple action case, as well as cooperative games involving more than two players, are briefly indicated in Section VII.

Problem I Let X be a metric space with a metric p defined on it. T. is

defined as a context space and to every x E X there corresponds an environment E,. E, is defined by a triple {a, d ( x ) , } for all x E 3c where g = { a , , a 2 } denotes the input set and =

{ 0, 1 } denotes the output set. It is assumed that the input and the output sets g and E are independent of the state x. The ele- ments of d(x), denoted by d , ( x ) , are the reward probabilities of E,. If at stage n , the context vector, the action chosen, and the response of the environment are respectively x (n), a ( n ) , a n d P ( n ) , thend , (x) = Prob { P ( n ) = 1 ( a ( n ) = a , , x ( n ) = x } ; i = 1, 2 . If d , ( x ) = Max { d l ( x ) , d 2 ( x ) } , a , ( x ) is the optimal action for E,.

L e t D c Xbeacompact reg ion .LetD, ( j= 1 , 2 , . - . ) b e a finite number of disjoint regions in D such that U, 3, = 9, 9, n D, = 9, i # j , where % is the null set. Let a>, be compact regions which lie strictly in the interior of D, and let a> =

U i S i . It is assumed that p(s,, S(Di) ) 2 e l (where 6(Di ) is theJoundary of 9,) for all i and that the Lebesgue measure p (a),) for all i is greater than some constant E * .

It is further assumed that in each of the regions Dj, one and only one of the actions { a , , a2 } is optimal. Our objective is to determine the structure as well as the learning algorithm & ( x ( n ) ) of a decision maker where h: a> -+ g such that

lim h ( x ( n ) = x) = a, (x ) , vx E b n - r n

where the convergence is in some desired deterministic or sto- chastic sense.

Comment 1: We assume that the context vector x lies in a compact set 9 of T to make the problem analytically tractable. The objectives of the learning process are the implicit deter- mination of the regions Di and the corresponding optimal ac- tions in these regions, based on the sequential performance of a decision maker.

Comment 2: The determination of the optimal action a l ( x ) corresponding to all context vectors x E 3, based on an infinite sequence of actions, is an approximation problem which in- volves generalization. The methods described in the following sections can be considered to be specific forms of such gener- alizations using multilayer neural networks.

Problem 2 At each environment E, corresponding to the context vector

x E 3, two decision makers A and B choose actions from their action sets. A has two actions ai and a2, while B has two ac- tions y l and y2. A play at any instant n is the set of actions chosen by A and B at that instant and is denoted by ( a ( n ) , y ( n ) ) . Each player is assumed to be unaware of the choice made by the other player. The outcome, P ( n ) , is once again assumed to take one of two possible values { 0, 1 }, where 0 corresponds to a failure and 1 corresponds to a success. The success probability of a play (a;, y, ) in the context x is defined as d, , (x ) = Prob { @ ( n ) = 1 l a ( n ) = ai, y ( n ) = y j , x ( n ) =

x } . If dll12 = max { d , j ( x ) } for all i , j = 1, 2 , x E 9, then the play is the optimal play for context x. We define disjoint regions a3 which have the same properties as in problem 1, a3 = U, Di and Di fl 3, = a, i # j . For every x E aj it is assumed that a unique pair { aj , yk } exists which is optimal. The optimal action set can then be described by two functions, a , ( x ) and y,(x), respectively.

If a>, C Di satisfies the conditions in problem 1, the objective is to determine the structure of two decentralized decision mak-

b - ~ I - - -

Page 3: Associative learning in random environments using neural networks

22 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 2, NO. I , JANUARY 1991

ers as well as their learning algorithms h ( x ( n ) ) and 4 ( x ( n ) ) such that

V x E b. Comment 3: The decomposition of 9 into disjoint regions

ai in problems 1 and 2 assumes that the optimal actions change discontinuously across the boundaries. The definition of the compact regions 9; implies that we are indifferent to the deci- sions made in the vicinities of the boundaries of ai. The need to define st and b is dictated by analytical tractability. In par- ticular, the speed of convergence of the sequential decision pro- cess is determined by the constants and e 2 .

111. THREE APPROACHES TO PROBLEM 1

Three methods of dealing with problem 1 are described briefly in this section. Of these, the first two indicate the evolution of the principal ideas while the last can be considered the main contribution of the paper. In all cases, a neural network is used as a decision maker and its weights are adjusted based on the response of the environment.

Method I If the environment E, for every x E 9 is assumed to be de-

terministic, the use of an action a, ( i = l , 2 ) yields a response 0 or 1. This implies that the optimal action in E, can be deter- mined by a single experiment (since the action set contains only two elements). For example, if aI is selected and results in a failure, a2 is the optimal action.

When the environment is deterministic as stated above, prob- lem 1 can be considered to be a supervised pattern recognition problem. The objective in such a case is to determine a discrim- inant function N * ( x ) , which partitions the region 9 into dis- joint regions where a1 or a2 is optimal. Since g, corresponding to different optimal actions in problem 1, were assumed to be disjoint with a minimum distance 2 ~ , between them, it follows that a smooth function N * , which belongs to a set X*, exists such that for all x E b

N * ( x ) > 0 = a , is the optimal action

N * ( x ) 5 0 - a2 is the optimal action.

A neural network is used in method 1 to realize such a dis- criminant function with N ( x ) E [ -1, 11 forx E 9. I f N ( x ) > 0 action a , is chosen, while if N ( x ) 5 0, a2 is the action se- lected. The weights of the neural network are adjusted so that the map N ( .) evolves to an element of X*.

If at stage n the context vector is x ( n ) , the desired output will be defined as y , ( x ( n ) ) such that y , , ( x ( n ) ) = sgn N * ( x ( n ) ) , x E 3. The parameters of the neural network are updated using back propagation based on the error function e 2 ( n ) , where

e ( n ) = y d ( x ( n ) ) - ~ ( x ( n ) ) = sgn ~ * ( x ( n ) ) - ~ ( x ( n ) ) .

Assuming that no local minima exist, the back propagation method results in the convergence of N ( x ) to sgn N* ( x ) for all x E b. In fact, after a finite number of corrections, sgn N ( x ) - = sgn N * ( x ) for x E b; i.e., the optimal action is chosen in 9. However, this does not imply that the learning process ter- minates. The latter will continue until N ( x ) converges to sgn N* ( X I in b.

While the method, as discussed above, is intended for use in deterministic environments, it can also be applied directly even when the environment is stochastic. In the latter case, at each state x , the reward probabilities d , ( x ) and d2 ( x ) are not binary but lie in the interval (0, 1 ) and are unknown. In spite of this, the decision maker merely behaves as though it is operating in a deterministic environment. For example, if aI results in a fail- ure response, a2 is assumed to be the optimal action and back propagation is used to update the weights with yd ( x ( n ) ) = - 1.

Method 2 The second method for dealing with problem 1 uses multiple

learning automata distributed throughout 9 in the context space. Since D is assumed to be compact, we can find a finite cover {ai} so that ai f l 63, = CP i # j and Uj 03, = 9. ai can therefore be regarded as elementary cells in the context space. A lattice of points is chosen in 9 with a single point in each of the cells a,, and an automaton A, is associated with it. Each automaton has two actions, a , and a2, and as described in Sec- tion 11 the response of the environment E, has reward probabil- ities d , ( x ) and d 2 ( x ) corresponding to the two actions.

The operation of the entire system may now be discussed briefly as follows. At every instant n the context vector assumes the value x ( n ) . The corresponding learning automaton chooses one of the two actions in the action set { a , , a 2 } according to the current action probabilities; it then receives a response and updates its action probabilities using one of the well-known learning algorithms (e.g. L, - I or L, ). For example, if all the automata use the L, - I (linear reward-inaction) algorithm, the action probabilities pI ( n ) and p 2 ( n ) of a typical automaton are updated as follows

If a ( n ) (action chosen at stage n ) = ai and 6 ( n ) E { 0, 1 } is the response of the environment, then

Pi (n + 1) = pi(.) + ~ ( 1 - p i ( n ) ) P ( n )

P j ( n + 1) = p j ( n ) - a p j ( n ) P ( n ) > j * i, (1)

where a E (0, 1) is the step size.

Similarly, if an L, - (linear reward-penalty) algorithm is used, the probabilities are updated as follows.

If a ( n ) = ai then

p i ( n + 1) = p , ( n ) + a( 1 - p , ( n ) )

P i ( n + 1) = pi(.) - Q P ; ( ~ )

p , (n + 1) = 1 - p ; ( n + l ) ,

if P ( n ) = 1,

if @ ( n ) = 0,

j # i.

We note that automaton A; is operative only when x E a;; con- sequently the argument n in (1) and ( 2 ) corresponds to the nth time the context vector occurs in ai. For the convergence prop- erties of the learning automata, the reader is referred to [ 11. It is well known that the L, - I scheme is €-optimal while the LR - scheme converges in distribution.

From the above discussion it is clear that the choice of the cells 63, is critical to the efficiency of the learning process. The use of a large number of cells implies the use of a large number of automata and hence a slow learning process, while a small number of cells results in a poor resolution in thechoice of the optimal action. If 63; is in the interior of a region ai, the optimal action at all points in 63; is the same. Hence, questions of proper convergence arise only in those cells which have intersection with more than one set si. The choice of the number of cells

Page 4: Associative learning in random environments using neural networks

NARENDRA AND MUKHOPADHYAY: ASSOCIATIVE LEARNING I N RANDOM ENVIRONMENTS 23

is consequently determined by the prior information concerning the geometry of 0; and hence by the constants E , and e2.

As the number of times a context vector visits a typical cell ai tends to w, the action probabilities of the corresponding au- tomaton converge in the sense described earlier. Hence, for suf- ficiently large n , the action probabilities of the automata { A , } contain the relevant information concerning the optimal action in each cell. If a neural network is used to approximate the action probabilities p ; ( n ) at all the lattice points x, it can in turn be used to determine the optimal actions at all points s. For example, if N ( x ) > 0.5 at any context x, the optimal action at x is chosen as a,.

Method 3 In method 2 , automata located at the lattice points in D de-

termine the probabilities with which the action a l is to be cho- sen. The process is terminated after a finite time and the action probabilities are approximated by the output of a neural net- work. Hence, learning takes place only for a finite time. In con- trast to this, in method 3 the output of the network corresponding to state x ( n ) at any instant n is directly the probability with which the action a I is chosen. Denoting the weight vector of the neural network at stage n as 8 ( n ),

N ( x ( n ) , 8 ( n ) ) = p ? ( x ( n ) ) = probability of choosing

action a, at state x ( n ) at the nth stage.

As in the previous cases, two outcomes are possible for the action chosen. In both cases the learning algorithm has to de- termine how 8 ( n ) is to be updated.

In the method adopted, the parameter vector 8 ( n ) is directly adjusted to increase or decrease the output probability depend- ing upon the response received from the environment and the action chosen. In this sense, the neural network can be consid- ered a generalized automaton which operates everywhere in 33. The learning algorithm, for a reward-penalty scheme, now takes the form

if a I is selected and results in a success or

if cy2 is selected and results in a failure.

= N ( x ( n ) , O ( n ) ) - A % ( x ( n ) , O ( n ) )

if CY* is selected and results in a success or

if a I is selected and results in a failure. ( 3 ) In a similar manner, algorithms corresponding to reward-inac- tion and reward-€-penalty can also be defined. To determine A N l (o r - A N , ) in(3), e ( x ( n ) , O ( n ) ) = 1 - N ( x ( n ) , O ( n ) ) (o r - N ( x ( n ) , O ( n ) ) ) is used as the error signal in the back propagation algorithm and 8 ( n ) is adjusted along the negative gradientwithastepsize A, A O ( n ) = - X ( 6 e / 6 8 ( n ) ( , , , , , H , r , ) , .

As described above, the parameters of the network are adjusted to realize an objective function N * ( x ) which satisfies the con- ditions

N * ( x ) = + 1

N * ( x ) = 0 where a? is optimal, x E

where cy, is optimal

This is similar in spirit to the learning automaton approach when the context space contains a single point. As in that ap- proach, the evolution of N ( x ) is stochastic in nature and our interest is in the stochastic convergence of the function N ( x ) .

IV. ANALYSIS In all the methods presented in Section 111, the output of the

network determines which of the two actions is to be chosen, and the weights of the network are adjusted on the basis of the response of the environment. The analysis of the performance of the methods suggested is rendered complex by the fact that an attempt is made in methods 2 and 3 to generalize concepts based on learning automata theory, using multilayer neural net- works. Hence, before undertaking such an analysis, it appears fruitful to consider the two principal components of all the schemes suggested, i.e., deterministic decision making in the context space using a neural network and stochastic decision making at a point in the context space using learning automata, in somewhat greater detail. The former is best described in the context of method 1 and a deterministic environment.

Method I The problem of determining the optimal action at every point

in 33 when the response of E, x E D is deterministic can be regarded as a problem in pattern recognition. Since the latter has been well studied in the literature, only a few relevant com- ments are made here. If the output of the neural network can be expressed as N ( x , 8 ) , where 8 is a weight vector, the aim of the learning procedure is to determine a rule for updating 8 ( n ) such that

lim N ( . , 8 ( n ) ) E X *

where X* is the class of desired discriminant functions defined in Section 111. If N T , N f E X*, then sgn N T ( x ) = s g n N f ( x )

It is well known that using the back propagation method 8 ( n ) can be adjusted along the negative gradient of an error function to reach a minimum value. While from a theoretical standpoint the negative gradient has to be computed using the entire train- ing set, in practice the adjustment is generally based on the measurement at every instant. As pointed out in [14], it is the- oretically not possible to demonstrate that such a procedure will result in the minimization of the error function. However, it has been observed in numerous empirical studies that the proce- dure, using a small step size, results in convergence to the de- sired value. Further, while the gradient method can, in theory, converge to a local minima, in many cases a global minimum is obtained. To separate the effects of the neural network from those resulting from the stochastic nature of the environment, we shall assume that the neural network in a deterministic en- vironment will invariably converge to the global minimum even when the parameters are updated after every measurement, i.e.,

n - m

E b.

lim lsgnN*(x) - ~ ( x , e ( n ) ) l = 0. ( 4 ) rr - m

We now consider the case where E , corresponding to every con- - text vector x is random. Let dl (x) > d 2 ( x ) at every point x E 9 for which sgn N* ( x ) = + 1 in the deterministic case. This implies that a, is the optimal action in the same regions in b in both deterministic and stochastic cases. While very little can be said about the convergence of method 1 when applied di-

- - r - r - - - -

Page 5: Associative learning in random environments using neural networks

24 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 2 , NO. I , JANUARY 1991

rectly in a random environment, its behavior can be analyzed if the algorithm for the adjustment of 8 ( n ) is suitably modified. In particular, if 19 ( n ) is adjusted using a stochastic approxima- tion algorithm, the function N will converge to N* E X* with probability 1. This follows from the facts that the gradient method converges to the global optimal in the deterministic case and the expected value of the function to be optimized in the stochastic case has the same optima as that in the deterministic case. In method 1 , the parameter vector is adjusted using a fixed step size which is sufficiently small. Hence, 8 ( n ) will lie in some neighborhood of the optimal value 8* with a probability depending on the step size as well as the randomness of the environment.

Use of Learning Automata in Stochastic Environments

The convergence properties of various learning algorithms operating in stochastic environments have been studied exten- sively and have been collected in [I]. If a , , a2 , . . . , a, cor- respond to the actions, p , ( n ) , p 2 ( n ) , . * . , p , ( n ) the probabilities of the actions at stage n , and p ( n ) the action prob- ability vector with p , ( n ) as the elements, the learning algorithm is usually formulated as

P ( " + 1) = T [ p ( n ) , a ( n ) 3 O b 1 1 ( 5 ) where a ( n ) is the action at stage n and 0 ( n ) is the response of the environment. Since the elements of the vector p ( n ) lie in the interval [0, 11 and satisfy the constraint equation E:=, p i ( n ) = 1, the unit simplex

1 i I = I s , = P 1 P T = [ p 1 , p * , . . . , P r ] , O ( P i % 1, c p ; = 1

represents the state space of the process { p(n ) } , , ? , , . The learning algorithm (5) represents a discrete-time Markov pro- cess defined on the state space s, and having a stationary tran- sition function. The asymptotic behavior of p ( n ) is found to depend upon the specific algorithm used. With absolutely ex- pedient algorithms (e.g. L R - I ) , p ( n ) converges w.p. 1 to an absorbing set. Further, the probability that p , ( n ) converges to unity can be made arbitrarily close to 1 by choosing the step- size sufficiently small. Such classes of algorithms are known as eoptimal algorithms. With LR - and LR - f p algorithms, p ( n ) converges to a limiting distribution, whose parameters depend upon the reward probabilities. By a proper choice of the step size, it is found that LR - E p algorithms can be made arbitrarily close to optimal.

As described above, the learning corresponds to a single point x in the context space X. Methods 2 and 3, described in the previous section and analyzed below, use a neural network to generalize the results throughout 5 in the context space.

Method 2 In the absence of prior information concerning 3, all the cells

63; are chosen to have identical measure p,.. Let L be the number of cells and, hence, the number of automata. The context vector at stage n is assumed to be chosen with a uniform probability distribution over D so that the probability of x ( n ) E 63, is I / L . As the number n of context vectors chosen becomes arbitrarily large, each cell is visited on an average n / L times, which grows with n at the same rate. Let L, be the number of cells which contain boundary points of two regions in which two different actions are optimal.

An automaton using an absolutely expedient scheme will

converge to its optimal action with a probability 1 - E , where E can be made arbitrarily small. Hence, in the limit as n --* 03,

on an average ( L - L , ) ( 1 - E ) automata can be expected to converge to their true values (i .e. , 0 or 1 ) while ( L - L,)E converge to the wrong values. In addition, along the boundaries are L, cells in each of which exists a subdomain where the ac- tion chosen is not optimal. Hence, L; + r ( L - L ) provides a conservative bound on the number of cells in which the wrong action can be chosen. If L, << L and E << 1, this implies that the correct action is chosen in almost all the cells. The impor- tance of the measure p, of the cells 63, now becomes evident. For a given distribution of the regions D, in which a single action is optimal, decreasing p, results in strengthening the in- equality L2 << L. Hence, by a proper choice of the cell size as well as the step size of the learning automata, the desired ac- curacy can be obtained.

Since a neural network is used off-line to approximate the probabilities realized by the automata, by the assumptions made earlier in this section, the resulting output of the network can be used to make the decision regarding the optimal action cor- responding to any context vector x E a>.

Method 3 This case represents a further evolution of method 2 so that

the output N ( x ) (E [0, 11) of the network at state x is directly the probability of choosing the action a l . At every stage n , the actions a , and cy2 are chosen with probabilities N ( x ( n ) , 8 ( n ) ) and 1 - N ( x ( n ) , 8 ( n ) ) respectively. The parameter vector 8 ( n ) is then adjusted depending upon the action chosen and the corresponding response of the environment. The updating is readily stated in terms of N ( x ( n ), 0 ( n ) ) rather than 8 ( n ) using well-known concepts in learning automata theory mentioned earlier. For example, either a reward-inaction or reward-pen- alty scheme can be used. If a reward-penalty algorithm is used, N ( x ( n ) , 8 ( n + 1 ) ) = N ( x ( n ) , 8 ( n ) ) + A N ( x ( n ) , O ( n ) ) . where A N ( x ( n ) , 8 ( n ) ) > 0 if a I is chosen and results in a success or a2 is chosen and results in a failure and A N ( x ( n ) , O(n)) < 0 for the complementary cases. The problem arises because the change A N ( x ( n ), 8 ( n ) ) is produced indirectly by a change in 8 ( n ) . If A O ( n ) is a sufficiently small change in 8 ( n ) , A N ( x ( n ) , 8 ( n ) ) = g T A O ( n ) , where g, is the gradient of N ( x ( n ) ) with respect to O ( n ) at the state x and parameter 8 ( n ) . It is assumed that 11 g,, 11 is bounded for all n and that 0 < 1 1 g, )I < g,,,, where g,,, is known. If A O ( n ) is adjusted to decrease an error function ( N * - N ) 2 , as is done in method 3, it is adjusted along the negative gradient so that A8 ( n ) = X [ N* - N ] gnr where X is a positive constant.

If h << 1,

A N ( x ( n ) , e ( n ) ) hg,Tg,[N* - N I

A N ( x ( n ) , O(n)) = a,[N* - N ] or

where a,, is a time-varying gain which lies in the open interval

Substituting N* ( x ) = 1 or 0 in the above equations, updating algorithms corresponding to the L R - 1 and LR-p schemes with a time-varying step size are realized and can be stated as fol- lows.

( 0 , €1 i f h < ( E / ( g i a x ) ) .

For LR

- - I -

--T -

Page 6: Associative learning in random environments using neural networks

NARENDRA A N D MUKHOPADHYAY: ASSOCIATIVE LEARNING IN RANDOM ENVIRONMENTS 25

.10

Fig. 2. (a) The function N * ( . ) for Example 1; D = [ -10, lO];m : a , is optimal action;=: a? is optimal action. (b) Initial choice of N ( x ) in Example 1 .

if a , is chosen and results in a success, and

~ ( x ( n ) , + 1 ) ) = N ( x ( ~ ) , m)) - a , ~ ( x ( n ) , w ) if a2 is chosen and results in a success.

For L, - p ,

N ( x ( n ) , O(n + 1) )

= N ( x ( n ) , W ) + a,(l - N ( x ( n ) , W ) ) if a I is chosen and results in a success or a2 is chosen and results in a failure, and

~ ( m , e ( n + 1 ) ) = ~ ( x ( n ) , w ) - a , ~ ( x ( n ) ,

if a2 is chosen and results in a success or aI is chosen and results in a failure.

If a, < E , where E is sufficiently small, it follows that N ( x ( n), 8 (n)) approaches N* with a probability arbitrarily close to 1 if an L, -, type of scheme is used. Similarly, with an L R - p type of scheme, N ( x ( n ) , 8(n)) will converge to a normal distribution with mean ( 1 - d 2 ( x ) ) / ( 2 - d , ( x ) - d 2 ( x ) ) .

Comment 4: Method 3 differs from method 1 in the inter- polation of the output N ( x ) of the neural network. In the for- mer, it corresponds to the probability of the action a l , while in the latter it is the estimate of the value of the desired discrimi- nant function at x . While the action at x is chosen probabilisti- cally in method 3, it is chosen deterministically in method 1. The empirical results presented in Section VI show that, for the examples considered, method 3 results in better performance than method 1. However, demonstrating this theoretically is rather difficult. Work is currently in progress to formulate the problem in such a manner as to make it analytically tractable.

V. DECENTRALIZED CONTROL USING NEURAL NETWORKS

The methods described in Sections 111 and IV reveal that neural networks can be used effectively for centralized decision making in random environments. We now consider problem 2, where decentralized decisions have to be made when global in- formation is not available. Cooperative games of learning au- tomata have been studied in the past [ lo], [ll], when the individual automata update their action probabilities based only

10

on the response of the environment and without any knowledge of the actions of the other automata. Problem 2 generalizes this further to include a context space. The response of the random environment depends upon both the actions of the various de- cision makers and the context vector x E 9. We shall consider only two decision makers in our discussion, but the results can be extended to multiple decision makers along the same lines as in [lo] and [ l 11 (see Section VII).

A and B are two decision makers operating in a random en- vironment. The response of this environment to their combined actions is specified by a game matrix G(x), whose elements d , ( x ) are functions of the context vectorx. Each decision maker is aware of the context vector x ( n ) at stage n, as well as the response of the environment, but not of the strategy of the other decision maker. Their common objective is to converge to their respective2ptimal actions a,, ( x ) and ( x ) corresponding to every x E 9.

N A ( .) and N s ( . ) are two neural networks corresponding to A and B . The outputs N A ( x ) and N B ( x ) of the networks corre- spond to the probabilities of the actions al and y, respectively. At every stage n, the two decision makers choose their actions on the basis of their output probabilities and adjust the param- eters of N A and NB as described in Section 111.

In this paper we merely present simulation results of this two- person cooperative game. The analysis of the convergence of the algorithms is substantially more complex.

VI. SIMULATION RESULTS Extensive simulation studies of problems 1 and 2 have been

carried out using the schemes described in Sections 111 and V. In this section, specific examples are collected and presented in a coherent fashion. The inclusion of all the examples in a single section enables the different schemes to be compared in an ef- ficient manner based on their complexity, speed of conver- gence, accuracy, and asymptotic behavior. Examples 1, 2, and 3 deal with problem 1; in example 1 X = a, while in examples 2 and 3 X = R 2 . While the samples in context space are as- sumed to be uniformly distributed in a domain 33 in example 2, the distribution is nonuniform in example 3. Example 4 deals with decentralized decision making involving two decision makers, each with two actions in X = 63’. In all the simula-

Page 7: Associative learning in random environments using neural networks

26 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 2 , NO. I , JANUARY 1991

reward prob. = (0.9 ,0. 1 ) reward prob.=(0.7,0.3)

I

c

- 10 z 10

Fig. 3 . Performance of the three methods in Example 1 after 80 000 iter- ations. (a) method 1, (b) method 2, ( c ) method 3 .

tions camed out using method 3, an LR - algorithm was used to update the weights of the neural network since the use of an LR -, algorithm resulted in very slow convergence.

Example 1 (X = (B) The region D of interest in the context space is the closed

interval [ -10, 101. The function N * ( x ) = e-' ' ' s i n ( n x / 4 ) shown in Fig. 2(a) determines the regions where aI and CY* are the optimal actions. Defining D, = [ - I O , - 8 ) , B2 = [ -8 ,

[ 8 , l o ] , 01' is the optimal action in S I = Dz U D4 U D6 and a2 is the optimal action in S, = D, U D3 U as. The compact regions of interest are 5 = U, s, where 5, C 3, and / (S I , 6 ( D l ) ) 2 e l , where e l = 0.1. The objective is to determine

-4), 3 3 = [ -4, 01, 9 4 = [o , 4), 9, = [4, 8 ) and 9 6 =

the extent to which the various schemes suggested in Section IV will succeed in determining the regions 3, ( i = 1, 2,

The neural networks used in all cases belong to the set X:,20,10, I as defined in [13]. The context vector x ( n ) at stage n is chosen using a uniform distribution over D and the initial choice of N ( x ) is shown in Fig. 2(b). The step size in the ad- justment of the weights of the network was 0.1.

Fig. 3 shows the function N ( x ) realized using all three meth- ods after 80 000 iterations for the two sets of values (0.9, 0.1 ) and (0.7, 0 .3 ) of the reward probabilities. Method 1 is seen to perform satisfactorily when the reward probabilities are (0.9, 0.1 ), but fails entirely to recognize the regions 5, and at the beginning and end of the interval when the reward proba- bilities are (0.7, 0.3). This indicates that the method has lim-

, 6 ) . . . .

--1 -

Page 8: Associative learning in random environments using neural networks

NARENDRA AND MUKHOPADHYAY: ASSOCIATIVE LEARNING IN RANDOM ENVIRONMENTS

i

27

~n I I I"

- 10

10

Y

-10

10 - 10

Fig. 4 . (a) Regions 9, in Example 2. (b) Initial estimates of Dn in Example 2.

ited applicability when the measure of randomness of the environment is high. In contrast to method 1, method 3 per- forms satisfactorily in both environments and is uniformly bet- ter than method 1.

Example 2 (X = (R )

The region D in the context space belongs to ( R 2 and is the 20 X 20 square around the origin. Regions Dl and D2 are cir- cles of radii 3 centered at ( -5, 5 ) and (5, 5 ) respectively, while region D3 is a 6 X 4 rectangle whose vertices are (0, - 2 ) , ( 0 , - 6 ) , ( - 6 , -6) , and ( - 6 , -2 ) . In the set SI = DI U B2 U B3, the action a , is optimal and the action a2 is op- timal in the complement Sc of SI with respect to D. These re- gions are shown in Fig. 4(a). The objective is to determine estimates of the regions SI and Sc.

The neural network used has three input nodes (since 3c = (R 2 , but the numbers of nodes in the hidden layers and the out- put layer are the same as in example 1. The context vectors are chosen with a uniform distribution over D and the weights of the network are adjusted using the algorithm described in Sec- tion 111. A vertical line ( I ) at any context x indicates that the action a1 is selected with higher probability than a2 in that con- text and a horizontal line (-) corresponds to a state where the probability of action cy2 is higher. Fig. 4(b) shows the initial conditions of these probabilities given by the initial weights of the neural network.

In Fig. 5 the simulation results for the case where dl (x ) = 0.7, d 2 ( x ) = 0.3 x E SI and d , ( x ) = 0.3, d 2 ( x ) = 0.7 x E S f are shown. For methods 1 and 3 250 000 iterations were needed for convergence. In method 2 3600 automata were used to cover the region D and 36 000 experiments were performed to update them. Since the neural network was used off-line to deterministically approximate the probabilities of the automata in method 2, 80 000 iterations were found to be adequate. On the basis of these simulations, it appears that methods 2 and 3 get significantly better than method 1 when the degree of ran- domness is high.

An Alternative Method of Implementation: In many practical problems the number of iterations needed in the earlier simu- lations to make the decision regarding SI and S f may be quite unrealistic if each iteration corresponds to an experiment per- formed at a specific value of the context vector. In such cases,

10

the modification suggested in this section may be practically more feasible. We illustrate this in terms of a problem in vini- culture. The region 33 is assumed to be located somewhere in Bordeaux, France, and the objective is to determine which of two grape varieties, a, (Cabamet Sauvignon) or a2 (Merlot), is to be grown in it. While it is known that a large number of factors, among them geographic location, soil, weather, the process of wine making, and the kind of grape used, determine the quality of wine, we shall assume that only the choice of the grape can be made by the decision maker, and that all the other factors contribute to the stochastic nature of the outcome.

It is obvious that the method used in the earlier case of per- forming a single experiment (planting a grapevine) and updat- ing the weights of a neural network based on the outcome of the experiment (quality of the wine) would not be feasible in the present context, particularly since the response time is gen- erally a few years. Hence, the following modification is pro- posed for the earlier method. At every stage, a large number (e.g., ten thousand) of experiments are performed and the cor- responding values of x are stored. Based on the outcomes at these values of x, the weights of the network are updated. Hence, in ten stages (or 100 000 iterations) the network weights approximate their final values. These results for the environ- ment with reward probabilities (0.9, 0.1) using method 3 are shown in Fig. 6. Hence, to make method 3 practically feasible, it may be necessary to store part of the information external to the neural network.

Example 3 (X = (R '): Nonuniform Distribution of Context Vectors

The problem in this case is identical to that discussed in con- nection with Example 2 but the context vectors are not assumed to occur with a uniform distribution over D. This is a problem of common occurrence in many practical applications and its effect is to overtrain the neural network at regions in which the probability of the context vector is high. If the probability dis- tribution of x is known, simple measures (e.g., accepting or rejecting samples with probabilities inversely proportional to their probabilities of occurrence) can be adopted to ensure a uniform distribution of the accepted context vectors. However, since the distribution of x is generally unknown, the following modification is proposed. Region D is divided into uniform cells

Page 9: Associative learning in random environments using neural networks

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 2, NO. 1, JANUARY 1991

10

I

-10, -

. . . . . . . . . . . . . . . I . . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. _ . _ ~ _ .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. ~ . . _ ~ . ~ .

. . . . . . . . .

. . _ . . ~ . . ~

. . . . . . . . .

. ~ ~ . . ~ . ~ .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . . ~ . . ~ ~ . ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . _ _ _ . _ _ _ _ . _ _ . _ . . _ _ . . _ _ _ _ _ _ _ _ . - _ _ . . - - - . .

10

10

Y

-10 z 10

(C) Fig. 5. Example 2: Performances of the three methods. (a) Method 1. (b)

Method 2. (c) Method 3 .

- 10

- 10

- - - a , , , , - -

-- - - - - - - - - ' 2 1

10 I . .

~. ~~

~. . . ~. . . . . .~ ~~

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- 10 2 10 Fig. 6 . Performance of method 3 with memory in Example 2.

63; as in method 2 . Exactly one context vector and the outcome of the experiment corresponding to it are stored in each cell. Updating of the network weights is started only after all the cells are filled. Following this, at each iteration, one of the boxes is chosen using a uniform distribution and the weights of the neural network are updated based on the response at the context vector contained in it. Further, as new samples arrive, the older samples contained in 63; are replaced by them. Hence at any instant the most recent samples are found in each of the

cells. A similar procedure was used in [15] for a network con- trol problem.

Fig. 7(a) shows a rectangular region 9, in a> defined by the vertices (-10, 5), (-5, 5) , (-5, -lo), and (-10, -10). Ninety percent of the context vectors occur in 33, and the re- maining 10% in a>$. The reward probabilities are (0.9, 0.1). After 50 000 trials, out of the three regions DI, a>2 and a>3, only a>, is recognized (Fig. 7(b)). In contrast to this, the pro- cedure outlined in this subsection yields more accurate infor- mation concerning a>,, a>2, and a>3. as seen in Fig. 7(c).

Example 4 (Decentralized Decision Making)

In this example two decision makers, A and B , are involved in a cooperative game. A has two possible strategies { a I , a2 } , while B has strategies { y, , y2 } . The context space is X = 6i and the region of interest, a>, is the 20 x 20 square about the origin. The region 33 can be expressed as a> = a>, U D2 U a>3 U a>,, where a>,, a>,, a>,, and 9, are disjoint as indicated in Fig. 8(a). The game matrix corresponding to the region a>; ( i = 1 , 2 , 3, 4) is denoted by Gi. The four game matrices are given as

G I = [ 0.9 0.1 ] G 2 = [ 0.1 0.9 ] 0.1 0.1 0.1 0.1

0.1 0.1 0.1 0.1

0.9 0.1 0.1 0.9

Page 10: Associative learning in random environments using neural networks

NARENDRA AND MUKHOPADHYAY: ASSOCIATIVE LEARNING IN RANDOM ENVIRONMENTS 29

Y

- 10 . -10

10

Y

10

Y

- 1( I

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .~~

. . . . . . . . . . . . . . . . . . .

0 2 1(

, , , - - - - , , , I - - -

. . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. _ . _ _ . . . . _ . . _ . - . . . . . . . . . . . .

. . . . . . . . . - . . - . .

. - - - - - -

. . - - - . . . . - . . - . . . . . . . .

. . - - - - . - . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . _ . _ _ _ . . _ _ . _ . - _ . _ _ - - . . . . . . . . . A . . . . - - - - . . . . . . . . . . . . . . . . . . . .

-10 . . . . . . . . . . . . . . . . . . .

- 10 2 10

(C)

Fig. 7 . Performance of method 3 with nonuniform distribution of context vectors. (a) The region a)4 where 90% of the context vectors occur. (b) Performance of unmodified method 3. (c) Performance with modification suggested in Example 3 (Section VI).

I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . " I, I , I I . I . . I, I * !I I# " " ,I ; :: I: :: ; 9

10 10

Y Y

- 10 - 10

2 z 10 - 10 10 - 10

(a) (b) Fig. 8 . (a) The regions a), in Example 4 . (b) Performance of the decen-

tralized decison makers.

Here, rows correspond to actions of A , columns correspond to actions of B , and G,] = Prob [ @ ( n ) = 1 1 a( n ) = a,, y( n ) = -yl 1 . Hence, in any region Q, the optimal pair yields a success with a probability of 0.9, while all other combinations yield a success with probability 0.1.

Each decision maker, unaware of the strategies being used by

the other decision maker, uses one of the methods discussed in Section I11 to update the strategies independently. The neural network used in all cases is of the same size as in Example 2. The context vectors are generated by sampling a uniform dis- tribution over the region of interest 6>.

Simulation results are given in Fig. 8(b) for the case when

Page 11: Associative learning in random environments using neural networks

30 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 2, NO. 1, JANUARY 1991

they use method 3. The decentralized decision makers are seen to perform almost perfectly after 30 000 iterations, The figure shows the action pair being selected by the two decision makers with higher probabilities throughout the context space. The symbols 11, I - , - 1 , and = are used to represent ( a i , y I ), ( a I , y2) , ( a z , r , ) , and ( a 2 , yz) respectively.

VIII. EXTENSIONS AND COMMENTS We consider briefly in this section the relative advantages of

the methods suggested and how the results can be extended to more general cases, involving more than two actions in problem 1 and more than two decision makers in problem 2.

A . Comparison of Methods Discussed in Section I11 Method 1 is perhaps the simplest to apply but the discrimi-

nant function does not converge in any statistical sense with a fixed step size. Further, unlike methods 2 and 3, method 1 can- not be extended in a straightforward manner to cases where there are multiple actions. Method 2 is essentially an off-line proce- dure, since the neural network is used to generalize the results provided by the learning automata. The performance of this ap- proach depends both on the number of learning automata used and on the number of trials before the neural network is intro- duced. The number of automata required increases exponen- tially with the dimension of the context space and some prior knowledge is required regarding the magnitude of the regions 3, to choose a proper size of the cells. Method 3 appears to be the only scheme which is particularly suited for on-line learn- ing. Convergence, in this case, is satisfactory (in simulation studies) only when a reward-penalty (rather than a reward- inaction) scheme is used.

B. Multiple Actions

The methods suggested in Section 111 can be suitably modi- fied for use in situations where the action set g contains more than two elements. However, only the second and third meth- ods can be extended in a straightforward manner as described below.

Let the action set of a typical automaton contain r elements. If method 2 is used, r - 1 networks, N I ( . ) , N 2 ( .), * . . , N , - I ( . ), can be trained to approximate ( r - 1 ) action proba- bilities. The probability of the last action is 1 - E, N , ( x ) .

As in method 2 , ( r - 1 ) networks are used in method 3, but the process is considerably more direct. The normalized outputs of the networks N I (.) , N 2 ( .), * , N , _ I ( . ) correspond to the probabilities p , , p z , . * , p r - with which the various actions 011, a2, * . . , a, , (and hence a,) are chosen. As in the case of learning automata, when a reward-penalty scheme is used, if an action a, is chosen and results in a success (failure), the parameters of N , ( . ) are adjusted to increase (decrease) N , ( x ) and the parameters of N, ( . ) ( j # i ) are adjusted to change N , ( x ) in the opposite direction.

C. Multiple Decision Makers

In the statement of problem 2 , it was assumed that only two decision makers are involved in the decentralized decision mak- ing and that each contains only two actions. The method out- lined in Section V can be extended to multiple decision makers, each having more than two actions. It is well known in learning automata theory 111 that when multiple decision makers, each having several actions, operate in a random environment using

an absolutely expedient scheme, they converge to actions which correspond to one of the equilibrium states with a probability arbitrarily close to 1 . The same can also be expected when mul- tiple decision makers use neural networks to make the decisions as described in Section V.

VIII. CONCLUSIONS The paper discusses decision making in a context space when

the responses to the actions chosen in a specific context are ran- dom. The objective is to determine the optimal action q ( x ) corresponding to the state x. Since decisions have to be made throughout the context space based on a countable number of experiments, generalization is inevitable. Naturally, many dif- ferent approaches can be followed to generate the desired dis- criminant function. The paper discusses three different methods which use neural networks and compares their relative merits. In the third method, which is the most general, the output of the network determines the probability with which one of the actions ( a I ) is to be chosen. The weights of the neural network are updated on the basis of the actions chosen and the response of the environment. The extension of similar concepts to de- centralized decision making in a context space is also intro- duced in Section V. Simulation results are included for all the methods analyzed in Section IV. In particular, modifications in the implementations of method 3 to make it practically viable are also presented. The various simulations reveal that all the methods suggested are feasible and that the choice of a specific method will depend on the accuracy desired as well as on the available computational power. Method 3 is to be preferred to the other two methods on the basis of accuracy as well as the memory required when the context space is of high dimension and the corresponding environments have a high degree of ran- domness.

ACKNOWLEDGMENT The authors would like to thank the reviewers and the asso-

ciate editor for their careful reading of the paper and their help- ful comments.

REFERENCES

[I ] K. S. Narendra and M. A . L. Thathachar, Learning Automata: An Introducrion.

[ 2 ] N. Baba, “New topics in learning automata theory and applica- tions,” Lecture Nores in Control and Information Sciences, vol. 71. Berlin: Springer Verlag, 1984.

[3] 0. V. Nedzelnitsky, Jr . , and K . S. Narendra, “Nonstationary models of learning automata routing in data-communication net- works,’’ IEEE Trans. Sysi., Man, Cybern., vol. SMC-17, pp.

[4] R. M. Wheeler, Jr . , and K. S. Narendra, “Decentralized learning in finite Markov chains,” IEEE Trans. Automat. Contr., vol. AC-

[5] R. Viswanathan and K. S . Narendra, “Stochastic automata models with application to learning systems,” IEEE Trans. Syst., Man, Cybern., vol. SMC-3, pp. 107-111, 1973.

[6] T . J . Sejnowski and C . R. Rosenberg, “Parallel networks that learn to pronounce English text,” Complex Syst., vol. I , pp. 145- 168, 1987.

[7] R. P. Gorman and T. J . Sejnowski, “Learned classification of sonar targets using a massively parallel network,” IEEE Trans. Acousr., Speech, Signal Process., vol. 36, no. 7 , pp. 1135-1 140, 1988.

[8] D . J . Barr, “Experiments on neural net recognition of spoken and written text,” IEEE Trans. Acousr., Speech, Signal Process., vol. 36, no. 7, pp. 1162-1168, 1988.

[9] B. Widrow, R. G . Winter, and R. A . Baxter, “Layered neural

Englewood Cliffs, NJ: Prentice-Hall, 1989.

1004- 10 15, 1987.

31, pp. 519-526, 1986.

- I -

- - l - -

Page 12: Associative learning in random environments using neural networks

NARENDRA AND MUKHOPADHYAY: ASSOCIATIVE LEARNING IN RANDOM ENVIRONMENTS 31

nets for pattern recognition,” IEEE Trans. Acoust., Speech, Sig- nal Process., vol. 36, no. 7 , pp. 1109-1 118, 1988.

[IO] K. S. Narendra and R. M. Wheeler, Jr . , “An N-player sequential stochastic game with identical payoffs,” IEEE Trans. Syst., Man, Cybern., vol. SMC-13, pp. 1154-1158, 1983.

[ l l ] M. A. L. Thathachar and K. R. Ramakrishnan, “A cooperative game of a pair of learning automata,” Automarica, vol. 20, pp.

[I21 A. G. Barto, “Learning by statistical cooperation of self-inter- ested neuron-like computing elements,” COINS Tech. Report 81- 1 1 , University of Massachusetts, Amherst, 1985.

1131 K. S. Narendra and K. Parthasarathy, “Identification and control of dynamical systems using neural networks,” IEEE Trans. Neural Networks, vol. 1, pp. 4-27, Mar. 1990.

[I41 A. H. Kramer and A. Sangiovanni-Vincentelli, “Efficient paral- lel learning algorithms for neural networks,” in Proc. IEEE Con$ Neural Information Processing Systems (Denver, C O ) , 1988, pp.

[IS] A. Hiramatsu, “ATM communications network control by neural networks,” IEEE Trans. Neural Networks, vol. 1, pp. 122-131, Mar. 1990.

797-801, 1984.

40-48.

* Kumpati S. Narendra (S’S-M’60-SM’63- F’79) received the Ph.D. degree from Harvard University in 1959. He is currently a Professor in the Electrical Engineering Department at Yale University, New Haven, CT, and Direc- tor of the Center for Systems Science there.

He is the author of numerous technical pub- lications in the area of systems theory and of the books, Frequency Domain Criteria for Ab- solute Stabiliry (with J . H. Taylor, Academic Press), Stable Adaptive Systems (A. M. Anna-

swamy) and Learning Automata-An Introduction (with M. A. L. Thathachar, Prentice Hall). He is also the editor of three books and is currently editing a reprint volume entitled Recent Advances in Adaptive Control (coeditors R. Ortega and P. Dorato, to be published by IEEE Press). His research interests are in the areas of stability theory, adap- tive control, learning automata, and the control of complex systems using neural networks.

Dr. Narendra is a member of Sigma Xi and the American Mathe- matical Society and a Fellow of the American Association for the Ad- vancement of Science and the IEE (U.K.). He was the recipient of the 1972 Franklin V . Taylor Award of the IEEE Systems, Man, and Cy- bernetics Society, the George S. Axelby best paper award of the IEEE Control Systems Society in 1988, and the Education Award of the American Automatic Control Council in 1990.

*

Snehasis Mukhopadhyay was born near Cal- cutta, India, in 1964. He received the B.E. de- gree in electronics and telecommunications engineering from Jadavpur University, Cal- cutta, in 1985, the M.E. degree in systems sci- ence and automation from the Indian Institute of Science, Bangalore, in 1987, and the M.S. degree in electrical engineering from Yale Uni- versity, New Haven, CT, in 1990. Currently he is a graduate student pursuing the Ph.D. de- gree in electrical engineering at Yale Univer-

sity. His research focuses on learning systems, including neural networks and learning automata, and the applications of hierarchical learning structures to the control of dynamical systems.