15
354 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 3, MAY 1998 Recurrent Neural-Network Training by a Learning Automaton Approach for Trajectory Learning and Control System Design Malur K. Sundareshan and Thomas A. Condarcure Abstract— Training of recurrent neural networks by conven- tional error backpropagation methods introduces considerable computational complexities due to the need for gradient eval- uations. In this paper we shall present a training approach using concepts from the theory of stochastic learning automata that eliminates the need for computation of gradients and hence affords a very simple implementation, particularly for imple- mentation on low-end platforms such as personal computers. This approach also offers the flexibility of tailoring a number of specific training algorithms based on the selection of linear and nonlinear reinforcement rules for updating automaton action probabilities. The training efficiency is demonstrated by applica- tion to two complex temporal learning scenarios, viz. learning of time-dependent continuous trajectories and feedback controller designs for continuous dynamical plants. For the first problem, it is shown that training algorithms can be tailored following the present approach for a recurrent neural net to learn to generate a benchmark circular trajectory more accurately than possible with existing gradient-based training procedures. For the second problem, it is shown that recurrent neural-network- based feedback controllers can be trained for different control objectives (plant stabilization, output tracking) even when the plant being controlled has complex nonlinear dynamics, and that these controllers possess excellent robustness characteristics with respect to variations in the initial states of the dynamical plant being controlled. Index Terms—Control system design, learning systems, neural- network learning, neuro controllers, recurrent neural networks, stochastic learning automata, trajectory learning. I. INTRODUCTION N EURAL networks with recurrent connections and dy- namical processing elements (neurons) are finding in- creasing applications in diverse areas. While past efforts at designing such networks have mainly focussed on the steady- state fixed point behavior for such applications as associative memory [1], [2] and optimization [3], [4], the capabilities offered by these networks for dynamically processing temporal sequences have begun to be increasingly appreciated. Such capabilities render these networks highly valuable for appli- cation to complex dynamical problems such as recognition of continuous time-dependent signals (speech, for instance) and adaptive control of dynamical systems. Manuscript received April 6, 1994; revised July 12, 1996, June 22, 1997, and November 27, 1997. The authors are with the Department of Electrical and Computer Engineer- ing, University of Arizona, Tucson, AZ 85721-0104 USA. Publisher Item Identifier S 1045-9227(98)02775-1. Despite the great potential dynamic recurrent neural- networks hold, a major problem in a successful deployment of these networks in practice (for instance, for control applications) is the complexity in training. Earlier procedures, suggested by Pineda [5], Almeida [6], and Werbos [45], attempt to extend the backpropagation approach that has proved very popular for static feedforward networks [7]. The implementation of the required updating equations for achieving desired dynamical behavior can however be computationally very demanding. For overcoming the com- putational requirements and ensuring a relatively manageable implementation, one is usually forced to resort to simplifying approximations, such as coarser gradient evaluations, heuristic selection of high gains in the activation functions [8], [39] and the like. The importance of developing alternate training procedures which bypass the need for the computation of gradients is clearly evident. The reduction in computational complexity is particularly appealing if the training algorithm is to be implemented on low-end platforms such as personal computers. This paper discusses a different approach for designing training schemes for a class of dynamic recurrent neural networks based on the theory of stochastic learning automata. Research on learning automata dates back to the early work of Tsetlin [9] in the 1960’s and has been developed since then by a number of others in various contexts ([10] and [11] provide an excellent treatise on these developments). A popularly investigated version of these ideas, known as reinforcement learning methods, have been used in studies of animal learning [12] and in early learning control work [13]. Reinforcement learning is based on the intuitive idea that if an action on a system results in an improvement in the state of the system (as measured by a precisely stated performance measure), then the probability of repeating that action is strengthened. More recently, the role of reinforcement learning in the training of neural networks has been studied by Barto et al. [14], [15], particularly in the context of learning control signals for complex dynamical plants [15]–[17]. Nevertheless, the rather very appealing notions of reinforcement learning, and more generally learning automata, do not seem to have received the attention they deserve in neural-network training. It is the objective of this paper to demonstrate the feasibility of learning automata-based approaches as alternatives to gradient methods in the training of dynamic recurrent neural networks and further to demonstrate their performance in certain highly 1045–9227/98$10.00 1998 IEEE

Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

  • Upload
    ta

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

354 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 3, MAY 1998

Recurrent Neural-Network Training by a LearningAutomaton Approach for Trajectory Learning

and Control System DesignMalur K. Sundareshan and Thomas A. Condarcure

Abstract—Training of recurrent neural networks by conven-tional error backpropagation methods introduces considerablecomputational complexities due to the need for gradient eval-uations. In this paper we shall present a training approach usingconcepts from the theory of stochastic learning automata thateliminates the need for computation of gradients and henceaffords a very simple implementation, particularly for imple-mentation on low-end platforms such as personal computers.This approach also offers the flexibility of tailoring a numberof specific training algorithms based on the selection of linearand nonlinear reinforcement rules for updating automaton actionprobabilities. The training efficiency is demonstrated by applica-tion to two complex temporal learning scenarios, viz. learning oftime-dependent continuous trajectories and feedback controllerdesigns for continuous dynamical plants. For the first problem,it is shown that training algorithms can be tailored followingthe present approach for a recurrent neural net to learn togenerate a benchmark circular trajectory more accurately thanpossible with existing gradient-based training procedures. Forthe second problem, it is shown that recurrent neural-network-based feedback controllers can be trained for different controlobjectives (plant stabilization, output tracking) even when theplant being controlled has complex nonlinear dynamics, and thatthese controllers possess excellent robustness characteristics withrespect to variations in the initial states of the dynamical plantbeing controlled.

Index Terms—Control system design, learning systems, neural-network learning, neuro controllers, recurrent neural networks,stochastic learning automata, trajectory learning.

I. INTRODUCTION

NEURAL networks with recurrent connections and dy-namical processing elements (neurons) are finding in-

creasing applications in diverse areas. While past efforts atdesigning such networks have mainly focussed on the steady-state fixed point behavior for such applications as associativememory [1], [2] and optimization [3], [4], the capabilitiesoffered by these networks for dynamically processing temporalsequences have begun to be increasingly appreciated. Suchcapabilities render these networks highly valuable for appli-cation to complex dynamical problems such as recognition ofcontinuous time-dependent signals (speech, for instance) andadaptive control of dynamical systems.

Manuscript received April 6, 1994; revised July 12, 1996, June 22, 1997,and November 27, 1997.

The authors are with the Department of Electrical and Computer Engineer-ing, University of Arizona, Tucson, AZ 85721-0104 USA.

Publisher Item Identifier S 1045-9227(98)02775-1.

Despite the great potential dynamic recurrent neural-networks hold, a major problem in a successful deploymentof these networks in practice (for instance, for controlapplications) is the complexity in training. Earlier procedures,suggested by Pineda [5], Almeida [6], and Werbos [45],attempt to extend the backpropagation approach that hasproved very popular for static feedforward networks [7].The implementation of the required updating equationsfor achieving desired dynamical behavior can however becomputationally very demanding. For overcoming the com-putational requirements and ensuring a relatively manageableimplementation, one is usually forced to resort to simplifyingapproximations, such as coarser gradient evaluations, heuristicselection of high gains in the activation functions [8], [39]and the like. The importance of developing alternate trainingprocedures which bypass the need for the computation ofgradients is clearly evident. The reduction in computationalcomplexity is particularly appealing if the training algorithmis to be implemented on low-end platforms such as personalcomputers. This paper discusses a different approach fordesigning training schemes for a class of dynamic recurrentneural networks based on the theory of stochastic learningautomata.

Research on learning automata dates back to the early workof Tsetlin [9] in the 1960’s and has been developed sincethen by a number of others in various contexts ([10] and[11] provide an excellent treatise on these developments).A popularly investigated version of these ideas, known asreinforcement learning methods, have been used in studies ofanimal learning [12] and in early learning control work [13].Reinforcement learning is based on the intuitive idea that if anaction on a system results in an improvement in the state ofthe system (as measured by a precisely stated performancemeasure), then the probability of repeating that action isstrengthened. More recently, the role of reinforcement learningin the training of neural networks has been studied by Bartoet al. [14], [15], particularly in the context of learning controlsignals for complex dynamical plants [15]–[17]. Nevertheless,the rather very appealing notions of reinforcement learning,and more generally learning automata, do not seem to havereceived the attention they deserve in neural-network training.It is the objective of this paper to demonstrate the feasibility oflearning automata-based approaches as alternatives to gradientmethods in the training of dynamic recurrent neural networksand further to demonstrate their performance in certain highly

1045–9227/98$10.00 1998 IEEE

Page 2: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

SUNDARESHAN AND CONDARCURE: RECURRENT NEURAL-NETWORK TRAINING 355

complex learning scenarios such as those arising in the learn-ing of continuous-time trajectories and in the design of controlsystems for nonlinear dynamical plants.

In comparison with more popularly used neural-networktraining procedures, reinforcement learning falls between su-pervised learning methods such as back propagation [7] whichrequires a detailed and precise knowledge of the desired stateof neurons and unsupervised learning methods [44], such ascompetitive learning which do not require a knowledge ofexternal signals for the evolution of the learning process.In order to guide the training in the proper direction, rein-forcement learning methods use a scalar value or an indicatorfunction that indicates whether the chosen inputs (or actions)have steered the network in the direction of accomplishingthe learning task. In particular, reinforcement learning doesnot require the network to generate close approximations tothe desired detailed input–output mappings, as in the case ofsupervised learning schemes (which in turn is facilitated bythe computation of error gradients), and it is this feature thatis fundamental to obtaining the required training simplicityespecially for dynamic recurrent neural networks.

The power of dynamical neural networks is manifestedperhaps most decisively in the learning of continuous-timetrajectories. Network training to learn such trajectories hasreceived some attention in the recent past with the investiga-tion of schemes which use various forms of gradient descentalgorithms. These include the real-time recurrent learning(RTRL) scheme of Williams and Zipser [18], the methodof directed derivatives of Pearlmutter [19], and the methodof adjoint operators of Toomarian and Barhen [20]. Theseresearchers have shown that a dynamical network can indeedbe trained to exhibit desired limit cycle behavior (it maybe noted that this behavior is not possible to emulate in astatic feedforward network) and have demonstrated the successof their training algorithms by application to the problemof learning certain benchmark trajectories (for example, acircular trajectory of a specified radius). While the closenesswith which the desired trajectory could be generated variedfrom one algorithm to another, the required computation ofgradients and other implementation considerations for er-ror backpropagation impose considerable burden (in fact,the methods differ from one another mainly in the specificprocedure employed for implementing the required gradientcomputations). Development of alternate training proceduresthat do not require such computations is hence an issue ofconsiderable interest, which will be addressed in this paper.

Another application area in which the capabilities of dy-namical neural networks are increasingly being exploited inthe recent times is the control of complex nonlinear dynamicalsystems. While the control of dynamical systems using neuralnetworks is in itself not new and dates back to the early sixties[21], a popular avenue followed in the later work reportedin the literature uses in some form or another a multilayerfeedforward network trained by the error backpropagationapproach for identifying a dynamical mapping function (plantdynamics or the inverse plant dynamics, for instance) inorder to implement the required control. It must be notedthat unlike the earlier discussed application, viz. learning to

generate a continuous-time trajectory, for the identification of adynamical plant for controlling it adaptively it is not necessaryto employ a recurrent network and adequate approximationsto the input–output mapping function can be obtained fromtraining static multilayer neural nets with only feedforwardconnections. However, as recent research has conclusivelydemonstrated [39]–[42], use of dynamic recurrent networkscan provide significant benefits in this application as well. Useof multilayer networks with dynamical nodes and recurrentconnections for the self-tuning adaptive control of complexdynamical processes (as demonstrated by applications to thedesign of adaptive controllers for multijointed robotic manipu-lators and synchronous power generators) is described in [24]and [25]. The training scheme employed in these works is errorbackpropagation which once again has the potential to result incomputational complexities. For facilitating implementation ofadaptive control, the design algorithms in [24] and [25] suggestthe use ofa priori selected arbitrary high values for the acti-vation function parameters. Development of alternate trainingprocedures which do not require sucha priori selections ofnetwork parameters are hence of considerable utility in thisapplication area as well and this paper attempts to make acontribution in this regard.

The primary focus of this paper is on the demonstrationof the feasibility of a learning automaton approach for thetraining of dynamic recurrent neural networks. For the sakeof illustrating the details, we shall consider a specific neural-network structure which has received considerable attention inthe literature and is described by the set of coupled nonlineardifferential equations

(1a)

where denotes the state of theth neuron,is a time constant referred to as the relaxation time, isa parameter that controls the slope of the sigmoidal activationfunction, and denotes the interconnection weight fromthe th neuron to the th neuron. The inputs to this networkcome from the initial conditions and the outputs arethe observations of the behavior of the state trajectories,for . The task of training this network to serve as auseful computational device involves the implementation of analgorithm for progressively updating the parameters

and such that when the training is completed,the network trajectories starting from any initial states

behave in a prescribed manner to perform the desiredcomputation.

In several practical problems, observing only a subset of thestate variables may be of particular importance for checkingwhether the goals of the desired computation are met, andconsequently designation of a setof output neurons (whichis a subset of the total set of neurons) may be appropriate. Also,in certain problems where the input–output mapping behaviorof the neural network is of interest, the use of externallyapplied time-dependent forcing signals to alter the activationdynamics of one or more neurons may be necessary. In order

Page 3: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

356 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 3, MAY 1998

Fig. 1. General architecture of aN -node recurrent neural network withm external inputs.

to be able to handle such problems, the dynamical frameworkcan be expanded to permit the introduction of external inputs

, by modifying the dynamical equation(1a) into

(1b)

The weights , some of which could be zero, serve to fan-out the input signals1 into the individual nodes of thenetwork. The number of weight parameters that need to betrained increases in this case to . Evidently,for , (1b) reduces to (1a). In this paper, for controlsystem design applications, where use of external referencecommands and feedback from the plant being controlled areused as inputs to the neural network, we will employ themore general network architecture described by the dynamicalequation (1b), whereas for trajectory learning problems we willuse the architecture described by (1a) since no external inputsare needed in this case. A schematic of the general recurrentnetwork architecture described by (1b) is shown in Fig. 1.

1To conform this architecture to the more familiar multilayer configurations,the input signalsIi can be considered as the input nodes of the network. Thesenodes, however, are different form theN dynamical nodes in that they donot have recurrent or feedback connections, but connect to theN dynamicalnodes only in the feedforward direction through the weights~wij .

An outline of the development of the paper is as fol-lows. In Section II, we shall give the details of the trainingprocedure by first stating a few key concepts from learningautomata theory and then introducing the incremental learningand teacher forcing concepts that we will use as additionalconstructs for implementing the reinforcement learning rules.A demonstration of the training performance will be given inSections III and IV by applying the present procedures in twodistinct application scenarios—one for learning a benchmarkcontinuous-time trajectory (a circle of specified radius) and theother for designing controllers with different control objectivesfor complex plants. While it is shown that the performanceof the trained neural network is highly satisfactory in theseapplications (i.e., in the degree of closeness with which thedesired trajectory is generated and in the degree of robustnessof the controller to variations in the initial states of the non-linear plant being controlled), a principal attractive feature ofthe present training schemes is the implementational simplicitycompared with existing gradient-based training procedures.

II. DEVELOPMENT OF TRAINING PROCEDURE

A. Basic Updating Rules for Learning Automata

A learning automatoninteracts adaptively with the environ-ment it is operating in and updates its actions at each stagebased on the response of the environment to these actions [10],

Page 4: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

SUNDARESHAN AND CONDARCURE: RECURRENT NEURAL-NETWORK TRAINING 357

[11]. Hence an automaton can be defined by the triplewhere denotes the set of actionsavailable to the automaton at any stage,is the set of observed responses from the environment, whichare used by the automaton as inputs, andis an updatingalgorithm which the automaton uses for selecting a particularaction from the set at any stage. In the present contextof neural-network training, a specific action at any stagecorresponds to the updating of the values of one or moreparameters of the network.

For astochastic learning automaton, the updating algorithmspecifies a rule for adjusting the probability of choosinga particular action at stage . Such a rule may be generallydescribed by a functional relation of the form

(2)

The learning procedure at each stage hence consists of twosequential steps. In the first step the automaton chooses aspecific action from the finite set of actionsavailable, and in the second step, the probabilities of choosingthe actions are updated depending on the response of theenvironment to the action in the first step, which influencesthe choice of future actions.

An alternative way of specifying the updating algorithmis to define a state vector for the automaton and considerthe transition of the state due to a certain action, whichenables one to state the updating rule in terms of statetransition probabilities. This approach has been quite popularin the development of learning automaton theory [26]. Forour application to neural-network training, however, the actionprobability updating approach, with the updating algorithmsspecified in the form of (2), provides a simpler and moreconvenient framework.

For execution of training, the feedback signal from theenvironment, which triggers the updating of the action prob-abilities by the automaton, can be given by specifying anappropriate “error” function. The environmental response set

at any stage can then be selected as the binary set, with indicating that the selected action

is not considered satisfactory by the environment andindicating that the action selected is considered satisfactory.2

For a stochastic automaton with available actions (i.e.,), the updating rules can then be specified

in a general form as follows.For the selected action at theth stage , if

then

for

(3)

2In the literature on learning automata [10], [11], this case of� allowedto take two distinct values only is referred to as theP -model. More generalmodels where� can take a number of values within an interval have alsobeen discussed.

whereas if , then

(4)

The functions and are appropriately selectedcontinuous-valued nonnegative functions. The summation

in (3) and the division by in (4) are to ensurepreservation of probability measure (i.e., sum of probabilitiesat equals one).

The two sets of (3) and (4) specify areinforcementlearning algorithm. By tailoring the functions andan appropriate degree of reinforcement in the selection of aparticular action can be introduced. A scheme where bothsets of equations are employed together is termed areward-penalty reinforcement scheme. It is evident that in this schemean action that is judged favorable is rewarded by havingits probability of selection increased while an unfavorableaction is penalized by having its probability of selectiondecreased. Another reinforcement scheme, termedreward-inaction scheme, employs the updating only for ,whereas for the action probabilities are maintainedat the same values as before. These schemes and several othervariations of them have been discussed in the literature [10],[11]. Due to the stochastic nature of the framework, however,very few analytical results can be developed for these schemesand studies directed to the evaluation of performance (suchas convergence, asymptotic behavior etc.) typically employsimulation experiments.

It should be emphasized that (3) and (4) describe a gen-eral framework for tailoring a variety of specific trainingalgorithms useful in specific applications by selectingand appropriately as linear or nonlinear functions. Infact, a number of heuristic algorithms where andmay not have an analytical form can also be considered forrealizing improved speed and accuracy in training. In certainapplications of neural-network training such constructionsmotivated by intuitive reasoning may indeed prove to be moreefficient. An illustrative example of this will be demonstratedin the next section for application to the trajectory learningproblem.

B. Neural-Network Training

A principal advantage of the learning automaton approachis its ability to determine optimal actions among a set ofpossible actions and this is particularly useful in neural-network training where a number of possible actions exist. Fortraining the neural network described by (1a), we will employthe learning configuration schematically shown in Fig. 2. Theautomaton actions are defined as either an increment or adecrement to any of the network parameters, , and . Foran -neuron network, this corresponds to a set ofsingle parameter updating actions. Multiple parameter actionscan also be considered, with the number of possible actions inthis case increasing to !

The environment for this learning configuration comprisesof the neural network itself together with an appropriately

Page 5: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

358 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 3, MAY 1998

Fig. 2. Learning configuration.

specified error functional defined over the time intervalby

(5)

where denotes the set of designated output nodes of thenetwork and , , denote the desired output signals.For trajectory learning problems and control system designproblems that are of interest to us in this paper appropriateerror functionals can be readily specified. The feedback signalto the automaton can be defined as

for action that reduces the error

for action that does not reduce error(6)

The function in (5) can be specified in various ways; forthe examples that will be discussed in the later sections, weemployed3 to define the error functional

. Subsequent to the determination that an action has reducedthe error, the corresponding changes to the neural-networkparameters are retained. However, if the action increases theerror, the corresponding parameter changes are not kept. Thus,the only modifications to the neural-network structure comefrom those actions that reduce the value of the specified error.

A probability of selection is initially assigned to each action.Since noa priori knowledge generally exists as to which ofthe network parameters have the greatest influence in reducingthe specified error, the entropy in learning is maximum atthe beginning of training. Hence, a uniform distribution isused at the beginning for the action probabilities. As learningprogresses, the probability associated with each action ischanged. This probability determines the relative frequencywith which a particular action will be selected. Thus, the moresuccessful a particular action is at reducing the error, the morelikely its selection will be in the future stages.

Any available prior knowledge on the qualitative behaviorof the network being trained can be utilized in the process of

3Since an evaluation of the gradient ofE with respect to the neural-networkparameters is not needed in our learning scheme, we have a greater degreeof freedom in the selection off(�; �). In contrast, it may be noted that theconventional gradient-based learning approaches invariably usef(vi; vdi ) =(vi � vd

i)2, mainly for simplicity in gradient evaluations.

initializing the training algorithm. The network described by(1a) is one whose dynamics and equilibrium behavior havebeen extensively studied in the past [1]–[4], [39] and thecorrelations of these results with the training performance canbe exploited for the initial setting of parameter values. Forillustration, some past results that underscore the role of highgain sigmoidal nonlinearities in ensuring desirable stabilityproperties for the network equilibria [2] and the observedcorrelation between selection of high gains and improvementin learning rates [8], [28] could be usefully employed in theinitial selection of parameters for improving the efficiencyof the training process.

In discussing the time-behavior of the training process, twotypes of convergence come into the picture—convergence ofthe training error and convergence of the automaton to someoptimal action. Convergence of the error is assured by thenature of the learning algorithm. Since changes to the neural-network structure come only from those actions that resultin a reduction of the error, starting from any finite positiveinitial error, a monotonic decreasing sequence of positive realnumbers is generated. This sequence is bounded and, from themonotone convergence theorem [27], is convergent.

Under certain conditions, the learning automaton will con-verge toward some optimal action depending on the type ofreinforcement rule used. By associating with each action apenalty probability, it has been shown in the literature [11]that if the penalty probabilities are stationary, then the actionprobabilities will converge to an optimal action. In particular,for the linear reward-inaction scheme [i.e., for a linearfunction of the argument and in the updatingrules (3) and (4)], convergence is assured in this sense. Itshould however be noted that convergence of this type maynot be desirable in the present context of neural-networktraining. The penalty probabilities are not known at the startof training and their distribution may not be stationary sincethe structure of the neural network is constantly changingduring the training process. An action that may produce afavorable response at some point in the training process maynot yield a favorable response at a later time. Furthermore,the gains and the time constants are constrained to benonnegative and hence cannot be continually decremented totake on negative values. Therefore, convergence of the learningautomaton to an optimal action is not desirable and will notoccur when the reward-penalty reinforcement rules are used(since the probability of any action approaching one is notpossible with this reinforcement scheme for a nonstationaryenvironment [11]).

C. Additional Constructs to Improve Training Performance

1) Incremental Learning:When neural networks aretrained in a supervised manner, there is a tendency for thetraining to proceed rapidly reducing the value of the specifiederror for some time, until a point is reached where no furthertraining becomes possible. This corresponds to the case whenthe training has proceeded to alocal minimum. In the presentcontext, this condition may be visualized by considering theerror surface in an space (where the

Page 6: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

SUNDARESHAN AND CONDARCURE: RECURRENT NEURAL-NETWORK TRAINING 359

Fig. 3. Target trajectories for incremental learning.

axes correspond to the adjustable parameters of the networkand the final dimension corresponds to the error function),which indicates that the error has been reduced with respectto these parameters but has fallen into an energy well, fromwhich a recovery with the type of parameter changes alreadyused is not possible. In the specific application to the trajectorylearning problem, which is of particular interest in this paper,this situation corresponds to the neural-network learning togenerate a trajectory that reduces the error, but the generatedtrajectory not having the same shape as the desired trajectory.

In order to reduce the occurrence of becoming trapped ina local minimum, some method of controlling the evolutionof trajectories during learning could be used. A simple wayof overcoming the problem is by a process ofincrementallearning, which generates a set of intermediate learning goals.Let denote the trajectory generated by the neural networkat the start of training and denote the final trajectory. Itis desired to establish learning goals, where the absoluteerror between one goal and the next is small. This can beaccomplished by defining a sequence of learning goals as

(7)

where .For illustration, suppose it is desired to train a dynamic

recurrent neural network of the form (1a) to output thetrajectory

(8)

Let denote the initial trajectory output of the network forsome initial set of parameters and initial states of neurons. Anarbitrary number, say 100, of learning targets can be selectedas

for (9)

When within some predetermined error bound,the next learning target becomes . Learning progressesthrough these increments until the final desired target isreached. Fig. 3 shows a succession of these desired trajectorieswhich represent incremental targets.

It may be noted that since the neural network being trainedis characterized by nonlinear dynamics, the effort in movingfrom one incremental learning goal to another may not beuniform even when a uniform distance between these learninggoals is implicit. This however is of no major consequence inso far as the overall learning performance is concerned sincethe motivation for modifying the learning goal is to providea mechanism for perturbation of the error function during thetraining process and the objectives of incremental learning areachieved when an appropriately large number of learning goals

is selected for implementation.2) Teacher Forcing: In training problems such as trajectory

learning, where the desired output is available at every instanceof time during the training process, using an appropriate mech-anism to directly feed this information to alter the activationdynamics of the neural network provides several benefits. Thisformalism, referred to asteacher forcing, has been used byseveral previous researchers [18], [20] in one form or another.The idea of including a teaching forcing signal in generalsupervised learning problems comes from the desire to supplyadditional instantaneous information from the teacher directlyto the activation dynamics during the learning stage. Therole of including this signal on the training performance canbe understood from the analogy with the use of continuousfeedback in reducing the error in closed-loop control systems.A temporal modulation of this signal as learning proceeds isoften desirable so that the activation dynamics during learningprogressively reduces to the activation dynamics during therecall stage.

In the present work, for improving the training performanceof the learning automaton, a method of teacher forcing similarto the one suggested originally by Williams and Zipser [18]can be employed. In this scheme, the desired network outputsignals are used in the place of the actual network outputswhen fedback into the network via the recurrent connections.The actual outputs are still used for computing the error inorder to determine whether the automaton action at any stageis favorable or not. The teacher forcing drives the networkoutputs closer to the desired signals as training progressesand the network is trained at each stage as if it were alreadygenerating the correct signal. This seems to significantly speedup learning, particularly at the beginning stages.

Upon completion of successful training, i.e., when theerror functional becomes zero, the teacher forcing will nolonger exist and the network dynamics will revert to theusual dynamics described by (1a) or (1b). As pointed outby Toomarian and Barhen [20], there exist training scenarios(particularly arising in trajectory learning problems) where theerror functional cannot be reduced to zero and consequentlythe activation dynamics of the neural network after trainingis completed, i.e., during the recall phase, will be differentfrom that specified by (1a) or (1b). To avoid this discrepancy,at some point in the training process, when confidence inthe shape of generated trajectories is developed, the teacherforcing is disabled and the learning is progressed with theactual outputs of the network. Alternately, a temporally mod-ulated teacher forcing scheme [20] that progressively reducesthe amount of teacher intervention during the training phase

Page 7: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

360 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 3, MAY 1998

can be employed; a simple mechanism for implementing suchmodulation is by multiplying the signal by a time-varying gain

, where is the measured error andis anappropriately selected number sufficiently large (a large valueof relative to the expected values of error is recommendedto prevent becoming negative).

III. L EARNING CONTINUOUS TRAJECTORIES

The performance of the new training approach describedin the last section has been tested in the task of learningcontinuous trajectories. We shall give the results for a circletrajectory of specified radius, which has become a benchmarktrajectory [19], [20] for comparison of training efficiency. Acircle trajectory can be generated by a network with dynamicsdescribed by (1a) and with two designated output nodes, i.e.,

, generating the two sinusoidal signalsand , with an arbitrarily selected

frequency . In the present example, a circle of radius 0.5(i.e., ) is desired to be generated.

Simulation experiments were conducted using a fourth-orderRunge–Kutta algorithm for studying the temporal dynamicalbehavior of the neural network. A time increment ofwas selected as the integration time constant, whereisapproximately the period of the trajectory to be generated.For implementing the actions of the learning automaton, it isnecessary to generate an output function , which maps thestage number into a selection of the appropriate action totake in a probabilistic fashion. Since these action probabilitiesare unknown at the start of the experiment, they are initializedto a uniform distribution. Then, as the experiment progressesand successful actions are found, a discrete probability densityfunction is built up, with the probability for a particularaction being increased or decreased according to thespecific reinforcement in the form of (3) or (4). As thedensity function is being generated, it is used for the selectionof actions by an inverse distribution method. This is doneby generating uniformly distributed random numbers (by astandard procedure such as the Lewis–Payne method [29]) andthen summing the numbers in the density function to createa distribution function until the generated random number isgreater than the sum. The action is then selected at the pointwhere the sum of the densities is greater than the uniformrandom number.

A six-node network (i.e., ) with two nodes designatedas the output nodes and with no externally appliedinputs was trained to generate the desired circle trajectory.In order to attempt to better control the trajectory rise time,rather than try to force the network to generate the circle withan unknown rise time, a parameterwas introduced to modifythe desired outputs in the form

(10)

For initializing the network, the weights were set to 0.0,the gains were set to 10.0 and the time constantswereset to numbers randomly distributed around 6.0. The initialstates of the neurons were chosen to be small random

numbers centered around zero. Incremental learning was usedwith 100 intermediate learning targets established as discussedin Section II-C.

A brief explanation on the role of parameterappears inorder. Observe that with the selection of and asin (10), we have

and hence as becomes progressively larger, andapproach the desired signals and ,

respectively, for any selection of . However, by selectionof a sufficiently large , a scaling of time can be achieved thusaccelerating the convergence to desired final values. It mayalso be noted that the use of and as in (10) ismotivated by our desire to generate the desired circle trajectoryfrom the starting values of and , whichcorresponds to a more challenging learning task than the casewhen the initial point is selected to lie on the desired circle.Selection of hence offers a mechanism for controlling thetrajectory rise time which is a highly desirable feature. In theexperiments that will be reported later, a representative valueof was used.

To test the effects of selecting alternate reinforcement rulesand parameter updating actions on the training performance,several experiments [43] were conducted. For the sake ofbrevity, only two illustrative cases will be described in thefollowing.

Experiment 1: In this experiment, a simple linear reward-penalty reinforcement scheme obtained by defining and

in (3) and (4) as linear functions was used. The reinforce-ment rules in this case will take the following form.

For an automaton with available actions, with the selectedaction at the th stage , if , then

and

(11)

whereas if , then

and

(12)

In (11) and (12), and are constants that may be selectedappropriately in the ranges and . Also,from (11) it is evident that an action considered favorablewill result in a reduction of the probabilities (for ) by apercentage while increasing the probability by an amountsuch that the sum of the probabilities at stage is .Similarly, when action is unfavorable, the probability isreduced by a percentagewhile the remaining probabilities

(for ) are correspondingly increased such that the sumof the probabilities remains at one, as reflected by the form ofthe updating rules in (12).

For the numerical simulations we used the values(corresponding to 2% change in the case of a favorable action)and (corresponding to 1% change in the case of

Page 8: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

SUNDARESHAN AND CONDARCURE: RECURRENT NEURAL-NETWORK TRAINING 361

(a)

(b) (c)

Fig. 4. (a) Automaton actions per learning increment. (b) Neural-network output trajectory in Experiment 1. (c) Trajectory generated in Experiment2.

an unfavorable action); these values were determined fromexperimentation to give good results.4 A single parameteraction (increment or decrement), defined as an incrementalchange to one network parameter that is continued until it isno longer successful for a given trial, was employed. The errorfunctional discussed earlier [viz. (5) with ]was used and it was required that the value of the error bereduced to 0.06 before moving from one learning goal to thenext. Teacher forcing was used to help accelerate the learningprocess at the start and was disabled at the 50th learningincrement when the shape of the actual output trajectory wassufficiently close to the desired trajectory.

Fig. 4(a) depicts the parameter changes or actions that wereattempted by the automaton for each learning increment. Itmay be noted that learning was very easy when teacher forcingwas active, which agrees well with intuition. After the 50thstep, when teacher forcing was disabled, learning became moredifficult, as the network must meet the learning goals on itsown. This continued until about step 82, when the automaton

4In earlier work on learning automata [10], [11], it is observed that a certaindegree of asymmetry between the reward and the penalty parameter results ingeneral in a desirable training behavior, i.e., rewarding a favorable responsemore than penalizing an unfavorable response is generally preferable.

developed enough experience in making better selections.The results of this experiment with the network trained forfour cycles (each cycle corresponding to one period of thesinusoidal waveforms) and then continued to run for anothereight cycles is shown in Fig. 4(b), which clearly indicates thestability of the generated limit cycle. It may be noted that onlythe first three cycles during which the trajectory evolves intothe limit cycle are distinguishable while the rest overlap.

Experiment 2: In this experiment the primary goal was tostudy the effects of allowing multiple parameter actions, i.e.,sets of parameters to be updated simultaneously. It is to benoted that since the neural network is nonlinear, the effectof changing more than one parameter at a time is not thesame as the combined effect resulting from changing themone after another. For a six-node network , thenumber of possible actions now increases to 48! (i.e.,

!). Consequently, to reduce the memory requirements,two options were exercised. The first is to limit the actionsto those that update ten parameters or less at a time. Thesecond is to store the successful actions in a repertoire for apreferential selection at the later stages. An action is addedto the repertoire if it is used successfully to reduce the error,which is the reward. If an action in the repertoire does not

Page 9: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

362 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 3, MAY 1998

successfully reduce the error, it is penalized by being removedfrom the repertoire. Once all actions existing in the repertoireare used at any stage, new actions are selected randomly fromthe remaining set of available actions based on a uniformdistribution. The following learning reinforcement is also used.When rewarded, the probability for an action in the repertoirestays at its previous value, whereas for a successful action notin the repertoire it is increased from its value in the uniformdistribution to a higher value. When penalized, the actionprobability is reduced to its value in the uniform distribution.

In the framework of the reinforcement rules discussedearlier, the present updating mechanism corresponds to anonlinear reinforcement scheme, more general than the lin-ear reinforcement rules used in Experiment 1. An analyticalmodeling of the updating rules is however more difficult toobtain in this case.

To provide a greater ease of implementation, in this ex-periment the activation gains were permanently set to thevalue 10.0 and learning was restricted to changes in the otherparameters ( and ). Incremental learning was used asbefore with 100 learning steps. The result of this experiment isshown in Fig. 4(c), which indicates a substantial improvementin the achieved performance over the linear reinforcement-single parameter action case considered in Experiment 1. Ascan be observed, the trajectory rise time is also significantlyreduced in this case (to about 0.2 s) and the evolution into thefinal orbit is almost complete within half a cycle. Fig. 4 showsthe results of the experiment with the network trained for fourcycles and then continued to run for another eight cycles. Theremarkable accuracy with which the recall cycles overlap isworthy of emphasis and this represents a level of performancesignificantly better than that provided by any of the existingtraining procedures.

As a further note, in the two experiments described above,the training took approximately 2500 attempted actions toreach the final learning goal. It must be emphasized thatthe computations required at each step are extremely simple(involving updating of probability vectors) and are almostnegligible compared to the evaluation of gradients requiredby existing methods, which makes the present scheme moreattractive to implement. Also, comparing the performancedepicted in Fig. 4(b) and 4(c) with the best available result forthe trajectory learning problem, that reported by Toomarianand Barhen [20], it may be noted that this level of accuracyin generating the circle trajectory could only be achieved in[20] when the learning was started with the initial valuesof the neuron states adjusted such that the initial point isalready on the desired circle (specifically, case 3 in [20]).In contrast, in our case the learning was started with theinitial states set at arbitrary small random values. It mustalso be noted that this level of performance was achievedeven when the learning was restricted to only the weights

and the time constants . It is conceivable that evenbetter performance levels can be realized by permitting theactivation gains also to be updated, although at the costof increased memory requirements. This, together with thesignificantly reduced computational requirements comparedto the conventional gradient-based algorithms, establishes the

(a)

(b)

Fig. 5. (a) Learning configuration forneural net controller. (b) Control con-figuration with trained neural network.

learning automaton approach as a very attractive alternative tohandle the problem of learning continuous trajectories.

IV. CONTROLLER DESIGN FORNONLINEAR SYSTEMS

While the performance of the training scheme in trajectorylearning applications has been quite satisfactory, its applicationto the design of controllers for complex nonlinear plants seemsto be even more attractive. At the outset it must be observedthat despite the significant advances in mathematical controltheory during the past several decades, convincing and well-developed algorithms exist at present only for the control oflinear systems and corresponding systematic design proceduresare under development for more general nonlinear systems[30], [31]. Recent work toward exploiting the model indepen-dent framework afforded by neural-network-based methodsoffer attractive alternatives for the design of implementablecontrollers for real-world nonlinear systems. Nevertheless,the possibility of using simpler training procedures that donot require gradient computations is highly appealing in thisapplication area as well.

In this section, we shall describe the performance of arecurrent neural network of the form described by (1b) andtrained by the present learning automaton approach whenused for the control of nonlinear systems with distinct controlobjectives. In this application, the network with its trainingmechanism is implemented in a negative feedback configura-tion shown in Fig. 5(a) and 5(b). The feedback signal, which

Page 10: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

SUNDARESHAN AND CONDARCURE: RECURRENT NEURAL-NETWORK TRAINING 363

could be either the output or the state of the plant beingcontrolled, is used as an external input to the neural network.A reference command, which specifies the control objective inthe particular application, also serves as another input to theneural network. The output of the neural network is the controlsignal that drives the plant. As depicted in Fig. 5(a), during thetraining phase the neural network together with the plant beingcontrolled forms the environment that provides evaluativefeedback for choosing the automaton actions, whereas aftercompletion of training, the trained neural network serves as thefeedback controller that drives the plant to achieve the desiredcontrol objectives, as depicted in Fig. 5(b). It is evident thatsince the signal available for testing whether the objectivesof the control problem are met is the neural-network outputafter being processed by the dynamical plant, this offersgreater challenges compared to the trajectory learning scenariodiscussed in the last section. Furthermore, since the desiredcontrol signal that must be input to the plant for realizingthe desired performance is not knowna priori (and is ingeneral a nonlinear function of the state of the plant), teacherforcing cannot be used in this application and the learningmust proceed on its own right from the start.

As mentioned in Section I, concepts of reinforcement learn-ing have been utilized in various forms in the past particularlyfor learning control signals for dynamical plants. Referenceslisted in [13]–[17] and [34]–[37] are representative not onlyof the wide interest this topic has received over the years,but also of the richness of ideas that are contributed bythese concepts to solving difficult control problems posed bycomplex dynamical systems. Some distinctive contributionsof the present work include a formalization of the concepts oflearning automata for designing neural-network-based controlschemes that employ trained recurrent neural networks andthe demonstration of the robustness of these control schemes.The control system architecture employed here, which maybe characterized by the recurrent neural network servingas a direct adaptive controller for the plant,5 also differsconsiderably from the more traditional neural-network-basedcontrol methods, such as those requiring learning the inversedynamics of the plant to be controlled [38] or self-tuningregulator methods to be used [24], etc. For a brief descriptionof the inherent differences, it may be mentioned that inversedynamics learning through backpropagation methods [38] gen-erally pose some complexities in requiring backpropagation oferror through the unknown plant dynamics (due to the factthat the measured error at the plant output is different fromthe error at the neural-network output which is needed forimplementing the gradient-based updating rules). On the otherhand, self-tuning regulator methods [24], which implement anindirect adaptive control approach, require certain separabilityassumptions (separation of control terms from plant states) on

5There are two basic schemes used in adaptive control of dynamical plants,viz. direct and indirect control. The plant dynamics to be controlled need notbe known in both methods. In the indirect control scheme, two distinct stepsinvolving an identification of plant dynamics followed by an evaluation ofcontrol signal are conducted at each control update instant, whereas in thedirect control scheme the control signal input to the plant is tailored directlybased on the error in plant response to the applied input. The architectureshown in Fig. 5(a) and 5(b) corresponds to a direct control scheme.

the plant dynamics for facilitating a simple implementation ofthe control scheme.

Results of two control experiments with distinct controlobjectives will be briefly outlined in the following to illustratethe performance of the training approach in this application.The objective in Experiment 1 is to regulate the output of thedynamical plant to zero whereas the objective in Experiment2 is to drive the plant such that its output follows a pre-specified desired trajectory. The control architecture [shown inFig. 5(a)] for these experiments can be described as employingthe reference input appropriately to specify the desired plantoutput (zero for Experiment 1 and the desired time trajectoryfor Experiment 2) and a feedback of the instantaneous state ofthe plant provided as an external input to the neural network.The learning framework employed here, where the neuralnetwork is trained being connected to the plant in this feedbackconfiguration, deserves a particular note.

Experiment 1: In this experiment, stabilization of a nonlin-ear dynamical plant described by

(13)

where and , was considered. Theproblem of interest is to design a control signal that drivesthe plant output to zero when started from a specifiedinitial condition . Evidently, if , will remainat zero for all and consequently is only of interest.

The complexity of designing a stabilization scheme for thisnonlinear plant was underscored by Ichikawa and Sawa [32].In particular, it was noted that since attains valuewhen and value when , the direction ofthe applied control force is inverted when changes sign,and consequently, linear constant feedback gains that stabilizethe system from positive values fail to stabilize fromnegative values, and vice versa. Ichikawa and Sawa [32]then proceeded to train a multilayer perceptron network by agenetic algorithm that stabilized that plant fromas well as from . We will now demonstrate thata dynamic recurrent neural network trained by the presentlearning automaton approach will not only stabilize the plantfrom these initial conditions but also offers very attractiverobustness characteristics with respect to variations in theinitial condition values.

A 8-node neural network of the form given by (1b) withone designated output node was trained in the state feedbackcontrol configuration depicted in Fig. 5(a). The plant states de-fined as and are used as inputs to the neural networkwhile the output of the neural network is the control signalthat drives the plant. A single parameter action automaton anda linear reward-penalty reinforcement scheme defined by (11)and (12), with the reward and penalty parametersand , were used for training. For computing the errorfunctional, the task specified for the network was to driveand to zero in a specified time, about 30 s. Incrementallearning was not used during training; it was only requiredthat the learning automaton reduce the value of at eachattempted action.

Page 11: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

364 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 3, MAY 1998

(a)

(b) (c)

(d) (e)

Fig. 6. (a) Plant output profiles. (b) Stabilizing control signal forx(0) = +1. (c) Stabilizing control signal forx(0) = �1. (d) Decay of error inx(t). (e) Decay of error inx(t).

The network was trained for two sets of initial conditionsand with

the weights and initialized to small random valuescentered around zero with a maximum amplitude of 0.05, thetime constants initialized to 1.0 and the activation gainsinitialized to 2.0. At the end of training, the neural-networkcontroller, operated in the configuration shown in Fig. 5(b),was able to stabilize the plant for either set of initial conditions.The trajectories for the plant output for the two cases are

shown in Fig. 6(a), which indicate that the control objectivesare satisfactorily met. The stabilizing control signals learnedby the neural network are shown in Fig. 6(b) and 6(c) for thetwo initial conditions and , respectively.

For providing a greater insight into the training behavior ofthe automaton, Fig. 6(d) and 6(e) show the evolution of theerrors in the two state variables and as a functionof the training loop number, with the error computed afterrunning the plant with the neural net controller for a certain

Page 12: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

SUNDARESHAN AND CONDARCURE: RECURRENT NEURAL-NETWORK TRAINING 365

(f)

Fig. 6. (Continued.) (f) Controller robustness—plant outputs from different initial conditions.

length of time following an updating action by the automaton.For this experiment, the automaton updating together with thegeneration of the trajectory over a 20-s interval is consideredone training loop. Initially a training strategy was specifiedwith allowed actions to be rewarded only when the changesreduced the errors in both and simultaneously. It washowever found that more efficient training was possible by astrategy which allowed small increases in as long as theerror in (the primary state variable for stabilization) wasdecreased. It may be observed in Fig. 6(e) that there is a smallincrease in the error in around step 200, which facilitatesa rapid reduction in the error in .

Comparing the performance of the learning automaton ap-proach to the genetic algorithm-based training proposed byIchikawa and Sawa [32], it may be noted that [32] uses apopulation size of ten with a maximum number of generationsof 100 for this problem. Ten loops were run per generationwhich gives a total number of loops of 20 000. This is some-what larger than the number of loops required with the learningautomaton approach (which was of the order of 200 before theerror reduced to negligible values). It must also be noted that[32] employs static neurons operating with zero-order holdsin order to mimic the operation of a digital filter, whereasthe present strategy uses dynamic neurons in a continuous-time simulation environment. Despite these differences, aftertraining was completed both methods resulted in stabilizationof the plant in roughly the same amount of time (15–20 s).Thus,it is in the training time that the two methods differconsiderably, with the learning automaton approach offeringa decided advantage (even though the network being trainedhere is a recurrent network).

While the controller performance described above parallelsthat demonstrated in [32], of particular significance was theevaluation of the robustness of the neural-network controller(not demonstrated in [32]). More specifically, the questionof interest is the following: since the controller was trainedfor two sets of initial conditions, will it generalize for othersets of initial conditions such that plant stabilization is guar-

anteed even under a mismatch in the initial states? This isa question of great significance in practical implementationsof the control strategy. Fig. 6(f) shows the plant output tra-jectories generated by the controller for the initial conditionsets

and , whichclearly demonstrate the excellent robustness of the presentcontroller.

Experiment 2: A different control objective, that of trackinga specified trajectory, was considered in this experiment.Control problems with such objectives arise in a number ofdiverse applications such as controlling the motion of robotarms, aircraft flight path control, flexible manufacturing envi-ronments, etc. The problem for this experiment was specifiedas the design of the tracking control signal for , forthe nonlinear system

(14)

such that the plant output tracks the signal specifiedby

(15)

A single parameter action automaton with a linear reinforce-ment rule as in Experiment 1 was used for training a eight-noderecurrent network with one output node. The external inputsto the network consist of the reference command specifiedby the signal to be tracked and a feedback of the plantoutput . Incremental learning was used by setting up 50learning steps and an error criterion specified by the integral ofthe absolute value of the deviation from the incremental targettrajectory with a final value of 0.06 was used for guiding thetraining from one learning step to the next. The parameters

and of the network were initialized to small randomvalues centered around zero,were initialized to 5.0 andwere initialized to 10.0. The initial state values for the plant,i.e., and , were selected as 0.0.

Page 13: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

366 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 3, MAY 1998

(a) (b)

(c) (d)

Fig. 7. (a) Tracking performance of neural controller. (b) Tracking error profile. (c) Tracking control signal generated by neural network. (d) Robustnessof tracking controller.

The tracking performance delivered by the controller isshown in Fig. 7(a), which clearly indicates that the neuralnetwork is well trained for generating the desired controlsignal. A plot of the tracking error is shown to a magnifiedscale in Fig. 7(b) and the profile of the control signal generatedby the neural network is shown in Fig. 7(c). It may be notedthat the tracking error is kept below 0.06 throughout the lengthof the trajectory and the error attains maximum values at thepoints where the signal to be tracked changes the most, that isat the beginning and at the point of inflection at 2.0 s. It mustbe emphasized that the design of a tracking controller for anonlinear dynamical plant of the present form is exceedinglycomplex and is usually conducted under crude linearizationapproximations only. As is well known, the magnitude of theinitial tracking error can be further reduced by including afeedforward control signal in addition to the neural-network-generated feedback control signal, which improves the overalltracking performance. The emphasis in the present experimentis however on the learning capability of the neural networkfor implementing the state feedback control signal to achievethe desired tracking, which is illustrated by the results of thisexperiment.

As in the stabilization experiment, the robustness of theneural net controller to changes in the initial plant states wastested. Fig. 7(d) shows the tracking performance of the trainedneural network, when the initial states of the plant were setto random numbers centered around zero with a maximummagnitude of 0.1. It is evident that the tracking performancedegrades very little despite the initial mismatch resulting from

while .

V. CONCLUSION

The major contribution of this paper is the demonstration ofthe feasibility of a learning automaton approach for trainingrecurrent neural networks in the execution of certain complextemporal learning tasks. The development given here can beregarded as a formalization of the method of reinforcementlearning that has been used successfully with static feedfor-ward networks in some past attempts. The learning automatonapproach enables the neural network to gain experience withthe environment it is operating in and be trained based onthat experience. The principal advantage of this method oftraining is that it requires no complex computations (suchas gradient evaluations) and hence affords a simpler imple-

Page 14: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

SUNDARESHAN AND CONDARCURE: RECURRENT NEURAL-NETWORK TRAINING 367

mentation than the available procedures for training recurrentnetworks. Furthermore, it offers the flexibility of tailoring avariety of specific algorithms based on the selection of linearand nonlinear reinforcement rules.

The efficiency of the training approach was demonstrated intwo illustrative application areas, viz. learning of continuoustrajectories and the design of controllers for nonlinear dynam-ical plants. The superior accuracy with which the benchmarkcircular trajectory could be generated even when the initialpoint is not necessarily on the desired trajectory attests tothe efficiency of the temporal learning performance. Thehighlights of the application to controller design, on theother hand, are the simplicity with which control problemswith diverse objectives can be handled (as exemplified bythe designs of a stabilizing controller and of a trackingcontroller) and the excellent robustness characteristics of thesecontrollers for variations in the initial plant states (which is ofconsiderable usefulness in practice).

Further improvements to the temporal learning performance,aimed at reducing the memory requirements and increasingthe speed of learning, should be possible through additionalinvestigations. A characterization of linear and nonlinear rein-forcement rules and identifying neural-network learning en-vironments that point to the tailoring of a particular classof reinforcement rules would be highly useful. Also, thedevelopment of procedures to identify optimal reward/penaltyfunctions for particular applications would be of interest.Another potentially useful direction for future investigationwould be to integrate the basic concepts of the learningautomaton approach with the familiar gradient-based algo-rithms, as for instance, to combine the simplicity of im-plementing linear reinforcement rules with approximate gra-dient computations (of the form used in [24] for adaptivecontrol designs), particularly in scenarios where the overalllearning problem can be decomposed into several subprob-lems. The recently proposed idea of Baldi and Toomarian[33] of training a hierarchy of modules independently andinterconnecting them could find useful application in thiswork.

REFERENCES

[1] J. J. Hopfield, “Neurons with graded response have collective computa-tional properties like those of two-state neurons,” inProc. Nat. AcademySci., 1984, vol. 81, pp. 3008–3090.

[2] S. I. Sudharsanan and M. K. Sundareshan, “Equilibrium characterizationof dynamical neural networks and a systematic synthesis procedurefor associative memories,”IEEE Trans. Neural Networks, vol. 2, pp.509–522, Sept. 1991.

[3] J. J. Hopfield and D. W. Tank, “Neural computation of decisionoptimization problems,”Biol. Cybern., vol. 52, pp. 1–12, 1985.

[4] S. I. Sudharsanan and M. K. Sundareshan, “Exponential stability and asystematic synthesis of a neural network for quadratic minimization,”Neural Networks, vol. 4, pp. 599–613, 1991.

[5] F. J. Pineda, “Generalization of backpropagation in recurrent neuralnetworks,”Phys. Rev. Lett., vol. 59, no. 19, pp. 2229–2232, 1987.

[6] L. B. Almeida, “A learning rule for asynchronous perceptrons withfeedback in a combinatorial environment,” inProc. IEEE 1st Annu. Int.Conf. Neural Networks, San Diego, CA, 1987, pp. 609–618.

[7] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internalrepresentations by error propagation,” inParallel Distributed Process-ing: Explorations in the Microstructure of Cognition, D. E. Rumelhartand J. L. McClelland, Eds. Cambridge, MA: MIT Press, 1986, pp.45–76.

[8] S. I. Sudharsanan and M. K. Sundareshan, “Training of a three-layerrecurrent neural network for nonlinear input–output mapping,” inProc.1991 Int. Joint Conf. Neural Networks (IJCNN-91), Seattle, WA, July1991.

[9] M. Tsetlin, “On the behavior of finite automata in random media,”Automat. Remote Contr., vol. 22, pp. 1210–1219, 1962.

[10] S. Lakshmivarahan,Learning Algorithms: Theory and Applications.New York: Springer-Verlag, 1981.

[11] K. S. Narendra and M. A. L. Thathachar,Learning Automata: AnIntroduction. Englewood Cliffs, NJ: Prentice-Hall, 1989.

[12] R. A. Rescorla and A. R. Wagner, “A theory of Pavlovian conditioning:Variations in the effectiveness of reinforcement and nonreinforcement,”Classical Conditioning II: Current Research and Theory, A. H. Blackand W. R. Prokasy, Eds. New York: Appleton-Century-Crofts, 1972.

[13] J. M. Mendel and R. N. McLearn, “Reinforcement learning control andpattern recognition systems,”Adaptive, Learning, and Pattern Recogni-tion Systems, J. M. Mendal and K. S. Fu, Eds. New York: Academic,1970.

[14] A. G. Barto, “Connectionist learning for control: An overview,”NeuralNetworks for Control, W. T. Miller, R. S. Sutton, and P. J. Werbos, Eds.Cambridge, MA: MIT Press, 1990.

[15] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike elementsthat can solve difficult learning control problems,”IEEE Trans. Syst.,Man, Cybern., vol. SMC-13, pp. 834–846, 1983.

[16] C. W. Anderson, “Learning to control an inverted pendulum using neuralnetworks,” IEEE Contr. Syst. Mag., pp. 31–47, 1989.

[17] R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement learningis direct adaptive optimal control,”IEEE Contr. Syst. Mag., pp. 19–22,1992.

[18] R. Williams and D. Zipser, “A learning algorithm for continually runningfully recurrent neural networks,”Neural Computa., vol. 1, pp. 270–270,1989.

[19] B. Pearlmutter, “Learning state-space trajectories in recurrent neuralnetworks,”Neural Computa., vol. 1, pp. 263–269, 1989.

[20] N. Toomarian and J. Barhen, “Learning a trajectory using adjointfunctions and teacher forcing,”Neural Networks, vol. 5, pp. 473–484,1992.

[21] B. Widrow and F. W. Smith, “Pattern recognizing control systems,” inProc. Computer and Inform. Sci. Symp., Washington, DC, 1963.

[22] K. S. Narendra and K. Parthasarathy, “Identification and control of dy-namical systems using neural networks,”IEEE Trans. Neural Networks,vol. 1, pp. 4–27, 1990.

[23] K. S. Narendra and S. Mukhopadhyay, “Intelligent control using neuralnetworks,” IEEE Contr. Syst. Mag., pp. 11–18, 1992.

[24] A. Karakasoglu, S. I. Sudharsanan, and M. K. Sundareshan, “Identifica-tion and decentralized adaptive control using dynamical neural networkswith application to robotic manipulators,”IEEE Trans. Neural Networks,vol. 4, pp. 919–931, Nov. 1993.

[25] S. I. Sudharsanan, I. Muhsin, and M. K. Sundareshan, “Self-tuning adap-tive control of multiinput multioutput systems using multilayer recurrentneural networks with application to synchronous power generators,” inProc. IEEE Conf. Neural Networks, San Francisco, CA, Mar. 1993.

[26] V. I. Varshavskii and I. P. Vorontsova, “On the behavior of stochasticautomata with variable structure,”Automat. Remote Contr., vol. 24, pp.327–333, 1963.

[27] R. Bartle and D. Sherbet,Introduction to Real Analysis. New York:Wiley, 1992.

[28] H. Behrens, D. Gawronska, J. Hollatz, and B. Schurmann, “Recurrentand feedforward backpropagation for time-independent pattern recogni-tion,” in Proc. 1991 Int. Joint Conf. Neural Networks (IJCNN), Seattle,WA, July 1991.

[29] T. Lewis and W. H. Payne, “Generalized feedback shift register pseu-dorandom number algorithms,”J. ACM, vol. 20, no. 3, pp. 456–468,July 1973.

[30] A. Isidori, Nonlinear Control Systems. New York: Springer-Verlag,1989.

[31] S. S. Sastry and A. Isidori, “Adaptive control of linearizable systems,”IEEE Trans. Automat. Contr., vol. 34, pp. 1123–1131, 1989.

[32] Y. Ichikawa and T. Sawa, “Neural-network application for direct feed-back controllers,”IEEE Trans. Neural Networks, vol. 3, pp. 224–232,Mar. 1992.

[33] P. Baldi and N. Toomarian, “Learning trajectories with a hierarchyof oscillatory modules,” inProc. IEEE Conf. Neural Networks, SanFrancisco, CA, June 1993, pp. 1171–1176.

[34] M. D. Waltz and K. S. Fu, “A heuristic approach to learning controlsystems,”IEEE Trans. Automat. Contr., vol. AC-10, pp. 390–398, 1965.

[35] B. Widrow, N. K. Gupta, and S. Maitra, “Punish/reward: Learning witha critic in adaptive threshold systems,”IEEE Trans. Syst., Man, Cybern.,vol. SMC-3, pp. 455–463, 1973.

Page 15: Recurrent neural-network training by a learning automaton approach for trajectory learning and control system design

368 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 3, MAY 1998

[36] S. Mahadevan and J. Connell, “Automatic programming of behavior-based robots using reinforcement learning,” IBM Tech. Rep., 1990.

[37] V. Gullapalli, J. A. Franklin, and H. Benbrahim, “Acquiring robot skillsvia reinforcement learning,”IEEE Contr. Syst. Mag., vol. 14, no. 1, pp.13–24, 1994.

[38] D. Psaltis, A. Sideris, and A. A. Yamamura, “A multilayered neuralnetwork controller,”IEEE Contr. Syst. Mag., vol. 8, no. 4, pp. 17–21,1988.

[39] S. I. Sudharsanan and M. K. Sundareshan, “Supervised training ofdynamical neural networks for associative memory design and iden-tification of nonlinear maps,”Int. J. Neural Syst., vol. 5, pp. 165–180,Sept. 1994.

[40] B. Pearlmutter, “Gradient calculation for dynamic neural networks: Asurvey,” IEEE Trans. Neural Networks, vol. 6, pp. 1212–1228, 1995.

[41] J. T. H. Lo, “Synthetic approach to optimal filtering,”IEEE Trans.Neural Networks, vol. 5, pp. 803–811, 1994.

[42] D. Obradivic, “On-line training of recurrent neural networks withcontinuous topology adaptation,”IEEE Trans. Neural Networks, vol.6, pp. 222–228, 1996.

[43] T. Condarcure, “A learning automaton approach to trajectory learningand control systems design using recurrent neural networks,” M.S.thesis, Univ. Arizona, Tucson, May 1993.

[44] B. Kosko, “Unsupervised learning in noise,”IEEE Trans. Neural Net-works, vol. 1, pp. 44–57, 1990.

[45] P. Werbos, “Backpropagation through time: What it does and how todo it,” Proc. IEEE, vol. 78, pp. 1550–1560, 1990.

Malur K. Sundareshan received the B.E. degree in electrical engineeringfrom Bangalore University, Bangalore, India, in 1967, and the M.E. andPh.D. degrees in electrical engineering from the Indian Institute of Science,Bangalore, India, in 1969 and 1973, respectively.

Between 1973 and 1976, he held various visiting faculty positions at theIndian Institute of Science, Bangalore, at the University of Santa Clara,CA, and at Concordia University, Montreal, Quebec, Canada. From 1976to 1981, he was on the faculty of the Department of Electrical Engineering,University of Minnesota, Minneapolis. Since 1981, he has been on the facultyof the Department of Electrical and Computer Engineering, University ofArizona, Tucson, where he is a Professor. From 1985 to 1989 he servedas the Chairman of the Systems Group coordinating the activities of thecommunications, control systems, and signal processing faculty. His currentresearch interests are in statistical signal processing, communication networks,adaptive control and estimation, neural networks, and biomedical systems. Heis the author or coauthor of several papers in these areas and is a coauthor ofthe bookFullerene C60: History, Physics, Nanobiology and Nanotechnology(New York: Elsevier, 1993).

Thomas A. Condarcure, photograph and biography not available at the timeof publication.