NNPattern

Preface This volume represents a turning point in neural network advancements. The first neural networks posed, such as the multilayer perceptron, were static networks that classified static patterns—fixed vectors—and resulted in a network output that was yet another static pattern, another fixed-valued vector. Neither pattern changed with time.

Today the field of neural networks is advancing beyond these static neural networks, to more advanced concepts that incorporate time-dynamics in their inputs, outputs, and internal processing. Neural networks now can accept, as input, time-varying signals, even multichannel signals that correspond to a vector or image that changes over time, and often provide classification of data that varies over time. Some networks produce results that are time-dynamic, including oscillations and temporal patterns, and sometimes self-sustained activity can be a signature unique to the network's structure or to the patterns that stimulate the network.

What are the elements and architectures that make it possible to advance from static architectures to dynamic computation? What approaches provide increased capabilities for neural networks? These questions are answered, in part, by this volume.

Pulse-coupled neural networks incorporate processing elements, neurons, that communicate by sending pulses to one another. Pulse-coupled neural networks can represent spatial information in the time structure of their output pulse trains and can segment an image into multi-neuron time-synchronous groups. Johnson, Ranganath, Kuntimad, and Caulfield, in Chapter 1, illustrate these capabilities and show the architectural structure of the pulse-coupled networks.

Motion perception is an essential capability for advanced organisms, yet the ability to detect motions and image flow computationally is a difficult problem. In Chapter 2, Li and Wang propose a recurrent neural network model that can be operated asynchronously in parallel to achieve a realtime solution.

In Chapter 3, temporal pattern matching is performed when dynamic time warping is combined with a Hopfield network. Unal and Tepedelenli-oglu show how a dynamic programming algorithm that compares an input test signal with a reference template signal, reducing the nonlinear time misalignments between the two patterns, can be implemented with a neural network approach to achieve an optimum match between two patterns.

Dynamic attractors in neural networks with prolonged, self-sustained

X Omidvar and DayhofF

activity are the subject of Chapter 4. Different attractors can be evoked by different network structures and different stimulus patterns, with a wide range of flexibility. Dynamic attractors can also be trained into a network. Authors Dayhoff, Palmadesso, Richards, and Lin demonstrate potential enhancements in computational paradigms for which dynamic networks show promise.

A macroscopic model of oscillations in ensembles of neurons that characterizes very large networks of neurons is presented in Chapter 5. In this chapter, Ghosh, Chang, and Liano study the interaction between two neuron groups and show how to predict the presence of oscillations and their frequencies.

The relationship between automata and recurrent neural networks is developed in Chapter 6 by Tino, Home, Giles, and Collingwood. Recurrent neural networks can be trained to mimick finite state machines, and mathematical relationships that demonstrate their ability to act as automata can be proven. The enormous potential, then, of appropriately trained recurrent networks becomes apparent.

In Chapter 7, Anderson shows a putative neurobiological model that correlates with trial-and-error learning. He demonstrates the plausibility for synaptic weights to be trained during random fluctuations in their strengths and concomitant changes in the synapses. He argues for the biological plausibility of such a model.

Segmentation of continuous sequences is addressed in Chapter 8, with the SONNET 1 network, which incorporates temporal decay on the input activation values. These networks learn to segment temporal patterns as the patterns are presented (e.g., as temporal signals) and learn to segment the patterns with no a priori knowledge of when a pattern begins or ends. In this case, the network performs a transformation of temporal events into spatial patterns of activity.

Models of living neural systems are related to models developed for complex engineering operations in Chapter 9, where Venkatesh, Pandya, and Hsu show how to extend the concepts of Petri nets to encompass high-level structures found in biological neurons and in biological neural systems. The result is a new class of high-level Petri nets (HPNs).

Chapter 10 attests to the high potential of locally recurrent networks for processing time-varying signals. In this chapter, Principe, Celebi, DeVries, and Harris review the gamma neural network structure and show variations such as the Laguerre and Gamma II memory networks. The functionality of these networks is identified, and their structure is described as a class of neural topologies that are intermediate between purely feedforward static networks and globally recurrent networks. The gamma operators are capable of adapting the time scale of the memory to best match the properties of the data.

Altogether, this volume incorporates landmark results on how neural

Preface xi

network models have evolved from simple feedforward systems with no temporally dynamic activity into advanced neural architectures with self-sustained activity patterns, simple and complicated oscillations, specialized time elements, and new capabilities for analysis and processing of time-varying signals. The enormous potential of these advanced architectures becomes apparent through the compendium of applications that appear here, including speech recognition, pattern classification, image analysis, and temporal pattern matching, and the modeling of neurobiological systems.

Judith Dayhoff

Omid Omidvar

Contributors • Russell W. Anderson

Smith-Kettlewell Eye Research Institute 2232 Webster Street San Francisco, CA 94115 and Biomedical Engineering University of Northern California Petaluma, CA E-mail: [email protected]

• H. J. Caulfield Alabama A&M University Department of Physics Normal, AL 35762

• Samel Celebi Lucent Technologies-Bell Labs Innovations Middletown, NJ 07748 E-mail: [email protected]

• Hung-Jen Chang Department of Molecular and Cell Biology University of California at Berkeley Berkeley, CA 94720 E-mail: [email protected]

• Pete C. Collingwood School of Computing & Management Sciences Sheffield Hallam University Hallam Business Park 100 Napier St. Sheffield, S l l 8HD United Kingdom E-mail: [email protected]

• Judith E. Dayhoff Institute for System Research University of Maryland College Park, MD 20742 E-mail: [email protected]

mailto:[email protected]





\r Contributors

• Joydeep Ghosh Department of Electrical and Computer Engineering Engineering Sciences Building The University of Texas at Austin Austin, TX 78712-1084 E-mail: [email protected]

• C. Lee Giles NEC Research Institute 4 Independence Way Princeton, NJ 08540 Institute for Advanced Computer Studies University of Maryland College Park, MD 20742 E-mail: [email protected] .nec.com

• John G. Harris Department of Electrical and Computer Engineering University of Florida Gainesville, FL 32611 E-mail: [email protected]

• Bill G. Home NEC Research Institute 4 Independence Way Princeton, NJ 08540 E-mail: [email protected] .nec.com

• Sam Hsu Department of Computer Science and Engineering Florida Atlantic University Boca Raton, FL 33431 [email protected]

• J. L. Johnson U. S. Army Missile Command Weapons Sciences Directorate AMSMI-RD-WS-PL Redstone Arsenal, AL 35898-5248

• Govinda Kuntimad Boeing North America Rocketdyne Division Huntsville, AL 35806 USA E-mail: [email protected]



http://nec.com



http://nec.com



Contributors

• Hua Li Computer Engineering Department College of Engineering San Jose State University San Jose, California 95192 E-mail: [email protected]

• Kadir Liano Pavilion Technologies Austin, Texas

• Daw-Tung Lin Computer Science Department Chung Hua Polytechnic University Hsin-Chu, 30 Tung-Shiang Taiwan E-mail: [email protected]

• Albert Nigrin 6942 Clearwind Ct. Baltimore, MD 21209 E-mail: [email protected]

• Peter J. Palmadesso Plasma Physics Division Naval Research Laboratory Washington, D.C. 20375 E-mail: [email protected]

• Abhijit Pandya Department of Computer Science and Engineering Florida Atlantic University Boca Raton, PL 33431 E-mail: [email protected]

• Jose C. Principe Department of Electrical and Computer Engineering University of Florida Gainesville, FL 32611 E-mail: [email protected]

• H. Ranganath University of Alabama in Huntsville Computer Sciences Department Huntsville, AL E-mail: [email protected]








i Contributors

• Fred Richards Entropic Research Laboratory, Inc. 600 Pennsylvania Ave. S.E., Suite 202 Washington, D.C. 20003

• Nazif Tepedelenhoglu Department of Electrical and Computer Engineering Florida Institute of Technology 150 W. University Blvd. Melbourne, FL 32901 E-mail: [email protected]

• Peter Tino Dept. of Computer Science and Engineering Slovak Technical University Ilkovicova 3 812 19 Bratislava, Slovakia NEC Research Institute 4 Independence Way Princeton, NJ 08540 E-mail: [email protected]

• Fatih A. Unal National Semiconductor National Semiconductor Drive Mail Stop C1-495 Santa Clara, CA 95052 E-mail: [email protected]

• Kurapati Venkatesh Center for Manufacturing Systems Department of Mechanical Engineering New Jersey Institute of Technology Newark, NJ 07104

• Bert De Vries David Sarnoff Research Center CN5300 Princeton, NJ 08543-5300 E-mail: [email protected]

• Jun Wang Industrial Technology Department The University of North Dakota Grand Forks, ND 58202 E-mail: [email protected]





Chapter 1

Pulse-Coupled Neural Networks J. L. Johnson H. Ranganath G. Kuntimad H. J. Caulfield

ABSTRACT A pulse-coupled neural network using the Eckhorn linking field coupling [1] is shown to contain invariant spatial information in the phase structure of the output pulse trains. The time domain signals are directly related to the intensity histogram of an input spatial distribution and have complex phase factors that specify the spatial location of the histogram elements. Two time scales are identified. On the fast time scale the linking produces dynamic, quasi-periodic, fringe-like traveling waves [2] that can carry information beyond the physical limits of the receptive fields. These waves contain the morphological connectivity structure of image elements. The slow time scale is set by the pulse generator, and on that scale the image is segmented into multineuron time-synchronous groups. These groups act as giant neurons, firing together, and by the same linking field mechanism as for the linking waves can form quasi-periodic pulse structures whose relative phases encode the location of the groups with respect to one another. These time signals are a unique, object-specific, and roughly invariant time signature for their corresponding input spatial image or distribution [3].

The details of the model are discussed, giving the basic Eckhorn linking field, extensions, generation of time series in the limit of very weak linking, invariances from the symmetries of the receptive fields, time scales, waves, and signatures. Multirule logical systems are shown to exist on single neurons. Adaptation is discussed. The pulse-coupled nets axe compatible with standard nonpulsed adaptive nets rather than competitive with them in the sense that any learning law can be used. Their temporal nature results in adaptive associations in time as well as over space, and they are similar to the time-sequence learning models of Reiss and Taylor [4]. Hardware implementations, optical and electronic, aie reviewed. Segmentation, object identification, and location methods are discussed and current results given. The conjugate basic problem of transforming a time signal into a spatial distribution, comparable in importance to the transformation of a spatial distribution into a time signal, is discussed. It maps the invariant time sig-

Johnson, Ranganath, Kuntimad, and Caulfield

nature into a phase versus frequency spatial distribution and is the spatial representation of the complex histogram. A method of generating this map is discussed. Image pattern recognition using this network is shown to have the power of syntactical pattern recognition and the simplicity of statistical pattern recognition.

1 Introduction

The linking field model of Eckhorn et al. [1] was proposed as a minimal model to explain the experimentally observed synchronous feature-dependent activity of neural assemblies over large cortical distances in the cat cortex [5]. It is a cortical model. It emphasizes synchronizations of oscillatory spindles that occur in the limit of strong linking fields and distinguishes two major types: (1) forced, or stimulus-locked, synchronous activity and (2) induced synchronous activity . Forced activity is produced by abrupt temporal changes such a s movement. Induced activity occurs when the pulse train structure of the outputs of groups of cells are similar [6]. The model is called "linking field" because it uses a secondary receptive field's input to modulate a primary receptive field's input by multiplication in order to obtain the necessary coupling that links the pulse activity into synchronicity.

This paper is concerned with the behavior of the linking field model in the limit of weak-to-moderate linking strengths [2],[7]. Strong linking is characterized by synchronous bursts of pulses. When the linking strength is reduced, the neurons no longer fire in bursts but still have a high degree of phase and frequency locking. This is the regime of moderate linking strength. Further reduction continuously lowers the degree of linking to a situation where locking can occur only for small phase and frequency differences. This is the weak linking regime. A major result of this research is the finding that in the weak linking regime it is possible to encode spatial input distributions into corresponding temporal patterns with enough structure to have object-specific time series for each input pattern. The pulse phase patterns in the time series are often found to be periodic. In both simulations and in an optical hybrid laboratory demonstration system, periodicity is observed to be the rule rather than the exception. The time series can be made insensitive to translation, rotation, and scale changes of the input image disrtibution by an appropriate choice of the structure of the receptive field weight patterns. Substantial insensitivity against scene illumination and image distortion has also been observed in simulations.

1. Pulse-Coupled Neural Networks 3

Inputs from other neurons

Inputs from other neurons

Linking

1 + PjLj Threshold

^y^0—*. Step Function

Output to other neurons

DENDRITIC TREE LINKING PULSE GENERATOR

FIGURE 1. The model neuron. The model neuron has three parts: The dendritic tree, the linking, and the pulse generator. The dendritic tree is subdivided into two channels, linking and feeding. All synapses are leaky integrator connections. The inputs are pulses from other neurons and the output is a pulse. The linking input modulates the feeding input. When a pulse occurs in the linking input it briefly raises the total internal activity Uj and can cause the model neuron to fire at that time, thus synchronizing it with the neuron transmitting the linking pulse. (Reprinted with permission from [1]).

2 Basic Model

This section reviews the basic model as discussed in Eckhorn et al. [1], [5], [6], [8], [9], and [10]. The model neuron is a neuromime [11], modified wi th two receptive fields per neuron and a linking mechanism added. I t is shown in Figure 1. There are three pa r t s to the model neuron: the dendri t ic t ree , the linking modulat ion, and the pulse generator . Each par t will be described separately, and then the operat ion of the complete model will be discussed.

2.1 The Dendritic Tree

The dendrit ic tree is divided into two principal branches in order to make two distinct inputs to the linking par t of the j t h neuron. They are the primary input , te rmed the feeding input F j , and the secondary input , which is the linking input Lj. These are given in equations 1 and 2, respectively, for the case of continuous t ime. For discrete t ime steps, the digital filter

4 Johnson, Ranganath, Kuntimad, and Caulfield

model is used, as given in the appendix of Eckhorn et al. [1]. (The simulations reported here used the discrete model. The equations are given in Section 9.) Each input is a weighted sum from the synaptic connections on its dendritic branch. The synapses themselves are modeled as leaky integrators. An electrical version of a leaky integrator is a capacitor and a resistor in parallel, charged by a brief voltage pulse and decaying exponentially. Likewise, when a synapse receives a pulse, it is charged, and its output amplitude rises steeply. The amount of rise depends on the amplitude gain factor assigned to the synapse. It then decays exponentially according to its time constant. These postsynaptic signals are summed to form the total signal out of that branch of the dendritic tree, as indicated in Figure 1. The amplitude gain factors and the decay time constants of the synapses characterize the signals. The synapses in the feeding branch are assumed [1] to have smaller time constants than those of the linking branch. This assumption lets the feeding signal have a long decay tail on which the spikelike linking input can operate through the linking modulation process. The linking and feeding inputs are given by

Li = $ ] i * , = ^ ( W ^ * , e - < ' ) * n W , (1) k k

Fj = '£F,j = '£{Mkje-<')*Yk{t) + Ij, (2) k k

where Wkj and Mkj are the synaptic gain strengths, or weights, for the fcth synapse of the linking and feeding receptive fields, respectively, to the j t h neuron. Yk{t) is the input pulse, or pulse train, from the fcth neuron; a^j and a^j are the time constants; and / l * / 2 denotes the convolution integral operation for any two functions / I and / 2 . Note that both the feeding and linking fields can recieve inputs from the A:th neuron. Ij is an analog feeding input to the jth neuron. It is shown here as a distinct single term but in general can be a weighted sum like the pulsed inputs. If the inputs Yk{t) are allowed to be arbitrary functions of time, then Ij can be included in the weighted sum over the F 's as a step function in time Step{t — to).

Each neuron thus has two receptive fields, linking and feeding. Both fields are dendritic tree structures and can overlay the same areas around the neuron. However, their weighted sums enter the neuron via distinct channels and are combined internally by the linking, as discussed below.

2.2 The Linking

The linking modulation (see Figure 1) is obtained by adding a constant positive bias to the linking input and multiplying that by the feeding input. The bias is taken to be unity. This bias has many uses as we will see, and


one of them is obvious. The hnking input cannot drive the internal activity to zero. The total internal activity Uj of the neuron is

Uj=FJil + 0jLj) (3)

where f3j is the linking strength. For convenience, it is broken out separately here, but strictly speaking, it could be incorporated in the synaptic weights. Uj is a function of time. Under the above assumption that the feeding input has a smaller time constant than that of the linking input, the general behavior of Uj is that the linking inputs appear as spike-like modulations riding on a quasi-constant carrier formed by the feeding input. The internal activity Uj thus is briefly raised above the feeding input level whenever a linking input occurs (Figure 2), and it can then trigger the neuron to fire. This effect is responsible for the synchronous activity found in the network as a whole. Equation 3 also establishes a correspondence between the linking field model and higher-order networks. If equations 1 and 2 are inserted into equation 3, there will be product terms of the form MkjWijYkYi within a double sum. This is a second-order network [12]. This implies that if a pulse output model rather than an average firing rate output model is used in higher-order nets, time-synchronous behavior should be observed. The work on adaptive higher-order nets [13] may be applicable to adaptation in pulse-coupled nets as well.

2.3 The Pulse Generator

The pulse generator uses a threshold discriminator followed by a pulse former, and a variable threshold that is dependent upon the prior pulse output of the generator itself. When the neuron emits a pulse, a portion of it feeds back to the threshold, which is yet another leaky integrator, as shown in Figure 1. One or more output pulses recharge the threshold to a high level. This quickly raises it above the current value of the internal activity Uj, which in turn causes the threshold discriminator to turn off the pulse former, and the neuron stops emitting pulses. The recharged threshold then decays exponentially according to its time constant and amplitude gain factor until it drops below the internal activity again, triggering a new output pulse or pulse burst from the neuron (Figure 2). This is the pulse generator model illustrated in Eckhorn et al [1], [8], [9], and [10], and given analytically in the appendix of [3]. One important result of the model is that under constant stimulation, the pulse former produces a train of uniformly spaced pulses. The spacing represents the refractory period r^ of the neuron within which time a new pulse cannot occur. This will give an upper saturation limit to the maximum output pulse frequency. The pulse generator is modeled by a leaky integrator threshold 0j (equation (4)), a threshold discriminator in the form of a sigmoidal envelope (equation (5)),


Pulse Period Tj

r* ^ Threshold 9,

t Output I Output Pulse Yj Pulse Capture

FIGURE 2. Pulse generation and linking. The threshold is recharged when it decays below the internal activity Uj = Fj{l-\-PjLj). The output pulse is formed as the threshold turns the step function of equation (5) on and then off as the threshold goes below Uj, starts recharging, and then rises above Uj. If a linking pulse occurs in the capture zone time, it causes the threshold to recharge sooner than otherwise, and the neuron fires a pulse synchronized with the arrival of the linking pulse. (Reprinted with permission from [3]).

and a pulse former (equations 6 and 8):

Oj = {VTe-^')*Yj{t)-^0o,

Yj{t) = {Sig{Uj{t)-0j{t))P{t))*e-r^),

^ ( 0 — /]pulse{t — UTr),

(4)

(5)

(6)

where Sig{z) is a hyperbolic tangent sigmoidal envelope for the pulse train P{t) out of the pulse former. The sigmoid function and the pulse function pulse{t — uTr) are

Sig{z) =

pulse{t — UTr) =

1 1 + e-^^

K

(7)

J —o {S{t' - riTr) - S{t' - UTr - T^))dt'. (8)

Equation (8) defines a square pulse of height K whose leading and trailing edges are formed by two delta functions separated by width r^ . It has a


constant area of K.VT and aj are the amplitude gain and the time constant of the leaky integrator threshold, and OQ is a threshold offset. A is the scale of the sigmoid argument, and aJ is the time constant for the convolution of equation (5). The number n refers to the pulse number. In order to have a good dynamic range of pulse periods it is desirable to require

ajrr < 1. (9)

The system of equations (4)-(8) exphcitly shows the causality in the pulse generator and that the pulses are finite. Now idealize it. First let r^j go to zero. This makes P{t) into a train of delta function pulses. Perform the convolution of equation (5) and take the limit of both aJ and K going to infinity, in such a way that their ratio is constant, to obtain yet another delta function limit, and finally, take A approaching infinity to get a single equation for Yj that replaces equations (5)-(8):

Yj{t) = Yl^{t- nrr)Step{Uj{t) - Ojit)). (10) n

Step{ ) is defined as 1 when its argument is positive, and 0 otherwise. Equations (10) and (4) are the ideahzed pulse generator. Its input has a lower limit of ô- An upper limit can be established by asking for the largest value of the input that will just barely recharge the threshold back to that level in a decay time Tr, the minimum pulse period. Equation (4) gives

f /e-^^^ + y / > f/,

from which

U<—p^^Ûma.^^ (11) 1 — e J ^j'T

under the dynamic range requirement of equation (9). Figure 3 summarizes the properties of the pulse generator. There, the digital filter form (equation (29)) of the time convolutions was used.

2.4 Pulse Periods

The firing rate of a single neuron is a sigmoidal function of the feeding input. This is shown by obtaining the pulse period TJ of a neuron. It is the time required for the threshold to decay from its recharged initial height down to the internal activity level (Figure 2). Consider equation (4) when the threshold is recharged with a single pulse by an amount VT- For a constant feeding input F and a zero linking input, the decay time back down to F is

Johnson, Ranganath, Kuntimad, and Caulfield

threshold

e

1

^^ ^^

-1

+1

ky

w

c

r\ sigmoid enve

(

J Vj 1 ky

Jecay loop

^^ ^^

lope

Jecay loop

imi pulse former

A

'

1

1 i

I

[ The pulse generator. The low-pass filter decay loops correspond to the time constants in the convolution integrals.

FIGURE 3. The pulse generator. The internal activity U feeds a sigmoidal envelope. When U > 6 the envelope becomes high, allowing the pulse former to make an output of uniformly spaced pulses. These are the cell's output. The envelope and pulse former are in a decay loop with a large time constant. This loop ensures causality, i.e., it gives a small time delay between the pulse output and the recharging of the threshold (upper feedback loop). The threshold is another leaky integrator, recharged by the pulse output. An idealization (see text) reduces the sigmoidal envelope to a step function and makes the pulse former's output into a train of delta function pulses. (Reprinted with permission from [26]).

The refractory period is added to the decay time to obtain the total pulse period. The pulse firing rate fj is then

fj = {rj+TrY (13)

As shown in Figure 4, it is a sigmoid function [14]: it increases more slowly than linear up to ô, then rises quickly (this is the center of the "S" shape), and finally goes to saturation. Its monotonically increasing behavior shows that the original input feeding distribution can be approximately recovered at any time by taking an average over many pulse periods, because the pulse frequency is faster for stronger (more intense) feeding inputs. The sigmoidal nonlinearity will cut off values below 6o and act as a squashing function near saturation, so the overall function is a sigmoidal mapping of the internal activity to the output when pulse-averaging is done.

When linking pulses are present, their strength defines a capture zone in the neuron receiving the linking pulse. From Figure 2, the capture zone

1. Pulse-Coupled Neural Networks

>-o

INTERNAL ACTIVITY Uj

FIGURE 4. Pulse frequency fj as a function of the internal activity Uj. The pulse frequency is a sigmoidal function of the internal activity. Addition of a refractory time period Tr makes the frequency saturate at the refractory frequency. A bias offset ^0 will shift the curve's origin to that bias point. (Reprinted with permission from [3]).

t ime interval is

Tc = - ^ ln ( l -h a^ F-Oi

•PL), (14)

where /? is the linking strength. If a linking pulse is received in this interval, it will briefly raise the internal activity level and cause the receiving neuron to fire at the arrival t ime of the linking pulse (Figure 2). The receiving neuron will frequency lock to the t ransmi t t ing neuron if their pulse ra tes Ti and T2 are similar. If the neurons have the same frequency (r i = T2), they will phase lock when their phase difference <j) is within the capture zone t ime period:

Frequency lock :

Phase lock :

1 2 - T i l < Tc,

\(t>\ < OL^Tc. (15)

There is also a forbidden zone immediately after each linking pulse. For a^ much greater t han a ^ , the length of the forbidden zone is equal to t h a t of the capture zone (see Figure 2).

This completes the description of the basic model neuron. The threshold t ime constants used by Eckhorn are intermediate in value between the linking and feeding t ime constants . The pulse-coupled linking field model


contains synaptic weights but does not require any learning law. On the other hand, any learning law can be used. The frequency function of equation (12) gives the desired nonlinear response in the limit of averaging over many pulses, so this model reduces to the usual nonpulsed networks in that limit. It has the weighted interconnects, the internal sums, and the sigmoid nonlinearity. The simple pulse generator used in simulations by Eckhorn and others [1] corresponds to a two-cell oscillator [15], [16] where the threshold acts as an inhibitor cell with a slow response and the step function as an excitatory cell with a fast response. The three parts of the model (the dendritic tree, the linking, and the pulse generator) act together to weight and sum inputs in the receptive fields, modulate one input channel with a second input channel, and form the output pulses, which in turn are received by other neurons through their receptive fields. In the remainder of this paper the same threshold time constant a^ will be used for all neurons, the same linking time constant a^ used for all linking fields, the same feeding time constant a^ used for all feeding fields, and the same linking strength /? used for all neurons unless otherwise stated. The subscript j will be suppressed except where necessary.

3 Multiple Pulses

Suppose that at time zero a cell receives linking pulses from N other cells, all arriving at the same time, and that a single firing is inadequate to raise its threshold above the composite linking pulse. It will continue to fire until it exceeds the linking pulse height, as shown in Figure 5(a). Let M be the number of pulses required. For simplicity take ^ = F at ^ = 0. Then from equations (1), (3), and (4),

r{M-l)Tr ^ - 1

^0 m=0

>F(l4-iV/3e-^^^' '(^-^>),

where M — 1 has been used because the time interval for the cell to fire M times is (M — l)r^. The left-hand side yields a finite sum of exponential decays. Expressing this in closed form leads to the result that

1 -_ ^-aTTriM-l)

^^( l - e - " T r . ) ^ i^îVe-«^^^(^-i). (16)

This gives M in terms of A . If M is small enough so that all the exponentials can be expanded (see condition of equation (9)), then M is approximately given by


M^ ^-^ . VT + aLTr0FN

But it is not that simple. Suppose that the N pulses came from the same group containing the cell and they all had the same feeding input F. Then every cell in the group must send M pulses to the others. The situation, shown in Figure 5(b), is that each cell receives A pulses at a time, N being the number of cells in the group, for M times, with a separation of Tr between times. The cells must pulse their way over a much larger linking pulse than in the previous case. Let M' be the number of pulses required. The linking pulse is now, at t = {M' — l)rr ,

^'-} 1 _ pOCLTrM'

m=0

Applying the condition that the threshold must be greater than this gives, after some rearrangement.

l _ e - « ^ ^ - ^ ' PFN A-e > 1 _ ^-OCLTrM' - y^ • ( T 3 7 ^ ^ ) - (17)

Unfortunately, since as shown in Figure 5(b) this condition depends on the gradual saturation of the envelope of the linking pulses, a first-order expansion may be inappropriate for the left-hand side. An asymptotic approximation comes from noting that the left-hand side is of order unity if M' ^ \. This gives a rough upper limit of

1 > K^)N.

This is similar to equation (11) when equation (3) is used in it to explicitly include F and (i. The limit of equation (17) is above that of equation (11), which was the pulse saturation limit. This shows that the model can handle all multiple pulses under the pulse saturation limit. A somewhat better approximation is to assume that arTrM' is small. This allows the expansion to first order of the numerator on the left hand side of equation (17):

which is of the form X > a ( l - e~^),

where x = aiTrM' and a = /3FN/VT. Finally, the value of A can be related to the receptive field kernel (equation (2)) as

N^NRF= f I WL(f-?)¥{?,t)(fr',


(a) A cell receives a composite linking pulse from an external group and fires M times for the threshold to exceed the internal activity U.

(b) A cell receives a composite linking pulse from its own group. It fires M' times, as do all the other cells in the group, causing more linking pulses. The linking pulse envelope saturates, allowing the threshold to finally exceed the internal activity U.

FIGURE 5. Multiple pulses. Two cases are shown. In 5(a), a cell receives A linking pulses simultaneously, as would occur when the cell is not part of the group of N cells making the pulses. It must fire M times to overcome the composite linking pulse. In 5(b), the cell is a member of a group of iV + 1 cells. Since every member must fire multiple pulses, each fires M ' times, and each firing generates an additional linking pulse of size iV, which the cell must attempt to overcome by firing again. It succeeds eventually because the linking pulse train envelope saturates more quickly that the threshold pulse train, allowing the threshold to catch up after M' pulses. (Reprinted with permission from [26]).


which, with equation (11) or (17), shows that the integral of the receptive field kernel W needs to be finite if the slab is not bounded.

4 Multiple Receptive Field Inputs

The pulse-coupled neural network is a dendritic model. The inputs from the receptive fields enter along the length of the dendrite, and the linking modulation occurs at the point of entry, the dendritic signal flows toward the cell body. There can be many inputs. The internal activity U is in general of the form

f/ = F ( l + /3iLi)(l + (32L2){1 + 03L3)...-{1 + PnLn). (18)

This is for one dendrite. A cell can have many dendrites. They are summed to form the total input to the cell, and can be excitatory or inhibitory. If the products are carried out, the internal activity has all possible products of all the receptive fields. These are products of weighted sums of inputs, as shown in equations (1) and (2). It is seen, then, that these are general higher-order networks. Eckhorn argues that the inputs far out on the dendrite have small synaptic time constants, while those close to the cell body have large synaptic time constants, so there is a transition from "feeding" to "linking" inputs along the length of the dendrite. The receptive fields can overlap, they can be offset, and each one can have its own kernel defining its weight distribution. Now, a given weight distribution W can give the same weighted response for more than one input distribution. This corresponds to a logical "OR" gate in that sense. The linking modulation uses an algebraic product, which corresponds to a logical "AND" gate. The inhibitory inputs give logical complementation. In this view (Figure 6), each neuron is a large multirule logical system. This property was used to achieve exact scale, translation, and rotation invariance as shown by the simulations discussed later.

5 Time Evolution of Two Cells

This section shows how to follow the time evolution of the pulse outputs of a two-cell system. As each cell fires, it can capture the other cell and shift its phase. By constructing an iterative map of the phase shifts from one pulse to the next, the time of firing can be predicted. The map plots the current phase versus the next phase. The simplest form of the map, discussed here, is constrained to one-way linking. There are two cells. The first one has a feeding input Fi and the second has F2. The first cell sends a linking input to the second, but not vice versa. It is assumed that the


RECEPTIVE FIELDS DENDRITE CELL BODY

Yk Yk

— I n-1 . ^ I n ^ ^ I , o n

DENDRITE

n+1 n ^ n U: = U i ( l + P i L : )

EACH DENDRITE IS A LOGICAL RULE

RF weighted sums -^^^^^ "OR" Linking product - ^ ^ ^ ^ "AND"

FIGURE 6. The linking field model neuron is a multirule logical system. A dendrite receives inputs from many receptive fields along its length. Each input modulates the dendritic signal by the factor (1 -\- PjUj) for the nth input. The receptive fields can give the same signal for more than one input distribution and thus correspond to a logical "OR". The product term in the modulation factors corresponds to a logical "AND". These logic gate correspondences are not exact, but they can be used effectively, as shown by the example discussed in the text. Reprinted with permission from [3]).

linking pulses are Kronecker delta functions (0 or 1), with no decay tail. The threshold is assumed to recharge instantly by an amount VT from the point where it intersected the internal activity. In this case the forbidden zone is equal to the capture zone. To form the map, first construct the threshold diagram of Figure 7. Pulses can intersect the internal activity outside the forbidden zone, including on the leading vertical edge of the linking pulse. This then defines an upper trace, where the recharged threshold can begin its decay back down to the internal activity. The upper trace is simply the lower one, raised up by a distance VT- It is effectively a launch platform from which the recharged threshold begins its downward decay. When the threshold again intersects the lower trace, it recharges and comes to a new location on one of the upper traces at a later time. This generates a mapping from one upper trace to another, and it can be used to make the iterative map with which to follow the time behavior of the system. Let the total length along the trace be X. Note that this consists of a horizontal (H) section followed by a short vertical (V) section corresponding to the


leading edge of the linking pulse (Figure 7). Let the remapped length be Y, If the threshold launches from the horizontal part of X, it can hit either a horizontal or vertical part of F , and the same is true for launch from the vertical part of X. The mapping accordingly will be linear (horizontal to horizontal, vertical to vertical), exponential (horizontal to vertical), or logarithmic (vertical to horizontal). There are five distinct cases, depending on where the mapping starts and ends. They are

Case I

Case II

Case III

Case IV

CaseV

HV - HH - VH - VV HV - HH - VH HH - HV - HH - VH HH - HV - VV - VH HH - HV ~ VV

The iterative map for Case I is shown in Figure 8. It is piece wise continuous and has an upper section and a lower section. All the curve sections can be written parametrically in terms of the inputs Fi , F2, the time constants Q T , QL, the linking strength /3, the Unking period TL and pulse period TT, the capture zone time period TC (which is also the forbidden zone in this case), and the number N of linking periods spanned by the threshold pulse period. The map of Figure 8 can be followed, step by step, by reference to the traces shown in Figure 7. Suppose a pulse begins on the upper trace's horizontal region and maps to the next lower trace's vertical region, following the b-b decay curve of Figure 7, for example. This would be an HV transition in Figure 8. It is reset by Vr to the corresponding upper trace. From there, it decays and hits the horizontal section of the next lower trace, as indicated by the e-e decay curve of Figure 7. This is a VH transition. It is again reset to the upper trace by Vr, decays to a horizontal section through an HH transition (the a-a decay curve of Figure 7), resets to the upper trace, again decays to another horizontal section (HH), resets, and this time maps from a horizontal section to a vertical section (HV) as shown in Figure 8. This follows the two-cell system through one cycle around the phase map of Figure 8. Note that although it has again reached an HV transition, it occurs at a different point than the first HV transition. If the system approaches a limit cycle in Figure 8, this means that the corresponding cell has a periodic pulse train output.

5,1 The Linking Decay Tail Is an Unstable Region

A geometrical argument can be used to show that the linking decay tail is an unstable region. Suppose there are two mutually linked cells, both fed by the same input F. Then they pulse at the same basic frequency. Now suppose that they are out of phase such that they link on each other's linking decay


!L

F+Wj

upper trace

lower trace

time

cc' d

FIGURE 7. Two cells with one-way linking. The top figure shows the threshold diagram for the cell receiving an idealized linking pulse from the other cell. The second cell does not receive linking from the first cell (two-way linking is shown in Figure 9). The threshold recharges from the lower trace by Vr, defining an upper trace as well. When the threshold decays from the upper trace to the lower and then is recharged back to the upper trace, it defines a mapping between upper traces that can be used to track the time evolution of the pulse history of the system. (Reprinted with permission from [26]).

tail, as shown in Figure 9(a): Each cell's threshold intersects the internal activity level of the other cell beyond the capture zone. Consider first cell # 1 . It links on the decay tail of the linking input from cell # 2 a t point A i , recharges to the upper trace, decays, and links again at point Bi. The diagram shows a composite t race combining the upper and lower traces for cell # 1 , with points Ai and Bi bo th on it. A similar composite t race is t rue for cell # 2 . Now consider bo th cells, as shown in Figure 9(b). The difference A2 — A i is the change in t ime separation between the firing of the two cells. Due to the difference in the height of the linking t race at points Ai and A2, A2 — A i will in general not be zero. (There is a single point on the decay tail where this difference is zero, bu t it is an unstable point.)


FIGURE 8. Iterative map. The horizontal axis is the total distance along the upper trace of Figure 7, from which the threshold can begin its decay, and the vertical axis is the distance along the upper trace where the pulse returns after it has recharged. There are five distinct cases, and each case is defined by the particular values of the two-cell system and its two feeding input strengths. For each case there are four possible transitions, HH, HV,VH, VV, corresponding to the initial and final locations on the traces of Figure 7. H indicates horizontal, V indicates vertical. These transitions are discussed in the text. (Reprinted with permission from [26]).

It is clear from the diagram that the firing time Bi of cell # 1 will move closer to the leading edge of the linking pulse from cell # 2 by an amount A2 — Ai. The same is true for ^2- The cells constantly try to catch up with each other by firing more frequently, but each one's gain helps the other one gain more, and the overall eflFect is that they repel each other out of the decay tail region. After several cycles, one of the cells' thresholds will decay into the leading edge of the linking pulse from the other cell and thus will fire at essentially the same time as that cell. Since both have the same


feeding input, they will be phase locked together from this time on. This shows how two cells with the same feeding input will always become phase locked together, regardless of their initial phase difference, due to the finite decay tails of the linking pulses.

6 Space to Time

Consider a group of weakly linked neurons. Suppose at time zero all the neurons fire together. As time goes by they will occasionally link in different combinations, as illustrated in Figure 10. Each neuron has its own basic firing rate due to its particular feeding input. Suppose further that at time T the neurons' combined firing rates and linking interactions cause them all to fire together a second time. This duplicates the initial state at time zero. Then everything that happened during time T will happen again in the same order, and all the neurons will fire together again at time 2T. This will continue, resulting in periodic behavior of the group with period T. The assumption of a single exact repetition of a given state (all the neurons fire together, for example) leads to the conclusion that everything that happened between the repetitions must necessarily happen again in the same order, in a permanently periodic way, for every neuron in that group. If all the outputs of the group are linearly summed, the result will be a single periodic time series that is the signature of that spatial input distribution. This is the time series S{t) for that group of neurons [7]. The length of time required for periodicity is primarily governed by the ratio '^c/Ttyp where rtyp is the characteristic pulse period of the input image. (For large P the ratio can be much greater than one, in which case the group links on every pulse and is completely periodic.) Two other factors that promote periodicity in a two-neuron system are linking in quasiharmonic ratios and linking on the decay tail of the linking pulses. For quasiharmonic pulse rates such that

\mT2 — nril < TC^ m,n are integers, (19)

the two neurons will periodically link about every mr2 seconds. When two mutually linked neurons link on the decay tails of the linking pulses (Figure 9), the cycle-to-cycle behavior is that they actively expel each other from this region into the leading edge linking region. While both effects promote periodicity they do not guarantee it. The time required to achieve periodicity, and the overall period length, can be large for large, weakly-linked slabs.

The following interpretation of the time series relates it to the input image's intensity histogram. The network maps intensity to frequency. The size of an isointensity image patch determines how many neurons fire at that


(a) Threshold diagram for cell #1, showing origin of composite trace diagram.

Cel l# l

Cell #2

(b) Interaction of cell #1 and cell #2. B2 actually occurs in

time on the next cycle, at the point (B2).

FIGURE 9. Two cells each linking on the other's linking pulse decay tail. Upper and lower traces are defined for each cell, and a composite trace is constructed that shows for each cell its map points A and B from one recharging point to the next (a). Both cells have the same feeding input strength F. Figure 9(b) uses the composite traces for both cells to show their interaction. Each cell's second recharging point B shifts the linking pulse time for the pulse that it sends to the other, with the result that both cells' firing points steadily move closer to the leading edge of the linking pulses until one or the other locks in the capture zone. The cells are then phase locked. When finite linking decay exists, as assumed here, this interaction shows that two cells with the same feeding input strength will always become phase locked. (Reprinted with permission from [26]).


G> GVL GvL

I U I J_JJ

11 I mt A

II 1 I* i.ii n

I II I

SUM 1+2+3+4

t = 0 t = T

FIGURE 10. Formation of a periodic time series. Neurons 1-4 all fire together at t = 0. As time passes, they occasionally link in various combinations. If at time T they again link as so to fire together, the situation will be the same as at t = 0. The system will repeat its behavior, generating a time series. The linear sum of the group's outputs is the periodic time signature of the input distribution to neurons 1-4. (Reprinted with permission from [3]).

corresponding frequency, so patch size maps to amplitude. The image's intensity histogram counts the number of pixels with a given intensity, while the amplitude of a given frequency counts the number of neurons firing at that rate. The frequency spectrum of the time signal is the intensity histogram of the input image as mapped through the sigmoidal response. Although this argument holds exactly only for a system with zero linking, a linked system will generate an intensity-quantized histogram whose envelope generally follows that of the analog input image. This is true for discrete pulse models and for continuous oscillator models, and for any other model where the output frequency is proportional to the input signal strength.

For a linked slab, the coherent periodicity of the time signal suggests that there must exist phase factors as well as frequency and amplitude. Suppose that the time signal S{t) is expressed as a sum of delta function pulses:

K

^w = EE«*^(^-^^-^^)' (20) n k=l

where T is the periodicity, ak is the amplitude of the A;th subgroup, and (pk is the time offset of the subgroup of cells with amplitude ak. The time offset


is between zero and T, and there are K subgroups that are Unked into the overall repetition period T. If a fourier transform is taken, it factors into a sum of complex phases and a sum representing the repetition period:

K

F.T.(S) =[J2^ke'''^'] [ê'''^'^]. (21) k=l n

The corresponding "histogram" must in some form include the phases as well as the amplitudes. Other transforms may be more appropriate; the Fourier transform was used here for illustrative purposes. This discussion shows that the geometrical content of an image, as well as its intensity, is encoded in the time signal, and that distance-dependent linking action provides a way to include syntactical information. The time signals are object-specific. They are a signature, or identification code, that represents a two-dimensional image as a time-domain signal that can be processed by the neural net. The signatures have some degeneracy, but this can be an advantage rather than a drawback, since certain classes of degeneracy can also be viewed as invariance.

7 Linking Waves and Time Scales

The linking pulses are transmitted very quickly as compared to the firing rates of the cells. If the receiving cells are within their capture zone, they will be induced to fire upon receipt of the linking inputs, and their output pulses can in turn capture other cells. This causes a wave of pulses to sweep across a region of the slab. The propagation of the wave will follow the underlying feeding input distribution, generally flowing down gradients and firing larger or smaller areas of cells according to how many are within their capture zones. The time profile of the firing history will reflect the shape of the underlying feeding spatial distribution, and, for the case of the feeding input being an image intensity pattern, be related to the geometry of the image, as shown in Figure 11. The repetition rate of a linking wave, e.g., how often it sweeps through an area, is determined by the intensity in that area. On a time scale that shows the linking wave profiles, the profiles can be taken as elementary signatures identifying their areas. On a time scale that compresses the linking wave profiles into a single time bin, the repetition period of each area can be used to segment that subregion of the total image. These segmented areas are in eff ect "giant neurons," i.e., synchronous groups. The linking still exists, and these groups transmit and receive composite linking pulses. They have their own group capture zones and behave like single neurons in many ways, with the exception that their output pulse is no longer a binary 1 or 0 but instead has an amplitude that is equal to the number of individual cells comprising the synchronous group.


Accordingly, group linking waves can exist. This is discussed in the next section. The time profile on this scale is the signature of the group of linked groups, and on yet another still-larger time scale the repetition period of the group of groups can be used to segment it into a supergroup. At this point the interpretation from an image processing standpoint is that the syntactic information of a large composite image has been encoded into an object-specific signature for that image. In principle, further time scales can be incorporated indefinitely in a self-similar manner, leading to groups of supergroups, supergroups of supergroups, and so on, each having its own time signature and segmentation time scale. This is indicated by Figure 12. It reduces the fundamental problem of image understanding to one of time correlation of time signatures, which may be a solvable problem. It has implications for how the brain works to send and receive signals. The Eckhorn linking field and in general all higher-order networks when used with pulsed neuronal models provide a specific mechanism to generate the essential time signals that carry syntatic information about arbitrary spatial distributions.

8 Groups

On a time scale that segments groups of cells, multiple pulses occur even for very weak linking strengths. Consider an idealized situation (Figure 13) where there are two groups having A and B numbers of cells in groups A and B, respectively. Assume for simplicity that each group sends a linking pulse of amplitude A' or B' to the other. Look at a cell in group A. Let M^ be the number of multiple pulses of group A. Then equation (17) gives an estimate M^ = /3FAA'/Vr for large numbers of multiple pulses. The repetition period of group A is longer than that for an individual cell because its threshold must rise via multiple pulses within the group to overcome A'. Approximately, it can be obtained from equation (12) by substituting M'J^VT for Vr- Now look at the linking inputs, and write the total internal activity:

UA = FA{1 + P{A'YA + B'YB)). (22)

The F 's give the moments in time when the groups' pulses occur, each at its own characteristic period. The groups A and B rescale all their characteristic times in proportion to the group sizes. The capture zone for group A with respect to group B, for example, is now

OLT


AT

JM illlu

Signature on time scale Ax.

Ill ^ ^

i Segmentation on time scale X.

FIGURE 11. Linking waves. An elementary region generates a linking wave that sweeps through it. The time history of the wave amplitude as summed over the region depends on the geometry of the area and is its signature. The repetition rate of the wave defines a time scale on which the elementary area can be segmented. (Reprinted with permission from [26]).

and the decay t ime of group A is

TA = — ln ( l + ^ ^ ) = — ln( l + /3A'). OCT i* A OLT

(23)


Jk

Elementary image patch

Image feature

LL

Composite object

m 7

/ /

y^^^Y/^ /^y\ mm FIGURE 12. Time scales. Linking waves for elementary areas make signatures for them. On a time scale where these areas are segmented, the signatures are compressed into a single time bin and become a composite pulse. The composite pulses link as groups (see Figure 13) to make linking waves on a group of elementary groups. The time history of the amplitude of these waves is the signature for the group of groups. Increasing the time scale so that these signatures are in turn compressed into a single time bin leads to supergroups, which in turn link together and form linking waves on that time scale. The process continues, leading to signatures for entire images as suggested by the figure. (Reprinted with permission from [26]).


The period of group A is the sum of the time required for the pulse burst and the decay time. This is a major change from the operation in the single pulse regime. There, the period depended on the individual cells' feeding inputs, while here it depends on the linking input from its own group. Since that linking input will be proportional to the area of the group and not its intensity, the behavior of a system of groups in the multiple pulse regime is driven by the sizes of the areas rather than only by their intensities. The intensity, however, will partially control the number of pulses in the bursts from each group (see equation (17)) and thus will enter into the period via M' . The size of the capture zone is still a function of the linking input, so the ratio of it to the group's period will determine the degree of linking among groups. This ratio can still be small, which defines the linking to be in a "weak linking" regime again. Even though the system emits multiple pulses and synchronous bursts, it is still in a "weak linking" mode on this larger time scale of group interactions. The system for groups is scaled in proportion to the number of cells in each group (with allowance for multiple pulses), giving a larger time scale on which linking among groups occurs, but in the same way as linking occurs for individual cells. This is illustrated in Figure 13.

9 Invariances

If there are symmetries in a receptive field weight pattern such that any two parts of it are the same, then an exchange of the corresponding parts of an input image distribution will not change the sum of the product of the field and the image. The exchanged parts of the input image will still be multiplied by the same weight values, because the weight pattern wa^ the same in those two regions. The exchange symmetry of the weight pattern makes the output of that field invariant against the corresponding exchange operation acting on the input image. This is because the neuron's output is determined by the internal activity Uj, which is a function of the feeding and linking inputs. They, in turn, are weighted sums. In general, if the image changes in a way that fits the symmetries of the feeding and linking receptive fields so that the internal activity doesn't change, then the neuronal output will be invariant against those changes. The utility of this is that the symmetries of the receptive fields then correspond to invariances of the time signal generated by the input image [7] because the time signal is driven by the internal activity. This is a very general principle. It can be used to make desirable time signal invariances by an appropriate choice of receptive field symmetries. The pulse-coupled network produces time series that encode in their phase structure the two-dimensional spatial input distribution, including its geometrical connectivity relationships. Symmetries


F^pA' triggered by F^pB' , -->^

\ time

FIGURE 13. Group linking. Two groups A and B send linking pulses to each other. Their thresholds must recharge to a height that exceeds their own group action (Figure 5), and so they reach heights approximately equal to their group linking amplitudes. These are much greater than for single-cell recharging. But the inter-group linking pulses are also much larger, and as a result the relative heights of both the thresholds and the linking inputs scale with group size. The ratio of the capture zone and the group periods can still be small, giving effectively "weak linking" despite the presence of multiple pulses. The detailed structure of the amplitudes is shown in Figure 5; it is simplified here for clarity. (Reprinted with permission from [26]).

can be introduced in the receptive fields to make the time signature of an image invariant against translation, rotation, and scale. Simulation results also show that there is a significant insensitivity to scene illumination and distortion, and further that there is some limited insensitivity to changes in the overlying patterns (shadows) on a given image shape.

The design objective is to make the internal activity invariant by introducing geometrical symmetries into the receptive field weight pattern. (1) For translational invariance use the same receptive field weight pattern at every neuron. (2) For rotational invariance make the receptive field patterns circularly symmetric. A translated and rotated image then covers a different set of neurons, but due to the translational and rotational symmetry of their receptive fields, sees the same receptive field patterns as before. The time signal is a sum over all the neurons, so it doesn't matter which neurons are used. (3) For scale invariance use an inverse square radial falloff. This does not make the internal activity invariant against distances r, but rather against scale changes as represented by the factor k in the rescaled distance kr. To see this, consider an optical image that is rescaled by a change in the object distance (Figure 14). In this case, the intensity per


image patch is constant. The number of cells affected by the rescaled patch is changed, but not their output pulse frequency. A neuron receiving the input at the rescaled location of the original image patch is driven by the same intensity as the neuron at the original location. For a rescaling factor of k,

Y{kR) = Y{R).

The linking input to that neuron, using an inverse square kernel, is /•27r /"OO 1

L{kR) = ——Y(k{R + p))kpkdpde Jo J On ^ /O Jpo ( M

r27r rcx)

= 1 1 -^Y{k{R + ^)pdpd9 = L{R). Jo Jpo P

(24)

This removes the scale factor dependence k from the integrand. The lower integration limit po is fixed and does not scale, so the above relation is not an exact equality in some cases. This will be discussed below.

Image patch

Original Optical Image Rescaled Optical Image

FIGURE 14. Geometry for scale invariance. A neuron at R receives a linking contribution from a neuron at p . When the image is rescaled, the image patch at R goes to kR and the patch at p goes to kp . Only the latter patch is shown. For the case of an optical image rescaled by a change in the object distance, the intensity per image patch is constant. The object is to design a linking receptive field such that L(kR) = L{R). (Reprinted with permission from [3]).

If the feeding field is a single pixel (this is not essential and is done here only for simplicity), then

F{kR) = F{R).

The internal activity of the rescaled image is thus the same as that for the unsealed image:

U{kR) = F{kR){l -f pL{kR))


= F{R){l-{-PL{R)) = U{R). (25)

There is a problem that must be resolved before complete scale invariance is achieved. It appears to be less important for large images on a fine grid of cells, but when the isointensity patch size covers less than approximately 10 X 10 cells in the simulations, it has some effect. The problem is that the local group around a neuron also changes in scale. The linking input due to the local group accordingly varies with scale, making the internal activity change as well. The cause is the fixed inner edge po of the linking field. It does not scale. External groups do not have this property because all their boundaries shift accordingly as the image scale is changed. For simplicity consider a neuron at the center of its local patch, which is surrounded by an external patch, making two concentric circles, as shown in Figure 15. Let Po be the fixed inner edge of the local patch, and Yi and I2 the pulse activities in the local and external patches, respectively. Then

pr y pR y r H = 2TX I -4pdp + 27r / -4pdp = 2-KYI In — + 2nY2 In —.

Jpo P^ Jr P^ Po r (26)

Under a scale change, r and R become kr and kR, but po is fixed. The linking input to the center neuron then has a scale factor dependence proportional to Yi ln{k). This is the problem.

The solution is to make the internal activity distinguish between the local and the external groups, and to make both scale-invariant. The local group can be made independent of scale by using a nearest-neighbor receptive field with a fixed outer limit so it fits in the image's characteristic isointensity patch size. To distinguish between local and external groups, however, it is necessary to use the generalized linking field model with multiple linking fields as well as excitatory and inhibitory dendritic inputs. The dendritic signals are summed in the cell body and can be either excitatory or inhibitory. The weighted sums in the receptive fields correspond to fuzzy OR-gates, while the products from the linking modulation correspond to fuzzy AND-gates. This view will be used to construct a "semi-exclusive OR" to let the neuron distinguish between the local and the external linking inputs. Use two dendrites, each having two linking inputs. One dendrite is excitatory, the other is inhibitory. The same linking inputs Li and L2 are used on both, and both are fed by the same feeding input F , but the linking strength coefficients are all different:

C/exc = + a i F ( l + / 3 l L i ) ( l + ^ 2 l ^ 2 ) ,

Uinh = - a 2 F ( l + /33Li)(H-/?4L2),

Utotal = Uexc-Ûinh' (27)

Choose the a's and /3's such that they are all positive and such that the


FIGURE 15. Geometry used to show that the fixed inner radius po of the local group Li causes a dependency on the rescaling factor k. The external group L2 is in the annulus from r to R^ while Li extends from po to r. (Reprinted with permission from [3]).

total internal activity has the form

Utotai = F{1 + /3Li + /3'[1 - Li / (Li^_^)]L2) . (28)

For the values /3 = 0.2, / 3 '=0 .3 , and Li^^^^^ = 4 0 used in the simulations, one possible set of coefficients is ai = 2, 02 = 1, /3i = I, f32 = 219/640, ^3 = 1.8, and /34 = 123/320. iî(^^,) is the maximum possible value of the local-neighborhood linking input Li , and L2 is a linking input from a larger and more extended receptive field such as the inverse square field. Li gives the input from the local group, and L2 gives the input from external groups that do not contain the neuron being linked. When the entire local group fires, Li = î(â«)5 sind the neuron sees only its nearest neighbors. When the local group is quiet, Li = 0, and the neuron can receive the L2 linking from the external groups. Suppose the rescaled image patch now makes several new adjacent groups out of the local group, all with the same frequency. If they are in phase, the neuron's local group will mask them. If they are not in phcise then they will link with the local group through the second linking input and be captured by the local group. Then they

30 Johnson, Ranganath, Kuntimad, and CaulReld

will be in phase, and the local group has effectively enlarged to include them but without altering the internal activity seen by a given neuron. When the outer limit of Li is chosen to overlap the inner limit on L2, the inner boundary of the external group is always the outer boundary of the composite local group, as desired. The system's architecture has translation, rotation, and scale invariance. It is a third-order network, which has been shown [17] to be the minimum order necessary for achieving these invariances all at the same time. An open problem is to derive specific geometrical rules in terms of the synaptic weights through equations (1), (2), and the internal activity equation, for these invariances.

9.1 Invariance Simulation Results

This model was simulated [3] on a PC. The array size was 33 x 33, and the images were made of five blocks, each with its own intensity, and the blocks rearranged to form the different test images. A cross shape and a "T" shape were used. They differed only in their geometrical arrangement, or syntax, an observation that will turn out to be of vital importance in our discussion of pattern recognition. Each block contained from five to eleven cells on a side, depending on the scale factor, and the background was set to zero in all cases. No noise was added. Analysis of the grid size indicated that reasonable results could be expected down to a 5 x 5 block size for rotation, and the scale increments were chosen so that the blocks varied in size by 5, 7, 9, and 11 cells on a side. The nearest neighbor linking field for Li was a 3 x 3 square (center excluded), while the outer radius of the inverse square linking field for L2 was fixed at 10 and the inner radius at 1. The simulation's equations were written for discrete time steps using the digital filter form from reference [1]. They are

F = ImageO',A:)/255,

Llocalit-\- I) = Ai J^local

Lext{t-\-l) = AiLext{t)-\-VLL2{t),

e{t + l) = A20{t) + VTY{t),

Y{t) = Step{Utotai{t)-e{t)), (29) where Utotai is given by equation (18). The parameter values were Ai = exp{-l/ti),A2 = exp{-l/t2), ti = IM = 5, VL = b.Vr = 20, /? = 0.2,/3' = 0.3, I/i(^^^) = 40, and Image(j, A:) was the input image. The results are shown in Figures 16 through 21. The most important result was that the time signatures were object-specific. Each test image generated a distinct periodic time signal that would never be confused with the signal from the other class (cross or "T"). This showed that the pulse-coupled net encoded the images in accordance with their geometrical configuration.


because both images were built of the same five blocks arranged in different geometrical configurations. Good invariance was achieved for translation, rotation, and scale. The time signatures of the two test images were easily distinguished in all cases except for the smallest rescaled "T" (Figure 17). Its patch size was 5 x 5. A grid coarsness analysis had indicated that below a 7 X 7 size the grid effects would be significant. The rotated "T" images, likewise, were sensitive to these effects, but their signatures were still distinct from those of the cross image (Figure 16) for patch sizes greater than 5 x 5 . The rotated "T" images were translated, as well, to fit in the small slab grid of 33 x 33 cells, so Figure 17 also indicates translational invar iance.

The images were tested with different scene illumination levels. It was found that their time signatures (Figure 18) were essentially invariant over a factor of two hundred in illumination. This was not expected, as the ratio of the capture zone time to the neuronal period changes in this case. What happens is that the signature period varies, as expected, but the signature itself remains the same. Detailed examination of these runs after the fact gives a possible explanation: The signatures reflect the propagation of linking waves through the scene object. These waves follow gradients, and changes in the overall scene illumination did not change the relative gradient patterns. There was less variation in the signatures due to scene illumination changes than for other image changes.

Figure 19 shows the effect of image distortion. A coordinate transform of the form x' = x -\- O.Olxy^ y' = y -\- O.Olxy was used to approximate an out-of-plane rotation of about 30 degrees with some perspective added. The signatures retained their characteristic forms sufficiently for the cross and the "T" images to still be correctly classified by their signatures. Again, this suggests a close relationship between the image morphology and the time signature. The insensitivity to distortion is because the signature generation is more of an area effect than an edge or angle effect.

Image intensity overlays were investigated next. The 9 x 9-scale "T" image was altered by transposing the two lower blocks. This would correspond to a shadow across the image, for example. The result, shown in Figure 20, is not invariant, but shows a distinct correspondence of the new signature to the original. Figure 21 shows the effect of combined image changes. Translation, rotation, scale, scene illumination, and distortional changes were made as indicated in the figure. The new signatures were similar enough to the originals for the altered images to be correctly classified as a cross or a "T" by using only the signatures. They are clearly not strictly invariant, but show a substantial insensitivity to the geometrical changes while retaining their object-specific character.


121

S C A L E :

49

i 1

ill

L ^ 1 ^

s

j ^

SC= 1

\c = o

>C = .82

\c = o

1

.C = .64

p 5C = .46 \C = 0

85

1 100

150

250

200

100

ROTATIO N:

SC = .82 AC = 45

SC = .82 AC = 30

17

25

FIGURE 16. Periodic time signatures and invariances for the cross image. The signatures cire the periodic part of the total output time signal of the pulsed 2u:ray. SC is the scale factor and AC is the rotation angle in degrees. Good scale invariance was found for scales over 1:0.46, and for large rotations of 30 and 45 degrees. The five blocks arranged to form the image were scaled from 11 x 11, 9 x 9 , 7 x 7 , to 5 x 5 block sizes. The 33 x 33 slab had a background intensity level of zero. Grid coarseness effects were expected for 7 x 7 and smaller block sizes in scale, and for 14 x 14 block sizes in rotation. Grid effects were not severe in this image. (Reprinted with permission from [3]).


121

SCALE:

it L

kl llL

25 iiLi

150 100

100

250

200

ROTATION SC= 1 AC = 0

SC = .82 AC = 0

81

111 SC = .82 1,1 AC = 30

SC = .46 AC = 0

17

FIGURE 17. Periodic time signature and invariances for the "T" image. Same setup as for Figure 16, but with the five blocks rearranged to form a "T". The signature was very distinct as compared to the first case, showing that the net makes unique time signatures for different images even when they are rearrangements of the same components. The scale invariance was good down to the 7 x 7 block size. The rotated images' signatures still followed the overall "T" signature shape in contrast to the cross signature. Their variation from ideal is strictly due to grid effects. (Reprinted with permission from [3]).


13

BRIGHT: I=2*Io

81

i L SC = .82 AC = 0

17

ORIGINAL

SC = .82 AC = 0

40

DIM:I=.01*Io

81 81 SC = .82 AC = 0 1 IL

SC = .82 AC = 0 L SC = .82

AC = 0

13

BRIGHT: I=2*Io

17

ORIGINAL 40

DIM: 1=01 *Io

FIGURE 18. Intensity invariance. The 9 x 9 block size images were multiplied by an intensity factor /o corresponding to a change in scene illumination. Prom /o = 2 to 0.01 the signature was invariant in its shape, though the period of the signature varied from 13 to 40 time units. (Reprinted with permission from [3]).

10 Segmentation

Image segmentation, the task of partitioning an image into its component parts, may be defined as the process of decomposing a given image F into disjoint nonempty regions, or subimages, Ri, R2, • • •, Rk such that

• RiUR2U'"URk = F;

• Ri is connected for all z;

• All pixels belonging to Ri are similar, based on some meaningful similarity measure M;

• Pixels belonging to Ri and Rj are dissimilar based on M.

In general, image segmentation is a challenging problem. Intensity variations within regions, fuzzy and incomplete boundaries, changing viewing conditions, and the presence of random noise are a few of the factors that make image segmentation a difficult task. In the past, researchers have used classical deterministic and nondeterministic methods, knowledge and rule based systems, and trainable neural networks to build automatic image segmentation systems. A recent survey paper by N. R. Pal and S. K. Pal summarizes many image segmentation techniques reported in the literature [18]. It is obvious that fast and accurate image segmentation is essential to


81

U L SC =.82 AC = 0

81

III SC =.82 AC = 0

IL.

81 I SC =.82 AC = 0

FIGURE 19. Image distortion. A coordinate transform approximating a 30-de-gree out-of-plane rotation was used for both test images. Their signatures were still distinct and recognizable as belonging to the correct image classification. (Reprinted with permission from [3]).

obtain meaningful results from image analysis or computer vision systems. The next few sections describe how pulse-coupled neural networks (PCNN) may be used for segmentating digital images.


121

U

150 100

100

250

k

200

121

SC= 1 AC = 0 i

[150 100

250

100

200

1 sc = 1

III- ^ ° ° ORIGINAL NEW

FIGURE 20. Signature of "T" image with two blocks interchanged. The two lower blocks of the full-scale unrotated "T" image were interchanged, simulating the effect of a shadow moving down the image. The new signature is similar to that of the 7 X 7 - block size "T" image and still has an initial peak followed by a valley and then a higher peak. In contrast, the cross image's second peak was lower than its first peak, so this signature would still be classified as a "T" and not a cross. (Reprinted with permission from [3]).

10.1 Modified Pulse- Coupled Neuron

An area is segmented by the PCNN when a linking wave sweeps through it in a time short compared to the overall repetition rate of that area, so the linking activity is the primary process in segmentation. In order to emphasize the linking action, the feeding inputs will be constrained to be small compared to the threshold gain Vr- Special attention will be given to the linking strength /3 and the radius r of the linking field, as well. The pulse generator and the dendritic tree are accordingly modified to reflect this emphasis. The number of neurons in the network is equal to the number of pixels in the image to be segmented. For each pixel in the image there is a corresponding neuron. Let Xj and Nj be the jth image pixel and its corresponding neuron, respectively. The segmentation model is as follows:

1. The feeding, or primary, input to Nj is the intensity value of Xj or simply Xj. There are no leaky integrators in the feeding branch of the dendritic tree. If desired, the average intensity of a local neighborhood centered on Xj may also be used as the feeding input to Nj.

2. Each neuron receives a linking input from its neighbors. Let Sj denote the group of neurons that are linked with Nj. Usually, a circular linking field of radius r centered on Nj is used: all neurons that are within a distance of r from Nj are linked to Nj. Other neurons are not linked to Nj. The outputs of all the leaky integrators in the linking branch of the dendritic tree decay at the same rate, as determined by the linking field decay time constant a^- The linking contribution of Nk to Nj is given by equation (1).


501 Ih

(18,18), SC = .64, AC = 45, 10 = .5, RD = 33.

44

llUlA. (20,14), SC = .64, AC = 45, I0 = .5,RD = 33.

121

u (16,16), SC=1, AC = 0, I0=1,RD = 0.

121

iHUi. (16,16), SC=1, AC = 0, I0=1,RD = 0.

FIGURE 21. Effect of combined image changes. The original images were located at coordinates (16,16) with scale factors of unity, unrotated, and with no distortion (RD is the approximate out-of-plane rotation). The signatures were sufficiently insensitive to the combined changes for the images still to be correctly classified. (Reprinted with permission from [3]).

Usually, the weights Wkj are inversly proportional to the distance or the square of the distance between Nj and Nk.

3. The feeding input Xj and the linking input Lj are combined via equation (3) to produce the total internal activity Uj {t) for the neuron Nj. At present, the value of (3 is the same for all neurons for a given image. However, it may be ultimately desirable to use different values of /3 for different regions, based on the regional intensity distribution. Then /3 can be viewed as an adaptive weight that adjusts to each image region for optimum segmentation.

4. The pulse generator of the neuron consists of a step-function generator and a threshold signal generator. The output of the step-function generator Yj (t) goes to 1 when the internal activity Uj {t) is greater than the threshold signal 9j{t). This charges the threshold according to equation (4). Since VT is much larger than Uj{t), the output of the neuron changes back to zero. The pulse generator produces a single pulse at its output whenever Uj{t) exceeds Oj{t). There are two major differences between this model and the original. The latter has the ability to produce a train of output pulses.


The model used here for segmentation produces only one pulse, which is approximated by a unit impulse function. The second difference is in the recharging of the threshold. Because the internal activity Uj{t) is much smaller that the threshold gain factor Vr, the recharging is done by setting the threshold to VT rather than to 6j{t) -h Vr- If two successive firings of Nj occur at times ti and ^2, then

0j{t) = VTe-''^^^-^'\ tl<t< t2. (30)

This new threshold mechanism is equivalent to the old one when the input signal level is much smaller than the threshold gain factor, as can be seen by looking at the pulse period TJ :

1 VT I VT J_ln(l + ^ ) - J - l n ( ^ ; ax Aj aT A j — ln(l -\~^)^— I n ( ^ ) , for Xj « VT.

On the segmentation time scale, neurons corresponding to pixels of each image region are forced to pulse together periodically. The pulse rate of a region is determined by the feeding and linking inputs to its neuron group. Therefore, it is important to understand the mathematics associated with the firing rate of a neuron using the segmentation model approximation of equation (30). Consider first a totally unhnked PCNN. Such a network may be realized by setting the linking strength /3 to zero. The activity internal to A^ is then simply Xj. Initially, at time t = 0, Oj{0) = 0 for all j . Assuming that Xj is greater than zero, all neurons fire at f = 0. From then on, each neuron fires periodically, and the period is determined by the feeding input, VT, and QT- Since VT and QT are constants, the period is a function of the intensity of Xj. The intensity / and the corresponding period T{I) are related by

T( / ) = —{IniVr) - HI)). (31) aT

For a given / , T{I) may be increased by increasing VT or decreasing QT- It is often convenient to express T{I) as the number of decay time constants. The period in number of decay time constants is

r(/) =\n{VT)-HI). (32)

The plot of r ( / ) as a function of ln(/) is a straight line with slope —1 and intercept In(VT). If r{I) is known, one can compute r{al), T{I -f 6), and r(a7-f 6):

T{al) = r ( / ) - l n ( a ) , (33)

r ( / + 6) = r ( / ) - l n ( l + 6//) , (34)

T{al -\-b) = r{I) - ln(a -h b/I). (35)


It is interesting to note that r ( / ) — T{al -f b) is independent of Vr- Also, r ( / ) — T{al) is independent of both VT and / . The approximation of equation (30) makes the system less dependent on the prior activity of the threshold, and its behavior is more strongly governed by the linking.

Now consider the effect of linking. Let î, t2, ts, . . . mark the times at which the ith neuron fires. For ti < t < tiî, let / and Lj{t) be the feeding and linking inputs to the neuron, respectively. The linking input increases the internal activity of the neuron from / to 7(1 -h /3Lj{t)). Accordingly, in the interval ti-î — ti the period reduces from T{I) to T'{I):

T\I) = T{I) - — ln(l + 0Lj{ti^,)), (36) ax

If the decay rate of Lj (t) is large and much greater than the decay rate of Oj{t)^ the following statements can be made:

1. Lj{t) may be approximated as an impulse train, whose magnitudes are proportional to the number of linking input pulses at time t.

2. If a subset of neurons belonging to Sj fire at tj and fail to capture Nj at that time, then the subset will not capture Nj later in the interval ti < tj < tiî. In other words, there is no linking decay tail, and the receiving neuron's output is unaltered if the linking pulse is outside the capture zone (equation (14)).

10.2 Image Segmentation

The image segmentation approach using pulse-coupled neural networks is described in this section. Figure 22 shows an image consisting of two regions i?i and i?2- Spatially connected object pixels form Ri. Similarly, spatially connected background pixels form i?2. Perfect segmentation is possible if there exists a linking radius r and a linking coefficient /3 that will force all neurons belonging to Ri to pulse together periodically with period Ti. Of course, Ti is not equal to T2.

If all pixels of i?i are of intensity / i and all pixels of R2 are of intensity I2, the segmentation problem becomes trivial. A pulse-coupled neural network with /3 equal to zero will do the job. Neurons of Ri will fire together at times t = nT{Ii), where n is an integer greater than or equal to zero.

In practice, image segmentation is not this simple. Images that consist of two regions will have bimodal histograms. Assume that [h^h] and [/3,/4] are the intensity ranges of the background {R2) and object (Ri) pixels, respectively. If I3 > h, simple thresholding can be used to achieve perfect segmentation. When h < h^ thresholding techniques do not produce a perfect result. Optimal thresholding techniques minimize or attempt to minimize the error. The error may be defined as the number of pixels incorrectly classified during segmentation. The presence of linking inputs


(a)

(b) (c)

FIGURE 22. An example of a perfect image segmentation, (a) input image; (b) segmented object region; (c) segmented background region. (Reprinted with permission from [27]. © IEEE 1995.)

makes pulse-coupled neural networks fairly insensitive to noise and minor local intensity variations. As a result, the PCNN is expected to produce better segmentation results.

Consider the segmentation of the digital image in Figure 22. Assume h > h and / i > 0. At t = 0, all neurons fire and charge the outputs of all the threshold units to Vr- The group of neurons corresponding to object pixels of intensity I4 fire first at time 1 = T{l4). This type of firing, which is mainly due to the feeding input, is called natural firing. The natural firing at ti leads to the following:

1. Object neurons for which the following inequality is true are captured


a,t t = ti : Xj{l-^l3Lj{ti))>h. (37)

Subscript j is used to represent object pixels and neurons.

2. Background neurons for which the following inequality is not true are also captured at î :

Xk{l-^pLk{h))<h. (38)

Subscript k is used to represent background pixels and neurons.

3. Object pixels not captured at ti fire in several groups after î. The number of groups and the exact time at which each group fires are determined by the intensity distribution of Ri, P, and r.

4. Neurons corresponding to background pixels of intensity /2, which are not captured so far, fire at t2 = T{l2)- This primary firing has no eff ect on neurons that have already fired {VT is large compared to the image intensity). However, all background neurons that are in the capture zone of this primary firing will fire at 2 •

X , ( l + / 3 L , ( ^ 2 ) ) > / 2 . (39)

Other background neurons organize into several groups and fire after

If inequality (37) is true for all Nj (object neurons), and inequalities (38) and (39) are true for all Nk (background neurons), the input image is perfectly segmented even when I2 > h. The value of the linking input to Nj, Lj{ti), depends on the composition of 5j and the number of fired neurons at ^1. For pixels like Pi , where all members of Sj are object neurons, Lj{ti) is relatively large. For pixels like P4, where Sj consists of mostly background pixels, Lj{ti) is small. Let Lmini = minLj(ti), Lmin2 = minLj{t2), and Lmax2 = maxLk{ti). It is obvious that values of Lminl, Lmin2, and Lmax2 depend on r and object-background boundary geometry. All three increase in value as r increases. However, the rate of increase varies depending on the boundary geometry. Perfect segmentation of the input image is possible if there exist 0 and r such that the following inequalities are true:

hil -^ PLminlih)) > /4, (40) hil -^ 0Lmax2itl)) < h. (41) h(l^pLmin2{t2)) > h- (42)

The above conditions when satisfied guarantee a perfect result for the worst case.


However, the solution may not be unique: perfect segmentation is not always possible. Inequality (40), when not true, leads to the fragmentation of Ri. Similarly, if inequality (42) is not true, R2 gets fragmented. Some background neurons (perhaps those near the object boundary) fire with object neurons, making Ri look larger than its actual size when inequality (41) is not true. A challange is to find optimal parameters /3* and r* that minimize the error. The determination of p* and r* requires adaptation and is not addressed in this chapter.

10.3 Segmentation Results

A pulse-coupled network was simulated on a SUN workstation. A number of real and artifical images were used. The study focused on the effects of intensity variation within regions, extent of intensity overlap, noise and smoothing, and boundary geometry.

Each artifical test image, an array of size 64 x 64, consisted of two regions, an object and a background. The object was a 32 x 32 subimage located at the center of the image. The object's intensity range was [/a, 74]. The remaining pixels of the image formed the background, and its intensity range was [/i,/2]. The object intensity range overlapped the background intensity range: I4 > I2 > h > h- Since the object was rectangular, the boundary geometry was simple to handle. For r = 1 only four pixels (top, bottom, left, right) were in the Unking field. It can be shown for that case that Lrninl ^ 2, Lrnin2 ^ 3 , a n d Lrnax2 ^^

1. Perfect segmentation is possible if /? is in the range [(3i,/32], where

/3i = m a x [ ( / 4 / / 3 - l ) / 2 , ( / 2 / / i - l ) / 3 ] , (43) 02 = ( / 4 / / 2 - I ) . (44)

If P2 is not greater than Pi, then perfect segmentation is not possible. Note that the solution range of /? changes with r. A number of artifical images were created by varying the object and background intensity ranges and the extent of overlap. Figure 22(a) shows an input for which / i = 100, I2 = 175, I3 = 150, and I4 = 250. From equations (43) and (44) the solution range for /3 is [1/3, 3/7]. The image was segmented using r = 1 and P = 0.35. The segmented image as determined by the synchronous firing of neurons is shown in Figures 22(b) and 22(c). The PCNN gave a perfect result because a solution range for P existed.

If the intensity distribution of the image is such that Pi is greater than or equal to /32, a perfect segmentation is not possible. Then the best /? can be determined by trial and error. The PCNN was tested using low-resolution TV and infrared (IR) images of tanks and helicopters for this case. Each image consisted of one target in a fairly noisy background. The network successfully segmented each image into background and target.


It is obvious that wide and excessively overlapping intensity ranges have an adverse effect on image segmentation. The segmentation error can be greatly reduced by shrinking the object and background intensity ranges and also by reducing the extent of overlap in the intensity ranges. A reduction in the intensity range reduces the value of /3i. Now more image pixels satisfy the desired inequalities, increaging the number of pixels correctly classfied. If the value of /32 then exceeds the value of ^ i , a perfect segmentation is possible. When the spread is due to noise, a smoothing algorithm can be used.

Neighborhood averaging smooths regions but blurs edges. A median filter suppresses random noise and also preserves edges. The PCNN is also capable of smoothing images without blurring the edges. The technique is to run the net and adjust the feeding input intensity of the pixels based on the local neuronal firing pattern. If a neuron Nj fires and a majority of its eight nearest neighbors do not fire then the intensity is changed as follows:

1. If five or more neighbors are brighter than Xj, c is added to the value of Xj, where c is a small integer constant.

2. If five or more neighbors are darker than Xj, c is subtracted from the value of Xj.

3. If five or more neighbors are of the same intensity as Xj, the threshold signal of Xj is set to the threshold value of its neighbors. This compensates for the phase shift.

A 128 X 128 image of Bambi, shown in Figure 23(a) is smoothed using the neighborhood average, a median filter, and the PCNN algorithm. The smoothed images are shown in Figures 23(b), 23(c), and 23(d). The PCNN filtered the noise without affecting the edges. In comparison, the neighborhood average blurred the edges. The median filter broke some edges and merged parallel lines running close to each other by filling in the dark spaces that existed between them. The PCNN performed better than the other two methods.

Theoretical results and simulations show that pulse-coupled neural networks can be used for segmenting digital images. The possibility of obtaining a perfect result even when the intensity ranges substantially overlap is a new and exciting result. The net can also be used to filter random noise without blurring edges. Since the network is compatible with electronic and optical hardware implementation techniques, it is a strong candidate for real-time image processing.


(a) (b)

(c) (d)

FIGURE 23. An example of image smoothing, (a) input image; (b) image after smoothing with PCNN algorithm; (c) image after neighborhood smoothing; (d) image after median filtering. (Reprinted with permission from [27]. © IEEE 1995.)

11 Adaptation

The Eckhorn linking field model contains synaptic weights but does not require a specific learning law. Any learning law, or none, can be used. (The Hebbian decay learning law is too rudimentary and is not considered here. It fails to retain the adapted weights after learning is complete. More realistic models such as the Grossberg competitive law [14] or a saturable law [19], either associative or causal, are more useful.) Any synaptic weight in the linking field model can be made adaptive, but for simplicity only the feeding field weights will be considered. The linking field weights will be fixed as the inverse square pattern in order to retain the invariance


properties discussed earlier. Suppose a wave of pulses sweeps over a region in which the feeding weights are adaptive (Figure 24). As the wave passes over a given cell, it is turned on and receives feeding input pulses. These weights adapt, memorizing the local pattern of the wave crest around the cell. The cells that had been active just prior to this time have been reset, and they are turned off. But the leaky integrator synapses connecting them to the currently on cells still have a residual signal on them, and those connections adapt to that strength. Likewise, the connections from the group of cells that had been active still earlier have an even more decayed signal strength, and the active cell will adapt to them as well. Each time the linking wave sweeps over the cell in question, more adaptation occurs. Whenever it is on, it sees the same pattern of active cells and decayed signals from the previously active cells due to the periodic nature of the established wave pattern. After adaptation is complete, suppose that a cell is stimulated and fires. It recalls the wave-crest pattern in its local neighborhood and also sends a pulse to the cells that had fired next as the wave passed over them after leaving the cell. These connections were adapted during training. The cell forward-biases them through the adapted feeding connections and further gives them an additional input through the linking field channel. This can cause them to fire next, just as the original linking wave had done. The process continues, each wave crest forward-biasing the next, and the slab not only recalls the wave pattern but also sets it in motion again [7]. A time average of the slab's pulse activity then approximately recovers the original spatial distribution that generated the linking wave.

The waves are binary fringe-like patterns very similar in appearance to holographic fringes. This suggests that it may be possible to store many wave patterns in an adaptive slab in the same sense that many holograms can be superimposed on a single photographic plate. It may be possible to have a slab with relatively few adaptive interconnects and to use the linking modulation to fill in the patterns when they are recalled. Figure 25 shows some wave patterns generated by a light square (lower left) and a light spot (lower right) on a light background.

The network stores and recalls the traveling waves. It can also do the same for sequences of images. Use a distribution of feeding time constants such that some of the feeding synapses have very long decay times. Present one image of a sequence and allow its linking waves to become established and memorized, and then do the same for the next image of the sequence. Some of the synaptic connections will overlap the images in time. Now when the first image is recalled, those connections will also stimulate the wave pattern of the next image, and it will be recalled in turn. This is the mechanism used in the time sequence memory model of Reiss and Taylor [4], except that pulses are used here. In that model an intermediate slab with leaky integrator decay characteristics was used to provide the


linking wave

linking modulation

\U^ distance

adaptive bias

o wave direction

FIGURE 24. Adaptation, (a) A linking wave sweeps over a cell, turning it on. Its feeding synapses adapt to the current wave pattern and also to the decayed inputs from previously on cells whose signal is still present on the leaky integrator synapses connecting them to the on cell, (b) After adaptation the cell fires. It recalls the wave-crest pattern and forwaxd-biases the cells that need to fire next in order to recreate the wave motion. It also sends a linking modulation to them. The wave crest that should fire next can be stimulated in preference to the one that fired previously, and the wave motion as well as the wave-crest shape can be regenerated. (Reprinted with permission from [26]).

overlap in time, and then adaptively associated with the current input image. Then when the first few images of the sequence were applied to the adapted system, they formed the decaying time overlap image, which in turn recalled the next image in the sequence. It was then fed back to the intermediate slab to make the next overlap, and so on, until the entire sequence had been recalled.

Consider a slab on which several wave patterns have been adapted, either superimposed or in different locations on the slab. Is it possible to


FIGURE 25. Linking waves from an optical hybrid laboratory demonstration system. The underlying image is a light square (lower left) and a light spot (lower right) on a light background. Coherent, locally periodic linking waves are generated as the system attempts to pulse at a frequency driven by the input intensity at each pixel while also attempting to obey the linking requirement. To satisfy both requirements the waves evolve and bifurcate into complex fringe-like patterns. (Reprinted with permission from [2].)

selectively recall a given pattern using only its time signal as input? This would mean that the slab could access any memory in parallel. Suppose the time signal of one of the encoded patterns is globally broadcast to the entire slab. It will stimulate all the patterns to attempt to regenerate their waves. As they start up, those that have different time signals will interfere with the incoming signal. The pattern with the same time signal will also interfere, since it will not generally be in phase with the incoming signal. None of the patterns will be able to establish themselves. They will continue to compete for resonance with the input. Eventually, the pattern with the matching signal may start up in the right phase. It will establish itself at the expense of the others because it will be locked in with the incoming signal and will proceed to generate its traveling wave pattern. A time average of the slab pulse activity then recovers the original input scene. This argument shows how a pulse-coupled adaptive neural network can in principle achieve parallel memory access. It is recognized that it must be verified before it can be claimed to be a viable mechanism for global recall,


but it is a specific possibility.

12 Time to Space

The pulse-coupled neural network generates a time signal that encodes a spatial distribution. Is it possible to make a network that forms a spatial distribution from a time signal? If so, then the cycle would be complete: space to time to space. The time signal is periodic and coherent. The intensity of the input maps to frequency in the time signal, while the geometrical relationships are encoded by the linking into phases in the time signal. The desired mapping should have a frequency coordinate and a phase coordinate for each amplitude component. Wavelet transforms [20], [21], [22] retain both phase and frequency information, so these transforms may be appropriate for the pulse-coupled time signals. Wavelet transforms can be done optically [23]. A way to do it with a third-order linking field is discussed below. It is not required that the resulting spatial distribution be identical to the original one that generated the time signal, but rather that it be reasonablly object-specific. Then the time-to-space transform becomes the second half of a spatial remapping transform. The resulting spatial distribution can in turn make another time signal, and so on, so that an input is transmitted from one place to another as a time signal and at each place is operated on by spatial interactions. This is a parallel processor in one sense, but in another sense, it is a serial processor like a digital computer. It has the advantages of the parallel processing and adaptation inherent in a neural network, yet it can perform the sequential operations necessary for causal logic operations. It does not need predefined features. It generates its own syntactical features. These are very insensitive to geometrical distortions, yet they can be object-specific. The key is weak linking. In this linking regime it is possible to make periodic, coherent, object-specific time signals, and from them the rest follows.

12.1 A Model for Time-to-Space Mapping

This model uses a third-order pulse-coupled neural network. It consists of two slabs P and Q, as shown in Figure 26(a). The P-slab generates a spatial signal distribution of frequencies in the vertical direction and phases in the horizontal direction. The Q-slab receives a globally broadcast time signal at every cell and a one-to-one input from the P-slab. These are multiplied by a linking modulation in front of each Q-slab model neuron, making it a third-order node (Figure 26(b)). The product of the global time signal input and the P-slab signal input comprise the feeding input to the Q-slab cell. The P-slab consists of rows of horizontally linked cells


(a) Time to space network architecture

Q-slab linking

Pulse Gen.

| - ^ Y Q « I , ,

(b) Q-slab third order cell

V)

One-way P-slab linking

I(v) Pulse Gen.

Yp($,v)

(c) P-slab second order cell

to (t)+l P-slab linking input

FIGURE 26. A time-to-space architecture. A two-slab system is used. The P-slab has one-way linking across each row. Just as the last cell in a row fires, the first cell fires again. The kMigtli of the row and the feeding input of the row are chosen such that each row lias a repetition rate that increases with row number. The P-slab cells are second-order neurons. The Q-slab neurons are third-order cells. A time signal S{t) is globally broadcast to the Q-slab and multiplied by the input from the P-slab at eacli point. A pulse in the time signal with a given frequency and phase will be coinc ident with one of the pulses from the P-slab at a location corresponding to its frequency and phase, giving a nonzero feeding input to the Q-slab cell at thai location. This produces a distribution on the Q-slab whose geometry is a function of the frequency and phase x content of the time signal. (Reprinted with permission from [26]).


with hnking only in the forward direction as shown in Figure 26(c). When the leftmost cell in each row fires, a linking wave sweeps across its row. The length of the row is such that the wave reaches the other side at the same time that the leftmost cell fires again. The rows have a feeding input / that increases with increasing row number. The result is that the P-slab sustains horizontally propagating waves along each row that have a repetition rate that increases with increasing row number. Each row represents a different frequency, and the distance along each row represents the phase at that frequency. Consider a time signal input S{t) globally broadcast to the Q-slab. Suppose one of its frequency components i/ has phase 0. Then it will be coincident into the Q-slab cell with the P-slab's nonzero input on the z/th row and at the 0th distance along that row, and the linking product will be nonzero for that Q-slab cell. This construction satisfies the basic requirements for converting a time signal to a spatial distribution.

13 Implementations

The nonadaptative pulse-coupled neural network has been implemented as a hybrid optical laboratory demonstration system [2], [7] and as a linear eight-element electronic array. The optical system used a liquid crystal television (LCTV) spatial light modulator from a commercially available projection television to perform the linking modulation. The scene was reimaged to an intermediate focal plane and then sent through the LCTV located slightly beyond the focus so that it was out of focus. This allowed each pixel of the LCTV to modulate a small local area of the input image, effectively forming the linking receptive field by the defocusing circle. The input image was then reimaged into a standard video camera and its signal sent to a framegrabber in a 386 PC. The signal was compared to the current value of the computed threshold in the computer, and an output array was formed that contained a one or a zero, depending on whether or not the input was below the threshold. This array represented the pulses. It was used to update the threshold array, recharging at those pixels that had a pulse output, and then sent through the framegrabber back to the LCTV. A bright pixel there indicated that the neuron for that pixel had fired, and it multiplied the incoming scene to preform the linking modulation for the next processing cycle. Each cycle took about ten seconds, which gave time to examine in detail the traveling linking wave patterns that formed.

The electronic chip array had eight neurons in a linear array. Each was linked to its two nearest neighbors and had a feeding input as well. Four arrays were built. Two were entirely electronic, and two had photodetectors at each cell for the feeding inputs and ferroelectric spatial light modulator pads for outputs. Preliminary tests of the all-electronic arrays showed a


pulse output range from 2 Hertz to 1 MHz and that the nearest-neighbor Unking was active. Further tests are in progress at this time. The optical implementation is attractive in that it allows access to the linking wave patterns for study, but it suffers from the limit of video frame rates. The best that it can do is 30 Hz for the maximum pulse frequency. On the other hand, electronic two-dimensional array architectures are entirely within current technology. The linking field receptive weight pattern can be approximated by a resistive plane or grid that is common to all the cells. It can also have local 3 x 3 linking fields in addition to the larger resistive plane field. Electronic arrays have the major advantage of high pulse rates, at or above the 1 MHz rate already demonstrated. The time signal is the sum of all the pulse activity, so the output can be a single wire. The linking modulation is straightforward, and the pulse generator architecture is electronically simple.

14 Integration into Systems

Two key features of the pulse-coupled neural network are first, it does not require training and second, it has the capability of operating very fast. This makes it suitable as a preprocessor because it can decrease the temporal complexity of many problems due to its high-speed parallel operation while producing an invariant output suitable for use by an adaptive classifier or by sequential iconic logical processors. The retina is an example of a preprocessor. It is nonadaptive and so can operate on any visual image. It is a hard-wired processor with parallel, high-speed action. It does immense bandwidth reduction, edge enhancement, noise reduction, and spectral decomposition and transmits the preprocessed results, all in real time. There is some evidence that the human vision preprocessor has further properties in terms of ability to tolerate significant distortions. For instance, in a 1993 special issue of Science on human vision [24], "Recognition of objects from their visual images is a key function of the primate brain. This recognition is not a template matching between the input image and stored images like the vision in lower animals but is a flexible process in which considerable changes in images, resulting from different illumination, viewing angle, and articulation of the object, can be tolerated." If the retina does in fact produce the invariant time signals of the pulse-coupled net, a view supported by the simple symmetries in the nonadaptive receptive fields being the cause of the invariances, then the "tolerance" is in the preprocessor itself.

When viewed as an image preprocessor, the pulse-coupled neural network bridges the gap between the most fundamental division in pattern recognition: the division between the syntactical and the statistical approach. In statistical pattern recognition, the properties (features) of the


scene are measured and used to form a multidimensional feature vector in an A^-dimensional hyper space. Each set of measurements forms a vector in the space. If the features form groups (i.e., if they are "good" features), then surfaces in the hyperspace can be found that "optimally" separate the groups. Then a given input feature vector can be classified as belonging to one of the groups. The problem is that the features must be correctly defined, and this has been a major problem in statistical pattern recognition. Syntactical pattern recognition goes beyond statistical pattern recognition by considering, and indeed emphasizing, the relationships among features. Since the number of possible relationships is exponential in A , this is an incomparably richer, more powerful method. It is also much harder: the number of groups is also exponential! But if the geometrical relationships are made independent of the possible geometrical distortions, then the syntactical approach yields a natural grouping method in which the large number of possibilities becomes an advantage rather than a drawback. The pulse-coupled neural nets provide the invariances essential for syntactical pattern recognition. They do this in a suprising way. The features it uses are not features of the input pattern. Rather, they are features of the pulse code generated by the net when the image is presented to it. The simulations using a cross and a "T" shape illustrate this. The features are the pulse phase patterns, and they are syntactical: "Where does the bar cross the post?" The image itself no longer is used, only the syntactically derived periodic time signal. This serves as the input to a statistical pattern classifier, and the pattern it classifies is the phase structure of the time signal, not the image pattern.

When a time-to-space mapping is also possible, the pulse-coupled neural network becomes more than a preprocessor. A spatial input IQ is first transformed into a time signal and then transmitted to another location where it is retransformed into a spatial distribution So again. The new pattern will not necessarily be the same as the original, but since the time signal had invariances encoded into it, the new pattern will also be invariant against the same distortions and so will be of reduced dimensionality in the sense of information content. The information that is lost is information about the disortions. The syntactical information about the geometrical input pattern is preserved, so the new pattern is an idealization or generalization of the original. Now suppose the pattern is again transformed into another time signal, transmitted, and made into a second spatial pattern Si. It will preserve the syntactical information of the preceeding pattern. As an example, consider the information about the scale of an input image. The first transform pair {IQ, SQ) is scale invariant with respect to the pulse phase pattern, but the amplitude of the time signal connecting them was proportional to the area covered by the image /Q, and so the amplitude of So still has an area dependence. However, the second transform {So,Si) will be invariant with respect to amplitude, as shown in the discussions


earlier, so Si will not depend on the original image area either by phase structure or by amplitude and will be completely independent of any scale effect in the original image. Each successive transform {S^ Sn-\-i) results in a more invariant pattern. If the time-to-space transform is poorly chosen, this could result in a final pattern that is invariant with respect to everything, including syntax. This is not desirable! On the other hand, it may be possible to choose a time-to-space transform that becomes stable yet still contains the fundamental syntactical information of the original image /Q. If so, then in the asymptotic limit the transform pair will become idempo-tent: SN = SN-^-I- This will be a point attractor, and all the distortions of /o that map to it will define its basin of attraction. It will be an idealized, or platonic, icon that represents the object itself rather than a view of the object. The existence of platonic icons is shown by this argument to be critically dependent on the choice of the time-to-space mapping. The repeated transformation process, however, will always make the resultant icon more and more invariant, and since it will always be an icon, there must always be at least some syntactical information in it. Now, whenever there is a spatial distribution in a net, it is possible to perform spatial operations on it via weighted receptive fields. Thus the repeated iconic transforms can undergo processing each time they are mapped to a spatial distribution, making the pulse-coupled neural net into a full processor rather than a preprocessor. Further, since each iconic transform is sequential in time, the system possesses causality. This leads to the view of a powerful processing system combining the capabilities of parallel and serial processing techniques, where information is transmitted as time signals and operated on as spatial distributions.

15 Concluding Remarks

This work begins with the Eckhorn linking field model and then investigates the new regime of weak linking to find the existence of time signals that encode spatial distributions in their phase structure. The signals are generally periodic. They are a signature for the image that generated them. They are a syntactical signature, made by the network itself, and its temporal features are features that are about the image, not in the image. The pulse-coupled nets are a general higher-order network that provide an object-specific and reasonably invariant time signature for spatial input distributions. Multiple time scales exist, and for each time scale at which a signature exists, the next time scale permits segmentation of the part of the image generating that signature. Conditions for perfect segmentation are given and verified through simulations. The time signal may represent a possible means of communication within the brain, a way to transmit and


receive information. It is analogous to the characteristic acoustic tone of a given musical instrument, in a sense bestowing a different "sound" on each distinct two-dimensional input image. The musical analogy is reinforced by the observation that pulse frequency harmonics are more stable against noise when hnked; i.e., the "harmony of thought" may be literally true [25]. The time signal can be transformed back into spatial distributions and operations performed on it, and these in turn generate another time signal to be sent to other processing areas of the brain. It reduces the basic problem of image understanding to that of correlation on an invariant time signal. Much research remains to be done, but the pulse-coupled model and its time signals are a significant step forward in the understanding of the brain.

16 REFERENCES

[1] R. Eckhorn, H. J. Reitboeck, M. Arndt, and P. Dicke, "Feature Linking via Synchronization Among Distributed Assemblies: Simulations of Results from Cat Cortex," Neural Computation 2, 293-307 (1990).

[2] J. L. Johnson and D. Ritter, "Observation of Periodic Waves in a Pulse-Coupled Neural Network," Optics Letters 18 (15), 1253-1255 (1993).

[3] J. L. Johnson, "Pulse-Coupled Neural Nets: Translation, Rotation, Scale, Distortion, and Intensity Signal Invariance for Images," Applied Optics 33 (26), 6239-6253 (1994).

[4] M. Reiss and J. G. Taylor, "Storing Temporal Sequences," Neural Networks 4, 773-787 (1991).

[5] R. Eckhorn, R. Bauer, M. Rosch, W. Jordan, W. Kruse, and M. Munk, "Functionally Related Modules of Cat Visual Cortex Show Stimulus-Evoked Coherent Oscillations: A Multiple Electrode Study," Invest. Ophthalmol. Visual Sci. 29 (12), 331 (1988).

[6] R. Eckhorn, "Stimulus-Evoked Synchronizations in the Visual Cortex: Linking of Local Features into Global Figures?" In Neural Cooperativ-ity, J. Kruger (editor). Springer Series in Brain Dynamics. Springer-Verlag, Berlin (1989).

[7] J. L. Johnson, "Waves in Pluse-Coupled Neural Networks," Proc. World Congress on Neural Networks, Vol. 4, p. IV-299. INNS Press (1993).

[8] R. Eckhorn, H. J. Reitboeck, M. Arndt, and P. Dicke, "A Neural Network for Feature Linking via Synchronous Activity: Results from Cat


Visual Cortex and from Simulations." In Models of Brain Function, R. M. J. Cotterill (editor), pp. 255-272. Cambridge University Press (1989).

[9] R. Eckhorn and T. Schanze, "Possible Neural Mechanisms of Feature Linking in the Visual System: Stimulus-Locked and Stimulus-Induced Synchronizations." In Self-Organization, Emerging Properties and Learning, A. Babloyantz (editor), Plenum Press, New York (in press).

10] P. W. Dicke, "Simulation Dymanischer Merkmalskopplungen in Einem Neuronalen Netzwerkmodell," Inaugural Dissertation. Biophysics Department, Philipps University, Renthof 7, D-3550 Marburg (1992).

11] A. S. French and R. B. Stein, "A Flexible Neural Analog Using Integrated Circuits," IEEE Trans. Biomed. Eng. BME-17, 248-253 (1970).

12] C. Giles and T. Maxwell, "Learning, Invariance, and Generalization in High-Order Neural Networks," Applied Optics 26 (23), 4972-4978 (1987).

13] C. Giles, C. Miller, D. Chen, H. Chen, G. Sun, and Y. Lee, "Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks," Neural Computation 2 (3), 393-405 (1992).

14] S. Grossberg, Studies of Mind and Brain, Reidel Publishing Company, Dordrecht, Holland (1982).

15] S. Grossberg and D. Somers, "Synchronized Oscillators During Cooperative Feature Linking in a Cortical Model of Visual Perception," Neural Networks 4, 453-466 (1991).

16] N. Farhat and M. Eldefrawy, "The Bifurcating Neuron," Digest of the Annual Optical Society of America Meeting, San Jose, CA, p. 10 (1991).

17] C. Giles, R. Griffin, and T. Maxwell, "Encoding Geometrical Invari-ances in Higher-Order Neural Networks," Proc. IEEE 1st Int. Neural Inf. Proc. Syst. Conf., Denver, CO, p. 301 (1987).

18] N. R. Pal and S. K. Pal, "A Review on Image Segmentation Techniques," Pattern Recognition 26 (9), 1277-1294 (1993).

19] J. L. Johnson, "Globally Stable Saturable Learning Laws," Neural Networks 4, 47-51 (1991).


[20] I. Daubechies, "The Wavelet Transform, Time-Frequency LocaUza-tion, and Signal Analysis," IEEE Trans. Inf. Theory IT-10, 961-1005 (1990).

[21] S. Mallat, "Multiresolution Approximations and Wavelet Orthonormal Bases of L 2 ( R ) , " Trans. Am. Math. Soc. 3 (15), 69-87 (1989).

[22] C. K. Chui, An Introduction to Wavelets^ Academic Press, Boston (1992).

[23] H. J. Caulfield and H. H. Szu, "Parallel Discrete and Continuous Wavelet Transforms," Opt. Eng. 31, 1835-1839 (1992).

[24] Keiji Tanaka, "Neuronal Mechanisms of Object Recognition," Science^ 262, 685-688 (1993).

[25] F. H. Rauscher, G. L. Shaw and K. N. Ky, "Music and Spatial Task Performance," Nature 365, 611 (1993).

[26] J. L. Johnson, "Pulse-coupled neural networks," SPIE Critical Review Volume CR-55, Adaptive Computing: Mathematics, Electronics, and Optics, S. S. Chen and J. H. Caulfield (Eds.), pp. 47-76, Orlando, FL, 1994.

[27] H. S. Ranganath, G. Kuntimad, and J. L. Johnson, "Pulse-Coupled Neural Networks for Image Processing," Proc. IEEE Southeastcon 95, IEEE Press, Raleigh, NC, 1995

Chapter 2

A Neural Network Model for Optical Flow Computation Hua Li Jun Wang

ABSTRACT Optical flow computation in dynamic image processing can be formulated as a minimization problem by a variational approach. Because solving the problem is computationally intensive, we reformulate the problem in a way suitable for neural computing. In this paper, we propose a recurrent neural network model that may be implemented in hardware with many processing elements (neurons) operating asynchronously in parallel to achieve a possible real-time solution. We derive and prove the properties of the reformulation, as well as analyze the asymptotic stability and convergence rate of the proposed neural network. Experiments using both the test patterns and the real laboratory images are conducted.

1 Introduction

Motion perception is one of the essential visual functions of biological organisms. Motion information processing, as generally believed, occurs at a relatively early stage of the perception [Sekuler 1975] due to the fact that a rapid response to a moving object is often more important than the precise recognition of what has moved. In addition, the need to search for food and to avoid becoming the prey of other animals demand real-time processing. In this regard, it is not enough to come up with solutions that merely give the correct output for a given input. A solution must be available within milliseconds of the problem's presentation, and actions must be forthcoming within a few hundred milliseconds [Churchland 1992]. So far, the human vision system outperforms any sophisticated computer vision system in motion perception.

Motion detection and motion parameter estimation are challenging problems due to the fact that a huge set of image data has to be processed in real time. For example, a typical 512-by-512 black-and-white image has to be processed at the rate of 30 frames per second, or equivalently, about 8 megabytes of image data to be processed for every second, which is about

57

58 Li and Wang

the size of a telephone book of a city with a population of 300,000. Secondly, most of the mathematical formulation and computational models of a biological vision system are ill-posed in the sense of Hadamard. Regularization processes that contribute to intensive computation are needed.

In dynamic image processing, there is often a need for detecting motion and estimating motion parameters in real-time in order for a system (e.g., a robot) to interact with a changing environment. Most existing image processing algorithms for motion application, however, are too computationally intensive to provide a real-time solution. Recently, biologically inspired algorithms and hardware have been developed for motion-related vision applications. In the area of early vision computing, Poggio and Koch have conducted interesting research [1,14]. Mead et al. have built a resistive network, "electronic retina," to compute image flow [12]. Recently, many works have been reported, which include designing an analog network for simulating a function of human visual peripheral processes of motion perception [10], image segmentation [11], simulation of human eye saccadic movement [20], and vertebrate retinal processing [17].

Optical flow, introduced by Gibson in the 1950s [4], is a two-dimensional vector field induced by relative motion between an observer and viewed objects. Under an egocentric coordinate system, the pattern of the flow provides the motion-related information. Based on this theory, Horn [6] and Thompson [18], among others, have developed mathematical models for optical flow computation on a pixel-by-pixel basis. Nagel and Enkel-mann [13] have investigated the "smoothness constraint." Kearney et al. [9] have performed the error analysis for optical flow computation. Recently, Wohn, Wu, and Brockett [23] have developed a new iterative transformation technique to compute full image flow. Snyder [16] has shown that Nagel's weight matrix is the only physical plausible constraint and further derived a general form of "smoothness constraint".

In this paper, we reformulate the optical flow computation in such a way that the motion information can be mapped to the node states of a recurrent neural network. The computation is distributed on each processing element. The stabilized activation states of the network represent the solution. We provide theoretical analysis on the asymptotic stability and convergence of the network. The proposed network can operate asynchronously in parallel. In addition, the regular structure of each processing element makes it possible to implement the proposed neural network in VLSI for real-time processing.

2. Neural Network for Optical Flow Computation 59

2 Theoretical Background

In order to derive a computational formula suitable for neural computing, we start from the problem formulation.

2.1 Optical Flow as a Minimization of Functionals

Let E{x^ 2/, i) be an image intensity function at position (x, y) and time t. By

Taylor expansion, it can be derived rather easily that ^^-^^^^=ExU-\-EyV-{-

E, + o{h), where E. = ^ £ ^ , Ey = '-^^, E^ = ^ ^ ^ , u = f ,

y = -^^ and o{h) is a higher-order term. The problem of finding u and V is ill-posed in the sense of Hadamard. Regularization is utilized to convert the problem to a well-posed one by imposing a smoothness constraint. Therefore, computation of optical flow is formulated as a minimization of functionals [5],

mmj l^[{E.u+EyV+Etf+aii^)'+i^)'H^)'H^)')]dxdy, (1)

where a > 0 is a regularization parameter and Q is the image plane on which the optical flow is to be computed. Prom the theory of calculus of variations, the Euler necessary condition of equation (1) gives

r vû = a{E:,u -f Eyv -f Et)Ea:, f^) \ s/^v = a{ExU + EyV -f Et)Ey, ^ ^

where v ^ = ^ + -^ is the Laplacian operator. These coupled elliptical partial differential equations give the solution to equation (1). They are subject to a natural boundary condition,

{ (du du\^(dji _dx^\ — 0

(dv dv_\^(d]i _dx^\ _ n

^dx^ dy^ ^ds^ ds ' "^ ^^

where s denotes the boundary of the image plane n , and ( | f ? f^)* and

( |^ , 1^)* are column vectors.

2.2 Formulation for Neural Computing

Applying the flnite difference method to equation (2), we have the following difference equations:

60 Li and Wang

f ( -4 - aEa:^)u{x, y) -f u{x -f 1, y) + u{x - 1 , 2 / ) + u{x, y + 1) +u{x, 2/ - 1) - aEÊyv{x, y) = aEÊt, , x

( -4 - aEy^)v{x, y) -f v{x 4-1,2/) + v{x -1 ,2/) + t;(a:, 2/ + 1) ^ ^ +v(x, 2/ - 1) - aExEyu{x, y) = aEyEt,

In view of the fact that optical flow computation is almost always performed on a square or a rectangular image, the natural boundary condition can be simplified as

q q . j ^(^^ 2/ + 1) - uix, 2/ - 1) = 0, '^^ ' ' '2- \ t ; (x ,2/ + l ) - t ; ( a : , 2 / - l ) = 0 ,

Si.S'c ^ (u{x-\-l,y) ' \v{x + 1,2/) - v{x

u{x- 1,2/) = 0, 1,2/) = 0 , (5)

where 5i, z = 0,..., 3, is the boundary of a given rectangular region. Figure 1 illustrates the region that gives the above boundary condition.

Prom the difference equations and the boundary conditions, we can derive a linear algebraic system AX = b. For example, to compute optical flow on a 2 X 2 image, by labeling each pixel from 1 to 4 as illustrated in Figure 1, the matrix equations in Figure 2 are used.

y

Si

I S2

1

3

So

2

4 S3

FIGURE 1. Illustration of a 3-by-3 image, upon which a 2-by-2 subregion will be used as the input for optical flow, computation. Note the boundaries (Si, i = 0,1, 2,3) of this given 2-by-2 region define a rectangular region that simplifies the mathematical manipulation of the boundary condition.

2. N e u r a l Network for Optical Flow Computat ion 61

Coq I

I I

tq ti^

I I

I I

O C M C M . O O O r ^

CO CM H

<^ o ^

1

CM 1 O

1

CM H

1 CM CM

1 ^

CM

CM

O

^!

O O r§ '^

1

1

r H* O O O

1

^ ^

t q t q i j q i j q i a q i a q t q i j q

II

FIGURE 2. Matrix equations to compute optical flow on a 2 x 2 image.

62 Li and Wang

3 Discussion on the Reformulation

In this section, we discuss some properties to assist the analysis and computation of optical flow. These properties are given to detail the algorithm construction and to show the nature of the formulation suitable for neural computing. The proofs of most of the properties are straightforward.

Property 1. The dimension of the matrix A, dim A, is related to the size of the K-hy-K square region (image) by dim A=2K'^, where K^ is the number of pixels of the region (image).

Proof: For at each pixel location (x,y), there are two unknowns, u{x,y) and v{x,y), to be determined. From equation (4), two linearly independent equations are needed to solve them. Therefore, for a given K x K image E{x, y,t), x,y = 1,2,...,if, there are 2{K x K) linearly independent equations, which results in dxmA=2K'^.

Property 2. The matrix >1 is a sparse matrix. Except at the boundary, the ratio of nonzero elements to total elements of each row is at most ZjK^.

Proof: A constructive method can be employed to prove this property. From {-4 - aEx^)u{x,y)-\-u{x -\- l,y)-hu{x- l,y)-\-u{x,y-{-l)-\- u{x,y- 1) —aExEyv{x,y) = aExEt of equation (4), where x,y = 1,2,...,/^, a single row of matrix A can be constructed at a time. The nonzero element of the equation comes from the coefficients of u's and t^'s. Hence, there are only six nonzero elements regardless of K. For the size of each row is 2K^ from Property 1, and therefore the ratio of nonzero elements to the total number of elements of any given row is equal to ^^.

With these properties, we may check equation (6). One should notice that due to the small size of the image (2x2) , each row of the matrix is affected by the boundary conditions. For example, at pixel position 1 (see Figure 1), u{x — l,y) relates to the 5i boundary. Its coeflScient must be determined by following the boundary condition of equation (5). Similarly, u{x,y -h 1) relates to the 52 boundary, and its coefficient is determined accordingly by equation (5). Property 2 holds except at the boundary conditions. Or, in other words, when K > S, one should be able to observe Property 2 well.

Remark: The matrix is symmetric, and its bandwidth is equal to K'^.

The bandwidth of a sparse matrix is defined as the maximum distance in terms of the number of entries between two nonzero entries in a row of the given matrix. From Properties 1 and 2 it is easy to observe that the matrix bandwidth is equal to the total number of iz's (which is equal to K'^). The


symmetry of the matrix comes from the fact of the cellular structure of the n's and i 's given in equation (4). For example, the computation of u{x,y) involves its neighbors u{x -\- l,y), u{x — 1,2/), u{x,y -f 1), and u{x,y — 1).

Based on the properties, we have constructed the matrix A of the linear algebraic system AX = 6 on a 32 x 32 window, as illustrated in Figure 3. The matrix A has a desirable regular structure and the nonzero entries are located on five subdiagonal positions and a main diagonal, as predicted by the properties developed above.

4 Choosing Regularization Parameters

Before describing the recurrent neural network model, we need to address the aperture problem. The aperture problem is an important problem in image flow computation. It refers to the ambiguity in determining the true velocity using a local motion detector. This ambiguity can be well observed from the original formulation, where — ^ t ^ ~ ExU-\-EyV-\-Et-\-o{h), which can be rewritten in vector dot product form as {Ex,Ey) - {u,v) = —Et for dE{x,y,t) ~ 0, under the smoothness constraint. This indicates that {u,v) cannot be uniquely determined when {Ex,Ey) is perpendicular to {u,v) [5]. It is generally accepted that any vision system, whether a biological or an artificial system, exhibits the aperture problem [7]. A work by Snyder [16] has shown an interesting result on the smoothness constraints based on Horn and Schunck's original formulation. Recently, Wohn et al. [23] have explicitly defined normal flow and full flow. Their iterative approach starts from normal flow and successively estimates full flow until the process converges. In this section, we provide a way for regulating the aperture problem. The regularization parameter a of the "smoothness constraint" in the second term of equation (1) controls the convergence of the image-flow computation.

Property 3. The regularization for handling the aperture problem can be achieved by choosing a proper a such that for alH, a is determined by

f,(a) = \-4âEl\- "£ |a,,| > 0, (7)

where aij is an element of matrix A.

Proof: This property is a direct result of applying a well-known property from numerical analysis to this particular linear algebraic system. The property states that the main diagonal element of a row should be greater than or equal to the sum of the absolute values of all other elements in

64 Li and Wang

S3

(2nxn)x(2nxn)

\

FIGURE 3. Top: The test pattern image with a 32 x 32 window. Bottom: The sparse matrix A for computing optical flow constructed by equations (4) and (5).


that row in order to ensure the convergence of iterations (in this situation, it can be very difficult to find the eigenvalues of the given linear algebraic system due to the size of the large sparse matrix) [2]. For the given problem, equation (7) gives

\-A-aEl\> Y. K | . (8)

Following the definition of / ( a ) , it immediately is seen that / ( a ) > 0 satisfies the above condition. In practice, a small value of a should be chosen, usually in the range of [10~^, 10"^].

5 A Recurrent Neural Network Model

Based on the mathematical formulation, we introduce a recurrent neural network model for optical flow computation.

5.1 The Neural Network Architecture

The proposed recurrent neural network for optical flow computation consists of 4X^ massively connected neurons. The state equation of the network can be described by the following vector-form diff'erential equation.

Cz{t) = -Wz{t) -f (9, (9)

where C is a scalar capacitive parameter, and z G B?^ is the activation state vector, W — A?' \s the connection weight matrix, and 6 = Ah\s the biasing threshold vector of the proposed neural network.

One of the desirable features of this conflguration is that the neural network can be implemented in hardware to perform parallel computation. For example, each of the 2K'^ neurons can be implemented by three operational amplifiers: a summer, an integrator, and an inverter. The connection weight Wij between neurons i and j can be implemented by a feedback resistor Rf and a connection resistor Rij such that wij = Rf/Rij; i.e., Rij = Rf/wij = Rf/ Yl2=i ^ki^kj, where aij is the element in the ith row and the j t h column of A. The threshold 6i of neuron i can be implemented by a voltage source with the bia^sing voltage 9i [22]. The architecture of the proposed recurrent neural network is shown in Figure 4, where the network has n = 2K^.

This simple implementation scheme allows the design of a high-density network that can be implemented in VLSI to obtain a possible real-time solution.

66 Li and Wang

FIGURE 4. The configuration of the proposed recurrent neural network.


5.2 Stability and Convergence Rate

Proposition: The recurrent neural network for solving a system of linear algebraic equations is asymptotically stable in the large; i.e., V2:(0), 3z such that limt->oo z{t) = z.

There is more than one way to prove the above proposition. For example, by a traditional approach we may define an energy function, L{z)={Az — bY{Az — b)/2, and prove that L{z) is a strict Liapunov function. Or, we may simply examine the eigenvalue characteristic derived from the proposed recurrent neural network to show the asymptotical stability as given below.

Proof: Since A is symmetric, A^ = AÂ and hence W = A^ is symmetric. Therefore, the eigenvalues of A^ are always real. Furthermore, since the eigenvalues of A'^ are always nonnegative, i.e., the eigenvalues of —A'^ are always nonpositive, the linear neural system is always asymptotically stable in the large.

The proposed recurrent network for computing image flow is essentially a linear dynamic system. According to linear systems theory [8], the convergent trajectory of the activation state z{t) can be described as

Zi(t) = 5^Ci,(t)e-^^^ + Zi (10) i= i

for i,j = 0 ,1 , ...,2ii^^, where A is an eigenvalue of W, and Cij{t) are constants or polynomials in t] depending on the initial condition and the uniqueness of the eigenvalues of VF. It should be pointed out that there is another interesting result related to this work. That is, recently we have also shown that the steady state of the proposed recurrent neural network represents a solution to the set of simultaneous linear equations (i.e., AX = b with X = z hy equation (9)) if and only if A is of full rank (i.e., rank{A) = dim{A) = 2K'^) [21]. The analysis of the optical flow formulation given in the previous section reveals that this condition (the condition of full rank) can be satisfied with a suitable regularization coeflBicient a.

According to linear systems theory [8], the convergence rate of the proposed recurrent neural network is dominated by the term in z{t) with the largest time constant that corresponds to the smallest eigenvalue of C~Â^^ min {C~^\i\ i = 1,..., 2K^'}. From the engineering point of view, the linear neural system can reach its steady state in 5/min{C~Âi;z = l,...,2-ftr^} seconds. Furthermore, since the positive capacitive parameter C is directly proportional to the stabilization time required by the linear neural system, the convergence rate of the solution process can be controlled by selecting

68 Li and Wang

a sufficiently small capacitance parameter C. The convergence rate of the proposed neural network also depends on the regularization parameter a. Specifically, the smaller the a, the slower the convergence rate, which will be demonstrated in the next section. Therefore, there is a trade-off between the need for regularization (smaller a) and the need for faster convergence (relatively bigger a ) .

6 Experiments

In order to demonstrate the characteristics of the proposed recurrent neural network, the experiments have been conducted in two phases: the experiments using artificially generated test patterns and the experiments using real laboratory images.

A pair of test patterns are given in Figure 5. These two patterns were used as two consecutive image frames captured at time slices t — dt and t. The second pattern was diagonally shifted by 1 pixel to simulate a motion, and its intensity was slightly altered to simulate random disturbance. Ex, Ey, and Et were computed first by using 3 x 3 kernels. With a = 0.01, the optical fiow was computed by using the proposed recurrent neural network. At equilibrium state, z gives the vector components u and v, for z = {zi,Z2, '"j^sY = ^1? ...,1/4,^1, ...^v^y. The computations used to compute the optical fiow field are illustrated in Figures 6-8, and the optical fiow field determined by the vectors is shown in Figure 9, which illustrates that the fiow pattern matches the diagonal motion. In the figure, we define the distance between two diagonally connected pixels to be \ /2, and the computational result matches this definition {y/u^ •+• f^), where u and v are given from the column vector z.

The laboratory images were then used. The image shown in Figure 10 has 256 x 240 resolution with 1 byte per pixel. The object of interest was displaced to a new position to create a motion after the digitization of the first image. A 32 x 32 window was chosen. Ex, Ey, and Et were computed within the window before the computation of z{t). Following the criterion in Section 4, the crucial regularization parameter a was made equal to 0.01. The experimental result is given in Figure 11, which agrees with the motion.

7 Comparison to Other Work

Our work described in this paper includes the reformulation of Horn's model for possible neural network implementation. Horn's original model is based on the optimization of an objective function in a global scale, as

2. Neural Network for Optical Flow Computat ion 69

A I(x,y,t)

10

7

8

10

• ^

t ' :x.y,t+dt)

7

5

7

7

fc

X X

FIGURE 5. Illustrated here axe two frames of the small (2-by-2) artificially generated images. Note that the second frame of the image is shifted diagonally at time t + dt. Then the partial derivatives Exi, Ey^, and Et^ are computed (repeating the boundary elements).

/ - 4 2 2 0 0 0 0

V 0

2 - 4 0 2 0 0 0 0

2 0

-4.09 2 0 0 0 0

0 2 2

-4.04 0 0 0

0.04

0 0 0 0

- 4 2 2 0

0 0 0 0 2

-4.09 0 2

0 0 0 0 2 0

- 4 2

0 0 0

0.04 0 2 2

- 4 . 0 4 /

/ 2 l \

22

2l

22

24

= a

\ Z 4 /

0 0

-0.09 0.04

0 -0.09

0 0.04

\

/

FIGURE 6. The linear algebraic system constructed for solving the optical flow.

70 Li and Wang

100000 200000 JOOOOO 400000 SOOOOO

Iteration 100000 200000 300000 400000 SOOOOO

Iteration

o.aoo -3 Z4

0 tOOOOO 200000 300000 400000 SOOOOO

Iteration 100000 200000 300000 400000 SOOOOO

Iteration

1.250 -

1.000 -

0.750 -

0.500 •

0.250 -

C

1 *

tooooo 200000 300000

Iteration 400000 SOOOOO 200000 300000 400000 500000

Iteration

100000 200000 300000 400000 500000

Iteration 00000 200000 300000 400000 500000

Iteration

FIGURE 7. The computation is performed by using a recurrent neural network. Note that the energy function of the proposed neural network is a strict Liapunov function. It decreases monotonically as the number of iterations increases. The plots of vector Z are shown. Note that it takes a large number of iterations to reach the final result. Since the network can be implemented in hardware, the number of iterations is not really the concern. As pointed out in the study, the speed of convergence can be controlled by choosing different fi and regulariza-tion parameter a. But trade-offs have to be made to ensure the "smoothness" constraint.


0.005 n Energy

0.004 H

0.003 H

0.002

0.001

0 . 0 0 0 I I 1 I I I I I I I | I M ' I I I I I I I I I I I M r I I I I I I I I I I I I I I I I I I I I I I I

0 100000 200000 300000 400000 500000

Iteration FIGURE 8. Energy as a function of iteration.

defined by equation (1). As a result of the global optimization, the model is less sensitive to local variation and random noise. The algorithm based on the reformulation is implemented as a recurrent neural network. The network can operate concurrently, in asynchronous fashion, for potential real-time application. The behavior of the network, such as convergence, convergence speed, and stability is analyzed. An analog VLSI implementation of the network is possible because of the nice regularity of the network structure.

Our work is based on a mathematical formulation with a smoothness constraint This constraint is widely adopted in many currently pursued models. The constraint can be further divided into the condition of smooth motion at any given short sampling time interval, and the requirement of smooth change of illumination. Obviously, in real life, the requirement of smooth change of illumination may or may not be satisfied. Therefore, there is a need to develop an illumination-invariant model. The existing Fourier analysis technique is computationally intensive and it does not provide accurate results. It has been reported recently that Tsao and Chen [19] have proposed a computational model for optical flow computation based on Gabor phase functions. They demonstrated that the proposed method works for the synthetic test pattern images.

72 Li and Wang

Image Flow with a=0.01

FIGURE 9. The computational result.

8 Summary and Discussion

In this paper, we have reformulated the optical flow computation in such a way that the optical flow can be mapped to activation states of a recurrent neural network. The advantage of this proposed approach is that the computation of optical flow is distributed to each simple and regular processing element of the network. The solution of optical flow is provided as the stabilized activation state of the network. The network operates concurrently, and it can be implemented in analog VLSI. Analog VLSI has some remarkable features, which include (1) fast computational speed, (2) lower power consumption, (3) smaller size in silicon implementation, and (4) simpler circuit configurations for realizing the same functionality. But in general, the design of analog VLSI circuit takes a longer time, and the computational accuracy is not as good as the digital counterpart. The current state-of-the-art analog VLSI technology can deliver about 6-bit res-


FIGURE 10. The laboratory scene.

olution (for example, Intel's neural chip, 80170NX, based on CHMOS III EEPROM technology, has 10,240 modifiable analog weights in 4-quadrant analog multiplier synapses with over 6-bit precision).

Compared to the resistive network for optical flow computation in Hutchison et al. [7], the proposed network here is based on Horn's functional analysis approach. The network can be implemented in a standard recurrent neural network, which is a desirable feature. It should also be pointed out that the massive connections of each neuron to every other neuron in this design may limit the size of images. It can be derived that the number of connections needed for each neuron is on the order of K^. In this study, we have also proven properties necessary for constructing a recurrent neural network and conducted experiments on both the test patterns and the laboratory images that confirm our theoretical analysis.

Our future work includes the further investigation of the neural network architecture to reduce the size of the network. We are also working on the analog VLSI implementation of the algorithm.

74 Li and Wang

o4 FIGURE 11. The optical flow of the laboratory scene. Note that the computation is performed within a 32-by-32 window with a = 0.001.

9 References

1. M. Bertero and T. Poggio, "Ill-Posed Problems in Early Vision," Proc. of IEEE, Vol. 76, No. 8, pp. 869-889, 1988.

2. R. Burden, J.D. Faires, and A.C. Reynolds, Numerical Analysis, Weber and Schmit Press, Boston, 1981.

3. P.S. Churchland and T.J. Sejnowski, The Computational Brain, MIT Press, Cambridge, MA, 1992.

4. J. Gibson, The Ecological Approach to Visual Perception, Houghton Mifflin Company, Boston, 1979.


5. B.K.P. Horn, Robot Vision, MIT Press, Cambridge, MA, 1986.

6. B.K.P. Horn and E.G. Schunk, "Determining Optical Flow," Artificial Intelligence, Vol. 17, pp. 185-203, 1981.

7. J. Hutchinson, C. Koch, J. Luo, and C. Mead, "Computing Motion Using Analog and Binary Resistive Networks," Computer, Vol. 21, No. 3, pp. 52-63, March, 1988.

8. T. Kailath, Linear Systems, Prentice Hall, Englewood Cliffs, NJ, 1980.

9. J.K. Kearney, W.B. Thompson, and D.L. Boley, "Optical Flow Estimation: An Error Analysis of Gradient-Based Methods with Local Optimization," IEEE Trans, on Pattern Analysis and Machine Intelligence, Vol. 9, No. 2, pp. 229-244, 1987.

10. H. Li and C.H. Chen, "Simulating a Function of Visual Peripheral Processes with an Analog VLSI," IEEE MICRO, Vol. 11, No. 5, pp. 8-15, 1991.

11. A. Lumsdaine, J. Wyatt, and I. Elfadel, "Nonlinear Analog Networks for Image Smoothing and Segmentation," Proc. of IEEE Int. Symp. Circuits and Systems, Vol. 2, pp. 987-991, 1990.

12. C. Mead, Analog VLSI and Neural Systems, Addison-Wesley, Reading, MA, 1989.

13. H. H. Nagel and W. Enkelmann, "An Investigation of Smoothness Constraints for the Estimation of Displacement Vector Fields from Image Sequences," IEEE Trans, on Pattern Analysis and Machine Intelligence, Vol. 8, No. 5, pp. 565-593, 1986.

14. T. Poggio and C. Koch, "Ill-Posed Problems in Early Vision: From Computational Theory to Analogue Networks," Proc. of Royal Society of London Series B, Vol. 226, pp. 303-323, 1985.

15. R. Sekuler, "Visual Motion Perception," in Handbook of Perception, Vol. V, Seeing, edited by E.C. Carterette and M.P. Friedman, Academic Press, New York, 1975.

16. M.A. Snyder, "On the Mathematical Foundations of Smoothness Constraints for the Determination of Optical Flow and for Surface Reconstruction," IEEE Trans, on Pattern Analysis and Machine Intelligence, Vol. 13, No. 11, pp. 1105-1114, 1991.

17. J.G. Taylor, "A Sihcon Model of Vertebrate Retinal Processing," Neural Network, Vol. 3, pp. 171-178, 1990.

76 Li and Wang

18. W.B. Thompson, S. Barnnard, "Low-Level Estimation and Interpretation of Visual Motion," Computer, IEEE Computer Society, pp. 20-28, August, 1981.

19. T.R. Tsao and V.C. Chen, "A Neural Computational Scheme for Extracting Optical Flow from the Gabor Phase Differences of Successive Images," Proc. of IJCNN 1992, IV-450, Baltimore, MD, 1992.

20. D.B. Tweed and T. Vilis, "The Superior ColUculus and Spatiotem-poral Translation in the Scaccadic System," Neural Networks, Vol. 3, pp. 75-86, 1990.

21. J. Wang and H. Li, "Solving Simultaneous Linear Equations Based on a Recurrent Neural Network," International J. of Information Science, Vol. 76, No. 3/4, pp. 255-278, Elsevier Publishing Co., New York, 1993.

22. J. Wang, "Electronic Realization of a Recurrent Neural Networks for Solving Simultaneous Linear Equations," Electronics Letters, Vol. 28, No. 5, pp. 493-495, 1992.

23. K.Y. Wohn, J. Wu, and R.W. Brockett, "A Contour-Based Recovery of Image Flow: Iterative Transformation Method," IEEE Trans, on PAMI, Vol. 13, No. 8, pp. 746-760, 1991.

Chapter 3

Temporal Pat tern Matching Using an Artificial Neural Network Fatih A. Unal Nazif Tepedelenlioglu

ABSTRACT A traditional optimization method used for pattern matching is dynamic time warping, which is a dynamic programming algorithm that compares an input test signal with a reference template signal and obtains an optimum match. The dynamic time warping algorithm reduces the nonlinear time misalignments between the two patterns and consequently accomplishes a better comparison, as opposed to an ordinary direct template matching method that might yield a larger distance between the two patterns despite the similarity. While effective in pattern recognition, the dynamic time warping algorithm is lacking in that the processing time becomes a major consideration for real-time applications as the number and the size of the pattern increase. A parallel computing architecture becomes the only avenue to deal with the heavy computational load. It is shown in what follows that the dynamic time warping pattern matching algorithm can be effectively implemented using the Hopfield network, whereby one defines a dynamic time warping energy function to achieve an optimum match between two patterns. The energy function is mapped to the Hopfield network's Liapunov function to derive the connection weights and the bias inputs.

1 Introduction

Pattern recognition systems consist of four functional units: A feature extractor (to select and measure the representative properties of raw input data in a reduced form), a pattern matcher (to compare an input pattern to reference patterns using a distance measure), a reference templates memory (against which the input pattern is compared), and a decision maker (to make the final decision as to which reference template is the closest to the input pattern) [2]. Among these units, the most crucial component is the pattern matcher, which finds the best match and the associated distance

77

78 Unal and Tepedelenlioglu

between the unknown test input and the reference patterns. Patterns are finite sequences of real numbers, sequence index being usually interpreted as time.

The rate of success in the matching process is very much dependent on how close the test pattern is to one of the reference templates. Often, due to the distortion and noise introduced during the handling of the test pattern, this desired similarity may deteriorate, and consequently, the process may suffer in that one begins making errors in matching. Among the possible causes of distortion that result in significant matching errors if not compensated for are the nonlinear shifts introduced to the time scale of the test pattern.

Dynamic time warping (DTW) is one such algorithm [5] that is used to eliminate the nonlinear shifts at the time scale of the temporal patterns. It reduces the nonlinear time misalignments between the two patterns by finding an optimal warping matching path and achieves a better comparison than an ordinary direct template matching method, which might yield a large distance. It is widely utilized in pattern recognition areas such as speech recognition, speaker verification, and speaker recognition, and it contributes significantly to the performance of these speech processing systems [8, 10, 11, 13].

While effective in pattern matching, the DTW algorithm is lacking in that the processing time becomes a major consideration for real-time ap-pUcations as the length of the patterns increases. A parallel computing architecture becomes the only avenue to achieve the high computational rate. A possible remedy toward this end would be the use of a Hopfield network, which can be interpreted as one form of parallel computing. It is a fully connected single-layer feedback neural network with symmetric connection weights [4].

The Hopfield network can be regarded as a compromise between finding the best warping path at a considerable computational cost and an acceptable suboptimal solution rapidly. Although the use of the Hopfield network is mentioned so far in relation to the DTW problem, the approach presented here is flexible enough to apply to other optimization problems.

The organization of the chapter is as follows: The Hopfield network and a general procedure to solve optimization problems with the Hopfield network are described in Section 2. The implementation of the DTW algorithm using the Hopfield network is explained in Section 3. Section 4 contains the computer simulation results, and finally, the conclusions are drawn in Section 5.

3. Pattern Matching Using an Artificial Neural Network 79

2 Solving Optimization Problems Using the Hopfield Network

The embodiment of the Hopfield network is shown in Figure 1. As seen from the figure, the network consists of neurons with self feedback in a single layer structure, and the full connection is achieved through symmetric weights.

The behavior of this system is described by the differential equation

u = - - - h W v - h b , (1) r

where the inputs of the neurons are denoted collectively by the vector u, outputs by the vector v, the connection weights between the neurons by the matrix W , the bias inputs by the vector b , and r determines the rate of decay of the neurons. Also, the input-output characteristics of the neurons are taken as

Vi = g{ui) = hl + tB.nh{^)), (2) Z UT

where UT determines the steepness of the sigmoidal activation function g and is called the temperature [4]. The corresponding graph is shown in Figure 2.

Hopfield showed that this network, with a symmetric W , forces the outputs of the neurons to follow a path through the state space on which the quadratic Liapunov function

L(v) = - i y ^ W v - b^v + - V / * 9-Ha)da (3)

monotonically decreases with respect to time as the network evolves in accordance with equation (1), and the network converges to a steady state that is determined by the choice of the weight matrix W and the bias vector b . That is, ^ - ^ < 0 [3]. The Liapunov function L(v) can be interpreted as the energy of the network. Note that

duj ^ dL{v) dt dvi ^ ^

can be derived from equations (1) and (3). Thus, the Hopfield network corresponds to a gradient system that seeks

a minimum of the Liapunov function L{v). The network converges to a stable state when a minimum is reached. So, ^^jp- = 0 implies ^ = 0, and this is achieved when the network reaches a stable state.


a

{}

a n

u

a

n

FIGURE 1. The Hopfield network.

This characteristic of the network is exploited to solve optimization problems. Usually, a quadratic energy function ^ (v ) composed of a cost function, and possibly some constraints, is defined for the optimization problem at hand and equated to the Liapunov function L(v) to determine the connection weights W and the bias inputs b .

It should be noted that the performance of the network (where it converges) critically depends on the choice of the cost function and the constraints and their relative magnitude, since they determine W and b , which in turn determine where the network settles down.


g(u,)

FIGURE 2. Sigmoidal activation function.

Table 1 shows the procedure that is used to set up a Hopfield network to solve an optimization problem. Each step in the procedure is briefly addressed in the next section when the implementation of DTW is described.

The decay (or damping) term — " i n equation (1) corresponds to the integration term of equation (3). One has to include an energy component in the energy function that will balance this integration term if the Liapunov function given by equation (3) is used. Otherwise, the convergence of the system can be disturbed [7, 15], and thus the performance of the Hopfield network may be lowered. In this study, the decay term (or equivalently the integration term) is ignored, as in most of the studies reported so far, and the following differential equation and the corresponding Liapunov function are used for the Hopfield network:

and

u = W v + b

L(v) = - - v * W v - b * v .

(5)

(6)

3 Dynamic Time Warping Using Hopfield Network

This section introduces the concept of the DTW and the use of the Hopfield network to implement it.


Step 1. Find a neural network representation for the problem

Step 2. Determine a number representation with the neurons

Step 3. Define a Liapunov function L{\) for the Hopfield network

Step 4. Devise an energy function ^ (v ) for the optimization problem

Step 5. Derive the connection weights W and the bias inputs b by equating L of Step 3 and E of Step 4

Step 6. Compute the energy function coeflScients c

TABLE 1. A general procedure to solve an optimization problem with a Hopfield network.

3.1 Dynamic Time Warping

As mentioned in the introduction, DTW is a sophisticated pattern matching algorithm that is used to compare an input test pattern with a reference pattern template and obtain an optimum match subject to certain constraints [5]. An associated distance is also determined during the process. The DTW algorithm eliminates the nonlinear time misalignments between the two patterns and consequently achieves a better comparison as opposed to an ordinary direct template matching procedure, which might yield a larger distance between the two patterns despite the similarity [12].

The DTW algorithm effectively eliminates the nonlinear x-axis variations to compensate for the nonlinear temporal distortions. Note that in speech processing applications, such distortions may arise due to the variations in the speaking rates of the speakers.

The algorithm can be formulated as a minimum-cost path problem as illustrated in Figure 3. Thus, the problem is transformed to one of finding an optimal alignment path m = w{n) between a reference signal r{n) and a test signal t{m) over a 2-D finite Cartesian grid of size N x N, where A is the length of the signals, and n and m are the discrete time scale indices for the reference and the test signals respectively. Each node i;(n, m) has a specified cost d(n, m) that corresponds to the distance between the reference signal sample r{n) and the test signal sample t(m). The problem is to obtain the minimum cost path from v(0,0) to v{N — 1, iV — 1).

In order to implement an eflFective and efficient DTW algorithm, it is necessary to specify a number of factors and constraints on the solution [9], which could vary depending on the application field.

In what follows, to fix ideas we will assume that the application field is speech recognition, in which case the constraints become:


(a) Endpoint constraints:

^(0) = 0, (7)

w{N-l) =N-1. (b) Local path constraints: The following are Itakura path constraints,

which are illustrated in Figure 4 [5]:

0 < w{n) - w{n - 1) < 2, (8)

w{n — 1) — w{n - 2) > 0 if w{n) — w{n — 1) = 0

These constraints guarantee that the average slope of the path lies between 1/2 and 2, provide path monotonicity, and prevent excessive compression and expansion of the time scales, as shown in Figure 3.

(c) Global path constraints:

rriLiri) <m< mH{n), (9)

where

rriHin) = min{2n, \^n + ^{N - l)l,iV - 1}, (10)

rriLin) = Tnax{[^n\,2n - {N - 1),0},

and [x\ denotes the smallest integer greater or equal to x, and \x] denotes the greatest integer less than or equal to x. Note that these global constraints elicit the parallelogram in Figure 3 that has the sides with slopes 1/2 and 2 emanating from the points n = 0, m = 0, and n = N — 1, m = N — 1 in which the optimal warping path w{n) lies. Strictly speaking, actual slopes of the line segments connecting the grid nodes can be 0, 1, or 2 only. However, for convenience, one can assume an average slope of 1/2 and 2 for the edges of the parallelogram that is shown in Figure 3.

(d) Local distance measure: The absolute difference metric is used as the distance measure, which is implemented in the form

d{r{Ti),t{m)) = \r{n) - t{w{n))\. (11)

Consequently, the total distance along the optimal path w(n) from the grid point (0,0) to the grid point (A^ — 1, iV — 1) can be written as

N-l

D = min^^n){Yl d{r{n),t{w{n)))}. (12) n=0

With all these constraints in mind, we can reiterate the definition of the DTW problem as finding an optimal warping path m = w{n) through the


\— 1

t(iii) m = w(n)

5

4

3

2

n

0 1 2 * •

1 0 1

• * r -1 2 / 3

/ 3 4 5

• / V 0 / 1 2

2/ 3 ^ 4 ^ 1/ 1

1 3 3

2 / 2 / / 4

0^ 4 2

/ ' * / 2 / e 0

/ 1 3 3

1 5 1 1 1 !

r(n)

Legend:

r(n) t(m) d(r(n) , t(in)

Represents the reference pattern Represents the test pattern

Denotes a grid node v(n,in) and the associated distance d(r (n) , t (m) ) between the n^ component of the pattern r and the m component of the pattern t, where d{r{n) , t{m))=\r{n)-t{m) \ Marks optimal warping path m = w(n) Marks the parallelogram resultina from the constraint equations

FIGURE 3. A DTW example depicting an optimal alignment path w{n) to match r{n) to t{m). Reprinted with permission from [17]. © IEEE 1992.


(n-2, m ) (n - i ^m ) ( ^ ; m )

Q X O

( n - L m - i )

(n - i ,m-2)

FIGURE 4. Itakura path constraints for the DTW. Reprinted with permission from [17]. © IEEE 1992.

grid points v{n,m) in Figure 3 to match the reference pattern r{n) with the test pattern t{m) subject to the constraints given by equations (7), (8), (9), and (10) such that the total distance D given by equation (12) is minimized. Thus, for the particular example illustrated in Figure 3, the optimal warping path m = w{n) (indicated by the solid Une) goes through the grid nodes v{0,0), v{l, 1), v{2,1), i;(3,3), i;(4,4), and v{5,5) and corresponds to the best match between the two patterns with the associated total distance 10. Note that none of the other valid paths (that satisfy the constraints) within the parallelogram have smaller total distance. The results of the DTW for the example shown in Figure 3 are summarized in Table 2.

3.2 Hopfield Network Implementation

Our purpose in this section is to demonstrate that the Hopfield network can be used to realize the DTW algorithm [17]. To achieve this objective, we follow the procedure given in Table 1: First, a neural network representation for the DTW algorithm is found. Once the representation is chosen, a DTW energy function E{v) is devised to obtain an optimum match between the unknown input test and the reference patterns. Note that there is no need to determine a number representation with the neurons, since the steady state outputs of the neurons suffice to define the final warping path, and hence step 2 in Table 1 is omitted. The DTW algorithm then is mapped onto the Hopfield network by equating the energy function ^ (v ) to the Liapunov function I/(v), and the connection weights W and the bias


1 " 0 1 2 3 4 5

m 0 1 1 3 4 5

r{n) 4 5 6 3 7 1

to m) 2 4 4 3 5 4

d(r(n),f(m)) = |r(n) - t(m)| 2 1 2 0 2 3 10 1

Total Distance D 1

TABLE 2. DTW results for the example in Figure 3.

inputs b are found by matching the linear and the quadratic terms in both functions. The energy function coefficients have to be determined in such a way as to obtain a balance among the energy function components to achieve a high quality result while maintaining the validity of the solution. For this purpose a method is developed that computes the energy function coefficients systematically. The reader is referred to [16] for details of this method.

Every grid point on the {n,m) plane in Figure 3 can be represented by a neuron. Therefore, a two-dimensional array representation is used for the network. The neuron outputs will be denoted by subscripts x (for ordinate m) and i (for abscissa n) showing the row and the column indices respectively.

Hence, by scrutinizing the warping path through the grid nodes and considering the objective function (12) and the constraints (7) through (10), which are described in Section 3.1, the following energy function E{v) can be constructed for the DTW algorithm:

N-lN-lN-l

^(v) = ^YlYl X][K,i+dj,,2+i)t^x,it^2,,i+i (13) a;=0 1=0 y=0

N-lN-1 N-1

x=0 i=0 y^x,y^x-\-l,y^x-\-2


N-lN-lN-l

+f Z! IZ Z! ^ ' 2/' x=0 1=0 y^x

N-lN-1

x=0 i=0

N-lN-l N-1

x=0 i=0 jî,\i-j\^l

N-1 N-1

1=0 i=0

where modulo A arithmetic is used for the subscripts wherever apphcable, i.e., N = 0.

The Co term stands for the objective function that minimizes the total distance between the two signals associated with the optimal warping path through the grid points that satisfies equation (12). This component will be pulled down to a minimum, since the energy function E{y) is anchored to the Liapunov function I/(v), which decreases monotonically during the operation of the neural network as described in Section 2.

The Itakura path slope constraint that is given by equation (8) is satisfied when the ci component equals zero. Namely, the slopes of the line segments between the grid nodes that form the final warping path w{n) will be pushed to 0, 1, or 2 when this energy component is minimized.

Every signal sample in the reference pattern should be visited once while matching with the test signal when the DTW is applied. Hence, a constraint that will force only a single neuron having output 1 at each column necessitates the next energy component, which is the C2 term. Again, when this part of the energy function is minimized, there will be at most one active neuron (output 1) in each column. Note that this is a necessary but not sufficient condition since this constraint is satisfied even if all the neurons have output 0. Therefore, the sufficient condition is also to be added to the energy function, which is the C3 component.

By minimizing the C3 energy member, the neural network will end up having N active neurons at the time the network converges to a minimum state.

The C4 component enforces the Itakura constraint given by equation (8). As the network settles down to a minimum energy configuration, this energy ingredient approaches zero, and successive zero-slope line segments are avoided in each row.


Finally, minimizing the C5 component of the energy function forces the neurons to have 0 or 1 output when the network reaches a stable state, as in [14].

Now, it is straightforward to show that the connection weights and the bias inputs can be obtained by equating equation (13) to equation (6) as

'Wxi,yi = -Co{dx,i + dyj){Siîj -f Si-ij) (14)

-CiSi^lj{l - Sx,y){l - (5x+l,2/)(l - <^x+2,j/) -C2Sij{l - 6x,y)

- C 3

-\-4CsSijSx,y and

bxi = C3N-2cs, (15)

where Sx,y is the Kroneker delta function, which is equal to 1 when x = y, and 0 otherwise. Modulo N arithmetic is used for the subscripts wherever applicable. The derivation of (14) and (15) is omitted to save space. The reader is referred to [16] for details.

Even though it is possible to extend the concept of DTW to multidimensional signals, in this study we confine ourselves to one-dimensional signals only. The formulation is completed to establish the necessary theoretical framework, and work is underway to conclude the experimental studies for dynamic spatial warping (DSW), which is the extension of DTW to two-dimensional signals. We plan to apply this novel method to the image recognition fields including fingerprint, signature, and handwritten character recognition.

4 Computer Simulation Results

The dynamical behavior of the Hopfield network model is represented by equation (5). The equation is solved numerically using Euler's method by replacing the derivative in equation (5) by the quotient of the forward differences as follows:

u^ .^^ = u?'''> N-1

'Wxi^yjVyj -\-bxi), (16) yj

where 0 < x, i < N — 1. This equation will be used throughout this work to simulate the operation of the DTW Hopfield network.

We also replace the sigmoidal activation function by the following piece-wise linear function:


0 if Ui< - 0 . 5 g{ui) = { Ui-\- 0.5 if - 0.5 <Ui< 0.5 (17)

1 if Ui> 0.5

Although the theory is developed assuming a sigmoidal function, it turns out that the above piecewise linear function works just as well, and it is more efficient to evaluate. Therefore, in the interest of speed it was used in the experiments throughout. Prom our experiments, we observe that it takes fewer iterations to converge to a solution with this activation function, and the quality of the results is not adversely affected.

From the computer simulations, we find that the initial inputs to the neurons affect the quality of the results significantly. If the neurons are initialized in a way consistent with the C3 term of the DTW energy equation (13), better results are achieved. To avoid the symmetric stuck conditions [4], noise is added to the inputs; hence the inputs are uniformly distributed random variables in the range

UQ — O.lSu < Uxi <uo + 0.\6u, (18)

where 5u is the noise, which is uniformly distributed in [0,1] and uo = -^ — 0.5. The above initial values are used for the neurons that reside inside the parallelogram defined by the path constraints that are addressed in Section 3.1. The outside neurons are clamped to zero because of the same constraints. Furthermore, the neurons at the origin and at n = (A^ — 1), m = (AT — 1) are clamped to 1 because of the endpoint constraints. During the operation of the network these neurons always have these fixed states and force other neurons to acquire better final states.

Another important factor is the step size At, which is used in the iterative solution of equation (16). Step size 0.02 is used in our experiments, and this size At seems to be sufficient with N = 10, N being the number of signal samples that are matched. The number of iterations to reach a solution is increased with smaller step sizes, and the quality of the solutions does not improve.

4-1 Performance Measurement with Random Signals

To evaluate the performance of the network, uniformly distributed random reference and test signals are generated. From these signals a distance matrix d is produced using equation (11). The distances are normalized to the unit square. Using d, the optimal warping path corresponding to the global minimum total distance and the path with the global maximum distance are determined by going through all of the possible paths within the parallelogram, as shown in Figure 3. Then the DTW Hopfield network


is employed to find the optimal path. A distance measure is defined to compare the results as follows:

miriNN -miriG .^^ . .^. dcM = : X 100 (19)

maxG ~ TniriG where miriG and maxG are the global minimum and maximum distances corresponding to the best and worst warping paths that are found by means of the traditional DTW using exhaustive search. The miriNN is the minimum distance corresponding to the optimal path found by the neural network. The dGM is the percentage of the distance to the global minimum and represents the independent variable on the horizontal axis in Figure 5 and Figure 7. The ^/-axis denotes the number of occurrences out of 500 runs.

Two tests are run to measure the performance of the DTW Hopfield network with the constraint coeflScients ci = 13.8, C2 = 13.8, C3 = 4.5, C4 = 6.3, C5 = 1.0. In the first test, the objective function coefficient is taken as CQ = 2.0. Then the same test is repeated with a more dominant objective function coefficient, CQ = 4.0, to demonstrate its impact on the solution validity and quality. Figure 5 shows the test results for the first set of coefficients. It is seen that the network converged to a valid solution 96% of the time. Figure 6 displays the corresponding iteration histogram. With the second set of coefficients the results summarized in Figures 7 and 8 are obtained. Using this set, the DTW Hopfield network reaches a valid solution 72% of the time, but the quality of the paths found is superior to that in the prior case. The reason for this is that while the constraint coefficients enforce the validity of the warping path, the objective function coefficient Co competes with them to minimize the total distance associated with the path. Thus, the quahty of the DTW path can be improved by increasing the value of CQ^ but this results in more frequent invalid paths. For both cases, the network converges to a valid solution in fewer than 50 iterations, and the results achieved show that the network is capable of matching the reference and test signals effectively.

4.2 Comparisons with Direct Template Matching

The purpose of this experiment is to demonstrate the superiority of the pattern matching performed by the DTW Hopfield network over ordinary direct template matching. First, direct template matching is applied to the reference signal r and the test signals t i , t2, which are shown in Figures 9-10. Then, the experiment is carried out with the energy function coefficients Co = 4.0, ci = 13.8, C2 = 13.8, C3 = 4.5, C4 = 6.3, C5 = 1.0 using the DTW Hopfield network. The results are shown in Tables 3 through 7. Table 3 displays the samples of the signals, the total distances, and the local distances between the samples of the reference and the test signals. The


o

Distance Percentage to Global Minimum FIGURE 5. Performance measurement results with coefficients co = 2.0, ci = 13.8, C2 = 13.8, C3 = 4.5, CA = 6.3, C5 = 1.0.


CJ CJ

o

70

80 h

50 h

40 h

30 h

20 h

10 h

t \i _ _._.__ \ ; _ i

^ [

L Jjl

M K i i r . , / , i , , , ^ i /tw/\, , , i , , . i i . . 1 , J

20 40 60

Number of Iterations

100

FIGURE 6. Iteration histogram with coefficients co = 2.0, ci = 13.8, C2 = 13.8, C3 = 4.5, C4 = 6.3, C5 = 1.0.


Pi O

o

60

50 h

40 h

30 H-

20 Vr

10 h

f i 1 :

f ! ! ^

I \ \ \

J i i f l J ; • -1 1 - 1 1 1 1 J. . _ i _ V - l \ 1 A - L . -1 1 1 ! 1 1 L J

20 40 60 80 100

Distance Percentage to Global Minimum

FIGURE 7. Performance measurement results with coefficients co ci = 13.8, C2 = 13.8, C3 = 4.5, C4 = 6.3, cs = 1.0.

4.0,


03

Pi

{-I

o

50

45

40

35

30

25

20

15

10 h

5

0 0 20 40 50 80 100

Number of Iterations

FIGURE 8. Iteration histogram with coefficients co = 4.0, ci = 13.8, C2 = 13.8, C3 = 4.5, C4 = 6.3, C5 = 1.0.

L


n 0 1 2 3 4 5 6 7 8 9

r{n) 20.0 15.0 5.0 0.0 4.9

14.9 20.0 15.0 5.0 0.0

hin) 14.0 3.0 1.0 4.0

11.0 13.0 14.0 19.0 16.0 7.0

t2{n) 15.0 15.0 12.5 10.0 10.0 10.0 10.0 7.5 5.0 5.0

d{r{n),ti{n)) 6.0

12.0 4.0 4.0 6.1 1.9 6.0 4.0

11.0 7.0

1 62.0 Total

d{r{n),t2{n)) 5.0 0.0 7.5

10.0 5.1 4.9

10.0 7.5 0.0 5.0

55.0 Distances

TABLE 3. Results of the direct template matching.

absolute difference distance metric is used to calculate the local distances and the distance matrices. The resultant distance matrices are given in Tables 4 and 6.

According to the direct template matching results in Table 3, the test signal t2 is more similar to the reference signal r than the test signal t i , since its total distance to r is smaller. Figures 11 and 12 illustrate the effect of DTW on the test signals, and the corresponding warping functions are displayed in Tables 5 and 7. As the results show clearly, the DTW Hopfield network can compare signals intelligently and achieve better results than the ordinary direct template matching.

5 Conclusions

The main objective of this study is to show that the Hopfield network can be used to implement the DTW algorithm, which compares two signals to accomplish the best match under certain constraints. The results obtained in Section 4 verify that the method proposed is a good candidate for this purpose, and the DTW Hopfield network has the capability to make fast and intelligent comparisons in the pattern matching phase of a pattern recognition process.

The procedure proposed in Section 2 provides a methodical approach to solve optimization problems using the Hopfield network. Most of the steps in this procedure are straightforward, except the neural network representation and the definition of the energy function. There can be more than one valid neural network representation and energy function for a given


20

15 h

5 h

T

1 1 1 1 1

A r

V

_A 1 J_L_

t1

, 11 • v . • • • 1 . • , 1 i 1 • i f

••/ t

, 1 • , 1

^

, • 1 1 1

7

1

n FIGURE 9. The reference signal r and the test signal t i . Signal r is marked by A, and t i is marked by V.

3. Pat tern Matching Using an Artificial Neural Network 97

26 4r

15 ^

10 h

5 h

0 I

FIGURE 10. The reference signal r and the test signal t2. Signal r is marked by A, and t2 is marked by D.


9 8 7 6 5 4 3 2 1 0 r

1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 0.37

0

1.41 1.41 1.41 1.41 1.41 1.41 1.41 0.99 0.84 0.02

1

1.41 1.41 1.41 1.41 1.41 0.47 0.06 0.28 0.13 1.41

2

1.41 1.41 1.41 1.04 0.97 0.82 0.30 0.07 0.22 1.41

3

1.41 1.41 1.06 0.69 0.61 0.47 0.06 0.28 1.41 1.41

4

1.41 1.41 0.35 0.02 0.09 0.24 0.76 0.99 1.41 1.41

5

1.41 0.22 0.00 0.37 0.45 0.60 1.12 1.41 1.41 1.41

6

1.41 0.13 0.35 0.02 0.09 1.41 1.41 1.41 1.41 1.41

7

0.17 0.84 1.06 1.41 1.41 1.41 1.41 1.41 1.41 1.41

8

0.52 1.41 1.41 1.41 1.41 1.41 1.41 1 1.41 1.41 1.41

9

TABLE 4. Local distance matrix for r and t i . The distances are normalized to the unit square.

X

9 8 7 6 5 4 3 2 1 0 i

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

0

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

1

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

2

0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

3

0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

4

0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

5

0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

6

0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

7

1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

8

1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

9

TABLE 5. The final neuron states Vx,i for r and t i matching. The total distance corresponding to this path is 1.93. The DTW Hopfield network reached this state in 14 iterations.


t2 9 8 7 6 5 4 3 2 1 0 r

1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 0.00

0

1.41 1.41 1.41 1.41 1.41 1.41 1.41 0.12 0.35 0.35

1

1.41 1.41 1.41 1.41 1.41 0.59 0.59 0.82 1.06 1.41

2

1.41 1.41 1.41 0.94 0.94 0.94 0.94 1.18 1.41 1.41

3

1.41 1.41 0.35 0.59 0.59 0.59 0.59 0.82 1.41 1.41

4

1.41 1.41 0.35 0.12 0.12 0.12 0.12 0.12 1.41 1.41

5

1.41 0.94 0.71 0.47 0.47 0.47 0.47 1.41 1.41 1.41

6

1.41 0.59 0.35 0.12 0.12 1.41 1.41 1.41 1.41 1.41

7

0.12 0.12 0.35 1.41 1.41 1.41 1.41 1.41 1.41 1.41

8

0.47 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41

9

TABLE 6. Local distance matrix for r and t2. The distances are normalized to the unit square.

X

9 8 7 6 5 4 3 2 1 0 i

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

0

0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

1

0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

2

0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

3

0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

4

0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

5

0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

6

0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

7

1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

8

1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

9

TABLE 7. The final neuron states Vxî for r and t2 matching. The total distance corresponding to this path is 3.77. The DTW Hopfield network reached this state in 20 iterations.


20 4r

OJ

FIGURE 11. The reference signal r and the warped test signal t i ; r is marked by A, t i is marked by V.


20

Qj CL.

15

[ \ \

LX^

1 . M 1 i , • , 1 i , 1 , A

t2

/ \

1 • , i i 1 1 > . i , 1 , •

\

• \

, 1 • , •

i i

FIGURE 12. The reference signal r and the warped test signal t2; r is marked by A, t2 is marked by D.


problem. In some applications, definition of the energy function may not require any work at all. In others, the neural representation may be quite intricate. The energy function defined in equation (13) is neither unique nor claimed to be the best energy function for the DTW problem. It may be possible to find better energy functions. Combining some of the constraint components and/or incorporating them into the objective function would reduce the number of energy function coefficients. But then it would not be possible to control the eff"ects of these components independently. It should be noted that the components of the energy functions compete and cooperate with each other, while the neural network descends with the Liapunov function, as dictated by the energy function, toward a stable minimum energy state. The energy function coefficients CQ through C5 define the characteristics of this falling motion. There is a delicate balance among these components, which are weighted by the energy function coefficients. It would be interesting to study the effects of changing the energy function coefficients dynamically (as a function of energy) as the neural network evolves toward a solution state. This could aid the DTW Hopfield network to reach a lower minimum with a faster convergence rate.

Superiority of the DTW Hopfield network over the traditional DTW algorithm is that the pattern matching time is independent of the pattern size. It should be noted that the time required for pattern matching with the conventional DTW is directly proportional to the number of possible valid paths within the parallelogram. Obviously, the size of the parallelogram, and consequently the number of the paths inside the parallelogram, is a function of the pattern size. On the other hand, for the DTW Hopfield network, the number of iterations does not increase as the size and the complexity of the problem grow. Only the number of neurons and their connections expand. This aspect of the DTW Hopfield network was tested for patterns with sizes up to ten, and the number of iterations to converge to a solution was in the neighborhood of twenty, as given in Section 4.1. Twenty iterations would be completed within an elapsed time of a few characteristic time constants of the network (the decay time of a neuron) which is in the order of microseconds [1,6].

The effect of the objective function (relative to the constraint components) could be reduced by calibrating the energy coefficients if maintaining a valid result has a higher priority than the quality of the solution.

In this study, a piecewise linear activation function was utilized instead of the standard tangent hyperbolic function to improve the simulation time. This did not adversely affect the performance of the DTW Hopfield network. The activation function could be digitized, and further advantages could be gained in the implementation of the neural network with digital VLSI (Very Large Scale Integration) technology.

The significance of this study is that for the first time, we have shown that the DTW algorithm can be realized using a Hopfield network. As the results


in Section 4 confirm, the DTW Hopfield network is capable of comparing signals intelligently and can be employed in the pattern matching phase of a pattern recognition system.

6 References

[1] S. Bhama and M. H. Hassoun, "Continuous Hopfield computational network: hardware implementation," Int. J. Electronics, vol. 69, no. 5, pp. 603-612, 1990.

[2] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. Wiley, New York, 1973.

[3] J. J. Hopfield, "Neurons with graded response have collective computational properties like those of two-state neurons," Proc. Natl. Acad. Sci. U.S.A., vol. 81, pp. 3088-3092, May 1984.

[4] J. J. Hopfield and D. W. Tank, "Neural computation of decisions in optimization problems," Biol. Cybernet., vol. 52, pp. 1-25, 1985.

[5] F. Itakura, "Minimum prediction residual principle applied to speech recognition," IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-23, pp. 67-72, February 1975.

[6] B. Kamgar-Parsi, J. A. Gualtieri, J. E. Devaney and B. Kamgar-Parsi, "Clustering with neural networks," Biol. Cybernet., vol. 63, pp. 201-208, 1990.

[7] K. C. Lee, N. Funabiki and Y. Takefuji, "A parallel improvement algorithm for the bipartite subgraph problem," IEEE Trans. Neural Networks, vol. 3, no. 1, pp. 139-145, January 1992.

[8] S. E. Levinson and D. B. Roe, "A perspective on speech recognition," IEEE Commun. Mag., pp. 28-34, January 1990.

[9] C. Myers, L. R. Rabiner and A. E. Rosenberg, "Performance tradeoffs in dynamic time warping algorithms for isolated word recognition," IEEE Trans. Acoust., Speech, Signal Process., vol. 28, pp. 623-635, December 1980.

[10] J. M. Naik, "Speaker verification: A tutorial," IEEE Commun. Mag.,


pp. 42-48, January 1990.

[11] D. O'Shaughnessy, "Speaker recognition," IEEE ASSP Mag., pp. 4-17, October 1986.

[12] H. Sakoe and S. Chiba, "Dynamic programming algorithm optimization for spoken word recognition," IEEE Trans. Acoust., Speech, Signal Process., vol. 26, pp. 43-49, February 1978.

[13] H. F. Silverman and D. P. Morgan, "The application of dynamic programming to connected speech recognition," IEEE ASSP Mag., pp. 6-25, July 1990.

[14] M. Takeda and J. W. Goodman, "Neural networks for computation: number representations and programming complexity," Appl. Opt., vol. 25, pp. 3033-3046, September 1986.

[15] Y. Takefuji and K. C. Lee, "Artificial neural networks for four-coloring map problems and K-colorability problems," IEEE Trans. Circuits Syst., vol. 38, pp. 326-333, March 1991.

[16] F. A. Unal, "Pattern matching using an artificial neural network," Doctoral Dissertation, Electrical and Computer Engineering, Florida Institute of Technology, Melbourne, FL, December 1992.

[17] F. A. Unal and N. Tepedelenlioglu, "Dynamic time warping using an artificial neural network," IJCNN'92 Proceedings, vol. 4, pp. 715-721, Baltimore MD, June 7-11, 1992.

Chapter 4

Patterns of Dynamic Activity and Timing in Neural Network Processing Judi th E. Dayhoff Peter J. Palmadesso Fred Richards Daw-Tung Lin

ABSTRACT This chapter addresses topics on the dynamic behavior of neural networks as they oscillate and produce specific timing patterns in their activity. A network of simple processing units is capable of producing prolonged self-sustained oscillations and even chaotic behavior. Modulation of a controlled parameter causes the temporal dynamics to increase in complexity until chaos is reached. An external stimulus—a pattern—can be applied to a chaotic network, resulting in a simpler, limit cycle attractor, which can be recognized in a pattern-to-oscillation map. Since random networks tend to have only one observed dynamic attractor, we have designed a weight perturbation schedule to develop multiple dynamic attractors from different initial states of the network. The result is to create different basins of attraction for different patterns or pattern groups. We can observe a tremendous flexibility not only in evoked attractors (usually oscillations) but in their basins of attraction—the collections of states that lead to the same attractor. Attractor training has been done in networks with time-delay mechanisms, where an a priori chosen dynamic attractor can be trained into the network. A comparison to temporal processing in biological systems is discussed.

1 Introduction

Humans and most animals are extraordinarily capable in the temporal domain. We operate with self-sustained activity, generate oscillating pa t terns such as walking and swimming, think and imagine for chosen periods of t ime, and recognize sensed pa t te rns in spite of temporal shifts and time-warping. This extensive set of temporal capabilities is supported by a neural system architecture with a built-in capability to operate in the

105

106 Dayhoff, Palmadesso, Richards, and Lin

temporal domain. From which anatomical and physiological structures do these temporal properties arise, and what are their operational properties? One can imagine time delays, recurrent loops, and firing threshold dynamics all created and modulated by cell anatomy and complex biochemical relationships, but the questions remain of how the underlying architectural components cause and modulate temporal dynamics and temporal processing and how the temporal processing allows for complex time-dependent behaviors.

Dynamic neural net architectures are capable of producing prolonged self-sustained activity, with changes in activations continuing across the network over long periods of time. Variations in activity may go on indefinitely, even in the absence of external stimuli. Recurrent loops in the neural connections contribute to such self-sustained activity. Networks with closed loop connections are possible with only a single layer, or they may have multiple layers, but they must depart from the traditional feedforward structure that does not allow closed-circuit connections. The resulting dynamic networks can be modulated to move among different modes of self-sustained activity, including simple oscillations, limit cycles, transients to dynamic at tractors, and chaos.

Previous investigators have shown that chaos tends to be inevitable in large models of neural networks in which interconnection weights are random and asymmetric (wij ^ Wji) [SCS88]. Alternative paths to chaos were identified for a progression of activity patterns that started with fixed-point stable states and bifurcated to chaotic oscillations [DCQS93] [CDQS94]. Even when interconnections were sparse, the same types of behavior resulted—the network did not have to be fully connected. Thus, self-sustained activity can be produced and modulated in neural networks with uncorrelated weights and with a realistic density of interconnections.

There is a rich potential for computational paradigms that could take advantage of this self-sustained activity, temporal dynamics, and dynamic attractors in neural networks. This computational potential is completely untapped in static neural networks, where feedforward architectures are used, or in recurrent configurations that are allowed to relax to a single stable state—a nondynamic fixed-point attractor. With a dynamic network paradigm, different initial states would evoke different end-state oscillations, or different external inputs would modulate those oscillations. Thus we have ways of obtaining a pattern-to-oscillation map that could potentially be used for pattern classification, representation, and identification.

Important issues remain to be resolved about dynamic neural networks before they can be fully understood and fully developed. Issues include how the weights and interconnection topology determine the dynamic self-sustained activity of the network. The number of attractors in the resulting network, the boundaries of the basins of attraction, and the training of

4. Patterns of Dynamic Activity and Timing 107

attractors into the networks are also important for the development of computational paradigms.

Most artificial neural network models applied today are powerful pattern recognition tools but do not have interesting temporal activity. Traditional feedforward networks are static, processing fixed patterns as inputs (vectors) and producing fixed patterns, one at a time, as output (other vectors). These static networks are powerful because they can be taught a number of pattern pairs simultaneously, and they imprint the memory and pattern mapping capability into the weight matrix. But their ability is only to map one fixed pattern to another at one time; temporal patterns are not spontaneously generated, learned, or associated.

The complex dynamics generated by dynamic networks have yet to be fully tapped for learning and computational purposes. Pattern recognition, associative memory, signal generation, trajectory production for robotics, and time series predictions are a few of the cogent applications that are natural for dynamic networks. An abundance of temporally varying signals can be found in a multitude of applications domains.

Models of biological systems and even consciousness have been proposed to involve dynamic neural networks [Day94]. The ability of an advanced organism to remain conscious in the presence or absence of external stimuli points to some kind of self-sustained activity occurring within the brain. Conscious experience involves a sequence of events that the organism experiences; somehow the sustained neural activity enables this sequence to be experienced and often remembered. Simulation studies have been aimed at understanding how neural activity can be sustained and what are the types of prolonged activity patterns. Ultimately, we would like to understand how self-sustaining neural circuits could support the ongoing activities and memory formation involved in conscious experience.

Since biological sensorimotor systems are so impressive in their spa-tiotemporal abilities, an examination of how biological systems may be modeled as dynamic networks is warranted. To capture and eventually understand biological capabilities, we must identify the temporal components and structures that make their dynamic behavior possible. In addition to recurrent, closed-loop configurations that lead to limit cycles and chaos, there are anatomical and physiological structures that are intriguingly temporal in nature. Time delays and their adaptation, variations in thresholds over time, and the spacing of action potentials in nerve impulse trains name a few of the most important structures. Trains of action potentials spaced over time offer an order of intricacy and complexity not available in most artificial networks, which have a nonpulsed structure. Impulse trains carry the communication between nerve cells in living systems and, taken together, must be responsible for representation of information and processing in the brain. Temporal patterns and coincidences between multiple impulse trains may have special significance in coding and processing schemes.

108 DayhofF, Palmadesso, Richards, and Lin

In this chapter we consider the development of dynamic behavior in neural networks and show how even a network of simple processing units, inspired by real neurons, is capable of producing oscillations and chaos. Section 2 illustrates how the movement of a controlled parameter causes modulation of the degree and type of temporal dynamics in the network's activity and describes dynamic neural networks that have prolonged self-sustained activity arising from dynamic attractors that oscillate. Different dynamic modes can be developed and controlled in such a neural network, and even a simple architecture with random weights can be forced to develop chaos from a fixed-point attractor—a "stable state" of the network. Section 3 shows how an external stimulus—a pattern—applied to a chaotic network can lock the network into a limit cycle attr actor. Unique patterns then can evoke different limit cycles.

Section 4 addresses networks with multiple dynamic attractors. Since random networks tend to have only one dynamic attractor, we have designed a weight perturbation schedule to develop multiple dynamic attractors. In Section 5 we analyze the tremendous flexibility in the basins of attraction—the collections of states that lead to the same oscillation, or attractor. Dynamic networks can have high capacity for attractors and for differing basin boundaries. Section 6 shows attractor training in networks with time-delay mechanisms. The trained weights give the network an a priori chosen dynamic attractor. Time-delays are components in biological systems, as are action potentials and impulse trains. Section 7 discusses the diversity of roles that impulse timing could play in the temporal dynamics of biological neural systems, and Section 8 concludes this overview of temporal mechanisms by discussing the impact of dynamic activity patterns on neural network processing.

2 Dynamic Networks

Dynamic neural networks have an extensive armamentarium of behaviors, including dynamic attractors—finite-state oscillations, Umit cycles, and chaos—as well as fixed-point attractors (stable states) and the transients that arise between attractor states. The transitions that occur from one neural state to another while a network is in a dynamic attractor comprise self-sustained activity. A wide variety of such activity is possible, with each attractor having its own dynamic pattern of changing activations. In addition, each attractor has a basin of attraction—a set of states that lead to that attractor—and tremendous variations occur in the boundaries of the basins of attraction and in the transients that lead to each attractor.

We have explored a method of developing dynamic attractors in a neural network and of modulating the network into a chaotic state. The neural


units were simple biologically inspired units, performing a weighted sum and nonlinear squashing function. Thus we did not use the approach of building oscillators and chaotic components into a network, to insure the presence of dynamics, but rather we allowed the dynamics to develop naturally as a result of the network's processing units and interconnections. We have examined a variety of paths from single fixed-point attractors to chaos. Although we use the approach of Doyon et al [DCQS93], who showed progressions for 128-neuron networks, we show here observations of the dynamics of networks with 64 neurons, which appear to have more variability.

The neural networks were single-layer networks with recurrent connections, where reciprocal connections did not have to be the same (e.g., Wij ^ Wji), and continuous-valued activations were allowed. The networks were fully connected or sparsely connected. Activations were determined by

N

aj{t-^l) = f{Y,9^jMt)), (1)

where aj(t) = activation of unit j at time t, Wji = weight to unit j from unit 2, iV = the number of processing units, the function / is a squashing function, and g a multiplier. We have used a symmetric sigmoid function

/ (x) = (1 / (1+ e - - ) - 0 . 5 ) * 2.0, (2)

which allows activation values to vary from —1.0 to 1.0. The parameter g is a multiplier for the weights and can be set to any value greater than zero.

Both the interconnections and weights were chosen at random. Networks were denoted as (A , K), where N was the total number of processing units, and K the number of incoming interconnections for each unit. The K units that sent interconnections to each unit were selected at random, and the values of the weights were selected from a uniform random distribution [-1/K,1/K].

The parameter g has two interpretations. The first is as a multiplier for all weights. Thus, the original set of weights becomes amplified or de-amplified depending on whether p > 1 or ^ < 1. In this interpretation, the incoming sum for neuron j is

Sj = T,{gwji)ai, (3)

and the modulated weight is gwji. The neuron then performs the squashing function to determine its next activation value:

ajit + l) = f{Sj). (4)


2

1.5

0.5

0

-1

-1.5

-0

Sigmoid, g-1

,'

'' Ji 1 1 1

y'

.^^^^.^^-^^"^^ ^^^"''^^

, , 1

FIGURE 1. Symmetric sigmoid function /(x), with y = x line.

The second interpretation for ^ is as a scaUng of the x-axis in the squashing function. Organizing equation (1) differently, we get

AT

Rj = ^'^jiîi^)^ i=l

where Rj is now the incoming sum for unit j and

aj{t-^l) = fg{Rj),

where fg = figx)

(5)

(6)

(7)

is a sigmoid squashing function that is rescaled with respect to the x-axis. Thus the weight is not modulated by g, but instead, the horizontal scale of the sigmoid is modulated by g.

Assume that / is the symmetric sigmoid function given in (2), the most commonly used squashing function. Then Figure 1 shows the sigmoid function, with asymptotes at —1 and 1, and with x = 0 as its horizontal midpoint. Note that the maximum slope of the sigmoid function is 0.5, since /'(O) = 0.5. This slope is less than one, the slope of the line x = y. Figure 1 also shows the line {x = y).

Now consider the function fg (x), the sigmoid modified by the multiplier g (as in (7)). Then Figures 1-4 plot fg{x) for different values of g. When ^ > 2, the maximum slope of the modified sigmoid function reaches above 45 degrees, at fg{0), which causes two pockets with positive area to form between the 45-degree line and the modified sigmoid. In Figure 2, ^ = 5, which causes pockets to form. In Figure 3, p = 10, which causes larger pockets to form. When g < 2, the sigmoid becomes more flattened, and no pockets of positive area form (Figure 4). The pockets induce more complex


0.8

0.6

0.4

0.2

S 0

-0.2

-0.4

-0.6

-0.8

," '

Modified Sigmoid, g-S

y'y J^^,,.^

/^ y' /^y

I'

y y

• i^__ 1 - 1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

FIGURE 2. Modified sigmoid function /^(s), with p = 5.

2

1.5

0.5

0

-0.5

-1.5

Modified Sigmoid, g-10

.\

yJ ,•'

, ' ' ^ , . J

y

f ,^^

I,"

FIGURE 3. Modified sigmoid with g = 10.

dynamics for the neuronal activity of the neurons and the entire network, which is consistent with the fact that the parameter g has been shown to be key to producing chaos in large random networks. In large networks of neurons, for example when {N,K) = (64,16), small values of g (usually g < 1) yield a single fixed-point attractor. Increasing g eventually gives limit cycle at tractors and chaos.

Consider a model of a single neuron with a recurrent loop, as in Figure 5. Assume that the activation of this unit is a{t) at time t. Using the second interpretation given above for g,

a ( t -h l ) = fg{wa{t)),

where w is the weight on the recurrent loop. Suppose w = 1. Then

a{t-^l) = fg{a{t)).

(8)

(9)


Modified Sigmoid, g-0.8

FIGURE 4. Modified sigmoid with g = 0.8.

FIGURE 5. A single processing unit with a recurrent loop.

If ^ > 2, then Figures 1 and 4 apply. If p < 2, then Figures 2 and 3 apply. Figure 6 shows a map of the sequential activation states of the neuron in

Figure 5. Figure 6 shows a case where g < 2, with the modified sigmoid fg. Also the 45° line x = yis shown, as well as the orbit of two points, Xa = 1.95 and xt = —1.8. Paths to successive values for {x,fg{x)) and (/^(x),/^(x)) are shown, followed by paths to ( / ; (a : ) , / ;+ i (x)) and (f^-^H^) Jg^H^)) for successive values of n (n = 1,2,3,...). This graphical analysis of the orbits of X illustrates the presence of a single attracting fixed point fg (0) = 0, as each of the initial points moves along its path to (0,0).

Figure 7 similarly shows a graphical analysis of the orbits of x for fg{x) when ^ = 5. The orbits of two points, x = 0.1 and x = —0.2, are shown. There are three intersections between x = y and fg{x), each a fixed point. The point at the origin is repelling, and the other two points are attracting.

A network with 64 units each with 16 incoming connections, (64,16), was constructed with random initial weights [PD95]. Transitions from fixed point attractors to chaotic attractors were observed as g was increased starting from numbers below 1.0. Figure 8 shows such a progression, from a fixed point attractor to a chaotic attractor, with average activation a{t)

4. Patterns of Dynamic Activi ty and Timing 113

2

1.5

0.5

0

-0.5

- 1

-1.5

Sigmoid, g=1

^^^^-^ ^-17-

^-^—"^""^ J - ' '

^ "

^-'

A ^

. - ' '

^

^ ..--

. - - - ' • "^ r

FIGURE 6. (a) The sigmoid / ( x ) . Paths to its single attractor are shown.

Compressed Sigmoid. g=5

-

.^t^

^^

^

"

-2 -1.5 -1 -0.5

FIGURE 7. Modified sigmoid for p = 5. Paths to its two attractors are shown.


graphed against a{t -h 1), to form a map of the dynamics. Use of the average activation in the plot was chosen to project the many dimensions (64 activation levels) to a single measured observation over time.

For low g {g = 0.9), a single fixed-point attractor was observed, shown in Figure 8(a). This graph has a single point at (0,0). When g was increased to 1.0, a hmit cycle appeared (Figure 8(b)). A Umit cycle appears as a dense set of points along one or more closed loops. When g increased to 1.1, a protrusion appeared at the two tips of the limit cycle graph (Figure 8(c)). When g was increased to 1.18, asymmetric changes occured in the two corners (Figure 8(d)). In Figure 8(e), g was increased to 1.2, and the limit cycle appears hke scribbling in a closed loop. When g = 1.3 (Figure 8(f)), the network is locked into a 2-state oscillation, and when g is increased to 1.5, a bifurcation occurs to form a limit cycle that appears as 2 rings (Figure 8(g)). When g = 1.69, a locking occurs into a 14-state oscillator (Figure 8(h)), and each point becomes a ring when g = 1.7 (Figure 8(i)). When g is increased to 1.8 (Figure 8(j)) and 2.1 (Figure 8(k)), chaotic behavior is observed.

This type of exploration illustrates the tremendous range and complexity of dynamic attractors and the ability to exert some control over their appearance, through varying the single parameter g. Although Doyon et al. [DCQS93] identified the value of g at which the first bifurcation occurs, and showed four distinct paths to chaos in 128-neuron networks, they studied larger networks than are examined here. In our work, we made observations on smaller networks, with 64 neurons, and found more elaborate paths to chaos, such as the path shown in Figure 8.

In many cases the transitions between limit cycles and the transitions to chaotic activity happened rapidly. Figure 9(a) shows a limit cycle at g = 1.880 that changes to a qualitatively different limit cycle when g = 1.889 (Figure 9(c)), with a frequency locking in between (Figure 9(b)) at g = 1.8865. Only small changes in g were required at these transition points.

In some cases, increasing the number of iterations calculated allowed activity that appeared chaotic to resolve into a limit cycle. Often, the number of transients in these cases is too large to be practical to implement. Thus, for practical use to be made of dynamic neural networks, a chaotic response would be considered to be behavior that appeared to be chaotic in a limited predefined time frame.

3 Chaotic Attractors and Attractor Locking

We have considered the problem of how to utilize the attractors in dynamic neural networks to perform pattern recognition and classification. To solve this problem, a pattern communicated to the network must change the


(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(J) (k)

FIGURE 8. Progression from fixed point to chaos in a random (64,16) network. The horizontal axis is average activation at time t H-1, a(t + 1), and the vertical axis is average activation at time t^ a(t) . (a) g = 0.9; (b) g = 1.0; (c)g=l . l ; (d) g = 1.18; (e) g = 1.2; (f) g = 1.3; (^)g=1.5; (h) g = 1.67; {i)g=l.7; (j) g = 1.8; (k )p = 2.1.


-O.M -O.OS -0.04 -0.02 0 0.02 0.04 0.06

(a)

1 1 1 { 1 1 1

«.04 -0.00 -0.03 -0.01 0.01 0.03 0.09

(b)

-0.04 -0.03 -0.02 -0.01 0 0.01 0.02

(c)

FIGURE 9. A quick transition between limit cycles, (a) g = 1.880; (b) g = 1.8865; (c) g = 1.889. The horizontal axis is average activation at time t -h 1, a(t + 1), and the vertical axis is average activation at time t, a{t).


dynamic activity in an observable fashion. The attractor that the network enters is observable, through graphing and projecting as in Figure 8, and could ultimately be identified with an automatic matching algorithm. Different attr actors appear different. We have a potential pattern classification device when different patterns produce different attractors in a dynamic neural network and when similar patterns produce similar attractors.

We initially explored whether different initial patterns would produce different attractors in the random networks described above. The initial state of the network was set so that neuron activation levels matched the pattern vector to be classified (ai(0) <- e^). Thus a pattern E = (ei, e2,..., e^) became the initial state of the network. The network was then updated by (1) and (2) for a thousand or more iterations, to pass transients. The resulting attr actor was then observed. In our networks with random weights (64 nodes), usually only one attractor was observed, which was reached from a wide variety of initial states. Sometimes, there were two limit cycle attractors, but they were symmetric with one another, having a 180-degree rotational symmetry about the origin. Figure 10 shows a pair of symmetric limit cycle attractors, drawn separately in Figures 10(a) and 10(b). In this case, each initial condition resulted in one of the two different but symmetric limit cycles. Although different initial conditions could evoke different (but symmetric) limit cycles in this case, this scenario does not offer enough flexibility to discriminate patterns in general.

(a) (b)

FIGURE 10. A pair of symmetric limit cycle attractors. The horizontal axis is average activation at time f-hl, a(t-l-l), and the vertical axis is average activation at time t, a.{t).

A few times, different attractors were observed from diffêrent initial states during our simulations of networks with random weights. This circumstance was very rare among our observations, and the parameter g was tuned precisely near a bifurcation point to produce this effect. Then sometimes different initial states led to attractors that were different in appearance. Figure 11 shows such an example. Figure 11(a) shows an 8-


state oscillator. This occurs at ^ = 1.72 in a (64,16) random network. Figure 11(b) shows the same network, except that a different initial state was selected at random. The 8 points appear to have bifurcated into 8 rings, which are asymmetric about the origin. A third random initial state evoked the limit cycle in Figure 11(c), which is symmetric to that in Figure 11(b).

These results do not resolve how to classify patterns in random networks when the pattern to be classified is presented as the initial state of the network. To classify an external pattern by means of evoking different dynamic attractors, we explored a different approach for presenting a pattern to the network, in which the pattern is treated as an external stimulus [PD95], [QDS95]. To include an external stimulus, the updating equation (1) can be modified as follows:

N

aj{t + 1) = fi^gwjiaiit)) -\- aeu (10)

where E = (ei,e2, ...,en) is the external input, and a the strength of the external pattern. The input E is then applied at every time step, and its strength is modulated by the multiplier a, which is usually fixed over time. The vector E can then be considered as the pattern to be classified.

A network is initially put in a choatic oscillation, as described in Section 2, by increasing parameter g until the network reaches chaotic behavior. The chaotic net does not have an external stimulus, and it updates by (1). Typically, we do not increase g more than is necessary to produce chaotic behavior, so the network can be said to be at the "edge of chaos." An external input E is then applied to the chaotic net, and the network uses (10) to update. The externally applied input often "locks" the chaotic network into a simpler attractor, usually a limit cycle. We call this scenario "attractor locking".

The attractor that results depends on the characteristics of the externally applied pattern and the network weights. The attractor does not depend on the state of the net at which the external input was first applied; the external input applied at any state during the chaotic oscillation was observed to yield the same attractor. Thus, in our observation, the chaotic behavior can continue any amount of time without the external input, and regardless of when E is applied, the same dynamic attractor appears. Figure 12 shows a progression in which a chaotically behaving network (Figure 12(a)) receives an external stimulus pattern (Figure 12(b)). The apphcation of the same pattern at an increased strength is shown in Figure 12(c).

Figure 13 shows the result of applying a different pattern as external stimulus to the chaotic network used in Figure 12(a). The resulting limit cycle is clearly different from those in Figures 12(b)-(c). The points scattered outside of the closed figure in Figure 13 are transients that occurred after the external pattern was applied and before the limit cycle was en-

4. Patterns of Dynamic Activi ty and Timing 119

(a)

(b)

(c) FIGURE 11. A case where different initial states lead to different attractors, from a random (64,16) network, (a) An 8-state oscillator; (b) A limit cycle with 8 closed loops; (c) A limit cycle symmetric to part (a). The horizontal axis is average activation at time *H-1, a(t-l-l), and the vertical axis is average activation at time t, a.{t).


tered. These transients demonstrate how few transients (about 25) were needed to transfer the network into the new Umit cycle. Transients were stripped in Figure 12.

It is expected that a wide range of different pattern inputs can evoke unique dynamic attractors. With suitable sensitivity studies we will be able to learn the extent to which similar patterns, when applied as external stimuli, evoke the same at tractor. Thus there is the potential for performing pattern classification with attractor locking of dynamic neural networks. The classification of the pattern would be read off as the attractor evoked when the pattern is applied to the network. There is some potential for gaining superior classification from dynamic neural networks because dynamical systems can have complicated basin boundaries.

The attractor locking scenario may represent a paradigm that could occur biologically. Freeman has suggested that neural circuits of the olfactory system operate in a chaotic mode in the absence of a stimulus, and that when a stimulus is applied, the network enters a simpler oscillation (a "wing" of the chaotic attractor), whereby recognition or classification occurs [YF90]. The idea of a "wing" of the attractor is analogous to the limit cycle evoked by the applied pattern in the attractor locking scenarios described above. The possibility that classification occurs very quickly when a network transitions from an initial chaotic state is also a possibility [YFBY91]. A fast locking into a limit cycle attractor was observed in Figure 13, where the small numbers of transients are shown in addition to the limit cycle.

4 Developing Multiple Attractors

Multiple attractors were developed in a dynamic network by a weight modification scheme dependent on past performance. The goal was to develop more than one dynamic attractor in the neural network and to constrain the basins of the attractors so that a set of chosen initial states would be forced into different basins. The chosen states would not have to occur in the attractor of its basin but must be somewhere in the basin. With this scheme, we could find weights that produce multiple dynamic attractors that could be accessed from different initial states.

A pattern classification scheme in which a pattern is imposed as an initial state to the network would require different initial states to lead to different attractors. A static version of the same approach would be the Hopfield associative memory, where initial states of the network (vector patterns) evoke a memory state (a stable state) [Hop82], [Hop84]. This paradigm has limited capacity, and memory states must become fixed-point attractors of the network. More capacity and fiexibility were sought by modifying the approach in two ways: (1) to use dynamic attractors in addition to allowing


-0.3 -0.2 0.2 0.3

(a)

« • •

(b)

(c)

FIGURE 12. An attractor locking progression, (a) Chaotic state of a random network with no external pattern applied. Parameter g has been raised just enough to be at the "edge of chaos" {g = 2.05). (b) An external stimulus pattern has been applied, with strength multiplier 1.1 (c) The same external pattern is applied at a stronger strength, with multiplier 1.15. The horizontal axis is average activation at time t -h 1, a(* + 1), and the vertical axis is average activation at time t, 8L{t).


FIGURE 13. Application of a second external pattern to the same chaotic network as in Figure 12. Transients are shown as well as the limit cycle. Strength multiplier was 1.3.

fixed-point attractors and (2) no longer to require that a memory pattern be a part of the attractor.

The neural network weights were changed according to a perturbation schedule that depends on previous success in developing new attractors [DPR94]. The networks were rewarded for increasing the number of attractors by keeping the altered weights. First, a set of UB states, B = {Bi,B2,..',BnB)j is chosen. The goal is then to develop a network such that Bi, when used as an initial state, gives a different attractor compared with any Bj {i ^ j). The results of the perturbation analysis depend on the set of initial states Bi used. This set should include a representative sample of initial states needed for an application. In our test cases we generated initial states from a uniform random distribution [—1,1]. The network was initialized with small random weights.

Throughout the perturbation schedule, small random numbers were added to randomly selected subsets of weights in the network. At each iteration a set of q weights {wi,W2, ••-fWq) was selected at random and perturbed as

Wi Wi -hrnci. (11)

where e wa^ a random variable from a uniform distribution and m was a multiplier. The new network was then tested with the set of initial states Bi {i = 1,2, ...,725). For each initial state, the network was iterated forwards past transients, using (1) and (2), until an attractor or final set of states was reached and the attractor was classified as fixed-point ("order 0"), an n-cycle oscillator ("order n"), or a final state with no observed repeats. Performance was then evaluated according to the number of distinct attractors. If performance was better with the perturbed weights, then the new weights were saved. Otherwise, the perturbations were discarded. Dur-


FIGURE 14. Development of multiple attractors in a 10-neuron network is subjected to 10,000 perturbations. There axe twenty pattern vectors in the set of initial states B. The number of attractors observed in each iteration is plotted as a function of iteration.

ing the perturbation schedule, values of q and m that tended to increase the number of attractors were explored. Preliminary explorations showed that particular ranges of each of these values were auspicious for increasing the attractors in a network.

Figure 14 shows results from a 10-neuron network subjected to 10,000 perturbations. The set of initial states had 100 patterns, chosen at random from a uniform distribution [—1,1]. Figure 14 shows the number of attractors found on each iteration. At each iteration, the best network was saved. Figure 15 shows the number of attractors in the best network as a function of iteration. In the best network, after 10,000 iterations, 82 attractors were observed. Most were 2-state oscillators, but many were fixed-points.

To exploit the rich dynamic behavior of neural networks for computational purposes, we must first be able to build dynamic attractors into a network through weight adjustment. Although previous work demonstrated transitions from fixed-point to chaotic attractors [DCQS93], it is necessary to exploit how multiple distinct attractors could be developed and accessed through unique initial states. Here we used the number of attractors as the performance criterion when adjusting weights in a recurrent neural network. Attractors could be fixed-point or n-cycle oscillators in this experiment. We showed that a perturbation schedule could increase the number of attractors in a network and could organize their basins so that a set of a priori specified patterns were in distinct basins. The number of attractors developed could easily exceed the capacity of the Hopfield network associative memory. In the example above, ten neurons had twenty distinct attractors, whereas the Hopfield associative memory has a limit of about 0.15n memories (n the number of neurons) [MPRV87]. The dynamic networks used


FIGURE 15. The maximum number of attractors observed so far, at each iteration, is graphed as a function of iteration. This data is from the same simulation as Figure 14, with 10,000 perturbations on a 10-neuron network.

here offer a wide repertoire of differing attractors and basins. These networks, with multiple dynamic attractors, demonstrate capacity that could ultimately be tapped for engineering tasks, where the attractors could represent pattern classes, memories, optimization solutions, or control actions. Different initial states would drive the network into these different final attractors, and a set of a priori given states could be preselected to be in the same or differing basins of attraction.

5 Attractor Basins and Dynamic Binary Networks

To develop the use of dynamic attractors in computational paradigms, it is helpful to characterize the wide repertoire of attractors and basins of attraction that can be generated by neural networks. The neural network's capacity for attractors and for differing basins of attraction must be assessed. Different neural networks, with different numbers of neurons and differing weights, are expected to have different numbers and types of attractors as well as different basin boundaries between those attractors. Here we show results of studying basins of attraction in recurrent binary networks. Binary neural networks were proposed and studied in earlier work [Ama72a] [Ama72b] [AM88] [Ami89] [Dem89] [HTV94] [Koh74] [VPPS90], but here we emphasize the flexibility possible in attaining different sets of basins of attraction.

Figures 16-19 show network transition graphs (NT-graphs) for neural networks with three neurons. An NT-graph has as nodes all the states possible for a binary network and as edges all transitions that the network could make from one state to another. The figures have eight nodes, repre-


FIGURE 16. Network transition graph for a neural network with n = 3 processing units. There is one attractor, a 4-state oscillator, and thus one basin. The basin class is /01234567/, where nodes are numbered 0-7 and / delimits basins.

senting the eight (2^) possible binary states of the 3-neuron network. The NT-graph depends on the neural network weights. However, many different weight matrices can yield the same NT-graph.

Each NT-graph shows all oscillations and fixed-point at tractors, along with their basins of attraction. In Figure 16, there is one oscillator, oscillating among four states, with all eight states in its basin of attraction. In Figure 17, there are two fixed-point attractors and two 3-state oscillators. In Figure 18, there are two fixed-point attractors and one 2-state oscillator, and Figure 19 also has two fixed-point attractors and one 2-state oscillator. A basin class is a set of neural networks that have the same basins. The attractors in those basins and the paths to those attractors, however, may be diffêrent. The networks that produce the NT-graph in Figure 18 are in the same basin class as those for Figure 19, as the basins are the same even though the attractors and paths differ.

To study the multiplicity of basin classes in dynamic binary networks, networks were constructed with random weights, and the state-to-state transitions of each network were simulated. The basin class was then computed and compared to basin classes previously observed. Two thousand networks of each size were simulated, with three, four, and five neurons. Figure 20 shows the results, where the number of unique basin classes is plotted as a function of the number of networks simulated. Figure 20(a) applies to networks with three neurons, and the number of basin classes observed rises to 17. For networks with four neurons (Figure 20(b) the number of basin classes rises steeply to more than 600, with the slope get-


FIGURE 17. Network transition graph for a neural network with n = 3 processing units. There are four attractors, consisting of two fixed points and two 3-state oscillators, and thus four basins, with basin class /0/124/356/7/ . Adapted with permission from [DP95].

FIGURE 18. Network transition graph for a neural network with n = 3 processing units. There are three attractors, consisting of two fixed points and one 2-state oscillator, and thus three basins, with basin class /0347/15/26/. Adapted with permission from [DP95].


FIGURE 19. Network transition graph for a neural network with n = 3 processing units. There are three attractors, consisting of two fixed points and one 2-state oscillator, with basin class /0347/15/26/. Its basin class is the same basin class as in Figure 18. Adapted with permission from [DP95].

ting smaller, indicating that the number is leveling off. In Figure 20(c), the slope is high for the first 2000 networks, with a new basin class observed for almost every network simulated. More than 1400 basin classes were observed among just 2000 networks.

The "basin class capacity" is the capacity of a set of neural networks to exhibit a variety of different basin classes. Thus binary neural networks have a basin class capacity of over a thousand classes when the network has only five neurons, and the increase in basin class capacity with increasing numbers of neurons appears to be very rapid. Our results indicate that dynamic networks have extremely high basin class capacity, even when we consider binary networks alone [DP95]. This high capacity leads to much flexibility in the way that the basins of attraction divide the set of neural network states. For future applications, adjustment of the basin boundaries could be more important than the attractor in the basin. The basin boundaries determine .which attractor the network goes into, and in an application, the attractor could represent the answer, result, or memory recalled from the neural network's computation. The exact nature of the attractor (fixed or oscillating) and its location (particular state(s) involved) can be less important than the basin boundaries. Since we aim to eventually have paradigms that allow adjustment and training of attractor basin boundaries, we have accomplished the first step towards this aim, which is to explore how many sets of basins and basin boundaries are possible with weight adjustments.

128 DayhofF, P a l m a d e s s o , R i c h a r d s , a n d Lin

(a)

(b)

(c) FIGURE 20. The number of basin classes found as a function of the number of networks simulated. Two thousand networks of each size (n = 3,4, and 5) were simulated. Weights were randomly selected from —1 to 1, and reciprocal weights were allowed to be unequal (wij ^ Wji). (a) For n = 3 there is initially a steep rise; then the graph flattens off with little increase in the number of basin classes found, (b) For n = 4, there is an initial steep rise that begins to become less steep by the 2000th network, (c) For n = 5, the initial steepness continues throughout the first 2000 nets, and the flattening out must occur later. Part (c) reprinted with permission from [DP95].


6 Time Delay Mechanisms and Attractor Training

Biological systems have anatomical and physiological mechanisms that force time delays on interconnections, thus providing a way of putting time delays in the neural network's computational circuitry. Propagation of action potentials (APs) from soma to synapse takes time, and speeds and distances vary. The diffusion of chemicals across the synapse takes time, as do postsynaptic potentials (PSPs) and integration of membrane potentials at the neuronal cell body and dendrites. In consideration of these biological mechanisms for time delays, we have used artificial neural networks that incorporate time delays on interconnections. The time-delay neural network (TDNN), proposed by Waibel, can have arbitrary delays on any connection and multiple connections between two units with different delays on each connection [WLH90]. The TDNN network is trained by adapting weights. The adaptive time-delay neural network (ATNN) adapts time delays in addition to weights [DD93], [DD91], [LDL92b], [LDL92a].

We have explored how an ATNN with a recurrent loop can be trained to have an a priori specified dynamic attractor, and thus to act as a dynamic network [LDL94], [LDL95], [LDL93]. This ATNN network used time delays along all interconnections and adapted both its weights and time delays during training. A connection from the network's output layer was made to the input, so that the results of the output neurons were used as input during the next time step. At each iteration, the network input was a segment of a trajectory, and the network produced the values for the next position along the trajectory. There were two output units and two input units, each pair specifying x{i) and y{i) of the trajectory in the x,2/-plane.

Figure 21 was generated when the network was trained on a circular, closed-loop trajectory. Figure 21(a) shows the network during training and Figure 21(b) shows the results after training was completed, when the network learned to generate the circular trajectory correctly. The network was given an initial segment of the circle, and it completed the figure, using its own output as successive inputs. Thus, at each point along the circle, the network took a segment of the circle and predicted the next point.

Figure 21(c) shows the noise resilience of the trained network. At the beginning of the experiment, an initial segment was generated with noise added to a portion of the circle. This segment was submitted to the neural network, and surprisingly, the network was still able to generate the circular trajectory, regenerating the circle on the second time around almost perfectly (Figure 21(d)). Figures 21(e) and 21(f) show initial segments that are smaller and larger than the circle, respectively, and the network spirals to generate the trained circle. These results suggest that the trajectory trained into the ATNN network is in fact an attractor of the network. The circular trajectory can be considered a limit cycle attractor because of the repeated sampling of points along the circular figure. Each time around the


- itfgel lr«ject<>iV reproduced trajectory

• itartini point

0 0

— t«r|et trijectory reproduced trajectory

9 tiarllrti point

0.0

1(0

- largel Irnjeclory reproduced trajectory

- target trajectory reproduced trajectory

FIGURE 21. Trajectory generated by a network trained to produce a circular trajectory, (a) The trajectory generated during training on the circle, (b) The trajectory generated after training, which closely follows the circle, (c) Initial segment is noisy, but the network recovers the circular trajectory, (d) The next circle after (c) is generated almost perfectly, (e) The initial segment is from a smaller circle, and the network's trajectory spirals out to the original circle, (f) The initial segment is from a larger circle, and the network's trajectory spirals in to the original circle. Parts (a) and (b) reprinted with permission from [LDL93]. Parts (e) and (f) reprinted with permission from [LDL95].


circle, a different set of points can be generated so as to fill in the circular drawing. Since the networks always arrived at the trajectory for which they were trained, the initial arcs used were within the basin of attraction for the trained figure—the attractor.

Once a chosen attractor is trained into a neural network, the network's ability to produce that attractor can be utilized for applications purposes. Thus an autonomous or controlled system, such as a robotic arm, a vehicle, or other moving object, could be trained to generate repetitive desired motions, and it could attain this repetitive motion from arbitrary starting trajectories. The starting trajectories would have to be in the basin of attraction for the attractor that is the trained repetitive motion. The results here indicate that the basin of attraction can be quite large, and thus the trained motions would be quite stable and would be able to recuperate in the face of perturbations.

7 Timing of Action Potentials in Impulse Trains

An entirely new realm of possibilities arises when impulses are used to communicate signals between neurons. This construct occurs in biology, where neurons produce action potentials (APs) that travel along axons. These APs are fast waves of depolarization that travel at speeds exceeding other biological mechanisms for communication. Impulse trains are trains of action potentials spaced over time, with varying time intervals between them. The brain thus includes a massively parallel impulse train generator and processor. Simultaneously generated impulse trains can have patterns that are a function of the activity of ensembles of neurons. Patterns and synchronies in these impulse trains furnish important putative codes for information transmission and processing in the brain. Models can incorporate spiking neurons, temporal patterns, or coincidences in the impulse trains, and sometimes attractor states [LAM"*"96].

Usually, artificial neural models use activation level parameters, which are continuous real-valued numbers that are communicated from one processing unit to another. A naive assumption is that the activation level in neural network models reflects firing rates in biological neural systems. While firing rates appear to play an information encoding role in some biological subsystems, it seems likely that a more complex processing scheme is enabled by the action potentials of neurons, based on a set of computational schemes that goes beyond simple firing rate encoding.

Simultaneously recorded nerve impulse trains appear as in Figure 22. Typically, the waveform is the same on each impulse recorded from the same neuron and, as a result, is not expected to carry information. Thus, the placement of impulses in time must represent, process, and carry the


cell 1

cell 2

cell 3

cell 4

cell 30

FIGURE 22. Simultaneously recorded nerve impulse trains (simulated data).

information. Temporal patterns have been examined in nerve impulse trains. Favored

patterns are firing patterns that repeat in exact or approximate form over an extended period of time (Figure 23). Their occurrences may be placed arbitrarily in time, or they may be periodic, occurring at equal intervals. Methods have been developed for identification of recurring temporal patterns that are statistically significant [DG83a], [DG83b]. These methods overcome the problem that some number of coincidental recurrences are expected at random. The methods realistically identify neural recordings that contain recurring patterns unusually often, according to statistical tests. Favored patterns have been found in single unit recordings and in multiple unit recordings [DG83a], [DG83b], [AG88], [FFH90]. This research has shown the presence of favored temporal patterns in neural recordings that include a variety of preparations (crayfish claw, cat visual cortex, cat brainstem). These intriguing results contribute to the accumulating studies and analysis of nerve impulse timing [NZJE96], [Les96], [SZTM96], [JSB97], [RWSB96], [MZ093], [Hop95], [SZ95], [TGK94], [Day87].

Temporal patterns are consistent with models that include dynamic at-tractors, as oscillating at tractors can produce repeating temporal patterns among one or more neurons. A temporal pattern could be elicited in exact or approximate form each time that a section of an oscillating attractor is revisited.

In multiple unit recordings, it is cogent to evaluate data for the presence

4. Patterns of Dynamic Act ivi ty and Timing 133

I I I 1 I I Example f i r i n g pattern ("word")

time —>

4 — W — ^ - H - 44

II nil occurrence with extra spike

FIGURE 23. Paxticulax firing patterns re-occur in the nerve impulse trains above, with some variation in interspike interval on each occurrence. The third line shows the pattern at the top occurring with an extra spike. Data was simulated. Reprinted with permission from [Day87]. © 1987 IEEE.


of temporal synchronies (and other patterns) among groups of two or more units. A synchrony would occur when a group of neurons each fire an impulse at approximately the same time. The study of multiunit synchronies is highly motivated for the following reasons. Neurons are natural recognizers of synchrony arriving at presynaptic sites, as synchronous stimuli sum more effectively when postsynaptic potential peaks coincide. Synchronous groups can stimulate postsynaptic activity faster than individual neurons. Synchronies play a role in LTP learning, and synchronous groups are consistent with models of neural processing. In addition, synchronous groups can multiplex firing rate codes. Methods for identification of synchronies have been developed and synchronies have been observed in biologically recorded systems, and evidence of ensemble coding has been found [Day95], [GPD85], [LHM+92J, [GKES89], [RCF96], [CDSS97], [GSM96].

In a synchrony code representation, an ensemble of near-coincidental firing would represent information or its processing during cognitive tasks. The event of synchronous firing, however, would last only an instant unless repeated. Repetitions could occur at regular periods or irregularly over time. Clearly, the brain has a mechanism to sustain a representation over an arbitrary period of time because we can imagine an image or consider an idea for any chosen length of time. Thus the proposed synchrony code could allow for sustained representations by repetitions of the synchronous firing. Repetitions could in turn be caused by oscillations, or attractors, in the network dynamics. Thus, synchronies are consistent with models of dynamic attractors that oscillate to produce repeated synchronous events. Some models of networks of spiking neurons have shown synchronies, temporal patterns, or oscillations and attractors [PCD96], [MR96], [TH96], [Kel95].

8 Discussion

The research described here is motivated by the temporal dynamics of living neural systems and especially by the temporal abilities of humans and higher animals. Our brains can respond to time-varying signals, can generate time-varying patterns, can process information (think) over time, can represent concepts and images mentally for arbitrary intervals of time, and have differing states of ongoing, self-sustained activity (awake, aroused, sleeping). Furthermore, we seem to automatically have time-related skills and dynamics such as recognition of spatiotemporal patterns as they occur; coordination of internal processing in the brain, in spite of no apparent controlling time clock; and the presence of self-sustained dynamic activity in many areas of the brain, through oscillation (e.g., respiratory neurons) or other more complex continuing activity ("spontaneous activity"). This


extensive array of temporal capabilities and time-varying activity points to a temporally dynamic neural network underlying these processes. To date, many neural models show pattern mapping abilities but lack the dynamics and temporal behavior of the systems they are intended to model.

We have explored a series of paradigms that concern dynamic activity in neural networks. We have illustrated how a simple model of a neural network can develop dynamic attractors, self-sustained activity, and chaos. Control over parameter ^, a weight multiplier, allows modulation of the dynamics with a progression from a simple fixed point attractor to chaos. Once we generate chaotic activity patterns in a neural network, we can apply a stimulus pattern and lock the network into a limit cycle attractor. This scenario poses a potential way to perform pattern recognition and signal classification. Because dynamic systems can have complicated basin boundaries for their attractors, there is reason to expect increased performance and generalization capabilities from this type of approach.

Developing multiple attractors in a neural network can be accomplished via an accretional method with weight perturbations. In the resulting network, a set of initial states each evoke their own attractor. Computational tasks in pattern classification and associative memory could be accomplished through differing initial states evoking diflFering dynamic attractors.

In dynamic binary networks, exploration of attractor basins and the flexibility of those basins of attraction showed capacities for attractors to be considerably higher than the number of memories in the static Hopfield network (0.15n). With as few as five neurons in a dynamic binary network, thousands of basin classes—divisions of patterns into diff'erent basins—can be accomplished.

To train a specific attractor into a neural network, a neural network with time delays was trained to generate a closed-loop trajectory. The trained network generates this trajectory in spite of noisy starting conditions, and with differing initial segments. The result is a robust signal and path generator for communications and control applications.

Impulse trains add a new dimension of spatiotemporal processing in biological neural systems. Temporal patterns of nerve impulses and synchronies among ensembles of neurons are putative codes for information processing and representation. The firing activity of neurons and neural ensembles could refiect transients and dynamic attractors superimposed on the impulse train structure of biological neural processing.

The general problem of recognition and generation of spatiotemporal signals appears solvable with dynamic neural networks, although much research remains to be done. The ability to generate and train self-sustained activity, based on dynamic oscillating attractors, is shown in the preliminary results described here.

As biological systems have indisputable power in the temporal domain, we experiment with tapping their mechanisms for artificial systems. Mech-


anisms that appear in biological systems include time-delays, recurrent loops, and the adjustment of synaptic strengths. Our models lead to self-sustained activity, dynamic attractors, and the training of those attrac-tors. Whereas observations of living neural systems catch them in the act of evolving increasingly powerful structures, we are beginning to develop a spectrum of dynamic and temporal neural networks that have far more potential than previous networks. Ultimately, we hope to exploit, in human-made systems, the mechanisms responsible for the power of biological systems in the temporal domain.

9 Acknowledgments

J. Dayhoff was supported by the Naval Research Laboratory (Special Project on Nonlinear Systems and Contract N00014-90K-2010), the National Science Foundation (Grants CDR-88-03012 and BIR9309169), the Institute for Systems Research at the University of Maryland, and the Air Force Office of Scientific Research (Summer Faculty Research Program, Phillips Laboratory, Kirtland Air Force Base). P. Palmadesso and F. Richards acknowledge support from the Office of Naval Research. D.-T. Lin was supported by the Applied Physics Laboratory of Johns Hopkins University. Thanks go to Greg Tarr, Lenore McMackin, Ed Ott, B. Doyon, B. Cessac, Manuel Samuelides, and Ira Schwartz for stimulating discussion on this and related topics.

10 REFERENCES

[AG88] M. Abeles and G. L. Gerstein. Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. Journal of Neurophysiology, 60(3):909-924, 1988.

[AM88] S.-I. Amari and K. Maginu. Statistical neurodynamics of associative memory. Neural Networks, 1:63-73, 1988.

[Ama72a] S.-I. Amari. Characteristics of random nets of analog neuronlike elements. IEEE Trans, on Systems, Man, and Cybernetics, 2(5):643-657, 1972.

[Ama72b] S.-I. Amari. Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Trans, on Computers, 21(11):1197-1206, 1972.

[Ami89] D. Amit. Modelling Brain Function. Cambridge University Press, Cambridge, U.K., 1989.


[CDQS94] B. Cessac, B. Doyon, M. Quoy, and M. Samuelides. Mean-field equations, bifurcation map and chaos in discrete time neural networks. Physica D, 74:24-44, 1994.

[CDSS97] D. Contreras, A. Destexhe, T. J. Sejnowski, and M. Steriade. Spatiotemporal patterns of spindle oscillations in cortex and thalamus. Journal of Neuro science, 17(3):1179-1196, 1997.

[Day87] J. E. Dayhoff. Detection of favored patterns in the temporal structure of nerve cell connections. Proceedings First International Conference on Neural Networks, 3:63-77, 1987.

[Day94] J. E. Dayhoff. Artificial neural networks: biological plausibility. Abstracts, Toward a Scientific Basis for Consciousness, University of Arizona, Tucson, Arizona, 1994.

[Day95] J. E. Dayhoff. Synchrony detection in neural assembUes. Biological Cybernetics, 71(3):263-270, 1995.

[DCQS93] B. Doyon, B. Cessac, M. Quoy, and M. Samuelides. Control of the transition of chaos in neural networks with random connectivity. International Journal of Bifurcation and Chaos, 3(2):279-291, 1993.

[DD91] S. P. Day and M. Davenport. Continuous-time temporal back-propagation with adaptive time delays. Neu-roprose archive, Ohio State University. Accessible on Internet via anonymous ftp on archive.cis.ohio-state.edu, in pub/neuroprose/day.tempora.ps August, 1991.

[DD93] S. P. Day and M. R. Davenport. Continuous-time temporal back-propagation with adaptive time delays. IEEE Trans, on Neural Networks, 4(2):348-354, March 1993.

[Dem89] A. Dembo. On the capacity of associative memories with linear threshold functions. IEEE Trans, on Information Theory, 35(4):709-720, 1989.

[DG83a] J. E. Dayhoff and G. L. Gerstein. Favored patterns in spike trains. I. Detection. Journal of Neurophysiology, 49(6): 1334-1348, June 1983.

[DG83b] J. E. Dayhoff and G. L. Gerstein. Favored patterns in spike trains. II. Application. Journal of Neurophysiology, 49(6):1349-1363, June 1983.

http://archive.cis.ohio-state.edu


[DP95] J. E. DayhofFand P. J. Palmadesso. Capacity for basin flexibility in dynamic binary networks. Proceedings of World Congress on Neural Networks (WCNN), 1:365-368, 1995.

[DPR94] J. E. Dayhoff", P. J. Palmadesso, and F. Richards. Developing multiple at tractors in a recurrent neural network. Proceedings of World Congress on Neural Networks (WCNN), 4:710-715, 1994.

[FFH90] R. D. Frostig, Z. Frostig, and R. M. Harper. Recurrent discharge patterns in multiple spike trains. Biological Cybernetics, 62:487-493, 1990.

[GKES89] C. M. Gray, P. Konig, A. K. Engel, and W. Singer. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338:334-337, 1989.

[GPD85] G. L. Gerstein, D. H. Perkel, and J. E. Dayhoff. Cooperative firing activity in simultaneously recorded populations of neurons: Detection and measurement. Journal of Neuroscience, 5(4):881-889, April 1985.

[GSM96] D. M. Gothard, W. E. Skaggs, and B. L. McNaughton. Dynamics of mismatch correction in the hippocampal ensemble code for space: interaction between path integration and environmental cues. Journal of Neuroscience, 16(24) :8027-8040, 1996.

[Hop82] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, USA, 79, 1982.

[Hop84] J. J. Hopfield. Neurons with graded responses have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, USA, 81, 1984.

[Hop95] J. J. Hopfield. Pattern recognition computation using action potential timing for stimulus representation. Nature, 376:33-36, 1995.

[HTV94] J. Hao, S. Tan, and J. Vandewalle. A new approach to the design of discrete hopfield associative memories. Journal of Artificial Neural Networks, l(2):247-266, 1994.

[JSB97] D. Jaeger, E. D. Schutter, and J. M. Bower. The role of synaptic and voltage-gated currents in the control of purkinje cell spiking: a modeling study. Journal of Neuroscience, 17(1):91-106, 1997.


[Kel95] J. A. S. Kelso. Dynamic Patterns: The Self-Organization of Brain and Behavior. MIT Press, Cambridge, MA, 1995.

[Koh74] T. Kohonen. An adaptive associative memory principle. IEEE Trans, on Computers, C-23:444-445, 1974.

[LAM+96] A. V. Lukashin, B. R. Amirikian, V. L. Mozhaev, G. L. Wilcox, and A. P. Georgepoulos. Modeling motor cortical operations by an attractor network of stochasic neurons. Biological Cybernetics, 74:255-261, 1996.

[LDL92a] D.-T. Lin, J. E. DayhofF, and P. A. Ligomenides. Adaptive time-delay neural network for temporal correlation and prediction. In Intelligent Robots and Computer Vision XI: Biological, Neural Net, and 3-D Methods, Proc. SPIE, volume 1826, pages 170-181, Boston, November, 1992.

[LDL92b] D.-T. Lin, J. E. DayhofF, and P. A. Ligomenides. A learning algorithm for adaptive time-delays in a temporal neural network. Technical Report SRC-TR-92-59, Systems Research Center, University of Maryland, College Park, MD 20742, May 15 1992.

[LDL93] D.-T. Lin, J. E. DayhofF, and P. A. Ligomenides. Learning spatiotemporal topology using an adaptive time-delay neural network. In World Congress on Neural Networks, volume 1, pages 291-294, Portland, OR, 1993. INNS, New York.

[LDL94] D.-T. Lin, J. E. DayhofF, and P. A. Ligomenides. Prediction oF chaotic time series and resolution oF embedding dynamics with the ATNN. In World Congress on Neural Networks, volume 2, pages 231-236, San Diego, CA, 1994. INNS Press, New York.

[LDL95] D.-T. Lin, J. E. DayhofF, and P. A. Ligomenides. Trajectory production with the adaptive time-delay neural network. Neural Networks, 8(3):447-461, 1995.

[Les96] R. Lestienne. Determination oF the precision oF spike timing in the visual cortex oF anaesthetised cats. Biological Cybernetics, 74:55-61, 1996.

[LHM+92] B. G. Lindsey, Y. M. Hernandez, K. F. Morris, R. Shannon, and G. L. Gerstein. Dynamic reconfiguration oF brain stem neural assemblies: respiratory phase-dependent synchrony versus modulation oF firing rates. Journal of Neurophysiology, 67:923-930, 1992.


[MPRV87] R. J. McEliece, E. C. Posner, E. R. Rodemich, and S. S. Venkatesh. The capacity of the hopfield associative memory. IEEE Trans, on Information Theory, 33:461-482, 1987.

[MR96] I. MeiUjson and E. Ruppin. Optimal firing in sparsely-connected low-activity attractor networks. Biological Cybernetics, 74:479-485, 1996.

[MZ093] J. W. McClurkin, J. A. Zarbock, and L. M. Optican. Temporal codes for colors, patterns, and memories. Cerebral Cortex, 10:443-467, 1993.

[NZJE96] H. Napp-Zinn, M. Jansen, and R. Eckmiller. Recognition and tracking of impulse patterns with delay adaptation in biology-inspired pulse processing neural net (bpn) hardware. Biological Cybernetics, 74:449-453, 1996.

[PCD96] O. Parodi, P. Combe, and J.-C. Ducom. Temporal coding in vision: coding by the spike arrival times leads to oscillations in the case of moving targets. Biological Cybernetics, 74:497-509, 1996.

[PD95] P. J. Palmadesso and J. E. Dayhoff. Attractor locking in a chaotic network: stimulus patterns evoke limit cycles. Proceedings of World Congress on Neural Networks (WCNN), 1:254-257, 1995.

[QDS95] M. Quoy, B. Doyon, and M. Samuelides. Dimension reduction by learning in a discrete time chaotic neural network. Proceedings of World Congress on Neural Networks (WCNN), 1:300-303, 1995.

[RCF96] R. Ratnam, C. J. Condon, and A. S. Feng. Neural ensemble coding of target identity in echolocating bats. Biological Cybernetics, 75:153-162, 1996.

[RWSB96] F. Rieke, D. Warland, R. D. R. V. Steveninck, and W. Bialek. Spikes: Exploring the Neural Code. MIT Press, Cambridge, MA, 1996.

[SCS88] H. Sompolinsky, A. Crisanti, and H. J. Sommers. Chaos in random neural networks. Physical Review Letters, 61(3):259-262, 1988.

[SZ95] C. F. Stevens and A. Zador. Neural coding: The enigma of the brain. Current Biology, 5:1370-1371, 1995.


[SZTM96] D. Scheuer, J. Zhang, G. M. Toney, and S. W. Mifflin. Temporal processing of aortic nerve evoked activity in the nucleus of the solitary tract. Journal of Neurophysiology, 76(6):3750-3757, 1996.

[TGK94] D. W. Tank, A. Gelperin, and D. Kleinfeld. Odors, oscillations and waves: Does it all compute? Science, 265:1819-1820, 1994.

[TH96] P. Tass and H. Hermann. Synchronized oscillations in the visual cortex—a synergetic model. Biological Cybernetics, 74:31-39, 1996.

[VPPS90] S. S. Venkatesh, G. Pancha, D. Psaltis, and G. Sirat. Shaping attraction basins in neural networks. Neural Networks, 3:613-623, 1990.

[WLH90] A. Waibel, K. J. Lang, and G. E. Hinton. A time-delay neural network architecture for isolated word recognition. Neural Networks, 3:23-43, 1990.

[YF90] Y. Yao and W. J. Freeman. Model of biological pattern recognition with spatially chaotic dynamics. Neural Networks, 3(2):153-170, 1990.

[YFBY91] Y. Yao, W. J. Freeman, B. Burke, and Q. Yang. Pattern recognition by a distributed neural network: an industrial application. Neural Networks, 4:103-121, 1991.

Chapter 5

A Macroscopic Model of Oscillation in Ensembles of Inhibitory and Excitatory Neurons

Joydeep Ghosh Hung-Jen Chang Kadir Liano

ABSTRACT Very large networks of neurons can be characterized in a tractable and meaningful way by considering the average, or ensemble behavior, of groups of cells. This paper develops a mathematical model to characterize a homogeneous neural group at a macroscopic level, given a microscopic description of individual cells. This model is then used to study the interaction between two neuron groups. Conditions that lead to oscillatory behavior in both excitatory and inhibitory groups of cells are determined. Using Fourier series analysis, we obtain approximate expressions for the frequency of oscillations of the average input and output activities and quantitatively relate them to other network parameters. Computer simulation results show these frequency estimations to be quite accurate. ^

1 Introduction

Biological neural networks consist of very large numbers of neurons. The human brain has over 10^^ neurons, with an average connectivity in the thousands . Faced with numbers of this magni tude, it is impossible and meaningless to model every single neuron and its interactions with the entire system. To gain insight into the complex functions performed by neural systems, an unders tanding of the overall network in terms of ensemble, or group, behavior and group interaction is required tha t is not overwhelmed by the details of individual neurons [Ama72, Ede87]. Such macroscopic

^Supported by NSF grant ECS-9307632, AFOSR contract F49620-93-1-0307 and ARC contract DAAH04-95-10494.

143

144 Ghosh, Chang, and Liano

models are useful in studying collective behavior of biological neural systems, and in particular, macroscopic oscillations in cell assemblies.

Oscillatory phenomena have been widely observed in cortical circuits at similar frequencies (in the 35-60 Hz range) and at many different spatial scales. They occur in single neurons, within small (10-100 cells) neural networks, and in large (over lOK cells) networks. Oscillations are considered fundamental to memory storage and temporal association functions in biological organisms [Bow90, vB86, Pav73].

Large-scale rhythmic/oscillatory phenomena are integral to the dynamic timing mechanisms for heartbeat, EEG waves, breathing, walking, and other activities. Recent experiments by Gray and Singer [GS89] and by Traub, Miles, and Wong [TMW89], among others, show oscillations occurring at the level of local populations of cortical neurons. Rhythmic patterns emerge even though single neurons may fire asynchronously. Some of the remarkable characteristics of cortical oscillations are the synchronization of oscillations between spatially disparate cell assemblies, and phase locking [AB89, GS89]. Such experiments, together with previous theoretical investigations [vB86], give credence to the labeling hypothesis, wherein cell assemblies are established through oscillations and labeled by their phase and/or frequency. The labeling hypothesis postulates that neural information processing is intimately related to the temporal relationships between the labels of different populations.

To study and characterize the behavior of large populations of neurons, several researchers have developed macroscopic models that characterize the aggregate population behavior, in a way similar to how statistical mechanics has been used to obtain global quantities like pressure and temperature starting from molecular-level description of gases. Pioneering research in developing macroscopic models of large neural ensembles has been performed by Amari, who studied characteristics of random networks of threshold logic elements (McCulloch-Pitts formal neurons) [Ama71] and, subsequently, of continuous-time analog neurons [Ama72]. The weights and thresholds in these networks were random variables, and they did not change as the ensemble evolved; i.e., no learning mechanisms were investigated. Using some simplifying assumptions including the stochastic independence among cell membrane potentials, Amari showed that a homogeneous random net is monostable or bistable. Moreover, oscillations could emerge from the interactions between two random nets consisting of excitatory and inhibitory classes of elements, respectively. At about the same time, Wilson and Cowan [WC72] showed the presence of oscillations in coupled subpopulations of inhibitory and excitatory neurons with refractory periods, but with nonadaptive weights. Amari's results were later expressed in a rigorous mathematical framework by Geman [Gem82]. Similarly, Wilson and Cowan's system has been further analyzed by other researchers [Som88], and oscillator models have been developed for specific circuits

5. A Macroscopic Model of Oscillation in Ensembles 145

such as the visual cortex [SW90]. Notable among recent research along these lines is the work by Atiya

and Baldi [AB89], who consider interacting cell assemblies of continuous time analog neurons as well as "integrate-and-fire" type neurons. If the assemblies are arranged in layers, with feedback from the topmost to the bottommost layer (thus resulting in a ring structure), and the number of inhibitory layers is odd, then oscillations arise easily if the cell gains are high enough. Also, provided that the cell time constants are very similar within a layer, all the cells belonging to that layer tend to phase lock in a few time constants. As before, learning mechanisms are not incorporated in their framework.

An alternative approach to obtaining oscillatory phenomena in neural networks is to use a more involved model of the individual cells that results in these cells becoming nonlinear neuronal oscillators by themselves. For example, Kammen, Koch, and Holmes [KKH90] assume a population of neuronal oscillators firing repetitively in response to synaptic input that is purely excitatory. They investigate two basic neuronal network architectures, namely, the linear chain model and comparator model, which incorporate either nearest neighbor or global feedback interactions. They conclude that nonlocal feedback plays a fundamental role in the initial synchronization and dynamic stability of the oscillations. Baird [BaiQO] discusses a generic model of oscillating cortex that assumes a minimal coupling structure. The network has explicit excitatory neurons with local inhibitory interneuron feedback that form a set of nonlinear oscillators coupled only by long range excitatory connections. He argues that an oscillatory associative memory function can be realized in such a system by using a local Hebb-like learning rule. Due to the complex characterization of individual cells, it is difficult to obtain a useful macroscopic description using such approaches.

In Chang et al. [CGL92], the authors have presented a macroscopic model for a homogeneous cell assembly wherein each individual cell is an analog neuron whose characteristics are given by a well-known model [Hop84]. This model is distinguished from previous work by the fact that it relates in quite some detail the macroscopic variables to biologically motivated cell parameters, and even more so by the incorporation of adaptive weight dynamics. The latter factor makes it possible to achieve rhythmic patterns in ensemble activity even in an isolated homogeneous cell assembly with no external periodic inputs. The model not only predicts such situations, but it is also able to estimate the frequency of oscillation and indicate how the parameters could be changed to obtain a desired oscillation frequency.

Here, we apply a model similar to that developed in Chang et al. [CGL92] to study the interaction of excitatory and inhibitory neuron groups. For simplicity, the weights are not adapted, though this possibility is kept open for future research. The main contribution of this report is to quantify the


situations that lead to stable macroscopic behavior and to estimate the oscillation frequencies. The frequency estimates are observed through simulations to be quite accurate. We begin in the next section by summarizing the macroscopic model. Section 3 applies this model to analyze a system with inhibitory and excitatory neurons. Stability analysis of this system is performed in Section 4, and the frequency of oscillation is estimated in Section 5 using a first-order approximation. Simulation results presented in Section 6 support the mathematical analysis given in previous sections.

2 A Macroscopic Model for Cell Assemblies

2.1 Description of Individual Cells

Macroscopic models for neuronal assemblies depend on the characterization of individual neurons as well as on the network architecture that defines how these neurons interact with one another. The model of individual neurons should be biologically plausible without incorporating details that do not significantly affect macroscopic behavior such as ensemble oscillations. These cells should at least be able to integrate information over time. Thus connectionist-type cells, where the instantaneous output is a linear or sigmoidal function of a weighted sum of inputs at that instant, are too simplistic and clearly inadequate.

The next level of complexity is to model a set of n asynchronous cells by n coupled first-order differential equations. A popular generic form is [Cow67, Ama72, GC83, Hop84]:

n = •^'^'îkgk{uk)-\-hi, l<i<n, (1)

dui

k=l

where Ui is the internal state of the zth neuron and represents the short-time averaged value of the membrane potential; r is a time constant; Wik represents the weight, or synaptic strength, from neuron k to i] hi is a. threshold; and gk{') is the input-output "squashing" cell-activation function for the kth neuron. The output can be interpreted as the mean firing rate of the neuron. This system can converge to equilibrium points if the weights are symmetric [GC83, Hop84]; if all weights are positive [Hir89]; or if the weights and thresholds are independent, normally distributed random variables [Ama72]. However, oscillatory behavior has not been established for this system.

While the model given by equation (1) does not capture many features of biological neurons, it is quite popular because of its simple nature and because it is easily amenable to hardware implementation [Hop84]. Further detail can be added by considering "integrate-and-fire" type neurons [AB89], by applying cable theory, or by using compartmental models


I external input

cell 1 To all celli

X T

V i = g [ u i - O J

0 (bias)

FIGURE 1. Microscopic model of a neural cell.

[KS89]. However, such detailed modeling is found to be overkill when considering large systems of neurons.

In this article, we use a Hopfield-type neural model, a form of equation (1) in which each cell is represented by an electrical circuit, as depicted in Figure 1. The synaptic strength from cell k to cell i is denoted by a conductance Wik{t), and jRi, Ci are the cell membrane resistance and capacitance of cell i. Let 6i be the threshold for neuron i to fire, Ui{t) = Ui — 0i the effective membrane potential, a the cell time constant, Tik = ^ ^ the effective connection strength from cell k to cell i, Ii(t) the effective external current into cell z, and Vi{t) the cell output, which can be considered as a short time average of its firing rate. By applying Kirchoff's current law to the cell input and simple algebraic manipulation, one obtains

^ = -a, {U,{t) + e^) + J2Ti,{t)v,{t) + : ^ . (2) A ; = l

The model may also include a formula for weight adaptation. In this paper, we acknowledge that the weights could be time-varying without explicitly using this fact.

A popular choice for the input-output transfer function for individual cells is

Vi{t) = 1

lê-'^xUi{t) (3)

= 9[Ui{t)].

Equations (2) and (3) thus characterize the microscopic model.


We note that the output of a neuron is a train of action potentials, or spikes, propagated through its axon. An action potential occurs when the membrane potential reaches a particular threshold. At that juncture, the membrane potential is "discharged" to a resting value, and the neuron is prevented from firing for a short refractory period, after which the membrane potential can charge up again. While several detailed models exist for this input-output behavior, it is deemed sufficient to model the neuron as having a continuous output voltage Vi that is a function of the input voltage Ui at the same instant, for the purposes of studying macroscopic phenomena. If the refractory period is sufficiently smaller than the interval between successive firings, then Vi{t) can be interpreted as the short-term average firing rate of the ith neuron [Ami89] and be obtained by a smooth function, such as in equation (2).

2.2 Characterization of Cell Assemblies

Several researchers have studied the properties of large neural networks by aggregating them into interacting groups, or clusters [Ama72, Ede87, Ama90, CG93]. Typically, a group exhibits more homogeneity among its constituent neurons, has higher internal connectivity, and/or is used to represent a particular hypothesis [GH89]. Well-known examples include the neuronal groups of Edelman [Ede87] and the neural clusters used for distortion-invariant pattern matching by von der Malsburg [vdM88]. In this section, we apply methods in statistics to build a macroscopic neural model.

Macroscopically, the behavior of a cell assembly can be characterized in terms of the time or ensemble averaged behavior of the homogeneous collection of neurons that populate it. For large assemblies, we can consider an individual cell parameter Xi (t) as an instance of a random process x{t). Thus, for any specific time instant t, x{t) is a random variable. The corresponding system-level parameter of interest is the sample mean, x(t), or the expectation {x{t)). A macroscopic model reduces the system description from a set of 2n coupled equations to a few equations involving expectations and standard deviations of these random variables.

The Law of Large Numbers shows that the expectation approaches the sample average for large n. In the derivations below, the expectation and sample average are interchangeable; i.e..

1 ^ X = — 2 . î ^ (^}

initior

dx{t) dt

n ^— 1=

_ d ~ dt

1

_ 1 Y^ dxi{t) ~ n ^ dt '

dxi{t) ~ dt (4)


By applying equation (4) to equation (2), we obtain

dU{t) dt

where

= -aU{t) -f -aG{t) + T{t)v{t), (5)

Qiit) = Oi - RJiit)

and n

At first glance, the macroscopic equation (5) seems intractable because of the coupled parameters. Fortunately, it seems that when n is large, a knowledge of any one of the random variables would contribute very little information about the other random variables [Gem82]. Based on this insight, Amari introduced the concept of a symmetric, random net to describe networks composed of one homogeneous class of analog elements described by equation (1), in which all the weights, Tij, are independent, identically distributed (i.i.d.) random variables. The thresholds are also i.i.d.s but subject to a probability distribution different from that for the weights. For such systems, a "local chaos" hypothesis analogous to the one widely used in statistical mechanics has been proposed. This hypothesis assumes that the solutions of individual equations (in our case, equation (2)) within the system are mutually independent. The chaos hypothesis was originally postulated by Rozonoer [Roz69]. Since then, it has been supported by mathematical analysis and by simulations on large systems of randomly coupled equations [Ama72, Gem82], and is key to the development of an elegant macroscopic model of cell assemblies.

For our more detailed model of equation (2), the comments made above provide a natural definition of a cell assembly to be a symmetric, random network of homogeneous neurons. This implies that the parameters ai, C~^,Ti(f), Ui{t)^ U{t)^ and 6i are from independent distributions at any given time instant t. The output Vi{t) of course depends on Ui{t) but is independent of the other parameters. For such a system, the coupled parameters (aiÛi), {ai,Qi), {Ti,Vi) in equation (5) can be separated to yield

^ ^ = -aU{t) - aQ{t) 4- T{t)v{t). (6) at

As presented in Chang et al. [CGL92], we know that the standard deviation of Ui, au (t) approaches some constant a as

t -> 00 (âlit) = {U- {U)f « ^ E r= i {Ui - Uy), and Ui{t) becomes a linear combination of Ri and 6i as t -^ oo. Thus, if Ri and 6i are normally


distributed, then Ui{t) is also normally distributed. The average output can then be computed by

/''^ 1 / 1 \ (^t-<y ii dUi

(7)

Though the macroscopic transfer function p(-) seems very complex as compared to the sigmoidal microscopic transfer function g{') relating the output of a single neuron to its input, it turns out that these two functions have a similar shape, as shown in Figure 2. In particular, we can

<u>

FIGURE 2. (a) Microscopic transfer function; (b) Macroscopic transfer function

prove that g{') is monotonically increasing by showing the properties of the macroscopic input-output transfer relation that the first derivative of g{') is always positive [CGL92], with

_ 0 < g{U) < 1;

(8)

The above discussion has been mathematically justified in Chang et al. [CGL92] and can be summarized in the following lemma:

Lemma 1. The shape of the sigmoidal microscopic transfer function g\Ui{t)\ = -lxui(t) is conserved by the macroscopic transfer function p |T7(t)] of a large homogeneous population of Hopfield-type neurons as t — 00.


TNP<O

Tpp> N T N N < 0

T p N > 0

External Input External Input

FIGURE 3. Interactions between excitatory and inhibitory neuronal assemblies.

3 Interactions between Two Neural Groups

In this section, we use the macroscopic state equation of a neuron population derived in the previous section to study a simple model involving the interactions between two groups of neurons, as shown in Figure 3. The P population consists of excitatory neurons, and the N population consists of inhibitory neurons. We discuss the qualitative behavior of the equilibrium states of both the excitatory and inhibitory neurons. The equilibrium points of the system are obtained by combining the equilibrium curves of both excitatory and inhibitory neuron populations. These curves also determine the number of fixed points in the system.

The connections between the two neuron groups as shown in Figure 3, a^ well as other notations used in this section, are defined below.

Tpp{t) =

TNN{t)

J2^pp..(t)

TpN{t) = fc=l L i = l

Y.'^PN^'-^t)

a„ =

, "N [""JV

"«fc=i Li=i

1 ""

_ 1 "'^

1 "P

>o,

<o,

>o,

<o,

"p . = 1 = 1


©NW = —T],eNAt),

" ' V i = l

Assuming that the number of excitatory neurons is close to the number of inhibitory neurons, i.e., Up ^ n^, the coupled differential equations describing the macroscopic behavior of the system can be derived from equation (6):

dUp{t)

dt = -a;Up{t) - a;Qp{t) + Tpp{t) v;{t) + TpN{t) tJ^(t)

= U {Up{t),UN(t),t). (9)

dt = -o^C/ivW - 5^07v(i) + TNNit) v^ît) + TNPit) vît)

= U{Upit),UNit),t). (10)

The system is in equilibrium when both /p and / ^ are equal to zero after time t is greater than some finite ô- This means that the system stays in a stationary state Up{t) = A, and f/iv(0 — ^ for all ^ > ô-The equilibrium point can be shown graphically by plotting the curves fp = 0 and / ^ = 0 on the {Up, UN) plane. Intersections of the two curves specify the equilibrium points of the system. In this paper, we assume that the aggregate values of effective synaptic strengths Tpp, TpN, TNP, and T/vAT, as well as effective biases 0 p and ©TV, cisymptotically reach constant


values. Under this assumption, the equiUbrium states may exist because the intersections then become time-invariant. The assumption is trivially valid if there is no learning and if the average external forcing function is constant. Also, large groups of neurons with self-regulatory mechanisms are expected to exhibit such behavior [Ede87].

The following paragraphs describe the features of the two curves at equilibrium. We start off by analyzing the curve fp = 0. For convenience, let us define

h. {Up{t)) = -a; Up(t) - o j 0 p + Tpp g{Up{t)). (11)

Then,

K ( M O ) = -a; + T^ g\u^{t)). (12)

By equation (8) we have

^ \ P g ( A , r ? p ) < l < ^ ft;(C7p(t)) < 0 forallC7p(<), (13) dp V TT

where .^ ) ^ i P smh{2V2 Xrjy)e-y ^

rj Jo 1 + cos h(2x/2Aw)^

Also h {Up{t)) is continuous and changes from —oo to -hoc as Up{t) varies from -hoo to —oo. Thus h {Up{t)) is monotonically decreasing. Prom equation (9) at equilibrium,

/ p ( M t ) , C 7 ^ ( t ) , t ) = 0 ^=> g(u;;(t)) = -^hp{U^{t)),

which means that 'g (UN{t)) is strictly monotonically increasing. From the above discussion, it can be concluded that the relation between Up{t) and UN{t) is monotonically decreasing. Moreover, 0 < ^ (fîv(O) < ^ ^^^ hp {Up{t)) is not bounded; the curve fp =0 is bounded on Up but not on UN' Figure 4 shows the graph fp=0 satisfying (13).

When (13) is not satisfied, i.e., ^ v / f QiKrjp) > 1, hp {Up{t)) may not always be monotonically decreasing. Since h'p {Up{t)) is a unimodal function, hp {Up{t)) cannot have more than three roots. If Up^ and Up-

stand for the local maximum and minimum of hp (Up{t)), then by (8), 'g' {Up{t)) is symmetric with respect to the origin. So

r'(M=)\ and TLr=-\a'-'(3 Up+ = \TppJ \Tpp


FIGURE 4. Generic graph of /p = 0 with a single root.

(c) (d)

FIGURE 5. Generic graphs of / p = 0 with three roots.


FIGURE 6. Generic graph of /^ = 0.

From the above discussion, it is clear that hp {Up{t)) is monotonically increasing over the interval [Up-, C/p+J and monotonically decreasing outside this interval. Since 0 < ^{UN{t)) < 1, the curve fp = Ois defined only for hp [Up{t)) that satisfies

0<hp {Up{t)) < -TpN. (14)

As Up(t) tends to the bounds of (14), f/yvCO tends to infinity. The two conditions that determine the shape of the curve /p = 0 are listed below:

0 < /ip {Up+) < -TpN,

0 < /ip {U^ < - T W -

(15)

(16)

If both conditions above are satisfied, the curve fp = 0 is shown in Figure 5(a). It has the general shape shown in Figure 5(b) if equation (15) is not satisfied, and that in Figure 5(c) if equation (16) is not satisfied. However, if both conditions are not satisfied, then the general shape of curve /p = 0 is shown in Figure 5(d).

For the curve / ^ = 0, equation (10) can be written as

ViUpit)) 1

TN}

h{UN{t)), (17)

where h^ {UN{t)) = -0^7 UN{t) - o ^ 0iv 4- TNN g{UN{t)). (18)

Since g {Up{t)) increases monotonically and h^ (f/iv(0) decreases monotonically, equation (18) denotes the monotonically increasing relation between T7p{t) and UN{t)' Moreover, since 0 < g {Tjp{t)) < 1 and /i^(C/iv(0) is not bounded, the curve / ^ = 0 is bounded on UN{t) but not on Up(t), as shown in Figure 6. We conclude this section with the following lemma:


L e m m a 2. For a system of coupled excitatory and inhibitory cell assemblies governed by equations (9) and (10), there exists at least one and at most five equilibrium points (except for degenerate cases).

Notice that the equilibrium of the system is shown graphically by the intersection of the curves /^ = 0 and / ^ = 0. The lemma can then be easily proved by intersecting Figure 6 with either Figure 4 or Figure 5.

4 Stability of Equilibrium States

In this section, the stability of equilibrium states is studied using Lia-punov's theorem. We then apply the Poincare-Bendixson theorem along with Dulac's criterion [AVK66] to analyze the system defined in Section 3.

In the previous section, we showed that there exists at least one fixed equilibrium point and assumed that T's and 0 's become constant. From Liapunov's stability theorem on diflFerential equations [AVK66], the stability of an equilibrium state {A, B) of the system described by equations (9) and (10) can be tested by the matrix

M — I 9Up dUs

dUp dUN

(19)

evaluated at {A,B).

If the determinant of M is positive and the trace of M is negative, then the equilibrium state is stable; it is unstable otherwise. Combining equations (9), (10), and (19),

M-( - ^ + 9'{mt))T^ 9' {U^{t))TF^ \ . . . . V 9' {Up{t)) TMP - O ^ -f g' {UM{t)) TMN ) ' ^ ^

By condition (8) and the definitions given in the beginning of Section 3,

^ < 0 ^-d

d e t ( M ) > 0 = ; •

lfe.<»^ _£££ . _££iiL

dUp ^ dUp

If we denote the slopes of the curves /^ = 0 and / ^ = 0 by Sp and 5iv respectively, then

det(M) > 0 ^ ^ 5p {U^{t),U^{t)) < SN ( M t ) , L ^ ( f ) ) .

The above discussion can be summarized in the following theorem.


Theorem 1 The equilibrium state (AjB) of the system described by equations (9) and (10) is stable when both conditions below hold and is unstable otherwise.

Sp{A,B) < SN{A,B) ( d e t ( M ) > 0 ) . (21)

g'{A)Tpp+g'{B)TNN < o^ + o ^ (trace(M) < 0). (22)

To discuss the general behavior of our system governed by equations (9) and (10), let us consider a region bounded by a circle:

5 f t = ( C 7 ? - A ) ' + ( C 7 ^ - ^ ) ' .

We have

fm . — = 2[-a;{Up-A)Upâ;;{UN-B)U^

+ pWy(f7?)+7W^(C7^)-a70p] {U^-A)

= [5W{^([7?)-^(A)}+7W{^(C7^)-^(B)}a7{C7?-A}] {TJ^-A}

{UJJ-B}.

Since g{-) is bounded, if region 5t is large enough, ^ < 0 when Up{t) and UN{t) become large. According to the Poincare-Bendixson theorem [AVK66], if {Up{t),UN{t)) is a solution of the system such that it exists and stays in K for t > o for some finite to, then

(a) the solution is periodic, (b) the solution spirals toward a periodic solution as t -> oo, or (c) the solution terminates at any one of the equilibrium points.

A formal statement of this theorem is given in the Appendix. An alternative mathematical analysis to show the existence of periodic solutions can be found in Geman [Gem82].

Figure 7 shows the phase portrait corresponding to each of the cases above. If we can choose an annular region Sft' that excludes all equilibrium points and assume that among these limit cycles there are no "semistable" ones (they are only possible in "noncoarse" systems), then if all paths enter the region SR' as t increases, there is at least one stable limit cycle. On the other hand, if all paths leave the region 3?' as t increases, there exists at least one unstable limit cycle.

We have shown that oscillatory solutions can exist in our model consisting of excitatory and inhibitory neurons. Furthermore, we can use Dulac's


iip\

(a) ' (b) (c)

FIGURE 7. Phase plot of all three possible solutions.

criterion [AVK66] to identify the situations where no periodic solution exists. If

dUp

a[g(t7pW,t^W)_/^ {u^{t),u^{t))] dUN

= [ r (Upit)) Tpp + g' {UNW) TNN - (37 + ^)] Q {Up{t), UMH))

dQ{pp{t),Thi{t))

dUp

dQ ^{ty^jt))

h {Upit),UNit))

has the same sign in region SR, there exists no limit cycle in 5R (see Appendix). Here g [Up{t),UN{t)) is any continuous function with continuous

derivatives. Let ^ = 1. Then

dUp dUN < g'{Up(t))Tpp-ia; + a;;)

< 9'{0)Tpp-{a^ + a;;)

= Q{\I1P)^-TFP - {a;+ a;;).

By Dulac's criterion, we obtain the following lemma.


Lemma 3. If

then there is no periodic solution for equations (9) and (10) with constant T's and 0 's .

This means that the neural system must converge to a fixed point. Prom Lemma 2, we obtain at most five equilibrium points for this system in general. If any of the conditions in (21) and (22) are not satisfied, these equilibrium points become unstable, and any perturbation from the equilibrium causes the system to oscillate.

5 Oscillation Frequency Estimation

Finding the frequency of oscillation in situations when it occurs is a natural extension in understanding the aggregate behavior of interacting neuron populations. However, since such systems are highly nonlinear, finding a simple closed-form solution is difficult. In this section, we derive an expression for the fundamental frequency of oscillation. Simulation results given in Section 6 show this estimate to be quite accurate.

If the system is periodic, the solution to equations (9) and (10) can be approximated by the zeroth and first harmonics of their Fourier series.

Upjt) = a^, -\- f3p sin (jjt, UN{t) = a^ -\-P^sin{ujt + S), . . v^{t) = g{Up(t))=a'p^l3'pSmut, ^^''^ ^ ( 0 = 9{UN{t)) = a'^ -h /Jjv sin(a;t -f 5).

Here, u is the first harmonic frequency, and 8 is the phase difference between Up{t) and f/iv(0- Since there is no delay in our model, it is obvious that Up{t) and vj{t) as well as UN{t) and v^{t) have the same phase. Substituting equations (23) into (9) and (10), we obtain

/3pu; cos(a;t) = -cTâp — o^/Jp sin ujt — cT^Qp -h Tppa'p -h Tppf3p sin ut

f3û;cos{u;t -\- S) = -CLâ^ - CL^P^ sm{ut -\- 8) — CT^QN

-h Tâ'^ -^T^^'j^ sin{ujt -h 8) -f TNPOL'P -h TNPI^'P sin ut.

For simplicity, let A = 00; i.e., let the activation function of an individual


neuron be a step function. Equation (7) can then be written as follows:

v = 9{U) = ^ — )] /2vJ\ where the error function is defined as before. Series representation of the error function is given by

9 J:X ^2/c- i $(x) = ^ ^ ( - l ) ' = - i

( 2 A ; - l ) ( f c - l ) ! '

Let X = p + qsinujt. Then the coefficient of the constant term is

^ K ^ , . s l - _ l P

E(-i)'"' fc=l

(2A: - l)(fc - 1)! îp),

and the coefficient of sin ut is

A . — 1 (A; - 1)! y ^ B->'^

LA:=0 y^ ê

Taylor's expansion of the function ^( t ) = ^{p 4- gsina;t) yields

2 _ 2

^(t ) = ^{p) H—^ge ^ sinujt + higher-order terms.

Using this result, the Fourier coefficients of the average outputs and average inputs in equation (23) can be related in the following equations:

0 --^ + ^ e '^^p sinujt + • •

+7#t^^"~«i^M + ' ) + ----(25)

From equations (24) and (25), the following relations can be derived by matching the coefficients of cosut, sina;t, and the constant term.

0

0

rpiv^^sin(5,

-o^/^p + T^(3'p + 7W/^N cos 5,

-oTâp — oT^Gp -f Tppa'p -f Tpâ'^^^,

-0^7/3^ sin 5 + TNNPN sin S,

(26)

(27)

(28)

(29)

-a^p^ cos S + TNNPN COS S 4- TNPPP, (30)

5. A Macroscopic Model of Oscillation in Ensembles

0 = -o^Q^N ~ a ^ 0 N -f- TNNO^'N + TNPOL'P,

1 a p = - l - h $

^'P = /3p

" A T = o

P'N =

1 + $

161

(31)

(32)

(33)

(34)

(35)

Using the above equations and some algebraic manipulat ions, we arrive a t the expression below for the frequency of oscillation.

LJ = -TpN Tjsfp VN

\\2.rj^r,ê.p{^^^) ^.

where a^ and a ^ can be obtained from

Tpp-exp i-m -12

y/2TT1 - a . (36)

-â. . ir...(-|^).iT...(-|-)

= ttpOp - -Tpp - -Tpisf

- - ^ ^^™*i^)-^^-*(it) = a^QN - Ti^NN — T^J^NP'

-2^^

(37)

(38)

This first-order approximation of the frequency is accurate when the condition given below is satisfied:

«P VN + «N ^F V ^ L ^

TTVAT ^ P r>2

" ^ ^ ' " 2 . ; ^ H ^ e x p

OL%

2r / | , / j (39)

6 Experimental Validation

Simulations were performed to validate the results of the last two sections for A = 00. For this special case, the different types of solutions as given by the stability analysis of Section 4 are experimentally observed. We then


show that for a system with periodic solutions, the frequency estimation in Section 5 is fairly accurate.

As stated in Section 4, there exist three types of solutions for equations (9) and (10): periodic, spiral to periodic, and fixed point. Simulation results showing all three solutions are presented in Figures 8, 9, and 10. Each figure consists of six graphs:

1. Average excitatory input over time, Up. 2. Average inhibitory input over time, UN-3. Phase portrait of the average excitatory versus inhibitory input. 4. Average excitatory output over time, Up. 5. Average inhibitory output over time, vjj. 6. Phase portrait of the average excitatory versus inhibitory output.

For the periodic case, the frequency can be approximated by equation (36) in Section 5. To simplify the computation, one of the equilibrium points is set at (0,0). Using equations (9) and (10) under these conditions, we obtain the following relations:

1 - — 1;

1,

apQp --Tpp --TjsfN = 0,

aj^Qjsf --TNN --^Tpp = 0.

It is easy to see that Qp = 0 and a^ = 0 is one of the solutions for equations (37) and (38). For this case, equations (36) and (39) can be reduced to

u = / - ^ P N ^ N P _ ^ / ^ ^ ^ . ^ y ^ (40) V 27rrjpr]^ r]^ \V2^VP J

^P ^ r + a^ r p = - = — -h — I . (41) \/27r \ VN VP J

Simulations confirm that amplitude and frequency, as well as phase shift, can be adjusted by changing the parameters used in equations (9) and (10). A wide range of the frequency of oscillation is obtained by changing the T's and a's while setting T]p = r]j^ = l- Numerical results from the simulations are compared with theoretical results from equation (40) in Table 1.

7 Conclusion

In this chapter, we have applied a macroscopic model of cell assemblies to study the qualitative behavior of a simple system consisting of two interacting groups of excitatory and inhibitory neurons. The theory not only is able


FIGURE 8. Periodic solution.

164 Ghosh, C h a n g , and Liano

I 'I CURE 9. Spiral to periodic solution.

5. A Macroscop ic Model of Oscillation in Ensembles 165

FIGURE 10. Fixed point solution.


Tpp

5.6 4.0 8.0

10.0 10.8 12.8 30.0

171.8

TpN

-5.6 -6.0

-12.0 -16.0 -10.8 -12.8 -30.0

-171.8

TNP

1.0 1.4 4.0 5.0 3.0 5.0

25.0 163.0

TNN

-1.0 -2.0 -6.0 -8.0 -3.0 -5.0

-25.0 -163.0

O p

1.32 0.7 0.7 0.7 2.8 2.8 1.8 2.5

" i v

0.5 0.1 0.1 0.1

0.31 0.31 0.11 0.61

/ Theory

0.037 0.116 0.190 0.220 0.270 0.351 0.636 1.557

w

Simulation

0.039 0.107 0.176 0.195 0.273 0.352 0.635 1.660

TABLE 1. Frequency of Oscillation with r/p = r;^ = 1 .

to determine the situations that lead to oscillatory behavior, but also is able to give a good estimation for the oscillation frequency for such situations. The oscillations in our system stem from the competitive-cooperative dynamics of the neuron groups similar to those studied by Wilson and Cowan [WC72], without any imposed constructs such as the use of periodic forcing functions or oscillator neurons.

The accuracy of the frequency estimation provides incentive to study more complex systems involving several neuronal groups, as well as more intricate phenomena such as phase locking. We believe that quantitative studies of rhythmic behavior will increase in significance with improved understanding of the role of temporal activities in information organization and processing in the brain.

8 Appendix

The Poincare-Bendixson Theorem [AVK66]: Let 7^ be a closed bounded region consisting of nonsingular points of a 2 x 2 system x = X(x) such that some positive half-path T-L of the system lies entirely within TZ. Then either T-i is itself a closed path, or it approaches a closed path, or it terminates at an equilibrium point.

Dulac's Criterion [AVK66]: For the system x = X{x,y), y = Y{x,y),

there are no closed paths in a simply-connected region in which ^^^ -j-

^Q ^ is of one sign, where Q{x,y) is any function having continuous first partial derivatives.

9 REFERENCES

[AB89] A. Atiya and P. Baldi. Oscillations and synchronizations in


neural networks: An exploration of the labeling hypothesis. International Journal of Neural Systems, 1:103-124, 1989.

[Ama71] S. I. Amari. Characteristics of randomly connected threshold-element networks and network systems. Proceedings of the IEEE, 59:35-47, 1971.

[Ama72] S. I. Amari. Characteristics of random nets of analog neuron-like elements. IEEE Transactions on Systems, Man, and Cybernetics, 2:1443-1463, 1972.

[Ama90] S. I. Amari. Mathematical foundations of neurocomputing. Proceedings of the IEEE, 78:1443-1463, 1990.

[Ami89] D. J. Amit. Modeling Brain Function. Cambridge University Press, Cambridge, U. K., 1989.

[AVK66] A. A. Andronov, A. A. Vitt, and S. E. Khaikin. Theory of Oscillators. Dover, New York, 1966.

[Bai90] B. Baird. Associative memory in a simple model of oscillating cortex. In D. Touretzky, editor. Advances in Neural Information Processing Systems II, pages 69-75. Morgan Kaufmann, San Mateo, CA, 1990.

[Bow90] J. M. Bower. Reverse engineering the nervous system: an anatomical, physiological and computer based approach. In S. Zornetzer, J. Davis, and C. Lau, editors. An Introduction to Neural and Electronic Networks, pages 3-24. Academic Press, San Diego, CA, 1990.

[CG93] H.-J. Chang and J. Ghosh. Pattern association and pattern retrieval in a continuous neural system. Biological Cybernetics, 69(l):77-86, 1993.

[CGL92] H.-J. Chang, J. Ghosh, and K. Liano. A macroscopic model of neural ensembles: Learning-induced oscillations in a cell assembly. International Journal of Neural Systems, 3(2):179-198, 1992.

[Cow67] J. D. Cowan. A mathematical theory of central nervous activity, Ph.D. Thesis. Univ. of London, 1967.

[Ede87] G. M. Edelman. Neural Darwinism. Basic Books, New York, 1987.


[GC83] S. Grossberg and M. Cohen. Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man, and Cybernet-ics, 13:815-826, 1983.

[Gem82] S. Geman. Almost sure stable oscillations in a large system of randomly coupled equations. SI AM Journal on Applied Mathematics, 42:695-703, 1982.

[GH89] J. Ghosh and K. Hwang. Mapping neural networks onto message-passing multicomputers. Journal of Parallel and Distributed Computing, 6:291-330, April, 1989.

[GS89] C. M. Gray and W. Singer. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proceedings of the National Academy of Sciences, USA, 86:1698-1702, 1989.

[Hir89] M. W. Hirsch. Convergent activation dynamics in continuous time networks. Neural Networks, 2:331-350, 1989.

[Hop84] J. J. Hopfield. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, USA, 81:3058-3092, 1984.

[KKH90] D. Kammen, C. Koch, and P. J. Holmes. Collective oscillations in the visual cortex. In D. Touretzky, editor. Advances in Neural Information Processing Systems II, pages 77-83. Morgan Kauf-mann, San Mateo, CA, 1990.

[KS89] C. Koch and I. Segev. Methods in Neuronal Modeling, from Synapses to Networks. MIT Press, Cambridge, MA, 1989.

[Pav73] T. Pavlidis. Biological Oscillators: Their Mathematical Analysis. Academic Press, New York, 1973.

[Roz69] L. I. Rozonoer. Random logical nets I, II and III. Automatiki i Telemekhaniki, 5:137-147, 1969.

[Som88] H. Sompohnsky. Statistical mechanics of neural networks. Physics Today, pages 70-80, 1988.

[SW90] H. G. Schuster and P. Wagner. A model for neuronal oscillations in the visual cortex. Biological Cybernetics, 64:77-82, 1990.

[TMW89] R. D. Traub, R. Miles, and R. K. S. Wong. Model of the origin of rhythmic population oscillations in the hippocampal slice. Science, 243:1319-1325, 1989.


[vB86] C. von der Malsburg and E. Bienenstock. Statistical coding and short term synaptic plasticity: A scheme for knowledge representation in the brain. In E. Bienenstock, F. Fogelman, and G. Weisbuch, editors, Disordered Systems and Biological Organization, pages 247-272. Springer, Berlin, 1986.

[vdMSS] C. von der Malsburg. Pattern recognition by labeled graph matching. Neural Networks, 1:141-148, 1988.

[WC72] H. R. Wilson and J. D. Cowan. Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12:1-24, 1972.

Chapter 6

Finite State Machines and Recurrent Neural Networks Automata and Dynamical Systems Approaches Peter Tino Bill G. H o m e C. Lee Giles Pete C. CoUingwood

ABSTRACT We present two approaches to the analysis of the relationship between a recurrent neural network (RNN) and the finite state machine M the network is able to exactly mimic. First, the network is treated as a state machine and the relationship between the RNN and M. is established in the context of the algebraic theory of automata. In the second approach, the RNN is viewed as a set of discrete-time dynamical systems associated with input symbols of A^. In particular, issues concerning network representation of loops and cycles in the state transition diagram of M. are shown to provide a basis for the interpretation of learning process from the point of view of bifurcation analysis. The circumstances under which a loop corresponding to an input symbol x is represented by an attractive fixed point of the underlying dynamical system associated with x axe investigated. For the case of two recurrent neurons, under some assumptions on weight values, bifurcations can be understood in the geometrical context of intersection of increasing and decreasing parts of curves defining fixed points. The most typical bifurcation responsible for the creation of a new fixed point is the saddle node bifurcation.

1 Introduction

The relationship between recurrent neural networks (RNN) and a u t o m a t a has been t rea ted by many [Min61], [Jor86], [CSSM89], [DGS92], [Elm90], [GMC+92a], [Cas93], [WK92b], [ZGS93], [MF94], [DM94], [HH94]. Activations of s ta te units represent past histories, and clusters of these activations can represent the states of the generating au tomaton [GMC"^92b].

In this contribution, the relationship between an RNN and a finite s ta te

171

172 Tifio, Home, Giles, and Collingwood

machine it exactly mimics is investigated from two points of view. First (Section 5), the network is treated as a state machine. The concept of state equivalence is used to reduce the infinite, non-countable set of network states (activations of RNN state neurons) to a finite factor state set corresponding to the set of states of M. Second (Section 6), the RNN is viewed as a set of discrete-time dynamical systems associated with input symbols of M. The dynamical systems operate on (0,1)^, where L is the number of recurrent neurons of the RNN. In our experiments, loops and cycles corresponding to an input symbol x oi M have stable representation as attractive fixed points and periodic orbits; respectively, of the dynamical system associated with the input x. Suppose there is a loop associated with an input x in a state q oi M. Denote the set of network states equivalent to q by {q)j\f. Then, if there is a vertex v G {0,1}^ such that v is in the closure of {q)u, the loop is likely to be represented by an attractive fixed point^ "near" v.

Related work was independently done by Casey [Cas93], [Cas95a]. In his setting, an RNN is assumed to operate in a noisy environment (representing, for example, a noise corresponding to round-off errors in computations performed on a digital computer). RNNs are trained to perform grammatical inference. It is proved that the presence of a loop in the state transition diagram of the automaton^ necessarily implies the presence of an attractive set inside RNN state space (see the discussion in Section 6). It is also shown that the method for extraction of an automaton from a trained RNN introduced in [GMC"'"92a] is consistent: the method is based on dividing RNN state space into equal hypercubes and there is always a finite number of hypercubes that one needs in order to unambiguously cover regions of equivalent network states.

In Section 7 a more detailed analysis of the case when RNN has two state neurons is presented. Under some conditions on weight values, the number, position, and stability types of fixed points of the underlying dynamical systems are analyzed and the bifurcation mechanism is clarified. The most typical bifurcation responsible for the creation of a new fixed point is the saddle node bifurcation. A mechanism of correct behavior of an RNN for short input strings when for long strings the network is known to generalize poorly, is investigated in Section 8. In such cases, a correct state transition diagram of an FSM the network was trained with can still be extracted [CMC"*"92a]. A tool called the state degradation diagram is developed to illustrate how regions of network state space, initially acting as if they assumed the role of states of the FSM in which there is a loop associated with an input symbol x, gradually degrade upon repeated presentation of x.

ôf the corresponding dynamical system ^recognizing the same language as the RNN

6. Finite State Machines and Recurrent Neural Networks 173

Sections 2 and 3 bring brief introductions to state machines and dynamical systems, respectively. Section 4 is devoted to the model of RNN [nHG95] used for learning FSMs.

2 State Machines

This section introduces the concept of a state machine, which is a generalized finite state machine with a possibly uncountable number of states. When viewed as automata, RNNs can be described in terms of state machines.

A state machine (SM) is a 6-tuple M = {X, F, 5, /«, /o, so)j where

• X is a nonempty finite set called the input set,

• y is a nonempty finite set called the output set,

• 5 is a nonempty set called the set of internal states,

• /s is a map fg'.SxX-^S called the next-state function,

• /o is a map fo'.SxX—^^Y called the output function,

• SQES is called the initial state.

SMs with a finite internal state set are called finite state machines (FSMs). We assume that the reader is familiar with the notion of a monoid of

words over a finite set. Following the standard notation. A, X*, X"^, and uv denote the empty word, the set of all words over X, the set of all nonempty words over X, and the concatenation of words u and t;, respectively.

At every moment M is in exactly one state s e S. When an element X G X is read in, the machine changes its state to fais^x) and yields the output fo{s,x). The processing of any input word w G X'^ by M always starts with M being in the initial state.

If for some x e X and s E 5 , it holds that /^(s, x) = s, then it is said that there is an x-loop in the state s. If there exist m (m > 2) distinct states 5ij -"iSm G S and an input x G X, such that fs{si,x) = 5i-|_i, for all i = l , . . . ,m — 1 and fsism^x) = s i , then the set {si,...,Sm} is said to be an X-cycle of length m passing through the states 5i, . . . , 5^ -

It is convenient to extend the domain of fs and fo from 5 x X t o 5 x X * and 5 X X"^, respectively:

• V 5 G 5 ; fs{s,A) = s,

• V 5 G 5 , VW;GX*, V X G X ; fs{s,wx) = fs{fs{s,w),x) and fo{s,wx) = fo{f8{s,w),x).

174 Tino, Home, Giles, and Collingwood

Yet further generalization of fo is useful:

V S E 5 , Vi/; = a:iX2...a:nGX"^; f^{s,w) = fo{s,Xi)fo{s,XiX2)...fo{s,XiX2...Xn).

A distinguishing sequence of M is a word w G X'^ such that there are no two states si,S2 ofM for which f^{si^w) = f^{s2,w).

The behavior of M is a map BM -X^ -^Y: Vit'GX"'"; BM{'^) = fo{so^w), A state 52 G 5 is said to be accessible and x-accessible from the state

5i E 5 if there exists some ii; G X* and w G {a:}*, respectively, such that 52 = fs{si^w). M is said to be connected if every state 5 G 5 is accessible from So- The set of all states that are x-accessible from a state 5 G 5 is denoted by Acc{x, s). An a:-cycle j = {si,..., Sm} is said to be x-accessible from a state p G 5, if 7 C Acc{x^p).

An input word w G X* is leading to a state q if fsiso^w) = q. An input word leading to q is minimal if there is no input word leading to q of shorter length.

We shall also need some concepts concerning state and machine equivalence. Let Mi = (X, F, Si, fl, fi, 5oi), z = 1,2, be two SMs. States si G Si and S2 G 52 are said to be equivalent if there is no nonempty word over X that would cause Mi to give different output from that given by ^^2, provided that Mi and M2 started from si and 82 respectively. This is formally represented by the equivalence relation E{Mi,M2) Q Si XS2:

{suS2)eE{MuM2) iff V ^ G X + ; fl{si,w) = fîs2,w).

The set {p G S2\{q,p) G E{Mi,M2)} of all states of X 2 that are equivalent to a state g G 5i of Mi is denoted by [q]E{Mi,M2)' When Mi=M2=M, the equivalence relation E{M,M) partitions the state set S of M into the set of disjoint equivalence classes S/E{M,M).

Ml and M2 are said to be equivalent if for every state si G 5i there exists a state S2 G 52 such that (51,52) G E{Mi,M2), and vice versa. If there exists a bijection 65 : 5i —> 52 satisfying

• V5 G 5i , Vx G X; 65(/ i(s ,x)) = f^{bs{s),x) and /,H5,x) = Po{bs{s),x),

• ^s(so) = "sg,

then Aî and M2 are said to be isomorphic. Isomorphic SMs can be considered identical since they differ only in the names of states.

An SM is said to be reduced if no two of its states are equivalent to each other. A reduced SM equivalent to A^ = (X, F, 5, /«, /o, 5o) is

{X,Y,S/E{M,M)J',JUô]EiM,M)) ,

w i t h / ; :S/E{M,M)xX*-^S/E{M,M) and /^ : 5 / E ( M , X ) x X + - ^ S/E{M,M) defined as follows:

V 5 G 5 , V I / ; G X * ; /^ ([«]£;( A4,A^), W;) = [/5 (5, W;)]£;(A^,;V^), (1)


V 5 G 5 , V I / ; G X + ; foi[s]EiMM)^^) = fo{s,w). (2)

3 Dynamical Systems

Analysis of dynamical systems (DSs) via state space structures plays an important role in experimenting and interpreting complex systems. Most of the important qualitative behaviors of a nonlinear system can be made explicit in the state space with a state space analysis. In this paper only discrete-time DSs (i.e., DSs evolving in discrete time) will be considered. Our theoretical knowledge about nonlinear DSs is far from complete. The state space of a nonlinear DS often consists of qualitatively different regions. It is useful to take into account the geometric information about the structures and spatial arrangements of these regions.

Among the most important characteristics of a DS are the fixed points, periodic orbits, their stability types, and the spatial arrangement of the corresponding stability regions. We review some of the basic concepts in DS theory.

A discrete-time DS can be represented as the iteration of a (diflFerentiable, invertible) function f : A -> A {AC 3?^), i.e.,

xtî = f{xt), t e Z, (3)

where Z denotes the set of all integers. For each x e A, the iteration (3) generates a sequence of distinct points defining the orbit, or trajectory of x under / . Hence, the (forward) orbit of a: under / is the set {f"^{x)\ m > 0}. For m > 1, / ^ is the composition of / with itself m times. /^ is defined to be the identity map on A.

A point x^ E A is called a fixed point of f if / ^ (x*) = x* for all m G Z. A point x^ E A is a periodic point of f if /^(x*) = x^ for some ^ > 1. The least such value of q is called the period of the point x^ and the orbit of X*. The set {x*,/(x*), ...,/^~^(x*)} is said to be a, periodic orbit of x^ of period q. Notice that a fixed point is a periodic point of period 1, and a periodic point of / with period ^ is a fixed point of /^ . If x* is a periodic point of period q for / , then so are all of the other points in the orbit of X * .

Fixed and periodic points can be classified according to the behavior of the orbits of points in their vicinity. A fixed point x* is said to be asymptotically stable (or an attractive point of f) if there exists a neighborhood 0(x*) of X* such that limm-ôo / ^ ( x ) = x* for all x E 0(x*). As m increases, trajectories of points near an asymptotically stable fixed point tend to it. The basin of attraction of an attractive fixed point x^ is the set {xeA\ lim^_,oo/'^(x) = x*}.

A fixed point x* of / is asymptotically stable only if for each eigenvalue A of Df{x^), the Jacobian of / at x*, |A| < 1 holds. The eigenvalues of

176 Tino, H o m e , Giles, and Collingwood

Df{x^) govern whether or not the map / in a vicinity of x* has contracting or expanding directions. Eigenvalues larger in absolute value than 1 lead to expansion, whereas eigenvalues smaller than 1 lead to contraction. If all the eigenvalues of -D/(x*) are outside the unit circle, x^ is a repulsive point, or repellor. All points from a neighborhood of a repellor move away from it as m increases, or equivalently, move towards it as — m decreases.^ If some eigenvalues of Df{x^) are inside and some are outside the unit circle, x* is said to be a saddle point. There is a set W^ of points x such that the trajectory of x tends to x* for m -> oo. W* is called the stable invariant manifold ofx,^. Similarly, the unstable invariant manifold ofx^, W^, is the set of points x such that the trajectory of x tends to x* for m —> — oo.

Since any periodic point of period q can be thought of as a fixed point of /^ , these remarks apply to periodic points as well.

An absorbing set of a set BCA under the map / is a set P such that for all xeB, there exists mo > 0 for which f^{x)eP for all m > mo- For a given xeB, the least such a value of mo is called the absorption level of X in P under the map / . An absorption region of P under the map / is defined as follows:

Af{P) = {x e A\ there exists mo > 0 such that f^{x) G P for all m > mo}.

When A C 5R or A C Sft2, it is useful to code with colors (or diff'erent gray levels) the absorption levels of points from Af{P) in P. We will refer to such a diagram as an absorption diagram of P under the map f.

B C A is said to be a positively invariant set of / if f{B) C B; i.e., trajectories of points from B stay in B. Trivially, A is a positively invariant set of / , but in an effort to understand the dynamics of (3), we are usually interested in finding as minimal a positively invariant set^ as possible. If B is open and^ f{B) C B, then the set B = r\m>o fî^) is not only positively invariant, but also attracting, meaning that there is a neighborhood of B such that all orbits starting in that neighborhood converge to B. Attractive fixed points and periodic orbits are trivial examples of attractive sets. Much more complicated attractive sets can be found in the dynamical systems literature under the name strange attractors [Dev86]^. As in the case of an attractive fixed point, the basin of attraction of an attractive set B is the set of all points whose orbits converge to B.

If B C A is a positively invariant set of / then it is certainly an absorbing set of itself under / . B may be an attracting set of / , or it may contain an

3 f — m _ / f—l\m

" in the sense of inclusion ^B denotes the closure of B. ®Loosely speaking, strange attractors are attractive sets that are topologically distinct

from (i.e., cannot be transformed by a homeomorphism to) the trivial attractive sets mentioned above.

6. Finite State Machines and Recurrent Neural Networks 1 7 7

( t ) 0

A

output nouroni

unit dslag

FIGURE 1. RNN model used for learning FSMs.

attractive set of / / or none of the two.^ To learn more about the theory of DSs, see, for example, Guckenheimer

and Holmes [GH82].

4 Recurrent Neural Networks

The RNN presented in Figure 1 was shown to be able to learn mappings that can be described by finite state machines [nHG95]. A binary input vector / *) = (/^ ,..., Ij^^) corresponds to the activations of A input neurons. There are two types of hidden neurons in the network:

• K hidden nonrecurrent neurons HI,...,HK, activations of which are denoted by i7f \ j = l,,..,K.

^Note that this does not necessarily imply that B is part of the basin of attraction of an attractive set contained in B. Think of an attractive periodic orbit inside B that encircles a repelling fixed point.

®The identity map constitutes a simple example.


• L hidden recurrent neurons 5 I , . . . , 5 L , called state neurons. We refer to the activations of state neurons by S - , i = 1, ...,L. The vector 5(*) = {S[^\..., 5[^^) is called the state of the network.

Wiin^Qjin, and Vmk are real-valued weights, and ^ is a sigmoid function g{x) = 1/(1-fe~^). The activations of hidden nonrecurrent neurons are determined by

l,n

The activations of state neurons at the next time step (t-f 1) are computed as follows:

5f+i) =p(53py,,„.5W./W) =5,(5W,/W). (4) l,n

The output of the network at time t is the vector {0[ \ ..., OjJ) of activations of M output neurons O I , . . . , O M - The network output is determined by

Oi},^=9{j:V^,-Hi'^) = Om{S('Kl^% (5) k

Network states are elements of the L-dimensional open interval (0,1)^, the internal region of the L-dimensional hypercube.

A unary encoding of symbols of both the input and output alphabets is used, with one input and one output neuron for each input and output symbol, respectively.

The bijection defining the encoding of N input symbols into A/'-dimensional binary vectors with just one active bit is denoted by c/. Similarly, the bijection that defines the encoding of M output symbols into M-dimensional one-active-bit binary vectors is denoted by CQ-

The vector I{t) = (/{ ,..., /^^) € {0,1} of activations of input neurons

corresponds to the input symbol cj^{l[ \ ...,/)^^). Activation of each output neuron is from the open interval (0,1). A

threshold A G (0, | ) is introduced such that any value from (0, A) is assumed to be an approximation of 0, and any value from (1 —A, 1) represents the value 1. A mapping r : (0,1) — {0,1, — 1} is defined as follows: ^

if xG (0,A), r{x) = { 1 if xG ( 1 - A , 1 ) ,

otherwise.

^ — 1 represents the don't know output of an output neuron.


Interpretation of network output in terms of output symbols of the FSM it models is performed via mapping D^^:

nd, ./ > ! _ / ^ O H 2 / I , - , 2 / M ) if 2/ iG{0,l}, z = l , . . . ,M, i ^ i y i , . . . , y M J - I ^ otherwise.

If the output of the network, 0{t) = (Of \ ..., O^^), falls into ((0, A) U (1 -A, 1 ) )^ , then it corresponds to the output symbol

D{riO[%...,riO^^)) = Co\r{0[\...,riO^^))

= c5i( i?(0W,. . . ,0W))

where the map R is the component-wise application of the map r. Each input word (a word over the input alphabet of the FSM used for

training) is encoded into the input neurons one symbol per discrete time step t, yielding the corresponding output, as well as the network new state.

Training is performed via optimization with respect to the error function

£ = jE(î"-o^^)''

where Tm G{0, 1} is the desired response value for the m-th output neuron at the time step t. For a more detailed explanation of the training procedure see Tifio et al [nHG95].

5 RNN as a State Machine

In this section we assume that an RNN Af of the type described above has learned to exactly mimic the behavior of a reduced, connected FSM M = (X, y, (5,<J, A, So) it was trained with. It follows that there exists a network state S^ for which network output will always be in ((0, A) U (1 — A, 1 ) )^ upon presentation of any input word, and such that the following correspondence holds (time is set to t = 1 ):^^

Vti; = xi. . .Xn6X+; X{qi,Xi) = D{R{0^'^)), for alH = 1, ...,n, (6)

where

^°It is assumed that * does not belong to the set of output symbols of the FSM modeled by the RNN, * stands for the don't know output of the net.

^În practical terms, during the learning phase, the network is trained to respond to a special "reset" input symbol # [^ ^ X) by changing its state to a state equivalent to 50, the initial state of M (more details in [nHG95]). 5 ° is the "next-state" computed in the layer of recurrent state neurons when the symbol # is presented to the network input after the training process has been completed.


• qi = So,

• 5(1) = 5 ^

• Qiî =S{qi,Xi), i = l , . . . , n - 1,

• the network input 7 *) at the time step i is the code cj{xi) of the ith input symbol Xi of the input word w.

Automata theory provides us with the abiUty to connect structural and behavioral equivalence of automata [Shi87]. In particular, it can be shown that for any couple {Mi,M2) of connected FSMs with equal input as well as output sets the following holds: if BMI = BM2^ then Mi and M2 are equivalent and their reduced forms are isomorphic. To investigate the correspondence between M and M in this context, we represent the network AT as an SM j \ r = ( X , y U {*},5, r, I/, 5^), where the maps 1/ and r are defined as follows:

for any 5 = (5i, . . . , 5 L ) G 5 , and any X G X ;

U{S,X) = D{R{OI{S,CI{X)),...,OM{S,CI{X)))),

and r (5 ,x) = (5i(5,c/(x)), . . . ,5L(5,c/(x))) ,

with Oi and Sj defined by (5) and (4) respectively. Prom (6) it follows that

Vii;€X+; A+(5o,ti;) = l/+(S^l/;). (7)

The set 5 = (0,1)^ of states of M can be partitioned into the set of equivalence classes corresponding to the equivalence relation E{M,M). By presenting inputs to the network and considering only the decoded network outputs, it is impossible to distinguish between equivalent network states.

[S^]E{j^,Sf) is the set of all network states equivalent to 5^. Denote the set of network states accessible from states from [S^]E{Sf,Sf) by Sacc- Note that for every state S G Sacc and for each input word w € X"*", v'^{S,w) does not contain the don^t know symbol *. Prom A/', a reduced, connected SM j\fi = (X, y, Sacc/E{Af, A/"), Ti, i/i, [S^]E(Sr,sr)) is constructed, where n and i/i are defined according to (1) and (2) respectively, and respectively restricted to Sacc/E{J\f,J\/') x X* and Sacc/E{Af,A^) x X^. Mi h a s t h e same behavior as A^. It is easy to see that the number of states of A/i is finite and hence A/i is an PSM. It follows that Mi and M are isomorphic.

The set [q]E{M,M) ^ ' network states equivalent to the state q oi M is denoted by {q)j^. States of an SM code the information about "what has happened so far in the course of input word processing." Prom that


point of view, all network states from {q)j\j' code the same information, the information that is coded by the state q of M.

So far we have dealt with the existence issues concerning nonempty regions of network states equivalent to states of the FSM the network is capable of exactly mimicking. For a "constructive" approach to determination of (^)jv'j the regions A/J of network state space are identified for which the network Af gives the (decoded) output y, provided that the code of the input symbol x is presented at network input. In particular, Afy = {S e 5|i/(5,x) = y}. Note that for each x e X diudy eY, M^ is an open set. For a given input word w = xiX2-..Xn G X"^, the set of all network states Afw originating the output equal to X'^{q,w) is

; Â+(g , t z ; )^^Mg,x i )p n (^x._. o ... o r , , o r , J-HAe^^^^^--^^^) ,i=2

, (8)

where Tx (5) = T{S, X) for each xeX. (9)

By f~^{A), where / is a map and A is a set, we denote the set of all points whose images under / are in A. For any x 6 X, r^ is continuous, and so is the composition r^^ o ...OTX^ o' xi for any word xiX2...Xm 6 X'^. It follows that the sets Afw are open. However, the set

wex+

of network states equivalent to the state q of A^ is not necessarily open, since an infinite, countable intersection of open sets is not guaranteed to be open.^^ If {q)j\/- is open, {q)j^ ^ 0 implies that there exists a (finite) length L of input words such that^^ {q)x = r\\w\<L'^^ w,w )

From (8) and (10) it follows that if there is an x-loop in a state q oi M producing an output symbol y, then

T.iiqM C (qU C n K ) - ' ( ^ x ^ ) - (11) i>0

As in Section 3, r* is the composition of r^ with itself i times, r^ is defined to be the identity map.

^^The case when trajectories in the RNN state space may be corrupted by a noise is not discussed in this paper. However, we note that if {q)j\/' is not open, arbitrarily close to a state S G {q)M there is a network state not equivalent to the state q of M. and an arbitrarily small perturbation of S may cause failure in the RNN modeling of M..

^^\w\ denotes the length of the word w, i.e., the number of symbols contained in w.

182 Tiiio, Home, Giles, and CoUingwood

Analogously, if there is an x-cycle of length m passing through states Qii-"iQm with outputs yi = X{qi,x), i = l , . . . ,m, then

{qi)M C f]{Tt')-' ^^\{rr)-'{^f^n] . (12) j=i \i>o J

Similar bounds can be found for {q2)M^'-"> {(lm)j^\ in particular,

C C t e W ) C {qi)M C n ( ^ r ) " n A / ' i ' 0 , J = 1, -.m. (13) i>0

Some researchers have attempted to extract a learned automaton from a trained recurrent network [GMC+92a], [CSSM89], [WK92a], [nHG95]. Extraction procedures rely on the assumption that equivalent network states are grouped together in well-separated regions in the recurrent neurons' activation space. After training, the network state space is partitioned into clusters using some clustering tools, and for each q E Q, the region {q)M is approximated by (possibly) several clusters so obtained. For example, in Giles et al. [GMC"'"92a] the network state neurons' activation space is divided into several equal hypercubes. When the number of hypercubes is sufficiently high, each hypercube is believed to contain only mutually equal states. After training, Tifio et al. [nHG95] present a large number of input words to the network input. All states the network passes through during the presentation are saved. Then the clustering of those states is performed using a Kohonen map with the "star" topology of a neural field consisting of several "branches" of neurons connected to one "central" neuron. Such a topology helped to reduce great sensitivity to initial conditions found in vector-coding algorithms using independent cluster centers, while avoiding time-consuming approximation of the input space topology typical of the classical regular-grid topologies of the Kohonen map [nJV94]. Other approaches to RNN state space clustering are discussed in Tino et al. [nHG95].

Having approximated the regions {q)jsj', the automaton A/i is constructed via determining arcs in the corresponding transition diagram, followed by nondeterministic eliminating and minimization procedures.

All ideas presented in this section stem from the assumption that the network AT exactly mimics the FSM M it was trained with. However, it is possible that a correct automaton is extracted from trained RNN even though the network is known to generalize poorly on long, unseen input words [GMC"'"92a]. This is discussed in Section 8.

5.1 Experiments

A number of experiments were performed in which RNNs with two or three state neurons were trained with simple FSMs. To show how the net-


a i 0

b l l a l 2

FIGURE 2. FSM M used for training an RNN. M = (X, Y, 5, fsjo, so) is represented as a directed graph called the state transition diagram. The graph has a node for each state, and every node has \X\ {\X\ denotes the number of elements of a finite set X) outgoing arcs labeled with x\y (XGX, y£Y) according to the rule. The arc from the node labeled with SIGS to the node labeled with S2ES is labeled with x\y if 52 = fsisi^x) and y = fo{si,x). The node corresponding to the initial state is indicated by an arrow labeled START.

work learned to organize its state space in order to mimic a given FSM, the regions corresponding to {q)j\j' were detected. The network state space was "covered" with a regular grid G of R x R points {R is on the order of hundreds); and a finite vocabulary F of distinguishing sequences of M was created. Regions {q)^ were approximated by grouping together those network states from the grid that for each input word from the vocabulary

lead to equal output strings. In other words, {q)^f = r\wex+ were

approximated by Cl^^^J^w H Q. For example, in Figure 3 approximations of regions of equivalent network states corresponding to states of an FSM shown in Figure 2 can be seen. Figure 3 should be compared with Figure 4, showing activations of state neurons during the presentation of a training set to the RNN after training.

Generally, in our experiments, regions approximating (q)/^ were observed to be connected and of "simple shape." Further study needs to be devoted to this matter. However, at least empirically and for simple tasks, our use of the Kohonen map as a clustering tool [nHG95], as well as the use of the simple clustering technique introduced in Giles et al. [GMC~^92a] are supported.

184 Tino, H o m e , Giles, and CoUingwood

FIGURE 3. Regions of equivalent network states. The capital letter inside each region indicates to which state of M the network states from that region are equivalent. A = 0.1. The two lines stemming from the origin are the lines Ta(s)i = 1/2 and To(5)2 = 1/2; between them is the region Pa,(1,1) (see Section 6).


f

FIGURE 4. Activations of state neurons when the training set is presented to the network after the training process has finished (weights are frozen).


6 RNN as a Collection of Dynamical Systems

RNNs can be viewed as discrete-time DSs. Literature dealing with the relationship between RNNs and DSs is quite rich: [Hir89], [BW92], [GF89], [Cas95a], [Cas95b] [HZ92], [Jor86], [Wan91], [WB90], [Vid93], [Bee94], and [Hir94], for example. However, as has already been mentioned, the task of complete understanding of the global dynamical behavior of a given DS is not at all an easy one. In [WB90] it is shown that networks with just two recurrent neurons can exhibit chaos, and hence the asymptotic network dynamical behavior (on a chaotic attractor) can be very complex.

In order to describe the behavior of the RNN M by an iterative map, we confine ourselves to only one input symbol x from the input alphabet of the FSM used for training A , the code of which is repeatedly presented to the network input. The evolution of the network is described in terms of trajectories {5, TX{S), T^{S), ...} in (0,1)^. The iterative map r^ : (0,1)^ -> (0,1)^ is defined in (9).

As in the previous section, here we also assume that an RNN Af exactly mimics the behavior of a reduced, connected FSM M = (X, Y, Q, S, A, SQ). In this section we deal with the problem of how certain features of M found in its STD (such as loops and cycles) induce some specific features (such as attractive points and periodic orbits) of network global dynamical behavior.

Assume that there is an a:-loop in a state q of M, and A(g, x) = y. Then according to (11), {q)jsf is a positively invariant set of r^ and hence an absorbing set of itself under TX • From (8) it follows that under r^, {q)j\j' is an absorbing set of all sets {p)j^ such that q is x-accessible from p. If there is an open set B such that B C {q)j^ and TX{B) C B , or {q)j\/- C B and Tx{B) C (Q')AA, then there is an attractive set r\m>of^(^) ^^ ^^ ^^ (^W that constitutes a stable network representation or the x-loop in a state q ofM.

Similarly, assume that there is an x-cycle 7 of length m passing through states gi, ...,gm with outputs yj = X{qj,x), j = 1, ...,m. Then according to (13), {qj)j^ are positively invariant sets of r^, and U ^ i {QJ)M is a positively invariant set of r^. A statement concerning the existence of attractive sets of T^ inside {qj)^r (or an attractive set of TX inside UJÎC^JOA/" ) ^^^ be made analogously to the statement above. Considering (8), it can be seen that under r^, [jq^^{q)M is an absorbing set of itself and of all sets (p)^ such that 7 is x-accessible from p.

Observation 1 formulates these ideas in a more compact form.

Observation 1: Assume that an RNN Af exactly mimics the behavior of a reduced, connected FSM M = (X, Y,Q,S, A, SQ) • Then

• / / there is an x-loop in a state q of Ai, then {q)j^ C Mx is a


positively invariant set ofr^^, and ^ Ug€>icc(x,p)(pW ^ -4r, {{q)^f)•

• / / there is an x-cycle 7 of length m passing through states qi,..., Qm of M, then {qj)j^^ j = l , . . . ,m, are positively invariant sets of r^ and U ^ i {qj)^J' * ^ positively invariant set ofrx. ( 1) /"? •••? {qm)M o.f^^ Periodically visited in the process of iteration ofrxf (î^d[j^cAcc{x,p)

When there was an x-loop in a state ^ of A^, in all our experiments an attractive fixed point 5* of r^ "near" a vertex v G {0,1}^ was detected (see Section 6.1 below). If S„ G {q)^^ 5* constitutes a plausible network representation of the x-loop. If furthermore, 5* is the only attractive set of Tx inside (g)jv , then \JqÂcc(x,p)(P)j^ ^^ subset of its basin of attraction.

For each input symbol x of A^ and each vertex v = {vi, ...,VL) G {0,1}^ define the set^^

Vx,v = Is e^^\Tx{s)i<-if Vi=0', Tx{s)i > - if Vi = 1; z = l , . . . , L i .

Hyperplanes Tx{s)i = 1/2 separate SR into 2^ partitions Vx,v' The map r^ is transformed to the map r^ by multiplying weights Wun by a scalar /x > 0, i.e., r^{s) = Txifis). ^ is also called the neuron gain. The following lemma was proved by Li [Li92]. It is stated for maps TX and accommodated to our notation. It tells us under what conditions one may expect an attractive fixed point of r^ to exist "near" a vertex v G {0,1}^.

Lemma 1: (Li, 1992) Suppose that for some input symbol x of M there exists a vertex v EVX,V^ TX{T-^X,V)- Then there exists a neuron gain /XQ such that for all ^ > /XQ there is an attractive fixed point ofr^ inVx,v^Tx{Vx,v)'

It was also shown that as fi tends to infinity, the attractive fixed point tends to the vertex v. For two recurrent neurons under certain conditions on weights Wun^ this is made more specific in the next section (Corollary !)•

Theorem 1: In addition to the assumptions in Observation 1, assume that there is an x-loop in a state q of M. Suppose there is a vertex v G {0,1}^ such that {q)j^ C Vx,v 0,'nd v G rx{{q)j\j'). Then there exists a neuron gain /io such that for all /i > /XQ there exists an attractive fixed point 5* G Vx,vnTx{Vx,v) ofrl^.

^^Recall that Ar^ {{Q)/^) is the absorbing region of (q)/^ under map TX-^^Tx{s)i denotes the ith component of TX(S). When viewed as an iterative map, TX

operates on (0 ,1)^ , but here we allow 5 G 3f? .


Proof: From

rx{{q)Ar) ^{q)^r QVx,v and r^HqM Q ra:{Vx,v)

it follows that Tx{{q)M) Q T^x,v H TX{VX,V)' Hence

Employing Lemma 1, the result follows immediately. •

Loosely speaking, Theorem 1 says that if arbitrarily close to a vertex v G {0,1}^ there is a network state from Tx{{q)jsr) C {q)j^ C Vx,v, i-e., if network states that are equivalent to the state g of A^ in which there is an x-loop are "accumulated" around the vertex v within Vx,vi then if the weights are "large enough," so that /io < 1, an attractive fixed point of r^ exists in a neighborhood of v (Figures 3 and 5).

As mentioned in the introduction, the approach presented in Casey addresses representational issues concerning recurrent neural networks trained to act as regular language recognizers [Cas95a]. Recurrent neural networks are assumed to operate in a noisy environment. Such an assumption can be supported by an argument that in any system implemented on a digital computer there is a finite amount of noise due to round-off errors, and "we are only interested in solutions that work in spite of round-off errors^^ [Cas95a]. Orbits of points under a map / and attractive sets of / are substituted for by the notions of an e-pseudo-orbit of points under / and an e-pseudo-attractor of / . These concepts correspond to the idea that instead of the precise trajectory of a point under a map, we should consider each sequence of points (pseudo-trajectory) having the distance from the precise trajectory less than e > 0. It is proved that when there is a loop in the reduced acceptor of a regular language also recognized by the network, then there must be an e-pseudo-attractor (and hence an attractor) of the corresponding map in the network state space. The network accepts and rejects a string of symbols if e-pseudo-orbits driven by the string end in subregions denoted by accept and reject regions respectively. It is assumed that the accept and reject regions are closed in the network state space.

6.1 Experiments

To see how loops and cycles of an FSM M are transformed into global dynamical properties of an RNN J\f that is able to exactly mimic M^ the following experiments were performed:

Consider again the FSM M presented in Figure 2. In Figure 3 it can be seen how the RNN M with two state neurons organizes its state space (0,1)^


FIGURE 5. Absorption diagrams of {A)js/' and (C)jv' under the map Ta. Network states lying in the Ughtest region need one or no iteration step under the map Ga to get to their absorption set. The more iteration steps that are needed, the darker the region is, with the exception of the region "close to" the "border line" between the two absorption diagrams. The region is light so that the border contours are clearly visible. The figure should be compared with the figure in the previous section showing {A)j\r and (C)M- Note the two attractive points of Ta placed inside {A)j^ and {C)M induced by a-loops in states A and C respectively.

into three distinct connected regions ( A ) ^ , {B)x, and {C)j^, corresponding to s ta tes A, B, and C respectively. It was observed^^ t ha t trajectories s tar t ing in {A)j^ converged to a single a t t ract ive point placed inside {A)//. The same applies to the s ta te C and its corresponding region {C)j\j-. So the a-loops in the states A and C induce a t t ract ive points of Ta placed inside the corresponding regions of equivalent RNN states . Actually, this represents the only RNN stable representat ion of loops in M we have observed during our simulations.

(A)jV' and {C)j^ are absorbing sets of themselves under the m a p Ta. Since the s ta te C is a-accessible from S , {C)j^ is an absorbing set of (5)AT under TQ. Absorption diagrams of {A)j^ and {C)x under Ta together with the at t ract ive points are presented in Figure 5.

^Âs before, during the simulations, the network state space was "covered" with a regular grid of points, and only the orbits starting from these points were taken into account.

190 Tino, Home, Giles, and CoUingwood

FIGURE 6. Absorption diagram of {C)M under the map TT,. Network states from the two white regions do not belong to the absorption region of {C)M- The figure should be compared with the figure in the previous section showing {C)M- Note the attrjictive point of 77, placed inside {C)^ induced by the 6-loop in the state C, as well as two periodic points of n placed inside {A)j^ and (B)//, constituting an attractive periodic orbit of period two. The orbit is induced by the 6-cycle {A,B}.

If we presented M only with input symbol b, we would end up either in a 6-cycle of length two involving states A and B or in a 6-loop in the state C. When, during the experiments, we started in a state from {C)j\f and presented to the network input only the code of the symbol 6, the trajectory converged to an attractive point inside {C)j^. An absorption diagram of {C)js/ under r^ together with the attractive point can be seen in Figure 6.

On the other hand, when started in a state from (A)jv-, the trajectory jumped between the sets {A)// and {B)j^, converging to a periodic orbit of length two. Again, this was observed to be the typical stable RNN representation of a cycle corresponding to an input symbol of M. The states constituting the orbit can be seen in Figure 6.

In the second experiment, an FSM M. shown in Figure 7 was used to generate the training set for an RNN Af with three state neurons. The a-cycle {A, B, C, D^ E} of length five induced an attractive periodic orbit of Ta of period five. Projections of the orbit to a two-dimensional subspace (0,1)^ of the network state space can be seen in Figures 8, 9, 10. To illustrate the convergence of orbits, the orbits were plotted after 60, 100, and


a l 4

START

FIGURE 7. FSM M whose state transition diagram contains a cycle of length five.

300 pre-iterations (Figures 8, 9, and 10 respectively). No plotting occurred during the pre-iterations.

7 RNN with Two State Neurons

Usually, studies of the asymptotic behavior of recurrent neural networks assume some form of structure in the weight matrix describing the connectivity pattern among recurrent neurons. For example, symmetric connectivity and absence of self-interactions enabled Hopfield [Hop84] to interpret the network as a physical system having energy minima in attractive fixed points of the network. These rather strict conditions were weakened in Casey [Cas95b], where more easily satisfied conditions are formulated. Blum and Wang [BW92] globally analyze networks with asymmetrical connectivity patterns of special types. In the case of two recurrent neurons with sigmoidal activation function p, they give results for weight matrices with diagonal elements equal to zero.^^ Recently, Jin, Nikiforuk, and Gupta [JNG94] reported new results on absolute stability for a rather general class of a recurrent neural networks. Conditions under which all fixed points of the network are attractive were determined by the weight matrix of the network.

The purpose of this section is to investigate the position and stability types of fixed points of maps r^ under certain assumptions concerning the signs and magnitudes of weights Wun. The iterative map under considera-

^În such a case the recurrent network is shown to have only one fixed point and no "genuine" periodic orbits (of period greater than one).

192 Tino, H o m e , Giles, and Coll ingwood

^ ^ : : . . « m ^ H

H FIGURE 8. Convergence of orbits of the map Ta to an attractive periodic orbit of period five. The attractive periodic orbit constitutes a stable representation of the a-cycle in the FSM M presented in the previous figure. The orbits were plotted after 60 pre-iterations. No plotting occurred during the pre-iterations. RNN has three state neurons. Shown are the projections of the orbits to a two-dimensional subspace (0,1)^ corresponding to activations of two of the recurrent neurons.

FIGURE 9. Convergence of orbits of the map Ta to an attractive periodic orbit of period five shown in the previous figure. This time, the number of pre-iterations is 100. No plotting occurred during the pre-iterations.


FIGURE 10. Attractive periodic orbit of period five of the map Ta convergence illustrated in the last two figures. The attractive orbit is approximated by plotting the trajectories of Ta after the preceding 300 non-plot iterations.

tion can be written as follows:

(14)

where {un,Vn) G (0,1)^ is the state of recurrent network with two state neurons at the time step n, and a,S and /?,7 are positive and negative real coefficients respectively. Thus we investigate the case when the two recurrent neurons are self-exciting (a,(5 > 0), with the tendency to inhibit each other (/3,7 < 0).

For c > 4, define

A(c) = - 1 - 1 c

In the following it will be shown how the network state space (0,1)^ can be partitioned into regions according to the stability types of fixed points of (14) found in the regions.

Regions

A ( a ) , -

and

( ^ 0 , i - A { a ) ) x ( ^ 0 , i - A ( < J ) ) ,

< ( 0 , i - A ( < 5 ) ) u ( 0 , i - A ( a ) ) x ( i - A ( < 5 ) , i ]

(^-A(a),i]x(i-A(.),i]


1

0.5+A(5)

0.5

0.5-A(6;

K.

<

S

Roo

R-oo

^01

< ,

R

Roo

Ko

K

x:,

R;.

<

R,*,

" "

<

Rio 1

0.5-A(a) 0.5 0.5+A(a)

FIGURE 11. Partitioning of RNN state space according to stability types of fixed points of maps TX •

are denoted by RQQ.RQQ, and RQQ respectively. Regions symmetrical to RQQ, RQQ, and R§) with respect to the line u = 1/2 are denoted by RIQ, R^Q, and R^Q, respectively:

< = Q + A ( a ) , l ) x ( ^ 0 , i - A ( J ) ) ,

i , 1 + A ( a ) ) X (o , 1 - A(^)) U ( 1 + A(a), l ) x ( 1 - A(^), i -^10 -

R « — 1 1 A . «))K^ AW, 2

Similarly, let i?Q\, RQ^ , and iZgi denote the regions symmetrical to RQQ, RQQ, and i?QQ with respect to the line v = 1/2. Finally, i?i\,i?fi, and iif\ denote regions that are symmetrical to i?oi ? -ôi ? ^^^ i?oi with respect to the line u = 1/2 (Figure 11).

Theorem 2: Suppose a > 4 , ^ < 0,7 < 0,(5 > 4 ,a > |/3|,(5 > |7| . Then the following can be said about the fixed points of (14)-

• attractive and repulsive points can lie only in UiGJ ^t ^^^ Uiex ^ ^ respectively. X is the index set X = {00,10,01,11}. If max{a{5 — 4),(J(a — 4)} < (3^, there are no repellors.


• all fixed points in UiGi ^f ' ^ saddle points}^

Proof: Any fixed point (u^v) of (14) satisfies

(u, v) = {g{au H- f3v),g{'yu + Sv)). (15)

The Jacobian J{u,v) of (14) in (u^v) is given by

aGi{u,v) PG 7G2{u,v) SG2

i{u,v) \

where Gi{u,v) = g'{au -h Pv) and G2{u,v) = g'{ju + Sv). Since g'{p) = g{p){l — g{p)), considering (15) we have

{Gi{u,v),G2{u,v)) = {u{l-u),v{l-v)) = (t>{u,v). (16)

The eigenvalues of J are^^

aGi -\-6G2±VD Ai,2 = 2 '

where D = {aGi - 5G2f -f 4GiG2/?7. D is always positive and so is aGi -h 8G2- It follows that to identify

possible values of Gi and G2 such that |Ai,2| < 1, it is sufficient to solve the inequality aG\ -h 5G2 -h \/D < 2, or equivalently,

2 - aGi - 6G2 > VD^ (17)

Consider only Gi,G2 such that aGi -{'SG2 < 2, that is, (Gi,G2) lies under the line p : aGi H- SG2 = 2. All (Gi,G2) above p lead to at least one eigenvalue of J greater than 1. Squaring both sides of (17), we arrive at

{aS - h)GiG2 - aGi - 8G2 > - 1 . (18)

The "border" curve K : (a (J -^7)GiG2-aGi- (5G2 = - 1 in (Gi,G2)-space is a hyperbola G2 = K{GI) = A[l -h B/{Gi - C)], where

a 5

Since 0 < cJ - /37/a < 8 and 0 < a - /37/(5 < a, it follows that A > l / J , G > 1/a and B > 0. Ac(l/a) = 0,/c(0) = 1/5 and (Gi,G2) satisfying (18) lie under the "left branch" and above the "right branch" of K (see Figure 12). It is easy to see that since we are confined to the space below the line p.

^®Note that this does not exclude the existence of saddle fixed points in other regions. ^^To simplify the notation, the identification {u^v) of a fixed point in which (14) is

linearized is omitted.

196 Tifio, Home, Giles, and Collingwood

FIGURE 12. An illustration for the proof of Theorem 2. (Gi,G2)-space is the space of derivatives of the sigmoid transfer functions with respect to the weighted sum of neurons' inputs. All (Gi,G2) G (0,1/4]^ bellow the left branch of K correspond to the attractive fixed points.

only (Gi,G2) under the left branch of K will be considered. Indeed, p is a decreasing line going through {C,P), and A — P = 2{A — 1/6) > 0, so it never intersects the right branch of K.

A necessary (but not sufficient) condition for a fixed point {u^v) of (14) to be attractive is that the corresponding (Gi,G2) = (t>{u,v) £ (0,1/4]^ lie in (0,1/a) x (0, l / J ) , where the map (f) is defined by (16). For each (Gi,G2) G (0,1/4]^, under 0, there are four preimages:

. , . ) = r ' ( G „ G a ) = { ( i ± A ( i ) , i ± A ( i ) ) } . (19) (

The set of preimages of (0,1/a) x (0,1/(5) is the set [Jiei^f^ ^ ~ {00,10,01,11}.

A fixed point {u^v) of (14) is a saddle if IA2I < 1 and |Ai| = Ai > 1. Since a6 > Pj,

0 < ViaGi -h (5G2)2 - 4GiG2{aS - /?7) = VD < aGi + SG2.


It follows that if aGi -h 8G2 < 2, i.e. (Gi,G2) lies under the line p, 0 < aGi -h SG2 - \fD < 2 holds, and 0 < A2 < 1. For (Gi,G2) above the line p, i.e., aGi -f SG2 > 2, we solve the inequality aGi -h 6G2 — 2 < > / D , which leads to the "border" curve G2 = «(Gi) we have already described. This time, only (Gi,G2) "between" the two branches of hyperbola K are considered.

It can be seen that in all fixed points {u,v) of (14) with

(/>(u, v) e [0, - X f 0, min <A,-\ju(o, ruin | G , - I j x fo, -

the eigenvalue A2 > 0 is less than 1. This is certainly true for all {u^v) such that 0(tx, v) € (0,1/4] x (0,1/5) U (0,1/a) x (0,1/4]. In particular, the preimages of (Gi,G2) € (1/a, 1/4] x (0,1/(5) U (0,1/a) x (1/5,1/4] under 0 define the region IJiex ^ f where only saddle fixed points of (14) can lie.

Fixed points (u, v) whose images under 0 lie above the right branch of K are repellors. No (Gi,G2) can lie in that region, if G, A > 1/4, that is, if 5{a - 4) < ^7 and a{5 — 4) < /?7, which is equivalent to max{a{S — 4) , (5(a-4)} </37. D

The condition max{a{5 — 4),(5(a — 4)} < ^7 implies that when self-excitations of recurrent neurons are not significantly higher than their mutual inhibition, there are no repulsive fixed points of (14). As self-excitations a and (J grow, stable fixed points of (14) move closer towards {0,1}^. More precisely:

Corollary 1: Savfit assumptions as in Theorem 2. All attractive fixed points of (14) lie in the e-neighborhood of vertices of the unit square^ where

e = 7(0 .5 - A(a))2 -f (0.5 - A{6))^.

The tendency of attractive fixed points in discrete-time RNNs with exclusively self-exciting recurrent neurons to move towards saturation values as neural gain grows is also discussed in Hirsch [Hir94].

So far, we have confined the areas of the network state space (0,1)^ where (under some assumptions on weights) fixed points of (14) of particular stability types can lie. In the following, it will be shown that those regions correspond to monotonicity intervals of functions defining fixed points of (14). The reasoning about the stability type of a fixed point can be based on the knowledge of where the functions intersect.

Recall that any fixed point (w*,v*) of (14) satisfies

(tx*,i;*) = {giau^ -\- (3v^),g{ju^ +Sv^)),


or equivalently, (v*, f *) lies on the intersection of two curves v = f^p{u),u = fsni^), where /ci,c2 : (0,1) -^ K,

/c.,c.(^) = - ^ ^ + ; ? - l n - ^ . (20) C2 C2 I — t

Um£ô+ fci,c2W = 00, hm^_î- /ci,c2W = -oo.^^ fc^^c2 is convex and concave on (0,0.5) and (0.5,1), respectively. If ci < 4, fci,c2 is nonincreasing; otherwise, it is decreasing on (0,0.5 — A(ci)) U (0.5 H- A(ci), 1) and increasing on (0.5 - A(ci),0.5 -h A(ci)). The graph of fci.cî^) is presented in Figure 13.

The "bended" graph of /ci,c2 for ci > 4 gives rise to a potentially complicated intersection pattern of fa,p{u) and fs^-yiv). In the following, we shall consider only the case Ci > |c2|, since it is sufficient to explain some interesting features of the training process observed in our experiments. Note that ci > |c2| means that for both neurons, the self-excitation is higher than the inhibition from the other neuron.

Lemma 2: Assume a > 0 ,^ < 0,7 < 0,5 > 0. / / a > |/3| and 6 > |7| , then fa,(3{'^) o.nd fs^-yiv) do not intersect in (0,0.5)^.

Proof: Assume that both fa,(3{u) and fs,y{y) lie in (0,0.5)^; otherwise, the result follows trivially. For u G (0,0.5), both (ln(u/(l - u))/l3 and —au//3 are positive. It follows that in (0,0.5)^, fa,(3{u) lies above the line v = aul\l3\. Similarly, in (0,0.5)^, fs.-yiv) lies above the line u = 8vl\^\. In terms of the coordinate system (w,t;), this can be restated as follows: in (0,0.5)^, the graph of /a,/? lies above the line v = aul\^\^ while the graph of fs^-y lies below the line v = \')\ul8. Since |7|/(5 < 1 < CK/|/?|, fa,p{u) and f6,y{v) do not intersect in (0,0.5)^. •

The correspondence between regions i?^ , i,j = 0,l; Q = A,S, R; and the regions of monotonicity of /a,/?(^) and fd,y{v) enables us to interpret the training process as a process of "shaping" Z ,/? and fs^-y so that the desired behavior of (14), as prescribed by the training set, is achieved.

Denote the set {{u, fa,p{u))\ u G (0,0.5 — A(Q:) )} of points lying on the "first decreasing branch" of fa.îv) by / ^ ^ . Analogously, the set of points {(u, /a,/3(^^))| i G (0.5 -f A(a), 1)} in the "second decreasing branch" of /a,/3('w) is denoted by / ^ ^ . Finally, let / ^ ^ denote the set of points {{u,fa,(3{y))\ ixG(0.5-A(a),0.5 + A(a))} on the increasing part of/a,/?(u). Similarly, fs^,fl~ and //"^ are used to denote the sets {{fs'y{v),v)\ v ^ (0,0.5 - A{S))},\{fs,Mly)\ y ^ (0.5 + A ( ( 5 ) , 1 ) } and {{fsn{v).v)\ v ^

^^note that since a, 6 and 0,^ are assumed to be positive and negative respectively, we have ci > 0 and C2 < 0


FIGURE 13. Graph of the function /ci,c2(^) when C2 < 0. SoHd and dashed lines represent the cases 0 < ci < 4 and ci > 4 respectively. For ci > 4, the function "bends" and becomes increasing on (1/2 — A(ci), 1/2 + A(ci)).

(0.5 - A((J), 0.5 -h A(cJ))}, respectively. Using Theorem 2 and Lemma 2, we state the following corollary:

Corollary 2: Same assumptions as in Theorem 2. Attractive fixed points of (14) can lie only on the intersection of the decreasing parts of fa,p and fs^-y-Whenever the increasing part of fa,(3 intersects with a decreasing part of fs,^ (or vice versa), it corresponds to a saddle point of (14)- I'n particular, all attractive fixed points of (I4) are from f^~^nfl~, fl'p^fs^^ ^ ^ / a ^ 1 /5,7 • Every point from f^p O / j ~ or / ^ ^ Pi /^^ is a saddle point of (14)-


The usual scenario of the creation of a new attractive fixed point of (14) is that typical of saddle-node bifurcation, in which a pair attractive -I- saddle fixed points is created. Attractive fixed points disappear in a reverse manner: an attractive point coalesces with a saddle, and they are annihilated. This is illustrated in Figure 14. fs,y{v), shown as a dashed curve, intersects fa^piu) in three points. By increasing S, fs^-y bends further (solid curve) and intersects with /a,/? in five points.^^ Saddle and attractive points are marked with squares and circles, respectively. Note that as 5 increases attractive fixed points move closer to vertices {0,1}^.

A similar approach to determining the number and stability types of fixed points of the underlying dynamical systems in continuous-time recurrent neural networks can be found in Beer [Bee94].

FIGURE 14. Geometrical illustration of saddle-node bifurcation in RNN with two state neurons.

^Ât the same time, |7| has to be also appropriately increased so as to compensate for the increase in 6 so that the "bended" part of fs^-y does not move radically to higher values of u.


al 1 a l 2 a l 3 a l 4

b l l

FIGURE 15. FSM M with four a-loops and "transition" input symbol h.

8 Experiments—Learning Loops of FSM

A RNN with two state neurons was trained with the FSM M presented in Figure 15. In each of its four states there is an a-loop. The input symbol h causes subsequent transitions between states up to the "trap" state D. The training set representing M was constructed as follows: Transitions to states 5 , C, and D from the initial state A are represented by one, two, and three consecutive 6's respectively. Apart from transition, each a-loop is represented by strings of consecutive a's up to length 5. The b-loop in the state D is represented by a string of 5 consecutive 6's. To each input string w;, its corresponding output string \'^{A^w) is determined.

During training, after each epoch, attractive sets of TQ were numerically detected. The evolution of position and number of attractive fixed point(s) of Ta in (0,1)^ can be seen in Figure 16. Near the points the corresponding epoch numbers are shown. At the beginning, there is only one fixed point of Ta. A bifurcation during the 59th epoch produces two attractive fixed points. From the 138th epoch to the 321st epoch there are three attractive fixed points and two saddle points of Ta. These are determined by the intersection of the corresponding lines faa^Pa ^^ /(5„,7„, where eta,/3a? 7a, sind Sa are coefficients of the map Ta as in (14). The episode of existence of the attractive fixed point /^~^^ ^fs'ia ^^&^^ when faa,i3a is "bended" enough so that fl~^ intersects with both increasing and decreasing parts f^^ p^ and fa~j3î respectively. At the same time, in order for the intersection ôt~Pa ^ ^ta 7a ^^ ^xist, /(j ,7^ uccds also to be sufficiently "bended" (Figure 17). The degree to which /««,/?« ^^^ fâîa ^^^ "bended" is primarily controlled by QQ and 5a respectively, while the vertical positions of bended parts are mainly determined by Pa and 7a, respectively. During the 322nd epoch, the attractive fixed point /<?[~^ H fl~^^ together with saddle point • a~/?a ^ fsa -fa îsappcar because the increase in |7a| pushes the "bended" part of fsaâ inside the state space (0,1)^ (Figure 18).


p J

00 J d 1

to J d 1

^ -t J d 1

CM J d 1

o J d

C

gg25 137

1" • •

1 1

).0 0.2

/

/ /

{.

1

0.4

7

1 0.6

S1

i 1321

^ \

\

| l l 3 9

1225

^12 1 1 1

0.8 1.0

FIGURE 16. Evolution of position of attractive sets of Ta during RNN training on FSM M (two state neurons).

The training error was 0.08, yet the only attractive sets of Ta that were detected were two attractive fixed points SA and So near vertices (0,1) and (1,0) corresponding to a-loops in states A and D, respectively. Starting in a small neighborhood of SA and SD, upon repeated presentation of input a, the decoded network outputs are 1 and 4 with trajectories of Ta approaching SA and S^, respectively. There is no stable representation of the a-loops in states B and C; i.e., there are no positively invariant sets of Ta leading to the network output 2 and 3 respectively when input a is presented to the network.

However, the net is able to simulate the training set perfectly. It follows that after it is reset^^ and presented with b, when five consecutive a's arrive, the decoded output will be five consecutive 2's. Hence, the network must have developed a mechanism for acting as if the a-loops in B and C were represented in a stable manner, at least for strings having no more than five consecutive a's. It turns out that the underlying mechanism for pretending that there are stable representations of a-loops for short input

^^with (possibly repeated) presentation of "reset" input #


FIGURE 17. faa^âM ^^^ fsanaiv) after the 150th training epoch. Coefficients of the map Ta are ota = 5.21, a = -2.58,7a = -2.63, 6a = 5.23.

strings involves a behavior of trajectories starting "near" the stable manifold W^ of the saddle fixed point Ss lying "between" attractive points SA and SDI with W^ constituting the border of regions of attraction of SA and SD-

Consider a point S "near" W^. Due to the continuity of Ta, the orbit of S under Ta first moves towards Ss along W^ and then away from Ss along a branch of the unstable manifold W^ of Ss, gradually approaching one of the attractive points SA-> SD- TO which of the two points the trajectory actually converges is determined by the "side" of W^ on which the initial point 5 lies. Assume that the trajectory of 5 converges to SA- If we slightly displace S into S' on "the other side" of the curve W^, the trajectories of 5 and S' move towards Ss close to each other, but as they approach Ss-, the trajectory of S' follows the other branch of W^ towards SD (see Figure 19). As we move the starting point S towards SA and SD^ the trajectories less and less follow the pattern described above; they move towards SA and SD in a straightforward manner^^ and approach a vicinity of SA and SD respectively much faster than trajectories starting "near" W^. Hence, the network is able to "cheat" by pretending stable behavior as described by

^^Due to the coefficients of Ta, the eigenvalues of its Jacobian in every point from (0,1)^ are real thus implying an absence of rotation in neighborhoods of fixed points.


FIGURE 18. foca,0aM '-nd fsa,iai'^) stfter 1000th training epoch. Coefficients of the map Ta are aa — 8.61,/?a = —3.96,7a = —3.08, (5a = 5.17.

the a-loop in the state B because it takes advantage of different convergence rates of orbits starting near W^ and SD- The decoded output of the net with input a and a state near SD is 4 (region T>), while for states involving first several steps in trajectories starting near W*, the output is 2 (region B ). An analogous statement can be made about trajectories starting near SA and W^ and regions A and C, respectively. Most of the time towards the end of the learning session was spent on learning the output function z/^(5) = v{S, a) in closely neighboring regions of B and C so that the outputs for states from B and C are 2 and 3 respectively (see Figures 20, 21). The map r# associated with the "reset" input symbol # has one attractive fixed point in the region A. Under the "reset" map T:^, trajectories of network states 5 G (0,1)^ quickly approach region A, thus preparing the ground for processing of a new input word.

The key role, however, is played by the transfer function u. It simulates transition between states with a-loops in M. Starting in S £ A^ u{S) G B, and r^{S) G C lie near W^, and the behavior of Ta in B and C appears to be stable for several iterations. Upon repeated presentation of a, T^{S) G V converges to So with network output 4.

The delicate role of n responsible for transitions A -> B -^ C -^ V with jumping on the "appropriate" sides of W^ while staying close to W*,


FIGURE 19. Illustration of a mechanism that enables RNN to "pretend" stable representation of loops in M for short input strings.


FIGURE 20. The map (1 0)2 representing the output of the second output neuron that corresponds to the output symbol 2. Note the sharp activity change along the border of regions of attraction of SA and SD •

FIGURE 21. The map {ua)3 representing the output of the third output neuron that corresponds to the output symbol 3. A sharp activity change along the border of regions of attraction of SA and SD is clearly visible.


together with different convergence rates of orbits under Ta starting close to W^ and near SA, SD are principal tools enabling the net to behave nicely for testing strings of smaller length, although it generalizes poorly on strings with many consecutive a's after b or 66. In particular, the outputs of the net for input strings ba^ and 66a^ are consistent with the training set for n = 8 and m = 10. As further a's keep coming, trajectories of Ta move away from B and C towards SD and SA respectively.

To visualize the process of state degradation upon repeated presentation of input a, a state degradation diagram for input a is constructed as follows (Ma denotes the set of states of M in which there is an a-loop):

• Construct a finite vocabulary F of short distinguishing words for Ma, such that r does not contain a word ua^v,i > 2, where u is leading to a state of M in which there is an a-loop. To each state q of Ma associate a minimal input word m^ leading to q.

• For each i e {1,2,..., Nmax}

- For each w eT

* For each state q G Ma

• present the reset network with niqa^ and then

• present the network with w and check whether the net output equals \'^{q,w). If not, check whether there is a state p of M. such that the network output equals A"^(p, w). If so, draw an arrow in a diagram from q to p.

State degradation diagram for input a is presented in Figure 22. Note that when only short input strings are presented to the network and quantization of network state space individually captures regions A,B,C,T>, a correct state transition diagram can be obtained, even though on longer input strings the net generalizes poorly.

FIGURE 22. State degradation diagram for input a. Nmax = 100.

When the network with three state neurons was trained with the FSM M, it generalized correctly over the training set by forming four attractive fixed points of Ta corresponding to loops in states A,B,C,D of M. The training process looked at from the point of view of the asymptotic behavior of Ta is illustrated in Figure 23. The horizontal axis corresponds to time (in


epochs); The network state space (0,1)^ is orthogonally projected into the 2-dimensional space of activations of a couple of state neurons. Bifurcations leading to formation of new attractive fixed points appeared during the 53rd, 115th and 121st epochs. If the network is able to exactly mimic the FSM M, the state degradation diagram for each input symbol has no arrows.

FIGURE 23. Evolution of position of attractive sets of TQ during RNN training on FSM M (three state neurons).

As another example, consider the FSM M in Figure 24. It is an FSM taken from the database of the International Symposium on Circuits and Systems (Portland, Oregon, 1989) [BBK89]. In each of its seven states there is an a-loop with output 0 except for a-loops in states 4 and 7. The training set consists of 3500 training strings^^ of input string length 3-35 and is ordered according to their length starting with the shortest ones. The machine A1 is hard to learn because the training set is very sparse in output symbols other than 0. The training process is disrupted by a tendency

^" input word w -> corresponding output word X'^(qojW).


dddeadfdaeaafaaadddaddadfeeedeaeeet — > 0000000000000000000000000000000002z

affedfeefaedeededfdefddaafeeeeeadd« — > 0000000000000000000000000000022200x

dffdadedfadaddffeeafeafdffdffefaad# — > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx

fdaadaafddafafdadfdffdeaffaa«feade» — > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOz

ddfaddadfaaddddeafdafdfaeedaedaedat ->> OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx

defadedefdeffdefdafdaaadeaeddaaefd# — > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx

ddfedaaffdedeaeadeefdfefaadadeaaff# — > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx

aaf aaeef af eaff eeef eaf aef eeadaef af a# ~ > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx

dddeeafffafeaadaddfdffadfeafdddefd# ~ > 0000000001lOOOOOOOOOOOOOOOOOOOOOOOx

fdaaddaadadffefaeadddfeddeafdddaea# ~ > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx

dedaddadaafeaaddaafaaefaefdeeffafe# — > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx

ddaeeafddfaaffffaeeefeadaefdfedfee# ~ > 000000000000001llOOOOOOOOOOOOOOOOOx

dddedeeafdfddfaeeaddafdfafadedfaaft ~ > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx

TABLE 1. A part of the training set characterizing the FSM M. Output strings are sparse in output symbols other than 0.

to find a trivial solution represented by the automaton with only one state and loops for every input symbol with the output 0. An example of a part of the training set is given in Table 1.

After 53 training epochs, the RNN with 6 state neurons is able to perform well on short test strings (training error was 0.06). Generalization on long test strings was found to be poor. Part of the problem was the unstable network representation of a-loops in M. The state degradation diagram for input a can be seen in Figure 25. a-loops in states 4, 6, and 7 are "well represented" by fixed points 54, Se, and Sy, respectively, in that when starting in a small neighborhood of 5g, q = 4,6,7, the resulting output sequences of the RNN for input words a^w^ it; € F, i > 0, equal A'^(g, a^w). This is not true of a-loops in states 1, 2, 3, and 5. When the net is reset and presented with m^, q = 1,2,3,5, for i > Nq it does not emulate X'^{q,a^w), w E r. States 5 and 3 degrade to states 1 and 2, respectively. In particular, N^ = S and A 3 = 5. Both states 1 and 2 degrade to the attractive fixed point SQ with Ni = 27 and A2 = 40. The network state So does not represent any state of M even for short input strings. Sj, j = 0,4,6,7, are the only attractive sets of Ta that were detected. There are trajectories of Ta starting near border of regions of attraction of So and some other attractive fixed point of TQ that passes through the region assuming the role of state 5 of M. for short input strings. Then, further towards .So, they pass through the region of network states that for short input strings seem to be equivalent to the state 1 of M, finally making their way to a close neighborhood of So and converging to it. A similar statement can be made about states 3 and 2 of At.


FIGURE 24. FSM M taken from the database of the International Symposium on Circuits and Systems (Portland, Oregon, 1989). M is the reduced form of a machine defined in the file bbara .k i ss2 . Inputs 01, 10, and 00 axe represented as the input symbol a since in every state, they initiate the same transition with the same output. Inputs 0011,-111, and 1011 are represented as input symbols d, e, and / , respectively. Outputs 00,01, and 10 are coded as output symbols 0 ,1 , and 2, respectively.


J s.

FIGURE 25. State degradation diagram for input a extended with network state So not representing any state of M. So = (0.89,0.01,0.55,0.95,0.99,0.92), S4 = (0.16,0.98,0.02,0.87,0.04,0.92), Se = (0.98,0.03,0.97,0.09,0.99,0.87), 57 = (0.94,0.98,0.95,0.01,0.05,0.15). Nmax = 100.

9 Discussion

Two views on the relationship between an RNN and an FSM M such that the RNN exactly mimics M have been presented. First, the network was treated as a state machine. The notion of regions of equivalent network states that are also equivalent to a state of M link the first approach with the second, dynamical systems approach to the RNN.

Our experiments suggest that in the most usual stable RNN J\f, representations of loops and cycles in M can be described as follows: An x-loop in a state ^ of A^ induces an attractive fixed point of r^ inside {q)^, and an x-cycle {gi, ...,^m} of M induces an attractive periodic orbit of period m of Tx periodically visiting {ql)^f,..., {qm)]^-

The present paper provides us with the opportunity to look at the learning process from the point of view of bifurcation analysis. If the network is supposed to operate as an FSM, its state space must have multiple attrac-tor basins to store distinct internal states. The network solves the task of FSM simulation by location of point and periodic attractors and the shaping of their respective basins of attraction [Cum93]. Before training, the connection weights are set to small random values, and as a consequence.


the network has only one attract or basin. This imphes that the network must undergo several bifurcations [Doy92a]. This can have an undesirable effect on the training process, since the gradient descent learning may get into trouble. At bifurcations points, the output of a network can change discontinuously with the change of parameters, and therefore convergence of gradient descent algorithms is not guaranteed [Doy92b].

In the following a possible application of these ideas to the problem of determination of the complexity of language recognition by neural networks will be discussed briefly.

Any FSM with binary output alphabet {0,1} can function as a recognizer of a regular language. A word over the input alphabet belongs to the language only if the output symbol after presentation of the word's last symbol is 1. Hence, the network output is used to decide whether a word belongs to the language or not. One of the most promising neural acceptors of regular languages [Shi87] is the second-order RNN introduced by Giles et al. [GMC'^92a]. However, the practical aspects of the acceptance issue are still unclear [SSG92]. The difficulty of acceptance of a given language by a neural network (the neural complexity of the language) can be quantified by the minimal number of neurons needed to recognize the language. In the context of mealy machines and threshold networks, a similar problem was attacked by Alon et al. [AD091] and Home and Hush [HH94]. An attempt to predict the minimal second-order RNN size so that the network can learn to accept a given regular language is presented in Siegelmann et al. [SSG92]. The predicted numbers of neurons were shown to correlate well with the experimental findings.

Essentially, a good starting point for the estimation of neural complexity of a given regular language is the representation of the language with the reduced recognizer. The most usual, very rough, approach to the neural complexity estimation takes into account only the number of states of such a recognizer [SSG92]. What plays a principal role in making the internal structure of a regular language rich is

• the number of input symbols of the recognizer,

• the number of loops associated with each input symbol,

• the number and corresponding lengths of cycles associated with each input symbol,

• the relationship among loops and/or cycles (i.e., an xi-cycle is passing through a state q in which there exists an X2-loop, etc... ).

In every recognizer of a regular language, for each input symbol there exists at least one loop or a cycle. During the training process, the weights of a network are modified so that the corresponding attractive sets evolve in dynamical systems defined by the iterative maps r^. A hint for a lower


&I0 a l l

FIGURE 26. Acceptor of the language L = Li U L2, Li = {a,fe}'*fe, nG {0,2,4,5,6,...}, L2 = {a,6}â, mG {1,3}.

bound on the minimal number of neurons can be obtained by exploring the possibilities of the existence of attractive points and/or periodic orbits that are to be induced during the training process. The expected relationship among their basins of attraction has to be taken into account at the same time [Cas93].

As an example consider the FSMs Mi and M2 in Figures 26 and 27 respectively. Apparently, the the internal structure of a regular language accepted by M2 is "more complex" than that accepted by Mi. In the latter case, only one attractive fixed point of r^ is sufficient to represent the a-loop in the state E. The same applies to the 6-loop in E and the map Tfe. In the former case, an attractive periodic orbit of period four of the map Ta and four attractive points of the map r^ have to be induced. Even though the FSM M2 has only four states, the RNN needed four state neurons to accomplish a successful learning. On the other hand, two state neurons were sufficient for the RNN to learn the FSM Mi.

A mechanism underlying generalization loss on longer input strings due to unstable representation of loops in an FSM to be learned was investigated. It was shown that even in such cases a correct state transition diagram of the FSM can potentially be extracted even though the network performs badly on longer input strings (as reported by Giles et al. [GMC"*"92a]). The state degradation diagram for an input symbol x illustrates how regions of network state space, initially acting as if they assumed the role of states of the FSM in which there is an x-loop, gradually degrade upon repeated presentation of x. The degradation may lead to a network state not representing any state of the FSM even for short input strings.

Zeng et al [ZGS93] and Das and Mozer [DM94] view the RNN state space quantization as an integral part of the learning process in which the network is trained to mimic a finite state machine. In particular, in [ZGS93] the


b l 0 b l l

b l l

FIGURE 27. Acceptor of the language L = L^, where Ls = 6*a6* U {b*afb-^ U {b*a)\

b ! 0


activation pattern of state units is mapped at each time step to the nearest corner of a hypercube as if state neurons had a hard threshold activation function. Das and Mozer [DM94] used a "soft" version of the Gaussian mixture model^^ in a supervised mode as a clustering tool. The mixture model parameters were adjusted so as to minimize the overall performance error of the whole system (recurrent network -f clustering tool). Both Zeng et al. and Das and Mozer report better asymptotic behavior for long, unseen test input strings. It would be interesting to investigate such approaches to training on RNN on finite state problems as a form of "dynamical self-reinforcement" learning encouraging bifurcations to attractive fixed points and periodic orbits of the underlying dynamical systems.

Acknowledgments Thanks to Maria Markosova, Pavol Brunovsky, and Phil Holmes for useful discussions on dynamical systems. The work of Mike Casey and Randall Beer contributed greatly to the preparation of this chapter.

10 REFERENCES

[AD091] N. Alon, A.K. Dewdney, and T.J. Ott. Eflacient simulation of finite automata by neural nets. Journal of the Association of Computing Machinery, 38(2):495-514, 1991.

[BBK89] F. Brglez, D. Bryan, and K. Kozminski. Combinational profiles of sequential benchmark circuits. In Proceedings of the International Symposium on Circuits and Systems, Portland, OR, May 1989.

[Bee94] R.D. Beer. On the dynamics of small continuous-time recurrent networks. Technical Report CES-94-18, Case Western Reserve University, Cleveland, OH, 1994.

[BW92] E.K. Blum and X. Wang. Stability of fixed points and periodic orbits and bifurcations in analog neural networks. Neural Networks, (5):577-587, 1992.

[Cas93] M.P. Casey. Computation dynamics in discrete-time recurrent neural networks. In Proceedings of the Annual Research Symposium, volume 3, pages 78-95, UCSD, La Jolla, CA, 1993. Institute for Neural Computation.

^Înstead of the center with greatest posterior probability given a pattern of state units ' activation, a linear combination of centers is used, where each center is weighted by its posterior probability given current network state.


[Cas95a]

[Cas95b]

[CSSM89]

[Cum93]

[Dev86]

[DGS92]

[DM94]

[Doy92a]

[Doy92b]

[Elm90]

[GF89]

M.P. Casey. Computation in Discrete-Time Dynamical Systems. PhD thesis, University of CaUfornia, San Diego, Department of Mathematics, March 1995.

M.P. Casey. Relaxing the symmetric weight condition for convergent dynamics in discrete-time recurrent networks. Technical Report INC-9504, Institute for Neural Computation, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0112, 1995.

A. Cleeremans, D. Servan-Schreiber, and J.L. McClelland. Finite state automata and simple recurrent networks. Neural Computation, 1(3):372-381, 1989.

F. Cummins. Representation of temporal patterns in recurrent neural networks. In Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, pages 377-382, 1993.

R.L. Devaney. An Introduction to Chaotic Dynamical Systems. Benjamin/Cummings Publishing Company, Inc., Menlo Park, CA, 1986.

S. Das, C.L. Giles, and G.Z. Sun. Learning context-free grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. In Proceedings of The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, 1992.

5. Das and M.C. Mozer. A unified gradient-descent/clustering architecture for finite state machine induction. In J.D. Cowen, G. Tesauro, and J. Alspector, editors. Advances in Neural Information Processing Systems 6, pages 19-26. Morgan Kaufmann, San Mateo, CA, 1994.

K. Doya. Bifurcations in the learning of recurrent neural networks. In Proceedings of the 1992 IEEE International Symposium on Circuits and Systems, pages 2777-2780, 1992.

K. Doya. Bifurcations in the learning of recurrent neural networks. In Proc. of 1992 IEEE Int. Symposium on Circuits and Systems, pages 2777-2780, 1992.

J.L. Elman. Finding structure in time. 14:179-211, 1990.

Cognitive Science,

M. Garzon and S. Franklin. Global dynamics in neural networks. Complex Systems, (3):29-36, 1989.


[GH82] J. Guckenheimer and P. Holmes. Nonlinear Oscilations, Dynamical Systems, and Bifurcations of Vector Fields. Springer-Verlag, Berlin, 1982.

[GMC+92a] C.L. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun, and Y.C. Lee. Learning and extracting finite state automata with second-order recurrent neural networks. Neural Computation^ 4(3):393-405, 1992.

[GMC+92b] C.L. Giles, C.B. Miller, D. Chen, G.Z. Sun, H.H. Chen, and Y.C. Lee. Extracting and learning an unknown grammar with recurrent neural networks. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors. Advances in Neural Information Processing Systems 4, pages 317-324. Morgan Kaufmann, San Mateo, CA, 1992.

[HH94] B.G. Home and D.R. Hush. Bounds on the complexity of recurrent neural network implementations of finite state machines. In J.D. Cowen, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6, pages 359-366. Morgan Kaufmann, San Mateo, CA, 1994. Also submitted to Neural Networks.

[Hir89] M.W. Hirsch. Convergent activation dynamics in continuous time networks. Neural Networks, 2(5):331-349, 1989.

[Hir94] M.W. Hirsch. Saturation at high gain in discrete time recurrent networks. Neural Networks, 7(3):449-453, 1994.

[Hop84] J.J. Hopfield. Neurons with a graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science USA, 81:3088-3092, May 1984.

[HZ92] S. Hui and S.H. Zak. Dynamical analysis of the brain-state-in-a-box neural models. IEEE Transactions on Neural Networks, (l):86-94, 1992.

[JNG94] L. Jin, P.N. Nikiforuk, and M.M. Gupta. Absolute stability conditions for discrete-time recurrent neural networks. IEEE Transactions on Neural Networks, (6):954-963, 1994.

[Jor86] M.I. Jordan. Attractor dynamics and parallelism in a con-nectionist sequential machine. In Proceedings of the Eighth Conference of the Cognitive Science Society, pages 531-546, Hillsdale, NJ, 1986. Erlbaum.


[Li92] L.K. Li. Fixed point analysis for discrete-time recurrent neural networks. In Proceedings of IJCNN, volume 4, pages 134-139, Baltimore, 1992.

[MF94] P. Manolios and R. Fanelli. First order recurrent neural networks and deterministic finite state automata. Neural Computation, 6(6):1155-1173, 1994.

[Min61] R.C. Minnick. Linear-input logic. IRE Transactions on Electronic Computers, EC-13:6-16, 1961.

[nHG95] P. Tifio, B.C. Home, and C.L. Giles. Fixed points in two-neuron discrete time recurrent networks: Stability and bifurcation considerations. Technical Report UMIACS-TR-95-51, Institute for Advance Computer Studies, University of Maryland, College Park, MD 20742, 1995.

[nJV94] P. Tiiio, I.E. Jelly, and V. Vojtek. Non-standard topologies of neuron field in self-organizing feature maps. In Proceedings of the AIICSR'94 conference, Slovakia, pages 391-396. World Scientific Publishing Company, 1994.

[Shi87] M.W. Shields. An Introduction to Automata Theory. Black-well Scientific Publications, London, UK, 1987.

[SSG92] H.T. Siegelmann, E.D. Sontag, and C.L. Giles. The complexity of language recognition by neural networks. In J. van Leeuwen, editor. Algorithms, Software, Architecture (Proceedings oflFIP 12^^ World Computer Congress), pages 329-335, Amsterdam, 1992. North-Holland.

[Vid93] M. Vidyasagar. Location and stability of the high-gain equilibria of nonlinear neural networks. IEEE Transactions on Neural Networks, 4(4):660-672, July 1993.

[Wan91] X. Wang. Period-doublings to chaos in a simple neural network: An analytical proof. Complex Systems, (5):425-441, 1991.

[WB90] X. Wang and E.K. Blum. Discrete-time versus continuous-time models of neural networks. Journal of Computer and Systems Sciences, 45:1-19, 1990.

[WK92a] R.L. Watrous and G.M. Kuhn. Induction of finite-state automata using second-order recurrent networks. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors. Advances in Neural Information Processing Systems 4^ pages 309-316. Morgan Kaufmann, San Mateo, CA, 1992.


[WK92b] R.L. Watrous and G.M. Kuhn. Induction of finite-state languages using second-order recurrent networks. Neural Computation, 4(3):406-414, 1992.

[ZGS93] Z. Zeng, R.M. Goodman, and R Smyth. Learning finite state machines with self-clustering recurrent networks. Neural Computation, 5(6):976-990, 1993.

Chapter 7

Biased Random-Walk Learning: A Neurobiological Correlate to Trial-and-Error Russell W . Anderson

ABSTRACT Neural network models offer a theoretical testbed for the study of learning at the network level. The only experimentally verified learning rule, Hebb's rule, is extremely limited in its ability to train networks to perform complex tasks. An identified cellular mechanism responsible for Hebbian-type long-term potentiation, the NMD A receptor, is highly versatile. Its operation is modulated by a wide variety of conditions and may be involved in several non-Hebbian processes. We have shown that another neural network learning rule, the chemotaxis algorithm, is theoretically much more powerful than Hebb's rule and is consistent with neurobiological observations. A biased random walk in synaptic weight space is a learning rule immanent in nervous activity and may account for some types of learning — notably the acquisition of skilled movement.

1 Introduction

In their landmark paper, "A Logical Calculus of the Ideas Immanent in Nervous Activity", McCulloch and Pitts [1943] demonstrated how a network of extremely simplified ("all-or-nothing") neurons could compute any Boolean function. Mathematical analyses of recurrent neural network models have shown them to be universal computing devices [Siegelman and Sontag 1991, 1994, 1995].

Neural network modeling has not only been helpful in understanding the collective behavior of existing networks, it also provides a theoretical framework with which one can experiment with models of learning. Rosenblatt [1958] demonstrated that these networks, when endowed with modifiable connections ("perceptrons"), could be "trained" to classify patterns (see also Arbib [1987]). Thus Rosenblatt had developed a theoretical testbed for the study of learning.

Theoretical neural network studies (mathematical analyses and empirical computer simulations) are useful for exploring the capabilities and limita-

221

222 Anderson

tions of a proposed learning rule. The only experimentally verified learning rule, Hebb's rule, has profound limitations in this respect. Engineering optimization algorithms (such as backpropagation or genetic algorithms) are capable of training neural networks to perform much more sophisticated tasks but are biologically implausible [Churchland and Sejnowski 1989; Hecht-Nielsen 1989; Stork 1989; Crick 1989a,b; Mel 1990; Anderson 1991].

Long underestimated by both the experimental and theoretical neural network communities is perhaps the most intuitive mode of learning—trial-and-error. We have shown [Bremermann and Anderson 1989,1991] that the mathematical analog to trial-and-error, a Gaussian biased random walk in synaptic weight space, is capable of training neural networks to perform the same complex, nonlinear mappings as backpropagation.

In this paper, the biological evidence for and theoretical limitations of Hebbian learning are reviewed. Next, theoretical and empirical studies of random-walk learning rules are presented. I argue the biological plausibility of trial-and-error learning rules through a discussion of existing neurobio-logical data and identified molecular mechanisms. Finally, new directions of experimental research are suggested.

2 Hebb's Rule

In 1949, Hebb proposed a neuronal learning rule that could integrate associative memories into neural networks [Hebb 1949]. Hebb postulated that when two neurons in synaptic contact fire coincidentally, the synaptic knobs are strengthened. Hebb's hypothesis is appealing as a cellular mechanism for associative learning. Hebb's rule is also appealing from a genetic point of view, since it requires very little genetic "overhead" to implement in actual nervous systems. All that is required is a mechanism for distinguishing simultaneous stimuli at the cellular level.

Verification has taken time, but there is now evidence that Hebbian-type long-term potentiation (LTP) (with some modifications of the original hypothesis) does indeed occur [Lynch 1986; Kennedy 1988; Stevens 1989; BUss and Collingridge 1993]. Long-term depression (LTD) has been observed in the same system supporting an ancillary "Hebbian covariance learning rule" [Stanton and Sejnowski 1989].

2,1 Experimental Evidence: The NMD A Receptor

Long-term potentiation is mediated by the N-methyl-D-aspartate (NMDA) receptor. It is useful to review the mechanisms in the current model of LTP for two reasons. First, it illustrates how the proposed (Hebbian) learning

7. Biased Random-Walk Learning 223

Coincident (depolarizing) Axon Terminal

Active Axon Temiinal

Non-NMDA Receptor

Nitric Oxide? arachidonic acid?

Retrograde Messenger Synthase?

Ca /calmodulin kinase Protein kinase C Tyrosine kinase MAP kinase

FIGURE 1. NMDA implementation of Hebbian learning. Simultaneous membrane depolarization and activation of the NMDA receptor allows calcium ions to flow into the cell. Calcium-dependent proteins trigger a cascade of intracellular events leading to structural and/or chemical changes postsynaptically as well as potential presynaptic changes via retrograde messengers. (Adapted from [Montague ei al. 1991; Kandel and O'Dell 1992].)

rule has influenced experimental eflForts. Second, the actual mechanisms discovered are subtly different from the Hebbian ideal of strengthening correlated inputs.

According to the current model of LTP [Zalutsky and Nicoll 1990; Buono-mano and Bryne 1990; Kandel and O'Dell 1992], for the NMDA receptor channel to open, two conditions must be met simultaneously: (i) the receptor must bind glutamate, and (ii) the postsynaptic cell must be depolarized through activation of non-NMDA receptors. At resting potential, the NMDA receptor channel is blocked by Mg "*". Depolarization removes the voltage-dependent Mg " block, allowing Ca "*" to flow into the cell. Ca^"^ appears to trigger LTP, through the activation of several diff"erent protein kinases (see Figure 1).

There is also evidence for chemical and/or structural presynaptic changes [Zalutsky and Nicoll 1990; Edwards 1991]. Presynaptic modification is thought to be effected via retrograde messengers released across the synaptosomal junction. The retrograde messenger is presumed to be a labile.

224 Anderson

diffusible substance synthesized and released by the postsynaptic cell. The synthesis and/or release of such messengers is also thought to be a calcium-dependent process. Several substances have been postulated to function as retrograde messengers. Among them are nitric oxide [Gaily et al. 1990], hydrogen peroxide [Colton et al. 1989; Zoccarato et al 1989] and archidoinic acid [Williams et al. 1989]. (For a review, see [Montague et al. 1991].)

Many substances have been shown to have modulatory effects on LTP. A partial list of proteins, hormones, neurotransmitters and other compounds includes glycine and D-serine [Salt 1989], serotonin [Ropert and Guy 1991], acetylcholine and noradrenaline [Bear and Singer 1986; Brocher et al. 1992], human epidermal growth factor [Abe and Saito 1992], antidepressant drugs [Birnstiel and Haas 1991], milacemide [Quartermain et al. 1991], opioids [Xie and Lewis 1991] and ethanol [lorio et al. 1992]. Thus, it is not surprising that mental states and other factors such as "attention", blood flow, and "excitement" can influence learning. That so many compounds can modulate LTP indicates that the NMDA receptor may be a much more universal tool for synaptic modification than previously thought, and not solely employed in local, Hebbian-type learning.

Finally, NMDA clearly mediates some, but not all, forms of learning. For instance, Malenfant et al. [1991] showed that application of an NMDA receptor antagonist (MK801) could block the acquisition of a spatial maze task in a dose-dependent manner. However, MK801 did not block the acquisition of experience-based maternal behavior. The same maternal experience effects can be blocked by chemical inhibition of protein synthesis.

In summary, the NMDA receptor requires coincident events and makes possible a type of associative learning. Its discovery required intricate experiments at synaptic junctions. It is currently unclear whether synaptic change occurs at the postsynaptic dendritic spine, the presynaptic glu-tamate axon terminal, the presynaptic depolarizing axon, the axonal processes themselves, or a combination of all of these structures. Several chemical compounds have been identified that can facilitate or inhibit LTP. Many compounds that modulate LTP are common physiological chemical compounds, proteins or neurotransmitters and do not necessarily originate from either the pre- or postsynaptic neuron(s). Thus, it is conceivable that several forms of learning are operating in neural tissues, and these other forms of learning can he mediated via the NMDA receptor as well as by other, independent, neural processes.

2.2 Limitations of Hebbian Learning

Theoretically, Hebbian learning can account for some types of biological learning. Hebbian mechanisms have been shown to be sufficient to account for topographic mappings [Kohonen 1984; Grajski and Merzenich 1990], plasticity in cortical representation [Merzenich et al. 1987; Montague et al.


1991] and, when applied to "sigma-pi" neurons, some nonlinear pattern recognition tasks [Mel 1992]. But there is more to the brain than conditioned reflexes and associative memories. For anything but special cases, Hebb's rule is insufficient as a learning rule [Rosenblatt 1962; Rumelhart et al. 1986].

Since Hebbian learning requires near simultaneous or synchronous stimuli, it is limited temporally. For many tasks, instantaneous performance results are not available. Motor control problems, for example, are inherently sequential. Temporal delays are also involved in many phenomena observed in psychophysical and electrophysiological studies of classical conditioning, such as anticipation of an unconditioned stimulus [Chester 1990; Deno 1992]. Hebbian learning would have to be combined with additional memory mechanisms or neuronal structures to account for such phenomena. Recent attempts to expand Hebbian learning rules to include short-term memory [Sutton and Barto 1981; Klopf 1989; Grossberg and Schmajuk 1989] have met with limited success [Chester 1990].^

Since the Hebbian rule applies only to correlations at the synaptic level, it is also limited locally. Strengthening a local correlation in the context of a nonlinear mapping of several variables (such as the A^-bit parity problem) often reduces overall performance. Consequently, Hebbian learning is unable to train a multilayer perceptron network to learn arbitrary, nonlinear decision boundaries [Rumelhart et al. 1986].

3 Theoretical Learning Rules

Current artificial neural network (ANN) research has provided valuable insights into the collective behavior of small networks of neurons [Hop-field 1984; Lehky and Sejnowski 1988, 1990; Lockery et al. 1989]. However, most of these results were obtained using more sophisticated algorithms than Hebb's rule. Learning rules employed to train ANNs are more appropriately referred to as optimization procedures. These algorithms, most of which are based on minimization of a defined error function, are capable of overcoming the limitations of Hebb's rule. Among the most popular today are genetic algorithms [Montana and Davis 1989; Fogel et al. 1990;

^To account for more complex phenomena, such as skilled movement, many have postulated that the brain utilizes "model-reference control," that is, the brain develops an internal model of the musculature and environment to predict performance of a control signal. A Hebbian mechanism can then be used to control such a system, since presumably, the temporal delay has been removed from correlated events. Such a system may in fact be used, especially for rapid, open-loop eye and hand movements [Grossman and Goodeye 1983; Anderson and Vemuri 1992]. But the "model" must still be updated by a global supervisory signal, which takes its cues from the external environment.

226 Anderson

Austin 1990] and gradient-descent learning [Rumelhart et al. 1986]. (For an overview of "connectionist" learning rules, see Hinton [1989].) Most of these algorithms have little biological basis and are used primarily for engineering problems in pattern recognition, classification, signal reconstruction, and so on. Do any of the multitude of ANN learning rules have any implications for experimental neurobiology?

Criticisms of the biological plausibility of ANN training algorithms are abundant in the literature [Churchland and Sejnowski 1989; Hecht-Nielsen 1989; Crick 1989a; Mel 1990; Anderson 1991]. In his article "The recent excitement about neural networks," Francis Crick [1989a] writes: "It is hardly surprising that such achievements [referring to the successes using backpropagation] have produced a heady sense of euphoria. But is this what the brain actually does? Alas, the back-prop nets are unrealistic in almost every respect....Obviously what is really required is a brain-like algorithm which produces results of the same general character as backpropagation^^ {emphasis added).

Bartlett Mel [1990] poses the problem this way: "[I]s it...a fundamental law that neural associative learning algorithms must be either representa-tionally impoverished or mechanistically overcomplex?"

What are the necessary features of a biologically plausible learning rule? First, it must have a mechanism for synaptic modification that is consistent with experimental data. Second, a learning rule must not involve so much specific neural structure that an excessive number of genes are required for its coding. Lastly, to be of any use to biologists, it must be observable. Clearly, Hebb's rule satisfies these criteria, while backpropagation, to varying degrees, violates all three. The debate over the biological plausibility of backpropagation continues [Dayhoff et al. 1994; Gardner 1993]. However, as the title of this paper suggests, there is at least one other ANN learning rule that satisfies these criteria—a biased random walk [Bremermann and Anderson 1989, 1991].

3.1 Learning via Random Walks

In its most basic form, a random walk can be generated by spontaneous, random variation in synaptic strength. This way, the mechanism for synaptic change is local and independent of any higher-level teaching signals. Successful changes in architecture or synaptic strength are rewarded or punished after the fact. Such a biased random walk in synaptic weight space can be considered a cellular analogue of trial-and-error.

The first attempt to apply such an algorithm to artificial neural networks was by Lewey Gilstrap, Jr., Cook and Armstrong at Adaptronics, Inc. (McLean, VA) around 1970. They called their method "guided, accelerated random search" (GARS): "[T]he accelerated random search begins by exploring the vicinity of its initial estimate. The random trials are gov-


erned by a normal distribution of probabilities which is centered on the initial point.... The accelerated random search follows an unsuccessful random step, with a step of equal magnitude in the opposite direction. By this means, a successful step is usually achieved on the second trial if not on the first random trial.... A successful step is always followed by another step in the same direction.... Each successive step is given double the magnitude of the prior step" [Barron 1968].

Barron [1968, 1970] used GARS to optimize control parameters in flight control systems. Mucciardi [1972] applied GARS to neural net-like classification structures called "neuromine nets." Mucciardi's paper presented an analysis of neuromine nets and the algorithm but provided only simple examples of its application. Interest in neural networks was waning at that time, especially because of well-known limitations of simple perceptrons acknowledged by Rosenblatt [1962] and highlighted in Perceptrons [Min-sky and Papert 1969]. Unfortunately, Mucciardi and his colleagues never applied their algorithm to the complex classification problems emphasized in Perceptrons—the exclusive OR and "connectedness" problems. Another aspect of random search, overlooked by the group at Adaptronics, was its potential relevance to biology.

In 1988, we began experimenting with a similar algorithm, which we dubbed the "chemotaxis algorithm" [Bremermann and Anderson 1989, 1991]. (See Inset.) The name was chosen by analogy to the strategy employed by bacteria to find chemoattractants in a spatial concentration gradient [Bremermann 1974; Alt 1980; Koshland 1980; Berg 1983]. Subsequently, Jabri and Flower [1992] have advocated the name "weight perturbation" for essentially the same algorithm. We showed that a biased Gaussian random walk could, in fact, train neural networks to solve the same difficult Boolean mappings that had eluded single-layer perceptrons and Hebbian networks (exclusive OR, A^-bit parity, etc.). Since then, random-walk learning has been subjected to several criticisms. Here, I discuss or refute the most common criticisms:

Criticism # 1 : Random walks are known to get trapped in local minima in conventional optimization problems.

In the case of neural networks, local minima are not as much of a problem as one might expect. What is a local minimum in a small network with a lower-dimensional weight space often becomes a multidimensional saddle point in higher dimensions [Baldi and Hornik 1989; Conrad and Ebeling 1992; Yu 1992]. This is because of the degeneracy inherent in neural network architectures: There are usually many more free parameters (weights) than are theoretically required to solve the task at hand.

228 Anderson

The Chemotaxis Algorithm

The "chemotaxis training algorithm" is one possible implementation of a biased random walk in weight space. One advantage to this training method is that it does not require gradient calculations or detailed error signals. It also allows for automatic adjustment of the single learning parameter, which otherwise has to be found empirically. The network is initialized with an an arbitrary set of weights, w°^ and performance E{w°) is evaluated. A random vector Aw is chosen from a multivariate Gaussian distribution with zero mean and unit standard deviation. This random vector is added to the current weights to create a "tentative" set of weights (w*):

where /i is a stepsize parameter. Performance E{w^) is then calculated for the tentative weights. If the error of the new configuration is lower than the original configuration, the tentative changes in the weight vector are retained; otherwise, the system reverts to its original configuration. If a successful direction in weight space is found, weight modifications continue along the same random vector until progress ceases. A new random vector is then chosen, and the process is repeated. More details are available in the cited literature.

Evolutionary optimization is also easier in high-dimensional, redundant systems [Conrad 1983]. A biased random walk can be considered a rudimentary genetic algorithm—one where the environment selects one of two possible mutant structures at each step. Conrad and Ebeling [1992] have shown that saddle points, not isolated peaks, dominate high-dimensional fitness landscapes: "Increasing the dimensionality of a system...increases the chances of finding an uphill [favorable] pathway to still higher peaks." Conrad refers to this phenomenon as "extradimensional bypass."

Criticism # 2 : Random walks are inefficient.

A biased random walk is also a form of gradient descent (random descent) and is quite efficient. In the case of a 3-dimensional spherical gradient (a condition that is ideal for gradient descent), the path taken to reach the optimum by the chemotaxis algorithm is, on average, only 39%


longer than the optimal direct gradient path [Bremermann 1974]. Empirical studies show that the chemotaxis algorithm, while usually slower to converge, compares favorably in final network performance with backprop-agation on a variety of benchmark tasks [Bremermann and Anderson 1989; Wilson 1991]. Furthermore, in cases where local minima do exist, there is no reason to expect that the Chemotaxis algorithm is more prone to local minima than backpropagation [Anderson 1991; Baldi 1991]. An extensive analytical comparison of random descent and gradient descent learning is given by Baldi [1991].

Criticism # 3 : Random walks cannot train neural networks to solve complex, nonlinear mappings such as the exclusive OR.

This belief, reinforced by the perceived problem of local minima, is simply untrue [Bremermann and Anderson 1989] (Table 1). In addition to the benchmark problems, the chemotaxis algorithm has been applied successfully to training neural networks to solve a variety of problems: discrimination of seismic signals [Dowla et al. 1990; Anderson 1991], training "recurrent" neural networks [Anderson 1991], process control [Willis et al. 1991a,b], and motor control [Anderson and Vemuri 1992; Styer and Vemuri 1992a,b, 1995]. Experiments with other stochastic training algorithms have had similar successes [Harth and Tzanakou 1974; Tzanakou et al. 1979; Harth et al. 1988; Smalz and Conrad 1991; Jabri and Flower 1992].

Criticism #4 : ^^Reinforcement" learning models are not biologically plausible.

Reinforcement signals are generally thought to carry only general information about the overall performance ("good," "better," "target was missed by x amount," etc.). Specific information to individual synapses as to their relative responsibility in the task would be very difficult to determine. Biological mechanisms for assigning responsibility to each individual synapse are highly unlikely [Crick 1989a].

Most proposed reinforcement learning rules are also "mechanistically overcomplex" [Bremermann and Anderson 1989, 1991]. In Barto and Sutton's reinforcement learning schemes, for example, synaptic change is generated by the reinforcement signal itself, as interpreted by an adaptive critic element [Barto et al. 1981; Barto and Sutton 1983]. Although this work has generated many interesting and nontrivial applications, the complexity of its synaptic adjustment rules makes it an unlikely candidate for a biological learning rule. Other reinforcement algorithms have similar drawbacks [Williams 1992]. Surprisingly, in a comparison between adaptive critic and chemotaxis in controlling a cart-pole system, chemotaxis performed as well or better t\iÎi the more complicated (and less biological) adaptive critic net-

230 Anderson

Chemotaxis Algorithm Performance

Dimension (N)

2(X0R) 3 4 5 6 7

Chemotaxis (epochs)

113 251 962 1259 4169 5789

Backpropagation (epochs)

25 33 75 130 310 800

TABLE 1. Training time for the A -bit parity problem. A -bit parity can be considered a generalization of the 2-bit "exclusive OR" (XOR) problem, since class membership of a given pattern is dependent on all A inputs. Network architecture was A — (2A" -h 1) — 1, where A represents the number of hidden units. The networks were trained on all 2^ possible binary input patterns. Training was continued until the network responses were within 10% of the ideal Boolean values. Chemotaxis averages are taken from Bremermann and Anderson [1989]. No attempt was made to optimize algorithm parameters. Backpropagation averages are taken from Tesauro and Janssens [1988], who used optimal values for the learning and momentum parameters. Note that the computational time is double these values in the case of backpropagation.

works [Styer and Vemuri 1992a,b, 1995].

Criticism # 5 : Random-walk learning is not experimentally observable.

The final, and most important, obstacle to finding biological evidence for reinforcement learning has been, and continues to be, experimental observability. This is because random walks are a nonlocal phenomenon. Experimental protocols involving single neurons, synapses, or even a small collection of interacting neurons cannot directly verify a nonlocal learning rule. Local measurements of a global phenomenon can verify only two of the necessary elements: local synaptic variation and neuromodulation (facilitation or inhibition of synaptic change). The remainder of this article addresses this issue.


4 Biological Evidence

Reinforcement learning requires three components: (i) a mechanism for the generation of synaptic change, (ii) a structure for evaluating performance, or "trainer," and (iii) a reinforcement signal. To build a case for biological plausibility, one must show that all of the necessary elements are consistent with biological observations.

Two components required for random-walk learning are clearly consistent with biological observations: random synaptic variation and neural structures for evaluating performance. Indeed, it is generally believed that local random explorations account for some types of neural development [Montague et al 1991]. In developmental models, however, the reinforcement signal is provided by the target cell. The random walk ends when a process finds its target. This type of locally reinforced random walk has the same limitations as Hebbian learning. The difference with what is being proposed here is that the reinforcement signals are not generated locally, through retrograde messengers or cell-adhesion molecules. Instead, reinforcement is generated and broadcast from "supervisory" neural structures (Figure 2).

4.1 Random Structural Variation

Cellular events are dominated by stochastic processes. It has been shown that structural variation can be guided or influenced by chemical or neural signals. What remains to be found is whether this modulation is a local phenomenon or one mediated by higher centers. Here, I cite just two examples of experimental systems that are consistent with this view.

Growth of neurites in cerebellar granule cell cultures progresses stochastically [Rashid and Cambray-Deakin 1992]. Stimulation with NMD A results in a marked increase in growth rate, while the addition of an NMDA receptor antagonist, aminophosphonovalerate (APV), causes a marked retraction of preexisting processes. Either of these effects could be directed from more distant neural structures.

In another experiment, Glanzman et al. [1990] studied an in vitro co-culture of Aplysia sensory neurons and their target (L7 motor) cells. The sensorimotor cocultures were grown for 5 days and observed by fluorescence video micrographs. One group of preparations was repeatedly treated with the facilitating transmitter serotonin (5-HT) for 24 hours. At the end of the experiment, the coculture was imaged again to look for structural changes. Morphological changes (changes in the size of varicosities or new processes) at the junctions between the sensory and motor cells were rated on a subjective scale. This study was significant in that the researchers were able to directly image structural changes—rather than relying on comparisons between two different populations of neurons. In the control group, mor-

232 Anderson

Performance Appraisal (higher brain centers)

Sensory Systems

External Environment

Neural Circuits

Reinforcement (facilitation or Inhibition signal)

(Site of random variation)

Effector Organs

FIGURE 2. Neural implementation of a biased random walk. Random variation in synaptic connectivity and efficacy is rewarded after the fact if performance has improved. Performance is evaluated by sensory systems (somatosensory, visual, auditory, etc.) and a nonspecific reinforcement signal is broadcast to the participating neural circuitry. The reinforcement signal could be chemical (hormonal) or neural in origin.

phological changes were found to be normally distributed with a mean change of zero on their rating scale. In the cocultures treated with serotonin, however, structural change was shown to be highly biased toward increases in varicosities or processes. Furthermore, they showed that these structural changes corresponded to measurable changes in monosynaptic excitatory postsynaptic potential (EPSP) produced in L7 motor cells by firing the sensory neuron. Thus, they were able to show that both physical and electrophysiological facilitation can be induced in vitro by a single chemical signal—serotonin.

I suggest that these random variations serve a vital role in learning, that is, generating trial connections and efficacies. Serotonin release in a cluster of neurons may serve as a local "print" (or fixing) signal to retain effective changes. However, the experiment described by Glanzman et al. was not designed to differentiate between serotonin's putative role as a simple growth factor or a reinforcement signal.

Serotonin has been shown to serve a role as a neuromodulator as well as a


facilitation signal. There is evidence for a brainstem serotonergic projection to the ventrobasal thalamus, thus linking facilitory signal to higher brain centers [Eaton and Salt 1989]. Does facilitation reinforce existing changes, or does the change occur as a result of the presence of serotonin?

4-2 Reinforcement Signals

A biased random walk requires that the performance of a net be evaluated. This requirement may not be that problematic, since evaluation of performance tends to be computationally easier than improvement, and evaluation could be accomplished by other brain circuits. For example, throwing a ball requires precise coordination and timing of numerous muscles. Good performance is hard to achieve and may require extensive training. But how close a ball comes to hitting the target is relatively easy to determine. Evaluation of accuracy can be processed separately by the visual cortex—independently of networks involved in generating the movement. One portion of the brain thus could act for another system as "supervisor."

The reinforcement signal is likely to carry only general, nonspecific, information. It could be neural or chemical (hormonal) in origin. Dayhoff et al. (1992, 1993) suggest that retrograde reinforcement signals could be mediated by intracellular signaling through the neuronal cytoskeleton. Glial cells are also cabable of long-range intracelluar communication via Ca "*" signaling (Cooper 1995). Many of the substances that have been shown to modulate LTP (including the candidate retrograde messengers) are candidate reinforcement signals as well. To complete a model of random-walk learning, one must demonstrate that other brain centers have projections to the sites of synaptic variation that release (directly or indirectly) substances and can act to facilitate or inhibit the process of structural change.

One known reverse pathway is a projection from the locus ceruleus to the olfactory bulb. Locus ceruleus neurons are activated by unconditioned stimuli [Aghajanian and Vandermaelen 1982]. Reverse pathways from the locus ceruleus are diffuse but may still serve a neuromodulatory role [Crick 1989a]. Locus ceruleus neurons release the neurotransmitter norepinephrine (NE), which when infused into the rabbit olfactory bulb can prevent or delay the habituation to unreinforced odors [Gray et al. 1986]. Several forms of use-dependent synaptic plasticity in cortical tissues require the presence of NE [Bliss et al 1983; Bear and Singer 1986]. Sullivan and colleagues argue that "it is now clearly established that activation of NE terminals in the olfactory bulb is necessary for memory formation, but not recall" [Sullivan et al. 1992]. Taken together, these data suggest that norepinephrine signals projecting from the locus ceruleus could be functioning as a reinforcement signal.

234 Anderson

5 Conclusions

It is self-evident that some form of trial-and-error learning is involved in the acquisition of skilled movement [Grossman and Goodeye 1959; Anderson 1981]. But training a tabula rasa of randomly connected masses of neurons to perform complex control tasks is evidently a hopeless endeavor [Anderson 1991]. High level control of movement is thought to involve the coordination or modulation of existing central pattern generators (GPGs) [Selverston 1980]. A biased random walk can be used to optimize a crudely organized network of GPGs during the acquisition of skilled movement [Anderson 1991; Anderson and Vemuri 1992; Styer and Vemuri 1992a,b, 1995]. This is somewhat analogous to Edelman's selectionist hypothesis in that learning entails the "selection," or education, of an existing repertoire of dynamical "groups" [Edelman 1987; Grick 1989b]. Furthermore, the chemo-taxis algorithm is only the most primitive form of trial-and-error; undoubtedly, more sophisticated, higher-level, neural mechanisms will have evolved to coordinate and complement this process [Smalz and Gonrad 1991].

Experimental verification of this type of learning will require protocols involving collections or assemblies of neurons, rather than individual synaptic junctions, to observe the stochastic variation and the effects of putative reinforcement signals. Furthermore, a more ambitious effort must be made to link reinforcement signals backwards to their projective sources.

McGulloch and Pitts offered a solution to the embodiment problem by demonstrating the computational properties of neural networks. Hebb proposed a neurobiological correlate to associative learning or classical conditioning. Biased random walks in synaptic weight space can be seen as the neurobiological "embodiment" of trial-and-error learning. A biased random walk may some day be shown to be a learning rule immanent in nervous activity.

6 Acknowledgments

I thank Daniel Ghester for calling to my attention the work done at Adap-tronics. Inc. I also thank Hans J. Bremermann, Lee Segel, Michael Gonrad, Judith Dayhoff, Omid Omidvar, and V. (Rao) Vemuri for their encouragement and editorial comments. This work was performed under the auspices of the U.S. Department of Energy and supported by the Genter for Nonlinear Studies at Los Alamos.

7. Biased Random-Walk Le£u:ning 235

7 References

K. Abe and H. Saito (1992). Epidermal growth factor selectively enhances NMDA receptor-mediated increase of intracellular Ca^"'" concentration in rat hippocampal neurons. Brain Research 587: 102-8.

G.K. Aghajanian and C.P. Vandermaelen (1982). Intracellular identification of central noradrenergic and serotonergic neurons by a new double labeling procedure. J. of Neuroscience 2: 1786-1792.

W. Alt (1980). Biased random walk models for chemotaxis and related diffusion approximations. J. of Mathematical Biology 9: 147-177.

J.R. Anderson, Ed. (1981). Cognitive Skills and Their Acquisition. Erlbaum Associates, Hillsdale, NJ

R.W. Anderson (1991). Stochastic optimization of neural networks and imphcations for biological learning. Ph.D. Dissertation, University of California, San Francisco.

R.W. Anderson and V. Vemuri (1992). Neural networks can be used for open-loop, dynamic control. Int. J. Neural Networks. 3(3): 71-84 (1992).

M.A. Arbib (1987). Brains, Machines, and Mathematics, Second Edition (First Edition: McGraw-Hill, 1964), Springer-Verlag. New York.

S. Austin (1990). Genetic solutions to XOR problems. AI Expert pp. 52-57

P. Baldi (1991). Gradient descent learning algorithms: A general overview. JPL Technical Document.

P. Baldi and K. Hornik (1989). Neural networks and principle component analysis: learning from examples without local minima. Neural Networks 2: 53-58.

R.L. Barron (1968). Self-organizing and learning control systems. In: Cybernetic Problems in Bionics (Bionics Symposium, May 2-5, 1966, Dayton, OH), Gordon and Breach, New York, pp. 147-203.

R.L. Barron (1970). Adaptive flight control systems. In: Principles and Practice of Bionics (NATO AGARD Bionics Symposium, Sept. 18-20,1968, Brussels, Belgium, pp. 119-167.

A.G. Barto, R.S. Sutton and P.S. Brouwer (1981). Associative search net-

236 Anderson

work: A reinforcement learning associative memory. Biological Cybernetics 40: 201-211.

A.G. Barto and R.S. Sutton (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC-13 (5):835-846.

M.F. Bear and W. Singer (1986). Modulation of visual cortical plasticity by acetylcholine and noradrenaline. Nature 320: 172-17.

H. Berg (1983). Random Walks in Biology. Princeton University Press, Princeton, NJ.

S. Birnstiel and H.L. Haas (1991). Acute effects of antidepressant drugs on long-term potentiation (LTP) in rat hippocampal slices Naunyn-Sch-miedebergs Archives of Pharmacology 344: 79-83.

W.W. Bledsoe (1961). The Use of Biological Concepts in the Analytical Study of Systems, Technical Report, Panoramic Research Inc., Palo Alto, CA.

T.V.P. Bliss and G.L. Collingridge (1993). A synaptic model of memory: long-term potentiation in the hippocampus. Nature 361: 31-39.

T.V.P. Bliss, G.V. Goddard and M. Riives (1983). Reduction of long-term potentiation in the dentate gyrus of the rat following selective depletion of monoamines J. of Physiol. 334: 475-491.

H.J. Bremermann (1974). Chemotaxis and optimization. J. of the Franklin Institute (Special Issue: Mathematical Models of Biological Systems) 297: 397-404.

H.J. Bremermann and R.W. Anderson (1989). An Alternative to Back-propagation: A Simple Rule of Synaptic Modification For Neural Net Training and Memory. Technical Report: U.C. Berkeley Center for Pure and Applied Mathematics PAM-483.

H.J. Bremermann and R.W. Anderson (1991). How the brain adjusts synapses—maybe. In Automated Reasoning: Essays in Honor of Woody Bledsoe, R.S. Boyer (Ed.), Chapter 6, pp. 119-147, Kluwer Academic Publ., Boston.

S. Brocher, A. Artola and W. Singer (1992). Agonists of cholinergic and noradrenergic receptors facilitate synergistically the induction of long-term potentiation in slices of rat visual cortex. Brain Research 573: 27-36.


D.V. Buonomano, and J.H. Bryne (1990). Long-term synaptic changes produced by a cellular analog of classical conditioning in Aplysia. Science 249: 420-3.

D.L. Chester (1990) A comparison of some neural network models of classical conditioning. Proc. 5th IEEE International Symposium on Intelligent Control, Philadelphia 2: 1163-1168.

P.S. Churchland and T.J. Sejnowski (1989). Neural representation and neural computation. In: Neural Connections, Mental Computations, Nadel, L., L A. Cooper, P. Culicover, and R. M. Harnish (Eds.) pp. 15-48, MIT Press, Cambridge, MA.

C.A. Colton, L. Fagni and D. Gilbert (1989). The action of hydrogen peroxide on paired pulse and long-term potentiation in the hippocampus. Free Radical Biol. Med. 7: 3-8.

M. Conrad (1983). Adaptability (Chapter 10), Plenum Press, NY.

M. Conrad and W. Ebeling (1992). M.V. Volkenstein, evolutionary thinking and the structure of fitness landscapes. BioSystems 27: 125-128.

M.S. Cooper (1995). Intercellular signaling in neuronal-gUal networks. BioSystems 34: 65-85.

F. Crick (1989a). The recent excitement about neural networks. Nature 337: 129-132.

F. Crick (1989b). Neural Edelmanism. Trends in Neurosciences 12 (7): 240-248.

E.R.F.W. Crossman (1959). A theory of the acquisition of speed-skill. Ergonomics 2 (2): 153-166.

E.R.F.W. Crossman and P.J. Goodeye (1983). Feedback control of hand-movement and Fitt's law. Quarterly Journal of Experimental Psychology 35A: 251-278.

Y. Dan and M. Poo (1992). Hebbian depression of isolated neuromuscular synapses in vitro. Science 256: 1570-1573.

J.E. Dayhoff, S.R. Hameroff, C.E. Swenberg and R. Lahoz-Beltra, 1992. Biological plausibility of back-error propagation through microtubules. Technical report of the Institute for Systems Research, University of Maryland, College Park, MD 20742. SRC TR92-17.

238 Anderson

J.E. Dayhoff, S.R. HamerofF, C.E. Swenberg, and R. Lahoz-Beltra, 1993. The neuronal cytoskeleton: A complex system that subserves neural learning. In Rethinking Neural Networks^ Eds. K. H. Pribram and Sir J. Eccles. Lawrence Erlbaum Assoc.

J. DayhofF, S. Hameroff, R. Lahoz-Beltra, and C.E. Swenberg (1994). Cytoskeletal involvement in neuronal learning: a review. Eur. Biophys. J. 23: 79-93.

D.C. Deno (1992). Control theoretic investigations of the visual smooth pursuit system. Ph.D. Thesis, Dept. EECS, U.C. Berkeley.

F.U. Dowla, S.R. Taylor and R.W. Anderson (1990). Seismic discrimination with artificial neural networks: Preliminary results with regional spectral data. Bulletin of the Seismological Society of America 80 (5): 1346-1373.

S.A. Eaton and T.E. Salt (1989). Modulatory effects of serotonin on excitatory amino acid responses and sensory synaptic transmission in the ventrobasal thalamus. Neuroscience 33 (2): 285-292.

G.M. Edelman (1987). Neural Darwinism. Basic Books, N. Y.

F. Edwards (1991). LTP is a long term problem. Nature 350: 271-272.

D.B. Fogel, L.J. Fogel and V.W. Porto (1990). Evolving neural networks. Biological Cybernetics 63: 487-493.

J.A. Gaily, P.R. Montague, G.N. Recke and G.M. Edelman (1990). The NO hypothesis: Possible effects of a rapidly diffusible substance in neural development and function. Proc. Natl. Acad. Sci. USA 87: 3547-3551.

D. Gardner (1993). Backpropagation and neuromorphic plausibihty. World Congress Neural Networks II: 590-593.

D.L. Glanzman, E.R. Kandel and S. Schacher (1990). Target-dependent structural changes accompanying long-term synaptic facilitation in aplysia neurons. Science 249: 799-802.

K.A. Grajski and M.M. Merzenich (1990). Hebb-type dynamics is sufficient to account for the inverse magnification rule in cortical somatotopy. Neural Computation.

C M . Gray, W.J. Freeman and J.E. Skinner (1986). Chemical dependencies of learning in the rabbit olfactory bulb: Acquisition of the transient


spatial pattern change depends on norepinephrine. Behavioral Neuroscience 100 (4): 585-596.

S. Grossberg and N.A. Schmajuk (1989) Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks 2 (2): 79-102.

E. Harth and E. Tzanakou (1974). A stochastic method for determining visual receptive fields. Vision Research 14: 1475-1482.

E. Harth, T. Kalogeropoulos, A.S. Pandyaand K.P. Unnikrishnan (1988). A universal optimization network. AT&T Technical Memorandum 11118-881026-23TM.

D.O. Hebb (1949). The Organization of Behavior. Wiley, New York.

R. Hecht-Nielsen (1989). Theory of the backpropagation neural network. IJCNN, Washington, DC. June 1989, I: 593-605.

G.E. Hinton (1989). Connectionist learning procedures. Artificial Intelligence 40 (1): 143-150.

J.J. Hopfield (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. (USA) 81: 3088-3092.

K.R. lorio, L. Reinlib, B. Tabakoff and P.L. Hoffman (1992). Chronic exposure of cerebellar granule cells to ethanol results in increased N-methyl-D-aspartate receptor function. Molecular Pharmacology 41: 1142-1148.

Y. Izumi, D.B. Clifford and C.F. Zorumski (1992). Inhibition of long-term potentiation by NMDA-mediated nitric oxide release. Science 257: 1273-1276.

M. Jabri and B. Flower (1992). Weight perturbation: An optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks. IEEE Transactions on Neural Networks 3 (1): 154-157.

E.R. Kandel and T.J. O'Dell (1992). Are adult learning mechanisms also used for development? Science 258: 243-245.

J.A. Kauer, R.C. Malenka and R.A. Nicoll (1988). NMDA application potentiates synaptic transmission in the hippocampus. Nature 334: 249-252.

240 Anderson

M.B. Kennedy (1988). Synaptic memory molecules. Nature 335: 770-772.

A.H. Klopf (1989). Classical conditioning phenomena predicted by a drive-reinforcement model of neuronal function. In Neural Models of Plasticity: Experimental and Theoretical Approaches, J.H. Byrne and W.O. Berry (Eds.), Chapter 7, pp. 104-132, Academic Press, Orlando, FL.

T. Kohonen (1984). Self-Organization and Associative Memories. Springer Verlag, BerUn

D. Koshland (1980). Bacterial Chemotaxis as a Model Behavioral System Raven Press, New York.

S.R. Lehky and T.J. Sejnowski (1988). Computing 3-D Curvatures from Images of Surfaces Using a Neural Model. Nature 333: 452.

S.R. Lehky and T.J. Sejnowski (1990). Neuronal Model of Stereoacuity and Depth Interpolation Based on a Distributed Representation of Stereo Disparity. J. of Neuroscience 10 (7): 2281-2299.

S.R. Lockery, G. Wittenberg, W.B. Kristan and G.W. Cottrell (1989). Function of Identified Interneurons in the Leech Elucidated Using Neural Networks Trained by Back-Propagation. Nature 340: 468-71.

G. Lynch (1986). Synapses, Circuits, and the Beginnings of Memory. Bradford/MIT Press, Cambridge, MA.

S.A. Malenfant, S. O'Hearn and A.S. Fleming, (1991). MK801, an NMDA antagonist, blocks acquisition of a spatial task but does not block maternal experience effects. Physiology and Behavior 49: 1129-1137.

W.S. McCuUoch and W. Pitts (1949). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5: 115-133.

B.W. Mel (1990). Connectionist Robot Motion Planning. Academic Press, Boston, San Diego.

B.W. Mel (1992). NMDA-based pattern discrimination in a modeled cortical neuron. Neural Computation 4: 502-517.

M.M. Merzenich, R.J. Nelson, J.H. Kaas, M.P. Stryker, W. M. Jenkins, J.M. Zook, M.S. Cynader and A. Schoppman (1987). Variability in hand surface representations in areas 3b and 1 in adult owl and squirrel monkeys.


J. of Comparative Neurology 258 (2): 281-96.

M. Minsky and S. Papert (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA

P.R. Montague, J.A. Gaily and G.M. Edelman (1991). Spatial signaling in the development and function of neural connections. Cerebral Cortex 1 (1): 1047-3211.

D. Montana and L. Davis (1989). Training feedforward neural networks using genetic algorithms. Proc. 11th IJCAI.

P.G. Montarolo, E.R. Kandel and S. Schacher (1988). Long-term het-erosynaptic inhibition in Aplysia. Nature 333: 171-174.

A.N. Mucciardi (1972). Neuromine nets as the basis for the predictive component of robot brains. In: Cyberneticsj Artificial Intelligence, and Ecology, H.W. Robinson and D.E. Knight (Eds.), Fourth Annual Symposium Amer. Soc. of Cybernetics, pp. 159-193, Spartan Books.

J.C. Pearson, L.H. Finkel and G.M. Edelman (1987). Plasticity in the organization of adult cerebral cortical maps: A computer simulation based on neuronal group selection. J. of Neuroscience 7 (12): 4209-4223.

D. Quartermain, T. Nuygen, J. Sheu and R.L. Herting (1991). Milacemide enhances memory storage and alleviates spontaneous forgetting in mice. Pharmacology, Biochemistry and Behavior 39: 31-5.

N.A. Rashid and M.A. Cambray-Deakin (1992). N-methyl-D-aspartate effects on the growth, morphology and cytoskeleton of individual neurons in vitro. Brain Research 67: 301-308.

N. Ropert and N. Guy (1991). Serotonin facilitates GABAergic transmission in the CAl region of rat hippocampus in vitro. J. of Physiology 441: 121-36.

F. Rosenblatt (1958). The Perceptron, a probabilistic model for information storage and organization in the brain. Psych. Reviews 62: 386-408.

F. Rosenblatt (1962). Principles of Neurodynamics: Spartan Books, Washington, DC.

D.E. Rumelhart, G.E. Hinton and R.J. Williams (1986). Learning internal representations by error propagation. In: Parallel Distributed Pro-

242 Anderson

cessing, D.E. Rumelhart and J.L. McClelland, Eds., Vol 1, MIT Press, Cambridge, MA pp. 318-362.

T.E. Salt (1989). Modulation of NMDA receptor-mediated responses by glycine and D-serine in the rat thalamus in vivo. Brain Research 481: 403-406.

A.I. Selverston (1980). Are central pattern generators understandable? Behavioral and Brain Sciences 3: 535-571.

H.T. Siegelmann and E.D. Sontag (1991). Neural nets are universal computing devices, Technical Report SYCON-91-08, Rutgers University, Center for Systems and Control, New Brunswick, NJ.

H.T. Siegelmann and E.D. Sontag (1994). Analog computation via neural networks. Theor. Comput. Sci. 131: 331-360.

H.T. Siegelmann and E.D. Sontag (1995). On the computational power of neural nets. J. Computer Syst. Sci. 50: 132-150.

R. Smalz and M. Conrad (1991). A credit apportionment algorithm for evolutionary learning with neural networks. In: Neurocomputers and Attention. A. V. Holden and V.I. Kryukov, eds.. Vol. 2, Manchester University Press, New York, pp. 663-673.

P.K. Stanton and T.J. Sejnowski (1989). Associative long-term depression in the hippocampus induced by hebbian covariance. Nature 339: 215-218 (1989).

C.F. Stevens (1989) Strengthening the synapses. Nature 338: 460-461.

D. Stork (1989). Is back-propagation biologically plausible? IJCNN Washington, DC. II: 241-246.

D.L. Styer and V. Vemuri (1992a). Adaptive critic and chemotaxis in adaptive control. Conf. Artificial Neural Networks in Engineering (ANNIE), St. Louis, MO.

D.L. Styer and V. Vemuri (1992b). Control by artificial neural networks using model-less reinforcement learning. Preprint: Biomedical Engineering Graduate Group, University of California, Davis (submitted to Simulations).

D.L. Styer and V. Vemuri (1995). A comparison of adaptive critic and chemotaxis methods in adaptive control. Math. Comput. Modeling 21 (1/2):


109-118.

R.M. Sullivan, D.R. Zyzak, R Skierkowski and D.A. Wilson (1992). The role of olfactory bulb norepinephrine in early olfactory learning. Brain Res. Dev. Brain Res. 70: 279-282.

R.S. Sutton and A.G. Barto (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review 88 (2): 135-170.

G. Tesauro and B. Janssens (1988). Scaling relationships in backpropa-gation learning. Complex Systems 2: 39-44.

E. Tzanakou, R. Michalak and E. Harth (1979). The alopex process: visual receptive fields by response feedback. Biological Cybernetics 35: 161-174.

J.H. Williams, M.L. Errington, M.A. Lynch and T.V.R Bliss (1989). Arachidonic acid induces a long-term activity-dependent enhancement of synaptic transmission in the hippocampus. Nature 341: 739-742.

R.J. Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8: 229-256.

M.J. Willis, C. Di Massimo, G.A. Montague, M.T. Tham and A.J. Morris (1991a) Artificial neural networks in process engineering. IEEE Proceedings D 138: 256-266.

M.J. Willis, G.A. Montague, C. Di Massimo, M.T. Tham and A.J. Morris (1991b). Non-linear predictive control using optimization techniques. Proc. ACC, Boston, pp. 2788-2793.

J.M. Wilson (1991). Back-propagation neural networks: A comparison of selected algorithms and methods of improving performance. Proc. 2nd Annual Workshop Neural Networks WNN-AIND, Auburn, AL.

C.W. Xie and D.V. Lewis (1991). Opioid-mediated facilitation of long-term potentiation at the lateral perforant path-dentate granule cell synapse. Journal of Pharmacology and Experimental Therapeutics. 256: 289-96.

X.H. Yu (1992). Can backpropagation error surface not have local minima. IEEE Trans. Neural Networks 3: 1019-1021.

R.A. Zalutsky and R.A. NicoU (1990). Comparison of two forms of long-term potentiation in single hippocampal neurons. Science 248: 1619-1624.

244 Anderson

F. Zoccarato, R. Deana, L. Cavallini and A. Alexandre (1989). Generation of hydrogen peroxide by cerebral cortex synaptosomes. Eur. J. Biochem. 180: 473-478.

Chapter 8

Using SONNET 1 to Segment Continuous Sequences of Items Alber t Nigrin^

ABSTRACT This chapter discusses self-organizing neural networks that were designed to classify temporal sequences. Three of the major constraints that the networks were designed to satisfy are: (1) The networks must be able to both learn and classify temporal sequences at the pace at which the sequences are presented (no off-line processing), (2) the networks must be able to learn to segment patterns that have no predefined beginnings or endings, and (3) the networks must be able to incrementally learn in an unsupervised fashion without degrading previously established categories. This chapter will discuss constraints and describe architectures for achieving these goals. The transformation of sequences of temporal events into spatial patterns of activity will be shown, and the properties that classifying systems should have, to enable them to classify the transformed patterns, will be discussed. A selection of simulations is given to show that the ideas presented are plausible.

1 Introduction

In order for an autonomous agent to operate in a real-world environment, it must overcome at least three major problems. First , to allow the agent to operate in real t ime, it must be able to respond to events at the pace a t which they occur. Second, since real-world events usually have no predefined beginning, middle, or ending, an agent must be able to form its own segmentations. And finally, since there is often no external teacher present to guide it, the agent must be able to learn its categories in an unsupervised fashion. This chapter will examine these and other issues. It will a t t emp t to design a real-time neural network t ha t learns to segment a never ending stream of input i tems in an unsupervised fashion.^

The ability to segment pa t te rns in real t ime is impor tan t in areas such as

^Supported in part by ARC grant DAG-29-84-K-0072 2 Some of this chapter is excerpted from Nigrin (1993).

245

246 Nigrin

object recognition, reinforcement learning, and speech recognition. Let us consider speech recognition. First, since language is interactive, real-time operation is a must. Second, as anyone who has ever listened to an unfamiliar foreign language can attest, there seldom are any clear-cut boundaries between words of a sentence. Thus, the ability to form segmentations in the presence of extraneous information is also a must. And third, if we wish to model humans, our systems must have the ability learn language in an unsupervised fashion. This is clear, since infants learn their native language simply by listening to it.

Obviously, it is still beyond the scope of current technology to create neural networks that can learn continuous speech. However, by considering a simplified version of the problem, it may be possible to model the continuous and real-time nature of speech without complicating the problem excessively. After the simplified problem has been solved, additional issues can be dealt with.

Therefore, the input patterns will not consist of continuous speech signals. Instead, they will be composed of never-ending sequences of items. For convenience, the items will be represented by capital letters. However, it is also possible to think of the items as phonemes, numbers, musical notes, etc. These items will be sequentially presented to the network at a constant rate and intensity. Then, the neural network's task will be to learn to segment the input sequences by discovering the significant patterns embedded within them.

For example, consider the sequence below, which is repeatedly presented to the network. The letters are presented one at a time, with no breaks in the presentation. Therefore, after the last letter, Z, is presented, the first letter in the sequence, E^ is immediately presented again. Notice that the lists EAT and NOW are embedded within several different locations (contexts). Because of this, the network should learn to recognize EAT and NOW as significant patterns.

E A T B C D N O W F G H I E A T J K L M N O W P Q R E A T S U V N O W X Y Z

If the network is viewed as a black box, it can be pictured as in Figure 1. Each input line represents a discrete event or item. An item is presented to the network by briefly activating its input line, and a list is presented by sequentially presenting different items. Figure 2 shows the activation of the input lines when the network is presented the lists ABC and CAB.

In this chapter, several restrictions are made to the type of inputs that are allowed. Methods to overcome these restrictions are discussed in Nigrin (1990, 1993).

1. No item can be repeated within the same Ust. For example, the list

8. Using SONNET 1 247

Outputs > < > < >

• • •

k

Black Box

) ; ) 1

• • •

^

Â h ^C

Inputs FIGURE 1. Black box of the network.

248 Nigrin

'B ABC

'c

(a)

'B CAB

(b)

FIGURE 2. (a) Sequential activation of the input lines during the presentation of the list ABC. (b) Activation of the input lines when CAB is presented. Reprinted with permission from Nigrin (1993).

ABC can be presented to the network, but the list ABA cannot. This restriction is placed for two reasons. First, in the current model only one storage location exists for each distinct item, and second, it is presently unknown how to use a single node to represent multiple occurrences of an item in an unambiguous fashion. Therefore, before an item can be repeated, the activity in its storage location must be reset.

2. Items are presented by activating input lines at fixed intensities for fixed periods of time. Therefore, the only important information that can be varied is the sequence of the items. This chapter does not deal with the possibility of different rhythms or different intensities in the presentation of items.

3. For simplicity, no noise is present during any of the simulations, and the items are not garbled in any way. However, the network was tested on noisy patterns in simulations involving static spatial patterns (Nigrin, 1993).

Outputs A A A

Xi) (X2) (5C3

Â h ^C

Inputs


© piV

F(1)

FIGURE 3. The structure of the network in this chapter. F^^^ transforms temporal sequences of events into spatial patterns of activity. F^ ^ classifies these evolving spatial patterns. Reprinted with permission from Nigrin (1993).

We will tackle this problem using a Self Organizing Neural NETwork called SONNET 1. SONNET 1 is separated into two distinct fields of units, F^^^ and F^'^\ as shown in Figure 3. F^^^ contains nodes that represent the various input items that can be presented to the network. Each node represents a specific item (or event) and receives external input from a single input line that represents the item. F^^) transforms the successive activation of different input lines into a spatial pattern of activity in short-term memory (STM) to represent the sequence of items. F^^^ nodes represent serial order, so they will be referred to by the letter 5. Both the name and the activity of the ith cell in F^^^ is given by Si.

F^^) has cell assemblies that represent lists of items. To be more precise, F^^) learns to chunk the evolving spatial patterns of activity across

250 Nigrin

Both the name and the activity of the ith cell assembly in F^^^ is given by

Using the architecture described above, there are three major problems that must be solved. First, constraints must be found to govern the transformation of temporal events into a spatial pattern of activity. Second, once an adequate transformation has been devised, neural network mechanisms must be designed at F^^^ to implement this transformation. And third, networks must be designed at F^^^ that can classify these unsegmented spatial patterns in real time.

The chapter is organized in the following manner. Section 2 will discuss the general manner by which patterns are classified at F^^\ Then, Sections 3 through 5 will discuss various constraints on transforming a sequence of temporal events into spatial patterns of activity. Sections 6 and 7 will discuss the specific architectures whereby this can be implemented. Section 8 will discuss properties that a classifying system should have to enable it to classify transformed patterns. Section 9 will present a small number of simulations to show that this scheme is a plausible one, and Section 10 will present some additional discussion.

2 Learning Isolated and Embedded Spatial Patterns

Before discussing how temporal patterns are transformed, let us first discuss the general manner by which a classifying network can learn to classify arbitrary patterns at F^^\

First, let us consider the network. In SONNET 1, F^^) and F^^) ^re fully connected. Thus, every F^^^ node sends signals to every F^^^ node and every F^^^ node sends signals back to every F^^^ node. These signals are gated (multiplied) by excitatory weights. The long-term memory (LTM) weight from Si to Xj will be referred to by Zij, and the feedback weight from Xk to Sm will be referred to by z[]l. At F^^\ inhibitory connections exist between all the F^^^ nodes. Thus, when an input pattern is presented to F^^\ the F^^^ nodes will compete for the right to activate and classify the pattern.

The excitatory weights from F^^^ to F^^^ are initially small. Thus, when a novel F^^^ pattern is initially presented, many F^^^ nodes will weakly activate in response to it. However, after learning has taken place, a single F^^^ node will activate strongly to represent the F^^^ pattern. This strong activation will occur due to the fact that the learning rule causes the weights at the classifying F^^^ node to become large and parallel to the F^^^ pattern.


The rule for modifying the feedforward excitatory weights is given by^

— Zji = €i Xi[—LZji -j- OjiXi\, (1)

where 6i is a constant that governs the learning rate, L is a decay constant, and Sji is the normalized value of Sj at Xi. (See equation 12.)

Learning occurs in the following way. When a novel pattern is presented, some F^^^ nodes will activate more strongly than others, due to the fact that the LTM weights are initially set to random values. The active nodes increase their LTM weights at active input lines and decrease their weights at inactive input lines. The higher the activity of an F^^^ node, the faster its weights will increase and decrease. Since higher activity nodes increase their weights more quickly than do lower activity nodes, they boost their competitive advantage over the lower activity nodes.

The asymmetry in the activity of the cells is also enhanced by the competitive structure of the network. F^^^ nodes compete for the right to activate via lateral inhibitory connections. Higher activity nodes output more inhibition than do lower activity nodes. Consequently, they inhibit lower activity nodes more strongly than the reverse.

The differences in the inhibition coupled with the differences in the learning rates cause the asymmetry in the activities of nodes to continually increase. Eventually, the asymmetries become so large that a single F^^^ node activates fully and inhibits the remainder of the field. That node is then considered to have classified the F^^^ pattern.

The network learning rate can be set to provide two different types of learning. Fast learning occurs when a high learning rate enables F^^^ nodes to classify F^^^ patterns as quickly as in a single trial. In this case an F^^^ node simply memorizes the active F^^^ pattern. Slow learning occurs when a lower learning rate allows the network to generalize over many different examples. This allows the network to be able to learn patterns that are embedded within larger patterns.

An embedded pattern will be learned when that pattern occurs in multiple different contexts. For example, suppose the pattern B is embedded within multiple larger patterns such as ABC, DBE, and FBG. Then, the network will learn to classify the pattern B in the following way. Suppose the F^^^ node Xi responds to the presentation of the patterns above. When ABC is presented, the LTM weights ZAI, ZBI, and zci will increase while the remaining weights decrease (in this example, ZDI^ ZEI, zpi, and ZQI all decrease). Similarly, when DBF is presented, zoi, ZBI-, and ZEI increase

^This and future equations are presented only to give the reader concrete instantiations for various quantities in the network. However, it is possible to skip over any of the equations without any loss of continuity.

252 Nigrin

while the remaining weights decrease. The weight ZBI increases for all the patterns that are presented. Conversely, the remaining weights increase infrequently (when their input line is active) and decrease frequently (in all contexts where their input line is not active). Therefore, ZBI will increase to much larger levels than the other weights, and Xi will establish a category for the pattern B.

In SONNET 1, by varying a single parameter it is possible to control the number of different contexts that an embedded pattern must appear in before it is learned. For example, in the paragraph above we assumed that the parameter was set so that the appearance of B in only 3 contexts was sufficient for generalization to occur. However, with a parameter choice, more than three contexts might be necessary. If this were true, then when ABC, DBE, and FBG were presented, the network would not create a category for B, but instead create categories for ABC, DBE, and FBG.

More discussion on the classifying network will be presented later, since some of motivation for its design depends on the manner by which temporal events are transformed into spatial patterns.

3 Storing Items with Decreasing Activity

Before a temporal pattern can be classified, it must first be stored in a way that allows a neural network to process it. This section will give guidelines that show one possible way to transform a sequence of temporal events into a spatial pattern of activity. Once the network has performed this transformation, it will be able to classify the patterns using any classifying network that is sufficiently powerful.

Suppose a list of events r i , r 2 , . . . , Tn sequentially activates the F^^^ nodes 5i, 52 , . . . , 5n- After this list has been presented, some spatial pattern of activation must exist across F^^^ to represent the temporal information of the list. (For convenience, the following examples use the list r i , r2 , . . . ,Tn-However, any arbitrary sequence of items can be presented and unambiguously stored. One possible different sequence is r ^ , r n - i , . . . , r i . ) It is clear that this activation pattern cannot be a binary one, since then it would be impossible to distinguish between different lists composed of the same items. For example, the network should be able to distinguish between the words LEFT and FELT, even though they are composed of the same items (Grossberg, 1978).

There are two obvious ways to represent order information with a spatial pattern of activity. After the list r i , r 2 , . . . ,rn has been presented, the activity across the items nodes can either be monotonically decreasing, as in equation 2, or monotonically increasing, as in equation 3.

Sl > S2 > ••• > Sn, (2)

8. Using S O N N E T 1 253

'x'l'^ 'x'2 Wl'x

(a) (b) (c)

FIGURE 4. Transformation of temporal events into spatial patterns of activity. (a) The activity of various nodes in a field after the list rir2rz has been presented. (b) The activity of these nodes in response to the list rir2. (c) The activity of these nodes after in response to the list rzr2r\.

5i < 52 < • • • < 5n. (3)

When order information is represented by a monotonically decreeising pattern, nodes representing items that occurred earlier have higher activations than nodes representing items that occurred later. The activation pattern that would result from a variety of different lists is shown in Figure 4.

Herein, order information will be represented by a decreasing pattern of activation, as in Figure 4. However, many other neural networks use exactly the opposite strategy. Networks such as those implemented by Sung and Jones (1988, 1990), and by Anderson, Merrill, and Port (1988) use an increasing pattern of activity to represent order information, as in equation 3.1 believe that this has been done because increasing patterns are easier to achieve. However, as Section 6 will show, \iis possible to design networks that obtain decreasing patterns of activity. Furthermore, as will be shown below, networks that use a decreasing pattern of activity do not have the fundamental problems with feedback that plague networks that use an increasing pattern of activity. (The networks that use equation 3 do not use feedback to bias their classifications.) Since I believe that networks must be able to use feedback, the transformation from temporal to spatial patterns will follow equation 2.

Let me present an example that shows why networks that use increasing patterns of activity have problems with feedback. Suppose that after the items r i , r 2 , . . . , r n have been presented, the activation pattern Si,S25---55n is monotonically increasing, as in equation 3. Furthermore, let that pattern be learned by some F^^^ node Xj. After learning, both the bottom up and top down LTM weights will become parallel to the pattern

254 Nigrin

of activation, so that Zj^^ < Zj2 < • • • < ZjJ. This is shown in Figure 5a. Now suppose that at a later time the same Hst is presented. After a por

tion of the Ust r i , r 2 , . . . , rk-i has been presented, Xj will partially activate and send feedback signals to F^^K Then, since Zjl_^^ > z^^^ the gated signal received by Sk-\-\ will be larger than that received by s\z. Therefore, î-fi ^"^ receive more expectation signals than Sjfc.

T?hus, because the LTM weights have equilibrated to an increasing pattern, the F^^^ field is biased to activate more easily to r^_|.i than to r^-This occurs even though rjk occurs earlier than Vk-^-i and thus should be more expected! The problem becomes even more apparent when we realize that the F^^^ node that is most biased to activate is the last item in the list, even though that item is not expected to occur for some time. This is clearly an error, since expectation signals should most bias the network to react to those items that are about to occur.

Conversely, if the items are stored with a decreasing pattern of activation, as in equation 2 (and shown in Figure 5b), this dilemma would not occur. Expectancy signals are generated correctly, since nodes representing earlier items receive more feedback than nodes representing later items. Thus, ij feedback is used to generate expectation, then successive items should be stored with decreasing activation.

4 The LTM Invariance Principle

The previous section discussed one possible way to transform a temporal pattern into a spatial one. However, merely transforming a sequence of temporal inputs into a decreasing pattern of activation is not sufl5cient. For correct operation it is necessary to satisfy another constraint, called the LTM invariance principle. This principle can be stated as follows: "Once a sequence of input items [ri, r2 , . . . , r^] is presented, its spatial pattern represents "past" order information. Presenting a new input [ri_|_i] can reorganize the total pattern of coded STM activity at F^^^, but . . . does not recode that part of the coded pattern which involves only past order information. In other words, new inputs can weaken the strength of past codes but do not deny the fact that the past events did occur" (Grossberg, 1978).

The LTM invariance principle applies to any representation of a sequence of items. It makes no assumptions about the manner by which input items are represented. In addition, it makes no assumptions about the rule used at the classifying layer to cluster the input. It states only that if some STM representation creates a match (arbitrarily defined) at a node in the classification layer, then that match should continue to occur after additional items are presented. (This is this case even for partial matches.)

One common rule that is used at the classifying layer is a dot product


F(2)

Top Down Weights

Activation Pat tern

pV2)

(b)

Top Down Weights

Activation Pat tern

FIGURE 5. Weights that evolve for different choices of transformations, (a) Items r ' i , r2 , . . . ,rn are sequentially presented and are stored with increasing activity across the nodes that represent them. At equilibrium, the top down LTM weights from an F^^^ node become parallel to the pattern of activation that is present while the F^^^ node is active. (In this and all subsequent figures, larger LTM connections are indicated by larger squares incident on the cell.) (b) Same as part (a) except that items are stored with decreasing activity. Whenever modifiable feedback weights are needed, this is the correct choice for the temporal to spatial transformation. Reprinted with permission from Nigrin (1993).

256 Nigrin

rule, which states that the input to the ith F^^^ node is given by:

î= ^ ^j^ji' (4)

In this case, The LTM invariance principle reduces to the following rule: When new items are presented to a fields the total activity of the field can change, but the relative pattern of activation among the nodes activated by past items must remain constant. For example, suppose that after ri and r2 activate si and 52, it is the case that si = 2 and 52 = 1. This might be coded by an F^^^ node Xi with weights zu = 1 and Z2i = 0.5. If the LTM invariance principle is followed, then when additional items such as rs are presented, the activities of si and 52 must stay in proportion to one another. Depending on the parameters in the system, they may rise to activities like si = 4 and S2 = 2, or 5i = 6 and 52 = 3. (If the activity of the field has saturated, they may even fall to activities like 81 = 1.2 and 82 = 0.6.) However, the relative activities between the two nodes must remain constant. This allows Xi to continue to know that its list was presented, since the input vector of activities across si and 52 remains parallel to its LTM weights.

The LTM invariance principle is important for the following reason. The network cannot know a priori which subsequences will be significant and which will not. (For example, can you pick out the Turkish word in the list "alborsalab"?) In principle, all subsequences of a list are legitimate sequences in their own right and should be able to be learned by the LTM weights to F^^^. This becomes difficult to do if the invariance principle is not followed. If each new item can change the relative pattern of activity across nodes representing previous items, then no pattern would remain in STM long enough to be coded by LTM weights.

Another problem that would result if the invariance principle were not followed has to do with the STM operation of the network. Consider some F^^^ node that receives a large amount of excitatory input immediately after its sequence is presented. If the relative pattern of activity can change (at F^^) nodes representing previous items), then after additional items are presented, that F^^^ node may no longer receive large amounts of input. This is incorrect, since there would no longer be any way of knowing that the event represented by the F^^^ node ever occurred. While it is true that future items should be able to weaken the significance of past events, these future items should not make it impossible to tell that these past events occurred. For example, after the word CARGO has been presented, it is still possible to tell that the word CAR has occurred. The significance of the word CAR may be reduced when embedded in the larger context. However, it is still possible to tell that the word occurred.


Approaches that do not satisfy the LTM invariance principle have been used by Sejnowski and Rosenberg (1987), Jordan (1986), Miikkulainen and Dyer (1991), Elman (1990), St. John and McClelland (1990), Cottrell (1985), Elman and Zipser (1988), and Hanson and Kegl (1987). Although some of these networks could recognize patterns that were embedded in a sequence, the fact that the networks did not satisfy the invariance principle meant that they could not address the problem of segmenting overlapping patterns. For example, consider the presentation of the spoken words All turn, while noticing that the phonemes in the word alter are contained within that utterance. Since the above networks do not satisfy the invariance principle, it will be the case that after All turn has been fully presented, the pattern that initially represented the word All will no longer exist. And since a reliable segmentation cannot be performed until after the word turn has been fully presented (immediately after the phonemes in All have been presented it is impossible to know whether the next phoneme will start a new word or complete the word Alter), the networks will not behave in a robust fashion.

It is possible for networks that use delay lines to satisfy the LTM invariance principle (Unnikrishnan, Hopfield, and Tank, 1992; Tank and Hopfield, 1987; Waibel, Hanazawa, Hinton, Shikano, and Lang, 1989). However, that approach has problems in hierarchical networks, since higher-level events of long duration require large amounts of network hardware to represent the required windows of time.

A rule that does satisfy the invariance principle is presented in Figure 6. The nodes si,S2,S3, and S4 are activated sequentially by r i , r2 , r3 , and r4. After presentation of item ri, node si attains an activation of fii. Concurrently, the activity of all other nodes is multiplied by the factor Ui. Since all other nodes are multiplied by the same factor, their activities remain proportional.

To satisfy the LTM invariance principle, /Xi and uji can be any non-negative constants. However, to allow a classifying network to operate correctly, it is sometimes useful to place extra constraints on the values for these parameters. One useful constraint (Grossberg, 1978) is that Vz,j : /ij = ^j = /i, where /x is some constant. When this constraint is followed, then immediately after r is presented, Si = fi, regardless of whether r is the first, last or fifth item in the list. Allowing the activity reached by Si to be independent of rj's position in the list is a reasonable approximation, since the network should be able to respond to any item, regardless of the number of items that have previously been presented. If /JLI were allowed to diminish with list position, then the network would be unable to attend to the last item in a long Hst.

An additional constraint on the invariance principle has to do with the parameter cj^. Since fii remains the same at each list position, uJi must be greater than 1 to allow the spatial pattern across F^^^ to become monotoni-

258 Nigrin

Time

^1

t2

h

U

Si

/^l

/^1^2

IJilUJ20J^

HlU2U;3U4

S2

0

/^2

fJ'2(^3

fJ'2^3^4

S3

0

0

/^3

fJ'3^4

54

0

0

0

/ /4

FIGURE 6. Figure showing the sequential activation of nodes si through 54 in a field that obey the LTM invariance principle. Nodes Si is activated at time ti and reaches an activity of /jii. As each node Si is activated, the activity of all other nodes is multiplied by the factor Ui. (Figure taken from Grossberg (1978).)

cally decreasing. For example, suppose /x = 1 and the list r i , r2 is presented. After ri is presented, 5i = 1 and 52 = 0. After r2 is presented, 52 = 1 and 5i = CL;2. If c«;2 < 1, then 52 > Si, and the network would not achieve the decreasing activity pattern that was discussed in the last section.

Using Rehearsal to Process Arbitrarily Long Lists

The previous section established constraints for the values of fii and Ui. That section determined that it should be the case that Vi, j : /JL^ = fij = fj, and LJi > 1. However, these choices necessitate that we find some way to reset items that have been active for a long time, since otherwise a problem will result.

In any real system, cui cannot be greater than 1 for all list positions, since there will eventually be a time at which the storage field saturates, and at this point, the activity of all nodes will be prevented from rising further. In fact, after saturation has occurred it will be the case that when new items are presented, the activity of nodes representing previous items must fall {uj < 1) to make room for the activity of nodes representing new items.

Thus, a bow can occur in the STM activation pattern. (See Figure 7.) The number of items that occur before the occurrence of the bow is defined as the transient memory span (TMS). The TMS is a useful concept, since correct order information can be obtained for all lists whose lengths are shorter than the TMS (since they are stored with a monotonically decreas-


FIGURE 7. The activation of a field of nodes after a long list has been presented. The bow in the curve occurs at position j , and thus the TMS is j items. Reprinted with permission from Nigrin (1993).

ing pattern). The question then arises: How can the network retain order information

for lists that are longer than the TMS? This can be achieved by introducing the concept of rehearsal (Grossberg, 1978, 1985). After the presentation of a long list, order is not necessarily confused if the items of the list can be classified and then reset. This results in no loss of information since the order information could still be obtained from the activity of the classifying nodes. For example, by using rehearsal, a network with a TMS of 4 could process the sequence THEBOYRAN even though it contains 9 letters. The network could do this by classifying the letters T, H, and E into a unitized representation of the word THE and deleting the letter representations from STM. Similarly, the network could create the chunks BOY and RAN, deleting the letters that make up those words. At no time does the network ever need to store a list in STM whose length is longer than 4.

However, if no preexisting nodes exist with which to chunk a long list, then order information would indeed be confused. This is not a problem if natural intelligence is being studied, since confusion also occurs in humans. Experiments have been done in which subjects were presented a long list of items and then asked to recall them in the correct order. It was found that subjects did well for items at the beginning and end of the list while doing poorly for items in the middle of the list (Grossberg, 1978). In fact, if the percentage of correct responses versus list position is plotted (the so-called serial position curves), a graph is obtained whose shape is identical

260 Nigrin

FIGURE 8. On-center ofF-surround network used to store input patterns. Shown axe the excitatory connections to Si and the inhibitory connections from it. Analogous connections exist for other cells. Reprinted with permission from Nigrin (1993).

to Figure 7. Thus, the LTM invariance principle is in good agreement with experimental data from humans.^

This and previous sections have given guidelines that show how to properly transform a temporal pattern into a spatial one. The next several sections will present a network that implements these guidelines.

Implementing the LTM Invariance Principle with an On-Center Off-Surround Circuit

Section 4 presented two constraints that should be used for transforming a sequence of temporal inputs into a spatial pattern of activity (Vz,^ : /Xi = fij = fi and Ui > 1). This section will show that by arranging F^^^ in the on-center off-surround configuration shown in Figure 8, it will be able to (approximately) satisfy both these constraints and transform a sequence of temporal events into a spatial pattern of activity (for a limited number of items).

The equation for the ith cell in F^^^ is given by

d_ (5)

"^This is a good example where data from humans can help in neural network design. If this data did not exist, one might spend an inordinate amount of time trying to design a network that never confused order information. However, the data from human subjects shows that this property may be very difficult to achieve and that a designer might be better off by avoiding it for the time being.


where A, Bg, and vs are constants such that A represents passive decay, Bs is the maximum activity of Sj, and vs is a constant weight between cells that helps control the shape of the transformed activity pattern. !{ represents external input and is nonzero only while the item represented by Si is being presented. Finally, f{si) is the output signal of Si.

As was proved in Grossberg (1973), if a linear signal (such as f{si) = Si) is used in a network with this architecture, then once the external input to a set of nodes is shut off, the relative activities of the nodes in that set will remain constant forever. (However, as also shown in that paper, a linear signal amplifies noise. Thus, if the network is to be used in the presence of noise, a sigmoid signal should be used instead of a linear signal.) This is useful, since it means that when a linear function is used, the LTM invariance principle is automatically satisfied. For example, suppose that after ri and r2 are presented, si and S2 reach some levels of activity. Then, if additional items, such as rs, are presented, the total activity across si and 52 may change, but the relative activity across the cells will remain fixed forever (as long as neither si nor 52 receives any additional external input).

With certain choices for the parameters / j . A, V3, Bg, coupled with the amount of time that each input line is active (Aîe), it is possible to achieve a decreasing pattern of activity across the nodes. For example, in equation (5) above, suppose the parameters are set such that A = 0, li = 0.007, ^3 = .075, Bg = 50, KiQ = 0.4, and f{si) = Sj. Then, when a long list is presented, a decreasing pattern of activity will be obtained as shown in Figure 9. Furthermore, Figure 10 shows that for the first few items in a list the network will achieve values of /x « 0.1 and cjj « 2. (With a different set of parameters, different behaviors can be obtained. Notice that in the parameter choices, Bg^ u. This will cause the TMS to be very large and allow many items to be presented before the activity of the field saturates.

Let us examine the dynamics of this network and show why it satisfies the two constraints given at the beginning of the section. Suppose the list r i , r 2 , . . . , Tn is presented to the network, where each item r activates cell Si. Initially, there is no activity in the field. Then, the first item is presented by activating li. Suppose the parameters are set so that when / i is shut off, 5i = /i.

Now let us examine what occurs when the ith item in the list is presented, under the assumptions that the total activity of the field is small compared to Bg and that i's < 1. We will see that when these assumptions are true, then Si will reach an activity of approximately ^, regardless of r^'s position in the list. To see that this is true, let us compare the positive input to negative input at Si. The positive input to Si is given by {Bg — Si)[vzSi -f li]. Since Si <^ Bg^ this quantity reduces to approximately Bg[v'iSi -\- U]. The negative input is given by v^^Si^^-^^Sj]. Since YljîSj is much less than B5, it follows that the negative input is almost negligible in relation

262 Nigrin

20 H

i 2 4 6

^ S Activity

(a)

10 12 14

(b)

FIGURE 9. The activity of s cells when a long list (ro, r i , . . . , r ^ ) is presented, (a) Equation 5 is simulated with parameter values of ^ = 0, li = 0.007, vs = 0.075, Bs = 50, Kie = 0.4 and f{si) = Si. With these parameters no bow is exhibited in the activity pattern, (b) Same as part (a) except that a log plot is used. Order information is represented well only in the linear region of the graph. Reprinted with permission from Nigrin (1993).


|Xj VS. list postion

10 12 14

(a)

(Oj vs. list postion

(b)

FIGURE 10. Values for fn and Ui for activities of Figure 9. Although constant values for these parameters are desired, the graphs show that this occurs for only the first few items, (a) Values of fn vs. list position, (b) Values of Ui vs. list position. Notice that Ui approaches 1 as the list position increases. This makes it increasingly difficult to represent order information. Reprinted with permission from Nigrin (1993).

264 Nigrin

to the positive input. Thus, the positive and negative input to the cell representing the ith item is approximately the same as what is seen by the cell representing the first item. As long as the negative input can be considered negligible, newly presented item nodes will reach approximately the same level of activity as earlier item nodes.

Just as the network achieves /ii « /LA for all items in the TMS, the network also obtains Ui to be roughly constant and greater than 1. This happens since the activity of nodes representing past events will continually rise due to the term {Bg — Si)si. As long as Si remains much smaller than Bg and while the negative input can be discounted, that rise will occur at an almost constant rate. This achieves the goal of having uji ^ cj > 1 for all items in the TMS.

Furthermore, the in variance principle is satisfied. As long as Si and Sj do not receive any further external input, the ratio between their activities will remain fixed forever. Presenting rk will alter the relative activity between Sk and all other nodes. However, it will not alter the relative activities between any other nodes.

The size of the TMS can be varied, by varying li, ^3, Bg, and KIQ. In this network, there is no fundamental limitation to the maximum length of a list that can be represented. However, the ability to process longer lists does not come for free, since the dynamic range of the system must increase exponentially with the size of the maximum length list possible. Consider that if a; = 2 and /i = 1, then after the presentation of the 10-item list r i , r 2 , . . . , rio, the activity of 5i equals 2^.

7 Resetting Items Once They can be Classified

The previous section showed how an on-center off-surround architecture could create a spatial pattern that accurately represented a sequence of items. Unfortunately, this representation was accurate only for a limited number of items, since the activity of F^^^ eventually saturated. To allow unlimited numbers of items to be presented, it must be possible to reset active F^^^ nodes after they have been classified by an F^^^ node. (They must also be reset after the activity of the field saturates.)

Let us first discuss the reset of items once those items have been classified by some F^^^ nodes. This will be done through the use of feedback. To allow the feedback to affect only cells in F^^^ that are part of a classified pattern, the feedback signals will be gated by LTM weights that are modified by an equation that is similar to the equation that modified the feedforward weights. The equation that modifies the feedback weights is similar to

lz(l>=e,xj{-z'jl^+SiXj), (6)


where ei is the LTM learning rate, Xj is the feedback signal from the jth F^^) cell, Zj-^ is the feedback weight from Xj to Sj, and Si is the normalized activity of Si {Si = Sif^/^Sk)- When this equation is followed, the feedback weights will become symmetric to the feedforward weights, and consequently, anytime the feedforward weight from an F^^^ cell to an F^^^ cell is large, the corresponding feedback weight from the F^^^ cell to the F^^^ cell will also be large.

After the feedback weights have equilibrated to their desired values, it is possible to use rehearsal to process lists that are longer than the TMS. Recall that Section 4 showed how that could be done. That section showed how it was possible (in principle) to prevent F^^^ from becoming saturated, by resetting item representations once the items had been classified into some Ust. This section will show how this reset can be implemented.

The key obstacle to implementing this reset has to do with the restriction of using only local information, and it is as follows. The F^^^ patterns are classified at F^^K However, reset must occur at F^^K Since each node must operate on purely local information, how can the network know which F^^^ nodes to reset and which to keep active? For example, suppose the list ABC has been presented to F^^^ and that the sublist AB has been classified by XAB' HOW can the network know, purely on the basis of local information, to reset SA and 5 B , while allowing sc to remain active and be classified by some other F^^^ node?

The answer to this question depends on that fact that the feedback weights from each F^^^ cell assembly to F^^^ are modified in such a way as to become symmetric to the feedforward weights into that cell assembly. By virtue of this, those F^^^ cells that provide a large amount of input to some node Xj (by being part of the pattern classified by Xj) receive large amounts of feedback when Xj is active. Conversely, those cells that are not part of Xj's pattern do not receive large amounts of feedback.

It is this distinction that allows the network to know which cells to reset and which to keep active. This is most easily shown by diagramming the sequence of events during a classification, as is done in Figure 11. Suppose a pattern is active at F^^\ part of which can be classified by some F^^^ cell assembly Xj. Then, Xj will activate and send feedback to only those portions of the F^^^ pattern that are part of its classification. Once the activity of Xj has exceeded some threshold for a short period of time, classification will be considered to have occurred, and Xj will be reset (by a mechanism to be discussed in Nigrin (1990, 1993).

This causes two things to occur. First, once Xj is shut off, lateral signals from it will no longer inhibit other assemblies. This will allow other cells to classify the remaining (or evolving) F^^^ pattern. Second, Xj will no longer send feedback to F^^\ It is this abrupt shutoff in feedback that provides F^^^ with enough information to know which cells to reset. Those

266 Nigrin

Activation

Expectancy

Activation

Activation

Expectancy

Activation

(b)

Activation

Expectancy

Activation

Activation

Expectancy

Activation

FIGURE 11. Sequence of events during a classification, (a) Signals from F^^^ activate a node at F^'^K (b) An F ^ node that represents the first two active items of F^^^ emits large feedback to F^^K (c) Feedback to F^^^ ceases after that F^^^ node has reached some threshold necessary for classification, (d) The s cells that are part of the classified pattern are reset when feedback from F ^ abruptly terminates. Reprinted with permission from Nigrin (1993).


cells at F^^^ that previously received large amounts of feedback turn off once their feedback abruptly terminates. Only those portions of the F^^^ activity pattern not yet classified (that consequently never received large feedback from F^^^) remain active. Then, other assemblies at F^^^ compete to classify this remaining pattern.

An ideal mechanism for incorporating this reset is called a gated dipole (Grossberg, 1982, 1987a, 1987b, 1988). However, to allow an easier implementation, SONNET 1 does not use the gated dipole. Instead, the following rule is used: Whenever the feedback to an s cell drops abruptly by more than two thirds, that s cell resets itself to make room for new items at F^^\ By continually resetting F^^^ nodes once they have been classified, arbitrarily long lists can be processed.

This method works well after F^^^ has created many classifications. Unfortunately, it cannot be used before learning has taken place, since until then no F^^^ cell will fully activate. Consequently, some additional mechanism is needed to deal with the saturation of activity at F^^K The simplest way to avoid this problem is for a network to reset the entire activity of F^^^ after many items have been presented and the field begins to saturate (Nigrin, 1993). (A more complicated procedure involves resetting only the high activity nodes in F^^K)

8 Properties of a Classifying System

The previous sections showed how to transform a sequence of temporal inputs into a spatial pattern of activity at F^^K This section will discuss how these evolving spatial patterns are classified at F^^^. Since previous works (Nigrin, 1993) have discussed the motivation for and the construction of F^^^ in detail, this section will only briefly describe the properties that a classifying system should achieve. Some of the properties are important to any classifying system, while some are important only when processing temporal patterns. First, let us discuss the properties that are important to any classifying system.

A classifying network should be able to:

1. Self-organize using unsupervised learning. A network should be able to form its own categories in response to the invariances in the environment. This allows the network to operate in an autonomous fashion, without the need for an external teacher.

2. Form stable category codes. A network should be able to learn new categories without degrading previous categories that it established. This is one aspect of what is called the "stability-plasticity"

268 Nigrin

dilemma. Networks that solve this dilemma can operate using both fast and slow learning (see next property). Conversely, those that do not are restricted to slow learning so as not to degrade previous categories (Grossberg, 1988). Networks that solve this dilemma can continue to operate when they encounter novel situations. Those that do not must be brought back into the lab and retrained on both the new and previous examples to insure that new learning does not degrade existing categories.

3. Perform fast and slow learning. A network should be able to perform fast learning to allow it to classify patterns as quickly as in a single trial, when it is clear exactly what should be learned and it is important that the network learn quickly. (For example, one should not have to touch a hot stove 500 times before learning that one will be burned.) Furthermore, a network should also be able to perform slow learning to allow it to generalize over multiple different examples.

4. Operate under the presence of noise. Networks should be able to operate in more than just laboratory conditions; they should also be able to operate in real-world environments. This requires the ability to operate in the presence of noise. Noise can occur in three different areas. It can be present within an object, within the background of the object, and within the components of the system. A network must be able to handle noise in all of these areas.

5. Scale well to large problems. There are at least two aspects to this property. First, as the size of a problem grows, the size of the required network should not grow too quickly. Second, as the number of different patterns in a training set increases, the number of required presentations for each pattern (to obtain successful classifications) should not increase too rapidly.

6. Create arbitrarily coarse or tight classifications. Patterns in a category often differ from the prototype (average) of the category. A network should be able vary the acceptable distortion from the prototype in at least two ways. It should be able to globally vary the overall error that is acceptable. The network should also be able to allow different amounts of variance at different dimensions of the input pattern (the different input lines). This would allow the network to create categories that are more complex than just the nearest-neighbor variety.

There are also a few properties that are especially relevant to processing temporal patterns. These properties are the ability to perform real-time


operations and the ability to classify patterns that are embedded within larger patterns (this property is also important for static patterns). These properties will be discussed in greater detail in the following two sections.

8,1 Real-Time Operation

Classifying networks should achieve real-time operation. To make this discussion clear, the term "real time" must be defined. In this chapter, this term will be used in a more restrictive sense than that used in Carpenter and Grossberg (1987a, 1987b). There, the term real time was defined to be equivalent to no off-line processing. For example when some pattern is presented to an ART network, that pattern is sustained until the network equilibrates. Then another pattern is presented and so forth.

When dealing with purely spatial patterns, this is an adequate formulation. However, when dealing with temporal patterns, a more restrictive definition of the term real time is necessary. Here, we will say that a network operates in real time when the network performs its classifications at the correct pace (not too slowly or too quickly) in response to the continuous evolution of input patterns. For example, suppose the sequential presentation of the letters T, H, and E has been classified by a network into a category for the word THE. Then, when the list THEDOGRAN is sequentially presented, the network should classify the pattern THE shortly after the E has been presented. Otherwise difficulties would result. If classification occurred too slowly, then the network would not be able to keep up with the items, and eventually, the STM buffer would overfiow. Conversely, if classification occurred too quickly (for example, immediately after the letter T), then after the word THE was learned, it would be difficult for the network to learn longer patterns like THEY.

Real-time processing must continue to occur at the correct pace regardless of how much or how little competition occurs between different categories. This is not necessarily trivial to achieve. For example, suppose that the only classification made by a network was the sequence HER by some cell XHER- In this case, the equilibration time for the network must be calibrated so that XHER classifies HER shortly after the R is presented. However, suppose that after some period of time the network also learns a new sequence, HE by the node XHE- Even though XHER now competes with this new node, the network must still operate at the correct pace, with XHER still classifying HER shortly after the R has been presented. This must also occur even after many additional categories like HERO, HERD, HEN, HELP, have been learned by the network. While slight variations in classification time are acceptable, that time should increase only slightly due to the increased competition.

To allow real-time operation, several changes need to be made in the operation of a classifying field. One important change concerns the manner

270 Nigrin

by which input combines at F^^\ Typically, the total gated input to the ith classifying cell {!{) is given by the dot product rule: li = ^pn) SjZji. Unfortunately, if this rule is used, real-time operation will be very difficult to achieve.

Consider the following situation. Let the sequentially presented inputs r i , r 2 , . . . , r n form a decreasing pattern of activity across the F^^^ nodes 5i, 52 , . . . , Sn- Let the activity of F^^^ be normalized and let Sj = 2sj-î. By equation 1, once the F^^^ node, Xi, has chunked this list, its LTM weights will become parallel to the STM activity across F^^\ and the total LTM at Xi will also become normalized.

Let us consider how the presence or absence of the last item in the list will affect the percent change in total input to Xi. Since Sj = 2sjî, Si = 2"~^Sn- Furthermore, since input is gated by the LTM weights at Xj, the total input to Xi from si is 2^^~^ times greater than from Sn (siZu = 2'^^~'^SnZni)' Thus, the percent increase in total input to Xi due to Sn is less than 1/2^^"^. If n = 2, then this is acceptable, since the second item increases input to Xi by 25%. However, if n = 4 (a reasonable-size chunk), the fourth item increases the input to Xi by less than 1.56%.

Thus, as the lists get longer, the F^^^ cells representing later items will have increasingly less significance. This will make it very difficult for the network to distinguish between lists that differ only in the last item (for example, PAR and PART). For long lists, small amounts of noise could easily cause errors. Furthermore, these tiny differences in the amount of input received make real-time operation very difficult. This follows since different nodes that receive very similar amounts of input will need to equilibrate for long periods of time before the small differences in input result in large differences in activity.

One solution to this problem that can easily be dismissed is to decrease the difference in activation between F^^^ nodes representing successive items. This solution can be dismissed because it would obscure the differences between lists that are permutations of the same items (such as LEFT and FELT).

Thus, to solve this problem, SONNET 1 resorts to a nonlinear equation to generate the total input to a cell. There are three basic properties that this nonlinear rule must achieve. First, when a large pattern is active, the input to a cell that classifies that pattern should be larger than the input to a cell that classifies a subset of that pattern. For example, when a pattern like ABC is presented, XABC should receive more input than XAB-

Second, when a subset of the pattern coded by an F^^^ cell is active, the input to the F^^^ cell should be reduced. This reduction must be done in the following way. If some pattern is presented that is a subset of a larger pattern, then the input to a cell that codes exactly the active pattern should be greater than the input to a cell that classifies a superset of the active pattern. For example, if AB is presented, XAB should receive more input


t han XABC-Third, even when some weight is (reasonably) small, it should still be

possible for input that is gated by that weight to affect the total input received by the classifying cell. For example suppose some cell Xi classifies ABCD. Then, even if ZAi is eight times greater than ZDI^ the presence or absence of the item D should significantly influence the total input received by Xi.

One rule that satisfies these constraints has been implemented in SONNET 1. It is presented below only to illustrate one possible way to satisfy the constraints above. Other rules are possible, and in fact improvements to this rule have been proposed in Nigrin (1993).

Let li be the total input to the ith cell in F^^^. Then li is given by

ii=itir^ (7)

where

li = 2_^ SjiZji, (8)

/ / < = m a x ( ( n ^ / i ) , ^ 8 ) , (9)

If,=K,+K2mm{l,^), (10)

where Ki and K2 are constants such that iiTi < 1 and Ki -{• K2 > I. Kg is a constant that is used to prevent the value of I^ from ever getting too small. {Ks is actually an ad hoc constant that contributes to weaknesses in the current implementation.) Zji is the normalized LTM weight given by

Similarly, Sji is the normalized input from Sj to Xi given by

In these equations, Ti is the set of indices of F^^^ cells in x^'s classified pattern. For example, if x^ has classified a pattern composed of activity in S3 and 55, then Ti = {3,5}. The quantity Ti is needed for the following reason.

272 Nigrin

Recall above that it was stated that inputs to reasonably small weights should be able to affect the total input to a classifying cell. However, inputs to extremely small weights should be ignored; otherwise, small fluctuations in noise could dramatically change the total input received by a cell. Ti is a cutoff used to discriminate which weights are significant and which are not. Its method of computation was illustrated in Nigrin (1990, 1993).

The quantities I^ and I^ serve different functions. The quantity I^ increases as the weights increase. It is exactly the dot product between the normalized input vector and the weight vector. Notice that the inputs are normalized and that this normalization is performed over the set of F^^^ cells in Ti. Furthermore, notice that the input is normalized over only the set of cells in Ti. This is done so that activity in F^^^ cells not in the classified pattern of an F^^^ assembly will not affect the input to the assembly. This allows patterns to be easily classified even when they are embedded within arbitrarily large patterns.

The quantity I^ compares how well a node's LTM weights match the current F^^^ pattern. It compares the normalized input vector to the normalized weight vector. The use of I^ allows the presence or absence of activities that are coded by small weights to affect the total input received by the F^^) cell.

8.2 The Classification of Embedded Patterns

An additional property that classifying networks should achieve is the ability to classify patterns that are surrounded by extraneous information. This is essential in areas such as continuous speech, where there are usually no clear-cut boundaries between words. One way a network can deal with the extraneous information is to use both inhomogeneous nodes and a nonuniform pattern of connectivity between the nodes. In SONNET 1, the nodes evolve to have different input/output characteristics, and the connectivity pattern evolves so that nodes inhibit only other nodes that classify similar patterns.

One possible justification for the necessity of inhomogenous nodes concerns the predictive power of the classifying nodes and has been discussed elsewhere (Cohen and Grossberg, 1986,1987; Marshall, 1990a, 1990b, 1992, 1995; Nigrin, 1990, 1993). Another justification arises if we analyze the structure that a network must have if it is to satisfy two simple constraints: (1) The network should be able to classify patterns that are surrounded by extraneous information; and (2) The network should be able to make clear-cut decisions.

For example, suppose some F^^^ cell XCAR represents the pattern CAR. The first constraint implies that XCAR should receive the full input that is possible for it, even when additional items like I or S are present in an input pattern like CARIS. Otherwise, if the presence of extraneous items


reduced the input to XCAR significantly, then XCAR would not be able to activate when its pattern was embedded in larger patterns (as is often the case in speech signals).

The second constraint implies that when multiple classifications are competing for an input pattern, then the network should choose whichever cell best represents the pattern and allow that cell to fully activate, while suppressing the activity of other cells. For example, if CAR is presented to a network that has the classifications XQAR and XCARGO, then XCAR should fully activate and xCARGO should be suppressed, even though xCARGO Partially represents the input pattern. Conversely when CARGO is presented, xcARGO should fully activate and XCAR should be suppressed. This is true even though the pattern that XCAR represents is entirely present, and therefore (by the first constraint) XCAR must receive the full input that is possible for it!

To allow a single network to be able to satisfy both constraints simultaneously, it must have some kind of inhomogeneity in the structure of its classifying cells. One possible inhomogeneity that solves the problem involves the use of different cell "sizes," with larger cells classifying larger patterns and smaller cells classifying smaller patterns. Larger cells dilute their input (both excitatory and inhibitory) to a greater degree than do smaller cells. Thus, they are difficult to turn on, and they respond well only to larger patterns. However, once the larger cells are activated, they are difficult to turn off, and thus they inhibit smaller cells more easily than the reverse. For example, when the word CAR is presented, XCARGO

does not receive enough input to activate, thus allowing XCAR to activate. However, when the word CARGO is presented, the node xCARGO receives enough input to activate, and through unequal competition it can suppress (mask out) the activity of XCAR-

A second reason to prefer inhomogeneous nodes is called the temporal chunking problem (Grossberg, 1982, 1987b, 1988). Suppose that some pattern ABCD is presented at F^^\ Furthermore, suppose that all the subparts of that pattern already exist as classifications, so that different F^^^ nodes already code the patterns A, B^ C, and D. If the F^^^ nodes were homogeneous, then the pattern ABCD would continually be processed as subparts instead of eventually being treated as a unified whole. (A more realistic example occurs when the network should learn the word CARGO ^ even after it has established categories for CAR and GO.) To prevent this, there must be some mechanism that favors the formation of larger categories.

A second area of nonuniformity in the structure of the classifying field concerns the inhibitory connections within the field (Cohen and Grossberg, 1986, 1987; Marshall, 1990a, 1990b, 1992, 1995; Nigrin, 1990, 1993). In SONNET 1, nodes compete only with other nodes that attempt to classify similar patterns. This nonuniformity increases the power of the network.

274 Nigrin

as the following example shows. Suppose that the lists AB, CD, and ABC have been learned. (Consider these lists to be abstractions for the spoken words ALL, ALTER, and TURN.) When ABC is presented, XABC should activate and XAB and XQD should be inhibited. However, when ABCD is presented, the reverse should be true. The list should be segmented as AB and CD, with XABC inhibited, since it is not part of the segmentation.

This will not happen if the connections are homogeneous. Since XABC

must activate whenever ABC is presented, it must be true that neither XAB nor xcD can individually suppress the activity of XABC- When ABCD is presented, only by combining inhibition can XAB and XCD possibly mask out XABC' However, if the connections are uniform, then XAB and XCD will inhibit each other as much as they inhibit XABC- Consequently, XABC will activate, even for ABCD.

To remedy this, F^^^ nodes should inhibit only other nodes in F^^^ that respond to similar patterns, thus allowing multiple smaller nodes to combine and overpower larger ones. In the example above, XABC should compete with both XAB and XCD, hut XAB and XCD should not compete with one another. (Another advantage to using nonuniform connections is that it allows a network to be able to classify multiple patterns simultaneously. This is a great advantage when a network is forced to operate in complex, unsegmented environments.)

9 Simulations

This section will present some simulations to illustrate the operation of the network. This will show that the network operates as previously indicated. All the simulation equations and parameters are described in Nigrin (1993). Patterns were presented to the network, as was described in the introduction. Items were presented one at a time with fixed intensities. Immediately after the presentation of one item was finished, the next item was presented. Thus, the patterns consisted entirely of sequences, and issues involving different rhythms or input intensities were completely avoided.

Since Section 6 demonstrated the behavior of F^^^ after it was presented a sequence of inputs, the only thing left to do is examine the behavior of F^'^\ This will be done with five simulations. (The reader should note that SONNET 1 was simulated more extensively on static spatial patterns than on temporal patterns. Therefore, the simulations in this section should be treated as preliminary. Furthermore, due to the minimal analysis, it is likely that the network's behavior on temporal patterns could easily be improved.)

1. The first simulation will deal with the STM response of F^^^ It will demonstrate that the network can respond in a real-time fashion to


1.5 7

0.5

0 0 A B D

ÂBC

FIGURE 12. Response of the cell CABC to the list ABCDEF. In this and the next three figures, the labels on the x-axis refer to the time immediately after each item has been presented. Furthermore, the s cells were not reset, either because of a classification or because of saturation. Reprinted with permission from Nigrin (1993).

the presentation of a single list. To ensure that the parameters have not been optimized for different situations, I will compare the network's behavior when it has classified a single list against the network's behavior when it has classified multiple similar lists. This will show how the presence of similar classifications affects the dynamics of the competition.

Figure 12 shows the response of CABC to the list ABCDEF^ when ABC is the only pattern that has been classified by the network. Notice that CABC does not fully activate until after the D has been presented.

The network response was then tested to the situation where two similar categories existed at F^'^\ Figure 13 shows the network response to the list ABCDEF at categories for both AB and ABC. Notice that earlier in the presentation, CAB has a competitive advantage. However, this advantage is quickly eliminated once CABC sees its full pattern. Notice also that CABC takes slightly longer to equilibrate than the previous case (reaching its full value after ABCDE rather than after ABCD).

276 Nigrin

.5-

1"

.5-

^ ^ ^X # ^^^ ^ ^ ^

# ^^^^"^^^^ # ^

% X

X % X \

y ^ ^^^ \ ^ \

0 B C D

FIGURE 13. Response of the cells CAB and C>IBC to the list ABCDEF. Reprinted with permission from Nigrin (1993).

The final test involved the addition of three more categories to see if this would further slow down the processing of the network. As can be seen in Figure 14 and Figure 15, this had little effect on the output of CABC-

2. Since the previous simulation showed that the STM dynamics of the network could operate in real time, the remaining simulations tested whether or not the network could learn in real time. This simulation showed that the network could learn a list even when it was embedded within a larger list. The following sequence was presented to the network, where after the last item in the sequence was presented, the first item was presented again. (Thus, immediately after item 23 was presented, item 0 was presented.)

0 1 2 3 4 5 6 7 8 9 0 1 2 10 11 12 13 14 15 16 0 1 2 17 18 19 20 21 22 23

Here, each item is represented by a number instead of a letter, since in later simulations, more items will exist than there are letters in the alphabet. In this simulation, the list (0,1, 2) occurs embedded in three different contexts. Thus, the network should learn to recognize that


1.5 1

0 A B C D E F

S e B EE] EI] ÂB ÂBC ÂCB ^BCA ÂBCG

FIGURE 14. Response of the cells CABI CABC, CACB-, CBCA, and CABCG to the list ABCDEF. Reprinted with permission from Nigrin (1993).

list as a significant category. With the parameters set as described in Nigrin (1993), this occurred in about 10 presentations of the full list (therefore (0, 1, 2) occurred about 30 times). The values for the weights from items 0, 1, and 2 equilibrated to 0.85, 0.46, and 0.24, respectively.

After learning had occurred, the F^^^ items 0, 1, and 2 were reset after they were classified. This occurred approximately three items after the Ust (0, 1, 2) was completely presented. For example, if the list 0 1 2 3 4 5 6 7 8 was presented, the items 0, 1, and 2 were reset while item 5 was being presented.

The network was robust in its behavior. The list (0, 1, 2) wais reliably learned and those items were reliably reset after they were classified. However, if the presentation of the full list continued, additional information was usually learned. For example, after learning (0, 1, 2), the network might later learn to classify the list (16, 0, 1, 2). This occurred since each longer list occurred in exactly one context. Therefore, since the network was designed to solve the temporal chunking problem, the continued presentation of a longer list could eventually cause the shorter list to be overshadowed.

3. This simulation demonstrated that multiple lists in a training set

A B 1

B C E3 2

D B 5

FIGURE 15. Superposition of CABC^S activity in each of the last three figures. Figure legends indicate the total number of categories that were present at F^^^ in the different simulation runs. Reprinted with permission from Nigrin (1993).

could be learned by the network. The network was presented the following training set in which the lists (0, 1, 2) and (24, 25, 26) were embedded within three contexts:

0 1 2 3 4 5 24 25 26 6 7 8 9 0 1 2 10 11 12 13 24 25 26 14 15 16 0 1 2 17 18 19 24 25 26 20 21 22 23

Both lists were classified by the network in approximately 10 trials. (In one simulation (0, 1, 2) was classified on the 11th trial and (24, 25, 26) was classified on the 10th trial.) Both the learning of the lists and the resetting of F^^^ items were robust. However, just as in the preceding simulation, after these lists were learned, longer lists that contained these shorter lists were classified by the network.

This simulation was very similar to the previous one. It involved the use of exactly the same parameters as in the preceding example, and demonstrated that the network could learn lists of different lengths. The network was presented the following training set in which the lists (0, 1) and (24, 25, 26, 27) were each embedded within four contexts.


0 1 3 4 5 24 25 26 27 6 7 8 9 0 1 10 11 12 13 24 25 26 27 14 15 16 0 1 17 18 19 24 25 26 27 20 21 22 23 0 1 28 29 30 24 25 26 27 31 32 33 34

In this case, the hst (0, 1) was classified on the 8th trial, and the list (24, 25, 26, 27) was classified on the 9th trial. While this simulation showed that it was possible for the network to classify lists of diflFerent lengths, the network needed more contexts than for the three-item lists. (Even with four contexts, the network occasionally made errors.) This was especially true in the case of the two item list, since that F^^) representation of the list was a smaller percentage of the total activity at F^^^ (see discussion in Nigrin (1993), Chapter 3).

5. The final simulation tested to see what the network response would be to the repeated presentation of the following set:

0 0 0 0 27 27 27 27 33 33 33 33 45 45 45

1 1 1 1 28 28 28 28 34 34 34 34 46 46 46

2 2 2 2 29 29 29 29 35 35 35 35 47 47 47

3 10 17 39 3 10 17 39 3 10 17 39 3 10 17

4 11 18 40 4 11 18 40 4 11 18 40 4 11 18

5 12 19 41 5 12 19 41 5 12 19 41 5 12 19

24 24 24 24 30 30 30 30 36 36 36 36 48 48 48

25 25 25 25 31 31 31 31 37 37 37 37 49 49 49

26 26 26 26 32 32 32 32 38 38 38 38 50 50 50

6 14 20 42 6 14 20 42 6 14 20 42 6 14 20

7 15 21 43 7 15 21 43 7 15 21 43 7 15 21

8 16 22 44 8 16 22 44 8 16 22 44 8 16 22

45 46 47 39 40 41 48 49 50 42 43 44

Close examination of this training set will reveal that it contains 16 three-item lists, each of which occurs in four different contexts. This list was repeatedly presented to the network. By the 16th epoch, the network could learn all the lists that were present in the training set.

The network created the following categories: On the 6th epoch, the network learned the lists (0, 1, 2), (24, 25, 26), (14, 15, 16), (45, 46, 47), and (48, 49, 50) (in that order). On the 7th epoch, the network learned (3, 4, 5), (17, 18, 19) and (39, 40, 41). On the Sth epoch, the network learned the lists (30, 31, 32), (27, 28, 29), (6, 7, 8), and (42, 43, 44). On the 9th epoch, the network learned the list (10, 11, 12).

280 Nigrin

On the 13th epoch, the network learned the Ust (20, 21, 22). Finally, on the 16th epoch, the network learned the lists (36, 37, 38) and (33, 34, 35).

Thus, in this simulation run, the network learned to perfectly segment the list. However, this was not true of all simulation runs. Using the same parameters, sometimes the network correctly classified as few as 13 of the 16 lists in the training set. Of the Usts that were incorrectly handled, there were two possibilities. Either the list was not classified at all, or the classification made by the network was erroneous. (For example, the classification (0, 1, 2, 3) would be considered incorrect.)

The main reason the network had diflSculty had to do with the fact that while the network could classify multiple lists concurrently, it could only learn one list at a time. Thus, since the Ust (3, 4, 5) is embedded between (0, 1, 2) and (24, 25, 26), the nodes that classified (0, 1, 2) and (24, 25, 26) could interfere with any node attempting to classify (3, 4, 5). It was for this reason that the training set consisted of lists that were all of the same length. However, simulations conducted by replacing some of the lists in the training set with lists of lengths 2 and 4 have shown that variable-length lists could be learned. However, due to the diffêrences in list size, even more interference was created during learning.

10 Discussion

This chapter has examined one fundamental issue—designing a neural network that could learn to segment arbitrarily long temporal patterns in real time. To allow the problem to be tackled at all, it was simplified to its bare essentials. A network was presented a continuous stream of input items and was required to learn to segment them into significant chunks.

This task was broken down into two subproblems. The first subproblem was to convert a sequence of temporal events into ever expanding spatial patterns. The second subproblem was to classify the evolving spatial patterns. Thus, the neural network was divided into two fields of cells. The input field F^^^ transformed a sequence of temporal events into a spatial pattern of activity, and the output field F^^^ classified those patterns. (Nigrin (1990, 1993) discussed how the architectures at F^^^ and F^^^ could be combined to allow the addition of extra layers above F^^\)

Within this framework, it became possible to present at least six guidelines for the construction of networks to solve the segmentation problem. (Additional constraints were presented in Nigrin (1990, 1993).) They are as follows: (1) Any transformation of a sequence of events to a spatial pattern at F^^^ should follow the LTM invariance principle. (2) When representing


a sequence of events, a monotonically decreasing pattern is a better choice than a monotonically increasing pattern. (3) One possible method for satisfying the previous two constraints is to use an on-center off-surround architecture at F^^\ (4) Rehearsal should be used to reset the activity of F^^^ nodes whose activity has been classified into a category at F^'^\ (5) To allow it to operate in real time, F^^^ should combine its inputs in a nonlinear fashion instead of the traditional dot product rule. (6) The architecture of F^^^ should be nonuniform, to allow it to deal with patterns that are subsets and supersets of one another (context sensitivity).

When these guidelines are observed, it becomes possible for a network to learn to segment temporally presented sequences of items. However, significant limitations still exist. One restriction is that items cannot be repeated within the same list. (The network can process the list ABC but not ABA.) Another limitation is that while the network can classify multiple patterns simultaneously, it can learn only a single pattern at a time. These problems are addressed in a network called SONNET 2 (Nigrin, 1990, 1993) by changing the manner by which classifying nodes interact. In current implementations, competition is implemented by having the classifying nodes at F^^^ compete for the right to classify input signals on active input lines. This will be changed in SONNET 2. There, competition will be implemented by having the input lines compete for the right to activate their respective classifying nodes at F^^^. Analysis has indicated that this change will dramatically increase the power of the classifying network.

11 References

Sven Anderson, John Merrill, and Robert Port. 1988. Dynamic speech categorization with recurrent networks. Technical Report 258, Indiana University, Bloomington, IN.

Gail Carpenter and Stephen Grossberg. 1987a. A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37:54-115.

Gail Carpenter and Stephen Grossberg. 1987b. ART 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26(23):4919-4930.

Michael Cohen and Stephen Grossberg. 1986. Neural dynamics of speech and language coding: Developmental programs, perceptual grouping, and competition for short-term memory. Human Neurobiology, 5(l):l-22.

282 Nigrin

Michael Cohen and Stephen Grossberg. 1987. Masking fields: A massively parallel neural architecture for learning, recognizing, and predicting multiple groupings of data. Applied Optics, 26:1866-1891.

Garrison W. Cottrell. 1985. Connectionist parsing. In Proceedings of Cognitive Science Society, pp 201-211.

Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science, 14:179-211.

Jeffrey L. Elman and David Zipser. 1988. Discovering the hidden structure of speech. Journal of the Acoustical Society of America, 83:1615-1626.

Stephen Grossberg. 1973. Contour enhancement, short term memory and constancies in reverberating neural networks. Studies in Applied Mathematics, 52:217-257.

Stephen Grossberg. 1978. A theory of human memory: Self-organization and performance of sensory-motor codes, maps, and plans. In R. Rosen and F. Snell, editors. Progress in Theoretical Biology, Vol. 5, Academic Press, New York.

Stephen Grossberg. 1982. Studies of Mind and Brain: Neural Principles of Learning, Perception, Development, Cognition, and Motor Control. Reidel Press, Boston.

Stephen Grossberg. 1985. The adaptive self-organization of serial order in behavior: Speech, language, and motor control. In E. C. Schwab and H.C. Nusbaum, editors. Pattern Recognition by Humans and Machines, Vol. 1: Speech Perception, Academic Press, New York.

Stephen Grossberg. 1987a. The Adaptive Brain, I: Cognition, Learning, Reinforcement, and Rhythm. Elsevier Science Publishing Company Inc., North Holland, Amsterdam.

Stephen Grossberg. 1987b. The Adaptive Brain, II: Vision, Speech, Language, and Motor Control. Elsevier Science Publishing Company Inc., North Holland, Amsterdam.

Stephen Grossberg. 1988. Neural Networks and Natural Intelligence. MIT Press, Cambridge, MA.

S. J. Hanson and J. Kegl. 1987. Parsnip: A connectionist network that learns natural language grammar from exposure to natural language sentences. Ninth Annual Conference of the Cognitive Science Society, Seattle, Washington. Erlbaum Associates, Hillsdale, NJ.


Michael I. Jordan. 1986. Attractor dynamics and parallelism in a con-nectionist sequential machine. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society^ pp. 431-546, Erlbaum Associates, Hillsdale, NJ.

Jonathan A. Marshall. 1990a. A self-organizing scale-sensitive neural network. In International Joint Conference On Neural Networks^ Vol. 3, pp. 649-654, San Diego.

Jonathan A. Marshall. 1990b. Representation of uncertainty in self-organizing neural networks. In International Conference on Neural Networks^ pp. 809-812, Paris, Prance.

Jonathan A. Marshall. 1992. Development of perceptual context-sensitivity in unsupervised neural networks: Parsing, grouping and segmentation. In International Joint Conference on Neural Networks^ Vol. 3, pp. 315-320, Baltimore, MD.

Jonathan A. Marshall. 1995. Adaptive perceptual pattern recognition by self-organizing neural networks: Context, uncertainty, multiplicity and scale. Neural Networks 8:335-362, April 1995.

Risto Miikkulainen and Michael Dyer. 1991. Natural language processing with modular PDP networks and distributed lexicon. Cognitive Science, 15:343-399.

Albert Nigrin. 1990. The Stable Learning of Temporal Patterns with an Adaptive Resonance Circuit. Ph.D. thesis, Duke University.

Albert Nigrin. 1993. Neural Networks for Pattern Recognition. MIT Press, Cambridge, MA.

Terrence J. Sejnowski and C. R. Rosenberg. 1987. Parallel networks that learn to pronounce Enghsh text. Complex Systems, 1:145-168.

M.F. St. John and J. L. McClelland. 1990. Learning and applying contextual constraints in sentence comprehension. Artificial Intelligence, 46:217-258.

Chen Sung and W. Jones. 1988. Temporal pattern recognition. In IEEE 1988 International Conference on Neural Networks, /, pp. 689-696.

Chen Sung and W. Jones. 1990. A speech recognition system featuring neural network processing of global lexical features. In I J CNN 1990 Proceedings of the International Conference on Neural Networks^ II, pages 437-440.

284 Nigrin

David W. Tank and John J. Hopfield. 1987. Neural computation by concentrating information in time. Proc. Natl. Acad. Sci. USA, 84:1896-1900.

K.P. Unnikrishnan, J. J. Hopfield, and D. W. Tank. 1992. Speaker - independent recognition using a neural network with time-delayed connections. Neural Computation, 4:108-119.

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. 1989. Phoneme recognition using time-delay neural networks. IEEE Trans. ASSP 26:328-339.

Chapter 9

On the Use of High-Level Pe t r i Nets in the ModeUng of Biological Neural Networks

Kurapat i Venkatesh Abhijit Pandya Sam Hsu

ABSTRACT High-level P e t r i ne t s ( H P N s ) — a class of P e t r i n e t s (PNs)—are powerful and versatile tools for modeling, simulating, analyzing, designing, and controlling complex asynchronous concurrent systems. Some of the important applications of HPNs can be found in addressing problems related to computer hardware and software and flexible factory automation. In this research, an initial attempt is made to model biological neural networks (BNNs) with HPNs, since the interactions among neurons are basically asynchronous and concurrent in nature. Even though there are many types of HPNs reported in the hterature, none have the constructs to model BNNs. Hence a new class of HPNs is proposed in this chapter. With this aim the analogies between BNNs and HPNs are explored. The detailed procedure of PN modeling is elucidated by modeling the mammalian olfactory bulb. By studying the dynamic behavior of petri net models (PNMs), temporal dynamics and time-varying pattern recognition in BNNs can be investigated. Various timing durations can be associated with the transitions in PNMs to study the temporal characteristics regarding pattern recognition. This is achieved by associating three timing functions with each transition in a PNM.

1 Introduction

Modeling of biological neural networks is paramount for t he clear understanding of the functioning of the brain. This is corroborated by t h e DARPA report [1], which emphasizes t ha t researchers should explore biological models to guide experimental work. Models with these aims have been actively pursued [2, 3]. As the field of neural networks uses theoret ical results and insights from many research areas, the DARPA report [1] also emphasizes the need for hierarchical network models t h a t can be easily

285

286 Venkatesh, Pandya, and Hsu

understood by people of different backgrounds. With these models, interactions among modelers, neurobiologists, engineers, cognitive researchers, physicists, mathematicians, and computer scientists can be encouraged.

Current BNN models are understood by specialists and are usually specific to the systems they model [4]. For an overview of these models the reader is referred to the DARPA report [1] and Arbib [2]. General hierarchical models and model frameworks that are easily understood are direly needed for neural system models. In this chapter, high-level Petri nets (HPNs)—a class of Petri nets (PNs) are explored in order to model biological neural networks (BNNs) . The model of the system obtained using PNs is called a Petri net model ( P N M ) of the system. The advantages of PNs to model BNNs include hierarchical modeling and elegant graphical representation of BNNs. This will result in models that are easily understood by experts in different related areas. Using PNs, various interactions among the neurons in BNNs along with their timing values can be modeled. The resulting PNMs can be simulated and analyzed to study the dynamics of temporal patterns in BNNs.

PNs are claimed to be ideal modeling tools for simulating, analyzing, designing, and controlling complex asynchronous concurrent systems. The details of PN theory can be found in Peterson [5]. Some of the important and diversified applications of PNs can be found in performance analysis of multiprocessor systems [6]; communication protocols [7]; software design [8]; data base design [9]; process control [10]; VLSI design optimization and testing [12]; and flexible factory automation [13, 14, 15, 16, 17]. Surveys on PN applications can be found in Silva and Valette [18] and Venkatesh and Ilyas [19]. As the biological brain is also basically a complex asynchronous system, PNs can be exploited to study the brain. Modeling BNNs with PNs has many advantages, and these are detailed in the later sections of the chapter. This chapter is organized as follows. In the next section, the fundamentals of PN modeling are briefly presented. Also, the need for a new class of PNs to model BNNs is also discussed. In the third section, modeling of BNNs with HPNs is described by giving analogies between the elements of HPNs to the elements of BNNs. In the fourth section, the new/modified elements added to HPNs are presented. The fifth section is devoted to illustrating the detailed modeling of BNNs with HPNs by considering an example of the olfactory bulb of a rabbit. In the same section, the analysis of the obtained HPN model and the results that can be drawn from the analysis are discussed. Finally, conclusions are presented.

9. High Level Petri Ne t s in Modeling Biological Neural N e t s 287

Robot (R)

FIGURE 1. Simple assembly cell.

2 Fundamentals of PNs

2.1 Concepts and Terminology of PNs

PNs are graphical and mathematical tools for modeling information and control flow in event-driven systems. A PN has two types of nodes: transitions and places. Directed arcs Unk places to transitions and transitions to places. Tokens reside in places and are used to describe the state of the system being modeled. The following paragraph illustrates PN concepts by modeling a simple assembly cell before a formal definition of PNs is given.

The assembly cell considered is shown in Figure 1. It contains two part feeders (PFl and PF2) and a robot (R). Part feeders supply the parts required for assembly, and the robot does all assembly operations. PF2 feeds a part to the empty assembly area automatically. The operational specifications of this system are as follows:

1. To start a cycle, robot (R) and parts must be available.

2. R transfers a part from P F l to assembly area and starts assembly.

3. R assembles the parts and transfers the finished product to the output buffer.

2 8 8 Venkatesh, Pandya, and Hsu

Robot PFl ready rcadv

Transfer t part and start a^^mbly (t )

Finish assembly and . . . transfer to output buffer (t^)

rcadv

P3

P 4 Q A.sscmbJy in pn">grr«;';

6Finished product available

Initial marking: (1 , 2, 3, 0, 0)

F I G U R E 2. Pe t r i ne t model of t h e assembly cell.

Figure 2 shows the PN model for this assembly cell. Conditions stated in specification 1 are modeled by three places (pictured by circles): Rjready (pi), FFl,ready (P2), and FF2.ready (pa). Putting a token (pictured by a dot) in pi represents that R is ready; two tokens in p2 means that there are two parts in PFl; and that three in ps means three parts are available in PF2. Specification 2 is modeled by a transition (pictured by a bar): transfer.part.and. start .assembly (t/). Once robot R starts assembly, a new condition results, i.e., assembly.in.progress modeled by P4. Specification 3 is modeled by finish.assembly.and.transfer.to.output.buffer (t2), a transition. After R finishes assembly and transfers the finished product to the output buffer, two new conditions result. The first one is that R is free again to do the next assembly task. It is modeled by an output arc from t2 to place p/, which deposits a token in p^ when t2 "fires". The second one is that a finished product is ready at the output buffer. This is modeled by another place, finished.product.available (ps). The distribution of tokens in all places is called a marking of the PN. A marking indicates the status of all system components, called the system's state. It is formally defined as a vector whose components represent the number of tokens in the corresponding places. For example, the initial state of the system for the assembly cell is (1, 2, 3, 0, 0), which models that the robot is ready, P F l and PF2 contains two and three parts respectively, assembly is not in progress, and there is no finished product in the output buffer. The mark-

9. High Level Petri Nets in Modeling Biological Neural Nets 289

ing changes when a transition fires, i.e., an event occurs. This results in a new marking according to the rules given later. Sometimes, weights (pictured as labels on the arcs) may also be presented in a PN to facilitate the modeling. If there is no weight on an arc, a unit weight is assumed.

Formally a PN, Z is a five tuple, Z = (P, T, I, O, m) where

1. P is a finite set of places;

2. T is a finite set of transitions with P U T 7 « ^ 0 a n d P n T = 0 ; P n T = 0 ;

3. I: P X T -^ N, is an input function that defines the set of directed arcs from P to T where, N = { 0 , 1,2, . . . ; }

4. 0 : P x T ^ N i s a n output function that defines the set of directed arcs from T to P;

5. m: P —)- N is a marking whose 2th component represents the number of tokens in the zth place. An initial marking is denoted by mo-

The execution rules of a PN include enabling and firing rules:

1. A transition t G T is enabled if and only if m(p) > I(p, t) , V p G P.

2. Enabled in a marking m, t fires and results in a new marking m' following the rule

M'(p) = M(p) + 0(p,t) - I(p,t), V p € P.

The marking m' is said to be reachable from m. Given Z and its initial marking m©, the reachability set is the set of all markings reachable from mo through various sequences of transition firings. Several important PN qualitative properties such as boundedness and liveness that are related to stability and deadlock freeness can be defined, and their implications for system modeling are reported in Peterson [5] and Murata [19].

2.2 Timed PNs

When a PN does not model timing information of the operations in the system, it is called an untimed PN as shown in Figure 2. However, for the quantitative analysis and control of the system, timing information has to be included in a PN. For example, assume that the time to transfer parts and start assembly takes one time unit, and time to do assembly and transfer the finished product takes two units. These are modeled by associating


one time unit to ti and two to t2. The time associated with a transition is called the firing duration and is shown on the right hand side of the transition. When PN models timing information in the system, it is called timed PN (TPN). Figure 3 shows various timed PNMs in chronological order.

Formally, a timed PN is a net Z in which each transition is associated with either a deterministic or random firing delay time. Note that the random time delays may follow a general distribution. There are two events for a transition firing, namely, start.firing and end.firing. Between these two events, the firing is in progress. The deposition of tokens to a transition's output places(s) occurs at end.firing. While the firing of a transition is in progress, the time to end firing, called the remaining firing time, decreases from firing duration to zero, at which firing of the transition is completed.

Instantaneous description (ID) [17] defines the state of a TPN and is a four tuple ID = (m, F, Q, A) where:

1. m is a marking;

2. F is a binary selector function, F: T -^ {0, 1}. If F(t) = 1, t is enabled; otherwise it is disabled;

3. Q: T -> R~ , is the remaining firing time function, where R" is the set of all positive integers. If Q(t) = q, there is q amount of time to complete firing t. Q is a cumulatively decreasing time function;

4. A: T —)• R" , is the active time function. If Q(t) = q', t is said to be active for q' amount of time. A is a cumulatively increasing time function.

ID is useful for the quantitative and behavioral analysis of the system. The importance of ID can be observed from various PNMs shown in Figure 3 corresponding to diflFerent times. Consider for example. Figure 3(a) modeling the assembly cell before starting assembly. At time zero, initial marking models the initial system state. The F-function models that ti is enabled; the Q-function models that there is one time unit necessary to finish t i ' s firing; A-function models that there is no transition active. After one time unit. Figure 3(b) shows the assembly cell after assembly starts. Notice that the marking has changed now, indicating that R is not ready because it is doing assembly, P F l and PF2 contain one and two parts respectively, assembly is in progress, and there is no finished product. F-function shows that t2 is enabled, Q-function shows that t2 needs two time units to complete its firing, and A-function shows that ti is active for one time unit. After two time units. Figure 3(c) shows the assembly cell during the assembly operation. Observe that the marking has not changed. F-function shows that no transition is enabled; Q-function shows that there is one time unit necessary to finish t2's firing, and A-function models that

9. High Level Petri Ne t s in Model ing Biological Neural N e t s 291

Robot ready

rt mw

1 F-inished

Robot ready ready ready n:

V-

Robot PF» PF2 ready ready ready

/ <)P.

u)—i— ILL, pôr;;

P56 Ftninhed product •vailable

Robot PFI PF2 ready ready ready

•Qpi O P J O P 3

>, Assembly

(a) Before Hring of transition I (before assembly staned)

Time: 0 units

Marking; (1 .2 . ?. 0 .0)

P-function: (1.0)

0-funclion: (1.0)

A-function; (0. 0)

(b) After firing of transition I (after assembly staned)

Time: After I lime unit

Marking: ( 0 . 1 . 2 . 1 . 0 )

F-fiinciion: (0. I)

0-function: (0. 2)

A-fiinclion: (1.0)

(c) During Hring of irnnsiiion 2 (during assembly)

Timr: After 2 time imils

Marking: (0. 1. 2. 1.0)

P.f.mclion: (0.0)

Q-function: (0. I)

A-function: (1.1)

(d) After firing of transition 2 (after assembly finished)

Time: After 3 time units

Marking: ( 1 . 1 . 2 . 0 . 1 )

F-funclion: (1,0)

Q-fimction: (1.0)

A-function: (1.2)

FIGURE 3. Timed Petri net model of the assembly cell.


both ti and t2 are active for one time unit. After three time units, Figure 3(d) models the assembly cell after assembly is finished. The new marking indicates that R is ready, P F l and PF2 contain one and two parts respectively, assembly is not in progress, and there is one finished product. F-function models that ti is enabled, Q-function shows that there is one time unit necessary to finish t i ' s firing, and A-function shows that ti and t2 are active for one and two time units respectively. The use of ID to analyze BNNs is discussed later, in Section 5.

2.3 High^Level PNs (HPNs)

HPNs are extensions of timed PNs and are capable of modeling complex asynchronous concurrent systems. There are several classes of HPNs reported in the literature. A typical HPN is shown in Figure 4. The additional constructs in this HPN compared to earlier PNs are: (i) colors (associated with tokens) representing different conditions simultaneously; (ii) predicates (conditions embedded in the transitions), representing the conditions to be fulfilled for the occurrence of activities; (iii) labels on some arcs modeling actions that are to be performed after firing a corresponding transition. The discussion on the theory of these classes of HPNs falls beyond the scope of this chapter. These available classes of HPNs [20, 21, 22, 23, 24, 25] are not powerful enough to model BNNs, as they cannot model some complex functions (explained in subsequent sections) taking place among the elements present in BNNs. Hence a new class of HPNs are needed to address the problem at hand. The Petri net model (PNM) can replicate the biological structure of the brain if the available HPNs are extended using additional constructs to the model: presynaptic cleft, axon transition, dendrite transition, receptor generating excitatory pulse, and receptor generating inhibitory pulse.

3 Modeling of Biological Neural Systems with High-Level PNs

Before suggesting the detailed methodology for modeling BNNs with HPNs, it is important to recapitulate the interactions taking place in BNNs. The brain contains over one hundred billion neurons, which perform all of the computational and communication functions within the brain. This is achieved by transmitting the information among the neurons in the form of electrochemical signals (action potentials). Before explaining the events underlying the transmission of the signals in the brain, it is mandatory to know about the elements of the neuron and its functions [11].

The neuron consists of three sections: (i) the cell body, (ii) the dendrites.


T1

P2 P3 1

-O Q O

T2 r PREDICATE 1 I T 3 I PREDICATE 2 I t3

P8

T4

FIGURE 4. A typical high-level Petri net.

and (iii) the axon, each with separate but complementary functions [26]. Functionally, the dendrites receive the signals from other cells at connection points called synapses. From there, the signals are passed on to the cell body, where they are essentially averaged with other such signals. If the average over a short time interval is sufficiently large, the cell fires, producing a pulse through its axon that is passed on to succeeding cells. Primarily, the axon carries the signal in the form of an action potential. Near its end, the axon has multiple branches, each terminating in a synapse, where the signal is transmitted to the next neuron through a dendrite or in some cases directly to a cell body [26].

The new elements/constructs as well as the available ones in HPNs that


are proposed to model BNNs are explained below and summarized in Table 1. From here on, for simplicity, these PN models are called PNMs.

1. Normal places in the PNM can represent cell bodies. In the biological neural networks (BNNs), one of the primary functions of the cell body is to receive the input signals from axons/dendrites and transmit the output signals to other axons/dendrites. Similarly, a place in a PNM receives information from several input arcs and transmits the information to output arcs.

2. Arcs in PNMs replicate the functions of axons/dendrites in BNNs. Axons/dendrites are channels through which information travels. Similarly, arcs in PNMs act as channels for transmitting the information. In the PNM, a normal directed arc models an axon carrying an excitatory pulse, and an arc with a small square at its edge (called an inhibitory arc in PN terminology) models an axon carrying an inhibitory pulse.

3. Timed transitions combined with their output arcs in the PNM can model the function of axons/dendrites in BNNs. To be more specific, an output arc from an "axon transition" represents the axon, and an output arc from a "dendrite transition" represents a dendrite. Axons and dendrites connect to each other at synapses. Dendrites send signals to the cell body. Similarly, transitions in the PNM receive signals from their input places. In BNNs, all the incoming signals are averaged at the cell body. But in PNMs, the calculation of the output from a place is carried out in the axon transition.

The threshold value that decides the firing of the cell body is modeled as a weight (a standard term for threshold in PN terminology) on the arc connecting the place modeling that cell body and the axon/dendrite transition. For some neurons that do not have axons (e.g. granule cells in the olfactory bulb), this output calculation is done in the dendrite transition. In PNMs, the output signal from this transition can't be distinguished either as excitatory or inhibitory until it passes through the "predicate transitions" that are modeling diflFerent receptors in BNNs. (The transition is called predicate transition because it models the receptor that checks for a condition whether the transmitter passing through it produces an excitatory pulse or an inhibitory pulse.)

4. In PNMs, weights on arcs represent (i) the threshold value of a cell body, and (ii) the number of input arcs for the place. For example, the threshold value of a cell body can be modeled as a weight on the arc connecting the place (modeling that cell body) and the transition (modeling the dendrite connecting to the cell body). The number


Element in HPNt Element(i)/implcmentatlon parameter(f) in BNNs Symbol

Place a) Cell Body 5 b) Pre synaptic cleft ^

Transition a) Axon generator ' ' ' ' b) Dendrite generator ' c) Any activity I '

Nomud arc Axon/dendrite generating excitatory pulse ' Inhibited arc firom transition Axon /dendrite generating inhibitory pulse °

Weight a) Threshold required to fire the neuron , b) the number of input axons/dendrites for a neuron, c) the amount of neuro transmitter released from presynaptic cleft to postsynaptic cleft

Token Chemical molecule flowing or information stored in the neuron Transition with Receptors at inter-neuron communication allowing selected predicate type of chemicals to pass through them _»____--^

a) Chemical gate allowing molecules causing excitatory pulse f ^ b) Chemical gate allowing molecules causing inhibitory pulse t f

Colors for tokens Different chemical molecules traveling through the axon Firing sequence Neuron firing sequence Initial Marking Initial sute of the BNN Transition asso- Timing duration (tj) of the activity modeled by that transition(Tj)

ciated with time

TABLE 1. Analogy between HPNs and BNNs.

of dendrites carrying signals to a cell body can be represented as a weight on the arc that acts as an input arc to the place modeling the cell body. For example, if there are four axons carrying excitatory signals, and five axons carrying inhibitory signals to the cell body, two weights, 4 and 5, can be associated with the two input arcs carrying information to the place. Weights on arcs also represent the amount of neurotransmitter released from the presynaptic cleft to postsynaptic cleft. Modeling of the transmitter transmitted is essential, as it decides the changes in electrochemical potential at the postsynaptic membrane, which in turn affects the firing of the neuron.

5. In PNMs, tokens represent both the information and the chemical molecules passing through the axons. In BNNs, there are many types of chemical molecules passing through the axons simultaneously. Tokens with two attributes, the name of the chemical molecule and the amount at a given time, can be used to model all these types of molecules.

6. In PNMs, places represented as concentric circles model presynaptic areas. These places receive the signals from the normal transitions and transmit the signals to their output transitions. Each normal transition representing the dendrite can have many output places, representing many postsynaptic membrane areas.

7. The output transitions of the concentric circles are associated with predicates to model the chemical gates (receptors) for interneuron


communication in BNNs. A transition with a "-f" signal at its right models a receptor allowing the chemical molecules that produce an excitatory pulse. Similarly, a transition with a "—" signal at its right models a receptor allowing the chemical molecules that produce an inhibitory pulse.

8. Times associated with transitions represent the time delays involved in the interactions among neurons such 2ts the time required for the axon potential to travel from the neuron to the presynaptic membrane area and the time required for the chemical gate at a receptor to open. Transition firing sequences in the PNM represent the flow of information among the neurons.

4 New/Modified Elements Added to HPNs to Model BNNs

As discussed in an earlier section, none of the available classes of HPNs have the necessary elements to model BNNs. For example, there are no specific elements to model presynaptic and postsynaptic areas, cell body, threshold required to fire a cell body, etc. In this section, new elements will be added to HPNs, and several other elements will be modified in order to make it possible to model BNNs. The formal description of this new class of HPNs and its specific use are described below.

4-1 New Types of Places

For the sake of discussion, let us assume that places can model both cell bodies and presynaptic areas. But places modeling cell bodies and presynaptic areas have to be distinguished, since the function of the cell body and presynaptic areas are diffêrent. In other words, the cell body is the primary building block of BNNs at which the signals from other cell bodies are averaged and propagated if the threshold required for that cell body is accumulated. In contrast, the function of the presynaptic area is to act as a terminal point of an axon and is mainly involved in communication between neurons. Hence, there should be two diff'erent types of places to model cell bodies and presynaptic clefts:

P = {Pc, Ppsc}, where

Pc = {Pi, P2, P 3 , . . . , Pn } is a set of cell bodies and

Ppsc = {Ppsc/, Pp5c2, • • • , Pscn} is a sct of presynaptic areas.


4-2 New Types of Transitions

Timed transitions (time associated with transition) and predicate transition (predicate associated with transition) are important elements in earUer HPNs [9]. However, these transitions alone are not sufficient to model axons and dendrites in BNNs. Hence, two additional types of transitions are proposed. The first type of transition is named the dendrite generator, as the output arcs of such transitions model dendrites. Another type of transition is called axon generator, as the output arcs of such transitions represent axons. Further, there are some specific actions associated with such transitions. The action to be performed at such transition is to add all the incoming signals of its parent cell body (the parent cell body for a transition is the cell body for which it is the output transition) and compare the resultant with the threshold required to fire the cell body. If the resultant sum exceeds the threshold required, then the axon generator allows the signal to pass through the axon (the output arc of the axon generator models the axon corresponding to its parent cell body). This summation of incoming signals of the parent cell body usually is done at the axon generator. When a cell body fires, the information flows in the form of an axon potential through the axon. However, there are some specific cell bodies that do not have axons. For example, in the olfactory bulb of the rabbit, granule cells do not have axons. In such cases, the aforementioned summation takes place at the dendrite generator corresponding to the granule cell:

T = {TAG. T ^ G , T ^ T , Tp} , where

TAG = { T A G I , TÂG2, • • • , T ^ G O } is a set of axon generators,

TDG = { TDGIJ T£)G2r • • , TDGP} IS a set of dendrite generators;

TAT = { T A , ta} where

T^ = {TAI, T ^ 2 J • • • 5 ÂQ} is a set of activities modeling flow of information from presynaptic area to postsynaptic area,

t^ = {ta/, ta2,--- 5 tag} is a sct of time durations associated with the corresponding activities modeled by "TA";

Tp = {TpEy Tp /} is set of receptors, where

TpE ={ TpEi, TpE2, • •. , TpER } is a set of predicate transitions modeling receptors generating an excitatory pulse,


Tp/ = {Tp//, T p / 2 , . . . , Tpjr} is a set of predicate transitions modeling receptors generating an inhibitory pulse.

4^3 New Type of Weights

In the conventional HPN, weights are integers. But to model BNNs, weights have to be both integers and real numbers, since they represent either the amount of transmitter released from presynaptic area to postsynaptic area or the threshold required to fire a neuron. For example, the threshold required to fire a mitral cell in the olfactory bulb of a rabbit is a real number. Hence, it is logical to keep such weights as real numbers. Weights also model the number of input neurons of one type to a neuron of another type. For example, in the case of the olfactory bulb of a rabbit, each mitral cell receives information from 200 granule cells. Hence, weights modeling such connections should be integers. Hence, there should be two different types of weights — weights that are real numbers and weights that are integers.

As in the conventional HPN, where arcs transfer information and control among the places, axons and dendrites in BNNs carry information from one neuron to other neuron (s). The normal arc with a solid line represents an axon. An arc modeling an axon/dendrite that generates an excitatory pulse has the arrow at its end. Similarly, an arc modeling an axon/dendrite that generates an inhibitory pulse has a small square at its end:

IA= {P X T} —^ S and 0 A = {T x P} —> S,

where lA and OA represent input and output functions that define directed arcs between places and transitions,

S is the set of all values of K, where

K = {IK, RK} is a set of weights on arcs;

IK is a set of all integers, used to model the number of input axons / dendrites for a cell body;

RK is the set of all real numbers modeling thresholds required to fire a cell body;

M = Marking of an HPN from set P to Q, i.e.,

M: P ^ Q, where M inputs tokens with attributes to every place;


M = M(pi) indicating the number of tokens in place pi with two attributes: (1) name of chemical molecule, and (2) amount of chemical molecule;

Q = { 0 , 1 , 2 , . . . } , and the attributes of each token are given by a set TR modeling the transmitter information, where

TR = {TRi, TR2 , . . . , TRu; }, where w is the total number of different types of transmitter molecules;

TRi = { NTRi, ATRi }, where NTR^ is the name of the transmitter molecule and ATRj is the amount of the transmitter molecule. Note that ATRi is a positive real number because the amount of transmitter molecule can be represented as a real number.

5 Example of a BNN: The Olfactory Bulb

To illustrate the modeling concepts of HPNs described above, the olfactory bulb of a rabbit given in Shepherd [26] is considered. The neuronal elements of the olfactory bulb are shown in Figure 5. The olfactory bulb is considered in the present investigation because of its distinct laminations and several sharply differentiated neurons. Furthermore, the results of earlier studies show that the olfactory bulb goes significantly beyond the framework of the classical neuronal structure as formulated in the motoneuron model. Hence, it is appropriate to consider modeling the olfactory bulb and then using similar concepts to model other highly organized regions of the brain, such as the retina and the cerebellum. Before illustrating the concepts modeling BNNs with HPNs, it is logical first to summarize the internal details of the olfactory bulb. In this section, the input and output connections of the olfactory bulb, different types of neurons present in the olfactory bulb, and the connections among them, are described. For a detailed description of the olfactory bulb, see Shepherd [26].

5.1 Inputs

From Figure 5, it can be noted that the afferent (peripheral) input to the olfactory bulb is through the axons of the receptor cells in the olfactory mucosa in the nasal cavity. The olfactory axons enter at the bulb surface and terminate in a layer composed of spherical regions of neuropil, called glomeruli. There are several centra/inputs to the bulb from the brain. Their sites of origin are indicated in Figure 5. Axons that are relatively large but


OLFACTORY MUCOSA

OLFACTORY NERVES

GLOMERULI

EXTERNAL PLEXJFORM

LAYER (EPL)

MITRAL B O D Y T LAYER i

C.RANULL LAYKR

A~^4J:^T [ |()n;iii

l ( . I O l •

Inputs: afferent fibers (nbovc) from olfacrorx* receptors; central fibers (below) from three sources; centrifugal fibers (C) fron) the nucleus of the horizontal limb df the diagonal batul; ij)sii.ircr:il fibers from tlic anterior olfacrorv nucleus ( A O N ) ; conrrnhircial fibers from the anterior commissure (A(] ) .

Principal netirovs: mitral cell ( M ) , with primarx- (i M MMX sccond-nrv dendrites (2° ) and recurrent axon collaterals ( r e ) ; uifrcd cell ( T ) .

Intrinsic neitrons: |)criglomernlar short-axon cell (PCI); dccj) short-nxon cell ( S A ) ; granule cell ( ( i r ) . L O T , lateral olfactor\- tract.

FIGURE 5. Steps for modeling BNNs by HPNs. Neuronal elements of the mammalian olfactory bulb. (See [26].


few in number, come from a region at the base of the brain called the diagonal band (DB). Other axons, finer and more in number come from the region just posterior to the bulb, the anterior olfactory nucleus (AON). Some of these come from the AON of the same side; others come from the contralateral side through the anterior commissure (AC).

5.2 Principal Neuron

The output from the olfactory bulb is directed centrally and is carried in the axons of mitral cells. Each cell sends an unbranched primary dendrite to a glomerulus, to terminate there in a tuft of branches. Each mitral cell also gives rise to several secondary dendrites, which branch sparingly and terminate in the external plexiform layer (EPL). The mitral cell axons proceed to the depths of the bulb and then run posteriorly to emerge together to form the lateral olfactory tract (LOT). During their presence within the bulb they give off two kinds of collaterals: recurrent collaterals that terminate in the EPL and deep collaterals that terminate in the granule layer (GRL).

Even though there are smaller versions of mitral cells—called tufted cells, they are not considered here, as their specific function is not known [26].

5.3 Intrinsic Neurons

There are mainly three types of intrinsic neurons: periglomerular cells, granule cells, and short axon cells. Surrounding the glomeruli are the intrinsic neurons, called periglomerular (PG) cells. Each of these cells has a short, bushy dendrite tree that arborizes within one of the glomeruU. The axon of this cell distributes to neighboring glomeruli, but not to the glomerulus containing the dendrite tree of its parent cell. Below the layer of mitral cell bodies is a thick layer containing the cell bodies of granule cells. Each granule cell has a superficial process that starts and terminates in the EPL. Each granule cell also gives off an inner process that terminates deeper in the granule layer. The outstanding feature of the granule cell is that it lacks a morphological axon. (The implications of this during modeling with HPNs are described later in this section.) The detailed ratios of the principal neurons to intrinsic neurons can be seen in Cotterill [4]. The basic circuit of the olfactory bulb is shown in Figure 6.

5.4 PNM Formulation and Analysis

To illustrate the concepts of PNs described above, the olfactory bulb of a rabbit given in Shepherd [26] is shown in Figure 5. The basic circuit of the olfactory bulb is shown in Figure 6. Certain assumptions are made while


FIGURE 6. Steps for modeling BNNs by HPNs. Basic circuit diagram for the mammalian olfactory bulb. (See [26].)

modeling. They are (i) for simplicity's sake as given in Shepherd [26], the terminals of dendrites are assumed as synaptic areas; (ii) the postsynaptic area (the synaptic area on the process that is receiving the signal) is embedded in the output transition of the place modeling the presynaptic area (synaptic area on the process that is sending the signal). The PNM of neuronal elements of the mammalian olfactory bulb considered is shown in Figure 7. Table 2 shows the interpretation of the PNM elements to the activities in brain. Figures 5-7 represent the logical steps needed to model BNNs with HPNs. First, the biological structure is converted into a circuit that describes the information flow among cells, and then BPNs are used to model the circuit.

In the PNM, each transition represents the activity of information flowing from its input place to the output place. Time duration for this activity is associated at the right-hand side of the transition. For example, T l rep-

9. High Level Petri Ne t s in Modeling Biological Neural N e t s 303

Jrmrm»»on modeling a ch«mical 9«t« allowing mc»<«cula« *»t*ch gan«f«ta an araialory ptilM

molaculaa wfiich gaoaraM an inNMory pulaa

a A»orV(*aodfHa ganaraMng kiMMofy pulaa

FIGURE 7. Steps for modeling BNNs by HPNs. PNM of the olfactory bulb.


Element Interpretation

PAONl PA0N2 PAIC PPDMl PDPG Ml PG PAPG M2 PA2C PAAON PAAC PSDMl GR PDIGR PD2GR

Transi t ions:

Presynaptic area (P) on the axon of olfactory nerve (ON) 1 P on the axon of 0 N 2 P on the axon 1 of centrifugal fibre (C) P on the primary dendrite of mitral cell 1 (M1) P on the dendrite of pcriglomerular short-axon cell (PG) Mitral cell 1 Pcriglomerular short-axon cell P on the axon of PG Mitral cell 2 P on the axon 2 of C P on the axon of anterior olfactory nucleus (AON) P on the axon of anterior commissure (AC) P on the secondary dendrite of MI Granule cell (GR) P on the dendrite I of GR P on the dendrite 2 of GR

Each transition represents the activity of information flowing from its input place lo the output place. Time duration for this activity is associated at the right hand side of the transition. For e.g. TI represents the information flow from PAONl to PPDMl and tl models the time duration for this activity to take place.

Weights :

T ^ Threshold required to fire MI TPG Threshold required to fire PG TGR Threshold required to fire GR

W represents the amount of transmitter relca.scd from presynaptic area to postsynaptic area. For c.g Wl models the amount of transmitter molecules released from PA( )N I to the posfsv napiic area of MI.

TABLE 2. Interpretation of elements in the PNM shown.

resents the information flow from PAONl to PPDMl, and tl models the time duration for this activity to take place. W represents the amount of transmitter released from presynaptic area to postsynaptic area. For example, Wl models the amount of transmitter molecules released from PAONl to the postsynaptic area of Ml. Similarly, certain weights on the output arcs from places modeling neurons represent the threshold required to fire the neuron. For example, TM models the threshold required to fire Ml.

5.5 Token Flow through the Olfactory Bulb PNM

Token flow in a PNM models the information flow in a BNN. In order to get an insight into the functioning of a PNM, consider the firing of mitral cell 1 modeled by place Ml. From olfactory nerve I (ONI), the transmitter comes to the presynaptic area on the axon of olfactory nerve I (PAONl).


When PAONl receives a sufficient amount of transmitter molecules that is equal to or greater than Wl , transition T l fires. For each mitral cell, there are 1000 afferent axons entering the olfactory bulb. The information flow from each of these axons is exactly the same as described above. Hence, this is modeled by weight 1000 on the output arc of T l . Now, at Ml all the signals corresponding to this information flow from ON through PAONl are summed up. In other words, the ATRj attribute of the token in Ml after firing T l is given as (1000*ATRi) .

Similarly, Ml receives 20 and 200 inhibitory signals from PDPG and GR, respectively. All these signals are summed up at Ml, and if the threshold caused due to these signals is greater than the threshold required to fire Ml (TM), transition T13 fires. The information flow from Mi's axon to the presynaptic area of GR is modeled by firing of T13. The transfer of this excitatory signal caused due to firing of T13 is modeled by firing of T14. Note that T14 models a chemical gate allowing transmitter molecules that generate an excitatory pulse. To summarize, the token movement in the path, PAONl —)• T l —^ MI —> T13 —> T14 —> GR models the information flow from olfactory nerve (ON) to granule cell via mitral cell 1. By knowing the values of 24 attributes of tokens in places, the exact amount of transmitter molecules, the signal status, and the status of the cell body can be determined. Similar explanations can be derived from the PNM for the firing of other neurons in the BNN.

5.6 Information That Can Be Obtained from an HPN Model

The PN approach to analysis of a system consists of two parts: modeling with a PN and analysis of the PNM by either analytical methods or simulation [5, 17, 18, 19]. The latter is applied in this chapter. The analysis of a PN using simulation can potentially be used for the discrete-event control of the system. For the quantitative analysis of PNMs, a software package has been developed in the C language. The following information related to ID (described in section 2.2) can be obtained from it:

1. Marking of the PNM;

2. Enabled transitions in a marking and confiicts among them;

3. Remaining firing durations of transitions for completion of their firing;

4. Active firing times of transitions, all with respect to real time.

The advantages of PN modeling over other models are (i) for the analysis, the PNMs are easily understandable, as they are graphically elegant; (ii) the PNMs can be both qualitatively and quantitatively analyzed, yielding


invaluable information about the functioning of the brain, and (iii) PNMs give a general framework to express an entire class of related models. The qualitative information yields properties such as liveness, deadlock, bound-edness, and safeness. For example, if the PNM is alive, it implies that at a given time at least one transition is ready to fire, which in turn implies that at least one activity in the brain is ready to take place. If there is a deadlock in the PNM, it implies that at some time not even a single transition is enabled to fire, which in turn implies that there is not even a single activity in the brain ready to take place. Quantitative information yields (i) the number of tokens deposited in each place, representing the stimulus and status present for each neuron. This is very important because as stated earlier, each token has attributes that describe the chemicals and its quantities that reside at a particular cell body that is modeled by the corresponding place; (ii) the active times remaining times for transition represent the state of activities modeled by the transitions; (iii) the marking of the PNM along with the firing vector, active times, and remaining times of transitions, all with respect to real time, represent the dynamic behavior of the brain. Dynamic behavior represents various states of the brain with respect to real time using ID, as described in Section 2.2. This can be used to study the characteristics of temporal patterns in BNNs, as there are three timing functions, F, Q, and A, associated with each transition in PNM.

The time that it takes for a system to come to a stable state is called the system transient time, and the time that a system takes to complete one cycle of its operation is called the system cycle time. The amount of the simulation cost and time can be drastically reduced if PNMs of the brain are formulated prior to simulation, since the system transient time and system cycle time can be estimated. These times can be determined using a PNM by checking the attributes of places and transitions in the model with respect to real time. Once the transient time and cycle time of BNNs are determined, the simulation can be stopped and the results of the simulation can be extrapolated for longer durations without actually simulating the system under study. The details of estimating the system transient and system cycle times can be seen in Venkatesh and Ilyas [16] and Venkatesh et al [17]. Furthermore, using the PNM, the effect of different firing sequences of neurons on the functioning of the brain can be investigated. To investigate the temporal effects of activities in the brain, various timing durations can be associated with transitions to study temporal dynamics. Note that each firing sequence in a PNM results in a unique neuron filing sequence, which in turn results in a particular sensory output. For the analysis of the PNM, a software package is being developed that comprises all the principles of PNs developed here. This package is an extension of the package reported in Venkatesh et al. [17].


6 Conclusions

In this chapter, an initial attempt has been made to model BNNs with HPNs. The motivation to use HPNs for modeling BNNs and its advantages are their efficiency and simple framework for expressing a broad set of BNN models. The advantages of PN modeling over other models are (i) for the analysis, the PNMs are easily understandable, as they are graphically elegant; (ii) the PNMs can be both qualitatively and quantitatively analyzed, yielding invaluable information about the functioning of the brain; and (iii) PNMs give a general framework to express an entire class of related models. The analogies between the functioning of HPNs and BNNs have been explored. The formulation of the PNM corresponding to BNNs has been elucidated by modeling the olfactory bulb of a rabbit. The qualitative and quantitative results that can be drawn using the software package to be developed have been presented. By studying the dynamic behavior of PNMs, recognition of various aspects of temporal patterns in BNNs can be investigated. Various timing durations can be associated with the transitions in PNMs to study temporal dynamics. This is achieved by associating three timing functions with each transition in the PNM. Thus this chapter has attempted to generate further interest in many groups of people with different backgrounds for applying HPNs to solve the related problems in the area of neural networks. For the class of PNs presented here, there is a need to develop theories that guarantee the well-behaved properties of PNs such as liveness, safeness, and reversibility. Also, the application of PNs for studying different examples of BNNs has to be explored.

Acknowledgments

The authors thank Oren Masory for his involvement in the initial discussions of this paper.

7 References

1. DARPA Neural Network Study, Oct. 1987-Feb. 1988, AFCEA, International Press.

2. Arbib, M.A., 1987, Brains, Machines and Mathematics. Springer-Verlag, Berlin.

3. Grossberg, S., and Kuperstein, M., 1986, Neural Dynamics of Adaptive Sensory-motor Control: Ballistic Eye Movements. Elsevier / North Holland, Amsterdam.


4. Cotterill, R.M.J., 1988. Computer Simulation in Brain Science, Cambridge University Press, Cambridge, UK.

5. Peterson, J.L., 1989. Petri Net Theory and the ModeUng of Systems, Prentice Hall, Englewood Cliffs, NJ.

6. Ajmone M.M., Balbo, G., and Conte, G., 1987, A Class of Generalized Petri Nets for the Performance Evaluation of Multi Processor System, MIT Press, Cambridge, MA.

7. Garg, K., 1985, An approach to performance specification of communication protocols using timed petri nets, IEEE Transactions on Software Engineering, Vol. SE-11, No. 10, pp. 1216-1225.

8. Mekly, L.J., and Yau, S.S., 1980, Software design representation using abstract process networks, IEEE Transactions on Software Engineering, Vol. SE-6, No. 5, pp. 420-435.

9. Ozsu, M.T., 1985, Modeling and analysis of distributed database concurrency control algorithms using an extended Petri nets formalism, IEEE Transactions on Software Engineering, Vol. SE 11, No. 10.

10. Bruno, G., and Marchetto, M., 1986, Process-translatable Petri nets for the rapid prototyping of process control systems, IEEE Transactions on Software Engineering, Vol. SE-12, No. 2, pp. 346-357.

11. Dayhoff, J.E., 1990, Neural Network Architectures: An Introduction, Van Nostrand Reinhold, New York.

12. Distante, F., 1985, A Petri net matrix approach in VLSI functional testing, microprocessing and microprogramming. Vol. 16, Nos. 2-3, p. 194.

13. Venkatesh, K., Chetty, O.V.K., and Ravi Raju, K., 1990a, Simulating flexible automated forming and assembly systems. Journal of Material Processing and Technology, Vol. 24, pp. 453-462.

14. Venkatesh, K., Ravi Raju, K., and Chetty, O.V.K., 1990b, Augmenting the performance of flexible multi robot assembly systems with Petri nets. Proceedings of the International Conference on Automation, Robotics, and Computer Vision, Singapore, pp. 341-345.

15. Venkatesh, K., 1990c, Petri nets—An expeditious tool for simulation, modeling and analysis of flexible multi robot assembly systems, M.Tech. thesis, Indian Institute of Technology, Madras, India.

9. High-Level P e t r i N e t s in M o d e l i n g Biological Neural N e t s 309

16. Venkatesh, K.. and Ilyas, M., 1993, Modeling, controlling, and simulation of local area networks for flexible manufacturing systems using Petr i nets, Computers and Industrial 28 Engineering, Vol. 25, Nos. 1-4, pp. 155-158.

17. Venkatesh. K.. Zhou, M.C., Kaighobadi, M., and Caudill , R., 1994, Augmented t imed Petri nets for modeling, simulation, and analysis of robotic systems with breakdowns. Journal of Manufacturing Systems, Vol. 13, No. 4, pp. 289-301.

18. Silva, M., and Vaiette, R., 1990, Petr i Nets and Flexible Manufactur ing, Advances in Petri Nets, Lecture Notes in Compute r Science, Springer-Verlag, Berlin, pp. 37-41 .

19. Mura ta , T., 1989, Petri nets — Propert ies , analysis and applications. Proceedings of IEEE, pp. 541-580.

20. Ghezzi, C . Maiidrioh. D., Morasca, S., and Pezze, M., 1991, A unified high-l(^v(4 Petri net formalism for time-critical systems, I E E E Transact ions on Software Engineering, Vol. SE-17, No. 2, 160-172.

21. Madhavji . H.N.. and Schafer, W., 1991, Prism-methodology and process - oriented environment, I E E E Transactions on Software Engineering, Vol. SE-17, No. 2, pp. 127-283.

22. Belli, Fevzi. and Grosspietsch, K.E., 1991, Specification of fault-tolerant system issues by predicate / t rans i t ion nets and regular expressions--approach and case study, I E E E Transactions on Software Engineering, Vol. SE-17, No. 6, pp. 513-525.

23. Dotan, Y.. and Arazi. B., 1991, Using flat concurrent prolog in system modeling, I E E E Transactions on Software Engineering, Vol. SE-17, No. 6, pp. 493-512.

24. Peterka, C . and Murata , T., 1989, Proof procedure and answer ext ract ion in Petri net model of logic programs, I E E E Transact ions on Software Engmeering, Vol. SE-15, No. 2, pp. 209-217.

25. BiUington. J.. Wheeler, R.G., and Wilbur-Ham, C M . , 1988, P R O TEAN: A higii-level Petri net tool for the specification and verification of comnnniication protocols, I E E E Transact ions on Software Engineering. Vol. SE-14, No. 3, pp. 301-316.

26. Shepherd, M.G.. 1974, The Synaptic Organization of the Brain: An Introduct ion. 1974, Oxford University, Oxford.

Chapter 10

Locally Recurrent Networks: The Gamma Operator^ Properties, and Extensions Jose C. Principe Samel Celebi Bert de Vries John G. Harris

ABSTRACT Locally recurrent networks have shown great potential for processing time-varying signals. This paper reviews various memory structures for time-varying signal processing with neural networks. In particular, we focus on the gamma structure and variations such as the Laguerre and gamma II memory networks. The paper presents the basic theory of memory structures and several interpretations of their function.

1 Introduction

In engineering and biology interesting patterns are often presented sequentially over time. Therefore, an information processing system needs some kind of short-term memory to store the recent past. The most common connectionist network mechanisms for short-term memory are feedforward delays (as in the Time Delay Neural Network, Lang et a/., 1990) and feedback delays (as, for example, in the fully recurrent networks by Williams and Zipser, 1989). Feedforward tapped delay lines are static representations in the sense that the depth (i.e., order, number of taps) and resolution (sampling period) have to be chosen a priori. These parameters should match the characteristics of the signals and the processing goal, which in practice almost always leads to suboptimal design. Note that we should not "over-design," because too many taps leads to spurious signals in the system, which often show up as noise. In particular, for nonstationary signals (almost all real-world signals), the optimal depth and resolution may be time varying, and the fixed representation of tapped delay lines is not optimal. However, the guaranteed stability, simple training algorithms, and conceptual simplicity make tapped delay lines still the favorite memory

311

312 Principe, Celebi, de Vries, and Harris

mechanism in neural networks. In unrestricted recurrent (feedback) networks, stability is hard to control.

Moreover, feedback connections introduce local minima in the error surface, which leads to poor training convergence.

When feedback is restricted to the local processing units, which in turn are connected as a feedforward network, we may hope to combine some of the lucrative properties of feedforward and feedback structures. In connec-tionist circles, several studies have appeared that investigate the properties and applicability of locally recurrent networks (Back and Tsoi, 1995). In this paper we present a summarizing review of a number of publications on the gamma locally recurrent network.

We will start by analyzing the different types of memory structures and provide basic definitions. The gamma memory will then be introduced as a special case of the generalized feedforward filters, and training algorithms will be provided. Section 4 presents some applications of the gamma memory, while in section 5 several important interpretations of the gamma memory are reviewed. In sections six and seven, Laguerre and gamma II memories are presented. Finally some importa-nt extensions to the basic gamma memory topology are presented and conclusions drawn.

2 Linear Finite Dimensional Memory Structures

In de Vries and Principe (1992), a general linear delay mechanism was defined by the convolution model as

y{t) z= I U)(t- S)x{s)ds (1) ^0

in continuous time and as

N

y{t)^Y^w{n-k)x{k) (2)

in the discrete time domain; x{n) is an input signal^ w{n) a filter response, and y{n) a memory trace. We will call w{n) a memory filter if w{n) is causal and normalized in the sense that Yl^=o l^('^)l — (^^ Vries and Principe, 1992). If x{n) and w{n) are vector signals, w{n — k)x{k) should be read as the inner product ^ ^ Wi{n — k)xi{k).

If w{n) is the impulse response of a finite-dimensional linear system, then it can be implemented as an autoregressive moving average (ARMA) structure. The ARMA memory can be written as

10. Locally Recurrent Networks: The Gamma Operator 313

x(n)

y(n)

FIGURE 1. The ARM A memory filter. The light shaded area is the feedforward filter, and the dark shaded area the leaky integrator.

N M

y{^) = Yl ^riy{n -l)+Yl ^rnx(n - m). (3) n = l m = 0

The ARMA system in (3) has M-hN-l free weights. The appropriate values for the memory filter parameters W = {ambm} depend on the characteristics of x{n) and the processing goal. An implementation of the ARMA memory is shown in Figure 1. As mentioned before, the ARMA memory can be unstable for particular choices of parameters W. Training of ARMA models is also a nontrivial problem. As a result, in practice it is common to use simpler structures such as the tapped delay line (a.k.a. transversal filter), which can be written as

M

2/(^) = 5 ^ bmXm{n - m), (4) m = 0

and the leaky integrator (a.k.a. context unit, memory neuron), which evaluates to

y{n) = ay{n - 1) -h x{n). (5)

The tapped delay line and leaky integrator are outlined by a lightly shaded and more darkly shaded area, respectively, in Figure 1. The tapped delay line memory filter is stable for all real values of bm, and the leaky integrator is stable for \a\ < 1.

In Principe et al. (1992), another memory filter is introduced, the generalized feedforward filter (GFF). In the GFF, the tap impulse response can be recursively computed from the previous tap by

9k{n) = g{n)9gk-i{n), for A; > 1, (6)


FIGURE 2. The Generalized feedforward filter.

where • is the convolution operation and ô(^) is either a delta function or another predefined operator. In the ^-transformation domain, the same equation reads

Gk{z) = G{z)Gk-i{z), (7)

The memory traces in the OFF are computed by

Xk{n) = g{n) • Xk-i{n) (8)

and the filter output by y{n) = Ylk=o '^kXki'n)' The generalized feedforward filter is shown in Figure 2.

In the generalized feedforward filter, the memory operator (kernel) g(n) is unspecified. Clearly, by inspection of Figure 2, if g{n) is stable, then the OFF is stable. Also, as we will outline in Section 3, the filter weights {wk} can be adapted by standard feedforward adaptation algorithms such as least mean squares or recursive least squares.

The advantage of using generalized feedforward structures for memory filters is that g{n) may have an adaptable parameter set, and consequently, the memory traces can be optimized with respect to a performance criterion. Note that this freedom does not exist in the case of the regular tapped delay line filter.

The gamma memory filter is a special case of the generalized feedforward filter, where

gin) = M(1 - / i)"

and go — S{n). For the kth tap, the impulse response evaluates to

9k (") = (^_J)/(i-Mr-*-

(9)

(10)


FIGURE 3. The gamma memory, also called the gamma delay line. Reprinted with permission from Principe et a/., 1994.

The functions gk (n) happen to be discrete versions of the integrands of the gamma function (de Vries and Principe, 1992). They are complete in L2 space (i.e. one can approximate a finite energy signal arbitrarily closely as a weighted sum of these functions). An interesting property of this family is that the time axis is scaled by the parameter /x, which means that there is a change in time scale from the input to the memory traces. As we will see, the parameter /x can be adapted to minimize the mean square output error, thus finding an optimal time scale (w.r.t. MSE) to represent the input signal (or signals in hidden layers if gamma filters are applied to hidden nodes).

In the z-domain, the gamma kernel becomes

G{z) = z-{i-f^y

(11)

Figure 3 shows the gamma memory structure and its characteristics. As can be seen from Figure 3, when K = 1, the gamma memory reduces to the leaky integrator, and when /x = 1, the gamma memory becomes the tapped delay line. So the gamma memory unifies the tapped delay line and the leaky integrator into a single parametrized structure. In fact, the gamma memory is a delay line made of leaky integrators.

When the outputs of the gamma memory are linearly combined, we obtain a gamma filter (Principe and de Vries, 1992). The describing equations for the gamma filter are

xo{n) = u(n). (12)

Xk{n) = {1 - jjL)xk{n - 1)-\-fixk-i{n - 1) for k = 1,2,...,K,


K

2/H = ^WkXk{n). k=0

The gamma filter is a generalization of the linear combiner, and when the weights are adapted to minimize the output mean square error, it extends the Adaline (de Vries et a/., 1991). The gamma filter is also a generalization of the FIR synapse as defined by Wan (1994) and is a building block for gamma locally recurrent networks. Sometimes it is useful to write (12) as a state space model, as

X{n) = [/ + M^]^n-l + y^n^ (13)

y{n) =wlx^,

where x„ = [xo(n),xi(n), ...,x/<:(n)]'^ is the state vector, u„ = [w(n),0, ...,0]-^ is the input vector.

0 1 0

0 1 1

0 1 0 •

•• 0 •• 0 • 0

is a state transition signature matrix, ui^ = [itô(^), ^1(^)5 • • • , ^ K ( ^ ) ] is the filter weight vector, and y{n) is the filter output. The state equation can be compacted to x^ = ^ ^ n - i "*" Mn? where A = I -\- fiA. Hence, the gamma filter can be written as a linear model in state space.

In nonlinear neural networks it is common to take

y{n) = a{wlx^ + b{n)), (14)

where cr(.) is a nonlinear squashing function and b a bias term. A gamma locally recurrent neural net, then, is a nonrecurrent (no loops) circuit of nonlinear gamma filters where y{n) is computed by (14).

2.1 Analysis of Depth and Resolution

It is interesting to compare gamma delay lines with regular tapped delay lines and leaky integrators with respect to their properties as memory devices. First, let us quantify the notion of memory depth. As a convenient


measure of memory depth for a Kth order gamma memory we take the first moment (mean value) of the last {Kth) delay kernel in the memory. Such a measure can be interpreted as the mean sampling time for the last tap. The mean memory depth D for the Kth. order memory is thus defined as

D ^f^ngKin) = Z{n9An)}U=i - z ^ ^ ^ | ,^ , = ^ . (15) n = 0

dz ''=^ fi

Next we define the (temporal) resolution R of the memory as the number of parameters of freedom (i.e., the number of tap variables) per unit of time in the memory. This is equivalent to the number of taps (K) divided by the mean memory depth D. Thus

^ = § = M- (16)

Clearly, there is a resolution versus memory depth trade-off in a linear memory structure for fixed order K. Such a trade-off is not possible in a nondispersive tapped delay line, since the fixed choice of /z = 1 sets the depth and resolution to D = K and R = 1, respectively. However, in the gamma memory, depth and resolution can be adapted by variation of /x. The choice fi = 1 represents a memory structure with maximal resolution and minimal depth. In this case, the order K and depth D of the memory are equal. Thus, when /x = 1, the number of weights equals the memory depth. Very often this coupling leads to overfitting of the data set (using parameters to model the noise). Hence, the parameter /i provides a means to uncouple the memory order and depth.

3 The Gamma Neural Network

Equation (14) represents the input-output map of a nonlinear processing element (PE), which we call the gamma PE (Figure 4). The input is fed to the gamma memory and the taps linearly combined to produce the PE output. This PE can be considered an extension of the well-known McCulloch-Pitts (M-P) neuron model. The M-P PE is a static model of the neuron, while the gamma PE includes a dynamic component modeling the integration over time that is known to occur at the dendritic tree. The gamma neural network is any multilayer feedforward interconnection of nonlinear PEs and gamma PEs. When the gamma PE is restricted to the input layer, the topology is called the focused gamma neural network.


FIGURE 4. The gamma PE and a one hidden layer gamma network.

3.1 Training the Gamma Network

Training of a gamma neural network consists of adaptation of ui and /i. If we use gradient descent learning, the general procedures known as real time recurrent learning (RTRL, Williams and Zipser, 1989) and backpropagation through time (BPTT, Werbos, 1990) can be applied without restrictions to update both w_ and /x.

As discussed in de Vries and Principe (1992), we assume that the network consists of a feedforward circuit of the filters described by

^n = [ + / ^ W ^ k n - l + Mn. (17)

y{n) = G{wlx^-\-b{n)).

Assume that either BPTT or RTRL has been used to compute an estimated error e(n) for the output y{n) of (17). Defining the local cost as E{n) =

h |e^(n), we can compute the gradients with respect to the weights as follows:

SE{n) -xê{n)a'{net{n)), (18)

where we assumed | 4 ^ = - 1 and defined net{n) = nin^n + K''^)- ^^^ A*? we derive

SEjn) Sfi{n) • ^ n P ^ ^ ( ^ ) ^ ' ( ^ ^ ^ ( ^ ) ) ' (19)


mse 2.

U . •'''

0 . • :•

0 . 4

0 . 2

0 . 2 0 . 4 0 . 6

K=l

^ ^ K = ^

X "3"

0 . 8 1 ^

FIGURE 5. Normalized mecin squaxe error for diflFerent orders K. Reprinted with permission from Principe et at., 1993. © 1993 IEEE.

£„ = [ + Kn)A]p^_ 1 + Mn-

Sx

where we defined p = -j^. It is clear that for intricate topologies of gamma memories, the update

equations for w and /x become uncomfortable to derive on paper. However, in practice these derivations are not necessary. Nowadays object-oriented simulation environments take care of the derivation of the update equations when the dual topology concept is utilized to implement the update equations. The user only has to specify the local "forward" equations and the global connectivity pattern of the network. For example, the NeuroSo-lutions package implements arbitrarily connected gamma neural nets (also globally recurrent) as a standard option (Lefebvre and Principe, 1993; Neu-roSolutions, 1994).

Since the parameter fx is part of a feedback loop, we found that the mean squared error performance surface has several minima as a function of /x. This sometimes leads to convergence to a local minimum when gradient descent is used. Figure 5 shows an example of the performance curve computed analytically with Mathematica for the identification of a third-order elliptic filter with the gamma filter.


5 tap gamma 3 tap gamma

final |i=0.53905 final \i= 0.306825

FIGURE 6. Adaptation of m for different memory sizes.

4 Applications of the Gamma Memory

4..I Control of Memory Depth in an Identification Problem

To get some intuition on how different a recurrent memory is from a tapped delay line, we present the following problem. We wish to construct a dynamic neural network that will double the frequency of an input sinusoid. The neural network consists of a focused gamma network with one hidden layer, and with an input layer formed by a gamma delay line with 5 taps. We use 2 tanh nodes in the hidden layer, and 1 Unear output node. The input signal was a 40 samples per period sine wave, and the target was another sine wave with 20 samples per period. Backpropagation through time over 80 samples is utilized to adapt all the weights, including the parameter /i of the gamma memory. This simulation was carried out using NeuroSolutions. Figure 6 shows the /x track when the dynamic net is adapting.

Notice that the /x parameter starts at 1 (the default value that corresponds to the tap delay line). The value decreases to 0.54in 150 iterations, yielding a mean memory depth of about 9 samples (J9 = ^ ) and a final MSB of 0.0005. This means that with 5 taps the system is actually processing information corresponding to 9 samples, which is beyond the 5 tap limit. This memory depth was found through adaptation of /i.

Next we reduced the size of the gamma memory from 5 taps to 3 taps, while keeping the network architecture and task the same. This time /i converged to 0.3, yielding a mean depth of about 10 samples and a final MSE of 0.0007, which is the optimal value.

In both cases, the mean memory depth converged to about the same


value. Apparently, the adaptive system was able to compensate for the smaller number of taps by decreasing the value of the parameter /x, thus achieving a similar memory depth. The memory resolution for the 3-tap system is worse than for the 5-tap system. A regular tapped delay line filter (/x = 1) with 5 taps converges to an MSE of 0.09, and a 3-tap filter never solves this problem.

4-2 Linear-Time Warping Control with the Gamma Memory

We performed the following experiment to show that the gamma memory can compensate automatically for linear time warping. We adapted ui and set /i = 0.5 for a gamma filter with 4 taps. The input signal was white Gaussian noise and the target a low-passed version of this signal. After convergence, the ni vector was fixed. In the second phase of the experiment, the input signal is decimated so as to mimic a linear warping of the time axis. We want to find out if the gamma filter can compensate for time warping by adjusting the /x while w_ remains unchanged. Figure 7 shows a graph of the value of /i found through adaptation for eight decimation (and interpolation) ratios between 0.5 and 2. As we can expect from the memory depth equation (17), the recursive parameter is linearly related to the time scale. This relation is experimentally demonstrated in Figure 7. /i changed from the initial value of 0.5 to the range 0.2 to 1.1, in a linear fashion as expected. In conclusion, the fi parameter, if continuously adapted, can compensate for time warping.

4.3 Other Applications

When utilized as a linear adaptive filter, the gamma filter extends Widrow's Adaline (de Vries et a/., 1991), and results in a more efficient filter for echo cancellation (Palkar and Principe, 1994), system identification (Motter and Principe, 1994), (Tsoi and Back, 1994), and nonlinear prediction (Mozer, 1994). Preliminary results with the gamma memory in isolated word recognition also showed that the performance of the system improved when /x is different from 1 (i.e., when it is not the tapped delay line) (Principe and Tracey, 1994). Renals (Renals, 1994) also showed that the gamma memory can be used advantageously as the front end of hidden Markov models for speech recognition. The gamma memory has also been utilized in noise reduction applications to stop the training of nonlinear predictors before the noise distorted the dynamics (Kuo and Principe, 1994a, 1994b), and in a new embedding of time series for nonlinear dynamical analysis where it would reduce noise and select the appropriate time delay for the reconstruction (Kuo and Principe, 1993).


FIGURE 7. Input Signal, value of /j, to compensate warping and Adaline /x used in experiment.

5 Interpretations of the Gamma Memory

One of the open issues is to choose the best set for a memory basis, for a given application (Back and Tsoi, 1995). Without a methodology to select a memory kernel, it seems very important to formulate the function of the memory from different points of view to guide the selection of the basis. We have investigated the representation provided by the gamma basis during adaptation, as a state space embedding, a representation in terms of Taylor series, and a multiscale interpretation. Hopefully, the knowledge of the input signal can be cast in one of these frameworks and will help the designer select one memory kernel versus another. The discussion will be restricted to the gamma memory, but it can be extended to the other memories presented later in this chapter.

5.1 Vector Space Interpretation of the Gamma Filter Adaptation

The vector space interpretation (3), where a signal x{n) is approximated by a weighted sum of signals Xk{n), is presented in full detail in (Principe et ah, 1994) and will be here briefly reviewed. These signals are the basis of the vector space. Let us present the most familiar connectionist memories in this framework. The context unit represents a projection of the


FIGURE 8. /i changes the relative position of the manifold to the signal vector.

large-dimensionality input signal x{n) onto a single basis function, which is the convolution of (10) with the input. As can be expected, this representation compromises the information preserved in the memory trace x{n). Changing /x to minimize the output MSE means that one is finding the best projection of x{n) onto a single basis vector, i.e., onto a line. This representation is appropriate when one wants long memories but low resolution.

Likewise, in the tap delay line, we are projecting x{n) in a memory space that is uniquely determined by the input signal, i.e. once the input signal x{n) is set, the axes become x{n — k) and the only degree of freedom is the memory order K. This memory structure has the highest resolution but lacks versatility, since one can improve the input signal representation only by increasing the order of the memory. In terms of versatility, the simple context unit is better (or any memory with a recursive parameter), since the neural system can adapt the parameter fi to better project the input signal. The memory depth is changed without changing the topology.

We recently proved that the gamma basis in continuous time represents a rigid memory space, even when the parameter fi is changed to minimize the output mean square error (Celebi and Principe, 1995). This means that the relative angle among the basis vectors does not change with fi. Hence, a decrease in the error must be associated with a decrease in the relative angle between the input signal and the projection space. So at least for the case of a white noise input, the recursive parameter in the gamma structure changes the span of the memory space with respect to the input signal (which can be visualized as a relative rotation between the input signal and the projection space). In terms of time domain analysis, the recursive parameter finds the length of the time window (the memory depth) containing the relevant information to decrease the output mean square error.


x(n-2), x(n-l), x(n)

go(n)=5(n)

g(n)=5(n-l)

Reconstructed Trajectory z

P(n) P(n-l)P(n-2)

x(n) x(n-l) x(n-2)..

x(n-l)x(n-2)x(n-3) .

^ x(n-2) x(n-3) x(n-4),

FIGURE 9. State space reconstruction from the memory outputs.

5.2 State Space Interpretation of the Gamma Memory

Let us for a moment shift our attention from the time series to the dynamical system that produces it. The system state will change with time, describing a trajectory in a multidimensional space called the state space. This time evolution defines the dynamical system. How can we reconstruct some of the properties of this state space evolution from the time series? Takens (Takens, 1981) proved that some properties (dynamic invariants) can be preserved if the time series is embedded into a sufficiently large reconstruction space (AT, the size of the space, should be at least 2Z)-fl, where D is the number of degrees of freedom of the dynamical system (Whitney, 1936)). He proposed that the point coordinates of the reconstructed trajectory be read as AT-tuples from the time series. For instance, for a reconstruction in a 3-D space, consecutive 3-tuples of the time series should be read together, with the first time series sample being the x coordinate of the first point in the reconstruction space, the 2nd sample the y coordinate, the 3rd sample the z coordinate, the 4th sample the x coordinate of the second point, etc. (Figure 9).

When the points of the reconstructed space are connected, a trajectory is found from which properties of the original dynamical system can be estimated (such as dimension and Liapunov exponents). This means that many of the important dynamical properties of the original signal are preserved in this reconstruction.

What is interesting is that the Takens embedding is naturally implemented by generalized feedforward structures. In fact, a TDNN with 3 taps provides x{n), x{n - 1), and x{n - 2), which are exactly the coordinates


needed to reconstruct the points of the trajectory in the reconstruction space. Sauer et al. (1991) recently generahzed the Takens embedding for Unear filters, which correspond to the memory filter class. We have also proposed an embedding based on the gamma memory kernel (Kuo and Principe, 1994b).

This alternative view of memory structures is very enlightening, because it shows that the function of the memory at the first layer of a neural network is to provide the representation space to reconstruct the time evolution of the state of the system that produced the time series. Then the nonlinear PEs in the hidden layers extract the relationships that characterize the dynamics. This architecture has been extensively explored in time series modeling with neural networks (Lapedes and Farber, 1987). Even if the memory is placed at the hidden layers, the interpretation is the same — reconstruction of the dynamics in the projection space constructed by the nonlinear PEs.

Invoking topological arguments by Taken and Whitney, this methodology suggests setting the size N of the memory at twice the size of the dimension D of the dynamical system that produced the time series. The dimension of the dynamical system can be estimated from the time series using, for instance, the correlation dimension algorithm (Grassberger and Proccacia, 1983). Unfortunately, this is not as straightforward as it may seem because in Taken's embedding theorem there is an unspecified parameter, the delay parameter r . Experimentally it was verified that the quality of the reconstructed trajectories varies tremendously with the choice of the delay parameter (Albano et a/., 1987), (Theiler, 1990). There are experimental methods to determine r that are based on the linear and nonlinear correlation time of the input signal (Eraser and Swinney, 1986). In our previous example, r = 1, which is normally a very poor choice. For other r 's , the memory depth should span the time interval

N = {2D + l ) r . (20)

This required depth may produce very large input layers, which produce very large networks. Although this is the time span where there are meaningful correlations, using the arguments of signal reconstruction, the first hidden layer PEs only need to receive 2D-\-l connections from the input layer. So, one should use uniformly sparse connections in the layer connecting the input taps to the first hidden layer PE (i.e., the nonzero connections should be taken r samples apart). One can use larger reconstruction spaces, i.e., more than A connections, but this would complicate the learning unnecessarily. We have proposed a method to determine the product D X r experimentally (Kuo and Principe, 1994b).

Recursive memory structures also provide a natural method to embed experimental time series. Note that the time delay between the taps in the


gamma memory filter is controlled independently by /x. At the same time, a soft lowpass filtering is performed that can filter unwanted noise without affecting the dynamics. So we can select the size of the gamma memory equal to iV, and let ji select the best r . The interesting question is how to find a training paradigm that adapts /i to the r value.

5.3 Gamma Memories as Implementation of Taylor Series

In this section we will bring a new interpretation to the contents of the gamma memory by relating them to the input signal spectrum. We will show how the information at the taps of this memory can be used as an alternative time-frequency representation.

The memory traces can be regarded as the moments of the input signal. Actually, an alternative name used in the control and signal processing community for contents of the gamma memory is the Poisson moment (Saha 1982).

Let's expand the convolution sum of (6) using the definition of the gamma kernel given in (15) as

a : f c ( n ) = / ^ a:vy(m,n) f ^ — y j , (21) m = —CO ^ ^

where xw{f^, n) is the delayed, inverted, and windowed version of the input signal x(n) formulated as

xw{'fTi, n) = x{n — m)(l — fi)^~û{m — k)u{n — m), (22)

where u{n) is the unit step. The term {1 — fi)^~û{m — k) can be regarded as a decaying window whose effective width is controlled by the parameter /i. This window is responsible for emphasizing the recent past of the signal and thereby achieving locality in time, which we think is vital in processing time-varying signals. In (22) there is also a rectangular window due to the terms u{n — m)u{m, — k)^ but its effect can be safely ignored for n — fc larger than the decaying window width.

As far as the magnitude spectrum goes (which is generally the main concern in speech recognition problems, for example), delay and sign inversion have no effect on the signal spectrum. The decaying window, however, has the side effect of blurring the spectrum of the original signal. Hence, IXvre-^^l can be seen as a low resolution approximation to the recent magnitude spectrum |X(e'^^)| of the original signal. Having established the relationship between the magnitude spectra of x^y (m, n) and x(n), let's go back to (21), which may be rewritten as

Xk{n)= r._^.y Y^ x w ( m , n ) [ ( m - l ) ( m - 2 ) . . - ( m - A : - h l ) ] . (23) {k-l)\

m= — oo

10. Locally Recurrent Networks: The G a m m a Operator

Differentiating the z-transform of xv^(m,n) iteratively gives

Xk{n) = (k-iy.dzf^-^

327

(24) 2 = 1

Here, the term Xw{z,n) can be interpreted as a short-term z-transform that takes into account only the recent values of a:(n). Expanding (24) and writing it in matrix form, one gets ^

A =

xi{n)

Xk(n)

= ADXw{z,n)\^^^, (25)

1 - 1 1

a*i

0 - 1 2

0 1 2 0 •

0 • 0 • 0

0 O-kk

Ckn

^*{- l )*+l ,

r=m—1 ^ ^ ^ 1)!

m = 1,

m > 1.

D

Examining (25), the memory traces XA;(n) can be recognized as the linear sum of the Taylor series coefficients of Xw{z, n) when the series expansion is done at zero frequency. In other words, the contents of the gamma memory correspond to the derivatives of the spectrum at z = 1. Hence, they may certainly be used to estimate the recent magnitude spectrum of the original signal near zero frequency. In that sense, the memory traces can be regarded as a cost-effective time-frequency representation. However, one should be wary of the finite region of support of the Taylor approximation. At frequencies away from the zero frequency, the Taylor series approximation diverges rapidly. For that reason, the memory traces make only a local representation of the spectrum. Representations at other frequencies can be obtained either by frequency shifting the signal spectrum or by employ-

If a continuous-time gamma memory was used, matrix A would reduce to the identity matrix.

328 Principe, Celebi, de Vries, and Hcurris

Moment reconstruction of the spectrogram

FIGURE 10. Comparison of the FFT with the moment reconstructed power spectrum.

ing gamma II structures (gamma memory with complex /x) or bandpass filters (Principe and Tracey, 1994). With this approach, the entire magnitude spectrum can be represented as a vector of memory traces that are obtained by concatenating the outputs of several gamma II structures each tuned to a different band.

As an example of the power of this representation, we display the spectrograms of the word "greasy" from the TIMIT database (sal.wav) obtained by the conventional short-term Fourier transform (STFT) technique and by a piecewise polynomial approximation that uses the input memory traces as its coefficients. In obtaining the memory traces we used 16 gamma-II filters each with 4 taps and tuned to frequencies in the range 0 to 4 kHz. The proposed technique preserves the main features of the conventional spectrogram method. Hence, the memory traces are time-frequency representations of temporal patterns. Whenever a gamma memory is used in the neural net, it will represent the spectrum of its input around zero frequency, with each tap estimating a higher derivative of the Taylor series of the spectral band coefficient. This feature is particularly appealing for speech recognition, where researchers have shown the importance of using the derivative of the spectral coefficients for good performance.

5.4 Gamma Memories as Multiresolution Representations

Another view of the gamma memory is as a multiresolution representation in the tap order domain and in the delay parameter (/i) domain. Multiresolution representations such as wavelets perform a complete decomposition in a scale-translation domain, i.e., the signal information is preserved in the coefficients of the wavelet representation. An alternative to the translation domain is the delay domain of the gamma basis, because it effectively spans the time axis just as the translation operation does, although with a change in waveshape. The different waveshapes produce a better approxi-


mation of the signal closer to the present time, which is recommended for on-line operation.

The gamma memory has basically two parameters, /x that controls the delay scale and k the tap order. Up to now we sought to adapt the delay scale // to best represent the signal of interest. An alternate approach is to consider /i as a parameter space that can be discretized and where several different versions of the input signal will be projected. Together with k, the tap order, these versions will constitute a multiresolution representation in delay and tap order. The interesting question is. Can we represent signals with the gamma memory kernel without loss of information in this multiresolution representation? If one can show that this multiresolution representation is a wavelet, then the substantial body of mathematics guarantees that the representation is complete.

We will present the development in continuous time. The gamma kernel in continuous time reads (de Vries and Principe, 1992)

5*W = (^^Yyr**''^""'' k = h...,K{ti>0). (26)

Since we are seeking a representation that is parametric in the delay, we will drop the dependence of gk{t) on /x; i.e., we are defining a multiresolution generating function given by

^* * = ( i f c ^ * * " ' ^ " * ' *' = l ' - ' - ^ - ( 2 7 )

Due to the shape of the gamma kernels, which are all positive, we can immediately say by the admissibility condition (Daubechies, 1992) that (27) is not a wavelet. The admissibility condition will require that

lim.ôo / " ^-^^dw < oo. (28)

So the memory arrangement has to be slightly modified. The generating function for the multiresolution representation is obtained by computing the difference of consecutive memory taps

Jkit)=9tit)-9t-i{t), (29)

which means that

7fc(0 = ^^^^4^<'-'e-S k = l,...,K (30)

One can show that 7A;(^) now obeys the admissibility condition (28) for a wavelet basis. This wavelet is implemented by discretizing the /i parameter to lead to a bank of modified gamma memories according to (29), i.e..


i^\

^l|

\i-\

x(t)

-

^

P -

^2

-U^

fed

t k

^

FIGURE 11. A wavelet representation made of a parallel bank of gamma memories with fixed but different /i's.

where the multiresolution signals are a difference of two consecutive gamma memory taps. There are several parallel memory structures with K-\-l taps but different /x's ( Figure 11)

The importance of this view for neural networks is the following. Instead of adapting the time scale // as we presently do, an alternative approach is to utilize multiple memory structures with fixed /x's. This scheme also preserves the information of the input, since it is a wavelet representation. It may seem a waste of memory blocks, but in classification of temporal patterns where the adaptation of /x is nontrivial, this arrangement circumvents the problem. We are currently exploring this representation in practical problems. This prefixed arrangement, in fact, resembles the method proposed by Hopfield for speech recognition (Tank and Hopfield, 1987), except that our functions are recursively computed.

6 Laguerre and Gamma II Memories

There are many other linear filters fitting our definition of generalized feedforward filters that can be used as memory structures for neural networks. A recent paper by Back and Tsoi (1995) presents several variations of locally recurrent networks with applications. We present here Laguerre and gamma II filters, two structures that are closely related to gamma structures.


Laguerre memory

l-î L(zl)

gx(t) gl(t) gKit)^

I domain

• 4 0

Delay operator: 3 - ( l - ^ l ) z-{\-\i)

FIGURE 12. The block diagram for the Laguerre memory. Reprinted with permission from Principe it et al., 1994.

6.1 Laguerre Memories

Laguerre functions are intimately related to the gamma function structures. In the z-domain, the ith Laguerre function is defined by

Li(z,/i) = >/l - (1 - /i)-( l - ( l - / i ) z - i ) -

(31)

which can be decomposed as a low pass filter GQ{Z) = i-h-u.)z-'^ ^ followed

by a cascade of i — 1 similar all-pass sections G{z) = ^zTiz^J^ypT- A block diagram of the Laguerre memory is shown in Figure 12.

Like the gamma memory, the Laguerre memory has one free parameter /i. In fact, it can be shown that a Gram-Schmidt orthogonalization of the gamma memory leads to the Laguerre memory (Celebi and Principe, 1995). Figure 13 compares the Laguerre and the Gamma kernels.

The Laguerre kernels are orthogonal. As a result, for uncorrelated (white) input signals, the tap signals are also uncorrelated in the Laguerre filter. This is not the case for gamma filters, and therefore in general, the filter weights adapt somewhat faster for Laguerre filters, depending on the input signal correlation matrix (Silva et al, 1994). Gamma kernels, on the other hand, use less computational resources (half the number of additions) than the Laguerre filter for the same filter order. Also, whereas traditionally Laguerre filters have been used with fixed fj, (e.g., Wahlberg 1991), the gamma filter framework has introduced LMS-adaptive /i to the Laguerre filters.


FIGURE 13. Gamma and Laguerre kernels.

6.2 Gamma II Memories

The gamma memory has a multiple pole that can be adaptively moved along the real Z domain axis; i.e., the gamma memory can implement only low-pass (0 < /i < 1) or high-pass (1 < /x < 2) transfer functions. For some applications, however, a resonant memory structure that favors storage of a certain frequency band is desired. As an example, if we want to store the output of a filter bank, a common preprocessor in speech classification, a resonant memory structure seems adequate. Resonating circuits can be constructed by filters of minimal second order.

The gamma II structure, which is shown in Figure 14, adheres to the philosophy of the generalized feedforward filter. Thus the memory parameters V and /i are global, and the feedback is local between taps. As a result, LMS for the gamma II filter scales also by 0{K), and the stability conditions remain trivial.

Next, a few properties of the gamma II memory element are derived. The transfer function Gii{z) = ^ i | j of this structure evaluates to

Gn{z) = (l + i/)G(z) ^^{l-û)[z-{l-fi)] l-\-uG^z) [z - (1 -/ /)]2 + ^1/ •

(32)

Thus Gii{z) has a zero dX ZQ = 1 — JJL and poles at ^p = (1 — /z) ± JVA"^.

The forward gain factor 1 •\- u ensures the normalization of the gamma II delay element. Similar to the gamma I operator, we have

Y,Gnk{n) = G//fc(z)U=i = {Gn{\)]'' = 1. (33) n = 0

In order to derive the stability condition, we assume // > 0 and i/ > 0. Then the system is stable if


Gamma II

u(l)

-»ojt>© ® -V KH Gu(z) m Delay operator:

. ^ l [ ^ - ( l - ^ l ) ] gi(t) gK^Hr

I domain

[ z - ( l - n ) ] ^ + V

FIGURE 14. The gamma II memory structure. G(z) represents the gamma kernel. Reprinted with permission from Principe et al, 1994.

(1 -/x)2 + / i z /< 1. (34)

Equation (34) can be reduced to /i(/i -f z/ - 2) < 0 and together with n>0 it follows that sufficient conditions for stability are given by

(0 < /i < 1) A (0 < 1/ < 1). (35)

The gamma II can be related to a broader class of functions called the Kautz functions (Kautz, 1954).

6.3 Two-Dimensional Gamma Kernels

The goal of the gamma memory was to create a signal processing structure that would display a variable memory depth for a fixed number of stages (taps), unUke the finite impulse response (FIR) filter. This was accomplished by introducing a local feedback loop around the delay operator, creating a generalized feedforward filter structure. The concept of a time warping parameter extrapolates to the spatial domain as a scale parameter that controls the region of support of the corresponding two-dimensional structure, which we call the 2-D gamma kernel. The 2-D gamma kernel defined as

^fc,M(^l»^2) = C'Â:,MWIt=./;ifT^ (36)

where the constant C is a normalization factor. The resulting 2-D gamma kernels have circularly symmetric shapes given by


FIGURE 15. The 2-D gamma kernels {k scale parameter /i.

1, A: = 15) for different values of the

QkA'î.n^) = / k+l

27rk\ {^n\-^nl) k-l -fiy/n^-{-nl (37)

n = {{nun2); -N < m.n^ < N],

where f] is the region of support of the kernel, k the kernel order, and // the parameter that controls the shape and scale of the kernel. Figure 15 depicts the characteristic of 2-D gamma kernels in the spatial domain. The 1st order (fc = 1) kernel has its peak at the pivot point (0,0) with an exponentially decaying amplitude. The gamma kernels with a higher order {k > 1) have peaks at the radius k/fi, creating concentric smooth rings around the pivot point. For a fixed kernel order, the radial distances where the kernels peak are still dependent upon the parameter /x, as in the 1-D case.

The 2-D gamma kernels are circularly symmetric, so they lose the property of completeness. Another slight disadvantage is that there is no recursive implementation, so they have to be implemented as 2-D FIR filters. Neverthless, they provide a very convenient set of functions to estimate the image intensity statistics in the neighborhood of the pivot point. They have been utilized with good results to enhance the constant false alarm rate (CFAR) detector for synthetic aperture radar (SAR) imagery (Kim et a/., 1996). The advantage of the gamma kernel is that the parameter //, that controls the size of the stencil, can be adapted to optimize the dis-


training subimages

y.(n)

PWF and PWF^ subimages V

ci(n)

t — • e(n) I

V^ y/i2)

FIGURE 16. Implementation of the NL-QGD and its adaptation.

crimination between targets and clutter in the same way as the 1-D counterpart. Figure 16 shows a neural network implementation of the nonlinear quadratic gamma (NL-QGD) detector, which has been utiUzed effectively in automatic target recognition (Kim, 1996). Once again, one can recognize the 2-D gamma kernel as a preprocessor for an MLP, but this time the application is image processing.

7 Analog VLSI Implementations of the Gamma Filter

Since it is impossible to design an ideal delay line in continuous-time hardware, many analog designers believe that the best that can be done is to try to "approximate" the ideal delays using a cascade of low-pass filters. Figure 17 shows such a strategy using a cascade of transconductance amplifiers and capacitors. Indeed, such a technique is shown in Mead (Mead, 1989). We have studied exactly this structure (cascaded low-pass filters) as memory elements for adaptive filters and neural networks for many years. An irony is that this structure—called the gamma filter—generally outperforms the ideal delay line with the same number of taps, since the former provides a mechanism to let the network choose the most appropriate memory depth/resolution for the task at hand. This is easily done by adapting the memory depth using the output mean square error in training, as we discussed early in this chapter.

We have implemented the delay-line component of the gamma filter ex-


XJI) ^

u Xi(t) b X2(t)L

\W0,

XgO)

>

\W3

N/y(t)

FIGURE 17. Continuous time 4 tap gamma filter.

actly as shown in Figure 17. Each stage consists of a transconductance ampUfier connected as a follower, with its output driving a capacitor — realizing a first-order low-pass filter with a 3 dB frequency equaling r . The CMOS transamp is operated within the subthreshold region so that a large dynamic range of r can be obtained. For speech processing applications, the necessary dynamic range of r is from 100 Hz to 10 kHz, which can be achieved by an exponentially controlled bias voltage. The ideal and measured impulse tap impulse responses for each tap of the gamma filter are very similar, as is to be expected (Juan et a/., 1996).

The weight adaptation can be formulated as a parametric least-squares problem that accepts an iterative solution based on the LMS gradient descent method. In order to adapt the weights Wk, we use the following continuous-time gradient descent update:

rw-ji'^kit) = e{t) -Xkit), at

(38)

where T^ (time constant of the weight update) > > r . This dynamic equation requires basic primitives such as adders, multipliers, and integrators that we have efficiently implemented in analog VLSI. Although gradient descent can still be used to adapt the time-scale parameter r , the equations are more difficult. They yield a nonconvex optimization problem, and here r will be preset.

We have experimented with our continuous-time adaptive gamma filter with this hardware for a simple system ID problem where the goal is to identify the input-output relationship of an unknown linear system. In


0 0.2 0.4 O.t 0.» 1.2 t.4 1.« I.I 2 0 0.2 0.4 0.6 O.S 1 1.2 1.6 1.8 2

FIGURE 18. Desired (solid) and gamma filter output (dashed) in the beginning of adaptation (left figure) and after convergence.

order to make the gamma filter more resistant to consistent offsets in each stage, we have used the difference between adjacent taps as our input to the multiplier. Since we used a diflFerential-input multiplier, this was a very easy change. The circuitry is designed to minimize the square error between the output of the two filters. The unknown system is a discrete analog circuit designed to implement a standard Sallen-Key low-pass filter. This type of problem is typically solved with adaptive filters, but rarely are continuous-time aspects considered.

The input to the system was a pseudo-random bit stream that was filtered to achieve a flat spectrum noise input. The system is set up so that the gamma filter will adapt its weights to minimize the mean square error between the filter output and the output of the unknown plant. Figure 18 shows the output of the filter during and after convergence. The gamma filter is not able to exactly match the Sallen-Key circuit because the two systems have different forms. However, the circuit does a good job in approximating the unknown plant. In (Juan et a/., 1996) we also present the weight tracks to show convergence with different initial conditions.

8 Conclusions

This paper provides a review of an important class of neural topologies of intermediate complexity between the purely feedforward nets and the globally recurrent networks. Since the processing elements of locally recurrent networks (LRNs) are a mixture of a nonlinear PE and a linear memory filter, the processing role of the new PE can still be studied at the local level as a projection operation of the PE activity in a linear manifold that we called the memory space. So the global input-output map is a nonlinear


combination of local linear maps. We present a generating function for the memory filters (the generalized

feedforward structures) that unifies the conventional connectionist memory structures (the context unit, the time delay, and the gamma memory) and that motivates the research in other kernels that have the potential to outperform the presently studied memories.

In this line of research the outstanding problems are:

• The choice of the generating function, which is equivalent to the choice of the basis vectors.

• The size of the memory kernels.

• The adaptation of the recursive parameter that controls the time scale of the memory representations.

• The mapping power of this class of neural networks.

The choice of the generating function is intrinsically related to the problem of finding the best basis to represent a given signal. This is a problem that has no known solution in signal processing. As long as the bases form a complete set, the information contained in the input signal is preserved, as long as enough taps are utilized. However, for practical purposes we would like to minimize the number of taps that are needed to solve our problem with good performance. In the case of the LRNs, the representation of past information is only one of the components for the mapping, since the nonlinear PEs in the net play a determinative role in the overall input-output map and can compensate for a less than optimal representation of the past information. So, in our opinion, the choice of the generating function is important, but it is not as determinative as in other domains. After all, engineers have used Fourier analysis extensively in spite of the fact that complex sinusoids are hardly ever a good model for real-world signals.

Criteria to help choose the size of the memory kernel are still needed. For applications related to signal representation, the view of memory as preserving the information from the past is very appropriate. The class of recursive memories have a nice property of allowing the system to find the best compromise depth/resolution for a given memory order. So they do the best they can with the available number of taps. But this does not mean that the choice of the number of taps is arbitrary. In linear system theory the problem of best model order is solved with statistical criteria such as those of Akaike and Rissanen. These can potentially be applied to set the number of basis functions. But it will be very interesting if new, incrementally adaptive methods are devised to grow the network topologies to best match the incoming signal properties.

We believe that the view of memory as the implementation of an embedding operation will be very important in the future. The advantage is that

10. Locally Recurrent Networks: T h e G a m m a Operator 339

one can utilize concepts from nonlinear dynamics to help describe signals in a different (probably more relevant) way, such as in terms of the properties of the system that created the time series. We have seen that this view is able to help us develop a criterion to set up the memory size.

The problem of the adaptation of the time scale is genuinely an adaptive system problem. Here we have a parameter that needs to be adapted, but the performance surface has many local minima. Moreover, there are applications such as classification, where the output mean square error is not an appropriate criterion to determine the time scale that best discriminates among a set of classes. So innovative ways to adapt the time scale of LRNs are necessary. Speech recognition is a key area that would benefit from these developments. We showed that the gamma memory has the potential to compensate for time warping. But this means that the recursive parameter must be adapted all the time, during both training and testing.

The universality of MLPs is a strong result that lends credibility and supports the continuing research interest in this topology. We believe that this is also needed for the cla^s of LRNs, so characterizing the functional mapping produced by LRNs seems a very important topic.

The major issue in analog neural design is how to build systems with enough precision and resolution using components that are fundamentally noisy and imprecise. Not only must we cope with noise in the input signals, but we must deal with noise in the computation itself. These considerations suggest that the Laguerre structure may provide the best low-precision implementation, since the signals at the output taps are less correlated than the corresponding signals in the gamma memory.

While analog systems are necessary to interface to the fundamentally analog world, they are limited both in their ability to implement large time constants and in their algorithmic fiexibility. The fact that these two drawbacks of analog systems are exactly the strengths of their digital counterparts suggests that hybrid analog/digital systems will ultimately be necessary. The primary feedforward structures would be built in dedicated analog circuitry, while the mechanisms for updating the parameters and the choice of the learning scheme would be left to a slower digital processor. Using clever feature extraction techniques such as described in Section 5.3, the slow digital processor could potentially sample the analog outputs at a rate much slower than the Nyquist rate. Such hybrid systems provide a reasonable compromise between the constraints of analog and digital hardware.

There is also an inescapable link to biology that we would like to mention. Leaky integrators are perva^sive in the central nervous system, both at the dendritic tree as well as in the response of the neurons. The analysis conducted in this paper shines light on the use of delays from a signal processing point of view, as projection operators in a space controlled by the component through the feedback parameter.

340 Principe, Celebi, de Vries, and Hcuris

As a final note we would like to point out that although all the work developed here on the extension of PEs with short term memory deals with supervised networks, there are strong reasons to believe that the incorporation of memory in unsupervised nets will allow the extension to time of some of the conventional unsupervised paradigms such as principal component analysis and Kohonen self-organizing nets. We are currently exploring this path.

Acknowledgments : This work was partially supported by NSF grant ECS - 9208789, and ARPA/ONR NOOO14-94-1-0858. The authors also want to acknowledge many generations of graduate students of the Computational Neuroengineering Laboratory, who all contributed to the ideas presented in this chapter.

9 References

Albano, A., Mees, A., Guzman, G., and Rapp, P., "Data requirements for reliable estimation of correlation dimension," in Chaos in Biological Systems, Degn, Holden, and Olsen, (Eds.), 207-220, Plenum, New York, 1987.

Back, A. and Tsoi, A., "FIR and IIR synapses, a new neural network architecture for time series modelling," Neural Computation, 3(3), 375-385, 1991.

Back, A.D., and Tsoi, A.C., "A comparison of discrete time operators for nonlinear system identification," in Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), pp. 883-890, MIT Press, Cambridge, MA, 1995.

Celebi, S., and Principe, J., "Parametric least squares approximation using gamma bases," IEEE Trans, on Signal Processing, 43(3), 781-784, 1995.

Daubechies, I., Ten Lectures on Wavelets^ Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1992.

De Vries, B. and Principe, J. C , "The gamma model—a new neural model for temporal processing," Neural Networks, 5(4), 565-576, 1992.


De Vries, B., Principe, J., and Oliveira P., "Adaline with adaptive recursive memory," Proc. 1991 IEEE Workshop Neural Networks in Signal Processing, 101-110, Princeton, NJ., 1991

Elman, J. L., "Finding structure in time," Cognitive Science, 14, 179-211, 1990.

Eraser, A., and Swinney, H., "Independent coordinates for strange at-tractors from mutual information," Phys. Rev. A33, 1134, 1986.

Grassberger, P., and Procaccia, I., "Measuring the strangeness of strange attractors," Physica 9D, 189-208, 1983.

Haykin, S. Adaptive Filter Theory, Prentice Hall, Englewood Cliffs, NJ, 1991.

Home, B., and Giles, C. L., "An experimental comparison of recurrent neural networks," Neural Information Processing Systems, NIPS-7, 697-704, 1995.

Jordan, M., "Attractor dynamics and parallelism in a connectionist sequential machine," Proc. 8th Annual Conf. on Cognitive Science, Erlbaum, Hillsdale, NJ, 531-546, 1986.

Juan, J., Harris, J., and Principe, J., "Analog VLSI implementations of continuous-time memory structures," Proc. IEEE Int. Symp. on Circuits and Systems, 338-340, Atlanta, GA, May 1996.

Kautz, W., "Transient synthesis in the time domain," IRE Trans, on Circuit Theory, 1, 29-39, 1954.

Kim, M., "Focus of attention based on Gamma kernels for automatic target recognition," Ph.D. Dissertation, University of Florida, 1996.

Kim, M., Fisher, J., and Principe, J., "A new CFAR stencil for target detection in SAR imagery," Proc. SPIE, 2757, 432-442, 1996.

Kuo, J-M, and Celebi, S., "Adaptation of memory depth in the gamma filter," Proc. ICASSP94, 5, 373-376, Adelaide, Australia, 1994.

Kuo, J-M., and Principe, J., "Using the Poisson filter chain to reconstruct


attractors," Proc SPIE Conf. on Chaos and Nonlinearities, 2037, 59-65, 1993.

Kuo, J-M., and Principe, J., "Noise reduction in state space using the focused gamma model," Proc. ICASSP94, 2, 533-536, 1994a.

Kuo, J-M., and Principe, J., "Reconstructed dynamics and chaotic time series modelhng," Proc. IEEE World Congr. on Computational Intelligence (WCII), 5, 3131-3136, Orlando, FL, 1994b.

Lang, K., Waibel, A., and Hinton, G., "A time delay neural network architecture for isolated word recognition," Neural Networks, 3(1), 23-44, 1990.

Lapedes, A., and Farber, R., "Nonlinear signal processing using neural networks: prediction and system modeling," Tech. Rep. LA-UR-87-2662, Los Alamos National Laboratory, Los Alamos, NM, 1987.

Lefebvre, C , and Principe, J., "Object-oriented artificial neural network implementations," Proc. World Conf. on Neural Networks, IV, 436-439, Portland, OR, 1993.

Mead C , Analog VLSI and Neural Systems, Addison-Wesley, Reading, MA, 1989.

Motter M. and Principe J., "A gamma memory neural network for system identification," Proc. IEEE World Congr. on Computational Intelligence (WCII), 5, 3232-3237, Orlando, FL, 1994.

Mozer M., "Neural architectures for temporal sequence processing, in Predicting the Future and Understanding the Past," Weigand and Ger-schenfeld (Eds.), Addison-Wesley, Reading, MA, 1994.

NeuroSolutions User's Manual, NeuroDimension, Inc., Gainesville, Fl, 1994.

Palkar, M., and Principe, J., "Echo cancellation with the gamma filter," Proc. ICASSP94, 3, 369-372, Adelaide, Austraha, 1994.

Principe J., de Vries B., and Guedes de Oliveira P., "Generalized feed-foward structures: a new class of adaptive filters," Proc. ICASSP 92, IV


244-248, San Francisco, 1992.

Principe, J., de Vries B., and Guedes de Oliveira P., "The gamma filters: a new class of adaptive IIR filters with restricted feedback," IEEE Trans, on Signal Processing, 41(2), 649-656, 1993.

Principe, J., Kuo J-M., and Celebi, S., "An analysis of short term memory structures in dynamic neural networks," IEEE Trans, on Neural Networks, Special Issue in Dynamic Nets, 5(2), 331-337, 1994.

Principe, J., and Tracey J., "Isolated word speech recognition using the gamma model," J. Art. Neural Net., 1(14), 481-489, 1994.

Principe, J. et a/., "Analysis of short-term memories for neural networks," Advances in Neural Information Processing Systems (NIPS 6), Morgan Kaufmann, 1011, 1018, 1994.

Renals, S., Hochberg, M., and Robinson, T., "Learning temporal dependencies in connectionist speech recognition," Neural Inf. Proc. Syst. NIPS6, Cowan, Tesauros, and Alspector (Eds.), 1051-1058, 1994.

Saha, D. C , and Rao, G. P., "A general algorithm for parameter identification in lumped continuous systems—The Poisson moment functional approach," IEEE Trans, on Automatic Control, 1, 223-225, 1982.

Sastry, P. S., Santharam, G., and Unnikrishnan, K. P., "Memory neuron networks for identification and control of dynamical systems," IEEE Trans, on Neural Networks 5(2), 306, 1994.

Sauer, T., Yorke, J. A. and Casdagh, M., "Embedology," Journal of Statistical Physics, 65, (3/4), 579-616, 1991.

Silva, T., de Oliveira, G., Principe, J. C , and de Vries, B., "Generalized feedforward filters with complex poles," IEEE Workshop on Neural Networks for Signal Processing, 1992.

Takens, P., "Detecting strange attractors in turbulence," Lecture Notes in Mathematics, 898, 365-381, 1981.

Tank, D. W., and Hopfield, J. J., "Neural computation by concentrating information in time," Proceedings of the National Academy of Sciences, 84,


1896-1900, 1987.

Theiler, J., "Estimating the fractal dimension of chaotic time series," Lincoln Lab J., 3(1), 63-85, 1990.

Tsoi, A., and Back, A., "Locally recurrent globally feedforward networks: a critical review of architectures," IEEE Trans, on Neural Networks, 5(2), 229-239, 1994.

Wahlberg, B., "System identification using Laguerre models," IEEE Trans, on Automatic Control, 36(5), 551-562, 1991.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K., "Phoneme recognition using time-delay neural networks," IEEE Trans. ACSSP, 37(3), 328-339, 1989.

Wan, E., "Time series prediction by using a connectionist network with internal delay lines," In: Times Series Prediction: Forecasting the Future and Understanding the Past, Weigand and Gerschenfeld (Eds.), 195-217, Addison-Wesley, Reading, MA, 1994.

Werbos, P. J., "Backpropagation through time: what it does and how to do it," Proceedings of the IEEE, 78(10), 1550-1560, 1990.

Whitney, H., "Differentiable manifolds," Ann. Math., 37, 645, 1936.

Wiener, N., Extrapolation, Interpolation, and Smoothing of Stationary Time Series, with Engineering Applications, Wiley, New York, 1949.

Williams, R. J., and Zipser, D., "A learning algorithm for continually running fully recurrent neural networks," Neural Computation, 1, 270-280, 1989.

Index Action potentials (APs), in im

pulse trains, 131-34 Adaptive time-delay neural net

work (ATNN), 129 Analog VLSI implementations of

gamma filter, 335-37 Aperture problem, in image flow

computation, 63-65 Artificial neural network (ANN)

biased random walk, 222, 226-30

research, 225-26 Axon generator, 297

Basin class, 125 Basin class capacity, 127 Biased random walk, 222

biological evidence, 231-33 efficiency of, 228-29 experimental observation and,

230 first attempts with, 226-27 performance of, 229-30 random structural variation,

231-33 reinforcement signals, 233 trapped in local minima, 227-

28 Binary neural networks, 124-28 Biological neural networks (BNNs).

See High-level Petri nets

Chaotic attractors and attractor locking, 114-20

developing multiple, 120-24 Chemotaxis algorithm, 221, 227,

228 efficiency of, 228-29 performance of, 229-30

Correlation dimension algorithm, 325

Delta function pulses, 20 Dendrite generator, 297 Dendritic tree, 3-4 Difference equations, 59-60 Dulac's criterion, 156, 157-59,166 Dynamical systems (DSs), discrete-

time analysis of, 175-77 basin of attraction, 175 fixed points, 175 periodic orbits, 175 recurrent neural networks with

two state neurons and fixed points, 191-200

repulsive points/repellors, 176,

197 saddle points, 176, 195, 196-

97 stability types, 175-76

Dynamic binary networks, 124-28

Dynamic image processing, 58 Dynamic neural networks

action potentials in impulse trains, 131-34

attractor basins and dynamic binary networks, 124-28

chaotic attractors and attractor locking, 114-20

description of, 108-14 developing multiple attractors,

120-24 perturbation schedule, use of,

122-23 self-sustained activity in, 106 symmetric sigmoid squashing

function, 110 temporal synchronies, 134 time delay mechanisms and

attractor training, 129-31

345

346 Index

unresolved issues regarding, 106-7

Dynamic spatial warping (DSW), 88

Dynamic time warping (DTW), 77, 78

comparisons with direct template matching, 90, 95

computer simulation results, 88-95

energy function, 77, 102 Hopfield network and, 81-88 Itakura path constraints, 83,

87 optimization problems solved

using, 78-81 performance measurement with

random signals, 89-90 piecewise linear function, 89,

102

Eckhorn linking field coupling, 1 See also Pulse-coupled neu

ral networks Embedded patterns, classification

of, 272-74 Energy function, 77, 102 Equilibrium states, stability of, 56-

59 Euler's method, 88 Extradimensional bypass, 228

Feedback delays, 311-12 Feedforward delays, 311-12 Finite state machines (FSMs)

See also Recurrent neural networks (RNNs)

defined, 173 experiments with, 182-83 learning loops of, 201-11

Firing times, Petri net, 290 Fixed points

attractive, at vertices, 197-99

attractive, at saddle point intersections, 199-200

defined, 175 recurrent neural networks with

two state neurons and, 191-200

Fourier coefficients, 160 Fourier transform, 21 Fourier series, 159

Gabor phase functions, 71 Gamma delay line, 315 Gamma filter, 315-16

analog VLSI implementations of, 335-37

Gamma kernels, two-dimensional, 333-35

Gamma locally recurrent neural net, 316

Gamma memory applications, 320-21 filter, 314-15 interpretations, 322-30 Laguerre and gamma II mem

ories, 330-35 multiresolution representations

and, 328-30 state space interpretation, 324-

26 Taylor series implementation

and, 326-28 vector space interpretation, 322-

23 Gamma neural network, 317-19 Gated dipole, 267 Generalized feedforward filter (GFF),

313-14 Genetic algorithms, 225, 228 Gradient-descent learning, 226, 228-

29 Gram-Schmidt orthogonalization,

331 Group linking waves, 22-25 Grossberg competitive law, 44

Index 347

Guided, accelerated random search (GARS), 226-27

Hebbian decay learning law, 44 development of, 222 limitations of, 224-25 long-term potentiation and the

NMDA receptor, 222-24 High-level Petri nets (HPNs)

advantages of Petri nets, 286 applications for, 285 arcs in, 294 cell bodies, 294 classes of, 292 concentric circles, 295-96 fundamentals of Petri nets,

287-92 for modeling biological neu

ral networks, 285-86, 292-96, 299-306

olfactory bulb example, 299-306

places for modeling cell bodies and presynaptic clefts, 296

timed transitions, 294, 297-98

tokens, 295 weights on arcs, 294-95, 298-

99 Hopfield associative memory, 120,

123 Hopfield network, 77

comparisons with direct template matching, 90, 95

components in, 77 computer simulation results,

88-95 dynamic time warping and,

81-88 optimization problems solved

using, 78-81 performance measurement with

random signals, 89-90

piecewise linear function, 89, 102

Image segmentation defined, 34 factors that affect, 34 natural firing, 40-41 pulse-coupled neural networks

for, 35-44 random noise, 43 smoothing images, 43 techniques, 34

Impulse trains, action potentials in, 131-34

Inhomogenous nodes, 272-73 Initial state, 173 Instantaneous description, 290 Invariances

description of translational, rotational and scale, 25-34

image distortion, 31 image intensity overlays, 31 receptive field patterns and,

26 simulation results, 30-31 time signatures, 31-34

Kirchhoff's current law, 147 Kohonen map, 182, 183 Kohonen self-organizing nets, 340 Kronecker delta function, 14, 88

Laguerre and gamma II memories, 330-35

Law of Large Numbers, 148 Learning rules, 44

biased random walk, 222, 226-30

biological evidence, 231-22 biological requirements for, 226 chemotaxis algorithm, 221, 227,

228 Hebb's rule, 222-25

348 Index

research on, 221-22 theoretical, 225-30 trial-and-error, 222, 234

Liapunov function, 67, 77, 79, 81 Liapunov's theorem, 156 Linear finite dimensional memory

structures, 312-17 Linear systems theory, 67 Linking decay tail, 15-18 Linking modulation, 4-5 Linking waves and time scales, 21 -

22 group linking waves, 22-25

Locally recurrent networks (LRNs) analog VLSI implementations

of gamma filter, 335-37 applications of, 311 feedback connections, 311-12 gamma memory applications,

320-21 gamma memory filter, 314-

15 gamma memory interpretations,

322-30 gamma neural network, 317-

19 Laguerre and gamma II mem

ories, 330-35 linear finite dimensional mem

ory structures, 312-17 outstanding problems, 338-

40 Long-term memory (LTM)

invariance principle, 254-58 invariance principle with on-

center off-surround circuit, 260-64

weights, 250-51 Long-term potentiation (LTP), NMDA

receptor and, 222-24

McCuUoch-Pitts neuron model, 317 Mean memory depth, 317 Memory

linear finite dimensional memory structures, 312-17

trace, 312, 314 Memory, gamma

applications, 320-21 filter, 314-15 interpretations, 322-30 Laguerre and gamma II mem

ories, 330-35 multiresolution representations

and, 328-30 state space interpretation, 324-

26 Taylor series implementation

and, 326-28 vector space interpretation, 322-

23 Motion perception, challenges in,

57-58 Multiresolution representations, gam

ma memory and, 328-30

Network transition (NT) graphs, use of, 124-27

Neuron gain, 187 Neurons, oscillation in inhibitory

and excitatory characterization of cell assem

blies, 148-50 individual cells described, 146-

48 interactions between two neu

ral groups, 151-56 macroscopic model for cell as

semblies, 146-50 oscillation frequency estima

tion, 159-61 random process, 148 research on, 143-46 stability of equilibrium states,

156-59 system-level parameter, 148 validation of experiments, 16162

NMDA (N-methyl-D-aspartate),

Index 349

long-term potentiation and the, 222-24

Nonuniform pattern of connectivity between nodes, 273-74

Olfactory bulb example, HPN modeling of

information obtained from, 305-6

inputs, 299-301 intrinsic neurons, 301 Petri net model formulation

and analysis, 301-4 principal neuron, 301 token flow through, 304-5

On-center off-surround circuit, 260-64

Optical flow computation advantages of, 72 aperture problem, 63-65 formulation for neural com

puting, 59-61 Gabor phase functions, 71 Horn's model, 68, 71, 73 introduction of, 58 laboratory images experimen

tation, 68 as a minimization of function-

als, 59 properties used, 62-63 recurrent neural network ar

chitecture for, 65-68 research on, 58 smoothness constraints, 59, 63,

71 stability and convergence rate,

67-68 test pattern experiments, 68

Optimization problems, Hopfield network and, 78-81

Oscillation behavior. See Neurons, oscillation in inhibitory and excitatory

Pattern matching comparisons with direct tem

plate matching, 90, 95 components in, 77 computer simulation results,

88-95 dynamic time warping, 77, 78 dynamic time warping using

Hopfield network, 81-88 energy function, 77, 102 Hopfield network for, 77, 78-

81 pattern matcher, 77, 78 performance measurement with

random signals, 89-90 piecewise linear function, 89,

102 Periodic time series, 18, 20-21 Petri nets (PNs)

See also High-level Petri nets concepts and terminology of,

287-89 timed, 289-92

Poincare-Bendixson theorem, 156, 157, 166

Poisson moment, 326 Presynaptic changes, 223-24 Pulse-coupled neural networks

(PCNNs) adaptation, 44-48 basic model, 3-10 dendritic tree, 3-4 group linking waves, 22-25 image segmentation, 34-44 implementations, 50-51 integration of, 51-53 invariances, 25-34 learning laws, 44 linking decay tail, 15-18 linking field model of Eckhorn

for, 2 linking modulation, 4-5 linking waves and time scales,

21-22

350 Index

multiple pulses, 10-13 multiple receptive field inputs,

13 periodic time series, 18, 20-

21 pulse generator, 5-7 pulse periods, 7-10 synaptic weights, 44-45 time evolution of pulse out

puts in a two-cell system, 13-18

time-to-space mapping, 48-50

Pulse function, 6 Pulse generator, 5-7 Pulse periods, 7-10

Quasiharmonic pulse rates, 18

Random walk. See Biased random walk

Real-time operation, classifying networks and, 269-72

Real time recurrent learning (RTRL), 318

Receptive field inputs, multiple, 13

Recurrent neural network architecture for optical flow computation, 65-68

Recurrent neural networks (RNNs) automata theory and, 180 background research on, 171-

73 as a collection of dynamical

systems, 186-91 dynamical systems, analysis

of, 175-77 experiments with trained fi

nite state machines, 182-83

hidden neurons, 177-78 Kohonen map, use of 182,183 language acceptance, 212-13

learning loops of finite state machines, 201-11

loops and cycles, transformation of, 188-91

second-order of, 212 state degradation diagrams,

207-9 as state machines, 179-85 state machines, description of,

173-75 state space clustering, 182 training procedure, 179 transfer function, 204, 207 with two state neurons, 191-

200 as universal computing devices,

221 Rehearsal to process long lists, 258-

60 Reinforcement learning

components of, 2331 random structural variation,

231-33 signals, 233

Repulsive points/repellors, 176, 197 Retina, as an example of a pre

processor, 51 Rotational invariance, 26

Saddle points, 176, 195, 196-97 biased random walks and, 227-

28 fixed points at the intersec

tion of, 199-200 Saturable law, 44 Scale invariance, 26-29 Segmented patterns. See SONNET

1, segmentation of patterns and

Serotonin, 231-33 Short-term memory (STM), 249,

256, 258 Sigmoid function, 6 Smooth change of illumination, 71

Index 351

Smooth motion, 71 Smoothness constraints, 59, 63, 71 SONNET 1, segmentation of pat

terns and black box, use of, 246-48 classification of embedded pat

terns, 272-74 learning isolated and embed

ded spatial patterns, 250-52

long-term memory (LTM) in-variance principle, 254-58

long-term memory invariance principle with on-center off-surround circuit, 260-64

long-term memory weights, 250-51

operational requirements, 245-46

properties of a classifying system, 267-74

real-time operation, 269-72 resetting items once they are

classified, 264-67 short-term memory (STM),

249, 256, 258 simulations, 274-80 storing items with decreasing

activity, 252-54 structure of units, 249-50 transient memory span (TMS),

258-59 using rehearsal to process long

lists, 258-60 Speech recognition, 246 State degradation diagrams, 207-

9 State machines

description of, 173-75 recurrent neural networks as,

179-85 State space interpretation, 324-

26 System cycle time, 306 System transient time, 306

Taylor series implementation, gamma memory and, 326-28

Temporal chunking problem, 273 Time delay mechanisms and at-

tractor training, 129-31 Time-delay neural network (TDNN),

129 Timed Petri nets (TPNs), 289-92 Timed transitions, HPNs and, 294,

297-98 Time scales, linking waves and,

21-22 Time-to-space mapping, 48-50 Transient memory span (TMS),

258-59 Translational invariance, 26 Trial-and-error learning, 222, 234

Vector space interpretation, 322-23

Waves hnking of, 21-22 linking of group, 22-25

Documents

NNPattern