Computational Explorations in Cognitive Neuroscience

Computational Explorations in Cognitive Neuroscience Chapter 6: Combined Model and Task Learning, Other Mechanisms

1

6.1 Overview In this chapter, we deal with the combination of Hebbian model learning and error-driven task learning into one unified learning model. We also deal with 2 other kinds of learning mechanism that can be handled with extensions of the basis model: 1) tasks that involve temporally extended sequential processing 2) tasks that have temporally delayed contingencies 6.2 Combined Hebbian and Error-Driven Learning In section 6.2, we first compare the advantages and disadvantages of Hebbian and error-driven learning, and then consider the advantages of combining them. Hebbian learning is autonomous and reliable, but can be myopic and greedy. Error-driven learning is task-driven and cooperative, but can be co-dependent and lazy. Finally, we consider how these 2 forms of learning might be used in the cortex: either (1) there are separate systems for Hebbian and error-driven learning; or (2) both forms of learning occur at every synapse in the cortex (to varying degrees).

2

6.2.1 Pros and Cons of Hebbian and Error-Driven Learning The property of locality of a learning algorithm can help to understand the essential difference between these two types of learning. This property means that changes to a given weight depend only on the immediate activations surrounding that weight. Hebbian learning is a function of local activation correlations. It is local. Error-driven learning depends on possibly remote error signals on other layers. It is non-local. Non-locality gives error-driven learning a superiority in task-learning performance. Weights throughout the network can be adjusted to solve a task that is only manifest in terms of error signals over the output layer. All of the weights in the network “work together” to solve the task. However, all the weights working together have too much interdependency, which makes it unclear which weights are most important. This can result in very slow learning. Error-driven units also tend to be “lazy” – they change weights only enough to solve the problem, and nothing more. Locality gives Hebbian learning a superiority in learning the correlational structure of the input without depending on remote error signals. By representing the principal correlation structure of the inputs, Hebbian learning can rapidly develop representations that are often useful for learning.

3

6.2.2 Advantages to Combining Hebbian and Error-Driven Learning A combination of error-driven and Hebbian learning can produce the most optimal learning. They each provide an advantage that is lacking in the other. Computational argument: it seems more efficient to combine both types of learning. Neurobiological argument: the biological mechanism for synaptic modification suggests that both forms of learning may occur at all synapses in the cortex. The approach taken is to consider error-driven task-based learning as the primary form of learning, and Hebbian model learning as secondary. Learning a task is the major goal, and it is accomplished by error-driven learning. But task learning is facilitated by Hebbian biases.

4

Regularization is the use of biases to provide additional constraints to an under-constrained problem. One form of regularization is weight decay, in which a small portion of each weight value is subtracted when the weights are updated. Hebbian learning can be a better type of regularization than weight decay because it makes a positive contribution to the development of representations rather than simply subtracting away weight value.

5

6.2.3 Inhibitory Competition as a Model-Learning Constraint Inhibitory competition may be considered as a constraint on learning because it imposes a bias toward sparse distributed representations. We have seen that the combination of inhibitory competition (kWTA) works well with CPCA Hebbian learning. In the past, inhibitory competition has often been used in various forms of model learning. However, it has not been used much in task learning. One problem in using it for task learning has been that it is difficult to compute the error derivatives of a competitive activation function. However, this is not a problem for the GeneRec algorithm, which computes error derivatives implicitly.

6

The use of kWTA competition in combination with GeneRec learning is more powerful than other attempts to use inhibitory competition in task learning. Inhibitory competition has been used in the mixtures of experts architecture (Fig. 6.2). Here, “gating units” undergo WTA competition. Their output multiplies the contribution of their corresponding “expert units” to a problem. A problem with this approach is that the inhibition usually allows only one expert network to be active at a time. This limits the extent to which different experts can cooperate to solve the problem. By contrast, the use of kWTA as a bias in task-learning encourages different units to specialize for different aspects of the task, but also allows cooperation in distributed representations.

7

6.2.4 Implementation of Combined Model and Task Learning Hebbian learning can be combined with GeneRec error-driven learning by adding together the weight changes produced by each. The parameter khebb controls the relative proportion of the two types of learning as: ( ) ( ) ( )1ij hebb hebb hebb errw k kΔ = Δ + − Δ (6.1) or

( )( ) ( ) ( )1ij hebb j i ij hebb i j i jw k y x w k x y x yε + + + + − −⎡ ⎤Δ = − + − −⎣ ⎦ (Box 6.1)

where the two learning components are given by the CPCA Hebbian rule (Eq 4.12) and the CHL GeneRec error-driven rule (Eq 5.39).

8

The khebb parameter is typically less than or equal to 0.01. This low value reflects: (a) the emphasis on error-driven learning (b) the larger weight changes typically produced by Hebbian learning, giving it a

greater impact on weight changes (c) the relative consistency of Hebbian learning throughout the learning phase, also

giving it a greater impact on weight changes The Hebbian rule is computed using the plus-phase activation states because the statistical structure of input patterns must be learned when the output state is correct.

9

6.2.5 Summary The essential properties of the Leabra algorithm:

(a) biological realism (b) distributed representations (c) inhibitory competition (kWTA) (d) bidirectional activation propagation (e) error-driven learning (GeneRec) (f) Hebbian learning (CPCA)

10

6.3 Generalization in Bidirectional Networks Generalization is an important property that allows a trained network to successfully process novel stimuli having constituent features that were previously learned. In other words, the novel items are similar to previously learned items. Generalization is central to the human ability to encoding the structure of the environment, which allows the organism to adapt to novel situations. The basic GeneRec network does not do a good job of generalization because the unit interactions interfere with the ability to form novel combinatorial representations. In other words, the units interact too much to keep the independence necessary for novel recombination. The addition of Hebbian learning and kWTA inhibitory competition improves generalization by imposing important constraints on the development of systematic distributed representations.

11

6.3.1 Exploration of Generalization We first look at network learning of line patterns using pure error-driven learning (learn_rule = “PURE_ERR”; lrn.hebb = 0). The project is model_and_task.proj. The training set consists of 35 of the 45 possible pairwise combinations of horizontal and vertical lines (10 hor + 10 ver + 25 hor-ver). The network is trained on this set, and then tested for generalization on the remaining 10 possible combinations (1 hor + 2 ver + 7 hor-ver).

Pre-training

12

Training Session With Pure Error-Driven Learning

Training error statistic (red – cnt_sum_se) (updated every epoch – a set of 35 -/+ phases): Count of # trials with errors out of 35: this statistic goes to zero with training Test error statistics (updated every 5 epochs):

(1) Generalization count (green – gen_cnt) (# testing events wrong out of 10 testing items): this statistic shows at least 2 errors out of the 10 patterns tested

(2) Unique patterns count (yellow – unq_pats) (# patterns distinctly represented): this statistic shows from 8 to 10 patterns distinctly represented out of the 10 patterns tested

13

After training by pure error-driven learning, the hidden unit weight patterns (WT_MAT_LOG) look relatively random, and do not reflect the line features of the input environment.

Weight values from input to hidden units (there are 5 rows and 5 columns, each cell

representing the 5x5 grid of weights from the input layer for one hidden unit) The error-driven weights are relatively under-constrained by the learning task, and reflect a large contribution from the initial random values. This lack of systematic representation makes for poor generalization. In other words, the lack of constraint prevents the units from systematically dividing the input/output mapping into distinct subsets that can subsequently be used for categorizing novel inputs.

14

The addition of Hebbian learning and kWTA inhibitory competition can improve generalization performance by providing biases that impose constraints on learning.

1. Hebbian learning constrains the weights to represent the correlational structure of the inputs, producing systematic weight patterns.

2. kWTA competition encourages individual units to specialize on representing a subset of input items, while also keeping the network from settling into shallow attractors.

We introduce Hebbian learning & kWTA into the simulation by setting learn_rule to “HEBB_AND_ERR” (lrn.hebb = 0.05).

With this addition, the statistics improve. The generalization count (green) goes to zero, and the unique patterns count (yellow) goes to 10.

15

Also, the weight patterns now show much more distinct representations:

16

You might suspect that pure Hebbian learning (learn_rule = “PURE_HEBB”) could do just as well on this task. However, with lrn.hebb set to 1, the results are not as good.

Training & Test Statistics

The network frequently gets perfect performance on the task, and on the generalization test. However, every 5 runs or so, the network fails to learn perfectly, and never gets better. It is clear that Hebbian learning helps because of the strong correlational structure of the input. Nonetheless, the network is more reliable if Hebbian learning is used in combination with error-driven learning.

17

Question 6.1: Report the summary statistics from batch text log (Batch_1_Textlog) for your batch run. Does this indicate that your earlier observations were generally applicable? Earlier observations: although the task was solved by error-driven learning, the weights were under-constrained, meaning that there was a lack of systematic representation and a large contribution from the initial random weight values. The lack of systematic representation led to poor generalization. Summary statistics for a 5- run batch using pure error-driven learning: Train_sse Unq_pats Avg Avg Min Max Cnt 0.05 8.6 6 10 2 These statistics do support the previous observations. There were 5 100-epoch runs. The average error (train_sse) was 0, indicating that the network correctly learned to map the 35 input events to the output units. However, the network failed to uniquely identify all 10 feature patterns (8.6 avg unq_pats per run). All of the 10 feature patterns were uniquely identified for only 2 out of the 5 runs.

18

Gen_cnt Avg Min Max Cnt 9.4 8 10 0 The average generalization count (avg_gen_cnt) represents the average number of errors (incorrect representations of novel events) during the test phase across the 5 runs in the batch. This number of errors was very high (9.4). The number of runs in which the network correctly represented all 10 novel events (cnt_gen_cnt) was 0. In sum, generalization was poor.

19

Question 6.2: (a) How did this .05 of additional Hebbian learning change the results compared to purely error-driven learning? The number of errors during the testing phase was reduced to zero, meaning that the network was able to correctly identify 10 of 10 novel events. The number of unique patterns was 10. The addition of Hebbian learning, event at 0.05, enabled error-driven learning to achieve better generalization results. (b) Report the results from the batch text log (Batch_1_Textlog) for the batch run. Train_sse Unq_pats Gen_cnt Avg Avg Min Max Cnt Avg Min Max Cnt 0 10 10 10 5 1.8 0 4 1

20

(c) Explain these results in terms of the weight patterns, the unique pattern statistic, and the general effects of Hebbian learning in representing the correlational structure of the input. The weights of the hidden units now show patterns of vertical and horizontal lines, rather than randomly distributed patterns as with pure error-driven learning. All of the 10 lines in the input layer are uniquely represented (unq_pats = 10). These results show that adding Hebbian learning to error-driven learning causes the hidden units to have distinct systematic input weight patterns, allowing the network to generalize.

21

6.4 Learning to Re-represent in Deep Networks The combination of model and task learning is particularly helpful in training networks with many hidden layers (deep networks). The family trees problem (Hinton, 1986) is a classic example of a problem that benefits from multiple hidden layers. The problem involves learning family relationships for two isomorphic families, i.e. families with the same tree structure. Learning is facilitated by re-representing the individuals in a family as distributed patterns in intermediate hidden layers to more specifically encode their family relationships. Pure error-driven learning has difficulty training a deep network: standard backpropagation training can take thousands of epochs to solve the problem.

22

6.4.1 Exploration of a Deep Network

Network layers: 1. The Agent and Relation layers are Input layers. The Agent layer inputs the names of individuals. The Relation layer inputs the family relationships of individuals. These have localist representations of the 24 different people and 12 different relationships. 2. The Patient layer is the Output layer. (Patient is another name of an individual.) 3. The Agent_Code, Relation_Code, and Patient_Code hidden layers provide a way for the network to re-represent these localist representations as distributed patterns. This re-representation facilitates learning by emphasizing relevant distinctions. 4. The central Hidden layer performs the mapping between the re-coded representations.

23

Events

The network is trained using the combined “Hebb_And_Error” rule (khebb = 0.01). The goal of training is to associate individual names together in the correct family relationship. The training events are 100 triplets of {Agent.Relation.Patient}, defining relations between Agent and Patient individuals. The network is considered to perform correctly, i.e. get the right answer, if when presented with Agent & Relation as Input in the minus phase, the correct Patient Output is produced.

24

Error Count Statistic (red) & Average # Network Settling Cycles (yellow)

The training error count statistic (# wrong out of 100) declines to 0 with training. A typical learning curve in Leabra is faster than in a standard feedforward backpropagation network (see Fig. 6.9).

25

We have now seen that Hebbian learning facilitates learning in deep networks when used as a constraint on error-driven learning. However, Hebbian learning alone fails miserably:

Pure Hebbian Learning Fails

We now compare Pure Error-driven, Hebbian & Error-driven, and Pure Hebbian types of learning:

26

Errors

Pure Hebbian learning is terrible; Pure Error-driven works, but takes longer to settle and takes more cycles per epoch; the combination works well and converges more quickly. Only the combined Hebb and error network achieves a significant speedup in the average number of settling cycles (yellow curves). In the Pure Hebbian case, it starts out fast but slows down with training.

27

We now look at the cluster plot of the hidden unit activity patterns when we test all 100 inputs.

Untrained Trained Prior to training, there is a general lack of meaningful clustering. After training:

1) the trained network has 2 main branches corresponding to the 2 different families. 2) clusters within each family are generally organized according to generation.

Remember that these cluster plots are based on distances measured between hidden unit activity patterns. Therefore, this structure represents a meaningful transformation of the event patterns that allows the network to solve the task.

28

Question 6.3: (a) What do you notice about the general shape of the standard backpropagation (BP) learning curve (SSE over epochs) in figure 6.9 compared to that of the PURE_ERR Leabra network you just ran? Pay special attention to the first 30 or so epochs of learning. The standard BP learning curve is much flatter during the first ~30 epochs of learning as compared to that of PURE_ERR Leabra, which is concave. The number of incorrect patterns remains near 100 for BP during this period, whereas the number for PURE_ERR Leabra shows a drastic and consistent drop from the start of learning. (b) Given that one of the primary differences between these two cases is that the PURE_ERR network has inhibitory competition via a kWTA function, whereas BP does not, speculate about the possible importance of this competition for learning based on these results (also note that the BP network has a much larger learning rate, .39, vs .01). kWTA inhibition biases the development of representations by allowing only k hidden units to be active at a time. This constrains the update of weights to only the k units that most strongly represent the input, thereby restricting the number of states that the network can settle into and resulting in more efficient re-representation of the inputs. By constraining the activation of hidden units, kWTA inhibition causes the network to settle faster.

29

(c) Now, compare the PURE_ERR case with the original HEBB_AND_ERR case (i.e., where do the SSE learning curves – red lines – start to diverge, and how is this different from the BP case)? Both cases show a steep decline in the error from the beginning, but pure error-driven learning takes about twice as many epochs to learn as the combined Hebbian and error-driven learning. The SSE learning curves diverge after epoch 30. Comparing BP, pure-error driven, and combination Hebbian and error-driven learning, the combination learning case learns fastest. (d) What does this suggest about the role of Hebbian learning? (Hint: Error signals get smaller as the network has learned more.) Hebbian learning is only concerned with the local correlational structure. It thus improves error-driven learning in deep networks by helping to stabilize weights locally between individual layers (see Figure 6.6 for analogy). This provides a constraint on the propagation of error signals in the network. Representations are able to develop more quickly that pure error-driven learning, which takes longer because it is concerned with having all the weights throughout the network work together to solve the task.

30

6.5 Sequence and Temporally Delayed Learning Processes in the real world always evolve in time. There are 3 main categories of temporal dependency in task learning:

a) sequential: sequence of discrete events having structure (grammar) is to be learned. b) temporally delayed: causal relation between antecedent events and their delayed

outcomes is to be learned. c) continuous trajectories: evolution of the state of a dynamical system is to be learned.

31

In sequential or temporally delayed behaviors, contingencies that extend across time between events must be learned. An example of sequential learning is the delayed matching task, where a response is contingent upon information acquired in discrete sensory events separated by delay. Another example is language comprehension, where the meaning of words is contingent upon the meanings of prior words. An example of temporally delayed learning is reinforcement learning, where the reinforcement strengthens the association of antecedent states with their likelihood of causing the reinforcement. To extend networks that learn direct input/output mappings so that they can learn sequential or temporally delayed tasks may be dealt with by adding extra hidden layers to represent the time steps intervening between inputs and outputs. In the next section, the use of hidden layers for learning sequential tasks is considered.

32

6.6 Context Representations and Sequential Learning In learning sequential tasks, useful context representations must be derived from previous events in order to produce an appropriate response at a later time. A useful simplification is to only consider Markovian processes. A Markovian process is one for which the likelihood of a given future state at any given moment depends only on its present state. Thus, it is a process without memory. To only consider Markovian processes is a useful simplification because the only prior context necessary to predict the next time step in a sequence is contained in the immediately preceding time step. The Simple Recurrent Network (SRN) is a type of neural network model that incorporates Markovian context representations. Forms of the SRN were developed by Jordan (1986) and Elman (1990).

33

The context representation in the SRN is contained in a dedicated context layer that acts like an input layer, but has an activity pattern that is a copy of the prior hidden activity pattern (Figure 6.12). The SRN idea is used in leabra to deal with sequential tasks. The hidden state-based context layer gives the network flexibility in choosing the contents of the context representations by learning representations in the hidden layer.

34

6.6.1 Computational Considerations for Context Representations The context layer can be any information-preserving transformation of the hidden layer. A set of adapting weights from the context to the hidden layer can adapt to a fixed or slowing updating transformation of hidden layer representations by the context layer. Updating of the context layer must be done in a very controlled manner because the layer must be both stable and updatable. The context layer must be able to:

a) preserve information about the prior hidden layer state (working memory maintenance).

b) update its information with the new hidden layer state (working memory encoding). Controlled updating of context units allows both preserved and updated representations to govern learning.

35

Updating of a context unit’s activity cj(t) in leabra is done through a simple copy operation applied to a hidden layer unit’s activity at the previous time (hj(t-1)) and the context unit’s activity at the previous time (cj(t-1)):

( ) ( ) ( )1 1j hid j prv jc t fm h t fm c t= − + − (6.2) where fmhid is a parameter that determines the extent to which the context unit gets updated by new input from the hidden layer, and fmprv determines the extent to which it gets updated by its own previous activity.

36

6.6.2 Possible Biological Bases for Context Representations The prefrontal cortex is intimately involved in planning and executing temporally extended behaviors. Lesions of prefrontal cortex cause sequencing deficits in which people are incapable of performing a sequence of behaviors. It is therefore likely that this brain region is in some way involved in creating and maintaining cross-temporal context. It is not well established at this point whether its role is to actually store representations (like the context layer in the SRN) or to control that storage in other brain areas.

37

6.6.3 Exploration: Learning the Reber Grammar The Reber grammar task illustrates how the simple recurrent network (SRN) works. Reber studied human implicit learning by having people memorize letter strings that followed a regular, but probabilistic, grammar. Subjects showed improved performance (faster response time) for grammatical letter sequences, but not for agrammatical letter sequences. However, even though they showed implicit evidence of learning the grammar, subjects had no explicit knowledge of it. The Reber grammar is generated by a finite state automaton (FSA), also called a finite state grammar. The connectivity of the nodes in the FSA graph (Figure 6.13) determines the regularities in the grammar. Cleeremans et al (1989) used an SRN with backpropagation to learn the Reber grammar. They trained the network to predict the next letter in the sequence as output, given the prior letter as input. This learning task is difficult because the letter for each link in the FSA is not unique -- different links have the same letter. An internal context is necessary to keep track of position in the FSA graph. This is accomplished by the context layer of the SRN.

38

Leabra also uses an SRN to learn the Reber grammar. It represents the probability of following links in the FSA graph by picking one of multiple possible output patterns at random. The finite state automaton (FSA) network has an input layer that sends to the hidden layer. The input layer presents the possible FSA letters, and the output layer responds with one of the possible following letters. (Refer to Figure 6.13 for the possible valid outputs.) The context layer units have a single receiving weight from hidden layer units. The context layer units send to the hidden layer. The hidden layer units also send to the output layer. The Targets layer is unconnected – it is only for displaying the two possible valid outputs for comparison with the actual output. The steps come in pairs, a minus phase followed by a plus phase. There is noise in the unit activations so that one unit can be selected at random from the two possible units. The network only learns about one of the two possible subsequent letters on any given trial. To learn that a given node has two possible outputs requires that it integrate learning over different trials. The context units are updated with a copy of the prior hidden unit activations.

39

Even when the network has learned completely, it still makes 50 percent errors because it makes discrete “guesses” about which output is next. However, this does not introduce a systematic error bias because any unit will be correctly active as often as it is incorrectly inactive (and correctly inactive as often as incorrectly active). Thus, the overall net error is zero. The higher level representation in the hidden layer remains essentially the same for both outputs, reflecting the node identity. It does not substantially change when the actual output is presented in the plus phase because it encompasses both possible outputs. It is only the lower level output representation that randomly chooses one output. The error statistic (sum_fsa_err) reports an error (as 1) if the output unit is not one of the two possible outputs. To train the network, a script environment dynamically creates 25 new event sequences on-line every other epoch, according to the Reber grammar FSA. The network can take up to 80 epochs to learn the problem to the point where it gets zero errors in an epoch. The network can be trained longer (e.g., to the point where it gets 4 zeros in a row). To “stamp in” the representations by longer training has the benefit of making the representations more robust to noise. However, occasional errors are still possible. (Not all possible sequences are used in training.)

40

The leabra network is much faster than the backpropagation network of Cleeremans et al (1989). Backprop required 60,000 sequences, as compared to only < 1000 sequences for leabra. (Backprop also required a much larger learning rate.) “Test” tests the network with a sequence of letters. The results are shown in the TEST_GRID_LOG. A correct output is indicated by a value of 0 for fsa_err and by a match between the Output pattern and the Target pattern. The sequence length is shown by the number of “ticks” in the sequence.

41

To better understand the hidden unit representation requires a sequence of longer length (i.e., > 10 events). A cluster plot of such a sequence shows FSA hidden unit representations. The labels for each cluster plot node identify the current and next letter and the current and next node in the FSA graph (Figure 6.13).

Cluster plot of Test sequence with 15 events

42

Question 6.4: Interpret the cluster plot you obtained (especially the clusters with events at zero distance) in terms of the correspondence between hidden states and the current node versus the current letter. Remember that current node and current letter information is reflected in the letter and numbers before the arrow. The cluster plot nodes with zero distance have either a common FSA node or common FSA letter. Testing with a random letter sequence rather than one generated by the Reber grammar produces mostly errors. The network is not capable of predicting which letter will come next. Thus, the trained network is capable of distinguishing between sequences produced by the grammar and those that are not. The trained network has incorporated the FSA structure into its own representations. It could be used to detect grammaticality. 6.6.4 Summary The context layer of a simple recurrent network (SRN) can allow the network to learn temporally extended sequential tasks.

43

6.7 Reinforcement Learning for Temporally Delayed Outcomes The context layer in the SRN retains immediately preceding context information. This is sufficient in many tasks. However, other tasks require learning about temporal contingencies that span many time steps. To accomplish these tasks is equivalent to solving the temporal credit assignment problem. This problem is similar to the structural credit assignment problem that was previously discussed for error-driven learning (i.e., learning which units are responsible for the current error signal). Solving the temporal credit assignment problem requires learning which past events are most responsible for a subsequent outcome. The problem can be solved using a time-based version of error-driven learning. The temporal differences (TD) learning algorithm was developed by Sutton (1988) to solve the temporal credit assignment problem. The TD algorithm was based on earlier models of reinforcement learning (RL). This learning is the basis for both classical conditioning and operant conditioning.

44

In RL, global reinforcement signals (reward or punishment) drive learning that seeks to enhance reward or avoid punishment. TD and standard error-driven learning are closely related. We will see that the activation phases used in GeneRec can implement a version of TD.

45

Behavior and Biology of Reinforcement Learning In classical conditioning, the animal learns that a previously neutral stimulus is predictive of rewards or punishment. The stimulus is conditioned. It is called the CS. The reward/punishment is unconditioned. It is called the US. In operant conditioning, a behavior performed by the animal takes the place of the CS.

46

Several brain areas appear to be specialized for RL. They include: (1) midbrain nuclei, such as ventral tegmental area (VTA) and substantia nigra (SN) (2) cortical & subcortical areas that control firing of neurons in the midbrain nuclei

Neurons with cell bodies in the midbrain nuclei send axons to frontal cortex (from VTA) and basal ganglia (from SN), where they ramify and release the neurotransmitter dopamine. Dopamine is a neuromodulator that is thought to modulate learning in the target areas. The midbrain dopaminergic nuclei thus provide a learning signal to frontal cortex and basal ganglia relevant for planning and motor control. In turn, the frontal cortex and basal ganglia compute the anticipation of future reward, which is conveyed to the midbrain dopaminergic nuclei. 47

The top-down projection to SN from the basal ganglia (and to VTA from frontal cortex) may involve a specialized descending control system. Neurons in areas of the basal ganglia called striosomes make direct monosynaptic connections onto neurons of the SN. These areas may be the primary controllers of the dopamine signal that affects the entire basal ganglia (Figure 6.16). VTA neurons fire whenever reward can be reliably anticipated. This firing is after the reward itself in early learning (Figure 6.18), but follows the instruction after learning is established (Figure 6.17). VTA neurons also display second order conditioning, where a second CS predicts the first CS which predicts the reward – they learn to fire at the onset of the second CS, and not to the reward or first CS. TD learning uses a distinct controller system called the adaptive critic (AC). We will see that AC models of TD learning behave similarly to the anticipatory firing of VTA neurons in the brain.

48

6.7.1 The Temporal Differences Algorithm Framework:

(1) An organism is in an environment. (2) The organism produces actions in the environment. (3) The environment produces rewards for the organism based on the effects of the

organism’s actions. (4) The goal of the organism is to produce actions that maximize the total amount of

reward.

49

Reward maximization is based on a value function:

V t r t r t r ta f a f a f a f= + + + + +γ γ γ0 1 21 2 L (6.3) The value of the function at time t is the expected value of the sum of weighted expected reward values going into the future. The expected reward value at a given time t+i is 0 or a positive number depending on whether a reward is expected at that time. The weight for the reward value is called a discount factor. Note that the discount factor is raised to the power of the future time increment. If the discount factor is between 0 and 1, the contribution of reward to the value function decreases exponentially with increasing time in the future.

50

The value function is called the objective function for TD learning. The goal of TD learning is to maximize the objective function. TD learning attempts to achieve this goal using two basic components:

(1) the adaptive critic (AC) learns to estimate the value of the objective function (2) the actor decides which actions to take based on the estimate from the AC

The estimated value of V(t) is written as . The AC learns which sensory cues are predictive of reward by propagating reward information backwards in time. To adjust the prediction at a given point in time, it uses the prediction for the next point in time, . $ +a f

$a f

$V ta f

In other words, the AC updates based on . V t 1

$V t +1a fV t At the beginning of learning, the AC learns to predict a reward just one time step before the reward occurs. Then, it learns to predict the prediction of the reward. Then, it learns to predict the prediction of the prediction of the reward. And, so forth. In this way, prediction of the reward is propagated backward in time. Note that, since learning takes place over repeated trials, it is not necessary for the AC to have information about all times leading up to the reward co-existing at the same time. Backward propagation of prediction can still occur if the AC deals with only 2 successive times (i.e., one time and the next time) on any given trial.

51

Equation 6.3 can be re-written in recursive form as:

V t r t V ta f a f a f= + +γ 1 (6.4) We can use this relation to define the TD error, which tells us how to update the current estimate of the objective function in terms of the “look-ahead estimate” at the next time point. The TD error is the difference between the value of the objective function as expressed in Eq 6.4 and the current estimated value:

δ γt r t V t V ta f a f a fd i a f= + + −$ $1 (6.5)

Note that the expected value notation is not used in this equation. It is not needed because the error is computed on each trial, and the expected value builds up slowly over time as long as the predictions of future reward are consistent over time. It is said that TD learning is able to span temporal delays by “building a bridge of consistency” in its predictions across time.

52

The AC computes from the current external stimulus, and learns by adjusting the weights to minimize the TD error (the difference between this estimate and , which is based on its value based on a one time step look-ahead. The AC network architecture is shown in Fig 6.19.

$V ta fγ $ a fV t r t+ +1 a f

(1) is computed based on weights from representations of the stimuli as processed by one or more hidden layers.

$V ta f(2) The TD error is computed and used to train the network weights that compute . $V ta f 53

As an example of how TD learning proceeds, consider the simple conditioning experiment with results shown in Figs 6.17 & 6.18. In the experiment, a reward is predicted by a tone stimulus which precedes it by a fixed time interval. The experiment is simulated by a TD model, which is given a stimulus starting at t=2, followed by a reward at t=16. (1) At the first trial (Fig 6.20a), the TD error, δ(t), is 1 when the reward occurs (at t=16), and is zero at all other times. Since learning has not yet occurred, the reward is completely unpredicted. (In Eq 6.5, δ(t) is completely determined by r(t) at t=16). Since δ(t) is used to update the weights that produce at t=16, these weights will increase, and will also increase, on the next trial. $a f

$V ta fV t

On the next trial, the weight increase: (a) reduces the value of δ(t) at t=16

(b) propagates the error backward one time step: at t=15, δ(t) now rises to 0.2 because Eq 6.5 includes from t=16. $a fV t

54

(2) During learning (Fig 6.20b), the estimate of reward gets predicted earlier and earlier, and the positive TD error moves earlier and earlier. (3) At the final trial (Fig 6.20c), the stimulus onset at t=2 completely predicts the reward. The reward cannot be propagated further back because no predictive stimulus occurs earlier in time. This TD model provides a good fit to the neural results of conditioning shown in Figs 6.17 and 6.18. (However, the model uses a continuous transition backwards in time during learning. It is not known whether classical conditioning in the nervous system is based on such a continuous transition.)

55

The actor network is trained to produce actions that increase the total expected reward.

If the actor network produces an action that leads to a reward or to a previously unpredicted increase in estimated future rewards, then the TD error is positive.

If this error is used to train the weights in the actor network, as it is in the AC network, then the likelihood that this action will be produced again in similar circumstances is increased.

Actions that lead to the greatest reward produce the largest weight changes, and thus have a greater effect than actions leading to weaker reward.

In short, the TD error is used to train both the AC network and the actor network. This property appears to mimic the way that dopamine from the SN modulates both parts of the basal ganglia: that which controls dopamine release (the AC network) and that which does not (the actor network) (see Fig 6.16). Many varieties of TD learning exist. An important element of many models is the eligibility trace, a time-averaged activation value used for learning rather than an instantaneous activation value.

56

6.7.2 Phase-Based Temporal Differences To use the Leabra framework for TD learning requires that phase-based activation differences be used, as they were for error-driven learning. As see in Eq 6.5 and Fig 6.19, both and go into determining the TD error, δ(t).

+$V ta f $V t

TD is implemented in Leabra by setting the minus phase activation of the AC unit to ,

γ $ t+ +a f a fand the plus phase activation to .

1a f

1$V ta f

V t r Thus, given a network that supports computation of the TD error, GeneRec learning on these two activation phases updates the weights so as to reduce that error.

57

When the network experiences a reward, the plus phase activation should equal the reward value plus additional discounted expected reward beyond the current reward. To simplify the situation, the additional discounted expected reward is set to zero. This is justified as the absorbing reward assumption, in which the entire network is reset after reward and a new trial is begun. Under this assumption, the AC unit is clamped to the reward value in the plus phase when the external reward is delivered. When no external reward is delivered, the plus phase represents the estimated discounted future rewards, . This estimate is computed by the AC unit in the plus phase by standard activation updating as a function of the weights. So, in the absence of external reward, the plus phase for the AC unit is an unclamped settling phase.

γ $ +a fV t 1

To summarize (see Fig 6.21): (1) the minus phase of the AC unit for each time step is clamped to the prior estimate of future rewards, . (Thus, the minus phase of one time step equals the plus phase of the previous time step – in practice, setting γ to 1 allows the next minus phase state to be a direct copy of the prior plus phase state.)

$a fV t

(2) the plus phase is either a computed estimate of discounted future rewards, , or an actual injected reward, r(t), but not both.

γ $V t +1a f

58

6.7.3 Exploration of TD: Classical Conditioning In rl_cond.proj, the simulation is based on the simple classical conditioning task. The network learns that one stimulus reliably predicts the reward. It then learns that another stimulus reliably predicts the first stimulus. Rescorla & Wagner (1972) used the delta rule for classical conditioning. Without considering the timing of the stimulus relative to the response (i.e., if everything happens as one time step), the TD rule reduces to the delta rule. In that case, is trained to match r(t). Of course, in TD learning we must consider stimulus timing!

$a fV t

In this simulation, as a simple demonstration of TD learning, the complete serial compound (CSC) is used (Sutton & Barto, 1990). In the CSC, there is a distinct unit for each stimulus for each time point. Although this provides unique information for training the AC unit’s weights, it is not necessary. The input layer, containing 3 rows of 20 units each, is the CSC. Each row represents a different stimulus (3), and each column represents a different time point (20).

59

The first stimulus comes on at t=10 (tick 10), and remains on through t=15 (tick 15). At t=16, the stimulus disappears and the AC unit is activated, reflecting receipt of a reward ( and clamping of the plus-phase activation to the reward value, 0.95). The reward remains on from t=16 to t=19. At t=16, we see the weight increased from the unit representing the stimulus at t=15 to the AC unit. To understand this, note that the reward causes the AC unit to go from 0 in the minus phase to 0.95 in the plus phase. This difference causes δ(t) to be 0.95 at t=16. This δ(t) updates the weights based on the sending activations at the previous time step (t=15). Therefore, the weight from the only non-zero sending unit at t=15 is increased from 0 to 0.2.

60

The blue curve in the graph log shows the plus-minus phase difference, δ(t). It is 0 at all time steps, except for a “blip” to 1 at t=16. (The reward is maintained active until the end of the sequence at t=19 – so, the AC unit activation is unchanged from 16 to 19, and δ(t) = 0 after the blip.)

61

At trial 3, t=15, the net input to the AC unit is now strong enough to exceed threshold and activate the unit. Prior to this time, the AC unit was not activated by the weights at t=15 although it did receive positive sub-threshold net input. This results from there now being a positive δ(t) at t=15.

62

This means that the network is anticipating the reward one time step earlier. Effects of anticipation:

(1) at t=14, the weight from the input unit to the AC unit is increased due to the positive δ(t) at t=15. This earlier weight change is propagated further and further back in time, so that the reward is anticipated earlier and earlier.

(2) The magnitude of δ(t) at t=16 is reduced.

63

With training, the sending unit develops positive weights to the AC unit all the way back to t=10 from t=15, and the error blip occurs at t=10.

64

Eventually, the AC unit is activated when the stimulus first comes on at t=10.

65

Many standard phenomena are known to occur with classical conditioning. One is extinction, which occurs when the stimulus is no longer predictive of reward. In the simulation, extinction can be brought on by turning off the reward. [Use the StdProbs button to select STIM1_NO_US to turn off the reward.]

66

Question 6.5: (a) What happened at the point where the reward was supposed to occur? The reward was supposed to appear from t=16 to t=19, but it doesn’t appear at all. In the graph log, the TD error shows a negative blip at t=16.

(b) Explain why this happened using the TD equations. δ(t) goes negative because the estimated value function at t=16 is larger than the reward value. From Eq 6.5, δ(t) should be r(t) - (0 - .95) at t=16. $a fV t

67

(c) Then, Continue the network and describe what occurs next in terms of the TD error signals plotted in the graph log, and explain why TD does this. The negative blip gets smaller and earlier as extinction continues, until it reaches t=11.

68

After that, the blip disappears entirely, and δ(t) is zero across the entire sequence.

69

(d) After the network is done learning again, does the stimulus still evoke an expectation of reward? No, after extinction has occurred, the weights to the AC unit are reduced to a uniform, low level.

70

Also, activation of the input unit at t=10 to t=15 only produces a small net input to the AC unit, and it does it activate the AC unit.

Note that, during extinction, the weights are not reduced back to zero. They are only reduced enough to bring the AC unit below threshold. In the brain, it appears that the SN dopamine cells (AC unit in the model) are always active at a low level, i.e. their activation never drops below threshold. Thus, the model behavior deviates from that of the dopamine neurons in this respect.

71

To explore the phenomenon of second order conditioning, we present stimulus 2, which goes from t=2 to t=9, before stimulus 1 (from t=10 to t=15), which is before the reward (from t=16 to t=19). [Use the StdProbs button to select STIM1_2_US.] The first stimulus acts like a reward by triggering a positive δ(t). This allows the second stimulus to learn to predict the first stimulus. The early anticipation of reward gets carried back to the onset of the second stimulus at t=2. After second order conditioning, the AC unit has positive weights from both first and second stimulus units.

72

Documents

Computational Explorations in Cognitive Neuroscience