Modeling the Role of Working Memory and Episodic Memory in

Modeling the Role of Working Memory and EpisodicMemory in Behavioral Tasks

Eric A. Zilli1,2* and Michael E. Hasselmo1,2

ABSTRACT: The mechanisms of goal-directed behavior have beenstudied using reinforcement learning theory, but these theoretical tech-niques have not often been used to address the role of memory systemsin performing behavioral tasks. This work addresses this shortcoming byproviding a way in which working memory (WM) and episodic memorymay be included in the reinforcement learning framework, then simulat-ing the successful acquisition and performance of six behavioral tasks,drawn from or inspired by the rat experimental literature, that requireWM or episodic memory for correct performance. With no delayimposed during the tasks, simulations with WM can solve all of the tasksat above the chance level. When a delay is imposed, simulations withboth episodic memory and WM can solve all of the tasks except adisambiguation of odor sequences task. VVC 2007 Wiley-Liss, Inc.

KEY WORDS: delayed nonmatch; hippocampus; reinforcementlearning; sequence disambiguation; spatial alternation

INTRODUCTION

Reinforcement learning theory (Sutton and Barto, 1998) provides apowerful theoretical framework for describing problems in which anagent (an animal, robot, etc.) interacts with its environment by selectingactions and receives a scalar reward signal after each action is taken.Recently, much attention has been paid to the possibility that a similarsystem is used in the brains of animals, especially because midbrain do-paminergic neuron firing appears to resemble the error signal used inthis framework (Schultz et al., 1997; Montague et al., 2004; but seeMorris et al., 2006; Redgrave and Gurney, 2006). Neural network mod-els of reinforcement learning that do not rely on dopamine have alsobeen proposed (Hasselmo, 2005; Koene and Hasselmo, 2005; Zilli andHasselmo, 2005).

Traditional approaches to reinforcement learning are limited by therequirement that tasks or environments have the Markov property.Roughly, this requirement means that the optimal action to take in aparticular observed state must depend only on that state and not on therecent history of the agent. The theory thus does not easily apply tomany of the common behavioral tasks that require an animal to make a

decision based on memory of a past state, such asdelayed matching to sample or spatial alternation. Linand Mitchell (1992) overcame this limitation by usingorder-n Markov chains, in which an agent’s statebecomes the set of its n most recent observations.McCallum (1995) improved on this idea using hisutile-distinction learning approach by which an agentcould learn to include aspects of a state from up to afixed number of steps in the past into its current state.In this way, an agent could act more flexibly by incor-porating only the most relevant of the most recentn observations instead of all of them. A drawback tothis approach is that it still does not overcome theinflexibility that arises from referring to observationsfrom a fixed number of states in the past.

Part of the flexibility that animals show duringbehavior arises from their access to multiple memorysystems that provide a variety of information. Twocommonly studied memory systems are workingmemory (WM) (Baddeley, 1986) and episodic mem-ory (Tulving, 2002). WM has been defined as mem-ory ‘‘active and relevant only for a short period oftime’’ (Goldman-Rakic, 1995), whereas episodic mem-ory is a longer lasting memory that allows one torecall and re-experience personal events (Tulving,2002). A given task might be solvable using morethan one distinct system (Redish, 2001), which com-plicates understanding how an animal may performand which memory systems will be used. Here, weintroduce simple, abstract forms of these two memorysystems in the reinforcement learning framework andsimulate common behavioral tasks from experimentalwork on memory systems.

First, we describe the ideas and implementation ofabstract versions of WM and episodic memory. Then,we demonstrate the memory systems on a few com-mon tasks, some of which can be solved with eitherWM or episodic memory and some by only one ofthe two. The tasks are all presented in delayed andnondelayed conditions, in which the simulated effectsof the delay prevent the agent from using a WM strat-egy, thus demonstrating whether episodic memory iscapable of solving the task.

The primary insight that allowed these tasks to bemodeled in this framework was that the memory sys-tems of an agent can be treated as part of the environ-

1 Program in Neuroscience, Center for Memory and Brain, Boston Uni-versity, Boston, Massachusetts; 2Department of Psychology, Center forMemory and Brain, Boston University, Boston, MassachusettsGrant sponsor: NIH; Grant numbers: Silvio O. Conte Center MH71702,MH60013, DA16454NSF Science of Learning Center; Grant numbers:SBE 0354378*Correspondence to: Eric A. Zilli, Program in Neuroscience, Departmentof Psychology, Center for Memory and Brain, Boston University, 2 Cum-mington St., Boston, MA 02215. E-mail: [email protected] for publication 1 October 2007DOI 10.1002/hipo.20382Published online in Wiley InterScience (www.interscience.wiley.com).

HIPPOCAMPUS 00:000–000 (2007)

VVC 2007 WILEY-LISS, INC.

ment instead of part of the agent, allowing the agent to takeactions upon its memory systems in the same way it takesactions on its physical environment. Environment in this con-text does not refer to the common usage of the word, butrather to the technical definition from reinforcement learningin which the environment can be defined as anything that pro-vides state input to the agent and changes its state in responseto the agent’s actions. Although the memory systems are envi-ronment, we will still refer to ‘‘the agent’s WM,’’ for instance,where the possessive is meant to indicate not containment, butrather possession.

MATERIALS AND METHODS

Reinforcement Learning

Recent work from this laboratory has shown that reinforce-ment learning type systems can be implemented in neural cir-cuits (Hasselmo, 2005; Koene and Hasselmo, 2005; Zilli andHasselmo, 2005). For the purposes of this work, however, weused the more common, abstract implementation of the system,because our focus here was on demonstrating that use of mem-ory systems can be treated as an active, learned process. Thesesimulations were written in C11, and the source code is avail-able upon request.

We used a tabular, actor-critic temporal difference learningarchitecture with e-greedy exploration (Sutton and Barto,1998). In this architecture, the agent maintains a table of statevalues V, where V(s) is the agent’s current estimation of theexpected, temporally discounted reward that will follow state s(thus more temporally distant rewards are considered as havingless value than immediate rewards of the same magnitude).The agent also maintains a table of action values Q, whereQ(s,a) is the value of taking action a in state s.

The agent followed a simple policy in which it took theaction with the highest action value in its current state withties broken randomly, except, with probability 0 � e � 1 theagent selected an action completely at random. All state andaction values were initialized to 0 at the beginning of a simula-tion, and so the agent always selected a random action the firsttime it entered a novel state. This is the actor half of thearchitecture.

When the agent takes action a in state s and finds itself instate s0 having received a scalar reward r, the TD error d iscomputed by the critic half of the architecture:

d ¼ r þ gV ðs0Þ � V ðsÞ ð1Þ

where 0 � g � 1 is a temporal discounting factor (smaller val-ues of g cause the agent to consider immediate rewards to bemore valuable than distant rewards). This is the differencebetween the value of the old state (s) and the sum of thereward signal (r) and the discounted value of the new state (s0)

and is used to update the old state’s value and the action valuefor the action just taken.

V ðsÞ V ðsÞ þ ad

Qðs; aÞ Qðs; aÞ þ ad

where a is a learning rate parameter. If d > 0 then V(s) wasless than the combined reward and discounted value of thenew state, and so V(s) should be increased to provide a betterestimate. Similarly, if d < 0, then the agent’s estimated value ofstate s was found to be too high, and so V(s) should bedecreased. No learning occurs if the predicted state valueexactly matches the sum of the reward and the discountedvalue of the new state.

To speed up the agent’s learning of the tasks, replacing eligi-bility traces were used (Singh and Sutton, 1996; Sutton andBarto, 1998). With eligibility traces, multiple state and actionvalues are updated on each step with the update scaled propor-tional to the amount of time because the agent last visited thestate and took the action. Now receipt of a reward while enter-ing the current state affects values not only at the previous statebut also at the most recently visited states. Each time the agentleaves state s taking action a, we assign E(s,a) / 1 and for allother states s0 we assign E(s0,a) / kE(s0,a), where k is a decaycoefficient. The learning equations become

V ðsÞ V ðsÞ þ adEðs; aÞ ð2Þ

Qðs; aÞ Qðs; aÞ þ adEðs; aÞ ð3Þ

for each pair of s and a. In the simulations, E(s,a) was set to 0if it was below a threshold value u.

In all simulations presented here, a was set to 0.01, g wasset to 0.95, e was set to 0.01, k was set to 0.25, and u was setto 0.01, so that at most four state-action pairs had nonzeroeligibility trace strengths at any time.

States and Actions

We assume a factored representation of state S 5 S1 3 S2 3� � � 3 Sn, so elements of S are n-tuples {(x1, x2, . . . , xn)jx1 [S1, x2 [ S2, . . .}. In the simulations below, there were usuallytwo or three state elements: the location state SL (numberedsquare in a gridworld) and the WM and episodic memorystates SWM and SEP (described below). We thus had S 5 SL 3

SWM for WM simulations and S 5 SL 3 SEP 3 SWM for epi-sodic memory simulations, in which, for example, each triple isof the form (location, episodic memory contents, WM con-tents), and each such double or triple has its own set of actionvalues. As noted in the results, in a few tasks, an extra state ele-ment was used to represent sensory input: olfactory input inthe odor sequence disambiguation task, visual input for barriersin the nonmatch to position task, and auditory input in thetone-cued alternation task.

We include two types of actions: motor actions (north,south, east, west), which change the location state SL, and we

2 ZILLI AND HASSELMO

Hippocampus DOI 10.1002/hipo

introduce two new types of memory actions that change SWM

and SEP.Notice that, because the agent receives state information

reflecting memory and can take actions affecting memorystates, from a physiological point of view, we are effectivelyincluding many parts of cortex as a part of the agent’s environ-ment and not of the agent itself. We leave the specific locationof the agent undefined, but it might be thought of as basalganglia or medial prefrontal and orbitofrontal cortices, whichare thought to be involved in learning and decision making(Houk et al., 1995; Schoenbaum and Setlow, 2001; Krawczyk,2002; O’Reilly and Frank, 2006). Our present results, however,do not rely on any particular anatomical locus for the agent.Sutton and Barto (1998) suggested that muscles and sensoryorgans, for example, should be considered environment andnot agent. We are simply further restricting the agent to repre-sent roughly the highest levels of the cortical hierarchy, whichcan take actions mediated by neurophysiological interactionswith other regions.

Working Memory

The traditional definition of WM is the ability to maintain arepresentation of a stimulus online and to be able to manipu-late it mentally (Baddeley, 1986), but for our present purposeswe will consider only the former aspect. The first action weintroduce is an action that sets SWM to the current sensorystate, keeping that state in memory until it is overwritten whenthe next memory action is taken. Until this action is initiallytaken or if memory is cleared (e.g., to simulate imposition of along delay during a task), an extra state representing emptymemory is used. Therefore, jSWMj 5 jSLj 1 1, the number ofpossible WM states is one more than the number of possiblelocation states. This functionality is summarized in Table 1. Inthe future work, this type of action could be extended in a fewobvious ways. First, one could allow multiple WM ‘‘buffers,’’allowing a separate action for each buffer, for example, statesSWM1 and SWM2 with one action to store a sensory state inSWM1 until that storage action is taken again (and replacesSWM1) and another action to store a sensory state in SWM2

until that storage action is taken again. Another extension

would be to allow separate actions to buffer the value of a spe-cific state element. For example, one could have separate WMbuffers for visuospatial state and for auditory state, as in theWM model of Baddeley and Hitch (1974).

This idea is similar to the more biological neural networkmodel of gating WM updating proposed by O’Reilly andFrank (2006). In both mechanisms, the agent learns the appro-priate times in a task at which to buffer stimuli into WM byincreasing the action value of the WM action (or gating actionin the O’Reilly and Frank, 2006 model) at the proper states.The concept also overlaps with the idea of options as describedby Sutton et al. (1999), which are temporally extended actionsthat change the agent’s policy until the option terminates,essentially providing the agent a continuing memory of theongoing option. Sutton et al. (1999) defined an option for-mally as a triple hI, p, bi, where I is the set of states where theoption is available as an action, p is the policy to be followedfor the duration of the action (i.e., a set of Q-values), and b(s)is a function giving the probability for each state s that theoption will terminate in that state. Thus, our WM concept isroughly equivalent to a set of options, each with a single initia-tion state, reflecting the specific state element to be buffered inWM (i.e., in a two-by-two environment, there would be fouroptions or four possible states to hold in WM). Because themodel uses a factored state representation in which WM con-tent is one element, each possible item in WM effectivelydefines a separate policy. With this interpretation, in ourdelayed tasks, the probability of terminating at the location inwhich the delay is enforced is 1 and is 0 elsewhere. However,an item in WM can also be overwritten at any state by a newone, at which point the old item is lost. Options, on the otherhand, may be nested, so that by taking one option, then takinganother, when the second option terminates the first one maystill apply. Hence the concept of an option and our WMmechanism are similar, but distinct in certain details.

Episodic Memory

The next two actions we introduce are episodic memoryactions. First, however, we describe an abstract episodic mem-ory system to motivate the new actions. The two importantcharacteristics of this memory system are (1) it is contentaddressable and (2) it is temporally indexed (this description issomewhat different from the phenomenological characterizationused by Tulving, 2002). Content addressability means that amemory can be elicited from a cue that is contained as part ofthe memory (Marr, 1971; e.g., given the cue word ‘‘breakfast,’’one can recall a specific memory containing the concept ofbreakfast). The term ‘‘temporally indexed’’ means that once amemory is retrieved, one can ‘‘replay’’ the memory from theretrieved point as if it were a movie. In contrast, semanticmemory is effectively content addressable memory that lacks atemporal index: one can retrieve specific facts given associatedfacts, but there is no sense of temporal progression (e.g., givenUSA and capital one might retrieve Washington, D. C., but

TABLE 1.

Summary of States and Actions in Working

Memory Agent Simulations

States Actions State transition

SL, a sensory state Motor action (e.g.,

N/S/E/W movement,

lever press, and dig for

reward in cup of sand)

SL / sensory state

SWM unchanged

SWM, a sensory

state or empty

Add to buffer SL unchanged

SWM / SL

MODELING MEMORY-GUIDED BEHAVIOR 3


the concept of ‘‘the next moment in time’’ given such a mem-ory is undefined).

We now describe a simple episodic-like memory systemdesigned to model these two characteristics. A specific mathe-matical implementation and a biological interpretation for itcan be found in Appendix.

The model maintains a list of the n most recently visitedstates in order. After each action, the agent’s previous state isadded to the end of the list, and the oldest state in the list isremoved (however, no state is removed for the first n steps).This list is the complete episodic memory of the agent. Theagent’s episodic memory state SEP will contain either one statefrom this list or an additional state representing ‘‘nothingrecalled.’’ Each time the agent takes a spatial movement action,SEP is reset to ‘‘nothing recalled.’’

The agent has two actions available to it with which it caninteract with its episodic memory. The first action, ‘‘cue re-trieval,’’ finds the most recent instance of the agent’s currentstate in the list and sets SEP to that state (or to the ‘‘nothingrecalled’’ state if the current state is not found in the list). Theother action, ‘‘advance retrieval,’’ sets SEP to the state followingits current state in the list (or, if SEP is ‘‘nothing recalled,’’ itremains unchanged).

Simulations including the episodic memory system alsoinclude a WM system to buffer episodic retrieval. Effectively,this introduces a third action that stores the current episodicmemory state SEP in WM.

For some of the simulations below, SEP forms memories ofSL to allow retrieval and replay of sequences of visited loca-tions. The exceptions are in the tone-cued alternation task, SEPforms memories of both location and auditory input, in thenonmatch to position and sample tasks, SEP forms locations ofboth location and visual input regarding barriers and levers,respectively, and in the odor sequence disambiguation task, SEPforms memories only of odors.

The function of these two actions and their interaction withthe other environments in the simulations are summarized inTable 2.

This mechanism maps closely onto the path readout modelof hippocampus proposed by Hasselmo and Eichenbaum(2005), but differs in that this earlier model treated cuing re-trieval and advancing multiple steps in memory as a single pro-cess that occurred automatically at every step of a simulation,instead of treating the two as separate actions that must beactively selected, as in the present framework. Thus, the impor-tant difference is that retrieval of episodes is now under thecontrol of the agent instead of being automatically performedon every step. Although, here, as in that model, encoding stillhappens automatically and constantly (see also Morris andFrey, 1997).

The content addressability in this context also differs some-what from the traditional definition. Content addressability asdescribed in Marr (1971) is what has also been described aspattern completion (O’Reilly and McClelland, 1994), a processby which a subset of the representation of some cue or memoryelicits the entire memory directly. The present mechanism

introduces an intermediate indexing step, which still allows fora subset of a memory to elicit the complete memory via theindex, but also provides an ordering on the indices so that sub-sequent memories can be read out in the proper order. The in-termediate step has some similarity with the context signal usedin some models of verbal WM (Burgess and Hitch, 2005),which are more similar to our episodic mechanism than to ourWM mechanism.

Rewards and Limitations on Actions

The agent was free to take any action in any state; however,some actions did not change the agent’s state and result in areward of 21. These included attempting to move out of theenvironment or into a barrier, to store something in WM thatis already stored, and to advance episodic retrieval without firstcuing retrieval. Also, as is common in behavioral tasks using T-mazes, the agent was prevented from moving backward, hereby giving a reward of 21 if the agent attempted to re-enter alocation it had just left. In general, attempting to take any‘‘illegal’’ action resulted in this negative reward and no statechange. Finally, four tasks (all except spatial alternation andtone-cued spatial alternation) had an upper limit on the num-ber of steps allowed, 100. When this limit was reached, theepisode ended and was counted as incorrect. For most tasks,the agent received a reward of 19.5 for a correct response, 26for an incorrect response, and 20.05 for all other actions toencourage the agent to find the shortest solution to a task (par-ticularly important when memory actions are available, to avoidthe agent repeatedly taking those actions instead of taking spa-tial movement actions). For the odor sequence disambiguationtask, the agent received a reward of 11 for reaching a correctgoal state, 25 for an incorrect response, and 20.5 for all otheractions. These values were selected to produce learning at areasonably fast rate (the odor sequence task required different

TABLE 2.

Summary of States and Actions in Episodic

Memory Agent Simulations

States Actions State transition

SL, a sensory state Motor action (e.g.,

N/S/E/W movement,

lever press, and dig for

reward in cup of sand)

SL / sensory state

SWM unchanged

S-EP cleared

SWM, a sensory

state or empty

Add to buffer SL unchanged

SWM / SEPS-EP unchanged

SEP, a sensory

state or empty

Cue retrieval SL unchanged

SWM unchanged

SEP / SLAdvance retrieval SL unchanged

SWM unchanged

SEP / next state in

temporal sequence



values). In many cases, faster learning could be achieved by set-ting task-unique reward values.

RESULTS

Before we present results of our simulations of various tasks,one aspect of episodic memory should be emphasized to makethe performance figures more clear. All simulations with epi-sodic memory also included a WM system to allow the agent tohold the retrieved episodic memory online to guide behavior (arequirement for proper learning in these simulations, this is sim-ilar to the context-reinstantiation idea described in Redish,2001). In general, if an agent in a nondelayed task with WMcould solve a task, so could an agent in a simulation with epi-sodic memory (an agent with episodic memory may, for exam-ple, react differently at a novel state or a state not visited for alarge number of steps, as these cases may produce ‘‘nothingrecalled’’ states in response to the cue retrieval action). It was indelayed tasks, wherein WM was cleared at locations as describedbelow (shown as striped squares in the figures), that the differ-ence between agents with episodic memory and WM becameclear. If an agent with episodic memory could solve a delayedtask, then it must be that the agent was using episodic retrievalas a means of retrieving an old memory sequence, instead ofsimply using its WM component to hold a state online. Table 1summarizes the states and actions available to a WM agent;Table 2 summarizes those for an episodic memory agent.

For all tasks, there were ultimately two possible responses ofwhich one was correct, thus chance performance was at 50%correct. In some cases, as described below, an agent consistentlyperformed well below chance and in one case the agent wasabove chance but not at 100% performance.

Spatial Sequence Disambiguation

We first consider a task that is a simple demonstration of theuse of WM. Spatial sequence disambiguation was implementedhere in an I-shaped maze, shown in Figure 1, top, comprisinga vertical stem with left and right horizontal arms on bothends. The agent was randomly started in one of the two bot-tom arms and forced to enter the vertical segment by a barrierblocking the other start arm. It received a positive reward onlyin the opposite side arm on the top of the maze. Whether theagent was rewarded or not, it was removed from the maze andrepositioned on a random start arm for the next trial. Wesimulated two versions of this task. In the delayed version, theagent’s WM was cleared when it reached the bottom of thestem to represent a delay period imposed by an experimenter.

Figure 2 demonstrates the way the agent with WM couldsolve this task by showing different sets of action values,depending on which location was held in WM, represented inthe figure by a square of small circles around the location beingheld. In particular, for each location on the starting arms inmemory, the choice point had one action with a positive value

(dark arrows), indicating the direction the agent should go toreceive a reward. It should be emphasized that these action val-ues are learned and that the action values learned for each value

FIGURE 1. Spatial sequence disambiguation (SSD). Top, envi-ronment used in this task. S indicates starting position, and Gindicates a position where a reward may be available. Filled blacksquares are static barriers. Filled gray squares are movable barriers.The striped square is where working memory is cleared asdescribed in the text. Middle and bottom, average performance of10 runs with both episodic memory and working memory (EP 1WM), working memory only (WM) and an agent with no memorysystems (plain) in the nondelayed (middle) and delayed (bottom)versions of the task. In the nondelayed task (middle), the agentswith working memory learn slightly faster than the agents with ep-isodic memory on average. In the delayed task (bottom), the work-ing memory of the two agents with memory systems is clearedbefore each episode, reducing interference across trials. Perform-ance in each block of 20 trials is the number of correct responsesdivided by the total number of responses.



of WM contents are independent. Figure 1 shows the perform-ance on the delayed and nondelayed versions of this task. Foreach version of the task, we show performance for an agent withepisodic memory and WM (EP 1 WM), an agent with onlyWM, and a normal agent with no extra memory systems(plain). As shown in the middle in that figure, both the episodicand WM agents were able to solve the nondelayed task, whereasonly the episodic agent could solve the delayed version (Fig. 1,middle and bottom). The standard agent could not learn thetask in either version. As mentioned earlier, because the episodicagent could solve the delayed version, it must have been usingsomething more complex than a straightforward WM approach.Because the trial types (on which arm the agent starts, andtherefore on which arm it is rewarded) were randomly selectedon each trial, it may be surprising that episodic memory couldsolve this task. At the choice point, the agent’s WM had beencleared, and so it must have relied on memory of a state thatfollowed its last visit to its current state. However, the last visitto the choice point had been on the previous trial, so the rewardarm it selected on that trial was independent of the correctreward arm for the current trial. The strategy that the agent dis-covered was, in fact, to retrieve the memory of that previoustrial and then to advance retrieval multiple steps, replaying theepisode up to and past the rewarded state at the end of the arm,which then brought its memory to the beginning of its currenttrial. It was thus able to recover at which location it began thetrial and could make its decision based on that memory.

This strategy worked in the simulations because there waseffectively no intertrial interval: each trial began immediatelyafter the previous one ended. Because experiments with realanimals often use a delay between trials, either so that the

experimenter may reset the apparatus for the next trial or toreduce interference across trials, real animals may not be ableto use this strategy.

Spatial Alternation

The next task we consider is spatial alternation (Wood et al.,2000; Lee et al., 2006; Ainge et al., 2007). This task took placein the environment shown in Figure 3, top. The agent began atthe bottom of the center hallway in the maze and proceeded upto the choice point. On the first trial, the agent could turn ei-ther left or right and received a positive reward at the end of thearm. The agent then continued along a return arm and back tothe starting point and then again proceeded up the center hall-way to the choice point (at the starting point, the agent was pre-vented from entering the side hallway instead of the center hall-way by use of a barrier). After the first trial, the agent received apositive reward if it chose the opposite direction as on the previ-ous trial, otherwise it received a negative reward. The delayedversion of this task was identical, except that the agent’s WMwas cleared on every visit to the start position (bottom of thestem). Performance in this task is plotted in Figure 3.

As in the previous task, this task is non-Markov because thecorrect turn at the choice point depends on the agent’s previousresponse. However, with a WM action available, the agent couldbuffer a location on the return arm while heading back to thestart position and maintain it in memory until the choice pointwas reached, thus allowing disambiguation as to the trial type.In this way, WM can also be thought of as a sort of memory forongoing context. The agent learned different policies for eachitem that may be stored in WM, which may be interpreted aslearning behaviors appropriate for a given context. Specifically,this works because, due to the factored state representation, theagent had many different policies depending on which locationstate was in its WM. Alternatively, episodic memory could beused to solve the task by cuing retrieval at the choice point,advancing one step in memory to retrieve the previous responsemade, then making the opposite response (sometimes the simu-lated agents would instead, for example, cue retrieval one stepbefore the choice point, then advance retrieval by two steps torecall the previous response). In the delayed version, only anagent with episodic memory could solve this task, as shownin Figure 3, right, in agreement with experimental work usinghippocampal lesions (Ainge et al., 2007).

Electrophysiological recordings of neurons in animals per-forming this and similar tasks have shown that neurons in thehippocampus sometimes discriminate between left-response tri-als and right-response trials (Wood et al., 2000; Frank et al.,2000; Ainge et al., 2007), but in other cases the hippocampalcells do not discriminate (Lenck-Santini et al., 2001; see alsoBower et al., 2005). The fact that the spatial alternation taskcould be successfully performed with either episodic or WMmay provide insight into the differences seen in neuronalresponses. This suggests that subtle differences in training ortask demand in those studies may bias the animal to use aWM strategy over an episodic memory strategy or vice versa.

FIGURE 2. Working memory’s effect on policy. The arrowsshow the agent’s learned action values corresponding to the twotrial types (‘‘should go left’’ trial on top, ‘‘should go right’’ trial onbottom). The state in the agent’s working memory is indicated by asquare of circles, which determines the indicated action values.Black arrows represent positive action values, light gray arrows rep-resent negative or zero action values. Filled black squares representimpassable barriers; filled gray squares represent movable barriers.



Nonmatch to Position

Our simulated NMP task took place on a T-maze as in thespatial alternation task, but without the return arms leadingfrom the reward states to the starting position (Griffin et al.,2007), as shown in Figure 4, top. Each trial in this task consistedof two stages, both of which began with the agent being placed

FIGURE 3. Spatial alternation. Top, environment used in thistask. S indicates starting position and G indicates a position wherea reward may be available. Filled black squares are static barriers.Filled gray squares are movable barriers. The striped square iswhere working memory is cleared as described in the text. Middleand bottom, average performance of 10 runs with both episodicmemory and working memory (EP 1 WM), working memoryonly (WM) and an agent with no memory systems (plain) in thenondelayed (middle) and delayed (bottom) versions of the task.Performance in each block of 2,000 steps is the number of correctresponses divided by the total number of responses.

FIGURE 4. Nonmatch to position (NMP). Top, environmentused in this task. S indicates starting position and G indicates aposition where a reward may be available. Filled gray squares aremovable barriers. The striped square is where working memory iscleared as described in the text. Middle and bottom, average per-formance of 10 runs with both episodic memory and workingmemory (EP 1 WM), working memory only (WM) and an agentwith no memory systems (plain) in the nondelayed (middle) anddelayed (bottom) versions of the task. The agent with episodicmemory is always able to perform the task, and the agent with nomemory is always at chance, while the working memory agent canperform correctly only in the nondelayed version (middle). Per-formance in each block of 40 trials is the number of correct teststage responses divided by the total number of test stage responses.



at the bottom of the stem and proceeding to the choice point.In the sample stage, one of the two arms of the maze at thechoice point was blocked, forcing the agent to go in a particulardirection. The agent received a reward at the end of the arm itwas forced to enter, then was placed back at the starting point tobegin the choice stage. In the choice stage, neither arm wasblocked at the choice point. If the agent entered the same arm,it had entered on the sample stage (a match response), itreceived a negative reward at the end of the arm; if the agententered the opposite arm (a nonmatch response), it received apositive reward at the arm’s end. The blocked arm on the samplestage was randomly selected on each trial. In the nondelayed ver-sion of this task, the agent’s WM was cleared before it wasplaced at the start position in each sample stage but not eachchoice stage (this improved the learning rate by eliminating in-terference across trials). For the delayed version, the agent’s WMwas cleared before both sample and choice stages. The choicepoint and the state immediately below it were each representedby one of three different states: choice point with right armblocked, choice point with left arm blocked, and choice pointwith no arms blocked. These correspond to a conjunctive stateof spatial location and barrier position to reflect differences invisual or tactile input a real animal would receive in such a task.

As shown in Figure 4, middle and bottom, this task couldbe solved by episodic memory in both the delayed and nonde-layed versions and by WM in the nondelayed version. UsingWM, the agent could buffer a location on the reward arm dur-ing the sample stage, which could then indicate the correctresponse at the choice point in the choice stage, allowing suc-cess only in the nondelayed version (Fig. 4, middle).

If the agent did not receive state information reflecting thepresence of barriers, which signal the sample stage, performancewas significantly impaired (unpublished observations). Thisrequirement of representing task stage is supported by the hip-pocampal recordings of rats performing a delayed NMP taskon a T-maze, where many units show modulation by task stage(Griffin et al., 2007). Our simulated task differs slightly fromthat of Griffin et al. (2007) in that we began each stage of thetask with the agent at the bottom of the stem, whereas rats inthe Griffin study began the sample stage on a pedestal with aramp leading to the stem and they began the choice stage atthe bottom of the stem. As in our spatial sequence disambigua-tion task, the different starting positions could be held in WMto differentiate the two task stages, improving performance andsuggesting a means by which the stage selective firing in realrats could have come about.

Nonmatch to Sample

We also simulated a nonmatch to sample task in whichresponses were made on two levers in an operant chamber(Hampson and Deadwyler, 1996). Here, each trial consisted ofa sample stage in which only one of the two levers was availablefor making a response, then a choice stage in which both leverswere available. The agent received a negative reward for makinga response at the same lever as had been available during the

sample stage and a positive reward for making a response at theother lever. As shown in Figure 5, top, the two levers were posi-tioned at opposite corners of the environment along the samewall. After making a response at a lever, the animal was movedback to the starting position (a slight difference from experi-

FIGURE 5. Nonmatch to sample (NMS) with levers as stimuli.Top, environment used in this task. S indicates starting position. Lindicates a response lever. The striped square is where workingmemory is cleared as described in the text. Middle and bottom, av-erage performance of 10 runs with both episodic memory andworking memory (EP 1 WM), working memory only (WM) andan agent with no memory systems (plain) in the nondelayed(middle) and delayed (bottom) versions of the task. The agentwith episodic memory is always able to perform the task at greaterthan 80%, and the agent with no memory is always at chance,while the working memory agent can perform correctly only in thenondelayed version (middle). Performance in each block of 20 tri-als is the number of correct test stage responses divided by thetotal number of test stage responses.



mental versions of this task which may require the animal to,e.g., make a nose poke at a sensor at a location distant from thepair of levers). In the delayed version of this task, the agent’sWM was cleared before starting the choice stage (WM wascleared before the sample stage in both task versions to elimi-nate interference between trials). To simulate visual inputregarding the state of the levers, the agent had separate state rep-resentations for the sample stage with the left lever extended,the sample stage with the right lever extended, and the choicestage with both levers extended. The spatial environment com-prised two rows and three columns of locations, two of whichare locations at which a lever may be present. Thus, there are18 location states in total (six locations time three different taskstage conditions). The need for separate states for sample andchoice stages was mentioned earlier, in the NMP task section,and will be discussed again later in the article.

In principle, both versions of this task were solvable usingepisodic memory; however, the nondelayed version was morereadily solved using WM, as shown in Figure 5, middle. Bymaintaining in WM the lever at which a response was madeduring the sample stage, the agent could learn to proceed toand respond to the opposite lever during the choice stage.When the agent could not solve the task, its performance waswell below chance. This occurred because the agent rarelymade responses, and the episode eventually was terminatedwhen the upper time step limit was reached.

Performance did not depend on the nonmatch rule usedhere and in the NMP task above; both tasks were solvable witheither a match or a nonmatch rule (unpublished observations).

Odor Sequence Disambiguation

Our odor sequence disambiguation task was modeled afterthe task from Agster et al. (2002), though some simplificationswere made. As shown in Figure 6, top, the simulated task tookplace on a linear track with five pairs of odors spaced along itslength. On each trial, the agent was to make a response to oneodor of each pair. On sequence 1, the agent was rewarded forresponding at each of the odors 1–5 (the ‘‘top’’ odors). Forsequence 2, the agent was rewarded for selecting each of theodors 6, 7, 3, 4, and 10. Thus, the two sequences overlap inthat odors 3 and 4 were always correct, and odors 8 and 9were never correct. After selecting the correct odor from a pair,the barrier in front of the rat was removed and it couldadvance to the next pair. Except for the final pair, if the agentattempted to dig at the incorrect odor, it was allowed to correctitself (in fact, it had to before it could continue). On the finalpair, the agent could dig at either odor and the trial wasrecorded as correct or incorrect, depending on the response andthe sequence type. In the delayed version of this task, theagent’s WM was cleared before the final pair was presented.

The agent had five actions available: move forward, sampletop odor, sample bottom odor, dig at top odor, and dig at bot-tom odor. Each odor had three corresponding states: samplingodor, successfully digging at an odor (if it selected the correctodor for a pair), and unsuccessfully digging at an odor (if it

FIGURE 6. Odor sequence disambiguation. Top, environmentused in this task. S indicates starting position. Each number indi-cates a different odor and the thin rectangles represent barriersthat were not explicitly included as locations (black are static bar-riers and gray are movable barriers). The striped square is whereworking memory is cleared as described in the text. Middle andbottom, average performance of 10 runs with both episodic mem-ory and working memory (EP 1 WM), working memory only(WM) and an agent with no memory systems (plain) in the nonde-layed (middle) and delayed (bottom) versions of the task. In thedelayed task (bottom), the working memory of the two agentswith memory systems is cleared before each episode, reducing in-terference across trials. Performance in each block of 1,000 trialsis the number of correct responses divided by the total number ofresponses.



selected the incorrect odor for a pair). Until the agent sampledan odor from the current pair, it was in a state simply corre-sponding to its location. After sampling an odor, the agentremained in the state of smelling that odor until it sampled theother or advanced to the next odor pair after successfully dig-ging. In some experimental versions of this task, the scentedcups had perforated lids that the rat had to remove before itmay dig. To simulate these lids, we used the restriction thatuntil the agent had sampled an odor, it could not attempt todig at it (thus it was able to smell the odor before it could digin the cup). Without this restriction, the agent was still able tolearn the task, but at a slower rate (unpublished observations).

As with the spatial sequence disambiguation task, this taskrequired the agent have a form of WM to perform correctly.Specifically, the agent had to hold in WM the state of havingsuccessfully dug at one of the two initial odors to make thecorrect choice on the final pair. For this reason, an agent withonly WM could not solve the delayed version of this task,shown in Figure 6, bottom.

The agent with episodic memory had two possible strategiesto use, as in the spatial sequence disambiguation task. In thenondelayed version, it could simply take a WM approach toperform successfully (Fig. 6, middle). An alternative approachmight be for the agent to use episodic retrieval with one of thecups in the final pair as the cue, which could retrieve a mem-ory of the end of the previous trial. Then the agent couldadvance retrieval by multiple steps until it reaches a memory ofsuccessfully digging in one of the cups in the first pair on thecurrent trial. Real animals may not be able to use this strategyfor the same reason given in the sequence disambiguation task:in the simulations there were no intervening stimuli betweenthe end of one trial and the beginning of the next, whereas be-havioral experiments in laboratory animals often did containsuch stimuli because of the presence of an intertrial delay.Agents required a significant amount of training to learn thisstrategy, which further argues against the likelihood of animalsusing this approach (so the simulation in Figure 6 was not runlong enough to show the agents with episodic memory learningto correctly perform the task). In this case, the first state theagent experiences on a trial did not disambiguate the trial type,so that it was forced to continue advancing retrieval until itreached a state where an odor was successfully dug. Theincreased number of ambiguous states leading up to the suc-cessful digging made the agent’s performance much more sensi-tive to random exploration either during retrieval at the end ofthe trial or during behavior at the beginning of the trial itself.

Tone-Cued Spatial Alternation

Next, we consider a tone-cued spatial alternation task. Toour knowledge, this task has not been used in the literature, sothat the present simulation results constitute a prediction ofthis model. This task took place on a figure-8 maze as in theearlier spatial alternation task, shown in Figure 7, top. Theagent began at the bottom of the center hallway in the mazeand proceeded up to the choice point (the agent was prevented

FIGURE 7. Tone-cued spatial alternation. Top, environmentused in this task. S indicates starting position, and G indicates aposition where a reward may be available. T indicates the locationwhere one of two tone-cues is played. Filled black squares arestatic barriers. Filled gray squares are movable barriers. Thestriped square is where working memory is cleared as described inthe text. Middle and bottom, average performance of 10 runs withboth episodic memory and working memory (EP 1 WM), work-ing memory only (WM) and an agent with no memory systems(plain) in the nondelayed (middle) and delayed (bottom) versionsof the task. In the delayed task (bottom), the working memory ofthe two agents with memory systems is cleared before each epi-sode. Performance in each block of 3,000 steps is the number ofcorrect responses divided by the total number of responses.



from directly entering a return arm from the start point withthe use of two barriers, as shown in the figure). Here, one oftwo tones was randomly sounded (implemented in the simula-tions as two different states due to the factored state representa-tion: choice point with tone 1 and choice point with tone 2).The agent could go in either direction on its first exposure toeach tone and receive a positive reward at the end of the arm.The agent then followed the return arm back to the bottom ofthe center hallway and proceeded to the choice point, and arandom tone was again played. Now the agent was onlyrewarded for choosing to move in a direction opposite to itsresponse when the current tone was last presented. That is, ifthe agent last made a left response when tone 1 was played, thenext time tone 1 is played the agent had to make a rightresponse, regardless of how it had responded at any number ofintermediate presentations of tone 2. Likewise, each time tone2 was presented, the agent had to make the opposite responseas on the previous presentation. In the delayed version of thistask, the agent’s WM was cleared on each visit to the bottomof the stem. The use of two tones as the cue was selected arbi-trarily. If this experiment were attempted with animals, usingcues from different modalities (such as a tone and a light) mayreduce interference and make the task easier to learn.

To perform the task correctly, the agent was forced to use itsepisodic memory actions to retrieve the last presentation of thecurrent tone and advance one step in memory to discoverwhich direction it previously went, then choose the oppositedirection. WM could not correctly solve this task as the correctchoice does not depend on the immediately preceding trial, butrather on the most recent trial in which a given tone waspresented.

As indicated in Figure 7, middle, however, in the nondelayedtask, the WM agent was above the chance level and the agentwith no memory systems was well below chance. The plainagent tended to choose the same arm over and over again, thusproducing a series of incorrect trials, until it selected a randomaction at the choice point or the action values became lowenough that the agent switched responses, receiving two correcttrials followed by another series of incorrect trials.

With WM, though, a better strategy was available: simplespatial alternation. Consider the following: the agent is at thestart position, having just returned from making a left response,for example. There are three possible situations the agent mightface. It could be that, for both tones, the agent was to go righton the previous trial, so regardless of which tone was played,the agent just made an error (thus, in a sense, the rewards ofboth tones are currently on the right reward arm). Or it couldbe that the reward for tone 1 is on the left reward arm, andthe reward for tone 2 is on the right reward arm (in whichcase it must be that tone 2 was played on the previous trial,because if tone 1 had been played the agent’s response wouldhave been correct, so tone 1’s reward would now be on theright arm). Similarly, it could be that the reward for tone 2 ison the left reward arm and tone 1 on the right arm (the previ-ous case, with tones 1 and 2 reversed). However, it could notbe the case that both tone 1 and tone 2 currently have their

rewarded arm on the left, because the agent just came from theleft arm, so that one of the two would have been switched tothe right arm.

With an alternation strategy, its next response will be tochoose the right arm, regardless of which tone is played. In thefirst case, for either tone 1 or tone 2, the agent will berewarded on the right arm, so one-third of the time it is guar-anteed to receive reward. In the other two cases, its probabilityof receiving a reward on the right arm is 0.5, since that is theprobability that the tone corresponding to the reward on theright arm will be played. Thus there are six possible outcomesof which four are favorable, so the agent will receive a rewardon 2/3 of the trials, as shown in Figure 7, middle, which isabove chance level. In contrast to the partial reward strategy foragents with only WM, the agent with episodic memoryreceived a higher average reward than the WM agent on boththe nondelayed and the delayed version of the task.

DISCUSSION

The goals of this work were twofold. First, we sought toextend the reinforcement learning framework to include mem-ory systems analogous to those thought to exist in animals, sothat common tasks used in behavioral experiments could besimulated in a single, quantitative framework. Second, wesought to determine the way that slight differences in thesetasks can change the demands on different subsystems in thebrains of animals so as to inform future experimental work. Tothe extent that our abstractions of these memory systems arecorrect, we have shown that the information required to makecorrect choices may require different systems whose utility mayoverlap, but which can be dissociated by specifically designingtasks to preclude the use of all systems except those of interest.We now discuss these points in more detail.

Working Memory Versus Episodic Memory

A formal description of working and episodic memory mayguide future research on memory function by providing ameans to evaluate potential tasks to see which forms of mem-ory can be successfully used in performing them.

If a task can be solved based on which of a set of states wasmost recently experienced, then WM can solve the task (thoughthe time required to learn the task increases as the amount oftime or number of states between the choice point and themost recently experienced disambiguating state increases). If atask can be solved based on the states that previously followedsome given experienced state on the current trial, then episodicmemory can solve the task (though the time required to learnthe task increases as the number of retrieval steps required toaccess the disambiguating state increases).

As was demonstrated, for instance, by the spatial sequencedisambiguation task, even though a task may appear to be solv-able with only one memory system, there may be unexpected



(sometimes artifactual) ways that other memory systems can beused instead. Even with the simple systems presented here thiswas possible, so with more complex and realistic memory sys-tems, the possible strategies may greatly increase in number.

Multistage Matching Tasks

A common feature of tasks used in behavioral or electrophys-iological studies is the presence of multiple, distinct stages. Forexample, a nonmatch-to-sample (NMS) task consists of a sam-ple stage during which a stimulus is presented, then a choicestage in which multiple stimuli are presented, either simultane-ously or consecutively, during which the animal must respondto the nonmatching stimulus (i.e., the stimulus not presentedduring the sample stage). A problem in implementing suchtasks in the reinforcement learning framework is that the ani-mal must know which stage it is in to make the properresponse. In an NMS task, the agent must make a response tothe sample stimulus in the sample stage, but not to that stimu-lus in the choice stage, so without task stage information aspart of its state, it cannot learn such a task. When real animalsperform such tasks, there is often stage information provided tothem in some form (i.e., in primate matching tasks the animaloften begins the trial by pulling on a lever which it releases torespond during the choice stage (Miller et al., 1996), and inrodent NMS tasks with levers (Hampson and Deadwyler,1996), there is one lever available during the sample stage andboth available during the choice stage-in both cases sensory in-formation or memory of recent actions can signal task stage).In such cases, one can simply include an extra element in anagent’s factored state representation that indicates task stage,but in other tasks where the animal must learn on its ownwhich stage it is in, the proper reinforcement learning approachis not clear. Indeed, in some tasks, each trial is both a sampleand choice stage, such as the continuous odor nonmatchingtask used by Wood et al. (1999) and Young et al. (1997) orthe analogous n-back task used in humans (Stern et al., 2001).

In this work, we have simulated only two matching tasks: adelayed NMP task in a T-shaped maze and a delayed non-matchto sample task using levers in an operant chamber. There are,however, many other variants of matching tasks, each of whichhas its own idiosyncrasies when considered in the reinforcementlearning framework. In primate experiments, the stimuli in thematching tasks tend to be either visual stimuli on a computerscreen (Riches et al., 1991; Miller et al., 1996) or objects atspatial locations in front of the animal (Buckmaster et al.,2004). In experiments with rodents, the stimuli are often levers(Hampson and Deadwyler, 1996), cups of scented sand (Dud-chenko et al., 2000), objects at spatial locations (Kirwan et al.,2005), or spatial locations themselves (Zironi et al., 2001).

One dimension along which we can categorize matchingtasks is whether the stimuli are unique on every trial or arerepeated over the course of the experiment. In tasks with leversit is clear that the levers’ identities are always the same on everytrial, and, with odors, it is usually the case that the odors areeventually repeated (as the experiment would otherwise require

a very large library of odors to be used). When objects areused, either as visual images on a computer screen or realobjects placed around the animal, the objects are sometimesreused (Rapp and Amaral, 1989) and sometimes always unique(Buckmaster et al., 2004). This distinction is relevant to simu-lations using the reinforcement learning framework, as such anagent makes its decisions by selecting the action with the high-est value for its current state. If the stimuli are unique on eachtrial, then on each trial the animal will not have learned anyactions to take from the state corresponding to a specificobject, and it will never fully learn the task. Suppose that theagent learns to hold the unique sample stimulus in WM (per-haps by learning the utility of ignoring the object’s identity andsimply taking a ‘‘set WM’’ action in the sample stage), theagent must still be able to detect a match or nonmatch duringthe choice stage of the task. Even if, on the first trial of amatch task, the agent makes a correct response to the factoredstate of (A in WM, A being experienced), on the next trial theagent will have no action values for the factored state of (B inWM, B being experienced). Thus, in addition to a WM sys-tem, an agent needs a mechanism by which it can detect amatch between its WM and its sensory input. One potentialmeans of solving this problem is a role-filler mechanism as isused in analogical reasoning models (Hummel and Holyoak,1997). It should be noted that although increased or decreasedneuronal firing rate to matching stimuli has been reported intemporal cortex (e.g., Miller et al., 1993; Holscher et al.,2003), this neuronal activity does not solve the problem notedearlier, as the increased firing is stimulus dependent and sodoes not provide the abstract signal of ‘‘matching’’ needed forthe present purposes. These firing rate differences, however, arethought to reflect familiarity and so a mechanism by whichthey can be used to influence decision making will likely bevery important to modeling this type of task.

Another dimension of matching tasks is the question ofwhether the identities of the choice stimuli are made knownsimultaneously or consecutively. For example, in the T-mazedelayed NMP task simulated earlier, when the agent is at thechoice point in the maze, both of the possible responses areavailable to it at the same time (go left or go right) and areconsistent across trials. Making a response then is as simple asselecting the action with the highest action value at that state(where, through some memory mechanism, the state should bedifferent on right trials vs. left trials). The delayed match tosample task also described earlier is somewhat in between thesimultaneous and consecutive extremes. The identities of thestimuli are consistent across all trials (i.e., the left lever isalways the left response) but there is no single location fromwhich both the left and right lever responses are immediatelyavailable. Still, as the agent starts the choice stage knowingwhich response it should make, it can take the appropriate se-ries of actions to bring it to the correct lever. The distinctioncan be made more clear by contrasting it with a purely consec-utive task, such as an odor nonmatch task. Suppose the agentis presented with a cup of scented sand in the sample stage,then is presented with two scented cups of sand (one with the



same odor from the sample stage and one with a different odoras well as a food reward) for the choice stage in randomizedspatial positions (Dudchenko et al., 2000; McGaughy et al.,2005). In this case, the agent from afar is not aware of whichcup has which odor, and thus which cup contains the foodreward. The agent must select one cup to approach, then onlyafter sampling the odor may the agent make a response (if atthe nonmatching odor) or withhold a response and approachthe other cup (if at the matching odor). From a reinforcementlearning standpoint, this is a relatively complex task. It appearsto require first that the agent have actions to select one of thetwo cups of sand as a destination which then alter the actionvalues of the location states to direct it to the cup. The agentmust then learn to select the other cup as a destination if theodor in memory matches the odor at that cup or to make aresponse if the odors do not match. This requirement is nottrivial, as if there are more than a few odors used in the task,the number of conjunctive states for which the odor in mem-ory does not match the experienced odor can be quite large,and, for each such state, the agent must learn to make aresponse. Here, a utile distinction learning approach (McCallum,1995) may be helpful, allowing the agent to distinguishbetween the two states, one being the triple (choice stage,sampling odor, match detected) and the other being the pair(choice stage, sampling odor), with the former taking prece-dence over the latter.

Relationship to Other Models

To our knowledge, this is the first report showing that re-trieval of episodic memories during behavior increases the do-main of problems reinforcement learning systems can solve, butit is not the first use of hippocampal mechanisms in reinforce-ment learning. Foster et al. (2000) treated hippocampal placecells as radial basis function units for representing space, associ-ating action values to individual place cells and linearly sum-ming them, scaled by the activation of each unit. Arleo andGerstner (2000) used a similar approach, using a form of Q-learning (Watkins, 1989) to modify the strength of synapsesfrom hippocampal place cells representing location to neuronsin the nucleus accumbens representing actions.

Johnson and Redish (2005) also used place cells as radial ba-sis functions, but showed further that offline (during awakestates, but not while engaged in the task) replay of activationin hippocampus can be interpreted as a form of the Dyna-Qreinforcement learning algorithm (Sutton, 1990), which allowsthe agent to learn tasks more quickly. Recent reports of hippo-campal replay when rats are idle during a task (Foster andWilson, 2006; Jackson et al., 2006) also support this suggestedrole.

These two functions are not at all exclusive of the presentsuggestion of prefrontal-initiated hippocampal retrieval of epi-sodes for guiding behavior. In fact, all three proposals couldeasily work together to combine all their individual advantages.The fact that these aspects of hippocampal function work wellwithin the reinforcement learning framework, along with the

suggestions that the basal ganglia may be implemented with anreinforcement learning-like mechanism suggest that this may bea fruitful theoretical approach for future research.

Other modeling work (Gaussier et al., 2002; Banquet et al.,2005) has interpreted hippocampal place cells as representingtransitions between states, using these to form graphs to repre-sent trajectories through space for navigation. These graphs aresimilar to those used by Hasselmo (2005) and Zilli andHasselmo (2005) in neural implementations of reinforcementlearning-like algorithms for navigation and behavior. The rein-forcement learning framework provides an abstract frameworkfor action selection through which different neural implementa-tions can be tested and which can guide research toward thecorrect neural implementation.

Acknowledgments

The authors thank Lisa Giocomo and Amy Griffin for help-ful comments on an early draft of this manuscript, and twoanonymous reviewers for their thorough and tireless efforts tomake this manuscript as good as possible.

REFERENCES

Agster KL, Fortin NJ, Eichenbaum H. 2002. The hippocampus anddisambiguation of overlapping sequences. J Neurosci 22:5760–5768.

Ainge JA, van der Meer MA, Langston RF, Wood ER. 2007. Explor-ing the role of context-dependent hippocampal activity in spatialalternation behavior. Hippocampus 17:988–1002.

Arleo A, Gerstner W. 2000. Spatial cognition and neuro-mimetic navi-gation: A model of hippocampal place cell activity. Biol Cybern83:287–299.

Baddeley AD. 1986. Working Memory. Oxford: Clarendon Press.Baddeley AD, Hitch G. 1974. Working memory. In Bower GH, edi-

tor. The Psychology of Learning and Motivation: Advances inResearch and Theory. New York: Academic Press. pp 47–89.

Banquet JP, Gaussier P, Quoy M, Revel A, Burnod Y. 2005. A hierar-chy of associations in hippocampo-cortical systems: Cognitive mapsand navigation strategies. Neural Comput 17:1339–1384.

Bower MR, Euston DR, McNaughton BL. 2005. Sequential-context-dependent hippocampal activity is not necessary to learn sequenceswith repeated elements. J Neurosci 25:1313–1323.

Buckmaster CA, Eichenbaum H, Amaral DG, Suzuki WA, Rapp PR.2004. Entorhinal cortex lesions disrupt the relational organizationof memory in monkeys. J Neurosci 24:9811–9825.

Burgess N, Hitch G. 2005. Computational models of working mem-ory: Putting long-term memory into context. Trends Cogn Sci9:535–541.

Dudchenko PA, Wood ER, Eichenbaum H. 2000. Neurotoxic hippo-campal lesions have no effect on odor span and little effect onodor recognition memory but produce significant impairments onspatial span, recognition, and alternation. J Neurosci 20:2964–2977.

Foster DJ, Wilson MA. 2006. Reverse replay of behavioural sequencesin hippocampal place cells during the awake state. Nature440:680–683.

Foster DJ, Morris RGM, Dayan P. 2000. Models of hippocampallydependent navigation using the temporal difference learning rule.Hippocampus 10:1–16.



Frank LM, Brown EN, Wilson M. 2000. Trajectory encoding in thehippocampus and entorhinal cortex. Neuron 27:169–178.

Gaussier P, Revel A, Banquet JP, Babeau V. 2002. From view cells andplace cells to cognitive map learning: Processing stages of the hip-pocampal system. Biol Cybern 86:15–28.

Goldman-Rakic PS. 1995. Cellular basis of working memory. Neuron14:477–485.

Griffin AL, Eichenbaum H, Hasselmo ME. 2007. Spatial representa-tions of hippocampal CA1 neurons are modulated by behavioralcontext in a hippocampus-dependent memory task. J Neurosci27:2416–2423.

Hampson RE, Deadwyler SA. 1996. Ensemble codes involving hippo-campal neurons are at risk during delayed performance tests. ProcNatl Acad Sci USA 93:13487–13493.

Hasselmo ME. 2005. A model of prefrontal cortical mechanisms forgoal directed behavior. J Cogn Neurosci 17:1115–1129.

Hasselmo ME, Eichenbaum H. 2005. Hippocampal mechanisms forthe context-dependent retrieval of episodes. Neural Network18:1172–1190.

Holscher C, Rolls ET, Xiang J. 2003. Perirhinal cortex neuronal activ-ity related to long-term familiarity memory in the macaque. Eur JNeurosci 18:2037–2046.

Houk JC, Adams JL, Barto AG. 1995. A model of how the basal gan-glia generate and use neural signals that predict reinforcement. InHouk JC, Davis JL, Beiser DG, editors. Models of InformationProcessing in the Basal Ganglia. Cambridge, MA: MIT Press. pp249–270.

Howard MW, Kahana MJ. 2002. A distributed representation of tem-poral context. J Math Psychol 46:269–299.

Hummel JE, Holyoak KJ. 1997. Distributed representations of struc-ture: A theory of analogical access and mapping. Psychol Rev104:427–466.

Jackson JC, Johnson A, Redish AD. 2006. Hippocampal sharp wavesand reactivation during awake states depend on repeated sequentialexperience. J Neurosci 26:12415–12426.

Johnson A, Redish AD. 2005. Hippocampal replay contributes towithin session learning in a temporal difference reinforcementlearning model. Neural Network 18:1163–1171.

Kirwan CB, Gilbert PE, Kesner RP. 2005. The role of the hippocam-pus in the retrieval of a spatial location. Neurobiol Learn Mem83:65–71.

Koene RA, Hasselmo ME. 2005. An integrate and fire model of pre-frontal cortex neuronal activity during performance of goal-directeddecision making. Cereb Cortex 15:1964–1981.

Krawczyk DC. 2002. Contributions of the prefrontal cortex to theneural basis of human decision making. Neurosci Biobehav Rev26:631–664.

Lee I, Griffin AL, Zilli EA, Eichenbaum H, Hasselmo ME. 2006. Grad-ual translocation of spatial correlates of neuronal firing in the hippo-campus toward prospective reward locations. Neuron 51:639–650.

Lenck-Santini PP, Save E, Poucet B. 2001. Place-cell firing does notdepend on the direction of turn in a Y-maze alternation task. Eur JNeurosci 13:1055–1058.

Lin L-J, Mitchell TM. 1992. Memory approaches to reinforcementlearning in non-Markovian domains. Tech Rep CS-92-138, Carne-gie Mellon University.

Marr D. 1971. Simple memory: A theory for archicortex. Philos TransR Soc Lond B Biol Sci 262:23–81.

McCallum AK. 1995. Reinforcement learning with selective percep-tion and hidden state, PhD Thesis, University of Rochester.

McGaughy J, Koene RA, Eichenbaum H, Hasselmo ME. 2005. Cho-linergic deafferentation of the entorhinal cortex in rats impairsencoding of novel but not familiar stimuli in a delayed nonmatch-to-sample task. J Neurosci 25:10273–10281.

Miller EK, Li L, Desimone R. 1993. Activity of neurons in anteriorinferior temporal cortex during a short-term memory task. J Neu-rosci 13:1460–1478.

Miller EK, Erickson CA, Desimone R. 1996. Neural mechanisms ofvisual working memory in prefrontal cortex of the macaque.J Neurosci 16:5154–5167.

Montague PR, Hyman SE, Cohen JD. 2004. Computational roles fordopamine in behavioural control. Nature 431:760–767.

Morris RG, Frey U. 1997. Hippocampal synaptic plasticity: Role inspatial learning or the automatic recording of attended experience?Philos Trans R Soc Lond B Biol Sci 352:1489–1503.

Morris G, Nevet A, Arkadir D, Vaadia E, Bergman H. 2006. Mid-brain dopamine neurons encode decisions for future action. NatNeurosci 9:1057–1063.

O’Reilly RC, McClelland JL. 1994. Hippocampal conjunctive encod-ing, storage, and recall: Avoiding a trade-off. Hippocampus 4:661–682.

O’Reilly RC, Frank MJ. 2006. Making working memory work: Acomputational model of learning in the prefrontal cortex and basalganglia. Neural Comput 18:283–328.

Rapp PR, Amaral DG. 1989. Evidence for task-dependent memorydysfunction in the aged monkey. J Neurosci 9:3568–3576.

Redgrave P, Gurney K. 2006. The short-latency dopamine signal: Arole in discovering novel actions? Nat Rev Neurosci 7:967–975.

Redish AD. 2001. The hippocampal debate: Are we asking the rightquestions? Behav Brain Res 127:81–98.

Riches IP, Wilson FA, Brown MW. 1991. The effects of visual stimu-lation and memory on neurons of the hippocampal formation andthe neighboring parahippocampal gyrus and inferior temporal cor-tex of the primate. J Neurosci 11:1763–1779.

Schoenbaum G, Setlow B. 2001. Integrating orbitofrontal cortex intoprefrontal theory: Common processing themes across species andsubdivisions. Learn Mem 8:134–147.

Schultz W, Dayan P, Montague PR. 1997. A neural substrate of pre-diction and reward. Science 275:1593–1599.

Singh SP, Sutton RS. 1996. Reinforcement learning with replacing eli-gibility traces. Mach Learn 22:123–158.

Stern CE, Sherman SJ, Kirchhoff BA, Hasselmo ME. 2001. Medialtemporal and prefrontal contributions to working memory taskswith novel and familiar stimuli. Hippocampus 11:337–346.

Sutton RS. 1990. Integrated architectures for learning, planning, andreacting based on approximating dynamic programming. In: Pro-ceedings of the Seventh International Conference on MachineLearning, Austin, Texas, USA, pp 216–224.

Sutton RS, Barto AG. 1998. Reinforcement Learning: An Introduc-tion. Cambridge: MIT Press.

Sutton RS, Precup D, Singh S. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcementlearning. Artif Intell 112:181–211.

Tulving E. 2002. Episodic memory: From mind to brain. Annu RevPsychol 53:1–25.

Watkins CJCH. 1989. Learning from delayed rewards, PhD Thesis,Psychology Department, University of Cambridge.

Wood ER, Dudchenko PA, Eichenbaum H. 1999. The global recordof memory in hippocampal neuronal activity. Nature 397:613–616.

Wood ER, Dudchenko PA, Robitsek RJ, Eichenbaum H. 2000. Hip-pocampal neurons encode information about different types ofmemory episodes occurring in the same location. Neuron 27:623–633.

Young BJ, Otto T, Fox GD, Eichenbaum H. 1997. Memory represen-tation within the parahippocampal region. J Neurosci 17:5183–5195.

Zilli EA, Hasselmo ME. 2005. A model of memory-guided behaviorbased on prefrontal cortex action selection and hippocampal epi-sodic retrieval. Program No. 775.7. Society for Neuroscience.

Zironi I, Iacovelli P, Aicardi G, Liu P, Bilkey DK. 2001. Prefrontalcortex lesions augment the location-related firing properties of areaTE/perirhinal cortex neurons in a working memory task. CerebCortex 11:1093–1100.



APPENDIX

Episodic Memory Equations

Here, we describe the abstract episodic memory system withequations of vectors and matrices, providing one possible waythat the system might be implemented.

Let T be a binary row vector with n elements, where n isthe number of temporal indices to be used, letting T have onlyone nonzero entry at any time. Let S be a binary row vectorwith a number of elements equal to the number of state com-ponents for each of the state elements (i.e., all of the factoredstate elements are treated as vectors concatenated together, thelength of which is jS1j 1 jS2j 1 � � � 1 jSnj), where all entriesare 0 except those representing the active states. As an example,suppose S 5 {Svision, Stactile} where the Svision element has a red,a green, and a blue component and Stactile has a rough and asmooth component. If the current state is green and smooth,then, as a vector, S 5 [0 1 0 0 1]. We can break S into sepa-rate vectors for each nonzero element, where Si is a vector thesame size as S where the ith element equals 1 and the othersare zero. That is, the example S from above can be broken intoS2 5 [0 1 0 0 0] and S5 5 [0 0 0 0 1]. (These vectors Sishould not be confused with the factored state elements Si,where Si is the set of values that the ith element in the statecan take, i.e., the components of Si).

We can now define matrices for associating states with tem-poral indices and vice versa. Let M be a matrix where the ithrow contains the state S last associated with temporal index i.This matrix will allow retrieval of a sensory vector given a tem-poral index vector. Then for encoding, M is updated after eachselected movement action as:

dM ¼ TTS� TT TM

or

dM ¼ TT ðS� TMÞ ðA1Þ

Notice that this equation simply adds S to the row indicatedby temporal index vector T (containing a single, nonzero entry)and subtracts the old memory vector, TM, that was stored inthat temporally indexed row at some previous time. It followsthat we can retrieve the sensory vector associated with T byusing the same product:

S ¼ TM ðA2Þ

This can be understood from a neural network standpoint ifM is treated as a synaptic weight matrix, S is a vector of postsy-naptic activity, and T a vector of presynaptic activity (here withonly one presynaptic unit active). With this interpretation,Eq. (A1) becomes a learning rule with two components: TTS isa Hebbian term that associates the sensory vector to the tempo-ral index vector. The second term then represents presynapti-

cally gated long-term depression (LTD) that can be written inthe simple form TTTM because T has only a single active unit(a rule analogous to A3 below can handle the general case withmultiple active units). This produces a matrix the same size asM where only the ith row is nonzero (equal to TM) where i isthe index of the nonzero unit in T. The equation as a wholethen simultaneously applies LTD to all synapses from unit i inT to set them to zero except those where the postsynaptic unitsS are active.

Next, we define the matrix N with one row for each stateelement that contains the temporal index vector for the timethe given state was last experienced. This will allow a sensoryvector to retrieve an associated temporal index vector. N isupdated as

dN ¼ ST T�Xi

STi SiN ðA3Þ

where the summation index i is over all nonzero elements in Sand Si is a vector the same size as S with all entries zero exceptfor the ith.

The neural network interpretation of this equation is identi-cal to that for Eq. (A1) except this allows for the case of multi-ple active presynaptic units, by simply computing the LTDcomponent separately for each active presynaptic unit in thesummation over i.

To later retrieve the temporal index vector given a cue vectorSi, we multiply

T ¼ SiN ðA4Þ

This can again be interpreted as a synaptic matrix and twopopulations of units in a neural network.

To finish characterizing the episodic memory abstraction interms of matrices, we can define a shift operator matrix thattakes a temporal index vector and advances the single nonzeroentry by one position. This matrix, O, is simply an identitymatrix with each row shifted up one and with the first rowmoved to the last row. As an example, the 3 3 3 shift operatormatrix would be

O ¼0 1 00 0 11 0 0

24

35 ðA5Þ

So, for example, [1 0 0]O 5 [0 1 0].In actual implementations, two other concerns must be

addressed. If the update to M is performed upon entering astate, then any cued retrieval actions in that state will retrievethe current temporal index, which was just associated to thelocation. It is, however, desired that the cuing retrieves the pre-vious visit to the present state, not the current visit, so Mshould only be updated upon leaving a state (in the same man-ner that state and action values in temporal difference learning



algorithms are updated for the state that was just left, not theupcoming state). Also, the agent should not be allowed toadvance the temporal index past the current temporal index. Inour implementation, attempts to do so return the ‘‘nothingrecalled’’ state. Thus, the computations performed when thecuing retrieval action is taken on cue vector Si are

T ¼ SiN

then

S ¼ TM

On taking the advance retrieval action, we compute:

Tnew ¼ ToldO

then:S ¼ TnewM

This implementation of the model can produce retrievalerrors under certain conditions. If the agent has not visited agiven state in a large number of trials, it may be that the tem-poral index last associated with that state has been reused andno longer indexes the correct memory. Including a generaldecay term in the associations from states to temporal indicesand thresholding retrieval may solve this problem, for example,replacing Eq. (A3) with dN ¼ STT�

Pi

STi SiN � eN with

0 � e � 1.Although not presented here, we have also simulated spatial

memory with a nearly identical set of equations. The primarydifference is that, in spatial memory, the analog of advancingretrieval is imagined motion in a particular direction. Thus,instead of a single operator matrix to advance one step in time,there are multiple operator matrices that each advance the spa-tial index in a different direction (e.g., in the case of four spa-tial directions there would be operator matrices for each ofnorth, south, east, and west). It is also simple to take intoaccount head or eye direction dependence of retrieved memo-ries by including the direction as part of the spatial index (e.g.,retrieving based on looking north or south from the same posi-tion would result in different states being retrieved and onewould also have to include operator matrices for changing thehead or eye direction without changing the location aspect ofthe spatial index). The spatial memory version would allow anagent to use a cue stimulus to retrieve a spatial location andsensory stimuli present there, then to mentally navigate throughthe spatial memory.

Multiple Retrieval Cues

The retrieval system described earlier has at least one majorlimitation: a given state cue might belong to multiple episodic

memories and it may not be the most recent one whose re-trieval is desired. To overcome this limitation, we can allow re-trieval based on multiple cues so that the most recent episodecontaining all of the cues is retrieved, ignoring episodes con-taining only a subset of the cues. First, we change the updatingof N so that it is no longer a binary process:

dN ¼ STT� kXi

STi SiN ðA6Þ

where 0 < k < 1 is a decay term (if k 5 1 this equationreduces to the binary case described above). Now a given stateelement will be associated with multiple temporal indices withthe most recent association having a value of 1, the next mostrecent having a value of (1 2 k), the next most recent (1 2

k)2 and, in general, the ith most recent having a value of (1 2

k)i21. This concept is similar in some ways to the temporalcontext model proposed by Howard and Kahana (2002) (alsoused in Hasselmo and Eichenbaum, 2005). The neural networkinterpretation also applies here. The only difference from Eq.(A3) is that there is a learning rate associated with the presy-naptically gated LTD that is less than 1, so synapses are notcompletely zeroed after a single application of the learning rule.

Now given a state vector S with all elements 0 except i1, i2,. . ., in, we compute

T ¼Yi

SiN

" #max

ðA7Þ

where the product is an entry-wise product of each of the vec-tors, []max is an operation that sets all elements to 0 except themaximum element in the vector, and Si is the cue vector S bro-ken into independent components as mentioned earlier. Anytemporal index that is not shared by all of the Si will be 0 bythe multiplication, and, if more than one index is shared by allthe retrieval cues, all except the most recent index will be set to0 by the max operation. If no temporal index is shared by allof the cues, the 0 vector will result and SEP is set to the ‘‘norecall’’ state.

To see that this works, consider each Ti 5 SiN. Ti is a vec-tor of the temporal indices at which a given state element hasbeen present, weighted exponentially as described earlier suchthat the more recent indices have higher values. For example,suppose state element Sa was present at temporal indices 7, 5,and 1. The value for Sa at index 5 is greater than that at 1 (bya factor of 1 2 k, in fact) because 5 was more recently experi-enced. Calculating SaN will produce the first row of N as a vec-tor. Similar calculations of SbN and ScN produce rows two andthree as vectors. If we desire the temporal index at which allthree of Sa, Sb, and Sc were most recently present, we can firstfind any temporal indices for which all were present, then findwhich of those was the most recent.

For a given temporal index, t, we know that at least one ofthe state elements was not present if at least one of them has avalue of 0 for that index. Therefore, if we multiply the value of



the association between each state element and temporal indext, the result will be 0 if at least one was not present and will begreater than 0 otherwise.

Finally, we ask which of the temporal indices shared by allelements was the most recent. Clearly it is the one for whichthe product of the temporal index strengths is largest. Considertwo distinct, nonzero products, p and q, of temporal indexstrengths. Each element in N is of the form (1 2 k)z for somez, and each of p and q are the product of m such terms, so wehave p ¼ ð1� kÞp1þp2þ��þpm and q ¼ ð1� kÞq1þq2þ��þqm : If we

assume without loss of generality that p corresponds to a timethat happened before q (by assumption they are distinct and sodo not represent the same time), then we must have pi > qifor all i. Then the sum of all pi must exceed the sum of all qiso p < q because (1 2 k) < 1. Thus, the largest entry in theentry-wise product of the rows from N corresponds to themost recent cooccurrence of all the state elements of interest.

Similar equations can be applied to spatial memory, as agiven sensory state may be associated with multiple spatiallocations.



Documents

Modeling the Role of Working Memory and Episodic Memory in