24
Neural Basis of Reinforcement Learning and Decision Making Daeyeol Lee, 1,2 Hyojung Seo, 1 and Min Whan Jung 3 1 Department of Neurobiology, Kavli Institute for Neuroscience, Yale University School of Medicine, New Haven, Connecticut 06510; email: [email protected], [email protected] 2 Department of Psychology, Yale University, New Haven, Connecticut 06520 3 Neuroscience Laboratory, Institute for Medical Sciences, Ajou University School of Medicine, Suwon 443-721, Republic of Korea; email: [email protected] Annu. Rev. Neurosci. 2012. 35:287–308 First published online as a Review in Advance on March 29, 2012 The Annual Review of Neuroscience is online at neuro.annualreviews.org This article’s doi: 10.1146/annurev-neuro-062111-150512 Copyright c 2012 by Annual Reviews. All rights reserved 0147-006X/12/0721-0287$20.00 Keywords prefrontal cortex, neuroeconomics, reward, striatum, uncertainty Abstract Reinforcement learning is an adaptive process in which an animal uti- lizes its previous experience to improve the outcomes of future choices. Computational theories of reinforcement learning play a central role in the newly emerging areas of neuroeconomics and decision neuro- science. In this framework, actions are chosen according to their value functions, which describe how much future reward is expected from each action. Value functions can be adjusted not only through reward and penalty, but also by the animal’s knowledge of its current environment. Studies have revealed that a large proportion of the brain is involved in representing and updating value functions and using them to choose an action. However, how the nature of a behavioral task affects the neural mechanisms of reinforcement learning remains incompletely under- stood. Future studies should uncover the principles by which different computational elements of reinforcement learning are dynamically co- ordinated across the entire brain. 287

Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

Neural Basis of ReinforcementLearning and DecisionMakingDaeyeol Lee,1,2 Hyojung Seo,1 and Min Whan Jung3

1Department of Neurobiology, Kavli Institute for Neuroscience, Yale University School ofMedicine, New Haven, Connecticut 06510; email: [email protected],[email protected] of Psychology, Yale University, New Haven, Connecticut 065203Neuroscience Laboratory, Institute for Medical Sciences, Ajou University School ofMedicine, Suwon 443-721, Republic of Korea; email: [email protected]

Annu. Rev. Neurosci. 2012. 35:287–308

First published online as a Review in Advance onMarch 29, 2012

The Annual Review of Neuroscience is online atneuro.annualreviews.org

This article’s doi:10.1146/annurev-neuro-062111-150512

Copyright c© 2012 by Annual Reviews.All rights reserved

0147-006X/12/0721-0287$20.00

Keywords

prefrontal cortex, neuroeconomics, reward, striatum, uncertainty

Abstract

Reinforcement learning is an adaptive process in which an animal uti-lizes its previous experience to improve the outcomes of future choices.Computational theories of reinforcement learning play a central rolein the newly emerging areas of neuroeconomics and decision neuro-science. In this framework, actions are chosen according to their valuefunctions, which describe how much future reward is expected from eachaction. Value functions can be adjusted not only through reward andpenalty, but also by the animal’s knowledge of its current environment.Studies have revealed that a large proportion of the brain is involved inrepresenting and updating value functions and using them to choose anaction. However, how the nature of a behavioral task affects the neuralmechanisms of reinforcement learning remains incompletely under-stood. Future studies should uncover the principles by which differentcomputational elements of reinforcement learning are dynamically co-ordinated across the entire brain.

287

Page 2: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

Contents

INTRODUCTION . . . . . . . . . . . . . . . . . . 288REINFORCEMENT LEARNING

THEORIES OF DECISIONMAKING . . . . . . . . . . . . . . . . . . . . . . . . . 289Economic Utilities

and Value Functions . . . . . . . . . . . . 289Model-Free versus Model-Based

Reinforcement Learning . . . . . . . . 290Learning During Social

Decision Making. . . . . . . . . . . . . . . . 291NEURAL REPRESENTATION

OF VALUE FUNCTIONS . . . . . . . . 291NEURAL MECHANISMS OF

ACTION SELECTION. . . . . . . . . . . 293NEURAL MECHANISMS FOR

UPDATING VALUEFUNCTIONS . . . . . . . . . . . . . . . . . . . . 295Temporal Credit Assignment

and Eligibility Trace . . . . . . . . . . . . 295Integration of Chosen Value

and Reward Prediction Error . . . . 296Uncertainty and Learning Rate . . . . . 298

NEURAL SYSTEMS FOR MODEL-FREE VERSUS MODEL-BASEDREINFORCEMENTLEARNING . . . . . . . . . . . . . . . . . . . . . . 299Model-Based Value Functions

and Reward Prediction Errors . . . 299Hypothetical Outcomes

and Mental Simulation . . . . . . . . . . 299CONCLUSIONS . . . . . . . . . . . . . . . . . . . . 301

INTRODUCTION

Decision making refers to the process by whichan organism chooses its actions and has beenstudied in such diverse fields as mathematics,economics, psychology, and neuroscience.Traditionally, theories of decision making havefallen into two categories. On the one hand,normative theories in economics generate well-defined criteria for identifying best choices.Such theories, including expected utility theory(von Neumann & Morgenstern 1944), deal with

choices in an idealized context and often failto account for actual choices made by humansand animals. On the other hand, descriptivepsychological theories try to account for fail-ures of normative theories by identifying a setof heuristic rules applied by decision makers.For example, prospect theory (Kahneman &Tversky 1979) can successfully account for thefailures of expected utility theory in describinghuman decision making under uncertainty.These two complementary theoretical frame-works are essential for neurobiological studiesof decision making. However, a fundamentalquestion not commonly addressed by eitherapproach concerns the role of learning. Howdo humans and animals acquire their prefer-ence for different actions and outcomes in thefirst place?

Our goal in this paper is to review and or-ganize recent findings about the functions ofdifferent brain areas that underlie experience-dependent changes in choice behaviors. Re-inforcement learning theory (Sutton & Barto1998) has been widely adopted as the main the-oretical framework in designing experiments aswell as interpreting empirical results. Becausethis topic has been frequently reviewed (Dayan& Niv 2008, Bornstein & Daw 2011, Ito &Doya 2011, van der Meer & Redish 2011), webriefly summarize only the essential elementsof reinforcement learning theory and focus onthe following questions: First, where and howin the brain are the estimates of expected re-ward represented and updated by the animal’sexperience? Converging evidence from a num-ber of recent studies suggests that many of thesecomputations are carried out in multiple in-terconnected regions in the frontal cortex andbasal ganglia. Second, how are the value sig-nals for potential actions transformed to thefinal behavioral response? Competitive inter-actions among different pools of recurrentlyconnected neurons are a likely mechanism forthis selection process (Usher & McClelland2001, Wang 2002), but their neuroanatomi-cal substrates remain poorly understood (Wang2008). Finally, how is model-based reinforce-ment learning implemented in the brain?

288 Lee · Seo · Jung

Page 3: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

Humans and animals can acquire new knowl-edge about their environment without directlyexperiencing reward or penalty, and this knowl-edge can be used to influence subsequent be-haviors (Tolman 1948). This is referred to asmodel-based reinforcement learning, whereasreinforcement learning entirely relying on ex-perienced reward and penalty is referred to asmodel free. Recent studies have begun to shedsome light on how these two different types ofreinforcement learning are linked in the brain.We conclude with some suggestions for futureresearch on the neurobiological mechanisms ofreinforcement learning.

REINFORCEMENT LEARNINGTHEORIES OF DECISIONMAKING

Economic Utilitiesand Value Functions

Economic theories of decision making focuson how numbers can be attached to alternative

actions so that choices can be understood as theselection of an action that has the maximumvalue among all possible actions. These hypo-thetical quantities are often referred to as utili-ties and can be applied to all types of behaviors.By definition, behaviors chosen by an organismare those that maximize the organism’s utility(Figure 1a). Although such utilities are pre-sumably constrained by evolution and individ-ual experience, economic theories are largelyagnostic about how they are determined. Bycontrast, reinforcement learning theories de-scribe how the animal’s experience alters itsvalue functions, which in turn influence sub-sequent choices (Figure 1b).

The goal of reinforcement learning is tomaximize future rewards. Analogous to utili-ties in economic theories, value functions inreinforcement learning theory refer to the es-timates for the sum of future rewards. How-ever, because the animal cannot perfectly pre-dict the future changes in its environment, valuefunctions, unlike utilities, reflect the animal’s

Action 2Action 1

Softmax

Value functions

Outcome

Reinforcement learning

b

Action 3

Chosenaction

RPE

Action 2Action 1

Max

Utilities

Economic choice

a

Action 3

Chosenaction

Figure 1Economic and reinforcement learning theories of decision making. (a) In economic theories, decisionmaking corresponds to the selection of an action with maximum utility. (b) In reinforcement learning,actions are chosen probabilistically (i.e., softmax) on the basis of their value functions. In addition, valuefunctions are updated on the basis of the outcome (reward or penalty) resulting from the action chosen bythe animal. Abbreviation: RPE, reward prediction error.

www.annualreviews.org • Reinforcement Learning 289

Page 4: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

empirical estimates for its future rewards. Thereinforcement learning theory utilizes two dif-ferent types of value functions. First, actionvalue function refers to the sum of future re-wards expected for taking a particular action ina particular state of the environment and is oftendenoted by Q(s,a), where s and a refer to the stateof the environment and the animal’s action, re-spectively. The term action is used formally: Itcan refer to not only a physical action, such asreaching for a coffee mug in a particular locationwith a specific limb, but also an abstract choice,such as buying a particular guitar. Second, statevalue function, often denoted by V(s), refers tothe sum of future rewards expected from a par-ticular state of the animal’s environment. If theanimal always chooses only one action in a givenstate, then its action value function would beequal to the state value function. Otherwise, thestate value function would correspond to theaverage of action value functions weighted bythe probability of taking each action in a givenstate. State value functions can be used to eval-uate the action outcomes, but in general, actionvalue functions are required for selecting anaction.

Model-Free versus Model-BasedReinforcement Learning

Value functions can be updated according totwo different types of information. First, theycan be revised according to the reward orpenalty received by the animal after each ac-tion. Value functions would not change andno learning would be necessary, if choice out-comes were always perfectly predicted fromthe current value functions. Otherwise, valuefunctions must be modified to reduce errorsin reward predictions. The signed differencebetween the actual reward and the reward ex-pected by the current value functions is re-ferred to as a reward prediction error (Sutton &Barto 1998). In a class of reinforcement learn-ing algorithms, referred to as simple or model-free reinforcement learning, reward predic-tion error is the primary source of changesin value functions. More specifically, the value

function for the action chosen by the animalor the state visited by the animal is updated ac-cording to the reward prediction error, whereasthe value functions for all other actions andstates remain unchanged or simply decay pas-sively (Barraclough et al. 2004, Ito & Doya2009).

Second, value functions can be changedmore flexibly. This class of reinforcementlearning is referred to as model-based rein-forcement learning, and the associated algo-rithms can update the value functions on thebasis of the animal’s motivational state and itsknowledge of the environment without directreward or penalty. The use of cognitive mod-els allows the animal to adjust its value func-tions immediately, whenever it acquires a newpiece of information about its internal state orexternal environment. Many lines of evidenceindicate that animals as well as humans are capa-ble of model-based reinforcement learning. Forexample, when an animal is satiated by a partic-ular type of reward, the subjective value of thesame food is diminished. However, if the animalwere to rely entirely on simple reinforcementlearning, the tendency to choose a given actionwould not change until it experiences the de-valued reward through the same action in thesame environment. Previous work has shownthat rats can change their behaviors immedi-ately according to their current motivationalstates following the devaluation of specific fooditems in a different environment. This is oftenused as a test for goal-directed behaviors and in-dicates that animals are capable of model-basedreinforcement learning (Balleine & Dickinson1998, Daw et al. 2005). Humans and animalscan also simulate the consequences of potentialactions that they could have chosen. This is re-ferred to as counterfactual thinking (Roese &Olson 1995), and the information about hypo-thetical outcomes from unchosen actions can beincorporated into value functions when they aredifferent from the outcomes predicted by thecurrent value functions (Coricelli et al. 2005,Lee et al. 2005, Boorman et al. 2011). Analo-gous to reward prediction error, the differencebetween hypothetical and predicted outcomes

290 Lee · Seo · Jung

Page 5: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

is referred to as fictive or counterfactual rewardprediction error (Lohrenz et al. 2007, Boormanet al. 2011).

Learning During SocialDecision Making

During social interaction, the outcomes of ac-tions are often jointly determined by the ac-tions of multiple players. Such strategic situa-tions are referred to as games (von Neumann &Morgenstern 1944). Decision makers can im-prove the outcomes of their choices during re-peated games by applying a model-free rein-forcement learning algorithm. In fact, for rel-atively simple games, such as two-player zero-sum games, humans and animals gradually ap-proximate optimal strategies using model-freereinforcement learning algorithms (Mookher-jee & Sopher 1994, Erev & Roth 1998, Camerer2003, Lee et al. 2004). By contrast, players em-ploying a model-based reinforcement learningalgorithm can adjust their strategies more flex-ibly according to the predicted behaviors ofother players. The ability to predict the beliefsand intentions of other players is often referredto as the theory of mind (Premack & Woodruff1978), and during social decision making, thismay dramatically improve the efficiency of rein-forcement learning. In game theory, this is alsoreferred to as belief learning (Camerer 2003).

In pure belief learning, the outcomes ofactual choices by the decision makers do notexert any additional influence on their choices,unless such outcomes can modify the beliefs ormodels of the decision makers about the otherplayers. In other words, reward or penalty doesnot have any separate role other than affectingthe decision maker’s belief about the otherplayers. Results from studies in behavioraleconomics show that such pure-belief learningmodels do not account well for human choicebehaviors (Mookherjee & Sopher 1997, Erev& Roth 1998, Feltovich 2000, Camerer 2003).Instead, human behaviors during repeatedgames are often consistent with hybrid learn-ing models, such as the experience-weightedattraction model (Camerer & Ho 1999, Zhu

et al. 2012). In these hybrid models, valuefunctions are adjusted by both real and fictivereward prediction errors. Learning rates forthese different reward prediction errors canbe set independently. Similar to the resultsfrom human studies, hybrid models are moreaccurate than either model-free reinforce-ment learning or belief learning models ataccounting for the behaviors of nonhumanprimates performing a competitive game taskagainst a computer (Lee et al. 2005, Abe & Lee2011).

NEURAL REPRESENTATIONOF VALUE FUNCTIONS

Utilities and value functions are central to eco-nomic and reinforcement learning theories ofdecision making, respectively. In both theo-ries, these quantities are assumed to capture allthe relevant factors influencing choices. Thus,brain areas or neurons involved in decisionmaking are expected to harbor signals relatedto utilities and value functions. In fact, neuralactivity related to reward expectancy has beenfound in many different brain areas (Schultzet al. 2000, Hikosaka et al. 2006, Wallis &Kennerley 2010), including sensory corticalareas (Shuler & Bear 2006, Serences 2008,Vickery et al. 2011). The fact that signals re-lated to reward expectancy are widespread inthe brain suggests that they are likely to sub-serve not only reinforcement learning, but alsoother related cognitive processes, such as atten-tion (Maunsell 2004, Bromberg-Martin et al.2010a, Litt et al. 2011). Neural signals relatedto reward expectancy can be divided into at leasttwo different categories, depending on whetherthey are related to specific actions or states.Neural signals related to action value functionswould be useful in choosing a particular action,especially if such signals are observed before theexecution of a motor response. Neural activityrelated to state value functions may play moreevaluative roles. In particular, during decisionmaking, the state value function changes fromthe weighted average of action values for alter-native choices to the action value function for

www.annualreviews.org • Reinforcement Learning 291

Page 6: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

the chosen action. The latter is often referred toas the chosen value (Padoa-Schioppa & Assad2006, Cai et al. 2011).

During decision-making experiments,choices can be made either among alternativephysical movements with different spatial tra-jectories or among different objects regardlessof the movements required to acquire them.In the reinforcement learning theory, thesedifferent options are all considered actions.Nevertheless, neural signals related to thecorresponding action value functions may varysubstantially according to the dimension inwhich choices are made. In most previousneurobiological studies, different propertiesof reward were linked to different physicalactions. These studies have identified neuralactivity related to action value functions innumerous brain areas, including the posteriorparietal cortex (Platt & Glimcher 1999, Dorris& Glimcher 2004, Sugrue et al. 2004, Seoet al. 2009), dorsolateral prefrontal cortex(Barraclough et al. 2004, Kim et al. 2008),premotor cortex (Pastor-Bernier & Cisek2011), medial frontal cortex (Seo & Lee2007, 2009; So & Stuphorn 2010; Sul et al.2010), and striatum (Samejima et al. 2005,Lau & Glimcher 2008, Kim et al. 2009, Caiet al. 2011). Despite the limited spatial andtemporal resolutions available in neuroimagingstudies, metabolic activity related to actionvalue functions has also been identified in thesupplementary motor area (Wunderlich et al.2009). In contrast, neurons in the primateorbitofrontal cortex are not sensitive to spatiallocations of targets associated with specificrewards (Tremblay & Schultz 1999, Wallis& Miller 2003), suggesting that they encodeaction value functions related to specificobjects or goals. When animals chose betweentwo different flavors of juice, neurons in theprimate orbitofrontal cortex signaled theaction value functions associated with specificjuice flavors rather than the directions ofeye movements used to indicate the animal’schoices (Padoa-Schioppa & Assad 2006).

Neurons in many different brain areas oftencombine value functions for alternative actions

and other decision-related variables. Preciselyhow multiple types of signals are combined inthe activity of individual neurons can, there-fore, provide important clues about how suchsignals are computed and utilized. For exam-ple, the likelihood of choosing one of two al-ternative choices is determined by the differ-ence in their action value functions; therefore,neurons encoding such signals may be closelyinvolved in the process of action selection. Dur-ing a binary-choice task, neurons in the primateposterior parietal cortex (Dorris & Glimcher2004, Sugrue et al. 2004, Seo & Lee 2008, Seoet al. 2009), dorsolateral prefrontal cortex (Kimet al. 2008), premotor cortex (Pastor-Bernier& Cisek 2011), supplementary eye field (Seo& Lee 2009), and dorsal striatum (Cai et al.2011) as well as in the rodent secondary mo-tor cortex (Sul et al. 2011) and striatum (Ito& Doya 2009) encode the difference betweenthe action value functions for two alternativeactions.

Signals related to state value functions arealso found in many different brain areas. Dur-ing a binary-choice task, the sum or average ofthe action value functions for two alternativechoices corresponds to the state value functionbefore a choice is made. In the posterior pari-etal cortex and dorsal striatum, signals related tosuch state value functions and action value func-tions coexist (Seo & Lee 2008, Seo et al. 2009,Cai et al. 2011). Neurons encoding state valuefunctions are also found in the ventral striatum(Cai et al. 2011), anterior cingulate cortex (Seo& Lee 2007), and amygdala (Belova et al. 2008).Neural activity related to chosen values thatcorrespond to post-decision state value func-tions is also widespread in the brain and hasbeen found in the orbitofrontal cortex (Padoa-Schioppa & Assad 2006, Sul et al. 2010), medialfrontal cortex (Sul et al. 2010), dorsolateral pre-frontal cortex (Kim & Lee 2011), and striatum(Lau & Glimcher 2008, Kim et al. 2009, Caiet al. 2011). Because reward prediction errorcorresponds to the difference between the out-come of a choice and the chosen value, neuralactivity related to chosen values may be utilizedto compute reward prediction errors and update

292 Lee · Seo · Jung

Page 7: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

0

0.5

1.0

1.5

2.0

0 0.5 1.0 1.5

Me

an

CP

D (

%)

0

0.1

0.2

0.3

0.4

Fra

ctio

n o

f n

eu

ron

s

Time from cue onset (s)

–2 0 1 2 3–1 0Time from choice (s) Time from reward (s)

0 0.5 1.0 1.5Time from cue onset (s)

DLPFC Striatum

0

1.0

2.0

3.0

4.0

5.0a

b

BeforeAfter

Dorsal Ventral

BeforeAfter

AGmAGIDSVS

ACCPLC/ILCOFC

Figure 2Time course of signals related to different state value functions during decision making. (a) Signals related tothe state value functions before (red ) and after (blue) decision making in the dorsolateral prefrontal cortex(DLPFC) (Kim et al. 2008, Kim & Lee 2011) and striatum (Cai et al. 2011) during an intertemporal choicetask. These two state value functions correspond to the average of the action value functions for two optionsand the chosen value, respectively. During these studies, monkeys chose between a small immediate rewardand a large delayed reward, and the magnitude of neural signals related to different value functions wasestimated by the coefficient of partial determination (CPD). Lines correspond to the mean CPD for all theneurons recorded in each brain area; the shaded area corresponds to the standard error of the mean.(b) Proportion of neurons carrying chosen value signals in multiple areas of the rodent brain. Abbreviations:ACC, anterior cingulate cortex; DS, dorsal striatum; ILC, infralimbic cortex; AGl and AGm, lateral andmedial agranular cortex (corresponding to the primary and secondary motor cortex, respectively); OFC,orbitofrontal cortex; PLC, prelimbic cortex; VS, ventral striatum. During these studies (Kim et al. 2009; Sulet al. 2010, 2011), the rats performed a dynamicforaging task. Large symbols indicate that the proportionsare significantly (p < 0.05) above the chance level.

value functions. In some brain areas, such asthe dorsolateral prefrontal cortex (Kim & Lee2011) and dorsal striatum (Cai et al. 2011), ac-tivity related to chosen value signals emergeslater than the signals related to the sum of thevalue functions for alternative actions, suggest-ing that action selection may take place duringthis delay (Figure 2).

NEURAL MECHANISMS OFACTION SELECTION

During decision making, neural activity relatedto action value functions must be convertedto the signals related to a particular actionand transmitted to motor structures. There-fore, some of the brain areas encoding action-value functions or their differences are likely to

www.annualreviews.org • Reinforcement Learning 293

Page 8: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

be involved also in action selection. Neverthe-less, the precise anatomical location playing aprimary role in action selection may vary withthe nature of a behavioral task. For example,compared with flexible goal-directed behav-iors, actions selected by fixed stimulus-actionassociations or well-practiced motor sequencesmay rely more on the dorsolateral striatum(Knowlton et al. 1996, Hikosaka et al. 1999,Yin & Knowlton 2006). Considering that spiketrains of cortical neurons are stochastic (Softky& Koch 1993), the process of action selectionis likely to rely on a network of neurons tem-porally integrating the activity related to dif-ference in action value functions (Soltani &Wang 2006, Krajbich et al. 2010). An analo-gous process has been extensively studied foraction selection based on noisy sensory stimulusduring perceptual decision making. For exam-ple, psychophysical performance during a two-alternative forced-choice task is well describedby the so-called random-walk or drift-diffusionmodel in which a particular action is selectedwhen the gradual accumulation of noisy ev-idence reaches a threshold for that action(Laming 1968, Roitman & Shadlen 2002, Smith& Ratcliff 2004).

Neurons in multiple brain areas involved inmotor control often build up their activity grad-ually prior to specific movements, suggestingthat these areas may be involved in action se-lection. Execution of voluntary movements aretightly coupled with phasic neural activity ina number of brain areas, such as the primarymotor cortex (Georgopoulos et al. 1986), pre-motor cortex (Churchland et al. 2006), frontaleye field (Hanes & Schall 1996), supplementaryeye field (Schlag & Schlag-Rey 1987), poste-rior parietal cortex (Andersen et al. 1987), andsuperior colliculus (Schiller & Stryker 1972,Wurtz & Goldberg 1972). All these structuresare closely connected with motor structures inthe brainstem and spinal cord. In addition, neu-rons in these areas display persistent activity re-lated to the metrics of upcoming movementswhen the desired movement is indicated be-fore a “go” signal, suggesting that they are alsoinvolved in motor planning and preparation

(Weinrich & Wise 1982, Bruce & Goldberg1985, Gnadt & Andersen 1988, Schall 1991,Glimcher & Sparks 1992, Smyrnis et al. 1992).Such persistent activity has often been associ-ated with working memory (Funahashi et al.1989, Wang 2001), but it may also subserve thetemporal integration of noisy inputs (Shadlen &Newsome 2001, Roitman & Shadlen 2002,Wang 2002, Curtis & Lee 2010). In fact, neu-ral activity in accordance with gradual evidenceaccumulation has been found in the same brainareas that show persistent activity related to mo-tor planning, such as the posterior parietal cor-tex (Roitman & Shadlen 2002), frontal eye field(Ding & Gold 2011), and superior colliculus(Horwitz & Newsome 2001).

Computational studies have demonstratedthat a network of neurons with recurrentexcitation and lateral inhibition can performtemporal integration of noisy sensory inputsand produce a signal corresponding to anoptimal action (Wang 2002, Lo & Wang 2006,Beck et al. 2008, Furman & Wang 2008).Most of these models have been developed toaccount for the pattern of activity observed inthe lateral intraparietal (LIP) cortex during aperceptual decision-making task. Nevertheless,value-dependent action selection may alsoinvolve attractor dynamics in a similar networkof neurons, provided that their input synapsesare adjusted in a reward-dependent manner(Soltani & Wang 2006, Soltani et al. 2006).Therefore, neurons involved in the evaluationof unreliable sensory information may alsocontribute to value-based decision making.Consistent with this possibility, neurons in theLIP tend to change their activity according tothe value of rewards expected from alternativeactions (Platt & Glimcher 1999, Sugrue et al.2004, Seo et al. 2009, Louie & Glimcher 2010,Rorie et al. 2010).

In contrast to the brain areas involved inselecting a specific physical movement, otherareas may be involved in more abstract decisionmaking. For example, the orbitofrontal cortexmay play a particularly important role inmaking choices among different objects orgoods (Padoa-Schioppa 2011). By contrast, an

294 Lee · Seo · Jung

Page 9: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

action selection process guided by memory andother endogenous cues rather than externalsensory stimuli may rely more on the medialfrontal cortex. Activity related to action valuefunctions has been found in the supplementaryand presupplementary motor areas (Sohn &Lee 2007, Wunderlich et al. 2009) as well as thesupplementary eye field (Seo & Lee 2009, So& Stuphorn 2010). More importantly, neuralactivity related to an upcoming movementappears in the medial frontal cortex earlierthan in other areas of the brain. For example,when human subjects are asked to initiate amovement voluntarily without any immediatesensory cue, scalp electroencelphalogramdisplays the so-called readiness potential wellbefore movement onset, and its source has beenlocalized to the supplementary motor area(Haggard 2008, Nachev et al. 2008). Thehypothesis that internally generated voluntarymovements are selected in the medial frontalcortex is also consistent with the results fromsingle-neuron recording and neuroimagingstudies. Individual neurons in the primatesupplementary motor area often begin tochange their activity according to an upcominglimb movement earlier than those in thepremotor cortex or primary motor cortex,especially when the animal is required toproduce such movements voluntarily withoutimmediate sensory cues (Tanji & Kurata 1985,Okano & Tanji 1987). Similarly, neurons inthe supplementary eye field begin to modulatetheir activity according to the direction of anupcoming saccade earlier than similar activityrecorded in the frontal eye field and LIP(Coe et al. 2002). In a rodent performing adynamic foraging task, signals related to theanimal’s choice appear in the medial motorcortex, presumably a homolog of the primatesupplementary motor cortex, earlier than manyother brain areas, including the primary motorcortex and basal ganglia (Sul et al. 2011).Furthermore, lesions in this area make theanimal’s choices less dependent on action valuefunctions (Sul et al. 2011). Finally, analysis ofBOLD activity patterns during a self-timedmotor task has also identified signals related to

an upcoming movement up to several secondsbefore the movement onset in the humansupplementary motor area (Soon et al. 2008).

In summary, neural activity potentially re-flecting the process of action selection has beenidentified in multiple regions, including areasinvolved in motor control, the orbitofrontalcortex, and the medial frontal cortex. Thus, fu-ture investigations should determine how thesemultiple areas interact cooperatively or com-petitively depending on the demands of spe-cific behavioral tasks. The frame of reference inwhich different actions are represented variesacross brain areas; therefore, how the actionsencoded in one frame of reference, for exam-ple, in object space, are transformed to another,such as visual or joint space, needs to be inves-tigated (Padoa-Schioppa 2011).

NEURAL MECHANISMS FORUPDATING VALUE FUNCTIONS

Temporal Credit Assignmentand Eligibility Trace

Reward resulting from a particular action is of-ten revealed after a substantial delay, and an an-imal may carry out several other actions beforecollecting the reward resulting from a previousaction. Therefore, it can be challenging to asso-ciate an action and its corresponding outcomecorrectly; this is referred to as the problem oftemporal credit assignment (Sutton & Barto1998). Not surprisingly, the loss of the abil-ity to link specific outcomes to correspondingchoices interferes with the process of updatingvalue functions appropriately. Whereas normalanimals can easily alter their preferences be-tween two objects when the probabilities of get-ting rewards from the two objects are switched,humans, monkeys, and rats with lesions in theorbitofrontal cortex are impaired in such re-versal learning tasks (Iversen & Mishkin 1970,Schoenbaum et al. 2002, Fellows & Farah 2003,Murray et al. 2007). These deficits may resultfrom failures in temporal credit assignment. Forexample, during a probabilistic reversal learn-ing task, in which the probabilities of rewards

www.annualreviews.org • Reinforcement Learning 295

Page 10: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

from different objects were dynamically and un-predictably changed, deficits produced by thelesions in the orbitofrontal cortex were due toerroneous associations between the choices andtheir outcomes (Walton et al. 2010).

In reinforcement learning theory, the prob-lem of temporal credit assignment can be re-solved in at least two different ways. First, aseries of intermediate states can be introducedduring the interval between an action and a re-ward, so that they can propagate the informa-tion about the reward to the value function ofthe correct action (Montague et al. 1996). Thisbasic temporal difference model was initiallyproposed to account for the reward predictionerror signals conveyed by dopamine neuronsbut was shown to be inconsistent with the actualtemporal profiles of dopamine neuron signals(Pan et al. 2005). Second, animals can use short-term memory signals related to the states or ac-tions they select. Such memory signals, termedeligibility traces (Sutton & Barto 1998), can fa-cilitate action-outcome association even whenthe outcome is delayed. Therefore, eligibilitytraces can account for the temporally discon-tinuous shift in the phasic activity of dopamineneurons observed during classical conditioning(Pan et al. 2005). Signals related to the ani-mal’s previous choices have been observed in anumber of brain areas, including the prefrontalcortex and posterior parietal cortex in mon-keys (Fecteau & Munoz 2003, Barraclough et al.2004, Genovesio et al. 2006, Seo et al. 2009, Seo& Lee 2009), as well as in many regions in the

rodent frontal cortex and striatum (Kim et al.2007, 2009; Sul et al. 2010, 2011) (Figure 3),and they may provide eligibility traces neces-sary to form associations between actions andtheir outcomes (Curtis & Lee 2010). In addi-tion, neurons in many of these areas, includingthe orbitofrontal cortex, often encode specificconjunctions of chosen actions and their out-comes, for example, by increasing their activ-ity when a positive outcome is obtained froma specific action (Barraclough et al. 2004, Kimet al. 2009, Roesch et al. 2009, Seo & Lee 2009,Sul et al. 2010, Abe & Lee 2011) (Figure 3).The brain may resolve the temporal credit as-signment problem using such action-outcomeconjunction signals.

Integration of Chosen Valueand Reward Prediction Error

In model-free reinforcement learning, thevalue function for a chosen action is revisedaccording to the reward prediction error.Therefore, signals related to the chosen valueand reward prediction error must be combinedin the activity of individual neurons involvedin updating value functions. Signals related toreward prediction error were first identified inthe midbrain dopamine neurons (Schultz 2006)but later found to exist in many other areas,including the lateral habenula (Matsumoto& Hikosaka 2007), globus pallidus (Hong& Hikosaka 2008), dorsolateral prefrontalcortex (Asaad & Eskandar 2011), anterior

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→Figure 3Time course of signals related to an animal’s choice, its outcome, and action-outcome conjunction in multiple brain areas of primatesand rodents. (a) Spatial layout of the choice targets during a matching-pennies task used in single-neuron recording experiments inmonkeys. (b) Brain regions tested during studies using monkeys (Barraclough et al. 2004, Seo & Lee 2007, Seo et al. 2009).Abbreviations: ACCd, dorsal anterior cingulate cortex; DLPFC, dorsolateral prefrontal cortex; LIP, lateral intraparietal cortex.(c) Fraction of neurons significantly modulating their activity according to the animal’s choice (top), its outcome (middle), andchoice-outcome conjunction (bottom) during the current (trial lag = 0) and three previous trials (trial lags = 1 ∼ 3). (d ) ModifiedT-maze used in a rodent dynamic-foraging task. (e) Anatomical areas tested in single-neuron recording experiments in rodents (Kimet al. 2009; Sul et al. 2010, 2011). Abbreviations: ACC, anterior cingulate cortex; DS, dorsal striatum; ILC, infralimbic cortex; OFC,orbitofrontal cortex; PLC, prelimbic cortex; VS, ventral striatum. Modified with permission from Elsevier (Paxinos & Watson 1998).( f ) Fraction of neurons significantly modulating their activity according to the animal’s choice (top), its outcome (middle), andchoice-outcome conjunction (bottom) during the current (lag = 0) and previous trials (lag = 1). Large symbols indicate that theproportions are significantly (p < 0.05) above the chance level.

296 Lee · Seo · Jung

Page 11: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

AGmAGIDSVS

ACCPLC/ILCOFC

e

b

c

d

Time from feedback onset (s)

Fra

ctio

n o

f n

eu

ron

sF

ract

ion

of

ne

uro

ns

Time from target onset (s)

Time from choice onset (s) Time from choice onset (s)Time from reward onset (s)

0

0.2

0.4

0.6

0.8

0 1 0 1 0 1 0 1

0

0.2

0.4

0.6

0.8

0

0.2

0.4

0.6

0.8

0 1 0 1 0 1 0 1

Trial lag= 1

Trial lag= 2

Trial lag= 3

trial lag=0

DLPFCACCdLIP

Choice × reward

Reward

a Matching-pennies task

ACCdACCdACCd

DLPFCDLPFCDLPFCLIPLIPLIP

Dynamic-foraging task

Choice × reward

Choice

Reward

Bregma1.2 mm

Bregma2.7 mm

fPrevious trial (lag = 1)

00 0 1 2 3–1–2

0.2

0.4

0.6

0.8Current trial (lag = 0)Choice

00–1–2

0.2

0.4

0.6

0.8

00 0 1 2 3–1–2

0.2

0.4

0.6

0.8

00–1–2

0.2

0.4

0.6

0.8

00 0 1 2 3–1–2

0.2

0.4

0.6

0.8

00–1–2

0.2

0.4

0.6

0.8

Bregma3.2 mm

www.annualreviews.org • Reinforcement Learning 297

Page 12: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

cingulate cortex (Matsumoto et al. 2007, Seo& Lee 2007), orbitofrontal cortex (Sul et al.2010), and striatum (Kim et al. 2009, Oyamaet al. 2010, Asaad & Eskandar 2011). Thus, theextraction of reward prediction error signalsmay be gradual and implemented through adistributed network of multiple brain areas.Dopamine neurons may then play an importantrole in relaying these error signals to update thevalue functions represented broadly in differentbrain areas. Signals related to chosen valuesare also distributed in multiple brain areas, in-cluding the medial frontal cortex, orbitofrontalcortex, and striatum (Padoa-Schioppa & Assad2006, Lau & Glimcher 2008, Kim et al. 2009,Sul et al. 2010, Cai et al. 2011). The areasin which signals related to the chosen valueand reward prediction error converge, suchas the orbitofrontal cortex and striatum, maytherefore play an important role in updatingthe value functions (Kim et al. 2009, Sul et al.2010).

It is often hypothesized that the primarysite for updating and storing action value func-tions is at the synapses between axons from cor-tical neurons and dendrites of medium spinyneurons in the striatum (Reynolds et al. 2001,Hikosaka et al. 2006, Lo & Wang 2006, Hong& Hikosaka 2011). Signals related to rewardprediction error arrive at these synapses via theterminals of dopamine neurons in the substan-tia nigra (Levey et al. 1993, Haber et al. 2000,Schultz 2006, Haber & Knutson 2010), andmultiple types of dopamine receptors in thestriatum can modulate the plasticity of cortico-striatal synapses according to the relative timingof presynaptic versus postsynaptic action po-tentials (Shen et al. 2008, Gerfen & Surmeier2011). However, the nature of the specific in-formation stored by these synapses remainspoorly understood. In addition, whether thecorticostriatal circuit carries the appropriatesignals related to eligibility traces for chosenactions as well as any other necessary state in-formation at the right time needs to be testedin future studies. Given the broad dopaminer-gic projections to various cortical areas (Lewiset al. 2001), value functions may be updated

in many of the same cortical areas encodingthe value functions at the time of decisionmaking.

Uncertainty and Learning Rate

The learning rate, which controls the speed oflearning, must be adjusted according to the un-certainty and volatility of the animal’s environ-ment. In natural environments, decision mak-ers face many different types of uncertainty.When the probabilities of different outcomesare known, as when flipping a coin or dur-ing economic experiments, uncertainty aboutoutcomes is referred to as risk (Kahneman &Tversky 1979) or expected uncertainty (Yu &Dayan 2005). In contrast, ambiguity (Ellsberg1961) or unexpected uncertainty (Yu & Dayan2005) is the term used when the exact probabili-ties are unknown. Ambiguity or unexpected un-certainty is high when the probabilities of dif-ferent outcomes expected from a given actionchange frequently (Behrens et al. 2007). Suchvolatile environments require a large learn-ing rate so that value functions can be mod-ified quickly. By contrast, if the environmentis largely known and stable, then the learningrate should be close to 0 so that value functionsare not too easily altered by random events inthe environment. Human learners can changetheir learning rates almost optimally when therate of changes in reward probabilities for al-ternative actions is manipulated (Behrens et al.2007). The level of volatility, and hence thelearning rate, is reflected in the activity of theanterior cingulate cortex, suggesting that thisregion of the brain may be important for adjust-ing the learning rate according to the stabilityof the decision maker’s environment (Behrenset al. 2007). Similarly, the brain areas that in-crease their activity during decision making un-der ambiguity, such as the lateral prefrontal cor-tex, orbitofrontal cortex, and amygdala, mayalso be involved in optimizing the learningrate (Hsu et al. 2005, Huettel et al. 2006).Single-neuron recording studies also indicatethat the orbitofrontal cortex plays a role inevaluating the amount of uncertainty in choice

298 Lee · Seo · Jung

Page 13: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

outcomes (Kepecs et al. 2008, O’Neill &Schultz 2010).

NEURAL SYSTEMS FOR MODEL-FREE VERSUS MODEL-BASEDREINFORCEMENT LEARNING

Model-Based Value Functionsand Reward Prediction Errors

During model-based reinforcement learning,decision makers utilize their knowledge of theenvironment to update the estimates of out-comes expected from different actions, evenwithout reward or penalty. A wide range ofalgorithms can be used to implement model-based reinforcement learning. For example,decision makers may learn the configurationand dynamics of their environment separatelyfrom the values of outcomes at different lo-cations (Tolman 1948). Doing so enables theanimal to rediscover quickly an optimal pathof travel whenever the location of a desireditem changes. Flexibly combining these two dif-ferent types of information may rely on theprefrontal cortex (Daw et al. 2005, Pan et al.2008, Glascher et al. 2010) and hippocampus(Womelsdorf et al. 2010, Simon & Daw 2011).For example, activity in the human lateral pre-frontal cortex increases when unexpected statetransitions are observed, suggesting that thisarea is involved in learning the likelihood ofstate transitions (Glascher et al. 2010). The hip-pocampus may play a role in providing infor-mation about the layout of the environmentand other contextual information necessary toupdate the value functions. The integration ofinformation about the behavioral context andcurrent task demands encoded in these two ar-eas, especially at the time of decision making,may rely on rhythmic synchronization of neu-ral activity in the theta frequency range (Sirotaet al. 2008, Benchenane et al. 2010, Hyman et al.2010, Womelsdorf et al. 2010).

Whether and to what extent model-free andmodel-based forms of reinforcement learningare supported by the same brain areas remainan important area of research. Value functions

estimated by model-free and model-based al-gorithms may be updated or represented sepa-rately in different brain areas (Daw et al. 2005),but they may also be combined to producea unique estimate for the outcomes expectedfrom the chosen actions. For example, whendecision makers are required to combine infor-mation about reward history and social infor-mation, the reliability of the predictions basedon these two different types of information isreflected separately in two different regions ofthe anterior cingulate cortex (Behrens et al.2008) (Figure 4). By contrast, signals relatedto reward probability predicted by both typesof information were found in the ventromedialprefrontal cortex (Behrens et al. 2008). Sim-ilarly, human ventral striatum may representthe chosen values and reward prediction er-rors regardless of how they are computed (Dawet al. 2011, Simon & Daw 2011, Zhu et al.2012). This is also consistent with the findingthat reward prediction error signals encodedby the midbrain dopamine neurons, as well asthe neurons in the globus pallidus and lateralhabenula, are in accordance with both model-free and model-based reinforcement learning(Bromberg-Martin et al. 2010b).

Hypothetical Outcomesand Mental Simulation

From the information that becomes availableafter completing chosen actions, decisionmakers can often deduce what alternativeoutcomes would have been possible from otheractions. They can then use this informationabout hypothetical outcomes to update theaction value functions for unchosen actions.In particular, the observed behaviors of otherdecision makers during social interactionare a rich source of information about suchhypothetical outcomes (Camerer & Ho 1999,Camerer 2003, Lee et al. 2005, Lee 2008).Results from lesion and neuroimaging studieshave demonstrated that the information abouthypothetical or counterfactual outcomes maybe processed in the same brain areas that arealso involved in evaluating the actual outcomes

www.annualreviews.org • Reinforcement Learning 299

Page 14: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

0

0.1

0.2

0.3

0.4

Confederate weighting (a.u.)

% s

ign

al

cha

ng

e e

lici

ted

by

co

nfe

de

rate

ad

vic

e v

ola

tili

ty i

n A

CC

gy

rus

–0.2

–0.1

0

0.1

0.2

0.3

0.4

Outcome weighting (a.u.)

% s

ign

al

cha

ng

e e

lici

ted

by

re

wa

rd

his

tory

vo

lati

lity

in

AC

C s

ulc

us

y = 12 mm

y = 28 mm

a

b

c

d

–0.4

–0.3

–0.2

–0.1

Figure 4Areas in the human brain involved in updating model-free and model-based value functions (Behrens et al. 2008). (a) Regions in whichthe activity is correlated with the volatility in estimating the value functions based on reward history ( green) and social information(red ). (b) Activity in the ventromedital prefrontal cortex was correlated with the value functions regardless of whether they wereestimated from reward history or social information. (c) Subjects more strongly influenced by reward history (ordinate) tended to showgreater signal change in the anterior cingulate cortex (ACC) in association with reward history (abscissa; green region in panel a).(d ) Subjects more strongly influenced by social information (ordinate) showed greater signal changes in the ACC in association withsocial information (abscissa; red region in panel a). Reprinted by permission from Macmillan Publisher Ltd: Nature 456:245–49,copyright 2008.

of chosen actions, such as the prefrontal cortex(Camille et al. 2004, Coricelli et al. 2005, Boor-man et al. 2011) and striatum (Lohrenz et al.2007). The hippocampus may also be involvedin simulating the possible outcomes of futureactions (Hassabis & Maguire 2007, Johnson &Redish 2007, Schacter et al. 2007, Luhmannet al. 2008). Single-neuron recording studieshave also shown that neurons in the dorsalanterior cingulate cortex respond similarly to

actual and hypothetical outcomes (Haydenet al. 2009). More recent neurophysiologicalexperiments in the dorsolateral prefrontal cor-tex and orbitofrontal cortex further revealedthat neurons in these areas tend to encodeactual and hypothetical outcomes for the sameaction, suggesting that they may provide an im-portant substrate for updating the action valuefunctions for chosen and unchosen actionssimultaneously (Abe & Lee 2011) (Figure 5).

300 Lee · Seo · Jung

Page 15: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

W: winning

2

43

payoff

0

20

40

60

0

20

40

60

0

20

40

60Fir

ing

ra

te (

spik

es

s–

1)

Monkey

choice

Top

Right

Left

W

W W

LeftRightTop

Time from feedback onset (s)

0 0.5 0 0.50 0.5

a

b

Winning target

Reward

Rock-paper-scissorstask

Target onset

Feedback onset

Figure 5Neuronal activity related to hypothetical outcomes in the primate orbitofrontal cortex. (a) Virtual rock-paper-scissors task used forsingle-neuron recording studies in monkeys (Abe & Lee 2011). (b) An example of a neuron recorded in the orbitofrontal cortex thatmodulated its activity according to the magnitude of reward that was available from the unchosen winning target (indicated by “W” inthe top panels). The spike density function of this neuron was estimated separately according to the position of the winning target(columns), the position of the target chosen by the animal (rows), and the magnitude of the reward available from the winning target(colors).

CONCLUSIONSIn recent decades, reinforcement learning the-ory has become a central framework in thenewly emerging areas of neuroeconomics anddecision neuroscience. This is hardly surpris-ing, because unlike abstract decisions analyzedin economic theories, biological organisms sel-dom receive complete information about thelikelihoods of different outcomes expected fromalternative actions. Instead, they face the chal-lenge of learning how to predict the outcomesof their actions by trial and error, which is theessence of reinforcement learning.

The field of reinforcement learning hasyielded many different algorithms, which pro-vide neurobiologists with exciting opportuni-ties to test whether they can successfully ac-count for the actual behaviors of humans andanimals and for how different computational el-ements are implemented in the brain. In some

cases, particular theoretical components closelycorrespond to specific brain structures, as inthe case of reward prediction errors and mid-brain dopamine neurons. However, in general,a relatively well-circumscribed computationalstep in a given algorithm is often implementedin multiple brain areas, and this relationshipmay change with the animal’s experience andtask demands. The challenge that lies aheadis, therefore, to understand whether and howthe signals in different brain areas related tovarious components of reinforcement learn-ing, such as action value functions and cho-sen value, make different contributions to theoverall behaviors of the animal. An importantexample discussed in this article is the rela-tionship between model-free and model-basedreinforcement learning algorithms. Neuralmachinery of model-free reinforcement learn-ing may be phylogenetically older, and for

www.annualreviews.org • Reinforcement Learning 301

Page 16: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

simple decision-making problems, it may bemore robust. By contrast, for complex decision-making problems, model-based reinforcementlearning algorithms can be more efficient,because they can avoid the need to relearn ap-propriate stimulus-action associations repeat-edly by exploiting the regularities in the en-vironment and the current information about

the animal’s internal state. Failures in applyingappropriate reinforcement learning algorithmscan lead to a variety of maladaptive behaviorsobserved in different mental disorders. There-fore, it would be crucial for future studies toelucidate the mechanisms that allow the brainto coordinate different types of reinforcementlearning and their individual elements.

DISCLOSURE STATEMENT

The authors are not aware of any affiliations, memberships, funding, or financial holding thatmight be perceived as affecting the objectivity of this review.

ACKNOWLEDGMENTS

We are grateful to Soyoun Kim, Jung Hoon Sul, and Hoseok Kim for their help with the illus-trations and to Jeansok Kim, Matthew Kleinman, and Tim Vickery for helpful comments on themanuscript. The research of the authors was supported by the grants from the National Instituteof Drug Abuse (DA024855 and DA029330 to D.L.) and the Korea Ministry of Education, Science,and Technology (the Brain Research Center of the 21st Century Frontier Research Program, Na-tional Research Foundation grant 2011-0015618 and the Original Technology Research Programfor Brain Science 2011-0019209 to M.W.J.).

LITERATURE CITED

Abe H, Lee D. 2011. Distributed coding of actual and hypothetical outcomes in the orbital and dorsolateralprefrontal cortex. Neuron 70:731–41

Andersen RA, Essick GK, Siegel RM. 1987. Neurons of area 7 activated by both visual stimuli and oculomotorbehavior. Exp. Brain Res. 67:316–22

Asaad WF, Eskandar EN. 2011. Encoding of both positive and negative reward prediction errors by neuronsof the primate lateral prefrontal cortex and caudate nucleus. J. Neurosci. 31:17772–87

Balleine BW, Dickinson A. 1998. Goal-directed instrumental action: contingency and incentive learning andtheir cortical substrates. Neuropharmacology 37:407–19

Barraclough DJ, Conroy ML, Lee D. 2004. Prefrontal cortex and decision making in a mixed-strategy game.Nat. Neurosci. 7:404–10

Beck JM, Ma WJ, Kiani R, Hanks T, Churchland AK, et al. 2008. Probabilistic population codes for Bayesiandecision making. Neuron 60:1142–52

Belova MA, Paton JJ, Salzman CD. 2008. Moment-to-moment tracking of state value in the amygdala.J. Neurosci. 48:10023–30

Behrens TE, Woolrich MW, Walton ME, Rushworth MF. 2007. Learning the value of information in anuncertain world. Nat. Neurosci. 10:1214–21

Behrens TE, Hunt LT, Woolrich MW, Rushworth MF. 2008. Associative learning of social value. Nature456:245–9

Benchenane K, Peyrache A, Khamassi M, Tierney PL, Gioanni Y, et al. 2010. Coherent theta oscillations andreorganization of spike timing in the hippocampal-prefrontal network upon learning. Neuron 66:921–36

Boorman ED, Behrens TE, Rushworth MF. 2011. Counterfactual choicer and learning in a neural networkcentered on human lateral frontopolar cortex. PLoS Biol. 9:e1001093

302 Lee · Seo · Jung

Page 17: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

Bornstein AM, Daw ND. 2011. Multiplicity of control in the basal ganglia: computational roles of striatalsubregions. Curr. Opin. Neurobiol. 21:374–80

Bromberg-Martin ES, Matsumoto M, Hikosaka O. 2010a. Dopamine in motivational control: rewarding,aversive, and alerting. Neuron 68:815–34

Bromberg-Martin ES, Matsumoto M, Hong S, Hikosaka O. 2010b. A pallidus-habenula-dopamine pathwaysignals inferred stimulus values. J. Neurophysiol. 104:1068–76

Bruce CJ, Goldberg ME. 1985. Primate frontal eye fields. I. Single neurons discharging before saccades.J. Neurophysiol. 53:603–35

Cai X, Kim S, Lee D. 2011. Heterogeneous coding of temporally discounted values in the dorsal and ventralstriatum during intertemporal choice. Neuron 69:170–82

Camerer CF. 2003. Behavioral Game Theory: Experiments in Strategic Interaction. Princeton: Princeton Univ.Press

Camerer C, Ho TH. 1999. Experience-weighted attraction learning in normal form games. Econometrica67:827–74

Camille N, Coricelli G, Sallet J, Pradat-Diehl P, Duhamel JR, Sirigu A. 2004. The involvement of theorbitofrontal cortex in the existence of regret. Science 304:1167–70

Churchland MM, Yu BM, Ryu SI, Santhnam G, Shenoy KV. 2006. Neural variability in premotor cortexprovides a signature of motor preparation. J. Neurosci. 26:3697–712

Coe B, Tomihara K, Matsuzawa M, Hikosaka O. 2002. Visual and anticipatory bias in three cortical eye fieldsof the monkey during an adaptive decision-making task. J. Neurosci. 22:5081–90

Coricelli G, Critchley HD, Joffily M, O’Doherty JP, Sirigu A, Dolan RJ. 2005. Regret and its avoidance: aneuroimaging study of choice behavior. Nat. Neurosci. 8:1255–62

Curtis CE, Lee D. 2010. Beyond working memory: the role of persistent activity in decision making. TrendsCogn. Sci. 14:216–22

Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. 2011. Model-based influences on humans’ choicesand striatal prediction errors. Neuron 69:1204–15

Daw ND, Niv Y, Dayan P. 2005. Uncertainty-based competition between prefrontal and dorsolateral striatalsystems for behavioral control. Nat. Neurosci. 8:1704–11

Dayan P, Niv Y. 2008. Reinforcement learning: the good, the bad and the ugly. Curr. Opin. Neurobiol. 18:185–96

Ding L, Gold JI. 2011. Neural correlates of perceptual decision making before, during, and after decisioncommitment in monkey frontal eye field. Cereb. Cortex. doi: 10.1093/cercor/bhr178

Dorris MC, Glimcher PW. 2004. Activity in posterior parietal cortex is correlated with the relative subjectivedesirability of action. Neuron 44:365–78

Ellsberg D. 1961. Risk, ambiguity, and the Savage axioms. Q. J. Econ. 61:643–69Erev I, Roth AE. 1998. Predicting how people play games: reinforcement learning in experimental games with

unique, mixed strategy equilibria. Am. Econ. Rev. 88:848–81Fecteau JH, Munoz DP. 2003. Exploring the consequences of the previous trial. Nat. Rev. Neurosci. 4:

435–43Fellows LK, Farah MJ. 2003. Ventromedial frontal cortex mediates affective shifting in humans: evidence

from a reversal learning paradigm. Brain 126:1830–37Feltovich R. 2000. Reinforcement-based versus belief-based learning models in experimental asymmetric-

information games. Econometrica 68:605–41Funahashi S, Bruce CJ, Goldman-Rakic PS. 1989. Mnemonic coding of visual space in the monkey’s dorso-

lateral prefrontal cortex. J. Neurophysiol. 61:331–49Furman M, Wang X-J. 2008. Similarity effect and optimal control of multiple-choice decision making. Neuron

60:1153–68Genovesio A, Brasted PJ, Wise SP. 2006. Representation of future and previous spatial goals by separate neural

populations in prefrontal cortex. J. Neurosci. 26:7305–16Georgopoulos AP, Schwartz AB, Kettner RE. 1986. Neural population coding of movement direction. Science

233:1416–19Gerfen CR, Surmeier DJ. 2011. Modulation of striatal projection systems by dopamine. Annu. Rev. Neurosci.

34:441–66

www.annualreviews.org • Reinforcement Learning 303

Page 18: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

Glascher J, Daw N, Dayan P, O’Doherty JP. 2010. States versus rewards: dissociable neural prediction errorsignals underlying model-based and model-free reinforcement learning. Neuron 66:585–95

Glimcher PW, Sparks DL. 1992. Movement selection in advance of action in the superior colliculus. Nature355:542–45

Gnadt JW, Andersen RA. 1988. Memory related motor planning activity in posterior parietal cortex ofmacaque. Exp. Brain Res. 70:216–20

Haber SN, Fudge JL, McFarland R. 2000. Striatonigrostriatal pathways in primates form an ascending spiralfrom the shell to the dorsolateral striatum. J. Neurosci. 20:2369–82

Haber SN, Knutson B. 2010. The reward circuit: linking primate anatomy and human imaging.Neuropsychopharmacology 35:4–26

Haggard P. 2008. Human volition: towards a neuroscience of will. Nat. Rev. Neurosci. 9:934–46Hanes DP, Schall JD. 1996. Neural control of voluntary movement initiation. Science 274:427–30Hassabis D, Maguire EA. 2007. Deconstructing episodic memory with construction. Trends Cogn. Sci. 11:299–

306Hayden BY, Pearson JM, Platt ML. 2009. Fictive reward signals in the anterior cingulate cortex. Science

324:948–50Hikosaka O, Nakamura K, Nakahara H. 2006. Basal ganglia orient eyes to reward. J. Neurophysiol. 95:567–84Hikoaka O, Nakahara H, Rand MK, Sakai K, Lu X, et al. 1999. Parallel neural networks for learning sequential

procedures. Trends Neurosci. 22:464–71Hong S, Hikosaka O. 2008. The globus pallidus sends reward-related signals to the lateral habenula. Neuron

60:720–29Hong S, Hikosaka O. 2011. Dopamine-mediated learning and switching in cortico-striatal circuit explain

behavioral changes in reinforcement learning. Front. Behav. Neurosci. 5:15Horwitz GD, Newsome WT. 2001. Target selection for saccadic eye movements: prelude activity in the

superior colliculus during a direction-discrimination task. J. Neurophysiol. 86:2543–58Hsu M, Bhatt M, Adolphs R, Tranel D, Camerer CF. 2005. Neural systems responding to degrees of uncer-

tainty in human decision-making. Science 310:1680–83Huettel SA, Stowe CJ, Gordon EM, Warner BT, Platt ML. 2006. Neural signatures of economic preferences

for risk and ambiguity. Neuron 49:765–75Hyman JM, Zilli EA, Paley AM, Hasselmo ME. 2010. Working memory performance correlates with

prefrontal-hippocampal theta interactions but not with prefrontal neuron firing rates. Front. Integr.Neurosci. 4:2

Ito M, Doya K. 2009. Validation of decision-making models and analysis of decision variables in the rat basalganglia. J. Neurosci. 29:9861–74

Ito M, Doya K. 2011. Multiple representations and algorithms for reinforcement learning in the cortico-basalganglia circuit. Curr. Opin. Neurobiol. 21:368–73

Iversen SD, Mishkin M. 1970. Perseverative interference in monkeys following selective lesions of the inferiorprefrontal convexity. Exp. Brain Res. 11:376–86

Johnson A, Redish AD. 2007. Neural ensembles in CA3 transiently encode paths forward of the animal at adecision point. J. Neurosci. 27:12176–89

Kahneman D, Tversky A. 1979. Prospect theory: an analysis of decision under risk. Econometrica 47:263–91Kepecs A, Uchida N, Zariwala HA, Mainen ZF. 2008. Neural correlates, computation and behavioural impact

of decision confidence. Nature 455:227–31Kim H, Sul JH, Huh N, Lee D, Jung MW. 2009. Role of striatum in updating values of chosen actions.

J. Neurosci. 29:14701–12Kim S, Hwang J, Lee D. 2008. Prefrontal coding of temporally discounted values during intertemporal choice.

Neuron 59:161–72Kim S, Lee D. 2011. Prefrontal cortex and impulsive decision making. Biol. Psychiatry 69:1140–46Kim Y, Huh N, Lee H, Baeg E, Lee D, Jung MW. 2007. Encoding of action history in the rat ventral striatum.

J. Neurophysiol. 98:3548–56Knowlton BJ, Mangels JA, Squire LR. 1996. A neostriatal habit learning system in humans. Science 273:1399–

402

304 Lee · Seo · Jung

Page 19: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

Krajbich I, Armel C, Rangel A. 2010. Visual fixations and the computation and comparison of value in simplechoice. Nat. Neurosci. 13:1292–98

Laming DRJ. 1968. Information Theory of Choice-Reaction Times. London: AcademicLau B, Glimcher PW. 2008. Value representations in the primate striatum during matching behavior. Neuron

58:451–63Lee D. 2008. Game theory and neural basis of social decision making. Nat. Neurosci. 11:404–9Lee D, Conroy ML, McGreevy BP, Barraclough DJ. 2004. Reinforcement learning and decision making in

monkeys during a competitive game. Cogn. Brain Res. 22:45–58Lee D, McGreevy BP, Barraclough DJ. 2005. Learning and decision making in monkeys during a rock-paper-

scissors game. Cogn. Brain Res. 25:416–30Levey AI, Hersch SM, Rye DB, Sunahara RK, Niznik HB, et al. 1993. Localization of D1 and D2 dopamine

receptors in brain with subtype-specific antibodies. Proc. Natl. Acad. Sci. USA 90:8861–65Lewis DA, Melchitzky DS, Sesack SR, Whitehead RE, Auh S, Sampson A. 2001. Dopamine transporter

immunoreactivity in monkey cerebral cortex: regional, laminar, and ultrastructural localization. J. Compar.Neurol. 432:119–36

Litt A, Plassmann H, Shiv B, Rangel A. 2011. Dissociating valuation and saliency signals during decisionmaking. Cereb. Cortex 21:95–102

Lo C-C, Wang X-J. 2006. Cortico-basal ganglia circuit mechanism for a decision threshold in reaction timetasks. Nat. Neurosci. 9:956–63

Lohrenz T, McCabe K, Camerer CF, Montague PR. 2007. Neural signature of fictive learning signals in asequential investment task. Proc. Natl. Acad. Sci. USA 104:9493–98

Louie K, Glimcher PW. 2010. Separating value from choice: delay discounting activity in the lateral intra-parietal area. J. Neurosci. 30:5498–507

Luhmann CC, Chun MM, Yi D-J, Lee D, Wang X-J. 2008. Neural dissociation of delay and uncertainty inintertemporal choice. J. Neurosci. 28:14459–66

Matsumoto M, Hikosaka O. 2007. Lateral habenula as a source of negative reward signals in dopamine neurons.Nature 447:1111–15

Matsumoto M, Matsumoto K, Abe H, Tanaka K. 2007. Medial prefrontal cell activity signaling predictionerrors of action values. Nat. Neurosci. 10:647–56

Maunsell JHR. 2004. Neuronal representations of cognitive state: reward or attention? Trends Cogn. Sci.8:261–65

Montague PR, Dayan P, Sejnowski TJ. 1996. A framework for mesencephalic dopamine systems based onpredictive Hebbian learning. J. Neurosci. 16:1936–47

Mookherjee D, Sopher B. 1994. Learning behavior in an experimental matching pennies game. Games Econ.Behav. 7:62–91

Mookherjee D, Sopher B. 1997. Learning and decision costs in experimental constant sum games. Games Econ.Behav. 19:97–132

Murray EA, O’Doherty JP, Schoenbaum G. 2007. What we know and do not know about the functions of theorbitofrontal cortex after 20 years of cross-species studies. J. Neurosci. 27:8166–69

Nachev P, Kennard C, Husain M. 2008. Functional role of the supplementary and pre-supplementary motorareas. Nat. Rev. Neurosci. 9:856–69

Okano K, Tanji J. 1987. Neuronal activities in the primate motor fields of the agranular frontal cortex precedingvisually triggered and self-paced movement. Exp. Brain Res. 66:155–66

O’Neill M, Schultz W. 2010. Coding of reward risk by orbitofrontal neurons is mostly distinct from codingof reward value. Neuron 68:789–800

Oyama K, Hernadi I, Iijima T, Tsutsui K-I. 2010. Reward prediction error coding in dorsal striatal neurons.J. Neurosci. 30:11447–57

Padoa-Schioppa C. 2011. Neurobiology of economic choice: a good-based model. Annu. Rev. Neurosci. 34:333–59

Padoa-Schioppa C, Assad JA. 2006. Neurons in the orbitofrontal cortex encode economic value. Nature441:223–26

www.annualreviews.org • Reinforcement Learning 305

Page 20: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

Pan W-X, Schmidt R, Wickens JR, Hyland BI. 2005. Dopamine cells respond to predicted events duringclassical conditioning: evidence for eligibility traces in the reward-learning network. J. Neurosci. 25:6235–42

Pan X, Sawa K, Tsuda I, Tsukada M, Sakagami M. 2008. Reward prediction based on stimulus categorizationin primate lateral prefrontal cortex. Nat. Neurosci. 11:703–12

Pastor-Bernier A, Cisek P. 2011. Neural correlates of biased competition in premotor cortex. J. Neurosci.31:7083–88

Paxinos G, Watson C. 1998. The Rat Brain in Stereotaxic Coordinates. San Diego: AcademicPlatt ML, Glimcher PW. 1999. Neural correlates of decision variables in parietal cortex. Nature 400:233–

38Premack D, Woodruff G. 1978. Does the chimpanzee have a theory of mind? Behav. Brain Sci. 4:515–26Reynolds JNJ, Hyland BI, Wickens JR. 2001. A cellular mechanism of reward-related learning. Nature 413:67–

70Roesch MR, Singh T, Brown PL, Mullins SE, Schoenbaum G. 2009. Ventral striatal neurons encode the value

of the chosen action in rats deciding between differently delayed or sized rewards. J. Neurosci. 29:13365–76Roese NJ, Olson JM. 1995. What Might Have Been: The Social Psychology of Counterfactual Thinking. New York:

Psychol. PressRoitman JD, Shadlen MN. 2002. Response of neurons in the lateral intraparietal area during a combined

visual discrimination reaction time task. J. Neurosci. 22:9475–89Rorie AE, Gao J, McClelland JL, Newsome WT. 2010. Integration of sensory and reward information during

perceptual decision-making in lateral intraparietal cortex (LIP) of the macaque monkey. PLoS One 5:e9308

Samejima K, Ueda Y, Doya K, Kimura M. 2005. Representation of action-specific reward values in the striatum.Science 310:1337–40

Schacter DL, Addis DR, Buckner RL. 2007. Remembering the past to imagine the future: the prospectivebrain. Nat. Rev. Neurosci. 8:657–61

Schall JD. 1991. Neuronal activity related to visually guided saccadic eye movements in the supplementarymotor area of rhesus monkeys. J. Neurophysiol. 66:530–58

Schiller PH, Stryker M. 1972. Single-unit recording and stimulation in superior colliculus of the alert rhesusmonkey. J. Neurophysiol. 35:915–24

Schlag J, Schlag-Rey M. 1987. Evidence for a supplementary eye field. J. Neurophysiol. 57:179–200Schoenbaum G, Nugent SL, Saddoris MP, Setlow B. 2002. Orbitofrontal lesions in rats impair reversal but

not acquisition of go, no-go odor discriminations. Neuroreport 13:885–90Schultz W. 2006. Behavioral theories and the neurophysiology of reward. Annu. Rev. Psychol. 57:87–115Schultz W, Tremblay L, Hollerman JR. 2000. Reward processing in primate orbitofrontal cortex and basal

ganglia. Cereb. Cortex 10:272–84Seo H, Barraclough DJ, Lee D. 2009. Lateral intraparietal cortex and reinforcement learning during a mixed-

strategy game. J. Neurosci. 29:7278–89Seo H, Lee D. 2007. Temporal filtering of reward signals in the dorsal anterior cingulate cortex during a

mixed-strategy game. J. Neurosci. 27:8366–77Seo H, Lee D. 2008. Cortical mechanisms for reinforcement learning in competitive games. Philos. Trans. R.

Soc. Lond. Ser. B 363:3845–57Seo H, Lee D. 2009. Behavioral and neural changes after gains and losses of conditioned reinforcers. J. Neurosci.

29:3627–41Serences JT. 2008. Value-based modulations in human visual cortex. Neuron 60:1169–81Shadlen MN, Newsome WT. 2001. Neural basis of a perceptual decision in the parietal cortex of the rhesus

monkey. J. Neurophysiol. 86:1916–36Shen W, Flajolet M, Greengard P, Surmeier DJ. 2008. Dichotomous dopaminergic control of striatal synaptic

plasticity. Science 321:848–51Simon DA, Daw ND. 2011. Neural correlates of forward planning in a spatial decision task in humans.

J. Neurosci. 31:5526–39Shuler MG, Bear MF. 2006. Reward timing in the primary visual cortex. Science 311:1606–9

306 Lee · Seo · Jung

Page 21: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

Sirota A, Montgomery S, Fujisawa S, Isomura Y, Zugaro M, Buzsaki G. 2008. Entrainment of neocorticalneurons and gamma oscillations by the hippocampal theta rhythm. Neuron 60:683–97

Smith PL, Ratcliff R. 2004. Psychology and neurobiology of simple decisions. Trends Neurosci. 27:161–68Smyrnis N, Taira M, Ashe J, Georgopoulos AP. 1992. Motor cortical activity in a memorized delay task. Exp.

Brain Res. 92:139–51So NY, Stuphorn V. 2010. Supplementary eye field encodes option and action value for saccades with variable

reward. J. Neurophysiol. 104:2634–53Softky WR, Koch C. 1993. The highly irregular firing of cortical cells is inconsistent with temporal integration

of random EPSPs. J. Neurosci. 13:334–50Soltani A, Lee D, Wang X-J. 2006. Neural mechanism for stochastic behaviour during a competitive game.

Neural Netw. 19:1075–90Soltani A, Wang X-J. 2006. A biophysically based neural model of matching law behavior: melioration by

stochastic synapses. J. Neurosci. 26:3731–44Sohn J-W, Lee D. 2007. Order-dependent modulation of directional signals in the supplementary and pre-

supplementary motor areas. J. Neurosci. 27:13655–66Soon CS, Brass M, Heinze H-J, Haynes J-D. 2008. Unconscious determinants of free decisions in the human

brain. Nat. Neurosci. 11:543–45Sugrue LP, Corrado GS, Newsome WT. 2004. Matching behavior and the representation of value in the

parietal cortex. Science 304:1782–87Sul JH, Kim H, Huh N, Lee D, Jung MW. 2010. Distinct roles of rodent orbitofrontal and medial prefrontal

cortex in decision making. Neuron 66:449–60Sul JH, Jo S, Lee D, Jung MW. 2011. Role of rodent secondary motor cortex in value-based action selection.

Nat. Neurosci. 14:1202–8Sutton RS, Barto AG. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT PressTanji J, Kurata K. 1985. Contrasting neuronal activity in supplementary and precentral motor cortex of

monkeys. I. Responses to instructions determining motor responses to forthcoming signals of differentmodalities. J. Neurophysiol. 53:129–41

Tolman EC. 1948. Cognitive maps in rats and men. Psychol. Rev. 55:189–208Tremblay L, Schultz W. 1999. Relative reward preference in primate orbitofrontal cortex. Nature 398:

704–8Usher M, McClelland J. 2001. On the time course of perceptual choice: the leaky, competing accumulator

model. Psychol. Rev. 108:550–92van der Meer MAA, Redish AD. 2011. Ventral striatum: a critical look at models of learning and evaluation.

Curr. Opin. Neurobiol. 21:387–92Vickery TJ, Chun MM, Lee D. 2011. Ubiquity and specificity of reinforcement signals throughout the human

brain. Neuron 72:166–77von Neumann J, Morgenstern O. 1944. Theory of Games and Economic Behavior. Princeton: Princeton Univ.

PressWallis JD, Kennerley SW. 2010. Heterogeneous reward signals in prefrontal cortex. Curr. Opin. Neurobiol.

20:191–98Wallis JD, Miller EK. 2003. Neuronal activity in primate dorsolateral and orbital prefrontal cortex during

performance of a reward preference task. Eur. J. Neurosci. 8:2069–81Walton ME, Behrens TE, Buckley MJ, Rudebeck PH, Rushworth MF. 2010. Separable learning systems in

the macaque brain and the role of orbitofrontal cortex in contingent learning. Neuron 65:927–39Wang X-J. 2001. Synaptic reverberation underlying mnemonic persistent activity. Trends Neurosci. 24:455–

63Wang X-J. 2002. Probabilistic decision making by slow reverberation in cortical circuits. Neuron 36:955–68Wang X-J. 2008. Decision making in recurrent neuronal circuits. Neuron 60:215–34Weinrich M, Wise SP. 1982. The premotor cortex of the monkey. J. Neurosci. 2:1329–45Womelsdorf T, Vinck M, Leung LS, Everling S. 2010. Selective theta-synchronization of choice-relevant

information subserves goal-directed behavior. Front. Hum. Neurosci. 4:210Wunderlich K, Rangel A, O’Doherty JP. 2009. Neural computations underlying action-based decision making

in the human brain. Proc. Natl. Acad. Sci. USA 106:17199–204

www.annualreviews.org • Reinforcement Learning 307

Page 22: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35CH14-Lee ARI 14 May 2012 13:58

Wurtz RH, Goldberg ME. 1972. Activity of superior colliculus in behaving monkey. 3. Cells dischargingbefore eye movements. J. Neurophysiol. 35:575–86

Yin HH, Knowlton BJ. 2006. The role of the basal ganglia in habit formation. Nat. Rev. Neurosci. 7:464–76Yu AJ, Dayan P. 2005. Uncertainty, neuromodulation, and attention. Neuron 46:681–92Zhu L, Mathewson KE, Hsu M. 2012. Dissociable neural representations of reinforcement and belief predic-

tion errors underlie strategic learning. Proc. Natl. Acad. Sci. USA 109:1419–24

308 Lee · Seo · Jung

Page 23: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35-FrontMatter ARI 21 May 2012 11:24

Annual Review ofNeuroscience

Volume 35, 2012Contents

The Neural Basis of EmpathyBoris C. Bernhardt and Tania Singer � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 1

Cellular Pathways of Hereditary Spastic ParaplegiaCraig Blackstone � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �25

Functional Consequences of Mutations in Postsynaptic ScaffoldingProteins and Relevance to Psychiatric DisordersJonathan T. Ting, Joao Peca, and Guoping Feng � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �49

The Attention System of the Human Brain: 20 Years AfterSteven E. Petersen and Michael I. Posner � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �73

Primary Visual Cortex: Awareness and BlindsightDavid A. Leopold � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �91

Evolution of Synapse Complexity and DiversityRichard D. Emes and Seth G.N. Grant � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 111

Social Control of the BrainRussell D. Fernald � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 133

Under Pressure: Cellular and Molecular Responses During Glaucoma,a Common Neurodegeneration with AxonopathyRobert W. Nickells, Gareth R. Howell, Ileana Soto, and Simon W.M. John � � � � � � � � � � � 153

Early Events in Axon/Dendrite PolarizationPei-lin Cheng and Mu-ming Poo � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 181

Mechanisms of Gamma OscillationsGyorgy Buzsaki and Xiao-Jing Wang � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 203

The Restless Engram: Consolidations Never EndYadin Dudai � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 227

The Physiology of the Axon Initial SegmentKevin J. Bender and Laurence O. Trussell � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 249

Attractor Dynamics of Spatially Correlated Neural Activity in theLimbic SystemJames J. Knierim and Kechen Zhang � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 267

Neural Basis of Reinforcement Learning and Decision MakingDaeyeol Lee, Hyojung Seo, and Min Whan Jung � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 287

vii

Page 24: Neural Basis of Reinforcement Learning and Decision Makingbrainm.com/software/pubs/brain/Lee-neural-reinfor-decis.pdf · NE35CH14-Lee ARI 14 May 2012 13:58 Neural Basis of Reinforcement

NE35-FrontMatter ARI 21 May 2012 11:24

Critical-Period Plasticity in the Visual CortexChristiaan N. Levelt and Mark Hubener � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 309

What Is the Brain-Cancer Connection?Lei Cao and Matthew J. During � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 331

The Role of Organizers in Patterning the Nervous SystemClemens Kiecker and Andrew Lumsden � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 347

The Complement System: An Unexpected Role in Synaptic PruningDuring Development and DiseaseAlexander H. Stephan, Ben A. Barres, and Beth Stevens � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 369

Brain Plasticity Through the Life Span: Learning to Learn and ActionVideo GamesDaphne Bavelier, C. Shawn Green, Alexandre Pouget, and Paul Schrater � � � � � � � � � � � � � 391

The Pathophysiology of Fragile X (and What It Teaches Us aboutSynapses)Asha L. Bhakar, Gul Dolen, and Mark F. Bear � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 417

Central and Peripheral Circadian Clocks in MammalsJennifer A. Mohawk, Carla B. Green, and Joseph S. Takahashi � � � � � � � � � � � � � � � � � � � � � � � � 445

Decision-Related Activity in Sensory Neurons: Correlations AmongNeurons and with BehaviorHendrikje Nienborg, Marlene R. Cohen, and Bruce G. Cumming � � � � � � � � � � � � � � � � � � � � � � 463

Compressed Sensing, Sparsity, and Dimensionality in NeuronalInformation Processing and Data AnalysisSurya Ganguli and Haim Sompolinsky � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 485

The Auditory Hair Cell Ribbon Synapse: From Assembly to FunctionSaaid Safieddine, Aziz El-Amraoui, and Christine Petit � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 509

Multiple Functions of Endocannabinoid Signaling in the BrainIstvan Katona and Tamas F. Freund � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 529

Circuits for Skilled Reaching and GraspingBror Alstermark and Tadashi Isa � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 559

Indexes

Cumulative Index of Contributing Authors, Volumes 26–35 � � � � � � � � � � � � � � � � � � � � � � � � � � � 579

Cumulative Index of Chapter Titles, Volumes 26–35 � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 583

Errata

An online log of corrections to Annual Review of Neuroscience articles may be found athttp://neuro.annualreviews.org/

viii Contents