28
Machine learning models for temporal prediction and decision making in the brain Doina Precup, PhD Co-Director, Reasoning & Learning Lab, McGill University Research Team Lead, DeepMind Montreal Senior Member, AAAI and CIFAR; Member of MILA

Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Machine learning models for temporal prediction and decision making in the brain Doina Precup, PhD Co-Director, Reasoning & Learning Lab, McGill University Research Team Lead, DeepMind MontrealSenior Member, AAAI and CIFAR; Member of MILA

Page 2: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Example: AlphaGo

Page 3: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Supervised Learning

Given: Input data and desired output Eg: Images and parts of interest

Goal: find a function that can be used for new inputs, and that matches the provided examples

Page 4: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Deep learning

Page 5: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Reinforcement Learning

Reward: Food or shock

Reward: Positive and negative numbers

•Learning by trial-and-error•Reward is often delayed

Page 6: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Computational framework

Page 7: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Example: AlphaGo

• Perceptions: state of the board• Actions: legal moves• Reward: +1 or -1 at the end of the game• Trained by playing games against itself• Invented new ways of playing which seem superior

The Game of Go

~10170 unique positions

~200 moves long

~200 branching factor

~10360 complexity

Page 8: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Basic Principles of Reinforcement Learning• All machine learning is driven to minimize prediction errors

• We use calculus to change parameters for this goal

• In reinforcement learning, the algorithm makes predictions at every time step

• The prediction error is computed between one time step and the next

• We want these predictions to be consistent, i.e. similar to each other

• If the situation improved since last time step, pick more the last action

Page 9: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Some definitions

Page 10: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Temporal-difference learning (Sutton, 1988)

Page 11: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Explaining dopaminergic neural activity

Cf. Shultz et al, 1996 and much current work

Page 12: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Actor-critic architecture

Page 13: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Learning transferrable representations

• How can we use reinforcement learning to build automatically higher-level representations?

• What is the right time scale for actions and computations?

13

Page 14: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Temporal abstraction

•Consider an activity such as cooking dinner

–  High-level steps: choose a recipe, make a grocery list, get groceries, cook,... –  Medium-level steps: get a pot, put ingredients in the pot, stir until smooth, check the recipe ... –  Low-level steps: wrist and arm movement while driving the car, stirring, ...

All have to be seamlessly integrated!

Page 15: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Temporal abstraction in Reinforcement Learning

• Options framework: an option is an extended action that has an initiation condition, an internal way of choosing actions, and a termination condition

• Eg. robot navigation: if there is no obstacle in front (initiation condition) go forward until something is too close (termination condition)

• Learning and planning algorithms work the same way at all levels of abstraction

Page 16: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Options framework

16

High level

Low level

Trajectory, time

Page 17: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Illustration: Navigation task

Page 18: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Bottleneck states284 C. Diuk et al.

Fig. 5 Graphs underlying the maps of the cities for the first version of the experiment (a) and thesecond one (b). Node labels identify the betweenness of each node

Trial

20 40 60 80 100 1200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Nav

igat

ion

Effic

ienc

y

Subject0

2

4

6

8

10

12

14

16

Bott

lene

ck c

hioc

esa b

Fig. 6 (a) Average performance ratio, over all participants, as a function of trials of experiencewith the “town.” The value on the y-axis in the figure represents the ratio of steps taken to theminimum number of steps, taking into account the optimal bus-stop location. A ratio of 1 indicatesoptimal performance, i.e., choice of the shortest path from start to goal, assuming an optimal(bottleneck) choice of bus stop. The two dashed data series indicate minimum and maximumscore across participants. (b) Number of times, out of 16 five-trial blocks, that the bottleneckstate was chosen for the bus-stop location. Participants in the x-axis are sorted by performance.The horizontal line indicates the expected performance if participants chose bus-stop locationsrandomly

Panel a shows that, over the course of the experiment, participants increasinglypicked out the shortest path from start to goal. This simply provides evidence thatparticipants learned something about the layout of the town as they went along.More important are the data in panel b, which show the number of blocks (out of atotal of 16) in which each participant chose to place the bus stop at the bottlenecklocation. Although there was some variability across participants, the data clearlyconfirm a general capacity to detect and exploit the presence of a bottleneck.

The results of this experiment do not, however, allow us to make conclusionsabout how participants identified the bottleneck location. In particular, while wewere interested in the possibility that they leveraged structural or topological knowl-edge, it is possible that participants instead used simple frequency information. Overthe course of multiple shortest-path deliveries, the bottleneck location would be

Botvinick, Niv et al

Page 19: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

People discover bottleneck states

284 C. Diuk et al.

Fig. 5 Graphs underlying the maps of the cities for the first version of the experiment (a) and thesecond one (b). Node labels identify the betweenness of each node

Trial

20 40 60 80 100 1200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Nav

igat

ion

Effic

ienc

y

Subject0

2

4

6

8

10

12

14

16

Bott

lene

ck c

hioc

es

a b

Fig. 6 (a) Average performance ratio, over all participants, as a function of trials of experiencewith the “town.” The value on the y-axis in the figure represents the ratio of steps taken to theminimum number of steps, taking into account the optimal bus-stop location. A ratio of 1 indicatesoptimal performance, i.e., choice of the shortest path from start to goal, assuming an optimal(bottleneck) choice of bus stop. The two dashed data series indicate minimum and maximumscore across participants. (b) Number of times, out of 16 five-trial blocks, that the bottleneckstate was chosen for the bus-stop location. Participants in the x-axis are sorted by performance.The horizontal line indicates the expected performance if participants chose bus-stop locationsrandomly

Panel a shows that, over the course of the experiment, participants increasinglypicked out the shortest path from start to goal. This simply provides evidence thatparticipants learned something about the layout of the town as they went along.More important are the data in panel b, which show the number of blocks (out of atotal of 16) in which each participant chose to place the bus stop at the bottlenecklocation. Although there was some variability across participants, the data clearlyconfirm a general capacity to detect and exploit the presence of a bottleneck.

The results of this experiment do not, however, allow us to make conclusionsabout how participants identified the bottleneck location. In particular, while wewere interested in the possibility that they leveraged structural or topological knowl-edge, it is possible that participants instead used simple frequency information. Overthe course of multiple shortest-path deliveries, the bottleneck location would be

Page 20: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Neural signature for subgoals

and another third included a jump of type E. Type D jumps, byincreasing the distance to the subgoal, were again intended totrigger a PPE. However, in the fMRI version of the task, unlikethe EEG version, the exact increase in subgoal distance variedacross trials. Therefore, type D jumps were intended to inducePPEs that varied in magnitude (Figure 2). Our analyses tooka model-based approach (O’Doherty et al., 2007), testing forregions that showed phasic activation correlating positivelywith predicted PPE size.A whole-brain general linear model analysis, thresholded at

p < 0.01 (cluster-size thresholded to correct for multiple compar-isons), revealed such a correlation in the dorsal anterior cingulatecortex (ACC; Figure 4). This region has been proposed to containthe generator of the FRN (Holroyd and Coles, 2002, although seeNieuwenhuis et al., 2005 and Discussion below). In this regardthe fMRI result is consistent with the result of our EEG experi-ment. The same parametric fMRI effect was also observed bilat-erally in the anterior insula, a region often coactivated with theACC in the setting of unanticipated negative events (Phanet al., 2004). The effect was also detected in right supramarginalgyrus, the medial part of lingual gyrus, and, with a negative coef-ficient, in the left inferior frontal gyrus. However, in a follow-upanalysis we controlled for subgoal displacement (e.g., thedistance between the original package location and point D inFigure 2), a nuisance variable moderately correlated, acrosstrials, with the change in distance to subgoal. Within this analysisonly the ACC (p < 0.01), bilateral anterior insula (p < 0.01 left,

p < 0.05 right), and right lingual gyrus (p < 0.01) continued toshow significant correlations with the PPE.In a series of region-of-interest (ROI) analyses, we focused in

on additional neural structures that, like the ACC, have beenpreviously proposed to encode negative RPEs: the habenularcomplex (Salas et al., 2010; Ullsperger and von Cramon, 2003),nucleus accumbens (NAcc) (Seymour et al., 2007), and amyg-dala (Breiter et al., 2001; Yacubian et al., 2006). (These analyseswere intended to bring greater statistical power to bear on theseregions, in part because their small size may have underminedour ability to detect activation in them in our whole-brainanalysis, where a cluster-size threshold was employed.) Thehabenular complex was found to display greater activityfollowing type D than type E jumps (p < 0.05), consistent withthe idea that this structure is also engaged by negative PPEs.A comparable effect was also observed in the right, though notleft, amygdala (p < 0.05).In the NAcc, where some studies have observed deactivation

accompanying negative RPEs (Knutson et al., 2005), no signifi-cant PPE effect was observed. However, it should be notedthat NAcc deactivation with negative RPEs has been an incon-sistent finding in previous work (for example, see Cooper andKnutson, 2008; O’Doherty et al., 2006). More robust is the asso-ciation between NAcc activation and positive RPEs (Hare et al.,2008; Niv, 2009; Seymour et al., 2004). To test this directly, weran a second, smaller fMRI study designed to elicit positivePPEs, specifically looking for activation within a NAcc ROI. Atotal of 14 participants performed the delivery task, with jumpsof type C (in Figure 2) occurring on one-third of trials and jumpsof type E on another third. As described earlier, a positive PPE ispredicted to occur in association with type C jumps, and in thissetting significant activation (p < 0.05) was observed in the right(though not left) NAcc, scaling with predicted PPE magnitude.

Behavioral ExperimentWe have characterized the results from our EEG and fMRI exper-iments as displaying a ‘‘signature’’ of HRL, in the sense that thePPE signal is predicted by HRL but not by standard RL algo-rithms (Figure 2). However, there is an important caveat thatwe now consider. In our neuroimaging experiments we assumedthat reaching the goal (the house) would be associated withprimary reward. (The same points hold if ‘‘primary reward’’ isreplaced with ‘‘secondary’’ or ‘‘conditioned reinforcement.’’)We also assumed that reaching the subgoal (the package) wasnot associated with primary reward but only with pseudo-reward. However, what if participants did attach primary rewardto the subgoal? If this were the case, it would present a difficultyfor the interpretation of our neuroimaging results because itwould lead standard RL to predict an RPE in association withevents that change only subgoal distance (including C and Djumps in our neuroimaging task).In view of these points, it was necessary to establish whether

participants performing the delivery task did or did not attachprimary reward to subgoal attainment. In order to evaluate this,we devised a modified version of the task. Here, 22 participantsdelivered packages as before, though without jump events.However, at the beginning of each delivery trial, two packageswere presented in the display, which defined paths that could

Figure 4. Results of fMRI Experiment 1Shown are regions displaying a positive correlation with the PPE, independent

of subgoal displacement. Talairach coordinates of peak are 0, 9, and 39 for the

dorsal ACC, and 45, 12, and 0 for right anterior insula. Not shown are foci in left

anterior insula (!45, 9, !3) and lingual gyrus (0, !66, 0). Color indicates

general linear model parameter estimates, ranging from 3.0 3 10!4 (palest

yellow) to 1.2 3 10!3 (darkest orange).

Neuron

Hierarchical Reinforcement Learning

Neuron 71, 370–379, July 28, 2011 ª2011 Elsevier Inc. 373

Cf Ribas-Fernandez et al, 2011

Page 21: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Option-critic architecture

21

in fact ergodic, and a stationary distribution over state-optionpairs exists.

We will now compute the gradient of this objective, as-suming the intra-option policies are stochastic and differen-tiable in ✓. From (1 , 2), it follows that:

@Q

(s,!)

@✓

=

X

a

@⇡!,✓ (a | s)@✓

QU (s,!, a)

!

+X

a

⇡!,✓ (a | s)X

s0

� P (s0 | s, a) @U(!, s0)

@✓

.

We can further expand the right hand side using (3) and (4),which yields the following theorem:Theorem 1 (Intra-Option Policy Gradient Theorem). Givena set of Markov options with stochastic intra-option poli-cies differentiable in their parameters ✓, the gradient of theexpected discounted return with respect to ✓ and initial con-dition (s

0

,!

0

) is:X

s,!

µ

(s,! | s0

,!

0

)X

a

@⇡!,✓ (a | s)@✓

QU (s,!, a) ,

where µ

(s,! | s0

,!

0

) is a discounted weighting of state-option pairs along trajectories starting from (s

0

,!

0

):µ

(s,! | s0

,!

0

) =P1

t=0

t P (st = s,!t = ! | s0

,!

0

).The proof is in the appendix. This gradient describes the

effect of a local change at the primitive level on the globalexpected discounted return. In contrast, subgoal or pseudo-reward methods assume the objective of an option is simplyto optimize its own reward function, ignoring how a pro-posed change would propagate in the overall objective.

We now turn our attention to computing gradients for thetermination functions, assumed again to be stochastic anddifferentiable in #. From (1, 2, 3), we have:@Q

(s,!)

@#

=X

a

⇡!,✓ (a | s)X

s0

� P (s0 | s, a) @U(!, s0)

@✓

Hence, the key quantity is the gradient of U . This is a naturalconsequence of the call-and-return execution, in which the“goodness” of termination functions can only be evaluatedupon entering the next state. The relevant gradient can befurther expanded as:

@U(!, s0)

@#

= �@�!,#(s0)

@#

A

(s0,!)+

X

!0

X

s00

P (s00,!0 | s0,!) @U(!0, s

00)

@#

. (5)

Expanding @U(!0,s00)@# recursively leads to a similar form as

in theorem (1) but where the weighting of state-option pairsis now according to a Markov chain shifted by one time step:µ

(st+1

,!t | st,!t�1

) (details are in the appendix).Theorem 2 (Termination Gradient Theorem). Given a set ofMarkov options with stochastic termination functions differ-entiable in their parameters #, the gradient of the expecteddiscounted return objective with respect to # is:

�X

s0,!

µ

(s0,! | s1

,!

0

)@�!,#(s

0)

@#

A

(s0,!) ,

where A

is the advantage function (Baird 1993)over options A

(s0,!) = Q

(s0,!) � V

(s0) andµ

(s0,! | s1

,!

0

) is a discounted weighting of state-option pairs from (s

1

,!

0

): µ

(s,! | s1

,!

0

) =P1t=0

t P (st+1

= s,!t = ! | s1

,!

0

).The advantage function often appears in policy gradient

methods (Sutton et al. 2000) when forming a baseline to re-duce the variance in the gradient estimates. Its presence inthat context has to do mostly with algorithm design. It is in-teresting that in our case, it follows as a direct consequenceof the derivation, and gives the theorem an intuitive interpre-tation: when the option choice is suboptimal with respect tothe expected value over all options, the advantage function isnegative and it drives the gradient corrections up, which in-creases the odds of terminating. After termination, the agenthas the opportunity to pick a better option using ⇡

. A sim-ilar idea also underlies the interrupting execution model ofoptions (Sutton, Precup, and Singh 1999), in which termina-tion is forced whenever the value of Q

(s0,!) for the cur-rent option ! is less than V

(s0). (Mann, Mankowitz, andMannor 2014) recently studied interrupting options throughthe lens of an interrupting Bellman Operator in a value-iteration setting. The termination gradient theorem can beinterpreted as providing a gradient-based interrupting Bell-man operator.

Algorithms and Architecture

QU , A⌦

Environment

atst

⇡!, �!

rt

Gradients

CriticTD error

!t

Options

Policy over options

Figure 1: Diagram of the option-critic architecture. The op-tion execution model is depicted by a switch ? over thecontacts (. A new option is selected according to ⇡

onlywhen the current option terminates.

Based on theorems 1 and 2, we can now design a gradientascent procedure for learning options. Since µ

, QU and A

will generally be approximate, we will use a stochastic gra-dient procedure. Using a two-timescale framework (Kondaand Tsitsiklis 2000), we propose to learn the values at a fasttimescale while updating the intra-option policies and termi-nation functions at a slower rate.

We refer to the resulting system as an option-critic archi-tecture, in reference to the actor-critic architectures (Sutton1984). The intra-option policies, termination functions and

Page 22: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Transfer learning with options

22

0 500 1000 1500 2000Episodes

0

100

200

300

400

500

600

Step

s

SARSA(0)AC-PGOC 4 optionsOC 8 options

Goal moves randomly

Using temporal abstractions

discovered by option-critic

Primitive actions

Hallways

Walls

Initial goalRandom goal

after 1000 episodes

Page 23: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Learned options are intuitive

23

Option 1 Option 2

Option 3 Option 4

Page 24: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Deep Learning Option-critic

Gradient-based learning:• Optimization criterion: return• Secondary consideration: avoid switching too often• Internal policy over options moves to take better

actions more often• Termination conditions shorten if the next option has

a large advantage, lengthen otherwise

Figure 3: Termination probabilities for the option-criticagent learning with 4 options. The darkest color representsthe walls in the environment while lighter colors encodehigher termination probabilities.

In the two temporally extended settings, with 4 optionsand 8 options, termination events are more likely to occurnear the doorways (Figure 3), agreeing with the intuitionthat they would be good subgoals. As opposed to (Sutton,Precup, and Singh 1999), we did not encode this knowledgeourselves but simply let the agents find options that wouldmaximize the expected discounted return.

Pinball Domain

Figure 4: Pinball: Sample trajectory of the solution foundafter 250 episodes of training using 4 options All options(color-coded) are used by the policy over options in success-ful trajectories. The initial state is in the top left corner andthe goal is in the bottom right one (red circle).

In the Pinball domain (Konidaris and Barto 2009), a ballmust be guided through a maze of arbitrarily shaped poly-gons to a designated target location. The state space is con-tinuous over the position and velocity of the ball in the x-y plane. At every step, the agent must choose among fivediscrete primitive actions: move the ball faster or slower, inthe vertical or horizontal direction, or take the null action.Collisions with obstacles are elastic and can be used to theadvantage of the agent. In this domain, a drag coefficient of0.995 effectively stops ball movements after a finite num-ber of steps when the null action is chosen repeatedly. Eachthrust action incurs a penalty of �5 while taking no actioncosts �1. The episode terminates with +10000 reward whenthe agent reaches the target. We interrupted any episode tak-ing more than 10000 steps and set the discount factor to 0.99.

We used intra-option Q-learning in the critic with linearfunction approximation over Fourier bases (Konidaris et al.

2011) of order 3. We experimented with 2, 3 or 4 options.We used Boltzmann policies for the intra-option policies andlinear-sigmoid functions for the termination functions. Thelearning rates were set to 0.01 for the critic and 0.001 forboth the intra and termination gradients. We used an epsilon-greedy policy over options with ✏ = 0.01.

0 25 50 100 150 200 250Episodes

-5000

0

5000

75008500

Und

isco

unte

dR

etur

n

4 options3 options2 options

Figure 5: Learning curves in the Pinball domain.

In (Konidaris and Barto 2009), an option can only beused and updated after a gestation period of 10 episodes. Aslearning is fully integrated in option-critic, by 40 episodes anear optimal set of options had already been learned in allsettings. From a qualitative point of view, the options ex-hibit temporal extension and specialization (fig. 4). We alsoobserved that across many successful trajectories the red op-tion would consistently be used in the vicinity of the goal.

Arcade Learning EnvironmentWe applied the option-critic architecture in the ArcadeLearning Environment (ALE) (Bellemare et al. 2013) usinga deep neural network to approximate the critic and repre-sent the intra-option policies and termination functions. Weused the same configuration as (Mnih et al. 2013) for thefirst 3 convolutional layers of the network. We used 32 con-volutional filters of size 8⇥8 and stride of 4 in the first layer,64 filters of size 4 ⇥ 4 with a stride of 2 in the second and64 3 ⇥ 3 filters with a stride of 1 in the third layer. We thenfed the output of the third layer into a dense shared layer of512 neurons, as depicted in Figure 6. We fixed the learningrate for the intra-option policies and termination gradient to0.00025 and used RMSProp for the critic.

⇡⌦(·|s)

{�!(s)}{⇡!(·|s)}

Figure 6: Deep neural network architecture. A concatenationof the last 4 images is fed through the convolutional layers,producing a dense representation shared across intra-optionpolicies, termination functions and policy over options.

We represented the intra-option policies as linear-softmax

Page 25: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Results (Atari)

• Results match or exceed those of DQN within a single task

25

(a) Asterix (b) Ms. Pacman

Option-CriticDQN

Option-CriticDQN

Avg

.Score

Epoch Epoch50 100 150 2000 50 100 150 2000

0

2000

4000

6000

8000

10000

500

1000

1500

2000

2500

(c) Seaquest (d) Zaxxon

Option-CriticDQN

Option-CriticDQN

Epoch Epoch50 100 150 2000 50 100 150 2000

0

2000

4000

6000

8000

10000

0

2000

4000

6000

8000

Page 26: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Learned options are intuitive

26

• Learned options are intuitive

Option 1: downward shooting sequences Option 2: upward shooting sequences

Transition from option 1 to 2

Action trajectory, time White: option 1 Black: option 2

Page 27: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Temporal Abstraction: What is a Good Time Scale for Actions?

(a) Without a deliberation cost, options ter-minate instantly and are used in any scenariowithout specialization.

(b) Options are used for extended periodsand in specific scenarios through a trajectory,when using a deliberation cost.

(c) Termination is sparse when using thedeliberation cost. The agent terminates op-tions at intersections requiring high level de-cisions.

Figure 2: We show the effects of using deliberation costs on both the option termination and policies. In figures (a) and (b),every color in the agent trajectory represents a different option being executed. This environment is the game Amidar, of theAtari 2600 suite.

of deliberation cost with previous notions of regularizationfrom (Mann et al. 2014) and (Bacon et al. 2017).

The deliberation cost goes beyond only the idea of pe-nalizing for lengthy computation. It can also be used to in-corporate other forms of bounds intrinsic to an agent in itsenvironment. One interesting direction for future work is toalso think of deliberation cost in terms of missed opportunityand opening the way for an implicit form of regularizationwhen interacting asynchronously with an environment. An-other interesting form of limitation inherent to reinforcementlearning agents has to do with their representational capaci-ties when estimating action values. Preliminary work seemsto indicate that the error decomposition for the action valuescould be also be expressed in the form of a deliberation cost.

References[Altman 1999] E. Altman. Constrained Markov DecisionProcesses. Chapman and Hall, 1999.

[Andreas et al. 2017] Jacob Andreas, Dan Klein, and SergeyLevine. Modular multitask reinforcement learning with pol-icy sketches. In ICML, pages 166–175, 2017.

[Bacon et al. 2017] Pierre-Luc Bacon, Jean Harb, and DoinaPrecup. The option-critic architecture. In AAAI, pages 1726–1734, 2017.

[Baird 1993] Leemon C. Baird. Advantage updating. Tech-nical Report WL–TR-93-1146, Wright Laboratory, 1993.

[Bellemare et al. 2013] M. G. Bellemare, Y. Naddaf, J. Ve-ness, and M. Bowling. The arcade learning environment: Anevaluation platform for general agents. Journal of ArtificialIntelligence Research, 47:253–279, 06 2013.

[Botvinick et al. 2009] Matthew M. Botvinick, Yael Niv, andAndrew C. Barto. Hierarchically organized behavior and its

neural foundations: A reinforcement learning perspective.Cognition, 113(3):262 – 280, 2009.

[Branavan et al. 2012] S. R. K. Branavan, Nate Kushman,Tao Lei, and Regina Barzilay. Learning high-level planningfrom text. In ACL, pages 126–135, 2012.

[Daniel et al. 2016] C. Daniel, H. van Hoof, J. Peters, andG. Neumann. Probabilistic inference for determining op-tions in reinforcement learning. Machine Learning, SpecialIssue, 104(2):337–357, 2016.

[Dayan and Hinton 1992] Peter Dayan and Geoffrey E. Hin-ton. Feudal reinforcement learning. In NIPS, pages 271–278, 1992.

[Dietterich 1998] Thomas G. Dietterich. The MAXQmethod for hierarchical reinforcement learning. In ICML,pages 118–126, 1998.

[Drescher 1991] Gary L. Drescher. Made-up Minds: A Con-structivist Approach to Artificial Intelligence. MIT Press,Cambridge, MA, USA, 1991.

[Fikes et al. 1972] Richard Fikes, Peter E. Hart, and Nils J.Nilsson. Learning and executing generalized robot plans.Artif. Intell., 3(1-3):251–288, 1972.

[Gigerenzer and Selten 2001] Gerd Gigerenzer and R. Sel-ten. Bounded Rationality: The adaptive toolbox. Cam-bridge: The MIT Press, 2001.

[Guo et al. 2014] Xiaoxiao Guo, Satinder Singh, HonglakLee, Richard L Lewis, and Xiaoshi Wang. Deep learningfor real-time atari game play using offline monte-carlo treesearch planning. In Z. Ghahramani, M. Welling, C. Cortes,N. D. Lawrence, and K. Q. Weinberger, editors, Advancesin Neural Information Processing Systems 27, pages 3338–3346. Curran Associates, Inc., 2014.

[Howard 1963] Ronald A. Howard. Semi-markovian deci-

Page 28: Machine learning models for temporal prediction and ... · the EEG version, the exact increase in subgoal distance varied across trials. Therefore, type D jumps were intended to induce

Summary

• Reinforcement learning is used both to build sophisticated controllers for automated tasks, and to model decision making in the brain (dopamine neutrons)

• To solve large problems, we need temporal abstraction

• Abstractions can be learned automatically from data