52
Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter Stone University of Texas at Austin November 13, 2015 1

Deep Recurrent Q-Learning for Partially Observable MDPspstone/Papers/bib2html-links/SDMIA15... · Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter

  • Upload
    phamnhu

  • View
    230

  • Download
    10

Embed Size (px)

Citation preview

  • Deep Recurrent Q-Learning for PartiallyObservable MDPs

    Matthew Hausknecht and Peter Stone

    University of Texas at Austin

    November 13, 2015

    1

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Outline

    Motivation

    BackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-Network

    Results

    Related Work

    Appendix

    3

  • Markov Decision Process (MDP)

    Statest

    Actionat

    Rewardrt

    At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment

    Markov property ensures thatst+1 depends only on st , at

    Learning an optimal policy

    requires no memory of past states

    4

  • Markov Decision Process (MDP)

    Statest

    Actionat

    Rewardrt

    At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment

    Markov property ensures thatst+1 depends only on st , at

    Learning an optimal policy

    requires no memory of past states

    4

  • Markov Decision Process (MDP)

    Statest

    Actionat

    Rewardrt

    At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment

    Markov property ensures thatst+1 depends only on st , at

    Learning an optimal policy

    requires no memory of past states

    4

  • Partially Observable Markov Decision Process(POMDP)

    Observationot

    Actionat

    Rewardrt

    True state of environment ishidden. Observations ot provideonly partial information.

    Memory of past observations mayhelp understand true systemstate, improve the policy

    5

  • Partially Observable Markov Decision Process(POMDP)

    Observationot

    Actionat

    Rewardrt

    True state of environment ishidden. Observations ot provideonly partial information.

    Memory of past observations mayhelp understand true systemstate, improve the policy

    5

  • Atari Domain

    ActionObservation Score

    160 210 state space 84 84 grayscale

    18 discrete actions

    Rewards clipped {1, 0, 1}

    Source: www.arcadelearningenvironment.org

    6

    www.arcadelearningenvironment.orgwww.arcadelearningenvironment.org

  • Atari Domain: MDP or POMDP?

    ActionObservation Score

    Depends on the staterepresentation!

    Single Frame POMDP Four Frames MDP Console RAM MDP

    7

  • Atari Domain: MDP or POMDP?

    ActionObservation Score

    Depends on the staterepresentation!

    Single Frame POMDP Four Frames MDP Console RAM MDP

    7

  • Atari Domain: MDP or POMDP?

    ActionObservation Score

    Depends on the staterepresentation!

    Single Frame POMDP Four Frames MDP Console RAM MDP

    7

  • Deep Q-Network (DQN)Q-Values 18

    IP1512

    Conv3

    Conv2

    Conv1 4

    84

    84

    Model-free ReinforcementLearning method using deepneural network as Q-Valuefunction approximator Mnihet al. (2015)

    Takes the last four game screensas input: enough to make mostAtari games Markov

    How well does DQN perform inpartially observed domains?

    8

  • Deep Q-Network (DQN)Q-Values 18

    IP1512

    Conv3

    Conv2

    Conv1 4

    84

    84

    Model-free ReinforcementLearning method using deepneural network as Q-Valuefunction approximator Mnihet al. (2015)

    Takes the last four game screensas input: enough to make mostAtari games Markov

    How well does DQN perform inpartially observed domains?

    8

  • Flickering Atari

    Observationot

    Actionat

    Rewardrt

    Induce partial observability bystochastically obscuring the gamescreen

    ot =

    {st with p =

    12

    < 0, . . . , 0 > otherwise

    Game state must now be inferredfrom past observations

    9

  • Flickering Atari

    Observationot

    Actionat

    Rewardrt

    Induce partial observability bystochastically obscuring the gamescreen

    ot =

    {st with p =

    12

    < 0, . . . , 0 > otherwise

    Game state must now be inferredfrom past observations

    9

  • Flickering Atari

    Observationot

    Actionat

    Rewardrt

    Induce partial observability bystochastically obscuring the gamescreen

    ot =

    {st with p =

    12

    < 0, . . . , 0 > otherwise

    Game state must now be inferredfrom past observations

    9

  • DQN Pong

    True Game Screen Perceived Game Screen

    10

  • DQN Flickering Pong

    True Game Screen Perceived Game Screen

    11

  • Outline

    Motivation

    BackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-Network

    Results

    Related Work

    Appendix

    12

  • Deep Recurrent Q-Network

    . . . LSTM 512

    18

    1

    84

    84

    tt 1Long Short Term MemoryHochreiter (1997)

    Identical to DQN Except:

    Replaces DQNs IP1 withrecurrent LSTM layer ofsame dimension

    Each timestep takes asingle frame as input

    LSTM provides a selectivememory of past game states

    Trained end-to-end usingBPTT: unrolled for last 10timesteps

    13

  • Deep Recurrent Q-Network

    . . . LSTM 512

    18

    1

    84

    84

    tt 1Long Short Term MemoryHochreiter (1997)

    Identical to DQN Except:

    Replaces DQNs IP1 withrecurrent LSTM layer ofsame dimension

    Each timestep takes asingle frame as input

    LSTM provides a selectivememory of past game states

    Trained end-to-end usingBPTT: unrolled for last 10timesteps

    13

  • Deep Recurrent Q-Network

    . . . LSTM 512

    18

    1

    84

    84

    tt 1Long Short Term MemoryHochreiter (1997)

    Identical to DQN Except:

    Replaces DQNs IP1 withrecurrent LSTM layer ofsame dimension

    Each timestep takes asingle frame as input

    LSTM provides a selectivememory of past game states

    Trained end-to-end usingBPTT: unrolled for last 10timesteps

    13

  • Deep Recurrent Q-Network

    . . . LSTM 512

    18

    1

    84

    84

    tt 1Long Short Term MemoryHochreiter (1997)

    Identical to DQN Except:

    Replaces DQNs IP1 withrecurrent LSTM layer ofsame dimension

    Each timestep takes asingle frame as input

    LSTM provides a selectivememory of past game states

    Trained end-to-end usingBPTT: unrolled for last 10timesteps

    13

  • DRQN Maximal Activations

    Unit detects the agent missing the ball

    Unit detects ball reflection on paddle

    Unit detects ball reflection on wall

    14

  • DRQN Maximal Activations

    Unit detects the agent missing the ball

    Unit detects ball reflection on paddle

    Unit detects ball reflection on wall

    14

  • DRQN Maximal Activations

    Unit detects the agent missing the ball

    Unit detects ball reflection on paddle

    Unit detects ball reflection on wall14

  • Outline

    Motivation

    BackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-Network

    Results

    Related Work

    Appendix

    15

  • DRQN Flickering Pong

    True Game Screen Perceived Game Screen

    16

  • Flickering Pong

    17

  • Pong Generalization: POMDP MDP

    How does DRQN generalize when trained on Flickering Pong andevaluated on standard Pong?

    18

  • Pong Generalization: POMDP MDP

    0.0 0.2 0.4 0.6 0.8 1.0Observation Probability

    20

    15

    10

    5

    0

    5

    10

    15

    20DRQN 1-frameDQN 10-frameDQN 4-frame

    18

  • Performance on Flickering Atari Games

    Game 10-frame DRQN std 10-frame DQN stdPong 12.1 (2.2) -9.9 (3.3)

    Beam Rider 618 (115) 1685.6 (875)Asteroids 1032 (410) 1010 (535)Bowling 65.5 (13) 57.3 (8)Centipede 4319.2 (4378) 5268.1 (2052)Chopper Cmd 1330 (294) 1450 (787.8)Double Dunk -14 (2.5) -16.2 (2.6)Frostbite 414 (494) 436 (462.5)Ice Hockey -5.4 (2.7) -4.2 (1.5)Ms. Pacman 1739 (942) 1824 (490)

    19

  • Performance on Flickering Atari Games

    Game 10-frame DRQN std 10-frame DQN stdPong 12.1 (2.2) -9.9 (3.3)Beam Rider 618 (115) 1685.6 (875)

    Asteroids 1032 (410) 1010 (535)Bowling 65.5 (13) 57.3 (8)Centipede 4319.2 (4378) 5268.1 (2052)Chopper Cmd 1330 (294) 1450 (787.8)Double Dunk -14 (2.5) -16.2 (2.6)Frostbite 414 (494) 436 (462.5)Ice Hockey -5.4 (2.7) -4.2 (1.5)Ms. Pacman 1739 (942) 1824 (490)

    19

  • Performance on Flickering Atari Games

    Game 10-frame DRQN std 10-frame DQN stdPong 12.1 (2.2) -9.9 (3.3)Beam Rider 618 (115) 1685.6 (875)Asteroids 1032 (410) 1010 (535)Bowling 65.5 (13) 57.3 (8)Centipede 4319.2 (4378) 5268.1 (2052)Chopper Cmd 1330 (294) 1450 (787.8)Double Dunk -14 (2.5) -16.2 (2.6)Frostbite 414 (494) 436 (462.5)Ice Hockey -5.4 (2.7) -4.2 (1.5)Ms. Pacman 1739 (942) 1824 (490)

    19

  • Performance on Standard Atari Games

    Game 10-frame DRQN std 10-frame DQN stdDouble Dunk -2 (7.8) -10 (3.5)Frostbite 2875 (535) 519 (363)

    Beam Rider 3269 (1167) 6923 (1027)Asteroids 1020 (312) 1070 (345)Bowling 62 (5.9) 72 (11)Centipede 3534 (1601) 3653 (1903)Chopper Cmd 2070 (875) 1460 (976)Ice Hockey -4.4 (1.6) -3.5 (3.5)Ms. Pacman 2048 (653) 2363 (735)

    20

  • Performance on Standard Atari Games

    Game 10-frame DRQN std 10-frame DQN stdDouble Dunk -2 (7.8) -10 (3.5)Frostbite 2875 (535) 519 (363)Beam Rider 3269 (1167) 6923 (1027)

    Asteroids 1020 (312) 1070 (345)Bowling 62 (5.9) 72 (11)Centipede 3534 (1601) 3653 (1903)Chopper Cmd 2070 (875) 1460 (976)Ice Hockey -4.4 (1.6) -3.5 (3.5)Ms. Pacman 2048 (653) 2363 (735)

    20

  • Performance on Standard Atari Games

    Game 10-frame DRQN std 10-frame DQN stdDouble Dunk -2 (7.8) -10 (3.5)Frostbite 2875 (535) 519 (363)Beam Rider 3269 (1167) 6923 (1027)Asteroids 1020 (312) 1070 (345)Bowling 62 (5.9) 72 (11)Centipede 3534 (1601) 3653 (1903)Chopper Cmd 2070 (875) 1460 (976)Ice Hockey -4.4 (1.6) -3.5 (3.5)Ms. Pacman 2048 (653) 2363 (735)

    20

  • Performance on Standard Atari Games

    21

  • DRQN Frostbite

    True Game Screen Perceived Game Screen

    22

  • Generalization: MDP POMDP

    How does DRQN generalize when trained on standard Atari andevaluated on flickering Atari?

    23

  • Generalization: MDP POMDP

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

    Observation Probability

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7Perc

    enta

    ge O

    rigin

    al Sco

    reDRQN

    DQN

    23

  • Outline

    Motivation

    BackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-Network

    Results

    Related Work

    Appendix

    24

  • Related Work

    Deep Recurrent Q-Network

    Hochreiter, S. and Schmidhuber, J. (1997). Long short-termmemory. Neural Comput., 9(8):17351780.

    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,D. (2015). Human-level control through deep reinforcementlearning. Nature, 518(7540):529533.

    Narasimhan, K., Kulkarni, T., and Barzilay, R. (2015). Languageunderstanding for text-based games using deep reinforcementlearning. CoRR, abs/1506.08941.

    Wierstra, D., Foerster, A., Peters, J., and Schmidthuber, J. (2007).Solving deep memory POMDPs with recurrent policy gradients.

    25

  • Thanks!

    . . . LSTM

    1

    84

    84

    tt 1LSTM can help deal withpartial observability

    Largest gains in generalizationbetween MDP POMDP

    Future work understandingwhy DRQN does better/worseon certain games

    Source: https://github.com/mhauskn/dqn/tree/recurrent

    Matthew Hausknecht and Peter Stone

    26

    https://github.com/mhauskn/dqn/tree/recurrenthttps://github.com/mhauskn/dqn/tree/recurrent

  • Outline

    Motivation

    BackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-Network

    Results

    Related Work

    Appendix

    27

  • Computational Efficiency

    Backwards (ms) Forwards (ms)

    Frames 1 4 10 1 4 10

    Baseline 8.82 13.6 26.7 2.0 4.0 9.0Unroll 1 18.2 22.3 33.7 2.4 4.4 9.4Unroll 10 77.3 111.3 180.5 2.5 4.4 8.3Unroll 30 204.5 263.4 491.1 2.5 3.8 9.4

    Table : Average milliseconds per backwards/forwards pass. Frames refersto the number of channels in the input image. Baseline is a non recurrentnetwork (e.g. DQN). Unroll refers to an LSTM network backpropagatedthrough time 1/10/30 steps.

    28

    MotivationBackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-NetworkResultsRelated WorkAppendix