Upload
phamnhu
View
230
Download
10
Embed Size (px)
Citation preview
Deep Recurrent Q-Learning for PartiallyObservable MDPs
Matthew Hausknecht and Peter Stone
University of Texas at Austin
November 13, 2015
1
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
Outline
Motivation
BackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-Network
Results
Related Work
Appendix
3
Markov Decision Process (MDP)
Statest
Actionat
Rewardrt
At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment
Markov property ensures thatst+1 depends only on st , at
Learning an optimal policy
requires no memory of past states
4
Markov Decision Process (MDP)
Statest
Actionat
Rewardrt
At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment
Markov property ensures thatst+1 depends only on st , at
Learning an optimal policy
requires no memory of past states
4
Markov Decision Process (MDP)
Statest
Actionat
Rewardrt
At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment
Markov property ensures thatst+1 depends only on st , at
Learning an optimal policy
requires no memory of past states
4
Partially Observable Markov Decision Process(POMDP)
Observationot
Actionat
Rewardrt
True state of environment ishidden. Observations ot provideonly partial information.
Memory of past observations mayhelp understand true systemstate, improve the policy
5
Partially Observable Markov Decision Process(POMDP)
Observationot
Actionat
Rewardrt
True state of environment ishidden. Observations ot provideonly partial information.
Memory of past observations mayhelp understand true systemstate, improve the policy
5
Atari Domain
ActionObservation Score
160 210 state space 84 84 grayscale
18 discrete actions
Rewards clipped {1, 0, 1}
Source: www.arcadelearningenvironment.org
6
www.arcadelearningenvironment.orgwww.arcadelearningenvironment.org
Atari Domain: MDP or POMDP?
ActionObservation Score
Depends on the staterepresentation!
Single Frame POMDP Four Frames MDP Console RAM MDP
7
Atari Domain: MDP or POMDP?
ActionObservation Score
Depends on the staterepresentation!
Single Frame POMDP Four Frames MDP Console RAM MDP
7
Atari Domain: MDP or POMDP?
ActionObservation Score
Depends on the staterepresentation!
Single Frame POMDP Four Frames MDP Console RAM MDP
7
Deep Q-Network (DQN)Q-Values 18
IP1512
Conv3
Conv2
Conv1 4
84
84
Model-free ReinforcementLearning method using deepneural network as Q-Valuefunction approximator Mnihet al. (2015)
Takes the last four game screensas input: enough to make mostAtari games Markov
How well does DQN perform inpartially observed domains?
8
Deep Q-Network (DQN)Q-Values 18
IP1512
Conv3
Conv2
Conv1 4
84
84
Model-free ReinforcementLearning method using deepneural network as Q-Valuefunction approximator Mnihet al. (2015)
Takes the last four game screensas input: enough to make mostAtari games Markov
How well does DQN perform inpartially observed domains?
8
Flickering Atari
Observationot
Actionat
Rewardrt
Induce partial observability bystochastically obscuring the gamescreen
ot =
{st with p =
12
< 0, . . . , 0 > otherwise
Game state must now be inferredfrom past observations
9
Flickering Atari
Observationot
Actionat
Rewardrt
Induce partial observability bystochastically obscuring the gamescreen
ot =
{st with p =
12
< 0, . . . , 0 > otherwise
Game state must now be inferredfrom past observations
9
Flickering Atari
Observationot
Actionat
Rewardrt
Induce partial observability bystochastically obscuring the gamescreen
ot =
{st with p =
12
< 0, . . . , 0 > otherwise
Game state must now be inferredfrom past observations
9
DQN Pong
True Game Screen Perceived Game Screen
10
DQN Flickering Pong
True Game Screen Perceived Game Screen
11
Outline
Motivation
BackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-Network
Results
Related Work
Appendix
12
Deep Recurrent Q-Network
. . . LSTM 512
18
1
84
84
tt 1Long Short Term MemoryHochreiter (1997)
Identical to DQN Except:
Replaces DQNs IP1 withrecurrent LSTM layer ofsame dimension
Each timestep takes asingle frame as input
LSTM provides a selectivememory of past game states
Trained end-to-end usingBPTT: unrolled for last 10timesteps
13
Deep Recurrent Q-Network
. . . LSTM 512
18
1
84
84
tt 1Long Short Term MemoryHochreiter (1997)
Identical to DQN Except:
Replaces DQNs IP1 withrecurrent LSTM layer ofsame dimension
Each timestep takes asingle frame as input
LSTM provides a selectivememory of past game states
Trained end-to-end usingBPTT: unrolled for last 10timesteps
13
Deep Recurrent Q-Network
. . . LSTM 512
18
1
84
84
tt 1Long Short Term MemoryHochreiter (1997)
Identical to DQN Except:
Replaces DQNs IP1 withrecurrent LSTM layer ofsame dimension
Each timestep takes asingle frame as input
LSTM provides a selectivememory of past game states
Trained end-to-end usingBPTT: unrolled for last 10timesteps
13
Deep Recurrent Q-Network
. . . LSTM 512
18
1
84
84
tt 1Long Short Term MemoryHochreiter (1997)
Identical to DQN Except:
Replaces DQNs IP1 withrecurrent LSTM layer ofsame dimension
Each timestep takes asingle frame as input
LSTM provides a selectivememory of past game states
Trained end-to-end usingBPTT: unrolled for last 10timesteps
13
DRQN Maximal Activations
Unit detects the agent missing the ball
Unit detects ball reflection on paddle
Unit detects ball reflection on wall
14
DRQN Maximal Activations
Unit detects the agent missing the ball
Unit detects ball reflection on paddle
Unit detects ball reflection on wall
14
DRQN Maximal Activations
Unit detects the agent missing the ball
Unit detects ball reflection on paddle
Unit detects ball reflection on wall14
Outline
Motivation
BackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-Network
Results
Related Work
Appendix
15
DRQN Flickering Pong
True Game Screen Perceived Game Screen
16
Flickering Pong
17
Pong Generalization: POMDP MDP
How does DRQN generalize when trained on Flickering Pong andevaluated on standard Pong?
18
Pong Generalization: POMDP MDP
0.0 0.2 0.4 0.6 0.8 1.0Observation Probability
20
15
10
5
0
5
10
15
20DRQN 1-frameDQN 10-frameDQN 4-frame
18
Performance on Flickering Atari Games
Game 10-frame DRQN std 10-frame DQN stdPong 12.1 (2.2) -9.9 (3.3)
Beam Rider 618 (115) 1685.6 (875)Asteroids 1032 (410) 1010 (535)Bowling 65.5 (13) 57.3 (8)Centipede 4319.2 (4378) 5268.1 (2052)Chopper Cmd 1330 (294) 1450 (787.8)Double Dunk -14 (2.5) -16.2 (2.6)Frostbite 414 (494) 436 (462.5)Ice Hockey -5.4 (2.7) -4.2 (1.5)Ms. Pacman 1739 (942) 1824 (490)
19
Performance on Flickering Atari Games
Game 10-frame DRQN std 10-frame DQN stdPong 12.1 (2.2) -9.9 (3.3)Beam Rider 618 (115) 1685.6 (875)
Asteroids 1032 (410) 1010 (535)Bowling 65.5 (13) 57.3 (8)Centipede 4319.2 (4378) 5268.1 (2052)Chopper Cmd 1330 (294) 1450 (787.8)Double Dunk -14 (2.5) -16.2 (2.6)Frostbite 414 (494) 436 (462.5)Ice Hockey -5.4 (2.7) -4.2 (1.5)Ms. Pacman 1739 (942) 1824 (490)
19
Performance on Flickering Atari Games
Game 10-frame DRQN std 10-frame DQN stdPong 12.1 (2.2) -9.9 (3.3)Beam Rider 618 (115) 1685.6 (875)Asteroids 1032 (410) 1010 (535)Bowling 65.5 (13) 57.3 (8)Centipede 4319.2 (4378) 5268.1 (2052)Chopper Cmd 1330 (294) 1450 (787.8)Double Dunk -14 (2.5) -16.2 (2.6)Frostbite 414 (494) 436 (462.5)Ice Hockey -5.4 (2.7) -4.2 (1.5)Ms. Pacman 1739 (942) 1824 (490)
19
Performance on Standard Atari Games
Game 10-frame DRQN std 10-frame DQN stdDouble Dunk -2 (7.8) -10 (3.5)Frostbite 2875 (535) 519 (363)
Beam Rider 3269 (1167) 6923 (1027)Asteroids 1020 (312) 1070 (345)Bowling 62 (5.9) 72 (11)Centipede 3534 (1601) 3653 (1903)Chopper Cmd 2070 (875) 1460 (976)Ice Hockey -4.4 (1.6) -3.5 (3.5)Ms. Pacman 2048 (653) 2363 (735)
20
Performance on Standard Atari Games
Game 10-frame DRQN std 10-frame DQN stdDouble Dunk -2 (7.8) -10 (3.5)Frostbite 2875 (535) 519 (363)Beam Rider 3269 (1167) 6923 (1027)
Asteroids 1020 (312) 1070 (345)Bowling 62 (5.9) 72 (11)Centipede 3534 (1601) 3653 (1903)Chopper Cmd 2070 (875) 1460 (976)Ice Hockey -4.4 (1.6) -3.5 (3.5)Ms. Pacman 2048 (653) 2363 (735)
20
Performance on Standard Atari Games
Game 10-frame DRQN std 10-frame DQN stdDouble Dunk -2 (7.8) -10 (3.5)Frostbite 2875 (535) 519 (363)Beam Rider 3269 (1167) 6923 (1027)Asteroids 1020 (312) 1070 (345)Bowling 62 (5.9) 72 (11)Centipede 3534 (1601) 3653 (1903)Chopper Cmd 2070 (875) 1460 (976)Ice Hockey -4.4 (1.6) -3.5 (3.5)Ms. Pacman 2048 (653) 2363 (735)
20
Performance on Standard Atari Games
21
DRQN Frostbite
True Game Screen Perceived Game Screen
22
Generalization: MDP POMDP
How does DRQN generalize when trained on standard Atari andevaluated on flickering Atari?
23
Generalization: MDP POMDP
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Observation Probability
0.2
0.3
0.4
0.5
0.6
0.7Perc
enta
ge O
rigin
al Sco
reDRQN
DQN
23
Outline
Motivation
BackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-Network
Results
Related Work
Appendix
24
Related Work
Deep Recurrent Q-Network
Hochreiter, S. and Schmidhuber, J. (1997). Long short-termmemory. Neural Comput., 9(8):17351780.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,D. (2015). Human-level control through deep reinforcementlearning. Nature, 518(7540):529533.
Narasimhan, K., Kulkarni, T., and Barzilay, R. (2015). Languageunderstanding for text-based games using deep reinforcementlearning. CoRR, abs/1506.08941.
Wierstra, D., Foerster, A., Peters, J., and Schmidthuber, J. (2007).Solving deep memory POMDPs with recurrent policy gradients.
25
Thanks!
. . . LSTM
1
84
84
tt 1LSTM can help deal withpartial observability
Largest gains in generalizationbetween MDP POMDP
Future work understandingwhy DRQN does better/worseon certain games
Source: https://github.com/mhauskn/dqn/tree/recurrent
Matthew Hausknecht and Peter Stone
26
https://github.com/mhauskn/dqn/tree/recurrenthttps://github.com/mhauskn/dqn/tree/recurrent
Outline
Motivation
BackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-Network
Results
Related Work
Appendix
27
Computational Efficiency
Backwards (ms) Forwards (ms)
Frames 1 4 10 1 4 10
Baseline 8.82 13.6 26.7 2.0 4.0 9.0Unroll 1 18.2 22.3 33.7 2.4 4.4 9.4Unroll 10 77.3 111.3 180.5 2.5 4.4 8.3Unroll 30 204.5 263.4 491.1 2.5 3.8 9.4
Table : Average milliseconds per backwards/forwards pass. Frames refersto the number of channels in the input image. Baseline is a non recurrentnetwork (e.g. DQN). Unroll refers to an LSTM network backpropagatedthrough time 1/10/30 steps.
28
MotivationBackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-NetworkResultsRelated WorkAppendix