Paper intoduction "Playing Atari with deep reinforcement learning"

論文紹介“Playing Atari with Deep Reinforcement Learning”

2014年6月3日（火）塚原裕史

https://sites.google.com/site/deeplearningworkshopnips2013/accepted-papers

Deep Learning Workshop NIPS 2013

https://sites.google.com/site/deeplearningworkshopnips2013/

Summary

• この論文の貢献– 最近、発展が目覚ましいDeep LearningとQ-Learning的な方法を融合した新しい強化学習の方法を提案

• そのメリット– Deep networkにより、特徴量や戦略を手で与えることなく自動獲得される（Model-Free）

• その効果– ビデオゲームに適用してみたら、な、なんと、従来手法を凌駕する性能を実現（人をも超える場合があり、びっくり！）

Atari 2600

http://nonciclopedia.wikia.com/wiki/Atari_2600

• Atari 2600 Emulator

Stella http://stella.sourceforge.net/docs/index.html#Games

Supervised Learning vs

Reinforcement Learning

Supervised Learning (狩猟文明)

• 狩猟の技は伝授され、行動の結果は即時に得る

Reinforcement Learning (農耕文明)

• 愛情を注ぎ、紆余曲折の後、恵みが得られる

？

• 教師付き学習→ ナンパ

• 強化学習→ 恋愛

男女で例えるなら・・・

Deep Learning and Reinforcement Learning

Deep Learning and RL

• モチベーション– 最近のDeep Learningの発展の恩恵に肖りたい

• 課題– Deep Learningの方法をそのままでは適用できない

• 正解データが作れない（遅延報酬）• 学習データ間に高い相関がある• データの発生源の分布が学習過程で変化する• データがスパース（似た経験を何度も繰り返さない）

– Model-Free RL with Q-Learningの問題• Nonlinear value function approximation and off-policy

Learning could cause divergence.

• 最近の発展– Gradient temporal-difference methodにより、その発散の問題が部分的に解消できることが証明された(2009)

– Experience replay technique (1993)により、学習データがスパースである問題に対処

Deep Reinforcement Learning

• TD-Gammon (G. Tesauro, 1995)

History

http://www.bkgm.com/articles/tesauro/tdl.html

• 観測空間– Atariのゲーム画面の画像

• アクション–

• 状態空間–

• 報酬–

この論文でのModel

1 1 2 1, , , ..., ,t t ts x a x a x

1,2, ,ta A K

Tt t

t t

t t

R r

tx

• 過去の履歴全体（無限長）が状態の信念と同等とみなせばPOMDPとなるだろう。

• しかし、エミュレーターは、必ず有限回の操作でゲームが完了すると考えられる（仮定する）ので、履歴は有限長となり、全ての状態間の遷移を考えることでMDPとみなすことができる。

POMDP or MDP？

観測

信念

危険安全戦況

• Optimal Value Function

• Bellman equation

Optimal Value Function

* , max , ,t t tQ s a E R s s a a

* *, max , ,sa

Q s a E r Q s a s a

• Solving Bellman equation iteratively– Converges to optimal value function as I goes to infinity

• 問題– すべての毎に、上の式を解く必要がある

Q-Learning

1 , max , ,i s ia

Q s a E r Q s a s a

,s a

• Parameterizing the value function– 価値関数を関数近似して、全体に汎化させる

• Q-Network– 関数近似にDeep Networkを使う– 今回、状態量（入力）が画像なので、CNNを使う

• 特徴量が自動的に学習される

Approximate Value Function by Q-Network

*, ; ,Q s a Q s a

,s a

• Deep Q-Learning Loss function

– Off-policy sampling: Behavior distribution • ε-greedy探索（on-policyとランダムサンプルの組合せ）

• Remarks– 学習データでありながら、出力がパラメタに依存しているのが特徴的

Deep Q-Learning

2

, 1 , ;i i s a i iL E y Q s a

1 1max , ; ,i s ia

y E r Q s a s a

,s a

arg max , ;a

a Q s a

• Gradient of Loss Function

Minibatch Update

, 1max , ; , , ; , ; ,i ii i s a s s i i i

aL E E E r Q s a s a Q s a Q s a s a

• 学習データとして、過去に経験したことを蓄えておき、何度も利用する– ローカルなエピソード

– Replay memory

Experience Replay

1, , ,t t t t te s a r s

1 2, , , ND e e e

Algorithm

※画像は粗視化して処理を軽くしておく

• 画像の切り出しと粗視化– 計算を軽くする– 既存プログラムをそのまま利用

• 固定長の履歴で近似– 入力データのサイズを揃える(過去の４フレーム)– データ間の相関を低下させる効果

実際の学習における工夫

• Training and Stability

Experiments

• Frames and Predicted Value Functions

Experiments

• Performance

Experiments

• Introduced a new deep learning model for reinforcement Learning– Demonstrated its ability to master difficult policies for

Atari 2600 computer games

• Also presented a variant of online Q-learning that combines stochastic minibatch updates with experience replay memory– Ease the training of deep networks for RL

Conclusion

• 従来技術にはすべて勝っているが、人に勝てるゲームは、単純な物ばかりな気がする。– 囲碁とかに適用してみたらどうなるか？（試してみたいが）

• 報酬が環境から明確に得られない問題へ適用するには、どのように行えば良いだろうか？– 報酬自体もDeep Learning?

• POMDPの近似解法としても使えないか？– 過去の履歴を信念の代用にして– 中間層に信念分布相当の物が形成される？

所感

Science

Paper intoduction "Playing Atari with deep reinforcement learning"