39
Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016

Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

  • Upload
    hangoc

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network Architectures for DeepReinforcement Learning (ICML 2016)

Yoonho Lee

Department of Computer Science and EngineeringPohang University of Science and Technology

October 11, 2016

Page 2: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Outline

Reinforcement LearningDefinition of RLMathematical formulations

AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network

Page 3: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Outline

Reinforcement LearningDefinition of RLMathematical formulations

AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network

Page 4: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Definition of RLgeneral setting of RL

Page 5: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Definition of RLatari setting

Page 6: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Outline

Reinforcement LearningDefinition of RLMathematical formulations

AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network

Page 7: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Mathematical formulationsobjective of RL

DefinitionReturn Gt is the cumulative discounted reward from time t

Gt = rt+1 + γrt+2 + γ2rt+3 + . . .

DefinitionA policy π is a function that selects actions given states

π(s) = a

I The goal of RL is to find π that maximizes G0

Page 8: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Mathematical formulationsobjective of RL

DefinitionReturn Gt is the cumulative discounted reward from time t

Gt = rt+1 + γrt+2 + γ2rt+3 + . . .

DefinitionA policy π is a function that selects actions given states

π(s) = a

I The goal of RL is to find π that maximizes G0

Page 9: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Mathematical formulationsQ-Value

Gt =∞∑i=0

γk rt+i+1

DefinitionThe action-value (Q-value) function Qπ(s, a) is the expectation ofGt under taking action a, and then following policy π afterwards

Qπ(s, a) = Eπ[Gt |St = s,At = a,At+i = π(St+i )∀i ∈ N]

I ”How good is action a in state s?”

Page 10: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Mathematical formulationsOptimal Q-value

DefinitionThe optimal Q-value function Q∗(s, a) is the maximum Q-valueover all policies

Q∗(s, a) = maxπ

Qπ(s, a)

TheoremThere exists a policy π∗ such that Qπ∗(s, a) = Q∗(s, a) for all s, a

I Thus, it suffices to find Q∗

Page 11: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Mathematical formulationsOptimal Q-value

DefinitionThe optimal Q-value function Q∗(s, a) is the maximum Q-valueover all policies

Q∗(s, a) = maxπ

Qπ(s, a)

TheoremThere exists a policy π∗ such that Qπ∗(s, a) = Q∗(s, a) for all s, a

I Thus, it suffices to find Q∗

Page 12: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Mathematical formulationsQ-Learning

Bellman Optimality Equation

Q∗(s, a) satisfies the following equation:

Q∗(s, a) = R(s, a) + γ∑s′

P(s ′|s, a)maxa′

Q∗(s′, a′)

Q-Learning

Let a be ε-greedy w.r.t. Q, and a′ be optimal w.r.t. Q. Qconverges to Q∗ if we iteratively apply the following update:

Q(s, a)← α(R(s, a) + γQ(s ′, a′)) + (1− α)Q(s, a)

Page 13: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Mathematical formulationsQ-Learning

Bellman Optimality Equation

Q∗(s, a) satisfies the following equation:

Q∗(s, a) = R(s, a) + γ∑s′

P(s ′|s, a)maxa′

Q∗(s′, a′)

Q-Learning

Let a be ε-greedy w.r.t. Q, and a′ be optimal w.r.t. Q. Qconverges to Q∗ if we iteratively apply the following update:

Q(s, a)← α(R(s, a) + γQ(s ′, a′)) + (1− α)Q(s, a)

Page 14: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Mathematical formulationsOther approcahes to RL

Value-Based RL

I Estimate Q∗(s, a)

I Deep Q Network

Policy-Based RL

I Search directly for optimal policy π∗I DDPG, TRPO. . .

Model-Based RL

I Use the (learned or given) transition model of environment

I Tree Search, DYNA . . .

Page 15: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Outline

Reinforcement LearningDefinition of RLMathematical formulations

AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network

Page 16: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Neural Fitted Q Iteration

I Supervised learning on (s,a,r,s’)

Page 17: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Neural Fitted Q Iterationinput

Page 18: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Neural Fitted Q Iterationtarget equation

Page 19: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Neural Fitted Q Iteration

Shortcomings

I Exploration is independent of experience

I Exploitation does not occur at all

I Policy evaluation does not occur at all

I Even in exact(table lookup) case, not guaranteed to convergeto Q∗

Page 20: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Outline

Reinforcement LearningDefinition of RLMathematical formulations

AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network

Page 21: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Deep Q Networkaction policy

I Choose an ε-greedy policy w.r.t. Q

Page 22: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Deep Q Networknetwork freezing

I Get a copy of the network every C steps for stability

Page 23: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Deep Q Network

Page 24: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Deep Q Network

Page 25: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Deep Q Networkperformance

Page 26: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Deep Q Networkoveroptimism

I What happens when we overestimate Q?

Page 27: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Deep Q Network

Problems

I Overestimation of the Q function at any s spills over toactions that lead to s → Double DQN

I Sampling transitions uniformly from D is inefficient →Prioritized Experience Replay

Page 28: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Outline

Reinforcement LearningDefinition of RLMathematical formulations

AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network

Page 29: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Double Deep Q Networktarget

We can write the DQN target as:

Y DQNt = Rt+1 + γQ(St+1, argmax

aQ(St+1, a; θ−t ); θ−t )

Double DQN’s target is:

Y DDQNt = Rt+1 + γQ(St+1, argmax

aQ(St+1, a; θt); θ−t )

This has the effect of decoupling action selection and actionevaluation

Page 30: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Double Deep Q Networkperformance

I Much stabler with very little change in code

Page 31: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Outline

Reinforcement LearningDefinition of RLMathematical formulations

AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network

Page 32: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Prioritized Experience Replay

I Update ’more surprising’ experiences more frequently

Page 33: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Outline

Reinforcement LearningDefinition of RLMathematical formulations

AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network

Page 34: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network

I The scalar approximates V , and the vector approximates A

Page 35: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Networkforward pass equation

The exact forward pass equation is:

Q(s, a; θ, α, β) = V (s; θ, β) + (A(s, a; θ, α)−maxa′

A(s, a′; θ, α))

The following module was found to be more stable without losingmuch performance:

Q(s, a; θ, α, β) = V (s; θ, β) + (A(s, a; θ, α)− 1

|A|∑a′

A(s, a′; θ, α))

Page 36: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Networkattention

I Advantage network only pays attention when current actioncrucial

Page 37: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Networkperformance

I Achieves state of the art in the Atari domain among DQNalgorithms

Page 38: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network

Summary

I Since this is an improvement only in network architecture,methods that improve DQN(e.g. Double DQN) are allapplicable here as well

I Solves problem of V and A typically being of different scale

I Updates Q values more frequently than a single-stream DQN,where only a single Q value is updated for each observation

I Implicitly splits the credit assignment problem into a recursivebinary problem of “now or later”

Page 39: Dueling Network Architectures for Deep Reinforcement ...mlg.postech.ac.kr/~readinglist/slides/20161011.pdf · Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network

Shortcomings

I Only works for |A| <∞I Still not able to solve tasks involving long-term planning

I Better than DQN, but sample complexity is still high

I ε-greedy exploration is essentially random guessing