71
Reinforcement Learning A (almost) quick (and very incomplete) introduction Slides from David Silver, Dan Klein, Mausam, Dan Weld

Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Reinforcement Learning A (almost)quick(and very incomplete) introduction

Slides from David Silver, Dan Klein, Mausam, Dan Weld

Page 2: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 3: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Reinforcement Learning

At each time step t:

• Agent executes an action At

• Environment emits a reward Rt

• Agent transitions to state St

Page 4: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Rat Example

Page 5: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Rat Example

• What if agent state = last 3 items in sequence?

Page 6: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Rat Example

• What if agent state = last 3 items in sequence?

• What if agent state = counts for lights, bells and levers?

Page 7: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Rat Example

• What if agent state = last 3 items in sequence?

• What if agent state = counts for lights, bells and levers?

• What if agent state = complete sequence?

Page 8: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Major Components of RL

An RL agent may include one or more of these components:

• Policy: agent’s behaviour function

• Value function: how good is each state and/or action

• Model: agent’s representation of the environment

Page 9: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Policy

• A policy is the agent’s behaviour

• It is a map from state to action

• Deterministic policy: a = π(s)

• Stochastic policy: π(a|s) = P[At = a|St = s]

Page 10: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Value function

• Value function is a prediction of future reward

• Used to evaluate the goodness/badness of states…

• …and therefore to select between actions

Page 11: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Model

• A model predicts what the environment will do next

• It predicts the next state…

• …and predicts the next (immediate) reward

Page 12: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Dimensions of RLModel-based vs. Model-free

• Model-based: Have/learn action

models (i.e. transition probabilities.

• Uses Dynamic Programming

• Model-free: Skip them and directly

learn what action to do when

(without necessarily finding out the

exact model of the action)

• e.g. Q-learning

On Policy vs. Off Policy

• On Policy: Makes estimates based on a

policy, and improves it based on estimates.

• Learning on the job.

• e.g. SARSA

• Off Policy: Learn a policy while following

another (or re-using experience from old

policy).

• Looking over someone's shoulder

• e.g. Q-learning

Page 13: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Markov Decision Process• Set of states S = {si}

• Set of actions for each state A(s) = {asi} (often independent of state)

• Transition model T(s -> s’ | a) = Pr(s’ | a, s)

• Reward model R(s, a, s’)

• Discount factor γ

MDP = <S, A, T, R, γ>

Page 14: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Bellman Equation for Value

Function

Page 15: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Bellman Equation for Action-Value

Function

Page 16: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Q vs V

Page 17: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Exploration vs Exploitation• Restaurant Selection

• Exploitation: Go to your favourite restaurant

• Exploration: Try a new restaurant

• Online Banner Advertisements

• Exploitation: Show the most successful advert

• Exploration: Show a different advert

• Oil Drilling

• Exploitation: Drill at the best known location

• Exploration: Drill at a new location

• Game Playing

• Exploitation: Play the move you believe is best

• Exploration: Play an experimental move

Page 18: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

ε-Greedy solution

• Simplest idea for ensuring continual exploration

• All m actions are tried with non-zero probability

• With probability 1 − ε choose the greedy action

• With probability ε choose an action at random

Page 19: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Off Policy Learning• Evaluate target policy π(a|s) to compute vπ(s) or qπ(s,a) while following behaviour

policy μ(a|s)

{s1,a1,r2,...,sT} ∼ μ

• Why is this important?

• Learn from observing humans or other agents

• Re-use experience generated from old policies π1, π2, ..., πt−1

• Learn about optimal policy while following exploratory policy

• Learn about multiple policies while following one policy

Page 20: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Q - Learning

• We now consider off-policy learning of action-values Q(s,a)

• Next action is chosen using behaviour policy At+1 ∼ μ(·|St)

• But we consider alternative successor action A′ ∼ π(·|St)

• And update Q(St,At) towards value of alternative action

Page 21: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Q - Learning

• We now allow both behaviour and target policies to improve

• The target policy π is greedy w.r.t. Q(s,a)

• The behaviour policy μ is e.g. ε-greedy w.r.t. Q(s,a)

• The Q-learning target then simplifies:

Page 22: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Q - Learning

Page 23: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Q - Learning

Page 24: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Deep RL

• We seek a single agent which can solve any human-level task

• RL defines the objective

• DL gives the mechanism

• RL + DL = general intelligence (David Silver)

Page 25: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Function Approximators

Page 26: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Deep Q-Networks

• Q Learning diverges using neural networks due to:

• Correlations between samples

• Non-stationary targets

Page 27: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Solution: Experience Replay

• Fancy biological analogy

• In reality, quite simple

Page 28: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Solution: Experience Replay

Page 29: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Improving Information Extraction by

Acquiring External Evidence with

Reinforcement LearningKarthik Narasimhan, Adam Yala, Regina Barzilay

CSAIL, MIT

Slides from Karthik Narasimhan

Page 30: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 31: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 32: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 33: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 34: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 35: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 36: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Why try to reason, when someone else can do it for you

Page 37: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 38: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 39: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 40: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 41: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 42: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 43: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 44: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 45: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 46: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 47: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 48: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 49: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 50: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 51: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 52: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 53: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 54: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 55: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 56: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 57: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 58: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 59: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 60: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 61: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 62: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 63: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 64: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 65: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 66: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 67: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 68: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 69: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay
Page 70: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Doubts*• Algo 1 line# 19. The process should end when "d" == "end_episode" and not q. [Prachi]

Error.

• The dimension of the match vector should be equivalent to the number of columns to ve

extracted. But Fig 3 has twice the number of dim. [Prachi] Error.

• Is RL the best approach. [Non believers].

• Experience Replay [Anshul]. Hope it is clear now.

• Why is RL-extract better than meta classifier? Explanation provided in paper about "long

tail of noisy, irrelevant documents" is unclear. [Yash]

• The meta-classifier should also cut off at top-20 results per search like the RL system to

be completely fair. [Anshul]

* most mean questions

Page 71: Reinforcement Learning - Indian Institute of Technology Delhimausam/courses/col864/spring2017/slides/09-rlie.pdf · Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay

Discussions• Experiments

• People are happy!

• Queries

• Cluster documents and learn queries [Yashoteja]

• Many other query formulations [Surag (lowest confidence entity), Barun (LSTM), Gagan (highest confidence entity), DineshR]

• Fixed set of queries [Akshay]

• Simplicity. Search engines are robust.

• Reliance on News articles {Gagan]

• Where else would you get News from?

• Domain limitations

• Too narrow [Barun, Himanshu]. Domain specific [Happy]. Small ontology [Akshay]

• It is not Open IE. It is task specific. Can be applied to any domain.

• Better meta-classifiers [Surag]

• Effect of more sophisticated RL algorithms (A3C, TRPO) [esp. if increasing action space by LSTM queries], and their effect on performance and training time.