30
The People Have Spoken...

The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

The People Have Spoken...

Page 2: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Administrivia

•Final Project proposal due today

•Undergrad credit: please see me in office hours

•Dissertation defense announcements

•R2 assigned today

•Midterm back

•Results of the survey

Page 3: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

ML (ish) thesis defenses•April 5, 8:30 AM

•Rob Abbot (advisor: Forrest)

•Learning to play robo-soccer from human observation + genetic adaptation

•April 20, 1:00 PM

• John Burge (advisor: Lane)

•Learning network models of fMRI brain activity data

Page 4: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Reading 2

•Due: Thurs, Apr 5

• Knill, D., and Pouget, A. “The Bayesian brain: the role of uncertainty in neural coding and computation”. Trends in Neuroscience. 27(12):712-9. 2004.

http://www.bcs.rochester.edu/people/alex/Publications.htm

Page 5: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Midterm

•Not too shabby over all

•Some weak spots, though

•Summary:

•μ=71.3

•σ=19.1

Page 6: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Surveeeeeeeey says!

Page 7: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Surveeeeeeeey says!•Vote tally:

•Unsupervised learning: 2

•Reinforcement learning: 5

Page 8: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Surveeeeeeeey says!•Vote tally:

•Unsupervised learning: 2

•Reinforcement learning: 5

•MLE: θ=Pr[RL]=0.71

Page 9: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Surveeeeeeeey says!•Vote tally:

•Unsupervised learning: 2

•Reinforcement learning: 5

•MLE: θ=Pr[RL]=0.71

•Bayesian posterior:

Page 10: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Reinforcement Learning: Learning

to get what you want...

Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.http://www.cs.ualberta.ca/~sutton/book/the-book.html

Kaelbling, Littman, & Moore, ``Reinforcement Learning: A Survey,'' Journal of Artificial Intelligence Research, Volume 4, 1996.http://people.csail.mit.edu/u/l/lpk/public_html/papers/rl-survey.ps

Page 11: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Meet Mack the Mouse*

•Mack lives a hard life as a psychology test subject

•Has to run around mazes all day, finding food and avoiding electric shocks

•Needs to know how to find cheese quickly, while getting shocked as little as possible

•Q: How can Mack learn to find his way around?* Mickey is still copyright

?

Page 12: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Start with an easy case•V. simple maze:

•Whenever Mack goes left, he gets cheese

•Whenever he goes right, he gets shocked

•After reward/punishment, he’s reset back to start of maze

•Q: how can Mack learn to act well in this world?

Page 13: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Learning in the easy case•Say there are two labels: “cheese” and “shock”

•Mack tries a bunch of trials in the world -- that generates a bunch of experiences:

•Now what?

Page 14: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

But what to do?•So we know that Mack can learn a

mapping from actions to outcomes

•But what should Mack do in any given situation?

•What action should he take at any given time?

•Suppose Mack is the subject of a psychotropic drug study and has actually come to like shocks and hate cheese -- how does he act now?

Page 15: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Reward functions• In general, we think of a reward function:

•R() tells us whether Mack thinks a particular outcome is good or bad

•Mack before drugs:

•R(cheese)=+1

•R(shock)=-1

•Mack after drugs:

•R(cheese)=-1

•R(shock)=+1

•Behavior always depends on rewards (utilities)

Page 16: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Maximizing reward•So Mack wants to get the maximum

possible reward

•(Whatever that means to him)

•For the one-shot case like this, this is fairly easy

•Now what about a harder case?

Page 17: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Reward over time• In general: agent can be in a state s

i at any

time t

•Can choose an action aj to take in that state

•Rwd associated with a state:

•R(si)

•Or with a state/action transition:

•R(si,a

j)

•Series of actions leads to series of rewards

• (s1,a

1)→s

3: R(s

3); (s

3,a

7)→s

14: R(s

14); ...

Page 18: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Reward over times

1

s2

s3

s4

s5

s6

s4

s2

s7

s11

s8

s9

s10

Page 19: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Reward over times

1

s2

s3

s4

s5

s6

s4

s2

s7

s11

s8

s9

s10

V(s1)=R(s

1)+R(s

4)+R(s

11)+R(s

10)+...

Page 20: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Reward over times

1

s2

s3

s4

s5

s6

s4

s2

s7

s11

s8

s9

s10

V(s1)=R(s

1)+R(s

2)+R(s

6)+...

Page 21: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Where can you go?

•Definition: Complete set of all states agent could be in is called the state space: S

•Could be discrete or continuous

•We’ll usually work with discrete

•Size of state space: |S|

•S={s1,s2,...,s|S|}

Page 22: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

What can you do?

•Definition: Complete set of actions an agent could take is called the action space: A

•Again, discrete or cont.

•Again, we work w/ discrete

•Again, size: |A|

•A={a1,...,a|A|}

Page 23: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Experience & histories• In supervised learning, “fundamental unit

of experience”: feature vector+label

•Fundamental unit of experience in RL:

•At time t in some state si, take action a

j,

get reward rt, end up in state s

k

•Called an experience tuple or SARSA tuple

Page 24: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

The value of history...

•Set of all experience during a single episode up to time t is a history:

•A.k.a., trace, trajectory

Page 25: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Policies•Total accumulated reward (value, V) depends

on

•Where agent starts, initial s

•What agent does at each step (duh), a

•Plan of action is called a policy, π

•Policy defines what action to take in every state of the system:

•A.k.a. controller, control law, decision rule, etc.

Page 26: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Policies•Value is a function of start state and

policy:

•Useful to think about finite horizon and infinite horizon values:

Page 27: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Finite horizon reward•Assuming that an episode is finite:

•Agent acts in the world for a finite number of time steps, T, experiences history h

T

•What should total aggregate value be?

Page 28: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Finite horizon reward•Assuming that an episode is finite:

•Agent acts in the world for a finite number of time steps, T, experiences history h

T

•What should total aggregate value be?

•Total accumulated reward:

•Occasionally useful to use average reward:

Page 29: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Gonna live forever...•Often, we want to model a process that is

indefinite

• Infinitely long

•Of unknown length (don’t know in advance when it will end)

•Runs ‘til it’s stopped (randomly)

•Have to consider infinitely long histories

•Q: what does value mean over an infinite history?

Page 30: The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements

Reaaally long-term reward•Let

•be an infinite history

•We define the infinite-horizon discounted value to be:

•where is the discount factor

•Q1: Why does this work?

•Q2: if Rmax

is the max possible reward attainable in

the environment, what is Vmax

?