Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
CMPUT 609/499: Reinforcement Learning for Artificial Intelligence
Instructor: Rich SuttonDept of Computing Science
richsutton.com
What is Reinforcement Learning?
Agent-oriented learning—learning by interacting with an environment to achieve a goal
• more realistic and ambitious than other kinds of machine learning
Learning by trial and error, with only delayed evaluative feedback (reward)
• the kind of machine learning most like natural learning
• learning that can tell for itself when it is right or wrong
The beginnings of a science of mind that is neither natural science nor applications technology
Lecture 1: Introduction to Reinforcement Learning
About RL
Many Faces of Reinforcement Learning
Computer Science
Economics
Mathematics
Engineering Neuroscience
Psychology
Machine Learning
Classical/OperantConditioning
Optimal Control
RewardSystem
Operations Research
BoundedRationality
Reinforcement Learning
David Silver 2015
Example: Hajime Kimura’s RL Robots
Before After
Backward New Robot, Same algorithm
The RL Interface
• Environment may be unknown, nonlinear, stochastic and complex
• Agent learns a policy mapping states to actions
• Seeking to maximize its cumulative reward in the long run
Agent
Action, Response, Control
State, Stimulus, Situation
Reward, Gain, Payoff, Cost
Environment(world)
Signature challenges of RL
Evaluative feedback (reward)
Sequentiality, delayed consequences
Need for trial and error, to explore as well as exploit
Non-stationarity
The fleeting nature of time and online data
Some RL Successes• Learned the world’s best player of Backgammon (Tesauro 1995)
• Learned acrobatic helicopter autopilots (Ng, Abbeel, Coates et al 2006+)
• Widely used in the placement and selection of advertisements and pages on the web (e.g., A-B tests)
• Used to make strategic decisions in Jeopardy! (IBM’s Watson 2011)
• Achieved human-level performance on Atari games from pixel-level visual input, in conjunction with deep learning (Google Deepmind 2015)
• In all these cases, performance was better than could be obtained by any other method, and was obtained without human instruction
Example: TD Gammon
Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 W
bar
V(s, w)
w
s
Example: TD-Gammon Tesauro, 1992-1995
Start with a random Network
Play millions of games against itself
Learn a value function from this simulated experience
Six weeks later it’s the best player of backgammon in the worldOriginally used expert handcrafted features, later repeated with raw board positions
estimated state value(≈ prob of winning)
Action selectionby a shallow search
Some RL Successes• Learned the world’s best player of Backgammon (Tesauro 1995)
• Learned acrobatic helicopter autopilots (Ng, Abbeel, Coates et al 2006+)
• Widely used in the placement and selection of advertisements on the web (e.g. A-B tests)
• Used to make strategic decisions in Jeopardy! (IBM’s Watson 2011)
• Achieved human-level performance on Atari games from pixel-level visual input, in conjunction with deep learning (Google Deepmind 2015)
• In all these cases, performance was better than could be obtained by any other method, and was obtained without human instruction
RL + Deep Learing Performance on Atari Games
Space Invaders Breakout Enduro
• Learned to play 49 games for the Atari 2600 game console,without labels or human input, from self-play and the score alone
• Learned to play better than all previous algorithmsand at human level for more than half the games
RL + Deep Learning, applied to Classic Atari GamesGoogle Deepmind 2015, Bowling et al. 2012
difficult and engaging for human players. We used the same networkarchitecture, hyperparameter values (see Extended Data Table 1) andlearning procedure throughout—taking high-dimensional data (210|160colour video at 60 Hz) as input—to demonstrate that our approachrobustly learns successful policies over a variety of games based solelyon sensory inputs with only very minimal prior knowledge (that is, merelythe input data were visual images, and the number of actions availablein each game, but not their correspondences; see Methods). Notably,our method was able to train large neural networks using a reinforce-ment learning signal and stochastic gradient descent in a stable manner—illustrated by the temporal evolution of two indices of learning (theagent’s average score-per-episode and average predicted Q-values; seeFig. 2 and Supplementary Discussion for details).
We compared DQN with the best performing methods from thereinforcement learning literature on the 49 games where results wereavailable12,15. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditionsand a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on yaxis; see Methods). Our DQN method outperforms the best existingreinforcement learning methods on 43 of the games without incorpo-rating any of the additional prior knowledge about Atari 2600 gamesused by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a pro-fessional human games tester across the set of 49 games, achieving morethan 75% of the human score on more than half of the games (29 games;
Convolution Convolution Fully connected Fully connected
No input
Figure 1 | Schematic illustration of the convolutional neural network. Thedetails of the architecture are explained in the Methods. The input to the neuralnetwork consists of an 84 3 84 3 4 image produced by the preprocessingmap w, followed by three convolutional layers (note: snaking blue line
symbolizes sliding of each filter across input image) and two fully connectedlayers with a single output for each valid action. Each hidden layer is followedby a rectifier nonlinearity (that is, max 0,xð Þ).
a b
c d
0 200 400 600 800
1,000 1,200 1,400 1,600 1,800 2,000 2,200
0 20 40 60 80 100 120 140 160 180 200
Ave
rage
sco
re p
er e
piso
de
Training epochs
0 1 2 3 4 5 6 7 8 9
10 11
0 20 40 60 80 100 120 140 160 180 200
Ave
rage
act
ion
valu
e (Q
)
Training epochs
0
1,000
2,000
3,000
4,000
5,000
6,000
0 20 40 60 80 100 120 140 160 180 200
Ave
rage
sco
re p
er e
piso
de
Training epochs
0 1 2 3 4 5 6 7 8 9
10
0 20 40 60 80 100 120 140 160 180 200
Ave
rage
act
ion
valu
e (Q
)
Training epochs
Figure 2 | Training curves tracking the agent’s average score and averagepredicted action-value. a, Each point is the average score achieved per episodeafter the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on SpaceInvaders. b, Average score achieved per episode for Seaquest. c, Averagepredicted action-value on a held-out set of states on Space Invaders. Each point
on the curve is the average of the action-value Q computed over the held-outset of states. Note that Q-values are scaled due to clipping of rewards (seeMethods). d, Average predicted action-value on Seaquest. See SupplementaryDiscussion for details.
RESEARCH LETTER
5 3 0 | N A T U R E | V O L 5 1 8 | 2 6 F E B R U A R Y 2 0 1 5
Macmillan Publishers Limited. All rights reserved©2015
mapping rawscreen pixels
to predictionsof final scorefor each of 18
joystick actions
Same learning algorithm applied to all 49 games!
w/o human tuning
Some RL Successes• Learned the world’s best player of Backgammon (Tesauro 1995)
• Learned acrobatic helicopter autopilots (Ng, Abbeel, Coates et al 2006+)
• Widely used in the placement and selection of advertisements on the web (e.g. A-B tests)
• Used to make strategic decisions in Jeopardy! (IBM’s Watson 2011)
• Achieved human-level performance on Atari games from pixel-level visual input, in conjunction with deep learning (Google Deepmind 2015)
• In all these cases, performance was better than could be obtained by any other method, and was obtained without human instruction
Intelligence is the ability to achieve goals
“Intelligence is the most powerful phenomena in the universe” —Ray Kurzweil, c 2000
The phenomena is that there are systems in the universe that are well thought of as goal-seeking systems
What is a goal-seeking system?“Constant ends from variable means is the hallmark of mind” —William James, c 1890a system that is better understood in terms of outcomes than in terms of mechanisms
The coming of artificial intelligence• When people finally come to understand the principles of
intelligence—what it is and how it works—well enough to design and create beings as intelligent as ourselves
• A fundamental goal for science, engineering, the humanities, …for all mankind
• It will change the way we work and play, our sense of self, life, and death, the goals we set for ourselves and for our societies
• But it is also of significance beyond our species, beyond history
• It will lead to new beings and new ways of being, things inevitably much more powerful than our current selves
Milestones in the development of life on Earth
year Milestone14Bya Big bang4.5Bya formation of the earth and solar system
3.7Bya origin of life on earth (formation of first replicators)DNA and RNA
1.1Bya sexual reproductionmulti-cellular organismsnervous systems
1Mya humansculture
100Kya language10Kya agriculture, metal tools5Kya written language200ya industrial revolution
technology70ya computers
nanotechnology? artificial intelligence
super-intelligence…
The Age ofReplicators
The Age ofDesign
Self-replicated thingsmost prominent
Designed thingsmost prominent
AI is a great scientific prize
• cf. the discovery of DNA, the digital code of life, by Watson and Crick (1953)
• cf. Darwin’s discovery of evolution, how people are descendants of earlier forms of life (1860)
• cf. the splitting of the atom, by Hahn (1938)
• leading to both atomic power and atomic bombs
When will we understand the principles of intelligence well enough to create, using technology, artificial minds that rival our own in skill and generality?
Which of the following best represents your current views?
A. Never
B. Not during your lifetime
C. During your lifetime, but not before 2045
D. Before 2045
E. Before 2035
Socrative.com, Room 568225
Is human-level AI possible?
• If people are biological machines, then eventually we will reverse engineer them, and understand their workings
• Then, surely we can make improvements
• with materials and technology not available to evolution
• how could there not be something we can improve?
• design can overcome local minima, make great strides, try things much faster than biology
Yes
If AI is possible, then will it eventually, inevitably happen?
• No. Not if we destroy ourselves first
• If that doesn’t happen, then there will be strong, multi-incremental economic incentives pushing inexorably towards human and super-human AI
• It seems unlikely that they could be resisted
• or successfully forbidden or controlled
• there is too much value, too many independent actors
Very probably, say 90%
When will human-level AI first be created?
• No one knows of course; we can make an educated guess about the probability distribution:
• 25% chance by 2030
• 50% chance by 2040
• 10% chance never
• Certainly a significant chance within all of our expected lifetimes
• We should take the possibility into account in our career plans
Corporate investment in AI is way up
• Google’s prescient AI buying spree: Boston Dynamics, Nest, Deepmind Technologies, …
• New AI research labs at Facebook (Yann LeCun), Baidu (Andrew Ng), Allen Institute (Oren Etzioni), Vicarious, Maluuba…
• Also enlarged corporate AI labs: Microsoft, Amazon, Adobe…
• Yahoo makes major investment in CMU machine learning department
• Many new AI startups getting venture capital
The 2nd industrial revolution• The 1st industrial revolution was the physical power of machines
substituting for that of people
• The 2nd industrial revolution is the computational power of machines substituting for that of people
• Computation for perception, motor control, prediction, decision making, optimization, search
• Until now, people have been our cheapest source of computation
• But now our machines are starting to provide greater, cheaper computation
The computational revolution≈computation
al power of the human
brain by ≈2025
2016‘10
Advances in AI abilities are coming faster; in the last 5 years:
• IBM’s Watson beats the best human players of Jeopardy! (2011)
• Deep neural networks greatly improve the state of the art in speech recognition and computer vision (2012–)
• Google’s self-driving car becomes a plausible reality (≈2013)
• Deepmind’s DQN learns to play Atari games at the human level, from pixels, with no game-specific knowledge (≈2014, Nature)
• University of Alberta’s Cepheus solves Poker (2015, Science)
• Google Deepmind’s AlphaGo defeats the world Go champion, vastly improving over all previous programs (2016)
Advances in AI abilities are coming faster; in the last 5 years:
• IBM’s Watson beats the best human players of Jeopardy! (2011)
• Deep neural networks greatly improve the state of the art in speech recognition and computer vision (2012–)
• Google’s self-driving car becomes a plausible reality (≈2013)
• Deepmind’s DQN learns to play Atari games at the human level, from pixels, with no game-specific knowledge (≈2014, Nature)
• University of Alberta’s Cepheus solves Poker (2015, Science)
• Google Deepmind’s AlphaGo defeats the world Go champion, vastly improving over all previous programs (2016)
Cheap computation power drives progress in AI
• Deep learning algorithms are essentially the same as what was used in ‘80s
• only now with larger computers (GPUs) and larger data sets
• enabling today’s vastly improved speech recognition
• Similar impacts of computer power can be seen in recent years, and throughout AI’s history, in natural language processing, computer vision, and computer chess, Go, and other games
Algorithmic advances are also essential
• Algorithmic advances such as backpropagation, MCTS, policy-gradient reinforcement learning, and LSTM were necessary but not sufficient
• They were invented early, then waited for the computational power needed for them to shine
• other algorithms are still waiting for more cheaper computation
• Algorithmic advances are slower, less reliable
• But they will accelerate with more computation, more focused effort
AI is not like other sciences• AI has Moore’s law, an enabling technology racing alongside it,
making the present special
• Moore’s law is a slow fuse, leading to the greatest scientific and economic prize of all time
• So slow, so inevitable, yet so uncertain in timing
• The present is a special time for humanity, as we prepare for, wait for, and strive to create strong AI
Algorithmic advances in Alberta• World’s best computer games group for decades (see Bowling’s talk)
including solving Poker
• Created the Atari games environment that our alumni, at Deepmind, used to show learning of human-level play
• Trained the AlphaGo team that beat the world Go champion
• World’s leading university in reinforcement learning algorithms, theory, and applications, including TD, MCTS
• ≈20 faculty members in AI
Course Overview
Main Topics:Learning (by trial and error)Planning (search, reason, thought, cognition)Prediction (evaluation functions, knowledge)Control (action selection, decision making)
Recurring issues:Demystifying the illusion of intelligencePurpose (goals, reward) vs Mechanism
Model-based RL: GridWorld Example
CMPUT 609: Provisional Schedule of Classes and Assignmentsclass num date lecture topic Reading assignment (in advance) Assignment
due
1 Thu, Sep 1, 2016 The Magic of Artificial Intelligence; reasons for taking the course
Read section 1 of the Wikipedia entry for “the technological singularity”; see also Vinge2010 (http://www-rohan.sdsu.edu/faculty/vinge/misc/iaai10/) and Moravec1998 (http://www.transhumanist.com/volume1/moravec.htm)
2 Tue, Sep 6, 2016 Bandit problems Sutton & Barto Chapters 1 and 2
3 Thu, Sep 8, 2016 Bandit problems plus RL examples Sutton & Barto Chapter 2 (including Section 2.7)
4 Tue, Sep 13, 2016 Defining “Intelligent Systems”Read the definition given for artificial intelligence in Wikipedia and in the Nilsson book on p13; google for and read “John McCarthy basic questions”, and “the intentional stance (dictionary of philosophy of mind)”
W1
5 Thu, Sep 15, 2016 Markov decision problems Sutton & Barto Chapter 3 thru Section 3.5
6 Tue, Sep 20, 2016 Returns, value functions Rest of Sutton & Barto Chapter 3
7 Thu, Sep 22, 2016 Bellman Equations Sutton & Barto Summary of Notation, Sutton & Barto Section 4.1 W2
8 Tue, Sep 27, 2016 Dynamic programming (planning) Sutton & Barto Rest of Chapter 4
9 Thu, Sep 29, 2016 Monte Carlo Learning Sutton & Barto Chapter 5
10 Tue, Oct 4, 2016 More Monte Carlo Learning Sutton & Barto Chapter 5 W3
11 Thu, Oct 6, 2016 Temporal-difference learning Sutton & Barto Chapter 6 thru Section 6.3
12 Tue, Oct 11, 2016 Temporal-difference learning Sutton & Barto rest of Chapter 6
13 Thu, Oct 13, 2016 Multi-step bootstrapping Sutton & Barto Chapter 7 W4
14 Tue, Oct 18, 2016 Models and planning Sutton & Barto Chapter 8 thru Section 8.3
15 Thu, Oct 20, 2016 Models and planning Sutton & Barto rest of Chapter 8
16 Tue, Oct 25, 2016 Review Sutton & Barto Chapters 2-8 W5
17 Thu, Oct 27, 2016 Midterm Exam No new reading
18 Tue, Nov 1, 2016 Function Approximation; Online linear supervised learning Nilsson Sec. 2.2.1 and Nilsson Ch. 4; Sutton & Barto Chapter 9 thru 9.4
19 Thu, Nov 3, 2016 Prediction with linear approximation, Tile coding Sutton & Barto rest of Chapter 9 P1
20 Tue, Nov 15, 2016 Control with approximation, Average reward, off-policy problems Sutton & Barto Chapter 10
21 Thu, Nov 17, 2016 Off-policy learning Sutton & Barto Chapter 11
22 Tue, Nov 22, 2016 Eligibility traces Sutton & Barto Chapter 12 P2
23 Thu, Nov 24, 2016 Policy-gradient learning Sutton & Barto Chapter 13
24 Tue, Nov 29, 2016 Biological reinforcement learning Sutton & Barto Chapters 14 and 15
25 Thu, Dec 1, 2016 Case studies Sutton & Barto Chapters 16 P3
26 Tue, Dec 6, 2016 Monte Carlo Tree Search (Mueller) Monte-Carlo Tree Search (MCTS-survey.pdf in dropbox)
27 Mon, Dec 12, 2016 Experimentation project should be done throughout the course P4
Help
Probability refresher Monday Sept 5, 5pm, NRE 1-001
Homework labs with TAs, subsequent Mondays
Office hours
Course Information
Course Moodle pagesome official informationdiscussion list!
Course Dropbox (see moodle page for link)schedule, assignments, slides, projects
Lab is on Monday, 5-7:50a good place to do your assignments
3
Textbooks
Readings will be from web sources plus the following two textbooks (both of which are available as online electronically and open-access):
Reinforcement Learning: An Introduction, by R Sutton and A Barto, MIT Press.
we will use the in-progress, online 2nd edition
printed copies available at next class — $28 exact
The Quest for AI, by N Nilsson, Cambridge, 2010 (pdf)
4
Evaluation
≈1 assignment per week, due at the beginning of class
5 written assignments – (5)3 programming projects – (4) (later in the course)
Midterm – (4)Project (4)
10
PrerequisitesSome comfort or interest in thinking abstractly and with mathematicsElementary statistics, probability theory
conditional expectations of random variables
there will be a lab session devoted to a tutorial review of basic probability
Basic linear algebra: vectors, vector equations, gradientsBasic programming skills (Python)
If Python is a problem, choose a partner who is already comfortable with Python
for next time...
Read Chapters 1 & 2 of Sutton & Barto text (online)
8
Policies on IntegrityDo not cheat on assignments: Discuss only general approaches to problemDo not take written notes on other's work Respect the lab environment. Do not:
Interfere with operation of computing systemInterfere with other's filesChange another's passwordCopy another's programetc.
Cheating is reported to university whereupon it is out of our handsPossible consequences:
A mark of 0 for assignmentA mark of 0 for the courseA permanent note on student recordSuspension / Expulsion from university
7
Academic Integrity
The University of Alberta is committed to the highest standards of academic integrity and honesty. Students are expected to be familiar with these standards regarding academic honesty and to uphold the policies of the University in this respect. Students are particularly urged to familiarize themselves with the provisions of the Code of Student Behavior (online at www.ualberta.ca/secretariat/appeals.htm) and avoid any behavior which could potentially result in suspicions of cheating, plagiarism, misrepresentation of facts and/or participation in an offence. Academic dishonesty is a serious offence and can result in suspension or expulsion from the University.
11
AI Seminar !!!
http://www.cs.ualberta.ca/~ai/cal/Friday noons, CSC 3-33Neat topics, great speakers
, FREE PIZZA!