Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 tgd Machine Learning: Making

Thomas G. Dietterich

Department of Computer Science

Oregon State University

Corvallis, Oregon 97331

http://www.cs.orst.edu/~tgd

Machine Learning: Making Computer Science

Scientific

Acknowledgements

VLSI Wafer Testing Tony Fountain

Robot Navigation Didac Busquets Carles Sierra Ramon Lopez de Mantaras

NSF grants IIS-0083292 and ITR-085836

Outline

Three scenarios where standard software engineering methods fail

Machine learning methods applied to these scenarios

Fundamental questions in machine learning

Statistical thinking in computer science

Scenario 1: Reading Checks

Find and read “courtesy amount” on checks:

Possible Methods:

Method 1: Interview humans to find out what steps they follow in reading checks

Method 2: Collect examples of checks and the correct amounts. Train a machine learning system to recognize the amounts

Scenario 2: VLSI Wafer Testing

Wafer test: Functional test of each die (chip) while on the wafer

Which Chips (and how many) should be tested?

Tradeoff: Test all chips on wafer?

Avoid cost of packaging bad chips Incur cost of testing all chips

Test none of the chips on the wafer?May package some bad chipsNo cost of testing on wafer

Possible Methods

Method 1: Guess the right tradeoff point Method 2: Learn a probabilistic model

that captures the probability that each chip will be bad Plug this model into a Bayesian decision

making procedure to optimize expected profit

Scenario 3: Allocating mobile robot camera

Binocular

No GPS

Camera tradeoff

Mobile robot uses camera both for obstacle avoidance and landmark-based navigation

Tradeoff: If camera is used only for navigation, robot

collides with objects If camera is used only for obstacle

avoidance, robot gets lost

Possible Methods

Method 1: Manually write a program to allocate the camera

Method 2: Experimentally learn a policy for switching between obstacle avoidance and landmark tracking

Challenges for SE Methodology

Standard SE methods fail when…1) System requirements are hard to collect

2) The system must resolve difficult tradeoffs

(1) System requirements are hard to collect

There are no human experts Cellular telephone fraud

Human experts are inarticulate Handwriting recognition

The requirements are changing rapidly Computer intrusion detection

Each user has different requirements E-mail filtering

(2) The system must resolve difficult tradeoffs

VLSI Wafer testing Tradeoff point depends on probability of bad

chips, relative costs of testing versus packaging

Camera Allocation for Mobile Robot Tradeoff depends on probability of

obstacles, number and quality of landmarks

Machine Learning: Replacing guesswork with data

In all of these cases, the standard SE methodology requires engineers to make guesses Guessing how to do character recognition Guessing the tradeoff point for wafer test Guessing the tradeoff for camera allocation

Machine Learning provides a way of making these decisions based on data

Outline

Three scenarios where software engineering methods fail




Basic Machine Learning Methods

Supervised Learning Density Estimation Reinforcement Learning

Supervised Learning

8

3

6

0

1

Training Examples

LearningAlgorithm

Classifier

New Examples

8

AT&T/NCR Check Reading System

Recognition transformer is a neural network trained on 500,000 examples of characters

The entire system is trained given entire checks as input and dollar amounts as output

LeCun, Bottou, Bengio & Haffner (1998) Gradient-Based Learning Applied to Document Recognition

Check Reader Performance

82% of machine-printed checks correctly recognized

1% of checks incorrectly recognized 17% “rejected” – check is presented to a

person for manual reading

Fielded by NCR in June 1996; reads millions of checks per month

Supervised Learning Summary

Desired classifier is a function y = f(x) Training examples are desired input-

output pairs (xi,yi)

Density Estimation

Training Examples

LearningAlgorithm

DensityEstimator

P(chipi is bad) = 0.42

Partially-tested wafer

On-Wafer Testing System

Trained density estimator on 600 wafers from mature product (HP; Corvallis, OR) Probability model is “naïve Bayes” mixture model

with four components (trained with EM)

W

C209C3C2C1 . . .

One-Step Value of Information

Choose the larger of Expected profit if we predict remaining

chips, package, and re-test Expected profit if we test chip Ci, then

predict remaining chips, package, and re-test [for all Ci not yet tested]

On-Wafer Chip Test Results

$1,160

$1,170

$1,180

$1,190

$1,200

$1,210

$1,220

$1,230

Profit($K)

Test all VOI testing

3.8% increase in profit

Density Estimation Summary

Desired output is a joint probability distribution P(C1, C2, …, C203)

Training examples are points X= (C1, C2, …, C203) sampled from this distribution

Reinforcement Learning

agent

Environment

state s

reward r

action a

Agent’s goal: Choose actions to maximize total reward

Action Selection Rule is called a “policy”: a = (s)

Reinforcement Learning for Robot Navigation

Learning from rewards and punishments in the environment Give reward for reaching goal Give punishment for getting lost Give punishment for collisions

Experimental Results:% trials robot reaches goal

Busquets, Lopez de Mantaras, Sierra, Dietterich (2002)

Reinforcement Learning Summary

Desired output is an action selection policy

Training examples are <s,a,r,s’> tuples collected by the agent interacting with the environment

Outline





Fundamental Issues in Machine Learning

Incorporating Prior Knowledge Incorporating Learned Structures into

Larger Systems Making Reinforcement Learning Practical Triple Tradeoff: accuracy, sample size,

hypothesis complexity

Incorporating Prior Knowledge

How can we incorporate our prior knowledge into the learning algorithm? Difficult for decision trees, neural networks,

support-vector machines, etc.Mismatch between form of our knowledge and

the way the algorithms work Easier for Bayesian networks

Express knowledge as constraints on the network

Incorporating Learned Structures into Larger Systems

Success story: Digit recognizer incorporated into check reader

Challenges: Larger system may make several

coordinated decisions, but learning system treated each decision as independent

Larger system may have complex cost function: Errors in thousands place versus the cents place: $7,236.07

Making Reinforcement Learning Practical

Current reinforcement learning methods do not scale well to large problems

Need robust reinforcement learning methodologies

The Triple Tradeoff

Fundamental relationship between amount of training data size and complexity of hypothesis space accuracy of the learned hypothesis

Explains many phenomena observed in machine learning systems

Learning Algorithms

Set of data points Class H of hypotheses Optimization problem: Find the

hypothesis h in H that best fits the data

TrainingData

h

Hypothesis Space

Triple Tradeoff

Amount of Data – Hypothesis Complexity – Accuracy

N = 1000

Hypothesis Space Complexity

Acc

urac

y

N = 10

N = 100

Triple Tradeoff (2)

Number of training examples N

Acc

urac

y

Hypothesis

Com

plexity

H1

H2

H3

Intuition

With only a small amount of data, we can only discriminate between a small number of different hypotheses

As we get more data, we have more evidence, so we can consider more alternative hypotheses

Complex hypotheses give better fit to the data

Fixed versus Variable-Sized Hypothesis Spaces

Fixed size Ordinary linear regression Bayes net with fixed structure Neural networks

Variable size Decision trees Bayes nets with variable structure Support vector machines

Corollary 1:Fixed H will underfit

Number of training examples N

Acc

urac

y

H1

H2 underfit

Corollary 2:Variable-sized H will overfit


Acc

urac

y

N = 100overfit

Ideal Learning Algorithm: Adapt complexity to data


Acc

urac

y

N = 10

N = 100

N = 1000

Adapting Hypothesis Complexity to Data Complexity

Find hypothesis h to minimizeerror(h) + complexity(h)

Many methods for adjusting Cross-validation MDL

Outline





The Data Explosion

NASA Data 284 Terabytes (as of August, 1999) Earth Observing System: 194 G/day Landsat 7: 150 G/day Hubble Space Telescope: 0.6 G/day

http://spsosun.gsfc.nasa.gov/eosinfo/EOSDIS_Site/index.html

The Data Explosion (2)

Google indexes 2,073,418,204 web pages

US Year 2000 Census: 62 Terabytes of scanned images

Walmart Data Warehouse: 7 (500?) Terabytes

Missouri Botanical Garden TROPICOS plant image database: 700 Gbytes

Old Computer Science Conception of Data

Store Retrieve

New Computer Science Conception of Data

Store Build

Models

Solve

Problems

Problems

Solutions

Machine Learning:Making Data Active

Methods for building models from data Methods for collecting and/or sampling

data Methods for evaluating and validating

learned models Methods for reasoning and decision-

making with learned models Theoretical analyses

Machine Learning andComputer Science

Natural language processing Databases and data mining Computer architecture Compilers Computer graphics

Hardware Branch Prediction

Source: Jiménez & Lin (2000) Perceptron Learning for Predicting the Behavior of Conditional Branches

Instruction Scheduler for New CPU

The performance of modern microprocessors depends on the order in which instructions are executed

Modern compilers rearrange instruction order to optimize performance (“instruction scheduling”)

Each new CPU design requires modifying the instruction scheduler

Instruction Scheduling

Moss, et al. (1997): Machine Learning scheduler can beat performance of commercial compilers and match the performance of research compiler.

Training examples: small basic blocks Experimentally determine optimal instruction

order Learn preference function

Computer Graphics: Video Textures

Generate new video by splicing together short stretches of old video

A B C D E F

B D E D E F A

Apply reinforcement learning to identify good transition points

Arno Schödl, Richard Szeliski, David H. Salesin, Irfan Essa (SIGGRAPH 2000)

Video TexturesArno Schödl, Richard Szeliski, David H. Salesin, Irfan

Essa (SIGGRAPH 2000)

You can find this video at Virtual Fish Tank Movie

http://www.gvu.gatech.edu/perception/projects/videotexture/SIGGRAPH2000/vtfishtk.mpg

Graphics: Image Analogies

: ::

: ?

Hertzmann, Jacobs, Oliver, Curless, Salesin (2000) SIGGRAPH

Learning to Predict Textures

A(p) A’(p)

B(q) B’(q)

Find p to minimize Euclidean distance between

and

B’(q) := A’(p)

Image Analogies

: ::

:

A video can be found at

Image Analogies Movie

Summary

Standard Software Engineering methods fail in many application problems

Machine Learning methods can replace guesswork with data to make good design decisions

Machine Learning and Computer Science

Machine Learning is already at the heart of speech recognition and handwriting recognition

Statistical methods are transforming natural language processing (understanding, translation, retrieval)

Statistical methods are creating opportunities in databases, computer graphics, robotics, computer vision, networking, and computer security

Computer Power and Data Power

Data is a new source of power for computer science

Every computer science student should learn the fundamentals of machine learning and statistical thinking

By combining engineered frameworks with models learned from data, we can develop the high-performance systems of the future

Documents

Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 tgd Machine Learning: Making