40
Machine Learning, Decision Trees Overfitting Decision Trees, Overfitting Reading: Mitchell, Chapter 3 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Machine Learning Department Carnegie Mellon University January 14, 2008

Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Machine Learning,Decision Trees OverfittingDecision Trees, Overfitting

Reading: Mitchell, Chapter 3

Machine Learning 10-601

Tom M. MitchellMachine Learning DepartmentMachine Learning Department

Carnegie Mellon University

January 14, 2008

Page 2: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Machine Learning 10-601Instructors• William Cohen

See webpage for • Office hours

• Tom Mitchell

TA’s

• Grading policy• Final exam date• Late homeworkTA s

• Andrew Arnold• Mary McGlohon

• Late homework policy

• Syllabus details•

Course assistant• Sharon Cavlovich

• ...

webpage: www cs cmu edu/~tom/10601webpage: www.cs.cmu.edu/~tom/10601

Page 3: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Machine Learning:Machine Learning:

Study of algorithms thaty g• improve their performance P• at some task T• at some task T• with experience E

well defined learning task: <P T E>well-defined learning task: <P,T,E>

Page 4: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Learning to Predict Emergency C-SectionsLearning to Predict Emergency C-Sections[Sims et al., 2000]

9714 patient records, each with 215 features

Page 5: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Learning to detect objects in images

(Prof. H. Schneiderman)

Example training images for each orientation

Page 6: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Learning to classify text documents

Company home page

vsvs

Personal home page

vs

University home page

vs

Page 7: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Reading a noun (vs verb)

[Rustandi et al., 2005]

Page 8: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Machine Learning - Practice

Speech Recognition

Object recognitionMining DatabasesMining Databases

Control learning

• Supervised learning

• Bayesian networksControl learning

• Hidden Markov models

• Unsupervised clustering

Text analysis

• Reinforcement learning

• ....

Page 9: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Machine Learning - Theory

PAC Learning Theory

Other theories for

• Reinforcement skill learning

# examples (m)

• Semi-supervised learning

• Active student querying

(supervised concept learning)

p ( )

representational complexity (H)

error rate ( )

• …

error rate (ε)failure probability (δ)

… also relating:

• # of mistakes during learning

• learner’s query strategyprobability (δ) • learner s query strategy

• convergence rate

• asymptotic performance

• bias, variance

Page 10: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Growth of Machine Learning• Machine learning already the preferred approach to

Speech recognition Natural language processing– Speech recognition, Natural language processing– Computer vision– Medical outcomes analysis– Robot control– …

All software apps

ML apps.

• This ML niche is growing– Improved machine learning algorithms

All software apps.

Improved machine learning algorithms – Increased data capture, networking– Software too complex to write by hand– New sensors / IO devices– Demand for self-customization to user, environment

Page 11: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Function Approximation and Decision tree learning

Page 12: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Function approximationSetting:• Set of possible instances XSet of possible instances X• Unknown target function f: X Y• Set of function hypotheses H={ h | h: X Y }yp { | }

Given:• Training examples {<xi,yi>} of unknown target

function f

Determine:H th i h H th t b t i t f• Hypothesis h∈ H that best approximates f

Page 13: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

How would you yrepresent

AB ∨ CD(¬E)?

Each internal node: test one attribute Xi

Each branch from a node: selects one value for Xi

Each leaf node: predict Y (or P(Y|X ∈ leaf))

Page 14: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 15: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

node = Root

[ID3, C4.5, …]

node = Root

Page 16: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

EntropyEntropy H(X) of a random variable XEntropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a d l d l f ( d t ffi i t d )randomly drawn value of X (under most efficient code)

Why? Information theory:Why? Information theory:• Most efficient code assigns -log2P(X=i) bits to encode

the message X=i• So, expected number of bits to code one random X is:

# of possible l f Xvalues for X

Page 17: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

EntropyEntropy H(X) of a random variable XEntropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka information gain) of X and Y :

Page 18: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Sample Entropy

Page 19: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Subset of Sfor which A=v

Gain(S,A) = mutual information between A and target class variable over sample S

Page 20: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 21: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 22: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 23: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Decision Tree Learning Applet

http // cs alberta ca/%7Eai plore/l• http://www.cs.ualberta.ca/%7Eaixplore/learning/DecisionTrees/Applet/DecisionTreeApplet htmlreeApplet.html

Page 24: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Which Tree Should We Output?• ID3 performs heuristic

search through spacesearch through space of decision trees

• It stops at smallest pacceptable tree. Why?

Occam’s razor: prefer the l h h h simplest hypothesis that

fits the data

Page 25: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Why Prefer Short Hypotheses? (Occam’s Razor)

Argument in favor:• Fewer short hypotheses than long ones

a short hypothesis that fits the data is less likely to be a statistical coincidencehighly probable that a sufficiently complex hypothesishighly probable that a sufficiently complex hypothesis will fit the data

Argument opposed:• Also fewer hypotheses with prime number of nodes

and attributes beginning with “Z”• What’s so special about “short” hypotheses?

Page 26: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 27: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 28: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 29: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 30: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Split data into training and validation set

Create tree that classifies training set correctly

Page 31: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 32: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 33: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 34: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 35: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 36: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision
Page 37: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

What you should know:• Well posed function approximation problems:

– Instance space, X– Sample of labeled training data { <xi, yi>}Sample of labeled training data { xi, yi }– Hypothesis space, H = { f: X Y }

• Learning is a search/optimization problem over H• Learning is a search/optimization problem over H– Various objective functions

• minimize training error (0-1 loss) • among hypotheses that minimize training error select shortest• among hypotheses that minimize training error, select shortest

• Decision tree learningG d t d l i f d i i t (ID3 C4 5 )– Greedy top-down learning of decision trees (ID3, C4.5, ...)

– Overfitting and tree/rule post-pruning– Extensions…

Page 38: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Questions to think about (1)

• Why use Information Gain to select attributes in decision trees? What otherattributes in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice?the tradeoffs in making this choice?

Page 39: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Questions to think about (2)

• ID3 and C4.5 are heuristic algorithms that search through the space ofthat search through the space of decision trees. Why not just do an exhaustive search?exhaustive search?

Page 40: Machine Learning, Decision Trees OverfittingDecision Trees ...tom/10601_sp08/slides/DTreesAndOverfitting-1-1… · 14/01/2008  · Machine Learning, Decision Trees OverfittingDecision

Questions to think about (3)

• Consider target function f: <x1,x2> y, where x1 and x2 are real-valued y iswhere x1 and x2 are real-valued, y is boolean. What is the set of decision surfaces describable with decision treessurfaces describable with decision trees that use each attribute at most once?