Upload
cameron-washington
View
230
Download
0
Tags:
Embed Size (px)
Citation preview
INTRODUCTION TO ARTIFICIAL INTELLIGENCE
Massimo Poesio
Machine Learning: Decision Trees
A DEFINITION OF LEARNING: LEARNING AS IMPROVEMENT
Learning denotes changes in the system that are adaptive in the sense that they enable the system to do the task or tasks drawn from the same population more efficiently and more effectively the next time. -- Herb Simon
3
LEARNING AS IMPROVEMENT, 2Improve on task, T, with respect to
performance metric, P, based on experience, E.T: Assign to words their senses.P: Percentage of words correctly classified.E: Corpus of words, some with human-given labels
T: Categorize email messages as spam or legitimate.P: Percentage of email messages correctly classified.E: Database of emails, some with human-given labels
T: Playing checkersP: Percentage of games won against an arbitrary opponent E: Playing practice games against itself
T: Recognizing hand-written wordsP: Percentage of words correctly classifiedE: Database of human-labeled images of handwritten words
4
SPECIFYING A LEARNING SYSTEM• Choose the training experience• Choose exactly what is too be learned, i.e. the
target function.• Choose how to represent the target function.• Choose a learning algorithm to infer the target
function from the experience.
Environment/Experience
Learner
Knowledge
PerformanceElement
FEATURES
• The functions learned by ML algorithms specify a mapping from input FEATURES to output FEATURES
EXAMPLE 1: CHECKERS
• Features used in the linear function seen last time:– bp(b): number of black pieces on board b– rp(b): number of red pieces on board b– bk(b): number of black kings on board b – rk(b): number of red kings on board b– bt(b): number of black pieces threatened (i.e. which
can be immediately taken by red on its next turn)– rt(b): number of red pieces threatened
EXAMPLE 2: WHEN THE NEIGHBOR DRIVES
• Suppose we are trying to learn when our neighbor goes to work by car so we can ask a ride
• Their decision appears to be influenced by– Temperature– Whether it’s going to rain or not– Day of the week– Whether they need to stop at a shop on the way back– How they are dressed
PAST EXPERIENCE IN TERMS OF FEATURES
PREDICTION TASK
PREDICTION
THE NEED FOR AVERAGING
THE NEED FOR GENERALIZATION
FIRST EXAMPLE OF ML ALGORITHM: DECISION TREES
• A method independently developed by Quinlan in AI and by Breiman et al in statistics
14
DECISION TREES• Tree-based classifiers for instances represented as feature-vectors. Nodes
test features, there is one branch for each value of the feature, and leaves specify the category.
• Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors.
color
red blue green
shape
circle square triangleneg pos
pos neg neg
color
red blue green
shape
circle square triangle B C
A B C
A DECISION TREE FOR THE DRIVING PROBLEM
LEARNING DECISION TREES FROM DATA
• Use the data in the training set to build a decision tree that will then be used to make decisions with unseen data
• The decision tree specifies a function
TRAVERSING THE DECISION TREE
TRAVERSING THE DECISION TREE
TRAVERSING THE DECISION TREE
Decision Tree Learning
• Discrete class values – Slight changes in the input: either no or full effect on
the classification
• Discrete feature values (or discretized)• Fast• Modern DT induction algorithms:– Handling noisy feature values– Handling noisy labels– Handling missing feature values
Top-down DT induction
• Partition training examples into good “splits”, based on values of a single “good” feature:
(1) Sat, hot, no, casual, keys -> +(2) Mon, cold, snow, casual, no-keys -> -(3) Tue, hot, no, casual, no-keys -> -(4) Tue, cold, rain, casual, no-keys -> -(5) Wed, hot, rain, casual, keys -> +
Top-down DT induction
keys?
yes no
Drive: 1,5 Walk: 2,3,4
Top-down DT induction
• Partition training examples into good “splits”, based on values of a single “good” feature
(1) Sat, hot, no, casual -> +(2) Mon, cold, snow, casual -> -(3) Tue, hot, no, casual -> -(4) Tue, cold, rain, casual -> -(5) Wed, hot, rain, casual -> +• No acceptable classification: proceed recursively
Top-down DT induction
t?
cold hot
Walk: 2,4 Drive: 1,5Walk: 3
Top-down DT induction
t?
cold hot
Walk: 2,4 day?
SatTue
Wed
Drive: 1 Walk: 3 Drive: 5
Top-down DT induction
t?
cold hot
Walk: 2,4 day?
SatTue
Wed
Drive: 1 Walk: 3 Drive: 5
Mo, Thu, Fr, Su
?Drive
Top-down DT induction: divide and conquer algorithm
• Pick a feature• Split your examples into subsets based on the
values of the feature• For each subsets, examine the examples:– Zero: assign the most popular class for the parent– All from the same class: assign this class– Otherwise, process recursively
Top-Down DT induction
Different trees can be built for the same data, depending on the order of features:
t?
cold hot
Walk: 2,4 day?
SatTue
Wed
Drive: 1 Walk: 3 Drive: 5
Mo, Thu, Fr, Su
?Drive
Top-down DT induction
Different trees can be built for the same data, depending on the order of features:
t?
cold hot
Walk: 2,4 day?
Sat Tue WedDrive: 1 Walk: 3 Drive: 5
Mo
Drive:?
clothing
casualhalloween
walk:?
Selecting features
• Intuitively–We want more “informative” features to be
higher in the tree:• Is it Monday? Is it raining? Good political news? No
halloween cloths? Hat on? Coat on? Car keys? Yes?? -> Driving! (doesn't look as a good learning job)
–We want a nice compact tree..
Selecting features-2
• Formally– Define “tree size” (number of nodes, leaves;
depth,..)– Try all the possible trees, find the smallest one– NP-hard
• Top-down DT induction – greedy search, depends on heuristics for feature ordering (=> no guarantee)– Information gain
Entropy
Information theory: entropy – number of bits needed to encode some information.
S – set of N examples: p*N positive (“Walk”) and q*N negative (“Drive”)
Entropy(S)= -p*lg(p) – q*lg(q)
p=1, q=0 => Entropy(S)=0p=1/2, q=1/2 => Entropy(S)=1
Entropy and Decision Trees
keys?
no yes
Walk: 2,4 Drive: 1,3,5
E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97
E(Sno)=0 E(Skeys)=0
Entropy and Decision Trees
t?
cold hot
Walk: 2,4 Drive: 1,5Walk: 3
E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97
E(Scold)=0 E(Shot)=-0.33*lg(0.33)-0.66*lg(0.66)= 0.92
Information gain
• For each feature f, compute the reduction in entropy on the split:
Gain(S,f)=E(S)-∑(Entropy(Si)*|Si|/|S|)
f=keys? : Gain(S,f)=0.97f=t?: Gain(S,f)=0.97-0*2/5-0.92*3/5=0.42f=clothing?: Gain(S,f)= ?
Conquer-and-divide with Information gain
• Batch learning (read all the input data, compute information gain based on all the examples simultaneously)
• Greedy search, may find local optima• Outputs a single solution• Optimizes depth
Complexity
• Worst case: build a complete tree– Compute gains on all nodes: at level i, we have
already examined i features; m-i remaining.
• In practice: tree is rarely complete, linear on number of features, number of examples (== very fast)
Overfitting
• Suppose we build a very complex tree.. Is it good?
• Last lecture: we measure the quality (“goodness”) of the prediction, not the performance on the training data
• Why can complex trees yield mistakes:– Noise in the data– Even without noise, solutions at the last levels are
based on too few observations
Overfitting
Mo: Walk (50 observations), Drive (5)Tue: Walk (40), Drive (3)We: Drive (1)Thu: Walk (42), Drive (14)Fri: Walk (50)Sa: Drive (20), Walk (20)Su: Drive (10)
• Can we conclude that “We->Drive”?
Overfitting
• A hypothesis H is said to overfit the training data if there exist another hypothesis H' such that:Error(H, train data) <= Error (H', train data)Error(H, unseen data) > Error (H', unseen data)
• Overfitting is related to hypothesis complexity: a more complex hypothesis (e.g., a larger decision tree) overfits more
Overfitting Prevention for DT: Pruning
• “prune” a complex tree: produce a smaller tree that is less accurate on the training dataOriginal tree: ...Mo: hot->drive (2), cold -> walk (100)Pruned tree: .. Mo-> walk (100/2)
• post-/pre- pruning
Pruning criteria
• Cross-validation– Reserve some training data to evaluate the utility
of the subtrees
• Statistical tests: use a test to determine whether observations at given level can be random
• MDL (minimum description length): compare the added complexity against memorizing exceptions
DT: issues
• Splitting criteria– Information gain: split at features with many values
• Non-discrete features• Non-discrete outputs (“regression trees”)• Costs• Missing values• Incremental learning• Memory issues
ACKNOWLEDGMENTS
• Some of the slides from– Ray Mooney’s Utexas ML course–MIT Open Course Ware AI course