Upload
others
View
20
Download
0
Embed Size (px)
Citation preview
Classification & Regression
Data Preprocessing
Classification& Regression
Decision Trees
• Example of inductive learning– The process of learning by example – where a system tries to
induce a general rule from a set of observed instances.
• Directed structure comprised of nodes– Each node specifies a test on an attribute
– Each branch corresponds to an attribute value or condition
– Leaves represent a class (or decision)
• Very wide application range
2
Data Preprocessing
Classification& Regression
Constructing Decision Trees
• Top-down, recursive, divide and conquer1. Select best feature for root node. Construct a branch for
every possible value of that feature.
2. Split data into mutually exclusive subsets for each branch
3. Repeat this process recursively using only the portion of data arriving at each node
4. Stop when training examples can be perfectly classified create a leaf node with the class decision
3
Data Preprocessing
Classification& Regression
How to choose the splitting attribute?
• Information Gain (used in ID3, C4.5)
• Gain Ratio (used in C4.5)
• Gini Measure (used in CART)
4
Data Preprocessing
Classification& Regression
Determining the best split
• Greedy approach:– Choose nodes with homogeneous class distributions
– Suppose we are trying to analyze a dataset to figure out if people will wait outside a restaurant for food
5
WaitNot wait
Rain outside
Yes No
Type of food
Chinese GreekItalian
HomogeneousLow degree of impurityLower entropyBetter attribute!
Data Preprocessing
Classification& Regression
Weather Data
6
Outlook Temp Humidity Windy Play?
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Data Preprocessing
Classification& Regression
Which attribute to select?
7
outlook
sunny overcast rainy
YesYesNoNoNo
YesYesYesYes
YesYesYesNoNo
humidity
high normal
YesYesYesNoNoNoNo
YesYesYesYesYesYesNo
windy
false true
YesYesYesYesYesYesNoNo
YesYesYesNoNoNo
temperature
hot mild cool
YesYesNoNo
YesYesYesYesNoNo
YesYesYesNo
Data Preprocessing
Classification& Regression
Information Gain
• Information gain (IG) measures how much “information” an attribute gives us about the class.– attributes that perfectly partition should give maximal
information
– unrelated attributes should give no information
• It measures the reduction in entropy – Entropy: (im)purity in an arbitrary collection of examples
8
Data Preprocessing
Classification& Regression
Aside on Entropy
• 𝑆 is a sample of training examples
• 𝑝⊕ is the proportion of positive examples in 𝑆
• 𝑝⊖ is the proportion of negative examples in 𝑆
• Entropy measures the impurity of 𝑆𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 = −𝒑⊕ 𝐥𝐨𝐠𝟐 𝒑⨁ − 𝒑⊝ 𝒍𝒐𝒈𝟐 𝒑⊝
9
Data Preprocessing
Classification& Regression
Aside on Entropy
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = expected number of bits needed to encode class (⊕ or ⊖) of randomly drawn member of 𝑆 (under the optimal, shortest-length code)
• Why?
– Information theory: optimal length code assigns − log2 𝑝 bits to message having probability 𝑝
– So, expected number of bits to encode ⊕ or ⊖ of a random member of S:
𝑝⊕ − log2 𝑝⊕ + 𝑝⊖ − log2 𝑝⊖
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = 𝐻 𝑆 = 𝑝⊕ − log2 𝑝⊕ + 𝑝⊖ − log2 𝑝⊖
10
Data Preprocessing
Classification& Regression
Aside on Entropy
• Minimum number of bits needed for c different classes (general case):
• Properties of entropy1. High entropy: uniform distribution
2. Low entropy: varied distribution (more desirable)
11
H Y = −p1 log2 p1 − p2 log2 p2…− pc log2 pc
H(Y) = −
i=1
c
pi log2 pi
Data Preprocessing
Classification& Regression
Conditional Entropy
• For an example at random, the conditional entropy of Y (class-label) conditioned on the m feature values taken for a feature 𝑥𝑘 is given by:
12
𝐻 𝑌|𝑥𝑘 =
𝑗=1
𝑚
𝑃 𝑥𝑘 = 𝑣𝑗 𝐻(𝑌|𝑥𝑘 = 𝑣𝑗)
𝑰𝒏𝒇𝑮𝒂𝒊𝒏 𝒀 𝒙𝒌 = 𝑯 𝒀 −𝑯 𝒀|𝒙𝒌
Data Preprocessing
Classification& Regression
Example
13
SchoolLikes
football?
ND Yes
MSU No
ND No
ND Yes
ND No
USC Yes
MSU No
USC Yes
Compute 𝑯(𝒀|𝑿)
𝒗𝒋 𝐏(𝒙 = 𝒗𝒋) 𝑯(𝒀|𝒙)
𝑯 𝒀 𝑿 = 𝟎. 𝟓 ∗ 𝟏 + 𝟎. 𝟐𝟓 ∗ 𝟎 + 𝟎. 𝟐𝟓 ∗ 𝟎𝑯 𝒀 𝑿 = 𝟎. 𝟓
MSU 0.25 0
ND 0. 5 1
USC 0.25 0
𝑰𝒏𝒇𝑮𝒂𝒊𝒏 𝒀 𝒙𝒌 = 𝑯 𝒀 − 𝑯 𝒀|𝒙𝒌𝑰𝒏𝒇𝑮𝒂𝒊𝒏 𝒀 𝒙𝒌 = 𝟏 − 𝟎. 𝟓 = 𝟎. 𝟓
𝑯 𝒀 −𝟒
𝟖𝒍𝒐𝒈𝟐
𝟒
𝟖−𝟒
𝟖𝒍𝒐𝒈𝟐
𝟒
𝟖
𝑯 𝒀 = 𝟏
Data Preprocessing
Classification& Regression
Back to the Decision Tree
14
outlook
sunny overcast rainy
YesYesNoNoNo
YesYesYesYes
YesYesYesNoNo
• Information gain– (entropy before split) – (entropy after split)
• Information gain for outlook:𝐼𝑛𝑓𝐺𝑎𝑖𝑛 outlook = 𝐼𝐺 9,5 − 𝐼𝐺 2,3 , 4,0 , 3,2
𝐼𝑛𝑓𝐺𝑎𝑖𝑛 outlook = 0.94 − 0.693 = 𝟎. 𝟐𝟒𝟕
• For other features:𝐼𝑛𝑓𝐺𝑎𝑖𝑛(“𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒”) = 𝟎. 𝟎𝟐𝟗
𝐼𝑛𝑓𝐺𝑎𝑖𝑛(“ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦”) = 𝟎. 𝟏𝟓𝟐
𝐼𝑛𝑓𝐺𝑎𝑖𝑛(“𝑤𝑖𝑛𝑑𝑦”) = 𝟎. 𝟎𝟒𝟖
Data Preprocessing
Classification& Regression
Continuing to split
15
outlook
…temperature
hot mild cool
NoNo
YesNo
Yes
sunny
…
outlook
…windy
false true
YesYesNo
YesNo
sunny
…
outlook
…humidity
high normal
NoNoNo
YesYes
sunny
…
𝑰𝒏𝒇𝒐𝒈𝒂𝒊𝒏(“𝒕𝒆𝒎𝒑𝒆𝒓𝒂𝒕𝒖𝒓𝒆”) = 0.571
𝑰𝒏𝒇𝒐𝒈𝒂𝒊𝒏(“𝒘𝒊𝒏𝒅𝒚”) = 0.020
𝑰𝒏𝒇𝒐𝒈𝒂𝒊𝒏(“𝒉𝒖𝒎𝒊𝒅𝒊𝒕𝒚”) = 0.971
Data Preprocessing
Classification& Regression
Final Tree
• Note: Leaves need not be pure as there can often be similar instances with different classes.
outlook
humidity
high normal
No
sunny
Yes
Yes windy
false true
Yes No
overcast rainy
Data Preprocessing
Classification& Regression
Applying Model to test Data
Outlook Temp Humidity Windy Play?
Rainy Hot High False ?
Test data:
outlook
humidity
high normal
No
sunny
Yes
Yes windy
false true
Yes No
overcast rainy
Yes
Data Preprocessing
Classification& Regression
How to Specify Test Condition
• Depends on:– Type of attributes/features:• Nominal• Continuous
– Number of ways to split• 2-way split• Multi-way split
18
Data Preprocessing
Classification& Regression
Splitting Based on Nominal Attributes
• Multi-way split: – Use as many partitions
as there are values
19
• Binary split: – Divide values into two
subsets
Car Type
Luxury Sports Family
Car Type
{Luxury,Sports} Family
Car Type
Luxury
{Sports,Family}
Data Preprocessing
Classification& Regression
Splitting Based on Continuous Attributes
• Discretization:– Form an ordinal categorical feature
– It can be done at the beginning (static – global), or at each level individually (dynamic – local)
• Binary Decision:– (𝐴 < 𝑣) or (𝐴 ≥ 𝑣)
– Considers all splits and chooses the best
– More computationally intensive
20
Data Preprocessing
Classification& Regression
Highly branching attributes
• Problematic: attributes with a large number of values (extreme case: ID code)
• Subsets are more likely to be pure if there is a large number of values – Information gain is biased towards choosing attributes
with a large number of values
– This may result in overfitting (selection of an attribute that is non-optimal for prediction)
21
Data Preprocessing
Classification& Regression
Highly branching attributes – Example
22
Day Outlook Temp Humidity Windy Play?
D1 Sunny Hot High False No
D2 Sunny Hot High True No
D3 Overcast Hot High False Yes
D4 Rainy Mild High False Yes
D5 Rainy Cool Normal False Yes
D6 Rainy Cool Normal True No
D7 Overcast Cool Normal True Yes
D8 Sunny Mild High False No
D9 Sunny Cool Normal False Yes
D10 Rainy Mild Normal False Yes
D11 Sunny Mild Normal True Yes
D12 Overcast Mild High True Yes
D13 Overcast Hot Normal False Yes
D14 Rainy Mild High True No
Data Preprocessing
Classification& Regression
Highly branching attributes – Example
23
Day
D1D2 D3 D4
…D13
D14
No
NoYes Yes
Yes
No
• Entropy of split = 0 each leaf is “pure”
• Information gain is maximum for this feature
• Is that good?