66
1 Machine Learning, Chapter 6 CSE 574, Spring 2003 Bayes Optimal Classifier (6.7) Bayes Optimal Classification H h i i j V v i j D h P h v P ) | ( ) | ( max arg

Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

1

Machine Learning, Chapter 6 CSE 574, Spring 2003

Bayes Optimal Classifier (6.7)

Bayes Optimal Classification

∑∈∈ Hh

iijVv

ij

DhPhvP )|()|(maxarg

Page 2: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

2

Machine Learning, Chapter 6 CSE 574, Spring 2003

Bayes Optimal Classifier

• Instead of asking “What is the most probable hypothesis given the training data?” , ask:

• “What is the most probable classification of the new instance given the training data?”

• Instead of learning the function fi, the Bayes optimal classifier assigns any given input to the most likely output vj

fix0x1x2

vj

Page 3: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

3

Machine Learning, Chapter 6 CSE 574, Spring 2003

Bayes Optimal Classifier

• Instead of learning the function, the Bayes optimal classifier assigns any given input to the most likely output

• Calculate a posteriori probabilities

• P(x0,x1,x2|0) is the class-conditional probability

fi

x0

x1

x2

),,()0()0|,,(),,|0(

210

210210 xxxP

PxxxPxxxP =

Page 4: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

4

Machine Learning, Chapter 6 CSE 574, Spring 2003

Example of Bayes Optimal Classifierx0 x1 x2 f0 f1 f2 f3 f4 f255

0 0 0 0 1 0 1 0 10 0 1 0 0 1 1 0 10 1 0 0 0 0 0 1 10 1 1 0 0 0 0 0 11 0 0 0 0 0 0 0 11 0 1 0 0 0 0 0 11 1 0 0 0 0 0 0 1

),,()0()0|,,(),,|0(

210

210210 xxxP

PxxxPxxxP =

),,()1()1|,,(),,|1(

210

210210 xxxP

PxxxPxxxP =

Page 5: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

5

Machine Learning, Chapter 6 CSE 574, Spring 2003

Bayes Optimal Classifier

• To calculate a posteriori probabilities• Need to know Class-conditional probabilities• Each is a table of 2n different probabilities estimated from

many training samples

)0|,,( 210 xxxP )1|,,( 210 xxxPx0 x1 x2Prob(0)0 0 0 0.10 0 1 0.050 1 0 0.10 1 1 0.251 0 0 0.31 0 1 0.11 1 0 0.051 1 1 0.05

x0 x1 x2Prob(1)0 0 0 0.050 0 1 0.10 1 0 0.250 1 1 0.251 0 0 0.11 0 1 0.11 1 0 0.151 1 1 0.05

Page 6: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

6

Machine Learning, Chapter 6 CSE 574, Spring 2003

Bayes Optimal Classifier

• Need to know Class-conditional probabilities

• Tables have 2.2n entries in tables• Will need many training samples:

• need to see every instance many times in order to obtain reliable estimates

• When number of attributes is large, impossible to even list all probabilities in a table

)0|,,( 210 xxxP )1|,,( 210 xxxP

Page 7: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

7

Machine Learning, Chapter 6 CSE 574, Spring 2003

Bayes Optimal Classifier

• Target function f(x)• Takes any value from finite set V, eg 0,1• Each instance x is composed of attribute values

x1,x2,..,xn

• Most possible target value vmap

),..,,()()|,..,,(

maxarg

maxarg

21

21

),..,,|( 21

n

jjn

xxxvPv

xxxPvPvxxxP

Vv

njVv

MAP

j

j

=

=

Page 8: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

8

Machine Learning, Chapter 6 CSE 574, Spring 2003

Most Probable Hypothesis vs Most Probable Classification

• Classification result can be different!• Suppose three hypotheses, f0,f1,f2 have posterior

probabilities given the training data as .3, .4, .3. • Therefore MAP hypothesis is f1• Instance x=<0,0,0> classified as 1 by f1 but as 0 by f0 and f2

• P(1|x,D)=P(1|f0,x)P(f0|D,x)+ P(1|f1,x)P(f1|D,x)+ P(1|f2,x)P(f2|D,x)• =0..3 + 1..4 + 0..3 = .4• Similarly P(0|x,D) = .6• Therefore most probable classification of x is 0

x0 x1 x2 f0 f1 f2 f3 f4 f255

0 0 0 0 1 0 1 0 10 0 1 0 0 1 1 0 10 1 0 0 0 0 0 1 10 1 1 0 0 0 0 0 11 0 0 0 0 0 0 0 1

Page 9: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

9

Machine Learning, Chapter 6 CSE 574, Spring 2003

Maximum Likelihood and Least-Squared Error Hypotheses (6.4)

• Bayesian analysis shows that under certain circumstances any learning algorithm that minimizes the squared error between output hypothesis predictions and the training data will output a maximum likelihood hypothesis.

Page 10: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

10

Machine Learning, Chapter 6 CSE 574, Spring 2003

Learning a Real-Valued Function

Figure 6.2

Page 11: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

11

Machine Learning, Chapter 6 CSE 574, Spring 2003

Probability Density Function

)(1lim)( 0000 ∈+<≤∈

≡∈→

xxxPxp

Page 12: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

12

Machine Learning, Chapter 6 CSE 574, Spring 2003

Maximum Likelihood Hypothesis

Maximum Likelihood Hypothesis: One that minimizes the sum of the squared errors between the observed training values di and the hypothesis predictions h(xi)

∑=∈

−=m

iii

HhML xhdh

1

2))((minarg

Page 13: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

13

Machine Learning, Chapter 6 CSE 574, Spring 2003

Maximum Likelihood Hypotheses for Predicting Probabilities (6.5)

)|,()|(1

hdxPhDP i

m

ii∏

=

=

)|,()|(1

hdxPhDP i

m

ii∏

=

=

)(),|()|,()|(11

ii

m

iii

m

ii xPxhdPhdxPhDP ∏∏

==

==

Page 14: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

14

Machine Learning, Chapter 6 CSE 574, Spring 2003

Maximum Likelihood Hypotheses for Predicting Probabilities, continued

⎩⎨⎧

=−=

=0if))(1(1if)(

),|(i

iiii dxh

dxhxhdP

ii di

diii xhxhxhdP −−= 1))(1()(),|(

)())(1()()|( 1

1i

di

m

i

di xPxhxhhDP ii −

=

−=∏

Page 15: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

15

Machine Learning, Chapter 6 CSE 574, Spring 2003

Maximum Likelihood Hypotheses for Predicting Probabilities, continued

)())(1()(maxarg1

1∏=

∈−=

m

ii

di

di

HhML xPxhxhh ii

∏=

∈−=

m

i

di

di

HhML

ii xhxhh1

1))(1()(maxarg

))(1ln()1()(lnmaxarg1

iii

m

ii

HhML xhdxhdh −−+= ∑

=∈

Page 16: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

16

Machine Learning, Chapter 6 CSE 574, Spring 2003

Gradient Search to Maximize Likelihood in a Neural Network (6.5.1)

jk

im

i ijk wxh

xhDhG

wDhG

∂∂

∂∂

=∂

∂ ∑=

)()(

),(),(1

jk

im

i i

iiii

wxh

xhxhdxhd

∂∂

∂−−+∂

=∑=

)()(

))(1ln()1)()(ln)(1

jk

im

i ii

ii

wxh

xhxhxhd

∂∂

−−

=∑=

)())(1)((

)(1

Page 17: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

17

Machine Learning, Chapter 6 CSE 574, Spring 2003

Gradient Search to Maximize Likelihood in a Neural Network , continued

jkjkjk www ∆+←

∑=

−=∆m

iijkiijk xxhdw

1))((η

jkjkjk www ∆+←

∑=

−−=∆m

iijkiiiijk xxhdxhxhw

1))())((1)((η

Page 18: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

18

Machine Learning, Chapter 6 CSE 574, Spring 2003

Minimum Description Length Principle (6.6)

• Occam’s razor: • Choose the shortest explanation for the observed data• Used in Decision Tree design where goal was to find

shortest tree

• Here we consider:• Bayesian perspective on this issue• Closely related principle called• Minimum Description Length (MDL) principle

Page 19: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

19

Machine Learning, Chapter 6 CSE 574, Spring 2003

Minimum Description Length Principle

MAPh• Motivated by interpreting the definition of• Using concepts from information theory• Familiar definition:

)()|(maxarg hPhDPhHh

MAP∈

=

Page 20: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

20

Machine Learning, Chapter 6 CSE 574, Spring 2003

Minimum Description Length Principle

)()|(maxarg hPhDPhHh

MAP∈

=

• Equivalently, taking logarithms

• Equivalently, taking negatives

)(log)|(logmaxarg 22 hPhDPhHh

MAP +=∈

)(log)|(logminarg 22 hPhDPhHh

MAP −−=∈

Page 21: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

21

Machine Learning, Chapter 6 CSE 574, Spring 2003

Minimum Description Length Principle

)(log)|(logminarg 22 hPhDPhHh

MAP −−=∈

• Interpretation of above equation:• Assuming a particular representation scheme for encoding

hypotheses and data• Short hypotheses are to be preferred• Explanation to follow

Page 22: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

22

Machine Learning, Chapter 6 CSE 574, Spring 2003

Design a compact code to transmit messages at random

• Probability of message i is pi

• Find code that minimizes expected number of bits we must transmit to encode message drawn at random• Assign shorter codes to more probable messages

• Shannon and Weaver (1949)• Optimal code assigns

• bits to encode message I• No of bits needed to encode message i using Code C is the

description length of message I with respect to C, I.e., LC(i)

ip2log−

Page 23: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

23

Machine Learning, Chapter 6 CSE 574, Spring 2003

Minimum Length encoding

• Huffman Code (C) assigns shorter codes more likely symbols optimally

• Message i Code pi Bit length LC(I)• A 0 0.5 1• B 10 0.25 2• C 110 0.125 3• D 111 0.125 3

• Uniquely decodable

Page 24: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

24

Machine Learning, Chapter 6 CSE 574, Spring 2003

Expected length of a message• A 0 prob 0.5 length 1• B 10 prob 0.25 length 2• C 110 prob 0.125 length 3• D 111 prob 0.125 length 3• Expected length of a message

• Same as formula for entropy

75.175.05.05.0

)log(2logloglog 81

81

41

41

21

21

2

=++=

−−−=−∑ ii

i pp

Page 25: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

25

Machine Learning, Chapter 6 CSE 574, Spring 2003

Interpretation of MAP hypothesis in terms of Coding Theory

)(log)|(logminarg 22 hPhDPhHh

MAP −−=∈

Description length of h under the optimal encoding for hypothesis space H, i.e., size of the description of hypothesis h using this optimal representation

)(hLHC=

where CH is the optimal code for H

Description length of training data D given hypothesis h under the optimal encoding for hypothesis space H

)(/

hLhDC=

where CD/h is the optimal code for describing D assuming sender and receiver know hypothesis h

Page 26: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

26

Machine Learning, Chapter 6 CSE 574, Spring 2003

Interpretation of Bayes MAP hypothesis in terms of Coding Theory

)()(minarg/

hLhLhhDH CC

HhMAP +=

Page 27: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

27

Machine Learning, Chapter 6 CSE 574, Spring 2003

Minimum Description Length (MDL) Principle

• If C1and C2 represent the hypothesis and the data given the hypothesis

• MDL principle recommends choosing where

MDLh

)|()(minarg21

hDLhLh CCHh

MDL +=∈

Page 28: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

28

Machine Learning, Chapter 6 CSE 574, Spring 2003

Minimum Description Length (MDL) Principle

If C1and C2 are chosen optimally • Then • Intuitively

• MDL recommends shortest method for re-encoding the training data,

• where we count the size of the hypothesis and any additional cost of encoding the data given this hypothesis

MAPMDL hh =

Page 29: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

29

Machine Learning, Chapter 6 CSE 574, Spring 2003

Gibbs Algorithm (6.8)

• Bayes can be costly to apply• it computes the posterior probability for every hypothesis in

H• then it combines the predictions of each hypothesis to

classify each new system

• Gibbs Algorithm is alternative, less optimal method• choose a hypothesis h from H at random, according to the

posterior probability distribution over H• use h to predict the classification of the next instance x

Page 30: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

30

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Classifier

• Practical Bayesian learning method• In some domains performance is comparable to that

of neural network and decision tree learning

Page 31: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

31

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Classifier

• Based on the simplifying assumption that the attribute values are statistically independent

)0|(

)0|()..0|()0|()0|,..,,( 2121

∏==

ii

nn

xP

xpxPxPxxxP

Page 32: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

32

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Classifier

• Class-conditional probabilities assuming statistical independence

• Tables have 2.2.n entries ==> much better than 2.2n

entries

x0 Prob(x0|0) x1 Prob(x1|0) x2 Prob(x3|0)0 0.65 0 0.4 0 0.151 0.35 0 0.6 1 0.85

x0 Prob(x0|1) x1 Prob(x1|1) x2 Prob(x3|1)0 0.65 0 0.4 0 0.151 0.35 0 0.6 1 0.85

Page 33: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

33

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Classifier (6.9)

• Naïve Bayes applies to learning tasks where• each instance x is described by the conjunction of attribute

values• the target function f(x) can take on any value from some

finite set V

Page 34: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

34

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Classifier, continued

),|(maxarg 21 njVv

MAP aaavPvj

K∈

=

),()()|,(

maxarg21

21

n

jjn

VvMAP aaaP

vPvaaaPv

j K

K

∈=

)()|,(maxarg 21 jjnVv

vPvaaaPj

K∈

=

Page 35: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

35

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Classifier, continued

Naïve Bayes Classifier

)|()(maxarg ii

ijVv

NB vaPvPvj

∏∈

=

Page 36: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

36

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes: PlayTennis Example Classify days according to whether someone will play tennisGiven 14 examples:

Day Outlook Temp Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Table 3.2

Page 37: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

37

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes: PlayTennis Example

• Task is predict the target value (yes or no) of the target concept PlayTennis for the new instance

Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong

Day Outlook Temp Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Page 38: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

38

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes: PlayTennis Exampleis the target value output by the Naïve Bayes

classifierInstantiate Naïve Bayes classifier equation to fit task

the target value is given by

)|()(maxarg},{

jiijnoyesv

NB vaPvPvj

∏∈

=NBv

)|()|()(maxarg},{

jjjnoyesv

vcooleTemperaturPvsunnyoutlookPvPj

===∈

)|()|( jj vstrongWindPvhighHumidityP ==

NBv

Prior ProbabilitiesConditional Probabilities

Page 39: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

39

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes: PlayTennis Example

)|()(maxarg},{

jiijnoyesv

NB vaPvPvj

∏∈

=

)|()|()(maxarg},{

jjjnoyesv

vcooleTemperaturPvsunnyoutlookPvPj

===∈

)|()|( jj vstrongWindPvhighHumidityP ==

Prior Probabilities (2)Conditional Probabilities (8)

• Notice that in the final expression ai has been instantiated using the particular attribute values of the new instance

• To calculate vNB, need 10 probabilities that can be estimated from the training data

Page 40: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

40

Machine Learning, Chapter 6 CSE 574, Spring 2003

Estimating Prior Probabilities• Probabilities of different target values estimated from

frequencies over 14 training examplesP(PlayTennis = yes) = 9/14 = .64 P(PlayTennis = no) = 5/14 = .36

Day Outlook Temp Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Page 41: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

41

Machine Learning, Chapter 6 CSE 574, Spring 2003

Estimating Conditional Probabilities• Similarly, can estimate conditional probabilities. For

example, those for Wind = strong are:P(Wind = strong|PlayTennis = yes) = 3/9 = .33P(Wind = strong|PlayTennis = no) = 3/5 = .60

Day Outlook Temp Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Page 42: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

42

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes: PlayTennis Target Values

• Using these and similar probability estimates for remaining attribute values, vNB can be calculated as follows

• Thus, the naïve Bayes classifier assigns the target value PlayTennis = no to this new instance

P(yes) P(sunny|yes) P(cool|yes) P(high|yes) P(strong|yes) = .0053P(no) P(sunny| no) P(cool| no) P(high| no) P(strong| no) = .0206

Page 43: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

43

Machine Learning, Chapter 6 CSE 574, Spring 2003

PlayTennis:Normalizing class probabilities

• By normalizing the above quantities to sum to one, we can calculate the conditional probability that the target value in no, given the observed attribute values.

• For the current example, this probability is

795.0053.0206.

0206.=

+

Page 44: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

44

Machine Learning, Chapter 6 CSE 574, Spring 2003

Estimating Probabilities

• Probability estimated as a fraction of times event is observed (nc) over the total number of observations(n)

• P(Wind = strong|PlayTennis = no) = 3/5 = .60

• When nc very small = => poor estimate of probability• suppose P(Wind = strong|PlayTennis = no) = .08 and we have

n=5. Then most probable value for nc is 0• Yields a biased underestimate of probability• This probability term will dominate since it is multiplying other

probabilities

Page 45: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

45

Machine Learning, Chapter 6 CSE 574, Spring 2003

Estimating Probabilities with Small Sample Size

• To avoid problem, use a Bayesian approach using the m-estimate

• m-estimate of Probability

• p is prior estimate of probability we wish to determine• m is a constant called the equivalent sample size

mnmpnc

++

Page 46: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

46

Machine Learning, Chapter 6 CSE 574, Spring 2003

Estimating Probabilities with Small Sample Size

• m-estimate of Probability

• p is prior estimate of probability we wish to determine• assume uniform priors: if attribute has k values, we set p=1/k• if k=2, then p=.5

• m is the equivalent sample size• if m=0, m-estimate is equivalent to simple fraction• prior and fraction are combined according to weight m• called equivalent sample size since n actual samples are

augmented using m virtual samples distributed according to p

mnmpnc

++

Page 47: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

47

Machine Learning, Chapter 6 CSE 574, Spring 2003

Bayesian Learning Example: Classifying Text

• Instances are Text documents• Target concept:

• electronic news articles that I find interesting• pages of the world-wide-web that discuss machine learning

topics

• If a computer could learn the target concept accurately in instances involving text documents,• it could automatically filter a large volume of on-line

documents and present only the most relevant

Page 48: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

48

Machine Learning, Chapter 6 CSE 574, Spring 2003

Text Classification Task

• General Setting:• instance space X consists of all possible text documents

(i.e., all possible strings of words and punctuation of all possible lengths)

• Given training examples of some unknown target function f(x), which can take on any value from some finite set V

• Task:• Learn to classify future documents as interesting or not

interesting to a particular person, • using the target values like and dislike to indicate these two

classes

Page 49: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

49

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Example: Learning To Classify Text

• Two main Design Issues• Decide how to represent an arbitrary text document

in terms of attribute values• How to estimate the probabilities required by the

naive Bayes classifier

Page 50: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

50

Machine Learning, Chapter 6 CSE 574, Spring 2003

Approach to Representing Arbitrary Text Documents

• Given a text document, we define • an attribute for each word position in the document and • define the value of that attribute to be the English word found

in that position• thus, this paragraph beginning with sentence “Given a text

document …,” would be described by 111 attribute values, corresponding to the 111 word positions

• value of the first attribute is the word “Given” the value of the second attribute is “a” etc.

• Note: Long text documents require larger number of attributes that short documents• Not a problem

Page 51: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

51

Machine Learning, Chapter 6 CSE 574, Spring 2003

Document Classification Task

• We are given a set of training documents that has been classified by a friend• 700 classified as dislike• 300 classified as like

• Use these to classify new documents

Page 52: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

52

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Classification of Text

)|()(maxarg111

1},{∏=∈

=i

jijdislikelikev

NB vaPvPvj

)|""()|""()(maxarg 21},{

jjjdislikelikev

vaaPvgivenaPvPj

===∈

)|""( 111 jvproblemaP =K

• Naïve Bayes Classification vNB is the classification that maximizes the probability of observing the words that were actually found in the document

Page 53: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

53

Machine Learning, Chapter 6 CSE 574, Spring 2003

Text Classification: independence assumption

• Independence assumption states that the word probabilities for one text position are independent of the words that occur in other positions, given the document classification vj

• assumption is incorrect• eg, probability of observing “learning” is higher if preceding

word is “machine”

• without making the assumption will involve prohibitive number of probability terms

• Naïve Bayes learner known to perform well in text classification problems

Page 54: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

54

Machine Learning, Chapter 6 CSE 574, Spring 2003

Estimating Probability Terms

• To calculate vNB we need• prior probability terms P(vj)

• easy• P(like)=.3 and P(dislike)=.7

• conditional probability terms P(ai=wk|vj)• wk is the kth word in the English vocabulary, eg

P(a1=given|dislike)• difficult• need to compute one probability term for each combination

of text position (111), English word(50,000), and target value(2), implies 107 probabilities

Page 55: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

55

Machine Learning, Chapter 6 CSE 574, Spring 2003

Reducing Number of Probability Terms

• Assume positional independence• Probability of encountering a specific word wk is

independent of the specific word position being encountered (a23 versus a95)

• This amounts to assuming that attributes are independent and identically distributed

• P(ai=wk|vj)=P(am=wk|vj)for all i,j,k,m

Page 56: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

56

Machine Learning, Chapter 6 CSE 574, Spring 2003

Reducing Number of Probability Terms

• Replace entire set of probabilities P(a1=wk|vj), P(a2=wk|vj),… by the single position independent probability P(wk|vj)

• use P(wk|vj) regardless of word position• Now require only 2 x 50,000 = 105 terms• When training data is limited

• increases number of samples available to estimate required probabilities

• Increases reliability of estimates

Page 57: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

57

Machine Learning, Chapter 6 CSE 574, Spring 2003

Text Classification: Estimating Probability Terms

• Using the m-estimate with uniform priors and the mequal to the size of the word vocabulary, the estimate for will be

• where n is the total no of word positions in all training examples whose target value is vj

• nk is number of times word wk is found among these n word positions

)|( jk vwP

||1

Vocabularynnk

++

Page 58: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

58

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Algorithm for Learning and Classifying Text

Learn_Naive_Bayes_Text (Examples, V)

Examples is a set of text documents along with their target values. V is the set of all possible target values. This function learns the probability terms P(wk|vj), describing the probability that a randomly drawn word from a document in class vj will be the English word wk. It also learns the class prior probabilities P(vj).

Page 59: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

59

Machine Learning, Chapter 6 CSE 574, Spring 2003

Learn_Naive_Bayes_Text (Examples, V)

Collect all words, punctuation, and other tokens that occur in Examples

• Vocabulary ← the set of all distinct words and other tokens occurring in any text document from Examples

Page 60: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

60

Machine Learning, Chapter 6 CSE 574, Spring 2003

Learn_Naive_Bayes_Text (Examples, V)

Calculate the required P(vj) and P(wk|vj) probability terms

• For each target value vj in V, do• docsj ← the subset of documents from Examples for which the target value is vj

• Textj ← a single document created by concatenating all members of docsj

• n ← total number of distinct word positions in Textj

• for each word wk in Vocabulary• nk ← number of time word wk occurs in Textj• then

||||

)(Examples

docsvP j

j ←

||1)|(

VocabularynnvwP k

jk ++

Page 61: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

61

Machine Learning, Chapter 6 CSE 574, Spring 2003

Classify_Naïve_Bayes_Text (Doc)

Return the estimated value for the document Doc. ai denotes the word found in the ith position within Doc.

• positions ← all words in Doc that contain tokens found in Vocabulary

• Return vNB where

)|()(maxarg jipositionsi

jVv

NB vavPvj

∏∈∈

=

Table 6.2

Page 62: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

62

Machine Learning, Chapter 6 CSE 574, Spring 2003

Experimental Results with Naïve Bayes for Classifying Text

• Problem of classifying news articles• 20 electronic newsgroups (usenet) considered

• comp.graphics, alt.atheism, etc

• Target classification• Name of newsgroup in which article appeared• Task is one of Newsgroup Posting Service that learns to

assign documents to appropriate newsgroup

Page 63: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

63

Machine Learning, Chapter 6 CSE 574, Spring 2003

Electronic Newsgroups considered in Text Classification Experiment

Table 6.3

Page 64: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

64

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Text Classification Experimental Results

• Data Set• 1,000 articles collected from each newsgroup, forming data

set of 20,000 documents• Naïve Bayes was applied using

• 2/3 of these 20,000 documents as training examples • performance measured over remaining 1/3

Page 65: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

65

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Text Classification Experimental Results

• Vocabulary• 100 most frequent words were removed

• including the and of• any word occurring fewer than 3 times was removed• resulting Vocabulary consisted of 38,500 words

Page 66: Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

66

Machine Learning, Chapter 6 CSE 574, Spring 2003

Naïve Bayes Text Classification Experimental Results

• Accuracy achieved by the program was 89%• Random guessing would yield 5% accuracy• Another variant of Naïve Bayes

• NewsWeeder system• Training: user rates some news articles as interesting• Based on user profile Newsweeder then suggests

subsequent articles of interest to user• NewsWeeder suggests top 10% of its automatically rated

articles each day• Result: 59% of articles presented were interesting as

opposed to 16% in overall pool• This is the PRECISION of system