Machine Learning, Chapter 6 CSE 574, Spring 2003srihari/CSE574/ChapBL/ChapBL.Part2A.pdf · 29 Machine Learning, Chapter 6 CSE 574, Spring 2003 Gibbs Algorithm (6.8) • Bayes can

1

Machine Learning, Chapter 6 CSE 574, Spring 2003

Bayes Optimal Classifier (6.7)

Bayes Optimal Classification

∑∈∈ Hh

iijVv

ij

DhPhvP )|()|(maxarg

2


Bayes Optimal Classifier

• Instead of asking “What is the most probable hypothesis given the training data?” , ask:

• “What is the most probable classification of the new instance given the training data?”

• Instead of learning the function fi, the Bayes optimal classifier assigns any given input to the most likely output vj

fix0x1x2

vj

3



• Instead of learning the function, the Bayes optimal classifier assigns any given input to the most likely output

• Calculate a posteriori probabilities

• P(x0,x1,x2|0) is the class-conditional probability

fi

x0

x1

x2

),,()0()0|,,(),,|0(

210

210210 xxxP

PxxxPxxxP =

4


Example of Bayes Optimal Classifierx0 x1 x2 f0 f1 f2 f3 f4 f255

0 0 0 0 1 0 1 0 10 0 1 0 0 1 1 0 10 1 0 0 0 0 0 1 10 1 1 0 0 0 0 0 11 0 0 0 0 0 0 0 11 0 1 0 0 0 0 0 11 1 0 0 0 0 0 0 1

),,()0()0|,,(),,|0(

210

210210 xxxP

PxxxPxxxP =

),,()1()1|,,(),,|1(

210

210210 xxxP

PxxxPxxxP =

5



• To calculate a posteriori probabilities• Need to know Class-conditional probabilities• Each is a table of 2n different probabilities estimated from

many training samples

)0|,,( 210 xxxP )1|,,( 210 xxxPx0 x1 x2Prob(0)0 0 0 0.10 0 1 0.050 1 0 0.10 1 1 0.251 0 0 0.31 0 1 0.11 1 0 0.051 1 1 0.05

x0 x1 x2Prob(1)0 0 0 0.050 0 1 0.10 1 0 0.250 1 1 0.251 0 0 0.11 0 1 0.11 1 0 0.151 1 1 0.05

6



• Need to know Class-conditional probabilities

• Tables have 2.2n entries in tables• Will need many training samples:

• need to see every instance many times in order to obtain reliable estimates

• When number of attributes is large, impossible to even list all probabilities in a table

)0|,,( 210 xxxP )1|,,( 210 xxxP

7



• Target function f(x)• Takes any value from finite set V, eg 0,1• Each instance x is composed of attribute values

x1,x2,..,xn

• Most possible target value vmap

•

),..,,()()|,..,,(

maxarg

maxarg

21

21

),..,,|( 21

n

jjn

xxxvPv

xxxPvPvxxxP

Vv

njVv

MAP

j

j

∈

∈

=

=

8


Most Probable Hypothesis vs Most Probable Classification

• Classification result can be different!• Suppose three hypotheses, f0,f1,f2 have posterior

probabilities given the training data as .3, .4, .3. • Therefore MAP hypothesis is f1• Instance x=<0,0,0> classified as 1 by f1 but as 0 by f0 and f2

• P(1|x,D)=P(1|f0,x)P(f0|D,x)+ P(1|f1,x)P(f1|D,x)+ P(1|f2,x)P(f2|D,x)• =0..3 + 1..4 + 0..3 = .4• Similarly P(0|x,D) = .6• Therefore most probable classification of x is 0

x0 x1 x2 f0 f1 f2 f3 f4 f255

0 0 0 0 1 0 1 0 10 0 1 0 0 1 1 0 10 1 0 0 0 0 0 1 10 1 1 0 0 0 0 0 11 0 0 0 0 0 0 0 1

9


Maximum Likelihood and Least-Squared Error Hypotheses (6.4)

• Bayesian analysis shows that under certain circumstances any learning algorithm that minimizes the squared error between output hypothesis predictions and the training data will output a maximum likelihood hypothesis.

10


Learning a Real-Valued Function

Figure 6.2

11


Probability Density Function

)(1lim)( 0000 ∈+<≤∈

≡∈→

xxxPxp

12


Maximum Likelihood Hypothesis

Maximum Likelihood Hypothesis: One that minimizes the sum of the squared errors between the observed training values di and the hypothesis predictions h(xi)

∑=∈

−=m

iii

HhML xhdh

1

2))((minarg

13


Maximum Likelihood Hypotheses for Predicting Probabilities (6.5)

)|,()|(1

hdxPhDP i

m

ii∏

=

=

)|,()|(1

hdxPhDP i

m

ii∏

=

=

)(),|()|,()|(11

ii

m

iii

m

ii xPxhdPhdxPhDP ∏∏

==

==

14


Maximum Likelihood Hypotheses for Predicting Probabilities, continued

⎩⎨⎧

=−=

=0if))(1(1if)(

),|(i

iiii dxh

dxhxhdP

ii di

diii xhxhxhdP −−= 1))(1()(),|(

)())(1()()|( 1

1i

di

m

i

di xPxhxhhDP ii −

=

−=∏

15


Maximum Likelihood Hypotheses for Predicting Probabilities, continued

)())(1()(maxarg1

1∏=

−

∈−=

m

ii

di

di

HhML xPxhxhh ii

∏=

−

∈−=

m

i

di

di

HhML

ii xhxhh1

1))(1()(maxarg

))(1ln()1()(lnmaxarg1

iii

m

ii

HhML xhdxhdh −−+= ∑

=∈

16


Gradient Search to Maximize Likelihood in a Neural Network (6.5.1)

jk

im

i ijk wxh

xhDhG

wDhG

∂∂

∂∂

=∂

∂ ∑=

)()(

),(),(1

jk

im

i i

iiii

wxh

xhxhdxhd

∂∂

∂−−+∂

=∑=

)()(

))(1ln()1)()(ln)(1

jk

im

i ii

ii

wxh

xhxhxhd

∂∂

−−

=∑=

)())(1)((

)(1

17


Gradient Search to Maximize Likelihood in a Neural Network , continued

jkjkjk www ∆+←

∑=

−=∆m

iijkiijk xxhdw

1))((η

jkjkjk www ∆+←

∑=

−−=∆m

iijkiiiijk xxhdxhxhw

1))())((1)((η

18


Minimum Description Length Principle (6.6)

• Occam’s razor: • Choose the shortest explanation for the observed data• Used in Decision Tree design where goal was to find

shortest tree

• Here we consider:• Bayesian perspective on this issue• Closely related principle called• Minimum Description Length (MDL) principle

19


Minimum Description Length Principle

MAPh• Motivated by interpreting the definition of• Using concepts from information theory• Familiar definition:

)()|(maxarg hPhDPhHh

MAP∈

=

20



)()|(maxarg hPhDPhHh

MAP∈

=

• Equivalently, taking logarithms

• Equivalently, taking negatives

)(log)|(logmaxarg 22 hPhDPhHh

MAP +=∈

)(log)|(logminarg 22 hPhDPhHh

MAP −−=∈

21




MAP −−=∈

• Interpretation of above equation:• Assuming a particular representation scheme for encoding

hypotheses and data• Short hypotheses are to be preferred• Explanation to follow

22


Design a compact code to transmit messages at random

• Probability of message i is pi

• Find code that minimizes expected number of bits we must transmit to encode message drawn at random• Assign shorter codes to more probable messages

• Shannon and Weaver (1949)• Optimal code assigns

• bits to encode message I• No of bits needed to encode message i using Code C is the

description length of message I with respect to C, I.e., LC(i)

ip2log−

23


Minimum Length encoding

• Huffman Code (C) assigns shorter codes more likely symbols optimally

• Message i Code pi Bit length LC(I)• A 0 0.5 1• B 10 0.25 2• C 110 0.125 3• D 111 0.125 3

• Uniquely decodable

24


Expected length of a message• A 0 prob 0.5 length 1• B 10 prob 0.25 length 2• C 110 prob 0.125 length 3• D 111 prob 0.125 length 3• Expected length of a message

• Same as formula for entropy

75.175.05.05.0

)log(2logloglog 81

81

41

41

21

21

2

=++=

−−−=−∑ ii

i pp

25


Interpretation of MAP hypothesis in terms of Coding Theory


MAP −−=∈

Description length of h under the optimal encoding for hypothesis space H, i.e., size of the description of hypothesis h using this optimal representation

)(hLHC=

where CH is the optimal code for H

Description length of training data D given hypothesis h under the optimal encoding for hypothesis space H

)(/

hLhDC=

where CD/h is the optimal code for describing D assuming sender and receiver know hypothesis h

26


Interpretation of Bayes MAP hypothesis in terms of Coding Theory

)()(minarg/

hLhLhhDH CC

HhMAP +=

∈

27


Minimum Description Length (MDL) Principle

• If C1and C2 represent the hypothesis and the data given the hypothesis

• MDL principle recommends choosing where

MDLh

)|()(minarg21

hDLhLh CCHh

MDL +=∈

28


Minimum Description Length (MDL) Principle

If C1and C2 are chosen optimally • Then • Intuitively

• MDL recommends shortest method for re-encoding the training data,

• where we count the size of the hypothesis and any additional cost of encoding the data given this hypothesis

MAPMDL hh =

29


Gibbs Algorithm (6.8)

• Bayes can be costly to apply• it computes the posterior probability for every hypothesis in

H• then it combines the predictions of each hypothesis to

classify each new system

• Gibbs Algorithm is alternative, less optimal method• choose a hypothesis h from H at random, according to the

posterior probability distribution over H• use h to predict the classification of the next instance x

30


Naïve Bayes Classifier

• Practical Bayesian learning method• In some domains performance is comparable to that

of neural network and decision tree learning

31



• Based on the simplifying assumption that the attribute values are statistically independent

)0|(

)0|()..0|()0|()0|,..,,( 2121

∏==

ii

nn

xP

xpxPxPxxxP

32



• Class-conditional probabilities assuming statistical independence

• Tables have 2.2.n entries ==> much better than 2.2n

entries

x0 Prob(x0|0) x1 Prob(x1|0) x2 Prob(x3|0)0 0.65 0 0.4 0 0.151 0.35 0 0.6 1 0.85

x0 Prob(x0|1) x1 Prob(x1|1) x2 Prob(x3|1)0 0.65 0 0.4 0 0.151 0.35 0 0.6 1 0.85

33


Naïve Bayes Classifier (6.9)

• Naïve Bayes applies to learning tasks where• each instance x is described by the conjunction of attribute

values• the target function f(x) can take on any value from some

finite set V

34


Naïve Bayes Classifier, continued

),|(maxarg 21 njVv

MAP aaavPvj

K∈

=

),()()|,(

maxarg21

21

n

jjn

VvMAP aaaP

vPvaaaPv

j K

K

∈=

)()|,(maxarg 21 jjnVv

vPvaaaPj

K∈

=

35


Naïve Bayes Classifier, continued


)|()(maxarg ii

ijVv

NB vaPvPvj

∏∈

=

36


Naïve Bayes: PlayTennis Example Classify days according to whether someone will play tennisGiven 14 examples:

Day Outlook Temp Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Table 3.2

37


Naïve Bayes: PlayTennis Example

• Task is predict the target value (yes or no) of the target concept PlayTennis for the new instance

Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong



38


Naïve Bayes: PlayTennis Exampleis the target value output by the Naïve Bayes

classifierInstantiate Naïve Bayes classifier equation to fit task

the target value is given by

)|()(maxarg},{

jiijnoyesv

NB vaPvPvj

∏∈

=NBv

)|()|()(maxarg},{

jjjnoyesv

vcooleTemperaturPvsunnyoutlookPvPj

===∈

)|()|( jj vstrongWindPvhighHumidityP ==

NBv

Prior ProbabilitiesConditional Probabilities

39


Naïve Bayes: PlayTennis Example

)|()(maxarg},{

jiijnoyesv

NB vaPvPvj

∏∈

=

)|()|()(maxarg},{

jjjnoyesv

vcooleTemperaturPvsunnyoutlookPvPj

===∈

)|()|( jj vstrongWindPvhighHumidityP ==

Prior Probabilities (2)Conditional Probabilities (8)

• Notice that in the final expression ai has been instantiated using the particular attribute values of the new instance

• To calculate vNB, need 10 probabilities that can be estimated from the training data

40


Estimating Prior Probabilities• Probabilities of different target values estimated from

frequencies over 14 training examplesP(PlayTennis = yes) = 9/14 = .64 P(PlayTennis = no) = 5/14 = .36



41


Estimating Conditional Probabilities• Similarly, can estimate conditional probabilities. For

example, those for Wind = strong are:P(Wind = strong|PlayTennis = yes) = 3/9 = .33P(Wind = strong|PlayTennis = no) = 3/5 = .60



42


Naïve Bayes: PlayTennis Target Values

• Using these and similar probability estimates for remaining attribute values, vNB can be calculated as follows

• Thus, the naïve Bayes classifier assigns the target value PlayTennis = no to this new instance

P(yes) P(sunny|yes) P(cool|yes) P(high|yes) P(strong|yes) = .0053P(no) P(sunny| no) P(cool| no) P(high| no) P(strong| no) = .0206

43


PlayTennis:Normalizing class probabilities

• By normalizing the above quantities to sum to one, we can calculate the conditional probability that the target value in no, given the observed attribute values.

• For the current example, this probability is

795.0053.0206.

0206.=

+

44


Estimating Probabilities

• Probability estimated as a fraction of times event is observed (nc) over the total number of observations(n)

• P(Wind = strong|PlayTennis = no) = 3/5 = .60

• When nc very small = => poor estimate of probability• suppose P(Wind = strong|PlayTennis = no) = .08 and we have

n=5. Then most probable value for nc is 0• Yields a biased underestimate of probability• This probability term will dominate since it is multiplying other

probabilities

45


Estimating Probabilities with Small Sample Size

• To avoid problem, use a Bayesian approach using the m-estimate

• m-estimate of Probability

• p is prior estimate of probability we wish to determine• m is a constant called the equivalent sample size

mnmpnc

++

46


Estimating Probabilities with Small Sample Size

• m-estimate of Probability

• p is prior estimate of probability we wish to determine• assume uniform priors: if attribute has k values, we set p=1/k• if k=2, then p=.5

• m is the equivalent sample size• if m=0, m-estimate is equivalent to simple fraction• prior and fraction are combined according to weight m• called equivalent sample size since n actual samples are

augmented using m virtual samples distributed according to p

mnmpnc

++

47


Bayesian Learning Example: Classifying Text

• Instances are Text documents• Target concept:

• electronic news articles that I find interesting• pages of the world-wide-web that discuss machine learning

topics

• If a computer could learn the target concept accurately in instances involving text documents,• it could automatically filter a large volume of on-line

documents and present only the most relevant

48


Text Classification Task

• General Setting:• instance space X consists of all possible text documents

(i.e., all possible strings of words and punctuation of all possible lengths)

• Given training examples of some unknown target function f(x), which can take on any value from some finite set V

• Task:• Learn to classify future documents as interesting or not

interesting to a particular person, • using the target values like and dislike to indicate these two

classes

49


Naïve Bayes Example: Learning To Classify Text

• Two main Design Issues• Decide how to represent an arbitrary text document

in terms of attribute values• How to estimate the probabilities required by the

naive Bayes classifier

50


Approach to Representing Arbitrary Text Documents

• Given a text document, we define • an attribute for each word position in the document and • define the value of that attribute to be the English word found

in that position• thus, this paragraph beginning with sentence “Given a text

document …,” would be described by 111 attribute values, corresponding to the 111 word positions

• value of the first attribute is the word “Given” the value of the second attribute is “a” etc.

• Note: Long text documents require larger number of attributes that short documents• Not a problem

51


Document Classification Task

• We are given a set of training documents that has been classified by a friend• 700 classified as dislike• 300 classified as like

• Use these to classify new documents

52


Naïve Bayes Classification of Text

)|()(maxarg111

1},{∏=∈

=i

jijdislikelikev

NB vaPvPvj

)|""()|""()(maxarg 21},{

jjjdislikelikev

vaaPvgivenaPvPj

===∈

)|""( 111 jvproblemaP =K

• Naïve Bayes Classification vNB is the classification that maximizes the probability of observing the words that were actually found in the document

53


Text Classification: independence assumption

• Independence assumption states that the word probabilities for one text position are independent of the words that occur in other positions, given the document classification vj

• assumption is incorrect• eg, probability of observing “learning” is higher if preceding

word is “machine”

• without making the assumption will involve prohibitive number of probability terms

• Naïve Bayes learner known to perform well in text classification problems

54


Estimating Probability Terms

• To calculate vNB we need• prior probability terms P(vj)

• easy• P(like)=.3 and P(dislike)=.7

• conditional probability terms P(ai=wk|vj)• wk is the kth word in the English vocabulary, eg

P(a1=given|dislike)• difficult• need to compute one probability term for each combination

of text position (111), English word(50,000), and target value(2), implies 107 probabilities

55


Reducing Number of Probability Terms

• Assume positional independence• Probability of encountering a specific word wk is

independent of the specific word position being encountered (a23 versus a95)

• This amounts to assuming that attributes are independent and identically distributed

• P(ai=wk|vj)=P(am=wk|vj)for all i,j,k,m

56


Reducing Number of Probability Terms

• Replace entire set of probabilities P(a1=wk|vj), P(a2=wk|vj),… by the single position independent probability P(wk|vj)

• use P(wk|vj) regardless of word position• Now require only 2 x 50,000 = 105 terms• When training data is limited

• increases number of samples available to estimate required probabilities

• Increases reliability of estimates

57


Text Classification: Estimating Probability Terms

• Using the m-estimate with uniform priors and the mequal to the size of the word vocabulary, the estimate for will be

• where n is the total no of word positions in all training examples whose target value is vj

• nk is number of times word wk is found among these n word positions

)|( jk vwP

||1

Vocabularynnk

++

58


Naïve Bayes Algorithm for Learning and Classifying Text

Learn_Naive_Bayes_Text (Examples, V)

Examples is a set of text documents along with their target values. V is the set of all possible target values. This function learns the probability terms P(wk|vj), describing the probability that a randomly drawn word from a document in class vj will be the English word wk. It also learns the class prior probabilities P(vj).

•

59



Collect all words, punctuation, and other tokens that occur in Examples

• Vocabulary ← the set of all distinct words and other tokens occurring in any text document from Examples

•

60



Calculate the required P(vj) and P(wk|vj) probability terms

• For each target value vj in V, do• docsj ← the subset of documents from Examples for which the target value is vj

• Textj ← a single document created by concatenating all members of docsj

• n ← total number of distinct word positions in Textj

• for each word wk in Vocabulary• nk ← number of time word wk occurs in Textj• then

||||

)(Examples

docsvP j

j ←

||1)|(

VocabularynnvwP k

jk ++

←

61


Classify_Naïve_Bayes_Text (Doc)

Return the estimated value for the document Doc. ai denotes the word found in the ith position within Doc.

• positions ← all words in Doc that contain tokens found in Vocabulary

• Return vNB where

)|()(maxarg jipositionsi

jVv

NB vavPvj

∏∈∈

=

Table 6.2

62


Experimental Results with Naïve Bayes for Classifying Text

• Problem of classifying news articles• 20 electronic newsgroups (usenet) considered

• comp.graphics, alt.atheism, etc

• Target classification• Name of newsgroup in which article appeared• Task is one of Newsgroup Posting Service that learns to

assign documents to appropriate newsgroup

63


Electronic Newsgroups considered in Text Classification Experiment

Table 6.3

64


Naïve Bayes Text Classification Experimental Results

• Data Set• 1,000 articles collected from each newsgroup, forming data

set of 20,000 documents• Naïve Bayes was applied using

• 2/3 of these 20,000 documents as training examples • performance measured over remaining 1/3

65



• Vocabulary• 100 most frequent words were removed

• including the and of• any word occurring fewer than 3 times was removed• resulting Vocabulary consisted of 38,500 words

66



• Accuracy achieved by the program was 89%• Random guessing would yield 5% accuracy• Another variant of Naïve Bayes

• NewsWeeder system• Training: user rates some news articles as interesting• Based on user profile Newsweeder then suggests

subsequent articles of interest to user• NewsWeeder suggests top 10% of its automatically rated

articles each day• Result: 59% of articles presented were interesting as

opposed to 16% in overall pool• This is the PRECISION of system