Data Mining and Machine Learning Naive Bayes David Corne, HWU

Data Mining and Machine Learning Naive Bayes David Corne, HWU A very simple dataset one field / one class P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY A very simple dataset one field / one class P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY A new patient has a blood test his P34 level is HIGH. what is our best guess for prostate cancer? A very simple dataset one field / one class P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY Its useful to know: P(cancer = Y) A very simple dataset one field / one class P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY Its useful to know: P(cancer = Y) - on basis of this tiny dataset, P(c = Y) is 5/10 = 0.5 A very simple dataset one field / one class P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY Its useful to know: P(cancer = Y) - on basis of this tiny dataset, P(c = Y) is 5/10 = 0.5 So, with no other info youd expect P(cancer=Y) to be 0.5 A very simple dataset one field / one class P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY But we know that P34 =H, so actually we want: P(cancer=Y | P34 = H) - the prob that cancer is Y, given that P34 is high A very simple dataset one field / one class P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY P(cancer=Y | P34 = H) - the prob that cancer is Y, given that P34 is high - this seems to be 2/3 = ~ 0.67 A very simple dataset one field / one class P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY So we have: P ( c=Y | P34 = H) = 0.67 P ( c =N | P34 = H) = 0.33 The class value with the highest probability is our best guess In general we may have any number of class values P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N Maybe MediumY suppose again we know that P34 is High; here we have: P ( c=Y | P34 = H) = 0.5 P ( c=N | P34 = H) = 0.25 P(c = Maybe | H) = and again, Y is the winner That is the essence of Naive Bayes, but: the probability calculations are much trickier when there are >1 fields so we make a Naive assumption that makes it simpler Bayes theorem P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY As we saw, on the right we are illustrating: P(cancer = Y | P34 = H) Bayes theorem P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY And now we are illustrating P(P34 = H | cancer = Y) This is a different thing, that turns out as 2/5 = 0.4 Bayes theorem is this: P( A | B) = P ( B | A ) P (A) P(B) It is very useful when it is hard to get P(A | B) directly, but easier to get the things on the right Bayes theorem in 1-non-class-field DMML context: P( Class=X | Fieldval = F) = P ( Fieldval = F | Class = X ) P( Class = X) P(Fieldval = F) Bayes theorem in 1-non-class-field DMML context: P( Class=X | Fieldval = F) = P ( Fieldval = F | Class = X ) P( Class = X) P(Fieldval = F) We want to check this for each class and choose the class that gives the highest value. Bayes theorem in 1-non-class-field DMML context: P( Class=X | Fieldval = F) = P ( Fieldval = F | Class = X ) P( Class = X) P(Fieldval = F) E.g. We compare: P(Fieldval | Yes) P (Yes) P(Fieldval | No) P (No) P(Fieldval | Maybe) P (Maybe)... we can ignore P(Fieldval = F)... why ? and that was Exactly how we do Naive Bayes for a 1-field dataset Note how this relates to our beloved histograms P34 levelProstate cancer HighY MediumY LowY N N MediumN HighY N LowN MediumY Low Med High P(Low| N) P(High| Y) Nave-Bayes with Many-fields P34 levelP61 levelBMIProstate cancer HighLowMediumY LowMediumY Low HighY LowHighLowN N Medium LowN HighLowMediumY HighMediumLowN HighN MediumHigh Y Nave-Bayes with Many-fields P34 levelP61 levelBMIProstate cancer HighLowMediumY LowMediumY Low HighY LowHighLowN N Medium LowN HighLowMediumY HighMediumLowN HighN MediumHigh Y New patient: P34=M, P61=M, BMI = H Best guess at cancer field ? Nave-Bayes with Many-fields P34 levelP61 levelBMIProstate cancer HighLowMediumY LowMediumY Low HighY LowHighLowN N Medium LowN HighLowMediumY HighMediumLowN HighN MediumHigh Y New patient: P34=M, P61=M, BMI = H Best guess at cancer field ? P(p34=M | Y) P(p61=M | Y) P(BMI=H |Y) P(cancer = Y) P(p34=M | N) P(p61=M | N) P(BMI=H |N) P(cancer = N) which of these gives the highest value? Nave-Bayes with Many-fields P34 levelP61 levelBMIProstate cancer HighLowMediumY LowMediumY Low HighY LowHighLowN N Medium LowN HighLowMediumY HighMediumLowN HighN MediumHigh Y New patient: P34=M, P61=M, BMI = H Best guess at cancer field ? P(p34=M | Y) P(p61=M | Y) P(BMI=H |Y) P(cancer = Y) P(p34=M | N) P(p61=M | N) P(BMI=H |N) P(cancer = N) which of these gives the highest value? Nave-Bayes with Many-fields P34 levelP61 levelBMIProstate cancer HighLowMediumY LowMediumY Low HighY LowHighLowN N Medium LowN HighLowMediumY HighMediumLowN HighN MediumHigh Y New patient: P34=M, P61=M, BMI = H Best guess at cancer field ? P(p34=M | Y) P(p61=M | Y) P(BMI=H |Y) P(cancer = Y) P(p34=M | N) P(p61=M | N) P(BMI=H |N) P(cancer = N) which of these gives the highest value? Nave-Bayes with Many-fields P34 levelP61 levelBMIProstate cancer HighLowMediumY LowMediumY Low HighY LowHighLowN N Medium LowN HighLowMediumY HighMediumLowN HighN MediumHigh Y New patient: P34=M, P61=M, BMI = H Best guess at cancer field ? P(p34=M | Y) P(p61=M | Y) P(BMI=H |Y) P(cancer = Y) P(p34=M | N) P(p61=M | N) P(BMI=H |N) P(cancer = N) which of these gives the highest value? Nave-Bayes with P34 levelP61 levelBMIProstate cancer HighLowMediumY LowMediumY Low HighY LowHighLowN N Medium LowN HighLowMediumY HighMediumLowN HighN MediumHigh Y New patient: P34=M, P61=M, BMI = H Best guess at cancer field ? P(p34=M | Y) P(p61=M | Y) P(BMI=H |Y) P(cancer = Y) P(p34=M | N) P(p61=M | N) P(BMI=H |N) P(cancer = N) which of these gives the highest value? Nave-Bayes with Many-fields P34 levelP61 levelBMIProstate cancer HighLowMediumY LowMediumY Low HighY LowHighLowN N Medium LowN HighLowMediumY HighMediumLowN HighN MediumHigh Y New patient: P34=M, P61=M, BMI = H Best guess at cancer field ? 0.4 0 0.4 0.5 = 0.4 0.2 0.5 = which of these gives the highest value? In practice, we finesse the zeroes and use logs: (note: log(ABCD) = log(A)+log(B)+ ) P34 levelP61 levelBMIProstate cancer HighLowMediumY LowMediumY Low HighY LowHighLowN N Medium LowN HighLowMediumY HighMediumLowN HighN MediumHigh Y New patient: P34=M, P61=M, BMI = H Best guess at cancer field ? log(0.4) + log (0.001) + log(0.4) + log(0.5) = log(0.2) + log (0.4) + log(0.2) + log(0.5) = which of these gives the highest value? Nave-Bayes -- in general N fields, q possible class values, New unclassified instance: F1 = v1, F2 = v2,..., Fn = vn what is the class value? i.e. Is it c1, c2,.. or cq ? calculate each of these q things biggest one gives the class: P(F1=v1 | c1) P(F2=v2 | c1) ... P(Fn=vn | c1) P(c1) P(F1=v1 | c2) P(F2=v2 | c2) ... P(Fn=vn | c2) P(c2)... P(F1=v1 | cq) P(F2=v2 | cq) ... P(Fn=vn | cq) P(cq) Nave-Bayes -- in general As indicated, what we normally do, when there are more than a handful of fields, is this Calculate: log(P(F1=v1 | c1)) log(P(Fn=vn | c1)) + log( P(c1)) log(P(F1=v1 | c2)) log(P(Fn=vn | c2)) + log( P(c2)) and choose class based on highest of these. Because ? Nave-Bayes -- in general log( a b c ) = log(a) + log(b) + log(c) + and this means we wont get underflow errors, which we would otherwise get with, e.g [100 fields] Deriving NB Essence of Naive Bayes, with 1 non-class field, is to calc this for each class value, given some new instance with fieldval = F: P(class = C | Fieldval = F) For many fields, our new instance is (e.g.) (F1, F2,...Fn), and the essence of Naive Bayes is to calculate this for each class: P(class = C | F1,F2,F3,...,Fn) i.e. What is prob of class C, given all these field vals together? Apply magic dust and Bayes theorem, and If we make the naive assumption that all of the fields are independent of each other (e.g. P(F1| F2) = P(F1), etc...)... then P (class = C | F1 and F2 and F3 and... Fn) P( F1 and F2 and... and Fn | C) x P (C) = P(F1| C) x P (F2 | C) x... X P(Fn | C) x P(C) which is what we calculate in NB Confusion and CW2 80% accuracy... means our classifier predicts the correct class 80% of the time, right? 80% accuracy... Could be with this confusion matrix: True \ PredABC A8010 B 8010 C51580 80% accuracy... Could be with this confusion matrix: True \ PredABC A8010 B 8010 C51580 True \ PredABC A10000 B0 0 C60040 Or this one 80% accuracy... Could be with this confusion matrix: True \ PredABC A8010 B 8010 C51580 True \ PredABC A10000 B0 0 C60040 Or this one Individual class accuracies 80% 100% 60% 80% accuracy... Could be with this confusion matrix: True \ PredABC A8010 B 8010 C51580 True \ PredABC A10000 B0 0 C60040 Or this one Individual class accuracies 80% 100% 60% 86.7% Mean class accuracy unbalanced classes take care in interpretation of accuracy figures True \ PredABC A10000 B C Individual class accuracies 100% 30% 80% Overall accuracy: ( ) / 75.7% True \ PredABC A10000 B C Individual class accuracies 100% 30% 80% 70% Overall accuracy: ( ) / 75.7% Mean class accuracy unbalanced classes take care in interpretation of accuracy figures True \ PredABC A10000 B0 0 C Individual class accuracies 100% 20% 73.3% Overall accuracy: ( ) / 21.6% Mean class accuracy unbalanced classes take care in interpretation of accuracy figures Example from https://tidsskrift.dk/index.php/geografisktidsskrift/article/view/2519/4475 Classifying agricultural land use from satellite image data https://tidsskrift.dk/index.php/geografisktidsskrift/article/view/2519/4475 CW2 datasets In this coursework you will work with a handwritten digit recognition dataset. Get it from my site here:Also pick up ten other versions of the same dataset, as follows:[etc...] 2 lines from opt2.txt 2 lines from opt2.txt (not a 2) (a 2) 2 lines from opt2.txt (not a 2) (a 2) 2 lines from opt2.txt (not a 2) (a 2) in op2.txt, which fields correlate well with the class field? Which fields will tend to be high values When the class is 1 ? Which will tend to be low when class is 1 ? Produce a version of optall.txt that has the instan ces in a randomised order. 2. Run my nave Bayes awk script on the resulting vers ion of optall.txt 3. Implement a program or script that allows you to wo rk out the correlation between any two fields. 4. Using your program, find out the correlation betwee n each field and the class field -- using only the first 50% of instances in the data file for all the optN files. For each optN.txt file, keep a record of the top five fields, in order of a bsolute correlation value. Run my nb.awk on optall.txt your-linux-prompt: awk f nb.awk < optall.txt Implement Pearsons correlation somehow Work out correlations between fields and class for each of the optN.txt files Select the best fields from each of those expts Build new versions of optall.txt using those fields Run my nb.awk again Next feature selection

Documents

Data Mining and Machine Learning Naive Bayes David Corne, HWU