Classification2_DMM

7/30/2019 Classification2_DMM

1/16

Data Mining 1

1

[email protected]

Jalali.mshdiau.ac.ir


2/16

Data Mining 2

Bayesian Classification

2


3/16

3

Bayesian Learning (Nave Bayesian Classifier)

A statistical classifier: performsprobabilistic prediction, i.e., predicts class membership

probabilities

Foundation: Based on Bayes Theorem.

Performance: A simple Bayesian classifier, nave Bayesian classifier, has comparable

performance with decision tree and selected neural network classifiers

Incremental: Each training example can incrementally increase/decrease the probability that

a hypothesis is correct

When to use:

Moderate or large training set available

Attributes that describe instances are conditionally independent given classification

Successful applications:

Diagnosis

Classifying text documents


4/16

4

Bayesian Theorem: Basics

Let Xbe a data sample (evidence): class label is unknown

Let H be a hypothesis that X belongs to class C

Classification is to determine P(H|X), the probability that the hypothesis holds

given the observed data sample X

P(H) (prior probability), the initial probability

E.g., Xwill buy computer, regardless of age, income,

P(X): probability that sample data is observed

P(X|H) (posteriori probability), the probability of observing the sample X,

given that the hypothesis holds

E.g.,Given that X will buy computer, the prob. that X is 31..40, medium income


5/16

5

Bayesian Theorem

Given training dataX, posteriori probability of a hypothesis H,

P(H|X), follows the Bayes theorem

Predicts X belongs to C2 if the probability P(Ci|X) is the highest

among all the P(Ck|X) for all the kclasses

Practical difficulty: require initial knowledge of manyprobabilities, significant computational cost

)(

)()|()|(

X

XX

P

HPHPHP


6/16

6

Towards Nave Bayesian Classifier

Let D be a training set of tuples and their associated class labels, and each

tuple is represented by an n-D attribute vector X = (x1, x2, , xn)

Suppose there are m classes C1, C2, , Cm.

Classification is to derive the maximum posteriori, i.e., the maximal

P(Ci|X)

This can be derived from Bayes theorem

Since P(X) is constant for all classes, only

needs to be maximized

)(

)()|()|(

X

XX

Pi

CPi

CP

iCP

)()|()|(i

CPi

CPi

CP XX


7/16

7

Derivation of Nave Bayes Classifier

A simplified assumption: attributes are conditionally independent (i.e., nodependence relation between attributes):

This greatly reduces the computation cost: Only counts the class

distribution

If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak

divided by |Ci, D| (# of tuples of Ci in D)

If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian

distribution with a mean and standard deviation

and P(xk|Ci) is

)|(.. .)|()|(

1

)|()|(21

CixPCixPCixP

n

k

CixPCiPnk

X

2

2

2

)(

2

1),,(

x

exg

),,()|(ii CCk

xgCiP X


8/16

8

Nave Bayesian Classifier: Training Dataset

Class:

C1:buys_computer = yes

C2:buys_computer = no

Data sample

X = (age 40 low yes excellent no

3140 low yes excellent yes


9/16

9

Nave Bayesian Classifier: An Example

P(Ci): P(buys_computer = yes) = 9/14 = 0.643P(buys_computer = no) = 5/14= 0.357

Compute P(X|Ci) for each classP(age =


10/16

10

Avoiding the 0-Probability Problem

Nave Bayesian prediction requires each conditional prob. be non-zero.

Otherwise, the predicted prob. will be zero

Ex. Suppose a dataset with 1000 tuples, income=low (0), income=

medium (990), and income = high (10),

Use Laplacian correction (or Laplacian estimator)

Adding 1 to each case

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003

n

kCixkPCiXP

1

)|()|(


11/16

11

Nave Bayesian Classifier: Comments

Advantages Easy to implement

Good results obtained in most of the cases

Disadvantages

Assumption: class conditional independence, therefore loss of accuracy

Practically, dependencies exist among variables

E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

Dependencies among these cannot be modeled by Nave Bayesian

Classifier

How to deal with these dependencies?

Bayesian Belief Networks


12/16

12


Bayesian belief network allows a subsetof the variables conditionallyindependent. (Also called Bayes Nets, directed graphical models, BNs, ).

A graphical model of causal relationships

Represents dependency among the variables

Gives a specification of joint probability distribution

X Y

ZP

Nodes: variables

Links: dependency

X and Y are the parents of Z, and Y is

the parent of P

No dependency between Z and P

Has no loops or cycles


13/16

13


Simple Bayesian


X1 X2 xn

Concept C

P(x1,x2,xn,c) = P(c) P(x1|c) P(x2|c) P(xn|c)

P(x1,x2,xn,c) = P(c) P(x1|c) P(x2|c) P(x3|x1,x2,c)P(x4,c)

X1 X2 x4

Concept C

X3


14/16

14

Bayesian Belief Network: An Example

Family

History

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9


The conditional probability table(CPT) for variable LungCancer:

n

i

YParents ixiPxxP n1

))(|(),...,( 1

CPT shows the conditional probability foreach possible combination of its parents

Derivation of the probability of aparticular combination of values ofX,from CPT:

Node = variables

Arc = dependency

Direction on arc representing causality


15/16

15

1 .1


16/16

16

.1

x=(sqrt=25, bit=30, chip=25, Web= 15, limit= 20, CPU= 12, equal=30)

Text Classification

Word sqrt bit chip log web limit cpu equal class

Doc1 42 25 7 56 2 66 1 89 Math

Doc2 10 28 45 2 69 8 56 40 Comp

Doc3 11 25 22 4 40 9 44 34 Comp

Doc4 33 40 8 48 3 40 4 77 Math

Doc5 28 32 9 60 6 29 7 60 Math

Doc6 8 22 30 1 29 3 40 33 Comp

Class Attribute Meaning

Math Maths related

Comp Computers related

Each document is a vector of

frequencies of all the words in

the document

Documents

Classification2_DMM