Upload
mostafa-heidary
View
213
Download
0
Embed Size (px)
Citation preview
7/30/2019 Classification2_DMM
1/16
Data Mining 1
1
Jalali.mshdiau.ac.ir
7/30/2019 Classification2_DMM
2/16
Data Mining 2
Bayesian Classification
2
7/30/2019 Classification2_DMM
3/16
3
Bayesian Learning (Nave Bayesian Classifier)
A statistical classifier: performsprobabilistic prediction, i.e., predicts class membership
probabilities
Foundation: Based on Bayes Theorem.
Performance: A simple Bayesian classifier, nave Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
Incremental: Each training example can incrementally increase/decrease the probability that
a hypothesis is correct
When to use:
Moderate or large training set available
Attributes that describe instances are conditionally independent given classification
Successful applications:
Diagnosis
Classifying text documents
7/30/2019 Classification2_DMM
4/16
4
Bayesian Theorem: Basics
Let Xbe a data sample (evidence): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that the hypothesis holds
given the observed data sample X
P(H) (prior probability), the initial probability
E.g., Xwill buy computer, regardless of age, income,
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of observing the sample X,
given that the hypothesis holds
E.g.,Given that X will buy computer, the prob. that X is 31..40, medium income
7/30/2019 Classification2_DMM
5/16
5
Bayesian Theorem
Given training dataX, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes theorem
Predicts X belongs to C2 if the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the kclasses
Practical difficulty: require initial knowledge of manyprobabilities, significant computational cost
)(
)()|()|(
X
XX
P
HPHPHP
7/30/2019 Classification2_DMM
6/16
6
Towards Nave Bayesian Classifier
Let D be a training set of tuples and their associated class labels, and each
tuple is represented by an n-D attribute vector X = (x1, x2, , xn)
Suppose there are m classes C1, C2, , Cm.
Classification is to derive the maximum posteriori, i.e., the maximal
P(Ci|X)
This can be derived from Bayes theorem
Since P(X) is constant for all classes, only
needs to be maximized
)(
)()|()|(
X
XX
Pi
CPi
CP
iCP
)()|()|(i
CPi
CPi
CP XX
7/30/2019 Classification2_DMM
7/16
7
Derivation of Nave Bayes Classifier
A simplified assumption: attributes are conditionally independent (i.e., nodependence relation between attributes):
This greatly reduces the computation cost: Only counts the class
distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak
divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean and standard deviation
and P(xk|Ci) is
)|(.. .)|()|(
1
)|()|(21
CixPCixPCixP
n
k
CixPCiPnk
X
2
2
2
)(
2
1),,(
x
exg
),,()|(ii CCk
xgCiP X
7/30/2019 Classification2_DMM
8/16
8
Nave Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = yes
C2:buys_computer = no
Data sample
X = (age 40 low yes excellent no
3140 low yes excellent yes
7/30/2019 Classification2_DMM
9/16
9
Nave Bayesian Classifier: An Example
P(Ci): P(buys_computer = yes) = 9/14 = 0.643P(buys_computer = no) = 5/14= 0.357
Compute P(X|Ci) for each classP(age =
7/30/2019 Classification2_DMM
10/16
10
Avoiding the 0-Probability Problem
Nave Bayesian prediction requires each conditional prob. be non-zero.
Otherwise, the predicted prob. will be zero
Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
n
kCixkPCiXP
1
)|()|(
7/30/2019 Classification2_DMM
11/16
11
Nave Bayesian Classifier: Comments
Advantages Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Nave Bayesian
Classifier
How to deal with these dependencies?
Bayesian Belief Networks
7/30/2019 Classification2_DMM
12/16
12
Bayesian Belief Networks
Bayesian belief network allows a subsetof the variables conditionallyindependent. (Also called Bayes Nets, directed graphical models, BNs, ).
A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
X Y
ZP
Nodes: variables
Links: dependency
X and Y are the parents of Z, and Y is
the parent of P
No dependency between Z and P
Has no loops or cycles
7/30/2019 Classification2_DMM
13/16
13
Bayesian Belief Networks
Simple Bayesian
Bayesian Belief Networks
X1 X2 xn
Concept C
P(x1,x2,xn,c) = P(c) P(x1|c) P(x2|c) P(xn|c)
P(x1,x2,xn,c) = P(c) P(x1|c) P(x2|c) P(x3|x1,x2,c)P(x4,c)
X1 X2 x4
Concept C
X3
7/30/2019 Classification2_DMM
14/16
14
Bayesian Belief Network: An Example
Family
History
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table(CPT) for variable LungCancer:
n
i
YParents ixiPxxP n1
))(|(),...,( 1
CPT shows the conditional probability foreach possible combination of its parents
Derivation of the probability of aparticular combination of values ofX,from CPT:
Node = variables
Arc = dependency
Direction on arc representing causality
7/30/2019 Classification2_DMM
15/16
15
1 .1
7/30/2019 Classification2_DMM
16/16
16
.1
x=(sqrt=25, bit=30, chip=25, Web= 15, limit= 20, CPU= 12, equal=30)
Text Classification
Word sqrt bit chip log web limit cpu equal class
Doc1 42 25 7 56 2 66 1 89 Math
Doc2 10 28 45 2 69 8 56 40 Comp
Doc3 11 25 22 4 40 9 44 34 Comp
Doc4 33 40 8 48 3 40 4 77 Math
Doc5 28 32 9 60 6 29 7 60 Math
Doc6 8 22 30 1 29 3 40 33 Comp
Class Attribute Meaning
Math Maths related
Comp Computers related
Each document is a vector of
frequencies of all the words in
the document