Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme
AG Intelligente Systeme - Data Mining group
Data Mining I
Summer semester 2019
Lecture 7: Classification part 4
Lectures: Prof. Dr. Eirini Ntoutsi
TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali
Cont’ from previous lecture
How do we evaluate a classifier?
Performance measures like accuracy etc
Evaluation setup
Holdout
Cross-validation
Stratification
Bootstrap
….
How do simultaneously do model selection and error estimation?
Train/validation/test split
2Data Mining I: Classification 4
Training set Test set
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
…
Overview of classification techniques covered in DM1
3
Decision trees k nearest neighbours
Support vector machines
Bayesian classifiers
Evaluation of classifiers
Data Mining I: Classification 4
Outline
Bayesian classifiers
Reading material
Things you should know from this lecture
4Data Mining I: Classification 4
Bayesian classifiers
A probabilistic framework for solving classification problems
Predict class membership probabilities for an instance
The class of an instance is the most likely class for the instance (Maximum Likelihood classification)
Based on Bayes’ rule
Bayesian classifiers
Naïve Bayes classifiers
Assume class-conditional independence among attributes
Bayesian Belief networks
Graphical models
Model dependencies among attributes
A popular method for e.g.,: text classification, sentiment analysis
5Data Mining I: Classification 4
Bayes’ theorem
The probability of an event C given an observation A:
P(C): prior
P(A|C): likelihood
P(C|A): posterior
P(A): evidence
For classification
C: is the class
A: is an instance
6Data Mining I: Classification 4
)(
)()|()|(
AP
CPCAPACP
Bayes’ theorem
The probability of an event C given an observation A:
Example: Given that
A doctor knows that meningitis causes stiff neck 50% of the time
Prior probability of any patient having meningitis is P(M)=1/50,000
Prior probability of any patient having stiff neck is P(S)=1/20
If a patient has stiff neck, what’s the probability he/she has meningitis?
7Data Mining I: Classification 4
prior)(
)()|()|(
AP
CPCAPACP
posteriorlikelihood
0002.020/1
50000/15.0
)(
)()|()|(
SP
MPMSPSMP
Compute P(M|S)=?
Bayesian classifiers
Let C={c1, c2, …, ck} be the class attribute.
Let X=(A1, A2, A3,….An) be a n-dimensional instance.
Classification problem: What is the probability of a class value c ϵ C given an instance observation X?
The event C to be predicted is the class value of the instance
The observation is the instance values X
The class of the instance is the class value with the higher probability: argmaxc(P(c|X)
P(c1|X): posterior for c1
P(c2|X) : posterior for c2
…
P(ck|X) : posterior for ck
8Data Mining I: Classification 4
A1 A2 …. AnA3 ?
Bayesian classifiers
Consider each attribute and class label as random variables
Given an instance X=(A1, A2, A3,….An)
Goal is to predict class label c in C
Specifically, we want to find the value c of C that maximizes P(c|X), i.e., argmaxc(P(c|X)
9Data Mining I: Classification 4
)()|(argmax
)(
)()|(argmax
)|(argmax
cPcXPc
XP
cPcXPc
XcPc
Cc
Cc
Cc
Bayes’ rule
Bayesian classifiers
Consider each attribute and class label as random variables
Given an instance X=(A1, A2, A3,….An)
Goal is to predict class label c in C
Specifically, we want to find the value c of C that maximizes P(c|X), i.e., argmaxc(P(c|X)
10Data Mining I: Classification 4
)()|(argmax
)(
)()|(argmax
)|(argmax
cPcXPc
XP
cPcXPc
XcPc
Cc
Cc
Cc
Bayes’ rule
prior
likelihood
max a posteriori = the most likely class
Bayesian classifiers
How can we estimate: ?
Short answer: from the data
Class prior P(c):
How often c occurs?
Just count the relative frequencies in the training set
Instance likelihood P(X|c):
What is the probability of an instance X given the class c?
X=(A1A2…An), so, P(X|c)=P(A1A2…An |c)
i.e., the probability of an instance given the class is equal to the probability of a set of features given the class
So:
11Data Mining I: Classification 4
)()|(argmax cPcXPc Cc
)()|...(argmax 21 cPcAAAPc nCc
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
categoric
al
categoric
al
continuous
class
How to estimate instance likelihood P(A1A2…An |c) ?
Assume independence among attributes Ai when class is given:
P(A1A2…An |Cj) = P(Ai|c) = P(A1|c)P(A2|c)… P(An|c)
Naïve Bayes classifier
12Data Mining I: Classification 4
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
categoric
al
categoric
al
continuous
class
Strong class conditional independence assumption!!!
What class conditional independence means?E.g., in the context of sentiment analysis?
C
A1 A2An…
How to estimate instance likelihood P(A1A2…An |c) ?
Assume independence among attributes Ai when class is given:
P(A1A2…An |Cj) = P(Ai|c) = P(A1|c)P(A2|c)… P(An|c)
We can estimate P(Ai|c) for all Ai and c in C based on training set
The instance is finally classified into:
Naïve Bayes classifier
13Data Mining I: Classification 4
)|()(argmax cAPcPc iCc
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
categoric
al
categoric
al
continuous
class
Strong class conditional independence assumption!!!
How to estimate probabilities from data?
14Data Mining I: Classification 4
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
categoric
al
categoric
al
continuous
class
How to estimate class prior P(c)?
P(c) = Nc/N
e.g., P(No) = 7/10, P(Yes) = 3/10
How to estimate P(Ai|c)?
– For discrete attributes:
P(Ai | c) = |Aic|/ Nc
|Aic|: # instances having attribute Ai and belonging to class c
e.g.:
P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0
How to estimate probabilities from data?
How to estimate P(Ai|c) for continuous attributes?
Discretize the range into bins
Two-way split: (A < v) or (A > v)
choose only one of the two splits as new attribute
Probability density estimation:
Assume the attribute follows a known distribution
Use data to estimate parameters of distribution (e.g., mean and standard deviation)
Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)
e.g. assume Gaussian (normal) distribution
15Data Mining I: Classification 4
How to estimate probabilities from data
16Data Mining I: Classification 4
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
categoric
al
categoric
al
continuous
class
0072.029752
1)|120( )2975(2
)110120( 2
eNoIncomeP
Normal distribution:
e.g., for attribute income and class no:
– Sample mean = 110
– Sample variance σ2=2975
2
2
2
)(
22
1)|( ij
ijiA
ij
ji ecAP
Sample variance
Naive Bayes classifier: Example I
17Data Mining I: Classification 4
Training set
Sunny Cool StrongHigh ?
Outlook Temperature WindHumidity Play
Test instance X
)(
yes)P(yes)|strong""yes)P(W|high""yes)P(H|"c"yes)P(T|sunny""P(O
)(
yes)P(yes)|P(XX)|P(yes
XP
ool
XP
9
2yes)|sunny""P(O
9
3yes)|""P(T cool
9
3yes)|""P(H high
9
3yes)|""P(W strong
14
9es)P( y
)(
no)P(no)|strong""no)P(W|high""no)P(H|"c"no)P(T|sunny""P(O
)(
no)P(no)|P(XX)|P(no
XP
ool
XP
Naive Bayes classifier: Example II
18Data Mining I: Classification 4
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
python no no no no non-mammals
salmon no no yes no non-mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
bat yes yes no yes mammals
pigeon no yes no yes non-mammals
cat yes no no yes mammals
leopard shark yes no yes no non-mammals
turtle no no sometimes yes non-mammals
penguin no no sometimes yes non-mammals
porcupine yes no no yes mammals
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals
platypus no no no yes mammals
owl no yes no yes non-mammals
dolphin yes no yes no mammals
eagle no yes no yes non-mammals
Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ?
0027.020
13004.0)()|(
021.020
706.0)()|(
0042.013
4
13
3
13
10
13
1)|(
06.07
2
7
2
7
6
7
6)|(
NPNXP
MPMXP
NXP
MXP
P(A|M)P(M) > P(A|N)P(N) Mammals
Training set
Test instance X
The problem of 0-probabilities
Naïve Bayesian prediction requires each conditional probability P(Ai|c) be non-zero. Otherwise, the predicted probability will be zero
Probability estimation via correction:
19Data Mining I: Classification 4
)|()(argmax cAPcPc iCc
mN
mpNcAP
kN
NcAP
N
NcAP
c
ici
c
ici
c
ici
)|(:estimate-m
1)|(:Laplace
)|( :Original
k: number of classes
p: prior probability
m: parameter
The problem of 0-probabilities: example
Suppose a dataset with 1000 tuples:
income=low (0)
income= medium (990)
income = high (10)
Use Laplacian correction (or Laplacian estimator): add 1 to each class value
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
Result
The probabilities are never 0
The “corrected” prob. estimates are close to their “uncorrected” counterparts
20Data Mining I: Classification 4
Naïve Bayes classifiers: overview
(+) Easy to implement
(+) It works surprisingly well in practice, although the independence assumption is too strong .
– It does not require precise estimations of the probabilities
– It is enough if the max probability belongs to the correct class
(+) Robust to irrelevant attributes
(+) Handle missing values by ignoring the instance during probability estimate calculations
(+) Robust to noise
(+) Incremental
(-) Strong independence assumption
(-) Practically, dependencies exist among variables
– Dependencies among these cannot be modeled by Naïve Bayesian Classifiers
– Use other techniques such as Bayesian Belief Networks (BBN)
21Data Mining I: Classification 4
Bayesian Belief Networks (BBN)
Bayesian belief networks allow class conditional independence to be defined between subsets of
variables instead of all variables (as in NB).
A graphical model of causal relationships
A belief network is defined by two components:
A directed acyclic graph of nodes encoding the dependence relationships among a set of variables.
A set of conditional probability tables (CPT) that associates each node to its immediate parent nodes.
22Data Mining I: Classification 4
X Y
ZP
• Nodes: random variables
• Links: dependency between variables
• X, Y are the parents of Z; Y is the parent of P.
Conditional independence in BBN
23Data Mining I: Classification 4
E.g., having lung cancer is influenced by a person’s family history and on whether or not the person is a smoker
PositiveXRay is independent of “family history” and “smoker” attributes once we know that the person has LungCancer
FamilyHistory
LungCancer
PositiveXRay
Smoker
Emphysema
DyspneaConditional independence of BBN:
A node in a Bayesian network is conditional independent of its non-descendants, if its
parents are known.
Bayesian Belief Networks
Each variable Y is associated with a conditional probability table (CPT)
CPT of Y specifies the conditional distribution P(Y|Parents(Y))
24Data Mining I: Classification 4
FamilyHistory
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
The conditional probability table (CPT) for variable LungCancer (LC):
Let X = (x1, x2,…, xn).
The probability of X is given by:
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
n
i
YParents ixiPxxP n
1
))(|(),...,( 1
Outline
Bayesian classifiers
Reading material
Things you should know from this lecture
25Data Mining I: Classification 4
Reading material
Reading material for this lecture:
Bayesian classifiers – Section 5.3, Tan et al book
Reading material for next week:
Support Vector Machine, Section 5.5, Tan et al book
26Data Mining I: Classification 4
Outline
Bayesian classifiers
Reading material
Things you should know from this lecture
27Data Mining I: Classification 4
Things you should know from this lecture
Bayesian classifiers
MNB
Independence assumption
Dealing with 0-probabilities
28Data Mining I: Classification 4