Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme

AG Intelligente Systeme - Data Mining group

Data Mining I

Summer semester 2019

Lecture 7: Classification part 4

Lectures: Prof. Dr. Eirini Ntoutsi

TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali

Cont’ from previous lecture

How do we evaluate a classifier?

Performance measures like accuracy etc

Evaluation setup

Holdout

Cross-validation

Stratification

Bootstrap

….

How do simultaneously do model selection and error estimation?

Train/validation/test split

2Data Mining I: Classification 4

Training set Test set

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

…

Overview of classification techniques covered in DM1

3

Decision trees k nearest neighbours

Support vector machines

Bayesian classifiers

Evaluation of classifiers

Data Mining I: Classification 4

Outline


Reading material

Things you should know from this lecture



A probabilistic framework for solving classification problems

Predict class membership probabilities for an instance

The class of an instance is the most likely class for the instance (Maximum Likelihood classification)

Based on Bayes’ rule


Naïve Bayes classifiers

Assume class-conditional independence among attributes

Bayesian Belief networks

Graphical models

Model dependencies among attributes

A popular method for e.g.,: text classification, sentiment analysis


Bayes’ theorem

The probability of an event C given an observation A:

P(C): prior

P(A|C): likelihood

P(C|A): posterior

P(A): evidence

For classification

C: is the class

A: is an instance


)(

)()|()|(

AP

CPCAPACP

Bayes’ theorem

The probability of an event C given an observation A:

Example: Given that

A doctor knows that meningitis causes stiff neck 50% of the time

Prior probability of any patient having meningitis is P(M)=1/50,000

Prior probability of any patient having stiff neck is P(S)=1/20

If a patient has stiff neck, what’s the probability he/she has meningitis?


prior)(

)()|()|(

AP

CPCAPACP

posteriorlikelihood

0002.020/1

50000/15.0

)(

)()|()|(

SP

MPMSPSMP

Compute P(M|S)=?


Let C={c1, c2, …, ck} be the class attribute.

Let X=(A1, A2, A3,….An) be a n-dimensional instance.

Classification problem: What is the probability of a class value c ϵ C given an instance observation X?

The event C to be predicted is the class value of the instance

The observation is the instance values X

The class of the instance is the class value with the higher probability: argmaxc(P(c|X)

P(c1|X): posterior for c1

P(c2|X) : posterior for c2

…

P(ck|X) : posterior for ck


A1 A2 …. AnA3 ?


Consider each attribute and class label as random variables

Given an instance X=(A1, A2, A3,….An)

Goal is to predict class label c in C

Specifically, we want to find the value c of C that maximizes P(c|X), i.e., argmaxc(P(c|X)


)()|(argmax

)(

)()|(argmax

)|(argmax

cPcXPc

XP

cPcXPc

XcPc

Cc

Cc

Cc

Bayes’ rule


Consider each attribute and class label as random variables

Given an instance X=(A1, A2, A3,….An)

Goal is to predict class label c in C

Specifically, we want to find the value c of C that maximizes P(c|X), i.e., argmaxc(P(c|X)


)()|(argmax

)(

)()|(argmax

)|(argmax

cPcXPc

XP

cPcXPc

XcPc

Cc

Cc

Cc

Bayes’ rule

prior

likelihood

max a posteriori = the most likely class


How can we estimate: ?

Short answer: from the data

Class prior P(c):

How often c occurs?

Just count the relative frequencies in the training set

Instance likelihood P(X|c):

What is the probability of an instance X given the class c?

X=(A1A2…An), so, P(X|c)=P(A1A2…An |c)

i.e., the probability of an instance given the class is equal to the probability of a set of features given the class

So:


)()|(argmax cPcXPc Cc

)()|...(argmax 21 cPcAAAPc nCc

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

How to estimate instance likelihood P(A1A2…An |c) ?

Assume independence among attributes Ai when class is given:

P(A1A2…An |Cj) = P(Ai|c) = P(A1|c)P(A2|c)… P(An|c)

Naïve Bayes classifier






3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


categoric

al

categoric

al

continuous

class

Strong class conditional independence assumption!!!

What class conditional independence means?E.g., in the context of sentiment analysis?

C

A1 A2An…

How to estimate instance likelihood P(A1A2…An |c) ?

Assume independence among attributes Ai when class is given:

P(A1A2…An |Cj) = P(Ai|c) = P(A1|c)P(A2|c)… P(An|c)

We can estimate P(Ai|c) for all Ai and c in C based on training set

The instance is finally classified into:

Naïve Bayes classifier


)|()(argmax cAPcPc iCc





3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


categoric

al

categoric

al

continuous

class

Strong class conditional independence assumption!!!

How to estimate probabilities from data?






3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


categoric

al

categoric

al

continuous

class

How to estimate class prior P(c)?

P(c) = Nc/N

e.g., P(No) = 7/10, P(Yes) = 3/10

How to estimate P(Ai|c)?

– For discrete attributes:

P(Ai | c) = |Aic|/ Nc

|Aic|: # instances having attribute Ai and belonging to class c

e.g.:

P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0

How to estimate probabilities from data?

How to estimate P(Ai|c) for continuous attributes?

Discretize the range into bins

Two-way split: (A < v) or (A > v)

choose only one of the two splits as new attribute

Probability density estimation:

Assume the attribute follows a known distribution

Use data to estimate parameters of distribution (e.g., mean and standard deviation)

Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)

e.g. assume Gaussian (normal) distribution


How to estimate probabilities from data






3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


categoric

al

categoric

al

continuous

class

0072.029752

1)|120( )2975(2

)110120( 2

eNoIncomeP

Normal distribution:

e.g., for attribute income and class no:

– Sample mean = 110

– Sample variance σ2=2975

2

2

2

)(

22

1)|( ij

ijiA

ij

ji ecAP

Sample variance

Naive Bayes classifier: Example I


Training set

Sunny Cool StrongHigh ?

Outlook Temperature WindHumidity Play

Test instance X

)(

yes)P(yes)|strong""yes)P(W|high""yes)P(H|"c"yes)P(T|sunny""P(O

)(

yes)P(yes)|P(XX)|P(yes

XP

ool

XP

9

2yes)|sunny""P(O

9

3yes)|""P(T cool

9

3yes)|""P(H high

9

3yes)|""P(W strong

14

9es)P( y

)(

no)P(no)|strong""no)P(W|high""no)P(H|"c"no)P(T|sunny""P(O

)(

no)P(no)|P(XX)|P(no

XP

ool

XP

Naive Bayes classifier: Example II


Name Give Birth Can Fly Live in Water Have Legs Class

human yes no no yes mammals

python no no no no non-mammals

salmon no no yes no non-mammals

whale yes no yes no mammals

frog no no sometimes yes non-mammals

komodo no no no yes non-mammals

bat yes yes no yes mammals

pigeon no yes no yes non-mammals

cat yes no no yes mammals

leopard shark yes no yes no non-mammals

turtle no no sometimes yes non-mammals

penguin no no sometimes yes non-mammals

porcupine yes no no yes mammals

eel no no yes no non-mammals

salamander no no sometimes yes non-mammals

gila monster no no no yes non-mammals

platypus no no no yes mammals

owl no yes no yes non-mammals

dolphin yes no yes no mammals

eagle no yes no yes non-mammals

Give Birth Can Fly Live in Water Have Legs Class

yes no yes no ?

0027.020

13004.0)()|(

021.020

706.0)()|(

0042.013

4

13

3

13

10

13

1)|(

06.07

2

7

2

7

6

7

6)|(

NPNXP

MPMXP

NXP

MXP

P(A|M)P(M) > P(A|N)P(N) Mammals

Training set

Test instance X

The problem of 0-probabilities

Naïve Bayesian prediction requires each conditional probability P(Ai|c) be non-zero. Otherwise, the predicted probability will be zero

Probability estimation via correction:


)|()(argmax cAPcPc iCc

mN

mpNcAP

kN

NcAP

N

NcAP

c

ici

c

ici

c

ici

)|(:estimate-m

1)|(:Laplace

)|( :Original

k: number of classes

p: prior probability

m: parameter

The problem of 0-probabilities: example

Suppose a dataset with 1000 tuples:

income=low (0)

income= medium (990)

income = high (10)

Use Laplacian correction (or Laplacian estimator): add 1 to each class value

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003

Result

The probabilities are never 0

The “corrected” prob. estimates are close to their “uncorrected” counterparts


Naïve Bayes classifiers: overview

(+) Easy to implement

(+) It works surprisingly well in practice, although the independence assumption is too strong .

– It does not require precise estimations of the probabilities

– It is enough if the max probability belongs to the correct class

(+) Robust to irrelevant attributes

(+) Handle missing values by ignoring the instance during probability estimate calculations

(+) Robust to noise

(+) Incremental

(-) Strong independence assumption

(-) Practically, dependencies exist among variables

– Dependencies among these cannot be modeled by Naïve Bayesian Classifiers

– Use other techniques such as Bayesian Belief Networks (BBN)


Bayesian Belief Networks (BBN)

Bayesian belief networks allow class conditional independence to be defined between subsets of

variables instead of all variables (as in NB).

A graphical model of causal relationships

A belief network is defined by two components:

A directed acyclic graph of nodes encoding the dependence relationships among a set of variables.

A set of conditional probability tables (CPT) that associates each node to its immediate parent nodes.


X Y

ZP

• Nodes: random variables

• Links: dependency between variables

• X, Y are the parents of Z; Y is the parent of P.

Conditional independence in BBN


E.g., having lung cancer is influenced by a person’s family history and on whether or not the person is a smoker

PositiveXRay is independent of “family history” and “smoker” attributes once we know that the person has LungCancer

FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

DyspneaConditional independence of BBN:

A node in a Bayesian network is conditional independent of its non-descendants, if its

parents are known.

Bayesian Belief Networks

Each variable Y is associated with a conditional probability table (CPT)

CPT of Y specifies the conditional distribution P(Y|Parents(Y))


FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

The conditional probability table (CPT) for variable LungCancer (LC):

Let X = (x1, x2,…, xn).

The probability of X is given by:

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

n

i

YParents ixiPxxP n

1

))(|(),...,( 1

Outline


Reading material



Reading material

Reading material for this lecture:

Bayesian classifiers – Section 5.3, Tan et al book

Reading material for next week:

Support Vector Machine, Section 5.5, Tan et al book


Outline


Reading material





MNB

Independence assumption

Dealing with 0-probabilities


Documents

Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,