28
Fakultät für Elektrotechnik und Informatik Institut für Verteilte Systeme AG Intelligente Systeme - Data Mining group Data Mining I Summer semester 2019 Lecture 7: Classification part 4 Lectures: Prof. Dr. Eirini Ntoutsi TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali

Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme

AG Intelligente Systeme - Data Mining group

Data Mining I

Summer semester 2019

Lecture 7: Classification part 4

Lectures: Prof. Dr. Eirini Ntoutsi

TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali

Page 2: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Cont’ from previous lecture

How do we evaluate a classifier?

Performance measures like accuracy etc

Evaluation setup

Holdout

Cross-validation

Stratification

Bootstrap

….

How do simultaneously do model selection and error estimation?

Train/validation/test split

2Data Mining I: Classification 4

Training set Test set

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

Page 3: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Overview of classification techniques covered in DM1

3

Decision trees k nearest neighbours

Support vector machines

Bayesian classifiers

Evaluation of classifiers

Data Mining I: Classification 4

Page 4: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Outline

Bayesian classifiers

Reading material

Things you should know from this lecture

4Data Mining I: Classification 4

Page 5: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Bayesian classifiers

A probabilistic framework for solving classification problems

Predict class membership probabilities for an instance

The class of an instance is the most likely class for the instance (Maximum Likelihood classification)

Based on Bayes’ rule

Bayesian classifiers

Naïve Bayes classifiers

Assume class-conditional independence among attributes

Bayesian Belief networks

Graphical models

Model dependencies among attributes

A popular method for e.g.,: text classification, sentiment analysis

5Data Mining I: Classification 4

Page 6: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Bayes’ theorem

The probability of an event C given an observation A:

P(C): prior

P(A|C): likelihood

P(C|A): posterior

P(A): evidence

For classification

C: is the class

A: is an instance

6Data Mining I: Classification 4

)(

)()|()|(

AP

CPCAPACP

Page 7: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Bayes’ theorem

The probability of an event C given an observation A:

Example: Given that

A doctor knows that meningitis causes stiff neck 50% of the time

Prior probability of any patient having meningitis is P(M)=1/50,000

Prior probability of any patient having stiff neck is P(S)=1/20

If a patient has stiff neck, what’s the probability he/she has meningitis?

7Data Mining I: Classification 4

prior)(

)()|()|(

AP

CPCAPACP

posteriorlikelihood

0002.020/1

50000/15.0

)(

)()|()|(

SP

MPMSPSMP

Compute P(M|S)=?

Page 8: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Bayesian classifiers

Let C={c1, c2, …, ck} be the class attribute.

Let X=(A1, A2, A3,….An) be a n-dimensional instance.

Classification problem: What is the probability of a class value c ϵ C given an instance observation X?

The event C to be predicted is the class value of the instance

The observation is the instance values X

The class of the instance is the class value with the higher probability: argmaxc(P(c|X)

P(c1|X): posterior for c1

P(c2|X) : posterior for c2

P(ck|X) : posterior for ck

8Data Mining I: Classification 4

A1 A2 …. AnA3 ?

Page 9: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Bayesian classifiers

Consider each attribute and class label as random variables

Given an instance X=(A1, A2, A3,….An)

Goal is to predict class label c in C

Specifically, we want to find the value c of C that maximizes P(c|X), i.e., argmaxc(P(c|X)

9Data Mining I: Classification 4

)()|(argmax

)(

)()|(argmax

)|(argmax

cPcXPc

XP

cPcXPc

XcPc

Cc

Cc

Cc

Bayes’ rule

Page 10: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Bayesian classifiers

Consider each attribute and class label as random variables

Given an instance X=(A1, A2, A3,….An)

Goal is to predict class label c in C

Specifically, we want to find the value c of C that maximizes P(c|X), i.e., argmaxc(P(c|X)

10Data Mining I: Classification 4

)()|(argmax

)(

)()|(argmax

)|(argmax

cPcXPc

XP

cPcXPc

XcPc

Cc

Cc

Cc

Bayes’ rule

prior

likelihood

max a posteriori = the most likely class

Page 11: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Bayesian classifiers

How can we estimate: ?

Short answer: from the data

Class prior P(c):

How often c occurs?

Just count the relative frequencies in the training set

Instance likelihood P(X|c):

What is the probability of an instance X given the class c?

X=(A1A2…An), so, P(X|c)=P(A1A2…An |c)

i.e., the probability of an instance given the class is equal to the probability of a set of features given the class

So:

11Data Mining I: Classification 4

)()|(argmax cPcXPc Cc

)()|...(argmax 21 cPcAAAPc nCc

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Page 12: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

How to estimate instance likelihood P(A1A2…An |c) ?

Assume independence among attributes Ai when class is given:

P(A1A2…An |Cj) = P(Ai|c) = P(A1|c)P(A2|c)… P(An|c)

Naïve Bayes classifier

12Data Mining I: Classification 4

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Strong class conditional independence assumption!!!

What class conditional independence means?E.g., in the context of sentiment analysis?

C

A1 A2An…

Page 13: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

How to estimate instance likelihood P(A1A2…An |c) ?

Assume independence among attributes Ai when class is given:

P(A1A2…An |Cj) = P(Ai|c) = P(A1|c)P(A2|c)… P(An|c)

We can estimate P(Ai|c) for all Ai and c in C based on training set

The instance is finally classified into:

Naïve Bayes classifier

13Data Mining I: Classification 4

)|()(argmax cAPcPc iCc

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Strong class conditional independence assumption!!!

Page 14: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

How to estimate probabilities from data?

14Data Mining I: Classification 4

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

How to estimate class prior P(c)?

P(c) = Nc/N

e.g., P(No) = 7/10, P(Yes) = 3/10

How to estimate P(Ai|c)?

– For discrete attributes:

P(Ai | c) = |Aic|/ Nc

|Aic|: # instances having attribute Ai and belonging to class c

e.g.:

P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0

Page 15: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

How to estimate probabilities from data?

How to estimate P(Ai|c) for continuous attributes?

Discretize the range into bins

Two-way split: (A < v) or (A > v)

choose only one of the two splits as new attribute

Probability density estimation:

Assume the attribute follows a known distribution

Use data to estimate parameters of distribution (e.g., mean and standard deviation)

Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)

e.g. assume Gaussian (normal) distribution

15Data Mining I: Classification 4

Page 16: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

How to estimate probabilities from data

16Data Mining I: Classification 4

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

0072.029752

1)|120( )2975(2

)110120( 2

eNoIncomeP

Normal distribution:

e.g., for attribute income and class no:

– Sample mean = 110

– Sample variance σ2=2975

2

2

2

)(

22

1)|( ij

ijiA

ij

ji ecAP

Sample variance

Page 17: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Naive Bayes classifier: Example I

17Data Mining I: Classification 4

Training set

Sunny Cool StrongHigh ?

Outlook Temperature WindHumidity Play

Test instance X

)(

yes)P(yes)|strong""yes)P(W|high""yes)P(H|"c"yes)P(T|sunny""P(O

)(

yes)P(yes)|P(XX)|P(yes

XP

ool

XP

9

2yes)|sunny""P(O

9

3yes)|""P(T cool

9

3yes)|""P(H high

9

3yes)|""P(W strong

14

9es)P( y

)(

no)P(no)|strong""no)P(W|high""no)P(H|"c"no)P(T|sunny""P(O

)(

no)P(no)|P(XX)|P(no

XP

ool

XP

Page 18: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Naive Bayes classifier: Example II

18Data Mining I: Classification 4

Name Give Birth Can Fly Live in Water Have Legs Class

human yes no no yes mammals

python no no no no non-mammals

salmon no no yes no non-mammals

whale yes no yes no mammals

frog no no sometimes yes non-mammals

komodo no no no yes non-mammals

bat yes yes no yes mammals

pigeon no yes no yes non-mammals

cat yes no no yes mammals

leopard shark yes no yes no non-mammals

turtle no no sometimes yes non-mammals

penguin no no sometimes yes non-mammals

porcupine yes no no yes mammals

eel no no yes no non-mammals

salamander no no sometimes yes non-mammals

gila monster no no no yes non-mammals

platypus no no no yes mammals

owl no yes no yes non-mammals

dolphin yes no yes no mammals

eagle no yes no yes non-mammals

Give Birth Can Fly Live in Water Have Legs Class

yes no yes no ?

0027.020

13004.0)()|(

021.020

706.0)()|(

0042.013

4

13

3

13

10

13

1)|(

06.07

2

7

2

7

6

7

6)|(

NPNXP

MPMXP

NXP

MXP

P(A|M)P(M) > P(A|N)P(N) Mammals

Training set

Test instance X

Page 19: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

The problem of 0-probabilities

Naïve Bayesian prediction requires each conditional probability P(Ai|c) be non-zero. Otherwise, the predicted probability will be zero

Probability estimation via correction:

19Data Mining I: Classification 4

)|()(argmax cAPcPc iCc

mN

mpNcAP

kN

NcAP

N

NcAP

c

ici

c

ici

c

ici

)|(:estimate-m

1)|(:Laplace

)|( :Original

k: number of classes

p: prior probability

m: parameter

Page 20: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

The problem of 0-probabilities: example

Suppose a dataset with 1000 tuples:

income=low (0)

income= medium (990)

income = high (10)

Use Laplacian correction (or Laplacian estimator): add 1 to each class value

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003

Result

The probabilities are never 0

The “corrected” prob. estimates are close to their “uncorrected” counterparts

20Data Mining I: Classification 4

Page 21: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Naïve Bayes classifiers: overview

(+) Easy to implement

(+) It works surprisingly well in practice, although the independence assumption is too strong .

– It does not require precise estimations of the probabilities

– It is enough if the max probability belongs to the correct class

(+) Robust to irrelevant attributes

(+) Handle missing values by ignoring the instance during probability estimate calculations

(+) Robust to noise

(+) Incremental

(-) Strong independence assumption

(-) Practically, dependencies exist among variables

– Dependencies among these cannot be modeled by Naïve Bayesian Classifiers

– Use other techniques such as Bayesian Belief Networks (BBN)

21Data Mining I: Classification 4

Page 22: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Bayesian Belief Networks (BBN)

Bayesian belief networks allow class conditional independence to be defined between subsets of

variables instead of all variables (as in NB).

A graphical model of causal relationships

A belief network is defined by two components:

A directed acyclic graph of nodes encoding the dependence relationships among a set of variables.

A set of conditional probability tables (CPT) that associates each node to its immediate parent nodes.

22Data Mining I: Classification 4

X Y

ZP

• Nodes: random variables

• Links: dependency between variables

• X, Y are the parents of Z; Y is the parent of P.

Page 23: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Conditional independence in BBN

23Data Mining I: Classification 4

E.g., having lung cancer is influenced by a person’s family history and on whether or not the person is a smoker

PositiveXRay is independent of “family history” and “smoker” attributes once we know that the person has LungCancer

FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

DyspneaConditional independence of BBN:

A node in a Bayesian network is conditional independent of its non-descendants, if its

parents are known.

Page 24: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Bayesian Belief Networks

Each variable Y is associated with a conditional probability table (CPT)

CPT of Y specifies the conditional distribution P(Y|Parents(Y))

24Data Mining I: Classification 4

FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

The conditional probability table (CPT) for variable LungCancer (LC):

Let X = (x1, x2,…, xn).

The probability of X is given by:

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

n

i

YParents ixiPxxP n

1

))(|(),...,( 1

Page 25: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Outline

Bayesian classifiers

Reading material

Things you should know from this lecture

25Data Mining I: Classification 4

Page 26: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Reading material

Reading material for this lecture:

Bayesian classifiers – Section 5.3, Tan et al book

Reading material for next week:

Support Vector Machine, Section 5.5, Tan et al book

26Data Mining I: Classification 4

Page 27: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Outline

Bayesian classifiers

Reading material

Things you should know from this lecture

27Data Mining I: Classification 4

Page 28: Data Mining Intoutsi/DM1.SoSe19/lectures/7.Classifica… · Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known,

Things you should know from this lecture

Bayesian classifiers

MNB

Independence assumption

Dealing with 0-probabilities

28Data Mining I: Classification 4