31
ETHEM ALPAYDIN © The MIT Press, 2010 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

ETHEM ALPAYDIN © The MIT Press, 2010 [email protected] ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Embed Size (px)

Citation preview

Page 1: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

ETHEM ALPAYDIN© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

1

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 2: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

2

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 3: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Probability and InferenceResult of tossing a coin is {Heads,Tails}

Random var X {1,0}

Bernoulli distribution

P (X = 1) = po

P (X = 0) = (1 ‒ po)

Sample: X = {xt }Nt =1

Estimation: po = # {Heads}/#{Tosses} = ∑t xt / N

Prediction of next toss:

Heads if po > ½, Tails otherwise

3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 4: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

4

ClassificationCredit scoring: Inputs are income(x1) and savings (x2).

Output is low-risk vs high-risk Input: x = [x1, x2]T , Output: C {0, 1}

Prediction:

otherwise 0

)|0()|1( if 1 choose

lyequivalentor

otherwise 0

50)|1( if 1 choose

2121

21

C

C

C

C

,xxCP ,xxCP

. ,xxCP

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 5: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Bayes’ Rule

xx

xppP

PCC

C|

|

posterior

Likelihood : conditional probability

prior

Evidence : marginal probability

1|1|0

00|11|

110

xx

xxx

CC

CCCC

CC

Pp

PpPpp

PP

5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 6: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Bayes’ Rule: K>2 Classes

K

kkk

ii

iii

CPCp

CPCp

pCPCp

CP

1

|

|

||

x

x

xx

x

xx | max |if choose

1 and 01

kkii

K

iii

CPCPC

CPCP

6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 7: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Losses and RisksAction αi : the decision to assign the input to class

Ci

Loss of αi when the input actually belongs to Ck : λik

Expected risk (Duda and Hart, 1973) for taking action αi

xx

xx

|min|if choose

||1

kkii

k

K

kiki

RR

CPR

7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT

Press (V1.0)

Page 8: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Losses and Risks: 0/1 LossIn the special case of the zero-one loss case:

then

ki

kiik if 1

if 0

For minimum risk, choose the most probable class.

1

1

| |

|

|

K

i ik kk

kk i

i

R P C

P C

P C

x x

x

x

8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 9: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

9

Losses and Risks: RejectDefine an additional action of reject, :

10

otherwise 1

1if

if 0

,Ki

ki

ik

xxx

xx

|1||

||1

1

iik

ki

K

kkK

CPCPR

CPR

otherwise reject

1| and ||if choose xxx ikii CPikCPCPC

1K

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 10: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Discriminant FunctionsSetBayes’ classifier –

Use 0-1 loss function:

Ignoring the term, p(x):

xx kkii ggC maxif choose

xxx kkii gg max| RK decision regions R1,...,RK

|i i ig p C P Cx x

|i ig Rx x

|i ig P Cx x

10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 11: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

K=2 ClassesExamples of K=2 :

g(x) = g1(x) – g2(x)

Log odds:

otherwise

0if choose

2

1

C

gC x

x

x||

log2

1

CPCP

1

1 22

|If | | then log 0

|

P CP C P C

P C

xx x

x

11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT

Press (V1.0)

Page 12: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Utility TheoryCertain features may be costly to observe.To assess the value of information that additional features

may provide.

Prob of state Sk given evidence x : P (Sk|x)

Utility function of action αi when state is Sk : Uik

To measure how good it is to take action αi when the state is Sk

Expected utility

For example, in classification, decisions correspond to choosing one of the classes, and maximizing the expected utility is equivalent to minimizing expected risk.

| |

Choose if | max |

i ik kk

i i jj

EU U P S

α EU EU

x x

x x

12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT

Press (V1.0)

Page 13: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Association RulesAssociation rule: X Y

People who buy/click/visit/enjoy X are also likely to buy/click/visit/enjoy Y.

A rule implies association, not necessarily causation (因果關係 ).

13Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 14: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Association measuresAssociation rule: X Y

Support (X Y): the statistical significance of the rule

Confidence (X Y): the conditional probability

Lift (X Y) (the interest of the association rule)

If X and Y are independent, then we expect lift to be close to 1. If the lift is more than 1, that X makes Y more likely, and if the lift is

less than 1, having X make Y less likely.

customers #

and bought whocustomers #,

YXYXP

, # customers who bought and

|( ) # customers who bought

P X Y X YP Y X

P X X

14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT

Press (V1.0)

, ( | )( ) ( ) ( )

P X Y P Y XP X P Y P Y

Page 15: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Apriori algorithm (Agrawal et al., 1996)For example,

{X, Y, Z} is a 3-item set, and we may look for a rule, such as X, Y Z.All such rules have high enough support and confidence.Since a sales database is generally very large, it needs an efficient

algorithm (Apriori algorithm) to find these rules by doing a small number of passes over the database.

Apriori algorithmFinding frequent (時常發生的 ) itemsets which have enough support.

If {X,Y,Z} is frequent, then {X,Y}, {X,Z}, and {Y,Z} should be frequent. If {X,Y} is not frequent, none of its supersets can be frequent.

Converting them to rules with enough confidence Once we find the frequent k-item sets, we convert them to rules: X, Y

Z, ... and X Y, Z, ... For all possible single consequents (後項 ), we check if the rule has enough

confidence and remove it if it does not.Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15

Page 16: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

ExerciseIn a two-class, two-action problem, if the loss function is

, , and , write the optimal decision rule.

Show that as we move an item from the antecedent to the consequent, confidence can never increase:

confidence(ABCD) confidence(ABCD)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16

02211 1012 121

Page 17: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

17

Page 18: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Bayesian NetworksBayesian Networks: also known as graphical models, probabilistic networks

Nodes are hypotheses (random variables) and the probability corresponds to our belief in the truth of the hypothesis.

Arcs are direct influences between hypotheses.The structure is represented as a directed acyclic graph (DAG).The parameters are the conditional probabilities in the arcs.

(Pearl, 1988, 2000; Jensen, 1996; Lauritzen, 1996)

18Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 19: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Causes and Bayes’ Rule

6.0~ RP

Diagnostic inference:Knowing that the grass is wet, what is the probability that rain is the cause?

causaldiagnostic

75.06.02.04.09.0

4.09.0

~|~|

|

||

RPRWPRPRWP

RPRWP

WP

RPRWPWRP

19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 20: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Conditional IndependenceX and Y are independent if

P(X,Y)=P(X)P(Y)X and Y are conditionally independent given Z if

P(X,Y|Z)=P(X|Z)P(Y|Z)

or P(X|Y,Z)=P(X|Z)

Three canonical (標準的 ) cases for conditional independence: Head-to-tail connectionTail-to-tail connectionHead-to-head connection

20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 21: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Case 1: Head-to-Tail ConnectionP(X,Y,Z)=P(X)P(Y|X)P(Z|Y)

X and Z are independent given Y

21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

P(W|C)= P(W|R)P(R|C)+P (W|~R)P(~R|C)= 0.9X0.8+0.2X0.2= 0.76

P(C|W)= P(W|C)P(C)/P(W)= 0.76X0.4/0.47= 0.65

P(R) = P(R|C)P(C)+P(R|~C)P(~C)

= 0.8X0.4+0.1X0.6 = 0.38

P(W) = P(W|R)P(R)+P(W|~R)P(~R)

= 0.9X0.38+0.2X0.62 = 0.47

Page 22: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Case 2: Tail-to-Tail ConnectionP(X,Y,Z)=P(X)P(Y|X)P(Z|X)

Y and Z are independent given X

22Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

P(C|R) = P(R|C)P(C)/P(R)= P(R|C)P(C)/(P(R|C)P(C)+ P(R|~C)P(~C)) = 0.8X0.5/(0.8X0.5+0.1X0.5) = 0.89

P(R|S) = P(R|C)P(C|S)+P(R|~C)P(~C|S) = 0.22 (Pages 391-392) 0.22 =P(R|S) < P(R) = 0.45

Page 23: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Case 3: Head-to-Head ConnectionP(X,Y,Z)=P(X)P(Y)P(Z|X,Y)

X and Y are independent

23Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

( | , ) ( , ) ( |~ , ) (~ , )

( | , ~ ) ( ,~ ) ( |~ ,~ ) (~ ,~ )

( | , ) ( ) ( ) ( |~ , ) (~ ) ( )

( | , ~ ) ( ) (~ ) ( |~ ,~ ) (~ ) (~ )

0.52

P W P W R S P R S P W R S P R S

P W R S P R S P W R S P R S

P W R S P R P S P W R S P R P S

P W R S P R P S P W R S P R P S

Page 24: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Causal InferenceCausal inference: If the sprinkler is on, what is the probability that the grass is wet?

P (W |S) = P (W |R, S) P (R |S) +

P (W |~R,S) P (~R |S) = P (W |R,S) P (R) +

P (W |~R, S) P (~R) = 0.95x0.4 + 0.9x0.6

= 0.92P(R) and P(S) are independent.

24Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 25: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Diagnostic InferenceDiagnostic inference: If the grass is wet, what is the probability that the sprinkler is on? P (S |W) = 0.35 > 0.2 ( =P (S)) where P (W) = 0.52 P (S |R,W) = 0.21 < 0.35

Explaining away: Knowing that it has raineddecreases the probability that the sprinkler is on. 25Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 26: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Diagnostic Inference

0.52

)~,(~)~,|~(

)~,()~,|(

),(~),|~(

),(),|(

SRPSRWP

SRPSRWP

SRPSRWP

SRPSRWPWP

( | , ) ( , )| ,

( , )

( | , ) ( | ) ( )

( | ) ( )

( | , ) ( | )

( | )

0.95 0.2 0.21

0.9

P W R S P R SP S R W

P R W

P W R S P S R P R

P W R P R

P W R S P S R

P W R

( | ) ( ) 0.92 0.2

| 0.35( ) 0.52

P W S P SP S W

P W

26Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT

Press (V1.0)

Page 27: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

ExerciseP417 (3)

Calculate P(R|W), P(R|W,S), and P(R|W,~S).

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

Page 28: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Bayesian Networks: CausesCausal inference:P (W |C) = P (W |R,S,C) P (R,S |C) +

P (W |~R,S,C) P (~R,S |C) +

P(W |R,~S,C) P (R,~S |C) +

P (W |~R,~S,C) P (~R,~S|C) = 0.76

use the fact that P (R, S |C) = P (R |C) P (S |C)

Diagnostic: P (C |W ) = ? (Exercise) 28Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 29: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Causal Inference

,

( | ) ( , , | )

= ( , , | ) ( ,~ , | ) ( , ,~ | ) ( ,~ ,~ | )

= ( | , , ) ( , | ) ( |~ , , ) (~ , | )

( | ,~ , ) ( ,~ | ) ( |~ ,~ , ) (~ ,~ | )

R S

P W C P W R S C

P W R S C P W R S C P W R S C P W R S C

P W R S C P R S C P W R S C P R S C

P W R S C P R S C P W R S C P R S C

( | , ) ( | ) ( | ) ( |~ , ) (~ | ) ( | )

( | ,~ ) ( | ) (~ | ) ( |~ ,~ ) (~ | ) (~ | )

0.95 0.8 0.1 0.90 0.2 0.1 0.90 0.8 0.9 0.1 0.2 0.9

P W R S P R C P S C P W R S P R C P S C

P W R S P R C P S C P W R S P R C P S C

0.076 0.018 0.648 0.018 0.76

29Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT

Press (V1.0)

( | ) ( )

( | ) ?( )

P W C P CP C W

P W

Page 30: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Bayesian Networks

Belief propagation (Pearl, 1988)use for inference when the network is a tree

Junction trees (Lauritzen and Spiegelhalter, 1988) convert a given directed acyclic graph to a tree

30Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

, , , | | | ,P C S R W P C P S C P R C P W S R

d

iiid XXPXXP

11 parents|,...,

Page 31: ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

Bayesian Networks: Classification

diagnosticP (C | x )

Bayes’ rule inverts the arc:

x

xx

pCPCp

CP|

|

31Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)