ETHEM ALPAYDIN © The MIT Press, 2010 [email protected] ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010

ETHEM ALPAYDIN© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

1

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

2


Probability and InferenceResult of tossing a coin is {Heads,Tails}

Random var X {1,0}

Bernoulli distribution

P (X = 1) = po

P (X = 0) = (1 ‒ po)

Sample: X = {xt }Nt =1

Estimation: po = # {Heads}/#{Tosses} = ∑t xt / N

Prediction of next toss:

Heads if po > ½, Tails otherwise

3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

4

ClassificationCredit scoring: Inputs are income(x1) and savings (x2).

Output is low-risk vs high-risk Input: x = [x1, x2]T , Output: C {0, 1}

Prediction:

otherwise 0

)|0()|1( if 1 choose

lyequivalentor

otherwise 0

50)|1( if 1 choose

2121

21

C

C

C

C

,xxCP ,xxCP

. ,xxCP


Bayes’ Rule

xx

xppP

PCC

C|

|

posterior

Likelihood : conditional probability

prior

Evidence : marginal probability

1|1|0

00|11|

110

xx

xxx

CC

CCCC

CC

Pp

PpPpp

PP


Bayes’ Rule: K>2 Classes

K

kkk

ii

iii

CPCp

CPCp

pCPCp

CP

1

|

|

||

x

x

xx

x

xx | max |if choose

1 and 01

kkii

K

iii

CPCPC

CPCP


Losses and RisksAction αi : the decision to assign the input to class

Ci

Loss of αi when the input actually belongs to Ck : λik

Expected risk (Duda and Hart, 1973) for taking action αi

xx

xx

|min|if choose

||1

kkii

k

K

kiki

RR

CPR

7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT

Press (V1.0)

Losses and Risks: 0/1 LossIn the special case of the zero-one loss case:

then

ki

kiik if 1

if 0

For minimum risk, choose the most probable class.

1

1

| |

|

|

K

i ik kk

kk i

i

R P C

P C

P C

x x

x

x


9

Losses and Risks: RejectDefine an additional action of reject, :

10

otherwise 1

1if

if 0

,Ki

ki

ik

xxx

xx

|1||

||1

1

iik

ki

K

kkK

CPCPR

CPR

otherwise reject

1| and ||if choose xxx ikii CPikCPCPC

1K


Discriminant FunctionsSetBayes’ classifier –

Use 0-1 loss function:

Ignoring the term, p(x):

xx kkii ggC maxif choose

xxx kkii gg max| RK decision regions R1,...,RK

|i i ig p C P Cx x

|i ig Rx x

|i ig P Cx x


K=2 ClassesExamples of K=2 :

g(x) = g1(x) – g2(x)

Log odds:

otherwise

0if choose

2

1

C

gC x

x

x||

log2

1

CPCP

1

1 22

|If | | then log 0

|

P CP C P C

P C

xx x

x


Press (V1.0)

Utility TheoryCertain features may be costly to observe.To assess the value of information that additional features

may provide.

Prob of state Sk given evidence x : P (Sk|x)

Utility function of action αi when state is Sk : Uik

To measure how good it is to take action αi when the state is Sk

Expected utility

For example, in classification, decisions correspond to choosing one of the classes, and maximizing the expected utility is equivalent to minimizing expected risk.

| |

Choose if | max |

i ik kk

i i jj

EU U P S

α EU EU

x x

x x


Press (V1.0)

Association RulesAssociation rule: X Y

People who buy/click/visit/enjoy X are also likely to buy/click/visit/enjoy Y.

A rule implies association, not necessarily causation (因果關係 ).


Association measuresAssociation rule: X Y

Support (X Y): the statistical significance of the rule

Confidence (X Y): the conditional probability

Lift (X Y) (the interest of the association rule)

If X and Y are independent, then we expect lift to be close to 1. If the lift is more than 1, that X makes Y more likely, and if the lift is

less than 1, having X make Y less likely.

customers #

and bought whocustomers #,

YXYXP

, # customers who bought and

|( ) # customers who bought

P X Y X YP Y X

P X X


Press (V1.0)

, ( | )( ) ( ) ( )

P X Y P Y XP X P Y P Y

Apriori algorithm (Agrawal et al., 1996)For example,

{X, Y, Z} is a 3-item set, and we may look for a rule, such as X, Y Z.All such rules have high enough support and confidence.Since a sales database is generally very large, it needs an efficient

algorithm (Apriori algorithm) to find these rules by doing a small number of passes over the database.

Apriori algorithmFinding frequent (時常發生的 ) itemsets which have enough support.

If {X,Y,Z} is frequent, then {X,Y}, {X,Z}, and {Y,Z} should be frequent. If {X,Y} is not frequent, none of its supersets can be frequent.

Converting them to rules with enough confidence Once we find the frequent k-item sets, we convert them to rules: X, Y

Z, ... and X Y, Z, ... For all possible single consequents (後項 ), we check if the rule has enough

confidence and remove it if it does not.Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15

ExerciseIn a two-class, two-action problem, if the loss function is

, , and , write the optimal decision rule.

Show that as we move an item from the antecedent to the consequent, confidence can never increase:

confidence(ABCD) confidence(ABCD)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16

02211 1012 121

17

Bayesian NetworksBayesian Networks: also known as graphical models, probabilistic networks

Nodes are hypotheses (random variables) and the probability corresponds to our belief in the truth of the hypothesis.

Arcs are direct influences between hypotheses.The structure is represented as a directed acyclic graph (DAG).The parameters are the conditional probabilities in the arcs.

(Pearl, 1988, 2000; Jensen, 1996; Lauritzen, 1996)


Causes and Bayes’ Rule

6.0~ RP

Diagnostic inference:Knowing that the grass is wet, what is the probability that rain is the cause?

causaldiagnostic

75.06.02.04.09.0

4.09.0

~|~|

|

||

RPRWPRPRWP

RPRWP

WP

RPRWPWRP


Conditional IndependenceX and Y are independent if

P(X,Y)=P(X)P(Y)X and Y are conditionally independent given Z if

P(X,Y|Z)=P(X|Z)P(Y|Z)

or P(X|Y,Z)=P(X|Z)

Three canonical (標準的 ) cases for conditional independence: Head-to-tail connectionTail-to-tail connectionHead-to-head connection


Case 1: Head-to-Tail ConnectionP(X,Y,Z)=P(X)P(Y|X)P(Z|Y)

X and Z are independent given Y


P(W|C)= P(W|R)P(R|C)+P (W|~R)P(~R|C)= 0.9X0.8+0.2X0.2= 0.76

P(C|W)= P(W|C)P(C)/P(W)= 0.76X0.4/0.47= 0.65

P(R) = P(R|C)P(C)+P(R|~C)P(~C)

= 0.8X0.4+0.1X0.6 = 0.38

P(W) = P(W|R)P(R)+P(W|~R)P(~R)

= 0.9X0.38+0.2X0.62 = 0.47

Case 2: Tail-to-Tail ConnectionP(X,Y,Z)=P(X)P(Y|X)P(Z|X)

Y and Z are independent given X


P(C|R) = P(R|C)P(C)/P(R)= P(R|C)P(C)/(P(R|C)P(C)+ P(R|~C)P(~C)) = 0.8X0.5/(0.8X0.5+0.1X0.5) = 0.89

P(R|S) = P(R|C)P(C|S)+P(R|~C)P(~C|S) = 0.22 (Pages 391-392) 0.22 =P(R|S) < P(R) = 0.45

Case 3: Head-to-Head ConnectionP(X,Y,Z)=P(X)P(Y)P(Z|X,Y)

X and Y are independent


( | , ) ( , ) ( |~ , ) (~ , )

( | , ~ ) ( ,~ ) ( |~ ,~ ) (~ ,~ )

( | , ) ( ) ( ) ( |~ , ) (~ ) ( )

( | , ~ ) ( ) (~ ) ( |~ ,~ ) (~ ) (~ )

0.52

P W P W R S P R S P W R S P R S

P W R S P R S P W R S P R S

P W R S P R P S P W R S P R P S

P W R S P R P S P W R S P R P S

Causal InferenceCausal inference: If the sprinkler is on, what is the probability that the grass is wet?

P (W |S) = P (W |R, S) P (R |S) +

P (W |~R,S) P (~R |S) = P (W |R,S) P (R) +

P (W |~R, S) P (~R) = 0.95x0.4 + 0.9x0.6

= 0.92P(R) and P(S) are independent.


Diagnostic InferenceDiagnostic inference: If the grass is wet, what is the probability that the sprinkler is on? P (S |W) = 0.35 > 0.2 ( =P (S)) where P (W) = 0.52 P (S |R,W) = 0.21 < 0.35

Explaining away: Knowing that it has raineddecreases the probability that the sprinkler is on. 25Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Diagnostic Inference

0.52

)~,(~)~,|~(

)~,()~,|(

),(~),|~(

),(),|(

SRPSRWP

SRPSRWP

SRPSRWP

SRPSRWPWP

( | , ) ( , )| ,

( , )

( | , ) ( | ) ( )

( | ) ( )

( | , ) ( | )

( | )

0.95 0.2 0.21

0.9

P W R S P R SP S R W

P R W

P W R S P S R P R

P W R P R

P W R S P S R

P W R

( | ) ( ) 0.92 0.2

| 0.35( ) 0.52

P W S P SP S W

P W


Press (V1.0)

ExerciseP417 (3)

Calculate P(R|W), P(R|W,S), and P(R|W,~S).

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

Bayesian Networks: CausesCausal inference:P (W |C) = P (W |R,S,C) P (R,S |C) +

P (W |~R,S,C) P (~R,S |C) +

P(W |R,~S,C) P (R,~S |C) +

P (W |~R,~S,C) P (~R,~S|C) = 0.76

use the fact that P (R, S |C) = P (R |C) P (S |C)

Diagnostic: P (C |W ) = ? (Exercise) 28Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Causal Inference

,

( | ) ( , , | )

= ( , , | ) ( ,~ , | ) ( , ,~ | ) ( ,~ ,~ | )

= ( | , , ) ( , | ) ( |~ , , ) (~ , | )

( | ,~ , ) ( ,~ | ) ( |~ ,~ , ) (~ ,~ | )

R S

P W C P W R S C

P W R S C P W R S C P W R S C P W R S C

P W R S C P R S C P W R S C P R S C

P W R S C P R S C P W R S C P R S C

( | , ) ( | ) ( | ) ( |~ , ) (~ | ) ( | )

( | ,~ ) ( | ) (~ | ) ( |~ ,~ ) (~ | ) (~ | )

0.95 0.8 0.1 0.90 0.2 0.1 0.90 0.8 0.9 0.1 0.2 0.9

P W R S P R C P S C P W R S P R C P S C

P W R S P R C P S C P W R S P R C P S C

0.076 0.018 0.648 0.018 0.76


Press (V1.0)

( | ) ( )

( | ) ?( )

P W C P CP C W

P W

Bayesian Networks

Belief propagation (Pearl, 1988)use for inference when the network is a tree

Junction trees (Lauritzen and Spiegelhalter, 1988) convert a given directed acyclic graph to a tree


, , , | | | ,P C S R W P C P S C P R C P W S R

d

iiid XXPXXP

11 parents|,...,

Bayesian Networks: Classification

diagnosticP (C | x )

Bayes’ rule inverts the arc:

x

xx

pCPCp

CP|

|


Documents

ETHEM ALPAYDIN © The MIT Press, 2010 [email protected] ethem/i2ml2e Lecture Slides for 1 Lecture Notes for E Alpaydın 2010