CSC446: Pattern Recognition (LN4)

Bayesian Decision Theory

Prof. Dr. Mostafa Gadal-Haqq

Faculty of Computer & Information Sciences

Computer Science Department

AIN SHAMS UNIVERSITY

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

CSC446 : Pattern Recognition

(Pattern Classifications, Ch2: Sec. 2.1 to Sec. 2.3)

2.1 Bayesian Decision Theory

• Bayesian Decision Theory is based on

quantifying the trade-offs between various

classification decisions using probabilities

and the costs that accompany such decisions.

• Assumes that: The decision problem is posed

in probabilistic terms and that all of the

relevant probability values are known.



• Back to the Fish Sorting Machine:

– = a random variable (State of nature)={1 ,2}

• For example: 1 = Sea bass, and 2 = Salmon

• P(1 ) = the prior (a priori probability) that the

coming fish is sea bass.

• P(2 ) = the prior (a priori probability) that the

coming fish is salmon.

– The priors gives us the knowledge of how likely

we are to get salmon or Sea bass before the fish

actually appears.


• Decision Rule Using Priors only:

– to make a decision about the fish that will

appear using only the priors, P(1) and P(2),

We use the following decision rule:

– which minimize the error.


Decide fish 1 if P(1) > P(2)

and fish 2 if P(1) < P(2)

Probability of error = min [ P(1) , P(2)]


• That is:

– If P(1) >> P(2) we will be right most of the

time when we decide that the fish belong to 1 .

– If P(1) = P(2) we have only fifty-fifty chance

of being right.

– Under these conditions, no other decision rules

can yield a larger probability of being right.



• Improving the decision using observation:


• If we know the class –

conditional probability,

P(x | j), of an

observation x, we could

improve our decision.

• for example: x describes

the observed lightness of

the sea bass or salmon

P(x|w2)

P(x|w1)


• We can improve our decision by using this

observed feature and the Bayes rule :

– Posterior = (Likelihood x Prior) / Evidence

– Where, for C categories :

Cj

j

jjPxPxP

1

)()|()(


)(

)()|()|(

xP

PxPxP

jj

j


• Bayesian decision is based on minimizing the

probability of error , i.e. for a given feature

value x :

• The probability of error for a particular x is :


Decide x 1 if P(1 | x) > P(2 | x)

and x 2 if P(1 | x) < P(2 | x)

P(error | x) = min [ P(1 | x), P(2 | x) ]


fish (x) 2

Suppose P(1)=2/3=0.67, and P(2)=1/3= 0.33 ,

2.1 Bayesian Decision Theory: Numerical Example

P(x|w2)

P(x|w1)

0.36

0.15

If x = 11.5, then P(x|1)= 0.15 , P(x|2)= 0.36

P(x) = 0.15*0.67 + 0.36*0.33 = 0.22

P(1|x)= 0.15*0.67/0.22

= 0.46

P(2|x)= 0.36*0.33/0.22

= 0.54

fish 1


2.1 Bayesian Decision Theory Computing

for all values

of x gives

decision

regions

(Rules) : R2 R2 R1 R1

• if x R1

decide 1

• if x R2

decide 2


• Draw Probability Densities and find the

decision regions for the following Classes:

= {1, 2},

P(x | 1) ~ N(20, 4),

P(x | 2) ~ N(15, 2),

P(1) = 1/3, and P(2) = 2/3,

– Then Classify a sample with feature value x= 17.

Assignment 2.1


2.2 General Bayesian Decision Theory

• Generalization of Bayesian decision theory is

done by allowing the following:

– Having more than one feature.

– Having more than two states of nature.

– Allowing actions and not only decide on the

state of nature.

– Introduce a loss of function which is more

general than the probability of error.


• Allowing actions other than classification

primarily allows the possibility of rejection

• Rejection is refusing to make decision in close

or bad cases!

• The loss function states: how costly each

action taken is?



• Suppose we have c states of nature (categories)

= { 1, 2,…, c } ,

• a feature vector:

x = { x1, x2,…, xd } ,

• the possible actions

= { 1, 2,…, a } ,

• and the loss, (i | j ), incurred for taking

action i when the state of nature is j .



• The conditional risk, R(i | x), for select the action

i is given by:

cj

j

jjii xPxR1

)|()|()|(

• The Overall risk, R, is the Sum of all Conditional

risks R(i | x) for i = 1,…,a.

ai

i

i xRR1

)|(



Take action i (i.e. decide i)

if R(i | x) < R(j | x) ; j and j i.

The Bayesian decision rule becomes: select

the action i for which the conditional risk,

R(i | x), is minimum. That is :



• Minimizing R(i | x) for all actions, that is: for

all i ; i = 1,…, a, is minimizing R.

• The overall risk R is the “expected loss

associated with a given decision rule”.

• The overall risk R is called the Bayes risk,

which defines the best performance that can

be achieved!



• Two-category classification Example:

Suppose we have two categories {1 ,2} and two

actions {1 ,2 }, where:

1 : deciding 1 , and 2 : deciding 2 ,

and for simplicity we write ij = (i | j )

The conditional risks for taking 1 and 2 are:

R(1 | x) = 11P(1 | x) + 12P(2 | x)

R(2 | x) = 21P(1 | x) + 22P(2 | x)



decide 1 (i.e. 1) if R(1 | x) < R(2 | x)

and 2 (i.e. 2) if R(1 | x) > R(2 | x)

There are a variety of ways to express the

minimum-risk rule, each has its advantage:

1- The fundamental rule is:



2- The rule in terms of the posteriors is:

3- The rule in terms of the priors and conditional

densities is:

decide 1 if (21- 11) P(1 | x ) > (12- 22) P(2 | x )

decide 2 otherwise


decide 1 if

(21- 11) P(x | 1 ) P(1) > (12- 22) P(x | 2 ) P(2 )

decide 2 otherwise


4- The rule in terms of the likelihoods ratios:

That is, the Bayes (Optimal) decision can be

interpreted as:


decide 1 if

decide 2 otherwise

)(

)(.

)|(

)|(

1

2

1121

2212

2

1

P

P

xp

xp

“One can take an optimal decision, if the

likelihood ratio exceeds a threshold value

that is independent of the observation x”


• Decision regions depends on the values of the loss

function:

• For different loss function we have:

)(

)(2 then

0 1

2 0 if

)(

)( then

0 1

1 0 if

1

2

1

2

P

P

P

P

b

a

)|(

)|( :if decide then

)(

)(. Let

2

11

1

2

1121

2212

xp

xp

P

P





2.3 Minimum-Error Rate Classification

• Consider the zero-one (or symmetrical) loss function:

• Therefore, the conditional risk is:

• In other words, for symmetric loss function, the

conditional risk is the probability of error.

cjiji

jiji

,...,1, 1

0),(

1jij

cj

1jjjii

)x|(P1)x|(P

)x|(P)|()x|(R


The Minmax Criterion

• Sometimes we need to design our classifier to

perform well over a range of prior probabilities, or

where we do not know the prior probabilities.

• A reasonable approach is to design our classifier so

that the worst overall risk for any value of the

priors is as small as possible

• Minimax Criterion:

“minimize the maximum possible overall

risk”



• It is found that the overall risk is linear in P(ωj).

Then, when the constant of proportionality (the

slope) is zero, the risk is independent of priors. This

condition gives the minmax risk Rmm as:




The Neyman-Pearson Criterion

• The Neynam-Pearson Criterion:

“minimize the overall risk subject to a constraint”

• Generally Neyman-Pearson criterion is satisfied by adjusting decision boundaries numerically. However, for Gaussian and some other distributions, its solution can be found analytically.

R(αi|x) dx < constant


• Computer Exercises:

– Find the optimal decision for the following data:

= {1, 2},

p(x | 1) ~ N(20, 4),

p(x | 2) ~ N(15, 2),

P(1) = 2/3, and P(2) = 1/3,

– With a loss function:

– Then classify the samples: x = 12, 17, 18, and 20.

1 2

.51 1

Assignment 2.2


Data & Analytics

CSC446: Pattern Recognition (LN4)