Upload
mostafa-g-m-mostafa
View
36
Download
4
Embed Size (px)
Citation preview
Bayesian Decision Theory
Prof. Dr. Mostafa Gadal-Haqq
Faculty of Computer & Information Sciences
Computer Science Department
AIN SHAMS UNIVERSITY
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1
CSC446 : Pattern Recognition
(Pattern Classifications, Ch2: Sec. 2.1 to Sec. 2.3)
2.1 Bayesian Decision Theory
• Bayesian Decision Theory is based on
quantifying the trade-offs between various
classification decisions using probabilities
and the costs that accompany such decisions.
• Assumes that: The decision problem is posed
in probabilistic terms and that all of the
relevant probability values are known.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 2
2.1 Bayesian Decision Theory
• Back to the Fish Sorting Machine:
– = a random variable (State of nature)={1 ,2}
• For example: 1 = Sea bass, and 2 = Salmon
• P(1 ) = the prior (a priori probability) that the
coming fish is sea bass.
• P(2 ) = the prior (a priori probability) that the
coming fish is salmon.
– The priors gives us the knowledge of how likely
we are to get salmon or Sea bass before the fish
actually appears.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 3
• Decision Rule Using Priors only:
– to make a decision about the fish that will
appear using only the priors, P(1) and P(2),
We use the following decision rule:
– which minimize the error.
2.1 Bayesian Decision Theory
Decide fish 1 if P(1) > P(2)
and fish 2 if P(1) < P(2)
Probability of error = min [ P(1) , P(2)]
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 4
• That is:
– If P(1) >> P(2) we will be right most of the
time when we decide that the fish belong to 1 .
– If P(1) = P(2) we have only fifty-fifty chance
of being right.
– Under these conditions, no other decision rules
can yield a larger probability of being right.
2.1 Bayesian Decision Theory
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 5
• Improving the decision using observation:
2.1 Bayesian Decision Theory
• If we know the class –
conditional probability,
P(x | j), of an
observation x, we could
improve our decision.
• for example: x describes
the observed lightness of
the sea bass or salmon
P(x|w2)
P(x|w1)
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 6
• We can improve our decision by using this
observed feature and the Bayes rule :
– Posterior = (Likelihood x Prior) / Evidence
– Where, for C categories :
Cj
j
jjPxPxP
1
)()|()(
2.1 Bayesian Decision Theory
)(
)()|()|(
xP
PxPxP
jj
j
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 7
• Bayesian decision is based on minimizing the
probability of error , i.e. for a given feature
value x :
• The probability of error for a particular x is :
2.1 Bayesian Decision Theory
Decide x 1 if P(1 | x) > P(2 | x)
and x 2 if P(1 | x) < P(2 | x)
P(error | x) = min [ P(1 | x), P(2 | x) ]
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 8
fish (x) 2
Suppose P(1)=2/3=0.67, and P(2)=1/3= 0.33 ,
2.1 Bayesian Decision Theory: Numerical Example
P(x|w2)
P(x|w1)
0.36
0.15
If x = 11.5, then P(x|1)= 0.15 , P(x|2)= 0.36
P(x) = 0.15*0.67 + 0.36*0.33 = 0.22
P(1|x)= 0.15*0.67/0.22
= 0.46
P(2|x)= 0.36*0.33/0.22
= 0.54
fish 1
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 9
2.1 Bayesian Decision Theory Computing
for all values
of x gives
decision
regions
(Rules) : R2 R2 R1 R1
• if x R1
decide 1
• if x R2
decide 2
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 10
• Draw Probability Densities and find the
decision regions for the following Classes:
= {1, 2},
P(x | 1) ~ N(20, 4),
P(x | 2) ~ N(15, 2),
P(1) = 1/3, and P(2) = 2/3,
– Then Classify a sample with feature value x= 17.
Assignment 2.1
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 11
2.2 General Bayesian Decision Theory
• Generalization of Bayesian decision theory is
done by allowing the following:
– Having more than one feature.
– Having more than two states of nature.
– Allowing actions and not only decide on the
state of nature.
– Introduce a loss of function which is more
general than the probability of error.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 12
• Allowing actions other than classification
primarily allows the possibility of rejection
• Rejection is refusing to make decision in close
or bad cases!
• The loss function states: how costly each
action taken is?
2.2 General Bayesian Decision Theory
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 13
• Suppose we have c states of nature (categories)
= { 1, 2,…, c } ,
• a feature vector:
x = { x1, x2,…, xd } ,
• the possible actions
= { 1, 2,…, a } ,
• and the loss, (i | j ), incurred for taking
action i when the state of nature is j .
2.2 General Bayesian Decision Theory
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 14
• The conditional risk, R(i | x), for select the action
i is given by:
cj
j
jjii xPxR1
)|()|()|(
• The Overall risk, R, is the Sum of all Conditional
risks R(i | x) for i = 1,…,a.
ai
i
i xRR1
)|(
2.2 General Bayesian Decision Theory
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 15
Take action i (i.e. decide i)
if R(i | x) < R(j | x) ; j and j i.
The Bayesian decision rule becomes: select
the action i for which the conditional risk,
R(i | x), is minimum. That is :
2.2 General Bayesian Decision Theory
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 16
• Minimizing R(i | x) for all actions, that is: for
all i ; i = 1,…, a, is minimizing R.
• The overall risk R is the “expected loss
associated with a given decision rule”.
• The overall risk R is called the Bayes risk,
which defines the best performance that can
be achieved!
2.2 General Bayesian Decision Theory
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 17
• Two-category classification Example:
Suppose we have two categories {1 ,2} and two
actions {1 ,2 }, where:
1 : deciding 1 , and 2 : deciding 2 ,
and for simplicity we write ij = (i | j )
The conditional risks for taking 1 and 2 are:
R(1 | x) = 11P(1 | x) + 12P(2 | x)
R(2 | x) = 21P(1 | x) + 22P(2 | x)
2.2 General Bayesian Decision Theory
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 18
decide 1 (i.e. 1) if R(1 | x) < R(2 | x)
and 2 (i.e. 2) if R(1 | x) > R(2 | x)
There are a variety of ways to express the
minimum-risk rule, each has its advantage:
1- The fundamental rule is:
2.2 General Bayesian Decision Theory
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 19
2- The rule in terms of the posteriors is:
3- The rule in terms of the priors and conditional
densities is:
decide 1 if (21- 11) P(1 | x ) > (12- 22) P(2 | x )
decide 2 otherwise
2.2 General Bayesian Decision Theory
decide 1 if
(21- 11) P(x | 1 ) P(1) > (12- 22) P(x | 2 ) P(2 )
decide 2 otherwise
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 20
4- The rule in terms of the likelihoods ratios:
That is, the Bayes (Optimal) decision can be
interpreted as:
2.2 General Bayesian Decision Theory
decide 1 if
decide 2 otherwise
)(
)(.
)|(
)|(
1
2
1121
2212
2
1
P
P
xp
xp
“One can take an optimal decision, if the
likelihood ratio exceeds a threshold value
that is independent of the observation x”
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 21
• Decision regions depends on the values of the loss
function:
• For different loss function we have:
)(
)(2 then
0 1
2 0 if
)(
)( then
0 1
1 0 if
1
2
1
2
P
P
P
P
b
a
)|(
)|( :if decide then
)(
)(. Let
2
11
1
2
1121
2212
xp
xp
P
P
2.2 General Bayesian Decision Theory
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 22
2.2 General Bayesian Decision Theory
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 23
2.3 Minimum-Error Rate Classification
• Consider the zero-one (or symmetrical) loss function:
• Therefore, the conditional risk is:
• In other words, for symmetric loss function, the
conditional risk is the probability of error.
cjiji
jiji
,...,1, 1
0),(
1jij
cj
1jjjii
)x|(P1)x|(P
)x|(P)|()x|(R
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 24
The Minmax Criterion
• Sometimes we need to design our classifier to
perform well over a range of prior probabilities, or
where we do not know the prior probabilities.
• A reasonable approach is to design our classifier so
that the worst overall risk for any value of the
priors is as small as possible
• Minimax Criterion:
“minimize the maximum possible overall
risk”
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 25
The Minmax Criterion
• It is found that the overall risk is linear in P(ωj).
Then, when the constant of proportionality (the
slope) is zero, the risk is independent of priors. This
condition gives the minmax risk Rmm as:
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 26
The Minmax Criterion
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 27
The Neyman-Pearson Criterion
• The Neynam-Pearson Criterion:
“minimize the overall risk subject to a constraint”
• Generally Neyman-Pearson criterion is satisfied by adjusting decision boundaries numerically. However, for Gaussian and some other distributions, its solution can be found analytically.
R(αi|x) dx < constant
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 28
• Computer Exercises:
– Find the optimal decision for the following data:
= {1, 2},
p(x | 1) ~ N(20, 4),
p(x | 2) ~ N(15, 2),
P(1) = 2/3, and P(2) = 1/3,
– With a loss function:
– Then classify the samples: x = 12, 17, 18, and 20.
1 2
.51 1
Assignment 2.2
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 29