Upload
albert-george
View
234
Download
0
Embed Size (px)
Citation preview
2
Outline
What is pattern recognition? What is classification?
Need for Probabilistic Reasoning Probabilistic Classification Theory
What is Bayesian Decision Theory? HISTORY PRIOR PROBABILITIES CLASS-CONDITIONAL PROBABILITIES BAYES FORMULA A Casual Formulation Decision
4
Outline
What is classification? Classification by Bayesian
Classification Basic Concepts Bayes Rule More General Forms of Bayes Rule Discriminated Functions Bayesian Belief Networks
5
TYPICAL APPLICATIONS OF PRIMAGE PROCESSING EXAMPLE
• Sorting Fish: incoming fish are sorted according to species using optical sensing (sea bass or salmon?)
Feature Extraction
Segmentation
Sensing
• Problem Analysis:
set up a camera and take some sample images to extract features
Consider features such as length, lightness, width, number and shape of fins, position of mouth, etc.
What is pattern recognition?
6
Preprocessing Segment (isolate) fishes from one another
and from the background Feature Extraction
Reduce the data by measuring certain features
Classification Divide the feature space into decision regions
Pattern Classification System
11
TYPICAL APPLICATIONSADD ANOTHER FEATURE
• Lightness is a better feature than length because it reduces the misclassification error.
• Can we combine features in such a way that we improve performance? (Hint: correlation)
12
Move decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!)
Task of decision theory
Threshold decision boundary and cost relationship
13
Adopt the lightness and add the width of the fish to the feature vector
Fish xT = [x1, x2]
Lightness Width
Feature Vector
14
TYPICAL APPLICATIONSWIDTH AND LIGHTNESS
• Treat features as a N-tuple (two-dimensional vector)
• Create a scatter plot
• Draw a line (regression) separating the two classes
Straight line decision boundary
15
We might add other features that are not highly correlated with the ones we already have. Be sure not to reduce the performance by adding “noisy features”
Ideally, you might think the best decision boundary is the one that provides optimal performance on the training data (see the following figure)
Features
16
TYPICAL APPLICATIONSDECISION THEORY
• Can we do better than a linear classifier?
• What is wrong with this decision surface? (hint: generalization)
Is this a good decision boundary?
17
Our satisfaction is premature because the central aim of designing a classifier is to correctly classify new (test) input
Issue of generalization!
Decision Boundary Choice
18
TYPICAL APPLICATIONSGENERALIZATION AND RISK
• Why might a smoother decision surface be a better choice? (hint: Occam’s Razor).
• PR investigates how to find such “optimal” decision surfaces and how to provide system designers with the tools to make intelligent trade-offs.
Better decision boundary
19
Need for Probabilistic Reasoning
Most everyday reasoning is based on uncertain evidence and inferences.
Classical logic, which only allows conclusions to be strictly true or strictly false, does not account for this uncertainty or the need to weigh and combine conflicting evidence.
Todays expert systems employed fairly ad hoc methods for reasoning under uncertainty and for combining evidence.
20
Probabilistic Classification Theory
درclassification ،کالسها با هم همپوشانی دارند بنابراین هر الگو با یک احتمالی به یک کالس متعلق
است. In practice, some overlap between classes and
random variation within classes occur, hence perfect separation between classes can not be achieved: Misclassification may occur.
تئوریBayesian decision اجازه می دهد تا تابع هزینه ای ارائه شود برای موقعی که ممکن است که یک
است اشتباها به Aبردار ورودی که عضوی از کالس (misclassify) نسبت داده شود. Bکالس
21
HISTORY
Bayesian Probability was named after Reverend Thomas Bayes (1702-1761).
He proved a special case of what is currently known as the Bayes Theorem.
The term “Bayesian” came into use around the 1950’s.
http://en.wikipedia.org/wiki/Bayesian_probabilityhttp://en.wikipedia.org/wiki/Bayesian_probability
What is Bayesian Decision Theory?
22
HISTORY (Cont.) Pierre-Simon, Marquis de Laplace (1749-
1827) independently proved a generalized version of Bayes Theorem.
1970 Bayesian Belief Network at Stanford University (Judea Pearl 1988)
The idea’s proposed above was not fully developed until later. BBN became popular in the 1990s.
23
HISTORY (Cont.)
Current uses of Bayesian Networks: Microsoft’s printer troubleshooter. Diagnose diseases (Mycin). Used to predict oil and stock prices Control the space shuttle Risk Analysis – Schedule and Cost Overruns.
24
Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification.
Using probabilistic approach to help making decision (e.g., classification) so as to minimize the risk (cost).
Assume all relevant probability distributions are known (later we will learn how to estimate these from data).
BAYESIAN DECISION THEORYPROBABILISTIC DECISION THEORY
25
State of nature is prior information
denote the state of nature
Model as a random variable, :
= 1: the event that the next fish is a sea bass
category 1: sea bass; category 2: salmon• A priori probabilities:
P(1) = probability of category 1
P(2) = probability of category 2
P(1) + P( 2) = 1 (either 1 or 2 must occur)
• Decision rule Decide 1 if P(1) > P(2); otherwise, decide 2
BAYESIAN DECISION THEORYPRIOR PROBABILITIES
But we know there will be many mistakes.…
http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm
26
BAYESIAN DECISION THEORYCLASS-CONDITIONAL PROBABILITIES
A decision rule with only prior information always produces the same result and ignores measurements.
If P(1) >> P( 2), we will be correct most of the time.
• Given a feature, x (lightness), which is a continuous random variable, p(x|2) is the class-conditional probability density function:
• p(x|1) and p(x|2) describe the difference in lightness between populations of sea and salmon.
27
p(lightness | salmon) ?
P(lightness | sea bass) ?
Let x be a continuous random variable .p(x|w) is the probability density for x given the state of nature w.
28
Suppose we know both P(j) and p(x|j), and we can
measure x. How does this influence our decision?
The joint probability that of finding a pattern that is in
category j and that this pattern has a feature value of x is:
BAYESIAN DECISION THEORYBAYES FORMULA
jjjj PxpxpxP)x,(p
• Rearranging terms, we arrive at Bayes formula.
How do we combine a priori and class-conditional probabilities to know the probability of a state of nature?
29
Bayes formula:
can be expressed in words as:
By measuring x, we can convert the prior probability,
P(j), into a posterior probability, P(j|x).
Evidence can be viewed as a scale factor and is often
ignored in optimization applications (e.g., speech
recognition).
BAYESIAN DECISION THEORYPOSTERIOR PROBABILITIES
xp
PxpxP
jjj
evidencepriorlikelihood
posterior
Bayes Decision:Choose w1 if P(w1|x) > P(w2|x); otherwise choose w2.
For two categories:
A Casual Formulation•The prior probability reflects knowledge of the relative frequency of instances of a class•The likelihood is a measure of the probability that a measurement value occurs in a class•The evidence is a scaling term
30
BAYESIAN THEOREM
A special case of Bayesian Theorem:
P(A∩B) = P(B) x P(A|B)
P(B∩A) = P(A) x P(B|A)
Since P(A∩B) = P(B∩A),
P(B) x P(A|B) = P(A) x P(B|A)
=> P(A|B) = [P(A) x P(B|A)] / P(B)
A B
ABPAPABPAP
ABPAP
BP
ABPAPBAP
||
)|()(
)(
)|()()|(
31
Preliminaries and Notations
:},,,{ 21 ci a state of nature
:)( iP prior probability
:x feature vector
:)|( ip x class-conditionaldensity
:)|( xiP posterior probability
32
Decision )(
)()|()|(
x
xx
p
PpP ii
i
( ) arg max ( | )i
iP
x xD( ) arg max ( | )i
iP
x xD
Decide i if P(i|x) > P(j|x) j i
Decide i if p(x|i)P(i) > p(x|j)P(j) j i
Special cases:1. P(1)=P(2)= =P(c)2. p(x|1)=p(x|2) = = p(x|c)
x gives us no useful information
decision is based entirely on the likelihood, p(x|j).
The evidence, p(x), is a scale factor that assures conditional probabilities sum to 1: P(1|x)+P(2|x)=1
We can eliminate the scale factor (which appears on both sides of the equation):
33
Two Categories
Decide i if P(i|x) > P(j|x) j i
Decide i if p(x|i)P(i) > p(x|j)P(j) j i
Decide 1 if P(1|x) > P(2|x); otherwise decide 2
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Special cases:1. P(1)=P(2)
Decide 1 if p(x|1) > p(x|2); otherwise decide 2
2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2
34
Example
R2
P(1)=P(2)
R1
Special cases:1. P(1)=P(2)
Decide 1 if p(x|> p(x|2); otherwise decide 1
2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2
Special cases:1. P(1)=P(2)
Decide 1 if p(x|> p(x|2); otherwise decide 1
2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2
35
Example
R1R1
R2
R2
P(1)=2/3P(2)=1/3
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Bayes Decision Rule
36
BAYESIAN DECISION THEORYPOSTERIOR PROBABILITIES
For every value of x, the posteriors sum to 1.0.
At x=14, the probability it is in category 2 is 0.08, and for
category 1 is 0.92.
Two-class fish sorting problem (P(1) = 2/3, P(2) = 1/3):
37
Classification Error Decision rule:
For an observation x, decide 1 if P(1|x) > P(2|x); otherwise, decide 2
Probability of error:
The average probability of error is given by:
If for every x we ensure that P(error|x) is as small as possible,
then the integral is as small as possible. Thus, Bayes decision
rule for minimizes P(error).
BAYESIAN DECISION THEORYBAYES DECISION RULE
21
12
x)x(P
x)x(Px|errorP
dx)x(p)x|error(Pdx)x,error(P)error(P
)]x(P),x(Pmin[)x|error(P 21 Consider two categories:
38
Generalization of the preceding ideas: Use of more than one feature
(e.g., length and lightness) Use more than two states of nature
(e.g., N-way classification) Allowing actions other than a decision to decide on the state
of nature (e.g., rejection: refusing to take an action when alternatives are close or confidence is low)
Introduce a loss of function which is more general than the probability of error(e.g., errors are not equally costly)
Let us replace the scalar x by the vector x in a d-dimensional Euclidean space, Rd, calledthe feature space.
CONTINUOUS FEATURES
GENERALIZATION OF TWO-CLASS PROBLEM
39
The Generation
:},,,{ 21 c a set of c states of nature or c categories
:},,,{ 21 a a set of a possible actions
:)|( jiij The loss incurred for taking action i when the true state of nature is j.
We want to minimize the expected loss in making decision.
Risk
can be zero.
LOSS FUNCTION
40
Examples
Ex 1: Fish classification X= is the image of fish x =(brightness, length, fin
#, etc.) is our belief what the fish
type is = {“sea bass”, “salmon”,
“trout”, etc} is a decision for the
fish type, in this case = {“sea bass”, “salmon”,
“trout”, “manual expection needed”, etc}
Ex 2: Medical diagnosis X= all the available medical
tests, imaging scans that a doctor can order for a patient
x =(blood pressure, glucose level, cough, x-ray, etc.)
is an illness type ={“Flu”, “cold”, “TB”,
“pneumonia”, “lung cancer”, etc}
is a decision for treatment, = {“Tylenol”, “Hospitalize”,
“more tests needed”, etc}
C
C
41
Conditional Risk
c
jjjii PR
1
)|()|()|( xx
c
jjij P
1
)|( x
Given x, the expected loss (risk)
associated with taking action i.
Given x, the expected loss (risk)
associated with taking action i.
)(
)()|()|(
x
xx
p
PpP jj
j
)()|()(1
j
c
jj Ppp
xx
42
0/1 Loss Function
otherwise1
with assiciateddecision correct a is 0)|( ji
ji
c
jjjii PR
1
)|()|()|( xx
c
jjij P
1
)|( x
( | ) ( | )iR P error x x
21
12
x)x(P
x)x(Px|errorP
43
Decision
c
jjjii PR
1
)|()|()|( xx
c
jjij P
1
)|( x
)|(minarg)( xx iRi
)|(minarg)( xx iRi
Bayesian Decision Rule:
A general decision rule is a function (x) that tells us which action to take for every possible observation.
44
Overall Risk
xxxx dpRR )()|)((Decision function
Bayesian decision rule:
the optimal one to minimize the overall riskIts resulting overall risk is called the Bayesian risk
)|(minarg)( xx iRi
)|(minarg)( xx iRi
The overall risk is given by:
If we choose (x) so that R(i(x)) is as small as possible for every x, the overall risk will be minimized.
Compute the conditional risk for every and select the action that minimizes R(i|x). This is denoted R*, and is referred to as the Bayes risk
The Bayes risk is the best performance that can be achieved.
45
Two-Category Classification
},{ 21
},{ 21 A
ctio
n
State of Nature
1 2
1 11 12
2 21 22
Loss Function
)|()|()|( 2121111 xxx PPR
)|()|()|( 2221212 xxx PPR
Let 1 correspond to 1, 2 to 2, and ij = (i|j)
The conditional risk is given by:
46
Two-Category Classification
)|()|()|( 2121111 xxx PPR
)|()|()|( 2221212 xxx PPR
Perform 1 if R(2|x) > R(1|x); otherwise perform 2
)|()|()|()|( 212111222121 xxxx PPPP
)|()()|()( 2221211121 xx PP
Our decision rule is:choose 1 if: R(1|x) < R(2|x);otherwise decide 2
47
Two-Category Classification
Perform 1 if R(2|x) > R(1|x); otherwise perform 2
)|()|()|()|( 212111222121 xxxx PPPP
positive
)|()()|()( 2221211121 xx PP
positive
Posterior probabilities are scaled before comparison.
If the loss incurred for making an error is greater than that incurred for being correct, the factors (21- 11) and(12- 22) are positive, and the ratio of these factors simply scales the posteriors.
48
Two-Category Classification)(
)()|()|(
x
xx
p
PpP ii
i
irrelevan
t
irrelevant
Perform 1 if R(2|x) > R(1|x); otherwise perform 2
)|()|()|()|( 212111222121 xxxx PPPP
)|()()|()( 2221211121 xx PP
)()|()()()|()( 222212111121 PpPp xx
)(
)(
)(
)(
)|(
)|(
1
2
1121
2212
2
1
P
P
p
p
x
x
By employing Bayes formula, we can replace the posteriors by the prior probabilities and conditional densities:
49
Two-Category Classification
)(
)(
)(
)(
)|(
)|(
1
2
1121
2212
2
1
P
P
p
p
x
xPerform 1 if
LikelihoodRatio
Threshold
This slide will be recalled later.This slide will be recalled later.
Stated as:
Choose a1 if the likelihood ration exceeds a threshold value independent of the observation x.
51
Loss Factors
0 1
1 0
1 0
0 1
1 1
1 1
1 1
0 0
1 0
1 0
If the loss factors are identical, and the prior probabilities are equal, this reduces to a standard likelihood ratio:
12
11
)|(p)|(p
:ifchoosexx
52
Consider a symmetrical or zero-one loss function:
c,...,,j,iji
ji)( ji 21
1
0
MINIMUM ERROR RATE
The conditional risk is:
x(
x(
x(x
j
jj
i
c
ij
c
jii
(P
(P
(P)(R)(R
1
1
The conditional risk is the average probability of error.
To minimize error, maximize P(i|x) — also known as maximum a posteriori decoding (MAP).
Error rate (the probability of error) is to be minimizedError rate (the probability of error) is to be minimized
53
Minimum Error RateLIKELIHOOD RATIO
Minimum error rate classification:
choose i if: P(i| x) > P(j| x) for all ji
54
Example
For sea bass population, the lightness x is a normal random variable distributes according to N(4,1);
for salmon population x is distributed according to N(10,1);
Select the optimal decision where:
a. The two fish are equiprobable
b. P(sea bass) = 2X P(salmon)
c. The cost of classifying a fish as a salmon when it truly is seabass is 2$, and t The cost of classifying a fish as a seabass when it is truly a salmon is 1$.
2
59
The Multicategory Classification
g1(x)g1(x)
g2(x)g2(x)
gc(x)gc(x)
x Action(e.g., classification)
(x)
The classifier Assign x to i ifgi(x) > gj(x) for all j i.
gi(x)’s are called the discriminant functions.
How to define discriminant functions?
How do we represent pattern classifiers?
The most common way is through discriminant functions.Remember we use {w1,w2, …, wc} to be the possible states of nature.
For each class we create a discriminant function gi(x).
Our classifier is a network or machine that computes c discriminant functions.
60
Simple Discriminant Functions
)|()( xx ii Rg
)|()( xx ii Pg
Minimum Risk case:
Minimum Error-Rate case:
)()|()( iii Ppg xx
)(ln)|(ln)( iii Ppg xx
If f( . ) is a monotonically increasing function, than f(gi( . ) )’s are also be discriminant functions.
Notice the decision is the same if we change every gi(x) for f(gi(x)) Assuming f(.) is a monotonically increasing function.
62
Decision Regions
} )()(|{ ijgg jii xxxR
Two-category example
Decision regions are separated by decision boundaries.
The net effect is to divide the feature space into c regions (one for each class). We then have c decision regions separated by decision boundaries.
65
Basics of Probability
Discrete random variable (X) - Assume integer
Continuous random variable (X)
Probability mass function (pmf): )()( xXPxp
Cumulative distribution function (cdf):
x
t
tpxXPxF )()()(
Probability density function (pdf): )(or )( xfxp
Cumulative distribution function (cdf):
xdttpxXPxF )()()(
not a probability
66
Probability mass function
•The graph of a probability mass function. •All the values of this function must be non-negative and sum up to 1.
67
Probability density function The pdf can be calculated by taking the integral of
the function f(x) by the integration interval of the input variable x.
For example: the probability of the variable X being within the interval [4.3,7.8] would be
68
Expectations
continuous is )()(
discrete is )()()]([
Xdxxpxg
XxpxgXgE x
continuous is )()(
discrete is )()()]([
Xdxxpxg
XxpxgXgE x
Let g be a function of random variable X.
The kth moment ][ kXE
The kth central moments ])[( kXXE
The 1st moment ][XEX
69
Important Expectations
Mean
continuous is )(
discrete is )(][
Xdxxxp
XxxpXE xX
Variance
continuous is )()(
discrete is )()(])[(][
2
2
22
Xdxxpx
XxpxXEXVar
X
xX
XX
Fact: 22 ])[(][][ XEXEXVar 22 ])[(][][ XEXEXVar
70
Entropy
continuous is )(ln)(
discrete is )(ln)(][
Xdxxpxp
XxpxpXH x
continuous is )(ln)(
discrete is )(ln)(][
Xdxxpxp
XxpxpXH x
The entropy measures the fundamental uncertainty in the value of points selected randomly from a distribution.
71
Univariate Gaussian Distribution
x
p(x)X~N(μ,σ2)
2
2
2
)(
2
1)(
x
exp2
2
2
)(
2
1)(
x
exp μ
σ σ
2σ 2σ
3σ 3σE[X] =μ
Var[X] =σ2
Properties:1. Maximize the entropy2. Central limit theorem
72
Illustration of the central limit theorem Let x1,x2,…,xn be a sequence of n independent and identically distributed random variables having each finite values of expectation µ and variance σ2>0 The central limit theorem states that as the sample size n increases, the distribution of the sample average of these random variables approaches the normal distribution with a mean µ and variance σ2 / n irrespective of the shape of the original distribution.
Original probability density function
A probability density function
Probability density function of the sum of two terms
Probability density function of the sum of three terms
Probability density function of the sum of four terms
73
Random Vectors
dR:XdR:XA d-dimensional
random vector
TdE ),,,(][ 21 XμVector Mean:
Covariance Matrix:
]))([( TE μXμXΣ
221
22221
11221
ddd
d
d
TdXXX ),,,( 21 X
74
Multivariate Gaussian Distribution
X~N(μ,Σ)
)()(
2
1exp
||)2(
1)( 1
2/12/μxΣμx
Σx T
dp
)()(
2
1exp
||)2(
1)( 1
2/12/μxΣμx
Σx T
dp
E[X] =μ
E[(X-μ) (X-μ)T] =Σ
2
2
2
)(
2
1)(
x
exp2
2
2
)(
2
1)(
x
exp
A d-dimensional random vector
75
Properties of N(μ,Σ)
X~N(μ,Σ) A d-dimensional random vector
Let Y=ATX, where A is a d × k matrix.
Y~N(ATμ, ATΣA)
76
Properties of N(μ,Σ)
X~N(μ,Σ) A d-dimensional random vector
Let Y=ATX, where A is a d × k matrix.
Y~N(ATμ, ATΣA)
77
On Parameters of N(μ,Σ)
X~N(μ,Σ) TdXXX ),,,( 21 X
TdE ),,,(][ 21 Xμ
ddijTE ][]))([( μXμXΣ
][ ii XE ][ ii XE
),()])([( jijjiiij XXCovXXE ),()])([( jijjiiij XXCovXXE
)(])[( 22iiiiii XVarXE )(])[( 22
iiiiii XVarXE
0 ijji XX
78
More On Covariance Matrix
ddijTE ][]))([( μXμXΣ
),()])([( jijjiiij XXCovXXE ),()])([( jijjiiij XXCovXXE
)(])[( 22iiiiii XVarXE )(])[( 22
iiiiii XVarXE
0 ijji XX
is symmetric and positive semidefinite.TΦΛΦΣ
: orthonormal matrix, whose columns are eigenvectors of . : diagonal matrix (eigenvalues).
TΦΛΦΛ 2/12/1
T))(( 2/12/1 ΦΛΦΛΣ T))(( 2/12/1 ΦΛΦΛΣ
79
Whitening Transform
X~N(μ,Σ)
Y=ATX Y~N(ATμ, ATΣA)
T))(( 2/12/1 ΦΛΦΛΣ T))(( 2/12/1 ΦΛΦΛΣ
Let 2/1ΦΛwA2/1ΦΛwA
),(~ wTw
Tww NX ΣAAμAA
IΦΛΦΛΦΛΦΛΣAA )())(()( 2/12/12/12/1 TTw
Tw
),(~ IμAA Tww NX ),(~ IμAA T
ww NX
80
Whitening Transform
X~N(μ,Σ)
Y=ATX Y~N(ATμ, ATΣA)
T))(( 2/12/1 ΦΛΦΛΣ T))(( 2/12/1 ΦΛΦΛΣ
Let 2/1ΦΛwA2/1ΦΛwA
),(~ wTw
Tww NX ΣAAμAA
IΦΛΦΛΦΛΦΛΣAA )())(()( 2/12/12/12/1 TTw
Tw
),(~ IμAA Tww NX ),(~ IμAA T
ww NX
Whitening
Projection
LinearTransform
81
Whitening Transform•The whitening transformation is a decorrelation method that converts the covariance matrix S of a set of samples into the identity matrix I. •This effectively creates new random variables that are uncorrelated and have the same variances as the original random variables. •The method is called the whitening transform because it transforms the input matrix closer towards white noise.
This can be expressed as 2/1ΦΛwA2/1ΦΛwA
where Φ is the matrix with the eigenvectors of "S" as its columns and Λ is the diagonal matrix of non-increasing eigenvalues.
82
White noise
•White noise is a random signal (or process) with a flat power spectral density. •In other words, the signal contains equal power within a fixed bandwidth at any center frequency.
Energy spectral density
83
Mahalanobis Distance
)()(
2
1exp
||)2(
1)( 1
2/12/μxΣμx
Σx T
dp
)()(
2
1exp
||)2(
1)( 1
2/12/μxΣμx
Σx T
dp
constant
)()( 12 μxΣμx Tr )()( 12 μxΣμx Tr
r2depends on the value of r2
X~N(μ,Σ)
84
Mahalanobis distance In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936.It is based on correlations between variables by which different patternscan be identified and analyzed.It is a useful way of determining similarity of an unknown sample set to aknown one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant, i.e. not dependent on the scale of measurements .
85
Mahalanobis distance Formally, the Mahalanobis distance from a group of values
with above mean and covariance matrix Σ for a multivariate vector is defined as:
Mahalanobis distance can also be defined as dissimilarity measure between two random vectors of the same distribution with the covariance matrix Σ :
If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. If the covariance matrix is diagonal, then the resulting distance measure is called the normalized Euclidean distance:
where σi is the standard deviation of the xi over the sample set.
86
Mahalanobis Distance
)()(
2
1exp
||)2(
1)( 1
2/12/μxΣμx
Σx T
dp
)()(
2
1exp
||)2(
1)( 1
2/12/μxΣμx
Σx T
dp
constant
)()( 12 μxΣμx Tr )()( 12 μxΣμx Tr
r2depends on the value of r2
X~N(μ,Σ)
88
Normal Density
If features are statistically independent and the variance is the samefor all features, the discriminant function is simple and is linear in nature.
A classifier that uses linear discriminant functions is called a linear machine.
The decision surface are pieces of hyperplanes defined by linear equations.
89
Minimum-Error-Rate Classification
)|()( xx ii Pg )()|()( iii Ppg xx
)(ln)|(ln)( iii Ppg xx
Xi~N(μi,Σi)
)()(
2
1exp
||)2(
1)|( 1
2/12/ iiT
ii
dip μxΣμxΣ
x
)()(
2
1exp
||)2(
1)|( 1
2/12/ iiT
ii
dip μxΣμxΣ
x
)(ln||ln2
12ln
2)()(
2
1)( 1
iiiiT
ii Pd
g ΣμxΣμxx )(ln||ln2
12ln
2)()(
2
1)( 1
iiiiT
ii Pd
g ΣμxΣμxx
Assuming the measurements are normally distributed, we have
90
Some Algebra to Simplify the Discriminants
Since
We take the natural logarithm to re-write the first term
)(ln)|(ln)( iii Pxpxg
)()(2
1
2/12/
)()(2
1
2/12/
1
1
ln||2
1ln
||2
1ln|ln
iit
i
iit
i
xx
id
xx
idi
e
e)p(x
91
Some Algebra to Simplify the Discriminants (continued)
||ln2
12ln
2)()(
2
1
||ln2ln)()(2
1
||2ln)()(2
1
ln||2
1ln
1
2/12/1
2/12/1
)()(2
1
2/12/
1
iiit
i
id
iit
i
id
iit
i
xx
id
dxx
xx
xx
eii
ti
92
Minimum-Error-Rate Classification
)(ln||ln2
12ln
2)()(
2
1)( 1
iiiiT
ii Pd
g ΣμxΣμxx )(ln||ln2
12ln
2)()(
2
1)( 1
iiiiT
ii Pd
g ΣμxΣμxx
Three Cases:Case 1: IΣ 2i
Case 2: ΣΣ i
Case 3: ji ΣΣ
Classes are centered at different mean, and their feature components are pairwisely independent have the same variance.
Classes are centered at different mean, but have the same variation.
Arbitrary.
93
Case 1. i = 2I
di
2
2
2
2
...00
.........0
0...0
000
i oft independen is2Ii
omitted bemay and i oft independen is 2ln2
d
)(ln||ln2
12ln
2)()(
2
1)( 1
iiiiT
ii Pd
g ΣμxΣμxx )(ln||ln2
12ln
2)()(
2
1)( 1
iiiiT
ii Pd
g ΣμxΣμxx
irrelevant
)(ln||||2
1)( 2
2 iii Pg
μxx
IΣ2
1 1
iIΣ
21 1
i
)(ln)2(2
12 ii
Ti
Ti
T P
μμxμxx
irrelevant
)(ln
2
11)(
22 iiTi
Tii Pg
μμxμx
toreduced becan )( constant), (a || Since 2 xgid
i
so omitted bemay and i oft independen is but xxt
94
Case 1. i = 2Iii μw 2
1
ii μw 21
)(ln
2
11)(
22 iiTi
Tii Pg
μμxμx
)(ln221
0 iiTii Pw
μμ )(ln22
10 ii
Tii Pw
μμ
0)( iTii wg xwx in x!linear iswhich
95
Case 1. i = 2Iii μw 2
1
ii μw 21
)(ln221
0 iiTii Pw
μμ )(ln22
10 ii
Tii Pw
μμ
0)( iTii wg xwx
i j
Boundary btw. i and j
)()( xx ji gg 00 j
Tji
Ti ww xwxw
00)( ijTj
Ti ww xww
)(
)(ln)()( 2
21
j
ij
Tji
Ti
Tj
Ti P
P
μμμμxμμ
)(
)(ln
||||
))(())(()(
22
21
j
i
ji
jiTj
Ti
jiTj
Ti
Tj
Ti P
P
μμ
μμμμμμμμxμμ
96
Case 1. i = 2I
i j
Boundary btw. i and j
)()( xx ji gg
)(
)(ln
||||
))(())(()(
22
21
j
i
ji
jiTj
Ti
jiTj
Ti
Tj
Ti P
P
μμ
μμμμμμμμxμμ
wT
w0)( 0 xxwT
)()(
)(ln
||||)(
2
2
21
0 jij
i
jiji P
Pμμ
μμμμx
ji μμw x0
x
xx0
The decision boundary will be a hyperplane perpendicular to the line btw. the means at somewhere.
0 if P( i)=P( j)midpoint
97
Case 1. i = 2I)(
)(
)(ln
||||)( 21
2
12
21
2
2121
0 μμμμ
μμx
P
P
)()( 21 PP
Minimum distance classifier (template matching)
The decision region when the priors are equal and the support regions are spherical is simply halfway between the means (Euclidean distance).
99
Case 1. i = 2I
)()( 21 PP
)()(
)(ln
||||)( 21
2
12
21
2
2121
0 μμμμ
μμx
P
P
Note how priors shift the boundary away from the more likely mean !!!
102
Case 2. i =
)(ln||ln2
12ln
2)()(
2
1)( 1
iiiiT
ii Pd
g ΣμxΣμxx )(ln||ln2
12ln
2)()(
2
1)( 1
iiiiT
ii Pd
g ΣμxΣμxx
Irrelevant ifP( i)= P( j) i, j
)(ln)()(2
1)( 1
iiT
ii Pg μxΣμxx
MahalanobisDistance
irrelevant
•Covariance matrices are arbitrary, but equal to each other for all classes.• Features then form hyper-ellipsoidal clusters of equal size and shape.
103
Case 2. i =
)(ln)()(2
1)( 1
iiT
ii Pg μxΣμxx
)(ln)2(2
1 111ii
Ti
Ti
T P μΣμxΣμxΣx
Irrelevant
0)( iTii wg xwx
ii μΣw 1 ii μΣw 1
)(ln121
0 iiTii Pw μΣμ )(ln1
21
0 iiTii Pw μΣμ
104
Case 2. i =
0)( iTii wg xwx
ii μΣw 1 ii μΣw 1
)(ln121
0 iiTii Pw μΣμ )(ln1
21
0 iiTii Pw μΣμ
i j
)()( xx ji gg 0)( 0 xxwT
)(1ji μμΣw
)()()(
)](/)(ln[)(
121
0 jiji
Tji
jiji
PPμμ
μμΣμμμμx
w
x0
xThe discriminant hyperplanes are often not
orthogonal to the segments joining the class means
107
Case 3. i j
)(ln||ln2
12ln
2)()(
2
1)( 1
iiiiT
ii Pd
g ΣμxΣμxx )(ln||ln2
12ln
2)()(
2
1)( 1
iiiiT
ii Pd
g ΣμxΣμxx
)(ln||ln2
1)()(
2
1)( 1
iiiiT
ii Pg ΣμxΣμxx
irrelevant
0)( iTii
Ti wg xwxWxx
1
2
1 ii ΣW1
2
1 ii ΣW iii μΣw 1 iii μΣw 1 )(ln||ln 1211
21
0 iiiiTii Pw ΣμΣμ )(ln||ln 1
211
21
0 iiiiTii Pw ΣμΣμ
Without this termIn Case 1 and 2
Decision surfaces are hyperquadrics, e.g.,• hyperplanes• hyperspheres• hyperellipsoids• hyperhyperboloids
The covariance matrices are different for each categoryIn two class case, the decision boundaries form hyperquadratics.
108
Case 3. i j
Non-simply connected decision regions can arise in one dimensions for Gaussians having unequal variance.
113
Example : A Problem
Exemplars (transposed) For 1 = {(2, 6), (3,
4), (3, 8), (4, 6)} For 2 = {(1, -2), (3,
0), (3, -4), (5, -2)} Calculated means
(transposed) m1 = (3, 6) m2 = (3, -2)
-6
-4
-2
0
2
4
6
8
10
0 2 4 6
x1
x2
Class 1
Class 2
m1
m2
114
Example : Covariance Matrices
00
01
0
1
6
3
6
4
40
00
2
0
6
3
8
3
40
00
2
0
6
3
4
3
00
01
0
1
6
3
6
2
20
02/1
80
02
4
1
4
1
141414
131313
121212
111111
1
4
111
t
t
t
t
ti
ii
aaa
aaa
aaa
aaa
aa
115
Example : Covariance Matrices
00
04
0
2
2
3
2
5
40
00
2
0
2
3
4
3
40
00
2
0
2
3
0
3
00
04
0
2
2
3
2
1
20
02
80
08
4
1
4
1
242424
232323
222222
212121
2
4
122
t
t
t
t
ti
ii
bbb
bbb
bbb
bbb
bb
116
Example: Inverse and Determinant for Each of the Covariance Matrices
42/10
02/1
20
02
12/10
02
20
02/1
21
22
11
11
and
and
117
Example : A Discriminant Function for Class 1
)(ln632
3622
1
)(ln6
33
262
2
1
)(ln1ln2
1
6
3
2/10
0263
2
1)(
)(ln||ln2
1)()(
2
1)(
Since
122
11
12
121
12
1211
1
Pxx
xx
Px
xxx
Px
xxxxg
Pxxxg iiiit
ii
118
Example
)(ln1834
6
)(ln3662
1222
1
)(ln632
3622
1)(
12
22
121
12
22
121
122
111
Pxx
xx
Pxx
xx
Pxx
xxxg
119
Example : A Discriminant Function for Class 2
)(ln2ln234
1
)(ln2ln2
3)2()3(
4
1
)(ln4ln2
1
2
3
2/10
02/123
2
1)(
)(ln||ln2
1)()(
2
1)(
Also,
22
22
1
22
121
22
1212
1
Pxx
Px
xxx
Px
xxxxg
Pxxxg iiiit
ii
120
Example
)(ln2ln4
13
42
3
4
)(ln2ln13464
1
)(ln2ln234
1)(
Hence,
22
22
1
21
22221
21
22
22
12
Pxx
xx
Pxxxx
Pxxxg
121
Example : The Class Boundary
2ln184
13
2
9
4
34
2ln4
13
42
3
4183
46
)(ln2ln4
13
42
3
4)(
)(ln1834
6)(
, setting and Assuming
1
21
2
2
22
1
21
2
22
121
22
22
1
21
2
12
22
1211
2121
xx
x
xx
xx
xx
xx
Pxx
xx
xg
Pxx
xxxg
(x)g(x)g)P(ω)P(ω
122
Example : A Quadratic Separator
)quadratic! (a ,514.3125.11875.0
4
2ln18
16
13
8
9
16
3
2ln184
13
2
9
4
34
1212
1
21
2
1
21
2
xxx
xx
x
xx
x
123
Example : Plot of the Discriminant
-6
-4
-2
0
2
4
6
8
10
0 2 4 6 8
x1
x2
Class 1
Class 2
m1
m2
Discriminant
124
Summary Steps for Building a Bayesian Classifier
Collect class exemplars Estimate class a priori probabilities Estimate class means Form covariance matrices, find the
inverse and determinant for each Form the discriminant function for each
class
125
Using the Classifier
Obtain a measurement vector x Evaluate the discriminant function gi(x)
for each class i = 1,…,c Decide x is in the class j if gj(x) > gi(x)
for all i j
128
Minimum-Error-Rate Classification If action is taken and the true state is ,
then the decision is correct if and in error if
Error rate (the probability of error) is to be minimized
Symmetrical or zero-one loss function
Conditional risk
cjiji
jiji ,,1,,
,1
,0)|(
)|(1)|()|()|(1
xxx ijj
c
jii PPR
ij
ji ji
131
Bayesian Decision Rule:Two-Category Classification
)(
)(
)(
)(
)|(
)|(
1
2
1121
2212
2
1
P
P
p
p
x
xDecide 1 if
LikelihoodRatio
Threshold
Minimax criterion deals with the case thatthe prior probabilities are unknown.
132
Basic Concept on Minimax
To choose the worst-case prior probabilities (the maximum loss) and, then, pick the decision rule that will minimize the overall risk.
Minimize the maximum possible overall risk.So that the worst risk for any value of the priors is as small as possibleSo that the worst risk for any value of the priors is as small as possible
133
Overall Risk
xxxx dpRR )()|)((
21
)()|()()|( 21 RRxxxxxx dpRdpR
)|()|()|( 2121111 xxx PPR )|()|()|( 2221212 xxx PPR
2
1
)()]|()|([
)()]|()|([
222121
212111
R
R
xxxx
xxxx
dpPP
dpPPR
134
Overall Risk
2
1
)()]|()|([
)()]|()|([
222121
212111
R
R
xxxx
xxxx
dpPP
dpPPR
)(
)()|()|(
x
xx
p
PpP ii
i
)(
)()|()|(
x
xx
p
PpP ii
i
2
1
)]|()()|()([
)]|()()|()([
22221121
22121111
R
R
xxx
xxx
dpPpP
dpPpPR
135
Overall Risk )(1)( 12 PP )(1)( 12 PP
2
1
)]|()()|()([
)]|()()|()([
22221121
22121111
R
R
xxx
xxx
dpPpP
dpPpPR
2
1
)}|()](1[)|()({
)}|()](1[)|()({
21221121
21121111
R
R
xxx
xxx
dpPpP
dpPpPR
1 2
1 1
2 2
12 2 22 2
11 1 1 12 1 2
21 1 1 22 1 2
( | ) ( | )
( ) ( | ) ( ) ( | )
( ) ( | ) ( ) ( | )
R p d p d
P p d P p d
P p d P p d
x x x x
x x x x
x x x x
R R
R R
R R
136
Overall Risk
1 2
1 1
2 2
12 2 22 2
11 1 1 12 1 2
21 1 1 22 1 2
( | ) ( | )
( ) ( | ) ( ) ( | )
( ) ( | ) ( ) ( | )
R p d p d
P p d P p d
P p d P p d
x x x x
x x x x
x x x x
R R
R R
R R
1)|()|(21
RRxxxx dpdp ii 1)|()|(
21
RRxxxx dpdp ii
12
1
)|()()|()()()(
)|()()]([
222121112122111
22212221
RR
R
xxxx
xx
dpdpP
dpPR
137
Overall Risk
12
1
)|()()|()()()(
)|()()]([
222121112122111
22212221
RR
R
xxxx
xx
dpdpP
dpPR
The overall risk for a particular P(1).
The value depends onthe setting of decision boundary
The value depends onthe setting of decision boundary
R(x) = ax + bR(x) = ax + b
138
Overall Risk
12
1
)|()()|()()()(
)|()()]([
222121112122111
22212221
RR
R
xxxx
xx
dpdpP
dpPR
= 0 for minimax solution
= R mm, minimax risk
R(x) = ax + bR(x) = ax + b
Independent on the value of P(i).
139
Minimax Risk
12
1
)|()()|()()()(
)|()()]([
222121112122111
22212221
RR
R
xxxx
xx
dpdpP
dpPR
1
)|()( 2221222 Rxx dpRmm
2
)|()( 1112111 Rxx dp
140
Error Probability
12
1
)|()()|()()()(
)|()()]([
222121112122111
22212221
RR
R
xxxx
xx
dpdpP
dpPR
Use 0/1 loss function
12
1
)|()|()(
)|()]([
211
21
RR
R
xxxx
xx
dpdpP
dpPPerror
141
Minimax Error-Probability
1
)|()( 2Rxx dperrorPmm
2
)|( 1Rxx dp
Use 0/1 loss function
P( 1| 2) P( 2| 1)
12
1
)|()|()(
)|()]([
211
21
RR
R
xxxx
xx
dpdpP
dpPPerror
142
Minimax Error-Probability
R1 R2
1 2
1
)|()( 2Rxx dperrorPmm
2
)|( 1Rxx dp
P( 1| 2) P( 2| 1)
12
1
)|()|()(
)|()]([
211
21
RR
R
xxxx
xx
dpdpP
dpPPerror
12
1
)|()|()(
)|()]([
211
21
RR
R
xxxx
xx
dpdpP
dpPPerror
143
Minimax Error-Probability
12
1
)|()|()(
)|()]([
211
21
RR
R
xxxx
xx
dpdpP
dpPPerror
12
1
)|()|()(
)|()]([
211
21
RR
R
xxxx
xx
dpdpP
dpPPerror
145
Bayesian Decision Rule:Two-Category Classification
)(
)(
)(
)(
)|(
)|(
1
2
1121
2212
2
1
P
P
p
p
x
xDecide 1 if
LikelihoodRatio
Threshold
Neyman-Pearson Criterion deals with the case that both loss functions and the prior probabilities are unknown.
146
Signal Detection Theory The theory of signal detection theory evolved
from the development of communications and radar equipment the first half of the last century.
It migrated to psychology, initially as part of sensation and perception, in the 50's and 60's as an attempt to understand some of the features of human behavior when detecting very faint stimuli that were not being explained by traditional theories of thresholds.
147
The situation of interest A person is faced with a stimulus (signal) that is
very faint or confusing.
The person must make a decision, is the signal there or not.
What makes this situation confusing and difficult is the presences of other mess that is similar to the signal. Let us call this mess noise.
148
Example
Noise is present both in the environment and in the sensory system of the observer.
The observer reacts to the momentary total activation of the sensory system, which fluctuates from moment to moment, as well as responding to environmental stimuli, which may include a signal.
149
Signal Detection Theory
Can we measure the discriminability of the problem?Can we do this independent of the threshold x*?
Suppose we want to detect a single pulse from a signal.
We assume the signal has some random noise.
When the signal is present we observe a normal distribution with mean u2.When the signal is not present we observe a normal distribution with mean u1.We assume same standard deviation.
Discriminability:
d’ = | u2 – u1 | / σ
150
Example A radiologist is examining a CT scan, looking for
evidence of a tumor. A Hard job, because there is always some
uncertainty.
There are four possible outcomes: hit (tumor present and doctor says "yes'') miss (tumor present and doctor says "no'') false alarm (tumor absent and doctor says "yes") correct rejection (tumor absent and doctor says "no").
Two types of Error
151
Correct RejectionCorrect Rejection
The Four Cases
P(1|1)
MissMiss
False AlarmsFalse Alarms HitHit
Signal (tumor)
Absent (1) Present (2)
Decision
No (1)
Yes (2)P(2|2)
P(1|2)
P(2|1)
Signal detection theory was developed to help us understand how a continuous and ambiguous signal can lead to a binary yes/no decision.
Signal detection theory was developed to help us understand how a continuous and ambiguous signal can lead to a binary yes/no decision.
152No (1) Yes (2)
Decision Making
d’Noise
1
Noise + Signal
2
Criterion
Hit
FalseAlarm
Discriminability
||' 12 d
||' 12 d
Based on expectancy)decision bias(
P(2|2)
P(2|1)
154
Signal Detection Theory
How do we find d’ if we do not know u1, u2, or x*?
From the data we can compute:
1. P( x > x* | w2) a hit.
2. P( x > x* | w1) a false alarm.
3. P( x < x* | w2) a miss.
4. P( x < x* | w1) a correct rejection.
If we plot a point in a space representing hit and false alarm rates,then we have a ROC (receiver operating characteristic) curve.
With it we can distinguish between discriminability and bias.