Upload
others
View
33
Download
0
Embed Size (px)
Citation preview
3rd NOSE Short CourseAlpbach, 21st – 26th Mar 2004
Statistical classifiers: Bayesian decision theory and density estimation
Ricardo Gutierrez-Osuna
Department of Computer ScienceTexas A&M University
[email protected] http://research.cs.tamu.edu/prism
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 2 -
Outline
o Chapter 1: Review of pattern classificationo Chapter 2: Review of probability theoryo Chapter 3: Bayesian Decision Theoryo Chapter 4: Quadratic classifierso Chapter 5: Kernel density estimationo Chapter 6: Nearest neighborso Chapter 7: Perceptron and least-squares classifiers
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 3 -
CHAPTER 1: Review of pattern classification
o Features and patterns
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 4 -
Features and patterns (1)
o Feature• Feature is any distinctive aspect, quality
or characteristic• Features may be symbolic (i.e., color) or
numeric (i.e., height)
• Feature vector: The combination of d features is represented as a d-dimensional column
• Feature space: The d-dimensional space defined by the feature vector
• Scatter plot: Representation of an object collection in feature space
=
d
2
1
x
xx
x
x3
x1 x2
x
Feature 1
Class 1
Class 2
Class 3Feat
ure
2
Featurevector
Featurespace
Scatter plot
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 5 -
Features and patterns (2)
o Pattern• Pattern is a composite of traits or features characteristic of an
individual• In classification tasks, a pattern is a pair of variables {x,ω} where
• x is a collection of observations or features (feature vector)• ω is the concept behind the observation (label)
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 6 -
Features and patterns (3)
o What makes a “good” feature vector?• The quality of a feature vector is related to its ability to
discriminate examples from different classes• Examples from the same class should have similar feature values• Examples from different classes have different feature values
• More feature properties
“Good” features “Bad” features
Highly correlated featuresNon-linear separabilityLinear separability Multi-modal
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 7 -
Classifiers
o The task of a classifier is to partition feature space into class-labeled decision regions• Borders between decision regions are called
decision boundaries• The classification of feature vector x consists
of determining which decision region it belongs to, and assign x to this class
o In this lecture we will overview two methodologies for designing classifiers• Based on the underlying probability density
functions of the data• Based on geometric pattern-separability criteria
R1R2
R3
R1R2
R3
R1
R2
R3
R4
R1
R2
R3
R4
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 8 -
CHAPTER 2: Review of probability theory
o What is a probabilityo Probability density functionso Conditional probabilityo Bayes theoremo Probabilistic reasoning: a case example
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 9 -
Basic probability concepts
• Probabilities are numbers assigned to events that indicate “how likely” it is that the event will occur when a random experiment is performed
• A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment
• The sample space S of a random experiment is the set of all possible outcomes
A2
A1
A3
A4 event
prob
abili
ty
A1 A2 A3 A4
PROBABILITY LAW
S
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 10-
Conditional probability (1)
o If A and B are two events, the probability of event A when we already know that event B has occurred is defined by the relation
o This conditional probability P[A|B] is read:• the “conditional probability of A conditioned on B”, or simply • the “probability of A given B”
0P[B]forP[B]
B]P[AB]|P[A >=I
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 11-
Conditional probability (2)
o Interpretation• The new evidence “B has occurred” has the following effects
• The original sample space S (the whole square) becomes B (the rightmost circle)
• The event A becomes A∩B• P[B] simply re-normalizes the probability of events that occur
jointly with B
S S
“B has occurred”
A A∩B B A A∩B B
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 12-
Theorem of total probability
o Let B1, B2, …, BN be a partitionof S, a set of mutually exclusive events such that
• Any event A can then be represented as:
• Since B1, B2, …, BN are mutually exclusive then, by Axiom III:
• and, therefore
)B...(A)B(A)B(A)B...BB(ASAA N21N21 IUIUIUUUII ===
]BP[A...]BP[A]BP[AP[A] N21 III +++=
∑=
=+=N
1kkkNN11 ]]P[BB|P[A]]P[BB|...P[A]]P[BB|P[AP[A]
B1
B2
B3 BN-1
BN
A
B4N21 B...BBS UUU=
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 13-
Bayes theorem
o Given {B1, B2, …, BN}, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj?
• Using the definition of conditional probability and the theorem of total probability we obtain
• This is known as Bayes theorem or Bayes rule, and is (one of) the most useful relations in probability and statistics
∑=
⋅
⋅== N
1kkk
jjjj
]P[B]B|P[A
]P[B]B|P[AP[A]
]BP[AA]|P[B
I
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 14-
Applying Bayesian theorem (1)
o Consider a clinical problem where we need to decide if a patient has a particular medical condition on the basis of an imperfect test:
• Someone with the condition may go undetected (false-negative)• Someone free of the condition may yield a positive result (false-
positive)
o Nomenclature• SPECIFICITY: The true-negative rate P(NEG|¬COND) of a test • SENSITIVITY: The true-positive rate P(POS|COND) of a test
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 15-
Applying Bayesian theorem (2)
o PROBLEM• Assume a population of 10,000 where 1 out of every 100 people
has the medical condition• Assume that we design a test with 98% specificity
P(NEG|¬COND) and 90% sensitivity P(POS|COND)
o Assume you take the test, and it yields a POSITIVE resulto What is the probability that you have the medical
condition?
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 16-
Applying Bayesian theorem (3)
o SOLUTION A: Fill in the joint frequency table below• The answer is the ratio of individuals with the condition to total
individuals (considering only individuals that tested positive) or 90/288=0.3125
TEST IS POSITIVE TEST IS NEGATIVE ROW TOTAL
HAS CONDITION True-positive
P(POS|COND) 100×0.90=90
False-negative P(NEG|COND)
100×(1-0.90)=10
100
FREE OF CONDITION False-positive
P(POS|¬COND) 9,900×(1-0.98)=198
True-negative P(NEG|¬COND)
9,900×0.98=9,072
9,900 COLUMN TOTAL 288 9,712 10,000
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 17-
Applying Bayesian theorem (4)
o SOLUTION B: Apply Bayes theorem
0.3125
0.990.98)(10.010.900.010.90
COND]P[COND]|P[POSP[COND]COND]|P[POSP[COND]COND]|P[POS
P[POS]P[COND]COND]|P[POS
POS]|P[COND
=
=⋅−+⋅
⋅=
=¬⋅¬+⋅
⋅=
=⋅
=
=
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 18-
Bayes theorem and pattern classification
o For the purpose of pattern classification, Bayes theorem is normally expressed as
• where ωj is the ith class and x is the feature vector
o Bayes theorem is relevant because, as we will see in a minute, a sensible classification rule is to choose the class ωi with the highest P[ωi|x]
• This represents the intuitive rationale of choosing the class that is more “likely” given the observed feature vector x
P[x]]P[]|P[x
]P[]|P[x
]P[]|P[xx]|P[ jj
N
1kkk
jjj
ωω
ωω
ωωω
⋅=
⋅
⋅=
∑=
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 19-
Bayes theorem and pattern classification
o Each term in the Bayes theorem has a special name, which you should become familiarized with
• Prior probability (of class ωi)• Posterior Probability (of class ωi given the
observation x)• Likelihood (conditional probability of observation x
given class ωi)• A normalization constant (does not affect the decision)
]P[ jωx]|P[ jω
]|P[x jω
P[x]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 20-
CHAPTER 3: Bayesian Decision Theory
o The Likelihood Ratio Testo The Probability of Erroro The Bayes Risko Bayes, MAP and ML Criteriao Multi-class problemso Discriminant Functions
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 21-
The Likelihood Ratio Test (1)
o Assume we are to classify an object based on the evidence provided by a measurement (or feature vector) x
o Would you agree that a reasonable decision rule would be the following?
• "Choose the class that is most ‘probable’ given the observed feature vector x”
• More formally: Evaluate the posterior probability of each class P(ωi|x) and choose the class with largest P(ωi|x)
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 22-
The Likelihood Ratio Test (2)
o Let us examine this decision rule for a 2-class problem• In this case the decision rule becomes
• Or, in a more compact form
• Applying Bayes theorem
2
121
ωchooseelseωchoose)x|ω(P)x|ω(Pif >
)x|ω(P)x|ω(P 2
ω
ω1
1
2
<>
)x(P)ω(P)ω|x(P
)x(P)ω(P)ω|x(P 22ω
ω
111
2
<>
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 23-
The Likelihood Ratio Test (3)
• P(x) does not affect the decision rule so it can be eliminated*.Rearranging the previous expression
• The term Λ(x) is called the likelihood ratio, and the decision rule is known as the likelihood ratio test
)ω(P)ω(P
)ω|x(P)ω|x(P)x(Λ
1
2ω
ω2
11
2
<>=
*P(x) can be disregarded in the decision rule since it is constant regardless of class ωi. However, P(x) will be needed if we want to estimate the posterior P(ωi|x) which, unlike P(x|ωi)P(x), is a true probability value and, therefore, gives us an estimate of the “goodness” of our decision.
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 24-
Likelihood Ratio Test: an example (1)
o Given a classification problem with the following class conditional densities:
o Derive a classification rule based on the Likelihood Ratio Test (assume equal priors)
2
2
10)(x21
2
4)(x21
1
e2π1)ω|P(x
e2π1)ω|P(x
−−
−−
=
=
x
P(x|ω1) P(x|ω2)
4 10
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 25-
Likelihood Ratio Test: an example (2)
o Solution• Substituting the given likelihoods and priors
into the LRT expression:
• Simplifying, changing signs and taking logs:
• Which yields:
• This LRT result makes intuitive sense since the likelihoods are identical and differ only in their mean value
11
e
eΛ(x)
1
2
2
2 ω
ω10)(x
21
2π1
4)(x21
2π1
<>
=−−
−−
0)10x()4x(1
2
ω
ω
22
><
−−−
7x1
2
ω
ω><
R1: say ω1
x
R2: say ω2
P(x|ω1) P(x|ω2)
4 10
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 26-
The probability of error
o Prob. of error is “the probability of assigning x to the wrong class”
• For a two-class problem, P[error|x] is simply
• It makes sense that the classification rule be designed to minimize the average prob. of error P[error] across all possible values of x
• To minimize P(error) we minimize the integrand P(error|x) at each x: choose the class with maximum posterior P(ωi|x)
• This is called the MAXIMUM A POSTERIORI (MAP) RULE
=12
21
ωdecideweifx)|P(ωωdecideweifx)|P(ω
x)|P(error
∫∫+∞
∞−
+∞
∞−== x)P(x)dx|P(errorx)dxP(error,P(error)
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 27-
Minimizing probability of error
o We “prove” the optimality of the MAP rule graphically
• The right plot shows the posterior for each of the two classes
• The bottom plots shows the P(error) for the MAP rule and an alternative decision rule
• Which one has lower P(error) (color-filled area) ?
x
P(w
i|x)
ChooseRED
ChooseBLUE
ChooseRED
THE MAP RULEChoose
REDChooseBLUE
ChooseRED
THE “OTHER” RULE
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 28-
The Bayes Risk (1)
o So far we have assumed that the penalty of misclassifying a class ω1example as class ω2 is the same as the reciprocal
o In general, this is not the case:• For example, misclassifying a cancer sufferer as a healthy patient is a
much more serious problem than the other way around• Misclassifying salmon as sea bass has lower cost (unhappy customers)
than the opposite error
o This concept can be formalized in terms of a cost function Cij
• Cij represents the cost of choosing class ωi when class ωj is the true class
o We define the Bayes Risk as the expected value of the cost
∑∑∑∑= == =
⋅∈⋅=∈⋅==ℜ2
1i
2
1jjjiij
2
1i
2
1jjiij ]ω[P]ω|Rx[PC]ωxandωchoose[PC]C[E
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 29-
The Bayes Risk (2)
o What is the decision rule that minimizes the Bayes Risk?• It can be shown* that the minimum risk can be achieved by using
the following decision rule:
• *For an intuitive proof visit my lecture notes at TAMU
o Notice any similarities with the LRT?
]ω[P]ω[P
)CC()CC(
)ω|x(P)ω|x(P
1
2
1121
2212
ω
ω2
1
1
2
−−
<>
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 30-
The Bayes Risk: an example (1)
o Consider a classification problem with two classes defined by the following likelihood functions
o What is the decision rule that minimizes P[error]?• Assume P[ω1]=P[ω2]=0.5, C11=C22=0, C12=1 and C21=31/2
2
2
2)(x21
2
3x
21
1
e2π1)ω|P(x
e32π
1)ω|P(x
−−
−
=
=
-6 -4 -2 0 2 4 60
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
x
likel
ihoo
d
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 31-
The Bayes Risk: an example (2)
1.274.73,x01212x2x
02)(x21
3x
21
1e
e
31
e2π
e32πΛ(x)
1
2
1
2
1
2
2
2
1
2
2
2
ω
ω
2
ω
ω
22
ω
ω2)(x
21
3x
21
ω
ω2)(x
21
3x
21
=⇒<>
+−
⇒<>
−+−
⇒<>
⇒<>
=
−−
−
−−
−
1
1
-6 -4 -2 0 2 4 60
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
x
R1R2R1
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 32-
o The LRT that minimizes the Bayes Risk is called the Bayes Criterion
o Many times we will simply be interested in minimizing P[error], which is a special case of the Bayes Criterion if we use a zero-one cost function
• This version of the LRT is referred to as the Maximum A Posteriori Criterion, since it seeks to maximize the posterior P(ωi|x)
o Finally, for the case of equal priors P[ωi]=1/2 and zero-one cost function, the LTR is called the Maximum Likelihood Criterion, since it will maximize the likelihood P(x|ωi)
Variations of the LRT
Bayes criterionBayes criterion]ω[P]ω[P
)CC()CC(
)ω|x(P)ω|x(P)x(Λ
1
2
1121
2212
ω
ω2
1
1
2
−−
<>
=
Maximum A Posteriori(MAP) Criterion
Maximum A Posteriori(MAP) Criterion
1)x|ω(P)x|ω(P
)ω(P)ω(P
)ω|x(P)ω|x(P)x(Λ
ji1ji0
C1
2
1
2
ω
ω2
1
1
2
ω
ω2
1ij <
>⇔
<>
=⇒
≠=
=
Maximum Likelihood(ML) Criterion
Maximum Likelihood(ML) Criterion
1)ω|x(P)ω|x(P)x(Λ
iC1)ω(P
ji1ji0
C 1
2
ω
ω2
1
i
ij
<>
=⇒
∀=
≠=
=
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 33-
Multi-class problems
o The previous decisions were derived for two-class problems, but generalize gracefully for multiple classes:• To minimize P[error] choose the class ωi with highest P[ωi|x]
• To minimize Bayes risk choose the class ωi with lowest ℜ[ωi|x]
∑=≤≤≤≤
=ℜ=C
1jjij
Ci1j
Ci1i x)|P(ωCargminx)|(ωargminω
x)|P(ωargmaxω iCi1
i≤≤
=
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 34-
Discriminant functions (1)
o Note that all the decision rules have the same structure• At each point x in feature space choose class ωi which
maximizes (or minimizes) some measure gi(x)• This structure can be formalized with a set of discriminant
functions gi(x), i=1..C, and the following decision rule
• We can then express the three basic decision rules (Bayes, MAP and ML) in terms of Discriminant Functions:
i"j(x)g(x)gifωclasstoxassign" jii ≠∀>
Criterion Discriminant FunctionBayes gi(x)=-ℜ(αi|x)MAP gi(x)=P(ωi|x)ML gi(x)=P(x|ωi)
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 35-
Discriminant functions (2)
o Therefore, we can visualize the decision rule as a networkthat computes C discriminant functions and selects the category corresponding to the largest discriminant
x2x2 x3
x3 xdxd
g1(x)g1(x)
x1x1
g2(x)g2(x) gC(x)gC(x)
Select maxSelect max
CostsCosts
Class assignment
Discriminant functions
Features
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 36-
Recapping…
o The LRT is a theoretical result that can only be applied if we have complete knowledge of the likelihoods P[x|ωi]• P[x|ωi] generally unknown, but can be estimated from data
• If the form of the likelihood is known (e.g., Gaussian) the problem is simplified b/c we only need to estimate the parameters of the model (e.g., mean and covariance)
• This leads to a classifier known as QUADRATIC, which we cover next
• If the form of the likelihood is unknown, the problem becomes much harder, and requires a technique known as non-parametric density estimation
• This technique is covered in the final chapters of this lecture
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 37-
CHAPTER 4: Quadratic classifiers
o Bayes classifiers for normally distributed classeso The Euclidean-distance classifiero The Mahalanobis-distance classifiero Numerical example
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 38-
The Normal or Gaussian distribution
o Remember that the univariate Normal distribution N(µ,σ) is
o Similarly, the multivariate Normal distribution N(µ,Σ) is defined as
o Gaussian pdfs are very popular since• The parameters (µ,Σ) are sufficient
to uniquely characterize the pdf• If the xi’s are mutually uncorrelated
(cik=0), then they are also independent
• The covariance matrix becomes a diagonal matrix, with the individual variances in the main diagonal
-6 -4 -2 0 2 4 6 8 10 12 14
0.05
0.1
0.15
0.2
0.25
0.3
0.35
p(x)
x
µ=2; σ=3µ=6; σ=1
-6 -4 -2 0 2 4 6 8 10 12 14
0.05
0.1
0.15
0.2
0.25
0.3
0.35
p(x)
x
µ=2; σ=3µ=6; σ=1
−∑−−
∑= − µ)(Xµ)(X
21exp
)(21(x)f 1T
1/2n/2Xπ
−=
2
Xµ-X
21exp
21(x)f
σσπ
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x 2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x 2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x 2
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 39-
Covariance matrix
o The covariance matrix indicates the tendency of each pair of features (dimensions in a random vector) to vary together, i.e., to co-vary*
o The covariance has several important properties• If xi and xk tend to increase together, then cik>0• If xi tends to decrease when xk increases, then cik<0• If xi and xk are uncorrelated, then cik=0• |cik|≤σiσk, where σi is the standard deviation of xi
• cii = σi2 = VAR(xi)
o The covariance terms can be expressed as• where ρik is called the correlation coefficient
kiikik2
iii candc σσρσ ==
Xi
Xk
Cik=-σiσkρik=-1
Xi
Xk
Cik=-½σiσkρik=-½
Xi
Xk
Cik=0ρik=0
Xi
Xk
Cik=+½σiσkρik=+½
Xi
Xk
Cik=σiσkρik=+1
*from http://www.engr.sjsu.edu/~knapp/HCIRODPR/PR_home.htm
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 40-
Bayes classifier for Gaussian classes (1)
o For Normally distributed classes, the DFs can be reduced to very simple expressions• The (multivariate) Gaussian density can be defined as
• Using Bayes rule, the MAP DF can be written as
−∑−−
∑= − µ)(xµ)(x
21exp
π)(21p(x) 1T
1/2n/2
P(x)1)P()µ(x)µ(x
21exp
π)(21
P(x)))P(|P(xx)|P((x)g
iiT
i1/2n/2
iiii
ω
ωωω
−∑−−
∑=
===
−1i
i
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 41-
Bayes classifier for Gaussian classes (2)
• Eliminating constant terms
• Taking logs
• This is known as a QUADRATIC discriminant function (because it is a function of the square of x)
• In the next few slides we will analyze what happens to this expression under different assumptions for the covariance
( ) ( ))ωP(loglog21-)µ(x)µ(x
21(x)g iii
1i
Tii +∑−∑−−= −
)ωP()µ(x)µ(x21exp(x)g ii
1i
Ti
1/2-ii
−∑−−∑= −
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 42-
Case 1: Σi=σ2I (1)
o This situation occurs when the features are statistically independent, and have the same variance for all classes• In this case, the quadratic discriminant function becomes
• Assuming equal priors and dropping constant terms
• This is called an Euclidean-distance or nearest mean classifier
( ) ( ) ( ) ( ))P(ωlog)µ(x)µ(x2σ
1)P(ωlogIσlog21-)µ(xIσ)µ(x
21(x)g ii
Ti2i
2i
-12Tii +−−−=+−−−=
( )∑=
−=−−−=DIM
1i
2iii
Tii µx-)µ(x)µ(x(x)g
From [Schalkoff, 1992]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 43-
Case 1: Σi=σ2I (1)
o This is probably the simplest statistical classifier that you can build:• “Assign an unknown example
to the class whose center is the closest using the Euclidean distance”
• How valid is the assumption Σi=σ2I in chemical sensor arrays?
µ1
Min
imum
Sel
ecto
r
µ2
µC
class
xEuclidean Distance
Euclidean Distance
Euclidean Distance
µ1
Min
imum
Sel
ecto
r
µ2
µC
class
xEuclidean Distance
Euclidean Distance
Euclidean Distance
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 44-
Case 1: Σi=σ2I, example
[ ] [ ] [ ]
=
=
=
===
2002
Σ2002
Σ2002
Σ
52µ47µ23µ
321
T3
T2
T1
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 45-
Case 2: Σi=Σ (Σ non-diagonal)
o All the classes have the same covariance matrix, but the matrix is not diagonal• In this case, the quadratic discriminant becomes
• Assuming equal priors and eliminating constant terms
• This is known as a Mahalanobis-distance classifier
( ) ( ))P(ωloglog21-)µ(x)µ(x
21(x)g ii
1Tii +∑−∑−−= −
)µ(x)µ(x(x)g i-1T
ii −Σ−−= µ1
Min
imum
Sel
ecto
r
µ2
µC
class
xMahalanobis
Distance
Mahalanobis Distance
Mahalanobis Distance
Σµ1
Min
imum
Sel
ecto
r
µ2
µC
class
xMahalanobis
Distance
Mahalanobis Distance
Mahalanobis Distance
Σ
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 46-
The Mahalanobis distance
o The quadratic term is called the Mahalanobis distance, a very important metric in Statistical Pattern Recognition (right up there with Bayes theorem)• The Mahalanobis distance is a vector distance that uses a ∑-1 norm• ∑-1 can be thought of as a stretching factor on the space• Note that for an identity covariance matrix (∑=I), the Mahalanobis
distance becomes the familiar Euclidean distance
µx2
x1
Κ µ-x 2i =
K µ-x 2i 1 =−∑
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 47-
[ ] [ ] [ ]
=
=
=
===
27.07.01
Σ27.07.01
Σ27.07.01
Σ
52µ45µ23µ
321
T3
T2
T1
Case 2: Σi=Σ (Σ non-diagonal), example
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 48-
[ ] [ ] [ ]
=
−
−=
−
−=
===
35.05.05.0
Σ7111
Σ2111
Σ
52µ45µ23µ
321
T3
T2
T1
Case 3: Σi≠Σj general case, example
Zoomout
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 49-
Numerical example (1)
o Derive a linear discriminant function for the two-class 3D classification problem defined by
o Anybody would dare to sketch the likelihood densities and decision boundary for this problem?
[ ] [ ] ( ) ( )1221T
2T
1 ω2pωp;1/40001/40001/4
ΣΣ ;111µ;000µ =
====
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 50-
Numerical example (2)
o Solution
( ) ( ) )logP(ω-µ-zµ-yµ-x
400040004
µ-zµ-yµ-x
21)logP(ωΣlog
21µ-xΣµ-x
21(x)g i
z
y
xT
z
y
x
ii1T
ii +
−∝+−−= −
32log
1-z1-y1-x
400040004
1-z1-y1-x
21(x)g
31log
0-z0-y0-x
400040004
0-z0-y0-x
21(x)g
T
2
T
1
+
−=
+
−= ;
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 51-
Numerical example (3)
o Solution (continued)
• Classify the test example xu=[0.1 0.7 0.8]T
( ) ( ) ( ) ( )( )32log1z1y1x2-
31logzyx2-
(x)g(x)g
222
ω
ω
222
2
ω
ω
1
2
1
2
1
+−+−+−<>
+++
⇒<>
1.324log26 zyx
1
2
ω
ω
=−
<>
++
2u
ω
ω
ωx1.321.6 0.80.70.11
2
∈⇒<>
=++
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 52-
Conclusions
o The Euclidean distance classifier is Bayes-optimal* for• Gaussian classes and equal covariance matrices proportional to
the identity matrix and equal priors
o The Mahalanobis distance classifier is Bayes-optimal for• Gaussian classes and equal covariance matrices and equal priors
*Bayes optimal means that the classifier yields the minimum P[error], which is the best ANY classifier can achieve
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 53-
CHAPTER 5: Kernel Density Estimation
o Histogramso Parzen Windowso Smooth Kernelso The Naïve Bayes Classifier
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 54-
Non-parametric density estimation (NPDE)
o In the previous two chapters we have assumed that either• The likelihoods p(x|ωi) were known (Likelihood Ratio Test) or• At least, the parametric form of the likelihoods were known
(Parameter Estimation)
o The methods that will be presented in the next two chapters do not afford such luxuries
• Instead, they attempt to estimate the density directly from the data without making assumptions about the underlying distribution
o Sounds challenging? You bet!
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 55-
The histogram
o The simplest form of NPDE is the familiar histogram• Divide the sample space into a number of bins and approximate
the density at the center of each bin by the fraction of points in the training data that fall into the corresponding bin
[ ][ ]xcontainingbinofwidthN
xasbinsameinxofnumber(x)P(k
H ×=
0 2 4 6 8 10 12 14 160
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
x
p(x)
0 2 4 6 8 10 12 14 160
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
x
p(x)
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 56-
Shortcomings of the histogram
o The shape of the NPDE depends on the starting position of the bins• For multivariate data, the final shape of the NDPE also depends on the
orientation of the bins
o The discontinuities are not due to the underlying density, they are only an artifact of the chosen bin locations
• These discontinuities make it very difficult, without experience, to grasp the structure of the data
o A much more serious problem is the curse of dimensionality: the number of bins grows exponentially with the number of dimensions
• In high dimensions we would require a very large number of examples or else most of the bins would be empty
o All these drawbacks make the histogram unsuitable for most practical applications except for rapid visualization of results in one or two dimensions
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 57-
NPDE, general formulation (1)
o Let us return to the basic definition of probability to get a solid idea of what we are trying to accomplish
• The probability that a vector x, drawn from a distribution p(x),will fall in a given region ℜ of the sample space is
∫ℜ
= )dx'p(x'P
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0 2 4 6 8 10 12 14 160
x
p(x)
ℜ0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0 2 4 6 8 10 12 14 160
x
p(x)
ℜ
From [Bishop, 1995]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 58-
NPDE, general formulation (2)
• Suppose now that N vectors {x(1, x(2, …, x(N} are drawn from the distribution; the probability that k of these N vectors fall in ℜ is now given by the binomial distribution
• It can be shown (from the properties of the binomial) that the mean and variance of the ratio k/N are
• Note that the variance gets smaller as N→∞, so we can expect that a good estimate of P is the mean fraction of points that fall within ℜ
( ) kNk P)(1PkN
kProb −−
=
( )N
P1PPNkE
NkVarandP
NkE
2 −=
−=
=
NkP ≅
From [Bishop, 1995]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 59-
NPDE, general formulation (3)
• Assume now that ℜ is so small that p(x) does not vary appreciably within it, then the integral can be approximated by
• where V is the volume enclosed by region ℜ
p(x)V)dx'p(x' ≅∫ℜ
From [Bishop, 1995]
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0 2 4 6 8 10 12 14 160
x
p(x)
ℜ→∅0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0 2 4 6 8 10 12 14 160
x
p(x)
ℜ→∅
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 60-
NPDE, general formulation (4)
• Merging the two expressions we obtain
• This estimate becomes more accurate as we increase the number ofsample points N and shrink the volume V
• In practice the value of N (the total number of examples) is fixed• To improve the estimate p(x) we could let V approach zero but then
region ℜ would become so small that it would enclose no examples• This means that, in practice, we will have to find a compromise
value for the volume V• Large enough to include enough examples within ℜ• Small enough to support the assumption that p(x) is constant within ℜ
NVkp(x)
NkP
p(x)V)dx'p(x'P≅⇒
≅
≅= ∫ℜ
From [Bishop, 1995]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 61-
NPDE, general formulation (5)
o In conclusion, the general expression for NPDE is
• When applying this result to practical density estimation problems, two basic approaches can be adopted
• Kernel Density Estimation (KDE): Choose a fixed value of the volume V and determine k from the data
• k Nearest Neighbor (kNN): Choose a fixed value of k and determine the corresponding volume V from the data
• It can be shown that both KDE and kNN converge to the true probability density as N→∞, provided that V shrinks with N, and k grows with N appropriately
≅Vinsideexamplesofnumbertheisk
examplesofnumbertotaltheisNxgsurroundinvolumetheisV
whereNVkp(x)
From [Bishop, 1995]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 62-
Parzen windows (1)
o Suppose that the region ℜ that encloses the k examples is a hypercube of side h
• Then its volume is given by V=hD, where D is the number of dimensions
o To find the #examples that fall within this region we define a kernel function K(u)
• This kernel, which corresponds to a unit hypercube centered at the origin, is known as a Parzen window or the naïve estimator
( ) =∀<
=otherwise0
D1,..,j1/2u1uK j
x
h
h
h
From [Bishop, 1995]
-h/2 h/2 u
1K(u)
-h/2 h/2 u
1K(u)
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 63-
Parzen windows (2)
o The total number of points inside the hypercube is then
o Substituting back into the density estimate expression
• Note that the Parzen window DE resembles the histogram, with the exception that the bin locations are determined by the data points
∑=
−=
N
1n
(n
DKDE hxxK
Nh1(x)p
∑=
−=
N
1n
n(
hxxKk
From [Bishop, 1995]
Volume
1 / V
K(x-x(1)=1
K(x-x(2)=1
K(x-x(3)=1
K(x-x(4)=0
x(1x(2x(3x(4
x
x(1
x(2
x(3
x(4
Volume
1 / V
K(x-x(1)=1K(x-x(1)=1
K(x-x(2)=1K(x-x(2)=1
K(x-x(3)=1K(x-x(3)=1
K(x-x(4)=0K(x-x(4)=0
x(1x(2x(3x(4
x
x(1
x(2
x(3
x(4
( )∑=
−=N
1n
(nxxKk
Volume
1 / V
K(x-x(1)=1
K(x-x(2)=1
K(x-x(3)=1
K(x-x(4)=0
x(1x(2x(3x(4
x
x(1
x(2
x(3
x(4
Volume
1 / V
K(x-x(1)=1K(x-x(1)=1
K(x-x(2)=1K(x-x(2)=1
K(x-x(3)=1K(x-x(3)=1
K(x-x(4)=0K(x-x(4)=0
x(1x(2x(3x(4
x
x(1
x(2
x(3
x(4
( )∑=
−=N
1n
(nxxKk
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 64-
Numerical exercise (1)
o Given the dataset X below, use Parzen windows to estimate the density p(x) at y=3, 10, 15.
• X = {x(1, x(2,…x(N} = {4, 5, 5, 6, 12, 14, 15, 15, 16, 17}• Use a bandwidth of h=4
5 10 15 x
p(x)y=3 y=10
y=15
5 10 15 x
p(x)y=3 y=10
y=15
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 65-
Numerical exercise (2)
o Solution: Let’s first estimate p(y=3):
• Similarly
[ ] 0.025410
10000000001410
1
4173K...
463K
453K
453K
443K
4101
hxyK
Nh13)(yp
1
13/4-1-1/2-1/2-1/4-
1
N
1n
(n
DKDE
=×
=+++++++++×
=
=
−
++
−
+
−
+
−
+
−
×=
=
−== ∑
=
4342143421434214342143421
[ ] 0410
00000000000410
110)(yp 1KDE =×
=+++++++++×
==
[ ] 0.1410
40111100000410
115)(yp 1KDE =×
=+++++++++×
==
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 66-
Smooth kernels (1)
o The Parzen window has several drawbacks• Yields density estimates that have discontinuities• Weights equally all the points xi, regardless of their distance to the
estimation point x
o Some of these difficulties can be overcome by replacing the Parzen window with a smooth kernel K(u) such that
( ) 1dxxKDR
=∫
-1/2 -1/2 u
1Parzen(u)
A=1
-1/2 -1/2 u
K(u)
A=1
-1/2 -1/2 u
1Parzen(u)
A=1
-1/2 -1/2 u
K(u)
A=1
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 67-
Smooth kernels (2)
• Usually, but not not always, K(u) will be a radially symmetric and unimodal probability density function, such as the multivariate Gaussian density function
• where the expression of the density estimate remains the same aswith Parzen windows
( )( )
−= xx
21exp
π21xK T
2/D
∑=
−=
N
1n
(n
DKDE hxxK
Nh1(x)p
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 68-
Smooth kernels (3)
o Just as the Parzen window DE can be considered a sum of boxes centered at the observations, the smooth kernel estimate is a sum of “bumps” placed at the data points
• The kernel function determines the shape of the bumps
• The parameter h, also called the smoothing parameter or bandwidth, determines their width
-10 -5 0 5 10 15 20 25 30 35 400
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
x
P KD
E(x)
; h=3
Density estimate
Data points
Kernel functions
-10 -5 0 5 10 15 20 25 30 35 400
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
x
P KD
E(x)
; h=3
Density estimate
Data points
Kernel functions
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 69-
Bandwidth selection, univariate case (1)
-10 -5 0 5 10 15 20 25 30 35 400
0.005
0.01
0.015
0.02
0.025
0.03
0.035
x
PKD
E(x)
; h=5
.0
-10 -5 0 5 10 15 20 25 30 35 400
0.005
0.01
0.015
0.02
0.025
0.03
x
PKD
E(x)
; h=1
0.0
-10 -5 0 5 10 15 20 25 30 35 400
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
x
PKD
E(x)
; h=2
.5
-10 -5 0 5 10 15 20 25 30 35 400
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
x
PKD
E(x)
; h=1
.0
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 70-
Bandwidth selection, univariate case (2)
o Subjective choice• Plot out several curves and choose the estimate that is most in
accordance with one’s prior (subjective) ideas• However, this method is not practical in pattern recognition since we
typically have high-dimensional data
o Reference to a standard distribution• Assume a standard density function and find the value of the bandwidth
that minimizes the integral of the square error (MISE)
• If we assume that the true distribution is a Gaussian and we use a Gaussian kernel, it can be shown that the optimal bandwidth is
• where σ is the sample variance and N is the number of training examples
( )( ){ } ( ) ( )( )[ ]{ }∫ −== dxxpxpEargminxpMISEargminh 2KDE
hKDE
hopt
5/1opt Nσ06.1h −=
From [Silverman, 1986]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 71-
Bandwidth selection, univariate case (3)
o Likelihood cross-validation• The ML estimate of h is degenerate since it yields hML=0, a density
estimate with Dirac delta functions at each training data point• A practical alternative is to maximize the “pseudo-likelihood” computed
using leave-one-out cross-validation
From [Silverman, 1986]
( ) ( ) ( ) ∑∑≠=
−=
−
−−
=
=
N
mn1,m
(m(n(n
n
N
1n
(nn
hMLCV h
xxKh1N
1xpwherexplogN1argmaxh ;
p-1(x)
xx(1
p-1(x(1)
p-2(x)
xx(2
p-2(x(2)
p-3(x)
xx(3
p-3(x(3)
p-4(x)
xx(4
p-4(x(4)
p-1(x)
xx(1
p-1(x(1)
p-2(x)
xx(2
p-2(x(2)
p-3(x)
xx(3
p-3(x(3)
p-4(x)
xx(4
p-4(x(4)
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 72-
Multivariate density estimation
o Bandwidth needs to be selected individually for each axis• Alternative, one may pre-scale axes or whiten the data, so that
the same bandwidth can be used for all dimensions
o Density can be estimated with a multivariate kernel or by means of so-called product kernels (see TAMU notes)
x1
x2
P(x1, x2| ωi)
PRODUCT KERNELS
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
111111111111111
2222222222222222
333333333
3 33
4444
4444444444
4 444
55
5555555
5555555
66 666 6666666
6
777777
77
7777777
8888
888888888888
9999999999
9999
1010101010101010101010101010
1*1*1*1*
2*
2*2*2*2*2*
3*3*3*3*3*3*3*
4*
5*5*5*5*5*
6*6*6*6*6*
6*6*
7*
7*7*7*
8*8*8*9*9*9*
9*9*
10*
10*10*
10*10*
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
111111111111111
2222222222222222
333333333
3 33
4444
4444444444
4 444
55
5555555
5555555
66 666 6666666
6
777777
77
7777777
8888
888888888888
9999999999
9999
1010101010101010101010101010
1*1*1*1*
2*
2*2*2*2*2*
3*3*3*3*3*3*3*
4*
5*5*5*5*5*
6*6*6*6*6*
6*6*
7*
7*7*7*
8*8*8*9*9*9*
9*9*
10*
10*10*
10*10*
x1
x2
x1
x2
P(x1, x2| ωi)
x1
x2
P(x1, x2| ωi)
PRODUCT KERNELS
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
111111111111111
2222222222222222
333333333
3 33
4444
4444444444
4 444
55
5555555
5555555
66 666 6666666
6
777777
77
7777777
8888
888888888888
9999999999
9999
1010101010101010101010101010
1*1*1*1*
2*
2*2*2*2*2*
3*3*3*3*3*3*3*
4*
5*5*5*5*5*
6*6*6*6*6*
6*6*
7*
7*7*7*
8*8*8*9*9*9*
9*9*
10*
10*10*
10*10*
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
111111111111111
2222222222222222
333333333
3 33
4444
4444444444
4 444
55
5555555
5555555
66 666 6666666
6
777777
77
7777777
8888
888888888888
9999999999
9999
1010101010101010101010101010
1*1*1*1*
2*
2*2*2*2*2*
3*3*3*3*3*3*3*
4*
5*5*5*5*5*
6*6*6*6*6*
6*6*
7*
7*7*7*
8*8*8*9*9*9*
9*9*
10*
10*10*
10*10*
x1
x2
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
111111111111111
2222222222222222
333333333
3 33
4444
4444444444
4 444
55
5555555
5555555
66 666 6666666
6
777777
77
7777777
8888
888888888888
9999999999
9999
1010101010101010101010101010
1*1*1*1*
2*
2*2*2*2*2*
3*3*3*3*3*3*3*
4*
5*5*5*5*5*
6*6*6*6*6*
6*6*
7*
7*7*7*
8*8*8*9*9*9*
9*9*
10*
10*10*
10*10*
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
111111111111111
2222222222222222
333333333
3 33
4444
4444444444
4 444
55
5555555
5555555
66 666 6666666
6
777777
77
7777777
8888
888888888888
9999999999
9999
1010101010101010101010101010
1*1*1*1*
2*
2*2*2*2*2*
3*3*3*3*3*3*3*
4*
5*5*5*5*5*
6*6*6*6*6*
6*6*
7*
7*7*7*
8*8*8*9*9*9*
9*9*
10*
10*10*
10*10*
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
111111111111111
2222222222222222
333333333
3 33
4444
4444444444
4 444
55
5555555
5555555
66 666 6666666
6
777777
77
77
111111111111111
2222222222222222
333333333
3 33
4444
4444444444
4 444
55
5555555
5555555
66 666 6666666
6
777777
77
7777777
8888
888888888888
9999999999
9999
1010101010101010101010101010
1*1*1*1*
2*
2*2*2*2*2*
3*3*3*3*3*3*3*
4*
5*5*5*5*5*
6*6*6*6*6*
6*6*
7*
7*7*7*
8*8*8*9*9*9*
9*9*
10*
10*10*
10*10*
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
111111111111111
2222222222222222
333333333
3 33
4444
4444444444
4 444
55
5555555
5555555
66 666 6666666
6
777777
77
77
111111111111111
2222222222222222
333333333
3 33
4444
4444444444
4 444
55
5555555
5555555
66 666 6666666
6
777777
77
7777777
8888
888888888888
9999999999
9999
1010101010101010101010101010
1*1*1*1*
2*
2*2*2*2*2*
3*3*3*3*3*3*3*
4*
5*5*5*5*5*
6*6*6*6*6*
6*6*
7*
7*7*7*
8*8*8*9*9*9*
9*9*
10*
10*10*
10*10*
x1
x2
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 73-
Naïve Bayes classifier (1)
o How do we apply KDE to classifier design?• First, we estimate the likelihood of each class P(x|ωi)• Then we apply Bayes rule to derive the MAP rule
o However, P(x|ωi) is multivariate: NPDE becomes hard!!• To avoid this problem, one practical simplification is
sometimes made: assume that the features are class-conditionally independent
))P(ωω|P(xx)|P(ω(x)g iiii ∝=
∏=
=D
1dii )ω|P(x(d))ω|P(x
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 74-
Naïve Bayes classifier (2)
o Class-conditional independence vs. independence
∏=
≠D
1dP(x(d))P(x)
∏=
=D
1dii )ω|P(x(d))ω|P(xx1
x2x2
x1
x2
∏=
=D
1dii )ω|P(x(d))ω|P(x
∏=
≅D
1dP(x(d))P(x)
∏=
≠D
1dii )ω|P(x(d))ω|P(x
∏=
≠D
1dP(x(d))P(x)
∏=
=D
1dii )ω|P(x(d))ω|P(xx1
x2x2
x1
x2
∏=
=D
1dii )ω|P(x(d))ω|P(x
∏=
≅D
1dP(x(d))P(x)
∏=
≠D
1dii )ω|P(x(d))ω|P(x
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 75-
Naïve Bayes classifier (3)
o Merging this expression into the discriminant function yields the decision rule for the Naïve Bayes classifier
o The main advantage of the Naïve Bayes classifier is that we only need to compute the univariate densities P(x(d)|ωi), which is a much easier problem than estimating the multivariate density P(x|ωi)
• Despite its simplicity, the Naïve Bayes has been shown to have comparable performance to artificial neural networks and decision tree learning in some domains
Naïve Bayes Classifier
Naïve Bayes Classifier( )∏
=
=D
1diiNB,i ω|)d(xP)P(ω)x(g
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 76-
CHAPTER 6: Nearest Neighbors
o Nearest Neighbors density estimationo The k Nearest Neighbors classification ruleo kNN as a lazy learnero Characteristics of the kNN classifiero Optimizing the kNN classifier
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 77-
kNN Density Estimation (1)
o In the kNN method we grow the volume surrounding the estimation point x until it encloses a total of k data points
o The density estimate then becomes
• Rk(x) is the distance between the estimation point x and its k-th closest neighbor
• cD is the volume of the unit sphere in D dimensions:
• Thus c1=2, c2=π, c3=4π/3 and so on
(x)RcNk
NVkP(x) D
kD ⋅⋅=≅
( ) ( )1D/2!D/2c
D/2D/2
D +==Γ
ππ
R
Vol=πR2
x2
RN
kP(x)
π=
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 78-
kNN Density Estimation (2)
o In general, the estimates that can be obtained with the kNN method are not very satisfactory
• The estimates are prone to local noise• The method produces estimates with very heavy tails• Since the function Rk(x) is not differentiable, the
density estimate will have discontinuities
o These properties are illustrated in the next few slides
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 79-
kNN Density Estimation, example 1
o To illustrate kNN we generated several DEs for a univariate mixture of two Gaussians: P(x)=½N(0,1)+½N(10,4) and several values of N and k
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 80-
kNN Density Estimation, example 2 (a)
o The performance of the kNN density estimation technique on two dimensions is illustrated in these figures
• The top figure shows the true density, a mixture of two bivariate Gaussians
• The bottom figure shows the density estimate for k=10 neighbors and N=200 examples
• In the next slide we show the contours of the two distributions overlapped with the training data used to generate the estimate
( ) ( )
[ ]
[ ]
−
−==
==
+=
4111
Σ05µ
2111
Σ50µwith
Σ,µN21Σ,µN
21p(x)
2T
2
1T
1
2211
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 81-
kNN Density Estimation, example 2 (b)
True density contours kNN density estimate contours
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 82-
kNN as a Bayesian classifier (1)
o The main advantage of the kNN method is that it leads to a very simple approximation of the Bayes classifier
o Assume that we have a dataset with N examples, Ni from class ωi, and that we are interested in classifying an unknown sample xu
• We draw a hyper-sphere of volume V around xu. Assume this volume contains a total of k examples, ki from class ωi.
• The unconditional density is estimated by
From [Bishop, 1995]
NVkP(x) =
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 83-
kNN as a Bayesian classifier (2)
• Similarly, we can then approximate the likelihood functions by counting the number of examples of each class inside volume V:
• And the priors are approximated by
• Putting everything together, the Bayes classifier becomes
NN)P( i
i =ω
kk
NVk
NN
VNk
P(x)))P(ωω|P(xx)|P(ω i
i
i
i
iii =
⋅==
From [Bishop, 1995]
VNk)|ωP(xi
ii =
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 84-
The kNN classification rule (1)
o The K Nearest Neighbor Rule (kNN) is a very intuitive method that classifies unlabeled examples based on their similarity to examples in the training set
o The kNN only requires• An integer k• A set of labeled examples (training data)• A metric to measure “closeness”
For a given unlabeled example xu∈ℜD, find the k “closest”labeled examples in the training data set and assign xu to the class that appears most frequently within the k-subset
For a given unlabeled example xu∈ℜD, find the k “closest”labeled examples in the training data set and assign xu to the class that appears most frequently within the k-subset
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 85-
The kNN classification rule (2)
o Example• In the example below we have three classes: the goal
is to find a class label for the unknown example xu
• In this case we use the Euclidean distance and a value of k=5 neighbors
• Of the 5 closest neighbors, 4 belong to ω1 and 1 belongs to ω3, so xu is assigned to ω1, the predominant class
xu
ω3
ω1 ω2
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 86-
kNN in action: example 1
o We have generated data for a 2-dimensional 3-class problem, where the class-conditional densities are multi-modal, and non-linearly separable, as illustrated in the figure
o We used the kNN rule with• k = 5• The Euclidean distance as a metric
o The resulting decision boundaries and decision regions are shown below
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 87-
kNN in action: example 2
o We have generated data for a 2-dimensional 3-class problem, where the class-conditional densities are unimodal, and are distributed in rings around a common mean. These classes are also non-linearly separable, as illustrated in the figure
o We used the kNN rule with• k = 5• The Euclidean distance as a metric
o The resulting decision boundaries and decision regions are shownbelow
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 88-
Characteristics of the kNN classifier (1)
o Advantages• Simple implementation• Nearly optimal in the large sample limit (N→∞)
• P[error]Bayes <P[error]1NN<2P[error]Bayes
• Uses local information, which can yield highly adaptive behavior
• Lends itself very easily to parallel implementationso Disadvantages
• Large storage requirements• Computationally intensive recall• Highly susceptible to the curse of dimensionality
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 89-
Characteristics of the kNN classifier (2)
o 1NN versus kNN• The use of large values of k has two main advantages
• Yields smoother decision regions• Provides probabilistic information
• The ratio of examples for each class gives information about theambiguity of the decision
• However, too large a value of k is detrimental• It destroys the locality of the estimation, since farther
examples are taken into consideration• In addition, it increases the computational burden
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 90-
kNN versus 1NN1-NN 5-NN 20-NN
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 91-
kNN and the problem of feature weighting
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 92-
Feature weighting
o The previous example illustrated the Achilles’ heel of the kNN classifier: its sensitivity to noisy axes
• A possible solution would be to normalize each feature to N(0,1)• However, normalization does not resolve the curse of
dimensionality. A close look at the Euclidean distance shows that this metric can become very noisy for high dimensional problems if only a few of the features carry the classification information
o The solution to this problem is to modify the Euclidean metric by a set of weights that represent the information content or “goodness” of each feature
∑=
−=D
1k
2uu ))k(x)k(x()x,x(d
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 93-
CHAPTER 7: Linear Discriminant Functions
o Perceptron learningo Minimum squared error (MSE) solutiono Least-mean squares (LMS) rule
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 94-
Linear Discriminant Functions (1)
o The objective of this chapter is to present methods for learning linear discriminant functions of the form
• where w is the weight vectorand w0 is the threshold weightor bias
• Similar discriminant functions were derived in chapter 3 as a special case of the quadratic classifier
• In this chapter, the discriminant functions will be derived in a non-parametric fashion, this is, no assumptions will be made about the underlying densities
x1
x2
wx
wTx+w0>0
wTx+w0<0
x(1
x(2d
x1
x2
wx
wTx+w0>0
wTx+w0<0
x(1
x(2d
( ) ( )( )
∈<∈>
⇔+=2
10
T
ωx0xgωx0xg
wxwxg
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 95-
Linear Discriminant Functions (2)
o For convenience, we will focus on binary classification• Extension to the multicategory case can be easily achieved by
• Using ωi/not ωi dichotomies• Using ωi/ωi dichotomies
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 96-
Gradient descent (1)
o Gradient descent is a general method for function minimization
• From basic calculus, we know that the minimum of a function J(x) is defined by the zeros of the gradient
• Only in very special cases this minimization function has a closed form solution
• In some other cases, a closed form solution may exist, but is numerically ill-posed or impractical (e.g., memory requirements)
[ ] 0J(x)J(x)argminx* xx
=⇒∇=∀
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 97-
Gradient descent (2)
o Gradient descent finds the minimum in an iterative fashion by moving in the direction of steepest descent
• where η is a learning rate
1. Start with an arbitrary solution x(0)2. Compute the gradient ∇xJ(x(k))3. Move in the direction of steepest descent:
4. Go to 1 (until convergence)
1. Start with an arbitrary solution x(0)2. Compute the gradient ∇xJ(x(k))3. Move in the direction of steepest descent:
4. Go to 1 (until convergence)( ) ( ) ( )( )kxJηkx1kx x∇−=+
-2 0 2-2
0
2
x1
x 2
Initialguess Global
minimum
Localminimum
-2 0 2-2
0
2
x1
x 2
Initialguess Global
minimum
Localminimum
J(w)
w
∇J<0∆w>0
∇J>0∆w<0
J(w)
w
∇J<0∆w>0
∇J>0∆w<0
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 98-
Perceptron learning (1)
o Let’s now consider the problem of learning a binary classification problem with a linear discriminant function
• As usual, assume we have a dataset X={x(1,x(2,…x(N} containing examples from the two classes
• For convenience, we will absorb the intercept w0 by augmenting the feature vector x with an additional constant dimension:
[ ] yax1
wwwxw TT00
T =
=+
From [Duda, Hart and Stork, 2001]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 99-
Perceptron learning (2)
• Keep in mind that our objective is to find a vector a such that
• To simplify the derivation, we will “normalize” the training set by replacing all examples from class ω2 by their negative
• This allows us to ignore class labels and look for a weight vector such that
y0yaT ∀>
[ ] 2ωyyy ∈∀−←
From [Duda, Hart and Stork, 2001]
( )
∈<∈>
=2
1T
ωx0ωx0
yaxg
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 100-
Perceptron learning (3)
o To find this solution we must first define an objective function J(a)
• A good choice is what is known as the Perceptron criterion
• where YM is the set of examples misclassified by a• Note that JP(a) is non-negative since aTy<0 for misclassified samples
( ) ( )∑∈
−=MΥy
TP yaaJ
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 101-
Perceptron learning (4)
o To find the minimum of JP(a), we use gradient descent• The gradient is defined by
• And the gradient descent update rule becomes
• This is known as the perceptron batch update rule. • The weight vector may also be updated in an “on-line” fashion, this
is, after the presentation of each individual example
• where y(i is an example that has been misclassified by a(k)
( ) ( )∑∈
−=∇MΥy
Pa yaJ
( ) ( )( )
∑∈
+=+kΥy M
yηka1ka
( ) ( ) (iηyka1ka +=+ Perceptron rule
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 102-
Perceptron learning (5)
o If classes are linearly separable, the perceptron rule is guaranteed to converge to a valid solution
o However, if the two classes are not linearly separable, the perceptron rule will not converge
• Since no weight vector a can correctly classify every sample in a non-separable dataset, the corrections in the perceptron rule will never cease
• One ad-hoc solution to this problem is to enforce convergence by using variable learning rates η(k) that approach zero as kapproaches infinite
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 103-
Minimum Squared Error solution (1)
o The classical Minimum Squared Error (MSE) criterion provides an alternative to the perceptron rule
• The perceptron rule seeks a weight vector aT that satisfies the inequality aTy(i>0
• The perceptron rule only considers misclassified samples, since these are the only ones that violate the above inequality
• Instead, the MSE criterion looks for a solution to the equality aTy(i=b(i, where b(i are some pre-specified target values (e.g., class labels)
• As a result, the MSE solution uses ALL samples in the training set
From [Duda, Hart and Stork, 2001]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 104-
Minimum Squared Error solution (2)
o The system of equations solved by MSE is
• where a is the weight vector, each row in Y is a training example, and each row in b is the corresponding class label
• For consistency, we will continue assuming that examples from class ω2 have been replaced by their negative vector, although this is not a requirement for the MSE solution
bYa
b
bb
a
aa
yyy
yyyyyy
(N
(2
(1
D
1
0
N(D
N(1
N(0
(2D
2(1
2(0
(1D
(11
(10
=⇔
=
M
MM
L
MMM
MMM
L
L
From [Duda, Hart and Stork, 2001]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 105-
Minimum Squared Error solution (3)
o An exact solution to Ya=b can sometimes be found • If the number of (independent) equations (N) is equal to the
number of unknowns (D+1), the exact solution is defined by
o In practice, however, Y will be singular so its inverse Y-1
does not exist• Y will commonly have more rows (examples) than columns
(unknown), which yields an over-determined system, for which an exact solution cannot be found
bYa 1−=
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 106-
Minimum Squared Error solution (4)
o The solution in this case is to find a weight vector that minimizes some function of the error between the model (aY) and the desired output (b)
• In particular, MSE seeks to Minimize the sum of the Squares of these Errors:
• which, as usual, can be found by setting its gradient to zero
( ) ( ) 2N
1i
2(i(iTMSE b-YabyaaJ =−=∑
=
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 107-
The pseudo-inverse solution
o The gradient of the objective function is
• with zeros defined by
• Notice that YTY is now a square matrix!
o If YTY is nonsingular, the MSE solution becomes
• where the matrix Y†=(YTY)-1YT is known as the pseudo-inverse of Y (Y†Y=I)
• Note that, in general, YY†≠I in general
( ) ( ) ( ) 0bYa2Yybya2aJ TN
1i
(i(i(iTMSEa =−=−=∇ ∑
=
bYYaY TT =
( ) bYbYYYa †T-1T == Pseudo-inverse solution
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 108-
Ridge-regression solution (1)
o If the training data is extremely correlated (collinearity problem), the matrix YTY becomes near singular
• The smaller eigenvalues (the noise) dominate the computation of the inverse (YTY)-1, which leads to numerical problems
o Collinearity problem can be solved through regularization• This is equivalent to adding a small multiple of the identity matrix
to the term YTY, which results in
• where ε (0<ε<1) is a regularization parameter that controls the amount of shrinkage to the identity matrix. This is known as the ridge-regression solution
( ) ( ) bYID
YYtrεYYε1-a T-1T
T
+= Ridge Regression
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 109-
Ridge-regression solution (2)
o Selection of the regularization parameter• For ε=0, ridge-regression solution is equivalent to the pseudo-
inverse solution• For ε=1, the ridge-regression solution is a constant function that
predicts the average classification rate across the entire dataset• An appropriate value for ε is typically found through cross-
validation
From [Gutierrez-Osuna, 2002]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 110-
Least-mean-squares solution (1)
o The objective function JMSE(a)=||Ya-b||2 can also be found using a gradient descent procedure
• This avoids the problems that arise when YTY is singular• In addition, it also avoids working with large matrices
o Looking at the expression of the gradient, the obvious update rule is
• It can be shown that if η(k)=η(1)/k, where η(1) is any positive constant, this rule generates a sequence of vectors that converge to a solution to YT(Ya-b)=0
( ) ( ) ( ) ( )( )kYa-bYkηka1ka T+=+
From [Duda, Hart and Stork, 2001]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 111-
Least-mean-squares solution (2)
o The storage requirements of this algorithm can be reduced by considering each sample sequentially
• This is known as the Widrow-Hoff, least-mean-squares(LMS) or delta rule [Mitchell, 1997]
( ) ( ) ( ) ( )( ) (i(i(i ykay-bkηka1ka +=+ LMS rule
From [Duda, Hart and Stork, 2001]
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 112-
Summary: Perceptron vs. MSE procedures
o Perceptron rule• The perceptron rule always finds a
solution if the classes are linearly separable, but does not converge if the classes are non-separable
o MSE criterion• The MSE solution has guaranteed
convergence, but it may not find a separating hyperplane if classes are linearly separable
• Notice that MSE tries to minimize the sum of squares of the distances of the training data to the separating hyperplane, as opposed to finding this hyperplane
x1
x2
LMS
Perceptron
x1
x2
LMS
Perceptron
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 113-
Summary of classifier decision boundaries
Feat
ure
2
Feature 1
Feat
ure
2
Feature 1
Feat
ure
2
Feature 1
QUADRATIC KNN
RBFMLP
Feat
ure
2
Feature 1
Feat
ure
2
Feature 1
Feat
ure
2
Feature 1
Feat
ure
2
Feature 1
QUADRATIC KNN
RBFMLP
Feat
ure
2
Feature 1
Ricardo Gutierrez- Osuna
Texas A&M University
3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 114-
References
o This material is an abridged version of my lecture notes in Pattern Recognition at Texas A&M University, which you may download from
http://research.cs.tamu.edu/prismo Additional references are:
• C. Bishop (1995), Neural Networks for Pattern Recognition, Oxford University Press
• B. W. Silverman (1986), Density Estimation for Statistics and Data Analysis, Chapman and Hall
• R. O. Duda, P. E. Hart and D. G. Stork (2001), Pattern Classification, Wiley
• R. Schalkoff (1992), Pattern Recognition; Statistical, Structural and Neural Approaches, Wiley Inc
• R. Gutierrez-Osuna (2002), “Pattern Analysis for Machine Olfaction: A Review,” IEEE Sensors Journal, 2(3), 189-202
3rd NOSE Short CourseAlpbach, 21st – 26th Mar 2004
Questions
3rd NOSE Short CourseAlpbach, 21st – 26th Mar 2004
Thank you