Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Discrimination and Classification
Nathaniel E. Helwig
Assistant Professor of Psychology and StatisticsUniversity of Minnesota (Twin Cities)
Updated 14-Mar-2017
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 1
Copyright
Copyright c© 2017 by Nathaniel E. Helwig
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 2
Outline of Notes
1) Classifying Two PopulationsOverview of ProblemCost of Misclassification
2) Two Multivariate NormalsEqual CovarianceUnequal Covariance
3) Evaluating ClassificationsMisclassification MeasuresQuality in LDA
4) Classifying g ≥ 2 PopulationsOverview of ProblemCost of MisclassificationDiscriminant Analysis
5) Iris Data ExampleData OverviewLDA ExampleQDA Example
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 3
Purpose of Discrimination and Classification
Discrimination attempts to separate distinct sets of objects, andclassification attempts to allocate new objects to predefined groups.
There are two typical goals of discrimination and classification:1 Data description: find “discriminants” that best separate groups2 Data allocation: put new objects in groups via the “discriminants”
Note that goal 1 is discrimination, and goal 2 is classification/allocation.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 4
Classifying Two Populations
Classifying Two Populations
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 5
Classifying Two Populations Overview of Problem
The Two Population Classification Problem
Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) denote the probability density function (pdf) for population π1
f2(x) denote the probability density function (pdf) for population π2
Problem: Given a realization X = x, we want to assign x to π1 or π2.
We want to find some classification rule to determine whether arealization X = x should be assigned to population π1 or π2.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 6
Classifying Two Populations Overview of Problem
Visualizing a Classification Rule
Let Ω denote the sample space, i.e., all possible values of x, andR1 ⊂ Ω is the subset of Ω for which we classify x as π1R2 = Ω− R1 is the subset of Ω for which we classify x as π2
Figure: Figure 11.2 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 2 variables.Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 7
Classifying Two Populations Cost of Misclassification
Probability of Misclassification
The conditional probability P(2|1) of classifying an object as π2 whenthe object really belongs to π1 is given by
P(2|1) = P(X ∈ R2|π1) =
∫R2
f1(x)dx
The conditional probability P(1|2) of classifying an object as π1 whenthe object really belongs to π2 is given by
P(1|2) = P(X ∈ R1|π2) =
∫R1
f2(x)dx
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 8
Classifying Two Populations Cost of Misclassification
Visualizing the Probability of Misclassification
Figure: Figure 11.3 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 1 variable.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 9
Classifying Two Populations Cost of Misclassification
Incorporating Prior Probabilities
Let p1 and p2 denote the prior probabilities that an object belongs to π1and π2, respectively, with the constraint that p1 + p2 = 1.
The overall probabilities of the four outcomes have the form
P(correctly classify as π1) = P(X ∈ R1|π1)P(π1) = P(1|1)p1
P(correctly classify as π2) = P(X ∈ R2|π2)P(π2) = P(2|2)p2
P(misclassify π1 as π2) = P(X ∈ R2|π1)P(π1) = P(2|1)p1
P(misclassify π2 as π1) = P(X ∈ R1|π2)P(π2) = P(1|2)p2
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 10
Classifying Two Populations Cost of Misclassification
Classification Table and Misclassification Costs
In many real world cases, costs of misclassification are not equal:π1 and π2 are diseased and healthyπ1 and π2 are guilty and not guiltyπ1 and π2 are buy and not buy stock
We can make a cost matrix to tabulate our misclassification costs:Classify as:π1 π2
Truth:π1 0 c(2|1)π2 c(1|2) 0
The expected cost of misclassification (ECM) is defined as
ECM = c(2|1)P(2|1)p1 + c(1|2)P(1|2)p2
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 11
Classifying Two Populations Cost of Misclassification
Classification Rule (Region) Minimizing ECM
The R1 and R2 that minimize the ECM are defined via the inequalities:
R1 :f1(x)
f2(x)≥(
c(1|2)
c(2|1)
)(p2
p1
)R2 :
f1(x)
f2(x)<
(c(1|2)
c(2|1)
)(p2
p1
)
If c(1|2) = c(2|1), then we are classifying via posterior probabilities.
If c(1|2) = c(2|1) and p1 = p2, then the classification rule reduces to
R1 :f1(x)
f2(x)≥ 1
R2 :f1(x)
f2(x)< 1
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 12
Classification with Two Multivariate Normal Populations
Two Multivariate NormalPopulations
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 13
Classification with Two Multivariate Normal Populations Equal Covariance Matrices
MVN Two Population Classification Problem
Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) ∼ N(µ1,Σ) denote the pdf for population π1
f2(x) ∼ N(µ2,Σ) denote the pdf for population π2
Problem: Given a realization X = x, we want to assign x to π1 or π2.
We want to find some classification rule to determine whether arealization X = x should be assigned to population π1 or π2.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 14
Classification with Two Multivariate Normal Populations Equal Covariance Matrices
Classification Rule Minimizing ECM
The multivariate normal densities have the form
fk (x) = (2π)−p/2|Σ|−1/2 exp−(1/2)(x− µk )′Σ−1(x− µk )
for k ∈ 1,2, which implies that
f ∗ =f1(x)
f2(x)= exp
−1
2(x− µ1)′Σ−1(x− µ1) +
12
(x− µ2)′Σ−1(x− µ2)
= exp
(µ1 − µ2)′Σ−1x− 1
2(µ1 − µ2)′Σ−1(µ1 + µ2)
The R1 and R2 that minimize the ECM are defined via the inequalities:
R1 : log(f ∗) ≥ log[(
c(1|2)
c(2|1)
)(p2
p1
)]R2 : log(f ∗) < log
[(c(1|2)
c(2|1)
)(p2
p1
)]Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 15
Classification with Two Multivariate Normal Populations Equal Covariance Matrices
Classification Rule in Practice
The rule on the previous slide depends on the population parametersµ1, µ2, and Σ, which are often unknown in practice.
Given n1 independent observations from π1 and n2 independentobservations from π2, we can estimate the needed parameters:
µ1 = x1 =1n1
n1∑i=1
xi(1) and µ2 = x2 =1n2
n2∑i=1
xi(2)
Σ = Sp =1
n1 + n2 − 2
[n1∑
i=1
(xi(1) − x1)(xi(1) − x1)′ +
n2∑i=1
(xi(2) − x2)(xi(2) − x2)′
]
The estimated classification rule replaces f ∗ with its sample estimate:
f ∗ = exp
(x1 − x2)′S−1p x− 1
2(x1 − x2)′S−1
p (x1 + x2)
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 16
Classification with Two Multivariate Normal Populations Equal Covariance Matrices
Classification Rule in Practice (continued)
If ν =(
c(1|2)c(2|1)
)(p2p1
)= 1, then the rule becomes
R1 : y ≥ mR2 : y < m
wherey = a′x and m =
12
(y1 + y2)
with a′ = (x1 − x2)′S−1p , y1 = a′x1, and y2 = a′x2
Scale of a is not uniquely determined, so normalize a using either:1 a∗ = a/‖a‖ (unit length)2 a∗ = a/a1 (first element 1)
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 17
Classification with Two Multivariate Normal Populations Equal Covariance Matrices
Fisher’s Linear Discriminant Function
R. A. Fisher arrived at the decision rule on the previous slide using anentirely different argument.
Fisher considered finding the linear combination Y = a′X that bestseparates the groups:
separation =|y1 − y2|
sy
wherey1 is the mean of the Y scores for the observations from π1
y2 is the mean of the Y scores for the observations from π2
s2y =
∑n1i=1(yi(1)−y1)2+
∑n2i=1(yi(2)−y2)2
n1+n2−2 is the pooled variance
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 18
Classification with Two Multivariate Normal Populations Equal Covariance Matrices
Fisher’s Linear Discriminant Function (continued)
Setting a′ = (x1 − x2)′S−1p maximizes the separation
separation2 =(y1 − y2)2
s2y
=(a′x1 − a′x2)2
a′Spa
=(a′d)2
a′Spa
= d′S−1p d
= D2
overall all possible a vectors, where d = x1 − x2.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 19
Classification with Two Multivariate Normal Populations Equal Covariance Matrices
Visualizing Fisher’s Linear Discriminant Function
Figure: Figure 11.5 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 2 variables.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 20
Classification with Two Multivariate Normal Populations Unequal Covariance Matrices
MVN Two Population Classification Problem (Σ1 6= Σ2)
Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) ∼ N(µ1,Σ1) denote the pdf for population π1
f2(x) ∼ N(µ2,Σ2) denote the pdf for population π2
Problem: Given a realization X = x, we want to assign x to π1 or π2.
We want to find some classification rule to determine whether arealization X = x should be assigned to population π1 or π2.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 21
Classification with Two Multivariate Normal Populations Unequal Covariance Matrices
Classification Rule Minimizing ECM (Σ1 6= Σ2)
The multivariate normal densities have the form
fk (x) = (2π)−p/2|Σk |−1/2 exp−(1/2)(x− µk )′Σ−1k (x− µk )
for k ∈ 1,2, which implies that
f ∗ =f1(x)f2(x)
=
(|Σ1||Σ2|
)−1/2
exp−1
2(x− µ1)
′Σ−11 (x− µ1) +
12(x− µ2)
′Σ−12 (x− µ2)
The R1 and R2 that minimize the ECM are defined via the inequalities:
R1 : log(f ∗) ≥ log[(
c(1|2)
c(2|1)
)(p2
p1
)]R2 : log(f ∗) < log
[(c(1|2)
c(2|1)
)(p2
p1
)]Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 22
Classification with Two Multivariate Normal Populations Unequal Covariance Matrices
Classification Rule in Practice (Σ1 6= Σ2)
The rule on the previous slide depends on the population parametersµ1, µ2, Σ1, and Σ2, which are often unknown in practice.
Given n1 independent observations from π1 and n2 independentobservations from π2, we can estimate the needed parameters:
µ1 = x1 =1n1
n1∑i=1
xi(1) and Σ1 = S1 =1
n1 − 1
n1∑i=1
(xi(1) − x1)(xi(1) − x1)′
µ2 = x2 =1n2
n2∑i=1
xi(2) and Σ2 = S2 =1
n2 − 1
n2∑i=1
(xi(2) − x2)(xi(2) − x2)′
The estimated classification rule replaces f ∗ with its sample estimate:
f ∗ =
(|S1||S2|
)−1/2
exp−1
2(x− x1)′S−1
1 (x− x1) +12
(x− x2)′S−12 (x− x2)
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 23
Classification with Two Multivariate Normal Populations Unequal Covariance Matrices
Classification Rule in Practice (Σ1 6= Σ2), continued
Note that we can write
log(f ∗) = log
[(|S1||S2|
)−1/2
e−12 (x−x1)′S−1
1 (x−x1)+ 12 (x−x2)′S−1
2 (x−x2)
]= y − m
where
y = −12
x′(S−11 − S−1
2 )x + (x′1S−11 − x′2S−1
2 )x
m =12
log(|S1||S2|
)+
12
(x′1S−11 x1 − x′2S−1
2 x2)
y is a quadratic function of x, so this a quadratic classification rule.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 24
Classification with Two Multivariate Normal Populations Unequal Covariance Matrices
Caution: Quadratic Classification of Non-Normal Data
Figure: Figure 11.6 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 1 variable.Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 25
Evaluating Classification Functions
Evaluating ClassificationFunctions
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 26
Evaluating Classification Functions Misclassification Measures
Quantifying the Quality of a Classification Rule
To determine if a classification rule is “good” we can examine the errorrates, i.e., misclassification probabilities.
The population parameters are unknown in practice, so we focus onapproaches that can estimate the error rates from the observed data.
We want our classification rule to cross-validate to new data, so weconsider cross-validation procedures.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 27
Evaluating Classification Functions Misclassification Measures
Total Probability of Misclassification
The Total Probability of Misclassification (TPM) is defined as
TPM(R1,R2) = p1
∫R2
f1(x)dx + p2
∫R1
f2(x)dx
for any classification rule (region) that partitions Ω = R1 ∪ R2.
The Optimum Error Rate (OER) is the minimum possible value of TPM
OER = minR1,R2
TPM(R1,R2) subject to Ω = R1 ∪ R2
which is obtained when R1 : f1(x)f2(x) ≥
p2p1
and R2 : f1(x)f2(x) <
p2p1
.
If c(1|2) = c(2|1), minimizing TPM is same as minimizing ECM
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 28
Evaluating Classification Functions Misclassification Measures
Actual Error Rate
The error rates on the previous slide require knowledge of the(typically unknown) parameters that define the densities f1(·) and f2(·).
Example: For LDA, calculating OER requires µ1, µ2, and Σ
The Actual Error Rate (AER) is defined using the sample estimates
AER(R1, R2) = p1
∫R2
f1(x)dx + p2
∫R1
f2(x)dx
where R1 and R2 denote estimates from samples sizes n1 and n2.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 29
Evaluating Classification Functions Misclassification Measures
Apparent Error Rate
The Apparent Error Rate (APER) is an—optimistic—estimate of AER.Estimates the AER using the observed (training) sample of data
The confusion matrix for a sample of data is
Classified as:π1 π2
Truth:π1 nC1 nM1 n1π2 nM2 nC2 n2
wherenCk is the number correctly classified in population k ∈ 1,2nM1 = n1 − nC1 is the number from π1 that are misclassifiednM2 = n2 − nC2 is the number from π2 that are misclassified
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 30
Evaluating Classification Functions Misclassification Measures
Apparent Error Rate (continued)
Given a sample of data with confusion matrix
Classified as:π1 π2
Truth:π1 nC1 nM1 n1π2 nM2 nC2 n2
the APER is calculated as
APER =nM1 + nM2
n1 + n2
which is the total proportion of misclassified sample observations.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 31
Evaluating Classification Functions Misclassification Measures
Leave-One-Out (Ordinary) Cross-Validation
Lachenbruch proposed a better approach to estimate the AER:1. Population 1 (for i = 1, . . . ,n1)
(a) Hold out the i-th observation from π1 and build classification rule(b) Use classification rule from Step 1(a) to classify the i-th observation
2. Population 2 (for i = 1, . . . ,n2)(a) Hold out the i-th observation from π2 and build classification rule(b) Use classification rule from Step 2(a) to classify the i-th observation
An (almost) unbiased estimate of the expected AER is given by
E(AER) =n∗M1 + n∗M2
n1 + n2
where n∗M1 and n∗M2 are the number of misclassified observations usingthe above “leave-one-out”’ procedure.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 32
Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis
Revisiting Linear Discriminant Analysis
Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) ∼ N(µ1,Σ) denote the pdf for population π1
f2(x) ∼ N(µ2,Σ) denote the pdf for population π2
Reminder: assuming that(
c(1|2)c(2|1)
)(p2p1
)= 1, the classification rule is
R1 : Y ≥ mR2 : Y < m
whereY = a′X and m =
12
(µY1 + µY2)
with a′ = (µ1 − µ2)′Σ−1, µY1 = a′µ1, and µY2 = a′µ2
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 33
Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis
Revisiting Linear Discriminant Analysis (continued)
Y = a′X = (µ1 − µ2)′Σ−1X is a linear function of X , so . . .µY1 = a′µ1 = (µ1 − µ2)′Σ−1µ1
µY2 = a′µ2 = (µ1 − µ2)′Σ−1µ2
σ2Y = a′Σa = (µ1 − µ2)′Σ−1(µ1 − µ2) = ∆2
And since X is multivariate normal, we have that
Y ∼
N(µY1 ,∆2) if from π1
N(µY2 ,∆2) if from π2
i.e., Y is univariate normal with population dependent mean.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 34
Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis
Visualizing Misclassification in LDA
Figure: Figure 11.7 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern).
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 35
Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis
Calculating Misclassification in LDA (classify π1 as π2)
Defining m = (1/2)(µ1 − µ2)′Σ−1(µ1 + µ2), we have that
P(misclassify π1 as π2) = P(X ∈ R2|π1) = P(2|1)
= P (Y < m)
= P
(Y − µY1
σY<
m − (µ1 − µ2)Σ−1µ1∆
)
= P(
Z <−(1/2)∆2
∆
)= Φ(−∆/2)
where Φ(·) denotes the CDF of the standard normal distribution.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 36
Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis
Calculating Misclassification in LDA (classify π2 as π1)
Defining m = (1/2)(µ1 − µ2)′Σ−1(µ1 + µ2), we have that
P(misclassify π2 as π1) = P(X ∈ R1|π2) = P(1|2)
= P (Y ≥ m)
= P
(Y − µY2
σY≥ m − (µ1 − µ2)Σ−1µ2
∆
)
= P(
Z ≥ (1/2)∆2
∆
)= 1−Φ(∆/2) = Φ(−∆/2)
where Φ(·) denotes the CDF of the standard normal distribution.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 37
Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis
Optimum Error Rate for Linear Discriminant Analysis
For the LDA classification rule, we have that
OER = minR1,R2
TPM(R1,R2)
=12
P(misclassify π1 as π2) +12
P(misclassify π2 as π1)
=12Φ(−∆/2) +
12
[1−Φ(∆/2)]
= Φ(−∆/2)
so the OER is a function of the ∆ effect size
∆ =
√(µ1 − µ2)′Σ−1(µ1 − µ2)
which is distance measure between µ1 and µ2.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 38
Classifying g ≥ 2 Populations
Classifying g ≥ 2Populations
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 39
Classifying g ≥ 2 Populations Overview of Problem
The g Population Classification Problem
Let X = (X1, . . . ,Xp)′ denote a random vector and let fk (x) denote theprobability density function (pdf) for population πk for k ∈ 1, . . . ,g.
Problem: Given a realization X = x, we want to assign x to a πk .
We want to find some classification rule to determine whether arealization X = x should be assigned to population π1, π2, . . ., or πg .
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 40
Classifying g ≥ 2 Populations Overview of Problem
Classification Rule with g ≥ 2 Populations
Let Ω denote the sample space, i.e., all possible values of x, andR1 ⊂ Ω is the subset of Ω for which we classify x as π1
R2 ⊂ Ω is the subset of Ω for which we classify x as π2...
Rg ⊂ Ω is the subset of Ω for which we classify x as πg
Ω = R1 ∪ R2 ∪ · · · ∪ Rg and Rk ∩ R` = ∅ for all k 6= `.The classification rule partitions the sample spaceThe classification regions are mutually exclusive
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 41
Classifying g ≥ 2 Populations Overview of Problem
Visualizing a Classification Rule: g = 3 Populations
Figure: Figure 11.10 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 2 variables.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 42
Classifying g ≥ 2 Populations Cost of Misclassification
Probability and Cost of Misclassification
The conditional probability P(`|k) of classifying an object as π` whenthe object really belongs to πk is given by
P(`|k) = P(X ∈ R`|πk ) =
∫R`
fk (x)dx
for all k 6= ` with k , ` ∈ 1, . . . ,g.
Note that P(k |k) = 1−∑
` 6=k P(`|k) by definition.
Let c(`|k) denote the cost of allocating an object to π` when the objectreally belongs to πk , and let pk denote the prior probability of πk .
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 43
Classifying g ≥ 2 Populations Cost of Misclassification
Expected Cost of Misclassification (revisited)
The conditional expected cost of misclassifying an object from πk is
ECM(k) =∑6=k
P(`|k)c(`|k)
Incorporating the prior probabilities, the overall ECM is given by
ECM =
g∑k=1
pkECM(k) =
g∑k=1
pk
∑6=k
P(`|k)c(`|k)
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 44
Classifying g ≥ 2 Populations Cost of Misclassification
Minimum ECM Classification Rule
The classification regions R1,R2, . . . ,Rg that minimize the ECM aredefined by allocating X = x to the population πk that minimizes∑
6=k
p`f`(x)c(k |`)
To understand the logic of the classification rule, suppose that we haveequal costs, i.e., c(`|k) = c(k |`) = 1 for all k , ` ∈ 1, . . . ,g
We allocate x to the population πk that minimizes∑
`6=k p`f`(x)
Minimizing∑
` 6=k p`f`(x) is the same as maximizing pk fk (x)
Allocate x to population πk if pk fk (x) > p`f`(x) for all ` 6= kThis is equivalent to maximizing the posterior probability P(πk |x)
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 45
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Overview of Fisher’s Approach
Fisher developed his discriminant analysis for g > 2 populations.
Idea: find a small number of linear combinations (e.g., a′1x, a′2x, a′3x)that best separate the groups.
Offers a simple and useful procedure for classification, which alsoprovides nice visualizations.
Plot the linear combinations to visualize the discriminants
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 46
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Assumptions of Fisher’s Discriminant Analysis
Let X = (X1, . . . ,Xp)′ denote a random vector and let fk (x) ∼ (µk ,Σ)denote the pdf for population πk .
Note the homogeneity of covariance matrix assumptionDo not need the multivariate normality assumption
Let µ = 1g∑g
k=1 µk denote the mean of the combined populations, and
Bµ =
g∑k=1
(µk − µ)(µk − µ)′
denote “Between” sum-of-squares and crossproducts (SSCP) matrix.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 47
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Properties of a Linear Combination
Define new variable Y = a′X which has properties
E(Y |πk ) = a′E(X |πk ) = a′µk
V (Y |πk ) = a′V (X |πk )a = a′Σa
and note that the overall mean of Y has the form
µY =1g
g∑k=1
µYk =1g
g∑k=1
a′µk = a′µ
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 48
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Between versus Within Group Variability
Form the ratio of the between group separation over the variance of Y :
F ∗ =
∑gk=1(µYk − µY )2
σ2Y
=
∑gk=1(a′µk − a′µ)2
a′Σa
=a′[∑g
k=1(µk − µ)(µk − µ)′]
aa′Σa
=a′Bµaa′Σa
Note that higher F ∗ values relate to more separation between groups.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 49
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Population Discriminants
The population k -th discriminant is the linear combination
Yk = a′kX
where ak is proportional to the k -th eigenvector of Σ−1Bµ.k = 1, . . . , s where s = min(g − 1,p)
The ak are scaled to make the Yk have unit variance, i.e., a′kΣak = 1.a′kΣa` = 0 for k 6= `
Note that this is only useful if we somehow know the true populationparameters µ1, . . . ,µg and Σ.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 50
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Sample Discriminants
The sample estimated “Between” and “Within” SSCP matrices are
B =
g∑k=1
(xk − x)(xk − x)′ and W =
g∑k=1
nk∑i=1
(xi(k)− xk )(xi(k)− xk )′
where xk = 1nk
∑nki=1 xi(k) and x = 1
g∑g
k=1 xk .
The sample k -th discriminant is the linear combination
Yk = a′kX
where ak is proportional to the k -th eigenvector of W−1B.
The ak are scaled to make the Yk have unit variance, i.e., a′kΣak = 1,where Σ = Sp = 1
n−g W with n =∑g
k=1 nk .Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 51
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Properties of Population Discriminants
Let Y = A′X where A = [a1, . . . ,as].Y = (Y1, . . . ,Ys)′ contains the s discriminantsColumns of A contain the linear combination weights
The mean of Y is given by
E(Y |πk ) = A′E(X |πk ) = A′µk = µkY
and the covariance matrix for Y is
Cov(Y ) = A′Cov(X |πk )A = A′ΣA = Is
because the discriminants have unit variance and are uncorrelated.Remember: a′kΣa` = δk` where δk` is Kronecker’s δ
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 52
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Classifying New Objects with Discriminants
Given a realization X = x, define y = A′x and calculate the distancebetween the observed y = (y1, . . . , ys)′ and the k -th population mean:
Dk = (y− µkY )′(y− µkY ) =s∑`=1
(y` − µkY`)2 =
s∑`=1
[a′`(x− µk )]2
where µkY = A′µk and y` = a′`x and µkY`= a′`µk .
To build a distance using r ≤ s discriminants, use
D(r)k =
r∑`=1
(y` − µkY`)2 =
r∑`=1
[a′`(x− µk )]2
and classify x to the population πk that minimizes the distance D(r)k .
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 53
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Classifying New Objects with Sample Discriminants
Given a realization X = x, define y = A′x and calculate the distancebetween the observed y = (y1, . . . , ys)′ and the k -th sample mean:
Dk = (y− µkY )′(y− µkY ) =s∑`=1
(y` − µkY`)2 =
s∑`=1
[a′`(x− xk )]2
where µkY = A′xk and y` = a′`x and µkY`= a′`xk .
To build a distance using r ≤ s discriminants, use
D(r)k =
r∑`=1
(y` − µkY`)2 =
r∑`=1
[a′`(x− xk )]2
and classify x to the population πk that minimizes the distance D(r)k .
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 54
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Relation to MVN Classification Problem
Let X = (X1, . . . ,Xp)′ be a random vector and let fk (x) ∼ N(µk ,Σk )denote the pdf for population πk .
Assuming equal misclassification costs, we allocate X = x to thepopulation πk that minimizes
∑` 6=k p`f`(x)⇐⇒ maximizes pk fk (x).
Equivalent to allocating X = x to the population πk that maximizes
dQk (x) = Quadratic discriminant score
= −12
ln(|Σk |)−12
(x− µk )′Σ−1k (x− µk ) + ln(pk )
dLk (x) = Linear discriminant score
= µ′kΣ−1x− 1
2µ′kΣ
−1µk + ln(pk )
where dLk is used when Σk = Σ for all k ∈ 1, . . . ,g.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 55
Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis
Relation to MVN Classification Problem (continued)
If we assume that pk = 1/g for all k ∈ 1, . . . ,g, then
dLk (x) = µ′kΣ
−1x− 12µ′kΣ
−1µk
Define the linear combination yj = a′jx, where aj = Σ−1/2vj with vj
denoting the j-th eigenvector of Bµ = Σ−1/2BµΣ−1/2. Then
Dk =
p∑j=1
(yj − µkYj )2 =
p∑j=1
[a′j(x− µk )]2 = (x− µk )′Σ−1(x− µk )
= −2dLk (x) + α
where α = x′Σ−1x is constant across populations.
If rank(Bµ) = r , allocating to the population πk that maximizes dLk (x) is
equivalent to allocating to the population πk that minimizes D(r)k .
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 56
Fisher’s Iris Data Example
Fisher’s Iris Data Example
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 57
Fisher’s Iris Data Example Data Overview
Fisher’s (or Anderson’s) Famous Iris Data
R. A. Fisher published the LDA approach in 1936 and used EdgarAnderson’s iris flower dataset as an example.
The dataset consists of measurements of p = 4 variables taken fromnk = 50 flowers randomly sampled from each of g = 3 species.
Variables: Sepal Length, Sepal Width, Petal Length, Petal WidthSpecies: setosa, versicolor, virginica
The goal was/is to build a linear discriminant function that bestclassifies a new flower into one of the three species.
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 58
Fisher’s Iris Data Example Data Overview
Fisher’s Famous Iris Data in R
> head(iris)Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa> colMeans(iris[iris$Species=="setosa",1:4])Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246> colMeans(iris[iris$Species=="versicolor",1:4])Sepal.Length Sepal.Width Petal.Length Petal.Width
5.936 2.770 4.260 1.326> colMeans(iris[iris$Species=="virginica",1:4])Sepal.Length Sepal.Width Petal.Length Petal.Width
6.588 2.974 5.552 2.026> p <- 4L> g <- 3L
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 59
Fisher’s Iris Data Example Data Overview
Make Pooled Covariance Matrix
# make pooled covariances matrix> Sp <- matrix(0, p, p)> nx <- rep(0, g)> lev <- levels(iris$Species)> for(k in 1:g)+ x <- iris[iris$Species==lev[k],1:p]+ nx[k] <- nrow(x)+ Sp <- Sp + cov(x) * (nx[k] - 1)+ > Sp <- Sp / (sum(nx) - g)> round(Sp, 3)
Sepal.Length Sepal.Width Petal.Length Petal.WidthSepal.Length 0.265 0.093 0.168 0.038Sepal.Width 0.093 0.115 0.055 0.033Petal.Length 0.168 0.055 0.185 0.043Petal.Width 0.038 0.033 0.043 0.042
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 60
Fisher’s Iris Data Example Linear Discriminant Analysis
LDA in R via the lda Function (MASS Package)
# fit lda model> library(MASS)> ldamod <- lda(Species ~ ., data=iris, prior=rep(1/3, 3))
# check the LDA coefficients/scalings> ldamod$scaling
LD1 LD2Sepal.Length 0.8293776 0.02410215Sepal.Width 1.5344731 2.16452123Petal.Length -2.2012117 -0.93192121Petal.Width -2.8104603 2.83918785> crossprod(ldamod$scaling, Sp) %*% ldamod$scaling
LD1 LD2LD1 1.00000e+00 -7.21645e-16LD2 -7.21645e-16 1.00000e+00
# create the (centered) discriminant scores> mu.k <- ldamod$means> mu <- colMeans(mu.k)> dscores <- scale(iris[,1:p], center=mu, scale=F) %*% ldamod$scaling> sum((dscores - predict(ldamod)$x)^2)[1] 1.658958e-28
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 61
Fisher’s Iris Data Example Linear Discriminant Analysis
Plot LDA Results: Score and Coefficients
−10 −5 0 5 10
−3
−2
−1
01
23
Discriminant Scores
LD1
LD2
setosaversicolorvirginica
−4 −3 −2 −1 0 1 2 3
−1
01
23
Discriminant Coefficients
LD1LD
2
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
R code for left plot:plot(dscores, xlab="LD1", ylab="LD2", pch=spid, col=spid,
main="Discriminant Scores", xlim=c(-10, 10), ylim=c(-3, 3))legend("top",lev,pch=1:3,col=1:3,bty="n")
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 62
Fisher’s Iris Data Example Linear Discriminant Analysis
Plot LDA Results: Discriminant Partitions
−5 0 5
−2
−1
01
2
LD1
LD2 s
s
s
s
s
s
ss
s s
s
s
ss
s
s
s
sss
s
ss
s
s
s
sss
ss
s
s
s
s
s
ss
s
s
s
s
s
ss
s
s
s
s
sc
c
c
c
c
c
c
c
c
c
c
c
c
c
ccc
cc c
c
c
c c
c
c
c
cc
c
cc
cc
c
c
c
c
c
c
c
c
c
c
cccc
cc
v
v
v
v
v
vv
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vvv
v
vv
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v
v
v
v
app. error rate: 0.02
Partition Plot
library(klaR)species <- factor(rep(c("s","c","v"), each=50))partimat(x=dscores[,2:1], grouping=species, method="lda")
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 63
Fisher’s Iris Data Example Linear Discriminant Analysis
Plot LDA Results: All Pairwise Partitions
2.0 2.5 3.0 3.5 4.0
4.5
5.5
6.5
7.5
Sepal.Width
Sep
al.L
engt
h
ss
ss
s
s
s
s
s
s
s
ss
s
s ss
s
s
ss
s
s
ss
s sss
ss
ss
s
s s
s
s
s
s s
s s
s ss
s
s
ss
c
c
c
c
c
c
c
c
c
cc
cc c
c
c
cc
c
cc
cc
cc
cc c
cc
cccc
c
c
c
c
cc c
cc
c
c cc
c
c
c
v
v
v
vv
v
v
v
v
v
vv
v
v v
vv
vv
v
v
v
v
v
v
v
v vv
vv
v
vvv
v
vv
v
vvv
v
v vv
vv
vv
app. error rate: 0.2
1 2 3 4 5 6
4.5
5.5
6.5
7.5
Petal.LengthS
epal
.Len
gth
ss
s s
s
s
s
s
s
s
s
ss
s
s sss
s
ss
s
s
ss
ssss
ss
ss
s
ss
s
s
s
ss
ss
s ss
s
s
ss
c
c
c
c
c
c
c
c
c
cc
cc c
c
c
cc
c
cc
cc
ccc
c c
cc
ccc
c
c
c
c
c
cc c
cc
c
ccc
c
c
c
v
v
v
vv
v
v
v
v
v
v v
v
vv
v v
v v
v
v
v
v
v
v
v
vvv
vv
v
vvv
v
vv
v
vv
v
v
vvv
vv
vv
app. error rate: 0.033
1 2 3 4 5 6
2.0
2.5
3.0
3.5
4.0
Petal.Length
Sep
al.W
idth s
ss s
s
s
ss
ss
s
s
ss
s
s
s
s
ss
s
ss
s s
s
sssss
s
ss
ss
ss
s
ss
s
s
s
s
s
s
s
s
s cc c
c
cc
c
c
cc
c
c
c
cccc
c
c
c
c
c
c
ccc ccc
ccc
c c
c
c
c
c
c
c c
c
c
c
c
ccc
c
c
v
v
vv v v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v vv
v v
vv
vv
v
v
vvv
v
v
vv v vv
v
vv
v
v
v
v
v
app. error rate: 0.047
0.5 1.0 1.5 2.0 2.5
4.5
5.5
6.5
7.5
Petal.Width
Sep
al.L
engt
h
ssss
s
s
s
s
s
s
s
ss
s
s ss
s
s
ss
s
s
sss sss
ss
ss
s
ss
s
s
s
s s
ss
sss
s
s
ss
c
c
c
c
c
c
c
c
c
cc
cc c
c
c
cc
c
cc
cc
cc
cc c
cc
ccc
c
c
c
c
c
ccc
cc
c
cc c
c
c
c
v
v
v
vv
v
v
v
v
v
vv
v
v v
vv
v v
v
v
v
v
v
v
v
vvv
vv
v
vvv
v
vv
v
vv
v
v
v vv
vv
vv
app. error rate: 0.04
0.5 1.0 1.5 2.0 2.5
2.0
2.5
3.0
3.5
4.0
Petal.Width
Sep
al.W
idth s
sss
s
s
ss
ss
s
s
ss
s
s
s
s
ss
s
ss
ss
s
sssss
s
s s
ss
ss
s
s s
s
s
s
s
s
s
s
s
s c cc
c
cc
c
c
cc
c
c
c
ccc c
c
c
c
c
c
c
c c cc
cc
ccc
c c
c
c
c
c
c
cc
c
c
c
c
c cc
c
c
v
v
vv vv
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
vvv
vv
vv
vv
v
v
vvv
v
v
vv v vv
v
v v
v
v
v
v
v
app. error rate: 0.033
0.5 1.0 1.5 2.0 2.5
12
34
56
Petal.Width
Pet
al.L
engt
h
ssssss
ssss ssss s
ssssss s
s
sss sssss ss sssss ss sss
ss
sssss
c cc
ccc c
c
c
cc
cc
c
c
c cc
cc
c
c
ccc c
c cc
ccc c
cc cc
ccc
c cc
c
cc cc
c
c
v
v
vv v
v
v
vv
v
vv vv vvv
v v
v
v
v
v
v
vv
vv
vvv
v
vv
vv
vv
vv v
vv
v vvv v v
v
app. error rate: 0.04
Partition Plot
library(klaR)species <- factor(rep(c("s","c","v"), each=50))partimat(x=iris[,1:4], grouping=species, method="lda")
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 64
Fisher’s Iris Data Example Linear Discriminant Analysis
APER and Expected AER
# make confusion matrix (and APER)> confusion <- table(iris$Species, predict(ldamod)$class)> confusion
setosa versicolor virginicasetosa 50 0 0versicolor 0 48 2virginica 0 1 49
> n <- sum(confusion)> aper <- (n - sum(diag(confusion))) / n> aper[1] 0.02
# use CV to get expected AER> ldamodCV <- lda(Species ~ ., data=iris, prior=rep(1/3, 3), CV=TRUE)> confusionCV <- table(iris$Species, ldamodCV$class)> confusionCV
setosa versicolor virginicasetosa 50 0 0versicolor 0 48 2virginica 0 1 49
> eaer <- (n - sum(diag(confusionCV))) / n> eaer[1] 0.02
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 65
Fisher’s Iris Data Example Linear Discriminant Analysis
Split Data into Training (70%) and Testing (30%) Sets
> # split into separate matrices for each flower> Xs <- subset(iris, Species=="setosa")> Xc <- subset(iris, Species=="versicolor")> Xv <- subset(iris, Species=="virginica")
# split into training and testing> set.seed(1)> sid <- sample.int(n=50, size=35)> cid <- sample.int(n=50, size=35)> vid <- sample.int(n=50, size=35)> Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])> Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])
# fit lda to training and evaluate on testing> ldatrain <- lda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))> confusionTest <- table(Xtest$Species, predict(ldatrain, newdata=Xtest)$class)> confusionTest
setosa versicolor virginicasetosa 15 0 0versicolor 0 15 0virginica 0 1 14
> n <- sum(confusionTest)> aer <- (n - sum(diag(confusionTest))) / n> aer[1] 0.02222222
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 66
Fisher’s Iris Data Example Linear Discriminant Analysis
Two-Fold CV with 100 Random 70/30 Splits
> nrep <- 100> aer <- rep(0, nrep)> set.seed(1)> for(k in 1:nrep)+ sid <- sample.int(n=50, size=35)+ cid <- sample.int(n=50, size=35)+ vid <- sample.int(n=50, size=35)+ Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])+ Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])+ ldatrain <- lda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))+ confusionTest <- table(Xtest$Species, predict(ldatrain, newdata=Xtest)$class)+ confusionTest+ n <- sum(confusionTest)+ aer[k] <- (n - sum(diag(confusionTest))) / n+ > mean(aer)[1] 0.022
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 67
Fisher’s Iris Data Example Quadratic Discriminant Analysis
QDA in R via the qda Function (MASS Package)
# fit qda model> library(MASS)> qdamod <- qda(Species ~ ., data=iris, prior=rep(1/3, 3))> names(qdamod)[1] "prior" "counts" "means" "scaling" "ldet" "lev" "N"[8] "call" "terms" "xlevels"
# check the QDA coefficients/scalings> dim(qdamod$scaling)[1] 4 4 3> round(crossprod(qdamod$scaling[,,1], cov(Xs[,1:p])) %*% qdamod$scaling[,,1], 4)1 2 3 4
1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 1> round(crossprod(qdamod$scaling[,,2], cov(Xc[,1:p])) %*% qdamod$scaling[,,2], 4)1 2 3 4
1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 1> round(crossprod(qdamod$scaling[,,3], cov(Xv[,1:p])) %*% qdamod$scaling[,,3], 4)1 2 3 4
1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 1
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 68
Fisher’s Iris Data Example Quadratic Discriminant Analysis
Plot QDA Results: All Pairwise Partitions
2.0 2.5 3.0 3.5 4.0
4.5
5.5
6.5
7.5
Sepal.Width
Sep
al.L
engt
h
ss
ss
s
s
s
s
s
s
s
ss
s
s ss
s
s
ss
s
s
ss
s sss
ss
ss
s
s s
s
s
s
s s
s s
s ss
s
s
ss
v
v
v
v
v
v
v
v
v
vv
vv v
v
v
vv
v
vv
vv
vv
vv v
vv
vvvv
v
v
v
v
vv v
vv
v
v vv
v
v
v
v
v
v
vv
v
v
v
v
v
vv
v
v v
vv
vv
v
v
v
v
v
v
v
v vv
vv
v
vvv
v
vv
v
vvv
v
v vv
vv
vv
app. error rate: 0.2
1 2 3 4 5 6
4.5
5.5
6.5
7.5
Petal.LengthS
epal
.Len
gth
ss
s s
s
s
s
s
s
s
s
ss
s
s sss
s
ss
s
s
ss
ssss
ss
ss
s
ss
s
s
s
ss
ss
s ss
s
s
ss
v
v
v
v
v
v
v
v
v
vv
vv v
v
v
vv
v
vv
vv
vvv
v v
vv
vvv
v
v
v
v
v
vv v
vv
v
vvv
v
v
v
v
v
v
vv
v
v
v
v
v
v v
v
vv
v v
v v
v
v
v
v
v
v
v
vvv
vv
v
vvv
v
vv
v
vv
v
v
vvv
vv
vv
app. error rate: 0.04
1 2 3 4 5 6
2.0
2.5
3.0
3.5
4.0
Petal.Length
Sep
al.W
idth s
ss s
s
s
ss
ss
s
s
ss
s
s
s
s
ss
s
ss
s s
s
sssss
s
ss
ss
ss
s
ss
s
s
s
s
s
s
s
s
s vv v
v
vv
v
v
vv
v
v
v
vvvv
v
v
v
v
v
v
vvv vvv
vvv
v v
v
v
v
v
v
v v
v
v
v
v
vvv
v
v
v
v
vv v v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v vv
v v
vv
vv
v
v
vvv
v
v
vv v vv
v
vv
v
v
v
v
v
app. error rate: 0.047
0.5 1.0 1.5 2.0 2.5
4.5
5.5
6.5
7.5
Petal.Width
Sep
al.L
engt
h
ssss
s
s
s
s
s
s
s
ss
s
s ss
s
s
ss
s
s
sss sss
ss
ss
s
ss
s
s
s
s s
ss
sss
s
s
ss
v
v
v
v
v
v
v
v
v
vv
vv v
v
v
vv
v
vv
vv
vv
vv v
vv
vvv
v
v
v
v
v
vvv
vv
v
vv v
v
v
v
v
v
v
vv
v
v
v
v
v
vv
v
v v
vv
v v
v
v
v
v
v
v
v
vvv
vv
v
vvv
v
vv
v
vv
v
v
v vv
vv
vv
app. error rate: 0.033
0.5 1.0 1.5 2.0 2.5
2.0
2.5
3.0
3.5
4.0
Petal.Width
Sep
al.W
idth s
sss
s
s
ss
ss
s
s
ss
s
s
s
s
ss
s
ss
ss
s
sssss
s
s s
ss
ss
s
s s
s
s
s
s
s
s
s
s
s v vv
v
vv
v
v
vv
v
v
v
vvv v
v
v
v
v
v
v
v v vv
vv
vvv
v v
v
v
v
v
v
vv
v
v
v
v
v vv
v
v
v
v
vv vv
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
vvv
vv
vv
vv
v
v
vvv
v
v
vv v vv
v
v v
v
v
v
v
v
app. error rate: 0.047
0.5 1.0 1.5 2.0 2.5
12
34
56
Petal.Width
Pet
al.L
engt
h
ssssss
ssss ssss s
ssssss s
s
sss sssss ss sssss ss sss
ss
sssss
v vv
vvv v
v
v
vv
vv
v
v
v vv
vv
v
v
vvv v
v vv
vvv v
vv vv
vvv
v vv
v
vv vv
v
v
v
v
vv v
v
v
vv
v
vv vv vvv
v v
v
v
v
v
v
vv
vv
vvv
v
vv
vv
vv
vv v
vv
v vvv v v
v
app. error rate: 0.02
Partition Plot
library(klaR)species <- factor(rep(c("s","c","v"), each=50))partimat(x=iris[,1:4], grouping=species, method="qda")
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 69
Fisher’s Iris Data Example Quadratic Discriminant Analysis
APER and Expected AER
# make confusion matrix (and APER)> confusion <- table(iris$Species, predict(qdamod)$class)> confusion
setosa versicolor virginicasetosa 50 0 0versicolor 0 48 2virginica 0 1 49
> n <- sum(confusion)> aper <- (n - sum(diag(confusion))) / n> aper[1] 0.02
# use CV to get expected AER> qdamodCV <- qda(Species ~ ., data=iris, prior=rep(1/3, 3), CV=TRUE)> confusionCV <- table(iris$Species, qdamodCV$class)> confusionCV
setosa versicolor virginicasetosa 50 0 0versicolor 0 47 3virginica 0 1 49
> eaer <- (n - sum(diag(confusionCV))) / n> eaer[1] 0.02666667
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 70
Fisher’s Iris Data Example Quadratic Discriminant Analysis
Split Data into Training (70%) and Testing (30%) Sets
> # split into separate matrices for each flower> Xs <- subset(iris, Species=="setosa")> Xc <- subset(iris, Species=="versicolor")> Xv <- subset(iris, Species=="virginica")
> # split into training and testing> set.seed(1)> sid <- sample.int(n=50, size=35)> cid <- sample.int(n=50, size=35)> vid <- sample.int(n=50, size=35)> Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])> Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])
# fit qda to training and evaluate on testing> qdatrain <- qda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))> confusionTest <- table(Xtest$Species, predict(qdatrain, newdata=Xtest)$class)> confusionTest
setosa versicolor virginicasetosa 15 0 0versicolor 0 15 0virginica 0 1 14
> n <- sum(confusionTest)> aer <- (n - sum(diag(confusionTest))) / n> aer[1] 0.02222222
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 71
Fisher’s Iris Data Example Quadratic Discriminant Analysis
Two-Fold CV with 100 Random 70/30 Splits
> nrep <- 100> aer <- rep(0, nrep)> set.seed(1)> for(k in 1:nrep)+ sid <- sample.int(n=50, size=35)+ cid <- sample.int(n=50, size=35)+ vid <- sample.int(n=50, size=35)+ Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])+ Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])+ qdatrain <- qda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))+ confusionTest <- table(Xtest$Species, predict(qdatrain, newdata=Xtest)$class)+ confusionTest+ n <- sum(confusionTest)+ aer[k] <- (n - sum(diag(confusionTest))) / n+ > mean(aer)[1] 0.02466667
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 72
Fisher’s Iris Data Example Quadratic Discriminant Analysis
Plot LDA and QDA Results using PCA
−4 −2 0 2 4
−2
−1
01
2
LDA Results
PC1
PC
2
setosaversicolorvirginica
−4 −2 0 2 4−
2−
10
12
QDA Results
PC1
PC
2
setosaversicolorvirginica
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 73
Fisher’s Iris Data Example Quadratic Discriminant Analysis
Plot LDA and QDA Results using PCA (R code)
R code for plot on previous slide:# visualize LDA and QDA results via PCAldaid <- as.integer(predict(ldamod)$class)qdaid <- as.integer(predict(qdamod)$class)pcamod <- princomp(iris[,1:4])dev.new(width=10, height=5, noRStudioGD=TRUE)par(mfrow=c(1,2))plot(pcamod$scores[,1:2], xlab="PC1", ylab="PC2", pch=ldaid, col=ldaid,
main="LDA Results", xlim=c(-4, 4), ylim=c(-2, 2))legend("topright",lev,pch=1:3,col=1:3,bty="n")abline(h=0,lty=3)abline(v=0,lty=3)plot(pcamod$scores[,1:2], xlab="PC1", ylab="PC2", pch=qdaid, col=qdaid,
main="QDA Results", xlim=c(-4, 4), ylim=c(-2, 2))legend("topright",lev,pch=1:3,col=1:3,bty="n")abline(h=0,lty=3)abline(v=0,lty=3)
Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 74