Upload
doandang
View
224
Download
6
Embed Size (px)
Citation preview
Tina Memo No. 2013-007
Presented at the BMVA one day meeting, London 2013.
Quantitative Pattern Recognition: Warts and All.
N.A.Thacker.
Last updated2 / 20 / 2014
Imaging Science and Biomedical Engineering Division,Medical School, University of Manchester,
Stopford Building, Oxford Road,Manchester, M13 9PT.
Quantitative Use of Pattern Recognition:
“Warts and All”
N. A. Thacker, ISBE, University of Manchester
Abstract
Pattern recognition is not like conventional statistical analysis methods. Much of the publishedwork is non-quantitative. The aim is to develop methods which “seem to work”, and mightbe called engineering. Some application domains (science, medicine) do not sit well with thisapproach. This talk is intended to explain why.
Ideas are illustrated with examples of MRI segmentation and crater counting on Mars.
2
Hypercube Classifier
aa
aa
aa
a
aa
a
bbb
b
b
b
b
cc
cc
cc
c
c
c
c
h
g
g2
g1
h1 h2
A simple approach
eg: a ‘cuts’ based analysis , binary image arithmetic;
(G(i, j) > g1) ∗ (G(i, j) < g2) ∗ (H(i, j) > h1) ∗ (H(i, j) < h2)
Simple but does not compactly describe the true distribution of data.
Is there a theoretical optimal approach?
3
Bayes Classifier
aa
aa
aa
a
aa
a
bbb
b
b
b
b
cc
cc
cc
c
c
c
c
h
g
p(g,h|a)
p(g,h|b)
p(g,h|c)
A decision based upon the classification probability eg:
P (a|g, h) = Q(g, h|a)/(Q(g, h|a) +Q(g, h|b) +Q(g, h|c))
4
Bayes Classifier
≈ p(g, h|a)Q(a)/(p(g, h|a)Q(a) + p(g, h|b)Q(b) + p(g, h|c)Q(c))
where ∫p(g, h|ω) dg dh = 1
will minimise the classification error.
In general
P (ωm|X) =p(X|ωm)Q(ωm)∑n p(X|ωn)Q(ωn)
Note: In practice when constructing classifiers from finite samples, difference vectors should beconstructed using ‘measure theory’.
5
Decision Boundaries
For a two class problem we can imagine a boundary curve between two probability distributionswhich passes through all points where P (ωm|x) = 0.5
aa
aa
aa
a
aa
a
bbb
b
b
b
b
cc
cc
cc
c
c
c
c
h
g
p(g,h|a)
p(g,h|b)
p(g,h|c)
decision boundary
Rapid learning of decision boundaries in high dimensions is the basis for LDA, SVM, DecisionTrees, Random Forests, (with or without Boosting).
6
Evaluation Methodology (Warts I)
Can we use the vast quantities of published results to undertake scientific analyses?
• Papers often focus on irrelevant aspects of algorithms and results can’t be reliably replicated(experimenter effects, publication bias [Chatfield]).
• Algorithms are often adjusted using ‘test’ data which introduces bias.
• Test datasets are often poorly constructed and permit unexpected solutions to the problem.[Shamir].
• Gold standard reference datasets generally have their own biases and error.
• The “representative data” problem, precludes use for quantitative counting tasks.
• Algorithms compared using ROC curves. ROC curves are not really what we need! (Seebelow)
7
Scientific Method
Theory 1
Data
Theory 2
The scientific method involves quantitative comparison of theory and data.
Progress is made most reliably via “falsification” (Popper).
The specific values of the measurement errors are important.
8
Scientific Method
Theory 1
Data
Theory 2
Theory 1
Data
Theory 2
The specific values of the errors are important.
9
Systematic and Statistical Effects
e.g. size / frequency distribution
frequency
Theory
size
Correct for systematic effects
We cannot directly compare quantities of classified data to theory, due to mis-identification.
10
Can we use ROC’s to Correct for Systematic Error?
fp tn
1
1
tn
00
fp
ROC curve
The decision process is only ‘Bayes Optimal’ if the relative proportions of data classes duringtraining and practical application match [Saerens].
Similarly the misidentification rate will change.
Q = Q′ − Q′ tn + (N − Q′)fp
11
Systematic and Statistical Effects
Q′ =Q − N fp
1 − tn − fp
eg: Q = 42, N= 100, tn = 0.066, fp = 0.20 → Q′ = 30
Even having corrected for systematic errors we need to know the statistical error. Is it Pois-son/Binomial? (See Below)
frequency
Theory
Estimate statistical errors
size
In practice the shapes and positions of sample distributions can also change!
12
Algorithm Construction and Theory (Warts II)
• Demographics change the optimal decision boundaries.
• Probability estimates only correct if the density models are correct [Niculescu-Mizil].
• Data generation processes are poorly understood for arbitrary data.
• Sensitivity to parameter tuning (sometimes crude techniques seem to work better on newdata).
b
aa aa
a a
a
a
a
a
a
ab
bb
bb
b
b b
b
Training
bb bbb
bb bb
b
bb
b b
a a aa
a
a
aa
a
a
a
Testing
It’s not just ‘where is the boundary?’, but also ‘how can the boundary move?’.
13
Summary of the Main Issues
We need to determine the following;
• Which p(X|ωn) do we use for incoming data?
It needs to be based upon knowledge of typical datasets.
• How do we correct for misclassification?
• What are the errors on observed quantities?
We could use Monte-Carlo, but this requires us to construct a simulation.
Monte-Carlo’s require detailed knowledge of the data, including expected sample quantitiesQ(ωw).
If we have a good simulation even a ‘cuts based’ analysis is practical.
Without a good simulation we are left without any idea of what a pattern recognition algorithmis doing on a new (arbitrary) data set.
Solution : White Box Analysis!
14
White Box Correction for Misclassification
This best case performance is called the Bayes Error Rate, which occurs when Q(ωm) is set tomatch the data sample.
In a sense Q(ωm) must be considered an output of the analysis.
The p(X|ωn)’s constitute our prior knowledge, not Q(ωn) !
Some algorithms assume that the data sample is fixed.
eg: K Nearest Neighbours, discriminant analysis, Support Vector Machines (SVM).
Some algorithms allow us to estimate Q(ωm) eg: Expectation Maximisation.
Q(ωm) =∑i∈R
P (ωm|Xi)
This is the Likelihood estimate, it is therefore the corrected estimate of quantity!
15
Example: Multi-Spectral Segmentation
Each grey-level can be used as one component of the measured data vector X = (x1, x2, x3, x4).
16
P (X|ω) is based upon physics of the MR imaging process.
Pure and partial volume models (simulation).
Real Data x1 vs x2 and x3 vs x4
17
Multi-Spectral Segmentation
Parameters for the density model are estimated using Expectation Optimisation(EM), whichinvolves iteration of the following steps;
• use an initial estimate of probability classifications P (ωm|X) to generate a new estimate ofthe parameters.
• use the new parameters to re-estimate P (ωm|X).
There is a proof that this process will converge on a local optima of the Likelihood of the N datavectors Xn.
L = Πi∈R
∑m
Q(ωm)p(Xi|ωm)
We can assess data conformity (representativeness) using ‘goodness of fit’.
18
Multi-Spectral Segmentation
But; What region (R) should we define? (the one which minimises measurement error?)
(See: Tina memos 2003-007, 2007-005)
19
White Box Construction of Density Distributions
What happens when we cannot predict the density ditributions?
• We can model density distributions non-parametrically (e.g. linear models of histograms).
• These models can be constructed from data samples using a version of ICA.
• We seek the model which approximates the variations in distributions seen across multiplesamples.
• We can adjust these models to ‘best’ account for incoming data during analysis.
• These models have estimation errors ‘locked in’ at the time of training.
20
Example: Crater Recognition
Define a set S of probability densities which describe the appearance of prototypical cratersp(X|ωm) with m ∈ S.
This has an associated mixture density
p(X|crater) = Q(S)∑ω∈S
p(ωm)p(X|ωm)
21
Crater Recognition
This task now conforms to our earlier definition of scientific measurement.
What are the errors on measured quantities Q(ω)?
22
White Box estimation of Statistical Error
Start with Likelihood and compute the Cramer-Rao Bound.
−∂2 log(L)
∂Q(ωm)2=
1
var(Q(ωm))
L = Πi∈R
M∑m
Q(ωm)p(Xi|ωm)
Assume that p(Xi|ωm) are known (no free parameters).
var(Q(ω)) ≥Q(ω)2∑
i P (ω|Xi)2
As∑
i P (ω|Xi)2 ≤
∑i P (ω|Xi), this gives a (best case) Poisson estimate, for non overlapping
distributions. var(Q(ω)) = Q(ω)
For a weak classifier we need to consider the off-diagonal terms in the inverse covariance matrixand as P (ω|Xi) → 0.5 then the off-diagonal terms become comparable to the diagonal ones and
var(Q(ω)) → ∞
This has implications for the evaluation of pattern recognition systems, as the true accuracy ofa classication assessment may be many time worse than that predicted by sampling (Poisson orBinomial) statistics.
23
White Box Correction of Distribution Assumptions
The observed distribution can be modelled as a linear combination of sub-classes. The inversecovariance for a set of classification quantities is then given by
C−1Q =
∑i∈R
D(Xi)TD(Xi)
withD(Xi)
T = [P (ω1|Xi)/Q(ω1), ..P (ωM |Xi)/Q(ωM)]
We can define a set of subset of classes m ∈ S as a quantitative measure which relates the theory.s.t.
Q(S) =∑m∈S
Q(ωm) = S.Q
Thenvar(Q(S)) = STCQS
(See: Tina memo 2010-008, 2011-003)
24
Conclusions
Methods which label data using fixed decision boundaries will be sub-optimal.
A reliance on Monte Carlo methods only shifts the problem from design of the pattern recognitionsystem to construction of the simulation.
Automatic construction of a simulation for arbitrary data samples requires solution of the veryproblem which pattern recognition is supposed to solve.
Many popular pattern recognition techniques and published performance details (e.g. ROC) areunsuitable for direct use in scientific tasks.
For scientific use it is far better to understand the quantitation errors in analysis than to optimisemis-identification rates.
Many of the problems which arise when trying to apply pattern recognition to quantitative tasksare due to ‘Black Box’ methodologies.
25
Conclusions: White Box vs Black Box
The White Box approach supports;
• Optimal adjustment of distributions to incoming data.
• Goodness of fit measures to test data conformity (representiveness).
• Estimation of statistical errors for incoming data.
• Estimation of systematic errors arising during training.
• Quantitative tests for self-consistency [Haralick].
Remaining problems with methodology are associated with ‘gold standard’ definitions and selec-tion of training/test data.
‘Black Box’ testing is simple and quick resulting in rapid publication. It can be performed bystudents with little training.
‘White Box’ analysis requires an understanding of both methods and statistics.
26
Acknowledgements
MRI segmentation project;
Paul Bromiley.
Satellite images of Mars;
Paul Tar, Jamie Gilmore, Merren Jones.
27
References
A. Niculescu-Mizil, R. Caruana, Obtaining Calibrated Probabilities from Boosting, Proc. 21stConf. Uncertainty in Artificial Intelligence, AUAI press, 2005.
K.Chatfield et al. The Devil is in the Details, An Evaluation of Recent Feature Encoding Meth-ods. BMVC Dundee, 2011.
M. Saerens et al. (2001), Adjusting the Outputs of a Classifier to new a priori Probabilities maySignificantly Improve Classification Accuracy: Evidence from a Multi-class Problem in RemoteSensing, Proc. 18th ICML, pp. 298-305.
R.M. Haralick, On the Use of Error Propagation for Statistical Validation of Computer VisionSoftware, (with Xufei Liu and Tapas Kanungo), IEEE Pattern Analysis and Machine Intelligence27, No. 10 (2005), pp. 1603- 1614.
L Shamir, Evaluation of Face datasets as a tool for Assessing the Performance of Face RecognitionMethods. IJCV, 79,3, 225-230, 2008.
All Tina memos available from www.tina-vision.net
28