Quantitative Pattern Recognition: Warts and All. N.A ... Pattern Recognition: Warts and All ... For a two class problem we can imagine a boundary curve between two probability distributions

Tina Memo No. 2013-007

Presented at the BMVA one day meeting, London 2013.

Quantitative Pattern Recognition: Warts and All.

N.A.Thacker.

Last updated2 / 20 / 2014

Imaging Science and Biomedical Engineering Division,Medical School, University of Manchester,

Stopford Building, Oxford Road,Manchester, M13 9PT.

Quantitative Use of Pattern Recognition:

“Warts and All”

N. A. Thacker, ISBE, University of Manchester

Abstract

Pattern recognition is not like conventional statistical analysis methods. Much of the publishedwork is non-quantitative. The aim is to develop methods which “seem to work”, and mightbe called engineering. Some application domains (science, medicine) do not sit well with thisapproach. This talk is intended to explain why.

Ideas are illustrated with examples of MRI segmentation and crater counting on Mars.

2

Hypercube Classifier

aa

aa

aa

a

aa

a

bbb

b

b

b

b

cc

cc

cc

c

c

c

c

h

g

g2

g1

h1 h2

A simple approach

eg: a ‘cuts’ based analysis , binary image arithmetic;

(G(i, j) > g1) ∗ (G(i, j) < g2) ∗ (H(i, j) > h1) ∗ (H(i, j) < h2)

Simple but does not compactly describe the true distribution of data.

Is there a theoretical optimal approach?

3

Bayes Classifier

aa

aa

aa

a

aa

a

bbb

b

b

b

b

cc

cc

cc

c

c

c

c

h

g

p(g,h|a)

p(g,h|b)

p(g,h|c)

A decision based upon the classification probability eg:

P (a|g, h) = Q(g, h|a)/(Q(g, h|a) +Q(g, h|b) +Q(g, h|c))

4

Bayes Classifier

≈ p(g, h|a)Q(a)/(p(g, h|a)Q(a) + p(g, h|b)Q(b) + p(g, h|c)Q(c))

where ∫p(g, h|ω) dg dh = 1

will minimise the classification error.

In general

P (ωm|X) =p(X|ωm)Q(ωm)∑n p(X|ωn)Q(ωn)

Note: In practice when constructing classifiers from finite samples, difference vectors should beconstructed using ‘measure theory’.

5

Decision Boundaries

For a two class problem we can imagine a boundary curve between two probability distributionswhich passes through all points where P (ωm|x) = 0.5

aa

aa

aa

a

aa

a

bbb

b

b

b

b

cc

cc

cc

c

c

c

c

h

g

p(g,h|a)

p(g,h|b)

p(g,h|c)

decision boundary

Rapid learning of decision boundaries in high dimensions is the basis for LDA, SVM, DecisionTrees, Random Forests, (with or without Boosting).

6

Evaluation Methodology (Warts I)

Can we use the vast quantities of published results to undertake scientific analyses?

• Papers often focus on irrelevant aspects of algorithms and results can’t be reliably replicated(experimenter effects, publication bias [Chatfield]).

• Algorithms are often adjusted using ‘test’ data which introduces bias.

• Test datasets are often poorly constructed and permit unexpected solutions to the problem.[Shamir].

• Gold standard reference datasets generally have their own biases and error.

• The “representative data” problem, precludes use for quantitative counting tasks.

• Algorithms compared using ROC curves. ROC curves are not really what we need! (Seebelow)

7

Scientific Method

Theory 1

Data

Theory 2

The scientific method involves quantitative comparison of theory and data.

Progress is made most reliably via “falsification” (Popper).

The specific values of the measurement errors are important.

8

Scientific Method

Theory 1

Data

Theory 2

Theory 1

Data

Theory 2

The specific values of the errors are important.

9

Systematic and Statistical Effects

e.g. size / frequency distribution

frequency

Theory

size

Correct for systematic effects

We cannot directly compare quantities of classified data to theory, due to mis-identification.

10

Can we use ROC’s to Correct for Systematic Error?

fp tn

1

1

tn

00

fp

ROC curve

The decision process is only ‘Bayes Optimal’ if the relative proportions of data classes duringtraining and practical application match [Saerens].

Similarly the misidentification rate will change.

Q = Q′ − Q′ tn + (N − Q′)fp

11

Systematic and Statistical Effects

Q′ =Q − N fp

1 − tn − fp

eg: Q = 42, N= 100, tn = 0.066, fp = 0.20 → Q′ = 30

Even having corrected for systematic errors we need to know the statistical error. Is it Pois-son/Binomial? (See Below)

frequency

Theory

Estimate statistical errors

size

In practice the shapes and positions of sample distributions can also change!

12

Algorithm Construction and Theory (Warts II)

• Demographics change the optimal decision boundaries.

• Probability estimates only correct if the density models are correct [Niculescu-Mizil].

• Data generation processes are poorly understood for arbitrary data.

• Sensitivity to parameter tuning (sometimes crude techniques seem to work better on newdata).

b

aa aa

a a

a

a

a

a

a

ab

bb

bb

b

b b

b

Training

bb bbb

bb bb

b

bb

b b

a a aa

a

a

aa

a

a

a

Testing

It’s not just ‘where is the boundary?’, but also ‘how can the boundary move?’.

13

Summary of the Main Issues

We need to determine the following;

• Which p(X|ωn) do we use for incoming data?

It needs to be based upon knowledge of typical datasets.

• How do we correct for misclassification?

• What are the errors on observed quantities?

We could use Monte-Carlo, but this requires us to construct a simulation.

Monte-Carlo’s require detailed knowledge of the data, including expected sample quantitiesQ(ωw).

If we have a good simulation even a ‘cuts based’ analysis is practical.

Without a good simulation we are left without any idea of what a pattern recognition algorithmis doing on a new (arbitrary) data set.

Solution : White Box Analysis!

14

White Box Correction for Misclassification

This best case performance is called the Bayes Error Rate, which occurs when Q(ωm) is set tomatch the data sample.

In a sense Q(ωm) must be considered an output of the analysis.

The p(X|ωn)’s constitute our prior knowledge, not Q(ωn) !

Some algorithms assume that the data sample is fixed.

eg: K Nearest Neighbours, discriminant analysis, Support Vector Machines (SVM).

Some algorithms allow us to estimate Q(ωm) eg: Expectation Maximisation.

Q(ωm) =∑i∈R

P (ωm|Xi)

This is the Likelihood estimate, it is therefore the corrected estimate of quantity!

15

Example: Multi-Spectral Segmentation

Each grey-level can be used as one component of the measured data vector X = (x1, x2, x3, x4).

16

P (X|ω) is based upon physics of the MR imaging process.

Pure and partial volume models (simulation).

Real Data x1 vs x2 and x3 vs x4

17

Multi-Spectral Segmentation

Parameters for the density model are estimated using Expectation Optimisation(EM), whichinvolves iteration of the following steps;

• use an initial estimate of probability classifications P (ωm|X) to generate a new estimate ofthe parameters.

• use the new parameters to re-estimate P (ωm|X).

There is a proof that this process will converge on a local optima of the Likelihood of the N datavectors Xn.

L = Πi∈R

∑m

Q(ωm)p(Xi|ωm)

We can assess data conformity (representativeness) using ‘goodness of fit’.

18

Multi-Spectral Segmentation

But; What region (R) should we define? (the one which minimises measurement error?)

(See: Tina memos 2003-007, 2007-005)

19

White Box Construction of Density Distributions

What happens when we cannot predict the density ditributions?

• We can model density distributions non-parametrically (e.g. linear models of histograms).

• These models can be constructed from data samples using a version of ICA.

• We seek the model which approximates the variations in distributions seen across multiplesamples.

• We can adjust these models to ‘best’ account for incoming data during analysis.

• These models have estimation errors ‘locked in’ at the time of training.

20

Example: Crater Recognition

Define a set S of probability densities which describe the appearance of prototypical cratersp(X|ωm) with m ∈ S.

This has an associated mixture density

p(X|crater) = Q(S)∑ω∈S

p(ωm)p(X|ωm)

21

Crater Recognition

This task now conforms to our earlier definition of scientific measurement.

What are the errors on measured quantities Q(ω)?

22

White Box estimation of Statistical Error

Start with Likelihood and compute the Cramer-Rao Bound.

−∂2 log(L)

∂Q(ωm)2=

1

var(Q(ωm))

L = Πi∈R

M∑m

Q(ωm)p(Xi|ωm)

Assume that p(Xi|ωm) are known (no free parameters).

var(Q(ω)) ≥Q(ω)2∑

i P (ω|Xi)2

As∑

i P (ω|Xi)2 ≤

∑i P (ω|Xi), this gives a (best case) Poisson estimate, for non overlapping

distributions. var(Q(ω)) = Q(ω)

For a weak classifier we need to consider the off-diagonal terms in the inverse covariance matrixand as P (ω|Xi) → 0.5 then the off-diagonal terms become comparable to the diagonal ones and

var(Q(ω)) → ∞

This has implications for the evaluation of pattern recognition systems, as the true accuracy ofa classication assessment may be many time worse than that predicted by sampling (Poisson orBinomial) statistics.

23

White Box Correction of Distribution Assumptions

The observed distribution can be modelled as a linear combination of sub-classes. The inversecovariance for a set of classification quantities is then given by

C−1Q =

∑i∈R

D(Xi)TD(Xi)

withD(Xi)

T = [P (ω1|Xi)/Q(ω1), ..P (ωM |Xi)/Q(ωM)]

We can define a set of subset of classes m ∈ S as a quantitative measure which relates the theory.s.t.

Q(S) =∑m∈S

Q(ωm) = S.Q

Thenvar(Q(S)) = STCQS

(See: Tina memo 2010-008, 2011-003)

24

Conclusions

Methods which label data using fixed decision boundaries will be sub-optimal.

A reliance on Monte Carlo methods only shifts the problem from design of the pattern recognitionsystem to construction of the simulation.

Automatic construction of a simulation for arbitrary data samples requires solution of the veryproblem which pattern recognition is supposed to solve.

Many popular pattern recognition techniques and published performance details (e.g. ROC) areunsuitable for direct use in scientific tasks.

For scientific use it is far better to understand the quantitation errors in analysis than to optimisemis-identification rates.

Many of the problems which arise when trying to apply pattern recognition to quantitative tasksare due to ‘Black Box’ methodologies.

25

Conclusions: White Box vs Black Box

The White Box approach supports;

• Optimal adjustment of distributions to incoming data.

• Goodness of fit measures to test data conformity (representiveness).

• Estimation of statistical errors for incoming data.

• Estimation of systematic errors arising during training.

• Quantitative tests for self-consistency [Haralick].

Remaining problems with methodology are associated with ‘gold standard’ definitions and selec-tion of training/test data.

‘Black Box’ testing is simple and quick resulting in rapid publication. It can be performed bystudents with little training.

‘White Box’ analysis requires an understanding of both methods and statistics.

26

Acknowledgements

MRI segmentation project;

Paul Bromiley.

Satellite images of Mars;

Paul Tar, Jamie Gilmore, Merren Jones.

27

References

A. Niculescu-Mizil, R. Caruana, Obtaining Calibrated Probabilities from Boosting, Proc. 21stConf. Uncertainty in Artificial Intelligence, AUAI press, 2005.

K.Chatfield et al. The Devil is in the Details, An Evaluation of Recent Feature Encoding Meth-ods. BMVC Dundee, 2011.

M. Saerens et al. (2001), Adjusting the Outputs of a Classifier to new a priori Probabilities maySignificantly Improve Classification Accuracy: Evidence from a Multi-class Problem in RemoteSensing, Proc. 18th ICML, pp. 298-305.

R.M. Haralick, On the Use of Error Propagation for Statistical Validation of Computer VisionSoftware, (with Xufei Liu and Tapas Kanungo), IEEE Pattern Analysis and Machine Intelligence27, No. 10 (2005), pp. 1603- 1614.

L Shamir, Evaluation of Face datasets as a tool for Assessing the Performance of Face RecognitionMethods. IJCV, 79,3, 225-230, 2008.

All Tina memos available from www.tina-vision.net

28

Documents

Quantitative Pattern Recognition: Warts and All. N.A ... Pattern Recognition: Warts and All ... For a two class problem we can imagine a boundary curve between two probability distributions