THOUGHTS ON MODEL ASSESSMENT Porco, DAIDD, December 2012

THOUGHTS ON MODEL ASSESSMENT

Porco, DAIDD, December 2012

Assessment

Some models are so wrong as to be useless.

Not really possible to argue that a model is “true”

Treacherous concept of “validation”: has a model that has been “validated” been shown to be “valid”?

Philosophy of science

Face Validity

Knowledge representation Analogy Does the model represent in some

approximate sense facts about the system?

Verification

Does your simulation actually simulate what you think it simulates?

How do you know your program is not generating nonsense?

Unit testing Boundary cases “Test harnesses” Formal software testing techniques

Thought

“…the gold standard of model performance is predictive power on data that wasn’t used to fit your model…”

--Conway and White, Machine Learning for Hackers

Prediction

Not everything is predictable Predict individuals? Communities?

Counterfactuals? Does the model somehow correspond to

reality?

California TB

Probabilistic forecasting

Weather Forecast expressed as a probability Correctly quantify uncertainty Uncertainty is needed to make the best

use of a forecast If you don’t know, don’t say you do Not bet-hedging, but honesty

Two approaches

Assessment of probabilistic forecasts Brier Score Information Score

Assessing simple forecasts

Binary event: 0 for no, 1 for yes Forecast: p is the probability it occurred If p=1, you forecast it with certainty If p=0, you forecast that it was certain

not to have occurred If 0<p<1, you were somewhere in the

middle

Brier score

Brier score

Squared error of a probabilistic forecast BS = (p-1)2 + ((1-p)-0)2

BS = 2(1-p)2

Brier score has a negative orientation (like in golf, smaller is better)

Some authors do not use the factor of 2

Brier score

Example. Suppose I compute a forecast that trachoma will be eliminated in village A after two years as measured by pooled PCR for DNA in 50 randomly selected children. Suppose I say the elimination probability is 0.8.

If trachoma is in fact eliminated, the Brier score for this is (0.8-1)^2 + (0.2-0)^2 = [you work it out]

Brier score

What is the smallest possible Brier score?

What is the largest possible Brier score?

Brier score

Suppose we now forecast elimination in five villages. We can compute an overall Brier score by just adding up separate scores for each village.

Elimination Prob.

Elimination Score

0.8 1 0.08

0.1 0 0.02

0.5 1 0.5

0.4 1 0.72

0.8 0 1.28

Murphy decomposition

Classical form applies to binary predictions

Predictions are probabilities Finite set of possible probabilities Example: chance of a meteorological

event happening might be predicted as 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%.

Reliability (sensu Murphy)

Similar to Brier score – less is better Looking at all identical forecasts, how

similar is what really happened to what was forecast?

Examples

Prediction Result Prediction Result

0.2 0 0.8 1

0.2 1 0.8 0

0.2 0 0.8 0

0.2 0 0.8 0

0.2 0 0.8 1

0.2 1 0.8 1

Example

Here, we get a total Brier score of 6.96. We have to add up (0.2-0)^2 + (0.2-1)^2+… (and not forget the factor of two).

Some terminology

Forecast levels: the different values possible for the forecast. Notation: fk for the levels of k

Observed means at the forecast levels: averaging the number of times the event happened stratifying on forecast level. Notation:

We have two forecast levels in the example, 0.2 and 0.8. For forecast level 0.2, we have an observed mean of 2/6. For the forecast level 0.8, we have an observed mean of 3/6.

Examples

We had 2/6 that were YES when we gave a 20% chance.

Let’s compare the prediction to the successes just for the 20% predictions. For each one we have the same thing: (0.2 – 2/6)2+(0.8-4/6)2 which is about 0.0356. We had 3/6 that were YES when we gave

an 80% chance. We do a similar computation and we get (0.8-3/6)2 + (0.2-3/6)2=0.222 or so.

Reliability

So the total reliability score is going to be computed:

This yields a total score of approximately 1.2933 for the reliability component REL.

Prediction Reliability Repeats Total

0.2 0.0356 6 0.2133

0.8 0.2222 6 1.08

Resolution

When we said different things would happen, how different really were they? We’re going to contrast the observed means at different forecast levels.

Specifically, we want the variance of the distribution of the observed means at different forecast levels.

Example

Using the simple example: the overall mean is 5/12. Then compute 2*(2/6 – 5/12)2 for every observation in the first group, and 2*(3/6-5/12)2 for every observation in the second group.

The total resolution component RES is 6*2*(2/6 – 5/12)2 +2*(3/6-5/12)2 = 1/6.

Uncertainty

In the classical Murphy formula, this is computed by calculating the mean observation times one minus the mean—and adding this up for each observation.

For our example, the uncertainty is 5.8333 or so.

Murphy decomposition

Murphy (1973) showed that the Brier score can be decomposed as follows:

BS=REL-RES+UNC N.B. the negative sign in front of resolution High uncertainty contributes to high Brier score

(all else being equal) High discrepancy of the observed means at

each level from the forecasts raises the Brier score

But having those observed means at each level separate from each other lowers the Brier score

Script

brier.component <- function(forecast,observation) {

2*(forecast-observation)^2

}

reliability <- function(fk,ok) {

2*(fk-ok)^2

}

resolution <- function(obar,ok) {

2*(ok-obar)^2

}

gen.obark <- function(predictions,outcome,key) {

tmp <- data.frame(key=key,outcome=outcome,pred=predictions)

f1 <- merge(tmp, ddply( tmp, .(pred), function(x){mean(x$outcome)} ), by="pred")

f1 <- f1[order(f1$key),]

f1$V1

}

Example

ds <- c(0,1,0,0,0,1,1,0,0,0,1,1)

fs <- c(rep(0.2,6),rep(0.8,6))

obark <- gen.obark(fs,ds,1:12)

rel <- sum(reliability(fs,obark))

res <- sum(resolution(mean(ds),obark))

unc <- 12*2*(mean(ds)*(1-mean(ds)))

brs <- sum(brier.component(fs,ds))

(rel-res+unc)-brs

[1] -1.776357e-15

General form

B: Brier score; K, number of forecast levels; nk the number of forecasts at level kOther quantities as defined earlier

More general forms

More general decompositions (Stephenson, five terms)

Continuous forms

Alternatives to Brier score

Decomposition into Reliability, Resolution, Uncertainty works for information measures (Weijs et al)

Information measures

Surprise Expected surprise Surprising a model

Surprisal

How much do you learn if two equally likely alternatives are disclosed to you?

Heads vs tails; 1 vs 2

Surprisal

How much do you learn if 3 are disclosed?

A, B, or C How much if I disclose one of 26 equally

likely outcomes? You should have learned more

Surprisal

Standard example: Now combine two independent things. If

we had 1 vs 2, and A, B, or C: A/1 C/2 B/1 B/2 …

Surprisal

Six equally likely things s(2 x 3) = s(2) + s(3) Uniquely useful way to do this: define

surprisal as log(1/p), log of the reciprocal probability

Stevens information tutorial online

Surprisal

Different bases can be used for the logarithm Log base 2 is traditional How much information is transmitted if you learn

which of two equally likely outcomes happened? Log2(1/(1/2)) = 1 If we use base two, then the value is 1, referred

to as 1 bit. Disclosure of one of two equally likely outcomes

reveals one bit of information. Notation: log base 2 often written lg

Example

Sequence of values of flips of a bent coin with P(heads)=1/3, P(tails)=2/3

Observed Probability

1/Probability

Surprisal

0 2/3 1.5 0.585

0 2/3 1.5 0.585

1 1/3 3 1.585

0 2/3 1.5 0.585

1 1/3 3 1.585

1 1/3 3 1.585

The average surprisal was 1.085 or so.

Expected surprisal

Every time an event with probability pi happens, the surprisal is lg(1/pi).

What is the expected surprisal?

Expected value

Expected square

Expected surprisal

Shannon entropy

Shannon entropy

Shannon entropy can be used to quantify the amount of information provided by a model for a categorical outcome, in much the same way that the squared multiple correlation coefficient can quantify the amount of variance explained by a continuous model

Use of entropy and related measures to assess probabilistic predictions is sometimes recommended (Weijs)

Estimated entropy

If you have a Bernoulli trial, what is the entropy? The success probability is p.

H=-(1-p)lg(1-p) – p lg(p) If we estimate p from data and plug this

in, we get an estimator of the entropy. Example: if we observe 8 successes in

20 trials, the observed frequency is 0.4, and the estimated entropy would be about 0.971.

What does a model do?

If we now have a statistical model that provides a better prediction, we can compute the conditional entropy.

Without the model, our best estimate of the chance of something is just the observed relative frequency.

But with the model, maybe we have a better estimate.

Example

Logistic regression: use data from the Mycotic Ulcer Treatment Trial (Prajna et al 2012, Lietman group).

Using just the chance of having a transplant or perforation, the estimated entropy is 0.637. About 16% of patients had such an event.

Example

Now, let’s use a covariate, such as what treatment a person got.

Now, we have a better (we hope) estimate of the chance of a bad outcome for each person. These are 11% for treatment 1 and 21% for treatment 2.

Example

For group 1, the chance of a bad outcome was 11%. The entropy given membership in this group is about 0.503.

For group 2, the chance of a bad outcome was 21% or so. The entropy given membership in this group is about 0.744.

The weighted average of these is the conditional entropy given the model. This gives us just 0.623.

Example

So the entropy was 0.636, and now it’s 0.623. The treatment model does not predict very much.

If we look at the proportional reduction, we get 2.2%; this model explained 2.2% of the uncertainty.

This is exactly the McFadden R2 you get in logistic regression output.

Summary

Probabilistic forecasts can be assessed using the Brier score.

The Brier score can be decomposed into reliability, resolution, and uncertainty components.

Information theoretic measures can also be used in assessment.

Information theoretic measures are useful in other biostatistical applications as well.

Documents

THOUGHTS ON MODEL ASSESSMENT Porco, DAIDD, December 2012