An analysis of different bias-correction algorithms in a synthetic environment

An analysis of different bias-correction algorithms in a synthetic environment

Joo-Hyung Son1

Zoltan Toth2 and Dingchen Hou3

1)Numerical Weather Prediction Division KMA2)Environmental Modeling Center NCEP/NWS/NOAA

3)EMC/NCEP/NWS/NOAA and SAIC

• Introduction• Generation of a Synthetic Data Set• Effects of Sample size on the Bias Estimation• Bias Estimation Based on Bayesian Approach• Effect of Bias Correction on Probabilistic

Forecast• Summary

OUTLINE

Background

• NWP products is subject to systematic error and random errors.

• Estimating bias from historical data and then subtracting it from the forecast provides an effective way of reducing systematic errors.

Existing Questions

• How to estimate the Bias? There exist various methods of bias correction, e.g. equal weight method and Kalman Filter type algorithm (Cui et al, 2005).

• What is the length of the historical data set required for a reasonable accuracy of bias estimation? No systematic investigations.

This Study – A Simplified Approach

• Single forecast of a single variable at a single grid point.

• Simulated forecast (synthetic data )--- no dynamic evolution.

• Simulated forecast of various skill (lead time) and bias level.

• Simulation can be extended to represent more realistic forecasts.

Introduction

Generation of synthetic data - analysis

• Assumptions– Remove annual cycle– Standardized

– Stationary processStationary process

s

mii c

cax

Daily climate data

Climate mean

Climate standard deviation

mc

sc

ia

• Estimate parameters based on- 40 years climate data at 37.5N, 117.5W- 2m temperature

• Analysis– General ARMA(p,q) model

– Order of autoregressive– Order of moving average– White noise– Autocorrelation parameter– Moving average parameter

p

i

q

jtjtjitit xx

1 1

p

q),0(~ 2

WNt

Aotocorrelation

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 50 100 150 200 250 300

p = 20 q = 1

-3

-2

-1

0

1

2

3

4

0 365 730

Generation of synthetic data - analysis

Time series of analysis

Climate generated by ARMA(20,1)

Requirements:• The time series of analysis and forecast are similar stationary stochastic

processes. • Forecast is correlated to analysis with a coefficient reflecting the skill

of the forecastfor perfect correlation and non-correlated forecast. (simulate lead time 1 to 16 days)

• Forecast is subject to random error (independent of analysis) with various variance (=1 no skill, =0 no noise).

• Forecast is statistically the same as analysis (N(0,1)). This is satisfied by setting =sqrt(1-**2).

• A constant (time independent) bias is added to the forecast.

Model:

Generation of synthetic data - forecast

bfAf ea – analysis generated by ARMA model, N(0,1) – forecast, N(0.1) : forecast error, N(0,1)– bias, constant – correlation between forecast and analysis

aA

b

f ef

Generation of synthetic data - forecast

time series of synthetic data (no bias)

-3

-2

-1

0

1

2

3

4

0 10 20 30 40 50 60 70 80 90 100analysis

corr=0.1

corr=0.9

time series of real forecast and anaysis

-3

-2

-1

0

1

2

3

4

0 10 20 30 40 50 60 70 80 90 100analysis

day 1

day 10

Testing Synthetic forecast model against real forecast data

Comparison between Real data & Synthetic data

Purple linePurple line:

• “prediction” of how the forecast would look.

• Normal forecast distribution centered on alpha times a,

• : correlation estimated based on whole observation period

• : mean of all analysis values falling between 3 and 4.

• : standard deviation of forecast when corresponding analysis is between 3 and 4

HistogramHistogram:

•Forecast after moving bias

),ˆ( 2aN

a

day 3

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

-20 -15 -10 -5 0 5 10 15 20

10 day

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

-20 -15 -10 -5 0 5 10 15 20

day 10

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

-20 -15 -10 -5 0 5 10 15 20

day 16

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

-20 -15 -10 -5 0 5 10 15 20

Chi-square Test

Lead time chi-square p-value

day 1 1.78427 0.97081

day 2 9.57123 0.21420

day 3 3.73155 0.81013

day 4 39.51318 0.00000

day 5 1137.78357 0.00000

day 6 37.41785 0.00000

day 7 26.03835 0.00050

day 8 7.17229 0.41117

day 9 17.95259 0.01219

day 10 8.96989 0.25483

day 11 26.03835 0.00050

day 12 4.92888 0.66864

day 13 8.96989 0.25483

day 14 7.32356 0.39599

day 15 7.32356 0.39599

day 16 3.73155 0.81013

mean

Testing Synthetic forecast model against real forecast data

Bias-correction algorithms

• Traditional method (method 1)

– Bias ~ weighted average of

– Bias Estimation• Equal weight

• Kalman Filter

– Bias Correction

af

nnn afn

bn

nb )(

1ˆ1ˆ1

nnn afbb )(ˆ)1(ˆ1

: Kalman Filter weight

nnn bff ˆ11

Absolute bias error of Method 1

Red points: the point of equal weighting bias error corresponding to the average of the KF bias error from 1001 to 10000 based on the correlation (~120)

bias error for a single case [equal weight]

-1

-0.5

0

0.5

1

1.5

2

2.5

3

0 200 400 600 800 1000

corr 0.0

corr 0.3

Kalman Filter method (alpha = 0.02)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500

n

absolu

te b

ias e

rror

m1 0.0

m1 0.3

m1 0.6

m1 0.95

Kalman filter absolute bias error for 100 cases

equal weight abasolute bias error for 100 cases

0

0.1

0.2

0.3

0.4

0.5

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

corr 0.00

corr 0.15

corr 0.30

corr 0.45

corr 0.60

corr 0.75

corr 0.90

Bias-correction algorithms

• Traditional method (method 1)

)()( abfaEafE e ))1(( bfaE e

For a particular a

Given the forecast model

bbaafE )1()(

For longer time series to sample, the whole distribution of , bafE )(

i.e. bafEE ea )]([

a

• New method (method 2)– Based on Bayesian ApproachBased on Bayesian Approach– Bias ~ weighted average of

Note without sampling the whole distribution of

shorter time series – Bias Estimation

• Equal weight

• Kalman Filter

– Bias correction

: Kalman Filter weight

af

nnn afn

bn

nb )(

1ˆ1ˆ1

nnn afbb )(ˆ)1(ˆ1

nnn bff ˆ11

bafE )(a

af

Absolute bias error of Method 2

Kalman Filter Absolute bias error of 100 cases

bias error for a single case [equal weight]

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 200 400 600 800 1000

corr 0.0

corr 0.3

Red points: the point of equal weighting bias error corresponding to the average of the KF bias error from 1001 to 10000 based on the correlation (~90)

equal weight abasolute bias error for 100 cases

0

0.1

0.2

0.3

0.4

0.5

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

corr 0.00

corr 0.15

corr 0.30

corr 0.45

corr 0.60

corr 0.75

corr 0.90


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500

n

absolu

te b

ias e

rror

m3 0.0

m3 0.3

m3 0.6

m3 0.95

Comparison of Methods 1 & 2


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500

n

abso

lute

bia

s er

ror m1 0.0

m1 0.3

m1 0.6

m1 0.95

m3 0.0

m3 0.3

m3 0.6

m3 0.95

m1

m2 0

500

1000

1500

2000

2500

3000

0.00 0.20 0.40 0.60 0.80 1.00

correlation

sam

ple

spac

e

M1_5%

M1_10%

M2_5%

M2_10%

Equal weight method

Sample size required for the error to be less than a specific percentage of real bias

m1

m2

BIAS (Kalman Filter, method 1)

0

0.05

0.1

0.15

0.2

0.25

0.000.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.95

161311108654321

correlation

lead time(day)

bias

Effects on ensemble based probabilistic forecast Continuous Ranked Probability Score (CRPS) test

• Assumption– Uncertainty is perfectly known

(no bias in 2nd momentum)

0.95 0.75 0.20

1bbb o • Forecast– Bias increases with lead time (decreases with correlation)– Modified bias– Bias is standardized by climate standard deviation

1bbb o

• CRPS

ic })1({ 22jjjji ppc ,

CDF

analysis

2jj p

2)1( jj pic

ia


if

21

ia

• Ensemble distribution = forecast uncertainty– PDF of forecast

if 21

2

2

2

)(exp

2

1)(

xx

,


Raw fcst100 warming period

CRPS [equal weight, method 1]

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

161311108654321

correlation

lead time(day)

crps

CRPS [Kalman Filter, method 1]

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

161311108654321

correlation

lead time(day)

crps

5000 warming period

For synthetic forecast with error levels similar to that in real forecast

For synthetic forecast with error levels similar to that in real forecast

CRPS (Kalman Filter)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.000.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.95

161311108654321correlation

lead time(day)

CR

PS

For synthetic forecast with error levels larger than that in real forecast

Summary• Working with synthetic analysis/forecast data sets is useful in the

investigation of the performance of various statistical bias correction methods. (quick assessment/comparison)

• Bayesian type bias estimation method may have the additional benefits (bias error).

• Bias error is independent of bias level, but the probabilistic forecast error can be reduced as the bias is larger.

• Need to consider realistic ensemble forecast and more complex bias estimation algorithms (comparing frequency and Bayesian approaches).

Documents

An analysis of different bias-correction algorithms in a synthetic environment