Benchmark database
inhomogeneous data, surrogate data and
synthetic data
Victor Venema
M e te o ro lo g ic a l
I n stitu te
B o n n
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Content
Introduction to benchmark dataset
Some results Some questions about exercise Question about future work Analyse and publish the results
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Benchmark dataset1) Real (inhomogeneous) climate records
Most realistic case Investigate if various HA find the same breaks
2) Synthetic data For example, Gaussian white noise Insert know inhomogeneities Test performance
3) Surrogate data Empirical distribution and correlations Insert know inhomogeneities Compare to synthetic data: test of assumptions
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Creation benchmark – Outline talk
1) Start with homogeneous data
2) Multiple surrogate and synthetic realisations
3) Mask surrogate records
4) Add global trend
5) Insert inhomogeneities in station time series
6) Published on the web
7) Homogenize by COST participants and third parties
8) Analyse the results and publish
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
1) Start with homogeneous data
Monthly mean temperature and precipitation Later also daily data (WG4), maybe other
variables (pressure, wind)
Homogeneous, no missing data Longer surrogates are based on multiple copies Generated networks are 100 a
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
2) Multiple surrogate realisations
Multiple surrogate realisations– Temporal correlations– Station cross-correlations– Empirical distribution function
Annual cycle removed before, added at the end Number of stations, 5, 9 or 15 Cross correlation varies as much as possible
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
5) Insert inhomogeneities in stations
Independent breaks Determined at random for every station and time 5 Breaks per 100 a Monthly slightly different perturbations Temperature
– Additive– Size: Gaussian distribution, σ=0.8°C
Rain– Multiplicative– Size: Gaussian distribution, <x>=1, σ=10%
Example break perturbations station
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Example break perturbations network
Year
Temperature perturbations
1920 1940 1960 1980 2000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
-1
-0.5
0
0.5
1
1.5
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
5) Insert inhomogeneities in stations
Correlated break in network One break in 10 % of networks In 30 % of the station simultaneously Position random
– At least 10 % of data points on either side
Example correlated break
Year
Correlated break
1920 1940 1960 1980 2000
2
4
6
8
10
12
14
16
18
20-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
5) Insert inhomogeneities in stations
Outliers Size
– Temperature: < 1 or > 99 percentile– Rain: < 0.1 or > 99.9 percentile
Frequency– 50 % of networks: 1 %– 50 % of networks: 3 %
Example outlier perturbations station
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000-10
-5
0
5Outliers
Example outliers network
Year
Outliers
1900 1920 1940 1960 1980 2000
2
4
6
8
10
12
14
16
18
20
-10
-5
0
5
10
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
5) Insert inhomogeneities in stations
Local trends (only temperature) Linear increase or decrease in one station Duration: between 30 and 60a Maximum size: Gaussian distribution, σ=0.8°C Frequency: once in 10 % of the stations
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Example local trends
Year
Local trends
1900 1920 1940 1960 1980
2
4
6
8
10
12
14
16
18
20-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
6) Published on the web
Inhomogeneous data are published on the COST-HOME homepage
Everyone is welcome to download and homogenize the data
http://www.meteo.uni-bonn.de/ venema/themes/homogenisation
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
7) Homogenize by participants
Return homogenised data Should be in COST-HOME file format (next slide)
– For real data including quality flags
Return break detection file– BREAK– OUTLI– BEGTR– ENDTR
Multiple breaks at one data possible
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Typical errors
The file format needs to be perfect! Forgetting the station-file that describes which
stations belong to the homogenised network Changing the file names in this station file to
homogeneous data files ► (Forgetting to return the files with the quality flags) The sizes of the breaks are not in the break file Please, keep directory structure of the benchmark
like it is, also for partial contributions – The only difference is the main directory
All files are tab-delimited ASCII files
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
COST-HOME file format – network file
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Typical errors
The file format needs to be perfect! Forgetting the station-file that describes which
stations belong to the homogenised network Changing the file names in this station file to
homogeneous data files (Forgetting to return the files with the quality flags) The sizes of the breaks are not in the break file ► Please, keep directory structure of the benchmark
like it is, also for partial contributions – The only difference is the main directory
All files are tab-delimited ASCII files
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Detected breaks file
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Typical errors – see discussion
The file format needs to be perfect! Forgetting the station-file that describes which
stations belong to the homogenised network Changing the file names in this station file to
homogeneous data files (Forgetting to return the files with the quality flags) The sizes of the breaks are not in the break file Please, keep directory structure of the benchmark
like it is, also for partial contributions – The only difference is the main directory
All files are tab-delimited ASCII files ►
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
COST-HOME file format – monthly data
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
ContributionsParticipant Algorithm Remarks
1. José Guijarro Climatol 6 Versions with different settings
2. Péter Domonkos CM-D, MASH-D, NSHT-D
3 Versions / detection algorithms
3. Michele Brunetti Brunetti Detection Craddock based; 2 surrogate temp. networks
4. Dubravka Rasol & Olivier Mestre
PRODIGE All surrogate temp.; 13 surrogate precip. Networks
5. Matthew Menne & Claude Williams
Automated pairwise hom.
2 Versions; “all” temp. Networks (part of real #3 is missing)
6. Christine Gruber & Ingeborg Auer
HOCLIS 1 Surrogate temp. & 1 surrogate precip.
7. Gregor Vertacnik MASH All surrogate temp.
8. Petr Stepanek AnClim 1 Surrogate temp. & 1 surrogate precip.
9. Lucie Vincent Vincent 1 Surrogate temp.
10. Enric Aguilar NSHT Not in the right format yet
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
No. homogenised networks - algorithmTable 1. Number of homogenised networks per algorithm
Homogenisation alg. All networks Real netw. Surrogate netw. Synthetic netw.
PRODIGE 32 0 32 0
Brunetti 2 0 2 0
MASH 25 0 25 0
Vincent 1 0 1 0
HOCLIS 1 0 1 0
AnClim 2 0 2 0
Climatol A 92 12 40 40
Climatol C 92 12 40 40
Climatol D 92 12 40 40
Climatol E 92 12 40 40
Climatol F 92 12 40 40
ClimatolG01 2 0 2 0
APHa2 42 5 18 19
APHa1 42 5 18 19
CM-D 5 0 5 0
MASH-D 5 0 5 0
SNHT-D 5 0 5 0
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
No. homogenised networks – input data
Table 3. Summary data: Number of homogenised networks per network
Network No. networks Temp. netw. Precip. netw.
All 624 371 253
Real 70 40 30
Surrogate 316 193 123
Surrogate #1 29 17 12
Surrogate ~#1 287 176 111
Synthetic 238 138 100
Synthetic #1 12 7 5
Synthetic ~#1 226 131 95
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Mean no. outliers per stationTable 21. Mean number of outliers per station for every algorithm
Homogenisation alg. All networks Real netw. Surrogate netw. Synthetic netw.
PRODIGE 0.0 NaN 0.0 NaN
Brunetti 3.4 NaN 3.4 NaN
MASH 16.1 NaN 16.1 NaN
Vincent 0.0 NaN 0.0 NaN
HOCLIS 6.0 NaN 6.0 NaN
AnClim 5.5 NaN 5.5 NaN
Climatol A 3.6 0.2 4.0 4.2
Climatol C 88.5 56.4 94.8 91.9
Climatol D 54.4 34.7 58.0 56.7
Climatol E 54.2 31.8 60.7 54.4
Climatol F 43.8 33.2 47.8 42.9
ClimatolG01 4.4 NaN 4.4 NaN
APHa2 1.9 0.0 2.1 2.2
APHa1 1.9 0.0 2.1 2.2
CM-D 15.7 NaN 15.7 NaN
MASH-D 15.5 NaN 15.5 NaN
SNHT-D 15.3 NaN 15.3 NaN
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Mean no. breaks per stationTable 22. Mean number of breaks per station for every algorithm
Homogenisation alg. All networks Real netw. Surrogate netw. Synthetic netw.
PRODIGE 2.7 NaN 2.7 NaN
Brunetti 5.0 NaN 5.0 NaN
MASH 4.6 NaN 4.6 NaN
Vincent 0.0 NaN 0.0 NaN
HOCLIS 2.8 NaN 2.8 NaN
AnClim 1.2 NaN 1.2 NaN
Climatol A 1.3 0.8 1.5 1.3
Climatol C 1.5 0.8 1.6 1.6
Climatol D 1.4 1.0 1.5 1.4
Climatol E 1.6 0.9 1.8 1.6
Climatol F 1.5 1.2 1.7 1.4
ClimatolG01 1.2 NaN 1.2 NaN
APHa2 1.8 1.2 2.1 1.7
APHa1 1.7 1.0 1.9 1.6
CM-D 4.6 NaN 4.6 NaN
MASH-D 3.9 NaN 3.9 NaN
SNHT-D 3.2 NaN 3.2 NaN
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Homogenising the exercise Tab-delimited files: also space-delimited?
– Mixture of strings and numbers Data quality files only for real data section Do we want to use the Diurnal Temperature
Range (DTR)?– Not useful for surrogate and synthetic data!– If we do, everyone should do it
End or begin uncorrected?– Compute statistics independent of absolute level?
Filling missing values part exercise? Human quality control or raw algorithm output? Homogenise all or homogenisable networks, times
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Contributions – who is missing?Participant Algorithm Remarks
1. José Guijarro Climatol 6 Versions with different settings
2. Péter Domonkos CM-D, MASH-D, NSHT-D
3 Versions / detection algorithms
3. Michele Brunetti Brunetti Detection Craddock based; 2 surrogate temp. networks
4. Dubravka Rasol & Olivier Mestre
PRODIGE All surrogate temp.; 13 surrogate precip. Networks
5. Matthew Menne & Claude Williams
Automated pairwise hom.
2 Versions; “all” temp. Networks (part of real #3 is missing)
6. Christine Gruber & Ingeborg Auer
HOCLIS 1 Surrogate temp. & 1 surrogate precip.
7. Gregor Vertacnik MASH All surrogate temp.
8. Petr Stepanek AnClim 1 Surrogate temp. & 1 surrogate precip.
9. Lucie Vincent Vincent 1 Surrogate temp.
10. Enric Aguilar NSHT Not in the right format yet
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Analysing the results What measures define a well homogenised
dataset?– Real data vs. data with known truth
Ensemble mean for real data?
– Breaks Position, hit rate size distribution detection probability as function of size
– Data itself Root mean square error (RMSE) RMSE (without outliers) RMSE (bias corrected) Uncertainty in the network mean trend
How to study which components are best?
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Deadline(s)
Agreed on 09/2009, September this year Multiple deadlines
– For example: synthetic data, real data, surrogate data– After deadline the truth can be revealed– After deadline the other contributions can be
revealed(?)– Start earlier analysing the results– For example: May, July, September
Bologna, 25 – 26 May, EGU, 19 – 24 April
Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain
Articles Articles
– Overview COST Action & benchmark with very basic analysis results Performance difference between synthetic (Gaussian, white
noise) and surrogate data How to deal multiple contributions per algorithm? Do we have references to all algorithms?
– What should the others be about Analysing results, which components are best
Who will organise, coordinate it?– Not everyone should do the same analysis– How to subdivide the work?
After deadline: sensitivity analysis