An evaluation of the minimum requirements for meteorological

An evaluation of the minimum requirements for meteorological

reforecasts from the Global Ensemble Forecast System (GEFS)

of the U.S. National Weather Service (NWS) in support of the calibration and validation of the

NWS Hydrologic Ensemble Forecast Service (HEFS)

Revision number: extended, final

Prepared by Hydrologic Solutions Limited for the U.S. National Weather Service (NWS) under subcontract SubK-2013-1003 with Lynker Technologies LLC in support of NWS Prime Contract

DG133W-13-CQ-0042 (in fulfillment of Deliverable No. 4 and, in part, Deliverable No. 9 of Task 3)

Dr. James Brown ([email protected])

August 2015

HSL

2 of 120

Contents

i. List of figures .................................................................................................... 3

ii. List of tables ..................................................................................................... 9

1. Executive summary ........................................................................................ 10

2. Introduction ..................................................................................................... 16

3. Approach ......................................................................................................... 20

3.1 Study basins ..................................................................................................... 20

3.2 Experimental design ......................................................................................... 23

3.3 Datasets ........................................................................................................... 27

3.4 Verification strategy .......................................................................................... 28

4. Results and analysis ...................................................................................... 29

4.1 Minimum requirements for estimating the parameters of the MEFP ................. 29

4.1.1 Sensitivity to the historical period and interval between reforecasts ................. 29

4.1.2 Sensitivity to the number of ensemble members in the GEFS.......................... 35

4.2 Minimum requirements for verifying the HEFS forecasts .................................. 40

5. Summary and recommendations .................................................................. 45

6. Glossary of terms and acronyms .................................................................. 53

7. References ...................................................................................................... 59

8. Tables .............................................................................................................. 64

9. Figures ............................................................................................................. 68

APPENDIX A: The Hydrologic Ensemble Forecast Service (HEFS)...................... 113

APPENDIX B: Verification measures ....................................................................... 117

a. Relative mean error ........................................................................................ 117

b. Brier Score and Brier Skill Score .................................................................... 117

c. Continuous Ranked Probability Score and skill score .................................... 118

d. Reliability diagram .......................................................................................... 118

e. Relative Operating Characteristic ................................................................... 119

f. Cumulative rank histogram ............................................................................. 119

3 of 120

i. List of figures

Figure 1: The four study basins, including their average elevation, the location of each outlet

(gaging station), and the positions of the nearest grid nodes in the GEFS.

Figure 2: Daily average temperature, total daily precipitation and daily average streamflow by

calendar month for each study basin.

Figure 3: Selected verification metrics for the MEFP-GEFS precipitation forecasts. The results

are shown for the dependent (solid) and independent (dashed) validation scenarios of N

(the number of years of calibration data), and include several non-exceedence

climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

comprise the MEFP-CLIM forecasts.

Figure 4: Selected verification metrics for the MEFP-GEFS precipitation forecasts at AB-CBNK1.

The results are plotted against forecast lead time for each scenario of N (the number of

years of calibration data), and are shown for several non-exceedence climatological

probabilities (Cp). The reference forecasts for the CRPSS and the BSS comprise the

MEFP-CLIM forecasts.

Figure 5: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CB-DRRC2.





Figure 6: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CN-DOSC1.





Figure 7: Selected verification metrics for the MEFP-GEFS precipitation forecasts at NE-HOPR1.





Figure 8: Selected verification metrics for the MEFP-GEFS precipitation forecasts at AB-CBNK1.

The results are plotted against climatological non-exceedence probability (Cp) for each

scenario of N (the number of years of calibration data), and are shown for several forecast

lead times. The reference forecasts for the CRPSS and the BSS comprise the MEFP-

CLIM forecasts.

Figure 9: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CB-DRRC2.

The results are plotted against climatological non-exceedence probability (Cp) for each

scenario of N (the number of years of calibration data), and are shown for several forecast

4 of 120

lead times. The reference forecasts for the CRPSS and the BSS comprise the MEFP-

CLIM forecasts.

Figure 10: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CN-

DOSC1. The results are plotted against climatological non-exceedence probability (Cp) for

each scenario of N (the number of years of calibration data), and are shown for several

forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the


Figure 11: Selected verification metrics for the MEFP-GEFS precipitation forecasts at NE-

HOPR1. The results are plotted against climatological non-exceedence probability (Cp) for




Figure 12: Selected verification metrics for the MEFP-GEFS precipitation forecasts. The results

are plotted against the interval between reforecasts (M days) used to calibrate the MEFP,

and are shown for several non-exceedence climatological probabilities (Cp). The reference

forecasts for the CRPSS and the BSS comprise the MEFP-CLIM forecasts.

Figure 13: Selected verification metrics for the MEFP-GEFS precipitation forecasts at AB-

CBNK1. The results are plotted against climatological non-exceedence probability (Cp) for

each scenario of M (the interval between reforecasts in days), and are shown for several



Figure 14: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CB-

DRRC2. The results are plotted against climatological non-exceedence probability (Cp) for





FTSC1. The results are plotted against climatological non-exceedence probability (Cp) for




Figure 16: Selected verification metrics for the MEFP-GEFS precipitation forecasts at NE-





Figure 17: Range (maximum-minimum) of selected verification metrics for the MEFP-GEFS

precipitation forecasts. The results are plotted against climatological non-exceedence

probability (Cp) across all scenarios of M (interval between reforecasts in days), and are

5 of 120

shown for several forecast lead times. The reference forecasts for the CRPSS and the

BSS comprise the MEFP-CLIM forecasts.

Figure 18: Range (maximum-minimum) of selected verification metrics for the MEFP-GEFS

precipitation forecasts. The results are plotted against climatological non-exceedence

probability (Cp) across all scenario of N (the number of years of calibration data), and are

shown for several forecast lead times. The reference forecasts for the CRPSS and the

BSS comprise the MEFP-CLIM forecasts.

Figure 19: Box plots of forecast errors against observed precipitation amount for N={24 and 12}

years of calibration data. The results are shown at a forecast lead time of 0-24 hours.

Figure 20: Box plots of forecast errors against forecast precipitation amount (ensemble mean)

for N={24 and 12} years of calibration data. The results are shown at a forecast lead time

of 0-24 hours.

Figure 21: Box plots of forecast errors against observed precipitation amount for calibration

scenarios of M={1 and 5} days between reforecasts. The results are shown at a forecast

lead time of 0-24 hours.

Figure 22: Box plots of forecast errors against forecast precipitation amount (ensemble mean)

for calibration scenarios of M={1 and 5} days between reforecasts. The results are shown

at a forecast lead time of 0-24 hours.

Figure 23: Selected verification metrics for the MEFP-GEFS temperature forecasts. The results





Figure 24: Selected verification metrics for the MEFP-GEFS temperature forecasts. The results




Figure 25: Selected verification metrics for the MEFP-GEFS temperature forecasts at AB-

CBNK1. The results are plotted against climatological non-exceedence probability (Cp) for




Figure 26: Selected verification metrics for the MEFP-GEFS temperature forecasts at CB-

DRRC2. The results are plotted against climatological non-exceedence probability (Cp) for




Figure 27: Selected verification metrics for the MEFP-GEFS temperature forecasts at CN-

DOSC1. The results are plotted against climatological non-exceedence probability (Cp) for

6 of 120




Figure 28: Selected verification metrics for the MEFP-GEFS temperature forecasts at NE-





Figure 29: Selected verification metrics for the MEFP-GEFS streamflow forecasts. The results





Figure 30: Selected verification metrics for the MEFP-GEFS streamflow forecasts. The results




Figure 31: Residuals of selected verification metrics for the MEFP-GEFS precipitation forecasts

when calibrating the MEFP with an ensemble mean derived from C=11 members versus

C=1 (F=11). The results are shown by forecast lead time for several non-exceedence





C=1 (F=11). The results are shown by climatological non-exceedence probability at

selected forecast lead times. The reference forecasts for the CRPSS and the BSS


Figure 33: Sensitivity of the MEFP-GEFS precipitation forecasts to the number of members (C)

used to calibrate the MEFP. The results comprise an average over the middle portion of

the forecast horizon (4-8 days) for selected climatological probabilities (Cp). The bold lines

show the calibration scenarios with F=11 forecast members. The dashed line shows the

(C=1, F=1) scenario.


DOSC1. The results are shown by forecast lead time for multiple calibration (C) and

forecasting (F) scenarios and for several non-exceedence climatological probabilities (Cp).

The reference forecasts for the CRPSS and the BSS comprise the MEFP-CLIM forecasts.

Figure 35: Residuals of selected verification metrics for the MEFP-GEFS temperature forecasts


7 of 120




Figure 36: Residuals of selected verification metrics for the HEFS streamflow forecasts when

calibrating the MEFP with an ensemble mean derived from C=11 members versus C=1

member (F=11). The results are shown by forecast lead time for several non-exceedence








Figure 38: Residuals of selected verification metrics for the MEFP-GEFS temperature forecasts





Figure 39: Residuals of selected verification metrics for the HEFS streamflow forecasts when

calibrating the MEFP with an ensemble mean derived from C=11 members versus C=5

member (F=11). The results are shown by forecast lead time for several non-exceedence



Figure 40: Cumulative rank histograms for the HEFS streamflow forecasts when calibrating the

MEFP with an ensemble mean derived from C=11 members (solid) and C=5 members

(dashed). The results are shown at a forecast lead time of 96-120 hours and for observed

streamflow volumes that exceed several (non-exceedence) climatological probabilities.

Figure 41: Selected verification scores for the MEFP-GEFS precipitation forecasts. The nominal

scores are shown for each scenario of N (solid lines), together with the range of scores

across the subcases of each scenario. The results include several non-exceedence



Figure 42: Selected verification scores for the MEFP-GEFS precipitation forecasts. The nominal

scores are shown for each scenario of M (solid lines), together with the range of scores

across the subcases of each scenario. The results include several non-exceedence



8 of 120

Figure 43: Reliability diagrams and corresponding sharpness plots (base 10 logarithm of the

sample size, n) for the MEFP-GEFS precipitation forecasts at N=12. The results are shown

for selected climatological non-exceedence probabilities (Cp), including the Probability of

Precipitation (PoP; Cp=0.0), and comprise a daily aggregation between 0-24 hours.

Alongside the nominal values (bold lines), the range of scores is shown for the two sub-

periods of N=12.

Figure 44: Reliability diagrams and corresponding sharpness plots (base 10 logarithm of the

sample size, n) for the MEFP-GEFS precipitation forecasts at M=5. The results are shown

for selected climatological non-exceedence probabilities (Cp), including the Probability of

Precipitation (PoP; Cp=0.0), and comprise a daily aggregation between 0-24 hours.

Alongside the nominal values (bold lines), the range of scores is shown for the five sub-

periods of M=5.

Figure 45: Probability of Detection (PoD) and Probability of False Detection (PoFD) for flooding

at NE-HOPR1. The results are shown for each ensemble member (48 in total) and for

three validation scenarios at a reforecast interval of M=3, namely the full period of record

(daily reforecasts) and the three sub-periods (reforecasts every 3 days, offset by 1 day).

The PoD is highlighted at PoFD0.015.

9 of 120

ii. List of tables

Table 1: Characteristics of the study basins

Table 2: Reforecast configuration parameters

Table 3: Calibration and validation scenarios for N

Table 4: Calibration and validation scenarios for M

Table 5: Average sample sizes by climatological probability (Cp) and reforecast scenario

Table 6: Calibration and validation scenarios for the number of ensemble members

10 of 120

1. Executive summary

Motivation

The Hydrologic Ensemble Forecast Service (HEFS) quantifies the total uncertainty

in future streamflow as a combination of the meteorological forcing uncertainties

and the hydrologic modeling uncertainties. Reliable and skillful weather and

climate forecasting is central to reliable and skillful streamflow forecasting. The

HEFS Meteorological Ensemble Forecast Processor (MEFP) quantifies the

meteorological uncertainties and corrects for biases in the forcing inputs to the

HEFS. For the medium-range (1-15 days), the MEFP uses precipitation and

temperature forecasts from the Global Ensemble Forecast System (GEFS) of the

National Centers for Environmental Prediction (NCEP).

The ability of the HEFS to provide useful information for decision making depends

upon the accuracy of the forecast probabilities. Crucially, there is a need to

demonstrate this accuracy through hindcasting and validation. Hindcasting is

necessary to benchmark and improve the HEFS, optimize decision support

systems that rely upon the HEFS, and to build confidence among decision makers

that the forecasts are accurate, useful, and can lead to better decisions. For

example, the New York City Department of Environmental Protection (NYCDEP)

is using the HEFS to improve the management of risks to water supply objectives

in the NYC area. The NYCDEP has developed an Operational Support Tool (OST),

which optimizes the quantity and quality of water stored in the NYC reservoirs and

helps to avoid unnecessary, multi-billion dollar, infrastructure costs. The NYCDEP

relies on streamflow hindcasts from the HEFS, supported by meteorological

reforecasts from the GEFS, in order to optimize and validate the OST.

Large and extreme hydrologic events are critically important to users of the HEFS,

as they pose a significant threat to life and property. Given the manifest

uncertainties in forecasting hydrologic extremes, the ability of the HEFS to quantify

these uncertainties (and correct for systematic biases) is an important advantage

over deterministic forecasting systems. However, validating the HEFS for large and

extreme events relies upon an adequate archive of meteorological reforecasts.

In order to determine the minimum requirement of the HEFS for meteorological

reforecasts from the GEFS, this report considers the sensitivity of the HEFS to a

limited number of reforecast configuration options. Understanding the minimum

requirements for calibrating and validating the HEFS is a necessary but not a

sufficient condition for understanding the minimum requirements of end users for

meteorological and hydrologic reforecasts. The requirements of end users, such

as the NYCDEP, will be gathered and presented separately.

11 of 120

Approach

In order to determine the minimum requirements for meteorological reforecasting

in support of calibrating and validating the HEFS, a 26-year reforecast dataset was

obtained for the current GEFS. Among other factors, the costs associated with

meteorological reforecasting depend on the historical period considered (N years),

the interval between reforecasts (M days), and the number of ensemble members

in each forecast (C). By sub-sampling the GEFS reforecasts, the MEFP was

calibrated for different combinations of N, M and C. The sensitivities of the

temperature and precipitation forecasts from the MEFP and the streamflow

forecasts from the HEFS were then explored through hindcasting and validation.

Forcing and streamflow hindcasts were produced and validated at four headwater

basins: the Chikaskia River at Corbin, Kansas (AB-CBNK1); the Dolores River at

Rico in Colorado (CB-DRRC2); the Middle Fork of the Eel River at Dos Rios in

California (CN-DOSC1); and the Wood River at Hope Valley, Rhode Island (NE-

HOPR1). The hindcasts were generated at 12Z for each day in the historical period

of record. Within this fixed period, the calibration of the MEFP varied according to

N, M and C. To ensure that the hindcasting was both practical and statistically

reasonable, a combination of dependent and (limited) cross-validation was used.

In exploring the sensitivities to N, a 24-year validation period was sub-divided into

smaller calibration and forecasting periods, namely N={2x12, 3x8, 4x6, and 6x4}

years. Dependent validation involved calibrating the MEFP and generating

hindcasts for each sub-period and then pooling all of the sub-periods for validation.

Independent validation involved borrowing the parameters from an adjacent sub-

period. While dependent validation may be regarded as a best case scenario for

the expected forecast quality, using parameters from adjacent sub-periods should

be regarded as a worst case scenario; in practice, the MEFP would be recalibrated

more frequently. In evaluating the sensitivities to M, the MEFP was calibrated for

M={1, 3, 5, and 7} days and hindcasts produced daily for the fixed historical period.

The sensitivities to C were examined by calibrating the MEFP with an ensemble

mean derived from C={1, 5, and 11} of the ensemble members from the GEFS

reforecasts. In practice, the GEFS reforecasts contain fewer ensemble members

(F=11) than the operational forecasts (F=21). Since the operational HEFS

forecasts use all available GEFS members (F=21), the HEFS reforecasts were

also generated with all available GEFS members (F=11). However, in order to

understand the impacts of this discrepancy, a baseline reforecast was generated

with the control run only (F=1), using the corresponding MEFP calibration (C=1).

The minimum requirements for validating the HEFS depend on N and M (not C)

are were examined both theoretically and empirically. Theoretically, verification is

12 of 120

concerned with the estimation of statistical measures. The quality of these

estimates will depend on the number of samples available and their unique

information content. Empirically, the effects of reducing the number of reforecasts

available is to increase the sampling uncertainty of the verification results and to

render some (typically large) events unverifiable, depending on the choice of

measure. In order to illustrate the effects of N and M on the uncertainties

associated with validating the HEFS, each sub-sample of N and M was verified

separately and the results compared to the nominal scores for N=24 and M=1.

Results

In terms of the quality of the MEFP forcing and HEFS streamflow forecasts, there

is no systematic decline in forecast quality as the interval between reforecasts (M)

increases from 1 to 7 days or the historical period (N) decreases from 24 to 4 years.

However, when considering the sensitivity of the verification scores to N and M,

measured by the range of scores across these scenarios, there are meaningful

differences, particularly at higher thresholds of precipitation and streamflow. In this

context, sensitivity is a necessary but not a sufficient condition for a decline in

forecast quality. These results imply some sensitivity to N and M, but they do not

suggest a consistent decline in forecast quality with increasing M or decreasing N.

In practice, a reforecast archive of N=12 years and M=1 day (among other

combinations) should be adequate to calibrate the MEFP, but it would not be

adequate to validate the HEFS for large events, as described below.

Against the best available calibration (C) and forecasting (F) scenario (C=11,

F=11), there is a material decline in the quality of the HEFS forcing and streamflow

forecasts when using the control member only (C=1, F=1). For some basins,

metrics and thresholds, this is minimized by using all ensemble members to

generate the HEFS forecasts (C=1, F=11). For precipitation and streamflow, the

greatest differences occur at CN-DOSC1, particularly in the middle and latter

portion of the forecast horizon, where the forecast lead time is increased by 1+

days when using C=11 (F=11) members versus C=1 (F=11). The improvements in

temperature are greatest at AB-CBNK1 and NE-HOPR1, particularly at the hottest

observed temperatures and during the middle portion of the forecast horizon,

where the CRPSS is increased by ~10% in real terms (~30% relative to the

baseline CRPSS). In contrast, when calibrating the MEFP with C=5 ensemble

members (F=11), the forcing and streamflow forecasts are no more reliable or

skillful than those calibrated with C=11 members (F=11). Thus, for the locations,

thresholds, and verification metrics considered, 5 ensemble members should be

adequate to calibrate the MEFP, while the operational forecasts would benefit from

using all available ensemble members.

13 of 120

The minimum requirements for validating the HEFS are examined both

theoretically and empirically. At a daily aggregation, the average number of

verification pairs for which the observed value exceeds a climatological probability,

Cp, is ~365N(1-Cp). In order to estimate a lumped verification score with reasonably

small sampling uncertainty, 30 or more independent samples may be required.

Thus, if all of the large (e.g. >Cp=0.995) events in a verification sample are

statistically independent, and reforecasts are issued once per day, 16.5 years of

reforecasts would be required, on average, to generate a verification sample with

30 large events. Clearly, these requirements increase dramatically with

increasing Cp; in general, the probability of flooding at a daily aggregation is less

than 1-in-200 (Cp=0.995). They also increase when the individual samples are

related to each other (e.g. one flood event that spans several days), as described

below. More detailed metrics, such as the reliability diagram and Relative

Operating Characteristic (ROC), require many more samples than a lumped

verification score (perhaps 100-200 samples).

For two time-series (forecasts and observations) that are both autocorrelated in

time, the effective sample size for verification is smaller than the nominal sample

size. As the correlations increase, the effective sample size declines. For example,

the lag-1 autocorrelation of streamflow at AB-CBNK1 for a daily aggregation is

0.542, while the lag-1 autocorrelation at NE-HOPR1 is 0.897. Based on sampling

theory, the effective sample size for computing the cross-correlation between the

observed and forecast time-series would be 55% of the nominal sample size at

AB-CBNK1 and 11% at NE-HOPR1. In other words, for a given amount of

confidence in the streamflow verification, roughly 9x more data would be required

at NE-HOPR1 than implied by the nominal sample size and 2x more data at AB-

CBNK1. While precipitation is generally autocorrelated over much shorter time-

scales than streamflow, this also increases the probability that data thinning (e.g.

from M=1 to M=3) would significantly reduce the number of extreme events in the

sample. Thus, based on sampling theory alone, reducing the reforecast period and

frequency would systematically reduce the precipitation and streamflow thresholds

at which the HEFS, and associated decision support, could be validated and

optimized, particularly for multi-day aggregations, such as reservoir inflows.

The sensitivities of the validation results to different configurations of N and M are

also explored empirically. Here, the range of verification scores between cases of

N and M is much smaller than the range of scores within cases for different subsets

of the validation data. Thus, as anticipated from theory, the minimum requirements

for validating the MEFP are much greater than the minimum requirements for

calibrating the MEFP, even for relatively simple verification scores. Of the

verification scores considered here, the cross-correlation is particularly variable

across the subsets of N and M. For example, at AB-CBNK1, precipitation amounts

14 of 120

that exceed Cp=0.995 show correlations of between -0.1 and 0.6 in the three

subsets of M=3. Thus, for a 1-in-200 day precipitation amount at AB-CBNK1,

forecasts issued every three days over a 24-year period would be unverifiable. For

more detailed verification metrics, such as the reliability diagram, the thresholds

for which the HEFS remains verifiable are even smaller. Thus, even with daily

reforecasts between 1985 and 2008, the sample sizes are too small to evaluate

reliability diagrams for moderately large precipitation amounts (Cp=0.99), as these

events are rarely forecast with high probability. However, this partly originates from

a conditional bias in the precipitation forecasts at high observed thresholds, which

would not be addressed by increasing the number of reforecasts alone.

In summary, therefore, the minimum requirements for meteorological reforecasting

in support of the HEFS are determined, primarily, by the need to validate the HEFS

with reasonably small sampling uncertainty, including for large events. In general,

simple, unconditional, verification measures cannot guide operational practice,

because they are not application-specific. For example, a flood warning may be

triggered when the forecast probability of flooding exceeds some threshold. In this

context, there is trade-off between issuing warnings too regularly (low probability

threshold) and failing to warn when floods actually occur (high probability

threshold). Given an adequate sample of historical flood occurrences, this trade-

off, and hence the triggering threshold, can be defined, objectively, through

hindcasting and validation. By way of illustration, the use of a degraded reforecast

of M=3 at NE-HOPR1 could lead to flood warnings that are correct on only 40% of

occasions, when they could be correct on 58% of occasions for a warning threshold

optimized to daily reforecasts (i.e. M=1). For users of the HEFS, such as the

NYCDEP, a long and consistent record of historical forecasts is, therefore,

essential; it is necessary to optimize and improve decision support systems and to

benchmark these systems against historical analogs for future extremes.

Recommendations

Reforecasting requires both significant human and computational resources.

However, unsophisticated approaches to data thinning, such as reducing the

number of historical years (N) or increasing the interval between reforecasts (M),

will also reduce the value of these reforecasts for hydrologic applications. In terms

of the HEFS, the greatest impacts of reducing the sample of historical reforecasts

would be to prevent the validation of large events with the necessary statistical

confidence. These events are critically important to users of the HEFS, such as the

NYCDEP. Thus, any approach to data thinning must accommodate a reasonable

sample of large and extreme events. The frequency of reforecasts (M) should also

15 of 120

accommodate rapidly evolving hydrometeorological conditions, for which M>1 day

would not be appropriate for the short-to-medium range.

The impacts of reducing the number of ensemble members in the GEFS

reforecasts (C) will be to reduce their value for statistical post-processing and other

applications. Nevertheless, as C increases, there are diminishing gains for the

reliability and skill of the MEFP outputs. This study indicates that C=5 ensemble

members should be adequate to calibrate the MEFP with the GEFSv10. However,

this cannot be generalized to other techniques or to future implementations of the

MEFP. Indeed, the benefits of reforecasting with additional members may vary with

location and forecast conditions, and they may be greater for extreme events (for

which sample measures of forecast quality are inherently limited). Thus, any

compromise should be reviewed as models and applications evolve and diagnostic

techniques become more sophisticated.

While the costs associated with meteorological reforecasting are substantial, the

benefits are even more substantial. Thus, a concerted effort should be made to

produce reforecasts every day over the maximum historical period for which there

is adequate data to initialize the GEFS, rather than compromising on N or M. In

the absence of a complete reforecast, more sophisticated approaches to data

thinning will be required. Here, emphasis on early forecast lead times and extreme

events will increase the utility of a limited reforecast for hydrologic applications.

Spatial pooling or regionalization may improve the sample sizes for calibration and

validation of the MEFP. Studies are underway to establish whether reforecasts

from hydrometeorologically similar basins can be used to augment the calibration

and validation of the HEFS. However, spatial pooling cannot satisfy user

requirements for long historical records at critical forecast locations. Also, in

validating the streamflow forecasts, spatial pooling is inherently difficult, as

hydrologic state variables, unlike atmospheric state variables, often vary abruptly

(over short distances), and with myriad basin characteristics.

An adequate sample of historical events, including large and extreme events, is

only one of several minimum requirements for users of weather and climate

reforecasts. Other requirements include the timely communication of development

plans, use of transitional arrangements for legacy models (e.g. temporary freezing

of models), software version control, coordination of model updates with users,

timely access to reforecasts, and consistency of the reforecasts and operational

forecasts (including initializations), among others. Collectively, these requirements

should contribute to a renewed effort by the NWS and other operational forecasting

agencies to deliver weather, climate, and water (re)forecasts for improved decision

support. This broader set of requirements must be addressed separately,

alongside the minimum requirements of end users, such as the NYCDEP.

16 of 120

2. Introduction

The Hydrologic Ensemble Forecast Service (HEFS) is an operational hydrologic

forecasting system that is being implemented by the thirteen River Forecast Centers

(RFCs) of the U.S. National Weather Service (NWS). The HEFS quantifies the total

uncertainty in future streamflow as a combination of the meteorological forcing uncertainty

and the hydrologic modeling uncertainty, while correcting for biases in the forecast

probabilities (Seo et al., 2010; Demargne et al., 2010, 2014; Brown et al., 2014a/b). The

HEFS ingests weather and climate forecasts from, among other sources, the Global

Ensemble Forecast System (GEFS) of the National Centers for Environmental Prediction

(NCEP), as well as NCEPs Climate Forecast System Version 2 (CFSv2), and produces

ensemble streamflow forecasts for the short- to long- range. The HEFS aims to: 1) span

lead times from one hour to one year or more with a seamless transitions between

forecast time horizons; 2) issue forecast probabilities that are unbiased for different

aggregation periods; 3) be spatially and temporally consistent across RFC domains; 4)

capture information from current operational weather and climate forecasting systems,

while correcting for biases; 5) be consistent with retrospective forecasts or hindcasts

that are used for verification and decision support; and 6) be properly validated, in order

identify the strengths and weaknesses of the forecasts and to guide forecasting

operations and decision support.

By explicitly accounting for the uncertainties inherent in meteorological and

hydrologic forecasting, while correcting for biases in the forecast probabilities, the HEFS

aims to support improved, risk-based, decision making for a variety of water resources

applications, including reservoir operation, flood forecasting, river navigation, and water

supply. For example, the New York City Department of Environmental Protection

(NYCDEP) is using the HEFS to improve the management of risks to water quantity and

quality objectives in the NYC area. In this context, the NYCDEP has developed an

Operational Support Tool (OST), which ingests streamflow forecasts from the HEFS that

are produced operationally by the Middle-Atlantic RFC and the Northeast RFC. The OST

optimizes the quantity and quality of water stored in the NYC reservoirs, while avoiding

unnecessary, multi-billion dollar, infrastructure costs, such as water filtration. Elsewhere,

17 of 120

the U.S. Army Corps of Engineers (USACE) are redeveloping their water control manual

for the Folsom Reservoir and the American River. In this context, the California-Nevada

RFC (CNRFC) are evaluating the use of streamflow hindcasts from the HEFS, in order to

establish the benefits and risks of using inflow forecasts to manage the flood control space

in the Folsom Reservoir. Elsewhere in California, the Yuba County Water Agency

(YCWA), together with CNRFC and partners, and exploring the use of probabilistic inflow

forecasts to better manage the flood control spaces in Lake Oroville, the Englebright

Reservoir and the New Bullards Bar Reservoir.

The ability of the HEFS to provide useful information for decision making depends,

crucially, upon the accuracy (unbiasedness and skillfulness) of the forecast probabilities.

There is a need to demonstrate this accuracy through retrospective forecasting and

verification. Retrospective studies are necessary to guide the development of the HEFS,

as well as decision support systems that rely upon the HEFS, and to build confidence

among decision makers that the forecasts are accurate, useful, and can lead to better

decisions. In order to provide meteorological and streamflow forecasts that are

demonstrably accurate, the HEFS must be calibrated and validated with historical data.

While recent studies have documented the quality of the precipitation, temperature and

streamflow forecasts from the HEFS, both for the short-to-medium range (Brown et al.,

2014a/b) and for the long-range (Brown, 2013), the minimum requirements for

reforecasting have not been evaluated. These requirements are largely driven by the raw

meteorological reforecasts used as input to the HEFS and, specifically, by the HEFS

Meteorological Ensemble Forecast Processor (MEFP), which aims to correct for biases

in the raw forecasts of precipitation and temperature (Schaake et al., 2007; Wu et al.,

2011). Observations of precipitation, temperature and streamflow are also required to

initialize the HEFS, calibrate the hydrologic models, and to validate the forecasts. Gauge-

based observations are typically available for many decades (often 50-100 years) at river

forecast locations. However, atmospheric models rely on a best estimate (or a range of

possibilities) of the multivariate, spatially distributed, state of the atmosphere-ocean

system at the forecast issue time. In order to conduct reforecasting, these estimates must

be produced retrospectively. In practice, reliable estimates of the atmosphere-ocean state

variables require satellite observations, which are only available since the early 1980s.

18 of 120

Thus, meteorological reforecasting is inherently constrained to the recent past. Also,

given the significant cost of conducting reforecasting, a trade-off emerges between

expanding reforecasting and improving the underlying weather and climate models.

However, for users of the HEFS, such as the NYCDEP and YCWA, hydrometeorological

reforecasting is critically important. It is necessary to optimize and improve decision

support systems, such as the OST, and to benchmark these systems against historical

analogs for future extremes.

In order to support NCEP in determining the requirement of the HEFS for

meteorological reforecasting, this report considers the sensitivity of the HEFS to a limited

number of reforecast configuration options. Clearly, reforecast configuration is only one

of several requirements for users of weather and climate forecasts. Other requirements

include the timely communication of development plans, use of transitional arrangements

for legacy models, software version control, coordination of model updates with users,

timely access to reforecasts, and consistency of the reforecasts and operational

forecasts, among others. Collectively, these requirements should contribute to a new

business model for NCEP and other operational forecasting agencies in delivering

weather, climate, and water (re)forecasts for improved decision support. As indicated

above, this report focuses on the minimum technical requirements of the HEFS for

meteorological reforecasts. It does not consider the broader set of requirements for

delivering an efficient and effective forecasting service, which must be addressed

separately.

In terms of the HEFS, the minimum requirements for historical data are driven by:

1) the need for an adequate sample size to estimate the statistical parameters of the

HEFS; 2) the need for an adequate sample size to validate the HEFS; and 3) the need

for users of the HEFS to calibrate and validate their decision support systems. This report

is concerned with the minimum requirements for (1) and (2) only. The requirements of

end users, such as the NYCDEP and YCWA, will be gathered and presented separately.

In this context, (1) and (2) define the minimum requirements for operating the HEFS, while

(3) is necessary to ensure the outputs from the HEFS are useful for decision making. In

other words, the minimum requirements associated with (1) and (2) should be regarded

19 of 120

as an incomplete baseline. In practice, the requirements of users for meteorological and

hydrologic reforecasting may exceed those for calibration and validation of the HEFS,

and they may evolve as services change and other users adopt the HEFS. Furthermore,

this study is concerned with short-to-medium range forecasting only and, specifically, with

the minimum requirements for historical data from the Global Ensemble Forecast System

(GEFS).

Raw forecasts of temperature and precipitation from the GEFS are used to produce

bias-corrected forcing for input to the HEFS. These forecasts are used in water supply

decision making for the short-to-medium range, including reservoir management, flood

warning, river navigation and recreation. The GEFS uses Version 9.0.1 of the Global

Forecast System (GFS), which comprises a horizontal resolution of T254 (~55km) for 1-

8 days and T190 (~70km) for 9-16 days, and a vertical resolution of L42 or 42 levels (Wei

et al. 2008; Hamill et al. 2011; Hamill et al. 2013). Reforecasts were produced with the

GEFS for a ~26-year period between 1985 and 2010 (Hamill et al., 2013). Calibrating and

validating the HEFS with a subset of the available reforecasts will identify the sensitivities

of the HEFS to a degraded reforecast with the current GEFS only. Some applications of

the HEFS may benefit from a configuration that improves upon the available reforecasts,

but this cannot be established here. Rather, this study examines the ability to provide

accurate forecasts with the HEFS using a degraded calibration sample and the ability to

measure that accuracy with a reduced validation sample.

The minimum requirements for calibrating the HEFS include an adequate historical

period and frequency of reforecasts from which to estimate the statistical parameters of

the HEFS, and sufficient ensemble members to capture the skill in the meteorological

forecasts. Since the HEFS relies on statistical modeling, consistency of the reforecasts

and operational forecasts is also important. The minimum requirements for validating the

HEFS also include an adequate sample (historical period and frequency) of reforecasts

under varying basin conditions, again without structural changes that would undermine

their interpretation. In slow responding basins, the effective sample size is reduced by

temporal autocorrelations in streamflow, implying a longer period of record for validation

(and calibration of streamflow post-processors). In fast responding basins, conditions

20 of 120

evolve rapidly, implying a greater frequency of reforecasts to capture large and extreme

events. Assuming the climatology is reasonably stationary, a 25-year reforecast should

capture much of this variability. However, at a one-day aggregation, flooding may occur

with a climatological frequency of 0.001 (1-in-1000 days) or less. Thus, on average, fewer

than ten (0.001*25*365) flood events will occur within a 25-year period. Likewise, for long-

range forecasting, where fixed aggregations are often required (e.g. April-July reservoir

volumes), a 25-year reforecast will inevitably omit some important variability.

In summary, the aims of this study are twofold, namely to determine the minimum

requirements for reforecasting with the GEFS, in order to: 1) calibrate the HEFS

adequately; that is without materially reducing the quality of the forecasts, including at

high thresholds; and 2) validate the forcing and streamflow forecasts with reasonably

small sampling uncertainty. The calibration of the HEFS depends on an adequate sample

size, for which the period of record and interval between reforecasts are important. It also

depends on the number of ensemble members in the GEFS and the consistency of the

reforecasts and operational forecasts. Likewise, the validation of the HEFS depends on

an adequate sample size, for which the period of record and interval between reforecasts

are important, and a reasonably consistent and representative sample (accepting that

these two things may not be aligned). Following a description of the study basins, datasets

and approach, the verification results are presented separately for the minimum

calibration and validation requirements.

3. Approach

3.1 Study basins

Four headwater basins were considered in this study, namely: the Chikaskia River

at Corbin, Kansas (AB-CBNK1); the Dolores River at Rico in Colorado (CB-DRRC2); the

Middle Fork of the Eel River at Dos Rios in California (CN-DOSC1); and the Wood River

at Hope Valley, Rhode Island (NE-HOPR1). Figure 1 and Table 1 show the location of

each basin, its average elevation, area, and the location of the nearest grid node in the

GEFS. Table 1 also shows the annual precipitation, the fraction of precipitation that

generates runoff (the runoff coefficient), and the ratio of precipitation to potential

21 of 120

evaporation (a climate index). The drainage areas range from 188 square kilometers (NE-

HOPR1) to 2,057 square kilometers (AB-CBNK1) and the runoff coefficients vary from

0.12 (AB-CBNK1) to 0.55 (NE-HOPR1). The basins were chosen for a combination of

practical and hydrological reasons. First, they all originate from RFCs for which the HEFS

has been implemented and validated, namely AB-, CB-, CN-, and NE-RFCs, and for

which the absolute quality of the forecasts has been documented (Brown, 2013, 2014;

Brown et al., 2014a/b). Here, the focus is on the minimum requirements for calibrating

and validating of the HEFS; that is, on the relative quality of the forecasts for different

configurations of the GEFS; and not on the absolute quality of the forecasts. Second,

headwater basins respond quickly to forcing information and, as the uncertainties and

biases propagate from upstream to downstream locations, it is important, initially, to

understand the quality of the HEFS in headwater basins. Third, headwater basins are

important for operational forecasting of water quantity and quality, including flood warning

and reservoir operations. Further downstream, the HEFS will be impacted by additional

sources of bias and uncertainty, of which some are inherently difficult to quantify (e.g. the

downstream effects of river regulations, simplified hydraulic routing and composite timing

errors; see Raff et al., 2013). As part of the ongoing evaluation of the HEFS, more

complex regimes, as well as additional sources of forcing, will be considered in future.

Figure 2 shows the daily means of temperature, precipitation, and streamflow for

each basin, where CN-DOSC1 and CB-DRRC2 both comprise an average over two sub-

basins (see below). The averages are shown by calendar month and were derived from

gauged temperature, precipitation, and streamflow over a 24-year period between 1985

and 2008 (see Section 3.3). As indicated in Figure 2, there are marked differences in the

seasonality and covariability of precipitation and runoff among these basins.

The Chikaskia River (AB-CBNK1) experiences a warm and humid summer climate.

During the late spring and early summer, cool air from Canada and the Rocky Mountains

combines with moist air from the Gulf of Mexico and hot air from the Sonoran Desert,

leading to intense thunderstorms and tornados in Kansas and Oklahoma. At AB-CBNK1,

the relationship between precipitation and runoff is modulated by the shallow terrain and

22 of 120

dense vegetation cover, as well as increased evapotranspiration during the summer

months.

The Dolores River (CB-DRRC2) is a tributary of the Colorado River and occupies

a narrow valley incised into the sandstone of the San Juan Mountains. Precipitation is

reasonably constant throughout the year, but falls primarily as snow during the winter

months. The snowpack melts in the late spring and early summer, which leads to a sharp

increase in runoff between April and July (Figure 2). For the purposes of hydrologic

modeling, CB-DRRC2 is separated into two sub-basins, in order to accommodate the

varied elevations there. The lower sub-basin accounts for 67% of the total area of CB-

DRRC2.

The Eel River (CN-DOSC1) drains the windward slopes of the North Coast Ranges

in Northern California (Figure 1). During the late summer and early autumn, the upper

reaches of the Eel River experience little or no precipitation and streamflow. Low flows

are accentuated by diversions to the Russian River for use in the Potter Valley Hydro-

Electric Project. In late autumn, cooler temperatures are accompanied by rapidly

increasing precipitation, to which the streamflows respond through November and

continue increasing until January (Figure 2). During the winter months, the predictability

of heavy precipitation is increased by the onshore movement of weather fronts from the

Pacific coast and their orographic lifting in the North Coast Ranges. The coastal

mountains of northern California and the Pacific Northwest are also susceptible to

atmospheric rivers, which carry moisture in narrow bands from the tropical oceans to

the mid-latitudes. Atmospheric rivers can lead to persistent, heavy, precipitation and

extreme flooding in the North Coast Ranges and further inland (Smith et al., 2010). For

the purposes of hydrologic modeling, CN-DOSC1 is separated into two sub-basins, and

the lower sub-basin accounts for 77% of the total area of CN-DOSC1.

The Wood River flows approximately 85km from its source in Sterling, Connecticut,

through Hope Valley (NE-HOPR1) in the Arcadia Management Area to Alton, Rhode

Island, where it converges with the Pawcatuck River. As indicated in Figure 2, the daily

average precipitation at NE-HOPR1 is relatively constant throughout the year, but

23 of 120

includes significant snowfall during winter months (the average annual snowfall is

866mm). During the early spring, rising temperatures lead to snowmelt and to a peak in

streamflow around March or April, followed by lower flows during the summer months.

3.2 Experimental design

The HEFS quantifies the total uncertainty in future streamflow as a combination of

the meteorological and hydrologic uncertainties, while correcting for biases in both the

forcing and streamflow (Demargne et al., 2014). Further information about the HEFS

methodology can be found in Appendix A. The meteorological uncertainties and biases

are quantified with the Meteorological Ensemble Forecast Processor (MEFP). The MEFP

produces ensemble forecasts of precipitation and temperature conditionally upon a raw,

single-valued, forecast (Wu et al., 2011). For the short- to medium-range, the raw

forecasts used by the MEFP include the ensemble mean of the GEFS. In removing the

meteorological biases with the MEFP, the hydrologic uncertainties and biases can be

modeled independently of the forcing uncertainties and biases (Seo et al., 2006;

Demargne et al., 2014). The hydrologic uncertainties and biases are modeled in two

stages. First, the meteorological forecasts from the MEFP are used to generate raw

streamflow forecasts, which may contain hydrologic biases, but do not explicitly account

for any hydrologic uncertainties. Secondly, the raw streamflow forecasts are post-

processed with the Ensemble Postprocessor (EnsPost). The EnsPost models the

hydrologic uncertainties and biases from the residuals between the observed and

simulated streamflows (Seo et al., 2006); that is, streamflow predictions based on

observed temperature and precipitation at the forecast issue time.

The simulations and observations used to estimate the hydrologic uncertainties

and biases are typically available for several decades at each RFC forecast location.

Likewise, the precipitation and temperature observations used to generate the streamflow

simulations and to quantify the forcing uncertainties and biases are typically available for

several decades. In contrast, the meteorological reforecasts, which are used by the MEFP

to estimate the forcing uncertainties and biases, require satellite observations and

corresponding reanalysis of the ocean-atmosphere states, in order to initialize the

24 of 120

weather and climate models. These datasets are only available from the early 1980s

onwards. Thus, as indicated above, the requirements of the HEFS for historical data are

primarily constrained by the availability of (appropriate initialization for the) meteorological

reforecasts.

As indicated above, the total uncertainty in the streamflow forecasts originates from

a combination of uncertainties in the meteorological forecasting and hydrologic modeling.

Depending on basin characteristics and antecedent conditions, a large fraction of the total

uncertainty can originate from the meteorological uncertainties (Kavetski et al., 2002;

Pappenberger et al., 2005; Wu et al., 2011). Thus, the meteorological forecasts are a

central component of the HEFS and other hydrologic ensemble prediction systems. When

a meteorological model is updated, any changes in the statistical properties of the

precipitation and temperature forecasts will, to some degree, impact the streamflow

forecasts from the HEFS. For example, the MEFP may be impacted by changes in the

spatial or temporal resolution of the model, including the position of grid cells in relation

to hydrologic basins, the model physics in different layers, including at the land-surface

and ocean boundaries, and the number of (or approach to generating) ensemble

members. In terms of calibrating the MEFP, these properties are important insofar as they

influence the statistical character of the precipitation and temperature forecasts, including

any systematic biases, as well as the information content more generally (e.g. measured

in terms of correlation). In general, therefore, the MEFP must be recalibrated when the

GEFS is updated in any non-trivial way. Likewise, any non-trivial changes to the HEFS

must be accompanied by new streamflow hindcasting and validation. In many cases, this

requires further hindcasting and validation by users of the HEFS, such as the NYCDEP,

who rely upon streamflow hindcasts to calibrate and validate their own forecasting and

decision support systems. Following changes to the operational GEFS, the HEFS

requires an adequate sample of meteorological reforecasts, in order to recalibrate the

MEFP and to produce and validate new forcing and streamflow hindcasts. In this context,

the minimum requirements for reforecasting include the number of historical years of data

(N), the interval between reforecasts (M), and the number of ensemble members. These

and other variables are summarized in Table 2.

25 of 120

In order to evaluate the effects of N and M on the quality of the precipitation and

temperature forecasts from the MEFP, the raw GEFS reforecasts (Hamill et al., 2013)

were systematically degraded from N=24 years (1985-2008) and M=1 day to

combinations of smaller N and larger M. These thinned reforecasts were used to

calibrate the MEFP and to generate forcing and streamflow hindcasts for a consistent

validation period. As indicated above, some applications of the HEFS may benefit from a

reforecast configuration that improves upon the available reforecasts, but this cannot be

established here. In degrading the raw GEFS reforecasts, the hindcasting and validation

period was fixed to 24 years (1985-2008), with a forecast issued at 12Z each day. The

choice of validation period was motivated by: 1) the need to isolate the effects of N and

M on the quality of the MEFP forecasts, independently of any background variability (i.e.

from changes in the validation period); and 2) by the choice of experimental design for

validation. In terms of the latter, independent validation is always preferred when

evaluating statistical techniques, such as the MEFP. Unless the verifying observation is

removed from the calibration sample, the statistical parameters will benefit, unfairly, from

seeing the outcome in advance of predicting it. Depending upon the number of

parameters to estimate and their sampling properties, among other factors, this

advantage can be important. The results from dependent validation should, therefore, be

regarded as a best case scenario of the actual forecast quality. In practice, however,

the MEFP is relatively parsimonious (Wu et al., 2011). In other words, a single observation

should not greatly influence the estimated parameters. Furthermore, independent

validation poses significant practical challenges, as the HEFS is an operational

forecasting system; it is not well-suited to automatic calibration, and hindcasting is

extremely time-consuming.

In evaluating the sensitivities to N, both dependent and (limited) cross-validation

were employed. Specifically, the 24-year validation period was sub-divided into smaller

calibration periods, N={2x12, 3x8, 4x6, and 6x4} years. Dependent validation involved

estimating the parameters for each sub-period, issuing forecasts for that sub-period, and

collating the forecasts from all sub-periods for validation (i.e. 24 years in total).

Independent validation involved borrowing the parameters from an adjacent sub-period.

In practice, this should be regarded as a worst case scenario for the expected forecast

26 of 120

quality, because independent forecasting is conducted for multiple years (i.e. 12 years,

for N=12) without recalibrating the MEFP. Table 3 summarizes the dependent and

independent calibration scenarios for N. In evaluating the sensitivities to M, the MEFP

was calibrated for M={1, 3, 5, and 7} days and forecasts were issued at 12Z each day

between 1985 and 2008. In this context, M=1 represents dependent validation, whereas

M={3, 5 and 7} involves a mixture of dependent and independent validation. Specifically,

for M=3, 5, and 7 days, 1/3rd, 1/5th and 1/7th of the validation sample appears in the

calibration sample, respectively. The calibration scenarios for M are summarized in Table

4. Alongside the precipitation and temperature forecasts from the MEFP, streamflow

forecasts were produced at the outlet of each basin (see below).

As a post-processing technique, the MEFP aims to improve skill by reducing bias

in the raw GEFS forecasts, but does not introduce any new predictors. Thus, the quality

of the MEFP outputs is sensitively dependent on the quality of the raw forcing inputs from

the GEFS. The MEFP uses the ensemble mean from the GEFS to capture the information

content in these (re)forecasts. In order to examine the sensitivity of the MEFP outputs to

the number of ensemble members in the GEFS inputs, the GEFS reforecasts were

systematically degraded by using only a subset of the ensemble members to derive the

ensemble mean. These thinned reforecasts were used to calibrate the MEFP and to

generate forcing and streamflow hindcasts for a consistent validation period (i.e. 26 years,

from 1985-2010). In practice, the GEFS reforecasts contain fewer ensemble members

(C) than the operational forecasts (F). Specifically, the GEFSv10 reforecasts comprise

only 11 ensemble members (10 + control), while the operational forecasts comprise 21

members (20 + control). Hindcasting and validation was conducted with the all available

members (11). For example, when calibrating the MEFP with an ensemble mean derived

from C=5 members, the hindcasts were generated with an ensemble mean derived from

F=11 members. However, in order to better understand the impacts of this discrepancy,

a baseline scenario was included. Here, the control run was used to both estimate the

MEFP parameters (C=1) and to derive the forcing and streamflow hindcasts (F=1). The

scenarios for C and F are summarized in Table 5.

27 of 120

3.3 Datasets

For each scenario of N and M, hindcasts of mean areal temperature (MAT) and

mean areal precipitation (MAP) were generated with the MEFP for a 24-year period

between 1985 and 2008. For each combination of C and F, the hindcasts were generated

for the full GEFS reforecast period (1985-2010); unlike N and M, the historical period was

not integral to the validation design for C and F (see below). The hindcasts of MAP and

MAT each comprise ~60 ensemble members (the precise number varying between

basins, as described in Wu et al., 2011), with lead times varying from 6 to 360 hours in

six-hourly increments. In order to evaluate the skill of the MEFP forecasts with GEFS

inputs (MEFP-GEFS), precipitation and temperature forecasts were also generated with

a conditional or resampled climatology (MEFP-CLIM). The latter involves resampling

the historical observations of MAP and MAT in a moving window of, respectively, 61 days

and 31 days around the forecast valid date.

Raw streamflow hindcasts were generated with the Community Hydrologic

Prediction System (CHPS) using the precipitation and temperature forecasts from the

MEFP. The hindcasts were produced with the hydrologic models and parameter settings

used operationally in each RFC. For all RFCs considered here, the Snow Accumulation

and Ablation Model (SNOW-17; Anderson, 1973) is used together with the Sacramento

Soil Moisture Accounting Model (SAC-SMA; Burnash, 1995). The models are executed

at a six-hourly timestep, but interpolated to an hourly timestep at CB-DRRC2 and CN-

DOSC1. Routing from the headwater to the downstream basins is conducted with Lag/K

using constant or variable lag and attenuation. Historical simulations were generated with

observed forcing for each basin and used to examine the sensitivities of the hydrologic

predictions to the meteorological forcing (see below).

Observations of precipitation and temperature were obtained from each RFC and

comprise areal averages (MAP, MAT) of the gauged precipitation and temperature in

each basin. The data comprise six-hourly observations, recorded in local time, and

covering the period ~1948-2010. In order to pair the meteorological observations and

forecasts, the observed values were chosen from the nearest available synoptic times in

28 of 120

{0Z, 6Z, 12Z, 18Z}. This introduced a timing error into the observations of +1 hours, 0

hours, -1 hours and -2 hours for NE-HOPR1, AB-CBNK1, CB-DRRC2 and CN-DOSC1,

respectively. As the forecasts were verified at an aggregated scale of one day or larger

(see below), this timing error was deemed acceptable. The hydrologic forecasts and

simulations were paired without any timing errors.

3.4 Verification strategy

Verification was conducted with the NWS Ensemble Verification Service (EVS;

Brown et al., 2010). The temperature and precipitation forecasts were verified against

observed temperature and precipitation, respectively. In order to establish the sensitivities

of the hydrologic forecasts to different calibrations of the MEFP, the raw streamflow

forecasts were verified against simulated streamflows. Differences between the

hydrologic forecasts and simulations reflect the contribution of the MEFP-GEFS forcing

to the quality of the streamflow forecasts, independently of any hydrologic errors and

biases (which are ordinarily removed by the HEFS Ensemble Postprocessor, EnsPost).

Aside from eliminating these hydrologic biases, simulated streamflows avoid the timing

and other errors associated with pairing streamflow forecasts and observations. For

example, the streamflow observations are only available as daily averages and in different

time zones to the forecasts. No streamflow post-processing was conducted in this study,

as the EnsPost uses hydrologic simulations and observations only and is, therefore,

insensitive to the meteorological reforecasting. In this context, the aim is to establish the

sensitivity of the HEFS forcing and streamflow forecasts to different calibrations of the

MEFP, and not to examine the absolute quality of the forecasts, which is considered

elsewhere (Brown, 2013, 2014; Brown et al., 2014a/b).

Verification was conducted both unconditionally (i.e. for all data) and conditionally

upon observed and forecast amount. Unconditional bias and skill are important, as the

HEFS is an operational forecasting system for which many applications are anticipated.

However, average conditions, particularly the ensemble mean, generally favor dryer

weather and lower flows, as precipitation and streamflow are both skewed variables. In

order to compare the verification results between basins, for different forecast lead times

29 of 120

and valid times, and for specific aggregation periods, common thresholds were identified

for each basin. Specifically, for each aggregation period, a, and basin, b, a climatological

distribution function, , , ( )n a bF x , was computed from the n values of the hydrometeorological

variable, x, between 1985 and 2008. Real-valued thresholds were then determined for

100k non-exceedence probabilities, pc ,

1

, , ( )n a b pF c , where 0,1pc and 1, , p k .

These non-exceedence probabilities provide a consistent mapping between the likelihood

of a particular hydrometeorological occurrence and its corresponding real value across

different basins and aggregation periods.

As indicated above, verification was performed for different magnitudes of the

observed and forecast variables. When conditioning on observed amount, the quality of

the forecasting system is evaluated for the full range of historical occurrences, including

extreme events that were forecast inadequately (as small or moderate events). When

conditioning on forecast amount, the verification results may discount important observed

extremes. However, since the observed amount is unknown when a forecast is issued,

conducting verification by forecast amount is useful for guiding operational forecasting

and real-time decision making. While some verification metrics provide integral measures

of error across multiple thresholds (e.g. the mean error), others are defined for discrete

occurrences (e.g. the probability of detection). Integral measures, such as the mean error,

were derived from the subsample in which the prescribed condition was met (e.g. the

observation exceeded the threshold). Measures defined for discrete events were

computed from the observed and forecast probabilities of exceeding the threshold.

4. Results and analysis

4.1 Minimum requirements for estimating the parameters of the MEFP

4.1.1 Sensitivity to the historical period and interval between reforecasts

The precipitation and temperature forecasts from the MEFP were verified against

observed MAP and MAT, respectively. The results are shown for a daily aggregation, as

this is a representative volume for short-to-medium range forecasting. The results are

30 of 120

presented by forecast lead time and magnitude of the forcing variable for each scenario

of N (the number of years of reforecasts) and M (the interval between reforecasts). The

analysis focuses on the sensitivity of the forecasts to N and M in terms of bias, skill, and

other attributes of forecast quality, and not on the absolute quality of the forecasts. Figure

3 provides selected verification scores (in the rows) at three climatological probabilities

(in the columns), for the MEFP-GEFS precipitation forecasts. Here, Cp=0.0 denotes the

Probability of Precipitation (PoP), while Cp=0.995 represents a daily precipitation amount

that is exceeded, on average, once every 200 days. The scores were derived from the

subsample of verification pairs in which the observed precipitation amount exceeded the

threshold. Here, the verification statistics for the daily accumulations were averaged over

the first three days of forecast lead time. The results are shown for each calibration

scenario, N={24, 12, 8, 6, and 4 years}, and for the two validation scenarios, namely

dependent validation (all scenarios of N) and cross-validation, i.e. N={12, 8, 6, 4} (see

Table 3). The verification measures are summarized in Appendix B. The correlation

coefficient measures the degree of association between the ensemble mean of the

MEFP-GEFS precipitation forecasts and the observed precipitation amount. The relative

mean error (RME) measures the fractional bias of the ensemble mean forecast, where a

negative RME denotes an under-forecasting bias. The Continuous Ranked Probability

Skill Score (CRPSS) measures the fractional improvement of the MEFP-GEFS

precipitation forecasts when compared to the MEFP-CLIM forecasts, where 1.0 denotes

a perfect score. The Brier Skill Score (BSS) also provides a lumped measure of skill

relative to the MEFP-CLIM forecasts. However, unlike the CRPSS, the BSS measures

the ability the forecasting system to predict the exceedence (or non-exceedence) of a

discrete threshold.

Figures 4-7 show selected verification scores by forecast lead time for the MEFP-

GEFS precipitation forecasts at AB-CBNK1, CB-DRRC2, CN-DOSC1, and NE-HOPR1,

respectively. Again, the results are shown for subsamples in which the observed

precipitation amount exceeded Cp={0.0,0.9,0.995}. Unlike Figure 3, the results are shown

separately for each one-day accumulation, and with a separate curve for each scenario

of N. While Figures 4-7 shows the verification results for selected thresholds at all forecast

lead times, Figures 8-11 shows the results for all thresholds at selected forecast lead

31 of 120

times. In Figures 8-11, the climatological probabilities are plotted on a non-linear scale,

in order to emphasize the larger thresholds. The origin of each curve in Figures 8-11 is

the climatological PoP, i.e. the zero-precipitation threshold. The BSS denotes the ability

of the MEFP-GEFS forecasts to predict the exceedence of this threshold. The correlation,

RME and CRPSS denote the quality of the MEFP-GEFS forecasts for wet conditions, i.e.

for the subsample that exceeds the threshold, with the lowest threshold being zero.

As indicated in Figure 3, the sensitivities of the MEFP-GEFS precipitation forecasts

to the number of years of calibration data (N) are relatively small, both for the dependent

and independent validation scenarios. In general, the forecast quality is slightly reduced

under independent validation. However, as indicated above, independent forecasting for

multiple years (up to 12 years) should be regarded as a worst case scenario for the

expected forecast quality, as the MEFP should be recalibrated more frequently. The

greatest differences between dependent and independent validation occur in CB-DRRC2,

particularly for light and moderate precipitation amounts, where the forecast quality is

generally lower. This is understandable because CB-DRRC2 lies in the San Juan

Mountains of Colorado, where the steep terrain leads to reduced predictability and

increased climatological variability on inter-annual timescales. While the MEFP assumes

that the joint distribution of forecasts and observations is reasonably stationary, any

climatological non-stationarities may introduce a trade-off between larger N (smaller

sampling uncertainty) and smaller N (greater climatological specificity). As indicated in

Figure 3, for most verification scores, locations and thresholds, there is no systematic

increase in forecast quality with increasing N. Indeed, in some cases, the forecast quality

increases slightly with decreasing N. Given the sampling uncertainties, this should not be

overstated. However, it may originate from climatological variability over the validation

period and thus a greater specificity of the estimated parameter at smaller N. As indicated

in Figures 4-7, the sensitivities to N are relatively small at all forecast lead times, although

some erratic behaviors are seen at N=4 in AB-CBNK1 and CB-DRRC2, where the

absolute forecast quality is also lower. Similarly, Figures 8-11 suggest that the MEFP is

relatively insensitive to N across a broad range of precipitation thresholds. However, at

CB-DRRC2, there is a material decline in BSS for N=4, particularly for light and moderate

precipitation amounts, while the CRPSS is higher (Figure 9). These differences originate

32 of 120

from the structure of the BSS and CRPSS. The CRPSS is sensitive to biases in the

ensemble mean forecast, which are also smaller for N=4. The BSS is sensitive to these

biases only insofar as they impact the forecast probability (of exceeding Cp), and not to

their absolute magnitude.

Figure 12 shows the quality of the MEFP-GEFS precipitation forecasts for different

scenarios of M. The verification scores include the correlation coefficient and the RME,

together with the BSS and CRPSS. They were computed at a daily accumulation for

Cp={0.0, 0.99, 0.995}, and averaged over the first three days of forecast lead time. Figures

13-16 show the verification results at all precipitation thresholds for AB-CBNK1, CB-

DRRC2, CN-DOSC1 and NE-HOPR1, respectively. Here, the results comprise daily

accumulations at forecast lead times of 1, 2, and 3 days. In terms of data thinning, the

scenarios of N are broadly comparable to M, with M=7 comprising 1/7th of the original

calibration sample, versus 1/6th for N=4. In principle, for atmospheric variables that are

statistically dependent over multiple days, thinning by M should have a smaller impact

than an equivalent N. In practice, however, except for large-scale systems, such as

atmospheric rivers, precipitation varies over short periods and at small spatial scales, as

evidenced by the majority of forecast skill occurring in the first 1-7 days (or less at AB-

CBNK1, which is located in the Central Plains). Thus, depending on forecast lead time

and location (among other factors), thinning by M may be more or less aggressive than

an equivalent N.

As indicated in Figure 12, when averaged across forecast lead times of 1-3 days,

there is no systematic decrease in forecast quality with increasing M at any location or

precipitation threshold considered. Similarly, when considering forecast lead times of 1,

5, and 10 days separately (Figures 13-16), the quality of the MEFP-GEFS precipitation

forecasts is relatively insensitive to M at most locations. However, at AB-CBNK1, where

the forecast skill declines rapidly over the first week (Figure 4), there is a non-trivial

sensitivity to M from 0-24 hours across a range of precipitation thresholds, particularly for

the correlation coefficient, RME and CRPSS (Figure 13). This is evidenced by the range

of verification scores for different scenarios of M. To further illustrate these sensitivities,

Figure 17 shows the range of verification scores across all scenarios of M at selected

33 of 120

forecast lead times. Figure 18 shows the equivalent range of scores for N. Clearly, the

range of scores is not indicative of a systematic dependence of forecast quality on N or

M (see above). However, it is indicative of a sensitivity to the amount of calibration data

available. In general, AB-CBNK1 shows the greatest sensitivities to M and N, while CB-

DRRC2 is only sensitive to N (and specifically to N=4, as indicated above).

In order to illustrate the effects of N and M on the largest observed and forecast

precipitation amounts, box plots were computed from the MEFP-GEFS precipitation

forecasts. Figure 19 shows box plots of the forecast errors for each basin (in the rows)

and for two scenarios of N (in the columns), namely N={24, 12}. The results are plotted

against observed precipitation amount and are shown at a forecast lead time of 0-24

hours. Figure 20 shows the corresponding results against forecast precipitation amount,

specifically the ensemble mean forecast. Selected quantiles of the forecasting errors are

plotted together with the median error and range (extreme residuals) as whiskers. The

verifying observation is denoted by the zero-error line. Verification pairs for which the

observation falls outside the ensemble range are denoted as misses. However, each

forecast comprises only a small number of ensemble members (60). Thus, some misses

should be expected, even if the forecasts are conditionally unbiased. Figures 21 and 22

show box plots of the errors in the MEFP-GEFS precipitation forecasts for two scenarios

of M, namely M={1, 3}, again ordered by observed and forecast precipitation amount,

respectively. Here, each box represents one ensemble forecast from the period 0-24

hours. As indicated in Figures 19 and 21, there is no systematic decline in forecast quality

at N=12 or M=3 for the most extreme observed precipitation amounts. Rather, any

differences between scenarios are consistent with sampling uncertainty. While there are

some differences in the largest precipitation forecasts (by ensemble mean) for N=12

(Figure 20) and M=3 (Figure 22), these differences are again consistent with sampling

uncertainty and do not translate into additional skill for N=24 or M=1 (e.g. Figures 8-11).

Figure 23 shows the quality of the temperature forecasts at selected thresholds for

each scenario of N, while Figure 24 shows the corresponding results for each scenario of

M. The results for N include both validation scenarios, namely dependent validation, i.e.

N={24, 12, 8, 6, 4}, and cross-validation, i.e. N={12, 8, 6, 4} (see Table 3). The verification

34 of 120

metrics include the mean error of the ensemble mean forecast (C), the correlation

coefficient, BSS and CRPSS. The metrics were com

Documents

An evaluation of the minimum requirements for meteorological