Upload
nguyenngoc
View
216
Download
1
Embed Size (px)
Citation preview
An evaluation of the minimum requirements for meteorological
reforecasts from the Global Ensemble Forecast System (GEFS)
of the U.S. National Weather Service (NWS) in support of the calibration and validation of the
NWS Hydrologic Ensemble Forecast Service (HEFS)
Revision number: extended, final
Prepared by Hydrologic Solutions Limited for the U.S. National Weather Service (NWS) under subcontract SubK-2013-1003 with Lynker Technologies LLC in support of NWS Prime Contract
DG133W-13-CQ-0042 (in fulfillment of Deliverable No. 4 and, in part, Deliverable No. 9 of Task 3)
Dr. James Brown ([email protected])
August 2015
HSL
2 of 120
Contents
i. List of figures .................................................................................................... 3
ii. List of tables ..................................................................................................... 9
1. Executive summary ........................................................................................ 10
2. Introduction ..................................................................................................... 16
3. Approach ......................................................................................................... 20
3.1 Study basins ..................................................................................................... 20
3.2 Experimental design ......................................................................................... 23
3.3 Datasets ........................................................................................................... 27
3.4 Verification strategy .......................................................................................... 28
4. Results and analysis ...................................................................................... 29
4.1 Minimum requirements for estimating the parameters of the MEFP ................. 29
4.1.1 Sensitivity to the historical period and interval between reforecasts ................. 29
4.1.2 Sensitivity to the number of ensemble members in the GEFS.......................... 35
4.2 Minimum requirements for verifying the HEFS forecasts .................................. 40
5. Summary and recommendations .................................................................. 45
6. Glossary of terms and acronyms .................................................................. 53
7. References ...................................................................................................... 59
8. Tables .............................................................................................................. 64
9. Figures ............................................................................................................. 68
APPENDIX A: The Hydrologic Ensemble Forecast Service (HEFS)...................... 113
APPENDIX B: Verification measures ....................................................................... 117
a. Relative mean error ........................................................................................ 117
b. Brier Score and Brier Skill Score .................................................................... 117
c. Continuous Ranked Probability Score and skill score .................................... 118
d. Reliability diagram .......................................................................................... 118
e. Relative Operating Characteristic ................................................................... 119
f. Cumulative rank histogram ............................................................................. 119
3 of 120
i. List of figures
Figure 1: The four study basins, including their average elevation, the location of each outlet
(gaging station), and the positions of the nearest grid nodes in the GEFS.
Figure 2: Daily average temperature, total daily precipitation and daily average streamflow by
calendar month for each study basin.
Figure 3: Selected verification metrics for the MEFP-GEFS precipitation forecasts. The results
are shown for the dependent (solid) and independent (dashed) validation scenarios of N
(the number of years of calibration data), and include several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 4: Selected verification metrics for the MEFP-GEFS precipitation forecasts at AB-CBNK1.
The results are plotted against forecast lead time for each scenario of N (the number of
years of calibration data), and are shown for several non-exceedence climatological
probabilities (Cp). The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 5: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CB-DRRC2.
The results are plotted against forecast lead time for each scenario of N (the number of
years of calibration data), and are shown for several non-exceedence climatological
probabilities (Cp). The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 6: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CN-DOSC1.
The results are plotted against forecast lead time for each scenario of N (the number of
years of calibration data), and are shown for several non-exceedence climatological
probabilities (Cp). The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 7: Selected verification metrics for the MEFP-GEFS precipitation forecasts at NE-HOPR1.
The results are plotted against forecast lead time for each scenario of N (the number of
years of calibration data), and are shown for several non-exceedence climatological
probabilities (Cp). The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 8: Selected verification metrics for the MEFP-GEFS precipitation forecasts at AB-CBNK1.
The results are plotted against climatological non-exceedence probability (Cp) for each
scenario of N (the number of years of calibration data), and are shown for several forecast
lead times. The reference forecasts for the CRPSS and the BSS comprise the MEFP-
CLIM forecasts.
Figure 9: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CB-DRRC2.
The results are plotted against climatological non-exceedence probability (Cp) for each
scenario of N (the number of years of calibration data), and are shown for several forecast
4 of 120
lead times. The reference forecasts for the CRPSS and the BSS comprise the MEFP-
CLIM forecasts.
Figure 10: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CN-
DOSC1. The results are plotted against climatological non-exceedence probability (Cp) for
each scenario of N (the number of years of calibration data), and are shown for several
forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 11: Selected verification metrics for the MEFP-GEFS precipitation forecasts at NE-
HOPR1. The results are plotted against climatological non-exceedence probability (Cp) for
each scenario of N (the number of years of calibration data), and are shown for several
forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 12: Selected verification metrics for the MEFP-GEFS precipitation forecasts. The results
are plotted against the interval between reforecasts (M days) used to calibrate the MEFP,
and are shown for several non-exceedence climatological probabilities (Cp). The reference
forecasts for the CRPSS and the BSS comprise the MEFP-CLIM forecasts.
Figure 13: Selected verification metrics for the MEFP-GEFS precipitation forecasts at AB-
CBNK1. The results are plotted against climatological non-exceedence probability (Cp) for
each scenario of M (the interval between reforecasts in days), and are shown for several
forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 14: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CB-
DRRC2. The results are plotted against climatological non-exceedence probability (Cp) for
each scenario of M (the interval between reforecasts in days), and are shown for several
forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 15: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CN-
FTSC1. The results are plotted against climatological non-exceedence probability (Cp) for
each scenario of M (the interval between reforecasts in days), and are shown for several
forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 16: Selected verification metrics for the MEFP-GEFS precipitation forecasts at NE-
HOPR1. The results are plotted against climatological non-exceedence probability (Cp) for
each scenario of M (the interval between reforecasts in days), and are shown for several
forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 17: Range (maximum-minimum) of selected verification metrics for the MEFP-GEFS
precipitation forecasts. The results are plotted against climatological non-exceedence
probability (Cp) across all scenarios of M (interval between reforecasts in days), and are
5 of 120
shown for several forecast lead times. The reference forecasts for the CRPSS and the
BSS comprise the MEFP-CLIM forecasts.
Figure 18: Range (maximum-minimum) of selected verification metrics for the MEFP-GEFS
precipitation forecasts. The results are plotted against climatological non-exceedence
probability (Cp) across all scenario of N (the number of years of calibration data), and are
shown for several forecast lead times. The reference forecasts for the CRPSS and the
BSS comprise the MEFP-CLIM forecasts.
Figure 19: Box plots of forecast errors against observed precipitation amount for N={24 and 12}
years of calibration data. The results are shown at a forecast lead time of 0-24 hours.
Figure 20: Box plots of forecast errors against forecast precipitation amount (ensemble mean)
for N={24 and 12} years of calibration data. The results are shown at a forecast lead time
of 0-24 hours.
Figure 21: Box plots of forecast errors against observed precipitation amount for calibration
scenarios of M={1 and 5} days between reforecasts. The results are shown at a forecast
lead time of 0-24 hours.
Figure 22: Box plots of forecast errors against forecast precipitation amount (ensemble mean)
for calibration scenarios of M={1 and 5} days between reforecasts. The results are shown
at a forecast lead time of 0-24 hours.
Figure 23: Selected verification metrics for the MEFP-GEFS temperature forecasts. The results
are shown for the dependent (solid) and independent (dashed) validation scenarios of N
(the number of years of calibration data), and include several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 24: Selected verification metrics for the MEFP-GEFS temperature forecasts. The results
are plotted against the interval between reforecasts (M days) used to calibrate the MEFP,
and are shown for several non-exceedence climatological probabilities (Cp). The reference
forecasts for the CRPSS and the BSS comprise the MEFP-CLIM forecasts.
Figure 25: Selected verification metrics for the MEFP-GEFS temperature forecasts at AB-
CBNK1. The results are plotted against climatological non-exceedence probability (Cp) for
each scenario of N (the number of years of calibration data), and are shown for several
forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 26: Selected verification metrics for the MEFP-GEFS temperature forecasts at CB-
DRRC2. The results are plotted against climatological non-exceedence probability (Cp) for
each scenario of N (the number of years of calibration data), and are shown for several
forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 27: Selected verification metrics for the MEFP-GEFS temperature forecasts at CN-
DOSC1. The results are plotted against climatological non-exceedence probability (Cp) for
6 of 120
each scenario of N (the number of years of calibration data), and are shown for several
forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 28: Selected verification metrics for the MEFP-GEFS temperature forecasts at NE-
HOPR1. The results are plotted against climatological non-exceedence probability (Cp) for
each scenario of N (the number of years of calibration data), and are shown for several
forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the
MEFP-CLIM forecasts.
Figure 29: Selected verification metrics for the MEFP-GEFS streamflow forecasts. The results
are shown for the dependent (solid) and independent (dashed) validation scenarios of N
(the number of years of calibration data), and include several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 30: Selected verification metrics for the MEFP-GEFS streamflow forecasts. The results
are plotted against the interval between reforecasts (M days) used to calibrate the MEFP,
and are shown for several non-exceedence climatological probabilities (Cp). The reference
forecasts for the CRPSS and the BSS comprise the MEFP-CLIM forecasts.
Figure 31: Residuals of selected verification metrics for the MEFP-GEFS precipitation forecasts
when calibrating the MEFP with an ensemble mean derived from C=11 members versus
C=1 (F=11). The results are shown by forecast lead time for several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 32: Residuals of selected verification metrics for the MEFP-GEFS precipitation forecasts
when calibrating the MEFP with an ensemble mean derived from C=11 members versus
C=1 (F=11). The results are shown by climatological non-exceedence probability at
selected forecast lead times. The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 33: Sensitivity of the MEFP-GEFS precipitation forecasts to the number of members (C)
used to calibrate the MEFP. The results comprise an average over the middle portion of
the forecast horizon (4-8 days) for selected climatological probabilities (Cp). The bold lines
show the calibration scenarios with F=11 forecast members. The dashed line shows the
(C=1, F=1) scenario.
Figure 34: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CN-
DOSC1. The results are shown by forecast lead time for multiple calibration (C) and
forecasting (F) scenarios and for several non-exceedence climatological probabilities (Cp).
The reference forecasts for the CRPSS and the BSS comprise the MEFP-CLIM forecasts.
Figure 35: Residuals of selected verification metrics for the MEFP-GEFS temperature forecasts
when calibrating the MEFP with an ensemble mean derived from C=11 members versus
7 of 120
C=1 (F=11). The results are shown by forecast lead time for several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 36: Residuals of selected verification metrics for the HEFS streamflow forecasts when
calibrating the MEFP with an ensemble mean derived from C=11 members versus C=1
member (F=11). The results are shown by forecast lead time for several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 37: Residuals of selected verification metrics for the MEFP-GEFS precipitation forecasts
when calibrating the MEFP with an ensemble mean derived from C=11 members versus
C=5 (F=11). The results are shown by forecast lead time for several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 38: Residuals of selected verification metrics for the MEFP-GEFS temperature forecasts
when calibrating the MEFP with an ensemble mean derived from C=11 members versus
C=5 (F=11). The results are shown by forecast lead time for several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 39: Residuals of selected verification metrics for the HEFS streamflow forecasts when
calibrating the MEFP with an ensemble mean derived from C=11 members versus C=5
member (F=11). The results are shown by forecast lead time for several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 40: Cumulative rank histograms for the HEFS streamflow forecasts when calibrating the
MEFP with an ensemble mean derived from C=11 members (solid) and C=5 members
(dashed). The results are shown at a forecast lead time of 96-120 hours and for observed
streamflow volumes that exceed several (non-exceedence) climatological probabilities.
Figure 41: Selected verification scores for the MEFP-GEFS precipitation forecasts. The nominal
scores are shown for each scenario of N (solid lines), together with the range of scores
across the subcases of each scenario. The results include several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
Figure 42: Selected verification scores for the MEFP-GEFS precipitation forecasts. The nominal
scores are shown for each scenario of M (solid lines), together with the range of scores
across the subcases of each scenario. The results include several non-exceedence
climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS
comprise the MEFP-CLIM forecasts.
8 of 120
Figure 43: Reliability diagrams and corresponding sharpness plots (base 10 logarithm of the
sample size, n) for the MEFP-GEFS precipitation forecasts at N=12. The results are shown
for selected climatological non-exceedence probabilities (Cp), including the Probability of
Precipitation (PoP; Cp=0.0), and comprise a daily aggregation between 0-24 hours.
Alongside the nominal values (bold lines), the range of scores is shown for the two sub-
periods of N=12.
Figure 44: Reliability diagrams and corresponding sharpness plots (base 10 logarithm of the
sample size, n) for the MEFP-GEFS precipitation forecasts at M=5. The results are shown
for selected climatological non-exceedence probabilities (Cp), including the Probability of
Precipitation (PoP; Cp=0.0), and comprise a daily aggregation between 0-24 hours.
Alongside the nominal values (bold lines), the range of scores is shown for the five sub-
periods of M=5.
Figure 45: Probability of Detection (PoD) and Probability of False Detection (PoFD) for flooding
at NE-HOPR1. The results are shown for each ensemble member (48 in total) and for
three validation scenarios at a reforecast interval of M=3, namely the full period of record
(daily reforecasts) and the three sub-periods (reforecasts every 3 days, offset by 1 day).
The PoD is highlighted at PoFD0.015.
9 of 120
ii. List of tables
Table 1: Characteristics of the study basins
Table 2: Reforecast configuration parameters
Table 3: Calibration and validation scenarios for N
Table 4: Calibration and validation scenarios for M
Table 5: Average sample sizes by climatological probability (Cp) and reforecast scenario
Table 6: Calibration and validation scenarios for the number of ensemble members
10 of 120
1. Executive summary
Motivation
The Hydrologic Ensemble Forecast Service (HEFS) quantifies the total uncertainty
in future streamflow as a combination of the meteorological forcing uncertainties
and the hydrologic modeling uncertainties. Reliable and skillful weather and
climate forecasting is central to reliable and skillful streamflow forecasting. The
HEFS Meteorological Ensemble Forecast Processor (MEFP) quantifies the
meteorological uncertainties and corrects for biases in the forcing inputs to the
HEFS. For the medium-range (1-15 days), the MEFP uses precipitation and
temperature forecasts from the Global Ensemble Forecast System (GEFS) of the
National Centers for Environmental Prediction (NCEP).
The ability of the HEFS to provide useful information for decision making depends
upon the accuracy of the forecast probabilities. Crucially, there is a need to
demonstrate this accuracy through hindcasting and validation. Hindcasting is
necessary to benchmark and improve the HEFS, optimize decision support
systems that rely upon the HEFS, and to build confidence among decision makers
that the forecasts are accurate, useful, and can lead to better decisions. For
example, the New York City Department of Environmental Protection (NYCDEP)
is using the HEFS to improve the management of risks to water supply objectives
in the NYC area. The NYCDEP has developed an Operational Support Tool (OST),
which optimizes the quantity and quality of water stored in the NYC reservoirs and
helps to avoid unnecessary, multi-billion dollar, infrastructure costs. The NYCDEP
relies on streamflow hindcasts from the HEFS, supported by meteorological
reforecasts from the GEFS, in order to optimize and validate the OST.
Large and extreme hydrologic events are critically important to users of the HEFS,
as they pose a significant threat to life and property. Given the manifest
uncertainties in forecasting hydrologic extremes, the ability of the HEFS to quantify
these uncertainties (and correct for systematic biases) is an important advantage
over deterministic forecasting systems. However, validating the HEFS for large and
extreme events relies upon an adequate archive of meteorological reforecasts.
In order to determine the minimum requirement of the HEFS for meteorological
reforecasts from the GEFS, this report considers the sensitivity of the HEFS to a
limited number of reforecast configuration options. Understanding the minimum
requirements for calibrating and validating the HEFS is a necessary but not a
sufficient condition for understanding the minimum requirements of end users for
meteorological and hydrologic reforecasts. The requirements of end users, such
as the NYCDEP, will be gathered and presented separately.
11 of 120
Approach
In order to determine the minimum requirements for meteorological reforecasting
in support of calibrating and validating the HEFS, a 26-year reforecast dataset was
obtained for the current GEFS. Among other factors, the costs associated with
meteorological reforecasting depend on the historical period considered (N years),
the interval between reforecasts (M days), and the number of ensemble members
in each forecast (C). By sub-sampling the GEFS reforecasts, the MEFP was
calibrated for different combinations of N, M and C. The sensitivities of the
temperature and precipitation forecasts from the MEFP and the streamflow
forecasts from the HEFS were then explored through hindcasting and validation.
Forcing and streamflow hindcasts were produced and validated at four headwater
basins: the Chikaskia River at Corbin, Kansas (AB-CBNK1); the Dolores River at
Rico in Colorado (CB-DRRC2); the Middle Fork of the Eel River at Dos Rios in
California (CN-DOSC1); and the Wood River at Hope Valley, Rhode Island (NE-
HOPR1). The hindcasts were generated at 12Z for each day in the historical period
of record. Within this fixed period, the calibration of the MEFP varied according to
N, M and C. To ensure that the hindcasting was both practical and statistically
reasonable, a combination of dependent and (limited) cross-validation was used.
In exploring the sensitivities to N, a 24-year validation period was sub-divided into
smaller calibration and forecasting periods, namely N={2x12, 3x8, 4x6, and 6x4}
years. Dependent validation involved calibrating the MEFP and generating
hindcasts for each sub-period and then pooling all of the sub-periods for validation.
Independent validation involved borrowing the parameters from an adjacent sub-
period. While dependent validation may be regarded as a best case scenario for
the expected forecast quality, using parameters from adjacent sub-periods should
be regarded as a worst case scenario; in practice, the MEFP would be recalibrated
more frequently. In evaluating the sensitivities to M, the MEFP was calibrated for
M={1, 3, 5, and 7} days and hindcasts produced daily for the fixed historical period.
The sensitivities to C were examined by calibrating the MEFP with an ensemble
mean derived from C={1, 5, and 11} of the ensemble members from the GEFS
reforecasts. In practice, the GEFS reforecasts contain fewer ensemble members
(F=11) than the operational forecasts (F=21). Since the operational HEFS
forecasts use all available GEFS members (F=21), the HEFS reforecasts were
also generated with all available GEFS members (F=11). However, in order to
understand the impacts of this discrepancy, a baseline reforecast was generated
with the control run only (F=1), using the corresponding MEFP calibration (C=1).
The minimum requirements for validating the HEFS depend on N and M (not C)
are were examined both theoretically and empirically. Theoretically, verification is
12 of 120
concerned with the estimation of statistical measures. The quality of these
estimates will depend on the number of samples available and their unique
information content. Empirically, the effects of reducing the number of reforecasts
available is to increase the sampling uncertainty of the verification results and to
render some (typically large) events unverifiable, depending on the choice of
measure. In order to illustrate the effects of N and M on the uncertainties
associated with validating the HEFS, each sub-sample of N and M was verified
separately and the results compared to the nominal scores for N=24 and M=1.
Results
In terms of the quality of the MEFP forcing and HEFS streamflow forecasts, there
is no systematic decline in forecast quality as the interval between reforecasts (M)
increases from 1 to 7 days or the historical period (N) decreases from 24 to 4 years.
However, when considering the sensitivity of the verification scores to N and M,
measured by the range of scores across these scenarios, there are meaningful
differences, particularly at higher thresholds of precipitation and streamflow. In this
context, sensitivity is a necessary but not a sufficient condition for a decline in
forecast quality. These results imply some sensitivity to N and M, but they do not
suggest a consistent decline in forecast quality with increasing M or decreasing N.
In practice, a reforecast archive of N=12 years and M=1 day (among other
combinations) should be adequate to calibrate the MEFP, but it would not be
adequate to validate the HEFS for large events, as described below.
Against the best available calibration (C) and forecasting (F) scenario (C=11,
F=11), there is a material decline in the quality of the HEFS forcing and streamflow
forecasts when using the control member only (C=1, F=1). For some basins,
metrics and thresholds, this is minimized by using all ensemble members to
generate the HEFS forecasts (C=1, F=11). For precipitation and streamflow, the
greatest differences occur at CN-DOSC1, particularly in the middle and latter
portion of the forecast horizon, where the forecast lead time is increased by 1+
days when using C=11 (F=11) members versus C=1 (F=11). The improvements in
temperature are greatest at AB-CBNK1 and NE-HOPR1, particularly at the hottest
observed temperatures and during the middle portion of the forecast horizon,
where the CRPSS is increased by ~10% in real terms (~30% relative to the
baseline CRPSS). In contrast, when calibrating the MEFP with C=5 ensemble
members (F=11), the forcing and streamflow forecasts are no more reliable or
skillful than those calibrated with C=11 members (F=11). Thus, for the locations,
thresholds, and verification metrics considered, 5 ensemble members should be
adequate to calibrate the MEFP, while the operational forecasts would benefit from
using all available ensemble members.
13 of 120
The minimum requirements for validating the HEFS are examined both
theoretically and empirically. At a daily aggregation, the average number of
verification pairs for which the observed value exceeds a climatological probability,
Cp, is ~365N(1-Cp). In order to estimate a lumped verification score with reasonably
small sampling uncertainty, 30 or more independent samples may be required.
Thus, if all of the large (e.g. >Cp=0.995) events in a verification sample are
statistically independent, and reforecasts are issued once per day, 16.5 years of
reforecasts would be required, on average, to generate a verification sample with
30 large events. Clearly, these requirements increase dramatically with
increasing Cp; in general, the probability of flooding at a daily aggregation is less
than 1-in-200 (Cp=0.995). They also increase when the individual samples are
related to each other (e.g. one flood event that spans several days), as described
below. More detailed metrics, such as the reliability diagram and Relative
Operating Characteristic (ROC), require many more samples than a lumped
verification score (perhaps 100-200 samples).
For two time-series (forecasts and observations) that are both autocorrelated in
time, the effective sample size for verification is smaller than the nominal sample
size. As the correlations increase, the effective sample size declines. For example,
the lag-1 autocorrelation of streamflow at AB-CBNK1 for a daily aggregation is
0.542, while the lag-1 autocorrelation at NE-HOPR1 is 0.897. Based on sampling
theory, the effective sample size for computing the cross-correlation between the
observed and forecast time-series would be 55% of the nominal sample size at
AB-CBNK1 and 11% at NE-HOPR1. In other words, for a given amount of
confidence in the streamflow verification, roughly 9x more data would be required
at NE-HOPR1 than implied by the nominal sample size and 2x more data at AB-
CBNK1. While precipitation is generally autocorrelated over much shorter time-
scales than streamflow, this also increases the probability that data thinning (e.g.
from M=1 to M=3) would significantly reduce the number of extreme events in the
sample. Thus, based on sampling theory alone, reducing the reforecast period and
frequency would systematically reduce the precipitation and streamflow thresholds
at which the HEFS, and associated decision support, could be validated and
optimized, particularly for multi-day aggregations, such as reservoir inflows.
The sensitivities of the validation results to different configurations of N and M are
also explored empirically. Here, the range of verification scores between cases of
N and M is much smaller than the range of scores within cases for different subsets
of the validation data. Thus, as anticipated from theory, the minimum requirements
for validating the MEFP are much greater than the minimum requirements for
calibrating the MEFP, even for relatively simple verification scores. Of the
verification scores considered here, the cross-correlation is particularly variable
across the subsets of N and M. For example, at AB-CBNK1, precipitation amounts
14 of 120
that exceed Cp=0.995 show correlations of between -0.1 and 0.6 in the three
subsets of M=3. Thus, for a 1-in-200 day precipitation amount at AB-CBNK1,
forecasts issued every three days over a 24-year period would be unverifiable. For
more detailed verification metrics, such as the reliability diagram, the thresholds
for which the HEFS remains verifiable are even smaller. Thus, even with daily
reforecasts between 1985 and 2008, the sample sizes are too small to evaluate
reliability diagrams for moderately large precipitation amounts (Cp=0.99), as these
events are rarely forecast with high probability. However, this partly originates from
a conditional bias in the precipitation forecasts at high observed thresholds, which
would not be addressed by increasing the number of reforecasts alone.
In summary, therefore, the minimum requirements for meteorological reforecasting
in support of the HEFS are determined, primarily, by the need to validate the HEFS
with reasonably small sampling uncertainty, including for large events. In general,
simple, unconditional, verification measures cannot guide operational practice,
because they are not application-specific. For example, a flood warning may be
triggered when the forecast probability of flooding exceeds some threshold. In this
context, there is trade-off between issuing warnings too regularly (low probability
threshold) and failing to warn when floods actually occur (high probability
threshold). Given an adequate sample of historical flood occurrences, this trade-
off, and hence the triggering threshold, can be defined, objectively, through
hindcasting and validation. By way of illustration, the use of a degraded reforecast
of M=3 at NE-HOPR1 could lead to flood warnings that are correct on only 40% of
occasions, when they could be correct on 58% of occasions for a warning threshold
optimized to daily reforecasts (i.e. M=1). For users of the HEFS, such as the
NYCDEP, a long and consistent record of historical forecasts is, therefore,
essential; it is necessary to optimize and improve decision support systems and to
benchmark these systems against historical analogs for future extremes.
Recommendations
Reforecasting requires both significant human and computational resources.
However, unsophisticated approaches to data thinning, such as reducing the
number of historical years (N) or increasing the interval between reforecasts (M),
will also reduce the value of these reforecasts for hydrologic applications. In terms
of the HEFS, the greatest impacts of reducing the sample of historical reforecasts
would be to prevent the validation of large events with the necessary statistical
confidence. These events are critically important to users of the HEFS, such as the
NYCDEP. Thus, any approach to data thinning must accommodate a reasonable
sample of large and extreme events. The frequency of reforecasts (M) should also
15 of 120
accommodate rapidly evolving hydrometeorological conditions, for which M>1 day
would not be appropriate for the short-to-medium range.
The impacts of reducing the number of ensemble members in the GEFS
reforecasts (C) will be to reduce their value for statistical post-processing and other
applications. Nevertheless, as C increases, there are diminishing gains for the
reliability and skill of the MEFP outputs. This study indicates that C=5 ensemble
members should be adequate to calibrate the MEFP with the GEFSv10. However,
this cannot be generalized to other techniques or to future implementations of the
MEFP. Indeed, the benefits of reforecasting with additional members may vary with
location and forecast conditions, and they may be greater for extreme events (for
which sample measures of forecast quality are inherently limited). Thus, any
compromise should be reviewed as models and applications evolve and diagnostic
techniques become more sophisticated.
While the costs associated with meteorological reforecasting are substantial, the
benefits are even more substantial. Thus, a concerted effort should be made to
produce reforecasts every day over the maximum historical period for which there
is adequate data to initialize the GEFS, rather than compromising on N or M. In
the absence of a complete reforecast, more sophisticated approaches to data
thinning will be required. Here, emphasis on early forecast lead times and extreme
events will increase the utility of a limited reforecast for hydrologic applications.
Spatial pooling or regionalization may improve the sample sizes for calibration and
validation of the MEFP. Studies are underway to establish whether reforecasts
from hydrometeorologically similar basins can be used to augment the calibration
and validation of the HEFS. However, spatial pooling cannot satisfy user
requirements for long historical records at critical forecast locations. Also, in
validating the streamflow forecasts, spatial pooling is inherently difficult, as
hydrologic state variables, unlike atmospheric state variables, often vary abruptly
(over short distances), and with myriad basin characteristics.
An adequate sample of historical events, including large and extreme events, is
only one of several minimum requirements for users of weather and climate
reforecasts. Other requirements include the timely communication of development
plans, use of transitional arrangements for legacy models (e.g. temporary freezing
of models), software version control, coordination of model updates with users,
timely access to reforecasts, and consistency of the reforecasts and operational
forecasts (including initializations), among others. Collectively, these requirements
should contribute to a renewed effort by the NWS and other operational forecasting
agencies to deliver weather, climate, and water (re)forecasts for improved decision
support. This broader set of requirements must be addressed separately,
alongside the minimum requirements of end users, such as the NYCDEP.
16 of 120
2. Introduction
The Hydrologic Ensemble Forecast Service (HEFS) is an operational hydrologic
forecasting system that is being implemented by the thirteen River Forecast Centers
(RFCs) of the U.S. National Weather Service (NWS). The HEFS quantifies the total
uncertainty in future streamflow as a combination of the meteorological forcing uncertainty
and the hydrologic modeling uncertainty, while correcting for biases in the forecast
probabilities (Seo et al., 2010; Demargne et al., 2010, 2014; Brown et al., 2014a/b). The
HEFS ingests weather and climate forecasts from, among other sources, the Global
Ensemble Forecast System (GEFS) of the National Centers for Environmental Prediction
(NCEP), as well as NCEPs Climate Forecast System Version 2 (CFSv2), and produces
ensemble streamflow forecasts for the short- to long- range. The HEFS aims to: 1) span
lead times from one hour to one year or more with a seamless transitions between
forecast time horizons; 2) issue forecast probabilities that are unbiased for different
aggregation periods; 3) be spatially and temporally consistent across RFC domains; 4)
capture information from current operational weather and climate forecasting systems,
while correcting for biases; 5) be consistent with retrospective forecasts or hindcasts
that are used for verification and decision support; and 6) be properly validated, in order
identify the strengths and weaknesses of the forecasts and to guide forecasting
operations and decision support.
By explicitly accounting for the uncertainties inherent in meteorological and
hydrologic forecasting, while correcting for biases in the forecast probabilities, the HEFS
aims to support improved, risk-based, decision making for a variety of water resources
applications, including reservoir operation, flood forecasting, river navigation, and water
supply. For example, the New York City Department of Environmental Protection
(NYCDEP) is using the HEFS to improve the management of risks to water quantity and
quality objectives in the NYC area. In this context, the NYCDEP has developed an
Operational Support Tool (OST), which ingests streamflow forecasts from the HEFS that
are produced operationally by the Middle-Atlantic RFC and the Northeast RFC. The OST
optimizes the quantity and quality of water stored in the NYC reservoirs, while avoiding
unnecessary, multi-billion dollar, infrastructure costs, such as water filtration. Elsewhere,
17 of 120
the U.S. Army Corps of Engineers (USACE) are redeveloping their water control manual
for the Folsom Reservoir and the American River. In this context, the California-Nevada
RFC (CNRFC) are evaluating the use of streamflow hindcasts from the HEFS, in order to
establish the benefits and risks of using inflow forecasts to manage the flood control space
in the Folsom Reservoir. Elsewhere in California, the Yuba County Water Agency
(YCWA), together with CNRFC and partners, and exploring the use of probabilistic inflow
forecasts to better manage the flood control spaces in Lake Oroville, the Englebright
Reservoir and the New Bullards Bar Reservoir.
The ability of the HEFS to provide useful information for decision making depends,
crucially, upon the accuracy (unbiasedness and skillfulness) of the forecast probabilities.
There is a need to demonstrate this accuracy through retrospective forecasting and
verification. Retrospective studies are necessary to guide the development of the HEFS,
as well as decision support systems that rely upon the HEFS, and to build confidence
among decision makers that the forecasts are accurate, useful, and can lead to better
decisions. In order to provide meteorological and streamflow forecasts that are
demonstrably accurate, the HEFS must be calibrated and validated with historical data.
While recent studies have documented the quality of the precipitation, temperature and
streamflow forecasts from the HEFS, both for the short-to-medium range (Brown et al.,
2014a/b) and for the long-range (Brown, 2013), the minimum requirements for
reforecasting have not been evaluated. These requirements are largely driven by the raw
meteorological reforecasts used as input to the HEFS and, specifically, by the HEFS
Meteorological Ensemble Forecast Processor (MEFP), which aims to correct for biases
in the raw forecasts of precipitation and temperature (Schaake et al., 2007; Wu et al.,
2011). Observations of precipitation, temperature and streamflow are also required to
initialize the HEFS, calibrate the hydrologic models, and to validate the forecasts. Gauge-
based observations are typically available for many decades (often 50-100 years) at river
forecast locations. However, atmospheric models rely on a best estimate (or a range of
possibilities) of the multivariate, spatially distributed, state of the atmosphere-ocean
system at the forecast issue time. In order to conduct reforecasting, these estimates must
be produced retrospectively. In practice, reliable estimates of the atmosphere-ocean state
variables require satellite observations, which are only available since the early 1980s.
18 of 120
Thus, meteorological reforecasting is inherently constrained to the recent past. Also,
given the significant cost of conducting reforecasting, a trade-off emerges between
expanding reforecasting and improving the underlying weather and climate models.
However, for users of the HEFS, such as the NYCDEP and YCWA, hydrometeorological
reforecasting is critically important. It is necessary to optimize and improve decision
support systems, such as the OST, and to benchmark these systems against historical
analogs for future extremes.
In order to support NCEP in determining the requirement of the HEFS for
meteorological reforecasting, this report considers the sensitivity of the HEFS to a limited
number of reforecast configuration options. Clearly, reforecast configuration is only one
of several requirements for users of weather and climate forecasts. Other requirements
include the timely communication of development plans, use of transitional arrangements
for legacy models, software version control, coordination of model updates with users,
timely access to reforecasts, and consistency of the reforecasts and operational
forecasts, among others. Collectively, these requirements should contribute to a new
business model for NCEP and other operational forecasting agencies in delivering
weather, climate, and water (re)forecasts for improved decision support. As indicated
above, this report focuses on the minimum technical requirements of the HEFS for
meteorological reforecasts. It does not consider the broader set of requirements for
delivering an efficient and effective forecasting service, which must be addressed
separately.
In terms of the HEFS, the minimum requirements for historical data are driven by:
1) the need for an adequate sample size to estimate the statistical parameters of the
HEFS; 2) the need for an adequate sample size to validate the HEFS; and 3) the need
for users of the HEFS to calibrate and validate their decision support systems. This report
is concerned with the minimum requirements for (1) and (2) only. The requirements of
end users, such as the NYCDEP and YCWA, will be gathered and presented separately.
In this context, (1) and (2) define the minimum requirements for operating the HEFS, while
(3) is necessary to ensure the outputs from the HEFS are useful for decision making. In
other words, the minimum requirements associated with (1) and (2) should be regarded
19 of 120
as an incomplete baseline. In practice, the requirements of users for meteorological and
hydrologic reforecasting may exceed those for calibration and validation of the HEFS,
and they may evolve as services change and other users adopt the HEFS. Furthermore,
this study is concerned with short-to-medium range forecasting only and, specifically, with
the minimum requirements for historical data from the Global Ensemble Forecast System
(GEFS).
Raw forecasts of temperature and precipitation from the GEFS are used to produce
bias-corrected forcing for input to the HEFS. These forecasts are used in water supply
decision making for the short-to-medium range, including reservoir management, flood
warning, river navigation and recreation. The GEFS uses Version 9.0.1 of the Global
Forecast System (GFS), which comprises a horizontal resolution of T254 (~55km) for 1-
8 days and T190 (~70km) for 9-16 days, and a vertical resolution of L42 or 42 levels (Wei
et al. 2008; Hamill et al. 2011; Hamill et al. 2013). Reforecasts were produced with the
GEFS for a ~26-year period between 1985 and 2010 (Hamill et al., 2013). Calibrating and
validating the HEFS with a subset of the available reforecasts will identify the sensitivities
of the HEFS to a degraded reforecast with the current GEFS only. Some applications of
the HEFS may benefit from a configuration that improves upon the available reforecasts,
but this cannot be established here. Rather, this study examines the ability to provide
accurate forecasts with the HEFS using a degraded calibration sample and the ability to
measure that accuracy with a reduced validation sample.
The minimum requirements for calibrating the HEFS include an adequate historical
period and frequency of reforecasts from which to estimate the statistical parameters of
the HEFS, and sufficient ensemble members to capture the skill in the meteorological
forecasts. Since the HEFS relies on statistical modeling, consistency of the reforecasts
and operational forecasts is also important. The minimum requirements for validating the
HEFS also include an adequate sample (historical period and frequency) of reforecasts
under varying basin conditions, again without structural changes that would undermine
their interpretation. In slow responding basins, the effective sample size is reduced by
temporal autocorrelations in streamflow, implying a longer period of record for validation
(and calibration of streamflow post-processors). In fast responding basins, conditions
20 of 120
evolve rapidly, implying a greater frequency of reforecasts to capture large and extreme
events. Assuming the climatology is reasonably stationary, a 25-year reforecast should
capture much of this variability. However, at a one-day aggregation, flooding may occur
with a climatological frequency of 0.001 (1-in-1000 days) or less. Thus, on average, fewer
than ten (0.001*25*365) flood events will occur within a 25-year period. Likewise, for long-
range forecasting, where fixed aggregations are often required (e.g. April-July reservoir
volumes), a 25-year reforecast will inevitably omit some important variability.
In summary, the aims of this study are twofold, namely to determine the minimum
requirements for reforecasting with the GEFS, in order to: 1) calibrate the HEFS
adequately; that is without materially reducing the quality of the forecasts, including at
high thresholds; and 2) validate the forcing and streamflow forecasts with reasonably
small sampling uncertainty. The calibration of the HEFS depends on an adequate sample
size, for which the period of record and interval between reforecasts are important. It also
depends on the number of ensemble members in the GEFS and the consistency of the
reforecasts and operational forecasts. Likewise, the validation of the HEFS depends on
an adequate sample size, for which the period of record and interval between reforecasts
are important, and a reasonably consistent and representative sample (accepting that
these two things may not be aligned). Following a description of the study basins, datasets
and approach, the verification results are presented separately for the minimum
calibration and validation requirements.
3. Approach
3.1 Study basins
Four headwater basins were considered in this study, namely: the Chikaskia River
at Corbin, Kansas (AB-CBNK1); the Dolores River at Rico in Colorado (CB-DRRC2); the
Middle Fork of the Eel River at Dos Rios in California (CN-DOSC1); and the Wood River
at Hope Valley, Rhode Island (NE-HOPR1). Figure 1 and Table 1 show the location of
each basin, its average elevation, area, and the location of the nearest grid node in the
GEFS. Table 1 also shows the annual precipitation, the fraction of precipitation that
generates runoff (the runoff coefficient), and the ratio of precipitation to potential
21 of 120
evaporation (a climate index). The drainage areas range from 188 square kilometers (NE-
HOPR1) to 2,057 square kilometers (AB-CBNK1) and the runoff coefficients vary from
0.12 (AB-CBNK1) to 0.55 (NE-HOPR1). The basins were chosen for a combination of
practical and hydrological reasons. First, they all originate from RFCs for which the HEFS
has been implemented and validated, namely AB-, CB-, CN-, and NE-RFCs, and for
which the absolute quality of the forecasts has been documented (Brown, 2013, 2014;
Brown et al., 2014a/b). Here, the focus is on the minimum requirements for calibrating
and validating of the HEFS; that is, on the relative quality of the forecasts for different
configurations of the GEFS; and not on the absolute quality of the forecasts. Second,
headwater basins respond quickly to forcing information and, as the uncertainties and
biases propagate from upstream to downstream locations, it is important, initially, to
understand the quality of the HEFS in headwater basins. Third, headwater basins are
important for operational forecasting of water quantity and quality, including flood warning
and reservoir operations. Further downstream, the HEFS will be impacted by additional
sources of bias and uncertainty, of which some are inherently difficult to quantify (e.g. the
downstream effects of river regulations, simplified hydraulic routing and composite timing
errors; see Raff et al., 2013). As part of the ongoing evaluation of the HEFS, more
complex regimes, as well as additional sources of forcing, will be considered in future.
Figure 2 shows the daily means of temperature, precipitation, and streamflow for
each basin, where CN-DOSC1 and CB-DRRC2 both comprise an average over two sub-
basins (see below). The averages are shown by calendar month and were derived from
gauged temperature, precipitation, and streamflow over a 24-year period between 1985
and 2008 (see Section 3.3). As indicated in Figure 2, there are marked differences in the
seasonality and covariability of precipitation and runoff among these basins.
The Chikaskia River (AB-CBNK1) experiences a warm and humid summer climate.
During the late spring and early summer, cool air from Canada and the Rocky Mountains
combines with moist air from the Gulf of Mexico and hot air from the Sonoran Desert,
leading to intense thunderstorms and tornados in Kansas and Oklahoma. At AB-CBNK1,
the relationship between precipitation and runoff is modulated by the shallow terrain and
22 of 120
dense vegetation cover, as well as increased evapotranspiration during the summer
months.
The Dolores River (CB-DRRC2) is a tributary of the Colorado River and occupies
a narrow valley incised into the sandstone of the San Juan Mountains. Precipitation is
reasonably constant throughout the year, but falls primarily as snow during the winter
months. The snowpack melts in the late spring and early summer, which leads to a sharp
increase in runoff between April and July (Figure 2). For the purposes of hydrologic
modeling, CB-DRRC2 is separated into two sub-basins, in order to accommodate the
varied elevations there. The lower sub-basin accounts for 67% of the total area of CB-
DRRC2.
The Eel River (CN-DOSC1) drains the windward slopes of the North Coast Ranges
in Northern California (Figure 1). During the late summer and early autumn, the upper
reaches of the Eel River experience little or no precipitation and streamflow. Low flows
are accentuated by diversions to the Russian River for use in the Potter Valley Hydro-
Electric Project. In late autumn, cooler temperatures are accompanied by rapidly
increasing precipitation, to which the streamflows respond through November and
continue increasing until January (Figure 2). During the winter months, the predictability
of heavy precipitation is increased by the onshore movement of weather fronts from the
Pacific coast and their orographic lifting in the North Coast Ranges. The coastal
mountains of northern California and the Pacific Northwest are also susceptible to
atmospheric rivers, which carry moisture in narrow bands from the tropical oceans to
the mid-latitudes. Atmospheric rivers can lead to persistent, heavy, precipitation and
extreme flooding in the North Coast Ranges and further inland (Smith et al., 2010). For
the purposes of hydrologic modeling, CN-DOSC1 is separated into two sub-basins, and
the lower sub-basin accounts for 77% of the total area of CN-DOSC1.
The Wood River flows approximately 85km from its source in Sterling, Connecticut,
through Hope Valley (NE-HOPR1) in the Arcadia Management Area to Alton, Rhode
Island, where it converges with the Pawcatuck River. As indicated in Figure 2, the daily
average precipitation at NE-HOPR1 is relatively constant throughout the year, but
23 of 120
includes significant snowfall during winter months (the average annual snowfall is
866mm). During the early spring, rising temperatures lead to snowmelt and to a peak in
streamflow around March or April, followed by lower flows during the summer months.
3.2 Experimental design
The HEFS quantifies the total uncertainty in future streamflow as a combination of
the meteorological and hydrologic uncertainties, while correcting for biases in both the
forcing and streamflow (Demargne et al., 2014). Further information about the HEFS
methodology can be found in Appendix A. The meteorological uncertainties and biases
are quantified with the Meteorological Ensemble Forecast Processor (MEFP). The MEFP
produces ensemble forecasts of precipitation and temperature conditionally upon a raw,
single-valued, forecast (Wu et al., 2011). For the short- to medium-range, the raw
forecasts used by the MEFP include the ensemble mean of the GEFS. In removing the
meteorological biases with the MEFP, the hydrologic uncertainties and biases can be
modeled independently of the forcing uncertainties and biases (Seo et al., 2006;
Demargne et al., 2014). The hydrologic uncertainties and biases are modeled in two
stages. First, the meteorological forecasts from the MEFP are used to generate raw
streamflow forecasts, which may contain hydrologic biases, but do not explicitly account
for any hydrologic uncertainties. Secondly, the raw streamflow forecasts are post-
processed with the Ensemble Postprocessor (EnsPost). The EnsPost models the
hydrologic uncertainties and biases from the residuals between the observed and
simulated streamflows (Seo et al., 2006); that is, streamflow predictions based on
observed temperature and precipitation at the forecast issue time.
The simulations and observations used to estimate the hydrologic uncertainties
and biases are typically available for several decades at each RFC forecast location.
Likewise, the precipitation and temperature observations used to generate the streamflow
simulations and to quantify the forcing uncertainties and biases are typically available for
several decades. In contrast, the meteorological reforecasts, which are used by the MEFP
to estimate the forcing uncertainties and biases, require satellite observations and
corresponding reanalysis of the ocean-atmosphere states, in order to initialize the
24 of 120
weather and climate models. These datasets are only available from the early 1980s
onwards. Thus, as indicated above, the requirements of the HEFS for historical data are
primarily constrained by the availability of (appropriate initialization for the) meteorological
reforecasts.
As indicated above, the total uncertainty in the streamflow forecasts originates from
a combination of uncertainties in the meteorological forecasting and hydrologic modeling.
Depending on basin characteristics and antecedent conditions, a large fraction of the total
uncertainty can originate from the meteorological uncertainties (Kavetski et al., 2002;
Pappenberger et al., 2005; Wu et al., 2011). Thus, the meteorological forecasts are a
central component of the HEFS and other hydrologic ensemble prediction systems. When
a meteorological model is updated, any changes in the statistical properties of the
precipitation and temperature forecasts will, to some degree, impact the streamflow
forecasts from the HEFS. For example, the MEFP may be impacted by changes in the
spatial or temporal resolution of the model, including the position of grid cells in relation
to hydrologic basins, the model physics in different layers, including at the land-surface
and ocean boundaries, and the number of (or approach to generating) ensemble
members. In terms of calibrating the MEFP, these properties are important insofar as they
influence the statistical character of the precipitation and temperature forecasts, including
any systematic biases, as well as the information content more generally (e.g. measured
in terms of correlation). In general, therefore, the MEFP must be recalibrated when the
GEFS is updated in any non-trivial way. Likewise, any non-trivial changes to the HEFS
must be accompanied by new streamflow hindcasting and validation. In many cases, this
requires further hindcasting and validation by users of the HEFS, such as the NYCDEP,
who rely upon streamflow hindcasts to calibrate and validate their own forecasting and
decision support systems. Following changes to the operational GEFS, the HEFS
requires an adequate sample of meteorological reforecasts, in order to recalibrate the
MEFP and to produce and validate new forcing and streamflow hindcasts. In this context,
the minimum requirements for reforecasting include the number of historical years of data
(N), the interval between reforecasts (M), and the number of ensemble members. These
and other variables are summarized in Table 2.
25 of 120
In order to evaluate the effects of N and M on the quality of the precipitation and
temperature forecasts from the MEFP, the raw GEFS reforecasts (Hamill et al., 2013)
were systematically degraded from N=24 years (1985-2008) and M=1 day to
combinations of smaller N and larger M. These thinned reforecasts were used to
calibrate the MEFP and to generate forcing and streamflow hindcasts for a consistent
validation period. As indicated above, some applications of the HEFS may benefit from a
reforecast configuration that improves upon the available reforecasts, but this cannot be
established here. In degrading the raw GEFS reforecasts, the hindcasting and validation
period was fixed to 24 years (1985-2008), with a forecast issued at 12Z each day. The
choice of validation period was motivated by: 1) the need to isolate the effects of N and
M on the quality of the MEFP forecasts, independently of any background variability (i.e.
from changes in the validation period); and 2) by the choice of experimental design for
validation. In terms of the latter, independent validation is always preferred when
evaluating statistical techniques, such as the MEFP. Unless the verifying observation is
removed from the calibration sample, the statistical parameters will benefit, unfairly, from
seeing the outcome in advance of predicting it. Depending upon the number of
parameters to estimate and their sampling properties, among other factors, this
advantage can be important. The results from dependent validation should, therefore, be
regarded as a best case scenario of the actual forecast quality. In practice, however,
the MEFP is relatively parsimonious (Wu et al., 2011). In other words, a single observation
should not greatly influence the estimated parameters. Furthermore, independent
validation poses significant practical challenges, as the HEFS is an operational
forecasting system; it is not well-suited to automatic calibration, and hindcasting is
extremely time-consuming.
In evaluating the sensitivities to N, both dependent and (limited) cross-validation
were employed. Specifically, the 24-year validation period was sub-divided into smaller
calibration periods, N={2x12, 3x8, 4x6, and 6x4} years. Dependent validation involved
estimating the parameters for each sub-period, issuing forecasts for that sub-period, and
collating the forecasts from all sub-periods for validation (i.e. 24 years in total).
Independent validation involved borrowing the parameters from an adjacent sub-period.
In practice, this should be regarded as a worst case scenario for the expected forecast
26 of 120
quality, because independent forecasting is conducted for multiple years (i.e. 12 years,
for N=12) without recalibrating the MEFP. Table 3 summarizes the dependent and
independent calibration scenarios for N. In evaluating the sensitivities to M, the MEFP
was calibrated for M={1, 3, 5, and 7} days and forecasts were issued at 12Z each day
between 1985 and 2008. In this context, M=1 represents dependent validation, whereas
M={3, 5 and 7} involves a mixture of dependent and independent validation. Specifically,
for M=3, 5, and 7 days, 1/3rd, 1/5th and 1/7th of the validation sample appears in the
calibration sample, respectively. The calibration scenarios for M are summarized in Table
4. Alongside the precipitation and temperature forecasts from the MEFP, streamflow
forecasts were produced at the outlet of each basin (see below).
As a post-processing technique, the MEFP aims to improve skill by reducing bias
in the raw GEFS forecasts, but does not introduce any new predictors. Thus, the quality
of the MEFP outputs is sensitively dependent on the quality of the raw forcing inputs from
the GEFS. The MEFP uses the ensemble mean from the GEFS to capture the information
content in these (re)forecasts. In order to examine the sensitivity of the MEFP outputs to
the number of ensemble members in the GEFS inputs, the GEFS reforecasts were
systematically degraded by using only a subset of the ensemble members to derive the
ensemble mean. These thinned reforecasts were used to calibrate the MEFP and to
generate forcing and streamflow hindcasts for a consistent validation period (i.e. 26 years,
from 1985-2010). In practice, the GEFS reforecasts contain fewer ensemble members
(C) than the operational forecasts (F). Specifically, the GEFSv10 reforecasts comprise
only 11 ensemble members (10 + control), while the operational forecasts comprise 21
members (20 + control). Hindcasting and validation was conducted with the all available
members (11). For example, when calibrating the MEFP with an ensemble mean derived
from C=5 members, the hindcasts were generated with an ensemble mean derived from
F=11 members. However, in order to better understand the impacts of this discrepancy,
a baseline scenario was included. Here, the control run was used to both estimate the
MEFP parameters (C=1) and to derive the forcing and streamflow hindcasts (F=1). The
scenarios for C and F are summarized in Table 5.
27 of 120
3.3 Datasets
For each scenario of N and M, hindcasts of mean areal temperature (MAT) and
mean areal precipitation (MAP) were generated with the MEFP for a 24-year period
between 1985 and 2008. For each combination of C and F, the hindcasts were generated
for the full GEFS reforecast period (1985-2010); unlike N and M, the historical period was
not integral to the validation design for C and F (see below). The hindcasts of MAP and
MAT each comprise ~60 ensemble members (the precise number varying between
basins, as described in Wu et al., 2011), with lead times varying from 6 to 360 hours in
six-hourly increments. In order to evaluate the skill of the MEFP forecasts with GEFS
inputs (MEFP-GEFS), precipitation and temperature forecasts were also generated with
a conditional or resampled climatology (MEFP-CLIM). The latter involves resampling
the historical observations of MAP and MAT in a moving window of, respectively, 61 days
and 31 days around the forecast valid date.
Raw streamflow hindcasts were generated with the Community Hydrologic
Prediction System (CHPS) using the precipitation and temperature forecasts from the
MEFP. The hindcasts were produced with the hydrologic models and parameter settings
used operationally in each RFC. For all RFCs considered here, the Snow Accumulation
and Ablation Model (SNOW-17; Anderson, 1973) is used together with the Sacramento
Soil Moisture Accounting Model (SAC-SMA; Burnash, 1995). The models are executed
at a six-hourly timestep, but interpolated to an hourly timestep at CB-DRRC2 and CN-
DOSC1. Routing from the headwater to the downstream basins is conducted with Lag/K
using constant or variable lag and attenuation. Historical simulations were generated with
observed forcing for each basin and used to examine the sensitivities of the hydrologic
predictions to the meteorological forcing (see below).
Observations of precipitation and temperature were obtained from each RFC and
comprise areal averages (MAP, MAT) of the gauged precipitation and temperature in
each basin. The data comprise six-hourly observations, recorded in local time, and
covering the period ~1948-2010. In order to pair the meteorological observations and
forecasts, the observed values were chosen from the nearest available synoptic times in
28 of 120
{0Z, 6Z, 12Z, 18Z}. This introduced a timing error into the observations of +1 hours, 0
hours, -1 hours and -2 hours for NE-HOPR1, AB-CBNK1, CB-DRRC2 and CN-DOSC1,
respectively. As the forecasts were verified at an aggregated scale of one day or larger
(see below), this timing error was deemed acceptable. The hydrologic forecasts and
simulations were paired without any timing errors.
3.4 Verification strategy
Verification was conducted with the NWS Ensemble Verification Service (EVS;
Brown et al., 2010). The temperature and precipitation forecasts were verified against
observed temperature and precipitation, respectively. In order to establish the sensitivities
of the hydrologic forecasts to different calibrations of the MEFP, the raw streamflow
forecasts were verified against simulated streamflows. Differences between the
hydrologic forecasts and simulations reflect the contribution of the MEFP-GEFS forcing
to the quality of the streamflow forecasts, independently of any hydrologic errors and
biases (which are ordinarily removed by the HEFS Ensemble Postprocessor, EnsPost).
Aside from eliminating these hydrologic biases, simulated streamflows avoid the timing
and other errors associated with pairing streamflow forecasts and observations. For
example, the streamflow observations are only available as daily averages and in different
time zones to the forecasts. No streamflow post-processing was conducted in this study,
as the EnsPost uses hydrologic simulations and observations only and is, therefore,
insensitive to the meteorological reforecasting. In this context, the aim is to establish the
sensitivity of the HEFS forcing and streamflow forecasts to different calibrations of the
MEFP, and not to examine the absolute quality of the forecasts, which is considered
elsewhere (Brown, 2013, 2014; Brown et al., 2014a/b).
Verification was conducted both unconditionally (i.e. for all data) and conditionally
upon observed and forecast amount. Unconditional bias and skill are important, as the
HEFS is an operational forecasting system for which many applications are anticipated.
However, average conditions, particularly the ensemble mean, generally favor dryer
weather and lower flows, as precipitation and streamflow are both skewed variables. In
order to compare the verification results between basins, for different forecast lead times
29 of 120
and valid times, and for specific aggregation periods, common thresholds were identified
for each basin. Specifically, for each aggregation period, a, and basin, b, a climatological
distribution function, , , ( )n a bF x , was computed from the n values of the hydrometeorological
variable, x, between 1985 and 2008. Real-valued thresholds were then determined for
100k non-exceedence probabilities, pc ,
1
, , ( )n a b pF c , where 0,1pc and 1, , p k .
These non-exceedence probabilities provide a consistent mapping between the likelihood
of a particular hydrometeorological occurrence and its corresponding real value across
different basins and aggregation periods.
As indicated above, verification was performed for different magnitudes of the
observed and forecast variables. When conditioning on observed amount, the quality of
the forecasting system is evaluated for the full range of historical occurrences, including
extreme events that were forecast inadequately (as small or moderate events). When
conditioning on forecast amount, the verification results may discount important observed
extremes. However, since the observed amount is unknown when a forecast is issued,
conducting verification by forecast amount is useful for guiding operational forecasting
and real-time decision making. While some verification metrics provide integral measures
of error across multiple thresholds (e.g. the mean error), others are defined for discrete
occurrences (e.g. the probability of detection). Integral measures, such as the mean error,
were derived from the subsample in which the prescribed condition was met (e.g. the
observation exceeded the threshold). Measures defined for discrete events were
computed from the observed and forecast probabilities of exceeding the threshold.
4. Results and analysis
4.1 Minimum requirements for estimating the parameters of the MEFP
4.1.1 Sensitivity to the historical period and interval between reforecasts
The precipitation and temperature forecasts from the MEFP were verified against
observed MAP and MAT, respectively. The results are shown for a daily aggregation, as
this is a representative volume for short-to-medium range forecasting. The results are
30 of 120
presented by forecast lead time and magnitude of the forcing variable for each scenario
of N (the number of years of reforecasts) and M (the interval between reforecasts). The
analysis focuses on the sensitivity of the forecasts to N and M in terms of bias, skill, and
other attributes of forecast quality, and not on the absolute quality of the forecasts. Figure
3 provides selected verification scores (in the rows) at three climatological probabilities
(in the columns), for the MEFP-GEFS precipitation forecasts. Here, Cp=0.0 denotes the
Probability of Precipitation (PoP), while Cp=0.995 represents a daily precipitation amount
that is exceeded, on average, once every 200 days. The scores were derived from the
subsample of verification pairs in which the observed precipitation amount exceeded the
threshold. Here, the verification statistics for the daily accumulations were averaged over
the first three days of forecast lead time. The results are shown for each calibration
scenario, N={24, 12, 8, 6, and 4 years}, and for the two validation scenarios, namely
dependent validation (all scenarios of N) and cross-validation, i.e. N={12, 8, 6, 4} (see
Table 3). The verification measures are summarized in Appendix B. The correlation
coefficient measures the degree of association between the ensemble mean of the
MEFP-GEFS precipitation forecasts and the observed precipitation amount. The relative
mean error (RME) measures the fractional bias of the ensemble mean forecast, where a
negative RME denotes an under-forecasting bias. The Continuous Ranked Probability
Skill Score (CRPSS) measures the fractional improvement of the MEFP-GEFS
precipitation forecasts when compared to the MEFP-CLIM forecasts, where 1.0 denotes
a perfect score. The Brier Skill Score (BSS) also provides a lumped measure of skill
relative to the MEFP-CLIM forecasts. However, unlike the CRPSS, the BSS measures
the ability the forecasting system to predict the exceedence (or non-exceedence) of a
discrete threshold.
Figures 4-7 show selected verification scores by forecast lead time for the MEFP-
GEFS precipitation forecasts at AB-CBNK1, CB-DRRC2, CN-DOSC1, and NE-HOPR1,
respectively. Again, the results are shown for subsamples in which the observed
precipitation amount exceeded Cp={0.0,0.9,0.995}. Unlike Figure 3, the results are shown
separately for each one-day accumulation, and with a separate curve for each scenario
of N. While Figures 4-7 shows the verification results for selected thresholds at all forecast
lead times, Figures 8-11 shows the results for all thresholds at selected forecast lead
31 of 120
times. In Figures 8-11, the climatological probabilities are plotted on a non-linear scale,
in order to emphasize the larger thresholds. The origin of each curve in Figures 8-11 is
the climatological PoP, i.e. the zero-precipitation threshold. The BSS denotes the ability
of the MEFP-GEFS forecasts to predict the exceedence of this threshold. The correlation,
RME and CRPSS denote the quality of the MEFP-GEFS forecasts for wet conditions, i.e.
for the subsample that exceeds the threshold, with the lowest threshold being zero.
As indicated in Figure 3, the sensitivities of the MEFP-GEFS precipitation forecasts
to the number of years of calibration data (N) are relatively small, both for the dependent
and independent validation scenarios. In general, the forecast quality is slightly reduced
under independent validation. However, as indicated above, independent forecasting for
multiple years (up to 12 years) should be regarded as a worst case scenario for the
expected forecast quality, as the MEFP should be recalibrated more frequently. The
greatest differences between dependent and independent validation occur in CB-DRRC2,
particularly for light and moderate precipitation amounts, where the forecast quality is
generally lower. This is understandable because CB-DRRC2 lies in the San Juan
Mountains of Colorado, where the steep terrain leads to reduced predictability and
increased climatological variability on inter-annual timescales. While the MEFP assumes
that the joint distribution of forecasts and observations is reasonably stationary, any
climatological non-stationarities may introduce a trade-off between larger N (smaller
sampling uncertainty) and smaller N (greater climatological specificity). As indicated in
Figure 3, for most verification scores, locations and thresholds, there is no systematic
increase in forecast quality with increasing N. Indeed, in some cases, the forecast quality
increases slightly with decreasing N. Given the sampling uncertainties, this should not be
overstated. However, it may originate from climatological variability over the validation
period and thus a greater specificity of the estimated parameter at smaller N. As indicated
in Figures 4-7, the sensitivities to N are relatively small at all forecast lead times, although
some erratic behaviors are seen at N=4 in AB-CBNK1 and CB-DRRC2, where the
absolute forecast quality is also lower. Similarly, Figures 8-11 suggest that the MEFP is
relatively insensitive to N across a broad range of precipitation thresholds. However, at
CB-DRRC2, there is a material decline in BSS for N=4, particularly for light and moderate
precipitation amounts, while the CRPSS is higher (Figure 9). These differences originate
32 of 120
from the structure of the BSS and CRPSS. The CRPSS is sensitive to biases in the
ensemble mean forecast, which are also smaller for N=4. The BSS is sensitive to these
biases only insofar as they impact the forecast probability (of exceeding Cp), and not to
their absolute magnitude.
Figure 12 shows the quality of the MEFP-GEFS precipitation forecasts for different
scenarios of M. The verification scores include the correlation coefficient and the RME,
together with the BSS and CRPSS. They were computed at a daily accumulation for
Cp={0.0, 0.99, 0.995}, and averaged over the first three days of forecast lead time. Figures
13-16 show the verification results at all precipitation thresholds for AB-CBNK1, CB-
DRRC2, CN-DOSC1 and NE-HOPR1, respectively. Here, the results comprise daily
accumulations at forecast lead times of 1, 2, and 3 days. In terms of data thinning, the
scenarios of N are broadly comparable to M, with M=7 comprising 1/7th of the original
calibration sample, versus 1/6th for N=4. In principle, for atmospheric variables that are
statistically dependent over multiple days, thinning by M should have a smaller impact
than an equivalent N. In practice, however, except for large-scale systems, such as
atmospheric rivers, precipitation varies over short periods and at small spatial scales, as
evidenced by the majority of forecast skill occurring in the first 1-7 days (or less at AB-
CBNK1, which is located in the Central Plains). Thus, depending on forecast lead time
and location (among other factors), thinning by M may be more or less aggressive than
an equivalent N.
As indicated in Figure 12, when averaged across forecast lead times of 1-3 days,
there is no systematic decrease in forecast quality with increasing M at any location or
precipitation threshold considered. Similarly, when considering forecast lead times of 1,
5, and 10 days separately (Figures 13-16), the quality of the MEFP-GEFS precipitation
forecasts is relatively insensitive to M at most locations. However, at AB-CBNK1, where
the forecast skill declines rapidly over the first week (Figure 4), there is a non-trivial
sensitivity to M from 0-24 hours across a range of precipitation thresholds, particularly for
the correlation coefficient, RME and CRPSS (Figure 13). This is evidenced by the range
of verification scores for different scenarios of M. To further illustrate these sensitivities,
Figure 17 shows the range of verification scores across all scenarios of M at selected
33 of 120
forecast lead times. Figure 18 shows the equivalent range of scores for N. Clearly, the
range of scores is not indicative of a systematic dependence of forecast quality on N or
M (see above). However, it is indicative of a sensitivity to the amount of calibration data
available. In general, AB-CBNK1 shows the greatest sensitivities to M and N, while CB-
DRRC2 is only sensitive to N (and specifically to N=4, as indicated above).
In order to illustrate the effects of N and M on the largest observed and forecast
precipitation amounts, box plots were computed from the MEFP-GEFS precipitation
forecasts. Figure 19 shows box plots of the forecast errors for each basin (in the rows)
and for two scenarios of N (in the columns), namely N={24, 12}. The results are plotted
against observed precipitation amount and are shown at a forecast lead time of 0-24
hours. Figure 20 shows the corresponding results against forecast precipitation amount,
specifically the ensemble mean forecast. Selected quantiles of the forecasting errors are
plotted together with the median error and range (extreme residuals) as whiskers. The
verifying observation is denoted by the zero-error line. Verification pairs for which the
observation falls outside the ensemble range are denoted as misses. However, each
forecast comprises only a small number of ensemble members (60). Thus, some misses
should be expected, even if the forecasts are conditionally unbiased. Figures 21 and 22
show box plots of the errors in the MEFP-GEFS precipitation forecasts for two scenarios
of M, namely M={1, 3}, again ordered by observed and forecast precipitation amount,
respectively. Here, each box represents one ensemble forecast from the period 0-24
hours. As indicated in Figures 19 and 21, there is no systematic decline in forecast quality
at N=12 or M=3 for the most extreme observed precipitation amounts. Rather, any
differences between scenarios are consistent with sampling uncertainty. While there are
some differences in the largest precipitation forecasts (by ensemble mean) for N=12
(Figure 20) and M=3 (Figure 22), these differences are again consistent with sampling
uncertainty and do not translate into additional skill for N=24 or M=1 (e.g. Figures 8-11).
Figure 23 shows the quality of the temperature forecasts at selected thresholds for
each scenario of N, while Figure 24 shows the corresponding results for each scenario of
M. The results for N include both validation scenarios, namely dependent validation, i.e.
N={24, 12, 8, 6, 4}, and cross-validation, i.e. N={12, 8, 6, 4} (see Table 3). The verification
34 of 120
metrics include the mean error of the ensemble mean forecast (C), the correlation
coefficient, BSS and CRPSS. The metrics were com