120
An evaluation of the minimum requirements for meteorological reforecasts from the Global Ensemble Forecast System (GEFS) of the U.S. National Weather Service (NWS) in support of the calibration and validation of the NWS Hydrologic Ensemble Forecast Service (HEFS) Revision number: extended, final Prepared by Hydrologic Solutions Limited for the U.S. National Weather Service (NWS) under subcontract SubK-2013-1003 with Lynker Technologies LLC in support of NWS Prime Contract DG133W-13-CQ-0042 (in fulfillment of Deliverable No. 4 and, in part, Deliverable No. 9 of Task 3) Dr. James Brown ([email protected]) August 2015 HSL

An evaluation of the minimum requirements for meteorological

Embed Size (px)

Citation preview

  • An evaluation of the minimum requirements for meteorological

    reforecasts from the Global Ensemble Forecast System (GEFS)

    of the U.S. National Weather Service (NWS) in support of the calibration and validation of the

    NWS Hydrologic Ensemble Forecast Service (HEFS)

    Revision number: extended, final

    Prepared by Hydrologic Solutions Limited for the U.S. National Weather Service (NWS) under subcontract SubK-2013-1003 with Lynker Technologies LLC in support of NWS Prime Contract

    DG133W-13-CQ-0042 (in fulfillment of Deliverable No. 4 and, in part, Deliverable No. 9 of Task 3)

    Dr. James Brown ([email protected])

    August 2015

    HSL

  • 2 of 120

    Contents

    i. List of figures .................................................................................................... 3

    ii. List of tables ..................................................................................................... 9

    1. Executive summary ........................................................................................ 10

    2. Introduction ..................................................................................................... 16

    3. Approach ......................................................................................................... 20

    3.1 Study basins ..................................................................................................... 20

    3.2 Experimental design ......................................................................................... 23

    3.3 Datasets ........................................................................................................... 27

    3.4 Verification strategy .......................................................................................... 28

    4. Results and analysis ...................................................................................... 29

    4.1 Minimum requirements for estimating the parameters of the MEFP ................. 29

    4.1.1 Sensitivity to the historical period and interval between reforecasts ................. 29

    4.1.2 Sensitivity to the number of ensemble members in the GEFS.......................... 35

    4.2 Minimum requirements for verifying the HEFS forecasts .................................. 40

    5. Summary and recommendations .................................................................. 45

    6. Glossary of terms and acronyms .................................................................. 53

    7. References ...................................................................................................... 59

    8. Tables .............................................................................................................. 64

    9. Figures ............................................................................................................. 68

    APPENDIX A: The Hydrologic Ensemble Forecast Service (HEFS)...................... 113

    APPENDIX B: Verification measures ....................................................................... 117

    a. Relative mean error ........................................................................................ 117

    b. Brier Score and Brier Skill Score .................................................................... 117

    c. Continuous Ranked Probability Score and skill score .................................... 118

    d. Reliability diagram .......................................................................................... 118

    e. Relative Operating Characteristic ................................................................... 119

    f. Cumulative rank histogram ............................................................................. 119

  • 3 of 120

    i. List of figures

    Figure 1: The four study basins, including their average elevation, the location of each outlet

    (gaging station), and the positions of the nearest grid nodes in the GEFS.

    Figure 2: Daily average temperature, total daily precipitation and daily average streamflow by

    calendar month for each study basin.

    Figure 3: Selected verification metrics for the MEFP-GEFS precipitation forecasts. The results

    are shown for the dependent (solid) and independent (dashed) validation scenarios of N

    (the number of years of calibration data), and include several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 4: Selected verification metrics for the MEFP-GEFS precipitation forecasts at AB-CBNK1.

    The results are plotted against forecast lead time for each scenario of N (the number of

    years of calibration data), and are shown for several non-exceedence climatological

    probabilities (Cp). The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 5: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CB-DRRC2.

    The results are plotted against forecast lead time for each scenario of N (the number of

    years of calibration data), and are shown for several non-exceedence climatological

    probabilities (Cp). The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 6: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CN-DOSC1.

    The results are plotted against forecast lead time for each scenario of N (the number of

    years of calibration data), and are shown for several non-exceedence climatological

    probabilities (Cp). The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 7: Selected verification metrics for the MEFP-GEFS precipitation forecasts at NE-HOPR1.

    The results are plotted against forecast lead time for each scenario of N (the number of

    years of calibration data), and are shown for several non-exceedence climatological

    probabilities (Cp). The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 8: Selected verification metrics for the MEFP-GEFS precipitation forecasts at AB-CBNK1.

    The results are plotted against climatological non-exceedence probability (Cp) for each

    scenario of N (the number of years of calibration data), and are shown for several forecast

    lead times. The reference forecasts for the CRPSS and the BSS comprise the MEFP-

    CLIM forecasts.

    Figure 9: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CB-DRRC2.

    The results are plotted against climatological non-exceedence probability (Cp) for each

    scenario of N (the number of years of calibration data), and are shown for several forecast

  • 4 of 120

    lead times. The reference forecasts for the CRPSS and the BSS comprise the MEFP-

    CLIM forecasts.

    Figure 10: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CN-

    DOSC1. The results are plotted against climatological non-exceedence probability (Cp) for

    each scenario of N (the number of years of calibration data), and are shown for several

    forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 11: Selected verification metrics for the MEFP-GEFS precipitation forecasts at NE-

    HOPR1. The results are plotted against climatological non-exceedence probability (Cp) for

    each scenario of N (the number of years of calibration data), and are shown for several

    forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 12: Selected verification metrics for the MEFP-GEFS precipitation forecasts. The results

    are plotted against the interval between reforecasts (M days) used to calibrate the MEFP,

    and are shown for several non-exceedence climatological probabilities (Cp). The reference

    forecasts for the CRPSS and the BSS comprise the MEFP-CLIM forecasts.

    Figure 13: Selected verification metrics for the MEFP-GEFS precipitation forecasts at AB-

    CBNK1. The results are plotted against climatological non-exceedence probability (Cp) for

    each scenario of M (the interval between reforecasts in days), and are shown for several

    forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 14: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CB-

    DRRC2. The results are plotted against climatological non-exceedence probability (Cp) for

    each scenario of M (the interval between reforecasts in days), and are shown for several

    forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 15: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CN-

    FTSC1. The results are plotted against climatological non-exceedence probability (Cp) for

    each scenario of M (the interval between reforecasts in days), and are shown for several

    forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 16: Selected verification metrics for the MEFP-GEFS precipitation forecasts at NE-

    HOPR1. The results are plotted against climatological non-exceedence probability (Cp) for

    each scenario of M (the interval between reforecasts in days), and are shown for several

    forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 17: Range (maximum-minimum) of selected verification metrics for the MEFP-GEFS

    precipitation forecasts. The results are plotted against climatological non-exceedence

    probability (Cp) across all scenarios of M (interval between reforecasts in days), and are

  • 5 of 120

    shown for several forecast lead times. The reference forecasts for the CRPSS and the

    BSS comprise the MEFP-CLIM forecasts.

    Figure 18: Range (maximum-minimum) of selected verification metrics for the MEFP-GEFS

    precipitation forecasts. The results are plotted against climatological non-exceedence

    probability (Cp) across all scenario of N (the number of years of calibration data), and are

    shown for several forecast lead times. The reference forecasts for the CRPSS and the

    BSS comprise the MEFP-CLIM forecasts.

    Figure 19: Box plots of forecast errors against observed precipitation amount for N={24 and 12}

    years of calibration data. The results are shown at a forecast lead time of 0-24 hours.

    Figure 20: Box plots of forecast errors against forecast precipitation amount (ensemble mean)

    for N={24 and 12} years of calibration data. The results are shown at a forecast lead time

    of 0-24 hours.

    Figure 21: Box plots of forecast errors against observed precipitation amount for calibration

    scenarios of M={1 and 5} days between reforecasts. The results are shown at a forecast

    lead time of 0-24 hours.

    Figure 22: Box plots of forecast errors against forecast precipitation amount (ensemble mean)

    for calibration scenarios of M={1 and 5} days between reforecasts. The results are shown

    at a forecast lead time of 0-24 hours.

    Figure 23: Selected verification metrics for the MEFP-GEFS temperature forecasts. The results

    are shown for the dependent (solid) and independent (dashed) validation scenarios of N

    (the number of years of calibration data), and include several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 24: Selected verification metrics for the MEFP-GEFS temperature forecasts. The results

    are plotted against the interval between reforecasts (M days) used to calibrate the MEFP,

    and are shown for several non-exceedence climatological probabilities (Cp). The reference

    forecasts for the CRPSS and the BSS comprise the MEFP-CLIM forecasts.

    Figure 25: Selected verification metrics for the MEFP-GEFS temperature forecasts at AB-

    CBNK1. The results are plotted against climatological non-exceedence probability (Cp) for

    each scenario of N (the number of years of calibration data), and are shown for several

    forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 26: Selected verification metrics for the MEFP-GEFS temperature forecasts at CB-

    DRRC2. The results are plotted against climatological non-exceedence probability (Cp) for

    each scenario of N (the number of years of calibration data), and are shown for several

    forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 27: Selected verification metrics for the MEFP-GEFS temperature forecasts at CN-

    DOSC1. The results are plotted against climatological non-exceedence probability (Cp) for

  • 6 of 120

    each scenario of N (the number of years of calibration data), and are shown for several

    forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 28: Selected verification metrics for the MEFP-GEFS temperature forecasts at NE-

    HOPR1. The results are plotted against climatological non-exceedence probability (Cp) for

    each scenario of N (the number of years of calibration data), and are shown for several

    forecast lead times. The reference forecasts for the CRPSS and the BSS comprise the

    MEFP-CLIM forecasts.

    Figure 29: Selected verification metrics for the MEFP-GEFS streamflow forecasts. The results

    are shown for the dependent (solid) and independent (dashed) validation scenarios of N

    (the number of years of calibration data), and include several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 30: Selected verification metrics for the MEFP-GEFS streamflow forecasts. The results

    are plotted against the interval between reforecasts (M days) used to calibrate the MEFP,

    and are shown for several non-exceedence climatological probabilities (Cp). The reference

    forecasts for the CRPSS and the BSS comprise the MEFP-CLIM forecasts.

    Figure 31: Residuals of selected verification metrics for the MEFP-GEFS precipitation forecasts

    when calibrating the MEFP with an ensemble mean derived from C=11 members versus

    C=1 (F=11). The results are shown by forecast lead time for several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 32: Residuals of selected verification metrics for the MEFP-GEFS precipitation forecasts

    when calibrating the MEFP with an ensemble mean derived from C=11 members versus

    C=1 (F=11). The results are shown by climatological non-exceedence probability at

    selected forecast lead times. The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 33: Sensitivity of the MEFP-GEFS precipitation forecasts to the number of members (C)

    used to calibrate the MEFP. The results comprise an average over the middle portion of

    the forecast horizon (4-8 days) for selected climatological probabilities (Cp). The bold lines

    show the calibration scenarios with F=11 forecast members. The dashed line shows the

    (C=1, F=1) scenario.

    Figure 34: Selected verification metrics for the MEFP-GEFS precipitation forecasts at CN-

    DOSC1. The results are shown by forecast lead time for multiple calibration (C) and

    forecasting (F) scenarios and for several non-exceedence climatological probabilities (Cp).

    The reference forecasts for the CRPSS and the BSS comprise the MEFP-CLIM forecasts.

    Figure 35: Residuals of selected verification metrics for the MEFP-GEFS temperature forecasts

    when calibrating the MEFP with an ensemble mean derived from C=11 members versus

  • 7 of 120

    C=1 (F=11). The results are shown by forecast lead time for several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 36: Residuals of selected verification metrics for the HEFS streamflow forecasts when

    calibrating the MEFP with an ensemble mean derived from C=11 members versus C=1

    member (F=11). The results are shown by forecast lead time for several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 37: Residuals of selected verification metrics for the MEFP-GEFS precipitation forecasts

    when calibrating the MEFP with an ensemble mean derived from C=11 members versus

    C=5 (F=11). The results are shown by forecast lead time for several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 38: Residuals of selected verification metrics for the MEFP-GEFS temperature forecasts

    when calibrating the MEFP with an ensemble mean derived from C=11 members versus

    C=5 (F=11). The results are shown by forecast lead time for several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 39: Residuals of selected verification metrics for the HEFS streamflow forecasts when

    calibrating the MEFP with an ensemble mean derived from C=11 members versus C=5

    member (F=11). The results are shown by forecast lead time for several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 40: Cumulative rank histograms for the HEFS streamflow forecasts when calibrating the

    MEFP with an ensemble mean derived from C=11 members (solid) and C=5 members

    (dashed). The results are shown at a forecast lead time of 96-120 hours and for observed

    streamflow volumes that exceed several (non-exceedence) climatological probabilities.

    Figure 41: Selected verification scores for the MEFP-GEFS precipitation forecasts. The nominal

    scores are shown for each scenario of N (solid lines), together with the range of scores

    across the subcases of each scenario. The results include several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

    Figure 42: Selected verification scores for the MEFP-GEFS precipitation forecasts. The nominal

    scores are shown for each scenario of M (solid lines), together with the range of scores

    across the subcases of each scenario. The results include several non-exceedence

    climatological probabilities (Cp). The reference forecasts for the CRPSS and the BSS

    comprise the MEFP-CLIM forecasts.

  • 8 of 120

    Figure 43: Reliability diagrams and corresponding sharpness plots (base 10 logarithm of the

    sample size, n) for the MEFP-GEFS precipitation forecasts at N=12. The results are shown

    for selected climatological non-exceedence probabilities (Cp), including the Probability of

    Precipitation (PoP; Cp=0.0), and comprise a daily aggregation between 0-24 hours.

    Alongside the nominal values (bold lines), the range of scores is shown for the two sub-

    periods of N=12.

    Figure 44: Reliability diagrams and corresponding sharpness plots (base 10 logarithm of the

    sample size, n) for the MEFP-GEFS precipitation forecasts at M=5. The results are shown

    for selected climatological non-exceedence probabilities (Cp), including the Probability of

    Precipitation (PoP; Cp=0.0), and comprise a daily aggregation between 0-24 hours.

    Alongside the nominal values (bold lines), the range of scores is shown for the five sub-

    periods of M=5.

    Figure 45: Probability of Detection (PoD) and Probability of False Detection (PoFD) for flooding

    at NE-HOPR1. The results are shown for each ensemble member (48 in total) and for

    three validation scenarios at a reforecast interval of M=3, namely the full period of record

    (daily reforecasts) and the three sub-periods (reforecasts every 3 days, offset by 1 day).

    The PoD is highlighted at PoFD0.015.

  • 9 of 120

    ii. List of tables

    Table 1: Characteristics of the study basins

    Table 2: Reforecast configuration parameters

    Table 3: Calibration and validation scenarios for N

    Table 4: Calibration and validation scenarios for M

    Table 5: Average sample sizes by climatological probability (Cp) and reforecast scenario

    Table 6: Calibration and validation scenarios for the number of ensemble members

  • 10 of 120

    1. Executive summary

    Motivation

    The Hydrologic Ensemble Forecast Service (HEFS) quantifies the total uncertainty

    in future streamflow as a combination of the meteorological forcing uncertainties

    and the hydrologic modeling uncertainties. Reliable and skillful weather and

    climate forecasting is central to reliable and skillful streamflow forecasting. The

    HEFS Meteorological Ensemble Forecast Processor (MEFP) quantifies the

    meteorological uncertainties and corrects for biases in the forcing inputs to the

    HEFS. For the medium-range (1-15 days), the MEFP uses precipitation and

    temperature forecasts from the Global Ensemble Forecast System (GEFS) of the

    National Centers for Environmental Prediction (NCEP).

    The ability of the HEFS to provide useful information for decision making depends

    upon the accuracy of the forecast probabilities. Crucially, there is a need to

    demonstrate this accuracy through hindcasting and validation. Hindcasting is

    necessary to benchmark and improve the HEFS, optimize decision support

    systems that rely upon the HEFS, and to build confidence among decision makers

    that the forecasts are accurate, useful, and can lead to better decisions. For

    example, the New York City Department of Environmental Protection (NYCDEP)

    is using the HEFS to improve the management of risks to water supply objectives

    in the NYC area. The NYCDEP has developed an Operational Support Tool (OST),

    which optimizes the quantity and quality of water stored in the NYC reservoirs and

    helps to avoid unnecessary, multi-billion dollar, infrastructure costs. The NYCDEP

    relies on streamflow hindcasts from the HEFS, supported by meteorological

    reforecasts from the GEFS, in order to optimize and validate the OST.

    Large and extreme hydrologic events are critically important to users of the HEFS,

    as they pose a significant threat to life and property. Given the manifest

    uncertainties in forecasting hydrologic extremes, the ability of the HEFS to quantify

    these uncertainties (and correct for systematic biases) is an important advantage

    over deterministic forecasting systems. However, validating the HEFS for large and

    extreme events relies upon an adequate archive of meteorological reforecasts.

    In order to determine the minimum requirement of the HEFS for meteorological

    reforecasts from the GEFS, this report considers the sensitivity of the HEFS to a

    limited number of reforecast configuration options. Understanding the minimum

    requirements for calibrating and validating the HEFS is a necessary but not a

    sufficient condition for understanding the minimum requirements of end users for

    meteorological and hydrologic reforecasts. The requirements of end users, such

    as the NYCDEP, will be gathered and presented separately.

  • 11 of 120

    Approach

    In order to determine the minimum requirements for meteorological reforecasting

    in support of calibrating and validating the HEFS, a 26-year reforecast dataset was

    obtained for the current GEFS. Among other factors, the costs associated with

    meteorological reforecasting depend on the historical period considered (N years),

    the interval between reforecasts (M days), and the number of ensemble members

    in each forecast (C). By sub-sampling the GEFS reforecasts, the MEFP was

    calibrated for different combinations of N, M and C. The sensitivities of the

    temperature and precipitation forecasts from the MEFP and the streamflow

    forecasts from the HEFS were then explored through hindcasting and validation.

    Forcing and streamflow hindcasts were produced and validated at four headwater

    basins: the Chikaskia River at Corbin, Kansas (AB-CBNK1); the Dolores River at

    Rico in Colorado (CB-DRRC2); the Middle Fork of the Eel River at Dos Rios in

    California (CN-DOSC1); and the Wood River at Hope Valley, Rhode Island (NE-

    HOPR1). The hindcasts were generated at 12Z for each day in the historical period

    of record. Within this fixed period, the calibration of the MEFP varied according to

    N, M and C. To ensure that the hindcasting was both practical and statistically

    reasonable, a combination of dependent and (limited) cross-validation was used.

    In exploring the sensitivities to N, a 24-year validation period was sub-divided into

    smaller calibration and forecasting periods, namely N={2x12, 3x8, 4x6, and 6x4}

    years. Dependent validation involved calibrating the MEFP and generating

    hindcasts for each sub-period and then pooling all of the sub-periods for validation.

    Independent validation involved borrowing the parameters from an adjacent sub-

    period. While dependent validation may be regarded as a best case scenario for

    the expected forecast quality, using parameters from adjacent sub-periods should

    be regarded as a worst case scenario; in practice, the MEFP would be recalibrated

    more frequently. In evaluating the sensitivities to M, the MEFP was calibrated for

    M={1, 3, 5, and 7} days and hindcasts produced daily for the fixed historical period.

    The sensitivities to C were examined by calibrating the MEFP with an ensemble

    mean derived from C={1, 5, and 11} of the ensemble members from the GEFS

    reforecasts. In practice, the GEFS reforecasts contain fewer ensemble members

    (F=11) than the operational forecasts (F=21). Since the operational HEFS

    forecasts use all available GEFS members (F=21), the HEFS reforecasts were

    also generated with all available GEFS members (F=11). However, in order to

    understand the impacts of this discrepancy, a baseline reforecast was generated

    with the control run only (F=1), using the corresponding MEFP calibration (C=1).

    The minimum requirements for validating the HEFS depend on N and M (not C)

    are were examined both theoretically and empirically. Theoretically, verification is

  • 12 of 120

    concerned with the estimation of statistical measures. The quality of these

    estimates will depend on the number of samples available and their unique

    information content. Empirically, the effects of reducing the number of reforecasts

    available is to increase the sampling uncertainty of the verification results and to

    render some (typically large) events unverifiable, depending on the choice of

    measure. In order to illustrate the effects of N and M on the uncertainties

    associated with validating the HEFS, each sub-sample of N and M was verified

    separately and the results compared to the nominal scores for N=24 and M=1.

    Results

    In terms of the quality of the MEFP forcing and HEFS streamflow forecasts, there

    is no systematic decline in forecast quality as the interval between reforecasts (M)

    increases from 1 to 7 days or the historical period (N) decreases from 24 to 4 years.

    However, when considering the sensitivity of the verification scores to N and M,

    measured by the range of scores across these scenarios, there are meaningful

    differences, particularly at higher thresholds of precipitation and streamflow. In this

    context, sensitivity is a necessary but not a sufficient condition for a decline in

    forecast quality. These results imply some sensitivity to N and M, but they do not

    suggest a consistent decline in forecast quality with increasing M or decreasing N.

    In practice, a reforecast archive of N=12 years and M=1 day (among other

    combinations) should be adequate to calibrate the MEFP, but it would not be

    adequate to validate the HEFS for large events, as described below.

    Against the best available calibration (C) and forecasting (F) scenario (C=11,

    F=11), there is a material decline in the quality of the HEFS forcing and streamflow

    forecasts when using the control member only (C=1, F=1). For some basins,

    metrics and thresholds, this is minimized by using all ensemble members to

    generate the HEFS forecasts (C=1, F=11). For precipitation and streamflow, the

    greatest differences occur at CN-DOSC1, particularly in the middle and latter

    portion of the forecast horizon, where the forecast lead time is increased by 1+

    days when using C=11 (F=11) members versus C=1 (F=11). The improvements in

    temperature are greatest at AB-CBNK1 and NE-HOPR1, particularly at the hottest

    observed temperatures and during the middle portion of the forecast horizon,

    where the CRPSS is increased by ~10% in real terms (~30% relative to the

    baseline CRPSS). In contrast, when calibrating the MEFP with C=5 ensemble

    members (F=11), the forcing and streamflow forecasts are no more reliable or

    skillful than those calibrated with C=11 members (F=11). Thus, for the locations,

    thresholds, and verification metrics considered, 5 ensemble members should be

    adequate to calibrate the MEFP, while the operational forecasts would benefit from

    using all available ensemble members.

  • 13 of 120

    The minimum requirements for validating the HEFS are examined both

    theoretically and empirically. At a daily aggregation, the average number of

    verification pairs for which the observed value exceeds a climatological probability,

    Cp, is ~365N(1-Cp). In order to estimate a lumped verification score with reasonably

    small sampling uncertainty, 30 or more independent samples may be required.

    Thus, if all of the large (e.g. >Cp=0.995) events in a verification sample are

    statistically independent, and reforecasts are issued once per day, 16.5 years of

    reforecasts would be required, on average, to generate a verification sample with

    30 large events. Clearly, these requirements increase dramatically with

    increasing Cp; in general, the probability of flooding at a daily aggregation is less

    than 1-in-200 (Cp=0.995). They also increase when the individual samples are

    related to each other (e.g. one flood event that spans several days), as described

    below. More detailed metrics, such as the reliability diagram and Relative

    Operating Characteristic (ROC), require many more samples than a lumped

    verification score (perhaps 100-200 samples).

    For two time-series (forecasts and observations) that are both autocorrelated in

    time, the effective sample size for verification is smaller than the nominal sample

    size. As the correlations increase, the effective sample size declines. For example,

    the lag-1 autocorrelation of streamflow at AB-CBNK1 for a daily aggregation is

    0.542, while the lag-1 autocorrelation at NE-HOPR1 is 0.897. Based on sampling

    theory, the effective sample size for computing the cross-correlation between the

    observed and forecast time-series would be 55% of the nominal sample size at

    AB-CBNK1 and 11% at NE-HOPR1. In other words, for a given amount of

    confidence in the streamflow verification, roughly 9x more data would be required

    at NE-HOPR1 than implied by the nominal sample size and 2x more data at AB-

    CBNK1. While precipitation is generally autocorrelated over much shorter time-

    scales than streamflow, this also increases the probability that data thinning (e.g.

    from M=1 to M=3) would significantly reduce the number of extreme events in the

    sample. Thus, based on sampling theory alone, reducing the reforecast period and

    frequency would systematically reduce the precipitation and streamflow thresholds

    at which the HEFS, and associated decision support, could be validated and

    optimized, particularly for multi-day aggregations, such as reservoir inflows.

    The sensitivities of the validation results to different configurations of N and M are

    also explored empirically. Here, the range of verification scores between cases of

    N and M is much smaller than the range of scores within cases for different subsets

    of the validation data. Thus, as anticipated from theory, the minimum requirements

    for validating the MEFP are much greater than the minimum requirements for

    calibrating the MEFP, even for relatively simple verification scores. Of the

    verification scores considered here, the cross-correlation is particularly variable

    across the subsets of N and M. For example, at AB-CBNK1, precipitation amounts

  • 14 of 120

    that exceed Cp=0.995 show correlations of between -0.1 and 0.6 in the three

    subsets of M=3. Thus, for a 1-in-200 day precipitation amount at AB-CBNK1,

    forecasts issued every three days over a 24-year period would be unverifiable. For

    more detailed verification metrics, such as the reliability diagram, the thresholds

    for which the HEFS remains verifiable are even smaller. Thus, even with daily

    reforecasts between 1985 and 2008, the sample sizes are too small to evaluate

    reliability diagrams for moderately large precipitation amounts (Cp=0.99), as these

    events are rarely forecast with high probability. However, this partly originates from

    a conditional bias in the precipitation forecasts at high observed thresholds, which

    would not be addressed by increasing the number of reforecasts alone.

    In summary, therefore, the minimum requirements for meteorological reforecasting

    in support of the HEFS are determined, primarily, by the need to validate the HEFS

    with reasonably small sampling uncertainty, including for large events. In general,

    simple, unconditional, verification measures cannot guide operational practice,

    because they are not application-specific. For example, a flood warning may be

    triggered when the forecast probability of flooding exceeds some threshold. In this

    context, there is trade-off between issuing warnings too regularly (low probability

    threshold) and failing to warn when floods actually occur (high probability

    threshold). Given an adequate sample of historical flood occurrences, this trade-

    off, and hence the triggering threshold, can be defined, objectively, through

    hindcasting and validation. By way of illustration, the use of a degraded reforecast

    of M=3 at NE-HOPR1 could lead to flood warnings that are correct on only 40% of

    occasions, when they could be correct on 58% of occasions for a warning threshold

    optimized to daily reforecasts (i.e. M=1). For users of the HEFS, such as the

    NYCDEP, a long and consistent record of historical forecasts is, therefore,

    essential; it is necessary to optimize and improve decision support systems and to

    benchmark these systems against historical analogs for future extremes.

    Recommendations

    Reforecasting requires both significant human and computational resources.

    However, unsophisticated approaches to data thinning, such as reducing the

    number of historical years (N) or increasing the interval between reforecasts (M),

    will also reduce the value of these reforecasts for hydrologic applications. In terms

    of the HEFS, the greatest impacts of reducing the sample of historical reforecasts

    would be to prevent the validation of large events with the necessary statistical

    confidence. These events are critically important to users of the HEFS, such as the

    NYCDEP. Thus, any approach to data thinning must accommodate a reasonable

    sample of large and extreme events. The frequency of reforecasts (M) should also

  • 15 of 120

    accommodate rapidly evolving hydrometeorological conditions, for which M>1 day

    would not be appropriate for the short-to-medium range.

    The impacts of reducing the number of ensemble members in the GEFS

    reforecasts (C) will be to reduce their value for statistical post-processing and other

    applications. Nevertheless, as C increases, there are diminishing gains for the

    reliability and skill of the MEFP outputs. This study indicates that C=5 ensemble

    members should be adequate to calibrate the MEFP with the GEFSv10. However,

    this cannot be generalized to other techniques or to future implementations of the

    MEFP. Indeed, the benefits of reforecasting with additional members may vary with

    location and forecast conditions, and they may be greater for extreme events (for

    which sample measures of forecast quality are inherently limited). Thus, any

    compromise should be reviewed as models and applications evolve and diagnostic

    techniques become more sophisticated.

    While the costs associated with meteorological reforecasting are substantial, the

    benefits are even more substantial. Thus, a concerted effort should be made to

    produce reforecasts every day over the maximum historical period for which there

    is adequate data to initialize the GEFS, rather than compromising on N or M. In

    the absence of a complete reforecast, more sophisticated approaches to data

    thinning will be required. Here, emphasis on early forecast lead times and extreme

    events will increase the utility of a limited reforecast for hydrologic applications.

    Spatial pooling or regionalization may improve the sample sizes for calibration and

    validation of the MEFP. Studies are underway to establish whether reforecasts

    from hydrometeorologically similar basins can be used to augment the calibration

    and validation of the HEFS. However, spatial pooling cannot satisfy user

    requirements for long historical records at critical forecast locations. Also, in

    validating the streamflow forecasts, spatial pooling is inherently difficult, as

    hydrologic state variables, unlike atmospheric state variables, often vary abruptly

    (over short distances), and with myriad basin characteristics.

    An adequate sample of historical events, including large and extreme events, is

    only one of several minimum requirements for users of weather and climate

    reforecasts. Other requirements include the timely communication of development

    plans, use of transitional arrangements for legacy models (e.g. temporary freezing

    of models), software version control, coordination of model updates with users,

    timely access to reforecasts, and consistency of the reforecasts and operational

    forecasts (including initializations), among others. Collectively, these requirements

    should contribute to a renewed effort by the NWS and other operational forecasting

    agencies to deliver weather, climate, and water (re)forecasts for improved decision

    support. This broader set of requirements must be addressed separately,

    alongside the minimum requirements of end users, such as the NYCDEP.

  • 16 of 120

    2. Introduction

    The Hydrologic Ensemble Forecast Service (HEFS) is an operational hydrologic

    forecasting system that is being implemented by the thirteen River Forecast Centers

    (RFCs) of the U.S. National Weather Service (NWS). The HEFS quantifies the total

    uncertainty in future streamflow as a combination of the meteorological forcing uncertainty

    and the hydrologic modeling uncertainty, while correcting for biases in the forecast

    probabilities (Seo et al., 2010; Demargne et al., 2010, 2014; Brown et al., 2014a/b). The

    HEFS ingests weather and climate forecasts from, among other sources, the Global

    Ensemble Forecast System (GEFS) of the National Centers for Environmental Prediction

    (NCEP), as well as NCEPs Climate Forecast System Version 2 (CFSv2), and produces

    ensemble streamflow forecasts for the short- to long- range. The HEFS aims to: 1) span

    lead times from one hour to one year or more with a seamless transitions between

    forecast time horizons; 2) issue forecast probabilities that are unbiased for different

    aggregation periods; 3) be spatially and temporally consistent across RFC domains; 4)

    capture information from current operational weather and climate forecasting systems,

    while correcting for biases; 5) be consistent with retrospective forecasts or hindcasts

    that are used for verification and decision support; and 6) be properly validated, in order

    identify the strengths and weaknesses of the forecasts and to guide forecasting

    operations and decision support.

    By explicitly accounting for the uncertainties inherent in meteorological and

    hydrologic forecasting, while correcting for biases in the forecast probabilities, the HEFS

    aims to support improved, risk-based, decision making for a variety of water resources

    applications, including reservoir operation, flood forecasting, river navigation, and water

    supply. For example, the New York City Department of Environmental Protection

    (NYCDEP) is using the HEFS to improve the management of risks to water quantity and

    quality objectives in the NYC area. In this context, the NYCDEP has developed an

    Operational Support Tool (OST), which ingests streamflow forecasts from the HEFS that

    are produced operationally by the Middle-Atlantic RFC and the Northeast RFC. The OST

    optimizes the quantity and quality of water stored in the NYC reservoirs, while avoiding

    unnecessary, multi-billion dollar, infrastructure costs, such as water filtration. Elsewhere,

  • 17 of 120

    the U.S. Army Corps of Engineers (USACE) are redeveloping their water control manual

    for the Folsom Reservoir and the American River. In this context, the California-Nevada

    RFC (CNRFC) are evaluating the use of streamflow hindcasts from the HEFS, in order to

    establish the benefits and risks of using inflow forecasts to manage the flood control space

    in the Folsom Reservoir. Elsewhere in California, the Yuba County Water Agency

    (YCWA), together with CNRFC and partners, and exploring the use of probabilistic inflow

    forecasts to better manage the flood control spaces in Lake Oroville, the Englebright

    Reservoir and the New Bullards Bar Reservoir.

    The ability of the HEFS to provide useful information for decision making depends,

    crucially, upon the accuracy (unbiasedness and skillfulness) of the forecast probabilities.

    There is a need to demonstrate this accuracy through retrospective forecasting and

    verification. Retrospective studies are necessary to guide the development of the HEFS,

    as well as decision support systems that rely upon the HEFS, and to build confidence

    among decision makers that the forecasts are accurate, useful, and can lead to better

    decisions. In order to provide meteorological and streamflow forecasts that are

    demonstrably accurate, the HEFS must be calibrated and validated with historical data.

    While recent studies have documented the quality of the precipitation, temperature and

    streamflow forecasts from the HEFS, both for the short-to-medium range (Brown et al.,

    2014a/b) and for the long-range (Brown, 2013), the minimum requirements for

    reforecasting have not been evaluated. These requirements are largely driven by the raw

    meteorological reforecasts used as input to the HEFS and, specifically, by the HEFS

    Meteorological Ensemble Forecast Processor (MEFP), which aims to correct for biases

    in the raw forecasts of precipitation and temperature (Schaake et al., 2007; Wu et al.,

    2011). Observations of precipitation, temperature and streamflow are also required to

    initialize the HEFS, calibrate the hydrologic models, and to validate the forecasts. Gauge-

    based observations are typically available for many decades (often 50-100 years) at river

    forecast locations. However, atmospheric models rely on a best estimate (or a range of

    possibilities) of the multivariate, spatially distributed, state of the atmosphere-ocean

    system at the forecast issue time. In order to conduct reforecasting, these estimates must

    be produced retrospectively. In practice, reliable estimates of the atmosphere-ocean state

    variables require satellite observations, which are only available since the early 1980s.

  • 18 of 120

    Thus, meteorological reforecasting is inherently constrained to the recent past. Also,

    given the significant cost of conducting reforecasting, a trade-off emerges between

    expanding reforecasting and improving the underlying weather and climate models.

    However, for users of the HEFS, such as the NYCDEP and YCWA, hydrometeorological

    reforecasting is critically important. It is necessary to optimize and improve decision

    support systems, such as the OST, and to benchmark these systems against historical

    analogs for future extremes.

    In order to support NCEP in determining the requirement of the HEFS for

    meteorological reforecasting, this report considers the sensitivity of the HEFS to a limited

    number of reforecast configuration options. Clearly, reforecast configuration is only one

    of several requirements for users of weather and climate forecasts. Other requirements

    include the timely communication of development plans, use of transitional arrangements

    for legacy models, software version control, coordination of model updates with users,

    timely access to reforecasts, and consistency of the reforecasts and operational

    forecasts, among others. Collectively, these requirements should contribute to a new

    business model for NCEP and other operational forecasting agencies in delivering

    weather, climate, and water (re)forecasts for improved decision support. As indicated

    above, this report focuses on the minimum technical requirements of the HEFS for

    meteorological reforecasts. It does not consider the broader set of requirements for

    delivering an efficient and effective forecasting service, which must be addressed

    separately.

    In terms of the HEFS, the minimum requirements for historical data are driven by:

    1) the need for an adequate sample size to estimate the statistical parameters of the

    HEFS; 2) the need for an adequate sample size to validate the HEFS; and 3) the need

    for users of the HEFS to calibrate and validate their decision support systems. This report

    is concerned with the minimum requirements for (1) and (2) only. The requirements of

    end users, such as the NYCDEP and YCWA, will be gathered and presented separately.

    In this context, (1) and (2) define the minimum requirements for operating the HEFS, while

    (3) is necessary to ensure the outputs from the HEFS are useful for decision making. In

    other words, the minimum requirements associated with (1) and (2) should be regarded

  • 19 of 120

    as an incomplete baseline. In practice, the requirements of users for meteorological and

    hydrologic reforecasting may exceed those for calibration and validation of the HEFS,

    and they may evolve as services change and other users adopt the HEFS. Furthermore,

    this study is concerned with short-to-medium range forecasting only and, specifically, with

    the minimum requirements for historical data from the Global Ensemble Forecast System

    (GEFS).

    Raw forecasts of temperature and precipitation from the GEFS are used to produce

    bias-corrected forcing for input to the HEFS. These forecasts are used in water supply

    decision making for the short-to-medium range, including reservoir management, flood

    warning, river navigation and recreation. The GEFS uses Version 9.0.1 of the Global

    Forecast System (GFS), which comprises a horizontal resolution of T254 (~55km) for 1-

    8 days and T190 (~70km) for 9-16 days, and a vertical resolution of L42 or 42 levels (Wei

    et al. 2008; Hamill et al. 2011; Hamill et al. 2013). Reforecasts were produced with the

    GEFS for a ~26-year period between 1985 and 2010 (Hamill et al., 2013). Calibrating and

    validating the HEFS with a subset of the available reforecasts will identify the sensitivities

    of the HEFS to a degraded reforecast with the current GEFS only. Some applications of

    the HEFS may benefit from a configuration that improves upon the available reforecasts,

    but this cannot be established here. Rather, this study examines the ability to provide

    accurate forecasts with the HEFS using a degraded calibration sample and the ability to

    measure that accuracy with a reduced validation sample.

    The minimum requirements for calibrating the HEFS include an adequate historical

    period and frequency of reforecasts from which to estimate the statistical parameters of

    the HEFS, and sufficient ensemble members to capture the skill in the meteorological

    forecasts. Since the HEFS relies on statistical modeling, consistency of the reforecasts

    and operational forecasts is also important. The minimum requirements for validating the

    HEFS also include an adequate sample (historical period and frequency) of reforecasts

    under varying basin conditions, again without structural changes that would undermine

    their interpretation. In slow responding basins, the effective sample size is reduced by

    temporal autocorrelations in streamflow, implying a longer period of record for validation

    (and calibration of streamflow post-processors). In fast responding basins, conditions

  • 20 of 120

    evolve rapidly, implying a greater frequency of reforecasts to capture large and extreme

    events. Assuming the climatology is reasonably stationary, a 25-year reforecast should

    capture much of this variability. However, at a one-day aggregation, flooding may occur

    with a climatological frequency of 0.001 (1-in-1000 days) or less. Thus, on average, fewer

    than ten (0.001*25*365) flood events will occur within a 25-year period. Likewise, for long-

    range forecasting, where fixed aggregations are often required (e.g. April-July reservoir

    volumes), a 25-year reforecast will inevitably omit some important variability.

    In summary, the aims of this study are twofold, namely to determine the minimum

    requirements for reforecasting with the GEFS, in order to: 1) calibrate the HEFS

    adequately; that is without materially reducing the quality of the forecasts, including at

    high thresholds; and 2) validate the forcing and streamflow forecasts with reasonably

    small sampling uncertainty. The calibration of the HEFS depends on an adequate sample

    size, for which the period of record and interval between reforecasts are important. It also

    depends on the number of ensemble members in the GEFS and the consistency of the

    reforecasts and operational forecasts. Likewise, the validation of the HEFS depends on

    an adequate sample size, for which the period of record and interval between reforecasts

    are important, and a reasonably consistent and representative sample (accepting that

    these two things may not be aligned). Following a description of the study basins, datasets

    and approach, the verification results are presented separately for the minimum

    calibration and validation requirements.

    3. Approach

    3.1 Study basins

    Four headwater basins were considered in this study, namely: the Chikaskia River

    at Corbin, Kansas (AB-CBNK1); the Dolores River at Rico in Colorado (CB-DRRC2); the

    Middle Fork of the Eel River at Dos Rios in California (CN-DOSC1); and the Wood River

    at Hope Valley, Rhode Island (NE-HOPR1). Figure 1 and Table 1 show the location of

    each basin, its average elevation, area, and the location of the nearest grid node in the

    GEFS. Table 1 also shows the annual precipitation, the fraction of precipitation that

    generates runoff (the runoff coefficient), and the ratio of precipitation to potential

  • 21 of 120

    evaporation (a climate index). The drainage areas range from 188 square kilometers (NE-

    HOPR1) to 2,057 square kilometers (AB-CBNK1) and the runoff coefficients vary from

    0.12 (AB-CBNK1) to 0.55 (NE-HOPR1). The basins were chosen for a combination of

    practical and hydrological reasons. First, they all originate from RFCs for which the HEFS

    has been implemented and validated, namely AB-, CB-, CN-, and NE-RFCs, and for

    which the absolute quality of the forecasts has been documented (Brown, 2013, 2014;

    Brown et al., 2014a/b). Here, the focus is on the minimum requirements for calibrating

    and validating of the HEFS; that is, on the relative quality of the forecasts for different

    configurations of the GEFS; and not on the absolute quality of the forecasts. Second,

    headwater basins respond quickly to forcing information and, as the uncertainties and

    biases propagate from upstream to downstream locations, it is important, initially, to

    understand the quality of the HEFS in headwater basins. Third, headwater basins are

    important for operational forecasting of water quantity and quality, including flood warning

    and reservoir operations. Further downstream, the HEFS will be impacted by additional

    sources of bias and uncertainty, of which some are inherently difficult to quantify (e.g. the

    downstream effects of river regulations, simplified hydraulic routing and composite timing

    errors; see Raff et al., 2013). As part of the ongoing evaluation of the HEFS, more

    complex regimes, as well as additional sources of forcing, will be considered in future.

    Figure 2 shows the daily means of temperature, precipitation, and streamflow for

    each basin, where CN-DOSC1 and CB-DRRC2 both comprise an average over two sub-

    basins (see below). The averages are shown by calendar month and were derived from

    gauged temperature, precipitation, and streamflow over a 24-year period between 1985

    and 2008 (see Section 3.3). As indicated in Figure 2, there are marked differences in the

    seasonality and covariability of precipitation and runoff among these basins.

    The Chikaskia River (AB-CBNK1) experiences a warm and humid summer climate.

    During the late spring and early summer, cool air from Canada and the Rocky Mountains

    combines with moist air from the Gulf of Mexico and hot air from the Sonoran Desert,

    leading to intense thunderstorms and tornados in Kansas and Oklahoma. At AB-CBNK1,

    the relationship between precipitation and runoff is modulated by the shallow terrain and

  • 22 of 120

    dense vegetation cover, as well as increased evapotranspiration during the summer

    months.

    The Dolores River (CB-DRRC2) is a tributary of the Colorado River and occupies

    a narrow valley incised into the sandstone of the San Juan Mountains. Precipitation is

    reasonably constant throughout the year, but falls primarily as snow during the winter

    months. The snowpack melts in the late spring and early summer, which leads to a sharp

    increase in runoff between April and July (Figure 2). For the purposes of hydrologic

    modeling, CB-DRRC2 is separated into two sub-basins, in order to accommodate the

    varied elevations there. The lower sub-basin accounts for 67% of the total area of CB-

    DRRC2.

    The Eel River (CN-DOSC1) drains the windward slopes of the North Coast Ranges

    in Northern California (Figure 1). During the late summer and early autumn, the upper

    reaches of the Eel River experience little or no precipitation and streamflow. Low flows

    are accentuated by diversions to the Russian River for use in the Potter Valley Hydro-

    Electric Project. In late autumn, cooler temperatures are accompanied by rapidly

    increasing precipitation, to which the streamflows respond through November and

    continue increasing until January (Figure 2). During the winter months, the predictability

    of heavy precipitation is increased by the onshore movement of weather fronts from the

    Pacific coast and their orographic lifting in the North Coast Ranges. The coastal

    mountains of northern California and the Pacific Northwest are also susceptible to

    atmospheric rivers, which carry moisture in narrow bands from the tropical oceans to

    the mid-latitudes. Atmospheric rivers can lead to persistent, heavy, precipitation and

    extreme flooding in the North Coast Ranges and further inland (Smith et al., 2010). For

    the purposes of hydrologic modeling, CN-DOSC1 is separated into two sub-basins, and

    the lower sub-basin accounts for 77% of the total area of CN-DOSC1.

    The Wood River flows approximately 85km from its source in Sterling, Connecticut,

    through Hope Valley (NE-HOPR1) in the Arcadia Management Area to Alton, Rhode

    Island, where it converges with the Pawcatuck River. As indicated in Figure 2, the daily

    average precipitation at NE-HOPR1 is relatively constant throughout the year, but

  • 23 of 120

    includes significant snowfall during winter months (the average annual snowfall is

    866mm). During the early spring, rising temperatures lead to snowmelt and to a peak in

    streamflow around March or April, followed by lower flows during the summer months.

    3.2 Experimental design

    The HEFS quantifies the total uncertainty in future streamflow as a combination of

    the meteorological and hydrologic uncertainties, while correcting for biases in both the

    forcing and streamflow (Demargne et al., 2014). Further information about the HEFS

    methodology can be found in Appendix A. The meteorological uncertainties and biases

    are quantified with the Meteorological Ensemble Forecast Processor (MEFP). The MEFP

    produces ensemble forecasts of precipitation and temperature conditionally upon a raw,

    single-valued, forecast (Wu et al., 2011). For the short- to medium-range, the raw

    forecasts used by the MEFP include the ensemble mean of the GEFS. In removing the

    meteorological biases with the MEFP, the hydrologic uncertainties and biases can be

    modeled independently of the forcing uncertainties and biases (Seo et al., 2006;

    Demargne et al., 2014). The hydrologic uncertainties and biases are modeled in two

    stages. First, the meteorological forecasts from the MEFP are used to generate raw

    streamflow forecasts, which may contain hydrologic biases, but do not explicitly account

    for any hydrologic uncertainties. Secondly, the raw streamflow forecasts are post-

    processed with the Ensemble Postprocessor (EnsPost). The EnsPost models the

    hydrologic uncertainties and biases from the residuals between the observed and

    simulated streamflows (Seo et al., 2006); that is, streamflow predictions based on

    observed temperature and precipitation at the forecast issue time.

    The simulations and observations used to estimate the hydrologic uncertainties

    and biases are typically available for several decades at each RFC forecast location.

    Likewise, the precipitation and temperature observations used to generate the streamflow

    simulations and to quantify the forcing uncertainties and biases are typically available for

    several decades. In contrast, the meteorological reforecasts, which are used by the MEFP

    to estimate the forcing uncertainties and biases, require satellite observations and

    corresponding reanalysis of the ocean-atmosphere states, in order to initialize the

  • 24 of 120

    weather and climate models. These datasets are only available from the early 1980s

    onwards. Thus, as indicated above, the requirements of the HEFS for historical data are

    primarily constrained by the availability of (appropriate initialization for the) meteorological

    reforecasts.

    As indicated above, the total uncertainty in the streamflow forecasts originates from

    a combination of uncertainties in the meteorological forecasting and hydrologic modeling.

    Depending on basin characteristics and antecedent conditions, a large fraction of the total

    uncertainty can originate from the meteorological uncertainties (Kavetski et al., 2002;

    Pappenberger et al., 2005; Wu et al., 2011). Thus, the meteorological forecasts are a

    central component of the HEFS and other hydrologic ensemble prediction systems. When

    a meteorological model is updated, any changes in the statistical properties of the

    precipitation and temperature forecasts will, to some degree, impact the streamflow

    forecasts from the HEFS. For example, the MEFP may be impacted by changes in the

    spatial or temporal resolution of the model, including the position of grid cells in relation

    to hydrologic basins, the model physics in different layers, including at the land-surface

    and ocean boundaries, and the number of (or approach to generating) ensemble

    members. In terms of calibrating the MEFP, these properties are important insofar as they

    influence the statistical character of the precipitation and temperature forecasts, including

    any systematic biases, as well as the information content more generally (e.g. measured

    in terms of correlation). In general, therefore, the MEFP must be recalibrated when the

    GEFS is updated in any non-trivial way. Likewise, any non-trivial changes to the HEFS

    must be accompanied by new streamflow hindcasting and validation. In many cases, this

    requires further hindcasting and validation by users of the HEFS, such as the NYCDEP,

    who rely upon streamflow hindcasts to calibrate and validate their own forecasting and

    decision support systems. Following changes to the operational GEFS, the HEFS

    requires an adequate sample of meteorological reforecasts, in order to recalibrate the

    MEFP and to produce and validate new forcing and streamflow hindcasts. In this context,

    the minimum requirements for reforecasting include the number of historical years of data

    (N), the interval between reforecasts (M), and the number of ensemble members. These

    and other variables are summarized in Table 2.

  • 25 of 120

    In order to evaluate the effects of N and M on the quality of the precipitation and

    temperature forecasts from the MEFP, the raw GEFS reforecasts (Hamill et al., 2013)

    were systematically degraded from N=24 years (1985-2008) and M=1 day to

    combinations of smaller N and larger M. These thinned reforecasts were used to

    calibrate the MEFP and to generate forcing and streamflow hindcasts for a consistent

    validation period. As indicated above, some applications of the HEFS may benefit from a

    reforecast configuration that improves upon the available reforecasts, but this cannot be

    established here. In degrading the raw GEFS reforecasts, the hindcasting and validation

    period was fixed to 24 years (1985-2008), with a forecast issued at 12Z each day. The

    choice of validation period was motivated by: 1) the need to isolate the effects of N and

    M on the quality of the MEFP forecasts, independently of any background variability (i.e.

    from changes in the validation period); and 2) by the choice of experimental design for

    validation. In terms of the latter, independent validation is always preferred when

    evaluating statistical techniques, such as the MEFP. Unless the verifying observation is

    removed from the calibration sample, the statistical parameters will benefit, unfairly, from

    seeing the outcome in advance of predicting it. Depending upon the number of

    parameters to estimate and their sampling properties, among other factors, this

    advantage can be important. The results from dependent validation should, therefore, be

    regarded as a best case scenario of the actual forecast quality. In practice, however,

    the MEFP is relatively parsimonious (Wu et al., 2011). In other words, a single observation

    should not greatly influence the estimated parameters. Furthermore, independent

    validation poses significant practical challenges, as the HEFS is an operational

    forecasting system; it is not well-suited to automatic calibration, and hindcasting is

    extremely time-consuming.

    In evaluating the sensitivities to N, both dependent and (limited) cross-validation

    were employed. Specifically, the 24-year validation period was sub-divided into smaller

    calibration periods, N={2x12, 3x8, 4x6, and 6x4} years. Dependent validation involved

    estimating the parameters for each sub-period, issuing forecasts for that sub-period, and

    collating the forecasts from all sub-periods for validation (i.e. 24 years in total).

    Independent validation involved borrowing the parameters from an adjacent sub-period.

    In practice, this should be regarded as a worst case scenario for the expected forecast

  • 26 of 120

    quality, because independent forecasting is conducted for multiple years (i.e. 12 years,

    for N=12) without recalibrating the MEFP. Table 3 summarizes the dependent and

    independent calibration scenarios for N. In evaluating the sensitivities to M, the MEFP

    was calibrated for M={1, 3, 5, and 7} days and forecasts were issued at 12Z each day

    between 1985 and 2008. In this context, M=1 represents dependent validation, whereas

    M={3, 5 and 7} involves a mixture of dependent and independent validation. Specifically,

    for M=3, 5, and 7 days, 1/3rd, 1/5th and 1/7th of the validation sample appears in the

    calibration sample, respectively. The calibration scenarios for M are summarized in Table

    4. Alongside the precipitation and temperature forecasts from the MEFP, streamflow

    forecasts were produced at the outlet of each basin (see below).

    As a post-processing technique, the MEFP aims to improve skill by reducing bias

    in the raw GEFS forecasts, but does not introduce any new predictors. Thus, the quality

    of the MEFP outputs is sensitively dependent on the quality of the raw forcing inputs from

    the GEFS. The MEFP uses the ensemble mean from the GEFS to capture the information

    content in these (re)forecasts. In order to examine the sensitivity of the MEFP outputs to

    the number of ensemble members in the GEFS inputs, the GEFS reforecasts were

    systematically degraded by using only a subset of the ensemble members to derive the

    ensemble mean. These thinned reforecasts were used to calibrate the MEFP and to

    generate forcing and streamflow hindcasts for a consistent validation period (i.e. 26 years,

    from 1985-2010). In practice, the GEFS reforecasts contain fewer ensemble members

    (C) than the operational forecasts (F). Specifically, the GEFSv10 reforecasts comprise

    only 11 ensemble members (10 + control), while the operational forecasts comprise 21

    members (20 + control). Hindcasting and validation was conducted with the all available

    members (11). For example, when calibrating the MEFP with an ensemble mean derived

    from C=5 members, the hindcasts were generated with an ensemble mean derived from

    F=11 members. However, in order to better understand the impacts of this discrepancy,

    a baseline scenario was included. Here, the control run was used to both estimate the

    MEFP parameters (C=1) and to derive the forcing and streamflow hindcasts (F=1). The

    scenarios for C and F are summarized in Table 5.

  • 27 of 120

    3.3 Datasets

    For each scenario of N and M, hindcasts of mean areal temperature (MAT) and

    mean areal precipitation (MAP) were generated with the MEFP for a 24-year period

    between 1985 and 2008. For each combination of C and F, the hindcasts were generated

    for the full GEFS reforecast period (1985-2010); unlike N and M, the historical period was

    not integral to the validation design for C and F (see below). The hindcasts of MAP and

    MAT each comprise ~60 ensemble members (the precise number varying between

    basins, as described in Wu et al., 2011), with lead times varying from 6 to 360 hours in

    six-hourly increments. In order to evaluate the skill of the MEFP forecasts with GEFS

    inputs (MEFP-GEFS), precipitation and temperature forecasts were also generated with

    a conditional or resampled climatology (MEFP-CLIM). The latter involves resampling

    the historical observations of MAP and MAT in a moving window of, respectively, 61 days

    and 31 days around the forecast valid date.

    Raw streamflow hindcasts were generated with the Community Hydrologic

    Prediction System (CHPS) using the precipitation and temperature forecasts from the

    MEFP. The hindcasts were produced with the hydrologic models and parameter settings

    used operationally in each RFC. For all RFCs considered here, the Snow Accumulation

    and Ablation Model (SNOW-17; Anderson, 1973) is used together with the Sacramento

    Soil Moisture Accounting Model (SAC-SMA; Burnash, 1995). The models are executed

    at a six-hourly timestep, but interpolated to an hourly timestep at CB-DRRC2 and CN-

    DOSC1. Routing from the headwater to the downstream basins is conducted with Lag/K

    using constant or variable lag and attenuation. Historical simulations were generated with

    observed forcing for each basin and used to examine the sensitivities of the hydrologic

    predictions to the meteorological forcing (see below).

    Observations of precipitation and temperature were obtained from each RFC and

    comprise areal averages (MAP, MAT) of the gauged precipitation and temperature in

    each basin. The data comprise six-hourly observations, recorded in local time, and

    covering the period ~1948-2010. In order to pair the meteorological observations and

    forecasts, the observed values were chosen from the nearest available synoptic times in

  • 28 of 120

    {0Z, 6Z, 12Z, 18Z}. This introduced a timing error into the observations of +1 hours, 0

    hours, -1 hours and -2 hours for NE-HOPR1, AB-CBNK1, CB-DRRC2 and CN-DOSC1,

    respectively. As the forecasts were verified at an aggregated scale of one day or larger

    (see below), this timing error was deemed acceptable. The hydrologic forecasts and

    simulations were paired without any timing errors.

    3.4 Verification strategy

    Verification was conducted with the NWS Ensemble Verification Service (EVS;

    Brown et al., 2010). The temperature and precipitation forecasts were verified against

    observed temperature and precipitation, respectively. In order to establish the sensitivities

    of the hydrologic forecasts to different calibrations of the MEFP, the raw streamflow

    forecasts were verified against simulated streamflows. Differences between the

    hydrologic forecasts and simulations reflect the contribution of the MEFP-GEFS forcing

    to the quality of the streamflow forecasts, independently of any hydrologic errors and

    biases (which are ordinarily removed by the HEFS Ensemble Postprocessor, EnsPost).

    Aside from eliminating these hydrologic biases, simulated streamflows avoid the timing

    and other errors associated with pairing streamflow forecasts and observations. For

    example, the streamflow observations are only available as daily averages and in different

    time zones to the forecasts. No streamflow post-processing was conducted in this study,

    as the EnsPost uses hydrologic simulations and observations only and is, therefore,

    insensitive to the meteorological reforecasting. In this context, the aim is to establish the

    sensitivity of the HEFS forcing and streamflow forecasts to different calibrations of the

    MEFP, and not to examine the absolute quality of the forecasts, which is considered

    elsewhere (Brown, 2013, 2014; Brown et al., 2014a/b).

    Verification was conducted both unconditionally (i.e. for all data) and conditionally

    upon observed and forecast amount. Unconditional bias and skill are important, as the

    HEFS is an operational forecasting system for which many applications are anticipated.

    However, average conditions, particularly the ensemble mean, generally favor dryer

    weather and lower flows, as precipitation and streamflow are both skewed variables. In

    order to compare the verification results between basins, for different forecast lead times

  • 29 of 120

    and valid times, and for specific aggregation periods, common thresholds were identified

    for each basin. Specifically, for each aggregation period, a, and basin, b, a climatological

    distribution function, , , ( )n a bF x , was computed from the n values of the hydrometeorological

    variable, x, between 1985 and 2008. Real-valued thresholds were then determined for

    100k non-exceedence probabilities, pc ,

    1

    , , ( )n a b pF c , where 0,1pc and 1, , p k .

    These non-exceedence probabilities provide a consistent mapping between the likelihood

    of a particular hydrometeorological occurrence and its corresponding real value across

    different basins and aggregation periods.

    As indicated above, verification was performed for different magnitudes of the

    observed and forecast variables. When conditioning on observed amount, the quality of

    the forecasting system is evaluated for the full range of historical occurrences, including

    extreme events that were forecast inadequately (as small or moderate events). When

    conditioning on forecast amount, the verification results may discount important observed

    extremes. However, since the observed amount is unknown when a forecast is issued,

    conducting verification by forecast amount is useful for guiding operational forecasting

    and real-time decision making. While some verification metrics provide integral measures

    of error across multiple thresholds (e.g. the mean error), others are defined for discrete

    occurrences (e.g. the probability of detection). Integral measures, such as the mean error,

    were derived from the subsample in which the prescribed condition was met (e.g. the

    observation exceeded the threshold). Measures defined for discrete events were

    computed from the observed and forecast probabilities of exceeding the threshold.

    4. Results and analysis

    4.1 Minimum requirements for estimating the parameters of the MEFP

    4.1.1 Sensitivity to the historical period and interval between reforecasts

    The precipitation and temperature forecasts from the MEFP were verified against

    observed MAP and MAT, respectively. The results are shown for a daily aggregation, as

    this is a representative volume for short-to-medium range forecasting. The results are

  • 30 of 120

    presented by forecast lead time and magnitude of the forcing variable for each scenario

    of N (the number of years of reforecasts) and M (the interval between reforecasts). The

    analysis focuses on the sensitivity of the forecasts to N and M in terms of bias, skill, and

    other attributes of forecast quality, and not on the absolute quality of the forecasts. Figure

    3 provides selected verification scores (in the rows) at three climatological probabilities

    (in the columns), for the MEFP-GEFS precipitation forecasts. Here, Cp=0.0 denotes the

    Probability of Precipitation (PoP), while Cp=0.995 represents a daily precipitation amount

    that is exceeded, on average, once every 200 days. The scores were derived from the

    subsample of verification pairs in which the observed precipitation amount exceeded the

    threshold. Here, the verification statistics for the daily accumulations were averaged over

    the first three days of forecast lead time. The results are shown for each calibration

    scenario, N={24, 12, 8, 6, and 4 years}, and for the two validation scenarios, namely

    dependent validation (all scenarios of N) and cross-validation, i.e. N={12, 8, 6, 4} (see

    Table 3). The verification measures are summarized in Appendix B. The correlation

    coefficient measures the degree of association between the ensemble mean of the

    MEFP-GEFS precipitation forecasts and the observed precipitation amount. The relative

    mean error (RME) measures the fractional bias of the ensemble mean forecast, where a

    negative RME denotes an under-forecasting bias. The Continuous Ranked Probability

    Skill Score (CRPSS) measures the fractional improvement of the MEFP-GEFS

    precipitation forecasts when compared to the MEFP-CLIM forecasts, where 1.0 denotes

    a perfect score. The Brier Skill Score (BSS) also provides a lumped measure of skill

    relative to the MEFP-CLIM forecasts. However, unlike the CRPSS, the BSS measures

    the ability the forecasting system to predict the exceedence (or non-exceedence) of a

    discrete threshold.

    Figures 4-7 show selected verification scores by forecast lead time for the MEFP-

    GEFS precipitation forecasts at AB-CBNK1, CB-DRRC2, CN-DOSC1, and NE-HOPR1,

    respectively. Again, the results are shown for subsamples in which the observed

    precipitation amount exceeded Cp={0.0,0.9,0.995}. Unlike Figure 3, the results are shown

    separately for each one-day accumulation, and with a separate curve for each scenario

    of N. While Figures 4-7 shows the verification results for selected thresholds at all forecast

    lead times, Figures 8-11 shows the results for all thresholds at selected forecast lead

  • 31 of 120

    times. In Figures 8-11, the climatological probabilities are plotted on a non-linear scale,

    in order to emphasize the larger thresholds. The origin of each curve in Figures 8-11 is

    the climatological PoP, i.e. the zero-precipitation threshold. The BSS denotes the ability

    of the MEFP-GEFS forecasts to predict the exceedence of this threshold. The correlation,

    RME and CRPSS denote the quality of the MEFP-GEFS forecasts for wet conditions, i.e.

    for the subsample that exceeds the threshold, with the lowest threshold being zero.

    As indicated in Figure 3, the sensitivities of the MEFP-GEFS precipitation forecasts

    to the number of years of calibration data (N) are relatively small, both for the dependent

    and independent validation scenarios. In general, the forecast quality is slightly reduced

    under independent validation. However, as indicated above, independent forecasting for

    multiple years (up to 12 years) should be regarded as a worst case scenario for the

    expected forecast quality, as the MEFP should be recalibrated more frequently. The

    greatest differences between dependent and independent validation occur in CB-DRRC2,

    particularly for light and moderate precipitation amounts, where the forecast quality is

    generally lower. This is understandable because CB-DRRC2 lies in the San Juan

    Mountains of Colorado, where the steep terrain leads to reduced predictability and

    increased climatological variability on inter-annual timescales. While the MEFP assumes

    that the joint distribution of forecasts and observations is reasonably stationary, any

    climatological non-stationarities may introduce a trade-off between larger N (smaller

    sampling uncertainty) and smaller N (greater climatological specificity). As indicated in

    Figure 3, for most verification scores, locations and thresholds, there is no systematic

    increase in forecast quality with increasing N. Indeed, in some cases, the forecast quality

    increases slightly with decreasing N. Given the sampling uncertainties, this should not be

    overstated. However, it may originate from climatological variability over the validation

    period and thus a greater specificity of the estimated parameter at smaller N. As indicated

    in Figures 4-7, the sensitivities to N are relatively small at all forecast lead times, although

    some erratic behaviors are seen at N=4 in AB-CBNK1 and CB-DRRC2, where the

    absolute forecast quality is also lower. Similarly, Figures 8-11 suggest that the MEFP is

    relatively insensitive to N across a broad range of precipitation thresholds. However, at

    CB-DRRC2, there is a material decline in BSS for N=4, particularly for light and moderate

    precipitation amounts, while the CRPSS is higher (Figure 9). These differences originate

  • 32 of 120

    from the structure of the BSS and CRPSS. The CRPSS is sensitive to biases in the

    ensemble mean forecast, which are also smaller for N=4. The BSS is sensitive to these

    biases only insofar as they impact the forecast probability (of exceeding Cp), and not to

    their absolute magnitude.

    Figure 12 shows the quality of the MEFP-GEFS precipitation forecasts for different

    scenarios of M. The verification scores include the correlation coefficient and the RME,

    together with the BSS and CRPSS. They were computed at a daily accumulation for

    Cp={0.0, 0.99, 0.995}, and averaged over the first three days of forecast lead time. Figures

    13-16 show the verification results at all precipitation thresholds for AB-CBNK1, CB-

    DRRC2, CN-DOSC1 and NE-HOPR1, respectively. Here, the results comprise daily

    accumulations at forecast lead times of 1, 2, and 3 days. In terms of data thinning, the

    scenarios of N are broadly comparable to M, with M=7 comprising 1/7th of the original

    calibration sample, versus 1/6th for N=4. In principle, for atmospheric variables that are

    statistically dependent over multiple days, thinning by M should have a smaller impact

    than an equivalent N. In practice, however, except for large-scale systems, such as

    atmospheric rivers, precipitation varies over short periods and at small spatial scales, as

    evidenced by the majority of forecast skill occurring in the first 1-7 days (or less at AB-

    CBNK1, which is located in the Central Plains). Thus, depending on forecast lead time

    and location (among other factors), thinning by M may be more or less aggressive than

    an equivalent N.

    As indicated in Figure 12, when averaged across forecast lead times of 1-3 days,

    there is no systematic decrease in forecast quality with increasing M at any location or

    precipitation threshold considered. Similarly, when considering forecast lead times of 1,

    5, and 10 days separately (Figures 13-16), the quality of the MEFP-GEFS precipitation

    forecasts is relatively insensitive to M at most locations. However, at AB-CBNK1, where

    the forecast skill declines rapidly over the first week (Figure 4), there is a non-trivial

    sensitivity to M from 0-24 hours across a range of precipitation thresholds, particularly for

    the correlation coefficient, RME and CRPSS (Figure 13). This is evidenced by the range

    of verification scores for different scenarios of M. To further illustrate these sensitivities,

    Figure 17 shows the range of verification scores across all scenarios of M at selected

  • 33 of 120

    forecast lead times. Figure 18 shows the equivalent range of scores for N. Clearly, the

    range of scores is not indicative of a systematic dependence of forecast quality on N or

    M (see above). However, it is indicative of a sensitivity to the amount of calibration data

    available. In general, AB-CBNK1 shows the greatest sensitivities to M and N, while CB-

    DRRC2 is only sensitive to N (and specifically to N=4, as indicated above).

    In order to illustrate the effects of N and M on the largest observed and forecast

    precipitation amounts, box plots were computed from the MEFP-GEFS precipitation

    forecasts. Figure 19 shows box plots of the forecast errors for each basin (in the rows)

    and for two scenarios of N (in the columns), namely N={24, 12}. The results are plotted

    against observed precipitation amount and are shown at a forecast lead time of 0-24

    hours. Figure 20 shows the corresponding results against forecast precipitation amount,

    specifically the ensemble mean forecast. Selected quantiles of the forecasting errors are

    plotted together with the median error and range (extreme residuals) as whiskers. The

    verifying observation is denoted by the zero-error line. Verification pairs for which the

    observation falls outside the ensemble range are denoted as misses. However, each

    forecast comprises only a small number of ensemble members (60). Thus, some misses

    should be expected, even if the forecasts are conditionally unbiased. Figures 21 and 22

    show box plots of the errors in the MEFP-GEFS precipitation forecasts for two scenarios

    of M, namely M={1, 3}, again ordered by observed and forecast precipitation amount,

    respectively. Here, each box represents one ensemble forecast from the period 0-24

    hours. As indicated in Figures 19 and 21, there is no systematic decline in forecast quality

    at N=12 or M=3 for the most extreme observed precipitation amounts. Rather, any

    differences between scenarios are consistent with sampling uncertainty. While there are

    some differences in the largest precipitation forecasts (by ensemble mean) for N=12

    (Figure 20) and M=3 (Figure 22), these differences are again consistent with sampling

    uncertainty and do not translate into additional skill for N=24 or M=1 (e.g. Figures 8-11).

    Figure 23 shows the quality of the temperature forecasts at selected thresholds for

    each scenario of N, while Figure 24 shows the corresponding results for each scenario of

    M. The results for N include both validation scenarios, namely dependent validation, i.e.

    N={24, 12, 8, 6, 4}, and cross-validation, i.e. N={12, 8, 6, 4} (see Table 3). The verification

  • 34 of 120

    metrics include the mean error of the ensemble mean forecast (C), the correlation

    coefficient, BSS and CRPSS. The metrics were com