16
Diagnostic evaluation of multiple hypotheses of hydrological behaviour in a limits-of-acceptability framework for 24 UK catchments G. Coxon, 1 * J. Freer, 1 T. Wagener, 2 N. A. Odoni 1 and M. Clark 3 1 School of Geographical Sciences, University of Bristol, Bristol, UK 2 Department of Civil Engineering, University of Bristol, Bristol, UK 3 Research Applications Laboratory, National Center for Atmospheric Research (NCAR), Boulder, Colorado, USA Abstract: Testing competing conceptual model hypotheses in hydrology is complicated by uncertainties from a wide range of sources, which result in multiple simulations that explain catchment behaviour. In this study, the limits of acceptability uncertainty analysis approach used to discriminate between 78 competing hypotheses in the Framework for Understanding Structural Errors for 24 catchments in the UK. During model evaluation, we test the models ability to represent observed catchment dynamics and processes by dening key hydrologic signatures and time step-based metrics from the observed discharge time series. We explicitly account for uncertainty in the evaluation data by constructing uncertainty bounds from errors in the stage-discharge rating curve relationship. Our study revealed large differences in model performance both between catchments and depending on the type of diagnostic used to constrain the simulations. Model performance varied with catchment characteristics and was best in wet catchments with a simple rainfall-runoff relationship. The analysis showed that the value of different diagnostics in constraining catchment response and discriminating between competing conceptual hypotheses varies according to catchment characteristics. The information content held within water balance signatures was found to better capture catchment dynamics in chalk catchments, where catchment behaviour is predominantly controlled by seasonal and annual changes in rainfall, whereas the information content in the ow-duration curve and time-step performance metrics was able to better capture the dynamics of rainfall-driven catchments. We also investigate the effect of model structure on model performance and demonstrate its (in)signicance in reproducing catchment dynamics for different catchments. Copyright © 2013 John Wiley & Sons, Ltd. KEY WORDS model diagnostics; exible modelling frameworks; signatures; limits of acceptability; discharge uncertainty Received 2 July 2013; Accepted 24 October 2013 INTRODUCTION Conceptual hydrological models are important tools for understanding and predicting the hydrologic behaviour of catchments. The selection of an appropriate model structure is central to this task as our ability to accurately predict the hydrologic behaviour of catchments rest, in part, on the model structure reecting the dominant processes occurring in the catchment. There has been a shift in the hydrological community from selecting a single model structure for a particular application to using multi-model ensembles (Velázquez et al ., 2010; Gudmundsson et al., 2012), multiple model structures of increasing complexity (Farmer et al., 2003; Bai et al., 2009; Pushpalatha et al., 2011) and exible modelling frameworks (Clark et al ., 2008). Flexible model approaches that incorporate many different model structures or components (Clark et al., 2011a) have been used to explore model parameter, structural and data uncertainties (Wagener et al., 2001; Krueger et al., 2010; Smith and Marshall, 2010), assess the impact of model structure choice on the representation of different ow behaviour (Staudinger et al., 2011), identify controls on model structure according to catchment type (Lee et al., 2005) and provide guidance on using eld data to discriminate between competing model hypotheses (Clark et al., 2011b; McMillan et al., 2011). With multiple exible hypotheses more readily employed within hydrological research, the task of designing effective model identication and evaluation techniques has become increasingly important. This is particularly crucial as we aim to better understand the links between model structure and catchment type by *Correspondence to: G. Coxon, School of Geographical Sciences, University of Bristol, Bristol, UK E-mail: [email protected] HYDROLOGICAL PROCESSES Hydrol. Process. 28, 61356150 (2014) Published online 2 December 2013 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/hyp.10096 Copyright © 2013 John Wiley & Sons, Ltd.

Diagnostic evaluation of multiple hypotheses of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Diagnostic evaluation of multiple hypotheses of

HYDROLOGICAL PROCESSESHydrol. Process. 28, 6135–6150 (2014)Published online 2 December 2013 in Wiley Online Library(wileyonlinelibrary.com) DOI: 10.1002/hyp.10096

Diagnostic evaluation of multiple hypotheses of hydrologicalbehaviour in a limits-of-acceptability framework for

24 UK catchments

G. Coxon,1* J. Freer,1 T. Wagener,2 N. A. Odoni1 and M. Clark31 School of Geographical Sciences, University of Bristol, Bristol, UK2 Department of Civil Engineering, University of Bristol, Bristol, UK

3 Research Applications Laboratory, National Center for Atmospheric Research (NCAR), Boulder, Colorado, USA

*CUnE-m

Co

Abstract:

Testing competing conceptual model hypotheses in hydrology is complicated by uncertainties from a wide range of sources,which result in multiple simulations that explain catchment behaviour. In this study, the limits of acceptability uncertaintyanalysis approach used to discriminate between 78 competing hypotheses in the Framework for Understanding Structural Errorsfor 24 catchments in the UK. During model evaluation, we test the model’s ability to represent observed catchmentdynamics and processes by defining key hydrologic signatures and time step-based metrics from the observed dischargetime series. We explicitly account for uncertainty in the evaluation data by constructing uncertainty bounds from errors inthe stage-discharge rating curve relationship. Our study revealed large differences in model performance both betweencatchments and depending on the type of diagnostic used to constrain the simulations. Model performance varied withcatchment characteristics and was best in wet catchments with a simple rainfall-runoff relationship. The analysis showedthat the value of different diagnostics in constraining catchment response and discriminating between competing conceptualhypotheses varies according to catchment characteristics. The information content held within water balance signatures wasfound to better capture catchment dynamics in chalk catchments, where catchment behaviour is predominantly controlledby seasonal and annual changes in rainfall, whereas the information content in the flow-duration curve and time-stepperformance metrics was able to better capture the dynamics of rainfall-driven catchments. We also investigate the effect ofmodel structure on model performance and demonstrate its (in)significance in reproducing catchment dynamics for differentcatchments. Copyright © 2013 John Wiley & Sons, Ltd.

KEY WORDS model diagnostics; flexible modelling frameworks; signatures; limits of acceptability; discharge uncertainty

Received 2 July 2013; Accepted 24 October 2013

INTRODUCTION

Conceptual hydrological models are important tools forunderstanding and predicting the hydrologic behaviour ofcatchments. The selection of an appropriate modelstructure is central to this task as our ability to accuratelypredict the hydrologic behaviour of catchments rest, inpart, on the model structure reflecting the dominantprocesses occurring in the catchment. There has been ashift in the hydrological community from selecting asingle model structure for a particular application to usingmulti-model ensembles (Velázquez et al., 2010;Gudmundsson et al., 2012), multiple model structuresof increasing complexity (Farmer et al., 2003; Bai et al.,

orrespondence to: G. Coxon, School of Geographical Sciences,iversity of Bristol, Bristol, UKail: [email protected]

pyright © 2013 John Wiley & Sons, Ltd.

2009; Pushpalatha et al., 2011) and flexible modellingframeworks (Clark et al., 2008). Flexible modelapproaches that incorporate many different modelstructures or components (Clark et al., 2011a) have beenused to explore model parameter, structural and datauncertainties (Wagener et al., 2001; Krueger et al., 2010;Smith and Marshall, 2010), assess the impact of modelstructure choice on the representation of different flowbehaviour (Staudinger et al., 2011), identify controls onmodel structure according to catchment type (Lee et al.,2005) and provide guidance on using field data todiscriminate between competing model hypotheses (Clarket al., 2011b; McMillan et al., 2011).With multiple flexible hypotheses more readily

employed within hydrological research, the task ofdesigning effective model identification and evaluationtechniques has become increasingly important. This isparticularly crucial as we aim to better understand thelinks between model structure and catchment type by

Page 2: Diagnostic evaluation of multiple hypotheses of

6136 G. COXON ET AL.

screening multiple model structures over many catch-ments (Wagener and McIntyre, 2012). This is asubstantial challenge as model identification is temperedby the information, uncertainty and errors within theavailable data used to force and to evaluate hydrologicalmodels (McMillan et al., 2012a). In the pursuit of modelstructural adequacy (Gupta et al., 2012), we suggest it isimportant that our approaches to model identification arerobust, discriminatory and ensure that models are‘working for the right reasons’ (Kirchner, 2006).Moreover, it is essential that model evaluation techniquestake into account the quality and information content ofthe data so that we are able to discriminate betweenstructures but not beyond the uncertainties in the data.Within hydrological research, model diagnostics aremoving away from traditional aggregate statisticalmetrics, such as Nash–Sutcliffe and their associatedlimitations (Gupta et al., 2009), to the analysis of timestep-based performance measures and the exploration ofhydrologic signatures (Gupta et al., 2008; see forexample, Yadav et al., 2007; Yilmaz et al., 2008;Krueger et al, 2010). These, if properly formulated, canscrutinize different model conceptual choices and testhypotheses about catchment functioning (Wagener et al.,2007; Bai et al., 2009) whilst taking into accountobservational uncertainties (Blazkova and Beven, 2009;Liu et al., 2009). Uncertainty in the discharge data hasbeen shown to influence the calibration of hydrologicalmodels and their parameters (McMillan et al., 2010) andalso constrain model identification (Krueger et al., 2010;Renard et al., 2010). Moreover, periods of disinformationmay cause models to be incorrectly fitted to data or causetype II errors whereby a model that would be useful forprediction is rejected (Beven et al., 2011). Hence, wemust not only develop methods that are able to evaluatedifferent models and discriminate between them, but alsorecognize the limitations of our uncertain evaluation dataand develop techniques to reflect this (Juston et al., 2013).The generalized likelihood uncertainty estimation(GLUE) limits of acceptability approach provide aframework for discriminating between competing hy-potheses that accounts for the observational errors(Beven, 2006). Previous studies include Westerberget al. (2011b) who use uncertainties in the stage-dischargerelationship and evaluation points along the flow-durationcurve (FDC) to build their limits of acceptability forcalibrating hydrological models, Blazkova and Beven(2009) who use the limits of acceptability approach forflood frequency estimation, Krueger et al. (2010) whoevaluate model structure, parameter and data uncer-tainties in their limits of acceptability study andWinsemius et al. (2009) who incorporate both hardand soft information into their limits of acceptability.However, these studies focused on the calibration of

Copyright © 2013 John Wiley & Sons, Ltd.

hydrological models against uncertain data for a limitednumber of model structures and catchments and point tothe need to test these approaches for other catchments,models and types of signatures.In this paper, we use the limits of acceptability

uncertainty analysis approach to evaluate competinghypotheses of hydrological behaviour in multiple catch-ments for the first time. For the model evaluation, we testthe model’s ability to represent observed catchmentdynamics and processes by defining key hydrologicsignatures and time step-based metrics from the observeddischarge time series. We build on the previous studieswhich use the limits of acceptability approach byexplicitly accounting for observational error in theobserved discharge to set multiple model-independentbenchmarks. We apply this methodology to 24catchments in the UK and utilize the framework forunderstanding structural errors (FUSE) as the basis of themultiple hypotheses of catchment behaviour. Thefollowing key objectives are to (1) compare overallmodel performance for different catchments, (2) assessthe value of different diagnostics in constrainingcatchment response and discriminating between modelstructures and (3) give guidance about the suitability ofdifferent model structures for each catchment.

CATCHMENTS AND INPUT DATA

In total, 24 small catchments (100–300 km2) distributedacross England and Wales were used in this study. Thesecatchments were selected against two criteria. Firstly, thequality and coverage of the input and evaluation dataset wasa primary consideration. Daily data of precipitation,potential evapotranspiration (PET) and discharge for an11-year period from 01/01/1998–31/12/2008 were used forthis study. Each catchment needed to have a >99%complete record for each dataset to be selected. The year1998was used as awarm-up period for themodel; therefore,no model evaluation was quantified in this period. Adescription of how the data were compiled for the 24catchments is given in Table I. These catchments were alsochosen to represent a wide climatic and hydrologic diversityacross the study area, thus covering a range of catchmentcharacteristics including topography, geology andclimate (Table II). This allows us to explore therepresentation of diverse hydrological processes againstmultiple model hypotheses. The locations, as well asthe FDCs and relevant climatic and hydrologiccharacteristics, are presented in Figure 1 for the 24catchments. The catchments are colour coded by theirbaseflow index (BFI) in all plots, as this catchmentcharacteristic was found to be the most significant inseeing differences in model behaviour in comparison

Hydrol. Process. 28, 6135–6150 (2014)

Page 3: Diagnostic evaluation of multiple hypotheses of

Table I. Data compilation for the input and evaluation data used in the study

Input dataSourceproduct

Dataprovider

Data resolution Timeperiod

availableData

compilationSpace Time

Precipitation CERF griddedobservationalprecipitationdata seta

Met office 1 km Daily Jan 1961 toDec 2008

Catchment a real precipitationand potential evapotranspirationwas determined by averaging thevalues of all the grid squares thatlied within the catchment boundariesPotential

evapotranspirationMORECSb Met office 40 km Daily Jan 1961 to

Dec 2008Discharge UK National

River FlowArchive

Centre forecology andhydrology

Catchmentoutlet

Daily Dependent oncatchment

Converted from m3/s tomm/day

a (Keller et al., 2006)b (Hough and Jones, 1999)

Table II. Study catchment characteristics

Description of catchment

Gauge Number River Station name Q/P P/PE BFIa Land use Bedrock

27088 Calder Mytholmroyd 0.51 2.47 0.35 Grassland Moderate permeability28031 Manifold Ilam 0.66 2.03 0.54 Moorland Moderate permeability28116 Maun Whitewater Bridge 0.25 1.18 0.75 Arable and woodland High permeability30003 Bain Fulsby Lock 0.31 1.12 0.59 Arable Mixed32006 Nene/Kislingbury Upton 0.26 1.18 0.6 Arable and grassland Low permeability33033 Hiz Arlesey 0.31 1.05 0.85 Arable High permeability36005 Brett Hadleigh 0.30 1.07 0.47 Arable High permeability37007 Wid Writtle 0.32 1.00 0.42 Arable and grassland Low permeability38003 Mimram Panshanger Park 0.20 1.09 0.93 Mixed High permeability39030 Gade Croxley Green 0.27 1.26 0.88 Mixed High permeability40012 Darent Hawley 0.18 1.29 0.73 Mixed High permeability43006 Nadder Wilton 0.47 1.56 0.82 Arable Mixed45012 Creedy Cowley 0.43 1.55 0.46 Arable and grassland Low permeability53002 Semington Brook Semington 0.47 1.28 0.58 Arable and grassland Mixed55013 Arrow Titley Mill 0.56 2.07 0.55 Grassland Low permeability55026 Wye Ddol Farm 0.77 3.55 0.37 Grassland Low permeability57007 Taff Fiddlers Elbow 0.64 3.27 0.46 Grassland Mixed57015 Taff Merthyr Tydfil 0.59 3.68 0.38 Grassland Low permeability60006 Gwili Glangwili 0.80 2.97 0.47 Grassland and woodland Low permeability67006 Alwen Druid 0.64 2.68 0.48 Grassland and woodland Low permeability69024 Croal Farnworth Weir 0.49 2.15 0.4 Grassland and urban Moderate permeability73005 Kent Sedgwick 0.81 3.56 0.42 Grassland Low permeability75004 Cocker Southwaite Bridge 0.74 3.96 0.44 Grassland Low permeability76015 Eamont Pooley Bridge 0.79 4.40 0.54 Grassland Low permeability

The dominant land use and bedrock is shown for each catchmentP, Precipitation; PE, potential evapotranspiration; Q, discharge; Q/P, runoff coefficient; BFI, baseflow indexa Source: (Marsh and Hannaford, 2008)

6137DIAGNOSTIC EVALUATION OF MULTIPLE HYPOTHESES OF HYDROLOGICAL BEHAVIOUR

with the other catchment characteristic we tested, thecatchment’s runoff coefficient. The BFI is a measure ofthe proportion of the river runoff that can be classifiedas baseflow and thus reflects the underlying geology. Itis calculated by dividing the total volume of baseflowby the total flow for a specified period (for moreinformation see Gustard et al., 1992).

Copyright © 2013 John Wiley & Sons, Ltd.

METHODS

Model hypotheses

This study uses the rainfall-runoff model buildingtoolkit FUSE as the basis of the competing hypothesesof hydrological behaviour (see Clark et al., 2008 formore details). Multiple flexible model structures are

Hydrol. Process. 28, 6135–6150 (2014)

Page 4: Diagnostic evaluation of multiple hypotheses of

Figure 1. An overview of the 24 study catchments. Catchments are ordered and colour coded according to their baseflow index and are referredthroughout the text by their gauge numbers (a) Map of England and Wales showing the location of the 24 catchments (b) Catchment ratios (c) Flow-duration curves d) Box plots show the overall spread of three catchment characteristics for the 1074 catchments available with the study catchments

plotted on top. P, precipitation; PE, potential evapotranspiration and Q, discharge

6138 G. COXON ET AL.

built from different combinations of the conceptualchoices of four existing hydrological models; PRMS(Leavesley et al., 1983), Topmodel (Beven and Kirkby,1979), Sacramento (Burnash, 1995) and ARNO/VIC(Zhao, 1984). All the models are lumped, conceptualmodels of similar complexity thus allowing a compar-ison of competing hypotheses of hydrologicalbehaviour. To ensure the study is both practical andcomprehensive in terms of model space coverage, welimited the number of model hypotheses to 78 modelstructures and investigate different combinations of fourconceptual choices; the upper soil layer, verticaldrainage, surface runoff and lower layer architecture(Figure 2). In all cases, no interflow was computed forthe models, evapotranspiration was computed by the‘sequential’ scheme (Clark et al., 2008) and a gammadistribution was used to route runoff. As with allmodelling toolboxes, we recognize that the toolboxdoes not provide a complete coverage of the entirerange of model conceptualizations that are possible (ina similar way, we are not trying to represent differentspatial complexity). However, we have decided tofocus on the uncertainty in different representations ofcatchment runoff generating processes rather thanrouting in order to answer the aims of this study. Westress that the parameterization of the routing functiondoes allow for sensitivities in timing and distribution tobe taken into account.

Copyright © 2013 John Wiley & Sons, Ltd.

For each of the 78 model structures, FUSE was runwithin a Monte-Carlo simulation framework using theSobol sequence whereby 10 000 parameter sets wererandomly sampled from a uniform prior distribution.Therefore, in total 780 000 combinations of modelstructure and parameter sets were generated per catch-ment. Given the wide range of catchment behaviours,sampling of the feasible parameter space for each modelstructure was ensured by using wide sampling ranges.This was determined primarily by evaluating modelparameter ranges in previous studies that have used FUSE(Clark et al., 2008; Clark et al., 2011b), and these rangeswere then checked to ensure that they were appropriatefor the UK catchments in this study.

Diagnostics for model evaluation and identification

Model diagnostics here are focused on understandingthe discrepancies between models to inform us oncatchment functioning, rather than a focus on modelimprovement (although model structural inadequacieswill be highlighted by model rejection). By performingrobust rejections from a range of model diagnostics, wecan not only learn about catchment functioning butalso about the information content of the diagnostics.Effective approaches to model identification andevaluation must focus on model diagnostics thatrequire the model to reproduce the hydrologic

Hydrol. Process. 28, 6135–6150 (2014)

Page 5: Diagnostic evaluation of multiple hypotheses of

Figure 2. Schematic diagram for the model structural choices available in the framework for understanding structural errors. Here parameters, w, fc andsat represent soil moisture at wilting point, field capacity and saturation, respectively

6139DIAGNOSTIC EVALUATION OF MULTIPLE HYPOTHESES OF HYDROLOGICAL BEHAVIOUR

behaviour of each of the catchments. In this study, weemploy a hierarchical system of hydrologic signaturesand time step-based metrics derived from the observeddischarge time series. These diagnostics vary fromdecadal to daily timescales and are increasinglydifficult for the model to represent (Figure 3).

Hydrologic signatures. Three signatures are utilized toevaluate how well the model can capture the predominantflow behaviour over decadal, annual and monthlytimescales. These hydrologic signatures build on signa-tures utilized in previous studies (Farmer et al., 2003; Baiet al., 2009) and assess the model’s ability to capture ahierarchy of catchment dynamics and behaviours. Themean flow over the 10-year period reflects the catchment’swater balance and tests the model’s ability to predict theaverage behaviour of the catchment (Figure 3a). Thesecond signature, inter-annual variability, tests the model’sability to reproduce flow response under different climaticconditions (Figure 3b). The range of wetness-index values(annual precipitation/annual PET) for the catchmentsvaries from 0.5 to 2.3 for the 10 years of input data. Thethird signature, intra-annual variability, provides insightsinto the catchment’s response to seasonal changes and themodel’s ability to represent flow behaviour within theyear (Figure 3c).The FDC describes the distribution of discharge values

and provides insights into the flow regime of thecatchment. For each simulation, the simulated dailyFDC was compared with the observed FDC at tenselected evaluation points (Figure 3d). The selection ofthese evaluation points is an important consideration;with sufficient coverage over both high and low flowsrequired to ensure that the model is tested against a rangeof flows. These points were selected based on equal

Copyright © 2013 John Wiley & Sons, Ltd.

volumes of water calculated as the area below the FDC,which has been seen to give the best overall results inprevious studies (Westerberg et al., 2011b). Volumeincrements of 10% were used to provide sufficientcoverage of the FDC, with the maximum and minimumdischarge values excluded. This resulted in a total of nineevaluation points from 10 to 90% volume. Because of thenature of the method utilized for selecting evaluationpoints, the shape of the FDC influences the spacing of theevaluation points and for some catchments this meant thatthe low flows were not considered. Hence, to ensure thatlow-flow performance was considered, an additional pointat 95% volume was also included. This approach meansthat the catchments were evaluated at different points alongthe FDC, ranging from <1–3% time of flow exceeded forthe highest flows through to 61–90% time of flow exceedfor the lowest flows across the catchments. Hence,catchment specific benchmarks were set for the modelsto pass dependent on the catchments’ flow regime.

Time step evaluation point analysis. Time step-basedmetrics are utilized here as a means of evaluating themodel’s ability to reproduce the timing and magnitude ofdaily runoff (Krueger et al., 2010). Hydrologic signaturesbased on individual time steps have often focussed onsplitting the hydrograph into ‘hydrologically meaningful’periods (Boyle et al., 2000; Freer et al., 2003) andevaluating how well the model performs during theseperiods. However, dividing the hydrograph in this manneris problematic when utilizing daily data, thus to cover arange of flow behaviours and to focus on certain aspectsof the catchment’s flow regime, the evaluation pointsfrom the FDC were used to define the time steps ofinterest. This approach, therefore, provides an interestingcomparison of how different behavioural model selection

Hydrol. Process. 28, 6135–6150 (2014)

Page 6: Diagnostic evaluation of multiple hypotheses of

Figure 3. Illustration of the hydrologic signatures, corresponding evaluation points and limits of acceptability used for the Maun at Whitewater Bridgecatchment (a) Mean flow, (b) Inter-annual variability, (c) Intra-annual variability, (d) Flow-duration curve and (e) Time step evaluation point metric

6140 G. COXON ET AL.

is if we evaluate the points on the flow duration curveagainst a more robust test of assessing the respective flowduration curve points in the time series. To define the timesteps of interest, the discharge values that were found tobe ±0.5% time of flow exceeded around each evaluationpoint on the FDC were then found in the observeddischarge time series. The models were then evaluatedagainst these specific time steps. This resulted in themodel being tested against a total of 366–485 time stepsfrom the observed discharge time series (Figure 3e). Here,this diagnostic is referred to as the time step EP metricthroughout the text to reflect the fact that we are notevaluating every time step as carried out in previousstudies (Liu et al., 2009; Krueger et al., 2010).

Rating curve uncertainty

The starting point for the limits of acceptabilityapproach is to assess the uncertainty in the observed datathat is being used to evaluate model performance. Thisstudy focuses on the errors in discharge data resultingfrom uncertainty in the rating curve. Although othersources of uncertainty could have been evaluated (forexample, uncertainty in the rainfall or PET data), effortshere are concentrated on the uncertainties we can bestidentify and characterize as part of the modelling processso we have confidence in our representation of theseuncertainties. Numerous methods have been used to

Copyright © 2013 John Wiley & Sons, Ltd.

estimate the uncertainty in the stage-discharge relation-ship including log-log linear regressions (Liu et al.,2009), envelope curves (Krueger et al., 2010), Bayesianstatistics (Juston et al., 2013) and fuzzy rating curveapproaches (Pappenberger et al., 2006). Here, spot flowgauging data were used in the fuzzy linear regressionapproach of Westerberg et al. (2011a) to estimate thedischarge uncertainty (Figure 4). We adopted a non-probabilistic framework as it was easily applied across thelarge number of catchments used in this study and due toa lack of knowledge of the errors affecting the stage-discharge relationship at each of these catchments. Foreach catchment, all available stage-discharge measure-ments were utilized and a visual inspection conductedwith any assumed erroneous data points that deviatedgreatly from the rating curve removed from the analysis.The flow gauging data were first transformed to obtain alinear relationship via a log transformation of the gauge-height data and a Box-Cox transform of the discharge.The fuzzy representation of the observed data was thenconstructed. The uncertainties in gauge height anddischarge data were assumed to be ±5% and ±10%,respectively; these were estimated from the typical rangesof stage measurement and discharge uncertainty from UKspecific ranges (Herschy, 1999) and also McMillan et al.(2012a). As a result of limited information at each site,conservative uncertainty estimates were chosen andcatchments were only utilized if the rating curve provided

Hydrol. Process. 28, 6135–6150 (2014)

Page 7: Diagnostic evaluation of multiple hypotheses of

Figure 4. Uncertain rating curve for the Whitewater Bridge gauging station in the Maun catchment. The inset figure shows the transformed data, with thefuzzy representation of the estimated uncertainty in the observed stage-discharge measurements. The main figure shows the rating curve with the stage-

discharge gauged points and resulting uncertainty bounds

6141DIAGNOSTIC EVALUATION OF MULTIPLE HYPOTHESES OF HYDROLOGICAL BEHAVIOUR

a good fit to the stage-discharge measurements. The fuzzyrating curve was checked to ensure that it agreed with theofficial rating curve and the resulting uncertainty boundswere checked to ensure that they encompassed theuncertainty in the ratings well for each catchment. Theofficial rating curve was then used for the best-estimate ofdischarge and the resultant estimated uncertainty boundsconsequently define the minimum and maximum dis-charge intervals for given stages.

Limits of acceptability

The true value of a hydrologic signature is not knownexactly; thus, for each hydrologic signature, the estimateddischarge uncertainty bounds are utilized as limits ofacceptability (Figure 5). Here, we do not estimateadditional commensurability error as a result of mappingthe sub-daily stage-discharge measurements to daily dataand instead include this within the uncertainties described in

S sigð Þ ¼ Qsim;sig � Qobs;sig

� �= Qob

�Qsim;sig � Qobs;sig

� �= Qma

�(

S tð Þ ¼ Qsim;t � Qobs;t

� �= Qo

�Qsim;t � Qobs;t

� �= Qm

�(

Copyright © 2013 John Wiley & Sons, Ltd.

the previous text. The upper and lower limits of the observeddischarge uncertainty interval are averaged over the periodof interest to give upper and lower limits of acceptability foreach hydrologic signature. Although we could deal with theerrors in discharge data in a more robust approach (see forexample McMillan et al., 2010), here we adopt aconservative ‘worst case’ approach to the error character-ization over longer periods due to a lack of informationabout the nature of the errors and their non-stationarity intime and conditions at each site. For example, the longerterm biases in the flow series due to un-observed subsurfacefluxes that we cannot adequately represent.In order to evaluate how well the model is performing,

a scaled score (S) is calculated to define the deviation ofthe simulated hydrograph from the observed hydrograph(Figure 5). The scaled score was calculated relative to theinterval width from the limits of acceptability for both thesignatures (1) and at each evaluated time step (2):

s;sig � Qmin;sig

�Qsim;sig < Qobs;sig

x;sig � Qobs;sig

�Qsim;sig≥Qobs;sig

(1)

bs;t � Qmin;t

�Qsim;t < Qobs;t

ax;t � Qobs;t

�Qsim;t≥Qobs;t

(2)

Hydrol. Process. 28, 6135–6150 (2014)

Page 8: Diagnostic evaluation of multiple hypotheses of

Figure 5. Illustration of the method used to construct the limits of acceptability for each time step: (1) find discharge on rating curve and obtain bounds(inset figure), (2) repeat for every time step and transfer discharge interval to observed time series and (3) calculate score relative to the interval width

from the limits of acceptability

6142 G. COXON ET AL.

where Qsim is the simulated discharge, Qobs is theobserved discharge and Qmin and Qmax are the lower andupper uncertainty bounds, respectively, for the signature(sig) or time step (t). For the water balance and FDCsignatures, the simulation needed to be within boundsfor all the limits of acceptability for the model to beaccepted as behavioural. For the time step EP metric, thefocus was on the consistency of the model to be able tofit the observations within bounds, hence, the meanabsolute score for all the time steps needed to be belowone for a model to be accepted as behavioural. Thesescaled scores form the basis of the triangular weightingfunction (Liu et al., 2009; Krueger et al., 2010). Allmodels within the limits of acceptability are thenassigned a likelihood weight (L), given the vector ofobservations (Qobs):

L Rj M θð Þð ÞjQobs

� � ¼ S�1meanL Rj M θð Þð Þ� �

∑Jj¼1S

�1meanL Rj M θð Þð Þ� � (3)

where Rj is an accepted model simulation, dependent ona particular model (M) and parameter set (θ). Theindependent hydrologic signatures are combined in theGLUE framework to give guidance about the suitabilityof the different model structures and parameter sets foreach catchment.To address the ability of the diagnostics to replicate the

observed discharge, the resulting 5th–95th simulationbounds from the likelihood-weighted distribution for eachtime step from the behavioural simulations and for eachdiagnostic were evaluated against two metrics. Firstly, theoverlap between the observed and simulated uncertaintybounds was calculated (Reliability) (Westerberg et al.,2011b). Secondly, in order to account for the fact thatreliability may be high simply because the simulation

Copyright © 2013 John Wiley & Sons, Ltd.

bounds are wide, the width of the overlapping rangebetween the observed and simulated uncertainty boundswas calculated as a percentage of the width of thesimulated bounds for each time step, and the average forthe whole time series was then used as a Precisionmeasure (Guerrero et al., 2013).

RESULTS

Performance of total model ensemble

Table III summarizes the percentage of behaviouralmodel simulations found for each catchment anddiagnostic. Clearly, there is a large difference in modelperformance between both the catchments and thedifferent diagnostics. As expected, the largest percentageof behavioural model simulations was found for the meanflow signature with the total number of behaviouralsimulations ranging from 4.8–96% of the 780 000simulations generated for each catchment. Generally, itwas easier for the models to replicate the inter-annualvariability rather than the intra-annual variability. It isalso interesting to note that no behavioural modelsimulations were found for three catchments for theinter-annual signature and one catchment for the intra-annual signature, whereas behavioural simulations werefound for all the catchments for the FDC signature. Thetime step evaluation point (EP) metric was found to bemuch harder to reproduce, with a significantly reducednumber of behavioural simulations for each of thecatchments and 12 out of the 24 with no behaviouralmodel simulations. It is interesting to note the inconsis-tency between assessing points on the FDC to assessingthese points in time; even if a simulation is consideredbehavioural according to EPs on the FDC, it does not

Hydrol. Process. 28, 6135–6150 (2014)

Page 9: Diagnostic evaluation of multiple hypotheses of

Table III. Overall model results

Water balance signaturesFlow durationcurve (%)

Time step EPmetric (%)

Gauge number Mean flow (%) Inter-annual (%) Intra-annual (%) All (%)

27088 53.355 3.582 3.068 0.748 0.529 <0.00128031 57.864 25.527 0.360 0.272 0.093 028116 16.381 0.120 0.505 0.067 0.216 0.00430003 21.614 0.759 0.001 0.001 0.339 032006 14.428 0 0.009 0 0.027 033033 22.303 0.055 0.066 0.007 0.292 0.0336005 33.594 4.897 0.148 0.121 0.095 037007 26.296 0.311 0.069 0.033 0.011 038003 10.503 0.002 0.312 0.001 0.175 0.13139030 13.902 0.009 0.389 0.008 0.272 0.22740012 4.839 0 0.004 0 0.114 043006 43.044 2.129 0.078 0.044 5.159 0.00845012 40.379 6.025 0.046 0.027 0.057 053002 42.878 11.121 0.017 0.011 0.002 055013 66.552 16.332 0.368 0.307 0.749 0.00155026 94.503 76.099 9.875 9.429 6.315 057007 84.477 21.742 8.243 5.057 0.317 057015 20.568 0.270 0.217 0.020 0.003 060006 87.043 36.535 11.398 10.663 0.468 0.00367006 89.715 50.633 7.075 5.993 12.163 0.11569024 45.493 0.328 1.105 0.053 0.084 073005 90.843 67.808 15.563 15.202 0.446 0.01875004 96.312 83.741 33.735 31.789 19.111 4.07376015 95.644 61.765 17.133 15.464 14.691 2.050

Total number of behavioural model simulations for each catchment and diagnostic metric given as a percentage of the total number of model simulationsran for each catchment (780 000 in total).EP, evaluation point

6143DIAGNOSTIC EVALUATION OF MULTIPLE HYPOTHESES OF HYDROLOGICAL BEHAVIOUR

mean it is good at these FDC levels when assessed intime. Regarding the model performance between catch-ments, it was found that the wet catchments with a lowBFI had the largest number of behavioural models acrossall diagnostics, whereas the chalk catchments with a lowrunoff coefficient had the lowest number of behaviouralmodel simulations.

Behavioural ensembles for different diagnostics

To gain an idea of how the behavioural ensemble foreach diagnostic is able to capture catchment dynamics,the reliability and precision of the 90% simulation limitsfor the different diagnostics are summarized in Table IV,and the resultant 5th and 95th simulation bounds forbehavioural models of each diagnostic are presented inFigure 6. The three signatures for water balance aregrouped together so that only behavioural models for allthree signatures are used. There are distinct differencesbetween the catchments and the value of the differentdiagnostics in identifying well-behaved models that haverealistic dynamics compared with the observed catchmentbehaviour. For catchment 39030, which is dominated by

Copyright © 2013 John Wiley & Sons, Ltd.

baseflow and has a low-flow variability (Figure 6e), thebehavioural simulations from the grouped water-balance signatures are seen to better constrain theobserved discharge and capture the overall dynamics ofthe catchment compared with the FDC and time stepEP metric. This is also shown in Table IV, wherebyalthough the reliability is slightly lower for the waterbalance signature (94%), the precision of the 90%simulation limits is high (70%) in comparison with theFDC signature (40%) or time step EP metric (60%).This is also observed in catchment 43006, with thegrouped water balance signatures able to betterreproduce catchment dynamics. In contrast, thebehavioural simulations for catchment 75004 from thegrouped water balance and FDC signatures envelopethe observed discharge well but the ensemble of modelsimulations for the time step EP metric better constrain theobserved discharge, particularly in the recession periods. Itcan be seen that for the groupedwater-balance signatures, thepeaks are consistently over-estimated and the uncertainty inthe recession periods are much wider. As expected, the timestep EPmetric is able to better capture the timing of the peaksin comparison with the other two signatures. For catchment

Hydrol. Process. 28, 6135–6150 (2014)

Page 10: Diagnostic evaluation of multiple hypotheses of

Table IV. Reliability and precision measures for the different metrics and all catchments

Water balance Flow duration curve Time step EP metric

Gauge Number Reliability Precision Reliability Precision Reliability Precision

27088 96 36 96 31 65 6228031 84 28 92 28 - -28116 97 35 96 49 99 6430003 65 39 98 22 — —32006 — — 95 23 — —33033 83 49 98 54 96 6136005 95 40 98 36 — —37007 74 20 93 23 — —38003 99 57 100 40 99 6139030 94 70 98 40 99 6240012 — — 84 21 — —43006 75 45 99 34 98 6145012 93 29 96 21 — —53002 82 25 91 38 — —55013 92 26 99 24 83 5255026 92 31 89 30 — —57007 92 28 96 33 — —57015 79 23 85 29 — —60006 97 32 93 34 87 6367006 97 37 98 33 96 5369024 87 25 89 26 — —73005 96 32 95 36 94 5875004 100 37 100 42 99 5876015 96 31 99 34 98 55

Figure 6. Observed discharge and uncertainty bounds for the behavioural simulations (5% and 95% percentiles of the likelihood-weighted simulateddischarge) for all hydrologic signatures. The plots show a 1-year period for five different catchments. Catchments are ordered from low baseflow index

(top plot) to high baseflow index (bottom plot)

6144 G. COXON ET AL.

57015, no behavioural simulations were found for the timestep EP metric and very few for the grouped water-balanceand FDC signatures (159 and 27 respectively). As found in

Copyright © 2013 John Wiley & Sons, Ltd.

catchment 75004, the ensemble of behavioural models fromthe FDC signature is better at reproducing both the peaks andthe recession periods.

Hydrol. Process. 28, 6135–6150 (2014)

Page 11: Diagnostic evaluation of multiple hypotheses of

6145DIAGNOSTIC EVALUATION OF MULTIPLE HYPOTHESES OF HYDROLOGICAL BEHAVIOUR

Model structure identifiability

This section considers how well the diagnostics areable to discriminate between models and linkagesbetween model performance, model structural choiceand catchment characteristics. Figure 7 shows therelationship between the mean number of behaviouralmodel simulations per model structure and the standarddeviation of GLUE weights for each catchment. TheGLUE weights were summed for all behaviouralparameter sets for each model structure and then thestandard deviation was found. A low standard deviationin the weights indicate an equal contribution of eachmodel structure to the behavioural model ensemble,whereas a high standard deviation in the weights indicatesthat there is an unequal contribution and one or twomodel structures are dominating the behavioural modelensemble. The overall trend displayed in Figure 7,whereby catchments that have a lot of behaviouralsimulations tend to have a large number of modelstructures contributing to the ensemble, and catchmentsthat have a smaller number of behavioural simulationstend to favour one or two model structures are consistentacross the three signatures. However, the ability of thediagnostics to discriminate between model structures ismarkedly different, highlighting the value of diagnosticsin model identification generally and for differentcatchments. Generally, it can be seen that the chalkcatchments (high BFI) have the lowest mean number ofbehavioural model simulations per model structure andthe catchments on impermeable soils have the highest,

Figure 7. Comparison of the average number of behavioural simulations acrouncertainty estimation weights between the model structures for each catchm

catchment’s baseflow index. The size of the circle is scaled according

Copyright © 2013 John Wiley & Sons, Ltd.

with the exception of some low BFI catchments for thetime step EP metric. The inverse is seen for the standarddeviation of GLUE weights. Hence, catchments with alow groundwater component are generally easier to modelwith many different model structures able to simulate theobserved dynamics. In contrast, chalk catchments aremore challenging to model and it appears that only certainmodel structures work well in these catchments, althoughthis result does vary by diagnostic. These differences arelikely a result of the simple rainfall-runoff relationshippresent in the catchments that have a low groundwatercomponent, and hence, the model parameters are able toaccount for differences in model structure. In contrast,chalk catchments have much more complex rainfall-runoff relationship and so the choice of model structurebecomes more important. A similar relationship is seenfor the behavioural models from the FDC signature,although there is more spread and overall the signature isless discriminative than the grouped water-balancesignatures. The same pattern is not observed for the timestep EP metric, whereby it is most discriminative forsome of the low BFI catchments.Figure 8 focuses on the links between model

conceptual choices and model performance in theindividual catchments. Results demonstrate the (in)significance of model structure choice for different typesof catchments. The choice of lower layer for the model isseen to be significant for some of the catchments and thisvaries by catchment characteristic. Across all threediagnostics, the lower layer choice 2.4 (Figure 2) is seen

ss all model structures and the standard deviation of generalized likelihoodent. The circles indicate the catchment and are coloured according to theto the number of model structures that have behavioural simulations

Hydrol. Process. 28, 6135–6150 (2014)

Page 12: Diagnostic evaluation of multiple hypotheses of

Figure 8. The mean generalized likelihood uncertainty estimation likelihood weight for each model structural choice for all catchments. Boxes along thex-axis refer to different model structural choices, UL, upper layer; LL, lower layer; VD, vertical drainage and SR, surface runoff, the numbers relate to

the model structural choices as illustrated in Figure 2. Any model structures that gained no behavioural model simulations are shown in white

6146 G. COXON ET AL.

to perform well in the chalk catchments. Here, thebaseflow is represented by a parabolic function, wherebysmall changes in the saturated zone result in large changesin baseflow. The large contributions of baseflow modelledby this particular model conceptual choice could explainwhy this particular model performed well in the catch-ments with a large groundwater component. Conversely, itis very difficult to discern any differences in the choice ofmodel structure for some of the catchments, with most ofthe model structures being accepted as equally plausible(see for example catchment 76015, 55026 or 73005). Thisis unsurprising, given the amount of model simulationsthat were accepted as behavioural and suggests that themodel parameters are able to compensate for differences inmodel structure. It is also important to note that forparticular model components, there appears to be norelationship between model performance and catchmentcharacteristics or indeed any difference in model perfor-mance between the different model structural choices(particularly the percolation options).

DISCUSSION

Which diagnostics are useful for model identificationand evaluation?

An interesting result here is that the value of signaturesin constraining model predictions varies between catch-ments and that the characteristics of those catchments inpart controls this. For chalk catchments, such ascatchment 39030 and 33033, the dynamics of the

Copyright © 2013 John Wiley & Sons, Ltd.

observed discharge are dominated by seasonal andclimatic changes rather than individual rainfall events.This means that the information content held within theintra-annual and inter-annual signatures better describescatchment dynamics and thus the resulting ensemble ofbehavioural model simulations are able to better replicateflow in comparison with the FDC signature and time stepEP metric (Figure 6). In contrast, wet catchments inWales and the north-west of England with a lowgroundwater component, such as catchment 57015 and75004, tend to be rainfall-driven catchments that respondmuch quicker at a daily time scale. Hence, the FDCsignature and time step EP metric which focus diagnosticsat the daily time scale are able to better constrain theobserved discharge, particularly in the peaks andrecession periods. The information content of thediagnostic and its ability to constrain catchment responseis therefore dependent on the catchment characteristics.Similarly, the effectiveness of model identification is alsodependent on the type of diagnostic used. The groupedwater balance signatures were most effective at discrim-inating between models for chalk catchments, but not aseffective for catchments with a small baseflow compo-nent. The FDC signature enveloped the observeddischarge well for all catchments and links betweencatchment characteristics and model structure wereidentifiable, particularly for the lower layer storagecomponent of the model structure. However, overall,the FDC was discovered to be the least effective formodel identification and evaluation in comparison withthe other signatures utilized here; behavioural models were

Hydrol. Process. 28, 6135–6150 (2014)

Page 13: Diagnostic evaluation of multiple hypotheses of

6147DIAGNOSTIC EVALUATION OF MULTIPLE HYPOTHESES OF HYDROLOGICAL BEHAVIOUR

found for all the catchments (even when some of thesemodels would be considered unacceptable compared withother metrics) and the FDC tended to accept the widestrange of model structures. This may have been a result oflimitations in the selection of EPs utilized here, particularlyin the chalk catchments where often the time of flowexceeded for the highest flows would only be 3% due tothe shape of the FDC. If additional EPs were considered,this could have helped to constrain the flow better as seenin other studies (Westerberg et al., 2011b); however, thiswas analysed for the catchments where these additionalEPs were thought to make the most difference and it wasfound that the overall effect was negligible.On the basis of the model results presented here,

different diagnostics will have different levels ofinformation content depending on the catchment ofinterest and its characteristics. Hence, we suggest thatthe use of diagnostics to calibrate or discriminate betweenhydrological models must be carefully considereddepending on the aims of the modelling study, yourcatchment(s) of interest and model structures (Guptaet al., 2008). Additionally, when modelling manycatchments with very different characteristics, a suite ofmodel independent benchmarks should be considered thatcontain varying levels of information content to ensurethe catchment dynamics are well constrained for allcatchments. If diagnostics are used in isolation, then in-depth posterior analysis (as used in Westerberg et al.,2011b) should be carried out to ensure that thesimulations are well constrained and also to investigatethe effect of epistemic data errors on the diagnostics.

Where do we draw the line in terms of model rejection?

We rejected all model simulations for the time step EPmetric and grouped water balance signatures for 15 and 3of the catchments, respectively. If the range of modelstructures and parameters fail to capture the waterbalance, as presented here, this suggests that the ‘modelfailures’ could be caused by model inadequacy (Guptaet al., 2012). The absence of behavioural models for twoof the catchments could be a result of missing processesand structures in the model. As these catchments bothhave a high baseflow component, it suggests thatprocesses missing in the model could relate to a thresholdbehaviour-based routing component or the ability forwater loss to deep groundwater. Moreover, these modelsare necessarily simplistic, lumped and conceptually donot include specific groundwater components or more fastand slow flow-paths that might have improved the modelperformance. In contrast, the 12 catchments that gainedno behavioural model simulations for the time step EPmetric suggests that we need to be careful about rejectingmodels for the wrong reasons. It could be considered that

Copyright © 2013 John Wiley & Sons, Ltd.

the time step-based analyses are overly harsh on the set ofmodel structures, as these measures are more prone to theinfluence of observational errors and disinformative dataperiods (Beven and Westerberg, 2011) compared withsignatures that require the model to represent certainaspects of dynamic catchment response behaviour,particularly as we have not considered input uncertaintyin this study. For example, the behavioural modelensemble exhibited for catchment 57015 for the FDCsignature actually captured the observed flow reasonablywell (Figure 6a) and by traditional metrics, such asNash–Sutcliffe, these models would be considered tohave performed well (maximum Nash–Sutcliffe of 0.80).With the exception of catchment 27088, no behaviouralmodels were found for the time step EP analysis forcatchments with a low BFI, despite good performancesfor the hydrologic signatures. Because of the quickresponse of these catchments, as characterized by theirlow BFI, these catchments are affected more heavily bysub-daily processes. Hence, the poor performance for thetime step EP analysis for these catchments could be dueto the use of daily data in this study and the resultanttiming errors in the simulated hydrographs, particularlyfor peak flows.

Model diagnostics and uncertainty frameworks

In the context of new techniques for model identifica-tion, the results reveal that this approach was able toidentify components of model structures that dominatedin particular catchments, but, also highlights the insignif-icance of model-structural choice in particular catch-ments. Although a wide range of model structures cancapture the dynamics of wet catchments that have asimple rainfall-runoff relationship, for drier catchmentswith a more complex rainfall-runoff relationship, onlycertain model-structural components were able to capturethe dynamics. Hence, the importance of selecting anappropriate model structure varies by catchment, and insome catchments, the model-structural choice is relativelyunimportant in comparison with the selection of modelparameters. However, there is still considerable ambiguityin the links between model structure, model performanceand catchment characteristics. Similar performancesbetween different model structure (e.g. Clark et al.,2008; Buytaert and Beven, 2011) or an absence of a clearlink between model performance and model structuralchoice (Gudmundsson et al., 2012) are certainly notuncommon. Yet, although we do not expect every modelcomponent to be identifiable with the available evaluationdata or necessarily linked to catchment characteristicsgiven the simple lumped model structures and the lack ofinput uncertainty utilized in this study, it does raisequestions about the amount and type of data that is

Hydrol. Process. 28, 6135–6150 (2014)

Page 14: Diagnostic evaluation of multiple hypotheses of

6148 G. COXON ET AL.

needed to effectively discriminate between competinghypotheses. More incisive diagnostic testing that is notsolely based on the discharge data (Freer et al., 2004; Sonand Sivapalan, 2007; Winsemius et al., 2009; McMillanet al., 2011; McMillan et al., 2012b) or further in-depthanalysis of different aspects of the hydrographs wasbeyond the scope of this paper and this certainly becomesmuch more problematic when attempting to evaluatemany catchments as suggested by a comparativeframework (Sivapalan, 2009).In this study, we have explicitly included the observa-

tional uncertainties in discharge and so our techniqueshave benchmarked the quality of the data. We suggest thisis critical when making assessments between catchmentsand if using different types of data (for a more detailedtreatment of these issues see McMillan et al., 2012a, b).However, it is important to point out the limitations of theapproach utilized to estimate the stage-discharge ratingcurve uncertainty and the resultant limits of acceptabilityand its impacts on the results. Differences in modelperformance between catchments regionally are, in part,due to the method used to calculate the stage-dischargerating curve uncertainty and the resultant limits. Forexample, the good model performance found in wetcatchments is, in part, a result of the wider limits ofacceptability set in these areas as a result of higher dailydischarge rates and thus wider uncertainty bounds. This iseasier for the model to fit as it is allowed to make largerabsolute errors and still be accepted as behavioural.There has been a continued debate within the

hydrological community relating to the most suitableapproach for hypothesis testing (Clark et al., 2011a;Beven et al., 2012). As with all approaches to hypothesistesting, the limits of acceptability approach adopted inthis study is not without its limitations and otherapproaches could have been adopted, such as the use ofBayesian statistics (Kavetski and Fenicia, 2011; Euseret al., 2013). However, the need for robust hypothesistesting given the real information content of the datameant that for this study we adopted the limits ofacceptability approach to hypothesis testing. Here, theresulting behavioural models for each catchment anddiagnostic act as a guide to the dominant processesoperating in the catchment and further posterior analysisor incisive diagnostic testing would need to concludethis definitively, alongside this type of analysis overmany more catchments. By evaluating our modelstructures within a rejectionist framework, the resultsare constrained by the fact that any non-behaviouralmodels are implausible hypotheses within the limitsdefined here; however, it does allow us to considermodel structural inadequacies or disinformative datathrough the rejection of models as discussed in theprevious text (Beven et al., 2012).

Copyright © 2013 John Wiley & Sons, Ltd.

CONCLUSIONS

This paper presents for the first time a diagnostic approachto testing multiple hypotheses of hydrological behaviourby incorporating hydrologic signatures and time step-basedmetrics into the limits of acceptability uncertainty analysisapproach. We explicitly account for uncertainty in thedischarge data by constructing uncertainty bounds from therating curve data for each catchment. The method wasapplied to 24 catchments in England and Wales to drawinsights into the variability of performance for multiplehydrological hypotheses in different catchments.Model performance was evaluated against different

model diagnostics, including signatures that capture thewater balance and assess the model’s ability to reproducethe FDC and time step-based analysis. In general, it wasfound that the best model performances were found in wetcatchments that have a simple rainfall-runoff relationship,and the poorest model performances were found in drychalk catchments. The resulting behavioural modelensembles for each catchment provide insights for modelidentification and the processes that control catchmentresponse. Firstly, the results demonstrate that the value ofdiagnostics in defining catchment response is dependenton catchment characteristics. We suggest that modeldiagnostics need to be tailored to the catchment of interestto ensure models are effectively interrogated and able torepresent the catchment’s dynamics (Gupta et al., 2008).This is a first step towards more ‘informative signatures’(Wagener and Montanari, 2011), which would enable us todefine diagnostic tests that are useful for model identificationand evaluation for a given application prior to any modellingtaking place (Beven, 2006;McMillan et al., 2011). Secondly,within the context of the modelling framework presentedhere, we have shown that model performance is linked tomodel structural choice for highBFI catchments, whereas forlow BFI catchments there is enough flexibility in the modelstructure to enable a good performance despite differences inmodel structural choice. This demonstrates the importance ofvarying themodel structure for a set of catchments withwideclimatic and hydrologic diversity but challenges the notionthat the model structure needs to be tailored for each uniquecatchment (McMillan et al., 2011). Future research will befocused on extending this methodology to many morecatchments across the UK to further advance our ability tolink model space to catchment space, benchmark catchmentprocesses and make predictions in ungauged basins.

ACKNOWLEDGEMENTS

The authors are grateful to Carol Langley from theEnvironment Agency for supplying the rating curve andstage discharge data. Gemma Coxon is supported by astudentship funded by the UK Natural Environment

Hydrol. Process. 28, 6135–6150 (2014)

Page 15: Diagnostic evaluation of multiple hypotheses of

6149DIAGNOSTIC EVALUATION OF MULTIPLE HYPOTHESES OF HYDROLOGICAL BEHAVIOUR

Research Council. Access to the data used in this studywas provided by the Environmental Virtual ObservatoryPilot funded by NERC; grant number NE/1002200/1.Partial support for this work was also provided by theNatural Environment Research Council [Consortium onRisk in the Environment: Diagnostics, Integration,Benchmarking, Learning and Elicitation (CREDIBLE);grant number NE/J017450/1]. We thank Ida Westerbergfor her thoughtful and constructive comments on anearlier version of this manuscript.

REFERENCES

Bai Y, Wagener T, Reed P. 2009. A top-down framework for watershedmodel evaluation and selection under uncertainty. EnvironmentalModelling & Software 24: 901–916.

Beven K. 2006. A manifesto for the equifinality thesis. Journal ofHydrology 320: 18–36.

Beven KJ, Kirkby MJ. 1979. A physically based, variable contributingarea model of basin hydrology/Un modèle à base physique de zoned’appel variable de l’hydrologie du bassin versant. HydrologicalSciences Bulletin 24: 43–69.

Beven K, Westerberg I. 2011. On red herrings and real herrings:disinformation and information in hydrological inference. HydrologicalProcesses 25: 1676–1680.

Beven K, Smith PJ, Wood A. 2011. On the colour and spin of epistemicerror (and what we might do about it). Hydrology and Earth SystemSciences 15: 3123–3133.

Beven K, Smith P, Westerberg I, Freer J. 2012. Comment on “Pursuingthe method of multiple working hypotheses for hydrological modeling”by P. Clark et al. Water Resources Research 48: W11801.DOI:10.1029/2012WR012282

Blazkova S, Beven K. 2009. A limits of acceptability approach to modelevaluation and uncertainty estimation in flood frequency estimation bycontinuous simulation: Skalka catchment, Czech Republic. WaterResources Research 45: W00B16. DOI:10.1029/2007WR006726

Boyle DP, Gupta HV, Sorooshian S. 2000. Toward improved calibrationof hydrologic models: combining the strengths of manual and automaticmethods. Water Resources Research 36: 3663–3674.

Burnash RJC. 1995. The NWS river forecast system-catchment modeling,in: computer models of watershed hydrology. Water Resour. Publ.:Littleton, Colo; 311–366.

Buytaert W, Beven K. 2011. Models as multiple working hypotheses:hydrological simulation of tropical alpine wetlands. HydrologicalProcesses 25: 1784–1799.

Clark MP, Slater AG, Rupp DE, Woods RA, Vrugt JA, Gupta HV,Wagener T, Hay LE. 2008. Framework for understanding structuralerrors (FUSE): a modular framework to diagnose differences betweenhydrological models. Water Resources Research 44: W00B02.DOI:10.1029/2007WR006735

Clark MP, Kavetski D, Fenicia F. 2011a. Pursuing the method of multipleworking hypotheses for hydrological modeling. Water ResourcesResearch 47: W09301. DOI:10.1029/2010WR009827

Clark MP, McMillan HK, Collins DBG, Kavetski D, Woods RA. 2011b.Hydrological field data from a modeller’s perspective: Part 2: process-based evaluation of model hypotheses. Hydrological Processes 25:523–543.

Euser T, Winsemius HC, Hrachowitz M, Fenicia F, Uhlenbrook S,Savenije HHG. 2013. A framework to assess the realism of modelstructures using hydrological signatures. Hydrology and Earth SystemSciences 17: 1893–1912.

Farmer D, Sivapalan M, Jothityangkoon C. 2003. Climate, soil, andvegetation controls upon the variability of water balance in temperateand semiarid landscapes: downward approach to water balance analysis.Water Resources Research 39: 1035. DOI:10.1029/2001WR000328

Copyright © 2013 John Wiley & Sons, Ltd.

Freer J, Beven K, Peters N. 2003. Multivariate seasonal period modelrejection within the generalised likelihood uncertainty estimationprocedure. In Water science and application, Duan Q, Gupta HV,Sorooshian S, Rousseau AN, Turcotte R (eds). American GeophysicalUnion: Washington, D. C.; 69–87.

Freer JE, McMillan H, McDonnell JJ, Beven KJ. 2004. Constrainingdynamic TOPMODEL responses for imprecise water table informationusing fuzzy rule based performance measures. Journal of Hydrology291: 254–277.

Gudmundsson L, Wagener T, Tallaksen LM, Engeland K. 2012.Evaluation of nine large-scale hydrological models with respect to theseasonal runoff climatology in Europe. Water Resources Research 48:W11504. DOI:10.1029/2011WR010911

Guerrero J-L, Westerberg IK, Halldin S, Lundin L-C, Xu C-Y. 2013.Exploring the hydrological robustness of model-parameter values withalpha shapes. Water Resources Research 49. DOI:10.1002/wrcr.20533

Gupta HV, Wagener T, Liu Y. 2008. Reconciling theory withobservations: elements of a diagnostic approach to model evaluation.Hydrological Processes 22: 3802–3813.

Gupta HV, Kling H, Yilmaz KK, Martinez GF. 2009. Decomposition ofthe mean squared error and NSE performance criteria: implications forimproving hydrological modelling. Journal of Hydrology 377: 80–91.

Gupta HV, Clark MP, Vrugt JA, Abramowitz G, Ye M. 2012. Towards acomprehensive assessment of model structural adequacy. WaterResources Research 48: W08301. DOI:10.1029/2011WR011044

Gustard A, Bullock A, Dixon JM. 1992. Low flow estimation in the UnitedKingdom. Institute of Hydrology: Wallingford.

Herschy RW. 1999. Streamflow Measurement, 2nd edn. E & FN Spon:London.

Hough MN, Jones RJA. 1999. The United Kingdom meteorological officerainfall and evaporation calculation system: MORECS version 2.0-Anoverview. Hydrology and Earth System Sciences 1: 227–239.

Juston JM, Kauffeldt A, Montano BQ, Seibert J, Beven KJ, WesterbergIK. 2013. Smiling in the rain: seven reasons to be positive aboutuncertainty in hydrological modelling. Hydrological Processes 27:1117–1122.

Kavetski D, Fenicia F. 2011. Elements of a flexible approach forconceptual hydrological modeling: 2. Application and experimentalinsights. Water Resources Research 47: W11511. DOI:10.1029/2011WR010748

Keller V, Young AR, Morris D, Davies H. 2006. Continuous estimation ofriver flows (CERF) technical report: estimation of precipitation inputs.

Kirchner JW. 2006. Getting the right answers for the right reasons:Linking measurements, analyses, and models to advance the science ofhydrology. Water Resources Research 42: W03S04. DOI:10.1029/2005WR004362

Krueger T, Freer J, Quinton JN, Macleod CJA, Bilotta GS, Brazier RE,Butler P, Haygarth PM. 2010. Ensemble evaluation of hydrologicalmodel hypotheses. Water Resources Research 46: W07516.DOI:10.1029/2009WR007845

Leavesley GH, Lichty RW, Troutman BM, Saindon LG. 1983.Precipitation-runoff modeling system; user‘s manual (No. WRI - 83-4238). United States Geological Survey.

Lee H, McIntyre N, Wheater H, Young A. 2005. Selection of conceptualmodels for regionalisation of the rainfall-runoff relationship. Journal ofHydrology 312: 125–147.

Liu Y, Freer J, Beven K, Matgen P. 2009. Towards a limits ofacceptability approach to the calibration of hydrological models:extending observation error. Journal of Hydrology 367: 93–103.

Marsh TJ, Hannaford J. 2008. UK hydrometric register, hydrological dataUK series. Centre for Ecology and Hydrology: Wallingford, UK.

McMillan H, Freer J, Pappenberger F, Krueger T, Clark M. 2010. Impactsof uncertain river flow data on rainfall-runoff model calibration anddischarge predictions. Hydrological Processes 24: 1270–1284.

McMillan HK, Clark MP, Bowden WB, Duncan M, Woods RA. 2011.Hydrological field data from a modeller’s perspective: Part 1.Diagnostic tests for model structure. Hydrological Processes 25:511–522.

McMillan H, Krueger T, Freer J. 2012a. Benchmarking observationaluncertainties for hydrology: rainfall, river discharge and water quality.Hydrological Processes 26: 4078–4111.

Hydrol. Process. 28, 6135–6150 (2014)

Page 16: Diagnostic evaluation of multiple hypotheses of

6150 G. COXON ET AL.

McMillan H, Tetzlaff D, Clark M, Soulsby C. 2012b. Do time-variabletracers aid the evaluation of hydrological model structure? Amultimodel approach. Water Resources Research 48: W05501.DOI:10.1029/2011WR011688

Pappenberger F, Matgen P, Beven KJ, Henry J-B, Pfister L, Fraipont de P.2006. Influence of uncertain boundary conditions and model structureon flood inundation predictions. Advances in Water Resources 29:1430–1449.

Pushpalatha R, Perrin C, Le Moine N, Mathevet T, Andréassian V. 2011.A downward structural sensitivity analysis of hydrological models toimprove low-flow simulation. Journal of Hydrology 411: 66–76.

Renard B, Kavetski D, Kuczera G, Thyer M, Franks SW. 2010.Understanding predictive uncertainty in hydrologic modeling: thechallenge of identifying input and structural errors. Water ResourcesResearch 46: W05521. DOI:10.1029/2009WR008328

Sivapalan M. 2009. The secret to ‘doing better hydrological science’:change the question! Hydrological Processes 23: 1391–1396.DOI:10.1002/HYP.7242

Smith TJ, Marshall LA. 2010. Exploring uncertainty and model predictiveperformance concepts via a modular snowmelt-runoff modelingframework. Environmental Modelling & Software 25: 691–701.

Son K, Sivapalan M. 2007. Improving model structure and reducingparameter uncertainty in conceptual water balance models through theuse of auxiliary data. Water Resources Research 43: W01415.DOI:10.1029/2006WR005032

Staudinger M, Stahl K, Seibert J, Clark MP, Tallaksen LM. 2011.Comparison of hydrological model structures based on recession and lowflow simulations. Hydrology and Earth System Sciences 15: 3447–3459.

Velázquez JA, Anctil F, Perrin C. 2010. Performance and reliability ofmultimodel hydrological ensemble simulations based on seventeenlumped models and a thousand catchments. Hydrology and EarthSystem Sciences 14: 2303–2317.

Copyright © 2013 John Wiley & Sons, Ltd.

Wagener T, McIntyre N. 2012. Hydrological catchment classificationusing a data-based mechanistic strategy, in: system identification,environmental modelling, and control system design. Springer: London;483–500.

Wagener T, Montanari A. 2011. Convergence of approaches towardreducing uncertainty in predictions in ungauged basins. WaterResources Research 47: W06301. DOI:10.1029/2010WR009469

Wagener T, Boyle DP, Lees MJ, Wheater HS, Gupta HV, Sorooshian S.2001. A framework for development and application of hydrologicalmodels. Hydrology and Earth System Sciences 5: 13–26.

Wagener T, Sivapalan M, Troch P, Woods R. 2007. Catchmentclassification and hydrologic similarity. Geography Compass 1: 901–931.

Westerberg I, Guerrero J-L, Seibert J, Beven KJ, Halldin S. 2011a. Stage-discharge uncertainty derived with a non-stationary rating curve in theCholuteca River, Honduras. Hydrological Processes 25: 603–613.

Westerberg IK, Guerrero J-L, Younger PM, Beven KJ, Seibert J,Halldin S, Freer JE, Xu C-Y. 2011b. Calibration of hydrologicalmodels using flow-duration curves. Hydrology and Earth SystemSciences 15: 2205–2227.

Winsemius HC, Schaefli B, Montanari A, Savenije HHG. 2009. On thecalibration of hydrological models in ungauged basins: a framework forintegrating hard and soft hydrological information. Water ResourcesResearch 45: W12422. DOI:10.1029/2009WR007706

Yadav M, Wagener T, Gupta H. 2007. Regionalization of constraints onexpected watershed response behavior for improved predictions inungauged basins. Advances in Water Resources 30: 1756–1774.

Yilmaz KK, Gupta HV, Wagener T. 2008. A process-based diagnosticapproach to model evaluation: application to the NWS distributedhydrologic model. Water Resources Research 44: W09417.DOI:10.1029/2007WR006716.

Zhao RJ. 1984. Watershed hydrological modelling. Water Resour. andElectr. Power Press: Beijing.

Hydrol. Process. 28, 6135–6150 (2014)