11
Bayesian analysis of data-worth considering model and parameter uncertainties Shlomo P. Neuman a,, Liang Xue a , Ming Ye b , Dan Lu b a Department of Hydrology and Water Resources, University of Arizona, Tucson, AZ 85721, United States b Department of Scientific Computing, Florida State University, Tallahassee, FL 32306, United States article info Article history: Available online xxxx Keywords: Data worth Value of information Model uncertainty Parameter uncertainty Bayesian model averaging abstract The rational management of water resource systems requires an understanding of their response to exist- ing and planned schemes of exploitation, pollution prevention and/or remediation. Such understanding requires the collection of data to help characterize the system and monitor its response to existing and future stresses. It also requires incorporating such data in models of system makeup, water flow and contaminant transport. As the collection of subsurface characterization and monitoring data is costly, it is imperative that the design of corresponding data collection schemes be cost-effective, i.e., that the expected benefit of new information exceed its cost. A major benefit of new data is its potential to help improve one’s understanding of the system, in large part through a reduction in model predictive uncer- tainty and corresponding risk of failure. Traditionally, value-of-information or data-worth analyses have relied on a single conceptual-mathematical model of site hydrology with prescribed parameters. Yet there is a growing recognition that ignoring model and parameter uncertainties render model predictions prone to statistical bias and underestimation of uncertainty. This has led to a recent emphasis on conducting hydrologic analyses and rendering corresponding predictions by means of multiple models. We describe a corresponding approach to data-worth analyses within a Bayesian model averaging (BMA) framework. We focus on a maximum likelihood version (MLBMA) of BMA which (a) is compatible with both deterministic and stochastic models, (b) admits but does not require prior information about the parameters, (c) is consistent with modern statistical methods of hydrologic model calibration, (d) allows approximating lead predictive moments of any model by linearization, and (e) updates model posterior probabilities as well as parameter estimates on the basis of potential new data both before and after such data become actually available. We describe both the BMA and MLBMA versions theoret- ically and implement MLBMA computationally on a synthetic example with and without linearization. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction The world’s water supply is threatened by overexploitation and contamination. To manage this supply in an optimal and sustain- able manner, it is necessary to understand the response of water resource systems to existing and planned schemes of exploitation and pollution prevention and/or remediation. Such understanding requires the collection of suitable data to help characterize the sys- tem and monitor its response to existing and future stresses. It also requires incorporating such data in suitable models of water flow and contaminant transport. As noted by Back [3], three strategies have traditionally been used to determine the magnitude of a data collection effort: mini- mizing cost for a specific level of accuracy or precision, minimizing uncertainty for a given budget, or responding to regulatory de- mands on data quantity and quality. Various combinations of these strategies have also been described such as a fitness-for-purpose approach [55]. Many today prefer a fourth approach based on va- lue-of-information or data-worth analysis. Here the decision to collect additional data, or the design of a data collection program, is based on cost-effectiveness. A program is considered cost-effec- tive if the expected benefit from the new information exceeds the cost. A major benefit of new data is its potential to help improve one’s understanding of the system, in large part through a reduc- tion in model predictive uncertainty. This benefit, however, is worth the cost only if it has the potential to impact decisions concerning management of the water resource system. Value-of-information or data-worth analyses incorporating sta- tistical decision theory have been applied to various water-related problems in the 1970s [12,13,23,36] and to groundwater problems in the 1980s [4,22,33,47]. More recent applications to groundwater resource and contamination issues have been reported in [1,11,21,27–30,37,45,50,51]. James and Freeze [28] proposed a Bayesian decision-making framework to evaluate the worth of data in the context of contaminated groundwater that has been widely cited in the subsequent literature. A comprehensive review focusing on health risk assessment can be found in [61]. Additional recent publications of relevance include [18,41]. 0309-1708/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.advwatres.2011.02.007 Corresponding author. Tel.: +1 520 885 3482; fax: +1 520 844 1795. E-mail address: [email protected] (S.P. Neuman). Advances in Water Resources xxx (2011) xxx–xxx Contents lists available at ScienceDirect Advances in Water Resources journal homepage: www.elsevier.com/locate/advwatres Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-worth considering model and parameter uncertainties. Adv Water Resour (2011), doi:10.1016/j.advwatres.2011.02.007

Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

Advances in Water Resources xxx (2011) xxx–xxx

Contents lists available at ScienceDirect

Advances in Water Resources

journal homepage: www.elsevier .com/ locate/advwatres

Bayesian analysis of data-worth considering model and parameter uncertainties

Shlomo P. Neuman a,⇑, Liang Xue a, Ming Ye b, Dan Lu b

a Department of Hydrology and Water Resources, University of Arizona, Tucson, AZ 85721, United Statesb Department of Scientific Computing, Florida State University, Tallahassee, FL 32306, United States

a r t i c l e i n f o

Article history:Available online xxxx

Keywords:Data worthValue of informationModel uncertaintyParameter uncertaintyBayesian model averaging

0309-1708/$ - see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.advwatres.2011.02.007

⇑ Corresponding author. Tel.: +1 520 885 3482; faxE-mail address: [email protected] (S.P. Ne

Please cite this article in press as: Neuman SP e(2011), doi:10.1016/j.advwatres.2011.02.007

a b s t r a c t

The rational management of water resource systems requires an understanding of their response to exist-ing and planned schemes of exploitation, pollution prevention and/or remediation. Such understandingrequires the collection of data to help characterize the system and monitor its response to existing andfuture stresses. It also requires incorporating such data in models of system makeup, water flow andcontaminant transport. As the collection of subsurface characterization and monitoring data is costly,it is imperative that the design of corresponding data collection schemes be cost-effective, i.e., that theexpected benefit of new information exceed its cost. A major benefit of new data is its potential to helpimprove one’s understanding of the system, in large part through a reduction in model predictive uncer-tainty and corresponding risk of failure. Traditionally, value-of-information or data-worth analyses haverelied on a single conceptual-mathematical model of site hydrology with prescribed parameters. Yetthere is a growing recognition that ignoring model and parameter uncertainties render model predictionsprone to statistical bias and underestimation of uncertainty. This has led to a recent emphasis onconducting hydrologic analyses and rendering corresponding predictions by means of multiple models.We describe a corresponding approach to data-worth analyses within a Bayesian model averaging(BMA) framework. We focus on a maximum likelihood version (MLBMA) of BMA which (a) is compatiblewith both deterministic and stochastic models, (b) admits but does not require prior information aboutthe parameters, (c) is consistent with modern statistical methods of hydrologic model calibration, (d)allows approximating lead predictive moments of any model by linearization, and (e) updates modelposterior probabilities as well as parameter estimates on the basis of potential new data both beforeand after such data become actually available. We describe both the BMA and MLBMA versions theoret-ically and implement MLBMA computationally on a synthetic example with and without linearization.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction approach [55]. Many today prefer a fourth approach based on va-

The world’s water supply is threatened by overexploitation andcontamination. To manage this supply in an optimal and sustain-able manner, it is necessary to understand the response of waterresource systems to existing and planned schemes of exploitationand pollution prevention and/or remediation. Such understandingrequires the collection of suitable data to help characterize the sys-tem and monitor its response to existing and future stresses. It alsorequires incorporating such data in suitable models of water flowand contaminant transport.

As noted by Back [3], three strategies have traditionally beenused to determine the magnitude of a data collection effort: mini-mizing cost for a specific level of accuracy or precision, minimizinguncertainty for a given budget, or responding to regulatory de-mands on data quantity and quality. Various combinations of thesestrategies have also been described such as a fitness-for-purpose

ll rights reserved.

: +1 520 844 1795.uman).

t al. Bayesian analysis of data-

lue-of-information or data-worth analysis. Here the decision tocollect additional data, or the design of a data collection program,is based on cost-effectiveness. A program is considered cost-effec-tive if the expected benefit from the new information exceeds thecost. A major benefit of new data is its potential to help improveone’s understanding of the system, in large part through a reduc-tion in model predictive uncertainty. This benefit, however, isworth the cost only if it has the potential to impact decisionsconcerning management of the water resource system.

Value-of-information or data-worth analyses incorporating sta-tistical decision theory have been applied to various water-relatedproblems in the 1970s [12,13,23,36] and to groundwater problemsin the 1980s [4,22,33,47]. More recent applications to groundwaterresource and contamination issues have been reported in[1,11,21,27–30,37,45,50,51]. James and Freeze [28] proposed aBayesian decision-making framework to evaluate the worth of datain the context of contaminated groundwater that has been widelycited in the subsequent literature. A comprehensive reviewfocusing on health risk assessment can be found in [61]. Additionalrecent publications of relevance include [18,41].

worth considering model and parameter uncertainties. Adv Water Resour

Page 2: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

Fig. 1. Decision tree (after [3]).

2 S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx

A major limitation of many existing approaches is that they relyon a single conceptual-mathematical model of geologic or wa-tershed makeup and of hydrologic processes therein. Yet hydro-logic environments are open and complex, rendering them proneto multiple interpretations and mathematical descriptions, includ-ing parameterizations. This is true regardless of the quantity andquality of available data. Predictions and analyses of uncertaintybased on a single hydrologic concept are prone to statistical bias(caused by reliance on an inadequate model) and underestimationof uncertainty (caused by under-sampling of the relevant modelspace). Analyses of environmental data-worth which explore howdifferent sets of conditioning data impact the predictiveuncertainty of multiple models in a Bayesian context include[19,49,54]; whereas Freer et al. [19] employ Generalized LikelihoodUncertainty Estimation (GLUE; see [5,6]), Rojas et al. [49] combineGLUE with Bayesian model averaging (BMA; [7,17,25,32,34]).Diggle and Lophaven [16] describe a Bayesian approach to geosta-tistical design. Nowak et al. [43] introduce a Bayesian approach todata worth analysis when flow and transport take place in a ran-dom log hydraulic conductivity field. Whereas flow and transportare described by a single (linearized) model each having knownparameters, other than those describing spatial variations in loghydraulic conductivity, the latter is characterized by a single driftmodel and a continuous family of variogram models having uncer-tain parameters.

In a similar spirit, we propose in this paper a multimodel ap-proach to optimum value-of-information or data-worth analysesthat is based on model averaging within a Bayesian framework.Our approach is general in that it considers multiple models ofany kind, all having uncertain parameters; whereas parameterizingmodels in the manner of [43] is elegant and computationally effi-cient, it is unfortunately limited to a narrow range of variogrammodels and does not, generally, apply to other models such asthose of flow and transport. We prefer BMA over GLUE because it(a) rests on rigorous statistical theory, (b) is compatible with deter-ministic as well as stochastic models and, (c) in its maximumlikelihood (ML) version (MLBMA), is consistent with current MLmethods of hydrologic model calibration [35,38–40,57–60].Whereas BMA (like the closely related approach in [43]) reliesheavily on prior parameter statistics, MLBMA can do without suchstatistics or otherwise update them on the basis of potential newdata both before and after they are collected. We describe the pro-posed BMA and MLBMA approaches theoretically, outline ways toimplement the MLBMA version computationally, and illustrate thelatter on a synthetic example. Our proposed methodology shouldbe of help in designing the collection of hydrologic characterizationand monitoring data in a cost-effective manner by maximizingtheir benefit under given cost constraints. The benefit would ac-crue from optimum gain in information, or reduction in predictiveuncertainty, upon considering jointly not only traditional sourcesof uncertainty such as those affecting model parameters and thereliability of data but also, most importantly, lack of certaintyabout the conceptual-mathematical models that underlie the anal-ysis and the scenarios under which the system would operate inthe future. The methodology should apply to a broad range of mod-els representing natural processes in ubiquitously open and com-plex earth and environmental systems.

Fig. 2. Steps of Bayesian approach (after [3]).

2. Background

2.1. Bayesian decision analysis framework

One way to cast the data-worth issue would be within a Bayes-ian risk-cost-benefit decision framework such as that of Freezeet al. [20,21]. Suppose without loss of generality that the data are

Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007

intended to help one decide whether or not a contaminated siteshould be remediated. This decision problem is illustrated inFig. 1 [3] by a decision tree in which U is the decision objective;Ui is an objective function associated with each decision alterna-tive (i = 1,2) defined as

Ui ¼ Bi � Ci � cPiCfi; ð1Þ

where Bi is the benefit and Ci the investment cost, risk being ex-pressed as the product cPiCfi of a risk aversion factor c, the probabil-ity of failure Pi and the cost of failure Cfi; C+ designates acontaminated and C� an uncontaminated state of the site; costsand benefits occurring at the triangular terminal nodes (only thecost of failure is indicated in the figure). Collecting additional infor-mation generally causes the risk term to decrease due to a decreasein the probability of failure. The corresponding increase in Ui is theexpected value (worth) of the new data. The final outcome of theanalysis depends on the choice of decision rule one adopts [42];for example, maximizing Ui would result in the largest benefitand lowest cost.

According to Back [3] the Bayesian approach to data-worthanalysis entails five steps as illustrated in Fig. 2. The steps include(1) defining one or more data collection (sampling) programs, (2)postulating a prior probability for the state of the site (e.g., contam-inated or uncontaminated), (3) using Bayes’ theorem to update theprior probability to a posterior probability conditional on the newdata, corresponding to each data collection program, (4) estimatingcorresponding costs and benefits, and (5) computing the worth ofdata or value of information using a given decision model andusing the results to optimize the data collection scheme. This workalso considers using linearized estimation of uncertainty to updatethe prediction variance in step 3. We add that data-worth analysesactually include a pre-posterior and a posterior mode. In the pos-terior mode, a given set of data is evaluated in hindsight, afterspending the money to collect it. In the pre-posterior mode, possi-ble sampling schemes are analyzed for the worth of data not yetcollected. This mode requires averaging over all such data,weighted by their pre-posterior probabilities, as we do below.

worth considering model and parameter uncertainties. Adv Water Resour

Page 3: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx 3

As noted, collecting additional information generally reducesrisk due to a decrease in the probability of failure. A reduction inthe probability of failure comes about through a reduction inuncertainty about the expected system state, present or future.The impact of hydrologic data on this expectation and the associ-ated uncertainty are often evaluated by means of a hydrologicmodel. Commonly, the model is considered to be certain whileits parameters (and in some cases its forcing terms such as sourcesand boundary conditions) are treated as being uncertain due toinsufficient and error-prone data [18]. As already noted, we knowof only one work that considers the impact of data on model pre-dictive uncertainty within a Bayesian framework by consideringthe model itself to be uncertain [19] and another work that param-eterizes this uncertainty [43]. Below we provide background aboutBayesian model averaging and its maximum likelihood versionwhich we propose to employ for this same purpose.

2.2. Bayesian model averaging (BMA)

Consider a random vector, D, the multivariate statistics ofwhich are to be predicted with a set M of K mutually independentmodels, Mk, each characterized by a vector of parameters hk, condi-tional on a discrete set of data, D (the case of correlated models hasrecently been considered in [52]). In analogy to the case of a scalarD [25] we write the joint posterior (conditional) distribution of Das

pðDjDÞ ¼XK

k¼1

pðDjD;MkÞpðMkjDÞ; ð2Þ

i.e., as the average over all models of the joint posterior distribu-tions p(DjD,Mk) associated with individual models, weighted bythe model posterior probabilities p(MkjD). These weights are givenby Bayes’ rule in the form

pðMkjDÞ ¼pðDjMkÞpðMkÞPKl¼1pðDjMlÞpðMlÞ

; ð3Þ

where

pðDjMkÞ ¼Z

pðDjMk; hkÞpðhkjMkÞdhk ð4Þ

is the integrated likelihood of model Mk, p(DjMk,hk) being the jointlikelihood of this model and its parameters, p(hkjMk) the prior den-sity of hk under model Mk, and p(Mk) the prior probability of Mk. Thelikelihood p(DjMk,hk) contains a model of D measurement errors,and the prior density p(hkjMk) may contain a model of parametermeasurement errors [8]. All probabilities are implicitly conditionalon the choice of models entering into the set M.

The posterior mean and covariance of D are given, throughextension of Draper’s [17] analysis to our multivariate case, by

EðDjDÞ ¼XK

k¼1

EðDjD;MkÞpðMkjDÞ; ð5Þ

CovðDjDÞ ¼XK

k¼1

CovðDjD;MkÞpðMkjDÞ þXK

k¼1

½EðDjD;MkÞ

� EðDjDÞ�½EðDjD;MkÞ � EðDjDÞ�T pðMkjDÞ; ð6Þ

where the superscript T denotes transpose. Eq. (6) is a discreteexpression of the law of total covariance,

CovðDjDÞ ¼ EMk jDCovðDjD;MkÞ þ CovMk jDEðDjD;MkÞ; ð7Þ

where EMk jDCovðDjD;MkÞ is the within-model component ofCov(DjD) and CovMk jDEðDjD;MkÞ is its between-model component.We will also be interested in the trace

Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007

Tr½CovðDjDÞ� ¼ Tr½EMk jDCovðDjD;MkÞ� þ Tr½CovMk jDEðDjD;MkÞ� ð8Þ

which provides a scalar measure of the posterior variance of D. Thelatter is of interest because, for K > 1, one generally has Tr½CovMk jDEðDjD;MkÞ� > 0 so that Tr½CovðDjDÞ� > Tr½EMk jDCovðDjD;MkÞ�. Hencethe consideration of multiple models generally results in greaterpredictive uncertainty, as measured by Tr[Cov(DjD)], than theuncertainty associated with a single model, as measured byTr[Cov(DjD,Mk)]. Note that, for any consistent normkk, kCovðDjDÞk 6 kEMk jDCovðDjD;MkÞk þ kCovMk jDEðDjD;MkÞk wherethe equality holds if and only if the two right hand side argumentsare linearly dependent, which is generally not the case. Itfollows that kCovMk jDEðDjD;MkÞk > 0 does not generally implykCovðDjDÞk > kEMk jDCovðDjD;MkÞk.

2.3. Maximum likelihood Bayesian model averaging (MLBMA)

BMA defines the integrated likelihood p(DjMk) of model Mk en-tirely in terms of the prior parameter density p(hkjMk) of modelparameters, having thus no provision for the conditioning of modelparameters on measurements D (i.e., for the estimation of optimummodel parameters on the basis of D using inverse methods). Instead,it requires computing the integral in (4) through exhaustive sam-pling of the prior parameter space hk for each model followed bynumerical integration. One way to resolve both issues is to replacehk by an estimate, hD

k , which maximizes the likelihood p(DjMk,hk).Obtaining such maximum likelihood (ML) estimates entails calibrat-ing each model against (conditioning on) the data D usingwell-established statistical inverse methods [8–10,24,44,48].Approximating p(DjD,Mk) in (2) by p(DjD,Mk)ML, where the sub-script indicates approximation based on ML estimation of hk, wasshown to be useful (for the case of a scalar D) in [17,46,56]. Neuman[40] proposed evaluating the weights p(MkjD) in (2), (5), and (6)based on a result due to Kashyap [31]. The corresponding expressionis [57]

pðMkjDÞ ’ pðMkjDÞML ¼exp � 1

2 dKICDk

� �pðMkÞPK

l¼1 exp � 12 dKICD

l

� �pðMlÞ

; ð9Þ

where

dKICDk ¼ KICD

k � KICDmin; ð10Þ

KICDk ¼ � 2 ln p DjMk; h

Dk

� �ML� 2 ln p hD

k jMk

� �ML

þ Nk lnND

2p

!þ ln jFkðDjMkÞMLj; ð11Þ

KICDk being the so-called Kashyap model selection (or information)

criterion for model Mk, KICDmin its minimum value over all candidate

models, and �2lnp(DjMk,hk)ML � 2lnp(hkjMk)ML a negative log like-lihood incorporating prior measurements of the parameters (ifavailable; see [8]), evaluated at hD

k . Here Nk is the dimension of hk

(number of adjustable parameters associated with model Mk), ND

is the dimension of D (number of discrete data points, which mayinclude measured parameter values), and Fk is the normalized (byND) observed (as opposed to ensemble mean) Fisher informationmatrix having components

Fk;nm ¼ �1

ND

@2 ln pðDjMk; hkÞ@hkn@hkm

" #hk¼hD

k

: ð12Þ

In the limit of large ND=Nk, KICDk reduces asymptotically to the

so-called Bayesian selection (or information) criterion (e.g. [59])

BICDk ¼ �2 ln p DjMk; h

Dk

� �MLþ Nk ln ND: ð13Þ

worth considering model and parameter uncertainties. Adv Water Resour

Page 4: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

4 S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx

3. Effect of data augmentation on uncertainty

3.1. BMA framework

Suppose that the original data set D is augmented by anotherhypothetical data set, C, which has not yet been collected and istherefore uncertain. Assume that the multivariate statistics of D,predicted with the model set M, can be conditioned on the aug-mented data set {D,C}. Assume further that the multivariate statis-tics of C, conditional on D, can be predicted either via BMA or viaMLBMA with a set P of I mutually independent statistical models,Pi, having parameters pi. The models Pi may be independent of Mk,may form extensions of Mk or may coincide with the latter as in thecomputational example given later in this paper. Then analogy to(2) and (5)–(8) implies

pðDjDÞ ¼ ECjDpðDjD;CÞ; ð14Þ

pðDjD;CÞ ¼XK

k¼1

pðDjD;C;MkÞpðMkjD;CÞ; ð15Þ

EðDjDÞ ¼ ECjDEðDjD;CÞ; ð16Þ

EðDjD;CÞ ¼XK

k¼1

EðDjD;C;MkÞpðMkjD;CÞ; ð17Þ

CovðDjDÞ ¼ ECjDCovðDjD;CÞ þ CovCjDEðDjD;CÞ; ð18Þ

CovðDjD;CÞ ¼XK

k¼1

CovðDjD;C;MkÞpðMkjD;CÞ þXK

k¼1

½EðDjD;C;MkÞ

� EðDjD;CÞ�½EðDjD;C;MkÞ � EðDjD;CÞ�T pðMkjD;CÞ;ð19Þ

Tr½CovðDjDÞ� ¼ Tr½ECjDCovðDjD;CÞ� þ Tr½CovCjDEðDjD;CÞ�; ð20Þ

where C is implicitly conditional on the choice of models P and onD. The term Tr[CovCjDE(DjD,C)] = Tr[Cov(DjD)] � Tr[ECjDCov(DjD,C)]represents the difference between the total trace conditional on Dand the expected trace conditional jointly on D and C. As this differ-ence is positive, conditioning on D and C jointly results in a lowertrace than conditioning on D alone. The difference could be viewedas an extended version of the A-criterion in optimal design.

The joint posterior distribution (weight) of each model is given,analogously to (3), by Bayes’ rule

pðMkjD;CÞ ¼pðD;CjMkÞpðMkÞPKl¼1pðD;CjMlÞpðMlÞ

; ð21Þ

where, in analogy to (4), the joint integrated likelihood of D and C is

pðD;CjMkÞ ¼Z

pðD;CjMk; hkÞpðhkjMkÞdhk: ð22Þ

The joint likelihood of D and C under the integral lends itself toBayesian updating according to

pðD;CjMk; hkÞ ¼ pðCjD;Mk; hkÞpðDjMk; hkÞ: ð23Þ

3.2. MLBMA framework

Let hD;Ck be an ML estimate of hk obtained by maximizing the

joint likelihood p(D,CjMk,hk) under the integral in (22) with respectto these parameters. Then MLBMA would entail approximatingEqs. (15), (17), (19)–(21) via

pðDjD;CÞ ’ pðDjD;CÞML ¼XK

k¼1

pðDjD;C;MkÞMLpðMkjD;CÞML; ð24Þ

EðDjD;CÞ ’ EðDjD;CÞML ¼XK

k¼1

EðDjD;C;MkÞMLpðMkjD;CÞML; ð25Þ

Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007

CovðDjD;CÞ ’ CovðDjD;CÞML

¼XK

k¼1

CovðDjD;C;MkÞMLpðMkjD;CÞML

þXK

k¼1

½EðDjD;C;MkÞML � EðDjD;CÞML�½EðDjD;C;MkÞML

� EðDjD;CÞML�T pðMkjD;CÞML; ð26Þ

Tr½CovðDjD;CÞ� ’ Tr½CovðDjD;CÞML�¼ Tr½EMk jDCovðDjD;C;MkÞML�þ Tr½CovMk jDEðDjD;C;MkÞML�; ð27Þ

pðMkjD;CÞ ’ pðMkjD;CÞML ¼pðD;CjMkÞMLpðMkÞPKl¼1pðD;CjMlÞMLpðMlÞ

: ð28Þ

In analogy to (9)–(12), (28) can be rewritten as

pðMkjD;CÞML ¼exp � 1

2 dKICD;Ck

� �pðMkÞPK

l¼1 exp � 12 dKICD;C

l

� �pðMlÞ

; ð29Þ

dKICD;Ck ¼ KICD;C

k � KICD;Cmin; ð30Þ

KICD;Ck ¼ � 2 ln pðD;CjMk; h

D;Ck ÞML � 2 ln p h

D;Ck jMk

� �ML

þ Nk lnND;C

2p

!þ ln jFkðD;CjMkÞMLj; ð31Þ

where KICD;Cmin is the minimum value of KICD;C

k over all K models, Mk;�2lnp(D,CjMk,hk)ML � 2lnp(hkjMk)ML is a negative log likelihoodincorporating prior measurements of the parameters (if available),evaluated at h

D;Ck ; Nk is the dimension of h

D;Ck (number of estimated

parameters); ND,C is the joint dimension of D and C (number of dis-crete data points); and Fk is the normalized observed Fisher infor-mation matrix having components

Fk;nm ¼ �1

ND;C

@2 ln pðD;CjMk; hkÞ@hkn @hkm

" #hk¼h

D;Ck

: ð32Þ

Minimizing �2lnp(D,CjMk,hk) � 2lnp(hkjMk) with respect to hk

can be optionally simplified by linearizing the residuals enteringinto this negative log likelihood about ML estimates hD

k , based so-lely on D, obtained through minimization of �2lnp(DjMk,hk) �2lnp(hkjMk) with respect to hk. This option is not explored in thepresent paper.

Computational implementation of the proposed MLBMA frame-work is detailed in Appendix A. Appendix B shows how some ofthese computations can be rendered more efficient throughlinearization.

4. Synthetic geostatistical example

Our proposed approach to the assessment of data-worth can beimplemented by using either BMA or MLBMA. Here we do so usingMLBMA by considering multiple variogram models of a zero-meanspatially correlated random field, Z(x), having point support in twodimensions, x = (x1,x2)T. In particular, we use a modified version ofthe sequential Gaussian simulation code SGSIM [14] to generate anunconditional realization of Z on a grid of 50 � 50 nodes using atruncated power variogram model with Gaussian modes (TpvG) gi-ven in [15],

worth considering model and parameter uncertainties. Adv Water Resour

Page 5: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx 5

cðs; kuÞ ¼ r2ðkuÞ 1� exp �p4

sku

� �2" #

þ p4

sku

� �2" #H

8<:

� C 1� H;p4

sku

� �2" #)

; 0 < H < 1; ð33Þ

where s = kx1 � x2k is separation distance (lag) between Z values atany two points x1 and x2, ku is an upper cutoff scale proportional todomain size, A is a coefficient, H is a Hurst scaling exponent,r2ðkuÞ ¼ Ak2H

u =2H is variance (sill) and C(�, �) is the incompletegamma function. The corresponding integral scale is I(ku) = 2Hku/(1 + 2H). We set the parameters of the TpvG model equal toh = (A,H,ku)T = (0.1,0.25,5)T which correspond to r2 = 0.45 andI = 1.67. We then generate a ‘‘true’’ sample of 2500 Z values at50 � 50 nodes of a square grid, spaced a unit distance apart, asshown in Fig. 3. After verifying that a sample variogram based onall the generated values reproduces the original TpvG very closelywe select 100 Z values at randomly located nodes to comprise a vec-tor D of ‘‘available’’ data, 20 values to form a vector C0 (C0 represent-ing true but unknown data values, C their estimates) at otherrandomly located ‘‘potential new’’ sampling nodes, and those atthe remaining 2380 nodes to make up a vector D of ‘‘unknown’’ val-ues that we wish to predict via MLBMA. The latter is based either onD or on {D,C} where C are values simulated randomly at the ‘‘poten-tial new’’ sampling nodes conditional on D.

To predict D we consider a set M of K = 3 alternative variogrammodels, Mk, having parameters hk (purposely excluding the gener-ating, or ‘‘true,’’ TpvG model): exponential (Exp), Gaussian (Gau)and spherical (Sph). Each model, Mk, is assigned an equal priorprobability, p(Mk) = 1/3, and is calibrated against D to yield ML esti-mates hD

k of hk by minimizing the joint negative log likelihood�2lnp(DjMk,hk) � 2lnp(hkjMk). The process, denoted by MLD as ex-plained in Appendix A, also yields corresponding parameter esti-mation covariance matrices CD

k , Kashyap criteria KICDk according

to (11) and Bayesian criteria BICDk according to (13). For comparison

we also compute information theoretic criteria

AICDk ¼ �2 ln pðDjMkÞML þ 2Nk; ð34Þ

AICcDk ¼ �2 ln pðDjMkÞML þ 2Nk þ

2NkðNk þ 1ÞND � Nk � 1

ð35Þ

Fig. 3. ‘‘True’’ random field Z (colored) generated using TpvG with parametersh = (A,H,ku)T = (0.1,0.25,5)T at 2500 grid nodes (shown). D represent locations of‘‘available’’ data and C those of ‘‘potential new’’ data.

Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007

introduced, respectively, by Akaike [2] and Hurvich and Tsai [26].KICD

k and BICDk are used to compute posterior model probabilities

(or, in the case of AICDk and AICcD

k , model averaging weights),pðMkjDÞMLD , for each model via (9). The results are listed in Table1 and the corresponding fits illustrated in Fig. 4. It is evident thatthe sample D of available data is not large enough to reproducecorrectly the TpvG model used to generate it; the correspondingsample and fitted variograms underestimate the true sill and over-estimate the true integral scale. All criteria favor the spherical mod-el, assigning a very low posterior probability (or weight) to theGaussian model. Our model averaged results are based on KICD

k .Augmenting the sample to include {D,C0} is still not enough to

reproduce correctly the TpvG model; the sample and fitted vario-grams underestimate the true sill to a greater extent than wasthe case with D alone but overestimate the true integral scale toa lesser extent (compare Figs. 4 and 5, and Tables 1 and 2). Thepreference of all criteria for the spherical model is now more pro-nounced (less ambiguous) than it was in the case of D. These re-sults indicate a need to account for the effect of potential newdata on parameter estimation and model weighting, as we do nextby following the procedure in Appendix A.

Predictions of D are uncertain due to random spatial fluctua-tions in Z as well as uncertainty about the variogram model, Mk,and its parameters, hk. As D is generally nonlinear in hk, one canestimate its lead moments either through linearization, as de-scribed in Appendix B, or via Monte Carlo simulation. We start withthe latter option by (a) drawing Rh = 2000 random realizations, h

rhk ,

of hk from a multivariate normal distribution hk � N hDk ;C

Dk

� �where rh = 1,2, . . . ,Rh for each Mk; (b) obtaining kriging estimatesEðDpjD;Mk; h

rhk ÞMLD and kriging variances VarðDpjD;Mk; h

rhk ÞMLD and

covariances CovðDpDqjD;Mk; hrhk ÞMLD for all components Dp and Dq

of D; and (c) averaging these over all Rh realizations to obtain

EðDpjD;MkÞMLD ¼ Ehk½EðDpjD;Mk; hkÞMLD �

’ 1Rh

XRh

r¼1

EðDpjD;Mk; hrhk ÞMLD ; ð36Þ

CovðDpDqjD;MkÞMLD ¼ EhkCovðDpDqjD;Mk; hkÞMLD

� �þ Covhk

EðDpDqjD;Mk; hkÞMLD

� �; ð37Þ

Table 1Parameter estimates; negative log likelihoods NLL; model selection criteria AIC, AICc,BIC and KIC; prior and posterior model probabilities; and rankings of variogrammodels based on D.

Model Exp Gau Sph

Sill estimate 0.404 0.401 0.399Stda of sill 0.226 0.233 0.222Integral scale estimate 3.686 1.979 2.681Stda of integral scale 0.574 0.187 0.283NLL �145.14 �136.13 �147.23NLL rank 2 3 1p(Mk) 1/3 1/3 1/3AIC �139.14 �130.13 �141.23AIC rank 2 3 1p(MkjD)AIC 25.98% 0.29% 73.74%AICc �138.91 �129.90 �141.00AICc rank 2 3 1p(MkjD)AICc 25.98% 0.29% 73.74%BIC �131.04 �122.03 �133.12BIC rank 2 3 1p(MkjD)BIC 25.98% 0.29% 73.74%KIC �146.89 �135.94 �147.29KIC rank 2 3 1p(MkjD)KIC 44.92% 0.19% 54.90%

a Std represents standard deviation.

worth considering model and parameter uncertainties. Adv Water Resour

Page 6: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

V

L

Fig. 4. TpvG, sample and fitted variograms based on D; see statistics in Table 1.

V

L

Fig. 5. TpvG, sample and fitted variograms based on {D,C0}; see statistics in Table 2.

Table 2Parameter estimates; negative log likelihoods NLL; model selection criteria AIC, AICc,BIC and KIC; prior and posterior model probabilities; and rankings of variogrammodels based on {D,C0}.

Model Exp Gau Sph

Sill estimate 0.371 0.372 0.373Std⁄ of sill 0.214 0.220 0.210Integral scale estimate 2.409 1.907 2.672Std⁄ of integral scale 0.403 0.151 0.311NLL �183.86 �175.59 �188.36NLL rank 2 3 1p(Mk) 1/3 1/3 1/3AIC �177.86 �169.59 �182.36AIC rank 2 3 1p(MkjD,C0)AIC 9.52% 0.15% 90.33%AICc �177.67 �169.40 �182.17AICc rank 2 3 1p(MkjD,C0)AICc 9.52% 0.15% 90.33%BIC �169.25 �160.99 �173.76BIC rank 2 3 1p(MkjD,C0)BIC 9.52% 0.15% 90.33%KIC �184.80 �174.87 �188.51KIC rank 2 3 1p(MkjD,C0)KIC 13.51% 0.09% 86.39%

⁄ Std represents standard deviation.

6 S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx

Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007

where

EhkCovðDjD;Mk; hkÞMLD

� �’ 1

Rh

XRh

rh¼1

CovðDjD;Mk; hrhk ÞMLD ð38Þ

and [53]

Cov DpDqjD;Mk; hrhk

� MLD ¼ Covðxp � xqjMk; h

rhk Þ

þX

m2Np

Xn2Nq

kpmkqnCovðxm � xnjMk; hrhk Þ

�X

m2Np

kpmCovðxm � xpjMk; hrhk Þ

�Xn2Nq

kqnCovðxn � xqjMk; hrhk Þ; ð39Þ

Covðxp � xqjMk; hrhk Þ being the covariance between Z values at spatial

locations xp and xq according to variogram model Mk with parame-ters h

rhk , kpm the kriging weight of data at xm used to estimate Z at

point xp, and Np the number of data points employed for this pur-pose. When Dp = Dq, (39) reduces to

VarðDpjD;Mk; hrhk ÞMLD ¼ Covð0jMk; h

rhk Þ

�X

m2Np

kpmCovðxm � xpjMk; hrhk Þ � lp ð40Þ

where Cov(0) is the sill of variogram model Mk and lp a Lagrangemultiplier entering into the kriging equations associated with theestimation of Dp. The last term in (37) is computed as

Covhk½EðDpDqjD;Mk; hkÞMLD �

’ 1Rh � 1

XRh

rh¼1

EðDpjD;Mk; hrhk ÞhD

k� EðDpjD;MkÞhD

k

h i

� EðDqjD;Mk; hrhk ÞhD

k� EðDqjD;MkÞhD

k

h ið41Þ

We verify that Rh = 2000 is large enough for Tr½CovðDjD;MkÞMLD � tostabilize for each model.

Once EðDpjD;MkÞMLD and CovðDpDqjD;MkÞMLD have been deter-mined for each Mk by Monte Carlo simulation, we compute theirmodel averaged equivalents according to

EðDpjDÞMLD ¼XK

k¼1

EðDpjD;MkÞMLD pðMkjDÞMLD ; ð42Þ

CovðDpDqjDÞMLD ¼XK

k¼1

CovðDpDqjD;MkÞMLD pðMkjDÞMLD

þXK

k¼1

½EðDpjD;MkÞMLD �EðDpjDÞMLD �½EðDqjD;MkÞMLD

�EðDqjDÞMLD �pðMkjDÞMLD : ð43Þ

A similar procedure is used to obtain EðDpjD;C0ÞMLD;C0 andCovðDpDqjD;C0ÞMLD;C0 . Figs. 6 and 7 show how VarðDpjDÞMLD andVarðDpjD;C0ÞMLD;C0 , respectively, vary across the grid, the differenceVarðDpjDÞMLD � VarðDpjD;C0ÞMLD;C0 between them being depicted inFig. 8. It is seen that enlarging the data base from 100 (the dimen-sion of D) to 120 (the dimension of {D,C0}) reduces the predictionvariance across the grid, most noticeably in its upper rightquadrant.

Since in real applications C0 would not be available, the nextstep is to compute the statistics of potential data C conditionalon D. Since in our case D and C describe the same attribute at dif-ferent locations, the statistics of C are obtained simply upon replac-ing D with C in the above procedure. We use these statistics togenerate Rc = 200 random realizations, CrC ; rc ¼ 1;2; . . . ;Rc , of Cby considering it to be multivariate normal, C � N EðCjDÞMLD ;

�CovðCjDÞMLD �; we see no way to avoid Monte Carlo simulation

worth considering model and parameter uncertainties. Adv Water Resour

Page 7: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

DD

D

D

D

D

D

D

D

D

D

D

D

D

DD

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

DD

D

DD

D D

DD

DD

D

D

D

D

D

D

D

D

D

D D

D

DD

DDD

D

D

D

D

DD

D

D

DDD

D

D

D

D

D

D

DD

D

D

D

D

D

D

D

D

x

y

0 10 20 30 40 500

10

20

30

40

500.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02

Fig. 6. Variation of VarðDpjDÞMLD across the grid.

DD

D

D

D

D

D

D

D

D

D

D

D

D

DD

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

DD

D

DD

D D

DD

DD

D

D

D

D

D

D

D

D

D

D D

D

DD

DDD

D

D

D

D

D

D

DDD

D

D

D

D

D

D

DD

D

D

D

D

D

D

D

D

C

C

C

C C

C

CC

C

C

C

C

C

C

CC

C

DD

CC

C

x

y

0 10 20 30 40 500

10

20

30

40

50 0.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02

Fig. 7. Variation of VarðDpjD;C0 ÞMLD;C0 across the grid.

C

C

C

C C

C

CC

C

C

C

C

C

C

CC

C

CC

C

x

y

0 10 20 30 40 500

10

20

30

40

500.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02

Fig. 8. Variation of VarðDpjDÞMLD � VarðDpjD;C0ÞMLD;C0 across the grid.

Fig. 9. C0 (red), EðCjDÞMLD (blue), 200 realizations CrC (gray) and 95% confidenceinterval of all 200 realizations (dashed black). (For interpretation of the referencesin color in this figure legend, the reader is referred to the web version of this article.)

DD

D

D

D

D

D

D

D

D

D

D

D

D

DD

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

DD

D

DD

D D

DD

DD

D

D

D

D

D

D

D

D

D

D D

D

DD

DDD

D

D

D

D

DD

D

D

DDD

D

D

D

D

D

D

DD

D

D

D

D

D

D

D

D

x

y

0 10 20 30 40 500

10

20

30

40

500.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02

Fig. 10. Variation of VarðDpjDÞMLD;C across the grid; compare with Fig. 6 forVarðDpjDÞMLD .

DD

D

D

D

D

D

D

D

D

D

D

D

D

DD

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

DD

D

DD

D D

DD

DD

D

D

D

D

D

D

D

D

D

D D

D

DD

DDD

D

D

D

D

DD

D

D

DDD

D

D

D

D

D

D

DD

D

D

D

D

D

D

D

D

C

C

C

C C

C

CC

C

C

C

C

C

C

CC

C

CC

C

x

y

0 10 20 30 40 500

10

20

30

40

500.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02

Fig. 11. Variation of ECjDVarðDpjD;CÞMLD;C across the grid; compare with Fig. 7 forVarðDpjD;C0ÞMLD;C0 .

S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx 7

Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-worth considering model and parameter uncertainties. Adv Water Resour(2011), doi:10.1016/j.advwatres.2011.02.007

Page 8: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

V

8 S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx

without neglecting the effect of C on parameter and model uncer-tainties. Fig. 9 verifies that EðCjDÞMLD represents a smoothed ver-sion of C0, which in turn is well within the range of simulated CrC

realizations.For every realization CrC we predict D the same way as before

but now conditional on an expanded data base fD;CrC g. This yieldsEðDpjD;CrC ÞMLD;C and CovðDpDqjD;CrC ÞMLD;C where the subscript MLD,C

denotes ML estimation based on fD;CrC g. The latter are then aver-aged over all realizations of CrC to obtain

EðDpjDÞMLD;C ¼ ECjDEðDpjD;CÞMLD;C ’ 1RC

XRC

rC¼1

EðDpjD;CrC ÞMLD;C ; ð44Þ

Fig. 13. Trace values obtained upon estimating lead moments of D through MonteCarlo simulation (MC) and second-order approximation (MA) according to Appen-dix B. Indexes 1–5 represent Tr½CovðDjDÞMLD �; Tr½CovðDjD;C0 Þ

MLD;C0 �; Tr½CovðDjDÞMLD;C �;Tr½CovCjDEðDjD;CÞMLD;C � and Tr½ECjDCovðDjD;CÞMLD;C �, respectively.

CovðDpDqjDÞMLD;C ¼ECjDCovðDpDqjD;CÞMLD;C þCovCjDEðDpDqjD;CÞMLD;C

ð45Þ

and Tr½CovðDjDÞMLD;C �. The final step entails computing Tr½CovCjDEðDjD;CÞMLD;C � which, we recall, represents the variance reductionmeasure Tr½CovðDjDÞMLD;C � � Tr½ECjDCovðDjD;CÞMLD;C �; we verify thatsample estimates of all these three terms stabilize after 200realizations. Variations of VarðDjDÞMLD;C ; ECjDVarðDjD;CÞMLD;C andVarCjDEðDjD;CÞMLD;C across the grid are shown, respectively, inFigs. 10–12. A comparison of Figs. 6 and 10 reveals that even thoughVarðDjDÞMLD;C tends to exceed VarðDjDÞMLD ðTr½CovðDjDÞMLD;C � ¼ 712:95while Tr½CovðDjDÞMLD � ¼ 682:44 due, most likely, to sampling errorsstemming from size and accuracy limitations on D) the twoquantities have near identical spatial patterns. Likewise, thoughECjDVarðDjD;CÞMLD;C in Fig. 11 tends to exceed VarðDjD;C0ÞMLD;C0 inFig. 7 these too have near identical spatial patterns. One therefore ex-pects the estimated variance reduction VarCjDEðDjD;CÞMLD;C to exhibita pattern similar to that of the true variance reduction VarðDjDÞMLD�VarðDjD;C0Þ

MLD;C0 , a fact verified through a comparison of Figs. 12and 8.Correspondingly Tr½CovCjDEðDjD;CÞMLD;C � ¼ 54:61 approximatesclosely the true trace reduction Tr½CovðDjDÞMLD � � Tr½CovðDjD;C0ÞMLD;C0 Þ ¼ 58:95.

The above results are based on Monte Carlo evaluation of allmoments. Estimating the lead moments of D through linearizationas described in Appendix B brings about a 10-fold reduction incentral processor time without any serious effect on accuracy:the value of Tr½CovCjDEðDjD;CÞMLD;C � drops from 54.61 to 54.54 andthat of Tr½CovðDjDÞMLD � � Tr½CovðDjD;C0ÞMLD;C0 � from 58.95 to 55.3,all spatial patterns remaining virtually unchanged. A visual com-parison of Tr½CovðDjDÞMLD �; Tr½CovðDjD;C0ÞMLD;C0 �; Tr½CovðDjDÞMLD;C �;

C

C

C

C C

C

CC

C

C

C

C

C

C

CC

C

CC

C

x

y

0 10 20 30 40 500

10

20

30

40

500.460.440.420.400.380.360.340.320.300.280.260.240.220.200.180.160.140.120.100.080.060.040.02

Fig. 12. Variation of VarCjDEðDpjD;CÞMLD;C across the grid; compare with Fig. 8 forVarðDpjDÞMLD � VarðDpjD;C0ÞMLD;C0 .

Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007

Tr½CovCjDEðDjD;CÞMLD;C � and Tr½ECjDCovðDjD;CÞMLD;C � values obtainedby the two methods is provided in Fig. 13.

5. Conclusions

Our paper leads to the following major conclusions:

1. A multimodel approach to optimum value-of-information ordata-worth analyses has been proposed based on a Bayesianmodel averaging (BMA) framework. We have focused on a max-imum likelihood (MLBMA) variant of BMA that (a) is compatiblewith both deterministic and stochastic models, (b) admits butdoes not require prior information about the parameters, (c)is consistent with modern statistical methods of hydrologicmodel calibration, (d) allows approximating lead predictivemoments of any model by linearization, and (e) updates modelposterior probabilities as well as parameter estimates on thebasis of potential new data both before and after such databecome actually available.

2. The proposed approach should be of help in designing the col-lection of hydrologic characterization and monitoring data ina cost-effective manner by maximizing their benefit undergiven cost constraints. Benefits would accrue from optimumgain in information, or reduction in predictive uncertainty,upon considering jointly not only traditional sources of uncer-tainty such as those affecting model parameters and the reli-ability of data but also lack of certainty about the underlyingmodels.

3. Implementation of the proposed approach on a synthetic geo-statistical problem in two space dimensions demonstrates aneed to account for the impact of potential new data on modeland parameter uncertainties. Though neither existing nor apotentially augmented set of data are sufficient to identify cor-rectly the underlying geostatistical model (variogram) and itsparameters, they nevertheless yield self-consistent results andallow identifying quite accurately the impacts of potentialnew data on the spatial distribution and magnitude of corre-sponding reductions in predictive variance.

4. Approximating lead predictive moments associated with eachmodel by linearization, as described in Appendix B, has yieldedresults comparable to those obtained via Monte Carlo simula-tion with a much lesser expenditure of computational effort.The extent to which such linearization would work in stronglynonlinear situations remains an open question.

worth considering model and parameter uncertainties. Adv Water Resour

Page 9: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx 9

Acknowledgements

This research was supported in part through a contract betweenthe University of Arizona and Vanderbilt University under the Con-sortium for Risk Evaluation with Stakeholder Participation (CRESP)III, funded by the US Department of Energy. The third and fourthauthors were supported in part by NSF-EAR Grant 0911074 andDOE-ERSP Grant DE-SC0002687.

Appendix A. Computational implementation of MLBMAframework

To assess the impact of data augmentation within the aboveMLBMA framework computationally we propose the followingapproach:

1. Postulate a set M of K mutually independent geostatistical,statistical or stochastic models, Mk, with parameters hk forthe desired output vector, D.

2. Obtain ML estimates hDk of hk by calibrating each Mk against

available data D through minimization of the log likelihood�2lnp(DjMk,hk) � 2lnp(hkjMk), then compute the corre-sponding estimation covariance CD

k and KICDk .

3. Compute pðMkjDÞMLD ¼ exp �12dKICD

kð ÞpðMkÞPK

l¼1exp �1

2dKICDlð ÞpðMlÞ

where the sub-

script MLD designates the ML estimation process in step 2.4. For each model Mk estimate EðDjD;MkÞMLD and

CovðDjD;MkÞMLD either through second-order approxima-tions, as described in Appendix B, or via Monte Carlo simu-lation (both options are explored in our synthetic example):a. Draw random samples (realizations) of hk from a multi-

variate Gaussian distribution with mean hDk and covari-

ance CDk .

b. Estimate EðDjD;Mk; hkÞMLD and CovðDjD;Mk; hkÞMLD for eachrealization of hk.

c. Average over all realizations of hk to obtain sample esti-mates of EðDjD;MkÞMLD ¼ Ehk

EðDjD;Mk; hkÞMLD and

Please(2011

CovðDjD;MkÞMLD ¼ EhkCovðDjD;Mk; hkÞMLD

þ CovhkEðDjD;Mk; hkÞMLD :

5. Compute EðDjDÞMLD ¼PK

k¼1EðDjD;MkÞMLD pðMkjDÞMLD and

CovðDjDÞMLD¼XK

k¼1

CovðDjD;MkÞMLD pðMkjDÞMLD

þXK

k¼1

EðDjD;MkÞMLD�EðDjDÞMLD

� �EðDjD;MkÞMLD

�EðDjDÞMLD

�T pðMkjDÞMLD

and/or Tr½CovðDjDÞMLD �.6. Postulate a set P of I alternative geostatistical, statistical or

stochastic models, Pi, with parameters pi for a potential dataset C; the models Pi may be independent of Mk, may formextensions of Mk or may coincide with the latter as in thecomputational examples given in this paper.

7. Predict multivariate statistics of C, conditional on D, eithervia BMA or via MLBMA by means of the model set P; inthe case of MLBMA the procedure would parallel thatdescribed for D in steps (2)–(6).

8. Estimate EðDjD;CÞMLD;C and CovðDjD;CÞMLD;C , where the sub-script MLD,C designates the ML estimation process in step 2but now with respect to an augmented data set {D,C}, eitherthrough second-order approximations, as described inAppendix B, followed by step 10 or via Monte Carlo simula-

cite this article in press as: Neuman SP et al. Bayesian analysis of data-), doi:10.1016/j.advwatres.2011.02.007

tion by using the statistics of C from step 7 to generate ran-dom realizations of C (both options are explored in oursynthetic example); for each realization and for each modelMk:a. Optionally linearize the residuals entering into the nega-

tive log likelihood �2lnp(D,CjMk,hk) � 2lnp(hkjMk) abouthD

k (this option is not explored in this paper).b. Obtain ML estimates hD;C

k of hk by minimizing this negativelog likelihood with respect to hk, then compute the corre-sponding estimation covariance CD;C

k and KICD;Ck .

c. Compute pðMkjD;CÞMLD;C ¼ exp �12dKICD;C

kð ÞpðMkÞPK

l¼1exp �1

2dKICD;Clð ÞpðMlÞ

.

d. For each model Mk estimate EðDjD;C;MkÞMLD;C andCovðDjD;C;MkÞMLD;C via Monte Carlo simulation:i. Draw random samples (realizations) of hk from a mul-

tivariate Gaussian distribution with mean hD;Ck and

covariance CD;Ck .

ii. Estimate EðDjD;C;Mk; hkÞMLD;C and CovðDjD;C;Mk;

hkÞMLD;C for each realization of hk.iii. Average over all realizations of hk to obtain sample

estimates of

worth

consid

EðDjD;C;MkÞMLD;C ¼ EhkEðDjD;C;Mk; hkÞMLD;C and

CovðDjD;C;MkÞMLD;C ¼ EhkCovðDjD;C;Mk; hkÞMLD;C

þ CovhkEðDjD;C;Mk; hkÞMLD;C :

e. Compute EðD j D; CÞMLD;C ¼PK

k¼1EðD j D; C; MkÞMLD;C�pðMkjD;CÞMLD;C and

CovðDjD;CÞMLD;C ¼XK

k¼1

CovðDjD;C;MkÞMLD;C pðMkjD;CÞMLD;C

þXK

k¼1

EðDjD;C;MkÞMLD;C �EðDjD;CÞMLD;C

� ��½EðDjD;C;MkÞMLD;C �EðDjD;CÞMLD;C �T

�pðMkjD;CÞMLD;C

and/or Tr½CovðDjD;CÞMLD;C �.9. Average over all realizations of C to obtain sample estimates

of

EðDjDÞMLD;C ¼ ECjDEðDjD;CÞMLD;C ;

CovðDjDÞMLD;C ¼ ECjDCovðDjD;CÞMLD;C þ CovCjDEðDjD;CÞMLD;C

and/or Tr½CovðDjDÞMLD;C �; note that due to inevitable samplingerrors and approximations associated with the ML estimationprocess EðDjDÞMLD;C , CovðDjDÞMLD;C and Tr½CovðDjDÞMLD;C � ob-tained at this step would generally differ, though ideally notby much, from EðDjDÞMLD , CovðDjDÞMLD and Tr½CovðDjDÞMLD � ob-tained at step 5.

10. Repeat steps 6–9 for different sets C1,C2,C3, . . . of potentialdata and select that set which maximizes the differenceTr½CovCjDEðDjD;CÞMLD;C � ¼ Tr CovðDjDÞMLD;C

� �� Tr½ECjDCovðDjD;

CÞMLD;C � between the trace conditional on D and the expectedtrace conditional on D and C (this step is not explored in thepresent paper).

Appendix B. Second-order moment approximations

Step 4 of the MLBMA procedure in Appendix A entails an op-tional approximation of EðDjD;MkÞMLD and CovðDjD;MkÞMLD via leadorder expansions of EðDjD;Mk; hkÞMLD and CovðDjD;Mk; hkÞMLD in hk

ering model and parameter uncertainties. Adv Water Resour

Page 10: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

10 S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx

about hDk ; step 8 entails a similar approximation of EðDjD;C;MkÞMLD;C

and CovðDjD;C;MkÞMLD;C via lead order expansion ofEðDjD;C;Mk; hkÞMLD;C and CovðDjD;C;Mk; hkÞMLD;C in hk about h

D;Ck . In

the context of step 4 one would start with second-orderexpansions

EðDpjD;Mk;hkÞMLD ’ EðDpjD;Mk; hDk Þþ

XNk

n¼1

@EðDpjD;Mk;hkÞMLD

@hkn

hD

k

dhkn

þ12

XNk

n¼1

XNk

m¼1

@2EðDpjD;Mk;hkÞMLD

@hkn@hkm

hD

k

dhkndhkm;

ðB1Þ

CovðDpDqjD;Mk;hkÞMLD ’ Cov DpDqjD;Mk; hDk

� �

þXNk

n¼1

@CovðDpDqjD;Mk;hkÞMLD

@hkn

hD

k

dhkn

þ12

XNk

n¼1

XNk

m¼1

@2Cov DpDqjD;Mk;hk

� MLD

@hkn@hkm

hD

k

dhkndhkm;

ðB2Þ

where Dp is the pth component of D and dhk ¼ hk � hDk . Since

EhkðdhkÞ ¼ 0 one has, to second order,

EhkEðDpjD;Mk; hkÞMLD ’ E DpjD;Mk; h

Dk

� �

þ 12

XNk

n¼1

XNk

m¼1

@2EðDpjD;Mk; hkÞMLD

@hkn@hkm

hD

k

Cknm;

ðB3Þ

EhkCovðDpDqjD;Mk;hkÞMLD ’ CovðDpDqjD;Mk; h

Dk Þ

þ12

XNk

n¼1

XNk

m¼1

@2CovðDpDqjD;Mk;hkÞMLD

@hkn@hkm

hD

k

Cknm;

ðB4Þ

CovhkEðDpDqjD;Mk; hkÞMLD ¼ Ehk

EðDpjD;Mk; hkÞMLD

��� Ehk

EðDpjD;Mk; hkÞMLD

�� EðDqjD;Mk; hkÞMLD

�� Ehk

EðDqjD;Mk; hkÞMLD

��’XNk

n¼1

XNk

m¼1

@EðDpjD;Mk; hkÞMLD

@hkn

hD

k

� @EðDqjD;Mk; hkÞMLD

@hkm

hD

k

Cknm; ðB5Þ

CovðDpDqjD;MkÞMLD ¼EhkCovðDpDqjD;Mk;hkÞMLD

þCovhkEðDpDqjD;Mk;hkÞMLD

’CovðDpDqjD;Mk;hDk Þ

þ12

XNk

n¼1

XNk

m¼1

@2CovðDpDqjD;Mk;hkÞMLD

@hkn@hkm

hD

k

Cknm

þXNk

n¼1

XNk

m¼1

@EðDpjD;Mk;hkÞMLD

@hkn

hD

k

@EðDqjD;Mk;hkÞMLD

@hkm

hD

k

Cknm;

ðB6Þ

where Cknm ¼ EhkðdhkndhkmÞ is the (n,m)th component of CD

k . An anal-ogous approach would apply to step 8 of Appendix A.

Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007

References

[1] Abbaspour KC, Schulin R, Schlappi E, Fluhler H. A Bayesian approach forincorporating uncertainty and data worth in environmental projects. EnvironModel Assess 1996;1:151–258.

[2] Akaike H. A new look at statistical model identification. IEEE Trans AutomatControl 1974;AC-19:716–22.

[3] Back P-E. A model for estimating the value of sampling programs and theoptimal number of samples for contaminated soil. Environ Geol2007;52:573–85. doi:10.1007/s00254-0488-6.

[4] Ben-Zvi M, Berkowitz B, Kesler S. Pre-posterior analysis as a tool for dataevaluation: application to aquifer contamination. Water Resour Manage1988;2:11–20.

[5] Beven KJ, Binley AM. The future of distributed models: model calibration anduncertainty prediction. Hydrol Process 1992;6:279–98.

[6] Beven KJ, Freer J. Equifinality, data assimilation, and uncertainty estimation inmechanistic modelling of complex environmental systems using the GLUEmethodology. J Hydrol 2001;249:11–29.

[7] Box G. Sampling and Bayes’ inference in scientific modelling and robustness. JRoy Statist Soc, Ser A 1980;143(4):383–430.

[8] Carrera J, Neuman SP. Estimation of aquifer parameters under transient andsteady state conditions: 1. Maximum likelihood method incorporating priorinformation. Water Resour Res 1986;22(2):199–210.

[9] Carrera J, Neuman SP. Estimation of aquifer parameters under transient andsteady state conditions: 3. Application to synthetic and field data. WaterResour Res 1986;22(2):228–42.

[10] Carrera J, Medina A, Axness C, Zimmerman T. Formulations and computationalissues of the inversion of random fields. In: Dagan G, Neuman SP, editors.Subsurface flow and transport: a stochastic approach. Cambridge, UnitedKingdom: Cambridge University Press; 1997. p. 62–79.

[11] Dausman AM, Doherty J, Langevin CD, Sukop MC. Quantifying data worthtoward reducing predictive uncertainty. Ground Water 2010;48(5):729–40.

[12] Davis DR, Kisiel CC, Duckstein L. Bayesian decision theory applied to design inhydrology. Water Resour Res 1972;8(1):33–41.

[13] Davis DR, Dvoranchik WM. Evaluation of the worth of additional data. WaterResour Res 1971;7(4):700–7.

[14] Deutsch CV, Journel AG. GSLIB: geostatistical software library and user’s guide.2nd ed. New York: Oxford University Press; 1998.

[15] Di Federico V, Neuman SP. scaling of random fields by means of truncatedpower variograms and associated spectra. Water Resour Res1997;33(5):1075–85.

[16] Diggle P, Lophaven S. Bayesian geostatistical design. Scand J Statist2006;33:53–64. doi:10.1111/j.1467-9469.2005.00469.x.

[17] Draper D. Assessment and propagation of model uncertainty. J Roy Statist SocB 1995;57(1):45–97.

[18] Feyen L, Gorelick SM. Framework to evaluate the worth of hydraulicconductivity data for optimal groundwater resources management inecologically sensitive areas. Water Resour Res 2005;41:W03019.doi:10.1029/2003WR002901.

[19] Freer J, Beven K, Ambroise B. Bayesian estimation of uncertainty in runoffprediction and the value of data: an application of the GLUE approach. WaterResour Res 1996;32(7):2161–73.

[20] Freeze RA, Massman J, Smith L, Sperling T, James B. Hydrogeological decisionanalysis: 1. A framework. Ground Water 1990;28(5):738–66.

[21] Freeze RA, Brude J, Massman J, Sperling T, Smith L. Hydrogeological decisionanalysis: 4. The concept of data worth and its use in the development of siteinvestigation strategies. Ground Water 1992;30(4):574–88.

[22] Grosser PW, Goodman AS. Determination of groundwater samplingfrequencies through Bayesian decision theory. Civil Eng Syst1985;2(December):186–294.

[23] Gates JS, Kisiel CC. Worth of additional data to a digital computer model of agroundwater basin. Water Resour Res 1974;10(5):1031–8.

[24] Hernandez AF, Neuman SP, Guadagnini A, Carrera J. Inverse stochastic momentanalysis of steady state flow in randomly heterogeneous media. Water ResourRes 2006;42:W05425. doi:10.1029/2005WR004449.

[25] Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: atutorial. Statist Sci 1999;14(4):382–417.

[26] Hurvich CM, Tsai C-L. Regression and time series model selection in smallsample. Biometrika 1989;76(2):99–104.

[27] IT-Corporation. Value of information analysis for corrective action unit no. 98,Frenchman Flat, US Department of Energy, USA; 1997.

[28] James BR, Freeze RA. The worth of data in predicting aquitard continuity inhydrogeological design. Water Resour Res 1993;29(7):2049–65.

[29] James BR, Gorelick SM. When enough is enough: the worth of monitoringdata in aquifer remediation design. Water Resour Res 1994;30(12):3499–513.

[30] James BR, Gwo J-P, Toran L. Risk-cost decision framework for aquiferremediation design. J Water Resour Plan Manage 1996;122(6):414–20.

[31] Kashyap RL. Optimal choice of AR and MA parts in autoregressive movingaverage models. IEEE Trans Pattern Anal Mach Intell 1982;4(2):99–104.

[32] Kass R, Raftery A. Bayes factors. J Am Statist Assoc 1995;90(430):773–95.[33] Kaunas JR, Haimes YY. Risk management of groundwater contamination in a

multiobjective framework. Water Resour Res 1985;21(11):1721–30.[34] Leamer E. Specification searches: ad hoc inference with nonexperimental data.

1st ed. New York: John Wiley & Sons; 1978.

worth considering model and parameter uncertainties. Adv Water Resour

Page 11: Advances in Water Resources - Peoplemye/pdf/paper24.pdf · 2011. 4. 16. · dom log hydraulic conductivity field. Whereas flow and transport are described by a single (linearized)

S.P. Neuman et al. / Advances in Water Resources xxx (2011) xxx–xxx 11

[35] Li X, Tsai FT-C. Bayesian model averaging for groundwater head prediction anduncertainty analysis using multimodel and multimethod. Water Resour Res2009;45:W09403. doi:10.1029/2008WR007488.

[36] Maddock T. Management model as a tool for studying the worth of data. WaterResour Res 1973;9(2):270–80.

[37] McNulty G, Deshler B, Dove H. Value of information analysis. Nevada Test Site,USA; 1997.

[38] Meyer PD, Ye M, Rockhold ML, Neuman SP, Cantrell KJ. Combined estimationof hydrogeologic conceptual model, parameter, and scenario uncertainty withapplication to uranium transport at the Hanford Site 300 Area, NUREG/CR-6940, (PNNL-16396), US Nuclear Regulatory Commission, Washington, DC;2007.

[39] Morales-Casique E, Neuman SP, Vesselinov VV. Maximum likelihood Bayesianaveraging of airflow models in unsaturated tuff. Stochast Environ Res RiskAssess 2010;24:863–80. doi:10.1007/s00477-010-0383-2.

[40] Neuman SP. Maximum likelihood Bayesian averaging of alternativeconceptual-mathematical models. Stochast Environ Res Risk Assess2003;17(5):291–305. doi:10.1007/s00477-003-0151-7.

[41] Norberg T, Rosen L. Calculating the optimal number of contaminant samplesby means of data worth analysis. Environmetrics 2006;17:705–19.

[42] Norrman J. Decision analysis under risk and uncertainty at contaminated sites.A literature review. Department of Geology, Chalmers University ofTechnology, Goteborg; 2001.

[43] Nowak W, de Barros FPJ, Rubin Y. Bayesian geostatistical design: Task-drivensite investigation when the geostatistical model is uncertain. Water ResourRes 2010;46:W035535. doi:10.1029/2009WR008312.

[44] Poeter EP, Hill MC. MMA, a computer code for multi-model analysis. USGeological Survey Techniques and Methods TM6-E3; 2007.

[45] Rada ES, Schultz GA. Worth of hydrological data in water resources projects.Application: Bolivians Amazons zone. In: Braga Jr PF, Fernandez-Jauregui CA,editors. Water management of the Amazon Basin. UNESCO; 1998.

[46] Raftery AE, Madigan D, Volinsky CT. Accounting for model uncertainty insurvival analysis improves predictive performance. In: Bernardo J, Beerger J,Dawid A, Smith A, editors. Bayesian statistics. United Kingdom: OxfordUniversity Press; 1996. p. 323–49.

[47] Reichard EG, Evans JS. Assessing the value of hydrogeological information forrisk-based remedial action decisions. Water Resour Res 1989;25(7):1451–60.

[48] Riva M, Guadagnini A, Neuman SP, Bianchi Janetti E, Malama B. Inverseanalysis of stochastic moment equations for transient flow in randomly

Please cite this article in press as: Neuman SP et al. Bayesian analysis of data-(2011), doi:10.1016/j.advwatres.2011.02.007

heterogeneous media. Adv Water Resour 2009;32:1495–507. doi:10.1016/j.advwatres.2009.07.003.

[49] Rojas R, Feyen L, Batelaan O, Dassargues A. On the value of conditioning data toreduce conceptual model uncertainty in groundwater modeling. Water ResourRes, in press. doi:10.1029/2009WR008822.

[50] Rosen L, LeGrand He. An outline of a guidance framework for assessinghydrogeological risks at early stages. Ground Water 1997;35(2):195–204.

[51] Russell KT, Rabideau AJ. Decision analysis for pump-and-treat design.Groundwater Monitor Remediat 2000(Summer):159–268.

[52] Sain SR, Furrer R. Combining climate model output via model correlations.Stochast Environ Res Risk Assess 2010;24(6):821–9. doi:10.1007/s00477-010-0380-5.

[53] Samper FJ, Neuman SP. Estimation of spatial covariance structures by adjointstate maximum likelihood cross-validation: 1. Theory. Water Resour Res1989;25(3):351–62.

[54] Sohn M, Small M, Pantazidou M. Reducing uncertainty in site characterizationusing Bayes Monte Carlo methods. J Environ Eng – ASCE2000;126(10):893–902. doi:10.1061/(ASCE)0733-9372(2000)126:10(893).

[55] Thompson M, Fearn T. What exactly is fitness for purpose in analyticalmeasurements? Analyst 1996;121(March):275–8.

[56] Volinsky CT, Madigan D, Raftery AE, Kronmal RA. Bayesian model averaging inproportional hazard models: assessing the risk of a stroke. J Roy Statist Soc, SerC 1997;46:433–48.

[57] Ye M, Neuman SP, Meyer PD. Maximum Likelihood Bayesian averaging ofspatial variability models in unsaturated fractured tuff. Water Resour Res2004;40:W05113. doi:10.1029/2003WR002557.

[58] Ye M, Neuman SP, Meyer PD, Pohlmann KF. Sensitivity analysis andassessment of prior model probabilities in MLBMA with application tounsaturated fractured tuff. Water Resour Res 2005;41:W12429. doi:10.1029/2005WR004260.

[59] Ye M, Meyer PD, Neuman SP. On model selection criteria in multimodelanalysis. Water Resour Res 2008;44:W03428. doi:10.1029/2008WR006803.

[60] Ye M, Lu D, Neuman SP, Meyer PD. Comment on ‘‘Inverse groundwatermodeling for hydraulic conductivity estimation using Bayesian modelaveraging and variance window’’ by Frank T.-C. Tsai and Xiaobao Li. WaterResour Res 2010;46:W02801. doi:10.1029/2009WR008501.

[61] Yokota F, Thompson K. Value of information literature analysis: a reviewof applications in health risk assessment. Med Decis Making 2004;24:287–98.

worth considering model and parameter uncertainties. Adv Water Resour